SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization
📝 Abstract
The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression – targeting ultra-low-bit quantization for both activations and weights – from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.
💡 Analysis
The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression – targeting ultra-low-bit quantization for both activations and weights – from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.
📄 Content
Large Language Models (LLMs) (Touvron et al. 2023;Guo et al. 2025;Achiam et al. 2023) have demonstrated impressive performance across a wide range of natural language processing tasks, including code generation and open-ended reasoning. These advances are largely driven by massive model scales and extensive pretraining data. However, the resulting memory footprint and computational cost pose significant challenges for deployment on edge devices (Lin et al. 2024a;Heo et al. 2023). Figure 1: Activation and weight distributions before and after naive smoothing. While smoothing aims to mitigate activation outliers, it often transfers the quantization burden to weights, introducing new outliers and degrading the robustness of both activations and weights under quantization.
To reduce memory and accelerate inference, recent quantization methods (Liu et al. 2024;Hu et al. 2025a) aim to represent both weights and activations in low-bit formats (Zhao et al. 2025;Lin et al. 2024b), enabling faster matrix multiplications and smaller model sizes (Xiao et al. 2023). Despite their effectiveness, a core challenge remains: activation outliers, which expand the dynamic range and induce significant accuracy degradation when quantized (Wei et al. 2022).
Existing approaches attempt to mitigate outlier impact via distributional transformations. SmoothQuant (Xiao et al. 2023) shifts activation outliers into weights via layer-wise scaling; OSTQuant (Hu et al. 2025b), SpinQuant (Liu et al. 2024) and QuaRot (Ashkboos et al. 2024) employ lightweight rotation layers to reduce activation variance; SVDQuant (Li et al. 2024) applies global low-rank approximation to absorb outliers. However, these strategies face fundamental limitations. As shown in Figure 1, smoothing-based methods often transfer the burden of quantization from activations to weights without eliminating it. Rotation-based techniques introduce non-negligible runtime overhead, while SVD-based approximations fail to preserve channel-wise outlier structures critical for contextual understanding.
Recent work (Jin et al. 2025) further reveals that extreme activation values often encode fine-grained contextual cues, essential for reasoning tasks. Thus, indiscriminate quantization of these values leads to substantial performance loss on long-context benchmarks. This motivates the need for a more principled and robust quantization strategy that preserves informative outliers without incurring high computational cost.
In this work, we propose SpecQuant, a novel compression framework based on adaptive Fourier-domain decomposition, which explicitly targets the spectral structure of weights induced by smoothed activations. SpecQuant consists of two stages: (1) activation smoothing to migrate outliers into the weight domain, and (2) channel-wise low-frequency truncation to suppress the transferred high-frequency noise while preserving signal fidelity. We observe that weights exhibit strong low-frequency bias in the Fourier domain. This property allows us to truncate high-frequency components with minimal impact on model accuracy, enabling more robust low-bit quantization: Our contributions are summarized as follows:
• Frequency Domain Approximation: We are the first to bridge a connection between frequency-domain compression and quantization robustness in LLMs. Our analysis leverages Fourier energy decay properties to provide theoretical guarantees for preserving accuracy under aggressive quantization.
Frequency-domain optimization has emerged as a powerful tool across machine learning, offering unique advantages in compression and efficiency. In computer vision, spectral representations help stabilize training and improve generalization. For example, FcaNet (Qin et al. 2021) applies discrete cosine transforms (DCT) to enhance channel attention, while F3D (Liu et al. 2021) replaces 3D convolutions with frequency-domain operations to improve hardware efficiency. In time-series forecasting, FreDF (Wang et al. 2024) uses spectral decomposition with consistency constraints to address temporal error propagation. Recently, frequencydomain principles have begun influencing LLM optimization. FourierFT (Gao et al. 2024) enables parameter-efficient fine-tuning by learning sparse frequency coefficients and reconstructing full model weights via inverse DFT. Notably, frequency projections align well with LLM channel structures and naturally absorb multi-scale activation outliers through bandwidth filtering.
This compatibility motivates the use of frequency-domain decomposition as an effective strategy for robust quantization. However, despite this promising synergy, systematic exploration of frequency-domain quantization for LLMs remains limited in current efforts.
In this section, we first establish the theoretical foundation of Low-Frequency Fourier Projection for channel vectors, laying the groundwork for our subsequent methodology. We then introduce SpecQuant , a novel quantization pa
This content is AI-processed based on ArXiv data.