Enhancing Post-Training Quantization via Future Activation Awareness
Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning. It typically sets quantization hyperparameters (e.g., scaling factors) based on current-layer activations. Although this method is efficient, it suffers from quantization bias and error accumulation, resulting in suboptimal and unstable quantization, especially when the calibration data is biased. To overcome these issues, we propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization. This allows better identification and preservation of important weights, while reducing sensitivity to calibration noise. We further introduce a window-wise preview mechanism to softly aggregate multiple future-layer activations, mitigating over-reliance on any single layer. To avoid expensive greedy search, we use a pre-searched configuration to minimize overhead. Experiments show that FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning, making it well-suited for edge deployment.
💡 Research Summary
The paper addresses a critical limitation of existing post‑training quantization (PTQ) methods for large language models (LLMs). Conventional PTQ determines the quantization scale of each layer solely from the statistics of that layer’s activations. This “local‑only” approach leads to two major problems: (1) quantization bias, where channels that dominate the current layer’s activation range cause the scale to be overly large, thereby compressing downstream‑important but smaller‑magnitude channels; and (2) error accumulation, where quantization errors introduced early in the network propagate forward and amplify, especially when the calibration dataset is distributionally mismatched with real‑world data.
To mitigate these issues, the authors propose Future‑Aware Quantization (FAQ). The key idea is to incorporate activations from future (down‑stream) layers when computing the scale for the current layer. Specifically, for a given layer i, a preview activation a_pvw_i is obtained by averaging the activations of the next j layers (window‑wise preview). The current activation a_i and the preview are then fused with a weighting factor γ: ˜a_i = γ·a_i + (1‑γ)·a_pvw_i. This fused activation serves as the base scale s_i = ˜a_i, which is further multiplied by a learnable factor c_i to produce the effective quantization scale s*_i = c_i·s_i. By doing so, the scale reflects not only the immediate statistics but also the sensitivity of downstream layers, allowing the quantizer to preserve weights that are crucial for later computations while still aggressively compressing less important ones.
FAQ introduces three practical design choices to keep overhead minimal: (1) a window‑wise preview that aggregates multiple future layers, reducing reliance on any single noisy layer; (2) a pre‑searched configuration for γ and the window size j (the paper reports γ=0.85 and j=3 as generally effective), eliminating the need for costly greedy hyper‑parameter search during deployment; and (3) a purely forward‑only procedure—no backward passes, data reconstruction, or fine‑tuning—making it suitable for edge devices.
Theoretical analysis (Theorem 1) formalizes why FAQ reduces quantization error compared with activation‑wise quantization (AWQ). Under assumptions that one channel’s activation magnitude dominates and that larger activations lead to larger scale search ranges, the authors prove that the error norm δ_FAQ is strictly smaller than δ_AWQ. The proof hinges on the fact that the fused scale incorporates a product of diagonal matrices representing both current and future activations, effectively balancing the scaling across channels.
Empirical evaluation spans several open‑source LLMs (Qwen‑3 4B/8B, Qwen‑2.5 0.5B/7B, LLaMA‑3.2 3B, LLaMA‑2 7B) under weight‑only 3‑bit and 4‑bit quantization. Benchmarks include perplexity on WikiText‑2 and C4, and zero‑shot accuracy on ARC‑Challenge/Easy, PIQA, BoolQ, HellaSwag, and Winogrande. Across all models and tasks, FAQ consistently outperforms the baselines Round‑to‑Nearest (RTN) and AWQ. For example, on Qwen‑3 8B, FAQ improves ARC‑Challenge accuracy from 0.4778 (AWQ) to 0.5043 and PIQA from 0.7459 to 0.7535. The gains are more pronounced at 3‑bit, where quantization noise is severe; FAQ reduces error accumulation and bias, delivering up to a 5 %p accuracy boost.
Robustness to calibration data bias is also examined. By varying the number of calibration samples N (16, 32, 64, 128), the authors show that FAQ maintains higher mean performance and lower variance than AWQ, indicating better resilience to limited or skewed calibration sets.
In summary, FAQ offers a lightweight yet effective PTQ strategy that leverages future‑layer activations through a windowed preview and a simple fusion mechanism. It alleviates quantization bias and error propagation without incurring additional training or reconstruction costs, making it an attractive solution for deploying LLMs on resource‑constrained edge hardware.
Comments & Academic Discussion
Loading comments...
Leave a Comment