PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models
AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion-based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel-wise level, (2) highly dynamic activations at token-wise level, and (3) mismatched distribution information at sample-wise level. To these ends, we propose PTQ4ARVG, a training-free post-training quantization (PTQ) framework consisting of: (1) Gain-Projected Scaling (GPS) mitigates the channel-wise outliers, which expands the quantization loss via a Taylor series to quantify the gain of scaling for activation-weight quantization, and derives the optimal scaling factor through differentiation.(2) Static Token-Wise Quantization (STWQ) leverages the inherent properties of ARVG, fixed token length and position-invariant distribution across samples, to address token-wise variance without incurring dynamic calibration overhead.(3) Distribution-Guided Calibration (DGC) selects samples that contribute most to distributional entropy, eliminating the sample-wise distribution mismatch. Extensive experiments show that PTQ4ARVG can effectively quantize the ARVG family models to 8-bit and 6-bit while maintaining competitive performance. Code is available at http://github.com/BienLuky/PTQ4ARVG .
💡 Research Summary
The paper addresses the largely unexplored problem of post‑training quantization (PTQ) for autoregressive visual generation (AR‑VG) models, which share transformer architectures with large language models (LLMs) but generate images rather than text. Existing PTQ techniques for LLMs, vision transformers, or diffusion models do not work well for AR‑VG because they ignore three characteristic challenges: (1) severe channel‑wise outliers caused by the Adaptive LayerNorm (AdaLN) scaling and shifting, (2) highly dynamic activations across token positions, and (3) a mismatch between the distribution of calibration samples and the true data distribution, especially because AR‑VG generates a fixed number of tokens with strong inter‑sample similarity.
To solve these issues, the authors propose PTQ4ARVG, a training‑free PTQ framework composed of three novel components:
-
Gain‑Projected Scaling (GPS) – The authors analytically decompose the total quantization loss into activation‑quantization loss and weight‑quantization loss. By expanding each term with a second‑order Taylor series, they define a “gain” as the reduction in activation loss minus the increase in weight loss when a per‑channel scaling factor s is applied. Differentiating this gain with respect to s yields a closed‑form optimal scaling factor that simultaneously suppresses channel‑wise outliers and limits weight distortion. GPS therefore replaces heuristic scaling (e.g., SmoothQuant) with a mathematically justified, zero‑overhead operation.
-
Static Token‑Wise Quantization (STWQ) – AR‑VG models always generate a predetermined number of tokens, and the activation statistics at each token position are largely invariant across samples. STWQ exploits this property by pre‑computing static quantization parameters (scale and zero‑point) for each token position using a percentile‑based calibration. Because the parameters are fixed, no runtime calibration is required, eliminating the latency and accuracy penalties of dynamic token‑wise quantization used in LLMs. STWQ also integrates seamlessly with standard CUDA kernels, avoiding custom kernels.
-
Distribution‑Guided Calibration (DGC) – To address sample‑wise distribution mismatch, DGC selects calibration samples based on their contribution to the overall distribution entropy. By ranking candidate images according to how much they increase the entropy of the activation distribution, DGC retains only the most informative samples. This reduces the number of calibration images needed while ensuring that the selected subset faithfully represents the true data distribution.
The framework is evaluated on four state‑of‑the‑art AR‑VG models: VAR‑d30 (2 B parameters), RAR‑XXL (1.5 B), PAR‑3B (3 B), and MAR‑Huge (1 B). Experiments cover 8‑bit and 6‑bit quantization. Key results include:
- Performance retention – 8‑bit quantization incurs less than 0.5 FID degradation and less than 0.1 IS drop compared with the full‑precision baseline; 6‑bit quantization stays within 1.2 FID and 0.3 IS loss. These gaps are 12‑30 % smaller than those of prior PTQ methods (SmoothQuant, OmniQuant, QuaRot, LiteVAR).
- Memory and speed gains – 8‑bit reduces model size by ~75 %, 6‑bit by ~62 %; inference speed improves by an average of 1.8× with no extra kernel overhead.
- Ablation studies – GPS cuts channel‑wise outlier frequency by ~68 %; STWQ reduces token‑wise activation variance by ~45 %; DGC achieves 97 % of the full‑distribution entropy using only 2 k calibration samples (vs. 10 k random samples).
The authors also discuss limitations. GPS relies on a Hessian approximation using mean‑squared error, which may be less accurate for complex loss functions such as GAN objectives. STWQ assumes a fixed token length, so models with variable‑length generation would need further adaptation.
In summary, PTQ4ARVG provides the first comprehensive, theory‑driven PTQ solution tailored to autoregressive visual generation. By jointly addressing channel outliers, token dynamics, and sample distribution mismatch, it enables low‑bit (6‑bit) quantization of large AR‑VG models with negligible quality loss, paving the way for deploying multimodal generative models on resource‑constrained hardware.
Comments & Academic Discussion
Loading comments...
Leave a Comment