SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights
Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances. We show that this approximates activation-aware quantization by recovering column scales from the weight matrix structure that are predictive of the typical activation magnitudes the matrix received during training. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layer. We evaluate our method on the Qwen3 model family, among others. SINQ reduces the perplexity gap on WikiText2 and C4 by over 50% against uncalibrated uniform quantization baselines, incurs zero to negligible compute overhead, and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code is available at https://github.com/huawei-csl/SINQ.
💡 Research Summary
This paper addresses the persistent degradation in perplexity that occurs when large language models (LLMs) are post‑training quantized (PTQ) to very low bit‑widths (≤ 4 bits). Existing calibration‑free uniform quantizers suffer because outliers force a single scale to accommodate both typical and extreme weight values, leading to severe precision loss, especially in weight‑only quantization scenarios. The authors propose SINQ (Sinkhorn‑Normalized Quantization), a method that augments any PTQ pipeline with a second‑axis (column‑wise) scale vector and a fast iterative algorithm inspired by the Sinkhorn‑Knopp matrix balancing procedure.
The core idea is to represent a weight matrix W ∈ ℝ^{m×n} as
Ŵ = (W ⊘ s) ⊘ t,
where s ∈ ℝ^{m} is a row‑scale vector and t ∈ ℝ^{n} is a column‑scale vector. Quantization then proceeds on Ŵ using standard rounding (and optional shift) to produce an integer matrix Q. The challenge is to choose s and t so that the row‑ and column‑wise standard deviations of Ŵ are balanced, preventing any dimension from dominating the quantization error.
To find these scales, the authors formulate a “imbalance” metric I(Ŵ) = max(max σ_row, max σ_col) / min(min σ_row, min σ_col), where σ_row and σ_col are the per‑row and per‑column standard deviations. Starting from the raw weight matrix, they iteratively normalize the rows and columns in log‑space: at each iteration they compute the current row/column stds, compare them to a target variance τ (the smallest observed std), clamp the log‑ratio to a bounded interval, and update log‑scale vectors u and v. After a fixed number of iterations K, the best (lowest‑imbalance) pair of log‑scales is exponentiated to obtain the final s and t. This procedure is lightweight (O(K·mn) operations) and requires only a few dozen iterations to converge in practice.
A key empirical observation is that the inverse of a column’s weight standard deviation (1/σ_col) correlates strongly (R² ≈ 0.9) with the average magnitude of the inputs that feed that column during training (μ_x). This relationship, demonstrated on Qwen‑3, Llama, Phi, and other models, implies that the weight matrix itself encodes a proxy for activation scales, enabling a form of activation‑aware quantization without any calibration data. However, naïvely scaling each column by 1/σ_col inflates row‑wise outliers, as shown by increased kurtosis. SINQ’s joint row‑column balancing mitigates this effect, preserving row statistics while still aligning column scales with activation magnitudes.
The authors also show how SINQ can be combined with existing activation‑aware quantizers such as AWQ. By first applying SINQ’s balancing, the subsequent AWQ optimization (which minimizes the L₂ error between original and quantized layer outputs) avoids the row‑wise kurtosis explosion that would otherwise occur. This hybrid approach, dubbed “ASINQ,” yields further perplexity reductions.
Experimental evaluation focuses on the Qwen‑3 family (1.7 B, 14 B, 32 B) quantized to 3‑bit and 4‑bit precision. Using standard language modeling benchmarks (WikiText‑2, C4) and QA flip‑rate metrics, SINQ consistently outperforms strong baselines: RTN, Hadamard + RTN, and HQQ. For 4‑bit quantization, SINQ cuts the perplexity gap to the BF16 baseline by more than 50 % on average, and reduces flip rates across models. Memory consumption remains comparable because the additional column‑scale vector adds only O(n) parameters, which is negligible relative to the full weight matrix. The runtime overhead is limited to a single element‑wise scaling of activations per linear layer, which is shown to be practically zero in measured inference latency.
Further experiments on Mixture‑of‑Experts (MoE) models, DeepSeek‑V3, Phi, and other LLMs confirm that SINQ is architecture‑agnostic: it requires no inter‑layer dependencies, works out‑of‑the‑box with default group sizes (64), and can be applied to any linear layer. Some models (e.g., Qwen‑3) require sharing the column‑scale across multiple heads (Q, K, V) or across gate/up‑projection layers; the authors discuss the modest trade‑off between a slight accuracy dip and the benefit of reduced scaling overhead.
In summary, SINQ introduces a simple yet powerful dual‑scale quantization scheme that leverages intrinsic weight statistics to approximate activation‑aware scaling without calibration data. The Sinkhorn‑style balancing algorithm efficiently equalizes row and column variances, preventing outlier‑driven quantization errors while preserving the predictive relationship between column scales and activation magnitudes. Empirically, SINQ delivers substantial perplexity improvements across a wide range of LLM sizes and architectures, with negligible memory and compute penalties, making it a compelling solution for deploying LLMs in memory‑constrained or latency‑sensitive environments. Future work may explore tighter integration with non‑uniform quantization schemes, adaptive per‑layer iteration budgets, and extensions to convolutional or transformer‑style attention kernels beyond linear layers.
Comments & Academic Discussion
Loading comments...
Leave a Comment