Price of universality in vector quantization is at most 0.11 bit
Fast computation of a matrix product $W^\top X$ is a workhorse of modern LLMs. To make their deployment more efficient, a popular approach is that of using a low-precision approximation $\widehat W$ in place of true $W$ (“weight-only quantization’’). Information theory demonstrates that an optimal algorithm for reducing precision of $W$ depends on the (second order) statistics of $X$ and requires a careful alignment of vector quantization codebook with PCA directions of $X$ (a process known as “waterfilling allocation’’). Dependence of the codebook on statistics of $X$, however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of $X$, in the sense of being at least as good as an $X$-adapted waterfilling codebook with rate reduced by 0.11 bit per dimension. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive. Equivalently, our result shows existence of a net in $\mathbb{R}^n$ that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms.
💡 Research Summary
The paper addresses a fundamental problem in modern large language models (LLMs): how to store the weight matrix W in low‑precision while preserving the accuracy of the inner product WᵀX that is computed billions of times during inference. Existing theory shows that if the second‑order statistics of the activation vector X (i.e., its covariance Σₓ) are known, the optimal quantization codebook should be aligned with the principal components of Σₓ. This “water‑filling” allocation distributes bits preferentially to directions with large variance and yields the information‑theoretic rate‑distortion bound for the weighted mean‑squared error d_{Σₓ}(W, \hat W) = (W‑\hat W)ᵀΣₓ(W‑\hat W). However, in practice the decoder (hardware) must be fixed in advance and cannot depend on Σₓ, which makes a Σₓ‑dependent codebook impractical.
The authors formalize this restriction as a “universal codebook” (or Σₓ‑oblivious decoder) that must work simultaneously for all possible covariance matrices Σₓ∈Sₙ⁺ with trace n. They assume the weight vector W is isotropic Gaussian, W∼N(0,Iₙ), and define the distortion for a given Σₓ as the average weighted MSE per dimension. The classical water‑filling solution gives a parametric curve (λ_i are eigenvalues of Σₓ): D_{wf}(Σₓ,t)= (1/n)∑{i} min{λ_i,t}, R{wf}(Σₓ,t)= (1/2n)∑{i} max{0,log(λ_i/t)}. Any encoder/decoder pair achieving distortion D must satisfy R≥R{wf}(Σₓ,D).
The paper’s first main contribution is a non‑constructive existence proof of a universal codebook that, for any fixed rate R, achieves a distortion described by a different parametric curve derived from random coding: D_{rc}(λ,T)= (1/n)∑{i} λ_i/(1+λ_i T), R{rc}(λ,T)= (1/2n)∑{i} log(1+λ_i T). For each rate R, a suitable T can be chosen so that R{rc}=R. Using a random Gaussian codebook together with a shared random seed S∈
Comments & Academic Discussion
Loading comments...
Leave a Comment