RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs
Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary ($\pm$1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49\times$ inference speed-up over full-precision models on an RTX 4090.
💡 Research Summary
The paper introduces RaBiT (Residual‑Aware Binarization Training), a novel quantization‑aware training (QAT) framework designed to enable accurate and efficient 2‑bit large language models (LLMs). Existing 2‑bit approaches fall into two camps: vector quantization (VQ), which preserves high accuracy but incurs heavy hardware overhead due to lookup tables and complex rotations, and residual binarization, which stacks multiple binary (±1) paths to achieve matmul‑free inference. The latter suffers from a severe training pathology the authors call inter‑path adaptation, a form of feature co‑adaptation where parallel binary paths receive the same global gradient and consequently learn redundant representations. This collapses the intended residual hierarchy—later paths should correct the errors of earlier ones—thereby limiting expressive capacity.
RaBiT solves this problem structurally. Instead of maintaining independent latent weights for each binary path, it keeps a single shared full‑precision weight matrix (W_{FP}). During the forward pass, binary cores are derived on‑the‑fly in a sequential manner: the first path is obtained by binarizing (W_{FP}) (sign operation) and scaling it with learnable per‑channel vectors (g_1) and (h_1). The residual error (R_1 = W_{FP} - \hat{W}_1) is then computed, and the second binary core is the sign of this residual, scaled by independent vectors (g_2, h_2). The final effective weight is the sum (\hat{W}^{(2)} = \hat{W}_1 + \hat{W}_2). Because each subsequent path is forced to operate on the remaining error, the architecture inherently enforces a residual hierarchy: later paths cannot ignore the mistakes of earlier ones.
Training uses a Straight‑Through Estimator (STE) that back‑propagates the gradient of the loss with respect to the summed effective weight directly to (W_{FP}). The scaling vectors are updated as ordinary parameters via the chain rule, treating the dynamically generated binary cores as constants during back‑propagation. This design yields a strong negative correlation between the outputs of the two paths, as shown analytically by decomposing the mean‑squared error (MSE) into a base error term and an interaction term proportional to the product of the path standard deviations and their Pearson correlation. Empirically, RaBiT achieves correlations around –0.5, turning the interaction term into a substantial loss‑reducing bonus, whereas standard QAT yields near‑zero correlation and no such benefit.
A second contribution is a function‑preserving initialization. The authors observe that 2‑bit QAT is highly sensitive to the starting point; naïve weight‑approximation initializations often trap the model in poor minima. They therefore employ an Iterative Residual Sign‑Value‑Independent Decomposition (SVID), a Gauss‑Seidel‑style procedure that iteratively refines both binary cores and scaling vectors for each path, ensuring that the initial representation already respects the residual structure. This initialization focuses on preserving the model’s functional behavior rather than merely matching weight values.
Experiments are conducted on Llama‑2‑7B and 13B models across a suite of benchmarks (MMLU, GSM‑8K, BBH, etc.). RaBiT’s 2‑bit models match or exceed the performance of state‑of‑the‑art VQ methods, often improving by 0.3–0.5 percentage points in accuracy and up to 1 % in more challenging tasks. Inference speed is measured on an RTX 4090 GPU: the matmul‑free binary paths enable a 4.49× speed‑up compared to the full‑precision baseline. Because the binary cores are derived from (W_{FP}) and then frozen, the full‑precision weight can be discarded after training, halving the optimizer‑state memory footprint—a critical advantage for fine‑tuning large models.
The paper also provides a detailed analysis of the MSE decomposition across early, middle, and late layers of Llama‑2‑7B, confirming that RaBiT consistently yields a strong negative correlation term, while standard QAT does not. This demonstrates that the residual hierarchy not only improves training stability but also leads to better generalization.
Limitations are acknowledged: the current design is specific to 2‑bit quantization; extending the sequential residual scheme to 3‑ or 4‑bit settings would increase architectural complexity. Moreover, while the method reduces memory and compute on GPUs, hardware‑specific implementations (ASICs, FPGAs) need further validation.
In summary, RaBiT introduces a principled way to enforce residual coupling in binary stacked architectures, eliminating inter‑path co‑adaptation without resorting to heuristic constraints like path freezing. By coupling a shared full‑precision anchor, dynamic residual‑based binary derivation, and a function‑preserving initialization, RaBiT pushes the 2‑bit accuracy‑efficiency frontier forward, delivering near‑VQ accuracy with substantially lower inference cost and memory usage—an important step toward practical deployment of ultra‑compact LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment