MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization-aware training (QAT) variants, rather than fast one-shot post training quantization (PTQ), and lacks open-source and kernel support. We address all of these limitations by introducing Post-Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi-precision objective with bit-slicing and cross-bit error compensation, resulting in an algorithm that produces a multi-bit-width, “sliceable” model in a single pass. We also incorporate a new budget-aware search for heterogeneous per-layer bit-witdhs and provide efficient kernels that implement slicing and mixed-precision execution. Across standard LLMs and benchmarks, MatGPTQ preserves high-bit accuracy while substantially improving performance at low-bit-witdh settings. Overall, we establish a new state of the art for Matryoshka-style post-training quantization and make single-checkpoint, multi-precision deployment open and practical. Code is available at https://github.com/IST-DASLab/MatGPTQ.


💡 Research Summary

MatGPTQ introduces a post‑training quantization (PTQ) framework that brings the multi‑precision “Matryoshka” concept into a practical, one‑shot setting. The authors start from the observation that the original Matryoshka Quantization (MatQuant) required expensive quantization‑aware training (QAT) and was limited to a fixed set of bit‑widths (typically 8‑4‑2). To overcome these drawbacks, MatGPTQ jointly optimizes a single parent model for an arbitrary set of target precisions R (e.g., {2,3,4,6,8} bits) using a novel multi‑precision loss. The loss aggregates reconstruction errors for each target bit‑width r∈R, weighted by user‑defined importance factors λr, and incorporates the MSB‑slicing operation S(qc,r) that extracts the r most significant bits from a c‑bit quantized weight.

The core algorithm adapts the well‑known GPTQ method. For each layer, all possible quantized integer values (0…2c−1) are enumerated. For every candidate, the algorithm de‑quantizes the value for each target bit‑width, applies the slicing function, computes the weighted error across all r∈R, and selects the integer that minimizes the total error. This “cross‑bit error sharing” replaces the standard nearest‑grid rounding of GPTQ. Error propagation is also extended: instead of propagating a single error term, MatGPTQ averages the errors generated by all target bit‑widths and distributes this average to the remaining unquantized weights, ensuring that the final model performs well across the whole precision spectrum.

A second major contribution is a budget‑aware search for heterogeneous per‑layer bit‑width assignments. Using the EvoPress evolutionary algorithm, MatGPTQ starts from the globally quantized parent model and iteratively generates offspring with different per‑layer bit‑width configurations. Offspring are evaluated on a calibration set, and the best configuration (balancing accuracy and memory footprint) is kept for the next generation. This enables “mix‑and‑match” quantization where some layers can run at higher precision while others are aggressively compressed, often achieving Pareto‑superior trade‑offs (e.g., average 2.5‑bit models that match or exceed uniform 3‑bit accuracy).

The authors evaluate MatGPTQ on several state‑of‑the‑art large language models, including LLaMA 3.1 8B, Qwen‑3 (8B/14B), and Phi‑3‑Medium. Using a modest calibration set of 128–256 samples, MatGPTQ matches or outperforms standard GPTQ baselines in the 4‑8‑bit regime (within 0.7 % accuracy loss) and delivers a 1.34 % average accuracy gain at 3 bits. Importantly, the model retains high accuracy when sliced at intermediate bit‑widths (e.g., 6 bits) that were not explicitly optimized, demonstrating strong interpolation capabilities.

To make the approach deployable, the authors provide custom CUDA kernels that pack/unpack integer weights, perform fast MSB slicing, and support mixed‑precision execution. Benchmarks show that 3‑bit inference of LLaMA 3.1 8B runs nearly three times faster than FP16 on modern NVIDIA GPUs, with proportional memory savings. All code, kernels, and a vLLM integration are released publicly (https://github.com/IST‑DASLab/MatGPTQ), allowing practitioners to serve a single checkpoint across a wide range of hardware budgets.

In summary, MatGPTQ delivers: (1) a one‑shot PTQ pipeline for Matryoshka‑style quantization, (2) a multi‑precision objective with cross‑bit error compensation, (3) an evolutionary search for heterogeneous per‑layer bit‑widths under memory constraints, and (4) efficient GPU kernels for real‑world deployment. This work substantially lowers the barrier to multi‑precision LLM deployment, offering both theoretical novelty and practical tooling.


Comments & Academic Discussion

Loading comments...

Leave a Comment