CoopQ: Cooperative Game Inspired Layerwise Mixed Precision Quantization for LLMs
Large Language Models (LLMs) promise impressive capabilities, yet their multi-billion-parameter scale makes on-device or low-resource deployment prohibitive. Mixed-precision quantization offers a compelling solution, but existing methods struggle when the average precision drops below four bits, as they rely on isolated, layer-specific metrics that overlook critical inter-layer interactions affecting overall performance. To address these limitations, we first frame the mixed-precision quantization problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE) to efficiently obtain accurate Shapley estimates of layer sensitivities and inter-layer interactions. Leveraging the SPQE estimates, we propose Cooperative Game Inspired Mixed-Precision Quantization (CoopQ) which translates these Shapley estimates into a binary quadratic optimization formulation, assigning either 2 or 4-bit precision to layers under strict memory constraints. Comprehensive experiments conducted on Llama-3, Gemma-2, and Qwen-3 models across three independent PTQ backends (Quanto, HQQ, GPTQ) demonstrate CoopQ’s scalability and consistently superior performance compared to methods relying solely on isolated metrics. Across average precisions spanning 4 bit down to 2 bit, CoopQ cuts Perplexity by 20 - 80 % relative to the best baseline, with the margin growing as the bit-width tightens.
💡 Research Summary
CoopQ tackles the pressing challenge of deploying large language models (LLMs) on resource‑constrained devices by introducing a game‑theoretic framework for mixed‑precision post‑training quantization (PTQ). The authors first reinterpret each transformer layer as a “player” in a cooperative game, where the payoff is the increase in per‑token negative log‑likelihood (NLL) caused by quantization. To estimate each layer’s contribution while capturing inter‑layer interactions, they propose Shapley‑based Progressive Quantization Estimation (SPQE). SPQE starts from a uniform moderate precision (typically 4‑bit) and, for each Monte‑Carlo sampled permutation of layers, progressively lowers one layer at a time to a lower precision (2‑bit). The immediate NLL change at each step is recorded, and averaging these marginal contributions across many permutations yields low‑variance Shapley values for all layers. This progressive approach avoids the catastrophic performance drops and high variance associated with pruning‑based Shapley estimates, while naturally incorporating interaction effects.
Using the Shapley values and their empirical covariance, the authors construct a stabilized interaction matrix K via diagonal shrinkage (controlled by a hyper‑parameter α). First‑order sensitivities a_i are derived by subtracting interaction contributions from the raw Shapley values. The mixed‑precision allocation problem is then formulated as a binary quadratic program: a binary variable q_i indicates whether layer i remains at low precision (q_i=1) or is promoted to high precision (q_i=0). The objective ΔL(q)=aᵀq+qᵀKq approximates the total loss increase caused by the chosen bit‑width configuration, and a linear constraint Σ_i c_i(1−q_i) ≤ B enforces a global memory budget B (c_i denotes the byte savings when a layer is quantized to low precision). By linearizing the quadratic term with auxiliary binary variables y_ij (standard MILP linearization), the problem becomes a Mixed‑Integer Linear Program solvable to global optimality with off‑the‑shelf solvers such as SCIP.
The authors evaluate CoopQ on three families of LLMs—Llama‑3 (3.2B and 8B), Gemma‑2 (2B and 9B), and Qwen‑3 (4B and 8B)—using three independent PTQ backends: Quanto, HQQ, and GPTQ. For Shapley estimation they employ Quanto’s fast in‑place quantization, while the same calibration settings are used across all baselines for a fair comparison. Experiments span average bit‑widths from 4‑bit down to 2‑bit. Across all model‑backend combinations, CoopQ consistently outperforms prior mixed‑precision methods that rely on isolated layer‑wise metrics (e.g., LLM‑MQ, LLM‑MQ‑Sensitivity, HAWQ). Perplexity reductions range from 20 % to 80 % relative to the best baseline, with the gap widening as the target average precision tightens (especially at the 2‑bit regime). Ablation studies show that sampling 200–500 permutations yields stable Shapley estimates, and a diagonal shrinkage α≈0.5 balances noise suppression with preservation of genuine interaction information.
Key contributions of the paper are: (1) framing mixed‑precision quantization as a cooperative game, enabling principled measurement of both individual layer sensitivities and cross‑layer interactions; (2) introducing SPQE, a progressive quantization scheme that provides accurate, low‑variance Shapley estimates without costly full‑model retraining; (3) translating these estimates into a binary quadratic objective and solving it exactly via MILP, guaranteeing globally optimal bit‑width assignments under strict memory constraints; (4) demonstrating scalability and practical effectiveness across multiple LLM families and PTQ toolchains. Limitations include the computational cost of Monte‑Carlo permutation sampling for extremely large models and the current restriction to a binary 2‑bit/4‑bit decision space. Future work could explore more efficient sampling strategies, finer granularity of bit‑width choices, and tighter integration with hardware‑aware cost models.
Overall, CoopQ presents a compelling, theoretically grounded, and empirically validated approach that advances the state of the art in mixed‑precision quantization for LLMs, making high‑performing language models more accessible on low‑resource platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment