You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many applications of large language models (LLMs) require only a narrow capability, yet common post-training quantization (PTQ) pipelines assign precision largely without regard to the target task. As a result, they may spend bits on layers that are less relevant to the task. We propose per-task mixed-precision PTQ guided by hidden representations. Given a small set of unlabeled calibration prompts from the target task, we estimate layer importance and allocate higher precision to task-relevant layers while lower to the rest, under a bits allocation budget. We introduce three task-aware allocation signals: \textbf{TAQ}, which scores layers using an information-stability criterion derived from activation geometry; \textbf{TAQO}, which ranks layers by direct sensitivity to single-layer quantization; and \textbf{TAQ-KL}, which measures output sensitivity via KL divergence under a noise proxy for quantization error. Together, these methods provide a simple, post-training framework that connects mechanistic signals to quantization decisions, enabling task-aligned compression without additional training.

💡 Research Summary

The paper addresses a key limitation of existing post‑training quantization (PTQ) methods for large language models (LLMs): they allocate precision uniformly or according to generic heuristics, ignoring the fact that many downstream applications only require a narrow subset of the model’s capabilities. To remedy this, the authors introduce Task‑Aware Quantization (TAQ), a framework that uses a small set of unlabeled calibration prompts from the target task to estimate the importance of each transformer layer and then distributes a limited bit‑budget accordingly. Three complementary layer‑importance signals are proposed. The first, TAQ, combines an information‑entropy measure derived from the eigen‑spectrum of a layer’s activation covariance matrix with a stability metric based on activation variance; both are z‑normalized and linearly combined. The second, TAQO, performs an “oracle” sensitivity test by quantizing each layer individually to 4‑bit, measuring the immediate drop in task performance, and preserving the most sensitive layers in full precision. The third, TAQ‑KL, injects Gaussian noise as a proxy for quantization error, computes the KL‑divergence between the original and perturbed output distributions, and ranks layers by the magnitude of this divergence.

Given a global budget (e.g., total memory or compute cost), the framework sorts layers by their importance scores and assigns higher precision (typically 8‑bit) to the top K % while quantizing the remainder to 4‑bit; embedding and output layers are always kept in FP16. Quantization itself follows a group‑wise affine scheme, ensuring compatibility with existing hardware accelerators.

Empirical evaluation is conducted on open‑weight models such as Gemma‑2‑9B and Qwen2.5‑7B across three task families: code completion, mathematical reasoning, and trivia/question‑answering. Compared with state‑of‑the‑art PTQ baselines like GPTQ and Activation‑Aware Weight Quantization (AWQ), TAQ, TAQO, and TAQ‑KL consistently achieve equal or higher accuracy while reducing memory usage by 30‑40 %. In several cases the task‑aware quantized models even surpass the original FP16 baseline, demonstrating that preserving precision in task‑critical layers can improve performance. The study also shows that the three importance signals are largely consistent, with TAQ and TAQ‑KL providing reliable rankings even when only a few calibration examples are available.

Overall, the work bridges mechanistic interpretability and model compression, proposing a practical, training‑free method to tailor LLM quantization to specific downstream tasks. By leveraging hidden‑state statistics, it offers a principled way to allocate bits where they matter most, enabling more efficient deployment of LLMs in resource‑constrained environments without sacrificing task‑specific quality.

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment