When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Compression methods, including quantization, distillation, and pruning, improve the computational efficiency of large reasoning models (LRMs). However, existing studies either fail to sufficiently compare all three compression methods on LRMs or lack in-depth interpretation analysis. In this paper, we investigate how the reasoning capabilities of LRMs are compromised during compression, through performance benchmarking and mechanistic interpretation. To uncover the effects of compression on reasoning performance, we benchmark quantized, distilled, and pruned DeepSeek-R1 models on four reasoning datasets (AIME 2024, FOLIO, Temporal Sequences, and MuSiQue). To precisely locate compression effects on model weights, we adapt difference of means and attribution patching techniques, focusing on the activation of every linear component in compressed LRMs, to interpret fine-grained causal relationships between weights and various reasoning capabilities. This fine-grained interpretation addresses a fundamental question of compression: which weights are the most important for reasoning? Overall, we find dynamically quantized 2.51-bit R1 reaches close-to-R1 performance. With empirical verification, we present three main findings that generalize across both R1 and non-R1 LRMs: (1) Weight count has a greater impact on LRMs’ knowledge memorization than reasoning, highlighting the risks of pruning and distillation; (2) The MLP up projection in the final layer of distilled LRMs is one of the most important components, offering a new perspective on locating critical weights - a fundamental problem in model compression; and (3) Current quantization methods overly compress the final-layer modules and MLP gate projections, so protecting just 2% of all weights that are excessively compressed can raise average accuracy by 6.57%, greatly surpassing the state-of-the-art.


💡 Research Summary

This paper investigates how three major compression techniques—quantization, knowledge distillation, and pruning—affect the reasoning capabilities of large reasoning models (LRMs), using DeepSeek‑R1 as the primary testbed. The authors benchmark compressed variants on four reasoning datasets of varying difficulty: AIME 2024 (mathematical reasoning), FOLIO (logical reasoning), Temporal Sequences (temporal reasoning), and MuSiQue (multihop reasoning with knowledge retrieval). For quantization they evaluate dynamic quantization from Unsloth (2.51‑bit, 1.73‑bit, 1.58‑bit) and several state‑of‑the‑art post‑training methods (A WQ, GPTQ, GPT‑AQ, ANY4/3) at 4‑bit and 3‑bit precision. Distillation experiments involve four distilled R1 models (Llama‑70B, Qwen‑32B, Llama‑8B, Qwen‑7B). Pruning is performed with SparseGPT and AlphaPruning at a default 50 % sparsity level.

Performance results show that the 2.51‑bit dynamically quantized model achieves the highest average accuracy across all tasks, closely matching the original R1. All 4‑bit quantization methods retain performance comparable to the unquantized baseline, while 3‑bit methods begin to exhibit collapse, especially on the more demanding AIME 2024 and MuSiQue tasks. Pruning at 50 % leads to a dramatic drop in accuracy, rendering the models largely unusable for complex reasoning. Distilled Qwen models generally outperform distilled Llama models, indicating that the teacher model’s architecture influences downstream reasoning robustness.

Beyond raw performance, the authors conduct a fine‑grained mechanistic interpretability analysis to locate which weight matrices are most critical for reasoning. They define four core reasoning behaviors—backtracking, uncertainty estimation, example testing, and adding knowledge—annotated on 120 instances using GPT‑4o. Using a Difference‑of‑Means approach they extract steering vectors for each linear module, then apply attribution patching to compute an importance score for each module with respect to each behavior. Scores are normalized to relative importance (RI) and compared between original and compressed models to quantify “importance shift.”

Key interpretability findings are: (1) The sheer number of weights impacts knowledge memorization more than reasoning ability, suggesting that pruning and distillation can be hazardous for tasks requiring extensive parametric knowledge. (2) In distilled models, the final‑layer MLP up‑projection matrix consistently receives the highest importance across all four reasoning behaviors. Quantizing this matrix down to 3‑bit causes a 16.3 % drop in average accuracy, confirming its critical role. (3) Current quantization pipelines over‑compress the final‑layer modules and the MLP gate projection. Protecting merely 2 % of the total weights—specifically those heavily compressed—recovers 6.57 % of average accuracy, with gains up to 23.17 % over the state‑of‑the‑art quantized models. This effect also holds for pruning, indicating that the final layer is a universal bottleneck.

The paper releases all code and data (https://github.com/psunlpgroup/Compression-Effects) to ensure reproducibility. Additional analyses—including test‑time compute, collapse points for higher compression ratios, and extensions to non‑R1 model families—are provided in the appendices.

Overall, the study delivers three major contributions: (i) a comprehensive benchmark of quantization, distillation, and pruning on diverse reasoning tasks, highlighting that modest compression ratios (e.g., 2.51‑bit) can preserve performance while higher ratios risk collapse; (ii) a novel, fine‑grained interpretability framework that pinpoints the most reasoning‑critical weight matrices, especially the final‑layer MLP up‑projection; and (iii) actionable guidance for future compression research, emphasizing the need to protect final‑layer components during quantization or pruning. The findings suggest that next‑generation compression algorithms should incorporate dynamic protection of critical weights, potentially via meta‑learning or adaptive precision schemes, to maintain reasoning fidelity while still achieving substantial efficiency gains.


Comments & Academic Discussion

Loading comments...

Leave a Comment