Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabiliti

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model’s ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.


💡 Research Summary

This paper presents the first systematic study of structured width pruning applied to the GLU‑MLP layers of Llama‑3.2 (1 B and 3 B variants) and demonstrates that reducing the expansion ratio does not uniformly degrade model performance. Using a Maximum Absolute Weight (MAW) criterion, the authors prune neurons with the smallest absolute weight values across seven expansion‑ratio settings, thereby creating a family of compressed models with progressively fewer parameters.

The evaluation covers five capability domains: factual knowledge (MMLU, GSM8K), mathematical reasoning (GSM8K, MUSR), language comprehension (LAMBADA, PIQA), instruction‑following (IFEval), and truthfulness (TruthfulQA‑MC2). Results show a clear dichotomy. In the knowledge‑heavy benchmarks (MMLU, GSM8K), performance drops predictably as the expansion ratio shrinks, confirming the intuition that parametric knowledge is proportional to raw weight capacity. Conversely, instruction‑following improves dramatically: the 1 B model gains +46 % and the 3 B model +75 % on IFEval relative to the unpruned baseline. Multi‑step reasoning measured by MUSR remains essentially unchanged, indicating that reasoning chains rely more on the stability of the computational graph than on raw knowledge.

A particularly striking finding is the strong inverse correlation (r = ‑0.864, p = 0.012) between MMLU scores and TruthfulQA‑MC2 scores in the 3 B series. As factual knowledge erodes, the model’s ability to avoid false statements improves consistently. The authors interpret this as evidence that MAW‑guided width pruning acts as a selective filter: it suppresses memorized facts while amplifying alignment‑related signals that guide the model toward truthful, instruction‑compliant behavior.

Beyond capability metrics, the paper quantifies efficiency gains. Pruned configurations reduce energy consumption per token by up to 23 % and show uniform benefits in batch‑processing workloads. Single‑request latency suffers a modest increase due to altered memory access patterns, but the overall cost‑performance trade‑off remains favorable for most deployment scenarios.

The authors argue that the expansion ratio should be viewed not merely as a compression knob but as a critical architectural hyper‑parameter that can be tuned to balance knowledge capacity against behavioral alignment. This reframes width pruning from a blunt tool for model size reduction into a nuanced lever for shaping model cognition. The work opens several avenues for future research: exploring alternative pruning criteria (e.g., learned importance scores), extending the analysis to attention layers, and investigating whether similar dichotomies emerge in larger models or multimodal architectures. In sum, the study challenges the prevailing assumption that pruning inevitably harms all downstream abilities and provides concrete evidence that selective pruning can simultaneously shrink models, cut energy use, and enhance alignment‑related performance.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...