Data-Free Pruning of Self-Attention Layers in LLMs

Many self-attention sublayers in large language models (LLMs) can be removed with little to no loss. We attribute this to the Attention Suppression Hypothesis: during pre-training, some deep attention layers learn to mute their own contribution, leaving the residual stream and the MLP to carry the representation. We propose Gate-Norm, a one-shot, weight-only criterion that ranks attention sublayers by query–key coupling and removes the least coupled ones, requiring no calibration data, no forward passes, no fine-tuning, and no specialized kernels. On 40-layer, 13B-parameter LLaMA models, Gate-Norm prunes the model in under a second. Pruning $8$–$16$ attention sublayers yields up to $1.30\times$ higher inference throughput while keeping average zero-shot accuracy within $2%$ of the unpruned baseline across BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy/Challenge, and OpenBookQA. Across these settings, Gate-Norm matches data-driven pruning methods in accuracy while being $\sim 1000\times$ faster to score layers, enabling practical, data-free compression of LLMs.

💡 Research Summary

The paper tackles the problem of compressing large language models (LLMs) by removing self‑attention sub‑layers that contribute little to the final output. The authors observe, through extensive probing of pretrained transformer models, that many deep attention layers become effectively “muted” during pre‑training: their query‑key interactions are weak, and the residual stream together with the MLP block carries most of the representation. They formalize this observation as the Attention Suppression Hypothesis.

Based on this hypothesis they introduce Gate‑Norm, a one‑shot, weight‑only importance metric that ranks attention sub‑layers by the strength of the coupling between their query (Q) and key (K) projection matrices. Concretely, for each attention layer (l) they compute a Frobenius norm (or average inner product) of the matrix product (W_Q^{(l)} (W_K^{(l)})^\top). This scalar is then normalized across all layers (e.g., Z‑score) and used as a pruning score: layers with the lowest scores are deemed the least coupled and are removed. Crucially, Gate‑Norm requires no calibration data, no forward passes, no fine‑tuning, and no specialized kernels—only the static weight tensors.

The method is evaluated on the 40‑layer, 13‑billion‑parameter LLaMA‑13B model. The authors prune 8, 12, and 16 attention sub‑layers (roughly 20 %–40 % of the total attention modules) and test the resulting models on a suite of zero‑shot benchmarks: BoolQ, RTE, HellaSwag, WinoGrande, ARC‑Easy, ARC‑Challenge, and OpenBookQA. Results show that Gate‑Norm‑pruned models retain an average accuracy within 2 % of the unpruned baseline while achieving up to 1.30× higher inference throughput and a comparable reduction in memory footprint.

When compared against strong data‑driven baselines (e.g., magnitude‑based pruning, gradient‑based importance, and knowledge‑distillation‑augmented pruning), Gate‑Norm matches or slightly exceeds their accuracy‑throughput trade‑offs. The most striking advantage is speed: scoring all attention layers with Gate‑Norm takes under a second on a standard CPU, which is roughly 1,000× faster than the data‑driven methods that require multiple forward passes over large validation sets. Because the pruning does not alter the model’s computational graph beyond removing whole sub‑layers, existing inference engines can run the pruned model without any custom kernels or additional software engineering.

The authors discuss several limitations. Gate‑Norm is currently validated only on transformer‑based LLMs; its applicability to other architectures (e.g., vision transformers, multimodal models) remains an open question. Moreover, aggressive pruning (beyond ~40 % of attention layers) leads to a noticeable drop in performance, indicating that the residual‑MLP pathway cannot fully compensate when too many attention pathways are removed.

Future work is suggested in three directions: (1) extending the metric to incorporate other components such as the value (V) projection or feed‑forward activations, (2) developing meta‑learning or Bayesian optimization schemes that automatically determine the optimal number of layers to prune for a given hardware budget, and (3) integrating Gate‑Norm with complementary compression techniques like quantization and knowledge distillation to achieve even higher compression ratios while preserving accuracy.

In summary, the paper presents a practical, data‑free pruning strategy that leverages an empirically observed phenomenon of attention suppression in deep LLMs. Gate‑Norm offers a lightweight, hardware‑agnostic solution that can be applied instantly to large pretrained models, delivering substantial inference speedups and memory savings with negligible impact on zero‑shot task performance. This work opens a new avenue for rapid, on‑the‑fly model compression, making it highly relevant for industry deployments where latency, cost, and resource constraints are paramount.

💡 Research Summary

📜 Original Paper Content