DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration

DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model’s weight matrices. Our method can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.


💡 Research Summary

The paper tackles a persistent problem in compressing large language models (LLMs): pruning, especially under semi‑structured sparsity constraints such as 2:4, often leads to severe performance loss. Existing pruning methods focus exclusively on estimating an importance score for each weight (e.g., magnitude, output‑sensitivity, or second‑order Taylor approximations) and then discarding the lowest‑scoring parameters. This approach treats the pretrained model’s importance distribution as immutable, which limits robustness when the pruning pattern restricts which weights can be removed.

DenoiseRotator introduces a fundamentally different perspective: before any weights are removed, reshape the distribution of importance scores so that most of the “mass” is concentrated on a small subset of parameters. The authors formalize this by normalizing importance scores into a probability distribution and minimizing its information entropy. Entropy is maximal for a uniform distribution and minimal when probability mass is focused on few elements; thus, entropy minimization naturally drives importance concentration. A concentrated importance profile reduces the total importance of the weights that will be pruned, which in turn minimizes the output deviation predicted by Taylor‑series‑based pruning criteria.

To realize entropy‑driven concentration in practice, the authors propose DenoiseRotator, a lightweight module that learns orthogonal (rotation) matrices and inserts them into each Transformer layer. Two pairs of orthogonal matrices are used: (1) layer‑level rotations (R₁ and its transpose) placed around the RMSNorm and residual connection, and (2) attention‑level rotations (R₂ and its transpose) applied to the value and output projections. Because orthogonal transformations preserve vector norms and inner products, the overall function of the network remains unchanged (the “computational invariance” property). After rotation, a weight matrix W becomes (W’ = R₁^{\top} W R₂) and the corresponding input X becomes (X’ = R₂^{\top} X); the OBD importance score transforms accordingly to (S’{ij}=|W’{ij}|^{2}(X’X’^{\top})_{jj}). Hence, importance is redistributed indirectly through the rotations.

Training proceeds with the original model parameters frozen; only the orthogonal matrices are optimized to minimize the sum of entropies computed over appropriate groups (rows, columns, or the whole matrix, depending on where the rotation is applied). The entropy loss is differentiable, allowing standard gradient‑based optimization. Once training converges, the orthogonal matrices (except the ones that must stay at the layer boundaries) are merged into the weight tensors, so the final model contains no extra parameters. Existing pruning pipelines—Magnitude, SparseGPT, Wanda—can then be applied unchanged. Because importance has already been concentrated, the same sparsity budget yields far less degradation.

Empirical evaluation covers several open‑source LLM families: LLaMA‑3 (8B and 70B), Qwen2.5 (7B‑72B), and Mistral‑7B. Experiments span 50 % unstructured sparsity and 2:4 semi‑structured sparsity. Across all settings DenoiseRotator consistently improves perplexity and zero‑shot benchmark accuracy. Notably, on LLaMA‑3‑70B pruned with SparseGPT at 2:4, the perplexity gap to the dense model shrinks from 8.1 to 3.4 points—a 58 % reduction. Similar gains appear for Qwen2.5 and Mistral, with average zero‑shot accuracy lifts of 1.2–2.5 % over baseline pruning. The method also works as a plug‑and‑play add‑on: the same orthogonal‑rotation training can be combined with any pruning algorithm without modifying the latter’s code.

Key contributions are: (1) proposing importance‑concentration as a pre‑pruning objective, (2) grounding the objective in entropy minimization, (3) implementing a practical, low‑overhead orthogonal‑rotation module that preserves model functionality, (4) demonstrating broad compatibility and substantial empirical gains across multiple models and sparsity patterns. Limitations include the need for transformer‑style architectures (RMSNorm, residual connections) to host the rotations and reliance on calibration data for the Hessian used in importance estimation. Future work could explore non‑orthogonal or non‑linear transformations, joint optimization of pruning and quantization, or more efficient entropy‑minimization strategies. Overall, DenoiseRotator offers a novel, theoretically motivated, and empirically validated pathway to make LLM pruning more robust and effective.


Comments & Academic Discussion

Loading comments...

Leave a Comment