SparseSwaps: Tractable LLM Pruning Mask Refinement at Scale

Reading time: 6 minute
...

📝 Original Info

  • Title: SparseSwaps: Tractable LLM Pruning Mask Refinement at Scale
  • ArXiv ID: 2512.10922
  • Date: 2025-12-11
  • Authors: Max Zimmer, Christophe Roux, Moritz Wagner, Deborah Hendrych, Sebastian Pokutta

📝 Abstract

The resource requirements of neural networks can be significantly reduced through pruning - the removal of seemingly less important parameters. However, for LLMs, full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as magnitude pruning are suboptimal on Transformers. State-of-the-art methods hence solve a layer-wise mask selection problem: finding a pruning mask that minimizes per-layer pruning error on a small set of calibration data. Exactly solving this problem is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches rely on approximations or heuristics. We demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive optimal 1-swaps (exchanging one kept and one pruned weight) computable efficiently via the Gram matrix. We propose a simple 1-swap algorithm that warmstarts from any pruning mask, runs efficiently on GPUs at LLM scale, and is essentially hyperparameter-free. Our approach reduces per-layer pruning error by up to 60% over Wanda (Sun et al., 2024) and consistently improves perplexity and zero-shot accuracy across state-of-the-art GPT architectures.

💡 Deep Analysis

Deep Dive into SparseSwaps: Tractable LLM Pruning Mask Refinement at Scale.

The resource requirements of neural networks can be significantly reduced through pruning - the removal of seemingly less important parameters. However, for LLMs, full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as magnitude pruning are suboptimal on Transformers. State-of-the-art methods hence solve a layer-wise mask selection problem: finding a pruning mask that minimizes per-layer pruning error on a small set of calibration data. Exactly solving this problem is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches rely on approximations or heuristics. We demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive optimal 1-swaps (exchanging one kept and one pruned weight) computable efficiently via the Gram matrix.

📄 Full Content

Pruning after training (Han et al., 2015;Gale et al., 2019;Lin et al., 2020;Hoefler et al., 2021;Zimmer et al., 2025) is a state-of-the-art technique to reduce the resource requirements of neural networks. A simple yet effective approach to obtain such sparse models starts from a pretrained dense model, removes seemingly unimportant parameters based on their magnitude, and requires retraining to compensate for pruning-induced performance degradation. However, while the inexpensive, data-free magnitude criterion has often achieved strong performance on traditional architectures (Gale et al., 2019;Zimmer et al., 2023b), pruning has undergone a paradigm shift with the rise of pretrained foundation models such as Large Language Models (LLMs).

First, the size of the models has shifted the focus toward retraining-free pruning criteria, as retraining is often computationally expensive if not infeasible, with parameterefficient fine-tuning (Lialin et al., 2023;Zimmer et al., 2023a) being an exception. Secondly, systematic activation outliers (Dettmers et al., 2022) and highly important super-weights (Yu et al., 2025) in sufficiently large Transformers (Vaswani et al., 2017) have rendered magnitude pruning no better than random pruning for LLMs (Sun et al., 2024;Yin et al., 2023). Lastly, state-of-the-art methods (Frantar & Alistarh, 2023;Sun et al., 2024;Zhang et al., 2024a) prune layer-wise: they split the pruning problem into per-layer subproblems, pruning layers sequentially and independently using a small calibration dataset to estimate parameter importance. Rather than optimizing the global loss, such approaches minimize a per-layer local pruning loss. Specifically, for a single layer with calibration input matrix X ∈ R din×B and weights W ∈ R dout×din , the objective becomes

where M ∈ {0, 1} dout×din is a binary pruning mask achieving a desired level of sparsity, e.g., ∥M ∥ 0 ≤ k for unstructured sparsity, and ⊙ denotes the element-wise multiplication or Hadamard product. Here, B = N • L with N being the number of samples in the calibration batch and L being the sequence length.

Solving this combinatorial mask selection problem to optimality is NP-hard due to feature correlations: selecting k of d out • d in weights yields a cardinality-constrained binary quadratic program (a best-subset selection variant). Even for a single row i the problem remains hard, despite reducing to

where w i ∈ R din and m i ∈ {0, 1} din denote the i-th row of W and M , respectively. While Integer Programming (IP) solvers could theoretically provide optimal solutions, the combinatorial search over mask entries makes this infeasible for LLMs. In practice, existing methods therefore relax Equation 1 or approximate it.

However, with deployed LLMs now serving millions of users, it becomes increasingly worthwhile to invest substantial resources to obtain pruned models that reach high performance, because the pruning cost is paid once during training whereas inference costs scale with the number of requests. In this work, we revisit the per-layer mask selection problem and demonstrate that it can be operationalized at LLM scale, enabling monotone improvements with each optimization step rather than relying on proxy importance scores. To that end, we observe that enforcing equal sparsity levels across rows ensures row-wise separability that yields independent objectives. This makes the problem drastically more tractable and still leads to good practical performance for LLMs. Instead of trying to obtain exact solutions via IP solvers, we instead propose a GPU-accelerated local optimization algorithm based on 1-swaps (exchanging one kept and one pruned weight) that perform exact and efficient local refinement with incremental cost updates using the Gram matrix G = XX ⊤ to monotonically decrease the objective from any warmstart.

The resulting method, which we term SparseSwaps, can start from any warmstart mask, evaluates the exact per-row quadratic loss, and is scalable, parallelizable across rows, almost hyperparameter-free, and deterministic for a fixed warmstart. With only a few 1-swap iterations, it can reduce the per-layer pruning error by up to 60% compared to Wanda and improves final perplexity and zero-shot accuracy across architectures. Our approach is a post-hoc refinement of existing pruning methods that can significantly improve upon the state of the art for unstructured, per-row, or N :M sparsity.

Contributions. Our contributions are as follows:

  1. Making the mask selection problem tractable. We observe that a) enforcing equal sparsity levels per row decouples the rows, and that b) optimal 1-swaps (exchanging one kept and one pruned weight) can be evaluated efficiently using the Gram matrix G = XX ⊤ of the calibration data, ensuring efficient lookups when determining the most beneficial swap.

  2. SparseSwaps: a practical post-hoc pruning algorithm. Building on these observations, we propose SparseSwaps, a plug-and-play 1-

…(Full text truncated)…

📸 Image Gallery

perplexity_vs_reconstruct_n_samples_50.png perplexity_vs_reconstruct_n_samples_50.webp perplexity_vs_reconstruct_n_samples_60.png perplexity_vs_reconstruct_n_samples_60.webp reconstruction_improvement.png reconstruction_improvement.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut