ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model’s performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe

💡 Research Summary

ReplaceMe introduces a training‑free depth pruning technique for large transformer‑based language models that replaces a contiguous sequence of transformer blocks with a single linear transformation. The method proceeds in two main stages. First, a small calibration dataset (on the order of tens of thousands of tokens) is used to compute hidden‑state activations for each block. By measuring the distance between the activations before and after a candidate cut (using metrics such as cosine distance, L2, etc.), the algorithm selects the cut index i* that minimizes this distance for a predefined number of blocks n to be removed. Empirical results show that cosine distance yields the most reliable identification of low‑impact block groups.

Second, the method estimates a linear matrix T that maps the output of the MLP sub‑layer of block i* (plus its residual Y_i) to the hidden state expected by block i* + n + 1. Two objective functions are considered. When the loss is L2 distance, a closed‑form least‑squares solution is derived: T = (M_iᵀ M_i)⁻¹ M_iᵀ (L_{i+n} − Y_i). When the loss is cosine distance, a non‑convex optimization problem is solved with Adam (learning rate 1e‑4, batch 1024, 10 epochs). To reduce memory consumption, the optimization is simplified to align M_i · T with (L_{i+n} − Y_i) rather than the full residual sum, which empirically has negligible impact on performance.

The estimated T is then fused into the existing MLP down‑projection weight matrix of block i*, resulting in a new weight W′ = T · W_down. This fusion introduces no additional parameters and leaves the overall architecture unchanged except for the removed blocks. Regularization terms (L1, L2) can be added to the objective to promote sparsity or stability; L1 sparsity reduces memory but slightly raises perplexity, illustrating a trade‑off.

ReplaceMe also supports multi‑linear‑transform extensions, allowing separate T matrices for multiple non‑overlapping block groups (Multi‑LT). Experiments with non‑consecutive groups show improved perplexity but a modest drop in task accuracy, indicating interaction effects between independent linear corrections.

The authors evaluate ReplaceMe on several state‑of‑the‑art LLMs: LLaMA‑2‑7B, LLaMA‑3‑8B‑Instruct, Qwen2.5‑7B, and Falcon‑11B. Using a 25 % depth reduction, the method retains on average 92.5 % of the baseline accuracy across a suite of benchmarks (CMNLI, HellaSwag, PIQA, CHID, WSC, MMLU, CMMLU, Race, etc.) and achieves a 0.9 % reduction in Lambada OpenAI perplexity. Compared with leading structured pruning baselines that require post‑pruning fine‑tuning (UIDL, SVD‑LLM, LLM‑Pruner), ReplaceMe delivers comparable or superior performance while eliminating any healing phase. Compression time is reduced by a factor of five or more, and CO₂ emissions measured with CodeCarbon drop by roughly 70 %, highlighting the method’s sustainability advantages.

Ablation studies explore the influence of calibration dataset size, distance metric choice, optimizer selection, and regularization strength. Cosine‑based layer selection consistently identifies near‑optimal cut points, while both L2‑based and cosine‑based T estimation perform similarly when sufficient calibration data are provided. Regularization improves accuracy at the expense of a slight perplexity increase.

Limitations are acknowledged: compression ratios beyond ~30 % cause the linear approximation to break down, leading to steep performance degradation. The method also assumes that the calibration data are representative of the model’s downstream tasks; poor calibration coverage can misguide block selection. Future work may incorporate non‑linear correction layers, hybrid width‑depth pruning, or adaptive calibration strategies to extend the viable compression range.

In summary, ReplaceMe offers a practical, training‑free, parameter‑free depth pruning solution that is broadly applicable to modern LLMs, delivering significant speed‑up, memory savings, and environmental benefits without sacrificing most of the original model’s capabilities.

ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment