RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Regression Mean (RegMean), an approach that formulates model merging as a linear regression problem, aims to find the optimal weights for each linear layer in the merge model by minimizing the discrepancy in predictions between the merge and candidate models. RegMean provides a precise closed-form solution for the merging problem; therefore, it offers explainability and computational efficiency. However, RegMean merges each linear layer independently, overlooking how the features and information in the earlier layers propagate through the layers and influence the final prediction in the merge model. In this paper, we introduce RegMean++, a simple yet effective alternative to RegMean, that explicitly incorporates both intra- and cross-layer dependencies between merge models’ layers into RegMean’s objective. By accounting for these dependencies, RegMean++ better captures the behaviors of the merge model. Extensive experiments demonstrate that RegMean++ consistently outperforms RegMean across diverse settings, including in-domain (ID) and out-of-domain (OOD) generalization, sequential merging, large-scale tasks, and robustness under several types of distribution shifts. Furthermore, RegMean++ achieves competitive or state-of-the-art performance compared to various recent advanced model merging methods. Our code is available at https://github.com/nthehai01/RegMean-plusplus.

💡 Research Summary

Model merging has become a practical solution for leveraging the ever‑growing pool of fine‑tuned models without retraining on all tasks. The recent Regression Mean (RegMean) method treats each linear layer of a transformer as an independent linear regression problem, yielding a closed‑form solution that is both explainable and computationally cheap. However, RegMean assumes that the input features for each layer are the activations produced by the corresponding layer in each candidate model. This assumption ignores the fact that, in the merged model, the activations feeding a layer are generated by the merged previous layer, not by the original candidates. Because deep networks contain non‑linearities (GELU, LayerNorm, etc.), a small discrepancy in early‑layer representations can be amplified downstream, leading to a mismatch between the statistics RegMean uses and the actual statistics the merged model will encounter at inference time. Consequently, RegMean’s layer‑wise independent regression can limit both in‑domain (ID) accuracy and out‑of‑domain (OOD) robustness.

RegMean++ addresses this limitation by redefining the feature matrix X(l,j)i used in the regression. Instead of taking X from candidate f_i, it collects X from the merged model f_M by performing a forward pass through the current merged layer f{l‑1}^M for each candidate’s data. In practice, for every transformer layer l and every linear sub‑layer j, the algorithm (see Algorithm 1) gathers the activations X(l,j)i = f{l‑1}^M(·) and then computes the inner‑product matrix G(l,j)_i = X(l,j)_iᵀ X(l,j)_i. The regularized regression objective remains the same, and the closed‑form solution W_M = (∑ b G_i)⁻¹ ∑ b G_i W_i is still applicable, with b G_i constructed from the merged‑model statistics. This modification introduces only a modest overhead: an extra forward pass per layer to collect statistics, while the final merging step retains the O(K·J·d²) complexity of RegMean.

The authors evaluate RegMean++ on two large‑scale benchmarks. For vision, they use three ViT variants (B/32, B/16, L/14) and eight classification datasets (SUN397, Cars, RESISC45, EuroSAT, SVHN, GTSRB, MNIST, DTD). For language, they merge two Llama‑3 variants (3B and 8B) across eleven tasks covering instruction following, mathematics, multilingual understanding, coding, and safety. Candidate models are off‑the‑shelf checkpoints; no additional training data are required. RegMean++ is compared against eleven recent merging methods spanning data‑free (Model Soups, Task Arithmetic, TIES‑Merging, TSV‑M, Iso‑C/CTS, DOGE‑TA), training‑free (Fisher Merging, original RegMean), and test‑time adaptation (AdaMerging, DOGE‑AM) families, as well as a multi‑task learning (MTL) upper bound.

Key findings:

Accuracy Gains – Across all vision tasks, RegMean++ improves average accuracy by 0.7–2.5 percentage points over RegMean, often achieving the best or runner‑up score among all baselines. Similar improvements (1–3 pp) are observed on language tasks, especially on coding and safety benchmarks where representation fidelity is critical.
ID/OOD Robustness – Under distribution shifts (noise, style changes, domain‑shifted test sets), RegMean++ maintains higher accuracy and lower performance variance, indicating that aligning the regression statistics with the merged model’s actual feature flow mitigates representation bias.
Layer‑wise Insights – Ablation studies reveal that merging only the linear weights from middle to deep transformer layers preserves >98 % of the original model’s accuracy, while early layers contribute little and can even degrade performance if merged. Moreover, using MLP sub‑layer weights yields consistently better results than attention‑head weights, suggesting that the feed‑forward network carries more task‑specific information.
Sequential and Large‑Scale Merging – When candidates are added incrementally (sequential merging), RegMean++ shows stable performance without catastrophic forgetting, unlike some data‑free methods that suffer from drift. The method also scales to models with hundreds of millions to billions of parameters, with memory and runtime comparable to RegMean because the extra forward passes are lightweight and can be batched.
Comparison to Advanced Methods – RegMean++ matches or exceeds state‑of‑the‑art techniques such as Iso‑C/CTS, TSV‑M, and test‑time adaptation methods, while retaining the simplicity and privacy‑preserving nature of a data‑free, closed‑form approach.

Limitations noted by the authors include the treatment of non‑linear components (LayerNorm, biases) via simple averaging, which may be suboptimal for tasks heavily reliant on normalization statistics. Additionally, the need to run a forward pass over the candidate data to collect merged‑model activations could be memory‑intensive for extremely large datasets; the authors suggest possible batch‑wise approximations as future work.

In summary, RegMean++ introduces a principled modification to the regression‑based model merging framework: by feeding the merged model’s own activations into the regression, it captures intra‑ and cross‑layer dependencies that were previously ignored. This yields a method that retains RegMean’s analytical elegance and computational efficiency while delivering markedly better ID performance, OOD generalization, and robustness in sequential and large‑scale merging scenarios. The work opens a new direction for data‑free merging research—explicitly modeling the dynamics of feature propagation—potentially inspiring further extensions to other architectures (CNNs, RNNs) and more sophisticated handling of non‑linear parameters.

RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

💡 Research Summary

Comments & Academic Discussion

Leave a Comment