A Pipeline to Assess Merging Methods via Behavior and Internals

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Merging methods combine the weights of multiple language models (LMs) to leverage their capacities, such as for domain adaptation. While existing studies investigate merged models from a solely behavioral perspective, we offer the first comprehensive view by assessing and connecting their behavior and internals. We present a novel evaluation pipeline that first merges multiple parent LMs, and then evaluates the merged models in comparison to the initial ones based on their behavior on downstream tasks, like MMLU, and the internal encoded linguistic competence. We showcase this pipeline by assessing the merging of instruction fine-tuned with math- and code-adapted LMs from the Qwen2.5 family. Our results show that merging methods impacts behavior and internals differently. While the performance of merged models is typically between that of the two parent models, their encoded information about linguistic phenomena, particularly in morphology and syntax, can surpass the parent models. Moreover, we find weak ranking correlation between this behavior and internal evaluation. With our pipeline and initial results, we emphasize the need for more comprehensive evaluations of model merging methods to gain a faithful understanding of their capabilities and reliability, beyond potential superficial behavioral advances.

💡 Research Summary

The paper addresses a notable gap in the literature on large language model (LLM) merging: while many works evaluate merged models solely on downstream task performance, none have systematically examined how merging affects the internal representations of the models. To fill this void, the authors propose a three‑stage evaluation pipeline that (1) merges multiple parent LLMs using a variety of weight‑combination techniques, (2) assesses both the external behavior and the internal linguistic competence of the resulting models, and (3) connects the two evaluation streams to understand their relationship.

In the merging stage, the authors employ MergeKit to implement five representative methods: Linear averaging, spherical linear interpolation (SLERP), Task Arithmetic, TIES, and DARE‑TIES. These methods span a spectrum of complexity, power consumption, and support for multi‑model merging. Linear is the simplest, averaging corresponding weights; SLERP interpolates along a spherical path, potentially finding smoother loss surfaces; Task Arithmetic adds “task vectors” derived from fine‑tuning; TIES trims insignificant changes and resolves sign conflicts; DARE‑TIES further sparsifies task vectors before applying TIES.

The evaluation stage uses two complementary suites. For behavioral assessment, the LM‑Harness library runs a collection of widely used benchmarks (BBH, Math Hard, MUSR, GPQA, MMLU‑PRO), providing a standardized measure of downstream performance. For internal analysis, the authors adopt Holmes, specifically its streamlined Flash‑Holmes variant, which probes over 160 linguistic tasks across morphology, syntax, semantics, reasoning, and discourse by training linear classifiers on the final‑layer representations.

Experiments focus on the Qwen2.5 7B family, merging an instruction‑tuned model with two domain‑adapted variants (Math and Code). The results reveal a clear divergence between behavioral and internal outcomes. Behaviorally, merged models consistently land between the two parents; they never surpass the best parent and often fall short of both, especially on the challenging Math Hard sub‑task. Simpler merging strategies (SLERP, Linear) tend to achieve the highest scores, while the more sophisticated interference‑mitigation methods (TIES, DARE‑TIES) perform worst, suggesting that aggressive conflict resolution can harm the model’s ability to solve complex, multi‑step tasks.

Conversely, internal probing shows that merging can enrich the encoded linguistic knowledge. Across most probing categories, merged models match or exceed the parents, with the most pronounced gains in morphology and syntax. This indicates that weight‑level combination can successfully fuse complementary structural knowledge from the parents, yielding richer internal representations even when external performance does not improve.

Crucially, the correlation analysis between the two evaluation dimensions yields a low rank‑order correlation, confirming that behavioral benchmarks and internal probing capture largely independent aspects of model quality. The authors argue that relying on a single perspective can be misleading, and that a comprehensive assessment—incorporating both behavior and internals—is essential for trustworthy model merging.

Finally, the paper contributes all code for the pipeline and analysis, encouraging reproducibility and extension to other model families and domain‑adaptation scenarios. By highlighting the nuanced trade‑offs between external performance and internal linguistic competence, the work sets a new standard for evaluating LLM merging methods and underscores the need for multi‑faceted evaluation frameworks in future research.

A Pipeline to Assess Merging Methods via Behavior and Internals

💡 Research Summary

Comments & Academic Discussion

Leave a Comment