Task arithmetic is a powerful technique for transferring skills between Large Language Models (LLMs), but it often suffers from negative interference when models have diverged during training. We address this limitation by first aligning the models' parameter spaces, leveraging the inherent permutation, rotation, and scaling symmetries of Transformer architectures. We adapt parameter space alignment for modern Grouped-Query Attention (GQA) and SwiGLU layers, exploring both weight-based and activation-based approaches. Using this alignment-first strategy, we successfully transfer advanced reasoning skills to a non-reasoning model. Experiments on challenging reasoning benchmarks show that our method consistently outperforms standard task arithmetic. This work provides an effective approach for merging and transferring specialized skills across evolving LLM families, reducing redundant fine-tuning and enhancing model adaptability.
The proliferation of open-weight Large Language Models (LLMs) [9,15,3,21] has fostered a decentralized ecosystem of model development. Once a foundational model is released, research teams independently adapt it to diverse applications. However, the rapid release of next-generation base models introduces a developmental dilemma: teams must either continue relying on their older, customized models or incur the substantial cost of re-adapting their work to a newer, more capable version. This challenge underscores the need for methods that can efficiently integrate the specialized skills of an adapted model with the enhanced capabilities of its successor.
A promising strategy for skill transfer is task arithmetic [14], which represents the effect of fine-tuning as a task vector, the difference between the fine-tuned and pre-trained weights. This vector encodes the acquired skill, and by adding multiple task vectors to a base model, their respective skills can be combined [14]. Despite its appeal, task arithmetic is prone to negative interference, where conflicting parameter updates from independently trained models degrade performance [23,24]. This problem becomes especially severe when models have substantially diverged during adaptation. This paper addresses two challenges at this intersection: 1) efficiently transferring skills across different versions of a foundational model, and 2) mitigating the negative interference that arises when applying task arithmetic to models with diverging training trajectories. We focus on the crucial skill of LLM reasoning. Cultivating robust reasoning in LLMs is a notoriously difficult and computationally expensive process. Therefore, the ability to transfer these hard-won reasoning skills between model versions is a vital research objective, allowing teams to leverage state-of-the-art advancements without invalidating their prior investments.
Contributions. We make several contributions. First, we extend methods that leverage parameter space symmetries, specifically rotation and scale symmetries for attention layers [25] and permutation symmetry for feed forward layers (FFN) [2], to their widely used, modern counterparts, Grouped-Query Attention (GQA) [1] and SwiGLU layers [18]. Second, we use these alignment methods to transfer the reasoning skills of Nemotron-Nano [3] from its reference model, Llama-3.1-8B-Instruct [9], to the independently instruction-tuned Tulu3-8B model. Through extensive evaluations, we show that aligning the parameter spaces before transferring the reasoning skills using task arithmetic significantly improves the final model’s performance on hard reasoning benchmarks.
Task arithmetic [14] provides an efficient framework for editing model capabilities by performing linear operations directly on their weights (θ). The core idea is to represent a learned skill as a task vector (τ skill ), which is the difference between a fine-tuned model’s parameters and its original base model’s parameters. This vector isolates the changes that encode a specific new skill.
This formulation is particularly powerful for skill transfer. A skill vector, such as one for reasoning, can be extracted from a reference model pair (e.g.,
) and then applied to an entirely different target model (θ target ) to imbue it with that same skill. The new model’s parameters, θ target+reason , are synthesized through simple vector addition:
This creates an analogy where the target model is enhanced in the same way as the reference model. While effective, this process can suffer from negative interference [23,24], where conflicting parameter updates between the models degrade performance, especially when their parameter spaces have significantly diverged.
The architecture of modern neural network are characterized by multiple parameter space symmetries, meaning distinct sets of weights can produce an identical output function. For instance, neurons in a feed-forward layer can be reordered, or the internal representations in an attention layer can be rotated, without changing the model’s behavior. These misalignment can occur when models are trained from different initializations during diverging training runs When comparing two independently trained models, these symmetries can cause their parameter spaces to be misaligned, hindering direct arithmetic operations like skill transfer.
The SwiGLU activation function for a given FFN layer in model i, Swish β (xW
U ) matrices whose intermediate outputs can be permuted without changing the function, provided the down-projection (W (i) D ) is adjusted accordingly. We find the optimal permutation matrix P that aligns the weights of two models (1 and 2) by solving the linear assignment problem:
= arg max
G W
(2)
U W
(2)
This problem is solved efficiently using the Hungarian algorithm. Model 2 weights are then updated:
D → W
(2)
(5)
The absence of element-wise non-linearities between query, key, and value projections in attention
This content is AI-processed based on open access ArXiv data.