다중작업 모델 병합을 위한 최적수송 기반 마스크 융합
📝 Abstract
Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.
💡 Analysis
Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.
📄 Content
MERGING WITHOUT FORGETTING: CONTINUAL FU- SION OF TASK-SPECIFIC MODELS VIA OPTIMAL TRANSPORT Zecheng Pan* 1, Zhikang Chen* 1,2, Ding Li* 1, Min Zhang† 3, Sen Cui1, Hongshuo Jin4, Luqi Tao† 1, Yi Yang† 1, Deheng Ye5, Yu Zhang 6, Tingting Zhu2, Tian-Ling Ren† 1 1Tsinghua University 2University of Oxford 3East China Normal University 4Zhejiang University 5Tencent 6Southern University of Science and Technology ABSTRACT Merging models fine-tuned for different tasks into a single unified model has be- come an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering com- mon masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging. 1 INTRODUCTION Large-scale pretrained models (PTMs) have achieved remarkable success across natural language processing, computer vision, and multimodal understanding (Bommasani et al., 2021; Radford et al., 2021). As their adoption accelerates, integrating multiple fine-tuned models into a unified multi-task system has become a key challenge (Yang et al., 2024e; Tang et al., 2024a). Traditional multi-task learning (MTL) methods rely on joint training with shared representations (Misra et al., 2016; Sener & Koltun, 2018; Ma et al., 2018), but such approaches are often infeasible when data access is restricted by privacy, communication, or resource constraints (Wortsman et al., 2022a; Li et al., 2023; Wu et al., 2024). To overcome these limitations, model merging has emerged as a promising alternative, enabling the construction of unified models by directly combining independently fine-tuned task models (Wortsman et al., 2022b; Yang et al., 2024a; Zhou et al., 2024).Existing methods include Weight Averaging, Fisher-weighted averaging (Wortsman et al., 2022b; Matena & Raffel, 2022b), Task Arithmetic(Ilharco et al., 2022; Ortiz-Jimenez et al., 2024), and Ties-Merging(Yadav et al., 2023; Davari & Belilovsky, 2023), often relying on the assumption of mode connectivity—i.e., smooth paths exist between optima in parameter space (Garipov et al., 2018; Draxler et al., 2018). *These authors contributed equally to this work. †Corresponding authors 1 arXiv:2511.19561v1 [cs.LG] 24 Nov 2025 Capture Common Information between Task Pre&Post Weights pre post Reducing Distribution Shift through Optimal Transport Mask Learning pre post Merged Weight Task Post Weight Task Pre Weight Pre Distribution Post Distribution Merged Distribution Learnable Mask Learnable Mask OT Mask Merging Distribution Task-Wise Adamerging Distribution Method Avg Acc TW AM 66.3 Ties Merging 51.4 Task Arithmetic 50.2 OT Mask 79.7 Experiments on ViT-B-32 CUDA Memory Usage Figure 1: Left: OTMF captures common information between pre/post weights while reducing distribution shift. Middle: T-SNE visualizations show that OTMF yields output distributions closely aligned with the pre model’s distributions, outperforming Task-wise AdaMerging. Right: OTMF outperforms other sequential methods in average accuracy while using less CUDA memory than Task-Wise AdaMerging, highlighting its advantages in both performance and efficiency. However, most of these methods operate solely at the parameter level, assuming linear interpolation can preserve task knowledge. In practice, they often disrupt feature distributions, leading to degraded performance in heterogeneous settings (Ilharco et al., 2022; Yadav et al., 2023). This raises a key question: How can we merge models while preserving the distributional structure of each task and promoting cross-task knowledge integration? We argue that prior methods fail to maintain the semantic geometry of task-specific feature spaces. Instead of directly editing parameters, we propose a principled solution grounded in optimal transport: aligning l
This content is AI-processed based on ArXiv data.