Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization
Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.
💡 Research Summary
The paper tackles the problem of Data Mixture Optimization (DMO) for multimodal large language models (MLLMs), a task that traditionally requires training a separate model for every candidate mixture of domain‑specific instruction‑tuning data. Because the number of possible mixtures grows combinatorially with the number of domains, exhaustive search quickly becomes prohibitively expensive. Existing approaches mitigate this cost by fitting scaling laws or regression models to a limited set of fully fine‑tuned runs, yet they still demand dozens or hundreds of costly training runs.
The authors propose a radically different, much cheaper proxy: use model merging as a surrogate for actual mixture training. First, they fine‑tune a base MLLM on each domain individually, producing K expert models θ₁,…,θ_K. For any candidate mixture weight vector w ∈ Δ^{K‑1}, they construct a merged model by simple linear interpolation of the expert parameters: θ_M(w)=∑_k w_k θ_k. The key hypothesis is that while θ_M(w) will not match the true mixture‑trained model θ*₍w₎ in absolute performance, it will preserve the ranking of mixtures: if mixture w₁ yields higher true performance than w₂, the same ordering should appear when evaluating the merged proxies. Consequently, the merged model can be used as a cheap evaluator, turning the DMO problem into a ranking problem that requires only a single forward pass per candidate after the K experts have been trained.
The experimental protocol is extensive. Two recent multimodal families—Qwen2‑VL and Intern3.5‑VL—are examined, each in 2 B and 8 B parameter variants, and both LoRA‑based low‑rank adaptation and full fine‑tuning are considered. A corpus of 23 instruction‑tuning datasets is assembled, grouped into four semantic domains (General multimodal understanding, OCR, visual perception & counting, and chart understanding). For each domain a 100 k‑sample subset is created, yielding K = 2, 3, or 4 experts depending on the experiment. Fourteen downstream benchmarks spanning the same domains (e.g., GQA, VQA‑v2, OCR‑Bench, ChartQA, MME) serve as the evaluation suite.
Results show a consistently high Spearman rank correlation between merged proxies and true mixture‑trained models, ranging from 0.57 to 0.78 across all settings. The correlation remains strong even as the number of domains increases (up to 0.77 for four domains) and across model families and sizes. Importantly, the authors demonstrate that experts trained on a fraction of the full data budget (10 k or 50 k samples) still produce reliable proxies, indicating that the upfront cost scales only with the number of domains, not with the total number of candidate mixtures. Compared against a state‑of‑the‑art regression‑based DMO method (Li et al., 2025), the merging approach achieves comparable or better ranking performance while requiring far fewer training runs (K versus tens or hundreds).
A theoretical justification is provided via a second‑order Taylor expansion of the loss L(θ, D_w) around the expert parameters, assuming local convexity. Under this approximation, the loss of the linearly merged model equals the weighted sum of the individual expert losses, mirroring the loss that would be incurred by training on the actual mixture. This analysis explains why linear interpolation can serve as a faithful surrogate for ranking purposes.
The paper’s contributions are threefold: (1) introducing model merging as an efficient DMO proxy, (2) empirically validating its effectiveness across diverse models, domains, and data budgets, and (3) offering a theoretical perspective that grounds the empirical observations. Limitations include reliance on linear interpolation and the need for all experts to share the same architecture; future work could explore non‑linear merging, Fisher‑information‑weighted combinations, or meta‑learning of merging weights to further improve proxy fidelity.
In summary, the work demonstrates that a simple linear combination of domain‑specific fine‑tuned models can replace costly mixture training for the purpose of mixture selection. This dramatically reduces the computational barrier to DMO, enabling practitioners to explore large mixture spaces with only a handful of fine‑tuning runs, and opens the door to more scalable, adaptable multimodal LLM deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment