AdaRank: Adaptive Rank Pruning for Enhanced Model Merging

AdaRank: Adaptive Rank Pruning for Enhanced Model Merging
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on such manually designed rank selection often leads to cross-task interference and suboptimal performance. In this paper, we propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. We empirically show that the dominant singular components of task vectors can cause critical interference with other tasks, and that naive truncation across tasks and layers degrades performance. In contrast, AdaRank dynamically prunes the singular components that cause interference and offers an optimal amount of information to each task vector by learning to prune ranks during test-time via entropy minimization. Our analysis demonstrates that such method mitigates detrimental overlaps among tasks, while empirical results show that AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.


💡 Research Summary

Model merging aims to combine several independently fine‑tuned models into a single network, thereby saving memory and inference cost while preserving multi‑task performance. Recent work has shown that representing task vectors (the difference between a fine‑tuned model and its pre‑trained backbone) in the spectral domain via Singular Value Decomposition (SVD) can reduce inter‑task interference compared with element‑wise pruning. However, all existing SVD‑based approaches rely on a heuristic “top‑k” truncation: they keep the k largest singular components for every task and every layer. The authors of this paper identify two critical flaws in that heuristic. First, the singular components with the largest singular values, while most informative for their own task, often cause the greatest interference with other tasks, sometimes increasing the overall multi‑task loss. Second, the intrinsic rank required to capture a sufficient amount of spectral energy varies dramatically across tasks and across layers; a fixed k cannot accommodate this variability. Consequently, naïve top‑k truncation either discards essential information for some tasks or retains unnecessary components that harm others.

AdaRank (Adaptive Rank Pruning) addresses these issues by introducing a learnable binary mask for each singular component of every task vector in each layer. The mask Bℓi∈{0,1}m (where m = min(d, d′) for a weight matrix of shape d×d′) determines whether the r‑th singular component is kept (1) or pruned (0). With these masks the merged weight for layer ℓ becomes

θℓ(Bℓ)=θℓ⁰+λℓ∑iUℓi (diag(Bℓi)⊙Σℓi) Vℓiᵀ,

where Uℓi, Σℓi, Vℓi are the left singular vectors, singular values, and right singular vectors of task vector τℓi. Setting all entries of B to 1 recovers full‑rank Task Arithmetic; setting the first k entries to 1 and the rest to 0 reproduces the conventional top‑k SVD method. Thus AdaRank can express any combination of singular components, allowing per‑task and per‑layer adaptive rank selection.

The central challenge is how to choose B without access to any labeled training data (the whole point of model merging is to avoid retraining). AdaRank adopts test‑time adaptation: it minimizes the sum of Shannon entropies of the model’s predictions on unlabeled test samples for each task. Entropy minimization has been shown to correlate strongly with supervised loss in multi‑task settings, making it a suitable surrogate objective. Formally, AdaRank solves

min_B ∑i∈tasks ∑x∈Di H(f(θ(B),x)),

where Di is an unlabeled test set for task i and H denotes the entropy of the softmax output. The binary masks are optimized using the Straight‑Through Estimator (STE). During the forward pass, mask entries are rounded to 0 or 1; during the backward pass, they are treated as continuous sigmoid‑scaled parameters, allowing gradients from the entropy loss to flow to the mask variables. After convergence, the masks are fixed and the final merged model is constructed.

Extensive experiments are conducted on vision and language backbones (ViT‑B/32, ResNet‑50, BERT‑base) across 4–12 downstream tasks, including image classification (MNIST, SVHN, SUN397, EuroSAT, DTD, RESISC‑45, GTSRB) and natural‑language tasks. AdaRank is compared against Task Arithmetic, SVD‑k, CAR‑T, AdaMerging, and recent router‑based adaptive merging methods. Results show that AdaRank consistently narrows the performance gap to individually fine‑tuned models, achieving 1–2 percentage‑point improvements in accuracy/F1 over the best prior SVD‑based baselines. The gains are especially pronounced for complex tasks that require higher intrinsic rank, confirming that adaptive rank selection is crucial. Moreover, AdaRank introduces no additional trainable parameters, yet matches or exceeds the performance of parameter‑heavy router approaches, highlighting its efficiency.

Ablation studies reveal that the learned masks tend to prune high‑interference singular components (often those with the largest singular values) while preserving mid‑range components that contribute positively across tasks. Layer‑wise analysis shows that early layers retain more components (reflecting shared low‑level features), whereas later layers prune aggressively, aligning with the intuition that task‑specific representations are more prone to interference.

In summary, AdaRank demonstrates that (1) the magnitude of singular values is not a reliable indicator of usefulness in a multi‑task merger; (2) task‑ and layer‑specific rank requirements must be respected; and (3) unsupervised entropy minimization at test time provides a practical, label‑free signal for selecting the optimal subset of singular components. By integrating a simple binary‑mask mechanism into existing SVD‑based pipelines, AdaRank offers a versatile, low‑overhead solution that substantially improves model merging across diverse architectures and tasks, paving the way for scalable, multi‑task deployment without costly retraining.


Comments & Academic Discussion

Loading comments...

Leave a Comment