MOMA: Masked Orthogonal Matrix Alignment for Zero-Additional-Parameter Model Merging

MOMA: Masked Orthogonal Matrix Alignment for Zero-Additional-Parameter Model Merging
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Model merging offers a scalable alternative to multi-task learning but often yields suboptimal performance on classification tasks. We attribute this degradation to a geometric misalignment between the merged encoder and static task-specific classifier heads. Existing methods typically rely on auxiliary parameters to enforce strict representation alignment. We challenge this approach by revealing that the misalignment is predominantly an orthogonal transformation, rendering such strict alignment unnecessary. Leveraging this insight, we propose MOMA (Masked Orthogonal Matrix Alignment), which rectifies the misalignment by jointly optimizing a global multi-task vector mask and task-specific orthogonal transformations. Crucially, MOMA absorbs corresponding new parameters directly into the existing model weights, achieving performance comparable to state-of-the-art baselines with zero additional parameters and zero added inference cost.


💡 Research Summary

The paper tackles a fundamental limitation of model merging for multi‑task learning: after merging the encoders of several fine‑tuned models, the static task‑specific classifier heads often become misaligned, leading to a noticeable drop in classification accuracy. Existing remedies introduce extra parameters (e.g., adapters, alignment modules) to force the merged representations to match the fine‑tuned ones, but this contradicts the core promise of model merging—no additional storage or inference cost.

Through extensive visualizations (t‑SNE) and a K‑Nearest‑Neighbors probe, the authors demonstrate that the merged encoder still preserves class‑discriminative structure; the performance loss is not due to information loss but to a geometric drift between the merged representation space and the fixed heads. By fitting various transformation families (affine, linear, scaling, orthogonal) to map merged representations back to the fine‑tuned ones, they find that a simple orthogonal (rotation) transformation is sufficient to recover most of the lost accuracy. This observation suggests that the discrepancy is primarily a rotation that preserves inner products, rather than a complex non‑linear distortion.

Building on this insight, the authors propose MOMA (Masked Orthogonal Matrix Alignment). MOMA jointly optimizes (1) a global binary/real‑valued mask that selectively zeros out redundant dimensions of the merged encoder, and (2) a task‑specific orthogonal matrix (Q_t) that rotates the merged representation for each task. Crucially, the orthogonal matrix can be absorbed into the existing classifier head weights: the new head becomes (W’_t = W_t Q_t). The mask is applied element‑wise to the merged encoder parameters, so no new tensors are introduced at inference time. Because orthogonal transformations preserve dot products, the rotated representations remain fully compatible with the original linear classifier, and the mask reduces the optimization burden, leading to stable convergence.

Empirical evaluation spans several NLP benchmarks (DBpedia, AG News, Yelp) and vision datasets (CIFAR‑100, ImageNet‑subset). MOMA consistently matches or exceeds state‑of‑the‑art merging baselines such as Weight Averaging, Task Arithmetic, Surgery, and Ties, while incurring zero additional parameters and zero extra FLOPs. The K‑NN experiments further confirm that the merged encoder retains rich semantic information; the orthogonal alignment simply unlocks this information for the fixed heads.

The contributions are threefold: (1) identifying that encoder‑head misalignment after merging is predominantly orthogonal, thereby questioning the necessity of strict representation alignment; (2) introducing a zero‑parameter merging framework that corrects this misalignment via a mask and orthogonal rotations; (3) providing extensive experiments that demonstrate state‑of‑the‑art performance without any storage or runtime overhead.

Future work could explore richer linear transformations (scaling + rotation), automatic mask discovery for heterogeneous architectures, and unsupervised alignment objectives that do not rely on task labels. Overall, MOMA offers a practical, theoretically grounded solution that preserves the efficiency ethos of model merging while delivering near‑fine‑tuned performance across diverse tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment