Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove “equal-rate” dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.


💡 Research Summary

This paper investigates a striking spectral phenomenon that emerges when the Muon optimizer is combined with Low‑Rank Adaptation (LoRA) for fine‑tuning large language models (LLMs). LoRA re‑parameterizes the trainable weight update as a product of two low‑rank matrices (A\in\mathbb{R}^{m\times r}) and (B\in\mathbb{R}^{r\times n}) (with (r\ll\min{m,n})), dramatically reducing the number of trainable parameters. Muon, a recent optimizer inspired by spectral gradient descent (SpecGD), orthogonalizes the gradient of each matrix factor before applying momentum, effectively normalizing all non‑zero singular values of the update to one.

Empirically, the authors fine‑tune two LLMs—ROBERTA‑Base on the SST‑2 GLUE task and LLaMA‑3.2‑1B on the Alpaca dataset—using LoRA adapters of rank eight. When Muon is used, the singular values of the product (AB) evolve almost identically: all eight trajectories remain parallel throughout training, keeping the effective rank close to eight. By contrast, the standard AdamW optimizer exhibits a “largest‑first” pattern, where larger singular values grow quickly while smaller ones lag, leading to a non‑uniform spectrum. This uniform growth is surprising because Muon orthogonalizes the gradients of (A) and (B) separately, not their product.

To explain this behavior, the authors introduce a continuous‑time analogue of SpecGD called Spectral Gradient Flow (SpecGF). In SpecGF, the dynamics are
\


Comments & Academic Discussion

Loading comments...

Leave a Comment