MARS-M: When Variance Reduction Meets Matrices
Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). Recent benchmark studies of LLM pretraining optimizers have demonstrated that variance-reduction techniques such as MARS can substantially speed up training compared with standard optimizers that do not employ variance reduction. In this paper, we introduce MARS-M, a new optimizer that integrates MARS-style variance reduction with Muon. Under standard regularity conditions, we prove that MARS-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, improving upon the $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.
💡 Research Summary
The paper introduces MARS‑M, a novel optimizer that fuses the variance‑reduction technique MARS with the matrix‑based preconditioned optimizer Muon (and its lightweight variant Moonlight). The authors first motivate the need for combining these two strands of recent research: matrix‑based optimizers preserve the two‑dimensional structure of weight matrices, exploiting singular‑value information via SVD and Newton‑Schulz iterations, while variance‑reduction methods such as STORM (and its scaled version MARS) mitigate the high stochastic noise that slows down convergence in large‑scale training.
MARS‑M is constructed by inserting the MARS‑scaled gradient correction into the Moonlight update rule. Specifically, at iteration t the corrected gradient is
cₜ = ∇f(Xₜ, ξₜ) + γₜ·β/(1−β)·(∇f(Xₜ, ξₜ) − ∇f(Xₜ₋₁, ξₜ₋₁)),
which is then clipped to unit norm, accumulated into a momentum term mₜ with coefficient β, and finally used to update the parameter matrix Xₜ via
Xₜ₊₁ = Xₜ − ηₜ·(0.2·p_{max(m,n)}·Oₜ + λXₜ),
where Oₜ ≈ UₜVₜᵀ is the low‑rank preconditioner obtained from the SVD of the momentum matrix and p_{max(m,n)} denotes the larger matrix dimension. The scaling factor 0.2·p_{max(m,n)} and weight decay λ are inherited from Moonlight and are crucial for balancing updates across matrix‑like and vector‑like parameters in LLMs.
Theoretical contributions include a convergence proof under standard smoothness, bounded variance, and bounded second‑moment assumptions. The analysis shows that the variance‑reduction scaling γₜ drives the stochastic gradient variance down to O(1/T), while the matrix preconditioner yields an effective curvature approximation. Together they lead to an expected squared gradient norm decreasing at a rate of Õ(T⁻¹/³), improving upon Muon’s previously established Õ(T⁻¹/⁴) rate. The proof leverages a recursive inequality that captures both the variance‑reduction effect and the contraction induced by the preconditioner, and it carefully handles the additional scaling and weight‑decay terms.
Empirically, the authors evaluate MARS‑M on three GPT‑2 variants (small, medium, large) trained on OpenWebText and FineWeb‑Edu 100B datasets. Across all model sizes, MARS‑M achieves lower training loss (3–5 % reduction) and better validation perplexity than Muon, Moonlight, AdamW, and Lion under identical hyper‑parameter settings. Downstream zero‑shot benchmarks (Hellaswag, SciQ) show accuracy gains of 1.8 and 2.3 points respectively. In computer‑vision experiments on CIFAR‑10/100, MARS‑M improves top‑1 accuracy by roughly 1 % relative to baselines.
The paper also proposes an approximate variant that reuses the previous‑step gradient ∇f(Xₜ₋₁, ξₜ₋₁) instead of recomputing ∇f(Xₜ₋₁, ξₜ), thereby cutting the number of gradient evaluations by about 30 % with negligible performance loss. A detailed comparison with the earlier MARS‑Shampoo optimizer highlights two key differences: (1) MARS‑M scales the preconditioner by 0.2·p_{max(m,n)} and adds weight decay, which are essential for large‑scale LLM training; (2) MARS‑Shampoo directly multiplies UₜVₜᵀ without these adjustments and underperforms in practice.
In summary, MARS‑M successfully integrates matrix‑based preconditioning and modern variance‑reduction, delivering both a theoretically faster convergence rate and consistent empirical improvements on language‑modeling and vision tasks. The work opens several avenues for future research, including coupling with more sophisticated preconditioners such as PolarGrad, extending the analysis to non‑convex settings with adaptive batch sizes, and scaling to LLMs with tens of billions of parameters.
Comments & Academic Discussion
Loading comments...
Leave a Comment