Insights on Muon from Simple Quadratics
Muon updates weight matrices along (approximate) polar factors of the gradients and has shown strong empirical performance in large-scale training. Existing attempts at explaining its performance largely focus on single-step comparisons (on quadratic proxies) and worst-case guarantees that treat the inexactness of the polar-factor as a nuisance ``to be argued away’’. We show that already on simple strongly convex functions such as $L(W)=\frac12|W|_{\text{F}}^2$, these perspectives are insufficient, suggesting that understanding Muon requires going beyond local proxies and pessimistic worst-case bounds. Instead, our analysis exposes two observations that already affect behavior on simple quadratics and are not well captured by prevailing abstractions: (i) approximation error in the polar step can qualitatively alter discrete-time dynamics and improve reachability and finite-time performance – an effect practitioners exploit to tune Muon, but that existing theory largely treats as a pure accuracy compromise; and (ii) structural properties of the objective affect finite-budget constants beyond the prevailing conditioning-based explanations. Thus, any general theory covering these cases must either incorporate these ingredients explicitly or explain why they are irrelevant in the regimes of interest.
💡 Research Summary
The paper investigates the recently popular Muon optimizer—Momentum Orthogonalized by Newton–Schulz—by reducing its dynamics to the simplest possible setting: a smooth, strongly convex quadratic loss L(W)=½‖W‖_F². Muon augments Nesterov‑style momentum with a low‑degree Newton–Schulz iteration that approximates the polar factor (matrix sign) of the momentum matrix, thereby projecting the update direction onto (approximately) the Stiefel manifold before taking a step. Existing theory typically assumes either exact polar factorization or treats the inexactness as a mere cost‑accuracy trade‑off, and often evaluates Muon by comparing per‑iteration loss decrease on local quadratic proxies. The authors ask three fundamental questions: (1) Does Muon converge to the global minimizer of a smooth, strongly convex function when used with the non‑vanishing step sizes common in practice? (2) Can an approximate polar step ever improve dynamics compared to an exact one? (3) Does per‑step superiority guarantee faster end‑to‑end convergence?
Exact polar factor with fixed step size – grid confinement.
When the polar step is computed exactly and a constant learning rate α>0 is used, the update reduces to W_{t+1}=W_t−α P(W_t), where P(·) denotes the polar factor. Writing W_t=U S_t Vᵀ (SVD) shows that U and V stay fixed while each singular value follows the scalar recursion s_{t+1}=s_t−α sign(s_t). This is a 1‑D “sign‑GD” that moves on the lattice s₀+αℤ. Unless the initial singular values happen to lie on the lattice (a measure‑zero event), the iterates become trapped in a two‑point cycle around zero and never approach arbitrarily small loss. Hence even on the simplest strongly convex quadratic, Muon with exact polar and constant step size fails to converge; the only way to guarantee convergence is to let the effective step size vanish. This phenomenon, termed “grid confinement”, directly contradicts the intuition that a better per‑step direction automatically yields global convergence.
Inexact polar steps can be beneficial.
The authors then deliberately introduce approximation error into the polar step, either by adding stochastic perturbations or by limiting Newton–Schulz iterations to one or two steps. Surprisingly, a moderate amount of noise breaks the lattice structure, allowing singular values to cross zero and continue decreasing. Empirically, there is a non‑monotonic relationship between perturbation magnitude and iteration count: too little error leaves the grid confinement intact, too much error destroys the useful alignment of the update direction, while an intermediate error level yields the fastest convergence on the quadratic toy. This demonstrates that approximation error is not merely a nuisance; it can act as a constructive algorithmic ingredient that reshapes the discrete‑time dynamics.
Spectral shape matters more than condition number.
A common narrative in the Muon literature is that the method should be “condition‑number insensitive” and therefore outperform gradient descent (GD) on ill‑conditioned problems. The authors test this claim by generating families of quadratic objectives that share the same condition number κ but differ in the distribution of singular values (e.g., rapidly decaying versus flat spectra). Their experiments reveal that Muon’s advantage over GD flips depending on the spectral shape: for some spectra Muon reaches a target loss in fewer iterations, while for others GD is strictly faster. Consequently, the condition number alone does not predict performance; finer spectral properties (such as the gap between large and small singular values) dominate the constant factors that matter in finite‑budget regimes.
One‑step superiority is misleading.
To probe whether per‑iteration improvement translates into overall speed, the authors construct a greedy policy that, at each iteration, selects the update (either a plain gradient step or a polar/ Stiefel step) that yields the larger immediate loss reduction, using the optimal step size for each. Even though the policy always picks the polar step (which is locally optimal on the quadratic), the full trajectory of GD still reaches the same loss level in fewer steps. This counter‑example shows that local, one‑step superiority does not guarantee faster end‑to‑end convergence, especially when the dynamics are nonlinear and the step size is fixed.
Implications and future directions.
The paper’s contributions can be summarized as follows:
- Negative result: Exact polar factorization with a constant step size can cause grid confinement, preventing convergence even on strongly convex quadratics.
- Positive role of inexactness: Moderate approximation error can break the confinement and accelerate convergence; the relationship between error magnitude and speed is non‑monotonic.
- Spectral‑dependent constants: Performance differences between Muon and GD are governed by detailed spectral shape rather than the condition number alone.
- Limitations of per‑step analysis: One‑step improvement does not imply overall speedup, highlighting the need for analyses that capture the full discrete‑time trajectory.
These findings challenge the prevailing theoretical frameworks that rely on (i) local quadratic proxies and per‑step improvement arguments, and (ii) worst‑case bounds that degrade monotonically with polar approximation error. The authors argue that any comprehensive theory of Muon must explicitly incorporate (a) the constructive effects of inexact polar steps, and (b) problem‑dependent spectral constants that affect finite‑budget performance. Future work should aim to (i) characterize optimal levels of approximation error, (ii) develop convergence guarantees that allow non‑vanishing step sizes while accounting for spectral structure, and (iii) design adaptive schemes that balance error, step size, and momentum to exploit the beneficial dynamics uncovered in this study.
Comments & Academic Discussion
Loading comments...
Leave a Comment