Muon in Associative Memory Learning: Training Dynamics and Scaling Laws
Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by low-frequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-decay frequency spectrum, we derive Muon’s optimization scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be interpreted as an implicit matrix preconditioner arising from adaptive task alignment and block-symmetric gradient structure. In contrast, the preconditioner with coordinate-wise sign operator could match Muon under oracle access to unknown task representations, which is infeasible for SignGD in practice. Experiments on synthetic long-tail classification and LLaMA-style pre-training corroborate the theory.
💡 Research Summary
The paper provides a rigorous theoretical analysis of the Muon optimizer, a recent method that updates matrix‑valued parameters using the matrix sign of the gradient. While Muon has demonstrated strong empirical gains in large‑scale language model pre‑training, its dynamics and scaling behavior have not been understood. To address this gap, the authors study Muon within a linear associative memory model equipped with a softmax retrieval layer and a hierarchical frequency spectrum over query‑answer pairs, considering both noiseless and noisy label settings.
The model assumes K orthogonal, unit‑norm embeddings for K knowledge items, grouped into M frequency tiers of size C (K = M·C). Each tier i has a uniform frequency p_i = r_{p_i}·C with r_{p_1} > … > r_{p_M} > 0, mimicking the head‑tail distribution observed in real LLM data. The learning objective is the cross‑entropy loss over the softmax probabilities p(i|j;W) = exp(E_i^T W E_j)/∑_k exp(E_k^T W E_j). Two optimizers are compared: standard Gradient Descent (GD) and Muon (with momentum disabled), where Muon updates W ← W – η msgn(∇L) and msgn(·) denotes the matrix‑sign operator derived from the SVD of the gradient.
Noiseless case (α = 0). The authors decompose the gradient into a product of query frequency, prediction residual, and the outer product of answer and query embeddings. They prove that GD learns each frequency tier at a rate L_{GD}^{(j)}(t) ≍ 1/(p_j t), so high‑frequency tiers converge quickly while low‑frequency tiers dominate the overall convergence, yielding total loss L_{GD}(t) ≍ K/t. In contrast, Muon’s matrix‑sign operation normalizes singular values to ±1 while preserving singular vectors, effectively equalizing progress across tiers. Consequently every tier enjoys exponential decay L_{Muon}^{(j)}(t) ≍ exp(−c t), and the total loss decays as L_{Muon}(t) ≍ K·exp(−c t), an exponential speed‑up over GD.
Noisy case (0 < α < 1). Label noise flips the true answer with probability α, attenuating the learning signal. The authors identify three Muon training phases: an initial rapid drop driven by high‑frequency tiers, a middle regime where lower tiers begin to improve, and a final asymptotic regime where loss scales as \tilde{O}(1/T²). They show a speed‑up factor of Ω(p C^q), indicating that larger knowledge groups amplify Muon’s advantage.
Scaling laws. Assuming a power‑law frequency spectrum p_i ∝ i^{−β} with β > 1, they derive that GD’s loss lower bound scales as \tilde{Ω}(1/T^{1−1/β}), which deteriorates for larger β. Muon, however, achieves an upper bound of \tilde{O}(1/T²) independent of β, demonstrating a universally steeper scaling exponent.
Preconditioning perspective and relation to SignGD. Muon can be interpreted as an implicit matrix preconditioner: the matrix‑sign update aligns the weight matrix with the optimal EEᵀ structure while normalizing singular values, a process the authors term “task‑aligned block‑symmetric preconditioning.” By contrast, coordinate‑wise SignGD would need oracle access to the singular vectors of the task matrix to replicate Muon’s effect; without such oracle, SignGD cannot exploit the latent structure.
Experiments. Synthetic long‑tail classification experiments varying group size C and power‑law exponent β confirm the theoretical predictions of exponential acceleration (noiseless) and 1/T² scaling (noisy). A LLaMA‑style pre‑training run (7 B parameters) shows that Muon reaches comparable perplexity 1.8× faster in FLOPs and improves rare‑token accuracy, corroborating the analysis.
In summary, the paper demonstrates that Muon’s matrix‑sign update equalizes learning across frequency tiers, eliminates the low‑frequency bottleneck that plagues GD, and yields provably faster convergence and superior scaling. The work bridges the gap between Muon’s empirical success and theoretical understanding, offering a clear mechanism—implicit spectral preconditioning—that can guide future optimizer design for large‑scale, long‑tail learning scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment