MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via $μ$P, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.


💡 Research Summary

The paper tackles a persistent problem in large‑language‑model (LLM) pre‑training: sudden, unrecoverable gradient explosions that waste massive compute. Using a reproducible 5‑million‑parameter NanoGPT model scaled with µP, the authors identify two precursors that consistently appear just before a training collapse. First, the stable rank of weight matrices (‖W‖_F²/‖W‖₂²) drops sharply, indicating that spectral energy concentrates in a few top singular directions. Second, the alignment between the Jacobians of adjacent layers rises, meaning the top right singular vector of one layer aligns with the top left singular vector of the next. The authors argue that these two phenomena together create a positive feedback loop that drives exponential growth of the total Jacobian norm and, consequently, the gradient norm.

The theoretical contribution formalizes this intuition. Theorem 4.4 shows that for a fixed Frobenius norm, a lower stable rank forces a higher operator norm (‖W‖₂ = ‖W‖_F / √srank(W)). Theorem 4.2 proves that if each layer Jacobian has norm ≥ M and the alignment between consecutive Jacobians is at least a, then the norm of the product Jacobian satisfies ‖J_total‖₂ ≥ (a M)ᴸ. When a·M > 1, the bound grows exponentially with depth L, providing a sufficient condition for gradient explosion. Theorem 4.8 then links a large total Jacobian norm to a lower bound on weight‑gradient norms via the chain rule, completing the causal chain: low stable rank + high alignment → large Jacobian norm → exploding gradients.

To break this mechanism, the authors introduce MSign, a lightweight optimizer that periodically (every P ≈ 100 steps) applies a matrix‑sign operation to selected weight matrices. Given a weight matrix W = U S Vᵀ, the sign operation replaces it with sign(W) = U Vᵀ, setting all non‑zero singular values to 1 while preserving the left and right singular subspaces. This maximizes the stable rank (making it equal to the matrix rank) without changing the Frobenius norm, which is restored after the sign step. The operation is applied mainly to attention output projection matrices; applying it only to MLP layers does not prevent failures, as shown in ablations.

Empirical validation spans four architectures: NanoGPT‑5M, Sigma‑40M (hybrid attention), LLaMA‑1B, and LLaMA‑MoE‑3B. In baseline runs with standard Adam, all models experience loss spikes and gradient explosions at moderate learning rates (≈ 6 × 10⁻⁴). With MSign, stable rank remains above a critical threshold, Jacobian alignment stays low, and training proceeds smoothly to convergence. The additional computational cost is modest—less than 7 % of total training time—and memory overhead is minimal. Ablation studies confirm that (1) targeting attention output projections is sufficient, and (2) MSign outperforms direct alignment regularizers in cost‑effectiveness.

Strengths of the work include a clear empirical‑theoretical‑practical pipeline, a simple plug‑in optimizer that works with existing training pipelines, and extensive experiments across scales and model families. Limitations are the reliance on full SVD, which may become prohibitive for models larger than a few billion parameters, and a lack of quantitative analysis on how the periodic rank restoration interacts with the optimizer’s convergence dynamics. Moreover, the paper does not deeply explore why alignment grows—whether it is driven by data distribution, learning‑rate schedules, or layer‑norm dynamics.

Future directions suggested are: (i) integrating low‑cost SVD approximations (e.g., randomized SVD) to scale MSign to >10 B‑parameter models, (ii) developing an adaptive schedule that adjusts the period P based on real‑time monitoring of stable rank and alignment, (iii) testing whether the same instability mechanism appears in non‑Transformer architectures such as CNNs or RNNs, and (iv) combining MSign with other regularization techniques (spectral norm clipping, orthogonal constraints) to further improve stability. Overall, the paper provides a compelling new lens on LLM training instability and a practical tool—MSign—that could become a standard component of large‑scale language model training pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment