Mano: Restriking Manifold Optimization for LLM Training
While large language models (LLMs) have emerged as a significant advancement in artificial intelligence, the hardware and computational costs for training LLMs are also significantly burdensome. Among the state-of-the-art optimizers, AdamW relies on diagonal curvature estimates and ignores structural properties, while Muon applies global spectral normalization at the expense of losing curvature information. In this study, we restriked manifold optimization methods for training LLMs, which may address both optimizers’ limitations, while conventional manifold optimization methods have been largely overlooked due to the poor performance in large-scale model optimization. By innovatively projecting the momentum onto the tangent space of model parameters and constraining it on a rotational Oblique manifold, we propose a novel, powerful, and efficient optimizer Mano that is the first to bridge the performance gap between manifold optimization and modern optimizers. Extensive experiments on the LLaMA and Qwen3 models demonstrate that Mano consistently and significantly outperforms AdamW and Muon even with less memory consumption and computational complexity, respectively, suggesting an expanded Pareto frontier in terms of space and time efficiency.
💡 Research Summary
The paper introduces Mano, a novel optimizer designed for the pre‑training of large language models (LLMs) that bridges the gap between traditional manifold optimization and modern adaptive methods such as AdamW and Muon. The authors argue that AdamW, while widely used, suffers from high memory overhead (due to storing first‑ and second‑moment estimates) and ignores the spectral structure of weight matrices because it only uses diagonal curvature information. Muon, on the other hand, applies a global spectral normalization via Newton‑Schulz iterations, which equalizes the magnitude of updates across all eigen‑directions but discards curvature information contained in the gradients and momentum.
Mano’s key insight is to project the momentum onto the tangent space of the model parameters and then constrain the resulting update onto a rotational Oblique manifold. Rather than forcing the parameters themselves onto a manifold (as in classic Riemannian optimization), Mano only imposes a “soft” manifold constraint on the update direction. The workflow for each training step is as follows:
-
Manifold Normalization – The current weight matrix θₜ is column‑wise (or row‑wise) normalized to unit ℓ₂ norm, yielding θ̂ₜ = θₜ ⊘ ‖θₜ‖₂,k. The index k alternates each step (odd steps use column normalization, even steps use row normalization), creating a rotating Oblique manifold.
-
Tangent‑Space Projection – The momentum Mₜ is projected onto the tangent space of the Oblique manifold at θ̂ₜ:
vₜ = Mₜ − θ̂ₜ ⊙ ⟨Mₜ, θ̂ₜ⟩ₖ.
This removes the component of the momentum that would move the parameters off the manifold surface. -
Second Normalization – The projected vector vₜ is again normalized to the manifold: v̂ₜ = vₜ ⊘ ‖vₜ‖₂,k.
-
Parameter Update – The final Euclidean update incorporates weight decay and a dimension‑scaled learning rate:
θₜ₊₁ = θₜ − ηₜ (0.2 √nₖ v̂ₜ + λθₜ).
The algorithm introduces only two inexpensive element‑wise normalizations and one tangent‑space projection per step, avoiding costly matrix decompositions (QR, SVD) or the Newton‑Schulz iterations required by Muon. Consequently, Mano’s memory footprint is comparable to vanilla SGD and substantially lower than AdamW (which stores two extra tensors per parameter) and Muon (which stores additional auxiliary matrices).
Why the Oblique manifold? The authors empirically measured average geodesic distances over 1,000 consecutive updates of a Qwen3‑0.6 B model trained with AdamW. The Oblique manifold exhibited the shortest distances compared to the Sphere and Stiefel manifolds, suggesting that its geometry aligns better with the natural learning trajectory of transformer weight matrices.
Rotating manifold scheme – By alternating column‑wise and row‑wise normalizations, Mano mitigates the bias of a static Oblique manifold (which would privilege column directions) and approximates the uniform spectral treatment that Muon aims for, while still preserving curvature information through the tangent‑space projection.
Experimental validation – The authors evaluated Mano on three LLMs: LLaMA‑350 M, LLaMA‑1.3 B, and Qwen3‑0.6 B, all pretrained on the Pile dataset. Results show:
- Speed – For the same token budget (10 B tokens for 350 M, 2.8 B for 1.3 B), Mano converged 1.75× (350 M) and 1.38× (1.3 B) faster in wall‑clock time than Muon.
- Memory – Because no second‑moment tensor is stored, Mano reduces GPU memory consumption by roughly 40 % relative to AdamW and matches or slightly improves upon Muon’s memory usage.
- Final perplexity – Across all models, Mano achieves lower test perplexity (0.5–1.2 % improvement) compared with AdamW and Muon, indicating better generalization.
- Stability – Training curves are smoother, with fewer spikes in loss, suggesting that the manifold‑constrained momentum reduces variance and helps escape shallow local minima.
Theoretical perspective – By projecting momentum onto the tangent space, Mano effectively reduces the variance of the stochastic update, akin to variance‑reduction techniques in SGD‑M. The subsequent normalization keeps the update magnitude balanced across dimensions, preserving spectral information that AdamW discards while avoiding the full loss of curvature that Muon’s orthogonalization incurs.
Limitations and future work – The study focuses exclusively on the Oblique manifold; other manifolds (Grassmann, low‑rank) remain unexplored. The rotating scheme is deterministic and may not capture more complex curvature patterns that adaptive spectral methods could. Moreover, the paper does not provide a formal convergence proof for the non‑convex, high‑dimensional setting of LLMs. Extending Mano to fine‑tuning, multimodal models, or integrating it with distributed training pipelines are promising directions.
Conclusion – Mano demonstrates that a carefully re‑engineered manifold‑based optimizer can outperform state‑of‑the‑art adaptive methods on large‑scale language model training while being simpler to implement and more resource‑efficient. By marrying geometric insight (tangent‑space projection, rotating Oblique manifold) with practical considerations (low memory, few extra operations), the work opens a new avenue for optimizer design that could become a standard component in future LLM training stacks.
Comments & Academic Discussion
Loading comments...
Leave a Comment