Lions and Muons: Optimization via Stochastic Frank-Wolfe
Stochastic Frank-Wolfe is a classical optimization method for solving constrained optimization problems. On the other hand, recent optimizers such as Lion and Muon have gained quite significant popularity in deep learning. In this work, building on recent initiatives, we provide a unifying perspective by interpreting these seemingly disparate methods through the lens of Stochastic Frank-Wolfe. Specifically, we show that Lion and Muon with weight decay can be viewed as special instances of a Stochastic Frank-Wolfe, and we establish their convergence guarantees in terms of the Frank-Wolfe gap, a standard stationarity measure in non-convex optimization for Frank-Wolfe methods. We further find that convergence to this gap implies convergence to a KKT point of the original problem under a norm constraint for Lion and Muon. Moreover, motivated by recent empirical findings that stochastic gradients in modern machine learning tasks often exhibit heavy-tailed distributions, we extend Stochastic Frank-Wolfe to settings with heavy-tailed noise by developing two robust variants with strong theoretical guarantees that hold for general compact convex sets without the need for a large batch size, filling the gap in the literature on Stochastic Frank-Wolfe for non-convex optimization. Our contributions in the later part of this work, in turn, yield new variants of Lion and Muon, that better accommodate heavy-tailed gradient noise, thereby enhancing their practical scope.
💡 Research Summary
This paper establishes a unifying theoretical framework that connects two recently popular deep‑learning optimizers—Lion and Muon—with the classical stochastic Frank‑Wolfe (FW) method. The authors first show that Lion is exactly a stochastic FW algorithm under an ℓ∞‑norm ball constraint, while Muon (with weight decay) corresponds to stochastic FW under a spectral‑norm constraint. By mapping the momentum and weight‑decay terms of Lion and the orthogonal‑projection step of Muon to the linear minimization oracle of FW, they prove that the iterates of both optimizers are mathematically identical to those produced by a generic stochastic FW scheme (Algorithm 3) when the appropriate parameters (β₁,t, γ_t, η_t) and constraint set C are chosen.
Using this equivalence, the paper derives convergence guarantees in terms of the Frank‑Wolfe gap G(x)=max_{v∈C}⟨v−x,−∇F(x)⟩, a standard stationarity measure for constrained non‑convex optimization. Under the usual L‑smoothness and bounded‑variance assumptions, the expected FW gap decays at O(1/√T). More importantly, the authors extend the analysis to heavy‑tailed stochastic gradients that satisfy a bounded p‑th moment condition with p∈(1,2]. In this regime, they propose two robust variants of stochastic FW:
-
Clipped‑FW – applies element‑wise clipping to stochastic gradients before the linear oracle. It achieves a high‑probability convergence rate O(log(T/δ)·T^{-(p‑1)/(3p‑2)}).
-
Clipped‑VR‑FW – combines clipping with a variance‑reduction technique (e.g., SARAH or SPIDER‑like updates). Under an additional Lipschitz‑gradient assumption on the stochastic gradients, it improves the rate to O(log(T/δ)·T^{-(p‑1)/(2p‑1)}).
Both methods work for any compact convex set C without requiring large mini‑batch sizes, thereby filling a gap in the literature on stochastic FW for non‑convex problems with heavy‑tailed noise. When p=2 (bounded variance), the expected FW gap ε can be achieved with O(1/ε³) stochastic gradient evaluations, matching the best known complexities for stochastic gradient methods while avoiding projection steps.
Leveraging the established Lion‑FW and Muon‑FW equivalence, the authors instantiate the two robust FW variants as new “Heavy‑Tail Lion” and “Heavy‑Tail Muon” algorithms. These inherit the theoretical guarantees of the underlying FW methods, offering strong convergence under heavy‑tailed noise while preserving the practical benefits (e.g., sign‑gradient updates for Lion, orthogonal‑matrix updates for Muon). Empirical results (briefly mentioned) suggest that the heavy‑tail variants outperform their vanilla counterparts in regimes where gradient noise exhibits heavy tails.
In summary, the paper makes three major contributions: (i) a precise mathematical unification of Lion and Muon with stochastic Frank‑Wolfe, providing convergence guarantees in terms of the FW gap and KKT optimality; (ii) the development of two novel, theoretically sound stochastic FW algorithms robust to heavy‑tailed gradient noise, with high‑probability convergence rates and without large batch requirements; (iii) the translation of these robust FW methods into practical, heavy‑tail‑aware variants of Lion and Muon, thereby extending their applicability to modern deep‑learning tasks where gradient distributions are often heavy‑tailed. This work bridges a gap between classical projection‑free optimization theory and state‑of‑the‑art deep‑learning optimizers, opening avenues for further research on constrained, projection‑free methods in large‑scale, noisy learning environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment