Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization

Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study online linear optimization with matrix variables constrained by the operator norm, a setting where the geometry renders designing data-dependent and efficient adaptive algorithms challenging. The best-known adaptive regret bounds are achieved by Shampoo-like methods, but they require solving a costly quadratic projection subproblem. To address this, we extend the gradient-based prediction scheme to adaptive matrix online learning and cast algorithm design as constructing a family of smoothed potentials for the nuclear norm. We define a notion of admissibility for such smoothings and prove any admissible smoothing yields a regret bound matching the best-known guarantees of one-sided Shampoo. We instantiate this framework with two efficient methods that avoid quadratic projections. The first is an adaptive Follow-the-Perturbed-Leader (FTPL) method using Gaussian stochastic smoothing. The second is Follow-the-Augmented-Matrix-Leader (FAML), which uses a deterministic hyperbolic smoothing in an augmented matrix space. By analyzing the admissibility of these smoothings, we show both methods admit closed-form updates and match one-sided Shampoo’s regret up to a constant factor, while significantly reducing computational cost. Lastly, using the online-to-nonconvex conversion, we derive two matrix-based optimizers, Pion (from FTPL) and Leon (from FAML). We prove convergence guarantees for these methods in nonsmooth nonconvex settings, a guarantee that the popular Muon optimizer lacks.


💡 Research Summary

The paper tackles online linear optimization (OLO) over matrix variables subject to an operator‑norm ball constraint ‖X‖ₒₚ ≤ D, a setting where standard vector‑wise adaptive methods such as AdaGrad or Shampoo encounter severe difficulties. Existing state‑of‑the‑art adaptive algorithms (one‑sided Shampoo/ASGO) achieve optimal data‑dependent regret but require solving a costly quadratic projection onto the operator‑norm ball at every round; this projection has no closed‑form solution and typically demands iterative SVD‑based solvers, making the methods impractical for large‑scale problems.

To overcome this bottleneck, the authors extend the Gradient‑Based Prediction Algorithm (GBPA) to the matrix domain and recast algorithm design as constructing a family of smoothed potentials for the nuclear norm (the dual of the operator norm). They introduce the notion of an (α, β)‑admissible smoothing e Ψ(·; L), parameterized by a positive‑semidefinite matrix L that captures the problem geometry. Four conditions define admissibility: (a) feasibility (the gradient of the smoothed potential respects the operator‑norm bound), (b) dominance (the smoothed potential upper‑bounds the nuclear norm), (c) upper stability (potential differences are controlled by the trace of L), and (d) smoothness (the Bregman divergence induced by the potential is bounded by a quadratic form involving L⁻¹). These conditions directly control the three terms that appear in the regret decomposition of GBPA.

The main theoretical result (Theorem 3.2) shows that if one chooses Lₜ = G²I + Mₜ, where Mₜ = ∑{s≤t} GₛGₛᵀ, and sets the learning rate η = α/β, then the GBPA update Xₜ₊₁ = −D ∇ₛ e Ψ(Sₜ; Lₜ/η) achieves regret Reg_T ≤ 2√{αβ}·D·Tr(√{G²I + M_T}) + (1−√{αβ})·D·‖G₁‖*. Thus the regret matches the one‑sided Shampoo bound up to the factor √{αβ}. Proposition 3.3 proves a universal lower bound αβ ≥ ½ for any admissible smoothing and exhibits a concrete smoothing e Ψ_R that attains (α, β) = (½, 1), i.e., the optimal product αβ = ½.

Two concrete algorithms instantiate this framework while avoiding quadratic projections:

  1. Adaptive Follow‑the‑Perturbed‑Leader (FTPL) with Gaussian stochastic smoothing. At each round a Gaussian perturbation Zₜ ∼ 𝒩(0, σ² Lₜ⁻¹) is added to the cumulative gradient Sₜ, and the update uses the gradient of the smoothed potential evaluated at Sₜ + Zₜ. By leveraging non‑central Wishart theory, the expected update can be expressed in closed form using only two matrix primitives (multiplication and inversion). Theorem 4.1 shows this smoothing is (α, β)‑admissible with α = O(log(m + n)) and β = 1, incurring only a mild logarithmic dimension factor.

  2. Follow‑the‑Augmented‑Matrix‑Leader (FAML) with deterministic hyperbolic smoothing. The method lifts the original matrix X to an augmented space (X, 1) and defines a potential e Ψ_FAML(S; L) = max_{‖X‖ₒₚ≤1} ⟨S, X⟩ − ½ Tr(XᵀLX) + ½ Tr(L). The maximizer admits a closed‑form expression X* = L^{-½} sign(L^{-½}S), leading to the update Xₜ₊₁ = −D Lₜ^{-½} sign(Lₜ^{-½}Sₜ). Theorem 4.2 proves this smoothing is (½, 1)‑admissible, achieving the optimal αβ = ½ and thus matching Shampoo’s regret up to a factor of two. Importantly, FAML requires no random sampling and only matrix multiplications and sign operations, making it highly efficient for large‑scale settings.

Having established adaptive matrix OLO algorithms with Shampoo‑level regret, the authors apply the Online‑to‑Nonconvex Conversion (O2NC) framework to obtain optimizers for nonsmooth nonconvex problems. By interpreting each OLO round as a surrogate for a stochastic gradient step, they derive two matrix‑based optimizers:

  • Pion – the optimizer obtained from the FTPL scheme.
  • Leon – the optimizer obtained from the FAML scheme.

Both algorithms are proved (Theorems 5.2 and 5.3) to converge to (ρ, ε)‑stationary points of arbitrary nonsmooth nonconvex objectives, a guarantee that the widely used Muon optimizer lacks. The paper clarifies that Muon is essentially a spectrally constrained Follow‑the‑Leader (FTL) method, which, due to the nonsmoothness of the nuclear norm, cannot guarantee sublinear regret or convergence in the nonsmooth regime.

Empirical evaluations (summarized in the paper) on deep‑learning tasks and large matrix factorization problems demonstrate that Pion and Leon achieve comparable or better test performance than Shampoo and Muon while reducing memory consumption and wall‑clock time by 2–3×, thanks to the elimination of quadratic projection steps.

In summary, the work introduces a principled smoothing‑based design paradigm for adaptive matrix online learning, proves that any (α, β)‑admissible smoothing yields Shampoo‑type regret, provides two practically efficient algorithms (FTPL and FAML) that meet this bound without costly projections, and leverages them to construct the first matrix‑based optimizers with provable convergence guarantees for nonsmooth nonconvex optimization. This bridges a critical gap between theory and practice in modern matrix‑structured learning systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment