Non-Euclidean Gradient Descent Operates at the Edge of Stability

Non-Euclidean Gradient Descent Operates at the Edge of Stability
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian converges to $2/η$ during training with gradient descent (GD) with a step-size $η$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation of EoS through the lens of Directional Smoothness Mishkin et al. [2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and Muon without momentum. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/η$. Practically, our framework provides a single, geometry-aware spectral measure that works across optimizers.


💡 Research Summary

The paper “Non‑Euclidean Gradient Descent Operates at the Edge of Stability” extends the recently observed Edge‑of‑Stability (EoS) phenomenon—originally documented for vanilla gradient descent (GD) with a fixed learning rate η—to a broad class of first‑order methods that are defined with respect to arbitrary norms. The authors begin by recalling that classical convergence theory for L‑smooth convex functions guarantees descent only for step sizes up to 2/L, yet deep neural networks routinely train with η far larger than 2/λ_max(∇²L), where λ_max denotes the largest Hessian eigenvalue (the “sharpness”). Empirically, training exhibits two phases: an initial “progressive sharpening” where loss decreases monotonically while sharpness grows, followed by an EoS phase where loss becomes non‑monotonic but still trends downward over long horizons, and sharpness hovers around the critical value 2/η.

To explain this, the authors introduce Directional Smoothness (DS), a trajectory‑aware curvature measure defined between two consecutive iterates w_t and w_{t+1}. By substituting the update rule of a non‑Euclidean GD step into the definition of DS, they derive the exact identity

ΔL_t = –η⟨∇L(w_t), d_t⟩ + D∥·∥(w_t,w_{t+1})·∥∇L(w_t)∥_*²,

where d_t is the normalized descent direction under the chosen norm and ∥·∥_* is the dual norm. From this identity they obtain the simple inequality: if the loss decreases (ΔL_t ≤ 0) then DS ≤ 2/η; if the loss oscillates, DS must oscillate around 2/η. This provides a clean, norm‑independent criterion linking loss dynamics to curvature.

The next step is to relate DS to a generalized sharpness. By upper‑bounding DS with the maximum quadratic form of the Hessian over the unit ball of the chosen norm, they define

S∥·∥(w) = max_{∥d∥≤1} dᵀ∇²L(w)d.

When the norm is Euclidean, S∥·∥ reduces to λ_max(∇²L), the classic sharpness. For a Mahalanobis (preconditioned) norm defined by a positive‑definite matrix P, S∥·∥ becomes the largest eigenvalue of P^{–½}∇²L P^{–½}, matching prior definitions for adaptive methods such as AdaGrad or RMSProp. For the ℓ_∞ norm, the maximization problem becomes equivalent to finding the ground‑state energy of an Ising spin glass, which is NP‑hard; the authors therefore propose a practical approximation using a multi‑restart Frank‑Wolfe algorithm (Algorithm 2) with appropriate projection operators (e.g., sign for ℓ_∞).

Two concrete formulations of non‑Euclidean GD are presented. Definition 1.1 (regularized linearization) yields the update

w_{t+1} = w_t – η ∥∇L(w_t)∥* (∇L(w_t))*,

while Definition 1.2 (normalized GD) drops the dual‑norm scaling:

w_{t+1} = w_t – η (∇L(w_t))_*.

Special cases include ℓ_∞‑Descent (sign GD), Block Coordinate Descent, Spectral GD (or Muon without momentum), and preconditioned GD. The authors emphasize that all these algorithms fit the same analytical framework.

Empirical validation is extensive. Experiments on multilayer perceptrons, convolutional networks, and transformers trained on CIFAR‑10‑5k and Tiny‑Shakespeare demonstrate that, for each optimizer, the generalized sharpness S∥·∥ exhibits an initial rise followed by stabilization near the theoretical threshold 2/η. In several ℓ_∞ and block‑coordinate experiments, S∥·∥ slightly exceeds 2/η, suggesting that higher‑order curvature terms or norm‑induced anisotropy can push the system marginally beyond the classic bound. Plots of loss, gradient norm, DS, and S∥·∥ consistently show the predicted two‑phase behavior across architectures and norms.

A theoretical analysis on quadratic objectives shows that non‑Euclidean GD aligns its descent direction with the top eigenvector of the Hessian defined under the chosen norm, and that the DS‑sharpness relationship holds exactly. This provides a rigorous justification for the observed EoS dynamics beyond the Euclidean setting.

Finally, the paper proposes a geometry‑aware spectral measure: by feeding any norm into the definition of S∥·∥, practitioners obtain a single, comparable sharpness metric that works uniformly across optimizers. This unifies prior disparate notions of sharpness (vanilla GD, adaptive methods, SAM, etc.) and offers a practical diagnostic tool for monitoring training stability and for designing new optimization schemes.

In summary, the work delivers a principled, norm‑agnostic theory of Edge‑of‑Stability, introduces a generalized sharpness that subsumes existing measures, validates the theory on a wide range of modern deep‑learning models, and supplies an algorithmic recipe (Frank‑Wolfe approximation) for computing the metric in practice. It bridges a significant gap in our understanding of why large learning rates can be safely employed and opens avenues for optimizer design that explicitly leverages the geometry of the underlying parameter space.


Comments & Academic Discussion

Loading comments...

Leave a Comment