High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes

High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.


💡 Research Summary

The paper develops a rigorous high‑dimensional scaling limit for stochastic gradient descent (SGD) when equipped with Polyak momentum (SGD‑M) and when using adaptive step‑sizes based on gradient normalization (SGD‑U). The authors extend the framework of recent high‑dimensional limit theorems, which previously handled plain online SGD, to these two widely used variants. Their analysis rests on two technical conditions—δₙ‑localizability and asymptotic closability—that control the behavior of a finite set of summary statistics (e.g., alignment with a signal, norm of the parameter vector) as both the data dimension and the number of iterations grow while the learning rate δₙ shrinks. Under these conditions, Theorem 2.3 shows that the interpolated statistics converge weakly to a stochastic differential equation (SDE) of the form

 duₜ = h(β, uₜ) dt + (1 − β)⁻¹ Σ(uₜ) dBₜ,

where β is the momentum coefficient, h is an effective drift that decomposes into a “signal” term f(u) and a “population‑corrector” term g(u), and Σ captures the covariance of the stochastic gradient noise. When β = 0 the result reduces to the known limit for plain SGD.

A key insight (Remark 2.4) is that the dynamics of SGD‑M can be matched to those of plain SGD by rescaling time by (1 − β)⁻¹ and adjusting the learning rate to δₙ/(1 − β). If the step‑size is kept identical for both algorithms, the corrector term g is amplified by a factor (1 − β)⁻², which can dominate the signal term f in the critical scaling regime (δₙ ≈ cδ/n). Consequently, SGD‑M may exacerbate high‑dimensional noise and degrade performance unless the step‑size is appropriately reduced.

The paper then introduces a scalar preconditioner ηₙ(x, y) that normalizes the stochastic gradient by its Euclidean norm, specifically ηₙ = √n ‖∇Lₙ(x, y)‖. This choice ensures that the gradient magnitude, which typically scales as O(√n) in high dimensions, is brought to O(1), preserving a non‑trivial limit. By verifying that the normalized gradients satisfy the same locality and closability conditions, the authors extend Theorem 2.3 to this preconditioned setting, yielding an SDE with modified drift and diffusion coefficients. They denote this algorithm “SGD‑U”.

Two canonical high‑dimensional inference problems are used as testbeds:

  1. Spiked Tensor PCA – Data are generated as Y = λ v^{⊗k} + W, where v is a unit signal vector, λ is the signal‑to‑noise ratio, and W is a Gaussian k‑tensor. The loss is the squared reconstruction error L(x, Y) = ‖Y − x^{⊗k}‖². The authors track the alignment m = ⟨x, v⟩ and the residual norm r² = ‖x − mv‖². With the critical learning‑rate scaling δₙ = cδ/n, Proposition 3.1 shows that SGD‑M obeys an ODE system where the corrector term appears with a factor (1 − β)⁻², while SGD‑U’s ODE lacks this amplification. Fixed‑point analysis reveals critical thresholds λ_{M,crit}(k, β, cδ) and λ_{U,crit}(k, cδ). For k = 2 (matrix PCA) the authors find λ_{U,crit} < λ_{M,crit} for any β > 0, meaning that SGD‑U can successfully recover the signal in regimes where both plain SGD and SGD‑M fail. Moreover, the admissible step‑size range for convergence is larger for SGD‑U, confirming that gradient‑norm normalization mitigates high‑dimensional noise.

  2. Single‑Index Model – Observations follow y = f(⟨θ*, x⟩) + ε with a non‑linear link f. The summary statistics are the cosine similarity between the iterate θ and the true parameter θ* and the norm of θ. Again, applying the same normalization yields an SDE with a reduced corrector term, leading to a broader basin of attraction and faster convergence compared with SGD‑M under the same learning‑rate scaling.

Across both examples, the authors provide numerical simulations that match the theoretical predictions, illustrating how the dynamics evolve under different β, λ, and cδ values. The simulations confirm that SGD‑U’s fixed points lie closer to the population optimum and that its stability region is substantially larger.

In summary, the paper makes three major contributions:

  • It extends high‑dimensional limit theory to momentum‑based SGD, quantifying precisely how momentum amplifies the population‑corrector term and under what rescaling the dynamics coincide with plain SGD.
  • It shows that keeping the step‑size unchanged when adding momentum can be detrimental in high dimensions, because the amplified corrector can overwhelm the descent direction.
  • It demonstrates that a simple adaptive step‑size based on gradient norm (SGD‑U) effectively neutralizes the high‑dimensional amplification, widens the admissible learning‑rate window, and yields fixed points nearer to the true optimum.

These results provide a rigorous foundation for empirical observations that early preconditioners (gradient clipping, normalization, or adaptive learning‑rates) improve training stability in deep learning. They also give concrete guidance on how to choose momentum and learning‑rate schedules when operating in regimes where the number of parameters far exceeds the number of samples.


Comments & Academic Discussion

Loading comments...

Leave a Comment