Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

💡 Research Summary

This paper provides a rigorous high‑dimensional analysis of signSGD, a simple yet practically important variant of stochastic gradient descent that replaces the stochastic gradient with its element‑wise sign. The authors consider a one‑pass linear regression problem with Gaussian data vectors x∈ℝ^d (zero mean, covariance K) and labels generated by y=⟨x,θ*⟩+ε, where ε is label noise with a C² density around zero. Under a set of technical assumptions—bounded spectrum of K, bounded operator norm of the sign‑data covariance Kσ, a specific scaling of the learning rate η′_t=η(t/d)/d, and a bounded initialization—the discrete signSGD iterates are shown to be well approximated by a continuous‑time stochastic differential equation (SDE) called Sign‑Homogenized SGD (SIGN‑HSGD).

The SDE captures two essential mechanisms introduced by the sign operator: (i) a deterministic drift term that preconditions the gradient with the data covariance K, modulated by a scalar function φ(R) that depends on the current risk R; and (ii) a diffusion term whose covariance is proportional to Kσ, reflecting how the sign compresses and reshapes the original gradient noise. The function φ(R) is derived from the distribution of the label noise; for common noises (Gaussian, Rademacher, uniform, Lévy) φ can be computed analytically or numerically.

Using concentration results for high‑dimensional random matrices, the authors prove that the risk R(θ_k)=E

Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects

💡 Research Summary

Comments & Academic Discussion

Leave a Comment