Sharp High-Probability Rates for Nonlinear SGD under Heavy-Tailed Noise via Symmetrization

Sharp High-Probability Rates for Nonlinear SGD under Heavy-Tailed Noise via Symmetrization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate $\widetilde{\mathcal{O}}(t^{-1/2})$, for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order $p \in (1,2]$. Compared to works assuming noise with bounded $p$-th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when $p < 2$, while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.


💡 Research Summary

This paper investigates high‑probability convergence guarantees for stochastic gradient descent (SGD) methods that incorporate nonlinear transformations—such as sign, clipping, and normalization—when the stochastic gradient noise exhibits heavy tails. Classical analyses of SGD either assume light‑tailed (bounded‑variance) noise or, for heavy‑tailed settings, require that the noise possesses a bounded p‑th moment for some p ∈ (1, 2]. Moreover, many recent works on nonlinear SGD rely on the additional assumption that the noise distribution is symmetric around zero. Both assumptions are restrictive for modern deep‑learning training, where empirical evidence points to asymmetric, possibly α‑stable or Pareto‑type noise with unbounded moments.

General nonlinear framework
The authors define a black‑box nonlinear mapping Ψ: ℝᵈ → ℝᵈ that can represent any component‑wise or joint transformation, including non‑smooth operators (sign, hard‑clipping) and their smooth approximations. The key analytical object is the “denoised” mapping Φ(x) ≜ E_z


Comments & Academic Discussion

Loading comments...

Leave a Comment