Functional Central Limit Theorem for Stochastic Gradient Descent

Reading time: 6 minute
...

📝 Original Info

  • Title: Functional Central Limit Theorem for Stochastic Gradient Descent
  • ArXiv ID: 2602.15538
  • Date: 2026-02-17
  • Authors: 논문에 명시된 저자 정보가 제공되지 않았습니다.

📝 Abstract

We study the asymptotic shape of the trajectory of the stochastic gradient descent algorithm applied to a convex objective function. Under mild regularity assumptions, we prove a functional central limit theorem for the properly rescaled trajectory. Our result characterizes the long-term fluctuations of the algorithm around the minimizer by providing a diffusion limit for the trajectory. In contrast with classical central limit theorems for the last iterate or Polyak-Ruppert averages, this functional result captures the temporal structure of the fluctuations and applies to non-smooth settings such as robust location estimation, including the geometric median.

💡 Deep Analysis

📄 Full Content

In this work, we are interested in the asymptotic properties of the whole trajectory of stochastic algorithms for the minimization of convex objectives. Namely, let Φ : R d → R (d ≥ 1) be a convex function. Let G : R d → R d be a measurable function such that G(θ) is a sub-gradient of Φ at θ, for all θ ∈ R d . If Φ is differentiable, then G is simply given by ∇Φ. We consider algorithms based on such iterations as:

where θ 0 ∈ R d is the algorithm initialization, which we will always consider fixed, non-random for simplicity; (t n ) n≥1 is a sequence of (deterministic) step-sizes; for all n ≥ 1, G n is a noisy version of G(θ n-1 ) that can be written as G n = G(θ n-1 ) + ε n for some random vector ε n satisfying E[ε n |θ n-1 ] = 0 almost surely.

More precisely, we focus on the case when Φ can be expressed as the expectation of a convex loss:

where X is a random variable taking values in some abstract measurable space (E, E) and ϕ : E × R d → R is a given map that is measurable in its first argument, convex in its second, and satisfies that ϕ(X, θ) is integrable for all θ ∈ R d . In that case, [9, Theorem 2] shows that there exists a map g : E × R d → R d that is measurable in its first argument and satisfies that with probability 1, g(X, θ) is a subgradient of ϕ(X, •) at θ, for all θ ∈ R d . By setting G(θ) = E[g(X, θ)] for all θ ∈ R d , [9, Theorem 3] ensures that the function G is well defined, it is measurable and G(θ) is a subgradient of Φ at θ, for all θ ∈ Θ. Thus, given i.i.d random variables X 1 , X 2 , . . . with the same distribution as X, we can set G n = g(X n , θ n-1 ) for all n ≥ 1, so ε n := G n -G(θ n-1 ) satisfies E[ε n |θ n-1 ] = 0.

In an offline context, the estimation of a minimizer θ * of Φ based on i.i.d samples X 1 , . . . , X n typically resorts to M -estimation, or empirical risk minimization, where one seeks for a minimizer of the empirical loss n -1 n i=1 ϕ(X i , θ), θ ∈ R d [18,17,25,19,9]. Here, we consider the online problem, where the algorithm takes one data X n at a time to update its output θ n .

Under minimal convexity assumtions on Φ, we obtain a functional central limit theorem (FCLT) for the trajectories of the stochastic gradient descent (SGD) iterates (1). Over classical central limit theorems for last iterates (or Polyak-Ruppert averages), our result allows to recover information on the fluctuations of the trajectory in long-term regimes. Notably, compared to classical asymptotic results on SGD, we do not require global strong convexity on Φ. We show that local strong convexity on an arbitrarily small neighborhood of the minimizer suffices, encompassing situations such as geometric median estimation or, more generally, robust location estimation.

Stochastic gradient descent (SGD), or Robbins-Monro procedure, has been widely studied since its first introduction in [28]. The first central limit theorem (CLT) is from [11] and gives an n -1/2 rate of convergence when the step-size is of the form cn -1 for some specific choice of c > 0. For larger step-sizes t n = cn -α with 1/2 < α < 1, a CLT is obtained with rate n -α/2 . Later, [30] then [14] obtained a CLT in the case t n = cn -1 using other methods that allow some assumptions to be relaxed Later, [30] and [14] established a central limit theorem for step-sizes of the form t n = c/n that allowed for weaker assumptions on the moments of the noise and extended the results to the multidimensional setting. [14] further highlighted the crucial role of the step-size constant c: if c is too small, convergence can be arbitrarily slow, while for larger values ensuring a CLT at rate n 1/2 , the asymptotic variance increases with c. The optimal value for c depends on the Hessian of the objective function at the minimizer, so its calibration requires prior knowledge of the local curvature. Ever since, other CLT-type results have been shown, such as in the case of multiple targets in [26] or, very recently, infinite variance of the noise in the evaluation of gradients in [8]. In these two cases, the asymptotic distribution is not normal but is given as the stationary law of a stochastic process. Later, [27] showed that the average of the first n SGD iterates with large step-size converges at rate n -1/2 with optimal asymptotic variance and step-size that does not require prior information. This version of SGD has also been widely studied, for example in very recent works in the specific case of geometric medians [10] -where the objective is neither smooth nor strongly convex -or even in non-convex cases [12].

Results of functional type, that is, describing the distribution of the entire trajectory of SGD, are much less common in the literature. Notably, [4] established an almost sure invariance principle in the special case of linear filtering and regression, thereby providing the asymptotic temporal correlations of the iterates. In this specific linear setting, the problem can be reduced to the study of a system of linear eq

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut