Functional Central Limit Theorem for Stochastic Gradient Descent

F unctional Cen tral Limit Theorem for Sto c hastic Gradien t Descen t Kessang Flamand 1 and Victor-Emman uel Brunel 1 1 CREST-ENSAE kessang.flamand@ensae.fr, victor-emmanuel.brunel@ensae.fr Abstract W e study the asymptotic shap e of the tra jectory of the sto c hastic gradien t descent algo- rithm applied to a con vex ob jectiv e function. Under mild regularit y assumptions, w e prov e a functional cen tral limit theorem for the prop erly rescaled tra jectory . Our result characterizes the long-term ﬂuctuations of the algorithm around the minimizer by pro viding a diﬀusion limit for the tra jectory . In con trast with classical central limit theorems for the last iterate or Poly ak–Rupp ert av erages, this functional result captures the temp oral structure of the ﬂuctuations and applies to non-smo oth settings suc h as robust lo cation es timation, including the geometric median. Keyw ords: Sto c hastic gradient descen t, online con vex optimization, sto c hastic appro ximation, functional central limit theorem, asymptotic ﬂuctuations 1 In tro duction 1.1 F ramework In this w ork, we are interested in the asymptotic prop erties of the whole tra jectory of sto c hastic algorithms for the minimization of conv ex ob jectiv es. Namely , let Φ : R d → R ( d ≥ 1 ) be a con vex function. Let G : R d → R d b e a measurable function suc h that G ( θ ) is a sub-gradien t of Φ at θ , for all θ ∈ R d . If Φ is diﬀeren tiable, then G is simply giv en by ∇ Φ . W e consider algorithms based on suc h iterations as: ( θ 0 ∈ R d ; θ n = θ n − 1 − t n G n , ∀ n ≥ 1 , (1) where θ 0 ∈ R d is the algorithm initialization, whic h we will alwa ys consider ﬁxed, non-random for simplicit y; ( t n ) n ≥ 1 is a sequence of (deterministic) step-sizes; for all n ≥ 1 , G n is a noisy version of G ( θ n − 1 ) that can be written as G n = G ( θ n − 1 ) + ε n for some random v ector ε n satisfying E [ ε n | θ n − 1 ] = 0 almost surely . More precisely , we fo cus on the case when Φ can b e expressed as the exp ectation of a con vex loss: Φ( θ ) = E [ ϕ ( X, θ )] , ∀ θ ∈ R d , (2) where X is a random v ariable taking v alues in some abstract measurable space ( E , E ) and ϕ : E × R d → R is a given map that is measurable in its ﬁrst argumen t, con vex in its second, and satisﬁes that ϕ ( X, θ ) is in tegrable for all θ ∈ R d . In that case, [9, Theorem 2] sho ws that there 1 exists a map g : E × R d → R d that is measurable in its ﬁrst argumen t and satisﬁes that with probabilit y 1 , g ( X , θ ) is a subgradien t of ϕ ( X, · ) at θ , for all θ ∈ R d . By setting G ( θ ) = E [ g ( X, θ )] for all θ ∈ R d , [9, Theorem 3] ensures that the function G is well deﬁned, it is measurable and G ( θ ) is a subgradien t of Φ at θ , for all θ ∈ Θ . Th us, given i.i.d random v ariables X 1 , X 2 , . . . with the same distribution as X , w e can set G n = g ( X n , θ n − 1 ) for all n ≥ 1 , so ε n := G n − G ( θ n − 1 ) satisﬁes E [ ε n | θ n − 1 ] = 0 . In an oﬄine context, the estimation of a minimizer θ ∗ of Φ based on i.i.d samples X 1 , . . . , X n t ypically resorts to M -estimation, or empirical risk minimization, where one seeks for a minimizer of the empirical loss n − 1 P n i =1 ϕ ( X i , θ ) , θ ∈ R d [18, 17, 25, 19, 9]. Here, we consider the online problem, where the algorithm takes one data X n at a time to up date its output θ n . 1.2 Con tributions Under minimal conv exit y assumtions on Φ , we obtain a functional central limit theorem (F CL T) for the tra jectories of the stochastic gradien t descen t (SGD) iterates (1). Over classical central limit theorems for last iterates (or P olyak-Ruppert a verages), our result allo ws to recov er infor- mation on the ﬂuctuations of the tra jectory in long-term regimes. Notably , compared to classical asymptotic results on SGD, w e do not require global strong conv exit y on Φ . W e show that lo cal strong conv exit y on an arbitrarily small neigh b orho od of the minimizer suﬃces, encompassing situations such as geometric median estimation or, more generally , robust lo cation estimation. 1.3 Related w ork Sto c hastic gradient descent (SGD), or Robbins-Monro pro cedure, has b een widely studied since its ﬁrst introduction in [28]. The ﬁrst cen tral limit theorem (CL T) is from [11] and gives an n − 1 / 2 rate of con v ergence when the step-size is of the form cn − 1 for some speciﬁc c hoice of c > 0 . F or larger step-sizes t n = cn − α with 1 / 2 < α < 1 , a CL T is obtained with rate n − α/ 2 . Later, [30] then [14] obtained a CL T in the case t n = cn − 1 using other metho ds that allow some assumptions to b e relaxed Later, [30] and [14] established a central limit theorem for step-sizes of the form t n = c/n that allow ed for weak er assumptions on the moments of the noise and extended the results to the m ultidimensional setting. [14] further highligh ted the crucial role of the step-size constan t c : if c is to o small, conv ergence can b e arbitrarily slow, while for larger v alues ensuring a CL T at rate n 1 / 2 , the asymptotic v ariance increases with c . The optimal v alue for c depends on the Hessian of the ob jective function at the minimizer, so its calibration requires prior knowledge of the lo cal curv ature. Ever since, other CL T-type results hav e b een sho wn, suc h as in the case of multiple targets in [26] or, v ery recen tly , inﬁnite v ariance of the noise in the ev aluation of gradien ts in [8]. In these tw o cases, the asymptotic distribution is not normal but is given as the stationary la w of a stochastic pro cess. Later, [27] sho wed that the a verage of the ﬁrst n SGD iterates with large step-size conv erges at rate n − 1 / 2 with optimal asymptotic v ariance and step-size that does not require prior information. This v ersion of SGD has also b een widely studied, for example in v ery recent works in the speciﬁc case of geometric medians [10] – where the ob jectiv e is neither smo oth nor strongly con vex – or even in non-conv ex cases [12]. Results of functional t yp e, that is, describing the distribution of the entire tra jectory of SGD, are m uch less common in the literature. Notably , [4] established an almost sure inv ariance principle in the sp ecial case of linear ﬁltering and regression, thereb y pro viding the asymptotic temp oral correlations of the iterates. In this sp eciﬁc linear setting, the problem can b e reduced to the study of a system of linear equations whose coeﬃcients are observ ed with noise. [2] established a functional cen tral limit theorem in the context of barycenter estimation on a Riemannian manifold, using a Riemannian v ersion of sto c hastic gradien t descen t. Their result is based on a con vergence theorem for Marko v c hains to diﬀusion pro cesses (Theorem 11.2.3 in [32]), whic h 2 is based on the conv ergence of the Mark ov transition op erators to the generator of the limiting diﬀusion – a result that we also emplo y in our analysis. The authors assume b ounded noise in the gradien t ev aluations and strong conv exit y of the ob jectiv e function. Notably , Donsk er’s theorem [6, Theorem 8.2] pro vides a similar result for the case of mean estimation in Euclidean spaces. Non-asymptotic properties of the SGD ha v e also raised a lot of in terest. First, [24, 23] give b ounds on the L 2 distance b et w een the n -th iteration of the SGD and the target minimizer for an y ﬁnite n . Diﬀeren t rates of conv ergence hav e then b een obtained in v arious settings. In particular, some b ounds dep ending on the choice of the step-size were obtained in [22], and later extended to the non-strongly conv ex case in [3]. These b ounds are typically written as the sum of a term that dep ends on the initial condition θ 0 and another term that do es not. This decomp osition highlights the tw o regimes that characterize the con vergence of sto c hastic gradient descen t: a ﬁrst regime in which the iterates mov e on av erage in the direction of the minimizer and approac h it rapidly , follow ed b y a second regime in which the iterates are close to the minimizer and ﬂuctuate around it without any global preferred direction and with a decreasing v ariance. In particular, the tra jectory in the ﬁrst regime strongly depends on the initial condition, whereas this dependence is lost in the second regime. [22] show ed that in the strongly conv ex case with t n = c/n for some c > 0 , the initial conditions are forgotten at rate O (1 /n α/ 2 ) with α > 1 dep ending on c . In the case of bounded gradients, it also giv es us that the ﬂuctuations around the minimizer are of order O (1 / √ n ) , whic h is consisten t with the cen tral limit theorem of [11], [30], [14] and with our own result. In particular, our functional cen tral limit theorem aims to quan tify the ﬂuctuations in the second regime. F or a v ersion of SGD with constant step-sizes, iterates remain to o noisy and do not conv erge to the minimizer θ ∗ . Instead, they conv erge in distribution to the unique stationary distribution ([13]), whic h then collapses to θ ∗ as the constan t step-size is chosen arbitrarily close to zero. It has b een sho wn ([21], [20]) that sto chastic gradien t algorithms with constant step-size can b e approximated b y a sto c hastic diﬀerential equation of the form d X t = −∇ Φ( X t )d t + √ η Σ( X t )d B t , where η is the constant step-size and Σ( X t ) is a co v ariance matrix. This result quantiﬁes the ﬂuctuations of the SGD tra jectory around that of the deterministic gradient descen t, showing that the diﬀusive term v anishes as the step-size (and hence the noise magnitude) go es to zero. Consequently , this result is fundamen tally diﬀerent from our FCL T, which describes the ﬂuctuations of SGD with a decreasing step-size around the minimizer θ ∗ , in a regime where the iterates are already close to the optim um and the gradien t of Φ is w ell approximated b y its ﬁrst-order T aylor expansion. 2 Main results Before stating our results, we ﬁrst give our main w orking assumptions. Assumption 1. The function Φ has a unique minimizer θ ∗ . Assumption 2. The function Φ has a unique minimizer θ ∗ and it is twic e c ontinuously diﬀer- entiable in a neighb orho o d of θ ∗ with p ositive deﬁnite Hessian ∇ 2 Φ( θ ∗ ) at θ ∗ . Assumption 3. Ther e exists L > 0 such that ∥ G ( θ ) ∥ ≤ L ∥ θ − θ ∗ ∥ , for al l θ ∈ R d . This assumption states that Φ grows at most quadratically from θ ∗ . Note that we do not require Φ to b e diﬀeren tiable and to hav e Lipschitz gradients. In the absence of further regularit y as- sumption on Φ , Assumption 3 is necessary even to obtain conv ergence of SGD, since con vergence could fail otherwise ev en in the noiseless case. Assumption 4. Ther e exists σ 2 > 0 such that E [ ∥ g ( X 1 , θ ) − G ( θ ) ∥ 2 ] ≤ σ 2 for al l θ ∈ R d . 3 This assumption states that for an y query of a subgradient of Φ , the error alwa ys has a second momen t that is uniformly b ounded, irresp ectiv e of the p oin t θ ∈ R d . In particular, under this assumption, indep endence of the X i ’s yields that E [ ∥ g ( X n , θ n − 1 ) − G ( θ n − 1 ) ∥ 2 |F n − 1 ] ≤ σ 2 almost surely , for all n ≥ 1 , which is a common assumption. The next assumption is more stringent as it requires that, at least in a neighborho o d of θ ∗ , that error is uniformly b ounded in an L 2 sense. Assumption 5. Ther e exists η > 0 such that E [sup θ ∈ B ( θ ∗ ,η ) ∥ g ( X 1 , θ ) − G ( θ ) ∥ 2 ] < ∞ . Belo w, we explain how Assumption 5 can b e replaced by a set of tw o assumptions whic h, in some case, may b e less restrictive (see Assumptions 6 and 7). Our ﬁrst theorem is the consistency of the sequence ( θ n ) n ≥ 0 deﬁned in (1), provided the step-sizes are c hosen appropriately . F or all n ≥ 1 , we denote b y F n the σ -algebra spanned b y X 1 , . . . , X n and by F 0 the trivial σ -algebra (we are implicitly assuming that all X n ’s are deﬁned on some probabilit y space (Ω , F , P ) – the σ -algebras F 0 , F 1 , . . . are included in F ). Theorem 1. L et the se quenc e of step-sizes ( t n ) n ≥ 1 satisfy P n ≥ 1 t n = ∞ and P n ≥ 1 t 2 n < ∞ . L et Assumptions 1, 3 and 4 hold. Then, θ n − − − → n →∞ θ ∗ almost sur ely. Pr o of. F or all n ≥ 1 , denote by G n = g ( X n , θ n − 1 ) and by ε n = G n − G ( θ n ) . Assumption 4 yields that E [ ∥ ε n ∥ 2 |F n − 1 ] ≤ σ 2 (3) for all n ≥ 1 . Therefore, ∥ θ n − θ ∗ ∥ 2 = ∥ θ n − 1 − t n G n − θ ∗ ∥ 2 = ∥ θ n − 1 − θ ∗ ∥ 2 − 2 t n ( θ n − 1 − θ ∗ ) ⊤ G n + t 2 n ∥ G n ∥ 2 = ∥ θ n − 1 − θ ∗ ∥ 2 − 2 t n ( θ n − 1 − θ ∗ ) ⊤ G ( θ n − 1 ) − 2 t n ( θ n − 1 − θ ∗ ) ⊤ ε n + t 2 n ∥ G ( θ n − 1 ) ∥ 2 + 2 t 2 n G ( θ n − 1 ) ⊤ ε n + t 2 n ∥ ε n ∥ 2 ≤ (1 + Lt 2 n ) ∥ θ n − 1 − θ ∗ ∥ 2 − 2 t n (Φ( θ n − 1 ) − Φ( θ ∗ )) − 2 t n ( θ n − 1 − θ ∗ ) ⊤ ε n + 2 t 2 n G ( θ n − 1 ) ⊤ ε n + t 2 n ∥ ε n ∥ 2 where w e used the conv exit y of Φ and Assumption 3 in the last inequality . T aking the conditional exp ectation giv en F n − 1 on b oth sides yields E [ ∥ θ n − θ ∗ ∥ 2 |F n − 1 ] ≤ (1 + Lt 2 n ) ∥ θ n − 1 − θ ∗ ∥ 2 − 2 t n (Φ( θ n − 1 ) − Φ( θ ∗ )) + t 2 n σ 2 (4) thanks to (3). Now, b y Robbins-Siegmund theorem [29], the sequence ( ∥ θ n − θ ∗ ∥ ) n ≥ 0 m ust con verge almost surely to a non-negative random v ariable Z , and the sum P n ≥ 1 t n (Φ( θ n − 1 ) − Φ( θ ∗ )) m ust b e ﬁnite with probabilit y 1 . By Lemma 2, it must hold that Z = 0 almost surely , hence, θ n − − − → n →∞ θ ∗ almost surely . No w, if we also assume Assumption 2, w e hav e the following result, which is essential for our main theorem. Theorem 2. L et Assumptions 2, 3 and 4 hold. Then, the se quenc e ( √ n ( θ n − θ ∗ )) n ≥ 0 is tight. W e defer the pro of of this theorem to the app endix. The next prop osition introduces a diﬀusion pro cess, whic h will b e the limit of a rescaled v ersion of the tra jectory of the sequence ( θ n ) n ≥ 0 . 4 Prop osition 1. L et H , Σ ∈ R d × d b e p ositive semi-deﬁnite, symmetric matric es. Assume that H is invertible and let 0 < µ 1 ≤ . . . ≤ µ d b e its eigenvalues and ( e 1 , . . . , e n ) an asso ciate d orthonormal b asis of eigenve ctors. F or al l t > 0 and c ontinuously diﬀer entiable functions f : R d → R , let G t f : R d → R b e the function deﬁne d as G t f ( y ) = − t − 1 y ⊤ H ∇ f ( y ) + 1 2 T r (Σ ∇ 2 f ( y )) , ∀ y ∈ R d . L et ( B t ) t ≥ 0 b e a Br ownian motion adapte d to a ﬁltr ation A = ( A t ) t ≥ 1 . Ther e exists a unique diﬀusion pr o c ess Y A -adapte d (up to indistinguishability) whose gener ator is given by ( G t ) t> 0 and that satisﬁes Y t − − → t ↓ 0 0 in pr ob ability. Mor e over, Y t − − → t ↓ 0 0 almost sur ely and one c an write, with pr ob ability 1 , that for al l t > 0 , Y t = Z t 0 e log( s/t ) H Σ 1 / 2 d B s = d X i =1 t − µ i  e ⊤ i Z t 0 s µ i Σ 1 / 2 d B s  e i . By [33, Theorem 7.3.3], the process ( Y t ) t> 0 m ust satisfy the sto c hastic diﬀeren tial equation d Y t = − t − 1 H Y t d t + Σ 1 / 2 d B t , ∀ t > 0 . In the sequel, we let Assumption 2 hold. W e denote by 0 < λ 1 ≤ . . . ≤ λ d the eigenv alues of ∇ 2 Φ( θ ∗ ) , with asso ciated orthonormal basis of eigen vectors ( e 1 , . . . , e n ) . W e also assume that g ( X 1 , θ ∗ ) has tw o momen ts and we let Γ = E [ g ( X 1 , θ ∗ ) g ( X 1 , θ ∗ ) ⊤ ] b e its co v ariance matrix. Note that E [ g ( X 1 , θ ∗ )] = G ( θ ∗ ) = ∇ Φ( θ ∗ ) = 0 , since Assumption 2 implies diﬀerentiabilit y of Φ at θ ∗ . Moreo ver, Assumption 2 implies diﬀeren tiability of Φ at θ ∗ , so ϕ ( X 1 , · ) must b e almost surely diﬀeren tiable at θ ∗ , b y [9, Lemma 13]. Hence, the choice of g ( X 1 , θ ∗ ) in the deﬁnition of Γ is almost surely unique. Fix a n um b er δ > 1 /λ 1 and consider step-sizes giv en b y t n = δ /n , for all n ≥ 1 . Applying Prop osition 1 to H = δ ∇ 2 Φ( θ ∗ ) − I d and Σ = δ 2 Γ yields the existence of a unique (up to indistinguishability) diﬀusion pro cess Y in C 0 ((0 , ∞ ) , R d ) with generator ( G t ) t> 0 giv en b y G t f ( y ) = t − 1 y ⊤ ( I d − δ ∇ 2 Φ( θ ∗ )) ∇ f ( y ) + δ 2 2 T r (Γ ∇ 2 f ( y )) , y ∈ R d , t > 0 , f ∈ C 2 ( R d , R ) (5) and satisfying both d Y t = t − 1 ( I d − δ ∇ 2 Φ( θ ∗ )) Y t d t + δ Γ 1 / 2 d B t , ∀ t > 0 , (6) where ( B t ) t ≥ 0 is a standard Brownian motion, and Y t − − → t ↓ 0 0 in probabilit y . In the follo wing, w e let a ( t, y ) = t − 1 ( I d − δ ∇ 2 Φ( θ ∗ )) y and b ( t, y ) = δ 2 Γ , for all ( t, y ) ∈ (0 , ∞ ) × R d , which we refer to as the drift and diﬀusion terms of ( Y t ) t> 0 , resp ectiv ely . No w, we deﬁne rescaled, contin uous time tra jectories of the sequence of iterates ( θ n ) n ≥ 1 as follo ws. F or every n ≥ 1 , consider the sequence ( ˜ Y n k ) k ≥ 0 deﬁned by setting ˜ Y n k = k √ n ( θ k − θ ∗ ) , ∀ k ≥ 0 . (7) Then, ( ˜ Y n k ) k ≥ 0 is an inhomogeneous, discrete time Mark o v c hain. W e introduce the con tinuous time pro cess ( Y n t ) t> 0 b y linear in terp olation of the v alues of ˜ Y n k , that is, w e set, for all t > 0 , Y n t = ( k − nt ) ˜ Y n k − 1 + ( nt − k + 1) ˜ Y n k , ( k − 1) /n < t ≤ k /n. 5 By construction, Y n is a con tinuous map deﬁned on (0 , ∞ ) and taking v alues in R d . W e denote b y C 0 ((0 , ∞ ) , R d ) the space of all such maps, which we equip with the top ology induced by uniform con vergence on compact in terv als of (0 , ∞ ) . Let us state our main result, pro viding a functional central limit theorem for ( Y n ) n ≥ 1 . Theorem 3. L et Assumptions 2, 3, 4 and 5 hold and let Y b e the diﬀusion deﬁne d in (6) . Then, Y n − − − → n →∞ Y in distribution in the sp ac e C 0 ((0 , ∞ ) , R d ) endowe d with the top olo gy induc e d by uniform c onver- genc e on c omp act intervals. Remark 1. The limiting law given by The or em 3 is c enter e d and do es not dep end on the initial c ondition θ 0 . However, the ﬁrst iter ations of sto chastic gr adient desc ent gener al ly dep end on the initial c ondition and, in p articular, ther e is no r e ason for them to b e c enter e d ar ound the minimizer. This is b e c ause the dep endenc e on the initial c ondition de cr e ases faster than the amplitude of the ﬂuctuations (se e, for instanc e, [22]). Thus, under our r esc aling, the term dep ending on the initial c ondition disapp e ars asymptotic al ly, and the r emaining ﬂuctuations ar e c enter e d and gaussian. The top olo gy ﬁxe d on C 0 ((0 , ∞ ) , R d ) must, however, ac c ount for the time r e quir e d to for get the initial c ondition, and c onse quently c annot describ e c onver genc e on time intervals that include 0 . Figure 1 b elo w illustrates the previous remark through a realization of sto c hastic gradient de- scen t. The ﬁrst panel shows the entire tra jectory , exhibiting a ﬁrst regime corresp onding to a noisy gradient descent, follow ed by a second regime in which the iterates remain close to the minimizer. The second panel zo oms in on this second regime and sho ws centered ﬂuctuations of decreasing amplitude. The third panel shows the rescaled tra jectory obtained via (7). The resulting tra jectory illustrates the result of Theorem 3, as it exhibits a diﬀusion-lik e b eha vior with ﬂuctuations of constan t v ariance. Figure 1: Sto c hastic gradien t descent tra jectory for the estimation of the median of a Laplace (0 , 1) distribution in R 2 with indep enden t co ordinates, based on n = 50000 samples. The ﬁrst panel sho ws the full tra jectory . The second panel zo oms in on the ﬂuctuations around the minimizer, with the ﬁrst 2000 iterations remov ed. The third panel shows the rescaled tra jec- tory , illustrating the diﬀusion-lik e b eha vior predicted b y Theorem 3. Color indicates time, from ligh t (start) to dark (end). W e set the step size to 2 /k . In order to present a pro of of this Theorem 3, let us introduce some imp ortan t quantities. F or all n ≥ 1 , w e let ( P n k ) k ≥ 1 b e the sequence of transition kernels of the Marko v chain ( ˜ Y n k ) k ≥ 0 , that is, P n k ( y , A ) = P ( ˜ Y n k ∈ A | ˜ Y n k − 1 = y ) for all y ∈ R d and for all Borel sets A in R d . Now, for all t > 0 and y ∈ R d , we also deﬁne a n ( t, y ) and b n ( t, y ) as the (rescaled) ﬁrst and second conditional momen ts of the increments: a n ( t, y ) = n E h ˜ Y n ⌊ nt ⌋ − ˜ Y n ⌊ nt ⌋− 1 | ˜ Y n ⌊ nt ⌋− 1 = y i 6 and b n ( t, y ) = n E h ( ˜ Y n ⌊ nt ⌋ − ˜ Y n ⌊ nt ⌋− 1 )( ˜ Y n ⌊ nt ⌋ − ˜ Y n ⌊ nt ⌋− 1 ) ⊤ | ˜ Y n ⌊ nt ⌋− 1 = y i . The next result, whic h is key to our main theorem, sho ws in particular that a n and b n con verge uniformly on compact subsets of (0 , ∞ ) × R d to the drift and diﬀusi on terms a and b of Y resp ectiv ely . Prop osition 2. L et Assumptions 2, 3 and 5 hold. Fix R, r, ε, T with 0 < ε < T . Then, the fol lowing statements hold. (i) sup ε ≤ t ≤ T , y ∈ B ( θ ∗ ,R ) nP n ⌊ nt ⌋  y , B ( y , r ) ∁  − − − → n →∞ 0 ; (ii) sup ε ≤ t ≤ T , y ∈ B ( θ ∗ ,R ) ∥ a n ( t, y ) − a ( t, y ) ∥ − − − → n →∞ 0 ; (iii) sup ε ≤ t ≤ T , y ∈ B ( θ ∗ ,R ) ∥ b n ( t, y ) − δ 2 Γ ∥ − − − → n →∞ 0 . Pr o of. First, for n ≥ 1 and k ≥ 1 , rewrite ˜ Y n k = k √ n ( ˆ θ k − θ ∗ ) = k √ n  ˆ θ k − 1 − δ k g ( X k , ˆ θ k − 1 ) − θ ∗  = k − 1 √ n ( ˆ θ k − 1 − θ ∗ ) + 1 √ n ( ˆ θ k − 1 − θ ∗ ) − δ √ n g ( X k , ˆ θ k − 1 ) = ˜ Y n k − 1 + ˜ Y n k − 1 k − 1 − δ √ n g ( X k , θ ∗ + √ n ˜ Y n k − 1 / ( k − 1)) . (8) No w, ﬁx t ∈ [ ε, T ] and y ∈ B ( θ ∗ , R ) . First, note that using (8) and indep endence of the X i ’s, P n ⌊ nt ⌋  y , B ( y , r ) ∁  can b e rewritten as P n ⌊ nt ⌋  y , B ( y , r ) ∁  = P      y ⌊ nt ⌋ − δ √ n g  X ⌊ nt ⌋ , θ ∗ + √ n ⌊ nt ⌋ − 1 y      > r  . (9) No w, b y the triangle inequalit y ,     y ⌊ nt ⌋ − δ √ n g  X ⌊ nt ⌋ , θ ∗ + √ n ⌊ nt ⌋ − 1 y      ≤ ∥ y ∥ ⌊ nt ⌋ + δ √ n     g  X ⌊ nt ⌋ , θ ∗ + √ n ⌊ nt ⌋ − 1 y      ≤ ∥ y ∥ ⌊ nt ⌋ + δ √ n     G  θ ∗ + √ n ⌊ nt ⌋ − 1 y      + δ √ n     g  X ⌊ nt ⌋ , θ ∗ + √ n ⌊ nt ⌋ − 1 y  − G  θ ∗ + √ n ⌊ nt ⌋ − 1 y      ≤ (1 + δ L ) ∥ y ∥ ⌊ nt ⌋ − 1 + δ √ n     g  X ⌊ nt ⌋ , θ ∗ + √ n ⌊ nt ⌋ − 1 y  − G  θ ∗ + √ n ⌊ nt ⌋ − 1 y      ≤ (1 + δ L )( ∥ θ ∗ ∥ + R ) ⌊ nϵ ⌋ − 1 + δ √ n     g  X ⌊ nt ⌋ , θ ∗ + √ n ⌊ nt ⌋ − 1 y  − G  θ ∗ + √ n ⌊ nt ⌋ − 1 y      7 where we used Assumption 3 in the inequality b efore the last. Therefore, if n is suﬃciently large so the ﬁrst term in the last line is not larger than r / 2 , this and (9) yield P n ⌊ nt ⌋  y , B ( y , r ) ∁  ≤ P      g  X ⌊ nt ⌋ , θ ∗ + √ n ⌊ nt ⌋ − 1 y  − G  θ ∗ + √ n ⌊ nt ⌋ − 1 y      > r √ n 2 δ  = P      g  X 1 , θ ∗ + √ n ⌊ nt ⌋ − 1 y  − G  θ ∗ + √ n ⌊ nt ⌋ − 1 y      > r √ n 2 δ  . (10) No w, further assume that n is suﬃciently large so √ nR/ ( nε − 1) ≤ η , where η is deﬁned in As- sumption 5. Letting V = sup θ ∈ B ( θ ∗ ,η ) ∥ g ( X 1 , θ ) − G ( θ ) ∥ 2 , we obtain that nP n ⌊ nt ⌋  y , B ( y , r ) ∁  ≤ nP ( V > r 2 n/ (2 δ ) 2 ) , which go es to 0 as n → ∞ by Lemma 3. This pro v es the ﬁrst statement. F or the second statemen t, ﬁx again t ∈ [ ε, T ] and y ∈ B ( θ ∗ , R ) and let n ≥ 1 b e suﬃciently large so nε ≥ 2 , use indep endence of the X i ’s and (8) to write a n ( t, y ) as a n ( t, y ) = n E  y ⌊ nt ⌋ − 1 − δ √ n g ( X ⌊ nt ⌋ , θ ∗ + √ ny / ( ⌊ nt ⌋ − 1))  = ny ⌊ nt ⌋ − 1 − √ nδ G ( θ ∗ + √ ny / ( ⌊ nt ⌋ − 1)) . No w, Assumption 2 implies that Φ is diﬀeren tiable in a neigh b orho od of θ ∗ , yielding that G ( θ ∗ + √ ny / ( ⌊ nt ⌋ − 1)) = ∇ Φ( θ ∗ + √ ny / ( ⌊ nt ⌋ − 1)) . Moreov er, Φ is t wice contin uously diﬀeren tiable in a neigh b orhoo d of θ ∗ and ∇ Φ( θ ∗ ) = 0 , so √ n ∇ Φ( θ ∗ + y / ( ⌊ nt ⌋ − 1)) − − − → n →∞ t − 1 ∇ 2 Φ( θ ∗ ) y uniformly in t ∈ [ ε, T ] and y ∈ B ( θ ∗ , R ) . Hence, a n ( t, y ) − − − → n →∞ t − 1 ( y − δ ∇ 2 Φ( θ ∗ ) y ) = a ( t, y ) uniformly in t ∈ [ ε, T ] and y ∈ B ( θ ∗ , R ) , whic h pro v es the second statement. Finally , for the third statement, again using (8), w e write nb n ( t, y ) = n E "  y ⌊ nt ⌋ − 1 − δ √ n g  X 1 , θ ∗ + √ n ⌊ nt ⌋ − 1 y   y ⌊ nt ⌋ − 1 − δ √ n g  X 1 , θ ∗ + √ n ⌊ nt ⌋ − 1 y  ⊤ # = n ( ⌊ nt ⌋ − 1) 2 y y ⊤ + δ 2 E " g  X 1 , θ ∗ + √ n ⌊ nt ⌋ − 1 y  g  X 1 , θ ∗ + √ n ⌊ nt ⌋ − 1 y  ⊤ # − √ nδ ⌊ nt ⌋ − 1 y G  θ ∗ + √ n ⌊ nt ⌋ − 1 y  ⊤ − √ nδ ⌊ nt ⌋ − 1 G  θ ∗ + √ n ⌊ nt ⌋ − 1 y  y ⊤ . Using a similar argumen t as ab o ve, the last tw o terms go to 0 uniformly in t ∈ [ ε, T ] and y ∈ B ( θ ∗ , R ) . It is clear that the ﬁrst term do es to o. Hence, it is no w suﬃcien t to c hec k that F ( t, y ) := E " g  X 1 , θ ∗ + √ n ⌊ nt ⌋ − 1 y  g  X 1 , θ ∗ + √ n ⌊ nt ⌋ − 1 y  ⊤ # − − − → n →∞ Γ uniformly in t ∈ [ ε, T ] and y ∈ B ( θ ∗ , R ) . First, by Lemma 4, it is suﬃcient to sho w that for all sequences ( t n ) n ≥ 1 ⊆ [ ε, T ] and ( y n ) n ≥ 1 ⊆ B ( θ ∗ , R ) , F ( t n , y n ) − − − → n →∞ Γ . Note that for such sequences, we hav e that √ n ⌊ nt n ⌋− 1 y n − − − → n →∞ 0 . Hence, let us simply consider any sequence ( u n ) n ≥ 1 in R d suc h that u n − − − → n →∞ 0 and sho w that E h g ( X 1 , θ ∗ + u n ) g ( X 1 , θ ∗ + u n ) ⊤ i − − − → n →∞ Γ . (11) 8 First, without loss of generality , assume that ∥ u n ∥ ≤ η for all n ≥ 1 . Since ϕ ( X 1 , · ) is almost surely diﬀeren tiable at θ ∗ , [9, Lemma 9] yields that g ( X 1 , θ ∗ + u n ) − − − → n →∞ g ( X 1 , θ ∗ ) almost surely . Moreo ver, the op erator norm of g ( X 1 , θ ∗ + u n ) g ( X 1 , θ ∗ + u n ) ⊤ is given by ∥ g ( X 1 , θ ∗ + u n ) ∥ 2 , whic h is dominated b y sup θ ∈ B ( θ ∗ ,η ) ∥ g ( X 1 , θ ) ∥ 2 whic h, b y Assumption 5, is in tegrable. Therefore, the dominated con v ergence theorem yields (11). W e are now ready to pro ve Theorem 3. Pr o of of The or em 3. First, for all positive integers p and n , let Y n,p b e the restriction of the sto c hastic pro cess Y n to the interv al [1 /p, ∞ ) . Then, we hav e the following lemma, whose pro of is deferred to the app endix. Lemma 1. F or al l p ≥ 1 , the pr o c ess ( Y n,p ) n ≥ 1 is tight in C 0 ([1 /p, ∞ ) , R d ) e quipp e d with the top olo gy induc e d by uniform c onver genc e on c omp act intervals. W e will now show that any subsequence of ( Y n ) n ≥ 1 has a further subsequence that conv erges w eakly to a pro cess ( Z t ) t> 0 with generator G and that satisﬁes Z t − − → t ↓ 0 0 in probability . In order to av oid renum b ering, let us simply sho w that ( Y n ) n ≥ 1 has such a subsequence. Lemma 1 shows the existence of a subsequence ( ˜ Y ϕ 1 ( n ) , 1 ) n ≥ 1 of ( Y n, 1 ) n ≥ 1 that conv erges w eakly in C 0 ([1 , ∞ ) , R d ) to some pro cess with generator ( G t ) t ≥ 1 . Similarly , one can extract a subse- quence ( ˜ Y ϕ 2 ( n ) , 2 ) n ≥ 1 of ( Y ϕ 1 ( n ) , 2 ) n ≥ 1 that conv erges weakly in C 0 ([1 / 2 , ∞ ) , R d ) to some pro cess with generator ( G t ) t ≥ 1 / 2 . Reiterating this construction, for all integers p ≥ 1 , w e can extract a subsequence ( ˜ Y ϕ p ( n ) ,p ) n ≥ 1 of ( Y ϕ p − 1 ( n ) ,p ) n ≥ 1 that con verges w eakly in C 0 ([1 /p, ∞ ) , R d ) to some pro cess with generator ( G t ) t ≥ 1 /p , for all integers p ≥ 1 . No w, ﬁx ε > 0 and consider the subse- quence ( Y ϕ n ( n ) ) n ≥ 1 of ( Y n ) n ≥ 1 . Let p ≥ 1 b e a suﬃciently large in teger suc h that 1 /p ≤ ε . Then – except maybe for the ﬁrst terms – the restriction of ( Y ϕ n ( n ) ) n ≥ 1 to [1 /p, ∞ ) is a subsequence of ( Y ϕ p ( n ) ,p ) n ≥ 1 , hence, it conv erges weakly in C 0 ([1 /p, ∞ ) , R d ) to some pro cess with generator ( G t ) t ≥ 1 /p . Therefore, the restriction of ( Y ϕ n ( n ) ) n ≥ 1 to [ ε, ∞ ) con verges w eakly in C 0 ([ ε, ∞ ) , R d ) to some pro cess with generator ( G t ) t ≥ ε . In particular, the restriction of ( Y ϕ n ( n ) ) n ≥ 1 to [ ε, ∞ ) is tight in C 0 ([ ϵ, ∞ ) , R d ) for every ε > 0 . Hence Lemma 6 ensures that ( Y ϕ n ( n ) ) n ≥ 1 is tight in C 0 ((0 , ∞ ) , R d ) . Moreo v er, if Z is an accumulation p oin t of ( Y ϕ n ( n ) ) n ≥ 1 then Z is a pro cess with generator G . T o conclude, w e need to show that Z t − − → t ↓ 0 0 in probability . Consider a subsequence ( Y ψ ( n ) ) n ≥ 1 of ( Y ϕ ( n ) ) n ≥ 1 that con v erges in distribution to Z . First, note that for all t > 0 , Y ψ n ( n ) t − − − → n →∞ Z t in distribution, since the map g ∈ C 0 ((0 , ∞ ) , R d ) 7→ g ( t ) is contin uous for the top ology induced by uniform conv ergence on compact interv als. Therefore, for all α > 0 and t > 0 , P ( ∥ Z t ∥ > α ) = lim n →∞ P ( ∥ Y ψ n ( n ) t ∥ > α ) ≤ sup n ≥ 1 P ( √ t √ n ∥ θ n − θ ∗ ∥ > α ) − − → t ↓ 0 0 b y Theorem 2. Therefore, Z t − − → t ↓ 0 0 in probability . Let Y b e the pro cess deﬁned in (6). Then, the uniqueness statement in Prop osition 1 ensures that Z and Y are indistinguishable. Hence, Y is the unique accum ulation p oint of ( Y ϕ ( n ) ) n ≥ 1 whic h is tigh t, and the result follo ws. F or the pro of of Theorem 3, we could in fact replace Assumption 5, whic h ma y b e quite stringen t in certain situations, with the follo wing tw o assumptions. 9 Assumption 6. Ther e exist η , ε, M > 0 such that E [ ∥ g ( X 1 , θ ) − G ( θ ) ∥ 2+ ε ] ≤ M for al l θ ∈ B ( θ ∗ , η ) . Assumption 7. The map θ ∈ R d 7→ E [ g ( X 1 , θ ) g ( X 1 , θ ) ⊤ ] is c ontinuous at θ ∗ . Indeed, Assumption 5 w as used in tw o places. First, to prov e that the right hand side in (10) go es to 0 as n → ∞ (via Lemma 3), whic h w ould still hold under Assumption 4. Second, to pro ve (11), which would b e a direct consequence of Assumption 5. Moreo ver, note that under Assumptions 2 and 3, Assumption 5 is equiv alent for the map θ ∈ R d 7→ V ar ( g ( X 1 , θ )) to b e con tinuous at θ ∗ , where V ar ( g ( X 1 , θ )) is simply the cov ariance matrix of the noise term in the SGD step when the subgradient G of Φ is ev aluated at θ . As a consequence of Theorem 3, we obtain the asymptotic normality of the iterates ( θ n ) n ≥ 1 of the sto c hastic algorithm. W e defer its pro of to the app endix. Corollary 1. Under the same assumptions as in The or em 3, we have that √ n ( ˆ θ n − θ ∗ ) − − − → n →∞ N d (0 , Σ) wher e Σ = δ Z ∞ 0 e t/δ e − t ∇ 2 Φ( θ ∗ ) Γ e − t ∇ 2 Φ( θ ∗ ) d t . F or n ≥ 1 , deﬁne ˆ θ n as a (measurable) minimizer of the (oﬄine) empirical risk θ ∈ R d 7→ 1 n P n i =1 ϕ ( X i , θ ) . It is known that under Assumptions 2 and 4, √ n ( ˆ θ n − θ ∗ ) − − − → n →∞ N d (0 , ∆) in distribution, where ∆ = ∇ 2 Φ( θ ∗ ) − 1 Γ ∇ 2 Φ( θ ∗ ) − 1 [17]. The following result compares ∆ with the asymptotic v ariance of θ n obtained in Corollary 1. In that result, ∥ · ∥ op stands for the op erator norm. That is, for an y symmetric matrix M ∈ R d × d , ∥ M ∥ op is the largest eigenv alue of M in absolute v alues. Prop osition 3. The matrix Σ − ∆ is p ositive semi-deﬁnite and ∥ Σ − ∆ ∥ op ≤ ( δ λ d − 1) 2 2 δ λ d − 1 ∥ ∆ ∥ op . No w, w e give a brief description of the asymptotic sto c hastic pro cess Y . First, note that Y is a cen tered, Gaussian pro cess, thanks to the in tegral represen tation giv en in Prop osition 1. The next result giv es an estimate for the suprem um of its norm on an y b ounded interv al. Theorem 4. Ther e exists a universal c onstant C > 0 such that for al l T > 0 , E  sup 0 0 , for the sak e of con tradiction. Since f is conv ex, it is contin uous. And since x ∗ is its unique minimizer, it m ust hold that η := inf x ∈ R d : ∥ x − x ∗ ∥≥ z / 2 f ( x ) − f ( x ∗ ) > 0 . Then, for all suﬃcien tly large n , ∥ x n − x ∗ ∥ ≥ z / 2 , hence, f ( x n ) − f ( x ∗ ) ≥ η , contradicting the second assumption of the lemma. Lemma 3. L et V b e a non-ne gative, inte gr able r andom variable. Then, sP ( V > s ) − − − → s →∞ 0 . Pr o of. Since V is non-negativ e and integrable, F ubini-T onelli’s theorem implies that the map s ≥ 0 7→ P ( V > s ) is in tegrable, and that E [ V ] = R ∞ 0 P ( V > s ) d s . Moreo v er, this map is non-increasing, yielding, for all s ≥ 0 , that ( s/ 2) P ( V > s ) ≤ Z s s/ 2 P ( V > u ) d u − − − → s →∞ 0 , yielding the result. Lemma 4. L et K ⊆ R p , for some p ≥ 1 and f , f 1 , f 2 , . . . b e functions deﬁne d on K with values in R q , for some q ≥ 1 . The fol lowing statements ar e e quivalent: 1. sup x ∈ K ∥ f n ( x ) − f ( x ) ∥ − − − → n →∞ 0 ; 2. F or al l se quenc es ( x n ) n ≥ 1 in K , f n ( x n ) − f ( x n ) − − − → n →∞ 0 . Pr o of. It is obvious that the ﬁrst statement implies the second one, so let us only pro ve the con verse. Assume the second statemen t is true and supp ose, for the sake of contradiction, that sup x ∈ K ∥ f n ( x ) − f ( x ) ∥ do es not go to 0 as n → ∞ . Then, there m ust exist ε > 0 and an increasing map ϕ : N ∗ → N ∗ suc h that sup x ∈ K ∥ f ϕ ( n ) ( x ) − f ( x ) ∥ ≥ ε , for all n ≥ 1 . In particular, for each n ≥ 1 , there m ust exist y n ∈ K satisfying ∥ f ϕ ( n ) ( y n ) − f ( y n ) ∥ > ε/ 2 . No w, consider an y sequence ( x n ) n ≥ 1 in K such that x ϕ ( n ) = y n for all n ≥ 1 . The sequence ∥ f n ( x n ) − f ( x n ) ∥ do es not go to zero as n → ∞ as it remains larger than ε/ 2 along a subsequence. This yields the con tradiction w e sought for. 13 Lemma 5. L et K b e a subset of C 0 ((0 , ∞ ) , R d ) . Then K is pr e c omp act with r esp e ct to the top olo gy induc e d by uniform c onver genc e on c omp act intervals if and only if for every ϵ, T with 0 < ϵ < T , K ϵ,T := { x | [ ϵ,T ] : x ∈ K } is pr e c omp act for the inﬁnite norm on C 0 ([ ϵ, T ] , R d ) . Pr o of. Let K b e a subset of C 0 ((0 , ∞ ) , R d ) and supp ose that for every ϵ, T such that 0 < ϵ < T , K ϵ,T is precompact for the inﬁnite norm on C 0 ([ ϵ, T ] , R d ) . Let ( x n ) n ≥ 1 b e a sequence in K . Then, there exists a subsequence ( x σ 1 ( n ) ) n ≥ 1 that conv erges uniformly on [1 / 2 , 2] . W e then recursiv ely ﬁnd subsequences ( x σ 1 ◦ σ 2 ◦ ... ◦ σ k ( n ) ) n ≥ 1 that conv erges uniformly on [2 − k , 2 k ] for every k ≥ 1 . Deﬁne η ( n ) = σ 1 ◦ σ 2 ◦ . . . ◦ σ n ( n ) for n ≥ 1 . By construction, the diagonal subsequence ( x η ( n ) ) n ≥ 1 con verges uniformly on every compact interv als of (0 , ∞ ) . Hence, K is precompact in C 0 ((0 , ∞ ) , R d ) . Lemma 6. F or every interval I ⊆ R + we endow C 0 ( I , R d ) with the top olo gy induc e d by uniform c onver genc e on c omp act intervals of I . L et ( Z n ) n ≥ 1 b e a se quenc e of sto chastic pr o c esses in C 0 ((0 , ∞ ) , R d ) and denote by ( Z n | I ) n ≥ 1 the se quenc e of their r estrictions to I . Assume that for every ϵ > 0 , the se quenc e ( Z n | [ ϵ, ∞ ) ) n ≥ 1 is tight in C 0 ([ ε, ∞ ) , R d ) . Then, ( Z n ) n ≥ 1 is tight in C 0 ((0 , ∞ ) , R d ) . Pr o of. Consider a sequence of random v ariables ( Z n ) n ≥ 1 taking v alues in C 0 ((0 , ∞ ) , R d ) suc h that for every ϵ > 0 , the sequence ( Z n | [ ϵ, ∞ ) ) n ≥ 1 is tight. Fix ν > 0 and for ev ery j ≥ 1 , let K j b e a compact subset of C 0 ([1 /j, ∞ ) , R d ) such that sup n P  Z n | [1 /j, ∞ ) ∈ K ∁ j  < ν 2 j and deﬁne ˜ K j := { z ∈ C 0 ((0 , ∞ ) , R d ) : z | [1 /j, ∞ ) ∈ K j } . W e now set K = ∩ j ≥ 1 ˜ K j . Then for any n ≥ 1 , P  Z n ∈ K ∁  = P  Z n ∈ ∪ j ≥ 1 ˜ K ∁ j  ≤ X j ≥ 1 P  Z n | [1 /j, ∞ ) ∈ K ∁ j  < ν . (12) It remains to c heck that K is precompact in C 0 ((0 , ∞ ) , R d ) . W e will use Lemma 5. Let ϵ, T suc h that 0 < ϵ < T and consider K ϵ,T := { x | [ ϵ,T ] suc h that x ∈ K } . T ake a sequence ( x n ) n ≥ 1 in K ϵ,T and j an in teger such that 1 /j < ϵ . Then, for ev ery n ≥ 1 , x n | [1 /j, ∞ ) ∈ K j . Hence, it has a subsequence that con v erges uniformly on ev ery compact in terv als, and in particular on [ ϵ, T ] . So K ϵ,T is precompact in C 0 ([ ϵ, T ] , R d ) , and b y Lemma 5, K is precompact in C 0 ((0 , ∞ ) , R d ) . Finally , the closure K of K is compact in C 0 ((0 , ∞ ) , R d ) , and with equation (12), w e obtain that P ( Z n ∈ K ) ≥ 1 − ν. B Pro ofs B.1 Pro of of Theorem 2 W e need to chec k that for all ε > 0 , there exists C > 0 with P ( n ∥ θ n − θ ∗ ∥ 2 2 ≥ C ) ≤ ε for all large enough integers n . First, ﬁx some r > 0 and α > δ − 1 suc h that ∇ 2 θ Φ ≥ αI d for all θ ∈ B ( θ ∗ , r ) . Suc h r and α exist thanks to Assumption 2 and b y deﬁnition of δ . F or all in tegers k , l with k ≤ l , let A k : l b e the ev ent where θ j ∈ B ( θ ∗ , r ) for all j = k , . . . , l . Fix some integers N ≥ 1 and 14 n ≥ N + 1 . Let k ∈ { N + 1 , . . . , n } . Using (4) and the fact that A N : k ⊆ A N : k − 1 and noting that the even t A N : k − 1 is F k − 1 -measurable, we ha ve E [ ∥ θ k − θ ∗ ∥ 2 2 1 A N : k ] ≤ E [ ∥ θ k − θ ∗ ∥ 2 2 1 A N : k − 1 ] = E [ E [ ∥ θ k − θ ∗ ∥ 2 2 |F k − 1 ] 1 A N : k − 1 ] ≤ E [ ∥ θ k − 1 − θ ∗ ∥ 2 2 1 A N : k − 1 ] − 2 t k E [(Φ( θ k − 1 ) − Φ( θ ∗ )) 1 A N : k − 1 ] + t 2 k σ 2 ≤ E [ ∥ θ k − 1 − θ ∗ ∥ 2 2 1 A N : k − 1 ] − αt k E [ ∥ θ k − 1 − θ ∗ ∥ 2 2 1 A N : k − 1 ] + t 2 k σ 2 = (1 − αt k ) E [ ∥ θ k − 1 − θ ∗ ∥ 2 2 1 A N : k − 1 ] + t 2 k σ 2 =  1 − αδ k  E [ ∥ θ k − 1 − θ ∗ ∥ 2 2 1 A N : k − 1 ] + δ 2 σ 2 k 2 . In the last inequalit y ab ov e, we used that for all θ ∈ B ( θ ∗ , r ) , Φ( θ ) ≥ Φ( θ ∗ ) + ( α/ 2) ∥ θ − θ ∗ ∥ 2 2 . Hence, by setting V k = E [ ∥ θ k − θ ∗ ∥ 2 2 1 A N : k ] , we ha ve obtained that V k ≤ k − 1 − γ k V k − 1 + δ 2 σ 2 k 2 where we set γ = α δ − 1 > 0 . Using the inequality 1 − γ u ≤ (1 − u ) γ / 2 for all u ∈ [0 , 1 / 2] , w e obtain (applying the inequalit y to u = 1 / ( k − 1) ), for k ≥ N + 1 ≥ 2 , V k ≤ 1 − γ / ( k − 1) 1 + 1 / ( k − 1) V k − 1 + δ 2 σ 2 k 2 ≤ (1 − 1 / ( k − 1)) γ / 2 1 + 1 / ( k − 1) V k − 1 + δ 2 σ 2 k 2 =  k − 2 k − 1  γ / 2 k − 1 k V k − 1 + δ 2 σ 2 k 2 and, multiplying b oth sides by ( k − 1) γ / 2 k , ( k − 1) γ / 2 k V k ≤ ( k − 2) γ / 2 ( k − 1) V k − 1 + ( k − 1) γ / 2 k δ 2 σ 2 ≤ ( k − 2) γ / 2 ( k − 1) V k − 1 + 1 ( k − 1) 1 − γ / 2 δ 2 σ 2 . Summing these inequalities for k = N + 1 , . . . , n , w e obtain: ( n − 1) γ / 2 nV n ≤ ( N − 1) γ / 2 N V N + K n γ / 2 δ 2 σ 2 for some positive constan t K that only dep ends on γ . Therefore, for all n ≥ N + 1 , nV n ≤ N γ / 2 V N +  n n − 1  γ / 2 K δ 2 σ 2 ≤ N γ / 2 V N + K ′ δ 2 σ 2 with K ′ = 2 γ / 2 K . No w, ﬁx C > 0 , to b e c hosen later, and denote b y A N = T n ≥ N A N : n . F or n ≥ N + 1 , P ( n ∥ θ n − θ ∗ ∥ 2 2 ≥ C ) ≤ P ( n ∥ θ n − θ ∗ ∥ 2 2 ≥ C , A N : n ) + P ( A ∁ N : n ) ≤ P ( n ∥ θ n − θ ∗ ∥ 2 2 1 A N : n ≥ C ) + P ( A ∁ N ) ≤ nV n C + P ( A ∁ N ) ≤ C − 1  N γ / 2 V N + K ′ δ 2 σ 2  + P ( A ∁ N ) where we used Mark ov’s inequalit y in the third inequality . By Theorem 1, P ( A ∁ N ) − − − − → N →∞ 0 , hence, one can ﬁx some N guaranteeing that P ( A ∁ N ) ≤ ε/ 2 . Moreov er, one can choose C large enough so as to ensure that the ﬁrst term in the right hand side of the last display is at most ε/ 2 , which completes the proof. 15 B.2 Pro of of Proposition 1 Fix ε > 0 . By [33, Theorem 7.3.3], if Y is a diﬀusion pro cess on [ ε, ∞ ) with generator ( G t ) t ≥ ε then Y mu st be a solution of the SDE d Y t = − t − 1 H Y t d t + Σ d B t , ∀ t ≥ ε. (13) F or ε > 0 , there exists K ϵ suc h that for any t ∈ [ ϵ, ∞ ) , the function y 7→ − t − 1 H y is K ϵ -Lipsc hitz on R d . Hence, [16, Theorem 7.1] ensures that for an y random v ariable y ε in R d , equation (13) has a unique solution (up to indistinguishability) on [ ε, ∞ ) started at y ε . W e now determine this solution. Let 0 < µ 1 ≤ . . . ≤ µ d b e the eigenv alues of H , which is assumed to b e p ositiv e deﬁnite. Let e 1 , . . . , e n b e corresp onding eigenv ectors. Let Y = P n i =1 Y i e i b e a solution of (13) starting from Y ε = P d i =1 Y i ε e i at t = ε . By Itô’s lemma, w e hav e, for all i = 1 , . . . , d and t ≥ ε , d ( Y i t t µ i ) = µ i Y i t t µ i − 1 d t + t µ i d Y i t = t µ i e ⊤ i Σ d B t . Hence, we ha ve obtained that necessarily , Y i t t µ i = Y i ε ε µ i + e ⊤ i Z t ε s µ i Σ d B s , ∀ t ≥ ε, and therefore, Y i t = Y i ε ε µ i t − µ i + t − µ i e ⊤ i Z t ε s µ i Σ d B s , ∀ t ≥ ε. (14) W e can c hec k that the pro cess Y = P d i =1 Y i e i with Y i ’s deﬁned in (14) is indeed a solution to (13) on [ ε, ∞ ) . Therefore, once the Brownian motion is ﬁxed, this is the unique solution starting from Y ε at t = ε . No w, let Y b e a solution to (13) on the whole in terv al (0 , ∞ ) , satisfying Y t − − → t ↓ 0 0 in probability . Fix some ε > 0 . Then, the restriction of Y to [ ε, ∞ ) is a solution to (13) starting from Y ε at t = ε , so thanks to the previous argumen t, its co ordinates in the basis ( e 1 , . . . , e d ) m ust satisfy (14). Letting ε → 0 (recalling that µ i > 0 ) then yields that Y t = d X i =1 t − µ i  e ⊤ i Z t 0 s µ i Σ 1 / 2 d B s  e i (15) for all t > 0 . B.3 Pro of of Lemma 1 Let p ≥ 1 . Let us sho w that ( Y n,p ) n ≥ 1 is relatively compact, i.e , from an y subsequence of ( Y n,p ) n ≥ 1 , w e can extract a further subsequence w eakly conv erging in C 0 ([1 /p, ∞ ) , R d ) . The desired result will then follow from Prokhorov’s Theorem [6, Theorem 5.2]. F or simplicity (and to av oid renum b ering the terms of the sequence), let us simply show that ( Y n,p ) n ≥ 1 has a w eakly conv erging subsequence. First, thanks to Theorem 2, the sequence Y n,p 1 /p is tight – recall that Y n,p 1 /p is a con vex com bination of ⌈ n/p ⌉ √ n ( θ ⌈ n/p ⌉ − θ ∗ ) and ⌈ n/p ⌉− 1 √ n ( θ ⌈ n/p ⌉− 1 − θ ∗ ) where ⌈·⌉ stands for the upper in teger part. Therefore, there exists a increasing map ϕ : N ∗ → N ∗ and a random v ector Z p in R d suc h that Y ϕ ( n ) ,p 1 /p − − − → n →∞ Z p in distribution. By Sk oroho d’s representation theorem [7, Theorem 6.7], one can assume that the conv ergence holds 16 with probabilit y 1 . Then, by [32, Theorem 11.2.3], 1 Prop osition 2 yields that ( Y ϕ ( n ) ,p ) n ≥ 1 con verges in distribution to a stochastic process in C 0 ([1 /p, ∞ ) , R d ) with generator ( G t ) t ≥ 1 /p giv en b y (5) and starting from Z p . B.4 Pro of of Corollary 1 Note that the map g ∈ C 0 ((0 , ∞ ) , R d ) 7→ g (1) is con tinuous with resp ect to the top ology induced b y uniform con vergence on compact sets, it follows from Theorem 3 that Y n 1 − − − → n →∞ Y 1 in distribution, that is, √ n ( θ n − θ ∗ ) − − − → n →∞ Y 1 in distribution. No w, b y Prop osition 1, Y 1 can b e written as Y 1 = δ Z 1 0 e − log( s )( δ ∇ 2 Φ( θ ∗ ) − I d ) Γ 1 / 2 d B s , whic h has the d -v ariate, centered, normal distribution with co v ariance matrix giv en by Σ = δ 2 Z 1 0 e log( s )( δ ∇ 2 Φ( θ ∗ ) − I d ) Γ e log( s )( δ ∇ 2 Φ( θ ∗ ) − I d ) d s = δ 2 Z 1 0 e − 2 log( s ) /δ e δ log( s ) ∇ 2 Φ( θ ∗ ) Γ e δ log( s ) ∇ 2 Φ( θ ∗ ) d s = δ Z ∞ 0 e u/δ e − u ∇ 2 Φ( θ ∗ ) Γ e − u ∇ 2 Φ( θ ∗ ) d u where we used the c hange of v ariables u = − δ log( s ) in the last line. B.5 Pro of of Proposition 3 Recall that 0 < λ 1 ≤ . . . ≤ λ d are the ordered eigenv alues of ∇ 2 Φ( θ ∗ ) and that we denote by e 1 , . . . , e d a collection of asso ciated unit eigen v ectors, so ∇ 2 Φ( θ ∗ ) = P d i =1 λ i e i e ⊤ i . Then, for all t ∈ R , e − t ∇ 2 Φ( θ ∗ ) = P d i =1 e − λ i t e i e ⊤ i , so one can write Σ = δ Z ∞ 0 e t/δ d X i =1 e − λ i t e i e ⊤ i ! Γ   d X j =1 e − λ j t e j e ⊤ j   d t = δ X 1 ≤ i,j ≤ d  Z ∞ 0 e t/δ e − t ( λ i + λ j ) d t  e i e ⊤ i Γ e j e ⊤ j = X 1 ≤ i,j ≤ d δ λ i + λ j − 1 /δ Γ i,j e i e ⊤ j , (16) where we denote by Γ i,j = e ⊤ i Γ e j , for all i, j = 1 , . . . , d . On the other hand, write ∆ = d X i =1 λ − 1 i e i e ⊤ i ! Γ   d X j =1 λ − 1 j e j e ⊤ j   = X 1 ≤ i,j ≤ d Γ i,j λ i λ j e i e ⊤ j . (17) 1 This theorem is stated for time-homogeneous Marko v chains, but its pro of is easily adapted to the non- homogeneous setup. 17 Therefore, using (16) and (17), w e obtain Σ − ∆ = X 1 ≤ i,j ≤ d  δ λ i + λ j − 1 /δ − 1 λ i λ j  Γ i,j e i e ⊤ j = X 1 ≤ i,j ≤ d ( δ λ i − 1)( δ λ j − 1) δ λ i + δ λ j − 1 Γ i,j λ i λ j e i e ⊤ j =: X 1 ≤ i,j ≤ d A i,j Γ i,j λ i λ j e i e ⊤ j (18) where the co eﬃcients A i,j , 1 ≤ i, j ≤ d are deﬁned in an obvious w a y . Recalling (17), and letting A ∈ R d × d b e the symmetric matrix whose entries are given by the A i,j ’s, (18) shows that P ⊤ (Σ − ∆) P is the Hadamard product of A and P ⊤ ∆ P . Moreo ver, A can b e written as A = DB D where D ∈ R d × d is the diagonal matrix with entries δ λ i − 1 , i = 1 , . . . , d and B ∈ R d × d is the Cauch y matrix with en tries B i,j = 1 / ( δ λ i + δ λ j − 1) , i, j = 1 , . . . , d . Since λ i > 1 /δ for all i = 1 , . . . , d b y deﬁnition of δ , b oth matrices B and D are p ositiv e deﬁnite, and so is A . Therefore, the Hadamard pro duct of A and P ⊤ ∆ P – whic h is a positive, semi-deﬁnite matrix – is p ositiv e semi-deﬁnite, yielding that Σ − ∆ is p ositiv e semi-deﬁnite. Moreo ver, since A is p ositiv e deﬁnite, [5, Theorem 1.4.1] yields that ∥ P ⊤ (Σ − ∆) P ∥ op ≤  max 1 ≤ i ≤ d A i,i  ∥ P ⊤ ∆ P ∥ op , that is, ∥ Σ − ∆ ∥ op ≤ ( δ λ d − 1) 2 2 δ λ d − 1 ∥ ∆ ∥ op . B.6 Pro of of Theorem 4 Fix T > 0 . F or a centered gaussian process z taking v alues in R , w e consider the pseudo-metric d z on [0 , T ] induced by z to b e d z ( s, t ) = p E [ | z s − z t | 2 ] . Dudley’s b ound [31, Theorem 1.4.2] tells us that the supremum of z is related to the en tropy n umber N ([0 , T ] , d z , ϵ ) deﬁned for ϵ > 0 to b e the minimal num b er of op en d z -balls of radius ϵ required to cov er [0 , T ] . This quantit y is not alwa ys easy to compute, but if w e hav e another pro cess z ′ with d z ≤ d z ′ and z ′ has an explicit entrop y num b er, then N  [0 , T ] , d z , ϵ  ≤ N  [0 , T ] , d z ′ , ϵ  and E h sup t ∈ [0 ,T ] | z t | i ≤ 48 Z ∞ 0 q log  N ([0 , T ] , d z ′ , ϵ )  d ϵ. (19) Recalling that for t > 0 , Y t = P d i =1 t 1 − δ λ i  e ⊤ i Z t 0 s δ λ i − 1 δ Γ 1 2 d B s  e i , it suﬃces to b ound its co ordinates Y t ( i ) := t 1 − δ λ i  e ⊤ i Z t 0 s δ λ i − 1 δ Γ 1 2 d B s  for i = 1 , · · · , d . Fix i ∈ { 1 , · · · , d } and let’s compute the quan tit y: E  | Y s ( i ) − y t ( i ) | 2  = E h   s 1 − δ λ i e ⊤ i Z s 0 u δ λ i − 1 δ Γ 1 2 d B u − t 1 − δ λ i e ⊤ i Z t 0 u δ λ i − 1 δ Γ 1 2 d B u   2 i . (20) First, letting ( g i k ) d k =1 = Γ 1 2 e i for i = 1 , · · · , d , w e hav e E h   e ⊤ i Z s 0 u δ λ i − 1 δ Γ 1 2 d B u   2 i = δ 2 Z s 0 u 2 δ λ i − 2 d u X k g ik g ik = δ 2 s 2 δ λ i − 1 2 δ λ i − 1 ⟨ e i , Γ e i ⟩ . 18 Plugging this in to (20) yields E  | Y s ( i ) − Y t ( i ) | 2  = s 2 − 2 δ λ i E h   e ⊤ i Z s 0 u δ λ i − 1 δ Γ 1 2 d B u   2 i − 2( st ) 1 − δ λ i E h   e ⊤ i Z min( s,t ) 0 u δ λ i − 1 δ Γ 1 2 d B u   2 i + t 2 − 2 δ λ i E h   e ⊤ i Z t 0 u δ λ i − 1 δ Γ 1 2 d B u   2 i = δ 2 2 δ λ i − 1 ⟨ e i , Γ e i ⟩  s − 2 min( s, t ) δ λ i max( s, t ) 1 − δ λ i + t  . (21) Deﬁne g i ( s, t ) : = s − 2 min( s, t ) δ λ i max( s, t ) 1 − δ λ i + t and ﬁx s ∈ (0 , T ) . Then for t < s : ∂ t g i ( s, t ) = 1 − 2 δ λ i  s t  1 − δ λ i ∂ 2 t g i ( s, t ) = (1 − δ λ i ) s 1 − δ λ i t δ λ i − 2 2 δ λ i . Recalling that δ λ i > 1 , w e obtain that the function t 7→ g i ( s, t ) is concav e on [0 , s ] . Moreov er, ∂ t g i ( s, s ) = 1 − 2 δ λ i , hence by concavit y we obtain that g i ( s, t ) ≤ (2 δ λ i − 1)( s − t ) . By performing the same computations for t > s , we obtain that for every t ∈ (0 , T ) , g i ( s, t ) ≤ (2 δ λ i − 1) | s − t | . Hence, substituting this b ound in to (21) giv es E  | Y s ( i ) − Y t ( i ) | 2  ≤ δ 2 | s − t |⟨ e i , Γ e i ⟩ . (22) T o conclude, we will need the following lemma: Lemma 7. L et K > 0 and d b e the metric on [0 , T ] given by d ( s, t ) = p K | s − t | . Then Z ∞ 0 q log  N ([0 , T ] , d, ϵ )  d ϵ ≤ c √ T K for some c > 0 . Mor e over, d is the metric asso ciate d to the Br ownian motion r esc ale d by K . Pr o of of L emma 7. F or s = 4 ϵ 2 /K , d (0 , s ) = √ K s = 2 ϵ = d ( t, t + s ) for ev ery t ∈ [0 , T − s ] . Hence  T K 4 ϵ 2  op en d -balls are required to cov er [0 , T ] . W e then compute Z ∞ 0 p log ( N ([0 , T ] , d, ϵ ))d ϵ = Z ∞ 0 s log  T K 4 ϵ 2  d ϵ = √ T K 2 Z ∞ 0 1 2 p log ( ⌈ u ⌉ ) u − 3 / 2 d u. By letting c = 1 4 Z ∞ 0 p log ( ⌈ u ⌉ ) u − 3 / 2 d u , we obtain the Lemma. Dudley’s b ound (Theorem 1.4.2 in [31]) together with equation (22) and the result of Lemma 7 with K = δ 2 ⟨ e i , Γ e i ⟩ gives E h sup t ∈ [0 ,T ]    t 1 − δ λ i e ⊤ i Z t 0 s δ λ i − 1 δ Γ 1 2 d B s    i ≤ cδ p T ⟨ e i , Γ e i ⟩ where we let c = 24 Z ∞ 0 p log ( ⌈ u ⌉ ) u − 3 / 2 d u . No w, since ( e i ) d i =1 is a orthonormal basis, E  sup t ∈ [0 ,T ] ∥ Y t ∥ 2  ≤  d X i =1 E h sup t ∈ [0 ,T ] | t 1 − δ λ i e ⊤ i Z t 0 s δ λ i − 1 δ Γ 1 2 d B s    2 i 1 / 2 . (23) 19 It therefore remains to b ound the second momen ts of the supremum of the co ordinates Y ( i ) to obtain the result of Theorem 4. First, for i = 1 , · · · , d E  | Y t ( i ) | 2  = δ 2 t 2 − 2 δ λ i 2 δ λ i − 1 Z t 0 s 2 δ λ i − 2 d s ⟨ e i , Γ e i ⟩ = δ 2 t 2 δ λ i − 1 ⟨ e i , Γ e i ⟩ . Deﬁne x t : = sup 0 0 , P  x T > E [ x T ] + u  ≤ e − u 2 / (2 σ 2 ) . Hence, we can compute E [ x 2 T ] = 2 Z ∞ 0 uP ( x T > u )d u = 2 Z ∞ 0 ( u − E [ x T ]) P ( x T − E [ x T ] > u − E [ x T ])d u + 2 E [ x T ] Z ∞ 0 P ( x T > u )d u ≤ 2 Z ∞ 0 ue − u 2 / (2 σ 2 ) d u + 2 E [ x T ] 2 ≤ 2 σ 2 + 2 E [ x T ] 2 ≤ 2 δ 2 T 2 δ λ i − 1 ⟨ e i , Γ e i ⟩ + 2 c 2 δ 2 T ⟨ e i , Γ e i ⟩ . This result together with equation (23) giv es E  sup t ∈ [0 ,T ] ∥ Y t ∥ 2  ≤ C δ q ∥ Γ 1 / 2 ∥ F T where C =  2 + 2 c 2  1 / 2 , ∥ . ∥ F denotes the F rob enius norm and we use the fact that 2 δ λ i − 1 > 1 for every i = 1 , · · · d . The follo wing Lemma sho ws that this b ound is of order of the exp ectation of the supremum of a Brownian motion rescaled by δ Γ 1 / 2 . This is probably a well-kno wn result but w e give a short pro of b elo w. Lemma 8. L et B b e a d -dimensional Br ownian motion and Σ a c ovarianc e matrix of size d . Then ther e exists a universal c onstant c such that for al l T > 0 , q 2 π ∥ Σ 1 / 2 ∥ F T ≤ E  sup 0 ≤ t ≤ T ∥ Σ 1 / 2 B t ∥  ≤ c p ∥ Σ 1 / 2 ∥ F T . Pr o of. By rescaling B , we can assume that T = 1 without loss of generality . Let us ﬁrst prov e the lo wer bound. If X = ( X i ) d i =1 is a d -dimensional standard gaussian v ector, E  sup 0 ≤ t ≤ 1 ∥ Σ 1 / 2 B t ∥  ≥ E  ∥ Σ 1 / 2 X ∥  . No w, letting 0 ≤ µ 1 ≤ · · · ≤ µ d b e the eigen v alues of Σ 1 / 2 , we ha ve E [ ∥ Σ 1 / 2 X ∥ ] = E h  d X i =1 µ 2 i X 2 i  1 / 2 i = q ∥ Σ 1 / 2 ∥ F E  d X i =1 p i X 2 i  1 / 2  where p i = µ 2 i  P d j =1 µ 2 j  − 1 = µ 2 i ∥ Σ 1 / 2 ∥ − 1 F . Hence E [ ∥ Σ 1 / 2 X ∥ ] ≥ q ∥ Σ 1 / 2 ∥ F inf p 1 ,...,p d ≥ 0 P d i =1 p i =1 E  d X i =1 p i X 2 i  1 / 2  . 20 F or all p = ( p 1 , . . . , p d ) ∈ ( R + ) d , let M ( p ) = E  P d i =1 p i X 2 i  1 / 2  . F or all u ∈ R d with u  = 0 , u ⊤ ∇ 2 M ( p ) u = − 1 / 4 E   d X i =1 u i X 2 i ! 2 d X i =1 p i X 2 i ! − 3 / 2   < 0 so M is a strictly concav e function. Hence, on the simplex { ( p 1 , . . . , p d ) : p i ≥ 0 , i = 1 , . . . , d, p 1 + . . . + p d = 1 } , its minimum is attained at an extreme p oin t, that is, at a p oin t p whose co ordinates are all zero except for one equal to 1 . F or suc h a p , M ( p ) = E [ | X 1 | ] = p 2 /π , and the result follo ws. The upp er b ound is obtained with the result of Lemma 7 together with the Borel-TIS b ound (Theorem 2.1.1 in [1]), following the same reasoning as in the last part of the pro of of the Theorem 4. 21

Functional Central Limit Theorem for Stochastic Gradient Descent

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment