The Dimension Strikes Back with Gradients: Generalization of Gradient Methods in Stochastic Convex Optimization
We study the generalization performance of gradient methods in the fundamental stochastic convex optimization setting, focusing on its dimension dependence. First, for full-batch gradient descent (GD) we give a construction of a learning problem in d…
Authors: Matan Schliserman, Uri Sherman, Tomer Koren
The Dimension Strike s Bac k with Gra dien ts: Generalization of Gradien t Metho ds in Sto c hast ic Con v ex Optimization Matan Sc hliserman Uri Sherman T omer K oren Jan uary 23, 2024 Abstract W e study the generalization per formance of g radient methods in the fundamental s to chas- tic conv ex optimization setting, foc us ing o n its dimensio n dependence. First, for full-batc h gradient descent (GD) we g ive a construction o f a learning pro blem in dimension d = O ( n 2 ), where the canonical version of GD (tuned for optimal p erformance of the empir ic a l risk) trained with n tra ining examples con verges, with constant probability , to an approximate empirical risk minimizer with Ω(1) p o pula tion excess risk. O ur b ound translates to a lo wer b ound o f Ω( √ d ) on the num b er of training examples required for standard GD to reach a non-tr ivial test erro r, answering an op en question ra ised by F eldman (2016) and Amir, Koren, and Livni (2021b) a nd showin g that a non- tr ivial dimension dependence is una voidable. F ur thermore, for standard one- pass sto chastic g radient desc e nt (SGD), we show that a n a pplication o f the same co nstruction technique provides a similar Ω( √ d ) low er b ound for the sample co mplexit y of SGD to reach a non-trivial empirical error , despite ac hieving optimal test p erfor mance. This again pro vides an exponential improv ement in the dimension dep endence compared to previous work (Koren, Livni, Mansour, and Sherman, 2022), reso lving an op en question left therein. 1 In tro duction The stu dy of generalizatio n prop erties of sto chast ic optimizatio n algorithms has b een at the heart of con temp orary mac hin e learning researc h. While in the more classical framew orks stud ies largely fo cused on the learning pr oblem (e.g., Alon et al., 1997; Blumer et al., 1989), in the past d ecade it has b ecome clear that in mo dern scenarios the particular algorithm used to learn the mo d el p la ys a vital role in its generalization p erformance. As a pr ominen t example, hea vily o ver-parame terized deep neu r al net w orks trained b y first order metho d s outpu t mo dels that generalize w ell, despite the fact that a n arb itrarily c h osen Empirical Risk Minimizer (ERM ) ma y perf orm p o orly (Zh ang et al., 2017; Neyshab u r et al., 201 4, 2017). The presen t pap er aims at und erstanding the general izatio n b ehavi or of gradien t metho ds, sp ecifically in connectio n w ith the pr ob lem dimension, in the fu nda- men tal Sto c hastic Conv ex Optimization (SC O) learning setup; a w ell studied, theoretica l framew ork widely used to stud y sto c h astic optimization algorithms. The seminal w ork of Shalev-Sh w artz et al. (2010) w as the fi rst to sho w that uniform con- v ergence, the canonical condition for generalizatio n in statistical learning (e.g., V apn ik, 1971; Bartlett and Mendelson, 2002) m a y not h old in high-dimensional SC O: they demonstrated learn- ing problems where there exist certain ERMs that o v erfit the training data (i. e., exhibit large p opulation risk), while mo dels p ro du ced b y e.g., Stochast ic Gradien t Descen t (SGD) or regulariz ed empirical risk minimization generalize wel l. The construction pr esented by Shalev-Sh w artz et al. (2010), ho wev er, featured a learning problem with dimension exp onent ial in th e num b er o f training 1 examples, wh ic h only serv ed to p ro ve an Ω(log d ) lo wer b oun d on the sample complexit y for reac h - ing non-trivia l p opulation r isk perf ormance, where d is the problem dimension. In a follo wup work, F eldman (2016) sho w ed ho w to dr amatical ly impro ve the dimension depend ence and established an Ω( d ) sample complexit y lo w er b ound, matc h ing (in terms of d ) the well-kno wn upp er b ound ob- tained from standard cov ering num b er argumen ts (see e.g., S halev-Sh wartz and Ben-Da vid, 2014). Despite settling the dimension dep endence of uniform con v ergence in SCO, it remained unclear from Sh alev-Sh wartz et al. (20 10); F eldman (2016) wh ether the sample complexit y lo w er b ound s for uniform con vergence actually transfer to natur al learning algorithms in this fr amew ork, and in particular, to common gradient- based optimization metho ds. Indeed, it is well-kno wn that in SCO there exist simple algorithms, suc h as SGD, that the mo dels they pro d uce actually generalize we ll with high probabilit y (see e.g., Shalev-Sh w artz and Ben-Da vid , 2014), despite these lo w er b oun ds. More tec hnically , the construction of F e ldman (2016) relie d hea vily on the existence of a “p eculiar” ERM wh ic h d o es n ot seem r eac hable by gradien t steps from a d ata-indep endent initial izatio n, and it was not at all clear (and in fact, stated as an op en problem in F el dman, 2016) h o w to adapt the construction so as to p ertain to ERMs that could b e found by gradient metho ds . In an att empt to address this issue, Amir et al. (2 021b) recent ly studied the p opulation p er- formance of batc h Gradien t Descen t (GD ) in S CO, and demonstrated p roblem instances w here it leads (with constan t probab ilit y) to an appro ximate ERM that generalizes p o orly , un less the n um - b er of training examples is d imen s ion-dep endent . 1 Subsequently , Amir et al. (20 21a) generalized this result to the more g eneral class of batc h first-order algorithms. Ho wev er, due to tec h nical complicati ons, the constructions in these pap ers w ere based in part on the earlier argumen ts of Shalev-Sh w artz et al. (2010) rather than the d ev elopment s by F eldman (2016), and therefore fell short of establishing their resu lts in dimension p olynomial in the num b er of training examples. As a consequence, their resu lts are unable to rule out a samp le complexit y upp er b ound for GD that dep ends only (p oly-)lo garithmically on problem dimension. In this w ork, we r esolv e the op en questions p osed in b oth F eldman (2016) and Amir et al. (2021 b ): O ur firs t main r esult demonstrates a conv ex learning problem wh ere GD, unless trained with at least Ω( √ d ) training examples, outputs a bad ERM with c onstan t probabilit y . This bridges the gap betw een the r esults of F eldman (2016 ) and actual, conc rete learning alg orithms (a lb eit with a sligh tly wea k er rate of Ω( √ d ), compared to the Ω( d ) of the latter pap er) and greatly impro v es on the previous Ω(log d ) low er b ound of Amir et al. (2021b), establishing that the sample complexit y of batc h GD in SCO has a significant, p olynomial dep endence on the problem d imension. F urtherm ore, in our second main result w e sh o w ho w an application of the same construction tec hnique pr o vides a similar impro v ement in the dimension dep end ence of the empiric al risk low er b ound presente d in the recen t work of Ko ren et al. (2 022), th us also resolving the open questio n left in their w ork. This w ork d emonstrated that in SCO, w ell-tuned SGD ma y u nderfit the training data despite ac h ieving optimal p opulation risk p erformance. A t a d eep er lev el, the o verfitting of GD and underfitting of S GD b oth stem from a com b ination of t wo conditions: lac k of algorithmic stabilit y , and failure of uniform con ve rgence; as it turns out, this com b ination allo ws for the ou tp u t m o dels to exhibit a large gener alization gap , d efined as the difference in absolute v alue b et ween the empirical and p opulation risks. Our work presen ts a construction tec hnique for suc h generalizatio n gap lo wer b ounds that ac hiev es small p olynomial d imension dep endence, providing for an exp onentia l impro v ement o ver p revious w orks. 1 Here we refer to GD as p erforming T = n iterations with stepsize η = Θ(1 / √ n ), where n d enotes the size of th e training set, but our results h old more generally; see b elo w for a more detailed discussion of the v arious regimes . 2 1.1 Our con tributions In some more detail, our main con tributions are as f ollo ws: (i) W e presen t a construction of a learning prob lem in dimension d = O ( nT + n 2 + η 2 T 2 ) where runnin g GD for T iterations with step η o ver a training s et of n i.i.d.-sampled exa mples le ads, with constant probabilit y , to a sol ution with p opulation error Ω( η √ T + 1 /ηT ). 2 In particular, for the canonical configuration of T = n and η = Θ (1 / √ n ), the lo wer b ound b ecomes Ω(1) and demonstrates that GD suffers from catastrophic o verfitting already in dimension d = O ( n 2 ). Put differen tly , this translate s to an e Ω( √ d ) lo we r b ound the n u mb er of training exa mples required for GD to reac h nontrivial test error. See Theorem 1 b elo w for a formal statemen t and further implications of th is result. (ii) F urthermore, w e giv e a construction of d imension d = e O ( n 2 ) where the empirical error of one- pass SGD trained o v er T = n training examples is Ω( η √ n + 1 /η n ). Assu ming the standard setting of η = Θ(1 / √ n ), c hosen for o ptimal test p erform ance, the emp ir ical error lo wer b ound becomes Ω(1), sho wing th at the “ b enign un derfitting” phenomena of one-pass SGD is exhibited already in dimension p olynomial in the n um b er of training samp les. Rephrasing this low er b oun d in terms of the num b er of training examples required to reac h non trivial empirical risk, w e again obtain an e Ω( √ d ) sample complexit y lo w er b ound . S ee Theorem 2 for the formal statemen t and furth er implications. Both of the results ab ov e are tig h t (up to logarithmic factors) in view of existing matc h ing upp er b ounds of Bassily et al. (2020). W e remark that the constructions leading to th e results feature differ entiable Lipsc hitz and con v ex lo ss fun ctions, wh ereas the lo we r b ounds in previous w orks concerned w ith gradien t metho ds (Amir et al., 2021b,a; K oren et al., 2022) crucially applied only to the class of non-different iable loss fu nctions. F rom the p ersp ectiv e of general n on-smo oth con vex optimization, this feature of the results imply that our lo we r b ound s r emain v alid under any c hoice of a subgradient oracle of the loss fun ction (as opp osed to only claiming th at there e xists a subgradient oracle und er wh ic h they apply , like prior results do). 1.2 Main ideas and tec hniques Our w ork builds primarily on t wo basic ideas. The first is due to F eldman (2016), whereb y an exp onent ial n umber (in n ) of appro ximately orthogo nal directions, that represen t the p oten tial candidates for a “bad ERM,” are emb edded in a Θ( n )-dimensional space. Th e second idea, und er- lying Bassily et al. (2020); Am ir et al. (2021b,a); Koren et al. (2022) is to augmen t th e loss fu n ction with a highly non-smo oth comp onen t, that is capable of generating large (sub-)gradien ts aroun d initializa tion directed at all candidate directio ns, that could steer GD tow ards a bad ERM that o ve rfits the training set. The maj or c hallenge is in making th ese t w o comp onen ts p lay in tandem: since the candidate directions of F eldman (2016) are only ne arly orthogonal, the p rogress of GD to wards one sp ecific direction ge ts hamp ered by its mo ve men t in other, irrelev an t directions. And indeed, previous w ork in this con text fel l short of resolving this incompatibil it y and ins tead, opted for a simpler construction with a p erfectly-orthogo nal set of candidate directions, that w as used in the earlier w ork of Shalev-Shw artz et al. (201 0). Unfortunately though, this latter construction requires the am bient d imen s ionalit y to b e exp onen tial in the num b er of samples n , wh ic h is precisely wh at w e aim to a v oid. 2 By population error (or test error) we mean the popu lation ex cess risk, namely the ga p in population risk betw een the return ed solution and th e optimal solution. 3 Our solution for o v ercoming this obstacl e, whic h we describ e in length in Section 3, is based on sev eral nov el ideas. Firstly , w e emplo y multi ple copies of the original construction of F eldman (2016) in orthogonal subspaces, in a wa y that it suffices for GD to mak e a single step within eac h cop y so as to reac h, across all copies, a bad ERM sol ution; this serv es to circum ven t the “collisio ns” b et wee n consecutiv e GD steps all uded to ab o ve. Secondly , we carefully design a co n vex loss term that, when augment ed to the loss function, forces successiv e gradien t steps to b e tak en in a roun d-robin fashion b etw een the different copies, so tha t eac h subsp ace in deed sees a single up d ate ste p through the GD execution. Lastly , we in tro du ce a no ve l tec hnique that memorizes the full training set by “encodin g” it int o the iterates in a conv ex and d ifferen tiable mann er, so that the GD iterate itself (to whic h the subgradien t oracle has access) con tains the information required to “deco de” the righ t mo vemen t direction to w ard s a b ad ERM. W e further show how all of these added loss comp onents can b e made differentia ble, so as to allo w for a different iable construction o ve rall. A detailed ov erview of these construction tec hniques and a vir tu ally complete description of our construction are p ro vided in Section 3. 1.3 A dditional related w ork Learnabilit y and generalization in t he SCO mo del. Our work b elongs to the b o dy of litera- ture on s tabilit y and generalizat ion in mo dern statistic al learning theory , pioneered b y Shalev-Sh w artz et al. (2010) and the earlier foundational w ork of Bousquet and Elisseeff (2002). In this line of researc h , Hardt et al. (2016); Bassily et al. (2020) study algorithmic sta bilit y of SGD and GD in t he smo oth and non-smo oth (con v ex) cases, resp ectiv ely . In the general non-smo oth case which w e stud y here, Bassily et al. (2020) ga ve an ite ration co mplexit y upp er b ound of O ( η √ T + 1 /ηT + η T /n ) test error for T iterations with step size η o v er a training set of size n . The more recen t work of Amir et al. (2021b) sho w ed this to b e tigh t up to log-factors in the dimens ion indep end en t regime , and Amir et al. (2021a) further extends this result to any optimizatio n algorithm making use of only batc h gradient s (i.e., gradien ts of the empirical risk). Eve n more recen tly , Kale et al. (2021) considers (m ulti-pass) SGD and GD in a more general SCO mo del where individual losses ma y b e n on-con vex (but still con v ex on a ve rage), and pro ve a s ample complexit y lo wer b ound for GD sho w ing it learns in a sub optimal rate with any s tep size and any num b er of iterations. Sample complexit y of ERMs. With relation to the sample complexit y of an (arbitrary) ERM in SCO, F eldman (2016) show ed th at reac hing ǫ -test error requir es Ω( d/ǫ + 1 /ǫ 2 ) training samples, but d id not establish optimali t y of this b ound. In a r ecen t w ork, Carmon et al. (20 23) sho w this to b e nearly tigh t and presen ts a e O ( d/ǫ + 1 / ǫ 2 ) up p er b ound for an y ERM, imp r o ving o ver the O ( d/ε 2 ) upp er b ound that can b e deriv ed from standard co vering num b er argumen ts. Another recen t w ork related to ours is that of Magen and Shamir (2023), who pr ovided another example for a setting in whic h learnabilit y can b e ac h iev ed without uniform con v ergence, sh o wing that uniform con verg ence ma y not hold in the class of vec tor-v alued linear (m ulti-class) predictors. Ho w ev er, the dimension of their p roblem instance wa s exp onent ial in the num b er of training examples. Implicit regularization and b e nign o verfitting. Another relev ant b o dy of r esearc h fo cuses on un derstanding the effectiv e generalizatio n of o ver-parame terized mo d els tr ained to ac hieve zero training error through gradien t metho d s (see e.g ., Bartlett et al., 202 0, 202 1; Belkin , 2021). This phenomenon app ears to challe nge con ven tional statistical wisdom, whic h emphasizes the imp ortance of balancing data fit and mo del complexit y , and motiv ated the stu d y of imp licit regularization (or bias) as a n otion for explaining general izatio n in the o v er-parameterized regimes. Ou r fi n dings in this p ap er could b e view ed as an ind ication that, at least in SCO, generalizatio n d o es n ot stem 4 from some form of an implicit bias or regularization; see Amir et al. (2021b); Koren et al. (20 22) for a more detailed discussion. 2 Problem setup and main results W e consider the standard setting of Sto c hastic Conv ex Optimization (SCO). The p roblem is char- acterize d by a p opulation distribu tion D o v er an instance set Z , and loss function f : W × Z → R defined o v er conv ex domain W ⊆ R d in d -dimensional Euclidean space. W e assume that, for any fixed instance z ∈ Z , the fu nction f ( w , z ) is b oth conv ex and L -Lipsc h itz with resp ect to its fi rst argumen t w . In this setting, the learner is int erested in minimizing the p opulation loss (or risk ) whic h corresp onds to the exp ected v alue of the loss function o ver D , defi n ed as F ( w ) = E z ∼D [ f ( w, z )] , (p opulation risk/loss) namely , finding a mo del w ∈ W that ac hiev es an ε -optima l p opulation loss, namely suc h that F ( w ) ≤ F ( w ∗ ) + ε, wh ere w ∗ ∈ arg min w ∈ W F ( w ) is a p opulation min imizer. T o fin d su c h a mo del w , the learner uses a set of n training examples S = { z 1 , . . . , z n } , dra wn i.i.d. from the unkno wn distribution D . Giv en th e samp le S , the corresp ondin g empiric al loss (or risk ), denoted b F ( w ), is defin ed as the a v erage loss o v er samples in S : b F ( w ) = 1 n n X i =1 f ( w, z i ) . (empirical risk/loss) W e let b w ∗ ∈ arg min w ∈ W b F ( w ) denote a minimizer of the empirical risk, refered to as an empirical risk minimizer (ERM). Moreo ve r, for ev ery w ∈ W , w e define th e gener alization gap at w as the absolute v alue of the difference b et ween the p opu lation loss and the empirical loss, i.e., | F ( w ) − b F ( w ) | . Optimization algorithms. W e consider several c anonical first-order optimization algo rithms in the con text of SCO. First-o rder algorithms mak e use of a (deterministic) subgradien t oracle that tak es as input a pair ( w , z ) and returns a su b gradien t g ( w , z ) ∈ ∂ w f ( w, z ) of the con vex loss function f ( w , z ) with resp ect to w . If | ∂ w f ( w, z ) | = 1, the loss f ( · , z ) is different iable at w and the subgradient oracle simply returns the gradien t at w ; otherwise, the subgradient oracle is allo wed to emit an y sub gradien t in the sub differen tial set ∂ w f ( w, z ). First, w e consider standard gradien t descen t (GD) with a fixed step size η > 0 applied to the empirical risk b F . W e allo w for a p oten tially pro j ected, m -suffi x a v eraged v ersion of the algorithm that tak es the follo win g form : initialize at w 1 ∈ W ; up d ate w t +1 = Π W " w t − η n n X i =1 g ( w t , z i ) # , ∀ 1 ≤ t < T ; return w T ,m : = 1 m m X i =1 w T − i +1 . (1) Here Π W : R d → W denotes the Eu clidean pro jection onto the set W ; when W is the en tire space R d , t his b ecomes simply u npro jected GD. The algorithm return s either the fin al iterate, the a v erage of the iterates, or more generally , an y m -suffix a v erage (1 ≤ m ≤ T ) of iterates. 5 The second metho d that we a nalyze is Sto c hastic Gradien t Desce n t (SGD), whic h is again p oten tially pro jected and /or suffix a v eraged. This metho d u ses a fixed stepsize η > 0 and tak es the follo wing form: initialize at w 1 ∈ W ; up d ate w t +1 = Π W w t − ηg ( w t , z t ) , ∀ 1 ≤ t < T ; return w T ,m : = 1 m m X i =1 w T − i +1 . (2) Main results. Ou r main contributions in the con text of SC O are ti gh t lo wer b ounds for the p opulation loss of GD and for the empirical loss of S GD, where the problem dimens ion is p olynomial in th e num b er of samples n and steps T . First, for the p opulation risk p erformance of GD, we pro v e the follo wing: Theorem 1. Fix n > 0 , T > 3200 2 and 0 ≤ η ≤ 1 5 √ T and let d = 178 nT + 2 n 2 + max { 1 , 25 η 2 T 2 } . Ther e exists a distribution D over instanc e set Z and a c onvex, differ entiable and 1 -Lipschitz loss function f : R d × Z → R such that for GD (either pr oje cte d or unpr oje cte d; c f . Eq . (1) with W = B d or W = R d r esp e ctively) initialize d at w 1 = 0 with step size η , for al l m = 1 , . . . , T , the m -suffix aver age d iter ate has, with pr ob ability at le ast 1 6 over the choic e of the tr aining sampl e, F ( w T ,m ) − F ( w ∗ ) = Ω min n η √ T + 1 /η T , 1 o . (3) F or S GD, w e prov e the follo wing theorem concerning its con v ergence on the empirical risk: Theorem 2. Fix n > 204 8 and 0 ≤ η ≤ 1 5 √ n and let d = 712 n log n + 2 n 2 + max { 1 , 25 η 2 n 2 } . Ther e exists a distribution D over instanc e set Z and a c onvex, 1 -Lipschitz and differ entiable loss function f : R d × Z → R such tha t for one-p ass SGD (either pr oje cte d or unpr oje cte d; cf. Eq. (2) with W = B d or W = R d r esp e ctively) over T = n steps initialize d at w 1 = 0 with step size η , for al l m = 1 , . . . , T , the m -suffix aver age d iter ate has, with pr ob ability at le ast 1 2 over the choic e of the tr aining sample, b F ( w T ,m ) − b F ( b w ∗ ) = Ω min n η √ T + 1 /η T , 1 o . (4) Discussion. As n oted in the introd u ction, b oth of the b oun ds ab ov e are tigh t up to logarithmic factors in view of matc h ing u pp er b ound s due to Bassily et al. (2020). F or GD tun ed for optimal con verg ence on the empirical risk, where T = n and η = Θ(1 / √ n ), Theorem 1 giv es an Ω(1) lo wer b ound for t he popu lation error, which precludes any sample c omplexit y upp er b oun d for this algo rithm of the form O ( d p /ǫ q ) unless p ≥ 1 2 . In partic ular, this imp lies an Ω ( √ d ) lo we r b ound the num b er of training examples required for GD to reac h a nont rivial p opulation risk. In con trast, low er b ounds in previous work (Amir et al., 2021b) only imp lies an exp onential ly we ak er Ω(log d ) d imen s ion dep endence in the sample complexit y . W e note ho w ever that there is still a small p olynomial gap b et ween our sample complexit y lo w er b ounds to th e kno wn (nearly tigh t) b ound s for generic ERMs (F eldman, 2016; Carmon et al., 2023); w e lea ve narro wing this gap as an op en problem for future in vestig ation. More generally , with GD fi xed to p erform T = n α , α > 0 steps, and setting η so as to optimize the lo we r b oun d, the righ t-hand side in Eq. (3) b ecomes Θ( n − α/ 4 ), wh ic h rules out an y sample 6 complexit y upp er b ound of the form O ( d p /ǫ q ) unless it satisfies max { 2 , α + 1 } p + 1 4 αq ≥ 1. 3 Sp ecifically , w e see that an y dimension-fr ee upp er b ound w ith T = n m ust ha ve at lea st an 1 /ǫ 4 dep endence on ǫ ; and th at for matc h ing the statist ically optimal sample complexit y rate of 1 /ǫ 2 , one m u st either run GD for T = n 2 steps or suffer a p olynomial dimension dep en d ence in the rate (e.g., for T = n this dep endence is at least d 1 / 4 ). Similar lo wer b ounds (up to a logarithmic fact or) are obtained for SGD through Theorem 2, but for the empirical risk of the algorithm w hen tuned f or optimal p erformance on the p opulation risk with T = n . In this case, the b ounds p ro vide an exp onential improv emen t in the dimension dep endence o v er the recen t results of Koren et al. (2022), showing that the “b enign underfi tting” phenomena they rev ealed for one-pass SGD is exhibited already in dimension p olynomial in the n um b er of training samples. Finally , w e remark that our restriction on η is o nly mean t for placing fo cus on the more common and in teresting range o f stepsizes in the con text of stochast ic optimization. It is not hard to ext end the r esult of Th eorems 1 and 2 to larger v alues of η (in this case the lo wer b ound s are Ω (1), the same rate the theorems give for η = Θ(1 / √ T )), in the same wa y this is done in previous w ork (e.g., Amir et al., 2021b; K oren et al., 2022). 3 Ov erview of constructions and p ro of ideas In this section w e outline the main ideas leading to our main results and giv e an ov erview of the lo wer b ound constructions. As discussed ab ov e, the main tec hnical cont ribution of this pap er is in establishing the first Ω( η √ T ) term in Eqs . ( 3) and (4) using a loss f unction in dimension p olynomial in n and T , and this is also the fo cus of our p resen tatio n in th is section. In Sections 3.1 to 3.4 w e fo cus on GD and describ e the main ideas and tec h nical s teps to w ards pro ving our first main result; in Sectio n 3.6 we s u rv ey the additional steps an d adjustment s n eeded to obtain our second main result concerning SGD. Starting w ith GD and Th eorem 1, recall that our goal is to establish a learning scenario where GD is likel y to con verge to a “bad ERM”, n amely a minimizer of the empirical r isk whose p opulation loss is large. W e will do that in four s teps: w e w ill first establish that such a “bad ERM” actually exists; then, we will sho w ho w to make suc h a solution reac hable by gradie n t steps from the origin; w e next describ e ho w the information r equired to iden tify this solutio n can b e “memorized” b y GD in to it s iterates; and finally , w e sh o w how to combine these comp onen ts and actually dr iv e GD to wa rds a bad ERM. 3.1 A prelim inary: existence of bad ERMs Our starti ng p oin t is the w ork of F eldman (2016) that demonstra ted that in SCO, an empirical r isk minimizer m igh t fail to generalize, already in dimension linear in the n um b er of training samp les. More concretely , they sho wed that for an y sample size n , there exists a distribution D ov er con vex loss fu nctions in dimension d = Θ( n ) suc h that , with co nstan t probabilit y , there exists a “bad ERM”: one that o v erfi ts the training sample and admits a large generalizati on gap. Their approac h was based on a construction of a set of un it v ecto rs of siz e 2 Ω( n ) , denoted U , that are “n early orthogonal”: the dot pro duct b et w een any t w o distinct u, v ∈ U satisfies |h u, v i| ≤ 1 8 . 4 3 T o see this, let r = max { 2 , α + 1 } and n ote that for our construction d = O ( nT + n 2 ) = O ( n r ) and ǫ = Ω( n − α/ 4 ); the sample complexity up p er b ound O ( d p /ǫ q ) can b e therefore rewritten in terms of n as O ( n r p + αq/ 4 ), and since this should as ymptotically u pp er b oun d th e num b er of samples n , one must hav e that r p + 1 4 αq ≥ 1. 4 The origi nal construction b y F eldman (201 6) satisfied slig h tly different conditions, whic h we adjust here for our analysis. 7 Then, they tak e the p o wer set Z = P ( U ) of U as the sample space (namely , id en tifying s amp les with subsets of U ), the distribution D to b e uniform o ve r Z , and the (con v ex, Lipschit z) loss h F16 : R Θ( n ) × Z → R to b e defined as f ollo ws: h F16 ( w, V ) = max n 1 2 , max u ∈ V h u, w i o . (5) F or this problem instance, they sho w that with constan t probabilit y o v er the choic e of a sample S = { V 1 , . . . , V n } iid ∼ D n of size n = O ( d ), at least one of the v ectors in U , sa y u 0 ∈ U , will not b e observ ed in an y of the sets V 1 , . . . , V n , namely u 0 ∈ U \ S n i =1 V i . F inally , they prov e that such a v ector u 0 is in fact an Ω (1)-bad ERM (for wh ic h the generalization gap is Ω(1)). T o see wh y this is the case, note that, since ev ery v ector u ∈ U is in ev ery training example V i with probabilit y 1 2 , the set U (whose s ize is exp onential in n ) is large enough to guarante e the existence of a v ector u 0 / ∈ S n i =1 V i with constan t probability . Consequent ly , the empirical loss of suc h u 0 equals 1 2 (since h u 0 , v i ≤ 1 8 for any v ∈ V i and therefore h F16 ( u 0 , V i ) = 1 2 for all i ). Ho w ever, for a fr esh example V ∼ D , with probabilit y 1 2 it holds that u 0 ∈ V , and therefore h F16 ( u 0 , V ) = 1, in whic h case the p opulation r isk of u 0 is at least = 1 2 · 1 2 + 1 2 · 1 2 = 3 4 and the generalizati on gap is therefore at least 1 4 . 3.2 Ensuring that bad ERMs are reac hable by GD As F eldman (2016) explains in their work, although there exists an ERM with a large generalizati on gap, it is not guarant eed that suc h a minimizer is at all reac hable by gradient metho d s, w ithin a reasonable (sa y , p olynomial in n ) n umb er of steps. This is b ecause in their construction, the loss function h F16 remains flat (and equals 1 2 , see E q. (5)) inside a ball of radius Ω(1) around the origin, where GD is initiali zed; within this ball, all mod els are essentia lly “goo d ERMs” that admit zero generaliza tion gap. It remains u nclear ho w to steer GD, with stepsize of order η = O (1 / √ T ) o ver T ste ps, aw a y fr om this flat region of the loss to wa rds a bad ERM, su c h as the u 0 iden tified ab o v e. T o addr ess this chall enge, w e mod ify the co nstruction of F eldman (2016) in a fundamen tal w ay . The k ey idea is increase dimensionalit y and replicate F eldman’s construction in T orthogonal subspaces; this would allo w us to decrease, in eac h of the sub spaces, the distance to a bad ERM to only O ( η ), rather than Θ(1) as b efore. Then , wh ile eac h of these subspace ERMs is only Ω( η )-bad, when ta k en together, they constitute an Ω(1)-bad ERM in the lifted space w hose dista nce from th e origin is roughly η √ T = Θ(1), yet is still reac h ab le by T steps of GD. More concretely , w e int ro duce a lo ss function h : R d ′ × P ( U ) → R (for d ′ = Θ( n )) that resemble s F eldman’s f unction from Eq. (5) u p to a minor adju stmen t: h ( w ′ , V ) = max n 3 32 η , max u ∈ V u, w ′ o . (6) As in the original co nstruction b y F eldman , V here ranges o v er subsets of a set U ⊆ R d ′ of size 2 Ω( d ′ ) , th e element s of whic h are nearly-orthogo nal u nit v ectors. Then, we co nstruct a loss function in dimension d = T d ′ b y applying h in T orthogonal subsp aces of dimension d ′ , d enoted W (1) , . . . , W ( T ) , as follo ws: 5 ℓ 1 ( w, V ) = v u u t T X k =2 h ( w ( k ) , V ) 2 . (7) Here and throughout, w ( k ) refers to the k ’th orthogonal comp onen t of the vec tor w , that resides in th e subspace W ( k ) . Finally , the distribution D is again tak en to b e uniform o ver Z = P ( U ), 5 The s ummation sta rts at k = 2 due to tec hn ical reasons that will become apparent later in this pro of sk etch. 8 and a training set is formed b y sampling S = { V 1 , . . . , V n } ∼ D n . As b efore, we know that with probabilit y at least 1 2 , there exists a vecto r u 0 suc h that u 0 ∈ U \ S n i =1 V i . With this setup, it can be sho wn that ℓ 1 is indeed con v ex and O (1)-Lipsc hitz, and further, that an y v ector w satisfying w ( k ) = cη u 0 for a sufficient ly large constan t c > 0 and Ω( T )-many comp onen ts k , is an Ω(1)-bad ERM with resp ect to ℓ 1 . The imp ortan t p oin t is that, unlik e in F eldman’s original co nstruction, su c h b ad ERMs are p oten tiall y r eac hable by GD: it is sufficien t to guide the algorithm to make a single, small step (with s tepsize η ) to wards u 0 in eac h subspace W ( k ) . 3.3 Memorizing the dataset in t he iterate There is one nota ble obstacle to the plan w e just describ ed: the v ector u 0 is determined in a r ather complex wa y by the full description of the training set and it is u n clear ho w to repro duce suc h a v ector thr ough su bgradien ts of the loss function. Indeed, recall that the only access GD has to the training set is through subgradien ts of individu al fun ctions g ( w , V 1 ) , . . . , g ( w , V n ) (and linear com binations thereof ), and none of these has direct access to the full training set that could allo w for determining a ve ctor u 0 ∈ U \ S n i =1 V i . T o circum ven t th is difficult y , another k ey asp ect of our construction inv olve s a mec hanism that effectiv ely memorizes the ful l tr aining set in the iter ate w itself , using the fir st few steps of GD. F or this memorization, w e can, for example, further increase the d imension of the domain W and create an “enco d in g sub space,” denoted as W (0) (and the corresp onding comp onent of a vect or w ∈ W is indicated by w (0) ), whic h is orthogonal to W (1) , . . . , W ( T ) . In this subspace, eac h step tak en with resp ect to (a linear com bination of ) individu al gradient s g ( w t , V i ) enco d es the inf orm ation on the sets V 1 , . . . , V n in to the iterate w t . Th en , since the sub gradien t oracle receiv es w t as input, it can reconstruct the training set en co d ed in w t (0) and reco ver u 0 , in ev ery subsequ ent step. On its o wn, the t ask of m emorizing the tra ining set is not parti cularly c h allenging an d can b e add ressed in a rather straigh tforwa rd manner. 6 What turns out to b e more c h allenging is to design the enco ding in suc h a wa y that u 0 is realized as the unique subgradient (i.e., the gradien t) of the loss fu nction. This would b e crucial for establishing that our lo w er b ound is v alid for any subgradient orac le, and not only for an adversaria lly c hosen one (as well for m aking the constru ction differen tiable; we discuss this later on, in Section 3.5). Let us describ e an enco ding mec h anism where u 0 acts as the un iqu e su b gradien t at w 1 = 0. W e will emplo y an enco d ing su b space W (0) of dimension Θ( n 2 ), and augmen t samples with a n um b er j ∈ [ n 2 ], drawn u n iformly at random; namely , eac h sample in the training set is now a pair ( V i , j i ) ∈ P ( U ) × [ n 2 ], for i = 1 , . . . , n . W e then crea te an enco d ing fu nction φ : P ( U ) × [ n 2 ] → W (0) suc h that φ ( V , j ) maps the set V in to the j ’th (2-dimensional) subspace of the enco d ing space. Th e role of j is to ensure that, with constant p robabilit y , sets in the trai ning samp le are mapp ed to distinct sub spaces of the en co d ing space W (0) , and thus can b e uniquely inferr ed giv en the encod ing. T o implement the enco ding within the optimization pro cess, we in tro duce the follo wing term into the loss: ℓ 2 ( w, ( V , j )) : = h− φ ( V , j ) , w (0) i . (8) F ollo wing a sin gle step of GD, th e iterate b ecomes w 2 (0) = η n P n i =1 φ ( V i , j i ), and by the prop erties of the enco ding φ it is then p ossible, with constant probabilit y , to fully reco ver the sets V 1 , . . . , V n in the training set, giv en the iterate w 2 . 6 One simple approac h is to u tilize a one-d imensional en co ding space to encode ev ery individual training example V ∈ P ( U ) in t h e least significant bits of w t (0) , in a wa y that guarantees there are n o collisions betw een different p ossible v alues of such sets. This allo ws the subgradient oracle to calculate the sp ecific u 0 from the training set encoded in t he least significan t bits of w t (0) and return it as a sub gradient. 9 Next, w e in tro du ce an additional term in to the loss function, whose role is to “decode” t he training set V 1 , . . . , V n from w t and pro d uce a v ector u 0 ∈ U \ S n i =1 V i as a sub gradien t. F or this, w e represent ev ery potent ial training set using a vec tor ψ ∈ Ψ ⊆ B 2 n 2 , and define a m app ing α : R 2 n 2 → U that, for ev ery ψ ∈ Ψ, pro vides a v ector α ( ψ ) ∈ U that do es not app ear in an y of the sets V i in the training sample associated with ψ (if suc h a vecto r exists). Finally , w e ad d the follo wing term to the loss function, ℓ 3 ( w ) : = max δ 1 , max ψ ∈ Ψ n h ψ , w (0) i − β h α ( ψ ) , w (1) i o , (9) where β , δ 1 > 0 are small pr edefined constan ts. W e can show, assumin g that in the first step, the training set was enco ded to the iterate ( w 2 (0) = η n P n i =1 φ ( V i , j i )), that for a su itable c hoice of the enco der ( φ ) and decoder ( ψ and α ), in the follo wing iteratio n, the v ector ψ ∗ ∈ Ψ that represents the actual training set V 1 , . . . , V n is realized as a uniqu e maximizer in Eq. (9 ), w h ic h in turn triggers a gradien t step along u 0 : = α ( ψ ) in the su bspace W (1) . 3.4 Making GD con v erge to a bad ERM Our final task is to fi nally mak e GD conv erge to a “bad ERM,” namely to a mo del w suc h that w ( k ) = cη u 0 for a sufficient ly large constan t c > 0 and Ω( T )-man y v alues of k , assuming it was successfully initialized at w with w (1) = c 1 u 0 (and w ( k ) = 0 for k > 1) as we just detailed. W e will accomplish this b y forcing GD into making a s in gle step to wa rds u 0 in Ω( T ) of the subspaces W (1) , . . . , W ( T ) . T o this end, w e emplo y a v ariation of a tec h nique used in previous lo wer b oun d construc- tions (Bassily et al., 2020; Amir et al., 2021b; Koren et al., 2022) to induce gradien t instabilit y around the origin. In th ese p rior instances, h ow ev er, the p oten tial directions of progress—analogous to v ectors in our set U —w ere p erfectly orthogonal (and thus, the dimension of sp ace wa s required to b e exp onential in n ). I n con trast, in our s cenario th e vec tors in U are only appro ximately or- thogonal, and directly applying this approac h could lead to situations w here gradien t s teps from consecutiv e iteratio ns ma y interfere with progress made in correlated directions in p r evious itera- tions. T o add ress this, we introd uce a careful v ariation on this tec hniqu e, based on augment ing the loss function with the follo w ing con vex term: ℓ 4 ( w ) = max δ 2 , max u ∈ U, k 0 is a small constan t (that will b e set later). The k ey idea here is that follo wing the initializa tion stag e, the inner maximization abov e is alw a ys at tained at the same v ector u = u 0 , a nd for v alues of k that increase b y 1 in ev ery iteratio n of GD. Consequent ly , sub gradient steps with resp ect to this term will result in maki ng a step to wards u 0 in eac h of the co mp onents w (1) , w (2) , . . . one by one, a v oiding in terference b et ween consecutiv e steps. A t the end of this p ro cess, there are Ω( T ) v alues of k suc h that w ( k ) = 1 8 η u 0 , whic h is wh at we set to ac h ieve. In some more detail, assuming GD is succe ssfully initialized at a v ector w with w (1) = c 1 u 0 and w ( k ) = 0 for k > 1 ( c 1 > 0 is a small constan t), note that the maxim um in Eq. (10) is u niquely attained at k = 1 and u = u 0 . Consequent ly , the su bgradien t of ℓ 4 at initia lizatio n is a vect or g suc h that g (1) = 3 8 u 0 , g (2) = − 1 2 u 0 (and g ( k ) = 0 for k 6 = 1 , 2), and taking a subgradient step with stepsize η results in w (1) = ( η β − 3 η 8 ) u 0 and w (2) = η 2 u 0 (for k 6 = 1 , 2, w ( k ) remains as is). In eac h subsequent iteration, the maximization in Eq. (10) is attained at an index k for whic h w ( k ) = η 2 u 0 10 and at u = u 0 . 7 Subsequently , eve ry gradient step adds − 3 η 8 u 0 to w ( k ) and η 2 u 0 to w ( k +1) and results in w (2) = w (3) = . . . = w ( k ) = η 8 u 0 and w ( k +1) = η 2 u 0 (whereas for all s > k + 1, w ( s ) remains zero). Finally , we note that the GD dynamics we described ensure that the it erates w 1 , . . . , w T remain strictly within the unit ball B d , even when the algorithm do es not emplo y a n y pro jections. As a consequence, the construction w e describ ed applies equally to a pro jected v ersion of GD, with pro jections to the un it ball, and the resulting lo wer boun d will apply to b oth v ersions of the algorithm. 3.5 Putting things together W e can no w in tegrate the ideas describ ed in Sections 3.1 to 3.4 into a constru ction of a learning problem w here GD o verfits t he training data (with constant probabilit y), that would serve to pro ve our lo wer b ound for gradient d ecen t. T o summarize this construction: • The examples in the learning p roblem are parameterized by pairs ( V , j ) ∈ Z : = P ( U ) × [ n 2 ], where U is the set of n early-orthogonal v ectors describ ed in Section 3.1, and P ( U ) is it s p o wer set; • The p opu lation distribu tion D is uniform o v er pairs ( V , j ) ∈ Z , namely su c h that V ∼ Unif ( P ( U )) (i.e., V i s formed by including ev ery elemen t u ∈ U indep en d en tly with proba- bilit y 1 2 ) and j ∼ Unif ([ n 2 ]); • The loss function in this construction, f : W × ( P ( U ) × [ n 2 ]) → R , is then giv en by: ∀ ( V , j ) ∈ Z , f ( w, ( V , j )) : = ℓ 1 ( w, V ) + ℓ 2 ( w, ( V , j )) + ℓ 3 ( w ) + ℓ 4 ( w ) , (11) with the terms ℓ 1 , ℓ 2 , ℓ 3 , ℓ 4 as defined in Eqs. (7 ) to (10) resp ectiv ely . With a suitable c h oice of paramete rs, this construction serves to p ro ving Theorem 1. W e remark that, while f in this construction is con vex and O (1)-Lipsc hitz, it is eviden tly non-different iable. F or obtaining a construction w ith a differentia ble ob jectiv e that main tains the same lo w er b ound and establish the full claim of Theorem 1, we add one fin al step of r and omized smo othing of the ob jectiv e. T his argumen t hinges on the fact that the subgradien ts of f are un ique along any p ossible tr aje ctory of GD , so that smo othing in a su fficien tly small neigh b orho o d w ould preserve gradients along any suc h tra jectory (and thus do es not affe ct th e the dynamics of GD), wh ile making the ob jectiv e differenti able eve rywhere. The full pro of of Theorem 1 is d eferred to App endix A. 3.6 A dditional adjustmen t s for SGD Mo ving on to discuss our second main result for SGD, w e pro vide here a b r ief o v erview of the necessary mo difi cations up on the construction for GD to establish t he lo wer b ound for SGD in Theorem 2; fu rther details can b e found in Section 5. In the case of SGD, our goal is to establish underfitting: namely , to sh o w that the algorithm ma y con v erge to a solution with an excessiv ely large empirical risk d espite su ccessfully con verging on the p opulation risk. The main idea s leading to our construction for SGD are s imilar to wh at we d iscussed ab ov e, b ut there are sev eral necessary mo difications that arise from the fact that, whereas in GD the entire 7 F or this va lue of k , it holds that max u ∈ U 3 8 h u, w ( k ) i − 1 2 h u, w ( k +1) i = 3 16 η (attained at u = u 0 ), whereas for other v alues of k this quan tity is at most ≈ 1 8 · 3 8 η + 1 8 · 1 2 2 η < 1 8 η due to the near-orthogonalit y of vectors in U . It follo ws that the subgradient is a vector g such that g ( k ) = 3 8 u 0 , g ( k +1) = − 1 2 u 0 (and zeros elsewhere). 11 training set is reve aled already in the fir st ite ration, in SGD it is rev ealed sequen tially , one training sample at a time. In particular, un lik e in the case of GD where it is p ossible to identify a bad ERM u 0 at the few first steps of the algo rithm and steer th e algorithm in this direction in eve ry subspace W (1) , W (2) , . . . , for SGD th e required progress direction in W ( t ) , repr esen ted as a “bad solution” u t , can b e only d etermined in the t ’th step b ased on the enco ded training set up to that p oint , V 1 , . . . , V t − 1 . As a result, it is crucial to mo dify the loss fu nction suc h that the pro cess of deco ding suc h u t from V 1 , . . . , V t − 1 o ccurs in ev ery iteration t . Another essen tial adjustment inv olve s iden tifying a solut ion with a large generaliza tion gap (namely large empir ical risk, lo w p opulation risk) a nd guiding th e SGD iterates to con ve rge to suc h a solution. Considerin g the function ℓ 1 defined in Eq. (7), suc h a solutio n is represen ted b y a v ector u ∈ U that app ears in all of the sets V 1 , . . . , V n in the training sample. Ho wev er, since u t cannot dep end on future examples, our goa l within ev ery subspace W ( t ) is to tak e a single gradien t step tow ards a ve ctor u t presen t only in sets up to that p oin t, namely in T t − 1 i =1 V i (note th at suc h u t maximizes t he co rresp ond ing loss functions ℓ 1 ( w, V 1 ) . . . ℓ 1 ( w, V t − 1 )). A d ditionally , to ensu re that gradie n ts for fu ture loss fun ctions remain zero and d o not affect the algorithm’s dynamics, it is necessary to to guarante e that u t ∈ T n i = t V i ; in other w ords, w e are lo oking for a s olution u t ∈ T t − 1 i =1 V i ∩ T n i = t V i . F or ens uring that such a ve ctor actually exists (with constant probab ilit y), w e lift the d imens ion of the set U and the subspaces { W ( k ) } n k =1 to d = Θ( n log n ) (instead of Θ( n ) as b efore) and mo dify the d istribution D so as to ha ve that V is sampled such that ev ery elemen t u ∈ U is included in V indep endent ly with probabilit y 1 / 4 n 2 . With these adaptations in place, we can obtain T heorem 2; for more details we refer to Section 5 . 4 Ov erfitting of GD: Pro of of T h eorem 1 In this section, w e p ro vide a f ormal pro of of our main result for GD. W e establish a lo wer b ound of Ω( η √ T ) for the p opulation loss of GD, wh ere the hard loss function is defined in a d -dimensional Euclidean space, where the d imension d is p olynomial in the n umb er of examples n . In App endix A w e complete the p r o of of Theorem 1, b y sh o win g a low er b ound of min { 1 /ηT , 1 } , and a construction of a different iable ob j ectiv e th at holds the lo wer b ound stated in Theorem 1. F ull construction. F or the first step, for a dimension d ′ that will b e set late r, we u se a set of appro ximately orthogonal vect ors in R d ′ with size (at least) exp onential in d ′ , the existence of wh ich is giv en by the follo wing lemma, adapted from F eldman (2016). Lemma 1. F or any d ′ ≥ 256 , ther e exists a set U d ′ ⊆ R d ′ , with | U d ′ | ≥ 2 d ′ / 178 , suc h that for al l u, v ∈ U d ′ , u 6 = v , it holds that | h u, v i | ≤ 1 8 . No w, let n b e the num b er of examples in the training set. W e defin e the set U : = U d ′ to b e a set as sp ecified by Lemma 1 for d ′ = 178 n . Then, as outlined in Section 3, we defin e the sample space Z : = { ( V , j ) : V ⊆ U, j ∈ [ n 2 ] } and the hard distribution D as the uniform distribution. Moreo v er, w e consider the loss function f : R d → R (defined in Eq. (11) for d : = T d ′ + 2 n 2 = 178 nT + 2 n 2 . This loss fu nction is con vex and 5-Lipschitz o ve r R d , as established in th e follo wing lemma: Lemma 2. F or every ( V , j ) ∈ Z , the loss function f ( w , ( V , j )) is c onvex and 5 -Lipschitz over R d with r esp e ct to its first ar gument. F or th is construction of distrib u tion and loss function, we obtain the follo wing theorem. 12 Theorem 3. A ssume that n > 0 , T > 3200 2 and η ≤ 1 √ T . Consider the dist ribution D and the loss function f tha t define d in Se ction 3.5 for d = 178 nT + 2 n 2 , ε = 1 n 2 (1 − c os( 2 π | P ( U ) | )) , β = ǫ 4 T 2 , δ 1 = η 2 n and δ 2 = 3 ηβ 16 . Then, for Unpr oje cte d GD (cf. Eq. (1) with W = R d ) on b F , initialize d at w 1 = 0 with step size η , we have, with pr ob ability at le ast 1 6 over the choic e of the tr aining sample: (i) Th e iter ates of GD r emain within the unit b al l, namely w t ∈ B d for al l t = 1 , . . . , T ; (ii) F or al l m = 1 , . . . , T , the m -suffix aver age d iter ate has: F ( w T ,m ) − F ( w ∗ ) = Ω η √ T . Algorithm’s dynamics. W e next giv e a k ey lemma that c haracterizes the tra jectory of GD when app lied to the empirical risk b F formed by the loss fu n ction f an d the training sample S = { ( V i , j i ) } n i =1 . T he c h aracteriza tion holds und er a certain “go o d eve n t”, giv en as follo ws: E : = [ n i =1 V i 6 = U ∩ j k 6 = j l , ∀ k 6 = l . (12) In words, under the ev en t E there exists at least one “bad direction” (which is a v ector in the set U \ S n i =1 V i ) and there is no co llision b et w een th e indices j 1 , . . . , j n . In the follo wing lemma w e sho w that E holds with a constant probabilit y . The pro of is deferr ed to App endix B.2. Lemma 3. F or the event E define d in Eq . (12) , it holds that Pr( E ) ≥ 1 6 . Under this ev en t, the dynamics of GD are charac terized as follo ws. Lemma 4. A ssume the c onditions of The or em 3, and c onsider the iter ates of unpr oje cte d GD on b F , with step size η ≤ 1 / √ T initial ize d at w 1 = 0 . Under the e vent E , we have for al l t ≥ 5 that w t ( k ) = η n P n i =1 φ ( V i , j i ) k = 0 ; − 3 8 + t − 2 4 ǫ T 2 η u 0 k = 1; 1 8 η u 0 2 ≤ k ≤ t − 3; 1 2 η u 0 k = t − 2; 0 t − 1 ≤ k ≤ T , (13) wher e u 0 is a ve ctor suc h that u 0 ∈ U \ S n i =1 V i . T o pr o ve Lemma 4, w e b reak do wn to the different co mp onents of the loss and analyze ho w the terms ℓ 1 , ℓ 3 and ℓ 4 affects the dyn amics of GD u n der th e even t E . F or eac h of th ese components, whic h in vol v e maxim um o ver linear fu n ctions, w e show wh ic h te rm ac hiev es the maxim um v alue for eac h w t and d eriv e the expressions for the gradient s at those p oin ts by the maximizing terms. First, w e sho w that under this eve n t, the gradients of ℓ 1 do not affect the dynamics since in any iteratio n t the gradien t of ℓ 1 is zero, as stat ed in the follo wing lemma. The proof is deferred to App end ix B. Lemma 5. A ssume the c onditions of The or em 3 and the ev e nt E . L et w ∈ R d b e such that for every 2 ≤ k ≤ T , w ( k ) = cη u 0 for c ≤ 1 2 and u 0 ∈ U \ S n i =1 V i . Then, for every i , it hold s that (i) for every k ≥ 2 , it holds max u ∈ V i h w ( k ) , u 0 i ≤ η 16 ; (ii) ℓ 1 is differ entiable at w and for al l i ∈ [ n ] , we have ∇ ℓ 1 ( w, V i ) = 0 . 13 Next, for the term ℓ 3 , as outlined in Section 3.3, it is used for iden tifying the actual training set S = { ( V i , j i ) } n i =1 giv en an enco ding ψ ∗ = 1 n P n i =1 φ ( V i , j i ) in the iterate w t (0) and ensur ing a p erformance of gradien t step in W (1) to wa rds a corresp onding v ector u 0 ∈ U \ S n i =1 V i in the follo wing iteratio n. It is d one b y getting ψ ∗ as a maxim um of linear fun ctions (with p ositiv e constan t marg in) o ve r the s et Ψ whic h co n tains all p ossible enco ded dataset s. T his idea is f orm alized in the f ollo wing lemma. Lemma 6. A ssume the c onditions of The or e m 3 and the event E . L et ψ ∗ = 1 n P n i =1 φ ( V i , j i ) and w ∈ R d b e such w (0) = η ψ ∗ , and let w (1) = cη u 0 for | c | ≤ 1 and u 0 ∈ U \ S n i =1 V i . Then (i) F or every ψ ∈ Ψ , ψ 6 = ψ ∗ : h w (0) , ψ ∗ i − ǫ 4 T 2 h α ( ψ ∗ ) , w (1) i > h w (0) , ψ i − ǫ 4 T 2 h α ( ψ ) , w (1) i + ηǫ 4 ; (ii) F or ψ = ψ ∗ , it holds that h w (0) , ψ ∗ i − ǫ 4 T 2 h α ( ψ ∗ ) , w (1) i > δ 1 + η 16 n ; (iii) ℓ 3 is differ entiable at w and the gr adient is given as fol lows: ( ∇ ℓ 3 ( w )) ( k ) = ψ ∗ k = 0; − ǫ 4 T 2 u 0 k = 1; 0 otherwise . Finally , for ℓ 4 , as d etailed in Section 3.4, the role of this term is to make the last iterate w T hold w T ( k ) = η 8 η u 0 for Ω( T ) man y sub-spaces W ( k ) . In the follo wing lemma, we sho w that in ev ery iteratio n t , ev ery gradient step increases the amoun t of suc h k s by 1, namely , in ev ery iteration t , the maximum of ℓ 4 is attained at u = u 0 and ind ex k t = arg max { k : w t ( k ) 6 = 0 } , whic h increases b y 1 in ev ery iteration, making the w t +1 ( k t ) = η 8 η u 0 . Lemma 7. A ssume the c onditions of The or em 3 and the e v ent E . L et w ∈ R d , u 0 ∈ U \ S n i =1 V i and 3 ≤ m < T b e such that w (1) = cη u 0 for − 3 8 ≤ c ≤ 0 , w ( k ) = η 8 u 0 for every 2 ≤ k ≤ m − 1 , w ( k ) = η 2 u 0 and w ( k ) = 0 for every k ≥ m . Then, it holds that, (i) F or every p air u ∈ U and k < T such that k 6 = m or u 6 = u 0 , 3 8 h u 0 , w ( m ) i − 1 2 h u 0 , w ( m +1) i > 3 8 h u, w ( k ) i − 1 2 h u, w ( k +1) i + η 64 (ii) 3 8 h u 0 , w ( m ) i − 1 2 h u 0 , w ( m +1) i > δ 2 + η 64 . (iii) ℓ 4 is differ entiable at w and the gr adient is given as fol lows: ∇ ℓ 4 ( w ) ( k ) = 3 8 u 0 k = m ; − 1 2 u 0 k = m + 1; 0 otherwise . 14 Pr o of of L emma 4 . W e prov e the lemma by induction on t ; the base case, for t = 5, is pr ov ed in Lemma 21 in App endix B and here w e fo cus on th e induction step. F or this, fix an y t ≥ 5 and assume the that the lemma holds for w t ; w e will pr o ve the claim for w t +1 . First, for ℓ 1 , n ote that, b y the hyp othesis of the ind u ction, for eve ry 2 ≤ k ≤ T , w t ( k ) = cη u 0 for c ≤ 1 2 , th u s, by Lemma 5, for ev ery i , ∇ ℓ 1 ( w t , V i ) = 0. F or ℓ 2 , w e kno w that, for every i , ∇ ℓ 2 ( w t , ( V i , j i )) ( k ) = ( − φ ( V i , j i ) k = 0; 0 otherwise . F or ℓ 3 , using the h yp othesis of the induction, which implies that w t (1) = cη u 0 for | c | ≤ 1 a nd w t (0) = η n P n i =1 φ ( V i , j i ), by Lemma 6, we get that, ∇ ℓ 3 ( w t ) ( k ) = 1 n P n i =1 φ ( V i , j i ) k = 0; − ǫ 4 T 2 u 0 k = 1; 0 otherwise . F or ℓ 4 , by the hyp othesis of the ind uction w e kn ow that w t (1) = − 3 8 + ( t − 2) 4 ǫ T 2 η u 0 , thus, w t (1) = cη u 0 for − 3 8 c ≤ 0. Then the conditions of Lemma 7 hold f or m = t − 2, thus, it holds that, ∇ ℓ 4 ( w t ) ( k ) = 3 8 u 0 k = t − 2; − 1 2 u 0 k = t − 1; 0 otherwise . Com b ining all together, we get that, ∇ b F ( w t ) ( k ) = − ǫ 4 T 2 u 0 k = 1; 3 8 u 0 k = t − 2; − 1 2 u 0 k = t − 1; 0 otherwise , where u 0 ∈ U \ S n i =1 V i , and the lemma follo ws. Pro of of Lo w er Bound. No w w e can turn to p ro ve Theorem 3. Here we pro v e the lo w er b ound for the case of suffix a v eraging w ith m = 1, namely , w hen the output solution is the final iterate w T of GD; the f ull pro of for the more general case can b e found in App endix B.3. Pr o of of The or em 3 ( m = 1 c ase). W e pro v e the theorem under the condition that E o ccurs. First, in Lemma 22 in app endix App endix B we kno w that for ev ery t , we h a ve that k w t k ≤ 1. Next, w T is as in Eq. (13). No w, we n otice that if a v ector v ∈ U is in a set V ⊆ U , it holds that max u ∈ V h u, v i = 1. Ho w ever, if v / ∈ V , it holds that max u ∈ V h u, v i = 1 8 . As a result, b y the fact that ev ery v ector for a fr esh p air ( V , j ) ∼ D , u 0 ∈ U is in V with probabilit y 1 2 , the follo w ing holds: E V v u u t T X k =2 max 3 η 32 , max u ∈ V h u, w T ( k ) i 2 ≥ E V v u u t T − 3 X k =2 max 3 η 32 , max u ∈ V h u, w T ( k ) i 2 = E V s ( T − 4) max 3 η 32 , max u ∈ V h u, η 8 u 0 i 2 15 = η √ T − 4 8 E V max 3 4 , max u ∈ V h u, u 0 i ≥ η √ T − 4 8 3 4 Pr( u 0 / ∈ V ) + Pr( u 0 ∈ V ) = 7 η 64 √ T − 4 . Moreo v er, we notice that for ev ery t , V ⊆ U and j ∈ [ n 2 ], ℓ 2 ( w t , ( V , j )) ≥ −k w t (0) k ≥ − η , ℓ 3 ( w t ) ≥ δ 1 and ℓ 4 ( w t ) ≥ δ 2 , th u s, it holds that F ( w T ) ≥ 7 η 64 √ T + δ 1 + δ 2 − η ≥ η 7 64 √ T − 1 ; F ( w ∗ ) ≤ F (0) ≤ 3 η 32 √ T + η . Then, since T is assu med large enough so th at 2 ≤ 1 128 √ T , w e conclude F ( w T ) − F ( w ∗ ) ≥ η 1 64 √ T − 2 ≥ η 128 √ T . 5 Underfitting of SGD: Pro of of Theorem 2 In this section w e sh o w a formal p ro of of our main result for S GD. As in GD , w e construct a hard loss f unction, wh ich is defined in a d -dimensional Euclidean space suc h that d is p olynomial in the n um b er of examples n . Using this construction, we establish a lo we r b ound of Ω( η √ T ) for the empirical loss of S GD with T = n iterations. W e complete the pro of of Theorem 2 in App endix A. F ull construction. F or the first step of the constru ction, w e use Lemma 1 (see Section 4) , whic h sho w s for eve ry dimension d ′ an existence of a set of approxima tely orthogonal vect ors in U ∈ R d ′ with size exp onen tial in d ′ . W e define the set U to b e U : = U d ′ for d ′ = 712 n log n and the samp le space to b e Z SGD : = { V : V ⊆ U } . Moreo ver, w e defin e the hard distribution D SGD to b e suc h that ev ery u ∈ U is included in V ⊆ U indep endently with probabilit y δ = 1 4 n 2 . F or the h ard loss fun ction, we con tinue referring to ev ery vec tor w ∈ R d as a concatenation of v ectors, w = ( w (0) , w (1) , w (2) , . . . , w ( n ) ), where for 1 ≤ k ≤ n , w ( k ) ∈ R 712 n log n and w (0) ∈ R 2 n 2 . In this construction, w (0) is also a concatenat ion of n vec tors w (0 , 1) , . . . , w (0 ,n ) suc h that eac h for ev ery r ∈ [ n ], w (0 ,r ) ∈ R 2 n Our appr oac h is, as in GD , in ev ery iteratio n t , to enco de the set V t , sampled from D SGD in to the iterate w (0) t +1 . F or th is, we construct an enco der, φ : P ( U ) × [ n ] → R 2 n , a deco der α : R 2 n → U , a real n u m b er ǫ > 0 and n s ets denoted as Ψ 1 , . . . , Ψ n . Here, the idea b ehind the constructio n is suc h set ψ k represen ts all of the p ossible training sets w ith k examples, { V 1 . . . , V k } , and in ev ery iteratio n t , it is p ossible to get the v ector ψ ∗ t − 1 ∈ Ψ t − 1 that is recognized with the actual sets V 1 , . . . , V t − 1 that are sampled b efore this it eration, as a maximizer of a linear fu nction with margin ǫ . Then, as outlined in Sect ion 3.6, we aim to output a vect or u t ∈ T n i = t V i . The exact construction of such ǫ, φ, α, Ψ 1 , . . . , Ψ n is detailed in Lemma 23 in App endix C. Then, for ǫ, φ, α, Ψ 1 , . . . , Ψ n and d : = n d ′ + 2 n 2 = 712 n 2 log n + 2 n 2 w e d efine the loss fun ction in our constru ction. The loss function f SGD is comp osed of thr ee terms: ℓ SGD 1 , ℓ SGD 2 , ℓ SGD 3 , and is 16 defined as follo ws , f SGD ( w, V ) : = v u u t T X k =2 max 3 η 32 , max u ∈ V h u, w ( k ) i 2 | {z } ℓ SGD 1 ( w, V ) : = (14) + max δ 1 , max k ∈ [ n − 1] ,u ∈ U,ψ ∈ Ψ k 3 8 h u, w ( k ) i − 1 2 h α ( ψ ) , w ( k +1) i + h w (0 ,k ) , 1 4 n ψ i − h w (0 ,k + 1) , 1 4 n ψ i + h w (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i + h w (0 , 1) , − 1 4 n 2 φ ( V , 1) i − h 1 n 3 u 1 , w (1) i | {z } ℓ SGD 3 ( w, V ) : = , where the second term is denoted ℓ SGD 2 ( w, V ) and u 1 is an arbitrary v ector in U . In the follo w in g lemma, w e establish that the ab o ve loss fun ction in indeed conv ex and Lipsc hitz o ver R d . The pro of app ears in App endix C. Lemma 8. F or every V ∈ Z , the loss f unction f SGD ( w, V ) is c onvex and 4 -Li pschitz over R d with r esp e ct to its first ar gument. F or th is construction of distrib u tion and loss function, we sho w the follo w in g theorem, Theorem 4. A ssume that n > 2048 and η ≤ 1 √ T . Consider the distribution D SGD and the loss function f SGD with d = 712 n 2 log n + 2 n 2 , ε = 1 n 2 (1 − cos( 2 π | P ( U ) | )) and δ 1 = η 8 n 3 . Then, for Unpr oje cte d SGD (cf. Eq . (2) with W = R d ) with T = n iter ations, initialize d at w 1 = 0 with step size η , we have, with pr ob ability at le ast 1 2 over the choic e of the tr aining sampl e, (i) Th e iter ates of SGD r emain within the unit b al l, namely w t ∈ B d for al l t = 1 , . . . , T ; (ii) F or al l m = 1 , . . . , T , the m -suffix aver age d iter ate has: b F SGD ( w T ,m ) − b F SGD ( b w ∗ ) = Ω η √ T . Algorithm’s dynamic s. As in GD, we pro vide a k ey lemma that c haracterizes the traj ectory of SGD un der a certain ”goo d even t” . F or this goo d ev ent, giv en a random training s et sample S = { V i } n i =1 , w e d enote P t = T t − 1 i =1 V i and S t = T t = n i = t V i . Moreo ver, if P t 6 = ∅ , we denote r t = arg min { r : V t ∈ P t } and J t = v r t ∈ U . The go o d ev ent is giv en as follo ws, E ′ = {∀ t ≤ T P t 6 = ∅ and J t ∈ S t } (15) In the follo wing lemma we sho w that E ′ o ccurs w ith a constant probabilit y . Th e pro of app ears in App end ix C. Lemma 9. F or T = n and the event E ′ define d in E q. (15) , it ho lds that P r( E ′ ) ≥ 1 2 . Under this ev en t, the dynamics of S GD is charac terized as follo ws, 17 Lemma 10. A ssume the c onditions of The or em 4, and c onsider the i ter ates of u npro j ected S GD , with step size η ≤ 1 √ T initialize d at w 1 = 0 . Under the e vent E ′ , we have for t ≥ 4 and s 6 = 0 , w t ( k ) = − 3 8 η u 1 + ( t − 1) η n 3 u 1 k = 1 1 8 η u k 2 ≤ k ≤ t − 2 1 2 η u t − 1 k = t − 1 0 t ≤ k ≤ n, and for s = 0 , w t (0 ,k ) = η 4 n 2 P t − 1 i =2 φ ( V i , 1) k = 1 η 4 n 2 P t − 1 i =1 φ ( V i , i ) k = t − 1 0 k / ∈ { 1 , t − 1 } . wher e u 1 ∈ U and e very another ve ctor u k holds u k ∈ T k − 1 i =1 V i ∩ T n i = k V i . F or pro ving this k ey lemma, w e analyze how the terms ℓ SGD 1 , ℓ SGD 2 affects the dynamics of SGD under the ev ent E ′ . First, w e sho w that the gradie n ts of ℓ SGD 1 do es not affect the dynamics of S GD , as th e gradien t of this term in any iterate w t is ze ro. Th e idea is formaliz ed in the follo wing lemma. The pro of is d eferred to App endix C. Lemma 11. A ssume the c onditions of The or em 4 and the event E ′ . L et w ∈ R d and t b e such that for every 2 ≤ k ≤ t − 1 , w ( k ) = cη u k for c ≤ 1 2 and every such u k holds u k ∈ T k − 1 i =1 V i ∩ T n i = k V i , and for every t ≤ k ≤ T , w ( k ) = 0 . Then, for every t , it holds that, ℓ SGD 1 is differ entiable at ( w , V ) and for al l i ∈ [ n ] , we have ∇ ℓ SGD 1 ( w, V t ) = 0 . No w, w e analyze the gradient of ℓ SGD 2 . The role of this comp onent is to deco de the next ”b ad solution” α 1 n P t − 1 i =1 φ ( V i , i ) from the sets V 1 , . . . , V t − 1 , and mak e a progress in this d irection in some su bspace W ( t − 1) . In the follo w in g lemma, w e sho w that the gradien t of ℓ SGD 2 , serv es this goal. Lemma 12. A ssume the c onditions of The or em 4 and the event E ′ . F or every k , let ψ ∗ k = 1 n P k t =1 φ ( V t , t ) . Mor e over, let m ≥ 3 and w ∈ R d such that w (1) = cη u 1 for − 3 8 ≤ c ≤ 0 and u 1 ∈ U , for every 2 ≤ k ≤ m − 1 , w ( k ) = 1 8 η u k such that every u k holds u k ∈ T k − 1 t =1 V t ∩ T n t = k V t , w ( m ) = 1 2 η u m wher e u m holds u m ∈ T m − 1 t =1 V t ∩ T n i = m V t and for every m + 1 ≤ k ≤ T , w ( k ) = 0 . Mor e over, assume tha t w hol ds w (0 ,m ) = η 4 n ψ ∗ m , k w (0 , 1) k ≤ η and for every k / ∈ { m, 1 } , w (0 ,k ) = 0 . Then, for every V ⊆ U , ℓ SGD 2 is differ entiable at ( w, V ) and, we have for k 6 = 0 , ∇ ℓ SGD 2 ( w, V ) ( k ) = 3 8 u m k = m − 1 2 α ( ψ ∗ m ) k = m + 1 0 k / ∈ { m, m + 1 } and, ∇ ℓ SGD 2 ( w, V ) (0 ,k ) = 1 4 n 2 P m t =1 φ ( V t , i ) k = m − 1 4 n 2 P m t =1 φ ( V t , i ) − 1 4 n 2 φ ( V , i ) k = m + 1 0 k / ∈ { m, m + 1 } . 18 No w w e can p r o ve Lemma 10. Pr o of of L emma 10. W e assume that E ′ holds and pro ve the lemma by indu ction on t . W e b egin from the basis of the indu ction, t = 4, wh ic h is prov ed in Lemm a 26 in App endix C. No w, w e assume the h yp othesis of the induction, that the lemma h olds for iteration t and turn to sho w the required for iteration t + 1. First, w e notice that for ev ery 2 ≤ k ≤ t − 1, w t ( k ) = cη u k for c ≤ 1 2 and ev ery su c h u k holds u k ∈ T k − 1 i =1 V i ∩ T n i = k V i , and for every t ≤ k ≤ T , w t ( k ) = 0. Then, by Lemma 11, we ha ve that ∇ ℓ SGD 1 ( w t , V t ) = 0 . Second, ℓ SGD 3 is a linear fu nction, th us, ∇ ℓ SGD 3 ( w t , V t ) ( s ) = − 1 n 3 u 1 s = 1 − 1 4 n 2 φ ( V t , 1) s = 0 , 1 0 otherwise . Third, F or ℓ SGD 2 ( w t , V t ), we notice for m = t − 1 ≥ 3 it holds that w t (1) = cη u 1 for − 3 8 ≤ c ≤ 0 and u 1 ∈ U , for ev ery 2 ≤ k ≤ m − 1, w t ( k ) = 1 8 η u k suc h that ev ery u k holds u k ∈ T k − 1 t =1 V t ∩ T n t = k V t , w t ( m ) = 1 2 η u m where u m holds u m ∈ T m − 1 t =1 V t ∩ T n i = m V t , and for eve ry m + 1 ≤ k ≤ T , w ( k ) = 0. Moreo v er, w t holds w (0 ,m ) = η 4 n ψ ∗ m , k w t (0 , 1) k ≤ η and for ev ery k / ∈ { m, 1 } , w t (0 ,k ) = 0. Then, by Lemma 12, we get that, we ha ve for k 6 = 0, ∇ ℓ SGD 2 ( w t , V t ) ( k ) = 3 8 u t − 1 k = t − 1 − 1 2 α ( ψ ∗ t − 1 ) k = t 0 k / ∈ { t − 1 , t } and, ∇ ℓ SGD 2 ( w t , V t ) (0 ,k ) = 1 4 n 2 P t − 1 i =1 φ ( V i , i ) k = m − 1 4 n 2 P t i =1 φ ( V i , i ) k = m + 1 0 k / ∈ { m, m + 1 } . No w, b y Lemma 23, for j = arg min i { i : v i ∈ T t − 1 i =1 V i } , w e get that α ( ψ ∗ t − 1 ) = v j ∈ t − 1 \ i =1 V i . W e notice that T t − 1 i =1 V i = P t and thus α ( ψ ∗ t − 1 ) = J t . Then, by E ′ , α ( ψ ∗ t − 1 ) also holds α ( ψ ∗ t − 1 ) ∈ S t . Com b ining the ab o ve together, we get, for u t = α ( ψ ∗ t − 1 ) ∈ P t ∩ S t , ∇ f ( w t , V t ) ( k ) = − 1 n 3 u 1 k = 1 3 8 u t − 1 k = t − 1 − 1 2 u t k = t 0 k / ∈ { 1 , t − 1 , t } , and, ∇ f ( w t , V t ) (0 ,k ) = − 1 4 n 2 φ ( V 3 , 1) k = 1 1 4 n 2 P t − 1 i =1 φ ( V i , i ) k = t − 1 − 1 4 n 2 P t i =1 φ ( V i , i ) k = t 0 k / ∈ { 1 , t − 1 , t } , 19 and the lemma follo ws . The p r o of of Theorem 4 is similar to Theorem 3 , u sing Lemma 10 instead of Lemma 4, and is deferred to App endix C. A ckno wledgmen ts This pro ject h as receiv ed fund ing from the Europ ean Rese arc h C ouncil (ERC) under the Europ ean Union’s Horizon 2020 research and in n o v ation program (gran t agreemen ts No. 10 10780 75; 882396). Views an d opinions expressed are ho wev er those of the author(s) only and do not necessarily reflect those of the Europ ean Union or th e E urop ean Researc h Coun cil. Neither the Eu rop ean Union nor the gran ting authorit y can b e held resp onsible f or them. This work receiv ed additional supp ort from the Isr ael Science F oundation (IS F, gran t num b er 2549/19), from the Len Bla v atnik and the Bla v atnik F amily f oun dation, an d from the A delis F ound atio n. References N. Alon, S. Ben-Da vid, N. Cesa-B ianc hi, and D. Haussler. Scale-sensitiv e dimen s ions, uniform con verg ence, and learnabilit y . Journal of the A CM (JA CM) , 44(4):61 5–631 , 1997. I. Amir, Y. Carmon, T . Koren, and R. Livn i. Nev er go fu ll batc h (in sto c h astic con v ex optimizatio n). A dvanc es in Neur al Information Pr o c essing Systems , 34:25033 –2504 3, 2021a. I. Amir, T. K oren, and R. Livni. S GD general izes b etter than gd (and regularization do esn’t help). In Confer e nc e on L e arning The ory , pages 63–92. PMLR, 2021b. P . L. Bartlett and S . Mendelson. Rademacher and gaussian complexitie s: Risk b oun ds and stru c- tural results. J ournal of Machine L e arning R ese ar ch , 3(No v):463 –482, 2002 . P . L. Bartlett, P . M. Long, G. Lu gosi, and A. T sigler. Benign o v erfitting in linear regression. Pr o c e e dings of the National A c ademy of Scienc es , 117(48):3 0063– 30070, 2020. P . L . Bartlett, A. Mon tanari, and A. Rakhlin. Dee p learning: a statistical viewp oint. A cta numeric a , 30:87– 201, 2021. R. Bassily , V. F eldman, C. Guzm´ an, and K. T alw ar. S tabilit y of sto c hastic gradien t descen t on nonsmo oth con v ex losses. A dvanc es in Neur al Information Pr o c e ssing Systems , 33, 2020. M. Belkin. Fit without fear: remarkable m athematica l phenomena of deep learning through the prism of inte rp olation. A cta Numeric a , 30:203–248 , 2021. A. Blumer, A. Ehrenfeuch t, D. Haussler, and M. K. W arm u th. Learnabilit y and the v apnik- c hervonenkis dimension. Journal of the A CM (JA CM) , 36(4):929– 965, 198 9. O. Bousquet and A. E lisseeff. Stabilit y and generalization. The Journal of M achine L e arning R ese ar ch , 2:499–5 26, 2002. D. Carmon, R. Livni, and A. Y ehuda y off. The sample complexit y of ERMs in sto c hastic con v ex optimizatio n. arXiv pr eprint arXiv:2311.05 398 , 2023. V. F eldman. Generalization of ERM in sto chastic con vex optimization: The dimension strik es bac k. In A dvanc e s in N eur al Information Pr o c essing Systems , v olume 29, 2016. 20 A. D. Fla xman, A. T. Kala i, and H. B. Mc Mahan. Online con v ex opti mization in the bandit setting: gradien t d escen t without a gradien t. In Pr o c e e dings of the sixte enth annual A CM-SIAM symp osium on Discr ete algorithms , pages 385–394, 2005. M. Hard t, B. Rec h t, and Y. S inger. T rain faster, generalize b etter: Stabilit y of sto c hastic gradien t descen t. In International Confer e nc e on Machine L e arning , p ages 1225–1 234. PMLR, 2016. S. Kale, A. Sekhari, and K. Sr idharan. SGD: The role of implic it regulariz ation, batc h-size and m u ltiple-epo c hs. arXiv pr e print arXiv:2107.0 5074 , 2021 . T. K oren, R. Livni, Y. Mansour, and U. Sh erman. Benign underfitting of stoc hastic gradient descen t. In S. K o yejo , S. Mohamed, A. Agarw al, D. Belgra v e, K. C ho, and A. Oh, ed itors, A dvanc es in Neur al Informa tion Pr o c essing Systems , volume 35, pages 196 05–19 617. Curr an Asso ciates, Inc., 2022. R. Magen and O. Shamir. Initializat ion-dep endent sample complexit y of linear p redictors and neural net works. arXiv pr eprint arXiv:2305.1647 5 , 202 3. M. E. Muller. A n ote on a method for generating p oints u niformly o n n-dimensional spheres. Communic ations of the A CM , 2(4):19–2 0, 1959. B. Neyshabur, R. T omioka, and N. Srebro. In search of the real indu ctive bias: On the role of implicit regularizatio n in d eep learning. arXiv pr eprint arXiv:141 2.6614 , 2014. B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Sr eb r o. Exploring generalizat ion in deep learning. A dvanc es in neur al informa tion pr o c essing systems , 30, 2017. S. Shalev-Shw artz and S. Ben-Da vid . Understanding Machine L e arning: F r om The ory to Algo- rithms . Un derstanding Mac hine Learning: F rom T h eory to Algorithms. Cam b ridge Univ ersit y Press, 2014. IS BN 978110 70571 35. S. S halev-Sh w artz, O. Shamir, N. Srebro, and K. Sridharan. Learnabilit y , stabili t y and uniform con verg ence. The Journal of Machine L e arning R ese ar ch , 11:2635– 2670, 2010. V. V apnik. On the un iform conv ergence of relativ e fr equencies of ev en ts to their probabilities. The ory of Pr ob ability and its A pplic ations , 16(2): 264–2 81, 1971 . C. Zhang, S. Bengio, M. Hardt, B. Rec h t, and O. Viny als. Understa nding d eep learning requires rethinking general izatio n. In 5th Internatio nal Confer enc e on L e arning R epr esentations, ICLR 2017 , 2017. A Differen tiabilit y and Pro ofs of Theorems 1 and 2 In th is section, w e complet e the pro of of Theorems 1 and 2, b y sh o win g a constru ction of a d ifferen- tiable ob jectiv e that main tains the same low er b ounds giv en in Theorems 3 and 4 and Lemma 33. Our general app roac h is to use a rand omized smo othing of the original ob jectiv es. Th en, w e use the f act that the sub gradien ts are unique along any p ossible tra jectory of GD, to show that wh en smo othing is applied within a sufficientl y small n eigh b orho o d, gradien ts along an y suc h tra jectory are preserv ed. Consequen tly , this approac h does not impact the dynamics o f the optimizat ion algorithm, while simulta neously ensurin g the ob jectiv es b ecome differentia ble ev erywhere. 21 A.1 Pro of of Theorem 1 F ull construction. T he hard distribution D is defined to b e as in Section 4. Th e hard lo ss function is a smo othing of f (Eq. (11)), and is defin ed as ˜ f ( w, ( V , j )) : = E v ∈ B [ f ( w + δv , ( V , j ))] , (16) for a sufficien tly small δ > 0 and the d -dimensional unit ball B . Analogously , we d enote the empir- ical loss and the p opulation loss with resp ect to the loss fu nction ˜ f as b ˜ F ( w ) = 1 n P n i =1 ˜ f ( w, ( V i , j i )) and ˜ F ( w ) = E ( V ,j ) ∼D ˜ f ( w, ( V , j )), resp ectiv ely . Th e loss fu nction ˜ f is differentia ble, 5-Lipsc h itz with resp ect to its first argument and conv ex ov er R d , as stated in the follo wing lemma. Lemma 13. F or every ( V , j ) ∈ Z , the loss function ˜ f is differ entiable, c onvex and 5 -Lipschitz with r esp e ct to its first ar gument and over R d . W e first pro v e the follo win g theorem, Theorem 5. A ssume that n > 0 , T > 3200 2 and η ≤ 1 √ T . Consider the distribution D and the loss function ˜ f for d = 178 nT + 2 n 2 , ε = 1 n 2 (1 − cos( 2 π | P ( U ) | )) , β = ǫ 4 T 2 , δ = ηβ 32 , δ 1 = η 2 n and δ 2 = 3 ηβ 16 . Then, for Unpr oje cte d GD (cf. Eq . (1) with W = R d ) on b F , initialize d at w 1 = 0 with step size η , we have, with pr ob ability at le ast 1 6 over the choic e of the tr aining sample: (i) Th e iter ates of GD r emain within the unit b al l, namely w t ∈ B d for al l t = 1 , . . . , T ; (ii) F or al l m = 1 , . . . , T , the m -suffix aver age d iter ate has: ˜ F ( w T ,m ) − ˜ F ( w ∗ ) = Ω η √ T . Algorithm dynamics. No w we sho w that the dynamics of GD w hen is applied on b ˜ F is iden tical to dynamics of the algorithm on b F , as stated in the follo wing lemma. Lemma 14. Under the c onditions of The or ems 3 and 5, let w t , ˜ w t b e the iter ates of Unpr oje cte d GD with step size η ≤ 1 √ T and w 1 = 0 , on b F and b ˜ F r esp e ctively. Then, if E o c curs, then for every t ∈ [ T ] , it holds that w t = ˜ w t . Pro of of Theorem 5. Next, we set out to establish the p ro of for Theorem 5. Pr o of of The or em 5. Let w T ,m b e the m -suffix a v erage of GD w hen is app lied on b F . Let w ∗ = arg min w F ( w ). By Lemma 14, w e kno w that, w ith probabilit y of at least 1 6 , E o ccurs and w T ,m = w T ,m . T hen, by Theorem 3 and Lemma 32, η 3200 √ T ≤ F ( w T ,m ) − F ( w ∗ ) = F ( w T ,m ) − F ( w ∗ ) ≤ ˜ F ( w T ,m ) + 5 δ − ˜ F ( w ∗ ) + 5 δ ≤ ˜ F ( w T ,m ) + 5 δ − ˜ F ( w ∗ ) + 5 δ , and, ˜ F ( w T ,m ) − ˜ F ( w ∗ ) ≥ η 3200 √ T − 10 ηǫ 128 T 2 22 ≥ η 3200 √ T − η 10 T 2 ≥ η 6400 √ T . ( T ≥ 30) No w we can finally pro ve Theorem 1. The pro of is an immediate corollary from T h eorem 5 and the lo w er bou n d of Ω min 1 ηT , 1 giv en in Lemma 35 in App en d ix E. It’s imp ortant to highligh t that w e offe r a rigorous p ro of f or a mo dified v ersion of Theorem 1, where the loss function f p ossesses a Lipsc hitz constan t of only 5. By scaling d o wn this loss function by a factor of 1 5 and sim u ltaneously adjusting the step size η by a factor of 5, we can emplo y the same p ro of to establish the v alidit y of Th eorem 1. Pr o of of The or em 1. W e kno w that η ≤ 1 5 √ T . First, by Theorem 5 , we know that for Unpro jected GD and d 1 = 178 nT + 2 n 2 , there exist a distribution D o v er a probabilit y space Z , a constan t C 1 and a loss fun ction ˜ f : R d 1 × Z → R su c h that, with probabilit y of at least 1 6 , ˜ F ( w T ,m ) − ˜ F ( w ∗ ) ≥ C 1 η √ T . Second, b y Lemma 35, we know that for Unp ro jected GD and d 2 = max(25 η 2 T 2 , 1), there exist a constan t C 2 and a deterministic loss function ˜ f OPT : R d 2 → R s uc h that ˜ f OPT ( w T ,m ) − ˜ f OPT ( w ∗ ) ≥ C 2 min 1 , 1 η T No w, let C = 1 2 min ( C 1 , C 2 ). If η ≥ T − 3 4 , then, η √ T ≥ min(1 , 1 ηT ), and we get, ˜ F ( w T ,m ) − ˜ F ( w ∗ ) ≥ C η √ T + min 1 , 1 η T ≥ C min 1 , η √ T + 1 η T . Otherwise, w e get that, ˜ f OPT ( w T ,m ) − ˜ f OPT ( w ∗ ) ≥ C η √ T + min 1 , 1 η T ≥ C min 1 , η √ T + 1 η T . Since in b oth cases, b y Lemma 35 and Th eorem 5, w t ∈ B d for ev ery t ∈ [ T ], the theorem is applicable also for Pr o jected GD. A.2 Pro of of Theorem 2 F ull construction. The hard distribution D SGD is d efined to b e as in Sectio n 5. The h ard loss function is a smo othing of f SGD (Eq. (14)), and is defined as ˜ f SGD ( w, V ) : = E v ∈ B h f SGD ( w + δ v , V ) i , (17) for a sufficien tly small δ > 0 and the d -dimensional unit b all B . Analogously , we denote the empirical loss and the p opulation loss with r esp ect to the loss function ˜ f SGD as b ˜ F SGD ( w ) = 1 n P n i =1 ˜ f SGD ( w, V i ) and ˜ F SGD ( w ) = E V ∼D ˜ f SGD ( w, V ), resp ectiv ely . The loss function ˜ f SGD is differen tiable, 4-Lipsc h itz with resp ect to its fi r st argumen t and con v ex o ver R d , as stated in the follo wing lemma. 23 Lemma 15. F or every V ∈ Z , the loss function ˜ f SGD is differ entiable, c onvex and 4 -Lipschitz with r esp e ct to its first ar gument and over R d . W e first pro v e the follo win g theorem, Theorem 6. A ssume that n > 2048 and η ≤ 1 √ n . Consider the distribution D SGD and the loss function ˜ f SGD with d = 712 n 2 log n + 2 n 2 , ε = 1 n 2 (1 − cos( 2 π | P ( U ) | )) , δ = ηε 32 n 3 and δ 1 = η 8 n 3 . Then, for Unpr oje c te d SGD (cf. Eq. (2) with W = R d ) with T = n iter ations, initialize d at w 1 = 0 with step size η , we have, with pr ob ability at le ast 1 2 over the choic e of the tr aining sampl e, (i) Th e iter ates of SGD r emain within the unit b al l, namely w t ∈ B d for al l t = 1 , . . . , n ; (ii) F or al l m = 1 , . . . , n , the m -suffix aver age d iter ate has: b ˜ F SGD ( w n,m ) − b ˜ F SGD ( b w ∗ ) = Ω η √ n . Algorithm’s dynamics. No w , As in GD, the main step in pr o ving Theorem 6 is to show that taking exp ectation of f SGD for ev ery p oin t w in a ball with small enough radius do es n ot c hange the dynamics of S GD . Lemma 16. Under the c onditions of The or ems 4 and 6, let w t , ˜ w t b e the iter ates of Unpr oje cte d SGD with step size η ≤ 1 √ T and w 1 = 0 , on b F SGD and b ˜ F SGD r esp e ctively. Then, if E ′ o c curs, then for every t ∈ [ T ] , it holds that w t = ˜ w t . Pro of of Theorem 6. Next, we set out to establish the p ro of for Theorem 6. Pr o of of The or em 6. Let w n,m b e th e m - suffix a v erage of S GD when is app lied on f SGD and let b w ∗ = arg min w b F SGD ( w ). By Lemma 16, we kno w that, with a probabilit y 1 2 , w n,m = w SGD n,m . Then, b y Theorem 4 and Lemma 32, η 64000 √ n ≤ b F SGD ( w n,m ) − b F SGD ( b w ∗ ) = b F SGD ( w n,m ) − b F SGD ( b w ∗ ) ≤ b ˜ F SGD ( w n,m ) + 4 δ − b ˜ F SGD ( b w ∗ ) + 4 δ ≤ b ˜ F SGD ( w n,m ) + 4 δ − b ˜ F SGD ( b w ∗ ) + 4 δ , and, b ˜ F SGD ( w n,m ) − b ˜ F SGD ( b w ∗ ) ≥ η 64000 √ n − η ǫ 4 n 3 ≥ η 64000 √ n − η 4 n 3 ≥ η 12800 0 √ n. ( n ≥ 40) No w w e can fi n ally p ro ve T heorem 2. 24 Pr o of pr o of of The or em 2. W e kn o w th at T = n and η ≤ 1 5 √ T . First, b y Th eorem 6, we kno w that for Unpro jected S GD and d 1 = 712 n 2 log n + 2 n 2 , t here exist a distribu tion D SGD o ve r a p robabilit y space Z , a constant C 1 and a loss f u nction ˜ f SGD : R d 1 × Z → R su c h that, with probabilit y of at least 1 2 , b ˜ F SGD ( w T ,m ) − b ˜ F SGD ( b w ∗ ) ≥ C 1 η √ T . Second, by Lemma 35, w e kn o w that for Unp ro jected S GD and d 2 = max (25 η 2 T 2 , 1), there exist a constan t C 2 and a deterministic loss function ˜ f OPT : R d 2 → R s uc h that ˜ f OPT ( w T ,m ) − ˜ f OPT ( b w ∗ ) ≥ C 2 min 1 , 1 η T No w, let C = 1 2 min ( C 1 , C 2 ). If η ≥ T − 3 4 , then, η √ T ≥ min(1 , 1 ηT ), and we get, b ˜ F SGD ( w T ,m ) − b ˜ F SGD ( b w ∗ ) ≥ C η √ T + min 1 , 1 η T ≥ C min 1 , η √ T + 1 η T . Otherwise, w e get that, ˜ f OPT ( w T ,m ) − ˜ f OPT ( b w ∗ ) ≥ C η √ T + min 1 , 1 η T ≥ C min 1 , η √ T + 1 η T . Since in b oth cases, b y Lemma 35 and Th eorem 6, w t ∈ B d for ev ery t ∈ [ T ], the theorem is applicable also for Pr o jected S GD. B Pro ofs of Section 4 B.1 Pro ofs for the full construction Pr o of of L emma 1. Let r = 2 − d ′ 178 . F or ev ery 1 ≤ i ≤ r and 1 ≤ j ≤ d ′ w e define the random v ariable u j i b e a rand om v ariable to be 1 √ d ′ with probabilit y 1 2 and − 1 √ d ′ with probabilit y 1 2 . Then, for every 1 ≤ i ≤ r , we define the vec tor u i whic h its j th ent ry is u j i and lo ok at the set U = { u 1 , u 2 , ...u r } . This set will h old the required p rop erty w ith p ositiv e probabilit y . First, f or ev ery i 6 = k , h u i , u k i are sums of d random v ariables that taking v alues in [ − 1 d ′ , 1 d ′ ] with E h u i , u k i = 0. Then by Ho effdin g’s inequalit y , P r ( |h u i , u k i| ≥ 1 8 ) ≤ 2 e − 2 ( 1 8 ) 2 d ′ · 4 d ′ 2 = 2 e − d ′ 128 Then, by union b ound on the r 2 pairs of vec tors in U , P r ( ∃ i, k |h u i , u k i| ≥ 1 8 ) ≤ 2 e − d ′ 128 · r 2 ! < 2 e − d ′ 128 · 1 2 r 2 ≤ 1 . Lemma 17. L et n, d ≥ 1 and a set U ⊆ B d . L et P ( U ) b e the p ower set of U . Then, ther e exist a set Ψ ⊆ R 2 n 2 , a numb er 0 < ǫ < 1 n and two mappings φ : P ( U ) × [ n 2 ] → R 2 n 2 , α : R 2 n 2 → U such that, (i) F or every j ∈ [ n 2 ] and V ⊆ U , it hol ds k φ ( V , j ) k ≤ 1 ; 25 (ii) F or eve ry ψ ∈ Ψ , it holds k ψ k ≤ 1 ; (iii) L et V 1 , . . . , V n b e arbitr ary subsets of U . If j 1 , . . . , j n hold that j i 6 = j k for i 6 = k , ψ ∗ = 1 n P n i =1 φ ( V i , j i ) is that, • D ψ ∗ , 1 n n X i =1 φ ( V i , j i ) E > 7 8 n ; • F or e v ery ψ ∈ Ψ , ψ 6 = ψ ∗ : D ψ ∗ , 1 n n X i =1 φ ( V i , j i ) E ≥ D ψ , 1 n n X i =1 φ ( V i , j i ) E + ǫ ; • If S n i =1 V i 6 = U , then it h olds that α ( ψ ∗ ) = v i ∗ ∈ U \ S n i =1 V i for i ∗ = min { i : v i ∈ U \ S n i =1 V i } . Pr o of. First, w e consider an arbitrary enumeratio n of P ( U ) = { V 1 , ...V | P ( U ) | } and defin e g : P ( U ) → R 2 , g ( V i ) = sin 2 π i | P ( U ) | , cos 2 π i | P ( U ) | . Now, w e refer to a ve ctor a ∈ R 2 n 2 as a concatenat ion of n 2 v ectors in R 2 , a (1) , ..., a ( n 2 ) . T hen, w e define δ = 1 − cos 2 π | P ( U ) | , ǫ = δ n 2 and φ ( V , j ) ( i ) = ( g ( V ) i = j 0 otherwise As a result, for ev ery V i , j it h olds that k φ ( V i , j ) k = k g ( V i ) k = s sin 2 π i | P ( U ) | 2 + cos 2 π i | P ( U ) | 2 = 1 Moreo v er, if j 1 6 = j 2 , h φ ( V i , j 1 ) , φ ( V i , j 2 ) i = 0 , and if i > k , h φ ( V i , j ) , φ ( V k , j ) i = h g ( V i ) , g ( V k ) i = sin 2 π i | P ( U ) | sin 2 π k | P ( U ) | + cos 2 π i | P ( U ) | cos 2 π k | P ( U ) | = cos 2 π ( i − k ) | P ( U ) | ≤ cos 2 π | P ( U ) | (cos is monotonic decreasing in [0 , π / 2]) = 1 − δ W e notice that 0 < δ < 1. No w, we consider an arbitrary en u meration of U = { v 1 , ...v | U | } , and define the follo w ing set Ψ ⊆ R 2 n 2 and the follo wing tw o mappings σ : R 2 n 2 → P ( U ) , α : R 2 n 2 → U , Ψ = { 1 n n X i =1 φ ( V i , j i ) : ∀ i V i ⊆ U, j i ∈ [ n 2 ] and i 6 = l = ⇒ j i 6 = j ℓ } 26 Note that, for eve ry ψ ∈ Ψ , k ψ k = k 1 n n X i =1 φ ( V i , j i ) k ≤ 1 n n X i =1 k φ ( V i , j i ) k = 1 . Then, for ev ery a ∈ R 2 n 2 and j ∈ [ n 2 ], we denote the ind ex q ( a, j ) ∈ [ | P ( U ) | ] as q ( a, j ) = arg max r h g ( V r ) , a ( j ) i , and define the follo w ing mapping σ : R 2 n 2 → P ( U ), σ ( a ) = n 2 [ j =1 ,a ( j ) 6 =0 V q ( a,j ) . Moreo v er, for ev ery a ∈ R 2 n 2 , w e denote the index p ( a ) ∈ [ | U | ] as p ( a ) = arg min i { i : v i ∈ U \ σ ( a ) } , and define the follo w ing mapping α : R 2 n 2 → U , α ( a ) = ( v | U | σ ( a ) = U v p ( a ) σ ( a ) 6 = U . No w, Let V 1 , . . . , V n ⊆ U and j 1 , ...j n that are sampled uniformly from [ n 2 ], W e p ro ve the last part of the lemma u nder the condition that j i 6 = j k for i 6 = k . ψ ∗ = 1 n P n i =1 φ ( V i , j i ) holds h ψ ∗ , 1 n n X i =1 φ ( V i , j i ) i = h 1 n n X i =1 φ ( V i , j i ) , 1 n n X i =1 φ ( V i , j i ) i = 1 n 2 n X i =1 h φ ( V i , j i ) , φ ( V i , j i ) i = 1 n > 7 8 n F or ψ = 1 n P n l =1 φ ( V ′ l , j ′ l ) suc h that ψ 6 = ψ ∗ , there are at most n pairs i, l such that h φ ( V ′ i , j ′ i ) , φ ( V ′ l , j ′ l ) i 6 = 0. thus, there exists a p air ( V ′ r , j ′ r ) that ( V ′ r , j ′ r ) / ∈ { ( V i , j i ) : i ∈ [ n ] } . and for ev ery i , h φ ( V i , j i ) , φ ( V ′ l , j ′ l ) i ≤ 1 − δ . As a result, h ψ , 1 n n X l =1 φ ( V i , j i ) i = h 1 n n X i =1 φ ( V ′ l , j ′ l ) , 1 n n X i =1 φ ( V i , j i ) i = 1 n 2 n X i =1 n X l =1 h φ ( V i , j i ) , φ ( V ′ l , j ′ l ) i ≤ 1 n 2 1 − δ + n X i =1 ,i 6 = r 1 ≤ 1 n 2 (1 − δ + n − 1) 27 = 1 n − δ n 2 = h ψ ∗ , 1 n n X i =1 φ ( V i , j i ) i − ǫ F urtherm ore, since if all j i are distinct, for ev ery i it holds that, 1 n P n i =1 φ = ( V i , i ) ( j i ) = 1 n g ( V i ), th u s, q 1 n n X i =1 φ ( V i , j i ) , j i ! = arg max r h g ( V r ) , 1 n n X i =1 φ ( V i , j i ) ( j i ) i = arg max r h g ( V r ) , 1 n g ( V i ) i = i, and we get, σ ( ψ ∗ ) = σ 1 n n X i =1 φ ( V i , j i ) ! = n 2 [ j =1 , 1 n P n i =1 φ ( V l ,j i ) ( j ) 6 =0 V q ( 1 n P n i =1 φ ( V i ,j i ) ,j ) = n [ i =1 V q ( 1 n P n i =1 φ ( V i ,j i ) ,j i ) (The indices that are non-zero are { j i } n i =1 } ) = n [ i =1 V i Finally , assuming that S n i =1 V i 6 = U , α ( ψ ∗ ) = v p ( a ) ∈ U \ n [ i =1 V i . Pr o of of L emma 2. W e pro v e that ℓ 1 , ℓ 2 and ℓ 4 are conv ex and 1-Lipschitz and ℓ 3 is con vex and 12-Lipsc hitz. First, b y Lemmas 1 and 17 for ev ery u ∈ U and V ∈ P ( U ), j ∈ [ n 2 ], it holds that k u k = 1, k φ ( V , j ) k = 1. Then, ℓ 2 is a 1-Lipsc hitz linea r fun ction, and ℓ 4 is a maximum o ver 1-Lipsc hitz linear functions, th us, b oth fu n ctions are con vex and 1-Lipsc hitz. Moreo v er, for ev ery p ossible ψ ∈ Ψ k ψ k = k 1 n n X l =1 φ ( V l ) k ≤ 1 n n X l =1 k φ ( V l ) k = 1 . th u s, ℓ 3 is a maxim u m o v er 2-Lipschitz linear functions, th u s, it is con ve x and 2-Lipsc hitz. No w, for ℓ 1 , for ev ery set V ⊆ U , let α V ( w ) ∈ R T − 1 to b e the ve ctor wh ich it s k ’th co ord inate is α V ( w ) ( k ) = max 3 η 32 , max u ∈ V h uw ( k +1) i and prov e con v exit y and 1-Lipsh itzness. F or establishing con vexi t y , observe v u u t T X k =2 max 3 η 32 , max u ∈ V h u, ( λx + (1 − λ ) y ) ( k ) i 2 28 = v u u t T X k =2 max 3 η 32 , max u ∈ V λ h u, x ( k ) i + (1 − λ ) h u, y ( k ) i 2 ≤ v u u t T X k =2 max 3 η 32 , max u ∈ V λ h u, x ( k ) i + max u ∈ V (1 − λ ) h u, y ( k ) i 2 (con ve xit y of m ax & monotonicit y of square ro ot) ≤ v u u t T X k =2 λ max 3 η 32 , max u ∈ V h u, x ( k ) i + (1 − λ ) max 3 η 32 , max u ∈ V h u, y ( k ) i 2 = k λα V ( x ) + (1 − λ ) α V ( y ) k 2 ≤ λ k α V ( x ) k 2 + (1 − λ ) α V ( y ) k 2 (con ve xit y of ℓ 2 norm) = λ v u u t T X k =2 max 3 η 32 , max u ∈ V h ux ( k ) i 2 + (1 − λ ) v u u t T X k =2 max 3 η 32 , max u ∈ V h uy ( k ) i 2 . F or 1-Lipsc hitzness, for eve ry w ∈ R d and sub-gradien t g ( w , V ) ∈ ∂ ℓ 1 ( w, V ), there exists a sub gradien t h ( w, V ) ∈ ∂ P T k =2 max 3 η 32 , max u ∈ V h uw ( k ) i 2 suc h that k g ( w, V ) k = k h ( w, V ) k 2 r P T k =2 max 3 η 32 , max u ∈ V h uw ( k ) i 2 = k h ( w, V ) k 2 q P T k =2 α V ( w ) ( k ) 2 . Moreo v er, for ev ery k and sub g radien t b k ,V ( w ) ∈ ∂ α V ( w ) ( k ) w e denote r k ,V ( w ) ∈ R d the v ector with r k ,V ( w ) ( k ) = b k ,V ( w ) and for j 6 = k , r k ,V ( w ) ( j ) = 0. Then, for ev ery sub gradien t h ( w, V ) ∈ ∂ P T k =2 max 3 η 32 , max u ∈ V h uw ( k ) i 2 , there exi sts T − 1 suc h v ectors r k ,V ( w ) ∈ R d (2 ≤ k ≤ T ) suc h that, h ( w, V ) = 2 T X k =2 r k ,V ( w ) α V ( w ) ( k ) As a result, b y the fact that every sub gradient of b k ,V ( w ) ∈ ∂ α V ( w ) ( k ) is either 0 or λ 1 u 1 + λ 2 u 2 + . . . + λ p u p for P i λ i ≤ 1, suc h that for all ev ery j, k , u j ∈ U and α V ( w ) ( k ) = h u j , w ( ) k i , com bining by the facts that for distinct k , k ′ , r k ,V , r k ′ ,V are orthogonal, it h olds, for u 2 j , . . . u T j ∈ U suc h that for eve ry k , α V ( w ) ( k ) = h u k j , w ( k ) i , k h ( w, V ) k = k 2 T X k =2 r k ,V ( w ) α V ( w ) ( k ) ik = k 2 T X k =2 ,b k,V r k ,V ( w ) α V ( w ) ( k ) ik = 2 k T X k =2 r k ,V h u k j , w ( k ) ik . 29 No w, w e denote c k j ( w ) ∈ R d the ve ctor with c k j ( w ) ( k ) = u k j and f or j 6 = k , c k j ( w ) ( j ) = 0, and, k h ( w, V ) k = 2 k T X k =2 r k ,V h c k j , w ik ≤ 2 v u u t h T X k =2 r k ,V h c k j , w i , T X l =2 r l,V h c l j , w ii = 2 v u u t T X k =2 k r k ,V k 2 h c k j , w i 2 ≤ 2 v u u t T X k =2 h u k j , w ( k ) i 2 = 2 v u u t T X k =2 α V ( w ) ( k ) 2 . The lemma follo ws . B.2 Pro of of algorithm ’s dyna mics In this section we describ e the dynamics of GD wh en applied on b F fo r training set S that is sampled from a distribution D . W e b egin with showing that the go o d eve n t E (Eq. (12)) occurs with a constan t probabilit y . Pr o of of L e mma 3. By the fact that ev ery V i and j i are indep enden t, it is enough to show that Pr n [ i =1 V i 6 = U d ′ ! ≥ 1 2 , and, Pr (for every k 6 = l , j k 6 = j l ) ≥ 1 3 . F or the former, for every u ∈ U d ′ , since V i are sampled indep enden tly and ev ery ve ctor u ∈ U d ′ is in ev ery V i with probabilit y 1 2 , Pr u ∈ n [ i =1 V i ! = 1 − Pr u / ∈ n [ i =1 V i ! = 1 − 2 − n , th u s, since by Lemma 1, | U d ′ | ≥ d ′ 178 = n , it h olds that, Pr n [ i =1 V i = U d ′ ! = Pr ∀ u ∈ U d ′ u ∈ n [ i =1 V i ! = 1 − 2 − n | U d ′ | ≤ 1 − 2 − n 2 d ′ 178 = 1 − 2 − n 2 n ≤ 1 e 30 < 1 2 . W e conclude, Pr n [ i =1 V i 6 = U d ′ ! ≥ 1 2 . F or th e latter, since all j i s are sampled ind ep endently , for a single pair k 6 = l , it holds that Pr( j k 6 = j l ) = 1 − 1 n 2 As a result, Pr (for ev ery k 6 = l , j k 6 = j l ) = 1 − 1 n 2 n ( n − 1) 2 ≥ 1 − 1 n 2 n 2 2 ≥ 1 √ 2 e ≥ 1 e . F rom now on, w e analyze the dyn amics of the GD conditioned on E (Eq. (12)). W e b egin with sev eral lemmas. Pr o of of L emma 5. F or the first part, we kno w that, for every 2 ≤ k ≤ T , w ( k ) = cη u 0 for c ≤ 1 2 . In addition, by the facts that u 0 ∈ U \ S n i =1 V i and that for eve ry u 6 = v ∈ U , it holds that h u, v i ≤ 1 8 , w e get for ev ery i , max u ∈ V i h u 0 , u i ≤ 1 8 , th u s, for eve ry i and k ≥ 2, max u ∈ V i uw ( k ) = max u ∈ V i h u, cη u 0 i ≤ 1 8 · cη ≤ η 16 . F or the second part, for ev ery sub -gradien t g ( w , V i ) ∈ ∂ ℓ 1 ( w, V i ), there exists a sub gradien t h ( w, V i ) ∈ ∂ P T k =2 max 3 η 32 , max u ∈ V h uw ( k ) i 2 suc h that g ( w, V i ) = h ( w, V i ) 2 r P T k =2 max 3 η 32 , max u ∈ V i h uw ( k ) i 2 . Then, since for ev ery k , it holds that max u ∈ U h w ( k ) , u 0 i ≤ η 16 , ev ery suc h sub -gradien t h ( w, V i ) is zero, ∇ ℓ 1 ( w, V i ) = 0. Pr o of of L emma 6. First, for the first part, by Lemma 17, the fact that for ev ery ψ , k α ( ψ ) k ≤ 1, and by k w (1) k ≤ η , for eve ry ψ ∈ Ψ, ψ ∗ = 1 n P n i =1 φ ( V i , j i ) holds, h w (0) , ψ ∗ i − 1 4 ǫ T 2 h α ( ψ ∗ ) , w (1) i ≥ h η n n X i =1 φ ( V i , j i ) , ψ ∗ i − η ǫ 4 ≥ η h 1 n n X i =1 φ ( V i , j i ) , ψ ∗ i − η ǫ 4 ≥ η h 1 n n X i =1 φ ( V i , j i ) , ψ i + η ǫ − η ǫ 4 (Lemma 17) = η h 1 n n X i =1 φ ( V i , j i ) , ψ i + η ǫ 2 + η ǫ 4 > h w (0) , ψ i − 1 4 ǫ T 2 h α ( ψ ) , w (1) i + η ǫ 4 , 31 th u s, arg max ψ ∈ Ψ h w (0) , ψ i − 1 4 ǫ T 2 h α ( ψ ) , w (1) i = ψ ∗ = 1 n n X i =1 φ ( V i , j i ) . F or th e second part, by the fact that ǫ < 1 n and Lemma 17, h w (0) , ψ ∗ i − 1 4 ǫ T 2 h α ( ψ ∗ ) , w (1) i ≥ 7 η 8 n − η 4 n > η 2 n + η 16 n = δ 1 + η 16 n . No w, b y E , f or u 0 = α ( ψ ∗ ), whic h is the u w ith th e minimal index in U \ S n i =1 V i , α ( ψ ∗ ) = u 0 ∈ U \ n [ i =1 V i . As a result, by the fact that the maximum is attained uniqu ely at ψ ∗ , we derive that, ∇ ℓ 3 ( w ) ( k ) = 1 n P n i =1 φ ( V i , j i ) k = 0 − 1 4 ǫ T 2 u 0 k = 1 0 otherwise . Pr o of of L emma 7. W e sho w that the maximum is attained uniqu ely at k = m and u = u 0 . F or k = 1 and ev ery u ∈ U , 3 8 h u, w t (1) i − 1 2 h u, w t (2) i = 3 8 h u, cη u 0 i − 1 2 h u, η 8 u 0 i ≤ 9 η 512 + η 128 = 13 η 512 . Moreo v er, for ev ery 2 ≤ k ≤ m − 2 and every u ∈ U , 3 8 h u, w ( k ) i − 1 2 h u, w ( k +1) i = 3 8 h u, η 8 u 0 i − 1 2 h u, η 8 u 0 i ≤ 3 η 64 + η 128 = 7 η 128 . F or k = m − 1 and ev ery u ∈ U , 3 8 h u, w ( m − 1) i − 1 2 h u, w ( m ) i = 3 8 h u, η 8 u 0 i − 1 2 h u, η 2 u 0 i ≤ 3 η 64 + η 32 = 5 η 64 . F or k = m and u = u 0 , 3 8 h u, w ( m ) i − 1 2 h u, w ( m +1) i = 3 8 h u 0 , η 2 u 0 i − 1 2 h u 0 , 0 i = 3 η 16 . F or k = m and u 6 = u 0 , 3 8 h u, w ( m ) i − 1 2 h u ′ , w ( m +1) i = 3 8 h u, η 2 u 0 i − 1 2 h u ′ , 0 i ≤ 3 η 128 . F or every m + 1 ≤ k < T − 1 and ev ery u ∈ U , 3 8 h u, w ( k ) i − 1 2 h u ′ , w ( k +1) i = 0 . Moreo v er, since T ≥ 4 , η < 1 , ǫ < 1, δ 1 ≤ 3 η 1024 , and 3 8 h u, w ( m ) i − 1 2 h u, w ( m +1) i = 3 η 16 > δ 2 + η 64 . 32 W e deriv e that, ∇ ℓ 4 ( w ) ( k ) = 3 8 u 0 k = m − 1 2 u 0 k = m + 1 0 otherwise . Lemma 18. Under the c onditions of The or e m 3, if E o c curs and w t is the iter ate of Unpr oje cte d GD on b F , with step size η ≤ 1 √ T and w 1 = 0 , then, for t = 2 it holds tha t, w 2 ( k ) = ( η n P n i =1 φ ( V i , j i ) k = 0 0 otherwise . Pr o of. F or t = 1, w 1 = 0. By Lemma 5 w e kn ow that for ev ery i , ∇ ℓ 1 ( w 1 , V i ) = 0. Moreo ver, b y the fact that δ 1 , δ 2 > 0 the maxim um in ℓ 3 and ℓ 4 is attai ned in δ 1 and δ 2 , resp ectiv ely , th us w e get that ∇ ℓ 3 ( w 1 ) = ∇ ℓ 4 ( w ) = 0 As a result, ∇ b F ( w 1 ) ( k ) = 1 n n X i =1 ∇ ℓ 2 ( w 1 , ( V i , j i )) ( k ) = ( − 1 n P n i =1 ( V i , j i ) k = 0 0 otherwise , and, w 2 ( k ) = ( η n P n i =1 φ ( V i , j i ) k = 0 0 otherwise . Lemma 19. Under the c onditions of The or em 3, if E o c c u rs and w t is the iter ate U npr oje cte d GD on b F , with step size η ≤ 1 √ T and w 1 = 0 , then, for t = 3 it holds tha t, w 3 ( k ) = η 4 ǫ T 2 u 0 k = 1 0 2 ≤ k ≤ T η n P n i =1 φ ( V i , j i ) k = 0 , wher e u 0 ∈ U \ S n i =1 V i . Pr o of. By L emm a 18, w 2 (1) , ...w 2 ( T ) = 0, th us, by Lemma 5, w e kno w th at for ev ery i , ∇ ℓ 1 ( w 1 , V i ) = 0. Moreo ver, b y the fact that δ 2 > 0, w e get th at ∇ ℓ 4 ( w 2 ) = 0. F or ℓ 3 ( w 2 ), by Lemma 6, u sing the fact that w 2 (1) = 0 and w 2 (0) = η n P n i =1 φ ( V i , j i ), we get that ∇ ℓ 3 ( w 2 ) ( k ) = 1 n P n i =1 φ ( V i , j i ) k = 0 − 1 4 ǫ T 2 u 0 k = 1 0 otherwise . 33 F or ℓ 2 ( w 2 ), for every i , the gradien t is ∇ ℓ 2 ( w 2 , ( V i , j i )) ( k ) = ( − φ ( V i , j i ) k = 0 0 otherwise . Com b ining all together, we conclude that, for u 0 ∈ U \ S n i =1 V i , it holds that, ∇ b F ( w 2 ) ( k ) = − 1 4 ǫ T 2 u 0 k = 1 0 2 ≤ k ≤ T 0 k = 0 , and w 3 ( k ) = η 4 ǫ T 2 u 0 k = 1 0 2 ≤ k ≤ T η n P n i =1 φ ( V i , j i ) k = 0 . Lemma 20. Under the c onditions of The or em 3, if E o c c u rs and w t is the iter ate U npr oje cte d GD on b F , with step size η ≤ 1 √ T and w 1 = 0 , then, for t = 4 it holds tha t, w 4 ( k ) = − 3 η 8 u 0 + η 2 ǫ T 2 u 0 k = 1 η 2 u 0 k = 2 0 3 ≤ k ≤ T η n P n i =1 φ ( V i , j i ) k = 0 , wher e u 0 ∈ U \ S n i =1 V i . Pr o of. W e start with ℓ 1 , ℓ 2 , ℓ 3 . F or ℓ 1 , by Lemma 19, for eve ry 2 ≤ k ≤ T , w 3 ( k ) = 0, th us, by Lemma 5, w e kno w that f or ev ery i , ∇ ℓ 1 ( w 1 , V i ) = 0. F or ℓ 2 , w e kno w that, for every i , ∇ ℓ 2 ( w 3 , ( V i , j i )) ( k ) = ( − φ ( V i , j i ) k = 0 0 otherwise . F or ℓ 3 , by Lemma 6, using the fact that w 3 (1) = cη u 0 for | c | ≤ 1 and w 3 (0) = η n P n i =1 φ ( V i , j i ), w e get that, ∇ ℓ 3 ( w 3 ) ( k ) = 1 n P n i =1 φ ( V i , j i ) k = 0 − 1 4 ǫ T 2 u 0 k = 1 0 otherwise . No w, for ℓ 4 , we show th at the maximum is attained u niquely in k = 1 and u = u 0 : F or k 6 = 1, for ev ery u ∈ U 3 8 h u, w 3 ( k ) i − 1 2 h u, w 3 ( k +1) i = 0 . F or k = 1 and u 6 = u 0 , 3 8 h u, w 3 ( k ) i − 1 2 h u, w 3 ( k +1) i = 3 8 h u, w 3 (1) i − 1 2 h u, w 3 (2) i 34 = 3 8 h u, η 4 ǫ T 2 u 0 i ≤ 3 η 256 ǫ T 2 F or k = 1 and u = u 0 , 3 8 h u, w 3 ( k ) i − 1 2 h u, w 3 ( k +1) i = 3 8 h u 0 , w 3 (1) i − 1 2 h u 0 , w 3 (2) i = 3 8 h u 0 , η 4 ǫ T 2 u 0 i = 3 η 32 ǫ T 2 > δ 2 W e deriv e that, ∇ ℓ 4 ( w 3 ) ( k ) = 3 8 u 0 k = 1 − 1 2 u 0 k = 2 0 3 ≤ k ≤ T 0 k = 0 . Com b ining all together, we get that, ∇ b F ( w 3 ) ( k ) = 3 8 u 0 − 1 4 ǫ T 2 u 0 k = 1 − 1 2 u 0 k = 2 0 3 ≤ k ≤ T 0 k = 0 , and w 4 ( k ) = − 3 η 8 u 0 + η 2 ǫ T 2 u 0 k = 1 η 2 u 0 k = 2 0 3 ≤ s ≤ T η n P n i =1 φ ( V i , j i ) k = 0 , where u 0 ∈ U \ S n i =1 V i . Lemma 21. Under the c onditions of The or em 3, if E o c c u rs and w t is the iter ate U npr oje cte d GD on b F , with step size η ≤ 1 √ T and w 1 = 0 , then, for t = 5 it holds tha t, w 5 ( k ) = − 3 8 η u 0 + 3 η 4 ǫ T 2 u 0 k = 1 1 8 η u 0 k = 2 1 2 η u 0 k = 3 0 4 ≤ s ≤ T 1 n P n i =1 φ ( V i , j i ) k = 0 , wher e u 0 ∈ U \ S n i =1 V i . 35 Pr o of. W e b egin with ℓ 1 , ℓ 2 , ℓ 3 . Note that, b y Lemma 20, for ev ery 2 ≤ k ≤ T , w 4 ( k ) = cη u 0 for c ≤ 1 2 , th u s, by Lemma 5, for every i , ∇ ℓ 1 ( w 4 , V i ) = 0. F or ℓ 2 , we know that, for every i , ∇ ℓ 2 ( w 4 , ( V i , j i )) ( k ) = ( − φ ( V i , j i ) k = 0 0 otherwise . F or ℓ 3 , b y Lemma 6, using Lemma 20, wh ere we sho w ed that w 4 (1) = cη u 0 for | c | ≤ 1 and w 4 (0) = η n P n i =1 φ ( V i , j i ), we get that, ∇ ℓ 3 ( w 4 ) ( k ) = 1 n P n i =1 φ ( V i , j i ) k = 0 − 1 4 ǫ T 2 u 0 k = 1 0 otherwise . It is left to calculate ∇ ℓ 4 ( w 4 ). W e sho w that the maxim um is attained uniquely at k = 2 and u = u 0 . First, 3 8 h u, η 2 ǫ T 2 u 0 i = 3 η 16 ǫ T 2 h u, u 0 i ≤ 3 η 16 T 2 , th u s, since T ≥ 4, 3 8 h u, w 4 (1) i − 1 2 h u, w 4 (2) i = 3 8 h u, − 3 η 8 u 0 + η 2 ǫ T 2 u 0 i − 1 2 h u, η 2 u 0 i ≤ 9 η 512 + η 32 + 9 η 256 = 43 η 512 < 3 η 16 . F or k = 2 and u = u 0 , 3 8 h u, w 4 (2) i − 1 2 h u, w 4 (3) i = 3 8 h u 0 , η 2 u 0 i − 1 2 h u 0 , 0 i = 3 η 16 ( > δ 2 ) . F or k = 2 and u 6 = u t − 2 , 3 8 h u, w 4 (2) i − 1 2 h u ′ , w 3 (3) i = 3 8 h u, η 2 u 0 i − 1 2 h u, 0 i ≤ 3 η 128 . F or every 3 ≤ k ≤ T − 1 , 3 8 h u, w 4 ( k ) i − 1 2 h u ′ , w 4 ( k +1) i = 0 . W e deriv e that, ∇ ℓ 4 ( w 4 ) ( k ) = 3 8 u 0 k = 2 − 1 2 u 0 k = 3 0 3 ≤ k ≤ T 0 k = 0 . Com b ining all together, we get that, ∇ b F ( w 4 ) ( k ) = − 1 4 ǫ T 2 u 0 k = 1 3 8 u 0 k = 2 − 1 2 u 0 k = 3 0 4 ≤ k ≤ T 0 k = 0 36 and w 5 ( k ) = − 3 8 η u 0 + 3 η 4 ǫ T 2 u 0 k = 1 1 8 η u 0 k = 2 1 2 η u 0 k = 3 0 4 ≤ s ≤ T 1 n P n i =1 φ ( V i , j i ) k = 0 , where u 0 ∈ U \ S n i =1 V i . Lemma 22. A ssume the c onditions of The or em 3, and c onsider the iter ates of unpr oje cte d GD on b F , with step size η ≤ 1 / √ T initial ize d at w 1 = 0 . Under the e vent E , we have for al l t ∈ [ T ] that k w t k ≤ 1 . Pr o of. I f E holds, b y Lemmas 4 and 18 to 20, w e kn o w that for ev ery t ≥ 2, k w t (1) k ≤ η 2 , k w t ( t − 1) k ≤ η 2 and for ev ery k ∈ { 2 , . . . , t − 2 } , k w t ( t − 1) k ≤ η 8 . As a result, k w t k 2 ≤ d X i =1 w t [ i ] 2 ≤ T X k =0 k w t ( k ) k 2 < 2 · η 2 2 + ( T − 3) η 8 2 + η n n X i =1 φ ( V i , j i ) 2 ≤ η 2 2 + η 2 ( T − 3) 64 + η 2 ≤ 1 64 + 3 2 T ( η ≤ 1 √ T ) ≤ 1 ( T ≥ 2) B.3 Pro of of Theorem 3 Pr o of of The or em 3. By Lemma 3, with probabilit y of at least 1 6 , E o ccurs and b y Lemma 4, it holds for ev ery 2 ≤ k ≤ T − 3 that, w T ,m ( k ) = 1 m m X i =1 w T − i +1 ( k ) = ( η 8 u 0 k ≤ T − m − 2 1 m η 2 + η 8 ( T − k − 2) u 0 k ≥ T − m − 1 (18) = ( η 8 u 0 k ≤ T − m − 2 η ( T − k +2) 8 m u 0 k ≥ T − m − 1 Then, w e denote α V ∈ R T − 4 the v ector which its k th entry is max 3 η 32 , max u ∈ V h u, w T ,m ( k +1) i . By the fact that eve ry vec tor u ∈ U is in V with pr obabilit y 1 2 , the follo wing holds, E V v u u t T X k =2 max 3 η 32 , max u ∈ V h u, w T ,m ( k ) i 2 ≥ E V v u u t T − 3 X k =2 max 3 η 32 , max u ∈ V h u, w T ,m ( k ) i 2 37 = E V v u u t T − 4 X k =1 max 3 η 32 , max u ∈ V h u, w T ,m ( k +1) i 2 = E V k α V k ≥ k E V α V k = v u u t T − 3 X k =2 E V max 3 η 32 , max u ∈ V h u, w T ,m ( k ) i 2 Then, by Eq. (18), E V v u u t T X k =2 max 3 η 32 , max u ∈ V h u, w T ,m ( k ) i 2 ≥ v u u t T − m − 2 X k =2 E V max 3 η 32 , max u ∈ V h u, w T ,m ( k ) i 2 + T − 3 X k = T − m − 1 E V max 3 η 32 , max u ∈ V h u, w T ,m ( k ) i 2 ≥ v u u t T − m − 2 X k =2 E V max 3 η 32 , max u ∈ V h u, η 8 u 0 i 2 + T − 3 X k = T − m − 1 E V max 3 η 32 , max u ∈ V h u, η ( T − k + 2) 8 m u 0 i 2 = η 8 v u u t T − m − 2 X k =2 E V max 3 4 , max u ∈ V h u, u 0 i 2 + T − 3 X k = T − m − 1 E V max 3 4 , T − k + 2 m max u ∈ V h u, u 0 i 2 ≥ η 8 v u u t T − m − 2 X k =2 E V max 3 4 , max u ∈ V h u, u 0 i 2 + T − 3 X k = T − m − 1 E V max 3 4 , T − k + 2 T max u ∈ V h u, u 0 i 2 = η 8 v u u t T − m − 2 X k =2 E V max 3 4 , max u ∈ V h u, u 0 i 2 + m − 1 X k =1 E V max 3 4 , k + 4 T max u ∈ V h u, u 0 i 2 No w, treating eac h of the term separately , with probabilit y 1 2 on V , max u ∈ V h u, u 0 i ≤ 1 8 (otherwise it is 1), thus, E V max 3 4 , max u ∈ V h u, u 0 i = 1 2 · 3 4 + 1 2 · 1 = 7 8 Moreo v er, if k ≤ 3 T 4 − 4 E V max 3 4 , k + 4 T max u ∈ V h u, u 0 i = 3 4 , otherwise, E V max 3 4 , k + 4 T max u ∈ V h u, u 0 i ≥ 1 2 max 3 4 , k + 4 T + 1 2 · 3 4 ≥ 3 8 + k + 4 2 T Then, we get, if m ≥ T − 3, (note that it implies l − 1 ≥ 3 T 4 − 4), E V v u u t T X k =2 max 3 η 32 , max u ∈ V h u, w T ,m ( k ) i 2 38 ≥ η 8 v u u t m − 1 X k =1 E V max 3 4 , k + 4 T max u ∈ V h u, u 0 i 2 ≥ η 8 v u u u t X k :1 ≤ k ≤ 3 T 4 − 4 9 16 + X k : 3 T 4 − 4 k , h φ ( V i , j ) , φ ( V k , j ) i = h g ( V i ) , g ( V k ) i = sin 2 π i | P ( U ) | sin 2 π k | P ( U ) | + cos 2 π i | P ( U ) | cos 2 π k | P ( U ) | = cos 2 π ( i − k ) | P ( U ) | ≤ cos 2 π | P ( U ) | (cos is monotonic decreasing in [0 , π / 2]) = 1 − δ W e notice that 0 < δ < 1. No w, w e consider an arbitrary enumeration of U = { v 1 , ...v | U | } , and define the follo wing sets Ψ 1 , . . . Ψ n ⊆ R 2 n and th e follo wing t wo mappings σ : R 2 n → P ( U ) , α : R 2 n → U , Ψ k = { 1 n k X i =1 φ ( V i , i ) : ∀ i V i ⊆ U } Note that, for eve ry ψ ∈ Ψ , k ψ k = k 1 n k X i =1 φ ( V i , j i ) k ≤ 1 n k X i =1 k φ ( V i , j i ) k ≤ 1 . Then, for ev ery a ∈ R 2 n and j ∈ [ n ], w e denote the index q ( a, j ) ∈ [ | P ( U ) | ] as q ( a, j ) = arg max r h g ( V r ) , a ( j ) i , and define the follo w ing mapping σ : R 2 n → P ( U ), σ ( a ) = n \ j =1 ,a ( j ) 6 =0 V q ( a,j ) . Moreo v er, for ev ery a ∈ R 2 n , we denote the in d ex p ( a ) ∈ [ | U | ] as p ( a ) = arg min i { i : v i ∈ σ ( a ) } , and define the follo w ing mapping α : R 2 n 2 → U , α ( a ) = ( v | U | σ ( a ) = ∅ v p ( a ) σ ( a ) 6 = ∅ . 41 Note that for eve ry a ∈ R 2 n , α ( a ) ∈ U , th u s, k α ( a ) k ≤ 1. No w, Let V 1 , . . . , V n ⊆ U , k ∈ [ n ] and ψ ∗ k = 1 n P k i =1 φ ( V i , i ). Then, h ψ ∗ , 1 n k X i =1 φ ( V i , i ) i = h 1 n k X i =1 φ ( V i , i ) , 1 n k X i =1 φ ( V i , i ) i = 1 n 2 k X i =1 h φ ( V i , i ) , φ ( V i , i ) i = k n 2 F or ψ = 1 n P k i =1 φ ( V ′ i , i ) suc h that ψ 6 = ψ ∗ , there exists a in dex r su ch that V ′ r 6 = V r ,th us , h ψ , 1 n k X l =1 φ ( V i , i ) i = h 1 n k X i =1 φ ( V ′ i , i ) , 1 n k X i =1 φ ( V i , i ) i = 1 n 2 k X i =1 h φ ( V i , i ) , φ ( V ′ i , i ) i ≤ 1 n 2 1 − δ + k X i =1 ,i 6 = r 1 ≤ 1 n 2 (1 − δ + k − 1) = k n 2 − δ n 2 = h ψ ∗ k , 1 n n X i =1 φ ( V i , j i ) i − ǫ F urtherm ore, it holds that, 1 n P n i =1 φ ( V i , i ) ( i ) = 1 n g ( V i ), thus, q 1 n k X i =1 φ ( V i , i ) , i ! = arg max r h g ( V r ) , 1 n k X i =1 φ ( V i , i ) ( i ) i = arg max r h g ( V r ) , 1 n g ( V i ) i = i, th u s, w e get, σ ( ψ ∗ ) = σ 1 n k X i =1 φ ( V i , i ) ! = n \ j =1 , 1 n P k i =1 φ ( V i ,i ) ( j ) 6 =0 V q ( 1 n P k i =1 φ ( V i ,i ) ( j ) ,j ) = k \ j =1 V q ( 1 n P k i =1 φ ( V i ,i ) ,j ) (The indices that are non-zero are j = 1 , . . . , k ) = k \ i =1 V i 42 Then, assuming that T k i =1 V i 6 = ∅ , and let and m = arg min i { i : v i ∈ T k i =1 V i } , p ( a ) = m and, α ( ψ ∗ ) = v m ∈ k \ i =1 V i . Pr o of of L emma 8. First, ℓ SGD 1 is conv ex and 1-Lipschit z by the fact that ℓ SGD 1 = ℓ 1 and Lemma 2. Moreo v er, by Lemma 23, ℓ SGD 2 is a maxim um o ve r 1-Lipsc hitz linear f u nctions, t h us, ℓ SGD 2 is conv ex and 1-Lipsc h itz. Finally , ℓ SGD 3 is a summation of t w o 1-Lipsc hitz linear functions, th us, ℓ SGD 3 is con vex and 2-Lipsc h itz. Com bining all together, we get the lemma. C.2 Pro of of algorithm’s dynamics In this section w e describ e the dynamics of S GD. W e b egin with sho wing th at the goo d ev en t E ′ (Eq. (15)) o ccurs with a constan t p r obabilit y . Pr o of of L e mma 9. First, b y union b ound, Pr ( ∀ t ∈ [ n ] P t 6 = ∅ and J t ∈ S t ) ≥ 1 2 = 1 − P r ( ∃ t P t = ∅ or J t / ∈ S t ) ≥ 1 − n X t =1 P r ( P t = ∅ or ( P t 6 = ∅ and J t / ∈ S t )) ≥ 1 − n X t =1 Pr ( P t = ∅ ) − n X t =1 Pr ( P t 6 = ∅ and J t / ∈ S t ) = 1 − n X t =1 Pr ( P t = ∅ ) − n X t =1 Pr ( P t 6 = ∅ ) P r ( J t / ∈ S t | P t 6 = ∅ ) . No w, for ev ery v l ∈ U , Pr( v l / ∈ t − 1 \ i =1 V i ) = 1 − Pr ( v l / ∈ t − 1 \ i =1 V i ) = 1 − δ t − 1 , and, Pr ( v l / ∈ S t ) = 1 − (1 − δ ) n − t +1 ≤ 1 − (1 − δ ) n . Then, Pr ( P t = ∅ ) = Pr t \ i =1 V i = ∅ ! = Pr( ∀ v l ∈ U w / ∈ t \ i =1 V i ) = (1 − δ t − 1 ) | U | ≤ (1 − δ n ) | U | . Moreo v er, by the fact that for ev ery t , P t is indep end en t of V t +1 , ...V n , P r ( P t 6 = ∅ ) P r ( J t / ∈ S t | P t 6 = ∅ ) = X l : v l ∈ U Pr ( P t 6 = ∅ ) Pr ( v l / ∈ S t | P t 6 = ∅ ) P r( J t = v l ) 43 = X l : v l ∈ U Pr ( P t 6 = ∅ ) Pr ( v l / ∈ S t ) Pr( J t = v l ) ≤ 1 − (1 − δ ) n Com b ining all of the ab o ve , we get that, Pr( ∀ t ∈ [ n ] P t 6 = ∅ and J t ∈ S t ) = 1 − n X t =1 Pr ( P t = ∅ ) − n X t =1 Pr ( P t 6 = ∅ ) P r ( J t / ∈ S t | P t 6 = ∅ ) ≥ 1 − n (1 − δ n ) | U | − n (1 − (1 − δ ) n ) . F or δ = 1 4 n 2 , b y the fact that | U | ≥ 2 d ′ 178 = 2 4 n log( n ) = n 4 n , | U | δ n ≥ n 4 n n − 2 n 4 − n ≥ n 2 n 4 − n ≥ log (4 n ) Pr ( ∀ t ∈ [ n ] P t 6 = ∅ and J t ∈ S t ) ≥ 1 2 ≥ 1 − n (1 − δ n 500 ) | U | − n (1 − (1 − δ ) n ) ≥ 1 − ne −| U | δ n 500 − n (1 − (1 − nδ )) ≥ 1 − ne − l og(4 n ) − n 2 δ ≥ 1 − 1 4 − 1 4 = 1 2 . Pr o of of L e mma 11. First, by the fact that for ev ery t ≤ k ≤ T , w ( k ) = 0, for ev ery such k , max u ∈ V t h u, w ( k ) i = 0 < 3 η 32 , F or 2 ≤ k ≤ t − 1, w ( k ) = cη u k , where c ≤ 1 2 and ev ery u k ∈ T T i = k V i ⊆ V t , thus, max u ∈ V t h u, w ( k ) i ≤ η 2 · 1 8 < 3 η 32 . W e deriv e that ∇ ℓ SGD 1 ( w t , V t ) = 0. Pr o of of L emma 12. First, we show that the maximum of ℓ SGD 2 ( w, V ) is attained w ith k = m and u = u m . F or k ≥ m + 1, for ev ery u ∈ U an d ψ ∈ Ψ k , 3 8 h u, w ( k ) i − 1 2 h α ( ψ ) , w ( k +1) i + h w (0 ,k ) , 1 4 n ψ i − h w (0 ,k + 1) , 1 4 n ψ i + h w (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 0 . F or k = 1, for ev ery u ∈ U and ψ ∈ Ψ 1 , by Lemma 23, we kn o w that for ev ery ψ, V , j , k ψ k , k φ ( V , j ) k ≤ 1 , and α ( ψ ) ∈ U , th u s, 3 8 h u, w ( k ) i − 1 2 h α ( ψ ) , w ( k +1) i + h w (0 ,k ) , 1 4 n ψ i − h w (0 ,k + 1) , 1 4 n ψ i + h w (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 3 c 8 h u 1 , u i − η 16 h u 2 , α ( ψ ) i + h w (0 ,k ) , 1 4 n ψ i − 0 + 0 44 ≤ 9 η 512 + η 128 + η 4 n < η 8 . ( n ≥ 4) F or 2 ≤ k ≤ m − 2, for ev ery u ∈ U and ψ ∈ Ψ k , by Lemma 23, w e kno w that for ev ery ψ , V , j , k ψ k , k φ ( V , j ) k ≤ 1 , and α ( ψ ) ∈ U , th u s, 3 8 h u, w ( k ) i − 1 2 h α ( ψ ) , w ( k +1) i + h w (0 ,k ) , 1 4 n ψ i − h w (0 ,k + 1) , 1 4 n ψ i + h w (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 3 64 h u k , u i − η 16 h u k +1 , α ( ψ ) i + 0 − 0 + 0 ≤ 3 η 64 + η 16 < η 8 . F or k = m − 1, f or ev ery u ∈ U and ψ ∈ Ψ k , by Lemma 23, we kno w that f or e v ery ψ , V , j , k ψ k , k φ ( V , j ) k ≤ 1 , and α ( ψ ) ∈ U , th u s, 3 8 h u, w ( k ) i − 1 2 h α ( ψ ) , w ( k +1) i + h w (0 ,k ) , 1 4 n ψ i − h w (0 ,k + 1) , 1 4 n ψ i + h w (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 3 64 h u k , u i − η 4 h u k +1 , α ( ψ ) i + 0 − h w (0 ,k + 1) , 1 4 n ψ i + h w (0 ,k + 1) , 1 4 n 2 φ ( V , m ) i ≤ 3 η 64 + η 32 + 1 16 n 2 + 1 16 n 3 < η 8 . ( n ≥ 4) F or k = m , u 6 = u m and ev ery ψ ∈ Ψ m , b y Lemma 23, we know th at for ev ery ψ , V , j , k ψ k , k φ ( V , j ) k ≤ 1 , and α ( ψ ) ∈ U , th u s, 3 8 h u, w ( k ) i − 1 2 h α ( ψ ) , w ( k +1) i + h w (0 ,k ) , 1 4 n ψ i − h w (0 ,k + 1) , 1 4 n ψ i + h w (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 3 8 h u, w ( k ) i + h 1 4 n ψ , w (0 ,k ) i = 3 8 h u, η 2 u m i + h 1 4 n ψ , w (0 ,k ) i ≤ 3 η 128 + η 16 n 2 < η 32 . ( n ≥ 4) F or k = m , u = u m and ev ery ψ ∈ Ψ m , by Le mma 23, we know that for ev ery ψ , V , j , k ψ k , k φ ( V , j ) k ≤ 1, and α ( ψ ) ∈ U , thus, 3 8 h u, w ( k ) i − 1 2 h α ( ψ ) , w ( k +1) i + h w (0 ,k ) , 1 4 n ψ i − h w (0 ,k + 1) , 1 4 n ψ i + h w (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 3 8 h u, w t ( k ) i + h 1 4 n ψ , w t (0 ,k ) i = 3 8 h u, η 2 u m i + h 1 4 n ψ , w t (0 ,k ) i ≥ 3 η 16 − η 16 n 2 45 > 5 η 32 ( n ≥ 4) > δ 1 . Second, w e sho w that when k = m and u = u m , the maxim um among ψ ∈ Ψ m is attained uniquely in ψ ∗ m = 1 n P m t =1 φ ( V t , t ). F or an y ψ ∈ Ψ m , w ith ψ 6 = ψ ∗ m , by Lemma 23, for k = m , u = u m , 3 8 h u, w ( m ) i − 1 2 h α ( ψ ∗ m ) , w ( m +1) i + h w (0 ,m ) , 1 4 n ψ ∗ m i − h w (0 ,m +1) , 1 4 n ψ ∗ m i + h w (0 ,m +1) , − 1 4 n 2 φ ( V , m + 1) i = 3 8 h u, w ( m ) i + h 1 4 n ψ ∗ m , w (0 ,m ) i = 3 η 16 + η 16 n 2 h ψ ∗ m , 1 n m X t =1 φ ( V t , t ) i ≥ 3 η 16 + η 16 n 2 h ψ , 1 n m X t =1 φ ( V t , t ) i + η ǫ 16 n 2 = 3 8 h u, w ( k ) i + h 1 4 n ψ , w (0 ,m ) i + η ǫ 16 n 2 = 3 8 h u, w ( m ) i − 1 2 h α ( ψ ) , w ( m +1) i + h w (0 ,m ) , 1 4 n ψ i − h w (0 ,m +1) , 1 4 n ψ i + h w (0 ,m +1) , − 1 4 n 2 φ ( V , m + 1) i + η ǫ 16 n 2 W e deriv e that, ∇ ℓ SGD 2 ( w, V ) ( k ) = 3 8 u m k = m − 1 2 α ( ψ ∗ m ) k = m + 1 0 k / ∈ { m, m + 1 } ∇ ℓ SGD 2 ( w, V ) (0 ,k ) = 1 4 n 2 P m t =1 φ ( V t , t ) k = m − 1 4 n 2 P m t =1 φ ( V t , t ) − 1 4 n 2 φ ( V , m + 1) k = m + 1 0 k / ∈ { m, m + 1 } . Lemma 24. U nder the c onditions of The or em 4, if E ′ o c curs and w t is the iter ate of Unpr oje c te d SGD with step size η ≤ 1 √ n and w 1 = 0 , w 2 ( k ) = ( η n 3 u 1 k = 1 0 k ≥ 2 , and, w 2 (0 ,k ) = ( η 4 n 2 φ ( V 1 , 1) k = 1 0 k 6 = 1 . Pr o of. w 1 = 0, th u s, for eve ry k , max u ∈ V 1 h u, w 1 ( k ) i = 0 < 3 η 32 , 46 and w e d erive that ∇ ℓ SGD 1 ( w 1 , V 1 ) = 0. By the same argumen t, ∇ ℓ SGD 2 ( w 1 , V 1 ) = 0 (where the maxim um is attained un iquely in δ 2 ). Moreo ver, ℓ SGD 3 is a linear fu nction, then, we get that, ∇ ℓ SGD 3 ( w 1 , V 1 ) ( k ) = ( − 1 n 3 u 1 k = 1 0 k ≥ 2 , and, ∇ ℓ SGD 3 ( w 1 , V 1 ) (0 ,k ) = ( − 1 4 n 2 φ ( V 1 , 1) k = 1 0 k 6 = 1 , and the lemma follo ws . Lemma 25. U nder the c onditions of The or em 4, if E ′ o c curs and w t is the iter ate of Unpr oje c te d SGD with step size η ≤ 1 √ n and w 1 = 0 , w 3 ( k ) = 2 η n 3 u 1 − 3 η 8 u 1 k = 1 η 2 u 2 k = 2 0 3 ≤ k ≤ n w 3 (0 ,k ) = η 4 n 2 φ ( V 2 , 1) k = 1 η 4 n 2 φ ( V 1 , 1) + η 4 n 2 φ ( V 2 , 2) k = 2 0 k ≥ 3 . wher e u 1 ∈ U , and u 2 holds u 2 ∈ P 2 ∩ S 2 . Pr o of. First, b y the fact that f or ev ery 2 ≤ k ≤ T , w 2 ( k ) = 0, for eve ry such k , max u ∈ V 2 h u, w 2 ( k ) i = 0 < 3 η 32 , and we derive that ∇ ℓ SGD 1 ( w 2 , V 2 ) = 0. Moreo v er, ℓ SGD 3 is a linear fu nction, thus, ℓ SGD 3 ( w 2 , V 2 ) ( k ) = ( η n 3 u 1 k = 1 0 k ≥ 2 , and, ℓ SGD 3 ( w 2 , V 2 ) ( k ) = ( η 4 n 2 φ ( V 2 , 1) k = 1 0 k 6 = 1 . F or ℓ SGD 2 ( w 2 , V 2 ), we get by th e fact that for every k ≥ 1, w 2 ( k +1) = w 2 (0 ,k + 1) = 0, ℓ SGD 2 ( w 2 , V 2 ) = max δ 2 , max k ∈ [ n − 1] ,u ∈ U,ψ ∈ Ψ k 3 8 h u, w 2 ( k ) i + h 1 4 n ψ , w 2 (0 ,k ) i ! As a fi rst step, w e sh o w that the the maxim um is attained with k = 1 and u = u 1 , F or k 6 = 1, for ev ery u ∈ U and ψ ∈ Ψ k , 3 8 h u, w 2 ( k ) i + h 1 4 n ψ , w 2 (0 ,k ) i = 0 . 47 F or k = 1, u 6 = u 1 and every ψ ∈ Ψ 1 , by the fact th at k ψ k , k φ ( V 1 , 1) k ≤ 1, 3 8 h u, w 2 ( k ) i + h 1 4 n ψ , w 2 (0 ,k ) i ≤ 3 η 64 n 3 + η 16 n 3 = 7 η 64 n 3 < 3 η 16 n 3 . F or k = 1, u = u 1 and every ψ ∈ Ψ 1 , by the fact th at k ψ k , k φ ( V 1 , 1) k ≤ 1, 3 8 h u, w 2 ( k ) i + h 1 4 n 2 ψ , w 2 (0 ,k ) i ≥ 3 η 8 n 3 − η 16 n 3 > 3 η 16 n 3 > δ 1 . As a second step we sho w that the maxim u m among ψ ∈ Ψ 1 is attained uniquely in ψ ∗ 1 = 1 n φ ( V 1 , 1). F or an y ψ ∈ Ψ 1 , with ψ 6 = ψ ∗ 1 . By Lemma 23, for k = 1, u = u 1 , 3 8 h u, w 2 ( k ) i + h 1 4 n ψ ∗ 1 , w 2 (0 ,k ) i = 3 η 8 n 3 + h 1 4 n ψ ∗ 1 , η 4 n 2 φ ( V 1 , 1) i = 3 η 8 n 3 + η 16 n 2 h ψ ∗ 1 , 1 n φ ( V 1 , 1) i ≥ 3 η 8 n 3 + η 16 n 2 h ψ , 1 n φ ( V 1 , 1) i + η ǫ 16 n 2 = 3 8 h u, w 2 ( k ) i + h 1 4 n ψ , w 2 (0 ,k ) i + η ǫ 16 n 2 W e got that the maxim um is uniquely attai ned at k = 1 , u = u 1 , ψ = 1 n φ ( V 1 , 1). No w, by Lemma 23, for j = arg min i { i : v i ∈ V 1 } , w e get that α ( ψ ) = v j ∈ V 1 . W e notice that V 1 = P 2 and thus α ( ψ ) = J 2 . Then, by E ′ , α ( ψ ) also h olds α ( ψ ) ∈ S 2 . Combining the ab ov e together, we get, for u 2 = α ( ψ ) ∈ P 2 ∩ S 2 , ∇ f ( w 2 , V 2 ) ( k ) = 3 8 u 1 − 1 n 3 u 1 k = 1 − 1 2 u 2 k = 2 0 k ≥ 3 and, ∇ f ( w 2 , V 2 ) (0 ,k ) = 1 4 n 2 φ ( V 1 , 1) − 1 4 n 2 φ ( V 2 , 1) k = 1 − 1 4 n 2 φ ( V 1 , 1) − 1 4 n 2 φ ( V 2 , 2) k = 2 0 k ≥ 3 , and the lemma follo ws . Lemma 26. U nder the c onditions of The or em 4, if E ′ o c curs and w t is the iter ate of Unpr oje c te d SGD with step size η ≤ 1 √ n and w 1 = 0 , w 4 ( k ) = 3 η n 3 u 1 − 3 η 8 u 1 k = 1 η 8 u 2 k = 2 η 2 u 3 k = 3 0 k ≥ 4 , and, w 4 (0 ,k ) = η 4 n 2 φ ( V 2 , 1) + η 4 n 2 φ ( V 3 , 1) k = 1 η 4 n 2 φ ( V 1 , 1) + η 4 n 2 φ ( V 2 , 2) + η 4 n 2 φ ( V 3 , 3) k = 3 0 k / ∈ { 1 , 3 } . 48 Pr o of. First, w e notice that by Lemma 25, it holds th at w 3 (2) = cη u 2 for c ≤ 1 2 and u 2 holds u 2 ∈ V 1 ∩ T n i =2 V i , a nd for ev ery 3 ≤ k ≤ T , w t ( k ) = 0. Th en, b y Lemma 11, w e ha v e that ∇ ℓ SGD 1 ( w 3 , V 3 ) = 0 . Moreo v er, ℓ SGD 3 is a linear fu nction, thus, ℓ SGD 3 ( w 3 , V 3 ) ( k ) = ( η n 3 u 1 k = 1 0 k ≥ 2 , and, ℓ SGD 3 ( w 3 , V 3 ) ( k ) = ( η 4 n 2 φ ( V 3 , 1) k = 1 0 k 6 = 1 . F or ℓ SGD 2 ( w 3 , V 3 ), w e first s h o w that the the maxim um is attained w ith k = 2 and u = u 2 . F or k ≥ 3, for eve ry u ∈ U and ψ ∈ Ψ k , 3 8 h u, w 3 ( k ) i − 1 2 h α ( ψ ) , w 3 ( k +1) i + h w 3 (0 ,k ) , 1 4 n ψ i − h w 3 (0 ,k + 1) , 1 4 n ψ i + h w 3 (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 0 . F or k = 1, for eve ry u ∈ U and ψ ∈ Ψ 1 , b y the fact that for ev ery ψ , V , j , k ψ k , k φ ( V , j ) k ≤ 1, 3 8 h u, w 3 ( k ) i − 1 2 h α ( ψ ) , w 3 ( k +1) i + h w (0 ,k ) , 1 4 n ψ i − h w 3 (0 ,k + 1) , 1 4 n ψ i + h w 3 (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 3 8 ( 2 η n 3 − 3 η 8 ) h u 1 , u i − η 4 h u 2 , α ( ψ ) i + h 1 4 n 2 φ ( V 2 , 1) , ψ i − h η 4 n 2 φ ( V 1 , 1) + η 4 n 2 φ ( V 2 , 2) , ψ i + h η 4 n 2 φ ( V 1 , 1) + η 4 n 2 φ ( V 2 , 2) , 1 4 n 2 φ ( V 3 , 2) i ≤ 9 η 512 + η 32 + η 4 n 2 + η 2 n 2 + η 8 n 4 < 29 η 256 ( n ≥ 4) < η 8 . F or k = 2, u 6 = u 2 and every ψ ∈ Ψ 2 , , by the fact th at for eve ry ψ , V , j , k ψ k , k φ ( V , j ) k ≤ 1, 3 8 h u, w 3 ( k ) i − 1 2 h α ( ψ ) , w 3 ( k +1) i + h w 3 (0 ,k ) , 1 4 n ψ i − h w 3 (0 ,k + 1) , 1 4 n ψ i + h w 3 (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 3 8 h u, w 3 ( k ) i + h 1 4 n ψ , w 3 (0 ,k ) i = 3 8 h u, η 2 u 2 i + h 1 4 n ψ , η 4 n 2 φ ( V 1 , 1) + η 4 n 2 φ ( V 2 , 2) i ≤ 3 η 128 + η 8 n 3 < η 32 . ( n ≥ 4) F or k = 2, u = u 2 and every ψ ∈ Ψ 2 , by the fact th at k ψ k , k φ ( V 1 , 1) k ≤ 1, 3 8 h u, w 3 ( k ) i − 1 2 h α ( ψ ) , w 3 ( k +1) i + h w 3 (0 ,k ) , 1 4 n ψ i − h w 3 (0 ,k + 1) , 1 4 n ψ i + h w 3 (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 3 8 h u, w 3 ( k ) i + h 1 4 n ψ , w 3 (0 ,k ) i 49 = 3 8 h u, η 2 u 2 i + h 1 4 n ψ , η 4 n 2 φ ( V 1 , 1) + η 4 n 2 φ ( V 2 , 2) i ≥ 3 η 16 − η 8 n 3 > 5 η 32 ( n ≥ 4) > δ 1 . Second, we sh ow that the maxim um among ψ ∈ Ψ 2 is atta ined uniquely in ψ ∗ 2 = 1 n φ ( V 1 , 1) + 1 n φ ( V 2 , 2). F or an y ψ ∈ Ψ 2 , with ψ 6 = ψ ∗ 2 , by Lemma 23, f or k = 2, u = u 2 , 3 8 h u, w 3 ( k ) i − 1 2 h α ( ψ ∗ 2 ) , w 3 ( k +1) i + h w 3 (0 ,k ) , 1 4 n ψ ∗ 2 i − h w 3 (0 ,k + 1) , 1 4 n ψ ∗ 2 i + h w 3 (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i = 3 8 h u, w 3 ( k ) i + h 1 4 n ψ ∗ 2 , w 3 (0 ,k ) i = 3 η 16 + h 1 4 n ψ ∗ 2 , η 4 n 2 φ ( V 1 , 1) + η 4 n 2 φ ( V 2 , 2) i = 3 η 16 + η 16 n 2 h ψ ∗ 2 , 1 n φ ( V 1 , 1) + 1 n φ ( V 2 , 2) i ≥ 3 η 16 + η 16 n 2 h ψ , 1 n φ ( V 1 , 1) + 1 n φ ( V 2 , 2) i + η ǫ 16 n 2 = 3 8 h u, w 3 ( k ) i + h 1 4 n ψ , w 3 (0 ,k ) i + η ǫ 16 n 2 = 3 8 h u, w 3 ( k ) i − 1 2 h α ( ψ ) , w 3 ( k +1) i + h w 3 (0 ,k ) , 1 4 n ψ i − h w 3 (0 ,k + 1) , 1 4 n ψ i + h w 3 (0 ,k + 1) , − 1 4 n 2 φ ( V , k + 1) i + η ǫ 16 n 2 W e got that the maximum is un iquely attained at k = 2 , u = u 2 , ψ = ψ ∗ 2 . No w, by Lemma 23, for j = arg min i { i : v i ∈ V 1 ∩ V 2 } , w e get that α ( ψ ) = v j ∈ V 1 ∩ V 2 . W e notice that V 1 ∩ V 2 = P 3 and th u s α ( ψ ) = J 3 . Then, b y E ′ , α ( ψ ) also holds α ( ψ ) ∈ S 3 . Com b ining the ab o ve together, we get, for u 1 ∈ U , u 2 ∈ P 2 ∩ S 2 and u 3 = α ( ψ ∗ 2 ) ∈ P 3 ∩ S 3 , ∇ f ( w 3 , V 3 ) = − 1 n 3 u 1 s = 1 3 8 u 2 s = 2 − 1 2 u 3 s = 3 0 4 ≤ s ≤ n − 1 4 n 2 φ ( V 3 , 1) s = 0 , 1 1 4 n 2 φ ( V 1 , 1) + 1 4 n 2 φ ( V 2 , 2) s = 0 , 2 − 1 4 n 2 φ ( V 1 , 1) − 1 4 n 2 φ ( V 2 , 2) − 1 4 n 2 φ ( V 3 , 3) s = 0 , 3 0 s = 0 , k for k ≥ 3 , and the lemma follo ws . C.3 Pro of of Theorem 4 Pr o of of The or em 4 . W e sho w that the theorem holds if the ev en t E ′ o ccurs. First, w e p ro ve that for ev ery t , k w t k ≤ 1. By Lemma 10 , k w t k ≤ v u u t d X i =1 w t [ i ] 2 50 ≤ v u u t n X k =1 k w t ( k ) k 2 + n X l =1 k w t (0 ,l ) k 2 < s 2 · η 2 2 + ( n − 2) η 8 2 + 2 · η 4 n 2 ≤ s η 2 2 + η 2 ( n − 2 ) 64 + 2 η 2 ≤ r 1 64 + 5 2 n ( η ≤ 1 √ n ) ≤ 1 ( n ≥ 4) No w, denote α V ∈ R n − 3 the ve ctor which its k th en tr y is max η 16 , max u ∈ V i h u, ( n − k + 2) η 8 u k +1 i . F or w n = w n,n , and any 2 ≤ s ≤ n − 2, w n ( s ) = η 2 n u s + ( n − s − 1) η 8 n u s = ( n − s + 3) η 8 n u s . Then, 1 n n X i =1 v u u t n X k =2 max 3 η 32 , max u ∈ V i h u, w n ( k ) i 2 ≥ 1 n n X i =1 v u u t n − 2 X k =2 max 3 η 32 , max u ∈ V i h u, w n ( k ) i 2 = 1 n n X i =1 v u u t n − 2 X k =2 max 3 η 32 , max u ∈ V i h u, ( n − k + 3) η 8 n u k i 2 = 1 n n X i =1 v u u t n − 3 X k =1 max 3 η 32 , max u ∈ V i h u, ( n − k + 2) η 8 n u k +1 i 2 = 1 n n X i =1 v u u t n − 3 X k =1 max 3 η 32 , max u ∈ V i h u, ( n − k + 2) η 8 n u k +1 i 2 = 1 n n X i =1 k α V i k ≥ k 1 n n X i =1 α V i k = v u u t n − 2 X k =2 1 n n X i =1 max 3 η 32 , max u ∈ V i h u, ( n − k + 3) η 8 n u k i ! 2 = η 8 v u u t n − 2 X k =2 1 n n X i =1 max 3 4 , max u ∈ V i h u, n − k + 3 n u k i ! 2 = η 8 v u u t n − 2 X k =2 1 n n X i =1 max 3 4 , max u ∈ V i h u, n − k + 2 n u k i ! 2 No w, b y the fact that if E ′ holds, by Lemma 10, for 2 ≤ k ≤ n − 2, u k ∈ P k = T k − 1 i =1 V k , 1 n n X i =1 v u u t n X k =2 max 3 η 32 , max u ∈ V i h u, w n ( k ) i 2 51 ≥ η 8 v u u u t n − 2 X k =2 1 n k − 1 X i =1 max 3 4 , max u ∈ V i h u, n − k + 2 n u k i + 1 n n X i = k max 3 4 , max u ∈ V i h u, n − k + 2 n u k i ! 2 ≥ η 8 v u u t n − 2 X k =2 3( n − k + 1) 4 n + k − 1 n max 3 4 , n − k + 2 n 2 ≥ η 8 v u u u t X 2 ≤ k ≤ n 4 − 2 3( n − k + 1) 4 n + ( k − 1)( n − k + 1) n 2 2 + X n 4 − 3 0 , B b e the d -dimensional unit b al l and S b e the d -dimensional unit spher e. Mor e over, let D B and D S b e the uniform distributions on B , S r esp e ctively. If ˜ f ( x ) = E v ∼D B [ f ( x + δ v )] , then, ∇ ˜ f ( x ) = d δ E a ∼D S [ f ( x + δa ) a ] Lemma 28. (e.g., Mul ler (1959)) L et d . L et S b e the d -dimensional unit spher e a nd D S the uniform distributions on S . M or e over, we define r andom variables Y 1 , . . . , Y d ∈ R , X 1 , . . . , X d ∈ R and Y ∈ R d such that X i ∼ N (0 , 1) (w her e N (0 , 1) is the normal univariate distribution with exp e ctation 0 and varianc e 1 ), Y i = x i q P d i =1 X 2 i and Y = ( Y 1 , . . . , Y d ) . Then, Y ∼ D S . Lemma 29. L et d . L et B b e the d -dimensional unit b al l and D B the uniform distributions on B . L et ζ 1 > ζ 2 > 0 , a function g : R → R and a 1 , ...a l ∈ B . Mor e over, let h : B → R , h ( x ) = g (max( ζ 1 , max 1 ≤ r ≤ l h a r , x i ) and x 0 ∈ B such that max 1 ≤ r ≤ l h a r , x 0 i ≤ ζ 2 . W e define ˜ h ( x ) : = E v ∼D B [ h ( x + δv )] . Then, for any 0 < δ < ζ 1 − ζ 2 , ∇ ˜ h ( x 0 ) = 0 , ˜ h ( x 0 ) = g ( ζ 1 ) . Pr o of. First, for every r and v ∈ B , by Cauch y-Sc hw artz In equ alit y , h a r , x 0 + δv i = h a r , x 0 i + h a r , δ v i ≤ ζ 2 + δ < ζ 1 Then, max( ζ 1 , max 1 ≤ r ≤ l h a r , x 0 + δv i ) = ζ 1 , and h ( x 0 + δv ) = g (max ( ζ 1 , max 1 ≤ r ≤ l h a r , x 0 + δv i ) = g ( ζ 1 ) . 54 As a result, ˜ h ( x 0 ) = E v ∼D B [ h ( x 0 + δv )] = g ( ζ 1 ) and by Lemma 27, ∇ ˜ h ( x 0 ) = d δ E v ∼D S [ h ( x 0 + δv ) v ] = d δ E v ∼D S g max( ζ 1 , max 1 ≤ r ≤ l h a r , x 0 + δv i v = d δ E v ∼D S [ g ( ζ 1 ) v ] = d δ g ( ζ 1 ) E v ∼D S [ v ] = 0 Lemma 30. L et d and K . L et B b e the dK -dimensional unit b al l and D B the uniform distributions on B . L et ζ 1 > ζ 2 > 0 and a 1 , ...a l ∈ B d . Mor e over, let g : B d → R , g ( x ) = max ( ζ 1 , max 1 ≤ r ≤ l h a r , x i ) and h : B → R , h ( x ) = q P K k =1 g ( x ( k ) ) 2 . L et x 0 ∈ B such tha t for every k , max 1 ≤ r ≤ l h a r , x 0 ( k ) i ≤ ζ 2 . W e define ˜ h ( x ) : = E v ∼D B [ h ( x + δv )] . Then, for any 0 < δ < ζ 1 − ζ 2 , ∇ ˜ h ( x 0 ) = 0 , ˜ h ( x 0 ) = ζ 1 √ K . Pr o of. First, for every k , r and u ∈ B d , by Cauc h y-Sc hw artz Inequalit y , h a r , x 0 ( k ) + δu i = h a r , x 0 ( k ) i + h a r , δ u i ≤ ζ 2 + δ < ζ 1 Then, g ( x 0 ( k ) + δu ) = max ( ζ 1 , max 1 ≤ r ≤ l h a r , x 0 ( k ) + δu i ) = ζ 1 , and for ev ery v ∈ B , h ( x 0 + δv ) = v u u t K X k =1 g (max ( ζ 1 , max 1 ≤ r ≤ l h a r , x 0 ( k ) + δv ( k ) i ) = ζ 1 √ K . As a result, ˜ h ( x 0 ) = E v ∼D B [ h ( x 0 + δv )] = ζ 1 √ K . No w, b y Lemma 27, ∇ ˜ h ( x 0 ) = d δ E v ∼D S [ h ( x 0 + δv ) v ] = d δ E v ∼D S v u u t K X k =1 max( ζ 1 , max 1 ≤ r ≤ l h a r , x 0 ( k ) + δv ( k ) i 2 · v = d δ E v ∼D S h ζ 1 √ K v i 55 = d δ ζ 1 √ K E v ∼D S [ v ] = 0 . Lemma 31. L et d . L et B b e the d -dimensional unit b al l and D B the uniform distributions on B . L et ζ 1 > ζ 2 , ζ 3 > 0 and ve ctors a 1 , ...a l ∈ B d G (0) . Mor e over, let h : B → R , h ( x ) = max ( ζ 3 , max 1 ≤ r ≤ l h a r , x i ) and x 0 ∈ B , r 0 ∈ [ l ] such that h a r 0 , x 0 i = ζ 1 and m ax 1 ≤ r ≤ l,r 6 = r 0 h a r , x 0 i ≤ ζ 2 . W e define ˜ h ( x ) : = E v ∼D B [ h ( x + δv )] . Then, for any 0 < δ < 1 2 G ( ζ 1 − max ( ζ 2 , ζ 3 )) , ˜ h ( x 0 ) = h a r 0 , x 0 i ∇ ˜ h ( x 0 ) = a r 0 Pr o of. First, b y Cauch y-Sc hw artz Inequalit y , max ζ 3 , max r 6 = r 0 h a r , x 0 + δv i ≤ max ζ 3 , max r 6 = r 0 h a r , x 0 i + max r 6 = r 0 h δ v , a r i ≤ max ( ζ 3 , ζ 2 + Gδ ) ≤ max ( ζ 3 , ζ 2 ) + Gδ < 1 2 ( ζ 1 + max ( ζ 3 , ζ 2 )) , and for r 0 , h a r 0 , x 0 + δv i = h a r 0 , x 0 i + h δ v , a r 0 i ≥ ζ 1 − Gδ > 1 2 ( ζ 1 + max ( ζ 2 , ζ 3 )) . W e deriv e that for eve ry v ∈ B , h ( x 0 + δv ) = max ζ 3 , max 1 ≤ r ≤ l h a r , x + δ v i ) = h a r 0 , x + δ v i . and that the maximum is attained in r 0 . T hen, ˜ h ( x 0 ) = E v ∼D B [ h ( x 0 + δv )] = E v ∼D B [ h a r 0 , x 0 + δv i ] = h a r 0 , x 0 + δ E v ∼D B v i = h a r 0 , x 0 i and by Lemma 27, ∇ ˜ h ( x 0 ) = d δ E v ∼D S max ζ 3 , max 1 ≤ r ≤ l h a r , x 0 + δv i v = d δ E v ∼D S [( h a r 0 , x 0 + δv i ) v ] = h a r 0 , x 0 i d δ E v ∼D S [ v ] + d δ E v ∼D S [ h a r 0 , δ v i v ] = 0 + da T r 0 E v ∼D S h v v T i = da T r 0 E v ∼D S h v v T i . 56 No w, w e define rand om v ariables Y 1 , . . . , Y d ∈ R , X 1 , . . . , X d ∈ R and Y ∈ R d suc h that X i ∼ N (0 , 1) (where N (0 , 1) is the normal univ ariate distribution with exp ectation 0 and v ariance 1), Y i = x i q P d i =1 X 2 i . By Lemma 28, we get, for the standard basis ve ctors e 1 . . . e d , ∇ ˜ h ( x ) = da T r 0 E Y 1 ,...,Y d " d X i =1 Y 2 i e i e T i # = da T r 0 d X i =1 E Y i h Y 2 i i e i e T i = da T r 0 d X i =1 E X i " X 2 i P d l =1 X 2 l # e i e T i = da T r 0 d X i =1 1 d e i e T i = a r 0 Pr o of of L emma 14. W e assume that E (Eq. (12)) holds and sh o w Lemma 14 u nder this ev ent . W e pro v e the claim by indu ction on t . F or t = 1, it is trivial. No w, we assume that w t = ˜ w t . F or ℓ 1 , in ev ery t , by the pro ofs of Lemmas 18 to 21 and Lemma 4, it can b e observed that for ev ery i ∈ [ n ], k ≥ 2, max t max u ∈ V i h u, w t ( k ) i ≤ η 16 , th us, in eve ry iteratio n the term that gets the maximal v alue is 3 η 32 . T hen, by Lemma 30 and the hypothesis of the in d uction, f or ev ery i , ∇ ˜ ℓ 1 ( ˜ w t , V i ) = ∇ ˜ ℓ 1 ( w t , V i ) = 0 = ∇ ℓ 1 ( w t , V i ) . F or ˜ ℓ 2 and ev ery w ∈ R d , V ⊆ U and j ∈ [ n 2 ], by linearit y of exp ectation, ˜ ℓ 2 ( w, ( V , j )) = E v ∈ δ B h w (0) + v (0) , − φ ( V , j ) i = ℓ 2 ( w, ( V , j )) + h E v ∈ δ B v (0) , − φ ( V , j ) i = ℓ 2 ( w, ( V , j )) Then, we deriv e that for ev ery w and i , ∇ ˜ ℓ 2 ( w, ( V i , j i )) = ∇ ℓ 2 ( w, ( V i , j i )). F or ˜ ℓ 3 , wh ic h is a 2-Lipsc hitz linear fu n ction, for t = 1, by the pro of of Lemma 18, the term that gets the maximal v alue is δ 1 . Moreo v er, it can b e observ ed that for s u c h t , and ev ery ψ ∈ Ψ, max ψ ∈ Ψ h w (0) 1 , ψ i − 1 4 ǫ T 2 h α ( ψ ) , w 1 (1) i = 0 . Then, we can apply Lemma 29, and get by the h yp othesis of the indu ction, ∇ ˜ ℓ 3 ( ˜ w 1 ) = ∇ ˜ ℓ 3 ( w 1 ) = 0 = ∇ ℓ 3 ( w 1 ) . If t ≥ 2, it can b e observed that ℓ 3 ( w t ) = max δ 2 , max ψ ∈ Ψ h w (0) t , ψ i − 1 4 ǫ T 2 h α ( ψ ) , w t (1) i = max ψ ∈ Ψ h w (0) t , ψ i − 1 4 ǫ T 2 h α ( ψ ) , w t (1) i 57 Then, by th e pro ofs of Lemmas 1 9 to 2 1 and Lemma 4, the maxima l v alue of h w (0) , ψ i− 1 4 ǫ T 2 h α ( ψ ) , w (1) i is attained in ψ = ψ ∗ and the difference from the second maximal p ossible v alue of this term is at most ǫ 2 T 2 . As a result, using the fact th at this maxim u m is also larger than δ 2 b y at least η 8 n (whic h is also larger than δ ), we can app ly Lemma 31 and get by th e hypothesis of the induction that, ∇ ˜ ℓ 3 ( ˜ w t ) ( k ) = ∇ ˜ ℓ 3 ( w t ) ( k ) = 1 n P n i =1 φ ( V i , j i ) k = 0 − ǫ 4 T 2 α ( 1 n P n i =1 φ ( V i , j i )) k = 1 0 otherwise = ∇ ℓ 3 ( w t ) ( k ) . F or ˜ ℓ 4 , for t ∈ { 1 , 2 } , by the pro ofs of Lemmas 1 8 and 19, the term that get s the maximal v alue is δ 2 . Moreo v er, it can b e observ ed that for every suc h t , and ev ery k ∈ [ T − 1] and u ∈ U , 3 8 h u, w t ( k ) i − 1 2 h u, w t ( k +1) i = 0 . Then, we can apply Lemma 29, and get by the h yp othesis of the indu ction, ∇ ˜ ℓ 4 ( ˜ w t ) = ∇ ˜ ℓ 4 ( w t ) = 0 = ∇ ℓ 4 ( w t ) . F or t = 3, it can b e observe d by the pro of of Lemma 20 that, ℓ 4 ( w t ) = max δ 2 , max k ∈ [ T − 1] ,u ∈ U 3 8 h u, w t ( k ) i − 1 2 h u, w t ( k +1) i ! = max k ∈ [ T − 1] ,u ∈ U 3 8 h u, w t ( k ) i − 1 2 h u, w t ( k +1) i Moreo v er, the maximal v alue is 3 ηǫ 32 T 2 and is attained in k 0 = 1 , u = u 0 = α ( ψ ∗ ). The sec ond maximal p ossible v alue of this term is δ 2 = 3 ηǫ 64 T 2 , then, b y the fact that δ < 3 ηǫ 32 T 2 − 3 ηǫ 64 T 2 = 3 ηǫ 64 T 2 , w e can apply L emma 31 and get by the hyp othesis of the indu ction that ∇ ˜ ℓ 4 ( ˜ w t ) ( k ) = ∇ ˜ ℓ 4 ( w t ) ( k ) = 3 8 u 0 k = 1 − 1 2 u 0 k = 2 0 otherwise = ∇ ℓ 4 ( w t ) ( k ) . F or t ≥ 4, it can b e observe d by the pro ofs of Lemmas 4 and 21 that, ℓ 4 ( w t ) = max δ 2 , max k ∈ [ T − 1] ,u ∈ U 3 8 h u, w t ( k ) i − 1 2 h u, w t ( k +1) i ! = max k ∈ [ T − 1] ,u ∈ U 3 8 h u, w t ( k ) i − 1 2 h u, w t ( k +1) i Moreo v er, the maximal v alue is 3 η 16 and is attained in k 0 = t − 2 , u = u 0 = α ( ψ ∗ ). The second maximal p ossible v alue of this term is smaller then 5 η 64 , then w e can apply again Lemma 31 and get b y the hypothesis of the induction that ∇ ˜ ℓ 4 ( ˜ w t ) ( k ) = ∇ ˜ ℓ 4 ( w t ) ( k ) 58 = 3 8 u 0 k = t − 2 − 1 2 u 0 k = t − 1 0 otherwise = ∇ ℓ 4 ( w t ) ( k ) . In conclusion, we p r o ved that ∇ b F ( w t ) = ∇ b ˜ F ( ˜ w t ), th us, by the hyp othesis of th e induction, w t +1 = w t − ∇ b F ( w t ) = ˜ w t − ∇ b ˜ F ( ˜ w t ) = ˜ w t +1 Lemma 32. L et d and δ > 0 . L et f : R d → R b e a G -Lipschitz function. L et B b e the d -dimensional unit b al l. M or e over, let D B b e the uniform distributions on B . If ˜ f ( x ) = E v ∼D B [ f ( x + δ v )] , then for every x , | ˜ f ( x ) − f ( x ) | ≤ Gδ Pr o of. By the fact that f is G -Lipschitz , | ˜ f ( x ) − f ( x ) | = | E v ∼D B [ f ( x + δ v )] − f ( x ) | ≤ | E v ∼D B [ f ( x )] + G δ E v ∼D B + [ k v k ] − f ( x ) | = Gδ E v ∼D B [ k v k ] ≤ Gδ D.2 Pro ofs of A pp endix A.2 Pr o of of L emma 15. First, differen tiabilit y can b e deriv ed immediate ly from Lemma 27. Second, for 4-Lipsc hitzness, for eve ry V ∈ Z , we define ˜ f SGD V : R d → R as ˜ f SGD V ( w ) : = ˜ f SGD ( w, V ). By the 5-Lipsc hitzness of f SGD with resp ect to its fir st argument and J ensen Inequalit y , f or ev ery x, y ∈ R d , it holds that | ˜ f SGD V ( x ) − ˜ f SGD V ( y ) | = E v ∈ δ B f SGD V ( y + v ) − E v ∈ δ B f SGD V ( w + v ) = E v ∈ δ B f SGD V ( x + v ) − f SGD V ( y + v ) ≤ E v ∈ δ B f SGD V ( x + v ) − f SGD V ( y + v ) ≤ 4 | x − y | . Third, for con v exit y , b y the con vexit y of f SGD for ev ery x, y ∈ R d and α ∈ [0 , 1], ˜ f SGD V ( αx + (1 − α ) y ) = E v ∈ δ B f SGD V ( αx + (1 − α ) y + v ) = E v ∈ δ B f SGD V ( α ( x + v ) + (1 − α )( y + v )) ≤ E v ∈ δ B αf SGD V ( x + v ) + (1 − α ) f V ( y + v )) = α E v ∈ δ B f SGD V ( x + v ) + (1 − α ) E v ∈ δ B f SGD V ( y + v ) = α ˜ f SGD V ( x ) + (1 − α ) ˜ f SGD V , j ( y ) . 59 Pr o of of L emma 16. W e assume that E ′ (Eq. (15)) holds and prov e Lemma 16 under this ev en t. W e p ro ve the claim by induction on t . F or t = 1, it is trivial. No w , w e assume that w t = ˜ w t . First, for ˜ ℓ SGD 3 and ev ery w and V , by linearit y of exp ectation, ˜ ℓ SGD 3 ( w, V ) = E v ∈ δ B h w (0 , 1) + v (0 , 1) , − 1 4 n 2 φ ( V , 1) i − h 1 n 3 u 1 , w (1) + v (1) i = ℓ SGD 3 ( w, V ) + h E v ∈ δ B v (0 , 1) , − 1 4 n 2 φ ( V , 1) i − h 1 n 3 u 1 , E v ∈ δ B v (1) i = ℓ SGD 3 ( w, V ) Then, we derive that for ev ery w , ∇ ˜ ℓ SGD 3 ( w, V ) = ∇ ℓ SGD 3 ( w, V ). No w, for r ∈ { 1 , 2 } w e sh ow that in ea c h term ˜ ℓ SGD r ( w t , V t ), the argumen t that giv es the maxim u m v alue is the same a s ℓ SGD r ( w t , V t ). F or ˜ ℓ SGD 1 ( w t , V t ), in ev ery t , by the pro ofs o f Lemma 24, Lemma 25 and Lemma 10, the maximal v alue is 3 η 32 . Moreo ver, it can b e observ ed that f or ev ery k ≥ 2, max t max u ∈ V t h u, w SGD t ( k ) i ≤ η 16 . Then, by Lemma 30, and the h yp othesis of the ind uction, ∇ ˜ ℓ SGD 1 ( ˜ w t , V t ) = ∇ ˜ ℓ SGD 1 ( w t , V t ) = 0 = ∇ ℓ SGD 1 ( w t , V t ) . No w, for ˜ ℓ SGD 2 , for t = 1, ∇ ℓ SGD 2 ( w 1 , V 1 ) = 0 and the maxim um is attained uniquely in δ 1 = η 8 n 3 (the second maximal v alue is zero). Then, we can apply Lemma 29 and by the hyp othesis of the induction, it follo ws that, ∇ ˜ ℓ SGD 2 ( ˜ w 1 , V 1 ) = ∇ ˜ ℓ SGD 2 ( w 1 , V 1 ) = 0 = ∇ ℓ SGD 2 ( w 1 , V 1 ) . F or ev ery t ≥ 2, the maxim um is atta ined un iquely in the linear term of k = t − 1, u = u t − 1 and ψ = ψ ∗ t − 1 suc h that the difference b et wee n the maximal v alue the second large st v alue is larger than ηǫ 16 n 2 . T hen, we can apply Lemma 31 and b y the hyp othesis of the induction, it follo ws that, ∇ ˜ ℓ SGD 2 ( ˜ w t , V t ) = ∇ ˜ ℓ SGD 2 ( w t , V t ) = ∇ ℓ SGD t ( w t , V t ) . In conclusion, w e pro ve d that ∇ ˜ f SGD ( ˜ w t , V t ) = ∇ f SGD ( w t , V t ), th u s, b y the h yp othesis of the induction, w t +1 = w t − ∇ f SGD ( w t , V t ) = ˜ w t − ∇ ˜ f SGD ( ˜ w t , V t ) = ˜ w t +1 . E Lo w er b ound of Ω min 1 , 1 η T In this section, we pro v e the Ω min 1 , 1 ηT lo wer b ound. Since our hard construction for getting this b ound in v olve s a deterministic loss function, GD is equiv alen t to SGD. F or cla rit y , we refer in our pro of to the p erformance of GD, ho w ever, the same result is applicable also for S GD with T = n iterations. E.1 Construction of a non-differen t iable loss function. F or d = m ax (25 η 2 T 2 , 1), w e define the hard loss function f OPT : R d → R , as follo ws, f OPT ( w ) = max 0 , max i ∈ [ d ] { 1 √ d − w [ i ] − η i 4 d } ! . (19) F or th is loss function, we pr o ve the follo wing lemma, 60 Lemma 33. A ssume n, T > 0 , η ≤ 1 5 √ T . Consider th e loss function f OPT that is define d in Eq. (19 ) for d = max (25 η 2 T 2 , 1) . Then, for Unpr oje cte d GD (cf. E q. (1) with W = R d ) on f OPT , initialize d at w 1 = 0 with step size η , we have, (i) Th e iter ates of GD r emain within the unit b al l, namely w t ∈ B d for al l t = 1 , . . . , T ; (ii) F or al l m = 1 , . . . , T , the m -suffix aver age d iter ate has: f OPT ( w T ,m ) − f OPT ( w ∗ ) = Ω min 1 , 1 η T ! . Algorithm’s dynamics W e start by provi ng a lemma that c haracterizes the dyn amics of the algorithm. Lemma 34. A ssume the c onditions of L emma 33, and c onsider the iter ate of Unpr oje c te d GD on f OPT , initialize d at w 1 = 0 with step size η ≤ 1 5 √ T L et w t b e the iter ate of Then, i t holds that, (i) F or every i ∈ [ d ] and for ev ery t ∈ [ T ] , w t [ i ] ≤ 1 2 √ d (ii) F or eve ry t ∈ [ T ] , ther e exists an index j t ∈ [ d ] such that k 6 = j t , 1 √ d − w t [ j t ] − η j 4 d > 1 √ d − w t [ k ] − η k 4 d + η 8 d . (iii) F or ev e ry t ∈ [ T ] , j t also holds 1 √ d − w t [ j t ] − η j t 4 d > η 8 d . Pr o of. W e pro v e b y induction on t . F or t = 1, w t = 0, th u s, w 1 [ i ] = 0 ≤ 1 2 √ d . Moreo v er, the maxi mizer is j 1 = 1. Then, w e notice that for b oth d = 1 and d = 25 η 2 T 2 , η ≤ 1 5 √ T = ⇒ η ≤ 1 5 √ d . T hen, it holds that, 1 √ d − w 1 [ j 1 ] − η j 1 4 d ≥ 1 √ d − w 1 [ j 1 ] − η 4 ≥ 19 20 √ d > η 8 d , and, for ev ery k 6 = j 1 , 1 √ d − w 1 [ j 1 ] − η j 1 4 d = 1 √ d − w 1 [ k ] − η k 4 d + η ( k − j 1 ) 4 d 61 ≥ 1 √ d − w 1 [ k ] − η k 4 d + η 4 d > 1 √ d − w 1 [ k ] − η k 4 d + η 8 d . In the step of the induction we assum e that the lemma holds for ev ery s ≤ t and prov e it for s = t + 1. By the hyp othesis of the induction, we know that for ev ery iterat ion s ≤ t , k w t k 2 ≤ 1 2 , as a result, w e kno w that the pro jections do es not affect the d ynamics of the algorithm un til the iteratio n t . Moreo ver, w e know that for every iteration s ≤ t there exists an index j s ∈ [ d ] suc h that the term that ac h iev e the maxim um v alue in w s is 1 √ d − w s [ j s ] − ηj 4 d . This maxi m u m is attained uniquely in j s b y margin that is strictl y larger than η 8 d . As a result, w e d eriv e that, for ev ery s ≤ t , ∇ f ( w s ) = − e j s . No w , for ev ery index m ∈ [ d ], we define, n m t = |{ s ≤ t : m = arg max i ∈ [ d ] { 1 √ d − w s [ i ] − η i 4 d }}| . W e get that, for ev ery i it holds that, w t +1 [ i ] = η n i t . Then, k w t +1 k 1 = X i η n i t ≤ η t, and ,th us, there exists a en try k ∈ [ d ] with w t +1 [ k ] ≤ ηt d . No w, w e pro ve th e first part of th e lemma using this observ ation and the step of the ind u ction. F or ev ery i 6 = j t , w t +1 [ i ] = w t [ i ] ≤ 1 2 √ d . Otherwise, w e kno w that, by the definition of j t 1 √ d − w t [ i ] − η i 4 d > 1 √ d − w t [ k ] − η k 4 d + η 8 d , w t [ i ] < w t [ k ] + η ( k − i ) 4 d − η 8 d ≤ η t d + η 4 ≤ 1 25 √ d + 1 20 √ d and, w t +1 [ i ] ≤ w t [ i ] + η ≤ 1 25 √ d + 1 20 √ d + 1 5 √ d ≤ 1 2 √ d , where we again used th e fact that η ≤ 1 5 √ T implies η ≤ 1 5 √ d for b oth d = 1 and d = 25 η 2 T 2 . 62 F or the second part of the lemma, w e define J t ⊆ [ d ], J t = arg min j { n j t } and j t +1 = min { j ∈ J t } and sho w that j t +1 holds the required. W e kno w , for ev ery j 6 = i ∈ [ d ], w t +1 [ i ] − w t +1 [ j ] = η ( n i t − n j t ) . F or k 6 = j t +1 with n k t > n j t +1 t , 1 √ d − w t +1 [ j t +1 ] − η j t +1 4 d ≤ 1 √ d − w t +1 [ k ] − η − η j t +1 4 d = 1 √ d − w t +1 [ k ] − η − η k 4 d + η ( k − j t +1 ) 4 d ≤ 1 √ d − w t +1 [ k ] − η − η k 4 d + η 4 < 1 √ d − w t +1 [ k ] − η k 4 d − η 2 . in contra diction to the fact th at j t +1 gets the maximal v alue. F or k 6 = j t +1 with n k t > n j t +1 t , it holds that w t +1 [ j t +1 ] ≤ w t +1 [ k ] − η , and, 1 √ d − w t +1 [ j t +1 ] − η j t +1 4 d ≥ 1 √ d − w t +1 [ k ] + η − η j t +1 4 d = 1 √ d − w t +1 [ k ] + η − η k 4 d + η ( k − j t +1 ) 4 d ≥ 1 √ d − w t +1 [ k ] + η − η k 4 d − η 4 > 1 √ d − w t +1 [ k ] − η k 4 d + η 8 d , as required. F or the third p art of the lemma, we know that for ev ery i ∈ [ d ], 1 √ d − w t +1 [ i ] − η i 4 d ≥ 1 2 √ d − η 4 = 9 20 √ d > η 8 d . Pro of of lo wer bound. No w w e can pr o ve Lemma 33. Pr o of of L emma 33. Th e first part of the theorem is a n imm ed iate coroll ary from Lemma 34. Moreo v er, by applying this lemma again, w e kno w that, for eve ry i ∈ [ d ], w T ,m [ i ] ≤ 1 2 √ d , th u s, f OPT ( w T ,m ) − f OPT ( w ∗ ) ≥ 1 2 √ d − η 4 − 0 63 ≥ 1 2 √ d − η 20 √ d > 1 4 √ d = min 1 4 , 1 20 ηT . E.2 Construction of a differen tiable loss function. In this section, we pro ve the low er b oun d for a smo othing of f OPT , defined as ˜ f OPT ( w ) = E v ∈ B d max 0 , max i ∈ [ d ] { 1 √ d − w [ i ] − δ v [ i ] − η i 4 d } ! , (20) namely , we pro ve the follo wing lemma, Lemma 35. A ssume n, T > 0 , η ≤ 1 5 √ T . Consider th e loss function ˜ f OPT that is define d in Eq. (20 ) for d = max (25 η 2 T 2 , 1) and δ = η 16 d . Then, for Unpr oje cte d GD (cf. Eq. (1 ) with W = R d ) on f OPT , initialize d at w 1 = 0 with step size η , we have, (i) Th e iter ates of GD r emain within the unit b al l, namely w t ∈ B d for al l t = 1 , . . . , T ; (ii) F or al l m = 1 , . . . , T , the m -suffix aver age d iter ate has: ˜ f OPT ( w T ,m ) − ˜ f OPT ( w ∗ ) = Ω min 1 , 1 η T ! . First, w e pr o ve that the sm o othing of the loss f unction do es not affect th e dynamics of the algorithm, as stated in the follo wing lemma, Lemma 36. Under the c onditions of L emmas 33 and 35, let w t , ˜ w t b e the iter ates of U npr oje cte d GD with step size η ≤ 1 5 √ T and w 1 = 0 , on f OPT and ˜ f OPT r esp e ctively. Then, for every t ∈ [ T ] , it holds that w t = ˜ w t . Pr o of. W e p ro of the lemma b y induction on t . F or t = 1, w e know that w 1 = ˜ w 1 = 0. No w, we assume that w t = ˜ w t . By Lemma 34 , w e kno w that the maxim um of the loss fu nction is atta ined uniquely with the prop ert y th at the difference b et we en the maximal v alue and the seco nd maximal v alue is la rger then η 8 d . As a result, b y the facts that f is 1-Lipsc h itz and δ = η 16 d , we can use Lemma 31 for b F ( w t ) and get that, ∇ b F ( w t ) = ∇ b ˜ F ( w t ) = ∇ b ˜ F ( ˜ w t ) . It follo ws by the hyp othesis of the indu ction that, w t +1 = w t − ∇ b F ( w t ) = ˜ w t − ∇ b ˜ F ( ˜ w t ) = ˜ w t +1 . No w w e can p r o ve Lemma 35. 64 Pr o of of L emma 35. Let w T ,m b e the m -suffi x av erage of GD when is applied on f OPT . Let w ∗ = arg min w f OPT ( w ). By Lemma 36, we kno w that, w T ,m = w T ,m . Then, b y Lemma 33 and Lemma 32, 1 4 √ d ≤ f OPT ( w T ,m ) − f OPT ( w ∗ ) = f OPT ( w T ,m ) − f OPT ( w ∗ ) ≤ ˜ f OPT ( w T ,m ) + δ − ˜ f OPT ( w ∗ ) + δ ≤ ˜ f OPT ( w T ,m ) + δ − ˜ f OPT ( w ∗ ) + δ , and, ˜ f OPT ( w T ,m ) − ˜ f OPT ( w ∗ ) ≥ 1 4 √ d − η 8 d ≥ 1 4 √ d − 1 8 √ d ≥ 1 8 √ d ≥ min ( 1 8 , 1 40 ηT ) . 65
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment