Finding Approximate Local Minima Faster than Gradient Descent

Finding Appro ximate Lo cal Minima F aster than Gradien t Descen t Naman Agarw al namana@cs.p rinceton.edu Princeton Univ ersity Zeyuan Allen-Zh u zeyuan@csail.mit.edu Institute for Adv anced Study Brian Bullins bbullins@cs.p rinceton.edu Princeton Univ ersity Elad Hazan ehazan@cs.p rinceton.edu Princeton Univ ersity T engyu Ma tengyu@cs.p rinceton.edu Princeton Univ ersity No vem ber 3, 2016 Abstract W e design a non-conv ex second-order optimization algorithm that is guaranteed to return an appr oximate lo cal minim um in time whic h scales linearly in the underlying dimension and the num b er of training examples. The time complexity of our algorithm to ﬁnd an approximate lo cal minim um is ev en faster than that of gradient descen t to ﬁnd a critical point. Our algorithm applies to a general class of optimization problems including training a neural net work and other non-con vex ob jectives arising in mac hine learning. 1 In tro duction Finding a global minimizer of a non-con v ex optimization problem is NP-hard. Th us, the standard goal of eﬃcient non-con vex optimization algorithms is instead to ﬁnd a lo cal minimum. This problem has b ecome increasingly imp ortan t as the state-of-the-art in machine learning is attained b y non-con v ex models, man y of which are v arian ts of deep neural netw orks. Exp erimen ts in [10, 11, 21] suggest that fast conv ergence to a lo cal minimum is suﬃcient for training neural nets, while con vergence to critical p oin ts (p oin ts with v anishing gradien ts) is not . Theoretical w orks ha ve also aﬃrmed the same phenomenon for other mac hine learning problems (see [5, 6, 18, 19] and the references therein). In this pap er we giv e a prov able linear-time algorithm for ﬁnding an appr oximate local minim um in smo oth non-con vex optimization. It applies to a general setting of mac hine learning optimization, and in particular to the optimization problem of training deep neural netw orks. F urthermore, the running time b ound of our algorithm is the fastest kno wn even for the more lenien t task of computing a p oin t with v anishing gradient (called a critical point), for a wide range of parameters. F ormally , the problem of unconstrained mathematical optimization is stated in general terms as that of ﬁnding the minimum v alue that a function attains ov er Euclidean space, i.e. min x ∈ R d f ( x ) . (1.1) If f is conv ex, the ab ov e form ulation is c onvex optimization and is solv able in (randomized) p olynomial time ev en if only a v aluation oracle to f is pro vided. A crucial prop ert y of con vex functions is that “lo cal optimalit y implies global optimality”, allowing for greedy algorithms to reac h the global optim um eﬃciently . Unfortunately , this is no longer the case if f is nonconv ex; indeed, even a degree four p olynomial can b e NP-hard to optimize [23], or even just to chec k whether a p oin t is not a lo cal minimum [25]. Th us, for non-conv ex optimization one has to settle for the more mo dest goal of reac hing approximate local optimality eﬃcien tly . Note that a particular in terest to mac hine learning is the optimization of functions f : R d 7→ R of the ﬁnite-sum form f ( x ) = 1 n n X i =1 f i ( x ) . (1.2) Suc h functions arise when minimizing loss o ver a training set, where each example i in the set corresp onds to one loss function f i in the summation. W e sa y that the function f is second-order smo oth if it has Lipsc hitz contin uous gradient and Lipsc hitz con tinuous Hessian. W e sa y that a p oin t x is an ε -approximate lo cal minimum if it satisﬁes (follo wing the tradition of [28]): k∇ f ( x ) k ≤ ε and ∇ 2 f ( x )  − √ ε I , where k · k denotes the Euclidean norm of a vector. W e say that a p oin t x is an ε -critical p oin t if it satisﬁes the gradien t condition ab o ve, but not necessarily the second-order condition. Critical p oin ts include saddle p oin ts in addition to lo cal minima. W e remark that ε -approximate lo cal minima (even with ε = 0) are not necessarily close to an y lo cal minimum, neither in domain nor in function v alue. How ever, if we assume in addition the function satisﬁes the (robust) strict-saddle prop ert y [15, 24] (see Section 2 for the precise deﬁnition), then an ε -approximate lo cal minim um is guaran teed to b e close to a lo cal minim um for suﬃciently small ε . Our main theorem b elow states the time required for the prop osed algorithm F astCubic to ﬁnd an ε -appro ximate lo cal minim um for second-order smo oth functions. 1 Theorem (informal) . Ignoring smo othness p ar ameters, the running time of FastCubic to r eturn an ε -appr oximate lo c al minimum is ˜ O  n ε 3 / 2 + n 3 / 4 ε 7 / 4  · T h, 1 ! for (1.2) or ˜ O  1 ε 7 / 4 · T h  for the gener al (1.1). A b ove, T h is the time to c ompute Hessian-ve ctor pr o duct for ∇ 2 f ( x ) and T h, 1 is that for an arbitr ary ∇ 2 f i ( x ) . The full statemen t of Theorem 1 can b e found in Section 2. Hessian-v ector pro ducts can b e computed in linear time —meaning T h, 1 = O ( d ) and T h = O ( nd )— for man y machine learning problems suc h as generalized linear mo dels and training neural net works [1, 29]. W e explain this more generally in App endix A. Therefore, Corollary 1.1. Algorithm F astCubic r eturns an ε -appr oximate lo c al minimum for the optimiza- tion pr oblem of tr aining a neur al network in time ˜ O nd ε 3 / 2 + n 3 / 4 d ε 7 / 4 ! . Another imp ortan t asp ect of our algorithm is that even in terms of just reaching an ε -critical p oin t, i.e. a p oin t that satisﬁes k∇ f ( x ) k ≤ ε without any second-order guaran tee, F astCubic is faster than all previous results (see T able 1 for a comparison). The fastest methods to ﬁnd critical points for a smo oth non-conv ex function are gradien t descen t and its extensions, join tly known as ﬁrst-order metho ds. These metho ds are extremely eﬃcient in terms of p er-iteration complexit y; ho wev er, they necessarily suﬀer from a 1 /ε 2 con vergence rate [27], to the best of our kno wledge, in previous results only higher-order metho ds seem capable of breaking this 1 /ε 2 b ottlenec k [28]. F or certain ranges of parameters, our F astCubic ﬁnds lo cal minima ev en faster than ﬁrst-order metho ds, even though they only ﬁnd critical p oin ts. This is depicted in T able 1. P ap er T otal Time Achieving k∇ f ( x ) k ≤ ε Second-Order Guarantee Gradien t Descen t (GD) O  nd ε 2  n/a SVR G [2] O  nd + n 2 / 3 d ε 2  n/a SGD [20] O  d ε 4  n/a noisy SGD [16] a O  d C 1 ε 4  ∇ 2 f ( x )  − ε 1 /C 2 I cubic regularization [28] ˜ O  nd ω − 1 + d ω ε 3 / 2  ∇ 2 f ( x )  − ε 1 / 2 I this pap er ˜ O  nd ε 7 / 4  ∇ 2 f ( x )  − ε 1 / 2 I this pap er ˜ O  nd ε 3 / 2 + n 3 / 4 d ε 7 / 4  ∇ 2 f ( x )  − ε 1 / 2 I T able 1: Comparison of known metho ds. a Here C 1 , C 2 are t wo constants that are not explicitly written. W e believe C 1 ≥ 4. 2 1.1 Related w ork Metho ds that Pro v ably Reac h Critical P oin ts. Recall that only a gradient oracle is needed to reach a critical p oin t. The most commonly used algorithm in practice for training non-conv ex learning machines such as deep neural netw orks is sto c hastic gradien t descent (SGD), also known as sto c hastic approximation [30] and its deriv ativ es. Some practical enhancements widely used in practice are based on Nesterov’s acceleration [26] and adaptive regularization [12]. The v ariance reduction tec hnique, introduced in [32], w as extremely successful in conv ex optimization, but only recen tly there was a non-con v ex counterpart with theoretical beneﬁts in tro duced [2]. Metho ds that Pro v ably Reac h Lo cal Minima. The recent work of Ge et al. [17] show ed that a noise-injected version of SGD in fact con verges to lo cal minima instead of critical p oin ts, as long as the underlying non-conv ex function is strict-saddle. Their theoretical running time is a large p olynomial in the dimension and not comp etitiv e with our metho d (see T able 1). The work of Lee et al. [24] shows that gradient descen t, starting from a random p oin t, almost surely con verges to a lo cal minim um of a strict-saddle function. The rates of conv ergence and precise step-sizes that are required are, how ever, y et unknown. If second-order information (i.e., the Hessian oracle) is pro vided, the cubic-regularization method of Nesterov and Poly ak [28] conv erges in O ( 1 ε 3 / 2 ) iterations. Ho wev er, each iteration of Nestero v- P olyak requires solving a cubic function which, in general, tak es time super-linear in the input represen tation. One natural direction is to apply an approximate trust region solver, such as the linear-time solv er of [22], to appro ximately solv e the cubic regularization subroutine of Nestero v-Poly ak. Ho w- ev er, the approximation needed b y a naiv e calculation mak es this approac h even slo wer than v anilla gradien t descent. Our main challenge is to obtain appro ximate second-order lo cal-minima and si- m ultaneously improv e up on gradien t descen t. Indep enden tly of this pap er and concurrently 1 , Carmon et al. [7] dev elop an accelerated gradien t descen t method that achiev es the same running time for ﬁnding an appro ximate local minim um as in our pap er. Remark ably , the same running time is obtained via a very diﬀeren t tec hnique. 1.2 Our T echniques Our algorithm is based on the cubic regularization metho d of Nestero v and Poly ak [8, 9, 28]. At a high level, cubic regularization states that if we can minimize a cubic function m ( h ) , g > h + 1 2 h > H h + L 6 k h k 3 exactly , where g = ∇ f ( x ), H = ∇ 2 f ( x ), and L is the second-order smo othness of the function f , then w e can iteratively p erform up dates x 0 ← x + h , and this algorithm conv erges to an ε -approximate lo cal minim um in O (1 /ε 3 / 2 ) iterations. Unfortunately , solving this cubic minimization problem exactly , to the b est of our kno wledge, requires a running time of O ( d ω ) where ω is the matrix multiplication constan t. Getting around this requires ﬁve observ ations. The ﬁrst observation is that, minimizing m ( h ) up to a constant multiplicativ e approximation (plus a few other constraints) is suﬃcien t for sho wing an iteration complexit y of O (1 /ε 3 / 2 ). 2 The pro of tec hniques to sho w this observ ation are based on extending Nesterov and P oly ak. The se c ond observation is that the minimizer h ∗ of m ( h ) must b e of the form h ∗ = ( H + λ ∗ I ) + g + v , where λ ∗ ≥ 0 is some constan t satisfying H + λ ∗ I  0, and v is the smallest eigen vector of H and + denotes the pseudo-inv erse of a matrix. This can b e viewed as moving in a mixture direction 1 T o b e precise, their manuscript app eared online approximately 24 hours before ours. 2 More sp eciﬁcally , we need m t ( h ) ≤ 1 C min h { m t ( h ) } for some constant C . In addition, we need to hav e go od b ounds on k h k and k∇ m ( h ) k . 3 b et w een choosing h ← v , and choosing h to follow a shifted Newton’s direction h ← ( H + λ ∗ I ) + g . In tuitively , we wish to reduce b oth the computation of ( H + λ ∗ I ) + g and v to Hessian-vector pro ducts. The ﬁrst task of computing ( H + λ ∗ I ) + g can b e slo w, and ev en if H + λ ∗ I is strictly p ositiv e- deﬁnite, computing it has a complexity dep ending on the (p ossibly h uge) condition num b er of H + λ ∗ I [34]. The thir d observation is that it suﬃces to pick some λ 0 > λ ∗ so b oth (1) the condition n umber of H + λ 0 I is small and (2) the vectors ( H + λ ∗ I ) − 1 g and ( H + λ 0 I ) − 1 g are close. This relies on the structure of m ( h ). The second task of computing v has a complexity dep ending on 1 / √ δ where δ is the target additiv e error [13, 14]. The fourth observation is that the c hoice δ = √ ε suﬃces for the outer lo op of cubic regularization to make suﬃcient progress. This reduces the complexit y to compute v . Finally , ﬁnding the correct v alue λ ∗ itself is as hard as minimizing m t ( h ). The ﬁfth step is to design an iterative scheme that makes only logarithmic num b er of guesses on λ ∗ . This pro cedure either ﬁnds the correct one (via binary searc h), or ﬁnds an approximate one, λ 0 , but satisfying ( H + λ ∗ I ) − 1 g and ( H + λ 0 I ) − 1 g b eing suﬃcien tly close. Putting all the observ ations together, and balancing all the parameters, we can obtain a cubic minimization subroutine (see F astCubicMin in Algorithm 2) that runs in time O ( nd + n 3 / 4 d/ε 1 / 4 ). 2 Preliminaries and Main Theorem W e use k · k to denote the Euclidean norm of a vector and the sp ectral norm of a matrix. F or a symmetric matrix M we denote b y λ max ( M ) and λ min ( M ) respectively the maxim um and minimum eigen v alues of M . W e denote by A  B that A − B is p ositiv e semideﬁnite (PSD). F or a PSD matrix M , w e denote by M + its pseudo-in verse if M is not strictly p ositiv e deﬁnite. W e make the follo wing Lipschitz contin uit y assumptions for the gradient and Hessian of the target function f . Namely , there exist L 2 , L > 0 such that ∀ x, y ∈ R d : k∇ 2 f ( x ) k ≤ L 2 and k∇ 2 f ( x ) − ∇ 2 f ( y ) k ≤ L k x − y k . (2.1) Deﬁnition 2.1. We assume the fol lowing c omplexity p ar ameters on the ac c ess to f ( x ) : • L et T g ∈ R ∗ b e the time c omplexity to c ompute ∇ f ( x ) for any x ∈ R d . • L et T h ∈ R ∗ b e the time c omplexity to c ompute  ∇ 2 f ( x )  v for any x, v ∈ R d . Deﬁnition 2.2. We say that f is of ﬁnite-sum form if f = 1 n P n i =1 f i ( x ) and k∇ 2 f i ( x ) k ≤ L 2 for e ach i ∈ [ n ] . In this c ase, we deﬁne T h, 1 to b e the time c omplexity to c ompute  ∇ 2 f i ( x )  v for arbitr ary x, v ∈ R d and i ∈ [ n ] . Next we deﬁne the strict-saddle function for whic h an ε -approximate local minimum is almost equiv alent to a local minimum [15, 24]. Deﬁnition 2.3 (strict saddle) . Supp ose f ( · ) : R d → R is twic e diﬀer entiable. F or α, β , γ ≥ 0 , we say f is ( α, β , γ ) -strict sadd le if every x ∈ R d satisﬁes at le ast one of the fol lowing thr e e c onditions: 1. k∇ f ( x ) k ≥ α . 2. λ min ( ∇ 2 f ) ≤ − β . 3. Ther e exists a lo c al minimum x ? that is γ -close to x in Euclide an distanc e. W e see that if a function is ( α, β , γ )-strict saddle, then for ε < min { α , β 2 } an ε -approximate lo cal minim um is γ -close to some lo cal minimum. 4 Algorithm 1 F astCubic ( f , x 0 , ε, L, L 2 ) Input: f ( x ) that satisﬁes (2.1) with L 2 and L ; a starting vector x 0 ; a target accuracy ε . 1: κ ←  900 εL  1 / 2 . 2: for t = 0 to ∞ do 3: m t ( h ) , ∇ f ( x t ) > h + h > ∇ 2 f ( x t ) h 2 + L 6 k h k 3 4: ( λ, v , v min ) ← F astCubicMin  ∇ f ( x t ) , ∇ 2 f ( x t ) , L, L 2 , κ  5: h 0 ← either v or λv min 2 L whic hever giv es smaller v alue for m t ( h ); 6: Set x t +1 , x t + h 0 7: if m t ( h 0 ) > − ε 3 / 2 c √ L then return x t +1 .  c is a c onstant; we pr ove d c = 2 . 4 ∗ 10 6 works 8: end for 2.1 Main Results The ﬁnite-sum setting captures muc h of sup ervised learning, including Neural Net works and Gen- eralized Linear Mo dels. The main theorem whic h we show in our pap er is as follo ws: Theorem 1. F astCubic (Algorithm 1) starts fr om a p oint x 0 and outputs a p oint x such that k∇ f ( x ) k ≤ ε and λ min ( ∇ 2 f ( x )) ≥ − √ Lε in total time (denoting by D , f ( x 0 ) − f ( x ∗ ) ) • ˜ O  D √ L ε 3 / 2 · T g + DL 1 / 4 √ L 2 ε 7 / 4 · T h  , or • ˜ O  D √ L ε 3 / 2 ·  T g + n T h, 1  + Dn 3 / 4 L 1 / 4 √ L 2 ε 7 / 4 · T h, 1  in the ﬁnite-sum setting (se e Deﬁnition 2.2). Her e ˜ O hides lo garithmic factors in L, L 2 , 1 /ε, d , and in max x  k∇ f ( x ) k  . Tw o Known Subroutines. Our running time of F astCubic relies on the following recen t results for appro ximate matrix inv erse and approximate PCA: Theorem 2.4 (Approximate Matrix Inv erse) . Supp ose matrix M ∈ R d × d satisﬁes k M k ≤ L 2 and λ I + M  δ I for c onstants λ, δ, L 2 > 0 . L et κ , λ + L 2 δ . Then, we c an c ompute ve ctor x satisfying   x − ( λ I + M ) − 1 b   ≤ ε k b k , (2.2) using A c c eler ate d gr adient desc ent (A GD) in O  κ 1 / 2 log( κ/ε )  iter ations, e ach r e quiring O ( d ) time plus the time ne e de d to multiply M with a ve ctor. Mor e over, supp ose M = 1 n P n i =1 M i wher e e ach M i is symmetric and satisﬁes k M i k ≤ L 2 . If M i b c an b e c ompute d in time O ( d 0 ) for e ach i and ve ctor b , then ac c eler ate d SVR G [4, 33] c omputes a ve ctor x that satisﬁes e quation (2.2) in time O  max { n, n 3 / 4 κ 1 / 2 } · d 0 · log 2 ( κ/ε )  . We r efer to the running time for this c omputation as T inverse ( κ, ε ) and the algorithm as A . Ab o v e, the SVRG based running time shall b e used only to wards our ﬁnite-sum case in Deﬁnition 2.2. Theorem 2.5 (App xPCA [3, 13, 14]) . L et M ∈ R d × d b e a symmetric matrix with eigenvalues 1 ≥ λ 1 ≥ · · · ≥ λ d ≥ 0 . With pr ob ability at le ast 1 − p , AppxPCA pr o duc es a unit ve ctor w satisfying w > M w ≥ (1 − δ × )(1 − ε ) λ max ( M ) . The total running time is ˜ O ( T inverse (1 /δ × , εδ × )) . 3 Our F ast Cubic Regularization Algorithm Recall that the cubic regularization method of Nesterov and P olyak [28] studies the follo wing upp er b ound on the change in ob jective v alue as w e mov e from a p oint x t to x t + h : (it follows simply 5 from the T aylor series truncated to the third order) ∀ h ∈ R d : f ( x t + h ) − f ( x t ) ≤ m t ( h ) , ∇ f ( x t ) > h + h > ∇ 2 f ( x t ) h 2 + L 6 k h k 3 . (3.1) Denote b y h ∗ an arbitrary minimizer of m t ( h ). W e prop ose in this pap er a subroutine FastCubicMin to minimizes m t ( h ) approximately . Note that FastCubicMin returns tw o v ectors v and v min . W e then c ho ose h 0 to b e either v or λv min 2 L , whic hever giv es a smaller v alue for m t ( h ). Before discussing the details of FastCubicMin , let us ﬁrst state a main theorem for F astCubicMin : 3 Theorem 2 (Guarantees of FastCubicMin ) . The algorithm F astCubicMin ﬁnds a ve ctor h 0 that satisﬁes (a) It pr o duc es a ve ctor h 0 satisfying m t ( h 0 ) ≤ 0 and either 3000 m t ( h 0 ) ≤ m t ( h ∗ ) or m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L . (b) If m ( h ∗ ) ≥ − ε 3 / 2 300 √ L , then k h 0 k ≤ k h ∗ k + √ ε 4 √ L and k∇ m t ( h 0 ) k ≤ ε 2 . (c) F astCubicMin runs in time: (using ˜ O to hide lo garithmic factors in L, L 2 , 1 /ε, d, k∇ f ( x t ) k ) • ˜ O  √ L 2 ( εL ) 1 / 4 · T h  wher e T h is the time to multiply ∇ 2 f ( x t ) to a ve ctor; • ˜ O  max  n, n 3 / 4 √ L 2 ( εL ) 1 / 4  · T h, 1  wher e T h, 1 is the time to multiply ∇ 2 f i ( x t ) with a ve ctor. Ab o v e, the ﬁrst guarantee promises that w e are either done (b ecause m t ( h ∗ ) is close to zero), or w e obtain a 1 / 3000 multiplic ative approximation to m t ( h ∗ ). Our second guarantee in Theorem 2 promises that when we are done (b ecause m t ( h ∗ ) is close to zero), the output v ector h 0 and h ∗ are roughly similar in Euclidean norm and hav e a small gradient k∇ m t ( h 0 ) k . Our third guarantee giv es the time complexit y of FastCubicMin . No w, our ﬁnal algorithm FastCubic for ﬁnding the ε -approximate lo cal minim um of f ( x ) is included in Algorithm 1. It simply iteratively calls F astCubicMin to ﬁnd an appro ximate minimizer, and it then stops whenever m t ( h 0 ) > − ε 3 / 2 c √ L for some large constan t c . Roadmap. In Section 4 w e sho w why Theorem 2 implies Theorem 1. All the remaining sections are for the purp ose of pro ving Theorem 2. Because our F astCubicMin is v ery tec hnical, instead of stating what the algorithm is righ t aw a y , we decide to take a diﬀerent path. In Section 5, we ﬁrst state a lemma c haracterizing “what h ∗ lo oks lik e”. In Section 6, w e provide a set of suﬃcien t conditions which “look similar” to the characterization of h ∗ , and sho w that as long as these conditions are met, Theorem 2-a and 2-b follow easily . Finally , in Section 7, w e state FastCubicMin and explain why it satisﬁes these suﬃcient conditions and why it runs in the aforemen tioned time. 4 Theorem 2 implies Theorem 1 In this section, we sho w that Theorem 2 implies Theorem 1. It relies on the following lemma (pro ved in App endix B) regarding the suﬃcient condition for us to reac h an ε -approximate lo cal minim um. 3 T o present the simplest result, w e ha ve not tried to impro ve the constant dep endency in this paper. 6 Lemma 4.1. If m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L and h 0 is an appr oximate minimizer of m t ( h ) satisfying k h 0 k ≤ k h ∗ k + √ ε 4 √ L and k∇ m t ( h 0 ) k ≤ ε 2 , then we have that k∇ f ( x t + h 0 ) k ≤ ε and λ min ( ∇ 2 f ( x t + h 0 )) ≥ − √ Lε . Pr o of of The or em 1 fr om The or em 2. When F astCubic terminates, w e hav e m t ( h 0 ) > − ε 3 / 2 c √ L ; there- fore, it satisﬁes m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L according to Theorem 2-a. Combining this with Theorem 2-b and Corollary 4.1, w e conclude that in the last iteration of FastCubic , our output satisﬁes k∇ f ( x t + h 0 ) k ≤ ε and λ min ( ∇ 2 f ( x t + h 0 )) ≥ − √ Lε . This ﬁnishes the pro of with resp ect to the accuracy conditions. As for the running time, in every iteration except for the last one, FastCubic satisﬁes m t ( h 0 ) ≤ − Ω  − ε 3 / 2 √ L  . Therefore b y (3.1), we must ha ve decreased the ob jective b y at least Ω  − ε 3 / 2 √ L  in this round, and this cannot happ en for more than O  ( f ( x 0 ) − f ∗ ) √ L ε 3 / 2  iterations. The ﬁnal running time of F astCubic follows from this bound together with Theorem 2-c. Therefore, in the rest of the pap er it suﬃces to study F astCubicMin and prov e Theorem 2. 5 Characterization Lemma of the Minimizer h ∗ F or notational simplicity in this and the subsequen t sections w e fo cus on the following problem: minimize m ( h ) , g > h + h > H h 2 + L 6 k h k 3 where H is a symmetric matrix with k H k 2 ≤ L 2 . Recall from the previous section that w e ha ve denoted b y h ∗ an arbitrary minimizer of m ( h ). W e ha v e the following lemma whic h c haracterizes h ∗ : (a v ariant of this lemma has app eared in [8], and w e prov e it in the app endix for the sak e of completeness) Lemma 5.1. We have h ∗ is a minimizer of m ( h ) if and only if ther e exists λ ∗ ≥ 0 such that H + λ ∗ I  0 , ( H + λ ∗ I ) h ∗ = − g , k h ∗ k = 2 λ ∗ L . The obje ctive value in this c ase is given by m ( h ∗ ) = − 1 2 g > ( H + λ ∗ I ) + g − 2( λ ∗ ) 3 3 L 2 ≤ 0 . The follo wing corollary comes from Lemma 5.1 and its pro of: Corollary 5.2. The value λ ∗ in L emma 5.1 is unique, and for every λ satisfying H + λ I  0 , we have k ( H + λ I ) − 1 g k > 2 λ L ⇐ ⇒ λ ∗ > λ and k ( H + λ I ) − 1 g k < 2 λ L ⇐ ⇒ λ ∗ < λ . In the ab o v e characterization, w e hav e a crude upp er b ound on λ ∗ : Prop osition 5.3. We have λ ∗ ≤ B , max  2 L 2 + p L k g k , 1  with λ ∗ deﬁne d in L emma 5.1. Pr o of. W e hav e L k ( H + B I ) − 1 g k ≤ L k g k λ min ( H + B I ) ≤ L k g k B − L 2 < 2 B and therefore λ ∗ ≤ B due to Corollary 5.2. 7 6 Suﬃcien t Conditions for Theorem 2-a and 2-b Without worrying ab out the design of FastCubicMin at this momen t, let us ﬁrst state a set of suﬃcien t conditions under which the assumptions in Theorem 2-a can b e satisﬁed. Main Lemma 1. Consider an algorithm that outputs a r e al λ ∈ [0 , 2 B ] , a ve ctor v ∈ R d , and a unit ve ctor v min ∈ R d . A dditional ly, supp ose numb ers κ, ˜ ε ≥ 0 satisfying the fol lowing c onditions: ˜ ε ≤ 1 10000 1 (max { κ, L, L 2 , k g k , k ( H + λ I ) − 1 k , B } ) 20 (6.1) ( H + ( λ − L ˜ ε ) I ) − 1  0 (6.2) Mor e over, supp ose that the outputs ( λ, v , v min ) satisfy one of the fol lowing two c ases: Case 1 : L k ( H + λ I ) − 1 g k ∈ [2 λ − 2 L ˜ ε, 2 λ + 2 L ˜ ε ] and k v + ( H + λ I ) − 1 g k ≤ ˜ ε Case 2 : The fol lowing c onditions ar e satisﬁe d: (a) λ ≥ λ ∗ and λ + λ min ( H ) ≤ 1 κ (b) L k ( H + λ I ) − 1 g k ≤ 2 λ and k v + ( H + λ I ) − 1 g k ≤ ˜ ε (c) v > min H v min ≤ λ min ( H ) + 1 10 κ Then, at le ast one of the two choic es h 0 ∈  v , λv min 2 L  satisfy either m ( h ∗ ) ≥ 3000 m ( h 0 ) or m ( h ∗ ) ≥ − 32 κ 3 L 2 . Let us compare suc h suﬃcient conditions to the c haracterization Lemma 5.1. • In Case 1, up to a v ery small error ˜ ε , we hav e essentially found a vector v that satisﬁes v ≈ − ( H + λ I ) − 1 g and k v k ≈ 2 λ L . Therefore, this v should b e close to h ∗ for obvious reason. (This is the simple case.) • In Case 2, w e hav e only found a vector v that satisﬁes v ≈ − ( H + λ I ) − 1 g and k v k . 2 λ L . In this case, w e also compute an approximate lo west eigen vector v min of λ min ( H ) up to an additiv e 1 / 10 κ accuracy (see case 2-c). W e will make sure that, as long as the conditions in 2-a hold, then either v or λv min 2 L will b e an appro ximate minimizer for m t ( h ). (This is the hard case.) Pr o of of Main L emma 1. W e ﬁrst consider Case 1. According to Corollary 5.2, if ˜ ε = 0 then v is a minimizer of m ( h ). The follo wing claim extends this argument to the setting when ˜ ε > 0: Claim 6.1. If λ and v satisfy Case 1 and ˜ ε satisﬁes (6.1), then m ( v ) ≤ m ( h ∗ ) + 1 250 κ 3 L 2 F rom the ab o ve lemma it follo ws that either m ( h ∗ ) ≥ − 8 κ 3 L 2 otherwise m ( h ∗ ) ≥ 1 . 1 m ( v ) which satisﬁes the conditions of the theorem. W e now consider Case 2, and in this case w e make the following t wo claims: Claim 6.2. If λ min ( H ) ≤ − 1 κ then m ( h ∗ ) ≥ 1500 min n m ( v ) , m  λv min 2 L o − 1 500 κ 3 L 2 . Claim 6.3. If λ min ( H ) ≥ − 1 κ then m ( h ∗ ) ≥ 2 m ( v ) − 16 κ 3 L 2 . Lemma 1 no w follows from the tw o claims b ecause we can output the vector h 0 whic h has the lo west v alue of m ( h 0 ) amongst the tw o choices h 0 ∈  v , λ v min 2 d  . This satisﬁes either m ( h ∗ ) ≥ 3000 m ( h 0 ) or m ( h ∗ ) ≥ − 32 κ 3 L 2 . The missing pro ofs of the three claims are deferred to App endix D. 8 The next main lemma sho ws that, under the same suﬃcient conditions as Main Lemma 1, w e also ha ve that Theorem 2-b holds. (Its proof is contained in App endix E.) Main Lemma 2. In the same setting as Main L emma 1, supp ose m ( h ∗ ) ≥ − ε 3 / 2 300 √ L . Then the output ve ctor v satisﬁes the fol lowing c onditions: k v k ≤ k h ∗ k + 3 κL and k∇ m ( v ) k ≤ ε 4 + 15 κ 2 L . 7 Main Algorithms for Theorem 2 W e are now ready to state our m ain algorithm F astCubicMin and sk etch why it satisﬁes the suﬃcien t conditions in Main Lemma 1. As describ ed in Algorithm 2, our algorithm starts with a very large c hoice λ 0 ← 2 B and decreases it gradually . At eac h iteration i , it computes an approximate in v erse v satisfying k v + ( H + λ i I ) − 1 g k ≤ ˜ ε with resp ect to the current λ i . Then there are three cases, dep ending on whether L k v k is appro ximately equal to, larger than, or smaller than 2 λ i . A t a high lev el, if it is “equal”, then we ha ve met Case 1 in Main Lemma 1; if it is “larger”, then we can binary search the correct v alue of λ ∗ in the interv al [ λ i , λ i − 1 ]; and if it is “smaller”, then w e need to compute an appro ximate eigenv ector and carefully choose the next p oin t λ i +1 . W e state our main lemma b elo w regarding the correctness and running time of F astCubicMin . Main Lemma 3. F astCubicMin in Algorithm 2 outputs a r e al λ ∈ [0 , 2 B ] , a ve ctor v ∈ R d , and a unit ve ctor v min ∈ R d satisfying one of the two suﬃcient c onditions in Main L emma 1. We also have that the pr o c e dur e c an b e implemente d in a total running time of • ˜ O  √ κL 2 · T h  if A c c eler ate d Gr adient Desc ent is use d in The or em 2.4 to invert matric es. • ˜ O  max { n, n 3 / 4 √ κL 2 } · T h, 1  if we use ac c eler ate d SVRG as the subpr o c e dur e A in The or em 2.4. Her e ˜ O hides lo garithmic factors in L, L 2 , κ, d, B . W e prov e the correctness half of Main Lemma 3, and defer its running time analysis to App endix G. 7.1 Correctness Half of Main Lemma 3 W e will no w establish the correctness of our algorithm. W e ﬁrst observ e that the Bina rySea rch subroutine returns ( λ, v , ∅ ) that satisﬁes Case 1 of Main Lemma 1. F act 7.1. BinarySea rch outputs a p air λ and v such that L k ( H + λ I ) − 1 g k ∈ [2 λ − 2 L ˜ ε, 2 λ + 2 L ˜ ε ] and k v + ( H + λ I ) − 1 g k ≤ ˜ ε . Pr o of. The latter is guaranteed by line 3 in BinarySea rch , and the former is implied by the latter b ecause L k ( H + λ I ) − 1 g k ∈  L k v k − L ˜ ε/ 2 , L k v k + L ˜ ε/ 2  ⊆  2 λ − 2 L ˜ ε, 2 λ + 2 L ˜ ε  . W e also establish the following in v ariants regarding the v alues λ i . (Pro of in Appendix F.) Lemma 7.2. The fol lowing statements hold for al l i until FastCubicMin terminates (a) λ i ∈ [0 , 2 B ] , λ i + λ max ( H ) ≤ 3 B (b) λ i + λ min ( H ) ≥ 3 10 κ (c) λ i +1 + λ min ( H ) ≤ 3 4 ( λ i + λ min ( H )) unless λ i +1 = 0 Mor e over when F astCubicMin terminates at Line 20 we have λ i + λ min ( H ) ≤ 1 κ . W e now prov e the output ( λ, v , v min ) of F astCubicMin satisﬁes the suﬃcient conditions of Main Lemma 1. 9 Algorithm 2 F astCubicMin ( g , H , L, L 2 , κ ) (main algorithm for cubic minimization) Input: g a v ector, H a symmetric matrix, parameters κ, L and L 2 whic h satisﬁes − L 2 I  H  L 2 I . Output: ( λ, v , v min ) 1: B ← L 2 + p L k g k + 1 κ . 2: ˜ ε ← 1 /  10000  max  L, k g k , 3 κ 10 , B , 1  20  3: λ 0 ← 2 B . 4: for i = 0 to ∞ do 5: Compute v such that k v + ( H + λ i I ) − 1 g k ≤ ˜ ε . 6: if L k v k ∈ [2 λ i − L ˜ ε, 2 λ i + L ˜ ε ] then 7: return ( λ i , v , ∅ ). 8: else if L k v k > 2 λ i + L ˜ ε then 9: return BinarySea rch ( λ 1 = λ i − 1 , λ 2 = λ i , ˜ ε ). 10: else if L k v k < 2 λ i − L ˜ ε then 11: Let P ow er Metho d ﬁnd v ector w that is 9 / 10-appx leading eigen vector of ( H + λ i I ) − 1 : 9 10 λ max (( H + λ i I ) − 1 ) ≤ w > ( H + λ i I ) − 1 w ≤ λ max (( H + λ i I ) − 1 ) . 12: Compute a v ector ˜ w suc h that k ˜ w − ( H + λ i I ) − 1 w k ≤ ˆ ε , 1 60 B . 13: ∆ ← 1 2 1 ˜ w > w − ˆ ε . 14: if ∆ > 1 2 κ then 15: ˜ λ i +1 ← λ i − ∆ 2 . 16: if ˜ λ i +1 > 0 then λ i +1 ← ˜ λ i +1 else λ i +1 ← 0 17: else 18: Use App xPCA to ﬁnd any unit v ector v min suc h that v > min H v min ≤ λ min ( H ) + 1 10 κ . 19: Flip the sign of v min so that g > v min ≤ 0. 20: return ( λ i , v , v min ) . 21: end if 22: end if 23: end for Algorithm 3 Bina rySearch ( λ 1 , λ 2 , ˜ ε ) (binary search subroutine) Input: λ 1 ≥ λ 2 , L k ( H + λ 1 I ) − 1 g k ≤ 2 λ 1 , L k ( H + λ 2 I ) − 1 g k ≥ 2 λ 2 , λ 2 + λ min ( H ) > 0 Output: ( λ, v , ∅ ) 1: for t = 1 to ∞ do 2: λ mid ← λ 1 + λ 2 2 3: Compute v ector v suc h that k v + ( H + λ mid I ) − 1 g k ≤ ˜ ε/ 2 4: if L k v k ∈ [2 λ mid − L ˜ ε, 2 λ mid + L ˜ ε ] then 5: return ( λ mid , v , ∅ ) 6: else if L k v k + L ˜ ε ≤ 2 λ mid then 7: λ 1 ← λ mid 8: else if L k v k − L ˜ ε ≥ 2 λ mid then 9: λ 2 ← λ mid 10: end if 11: end for 10 Corr e ctness Pr o of of Main L emma 3. W e carefully v erify these suﬃcient conditions: • Lemma 7.2 implies λ i ∈ [0 , 2 B ]. • λ i + λ min ( H ) ≥ 3 10 κ from Lemma 7.2 implies k ( H + λ i I ) − 1 k ≤ 4 κ . It is now immediate that the c hoice of ˜ ε on Line 2 satisﬁes the Condition (6.1) in the assumption of Main Lemma 1. • Since ˜ ε ≤ 1 10 κL and λ i + λ min ( H ) ≥ 3 10 κ it follows that ( H + ( λ i − L ˜ ε ) I ) − 1  0 whic h prov es Condition (6.2) in Main Lemma 1. • W e now verify Case 1 and 2 in the assumption of Main Lemma 1. A t the b eginning of the algorithm, our choice λ 0 = 2 B ensures (using Prop osition 5.3) that L k ( H + λ 0 I ) − 1 g k < 2 λ 0 . Let us no w consider the v arious places where the algorithm outputs: – If F astCubicMin terminates at Line 7, then we hav e k v + ( H + λ i I ) − 1 g k ≤ ˜ ε and additionally L k ( H + λ i I ) − 1 g k ∈  L k v k − L ˜ ε, L k v k + L ˜ ε  ⊆ [2 λ i − 2 L ˜ ε, 2 λ i + 2 L ˜ ε ] . Therefore, the output meets Case 1 requirement of Main Lemma 1 with λ = λ i . – If FastCubicMin terminates at Line 9, then L k ( H + λ i I ) − 1 g k > L k v k − L ˜ ε ≥ 2 λ i . Obviously , w e m ust hav e i ≥ 1 in this case b ecause L k ( H + λ 0 I ) − 1 g k < 2 λ 0 . Therefore, Line 10 must ha ve b een reached at the previous iteration, so it implies L k ( H + λ i − 1 I ) − 1 g k < 2 λ i − 1 . T ogether, these tw o imply that w e can call BinarySea rch with ( λ i − 1 , λ i ). Owing to F act 7.1, the subroutine outputs a pair ( λ, v ) satisfying the Case 1 requirement of Main Lemma 1. – If FastCubicMin terminates on Line 20, w e v erify that Case 2 of Main Lemma 1 with λ = λ i holds. W e ﬁrst hav e L k ( H + λ i I ) − 1 g k ≤ L k v k + L ˜ ε ≤ 2 λ i . By Corollary 5.2, w e also ha v e that λ i ≥ λ ∗ . Lemma 7.2 tells us λ i satisﬁes λ i + λ min ( H ) ≤ 1 κ . V ector v satisﬁes k v + ( H + λ i I ) − 1 g k ≤ ˜ ε . V ector v min satisﬁes v > min H v min ≤ λ min ( H ) + 1 10 κ . In sum, w e hav e v eriﬁed that all the assumptions of Main Lemma 1 hold. Final Proof of Theorem 2. Theorem 2 is a direct corollary of our main lemmas. Main Lemma 3 ensures that the assumptions of Main Lemma 1 and Main Lemma 2 both hold. Now, using the spe- cial choice of κ in FastCubic , Theorem 2-a immediately comes from Main Lemma 1; Theorem 2-b immediately comes from Main Lemma 2; and Theorem 2-c immediately comes from Main Lemma 3. This ﬁnishes the pro of of Theorem 2. Ac kno wledgemen ts W e thank Ben Rech t for v ery helpful suggestions and corrections to a previous v ersion. Z. Allen-Zh u is supp orted by an NSF Grant, no. CCF-1412958, and a Microsoft Research Grant, no. 0518584. An y opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of NSF or Microsoft. Appendix A Computing Hessian-V ector Pro duct in Linear Time In this section we sk etch the in tuition regarding why Hessian-v ector pro ducts can b e computed in linear time in many in teresting (esp ecially machine learning) problems. W e start b y sho wing that the gradient can be computed in linear time. The algorithm is often referred to as bac k-propagation, 11 whic h dates back to W erb os’s PhD thesis [35], and has been popularized b y Rumelharte et al. [31] for training neural net works. Claim A.1 (back-propagation, informally stated) . Supp ose a r e al-value d function f : R d → R c an b e evaluate d by a diﬀer entiable cir cuit of size N . Then, the gr adient ∇ f c an b e c ompute d in time O ( N + d ) (using a cir cuit of size O ( N + d ) ). 4 The claim follows from simple induction and chain-rule, and is left to the readers. In the training of neural net works, often the size of circuits that computes the ob jective f is prop ortional to (or equal to) the num b er of parameters d . Th us the gradien t ∇ f can b e computed in time O ( d ) using a circuit of size d . Next, we consider computing ∇ 2 f ( x ) · v where v ∈ R d . Let g ( x ) := h∇ f ( x ) , v i b e a function from R d to R . Then, w e see that if suﬃces to compute the gradien t of g , since ∇ 2 f ( x ) · v = ∇ g ( x ) . W e observ e that g ( x ) can b e ev aluated in linear time using circuit of size O ( d ) since w e’ve shown ∇ f ( x ) can. Th us, using Claim A.1 again on function g , 5 w e conclude that ∇ g ( x ) can also b e computed in linear time. B Pro of of Lemma B.1 and Corollary 4.1 Lemma B.1. F or al l h 0 ∈ R d , it satisﬁes k∇ f ( x t + h 0 ) k ≤ L k h 0 k 2 + k∇ m t ( h 0 ) k and λ min ( ∇ 2 f ( x t + h 0 )) ≥ −  3 L 2 max { 0 , − m t ( h ∗ ) } 2  1 / 3 − L k h 0 k . Pr o of of L emma B.1. Let us denote b y g = ∇ f ( x t ) and H = ∇ 2 f ( x t ) in this pro of. W e b egin by pro ving the ﬁrst order condition. Note that we ha ve ∇ m t ( h ) = g + H h + L 2 k h k h . Recall h ∗ is a minimizer of argmin m t ( h ). The characterization result in Lemma 5.1 shows H + L k h ∗ k 2 I  0, and thus g > h ∗ + ( h ∗ ) > H h ∗ + L 2 k h ∗ k 3 = ∇ m t ( h ∗ ) > h ∗ = 0 (B.1) ( h ∗ ) > H h ∗ + L 2 k h k 3 = ( h ∗ ) >  H + L k h ∗ k 2 I  h ∗ ≥ 0 . (B.2) They imply m t ( h ∗ ) = g > h ∗ + ( h ∗ ) > H h ∗ 2 + L k h ∗ k 3 6 ¬ = − ( h ∗ ) > H h ∗ 2 − L 3 k h ∗ k 3  ≤ L 4 k h ∗ k 3 − L 3 k h ∗ k 3 = − L 12 k h ∗ k 3 (B.3) where ¬ uses (B.1) and  uses (B.2). W e compute the norm of the gradient at a point x t + h 0 for an y h 0 ∈ R d : k∇ f ( x t + h 0 ) k ≤ k∇ f ( x t + h 0 ) − ∇ m t ( h 0 ) k + k∇ m t ( h 0 ) k =    ∇ f ( x t ) + Z 1 0 ∇ 2 f ( x t + τ h 0 ) h 0 dτ −  g + H h 0 + L 2 k h 0 k h 0     + k∇ m t ( h 0 ) k 4 T echnically , we assume that the gradient of each gate can b e computed in O (1) time 5 W e assume here that the original circuits are twice diﬀerentiable 12 ≤    Z 1 0  ∇ 2 f ( x t + τ h 0 ) − H  h 0 dτ    + L 2 k h 0 k 2 + k∇ m t ( h 0 ) k ® ≤ L k h 0 k 2 Z 1 0 τ dτ + L 2 k h 0 k 2 + k∇ m t ( h 0 ) k = L k h 0 k 2 + k∇ m t ( h 0 ) k (B.4) where ® follo ws from the Lipschitz con tinuit y on the Hessian (2.1). This pro ves the ﬁrst conclusion of the lemma. As for the second-order condition, we ﬁrst note that for all h 0 ∈ R d , b y the Lipschitz contin uity on the Hessian (2.1), we hav e k∇ 2 f ( x t + h 0 ) − ∇ 2 f ( x t ) k ≤ L k h 0 k . Ho wev er, this implies λ min ( ∇ 2 f ( x t + h 0 )) ≥ λ min ( ∇ 2 f ( x t )) − L k h 0 k . (B.5) b ecause if t wo matrices A and B satisﬁes k A − B k ≤ p , then it m ust satisfy   λ min ( A ) − λ min ( B )   ≤ p as w ell. W e consider tw o cases: if λ min ( ∇ 2 f ( x t )) ≥ 0, then w e ha ve λ min ( ∇ 2 f ( x t + h 0 )) ≥ − L k h 0 k . (B.6) Otherwise, we consider the case where λ min ( ∇ 2 f ( x t )) = λ min ( H ) < 0. Let ν d b e the normalized eigen vector corresponding to λ min ( H ), and deﬁne ˜ h , sign( g > ν d ) · 2 λ min ( H ) L ν d . W e calculate m t ( ˜ h ) as follo ws: m t ( ˜ h ) = g > ˜ h + ˜ h > H ˜ h 2 + L 6 k ˜ h k 3 ≤ ˜ h > H ˜ h 2 + L 6 k ˜ h k 3 = 2( λ min ( H )) 2 L 2 ν > d H ν d + 4 | λ min ( H ) | 3 3 L 2 ¬ = 2( λ min ( H )) 3 L 2 + 4 | λ min ( H ) | 3 3 L 2  = 2( λ min ( H )) 3 3 L 2 , (B.7) where ¬ uses ν > d H ν d = λ min ( H ) < 0, and  uses the assumption that λ min ( H ) < 0. Since b y deﬁnition m t ( h ∗ ) ≤ m t ( ˜ h ), w e can deduce from inequality (B.7) that λ min ( ∇ 2 f ( x t )) = λ min ( H ) ≥ −  3 L 2 | m t ( h ∗ ) | 2  1 / 3 . (B.8) No w we put together inequalities (B.5) and (B.8), and obtain λ min ( ∇ 2 f ( x t + h 0 )) ≥ −  3 L 2 | m t ( h ∗ ) | 2  1 / 3 − L k h 0 k . (B.9) Com bining (B.6) and (B.9) we ﬁnish the proof of Lemma B.1. Corollary 4.1. If m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L and h 0 is an appr oximate minimizer of m t ( h ) satisfying k h 0 k ≤ k h ∗ k + √ ε 4 √ L and k∇ m t ( h 0 ) k ≤ ε 2 , then we have that k∇ f ( x t + h 0 ) k ≤ ε and λ min ( ∇ 2 f ( x t + h 0 )) ≥ − √ Lε . Pr o of of Cor ol lary 4.1. First of all, our assumption that m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L , along with inequality (B.3), tells us that k h ∗ k ≤ √ ε 4 √ L . This, together with our assumption on k h 0 k , implies k h 0 k ≤ √ ε 2 √ L . Since w e also assume k∇ m t ( h 0 ) k ≤ ε 2 , w e hav e from Lemma B.1 that k∇ f ( x t + h 0 ) k ≤ L k h 0 k 2 + k∇ m t ( h 0 ) k ≤ ε 4 + ε 2 ≤ ε . F or the second-order condition, we can again apply Lemma B.1 to get λ min ( ∇ 2 f ( x t + h 0 )) ≥ −  3 L 2 max { 0 , − m t ( h ∗ ) } 2  1 / 3 − L k h 0 k ≥ − 3 L 3 / 2 ε 3 / 2 1600 ! 1 / 3 − √ Lε 2 ≥ − √ Lε . 13 C Pro of of Lemma 5.1 and Corollary 5.2 W e b egin b y proving a few lemmas that c haracterize the system of equations. Lemma C.1. Consider the fol lowing system of e quations/ine qualities in variables λ, h : H + λ I  0 , ( H + λ I ) h = − g , k h k = 2 λ L . (C.1) The fol lowing statements hold for any solution ( λ 0 , h 0 ) of the ab ove system: • Ther e is a unique value λ 0 that satisﬁes the ab ove e quations. λ 0 is such that λ 0 ≥ − λ min ( H ) . • If λ 0 > − λ min ( H ) , then the c orr esp onding h 0 is also unique and is given by h 0 = − ( H + λ I ) − 1 g . • If λ 0 = − λ min ( H ) then g > v = 0 for any ve ctor v b elonging to the eigensp ac e c orr esp onding to λ min ( H ) . Subse quently we also have that the c orr esp onding h 0 is of the form h 0 = − ( H + λI ) + g + γ v for some γ and v in the lowest eigensp ac e of H . Pr o of of L emma C.1. Note that H + λ I  0 ensures that for any solution λ 0 , w e ha ve λ 0 ≥ − λ min ( H ). F urthermore, for any λ 0 > − λ min ( H ), the corresp onding h is uniquely deﬁned by h = ( H + λ I ) − 1 g since H + λ 0 I is in vertible. If indeed λ 0 = − λ min ( H ), then w e hav e that the equation ( H − λ min ( H ) I ) h = − g has a solution. This implies that g has no comp onen t in the null space of H − λ min ( H ) I , or equiv alently that it has no comp onent in the eigenspace corresp onding to λ min ( H ). W e also hav e that every solution of ( H − λ min ( H ) I ) h = − g is necessarily of the form h = − ( H − λ min I ) + g + γ v for some γ and v in the low est eigenspace of H . W e will no w prov e the uniqueness of λ 0 b y con tradiction. Consider tw o distinct v alues of λ 1 , λ 2 that satisfy the system (C.1). If b oth λ 1 , λ 2 > − λ min ( H ) w e get that k ( H + λ 1 I ) − 1 g k = 2 λ 1 L and k ( H + λ 2 I ) − 1 g k = 2 λ 2 L . No w note that k ( H + λ I ) − 1 g k is a strictly decreasing function o v er the domain λ ∈ ( − λ min ( H ) , ∞ ) and 2 λ L is strictly increasing ov er the same domain. Therefore the ab o ve t wo equations cannot b e satisﬁed for t wo distinct λ 1 , λ 2 > − λ min ( H ) whic h is a con tradiction. Supp ose no w without loss of generalit y that λ 1 = − λ min ( H ). Then w e hav e that the corresp onding solution is of the form h = − ( H + λ I ) + g + γ v for some γ and v in the low est eigenspace of H and g has no comp onen t in the low est eigenspace of H . It follows that k ( H − λ min ( H ) I ) + g k ≥ k ( H + λ I ) − 1 g k for any λ > − λ min ( H ). By a similar argumen t as in the ﬁrst case, we can no w see that the following conditions, k ( H + λ 1 I ) + g + γ v min ( H ) k = 2 λ 1 L and k ( H + λ 2 I ) − 1 g k = 2 λ 2 L , cannot b oth b e satisﬁed for λ 2 > λ 1 = − λ min ( H ), giving us a contradiction. This ﬁnishes the pro of of Lemma C.1. Lemma C.2. L et ( λ, h ) b e a solution of the system (C.1). Then we have that m ( h ) = − 1 2 g > ( H + λ I ) + g − 2 λ 3 3 L 2 . 14 Pr o of of L emma C.2. By the deﬁnition of the system (C.1), an y solution λ, h to the system should b e suc h that there exists some γ such that h = ( H + λ I ) + g + γ v 0 where v 0 is in the n ull space of H + λ I if it exists; otherwise γ = 0. This gives us the follo wing: m ( h ) = g > h + h > H h 2 + L 6 k h k 3 ¬ = − 1 2 h > ( H + λ I ) h − λ 2 k h k 2 + L 6 k h k 3  = − 1 2 g > ( H + λ I ) + g − 2 λ 3 3 L 2 . Equalit y ¬ follows b ecause ( H + λ I ) h = − g . Equality  follows b ecause h = ( H + λ I ) + g + γ v 0 and k h k = 2 λ L . Lemma 5.1. h ∗ is a minimizer of m ( h ) if and only if ther e exists λ ∗ ≥ 0 such that H + λ ∗ I  0 , ( H + λ ∗ I ) h ∗ = − g , k h ∗ k = 2 λ ∗ L . The obje ctive value in this c ase is given by m ( h ∗ ) = − 1 2 g > ( H + λ ∗ I ) + g − 2( λ ∗ ) 3 3 L 2 ≤ 0 . Pr o of of L emma 5.1. W e ﬁrst compute that ∇ m ( h ) = g + H h + L 2 k h k h and ∇ 2 m ( h ) = H + L 2 k h k I + L 2 k h k  h k h k   h k h k  > . F or the forw ard direction, supp ose h ∗ is a minimizer of m ( h ). Let λ ∗ = L 2 k h ∗ k . Then, the necessary conditions ∇ m ( h ∗ ) = 0 and ∇ 2 m ( h ∗ )  0 can be written as g + ( H + λ ∗ I ) h ∗ = 0 and w > H + λ ∗ I + λ ∗  h ∗ k h ∗ k   h ∗ k h ∗ k  > ! w ≥ 0 , ∀ w ∈ R n . (C.2) F rom this w e see ( H + λ ∗ I ) h ∗ = − g and k h ∗ k = 2 λ ∗ L , and the only thing left to v erify is H + λ ∗ I  0. Note that if h ∗ = 0, then the second inqualit y in (C.2) directly implies H + λ ∗ I  0. Th us, we only need to fo cus on h ∗ 6 = 0. W e w ant to show that w > ( H + λ ∗ I ) w ≥ 0 for every w ∈ R d . Now, if w > h ∗ = 0 then this trivially follo ws from (C.2), so it suﬃces to fo cus on those w that satisﬁes w > h ∗ 6 = 0. Since w and h ∗ are not orthogonal, there exists γ ∈ R \{ 0 } such that k h ∗ + γ w k = k h ∗ k . (This can b e done by squaring b oth sides and solving the linear system in λ .) Squaring b oth sides we ha ve ( γ w ) > h ∗ + γ 2 k w k 2 2 = 0 . (C.3) No w we bound the diﬀerence m ( h ∗ + γ w ) − m ( h ∗ ) = g > (( h ∗ + γ w ) − h ∗ ) + ( h ∗ + γ w ) > H ( h ∗ + γ w ) 2 − h ∗ H h ∗ 2 ¬ = ( h ∗ − ( h ∗ + γ w )) > ( H + λ ∗ I ) h ∗ + ( h ∗ + γ w ) > H ( h ∗ + γ w ) 2 − h ∗ H h ∗ 2  = λ ∗ γ 2 2 k w k 2 + ( h ∗ − ( h ∗ + γ w )) > H h ∗ + ( h ∗ + γ w ) > H ( h ∗ + γ w ) 2 − h ∗ H h ∗ 2 = λ ∗ γ 2 2 k w k 2 + h ∗ H h ∗ 2 − ( h ∗ + γ w ) > H h ∗ + ( h ∗ + γ w ) > H ( h ∗ + γ w ) 2 15 = λ ∗ γ 2 2 k w k 2 + γ 2 2 w > H w = γ 2 2 w > ( H + λ ∗ I ) w , (C.4) where ¬ and  follow from (C.2) and (C.3), resp ectively . Since h ∗ is a minimizer of m ( h ), w e immediately ha ve m ( h ∗ + γ w ) − m ( h ∗ ) = γ 2 2 w > ( H + λ ∗ I ) w ≥ 0 , and w e conclude that ( H + λ ∗ I )  0. F or the backw ard direction, w e will mak e use Lemma C.1 and Lemma C.2. First w e note that the function m ( h ) is con tinuous and b ounded from b elo w, and there exists at least one minimizer h ∗ . Supp ose now there exists a λ ∗ and a corresp onding h ∗ suc h that ( λ ∗ , h ∗ ) is a solution to the system C.1. The backw ard direction requires us to prov e that h ∗ m ust be a minimizer of m ( h ). By Lemma C.1 w e get the following t wo cases. W e prov e the backw ard direction by sho wing that the conditions in Equation C.2 determine the minimizer up to its norm. T o this end we will use Lemma C.1 and Lemma C.2. First we note that the function m ( h ) is contin uous, b ounded from b elo w, and tends to + ∞ when k h k → ∞ , so there exists at least one minimizer h ∗ . Supp ose no w there exists a λ ∗ and a corresp onding h ∗ suc h that ( λ ∗ , h ∗ ) is a solution to the system (C.1). The backw ard direction requires us to prov e that h ∗ m ust b e a minimizer of m ( h ). By Lemma C.1 w e get the following t wo cases. • If λ ∗ > − λ min ( H ) then ( λ ∗ , h ∗ ) is the only solution to the system (C.1). By the pro of of the forw ard direction we see that any minimizer of m ( h ) m ust satisfy system (C.1) and therefore h ∗ m ust b e the minimizer. • If ab o ve is not the case, then λ ∗ = − λ min ( H ). Let h 0 b e an y minimizer of m ( h ). Lemma C.1 and the pro of of the forw ard direction ensures that ( λ ∗ , h 0 ) also satisﬁes the system (C.1). By Lemma C.2 w e get m ( h ∗ ) = m ( h 0 ) and therefore h ∗ is a minimizer to o. Corollary 5.2. This value λ ∗ is unique, and for every λ satisfying H + λ I  0 , we have k ( H + λ I ) − 1 g k > 2 λ L ⇐ ⇒ λ ∗ > λ and k ( H + λ I ) − 1 g k < 2 λ L ⇐ ⇒ λ ∗ < λ . Pr o of of Cor ol lary 5.2. The uniqueness of λ ∗ follo ws from Lemma C.1. T o pro ve the second part w e ﬁrst make some observ ations ab out the function p ( y ) , 2 y L − k ( H + y I ) − 1 g k deﬁned on the domain y ∈ ( − λ min ( H ) , ∞ ). Note that p ( y ) is contin uous and strictly increasing o ver the domain and p ( y ) → ∞ as y → ∞ . The corollary requires us to show that p ( λ ) < 0 ⇐ ⇒ λ ∗ > λ and p ( λ ) > 0 ⇐ ⇒ λ ∗ < λ . W e b egin by showing the ﬁrst equiv alence. T o see the backw ard direction note that if λ ∗ > λ > − λ min ( H ), by the characterization of λ ∗ in Lemma C.1 we ha ve that k ( H + λ ∗ I ) − 1 g k = 2 λ ∗ L i.e. p ( λ ∗ ) = 0 which implies that p ( λ ) < 0 as p ( y ) is a strictly increasing function. F or the forw ard direction note that since p ( y ) is contin uous and strictly increasing w e see that the range of the function contains [ p ( λ ) , ∞ ). Since p ( λ ) < 0 there must exist a λ ∗ > λ such that p ( λ ∗ ) = 0 which b y the characterization in Lemma C.1 ﬁnishes the pro of. No w we will prov e that p ( λ ) > 0 ⇐ ⇒ λ ∗ < λ . T o see the forw ard direction note that if λ ∗ ≥ λ then p ( λ ∗ ) = 0 and p ( λ ) > 0 whic h con tradicts the fact that p ( y ) is strictly increasing. F or the 16 bac kward direction w e consider t w o cases. Firstly if λ ∗ > − λ min ( H ) the conclusion follo ws similarly b y the monotonicity of p ( y ). If λ ∗ = − λ min then by Lemma C.1, w e ha ve that g has no comp onen t in the lo west eigenspace of H and therefore if we extend p ( y ) to − λ min ( H ) b y deﬁning p ( − λ min ( H )) , − 2 λ min ( H ) L − k ( H − λ min ( H ) I ) + g k w e get that p ( y ) is increasing in the domain y ∈ [ − λ min ( H ) , ∞ ). No w from the characterization of the solution in Lemma C.1 we can see that p ( − λ min ( H )) ≥ 0 and therefore by monotonicity p ( λ ) > 0. This ﬁnishes the pro of. D Pro of of Main Lemma 1 D.1 Pro of of Claim 6.1 Claim 6.1. If λ and v satisfy Case 1 and ˜ ε satisﬁes (6.1), then m ( v ) ≤ m ( h ∗ ) + 1 250 κ 3 L 2 Pr o of of Claim 6.1. Note that by the conditions of the theorem w e ha ve that ( H + ( λ − L ˜ ε ) I ) − 1  eq and L k ( H + ( λ − L ˜ ε ) I ) − 1 g k ≥ 2 λ − 2 L ˜ ε and L k ( H + ( λ + L ˜ ε ) I ) − 1 g k ≤ 2 λ − 2 L ˜ ε , according to Corollary 5.2 we must ha ve λ ∗ ∈ [ λ − L ˜ ε, λ + L ˜ ε ] (D.1) This also implies (using our assumption on ˜ ε ) L k v k ≤ [2 λ ∗ − 5 L ˜ ε, 2 λ ∗ + 5 L ˜ ε ] . Next, consider the v alue m ( v ) m ( v ) = g > v + v > H v 2 + L 6 k v k 3 = g > v + v > ( H + λ I ) v 2 − k v k 2  λ 2 − L k v k 6  . (D.2) W e b ound the t wo parts on the righ t hand side of (D.2) separately . The ﬁrst part g > v + v > ( H + λ I ) v 2 ≤ − g > ( H + λ I ) − 1 g 2 + k g k ˜ ε + k ( H + λ I ) − 1 g k ˜ ε + k ( H + λ I ) − 1 k ˜ ε 2 2 ¬ ≤ − g > ( H + λ I ) − 1 g 2 + 1 1000 κ 3 L 2 (D.3)  ≤ − g > ( H + λ ∗ I ) − 1 g 2 + L k g k 2 k ( H + λ I ) − 1 kk ( H + ( λ + 2 L ˜ ε ) I ) − 1 k ˜ ε + 1 1000 κ 3 L 2 ® ≤ − g > ( H + λ ∗ I ) − 1 g 2 + 1 500 κ 3 L 2 Ab o v e, inequalities ¬ and ® use the assumption on ˜ ε in (6.1), and inequality  uses − ( H + λ I ) − 1  − ( H + ( λ ∗ + L ˜ ε ) I ) − 1 = − ( H + λ ∗ I ) − 1 − L ˜ ε ( H + λ ∗ I ) − 1 ( H + ( λ ∗ + L ˜ ε ) I ) − 1 Note that ( H + λ ∗ I ) − 1  0 b y Equations (D.1) and (6.2). The second part of (D.2) can b e b ounded as follo ws k v k 2  λ 2 − L k v k 6  ≥ (2 λ ∗ − 5 L ˜ ε ) 2 L 2  λ ∗ − L ˜ ε 2 − 2 λ ∗ + 5 L ˜ ε 6  ≥ 2( λ ∗ ) 3 3 L 2 − 1000 ˜ εL ( λ ∗ ) 2 ¬ ≥ 2( λ ∗ ) 3 3 L 2 − 1 500 κ 3 L 2 17 Ab o v e, inequality ¬ uses λ ∗ ≤ B (owing to Prop osition 5.3) and our assumption on ˜ ε from (6.1). Putting these together w e get that m ( v ) ≤ m ( h ∗ ) + 1 250 κ 3 L 2 . D.2 Pro ofs for Claims 6.2 and 6.3 F or notational simplicity , let us rotate the space in to the basis in the eigenspace of H ; let the i -th dimension corresp ond to the i -th largest eigenv alue λ i of H . W e hav e λ 1 ≥ λ 2 . . . ≥ λ d = λ min . Let g i denote the i -th co ordinate of g in this basis. Lemma 5.1 implies m ( h ∗ ) = − 1 2 X i g 2 i λ i + λ ∗ − 2( λ ∗ ) 3 3 L 2 =: S 1 + S 2 − 2( λ ∗ ) 3 3 L 2 . (D.4) where w e denote by S 1 = − X i : λ i + λ ∗ ≥ 1 κ g 2 i λ i + λ ∗ S 2 = − X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i λ i + λ ∗ F rom Corollary 5.2 we can also obtain X i : λ i + λ ∗ > 0 g 2 i ( λ i + λ ∗ ) 2 ≤ 4( λ ∗ ) 2 L 2 . (D.5) No w the assumption k ( H + λ I ) − 1 g k ≤ 2 λ L is equiv alent to X i g 2 i ( λ i + λ ) 2 ≤ 4 λ 2 L 2 (D.6) W e b egin with a few auxiliary claims. Claim D.1. If λ min ( H ) ≤ − 1 κ then S 2 ≥ 1000 · m  λv min 2 L  Pr o of of Claim D.1. W e compute that S 2 = − X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i λ i + λ ∗ = − X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i ( λ i + λ ∗ ) ( λ i + λ ∗ ) 2 ≥ − 1 κ X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i ( λ i + λ ∗ ) 2 ¬ ≥ − 4 κL 2 ( λ ∗ ) 2  ≥ − 16 | λ min | 3 L 2 . (D.7) Ab o v e, ¬ uses (D.5), and  follows b ecause we hav e λ min ( H ) ≤ − 1 κ in the assumption and ha ve λ ∗ ≤ − λ min ( H ) + 1 κ in the assumption of Case 2 of Main Lemma 1. Let us no w consider the v alue of the vector λv min 2 L . W e hav e that m  λv min 2 L  = λg > v min 2 L + λ 2 v > min H v min 8 L 2 + λ 3 48 L 2 ¬ ≤ λg > v min 2 L + λ 2 λ min 16 L 2 + λ 3 48 L 2  ≤ λg > v min 2 L + λ 2 λ min 16 L 2 − λ 2 λ min 24 L 2 ≤ λg > v min 2 L + λ 2 λ min 48 L 2 Ab o v e, ¬ is b ecause our assumption λ min ( H ) ≤ − 1 κ and assumption v min H v min ≤ λ min ( H ) + 1 10 κ together imply v min H v min ≤ λ min 2 .  follo ws from λ min ( H ) ≤ − 1 κ and λ ≤ − λ min ( H ) + 1 κ . No w, recall that the sign of v min is chosen so g > v min is non-p ositive, and therefore by our 18 assumptions λ min ( H ) ≤ − 1 κ and λ ≤ − λ min ( H ) + 1 κ , w e get the following inequalit y: m  λv min 2 L  ≤ − | λ min | 3 48 L 2 (D.8) Putting inequalities (D.8) and (D.7) together ﬁnishes the pro of of Claim D.1. W e also show the follo wing lemma, the pro of of which can b e seen from inequality (D.3), as part of the pro of of Claim 6.1 ab o ve. Lemma D.2. If we have λ, v such that L k ( H + λ I ) − 1 g k ≤ 2 λ and k v + ( H + λ I ) − 1 g k ≤ ˜ ε with ˜ ε satisfying c ondition (6.1) then we have that g > v + v > ( H + λ I ) v 2 ≤ − g > ( H + λ I ) − 1 g 2 + 1 1000 κ 3 L 2 Claim D.3. S 1 ≥ 4 m ( v ) − 1 250 κ 3 L 2 Pr o of of Claim D.3. W e hav e that m ( v ) = g > v + v > ( H + λ I ) v 2 − λ 2 k v k 2 + L 6 k v k 3 ¬ = − g > ( H + λ I ) − 1 g 2 − k v k 2  λ 2 − L 6 k v k  + 1 1000 κ 3 L 2  ≤ − g > ( H + λ I ) − 1 g 2 −  2 λ − 3 L ˜ ε L  2  λ 6 + L ˜ ε 3  + 1 1000 κ 3 L 2 ® ≤ − g > ( H + λ I ) − 1 g 2 − 2 λ 3 3 L 2 + 1 500 κ 3 L 2 ≤ − g > ( H + λ I ) − 1 g 2 + 1 500 κ 3 L 2 (D.9) Ab o v e, ¬ is due to Lemma D.2;  uses our condition on v which giv es L k v k ∈ [2 λ − 3 L ˜ ε, 2 λ + 3 L ˜ ε ]; ® uses our condition (6.1) on ˜ ε . W e now bound S 1 . F or this purp ose ﬁrst w e note that if λ i + λ ∗ ≥ 1 κ and λ − λ ∗ ≤ 1 κ then 2( λ i + λ ∗ ) ≥ 1 /κ + λ i + λ ∗ ≥ λ i + λ . Therefore, the sum S 1 satisﬁes S 1 = − X i : λ i + λ ∗ ≥ 1 κ g 2 i λ i + λ ∗ ≥ − 2 X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i ( λ i + λ ) ≥ − 2( g > ( H + λ I ) − 1 g ) ≥ 4 m ( v ) − 1 250 κ 3 L 2 (Note that w e hav e H + λ I  0.) This ﬁnishes the pro of of Claim D.3. Claim 6.2. If λ min ( H ) ≤ − 1 κ then m ( h ∗ ) ≥ 1500 min n m ( v ) , m  λv min 2 L o − 1 500 κ 3 L 2 Pr o of of Claim 6.2. W e derive that m ( h ∗ ) ¬ = 1 2 ( S 1 + S 2 ) − 2( λ ∗ ) 3 3 L 2  ≥ 1 2 ( S 1 + S 2 ) − 16 | λ min | 3 3 L 2 ® ≥ 2 m ( v ) − 1 500 κ 3 L 2 + 500 · m  λv min 2 L  − 16 | λ min | 3 3 L 2 ¯ ≥ 2 m ( v ) − 1 500 κ 3 L 2 + 1500 · m  λv min 2 L  19 ≥ 1500 min  m ( v ) , m  λv min 2 L  − 1 500 κ 3 L 2 Ab o v e, ¬ uses equation (D.4), inequality  follo ws b ecause we hav e λ min ( H ) ≤ − 1 κ in the assump- tion and hav e λ ∗ ≤ − λ min ( H ) + 1 κ in the assumption of Case 2 of Main Lemma 1; inequalit y ® uses Claim D.3 and Claim D.1; and inequality ¯ uses (D.8). This ﬁnishes the pro of of Claim 6.2. Claim 6.3. If λ min ( H ) ≥ − 1 κ then m ( h ∗ ) ≥ 2 m ( v ) − 16 κ 3 L 2 Pr o of of Claim 6.3. This time w e low er b ound S 2 sligh tly diﬀerently: S 2 ¬ ≥ − 4 κL 2 ( λ ∗ ) 2  ≥ − 16 κ 3 L 2 (D.10) where ¬ comes from the second to last inequalit y from (D.7) and  comes from λ ∗ ≤ λ ≤ − λ min ( H ) + 1 κ ≤ 2 κ using our assumption in Case 2 of Main Lemma 1. Putting these together w e get that m ( h ∗ ) ¬ = 1 2 ( S 1 + S 2 ) − 2( λ ∗ ) 3 3 L 2  ≥ 2 m ( v ) − 1 500 κ 3 L 2 − 15 κ 3 L 2 ≥ 2 m ( v ) − 16 κ 3 L 2 . Ab o v e, ¬ comes from (D.4),  uses Claim D.3, lo wer bound (D.10) and 2( λ ∗ ) 3 3 L 2 ≤ 16 3 κ 3 L 2 E Pro of of Main Lemma 2 Main Lemma 2. In the same setting as Main L emma 1, supp o se m ( h ∗ ) ≥ − ε 3 / 2 300 √ L . Then the output ve ctor v satisﬁes the fol lowing c onditions: k v k ≤ k h ∗ k + 3 κL and k∇ m ( v ) k ≤ ε 4 + 15 κ 2 L . Pr o of of Main L emma 2. Let’s ﬁrst note that from the v alue given in Lemma 5.1, ( λ ∗ ) 3 ≤ 3 L 2 | m ( h ∗ ) | 2 ≤ L 3 / 2 ε 3 / 2 200 . (E.1) If Case 1 o ccurs, w e hav e k v k ¬ ≤ k ( H + λ I ) − 1 g k + ˜ ε  ≤ 2 λ + 2 L ˜ ε L + ˜ ε ® ≤ 2 λ ∗ L + 5 ˜ ε ¯ ≤ k h ∗ k + 1 20 κL . Ab o v e, inequalities ¬ and  b oth use the assumptions of Case 1; inequality ® uses the fact that λ ∗ ∈ [ λ − L ˜ ε, λ + L ˜ ε ] whic h again follows from the assumptions of Case 1 (see (D.1)); inequality ¯ uses k h ∗ k = 2 λ ∗ L from Lemma 5.1 as well as our assumption (6.1) on ˜ ε . As for the quan tity k∇ m ( v ) k , w e b ound it as follo ws k∇ m ( v ) k =    g + H v + L k v k 2 v    ¬ ≤ k g + ( H + λ I ) v k + λ k v k + L k v k 2  ≤ k H + λ I k ˜ ε + λ k v k + L k v k 2 ® ≤ ( L 2 + 2 B ) ˜ ε + λ (2 λ + 3 L ˜ ε ) + (2 λ + 3 L ˜ ε ) 2 L = ( L 2 + 2 B ) ˜ ε + 6 λ 2 L + 15 ˜ ελ + 9 L ˜ ε 2 ¯ ≤ 6( λ ∗ + L ˜ ε ) 2 L + ( L 2 + 32 B ) ˜ ε + 9 L ˜ ε 2 ° ≤ 6( λ ∗ ) 2 L + ( L 2 + 56 B ) ˜ ε + 15 L ˜ ε 2 ± ≤ ε 4 + 15 κ 2 L . Ab o v e, inequality ¬ uses triangle inequality; inequalit y  uses k v + ( H + λ I ) − 1 g k ≤ ˜ ε ; inequality ® uses k H + λ I k ≤ L 2 + 2 B and L k v k ≤ 2 λ + 3 L ˜ ε whic h comes from our upper b ound on k v k abov e; ¯ uses the fact that λ ∗ ∈ [ λ − L ˜ ε, λ + L ˜ ε ] whic h again follows from the assumptions of Case 1 (see 20 (D.1)); inequalit y ° uses λ ∗ ≤ 2 B ; and inequalit y ± uses (E.1) together with our assumption (6.1) on ˜ ε . If Case 2 o ccurs, w e hav e k v k ¬ ≤ k ( H + λ I ) − 1 g k + ˜ ε  ≤ 2 λ L + ˜ ε ® ≤ 2( λ ∗ + 1 /κ ) L + ˜ ε ≤ k h ∗ k + 3 κL . (E.2) Ab o v e, inequalities ¬ and  b oth use the assumptions of Case 2; inequalit y ® uses λ ≤ − λ min ( H ) + 1 /κ from our assumption of Case 2 as w ell as − λ min ( H ) ≤ λ ∗ whic h comes from Lemma 5.1; inequalit y ¯ uses k h ∗ k = 2 λ ∗ L from Lemma 5.1 as well as our assumption (6.1) on ˜ ε . The quan tity k∇ m ( v ) k can be b ounded in an analogous manner as Case 1: k∇ m ( v ) k ≤ k H + λ I k ˜ ε + λ k v k + L k v k 2 ≤ ( L 2 + 2 B ) ˜ ε + λ (2 λ + L ˜ ε ) + (2 λ + L ˜ ε ) 2 L ¬ ≤ 6 λ 2 L + 1 10 κ 2 L  ≤ 6( λ ∗ + 1 κ ) 2 L + 1 10 κ 2 L ≤ 12( λ ∗ ) 2 L + 15 κ 2 L ® ≤ ε 4 + 15 κ 2 L . Ab o v e, inequality ¬ uses our assumption (6.1) on ˜ ε ; inequalit y  uses λ ≤ λ ∗ + 1 κ whic h app eared in (E.2); inequalit y ® uses (E.1). F Pro of of Lemma 7.2 Lemma 7.2. The fol lowing statements hold for al l i until FastCubicMin terminates (a) λ i ∈ [0 , 2 B ] , λ i + λ max ( H ) ≤ 3 B (b) λ i + λ min ( H ) ≥ 3 10 κ (c) λ i +1 + λ min ( H ) ≤ 3 4 ( λ i + λ min ( H )) unless λ i +1 = 0 Mor e over when F astCubicMin terminates at Line 20 we have λ i + λ min ( H ) ≤ 1 κ . Pr o of of L emma 7.2. The lemma follows via induction. T o see (a) and (b) at the base case i = 0, recall that the deﬁnitions of B and L 2 together ensure λ 0 + λ max ( H ) ≤ 3 B and λ 0 + λ min ( H ) ≥ 3 10 κ . Also λ 0 ∈ [0 , 2 B ]. Supp ose now for some i ≥ 0 prop erties (a) and (b) hold. It is easy to c heck that λ i ≤ λ i − 1 and th us we hav e λ i + λ max ( H ) ≤ 2 B and λ i ≤ 2 B . This implies prop ert y (a) at iteration i + 1 also hold. W e now pro ceed to sho w prop erty (c) at iteration i and prop ert y (b) at iteration i + 1. Recall that the algorithm ensures 9 10 λ max (( H + λ i I ) − 1 ) ≤ w > ( H + λ i I ) − 1 w ≤ λ max (( H + λ i I ) − 1 ) , and b y the deﬁnition of ˜ w we hav e 9 10 λ max (( H + λ i I ) − 1 ) − 2 ˆ ε ≤ ˜ w > w − ˆ ε ≤ λ max (( H + λ i I ) − 1 ) . (F.1) No w, since 3 10 κ ≤ λ i + λ min ( H ) ≤ 3 B from the inductive assumption, it follows from the choice of ˆ ε that 2 ˆ ε ≤ 1 30 B ≤ 1 10( λ i + λ min ( H )) = λ max (( H + λ i I ) − 1 ) 10 . (F.2) Plugging Equation (F.2) in to Equation (F.1) we get 8 10 1 λ i + λ min ( H ) = 8 10 λ max (( H + λ i I ) − 1 ) ≤ ˜ w > w − ˆ ε ≤ λ max (( H + λ i I ) − 1 ) = 1 λ i + λ min ( H ) . 21 In verting this c hain of inequalities, we hav e λ i + λ min ( H ) 2 ≤ ∆ ≤ 5( λ i + λ min ( H )) 8 . (F.3) F rom this we deriv e the following implications: ∆ ≤ 1 2 κ = ⇒ ( λ i + λ min ( H )) ≤ 1 κ (F.4) ∆ > 1 2 κ = ⇒ ( λ i + λ min ( H )) > 4 5 κ (F.5) If Condition (F.4) happ ens, our algorithm F astCubicMin outputs on Line 20; in suc h a case (F.4) implies our desired inequality λ i + λ min ( H ) ≤ 1 κ . If Condition (F.5) happ ens, our c hoice ˜ λ i +1 ← λ i − ∆ 2 and Equation (F.3) together imply that 3 4 ( λ i + λ min ( H )) ≥ ˜ λ i +1 + λ min ( H ) ≥ 11 16 ( λ i + λ min ( H )) Com bining this with (F.5) we get that 3 4 ( λ i + λ min ( H )) ≥ ˜ λ i +1 + λ min ( H ) ≥ 11 16  4 5 κ  ≥ 3 10 κ . Therefore, w e conclude that prop ert y (c) at iteration i holds and prop ert y (b) at iteration i + 1 hold b ecause λ i +1 ≥ ˜ λ i +1 . This ﬁnishes the pro of of Lemma 7.2. G Pro of of Main Lemma 3: Running Time Half Ha ving prov en the correctness of the algorithm, w e no w aim to bound the ov erall running time of FastCubicMin , completing the pro of of Main Lemma 3. W e prov e in App endix H the following lemma: Lemma G.1. If λ 2 + λ min ( H ) ≥ c 1 ∈ (0 , 1) then Bina rySearch ends in O  log( ( λ 1 − λ 2 ) B c 1 · L · ˜ ε )  iter ations. Since in our F astCubicMin algorithm, it satisﬁes λ i ≤ 2 B and λ i + λ min ( H ) ≥ 3 10 κ (see Lemma 7.2), tak en together with our choice of ˜ ε we ha ve: Claim G.2. Each invo c ation of BinarySea rch ends in O  log(1 / ˜ ε )  iter ations. Claim G.3. F astCubicMin ends in at most O (log( B κ )) outer lo ops. Pr o of. According to Lemma 7.2 w e hav e 3 4 ( λ i − 1 + λ min ( H )) ≥ λ i + λ min ( H ) so the quan tity λ i + λ min ( H ) decreases by a constant factor p er iteration (except p ossibly λ i = 0 the last outer lo op in which case we shall terminate in one more iteration). On one hand, we hav e b egan with λ 0 + λ min ( H ) ≤ 3 B . On the other hand, w e alwa ys ha ve λ i + λ min ( H ) ≥ 3 10 κ according to Lemma 7.2. Therefore, the total n umber of outer lo ops is at most O (log( B κ )). G.1 Matrix In v erse Since the key comp onen t of the running time is th e computation of ( H + λ i I ) − 1 b for diﬀeren t v ectors b w e will ﬁrst b ound the condition num b er of the matrix ( H + λ i I ) − 1 via the follo wing lemma Claim G.4. Thr ough out the exe cution of FastCubicMin and BinarySea rch whenever we c ompute ( H + λ i I ) − 1 b for some ve ctor b it satisﬁes λ i + L 2 λ i + λ min ( H ) ≤ 10 κL 2 . Pr o of of Claim G.4. W e ﬁrst fo cus on Line 5 and Line 11 of FastCubicMin . There are tw o cases. If λ i ≥ 2 L 2 , then according to − L 2 I  H  L 2 I we can b ound λ i + L 2 λ i + λ min ( H ) ≤ 3 b ecause the left hand 22 side is the largest when λ i = 2 L 2 . If λ i < 2 L 2 , then b y Lemma 7.2 we know λ i + λ min ( H ) ≥ 3 10 κ . This implies λ i + L 2 λ i + λ min ( H ) ≤ 10 κL 2 . W e no w focus on Line 3 of Bina rySea rch . W e claim that all v alues λ mid iterated o ver Bina rySearch also satisfy λ mid + λ min ( H ) ≥ 3 10 κ (b ecause the v alues λ mid ≥ λ i and λ i satisﬁes λ i + λ min ( H ) ≥ 3 10 κ according to Lemma 7.2). Therefore, the same case analysis (with resp ect to λ mid ≥ 2 L 2 and λ mid < 2 L 2 ) also giv es λ i + L 2 λ i + λ min ( H ) ≤ 10 κL 2 . Claim G.5. Line 5 of FastCubicMin and Line 3 of BinarySea rch runs in time ˜ O ( T inverse ( κL 2 , ˜ ε )) . Pr o of. Whenev er we compute ( H + λ i I ) − 1 b for some vector v it satisﬁes k b k ≤ 1 / ˜ ε ; therefore to ﬁnd v satisfying k v + ( H + λ i I ) − 1 b k ≤ ˜ ε it suﬃces to ﬁnd k v + ( H + λ i I ) − 1 b k ≤ ˜ ε 2 k b k . This costs a total running time ˜ O ( T inverse ( κL 2 , ˜ ε )) according to Theorem 2.4. Therefore b y Theorem 2.4, every time we need to m ultiply a vector v to ( H + λ I ) − 1 to error δ , the time required to approximately solv e such a system is T inverse ( O ( κL 2 ) , δ ). W e will state our running time with resp ect to T inverse as it is the dominant op eration in the algorithm. G.2 P o wer Metho d W e no w bound the running time of Po wer Metho d in Line 11 of F astCubicMin . It is a folklore (cf. [3, App endix A]) that getting any constant multiplicativ e approximation to the leading eigenv ector of any PSD matrix M ∈ R d × d requires only O (log d ) iterations, each computing M b for some v ector b . In our case, we hav e M = ( H + λ i I ) − 1 so w e cannot compute M b exactly . F ortunately , folklore results on inaccurate p o w er metho d suggests that, as long as each M b is computed to a v ery go o d accuracy such as ˜ ε − Ω(log d ) , then we can still get a constan t multiplicativ e approximate leading eigen v ector that satisﬁes Line 11 of FastCubicMin . Ignoring all the details (which are quite standard and can b e found for instance in [3, App endix A]), w e claim that Claim G.6. Line 11 of FastCubicMin runs in time ˜ O  T inverse  κL 2 , ε − Θ(log( d ))  = ˜ O ( T inverse ( κL 2 , ε )) . G.3 Lo w est Eigenv ector W e will now focus on the running time for the computation of the low est eigenv ector of the Hessian whic h is required in Line 18. W e recall Theorem 2.5 from Section 2 which uses Shift and In vert to compute the largest eigen v alue of a matrix. Since w e are concerned with the lo west eigen vector of H and by assumption − L 2 I  H  L 2 I , w e can equiv alen tly compute the largest eigenv ector of M , I − H + L 2 I 2 L 2 whic h satisﬁes 0  M  I . Note that computing M v is of the same time complexity as computing H v . By setting ε = δ × = 0 . 01 κL 2 in Theorem 2.5 and running AppxPCA, we obtain a unit v ector w such that 1 − w > H w + L 2 2 L 2 = w > M w ≥ (1 − 2 δ × ) λ max ( M ) ¬ ≥ λ max ( M ) − 2 δ × ≥ 1 − λ min ( H ) + L 2 2 L 2 − 2 δ × Ab o v e, ¬ uses λ max ( M ) ≤ 1. Rearranging the terms we obtain w > H w ≤ λ min ( H ) + 0 . 05 κ as desired. In sum, Claim G.7. The appr oximate lowest eigenve ctor c omputation on Line 18 runs in time ˜ O ( T inverse ( κL 2 , ˜ ε )) . G.4 Putting It All T ogether R unning-Time Pr o of of Main L emma 3. Putting together our bounds in Claim G.2 and Claim G.2 whic h b ound the n umber of iterations, as w ell as our b ounds in Claim G.6, Claim G.5, and Claim G.7 for p o wer metho d, matrix inv erse, and low est eigenv ectors, w e conclude that our total 23 running time of F astCubicMin is at most ˜ O ( T inverse ( κL 2 , ˜ ε )), where ˜ O con tains factors polylogarith- mic in κ, L, L 2 , B , d . By putting together our choice of ˜ ε in Line 2 as well as the running time of either accelerated gradien t descen t or accelerated SVRG from Theorem 2.4 into formula O ( κL 2 , ˜ ε ), we ﬁnish the pro of of the running time part for Main Lemma 3. H Pro of of Lemma G.1 Lemma G.1. If λ 2 + λ min ( H ) ≥ c 1 ∈ (0 , 1) then Bina rySearch ends in O  log( ( λ 1 − λ 2 ) B c 1 · L · ˜ ε )  iter ations. Pr o of of L emma G.1. W e ﬁrst note that in all iterations of BinarySea rch it alwa ys satisﬁes L k ( H + λ 1 I ) − 1 g k ≤ 2 λ 1 and L k ( H + λ 2 I ) − 1 g k ≥ 2 λ 2 . (H.1) This is true at the b eginning. In each of the follow-up iterations, if w e hav e set λ 1 ← λ mid then it m ust satisfy L k v k + L ˜ ε ≤ 2 λ mid but this implies L k ( H + λ mid I ) − 1 g k ≤ 2 λ mid according to triangle inequalit y and k v + ( H + λ mid I ) − 1 g k ≤ L ˜ ε ; similarly , if w e ha v e set λ 2 ← λ mid then it must satisfy L k ( H + λ mid I ) − 1 g k ≥ 2 λ mid . Supp ose now the lo op has run for at least log 2 ( λ 1 − λ 2 ˆ ε ) iterations where ˆ ε , L ˜ εc 1 40 B . Then, it m ust satisfy λ 1 − λ 2 ≤ ˆ ε . At this point, w e compute ( H + λ 1 I ) − 1 = ( H + λ 2 I ) − 1 − ( λ 1 − λ 2 )( H + λ 2 I ) − 1 ( H + λ 1 I ) − 1 and therefore L k ( H + λ 1 I ) − 1 g k ≥ L k ( H + λ 2 I ) − 1 g k − L k ( λ 1 − λ 2 )( H + λ 2 I ) − 1 ( H + λ 1 I ) − 1 g k ¬ ≥ 2 λ 2 − ˆ ε k ( H + λ 2 I ) − 1 k · 2 λ 1  ≥ 2 λ 1 − 2 ˆ ε − ˆ ε k ( H + λ 2 I ) − 1 k · 2 λ 1 Ab o v e, inequality ¬ uses (H.1) and λ 1 − λ 2 ≤ ˆ ε ; inequality  uses again λ 1 − λ 2 ≤ ˆ ε . No w, w e notice that k ( H + λ 2 I ) − 1 k ≤ 1 c 1 and λ 1 ≤ 2 B b ecause λ 2 only increases and λ 1 only decreases through the execution of the algorithm. Therefore by the c hoice of ˆ ε = ˜ εc 1 40 B , w e get L k ( H + λ 1 I ) − 1 g k ≥ 2 λ 1 − L ˜ ε/ 5 . A completely analogous argumen t also shows that L k ( H + λ 2 I ) − 1 g k ≤ 2 λ 2 + L ˜ ε/ 5 . Therefore, in the immediate next iteration when picking λ mid ← ( λ 1 + λ 2 ) / 2, it m ust satisfy 2 λ mid − L ˜ ε/ 2 ≤ 2 λ − L ˜ ε/ 5 ≤ L k ( H + λ mid I ) − 1 g k ≤ 2 λ 2 + L ˜ ε/ 5 ≤ 2 λ mid + L ˜ ε/ 2 . Then, at this iteration when v is computed to satisfy k v + ( H + λ mid I ) − 1 g k ≤ ˜ ε/ 2, we also ha ve 2 λ mid − L ˜ ε ≤ L k v k ≤ 2 λ mid + L ˜ ε whic h means Bina rySearch will stop in this iteration. In sum, w e ha ve concluded that there will be no more than O  log( ( λ 1 − λ 2 ) B c 1 · L · ˜ ε )  iterations. Ac kno wledgemen ts W e thank Ben Rech t for helpful suggestions and corrections to a previous version. References [1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second order sto c hastic optimization for mac hine learning in linear time. arXiv pr eprint arXiv:1602.03943 , 2016. 24 [2] Zeyuan Allen-Zhu and Elad Hazan. V ariance Reduction for F aster Non-Con vex Optimization. In ICML , 2016. [3] Zeyuan Allen-Zhu and Y uanzhi Li. Even F aster SVD Decomp osition Y et Without Agonizing P ain. In NIPS , 2016. [4] Zeyuan Allen-Zhu and Y ang Y uan. Improv ed SVRG for Non-Strongly-Conv ex or Sum-of-Non- Con vex Ob jectiv es. In ICML , 2016. [5] Afonso S Bandeira, Nicolas Boumal, and Vladisla v V oroninski. On the low-rank approach for semideﬁnite programs arising in synchronization and comm unity detection. arXiv pr eprint arXiv:1602.04426 , 2016. [6] S. Bho janapalli, B. Neyshabur, and N. Srebro. Global Optimalit y of Local Searc h for Low Rank Matrix Reco very. ArXiv e-prints , May 2016. [7] Y air Carmon, John C. Duc hi, Oliver Hinder, and Aaron Sidford. Accelerated metho ds for non-con vex optimization. arXiv pr eprint 1611.00756 , 2016. [8] Coralia Cartis, Nicholas IM Gould, and Philipp e L T oin t. Adaptiv e cubic regularisation meth- o ds for unconstrained optimization. part i: motiv ation, conv ergence and numerical results. Mathematic al Pr o gr amming , 127(2):245–295, 2011. [9] Coralia Cartis, Nicholas IM Gould, and Philipp e L T oin t. Adaptiv e cubic regularisation meth- o ds for unconstrained optimization. part ii: worst-case function-and deriv ativ e-ev aluation com- plexit y . Mathematic al Pr o gr amming , 130(2):295–319, 2011. [10] Anna Choromansk a, Mik ael Henaﬀ, Michael Mathieu, G´ erard Ben Arous, and Y ann LeCun. The loss surfaces of multila yer net w orks. In AIST A TS , 2015. [11] Y ann N Dauphin, Razv an Pascan u, Caglar Gulcehre, Kyunghyun Cho, Sury a Ganguli, and Y oshua Bengio. Identifying and attacking the saddle p oin t problem in high-dimensional non- con vex optimization. In A dvanc es in neur al information pr o c essing systems , pages 2933–2941, 2014. [12] John Duc hi, Elad Hazan, and Y oram Singer. Adaptive subgradien t metho ds for online learning and stochastic optimization. The Journal of Machine L e arning R ese ar ch , 12:2121–2159, 2011. [13] Dan Garb er and Elad Hazan. F ast and simple PCA via con vex optimization. ArXiv e-prints , Septem b er 2015. [14] Dan Garb er, Elad Hazan, Chi Jin, Sham M. Kak ade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Robust shift-and-in vert preconditioning: F aster and more sample eﬃcient algorithms for eigen vector computation. In ICML , 2016. [15] Rong Ge, F urong Huang, Chi Jin, and Y ang Y uan. Escaping from saddle p oin ts—online sto c hastic gradien t for tensor decomp osition. , 2015. [16] Rong Ge, F urong Huang, Chi Jin, and Y ang Y uan. Escaping from saddle p oin ts—online sto c hastic gradient for tensor decomp osition. In Pr o c e e dings of the 28th Annual Confer enc e on L e arning The ory , COL T 2015, 2015. 25 [17] Rong Ge, F urong Huang, Chi Jin, and Y ang Y uan. Escaping from saddle points - online sto c hastic gradien t for tensor decomp osition. In Pr o c e e dings of The 28th Confer enc e on L e arn- ing The ory, COL T 2015, Paris, F r anc e, July 3-6, 2015 , pages 797–842, 2015. [18] Rong Ge, Jason Lee, and T engyu Ma. Matrix Completion has No Spurious Lo cal Minimum. A rXiv e-prints , May 2016. [19] Rong Ge and T engyu Ma. On the optimization landscap e of tensor decomp ositions, 2016. [20] Saeed Ghadimi and Guanghui Lan. Accelerated gradien t methods for nonconv ex nonlinear and sto c hastic programming. Mathematic al Pr o gr amming , pages 1–26, feb 2015. [21] I. J. Go odfellow, O. Vin yals, and A. M. Saxe. Qualitatively c haracterizing neural net w ork optimization problems. A rXiv e-prints , December 2014. [22] Elad Hazan and T omer Koren. A linear-time algorithm for trust region problems. Mathematic al Pr o gr amming , pages 1–19, 2015. [23] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J. ACM , 60(6):45, 2013. [24] Jason D. Lee, Max Simc howitz, Mic hael I. Jordan, and Benjamin Rec ht. Gradient descen t only con verges to minimizers. In Pr o c e e dings of the 29th Confer enc e on L e arning The ory, COL T 2016, New Y ork, USA, June 23-26, 2016 , pages 1246–1257, 2016. [25] Katta G Murt y and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematic al pr o gr amming , 39(2):117–129, 1987. [26] Y urii Nesterov. A metho d of solving a conv ex programming problem with conv ergence rate O (1 /k 2 ). In Doklady AN SSSR (tr anslate d as Soviet Mathematics Doklady) , v olume 269, pages 543–547, 1983. [27] Y urii Nesterov. Intr o ductory L e ctur es on Convex Pr o gr amming V olume: A Basic c ourse , vol- ume I. Klu wer Academic Publishers, 2004. [28] Y urii Nestero v and Boris T Poly ak. Cubic regularization of newton metho d and its global p erformance. Mathematic al Pr o gr amming , 108(1):177–205, 2006. [29] Barak A Pearlm utter. F ast exact multiplication b y the hessian. Neur al c omputation , 6(1):147– 160, 1994. [30] Herb ert Robbins and Sutton Monro. A sto chastic appro ximation metho d. The annals of mathematic al statistics , pages 400–407, 1951. [31] Da vid E Rumelhart, Geoﬀrey E Hin ton, and Ronald J Williams. Learning represen tations by bac k-propagating errors. Co gnitive mo deling , 5(3):1, 1988. [32] Mark Sc hmidt, Nicolas Le Roux, and F rancis Bach. Minimizing ﬁnite sums with the sto c hastic a verage gradient. arXiv pr eprint arXiv:1309.2388 , pages 1–45, 2013. Preliminary v ersion app eared in NIPS 2012. [33] Shai Shalev-Shw artz. SDCA without Dualit y , Regularization, and Individual Conv exity. In ICML , 2016. 26 [34] Jonathan Richard Shew ch uk. An introduction to the conjugate gradient metho d without the agonizing pain, 1994. [35] P aul W erb os. Beyond regression: New to ols for prediction and analysis in the b eha vioral sciences. 1974. 27

Finding Approximate Local Minima Faster than Gradient Descent

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment