Finding Approximate Local Minima Faster than Gradient Descent
We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm t…
Authors: Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins
Finding Appro ximate Lo cal Minima F aster than Gradien t Descen t Naman Agarw al namana@cs.p rinceton.edu Princeton Univ ersity Zeyuan Allen-Zh u zeyuan@csail.mit.edu Institute for Adv anced Study Brian Bullins bbullins@cs.p rinceton.edu Princeton Univ ersity Elad Hazan ehazan@cs.p rinceton.edu Princeton Univ ersity T engyu Ma tengyu@cs.p rinceton.edu Princeton Univ ersity No vem ber 3, 2016 Abstract W e design a non-conv ex second-order optimization algorithm that is guaranteed to return an appr oximate lo cal minim um in time whic h scales linearly in the underlying dimension and the num b er of training examples. The time complexity of our algorithm to find an approximate lo cal minim um is ev en faster than that of gradient descen t to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural net work and other non-con vex ob jectives arising in mac hine learning. 1 In tro duction Finding a global minimizer of a non-con v ex optimization problem is NP-hard. Th us, the standard goal of efficient non-con vex optimization algorithms is instead to find a lo cal minimum. This problem has b ecome increasingly imp ortan t as the state-of-the-art in machine learning is attained b y non-con v ex models, man y of which are v arian ts of deep neural netw orks. Exp erimen ts in [10, 11, 21] suggest that fast conv ergence to a lo cal minimum is sufficient for training neural nets, while con vergence to critical p oin ts (p oin ts with v anishing gradien ts) is not . Theoretical w orks ha ve also affirmed the same phenomenon for other mac hine learning problems (see [5, 6, 18, 19] and the references therein). In this pap er we giv e a prov able linear-time algorithm for finding an appr oximate local minim um in smo oth non-con vex optimization. It applies to a general setting of mac hine learning optimization, and in particular to the optimization problem of training deep neural netw orks. F urthermore, the running time b ound of our algorithm is the fastest kno wn even for the more lenien t task of computing a p oin t with v anishing gradient (called a critical point), for a wide range of parameters. F ormally , the problem of unconstrained mathematical optimization is stated in general terms as that of finding the minimum v alue that a function attains ov er Euclidean space, i.e. min x ∈ R d f ( x ) . (1.1) If f is conv ex, the ab ov e form ulation is c onvex optimization and is solv able in (randomized) p olynomial time ev en if only a v aluation oracle to f is pro vided. A crucial prop ert y of con vex functions is that “lo cal optimalit y implies global optimality”, allowing for greedy algorithms to reac h the global optim um efficiently . Unfortunately , this is no longer the case if f is nonconv ex; indeed, even a degree four p olynomial can b e NP-hard to optimize [23], or even just to chec k whether a p oin t is not a lo cal minimum [25]. Th us, for non-conv ex optimization one has to settle for the more mo dest goal of reac hing approximate local optimality efficien tly . Note that a particular in terest to mac hine learning is the optimization of functions f : R d 7→ R of the finite-sum form f ( x ) = 1 n n X i =1 f i ( x ) . (1.2) Suc h functions arise when minimizing loss o ver a training set, where each example i in the set corresp onds to one loss function f i in the summation. W e sa y that the function f is second-order smo oth if it has Lipsc hitz contin uous gradient and Lipsc hitz con tinuous Hessian. W e sa y that a p oin t x is an ε -approximate lo cal minimum if it satisfies (follo wing the tradition of [28]): k∇ f ( x ) k ≤ ε and ∇ 2 f ( x ) − √ ε I , where k · k denotes the Euclidean norm of a vector. W e say that a p oin t x is an ε -critical p oin t if it satisfies the gradien t condition ab o ve, but not necessarily the second-order condition. Critical p oin ts include saddle p oin ts in addition to lo cal minima. W e remark that ε -approximate lo cal minima (even with ε = 0) are not necessarily close to an y lo cal minimum, neither in domain nor in function v alue. How ever, if we assume in addition the function satisfies the (robust) strict-saddle prop ert y [15, 24] (see Section 2 for the precise definition), then an ε -approximate lo cal minim um is guaran teed to b e close to a lo cal minim um for sufficiently small ε . Our main theorem b elow states the time required for the prop osed algorithm F astCubic to find an ε -appro ximate lo cal minim um for second-order smo oth functions. 1 Theorem (informal) . Ignoring smo othness p ar ameters, the running time of FastCubic to r eturn an ε -appr oximate lo c al minimum is ˜ O n ε 3 / 2 + n 3 / 4 ε 7 / 4 · T h, 1 ! for (1.2) or ˜ O 1 ε 7 / 4 · T h for the gener al (1.1). A b ove, T h is the time to c ompute Hessian-ve ctor pr o duct for ∇ 2 f ( x ) and T h, 1 is that for an arbitr ary ∇ 2 f i ( x ) . The full statemen t of Theorem 1 can b e found in Section 2. Hessian-v ector pro ducts can b e computed in linear time —meaning T h, 1 = O ( d ) and T h = O ( nd )— for man y machine learning problems suc h as generalized linear mo dels and training neural net works [1, 29]. W e explain this more generally in App endix A. Therefore, Corollary 1.1. Algorithm F astCubic r eturns an ε -appr oximate lo c al minimum for the optimiza- tion pr oblem of tr aining a neur al network in time ˜ O nd ε 3 / 2 + n 3 / 4 d ε 7 / 4 ! . Another imp ortan t asp ect of our algorithm is that even in terms of just reaching an ε -critical p oin t, i.e. a p oin t that satisfies k∇ f ( x ) k ≤ ε without any second-order guaran tee, F astCubic is faster than all previous results (see T able 1 for a comparison). The fastest methods to find critical points for a smo oth non-conv ex function are gradien t descen t and its extensions, join tly known as first-order metho ds. These metho ds are extremely efficient in terms of p er-iteration complexit y; ho wev er, they necessarily suffer from a 1 /ε 2 con vergence rate [27], to the best of our kno wledge, in previous results only higher-order metho ds seem capable of breaking this 1 /ε 2 b ottlenec k [28]. F or certain ranges of parameters, our F astCubic finds lo cal minima ev en faster than first-order metho ds, even though they only find critical p oin ts. This is depicted in T able 1. P ap er T otal Time Achieving k∇ f ( x ) k ≤ ε Second-Order Guarantee Gradien t Descen t (GD) O nd ε 2 n/a SVR G [2] O nd + n 2 / 3 d ε 2 n/a SGD [20] O d ε 4 n/a noisy SGD [16] a O d C 1 ε 4 ∇ 2 f ( x ) − ε 1 /C 2 I cubic regularization [28] ˜ O nd ω − 1 + d ω ε 3 / 2 ∇ 2 f ( x ) − ε 1 / 2 I this pap er ˜ O nd ε 7 / 4 ∇ 2 f ( x ) − ε 1 / 2 I this pap er ˜ O nd ε 3 / 2 + n 3 / 4 d ε 7 / 4 ∇ 2 f ( x ) − ε 1 / 2 I T able 1: Comparison of known metho ds. a Here C 1 , C 2 are t wo constants that are not explicitly written. W e believe C 1 ≥ 4. 2 1.1 Related w ork Metho ds that Pro v ably Reac h Critical P oin ts. Recall that only a gradient oracle is needed to reach a critical p oin t. The most commonly used algorithm in practice for training non-conv ex learning machines such as deep neural netw orks is sto c hastic gradien t descent (SGD), also known as sto c hastic approximation [30] and its deriv ativ es. Some practical enhancements widely used in practice are based on Nesterov’s acceleration [26] and adaptive regularization [12]. The v ariance reduction tec hnique, introduced in [32], w as extremely successful in conv ex optimization, but only recen tly there was a non-con v ex counterpart with theoretical benefits in tro duced [2]. Metho ds that Pro v ably Reac h Lo cal Minima. The recent work of Ge et al. [17] show ed that a noise-injected version of SGD in fact con verges to lo cal minima instead of critical p oin ts, as long as the underlying non-conv ex function is strict-saddle. Their theoretical running time is a large p olynomial in the dimension and not comp etitiv e with our metho d (see T able 1). The work of Lee et al. [24] shows that gradient descen t, starting from a random p oin t, almost surely con verges to a lo cal minim um of a strict-saddle function. The rates of conv ergence and precise step-sizes that are required are, how ever, y et unknown. If second-order information (i.e., the Hessian oracle) is pro vided, the cubic-regularization method of Nesterov and Poly ak [28] conv erges in O ( 1 ε 3 / 2 ) iterations. Ho wev er, each iteration of Nestero v- P olyak requires solving a cubic function which, in general, tak es time super-linear in the input represen tation. One natural direction is to apply an approximate trust region solver, such as the linear-time solv er of [22], to appro ximately solv e the cubic regularization subroutine of Nestero v-Poly ak. Ho w- ev er, the approximation needed b y a naiv e calculation mak es this approac h even slo wer than v anilla gradien t descent. Our main challenge is to obtain appro ximate second-order lo cal-minima and si- m ultaneously improv e up on gradien t descen t. Indep enden tly of this pap er and concurrently 1 , Carmon et al. [7] dev elop an accelerated gradien t descen t method that achiev es the same running time for finding an appro ximate local minim um as in our pap er. Remark ably , the same running time is obtained via a very differen t tec hnique. 1.2 Our T echniques Our algorithm is based on the cubic regularization metho d of Nestero v and Poly ak [8, 9, 28]. At a high level, cubic regularization states that if we can minimize a cubic function m ( h ) , g > h + 1 2 h > H h + L 6 k h k 3 exactly , where g = ∇ f ( x ), H = ∇ 2 f ( x ), and L is the second-order smo othness of the function f , then w e can iteratively p erform up dates x 0 ← x + h , and this algorithm conv erges to an ε -approximate lo cal minim um in O (1 /ε 3 / 2 ) iterations. Unfortunately , solving this cubic minimization problem exactly , to the b est of our kno wledge, requires a running time of O ( d ω ) where ω is the matrix multiplication constan t. Getting around this requires five observ ations. The first observation is that, minimizing m ( h ) up to a constant multiplicativ e approximation (plus a few other constraints) is sufficien t for sho wing an iteration complexit y of O (1 /ε 3 / 2 ). 2 The pro of tec hniques to sho w this observ ation are based on extending Nesterov and P oly ak. The se c ond observation is that the minimizer h ∗ of m ( h ) must b e of the form h ∗ = ( H + λ ∗ I ) + g + v , where λ ∗ ≥ 0 is some constan t satisfying H + λ ∗ I 0, and v is the smallest eigen vector of H and + denotes the pseudo-inv erse of a matrix. This can b e viewed as moving in a mixture direction 1 T o b e precise, their manuscript app eared online approximately 24 hours before ours. 2 More sp ecifically , we need m t ( h ) ≤ 1 C min h { m t ( h ) } for some constant C . In addition, we need to hav e go od b ounds on k h k and k∇ m ( h ) k . 3 b et w een choosing h ← v , and choosing h to follow a shifted Newton’s direction h ← ( H + λ ∗ I ) + g . In tuitively , we wish to reduce b oth the computation of ( H + λ ∗ I ) + g and v to Hessian-vector pro ducts. The first task of computing ( H + λ ∗ I ) + g can b e slo w, and ev en if H + λ ∗ I is strictly p ositiv e- definite, computing it has a complexity dep ending on the (p ossibly h uge) condition num b er of H + λ ∗ I [34]. The thir d observation is that it suffices to pick some λ 0 > λ ∗ so b oth (1) the condition n umber of H + λ 0 I is small and (2) the vectors ( H + λ ∗ I ) − 1 g and ( H + λ 0 I ) − 1 g are close. This relies on the structure of m ( h ). The second task of computing v has a complexity dep ending on 1 / √ δ where δ is the target additiv e error [13, 14]. The fourth observation is that the c hoice δ = √ ε suffices for the outer lo op of cubic regularization to make sufficient progress. This reduces the complexit y to compute v . Finally , finding the correct v alue λ ∗ itself is as hard as minimizing m t ( h ). The fifth step is to design an iterative scheme that makes only logarithmic num b er of guesses on λ ∗ . This pro cedure either finds the correct one (via binary searc h), or finds an approximate one, λ 0 , but satisfying ( H + λ ∗ I ) − 1 g and ( H + λ 0 I ) − 1 g b eing sufficien tly close. Putting all the observ ations together, and balancing all the parameters, we can obtain a cubic minimization subroutine (see F astCubicMin in Algorithm 2) that runs in time O ( nd + n 3 / 4 d/ε 1 / 4 ). 2 Preliminaries and Main Theorem W e use k · k to denote the Euclidean norm of a vector and the sp ectral norm of a matrix. F or a symmetric matrix M we denote b y λ max ( M ) and λ min ( M ) respectively the maxim um and minimum eigen v alues of M . W e denote by A B that A − B is p ositiv e semidefinite (PSD). F or a PSD matrix M , w e denote by M + its pseudo-in verse if M is not strictly p ositiv e definite. W e make the follo wing Lipschitz contin uit y assumptions for the gradient and Hessian of the target function f . Namely , there exist L 2 , L > 0 such that ∀ x, y ∈ R d : k∇ 2 f ( x ) k ≤ L 2 and k∇ 2 f ( x ) − ∇ 2 f ( y ) k ≤ L k x − y k . (2.1) Definition 2.1. We assume the fol lowing c omplexity p ar ameters on the ac c ess to f ( x ) : • L et T g ∈ R ∗ b e the time c omplexity to c ompute ∇ f ( x ) for any x ∈ R d . • L et T h ∈ R ∗ b e the time c omplexity to c ompute ∇ 2 f ( x ) v for any x, v ∈ R d . Definition 2.2. We say that f is of finite-sum form if f = 1 n P n i =1 f i ( x ) and k∇ 2 f i ( x ) k ≤ L 2 for e ach i ∈ [ n ] . In this c ase, we define T h, 1 to b e the time c omplexity to c ompute ∇ 2 f i ( x ) v for arbitr ary x, v ∈ R d and i ∈ [ n ] . Next we define the strict-saddle function for whic h an ε -approximate local minimum is almost equiv alent to a local minimum [15, 24]. Definition 2.3 (strict saddle) . Supp ose f ( · ) : R d → R is twic e differ entiable. F or α, β , γ ≥ 0 , we say f is ( α, β , γ ) -strict sadd le if every x ∈ R d satisfies at le ast one of the fol lowing thr e e c onditions: 1. k∇ f ( x ) k ≥ α . 2. λ min ( ∇ 2 f ) ≤ − β . 3. Ther e exists a lo c al minimum x ? that is γ -close to x in Euclide an distanc e. W e see that if a function is ( α, β , γ )-strict saddle, then for ε < min { α , β 2 } an ε -approximate lo cal minim um is γ -close to some lo cal minimum. 4 Algorithm 1 F astCubic ( f , x 0 , ε, L, L 2 ) Input: f ( x ) that satisfies (2.1) with L 2 and L ; a starting vector x 0 ; a target accuracy ε . 1: κ ← 900 εL 1 / 2 . 2: for t = 0 to ∞ do 3: m t ( h ) , ∇ f ( x t ) > h + h > ∇ 2 f ( x t ) h 2 + L 6 k h k 3 4: ( λ, v , v min ) ← F astCubicMin ∇ f ( x t ) , ∇ 2 f ( x t ) , L, L 2 , κ 5: h 0 ← either v or λv min 2 L whic hever giv es smaller v alue for m t ( h ); 6: Set x t +1 , x t + h 0 7: if m t ( h 0 ) > − ε 3 / 2 c √ L then return x t +1 . c is a c onstant; we pr ove d c = 2 . 4 ∗ 10 6 works 8: end for 2.1 Main Results The finite-sum setting captures muc h of sup ervised learning, including Neural Net works and Gen- eralized Linear Mo dels. The main theorem whic h we show in our pap er is as follo ws: Theorem 1. F astCubic (Algorithm 1) starts fr om a p oint x 0 and outputs a p oint x such that k∇ f ( x ) k ≤ ε and λ min ( ∇ 2 f ( x )) ≥ − √ Lε in total time (denoting by D , f ( x 0 ) − f ( x ∗ ) ) • ˜ O D √ L ε 3 / 2 · T g + DL 1 / 4 √ L 2 ε 7 / 4 · T h , or • ˜ O D √ L ε 3 / 2 · T g + n T h, 1 + Dn 3 / 4 L 1 / 4 √ L 2 ε 7 / 4 · T h, 1 in the finite-sum setting (se e Definition 2.2). Her e ˜ O hides lo garithmic factors in L, L 2 , 1 /ε, d , and in max x k∇ f ( x ) k . Tw o Known Subroutines. Our running time of F astCubic relies on the following recen t results for appro ximate matrix inv erse and approximate PCA: Theorem 2.4 (Approximate Matrix Inv erse) . Supp ose matrix M ∈ R d × d satisfies k M k ≤ L 2 and λ I + M δ I for c onstants λ, δ, L 2 > 0 . L et κ , λ + L 2 δ . Then, we c an c ompute ve ctor x satisfying x − ( λ I + M ) − 1 b ≤ ε k b k , (2.2) using A c c eler ate d gr adient desc ent (A GD) in O κ 1 / 2 log( κ/ε ) iter ations, e ach r e quiring O ( d ) time plus the time ne e de d to multiply M with a ve ctor. Mor e over, supp ose M = 1 n P n i =1 M i wher e e ach M i is symmetric and satisfies k M i k ≤ L 2 . If M i b c an b e c ompute d in time O ( d 0 ) for e ach i and ve ctor b , then ac c eler ate d SVR G [4, 33] c omputes a ve ctor x that satisfies e quation (2.2) in time O max { n, n 3 / 4 κ 1 / 2 } · d 0 · log 2 ( κ/ε ) . We r efer to the running time for this c omputation as T inverse ( κ, ε ) and the algorithm as A . Ab o v e, the SVRG based running time shall b e used only to wards our finite-sum case in Definition 2.2. Theorem 2.5 (App xPCA [3, 13, 14]) . L et M ∈ R d × d b e a symmetric matrix with eigenvalues 1 ≥ λ 1 ≥ · · · ≥ λ d ≥ 0 . With pr ob ability at le ast 1 − p , AppxPCA pr o duc es a unit ve ctor w satisfying w > M w ≥ (1 − δ × )(1 − ε ) λ max ( M ) . The total running time is ˜ O ( T inverse (1 /δ × , εδ × )) . 3 Our F ast Cubic Regularization Algorithm Recall that the cubic regularization method of Nesterov and P olyak [28] studies the follo wing upp er b ound on the change in ob jective v alue as w e mov e from a p oint x t to x t + h : (it follows simply 5 from the T aylor series truncated to the third order) ∀ h ∈ R d : f ( x t + h ) − f ( x t ) ≤ m t ( h ) , ∇ f ( x t ) > h + h > ∇ 2 f ( x t ) h 2 + L 6 k h k 3 . (3.1) Denote b y h ∗ an arbitrary minimizer of m t ( h ). W e prop ose in this pap er a subroutine FastCubicMin to minimizes m t ( h ) approximately . Note that FastCubicMin returns tw o v ectors v and v min . W e then c ho ose h 0 to b e either v or λv min 2 L , whic hever giv es a smaller v alue for m t ( h ). Before discussing the details of FastCubicMin , let us first state a main theorem for F astCubicMin : 3 Theorem 2 (Guarantees of FastCubicMin ) . The algorithm F astCubicMin finds a ve ctor h 0 that satisfies (a) It pr o duc es a ve ctor h 0 satisfying m t ( h 0 ) ≤ 0 and either 3000 m t ( h 0 ) ≤ m t ( h ∗ ) or m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L . (b) If m ( h ∗ ) ≥ − ε 3 / 2 300 √ L , then k h 0 k ≤ k h ∗ k + √ ε 4 √ L and k∇ m t ( h 0 ) k ≤ ε 2 . (c) F astCubicMin runs in time: (using ˜ O to hide lo garithmic factors in L, L 2 , 1 /ε, d, k∇ f ( x t ) k ) • ˜ O √ L 2 ( εL ) 1 / 4 · T h wher e T h is the time to multiply ∇ 2 f ( x t ) to a ve ctor; • ˜ O max n, n 3 / 4 √ L 2 ( εL ) 1 / 4 · T h, 1 wher e T h, 1 is the time to multiply ∇ 2 f i ( x t ) with a ve ctor. Ab o v e, the first guarantee promises that w e are either done (b ecause m t ( h ∗ ) is close to zero), or w e obtain a 1 / 3000 multiplic ative approximation to m t ( h ∗ ). Our second guarantee in Theorem 2 promises that when we are done (b ecause m t ( h ∗ ) is close to zero), the output v ector h 0 and h ∗ are roughly similar in Euclidean norm and hav e a small gradient k∇ m t ( h 0 ) k . Our third guarantee giv es the time complexit y of FastCubicMin . No w, our final algorithm FastCubic for finding the ε -approximate lo cal minim um of f ( x ) is included in Algorithm 1. It simply iteratively calls F astCubicMin to find an appro ximate minimizer, and it then stops whenever m t ( h 0 ) > − ε 3 / 2 c √ L for some large constan t c . Roadmap. In Section 4 w e sho w why Theorem 2 implies Theorem 1. All the remaining sections are for the purp ose of pro ving Theorem 2. Because our F astCubicMin is v ery tec hnical, instead of stating what the algorithm is righ t aw a y , we decide to take a different path. In Section 5, we first state a lemma c haracterizing “what h ∗ lo oks lik e”. In Section 6, w e provide a set of sufficien t conditions which “look similar” to the characterization of h ∗ , and sho w that as long as these conditions are met, Theorem 2-a and 2-b follow easily . Finally , in Section 7, w e state FastCubicMin and explain why it satisfies these sufficient conditions and why it runs in the aforemen tioned time. 4 Theorem 2 implies Theorem 1 In this section, we sho w that Theorem 2 implies Theorem 1. It relies on the following lemma (pro ved in App endix B) regarding the sufficient condition for us to reac h an ε -approximate lo cal minim um. 3 T o present the simplest result, w e ha ve not tried to impro ve the constant dep endency in this paper. 6 Lemma 4.1. If m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L and h 0 is an appr oximate minimizer of m t ( h ) satisfying k h 0 k ≤ k h ∗ k + √ ε 4 √ L and k∇ m t ( h 0 ) k ≤ ε 2 , then we have that k∇ f ( x t + h 0 ) k ≤ ε and λ min ( ∇ 2 f ( x t + h 0 )) ≥ − √ Lε . Pr o of of The or em 1 fr om The or em 2. When F astCubic terminates, w e hav e m t ( h 0 ) > − ε 3 / 2 c √ L ; there- fore, it satisfies m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L according to Theorem 2-a. Combining this with Theorem 2-b and Corollary 4.1, w e conclude that in the last iteration of FastCubic , our output satisfies k∇ f ( x t + h 0 ) k ≤ ε and λ min ( ∇ 2 f ( x t + h 0 )) ≥ − √ Lε . This finishes the pro of with resp ect to the accuracy conditions. As for the running time, in every iteration except for the last one, FastCubic satisfies m t ( h 0 ) ≤ − Ω − ε 3 / 2 √ L . Therefore b y (3.1), we must ha ve decreased the ob jective b y at least Ω − ε 3 / 2 √ L in this round, and this cannot happ en for more than O ( f ( x 0 ) − f ∗ ) √ L ε 3 / 2 iterations. The final running time of F astCubic follows from this bound together with Theorem 2-c. Therefore, in the rest of the pap er it suffices to study F astCubicMin and prov e Theorem 2. 5 Characterization Lemma of the Minimizer h ∗ F or notational simplicity in this and the subsequen t sections w e fo cus on the following problem: minimize m ( h ) , g > h + h > H h 2 + L 6 k h k 3 where H is a symmetric matrix with k H k 2 ≤ L 2 . Recall from the previous section that w e ha ve denoted b y h ∗ an arbitrary minimizer of m ( h ). W e ha v e the following lemma whic h c haracterizes h ∗ : (a v ariant of this lemma has app eared in [8], and w e prov e it in the app endix for the sak e of completeness) Lemma 5.1. We have h ∗ is a minimizer of m ( h ) if and only if ther e exists λ ∗ ≥ 0 such that H + λ ∗ I 0 , ( H + λ ∗ I ) h ∗ = − g , k h ∗ k = 2 λ ∗ L . The obje ctive value in this c ase is given by m ( h ∗ ) = − 1 2 g > ( H + λ ∗ I ) + g − 2( λ ∗ ) 3 3 L 2 ≤ 0 . The follo wing corollary comes from Lemma 5.1 and its pro of: Corollary 5.2. The value λ ∗ in L emma 5.1 is unique, and for every λ satisfying H + λ I 0 , we have k ( H + λ I ) − 1 g k > 2 λ L ⇐ ⇒ λ ∗ > λ and k ( H + λ I ) − 1 g k < 2 λ L ⇐ ⇒ λ ∗ < λ . In the ab o v e characterization, w e hav e a crude upp er b ound on λ ∗ : Prop osition 5.3. We have λ ∗ ≤ B , max 2 L 2 + p L k g k , 1 with λ ∗ define d in L emma 5.1. Pr o of. W e hav e L k ( H + B I ) − 1 g k ≤ L k g k λ min ( H + B I ) ≤ L k g k B − L 2 < 2 B and therefore λ ∗ ≤ B due to Corollary 5.2. 7 6 Sufficien t Conditions for Theorem 2-a and 2-b Without worrying ab out the design of FastCubicMin at this momen t, let us first state a set of sufficien t conditions under which the assumptions in Theorem 2-a can b e satisfied. Main Lemma 1. Consider an algorithm that outputs a r e al λ ∈ [0 , 2 B ] , a ve ctor v ∈ R d , and a unit ve ctor v min ∈ R d . A dditional ly, supp ose numb ers κ, ˜ ε ≥ 0 satisfying the fol lowing c onditions: ˜ ε ≤ 1 10000 1 (max { κ, L, L 2 , k g k , k ( H + λ I ) − 1 k , B } ) 20 (6.1) ( H + ( λ − L ˜ ε ) I ) − 1 0 (6.2) Mor e over, supp ose that the outputs ( λ, v , v min ) satisfy one of the fol lowing two c ases: Case 1 : L k ( H + λ I ) − 1 g k ∈ [2 λ − 2 L ˜ ε, 2 λ + 2 L ˜ ε ] and k v + ( H + λ I ) − 1 g k ≤ ˜ ε Case 2 : The fol lowing c onditions ar e satisfie d: (a) λ ≥ λ ∗ and λ + λ min ( H ) ≤ 1 κ (b) L k ( H + λ I ) − 1 g k ≤ 2 λ and k v + ( H + λ I ) − 1 g k ≤ ˜ ε (c) v > min H v min ≤ λ min ( H ) + 1 10 κ Then, at le ast one of the two choic es h 0 ∈ v , λv min 2 L satisfy either m ( h ∗ ) ≥ 3000 m ( h 0 ) or m ( h ∗ ) ≥ − 32 κ 3 L 2 . Let us compare suc h sufficient conditions to the c haracterization Lemma 5.1. • In Case 1, up to a v ery small error ˜ ε , we hav e essentially found a vector v that satisfies v ≈ − ( H + λ I ) − 1 g and k v k ≈ 2 λ L . Therefore, this v should b e close to h ∗ for obvious reason. (This is the simple case.) • In Case 2, w e hav e only found a vector v that satisfies v ≈ − ( H + λ I ) − 1 g and k v k . 2 λ L . In this case, w e also compute an approximate lo west eigen vector v min of λ min ( H ) up to an additiv e 1 / 10 κ accuracy (see case 2-c). W e will make sure that, as long as the conditions in 2-a hold, then either v or λv min 2 L will b e an appro ximate minimizer for m t ( h ). (This is the hard case.) Pr o of of Main L emma 1. W e first consider Case 1. According to Corollary 5.2, if ˜ ε = 0 then v is a minimizer of m ( h ). The follo wing claim extends this argument to the setting when ˜ ε > 0: Claim 6.1. If λ and v satisfy Case 1 and ˜ ε satisfies (6.1), then m ( v ) ≤ m ( h ∗ ) + 1 250 κ 3 L 2 F rom the ab o ve lemma it follo ws that either m ( h ∗ ) ≥ − 8 κ 3 L 2 otherwise m ( h ∗ ) ≥ 1 . 1 m ( v ) which satisfies the conditions of the theorem. W e now consider Case 2, and in this case w e make the following t wo claims: Claim 6.2. If λ min ( H ) ≤ − 1 κ then m ( h ∗ ) ≥ 1500 min n m ( v ) , m λv min 2 L o − 1 500 κ 3 L 2 . Claim 6.3. If λ min ( H ) ≥ − 1 κ then m ( h ∗ ) ≥ 2 m ( v ) − 16 κ 3 L 2 . Lemma 1 no w follows from the tw o claims b ecause we can output the vector h 0 whic h has the lo west v alue of m ( h 0 ) amongst the tw o choices h 0 ∈ v , λ v min 2 d . This satisfies either m ( h ∗ ) ≥ 3000 m ( h 0 ) or m ( h ∗ ) ≥ − 32 κ 3 L 2 . The missing pro ofs of the three claims are deferred to App endix D. 8 The next main lemma sho ws that, under the same sufficient conditions as Main Lemma 1, w e also ha ve that Theorem 2-b holds. (Its proof is contained in App endix E.) Main Lemma 2. In the same setting as Main L emma 1, supp ose m ( h ∗ ) ≥ − ε 3 / 2 300 √ L . Then the output ve ctor v satisfies the fol lowing c onditions: k v k ≤ k h ∗ k + 3 κL and k∇ m ( v ) k ≤ ε 4 + 15 κ 2 L . 7 Main Algorithms for Theorem 2 W e are now ready to state our m ain algorithm F astCubicMin and sk etch why it satisfies the sufficien t conditions in Main Lemma 1. As describ ed in Algorithm 2, our algorithm starts with a very large c hoice λ 0 ← 2 B and decreases it gradually . At eac h iteration i , it computes an approximate in v erse v satisfying k v + ( H + λ i I ) − 1 g k ≤ ˜ ε with resp ect to the current λ i . Then there are three cases, dep ending on whether L k v k is appro ximately equal to, larger than, or smaller than 2 λ i . A t a high lev el, if it is “equal”, then we ha ve met Case 1 in Main Lemma 1; if it is “larger”, then we can binary search the correct v alue of λ ∗ in the interv al [ λ i , λ i − 1 ]; and if it is “smaller”, then w e need to compute an appro ximate eigenv ector and carefully choose the next p oin t λ i +1 . W e state our main lemma b elo w regarding the correctness and running time of F astCubicMin . Main Lemma 3. F astCubicMin in Algorithm 2 outputs a r e al λ ∈ [0 , 2 B ] , a ve ctor v ∈ R d , and a unit ve ctor v min ∈ R d satisfying one of the two sufficient c onditions in Main L emma 1. We also have that the pr o c e dur e c an b e implemente d in a total running time of • ˜ O √ κL 2 · T h if A c c eler ate d Gr adient Desc ent is use d in The or em 2.4 to invert matric es. • ˜ O max { n, n 3 / 4 √ κL 2 } · T h, 1 if we use ac c eler ate d SVRG as the subpr o c e dur e A in The or em 2.4. Her e ˜ O hides lo garithmic factors in L, L 2 , κ, d, B . W e prov e the correctness half of Main Lemma 3, and defer its running time analysis to App endix G. 7.1 Correctness Half of Main Lemma 3 W e will no w establish the correctness of our algorithm. W e first observ e that the Bina rySea rch subroutine returns ( λ, v , ∅ ) that satisfies Case 1 of Main Lemma 1. F act 7.1. BinarySea rch outputs a p air λ and v such that L k ( H + λ I ) − 1 g k ∈ [2 λ − 2 L ˜ ε, 2 λ + 2 L ˜ ε ] and k v + ( H + λ I ) − 1 g k ≤ ˜ ε . Pr o of. The latter is guaranteed by line 3 in BinarySea rch , and the former is implied by the latter b ecause L k ( H + λ I ) − 1 g k ∈ L k v k − L ˜ ε/ 2 , L k v k + L ˜ ε/ 2 ⊆ 2 λ − 2 L ˜ ε, 2 λ + 2 L ˜ ε . W e also establish the following in v ariants regarding the v alues λ i . (Pro of in Appendix F.) Lemma 7.2. The fol lowing statements hold for al l i until FastCubicMin terminates (a) λ i ∈ [0 , 2 B ] , λ i + λ max ( H ) ≤ 3 B (b) λ i + λ min ( H ) ≥ 3 10 κ (c) λ i +1 + λ min ( H ) ≤ 3 4 ( λ i + λ min ( H )) unless λ i +1 = 0 Mor e over when F astCubicMin terminates at Line 20 we have λ i + λ min ( H ) ≤ 1 κ . W e now prov e the output ( λ, v , v min ) of F astCubicMin satisfies the sufficient conditions of Main Lemma 1. 9 Algorithm 2 F astCubicMin ( g , H , L, L 2 , κ ) (main algorithm for cubic minimization) Input: g a v ector, H a symmetric matrix, parameters κ, L and L 2 whic h satisfies − L 2 I H L 2 I . Output: ( λ, v , v min ) 1: B ← L 2 + p L k g k + 1 κ . 2: ˜ ε ← 1 / 10000 max L, k g k , 3 κ 10 , B , 1 20 3: λ 0 ← 2 B . 4: for i = 0 to ∞ do 5: Compute v such that k v + ( H + λ i I ) − 1 g k ≤ ˜ ε . 6: if L k v k ∈ [2 λ i − L ˜ ε, 2 λ i + L ˜ ε ] then 7: return ( λ i , v , ∅ ). 8: else if L k v k > 2 λ i + L ˜ ε then 9: return BinarySea rch ( λ 1 = λ i − 1 , λ 2 = λ i , ˜ ε ). 10: else if L k v k < 2 λ i − L ˜ ε then 11: Let P ow er Metho d find v ector w that is 9 / 10-appx leading eigen vector of ( H + λ i I ) − 1 : 9 10 λ max (( H + λ i I ) − 1 ) ≤ w > ( H + λ i I ) − 1 w ≤ λ max (( H + λ i I ) − 1 ) . 12: Compute a v ector ˜ w suc h that k ˜ w − ( H + λ i I ) − 1 w k ≤ ˆ ε , 1 60 B . 13: ∆ ← 1 2 1 ˜ w > w − ˆ ε . 14: if ∆ > 1 2 κ then 15: ˜ λ i +1 ← λ i − ∆ 2 . 16: if ˜ λ i +1 > 0 then λ i +1 ← ˜ λ i +1 else λ i +1 ← 0 17: else 18: Use App xPCA to find any unit v ector v min suc h that v > min H v min ≤ λ min ( H ) + 1 10 κ . 19: Flip the sign of v min so that g > v min ≤ 0. 20: return ( λ i , v , v min ) . 21: end if 22: end if 23: end for Algorithm 3 Bina rySearch ( λ 1 , λ 2 , ˜ ε ) (binary search subroutine) Input: λ 1 ≥ λ 2 , L k ( H + λ 1 I ) − 1 g k ≤ 2 λ 1 , L k ( H + λ 2 I ) − 1 g k ≥ 2 λ 2 , λ 2 + λ min ( H ) > 0 Output: ( λ, v , ∅ ) 1: for t = 1 to ∞ do 2: λ mid ← λ 1 + λ 2 2 3: Compute v ector v suc h that k v + ( H + λ mid I ) − 1 g k ≤ ˜ ε/ 2 4: if L k v k ∈ [2 λ mid − L ˜ ε, 2 λ mid + L ˜ ε ] then 5: return ( λ mid , v , ∅ ) 6: else if L k v k + L ˜ ε ≤ 2 λ mid then 7: λ 1 ← λ mid 8: else if L k v k − L ˜ ε ≥ 2 λ mid then 9: λ 2 ← λ mid 10: end if 11: end for 10 Corr e ctness Pr o of of Main L emma 3. W e carefully v erify these sufficient conditions: • Lemma 7.2 implies λ i ∈ [0 , 2 B ]. • λ i + λ min ( H ) ≥ 3 10 κ from Lemma 7.2 implies k ( H + λ i I ) − 1 k ≤ 4 κ . It is now immediate that the c hoice of ˜ ε on Line 2 satisfies the Condition (6.1) in the assumption of Main Lemma 1. • Since ˜ ε ≤ 1 10 κL and λ i + λ min ( H ) ≥ 3 10 κ it follows that ( H + ( λ i − L ˜ ε ) I ) − 1 0 whic h prov es Condition (6.2) in Main Lemma 1. • W e now verify Case 1 and 2 in the assumption of Main Lemma 1. A t the b eginning of the algorithm, our choice λ 0 = 2 B ensures (using Prop osition 5.3) that L k ( H + λ 0 I ) − 1 g k < 2 λ 0 . Let us no w consider the v arious places where the algorithm outputs: – If F astCubicMin terminates at Line 7, then we hav e k v + ( H + λ i I ) − 1 g k ≤ ˜ ε and additionally L k ( H + λ i I ) − 1 g k ∈ L k v k − L ˜ ε, L k v k + L ˜ ε ⊆ [2 λ i − 2 L ˜ ε, 2 λ i + 2 L ˜ ε ] . Therefore, the output meets Case 1 requirement of Main Lemma 1 with λ = λ i . – If FastCubicMin terminates at Line 9, then L k ( H + λ i I ) − 1 g k > L k v k − L ˜ ε ≥ 2 λ i . Obviously , w e m ust hav e i ≥ 1 in this case b ecause L k ( H + λ 0 I ) − 1 g k < 2 λ 0 . Therefore, Line 10 must ha ve b een reached at the previous iteration, so it implies L k ( H + λ i − 1 I ) − 1 g k < 2 λ i − 1 . T ogether, these tw o imply that w e can call BinarySea rch with ( λ i − 1 , λ i ). Owing to F act 7.1, the subroutine outputs a pair ( λ, v ) satisfying the Case 1 requirement of Main Lemma 1. – If FastCubicMin terminates on Line 20, w e v erify that Case 2 of Main Lemma 1 with λ = λ i holds. W e first hav e L k ( H + λ i I ) − 1 g k ≤ L k v k + L ˜ ε ≤ 2 λ i . By Corollary 5.2, w e also ha v e that λ i ≥ λ ∗ . Lemma 7.2 tells us λ i satisfies λ i + λ min ( H ) ≤ 1 κ . V ector v satisfies k v + ( H + λ i I ) − 1 g k ≤ ˜ ε . V ector v min satisfies v > min H v min ≤ λ min ( H ) + 1 10 κ . In sum, w e hav e v erified that all the assumptions of Main Lemma 1 hold. Final Proof of Theorem 2. Theorem 2 is a direct corollary of our main lemmas. Main Lemma 3 ensures that the assumptions of Main Lemma 1 and Main Lemma 2 both hold. Now, using the spe- cial choice of κ in FastCubic , Theorem 2-a immediately comes from Main Lemma 1; Theorem 2-b immediately comes from Main Lemma 2; and Theorem 2-c immediately comes from Main Lemma 3. This finishes the pro of of Theorem 2. Ac kno wledgemen ts W e thank Ben Rech t for v ery helpful suggestions and corrections to a previous v ersion. Z. Allen-Zh u is supp orted by an NSF Grant, no. CCF-1412958, and a Microsoft Research Grant, no. 0518584. An y opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF or Microsoft. Appendix A Computing Hessian-V ector Pro duct in Linear Time In this section we sk etch the in tuition regarding why Hessian-v ector pro ducts can b e computed in linear time in many in teresting (esp ecially machine learning) problems. W e start b y sho wing that the gradient can be computed in linear time. The algorithm is often referred to as bac k-propagation, 11 whic h dates back to W erb os’s PhD thesis [35], and has been popularized b y Rumelharte et al. [31] for training neural net works. Claim A.1 (back-propagation, informally stated) . Supp ose a r e al-value d function f : R d → R c an b e evaluate d by a differ entiable cir cuit of size N . Then, the gr adient ∇ f c an b e c ompute d in time O ( N + d ) (using a cir cuit of size O ( N + d ) ). 4 The claim follows from simple induction and chain-rule, and is left to the readers. In the training of neural net works, often the size of circuits that computes the ob jective f is prop ortional to (or equal to) the num b er of parameters d . Th us the gradien t ∇ f can b e computed in time O ( d ) using a circuit of size d . Next, we consider computing ∇ 2 f ( x ) · v where v ∈ R d . Let g ( x ) := h∇ f ( x ) , v i b e a function from R d to R . Then, w e see that if suffices to compute the gradien t of g , since ∇ 2 f ( x ) · v = ∇ g ( x ) . W e observ e that g ( x ) can b e ev aluated in linear time using circuit of size O ( d ) since w e’ve shown ∇ f ( x ) can. Th us, using Claim A.1 again on function g , 5 w e conclude that ∇ g ( x ) can also b e computed in linear time. B Pro of of Lemma B.1 and Corollary 4.1 Lemma B.1. F or al l h 0 ∈ R d , it satisfies k∇ f ( x t + h 0 ) k ≤ L k h 0 k 2 + k∇ m t ( h 0 ) k and λ min ( ∇ 2 f ( x t + h 0 )) ≥ − 3 L 2 max { 0 , − m t ( h ∗ ) } 2 1 / 3 − L k h 0 k . Pr o of of L emma B.1. Let us denote b y g = ∇ f ( x t ) and H = ∇ 2 f ( x t ) in this pro of. W e b egin by pro ving the first order condition. Note that we ha ve ∇ m t ( h ) = g + H h + L 2 k h k h . Recall h ∗ is a minimizer of argmin m t ( h ). The characterization result in Lemma 5.1 shows H + L k h ∗ k 2 I 0, and thus g > h ∗ + ( h ∗ ) > H h ∗ + L 2 k h ∗ k 3 = ∇ m t ( h ∗ ) > h ∗ = 0 (B.1) ( h ∗ ) > H h ∗ + L 2 k h k 3 = ( h ∗ ) > H + L k h ∗ k 2 I h ∗ ≥ 0 . (B.2) They imply m t ( h ∗ ) = g > h ∗ + ( h ∗ ) > H h ∗ 2 + L k h ∗ k 3 6 ¬ = − ( h ∗ ) > H h ∗ 2 − L 3 k h ∗ k 3 ≤ L 4 k h ∗ k 3 − L 3 k h ∗ k 3 = − L 12 k h ∗ k 3 (B.3) where ¬ uses (B.1) and uses (B.2). W e compute the norm of the gradient at a point x t + h 0 for an y h 0 ∈ R d : k∇ f ( x t + h 0 ) k ≤ k∇ f ( x t + h 0 ) − ∇ m t ( h 0 ) k + k∇ m t ( h 0 ) k = ∇ f ( x t ) + Z 1 0 ∇ 2 f ( x t + τ h 0 ) h 0 dτ − g + H h 0 + L 2 k h 0 k h 0 + k∇ m t ( h 0 ) k 4 T echnically , we assume that the gradient of each gate can b e computed in O (1) time 5 W e assume here that the original circuits are twice differentiable 12 ≤ Z 1 0 ∇ 2 f ( x t + τ h 0 ) − H h 0 dτ + L 2 k h 0 k 2 + k∇ m t ( h 0 ) k ® ≤ L k h 0 k 2 Z 1 0 τ dτ + L 2 k h 0 k 2 + k∇ m t ( h 0 ) k = L k h 0 k 2 + k∇ m t ( h 0 ) k (B.4) where ® follo ws from the Lipschitz con tinuit y on the Hessian (2.1). This pro ves the first conclusion of the lemma. As for the second-order condition, we first note that for all h 0 ∈ R d , b y the Lipschitz contin uity on the Hessian (2.1), we hav e k∇ 2 f ( x t + h 0 ) − ∇ 2 f ( x t ) k ≤ L k h 0 k . Ho wev er, this implies λ min ( ∇ 2 f ( x t + h 0 )) ≥ λ min ( ∇ 2 f ( x t )) − L k h 0 k . (B.5) b ecause if t wo matrices A and B satisfies k A − B k ≤ p , then it m ust satisfy λ min ( A ) − λ min ( B ) ≤ p as w ell. W e consider tw o cases: if λ min ( ∇ 2 f ( x t )) ≥ 0, then w e ha ve λ min ( ∇ 2 f ( x t + h 0 )) ≥ − L k h 0 k . (B.6) Otherwise, we consider the case where λ min ( ∇ 2 f ( x t )) = λ min ( H ) < 0. Let ν d b e the normalized eigen vector corresponding to λ min ( H ), and define ˜ h , sign( g > ν d ) · 2 λ min ( H ) L ν d . W e calculate m t ( ˜ h ) as follo ws: m t ( ˜ h ) = g > ˜ h + ˜ h > H ˜ h 2 + L 6 k ˜ h k 3 ≤ ˜ h > H ˜ h 2 + L 6 k ˜ h k 3 = 2( λ min ( H )) 2 L 2 ν > d H ν d + 4 | λ min ( H ) | 3 3 L 2 ¬ = 2( λ min ( H )) 3 L 2 + 4 | λ min ( H ) | 3 3 L 2 = 2( λ min ( H )) 3 3 L 2 , (B.7) where ¬ uses ν > d H ν d = λ min ( H ) < 0, and uses the assumption that λ min ( H ) < 0. Since b y definition m t ( h ∗ ) ≤ m t ( ˜ h ), w e can deduce from inequality (B.7) that λ min ( ∇ 2 f ( x t )) = λ min ( H ) ≥ − 3 L 2 | m t ( h ∗ ) | 2 1 / 3 . (B.8) No w we put together inequalities (B.5) and (B.8), and obtain λ min ( ∇ 2 f ( x t + h 0 )) ≥ − 3 L 2 | m t ( h ∗ ) | 2 1 / 3 − L k h 0 k . (B.9) Com bining (B.6) and (B.9) we finish the proof of Lemma B.1. Corollary 4.1. If m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L and h 0 is an appr oximate minimizer of m t ( h ) satisfying k h 0 k ≤ k h ∗ k + √ ε 4 √ L and k∇ m t ( h 0 ) k ≤ ε 2 , then we have that k∇ f ( x t + h 0 ) k ≤ ε and λ min ( ∇ 2 f ( x t + h 0 )) ≥ − √ Lε . Pr o of of Cor ol lary 4.1. First of all, our assumption that m t ( h ∗ ) ≥ − ε 3 / 2 800 √ L , along with inequality (B.3), tells us that k h ∗ k ≤ √ ε 4 √ L . This, together with our assumption on k h 0 k , implies k h 0 k ≤ √ ε 2 √ L . Since w e also assume k∇ m t ( h 0 ) k ≤ ε 2 , w e hav e from Lemma B.1 that k∇ f ( x t + h 0 ) k ≤ L k h 0 k 2 + k∇ m t ( h 0 ) k ≤ ε 4 + ε 2 ≤ ε . F or the second-order condition, we can again apply Lemma B.1 to get λ min ( ∇ 2 f ( x t + h 0 )) ≥ − 3 L 2 max { 0 , − m t ( h ∗ ) } 2 1 / 3 − L k h 0 k ≥ − 3 L 3 / 2 ε 3 / 2 1600 ! 1 / 3 − √ Lε 2 ≥ − √ Lε . 13 C Pro of of Lemma 5.1 and Corollary 5.2 W e b egin b y proving a few lemmas that c haracterize the system of equations. Lemma C.1. Consider the fol lowing system of e quations/ine qualities in variables λ, h : H + λ I 0 , ( H + λ I ) h = − g , k h k = 2 λ L . (C.1) The fol lowing statements hold for any solution ( λ 0 , h 0 ) of the ab ove system: • Ther e is a unique value λ 0 that satisfies the ab ove e quations. λ 0 is such that λ 0 ≥ − λ min ( H ) . • If λ 0 > − λ min ( H ) , then the c orr esp onding h 0 is also unique and is given by h 0 = − ( H + λ I ) − 1 g . • If λ 0 = − λ min ( H ) then g > v = 0 for any ve ctor v b elonging to the eigensp ac e c orr esp onding to λ min ( H ) . Subse quently we also have that the c orr esp onding h 0 is of the form h 0 = − ( H + λI ) + g + γ v for some γ and v in the lowest eigensp ac e of H . Pr o of of L emma C.1. Note that H + λ I 0 ensures that for any solution λ 0 , w e ha ve λ 0 ≥ − λ min ( H ). F urthermore, for any λ 0 > − λ min ( H ), the corresp onding h is uniquely defined by h = ( H + λ I ) − 1 g since H + λ 0 I is in vertible. If indeed λ 0 = − λ min ( H ), then w e hav e that the equation ( H − λ min ( H ) I ) h = − g has a solution. This implies that g has no comp onen t in the null space of H − λ min ( H ) I , or equiv alently that it has no comp onent in the eigenspace corresp onding to λ min ( H ). W e also hav e that every solution of ( H − λ min ( H ) I ) h = − g is necessarily of the form h = − ( H − λ min I ) + g + γ v for some γ and v in the low est eigenspace of H . W e will no w prov e the uniqueness of λ 0 b y con tradiction. Consider tw o distinct v alues of λ 1 , λ 2 that satisfy the system (C.1). If b oth λ 1 , λ 2 > − λ min ( H ) w e get that k ( H + λ 1 I ) − 1 g k = 2 λ 1 L and k ( H + λ 2 I ) − 1 g k = 2 λ 2 L . No w note that k ( H + λ I ) − 1 g k is a strictly decreasing function o v er the domain λ ∈ ( − λ min ( H ) , ∞ ) and 2 λ L is strictly increasing ov er the same domain. Therefore the ab o ve t wo equations cannot b e satisfied for t wo distinct λ 1 , λ 2 > − λ min ( H ) whic h is a con tradiction. Supp ose no w without loss of generalit y that λ 1 = − λ min ( H ). Then w e hav e that the corresp onding solution is of the form h = − ( H + λ I ) + g + γ v for some γ and v in the low est eigenspace of H and g has no comp onen t in the low est eigenspace of H . It follows that k ( H − λ min ( H ) I ) + g k ≥ k ( H + λ I ) − 1 g k for any λ > − λ min ( H ). By a similar argumen t as in the first case, we can no w see that the following conditions, k ( H + λ 1 I ) + g + γ v min ( H ) k = 2 λ 1 L and k ( H + λ 2 I ) − 1 g k = 2 λ 2 L , cannot b oth b e satisfied for λ 2 > λ 1 = − λ min ( H ), giving us a contradiction. This finishes the pro of of Lemma C.1. Lemma C.2. L et ( λ, h ) b e a solution of the system (C.1). Then we have that m ( h ) = − 1 2 g > ( H + λ I ) + g − 2 λ 3 3 L 2 . 14 Pr o of of L emma C.2. By the definition of the system (C.1), an y solution λ, h to the system should b e suc h that there exists some γ such that h = ( H + λ I ) + g + γ v 0 where v 0 is in the n ull space of H + λ I if it exists; otherwise γ = 0. This gives us the follo wing: m ( h ) = g > h + h > H h 2 + L 6 k h k 3 ¬ = − 1 2 h > ( H + λ I ) h − λ 2 k h k 2 + L 6 k h k 3 = − 1 2 g > ( H + λ I ) + g − 2 λ 3 3 L 2 . Equalit y ¬ follows b ecause ( H + λ I ) h = − g . Equality follows b ecause h = ( H + λ I ) + g + γ v 0 and k h k = 2 λ L . Lemma 5.1. h ∗ is a minimizer of m ( h ) if and only if ther e exists λ ∗ ≥ 0 such that H + λ ∗ I 0 , ( H + λ ∗ I ) h ∗ = − g , k h ∗ k = 2 λ ∗ L . The obje ctive value in this c ase is given by m ( h ∗ ) = − 1 2 g > ( H + λ ∗ I ) + g − 2( λ ∗ ) 3 3 L 2 ≤ 0 . Pr o of of L emma 5.1. W e first compute that ∇ m ( h ) = g + H h + L 2 k h k h and ∇ 2 m ( h ) = H + L 2 k h k I + L 2 k h k h k h k h k h k > . F or the forw ard direction, supp ose h ∗ is a minimizer of m ( h ). Let λ ∗ = L 2 k h ∗ k . Then, the necessary conditions ∇ m ( h ∗ ) = 0 and ∇ 2 m ( h ∗ ) 0 can be written as g + ( H + λ ∗ I ) h ∗ = 0 and w > H + λ ∗ I + λ ∗ h ∗ k h ∗ k h ∗ k h ∗ k > ! w ≥ 0 , ∀ w ∈ R n . (C.2) F rom this w e see ( H + λ ∗ I ) h ∗ = − g and k h ∗ k = 2 λ ∗ L , and the only thing left to v erify is H + λ ∗ I 0. Note that if h ∗ = 0, then the second inqualit y in (C.2) directly implies H + λ ∗ I 0. Th us, we only need to fo cus on h ∗ 6 = 0. W e w ant to show that w > ( H + λ ∗ I ) w ≥ 0 for every w ∈ R d . Now, if w > h ∗ = 0 then this trivially follo ws from (C.2), so it suffices to fo cus on those w that satisfies w > h ∗ 6 = 0. Since w and h ∗ are not orthogonal, there exists γ ∈ R \{ 0 } such that k h ∗ + γ w k = k h ∗ k . (This can b e done by squaring b oth sides and solving the linear system in λ .) Squaring b oth sides we ha ve ( γ w ) > h ∗ + γ 2 k w k 2 2 = 0 . (C.3) No w we bound the difference m ( h ∗ + γ w ) − m ( h ∗ ) = g > (( h ∗ + γ w ) − h ∗ ) + ( h ∗ + γ w ) > H ( h ∗ + γ w ) 2 − h ∗ H h ∗ 2 ¬ = ( h ∗ − ( h ∗ + γ w )) > ( H + λ ∗ I ) h ∗ + ( h ∗ + γ w ) > H ( h ∗ + γ w ) 2 − h ∗ H h ∗ 2 = λ ∗ γ 2 2 k w k 2 + ( h ∗ − ( h ∗ + γ w )) > H h ∗ + ( h ∗ + γ w ) > H ( h ∗ + γ w ) 2 − h ∗ H h ∗ 2 = λ ∗ γ 2 2 k w k 2 + h ∗ H h ∗ 2 − ( h ∗ + γ w ) > H h ∗ + ( h ∗ + γ w ) > H ( h ∗ + γ w ) 2 15 = λ ∗ γ 2 2 k w k 2 + γ 2 2 w > H w = γ 2 2 w > ( H + λ ∗ I ) w , (C.4) where ¬ and follow from (C.2) and (C.3), resp ectively . Since h ∗ is a minimizer of m ( h ), w e immediately ha ve m ( h ∗ + γ w ) − m ( h ∗ ) = γ 2 2 w > ( H + λ ∗ I ) w ≥ 0 , and w e conclude that ( H + λ ∗ I ) 0. F or the backw ard direction, w e will mak e use Lemma C.1 and Lemma C.2. First w e note that the function m ( h ) is con tinuous and b ounded from b elo w, and there exists at least one minimizer h ∗ . Supp ose now there exists a λ ∗ and a corresp onding h ∗ suc h that ( λ ∗ , h ∗ ) is a solution to the system C.1. The backw ard direction requires us to prov e that h ∗ m ust be a minimizer of m ( h ). By Lemma C.1 w e get the following t wo cases. W e prov e the backw ard direction by sho wing that the conditions in Equation C.2 determine the minimizer up to its norm. T o this end we will use Lemma C.1 and Lemma C.2. First we note that the function m ( h ) is contin uous, b ounded from b elo w, and tends to + ∞ when k h k → ∞ , so there exists at least one minimizer h ∗ . Supp ose no w there exists a λ ∗ and a corresp onding h ∗ suc h that ( λ ∗ , h ∗ ) is a solution to the system (C.1). The backw ard direction requires us to prov e that h ∗ m ust b e a minimizer of m ( h ). By Lemma C.1 w e get the following t wo cases. • If λ ∗ > − λ min ( H ) then ( λ ∗ , h ∗ ) is the only solution to the system (C.1). By the pro of of the forw ard direction we see that any minimizer of m ( h ) m ust satisfy system (C.1) and therefore h ∗ m ust b e the minimizer. • If ab o ve is not the case, then λ ∗ = − λ min ( H ). Let h 0 b e an y minimizer of m ( h ). Lemma C.1 and the pro of of the forw ard direction ensures that ( λ ∗ , h 0 ) also satisfies the system (C.1). By Lemma C.2 w e get m ( h ∗ ) = m ( h 0 ) and therefore h ∗ is a minimizer to o. Corollary 5.2. This value λ ∗ is unique, and for every λ satisfying H + λ I 0 , we have k ( H + λ I ) − 1 g k > 2 λ L ⇐ ⇒ λ ∗ > λ and k ( H + λ I ) − 1 g k < 2 λ L ⇐ ⇒ λ ∗ < λ . Pr o of of Cor ol lary 5.2. The uniqueness of λ ∗ follo ws from Lemma C.1. T o pro ve the second part w e first make some observ ations ab out the function p ( y ) , 2 y L − k ( H + y I ) − 1 g k defined on the domain y ∈ ( − λ min ( H ) , ∞ ). Note that p ( y ) is contin uous and strictly increasing o ver the domain and p ( y ) → ∞ as y → ∞ . The corollary requires us to show that p ( λ ) < 0 ⇐ ⇒ λ ∗ > λ and p ( λ ) > 0 ⇐ ⇒ λ ∗ < λ . W e b egin by showing the first equiv alence. T o see the backw ard direction note that if λ ∗ > λ > − λ min ( H ), by the characterization of λ ∗ in Lemma C.1 we ha ve that k ( H + λ ∗ I ) − 1 g k = 2 λ ∗ L i.e. p ( λ ∗ ) = 0 which implies that p ( λ ) < 0 as p ( y ) is a strictly increasing function. F or the forw ard direction note that since p ( y ) is contin uous and strictly increasing w e see that the range of the function contains [ p ( λ ) , ∞ ). Since p ( λ ) < 0 there must exist a λ ∗ > λ such that p ( λ ∗ ) = 0 which b y the characterization in Lemma C.1 finishes the pro of. No w we will prov e that p ( λ ) > 0 ⇐ ⇒ λ ∗ < λ . T o see the forw ard direction note that if λ ∗ ≥ λ then p ( λ ∗ ) = 0 and p ( λ ) > 0 whic h con tradicts the fact that p ( y ) is strictly increasing. F or the 16 bac kward direction w e consider t w o cases. Firstly if λ ∗ > − λ min ( H ) the conclusion follo ws similarly b y the monotonicity of p ( y ). If λ ∗ = − λ min then by Lemma C.1, w e ha ve that g has no comp onen t in the lo west eigenspace of H and therefore if we extend p ( y ) to − λ min ( H ) b y defining p ( − λ min ( H )) , − 2 λ min ( H ) L − k ( H − λ min ( H ) I ) + g k w e get that p ( y ) is increasing in the domain y ∈ [ − λ min ( H ) , ∞ ). No w from the characterization of the solution in Lemma C.1 we can see that p ( − λ min ( H )) ≥ 0 and therefore by monotonicity p ( λ ) > 0. This finishes the pro of. D Pro of of Main Lemma 1 D.1 Pro of of Claim 6.1 Claim 6.1. If λ and v satisfy Case 1 and ˜ ε satisfies (6.1), then m ( v ) ≤ m ( h ∗ ) + 1 250 κ 3 L 2 Pr o of of Claim 6.1. Note that by the conditions of the theorem w e ha ve that ( H + ( λ − L ˜ ε ) I ) − 1 eq and L k ( H + ( λ − L ˜ ε ) I ) − 1 g k ≥ 2 λ − 2 L ˜ ε and L k ( H + ( λ + L ˜ ε ) I ) − 1 g k ≤ 2 λ − 2 L ˜ ε , according to Corollary 5.2 we must ha ve λ ∗ ∈ [ λ − L ˜ ε, λ + L ˜ ε ] (D.1) This also implies (using our assumption on ˜ ε ) L k v k ≤ [2 λ ∗ − 5 L ˜ ε, 2 λ ∗ + 5 L ˜ ε ] . Next, consider the v alue m ( v ) m ( v ) = g > v + v > H v 2 + L 6 k v k 3 = g > v + v > ( H + λ I ) v 2 − k v k 2 λ 2 − L k v k 6 . (D.2) W e b ound the t wo parts on the righ t hand side of (D.2) separately . The first part g > v + v > ( H + λ I ) v 2 ≤ − g > ( H + λ I ) − 1 g 2 + k g k ˜ ε + k ( H + λ I ) − 1 g k ˜ ε + k ( H + λ I ) − 1 k ˜ ε 2 2 ¬ ≤ − g > ( H + λ I ) − 1 g 2 + 1 1000 κ 3 L 2 (D.3) ≤ − g > ( H + λ ∗ I ) − 1 g 2 + L k g k 2 k ( H + λ I ) − 1 kk ( H + ( λ + 2 L ˜ ε ) I ) − 1 k ˜ ε + 1 1000 κ 3 L 2 ® ≤ − g > ( H + λ ∗ I ) − 1 g 2 + 1 500 κ 3 L 2 Ab o v e, inequalities ¬ and ® use the assumption on ˜ ε in (6.1), and inequality uses − ( H + λ I ) − 1 − ( H + ( λ ∗ + L ˜ ε ) I ) − 1 = − ( H + λ ∗ I ) − 1 − L ˜ ε ( H + λ ∗ I ) − 1 ( H + ( λ ∗ + L ˜ ε ) I ) − 1 Note that ( H + λ ∗ I ) − 1 0 b y Equations (D.1) and (6.2). The second part of (D.2) can b e b ounded as follo ws k v k 2 λ 2 − L k v k 6 ≥ (2 λ ∗ − 5 L ˜ ε ) 2 L 2 λ ∗ − L ˜ ε 2 − 2 λ ∗ + 5 L ˜ ε 6 ≥ 2( λ ∗ ) 3 3 L 2 − 1000 ˜ εL ( λ ∗ ) 2 ¬ ≥ 2( λ ∗ ) 3 3 L 2 − 1 500 κ 3 L 2 17 Ab o v e, inequality ¬ uses λ ∗ ≤ B (owing to Prop osition 5.3) and our assumption on ˜ ε from (6.1). Putting these together w e get that m ( v ) ≤ m ( h ∗ ) + 1 250 κ 3 L 2 . D.2 Pro ofs for Claims 6.2 and 6.3 F or notational simplicity , let us rotate the space in to the basis in the eigenspace of H ; let the i -th dimension corresp ond to the i -th largest eigenv alue λ i of H . W e hav e λ 1 ≥ λ 2 . . . ≥ λ d = λ min . Let g i denote the i -th co ordinate of g in this basis. Lemma 5.1 implies m ( h ∗ ) = − 1 2 X i g 2 i λ i + λ ∗ − 2( λ ∗ ) 3 3 L 2 =: S 1 + S 2 − 2( λ ∗ ) 3 3 L 2 . (D.4) where w e denote by S 1 = − X i : λ i + λ ∗ ≥ 1 κ g 2 i λ i + λ ∗ S 2 = − X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i λ i + λ ∗ F rom Corollary 5.2 we can also obtain X i : λ i + λ ∗ > 0 g 2 i ( λ i + λ ∗ ) 2 ≤ 4( λ ∗ ) 2 L 2 . (D.5) No w the assumption k ( H + λ I ) − 1 g k ≤ 2 λ L is equiv alent to X i g 2 i ( λ i + λ ) 2 ≤ 4 λ 2 L 2 (D.6) W e b egin with a few auxiliary claims. Claim D.1. If λ min ( H ) ≤ − 1 κ then S 2 ≥ 1000 · m λv min 2 L Pr o of of Claim D.1. W e compute that S 2 = − X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i λ i + λ ∗ = − X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i ( λ i + λ ∗ ) ( λ i + λ ∗ ) 2 ≥ − 1 κ X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i ( λ i + λ ∗ ) 2 ¬ ≥ − 4 κL 2 ( λ ∗ ) 2 ≥ − 16 | λ min | 3 L 2 . (D.7) Ab o v e, ¬ uses (D.5), and follows b ecause we hav e λ min ( H ) ≤ − 1 κ in the assumption and ha ve λ ∗ ≤ − λ min ( H ) + 1 κ in the assumption of Case 2 of Main Lemma 1. Let us no w consider the v alue of the vector λv min 2 L . W e hav e that m λv min 2 L = λg > v min 2 L + λ 2 v > min H v min 8 L 2 + λ 3 48 L 2 ¬ ≤ λg > v min 2 L + λ 2 λ min 16 L 2 + λ 3 48 L 2 ≤ λg > v min 2 L + λ 2 λ min 16 L 2 − λ 2 λ min 24 L 2 ≤ λg > v min 2 L + λ 2 λ min 48 L 2 Ab o v e, ¬ is b ecause our assumption λ min ( H ) ≤ − 1 κ and assumption v min H v min ≤ λ min ( H ) + 1 10 κ together imply v min H v min ≤ λ min 2 . follo ws from λ min ( H ) ≤ − 1 κ and λ ≤ − λ min ( H ) + 1 κ . No w, recall that the sign of v min is chosen so g > v min is non-p ositive, and therefore by our 18 assumptions λ min ( H ) ≤ − 1 κ and λ ≤ − λ min ( H ) + 1 κ , w e get the following inequalit y: m λv min 2 L ≤ − | λ min | 3 48 L 2 (D.8) Putting inequalities (D.8) and (D.7) together finishes the pro of of Claim D.1. W e also show the follo wing lemma, the pro of of which can b e seen from inequality (D.3), as part of the pro of of Claim 6.1 ab o ve. Lemma D.2. If we have λ, v such that L k ( H + λ I ) − 1 g k ≤ 2 λ and k v + ( H + λ I ) − 1 g k ≤ ˜ ε with ˜ ε satisfying c ondition (6.1) then we have that g > v + v > ( H + λ I ) v 2 ≤ − g > ( H + λ I ) − 1 g 2 + 1 1000 κ 3 L 2 Claim D.3. S 1 ≥ 4 m ( v ) − 1 250 κ 3 L 2 Pr o of of Claim D.3. W e hav e that m ( v ) = g > v + v > ( H + λ I ) v 2 − λ 2 k v k 2 + L 6 k v k 3 ¬ = − g > ( H + λ I ) − 1 g 2 − k v k 2 λ 2 − L 6 k v k + 1 1000 κ 3 L 2 ≤ − g > ( H + λ I ) − 1 g 2 − 2 λ − 3 L ˜ ε L 2 λ 6 + L ˜ ε 3 + 1 1000 κ 3 L 2 ® ≤ − g > ( H + λ I ) − 1 g 2 − 2 λ 3 3 L 2 + 1 500 κ 3 L 2 ≤ − g > ( H + λ I ) − 1 g 2 + 1 500 κ 3 L 2 (D.9) Ab o v e, ¬ is due to Lemma D.2; uses our condition on v which giv es L k v k ∈ [2 λ − 3 L ˜ ε, 2 λ + 3 L ˜ ε ]; ® uses our condition (6.1) on ˜ ε . W e now bound S 1 . F or this purp ose first w e note that if λ i + λ ∗ ≥ 1 κ and λ − λ ∗ ≤ 1 κ then 2( λ i + λ ∗ ) ≥ 1 /κ + λ i + λ ∗ ≥ λ i + λ . Therefore, the sum S 1 satisfies S 1 = − X i : λ i + λ ∗ ≥ 1 κ g 2 i λ i + λ ∗ ≥ − 2 X i :0 <λ i + λ ∗ ≤ 1 κ g 2 i ( λ i + λ ) ≥ − 2( g > ( H + λ I ) − 1 g ) ≥ 4 m ( v ) − 1 250 κ 3 L 2 (Note that w e hav e H + λ I 0.) This finishes the pro of of Claim D.3. Claim 6.2. If λ min ( H ) ≤ − 1 κ then m ( h ∗ ) ≥ 1500 min n m ( v ) , m λv min 2 L o − 1 500 κ 3 L 2 Pr o of of Claim 6.2. W e derive that m ( h ∗ ) ¬ = 1 2 ( S 1 + S 2 ) − 2( λ ∗ ) 3 3 L 2 ≥ 1 2 ( S 1 + S 2 ) − 16 | λ min | 3 3 L 2 ® ≥ 2 m ( v ) − 1 500 κ 3 L 2 + 500 · m λv min 2 L − 16 | λ min | 3 3 L 2 ¯ ≥ 2 m ( v ) − 1 500 κ 3 L 2 + 1500 · m λv min 2 L 19 ≥ 1500 min m ( v ) , m λv min 2 L − 1 500 κ 3 L 2 Ab o v e, ¬ uses equation (D.4), inequality follo ws b ecause we hav e λ min ( H ) ≤ − 1 κ in the assump- tion and hav e λ ∗ ≤ − λ min ( H ) + 1 κ in the assumption of Case 2 of Main Lemma 1; inequalit y ® uses Claim D.3 and Claim D.1; and inequality ¯ uses (D.8). This finishes the pro of of Claim 6.2. Claim 6.3. If λ min ( H ) ≥ − 1 κ then m ( h ∗ ) ≥ 2 m ( v ) − 16 κ 3 L 2 Pr o of of Claim 6.3. This time w e low er b ound S 2 sligh tly differently: S 2 ¬ ≥ − 4 κL 2 ( λ ∗ ) 2 ≥ − 16 κ 3 L 2 (D.10) where ¬ comes from the second to last inequalit y from (D.7) and comes from λ ∗ ≤ λ ≤ − λ min ( H ) + 1 κ ≤ 2 κ using our assumption in Case 2 of Main Lemma 1. Putting these together w e get that m ( h ∗ ) ¬ = 1 2 ( S 1 + S 2 ) − 2( λ ∗ ) 3 3 L 2 ≥ 2 m ( v ) − 1 500 κ 3 L 2 − 15 κ 3 L 2 ≥ 2 m ( v ) − 16 κ 3 L 2 . Ab o v e, ¬ comes from (D.4), uses Claim D.3, lo wer bound (D.10) and 2( λ ∗ ) 3 3 L 2 ≤ 16 3 κ 3 L 2 E Pro of of Main Lemma 2 Main Lemma 2. In the same setting as Main L emma 1, supp o se m ( h ∗ ) ≥ − ε 3 / 2 300 √ L . Then the output ve ctor v satisfies the fol lowing c onditions: k v k ≤ k h ∗ k + 3 κL and k∇ m ( v ) k ≤ ε 4 + 15 κ 2 L . Pr o of of Main L emma 2. Let’s first note that from the v alue given in Lemma 5.1, ( λ ∗ ) 3 ≤ 3 L 2 | m ( h ∗ ) | 2 ≤ L 3 / 2 ε 3 / 2 200 . (E.1) If Case 1 o ccurs, w e hav e k v k ¬ ≤ k ( H + λ I ) − 1 g k + ˜ ε ≤ 2 λ + 2 L ˜ ε L + ˜ ε ® ≤ 2 λ ∗ L + 5 ˜ ε ¯ ≤ k h ∗ k + 1 20 κL . Ab o v e, inequalities ¬ and b oth use the assumptions of Case 1; inequality ® uses the fact that λ ∗ ∈ [ λ − L ˜ ε, λ + L ˜ ε ] whic h again follows from the assumptions of Case 1 (see (D.1)); inequality ¯ uses k h ∗ k = 2 λ ∗ L from Lemma 5.1 as well as our assumption (6.1) on ˜ ε . As for the quan tity k∇ m ( v ) k , w e b ound it as follo ws k∇ m ( v ) k = g + H v + L k v k 2 v ¬ ≤ k g + ( H + λ I ) v k + λ k v k + L k v k 2 ≤ k H + λ I k ˜ ε + λ k v k + L k v k 2 ® ≤ ( L 2 + 2 B ) ˜ ε + λ (2 λ + 3 L ˜ ε ) + (2 λ + 3 L ˜ ε ) 2 L = ( L 2 + 2 B ) ˜ ε + 6 λ 2 L + 15 ˜ ελ + 9 L ˜ ε 2 ¯ ≤ 6( λ ∗ + L ˜ ε ) 2 L + ( L 2 + 32 B ) ˜ ε + 9 L ˜ ε 2 ° ≤ 6( λ ∗ ) 2 L + ( L 2 + 56 B ) ˜ ε + 15 L ˜ ε 2 ± ≤ ε 4 + 15 κ 2 L . Ab o v e, inequality ¬ uses triangle inequality; inequalit y uses k v + ( H + λ I ) − 1 g k ≤ ˜ ε ; inequality ® uses k H + λ I k ≤ L 2 + 2 B and L k v k ≤ 2 λ + 3 L ˜ ε whic h comes from our upper b ound on k v k abov e; ¯ uses the fact that λ ∗ ∈ [ λ − L ˜ ε, λ + L ˜ ε ] whic h again follows from the assumptions of Case 1 (see 20 (D.1)); inequalit y ° uses λ ∗ ≤ 2 B ; and inequalit y ± uses (E.1) together with our assumption (6.1) on ˜ ε . If Case 2 o ccurs, w e hav e k v k ¬ ≤ k ( H + λ I ) − 1 g k + ˜ ε ≤ 2 λ L + ˜ ε ® ≤ 2( λ ∗ + 1 /κ ) L + ˜ ε ≤ k h ∗ k + 3 κL . (E.2) Ab o v e, inequalities ¬ and b oth use the assumptions of Case 2; inequalit y ® uses λ ≤ − λ min ( H ) + 1 /κ from our assumption of Case 2 as w ell as − λ min ( H ) ≤ λ ∗ whic h comes from Lemma 5.1; inequalit y ¯ uses k h ∗ k = 2 λ ∗ L from Lemma 5.1 as well as our assumption (6.1) on ˜ ε . The quan tity k∇ m ( v ) k can be b ounded in an analogous manner as Case 1: k∇ m ( v ) k ≤ k H + λ I k ˜ ε + λ k v k + L k v k 2 ≤ ( L 2 + 2 B ) ˜ ε + λ (2 λ + L ˜ ε ) + (2 λ + L ˜ ε ) 2 L ¬ ≤ 6 λ 2 L + 1 10 κ 2 L ≤ 6( λ ∗ + 1 κ ) 2 L + 1 10 κ 2 L ≤ 12( λ ∗ ) 2 L + 15 κ 2 L ® ≤ ε 4 + 15 κ 2 L . Ab o v e, inequality ¬ uses our assumption (6.1) on ˜ ε ; inequalit y uses λ ≤ λ ∗ + 1 κ whic h app eared in (E.2); inequalit y ® uses (E.1). F Pro of of Lemma 7.2 Lemma 7.2. The fol lowing statements hold for al l i until FastCubicMin terminates (a) λ i ∈ [0 , 2 B ] , λ i + λ max ( H ) ≤ 3 B (b) λ i + λ min ( H ) ≥ 3 10 κ (c) λ i +1 + λ min ( H ) ≤ 3 4 ( λ i + λ min ( H )) unless λ i +1 = 0 Mor e over when F astCubicMin terminates at Line 20 we have λ i + λ min ( H ) ≤ 1 κ . Pr o of of L emma 7.2. The lemma follows via induction. T o see (a) and (b) at the base case i = 0, recall that the definitions of B and L 2 together ensure λ 0 + λ max ( H ) ≤ 3 B and λ 0 + λ min ( H ) ≥ 3 10 κ . Also λ 0 ∈ [0 , 2 B ]. Supp ose now for some i ≥ 0 prop erties (a) and (b) hold. It is easy to c heck that λ i ≤ λ i − 1 and th us we hav e λ i + λ max ( H ) ≤ 2 B and λ i ≤ 2 B . This implies prop ert y (a) at iteration i + 1 also hold. W e now pro ceed to sho w prop erty (c) at iteration i and prop ert y (b) at iteration i + 1. Recall that the algorithm ensures 9 10 λ max (( H + λ i I ) − 1 ) ≤ w > ( H + λ i I ) − 1 w ≤ λ max (( H + λ i I ) − 1 ) , and b y the definition of ˜ w we hav e 9 10 λ max (( H + λ i I ) − 1 ) − 2 ˆ ε ≤ ˜ w > w − ˆ ε ≤ λ max (( H + λ i I ) − 1 ) . (F.1) No w, since 3 10 κ ≤ λ i + λ min ( H ) ≤ 3 B from the inductive assumption, it follows from the choice of ˆ ε that 2 ˆ ε ≤ 1 30 B ≤ 1 10( λ i + λ min ( H )) = λ max (( H + λ i I ) − 1 ) 10 . (F.2) Plugging Equation (F.2) in to Equation (F.1) we get 8 10 1 λ i + λ min ( H ) = 8 10 λ max (( H + λ i I ) − 1 ) ≤ ˜ w > w − ˆ ε ≤ λ max (( H + λ i I ) − 1 ) = 1 λ i + λ min ( H ) . 21 In verting this c hain of inequalities, we hav e λ i + λ min ( H ) 2 ≤ ∆ ≤ 5( λ i + λ min ( H )) 8 . (F.3) F rom this we deriv e the following implications: ∆ ≤ 1 2 κ = ⇒ ( λ i + λ min ( H )) ≤ 1 κ (F.4) ∆ > 1 2 κ = ⇒ ( λ i + λ min ( H )) > 4 5 κ (F.5) If Condition (F.4) happ ens, our algorithm F astCubicMin outputs on Line 20; in suc h a case (F.4) implies our desired inequality λ i + λ min ( H ) ≤ 1 κ . If Condition (F.5) happ ens, our c hoice ˜ λ i +1 ← λ i − ∆ 2 and Equation (F.3) together imply that 3 4 ( λ i + λ min ( H )) ≥ ˜ λ i +1 + λ min ( H ) ≥ 11 16 ( λ i + λ min ( H )) Com bining this with (F.5) we get that 3 4 ( λ i + λ min ( H )) ≥ ˜ λ i +1 + λ min ( H ) ≥ 11 16 4 5 κ ≥ 3 10 κ . Therefore, w e conclude that prop ert y (c) at iteration i holds and prop ert y (b) at iteration i + 1 hold b ecause λ i +1 ≥ ˜ λ i +1 . This finishes the pro of of Lemma 7.2. G Pro of of Main Lemma 3: Running Time Half Ha ving prov en the correctness of the algorithm, w e no w aim to bound the ov erall running time of FastCubicMin , completing the pro of of Main Lemma 3. W e prov e in App endix H the following lemma: Lemma G.1. If λ 2 + λ min ( H ) ≥ c 1 ∈ (0 , 1) then Bina rySearch ends in O log( ( λ 1 − λ 2 ) B c 1 · L · ˜ ε ) iter ations. Since in our F astCubicMin algorithm, it satisfies λ i ≤ 2 B and λ i + λ min ( H ) ≥ 3 10 κ (see Lemma 7.2), tak en together with our choice of ˜ ε we ha ve: Claim G.2. Each invo c ation of BinarySea rch ends in O log(1 / ˜ ε ) iter ations. Claim G.3. F astCubicMin ends in at most O (log( B κ )) outer lo ops. Pr o of. According to Lemma 7.2 w e hav e 3 4 ( λ i − 1 + λ min ( H )) ≥ λ i + λ min ( H ) so the quan tity λ i + λ min ( H ) decreases by a constant factor p er iteration (except p ossibly λ i = 0 the last outer lo op in which case we shall terminate in one more iteration). On one hand, we hav e b egan with λ 0 + λ min ( H ) ≤ 3 B . On the other hand, w e alwa ys ha ve λ i + λ min ( H ) ≥ 3 10 κ according to Lemma 7.2. Therefore, the total n umber of outer lo ops is at most O (log( B κ )). G.1 Matrix In v erse Since the key comp onen t of the running time is th e computation of ( H + λ i I ) − 1 b for differen t v ectors b w e will first b ound the condition num b er of the matrix ( H + λ i I ) − 1 via the follo wing lemma Claim G.4. Thr ough out the exe cution of FastCubicMin and BinarySea rch whenever we c ompute ( H + λ i I ) − 1 b for some ve ctor b it satisfies λ i + L 2 λ i + λ min ( H ) ≤ 10 κL 2 . Pr o of of Claim G.4. W e first fo cus on Line 5 and Line 11 of FastCubicMin . There are tw o cases. If λ i ≥ 2 L 2 , then according to − L 2 I H L 2 I we can b ound λ i + L 2 λ i + λ min ( H ) ≤ 3 b ecause the left hand 22 side is the largest when λ i = 2 L 2 . If λ i < 2 L 2 , then b y Lemma 7.2 we know λ i + λ min ( H ) ≥ 3 10 κ . This implies λ i + L 2 λ i + λ min ( H ) ≤ 10 κL 2 . W e no w focus on Line 3 of Bina rySea rch . W e claim that all v alues λ mid iterated o ver Bina rySearch also satisfy λ mid + λ min ( H ) ≥ 3 10 κ (b ecause the v alues λ mid ≥ λ i and λ i satisfies λ i + λ min ( H ) ≥ 3 10 κ according to Lemma 7.2). Therefore, the same case analysis (with resp ect to λ mid ≥ 2 L 2 and λ mid < 2 L 2 ) also giv es λ i + L 2 λ i + λ min ( H ) ≤ 10 κL 2 . Claim G.5. Line 5 of FastCubicMin and Line 3 of BinarySea rch runs in time ˜ O ( T inverse ( κL 2 , ˜ ε )) . Pr o of. Whenev er we compute ( H + λ i I ) − 1 b for some vector v it satisfies k b k ≤ 1 / ˜ ε ; therefore to find v satisfying k v + ( H + λ i I ) − 1 b k ≤ ˜ ε it suffices to find k v + ( H + λ i I ) − 1 b k ≤ ˜ ε 2 k b k . This costs a total running time ˜ O ( T inverse ( κL 2 , ˜ ε )) according to Theorem 2.4. Therefore b y Theorem 2.4, every time we need to m ultiply a vector v to ( H + λ I ) − 1 to error δ , the time required to approximately solv e such a system is T inverse ( O ( κL 2 ) , δ ). W e will state our running time with resp ect to T inverse as it is the dominant op eration in the algorithm. G.2 P o wer Metho d W e no w bound the running time of Po wer Metho d in Line 11 of F astCubicMin . It is a folklore (cf. [3, App endix A]) that getting any constant multiplicativ e approximation to the leading eigenv ector of any PSD matrix M ∈ R d × d requires only O (log d ) iterations, each computing M b for some v ector b . In our case, we hav e M = ( H + λ i I ) − 1 so w e cannot compute M b exactly . F ortunately , folklore results on inaccurate p o w er metho d suggests that, as long as each M b is computed to a v ery go o d accuracy such as ˜ ε − Ω(log d ) , then we can still get a constan t multiplicativ e approximate leading eigen v ector that satisfies Line 11 of FastCubicMin . Ignoring all the details (which are quite standard and can b e found for instance in [3, App endix A]), w e claim that Claim G.6. Line 11 of FastCubicMin runs in time ˜ O T inverse κL 2 , ε − Θ(log( d )) = ˜ O ( T inverse ( κL 2 , ε )) . G.3 Lo w est Eigenv ector W e will now focus on the running time for the computation of the low est eigenv ector of the Hessian whic h is required in Line 18. W e recall Theorem 2.5 from Section 2 which uses Shift and In vert to compute the largest eigen v alue of a matrix. Since w e are concerned with the lo west eigen vector of H and by assumption − L 2 I H L 2 I , w e can equiv alen tly compute the largest eigenv ector of M , I − H + L 2 I 2 L 2 whic h satisfies 0 M I . Note that computing M v is of the same time complexity as computing H v . By setting ε = δ × = 0 . 01 κL 2 in Theorem 2.5 and running AppxPCA, we obtain a unit v ector w such that 1 − w > H w + L 2 2 L 2 = w > M w ≥ (1 − 2 δ × ) λ max ( M ) ¬ ≥ λ max ( M ) − 2 δ × ≥ 1 − λ min ( H ) + L 2 2 L 2 − 2 δ × Ab o v e, ¬ uses λ max ( M ) ≤ 1. Rearranging the terms we obtain w > H w ≤ λ min ( H ) + 0 . 05 κ as desired. In sum, Claim G.7. The appr oximate lowest eigenve ctor c omputation on Line 18 runs in time ˜ O ( T inverse ( κL 2 , ˜ ε )) . G.4 Putting It All T ogether R unning-Time Pr o of of Main L emma 3. Putting together our bounds in Claim G.2 and Claim G.2 whic h b ound the n umber of iterations, as w ell as our b ounds in Claim G.6, Claim G.5, and Claim G.7 for p o wer metho d, matrix inv erse, and low est eigenv ectors, w e conclude that our total 23 running time of F astCubicMin is at most ˜ O ( T inverse ( κL 2 , ˜ ε )), where ˜ O con tains factors polylogarith- mic in κ, L, L 2 , B , d . By putting together our choice of ˜ ε in Line 2 as well as the running time of either accelerated gradien t descen t or accelerated SVRG from Theorem 2.4 into formula O ( κL 2 , ˜ ε ), we finish the pro of of the running time part for Main Lemma 3. H Pro of of Lemma G.1 Lemma G.1. If λ 2 + λ min ( H ) ≥ c 1 ∈ (0 , 1) then Bina rySearch ends in O log( ( λ 1 − λ 2 ) B c 1 · L · ˜ ε ) iter ations. Pr o of of L emma G.1. W e first note that in all iterations of BinarySea rch it alwa ys satisfies L k ( H + λ 1 I ) − 1 g k ≤ 2 λ 1 and L k ( H + λ 2 I ) − 1 g k ≥ 2 λ 2 . (H.1) This is true at the b eginning. In each of the follow-up iterations, if w e hav e set λ 1 ← λ mid then it m ust satisfy L k v k + L ˜ ε ≤ 2 λ mid but this implies L k ( H + λ mid I ) − 1 g k ≤ 2 λ mid according to triangle inequalit y and k v + ( H + λ mid I ) − 1 g k ≤ L ˜ ε ; similarly , if w e ha v e set λ 2 ← λ mid then it must satisfy L k ( H + λ mid I ) − 1 g k ≥ 2 λ mid . Supp ose now the lo op has run for at least log 2 ( λ 1 − λ 2 ˆ ε ) iterations where ˆ ε , L ˜ εc 1 40 B . Then, it m ust satisfy λ 1 − λ 2 ≤ ˆ ε . At this point, w e compute ( H + λ 1 I ) − 1 = ( H + λ 2 I ) − 1 − ( λ 1 − λ 2 )( H + λ 2 I ) − 1 ( H + λ 1 I ) − 1 and therefore L k ( H + λ 1 I ) − 1 g k ≥ L k ( H + λ 2 I ) − 1 g k − L k ( λ 1 − λ 2 )( H + λ 2 I ) − 1 ( H + λ 1 I ) − 1 g k ¬ ≥ 2 λ 2 − ˆ ε k ( H + λ 2 I ) − 1 k · 2 λ 1 ≥ 2 λ 1 − 2 ˆ ε − ˆ ε k ( H + λ 2 I ) − 1 k · 2 λ 1 Ab o v e, inequality ¬ uses (H.1) and λ 1 − λ 2 ≤ ˆ ε ; inequality uses again λ 1 − λ 2 ≤ ˆ ε . No w, w e notice that k ( H + λ 2 I ) − 1 k ≤ 1 c 1 and λ 1 ≤ 2 B b ecause λ 2 only increases and λ 1 only decreases through the execution of the algorithm. Therefore by the c hoice of ˆ ε = ˜ εc 1 40 B , w e get L k ( H + λ 1 I ) − 1 g k ≥ 2 λ 1 − L ˜ ε/ 5 . A completely analogous argumen t also shows that L k ( H + λ 2 I ) − 1 g k ≤ 2 λ 2 + L ˜ ε/ 5 . Therefore, in the immediate next iteration when picking λ mid ← ( λ 1 + λ 2 ) / 2, it m ust satisfy 2 λ mid − L ˜ ε/ 2 ≤ 2 λ − L ˜ ε/ 5 ≤ L k ( H + λ mid I ) − 1 g k ≤ 2 λ 2 + L ˜ ε/ 5 ≤ 2 λ mid + L ˜ ε/ 2 . Then, at this iteration when v is computed to satisfy k v + ( H + λ mid I ) − 1 g k ≤ ˜ ε/ 2, we also ha ve 2 λ mid − L ˜ ε ≤ L k v k ≤ 2 λ mid + L ˜ ε whic h means Bina rySearch will stop in this iteration. In sum, w e ha ve concluded that there will be no more than O log( ( λ 1 − λ 2 ) B c 1 · L · ˜ ε ) iterations. Ac kno wledgemen ts W e thank Ben Rech t for helpful suggestions and corrections to a previous version. References [1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second order sto c hastic optimization for mac hine learning in linear time. arXiv pr eprint arXiv:1602.03943 , 2016. 24 [2] Zeyuan Allen-Zhu and Elad Hazan. V ariance Reduction for F aster Non-Con vex Optimization. In ICML , 2016. [3] Zeyuan Allen-Zhu and Y uanzhi Li. Even F aster SVD Decomp osition Y et Without Agonizing P ain. In NIPS , 2016. [4] Zeyuan Allen-Zhu and Y ang Y uan. Improv ed SVRG for Non-Strongly-Conv ex or Sum-of-Non- Con vex Ob jectiv es. In ICML , 2016. [5] Afonso S Bandeira, Nicolas Boumal, and Vladisla v V oroninski. On the low-rank approach for semidefinite programs arising in synchronization and comm unity detection. arXiv pr eprint arXiv:1602.04426 , 2016. [6] S. Bho janapalli, B. Neyshabur, and N. Srebro. Global Optimalit y of Local Searc h for Low Rank Matrix Reco very. ArXiv e-prints , May 2016. [7] Y air Carmon, John C. Duc hi, Oliver Hinder, and Aaron Sidford. Accelerated metho ds for non-con vex optimization. arXiv pr eprint 1611.00756 , 2016. [8] Coralia Cartis, Nicholas IM Gould, and Philipp e L T oin t. Adaptiv e cubic regularisation meth- o ds for unconstrained optimization. part i: motiv ation, conv ergence and numerical results. Mathematic al Pr o gr amming , 127(2):245–295, 2011. [9] Coralia Cartis, Nicholas IM Gould, and Philipp e L T oin t. Adaptiv e cubic regularisation meth- o ds for unconstrained optimization. part ii: worst-case function-and deriv ativ e-ev aluation com- plexit y . Mathematic al Pr o gr amming , 130(2):295–319, 2011. [10] Anna Choromansk a, Mik ael Henaff, Michael Mathieu, G´ erard Ben Arous, and Y ann LeCun. The loss surfaces of multila yer net w orks. In AIST A TS , 2015. [11] Y ann N Dauphin, Razv an Pascan u, Caglar Gulcehre, Kyunghyun Cho, Sury a Ganguli, and Y oshua Bengio. Identifying and attacking the saddle p oin t problem in high-dimensional non- con vex optimization. In A dvanc es in neur al information pr o c essing systems , pages 2933–2941, 2014. [12] John Duc hi, Elad Hazan, and Y oram Singer. Adaptive subgradien t metho ds for online learning and stochastic optimization. The Journal of Machine L e arning R ese ar ch , 12:2121–2159, 2011. [13] Dan Garb er and Elad Hazan. F ast and simple PCA via con vex optimization. ArXiv e-prints , Septem b er 2015. [14] Dan Garb er, Elad Hazan, Chi Jin, Sham M. Kak ade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Robust shift-and-in vert preconditioning: F aster and more sample efficient algorithms for eigen vector computation. In ICML , 2016. [15] Rong Ge, F urong Huang, Chi Jin, and Y ang Y uan. Escaping from saddle p oin ts—online sto c hastic gradien t for tensor decomp osition. , 2015. [16] Rong Ge, F urong Huang, Chi Jin, and Y ang Y uan. Escaping from saddle p oin ts—online sto c hastic gradient for tensor decomp osition. In Pr o c e e dings of the 28th Annual Confer enc e on L e arning The ory , COL T 2015, 2015. 25 [17] Rong Ge, F urong Huang, Chi Jin, and Y ang Y uan. Escaping from saddle points - online sto c hastic gradien t for tensor decomp osition. In Pr o c e e dings of The 28th Confer enc e on L e arn- ing The ory, COL T 2015, Paris, F r anc e, July 3-6, 2015 , pages 797–842, 2015. [18] Rong Ge, Jason Lee, and T engyu Ma. Matrix Completion has No Spurious Lo cal Minimum. A rXiv e-prints , May 2016. [19] Rong Ge and T engyu Ma. On the optimization landscap e of tensor decomp ositions, 2016. [20] Saeed Ghadimi and Guanghui Lan. Accelerated gradien t methods for nonconv ex nonlinear and sto c hastic programming. Mathematic al Pr o gr amming , pages 1–26, feb 2015. [21] I. J. Go odfellow, O. Vin yals, and A. M. Saxe. Qualitatively c haracterizing neural net w ork optimization problems. A rXiv e-prints , December 2014. [22] Elad Hazan and T omer Koren. A linear-time algorithm for trust region problems. Mathematic al Pr o gr amming , pages 1–19, 2015. [23] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J. ACM , 60(6):45, 2013. [24] Jason D. Lee, Max Simc howitz, Mic hael I. Jordan, and Benjamin Rec ht. Gradient descen t only con verges to minimizers. In Pr o c e e dings of the 29th Confer enc e on L e arning The ory, COL T 2016, New Y ork, USA, June 23-26, 2016 , pages 1246–1257, 2016. [25] Katta G Murt y and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematic al pr o gr amming , 39(2):117–129, 1987. [26] Y urii Nesterov. A metho d of solving a conv ex programming problem with conv ergence rate O (1 /k 2 ). In Doklady AN SSSR (tr anslate d as Soviet Mathematics Doklady) , v olume 269, pages 543–547, 1983. [27] Y urii Nesterov. Intr o ductory L e ctur es on Convex Pr o gr amming V olume: A Basic c ourse , vol- ume I. Klu wer Academic Publishers, 2004. [28] Y urii Nestero v and Boris T Poly ak. Cubic regularization of newton metho d and its global p erformance. Mathematic al Pr o gr amming , 108(1):177–205, 2006. [29] Barak A Pearlm utter. F ast exact multiplication b y the hessian. Neur al c omputation , 6(1):147– 160, 1994. [30] Herb ert Robbins and Sutton Monro. A sto chastic appro ximation metho d. The annals of mathematic al statistics , pages 400–407, 1951. [31] Da vid E Rumelhart, Geoffrey E Hin ton, and Ronald J Williams. Learning represen tations by bac k-propagating errors. Co gnitive mo deling , 5(3):1, 1988. [32] Mark Sc hmidt, Nicolas Le Roux, and F rancis Bach. Minimizing finite sums with the sto c hastic a verage gradient. arXiv pr eprint arXiv:1309.2388 , pages 1–45, 2013. Preliminary v ersion app eared in NIPS 2012. [33] Shai Shalev-Shw artz. SDCA without Dualit y , Regularization, and Individual Conv exity. In ICML , 2016. 26 [34] Jonathan Richard Shew ch uk. An introduction to the conjugate gradient metho d without the agonizing pain, 1994. [35] P aul W erb os. Beyond regression: New to ols for prediction and analysis in the b eha vioral sciences. 1974. 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment