Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond

Near-Optimal Metho ds for Minimizing Star-Con v ex F unctions and Bey ond Oliv er Hinder Aaron Sidford Nimit S. Sohoni Stanfor d University {ohinder, sidford, nims}@stanford.edu F ebruary 27, 2023 Abstract In this pap er, we provide near-optimal accelerated ﬁrst-order metho ds for minimizing a broad class of smo oth nonconv ex functions that are unimo dal on all lines through a minimizer. This function class, which w e call the class of smo oth quasar-c onvex functions, is parameterized b y a constant γ ∈ (0 , 1] : γ = 1 encompasses the classes of smo oth conv ex and star-conv ex functions, and smaller v alues of γ indicate that the function can be “more noncon vex.” W e dev elop a v arian t of accelerated gradien t descen t that computes an  -appro ximate minimizer of a smo oth γ -quasar-con vex function with at most O ( γ − 1  − 1 / 2 log ( γ − 1  − 1 )) total function and gradien t ev aluations. W e also deriv e a low er b ound of Ω( γ − 1  − 1 / 2 ) on the worst-case n umber of gradien t ev aluations required by an y deterministic ﬁrst-order metho d, showing that, up to a logarithmic factor, no deterministic ﬁrst-order metho d can improv e upon ours. 1 In tro duction A cceleration [ 42 , 44 ] is a pow erful to ol for improving the p erformance of ﬁrst-order optimization metho ds. A ccelerated gradient descent (AGD) obtains asymptotically optimal runtimes for smo oth con vex minimization. F urthermore, acceleration has been used to obtain improv ed rates for sto c hastic optimization [ 31 , 2 , 25 , 57 , 58 ], co ordinate descent metho ds [ 46 , 20 , 27 , 51 ], proximal methods [ 22 , 36 , 38 ], and higher-order optimization [ 10 , 23 , 30 ]. Acceleration has also b een successful in a wide v ariet y of practical applications, such as image deblurring [ 7 ] and neural net work training [ 53 ]. In addition, there has b een extensive w ork giving alternative interpretations of acceleration [ 3 , 9 , 52 ]. More recently , acceleration techniques hav e b een applied to sp eed up the computation of  -stationary p oin ts (p oin ts where the gradien t has norm at most  ) of smo oth nonc onvex functions [ 1 , 11 , 12 ]. In particular, while gradient descen t’s O (  − 2 ) rate for ﬁnding  -stationary p oin ts of noncon vex functions with Lipschitz gradients is optimal among ﬁrst-order methods, if higher-order smo othness assumptions are made accelerated methods can impro ve this to O (  − 5 / 3 log (  − 1 )) [ 11 ]. F urther, [ 14 ] sho ws that under the same assumptions, any dimension-free deterministic ﬁrst-order m ethod requires at least Ω(  − 8 / 5 ) iterations to compute an  -stationary p oin t in the worst case. These b ounds are 1 signiﬁcan tly worse than the corresponding O (  − 1 / 2 ) b ound that AGD ac hieves for smooth conv ex functions. Still, in practice it is often possible to ﬁnd approximate stationary p oin ts, and even approximate global minimizers, of noncon vex functions faster than these low er b ounds suggest. This p erformance gap stems from the fairly w eak assumptions underpinning these generic b ounds. F or example, [ 14 , 13 ] only assume Lipsc hitz contin uity of the gradien t and some higher-order deriv atives. How ever, functions minimized in practice often admit signiﬁcan tly more structure, ev en if they are not conv ex. F or example, under suitable assumptions on their inputs, several p opular noncon vex optimization problems, including matrix completion, deep learning, and phase retriev al, displa y “conv exit y-like” prop erties, e.g. that all lo cal minimizers are global [ 6 , 24 ]. Muc h more research is needed to c haracterize structured sets of functions for which minimizers can b e eﬃciently found; our work is a step in this direction. The class of “structured” nonconv ex functions that we fo cus on in this work is the class of functions w e term quasar-c onvex . Informally , quasar-conv ex functions are unimo dal on all lines that pass through a global minimizer. This function class is parameterized by a constant γ ∈ (0 , 1] , where γ = 1 implies the function is star-conv ex [ 47 ] (itself a generalization of conv exit y), and smaller v alues of γ indicate the function can b e “ev en more nonconv ex.” 1 W e pro duce an algorithm that, given a smo oth γ -quasar-con vex function, uses O ( γ − 1  − 1 / 2 log ( γ − 1  − 1 )) function and gradien t queries to ﬁnd an  -optimal p oin t. Additionally , we provide nearly matc hing query complexity low er b ounds of Ω( γ − 1  − 1 / 2 ) for any deterministic ﬁrst-order metho d applied to this function class. Minimization on this function class has b een studied previously [ 26 , 48 ]; our b ounds more precisely c haracterize its complexity . Basic notation. Throughout this pap er, we use k·k to denote the Euclidean norm (i.e. k·k 2 ). W e sa y that a function f : R n → R is L -smo oth, or L -Lipsc hitz diﬀeren tiable, if k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k for all x, y ∈ R n . (W e say a function is smo oth if it is L -smo oth for some L ∈ [0 , ∞ ) .) W e denote a minimizer of f b y x ∗ , and we say that a p oin t x is “  -optimal” or an “  -minimizer” if f ( x ) ≤ f ( x ∗ ) +  . W e use ‘ log ’ to denote the natural logarithm and log + ( · ) to denote max { log ( · ) , 1 } . 1.1 Quasar-Con v exity: Deﬁnition, Motiv ation, and Prior W ork In this work, w e improv e up on the state-of-the-art complexit y of ﬁrst-order minimization of quasar- c onvex functions, 2 whic h are deﬁned as follo ws. Deﬁnition 1 L et γ ∈ (0 , 1] and let x ∗ b e a minimizer of the diﬀer entiable function f : R n → R . The function f is γ -quasar-con vex with r esp e ct to x ∗ if for al l x ∈ R n , f ( x ∗ ) ≥ f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) . (1) 1 An example of a practical problem that is quasar-conv ex but not star-conv ex is the ob jective for learning linear dy- namical systems (under certain conditions) [ 29 ]. W e presen t numerical exp erimen ts for this problem in App endix B. 2 The concept of quasar-conv exity w as ﬁrst introduced b y [ 29 ], who refer to it as ‘weak quasi-conv exity’. W e introduce the term ‘quasar-convexit y’ b ecause w e b elieve it is linguistically clearer. In particular, ‘weak quasi-con vexit y’ is a misnomer b ecause it do es not subsume quasi-conv exity . Moreov er, using this terminology , strong quasar-conv exity would b e confusingly termed ‘strong weak quasi-conv exity .’ 2 F urther, for µ ≥ 0 , the function f is ( γ , µ ) -strongly quasar-conv ex 3 (or ( γ , µ ) -quasar-con vex for short) if for al l x ∈ R n , f ( x ∗ ) ≥ f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2 . (2) W e simply say that f is quasar-c onvex if (1) holds for some minimizer x ∗ of f and some constant γ ∈ (0 , 1] , and str ongly quasar-c onvex if (2) holds with some constan ts γ ∈ (0 , 1] , µ > 0 . W e refer to x ∗ as the “quasar-conv ex p oin t” of f . Assuming diﬀerentiabilit y , in the case γ = 1 , condition (1) is simply star-conv exity [ 47 ]; 4 if in addition the conditions (1) or (2) hold for all y ∈ R n instead of just for x ∗ , they b ecome the standard deﬁnitions of conv exity or µ -strong con vexit y , respectively [ 8 ]. Deﬁnition 1 can also b e straightforw ardly generalized to the case where the domain of f is a con vex subset of R n (see Deﬁnition 3 in Appendix D). Thus, our deﬁnition of quasar-conv exity generalizes the standard notions of conv exity and star-conv exity in the diﬀeren tiable case. Lemma 11 in App endix D.2 shows that quasar-conv exity is equiv alent to a certain “conv exity-lik e” condition on line segments to x ∗ . In Figure 1, w e plot example quasar-conv ex functions. W e say that a one-dimensional function is unimo dal if it monotonically decreases to its minimizer and then monotonically increases thereafter. As Observ ation 1 shows, quasar-con vexit y is closely related to unimo dality . Therefore, lik e the well-kno wn quasiconv exity [ 5 ] and pseudo conv exity [ 39 ], quasar-conv exity can be viewed as an approximate generalization of unimodality to higher dimensions. W e remark that beyond one dimension, neither quasicon vexit y nor pseudo con v exity subsumes or is subsumed by quasar-conv exity . The pro of of Observ ation 1 appears in App endix D.1, and follows readily from the deﬁnitions. Observ ation 1 L et a < b and let f : [ a, b ] → R b e c ontinuously diﬀer entiable. The function f is γ -quasar-c onvex for some γ ∈ (0 , 1] iﬀ f is unimo dal and al l critic al p oints of f ar e minimizers. A dditional ly, if h : R n → R is γ -quasar-c onvex with r esp e ct to a minimizer x ∗ , then for any d ∈ R n with k d k = 1 , the 1-D function f ( θ ) , h ( x ∗ + θ d ) is γ -quasar-c onvex. 1.1.1 Related W ork There are several other ‘conv exity-lik e’ conditions in the literature related to quasar-con vexit y . F or example, star-con v exity is a condition that relaxes conv exity , and is a strict subset of quasar-conv exity in the diﬀeren tiable case. [ 47 ] in tro duces this condition when analyzing cubic regularization. [ 35 ] further inv estigates star-conv exity , developing a cutting-plane metho d to minimize general star- con vex functions. Star-conv exity is an in teresting property b ecause there is some evidence to suggest the loss function of neural netw orks might conform to this structure in large neigh b orho ods of the minimizers [ 34 , 61 ]. F urthermore, und er mild assumptions, the ob jectiv e for learning linear dynamical systems is quasar-con vex [ 29 ]; this problem is closely related to the training of recurrent neural net works. Another relev ant class of functions is those for whic h a small gradient implies appro ximate optimalit y . This is known as the Poly ak-Ło jasiewicz (PL) condition [ 50 ] and is w eaker 3 By Observ ation 4, x ∗ is unique if µ > 0 . 4 When γ = 1 , condition (2) is also known as quasi-str ong c onvexity [41] or we ak str ong c onvexity [33]. 3 Figure 1: Examples of quasar-conv ex functions. [Left: f ( x ) =  x 2 + 1 8  1 / 6 (quasar-con vex with γ = 1 2 ). Middle: f ( x, y ) = x 2 y 2 (star-con vex). The righ tmost function is describ ed in App endix D.3.] than strong quasar-con vexit y [ 26 ]. F or linear residual net works, the PL condition holds in large regions of parameter space [ 28 ]. In addition to pseudo con vexit y , quasiconv exity , star-con vexit y , and the PL condition, other relaxations of conv exity or strong conv exity include inv exity [ 16 ], semicon vexit y [ 55 ], quasi-strong conv exity [ 41 ], restricted strong conv exity [ 59 ], one-p oin t conv exity [ 37 ], v ariational coherence [ 62 ], the quadratic gro wth condition [ 4 ], and the error bound prop erty [19]. A more thorough discussion is pro vided in Appendix A.1. W e are not the ﬁrst to study acceleration on quasar-conv ex functions; recent w ork by [ 26 ] and [ 48 ] sho ws ho w to ac hieve accelerated rates for minimizing quasar-con vex functions. F or a function that is L -smo oth and γ -quasar-con vex with resp ect to a minimizer x ∗ , with initial distance to x ∗ b ounded b y R , the algorithm of [ 26 ] yields an  -optimal p oin t in O ( γ − 1 L 1 / 2 R − 1 / 2 ) iterations, while the algorithm of [ 48 ] does so in O ( γ − 3 / 2 L 1 / 2 R − 1 / 2 ) iterations. F or conv ex functions (which ha ve γ = 1 ), these b ounds match the iter ation b ounds ac hieved by AGD [ 44 ], but use a diﬀeren t oracle mo del. In particular, to achiev e these iteration b ounds, the metho d in [ 26 ] relies on a low-dimensional subspace optimization metho d within eac h iteration, while [ 48 ] uses a one-dimensional line search o ver the function v alue in each iteration, as well as a restart criterion that requires kno wledge of the true optimal function v alue. 5 Ho wev er, quasar-conv ex functions are not necessarily unimo dal along the arbitrary low-dimensional regions or line segments being searc hed ov er. Therefore, ev en ﬁnding an approximate minimizer within these subregions may b e computationally exp ensiv e, making e ach iter ation p otentially costly; by con trast, our metho ds only require a function and gradien t 5 W e discuss [48] in more detail in Appendix A.2. 4 oracle. In addition, neither pap er provides lo wer b ounds nor studies the “strongly quasar-conv ex” regime. Indep enden tly , recent w ork by [ 60 ] uses a diﬀeren tial equation discretization to approach the accelerated O ( κ 1 / 2 log (  − 1 )) rate for minimization of smo oth strongly quasar-conv ex functions in a neighborho o d of the optimum, in the sp ecial case γ = 1 . 6 Similarly , in the γ = 1 case, geometric descen t [ 9 ] ac hieves O ( κ 1 / 2 log (  − 1 )) running times in terms of the n umber of calls to a one-dimensional line searc h oracle (although, as previously noted, the n umber of function and gradien t ev aluations required may still b e large). 7 1.2 Summary of Results F or functions that are L -smo oth and γ -quasar-con vex, w e provide an algorithm which ﬁnds an  -optimal solution in O ( γ − 1 L 1 / 2 R − 1 / 2 ) iterations (where, as b efore, R is an upp er b ound on the initial distance to the quasar-conv ex point x ∗ ). Our iteration b ound is the same as that of [ 26 ], and a factor of γ 1 / 2 b etter than the O ( γ − 3 / 2 L 1 / 2 R − 1 / 2 ) b ound of [ 48 ]. Additionally , we are the ﬁrst to pro vide b ounds on the total num b er of function and gr adient evaluations required; our algorithm uses O ( γ − 1 L 1 / 2 R − 1 / 2 log( γ − 1  − 1 )) ev aluations to ﬁnd a  -optimal solution. W e also provide an algorithm for L -smo oth, ( γ , µ ) -strongly quasar-conv ex functions; our algorithm uses O ( γ − 1 κ 1 / 2 log ( γ − 1  − 1 )) iterations and O ( γ − 1 κ 1 / 2 log ( γ − 1 κ ) log ( γ − 1  − 1 )) total function and gradien t ev aluations to ﬁnd an  -optimal p oint, where κ , L/µ ( κ is typically referred to as the c ondition numb er ). F or constant γ , this matches accelerated gradient descen t’s b ound for smo oth strongly conv ex functions, up to a logarithmic factor. The key idea behind our algorithm is to take a close lo ok at whic h essen tial inv ariants need to hold during the momen tum step of AGD, and use this insight to carefully redesign the algorithm to accelerate on general smo oth quasar-conv ex functions. By observing how the function behav es along the line segment b et ween current iterates x ( k ) and v ( k ) , we show that for any smo oth quasar- con vex function, there alwa ys exists a p oin t y ( k ) along this segment with the prop erties needed for acceleration. F urthermore, we sho w that an eﬃcient binary searc h can be used to ﬁnd such a p oin t, ev en without the assumption of conv exity along the segment. T o complement our upp er bounds, we pro vide low er b ounds of Ω( γ − 1 L 1 / 2 R − 1 / 2 ) for the n umber of gradien t ev aluations that any deterministic ﬁrst-order metho d requires to ﬁnd an  -minimizer of a quasar-con vex function. This sho ws that up to logarithmic factors, our lo wer and upp er b ounds are tight. Our low er b ounds extend the tec hniques from [ 14 ] to the class of smo oth quasar-con vex functions, allo wing an almost exact characterization of the complexit y of minimizing these functions. P ap er outline. In Section 2, we pro vide a general framew ork for accelerating the minimization of smo oth quasar-conv ex functions. In Section 3, we apply our framework to dev elop sp eciﬁc algorithms tailored to b oth quasar-con vex and strongly quasar-con vex functions. In Section 4, w e provide lo wer b ounds to sho w that the upp er b ounds for quasar-conv ex minimization of Section 3 are tight up to logarithmic factors. F ull pro ofs, additional results, and numerical exp erimen ts are in the App endix. 6 κ = L/µ denotes the condition numb er of an L -smo oth ( γ , µ ) -strongly quasar-con vex function. 7 Although this result is not explicitly stated in the literature, up on careful insp ection of the analysis in [ 9 ] it can be observ ed that the µ -strong con vexit y requiremen t in [ 9 ] ma y be relaxed to the requirement of (1 , µ ) -strong quasar-conv exity , with no changes to the algorithm necessary . 5 2 Quasar-Con v ex Minimization F ramew ork In this section, w e provide and analyze a general algorithmic template for accelerated minimization of smooth quasar-conv ex functions. In Section 3.1 w e show how to lev erage this framew ork to achiev e accelerated rates for minimizing str ongly quasar-conv ex functions, and in Section 3.2 we sho w how to achiev e accelerated rates for minimizing non-str ongly quasar-conv ex functions (i.e. when µ = 0 ). F or simplicity , we assume the domain is R n . Our algorithm (Algorithm 1) is a simple generalization of accelerated gradient descent (A GD). Indeed, standard AGD can b e written in the form of Algorithm 1, for particular choices of the parameters α ( k ) , β ( k ) , η ( k ) . Given a diﬀeren tiable function f ∈ R n → R with smo othness parameter L > 0 and initial p oin t x (0) = v (0) ∈ R n , the algorithm iterativ ely computes p oin ts x ( k ) , v ( k ) ∈ R n of impro ving “qualit y .” Ho wev er, it is challenging to argue that Algorithm 1 actually p erforms optimally without the assumption of c onvexity . The crux of circumv enting con vexit y is to show that there exists a w ay to eﬃciently compute the momentum parameter α ( k ) to yield conv ergence at the desired rate. In this section, we provide general to ols for analyzing this algorithm; in Section 3, w e leverage this analysis with sp eciﬁc c hoices of the parameters α ( k ) , β , and η ( k ) to derive our fully-sp eciﬁed accelerated sc hemes for b oth quasar-conv ex and strongly quasar-con vex functions. Algorithm 1: General A GD F ramew ork Input : L -smo oth function f : R n → R , initial point x (0) ∈ R n , num b er of iterations K Sequences { α ( k ) } K − 1 k =0 , { β ( k ) } K − 1 k =0 , { L ( k ) } K − 1 k =0 , { η ( k ) } K − 1 k =0 are deﬁned by the particular algorithm instance, where α ( k ) ∈ [0 , 1] , β ( k ) ∈ [0 , 1] , L ( k ) ∈ (0 , 2 L ) , η ( k ) ≥ γ L ( k ) . 1 Set v (0) = x (0) 2 for k = 0 , 1 , 2 , . . . , K − 1 do 3 Set y ( k ) = α ( k ) x ( k ) + (1 − α ( k ) ) v ( k ) 4 Set x ( k +1) = y ( k ) − 1 L ( k ) ∇ f ( y ( k ) ) # L ( k ) computed s.t. f ( x ( k +1) ) ≤ f ( y ( k ) ) − 1 2 L ( k )    ∇ f ( y ( k ) )    2 5 Set v ( k +1) = β ( k ) v ( k ) + (1 − β ( k ) ) y ( k ) − η ( k ) ∇ f ( y ( k ) ) end 6 return x ( K ) W e ﬁrst deﬁne notation that will b e used throughout Sections 2 and 3: Deﬁnition 2 L et  ( k ) , f ( x ( k ) ) − f ( x ∗ ) ,  ( k ) y , f ( y ( k ) ) − f ( x ∗ ) , r ( k ) ,   v ( k ) − x ∗   2 , r ( k ) y ,   y ( k ) − x ∗   2 , Q ( k ) , β ( k )  2 η ( k ) α ( k ) ∇ f ( y ( k ) ) > ( x ( k ) − v ( k ) ) − ( α ( k ) ) 2 (1 − β ( k ) )   x ( k ) − v ( k )   2  . In the remainder of this section, we analyze Algorithm 1. W e assume that f is L -smo oth and ( γ , µ ) strongly quasar-conv ex (p ossibly with µ = 0 ) with resp ect to a minimizer x ∗ . First, we use Lemma 1 to bound ho w muc h the function error of x ( k ) and the distance from v ( k ) to x ∗ decrease at each iteration. Lemma 1 (One Step F ramew ork Analysis) Supp ose f is L -smo oth and ( γ , µ ) -quasar-c onvex with r esp e ct to a minimizer x ∗ . Then, in e ach iter ation k ≥ 0 of A lgorithm 1 applie d to f , it is the 6 c ase that 2( η ( k ) ) 2 L ( k )  ( k +1) + r ( k +1) ≤ β ( k ) r ( k ) + h (1 − β ( k ) ) − γ µη ( k ) i r ( k ) y + 2 η ( k ) h L ( k ) η ( k ) − γ i  ( k ) y + Q ( k ) . Pro of Let z ( k ) , β ( k ) v ( k ) + (1 − β ( k ) ) y ( k ) . Since v ( k +1) = z ( k ) − η ( k ) ∇ f ( y ( k ) ) , direct algebraic manipulation yields that r ( k +1) =    v ( k +1) − x ∗    2 =    z ( k ) − x ∗ − η ( k ) ∇ f ( y ( k ) )    2 =    z ( k ) − x ∗    2 + 2 η ( k ) ∇ f ( y ( k ) ) > ( x ∗ − z ( k ) ) + ( η ( k ) ) 2    ∇ f ( y ( k ) )    2 . (3) Using the deﬁnitions of z ( k ) and y ( k ) , we hav e    z ( k ) − x ∗    2 = β ( k )    v ( k ) − x ∗    2 + (1 − β ( k ) )    y ( k ) − x ∗    2 − β ( k ) (1 − β ( k ) )    v ( k ) − y ( k )    2 = β ( k ) r ( k ) + (1 − β ( k ) ) r ( k ) y − β ( k ) (1 − β ( k ) )( α ( k ) ) 2    v ( k ) − x ( k )    2 . (4) F urther, since v ( k ) = y ( k ) + α ( k ) ( v ( k ) − x ( k ) ) and z ( k ) = β ( k ) v ( k ) + (1 − β ( k ) ) y ( k ) = y ( k ) + α ( k ) β ( k ) ( v ( k ) − x ( k ) ) , it follows that ∇ f ( y ( k ) ) > ( x ∗ − z ( k ) ) = ∇ f ( y ( k ) ) > ( x ∗ − y ( k ) ) + α ( k ) β ( k ) ∇ f ( y ( k ) ) > ( x ( k ) − v ( k ) ) . (5) Since ( γ , µ ) -strong quasar-conv exity of f implies −  ( k ) y ≥ 1 γ ∇ f ( y ( k ) ) > ( x ∗ − y ( k ) ) + µ 2 r ( k ) y and the deﬁnition of x ( k +1) and L ( k ) implies 0 ≤   ∇ f ( y ( k ) )   2 ≤ 2 L ( k ) [  ( k ) y −  ( k +1) ] , combining with (3) , (4) , and (5) yields the result. Note that L ( k ) in Line 3 of Algorithm 1 can b e set to the Lipschitz constan t L if it is known; otherwise, it can be eﬃciently computed to mak e f ( x ( k ) ) = f ( y ( k ) − 1 L ( k ) ∇ f ( y ( k ) )) ≤ f ( y ( k ) ) − 1 2 L ( k )   ∇ f ( y ( k ) )   2 and L ( k ) ∈ (0 , 2 L ) hold using backtrac king line searc h. See L emma 9 (Ap- p endix C.1) for more details. Lemma 1 provides our main b ound on ho w the error  ( k ) c hanges b et w een successiv e iterations of Al- gorithm 1. The k ey step necessary to apply this lemma is to relate f ( y ( k ) ) and ∇ f ( y ( k ) ) > ( x ( k ) − v ( k ) ) to f ( x ( k ) ) , in order to bound Q ( k ) . In the standard analysis of accelerated gradient descent, con vexit y is used to obtain such a connection. 8 In our algorithms, w e instead perform binary searc h to compute the momentum parameter α ( k ) for which the necessary relationship holds without assuming conv exity . The following lemma sho ws that there alwa ys exists a setting of α ( k ) that satisﬁes the necessary relationship. Lemma 2 (Existence of “Go od” α ) L et f : R n → R b e diﬀer entiable and let x, v ∈ R n . F or α ∈ R deﬁne y α , αx + (1 − α ) v . F or any c ≥ 0 ther e exists α ∈ [0 , 1] such that α ∇ f ( y α ) > ( x − v ) ≤ c [ f ( x ) − f ( y α )] . (6) 8 See Appendix C.4.1 for more details. 7 Figure 2: Illustration of Lemma 2. g ( α ) is deﬁned as in the pro of of the lemma; here, we depict the case where g (0) > g (1) and g 0 (1) > 0 . The p oin ts highlighted in green satisfy inequality (6) ; the circled p oin t has g 0 ( α ) = 0 and g ( α ) ≤ g (1) . Here c = 10 . Pro of Deﬁne g ( α ) , f ( y α ) . Then for all α ∈ R w e ha ve g 0 ( α ) = ∇ f ( y α ) > ( x − v ) . Consequen tly , (6) is equiv alen t to the condition αg 0 ( α ) ≤ c [ g (1) − g ( α )] . If g 0 (1) ≤ 0 , inequality (6) trivially holds at α = 1 ; if f ( v ) = g (0) ≤ g (1) = f ( x ) , the inequalit y trivially holds at α = 0 . If neither of these conditions hold, g 0 (1) > 0 and g (0) > g (1) , so F act 1 from App endix C.2 implies that there is a v alue of α ∈ (0 , 1) suc h that g 0 ( α ) = 0 and g ( α ) ≤ g (1) , and therefore this v alue of α satisﬁes (6). Figure 2 illustrates this third case graphically . In our algorithms, we will not seek an α satisfying (6) exactly , but instead α ∈ [0 , 1] suc h that α ∇ f ( y α ) > ( x − v ) − α 2 b k x − v k 2 ≤ c [ f ( x ) − f ( y α )] + ˜  , (7) for some b, c, ˜  ≥ 0 . As (7) is a weak er statement than (6) , the existence of α satisfying (7) follo ws from Lemma 2. Moreov er, we will sho w how to lo wer bound the size of the set of p oin ts satisfying (7), which we use to b ound the time required to compute such a p oin t. W e can th us b ound the quantit y Q ( k ) from Lemma 1 by selecting α ( k ) to satisfy (7) with appropriate settings of b, c, ˜  , which w e do in Lemma 3. Lemma 3 If β ( k ) > 0 and α ( k ) ∈ [0 , 1] satisﬁes (7) with x = x ( k ) , v = v ( k ) , b = 1 − β ( k ) 2 η ( k ) , and c = L ( k ) η ( k ) − γ β ( k ) , or if β ( k ) = 0 and α ( k ) = 1 , then Q ( k ) ≤ 2 η ( k ) h ( L ( k ) η ( k ) − γ ) · (  ( k ) −  ( k ) y ) + β ( k ) ˜  i . (8) Pro of First supp ose β ( k ) > 0 . As b y deﬁnition y ( k ) = α ( k ) x ( k ) + (1 − α ( k ) ) v ( k ) and L ( k ) η ( k ) ≥ γ , 8 applying (7) yields Q ( k ) = 2 β ( k ) η ( k ) α ( k ) ∇ f ( y ( k ) ) > ( x ( k ) − v ( k ) ) −  α ( k )  2 (1 − β ( k ) )   x ( k ) − v ( k )   2 2 η ( k ) ! ≤ 2 β ( k ) η ( k )  L ( k ) η ( k ) − γ β ( k ) [ f ( x ( k ) ) − f ( y ( k ) )] + ˜   = 2 η ( k )  [ L ( k ) η ( k ) − γ ] · [  ( k ) −  ( k ) y ] + β ( k ) ˜   . Alternativ ely , supp ose β ( k ) = 0 . Then Q ( k ) = 0 as w ell; if we select α ( k ) = 1 , then y ( k ) = x ( k ) and (8) trivially holds for any ˜  , as  ( k ) y =  ( k ) . No w, in Algorithm 2 w e show ho w to eﬃciently compute an α satisfying inequality (7). Algorithm 2: BinaryLineSearch ( f , x, v , b, c, ˜ , [ guess ] ) Assumptions : f is L -smooth; x, v ∈ R n ; b, c, ˜  ≥ 0 ; “guess” (optional) is in [0 , 1] if pro vided. Deﬁne g ( α ) , f ( αx + (1 − α ) v ) and p , b k x − v k 2 . 1 if guess provided and cg ( guess ) + guess · ( g 0 ( guess ) − guess · p ) ≤ cg (1) + ˜  then return guess 2 if g 0 (1) ≤ ˜  + p then return 1; 3 else if c = 0 or g (0) ≤ g (1) + ˜ /c then return 0; 4 τ ← 1 − g 0 (1) / BacktrackingSearch ( g , p, 1) # one step of gradien t descen t on g from 1, using bac ktracking to select step size; see Algorithm 5 for BacktrackingSearch pseudo code 5 lo ← 0 , hi ← τ , α ← τ 6 while cg ( α ) + α ( g 0 ( α ) − αp ) > cg (1) + ˜  do 7 α ← ( lo + hi ) / 2 8 if g ( α ) ≤ g ( τ ) then hi ← α ; 9 else lo ← α ; end 10 return α The core idea b ehind Algorithm 2 is as follows: as in the pro of of Lemma 2, let g ( α ) , f ( αx + (1 − α ) v ) b e the restriction of the function f to the line from v to x . If either g (0) ≤ g (1) , or g is decreasing at α = 1 , then (6) is immediately satisﬁed. If this do es not happ en, then g (0) > g (1) but g 0 (1) > 0 , whic h means that g switc hes from increasing to decreasing at some α ∈ (0 , 1) , and so g 0 ( α ) = 0 . Suc h a v alue of α also satisﬁes (6) . Algorithm 2 uses binary search to exploit this fact and thereby eﬃcien tly compute a v alue of α appr oximately satisfying (6) (i.e., satisfying (7) ). In Lemma 4, we b ound the maxim um n umber of iterations that Algorithm 2 can tak e until (7) holds and it thereb y terminates. Lemma 4 is pro ved in Appendix C.2. “guess” is an optional argument to Algorithm 2; if given, the v alue of “guess” will b e tested ﬁrst, and c hosen as the v alue of α if it satisﬁes (7) . F or instance, we can use the v alue of α ( k ) prescrib ed by the standard version of A GD as an initial guess. W e discuss this further in Section C.4.2. Lemma 4 (Line Search Run time) F or L -smo oth f : R n → R , p oints x, v ∈ R n and sc alars 9 b, c, ˜  ≥ 0 , Algorithm 2 c omputes α ∈ [0 , 1] satisfying (7) with at most 8 + 3 l log + 2  (4 + c ) min n 2 L 3 b 3 , L k x − v k 2 2˜  om function and gr adient evaluations. In summary , we achiev e our accelerated quasar-con vex min imization pro cedures (presen ted b elo w) by setting η ( k ) , β ( k ) ,  , and α ( k ) appropriately . In standard A GD, con vexit y is used to set a particular v alue of α ( k ) ; by contrast, our accelerated quasar-con v ex minimization pro cedures relax the conv exity assumption by computing an α ( k ) satisfying (7) via binary search (Algorithm 2). By low er bounding the length of the interv al of v alues of α ( k ) satisfying (7) , we sho w that this binary search only costs a logarithmic factor in the ov erall runtime. 3 Algorithms In this section, we develop algorithms for accelerated minimization of strongly quasar-con vex functions and quasar-conv ex functions, respectively , and analyze their running times in terms of the n umber of function and gradient ev aluations required. W e note that the Lipschitz constant L do es not need to be kno wn; how ever, a lo wer b ound ˆ γ > 0 on γ do es need to b e known, and the runtime dep ends inv ersely on ˆ γ . In App endix B, we provide numerical exp erimen ts on diﬀerent types of quasar-con vex functions, which v alidate the claim that our algorithm is not only eﬃcient in theory but also empirically comp etitiv e with other ﬁrst-order methods suc h as standard AGD. 3.1 Strongly Quasar-Conv ex Minimization First, w e provide and analyze our algorithm for ( γ , µ ) -strongly quasar-con vex function minimization, where µ > 0 . The algorithm (Algorithm 3) is a carefully constructed instance of the general AGD framew ork (Algorithm 1). As in the general AGD framework, the algorithm maintains tw o current p oin ts denoted x ( k ) and v ( k ) and at eac h step appropriately selects y ( k ) = α ( k ) x ( k ) + (1 − α ( k ) ) v ( k ) as a con vex com bination of these tw o points. Intuitiv ely , the algorithm iterativ ely seeks to decrease quadratic upp er and low er b ounds on the function v alue. L -smo othness of f implies for all x, y ∈ R n that f ( x ) ≤ U B y ( x ) , f ( y ) + ∇ f ( y ) > ( x − y ) + L 2 k x − y k 2 ; if L ( k ) = L , then x ( k +1) is the minimizer y ( k ) − 1 L ∇ y ( k ) of the upp er bound U B y ( k ) . Similarly , by ( γ , µ ) quasar-conv exity , f ( x ) ≥ f ( x ∗ ) ≥ min z LB y ( z ) for all x, y ∈ R n , where LB y ( x ) , f ( y ) + 1 γ ∇ f ( y ) > ( x − y ) + µ 2 k x − y k 2 . The minimizer of the low er b ound LB y ( k ) is y ( k ) − 1 γ µ ∇ f ( y ( k ) ) ; we set v ( k +1) to b e a conv ex combination of v ( k ) and the minimizer of 10 LB y ( k ) . Algorithm 3: Accelerated Strongly Quasar-Conv ex F unction Minimization Input : L -smo oth f : R n → R that is ( γ , µ ) -strongly quasar-conv ex (with µ > 0 ), initial p oin t x (0) ∈ R n , num b er of iterations K , solution tolerance  > 0 1 return output of Algorithm 1 on f with initial point x (0) , where for all k , L ( k ) = BacktrackingSearch ( f , γ µ 2 − γ , x ( k ) ) , β ( k ) = 1 − γ q µ L ( k ) , η ( k ) = 1 √ µL ( k ) , and α ( k ) =  BinaryLineSearch ( f , x ( k ) , v ( k ) , b = γ µ 2 , c = q L ( k ) µ , ˜  = 0) if β ( k ) > 0 else 1  W e lev erage the analysis from Section 2 to analyze Algorithm 3. First, in Lemma 5 we show that the algorithm con verges at the desired rate, b y building oﬀ of Lemma 1 and using the sp eciﬁc parameter c hoices in Algorithm 3. Lemma 5 (Strongly Quasar-Conv ex Con vergence) If f is L -smo oth and ( γ , µ ) -str ongly quasar-c onvex with minimizer x ∗ , γ ∈ (0 , 1] , and µ > 0 , then in e ach iter ation k ≥ 0 of A lgorithm 3,  ( k +1) + µ 2 r ( k +1) ≤  1 − γ √ 2 κ  h  ( k ) + µ 2 r ( k ) i , (9) wher e  ( k ) , f ( x ( k ) ) − f ( x ∗ ) , r ( k ) ,   v ( k ) − x ∗   2 , and κ , L µ . Ther efor e, if the numb er of iter ations K ≥ l √ 2 κ γ log +  3  (0) γ  m , then the output x ( K ) satisﬁes f ( x ( K ) ) ≤ f ( x ∗ ) +  . Pro of F or all k , η ( k ) = 1 √ µL ( k ) ≥ q γ (2 − γ )( L ( k ) ) 2 ≥ γ L ( k ) as required by Algorithm 1, since x 2 − x ≥ x 2 for all x ∈ [0 , 1] and since (2 − γ ) L ( k ) γ ≥ µ > 0 by deﬁnition of L ( k ) b ecause w e use γ µ 2 − γ (whic h is ≤ L b y Observ ation 2) as the initial guess for L ( k ) and only increase it during the backtrac king search. Similarly , since 0 < µ L ( k ) ≤ 2 − γ γ and γ ∈ (0 , 1] , w e hav e 0 < γ q µ L ( k ) ≤ p γ (2 − γ ) ≤ 1 , meaning that β ( k ) ∈ [0 , 1) . Additionally , b y construction, either β ( k ) = 0 and α ( k ) = 1 , or β ( k ) > 0 , α ( k ) ∈ [0 , 1] , and ( α, x, y α , v ) = ( α ( k ) , x ( k ) , y ( k ) , v ( k ) ) satisﬁes (7) with b = γ µ 2 = 1 − β ( k ) 2 η ( k ) , c = q L ( k ) µ = L ( k ) η ( k ) − γ β ( k ) , ˜  = 0 . Consequently , b y combining Lemmas 1 and 3, for eac h iteration k ≥ 0 of Algorithm 3 w e ha ve 2( η ( k ) ) 2 L ( k )  ( k +1) + r ( k +1) ≤ β ( k ) r ( k ) + h (1 − β ( k ) ) − γ µη ( k ) i r ( k ) y + 2 η ( k ) h L ( k ) η ( k ) − γ i  ( k ) + 2 β ( k ) η ( k ) ˜ . Substituting in η ( k ) = 1 √ µL ( k ) = 1 − β ( k ) γ µ and ˜  = 0 , this implies that 2 µ  ( k +1) + r ( k +1) ≤ β ( k ) r ( k ) + 2 p µL ( k )   s L ( k ) µ − γ    ( k ) = β ( k )  r ( k ) + 2 µ  ( k )  . Multiplying by µ/ 2 and using the deﬁnition of β as 1 − γ q µ L ( k ) and the fact that 0 < L ( k ) < 2 L yields (9). Now, b y (9) and induction,  ( k ) + µ 2 r ( k ) ≤  1 − γ √ 2 κ  k h  (0) + µ 2 r (0) i ≤ exp  − k γ √ 2 κ  h  (0) + µ 2 r (0) i . 11 Therefore, whenever k ≥ √ 2 κ γ log +   (0) + µ 2 r (0)   w e hav e  ( k ) = f ( x ( k ) ) − f ( x ∗ ) ≤  , as r ( k ) ≥ 0 alw ays. By Corollary 1, 2  (0) γ ≥ µ 2 r (0) , so it suﬃces to run k ≥ l √ 2 κ γ log +  3  (0) γ  m iterations. Note that when f is (1 , µ ) -strongly quasar-con vex with µ > 0 , Lemma 5 implies that the n umber of iter ations Algorithm 3 needs to ﬁnd an  -minimizer of f is of the same order as the n umber of iterations required by standard A GD to ﬁnd an  -minimizer of a µ -strongly con vex function. In eac h iteration of Algorithm 3, w e compute α ( k ) and then simply perform O (1) vector operations to compute y ( k ) , x ( k +1) , and v ( k +1) . Consequen tly , to obtain a complete b ound on the ov erall complexit y of Algorithm 3, it remains to b ound the cost of computing α ( k ) , whic h we do using Lemma 4. This leads to Theorem 1. Theorem 1 If f is L -smo oth and ( γ , µ ) -str ongly quasar-c onvex with γ ∈ (0 , 1] and µ > 0 , then A lgorithm 3 pr o duc es an  -optimal p oint after O  γ − 1 κ 1 / 2 log  γ − 1 κ  log +  f ( x (0) ) − f ( x ∗ ) γ   function and gr adient evaluations. Pro of Lemma 5 implies that O  √ κ γ log +   (0) γ   iterations are needed to get an  -optimal p oin t. Lemma 4 implies that eac h iteration uses O  log +  (1 + c ) min n L k x − v k 2 ˜  , L 3 b 3 o function and gradien t ev aluations. In this case, b = γ µ 2 , c = q L ( k ) µ ∈ h p γ 2 , 2 L µ i , and ˜  = 0 . Th us, this reduces to O ( log + ( √ κ L 3 γ 3 µ 3 )) = O ( log ( κ γ )) . So, the total num b er of required function and gradient ev aluations is O  √ κ γ log  κ γ  log +   (0) γ   as claimed. Note that Lemma 5 sho ws that x ( k ) will b e  -optimal if k = l √ 2 κ γ log +  3  (0) γ  m , while the ab o ve argumen t sho ws that O  √ κ γ log  κ γ  log +   (0) γ   function and gradient ev aluations are required to compute such an x ( k ) . Thus, Algorithm 3 produces an  -optimal p oint using at most this man y ev aluations; how ever, of course, the algorithm need not return instantly and ma y still contin ue to run if the sp eciﬁed n umber of iterations K is larger. (F uture iterates will also b e  -optimal.) Standard AGD on L -smo oth µ -strongly-con vex functions requires O  κ 1 / 2 log +  f ( x (0) ) − f ( x ∗ )   function and gradient and ev aluations to ﬁnd an  -optimal p oin t [ 45 ]. Thus, as the class of L -smo oth (1 , µ ) -strongly quasar-conv ex functions con tains the class of L -smo oth µ -strongly conv ex functions, our algorithm requires only a O ( log ( κ )) factor extra function and gradient ev aluations in the smo oth strongly conv ex case, while also b eing able to eﬃcien tly minimize a muc h broader class of functions than standard AGD. 3.2 Non-Strongly Quasar-Conv ex Minimization No w, w e provide and analyze our algorithm (Algorithm 4) for non-str ongly quasar-conv ex function minimization, i.e. when µ = 0 . Once again, this algorithm is an instance of Algorithm 1, the 12 general AGD framework, with a diﬀeren t choice of parameters. W e assume L > 0 , since otherwise quasar-con vexit y implies the function is constan t. Algorithm 4: Accelerated Non-Strongly Quasar-Conv ex F unction Minimization Input : L -smo oth f : R n → R that is γ -quasar-conv ex, initial p oin t x (0) ∈ R n , num b er of iterations K , solution tolerance  > 0 Deﬁne ω ( − 1) = 1 , and ω ( k ) = ω ( k − 1) 2  p ( ω ( k − 1) ) 2 + 4 − ω ( k − 1)  for k ≥ 0 1 Set L ( − 1) = BacktrackingSearch ( f , , x (0) , run_halving=True ) 2 return output of Algorithm 1 on f with initial point x (0) , where for all k , β ( k ) = 1 , L ( k ) = BacktrackingSearch ( f , max k 0 ∈ [ − 1 ,k − 1] L ( k 0 ) , x ( k ) ) , η ( k ) = γ L ( k ) ω ( k ) , and α ( k ) = BinaryLineSearch ( f , x ( k ) , v ( k ) , b = 0 , c = γ ( 1 ω ( k ) − 1) , ˜  = γ  2 ) Lemma 6 (Non-Strongly Quasar-Conv ex A GD Con v ergence) If f is L -smo oth and γ - quasar-c onvex with r esp e ct to a minimizer x ∗ , with γ ∈ (0 , 1] , then in e ach iter ation k ≥ 0 of A lgorithm 4,  ( k ) ≤ 8 ( k + 2) 2   (0) + L 2 γ 2 r (0)  +  2 , (10) wher e  ( k ) , f ( x ( k ) ) − f ( x ∗ ) and r ( k ) ,   v ( k ) − x ∗   2 . Ther efor e, if R ≥   x (0) − x ∗   and the numb er of iter ations K ≥  8 γ − 1 L 1 / 2 R − 1 / 2  , then the output x ( K ) satisﬁes f ( x ( K ) ) ≤ f ( x ∗ ) +  . Com bining the b ound on the n umber of iterations from Lemma 6, and the b ound from Lemma 4 on the n umber of function and gradient ev aluations during the line search, leads to the b ound in Theorem 2 on the total num b er of function and gradient ev aluations required to ﬁnd an  -optimal p oin t. The pro ofs of Lemma 6 and Theorem 2 are giv en in Appendix C.3. Theorem 2 If f is L -smo oth and γ -quasar-c onvex with r esp e ct to a minimizer x ∗ , with γ ∈ (0 , 1] and   x (0) − x ∗   ≤ R , then Algorithm 4 pr o duc es an  -optimal p oint after O  γ − 1 L 1 / 2 R − 1 / 2 log +  γ − 1 L 1 / 2 R − 1 / 2  function and gr adient evaluations. Note that standard A GD on the class of L -smo oth c onvex functions requires O  L 1 / 2 R − 1 / 2  function and gradient ev aluations to ﬁnd an  -optimal p oin t; so, again, our algorithm requires only a logarithmic factor more ev aluations than do es standard AGD. 4 Lo w er b ounds In this section, we construct low er b ounds which demonstrate that the algorithms we presented in Section 3 obtain, up to logarithmic factors, the b est p ossible w orst-case iteration bounds for deterministic ﬁrst-order minimization of quasar-conv ex functions. T o do so, we extend the ideas from [ 13 ], a seminal pap er whic h me c hanized the pro cess of constructing lo wer bounds. The key idea is to construct a zer o-chain , which is deﬁned as a function f for whic h if x j = 0 , ∀ j ≥ t then 13 ∂ f ( x ) ∂ x t +1 = 0 . On these zero-chains, one can pro vide lo wer bounds for a particular class of metho ds kno wn as ﬁrst-or der zer o-r esp e cting (F OZR) algorithms , which are algorithms that only query the gradien t at p oin ts x ( t ) with x ( t ) i 6 = 0 if there exists some j < t with ∇ i f ( x ( j ) ) 6 = 0 . Examples of F OZR algorithms include gradien t descent, accelerated gradien t descent, and nonlinear conjugate gradien t [ 21 ]. It is relatively easy to form lo wer bounds for F OZR algorithms applied to zero-c hains, b ecause one can pro ve that if the initial p oint is x (0) = 0 , then x ( T ) has at most T nonzeros [ 13 , Observ ation 1]. The particular zero-c hain we use to derive our lo wer b ounds is ¯ f T ,σ ( x ) , q ( x ) + σ T X i =1 Υ( x i ) , where Υ( θ ) , 120 Z θ 1 t 2 ( t − 1) 1 + t 2 dt q ( x ) , 1 4 ( x 1 − 1) 2 + 1 4 T − 1 X i =1 ( x i − x i +1 ) 2 . This function ¯ f T ,σ is similar to the function ¯ f T ,µ,r of [ 14 ]. How ever, the low er b ound pro of is diﬀeren t b ecause the primary c hallenge is to show ¯ f T ,σ is quasar-con vex, rather than showing that   ∇ ¯ f T ,σ ( x )   ≥  for all x with x T = 0 . Our main lemma sho ws that this function is in fact 1 100 T √ σ -quasar-con vex. Lemma 7 L et σ ∈ (0 , 10 − 4 ] , T ∈  σ − 1 / 2 , ∞  ∩ Z . The function ¯ f T ,σ is 1 100 T √ σ -quasar-c onvex and 3 -smo oth, with unique minimizer x ∗ = 1 . F urthermor e, if x t = 0 for al l t = d T / 2 e , . . . , T , then ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 ) ≥ 2 T σ . The pro of of Lemma 7 app ears in Appendix E.1. The argumen t rests on sho wing that the quasar- con vexit y inequality 1 100 T √ σ ( ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 )) ≤ ∇ ¯ f T ,σ ( x ) T ( x − 1 ) holds for all x ∈ R T . The non trivial situation is when there exists some j 1 < j 2 suc h that x j 1 ≥ 0 . 9 , x j 2 ≤ 0 . 1 , and 0 . 1 ≤ x i ≤ 0 . 9 for i ∈ { j 1 + 1 , . . . , j 2 − 1 } . In this situation, we us e ideas closely related to the transition region argumen ts made in Lemma 3 of [ 14 ]. The intuition is as follows. If the gaps x i +1 − x i are large, then the conv ex function q ( x ) dominates the function v alue and gradient of ¯ f T ,σ ( x ) , allo wing us to establish quasar-con vexit y . Con versely , if the x i +1 − x i ’s are small, then a large p ortion of the x i ’s must lie in the quasar-con vex region of Υ , and the corresp onding Υ 0 ( x i )( x i − 1) terms make ∇ ¯ f T ,σ ( x ) > ( x − 1 ) suﬃciently p ositiv e. Lemma 8 L et  ∈ (0 , ∞ ) , γ ∈ (0 , 10 − 2 ] , T =  10 − 3 γ − 1 L 1 / 2 R − 1 / 2  , and σ = 1 10 4 T 2 γ 2 , and assume L 1 / 2 R − 1 / 2 ≥ 10 3 . Consider the function ˆ f ( x ) , 1 3 LR 2 T − 1 · ¯ f T ,σ ( xT 1 / 2 R − 1 ) . (11) This function is L -smo oth and γ -quasar-c onvex, and its minimizer x ∗ is unique and has k x ∗ k = R . F urthermor e, if x t = 0 ∀ t ∈ Z ∩ [ T / 2 , T ] , then ˆ f ( x ) − inf z ˆ f ( z ) >  . 14 The pro of of Lemma 8 appears in App endix E.1. Combining Lemma 8 with Observ ation 1 from [ 13 ] yields a low er b ound for ﬁrst-order zero-resp ecting algorithms, and an extension of this lo wer b ound to the class of all deterministic ﬁrst-order metho ds. This leads to Theorem 3, whose pro of appears in App endix E.2. Theorem 3 L et , R, L ∈ (0 , ∞ ) , γ ∈ (0 , 1] , and assume L 1 / 2 R − 1 / 2 ≥ 1 . L et F denote the set of L -smo oth functions that ar e γ -quasar-c onvex with r esp e ct to some p oint with Euclide an norm less than or e qual to R . Then, given any deterministic ﬁrst-or der metho d, ther e exists a function f ∈ F such that the metho d r e quir es at le ast Ω( γ − 1 L 1 / 2 R − 1 / 2 ) gr adient evaluations to ﬁnd an  -optimal p oint of f . Theorem 3 demonstrates that the upp er b ound for our algorithm for quasar-conv ex minimization is tigh t within logarithmic factors. W e note that b y reduction (Remark 4), one can prov e a lo wer b ound of Ω( γ − 1 κ 1 / 2 ) for strongly quasar-conv ex functions; th us, our algorithm for strongly quasar-conv ex minimization is also optimal within logarithmic factors. Although the construction of our low er b ounds is similar to that of [ 14 ], there are imp ortan t diﬀerences b et ween our lo wer bounds and theirs. First, the assumptions diﬀer signiﬁcantly; we assume quasar-conv exity and Lipschitz contin uity of the ﬁrst deriv ative, while [ 14 ] assumes Lipschitz con tinuit y of the ﬁrst thr e e deriv atives. Next, the b ounds in [ 13 , 14 ] apply to ﬁnding  -stationary p oin ts, rather than  -optimal p oin ts. In addition, our low er and upper bounds only diﬀer by logarithmic factors, whereas there is a gap of ˜ O (  − 1 / 15 ) b et ween the lo wer b ound of Ω(  − 8 / 5 ) given b y [ 14 ] and the b est known corresp onding upp er bound of O (  − 5 / 3 log (  − 1 )) [ 11 ]. Finally , we require x t = 0 for all t > T / 2 to guarantee ˆ f ( x ) − inf z ˆ f ( z ) >  , whereas [ 13 , 14 ] only need x T = 0 to guaran tee    ∇ ˆ f ( x )    >  . 5 Conclusion In this w ork, we introduce a generalization of star-conv exity called quasar-conv exity and provide insigh t into the structure of quasar-conv ex functions. W e show how to obtain a near-optimal accelerated rate for the minimization of any smo oth function in this broad class, using a simple but no vel binary search tec hnique. In addition, we pro vide nearly matc hing theoretical lo wer b ounds for the p erformance of any ﬁrst-order metho d on this function class. Interesting topics for future research are to further understand the prev alence of quasar-conv exity in problems of practical in terest, and to develop new accelerated methods for other structured classes of nonconv ex problems. 15 A c kno wledgemen ts The work of Aaron Sidford w as supp orted by NSF CAREER A ward CCF-1844855. References [1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and T engyu Ma. Finding approximate lo cal minima faster than gradient descent. In Symp osium on The ory of Computing (STOC) , pages 1195–1199. ACM, 2017. [2] Zeyuan Allen-Zhu. Kat yusha: The ﬁrst direct acceleration of sto c hastic gradien t methods. Journal of Machine L e arning R ese ar ch , 18(1):8194–8244, 2017. [3] Zeyuan Allen-Zh u and Lorenzo Orecchia. Linear coupling: An ultimate uniﬁcation of gradient and mirror descent. In Innovations in The or etic al Computer Scienc e (ITCS) , pages 1–22, 2017. [4] Mihai Anitescu. Degenerate nonlinear programming with a quadratic growth condition. SIAM Journal on Optimization , 10(4):1116–1135, 2000. [5] Kenneth Arrow and Alain Entho ven. Quasi-conca ve programming. Ec onometric a , 16(5):779–800, 1961. [6] P eter Bartlett, David Helmbold, and Philip Long. Gradient descent with identit y initialization eﬃcien tly learns p ositiv e-deﬁnite linear transformations by deep residual net works. Neural Computation , 31(3): 477–502, 2019. [7] Amir Beck and Marc T eb oulle. A fast iterativ e shrink age-thresholding algorithm for linear inv erse problems. SIAM Journal on Imaging Scienc es , 2(1):183–202, 2009. [8] Stephen Boyd and Lieven V andenberghe. Convex Optimization . Cambridge Univ ersity Press, 2004. [9] Sébastien Bub ec k, Yin T at Lee, and Mohit Singh. A geometric alternative to Nesterov’s accelerated gradien t descen t. arXiv pr eprint arXiv:1506.08187 , 2015. [10] Sébastien Bub ec k, Qijia Jiang, Yin T at Lee, Y uanzhi Li, and Aaron Sidford. Near-optimal metho d for highly smo oth conv ex optimization. In Confer enc e on L e arning The ory (COL T) , pages 492–507, 2019. [11] Y air Carmon, John Duchi, Oliver Hinder, and Aaron Sidford. Conv ex until prov en guilty: Dimension-free acceleration of gradien t descen t on non-conv ex functions. In International Confer enc e on Machine L earning (ICML) , pages 654–663, 2017. [12] Y air Carmon, John Duc hi, Oliver Hinder, and Aaron Sidford. Accelerated metho ds for noncon vex optimization. SIAM Journal on Optimization , 28(2):1751–1772, 2018. [13] Y air Carmon, John Duchi, Oliver Hinder, and Aaron Sidford. Low er b ounds for ﬁnding stationary p oin ts I. Mathematical Pr o gr amming , pages 1–50, 2019. [14] Y air Carmon, John Duchi, Oliver Hinder, and Aaron Sidford. Low er b ounds for ﬁnding stationary p oin ts I I: First-order methods. Mathematic al Pro gr amming , 2019. [15] Chih-Ch ung Chang and Chih-Jen Lin. LIBSVM: A library for supp ort vector machines. ACM T r ansactions on Intel ligent Systems and T e chnolo gy , 2:27:1–27:27, 2011. Soft ware av ailable at http: //www.csie.ntu.edu.tw/~cjlin/libsvm . 16 [16] Bruce Crav en and Barney Glov er. Inv ex functions and dualit y . Journal of the Austr alian Mathematic al So ciety , 39(1):1–20, 1985. [17] Cong Dang and Guangh ui Lan. On the con vergence properties of non-Euclidean extragradient method s for v ariational inequalities with generalized monotone op erators. Computational Optimization and Applic ations , 60:277–310, 2015. [18] Dheeru Dua and Casey Graﬀ. UCI mac hine learning rep ository , 2017. http://archive.ics.uci.edu/ml . [19] Marian F abian, René Henrion, Alexander Kruger, and Jiří Outrata. Error bounds: Necessary and suﬃcien t conditions. Set-V alue d and V ariational Analysis , 18(2):121–149, 2010. [20] Olivier F erco q and Peter Ric htárik. Accelerated, parallel, and proximal coordinate descen t. SIAM Journal on Optimization , 25(4):1997–2023, 2015. [21] Roger Fletcher and Colin Reev es. F unction minimization by conjugate gradients. The Computer Journal , 7(2):149–154, 1964. [22] Ro y F rostig, Rong Ge, Sham Kak ade, and Aaron Sidford. Un-regularizing: approximate proximal p oin t and faster sto chastic algorithms for empirical risk minimization. In International Confer enc e on Machine L e arning (ICML) , pages 2540–2548, 2015. [23] Alexander Gasniko v, Pa vel Dvurechensky , Eduard Gorbunov, Evgeniy a V orontso v a, Daniil Se- likhano vych, and César Urib e. The global rate of conv ergence for optimal tensor metho ds in smooth con vex optimization. Computer R ese ar ch and Mo deling , 10(6):737–753, 2018. [24] Rong Ge, Jason Lee, and T engyu Ma. Matrix completion has no spurious lo cal minimum. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 2973–2981, 2016. [25] Saeed Ghadimi and Guanghui Lan. Accelerated gradien t metho ds for nonconv ex nonlinear and sto c hastic programming. Mathematic al Pr o gr amming , 156(1-2):59–99, 2016. [26] Sergey Guminov and Alexander Gasniko v. Accelerated metho ds for α -w eakly-quasi-conv ex problems. arXiv pr eprint arXiv:1710.00797 , 2017. [27] Filip Hanzely and Peter Ric htárik. A ccelerated co ordinate descent with arbitrary sampling and b est rates for minibatches. In International Confer enc e on Artiﬁcial Intel ligenc e and Statistics (AIST A TS) , pages 304–312, 2019. [28] Moritz Hardt and T engyu Ma. Identit y matters in deep learning. In International Confer enc e on L earning R epr esentations (ICLR) , 2017. [29] Moritz Hardt, T engyu Ma, and Benjamin Rec ht. Gradient descent learns linear dynamical systems. Journal of Machine L e arning R ese ar ch , 19(29):1–44, 2018. [30] Bo Jiang, Haoyue W ang, and Shuzhong Zhang. An optimal high-order tensor metho d for conv ex optimization. In Conferenc e on L e arning The ory (COL T) , pages 1799–1801, 2019. [31] Rie Johnson and T ong Zhang. A ccelerating sto c hastic gradient descent using predictive v ariance reduction. In Advanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 315–323, 2013. [32] P o oria Joulani, András György , and Csaba Szep esv ári. A mo dular analysis of adaptive (non-)con vex optimization: Optimism, comp osite ob jectives, and v ariational b ounds. In International Confer enc e on Algorithmic L e arning The ory (AL T) , 2017. 17 [33] Hamed Karimi, Julie Nutini, and Mark Sc hmidt. Linear con vergence of gradient and proximal-gradien t metho ds under the P olyak-Ło jasiewicz condition. In Joint Eur op e an Confer enc e on Machine L e arning and Know le dge Disc overy in Datab ases (ECML-PKDD) , pages 795–811. Springer, 2016. [34] Rob ert Kleinberg, Y uanzhi Li, and Y ang Y uan. An alternative view: When do es SGD escape lo cal minima? In International Confer enc e on Machine L e arning (ICML) , pages 2698–2707, 2018. [35] Jasp er Lee and P aul V aliant. Optimizing star-con vex functions. In Symposium on F oundations of Computer Scienc e (FOCS) , pages 603–614. IEEE, 2016. [36] Huan Li and Zhouchen Lin. A ccelerated pro ximal gradien t methods for nonconv ex programming. In A dvances in Neur al Information Pr o c essing Systems (NeurIPS) , pages 379–387, 2015. [37] Y uanzhi Li and Y ang Y uan. Conv ergence analysis of tw o-lay er neural netw orks with ReLU activ ation. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 597–607, 2017. [38] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A univ ersal catalyst for ﬁrst-order optimization. In A dvances in Neur al Information Pr o c essing Systems (NeurIPS) , pages 3384–3392, 2015. [39] Olvi Mangasarian. Pseudo-con vex functions. Journal of the So ciety for Industrial and Applie d Mathematics Series A Contr ol , 3(2):281–290, 1965. [40] James Munkres. T op olo gy . Pearson, 1975. [41] Ion Necoara, Y urii Nesterov, and F rançois Glineur. Linear conv ergence of ﬁrst order methods for non-strongly conv ex optimization. Mathematic al Pr o gr amming , 175(1):69–107, 2019. [42] Ark adi Nemiro vski. Orth-method for smo oth conv ex optimization. Izvestia AN SSSR, Ser. T ekhnich- eskaya Kib ernetika , 2, 1982. [43] Ark adi Nemirovski and Da vid Y udin. Pr oblem Complexity and Metho d Eﬃciency in Optimization . Wiley , 1983. [44] Y urii Nestero v. A method of solving a con vex programming problem with conv ergence rate O (1 /k 2 ) . Soviet Mathematics Doklady , 27(2):372–376, 1983. [45] Y urii Nesterov. Intr o ductory L e ctur es on Convex Optimization . Kluw er A cademic Publishers, 2004. [46] Y urii Nestero v. Eﬃciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization , 22(2):341–362, 2012. [47] Y urii Nestero v and Boris P olyak. Cubic regularization of Newton metho d and its global p erformance. Mathematic al Pr o gr amming , 108(1):177–205, 2006. [48] Y urii Nestero v, Alexander Gasnik ov, Sergey Guminov, and Pa vel Dvurechensky . Primal-dual accelerated gradien t descen t with line searc h for conv ex and nonconv ex optimization problems. Pr o ce e dings of the Russian A c ademy of Scienc es (RAS) , 485(1):15–18, 2019. [49] A dam P aszke, Sam Gross, Soumith Chin tala, Gregory Chanan, Edw ard Y ang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca An tiga, and Adam Lerer. Automatic diﬀerentiation in PyT orch. In A dvances in Neur al Information Pr o c essing Systems (NeurIPS) - Auto diﬀ W orkshop , 2017. [50] Boris Poly ak. Gradien t metho ds for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki , 3(4):643–653, 1963. 18 [51] Shai Shalev-Shw artz and T ong Zhang. Accelerated pro ximal sto c hastic dual co ordinate ascen t for regularized loss minimization. In International Confer enc e on Machine L e arning (ICML) , pages 64–72, 2014. [52] W eijie Su, Stephen Boyd, and Emmanuel Candès. A diﬀerential equation for mo deling Nestero v’s accelerated gradien t metho d: Theory and insights. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 2510–2518, 2014. [53] Ily a Sutskev er, James Martens, George Dahl, and Geoﬀrey Hin ton. On the imp ortance of initialization and momentum in deep learning. In International Conferenc e on Machine L e arning (ICML) , pages 1139–1147, 2013. [54] Alexander Tyurin. Mirror v ersion of similar triangles method for constrained optimization problems. arXiv pr eprint arXiv:1705.09809 , 2017. [55] Huynh V an Ngai and Jean-Paul P enot. Appro ximately conv ex functions and approximately monotonic op erators. Nonlinear Analysis: The ory, Metho ds & Applic ations , 66(3):547–564, 2007. [56] Jean-Philipp e Vial. Strong and w eak conv exity of sets and functions. Mathematics of Op er ations R esear ch , 8(2):231–259, 1983. [57] Blak e W o odworth and Nati Srebro. Tight complexity bounds for optimizing composite ob jectives. In A dvances in Neur al Information Pr o c essing Systems (NeurIPS) , pages 3639–3647, 2016. [58] P eng Xu, Bryan He, Christopher De Sa, Ioannis Mitliagk as, and Christopher Ré. A ccelerated sto c hastic p o w er iteration. In International Confer enc e on Artiﬁcial Intel ligenc e and Statistics (AIST A TS) , pages 58–67, 2018. [59] Hui Zhang and W otao Yin. Gradien t metho ds for con vex minimization: better rates under w eaker conditions. arXiv pr eprint arXiv:1303.4645 , 2013. [60] Jingzhao Zhang, Suvrit Sra, and Ali Jadbabaie. Acceleration in ﬁrst order quasi-strongly con vex optimization by ODE discretization. In IEEE Conferenc e on De cision and Contr ol (CDC) , 2019. [61] Yi Zhou, Junjie Y ang, Huish uai Zhang, Yingbin Liang, and V ahid T arokh. SGD con verges to global minim um in deep learning via star-conv ex path. In International Confer enc e on L e arning Repr esentations (ICLR) , 2019. [62] Zhengyuan Zhou, Pana yotis Mertikopoulos, Nic holas Bam b os, Stephen Boyd, and P eter Glynn. Sto c has- tic mirror descent in v ariationally coheren t optimization problems. In A dvanc es in Neural Information Pr oc essing Systems (NeurIPS) , pages 7040–7049, 2017. 19 A Extended Related W ork A.1 Related F unction Classes to Quasar-Conv exity In this section, we provide a brief taxonomy of related conditions (relaxations of con vexit y or strong con vexit y), and describ e how they relate to quasar-con vexit y . F or simplicity , here w e assume f is L -smo oth with domain X = R n . W e denote the minim um of f b y f ∗ and the set of minimizers of f b y X ∗ ; when X ∗ consists of a single p oin t, we denote the p oin t by x ∗ . First, we review the deﬁnitions of quasar-conv exit y , star-conv exity , and conv exity . Recall that (strong) quasar-conv exity is a generalization of (strong) star-conv exity , which itself generalizes (strong) conv exity . • (Str ong) quasar-c onvexity (with parameters γ ∈ (0 , 1] , µ ≥ 0 ): for some x ∗ ∈ X ∗ , f ( x ∗ ) ≥ f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2 for all x ∈ X . – When µ = 0 , this is merely referred to as quasar-c onvexity , which is also known as we ak quasi-c onvexity [29]. – When µ > 0 , f has exactly one minimizer x ∗ . • (Str ong) star-c onvexity (with parameter µ ≥ 0 ): for some x ∗ ∈ X ∗ , f ( x ∗ ) ≥ f ( x ) + ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2 for all x ∈ X . – When µ = 0 , this is merely referred to as star-c onvexity . – When µ > 0 , this is also known as quasi-str ong c onvexity [41]. – When µ = 0 , f ma y not ha ve a unique minimizer; some authors require the condition to hold for al l x ∗ ∈ X ∗ [47], while others only require it for some x ∗ ∈ X ∗ [35]; we use the latter deﬁnition. – When µ > 0 , f has exactly one minimizer x ∗ . • (Str ong) c onvexity (with parameter µ ≥ 0 ): f ( y ) ≥ f ( x ) + ∇ f ( x ) > ( y − x ) + µ 2 k y − x k 2 for all x, y ∈ X . – When µ = 0 , this is merely referred to as c onvexity . Next, w e enumerate some other generalizations of strong con vexit y from the literature, and state whether they generalize quasar-conv exity , are generalized by quasar-conv exity , or neither. • W e ak c onvexity [ 56 ] (with parameter µ > 0 ): f ( y ) ≥ f ( x ) + ∇ f ( x ) > ( y − x ) − µ 2 k y − x k 2 for all x, y ∈ X . – Neither implies nor is implied by quasar-conv exity . • Quadr atic gr owth c ondition (with parameter µ > 0 ) [ 4 ]: f ( x ) ≥ f ( x ∗ ) + µ 2 k x ∗ − x k 2 for all x ∈ X . 20 – Neither implies nor is implied by quasar-conv exity . • R estricte d se c ant c ondition (with parameter µ > 0 ) [ 59 ]: 0 ≥ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2 for all x ∈ X . – Implied b y ( γ , µ γ ) -strong quasar-conv exity (for an y choice of γ ∈ (0 , 1] ). • One-p oint str ong c onvexity (with parameter µ > 0 ) [ 37 ]: for some y ∈ X , 0 ≥ ∇ f ( x ) > ( y − x ) + µ 2 k y − x k 2 for all x ∈ X . – This is a generalization of the restricted secan t prop erty (which is one-p oin t strong con vexit y in the special case y = x ∗ ), and is therefore lik ewise implied by strong quasar- con vexit y . • V ariational c oher enc e [ 62 ]: 0 ≥ ∇ f ( x ) > ( x ∗ − x ) for all x ∈ X , x ∗ ∈ X ∗ , with equality iﬀ x ∈ X ∗ . – Implied by strong quasar-con vexit y (for any µ > 0 and γ ∈ (0 , 1] ). The closely related w eaker condition “for some x ∗ ∈ X ∗ , 0 ≥ ∇ f ( x ) > ( x ∗ − x ) for all x ∈ X , with equality iﬀ x ∈ X ∗ ” is implied by quasar-con vexit y (for any µ ≥ 0 , γ ∈ (0 , 1] ). In fact, the set of functions satisfying this condition is the limiting set of the class of γ -quasar-con vex functions as γ → 0 ; this is the set of diﬀeren tiable functions with star-conv ex sublev el sets. • Polyak-Łojasiewicz c ondition [ 50 ] (with parameter µ > 0 ): 1 2 k∇ f ( x ) k 2 ≥ µ ( f ( x ) − f ∗ ) for all x ∈ X . – This is implied b y the restricted secant prop ert y [ 33 ], and therefore b y strong quasar- con vexit y . • Quasic onvexity [5]: f ( λx + (1 − λ ) y ) ≤ max { f ( x ) , f ( y ) } for all x, y ∈ X and λ ∈ [0 , 1] . – Neither implies nor is implied by quasar-conv exity . (Ho w ever, the set of diﬀerentiable quasicon vex functions is contained in the limiting set of the the class of γ -quasar-con vex functions as γ → 0 .) • Pseudo c onvexity [39]: f ( y ) ≥ f ( x ) for all x, y ∈ X such that ∇ f ( x ) · ( y − x ) ≥ 0 . – Neither implies nor is implied by quasar-conv exity . • Invexity [16]: x ∈ X ∗ for all x ∈ X suc h that ∇ f ( x ) = 0 . – Implied b y quasar-conv exit y (for an y µ ≥ 0 , γ ∈ (0 , 1] ). A.2 Comparison to Nestero v et al. (2018) As discuss ed in Section 1.1.1, b oth our metho d (Algorithm 4) and Algorithm 2 of [ 48 ] minimize γ -quasar-con vex functions at accelerated rates. Compared to the metho d in [ 48 ], our algorithm attains a b etter runtime bound b y a factor of γ 1 / 2 , and do es not require the optimal function v alue 21 to b e kno wn. Mean while, the metho d of [ 48 ] can handle functions that are L -smo oth (where L need not be known) with resp ect to more general norms (not necessarily Euclidean). The algorithms themselv es are quite similar (in the case of γ -quasar-con vex functions that are L -smo oth with resp ect to the Euclidean norm), as b oth are generalizations of standard AGD. How ever, unlike [ 48 ], our algorithm does not rely on any restarts. F urthermore, while b oth algorithms conduct a one-dimensional minimization b etw een curren t iterates x ( k ) and v ( k ) in each iteration, we use the insight that this can actually b e done via a carefully implemen ted binary searc h to do this eﬃcien tly . B Numerical Exp erimen ts W e ﬁrst consider optimizing a “hard function”—an example of the type of function used to construct the low er bound in Theorem 2. This function class is parameterized b y σ and the dimension T ; we denote these functions b y ¯ f T ,σ (see App endix 4 for the deﬁnition). W e compare our metho d to other commonly used ﬁrst-order methods: gradien t descent (GD), [standard] accelerated gradien t descent (A GD), nonlinear conjugate gradients (CG), and the limited-memory BFGS (L-BFGS) algorithm. (Out of all these algorithms, only our metho d and GD oﬀer theoretical guarantees for quasar-con vex function minimization.) W e next ev aluate our algorithm on real-world tasks: we use our algorithm to train a supp ort vector mac hine (SVM) on the nine LIBSVM UCI binary classiﬁcation datasets [ 15 ] (which are deriv ed from the UCI “A dult” datasets [ 18 ]). The SVM loss function we use is a smo othed version of the hinge loss: f ( x ) = P n i =1 φ α (1 − b i a > i x ) , where a i ∈ R d , b i = ± 1 are given b y the training data (the a i ’s are the cov ariates and the b i ’s are the lab els), and φ α ( t ) = 0 for t ≤ 0 , t 2 2 for t ∈ [0 , 1] , and t α − 1 α + 1 2 for t ≥ 1 . When α = 1 , φ α = t 2 2 for all t ≥ 0 , and thus φ α and f are con vex. F or all α ∈ (0 , 1] , φ α is smooth and α -quasar-con vex. Line searches for this function are inexp ensiv e, as the quantities b i a > i x need only b e calculated once p er outer lo op iteration. Results are given in T able 1. Finally , we ev aluate on the problem of learning linear dynamical systems, which was sho wn to b e quasar-con vex (under certain assumptions) b y [ 29 ]. In this problem, we are giv en observ ations { ( x t , y t ) } T i =1 generated b y the time-inv ariant linear system h t +1 = Ah t + B x t ; y t = C h t + D x t , where x t , y t ∈ R ; h t ∈ R n is the hidden state at time t ; and Θ = ( A, B , C, D ) are the (unkno wn) parameters of the system. Informally , w e seek to learn ˆ Θ to minimize 1 T P T i =1 ( y t − ˆ y t ) 2 , where ˆ h t +1 = ˆ A ˆ h t + ˆ B x t ; ˆ y t = ˆ C ˆ h t + ˆ D x t , and ˆ h 0 = 0 . When parameterized in c ontr ol lable c anonic al form , this problem w as shown to b e quasar-con vex on a subset of the domain near the optim um in [ 29 ]. W e describ e this problem and our exp erimen tal approach in more detail in Appendix B.1. Represen tative plots are giv en in Figure 3. Despite the noncon vexit y , AGD p erforms quite well on this problem. Nonetheless, we observe that our method is comp etitiv e with AGD in terms of iter ation coun t; w e use more function evaluations due to the line search, but gradien t ev aluations are ab out t wice as exp ensiv e in this setting, and the line searc h can also be parallelize d. The design of b etter heuristics to sp eed up our metho d is an interesting question for future empirical in vestigation. In all exp eriments, w e use adaptive step sizes for our metho d, as well as GD and A GD, as in practice L may not b e known a priori . W e do not use an initial guess for the line search. 22 ↓ F unction / Algorithm → Ours (Alg. 4) Gradient Descent (GD) Standard AGD Nonlinear CG L-BFGS ¯ f T ,σ ( σ = 10 − 1 , T = 10 2 ;  = 10 − 4 ) 422; 1,451 336; 738 272; 869 312; 1,599 354; 1,778 ¯ f T ,σ : ( σ = 10 − 4 , T = 10 3 ;  = 10 − 6 ) 12,057; 55,357 18,607; 40,684 3,891; 12,399 1,251; 3,647 1,093; 6,554 ¯ f T ,σ : ( σ = 10 − 6 , T = 10 3 ;  = 10 − 8 ) 17,135; 167,447 275,572; 602,561 55,623; 177,247 10,007; 30,023 2,079; 12,476 LIBSVM UCI ( α = 1 ;  = 10 − 4 ) 0.92; +0.017% 4.65; +0.036% — 0.46; +0.001% 0.29; +0.010% LIBSVM UCI ( α = 0 . 5 ;  = 10 − 4 ) 1.32; +0.016% 4.78; +0.033% — 0.48; +0.001% 0.30; +0.011% T able 1: Exp erimen tal results. The stopping criterion used is k∇ f ( x ) k ∞ ≤  . F or ¯ f T ,σ w e rep ort (# iter ations; # function+gr adient evals) ; the initial p oin t is x 0 = 0 . F or LIBSVM UCI datasets, w e rep ort: the r atio of the total num b er of iterations required compared to standard AGD, av eraged o ver all 9 datasets and 3 diﬀerent random initializations (shared across algorithms) p er dataset, and the av erage ﬁnal test classiﬁc ation ac cur acy diﬀer enc e compared to AGD. Figure 3: Results on learning linear dynamical systems, for t wo diﬀeren t problem instances. W e ev aluate our method with γ = { 0 . 5 , 1 } , and compare to GD and AGD. W e run until the loss is < 10 − 4 or 1000 iterations ha ve been reached. Our metho d uses ≈ 4x as man y total ev aluations as A GD; for instance, in the ﬁrst setting all metho ds run for 1000 iterations and use 2195, 3195, 13562 and 14626 total ev aluations resp ectiv ely (out of which 1000 are gradient ev aluations). B.1 A dditional Experimental Details W e implemen t our algorithm, as well as AGD and GD, in Julia and Python. 9 W e run our experiments on learning linear dynamical systems (LDS) using the PyT orch framework [ 49 ]. W e generate the true parameters and the dynamical mo del inputs the same w ay as in [ 29 ], using the same parameters n = 20 , T = 500 . Ho w ever, diﬀeren tly from this pap er, w e do not generate fresh sequences { ( x t , y t ) } at eac h iteration, but instead generate 100 sequences at the beginning which are used throughout (so, it is no longer a stochastic optimization problem). As in [ 29 ], w e actually minimize the loss 1 |B| P ( x,y ) ∈B  1 T − T 1 P i>T 1 ( y t − ˆ y t ) 2  , where the outer summation is ov er the batc h B of 100 sequences and the inner summation starts at time T 1 , T / 4 , to mitigate the fact that the initial hidden state is not known. In addition, we generate the initial p oint ( ˆ A 0 , ˆ C 0 , ˆ D 0 ) by perturbing the true dynamical system parameters ( A, C, D ) with random noise; we additionally ensure that the sp ectral radius of ˆ A 0 remains less than 1. The quasar-conv exit y parameter γ deriv ed in [ 29 ] for the LDS ob jective is deﬁned as the supremum 9 Code for our implementation and exp erimen ts is a v ailable at https://github.com/nimz/ quasar- convex- acceleration . 23 of the real part of a ratio of t wo degree- n univ ariate p olynomials ov er the complex unit circle. Therefore, it is diﬃcult to calculate in practice. W e instead simply ev aluate diﬀerent v alues of γ in our exp erimen ts; we ﬁnd that, while the c hoice of γ do es aﬀect p erformance somewhat, our metho d do es not break do wn even if the “wrong” choice is used. [ 29 ] presented tw o b etter-p erforming alternatives to ﬁxed-stepsize SGD: SGD w ith gradient clipping or pro jected SGD. By contrast, as w e use an adaptive step size, there is no need to clip gradients; in addition, we ﬁnd pro jection to be unnecessary as the initial iterate we generate already has ρ ( ˆ A 0 ) < 1 b y construction. In the LDS exp eriments, w e use forw ard diﬀerence to appro ximate the 1D gradien ts in the line searc h, since full gradient ev aluations require backpropagation and are thus more expensive than function ev aluations in this case; w e do not ﬁnd this to incur signiﬁcant n umerical error. F or the adaptiv e step sizes, we use a standard scheme in whic h the step size at iteration k > 0 [which w e denote 1 L ( k ) ] is initialized to the previous step size 1 L ( k − 1) times a ﬁxed v alue ζ 1 ≥ 1 , and then m ultiplied by a ﬁxed v alue ζ 2 ∈ (0 , 1) until it is small enough so that the function v alue decrease is suﬃcien t, 10 where ζ 1 , ζ 2 are constant h yp erparameters. (This slightly generalizes Algorithm 5, which simply sets ζ 2 = 1 / 2 .) In all experiments for GD, AGD, and our method, w e used ζ 1 = 1 . 1 , ζ 2 = 0 . 6 , and L (0) = 1 (these v alues were only coarsely tuned; the algorithms are fairly insensitive to them when reasonable settings are used). C Algorithm Analysis Here, we provide omitted pro ofs and details for Sections 2-3. C.1 Bac ktrac king Step Size Searc h Analysis In Algorithm 5 (analyzed in Lemma 9), we sho w how to eﬃciently compute an L ( k ) suc h that f ( y ( k ) − 1 L ( k ) ∇ f ( y ( k ) )) ≤ f ( y ( k ) ) − 1 2 L ( k )   ∇ f ( y ( k ) )   2 holds in Line 3 of Algorithm 1, even when the true Lipschitz constan t L is unknown. This is done using standard backtrac king line searc h; we pro- vide the details of the algorithm and analysis for completeness (Algorithm 5). [Note that run_halving means we halv e ˆ L rep eatedly as long as the descen t inequality is satisﬁed—corresp onding to doubling 10 Speciﬁcally , for GD, w e decrease the step size 1 L ( k ) until the criterion f ( x ( k +1) ) ≤ f ( x ( k ) ) − 1 2 L ( k ) ||∇ f ( x ( k ) ) || 2 is satisﬁed; for AGD and our metho d, the criterion is f ( x ( k +1) ) ≤ f ( y ( k ) ) − 1 2 L ( k ) ||∇ f ( y ( k ) || 2 . These criteria are guaranteed to hold when L ( k ) ≥ L . 24 the step size each time.] Algorithm 5: BacktrackingSearch ( f , ζ , x, run_halving = False ) Assumptions : f : R n → R is L -smooth; x ∈ R n ; ζ > 0 and ( ζ < 2 L or run_halving=False ) 1 ˆ L ← ζ 2 if run_halving then 3 while f ( x − 1 ˆ L ∇ f ( x )) ≤ f ( x ) − 1 2 ˆ L k∇ f ( x ) k 2 do 4 ˆ L ← ˆ L/ 2 end 5 ˆ L ← 2 ˆ L end 6 while f ( x − 1 ˆ L ∇ f ( x )) > f ( x ) − 1 2 ˆ L k∇ f ( x ) k 2 do 7 ˆ L ← 2 ˆ L end 8 return ˆ L Lemma 9 L et L b e the minimum r e al numb er such that f : R n → R is L -smo oth. Then, A lgorithm 5 c omputes an “inverse step size” ˆ L such that f  x − 1 ˆ L ∇ f ( x )  ≤ f ( x ) − 1 2 ˆ L k∇ f ( x ) k 2 . If run_halving is False , ˆ L ∈ [ ζ , 2 L ) and Algorithm 5 uses at most l log + 2 L ζ m + 3 function and gr adient evaluations. If run_halving is True , ˆ L ∈ (0 , 2 L ) and Algorithm 5 uses at most l log + 2 max n L ζ , ζ L om + 3 evaluations. Pro of W e use the elemen tary fact that if f is L -smo oth, then for an y x ∈ R n if w e deﬁne y , x − 1 L ∇ f ( x ) , then f ( y ) ≤ f ( x ) − 1 2 L k∇ f ( x ) k 2 (for example, see [45] for pro of ). In Algorithm 5, we use ζ as the initial guess for ˆ L , and when run_halving is False simply double ˆ L un til the desired condition holds. Note that since an L -smo oth function is also L 0 -smo oth for any L 0 ≥ L , the desired condition holds for any L 0 ≥ L ; we will use L to denote the minimum v alue of L 0 suc h that f is L 0 -smo oth. W e need to double ˆ L at most  log + 2 ( L/ζ )  times un til it is greater than or equal to L , so the while lo op condition is chec ked at most  log + 2 ( L/ζ )  + 1 times. Since we stop increasing ˆ L when the desired condition holds, and it holds whenev er ˆ L ≥ L , the ﬁnal v alue of ˆ L will b e less than 2 L . Each chec k of the while loop condition requires computing f  x − 1 ˆ L ∇ f ( x )  for the current v alue of ˆ L ; we also need to compute f ( x ) and ∇ f ( x ) at the beginning. When run_halving is True (branc h in Line 2), we also halve the initial guess ˆ L un til the condition no longer holds, then double this v alue to reco ver the last v alue of ˆ L for which the condition holds. Similarly , at most l log + 2 ζ L m iterations of this halving pro cedure are required. Finally , notice that if the while lo op condition in Line 3 ev er ev aluates to True , then the v alue ˆ L at the end of Line 5 will satisfy f ( x − 1 ˆ L ∇ f ( x )) ≤ f ( x ) − 1 2 ˆ L k∇ f ( x ) k 2 , meaning that the while loop on Line 6 will immediately terminate. Note that the constan t 2 used in Algorithm 5 is arbitrary; we can use any constan t larger than 1 to 25 m ultiplicatively increase ˆ L eac h time, which merely c hanges b oth the run time and the ﬁnal upp er b ound on ˆ L b y a constant factor. The term “backtrac king” is used b ecause increasing ˆ L corresp onds to decreasing the “step size.” C.2 Analysis of Algorithm 2 W e ﬁrst present a simple fact that is useful in our pro ofs of Lemmas 2 and 4. F act 1 Supp ose that a < b , g : R → R is diﬀer entiable, and that g ( a ) ≥ g ( b ) . Then, ther e is a c ∈ ( a, b ] such that g ( c ) ≤ g ( b ) and either g 0 ( c ) = 0 , or c = b and g 0 ( c ) ≤ 0 . Pro of If g 0 ( b ) ≤ 0 , the claim is trivially true. If not, then g 0 ( b ) > 0 , so the minimum v alue of g on [ a, b ] is strictly less than g ( b ) (and therefore strictly less than g ( a ) as well). By contin uit y of g and the extreme v alue theorem, g m ust therefore attain its minimum on [ a, b ] at some point in c ∈ ( a, b ) . By diﬀerentiabilit y of g and the fact that c minimizes g , we then ha ve g 0 ( c ) = 0 . F act 2 Supp ose f is L -smo oth. Deﬁne g ( α ) , f ( αx + (1 − α ) v ) ; then, g is L k x − v k 2 -smo oth. Pro of By L -smo othness of f , k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k for all x, y . So, k∇ f ( y ( α 1 )) − ∇ f ( y ( α 2 )) k = k∇ f ( α 1 x + (1 − α 1 ) v ) − ∇ f ( α 2 x + (1 − α 2 ) v ) k ≤ L k ( α 1 − α 2 ) x − ( α 1 − α 2 ) v k = L | α 1 − α 2 | k x − v k . By deﬁnition of g and the Cauc hy-Sc hw arz inequality , | g 0 ( α 1 ) − g 0 ( α 2 ) | = |∇ f ( y ( α 1 )) > ( x − v ) − ∇ f ( y ( α 2 )) > ( x − v ) | ≤ k∇ f ( y ( α 1 )) − ∇ f ( y ( α 2 )) k k x − v k , so | g 0 ( α 1 ) − g 0 ( α 2 ) | ≤ L k x − v k 2 | α 1 − α 2 | as desired. Using Lemma 2 and F act 2, we pro ve Lemma 4. Lemma 4 (Line Search Run time) F or L -smo oth f : R n → R , p oints x, v ∈ R n and sc alars b, c, ˜  ≥ 0 , Algorithm 2 c omputes α ∈ [0 , 1] satisfying (7) with at most 8 + 3 l log + 2  (4 + c ) min n 2 L 3 b 3 , L k x − v k 2 2˜  om function and gr adient evaluations. Pro of Deﬁne ˆ L , L k x − v k 2 ; by F act 2, g is ˆ L -smo oth. Note that if p + ˜  ≥ ˆ L and g 0 ( α ) = 0 , then b y ˆ L -smo othness of g , we hav e g 0 (1) ≤ ˜  + p . So, it m ust b e the case that p + ˜  < ˆ L if Algorithm 2 en ters the binary searc h phase. Th us, if g 0 (1) > ˜  + p , then by Lemma 9 and the deﬁnition of τ w e 26 ha ve g 0 ( τ ) > 0 and g ( τ ) − g (1) ≤ − (˜  + p ) 2 4 ˆ L . Recall that the loop termination condition in Algorithm 2 is α ( g 0 ( α ) − αp ) ≤ c ( g (1) − g ( α )) + ˜  . First, we claim that the inv ariants g ( lo ) > g ( τ ) , g ( hi ) ≤ g ( τ ) , and g 0 ( hi ) > ˜  hold at the start of ev ery lo op iteration. This is true at the b eginning of the lo op, since otherwise the algorithm would return b efore en tering it. In the lo op b ody , hi is only ev er set to a new v alue α if g ( α ) ≤ g ( τ ) . If the lo op do es not subsequently terminate, this also implies g 0 ( α ) > ˜  since then α ( g 0 ( α ) − αp ) > c ( g (1) − g ( α )) + ˜  ≥ c ( g (1) − g ( τ )) + ˜  ≥ ˜  . Similarly , lo is only ever set to a new v alue α if g ( α ) > g ( τ ) . Thus, these in v ariants indeed hold at the start of each loop iteration. No w, supp ose α = ( lo + hi ) / 2 does not satisfy the termination condition. If g ( α ) ≤ g ( τ ) , this implies g 0 ( α ) > ˜  . As g ( lo ) > g ( τ ) ≥ g ( α ) , b y F act 1, there m ust b e an ˆ α ∈ ( lo , α ) with g 0 ( ˆ α ) = 0 and g ( ˆ α ) ≤ g ( τ ) [and thus satisfying the termination condition]. The algorithm sets hi to α , which will keep ˆ α in the new searc h interv al [ lo , α ] . Similarly , if g ( α ) > g ( τ ) , then since g ( τ ) ≥ g ( hi ) and g 0 ( hi ) > 0 , there must b e an ˆ α ∈ ( α, hi ) with g 0 ( ˆ α ) = 0 and g ( ˆ α ) ≤ g ( τ ) [and thus satisfying the termination condition], b y applying F act 1. The algorithm sets lo to α , whic h will keep ˆ α in the search interv al. Thus, there is alw ays at least one p oin t ˆ α ∈ [ lo , hi ] satisfying the termination condition. In addition, note that if an in terv al [ z 1 , z 2 ] ⊆ [0 , 1] of p oin ts satisﬁes the termination condition, then at every lo op iteration, either the entire in terv al lies in [ lo , hi ] or none of the in terv al do es, i.e. either [ z 1 , z 2 ] ⊆ [ lo , hi ] or [ z 1 , z 2 ] ∩ [ lo , hi ] = ∅ . The reason is that if a p oin t α satisﬁes the termination condition w e terminate immediately . If not, then α is not in an interv al of points satisfying the termination condition, so either z 2 < α or z 1 > α . Th us, all in terv als of p oin ts satisfying the termination condition either disjointly lie in the set of p oints that remain in our search interv al, or the set of p oin ts we thro w aw ay (i.e. an in terv al of satisfying p oin ts never gets split). Supp ose that α ∈ [0 , τ ] , g 0 ( α ) = 0 , and g ( α ) ≤ g ( τ ) . By ˆ L -Lipsc hitz contin uity of g 0 , w e hav e that for all t , | g 0 ( t ) | = | g 0 ( t ) − g 0 ( α ) | ≤ ˆ L | t − α | and g ( t ) − g (1) ≤ g ( t ) − g ( τ ) ≤ g ( t ) − g ( α ) ≤ ˆ L 2 ( t − α ) 2 . So, for all t ∈ [ α/ 2 , τ ] , t ( g 0 ( t ) − tp ) + c ( g ( t ) − g ( τ )) ≤ t ( ˆ L | t − α | − ( t − α ) p ) + c ˆ L 2 ( t − α ) 2 − αtp ≤  ˆ L (1 + c 2 ) + p  · | t − α | − α 2 p/ 2 . Supp ose | t − α | ≤ α 2 p/ 2 + ˜  ˆ L (1 + c 2 ) + p . Then,  ˆ L (1 + c 2 ) + p  · | t − α | − α 2 p/ 2 ≤ ˜  . So, if α ∈ [0 , τ ] , g 0 ( α ) = 0 , and g ( α ) ≤ g ( τ ) , then all t ∈ h α − α 2 p/ 2+˜  ˆ L (1+ c/ 2)+ p , α + α 2 p/ 2+˜  ˆ L (1+ c/ 2)+ p i ∩ [ α/ 2 , τ ] also satisfy the termination condition t ( g 0 ( t ) − tp ) + c ( g ( t ) − g (1)) ≤ ˜  . If α 2 p/ 2+˜  ˆ L (1+ c/ 2)+ p ≤ α/ 2 , the low er b ound of the ﬁrst interv al is ≥ α/ 2 and the in tersection of the tw o interv als con tains [ α − α 2 p/ 2+˜  ˆ L (1+ c/ 2)+ p , α ] . If not, then the ﬁrst interv al con tains [ α/ 2 , α ] as do es the second interv al, so the intersection of the tw o interv als contains [ α/ 2 , α ] . Therefore, the length of the interv al of p oin ts satisfying the termination condition is at least min { α 2 , α 2 p/ 2+˜  ˆ L (1+ c/ 2)+ p } . 27 If g 0 ( α ) = 0 and g ( α ) ≤ g ( τ ) , then g (0) ≤ g ( τ ) + ˆ L 2 α 2 b y ˆ L -smo othness. Since g ( τ ) + ( p +˜  ) 2 4 ˆ L ≤ g (1) < g (0) , this implies α ≥ p +˜  ˆ L √ 2 . Therefore, the in terv al length is at least min ( p + ˜  2 √ 2 ˆ L , p 3 / (4 ˆ L 2 ) + ˜  (1 + c/ 2) ˆ L + p ) ≥ min ( p + ˜  ˆ L √ 8 , p 3 / (4 ˆ L 2 ) + ˜  (2 + c/ 2) ˆ L ) ≥ p 3 / (4 ˆ L 2 ) + ˜ / √ 2 (2 + c/ 2) ˆ L . p 3 / (4 ˆ L 2 ) + ˜ / √ 2 (2 + c/ 2) ˆ L ≥ max ( p 3 (8 + 2 c ) ˆ L 3 , ˜  (4 + c ) ˆ L ) = max ( b 3 (8 + 2 c ) L 3 , ˜  (4 + c ) ˆ L ) , using the fact that ˆ L = L k x − v k 2 and p = b k x − v k 2 . Since we know at least one such in terv al of p oin ts satisfying the termination condition is alwa ys con tained within our curren t search interv al, this implies that if w e run the algorithm until the curren t searc h interv al has length at most max n b 3 (8+2 c ) L 3 , ˜  (4+ c ) L k x − v k 2 o , we will terminate with a p oint satisfying the necessary condition. As w e halve our searc h interv al (whic h is initially [0 , τ ] ⊂ [0 , 1] ) at ev ery iteration, w e must therefore terminate in at most l log + 2  (4 + c ) min n 2 L 3 b 3 , L k x − v k 2 ˜  om iterations. Before each loop iteration (including the last which do es not get executed when the termination condition is satisﬁed), we compute g ( α ) and g 0 ( α ) , so there are tw o function and gradient ev aluations p er iteration. Before the lo op b egins, we require (at most) three function and gradient ev aluations to ev aluate g (0) , g (1) , g 0 (1) , in addition to the ev aluations required to compute τ . [This b ecomes ﬁve if an initial guess is pro vided, as then we also compute g ( guess ) , g 0 ( guess ) .] As argued earlier, if p + ˜  ≥ ˆ L , Algorithm 2 terminates b efore Line 3. Thus, w e compute τ only if g 0 (1) ≥ p + ˜  , in whic h case Lemma 9 sa ys that at most l log 2 ( ˆ L p +˜  ) m + 1 additional function ev aluations are required to compute τ . Note that ˆ L p +˜  ≤ min n ˆ L p , ˆ L ˜  o since p, ˜  ≥ 0 ; thus, l log 2 ( ˆ L p +˜  ) m ≤ l log 2  min n ˆ L p , ˆ L ˜  om ≤ l log + 2  (4 + c ) min n 2 L 3 b 3 , L k x − v k 2 ˜  om . Th us, the total num b er of function and gradient ev aluations made is at most 8 + 3 l log + 2  (4 + c ) min n 2 L 3 b 3 , L k x − v k 2 2˜  om . Note that w e deﬁne min { x, + ∞} = x for any x ∈ R ∪ {±∞} . Note also that if b = 0 and L = 0 , or if ˜  = 0 and either L = 0 or x = v , the ab o v e expression is technically indeterminate; ho wev er, observ e that g is constant in all of these cases, so at most one gradient ev aluation is performed and the p oin t α = 1 is returned (or, if an initial guess is passed in, then there are three ev aluations — g(guess), g’(guess), and g(1) — and the point “guess” is returned). C.3 Non-Strongly Quasar-Conv ex Algorithm Analysis Lemma 10 Supp ose ω ( − 1) = 1 and ω ( k ) = 1 2  ω ( k − 1)  q  ω ( k − 1)  2 + 4 − ω ( k − 1)  for k ≥ 0 . In the fol lowing sub-lemmas, we pr ove various simple pr op erties of this se quenc e: 28 Lemma 10.1 ω ( k ) ≤ 4 k + 6 for al l k ≥ 0 . Pro of The case k = 0 is clearly true as ω (0) = √ 5 − 1 2 < 2 3 . Suppose that ω ( i − 1) ≤ 4 i + 5 for some i ≥ 1 . ω ( i ) = ω ( i − 1) 2  q  ω ( i − 1)  2 + 4 − ω ( i − 1)  . Using the fact that √ x 2 + 1 ≤ 1 + x 2 2 for all x and the fact that ω ( i − 1) ∈ (0 , 1) , ω ( i ) ≤ ω ( i − 1) 2 2 − ω ( i − 1) +  ω ( i − 1)  2 2 ! ≤ ω ( i − 1)  1 − ω ( i − 1) 4  . If y > 0 , then x (1 − x 4 ) < 4 y +1 for all 0 ≤ x ≤ 4 y . Thus, setting y = i + 5 yields that ω ( i ) ≤ 4 i +6 b y the inductive hypothesis. Lemma 10.2 ω ( k ) ≥ 1 k + 2 for al l k ≥ 0 . Pro of The case k = 0 is clearly true as ω (0) = √ 5 − 1 2 > 1 2 . Suppose that ω ( i − 1) ≥ 1 i + 1 for some i ≥ 1 . Observ e that the function h ( x ) = 1 2 ( x ( √ x 2 + 4 − x )) is increasing for all x . Therefore, ω ( i ) = h ( ω ( i − 1) ) ≥ h ( 1 i +1 ) = 1 2( i +1)  q 1 ( i +1) 2 + 4 − 1 i +1  = 1 2( i +1) 2  p 4( i + 1) 2 + 1 − 1  . No w, it just remains to sho w that √ 4 x 2 + 1 ≥ 2 x 2 x + 1 + 1 for all x ≥ 0 . T o prov e this, note that 4 x 2 ( x + 1) 2 = 4 x 4 + 8 x 3 + 4 x 2 , so 4 x 2 + 1 = 4 x 4 + 8 x 3 + 4 x 2 ( x + 1) 2 + 1 ≥ 4 x 4 + 4 x 3 + 4 x 2 ( x + 1) 2 + 1 =  2 x 2 x + 1 + 1  2 . Th us, ω ( i ) ≥ 1 2( i + 1) 2  p 4( i + 1) 2 + 1 − 1  ≥ 1 2( i + 1) 2 · 2( i + 1) 2 ( i + 2) = 1 i + 2 . Lemma 10.3 ω ( k ) ∈ (0 , 1) for al l k ≥ 0 . A dditional ly, ω ( k ) < ω ( k − 1) for al l k ≥ 0 . Pro of The fact that ω ( k ) > 0 follows from Lemma 10.2. T o show the rest, we simply observ e that 1 2 ( √ x 2 + 4 − x ) < 2 2 = 1 for all x > 0 ; as ω ( − 1) = 1 and ω ( k ) = 1 2 ( p ( ω ( k − 1) ) 2 + 4 − ω ( k − 1) ) · ω ( k − 1) for all k ≥ 0 , the result follo ws. 29 Lemma 10.4 Deﬁne s ( k ) = 1 + k − 1 X i =0 1 ω ( i ) . Then,  s ( k )  − 1 ≤ 8 ( k + 2) 2 for al l k ≥ 0 . Pro of Applying Lemma 10.1, s ( k ) ≥ 1 + k − 1 X i =0  i + 6 4  = k ( k + 11) + 8 8 ≥ k ( k + 4) + 4 8 = 1 8 ( k + 2) 2 , and so  s ( k )  − 1 ≤ 8 ( k + 2) 2 . Lemma 10.5 1 ( ω ( k ) ) 2 − 1 ω ( k ) = k − 1 X i = − 1 1 ω ( i ) = s ( k ) for al l k ≥ 0 . Pro of Notice that ( ω ( k ) ) 2 = (1 − ω ( k ) )( ω ( k − 1) ) 2 for all k ≥ 0 , by deﬁnition of the sequence { ω ( k ) } . Th us, since w ( k ) ∈ (0 , 1) for all k ≥ 0 , 1 ( ω ( k ) ) 2 − 1 ω ( k ) = 1 ( ω ( k − 1) ) 2 . This prov es the base case k = 0 , since ω ( − 1) = 1 . No w, for k ≥ 0 deﬁne B ( k ) = 1 ( ω ( k ) ) 2 − 1 ω ( k ) . Then for all k ≥ 0 , B ( k +1) −  B ( k ) + 1 ω ( k )  = 1 ( ω ( k +1) ) 2 − 1 ω ( k +1) − 1 ( ω ( k ) ) 2 = 0 . Th us B ( k +1) = B ( k ) + 1 ω ( k ) = 1 ω ( k ) + k − 1 X i = − 1 1 ω ( i ) b y the inductiv e hypothesis. Lemma 6 (Non-Strongly Quasar-Conv ex A GD Con v ergence) If f is L -smo oth and γ -quasar- c onvex with r esp e ct to a minimizer x ∗ , with γ ∈ (0 , 1] , then in e ach iter ation k ≥ 0 of Algorithm 4,  ( k ) ≤ 8 ( k + 2) 2   (0) + L 2 γ 2 r (0)  +  2 , (10) wher e  ( k ) , f ( x ( k ) ) − f ( x ∗ ) and r ( k ) ,   v ( k ) − x ∗   2 . Ther efor e, if R ≥   x (0) − x ∗   and the numb er of iter ations K ≥  8 γ − 1 L 1 / 2 R − 1 / 2  , then the output x ( K ) satisﬁes f ( x ( K ) ) ≤ f ( x ∗ ) +  . Pro of F or simplicity of exp osition, we present the pro of in the case where L ( k ) = L for all k (i.e., L is known). The general case can b e handled b y tightening the analysis, similarly to the analysis of standard AGD with adaptive step size on conv ex functions. In the non-strongly quasar-con vex case, µ = 0 and β = 1 . F or all k , η ( k ) = γ L ( k ) ω ( k ) ≥ γ L ( k ) since ω ( k ) ∈ (0 , 1) by Lemma 10.3. A dditionally , α ( k ) is in [0 , 1] and ( α, x, y α , v ) = ( α ( k ) , x ( k ) , y ( k ) , v ( k ) ) satisﬁes (7) with b = 1 − β 2 η ( k ) = 0 , c = L ( k ) η ( k ) − γ β = L ( k ) η ( k ) − γ b y construction. Lemmas 1 and 3 th us imply that for all k ≥ 0 , 2( η ( k ) ) 2 L ( k )  ( k +1) + r ( k +1) ≤ r ( k ) + 2 η ( k )  L ( k ) η ( k ) − γ   ( k ) + 2 η ( k ) ˜  . (12) 30 Deﬁne A ( k ) , 2  η ( k )  2 L ( k ) − 2 η ( k ) γ . So, ( A ( k ) + 2 η ( k ) γ )  ( k +1) + r ( k +1) ≤ A ( k )  ( k ) + r ( k ) + 2 η ( k ) ˜  . Recall that ( ω ( k +1) ) 2 = (1 − ω ( k +1) )( ω ( k ) ) 2 and ω ( k ) ∈ (0 , 1) for all k ≥ 0 . So, A ( k +1) − ( A ( k ) + 2 η ( k ) γ ) = 2( η ( k +1) ) 2 L ( k +1) − 2 η ( k +1) γ − 2( η ( k ) ) 2 L ( k ) = 2  γ 2 L ( k +1) ( L ( k +1) ) 2 ( ω ( k +1) ) 2 − γ 2 L ( k +1) ω ( k +1) − γ 2 L ( k ) ( L ( k ) ) 2 ( ω ( k ) ) 2  = 2 γ 2  1 L ( k +1) · 1 − ω ( k +1) ( ω ( k +1) ) 2 − 1 L ( k ) · 1 ( ω ( k ) ) 2  = 2 γ 2  1 L ( k +1) · 1 ( ω ( k ) ) 2 − 1 L ( k ) · 1 ( ω ( k ) ) 2  ≤ 0 . The ﬁnal inequality comes from the fact that L ( k +1) ≥ L ( k ) , by deﬁnition of the sequence { L ( k ) } in Algorithm 4. So, A ( k +1) = L ( k ) L ( k +1) ( A ( k ) + 2 η ( k ) γ ) ≤ A ( k ) + 2 η ( k ) γ and th us A ( k +1)  ( k +1) + r ( k +1) ≤ ( A ( k ) + 2 η ( k ) γ )  ( k +1) + r ( k +1) ≤ A ( k )  ( k ) + r ( k ) + 2 η ( k ) ˜  . Applying (12) repeatedly , we th us hav e A ( k )  ( k ) + r ( k ) ≤ A ( k − 1)  ( k − 1) + r ( k − 1) + 2 η ( k − 1) ˜  ≤ · · · ≤ A (0)  (0) + r (0) + 2˜  k − 1 X i =0 η ( i ) . (13) By Lemma 10.5, A ( k ) = 2( η ( k ) ) 2 L ( k ) − 2 η ( k ) γ = 2 γ 2 L ( k )  1 ( ω ( k ) ) 2 − 1 ω ( k )  = 2 γ 2 L ( k ) s ( k ) , where s ( k ) , 1 + k − 1 X i =0 1 ω ( i ) ! . Since 0 < L ( k ) < 2 L for all k ≥ 0 , we thus ha ve A ( k ) ≥ γ 2 L s ( k ) . Also, A (0) = 2( η (0) ) 2 L (0) − 2 η (0) γ = 2 γ 2 L (0) ( ω (0) ) 2 − 2 γ 2 L (0) ω (0) = 2 γ 2 L (0) , as ω (0) = √ 5 − 1 2 . So, as r ( k ) ≥ 0 and (b y our simplifying assumption) L ( k ) = L ,  ( k ) ≤ ( A ( k ) ) − 1  A (0)  (0) + r (0)  + 2( A ( k ) ) − 1 ˜  k − 1 X i =0 η ( i ) ≤ L γ 2 ( s ( k ) ) − 1  2 γ 2 L (0)  (0) + r (0)  + 2˜ L γ ( s ( k ) ) − 1   k − 1 X i =0 η ( i )   Then, the previous expression becomes ( s ( k ) ) − 1  2  (0) + L γ 2 r (0)  + γ − 1 ˜  . ˜  = γ  2 b y deﬁnition and  s ( k )  − 1 ≤ 8 ( k + 2) 2 b y Lemma 10.4, whic h prov es the b ound on  ( k ) . F or the iteration b ound, w e simply require K large enough suc h that 8 ( K +2) 2   (0) + L 2 γ 2 r (0)  ≤  2 . Observ e that as f ( x (0) ) ≤ f ( x ∗ ) + L 2   x (0) − x ∗   2 b y F act 3, 2  (0) ≤ Lr (0) ≤ L γ 2 r (0) . 31 So, it suﬃces to hav e 8 ( K +2) 2  2 L γ 2 r (0)  ≤  2 . Rearranging, this is equiv alent to K +2 ≥ 8 γ − 1 L 1 / 2 R − 1 / 2 , as r (0) = R 2 . As K must be a nonnegativ e integer, it suﬃces to hav e K ≥  8 γ − 1 L 1 / 2 R − 1 / 2  . Theorem 2 If f is L -smo oth and γ -quasar-c onvex with r esp e ct to a minimizer x ∗ , with γ ∈ (0 , 1] and   x (0) − x ∗   ≤ R , then Algorithm 4 pr o duc es an  -optimal p oint after O  γ − 1 L 1 / 2 R − 1 / 2 log +  γ − 1 L 1 / 2 R − 1 / 2  function and gr adient evaluations. Pro of Lemma 6 implies O ( γ − 1 L 1 / 2 R − 1 / 2 ) iterations are needed to get an  -optimal p oint. Lemma 4 implies that eac h line searc h uses O  log +  (1 + c ) min  L k x ( k ) − v ( k ) k 2 ˜  , L 3 b 3  function and gradien t ev aluations. Again, for simplicity we fo cus on the case where L ( k ) = L for all k ≥ 0 ; the analysis for the general case pro ceeds analogously . In this case, b = 0 , c = Lη ( k ) − γ = γ  1 ω ( k ) − 1  , and ˜  = γ  2 . By Lemma 10.2 and 10.3, 1 < 1 ω ( k ) ≤ k + 2 for all k ≥ 0 . Th us, the n umber of function and gradien t ev aluations required for the line search at iteration k of Algorithm 4 is O  log +  ( γ k + 1) L k x ( k ) − v ( k ) k 2 γ   . No w, we bound   x ( k ) − v ( k )   2 . T o do so, we ﬁrst bound   v ( k ) − x ∗   2 = r ( k ) . Recall that equation (13) in the pro of of Lemma 6 says that A ( k )  ( k ) + r ( k ) ≤ A (0)  (0) + r (0) + 2 ˜  k − 1 P i =0 η ( i ) , where A ( j ) , 2 γ 2 L  1 + j − 1 P i =0 1 ω ( i )  . As A ( k ) ,  ( k ) ≥ 0 , this means that r ( k ) ≤ A (0)  (0) + r (0) + 2˜  k − 1 X i =0 η ( i ) = 2 γ 2 L  (0) + r (0) + γ 2  L k − 1 X i =0 1 ω ( i ) , using that η ( i ) = γ Lω ( i ) , ˜  = γ  2 , and A (0) = 2 γ 2 L (as previously shown in the pro of of Lemma 6). No w, by Lemma 10.2 we ha ve that k − 1 P i =0 1 ω ( i ) ≤ k − 1 P i =0 ( i + 2) = k ( k +3) 2 , and b y L -smo othness of f and F act 3 we ha ve that  (0) ≤ L 2 r (0) ≤ L 2 γ 2 r (0) . Thus, for all k ≥ 1 , we hav e r ( k ) ≤ 2 r (0) + γ 2 k ( k +3) 2 L ≤ 2( R 2 + γ 2 k 2 L ) , as r (0) = R 2 and k + 3 ≤ 4 k for all k ≥ 1 . In fact, the ab o ve holds for k = 0 as well, b ecause r ( k ) is simply r (0) in this case. By the triangle inequalit y ,   v ( k ) − v ( k − 1)   ≤   v ( k ) − x ∗   +   v ( k − 1) − x ∗   ≤ 2 q 2( R 2 + γ 2 k 2 L ) . Since β = 1 , we ha ve that v ( k − 1) − η ( k − 1) ∇ f ( y ( k − 1) ) and so   v ( k ) − v ( k − 1)   = η ( k − 1)   ∇ f ( y ( k − 1) )   . Thus,    ∇ f ( y ( k − 1) )    ≤ ( η ( k − 1) ) − 1 · 2 q 2( R 2 + γ 2 k 2 L ) = Lω ( k − 1) γ − 1 q 8( R 2 + γ 2 k 2 L ) . (14) 32 No w, by deﬁnition of x ( k ) , v ( k ) , and y ( k − 1) , x ( k ) − v ( k ) = y ( k − 1) − 1 L ∇ f ( y ( k − 1) ) − v ( k ) = α ( k − 1) x ( k − 1) + (1 − α ( k − 1) ) v ( k − 1) − 1 L ∇ f ( y ( k − 1) ) − v ( k ) = α ( k − 1) x ( k − 1) + (1 − α ( k − 1) ) v ( k − 1) − 1 L ∇ f ( y ( k − 1) ) −  v ( k − 1) − η ( k − 1) ∇ f ( y ( k − 1) )  = α ( k − 1) ( x ( k − 1) − v ( k − 1) ) + ( η ( k − 1) − 1 L ) ∇ f ( y ( k − 1) ) . Therefore,    x ( k ) − v ( k )    ≤ α ( k − 1)    x ( k − 1) − v ( k − 1)    +    η ( k − 1) − 1 L    ·    ∇ f ( y ( k − 1) )    ≤    x ( k − 1) − v ( k − 1)    +  η ( k − 1) + 1 L  ·    ∇ f ( y ( k − 1) )    ≤    x ( k − 1) − v ( k − 1)    + 2 Lω ( k − 1) ·    ∇ f ( y ( k − 1) )    ≤    x ( k − 1) − v ( k − 1)    + γ − 1 q 32( R 2 + γ 2 k 2 L ) ≤    x ( k − 1) − v ( k − 1)    + √ 32 γ − 1  R + γ k q  L  , where the ﬁrst inequality is the triangle inequalit y , the third inequality uses that η ( k − 1) = γ Lω ( k − 1) and that γ , ω ( k − 1) ∈ (0 , 1] , the fourth inequalit y uses (14) , and the ﬁnal inequality uses that √ a + b ≤ √ a + √ b for any a, b ≥ 0 . As this holds for all k ≥ 1 , w e hav e by induction that for all k ≥ 0 ,    x ( k ) − v ( k )    ≤    x (0) − v (0)    + k X j =1 √ 32 γ − 1  R + γ j q  L  = √ 32 γ − 1 k X j =1  R + γ j q  L  , since x (0) = v (0) . Simpliﬁcation yields   x ( k ) − v ( k )   ≤ √ 32 k γ − 1 R + √ 8 k ( k + 1) p  L . F or all k ≥ 1 , it is the case that k + 1 ≤ 2 k , so   x ( k ) − v ( k )   ≤ √ 32  k γ − 1 R + k 2 p  L  ; this inequalit y holds for k = 0 as w ell, as   x (0) − v (0)   = 0 in this case. Supp ose k ≤  4 γ − 1 L 1 / 2 R − 1 / 2  . Then    x ( k ) − v ( k )    ≤ √ 32  4 γ − 1 L 1 / 2 R − 1 / 2 · γ − 1 R + 16 γ − 2 LR 2  − 1 · q  L  = 80 √ 2 · γ − 2 L 1 / 2 R 2  − 1 / 2 . Recall that the line search at iteration k requires O  log +  ( γ k + 1) L k x ( k ) − v ( k ) k 2 γ   function and gradien t ev aluations. ( γ k + 1) L k x ( k ) − v ( k ) k 2 γ  ≤ (4 L 1 / 2 R − 1 / 2 + 1) · 12800( γ − 5 L 2 R 4  − 2 ) . Therefore, eac h line searc h indeed requires O  log +  γ − 1 L 1 / 2 R − 1 / 2  function and gradient ev aluations. As the num b er of iterations k is O ( γ − 1 L 1 / 2 R − 1 / 2 ) , the total n umber of function and gradient ev aluations required is th us O  γ − 1 L 1 / 2 R − 1 / 2 log +  γ − 1 L 1 / 2 R − 1 / 2  , as claimed. 33 As in the strongly con vex case, the algorithm ma y con tin ue to run if the sp eciﬁed num b er of iterations K is larger; ho wev er, this theorem com bined with Lemma 6 sho ws that x ( k ) will be  -optimal if k =  4 γ − 1 L 1 / 2 R − 1 / 2  , and this x ( k ) will b e produced using O  γ − 1 L 1 / 2 R − 1 / 2 log +  γ − 1 L 1 / 2 R − 1 / 2  function and gradient ev aluations. (F uture iterates x ( k 0 ) with k 0 >  4 γ − 1 L 1 / 2 R − 1 / 2  will also b e  -optimal.) Remark 1 If f is L -smo oth and γ -quasar-c onvex with γ ∈ (0 , 1] and   x (0) − x ∗   ≤ R , then gr adient desc ent with step size 1 L r eturns a p oint x with f ( x ) ≤ f ( x ∗ ) +  after O  γ − 1 LR 2  − 1  function and gr adient evaluations. Pro of See Theorem 1 in [26]. C.4 Comparisons with Standard AGD W e hav e describ ed ho w our algorithms relate to standard (conv ex) AGD. W e brieﬂy comment on the diﬀerence b et ween the analysis of our algorithms and that of standard A GD, and on the use of a line search “initial guess” inspired by the setting of α ( k ) in standard AGD. C.4.1 Analysis Concretely , a k ey step in the proof of conv ergence of algorithms that extend Algorithm 1 (including our algorithms as well as standard AGD) is to b ound Q ( k ) ; this b ound is then combined with Lemma 1 to get the ﬁnal conv ergence b ound. In standard AGD, w e b ound Q ( k ) b y setting α ( k ) to a sp eciﬁc predetermined v alue. F or instance, in the non-strongly conv ex case (where β = 1 ), α ( k ) is set suc h that α ( k ) 1 − α ( k ) = L ( k ) η ( k ) − 1 . 11 W e then ha ve Q ( k ) = 2 η ( k ) · α ( k ) 1 − α ( k ) ∇ f ( y ( k ) ) > ( x ( k ) − y ( k ) ) = 2 η ( k ) ( L ( k ) η ( k ) − 1) ∇ f ( y ( k ) ) > ( x ( k ) − y ( k ) ) , and then we use con vexit y to obtain that Q ( k ) ≤ 2 η ( k ) ( L ( k ) η ( k ) − 1)( f ( x ( k ) ) − f ( y ( k ) )) = 2 η ( k ) ( L ( k ) η ( k ) − 1)(  ( k ) −  ( k ) y ) . By con trast, for our algo- rithms, we b ound Q ( k ) using Lemma 3. C.4.2 Line Search Initial Guess In sp ecial cases, sp ecifying an “initial guess” for α in the binary line search (Algorithm 2) can sp eed up our algorithms, by allo wing the line search to b e circumv ented a large portion of the time. F or instance, at each step k w e can use the α ( k ) prescrib ed by the standard version of AGD as a guess: this v alue is √ L ( k ) /µ 1+ √ L ( k ) /µ in the strongly con vex case (Algorithm 3), and 1 − ω ( k ) in the non-strongly con vex case (Algorithm 4). Thus, when f is con vex or strongly conv ex (and th us γ = 1 ), our resp ectiv e algorithms using the initial guess are equiv alent to standard AGD (as described in 11 As η ( k ) = 1 L ( k ) ω ( k ) , this implies that α ( k ) = 1 − ω ( k ) . 34 [ 45 ]), since this initial guess alwa ys satisﬁes the necessary condition (7) b y conv exity [in fact, it satisﬁes the stronger (6) ] and will th us b e chosen as the v alue of α ( k ) . Moreo ver, even when f is noncon vex, c hecking this initial guess costs at most one extra function and gradient ev aluation eac h p er inv o cation of Algorithm 2. So, when γ = 1 we can interpret the o verall algorithm as a “robustiﬁed” version of standard A GD—each iteration is iden tical to that of standard A GD unless a “con vexit y violation” b et ween x ( k ) and v ( k ) is detected, in whic h case we fall bac k to the binary searc h. C.4.3 Analysis T echniques W e remark that our analysis can also be recast in the framework of estimate sequences (for instance, follo wing [ 45 ]), by generalizing the analysis for standard AGD. The analysis presented in this w ork is an adaptation of a somewhat diﬀeren t st yle of analysis of standard AGD, based on analyzing the one-step decrease in the more general potential function presented in Lemma 1. Indeed, as men tioned, the standard AGD algorithms for b oth con vex and strongly conv ex minimization are also sp eciﬁc instances of the framework presented in Algorithm 1. D The Structure of Quasar-Con v ex F unctions In this section, w e pro ve v arious prop erties of quasar-conv ex functions. First, we state a slightly more general deﬁnition of quasar-conv exity on a con vex domain. Deﬁnition 3 L et X ⊆ R n b e c onvex. F urthermor e, supp ose that either X is op en or n = 1 . L et γ ∈ (0 , 1] and let x ∗ ∈ X b e a minimizer of the diﬀer entiable function f : X → R . The function f is γ -quasar-conv ex on X with r esp e ct to x ∗ if for al l x ∈ X , f ( x ∗ ) ≥ f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) . Supp ose also µ ≥ 0 . The function f is ( γ , µ ) -strongly quasar-conv ex on X if for al l x ∈ X , f ( x ∗ ) ≥ f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2 . If X is of the form [ a, b ] ⊆ R , then ∇ f ( a ) and ∇ f ( b ) her e denote lim h → 0 + f ( a + h ) − f ( a ) h and lim h → 0 − f ( b + h ) − f ( b ) h , r esp e ctively. Diﬀer entiability simply me ans that ∇ f ( x ) exists for al l x ∈ X . Deﬁnition 3 is exactly the same as Deﬁnition 1 if the domain X = R n . W e remark that it is p ossible to generalize Deﬁnition 3 ev en further to the case where X is a star-c onvex set with star center x ∗ . D.1 Pro of of Observ ation 1 Observ ation 1 L et a < b and let f : [ a, b ] → R b e c ontinuously diﬀer entiable. The function f is γ -quasar-c onvex for some γ ∈ (0 , 1] iﬀ f is unimo dal and al l critic al p oints of f ar e minimizers. 35 A dditional ly, if h : R n → R is γ -quasar-c onvex with r esp e ct to a minimizer x ∗ , then for any d ∈ R n with k d k = 1 , the 1-D function f ( θ ) , h ( x ∗ + θ d ) is γ -quasar-c onvex. Pro of First, we pro ve that if f is contin uously diﬀeren tiable and unimodal with nonzero deriv ativ e except at minimizers, then f is γ -quasar-conv ex for some γ > 0 . Let x ∗ b e a minimizer of f on [ a, b ] , and let x ∈ [ a, b ] b e arbitrary . Deﬁne g x ( t ) = f ((1 − t ) x ∗ + tx ) . By unimo dalit y of f , g x is diﬀerentiable and increasing on [0 , 1] , so g 0 x ( t ) ≥ 0 for t ∈ [0 , 1] , and f ( x ) − f ( x ∗ ) = g x (1) − g x (0) = 1 Z 0 g 0 x ( t ) dt . Also, g 0 x (1) = f 0 ( x )( x − x ∗ ) 6 = 0 b y assumption for all x with f ( x ) > f ( x ∗ ) . Note that if f ( x ) = f ( x ∗ ) , then g x ( t ) is constant on [0 , 1] b y unimo dalit y and so g 0 x ( t ) = 0 for all t ∈ [0 , 1] . Deﬁne C x ∗ = sup x ∈ [ a,b ] sup t ∈ [0 , 1] g 0 x ( t ) g 0 x (1) , where w e deﬁne the inner supremum to be 1 if f ( x ) = f ( x ∗ ) . By con tinuit y of eac h g 0 x o ver [0 , 1] and the fact that g 0 x (1) > 0 for all x ∈ [ a, b ] with f ( x ) > f ( x ∗ ) , sup t ∈ [0 , 1] g 0 x ( t ) g 0 x (1) is a con tinuous function of x . Thus as the outer supremum is ov er the compact in terv al [ a, b ] , C x ∗ indeed exists; note that C x ∗ ∈ [1 , ∞ ) . F or any x ∈ [ a, b ] with f ( x ) > f ( x ∗ ) , we th us hav e f ( x ) − f ( x ∗ ) f 0 ( x )( x − x ∗ ) = R 1 0 g 0 x ( t ) dt g 0 x (1) ≤ C x ∗ , meaning f ( x ∗ ) ≥ f ( x ) + C x ∗ ( f 0 ( x )( x ∗ − x )) . This also holds for all x suc h that f ( x ) = f ( x ∗ ) , as either x = x ∗ or f 0 ( x ) = 0 in these cases. Th us, f is 1 C x ∗ quasar-con vex on [ a, b ] with resp ect to x ∗ . Finally , if w e deﬁne C max = max x ∗ ∈ argmin x ∈ [ a,b ] f ( x ) C x ∗ , w e hav e that f is 1 C max quasar-con vex on [ a, b ] where 1 C max ∈ (0 , 1] is a constan t dep ending only on f , a , and b . This completes the pro of. No w, we pro ve the other direction (whic h is muc h simpler). Supp ose that f : [ a, b ] → R is diﬀeren tiable and quasar-con vex for some γ ∈ (0 , 1] . Then 1 γ f 0 ( x )( x − x ∗ ) ≥ f ( x ) − f ( x ∗ ) ≥ 0 . If x is not a minimizer of f , then the last inequality is strict; otherwise, either x ∈ { a, b } or f 0 ( x ) = 0 . In other words, assuming x is not a minimizer, when x < x ∗ [i.e. to the left of x ∗ ], f 0 < 0 and so f is strictly decreasing, while when x > x ∗ [i.e. to the right of x ∗ ], f 0 > 0 and so f is strictly increasing. This implies that f is unimodal. Finally , supp ose h : R n → R is γ -quasar-con vex with resp ect to a minimizer x ∗ , supp ose d ∈ R n has k d k = 1 , and deﬁne f ( θ ) , h ( x ∗ + θ d ) . Note that f 0 ( θ ) = d > ∇ h ( x ∗ + θ d ) and that θ = 0 minimizes f . By γ -quasar-con vexit y of h with resp ect to x ∗ , we hav e for all θ ∈ R that f (0) = h ( x ∗ ) ≥ h ( x ∗ + θ d ) + 1 γ ∇ h ( x ∗ + θ d ) > ( x ∗ − ( x ∗ + θ d )) = f ( θ ) + 1 γ f 0 ( θ )(0 − θ ) , meaning that f is γ -quasar-con vex. 36 D.2 Characterizations of Quasar-Con v exit y Lemma 11 L et f : X → R b e diﬀer entiable with a minimizer x ∗ ∈ X , wher e the domain X ⊆ R n is op en and c onvex. 12 Then, the fol lowing two statements: f ( tx ∗ + (1 − t ) x ) + t  1 − t 2 − γ  γ µ 2 k x ∗ − x k 2 ≤ γ tf ( x ∗ ) + (1 − γ t ) f ( x ) ∀ x ∈ X , t ∈ [0 , 1] (15) f ( x ∗ ) ≥ f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2 ∀ x ∈ X (16) ar e e quivalent for al l µ ≥ 0 , γ ∈ (0 , 1] . Pro of First, we prov e that (16) implies (15). Supp ose (16) holds and µ = 0 . Let x ∈ X b e arbitrary and for all t ∈ [0 , 1] let x t , (1 − t ) x ∗ + tx and let g ( t ) , f ( x t ) − f ( x ∗ ) . Since g 0 ( t ) = ∇ f ( x t ) > ( x − x ∗ ) and x ∗ − x t = − t ( x ∗ − x ) , substituting these equalities into (16) yields that g ( t ) ≤ t γ g 0 ( t ) for all t ∈ [0 , 1] . Rearranging, we see that the inequalit y in (15) [for ﬁxed x ] is equiv alent to the condition that g ( t ) ≤ ` ( t ) for all t ∈ [0 , 1] , where ` ( t ) , (1 − γ (1 − t )) g (1) . W e pro ceed b y contradiction: supp ose that for some α ∈ [0 , 1] it is the case that g ( α ) > ` ( α ) . Note that α > 0 necessarily . Let β b e the minim um element of the set { t ∈ [ α, 1] : g ( t ) = ` ( t ) } . Since g (1) = ` (1) , such a β exists with α < β . Consequen tly , for all t ∈ ( α, β ) we hav e g ( t ) ≥ ` ( t ) and so Z β α g 0 ( t ) dt = g ( β ) − g ( α ) < ` ( β ) − ` ( α ) = γ ( β − α ) g (1) (17) and ( β − α ) g (1) = Z β α ` ( t ) 1 − γ (1 − t ) dt ≤ Z β α g ( t ) 1 − γ (1 − t ) dt . (18) Com bining (17) and (18) and using that g ( t ) ≤ t γ g 0 ( t ) , we hav e Z β α  1 t − 1 1 − γ (1 − t )  g ( t ) dt ≤ Z β α g 0 ( t ) γ dt − Z β α g ( t ) 1 − γ (1 − t ) dt < 0 As g ( t ) = f ( x t ) − f ( x ∗ ) ≥ 0 and 1 /t ≥ 1 / (1 − γ (1 − t )) for all t ∈ [ α, β ] ⊂ (0 , 1] , we hav e a con tradiction. No w, supp ose µ > 0 . Deﬁne h ( x ) , f ( x ) − γ µ 2(2 − γ ) k x ∗ − x k 2 . Observ e that h ( x ∗ ) = f ( x ∗ ) , ∇ h ( x ) = ∇ f ( x ) − γ µ 2 − γ ( x − x ∗ ) , and ∇ h ( x ) > ( x ∗ − x ) = ∇ f ( x ) > ( x ∗ − x ) + γ µ 2 − γ k x ∗ − x k 2 . Thus, 12 W e remark that this lemma still holds if X is open and star-conv ex with star cen ter x ∗ , or if X is an y subinterv al of R . 37 b y algebraic simpliﬁcation and then application of (16) by assumption, h ( x ) + 1 γ ∇ h ( x ) > ( x ∗ − x ) = f ( x ) − γ µ 2(2 − γ ) k x ∗ − x k 2 + 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 − γ k x ∗ − x k 2 = f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2  − γ 2 − γ + 2 2 − γ  = f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2 ≤ f ( x ∗ ) = h ( x ∗ ) . As we earlier show ed that (16) implies (15) in the µ = 0 case, we hav e that h ( tx ∗ + (1 − t ) x ) ≤ γ th ( x ∗ ) + (1 − γ t ) h ( x ) . Substituting in the deﬁnition of h : f ( tx ∗ + (1 − t ) x ) − γ µ 2(2 − γ ) k x ∗ − tx ∗ − (1 − t ) x k 2 ≤ γ tf ( x ∗ ) + (1 − γ t ) f ( x ) − (1 − γ t ) γ µ 2(2 − γ ) k x ∗ − x k 2 . Rearranging terms and simplifying yields f ( tx ∗ + (1 − t ) x ) + γ µ 2(2 − γ )  (1 − γ t ) k x ∗ − x k 2 − (1 − t ) 2 k x ∗ − x k 2  ≤ γ tf ( x ∗ ) + (1 − γ t ) f ( x ) . Finally , (1 − γ t ) − (1 − t ) 2 = t ((2 − γ ) − t ) , which giv es the desired result. No w, we pro ve that (15) implies (16). This time, deﬁne g ( t ) , f ( tx ∗ + (1 − t ) x ) . F or t ∈ [0 , 1) , g 0 ( t ) = ∇ f ( tx ∗ + (1 − t ) x ) > ( x ∗ − x ) . By assumption, g ( t ) + t  1 − t 2 − γ  γ µ 2 k x ∗ − x k 2 ≤ γ tg (1) + (1 − γ t ) g (0) for all t ∈ [0 , 1] , so g (1) ≥ g (0) + g ( t ) − g (0) γ t +  1 − t 2 − γ  µ 2 k x ∗ − x k 2 for all t ∈ (0 , 1] . T aking the limit as t ↓ 0 yields f ( x ∗ ) = g (1) ≥ g (0) + 1 γ g 0 (0) + µ 2 k x ∗ − x k 2 = f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2 . Remark 1 A mo diﬁe d version of L emma 11 holds if x ∗ is r eplac e d with any p oint ˆ x ∈ X , wher e either γ = 1 or (15) and (16) hold for al l x ∈ X with f ( x ) ≥ f ( ˆ x ) . If f satisﬁes either of these e quivalent pr op erties, we then say that f is “ ( γ , µ ) -str ongly quasar-c onvex with r esp e ct to ˆ x .” Remark 2 Using R emark 1, we c an show that even if ˆ x is not a minimizer of the function f , Algorithms 3 and 4 c an stil l b e applie d to eﬃciently ﬁnding a p oint that has an obje ctive value of at most f ( ˆ x ) +  ; the r esp e ctive runtime b ounds ar e the same, and the pr o ofs r emain essential ly unchange d. 38 Note that when γ = 1 , µ = 0 , and (15) is required to hold for al l minimizers of f , it b ecomes the standard deﬁnition of star-conv exity [47]. Corollary 1 If f is ( γ , µ ) -str ongly quasar-c onvex with minimizer x ∗ , then f ( x ) ≥ f ( x ∗ ) + γ µ 2(2 − γ ) k x ∗ − x k 2 , ∀ x Pro of Plug in t = 1 to (15) to get f ( x ∗ ) +  1 − 1 2 − γ  γ µ 2 k x ∗ − x k 2 ≤ γ f ( x ∗ ) + (1 − γ ) f ( x ) . Simplifying yields f ( x ) ≥ f ( x ∗ ) +  1 − 1 2 − γ  γ µ 2(1 − γ ) k x ∗ − x k 2 = f ( x ∗ ) + γ µ 2(2 − γ ) k x ∗ − x k 2 . F act 3 If f : X → R is L -smo oth, x ∗ is a minimizer of f , and the domain X ⊆ R n is op en and star-c onvex with star c enter x ∗ , then f ( y ) ≤ f ( x ∗ ) + L 2 k y − x ∗ k 2 for al l y ∈ X . Pro of This is a simple and w ell-known fact that is true of an y L -smo oth function (whether or not it is quasar-conv ex); for completeness, we pro vide the pro of. Deﬁne g ( t ) , f ((1 − t ) x ∗ + ty ) , for t ∈ [0 , 1] . So, g 0 ( t ) = ∇ f ((1 − t ) x ∗ + ty ) > ( y − x ∗ ) , g (0) = f ( x ∗ ) , and g (1) = f ( y ) . Since g 0 (0) = 0 and f is L -smo oth, k∇ f ((1 − t ) x ∗ + ty ) k ≤ L k (1 − t ) x ∗ + ty − x ∗ k = Lt k y − x ∗ k . So, g 0 ( t ) ≤ | g 0 ( t ) | ≤ Lt k y − x ∗ k 2 , and th us f ( y ) = g (1) = 1 Z 0 g 0 ( t ) dt + g (0) ≤ 1 Z 0 Lt k y − x ∗ k 2 dt + g (0) = L 2 k y − x ∗ k 2 + f ( x ∗ ) . Observ ation 2 If f is ( γ , µ ) -str ongly quasar-c onvex, then f is not L -smo oth for any L < γ µ 2 − γ . Pro of If f is ( γ , µ ) -strongly quasar-conv ex, Corollary 1 sa ys that f ( x ) ≥ f ( x ∗ ) + γ µ 2(2 − γ ) k x ∗ − x k 2 for all x . If f is L -smo oth, F act 3 says that f ( x ) ≤ f ( x ∗ ) + L 2 k x ∗ − x k 2 for all x . Th us, if f is ( γ , µ ) -strongly quasar-conv ex and L -smo oth, we ha ve γ µ 2(2 − γ ) k x ∗ − x k 2 ≤ L 2 k x ∗ − x k 2 for all x , which means that w e must ha ve L ≥ γ µ 2 − γ . 39 Observ ation 3 If f is γ -quasar c onvex, the set of its minimizers is star-c onvex. Pro of Recall that a set S is termed star-c onvex (with star center x 0 ) if there exists an x 0 ∈ S suc h that for all x ∈ S and t ∈ [0 , 1] , it is the case that tx 0 + (1 − t ) x ∈ S [40]. Supp ose f : X → R is γ -quasar-con vex with respect to a minimizer x ∗ ∈ X , where X is con vex. Supp ose y ∈ X also minimizes f . Then for any t ∈ [0 , 1] , equation (15) implies that f ( tx ∗ + (1 − t ) y ) ≤ γ tf ( x ∗ ) + (1 − γ t ) f ( y ) = γ tf ( x ∗ ) + (1 − γ t ) f ( x ∗ ) = f ( x ∗ ) . So, tx ∗ + (1 − t ) y is in X and also minimizes f . Thus, the set of minimizers of f is star-con vex, with star cen ter x ∗ . Observ ation 4 If f is ( γ , µ ) -str ongly quasar-c onvex with µ > 0 , f has a unique minimizer. Pro of By Corollary 1, f ( x ) > f ( x ∗ ) if µ > 0 and x 6 = x ∗ , implying that x minimizes f iﬀ x = x ∗ . Observ ation 5 Supp ose f is diﬀer entiable and ( γ , µ ) -str ongly quasar-c onvex. Then f is also ( θ γ , µ/θ ) -str ongly quasar-c onvex for any θ ∈ (0 , 1] . Pro of ( γ , µ ) -strong quasar-con v exity states that 0 ≥ f ( x ∗ ) − f ( x ) ≥ 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x ∗ − x k 2 for some x ∗ and all x in the domain of f . Multiplying by 1 θ − 1 ≥ 0 , it follo ws that f ( x ∗ ) ≥ f ( x ) + 1 γ ∇ f ( x ) > ( x ∗ − x ) + µ 2 k x − x ∗ k 2 ≥ f ( x ) + 1 γ θ ∇ f ( x ) > ( x ∗ − x ) + µ 2 θ k x ∗ − x k 2 . Note that any ( γ , µ ) -strongly quasar-con vex function is also ( γ , ˜ µ ) -strongly quasar-con vex for an y ˜ µ ∈ [0 , µ ] . Thus, the restriction γ ∈ (0 , 1] in the deﬁnition of quasar-conv exity may b e made without an y loss of generalit y compared to the restriction γ > 0 . Observ ation 6 The p ar ameter γ is a dimensionless quantity, in the sense that if f is γ -quasar- c onvex on R n , the function g ( x ) , a · f ( bx ) is also γ -quasar-c onvex on R n , for any a ≥ 0 , b ∈ R . Pro of If a or b is 0, then g is constant so the claim is trivial. No w supp ose a, b 6 = 0 . Let x ∗ denote the quasar-con vex point of f . Observe that as x ∗ minimizes f , x ∗ /b minimizes g . By (15) , for all x ∈ R n w e hav e 1 a g (( tx ∗ + (1 − t ) x ) /b ) = f ( tx ∗ + (1 − t ) x ) ≤ γ tf ( x ∗ ) + (1 − γ t ) f ( x ) = γ t · 1 a g ( x ∗ /b ) + (1 − γ t ) · 1 a g ( x/b ) . Multiplying b y a , we hav e g ( t ( x ∗ /b ) + (1 − t )( x/b )) ≤ γ tg ( x ∗ /b ) + (1 − γ t ) g ( x/b ) for all x ∈ R n . Since x/b can tak e on any v alue in R n , this means that g is γ -quasar-con vex with resp ect to x ∗ /b . 40 D.3 Construction of Quasar-Con v ex F unctions W e now brieﬂy describ e some basic “building blocks” and closure properties of the family of quasar- con vex functions, inspired b y the analogous discussion for star-conv ex functions in App endix A of [ 35 ]. (Recall also that star-conv ex functions, such as the examples in [ 35 ], are quasar-conv ex with γ = 1 .) 1. Supp ose f : R n → R is ( γ , µ ) -quasar-con vex with respect to x ∗ = 0 . Let a ≥ 0 , c ∈ R b e scalars, M ∈ R m × n , and b ∈ R m . Then g ( x ) = a · f ( M ( x + b )) + c is ( γ , µ · σ 2 min ( M )) -quasar-con vex, where σ min ( M ) denotes the smallest singular v alue of M . • Pr o of : It is easy to see that adding the constan t c do es not aﬀect the quasar-con vexit y prop erties, and that g ( x − b ) has the same quasar-conv exity prop erties as g ( x ) ; so, b y Observ ation 6, it suﬃces to prov e the claim for a = 1 , b = 0 , c = 0 . W e ha ve f ( 0 ) ≥ f ( x ) − 1 γ ∇ f ( x ) > x + µ 2 k x k 2 for all x ∈ R n , b y ( γ , µ ) -quasar-con vexit y of f with respect to x ∗ = 0 . So g ( 0 ) = f ( 0 ) ≥ f ( M y ) − 1 γ ∇ f ( M y ) > ( M y ) + µ 2 k M y k 2 = f ( M y ) − 1 γ M ∇ f ( M y ) > y + µ 2 k M y k 2 ≥ f ( M y ) − 1 γ M ∇ f ( M y ) > y + µσ 2 min ( M ) 2 k y k 2 = g ( y ) + 1 γ ∇ g ( y ) > ( 0 − y ) + µσ 2 min ( M ) 2 k 0 − y k 2 for all y ∈ R m , which prov es the claim. 2. If f , g are ( γ 1 , µ 1 ) and ( γ 2 , µ 2 ) quasar-conv ex resp ectiv ely with resp ect to the same minimizer x ∗ , then h ( x ) = f ( x ) + g ( x ) is ( min { γ 1 , γ 2 } , µ 1 + µ 2 ) quasar-conv ex with resp ect to the same minimizer x ∗ . • Pr o of : h ( x ∗ ) = f ( x ∗ ) + g ( x ∗ ) ≥ f ( x ) + 1 γ 1 ∇ f ( x ) > ( x ∗ − x ) + µ 1 2 k x ∗ − x k 2 + g ( x ) + 1 γ 2 ∇ g ( x ) > ( x ∗ − x ) + µ 1 2 k x ∗ − x k 2 = h ( x ) + µ 1 + µ 2 2 k x ∗ − x k 2 + 1 γ 1 ∇ f ( x ) > ( x ∗ − x ) + 1 γ 2 ∇ g ( x ) > ( x ∗ − x ) ≥ h ( x ) + µ 1 + µ 2 2 k x ∗ − x k 2 + 1 min { γ 1 ,γ 2 } ( ∇ f ( x ) + ∇ g ( x )) > ( x ∗ − x ) as desired, since ∇ f ( x ) > ( x ∗ − x ) ≤ f ( x ∗ ) − f ( x ) ≤ 0 and similarly ∇ g ( x ) > ( x ∗ − x ) ≤ 0 . 3. Supp ose f , g are γ -quasar-con vex with respect to the same p oin t x ∗ , and f ( x ∗ ) = g ( x ∗ ) = 0 . Then h ( x ) = f ( x ) g ( x ) is also γ -quasar-conv ex with resp ect to x ∗ . • Pr o of : Using Lemma 11 and the fact that f , g are nonnegativ e, h ( tx ∗ + (1 − t ) x ) = f ( tx ∗ + (1 − t ) x ) g ( tx ∗ + (1 − t ) x ) ≤ (1 − γ t ) f ( x ) · (1 − γ t ) g ( x ) = (1 + ( γ t ) 2 − 2 γ t ) h ( x ) for all t ∈ [0 , 1] . As γ t ∈ [0 , 1] , ( γ t ) 2 ≤ γ t , so (1 + ( γ t ) 2 − 2 γ t ) h ( x ) ≤ (1 − γ t ) h ( x ) b y nonnegativit y of h . Applying Lemma 11 yields the result. 4. Let X b e a b ounded star-con vex set with C 1 b oundary and star center x ∗ = 0 . Let f : R → R b e γ -quasar-con vex with resp ect to the p oin t 0, with f (0) = 0 , and let g ( x ) be an arbitrary C 1 nonnegativ e function deﬁned on the b oundary of X . F or each p oint x 6 = x ∗ ∈ X , let P ( x ) b e the (unique) intersection of the boundary of X with the ray from x ∗ to x , and let P ( x ∗ ) = x ∗ . Then the function h ( x ) = f ( k x k ) · g ( P ( x )) is nonnegative, C 1 , and γ -quasar-con vex on X with resp ect to x ∗ = 0 . (Note: The rightmost function plotted in Figure 1 was constructed in this manner, where X is the unit circle [so P ( x ) = x k x k ], f ( x ) = x 2 1+ x 2 , and g w as a randomly generated linear combination of exponentiated high-frequency trigonometric functions.) 41 • Pr o of : Nonnegativity of f and g implies that of h . The prop erties of X imply that lim x → 0 h ( x ) = 0 = h ( 0 ) , so the fact that f , g ∈ C 1 implies that h is also C 1 . Also, for any x 6 = x ∗ and an y t ∈ [0 , 1) , h ( tx ∗ + (1 − t ) x ) = h ((1 − t ) x ) = f ( k (1 − t ) x k ) · g ( P ((1 − t ) x )) = f ( t · 0 + (1 − t ) · k x k ) · g ( P ( x )) ≤ (1 − γ t ) f ( k x k ) · g ( P ( x )) = (1 − γ t ) h ( x ) = γ th ( x ∗ ) + (1 − γ t ) h ( x ) . T o obtain the preceding inequalities, we used Lemma 11 for f , and the fact that P ( tx ) = P ( x ) for an y t ∈ (0 , 1] , since the ray from x ∗ to tx also passes through x . Finally , it is trivially true that h ( tx ∗ + (1 − t ) x ) ≤ γ th ( x ∗ ) + (1 − γ t ) h ( x ) when x = x ∗ or when t = 1 , so applying Lemma 11 yields the result. E Lo w er Bound Pro ofs In this section, we use 0 to denote a vector with all en tries equal to 0, and 1 to denote a vector with all entries equal to 1. E.1 Pro of of Lemma 8 Before we prov e Lemma 8, we pro ve t wo useful results related to the prop erties of q and Υ . F or con venience, these functions are restated b elo w: Υ( θ ) , 120 Z θ 1 t 2 ( t − 1) 1 + t 2 dt q ( x ) , 1 4 ( x 1 − 1) 2 + 1 4 T − 1 X i =1 ( x i − x i +1 ) 2 . Observ ation 7 q is c onvex and 2 -smo oth with minimizer x ∗ = 1 . A lso, for any 1 ≤ j 1 < j 2 ≤ T , q ( x ) = 1 2 ∇ q ( x ) > ( x − x ∗ ) ≥ max  1 4 ( x 1 − 1) 2 , ( x j 1 − x j 2 ) 2 4( j 2 − j 1 )  . Pro of Con vexit y and 2 -smo othness of q follo w from deﬁnitions. It is easy to see that q is alw ays nonnegativ e and q ( 1 ) = 0 , so 1 minimizes q . In fact 1 is the unique minimizer, since q is strictly p ositiv e for all nonconstant v ectors and all vectors with x 1 6 = 1 . Notice that as q is a con vex quadratic, q ( x ) = 1 2 ( x − x ∗ ) > ∇ 2 q ( x )( x − x ∗ ) where ∇ 2 q ( x ) is a constant matrix. Therefore ∇ q ( x ) = ∇ 2 q ( x )( x − x ∗ ) . It follows that q ( x ) = 1 2 ∇ q ( x ) > ( x − x ∗ ) . By deﬁnition q ( x ) ≥ 1 4 ( x 1 − 1) 2 . F urthermore, 1 j 2 − j 1 P j 2 i = j 1 ( x i − x i +1 ) 2 ≥  1 j 2 − j 1 P j 2 i = j 1 ( x i − x i +1 )  2 = ( x j 1 − x j 2 ) 2 ( j 2 − j 1 ) 2 , where the inequalit y uses that the exp ectation of the square of a random v ariable is greater than the square of its expectation. The result follo ws. Prop erties of Υ that w e will use are listed b elo w. 42 Lemma 12 The function Υ satisﬁes the fol lowing. 1. Υ 0 (0) = Υ 0 (1) = 0 . 2. F or al l θ ≤ 1 , Υ 0 ( θ ) ≤ 0 , and for al l θ ≥ 1 , Υ 0 ( θ ) ≥ 0 . 3. F or al l θ ∈ R we have Υ( θ ) ≥ Υ(1) = 0 , and Υ(0) ≤ 10 . 4. Υ 0 ( θ ) < − 1 for al l θ ∈ ( −∞ , − 0 . 1] ∪ [0 . 1 , 0 . 9] . 5. Υ is 180 -smo oth. 6. F or al l θ ∈ R we have Υ( θ ) ≤ min { 30 θ 4 − 40 θ 3 + 10 , 60( θ − 1) 2 } , and Υ(0) ≥ 5 . 7. F or al l θ 6∈ ( − 0 . 1 , 0 . 1) we have 40( θ − 1)Υ 0 ( θ ) ≥ Υ( θ ) . Pro of Prop erties 1-4 w ere prov ed in [14, Lemma 2]. Pr op erty 5. | Υ 00 ( θ ) | = 120    θ ( θ 3 +3 θ − 2) (1+ θ 2 ) 2    ≤ 120 · 3 2 = 180 for all θ ∈ R . Th us, for an y θ 1 , θ 2 ∈ R , | Υ 0 ( θ 1 ) − Υ 0 ( θ 2 ) | ≤ max θ ∈ [ θ 1 ,θ 2 ] | Υ 00 ( θ ) | · | θ 1 − θ 2 | ≤ 180 | θ 1 − θ 2 | . Pr op erty 6. W e ha ve Υ(0) = 120 R 1 0 t 2 (1 − t ) 1+ t 2 dt ≥ 120 R 1 0 t 2 (1 − t ) 2 dt = 120 2 · 12 = 5 . F or all θ ∈ R w e hav e Υ( θ ) = 120 R θ 1 t 2 ( t − 1) 1+ t 2 dt ≤ 120 R θ 1 t 2 ( t − 1) dt = 120(( θ 4 / 4 + θ 3 / 3) − (1 / 4 − 1 / 3)) = 30 θ 4 − 40 θ 3 + 10 . In addition, since t 2 1+ t 2 ≤ 1 for all t , w e hav e for all θ ∈ R that Υ( θ ) ≤ 120 R θ 1 ( t − 1) dt = 120( θ − 1) 2 / 2 . Pr op erty 7. If θ ∈ ( ∞ , − 1 . 0] ∪ [1 . 0 , ∞ ) then θ 2 1+ θ 2 ≥ 1 2 , so by prop erty 6 we ha ve Υ( θ ) + 40(1 − θ )Υ 0 ( θ ) ≤ 60( θ − 1) 2 − 40 · 120 θ 2 ( θ − 1) 2 1 + θ 2 ≤ 60( θ − 1) 2 − 40 · 60( θ − 1) 2 = − 60 · 39( θ − 1) 2 ≤ 0 . Alternativ ely , if θ ∈ [ − 1 . 0 , − 0 . 1] ∪ [0 . 1 , 1 . 0] then 1 1+ θ 2 ≥ 1 2 , so by prop erty 6 we ha ve Υ( θ ) + 40(1 − θ )Υ 0 ( θ ) ≤ 10 + 30 θ 4 − 40 θ 3 − 40 · 120 θ 2 ( θ − 1) 2 1 + θ 2 ≤ 10  1 + θ 2  3 θ 2 − 4 θ − 240( θ − 1) 2  = 10  1 − 237 θ 4 + 476 θ 3 − 240 θ 2  = 10 P ( θ ) , where we deﬁne P ( θ ) , 1 − 237 θ 4 + 476 θ 3 − 240 θ 2 . Observ e that P 0 ( θ ) = − 12 θ (40 − 119 θ + 79 θ 2 ) has exactly three roots: at θ = 0 , θ = 1 and θ = 40 / 79 . F urthermore, at θ = 1 , θ = 40 / 79 and θ = 0 . 1 we ha ve P ( θ ) ≤ 0 , which implies P ( θ ) ≤ 0 for θ ∈ [0 . 1 , 1] . W e conclude that Υ( θ ) + 40(1 − θ )Υ 0 ( θ ) ≤ 0 43 for θ ∈ [0 . 1 , 1] . In addition, P ( θ ) is negative while P 0 ( θ ) is p ositiv e for θ = − 0 . 1 , which means that P ( θ ) and th us Υ( θ ) + 40(1 − θ )Υ 0 ( θ ) are also negativ e on [ − 1 . 0 , − 0 . 1] . In order to pro ve Lemma 8, we ﬁrst pro ve an “unscaled version” in Lemma 7. This is the critical and most diﬃcult part of the proof of the result; some intuition is provided in Section 4. Lemma 7 L et σ ∈ (0 , 10 − 4 ] , T ∈  σ − 1 / 2 , ∞  ∩ Z . The function ¯ f T ,σ is 1 100 T √ σ -quasar-c onvex and 3 -smo oth, with unique minimizer x ∗ = 1 . F urthermor e, if x t = 0 for al l t = d T / 2 e , . . . , T , then ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 ) ≥ 2 T σ . Pro of Since σ ∈ (0 , 10 − 4 ] , Υ is 180 -smooth, and q is 2 -smo oth, w e deduce ¯ f T ,σ is 3 -smo oth. By Observ ation 7 and Lemma 12.3 w e deduce ¯ f T ,σ ( 1 ) = 0 < ¯ f T ,σ ( x ) for all x 6 = 1 . Therefore, x ∗ = 1 is the unique minimizer of ¯ f T ,σ . No w, we will show ¯ f T ,σ is 1 100 T √ σ -quasar-con vex, i.e. that ∇ ¯ f T ,σ ( x ) > ( x − 1 ) ≥ ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 ) 100 T √ σ for all x ∈ R T . Deﬁne A , { i : x i ∈ ( −∞ , − 0 . 1] ∪ (0 . 9 , ∞ ) } B , { i : x i ∈ ( − 0 . 1 , 0 . 1) } C , { i : x i ∈ [0 . 1 , 0 . 9] } . First, we derive t wo useful inequalities. By Observ ation 7 and the fact that Υ 0 ( x i ) ≤ 0 for i ∈ B , ∇ ¯ f T ,σ ( x ) > ( x − 1 ) = ∇ q ( x ) > ( x − 1 ) + σ X i ∈A∪B∪C ( x i − 1)Υ 0 ( x i ) ≥ 2 q ( x ) + σ X i ∈A∪C ( x i − 1)Υ 0 ( x i ) . (19) By Lemma 12.2 and 12.6 we deduce P i ∈B∪C Υ( x i ) ≤ |B ∪ C | Υ( − 0 . 1) ≤ 11 T , so it follo ws that ¯ f T ,σ ( x ) ≤ q ( x ) + 11 T σ + σ P i ∈A Υ( x i ) , and therefore using T ≥ σ − 1 / 2 and nonnegativity of Υ and q , w e hav e ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 ) 100 T √ σ = ¯ f T ,σ ( x ) 100 T √ σ ≤ 11 T σ 100 T √ σ + σ 100 T √ σ X i ∈A Υ( x i ) + 1 100 T √ σ q ( x ) ≤ 11 100 σ 1 / 2 + σ 100 X i ∈A Υ( x i ) + 1 100 q ( x ) ≤ 11 100 σ 1 / 2 + σ 40 X i ∈A Υ( x i ) + q ( x ) (20) W e now consider three p ossible cases for the v alues of x . 44 1. Consider the case that x 1 6∈ [0 . 9 , 1 . 1] . W e hav e ∇ ¯ f T ,σ ( x ) > ( x − 1 ) ≥ 2 q ( x ) + σ 40 X i ∈A∪C Υ( x i ) ≥ 0 . 1 2 4 + q ( x ) + σ 40 X i ∈A∪C Υ( x i ) = 1 √ 10 4 σ · √ σ 4 + σ 40 X i ∈A∪C Υ( x i ) + q ( x ) ≥ √ σ 4 + σ 40 X i ∈A∪C Υ( x i ) + q ( x ) ≥ ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 ) 100 T √ σ where the ﬁrst inequalit y uses (19) and Lemma 12.7, the second inequality uses Observ ation 7 and x 1 6∈ [0 . 9 , 1 . 1] , the p enultimate inequality uses σ ∈ (0 , 10 − 6 ] ⊂ (0 , 10 − 4 ] , and the ﬁnal inequalit y uses (20) and nonnegativit y of Υ . 2. Consider the case that B = ∅ . By Lemma 12.7 and con vexit y of q ( x ) , ∇ ¯ f T ,σ ( x ) > ( x − 1 ) = ∇ q ( x ) > ( x − 1 ) + σ X i ∈A∪C ( x i − 1)Υ 0 ( x i ) ≥ q ( x ) − q ( 1 ) + σ 40 X i ∈A∪C Υ( x i ) = 1 40   q ( x ) + σ T X i =1 Υ( x i )   − ¯ f T ,σ ( 1 ) + 39 40 q ( x ) ≥ ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 ) 40 ≥ ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 ) 100 T √ σ . 3. Supp ose cases 1-2 do not hold, i.e., x 1 ∈ [0 . 9 , 1 . 1] and B 6 = ∅ . Then there exist some m ≥ 1 and j ∈ { 1 , . . . , T − m } suc h that x j ≥ 0 . 9 , x j + m ≤ 0 . 1 , and x i ∈ C for all i ∈ { j + 1 , . . . , j + m − 1 } . 45 Then, ∇ ¯ f T ,σ ( x ) > ( x − 1 ) ≥ q ( x ) + σ X i ∈A∪C ( x i − 1)Υ 0 ( x i ) + q ( x ) ≥ 0 . 8 2 4 m + σ X i ∈C ( x i − 1)Υ 0 ( x i ) + σ X i ∈A ( x i − 1)Υ 0 ( x i ) + q ( x ) ≥ 0 . 8 2 4 m + 0 . 1 σ ( m − 2) + σ 40 X i ∈A Υ( x i ) + q ( x ) ≥ 0 . 16 √ 1 . 6 σ 1 / 2 + σ 40 X i ∈A Υ( x i ) + q ( x ) ≥ ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 ) 100 T √ σ where the the ﬁrst inequality holds b y (19) , the second inequality uses Observ ation 7, the third inequality uses Lemma 12.4 and 12.7, the fourth inequalit y uses that m = √ 1 . 6 σ − 0 . 5 ≥ 2 minimizes the previous expression, and the ﬁnal inequality uses (20) [and the fact that 0 . 16 / √ 1 . 6 > 0 . 11 ]. Finally , supp ose x t = 0 for all t = d T / 2 e , . . . , T . Then w e hav e ¯ f T ,σ ( x ) − ¯ f T ,σ ( 1 ) = ¯ f T ,σ ( x ) ≥ σ d T / 2 e Υ(0) ≥ 2 T σ , where the ﬁrst inequalit y uses that Υ ≥ 0 and q ≥ 0 , and the last inequalit y uses that T ≥ 1 and Υ(0) ≥ 5 . With Lemma 7 in hand, we are able to establish Lemma 8 whic h is a scaled v ersion of Lemma 7. Lemma 8 L et  ∈ (0 , ∞ ) , γ ∈ (0 , 10 − 2 ] , T =  10 − 3 γ − 1 L 1 / 2 R − 1 / 2  , and σ = 1 10 4 T 2 γ 2 , and assume L 1 / 2 R − 1 / 2 ≥ 10 3 . Consider the function ˆ f ( x ) , 1 3 LR 2 T − 1 · ¯ f T ,σ ( xT 1 / 2 R − 1 ) . (11) This function is L -smo oth and γ -quasar-c onvex, and its minimizer x ∗ is unique and has k x ∗ k = R . F urthermor e, if x t = 0 ∀ t ∈ Z ∩ [ T / 2 , T ] , then ˆ f ( x ) − inf z ˆ f ( z ) >  . Pro of W e hav e σ − 1 / 2 = 10 2 T γ ≤ T and σ = 1 10 4 T 2 γ 2 ≤ 1 ( L 1 / 2 R − 1 / 2 ) 2 ≤ 10 − 6 , so ¯ f T ,σ satisﬁes the conditions of Lemma 7. Let us verify the properties of ˆ f . The optim um of ¯ f T ,σ is 1 , but after this rescaling it b ecomes x ∗ = R √ T 1 , for which k x ∗ k = R . F or all x, y ∈ R T , by 3 -smo othness of ¯ f T ,σ w e hav e    ∇ ˆ f ( x ) − ∇ ˆ f ( y )    = 1 3 ( LR 2 T − 1 ) · ( T 1 / 2 R − 1 )    ∇ ¯ f T ,σ ( xT 1 / 2 R − 1 ) − ∇ ¯ f T ,σ ( y T 1 / 2 R − 1 )    ≤ ( LR 2 T − 1 ) · ( T 1 / 2 R − 1 ) 2 k x − y k = L k x − y k . 46 Therefore ˆ f is L -smo oth. By the deﬁnition of σ w e ha ve 1 100 T √ σ = γ , so ¯ f T ,σ is γ -quasar-con vex. As quasar-conv exity is inv arian t to scaling (Observ ation 6), w e deduce that ˆ f is γ -quasar-con vex as w ell. Finally , given x ( k ) t = 0 for t = d T / 2 e , . . . , T , we hav e ˆ f ( x ( k ) ) − inf z ˆ f ( z ) ≥ 2 T σ · LR 2 3 T = 2 3 LR 2 σ = 2 3 (10 − 2 γ − 1 L 1 / 2 RT − 1 ) 2 ≥ 50 3 , where the ﬁrst transition uses Lemma 7, the third transition uses that σ = 1 10 4 T 2 γ 2 , and the last transi- tion uses that T =  10 − 3 γ − 1 L 1 / 2 R − 1 / 2  ≤ 2 · 10 − 3 γ − 1 L 1 / 2 R − 1 / 2 since 10 − 3 γ − 1 ( L 1 / 2 R − 1 / 2 ) ≥ 1 . E.2 Pro of of Theorem 3 Before proving Theorem 3 w e recap deﬁnitions that w ere originally provided in [13]. Deﬁnition 4 A function f is a ﬁrst-or der zer o-chain if for every x ∈ R n , x i = 0 ∀ i ≥ t ⇒ ∇ i f ( x ) = 0 ∀ i > t. Deﬁnition 5 An algorithm is a ﬁrst-or der zer o-r esp e cting algorithm (FOZRA) if, for al l i ∈ { 1 , . . . , n } , its iter ates x (0) , x (1) , ... ∈ R n satisfy ∇ i f ( x ( k ) ) = 0 ∀ k ≤ t ⇒ x ( t +1) i = 0 for al l i ∈ { 1 , . . . , n } . Deﬁnition 6 An algorithm A is a ﬁrst-or der deterministic algorithm (FOD A) if ther e exists a se quenc e of functions A k such the algorithm’s iter ates satisfy x ( k +1) = A k ( x (0) , . . . , x ( k ) , ∇ f ( x (0) ) , . . . , ∇ f ( x ( k ) )) for al l k ∈ N , input functions f , and starting p oints x (0) . Observ ation 8 Consider  > 0 , a function class F , and K ∈ N . If f : R n → R satisﬁes 1. f is a ﬁrst-or der zer o-chain, 2. f b elongs to the function class F , i.e. f ∈ F , and 3. f ( x ) − inf z f ( z ) ≥  for every x such that x t = 0 for al l t ∈ { K, K + 1 , . . . , n } ; then it takes at le ast K iter ations for any FOZRA to ﬁnd an  -optimal solution of f . Pro of Cosmetic mo diﬁcation of the pro of of Observ ation 2 in [13]. 47 Theorem 3 L et , R, L ∈ (0 , ∞ ) , γ ∈ (0 , 1] , and assume L 1 / 2 R − 1 / 2 ≥ 1 . L et F denote the set of L -smo oth functions that ar e γ -quasar-c onvex with r esp e ct to some p oint with Euclide an norm less than or e qual to R . Then, given any deterministic ﬁrst-or der metho d, ther e exists a function f ∈ F such that the metho d r e quir es at le ast Ω( γ − 1 L 1 / 2 R − 1 / 2 ) gr adient evaluations to ﬁnd an  -optimal p oint of f . Pro of Applying Lemma 8 and Observ ation 8 implies this result for an y ﬁrst-order zero-respecting metho d. Applying Prop osition 1 from [ 13 ], which states that lo wer b ounds for ﬁrst-order zero- resp ecting metho ds also apply to deterministic ﬁrst-order metho ds, gives the result. E.3 Lo w er Bounds via Reduction Remark 3 If we have an algorithm that c an appr oximately minimize a str ongly quasar-c onvex function, we c an use it to appr oximately minimize a quasar-c onvex function. Pro of This follows from the fact that if f is γ -quasar-con vex with resp ect to a minimizer x ∗ , then the function g  ( x ) = f ( x ) +  2   x − x (0)   2 is ( γ ,  ) -strongly quasar-conv ex with resp ect to x ∗ (recall this terminology from Remark 1). Note that x ∗ is not necessarily a minimizer of g  , but g  ( x ∗ ) ≤ f ( x ∗ ) + R 2 / 2 , where R =   x (0) − x ∗   . Therefore, if we obtain a p oint ˜ x with g  ( ˜ x ) ≤ inf x g ( x ) + R 2 / 2 , then f ( ˜ x ) ≤ g  ( ˜ x ) ≤ g  ( x ∗ ) + / 2 ≤ f ( x ∗ ) + R 2 . Remark 4 Given any deterministic ﬁrst-or der metho d, ther e exists an L -smo oth, ( γ , µ ) -str ongly quasar-c onvex function such that the metho d r e quir es at le ast Ω( max { γ − 1 L 1 / 2 µ − 1 / 2 , γ − 1 L 1 / 2 µ − 1 / 2 log + (  − 1 ) } ) gr adient evaluations to ﬁnd an  -optimal p oint of f . Pro of Supp ose there w as a deterministic ﬁrst-order metho d for minimizing L -smo oth ( γ , µ ) -strongly quasar-con vex functions which required o ( γ − 1 κ − 1 / 2 ) gradient ev aluations to ﬁnd an  -minimizer, where κ = L µ . Let f b e an L -smo oth function that is γ -quasar-con vex with respect to a minimizer x ∗ , let  > 0 , and let R =   x (0) − x ∗   . Then, the function g /R 2 is ( L +  R 2 ) -smo oth and ( γ ,  R 2 ) -strongly quasar-con vex with resp ect to x ∗ as shown in Remark 3, so the condition n umber of g /R 2 is κ = 1 + LR 2  . Thus, w e could apply the metho d to ﬁnd an  2 R 2 -minimizer of g /R 2 , and it w ould do so using o ( γ − 1  L 1 / 2 R − 1 / 2  ) gradient ev aluations. But an  2 R 2 -minimizer of g /R 2 is an  -minimizer of f , as argued in Remark 3; thus, this violates the lo wer b ound on the complexity of minimizing quasar-con vex functions sho wn in Theorem 3. T o pro ve the second part of the low er b ound, we ﬁrst note that an y ( γ , µ ) -quasar-con vex quadratic is also (1 , ( 2 γ − 1) − 1 µ ) -quasar-con vex and thus (1 , γ µ 2 ) -quasar-con vex, and in fact γ µ 2 -strongly c onvex ; this follo ws from deﬁnitions. Thus, direct application of the Ω(( L/µ ) 1 / 2 log + (  − 1 )) lo wer b ound on the complexity of ﬁnding an  -minimizer of an L -smo oth µ -strongly con vex quadratic with a deterministic ﬁrst-order metho d [ 43 , Chapter 7] yields a lo wer bound of Ω( γ − 1 / 2 ( L/µ ) 1 / 2 log + (  − 1 )) on the complexity of ﬁrst-order minimization of L -smo oth ( γ , µ ) -quasar-conv ex functions. 48

Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment