Optimization of Convex Functions with Random Pursuit

OPTIMIZA TION OF CONVEX FUNCTIONS WITH RANDOM PURSUIT ∗ S. U. STICH † , C. L. M ¨ ULLER ‡ , AND B. G ¨ AR TNER § Abstract. W e consider unconstrained randomized optimization of convex ob jective functions. W e analyze the Random Pursuit algorithm, whic h iteratively computes an approximate solution to the optimization problem by rep eated optimization o ver a randomly chosen one-dimensional sub- space. This randomized method only uses zeroth-order information about the ob jective function and does not need an y problem-sp eciﬁc parametrization. W e prov e con vergence and give conv ergence rates for smooth ob jectives assuming that the one-dimensional optimization can be solved exactly or approximately b y an oracle. A con venien t property of Random Pursuit is its inv ariance under strictly monotone transformations of the ob jectiv e function. It thus enjoys iden tical con vergence behavior on a wider function class. T o supp ort the theoretical results we present extensiv e numerical performance results of Random Pursuit, tw o gradient-free algorithms recen tly prop osed b y Nesterov, and a classical adaptiv e step-size random searc h sc heme. W e also present an accelerated heuristic version of the Random Pursuit algorithm which signiﬁcantly improves standard Random Pursuit on all numerical benchmark problems. A general comparison of the exp erimen tal results reveals that (i) standard Random Pursuit is eﬀective on strongly conv ex functions with mo derate condition num ber, and (ii) the accelerated scheme is comparable to Nesterov’s fast gradient metho d and outp erforms adaptive step-size strategies. Key w ords. contin uous optimization, conv ex optimization, randomized algorithm, line search AMS sub ject classiﬁcations. 90C25, 90C56, 68W20, 62L10 1. In tro duction. Randomized zeroth-order optimization schemes w ere among the ﬁrst algorithms prop osed to numerically solve unconstrained optimization prob- lems [1, 6, 34]. These metho ds are usually easy to implemen t, do not require gradient or Hessian information ab out the ob jectiv e function, and comprise a randomized mec hanism to iterativ ely generate new candidate solutions. In man y areas of mo d- ern science and engineering suc h methods are indisp ensable in the simulation (or blac k-b o x) optimization con text, where higher-order information about the simula- tion output is not av ailable or do es not exist. Compared to deterministic zeroth-order algorithms suc h as dir e ct se ar ch methods [21] or in terpolation metho ds [8] randomized sc hemes often show faster and more robust performance on ill-conditioned b enc hmark problems [2] and certain real-w orld applications suc h as quan tum con trol [5] and pa- rameter estimation in systems biology net w orks [39]. While probabilistic con v ergence guaran tees even for non-conv ex ob jectives are readily av ailable for man y randomized algorithms [41], pro v able c onver genc e r ates are often not kno wn or unrealistically slo w. Notable exceptions can be found in the literature on adaptive step size random searc h (also kno wn as Ev olution Strategies) [4, 14], on Mark ov chain metho ds for v olume estimation, rounding, and optimization [40], and in Nesterov’s recen t work on complexit y b ounds for gradient-free con v ex optimization [29]. Although Nestero v’s algorithms are termed “gradient-free” their w orking mec ha- nism do es, in fact, rely on approximate dir e ctional derivatives that ha v e to b e a v ailable ∗ The pro ject CG Learning ackno wledges the ﬁnancial supp ort of the F uture and Emerging T ech- nologies (FET) programme within the Seven th F ramework Programme for Researc h of the Europ ean Commission, under FET-Op en grant num ber: 255827 † Institute of Theoretical Computer Science, ETH Z ¨ urich, and Swiss Institute of Bioinformatics, sstich@inf.ethz.ch ‡ Institute of Theoretical Computer Science, ETH Z ¨ urich, and Swiss Institute of Bioinformatics, christian.mueller@inf.ethz.ch § Institute of Theoretical Computer Science, ETH Z ¨ urich, gaertner@inf.ethz.ch 1 2 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER via a suitable oracle. W e here relax this requireme n t and in vestigate a true random- ized gradien t- and deriv ative-free optimization algorithm: R andom Pursuit ( RP µ ). The metho d comprises tw o v ery elementary primitiv es: a random direction genera- tor and an (appro ximate) line searc h routine. W e establish theoretical performance b ounds of this algorithm for the unconstrained con v ex minimization problem min f ( x ) sub ject to x ∈ R n , (1.1) where f is a smooth conv ex function. W e assume that there is a global minimum and that the curv ature of the function f can b ounded b y a constan t. Eac h iteration of Random Pursuit consists of tw o steps: A random direction is sampled uniformly at random from the unit sphere. The next iterate is c hosen such as to (approximately) minimize the ob jectiv e function along this direction. This method ranges among the simplest p ossible optimization schemes as it solely relies on tw o easy-to-implemen t primitiv es: a random direction generator and an (approximate) one-dimensional line searc h. A conv enient feature of the algorithm is that it inherits the inv ariance un- der strictly monotone transformations of the ob jectiv e function from the line searc h oracle. The algorithm th us enjoys conv ergence guarantees even for non-con v ex ob jec- tiv e functions that can b e transformed into con vex ob jectives via a suitable strictly monotone transformation. Although Random Pursuit is fully gradient- and deriv ativ e-free, it can still b e un- dersto od from the p erspective of the classical gradien t metho d. The gradient metho d ( G M ) is an iterative algorithm where the current appro ximate solution x k ∈ R n is impro ved along the direction of the negativ e gradien t with some step size λ k : x k +1 = x k + λ k ( −∇ f ( x k )) . (1.2) When the descen t direction is replaced by a random v ector the generic scheme reads x k +1 = x k + λ k u , (1.3) where u is a random vector distributed uniformly ov er the unit sphere. A crucial asp ect of the performance of this randomized scheme is the determination of the step size. Rastrigin [34] studied the conv ergence of this scheme on quadratic functions for ﬁxed step sizes λ k where only impro ving steps are accepted. Many authors ob- serv ed that v ariable step size metho ds yield faster conv ergence [24, 18]. Sch umer and Steiglitz [36] w ere among the ﬁrst to develop an eﬀective step size adaptation rule whic h is based on the maximization of the expected one-step progress on the sphere function. A similar analysis has b een indep enden tly obtained by Rec henberg for the (1+1)-Ev olution Strategy ( E S ) [35]. Mutseniyeks and Rastrigin prop osed to c hoos e the step size suc h as to minimize the function v alue along the random direction [26]. This algorithm is iden tical to Random Pursuit with an exact line searc h. Con ver- gence analyses on strongly conv ex functions ha ve b een provided by Krutiko v [22] and Rappl [33]. Rappl prov ed linear con vergence of RP µ without giving exact con ver- gence rates. Krutiko v show ed linear con v ergence in the sp ecial case where the searc h directions are giv en by n linearly independent v ectors whic h are used in cyclic order. Karmano v [16, 17, 42] already conducted an analysis of Random Pursuit on gen- eral con v ex functions. Thus far, Karmano v’s w ork has not b een recognized by the optimization communit y but his results are very close to the work presen ted here. W e enhance Karmano v’s results in a num b er of w ays: (i) w e pro ve exp ected con v ergence rates also under appr oximate line searc h; (ii) we show that con tin uous sampling from RANDOM PURSUIT 3 the unit sphere can b e replaced with discrete sampling from the set {± e i : i = 1 , . . . , n } of signed unit v ectors, without changing the exp ected conv ergence rates; (iii) w e pro- vide a large num b er of exp erimen tal results, sho wing that Random Pursuit is a com- p etitiv e algorithm in practice; (iv) we in tro duce a heuristic impro vemen t of Random Pursuit that is ev en faster on all our b enc hmark functions; (v) we p oin t out that Random Pursuit can also b e applied to a num b er of relev ant non-con vex functions, without sacriﬁcing an y theoretical and practical p erformance guaran tees. On the other hand, while w e prov e fast con vergence only in expectation, Karmano v’s more in tricate analysis also yields fast con vergence with high probabilit y . P olyak [31] describes step size rules for the closely related randomized gradien t descen t scheme: x k +1 = x k + λ k f ( x k + µ k u ) − f ( x k ) µ k u , (1.4) where conv ergence is pro ved for µ k → 0 but no con vergence rates are established. Nestero v [29] studied diﬀeren t v ariants of metho d (1.4) and its accelerated versions for smooth and non-smo oth optimization problems. He sho wed that scheme (1.4) is at most O ( n 2 ) times slow er than the standard (sub-)gradient method. The use of exact directional deriv atives reduces the gap further to O ( n ). F or smo oth problems the method is only O ( n ) slow er than the standard gradien t method and accelerated v ersions are O ( n 2 ) slo wer than fast gradient metho ds. Kleiner et al. [20] studied a v ariant of algorithm (1.3) for unconstrained semideﬁ- nite programming: Random Conic Pursuit. There, eac h iteration comprises t wo steps: (i) the algorithm samples a rank-one matrix (not necessarily uniformly) at random; (ii) a tw o-dimensional optimization problem is solv ed that consists of ﬁnding the optimal linear combination of the rank-one matrix and the current semideﬁnite matrix. The solution determines the next iterate of the algorithm. In the case of trace-constrained semideﬁnite problems only a one-dimensional line search is necessary . Kleiner and co- w orkers pro v ed conv ergence of this algorithm when directions are chosen uniformly at random. The dependency betw een con v ergence rate and dimension are, how ever, not kno wn. Nonetheless, their work greatly inspired our o wn eﬀorts whic h is also reﬂected in the name “Random Pursuit” for the algorithm under study . The presen t article is structured as follo ws. In Section 2 we presen t the Random Pursuit algorithm with approximate line search. W e introduce the necessary notation and formulate the assumptions on the ob jectiv e function. In Section 3 w e derive a n umber of useful results on the exp ectation of scaled random vectors. In Section 4 w e calculate the exp ected one-step progress of Random Pursuit with approximate line searc h ( RP µ ). W e sho w that (b esides some additive error term) this progress is b y a factor of O ( n ) w orse than the one-step progress of the gradient method. These results allo w us to deriv e the ﬁnal conv ergence results in Section 5. W e show that RP µ meets the conv ergence rates of the standard gradient metho d up to a factor of O ( n ), i.e., linear con vergence on strongly conv ex functions and con vergence rate 1 /k for general con vex functions. The linear con vergence on strongly conv ex functions is b est p ossible: F or the sphere function our metho d meets the low er b ound [15]. F or strongly con vex ob jective functions the method is robust against small absolute or relative errors in the line search. In Section 6 we present numerical exp erimen ts on selected test problems. W e compare RP µ with a ﬁxed step size gradien t method and a gradien t sc heme with line searc h, Nestero v’s random gradien t sc heme and its accelerated version [29], an adaptiv e step size random searc h, and an accelerated heuristic v ersion of RP µ . In Section 7 w e discuss the theoretical and n umerical results as well as the present 4 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER limitations of the sc heme that may b e alleviated b y more elab orate randomization primitiv es. W e also pro vide a n umber of promising future researc h directions. 2. The Random Pursuit (RP) Algorithm. W e consider problem (1.1) where f is a diﬀeren tiable conv ex function with b ounded curv ature (to b e deﬁned below). The algorithm RP µ is a v arian t of scheme (1.3) where the step sizes are determined b y a line searc h. F ormally , we deﬁne the following oracles: Definition 2.1 ( Line search oracle ). F or x ∈ R n , a c onvex function f , and a dir e ction u ∈ S n − 1 = { y ∈ R n : k y k 2 = 1 } , a function LS : R n × S n − 1 → R with LS ( x, u ) ∈ arg min h ∈ R f ( x + hu ) (2.1) is c al le d an exact line searc h oracle . (Her e, the arg min is not assume d to b e unique, so we c onsider it as a set fr om which LS ( x, u ) sele cts a wel l-deﬁne d element.) F or ac cur acy µ ≥ 0 the functions LSappro x r el µ and LSappro x abs µ with LS ( x, u ) − µ ≤ LSappro x abs µ ( x, u ) ≤ LS ( x, u ) + µ, and , (2.2) s · max { 0 , (1 − µ ) } · LS ( x, u ) ≤ s · LSappro x r el µ ( x, u ) ≤ s · LS ( x, u ) , (2.3) wher e s = sign( LS ( x, u )) , ar e, r esp e ctively, absolute and relative , appro ximate line searc h oracles . By LSapprox µ , we denote any of the two. This means that we allow an inexact line search to return a v alue ˜ h close to an optimal v alue h ∗ = LS ( x, u ). T o simplify subsequent calculations, we also require that ˜ h ≤ h ∗ in the case of relativ e appro ximation (cf. (2.3)), but this requiremen t is not essen tial. As the optimization problem (2.1) cannot b e solv ed exactly in most cases, w e will describ e and analyze our algorithm by means of the t wo latter appro ximation routines. The formal deﬁnition of algorithm RP µ is sho wn in Algorithm 1. At iteration k of the algorithm a direction u ∈ S n − 1 is chosen uniformly at random and the next iterate x k +1 is calculated from the curren t iterate x k as x k +1 := x k + LSapprox µ ( x k , u ) · u. (2.4) Algorithm 1 Random Pursuit ( RP µ ) Input: A problem of the form (1.1) N ∈ N : num b er of iterations x 0 : an initial iterate µ > 0 : line search accuracy Output: Approximate solution x N to (1.1). 1: for k ← 0 to N − 1 do 2: c ho ose u k uniformly at random from S n − 1 3: Set x k +1 ← x k + LSapprox µ ( x k , u k ) · u k 4: end for 5: return x N This algorithm only requires function ev aluations if the line search LSapprox µ is implemen ted appropriately (see [9] and references therein). No additional ﬁrst or second-order information of the ob jective is needed. Note also that b esides the start- ing p oin t no further input parameters describing function properties (e.g. curv ature constan t, etc.) are necessary . The actual run time will, ho wev er, dep end on the sp eciﬁc prop erties of the ob jectiv e function. RANDOM PURSUIT 5 2.1. Discrete Sampling. As our analysis b elo w reveals, the random v ector u k en ters the analysis only in terms of exp ectations of the form E [ h x, u k i u k ] and E [ kh x, u k i u k k 2 ]. In Lemmas 3.3 and 3.4 we show that these expectations are the same for u k ∼ S n − 1 and u k ∼ {± e i : i = 1 , . . . , n } , the set of signed unit v ectors (here and in the following, the notation x ∼ A for a set A , denotes that x is distributed according to the uniform distribution on A ). It follows that con tinuous sampling from S n − 1 can b e replaced with discrete sampling from {± e i : i = 1 , . . . , n } without aﬀecting our guaran tees on the expected runtime. Under this modiﬁcation, fast conv ergence still holds with high probabilit y , but the bounds get worse [17]. 2.2. Quasicon v ex F unctions. If f and g are functions, g is called a strictly monotone tr ansformation of f if f ( x ) < f ( y ) ⇔ g ( f ( x )) < g ( f ( y )) , x, y ∈ R n . This implies that the distribution of x k in RP µ is the same for the function f and the function g ◦ f , if g is a strictly monotone transformation of f . This follows from the fact that the result of the line searc h giv en in Deﬁnition 2.1 is inv ariant under strictly monotone transformations. This observ ation allows us to run RP µ on an y strictly monotone transformation of any conv ex function f , with the same theoretical performance as on f itself. The functions obtainable in this wa y form a sub class of the class of quasic onvex functions , and they include some non-conv ex functions as w ell. In Section 6.2.3 w e will exp er- imen tally verify the in v ariance of RP µ under strictly monotone transformations on one instance of a quasicon vex function. 2.3. F unction Basics. W e no w in tro duce some imp ortan t inequalities that are useful for the subsequent analysis. W e alw ays assume that the ob jectiv e function is diﬀeren tiable and con vex. The latter property is equiv alent to f ( y ) ≥ f ( x ) + h∇ f ( x ) , y − x i , x, y ∈ R n . (2.5) W e also require that the curv ature of f is bounded. By this w e mean that for some constan t L 1 , f ( y ) − f ( x ) − h∇ f ( x ) , y − x i ≤ 1 2 L 1 k x − y k 2 , x, y ∈ R n . (2.6) W e will also refer to this inequalit y as the quadr atic upp er b ound . It means that the deviation of f from any of its linear approximations can b e b ounded by a quadratic function. W e denote by C 1 L 1 the class of diﬀerentiable and con v ex functions for with the quadratic upper b ound holds with the constant L 1 . A diﬀerentiable function is str ongly c onvex with parameter m if the quadr atic lower b ound f ( y ) − f ( x ) − h∇ f ( x ) , y − x i ≥ m 2 k y − x k 2 , x, y ∈ R n , (2.7) holds. Let x ∗ b e the unique minimizer of a strongly con vex function f with parameter m . Then equation (2.7) implies this useful relation: m 2 k x − x ∗ k 2 ≤ f ( x ) − f ( x ∗ ) ≤ 1 2 m k∇ f ( x ) k 2 , ∀ x ∈ R n . (2.8) 6 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER The former inequalit y uses ∇ f ( x ∗ ) = 0, and the latter one follows from (2.7) via f ( x ∗ ) ≥ f ( x ) + h∇ f ( x ) , x ∗ − x i + m 2 k x ∗ − x k 2 ≥ f ( x ) + min y ∈ R n  h∇ f ( x ) , y − x i + m 2 k y − x k 2  = f ( x ) − 1 2 m k∇ f ( x ) k 2 b y standard calculus. 3. Exp ectation of Scaled Random V ectors. W e no w study the pro jection of a ﬁxed vector x onto a random v ector u . This will help analyze the exp ected progress of Algorithm 1. W e start with the case u ∼ N (0 , I n ) and then extend it to u ∼ S n − 1 . Throughout this section, let x ∈ R n b e a ﬁxed vector and u ∈ R n a random v ector dra wn according to some distribution. W e will need the following facts ab out the momen ts of the standard normal distribution. Lemma 3.1. (i) L et ν ∼ N (0 , 1) b e dr awn fr om the standar d normal distribution over the r e als. Then E [ ν ] = E [ ν 3 ] = 0 , E [ ν 2 ] = 1 , E [ ν 4 ] = 3 . (ii) L et u ∼ N (0 , I n ) b e dr awn fr om the standar d normal distribution over R n . Then E u [ uu T ] = I n , E u [( u T u ) uu T ] = ( n + 2) I n . Pr o of . Part (i) is standard, and the latter tw o matrix equations easily follow from (i) via ( uu T ) ij = u i u j ,  ( u T u ) uu T  ij = u i u j X k u 2 k . Lemma 3.2 ( Normal distribution ). L et u ∼ N (0 , I n ) . Then E u [ h x, u i u ] = x , and E u h kh x, u i u k 2 i = ( n + 2) k x k 2 . Pr o of . W e calculate E u [ h x, u i u ] = E u [ uu T x ] = E u [ uu T ] x = x, b y Lemma 3.1(ii). F or the second momen t w e get E u h kh x, u i u k 2 i = E u [ x T ( u T u ) uu T x ] = x T E u [( u T u ) uu T ] x = ( n + 2) k x k 2 , again using Lemma 3.1(ii). Lemma 3.3 ( Spherical distribution ). L et u ∼ S n − 1 . Then E u [ h x, u i u ] = 1 n x , and E u h kh x, u i u k 2 i = E u h h x, u i 2 i = 1 n k x k 2 . Pr o of . Let v ∼ N (0 , I n ). W e observe that the random vector w = v / k v k has the same distribution as u . In particular, E u  uu T  = E v " v v T k v k 2 # = E v  v v T  E v h k v k 2 i = I n n , (3.1) RANDOM PURSUIT 7 where w e ha v e used that the t wo random v ariables v v T k v k 2 and k v k 2 are independent (see [11]), along with E v  v v T  = I n , E v h k v k 2 i = n , a consequence of Lemma 3.1. No w w e use (3.1) to compute E u [ h x, u i u ] = E u  uu T  x = I n n x = 1 n x and E u h h x, u i 2 i = E u  x T uu T x  = x T E u  uu T  x = x T I n n x = 1 n k x k 2 . The same result can be derived when the v ector u is chosen to b e a random signed unit v ector. Lemma 3.4. L et u ∼ U := {± e i : i = 1 , . . . , n } wher e e i denotes the i -th standar d unit ve ctor in R n . Then E u [ h x, u i u ] = 1 n x , and E u h kh x, u i u k 2 i = E u h h x, u i 2 i = 1 n k x k 2 . Pr o of . W e calculate E u [ h x, u i u ] = 1 2 n X u ∈ U h x, u i u = 1 n n X i =1 x i e i = 1 n x , and similarly E u h h x, u i 2 i = 1 2 n X u ∈ U h x, u i 2 = 1 n n X i =1 x 2 i = 1 n k x k 2 . 4. Single Step Progress. T o prepare the conv ergence pro of of Algorithm RP µ in the next section, w e study the exp ected progress in a single step, whic h is the quan tity E [ f ( x k +1 ) | x k ] . It turns out that w e need to proceed diﬀeren tly , dep ending on whether the function under consideration is strongly conv ex (the easier case) or not. W e start with a preparatory lemma for b oth cases. W e ﬁrst analyze the case when an appro ximate line searc h with absolute error is applied. Using an approximate line search with relativ e error will be reduced to the case of an exact line search. 4.1. Line Searc h with Absolute Error. Lemma 4.1 ( Absolute Error ). L et f ∈ C 1 L 1 and let x k ∈ R n b e the curr ent iter ate and x k +1 ∈ R n the next iter ate gener ate d by algorithm RP µ with absolute line se ar ch ac cur acy µ . F or every p ositive h ∈ R and every p oint z ∈ R n we have E [ f ( x k +1 ) | x k ] ≤ f ( x k ) + h n h∇ f ( x k ) , z − x k i + L 1 h 2 2 n k z − x k k 2 + L 1 µ 2 2 . 8 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER Pr o of . Let x 0 k +1 := x k + LS ( x k , u k ) u k b e the exact line searc h optim um. Here, u k ∈ S n − 1 is the chosen search direction. By deﬁnition of the approximate line searc h (2.2), w e hav e f ( x k +1 ) ≤ max | ν |≤ µ f ( x 0 k +1 + ν u k ) (2 . 6) ≤ f ( x 0 k +1 ) + max | ν |≤ µ     ∇ f ( x 0 k +1 ) , ν u k  | {z } 0 + L 1 2 ν 2    = f ( x 0 k +1 ) + L 1 µ 2 2 , (4.1) where we used the quadratic upp er b ound (2.6) in the second inequality with x = x 0 k +1 and y = x 0 k +1 + ν u k . Since x 0 k +1 is the exact line searc h optimum, w e in particular hav e f ( x 0 k +1 ) ≤ f ( x k + t k u k ) ≤ f ( x k ) + h∇ f ( x k ) , t k u k i + L 1 t 2 k 2 , ∀ t k ∈ R , (4.2) where we ha v e applied (2.6) a second time. Putting together (4.1) and (4.2), and taking expectations, we get E u k [ f ( x k +1 ) | x k ] ≤ f ( x k ) + E u k  h∇ f ( x k ) , t k u k i + L 1 t 2 k 2 | x k  + L 1 µ 2 2 . (4.3) No w it is time to choose t k suc h that we can control the expectations on the right-hand side. W e set t k := h h z − x k , u k i , where h > 0 and z ∈ R n are the “free parameters” of the lemma. Via Lemma 3.3, this en tails E u k [ t k u k ] = h n ( z − x k ) , E u k  t 2 k  = h 2 n k z − x k k 2 , and the lemma follo ws. 4.2. Line Searc h with Relativ e Error. In the case of relative line searc h error, we can pro ve a v ariant of Lemma 4.1 with a denominator n 0 sligh tly larger than n . As a result, the analysis under relativ e line searc h error reduces to the analysis of exact line searc h (approximate line search error 0) in a sligh tly higher dimension; in the sequel, w e will therefore only deal with absolute line searc h error. Lemma 4.2 ( Relativ e Error ). L et f ∈ C 1 L 1 and let x k ∈ R n b e the curr ent iter ate and x k +1 ∈ R n the next iter ate gener ate d by algorithm RP µ with r elative line se ar ch ac cur acy µ . F or every p ositive h ∈ R and every p oint z ∈ R n we have E [ f ( x k +1 ) | x k ] ≤ f ( x k ) + h n 0 h∇ f ( x k ) , z − x k i + L 1 h 2 2 n 0 k z − x k k 2 , wher e n 0 = n/ (1 − µ ) . Pr o of . By the deﬁnition (2.3) of relativ e line searc h error, x k +1 is a con vex com bination of x k and x 0 k +1 , the exact line searc h optim um. More precisely , w e can compute that x k +1 = (1 − γ ) x k + γ x 0 k +1 , RANDOM PURSUIT 9 where γ ≥ 1 − µ . By con v exity of f , we th us hav e f ( x k +1 ) ≤ (1 − γ ) f ( x k ) + γ f ( x 0 k +1 ) ≤ µf ( x k ) + (1 − µ ) f ( x 0 k +1 ) , since f ( x 0 k +1 ) ≤ f ( x k ). Hence E [ f ( x k +1 ) | x k ] ≤ µf ( x k ) + (1 − µ ) E  f ( x 0 k +1 ) | x k  . (4.4) Using Lemma 4.1 with absolute line search error 0 yields a bound for the latter term: E  f ( x 0 k +1 ) | x k  ≤ f ( x k ) + h n h∇ f ( x k ) , z − x k i + L 1 h 2 2 n k z − x k k 2 . Putting this together with (4.4) yields E [ f ( x k +1 ) | x k ] ≤ f ( x k ) + (1 − µ )  h n h∇ f ( x k ) , z − x k i + L 1 h 2 2 n k z − x k k 2  , and with n 0 = n/ (1 − µ ), the lemma follo ws. 4.3. T o w ards the Strongly Con v ex Case. Here w e use z = x k − ∇ f ( x k ) in Lemma 4.1. Corollar y 4.3. L et f ∈ C 1 L 1 and let x k ∈ R n b e the curr ent iter ate and x k +1 ∈ R n the next iter ate gener ate d by algorithm RP µ with absolute line se ar ch ac cur acy µ . F or any p ositive h k ≤ 1 L 1 it holds that E [ f ( x k +1 ) | x k ] ≤ f ( x k ) − h k 2 n k∇ f ( x k ) k 2 + L 1 µ 2 2 . Pr o of . Lemma 4.1 with z = x k − ∇ f ( x k ) yields E [ f ( x k +1 ) | x k ] ≤ f ( x k ) − h k n h∇ f ( x k ) , ∇ f ( x k ) i + L 1 h 2 k 2 n k∇ f ( x k ) k 2 + L 1 µ 2 2 . W e conclude E [ f ( x k +1 ) | x k ] ≤ f ( x k ) − h k n  1 − L 1 h k 2  | {z } ≥ h k 2 n k∇ f ( x k ) k 2 + L 1 µ 2 2 . 4.4. T o w ards the General Conv ex Case. F or this case, we apply Lemma 4.1 with z = x ∗ . Corollar y 4.4. L et f ∈ C 1 L 1 and let x k ∈ R n b e the curr ent iter ate and x k +1 ∈ R n the next iter ate gener ate d by algorithm RP µ with absolute line se ar ch ac cur acy µ . L et x ∗ ∈ R n b e one of the minimizers of the function f . F or any p ositive h k ≥ 0 it holds that E [ f ( x k +1 ) − f ( x ∗ ) | x k ] ≤ (1 − h k n ) ( f ( x k ) − f ( x ∗ )) + L 1 h 2 k 2 n k x ∗ − x k k 2 + L 1 µ 2 2 . Pr o of . W e use Lemma 4.1 with z = x ∗ and apply con v exity (2.5) to b ound the term h∇ f ( x k ) , x ∗ − x k i from abov e b y f ( x ∗ ) − f ( x k ). Subtracting f ( x ∗ ) from both sides yields the inequalit y of the corollary . 10 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER 5. Con v ergence Results. Here w e use the previously deriv ed b ounds on the exp ected single step progress (Corollaries 4.3 and 4.4) to sho w con vergence of the algorithm. 5.1. Con v ergence Analysis for Strongly Conv ex F unctions. W e ﬁrst prov e that algorithm RP µ con verges linearly in exp ectation on strongly con v ex functions. Despite strong conv exit y b eing a global prop ert y , it is suﬃcient if the function is strongly con vex in the neighborho od of its minimizer (see Theorem 5.2). Theorem 5.1. L et f ∈ C 1 L 1 and let f b e str ongly c onvex with p ar ameter m , and c onsider the se quenc e { x k } k ≥ 0 gener ate d by RP µ with absolute line se ar ch ac cur acy µ . Then for any N ≥ 0 , we have E [ f ( x N ) − f ( x ∗ )] ≤  1 − m L 1 n  N ( f ( x 0 ) − f ( x ∗ )) + L 2 1 nµ 2 2 m . Pr o of . W e use Corollary 4.3 with h = 1 L 1 and the quadratic low er b ound to estimate the progress in one step as E [ f ( x k +1 ) − f ( x ∗ ) | x k ] ≤ f ( x k ) − f ( x ∗ ) − 1 2 nL 1 k∇ f ( x k ) k 2 + L 1 µ 2 2 (2.8) ≤  1 − m nL 1  ( f ( x k ) − f ( x ∗ )) + L 1 µ 2 2 . After taking expectations (o v er x k ), the tow er property of conditional exp ectations yields the recurrence E [ f ( x k +1 ) − f ( x ∗ )] ≤  1 − m nL 1  E [ f ( x k ) − f ( x ∗ )] + L 1 µ 2 2 . This implies E [ f ( x N ) − f ( x ∗ )] ≤  1 − m nL 1  N ( f ( x 0 ) − f ( x ∗ )) + ω ( N ) L 1 µ 2 2 , with ω ( N ) := N − 1 X i =0 (1 − m L 1 n ) i ≤ L 1 n m . The bound of the theorem follo ws. W e remark that by strong conv exit y the error k x N − x ∗ k can also b e b ounded using the results of this theorem. Th us, the algorithm does not only conv erge in terms of function v alue, but also in terms of the solution itself. Eac h strongly conv ex function has a unique minimizer x ∗ . Using the quadratic lo wer bound (2.8) w e recall that: f ( x ) − f ( x ∗ ) ≥ m 2 k x − x ∗ k 2 , ∀ x ∈ R n . (5.1) It turns out that instead of strong con v exity (2.7) the weak er condition (5.1) is suﬃ- cien t to ha ve linear conv ergence. Theorem 5.2. L et f ∈ C 1 L 1 and supp ose f has a unique minimizer x ∗ satisfy- ing (5.1) with p ar ameter m . Consider the se quenc e { x k } k ≥ 0 gener ate d by RP µ with RANDOM PURSUIT 11 absolute line se ar ch ac cur acy µ . Then for any N ≥ 0 , we have E [ f ( x N ) − f ( x ∗ )] ≤  1 − m 4 L 1 n  N ( f ( x 0 ) − f ( x ∗ )) + 2 L 2 1 nµ 2 m . Pr o of . T o see this w e use Corollary 4.4 with prop ert y (5.1) to get E [ f ( x k +1 ) − f ( x ∗ ) | x k ] ≤  1 − h k n  ( f ( x k ) − f ( x ∗ )) + L 1 h 2 k 2 n k x k − x ∗ k 2 + L 1 µ 2 2 ≤  1 − h k n + L 1 h 2 k mn  ( f ( x k ) − f ( x ∗ )) + L 1 µ 2 2 . Setting h k to m 2 L 1 , the term in the left brac k et b ecomes (1 − m 4 L 1 n ). No w the pro of con tinues as the pro of of Theorem 5.1. 5.2. Con v ergence Analysis for Con v ex F unctions. W e no w pro v e that al- gorithm RP µ con verges in exp ectation on smooth (not necessarily strongly) con v ex functions. The rate is, ho wev er, not linear an ymore. Theorem 5.3. L et f ∈ C 1 L 1 and let x ∗ a minimizer of f , and let the se quenc e { x k } k ≥ 0 b e gener ate d by RP µ with absolute line se ar ch ac cur acy µ . Assume ther e exists R , s.t. k y − x ∗ k < R for al l y with f ( y ) ≤ f ( x 0 ) . Then for any N ≥ 0 , we have E [ f ( x N ) − f ( x ∗ )] ≤ Q N + 1 + N L 1 µ 2 2 , wher e Q = max  2 nL 1 R 2 , f ( x 0 ) − f ( x ∗ )  . Pr o of . By assumption, there exists an R ∈ R , s.t. k x k − x ∗ k ≤ R , for all k = 0 , 1 , . . . , N . With Corollary 4.4 it follo ws for an y step size h k ≥ 0: E [ f ( x k +1 ) − f ( x ∗ ) | x k ] ≤  1 − h k n  ( f ( x k ) − f ( x ∗ )) + L 1 h 2 k 2 n R 2 + L 1 µ 2 2 . (5.2) T aking exp ectation w e obtain E [ f ( x k +1 ) − f ( x ∗ )] ≤  1 − h k n  E [ f ( x k ) − f ( x ∗ )] +  h k n  2 nL 1 R 2 2 + L 1 µ 2 2 . By setting h k := 2 n k +1 for k = 0 , . . . , ( N − 1) w e obtain a recurrence that is exactly of the form as treated in Lemma A.1 and the result follo ws. W e note that for  > 0 the exact algorithm RP 0 needs O  n   steps to guaran tee an approximation error of  . According to the discussion preceding Lemma 4.2, this still holds under an appro ximate line search with ﬁxed relative error. In the absolute error model, ho w ever, the error bound of Theorem 5.3 b ecomes meaningless as N → ∞ . Nev ertheless, for N opt = p 2 Q/ ( L 1 µ 2 ) the bound yields E  f ( x N opt ) − f ( x ∗ )  ≤ Q N opt + N opt L 1 µ 2 2 ≤ µ p 2 QL 1 . 12 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER 5.3. Remarks. W e emphasize that the constant L 1 and the strong-conv exity parameter m whic h describ e the b eha vior of the function are only needed for the theoretical analysis of RP µ . These parameters are not input parameters to the algo- rithm. No pre-calculation or estimation of these parameters is th us needed in order to use the algorithm on conv ex functions. Moreo v er, the presented analysis does not need parameters that describ e the properties of the function on the whole domain. It is suﬃcien t to restrict our view on the sub-lev el set determined b y the initial iterate. Consequen tly , if the function parameters get better in a neighborho od of the opti- m um, the p erformance of the algorithm ma y b e b etter than theoretically predicted b y the w orst-case analysis. 6. Computational Exp erimen ts. W e complemen t the presen ted theoretical analysis with extensive numerical optimization exp erimen ts on selected benchmark functions. W e compare the p erformance of the RP µ algorithm with a num b er of gradien t-free algorithms that share the simplicit y of Random Pursuit in terms of the computational searc h primitives used. W e also introduce a heuristic acceleration sc heme for Random Pursuit, the accelerated RP µ metho d ( ARP µ ). As a generic reference we also consider tw o steep est descen t schemes that use analytic gradien t information. The test function set comprises tw o quadratic functions with diﬀerent condition n umbers, tw o v ariants of Nestero v’s smo oth function [28], and a non-con vex funnel-shap ed function. W e ﬁrst detail the algorithms, their input requirements, and necessary parameter c hoices. W e then present the deﬁnition of the test functions, describ e the experimental p erformance ev aluation proto col, and presen t the numerical results. F urther exp erimen tal data are av ailable in the supp orting online material [38] at . 6.1. Algorithms. W e now in tro duce the set of tested algorithms. All methods ha ve been implemented in MA TLAB. The source co de is also publicly av ailable in the supp orting online material [38]. 6.1.1. Random Gradien t Metho ds. W e consider tw o randomized metho ds that are discussed in detail in [29]. The ﬁrst algorithm, the Random Gradient Metho d ( RG ), implemen ts the iterativ e sc heme described in (1.4). A necessary ingredient for the algorithm is an oracle that provides directional deriv ativ es. The accuracy of the directional deriv atives is controlled by the ﬁnite diﬀerence step size µ . A pseudo- co de representation of the approximate Random Gradient metho d ( RG µ ) along with a con vergence pro of is describ ed in [29, Section 5]. W e implemen ted RG µ and used the parameter setting µ = 1 e − 5. A necessary input to the RG µ algorithm is the function-dep enden t Lipschitz constant L 1 that is used to determine the step size λ k = 1 / (4( n + 4) L 1 ). W e also consider Nestero v’s fast Random Gradien t Metho d ( F G ) [29]. This algorithm simultaneously evolv es t wo iterates in the search space where, in each iteration, a directional deriv ative is appro ximately computed at sp eciﬁc linear combinations of these p oin ts. In [29, Section 6] Nestero v provides a pseudo- co de for the approximate scheme F G µ and prov es conv ergence on strongly conv ex functions. W e implemen ted the F G µ sc heme and used the parameter setting µ = 1 e - 5. F urther necessary input parameters are b oth the L 1 constan t and the strong con vexit y parameter m of the resp ectiv e test function. 6.1.2. Random Pursuit Metho ds. In the implemen tation of the RP µ algo- rithm w e choose the sampling directions uniformly at random from the hypersphere. W e use the built-in MA TLAB routine fminunc.m from the optimization to olbox [32] RANDOM PURSUIT 13 with optimset(’TolX’= µ ) as approximate line searc h oracle LSapprox µ with µ = 1 e - 5. In the present gradient-free setting fminunc.m uses a mixed cubic/quadratic p oly- nomial line searc h where the ﬁrst three p oin ts brack eting the minimum are found by bisection [32]. Inspired b y the F G sc heme w e also designed an accelerated Random Pursuit algo- rithm ( ARP µ ) whic h is summarized in Algorithm 2. The structure of this algorithm Algorithm 2 Accelerated Random Pursuit ( ARP µ ) Input: N , x 0 , µ, m, L 1 1: θ = 1 L 1 n 2 , γ 0 ≥ m , ν 0 = x 0 2: for k = 0 to N do 3: Compute β k > 0 satisfying θ − 1 β 2 k = (1 − β k ) γ k + β k m =: γ k +1 . 4: Set λ k = β k γ k +1 m , δ k = β k γ k γ k + β k m , and y k = (1 − δ k ) x k + δ k v k . 5: u k ∼ S n − 1 6: Set x k +1 = y k + LSapprox µ ( y k , u k ) · u k . 7: Set v k +1 = (1 − λ k ) v k + λ k y k + LSapprox µ ( y k ,u k ) β k n u k . 8: end for is similar to Nesterov’s F G µ sc heme. In ARP µ the step size calculation is, how ever, pro vided b y the line search oracle. Although we currently lack theoretical guaran tees for this scheme we here rep ort the exp erimen tal p erformance results. Analogously to the F G µ algorithm, the accelerated RP µ algorithm needs the function-dep enden t parameters L 1 and m as necessary input. The line search oracle is identical to the one in standard Random Pursuit. 6.1.3. Adaptiv e Step Size Random Search Metho ds. The previous ran- domized schemes pro ceed along random directions either b y using pre-calculated step sizes or by using line searc h oracles. In adaptiv e step size random searc h methods the step size is dynamically controlled such as to approximately guaran tee a certain probabilit y p of ﬁnding an improving iterate. Sc humer and Steiglitz [36] w ere among the ﬁrst to prop ose such a sc heme. In the bio-inspired optimization literature, the metho d is known as the (1+1)-Ev olution Strategy ( E S ) [35]. J¨ agersk ¨ upp er [14] pro- vides a con vergence pro of of E S on conv ex quadratic functions. W e here consider the follo wing generic E S algorithm summarized in Algorithm 3. Algorithm 3 (1+1)-Ev olution Strategy ( E S ) with adaptiv e step size con trol Input: N , x 0 , σ 0 , probabilit y of impro vemen t p = 0 . 27 1: Set c s = e 1 3 , c f = c s · e − p 1 − p . 2: for k = 0 to N do 3: u k ∼ N (0 , I n ) 4: if f ( x k + σ k u k ) ≤ f ( x k ) then 5: Set x k +1 = x k + σ k u k and σ k +1 = c s · σ k . 6: else 7: Set x k +1 = x k and σ k +1 = c f · σ k . 8: end if 9: end for 14 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER Dep ending on the sp eciﬁc random direction generator and the underlying test func- tion diﬀeren t optimalit y conditions can b e formulated for the probabilit y p . Sch umer and Steiglitz [36] suggest the setting p = 0 . 27 which is also considered in this w ork. F or all of the considered test functions the initial step size σ 0 has b een determined exp erimen tally in order to guaran tee the targeted p at the start (see T able B.1 for the respective v alues). The E S algorithm shares RP µ ’s in v ariance under strictly monotone transformations of the ob jectiv e function. 6.1.4. First-order Gradien t Metho ds. In order to illustrate the numerical eﬃciency of the randomized zeroth-order sc hemes relativ e to that of ﬁrst-order meth- o ds, w e also consider t w o Gradient Metho ds as outlined in (1.2). The ﬁrst metho d ( G M ) uses a ﬁxed step size λ k = 1 L 1 [28]. The function-dependent constant L 1 is, th us, part of the input to the G M algorithm. The second method ( G M LS ) deter- mines the step size λ k in each iteration using RP µ line search oracle LSapprox µ with µ = 1 e -5 as input parameter. 6.2. Benc hmark F unctions. W e now present the set of test functions used for the numerical p erformance ev aluation of the diﬀerent optimization schemes. W e presen t the three function classes and detail the speciﬁc function instances and their prop erties. 6.2.1. Quadratic F unctions. W e consider quadratic test functions of the form: f ( x ) = 1 2 ( x − 1) T Q ( x − 1) , (6.1) where x ∈ R n and Q ∈ R n × n is a diagonal matrix. F or given L 1 the diagonal en tries are chosen in the in terv al [1 , L 1 ]. The minimizer of this function class is x ∗ = 1 and f ( x ∗ ) = 0. The deriv ative is ∇ f ( x ) = Q ( x − 1). W e consider tw o diﬀeren t matrix instances. Setting Q = I n the n -dimensional identit y matrix the function reduces to the shifted sphere function denoted here by f 1 . In order to get a quadratic function with anisotropic axis lengths we use a matrix Q whose ﬁrst n/ 2 diagonal entries are equal to L 1 and the remaining en tries are set to 1. This ellipsoidal function is denoted b y f 2 . 6.2.2. Nestero v’s Smo oth F unctions. W e consider Nesterov’s smo oth func- tion as in tro duced in Nestero v’s text bo ok [28]. The generic version of this function reads: f 3 ( x ) = L 1 4 1 2 " x 2 1 + n − 1 X i =1 ( x i +1 − x i ) 2 + x 2 n # − x 1 ! . (6.2) This function has deriv ative ∇ f 3 ( x ) = L 1 4 ( Ax − e 1 ), where A =          2 − 1 0 0 − 1 2 − 1 0 0 0 − 1 2 1 . . . . . . . . . 0 − 1 2 − 1 0 − 1 2          and e 1 = (1 , 0 , . . . , 0) T . The optimal solution is located at: x ∗ i = 1 − i n + 1 , for i = 1 , . . . , n. RANDOM PURSUIT 15 F or ﬁxed dimension n , this function is strongly conv ex with parameter m ≈ L 1 4( n +1) 2 . Th us, the condition L 1 /m grows quadratically with the dimension. Adding a reg- ularization term leads, how ev er, to a strongly conv ex function with parameter m indep enden t of the dimension. Given L 1 ≥ m > 0, the regularized function reads: f 4 ( x ) = L 1 − m 4 1 2 " x 2 1 + n − 1 X i =1 ( x i +1 − x i ) 2 + x 2 n # − x 1 ! + m 2 k x k 2 . (6.3) This function is strongly con vex with parameter m . Its deriv ative ∇ f 4 ( x ) =  L 1 − m 4 A + mI  x − L 1 − m 4 e 1 , and the optimal solution x ∗ satisﬁes  A + 4 m L 1 − m  x ∗ = e 1 . 6.2.3. F unnel F unction. W e ﬁnally consider the follo wing funnel-shaped func- tion f 5 ( x ) = log  1 + 10 q ( x − 1) T ( x − 1)  , (6.4) where x ∈ R n . The minimizer of this function is x ∗ = 1 with f 5 ( x ∗ ) = 0. Its deriv ative for x 6 = 1 is ∇ f 5 ( x ) = 10 / (1 + 10 p ( x − 1) T ( x − 1)) · sign ( x − 1). A one-dimensional graph of f 5 is sho wn in the left panel of Figure 7.1. The function f 5 arises from a strictly monotone transformation of f 1 and thus b elongs to the class of strictly quasicon vex functions. 6.3. Numerical Optimization Results. T o illustrate the performance of Ran- dom Pursuit in comparison with the other randomized metho ds w e here presen t and discuss the k ey n umerical results. F or all n umerical tests w e follo w the iden tical pro- to col. All algorithms use as starting p oin t x 0 = 0 for all test functions. In order to compare the p erformance of the diﬀeren t algorithms across diﬀeren t test functions, w e follo w Nestero v’s approach [29] and rep ort relative solution accuracies with respect to the scale S ≈ 1 2 L 1 R 2 where R := k x 0 − x ∗ k is the Euclidean distance betw een starting p oin t and optimal solution of the resp ectiv e function. The prop erties of the four conv ex and con tinuously diﬀerentiable test functions and the quasicon vex fun- nel function along with the upp er b ounds on R 2 and the corresponding scales S are summarized in T able 6.1. Name function class L 1 m R 2 S f 1 Sphere strongly con vex 1 1 n 1 2 n f 2 Ellipsoid strongly con vex 1000 1 n 50 n f 3 Nestero v smooth conv ex 1000 ≈ L 1 4( n +1) 2 n +1 3 500 · n +1 3 f 4 Nestero v strong strongly conv ex 1000 1 √ 1000 4 1000 f 5 Funnel not con vex - - n 1 2 n T able 6.1: T est functions with parameters L 1 , m , R and the used scale S . Due to the inherent randomness of a single searc h run we perform 25 runs for eac h pair of problem instance/algorithm with diﬀeren t random num b er seeds. W e compare the diﬀeren t methods based on tw o record v alues: (i) the minimal, mean, and maximum num ber of iter ations (ITS) and (ii) the minimal, mean, and maxim um n umber of function evaluations (FES) needed to reach a certain solution accuracy . 16 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER While the former records serve as a means to compare the n um b er of oracle calls in the diﬀeren t metho d, the latter one only considers ev aluations of the ob jective function as relev ant performance cost. It is eviden t that measuring the performance of the algorithms in terms of oracle calls fa vors Random Pursuit b ecause the line searc h oracle “do es more w ork” than an oracle that, for instance, provides a directional deriv ative. F or Random Gradient metho ds the num b er of FES is just twice the num b er of ITS when a ﬁrst-order ﬁnite diﬀerence sc heme is used for directional deriv atives. F or the E S algorithm the n um b er of ITS and FES is iden tical. F or Random Pursuit metho ds the relation b et w een ITS and FES dep ends on the speciﬁc test function, the line searc h parameter µ , and the actual implementation of the line search. Our theoretical in vestigation suggest that the randomized sc hemes are a factor of O ( n ) times slo wer than the ﬁrst-order G M algorithms. This is due the reduced av ailable (direction) information in the randomized metho ds compared to the n -dimensional gradien t v ector. F or b etter comparison with G M and G M LS , w e thus scale the n umber of ITS of the randomized schemes b y a factor of 1 /n . 6.3.1. P erformance on the Quadratic T est F unctions for n ≤ 1024. W e ﬁrst consider the t w o quadratic test functions in n = 2 2 , . . . , 2 10 dimensions. T able 6.2 summarizes the minim um, maxim um, and mean n umber of ITS (in blo c ks of size n ) needed for eac h algorithm to reac h the absolute accuracy 1 . 91 · 10 − 6 S on the sphere function f 1 . F or the ﬁrst-order G M algorithms the absolute n umber of ITS is reported. Three k ey observ ations can b e made from these data. First, all zeroth- RP RG F G ARP E S G M GM LS n min max mean min max mean min max mean min max mean min max mean - - 4 5 17 10 40 65 53 31 49 39 5 17 10 28 46 38 1 1 8 8 16 12 39 53 44 30 40 35 5 13 11 28 43 35 1 1 16 10 14 12 33 41 37 30 37 33 10 14 12 30 42 36 1 1 32 11 14 12 31 36 33 28 35 31 11 16 12 33 41 37 1 1 64 12 14 13 30 34 32 28 33 31 12 14 13 33 41 37 1 1 128 12 14 13 30 32 31 29 32 31 12 14 13 35 40 37 1 1 256 13 14 13 30 31 30 29 31 30 13 14 13 35 40 37 1 1 512 13 13 13 30 31 30 30 31 30 13 14 13 36 38 37 1 1 1024 13 14 13 30 31 30 30 31 30 13 13 13 36 38 37 1 1 T able 6.2: Recorded minim um, maxim um, and mean #ITS/ n on the sphere function f 1 to reac h a relativ e accuracy of 1 . 91 · 10 − 6 . F or G M and G M LS the absolute num b er of ITS are recorded. order algorithms approach the theoretically exp ected line ar sc aling of the run time with dimension for strongly con v ex functions for suﬃciently large n (for n ≥ 64, e.g., the a verage num b er of ITS/n b ecomes constan t for the RP algorithms). Second, no signiﬁcant performance diﬀerence can b e found betw een RP and its accelerated v ersion across all dimensions. The performance of the algorithm pair RG / F G becomes similar for n ≥ 128. Third, the Random Pursuit algorithms outp erform all other zeroth-order methods in terms of n umber of ITS. Only the last observ ation c hanges when the n um b er of FES is considered. T able 6.3 summarizes the n um b er of FES (in blo c ks of size n ) for all algorithms on f 1 . W e see that the RP µ algorithms outp erform the Random Gradien t metho ds for lo w dimensions and p erform equally w ell for n = 256. F or n ≥ 512 the Random Gradient schemes become increasingly sup erior to the Random Pursuit schemes. Remark ably , the adaptive step size E S algorithm outp erforms all other metho ds across all dimensions. The data also rev eal that the line searc h oracle in the RP µ algorithms consume on a v erage four FES p er iter ation for n ≤ 128 with a slight increase to seven FES p er iter ation for n = 1024. RANDOM PURSUIT 17 RP RG F G ARP E S n min max mean min max mean min max mean min max mean min max mean 4 20 69 39 80 131 106 62 99 78 20 69 39 28 46 38 8 34 65 47 78 105 87 59 81 70 22 53 43 28 43 35 16 38 54 48 65 81 73 61 74 66 40 56 47 30 42 36 32 45 57 50 62 71 66 57 69 62 43 62 50 33 41 37 64 47 57 52 60 68 64 57 66 61 50 56 52 33 41 37 128 50 56 53 59 64 62 58 64 61 51 56 54 35 40 37 256 56 63 59 59 62 61 58 62 60 56 63 59 35 40 37 512 64 69 67 59 62 60 59 62 60 64 70 67 36 38 37 1024 84 92 89 59 61 60 60 61 60 85 91 88 36 38 37 T able 6.3: Recorded minimum, maximum, and mean #FES/ n on the sphere function f 1 to reac h a relativ e accuracy of 1 . 91 · 10 − 6 . W e also observe that the gap betw een minim um and maximum n umber of FES reduces with increasing dimension for all methods. Finally , the ﬁrst-order sc hemes reach the minim um as exp ected in a single iteration across all dimension. F or the high-conditioned ellipsoidal function f 2 w e observ e a genuinely diﬀerent b eha vior of the diﬀeren t algorithms. Figure 6.1 shows for eac h algorithm the mean n umber of FES (left panel) and ITS (righ t panel) in blo c ks of size n needed to reac h the absolute accuracy 1 . 91 · 10 − 6 S on f 2 . The minimum, maxim um, and mean n umber of ITS and FES are rep orted in the Appendix in T ables B.2 and B.3, respectively . 4 8 16 32 64 128 256 512 1024 10 3 10 4 10 5 dimension n # FES / n Random Pursuit Accelerated RP Random Gradient 4 8 16 32 64 128 256 512 1024 10 2 10 3 10 4 10 5 dimension n # ITS / n [GM: # ITS] Accelerated RG (1+1)-ES Gradient Fig. 6.1: Av erage #FES/ n (left panel) and #ITS/ n (righ t panel) vs. dimension n on the ellipsoidal function f 2 to reac h a relativ e accuracy of 1 . 91 · 10 − 6 (#ITS for G M ). F urther data are a v ailable in T ables B.2 and B.3. W e again observe the theoretically expected linear scaling of the num b er of ITS with dimension for suﬃcien tly large n . The mean n umber of ITS no w spans t w o orders of magnitude for the diﬀeren t algorithms. Standard Random Pursuit outp erforms the RG and the E S algorithm. Moreov er, the accelerated RP µ sc heme outperforms the F G sc heme b y a factor of 4. All methods show, ho w ever, an increased o v erall run time due to the high condition num b er of the quadratic form. This is also reﬂected in the increased num b er of FES that are needed b y the line search oracle in the RP µ algorithms. The line searc h oracle now consumes on a v erage 12-14 FES p er iter ation . In terms of consumed FES budget w e observ e that Random Pursuit still out- p erforms Random Gradien t for small dimensions but needs a comparable num b er of FES for n ≥ 64 (around 30.000 FES in blo c ks of n ). The E S , the ARP µ , and the F G algorithm need a n order of magnitude few er FES. The accelerated RP µ is only outp erformed b y the F G algorithm. The p erformance of the E S algorithm is again remark able giv en the fact that it do es not need information about the parameters L 1 and m whic h are of fundamen tal imp ortance for the accelerated schemes. 18 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER 6.3.2. P erformance on the F ull Benc hmark Set for n = 64. W e now il- lustrate the behavior of the diﬀeren t algorithms on the full b enc hmark set for ﬁxed dimension n = 64. W e observ ed similar qualitative b eha vior for all other dimensions. T able 6.4 contains the num b er of ITS needed to reac h the scale-dep enden t accuracy 1 . 91 · 10 − 6 S for all algorithms. W e observe that Random Pursuit outp erforms the RG RP RG F G ARP E S G M G M LS fun. min max mean min max mean min max mean min max mean min max mean - - f 1 12 14 13 30 34 32 30 34 32 12 14 13 33 41 37 1 1 f 2 1899 2096 2001 16601 17333 16868 990 1079 1038 233 250 242 5451 5954 5729 3934 3 f 3 2068 2191 2136 18922 19075 19004 892 970 942 192 678 473 5766 6050 5916 4474 2237 f 4 954 1023 995 8727 8995 8854 441 534 458 137 188 159 2651 2854 2751 2086 1044 f 5 26 30 28 - - - - - - 26 30 28 73 85 78 - 1 T able 6.4: Average #ITS/ n to reac h the relative accuracy 1 . 91 · 10 − 6 in dimension n = 64. F or G M and G M LS the exact num b er of ITS is rep orted. Observed minim um ITS across all (gradien t-free) algorithms are marked in b old face for eac h function. and the E S algorithm, and that the ARP µ algorithm outp erforms all gradient-free sc hemes in terms of num b er of ITS on all functions (with equal p erformance as Ran- dom Pursuit on f 1 and f 5 ). W e consistently observ e an improv e d performance of all algorithms on the regularized strongly con vex function f 4 as compared to its con v ex coun terpart f 3 . This expected b eha vior is most pronounced for the ARP µ sc heme where, on a v erage, the n umber of ITS is reduced to 159 / 473 ≈ 1 / 3. A comparison b et ween the tw o gradien t schemes reveals that G M LS outp erforms the ﬁxed step size gradient scheme on all test functions. The remark able p erformance of G M LS on f 2 is due to the fact that the sp ectrum of the Hessian contains in equal parts tw o diﬀerent v alues (1 and L , respectively). A single line search along a gradien t direction is th us simultaneously optimal for n/ 2 directions of this function. The G M LS sc heme thus reac hes the target accuracy in as few as three steps. This eﬃciency is lost as so on as the sp ectrum becomes suﬃciently div erse (as indicated b y its p erformance on f 4 ). W e also remark that the performance of RP / G M LS as w ell as the pair F G / G M is in full agreemen t with theory . W e see on functions f 3 /f 4 that RP is ab out a factor of n slow er than the G M LS due to una v ailable gradien t information. The same is true for F G / G M where F G is ab out 4 n times slo wer than G M due to the theoretically needed reduction of the optimal step length by a factor of 1 / 4 [29]. F or function f 4 w e illustrate the con vergence b eha vior of the diﬀerent algorithms in Figure 6.2. After a short algorithm-dep enden t initial phase we observe linear con vergence of all algorithms for ﬁxed dimension, i.e., a constant reduction of the logarithm of the distance to the minim um p er iteration. W e also observe that the accelerated Random Pursuit consisten tly outp erforms standard Random Pursuit for all measured accuracies on f 4 (see T able S-4 in [38] for the corresp onding numeri- cal data). This b eha vior is less pronounced for the function pair f 1 /f 2 as sho wn in Figure 6.3. On f 1 b oth Random Pursuit schemes hav e iden tical progress rates that are also consisten t with the theoretically predicted one. On f 2 Random Pursuit out- p erforms the accelerated scheme for lo w accuracies (see also T able S-2 in [38] for the n umerical data) but is quic kly outp erformed due to faster progress rate of the accel- erated sc heme. W e also observe that the theoretically predicted worst-case progress rate (dotted line in the right panel of Figure 6.3) do es not reﬂect the true progress on this test function. Comparison of the n umerical results on the function pair f 1 /f 5 (see Figure 6.4) demonstrates the expected inv ariance under strictly monotone trans- RANDOM PURSUIT 19 0 1000 2000 3000 4000 5000 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 # ITS / n [GM: # ITS] accuracy Random Pursuit Accelerated RP Random Gradient Accelerated RG (1+1)-ES Gradient Fig. 6.2: Average accuracy (in log scale) vs. #ITS/ n for all algorithms on f 4 in dimension n = 64. F urther data are a v ailable in [38] in T able S-4. 8 9 10 11 10 -4 SPHERE: # ITS / n accuracy Random Pursuit Accelerated RP Rate e -k 500 1000 1500 10 -4 10 -3 10 -2 10 -1 ELLIPSOID: # ITS / n accuracy Rate e -k/10 3 Rate e -k/10 1.5 Fig. 6.3: Numerical conv ergence rate of standard and accelerated Random Pursuit on f 1 (left panel) and f 2 (righ t panel) in dimension n = 64. F or both instances the theoretically predicted worst-case progress rate (dotted line) is sho wn for comparison. The rate of the ARP scheme is compared to the theoretically predicted conv ergence rate of F G (slash-dotted line) [29]. formations of the Random Pursuit algorithms and the E S sc heme. These algorithms enjo y the same con v ergence b eha vior (up to small random v ariations) to the target solution while the Random Gradient sc hemes fail to con verge to the target accuracy . Note, ho w ever, the n umbers reported in e.g. T able 6.4 are not iden tical for f 1 and f 5 b ecause the used stopping criteria are dependent on the scale of the function values . The conv ergence rates are the same, but more iterations are needed for f 5 b ecause the required accuracy is considerably smaller. W e also rep ort the performance of the diﬀeren t algorithms in terms of n umber of FES needed to reach the target accuracy of 1 . 91 · 10 − 6 S for the diﬀeren t test functions. F or all algorithms the minim um, maxim um, and av erage num ber of FES are recorded in T able 6.5. W e observ e that the RP µ algorithm outperforms the 20 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER 0 50 100 150 200 10 -2 10 0 10 2 10 4 # ITS / n accuracy Random Pursuit on FUNNEL Random Gradient on FUNNEL (1+1)-ES on FUNNEL Random Pursuit on SPHERE Random Gradient on SPHERE (1+1)-ES on SPHERE Fig. 6.4: Numerical con vergence rate of the RP µ , the E S , and the RG scheme on f 1 and f 5 in n = 64 dimensions. The accuracy is measured in terms of the logarithmic distance to the optim um log ( k x k − x ∗ k 2 ). RP RG F G ARP E S fun. min max mean min max mean min max mean min max mean min max mean f 1 47 57 52 60 68 64 57 66 61 50 56 52 33 41 37 f 2 27723 30272 29071 33202 34667 33736 1980 2159 2077 3035 3247 3159 5451 5954 5729 f 3 25520 27034 26351 37844 38150 38008 1785 1939 1885 2199 8149 5609 5766 6050 5916 f 4 11629 12482 12122 17455 17990 17708 883 1069 916 1557 2134 1825 2651 2854 2751 f 5 338 384 360 - - - - - - 342 399 361 73 85 78 T able 6.5: Average #FES /n to reach the relativ e accuracy 1 . 91 · 10 − 6 in dimension n = 64. Observed minim um #FES/ n across all algorithms are marked in b old face for eac h function. standard Random Gradient metho d on all tested functions. How ever, Random Pursuit is not comp etitiv e compared to the accelerated sc hemes and the E S algorithm. The accelerated RP µ sc heme is only outp erformed by the F G algorithm. The latter scheme sho ws particularly go od performance on the conv ex function f 3 with considerably lo wer v ariance. F or functions f 2 – f 4 the RP µ algorithms need around 12 − 15 FES per line search oracle call. W e emphasize again that the p erformance of the adaptive step size E S scheme is remark able given the fact that it do es not need any function-speciﬁc parametrization. A comparison to the parameter-free Random Pursuit sc heme shows that it needs around four times fewer FES on functions f 2 – f 4 . W e remark that Random Pursuit with discrete sampling, i.e., using the set of signed unit vectors for sample generation (see Section 2.1), yields numerical results on the presen t benchmark that are consistent with our theoretical predictions. W e observ ed impro ved p erformance of Random Pursuit with discrete sampling on the function triplet f 1 /f 2 /f 5 . This is eviden t as the co ordinate system of these functions coincide with the standard basis. Th us, algorithms that mov e along the standard co ordinate system are fav ored. On the function pair f 3 /f 4 w e do not see any signiﬁcant deviation from the presen ted p erformance results. W e also exemplify the inﬂuence of the parameter µ on RP µ ’s performance. W e c ho ose the function f 2 as test instance because the RP µ consumes most FES on this function. W e v ary µ betw een 1 e -1 and 1 e -10 and run the RP µ sc heme 25 times to RANDOM PURSUIT 21 param. µ : 1 e − 1 1 e − 2 1 e − 3 1 e − 4 1 e − 5 1 e − 6 1 e − 7 1 e − 8 1 e − 9 1 e − 10 #ITS / n 1986 1982 2013 2017 2001 2001 2001 2001 2020 2020 # FES / n 19824 19781 21435 26120 29071 29495 29537 29542 38993 39070 T able 6.6: P erformance of RP µ on the ellipsoid function f 2 , m = 1, L = 1000, S = 3200, n = 64, for diﬀeren t line searc h parameters µ . Mean (of 25 rep etitions) n umber of ITS/ n and FES/ n to reach a relative accuracy of 1 . 91 · 10 − 6 are reported. reac h a relative accuracy of 1 . 91 · 10 − 6 . Mean num b er of ITS and FES are rep orted in T able 6.6. W e see that the choice of µ has almost no inﬂuence on the n um b er of ITS to reac h the target accuracy , thus justifying the use of ITS as meaningful performance measure. The num b er of FES span the same order of magnitude ranging from 19824 for 1 e -1 to 39070 for 1 e -10. W e see that the n umber of FES for the standard setting µ = 1 e -5 is approximately in the middle of these extremes (29071 FES). This implies that the qualitativ e picture of the reported p erformance comparison is still v alid but individual results for RP µ and ARP µ are impro v able b y optimally c ho osing µ . An in- depth analysis of the optimal function-dep enden t choice of the µ parameter is sub ject to future studies. As a ﬁnal remark we highlight that the presen t numerical results for the Random Gradien t metho ds are fully consisten t with the ones presented in Nestero v’s pap er [29]. 7. Discussion and Conclusion. W e hav e derived a con v ergence pro of and con vergence rates for Random Pursuit on con vex functions. W e ha ve used a quadratic upp er bound tec hnique to bound the exp ected single-step progress of the algorithm. Assuming exact line search, this results in global linear conv ergence for strongly conv ex functions and con v ergence of the order 1 /k for general con vex functions. F or line search oracles with relative error µ the same results hav e b een obtained with conv ergence rates reduced b y a factor of 1 1 − µ . F or inexact line searc h with absolute error µ , conv ergence can be established only up to an additiv e error term dep ending on µ , the prop erties of the function and the dimensionalit y . The con vergence rate of Random Pursuit exceeds the rate of the standard (ﬁrst- order) Gradient Metho d b y a factor of n . J¨ agersk¨ upper sho wed that no b etter p erfor- mance can b e exp ected for strongly con vex functions [15]. He derived a lo w er b ound for algorithms of the form (1.3) where at eac h iteration the step size along the random direction is c hosen suc h as to minimize the distance to the minim um x ∗ . On sphere functions f ( x ) = ( x − x ∗ ) T ( x − x ∗ ) Random Pursuit coincides with the describ ed sc heme, thus achieving the lo wer bound. The numerical exp erimen ts show ed that (i) standard Random Pursuit is eﬀective on strongly con vex functions with moderate condition num ber, and (ii) the acceler- ated sc heme is comparable to Nesterov’s fast gradien t metho d and outperforms the E S algorithm. The experimental results also rev ealed that (i) RP µ ’s empirical con- v ergence is (as predicted b y theory) n times slow er than the one of the corresponding gradien t scheme with line search ( G M LS ), and (ii) b oth contin uous and discrete sam- pling can be employ ed in Random Pursuit. W e conﬁrmed the in v ariance of the RP µ algorithms and E S under monotone transformations of the ob jectiv e functions on the quasicon vex funnel-shap ed function f 5 where Random Gradient algorithms fail. W e also highlighted the remark able p erformance of the E S scheme giv en the fact that it do es not need any function-speciﬁc input parameters. The presen t theoretical and experimental results hin t at a num b er of p oten tial 22 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER enhancemen ts for standard Random Pursuit in future w ork. First, RP µ ’s conv ergence rate depends on the function-sp eciﬁc parameter L 1 that bounds the curv ature of the ob jective function. Any reduction of this dep endency would imply faster con v ergence on a larger class of functions. The empirical results on the function pair f 1 /f 2 (see T ables S-1 and S-2 in [38]) also suggest that complicated accelerated schemes do not presen t an y signiﬁcan t adv an tage on functions with small constan t L 1 . It is conceiv able that Random Pursuit can incorp orate a mechanism to learn second-order information about the function “on the ﬂy”, thus improving the conditioning of the original optimization problem and p oten tially reducing it to the L 1 ≈ 1 case. This may b e possible using techniques from randomized Quasi-Newton approaches [3, 23, 37] or diﬀeren tial geometry [7]. It is noteworth y that heuristic v ersions of suc h an adaptation mec hanism ha ve prov ed extremely useful in practice for adaptive step size algorithms [19, 13, 25] Second, w e ha ve not considered Random Pursuit for constrained optimization problems of the form: min f ( x ) sub ject to x ∈ K , (7.1) where K ⊂ R n is a conv ex set. The key challenge is how to treat iterates x k +1 = x k + LSapprox ( x k , u ) · u generated by the line searc h oracle that are outside the domain K . A classic idea is to apply a pro jection operator π K and use the resulting x 0 k +1 := π K ( x k +1 ) as the next iterate. How ev er, ﬁnding a pro jection onto a con v ex set (except for simple bo dies such as hyper-parallelepip eds) can b e as diﬃcult as the original optimization problem. Moreov er, it is an op en question whether general con vergence can be ensured, and what conv ergence rates can b e ac hieved. Another p ossibilit y is to constrain the line search to the in tersection of the line and the conv ex b ody K . In this case, it is evident that one can only exp ect exponentially slow con vergence rates for this metho d. Consider the linear function f ( x ) = 1 T x and K = R n + . Once an iterate x k lies at the boundary ∂ K of the domain, say the ﬁrst co ordinate of x k is zero, then only directions u with p ositiv e ﬁrst co ordinate may lead to an impro vemen t. As so on as a constan t fraction of the co ordinates are zero, the probabilit y of ﬁnding an impro ving direction is exponentially small. Karmano v [17] prop osed the follo wing com bination of pro jection and line searc h constraining: First, a random point y at some ﬁxed distance of the curren t iterate is dra wn uniformly at random and then pro jected to the set K . A constrained line searc h is no w p erformed along the line through the curren t iterate x k and π K ( y ). It remains op en to study the con vergence rate of this metho d. Finally , we envision conv ergence guaran tees and prov able con v ergence rates for Random Pursuit on more general function classes. The in v ariance of the line searc h oracle under strictly monotone transformations of the ob jective function already im- plied that Random Pursuit conv erges on certain strictly quasicon vex functions. It also seems in reach to derive con vergence guaran tees for Random Pursuit on the class of globally con vex (or δ -conv ex) functions [12] or on con v ex functions with bounded p erturbations [30] (see righ t panel of Figure 7.1 for the graph of such an instance). This ma y be achiev ed b y appropriately adapting line search metho ds to these func- tion classes. In summary , w e b eliev e that the theoretical and experimental results on Random Pursuit represent a promising ﬁrst step tow ard the design of comp etitiv e deriv ative-free optimization metho ds that are easy to implemen t, possess theoretical con vergence guaran tees, and are useful in practice. RANDOM PURSUIT 23 ï 4 ï 2 0 2 4 6 0 0.5 1 1.5 2 2.5 3 3.5 4 ï 5 0 5 0 5 10 15 20 25 f 5 ( x ) f GC ( x ) x x Fig. 7.1: Left panel: Graph of function f 5 in 1D. Righ t Panel: Graph of a globally con vex function f GC . Ac kno wledgments. W e sincerely thank Dr. Martin Jaggi for several helpful discussions. W e would also lik e to thank the referees for their careful reading of the man uscript and their constructive comments that truly helped improv e the qualit y of the man uscript. 24 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER REFERENCES [1] R. L. Anderson , R e cent advanc es in ﬁnding best op er ating c onditions , Journal of the American Statistical Asso ciation, 48 (1953), pp. 789–798. [2] A. Auger, N. Hansen, J. M. Perez Zerp a, R. Ros, and M. Schoena uer , Exp erimental Comp arisons of Derivative F r e e Optimization A lgorithms , in Pro ceedings of the 8th In- ternational Symposium on Experimental Algorithms, SEA ’09, Berlin, Heidelberg, 2009, Springer-V erlag, pp. 3–15. [3] B. Betr o and L. De Biase , A Newton-like metho d for sto chastic optimization , in T ow ards Global Optimization, vol. 2, North-Holland, 1978, pp. 269–289. [4] H. G. Beyer , The the ory of evolution str ate gies , Natural Computing, Springer-V erlag New Y ork, Inc., New Y ork, NY, USA, 2001. [5] C. Brif, R. Chakrabar ti, and H. Rabitz , Contr ol of quantum phenomena: p ast, pr esent and futur e , New Journal of Physics, 12 (2010), p. 075008. [6] S. H. Br ooks , A Discussion of R andom Metho ds for Se eking Maxima , Op erations Research, 6 (1958), pp. 244–251. [7] H. B. Cheng, Cheng L. T., and S. T. Y au , Minimization with the aﬃne normal dir e ction , Comm. Math. Sci., 3 (2005), pp. 561–574. [8] A. R. Conn, K. Scheinber g, and L. N. Vicente , Intro duction to derivative-fre e optimization , MPS-SIAM Bo ok Series on Optimization, SIA M, 2009. [9] E. den Boef and D. den Her tog , Eﬃcient Line Sear ch Metho ds for Convex F unctions , SIAM Journal on Optimization, 18 (2007), pp. 338–363. [10] E. Hazan , Sp arse appr oximate solutions to semideﬁnite pr o gr ams , in Pro ceedings of the 8th Latin American conference on Theoretical informatics, LA TIN’08, Berlin, Heidelberg, 2008, Springer-V erlag, pp. 306–316. [11] R. Heijmans , When do es the exp e ctation of a r atio e qual the r atio of exp e ctations? , Statistical Papers, 40 (1999), pp. 107–115. 10.1007/BF02927114. [12] T. C. Hu, V. Klee, and D. Larman , Optimization of glob al ly convex functions , SIAM Journal on Control and Optimization, 27 (1989), pp. 1026–1047. [13] C. Igel, T. Suttorp, and N. Hansen , A c omputational eﬃcient c ovarianc e matrix up date and a (1+1)-CMA for evolution strate gies , in GECCO ’06: Proceedings of the 8th annual conference on Genetic and ev olutionary computation, New Y ork, NY, USA, 2006, ACM, pp. 453–460. [14] J. J ¨ agersk ¨ upper , Rigor ous runtime analysis of the (1+1) ES: 1/5-rule and el lipsoidal ﬁt- ness landsc apes , in F oundations of Genetic Algorithms, Alden W right, Michael V ose, Ken- neth De Jong, and Lothar Schmitt, eds., vol. 3469 of Lecture Notes in Computer Science, Springer Berlin / Heidelb erg, 2005, pp. 356–361. 10.1007/11513575.14. [15] , L ower bounds for hit-and-run dir e ct se ar ch , in Sto c hastic Algorithms: F oundations and Applications, Jura j Hromko vic, Ric hard Kr´ alovic, Marc Nunkesser, and Peter Widmay er, eds., vol. 4665 of Lecture Notes in Computer Science, Springer Berlin / Heidelb erg, 2007, pp. 118–129. [16] V. G. Karmanov , Conver genc e estimates for iter ative minimization methods , USSR Compu- tational Mathematics and Mathematical Physics, 14 (1974), pp. 1 – 13. [17] , On conver genc e of a r andom se ar ch method in convex minimization problems , Theory of Probability and its applications, 19 (1974), pp. 788–794. in Russian. [18] D. C. Karnopp , Random se ar ch techniques for optimization pr oblems , Automatica, 1 (1963), pp. 111 – 121. [19] G. Kjellstr ¨ om and L. T axen , Sto chastic Optimization in System Design , IEEE T rans. Circ. and Syst., 28 (1981). [20] A. Kleiner, A. Rahimi, and M. I. Jordan , R andom c onic pursuit for semideﬁnite pr o gr am- ming , in Neural Information Pro cessing Systems, 2010. [21] T. G. Kolda, R. M. Lewis, and V. Tor czon , Optimization by dir e ct se ar ch: New p ersp e ctives on some classic al and modern metho ds , Siam Review, 45 (2004), pp. 385–482. [22] V. N. Krutiko v , On the r ate of c onver gence of the minimization metho d along ve ctors in given dir e ctional sy , USSR Comput. Maths. Phys., 23 (1983), pp. 154–155. in russian. [23] D. Leventhal and A. S. Lewis , R andomize d hessian estimation and dir e ctional se arch , Op- timization, 60 (2011), pp. 329–345. [24] R. L. Ma ybach , Solution of optimal c ontr ol pr oblems on a high-sp e e d hybrid computer , Simu- lation, 7 (1966), pp. 238–245. [25] C. L. M ¨ uller and I. F. Sbalzarini , Gaussian adaptation r evisited - an entr opic view on c ovarianc e matrix adaptation , in Ev oApplications, C. Di Chio et al., ed., no. 6024 in Lecture Notes in Computer Science, Springer, 2010, pp. 432–441. RANDOM PURSUIT 25 [26] V. A. Mutseniyeks and L. A. Rastrigin , Extr emal c ontr ol of continuous multi-p ar ameter systems by the metho d of r andom se ar ch , Eng. Cyb ernetics, 1 (1964), pp. 82–90. [27] A. Nemiro vski, A. Juditsky, G. Lan, and A. Shapiro , Robust Sto chastic Appr oximation Appr o ach to Stochastic Pr ogr amming , SIAM Journal on Optimization, 19 (2009), pp. 1574– 1609. [28] Y. Nesterov , Intro ductory Lectur es on Convex Optimization , Kluw er, Boston, 2004. [29] , R andom Gradient-Fr e e Minimization of Convex Functions , tec h. rep ort, ECORE, 2011. [30] H. X. Phu , Minimizing convex functions with b ounde d p erturb ation , SIAM Journal on Opti- mization, 20 (2010), pp. 2709–2729. [31] B. Pol y ak , Intr o duction to Optimization , Optimization Softw are - Inc, Publications Division, New Y ork, 1987. [32] MA TLAB R2012a , http://www.mathworks.ch/help/to olb ox/optim/ug/fminunc.html . [33] G. Rappl , On Line ar Conver genc e of a Class of Random Se ar ch Algorithms , ZAMM - Journal of Applied Mathematics and Mechanics / Zeitsc hrift f ¨ ur Angewandte Mathematik und Mechanik, 69 (1989), pp. 37–45. [34] L. A. Rastrigin , The c onver gence of the r andom se arch method in the extr emal c ontrol of a many p ar ameter system , Automation and Remote Control, 24 (1963), pp. 1337–1342. [35] I. Rechenberg , Evolutionsstr ategie; Optimierung te chnischer Systeme nach Prinzipien der biolo gischen Evolution. , F rommann-Holzb oog, Stuttgart–Bad Cannstatt, 1973. [36] M. Schumer and K. Steiglitz , A daptive step size r andom se ar ch , Automatic Control, IEEE T ransactions on, 13 (1968), pp. 270 – 276. [37] S. U. Stich and C. L. M ¨ uller , On sp e ctr al invarianc e of R andomize d Hessian and Covarianc e Matrix A daptation schemes , under review at PPSN 2012, (2012). [38] S. U. Stich, C. L. M ¨ uller, and B. G ¨ ar tner , Supp orting online material for optimization of c onvex functions with random pursuit , arXiv:1111.0194v2, (2012). [39] J. Sun, J. M. Garibaldi, and C. Hodgman , Par ameter Estimation Using Metaheuristics in Systems Biolo gy: A Compr ehensive R eview , IEEE/ACM T rans. Comput. Biol. Bioinfor- matics, 9 (2012), pp. 185–202. [40] S. Vemp ala , R e c ent Pr o gr ess and Op en Pr oblems in Algorithmic Convex Ge ometry , in IARCS Annual Conference on F oundations of Softw are T echnology and Theoretical Computer Science (FSTTCS 2010), Kamal Lo da ya and Meena Maha jan, eds., vol. 8 of Leibniz Inter- national Pro ceedings in Informatics (LIPIcs), Dagstuhl, Germany , 2010, Schloss Dagstuhl– Leibniz-Zentrum fuer Informatik, pp. 42–64. [41] A. A. Zhiglja vsky and Zilinskas A. G. , Sto chastic Glob al Optimization , Springer-V erlag, Berlin, Germany, 2008. [42] R. Zieli ´ nski and P. Neumann , Sto chastische V erfahr en zur Suche nach dem Minimum einer F unktion , Ak ademie-V erlag, Berlin, Germany , 1983. App endix A. Lemma. Lemma A.1. L et { f t } t ∈ N b e a se quenc e with f i ∈ R + . Supp ose f t +1 ≤ (1 − θ /t ) f t + C θ 2 /t 2 + D , for t ≥ 1 , for some c onstants θ > 1 , C > 0 and D ≥ 0 . Then it fol lows by induction that f t ≤ Q ( θ ) /t + ( t − 1) D , wher e Q ( θ ) = max  θ 2 C / ( θ − 1) , f 1  . A very similar result was stated without pro of in [27] and also Hazan [10] is using the same. Pr o of . F or t = 1 it holds that f 1 ≤ Q ( θ ) by deﬁnition of Q ( θ ). Assume that the result holds for t ≥ 1. If Q ( θ ) = θ 2 C / ( θ − 1) then w e deduce: f t +1 ≤ θ 2 C ( t − θ ) ( θ − 1) t 2 + C θ 2 t 2 + ( t − θ )( t − 1) D t + D = θ 2 C ( t − 1) ( θ − 1) t 2 + D ( t 2 − θ ( t − 1)) t ≤ θ 2 C ( θ − 1)( t + 1) + tD . 26 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER If on the other hand Q ( θ ) = f 1 , then f 1 ≥ θ 2 C ( θ − 1) ⇔ ( θ − 1) f 1 ≥ θ 2 C , and it follo ws f t +1 ≤ ( t − θ ) f 1 t 2 + C θ 2 t 2 + ( t − θ )( t − 1) D t + D = ( t − 1) f 1 t 2 + θ 2 C − ( θ − 1) f 1 t 2 + D ( t 2 − θ ( t − 1)) t ≤ f 1 t + 1 + tD . App endix B. T ables. B.1. Initial σ 0 of the E S algorithm for all test functions. T able B.1 re- p orts the empirically determined optimal initial step sizes σ 0 used as input to the E S algorithm. dim f 1 f 2 f 3 f 4 4 0.79158 1.3897 0.2054 0.20395 8 0.49167 0.78761 0.08922 0.088145 16 0.32692 0.49500 0.04134 0.041273 32 0.22292 0.32547 0.019911 0.019905 64 0.15542 0.22243 0.0097212 0.0097127 128 0.10925 0.15638 0.0048305 0.0048335 256 0.076658 0.10902 0.0024171 0.0024114 512 0.054339 0.076568 0.0012012 0.0012006 1024 0.038367 0.054173 0.00060284 0.00060223 T able B.1: The initial v alues of the stepsize σ for (1 + 1)-ES on the test functions for v arious dimensions. B.2. Data for the ellipsoid test function for n ≤ 1024. T ables B.2 and B.3 rep ort the numerical data used to pro duce Figure 6.1. RP RG F G ARP E S G M GM LS n min max mean min max mean min max mean min max mean min max mean - - 4 236 472 364 28322 33608 31549 966 1575 1282 124 682 322 2491 4557 3784 3934 3 8 787 1241 1088 22461 24610 23666 981 1262 1155 174 285 226 3786 5799 4906 3934 3 16 1326 1763 1624 18981 20403 19805 975 1164 1076 218 256 232 4967 6034 5400 3934 3 32 1769 2026 1880 17381 18393 17858 968 1102 1048 221 256 237 5183 6145 5625 3934 3 64 1899 2096 2001 16601 17333 16868 990 1079 1038 233 250 242 5451 5954 5729 3934 3 128 1987 2145 2076 16183 16721 16376 978 1061 1030 237 252 245 5512 5964 5753 3934 3 256 2063 2173 2117 15960 16276 16115 1007 1053 1026 238 251 245 5603 6065 5805 3934 3 512 2081 2159 2119 15937 16103 16011 1015 1037 1026 239 249 244 5706 5909 5818 3934 3 1024 2109 2152 2132 15846 16065 15955 1020 1035 1027 242 247 244 5759 5932 5833 3934 3 T able B.2: Ellipsoid function f 2 to accuracy 1 . 91 · 10 − 6 , S = 50 n , L = 1000. #ITS /n , ( G M and G M LS : #ITS). RP RG F G ARP E S n min max mean min max mean min max mean min max mean min max mean 4 3155 6236 4775 56643 67217 63097 1933 3150 2564 1408 6983 3631 2491 4557 3784 8 11043 17124 15216 44923 49221 47331 1963 2525 2310 2113 3483 2774 3786 5799 4906 16 19182 25225 23320 37962 40806 39610 1951 2329 2152 2730 3239 2930 4967 6034 5400 32 25768 29258 27302 34762 36785 35715 1937 2203 2097 2870 3243 3043 5183 6145 5625 64 27723 30272 29071 33202 34667 33736 1980 2159 2077 3035 3247 3159 5451 5954 5729 128 28791 30757 29894 32365 33441 32753 1956 2121 2059 3139 3354 3259 5512 5964 5753 256 29363 30691 30016 31920 32552 32230 2013 2106 2053 3210 3411 3322 5603 6065 5805 512 29173 30167 29644 31874 32207 32022 2031 2075 2053 3311 3453 3386 5706 5909 5818 1024 29316 29914 29639 31692 32131 31910 2041 2069 2054 3455 3541 3494 5759 5932 5833 T able B.3: Ellipsoid function f 2 to accuracy 1 . 91 · 10 − 6 , S = 50 n , L = 1000. #FES /n . Suppor ting Online Ma terial for OPTIMIZA TION OF CONVEX FUNCTIONS WITH RANDOM PURSUIT ∗ S. U. Stich † , C. L. M ¨ uller ‡ , and B. G ¨ ar tner § ∗ The pro ject CG Learning ackno wledges the ﬁnancial supp ort of the F uture and Emerging T ech- nologies (FET) programme within the Seven th F ramework Programme for Researc h of the Europ ean Commission, under FET-Op en grant num ber: 255827 † Institute of Theoretical Computer Science, ETH Z ¨ urich, and Swiss Institute of Bioinformatics, sstich@inf.ethz.ch ‡ Institute of Theoretical Computer Science, ETH Z ¨ urich, and Swiss Institute of Bioinformatics, christian.mueller@inf.ethz.ch § Institute of Theoretical Computer Science, ETH Z ¨ urich, gaertner@inf.ethz.ch 27 28 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER App endix C. Exemplary Matlab Codes. C.1. Accelerated Random Pursuit. function [fval, x, funeval] = r p ac c(fitfun, xstart, N, mu, m, L) % Algorithm Accelerated Random Pursuit % fitfun name or function handle % N number of iterations % mu line search accuracy % funeval #function evaluations % m, L parameters for quadratic upper and lower bounds %line search parameters opts = optimset( 'Display' , 'off' , 'LargeScale' , 'off' , 'TolX' , mu); funeval = 0; x = xstart; n = length(xstart); th = 1/(L * nˆ2); ga = m; v = x; for i = 1:N p = [1/th, ga − m, − ga]; be = max(roots(p)); de = (be * ga) / (ga + be * m); y = (1 − de) * x + de * v; ga = (1 − be) * ga + be * m; la = be / ga * m; d=randn(size(xstart)); d=d/norm(d); [sigma, fval, ˜, infos] = ... fminunc(@(sigma) feval(fitfun, y + sigma * d), 0, opts); funeval = funeval + infos.funcCount; x = y + sigma * d; v = (1 − la) * v + la * y + sigma/(be * n) * d; end Fig. S-1: Matlab co de for algorithm ARP µ RANDOM PURSUIT 29 C.2. Random Pursuit. function [fval, x, funeval] = rp(fitfun, xstart, N, mu) % Algorithm Random Pursuit % fitfun name or function handle % N number of iterations % mu line search accuracy % funeval #function evaluations %line search parameters opts = optimset( 'Display' , 'off' , 'LargeScale' , 'off' , 'TolX' , mu); funeval = 0; x = xstart; for i = 1:N d=randn(size(xstart)); d=d/norm(d); [sigma, fval, ˜, infos] = ... fminunc(@(sigma) feval(fitfun, x + sigma * d), 0, opts); funeval = funeval + infos.funcCount; x = x + sigma * d; end Fig. S-2: Matlab co de for algorithm RP µ C.3. Random Gradien t. function [fval, x] = rg(fitfun, xstart, N, eps, L) % Random Gradient [Nesterov 2011] % fitfun name or function handle % N number of iterations % eps finite difference parameter % L quadratic upper bound x = xstart; n = length(xstart); h = 1/(4 * L * (n+4)); for i = 1:N d = randn(size(xstart)); g = (feval(fitfun, x + eps * d) − feval(fitfun, x)) / eps; x = x − h * g * d; end fval = feval(fitfun, x); Fig. S-3: Matlab co de for algorithm RG 30 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER C.4. Accelerated Random Gradien t. function [fval, x] = r g ac c(fitfun, xstart, N, eps, m, L) % Accelerated Random Gradient [Nesterov 2011] % fitfun name or function handle % N number of iterations % eps finite difference parameter % m, L parameters for quadratic upper and lower bounds x = xstart; n = length(xstart); h = 1/(4 * L * (n+4)); th = 1/(16 * L * (n+1)ˆ2); ga = m; v = x; for i = 1:N p = [1/th, ga − m, − ga]; be = max(roots(p)); de = (be * ga) / (ga + be * m); y = (1 − de) * x + de * v; ga = (1 − be) * ga + be * m; la = be / ga * m; d=randn(size(xstart)); g = (feval(fitfun, x + eps * d) − feval(fitfun, x)) / eps; x = y − h * g * d; v = (1 − la) * v + la * y − th / be * g * d; end fval = feval(fitfun, x); Fig. S-4: Matlab co de for algorithm F G RANDOM PURSUIT 31 C.5. (1+1)-ES. function [fval, x] = es(fitfun, xstart, N, sigma) % ( 1+1 ) − ES % fitfun name or function handle % N number of iterations % sigma initial stepsize % funeval #function evaluations fval = feval(fitfun, xstart); x = xstart; ss=exp(1/3); ff=exp(1/3 * ( − 0 .27)/ (1 − 0.27)); for i = 1:N d = randn(size(xstart)); f = feval(fitfun, x + sigma * d); if f ≤ fval x = x + sigma * d; fval = f; sigma = sigma * ss; else sigma = sigma * ff; end end Fig. S-5: Matlab co de for algorithm E S App endix D. T ables. D.1. Num b er of iterations for increasing accuracy for n = 64. T ables S-1 - S-5 summarize the n um b er of iterations needed to ac hiev e a corresp onding relative accuracy (acc) for ﬁxed dimension n = 64. RP RG F G ARP E S GM G M LS acc. min max mean min max mean min max mean min max mean min max mean - - 6 . 25 · 10 − 2 2 3 3 6 8 7 6 8 7 2 3 3 6 9 8 1 1 3 . 12 · 10 − 2 3 4 3 8 10 8 8 10 8 3 4 3 8 12 10 1 1 1 . 56 · 10 − 2 4 5 4 9 12 10 9 12 10 4 5 4 10 14 12 1 1 7 . 81 · 10 − 3 4 6 5 11 13 12 11 13 12 4 5 5 12 16 14 1 1 3 . 91 · 10 − 3 5 6 5 13 15 14 13 15 14 5 6 5 14 18 16 1 1 1 . 95 · 10 − 3 5 7 6 14 17 15 14 17 15 5 7 6 15 20 18 1 1 9 . 77 · 10 − 4 6 8 7 16 18 17 16 18 17 6 8 7 17 22 20 1 1 4 . 88 · 10 − 4 7 9 7 18 20 19 18 20 19 7 9 7 18 24 21 1 1 2 . 44 · 10 − 4 7 9 8 19 22 20 19 22 20 7 9 8 20 26 23 1 1 1 . 22 · 10 − 4 8 10 9 21 24 22 21 24 22 8 10 9 22 28 25 1 1 6 . 10 · 10 − 5 9 10 10 22 26 24 22 26 24 9 11 9 23 31 27 1 1 3 . 05 · 10 − 5 9 11 10 24 28 25 24 28 25 10 11 10 26 32 29 1 1 1 . 53 · 10 − 5 10 12 11 25 29 27 25 29 27 10 12 11 28 35 31 1 1 7 . 63 · 10 − 6 11 13 12 27 31 29 27 31 29 11 13 12 29 36 33 1 1 3 . 81 · 10 − 6 11 13 12 28 32 30 28 32 30 12 13 12 31 38 35 1 1 1 . 91 · 10 − 6 12 14 13 30 34 32 30 34 32 12 14 13 33 41 37 1 1 T able S-1: Sphere function f 1 , m = 1, L = 1, S = 32, n = 64. #ITS /n , ( G M , G M LS : #ITS). 32 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER RP RG F G ARP E S G M G M LS acc. min max mean min max mean min max mean min max mean min max mean - - 6 . 25 · 10 − 2 2 3 2 9 11 10 5 18 16 64 79 71 5 9 6 1 1 3 . 12 · 10 − 2 2 3 3 11 13 12 17 21 18 70 86 77 6 10 8 1 1 1 . 56 · 10 − 2 3 4 3 13 15 14 31 359 261 77 93 84 7 19 10 1 1 7 . 81 · 10 − 3 4 132 53 15 20 17 340 423 380 84 101 91 18 465 268 1 1 3 . 91 · 10 − 3 104 298 210 381 1103 677 404 484 444 92 125 107 481 928 723 124 2 1 . 95 · 10 − 3 273 460 373 1863 2578 2150 464 544 504 101 140 129 936 1410 1177 470 2 9 . 77 · 10 − 4 433 624 536 3340 4062 3624 521 601 562 129 153 143 1376 1869 1631 817 2 4 . 88 · 10 − 4 598 787 700 4815 5536 5094 576 657 618 142 164 153 1826 2325 2085 1163 2 2 . 44 · 10 − 4 761 954 862 6280 7018 6564 630 712 673 153 174 162 2264 2774 2538 1509 2 1 . 22 · 10 − 4 921 1118 1024 7754 8488 8036 683 767 727 161 183 170 2732 3239 2994 1856 2 6 . 10 · 10 − 5 1080 1284 1187 9232 9961 9508 736 820 781 168 191 178 3172 3690 3447 2202 2 3 . 05 · 10 − 5 1243 1446 1350 10712 11439 10980 787 873 833 176 199 187 3635 4138 3905 2549 3 1 . 53 · 10 − 5 1406 1607 1512 12179 12906 12453 839 925 885 189 214 201 4083 4593 4361 2895 3 7 . 63 · 10 − 6 1570 1766 1675 13654 14388 13923 890 977 936 203 227 217 4537 5042 4819 3241 3 3 . 81 · 10 − 6 1732 1928 1837 15130 15854 15395 940 1029 988 219 238 230 4989 5492 5273 3588 3 1 . 91 · 10 − 6 1899 2096 2001 16601 17333 16868 990 1079 1038 233 250 242 5451 5954 5729 3934 3 T able S-2: Ellipsoid function f 2 , m = 1, L = 1000, S = 3200, n = 64. #ITS /n , ( G M , G M LS : #ITS). RP RG F G ARP E S G M G M LS acc. min max mean min max mean min max mean min max mean min max mean - - 6 . 25 · 10 − 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 3 . 12 · 10 − 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 . 56 · 10 − 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 7 . 81 · 10 − 3 1 1 1 3 5 4 2 4 3 0 1 1 2 4 3 1 1 3 . 91 · 10 − 3 3 4 3 18 23 21 8 13 10 3 19 8 6 10 9 5 3 1 . 95 · 10 − 3 8 12 10 74 84 79 29 40 34 12 66 25 22 31 26 19 10 9 . 77 · 10 − 4 26 37 31 256 283 269 55 84 70 22 75 41 73 102 86 64 32 4 . 88 · 10 − 4 79 109 92 790 837 811 104 152 130 33 132 61 233 292 257 191 96 2 . 44 · 10 − 4 200 264 228 1993 2065 2022 164 258 201 35 173 87 577 700 633 477 239 1 . 22 · 10 − 4 405 501 453 3955 4053 4004 224 328 279 58 199 127 1142 1344 1249 945 473 6 . 10 · 10 − 5 665 778 723 6349 6465 6412 293 382 348 74 255 151 1867 2101 1998 1512 756 3 . 05 · 10 − 5 948 1060 1005 8849 8964 8917 369 427 402 88 391 213 2641 2891 2780 2101 1051 1 . 53 · 10 − 5 1229 1345 1288 11375 11482 11435 397 613 442 96 449 265 3406 3692 3563 2694 1347 7 . 63 · 10 − 6 1509 1619 1570 13886 14016 13958 450 894 677 129 533 342 4223 4461 4348 3288 1644 3 . 81 · 10 − 6 1792 1902 1853 16401 16545 16482 463 935 887 188 632 400 5008 5254 5133 3881 1940 1 . 91 · 10 − 6 2068 2191 2136 18922 19075 19004 892 970 942 192 678 473 5766 6050 5916 4474 2237 T able S-3: Nesterov smooth f 3 , L = 1000, S = 10833, n = 64. #ITS /n , ( G M , G M LS : #ITS). RP RG F G ARP E S G M G M LS acc. min max mean min max mean min max mean min max mean min max mean - - 6 . 25 · 10 − 2 1 2 1 7 9 8 4 5 5 1 6 2 3 5 4 2 1 3 . 12 · 10 − 2 3 5 4 25 31 28 10 16 13 5 19 11 8 13 11 7 4 1 . 56 · 10 − 2 8 12 10 79 90 82 27 37 33 9 45 22 19 35 27 19 10 7 . 81 · 10 − 3 19 31 24 199 219 204 46 70 59 21 53 32 52 76 65 48 24 3 . 91 · 10 − 3 43 62 50 415 458 432 74 107 88 25 58 41 109 152 135 102 51 1 . 95 · 10 − 3 82 106 90 772 824 791 104 140 118 39 88 53 214 272 247 187 94 9 . 77 · 10 − 4 131 166 146 1248 1341 1284 138 177 157 49 96 65 352 447 399 303 152 4 . 88 · 10 − 4 195 236 214 1841 1965 1900 169 216 191 52 109 73 533 638 587 449 225 2 . 44 · 10 − 4 267 317 295 2540 2694 2621 195 259 228 64 114 82 740 875 810 619 310 1 . 22 · 10 − 4 352 409 384 3326 3513 3420 244 293 266 68 127 95 976 1132 1059 807 404 6 . 10 · 10 − 5 441 506 479 4174 4387 4277 282 335 306 89 132 107 1233 1403 1325 1009 505 3 . 05 · 10 − 5 539 605 579 5057 5288 5168 321 376 342 96 160 119 1508 1681 1603 1219 610 1 . 53 · 10 − 5 642 715 682 5969 6206 6079 351 402 376 108 168 130 1788 1974 1885 1434 717 7 . 63 · 10 − 6 740 816 786 6893 7138 7001 383 430 409 117 177 138 2071 2269 2173 1650 826 3 . 81 · 10 − 6 845 920 890 7816 8064 7926 423 453 435 127 184 148 2357 2568 2462 1868 935 1 . 91 · 10 − 6 954 1023 995 8727 8995 8854 441 534 458 137 188 159 2651 2854 2751 2086 1044 T able S-4: Nesterov strongly conv ex function f 4 , m = 1, L = 1000, S = 1000, n = 64. #ITS /n , ( G M , G M LS : #ITS), RANDOM PURSUIT 33 RP RG F G ARP E S GM G M LS acc. min max mean min max mean min max mean min max mean min max mean - - 6 . 25 · 10 − 2 4 6 5 - - - - - - 4 6 5 13 17 14 - 1 3 . 12 · 10 − 2 7 9 7 - - - - - - 7 8 7 19 25 22 - 1 1 . 56 · 10 − 2 8 11 9 - - - - - - 8 10 9 24 32 27 - 1 7 . 81 · 10 − 3 10 13 11 - - - - - - 10 12 11 29 37 32 - 1 3 . 91 · 10 − 3 11 15 12 - - - - - - 12 14 12 32 42 36 - 1 1 . 95 · 10 − 3 13 16 14 - - - - - - 13 15 14 35 46 40 - 1 9 . 77 · 10 − 4 14 17 15 - - - - - - 14 17 15 39 50 43 - 1 4 . 88 · 10 − 4 16 19 17 - - - - - - 15 18 17 43 54 47 - 1 2 . 44 · 10 − 4 17 20 18 - - - - - - 17 20 18 47 58 51 - 1 1 . 22 · 10 − 4 18 22 19 - - - - - - 18 22 19 51 62 55 - 1 6 . 10 · 10 − 5 19 23 21 - - - - - - 19 23 21 55 66 59 - 1 3 . 05 · 10 − 5 21 24 22 - - - - - - 21 25 22 59 70 63 - 1 1 . 53 · 10 − 5 22 25 23 - - - - - - 22 26 23 62 74 67 - 1 7 . 63 · 10 − 6 23 27 25 - - - - - - 23 28 25 67 77 70 - 1 3 . 81 · 10 − 6 25 28 26 - - - - - - 24 29 26 71 81 74 - 1 1 . 91 · 10 − 6 26 30 28 - - - - - - 26 30 28 73 85 78 - 1 T able S-5: F unnel function f 5 , S = 32, n = 64. #ITS /n , ( G M , G M LS : #ITS). D.2. Num b er of function ev aluations for increasing accuracy for ﬁxed dimension n = 64 . T ables S-6 - S-10 summarize the num b er of function ev aluations needed to ac hieve a corresponding relative accuracy (acc) for ﬁxed dimension n = 64. RP RG F G ARP E S acc. min max mean min max mean min max mean min max mean min max mean 6 . 25 · 10 − 2 10 14 12 11 15 14 12 15 13 9 13 11 6 9 8 3 . 12 · 10 − 2 12 17 14 15 19 17 15 18 17 12 16 14 8 12 10 1 . 56 · 10 − 2 15 20 17 18 23 20 18 21 20 15 19 17 10 14 12 7 . 81 · 10 − 3 17 23 20 22 27 24 21 25 23 17 22 20 12 16 14 3 . 91 · 10 − 3 20 27 22 25 30 27 24 28 26 20 25 22 14 18 16 1 . 95 · 10 − 3 22 29 25 28 34 30 27 32 29 22 29 25 15 20 18 9 . 77 · 10 − 4 25 33 28 32 37 34 29 35 33 24 32 27 17 22 20 4 . 88 · 10 − 4 27 35 31 35 40 37 32 38 36 27 35 30 18 24 21 2 . 44 · 10 − 4 30 38 33 38 44 41 35 42 39 29 38 33 20 26 23 1 . 22 · 10 − 4 32 40 36 41 47 44 38 45 42 32 40 36 22 28 25 6 . 10 · 10 − 5 35 42 39 44 52 47 41 48 45 36 43 38 23 31 27 3 . 05 · 10 − 5 38 44 42 47 56 51 44 51 49 39 46 41 26 32 29 1 . 53 · 10 − 5 41 48 44 51 58 54 47 55 52 41 49 44 28 35 31 7 . 63 · 10 − 6 43 52 47 53 61 57 51 58 55 44 51 47 29 36 33 3 . 81 · 10 − 6 45 54 50 57 65 60 54 63 58 47 53 49 31 38 35 1 . 91 · 10 − 6 47 57 52 60 68 64 57 66 61 50 56 52 33 41 37 T able S-6: Sphere function f 1 , m = 1, L = 1, S = 32, n = 64. #FES /n . RP RG F G ARP E S acc. min max mean min max mean min max mean min max mean min max mean 6 . 25 · 10 − 2 12 21 16 18 22 20 10 36 33 637 809 733 5 9 6 3 . 12 · 10 − 2 15 25 19 21 25 23 34 41 37 705 883 802 6 10 8 1 . 56 · 10 − 2 20 32 25 25 30 28 62 717 522 777 966 872 7 19 10 7 . 81 · 10 − 3 29 1671 662 30 39 34 680 847 760 863 1092 963 18 465 268 3 . 91 · 10 − 3 1416 3925 2803 761 2206 1354 808 968 887 978 1418 1185 481 928 723 1 . 95 · 10 − 3 3809 6240 5125 3725 5155 4300 928 1087 1008 1054 1613 1489 936 1410 1177 9 . 77 · 10 − 4 6096 8577 7458 6681 8124 7248 1042 1203 1124 1526 1790 1688 1376 1869 1631 4 . 88 · 10 − 4 8495 10961 9848 9629 11072 10189 1152 1315 1236 1702 1955 1835 1826 2325 2085 2 . 44 · 10 − 4 10936 13455 12278 12560 14036 13129 1259 1425 1347 1844 2108 1966 2264 2774 2538 1 . 22 · 10 − 4 13340 15896 14705 15507 16977 16072 1365 1533 1455 1960 2242 2088 2732 3239 2994 6 . 10 · 10 − 5 15726 18390 17144 18463 19921 19015 1471 1641 1562 2068 2362 2211 3172 3690 3447 3 . 05 · 10 − 5 18147 20800 19566 21423 22877 21960 1575 1747 1666 2188 2508 2353 3635 4138 3905 1 . 53 · 10 − 5 20573 23183 21970 24357 25812 24906 1678 1851 1770 2389 2743 2552 4083 4593 4361 7 . 63 · 10 − 6 22981 25521 24375 27308 28776 27846 1780 1954 1873 2627 2918 2789 4537 5042 4819 3 . 81 · 10 − 6 25337 27890 26733 30259 31707 30790 1880 2057 1975 2857 3083 2993 4989 5492 5273 1 . 91 · 10 − 6 27723 30272 29071 33202 34667 33736 1980 2159 2077 3035 3247 3159 5451 5954 5729 T able S-7: Ellipsoid function f 2 , m = 1, L = 1000, S = 3200, n = 64. #FES/ n . 34 S. U. STICH, C. L. M ¨ ULLER AND B. G ¨ AR TNER RP RG F G ARP E S acc. min max mean min max mean min max mean min max mean min max mean 6 . 25 · 10 − 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 . 12 · 10 − 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 . 56 · 10 − 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 . 81 · 10 − 3 5 8 7 7 10 8 5 7 6 4 12 7 2 4 3 3 . 91 · 10 − 3 22 33 27 36 45 42 17 25 21 29 163 66 6 10 9 1 . 95 · 10 − 3 73 123 96 147 169 158 58 80 67 102 604 221 22 31 26 9 . 77 · 10 − 4 282 410 344 511 566 538 110 167 140 198 690 375 73 102 86 4 . 88 · 10 − 4 930 1304 1094 1579 1675 1621 208 305 261 301 1264 584 233 292 257 2 . 44 · 10 − 4 2440 3232 2788 3987 4130 4045 327 517 401 328 1716 866 577 700 633 1 . 22 · 10 − 4 4996 6186 5588 7909 8106 8009 448 656 557 574 2132 1332 1142 1344 1249 6 . 10 · 10 − 5 8225 9610 8942 12697 12930 12824 586 765 696 763 2864 1621 1867 2101 1998 3 . 05 · 10 − 5 11739 13117 12445 17699 17928 17833 739 855 803 903 4485 2372 2641 2891 2780 1 . 53 · 10 − 5 15220 16645 15948 22750 22963 22870 795 1225 885 1004 5201 3010 3406 3692 3563 7 . 63 · 10 − 6 18678 20032 19423 27772 28033 27917 900 1788 1355 1410 6242 3979 4223 4461 4348 3 . 81 · 10 − 6 22154 23511 22902 32801 33091 32963 926 1870 1775 2145 7381 4691 5008 5254 5133 1 . 91 · 10 − 6 25520 27034 26351 37844 38150 38008 1785 1939 1885 2199 8149 5609 5766 6050 5916 T able S-8: Nestero v smooth f 3 , L = 1000, S = 10833, n = 64. #FES /n . RP RG F G ARP E S acc. min max mean min max mean min max mean min max mean min max mean 6 . 25 · 10 − 2 8 15 12 13 18 15 8 11 9 8 52 15 3 5 4 3 . 12 · 10 − 2 29 45 35 50 61 56 21 32 26 43 173 93 8 13 11 1 . 56 · 10 − 2 74 120 97 157 179 164 54 75 66 83 421 193 19 35 27 7 . 81 · 10 − 3 198 342 255 397 438 408 93 139 119 184 497 290 52 76 65 3 . 91 · 10 − 3 490 718 570 831 915 864 148 214 175 223 555 391 109 152 135 1 . 95 · 10 − 3 969 1258 1071 1543 1648 1583 208 281 237 373 896 516 214 272 247 9 . 77 · 10 − 4 1578 2012 1757 2495 2682 2568 276 354 314 477 989 662 352 447 399 4 . 88 · 10 − 4 2374 2877 2605 3681 3929 3800 338 433 383 530 1144 763 533 638 587 2 . 44 · 10 − 4 3260 3883 3607 5079 5387 5242 389 519 457 675 1209 876 740 875 810 1 . 22 · 10 − 4 4315 5011 4710 6653 7026 6840 487 587 532 727 1370 1037 976 1132 1059 6 . 10 · 10 − 5 5417 6220 5882 8347 8774 8555 563 671 612 978 1437 1179 1233 1403 1325 3 . 05 · 10 − 5 6621 7435 7116 10113 10577 10337 642 752 685 1074 1782 1323 1508 1681 1603 1 . 53 · 10 − 5 7872 8772 8371 11937 12411 12159 702 804 753 1212 1882 1468 1788 1974 1885 7 . 63 · 10 − 6 9069 10005 9628 13786 14275 14002 766 861 818 1286 1986 1559 2071 2269 2173 3 . 81 · 10 − 6 10331 11264 10885 15633 16129 15852 845 906 870 1448 2078 1687 2357 2568 2462 1 . 91 · 10 − 6 11629 12482 12122 17455 17990 17708 883 1069 916 1557 2134 1825 2651 2854 2751 T able S-9: Nesterov strongly conv ex function f 4 , m = 1, L = 1000, S = 1000 n = 64. #FES /n . RP RG F G ARP E S acc. min max mean min max mean min max mean min max mean min max mean 6 . 25 · 10 − 2 37 58 45 - - - - - - 40 51 46 13 17 14 3 . 12 · 10 − 2 63 86 71 - - - - - - 62 76 70 19 25 22 1 . 56 · 10 − 2 83 110 92 - - - - - - 84 102 92 24 32 27 7 . 81 · 10 − 3 103 129 112 - - - - - - 103 123 112 29 37 32 3 . 91 · 10 − 3 118 150 132 - - - - - - 123 144 132 32 42 36 1 . 95 · 10 − 3 140 168 151 - - - - - - 140 167 151 35 46 40 9 . 77 · 10 − 4 159 187 171 - - - - - - 159 191 171 39 50 43 4 . 88 · 10 − 4 178 207 190 - - - - - - 177 212 192 43 54 47 2 . 44 · 10 − 4 196 230 210 - - - - - - 197 231 211 47 58 51 1 . 22 · 10 − 4 215 252 232 - - - - - - 218 265 234 51 62 55 6 . 10 · 10 − 5 236 270 252 - - - - - - 239 289 255 55 66 59 3 . 05 · 10 − 5 257 293 273 - - - - - - 257 313 275 59 70 63 1 . 53 · 10 − 5 279 316 294 - - - - - - 277 330 295 62 74 67 7 . 63 · 10 − 6 297 340 316 - - - - - - 295 355 317 67 77 70 3 . 81 · 10 − 6 320 365 339 - - - - - - 318 378 338 71 81 74 1 . 91 · 10 − 6 338 384 360 - - - - - - 342 399 361 73 85 78 T able S-10: F unnel function f 5 , S = 32 n = 64. #FES /n . D.3. Diﬀeren t line searc h parameters for ﬁxed dimension n = 64 . T a- bles S-11 and S-12 summarize the num b er of function ev aluations needed by RP µ RANDOM PURSUIT 35 on the ellipsoid function f 2 to achiev e a relative accuracy of 1 . 91 · 10 − 6 for diﬀerent parameters µ that w ere passed to the used Matlab line searc h (cf. Figure S-2). acc. 1 e − 1 1 e − 2 1 e − 3 1 e − 4 1 e − 5 1 e − 6 1 e − 7 1 e − 8 1 e − 9 1 e − 10 6 . 25 · 10 − 2 2 2 2 2 2 2 2 2 2 2 3 . 12 · 10 − 2 3 3 3 3 3 3 3 3 3 3 1 . 56 · 10 − 2 3 3 3 3 3 3 3 3 3 3 7 . 81 · 10 − 3 57 50 56 65 53 53 53 53 73 73 3 . 91 · 10 − 3 201 194 219 225 210 210 210 210 232 232 1 . 95 · 10 − 3 360 353 382 389 373 373 373 373 395 395 9 . 77 · 10 − 4 523 516 546 552 536 536 536 536 558 558 4 . 88 · 10 − 4 685 678 709 715 700 700 700 700 721 721 2 . 44 · 10 − 4 849 842 871 878 862 862 862 862 883 883 1 . 22 · 10 − 4 1011 1004 1035 1041 1024 1024 1024 1024 1045 1045 6 . 10 · 10 − 5 1173 1166 1198 1203 1187 1187 1187 1187 1208 1208 3 . 05 · 10 − 5 1335 1330 1360 1366 1350 1350 1350 1350 1370 1370 1 . 53 · 10 − 5 1497 1494 1523 1528 1512 1512 1512 1512 1533 1533 7 . 63 · 10 − 6 1661 1657 1686 1691 1675 1675 1675 1675 1696 1696 3 . 81 · 10 − 6 1824 1819 1849 1853 1837 1837 1837 1837 1858 1858 1 . 91 · 10 − 6 1986 1982 2013 2017 2001 2001 2001 2001 2020 2020 T able S-11: Diﬀerent line search parameters µ for RP µ on ellipsoid function f 2 , m = 1, L = 1000, S = 3200, n = 64, mean of 25 runs of #ITS/ n to reac h to reac h a relativ e accuracy of 1 . 91 · 10 − 6 . acc. 1 e − 1 1 e − 2 1 e − 3 1 e − 4 1 e − 5 1 e − 6 1 e − 7 1 e − 8 1 e − 9 1 e − 10 6 . 25 · 10 − 2 15 16 16 17 16 16 16 16 16 16 3 . 12 · 10 − 2 18 20 20 21 19 19 19 19 20 20 1 . 56 · 10 − 2 24 25 25 26 25 25 25 25 26 26 7 . 81 · 10 − 3 542 482 657 815 662 662 662 662 969 970 3 . 91 · 10 − 3 1973 1908 2664 2983 2803 2805 2805 2806 3378 3390 1 . 95 · 10 − 3 3565 3495 4651 5273 5125 5131 5131 5131 6140 6173 9 . 77 · 10 − 4 5187 5120 6544 7558 7458 7469 7469 7470 9078 9126 4 . 88 · 10 − 4 6813 6750 8319 9867 9848 9866 9867 9869 12224 12277 2 . 44 · 10 − 4 8447 8380 10005 12191 12278 12307 12310 12311 15528 15582 1 . 22 · 10 − 4 10069 10008 11653 14484 14705 14750 14754 14756 18872 18926 6 . 10 · 10 − 5 11687 11627 13285 16693 17144 17212 17218 17220 22238 22293 3 . 05 · 10 − 5 13309 13264 14908 18817 19566 19664 19674 19676 25576 25633 1 . 53 · 10 − 5 14935 14902 16537 20820 21970 22114 22128 22131 28935 28996 7 . 63 · 10 − 6 16572 16531 18167 22698 24375 24582 24602 24606 32308 32373 3 . 81 · 10 − 6 18198 18155 19798 24449 26733 27030 27059 27064 35655 35726 1 . 91 · 10 − 6 19824 19781 21435 26120 29071 29495 29537 29542 38993 39070 T able S-12: Diﬀerent line search parameters µ for RP µ on ellipsoid function f 2 , m = 1, L = 1000, S = 3200, n = 64, mean of 25 runs of #FES/ n to reac h to reac h a relativ e accuracy of 1 . 91 · 10 − 6 .

Optimization of Convex Functions with Random Pursuit

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment