Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming
In this paper, we introduce a new stochastic approximation (SA) type algorithm, namely the randomized stochastic gradient (RSG) method, for solving an important class of nonlinear (possibly nonconvex) stochastic programming (SP) problems. We establis…
Authors: Saeed Ghadimi, Guanghui Lan
STOCHASTIC FIRST- AND ZER OTH-ORDER METHODS F OR NONCONVEX STOCHASTIC PR OGRAMMING ∗ SAEED GHADIMI † AND GUANGHUI LAN ‡ Abstract. In this paper, w e i n tro duce a new stoch astic app r o ximation (SA) t ype algorithm, namely the randomized sto c hastic gradien t (RSG) method, for solving an impor tan t class of nonlinear (possibl y nonconv ex) sto ch astic programming (SP) problems. W e establish the complexit y of this method for computing an appro ximate stationary point of a nonlinear programmi ng problem. W e also sho w tha t this method possesses a ne arl y optimal rate of conv ergence if the problem is con vex. W e discuss a v ariant of the algorithm whic h consists of applying a p ost-optimization phase to ev aluate a short li st of solutions generated b y s ev eral independent runs of the RSG method, and sho w that suc h mo dification allows to impro ve significan tly the large-deviation pr operties of the algorithm. These methods are then sp ecialized for solving a cl ass of simulation-based optimization problems in which only stochastic zeroth-order information is av ail able. Keywords: stochastic appro ximation, nonconv ex optimization, sto chast i c programming, simulation- based optimization 1. In tro duction. In 1951, Robbins and Monro in their seminal work [33] pro- po sed a class ical sto chastic appr oximation (SA) a lg orithm for solving sto chastic pro- gramming (SP ) pr oblems. This approach mimics the simplest gr adient descent method by using noisy gr adient informa tion in place of the exact gra dient s, and p osse sses the “asymptotically optimal” ra te of conv erge nce for so lv ing a class of stro ngly conv ex SP problems [4, 3 7]. How ever, it is usually difficult to implemen t the “as ymptotically optimal” steps ize p olicy , esp ecially in the beginning, so that the a lgorithms often p er- form p o orly in practice (e.g., [39, Sectio n 4.5.3]). An imp ortant improvemen t of the classical SA was developed by Polyak [3 1] and Poly ak and Juditsky [32], wher e longer stepsizes w ere sug gested tog e ther with the a veraging of t he obtained itera tes. Their metho ds w er e shown to b e mor e robust with resp ect to the s election of stepsizes than the cla ssical SA and also exhibit the “ asymptotically o ptimal” ra te of convergence for solving strongly conv ex SP problems. W e refer to [23] for an account of the ea rlier history of SA metho ds. The last few y ear s hav e seen some sig nificant prog ress for the development o f SA metho ds for SP . O n one hand, new SA t ype metho ds a re being introduced to solve SP problems which ar e no t necessar ily s trongly conv ex. On the other hand, these de- velopmen ts, motiv a ted b y complexity theory in conv ex optimization [24], concerned the con vergence pr op erties of SA metho ds during a finite n umber of iterations. F or example, Nemirovski et al. [23] presented a prop erly mo dified SA approa ch, na mely , mirror descent SA for solving general non- s mo oth conv ex SP pr oblems. They demon- strated that the mirror des cent SA exhibits an optimal O (1 / ǫ 2 ) iteration complexity for solving these problems. This metho d has b een shown in [19, 23] to be comp etitive to the widely-ac c epted sample a verage a pproximation appro ach (see, e.g., [1 7, 38]) and even significantly outperfo rm it for solving a class o f conv ex SP pr oblems. Similar techn ique s , based on subgradient av eraging , hav e b een prop os e d in [14, 16, 2 7]. While ∗ The first author wa s partially suppor ted by NSF Grant CMMI-1000347 and the second au- thor was partially supp orted by N SF grant CM M I-1000347, ONR gran t N00014-13-1-0036 and NSF CAREER Award CMM I-1254446. † Departmen t of Industrial and Systems Engineering, Univ ersity of Florida, Gainesville, FL 32611, (email: sgh adimi@ufl.edu ) . ‡ Departmen t of Industrial and Systems Enginee r i ng, Universit y of Florida, Gainesville, FL 3261 1, (email: gla n@ise.ufl.edu ) . 1 these tec hniques dea lt with no n-smo oth con vex pr ogra mming pr oblems, Lan [18] pre- sented a unified optimal metho d for smo oth, non-smo o th a nd sto chastic optimiza- tion, whic h explicitly tak es in to account the smo othness of the ob jective function (see also [11, 10] for dis cussions ab out stro ng conv exity). How ever, note that convexit y has play ed an important role in establishing the con vergence of a ll these SA algorithms. T o the best of o ur knowledge, none of exis ting SA algor ithms can handle mo re ge ne r al SP pr oblems whose o b jective function is poss ibly nonconvex. This pap er fo cuses on the theor etical developmen t of SA type metho ds for s olving an impo rtant c lass of nonconv ex SP problems. Mo re specifica lly , we study the classica l unconstrained no nlinear pro gramming (NLP) problem given in the for m of (e.g., [26, 30]) f ∗ := inf x ∈ R n f ( x ) , (1.1) where f : R n → R is a differentiable (not necessa rily co nvex), bo unded from b elow, and its gr adient ∇ f ( · ) s atisfies k∇ f ( y ) − ∇ f ( x ) k ≤ L k y − x k , ∀ x, y ∈ R n . How ever, differ ent from the standar d NLP , we a s sume througho ut the paper that w e only ha ve access to noisy function v alues or gradients abo ut the o b jective function f in (1.1). In particula r , in the basic setting, we as sume that pr oblem (1.1) is to b e solved by itera tiv e algor ithms which a cquire the gradients of f via subsequent calls to a sto chastic first- o rder o racle ( S F O ). A t iteration k of the alg orithm, x k being the input, the S F O outputs a sto chastic gradient G ( x k , ξ k ), where ξ k , k ≥ 1, are random v aria bles whose distributions P k are supp or ted on Ξ k ⊆ R d . The following assumptions are made for the Borel functions G ( x k , ξ k ). A1: F or any k ≥ 1 , we have a) E [ G ( x k , ξ k )] = ∇ f ( x k ) , (1.2) b) E k G ( x k , ξ k ) − ∇ f ( x k ) k 2 ≤ σ 2 , (1.3) for some parameter σ ≥ 0. Observe that, b y (1.2), G ( x k , ξ k ) is an un bia sed estimator of ∇ f ( x k ) and, by (1.3), the v ar iance of the random v a riable k G ( x k , ξ k ) − ∇ f ( x k ) k is bo unded. It is worth noting that in the standard setting for SP , the ra ndom vectors ξ k , k = 1 , 2 , . . . , are indep endent of each o ther (and also of x k ) (see, e.g ., [2 4, 23]). Our assumption here is slig h tly weaker since we do not need to assume ξ k , k = 1 , 2 , . . . , to be indep endent. Our study on the aforementioned SP pr o blems has b een mo tiv ated by a few int er esting applications whic h are briefly o utlined a s fo llows. • In man y machine learning problems, w e int end to minimize a regula r ized loss function f ( · ) given by f ( x ) = Z Ξ L ( x, ξ ) dP ( ξ ) + r ( x ) , (1.4) where either the loss function L ( x, ξ ) or the re gularizatio n r ( x ) is nonco n vex (see, e.g., [2 1, 22]). • Another imp ortant class of pr oblems originate from the so-ca lled endoge- nous uncertaint y in SP . More sp ecifically , the o b jective functions for these 2 SP pr oblems are g iven in the form of f ( x ) = Z Ξ( x ) F ( x, ξ ) dP x ( ξ ) , (1 .5) where the supp ort Ξ( x ) and the dis tribution function P x of the r andom vector ξ depend on x . The function f in (1.5) is usually nonconvex even if F ( x, ξ ) is conv ex with r esp ect to x . F or example, if the suppo r t Ξ do es not dep end on x , it is often p os sible to repres e n t dP x = H ( x ) dP for some fixed dis tr ibution P . Typically this tr ansformation res ults in a nonconvex integrand function. Other techniques hav e also b een develope d to compute unbiased estimators for the g r adient of f ( · ) in (1.5) (see, e.g., [8, 13, 20, 35]). • Finally , in s im ulatio n-based optimiza tion , the ob jective function is given b y f ( x ) = E ξ [ F ( x, ξ )], where F ( · , ξ ) is not given explicitly , but through a black- box simulation pr o cedure (e.g., [1, 7]). Therefore, we do not know if the function f is conv ex or no t. Moreover, in these cases, we usually o nly have access to sto chastic zeroth-or der information a b o ut the function v alues of f ( · ) rather than its gr adients. The c o mplexity of the gra dient descent metho d for solving problem (1.1) has b een well-understoo d under the de ter ministic setting (i.e., σ = 0 in (1.3)). In particular, Nesterov [2 6] shows that after running the metho d for at most N = O (1 /ǫ ) steps, we hav e min k =1 ,..., N k∇ f ( x k ) k 2 ≤ ǫ (see Gratton et al. [36] for a similar b ound for the trust-reg ion metho ds). Cartis et a l. [2] show that this b ound is actually tight for the gradient descent metho d. Note, ho wever, that the analysis in [26] is not applicable to the sto chastic setting (i.e., σ > 0 in (1.3)). Mor eov er, even if we hav e min k =1 ,..., N k∇ f ( x k ) k 2 ≤ ǫ , to find the b est s olution fro m { x 1 , . . . , x N } is still difficult since k ∇ f ( x k ) k is not known e x actly . Our ma jor contributions in this pa per ar e summarized as follo ws. Fir stly , to solv e the a forementioned nonconv ex SP pro blem, w e present a randomized stochastic gr adient (RSG) method by introducing the following mo difications to the clas s ical SA. Instead of taking average of the iterates as in the mirror descent SA for conv ex SP , we randomly select a solution ¯ x from { x 1 , . . . , x N } according to a certain proba bility distribution as the output. W e s how that such a solution satisfies E [ k∇ f ( ¯ x ) k 2 ] ≤ ǫ a fter running the method for at most N = O (1 / ǫ 2 ) iterations 1 . Mor e ov er, if f ( · ) is co n vex, we show that the r elation E [ f ( ¯ x ) − f ∗ ] ≤ ǫ alwa ys holds. W e demonstrate that s uch a co mplex ity result is near ly o ptimal for solving conv ex SP problems (see the discus s ions after Cor ollary 2.2). Secondly , in o rder to improv e the larg e devia tion prop erties and hence the reli- ability o f the RSG metho d, we pr esent a tw o- phase ra ndomized sto chastic gr adient (2-RSG) metho d by intro ducing a p ost-o ptimization phase to ev aluate a sho rt list of solutions g enerated by several indep endent runs of the RSG metho d. W e show that the co mplexit y of the 2-RSG metho d for computing an ( ǫ, Λ) -s olution of pr o b- lem (1.1), i.e., a po int ¯ x such that Pro b {k∇ f ( ¯ x ) k 2 ≤ ǫ } ≥ 1 − Λ for so me ǫ > 0 and Λ ∈ (0 , 1), can b e b ounded b y O log(1 / Λ ) σ 2 ǫ 1 ǫ + log(1 / Λ) Λ . 1 It s hould not b e to o surpr ising to see that the complexity for the sto chast ic case is muc h worse than that for the deterministic case. F or example, in the con vex case, it is kno wn [26, 18] that the complexit y for finding an solution ¯ x satisfying f ( ¯ x ) − f ∗ ≤ ǫ will b e substantially increased fr om O (1 / √ ǫ ) to O (1 /ǫ 2 ) as one mov es from the dete rm inistic to stochastic setting. 3 W e further show that, under certain light-tail assumption ab out the S F O , the ab ov e complexity b ound can b e reduced to O log(1 / Λ) σ 2 ǫ 1 ǫ + lo g 1 Λ . Thirdly , w e sp ecializ e the RSG method for the case where only sto chastic zeroth- order information is av ailable. Ther e exists a s omewhat long history for the develop- men t of zero th-order (or deriv ative-free) metho ds in nonlinear progr a mming (see the monogra ph by Co nn et al. [5] and reference s there in). Ho wev er , only few complex- it y results are av aila ble for these types of metho ds, mostly for conv ex pr ogramming (e.g., [24, 28]) and deterministic nonconv ex programming problems (e.g ., [3, 9 , 28, 4 0]). The sto chastic zeroth-order methods studied in this pap er ar e directly motiv a ted by a r ecent imp or tant w or k due to Nesterov [28]. Mor e sp ecifica lly , Nesterov proved in [28] so me tigh t b ounds for approximating first-or der info r mation by zeroth-or der in- formation using the Gauss ian smo othing tec hnique (see Theo rem 3.1). Ba sed on this techn ique , he presented a series of new complexity results for zeroth-o r der metho ds. F or example, he established the O ( n/ ǫ ) complexity , in terms of E [ f ( ¯ x ) − f ∗ ] ≤ ǫ , for a z e roth-order metho d applied to smo o th conv ex progr amming pro blems (see in p.19 of [28]) along with some p oss ible accelera tion schemes. Here the exp ectatio n is taken with res pe c t to the Gaussian random v ariables use d in the alg orithms. He ha d also proved the O ( n/ǫ ) complex it y , in terms of E [ k ∇ f ( ¯ x ) k 2 ] ≤ ǫ , for solving smo o th nonconv ex problems (see p.24 o f [28]). While these b ounds were obtained fo r s olving deterministic o ptimization problems, Nesterov established the O ( n 2 /ǫ 2 ) complexity , in terms of E [ f ( ¯ x ) − f ∗ ] ≤ ǫ , for solving g eneral nonsmo oth c o nv ex SP problems (see p.17 of [28 ]). By incorp or ating the Gaussian smo o thing technique [28] into the RSG metho d, we present a randomized sto chastic gra dient free (RSGF) metho d for so lving a class o f simulation-based optimizatio n problems and demonstr ate that its iteration complexit y for finding the afor ement io ned ǫ -solutio n (i.e., E [ k∇ f ( ¯ x ) k 2 ] ≤ ǫ ) can b e b ounded by O ( n/ ǫ 2 ). T o the b est o f our knowledge, this a pp e a rs to be the firs t complexity r esult for nonconv ex stochastic zeroth-order metho ds in the litera tur e. Mo reov er, the s ame RSGF alg o rithm p ossess es an O ( n/ǫ 2 ) complexity b ound, in terms o f E [ f ( ¯ x ) − f ∗ ] ≤ ǫ , for s olving smo oth co nv ex SP problems. It is in teresting to observe that this bo und has a muc h weak er dep endence on n than the one previously establis hed by Nesterov for solving gener al no nsmo oth conv ex SP problems (see p.17 of [28]). Such an improvemen t is obtained by explicitly making use of the smo othness prop er ties of the ob jective function and carefully choosing the stepsiz e s a nd smo o thing parameter used in the RSGF metho d. This pap er is organized as follows. W e introduce tw o sto chastic first-o rder meth- o ds, i.e., the RSG and 2-RSG metho ds, for nonconvex SP , and establish their conv er- gence pro per ties in Section 2. W e then sp ecialize these metho ds for solving a class of simulation-based optimization pr oblems in Section 3. Some brief concluding remarks are also pre s ent ed in Section 4. 1.1. Notation and termin o logy . As stated in [26], we say that f ∈ C 1 , 1 L ( R n ) if it is differentiable and k∇ f ( y ) − ∇ f ( x ) k ≤ L k y − x k , ∀ x, y ∈ R n . 4 Clearly , w e hav e | f ( y ) − f ( x ) − h∇ f ( x ) , y − x i| ≤ L 2 k y − x k 2 , ∀ x, y ∈ R n . (1.6) If, in a ddition, f ( · ) is conv ex, then f ( y ) − f ( x ) − h∇ f ( x ) , y − x i ≥ 1 2 L k∇ f ( y ) − ∇ f ( x ) k 2 , (1.7) and h∇ f ( y ) − ∇ f ( x ) , y − x i ≥ 1 L k∇ f ( y ) − ∇ f ( x ) k 2 , ∀ x, y ∈ R n . (1.8) 2. Sto c hastic first-order metho ds . Our goa l in this section is to pres ent and analyze a new class of SA algo rithms for solving gener al smo oth no nlinear (p oss ibly nonconv ex) SP problems. More sp ecifically , we present the RSG metho d and es tablish its conv erg ence prop erties in Subsection 2.1, and then introduce the 2-RSG metho d which can significantly improv e the large- deviation pro p e rties of the RSG method in Subsection 2.2. W e a s sume throughout this s e ction that Assumption A1 ho lds . In some cases, Assumption A1 is aug men ted by the follo wing “light-tail” ass umption. A2: F or any x ∈ R n and k ≥ 1, w e hav e E exp {k G ( x, ξ k ) − g ( x ) k 2 /σ 2 } ≤ exp { 1 } . (2.1) It can b e easily seen that Assumption A2 implies Assumption A1.b) by Jensen’s inequality . 2.1. The randomized sto c hasti c gradien t metho d. The conv ergence of ex- isting SA metho ds requir es f ( · ) to b e conv ex [23, 19, 18, 11, 10]. Moreover, in o rder to g uarantee the conv exity o f f ( · ), one often need to ass ume that the r andom v ari- ables ξ k , k ≥ 1 , to b e indep endent of the search sequence { x k } . Below we present a new SA-type algor ithm that can deal with b o th convex and no nconv ex SP problems, and allow ra ndom nois es to b e dep endent on the sear ch sequenc e . This alg orithm is o btained by incor po rating a certain randomiza tion scheme into the class ical SA metho d. A randomized sto c hastic g radie nt (RSG) m etho d Input: I nitia l point x 1 , iteration limit N , stepsizes { γ k } k ≥ 1 and pr obability mass function P R ( · ) suppor ted on { 1 , . . . , N } . Step 0. Let R be a rando m v ariable with probability mass function P R . Step k = 1 , . . . , R . Call the sto chastic firs t- order oracle for computing G ( x k , ξ k ) and set x k +1 = x k − γ k G ( x k , ξ k ) . (2.2) Output x R . A few remarks ab out the ab ov e RSG method a r e in order. Firstly , in compar ison with the classical SA, we have used a random itera tion count, R , to terminate the execution of the RSG algo r ithm. Equiv alently , one can vie w such a rando mization 5 scheme fr o m a slightly different p er sp ective describ ed as follows. Instead of termi- nating the alg orithm a t the R - th s tep, one can also r un the RSG alg o rithm for N iterations but rando mly c ho o se a s earch p oint x R (according to P R ) from its tr a jec- tory as the o utput of the algorithm. Clearly , using the latter scheme, we just need to r un the algor ithm for the first R itera tions and the remaining N − R iteratio ns are s urpluses. Note how ever, that the primary goal to introduce the ra ndom iteration count R is to derive new complexity results fo r nonconvex SP , rather than sav e the computational effor ts in the last N − R itera tions o f the algor ithm. Indeed, if R is uniformly distributed, the co mputational gain from such a randomizatio n scheme is simply a fa ctor of 2. Seco ndly , the RSG algo rithm de s crib ed above is conceptual only bec ause we have not sp ecified the se le ction of the stepsizes { γ k } a nd the probabil- it y mass function P R yet. W e will a ddress this issue after establishing some bas ic conv ergence pr op erties of the RSG method. The following result describ es some convergence prop e rties of the RSG method. Theorem 2.1. Supp ose that the stepsizes { γ k } and the pr ob ability mass fun ction P R ( · ) in the R SG metho d ar e chosen su ch that γ k < 2 /L and P R ( k ) := P rob { R = k } = 2 γ k − L γ 2 k P N k =1 (2 γ k − L γ 2 k ) , k = 1 , ..., N . (2.3) Then, un der Assumption A1, a) for any N ≥ 1 , we have 1 L E [ k∇ f ( x R ) k 2 ] ≤ D 2 f + σ 2 P N k =1 γ 2 k P N k =1 (2 γ k − L γ 2 k ) , (2.4) wher e the exp e ctation is taken with r esp e ct to R and ξ [ N ] := ( ξ 1 , ..., ξ N ) , D f := 2 ( f ( x 1 ) − f ∗ ) L 1 2 , (2.5) and f ∗ denotes the optimal value of pr oblem (1.1); b) if, in addition, pr oblem (1.1) is c onvex with an optimal solution x ∗ , then, for any N ≥ 1 , E [ f ( x R ) − f ∗ ] ≤ D 2 X + σ 2 P N k =1 γ 2 k P N k =1 (2 γ k − L γ 2 k ) , (2.6) wher e the exp e ctation is taken with r esp e ct to R and ξ [ N ] , and D X := k x 1 − x ∗ k . (2.7) Pr o of . Display δ k ≡ G ( x k , ξ k ) − ∇ f ( x k ), k ≥ 1. W e fir st show part a). Us ing the assumption that f ∈ C 1 , 1 L ( R n ), (1.6) and (2.2), we hav e, for a ny k = 1 , . . . , N , f ( x k +1 ) ≤ f ( x k ) + h∇ f ( x k ) , x k +1 − x k i + L 2 γ 2 k k G ( x k , ξ k ) k 2 = f ( x k ) − γ k h∇ f ( x k ) , G ( x k , ξ k ) i + L 2 γ 2 k k G ( x k , ξ k ) k 2 = f ( x k ) − γ k k∇ f ( x k ) k 2 − γ k h∇ f ( x k ) , δ k i + L 2 γ 2 k k∇ f ( x k ) k 2 + 2 h∇ f ( x k ) , δ k i + k δ k k 2 = f ( x k ) − γ k − L 2 γ 2 k k∇ f ( x k ) k 2 − γ k − L γ 2 k h∇ f ( x k ) , δ k i + L 2 γ 2 k k δ k k 2 . (2.8) 6 Summing up the ab ove ineq ualities and r e-arr anging the terms, we o btain N X k =1 γ k − L 2 γ 2 k k∇ f ( x k ) k 2 ≤ f ( x 1 ) − f ( x N +1 ) − N X k =1 γ k − L γ 2 k h∇ f ( x k ) , δ k i + L 2 N X k =1 γ 2 k k δ k k 2 ≤ f ( x 1 ) − f ∗ − N X k =1 γ k − L γ 2 k h∇ f ( x k ) , δ k i + L 2 N X k =1 γ 2 k k δ k k 2 , (2.9) where the last inequality follows fro m the fa c t that f ( x N +1 ) ≥ f ∗ . Note that the search p oint x k is a function of the histor y ξ [ k − 1] of the generated r andom pro cess and hence is random. T aking exp ecta tions (with re s pec t to ξ [ N ] ) on b oth sides of (2.9) a nd noting that under Assumption A1, E [ k δ k k 2 ] ≤ σ 2 , and E [ h∇ f ( x k ) , δ k i| ξ [ k − 1] ] = 0 , (2.10) we obtain N X k =1 γ k − L 2 γ 2 k E ξ [ N ] k∇ f ( x k ) k 2 ≤ f ( x 1 ) − f ∗ + Lσ 2 2 N X k =1 γ 2 k (2.11) Dividing b oth s ides of the ab ov e inequa lity b y L P N k =1 γ k − L γ 2 k / 2 and noting tha t E [ k∇ f ( x R ) k 2 ] = E R,ξ [ N ] [ k∇ f ( x R ) k 2 ] = P N k =1 2 γ k − L γ 2 k E ξ [ N ] k∇ f ( x k ) k 2 P N k =1 (2 γ k − L γ 2 k ) , we conclude 1 L E [ k∇ f ( x R ) k 2 ] ≤ 1 P N k =1 (2 γ k − L γ 2 k ) " 2 ( f ( x 1 ) − f ∗ ) L + σ 2 N X k =1 γ 2 k # , which, in v iew of (2.5 ), clea r ly implies (2.4). W e no w show that par t b) holds. Display ω k ≡ k x k − x ∗ k . First obse r ve that, for any k = 1 , . . . , N , ω 2 k +1 = k x k − γ k G ( x k , ξ k ) − x ∗ k 2 = ω 2 k − 2 γ k h G ( x k , ξ k ) , x k − x ∗ i + γ 2 k k G ( x k , ξ k ) k 2 = ω 2 k − 2 γ k h∇ f ( x k ) + δ k , x k − x ∗ i + γ 2 k k∇ f ( x k ) k 2 + 2 h∇ f ( x k ) , δ k i + k δ k k 2 . Moreov er , in v iew o f (1 .8) a nd the fact that ∇ f ( x ∗ ) = 0 , we hav e 1 L k∇ f ( x k ) k 2 ≤ h∇ f ( x k ) , x k − x ∗ i . (2.12) Combining the above t wo rela tions, we o btain, for any k = 1 , . . . , N , ω 2 k +1 ≤ ω 2 k − (2 γ k − L γ 2 k ) h∇ f ( x k ) , x k − x ∗ i − 2 γ k h x k − γ k ∇ f ( x k ) − x ∗ , δ k i + γ 2 k k δ k k 2 ≤ ω 2 k − (2 γ k − L γ 2 k )[ f ( x k ) − f ∗ ] − 2 γ k h x k − γ k ∇ f ( x k ) − x ∗ , δ k i + γ 2 k k δ k k 2 , where the las t inequality follo ws fr o m the conv exity o f f ( · ) and the fact that γ k ≤ 2 /L . Summing up the ab ove ineq ualities and r e-arr anging the terms, we have N X k =1 (2 γ k − L γ 2 k )[ f ( x k ) − f ∗ ] ≤ ω 2 1 − ω 2 N +1 − 2 N X k =1 γ k h x k − γ k ∇ f ( x k ) − x ∗ , δ k i + N X k =1 γ 2 k k δ k k 2 ≤ D 2 X − 2 N X k =1 γ k h x k − γ k ∇ f ( x k ) − x ∗ , δ k i + N X k =1 γ 2 k k δ k k 2 , 7 where the la st inequality follows from (2.7) and the fac t that ω N +1 ≥ 0. The rest of the pro of is similar to that of par t a) and hence the details ar e sk ipped. W e now describ e a po ssible stra tegy for the s e lection of the stepsizes { γ k } in the RSG metho d. F or the sake of simplicity , le t us a ssume tha t a c onstant stepsize po licy is use d, i.e., γ k = γ , k = 1 , . . . , N , for some γ ∈ (0 , 2 /L ). Note tha t the assumption o f cons ta nt stepsizes do es no t hurt the e fficie ncy estimate of the RSG metho d. T he following corollar y of Theorem 2.1 is obtained by a ppropriately cho osing the parameter γ . Corollar y 2.2. Supp ose that the stepsizes { γ k } ar e set t o γ k = min ( 1 L , ˜ D σ √ N ) , k = 1 , . . . , N , (2.13) for some ˜ D > 0 . Also assu me that the pr ob ability mass function P R ( · ) is set to (2.3). Then, un der Assumption A1, we have 1 L E [ k∇ f ( x R ) k 2 ] ≤ B N := LD 2 f N + ˜ D + D 2 f ˜ D ! σ √ N , (2.14) wher e D f is define d in (2.5). If, in addi t ion, pr oblem (1.1) is c onvex with an optimal solution x ∗ , then E [ f ( x R ) − f ∗ ] ≤ LD 2 X N + ˜ D + D 2 X ˜ D σ √ N , (2.15) wher e D X is define d in (2.7). Pr o of . Noting that b y (2.1 3), we have D 2 f + σ 2 P N k =1 γ 2 k P N k =1 (2 γ k − L γ 2 k ) = D 2 f + N σ 2 γ 2 1 N γ 1 (2 − Lγ 1 ) ≤ D 2 f + N σ 2 γ 2 1 N γ 1 = D 2 f N γ 1 + σ 2 γ 1 ≤ D 2 f N max ( L, σ √ N ˜ D ) + σ 2 ˜ D σ √ N ≤ LD 2 f N + ˜ D + D 2 f ˜ D ! σ √ N , which together w ith (2 .4) then imply (2.14). Relation (2.15) follows similarly from the above inequalit y (with D f replaced b y D X ) and (2.6). W e now add a few r emarks abo ut the results obtained in T heo rem 2.1 and Coro l- lary 2.2. Firs tly , as can b e seen from (2.11), ins tea d o f ra ndomly s electing a s olution x R from { x 1 , . . . , x N } , a nother p ossibility would be to output the solution ˆ x N such that k∇ f ( ˆ x N ) k = min k =1 ,..., N k∇ f ( x k ) k . (2.16) W e can show that E k∇ f ( ˆ x N ) k go es to zero with similar ra tes of conv ergence as in (2.4) a nd (2.14). How ever, to use this s trategy would r equire some extra computa- tional effort to compute k∇ f ( x k ) k for all k = 1 , . . . , N . Since k ∇ f ( x k ) k ca nnot b e 8 computed exactly , to estimate them by using Mon te-c arlo sim ulation would incur ad- ditional approximation error s and raise some r eliability issues. On the other hand, the ab ov e RSG metho d does not r e quire any extra computationa l effor t for estimating the gradients k∇ f ( x k ) k , k = 1 , . . . , N . Secondly , obser ve that in the stepsize po lic y (2.13), we need to sp ecify a pa rameter ˜ D . While the RSG metho d conv er g es for an y arbitra ry ˜ D > 0, it can be eas ily seen from (2.14) and (2.1 5) that an optimal selection of ˜ D w o uld be D f and D X , resp ectively , for s olving nonconv ex and conv ex SP pro blems. With such selections , the bounds in (2.14) and (2.15), r esp ectively , reduce to 1 L E [ k∇ f ( x R ) k 2 ] ≤ LD 2 f N + 2 D f σ √ N . (2.17) and E [ f ( x R ) − f ∗ ] ≤ LD 2 X N + 2 D X σ √ N . (2.18) Note how ever, that the e x act v alues of D f or D X are rarely known a nd o ne often need to set ˜ D to a subo ptimal v alue, e.g., certain uppe r b ounds on D f or D X . Thirdly , one p ossible drawbac k for the ab ov e RSG method is that one need to estimate L to obtain an upp er bo und on γ k (see, e.g., (2.13)), whic h will a ls o p ossibly affect the selection of P R (see (2.3)). N o te that similar requir ement s also exist for some deterministic fir st-order metho ds (e.g., g radient descent a nd Nesterov’s acc el- erated gr adient metho ds). While under the deter ministic setting, one can so mehow relax such requirements by using certain line-sea rch pro cedures to enhance the prac - tical p erformanc e of these metho ds, it is mor e difficult to dev ise simila r line-sea rch pro cedures for the sto chastic setting, since the ex act v alues of f ( x k ) and ∇ f ( x k ) are not av aila ble. It s hould be no ted, howev er, that we do no t need very accurate estimate for L in the RSG metho d. Indeed, it ca n b e easily chec ked that the RSG metho d exhibits an O (1 / √ N ) rate o f conv er gence if the stepsizes { γ k } a re set to min ( 1 q L , ˜ D σ √ N ) , k = 1 , . . . , N for any q ∈ [1 , √ N ]. In o ther words, we can ov eres tima te the v alue of L by a factor up to √ N a nd the resulting RSG metho d still exhibits similar rate of conv erge nc e. A common practice in sto chastic optimization is to estimate L by using the sto chastic gradients computed a t a small num be r of trial p oints (see, e.g., [23, 19, 1 1, 1 0]). W e hav e adopted suc h a stra tegy in o ur implemen tation o f the RSG metho d as describ ed in mor e details in the technical re po rt ass o ciated with this pap er [12]. It is als o worth noting that, although in g e neral the s election of P R will dep end o n γ k and hence o n L , s uch a dep endence is no t neces s ary in so me s pecia l cases . In par ticular, if the stepsizes { γ k } are chosen a ccording to a co nstant stepsize p olic y (e.g., (2.13)), then R is uniformly distr ibuted on { 1 , . . . , N } . F o urthly , it is interesting to no te tha t the RSG metho d a llows us to hav e a unified treatment for both no nconv ex and conv ex SP problems in view of the specifica tion of { γ k } a nd P R ( · ) (c.f., (2.3) and (2.13)). Recall that the optimal rate of co nv ergence for solving smoo th convex SP problems is given b y O LD 2 X N 2 + D X σ √ N . 9 This bo und has b een obtained by Lan [18] based o n a sto chastic counterpart of Nes- terov’s metho d [25, 2 6]. Co mparing (2.1 8) with the ab ov e b ound, the RSG metho d po ssesses a nearly optimal rate of convergence, since the second term in (2 .18) is unim- prov able while the fir st term in (2.18) ca n be muc h improv ed. More ov er, a s sho wn by Cartis et al. [3], the firs t term in (2.17) for nonco nv ex proble ms is also unimprov a ble for g radient descent metho ds. It should be noted, how ever that the analysis in [3] applies only for g r adient descent metho ds and do es not show that the O (1 / N ) term is tight for all first-or der metho ds. Finally , obs erve that we can use different stepsize p olicy other than the c o nstant one in (2.13). In particular, it can b e shown that the RSG metho d with the follo wing t wo s tepsize p olicies will exhibit s imilar rates of con vergence as those in Corollar y 2.2. • Increa sing stepsize p olicy : γ k = min ( 1 L , ˜ D √ k σ N ) , k = 1 , . . . , N . • Decrea sing stepsize p olicy : γ k = min ( 1 L , ˜ D σ ( k N ) 1 4 ) , k = 1 , . . . , N . Int uitively sp eaking, one may want to choos e decr easing stepsiz es which, acco rding to the definition of P R ( · ) in (2.3), ca n stop the algorithm earlie r. On the other hand, as the algorithm mov es forward and lo cal information ab out the g r adient gets b etter, choosing increasing stepsizes might b e a be tter option. W e exp ect that the practica l per formance of these steps iz e p olicie s will dep end on each problem instance to b e solved. While Theorem 2 .1 and Cor o llary 2 .2 establis h the exp ected conv erg ence per for- mance o ver many runs of the RSG method, we are also in terested in the large- de v iation prop erties for a single run of this metho d. In par ticular, we are interested in estab- lishing its co mplexit y for co mputing an ( ǫ, Λ ) -solution of problem (1.1), i.e., a p oint ¯ x satisfying Pro b {k∇ f ( ¯ x ) k 2 ≤ ǫ } ≥ 1 − Λ for so me ǫ > 0 and Λ ∈ (0 , 1). By using (2.14) and Mar ko v’s inequality , we hav e Prob k∇ f ( x R ) k 2 ≥ λL B N ≤ 1 λ , ∀ λ > 0 . (2.19) It then follows that the num b er of calls to S F O p erformed by the RSG metho d for finding an ( ǫ, Λ)-so lution, a fter disr egarding a few constant factors, ca n b e b ounded by O 1 Λ ǫ + σ 2 Λ 2 ǫ 2 . (2.20) The ab ove complexity b ound is rather p ess imis tic in terms of its dep endence on Λ. W e will inv estiga te one pos sible w ay to significantly improv e it in next subsection. 2.2. A t wo-phase randomized sto c hastic gradient me tho d. In this s e c- tion, we describ e a v a riant of the RSG metho d which c a n considera bly impr ov e the complexity bound in (2.20). This pro cedure consists o f tw o phases : an optimization phase used to g enerate a list of candidate solutio ns via a few indep endent runs of the 10 RSG method a nd a p ost-o ptimization phase in which a so lution is selec ted from this candidate list. A tw o- phase RSG ( 2 -R SG) metho d Input: Initial po in t x 1 , n umber of runs S , iteration limit N , and sample size T . Optimization ph ase : F o r s = 1 , . . . , S Call the RSG method with input x 1 , iteration limit N , stepsizes { γ k } in (2.13) and probability mass function P R in (2.3). Let ¯ x s be the o utput of this pro ce dur e. P ost-opti m ization phase: Cho ose a so lution ¯ x ∗ from the ca ndida te list { ¯ x 1 , . . . , ¯ x S } s uch that k g ( ¯ x ∗ ) k = min s =1 ,...,S k g ( ¯ x s ) k , g ( ¯ x s ) := 1 T T X k =1 G ( ¯ x s , ξ k ) , (2.21) where G ( x, ξ k ), k = 1 , . . . , T , are the sto chastic g radients returned by the S F O . Observe that in (2.21), w e define the b e st so lution ¯ x ∗ as the one with the smallest v a lue of k g ( ¯ x s ) k , s = 1 , . . . , S . A lterna tively , one can choo se ¯ x ∗ from { ¯ x 1 , . . . , ¯ x S } such that ˜ f ( ¯ x ∗ ) = min 1 ,...,S ˜ f ( ¯ x s ) , ˜ f ( ¯ x s ) = 1 T T X k =1 F ( ¯ x s , ξ k ) . (2.22) It should b e noted that the 2-RSG metho d is different from a tw o - phase pro ce dur e for conv ex sto chastic pr ogra mming by Nesterov a nd Vial [29], where the av era ge of ¯ x 1 , . . . , ¯ x S is ch o sen a s the output solution. In the 2-RSG method descr ib e d ab ove, the num b e r of calls to the S F O are given by S × N a nd S × T , r esp ectively , for the optimization phase and p ost- o ptimization phase. Also note that we can p o ssibly recycle the same sequence { ξ k } a cross all gradient estimations in the pos t-optimization phase of 2-RSG metho d. W e will provide in Theo r em 2.4 b elow certain b ounds on S , N and T , to c o mpute an ( ǫ, Λ )-solution of problem (1.1). W e need the following results regarding the large deviations of vector v alued martingales (see, e.g., Theo rem 2 .1 of [15]). Lemma 2. 3. Assume that we ar e give n a p olish sp ac e with Bor el pr ob ability me asur e µ and a se quen c e of F 0 = { ∅ , Ω } ⊆ F 1 ⊆ F 2 ⊆ . . . of σ -sub-algebr as of Bor el σ -algebr a of Ω . L et ζ i ∈ R n , i = 1 , . . . , ∞ , b e a martingale-differ enc e se quenc e of Bor el functions on Ω such t hat ζ i is F i me asur able and E [ ζ i | i − 1] = 0 , wher e E [ ·| i ] , i = 1 , 2 , . . . , denotes the c onditional exp e ctation w.r.t. F i and E ≡ E [ ·| 0] is the exp e ctation w.r.t. µ . a) If E [ k ζ i k 2 ] ≤ σ 2 i for any i ≥ 1 , t hen E [ k P N i =1 ζ i k 2 ] ≤ P N i =1 σ 2 i . As a c onse- quenc e, we have ∀ N ≥ 1 , λ ≥ 0 : Prob ( k N X i =1 ζ i k 2 ≥ λ N X i =1 σ 2 i ) ≤ 1 λ ; 11 b) If E exp k ζ i k 2 /σ 2 i | i − 1 ≤ exp(1) almost sure ly for any i ≥ 1 , then ∀ N ≥ 1 , λ ≥ 0 : P rob k N X i =1 ζ i k ≥ √ 2(1 + λ ) v u u t N X i =1 σ 2 i ≤ exp( − λ 2 / 3) . W e ar e now r eady to descr ib e the main conv er gence prop erties of the 2-RSG metho d. More sp ecifically , Theorem 2.4.a) b elow shows the conv erge nce rate of this algorithm for a given set o f parameter s ( S, N , T ), while Theor em 2.4.b) establishes the complexity of the 2-RSG metho d for computing an ( ǫ, Λ)-solution o f problem (1.1). Theorem 2.4. Under Assu mption A1, the fol lowing statements hold for the 2 -RSG metho d applie d t o pr oblem (1.1 ). a) L et B N b e define d in (2.14). W e have Prob k∇ f ( ¯ x ∗ ) k 2 ≥ 2 4 L B N + 3 λσ 2 T ≤ S + 1 λ + 2 − S , ∀ λ > 0; (2.23) b) L et ǫ > 0 and Λ ∈ (0 , 1) b e given. If the p ar ameters ( S, N , T ) ar e set t o S = S (Λ) := ⌈ log (2 / Λ ) ⌉ , (2.24) N = N ( ǫ ) := max 32 L 2 D 2 f ǫ , " 32 L ˜ D + D 2 f ˜ D ! σ ǫ # 2 , (2.25) T = T ( ǫ, Λ) := 24( S + 1) σ 2 Λ ǫ , (2.26) then t he 2 -R SG metho d c an c ompute an ( ǫ, Λ) -s olut ion of pr oblem (1.1) after taking at most S (Λ) [ N ( ǫ ) + T ( ǫ, Λ)] (2.27) c al ls to t he sto chastic first-or der or acle. Pr o of . W e first show part a ). Obser ve that by the definition of ¯ x ∗ in (2.2 1), we hav e k g ( ¯ x ∗ ) k 2 = min s =1 ,...,S k g ( ¯ x s ) k 2 = min s =1 ,...,S k∇ f ( ¯ x s ) + g ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 ≤ min s =1 ,...,S 2 k∇ f ( ¯ x s ) k 2 + 2 k g ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 ≤ 2 min s =1 ,...,S k∇ f ( ¯ x s ) k 2 + 2 max s =1 ,...,S k g ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 , which implies that k∇ f ( ¯ x ∗ ) k 2 ≤ 2 k g ( ¯ x ∗ ) k 2 + 2 k∇ f ( ¯ x ∗ ) − g ( ¯ x ∗ ) k 2 ≤ 4 min s =1 ,...,S k∇ f ( ¯ x s ) k 2 + 4 max s =1 ,...,S k g ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 + 2 k∇ f ( ¯ x ∗ ) − g ( ¯ x ∗ ) k 2 . (2.28) W e now provide c ertain proba bilistic upp er b ounds to the thr ee terms in the right hand side of the a b ove ineq uality . Firstly , using the fact that ¯ x s , 1 ≤ s ≤ S , a re independent and relation (2 .19) (with λ = 2), we ha ve Prob min s =1 ,...,S k∇ f ( ¯ x s ) k 2 ≥ 2 L B N = S Y s =1 Prob k∇ f ( ¯ x s ) k 2 ≥ 2 L B N ≤ 2 − S . (2.29) 12 Moreov er , denoting δ s,k = G ( ¯ x s , ξ k ) − ∇ f ( ¯ x s ), k = 1 , . . . , T , we ha ve g ( ¯ x s ) − ∇ f ( ¯ x s ) = P T k =1 δ s,k /T . Using this observ ation, Assumption A1 and Lemma 2.3 .a), we co nclude that, for a n y s = 1 , . . . , S , Prob k g ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 ≥ λσ 2 T = Pr ob ( k T X k =1 δ s,k k 2 ≥ λT σ 2 ) ≤ 1 λ , ∀ λ > 0 , which implies that Prob max s =1 ,...,S k g ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 ≥ λσ 2 T ≤ S λ , ∀ λ > 0 , (2.30) and that Prob k g ( ¯ x ∗ ) − ∇ f ( ¯ x ∗ ) k 2 ≥ λσ 2 T ≤ 1 λ , ∀ λ > 0 . (2.31) The result then follows by combining r elations (2.2 8), (2.2 9), (2 .30) and (2.31). W e now show tha t par t b) holds. Since the 2-RSG method needs to call the RSG metho d S times with iteration limit N ( ǫ ) in the optimization phase, and estimate the gradients g ( ¯ x s ), s = 1 , . . . , S with sample s ize T ( ǫ ) in the p o st-optimization phase, the total n umber o f ca lls to the stochastic first-o rder or a cle is bounded b y S [ N ( ǫ ) + T ( ǫ )]. It remains to show that ¯ x ∗ is an ( ǫ, Λ)-solution of problem (1.1). Noting that by the definitions of B N and N ( ǫ ), resp ectively , in (2.14) and (2.25), w e hav e B N ( ǫ ) = LD 2 f N ( ǫ ) + ˜ D + D 2 f ˜ D ! σ p N ( ǫ ) ≤ ǫ 32 L + ǫ 32 L = ǫ 16 L . Using the ab ove observ ation, (2.2 6) a nd setting λ = [2( S + 1)] / Λ in (2.23), we hav e 4 LB N ( ǫ ) + 3 λσ 2 T ( ǫ ) = ǫ 4 + λ Λ ǫ 8( S + 1) = ǫ 2 , which, tog ether with relatio ns (2.23) and (2.24), a nd the selec tion of λ , then imply that Prob k∇ f ( ¯ x ∗ ) k 2 ≥ ǫ ≤ Λ 2 + 2 − S ≤ Λ . It is interesting to compa re the complexity b ound in (2.2 7) with the one in (2.20). In view of (2.2 4), (2.25) and (2.26), the complexity b ound in (2.27), after disreg arding a few cons tant factors, is equiv alent to O log(1 / Λ ) ǫ + σ 2 ǫ 2 log 1 Λ + log 2 (1 / Λ) σ 2 Λ ǫ . (2.32) The above b ound can b e considera bly s ma ller than the one in (2.20) up to a factor of 1 / Λ 2 log(1 / Λ ) , when the sec o nd terms are the do minating ones in b oth bo unds. The following r esult s hows that the bound (2.27) obtained in Theorem 2.4 ca n b e further improv ed under certain light-tail assumption of S F O . Corollar y 2.5. Under Assu mptions A1 and A2, the fol lowing s t atements hold for t he 2 -R S G metho d applie d t o pr oblem (1.1). 13 a) L et B N is define d in (2.14). We have, ∀ λ > 0 , Prob k∇ f ( ¯ x ∗ ) k 2 ≥ 4 2 L B N + 3 (1 + λ ) 2 σ 2 T ≤ ( S + 1)exp( − λ 2 / 3) + 2 − S ; (2.33) b) L et ǫ > 0 and Λ ∈ (0 , 1) b e given. If S and N ar e set to S (Λ) and N ( ǫ ) as in (2.24) and (2.25), r esp e ctively, and t he sample size T is set to T = T ′ ( ǫ, Λ) := 24 σ 2 ǫ " 1 + 3 ln 2( S + 1) Λ 1 2 # 2 , (2.34) then t he 2 -R SG metho d c an c ompute an ( ǫ, Λ) -solution of pr oblem (1.1) in at most S (Λ) [ N ( ǫ ) + T ′ ( ǫ, Λ)] (2.35) c al ls to t he sto chastic first-or der or acle. Pr o of . W e provide the pro o f o f part a) o nly , since part b) follows immediately from part a) and an argument similar to the one used in the pro of of T he o rem 2.4.b). Denot- ing δ s,k = G ( ¯ x s , ξ k ) − ∇ f ( ¯ x s ), k = 1 , . . . , T , w e hav e g ( ¯ x s ) − ∇ f ( ¯ x s ) = P T k =1 δ s,k /T . Using this observ ation, Assumption A2 and Lemma 2.3.b), w e conclude that, for any s = 1 , . . . , S and λ > 0, Prob n k g ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 ≥ 2(1 + λ ) 2 σ 2 T o = Pr ob n k P T k =1 δ s,k k ≥ √ 2 T (1 + λ ) σ o ≤ exp( − λ 2 / 3) , which implies that Prob max s =1 ,...,S k g ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 ≥ 2(1 + λ ) 2 σ 2 T ≤ S exp( − λ 2 / 3) , ∀ λ > 0 . (2.36) and that Prob k g ( ¯ x ∗ ) − ∇ f ( ¯ x ∗ ) k 2 ≥ 2(1 + λ ) 2 σ 2 T ≤ exp( − λ 2 / 3) , ∀ λ > 0 . (2.37 ) The re s ult in par t a) then follows by co m bining relations (2.28), (2.2 9), (2.3 6) and (2.37). In view of (2.24), (2.25) and (2.34), the b ound in (2 .35), after disrega rding a few constant factors, is equiv alent to O log(1 / Λ ) ǫ + σ 2 ǫ 2 log 1 Λ + log 2 (1 / Λ) σ 2 ǫ . (2.38) Clearly , the thir d ter m of the ab ov e b ound is sig nificantly smaller than the cor r e- sp onding one in (2.32) by a factor o f 1 / Λ. 3. Sto c hastic zeroth-order metho ds. Our problem of interest in this section is pr oblem (1.1) with f given in (1.4), i.e ., f ∗ := inf x ∈ R n f ( x ) := Z Ξ F ( x, ξ ) dP ( ξ ) . (3.1) 14 Moreov er , we a ssume that F ( x, ξ ) ∈ C 1 , 1 L ( R n ) almo st s urely , which clea rly implies f ( x ) ∈ C 1 , 1 L ( R n ). Our go al in this section is to spe cialize the RSG and 2 - RSG metho d, resp ectively , in Subsections 3.1 and 3.2, to deal with the situation when only stochastic zeroth-or der information of f is av ailable. 3.1. The randomized sto c hastic gradien t free metho d. Throughout this section, we a ssume that f is repr esented by a sto chastic zeroth-or der oracle ( S Z O ) . More spec ific a lly , at the k -th iteration, x k and ξ k being the input, the S Z O outputs the quant ity F ( x k , ξ k ) suc h that the fo llowing as sumption ho lds : A3: F or any k ≥ 1 , we have E [ F ( x k , ξ k )] = f ( x k ) . (3.2) T o e xploit zero th-order info r mation, we consider a smo oth appr oximation of the ob jective function f . It is well-known (see, e.g ., [34], [6] a nd [4 1]) that the con volution of f with a ny nonnega tiv e, meas urable and b ounded function ψ : R n → R satisfying R R n ψ ( u ) du = 1 is an a pproximation of f which is at least as smo o th as f . One of the most imp o rtant examples of the function ψ is the probability density function. Here, we use the Gaussia n distributio n in the co nvolution. L e t u b e n -dimensional standard Gaussia n ra ndom vector and µ > 0 b e the smo othing par ameter. Then, a smo oth a pproximation of f is defined as f µ ( x ) = 1 (2 π ) n 2 Z f ( x + µu ) e − 1 2 k u k 2 du = E u [ f ( x + µu )] . (3.3) The following result due to Nesterov [28] describ es some prop e r ties o f f µ ( · ). Theorem 3.1. The fol lowing st atemen ts hold for any f ∈ C 1 , 1 L . a) The gr adient of f µ given by ∇ f µ ( x ) = 1 (2 π ) n 2 Z f ( x + µu ) − f ( x ) µ ue − 1 2 k u k 2 du, (3.4) is Lipschitz c ontinuous with c onstant L µ such that L µ ≤ L ; b) F or any x ∈ R n , | f µ ( x ) − f ( x ) | ≤ µ 2 2 Ln, (3.5) k∇ f µ ( x ) − ∇ f ( x ) k ≤ µ 2 L ( n + 3) 3 2 ; (3.6) c) F or any x ∈ R n , 1 µ 2 E u [ { f ( x + µu ) − f ( x ) } 2 k u k 2 ] ≤ µ 2 2 L 2 ( n + 6) 3 + 2 ( n + 4) k∇ f ( x ) k 2 . (3.7) It immediately follows from (3 .6) that k∇ f µ ( x ) k 2 ≤ 2 k ∇ f ( x ) k 2 + µ 2 2 L 2 ( n + 3) 3 , (3.8) k∇ f ( x ) k 2 ≤ 2 k ∇ f µ ( x ) k 2 + µ 2 2 L 2 ( n + 3) 3 . (3.9) 15 Moreov er , denoting f ∗ µ := min x ∈ R n f µ ( x ) , (3.10) we conclude from (3 .5) that | f ∗ µ − f ∗ | ≤ µ 2 Ln/ 2 and hence that − µ 2 Ln ≤ [ f µ ( x ) − f ∗ µ ] − [ f ( x ) − f ∗ ] ≤ µ 2 Ln. (3.11) Below we mo dify the RSG metho d in subsection (2.1) to use sto chastic zeroth- order rather than first-o rder infor ma tion for solving problem (3.1). A randomized sto c hastic g radie nt free (R SGF) metho d Input: Initial p oint x 1 , iter ation limit N , stepsizes { γ k } k ≥ 1 , pr obability mass function P R ( · ) suppor ted on { 1 , . . . , N } . Step 0. Let R be a rando m v ariable with probability mass function P R . Step k = 1 , . . . , R . Generate u k by Gaussia n random vector g enerator and call the sto chastic z eroth-or de r oracle for computing G µ ( x k , ξ k , u k ) given b y G µ ( x k , ξ k , u k ) = F ( x k + µu k , ξ k ) − F ( x k , ξ k ) µ u k . (3.12) Set x k +1 = x k − γ k G µ ( x k , ξ k , u k ) . (3.13) Output x R . Note that the e s imator G µ ( x k , ξ k , u k ) of ∇ f µ ( x k ) in (3.12) w as sug gested b y Nesterov in [28]. Indeed, by (3.4) and Assumption A3, we have E ξ, u [ G µ ( x, ξ , u )] = E u [ E ξ [ G µ ( x, ξ , u ) | u ]] = ∇ f µ ( x ) , (3.14) which implies that G µ ( x, ξ , u ) is an unbiased estimato r of ∇ f µ ( x ). Hence, if the v a riance ˜ σ 2 ≡ E ξ, u [ k G µ ( x, ξ , u ) − ∇ f µ ( x ) k 2 ] is b ounded, we can directly a pply the conv ergence r esults in Theorem 2.1 to the ab ov e RSGF metho d. How ever, there still exist a few pro blems in this approach. Fir stly , we do not know an explicit ex pression of the bound ˜ σ 2 . Secondly , this appro ach do es not provide an y information regarding how to appropr iately sp ecify the smo othing parameter µ . The latter is sue is critical for the implementation of the RSGF metho d. By applying the a pproximation results in Theorem 3.1 to the functions F ( · , ξ k ), k = 1 , . . . , N , and using a slightly different c o nv ergence ana lysis than the one in Theorem 2.1, we a re a ble to obtain muc h refined convergence results for the ab ov e RSGF method. Theorem 3.2. Supp ose that the stepsizes { γ k } and the pr ob ability mass fun ction P R ( · ) in the R SGF metho d ar e chosen su ch that γ k < 1 / [2( n + 4) L ] and P R ( k ) := P rob { R = k } = γ k − 2 L ( n + 4) γ 2 k P N k =1 [ γ k − 2 L ( n + 4) γ 2 k ] , k = 1 , ..., N . (3.15) Then, un der Assumptions A1 and A3, 16 a) for any N ≥ 1 , we have 1 L E [ k∇ f ( x R ) k 2 ] ≤ 1 P N k =1 [ γ k − 2 L ( n +4) γ 2 k ] h D 2 f + 2 µ 2 ( n + 4) 1 + L ( n + 4) 2 P N k =1 ( γ k 4 + L γ 2 k ) + 2 ( n + 4) σ 2 P N k =1 γ 2 k i , (3.16) wher e the exp e ctation is taken with r esp e ct to R , ξ [ N ] and u [ N ] , and D f is define d in(2.5); b) if, in addition, pr oblem (3.1) is c onvex with an optimal solution x ∗ , then, for any N ≥ 1 , E [ f ( x R ) − f ∗ ] ≤ 1 2 P N k =1 [ γ k − 2( n +4) Lγ 2 k ] D 2 X + 2 µ 2 L ( n + 4) P N k =1 γ k + L ( n + 4) 2 γ 2 k + 2 ( n + 4) σ 2 P N k =1 γ 2 k i , (3.17) wher e the exp e ctation is taken with resp e ct to R , ξ [ N ] and u [ N ] , and D X is define d in (2.7). Pr o of . Let ζ k ≡ ( ξ k , u k ), k ≥ 1, ζ [ N ] := ( ζ 1 , ..., ζ N ), and E ζ [ N ] denote the ex- pec tation w.r.t. ζ [ N ] . Also denote ∆ k ≡ G µ ( x k , ξ k , u k ) − ∇ f µ ( x k ) ≡ G µ ( x k , ζ k ) − ∇ f µ ( x k ) , k ≥ 1 . Using the fact that f ∈ C 1 , 1 L ( R n ), Theor em 3.1.a), (1.6 ) and (3.13), we hav e, fo r any k = 1 , . . . , N , f µ ( x k +1 ) ≤ f µ ( x k ) − γ k h∇ f µ ( x k ) , G µ ( x k , ζ k ) i + L 2 γ 2 k k G µ ( x k , ζ k ) k 2 = f µ ( x k ) − γ k k∇ f µ ( x k ) k 2 − γ k h∇ f µ ( x k ) , ∆ k i + L 2 γ 2 k k G µ ( x k , ζ k ) k 2 . (3.18) Summing up these inequalities, re-arr anging the ter ms a nd noting that f ∗ µ ≤ f µ ( x N +1 ), we obtain N X k =1 γ k k∇ f µ ( x k ) k 2 ≤ f µ ( x 1 ) − f ∗ µ − N X k =1 γ k h∇ f µ ( x k ) , ∆ k i + L 2 N X k =1 γ 2 k k G µ ( x k , ζ k ) k 2 . (3.19) Now, observe that b y (3.14), E [ h∇ f µ ( x k ) , ∆ k i| ζ [ k − 1] ] = 0 . (3.20) and that by the assumption F ( · , ξ k ) ∈ C 1 , 1 L ( R n ), (3.7) (with f = F ( · , ξ k )), and (3 .12), E [ k G µ ( x k , ζ k ) k 2 | ζ [ k − 1] ] ≤ 2( n + 4) E [ k G ( x k , ξ k ) k 2 | ζ [ k − 1] ] + µ 2 2 L 2 ( n + 6) 3 ≤ 2 ( n + 4) E [ k∇ f ( x k ) k 2 | ζ [ k − 1] ] + σ 2 + µ 2 2 L 2 ( n + 6) 3 , (3.21) where the sec ond inequality follows from Assumption A1. T aking exp ectations with resp ect to ζ [ N ] on bo th side s of (3.19) and using the ab ov e tw o observ ations, we obtain N P k =1 γ k E ζ [ N ] k∇ f µ ( x k ) k 2 ≤ f µ ( x 1 ) − f ∗ µ + L 2 N P k =1 γ 2 k n 2( n + 4) E ζ [ N ] [ k∇ f ( x k ) k 2 ] + σ 2 + µ 2 2 L 2 ( n + 6) 3 o . The ab ove conclusion together with (3.8) and (3.11) then imply that P N k =1 γ k h E ζ [ N ] [ k∇ f ( x k ) k 2 ] − µ 2 2 L 2 ( n + 3) 3 i ≤ 2 [ f ( x 1 ) − f ∗ ] + 2 µ 2 Ln +2 L ( n + 4) P N k =1 γ 2 k E ζ [ N ] [ k∇ f ( x k ) k 2 ] + h 2 L ( n + 4) σ 2 + µ 2 2 L 3 ( n + 6) 3 i P N k =1 γ 2 k . (3.22) 17 By r e-arr anging the terms and simplifying the constants, we hav e P N k =1 γ k − 2 L ( n + 4) γ 2 k E ζ [ N ] [ k∇ f ( x k ) k 2 ] ≤ 2 [ f ( x 1 ) − f ∗ ] + 2 L ( n + 4) σ 2 P N k =1 γ 2 k + 2 µ 2 Ln + µ 2 2 L 2 P N k =1 ( n + 3) 3 γ k + L ( n + 6) 3 γ 2 k ≤ 2 [ f ( x 1 ) − f ∗ ] + 2 L ( n + 4) σ 2 P N k =1 γ 2 k + 2 µ 2 L ( n + 4) h 1 + L ( n + 4) 2 P N k =1 ( γ k 4 + L γ 2 k ) i . (3.23) Dividing b oth s ides o f the ab ov e ineq ua lity b y P N k =1 γ k − 2 L ( n + 4) γ 2 k and no ting that E [ k∇ f ( x R ) k 2 ] = E R,ζ [ N ] [ k∇ f ( x R ) k 2 ] = P N k =1 γ k − 2 L ( n + 4) γ 2 k E ζ [ N ] k∇ f ( x k ) k 2 P N k =1 [ γ k − 2 L ( n + 4) γ 2 k ] , we obtain (3.16). W e now show part b). Denote ω k ≡ k x k − x ∗ k . First observe that, for any k = 1 , . . . , N , ω 2 k +1 = k x k − γ k G µ ( x k , ζ k ) − x ∗ k 2 = ω 2 k − 2 γ k h∇ f µ ( x k ) + ∆ k , x k − x ∗ i + γ 2 k k G µ ( x k , ζ k ) k 2 . and hence that ω 2 N +1 = ω 2 1 − 2 N X k =1 γ k h∇ f µ ( x k ) , x k − x ∗ i − 2 N X k =1 γ k h ∆ k , x k − x ∗ i + N X k =1 γ 2 k k G µ ( x k , ζ k ) k 2 . T a king exp ectation w.r.t. ζ ζ [ N ] on b o th sides of the ab ove e q uality , using relation (3.21) and noting that by (3.14), E [ h ∆ k , x k − x ∗ i| ζ [ k − 1] ] = 0 , we obtain E ζ [ N ] [ ω 2 N +1 ] ≤ ω 2 1 − 2 N X k =1 γ k E ζ [ N ] [ h∇ f µ ( x k ) , x k − x ∗ i ] + 2( n + 4) N X k =1 γ 2 k E ζ [ N ] [ k∇ f ( x k ) k 2 ] + 2( n + 4) σ 2 + µ 2 2 L 2 ( n + 6) 3 N X k =1 γ 2 k ≤ ω 2 1 − 2 N X k =1 γ k E ζ [ N ] [ f µ ( x k ) − f µ ( x ∗ )] + 2( n + 4) L N X k =1 γ 2 k E ζ [ N ] [ f ( x k ) − f ∗ ] + 2( n + 4) σ 2 + µ 2 2 L 2 ( n + 6) 3 N X k =1 γ 2 k ≤ ω 2 1 − 2 N X k =1 γ k E ζ [ N ] f ( x k ) − f ∗ − µ 2 Ln + 2 ( n + 4) L N X k =1 γ 2 k E ζ [ N ] [ f ( x k ) − f ∗ ] + 2( n + 4) σ 2 + µ 2 2 L 2 ( n + 6) 3 N X k =1 γ 2 k , where the second inequalit y follows from (2 .12) and the co nvexit y of f µ , and the la st inequality follows fr o m (3.5). Re-ar ranging the ter ms in the a b ov e inequality , using 18 the facts that ω 2 N +1 ≥ 0 and f ( x k ) ≥ f ∗ , and simplifying the constants, we hav e 2 N X k =1 γ k − 2 ( n + 4) Lγ 2 k ) E ζ [ N ] [ f ( x k ) − f ∗ ] ≤ 2 N X k =1 γ k − ( n + 4) Lγ 2 k ) E ζ [ N ] [ f ( x k ) − f ∗ ] ≤ ω 2 1 + 2 µ 2 L ( n + 4) N X k =1 γ k + 2 ( n + 4) L 2 µ 2 ( n + 4) 2 + σ 2 N X k =1 γ 2 k . The rest o f pro o f is similar to part a ) and hence the details are skipped. Similarly to the RSG metho d, we can sp ecialize the co nv ergence results in Theo- rem 3 .2 for the RSGF metho d with a co nstant stepsize p olic y . Corollar y 3.3. Supp ose that the stepsizes { γ k } ar e set t o γ k = 1 √ n + 4 min ( 1 4 L √ n + 4 , ˜ D σ √ N ) , k = 1 , . . . , N , (3.24) for some ˜ D > 0 . Also assume that the pr ob ability mass function P R ( · ) is set to (3.15 ) and µ is chosen such t hat µ ≤ D f ( n + 4) √ 2 N (3.25) wher e D f and D X ar e define d in (2.5) and (2.7), r esp e ctively. Then, u nder Assump- tions A1 and A 3, we have 1 L E [ k∇ f ( x R ) k 2 ] ≤ ¯ B N := 12( n + 4) LD 2 f N + 4 σ √ n + 4 √ N ˜ D + D 2 f ˜ D ! . (3.26 ) If, in addition, pr oblem (3.1) is c onvex with an optimal solution x ∗ and µ is chosen such that µ ≤ D X p ( n + 4) , then, E [ f ( x R ) − f ∗ ] ≤ 5 L ( n + 4) D 2 X N + 2 σ √ n + 4 √ N ˜ D + D 2 X ˜ D . (3.27) Pr o of . W e prove (3.26) o nly since re lation (3.27) ca n b e shown by using similar arguments. First note tha t by (3 .24), w e hav e γ k ≤ 1 4( n + 4) L , k = 1 , . . . , N , (3.28) N X k =1 γ k − 2 L ( n + 4) γ 2 k = N γ 1 [1 − 2 L ( n + 4) γ 1 ] ≥ N γ 1 2 . (3 .29) 19 Therefore, using the a bove inequalities a nd (3.1 6), we obta in 1 L E [ k∇ f ( x R ) k 2 ] ≤ 2 D 2 f + 4 µ 2 ( n + 4) N γ 1 + µ 2 L ( n + 4) 3 + 4 ( n + 4) µ 2 L 2 ( n + 4) 2 + σ 2 γ 1 ≤ 2 D 2 f + 4 µ 2 ( n + 4) N max ( 4 L ( n + 4) , σ p ( n + 4) N ˜ D ) + µ 2 L ( n + 4) 2 [( n + 4) + 1] + 4 √ n + 4 ˜ D σ √ N , which, in v iew of (3.2 5), then implies that 1 L E [ k∇ f ( x R ) k 2 ] ≤ 2 D 2 f N 1 + 1 ( n + 4) N " 4 L ( n + 4) + σ p ( n + 4) N ˜ D # + LD 2 f 2 N [( n + 4) + 1] + 4 √ n + 4 ˜ D σ √ N = LD 2 f N 17( n + 4) 2 + 8 N + 1 2 + 2 σ √ n + 4 √ N " D 2 f ˜ D 1 + 1 ( n + 4) N + 2 ˜ D # ≤ 12 L ( n + 4) D 2 f N + 4 σ √ n + 4 √ N ˜ D + D 2 f ˜ D ! . A few r e marks a bo ut the r e sults o btained in Co rollar y 2.2 ar e in order . Firstly , similar to the RSG method, we use the same selection o f stepsizes { γ k } and probability mass function P R ( · ) in RSGF metho d for b oth conv ex a nd nonconv ex SP pro blems. In pa r ticular, in view of (3 .26), the iteratio n complexity of the RSGF metho d for finding an ǫ -solution of problem (3 .1) can be b ounded by O ( n/ǫ 2 ). Moreover, in view of (3.2 7), if the pro ble m is conv ex, a so lutio n ¯ x s a tisfying E [ f ( ¯ x ) − f ∗ ] ≤ ǫ ca n als o be found in O ( n/ǫ 2 ) iterations. This result has a weak er dep endence (by a factor of n ) than the one established by Nesterov for solving general nonsmo oth con vex SP problems (see page 1 7 o f [28]). This improvemen t is obtained sinc e we are dea ling with a more sp ecial clas s o f SP problems . Also, note that in the ca se o f σ = 0, the iteration complexity of the RSGF metho d reduces to O ( n/ ǫ ) which is is similar to the one obtained by Nesterov [2 8] for the der iv a tiv e free random search metho d when applied to b oth smo oth conv ex and nonc o nv ex deterministic pro blems. Secondly , we need to sp ecify ˜ D for the stepsize p olicy in (3.24). According to (3.26) and (3.27), an optimal selectio n of ˜ D would b e D f and D X , re s pec tiv ely , for the no nco nv ex and co nv ex case. With such selections, the b ounds in (3.2 6) and (3.27), resp ectively , reduce to 1 L E [ k∇ f ( x R ) k 2 ] ≤ 12( n + 4) LD 2 f N + 8 √ n + 4 D f σ √ N , (3.30) E [ f ( x R ) − f ∗ ] ≤ 5 L ( n + 4) D 2 X N + 4 √ n + 4 D X σ √ N . (3.31 ) Similarly to the RSG metho d, we can establish the complexity of the RSGF metho d for finding an ( ǫ, Λ )-solution of pro ble m (3.1 ) for so me ǫ > 0 and Λ ∈ (0 , 1). 20 More spec ific a lly , by using (3.26) and Marko v’s inequa lit y , w e have Prob k∇ f ( x R ) k 2 ≥ λL ¯ B N ≤ 1 λ , ∀ λ > 0 , (3.32) which implies that the tota l n umber o f calls to the S Z O p erfor med by the RSGF metho d for finding an ( ǫ, Λ)-so lution of (3.1) can be b ounded by O nL 2 D 2 f Λ ǫ + nL 2 Λ 2 ˜ D + D 2 f ˜ D ! 2 σ 2 ǫ 2 . (3.33) W e will inv estigate a p ossible approa ch to improv e the a bove complexity b ound in next subsection. 3.2. A tw o - phase randomi zed sto c hastic g radien t free metho d. In this section, we mo dify the 2-RSG metho d to improv e the c o mplexity bound in (3.33) for finding an ( ǫ, Λ)-solution of problem (3.1). A tw o- phase RSGF ( 2 -R SGF) metho d Input: Initial po in t x 1 , n umber of runs S , iteration limit N , and sample size T . Optimization ph ase : F o r s = 1 , . . . , S Call the RSGF method with input x 1 , iteration limit N , stepsizes { γ k } in (3.2 4), pr obability mass function P R in (3.1 5), and the smo othing parameter µ satisfying (3.25). Let ¯ x s be the output of this pr o cedure. P ost-opti m ization phase: Cho ose a so lution ¯ x ∗ from the ca ndida te list { ¯ x 1 , . . . , ¯ x S } s uch that k g µ ( ¯ x ∗ ) k = min s =1 ,...,S k g µ ( ¯ x s ) k , g µ ( ¯ x s ) := 1 T T X k =1 G µ ( ¯ x s , ξ k , u k ) , (3.34) where G µ ( x, ξ , u ) is defined in (3.12). The main co n vergence pro per ties of the 2-RSGF method are summarized in The- orem 3.4. More s pec ific a lly , Theorem 3.4.a) establishe s the rate o f conv er gence of the 2-RSGF metho d with a given set of parameters ( S, N , T ), while Theorem 3 .4.b) shows the co mplexity of this method for finding an ( ǫ, Λ)-solution of problem (3.1). Theorem 3.4 . Under Assum ptions A1 and A3, the fol lowing st atements hold for t he 2 -R S GF metho d applie d to pr oblem (3.1). a) L et ¯ B N b e define d in (3.26). W e have Prob n k∇ f ( ¯ x ∗ ) k 2 ≥ 8 L ¯ B N + 3( n +4) L 2 D 2 f 2 N + 24( n +4) λ T h L ¯ B N + ( n +4) L 2 D 2 f N + σ 2 io ≤ S +1 λ + 2 − S , ∀ λ > 0; (3.35) b) L et ǫ > 0 and Λ ∈ (0 , 1) b e given. If S is set to S (Λ) as in (2.24), and t he iter ation limit N and sample size T , r esp e ctively, ar e set to N = ˆ N ( ǫ ) := ma x 12( n + 4)(6 LD f ) 2 ǫ , " 72 L √ n + 4 ˜ D + D 2 f ˜ D ! σ ǫ # 2 , (3.36) T = ˆ T ( ǫ, Λ) := 24( n + 4)( S + 1) Λ max 1 , 6 σ 2 ǫ , ( 3 .37) 21 then t he 2 -RSGF metho d c an c ompute an ( ǫ, Λ) -solut ion of pr oblem (3.1) after taking at most 2 S (Λ) h ˆ N ( ǫ ) + ˆ T ( ǫ, Λ) i (3.38) c al ls to t he S Z O . Pr o of . First, obs e rve that by (3.6), (3.25) and (3 .26), w e hav e k∇ f µ ( x ) − ∇ f ( x ) k 2 ≤ µ 2 4 L 2 ( n + 3) 3 ≤ ( n + 4) L 2 D 2 f 8 N . (3.39) Using this obs erv ation and the definition of ¯ x ∗ in (3.34), we o btain k g µ ( ¯ x ∗ ) k 2 = min s =1 ,...,S k g µ ( ¯ x s ) k 2 = min s =1 ,...,S k∇ f ( ¯ x s ) + g µ ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 ≤ min s =1 ,...,S 2 k∇ f ( ¯ x s ) k 2 + k g µ ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 ≤ min s =1 ,...,S 2 k∇ f ( ¯ x s ) k 2 + 2 k g µ ( ¯ x s ) − ∇ f µ ( ¯ x s ) k 2 + 2 k∇ f µ ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 ≤ 2 min s =1 ,...,S k∇ f ( ¯ x s ) k 2 + 4 max s =1 ,...,S k g µ ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 + ( n + 4) L 2 D 2 f 2 N , which implies that k∇ f ( ¯ x ∗ ) k 2 ≤ 2 k g µ ( ¯ x ∗ ) k 2 + 2 k∇ f ( ¯ x ∗ ) − g µ ( ¯ x ∗ ) k 2 ≤ 2 k g µ ( ¯ x ∗ ) k 2 + 4 k∇ f µ ( ¯ x ∗ ) − g µ ( ¯ x ∗ ) k 2 + 4 k∇ f ( ¯ x ∗ ) − ∇ f µ ( ¯ x ∗ ) k 2 ≤ 4 min s =1 ,...,S k∇ f ( ¯ x s ) k 2 + 8 ma x s =1 ,...,S k g µ ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 + ( n + 4) L 2 D 2 f N + 4 k∇ f µ ( ¯ x ∗ ) − g µ ( ¯ x ∗ ) k 2 + 4 k∇ f ( ¯ x ∗ ) − ∇ f µ ( ¯ x ∗ ) k 2 ≤ 4 min s =1 ,...,S k∇ f ( ¯ x s ) k 2 + 8 ma x s =1 ,...,S k g µ ( ¯ x s ) − ∇ f ( ¯ x s ) k 2 + 4 k∇ f µ ( ¯ x ∗ ) − g µ ( ¯ x ∗ ) k 2 + 3( n + 4) L 2 D 2 f 2 N , (3.40 ) where the la st inequalit y also follows from (3.39). W e now provide certa in probabilistic bo unds on the indiv idua l terms in the right hand side of the ab ove inequality . Using (3.32) (with λ = 2), w e obtain Prob min s =1 ,...,S k∇ f ( ¯ x s ) k 2 ≥ 2 L ¯ B N = S Y s =1 Prob k∇ f ( ¯ x s ) k 2 ≥ 2 L ¯ B N ≤ 2 − S . (3.41) Moreov er , denote ∆ s,k = G µ ( ¯ x s , ξ k , u k ) − ∇ f µ ( ¯ x s ), k = 1 , . . . , T . Note that, simila r to (3.2 1), we hav e E [ k G µ ( ¯ x s , ξ k , u k ) k 2 ] ≤ 2 ( n + 4)[ E [ k G ( ¯ x s , ξ ) k 2 ] + µ 2 2 L 2 ( n + 6) 3 ≤ 2( n + 4)[ E [ k∇ f ( ¯ x s ) k 2 ] + σ 2 ] + 2 µ 2 L 2 ( n + 4) 3 . 22 It then follows from the previous inequality , (3.25) a nd (3.2 6) that E [ k ∆ s,k k 2 ] = E [ k G µ ( ¯ x s , ξ k , u k ) − ∇ f µ ( ¯ x s ) k 2 ] ≤ E [ k G µ ( ¯ x s , ξ k , u k ) k 2 ] ≤ 2( n + 4) L ¯ B N + σ 2 + 2 µ 2 L 2 ( n + 4) 3 ≤ 2( n + 4) " L ¯ B N + σ 2 + L 2 D 2 f 2 N # =: D N . (3.42) Noting that g µ ( ¯ x s ) − ∇ f µ ( ¯ x s ) = P T k =1 ∆ s,k /T , w e co nclude from (3.42), Assumption A1 a nd Lemma 2.3.a) that, for any s = 1 , . . . , S , Prob k g µ ( ¯ x s ) − ∇ f µ ( ¯ x s ) k 2 ≥ λ D N T = Pr ob ( k T X k =1 ∆ s,k k 2 ≥ λT D N ) ≤ 1 λ , ∀ λ > 0 , which implies that Prob max s =1 ,...,S k g µ ( ¯ x s ) − ∇ f µ ( ¯ x s ) k 2 ≥ λ D N T ≤ S λ , ∀ λ > 0 . (3.43) and that Prob k g µ ( ¯ x ∗ ) − ∇ f µ ( ¯ x ∗ ) k 2 ≥ λ D N T ≤ 1 λ , ∀ λ > 0 . (3.44) The r esult then follows by c o mbin ing relatio ns (3 .40), (3.41),(3.42), (3.43) and (3.44). W e now show part b) holds. Clearly , the total num ber of ca lls to S Z O in the 2-RSGF metho d is b ounded by 2 S [ ˆ N ( ǫ ) + ˆ T ( ǫ )]. I t then suffices to show that ¯ x ∗ is an ( ǫ, Λ)- s olution of pr oblem (3.1). Noting that by the definitions of ¯ B ( N ) and ˆ N ( ǫ ), resp ectively , in (3.26) a nd (3.3 6), we hav e ¯ B ˆ N ( ǫ ) = 12( n + 4) LD 2 f ˆ N ( ǫ ) + 4 σ √ n + 4 q ˆ N ( ǫ ) ˜ D + D 2 f ˜ D ! ≤ ǫ 36 L + ǫ 18 L = ǫ 12 L . Hence, w e hav e 8 L ¯ B ˆ N ( ǫ ) + 3( n + 4) L 2 D 2 f 2 ˆ N ( ǫ ) ≤ 2 ǫ 3 + ǫ 288 ≤ 17 ǫ 24 . Moreov er , b y setting λ = [2 ( S + 1)] / Λ a nd using (3.36) and (3 .37), we o btain 24( n + 4) λ T " L ¯ B ˆ N ( ǫ ) + ( n + 4) L 2 D 2 f ˆ N ( ǫ ) + σ 2 # ≤ 24( n + 4) λ T ǫ 12 + ǫ 432 + σ 2 ≤ ǫ 12 + ǫ 432 + ǫ 6 ≤ 7 ǫ 24 . Using these t wo obs erv ations and relation (3.35) with λ = [2( S + 1)] / Λ, w e conclude that Prob ∇ f ( ¯ x ∗ ) k 2 ≥ ǫ ≤ Pr ob ( k∇ f ( ¯ x ∗ ) k 2 ≥ 8 L ¯ B ˆ N ( ǫ ) + 3( n + 4) L 2 D 2 f 2 ˆ N ( ǫ ) + 24( n + 4) λ T " L ¯ B ˆ N ( ǫ ) + ( n + 4) L 2 D 2 f ˆ N ( ǫ ) + σ 2 #) ≤ S + 1 λ + 2 − S = Λ . 23 Observe that in the view o f (2.24), (3.3 6) and (3.37), the total n umber of calls to S Z O p erformed b y the 2-RSGF method ca n be b ounded by O nL 2 D 2 f log(1 / Λ ) ǫ + nL 2 ˜ D + D 2 f ˜ D ! 2 σ 2 ǫ 2 log 1 Λ + n log 2 (1 / Λ) Λ 1 + σ 2 ǫ . (3.45) The ab ov e b ound is conside r ably smaller than the one in (3.33), up to a factor of O 1 / [Λ 2 log(1 / Λ )] , when the s econd terms ar e the dominating o nes in both b ounds. 4. Concluding remarks. In this pa per , we present a class of new SA metho ds for s olving the clas s ical unconstra ined NLP problem with nois y firs t-order information. W e establis h a few new complexity re s ults rega rding the computation of an ǫ -solution for s olving this class of problems and show that they are nearly optimal whenever the pro blem is convex. Moreover, we in tro duce a p ost-optimizatio n phase in or der to improv e the large-de v iation pro p er ties of the RSG metho d. These pr o cedures, along with their complexity results, a re then sp ecialized fo r simulation-based optimization problems when only sto chastic z eroth-or de r information is av ailable. In a ddition, w e show that the complex ity for g radient-free metho ds for smo o th con vex SP ca n have a m uch w eaker dependence on the dimension n than tha t for more general nonsmo o th conv ex SP . REFERENCES [1] S. Andr ad´ ottir. A review of simulation optimization tec hniques. Pr o c e e dings of the 1998 Winter Simulation Confer enc e , pages 151–158. [2] C. Cartis, N. I. M. Gould, and Ph. L. T oint. On the complexit y of steep est descen t, newton’s and regularized newton’s metho ds for noncon ve x unconstrained optimization. SIAM Journal on Optimization , 20(6 ):2833–2852, 2010. [3] C. Cartis, N. I. M. Gould, and Ph. L. T oint. On the oracle complexity of first-or der and deriv ativ e-f r ee algorithms f or smo oth noncon vex mini mization. SIAM Journal on Opti- mization , 22:66–86, 2012. [4] K.L. Chung. On a stochastic approximation method. Annals of Mathematic al Statistics , pages 463–483, 1954. [5] A. R. Conn, K . Schein berg, and L. N. Vicente . Intr o duction to Derivative-F r e e Optimization . SIAM, Philadelphia, 2009. [6] J. C. Duc hi, P . L. Bartlett, and M . J. W ainwrigh t. Randomized smo othing for sto cha stic optimization. SIAM Journal on Optimization , 22:674–701, 2012. [7] M. F u. O ptimi zation for s imulation: Theory vs. pr actice. INF O RMS Journal on Computing , 14:192–215, 2002 . [8] M.C . F u. Gradient estimation. In S. G. Henderson and B. L. Nelson, editors, Handb o oks in Op er ations R ese ar ch and Management Scienc e: Simulation , pages 575–616. Elsevier. [9] R. Garmanjani and L. N. Vicen te. Smo othing and worst-case complexit y for direct-search methods in nonsmo oth optimi zation. IMA Journal of Numeric al Analysis , 2012. to appear. [10] S. Ghadimi and G. Lan. Optimal sto c hastic approximation algori thms for strongly con vex stochastic composite optimization, I I: shrinking pro cedures and optimal al gorithms. T ec h- nical r eport, 2010. SIAM Journal on Optimization (und er second-round review). [11] S. Ghadimi and G. Lan. Optimal sto c hastic approximation algori thms for strongly con vex stochastic composite optimization, I: a generic algorithmic framew ork. SIAM Journal on Optimization , 22:1469 –1492, 2012. [12] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods f or noncon ve x s tochastic programming. T ec hnical rep ort, Departmen t of Industrial and Systems Engineering, Uni- v ersi t y of Florida, Gainesville, FL 32611, USA, June 2012. SIAM Journal on Optimization (under second-round review). 24 [13] P . Glasserman. Gr adient Estimation via Perturb ation Analysis . K l uw er Academic Publi shers, Boston,Massac husett s, 2003. [14] A . Juditsky , A. Nazin, A. B. Tsybako v, and N. V ay atis. Recursiv e aggregat ion of estimators via the m irror descen t algori thm with av erage. Pr oblems of Information T r ansmission , 41:n.4, 2005. [15] A . Juditsky and A. Nemir ovski. Large deviations of vect or- v alued martingales i n 2-smo oth normed spaces. Manusc r ipt, Georgia Institute of T ec hnology , Atlan ta, GA, 2008. E-print: www2.isye.gat ech.edu/ ∼ nemiro vs/LargeDevSubmitted.pdf. [16] A . Juditsky , P . Rigollet, and A. B. Tsybako v. Learning by mirror av eraging. Annals of Statis- tics , 36:2183–2206, 2008. [17] A . J. Kleywegt, A. Shapiro, and T. Homem de Mell o. The sample av erage approximation method for stochastic discrete optimization. SIAM Journal on Optimization , 12:479–502, 2001. [18] G. Lan. An optimal method f or stochastic composi te optimization. Mathematic al Pr o gr amming , 133(1):365 –397, 2012. [19] G. Lan, A. Nemiro vski, and A. Shapiro. V alidation analysis of mirror descen t stochastic ap- prox i mation m ethod. Mathematic al Pr o g ra mming , 134:425– 458, 2012. [20] P . L ´ Ecuy er. A unified view of the IP A, SF, and LR gradient estimation tec hniques. Management Scienc e , 36(1 1):1364–1383, 1990. [21] J. M ai ral, F. Bac h, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In In ICML , pages 689–696, 2009. [22] L. M ason, J. Baxter, P . Bar tlett, and M. F rean. Boosting algorithms as gradient descen t in function s pace. Pr o c. NIPS , 12:512– 518, 1999. [23] A . Nemi rovski, A. Juditsky , G. Lan, and A. Shapiro. Robust stochastic approx i mation approac h to s tochastic programming. SIAM Journal on Optimization , 19:1574–1609, 2009. [24] A . Nemi rovski and D. Y udin. Pr oblem c omplexity and metho d efficiency in optimization . Wil ey- In terscience Series in Discrete M athematics. John Wiley , XV, 1983. [25] Y . E. Nesterov. A method for unconstrained conv ex minimi zation problem with the rate of con vergence O (1 /k 2 ). Doklady AN SSSR , 269:543 –547, 1983. [26] Y . E. Nesterov. Intr o ductory L e ctur es on Convex Optimization: a b asic c ourse . Kluw er Aca- demic Publis hers , Massach usetts, 2004. [27] Y . E. Nesterov. Primal-dual subgradient methods for con vex problems. Mathematic al Pr o- gr amming , 120:221–2 59, 2006. [28] Y . E. Nesterov . Random gradient-free minimi zation of conv ex functions. T echnical rep or t, Cen ter for Op erations Research and Economet ri cs (CORE), Catholic Universit y of Louv ain, Jan uary 2010. [29] Y . E. Nesterov and J. P . Vi al. C onfidence level solutions f or s to chastic pr ogramm ing. 2000. [30] J. No cedal and S. J. W ri gh t. Numeric al optimization . Springer-V erlag, N ew Y ork, USA, 1999. [31] B. T. Poly ak. New stochastic approximation t yp e pro cedures. Automat. i T elemekh. , 7:98–107, 1990. [32] B. T. Poly ak and A. B . Juditsky . Acceleration of sto chast ic approximation by av eraging. SIAM J. Contr ol and O pti mization , 30:838–855, 1992. [33] H . Robbins and S. Monro. A sto c hastic approximation method. Annals of Mathematic al Statistics , 22:400– 407, 1951. [34] R. T. Ro ck afellar and R. J.-B. W ets. V ariational analysis, ser. Grund lehr en der Mathematis- chen Wissenschaften [F undamental Principles of Mathematic al Sci enc es] . Spri nger-V erlag, Berlin, 1998. [35] R. Y. Rubinstein and A. Shapiro. Di scr ete Event Systems: Sensiti vity Analysis and St o c hastic Optimization by the Sc or e F unction Metho d . John Wiley & Sons, 1993. [36] A . Sartenaer S. Gr atton and Ph. L. T oi n t. Recursive trust-region m etho ds for multiscale nonlinear optimization. SIAM Journal on Optimization , 19:414–44 4, 2008. [37] J. Sacks. Asymptotic di stribution of sto chastic approximation. Annals of Mathematic al Statis- tics , 29:373–409, 1958. [38] A . Shapiro. Mont e carlo sampling methods. In A. Ruszczy ´ nski and A. Shapiro, editors, Sto c has- tic Pr o g r amming . North-Holland Publishi ng Company , Amsterdam, 2003. [39] J. C . Spall. Intr o duction to Sto chastic Se ar ch and Optimization: Estimation, Simulation, and Contr ol . John Wiley , Hoboke n, NJ, 2003. [40] L. N. Vicen te. W orst case complexity of direct search. EUR O Journal on Computational Optimization , 2012 . to appear. [41] F. Y ousefian, A. Nedic, and U. V. Shan bhag. On sto c hastic gradien t and subgradien t methods with adaptive steplength sequence s. Au t omatic a , 48:56–67, 2012. 25
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment