Stochastic Smoothing for Nonsmooth Minimizations: Accelerating SGD by Exploiting Structure

Sto c hastic Smo othing for Nonsmo ot h Minimizations: Accelerating SGD b y E xploiting S tructure Hua Ouy ang, Alexander Gra y { houy ang, ag ra y } @cc.ga te ch. edu Col le ge of Co mput ing Ge or gia Institute of T e chnolo gy Abstract In this work w e co nsider the sto chastic minimization of nonsmo oth con vex loss functions, a central pr oblem in ma c hine le arning. W e pr opose a nov el alg orithm c a lled A ccelerated N onsmo oth S to c hastic G radient D escent ( ANSGD ), which exploits t he structure of co mmon nonsmo oth loss functions to a c hieve optimal conv ergence rates for a class of problems including SVMs. It is the ﬁrs t sto c hastic algo r ithm that can achiev e the optimal O (1 /t ) rate for minimizing nons mooth los s functions (with str ong convexit y). The fast r ates ar e conﬁrmed by empirical compar isons, in which ANSGD signiﬁcantly outpe r forms previo us subgradient descent algor ithms inc luding SGD. 1. In tro duction Nonsmo othness is a ce ntral issue in mac h in e learning computation, as man y imp ortant metho ds minimize nonsmo oth con vex fun ctions. F or example, using the n onsmooth hin ge loss yields sparse su p p ort v ector mac hines; regressors can b e made robu st to outliers b y using the n onsmooth ab s o lute loss other than the squared loss; the l 1-norm is wid ely used in sparse reconstructions. In spite of the attrac tive prop erties, nonsmo oth functions are theoreticall y more diﬃcult t o optimize than smo oth functions Nemiro vski and Y u din (1983). In this p aper w e f o cus on minimizing nonsmo oth fun ct ions where the fun ct ions are either sto c hastic (sto c hastic optimization), or learnin g samples are pr o vided incrementa lly (online learning). Smo othness and strong-con ve xity are t yp ically certiﬁcates of the existence of fast global solv ers. Nestero v’s deterministic smo othing metho d Nestero v (2005b) d ea ls with the dif- ﬁcult y of nons m ooth functions by appr o ximating them with smo oth fun ct ions, for whic h optimal metho ds Nestero v (2004) can b e applied. It con verge s as f ( x t ) − min x f ( x ) ≤ O (1 /t ) after t iterations. If a n onsmooth function is strongly con ve x, this rate can b e improv ed to O (1 / t 2 ) using the excessiv e gap tec hnique Nestero v (2005a). In th is pap er, w e extend Nestero v’s sm o othing method to the sto c hastic setting b y prop osing a sto c hastic smo othing metho d for nonsmo oth fun ct ions. Com bining this with a sto c hastic v ersion of the op timal grad ient d esce nt m et ho d, we introdu ce and analyze a n ew algorithm named A ccelerated N onsmo oth S tochastic G radien t D escent ( ANSGD ), for a class of functions that include the p opular ML metho ds of inte rest. T o our kno wledge ANSGD is the ﬁrst sto c hastic ﬁrst-order algorithm th at can ac hiev e the optimal O (1 /t ) r at e for min imizi n g nonsmo oth loss f unctions without P olya k’s a verag - ing P olya k and Jud itsky (1992). I n comparison, the classic SGD conv erges in O (ln t/t ) for 1. A short version of this paper app ears in In ternational Conference of Machine Learning (ICML) 2012. 1 nonsmo oth strongly con vex fun ctions Sh a lev-Shw artz et al. (2007), and is usu al ly not ro- bust Ne mirovski et al. (2 009 ). Even with Po lyak’s a ve r aging Bac h an d Moulines (2011); Xu (2011), there are cases where SGD’s conv ergence r ate still can n ot b e faster than O (ln t/t ) Shamir (20 11 ). Numerical exp erimen ts on real-w orld datasets also ind ic ate that ANSGD con v erges m uch f ast er in comparing w ith these state-of-the-art algorithms. A p erturbation-based smo othing metho d is recent ly prop osed for sto c hastic nonsmo oth minimization Duchi et al. (2011). This w ork achiev es similar iteration complexities as ours, in a parallel computation scenario. I n serial settings, ANSG D enjo ys b etter and optimal b ounds. In mac h in e learnin g, many problems can b e cast as min imizi n g a comp osition of a loss function and a regularization term. Before pro ceeding to th e algorithm, we ﬁ rst describ e a diﬀeren t setting of “comp osite minimizations” that w e will pursu e in this p ap er, along with our notations and assumptions. 1.1 A Diﬀerent “Comp osite Setting” In th e classic black-b ox setting of ﬁ rst-order sto c h astic algorithms Nemiro vski et al. (2009), the stru cture of the ob jectiv e fun ct ion min x { f ( x ) = E ξ f ( x , ξ ) : ξ ∼ P } is unkn o wn . In eac h iteration t , an algorithm can only access the ﬁrst-order sto c hastic oracle and obtain a subgradient f ′ ( x , ξ t ). The basic assumption is that f ′ ( x ) = E ξ f ′ ( x , ξ ) f or any x , w here the random v ector ξ is from a ﬁ xed distribu tion P . The c omp osite setting (also kno wn a s splitting Lions and Mercier (197 9 )) is an extension of the blac k-b ox mod el. I t was prop osed to exploit the structure of ob jectiv e functions. Driv en b y app lic ations of sparse signal reconstruction, it has gained signiﬁcant in terest from diﬀeren t comm unities Daub ec hies et al. (2004); Bec k and T eb oulle (20 09 ); Nestero v (2007 a ). Stochasti c v arian ts ha ve also b een prop osed recen tly Lan (2010); Lan and Ghadimi (2011); Duc hi and Sin ge r (2009); Hu et al. (2009); Xiao (2010). A sto c hastic comp osite function Φ( x ) := f ( x ) + g ( x ) is the sum of a smo oth sto c h a stic con vex fu nction f ( x ) = E ξ f ( x , ξ ) and a nonsmooth (but simp le and deterministic) fun ct ion g (). T o minimize Φ, previous w ork construct th e follo wing mo del iterativ ely: h∇ f ( x t , ξ t ) , x − x t i + 1 η t D ( x , x t ) + g ( x ) , (1) where ∇ f ( x t , ξ t ) is a gradien t, D ( · , · ) is a pro ximal function (t ypically a Bregman d iv er- gence) and η t is a s te ps iz e. A successful application of th e comp osite idea typica lly relies on the assumption that mo del (1) is easy to m in imize . I f g () is very simple, e. g. k x k 1 or the nuclea r norm, it is straigh tforwa rd to obtain the min imum in analytic forms. How ever, this assumption do es not hold for man y other applicatio n s in m ac hine learnin g, wh ere man y loss fun ct ions (not the regularization term, h ere the nonsmo oth g () b ecomes the nonsmo oth loss fun ct ion) are nonsmo oth, and do not enjo y sep arab ility prop erties W r ight et al. (200 9 ). This includes imp ortan t examples suc h as h inge loss, absolute loss, and ǫ -insensitive loss. In this pap er, we tac kle th is p roblem b y stud ying a new sto c h asti c composite setting: min x Φ( x ) = f ( x ) + g ( x ), where loss fu nctio n f () is conv ex and n on s mooth, while g () is 2 con v ex and L g -Lipsc hitz smo oth: g ( x ) ≤ g ( y ) + h∇ g ( y ) , x − y i + L g 2 k x − y k 2 . (2) F or clarit y , in this p aper we fo cus on un co nstr ai n ed minim izations. Without loss of general- it y , w e assume that b oth f () and g () are sto c hastic: f ( x ) = E ξ f ( x , ξ ) and g ( x ) = E ξ g ( x , ξ ), where ξ has distribution P . If either one is deterministic, its ξ is then dropp ed. T o make our algorithm and analysis more general, we assume that g () is µ -strongly con vex: ∀ x , y , g ( x ) ≥ g ( y ) + h∇ g ( y ) , x − y i + µ 2 k x − y k 2 . (3) If it is not strongly con vex, one can simp ly tak e µ = 0. The m ai n idea of our algorithm agai n stems from exploiting the structures of f () and g (). In S ec tion 2 we prop ose to form a smo oth sto c hastic appro ximation of f (), suc h that the optimal metho ds Nestero v (2004) can b e applied to attain optimal conv ergence rates. The con v ergence of our prop osed algorithm is analyzed in Section 3, and a batc h-to-online con v ersion is also prop osed. Tw o p opular mac hine learning problems are c hosen as our examples in Section 4, and n um erica l ev aluations are presented in Section 5. All p roofs in this pap er are provided in the app endix. 2. Approac h 2.1 Stochastic Smo othing Metho d An imp ortan t b r ea kthr ough in nonsmo oth minimization wa s m ade by Nestero v in a series of wo rk s Nestero v (2005b,a, 2007b ). By exp lo iting function structur es, Ne sterov sho ws that in man y applications, minimizing a well-st r u ctured n onsmooth fun ct ion f ( x ) can b e form ulated as an equiv alen t saddle-p oin t form min x ∈X f ( x ) = min x ∈X max u ∈U  h A x , u i − Q ( u )  , (4) where u ∈ R m , U ⊆ R m is a con vex set , A is a linear op er ator mapping R D → R m and Q ( u ) is a con tinuous con vex fun ct ion. Inserting a non-negativ e ζ -strongly con vex fun ct ion ω ( u ) in (4 ) one obtains a smo oth app ro ximation of the original nonsmo oth fun ct ion ˆ f ( x , γ ) := max u ∈U  h A x , u i − Q ( u ) − γ ω ( u )  , (5) where γ > 0 is a ﬁ xed smo othness p ar ameter whic h is cru cia l in the conv ergence analysis. The k ey pr operty of th is appr o ximation is: Lemma 1 Nester ov (2005 b )(The or em 1) F unction ˆ f ( x , γ ) is c onvex and c ontinuously dif- fer entiable, and its gr adient is Lipschitz c ontinuous with c onstant L ˆ f := k A k 2 γ ζ , wher e k A k := max x , u {h A x , u i : k x k = 1 , k u k = 1 } . (6) 3 Nestero v’s smo othing metho d w as originally prop osed for deterministic optimization. A ma jor dra wbac k of this method is that the num b er of iterations N m us t b e kn o wn b eforehand, su c h that the algo rithm can set a prop er smo othness parameter γ = O  2 k A k N +1  to ensur e conv ergence. Th is makes it unsu ita ble for algorithms that runs f orev er, or whose n umb er of iterations is not kno wn . F ollo win g his work we p ropose to extend this smo othing metho d to sto c hastic optimization. Ou r sto c h astic smo othing d iﬀers from the deterministic one in the op erator A and smo othness parameter γ , where b oth will b e time-v arying. W e assume that the nonsmo oth part f ( x , ξ ) of the stochastic comp osite function Φ() is well stru ctured, i.e. for a sp eciﬁc realization ξ t , it has an equiv alent form lik e the max function in (4): f ( x , ξ t ) = max u ∈U  h A ξ t x , u i − Q ( u )  , (7) where A ξ t is a stochasti c linear op erator asso ciated w ith ξ t . W e construct a smo oth ap- pro ximation of this fu nctio n as: ˆ f ( x , ξ t , γ t ) := max u ∈U  h A ξ t x , u i − Q ( u ) − γ t ω ( u )  , (8) where γ t is a time-v arying smo othness parameter only asso ciated with iteration index t , and is indep endent of ξ t . F unction ω () is non -n eg ativ e and ζ -strongly conv ex. Due to Lemma 1, ˆ f ( x , ξ t , γ t ) is k A ξ t k 2 γ t ζ -Lipsc hitz smo oth. It follo ws that Lemma 2 ∀ x , y , t , E ξ ˆ f ( x , ξ , γ t ) ≤ E ξ ˆ f ( y , ξ , γ t ) + E ξ h∇ ˆ f ( y , ξ , γ t ) , x − y i + E ξ k A ξ k 2 γ t ζ k x − y k 2 . W e ha ve the follo w in g observ ation ab out our comp osit e ob jectiv e Φ(), wh ic h relates the reduction of the original and appr oximate d function v alues. Lemma 3 F or any x , x t , t , Φ( x t ) − Φ( x ) ≤ E ξ h ˆ f ( x t , ξ , γ t ) + g ( x t , ξ ) i − E ξ h ˆ f ( x , ξ , γ t ) + g ( x , ξ ) i + γ t D U , (9) wher e D U := max u ∈U ω ( u ) . 2.2 Acc elerat ed Nonsmo oth SGD ( ANSGD ) W e are no w ready to present our algorithm A N S GD (Algorithm 1). This sto c hastic algorithm is obtained b y applying Nesterov’s optimal metho d to our smo oth surrogate fu nctio n, and th us has a similar form to that of h is original deterministic metho d Nestero v (2004)(p.78). Ho w eve r, our conv ergence analysis is more straightfo rward, an d do es not rely on the concept of estimate sequences. Hence it is easier to iden tify prop er series γ t , η t , α t and θ t that are crucial in ac hieving fast rates of con vergence . These s er ies will b e determined in our main results (Thm.6 and 7). 3. Con v ergence Analysis T o clarify our present ation, w e use T able 1 to list s ome n o tations that will b e u sed through- out the pap er. 4 Algorithm 1 A cc elerated Nonsmooth Sto chastic G radient Descent (ANSGD) INPUT: s eries γ t , η t , θ t ≥ 0 and 0 ≤ α t ≤ 1; OUTPUT: x t +1 ; [0.] In itia lize x 0 and v 0 ; for t = 0 , 1 , 2 , . . . do [1.] y t ← (1 − α t )( µ + θ t ) x t + α t θ t v t µ (1 − α t )+ θ t [2.] ˆ f t +1 ( x ) ← max u ∈U  h A ξ t +1 x , u i − Q ( u ) − γ t +1 ω ( u )  [3.] x t +1 ← y t − η t  ∇ ˆ f t +1 ( y t ) + ∇ g t +1 ( y t )  [4.] v t +1 ← θ t v t + µ y t − [ ∇ ˆ f t +1 ( y t )+ ∇ g t +1 ( y t ) ] µ + θ t end for T able 1: Some notations. Sym b ol Meaning ˆ f t ( x ), g t ( x ) ˆ f ( x , ξ t , γ t ), g ( x , ξ t ) ∇ ˆ f t ( x ), ∇ g t ( x ) ∇ ˆ f ( x , ξ t , γ t ), ∇ g ( x , ξ t ) L t L g + k A ξ t k 2 γ t ζ σ t ( x ) [ ∇ ˆ f t ( x ) + ∇ g t ( x )] − E ξ t [ ∇ ˆ f t ( x ) + ∇ g t ( x )] σ 2 E max t k σ t +1 ( y t ) k 2 ∆ t E ξ t  ˆ f t ( x t ) + g t ( x t )  − E ξ t  ˆ f t ( x ) + g ( x )  Γ t +1 h σ t +1 ( y t ) , α t x + (1 − α t ) x t − y t i D 2 t 1 2 E k x − v t k 2 Our con verge nce rates are based on the follo wing main lemma, whic h b ounds the pro- gressiv e r eduction ∆ t of the smo othed fu nctio n v alue. Actually Line 1, 3, and 4 of Alg.1 are also deriv ed f rom th e pro of of this lemma. Lemma 4 L et γ t b e mo notonic al ly de cr e asing. Applying a lgorithm ANSGD to nonsmo oth c omp osite function Φ() , we have ∀ x and ∀ t ≥ 0 , ∆ t +1 ≤ (1 − α t )∆ t + (1 − α t )( γ t − γ t +1 ) D U + Γ t +1 + α t 2  θ t k x − v t k 2 − ( µ + θ t ) k x − v t +1 k 2  + η t pq +  α t 2( µ + θ t ) + L t +1 2 η 2 t − η t  q 2 (10) wher e p := k σ t +1 ( y t ) k and q := k∇ ˆ f t +1 ( y t ) + ∇ g t +1 ( y t ) k . 3.1 Ho w to Cho ose Stepsizes η t In the RHS of (10), nonn eg ativ e scalars p, q ≥ 0 are data-dep enden t, and could b e arbitrarily large. Hence w e need to set pr oper stepsize s η t suc h that the last tw o terms in (10) are 5 non-p ositiv e. One might conjecture that: there exist a series c t ≥ 0 su c h that η t pq +  α t 2( µ + θ t ) + L t +1 2 η 2 t − η t  q 2 ≤ c t p 2 . (11) It is easy to v erify that if we tak e η t = α t µ + θ t and an y series c t ≥ α t 2( µ + θ t − α t L t +1 ) ≥ 0, th en (11) is s a tisﬁed . T o retain a tigh t b ound , we tak e c t = α t 2( µ + θ t − α t L t +1 ) . (12) T aking exp ect ation on both sides of (10 ) and noticing that E ξ t +1 | ξ [ t ] Γ t +1 = 0, E ξ t +1 c t ≤ α t 2( µ + θ t − α t E ξ t +1 L t +1 ) due to Jensen’s inequalit y , we hav e Lemma 5 ∀ x and ∀ t ≥ 0 , E ∆ t +1 ≤ (1 − α t ) E ∆ t + α t θ t D 2 t − α t ( µ + θ t ) D 2 t +1 + α t 2( µ + θ t − α t E L t +1 ) σ 2 + (1 − α t )( γ t − γ t +1 ) D U , (13) The optimal con verge n ce rates of our algorithm diﬀers according to the fact of µ (p ositiv e or not). They are presented separately in the follo win g tw o su bsectio n s , where the choic es of γ t , θ t , α t will also b e determined . 3.2 Optimal Rates for Comp osite Minimizations when µ = 0 When µ = 0, g () is only conv ex and L g -Lipsc hitz smo oth, but n ot assu m ed to b e strongly con v ex. Theorem 6 T ake α t = 2 t +2 , γ t +1 = α t , θ t = L g α t + Ω √ α t + E k A ξ k 2 ζ and η t = α t θ t in Alg.1, wher e Ω i s a c onstant. We have ∀ x and ∀ t ≥ 0 , E [Φ( x t +1 ) − Φ( x )] ≤ 4 L g D 2 ( t + 2) 2 + 2 E k A ξ k 2 D 2 /ζ + 4 D U t + 2 + √ 2(Ω D 2 + σ 2 / Ω) √ t + 2 , (14) wher e D 2 := max i D 2 i . In this result, the v ariance b ound is op timal up to a constant f ac tor Agarwal et al. (2012). The dominating f ac tor is still due to the sto c hasticit y , b ut not aﬀected b y the nonsmo othness of f (). T aking the parameter Ω = σ /D , this last term b ecomes 2 √ 2 Dσ √ t +2 . This b oun d is b etter than that of sto c h asti c gradien t descen t or sto c h ast ic dual a v eraging Dek el et al. (2010) for minimizing L -Lipsc h itz smo oth fun ct ions, whose rate is O  LD 2 0 t + D 2 0 + σ 2 √ t  ; without the smo oth fu nctio n g (), our b ound is of t h e same order as it, k eeping in mind that our rate is for nonsmo oth minimizations. This f act underscores the p oten tial of using sto c hastic optimal metho ds for nonsm ooth functions. The diminish ing smo othness parameter γ t = 2 t +2 indicates that initially a smo other appro ximation is preferr ed, such that th e solution d oes not c hange wildly due to the non- smo othness and sto c hasticit y . Ev en tually th e appro ximated fun ct ion s h ould b e closer and closer to the original n on s mooth function, su c h that the optimalit y can b e reac hed. Some concrete examples are giv en in Fig.1 . The E k A ξ k 2 in our b ound is a theoretical constan t. In S ec .4 we demonstrate a sampling metho d, and it tur ns out to w ork quite well in estimating E k A ξ k 2 . 6 3.3 Nearly Opt imal Rates for Strongly C onv ex Minimizations When µ > 0, g () is strongly con ve x, and the con ve r gence rate of ANSGD can b e impr o ve d to O (1 / t ). Theorem 7 T ake α t = 2 t +1 , γ t +1 = α t , θ t = L g α t + µ 2 α t + E k A ξ k 2 ζ − µ and η t = α t µ + θ t in Alg.1. Denote C := max ( 4 E k A ξ k 2 ζ µ , 2  L g µ  1 / 3 ) . (15) We have ∀ x and ∀ t ≥ 0 , E [Φ( x t +1 ) − Φ( x )] ≤ 6 . 58 L g ˜ D 2 t ( t + 1) + B + 4 D U t + 1 + σ 2 µ ( t + 1) , (16) wher e B :=    2 E k A ξ k 2 ˜ D 2 /ζ t +1 if 0 ≤ t < C, 2( C − 2) E k A ξ k 2 ˜ D 2 /ζ t ( t +1) if t ≥ C, (17) and ˜ D 2 := max 0 ≤ i ≤ min { t,C } D 2 i . Note that C is the smallest iteration ind ex for w hic h one can r et ain 1 /t 2 rates for the E k A ξ k 2 part ( B ). Without any kn o wledge ab out L g , µ and E k A ξ k 2 , one can set a p aramet er Ω and tak e θ t = L g α t + µ 2 α t + E k A ξ k 2 Ω ζ − µ in the algorithm. In our exp eriment s, w e observ e that one can tak e Ω fairly large (of O ( E k A ξ k 2 )), meaning th a t C can b e very small (O(1)), and B is O ( 1 t 2 ) for al l t . In this sense, str on gly con vex A NSGD is almost parameter-free. Without the O (1 /t ) rate of D U , all term s in our b ound are optimal. T his is w h y our rate is called “nearly” optimal. In practice, D U is usually small, and it will b e d ominate d by the last term σ 2 µ ( t +1) . 3.4 Batc h-t o-O nline Conv ersion The p erformance of an online learning (online conv ex m inimiza tion) algorithm is typicall y measured by r e gr et , whic h can b e expressed as R ( t ) := t − 1 X i =0  Φ( x i , ξ i +1 ) − Φ( x ∗ t , ξ i +1 )  , (18) where x ∗ t := arg min x P t − 1 i =0  Φ( x , ξ i +1 )  . In the learnin g theory literature, m an y approac hes are pr oposed which u se online learning algorithms for batc h learning (sto c h asti c optimiza- tion), called “online-to -batc h ” (O-to-B) con ve rsions. F or con ve x functions, many of these approac hes employ an “a ve raged” solution as the ﬁ nal solution. On th e con trary , w e sho w that sto c h astic optimization algo rithm s can al so b e used dir e ctly for online learning. This “batc h-to-online” (B-to-O) con version is almost free of any additional eﬀort: under i.i.d. assum ptions of d ata, one can use any sto c hastic optimization algorithm for online learning. 7 Prop osition 8 F or any t ≥ 0 , E ξ [ t ] R ( t ) ≤ t − 1 X i =0 E ξ [ i ] [Φ( x i ) − Φ( x ∗ )] + E ξ [ t ] t − 1 X i =0  Φ( x ∗ t ) − Φ( x ∗ t , ξ i +1 )  (19) wher e x ∗ := arg min x Φ( x ) and x ∗ t := arg min x P t − 1 i =0  Φ( x , ξ i +1 )  . When Φ() is con v ex, the second term in (19) can b e b oun ded by applying s ta nd ard results in un iform con verge n ce (e.g. Bouc heron et al. (2005)): P t − 1 i =1 Φ( x ∗ t ) − Φ ( x ∗ t , ξ i +1 ) = O ( √ t ). T ogether with summ in g up the RHS of (14), we can obtain an O ( √ t ) r eg ret b oun d . When Φ() is strongly con ve x, the second term in (19) can b e b ound ed us in g Sh ale v-Shw artz et al. (2009): P t − 1 i =1 Φ( x ∗ t ) − Φ( x ∗ t , ξ i +1 ) = O (ln t ). T ogether w ith summing up the RHS of (16), an O (ln t ) r eg ret b oun d is achiev ed. Th e O ( √ t ) and O (ln t ) r eg r et b oun ds are kn o wn Using our prop osed ANSG D for online learnin g by B-to-O ac hieves the same (optimal) regret b ound s as state-of-the- art algorithms designated for online learning. Ho wev er, using O-to-B, one can only retain an O (ln t/t ) r ate of conv er gence for sto c hastic strongly con vex optimization. F rom this persp ectiv e, O-to-B is inf erio r to B-to-O. The sub-optimalit y of O-to-B is also discussed in Hazan and Kale (2011). 4. Examples In this sectio n , t wo non s mooth functions are giv en as examples. W e will show ho w these functions can b e sto c h astic ally appro ximated, and how to calculate parameters used in our algorithm. 4.1 Hinge Loss SVM Classiﬁcation Hinge loss is a con ve x surrogate of the 0 − 1 loss. Denote a s ample-l ab el pair as ξ := { s , l } ∼ P , where s ∈ R D and l ∈ R . Hinge loss can b e expressed as f hinge ( x ) := max { 0 , 1 − l s T x } . It h as b een widely us ed for SVM cl assiﬁers where the ob jectiv e is min Φ( x ) = min E ξ f hinge ( x ) + λ 2 k x k 2 . Note that the regularization term g ( x ) = λ 2 k x k 2 is λ -strongly con v ex, h en ce according to T h m.7 , ANSGD enjo ys O (1 / ( λt )) r ates. T aking ω ( u ) = 1 2 k u k 2 in (8), it is easy to c hec k that the smo oth sto c hastic approxima tion of hinge loss is ˆ f hinge ( x , ξ t , γ t ) = max 0 ≤ u ≤ 1  u  1 − l t s T t x  − γ t u 2 2  . (20) This maximization is simple enough suc h that w e can obtain an equ iv alen t smo oth r epre- sen tation: ˆ f hinge ( x , ξ t , γ t ) =      0 if l t s T t x ≥ 1 , (1 − l t s T t x ) 2 2 γ t if 1 − γ t ≤ l t s T t x < 1 , 1 − l t s T t x − γ t 2 if l t s T t x < 1 − γ t . (21) Sev eral examples of ˆ f hinge with v arying γ t are plotted in Fig.1(left) in comparing with the hinge loss. Here u is a scalar, h ence it is straight forward to calculate E k A ξ k 2 ζ , wh ic h will b e us ed to generate sequences θ t . I n bin ary classiﬁcation, supp ose l ∈ { 1 , − 1 } . Using deﬁnition (6), 8 −1 0 1 2 0 0.5 1 1.5 2 l t s T t x f Appx Hinge γ t =1 Appx Hinge γ t =0.5 Appx Hinge γ t =0.2 Hinge Loss −2 −1 0 1 2 0 0.5 1 1.5 2 2.5 3 3.5 4 l t − s T t x f Squared Appx Absolute γ t =1 Appx Absolute γ t =0.5 Appx Absolute γ t =0.2 Absolute Figure 1: Left: Hinge loss and it s smo oth appro ximations. Righ t: Absolute loss and its smo oth approximati ons. one only needs to calculate E (max k x k =1 s T t x ) 2 . Practically one can take a small su b set of k random samples s i (e.g. k = 100), and calculate the s ample a ve r age of th e squared n orms 1 k P k i =1 k s i k 2 . This yields 1 k P k i =1 (max k x k =1 s T i x ) 2 , an estimate of E k A ξ k 2 . 4.2 Absolute Loss Robust Regression Absolute loss is an alternativ e to the p opular squared loss for robu s t regressions Hastie et al. (2009). Using s ame notations as Sec.4.1 it can b e exp r essed as f abs ( x ) := | l − s T x | . T aking ω ( u ) = 1 2 k u k 2 in (8), its s mooth sto c hastic app ro ximation can b e expressed as ˆ f abs ( x , ξ t , γ t ) = max − 1 ≤ u ≤ 1  u ( l t − s T t x ) − γ t u 2 2  . (22) Solving this maximization wrt u w e obtain an equ iv alent form: ˆ f abs ( x , ξ t , γ t ) =      l t − s T t x − γ t 2 if l t − s T t x ≥ γ t , ( l t − s T t x ) 2 2 γ t if − γ t ≤ l t − s T t x < γ t , − ( l t − s T t x ) − γ t 2 if l t − s T t x < − γ t . (23) This appro ximation lo oks similar to the wel l-stud ied Hu ber loss Hu ber (1964), though th ey are diﬀeren t. Actually th ey share the same form only when γ t = 0 . 5 (green cur v e in Fig.1 Righ t). The parameter E k A ξ k 2 can b e estimated in a similar wa y as discuss ed in Sec.4.1. 9 5. Exp erimen tal Results In this section, ﬁv e pu blicly av ailable datasets from v arious application domains will b e used to ev aluate th e eﬃciency of ANSGD . Datasets “svmguide1”, “real- sim”, “rcv1” and “alpha” are for binary classiﬁcations, and “abalone” is for robu st r eg ressions. 1 F ollo wing our examples in Sec.4, we will ev aluate our algorithm using approximate d hinge loss for classiﬁcations, and app r o ximated absolute loss for regressions. Exact hinge and absolute losses will b e used for subgradien t descent algorithms that w e will compare with, as d esc r ib ed in the follo win g section. All losses are squ ared- l 2-norm-regularized. The regularization parameter λ is shown on eac h ﬁgure. When assum ing str o n g- conv exit y , w e tak e µ = λ . 5.1 Algorithms for Comparison and P arameters W e compare ANSGD with three state-of-the-art algorithms. Eac h algorithm has a data- dep enden t tunin g paramete r, den oted b y Ω (a lthough they hav e diﬀeren t ph ysical mean- ings). The b est v alues of Ω are found based on a tun ing su bset of samples. Note that when assuming strong-con ve xity , our ANS GD is almost parameter-free. As discussed after Thm.7, our exp erimen ts in dicat e that th e optimal Ω is take n such that E k A ξ k 2 Ω ζ ≈ 1, m eanin g that one can s im p ly tak e θ t = L g α t + µ 2 α t + 1 − µ . SGD . Th e classic sto c hastic appro ximation Robbins and Monr o (1951) is adopted: x t +1 ← x t − η t f ′ ( x t ), where f ′ ( x t ) is th e su bgradien t. When only assuming con ve xity ( µ = 0), we use stepsize η t = Ω √ t . When assumin g strong-con vexit y , we follo w th e stepsize u sed in SGD2 Bottou: η t = 1 µ ( t +Ω) . Averaged SG D . This is al gorithmically the same as SGD , except that th e a v eraged re- sult ¯ x := 1 t P t i =1 x i is used for testing. W e follo w the s te ps iz es suggested by the rece nt w ork on the non-asymp to tic analysis of SGD Bac h and Moulines (2011); Xu (2011), wh ere it is argued that P oly ak’s a v eraging com bining with prop er stepsizes yield optimal rates. When only assumin g con v exit y , we us e stepsizes η t = Ω √ t Bac h and Moulines (2011). When assuming strong con v exit y , the stepsize is take n as η t = 1 Ω(1+ µt/ Ω) 3 / 4 Xu (2011). A C-SA . This appr oa ch Lan (2010); Lan and Ghadimi (2011) is int eresting to compare b e- cause like ANSGD , it is another wa y of obtaining a s toc hastic algorithm based on Nestero v’s optimal metho d, b egging the qu estion of wh et h er it has similar b eha vior. Theoretically , ac- cording to Pr op.8 and 9 in Lan and Ghadimi (2011 ), the b ound for the n onsmooth p art is of O (1 / √ t ) for µ = 0 and O (1 /t ) f or µ > 0. In comparison, our nonsmo oth part con verge s in O (1 /t ) for µ = 0 and O (1 /t 2 ) for µ > 0. Numerically we ob s erv e that directly applying A C-SA to nons mooth functions results in inferior p erformances. 1. D ata set “alpha” is ob tai ned from ftp://largescale. ml.tu- berlin.de/largescale/ , and the other four datasets can b e accessed via http://www.csie.nt u.edu.tw/ ~ cjlin/libs vmtools . Dataset “rcv1” comes with 20 , 242 training samples and 677 , 399 testing samples. F or “svmguide1” and “real-sim”, w e randomly take 60% of the samples for training and 40% for testing. F or “alpha” and “abalone”, 80% are u se d for training, and t he rest 20% are used for testing. 10 5.2 Results Due to the sto c hasticit y o f all the algorithms, for eac h setting of th e exp eriments, we run the program for 10 times, a n d plot the mea n and standard d eviation of the resu lt s us ing error bars. In the ﬁrst set of e xp eriments, we compare ANSGD with t wo subgradient-based algo- rithms SGD and Averaged SGD . Classiﬁcation results are sh own in Fig.2, 3, 4 and 5, and regression results are sho wn in Fig.6. In eac h ﬁgure, the left column is for algorithms without strongly co nv ex assumptions, while in th e righ t column the algorithms assume strong-con v exit y and tak e µ = λ . F or classiﬁcation results, we plot function v alues ov er the testing s et in the ﬁr s t ro w, and plot testing accuracies in the second ro w. 500 1000 1500 2000 0.15 0.2 0.25 # of e pochs 1 N t es t P i h inge + λ 2 k x k 2 svmguid e1 , λ = 10 − 2 , µ = 0 SGD Avg. SGD ANSGD 500 1000 1500 2000 0.1 0.2 0.3 0.4 # of e pochs svmguid e1 , λ = 10 − 2 , µ = 1 0 − 2 500 1000 1500 2000 88 90 92 94 96 # of e pochs T esting a ccu ra cy % svmguid e1 , λ = 10 − 2 , µ = 0 500 1000 1500 2000 88 90 92 94 96 # of e pochs svmguid e1 , λ = 10 − 2 , µ = 1 0 − 2 Figure 2: Classiﬁcation with “svmguide1”. It is clear that in all these exp eriments, ANSGD ’s f unction v alues con v erges consistently faster than th e other t wo S GD algorithms. In non-strongly con vex exp eriments, it conv erges signiﬁcan tly faster than SG D and its a v eraged ve r s io n. In strongly conv ex exp eriment s, it still out p erforms, and is more robust than str ongl y conv ex SGD . Averaged SGD p erforms w ell in strongly conv ex settings, in terms of prediction accuracies, a lthough its err ors are still higher than ANSGD in the ﬁrst thr ee datasets. The only exception is in “alpha” (Fig.5), w here Averaged S GD r etains higher fun cti on v alues than ANSGD , b u t its accuracies are cont radictorily higher in early s ta ges. The reason might b e that the in exa ct solutio n serv es as an additional regularizatio n factor, which cannot b e pr edicte d b y the analysis of con v ergence rates. In the second set of exp eriments, w e compare ANSGD with A C-SA and its strongly con v ex version. Re su lts are in Fig.7, 8 , 9 and 10. In all exp erimen ts our ANSG D signiﬁcantl y outp erforms A C-SA, and is muc h m ore stable. Th ese exp eriment s conﬁrm th e theoretically b etter rates discussed in S ec .5.1 . 11 1 2 3 0.05 0.1 0.15 # of e pochs 1 N t es t P i h inge + λ 2 k x k 2 real - sim, λ = 10 − 5 , µ = 0 SGD Avg. SGD ANSGD 1 2 3 0.06 0.08 0.1 # of e pochs real - sim, λ = 10 − 5 , µ = 1 0 − 5 1 2 3 95.5 96 96.5 97 97.5 98 # of e pochs T esting a ccu racy % real - sim, λ = 10 − 5 , µ = 0 1 2 3 96.5 97 97.5 98 # of e pochs real - sim, λ = 10 − 5 , µ = 1 0 − 5 Figure 3: Classiﬁcation with “real-sim”. 1 2 3 0.1 0.12 0.14 0.16 0.18 # of e pochs 1 N t es t P i h inge + λ 2 k x k 2 rcv1, λ = 1 0 − 5 , µ = 0 SGD Avg. SGD ANSGD 1 2 3 0.1 0.2 0.3 0.4 # of e pochs rcv1, λ = 1 0 − 5 , µ = 1 0 − 5 1 2 3 94 94.5 95 95.5 96 # of e pochs T esting a ccu racy % rcv1, λ = 10 − 5 , µ = 0 1 2 3 93.5 94 94.5 95 95.5 96 # of e pochs rcv1, λ = 1 0 − 5 , µ = 1 0 − 5 Figure 4: Classiﬁcation with “rcv1”. 6. Conclusions and F uture W ork W e int ro duce a diﬀerent comp osite setting for nonsm ooth fu nctio n s . Under th is setting we prop ose a sto c hastic smo othing metho d and a n o vel sto c hastic algorithm ANSGD . Conv er- gence analysis sh o w that it ac hiev es (nearly) optimal r a tes under b oth con v ex and strongly 12 1 2 3 0.55 0.6 0.65 # of e pochs 1 N t es t P i h inge + λ 2 k x k 2 al p ha, λ = 1 0 − 5 , µ = 0 SGD Avg. SGD ANSGD 1 2 3 0.55 0.6 0.65 # of e pochs al p ha, λ = 1 0 − 5 , µ = 1 0 − 5 1 2 3 73 74 75 76 77 78 79 # of e pochs T esting a ccu racy % al p ha, λ = 1 0 − 5 , µ = 0 1 2 3 70 72 74 76 78 # of e pochs al p ha, λ = 1 0 − 5 , µ = 1 0 − 5 Figure 5: Classiﬁcation with “alpha”. 0 200 400 600 800 1.8 2 2.2 # of e pochs 1 N t es t P i absolu te + λ 2 k x k 2 abal one, λ = 1 0 − 2 , µ = 0 SGD Averaged SGD ANSGD 0 200 400 600 800 1.8 2 2.2 # o f epo ch s abal one, λ = 1 0 − 2 , µ = 1 0 − 2 Figure 6: Regression with “abalone”. con v ex assu mptions. W e also prop ose a “Batc h-to-Online” co nv ersion for online learning, and sho w that optimal regrets can b e obtained. W e will extend our metho d to constrained minim izations, as w ell as cases when the appro ximated fun ction ˆ f () is not easily obtained by maximizing u . Nesterov’ s excessiv e gap tec hn ique has the “true” optimal 1 /t 2 b ound, and we will in vestig ate th e p ossibilit y of in tegrating it in our algorithm. Exploiting links with statistical learning th eo ries ma y also b e pr o misin g. 13 500 1000 1500 2000 0.15 0.2 0.25 0.3 0.35 # of e pochs 1 N t es t P i h inge + λ 2 k x k 2 svmguid e1 , λ = 10 − 2 , µ = 0 AC−SA ANSGD 500 1000 1500 2000 0.15 0.2 0.25 0.3 0.35 # of e pochs svmguid e1 , λ = 10 − 2 , µ = 1 0 − 2 500 1000 1500 2000 88 90 92 94 96 # of e pochs T esting a ccu ra cy % svmguid e1 , λ = 10 − 2 , µ = 0 500 1000 1500 2000 88 90 92 94 96 # of e pochs svmguid e1 , λ = 10 − 2 , µ = 1 0 − 2 Figure 7: Classiﬁcation with “svmguide1”. 1 2 3 0.05 0.1 0.15 # of e pochs 1 N t es t P i h inge + λ 2 k x k 2 real - sim, λ = 1 0 − 5 , µ = 0 AC−SA ANSGD 1 2 3 0.05 0.1 0.15 # of e pochs real - sim, λ = 10 − 5 , µ = 1 0 − 5 1 2 3 95 96 97 98 # of e pochs T esting a ccu racy % real - sim, λ = 1 0 − 5 , µ = 0 1 2 3 95 96 97 98 # o f epo ch s real - sim, λ = 1 0 − 5 , µ = 1 0 − 5 Figure 8: Classiﬁcation with “real-sim”. 14 1 2 3 0.1 0.12 0.14 0.16 0.18 # of e pochs 1 N t es t P i h inge + λ 2 k x k 2 rcv1, λ = 1 0 − 5 , µ = 0 AC−SA ANSGD 1 2 3 0.1 0.2 0.3 0.4 # of e pochs rcv1, λ = 1 0 − 5 , µ = 1 0 − 5 1 2 3 93.5 94 94.5 95 95.5 96 # of e pochs T esting a ccu racy % rcv1, λ = 10 − 5 , µ = 0 1 2 3 92 93 94 95 96 # of e pochs rcv1, λ = 1 0 − 5 , µ = 1 0 − 5 Figure 9: Classiﬁcation with “rcv1”. 1 2 3 0.54 0.56 0.58 0.6 0.62 0.64 # of e pochs 1 N t es t P i h inge + λ 2 k x k 2 al p ha, λ = 1 0 − 5 , µ = 0 AC−SA ANSGD 1 2 3 0.55 0.6 0.65 # of e pochs al p ha, λ = 1 0 − 5 , µ = 1 0 − 5 1 2 3 73 74 75 76 77 78 79 # of e pochs T esting a ccu ra cy % al p ha, λ = 1 0 − 5 , µ = 0 1 2 3 70 72 74 76 78 # of e pochs al p ha, λ = 1 0 − 5 , µ = 1 0 − 5 Figure 10: Classiﬁcation w it h “alpha”. 15 App endix A . Pro of of Lemma 3 Pro of Φ( x t ) − Φ( x ) = [ f ( x t ) − f ( x )] + [ g ( x t ) − g ( x )] = E ξ [ f ( x t , ξ )] + E ξ [ − f ( x , ξ ) + g ( x t , ξ ) − g ( x , ξ )] = E ξ max u ∈U   h A ξ x t , u i − Q ( u ) − γ t ω ( u )  + γ t ω ( u )  + E ξ [ − f ( x , ξ ) + g ( x t , ξ ) − g ( x , ξ )] ≤ E ξ max u ∈U  h A ξ x t , u i − Q ( u ) − γ t ω ( u )  + max u ∈U  γ t ω ( u )  + E ξ [ − f ( x , ξ ) + g ( x t , ξ ) − g ( x , ξ )] = E ξ h ˆ f ( x t , ξ , γ t ) i + γ t D U + E ξ [ − f ( x , ξ ) + g ( x t , ξ ) − g ( x , ξ )] ≤ E ξ h ˆ f ( x t , ξ , γ t ) − ˆ f ( x , ξ , γ t ) i + E ξ [ g ( x t , ξ ) − g ( x , ξ )] + γ t D U . (24) The last inequalit y is du e to the non-negativit y of ω () and d eﬁnitions of f (7) and ˆ f (8). App endix B . Pro of of Lemma 4 Before p roceeding to the pro of of this lemma, we present t wo auxiliary results. F or clarit y , in the follo wing lemmas and pro ofs we us e the follo wing notations to denote the smo othly appro ximated comp osite function and its exp ectation: F t ( x , γ t ) := ˆ f t ( x ) + g t ( x ) = ˆ f ( x , ξ t , γ t ) + g ( x , ξ t ) (25 ) and F ( x , γ t ) := E ξ t F t ( x , γ t ) . (26) The ﬁr s t lemma is on the s m oothly appr o ximated function and th e s m oothness parameter γ t . Lemma 9 If γ t is monotonic al ly de cr e asing with t , for any x and t ≥ 0 , F ( x , γ t ) ≤ F ( x , γ t +1 ) ≤ F ( x , γ t ) + ( γ t − γ t +1 ) D U , (27) wher e D U := max u ∈U ω ( u ) . Pro of The left inequalit y is obvio us , since γ t ≥ γ t +1 and ω ( u ) is nonnegativ e. F or the righ t inequ ality , F ( x , γ t +1 ) − F ( x , γ t ) = E ξ ˆ f ( x , ξ , γ t +1 ) − E ξ ˆ f ( x , ξ , γ t ) = max u ∈U [ h E ξ A ξ x , u i − Q ( u ) − γ t +1 ω ( u )] − max u ∈U [ h E ξ A ξ x , u i − Q ( u ) − γ t ω ( u )] ≤ max u ∈U   h E ξ A ξ x , u i − Q ( u ) − γ t +1 ω ( u )  −  h E ξ A ξ x , u i − Q ( u ) − γ t ω ( u )   = max u ∈U [( γ t − γ t +1 ) ω ( u )] . (28) 16 The seco n d lemma is ab o u t pro ximal metho ds using Bregman div ergence as pro x- functions, w hic h is a dir ec t result of optimalit y conditions. It app eared in Lan and Ghadimi (2011)(Lemma 2), and is an extension of the “3-p oin t iden tit y” Ch en and T eb oulle (1993)(Lemma 3.1). Lemma 10 L an and Ghadimi (2011) L et l ( x ) b e a c onvex function. L et sc alars s 1 , s 2 ≥ 0 . F or any ve ctors u and v , denote their Br e gman diver genc e as D ( u , v ) . If ∀ x , u , v x ∗ = arg min x l ( x ) + s 1 D ( u , x ) + s 2 D ( v , x ) , (29) then l ( x ) + s 1 D ( u , x ) + s 2 D ( v , x ) ≥ l ( x ∗ ) + s 1 D ( u , x ∗ ) + s 2 D ( v , x ∗ ) + ( s 1 + s 2 ) D ( x ∗ , x ) . (30) W e are now ready to prov e L emma 4. Pro of [Pro of of Lemma 4] Due to Lemma 2 and Lipschitz- sm o othness of g ( x ), F ( x , γ t +1 ) has a Lipsc hitz smo oth constan t L F t +1 := E ξ k A ξ k 2 γ t +1 ζ + L g . It follo ws that F ( x t +1 , γ t +1 ) ≤ F ( y t , γ t +1 ) + h∇ F ( y t , γ t +1 ) , x t +1 − y t i + L F t +1 2 k x t +1 − y t k 2 = (1 − α t ) F ( y t , γ t +1 ) + α t F ( y t , γ t +1 ) + h∇ F ( y t , γ t +1 ) , x t +1 − y t i + L F t +1 2 k x t +1 − y t k 2 = (1 − α t ) F ( y t , γ t +1 ) + h∇ F ( y t , γ t +1 ) , (1 − α t )( x t − y t ) i + α t F ( y t , γ t +1 ) + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 ≤ (1 − α t ) F ( x t , γ t +1 ) + α t F ( y t , γ t +1 ) + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 , (31) where the last inequalit y is du e to the con v exit y of F (). Subtracting F ( x , γ t +1 ) from b oth sides of the ab o ve inequalit y w e ha ve: F ( x t +1 , γ t +1 ) − F ( x , γ t +1 ) ≤ (1 − α t ) F ( x t , γ t +1 ) − F ( x , γ t +1 ) + α t F ( y t , γ t +1 ) + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 ≤ (1 − α t )  F ( x t , γ t ) + ( γ t − γ t +1 ) D U  − F ( x , γ t +1 ) + α t F ( y t , γ t +1 ) + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 ≤ (1 − α t )  F ( x t , γ t ) − F ( x , γ t )  − α t F ( x , γ t +1 ) + (1 − α t )( γ t − γ t +1 ) D U + α t F ( y t , γ t +1 ) + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 , (32) 17 where the last t wo inequalities are due to Lemma 9 . Denoting ∆ t := F ( x t , γ t ) − F ( x , γ t ) and σ t ( x ) := ∇ F t ( x , γ t ) − ∇ F ( x , γ t ) we can r ewr ite (32) as: ∆ t +1 − (1 − α t )∆ t − (1 − α t )( γ t − γ t +1 ) D U ≤ α t F ( y t , γ t +1 ) − α t F ( x , γ t +1 ) + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 ( 3 ) ≤ α t F ( y t , γ t +1 ) − α t h F ( y t , γ t +1 ) + h∇ F ( y t , γ t +1 ) , x − y t i + µ 2 k x − y t k 2 i + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 = − α t h h∇ F t +1 ( y t , γ t +1 ) − σ t +1 ( y t ) , x − y t i + µ 2 k x − y t k 2 i + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 = − α t  h∇ F t +1 ( y t , γ t +1 ) , x − y t i + µ 2 k x − y t k 2 + θ t 2 k x − v t k 2  + α t θ t 2 k x − v t k 2 + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 + h σ t +1 ( y t ) , α t ( x − y t ) i ≤ − α t  h∇ F t +1 ( y t , γ t +1 ) , v t +1 − y t i + µ 2 k v t +1 − y t k 2 + θ t 2 k v t +1 − v t k 2 + µ + θ t 2 k x − v t +1 k 2  + α t θ t 2 k x − v t k 2 + h∇ F ( y t , γ t +1 ) , x t +1 − y t − (1 − α t )( x t − y t ) i + L F t +1 2 k x t +1 − y t k 2 + h σ t +1 ( y t ) , α t ( x − y t ) i , (33) where the last in equ al ity is due to Lemma 10 (taking D ( u , v ) = 1 2 k u − v k 2 ) and th e deﬁnition of v t +1 : v t +1 := arg min x h∇ F t +1 ( y t , γ t +1 ) , x − y t i + µ 2 k x − y t k 2 + θ t 2 k x − v t k 2 . (34) Minimizing the ab o v e dir ec tly leads to Lin e 4 of Alg.1: v t +1 = θ t v t + µ y t − ∇ F t +1 ( y t , γ t +1 ) µ + θ t . (35 ) Base on this up dating ru le , it is easy to verify the follo win g inequalit y: − α t  µ 2 k v t +1 − y t k 2 + θ t 2 k v t +1 − v t k 2  ≤ − α t 2  µθ t µ + θ t k v t − y t k 2 + 1 µ + θ t k∇ F t +1 ( y t , γ t +1 ) k 2  ≤ − α t 2 ( µ + θ t ) k∇ F t +1 ( y t , γ t +1 ) k 2 . (36) T o set x t +1 (Line 3 of Alg.1), we follo w the classic sto c hastic gradient d escent, such that k x t +1 − y t k 2 can b e b ounded in terms of k∇ F t +1 ( y t , γ t +1 ) k 2 : x t +1 = y t − η t ∇ F t +1 ( y t , γ t +1 ). 18 Hence k x t +1 − y t k 2 = η 2 t k∇ F t +1 ( y t , γ t +1 ) k 2 , (37) and h∇ F ( y t , γ t +1 ) , x t +1 − y t i = h∇ F t +1 ( y t , γ t +1 ) − σ t +1 ( y t ) , x t +1 − y t i ≤ − η t k∇ F t +1 ( y t , γ t +1 ) k 2 + η t k σ t +1 ( y t ) k · k∇ F t +1 ( y t , γ t +1 ) k . (38) Inserting (35,36,37 and 38) in to (33) w e ha ve ∆ t +1 ≤ (1 − α t )∆ t + (1 − α t )( γ t − γ t +1 ) D U + α t 2  θ t k x − v t k 2 − ( µ + θ t ) k x − v t +1 k 2  + h σ t +1 ( y t ) , α t ( x − y t ) + (1 − α t )( x t − y t ) i + η t k σ t +1 ( y t ) k · k∇ F t +1 ( y t , γ t +1 ) k +  α t 2( µ + θ t ) + L t +1 2 η 2 t − η t  k∇ F t +1 ( y t , γ t +1 ) k 2 +  ∇ F t +1 ( y t , γ t +1 ) , − α t θ t ( v t − y t ) µ + θ t − (1 − α t )( x t − y t )  . (39) T aking the last term − α t θ t ( v t − y t ) µ + θ t − (1 − α t )( x t − y t ) = 0 reco vers th e up dating r ule of y t (Line 1 of Alg.1). Hence our result follo ws. App endix C. Pro of of Theorem 6 Pro of It is easy to ve rify that by taking α t = 2 t +2 , γ t +1 = α t and θ t = L g α t + E k A ξ k 2 ζ + Ω √ α t , w e hav e ∀ t > 1: (1 − α t − 1 )( γ t − 1 − γ t ) ≤ γ t − γ t +1 , (40) and (1 − α t ) α t − 1 2( θ t − 1 − α t − 1 E L t ) ≤ α t 2( θ t − α t E L t +1 ) . (41) Next w e deﬁn e and b ound weig hted sums of D 2 t that will b e used later. Ψ( t ) := [ α t θ t − (1 − α t ) α t − 1 θ t − 1 ] D 2 t + (1 − α t ) [ α t − 1 θ t − 1 − (1 − α t − 1 ) α t − 2 θ t − 2 ] D 2 t − 1 + (1 − α t )(1 − α t − 1 ) [ α t − 2 θ t − 2 − (1 − α t − 2 ) α t − 3 θ t − 3 ] D 2 t − 2 + · · · , (42) where replacing α t and θ t b y th ei r d eﬁnitions we ha ve ∀ t : α t θ t − (1 − α t ) α t − 1 θ t − 1 = 4 L g ( t + 1) 2 ( t + 2) 2 + 2 E k A ξ k 2 /ζ ( t + 1)( t + 2) + √ 2  ( t + 1) √ t + 2 − t √ t + 1  Ω ( t + 1)( t + 2) (43) 19 Substituting (43) into (42) and using inv oking the deﬁnition of D 2 w e hav e ∀ t : Ψ( t ) ≤ 4 L g D 2  1 ( t + 1) 2 ( t + 2) 2 + t ( t + 1) ( t + 1)( t + 2) 1 t 2 ( t + 1) 2 + ( t − 1) t ( t + 1)( t + 2) 1 ( t − 1) 2 t 2 + · · ·  + 2 E k A ξ k 2 D 2 ζ  1 ( t + 1)( t + 2) + t ( t + 1) ( t + 1)( t + 2) 1 t ( t + 1) + ( t − 1) t ( t + 1)( t + 2) 1 ( t − 1) t + · · ·  + √ 2Ω D 2  ( t + 1) √ t + 2 − t √ t + 1 ( t + 1)( t + 2) + t ( t + 1) ( t + 1)( t + 2) t √ t + 1 − ( t − 1) √ t t ( t + 1) + ( t − 1) t ( t + 1)( t + 2) ( t − 1) √ t − ( t − 2) √ t − 1 ( t − 1) t + · · ·  = 4 L g D 2 ( t + 1)( t + 2)  1 t + 1 − 1 t + 2  +  1 t − 1 t + 1  +  1 t − 1 − 1 t  + · · ·  + 2 E k A ξ k 2 D 2 ζ  1 ( t + 1)( t + 2) + 1 ( t + 1)( t + 2) + 1 ( t + 1)( t + 2) + · · ·  + √ 2Ω D 2 ( t + 1)( t + 2) h ( t + 1) √ t + 2 − t √ t + 1 + t √ t + 1 − ( t − 1) √ t + ( t − 1) √ t − ( t − 2) √ t − 1 + · · · i ≤ α t θ t D 2 . (44) Since µ = 0, by recur s iv ely app lying (13) and 1 − α 0 = 0 we h a ve E ∆ t +1 ≤ (1 − α t ) E ∆ t + α t θ t  D 2 t − D 2 t +1  + α t 2( θ t − α t E L t +1 ) σ 2 + (1 − α t )( γ t − γ t +1 ) D U ≤ (1 − α t )(1 − α t − 1 ) E ∆ t − 1 + α t θ t  D 2 t − D 2 t +1  + (1 − α t ) α t − 1 θ t − 1  D 2 t − 1 − D 2 t  + 2 α t 2( θ t − α t E L t +1 ) σ 2 + 2(1 − α t )( γ t − γ t +1 ) D U ≤ · · · ( 42 ) ≤ t Y i =0 (1 − α i )∆ 0 + Ψ( t ) + ( t + 1) α t 2( θ t − α t E L t +1 ) σ 2 + ( t + 1)(1 − α t )( γ t − γ t +1 ) D U ( 44 ) ≤ α t θ t D 2 + σ 2 θ t − α t E L t +1 + 2 D U t + 2 =  α 2 t E L t +1 + Ω √ α t  D 2 + √ α t σ 2 Ω + 2 D U t + 2 . (45) Com binin g with Lemma 3 we h a ve ∀ x E [Φ( x t +1 ) − Φ( x )] ≤  α 2 t E L t +1 + Ω √ α t  D 2 + √ α t σ 2 Ω + 2 D U t + 2 + γ t +1 D U ≤ α 2 t L g D 2 +  γ t +1 + 2 t + 2  D U + α 2 t E k A ξ k 2 γ t +1 ζ D 2 + √ α t  Ω D 2 + σ 2 Ω  . (46) 20 T aking γ t +1 = α t = 2 t +2 our result follo ws. App endix D. Pro of of Theorem 7 Pro of It is easy to verify that by taking α t = 2 t +1 , w e hav e ∀ t ≥ 1 (1 − α t − 1 )( γ t − 1 − γ t ) ≤ γ t − γ t +1 . (47) and (1 − α t ) α 2 t − 1 ≤ α 2 t (48) Denote S t := α t θ t − (1 − α t )( α t − 1 θ t − 1 + µα t − 1 ) . (49) T aking θ t = L g α t + µ 2 α t + E k A ξ k 2 ζ − µ it is easy to ve r if y that ∀ t ≥ 1: S t = 4 L g 1 ( t + 1) 2 t 2 + 2 E k A ξ k 2 ζ  1 t − 1 t + 1  − µ t + 1 . (50) W e wa nt to ﬁnd the sm allest iteration index C such that: when t ≥ C , S t ≤ 0. Without an y knowle d ge ab out L g and E k A ξ k 2 , m inimizing S t w.r.t t do es not yield an analytic form of C . Hence we simply let 4 L g 1 ( t + 1) 2 t 2 ≤ µ 2( t + 1) , (51) and 2 E k A ξ k 2 ζ  1 t − 1 t + 1  ≤ µ 2( t + 1) . (52 ) Inequalit y (51 ) is satisﬁed when t ≥ 2  L g µ  1 / 3 , (53) and (52) is satisﬁed when t ≥ 4 E k A ξ k 2 ζ µ . (54) Com binin g these t w o w e r ea ch the deﬁnition of C in (15). Next we pro ceed to p ro ve th e b ound. 21 As deﬁned in the theorem, w e denote ˜ D 2 = max 0 ≤ i ≤ min( t,C ) D 2 i . By recur siv ely app lyin g (13) for 0 ≤ i ≤ t and n oti cing that S t ≤ 0 ∀ t ≥ C , 1 − α 1 = 0 we h a ve E ∆ t +1 ( 47 ) ≤ t Y i =0 (1 − α i )∆ 0 + ( t + 1)(1 − α t )( γ t − γ t +1 ) D U +  ( α t θ t ) D 2 t − ( α t θ t + µα t ) D 2 t +1  + (1 − α t )  ( α t − 1 θ t − 1 ) D 2 t − 1 − ( α t − 1 θ t − 1 + µα t − 1 ) D 2 t  + (1 − α t )(1 − α t − 1 )  ( α t − 2 θ t − 2 ) D 2 t − 2 − ( α t − 2 θ t − 2 + µα t − 2 ) D 2 t − 1  + · · · + t Y i =1 (1 − α i )  ( α 0 θ 0 ) D 2 0 − ( α 0 θ 0 + µα 0 ) D 2 1  + σ 2 µ " α 2 t + (1 − α t ) α 2 t − 1 + · · · + t Y i =1 (1 − α i ) α 2 0 # ( 48 ) ≤ 2 D U t + 1 + ˜ D 2 t Y i = C − 1 (1 − α i ) [ α C − 2 θ C − 2 − (1 − α C − 2 )( α C − 3 θ C − 3 + µα C − 3 )] + ˜ D 2 t Y i = C − 2 (1 − α i ) [ α C − 3 θ C − 3 − (1 − α C − 3 )( α C − 4 θ C − 4 + µα C − 4 )] + · · · + ˜ D 2 t Y i =2 (1 − α i ) [ α 1 θ 1 − (1 − α 1 )( α 0 θ 0 + µα 0 )] + tα 2 t σ 2 µ (55) Applying (50) by ignoring the − µ t +1 term to the ab o v e inequalit y we can b ound the co eﬃ- cien ts of L g and E k A ξ k 2 ζ parts separately as follo ws. When t ≥ C , for the L g part: Q t i = C − 1 (1 − α i ) ( C − 1) 2 ( C − 2) 2 + Q t i = C − 2 (1 − α i ) ( C − 2) 2 ( C − 3) 2 + Q t i = C − 3 (1 − α i ) ( C − 3) 2 ( C − 4) 2 + · · · + Q t i =2 (1 − α i ) 2 2 · 1 2 = 1 ( t + 1) t  1 ( C + 2)( C + 1) + 1 ( C + 1) C + 1 C ( C − 1)) + · · · + 1 2 · 1  ≤ 1 ( t + 1) t C +1 X i =1 1 i 2 ≤ π 2 6 t ( t + 1) (56) F or the E k A ξ k 2 ζ part: Π t i = C − 1 (1 − α i )  1 C − 2 − 1 C − 1  + Π t i = C − 2 (1 − α i )  1 C − 3 − 1 C − 2  + · · · + t Y i =2 (1 − α i )  1 − 1 2  = C − 1 ( t + 1) t − C − 2 ( t + 1) t + C − 2 ( t + 1) t − C − 3 ( t + 1) t + · · · + 2 ( t + 1) t − 1 ( t + 1) t = C − 1 ( t + 1) t − 1 ( t + 1) t = C − 2 t ( t + 1) . (57) 22 Com binin g with Lemma 3 and taking γ t +1 = α t = 2 t +1 w e hav e ∀ x : E [Φ( x t +1 ) − Φ( x )] ≤ 2 D U t + 1 + 2 π 2 L g ˜ D 2 3 t ( t + 1) + 2( C − 2) E k A ξ k 2 ˜ D 2 /ζ t ( t + 1) + σ 2 µ ( t + 1) + γ t +1 D U = 2 π 2 L g ˜ D 2 3 t ( t + 1) + 2( C − 2) E k A ξ k 2 ˜ D 2 /ζ t ( t + 1) + 4 D U t + 1 + σ 2 µ ( t + 1) . (58) When 0 ≤ t ≤ C , one can simply put C = t in the ab o v e, and this completes our pr oof. App endix E . Pro of o f Prop osition 8 Pro of E ξ [ t ] R ( t ) = E ξ [ t ] t − 1 X i =0  Φ( x i , ξ i +1 ) − Φ( x ∗ t , ξ i +1 )  = E ξ [ t ] t − 1 X i =0   Φ( x i , ξ i +1 ) − Φ( x ∗ )  +  Φ( x ∗ ) − Φ( x ∗ t , ξ i +1 )   = t − 1 X i =0 E ξ [ i +1]  Φ( x i , ξ i +1 ) − Φ( x ∗ )  + E ξ [ t ] t − 1 X i =0 [Φ( x ∗ ) − Φ( x ∗ t )] + E ξ [ t ] t − 1 X i =0  Φ( x ∗ t ) − Φ( x ∗ t , ξ i +1 )  ≤ t − 1 X i =0 E ξ [ i +1]  Φ( x i , ξ i +1 ) − Φ( x ∗ )  + E ξ [ t ] t − 1 X i =0  Φ( x ∗ t ) − Φ( x ∗ t , ξ i +1 )  = t − 1 X i =0 E ξ [ i ] [Φ( x i ) − Φ( x ∗ )] + E ξ [ t ] t − 1 X i =0  Φ( x ∗ t ) − Φ( x ∗ t , ξ i +1 )  . References Alekh Agarwa l, Pete r L. Bartlett, P . Ravikumar, and Martin J. W ain wright. Inform a tion- theoretic lo w er boun ds on the oracle complexit y of sto c h astic co nv ex optimization. In- formation The ory, IEEE T r ans. , 2012. F rancis Bac h and Eric Moulines. Non-asymptotic analysis of sto c hastic appr oximati on algorithms for mac hine learning. In NIPS , 2011. Amir Bec k and Marc T eb oulle. A fast iterativ e shrink age-thresholding algorithm f o r linear in ve r s e problems. SIAM J. Imaging Sci. , 2(1):19 3–202, 2009 . Leon Bottou. Sto c h asti c gradien t descen t 2.0. URL http://l eon.bottou.org /p ro je ct s /sg d . 23 St ´ ephane Bouc h eron, Olivier Bousquet, and G´ a b or Lugosi. T heory of classiﬁcation: A surve y of some recent adv ances. ESA IM: P r ob ability and Statistics , 9:323– 375, 2005. Gong Ch en an d Marc T eb oulle. Conv ergence analysis of a pro ximal-lik e m inimiza tion algo- rithm using bregman fun ct ions. SIAM J . on Optimization , 3(3), 1993. I. Daub ec hies, M. Defrise, and C. De Mol. An iterativ e thresholding algorithm for lin- ear in verse p roblems with a sp arsit y constraint. Communic ations on Pur e and Applie d Mathematics , 57(11): 14131457, 2004 . Ofer Dek el, Ran Gilad-Bac hrac h, Ohad S h amir, and Lin Xiao. Optimal distributed online prediction using mini-batc hes. arXiv , 2010. URL ht tp://arxiv.org /abs/1012.1367 . John Duc hi and Y oram Singer. Eﬃcien t online and batc h learning using forwa rd bac kwa rd splitting. JMLR , (10):2899 –2934, 2009. John D u c hi, P eter L. Bartlett , and Martin J. W ain wrigh t. Ra n domize d smo othing for sto c hastic optimization. arXiv , 2011. URL http:/ /arxiv.org/abs/ 1103.4296 . T revo r Hastie, Rob ert Tibshiran i, and Jerome F riedman. The Elements of Statistic al L e arn- ing: Data Mining, Inf e r enc e, and Pr e diction . Sp ringer, 2nd edition, 2009. Elad Hazan and Sat yen Kale. Bey ond the regret minimization barrier: an optimal algorithm for sto c hastic strongly-con v ex optimization. In COL T , 2011. Chonghai Hu, James T. Kwok, and W eike Pa n. Accelerate d gradient metho ds for sto c hastic optimization and online learning. In NIP S 22 , 2009. P eter J. Hub er. Robust estimation of a lo cation parameter. Annals of Mathematic al Statis- tics , 35(1):73 –101, 1964. G. Lan and S. Ghadimi. Optimal sto c hastic appro ximation algorithms for strongly con vex sto c hastic c omp osite optimization, i: a generic algorithmic framew ork. SIAM J. on Optimization , 2011. Guangh ui Lan. An optimal met h o d for sto c hastic comp osite optimization. Mathematic al Pr o gr amming , 2010. doi: DOI10.100 7/s10107- 0 10- 0434- y. P . L. L ions and B. Mercier. Splitting algorithms for the sum of t wo nonlinear op erators. SIAM J. on Numeric al Analysis , 16(6): 964–979, 1979. A. Nemirovski and D. Y udin . P r oblem Complexity and Metho d Eﬃciency in Optimization . John Wiley and Sons, 1983. A. Nemiro vski, A. Jud itsky , G. Lan, and A. Shapiro. Robust sto chastic app ro ximation approac h to sto c hastic p rogramming. SIAM J. on Optimization , 19(4):1 574–1609, 2009 . Y urii Nestero v. Intr o ductory L e ctur es on Convex Optimiza tion, A Basic Course . K lu w er Academic Publish er s , 2004. 24 Y urii Ne sterov. E xce ssive gap te chnique in n onsmooth conv ex minimization. SIAM J. Optim. , 16(1):2 35–249, 200 5a. Y urii Nestero v. Smo oth minimization of n on -sm ooth functions. Math. Pr o gr am., Ser. A , 103:12 7–152, 2005b. Y urii Nestero v. Gradien t m et h o ds for minimizing comp osite ob jectiv e function. T echnical Rep ort CORE DISCUSS ION P APER 2007/76, Catholic Universit y of Louv ain, 2007a . Y urii Nestero v. Smo othing te chnique and its applications in semideﬁn ite optimization. Mathematic al Pr o gr amming , 110(2):245– 259, 2007b. Boris T . P oly ak and An at oli B. Juditsky . Acceleration of sto c hastic appro ximation b y a v eraging. SIAM J. on Contr ol and Optimization , 30(4):838– 855, 1992. Herb ert R ob b ins and S utton Monro. A sto c hastic appro ximation metho d. The Annals of Mathematic al Statistics , 22(3):400 –407, 1951. Shai S halev-Sh w artz, Y oram Singer, and Nathan Srebr o. Peg asos: Primal estimated sub- gradien t solv er for svm. In ICML , 2007. Shai Shalev-Sh wartz, Ohad Sh amir , Nathan Srebro, an d Karth ik Sridharan. Sto c hastic con v ex optimization. In CO L T , 2009. Ohad Shamir. Making gradien t descent optimal for str on gly con ve x sto c hastic optimization. In OPT 2011 , 2011. URL http:/ /arxiv.org/abs /1109.5647 . S. J . W righ t, R . D. No w ak, and M. A. T. Figueiredo. Sparse r eco nstr u ctio n by separable appro ximation. IEEE T r ansactions on Signal Pr o c essing , 57(7):2479 –2493, 2009. Lin Xiao. Dual av eraging metho ds for regularized sto c h asti c learnin g and online optimiza- tion. JMLR , 11:2543– 2596, 2010. W ei Xu. T o w ards optimal one pass large scale learning with a v eraged stoc h astic gradient descen t. arXiv , 2011. URL htt p://arxiv.org/ abs/1107.2490 . 25

Stochastic Smoothing for Nonsmooth Minimizations: Accelerating SGD by Exploiting Structure

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment