Step size adaptation in first-order method for stochastic strongly convex programming
We propose a first-order method for stochastic strongly convex optimization that attains $O(1/n)$ rate of convergence, analysis show that the proposed method is simple, easily to implement, and in worst case, asymptotically four times faster than its…
Authors: Peng Cheng
1 Step size adaptati on in first-order method for stochastic strongly con v e x programmi ng Peng Cheng pc175@uow .edu.au Abstract W e propose a first-order method for stochastic strongly con ve x optimization that attains O (1 /n ) rate of con ver gence, analysis sho w that the proposed method is simple, easily to i mplement, and in worst case, asymptotically four times faster than its peers. W e deriv e this method from se veral intuitiv e observations t hat are generalized f rom existing first order optimization methods. I . P R O B L E M S E T T I N G In this article we seek a numerical algor ithm that iteratively ap prox imates the solutio n w ∗ of the following stro ngly conv ex optimization p roblem : w ∗ = arg min Γ f f ( . ) (1) where f ( . ) : Γ f → R is an unknown, not n ecessarily smooth, m ultiv ariate and λ -strongly co n vex fu nction, with Γ f its con vex definition domain . The algorithm is not allowed to accurately sam ple f ( . ) by any mea ns since f ( . ) itself is unkn own. Instead the algorithm can call stochastic or acles ˜ ω ( . ) at chosen po ints ˜ x 1 , . . . , ˜ x n , wh ich are u nbiased and indep endent prob abilistic estimators of th e first-or der local info rmation of f ( . ) in the vicinity of each x i : ˜ ω ( x i ) = { ˜ f i ( x i ) , ▽ ˜ f i ( x i ) } (2) where ▽ deno tes random subgradien t oper ator , ˜ f i ( . ) : Γ f → R are ind ependen t and identically d istributed (i.i.d .) functions that satisfy: (unbiased ) E [ ˜ f i ( . )] = f ( . ) ∀ i (3a) (i.i.d) Cov ˜ f i ( . ) , ˜ f j ( . ) = 0 ∀ i 6 = j (3b) Solvers to th is kind of pr oblem are highly demand ed by scientists in large scale comp utational learning , in which the first- order stochastic oracle is the only measur able information of f ( . ) that scale well with both dim ensionality and scale of the learning p roblem. For exam ple, a stochastic first-or der oracle in stru ctural risk minimization (a.k.a. trainin g a su pport vector machine) ca n be rea dily obtain ed in O (1 ) time [ 1]. I I . A L G O R I T H M The proposed alg orithm itself is quite simple b ut with a deep proof of con vergence. The o nly improvement co mparing to SGD is the selection of step size in each iteration, which howe ver , results in substantial b oost o f p erform ance, as will b e shown in the next section. I I I . A N A LY S I S The p roposed a lgorithm is design ed to gener ate an ou tput y that redu ces the suboptim ality: S ( y ) = f ( y ) − min f ( . ) (4) as fast as possible after a n umber of operatio ns. W e d eriv e the algo rithm by se veral intuitive observations that are gen eralized from existing first o rder methods. First, we start fro m worst-case upp er-bounds o f S ( y ) in determin istic prog ramming : Lemma 1 . (Cutting-pla ne bound [4]): Given n deterministic oracles Ω n = { ω ( x 1 ) , . . . , ω ( x n ) } defined b y: ω ( x i ) = { f ( x i ) , ▽ f ( x i ) } (5) If f ( . ) is a λ -str o ngly conve x function, then min f ( . ) is unimpr ovably lower b ound ed by: min f ( . ) ≥ max i =1 ...n p i ( w ∗ ) ≥ min max i =1 ...n p i ( . ) (6) 2 Algorithm 1 Receiv e x 1 , Γ f , λ u 1 ← 1 , y 1 ← x 1 Receiv e ˜ f 1 ( x 1 ) , ▽ ˜ f 1 ( x 1 ) ˜ P 1 ( . ) ← Γ f n ˜ f 1 ( x 1 ) + h▽ ˜ f 1 ( x 1 ) , . − x i i + λ 2 || . − x 1 || 2 o for i = 2 , . . . , n do x i ← ar g min ˜ P i − 1 ( . ) Receiv e ˜ f i ( x i ) , ▽ ˜ f i ( x i ) ˜ p i ( . ) ← Γ f n ˜ f i ( x i ) + h▽ ˜ f i ( x i ) , . − x i i + λ 2 || . − x i || 2 o ˜ P i ( . ) ← (1 − u i − 1 2 ) ˜ P i − 1 ( . ) + u i − 1 2 ˜ p i ( . ) y i ← (1 − u i − 1 2 ) y i − 1 + u i − 1 2 x i u i ← u i − 1 − u 2 i − 1 4 end for Output y n wher e w ∗ is the unkno wn minimizer d efined by (1) , a nd p i ( . ) : Γ f → R ar e pr oximity contr ol functions (or simply pr ox- functions) d efined by p i ( . ) = f ( x i ) + h▽ f ( x i ) , . − x i i + λ 2 || . − x i || 2 . Pr oo f: By stro ng c onv exity of f ( . ) we have: B f ( . || x i ) ≥ λ 2 || . − x i || 2 = ⇒ f ( . ) ≥ p i ( . ) = ⇒ f ( . ) ≥ max i =1 ,...,n p i ( . ) (7) = ⇒ min f ( . ) = f ( w ∗ ) ≥ max i =1 ,...,n p i ( w ∗ ) ≥ min max i =1 ,...,n p i ( . ) (8) where B f ( x 1 || x 2 ) = f ( x 1 ) − f ( x 2 ) − h▽ f ( x 2 ) , x 1 − x 2 i den otes the Bregman divergence between two po ints x 1 , x 2 ∈ Γ f . Both sides of (7) and (8) be come equal if f ( . ) = max i =1 ,...,n p i ( . ) , so this bou nd cannot b e improved without any extra condition . Lemma 2. (J ensen’ s in equality for str o ngly co n vex fu nction) Given n deterministic oracles Ω n = { ω ( x 1 ) , . . . , ω ( x n ) } d efined by (5) . If f ( . ) is a λ -str ong ly conve x func tion, then for all α 1 , . . . , α n that satisfy P n i =1 α i = 1 , α i ≥ 0 ∀ i , f ( y ) is unimpr ovably upper bo unded by: f ( y ) ≤ n X i =1 α i f ( x i ) − λ 2 n X i =1 α i || x i − y || 2 (9) wher e y = P n i =1 α i x i . Pr oo f: By stro ng c onv exity of f ( . ) we have: B f ( x i || y ) ≥ λ 2 || x i − y || 2 = ⇒ f ( y ) ≤ f ( x i ) − h▽ f ( y ) , x i − y i − λ 2 || x i − y || 2 = ⇒ f ( y ) ≤ n X i =1 α i f ( x i ) − h▽ f ( y ) , n X i =1 α i x i − y i − λ 2 n X i =1 α i || x i − y || 2 ≤ n X i =1 α i f ( x i ) − λ 2 n X i =1 α i || x i − y || 2 3 Both sides of all above in equalities become equ al if f ( . ) = λ 2 || . − y || 2 + h c 1 , . i + c 2 , wher e c 1 and c 2 are con stants, so this bound cannot be improved without any extra con dition. Immediately , the o ptimal A that yields the lowest u pper bo und of f ( y ) can be given by: A = arg min P n i =1 α i =1 α i ≥ 0 ∀ i n X i =1 α i f ( x i ) − λ 2 n X i =1 α i || x i − n X j =1 α j x j || 2 (10) Combining with (4), (6), we have an d eterministic upp er bo und of S ( y ) : S ( y ) ≤ min P n i =1 α i =1 α i ≥ 0 ∀ i n X i =1 α i f ( x i ) − λ 2 n X i =1 α i || x i − n X j =1 α j x j || 2 − max i =1 ...n p i ( w ∗ ) (11) This boun d is qu ite useless at the mom ent as we are only interested in b ound s in stoch astic progra mming. The next lemma will show how it can be generalize d in later case. Lemma 3. Give n n stochastic oracles ˜ Ω n = { ˜ ω ( x 1 ) , . . . , ˜ ω ( x n ) } defined by ( 2) , if y ( ., . . . , . ) : H n × Γ n f → Γ f and U ( ., . . . , . ) : H n × Γ n f → R ar e fu nctiona ls of ˜ f i ( . ) and x i that satisfy: U ( f , . . . , f , x 1 , . . . , x n ) ≥ S ( y ( f , . . . , f , x 1 , . . . , x n )) (12a) U ( ˜ f 1 , . . . , ˜ f n , x 1 , . . . , x n ) is co n vex w .r .t. ˜ f 1 , . . . , ˜ f n (12b) E [ h▽ f ,...,f U ( f , . . . , f , x 1 , . . . , x n ) , [ ˜ f 1 − f , . . . , ˜ f n − f ] T i ] ≤ 0 (12c) then E [ S ( y ( f , . . . , f , x 1 , . . . , x n ))] is u pper boun ded by U ( ˜ f 1 , . . . , ˜ f n , x 1 , . . . , x n ) . Pr oo f: Assuming th at δ i ( . ) : Γ f → R are pertu rbation functio ns defined b y δ i ( . ) = ˜ f i ( . ) − f ( . ) (13) we h ave: U ( ˜ f 1 ,...,n , x 1 ,...,n ) ≥ U ( f + δ 1 , . . . , f + δ n , x 1 ,...,n ) (by (12b)) = U ( f , . . . , f , x 1 ,...,n ) + h▽ f ,...,f U ( f , . . . , f , x 1 ,...,n ) , [ δ 1 ,... n ] T i (by (12 a)) ≥ S ( y ( f , . . . , f , x 1 ,...,n )) + h▽ f ,...,f U ( f , . . . , f , x 1 ,...,n ) , [ δ 1 ,...,n ] T i (14) Moving δ i to the left side: E [ S ( y ( f , . . . , f , x 1 ,...,n ))] ≤ U ( ˜ f 1 ,...,n , x 1 ,...,n ) + E [ h▽ f ,...,f U ( f , . . . , f , x 1 ,...,n ) , [ δ 1 ,...,n ] T i ] (by (12 c)) ≤ U ( ˜ f 1 ,...,n , x 1 ,...,n ) Clearly , accordin g to (1 2b), setting : U ( ˜ f 1 ,...,n , x 1 ,...,n ) = min P n i =1 α i =1 α i ≥ 0 ∀ i n X i =1 α i ˜ f i ( x i ) − λ 2 n X i =1 α i || x i − n X j =1 α j x j || 2 − max i =1 ...n ˜ p i ( w ∗ ) (15) by sub stituting f ( . ) and p i ( . ) in (11) r espectiv ely with ˜ f i ( . ) defined b y (3) and ˜ p i ( . ) : Γ f → R de fined by: ˜ p i ( . ) = ˜ f i ( x i ) + h▽ ˜ f i ( x i ) , . − x i i + λ 2 || . − x i || 2 (16) is n ot an option, b ecause min P n i =1 α i =1 α i ≥ 0 ∀ i { . } and − max i =1 ...n { . } are bo th co ncave, P n i =1 α i ˜ f i ( x i ) an d ˜ p i ( w ∗ ) ar e bo th lin ear to ˜ f i ( . ) , and λ 2 P n i =1 α i || x i − P n j =1 α j x j || 2 is irrelev ant to ˜ f i ( . ) . This p rev ents asymp totically fast cutting-plan e/bundle meth ods [4], [ 8], [ 2] fr om being directly ap plied on stochastic o racles withou t any loss of perform ance. As a result, to decrease (4) and 4 satisfy ( 12b) ou r option s boil d own to replacin g min P n i =1 α i =1 α i ≥ 0 ∀ i { . } an d − max i =1 ...n { . } in (15) with th eir respective lo west conv ex up per bou nd: U ( A,B ) ( ˜ f 1 ,...,n , x 1 ,...,n ) = n X i =1 α i ˜ f i ( x i ) − λ 2 n X i =1 α i || x i − n X j =1 α j x j || 2 − n X i =1 β i ˜ p i ( w ∗ ) = n X i =1 α i ˜ f i ( x i ) − λ 2 n X i =1 α i || x i − n X j =1 α j x j || 2 − ˜ P n ( w ∗ ) (17) where ˜ P n ( . ) : Γ f → R is defined by: ˜ P n ( . ) = n X i =1 β i ˜ p i ( . ) (18) and A = [ α 1 , . . . , α n ] T , B = [ β 1 , . . . , β n ] T are constant n -dimen sional vectors, with e ach α i , β i satisfying: n X i =1 α i = 1 α i ≥ 0 ∀ i (19a) n X i =1 β i = 1 β i ≥ 0 ∀ i (19b) accordin gly y (( ˜ f 1 , . . . , ˜ f n , x 1 , . . . , x n ) can b e set to : y ( A,B ) ( x 1 , . . . , x n ) = n X i =1 α i x i (20) such that (12a) is guara nteed by lemm a 2. It shou ld be noted th at A and B must both be con stant vectors that are inde penden t from all stoc hastic variables, otherwise the conve xity cond ition (12b) may be lost. For example, if we always set β i as the solution of th e following pro blem: β i = arg max P n i =1 β i =1 β i ≥ 0 ∀ i ( n X i =1 β i ˜ p i ( w ∗ ) ) then ˜ P n ( w ∗ ) will be no different from the cutting-p lane bound (6). Finally , ( 12c) can be validated dire ctly by sub stituting (1 7) back in to (1 2c): h▽ f ,...,f U ( A,B ) ( f , . . . , f , x 1 ,...,n ) , [ δ 1 ,...,n ] T i = n X i =1 [( α i − β i ) δ i ( x i ) − h▽ δ i ( x i ) , β i ( w ∗ − x i ) i ] (21) Clearly E [( α i − β i ) δ i ( x i )] = 0 and E [ h▽ δ i ( x i ) , w ∗ i ] = 0 can be easily satisfied becau se α i and β i are already set to con stants to enf orce (12 b), and by definition w ∗ = a r g min f ( . ) is a deterministic (y et unknown) variable in our prob lem settin g, while both δ i ( x i ) and ▽ δ i ( x i ) are unbiased acco rding to ( 3a). Bound ing E [ h▽ δ i ( x i ) , x i i ] is a b it hard er but still possible: I n all optimization algo rithms, each x i can eithe r be a con stant, or chosen from Γ f based o n pr evious ˜ f 1 ( . ) , . . . , ˜ f i − 1 ( . ) ( x i cannot be based on ˜ f i ( . ) that is still unk nown by th e time x i is cho sen). By the i.i.d. condition (3b), th ey are all indepe ndent from ˜ f i ( . ) , which implies tha t x i is also ind ependen t f rom ˜ f i ( x i ) : E [ h▽ δ i ( x i ) , x i i ] = 0 (22) As a result, we conclu de that (21) satisfies E [ h▽ f ,...,f U ( A,B ) ( f , . . . , f , x 1 ,...,n ) , [ δ 1 ,...,n ] T i ] = 0 , and sub sequently U ( A,B ) defined by (17) satisfies all three conditio ns o f L emma 3. At this point we may con struct an alg orithm that uniform ly reduces max w ∗ U ( A,B ) by iter ativ ely calling new stochastic o racles an d up dating A and B . Our main r esult is summa rized in the following theorem: Theorem 1. F or all λ -str ongly c onve x fu nction F ( . ) , assuming that at s ome stage of a n alg orithm, n sto chastic oracles ˜ ω ( x 1 ) , . . . , ˜ ω ( x n ) h ave be en called to yield a point y ( A n ,B n ) defined b y (2 0) a nd an upper boun d ˆ U ( A n ,B n ) defined b y: 5 ˆ U ( A n ,B n ) ( ˜ f 1 ,...,n , x 1 ,...,n ) = U ( A n ,B n ) ( ˜ f 1 ,...,n , x 1 ,...,n ) + λ 2 n X i =1 α i || x i − n X j =1 α j x j || 2 + ˜ P n ( w ∗ ) − min ˜ P n ( . ) = n X i =1 α i ˜ f i ( x i ) − min ˜ P n ( . ) (23) wher e A n and B n ar e constant vecto rs satisfying (19) (h er e n in A n and B n denote superscripts, shou ld not b e confu sed with exponential index), if the a lgorithm c alls a nother stochastic o racle ˜ ω ( x n +1 ) at a n ew poin t x n +1 given b y: x n +1 = arg min ˜ P n ( . ) and upda te A an d B by: A n +1 = 1 − λ G 2 ˆ U ( A n ,B n ) ( A n ) T , λ G 2 ˆ U ( A n ,B n ) T (24a) B n +1 = 1 − λ G 2 ˆ U ( A n ,B n ) ( B n ) T , λ G 2 ˆ U ( A n ,B n ) T (24b) wher e G = max || ▽ ˜ f i ( . ) || , then ˆ U A n +1 ,B n +1 is bo unded by: ˆ U ( A n +1 ,B n +1 ) ≤ ˆ U ( A n ,B n ) − λ 2 G 2 ˆ U 2 ( A n ,B n ) (25) Pr oo f: First, optimizin g and cach ing all elemen ts o f A n or B n takes at lea st O ( n ) time and space, wh ich is not po ssible in large scale p roblems. So we confine our optio ns of A n +1 and B n +1 to setting: [ α n +1 1 , . . . , α n +1 n ] T = (1 − α n +1 n +1 )[ α n 1 , . . . , α n n ] T (26a) [ β n +1 1 , . . . , β n +1 n ] T = (1 − β n +1 n +1 )[ β n 1 , . . . , β n n ] T (26b) such that previous P n i =1 α i ˜ F ( x i ) and P n i =1 β i ˜ p i ( . ) can be summed up in previous iteratio ns in order to produ ce a 1 -m emory algorithm in stead of an ∞ -mem ory one, without violating ( 19). Con sequently ˆ U ( A n +1 ,B n +1 ) can be d ecompo sed in to: ˆ U ( A n +1 ,B n +1 ) = n +1 X i =1 α n +1 i ˜ f ( x i ) − min ˜ P n +1 ( . ) (by (16), ( 18), (1 9)) ≤ n +1 X i =1 α n +1 i ˜ f ( x i ) − " n +1 X i =1 β n +1 i ˜ p i ( ˜ x ∗ n ) − 1 2 λ || ▽ n +1 X i =1 β n +1 i ˜ p i ( ˜ x ∗ n ) || 2 # (by (26)) = " (1 − α n +1 n +1 ) n X i =1 α n i ˜ f ( x i ) − (1 − β n +1 n +1 ) n X i =1 β n i ˜ p i ( ˜ x ∗ n ) # + h α n +1 n +1 ˜ f ( x n +1 ) − β n +1 n +1 ˜ p n +1 ( ˜ x ∗ n ) i + ( β n +1 n +1 ) 2 2 λ || ▽ ˜ p n +1 ( ˜ x ∗ n ) || 2 where ˜ x ∗ n = arg min ˜ P n ( . ) , setting a n +1 n +1 = b n +1 n +1 and x n +1 = ˜ x ∗ n eliminates the seco nd ter m: ˆ U ( A n +1 ,B n +1 ) = (1 − α n +1 n +1 ) n X i =1 α n i ˜ f ( x i ) − ˜ P n ( ˜ x ∗ n ) ! + ( α n +1 n +1 ) 2 2 λ || ▽ ˜ p n +1 ( x n +1 ) || 2 (by (16)) = (1 − α n +1 n +1 ) ˆ U ( A n ,B n ) + ( α n +1 n +1 ) 2 2 λ || ▽ ˜ f n +1 ( x n +1 ) || 2 ( G ≥ || ▽ ˜ f i ( . ) || ) ≤ (1 − α n +1 n +1 ) ˆ U ( A n ,B n ) + ( α n +1 n +1 ) 2 G 2 2 λ (27) Let u i = 2 λ G 2 ˆ U ( A i ,B i ) , minim izing the righ t side of (27) over α n +1 n +1 yields: α n +1 n +1 = arg min α { (1 − α ) u n + α 2 } = u n 2 = λ G 2 ˆ U ( A n ,B n ) (28) 6 In this case: u n +1 ≤ u n − u 2 n 4 (29) = ⇒ ˆ U ( A n +1 ,B n +1 ) ≤ ˆ U ( A n ,B n ) − 2 λ G 2 ˆ U 2 ( A n ,B n ) Giv en an arbitrar y initial oracle ˜ ω ( x 1 ) an d apply th e updatin g rule in theore m 1 recursiv ely results in algo rithm 1, accord ingly we c an pr ove its asymp totic beh avior by indu ction: Corollary 1. The fina l point y n obtained by ap plying algorithm 1 on a rbitrary λ -str ongly co n vex function f ( . ) has the following worst-case rate of conver genc e: E [ f ( y n )] − min f ( . ) ≤ 2 G 2 λ ( n + 3 ) Pr oo f: First, b y (2 9) we have: 1 u n +1 ≥ 1 u n 1 − u n 4 = 1 u n + 1 4 − u n ≥ 1 u n + 1 4 (30) On the oth er ha nd, by strong conve xity , for all x 1 ∈ Γ f we h ave: f ( x 1 ) − min f ( . ) ≤ 1 2 λ || ▽ f ( x 1 ) || 2 ≤ G 2 2 λ (31) Setting ˆ U (1 , 1) = G 2 2 λ as intial co ndition and apply (30) r ecursively ind uces the following gener ativ e f unction: 1 u n ≥ 1 + n − 1 4 = n + 3 4 = ⇒ u n ≤ 4 n + 3 = ⇒ ˆ U ( A n ,B n ) ≤ 2 G 2 λ ( n + 3 ) = ⇒ E [ f ( y n )] − min f ( . ) ≤ 2 G 2 λ ( n + 3 ) − λ 2 n X i =1 α n i || x i − y n || 2 − ˜ P n ( w ∗ ) − min ˜ P n ( . ) This worst-case rate of co n vergence is four times faster than Epo ch-GD ( 8 G 2 λn ) [3], [5] o r Cutting-p lane/Bundle Method ( 8 G 2 λ h n +2 − log 2 λf ( x 1 ) 4 G 2 i ) [4], [8], [2], and is in definitely faster than SGD ( ln ( n ) G 2 2 λn ) [1], [6]. I V . H I G H P RO B A B I L I T Y B O U N D An immediate resu lt o f Coro llary 1 is the following h igh p robab ility bound yield ed by Markov ineq uality: Pr S ( y n ) ≥ 2 G 2 λ ( n + 3 ) η ≤ η (32) where 1 − η ∈ [0 , 1] denotes the con fidence of the result y n to reach the desired suboptimality . I n most cases ( particularly when η ≈ 0 , as demand ed by most ap plications) th is bou nd is very loose and cann ot demon strate the true perf ormance o f the propo sed alg orithm. I n th is sectio n we derive se veral high pro bability bo unds tha t are mu ch less sen siti ve to small η c omparin g to (32). Corollary 2. The fin al point y n obtained by applyin g algorithm 1 on arbitrary λ -str on gly conve x fun ction F ( . ) has the following high pr o bability bo unds: 7 Pr S ( y n ) ≥ t + 2 G 2 λ ( n + 3 ) ≤ exp − t 2 ( n + 2) 16 D 2 σ 2 (33a) Pr S ( y n ) ≥ t + 2 G 2 λ ( n + 3 ) ≤ 1 2 exp − t 2 ( n + 2) 8 ˜ G 2 D 2 (33b) Pr S ( y n ) ≥ t + 2 G 2 λ ( n + 3 ) ≤ exp ( − t ( n + 2) 4 ˜ GD ln 1 + t ˜ G 2 D σ 2 !) (33c) wher e constants ˜ G = max || ▽ δ i ( . ) || , σ 2 = max V ar( ▽ ˜ f i ( . )) ar e maximal range and variance of ea ch stochastic subgradient r espectively , and D = max x 1 ,x 2 ∈ Γ f || x 1 − x 2 || is the larges t distance between two points in Γ f . Pr oo f: W e start by expand ing the rig ht side of (14), setting A n = B n and su bstituting ( 21) bac k into (14) yields: S ( y n ) ≤ U ( A n ,A n ) ( ˜ f 1 ,...,n , x 1 ,...,n ) − n X i =1 α n i h▽ δ i ( x i ) , x i − w ∗ i (by Corollary 1) ≤ 2 G 2 λ ( n + 3 ) + n X i =1 α n i r i (34) with each r i = − h▽ δ i ( x i ) , x i − w ∗ i satisfyin g: (Cauchy’ s in equality) − || ▽ δ i ( x i ) |||| x i − w ∗ || ≤ r i ≤ || ▽ δ i ( x i ) |||| x i − w ∗ || − ˜ GD ≤ r i ≤ ˜ GD (35) V ar( r i ) = E [( h▽ δ i ( x i ) , x i − w ∗ i − E [ h▽ δ i ( x i ) , x i − w ∗ i ]) 2 ] (by (22)) = E [( h▽ δ i ( x i ) , x i − w ∗ i ) 2 ] (Cauchy’ s in equality) ≤ E [ || ▽ δ i ( x i ) || 2 || x i − w ∗ || 2 ] ≤ D 2 E [ || ▽ δ i ( x i ) || 2 ] = D 2 V ar ▽ ˜ f i ( x i ) ≤ D 2 σ 2 (36) This immed iately expo se S n ( y ( A n ,A n ) ( x 1 ,...,n )) to se veral in equalities in non-p arametric statistics that bou nd the pro bability of sum o f ind ependen t r andom variables: (genera lized Chernoff bound) Pr n X i =1 α n i r i ≥ t ! ≤ exp − t 2 4V ar ( P n i =1 α n i r i ) (by (36)) ≤ ex p − t 2 4 D 2 σ 2 P n i =1 ( α n i ) 2 (37a) (Azuma-Ho effding inequality) Pr n X i =1 α n i r i ≥ t ! ≤ 1 2 exp − 2 t 2 P n i =1 ( α n i ) 2 (max r i − min r i ) 2 (by (35)) ≤ 1 2 exp ( − 2 t 2 4 ˜ G 2 D 2 P n i =1 ( α n i ) 2 ) (37b) (Bennett in equality) P r n X i =1 α n i r i ≥ t ! ≤ exp − t 2 max || α n i ǫ i || ln 1 + t max || α i ǫ i || V ar ( P n i =1 α n i r i ) (by (35), ( 36)) ≤ exp ( − t 2 ˜ GD max α i ln 1 + t ˜ G max α n i D σ 2 P n i =1 ( α n i ) 2 !) (37c) In case of algor ithm 1, if A n is recu rsiv ely upda ted by (24), th en each two consecutive α n i has the following p roperty : 8 (by (24)) α n i +1 α n i = α i +1 i +1 α i +1 i = α i +1 i +1 α i i (1 − α i +1 i +1 ) (by (28), ( 29)) = u i − 1 − u 2 i − 1 4 u i − 1 1 − u i − 1 2 + u 2 i − 1 8 ( u i − 1 ≤ 1 ) > 1 = ⇒ α n i +1 > α n i = ⇒ max α n i = α n n = u n − 1 2 ≤ 2 n + 2 (38) According ly P n i =1 ( α n i ) 2 can be b ound ed by n X i =1 ( α n i ) 2 ≤ n ( α n n ) 2 ≤ 4 n ( n + 2) 2 ≤ 4 n + 2 (39) Eventually , comb ining (34) (37), (38) an d (3 9) togeth er yields the propo sed high pr obability b ounds (3 3). By d efinition ˜ G and σ are both u pper bo unded b y G . And if Γ f is undefined , b y com bining stro ng conve xity condition B f ( x 1 || arg min f ( . )) = f ( x 1 ) − min f ( . ) ≥ λ 2 || x 1 − arg min f ( . ) || 2 and ( 31) tog ether we can still set Γ f = || . − x 1 || 2 ≤ G 2 λ 2 such that D = 2 G λ , while ar g min f ( . ) is always included in Γ f . Consequen tly , e ven in worst cases (32) can be easily supersede d by any of ( 33), in which η decr eases expon entially with t instead o f inverse pr oportio nally . I n most ap plications both ˜ G and σ can be much smaller than G , and σ can be fur ther r educed if eac h ˜ ω ( x i ) is estima ted from averaging over several stochastic oracles p rovided simultaneou sly by a p arallel/distributed system. V . D I S C U S S I O N In this article we p roposed algo rithm 1 , a first-o rder alg orithm for stochastic stron gly convex op timization th at asymp totically outperf orms all state-o f-the-ar t algo rithms b y four times, ach ieving less than S sub optimality u sing only 2 G 2 λS − 3 iterations and stochastic or acles in average. Th eoretically alg orithm 1 can be gen eralized to strongly conv ex fun ctions w .r .t. arbitrar y norms using tech nique pr oposed in [5], a nd a slightly different analysis can be used to find optimal methods fo r strong ly smooth (a.k.a. g radient lipschitz continuou s or g.l.c.) fun ctions, but we w ill leave th em to fu rther investigations. W e do not know if this algorithm is optim al a nd unimp rovable, nor d o we kn ow if higher-orde r algorithm s can be discovered using similar analy sis. There are several loose en ds we may possibly fail to scru tinize, clearly , the m ost likely one is th at we assum e: max f S ( y ) = max f { f ( y ) − min f ( . ) } ≤ max f f ( y ) − min f min f ( . ) Howe ver in fact, there is n o case arg max f f ( y ) = arg min f min f ( . ) ∀ y ∈ Γ f , so this bo und is still far from unimpr ovable. Another possible one is th at we do not know how to bou nd λ 2 P n i =1 α n i || x i − y n || 2 by optimizin g x n and α n n , so it is isolated from (2 3) an d never participate in par ameter optim ization of (27), but actually it can b e deco mposed into : n +1 X i =1 α n +1 i || x i − y n +1 || 2 = min n +1 X i =1 α n +1 i || x i − . || 2 = n +1 X i =1 α n +1 i || x i − y n || 2 − 1 2 λ || ▽ y n ( n +1 X i =1 α i || x i − y n || 2 ) || 2 = n X i =1 α n +1 i || x i − y n || 2 + α n +1 || x n +1 − y n || 2 − ( α n +1 n +1 ) 2 2 λ || y n − x n +1 || 2 = (1 − α n +1 n +1 ) " n X i =1 α n i || x i − y n || 2 # + " α n +1 − ( α n +1 n +1 ) 2 2 λ # || y n − x n +1 || 2 9 such that α n +1 − ( α n +1 n +1 ) 2 2 λ || y n − x n +1 || 2 can be ad ded into the right side of (27), unfo rtunately , we still do n ot kn ow how to b ound it, but it m ay be proved to be u seful in som e alter native pro blem settings (e.g. in o ptimization o f stro ngly smoo th function s). Most impo rtant, if f ( . ) is λ -stron gly c onv ex a nd eac h ˜ f i ( . ) can be revealed comp letely by ea ch oracle (instead of on ly its first-order in formatio n), then the prin ciple of e mpirical r isk minimization ( ERM): y n = a r g min n X i =1 ˜ f i ( . ) easily o utperfo rms all state- of-the- art sto chastic m ethods by yieldin g the best-ever rate of conver gence σ 2 2 λn [7], and is still more than four times faster th an algorithm 1 (through this is already very close for a first-ord er metho d). This immediately raises the q uestion: how d o we c lose this ga p? and if fir st-order method s ar e not able to do so, how much extra info rmation of each ˜ f i ( . ) is requir ed to red uce it? W e be liev e that solution s to th ese lo ng term problem s are vital in construction of very large scale predicto rs in com putation al lear ning, but we a re still far f rom getting any of the m. A C K N O W L E D G E M E N T S W e want to th ank Dr Francesco O rabona and Professor Nath an Srebro for their cr itical co mments on this man uscript. W e also want to th ank Maren Sievers, Stefan Roth schuh, Raouf Rusmaully and Li He for their proo f read ing an d/or tec hnical comments on an C++ implem entation of this m ethod. R E F E R E N C E S [1] L ´ eon Bottou and Olivi er Bousquet. The tradeoff s of large scale learning. In J.C. P latt, D. Koll er , Y . Singer , and S. Rowei s, editors, Advances in Neural Informatio n Processi ng Systems , volume 20, pages 161–168. NIPS Foundati on (http://boo ks.nips.cc), 2008. [2] V . Franc and S. Sonnebur g. Optimized cutting plane algorithm for large -scale risk minimization. Journal of Machi ne Learning Resear ch , 10:2157 –2232, 2009. [3] E. Hazan and S Kale. Beyond the regret minimizat ion barrier: an optimal algori thm for stochasti c strongly-c on ve x optimizat ion. In Proc eedings of the 24th Annual Confer ence on Learning Theory , 2011. [4] T . J oachims. Train ing linear svms in linear time. In P r ocee dings of the 12th Internatio nal Confer ence on Knowledg e Discovery and Data Mining , pages 217–226. AC M, 2006. [5] A. J uditski and Y . Nesterov . Primal-du al subgradient methods for minimizin g uniformly conv e x. Manuscript , 1:1, August 2010. [6] S. Shale v-Shwartz, Y . Singer , and N. Srebro. Pegasos: Primal estimat ed sub-gradient solver for svm. In Procee dings of the 24th intern ational confe re nce on Machine learning , pages 807–814. A CM, 2007. [7] K. Sridharan, N. Srebro, and S. Shale v-Shwartz. Fast rates for regulariz ed object i ves. In Advances in Neural Information Pro cessing Systems , volume 22, pages 1545–1552. NIPS Foundatio n (http://books.ni ps.cc), 2008. [8] C.H. T eo, SVN V ishwana than, A. Smola, and Q.V . Le. Bundle methods for regula rized risk minimization . Jo urnal of Machine Learning Researc h , 1:1–55, 2009.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment