Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server

This paper presents an asynchronous incremental aggregated gradient algorithm and its implementation in a parameter server framework for solving regularized optimization problems. The algorithm can handle both general convex (possibly non-smooth) reg…

Authors: Arda Aytekin, Hamid Reza Feyzmahdavian, Mikael Johansson

Analysis and Implementation of an Asynchronous Optimization Algorithm   for the Parameter Server
1 Analysis and Implementation of an Asyn chronous Optimizati on Algorithm for the P arame ter Serv er Arda A ytekin Student Member , IEEE, Ha mid Rez a Feyzmahdavian, Student Member , IEEE, and Mikael Joha nsson Member , IEEE Abstract This paper presents an asynchronous incremental aggreg ated gradient algorithm and its implementation in a parameter server frame work for solving regularized op timization problems. The algorithm can hand le both general con v ex (possibly non-sm ooth) regularizers an d gen eral con v ex con straints. When the empirical data loss is strongly conv e x, we establish l inear conv erg ence rate, gi ve explicit expression s for step-size choices that guarantee con v ergen ce to the optimum, and bound the associated con v ergence factors. The expressions have an explicit dependence on the degree of asynchron y and recove r classical results under synchronou s operation. S imulations and implementations on commercial compute clouds validate our findings. Index T erms asynchron ous, proximal, incremental, aggregated gradient, linear con verge nce. I . I N T R O D U C T I O N M A CHINE learning and op timization theor y have e njoyed a fruitfu l symbiosis over th e last decade. On the one hand, since ma ny machine learn ing tasks can be p osed as op timization prob lems, advances in large-scale optim ization (e.g. [ 1]) have had an immediate and profoun d impact on m achine learn ing re search. On the other hand, the challen ges of d ealing with huge data sets, often spread over m ultiple sites, hav e inspired the mac hine learning community to de velop novel op timization algorithm s [2], imp rove the theory for asyn chrono us computatio ns [3], a nd introd uce new p rogram ming models for par allel and distrib uted optimization [4]. In this paper, we consider mach ine learning in the p arameter server framework [4]. This is a master-worker architec ture, where a central server m aintains the curre nt parameter iterates and queries worker nodes for g radients of the loss ev alu ated o n their data. In this setting, we focu s on problem s on the fo rm minimize x N X n =1 f n ( x ) + h ( x ) subject to x ∈ R d . Here, the first part o f the objective f unction ty pically mo dels the empirical data loss and the second term is a re gularizer ( for example, an ℓ 1 penalty to promote sparsity of the solution ). Regular ized op timization p roblems arise in many applications in m achine learnin g, signal p rocessing, and statistical estima tion. Examp les include Tikhonov and elastic ne t regular ization, Lasso, sparse log istic regression, and suppor t vector machines. In the parame ter server fram ew o rk, Li et al. [4] ana lyzed a par allel and asyn chron ous proximal gra dient me thod for n on- conv ex pr oblems an d established cond itions for convergence to a critical po int. Agar wal an d Duchi [ 5], an d more recen tly Feyzmahdavian et al. [6], de velop ed parallel mini-batch optimization a lgorithms based on asyn chron ous incremen tal g radient methods. When the loss fu nctions are stron gly conve x, which is o ften the case, it has recently b een ob served that incremental aggregated meth ods outperform in cremental gradient descent an d are, in ad dition, able to co n verge to the true optimum e ven with a c onstant step-size. Gur buzbalaban et al. [7] established linear convergence for an incr emental aggregated gradient method suitable for implem entation in the par ameter ser ver framework. Howe ver, the analysis does not allow for any regularization term, nor any additional con vex constraints. This paper presents an asynch ronou s pr oximal incremen tal aggregated grad ient algorith m and its implem entation in the parameter server f ramework. Our algorithm ca n h andle both general con vex regularizers and con vex constrain ts. W e establish linear convergence when the empirical da ta loss is strongly conve x, gi ve explicit expressions for step-size c hoices that g uarantee conv ergence to th e global op timum and bou nd the associated con vergence factors. The se e x pressions ha ve an explicit dependence on the degree of asynchr ony and r ecover classical results un der synch ronou s operation. W e believe that this is a pr actically and theoretically imp ortant a ddition to existing optimization algo rithms for the parameter server architectur e. A. A ytekin, H. R. Feyzmahda vian and M. Johansson are w ith the Departme nt of Automatic Control, School of Electric al E nginee ring and A CCESS Linnaeus Center , KTH Royal Institute of T echnol ogy , SE -100 44 Stockhol m, Sweden. Emails: {aytekin, hamidrez, mikaelj}@kth.se 2 A. P rior work Incremen tal gr adient methods fo r smooth optimiza tion p roblems have a long tr adition, most n otably in the training of n eural networks via back-p ropag ation. In contrast to gra dient method s, which comp ute the fu ll grad ient of the loss function before updating the iterate, incremental grad ient m ethods ev alu ate the gradients of a single, or possibly a few , compo nent f unctions in each itera tion. Incremental grad ient methods can be computationally mo re ef ficien t than traditional gradient methods since each step is cheape r but makes a comparab le pro gress o n average. Howev er , for glo bal conv ergence, the step-size needs to diminish to zero, which can lead to slo w con vergence [8] . If a constant step-size is used, only convergence to an ap proxim ate solution can be guarante ed in general [9]. Recently , Blatt, Hero , and Gau chman [10] pro posed a metho d, the incremental ag gr e g ated gradient (IAG), that also comp utes the gra dient o f a single component fun ction a t each iteration . But rather th an updatin g the iterate based on this in formatio n, it uses the sum of the most recently ev aluated g radients of all component fun ctions. Compa red to the basic incremental gr adient methods, IAG h as the advantage that global convergence c an be achieved using a constant step-size when each compon ent function is c on vex quadratic. Later , Gurbuzbalaban, Ozd aglar, and Parillo [7] proved linear con vergence for IA G in a more general setting whe n co mponen t functions ar e stron gly c onv ex. In a mor e rece nt work, V anli, Gu rbuzbalaban and Ozdaglar [ 11] analyzed the global conv ergence rate of proximal incremen tal agg regated gr adient m ethods, wher e they can p rovide th e linear conv ergence rate only af ter sufficiently m any iteratio ns. Ou r result differs from theirs in th at we provide the linear convergence rate of th e alg orithm without any con straints on the iteration count and we extend the result to the general distance functions. There has b een some recen t work o n the stoch astic version of the IA G method (called stochastic average gradient, or SAG) where we sample the co mponen t fun ction to upda te instead of using a cyclic o rder [12]–[14]. Un like the IAG m ethod where the linear convergence rate d epends on the num ber of pa sses th rough the data, th e SAG m ethod achieves a linear convergence rate that depend s on the nu mber of iterations. Further, when the num ber of training examples is sufficiently large, th e SA G method allo ws the use o f a very large step -size, which leads to improved th eoretical an d empirical pe rforman ce. I I . N O TA T I O N W e let N and N 0 denote the set of n atural numbers and the set of na tural numbers includin g z ero, respectively . The inner produ ct o f two vectors x, y ∈ R d is den oted by h x, y i . W e assume that R d is end owed with a n orm k·k an d use k·k ∗ to represent the co rrespond ing d ual norm, defin ed by k y k ∗ = sup k x k≤ 1 h x, y i . I I I . P RO B L E M D E FI N I T I O N W e consider optimization p roblems on the form minimize x N X n =1 f n ( x ) + h ( x ) subject to x ∈ R d , (1) where x is the decision variable, f n ( x ) is c on vex and differentiable f or each n ∈ N := { 1 , . . . , N } and h ( x ) is a pro per convex function that may be no n-smooth and extended r eal-valued. Th e r ole of the r egularization term h ( x ) is to fa vor solutio ns w ith certain preferred structu re. F or e xample, h ( x ) = λ 1 k x k 1 with λ 1 > 0 is often used to promote sp arsity in solutio ns, and h ( x ) = I X ( x ) := ( 0 if x ∈ X ⊆ R d , + ∞ otherwise is used to force the po ssible solutions to lie in the closed con vex set X . In order to solve (1), we are go ing to u se the proximal in cremental aggregated g radient method . In this meth od, at iteration k ∈ N , the gradients of all c ompon ent function s f n ( x ) , possibly evaluated at stale in formation x k − τ n k , ar e aggregated g k = N X n =1 ∇ f n  x k − τ n k  . Then, a proximal step is taken based on the curr ent vector x k , the aggregated grad ient g k , and th e n on-smoo th term h ( x ) , x k +1 = arg min x  h g k , x − x k i + 1 2 α k x − x k k 2 + h ( x )  . (2) The algorith m has a n atural implementation in the parameter server fr amew ork. The m aster node maintains the iterate x and perfor ms the pro ximal steps. Whenever a worker nod e reports new gradien ts, the master updates the iterate an d info rms the worker ab out the new iterate. Pseudo code fo r a basic parameter server implementation is given in Algorithms 1 and 2. 3 Algorithm 1 Master procedure 1: Da ta: g w for each worker w ∈ W := { 1 , 2 , . . . , W } 2: Input: α ( L, µ, ¯ τ ) , K > 0 and h ( x ) 3: O utput: x K 4: Initialize: k = 0 5: while k < K do 6: W ait until a set R of workers return the ir g radients 7: for all w ∈ W do 8: if w ∈ R then 9: Update g w ← P n ∈N w ∇ f n ( x k − τ w k ) 10: else 11: K eep old g w 12: end if 13: end f or 14: Aggregate the incremental g radients g k = P w ∈ W g w 15: Solve (2) with g k 16: for all w ∈ R do 17: Send x k +1 to worker w 18: end f or 19: Incremen t k 20: end while 21: Sign al EX IT 22: Return x K Algorithm 2 Procedur e f or each worker w 1: Da ta: x , and loss functions { f w ( x ) : w ∈ N w } with S w ∈ W N w = N and N w 1 T N w 2 = ∅ ∀ w 1 6 = w 2 ∈ W 2: repeat 3: Receiv e x ← x k +1 from master 4: Calculate incremental grad ient (IG) P n ∈N w ∇ f n ( x ) 5: Send IG to master with a delay of τ w k 6: until EXIT received T o establish con vergence of the iterates to the global optimu m, we impose the fo llowing assumptions on Pr oblem ( 1): A1) The f unction F ( x ) := P N n =1 f n ( x ) is µ -stron gly co n vex, i.e. , F ( x ) ≥ F ( y ) + h∇ F ( y ) , x − y i + µ 2 k x − y k 2 , (3) holds for all x, y ∈ R d . A2) Each f n is con vex with L n -continu ous gradien t, that is, k∇ f n ( x ) − ∇ f n ( y ) k ∗ ≤ L n k x − y k ∀ x, y ∈ R d . Note that un der th is ass umption , ∇ F is also Lipschitz c ontinuo us with L ≤ P N n =1 L n . A3) h ( x ) is sub-differentiab le everywhere in its effective do main, th at is, fo r all x, y ∈ { z ∈ R d : h ( z ) < + ∞} , h ( x ) ≥ h ( y ) + h s ( y ) , x − y i ∀ s ( y ) ∈ ∂ h ( y ) . (4) A4) The time -varying delays τ n k are bounded, i.e. , there is a non-negative integer ¯ τ such that τ n k ∈ { 0 , 1 , . . . , ¯ τ } , hold f or all k ∈ N 0 and n ∈ N . I V . M A I N R E S U LT First, we pr ovide a lemma wh ich is key to proving o ur main resu lt. 4 Lemma 1. Assume that the non-negative seq uences { V k } and { w k } satisfy th e follo wing inequality: V k +1 ≤ aV k − b w k + c k X j = k − k 0 w j , (5) for some r ea l numbers a ∈ (0 , 1 ) and b, c ≥ 0 , an d so me inte ger k 0 ∈ N 0 . Assume also that w k = 0 for k < 0 , and that the following holds: c 1 − a 1 − a k 0 +1 a k 0 ≤ b . Then, V k ≤ a k V 0 for all k ≥ 0 . Pr oof: T o prove th e linear co n vergence of the seq uence, we divide both sides of (5) b y a k +1 and take the sum: K X k =0 V k +1 a k +1 ≤ K X k =0 V k a k − b K X k =0 w k a k +1 + c K X k =0 1 a k +1 k X j = k − k 0 w j = K X k =0 V k a k − b K X k =0 w k a k +1 + c a ( w − k 0 + w − k 0 +1 + · · · + w 0 ) + c a 2 ( w − k 0 +1 + w − k 0 +2 + · · · + w 1 ) + . . . + c a K +1 ( w K − k 0 + w K − k 0 +1 + . . . w K ) ≤  c  1 + 1 a + · · · + 1 a k 0  − b  K X k =0 w k a k +1 + K X k =0 V k a k , (6) where we h av e u sed th e n on-negativity of w k to o btain (6). If the co efficient o f the first sum of th e rig ht-hand si de of (6) is non- positiv e, i.e. , if c + c a + · · · + c a k 0 = c 1 − a 1 − a k 0 +1 a k 0 ≤ b , then inequality (6) implies that V K +1 a K +1 + V K a K + · · · + V 1 a 1 ≤ V K a K + V K − 1 a K − 1 · · · + V 0 a 0 . Hence, V K +1 ≤ a K +1 V 0 for any K ≥ 0 and the desired result follows. W e are no w r eady to state and prove ou r m ain result. Theorem 1. Assume that Pr oblem (1) satisfies assumptio ns A1–A4, and that the step-size α satisfies: α ≤  1 + µ L 1 ¯ τ + 1  1 ( ¯ τ +1) − 1 µ , wher e L = P N n =1 L n . Th en, the iterates generated by Algorithms 1 and 2 satisfy: k x k − x ⋆ k 2 ≤  1 µα + 1  k k x 0 − x ⋆ k 2 . for all k ≥ 0 . Pr oof: W e start with a nalyzing eac h componen t fu nction f n ( x ) to fin d upper bou nds on the function values: f n ( x k +1 ) ≤ f n ( x k − τ n k ) +  ∇ f n ( x k − τ n k ) , x k +1 − x k − τ n k  + L n 2   x k +1 − x k − τ n k   2 ≤ f n ( x ) +  ∇ f n ( x k − τ n k ) , x k +1 − x  + L n 2   x k +1 − x k − τ n k   2 ∀ x , (7) 5 where the first an d seco nd ineq ualities use L n -continu ity and c on vexity o f f n ( x ) , re spectiv ely . Summing (7 ) over all com ponen t function s, we obtain: F ( x k +1 ) ≤ F ( x ) + h g k , x k +1 − x i + N X n =1 L n 2   x k +1 − x k − τ n k   2 ∀ x . (8) Observe that optimality condition of (2) implies: h g k , x k +1 − x i ≤ 1 α h x k +1 − x k , x − x k +1 i + h s ( x k +1 ) , x − x k +1 i ∀ x ∈ X . (9) T o find an up per b ound on the second term of the right-hand side of (8), we use the three-point equ ality o n (9) to obtain: h g k , x k +1 − x i ≤ 1 2 α k x k − x k 2 − 1 2 α k x k +1 − x k k 2 − 1 2 α k x k +1 − x k 2 + h s ( x k +1 ) , x − x k +1 i ∀ x ∈ X . (10) Plugging y = x k +1 in ( 4), and u sing (4) tog ether with (10) in (8) , we o btain the following rela tion: F ( x k +1 ) + h ( x k +1 ) + 1 2 α k x k +1 − x k 2 ≤ F ( x ) + h ( x ) + 1 2 α k x k − x k 2 − 1 2 α k x k +1 − x k k 2 + N X n =1 L n 2   x k +1 − x k − τ n k   2 ∀ x ∈ X . Using the stron g con vexity p roperty (3) on F ( x k +1 ) + h ( x k +1 ) above and choosing x = x ⋆ giv es: h∇ F ( x ⋆ ) + s ( x ⋆ ) , x k +1 − x ⋆ i + µ 2 k x k +1 − x ⋆ k 2 + 1 2 α k x k +1 − x ⋆ k 2 ≤ 1 2 α k x k − x ⋆ k 2 − 1 2 α k x k +1 − x k k 2 + N X n =1 L n 2   x k +1 − x k − τ n k   2 . (11) Due to the optimality co ndition of (1), there exists a subg radient s ( x ⋆ ) such that the first term on the left-han d side is n on- negativ e. U sing this particular subgra dient, we drop the first term. Th e last term on the righ t-hand side of the inequality can be further up per-bounded using Jensen’ s in equality as f ollows: N X n =1 L n 2   x k +1 − x k − τ n k   2 = N X n =1 L n 2       k X j = k − τ n k x j +1 − x j       2 ≤ L ( ¯ τ + 1) 2 k X j = k − ¯ τ k x j +1 − x j k 2 , where L = P N n =1 L n . As a result, rearranging th e ter ms in (11), we obtain: k x k +1 − x ⋆ k 2 ≤ 1 µα + 1 k x k − x ⋆ k 2 − 1 µα + 1 k x k +1 − x k k 2 + α ( ¯ τ + 1 ) L µα + 1 k X j = k − ¯ τ k x j +1 − x j k 2 . W e note that k x j +1 − x j k 2 = 0 for all j < 0 . Using Lem ma 1 with V k = k x k +1 − x ⋆ k 2 , w k = k x k +1 − x k k 2 , a = b = 1 µα +1 , c = α ( ¯ τ +1) L µα +1 and k 0 = ¯ τ co mpletes the p roof. 6 Remark 1. F or the special case of Algorithms 1 and 2 wher e τ n k = 0 for a ll k , n , Xiao and Zhang [15] have shown that th e conver gence rate o f serial pr oximal gr adient method with a constant step -size α = 1 L is O  L − µ F L + µ h  k ! wher e µ F and µ h ar e str ong conve xity parameter s of F ( x ) and h ( x ) , r espe ctively . I t is clear that in the case that ¯ τ = 0 , the guaranteed bo und in Theorem 1 r educe s to the one ob tained in [15]. V . P RO X I M A L I N C R E M E N TA L AG G R E G AT E D E S C E N T W I T H G E N E R A L D I S TA N C E F U N C T I O N S The update r ule of our algorithm can be easily exten ded to a non-Eu clidean setting , by replacin g the Euclidean sq uared distance in (2) with a general Bregman d istance functio n. W e first defin e a Bregman d istance fu nction, also ref erred to as a prox- function . Definition 1. A function ω : R d → R is called a distance g enerating function with mod ulus µ ω > 0 if ω is continu ously differentiable and µ ω -strongly conve x with r espect to k·k . Every distance gen erating f unction intr oduces a correspon ding Bregman distanc e fu nction gi ven by D ω ( x, x ′ ) := ω ( x ′ ) − ω ( x ) − h∇ ω ( x ) , x ′ − x i . For example, if we choo se ω ( x ) = 1 2 k x k 2 2 , wh ich is 1 - strongly convex with respect to the l 2 -norm , that would result in D ω ( x, x ′ ) = 1 2 k x ′ − x k 2 2 . Another co mmon e xample of distance generating fu nctions is the entropy f unction ω ( x ) = d X i =1 x i log( x i ) , which is 1 -stro ngly con vex with respect to the l 1 -norm ov er the standard simplex ∆ := ( x ∈ R d : d X i =1 x i = 1 , x ≥ 0 ) , and its associated Bregman d istance function is D ω ( x, x ′ ) = d X i =1 x ′ i log  x ′ i x i  . The main moti vation to use a generalized distance g enerating f unction rather tha n the usua l Euclidean distance fun ction is to design an op timization alg orithm that can take ad vantage of the geometry of the f easible set. The associated convergence result now rea ds as fo llows. Corollary 1. Consider using the followin g p r oximal gradient method to solve (1) : x k +1 = a rg min x ∈X  h g k , x − x k i + 1 α D ω ( x, x k ) + h ( x )  , g k = N X n =1 ∇ f n  x k − τ n k  . (12) Assume that D ω ( · , · ) satisfies: µ ω 2 k x − y k 2 ≤ D ω ( x, y ) ≤ L ω 2 k x − y k 2 . (13) Assume also tha t the pr o blem satisfies assump tions A1– A4, a nd that the step-size α satisfie s: α ≤ L ω  1 + µ L 1 ¯ τ + 1 µ ω L ω  1 ¯ τ +1 − 1 µ , wher e L = P N n =1 L n . Th en, the iterates generated by the meth od satisfy: D ω ( x ⋆ , x k ) ≤  L ω µα + L ω  k D ω ( x ⋆ , x 0 ) . 7 Pr oof: The analysis is similar to that of Theorem 1 . This time, the optimality c ondition o f (12) imp lies: h g k , x k +1 − x i ≤ 1 α h∇ ω ( x k +1 ) − ∇ ω ( x k ) , x − x k +1 i + h s ( x k +1 ) , x − x k +1 i ∀ x ∈ X . (14) Using the fo llowing four-point equ ality D ω ( a, d ) − D ω ( c, d ) − D ω ( a, b ) + D ω ( c, b ) = h∇ ω ( b ) − ∇ ω ( d ) , a − c i , in (14) with a = x , b = c = x k +1 and d = x k , and f ollowing the steps of the proof o f Theorem 1, we obtain: µ 2 k x k +1 − x ⋆ k 2 + 1 α D ω ( x ⋆ , x k +1 ) ≤ 1 α D ω ( x ⋆ , x k ) − 1 α D ω ( x k +1 , x k ) + L ( ¯ τ + 1) 2 k X j = k − ¯ τ k x j +1 − x j k 2 . This time, using th e u pper and lower b ounds of ( 13) o n the left and r ight h and-side of th e ab ove inequality , respectively , and rearranging th e ter ms, we arrive at: D ω ( x ⋆ , x k +1 ) ≤ L ω µα + L ω D ω ( x ⋆ , x k ) − L ω µα + L ω D ω ( x k +1 , x k ) + αL ( ¯ τ + 1) µα + L ω L ω µ ω k X j = k − ¯ τ D ω ( x k +1 , x k ) Applying Lemma 1 with V k = D ω ( x ⋆ , x k +1 ) , w k = D ω ( x k +1 , x k ) , a = b = L ω µα + L ω , c = αL ( ¯ τ +1) µα + L ω L ω µ ω and k 0 = ¯ τ co mpletes the proof. V I . N U M E R I C A L E X A M P L E S In this sectio n, we present nu merical examples which verif y o ur th eoretical b ound in different setting s. First, we simulate th e implementatio n of Algorithms 1 and 2 on a param eter server architecture to solve a small, toy prob lem. Then, we imp lement the framew ork on Amazon EC2 and solve a binary classification problem on three dif ferent real-world datasets. A. T o y pr oblem T o verify o ur theoretical bou nds pr ovided in Theorem 1 and Corollary 1, we consider solving (1) with f n ( x ) =      ( x n − c ) 2 + 1 2 ( x n +1 + c ) 2 , n = 1 , 1 2 ( x n − 1 + c ) 2 + 1 2 ( x n − c ) 2 , n = N , 1 2 ( x n − 1 + c ) 2 + 1 2 ( x n − c ) 2 + 1 2 ( x n +1 + c ) 2 , h ( x ) = λ 1 k x k 1 + I X ( x ) , X = { x ≥ 0 } , for some c ≥ 0 . W e use D ω ( x, x k ) = 1 2 k x − x k k 2 p in th e proximal step (12) and consider both p = 1 . 5 and p = 2 . It ca n be verified that ∇ F ( x ) is ( N + 1) -continu ous and F ( x ) is 2 -strong ly conv ex, both with respect to k·k 2 , an d that the optimizer for the pro blem is x ⋆ = max(0 ,c − λ 1 ) 3 e 1 , where e n denotes the n th basis vector . Mo reover , it can be shown that if p ∈ (1 , 2] , then µ ω = 1 and L ω = N 2 /p − 1 satisfy (13) with respect to k ·k 2 . W e select the problem param eters N = 100 , c = 3 an d λ 1 = 1 . W e simu late solv ing the pro blem with W = 4 workers, where at each iteration k , a worker w is selected uniformly at random to retur n their gradient, information evaluated o n stale informa tion x k − τ w k , to th e m aster . Here, at time k , τ w k is simply the num ber of iterations since the last time worker w was selected. Each worker holds N /W = 25 co mponen t functions, and we tun e step-size based on the assumptio n that ¯ τ = W . Figure 1 shows the results of a representative simulation . As can be observed, the iterates conv erge to the op timizer and the theoretical bound der iv ed is valid. 8 0 3 , 000 6 , 000 9 , 000 12 , 000 15 , 000 10 − 9 10 − 6 10 − 3 10 0 10 3 k k x k − x ⋆ k 2 2 Iterate Con vergence p = 1 . 5 p = 2 . 0 Fig. 1. Con vergenc e of the iterate s in toy problem. Solid lines represent our theoret ical upper bound, whereas dash-dotted lines represen t simulation results. B. B inary classification on actual datasets Next, we co nsider solving a regular ized, sp arse bina ry classification p roblem on three different datasets: rcv1 (sp arse) [16], url (sparse) [17] an d e psilon (dense) [18]. T o th is en d, we implemen t the p arameter ser ver framework in th e Julia languag e, and instantiate it with Problem (1) : f n ( x ) = 1 N  log(1 + ex p( − b n h a n , x i )) + 1 2 λ 2 k x k 2 2  , h ( x ) = λ 1 k x k 1 , Here, a n ∈ R d is the feature vector for samp le n , and b n ∈ {− 1 , 1 } is the correspo nding binary labe l. W e pick λ 1 = 10 − 5 and λ 2 = 10 − 4 for r cv1 and e psilon datasets, and λ 1 = 10 − 3 and λ 2 = 10 − 4 for u rl . rcv1 is already normalized to have unit norm in its samples; hen ce, we no rmalize ur l an d epsilon datasets to ha ve comparable problem instances. rcv1 is a te xt categor ization test collectio n from Reuters, having N = 80441 4 docum ents and d = 4723 6 feature s (density: 0 . 16% ) for each d ocument. W e choose to classify sports , disaster and g overnment r elated articles from the co rpus. url is a co llection of data for idenfiticatio n of malicious URLs. It has N = 2 39613 0 URL sam ples, each having d = 64 real valued fea tures ou t of a total of 323196 1 attributes (density: 18 . 08% ). Finally , epsilon is a sy nthetic, dense dataset, having N = 5000 00 samp les an d d = 20 00 fe atures. It can be verified tha t ∇ F ( x ) is (1 / 4 k A k 2 2 + λ 2 ) -Lipschitz co ntinuou s with k A k 2 2 = 1 in all the examples, and F ( x ) is λ 2 -strongly con vex with respect to k·k 2 . W e create th ree c4.2xlarge co mpute nodes in Amazon ’ s Elastic Comp ute Cloud. Th e comp ute n odes are physically located in I reland ( eu ), North V irginia ( us ) and T okyo ( ap ) , respectively . Th en, we assign on e CPU from each n ode as workers, resu lting in a total of 3 workers, an d we p ick the ma ster n ode at KTH in Sweden . W e ru n a small numbe r of iterations of the algorith ms to obtain a n a priori delay distrib ution of th e workers in this setting, and we ob serve that ¯ τ = 6 . In Figur es 2 and 3, we p resent the c on vergence results of our exper iments and delay distributions of the workers, r espectively . As in the previous e xample, the iterates conver ge to the op timizer and the theoretical boun d d eriv ed is v alid. Another observation worth noting is th at the denser the datasets become, th e smaller the gap betwee n the actual iter ates and the theoretical u pper bound gets. V I I . D I S C U S S I O N S A N D C O N C L U S I O N In this p aper, we have studied the u se o f p arameter ser ver fram ew ork on solvin g regula rized machine learning p roblems. One class of m ethods applicable fo r th is framework is the proximal incremental agg regated gradie nt metho d. W e ha ve shown that when the obje cti ve func tion is stro ngly conv ex, the iterates gen erated by the method co n verges linear ly to the global op timum. W e have also given constant step-size rule wh en the d egree of asyn chrony in the arc hitecture is kn own. Moreover , we have validated our theoretical bound by simulating the param eter server architecture o n tw o dif ferent problems. 9 0 10 , 000 20 , 000 30 , 000 40 , 000 50 , 000 10 1 10 2 10 3 k k x k − x ⋆ k 2 2 Iterate Con vergence rcv1 url epsilon Fig. 2. Con verge nce of the itera tes in Amazon EC2 experiment s. Solid lines repre sent our theore tical upper bound, whe reas dash-dott ed lines represent expe riment results. us eu ap 0 2 4 6 8 W orker τ Delay Distrib u tion rcv1 url epsilon Fig. 3. W orker delays in Amazon EC2 experi ments. Bars represent the mean delays, whereas verti cal stacked lines represent the standard dev iation . For each work er , from left to right, we present the delays obtain ed in rcv1 , ur l and epsilon experiment s, respecti vely . R E F E R E N C E S [1] Y . Nestero v , “Efficienc y of coordinate descent methods on huge-scale optimization problems, ” SIAM J ournal on Optimizatio n , vol . 22, no. 2, pp. 341–362, 2012. [2] O. Dekel, R. Gilad-Bachra ch, O. Shamir, and L. Xiao, “Optimal distrib uted online prediction using mini-batc hes, ” J ournal of Machine L earning Researc h , vol. 13, pp. 165–202, 2012. [3] B. Recht, C. Re, S. Wright, and F . Niu, “Hogwild: A lock-free approac h to paral lelizi ng stochastic gradient descent, ” in Advances in Neural Information Pr ocessing Systems 24 , 2011, pp. 693–701. [4] M. Li, L. Zhou, Z. Y ang, A. Li, F . Xia, D. G. Andersen, an d A. Smola, “Paramete r serve r for distribut ed machi ne learning, ” in B ig Learning NIPS W orkshop , vol. 1, 2013. [5] A. Agarwal and J. C. Duchi, “Distrib uted delayed stochast ic optimizati on, ” in A dvance s in Neural Information Pr ocessing Systems , 2011, pp. 873–881. [6] H. R. Feyzmahda vian, A. A ytekin, and M. Johansson, “ An asynchron ous mini-batch algori thm for regulariz ed stochast ic optimizati on, ” IEEE T ransactions on Automatic Contr ol , vol. PP , no. 99, pp. 1–1, 2016. [7] M. Gurbu zbalaba n, A. Ozdaglar , and P . Parri lo, “On the con verge nce rate of incrementa l aggrega ted gradient algorith ms, ” jun 2015, arXiv : 1506.02081v1 [math.OC]. 10 [8] D. P . Bertsekas, “Increment al gradient, subgradient, and proximal methods for con vex optimizat ion: A survey , ” Optimizatio n for Machi ne Learning , vol. 2010, pp. 1–38, 2011. [9] M. V . Solodov , “Increment al gradien t algorithms with stepsizes bounded away from zero, ” Computational Optimization and Application s , vol. 11, no. 1, pp. 23–35, 1998. [10] D. Bla tt, A. O. Hero, and H. Gauchman, “ A con ver gent incremental gradi ent meth od with a constant step siz e, ” SIAM J . Opti m. , vol. 18, no. 1, pp. 29–51, jan 2007. [Online ]. A vail able: http://d x.doi.org/1 0.1137/040615961 [11] N. D. V anli, M. Gurbuzba laban, and A. Ozdaglar , “Global conv ergence rate of proximal incremental aggregate d gradi ent m ethods, ” aug 2016, arXiv: 1608.01713v1 [math.OC]. [12] M. Schmid t, N. Le Roux, and F . Bach, “Mini mizing Fini te Sums wit h the St ochastic A verage Gradient, ” may 2016, arXi v: 1309 .2388v2 [m ath.OC]. [Online]. A vaila ble: https://ha l.inria.fr/h al- 00 860051 [13] A. Defazio, F . Bach, and S. Lacoste-Jul ien, “Saga: A fast increme ntal gradient method with support for non-strongly con vex composite object i ves, ” in Advances in Neural Information Pr ocessing Systems 27 . Curran Ass ociat es, Inc., 2014, pp. 1646–1654. [Online]. A v ailab le: http:/ /papers.nip s.cc/paper/5258- saga- a- fast- incremental- gradient- method- with- support- for- non- strongly- conv ex- composite- objectiv e s.pdf [14] J. Mairal , “Incremental majorizatio n-minimizat ion optimizat ion with applica tion to larg e-scale m achine learning, ” SIAM Journal on Optimizati on , vol. 25, no. 2, pp. 829–855, 2015. [15] L . Xiao and T . Zhang, “ A proximal s tochast ic gradient method with progressi ve varia nce reduction, ” SIAM J ournal on Optimization , vol. 24, no. 4, pp. 2057–2075, 2014. [Online]. A vaila ble: http:/ /dx.doi.org /10.1137/140 961791 [16] D. D. Lewis, Y . Y ang, T . G. Rose, and F . L i, “Rcv1: A new benchmark collect ion for text categori zatio n research, ” J . Mach . Learn. Res. , vol. 5, pp. 361–397, Dec. 2004. [Online]. A vaila ble: http://dl .acm.org/c itation.cfm?id=1005332.1005345 [17] J. Ma, L. K . Saul, S. Sav age, and G. M. V oelk er, “ Identify ing suspicious urls: An applicat ion of large-sca le online learning, ” in Proce edings of the 26th Annual Internation al Confer ence on Machine Learning (ICML 2009) , ser . ICML ’09, A CM. Mont real, Quebec: A CM, June 2009, pp. 681–688. [Online]. A vaila ble: http://doi .acm.org/1 0.1145/1553374.1553462 [18] “Pascal large scale learning challen ge, ” Accessed: 2016-10-07. [Online]. A va ilable : http:// lar gescale .ml.tu- b erlin.de/

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment