On Optimal Probabilities in Stochastic Coordinate Descent Methods

On Optimal Pr obabilities in Stochastic Coordinate Descent Methods Peter Richt ´ arik and Martin T ak ´ a ˇ c Univ ersity of Edinb urgh, United Kingdom October 11, 2013 Abstract W e propose and analyze a ne w parallel coordinate descent method—‘NSync— in which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen non-uniformly . W e derive con vergence rates under a strong con vexity assumption, and comment on how to assign probabilities to the sets to optimize the bound. The complexity and practical performance of the method can outperform its uniform variant by an order of magnitude. Surprisingly , the strate gy of updating a single randomly selected coordinate per iteration—with optimal probabilities—may require less iterations, both in theory and practice, than the strategy of updating all coordinates at e very iteration. 1 Introduction In this work we consider the optimization problem min x ∈ R n φ ( x ) , (1) where φ is strongly con vex and smooth. W e propose a new algorithm, and call it ‘NSync (Nonuni- form SYNchronous Coordinate descent). Algorithm 1 (‘NSync) Input: Initial point x 0 ∈ R n , subset probabilities { p S } and stepsize parameters w 1 , . . . , w n > 0 for k = 0 , 1 , 2 , . . . do Select a random set of coordinates ˆ S ⊆ { 1 , . . . , n } such that Prob ( ˆ S = S ) = p S Updated selected coordinates: x k +1 = x k − P i ∈ ˆ S 1 w i ∇ i φ ( x k ) e i end for In ‘NSync, we ﬁrst assign a probability p S ≥ 0 to ev ery subset S of [ n ] := { 1 , . . . , n } , with P S p S = 1 , and pick stepsize parameters w i > 0 , i = 1 , 2 , . . . , n . At ev ery iteration, a random set ˆ S is generated, independently from previous iterations, following the law Prob ( ˆ S = S ) = p S , and then coordinates i ∈ ˆ S are updated in parallel by moving in the direction of the negati ve partial deriv ative with stepsize 1 /w i . The updates are synchronized: no processor/thread is allowed to proceed before all updates are applied, generating the new iterate x k +1 . W e speciﬁcally study samplings ˆ S which are non-uniform in the sense that p i := Prob ( i ∈ ˆ S ) = P S : i ∈ S p S is allowed to vary with i . By ∇ i φ ( x ) we mean h∇ φ ( x ) , e i i , where e i ∈ R n is the i -th unit coordinate vector . Literature. Serial stochastic coordinate descent methods were proposed and analyzed in [6, 13, 15, 18], and more recently in v arious settings in [12, 7, 8, 9, 21, 19, 24, 3]. Parallel methods were considered in [2, 16, 14], and more recently in [22, 5, 23, 4, 11, 20, 10, 1]. A memory distributed method scaling to big data problems was recently developed in [17]. A nonuniform coordinate 1 descent method updating a single coordinate at a time was proposed in [15], and one updating two coordinates at a time in [12]. T o the best of our kno wledge, ‘NSync is the ﬁrst nonuniform parallel coordinate descent method. 2 Analysis Our analysis of ‘NSync is based on two assumptions. The ﬁrst assumption generalizes the ESO concept introduced in [16] and later used in [22, 23, 5, 4, 17] to nonuniform samplings. The second assumption requires that φ be strongly con vex. Notation: For x, y, u ∈ R n we write k x k 2 u := P i u i x 2 i , h x, y i u := P n i =1 u i y i x i , x • y := ( x 1 y 1 , . . . , x n y n ) and u − 1 := (1 /u 1 , . . . , 1 /u n ) . For S ⊆ [ n ] and h ∈ R n , let h [ S ] := P i ∈ S h i e i . Assumption 1 (Nonuniform ESO: Expected Separable Overapproximation) . Assume p = ( p 1 , . . . , p n ) T > 0 and that for some positi ve v ector w ∈ R n and all x, h ∈ R n , E [ φ ( x + h [ ˆ S ] )] ≤ φ ( x ) + h∇ φ ( x ) , h i p + 1 2 k h k 2 p • w . (2) Inequalities of type (2), in the uniform case ( p i = p j for all i, j ), were studied in [16, 22, 5, 17]. Assumption 2 (Strong con vexity) . W e assume that φ is γ -strongly con vex with respect to the norm k · k v , where v = ( v 1 , . . . , v n ) T > 0 and γ > 0 . That is, we require that for all x, h ∈ R n , φ ( x + h ) ≥ φ ( x ) + h∇ φ ( x ) , h i + γ 2 k h k 2 v . (3) W e can no w establish a bound on the number of iterations sufﬁcient for ‘NSync to approximately solve (1) with high probability . Theorem 3. Let Assumptions 1 and 2 be satisﬁed. Choose x 0 ∈ R n , 0 <  < φ ( x 0 ) − φ ∗ and 0 < ρ < 1 , where φ ∗ := min x φ ( x ) . Let Λ := max i w i p i v i . (4) If { x k } ar e the random iterates gener ated by ‘NSync, then K ≥ Λ γ log  φ ( x 0 ) − φ ∗ ρ  ⇒ Prob ( φ ( x K ) − φ ∗ ≤  ) ≥ 1 − ρ. (5) Mor eover , we have the lower bound Λ ≥ ( P i w i v i ) / E [ | ˆ S | ] . Pr oof. W e ﬁrst claim that φ is µ -strongly con vex with respect to the norm k · k w • p − 1 , i.e., φ ( x + h ) ≥ φ ( x ) + h∇ φ ( x ) , h i + µ 2 k h k 2 w • p − 1 , (6) where µ := γ / Λ . Indeed, this follo ws by comparing (3) and (6) in the light of (4). Let x ∗ be such that φ ( x ∗ ) = φ ∗ . Using (6) with h = x ∗ − x , φ ∗ − φ ( x ) (6) ≥ min h 0 ∈ R n h∇ φ ( x ) , h 0 i + µ 2 k h 0 k 2 w • p − 1 = − 1 2 µ k∇ φ ( x ) k 2 p • w − 1 . (7) Let h k := − (Diag( w )) − 1 ∇ φ ( x k ) . Then x k +1 = x k + ( h k ) [ ˆ S ] , and utilizing Assumption 1, we get E [ φ ( x k +1 ) | x k ] = E [ φ ( x k + ( h k ) [ ˆ S ] )] (2) ≤ φ ( x k ) + h∇ φ ( x k ) , h k i p + 1 2 k h k k 2 p • w (8) = φ ( x k ) − 1 2 k∇ φ ( x k ) k 2 p • w − 1 (7) ≤ φ ( x k ) − µ ( φ ( x k ) − φ ∗ ) . (9) T aking expectations in the last inequality and rearranging the terms, we obtain E [ φ ( x k +1 ) − φ ∗ ] ≤ (1 − µ ) E [ φ ( x k ) − φ ∗ ] ≤ (1 − µ ) k +1 ( φ ( x 0 ) − φ ∗ ) . Using this, Marko v inequality , and the deﬁnition of K , we ﬁnally get Prob ( φ ( x K ) − φ ∗ ≥  ) ≤ E [ φ ( x K ) − φ ∗ ] / ≤ (1 − µ ) K ( φ ( x 0 ) − φ ∗ ) / ≤ ρ . Let us now establish the last claim. First, note that (see [16, Sec 3.2] for more results of this type), P i p i = P i P S : i ∈ S p S = P S P i : i ∈ S p S = P S p S | S | = E [ | ˆ S | ] . (10) Letting ∆ := { p 0 ∈ R n : p 0 ≥ 0 , P i p 0 i = E [ | ˆ S | ] } , we hav e Λ (4) + (10) ≥ min p 0 ∈ ∆ max i w i p 0 i v i = 1 E [ | ˆ S | ] X i w i v i , where the last equality follows since optimal p 0 i is proportional to w i /v i . 2 Theorem 3 is generic in the sense that we do not say when Assumptions 1 and 2 are satisﬁed, ho w should one go about to choose the stepsizes w and probabilities { p S } . In the next section we address these issues. On the other hand, this abstract setting allowed us to write a brief comple xity proof. Change of variables. Consider the change of v ariables y = Diag( d ) x , where d > 0 . Deﬁning φ d ( y ) := φ ( x ) , we get ∇ φ d ( y ) = (Diag ( d )) − 1 ∇ φ ( x ) . It can be seen that (2), (3) can equi valently be written in terms of φ d , with w replaced by w d := w • d − 2 and v replaced by v d := v • d − 2 . By choosing d i = √ v i , we obtain v d i = 1 for all i , recov ering standard strong conv exity . 3 Nonunif orm samplings and ESO Consider now problem (1) with φ of the form φ ( x ) := f ( x ) + γ 2 k x k 2 v , (11) where v > 0 . Note that Assumption 2 is satisﬁed. W e further make the follo wing two assumptions. Assumption 4 (Smoothness) . f has Lipschitz gradient with respect to the coordinates, with positiv e constants L 1 , . . . , L n . That is, |∇ i f ( x ) − ∇ i f ( x + te i ) | ≤ L i | t | for all x ∈ R n and t ∈ R . Assumption 5 (Partial separability) . f ( x ) = P J ∈J f J ( x ) , where J is a ﬁnite collection of nonempty subsets of [ n ] and f J are differentiable conv ex functions such that f J depends on co- ordinates i ∈ J only . Let ω := max J | J | . W e say that f is separable of de gr ee ω . Uniform parallel coordinate descent methods for re gularized problems with f of the abo ve structure were analyzed in [16]. Example 6. Let f ( x ) = 1 2 k Ax − b k 2 2 , where A ∈ R m × n . Then L i = k A : i k 2 2 and f ( x ) = 1 2 P m j =1 ( A j : x − b j ) 2 , whence ω is the maximum # of nonzeros in a ro w of A . Nonuniform sampling. Instead of considering the general case of arbitrary p S assigned to all subsets of [ n ] , here we consider a special kind of sampling having two advantages: i) sets can be generated easily , ii) it leads to larger stepsizes 1 /w i and hence improv ed con vergence rate. Fix τ ∈ [ n ] and c ≥ 1 and let S 1 , . . . , S c be a collection of (possibly overlapping) subsets of [ n ] such that | S j | ≥ τ for all i and ∪ c j =1 S j = [ n ] . Moreov er , let q = ( q 1 , . . . , q c ) > 0 be a probability vector . Let ˆ S j be τ -nice sampling from S j ; that is, ˆ S j picks subsets of S j having cardinality τ , uniformly at random. W e assume these samplings are independent. Now , ˆ S is generated as follows. W e ﬁrst pick j ∈ { 1 , . . . , c } with probability q j , and then dra w ˆ S j . Note that we do not need to compute the quantities p S , S ⊆ [ n ] , to execute ‘NSync. In fact, it is much easier to implement the sampling via the two-tier procedure e xplained abov e. Sampling ˆ S is a nonuniform variant of the τ -nice sampling studied in [16], which here arises as a special case for c = 1 . Note that p i = P c j =1 q j τ | S j | δ ij > 0 , i ∈ [ n ] , (12) where δ ij = 1 if i ∈ S j , and 0 otherwise. Theorem 7. Let Assumptions 4 and 5 be satisﬁed, and let ˆ S be the sampling described above . Then Assumption 1 is satisﬁed with p given by (12) and any w = ( w 1 , . . . , w n ) T for which w i ≥ w ∗ i := L i + v i p i P c j =1 q j τ | S j | δ ij  1 + ( τ − 1)( ω j − 1) max { 1 , | S j |− 1 }  , i ∈ [ n ] , (13) wher e ω j := max J ∈J | J ∩ S j | ≤ ω . Pr oof. Since f is separable of degree ω , so is φ (because 1 2 k x k 2 v is separable). Now , E [ φ ( x + h [ ˆ S ] )] = E [ E [ φ ( x + h [ ˆ S j ] ) | j ]] = P c j =1 q j E [ φ ( x + h [ ˆ S j ] )] (14) ≤ P c j =1 q j n f ( x ) + τ | S j |  h∇ f ( x ) , h [ S j ] i + 1 2  1 + ( τ − 1)( ω j − 1) max { 1 , | S j |− 1 }  k h [ S j ] k 2 L + v o , (15) where the last inequality follows from the ESO for τ -nice samplings established in [16, Theorem 15]. The claim now follo ws by comparing the above expression and (2). 3 4 Optimal probabilities Observe that formula (13) can be used to design a sampling (characterized by the sets S j and prob- abilities q j ) that minimizes Λ , which in view of Theorem 3 optimizes the conver gence rate of the method. Serial setting. Consider the serial version of ‘NSync ( Prob ( | ˆ S | = 1) = 1 ). W e can model this via c = n , with S i = { i } and p i = q i for all i ∈ [ n ] . In this case, using (12) and (13), we get w i = w ∗ i = L i + v i . Minimizing Λ in (4) over the probability vector p gi ves the optimal pr obabilities (we refer to this as the optimal serial method) and optimal complexity p ∗ i = ( L i + v i ) /v i P j ( L j + v j ) /v j , i ∈ [ n ] , Λ OS = P i L i + v i v i = n + P i L i v i , (16) respectiv ely . Note that the uniform sampling , p i = 1 /n for all i , leads to Λ U S := n + n max j L j /v j (we call this the uniform serial method), which can be much larger than Λ OS . Moreov er , under the change of variables y = Diag( d ) x , the gradient of f d ( y ) := f (Diag( d − 1 ) y ) has coordinate Lipschitz constants L d i = L i /d 2 i , while the weights in (11) change to v d i = v i /d 2 i . Hence, the condition numbers L i /v i can not be improv ed via such a change of variables. Optimal serial method can be faster than the fully parallel method. T o model the fully parallel setting (i.e., the variant of ‘NSync updating all coordinates at ev ery iteration), we can set c = 1 and τ = n , which yields Λ F P = ω + ω max j L j /v j . Since ω ≤ n , it is clear that Λ U S ≥ Λ F P . Howe ver , for large enough ω it will be the case that Λ F P ≥ Λ OS , implying, surprisingly , that the optimal serial method can be faster than the fully parallel method. Parallel setting. Fix τ and sets S j , j = 1 , 2 , . . . , c , and deﬁne θ := max j  1 + ( τ − 1)( ω j − 1) max { 1 , | S j |− 1 }  . Consider running ‘NSync with stepsizes w i = θ ( L i + v i ) (note that w i ≥ w ∗ i , so we are ﬁne). From (4), (12) and (13) we see that the complexity of ‘NSync is determined by Λ = max i w i p i v i = θ τ max i  1 + L i v i   P c j =1 q j δ ij | S j |  − 1 . The probability vector q minimizing this quantity can be computed by solving a linear program with c + 1 variables ( q 1 , . . . , q c , α ), 2 n linear inequality constraints and a single linear equality constraint: max α,q n α subject to α ≤ ( b i ) T q for all i, q ≥ 0 , P j q j = 1 o , where b i ∈ R c , i ∈ [ n ] , are gi ven by b i j = v i ( L i + v i ) δ ij | S j | . 5 Experiments W e no w conduct 2 preliminary small scale experiments to illustrate the theory; the results are de- picted below . All experiments are with problems of the form (11) with f chosen as in Example 6. 0 2000 4000 6000 10 −15 10 −10 10 −5 10 0 φ (x k )− φ * Iteration k Uniform Serial Optimal Serial 0 500 1000 10 −15 10 −10 10 −5 10 0 φ (x k )− φ * Epochs Fully Parallel Serial Nonuniform ω =6 ω =4 ω =8 In the left plot we chose A ∈ R 2 × 30 , γ = 1 , v 1 = 0 . 05 , v i = 1 for i 6 = 1 and L i = 1 for all i . W e compare the US method ( p i = 1 /n , blue) with the OS method ( p i giv en by (16), red). The dashed lines show 95% conﬁdence intervals (we run the methods 100 times, the line in the middle is the av erage behavior). While OS can be faster , it is sensitive to over/under -estimation of the constants L i , v i . In the right plot we show that a nonuniform serial (NS) method can be faster than the fully parallel (FP) variant (we ha ve chosen m = 8 , n = 10 and 3 v alues of ω ). On the horizontal axis we display the number of epochs, where 1 epoch corresponds to updating n coordinates (for FP this is a single iteration, whereas for NS it corresponds to n iterations). 4 References [1] Y . Bian, X. Li, and Y . Liu. Parallel coordinate descent Ne wton for large-scale l1-regularized minimization. arXiv1306:4080v1 . [2] J. Bradley , A. K yrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In ICML , 2011. [3] C. D. Dang and G. Lan. Stochastic block mirror descent methods for nonsmooth and stochastic optimiza- tion. T echnical report, Georgia Institute of T echnology , 2013. [4] O. Fercoq. Parallel coordinate descent for the AdaBoost problem. In ICMLA , 2013. [5] O. Fercoq and P . Richt ´ arik. Smooth minimization of nonsmooth functions with parallel coordinate descent methods. , 2013. [6] C-J. Hsieh, K-W . Chang, C-J. Lin, S.S. Keerthi, , and S. Sundarajan. A dual coordinate descent method for large-scale linear SVM. In ICML , 2008. [7] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P . Pletcher . Block-coordinate frank-wolfe optimization for structural svms. In ICML , 2013. [8] Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent methods. arXiv:1305.4723 , 2013. [9] Z. Lu and L. Xiao. Randomized block coordinate non-monotone gradient methods for a class of nonlinear programming. , 2013. [10] I. Mukherjee, Y . Singer , R. Frongillo, and K. Canini. Parallel boosting with momentum. In ECML , 2013. [11] I. Necoara and D. Clipici. Efﬁcient parallel coordinate descent algorithm for con vex optimization prob- lems with separable constraints: application to distributed mpc. J. of Process Contr ol , 23:243–253, 2013. [12] I. Necoara, Y u. Nesterov , and F . Glineur . Efﬁcienc y of randomized coordinate descent methods on opti- mization problems with linearly coupled constraints. T echnical report, 2012. [13] Y u. Nesterov . Efﬁciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization , 22(2):341–362, 2012. [14] P . Richt ´ arik and M. T ak ´ a ˇ c. Efﬁcient serial and parallel coordinate descent methods for huge-scale truss topology design. In Operations Resear ch Pr oceedings , pages 27–32. Springer, 2012. [15] P . Richt ´ arik and M. T ak ´ a ˇ c. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Pr ogramming , 2012. [16] P . Richt ´ arik and M. T ak ´ a ˇ c. Parallel coordinate descent methods for big data optimization. arXiv:1212.0873 , 2012. [17] P . Richt ´ arik and M. T ak ´ a ˇ c. Distributed coordinate descent method for learning with big data. arXiv:1310.2059 , 2013. [18] S. Shalev-Shwartz and A. T ewari. Stochastic Methods for l1-regularized Loss Minimization. JMLR , 12:1865–1892, 2011. [19] S. Shalev-Shwartz and T . Zhang. Proximal stochastic dual coordinate ascent. arXiv:1211:2717 , 2012. [20] S. Shale v-Shwartz and T . Zhang. Accelerated mini-batch stochastic dual coordinate ascent. arXiv:1305.2581v1 , May 2013. [21] S. Shalev-Shw artz and T . Zhang. Stochastic dual coordinate ascent methods for regularized loss mini- mization. JMLR , 14:567–599, 2013. [22] M. T ak ´ a ˇ c, A. Bijral, P . Richt ´ arik, and N. Srebro. Mini-batch primal and dual methods for SVMs. In ICML , 2013. [23] R. T appenden, P . Richt ´ arik, and B. B ¨ uke. Separable approximations and decomposition methods for the augmented Lagrangian. , 2013. [24] R. T appenden, P . Richt ´ arik, and J. Gondzio. Inexact coordinate descent: complexity and preconditioning. arXiv:1304.5530 , 2013. 5

On Optimal Probabilities in Stochastic Coordinate Descent Methods

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment