On the Complexity Analysis of Randomized Block-Coordinate Descent Methods

On the Complexit y Analysis of Randomized Blo c k-Co ordinate Descen t Metho ds Zhaoson g Lu ∗ Lin Xia o † Ma y 20, 2 0 13 Abstract In this pap er w e analyze the randomized blo ck- co ordinate descen t (RBCD) metho ds prop osed in [8, 11] for minimizing t he s um of a smooth con v ex fu nction and a bloc k- separable con v ex function. In particular, w e extend Nestero v’s tec hnique dev elop ed in [8] for analyzing the RBCD metho d for minim izing a smo oth conv ex fu nction o v er a blo ck- separable closed con ve x set to the aforementi oned more general problem and obtain a sharp er exp ected-v a lue t yp e of con v ergence rate than the one implied in [11]. Also, w e obtain a b etter high-probabilit y typ e of iteration complexit y , which i mp ro v es up on the one in [11] by at least the amount O ( n/ǫ ), wh ere ǫ is th e target solution accuracy and n is the num b er of problem blo c ks. In addition, for un constrained smo oth con v ex minimization, w e d ev elop a new tec hniqu e called r andomize d estimate se q uenc e to analyze th e accele rated RBCD method prop osed by Nestero v [8] and establish a sharp er exp ected-v alue t yp e of conv ergence rate than the one giv en in [8]. Key w ords: Randomized blo c k-coord inate descent, accelerated co ordinate d escen t, iteration complexit y , con ve rgence r ate, comp osite minimization. 1 In t ro d u ction Blo c k-co ordinat e descen t (BCD) metho ds and t heir v ariants ha v e b een successfully applied to solv e v arious large- scale optimization pro blems ( see, for example, [22, 4, 18, 19, 20, 21, 9, 23 ]). A t eac h iteration, these metho ds c ho o se o ne blo c k of co ordinat es to suﬃcien tly reduce the ob jectiv e v alue while ke eping the other blo cks ﬁxed. One common a nd simple a pproac h for c ho osing suc h a blo c k is by means of a cyclic strategy . The global and lo cal conv ergence of the cyclic BCD metho d hav e b een w ell studied in the literature (see, for example, [17, 5]) though its global con v ergence rate still remains unkno wn except for some sp ecial cases [13]. ∗ Department of Mathematics, Simon F ras er Univ ersity , Burnaby , BC, V5 A 1S6, Canada. (email: zhaoso ng@sf u.ca ). This author was supp or ted in part by NSER C Disc ov e ry Grant. † Machine Le arning Groups, Microso ft Resea rch, One Microso ft W ay , Redmond, W A 9 8052, USA. (email: lin.xi ao@mi crosoft.com ). 1 Instead of using a deterministic cy clic order, recen tly man y researc hers prop osed ran- domized strategies for c ho osing a blo c k to up date at eac h iteration of the BCD metho ds [1, 2, 14, 3, 8, 10, 1 1, 15, 12, 16]. The resulting methods are called r a ndomized BCD (RBCD) metho ds. Numerous exp erimen ts ha v e demonstrated that the RBCD metho ds are v ery p o w erful fo r solving large- a nd eve n huge-sc ale optimization problems arising in mac hine learning [1, 2 , 14, 1 5]. In particular, Chang et al. [1] prop osed a RBCD metho d for minimizing sev eral smo o th functions app earing in mac hine learning a nd deriv ed its iteration complex- it y . Shalev-Sh wartz and T ew ar i [14 ] studied a RBCD metho d fo r minimizing l 1 -regularized smo oth con v ex problems. They ﬁrst tra nsfor med the problem into a b ox-constrained smo oth problem b y doubling the dimension and then applied a blo c k- co ordinate gradien t descen t metho d in whic h eac h blo c k was c hosen with equal pro babilit y . Lev enthal a nd Lewis [3 ] prop osed a RBCD metho d for minimizing a con v ex quadratic f unction and established its iteration complexity . Neste rov [8] analyzed some RBCD metho ds for minimizing a smo oth con v ex function ov er a closed blo c k-separable con v ex set and established its it eration com- plexit y , whic h in eﬀect extends and improv es up on some o f the results in [1, 3, 14] in sev eral asp ects. Ric h t´ arik and T ak´ a ˇ c [11] generalized the RBCD metho ds prop osed in [8] to the problem of minimizing a comp osite ob j ectiv e ( i.e., the sum of a smo oth conv ex function and a blo c k-separable con v ex function) and deriv ed some impro v ed complexit y results than those giv en in [8 ]. More recen tly , Shalev-Sh w artz and Zhang [15] studied a randomized pro ximal co ordinate ascen t metho d fo r solving the dual of a class o f large-scale conv ex minimization problems arising in mac hine learning and established iteration complexit y for obtaining a pair of approximate primal-dual solutions. Inspired b y the recen t w ork [8, 11], w e consider the problem of minimizing the sum of t w o con v ex functions: min x ∈ℜ N n F ( x ) def = f ( x ) + Ψ( x ) o , (1) where f is diﬀeren tia ble o n ℜ N , a nd Ψ has a blo ck separable structure. More sp eciﬁcally , Ψ( x ) = n X i =1 Ψ i ( x i ) , where eac h x i denotes a sub v ector of x with cardinality N i , the collection { x i : i = 1 , . . . , n } form a part it ion of the comp onen ts of x , and eac h Ψ i : ℜ N i → ℜ ∪ { + ∞} is a closed conv ex function. Give n the curren t iterate x k , the RBCD metho d [11] pick s a blo ck i ∈ { 1 , . . . , n } uniformly a t rando m and solv es a blo ck-wis e proximal subproblem in the f orm of d i ( x k ) := arg min d i ∈ℜ N i  h∇ i f ( x k ) , d i i + L i 2 k d i k 2 + Ψ i ( x k i + d i )  , and then it sets the next iterat e as x k +1 i = x k i + d i ( x ) and x k +1 j = x k j for all j 6 = i . Here ∇ i f ( x ) denotes the p artial gr adient of f with resp ect to x i , and L i is the Lipsc hitz constan t of the partial gradien t (whic h will b e deﬁned precisely later). Under the assumption that the partial gradients of f with resp ect to eac h blo ck co ordi- nate a re Lipsc hitz contin uous, Nestero v [8] studied RBCD metho ds for solving some sp e cial 2 cases of problem (1). In pa r t icular, for Ψ ≡ 0, he prop osed a RBCD metho d in whic h a random blo c k is c ho sen p er iteration according to a uniform or certain non-uniform proba - bilit y distributions and established an exp ected-v alue ty p e o f conv ergence rate. In addition, he prop osed a RBCD metho d for solving (1) with eac h Ψ i b eing t he indicator function of a closed con v ex set, in whic h a random blo ck is chose n uniformly at eac h iteration. He also deriv ed an exp ected-v alue t yp e of conv ergence ra t e fo r this metho d. It can b e observ ed that the tech niques used by Nestero v to deriv e t hese t w o con v ergence rates substan tially diﬀer from eac h other, and moreo v er, for Ψ ≡ 0 the second r ate is m uc h b etter t ha n the ﬁrst one. (How ev er, the second tec hnique can only work with uniform distribution.) Re- cen tly , Rich t´ arik and T ak´ a ˇ c [11] extended Nesterov’s RBCD metho ds to the gener al f o rm of problem (1) and established a high-pro ba bilit y type of iteratio n complexit y . Although the exp ected-v alue type of conv ergence rate is not presen ted explicitly in [11], it can b e readily obtained fr o m some interme diate result dev elop ed in [1 1] (see Section 3 for a detailed dis- cussion). Their results can b e considered as a generalization o f Nestero v’s ﬁrst tec hnique men t io ned ab ov e. Giv en that for Ψ ≡ 0 Nestero v’s second tec hnique can pro duce a b etter con v ergence rate t han his ﬁrst one, a natural question is whether his second tec hnique can b e extended to w ork with the general setting of problem (1) and obtain a sharp er con v ergence rate than the one implied in [11] . In addition, Nestero v [8] prop osed an accelerated RBCD (AR CD) metho d for solving problem (1) with Ψ ≡ 0 and established an exp ected-v alue type o f conv ergence rate for his metho d. When n = 1, this metho d b ecomes a deterministic a ccelerated full gradien t metho d for minimizing smo oth conv ex functions. When f is a strongly con v ex function, the con v ergence ra te given in [8] for n = 1 is, how ev er, w o rse than the w ell-kno wn optimal rate sho wn in [6, Theorem 2.2.2]. Then the question is whether a sharp er conv ergence ra te for the AR CD metho d than the one giv en in [8] can b e established ( whic h w ould matc h the optimal rate for n = 1). In this pap er, w e success fully a ddress the ab ov e t w o questions by obtaining some sharper con v ergence rat es for the RBCD metho d for solving problem (1 ) and for the AR CD metho d in the case Ψ ≡ 0. First, w e extend Nestero v’s second tec hnique [8] dev elop ed fo r a sp ecial case of (1 ) to analyze the RBCD metho d in the general setting, and obtain a sharp er expected- v alue t yp e of con v ergence rate than the one implied in [11]. W e also obtain a b etter high- probabilit y t yp e of iteratio n complexit y , whic h impro v es up on the one in [11] at least b y the amoun t O ( n/ǫ ), where ǫ is the targ et solution accuracy . F or unconstrained smo oth con v ex minimization (i.e., Ψ ≡ 0), w e dev elop a new tec hnique called r ando mize d estimate se quenc e to analyze Nestero v’s ARC D method and establish a sharp er exp ected-v alue ty p e of con v ergence rate than the one given in [8]. Esp ecially , for n = 1, our r a te b ecomes the same as the w ell-know n optimal rate achie v ed b y accelerated full gradien t metho d [6 , Section 2.2]. This pap er is organized as follows. In Section 2, w e dev elop some tec hnical results that are used to analyze the RBCD metho ds. In Section 3, w e analyze t he R BCD metho d for problem (1) b y extending Nestero v’s second t echniq ue [8], and establish a sharp er expected- v alue t yp e of conv erge rate as w ell as improv ed high-probability iteration complexit y . In 3 Section 4, w e dev elop the randomized estimate sequenc e t echniq ue a nd use it to deriv e a sharp er exp ected-v alue type of con v erge rate for the ARC D metho d for solving unconstrained smo oth con v ex minimization. 2 T ec hnical preli min aries In this section w e dev elop some tec hnical results tha t will b e used to analyze the RBCD and AR CD metho ds subsequen tly . Throughout this pap er we assume that problem (1) has a minim um ( F ⋆ > −∞ ) and its set of optimal solutio ns, denoted b y X ∗ , is nonempt y . F or an y partition of x ∈ ℜ N in to { x i ∈ ℜ N i : i = 1 , . . . , n } , there is an N × N p erm utation matrix U partitioned as U = [ U 1 · · · U n ], where U i ∈ ℜ N × N i , such that x = n X i =1 U i x i , and x i = U T i x, i = 1 , . . . , n. F or an y x ∈ ℜ N , t he p artial gr adien t o f f with resp ect to x i is deﬁned as ∇ i f ( x ) = U T i ∇ f ( x ) , i = 1 , . . . , n. F or simplicit y o f presen tation, w e asso ciate eac h subspace ℜ N i , for i = 1 , . . . , n , with the standard Euclidean norm, denoted b y k · k . W e mak e the following assumption whic h is used in [8 , 11] as we ll. Assumption 1. T he gr adient of function f is blo ck-wise Lipschitz c ontinuous with c ons tants L i , i.e., k∇ i f ( x + U i h i ) − ∇ i f ( x ) k ≤ L i k h i k , ∀ h i ∈ R N i , i = 1 , . . . , n, x ∈ R N . F ollowing [8], w e deﬁne the follow ing pair of norms in the whole space ℜ N : k x k L =  n X i =1 L i k x i k 2  1 / 2 , ∀ x ∈ ℜ N , k g k ∗ L =  n X i =1 1 L i k g i k 2  1 / 2 , ∀ g ∈ ℜ N . Clearly , they satisfy the Cauc h y-Sc h w artz inequalit y: h g , x i ≤ k x k L · k g k ∗ L , ∀ x, g ∈ ℜ N . The con v exit y parameter of a con vex function φ : ℜ N → ℜ ∪ { + ∞} with respect to the norm k · k L , denoted by µ φ , is the largest µ ≥ 0 suc h that for all x, y ∈ dom φ , φ ( y ) ≥ φ ( x ) + h s, y − x i + µ 2 k y − x k 2 L , ∀ s ∈ ∂ φ ( x ) . 4 Clearly , φ is strongly con v ex if and o nly if µ φ > 0. Assume that f and Ψ hav e conv exit y parameters µ f ≥ 0 a nd µ Ψ ≥ 0 with resp ect to the norm k · k L , respective ly . Then the conv exit y pa r ameter o f F = f + Ψ is at least µ f + µ Ψ . Moreo v er, b y Assumption 1, we hav e f ( x + U i h i ) ≤ f ( x ) + h∇ i f ( x ) , h i i + L i 2 k h i k 2 , ∀ h i ∈ R N i , i = 1 , . . . , n, x ∈ R N , (2) whic h immediately implies that µ f ≤ 1. The follo wing lemma concerns the exp ected v alue of a blo c k-separable function when a random blo c k of co or dinat e is up dated. Lemma 1. Supp o s e that Φ( x ) = P n i =1 Φ i ( x i ) . F or any x, d ∈ ℜ N , if we pick i ∈ { 1 , . . . , n } uniformly at r andom, then E i  Φ( x + U i d i )  = 1 n Φ( x + d ) + n − 1 n Φ( x ) . Pr o of. Since eac h i is pic k ed randomly with probability 1 /n , w e hav e E i  Φ( x + U i d i )  = 1 n n X i =1  Φ i ( x i + d i ) + X j 6 = i Φ j ( x j )  = 1 n n X i =1 Φ i ( x i + d i ) + 1 n n X i =1 X j 6 = i Φ j ( x j ) = 1 n Φ( x + d ) + n − 1 n Φ( x ) . F or notational conv enience, we deﬁne H ( x, d ) := f ( x ) + h∇ f ( x ) , d i + 1 2 k d k 2 L + Ψ( x + d ) . (3) The follo wing result is equiv alen t to [11, Lemma 2]. Lemma 2. Supp ose x, d ∈ ℜ N . If w e pick i ∈ { 1 , . . . , n } uniform ly at r andom, then E i  F ( x + U i d i )  − F ( x ) ≤ 1 n  H ( x, d ) − F ( x )  . W e next dev elop some results rega r ding the bl o ck-wise c omp os i te gr adien t mapping . Com- p osite gradient mapping w as intro duced b y Nestero v [7] for the analysis of f ull gradient meth- o ds for solving problem (1). Here w e extend the concept and sev eral asso ciated pro p erties to the blo ck -co o rdinate case. 5 As men tioned in the in tro duction, the RBCD metho ds studied in [1 1] solv es in eac h iteration a blo ck-wis e proximal subproblem in the f orm of: d i ( x ) := arg min d i ∈ℜ N i  h∇ i f ( x ) , d i i + L i 2 k d i k 2 + Ψ i ( x i + d i )  , for some i ∈ { 1 , . . . , n } . By the ﬁrst-order optimalit y condition, there exists a subgradien t s i ∈ ∂ Ψ i ( x i + d i ( x )) such that ∇ i f ( x ) + L i d i ( x ) + s i = 0 . (4) Let d ( x ) = P n i =1 U i d i ( x ). By (3) , t he deﬁnition of k · k L and separabilit y of Ψ, w e then hav e d ( x ) = arg min d ∈ℜ N H ( x, d ) . W e deﬁne the blo ck -wise comp osite gradient mappings as g i ( x ) def = − L i d i ( x ) , i = 1 , . . . , n. F rom the optimality conditions (4), w e conclude −∇ i f ( x ) + g i ( x ) ∈ ∂ Ψ i ( x i + d i ( x )) , i = 1 , . . . , n. Let g ( x ) = n X i =1 U i g i ( x ) . Then we ha v e − ∇ f ( x ) + g ( x ) ∈ ∂ Ψ( x + d ( x )) . (5) Moreo v er, k d ( x ) k 2 L = n X i =1 L i k d i ( x ) k 2 = n X i =1 1 L i k g i ( x ) k 2 =  k g ( x ) k ∗ L  2 , and h g ( x ) , d ( x ) i = −k d ( x ) k 2 L = −  k g ( x ) k ∗ L  2 . (6) The following result establishes a lo w er b ound of the function v alue F ( y ), where y is arbitrary in ℜ N , based on the comp osite gra dient mapping at anot her p oint x . Lemma 3. F or any ﬁxe d x, y ∈ ℜ N , if we pick i ∈ { 1 , . . . , n } uniformly a t r andom, then 1 n F ( y ) + n − 1 n F ( x ) ≥ E i  F ( x + U i d i ( x ))  + 1 n  h g ( x ) , y − x i + 1 2  k g ( x ) k ∗ L  2  + 1 n  µ f 2 k x − y k 2 L + µ Ψ 2 k x + d ( x ) − y k 2 L  . 6 Pr o of. By (5) and con v exit y of f and Ψ, w e hav e H ( x, d ( x )) = f ( x ) + h∇ f ( x ) , d ( x ) i + 1 2 k d ( x ) k 2 L + Ψ( x + d ( x )) ≤ f ( y ) + h∇ f ( x ) , x − y i − µ f 2 k x − y k 2 L + h∇ f ( x ) , d ( x ) i + 1 2 k d ( x ) k 2 L +Ψ( y ) + h−∇ f ( x ) + g ( x ) , x + d ( x ) − y i − µ Ψ 2 k x + d ( x ) − y k 2 L = F ( y ) + h g ( x ) , x − y i + h g ( x ) , d ( x ) i + 1 2 k d ( x ) k 2 L − µ f 2 k x − y k 2 L − µ Ψ 2 k x + d ( x ) − y k 2 L = F ( y ) + h g ( x ) , x − y i − 1 2  k g ( x ) k ∗ L  2 − µ f 2 k x − y k 2 L − µ Ψ 2 k x + d ( x ) − y k 2 L , where the last inequalit y holds due to (6). This together with Lemma 2 yields the desired result. Using Lemma 1 with Φ( · ) = k · k 2 L , w e can rewrite the conclusion of Lemma 3 in an equiv alen t form: 1 n F ( y ) + n − 1 n F ( x ) + µ Ψ 2 k x − y k 2 L ≥ E i h F ( x + U i d i ( x )) + µ Ψ 2 k x + U i d i − y k 2 L i + 1 n  h g ( x ) , y − x i + 1 2  k g ( x ) k ∗ L  2 + µ f + µ Ψ 2 k x − y k 2 L  . (7) This is the form we will actually use in our subsequen t con v ergence analysis. Letting y = x in Lemma 3, w e obtain the follo wing corollary . Corollary 1. Given x ∈ ℜ N . If w e pick i ∈ { 1 , . . . , n } uniformly at r andom, then F ( x ) − E i  F ( x + U i d i ( x ))  ≥ 1 + µ Ψ 2 n  k g ( x ) k ∗ L ) 2 = 1 + µ Ψ 2 n  k d ( x ) k L ) 2 . By similar argumen t s as in the pro of of Lemma 3, it can b e show n that a similar result as Lemma 3 also holds blo ck -wise without taking exp ectation: F ( x ) − F ( x + U i d i ( x )) ≥ 1 + µ Ψ 2 L i k d i ( x ) k 2 . The follo wing (trivial) corollary is useful when we do not hav e kno wledge on µ f or µ Ψ . Corollary 2. F or any ﬁxe d x, y ∈ ℜ N , if we pick i ∈ { 1 , . . . , n } uniformly a t r andom, then 1 n F ( y ) + n − 1 n F ( x ) ≥ E i  F ( x + U i d i ( x ))  + 1 n  h g ( x ) , y − x i + 1 2  k g ( x ) k ∗ L  2  . 7 3 Randomized b l o ck -co ord inate descen t In this section we analyze the following randomized blo c k co ordina t e descen t (RBCD) metho d fo r solving problem (1 ), whic h w as prop osed in [11]. In particular, w e extend Nes- tero v’s tec hnique [8] dev elop ed for a special case of pro blem (1) to w ork with the general setting and establish some sharp er exp ected-v alue type of conv erge rate, as w ell as improv ed high-probability it era t ion complexity , than those giv en or implied in [11]. Algorithm: RBCD( x 0 ) Rep eat for k = 0 , 1 , 2 , . . . 1. Cho ose i k ∈ { 1 , . . . , n } ra ndomly with a uniform distribution. 2. Up date x k +1 = x k + U i k d i k ( x k ). After k it era t ions, the RBCD metho d generates a random output x k , which dep ends on the observ ed realization of the random v aria ble ξ k − 1 def = { i 0 , i 1 , . . . , i k − 1 } . The follo wing quan tit y measures the distance b et w een x 0 and the optimal solutio n set of problem (1 ) that will app ear in our complexit y results: R 0 def = min x ⋆ ∈ X ∗ k x 0 − x ⋆ k L , (8) where X ∗ is the set o f optimal solutions of problem (1). 3.1 Con v ergence rate of exp ected v alues The follo wing theorem is a generalization o f [8, Theorem 5], where the function Ψ in (1) is restricted to b e the indicato r function of a blo c k-separable closed con v ex set. Here w e extend it to the general case of Ψ b eing blo c k-separable conv ex functions by employ ing the mac hinery of blo c k-wise comp osite g r a dien t mapping dev elop ed in Section 2. Theorem 1. L et R 0 b e deﬁne d in (8) , F ⋆ b e the optimal value of pr obl e m (1), and { x k } b e the se quenc e ge n er ate d by the RB C D metho d. T hen for any k ≥ 0 , the iter ate x k satisﬁes E ξ k − 1  F ( x k )  − F ⋆ ≤ n n + k  1 2 R 2 0 + F ( x 0 ) − F ⋆  . (9) F urthermo r e, if at le ast one of f and Ψ is str ongly c onvex, i.e., µ f + µ Ψ > 0 , then E ξ k − 1  F ( x k )  − F ⋆ ≤  1 − 2( µ f + µ Ψ ) n (1 + µ f + 2 µ Ψ )  k  1 + µ Ψ 2 R 2 0 + F ( x 0 ) − F ⋆  . (10) 8 Pr o of. Let x ⋆ b e an arbitrary optimal solution of (1). Denote r 2 k = k x k − x ⋆ k 2 L = n X i =1 L i h x k i − x ⋆ i , x k i − x ⋆ i i . Notice that x k +1 = x k + U i k d i k ( x k ). Th us w e hav e r 2 k +1 = r 2 k + 2 L i k h d i k ( x k ) , x k i k − x ⋆ i k i + L i k k d i k ( x k ) k 2 . Multiplying b oth sides by 1 / 2 a nd taking exp ectation with respect to i k yield E i k  1 2 r 2 k +1  = 1 2 r 2 k + 1 n n X i =1 L i h d i ( x k ) , x k i − x ⋆ i i + 1 2 n X i =1 1 L i k g i ( x k ) k 2 ! = 1 2 r 2 k + 1 n  h g ( x k ) , x ⋆ − x k i + 1 2  k g ( x k ) k ∗ L  2  . (11) Using Corolla r y 2, w e obtain E i k  1 2 r 2 k +1  ≤ 1 2 r 2 k + 1 n F ⋆ + n − 1 n F ( x k ) − E i k F ( x k +1 ) . By rearra nging terms, w e obtain that for eac h k ≥ 0, E i k  1 2 r 2 k +1 + F ( x k +1 ) − F ⋆  ≤  1 2 r 2 k + F ( x k ) − F ⋆  − 1 n  F ( x k ) − F ⋆  . T aking expectation with resp ect to ξ k − 1 on b oth sides of t he ab ov e inequalit y , we hav e E ξ k  1 2 r 2 k +1 + F ( x k +1 ) − F ⋆  ≤ E ξ k − 1  1 2 r 2 k + F ( x k ) − F ⋆  − 1 n E ξ k − 1  F ( x k ) − F ⋆  . Applying this inequalit y recursiv ely and using t he fact that E ξ k  F ( x j )  is monotonically decreasing f or j = 0 , . . . , k + 1 ( see Corollary 1), w e furt her obtain that E ξ k  F ( x k +1 )  − F ⋆ ≤ E ξ k  1 2 r 2 k +1 + F ( x k +1 ) − F ⋆  ≤ 1 2 r 2 0 + F ( x 0 ) − F ⋆ − 1 n k X j = 0  E ξ k  F ( x j )  − F ⋆  ≤ 1 2 r 2 0 + F ( x 0 ) − F ⋆ − k + 1 n  E ξ k  F ( x k +1 )  − F ⋆  . This leads to E ξ k  F ( x k +1 )  − F ⋆ ≤ n n + k + 1  1 2 k x 0 − x ⋆ k 2 L + F ( x 0 ) − F ⋆  , 9 whic h t ogether with the arbitrariness of x ⋆ and t he deﬁnition of R 0 yields (9 ) . Next w e prov e (10) under t he strong conv exit y assumption µ f + µ Ψ > 0. Using (7) and (11), we obtain that E i k  1 + µ Ψ 2 r 2 k +1 + F ( x k +1 ) − F ⋆  ≤  1 + µ Ψ 2 r 2 k + F ( x k ) − F ⋆  − 1 n  µ f + µ Ψ 2 r 2 k + F ( x k ) − F ⋆  . (12) By strong conv exit y of F , w e hav e µ f + µ Ψ 2 r 2 k + F ( x k ) − F ⋆ ≥ µ f + µ Ψ 2 r 2 k + µ f + µ Ψ 2 r 2 k = ( µ f + µ Ψ ) r 2 k . Deﬁne β = 2( µ f + µ Ψ ) 1 + µ f + 2 µ Ψ . W e hav e 0 < β ≤ 1 due to µ f + µ Ψ > 0 and µ f ≤ 1. Then µ f + µ Ψ 2 r 2 k + F ( x k ) − F ⋆ ≥ β  µ f + µ Ψ 2 r 2 k + F ( x k ) − F ⋆  + (1 − β )( µ f + µ Ψ ) r 2 k = β  1 + µ Ψ 2 r 2 k + F ( x k ) − F ⋆  . Com bining the ab o v e inequalit y with (12) giv es E i k  1 + µ Ψ 2 r 2 k +1 + F ( x k +1 ) − F ⋆  ≤  1 − β n   1 + µ Ψ 2 r 2 k + F ( x k ) − F ⋆  T aking expectation with resp ect ξ k − 1 on b oth sides of t he ab ov e relation, w e ha v e E ξ k  1 + µ Ψ 2 r 2 k +1 + F ( x k +1 ) − F ⋆  ≤  1 − β n  k +1  1 + µ Ψ 2 r 2 0 + F ( x 0 ) − F ⋆  , whic h t ogether with the arbitrariness of x ⋆ and t he deﬁnition of R 0 leads to (10). W e ha v e t he followin g remarks on comparing the results in Theorem 1 with those in [11]. • F or the gener al setting of problem (1), exp ected-v alue t yp e of conv ergence rate is not presen ted explicitly in [11]. Nev ertheless, it can b e deriv ed straigh tforwardly from the follo wing relation tha t was pro v ed in [11, Theorem 5]: E i k [∆ k +1 ] ≤ ∆ k − ∆ 2 k 2 nc , ∀ k ≥ 0 , (13) 10 where ∆ k := F ( x k ) − F ⋆ , and c := max { ¯ R 2 0 , F ( x 0 ) − F ⋆ } , (14) ¯ R 0 := max x n max x ⋆ ∈ X ∗ k x − x ⋆ k L : F ( x ) ≤ F ( x 0 ) o . (15) T aking expectation with resp ect to ξ k − 1 on b oth sides of (13), one can hav e E ξ k [∆ k +1 ] ≤ E ξ k − 1 [∆ k ] − 1 2 nc  E ξ k − 1 [∆ k ]  2 , ∀ k ≥ 0 . By this relation and a similar argumen t as used in t he pro of o f [8, Theorem 1], one can obtain that E ξ k − 1 [ F ( x k )] − F ⋆ ≤ 2 nc ( F ( x 0 ) − F ⋆ ) k ( F ( x 0 ) − F ⋆ ) + 2 nc , ∀ k ≥ 0 . (16) Let a and b denote the righ t-hand side o f (9 ) and (16), resp ectiv ely . By the deﬁnition of c and t he relation ¯ R 0 ≥ R 0 , we can see that when k is suﬃcien tly lar g e, b a ≈ 2 c 1 2 R 2 0 + F ( x 0 ) − F ⋆ ≥ 4 3 . Therefore, our exp ected-v alue type of conv ergence rate is b etter by at least a facto r of 4 / 3 asymptotically , a nd the improv emen t can b e muc h la rger if ¯ R 0 is muc h la r g er than R 0 . • F or the sp e cial case of (1) where at least one of f and Ψ is strongly conv ex, i.e., µ f + µ Ψ > 0 , Ric ht´ arik and T ak´ a ˇ c [11, Theorem 7] sho w ed that f or all k ≥ 0, there holds E ξ k − 1  F ( x k )  − F ⋆ ≤  1 − µ f + µ Ψ n (1 + µ Ψ )  k  F ( x 0 ) − F ⋆  . It is not hard t o observ e that 2( µ f + µ Ψ ) n (1 + µ f + 2 µ Ψ ) > µ f + µ Ψ n (1 + µ Ψ ) . (17) It t hen follows that for suﬃcien tly large k , one has  1 − 2( µ f + µ Ψ ) n (1 + µ f + 2 µ Ψ )  k  1 + µ Ψ 2 R 2 0 + F ( x 0 ) − F ⋆  ≤  1 − 2( µ f + µ Ψ ) n (1 + µ f + 2 µ Ψ )  k  1 + µ f + µ Ψ µ f + µ Ψ   F ( x 0 ) − F ⋆  ≪  1 − µ f + µ Ψ n (1 + µ Ψ )  k  F ( x 0 ) − F ⋆  . Therefore, our con ve rgence rat e (10) is m uc h sharp er than their rate for suﬃcien tly large k . 11 3.2 High probabilit y complexit y b ound By virtue of Theorem 1 w e can also deriv e a sharp er iteration complexit y fo r a sing le run of the RBCD metho d for obtaining an ǫ -optimal solution with high probability than the one giv en in [11, Theorems 5 and 7]. Theorem 2. L et R 0 b e deﬁne d in (8) and { x k } b e the se quenc e gener a te d by the RBCD metho d. L et 0 < ǫ < F ( x 0 ) − F ⋆ and ρ ∈ (0 , 1) b e ch o sen arbitr arily. (i) F or al l k ≥ K , ther e holds P ( F ( x k ) − F ⋆ ≤ ǫ ) ≥ 1 − ρ, (18) wher e K := 2 nc ǫ  1 + log  R 2 0 + 2[ F ( x 0 ) − F ⋆ ] 4 cρ  + 2 − n. (19) (ii) F urthermor e, if at le ast one of f a nd Ψ is str on gly c onve x, i.e., µ f + µ Ψ > 0 , then (18) holds when k ≥ ˜ K , whe r e ˜ K := n (1 + µ f + 2 µ Ψ ) 2( µ f + µ Ψ ) log 1+ µ Ψ 2 R 2 0 + F ( x 0 ) − F ⋆ ρǫ ! Pr o of. (i) F or conv enience, let ∆ k = F ( x k ) − F ⋆ for all k . Deﬁne the truncated sequence { ∆ ǫ k } as follow s: ∆ ǫ k =  ∆ k if ∆ k ≥ ǫ, 0 otherwise . Using (13) a nd the same argumen t as used in the pro of o f [11, Theorem 1 ], o ne can hav e E i k [∆ ǫ k +1 ] ≤  1 − ǫ 2 nc  ∆ ǫ k , ∀ k ≥ 0 . T aking expectation with resp ect to ξ k − 1 on b oth sides of t he ab ov e relation, w e obtain tha t E ξ k [∆ ǫ k +1 ] ≤  1 − ǫ 2 nc  E ξ k − 1 [∆ ǫ k ] , ∀ k ≥ 0 . (20) In additio n, using (9 ) a nd the r elat io n ∆ ǫ k ≤ ∆ k , w e hav e E ξ k − 1 [∆ ǫ k ] ≤ n n + k  1 2 R 2 0 + F ( x 0 ) − F ⋆  , ∀ k ≥ 0 . (21) F or an y t > 0, let K 1 =  n tǫ  1 2 R 2 0 + F ( x 0 ) − F ⋆  − n, K 2 =  2 nc ǫ log  t ρ  . 12 It f o llo ws fr o m (21) that E ξ K 1 − 1 [∆ ǫ K 1 ] ≤ tǫ , whic h together with ( 2 0) implies that E ξ K 1 + K 2 − 1 [∆ ǫ K 1 + K 2 ] ≤  1 − ǫ 2 nc  K 2 E ξ K 1 − 1 [∆ ǫ K 1 ] ≤  1 − ǫ 2 nc  K 2 tǫ ≤ ρǫ. Notice fro m (20) that { E ξ k − 1 [∆ ǫ k ] } is decreasing. Hence, we hav e E ξ k − 1 [∆ ǫ k ] ≤ ρǫ, ∀ k ≥ K ( t ) , (22) where K ( t ) := n tǫ  1 2 R 2 0 + F ( x 0 ) − F ⋆  + 2 nc ǫ log  t ρ  + 2 − n. It is not hard t o v erify that t ∗ := 1 2 R 2 0 + F ( x 0 ) − F ⋆ 2 c = arg min t> 0 K ( t ) . Also, one can observ e from (19) that K ≥ K ( t ∗ ), whic h together with (22) implies that E ξ k − 1 [∆ ǫ k ] ≤ ρǫ, ∀ k ≥ K . Using this relation and Mark o v inequalit y , we obtain that P ( F ( x k ) − F ⋆ > ǫ ) = P (∆ k > ǫ ) = P (∆ ǫ k > ǫ ) ≤ E ξ k − 1 [∆ ǫ k ] ǫ ≤ ρ, ∀ k ≥ K, whic h immediately implies statemen t (i) holds. (ii) Using the Mark ov inequalit y , the inequalit y (10) and the deﬁnition of ˜ K , we obtain that for an y k ≥ ˜ K , P ( F ( x k ) − F ⋆ > ǫ ) ≤ E ξ k − 1 [ F ( x k ) − F ⋆ ] ǫ ≤ 1 ǫ  1 − 2( µ f + µ Ψ ) n (1 + µ f + 2 µ Ψ )  ˜ K  1 + µ Ψ 2 R 2 0 + F ( x 0 ) − F ⋆  ≤ 1 ǫ exp − 2( µ f + µ Ψ ) ˜ K n (1 + µ f + 2 µ Ψ ) !  1 + µ Ψ 2 R 2 0 + F ( x 0 ) − F ⋆  ≤ ρ and hence statemen t (ii) holds. W e mak e the follo wing remarks in comparing our r esults in Theorem 2 with those in [11]. 13 • F or an y 0 < ǫ < F ( x 0 ) − F ⋆ and ρ ∈ (0 , 1), Ric h t´ arik and T ak´ aˇ c [11 , Theorem 5] sho w ed that (18) ho lds for all k ≥ ¯ K , where ¯ K = 2 nc ǫ  1 + log 1 ρ  + 2 − 2 nc F ( x 0 ) − F ⋆ and c is give n in (14) . Using the deﬁnitions o f c and R 0 and the fact R 0 ≤ ¯ R 0 , one can observ e tha t τ := R 2 0 + 2[ F ( x 0 ) − F ⋆ ] 4 c ≤ 3 4 . By the deﬁnitions of K and ¯ K , we hav e that for suﬃcien tly small ǫ > 0, K − ¯ K ≈ 2 nc log τ ǫ ≤ − 2 nc log(4 / 3) ǫ . In a ddition, b y the deﬁnitions of R 0 and ¯ R 0 , one can see that R 0 can b e m uc h smaller than ¯ R 0 and thus τ can b e v ery small. It follow s from the ab o v e relation that K can b e substan tially smaller than ¯ K . • F or a s p e cial case of (1) where a t least o ne of f and Ψ is strongly con v ex, i.e., µ f + µ Ψ > 0, Rich t´ a rik and T ak´ aˇ c [1 1 , Theorem 8] show ed that (18) holds for all k ≥ ˆ K , where ˆ K := n (1 + µ Ψ ) µ f + µ Ψ log  F ( x 0 ) − F ⋆ ρǫ  . W e t hen see that when ρ or ǫ is suﬃcien tly small, ˜ K ˆ K ≈ 1 + µ f + 2 µ Ψ 2(1 + µ Ψ ) ≤ 1 due to 0 ≤ µ f ≤ 1 . When µ f < 1, w e ha v e ˜ K ≤ ˜ τ ˆ K for some ˜ τ ∈ (0 , 1) and th us our complexit y b ound is tighter when ρ or ǫ is suﬃcien tly small. As discussed in [11, Section 2], the n um b er of iterations required by the RBCD metho d for obta ining an ǫ -o ptimal solution with high probabilit y can also b e estimated b y us- ing a multiple-run strat egy , each run with an indep enden tly generated random sequence { i 0 , i 1 , . . . } . W e next deriv e suc h an iteration complexit y . Theorem 3. L et 0 < ǫ < F ( x 0 ) − F ⋆ and ρ ∈ (0 , 1) b e arbitr arily chosen, an d let r = ⌈ log(1 /ρ ) ⌉ . Supp ose that we run the RBCD metho d starting with x 0 for r time s in- dep endently, e ach time for the same n umb er of iter ations k . L et x k ( j ) denote the output by the RBC D at the k th iter ation of the j th run. T h en ther e holds: P  min 1 ≤ j ≤ r F ( x k ( j ) ) − F ⋆ ≤ ǫ  ≥ 1 − ρ for an y k ≥ K , wher e K :=  en ǫ  1 2 R 2 0 + F ( x 0 ) − F ⋆  − n. 14 Pr o of. Let ξ ( j ) k − 1 = n i ( j ) 0 , i ( j ) 1 , . . . , i ( j ) k − 1 o denote the random sequence used in the j th run. Using Marko v inequalit y , (9) and the deﬁnition o f K , we obtain that fo r an y k ≥ K , P  F ( x k ( j ) ) − F ⋆ > ǫ  ≤ E ξ ( j ) k − 1 [ F ( x k ( j ) ) − F ⋆ ] ǫ ≤ n ( n + k ) ǫ  1 2 R 2 0 + F ( x 0 ) − F ⋆  ≤ 1 e . This t o gether with the deﬁnition of r implies that P  min 1 ≤ j ≤ r F ( x k ( j ) ) − F ⋆ > ǫ  = Π r j = 1 P  F ( x k ( j ) ) − F ⋆ > ǫ  ≤ 1 e r ≤ ρ, and hence the conclusion holds. R emark. F rom Theorem 3, one can see that the total n um b er of iterat ions by RBCD with a m ultiple-run strategy for obta ining a n ǫ -optimal solution is at most K M :=  2 en ǫ  R 2 0 + 2( F ( x 0 ) − F ⋆ )   − n   log 1 ρ  . It was implicitly established in [11] that an ǫ -optimal solution can b e found b y RBCD with a multiple -run strategy in at most ¯ K M :=  2 enc ǫ − 2 nc F ( x 0 ) − F ⋆   log 1 ρ  iterations. When ρ or ǫ is suﬃcien tly small, we hav e K M ¯ K M ≈ R 2 0 + 2( F ( x 0 ) − F ⋆ ) c . Recall that ¯ R 0 can b e muc h la rger than R 0 , whic h together with (15) implies that c can b e m uc h larger than R 2 0 + 2( F ( x 0 ) − F ⋆ ). It follows from the ab ov e relation that when ρ or ǫ is suﬃcien tly small, K M can b e substantially smaller than ¯ K M . 4 Accelerate d randomized co ordi n ate des c en t In this section, w e restrict o urselv es to the unconstrained smo oth minimization problem min x ∈ℜ N f ( x ) , (23) where f is conv ex in ℜ N with con ve xit y parameter µ = µ f ≥ 0 with r espect to the norm k · k L and satisﬁes Assumption 1. It then follo ws from (2 ) that µ ≤ 1. Our aim is to analyze the con v ergence rate of the following accelerated rando mized co or dinat e descen t ( ARCD) metho d. 15 Algorithm: ARCD( x 0 ) Set v 0 = x 0 , choo se γ 0 > 0 arbitrarily , and rep eat for k = 0 , 1 , 2 , . . . 1. Compute α k ∈ (0 , n ] from t he equation α 2 k =  1 − α k n  γ k + α k n µ and set γ k +1 =  1 − α k n  γ k + α k n µ. 2. Compute y k as y k = 1 α k n γ k + γ k +1  α k n γ k v k + γ k +1 x k  . 3. Cho ose i k ∈ { 1 , . . . , n } uniformly at rando m, and up date x k +1 = y k − 1 L i k U i k ∇ i k f ( y k ) . 4. Set v k +1 = 1 γ k +1   1 − α k n  γ k v k + α k n µy k − α k L i k U i k ∇ i k f ( y k )  . R emark. F o r the ab ov e algorithm, claim that γ k > 0 a nd α k is w ell- deﬁned f or all k . Indeed, let γ > 0 b e arbitrarily give n and deﬁne h ( α ) := α 2 −  1 − α n  γ − α n µ, ∀ α ≥ 0 . W e o bserv e that h (0) = − γ < 0 , h ( n ) = n 2 − µ ≥ 0 , where the last inequalit y is due to µ ≤ 1. Therefore, b y contin uit y of h , there exists some α ∗ ∈ (0 , n ] suc h that h ( α ∗ ) = 0. Moreov er, if µ = 0, w e hav e 0 < α ∗ < n . Using these observ ations a nd the deﬁnitions of α k and γ k , it is not hard to see by induction that γ k > 0 and α k is we ll-deﬁned for all k . The ab o v e description of the AR CD metho d comes directly fro m the deriv ation using r andomize d estimate se quenc e we deve lop in Section 4.1, and is v ery con v enien t for the purp ose of o ur con v ergence analysis. F o r implemen tation in practice, one can simplify the notations and use an equiv alen t algorithm describ ed b elow . In the simpliﬁed description, it is a lso clear that the AR CD metho d is equiv alen t t o the metho d (5.1) in [8, Section 5], with the fo llo wing correspo ndences b et w een the sym b ols used. This pa p er α k α k − 1 θ k β k µ [8, (5.1)] 1 /γ k b k /a k α k β k σ 16 Algorithm: ARCD( x 0 ) Set v 0 = x 0 , choo se α − 1 ∈ (0 , n ], a nd rep eat for k = 0 , 1 , 2 , . . . 1. Compute α k ∈ (0 , n ] from t he equation α 2 k =  1 − α k n  α 2 k − 1 + α k n µ, and set θ k = nα k − µ n 2 − µ , β k = 1 − µ nα k . 2. Compute y k as y k = θ k v k + (1 − θ k ) x k . 3. Cho ose i k ∈ { 1 , . . . , n } uniformly at rando m, and up date x k +1 = y k − 1 L i k U i k ∇ i k f ( y k ) . 4. Set v k +1 = β k v k + (1 − β k ) y k − 1 α k L i k U i k ∇ i k f ( y k ) . A t eac h iteration k , the AR CD metho d generates y k , x k +1 and v k +1 . One can observ e that x k +1 and v k +1 dep end on the realization of the random v ariable ξ k = { i 0 , i 1 , . . . , i k } while y k dep ends o n the realization of ξ k − 1 . W e now state a sharp er exp ected-v alue t yp e of con v ergence rate for the ARC D metho d than the one giv en in [8]. Its pro of relies on a new tec hnique called r andomi z e d estimate se quenc e tha t will b e dev elop ed in Subsection 4.1. Therefore, w e p ostp o ne the pro of to Subsection 4 .2 . Theorem 4. L et f ⋆ b e the optimal value of pr ob l e m (23), R 0 b e deﬁne d in (8) , an d { x k } b e the se quenc e ge n er ate d by the ARCD metho d. Th en, for any k ≥ 0 , ther e holds: E ξ k − 1 [ f ( x k )] − f ⋆ ≤ λ k  f ( x 0 ) − f ⋆ + γ 0 R 2 0 2  , wher e λ 0 = 1 and λ k = Q k − 1 i =0  1 − α i n  . In p articular, if γ 0 ≥ µ , then λ k ≤ min     1 − √ µ n  k , n n + k √ γ 0 2 ! 2    . R emark. W e note that for n = 1, the ARC D metho d r educes to a deterministic accelerated full gradien t metho d describ ed in [6, (2.2.8)]; Our iteratio n complexity result a b o v e also b ecomes t he same as the one giv en there. 17 Nestero v [8 , Theorem 6] established t he following con v ergence rate fo r the ab o v e AR CD metho d: E ξ k − 1 [ f ( x k )] − f ⋆ ≤                        a µ z }| { µ  2 R 2 0 + 1 n 2 ( f ( x 0 ) − f ⋆ )  · "  1 + √ µ 2 n  k +1 −  1 − √ µ 2 n  k +1 # − 2 if µ > 0 ,  n k + 1  2 ·  2 R 2 0 + 1 n 2 ( f ( x 0 ) − f ⋆ )  | {z } a 0 otherwise . In view of Theorem 4 , our con v ergence rate is given by E ξ k − 1 [ f ( x k )] − f ⋆ ≤ min     1 − √ µ n  k , n n + k √ γ 0 2 ! 2     f ( x 0 ) − f ⋆ + γ 0 R 2 0 2  | {z } b µ W e now compare the ab o v e tw o rates by considering tw o cases: µ > 0 and µ = 0. • Case (1): µ > 0. W e can o bserve that for suﬃcien tly large k , a µ = O  1 + √ µ 2 n  − 2 k ! , b µ = O  1 − √ µ n  k ! . It is easy t o v erify that  1 + √ µ 2 n  − 2 > 1 − √ µ n and hence a µ ≫ b µ when k is suﬃcien t ly large, whic h implies that our rate is m uc h tigh ter. • Case (2): µ = 0. F or suﬃcien tly k , w e hav e a 0 ≈ (2 n 2 R 2 0 + f ( x 0 ) − f ⋆ ) /k 2 , b 0 ≈  2 n 2 R 2 0 + 4 n 2 γ 0 ( f ( x 0 ) − f ⋆ )  /k 2 . Therefore, when γ 0 > 4 n 2 , we o bt a in b 0 < a 0 for suﬃcien tly lar g e k , whic h again implies tha t our rate is sharp er. 18 4.1 Randomized estimate sequence In [6], Nestero v in tro duced a p o w erful framew or k of estimate se quenc e for t he dev elopmen t and analysis of accelerated full gradient metho ds. Here we extend it to a randomized blo c k- co ordinate descen t setup, and use it to analyze the con v ergence rate of the ARCD metho d subseque ntly . Deﬁnition 1. L et φ 0 ( x ) b e a deterministic function and φ k ( x ) b e a r andom function d e- p ending on ξ k − 1 for a l l k ≥ 1 , and λ k ≥ 0 for al l k ≥ 0 . The se quenc e { ( φ k ( x ) , λ k ) } ∞ k =0 is c al le d a r andomize d estimate se quenc e o f function f ( x ) if λ k → 0 (24) and for any x ∈ ℜ N and al l k ≥ 0 we have E ξ k − 1 [ φ k ( x )] ≤ (1 − λ k ) f ( x ) + λ k φ 0 ( x ) , (25) wher e E ξ − 1 [ φ 0 ( x )] def = φ 0 ( x ) . Here w e assume { λ k } k ≥ 0 is a deterministic sequence tha t is indep enden t o f ξ k . Lemma 4. L et x ⋆ b e an optimal solution to (23) and f ⋆ b e the optimal value. Supp o se that { ( φ k ( x ) , λ k ) } ∞ k =0 is a r andomize d estimate se quenc e of func tion f ( x ) . Assume that { x k } is a se quenc e such that for e ach k ≥ 0 , E ξ k − 1 [ f ( x k )] ≤ min x E ξ k − 1 [ φ k ( x )] , (26) wher e E ξ − 1 [ f ( x 0 )] def = f ( x 0 ) . Then we have E ξ k − 1 [ f ( x k )] − f ⋆ ≤ λ k ( φ 0 ( x ⋆ ) − f ⋆ ) → 0 . Pr o of. Since { ( φ k ( x ) , λ k ) } ∞ k =0 is a randomized estimate seque nce o f f ( x ), it follo ws from (25) and ( 26) that E ξ k − 1 [ f ( x k )] ≤ min x E ξ k − 1 [ φ k ( x )] ≤ min x { (1 − λ k ) f ( x ) + λ k φ 0 ( x ) } ≤ (1 − λ k ) f ( x ⋆ ) + λ k φ 0 ( x ⋆ ) = f ⋆ + λ k ( φ 0 ( x ⋆ ) − f ⋆ ) , whic h t ogether with (24) implies that t he conclusion holds. As we will see next, our construction of the randomized estimate sequence satisﬁes a stronger condition, i.e., E ξ k − 1 f ( x k ) ≤ E ξ k − 1 [min x φ k ( x )] . This implies that the a ssumption in Lemma 4, namely , ( 2 6) holds due to E ξ k − 1 [min x φ k ( x )] ≤ min x E ξ k − 1 [ φ k ( x )] . 19 Lemma 5. Assume that f satisﬁe s Assumption 1 with c onvexity p ar ameter µ ≥ 0 . In addition, supp ose that • φ 0 ( x ) is an arbitr ary determini s tic function on ℜ N ; • { y k } ∞ k =1 is a se quenc e in ℜ N such that y k dep ends on ξ k − 1 ; • { α k } ∞ k =1 is indep enden t of ξ k and satisﬁes α k ∈ (0 , n ) for al l k ≥ 0 and P ∞ k =0 α k = ∞ . Then the p air of se quenc es { φ k ( x ) } ∞ k =0 and { λ k } ∞ k =0 c onstructe d by setting λ 0 = 1 and λ k +1 =  1 − α k n  λ k , (27) φ k +1 ( x ) =  1 − α k n  φ k ( x ) + α k  1 n f ( y k ) + h∇ i k f ( y k ) , x i k − y k i k i + µ 2 n k x − y k k 2 L  , (28) is a r andomize d estimate se quenc e o f f ( x ) . Pr o of. It follo ws from (27) and λ 0 = 1 that λ k = Q k − 1 i =0 (1 − α i /n ) for k ≥ 1. Then w e ha v e log λ k = k − 1 X i =0 log  1 − α i n  ≤ − 1 n k − 1 X i =0 α i → − ∞ due to P ∞ i =0 α i = ∞ . Hence, λ k → 0. W e next pro v e by induction that (2 5) holds for all k ≥ 0 . Indeed, for k = 0 , w e kno w that λ 0 = 1 and hence E ξ − 1 [ φ 0 ( x )] = φ 0 ( x ) = (1 − λ 0 ) f ( x ) + λ 0 φ 0 ( x ) , that is, (25) holds for k = 0. Now supp ose it holds for some k ≥ 0. Using (28), w e obtain that E ξ k [ φ k +1 ( x )] = E ξ k − 1 [ E i k [ φ k +1 ( x )]] = E ξ k − 1   1 − α k n  φ k ( x ) + α k  1 n f ( y k ) + E i k  h∇ i k f ( y k ) , x i k − y k i k i  + µ 2 n k x − y k k 2 L  = E ξ k − 1 h 1 − α k n  φ k ( x ) + α k n  f ( y k ) + h∇ f ( y k ) , x − y k i + µ 2 k x − y k k 2 L i ≤ E ξ k − 1 h 1 − α k n  φ k ( x ) + α k n f ( x ) i , where the last inequalit y is due to con v exit y of f . Using the induction h yp othesis, we hav e E ξ k [ φ k +1 ( x )] ≤  1 − α k n   (1 − λ k ) f ( x ) + λ k φ 0 ( x )  + α k n f ( x ) =  1 −  1 − α k n  λ k  f ( x ) +  1 − α k n  λ k φ 0 ( x ) = (1 − λ k +1 ) f ( x ) + λ k +1 φ 0 ( x ) and hence (25) also holds for k + 1. This completes the pro of. 20 Lemma 6. L et φ 0 ( x ) = φ ⋆ 0 + γ 0 2 k x − v 0 k 2 L . Then the r an d omize d estimate se quen c e c onstructe d in L emma 5 pr eserves the c anonic al form of the function s , i.e., fo r al l k ≥ 0 , φ k ( x ) = φ ⋆ k + γ k 2 k x − v k k 2 L , (29) wher e the se quenc es { γ k } , { v k } and { φ ⋆ k } ar e deﬁne d as fol lows: γ k +1 =  1 − α k n  γ k + α k n µ, (30) v k +1 = 1 γ k +1   1 − α k n  γ k v k + α k n µy k − α k L i k U i k ∇ i k f ( y k )  (31) φ ⋆ k +1 =  1 − α k n  φ ⋆ k + α k n f ( y k ) − α 2 k 2 γ k +1 L i k k∇ i k f ( y k ) k 2 + α k  1 − α k n  γ k γ k +1  µ 2 n k y k − v k k 2 L + h∇ i k f ( y k ) , v k i k − y k i k i  (32) Pr o of. First w e observ e that φ k ( x ) is a conv ex quadrat ic function due to (28) and the deﬁ- nition of φ 0 ( x ). W e no w prov e b y induction that for φ k is giv en b y (29) all k ≥ 0 . Clearly , (29) holds f o r k = 0. Supp ose now that it holds for some k ≥ 0. It f o llo ws that the Hessian of φ k ( x ) is a blo ck -diago nal matrix giv en b y ∇ 2 φ k ( x ) = γ k diag ( L 1 I N 1 , . . . , L n I N n ) . Using this relation, (28) and (3 0), w e ha v e ∇ 2 φ k +1 ( x ) =  1 − α k n  ∇ 2 φ k ( x ) + α k n µ diag ( L 1 I N 1 , . . . , L n I N n ) = γ k +1 diag ( L 1 I N 1 , . . . , L n I N n ) . (33) Using the induction h yp othesis b y substituting (29) into (28), w e can write φ k +1 ( x ) a s φ k +1 ( x ) =  1 − α k n   φ ⋆ k + γ k 2 k x − v k k 2 L  + α k  1 n f ( y k ) + h∇ i k f ( y k ) , x i k − y k i k i + µ 2 n k x − y k k 2 L  , (34) whic h t ogether with (31) implies ∇ φ k +1 ( v k +1 ) =  1 − α k n  γ k n X i =1 U i L i ( v k +1 i − v k i )+ α k U i k ∇ i k f ( y k )+ α k n µ n X i =1 U i L i ( v k +1 i − y k i ) = 0 . (35) Letting x = y k in (3 4), one has φ k +1 ( y k ) =  1 − α k n   φ ⋆ k + γ k 2 k y k − v k k 2 L  + α k n f ( y k ) . 21 In view of (3 1), w e hav e v k +1 − y k = 1 γ k +1   1 − α k n  γ k ( v k − y k ) − α k L i k U i k ∇ i k f ( y k )  , and hence γ k +1 2 k y k − v k +1 k 2 L = 1 2 γ k +1   1 − α k n  2 γ 2 k k y k − v k k 2 L + α 2 k L i k k∇ i k f ( y k ) k 2 − 2 α k  1 − α k n  γ k  ∇ i k f ( y k ) , v k i k − y k i k   . In additio n, using (3 0) w e obtain that  1 − α k n  γ k 2 − 1 2 γ k +1  1 − α k n  2 γ 2 k = 1 2 γ k +1  1 − α k n  γ k α k n µ. By virtue of the ab o v e relations and (3 2), it is not hard to conclude that φ k +1 ( y k ) = φ ⋆ k +1 + γ k +1 2 k y k − v k +1 k 2 L , whic h, t ogether with (33), (35) a nd the fact that φ k +1 is quadratic, implies that φ k +1 ( x ) = φ ⋆ k +1 + γ k +1 2 k x − v k +1 k 2 L . Therefore, the conclusion holds. 4.2 Pro of of Theorem 4 Let φ 0 ( x ) = f ( v 0 ) + γ 0 k x − v 0 k 2 L / 2, { y k } and { α k } b e generated in the AR CD metho d. In addition, let { ( φ k ( x ) , λ k } b e the randomized estimate sequence o f f ( x ) generated as in Lemma 5 b y using suc h { y k } and { α k } . First w e prov e b y induction that for all k ≥ 0, E ξ k − 1 [ f ( x k )] ≤ E ξ k − 1 hn φ ⋆ k = min x φ k ( x ) oi . (36) F or k = 0, using v 0 = x 0 , t he deﬁnition of φ 0 ( x ) a nd E ξ − 1 [ f ( x 0 )] = f ( x 0 ), w e hav e E ξ − 1 [ f ( x 0 )] = f ( x 0 ) = f ( v 0 ) = φ ⋆ 0 , and hence (36) holds for k = 0. Now supp ose it holds fo r some k ≥ 0. It follows from (3 2) that E ξ k [ φ ⋆ k +1 ] = E ξ k − 1  E i k [ φ ⋆ k +1 ]  = E ξ k − 1 "  1 − α k n  φ ⋆ k + α k n f ( y k ) − α 2 k 2 γ k +1 E i k  1 L i k k∇ i k f ( y k ) k 2  (37) + α k  1 − α k n  γ k γ k +1  µ 2 n k y k − v k k 2 L + E i k  h∇ i k f ( y k ) , v k i k − y k i k i   # . 22 Let d i ( y k ) = − 1 L i ∇ i f ( y k ) , i = 1 , . . . , n, and d ( y k ) = P n i =1 U i d i ( y k ). Then we ha v e E i k  1 L i k k∇ i k f ( y k ) k 2  = 1 n k d ( y k ) k 2 L . Moreo v er, E i k  h∇ i k f ( y k ) , v k i k − y k i k i  = 1 n h∇ f ( y k ) , v k − y k i . Using these tw o equalities and dropping the term k y k − v k k 2 L in (37), w e arriv e at E ξ k [ φ ⋆ k +1 ] ≥ E ξ k − 1 "  1 − α k n  φ ⋆ k + α k n f ( y k ) − α 2 k 2 nγ k +1 k d ( y k ) k 2 L + α k n  1 − α k n  γ k γ k +1 h∇ f ( y k ) , v k − y k i # . By the induction h yp othesis and the conv exit y of f , w e obtain tha t E ξ k − 1 [ φ ⋆ k ] ≥ E ξ k − 1 [ f ( x k )] ≥ E ξ k − 1 [ f ( y k ) + h∇ f ( y k ) , x k − y k i ] . Com bining the ab o v e t w o inequalities giv es E ξ k [ φ ⋆ k +1 ] ≥ E ξ k − 1 " f ( y k ) − α 2 k 2 nγ k +1 k d ( y k ) k 2 L +  1 − α k n   ∇ f ( y k ) , α k γ k nγ k +1 ( v k − y k ) + ( x k − y k )  # . Recall tha t y k = 1 α k n γ k + γ k +1  α k n γ k v k + γ k +1 x k  . This r elat io n together with t he ab ov e inequalit y yields E ξ k [ φ ⋆ k +1 ] ≥ E ξ k − 1  f ( y k ) − α 2 k 2 nγ k +1 k d ( y k ) k 2 L  . Also, we observ e that α 2 k = γ k +1 . Substituting it in to the a b o v e inequalit y give s E ξ k [ φ ⋆ k +1 ] ≥ E ξ k − 1  f ( y k ) − 1 2 n k d ( y k ) k 2 L  . In additio n, notice that x k +1 = y k − 1 L i k U i k ∇ i k f ( y k ) = y k + U i k d i k ( y k ) , 23 whic h t ogether with Corollary 1 yields E ξ k [ φ ⋆ k +1 ] ≥ E ξ k − 1  E i k f ( x k +1 )  = E ξ k [ f ( x k +1 )] . Therefore, (36) holds for all k + 1 . F urther, by Lemma 4, w e ha v e E ξ k − 1 [ f ( x k )] − f ⋆ ≤ λ k  f ( x 0 ) − f ⋆ + γ 0 2 k x 0 − x ⋆ k 2 L  . Finally , w e estimate the deca y of λ k , using the same argumen ts in the pro of of [6 , Lemma 2.2.4]. Here w e assume γ 0 ≥ µ (it suﬃces to set γ 0 = 1 b ecause µ ≤ 1 ). Indeed, if γ k ≥ µ , then γ k +1 =  1 − α k n  γ k + α k n µ ≥ µ. So w e ha v e γ k ≥ µ for all k ≥ 0. Since α 2 k = γ k +1 , w e hav e α k ≥ √ µ for all k ≥ 0 . Therefore, λ k = k − 1 Y i =0 (1 − α i n ) ≤  1 − √ µ n  k . In additio n, we ha v e γ k ≥ γ 0 λ k . T o see this, we note γ 0 = γ 0 λ 0 and use induction γ k +1 ≥  1 − α k n  γ k ≥  1 − α k n  γ 0 λ k = γ 0 λ k +1 . This implies α k = √ γ k +1 ≥ p γ 0 λ k +1 . (38) Since { λ k } is a decreasing sequence, w e hav e 1 p λ k +1 − 1 √ λ k = √ λ k − p λ k +1 √ λ k p λ k +1 = λ k − λ k +1 √ λ k p λ k +1 ( √ λ k + p λ k +1 ) ≥ λ k − λ k +1 2 λ k p λ k +1 = λ k −  1 − α k n  λ k 2 λ k p λ k +1 = α k n 2 p λ k +1 . Com bining with (38) gives 1 p λ k +1 − 1 √ λ k ≥ √ γ 0 2 n . By further noting λ 0 = 1, w e obtain 1 √ λ k ≥ 1 + k n √ γ 0 2 . Therefore λ k ≤ n n + k √ γ 0 2 ! 2 . This completes the pro of for Theorem 4. 24 References [1] K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Co ordinate desc en t metho d for large-scale l 2 -loss linear suppo r t vec tor mac hines. Journal of Machine L e arnin g R ese a r ch , 9:1 3 69– 1398, 2008 . [2] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. Keerthi, and S. Sundara ra jan. A dual co ordinate descen t metho d for large-scale linear SVM. In ICML 2008, pages 408–415 , 2008. [3] D. Lev enthal and A. S. Lewis. Randomized metho ds for linear constraints: con vergenc e rates and conditioning. Mathematics of Op er a tions R ese ar ch , 35(3):641–6 5 4, 201 0. [4] Y. Li and S. Osher. Co ordinate descen t optimization for l 1 minimization with a pplication to compressed sensing; a greedy alg orithm. Inve rs e Pr oblems and Im a ging , 3:487– 503, 2009. [5] Z. Q. Luo and P . Tseng. On the conv ergence of the co o r dina t e descen t metho d for con v ex diﬀeren tiable minimization. Journal of Optimization The ory and Applic ations , 72(1):7–35 , 2 002. [6] Y. Nesterov. In tro ductory Lectures on Con v ex Optimization: A Basic Course. K luw er Academic Publishers, Boston, 2004. [7] Y. Nestero v. G radien t metho ds fo r minimizing composite ob jectiv e function. CORE Discussion Paper 1007 /76, Catholic Univ ersity o f Louv ain, Belgium, 2007. [8] Y. Nestero v. Eﬃciency of co or dina t e descen t metho ds on h uge-scale optimization pro b- lems. SIAM Journal on Optimization , 2 2 (2): 341–36 2, 2012 . [9] Z. Qin, K. Sc hein b erg, and D. G oldfarb. Eﬃien t blo ck-coor dina t e desc en t algorithms for the group lasso. T o app ear in Mathematic al Pr o gr am ming Computation , 2010. [10] P . Rich t´ a rik and M. T ak´ aˇ c. Eﬃcien t serial and parallel co ordinate descen t metho d for h uge-scale tr uss top ology design. Op er ations R es e ar ch Pr o c e e dings , 27–32, 2 0 12. [11] P . R ic h t´ ar ik and M. T ak´ aˇ c. Iteration complexit y o f randomized blo c k-co ordinate descen t metho ds for minimizing a comp o site function. T o app ear in Mathematic a l Pr o gr amming , 2011. [12] P . Ric ht´ arik and M. T ak´ a ˇ c. P arallel co ordinate descen t metho ds fo r big data optimiza- tion. T ech nical rep o rt, Nov em b er 2 012. [13] A. Saha a nd A. T ewari. On the non- asymptotic con v ergence of cyclic co ordinate descen t metho ds. S I AM Journal on Optimization , 23(1):576 –601, 2013. [14] S. Shalev-Sh w artz and A. T ew ari. Sto c hastic metho ds f or l 1 regularized loss minimiza- tion. In Pro ceedings of the 26th In ternational Conference on Machine Learning, 2009. 25 [15] S. Shalev-Shw artz a nd T. Zhang. Prox imal sto c hastic dual co ordinate ascen t. T echn ical rep ort, 2 0 12. [16] R. T app enden, P . Ric ht´ arik and J. Gondzio. Inexact co ordinate descen t: complexit y and preconditioning. T echnic al r ep ort, April 20 13. [17] P . Tseng. Con v ergence of a blo c k co ordinate descen t metho d for nondiﬀeren tiable min- imization. Journal of O ptimization The ory and Applic ations , 10 9:475–494 , 2 001. [18] P . Tseng and S. Y un. Blo c k-co ordinate gr a dien t descen t metho d f or linearly constrained nonsmo oth separable optimization. Journal of Optimi z ation The ory and Applic ations , 140:513–5 35, 20 09. [19] P . Tseng and S. Y un. A co or dinate gradient descen t metho d for nonsmo oth separable minimization. Mathem a tic al Pr o gr amming , 117:387 – 423, 2 009. [20] Z. W en, D . Go ldf a rb, and K. Sc hein b erg. Blo c k co ordinate descen t metho ds for semidef- inite programming. In Miguel F. Anjos a nd Jean B. Lasserre, editors, Ha ndb o ok on Semideﬁnite, Cone and P olynomial Optimization: Theory , Algorithms, Soft w are and Applications. Springer, V olume 166: 533–56 4, 2012. [21] S. J. W righ t. Ac celerated blo c k-co ordinate relaxation for regularized optimization. SIAM Journal on Optimiz a tion , 22:159-1 8 6, 2012. [22] T. W u and K. Lange. Co ordinate desce nt algorithms for lasso p enalized regression. Th e A nnals of Applie d Statistics , 2(1) :2 24–244, 20 0 8. [23] S. Y un and K.-C. T oh. A co ordinate gradien t descen t metho d for l 1 -regularized conv ex minimization. C omputational Optimization and Applic ations , 48:273– 3 07, 2011. 26

On the Complexity Analysis of Randomized Block-Coordinate Descent Methods

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment