Fast Alternating Linearization Methods for Minimizing the Sum of Two Convex Functions

F AST AL TERNA TING LINEARIZA TION METHODS FOR MINIMIZING THE SUM OF TW O CONVEX FUNCTIONS DONALD GOLDF ARB ∗ , SHIQIAN M A ∗ , AND KA TY A SCHEINBERG † Octob er 11, 201 0 Abstract. W e presen t i n this pap er ﬁrst-order al ternating linearization algorithms based on an alternating direction augmen ted Lagrangian a pproach for minimizing the sum of tw o con vex functions. Our basic methods require at m ost O (1 /ǫ ) iterations to obtain an ǫ -optimal solution, while our ac celerated (i. e., fast) v ersions of them require at most O (1 / √ ǫ ) iterations, with little change in the comput ational eﬀort required at each i teration. F or b oth t ypes of methods, we pr esen t one algorithm that r equires b oth functions to b e smo oth w i th Lipschitz con tinu ous gradients and one algorithm that needs only one of the functions to b e so. Algori thms in this paper are Gauss-Seidel t ype metho ds, in con trast to the ones proposed b y Goldfarb and Ma in [21] where the algori thms are Jacobi type methods. Numerical results are reported to supp ort our theoretical conclusions and demonstrate the practical p oten tial of our algori thms. Key words. Con vex Optimization, V ariable Splitting, Alternating Linearization Method, Alternating Direction Method, Augmen ted Lagrangian Method, Proximal Poin t Algorithm, Optimal Gradient M ethod, Gauss-Seidel M etho d, Pe aceman- Rac hford Method, Robust Principal Comp onen t Analysis AMS sub ject classiﬁcatio ns. Pr imary , 65K05; Secondary , 68Q25, 90C25 1. In tro duction. In this paper , we ar e interested in the following c o nv e x optimizatio n pr oblem: min F ( x ) ≡ f ( x ) + g ( x ) (1.1) where f , g : R n → R ar e b o th conv ex functions such that the following tw o problems are e a sy to solve for any τ > 0 and z ∈ R n relative to minimizing F ( x ): min  τ f ( x ) + 1 2 k x − z k 2  (1.2) and min  τ g ( x ) + 1 2 k x − z k 2  . (1.3) In particula r, w e are sp ecially interested in cases where solving (1.2 ) (or (1.3)) takes r o ughly the same eﬀor t as co mputing the gradie nt (o r a s ubgradient) o f f ( x ) (or g ( x ), res pectively). Pro blems of this type arise in many applications o f pr actical interest. The following are s ome interesting exa mples. Example 1. ℓ 1 minimi zation in compress ed sensing (CS). Signal recov ery problems in compressed sensing [8, 13] use the ℓ 1 norm k x k 1 := P n i =1 | x i | as a reg ularization term to enfor ce spar sity in the so lution x ∈ R n of a linea r sys tem Ax = b , wher e A ∈ R m × n and b ∈ R m . This results in the unconstra ined proble m min  1 2 k Ax − b k 2 2 + ρ k x k 1  , (1.4) ∗ Departmen t of Industrial Engineering and Op erations Research, Columbia Universit y , New Y or k, 10027, USA. Email: { goldfarb, sm2756 } @colum bia.edu. Research supported in part b y NSF Gr ants DMS 06-06712 and DMS 10-16571, ONR Grant N00014-08-1-1118 and DOE Gran t DE-FG 02-08ER25856. † Departmen t of Industrial and Systems Engineering, Lehigh Univ ersity , Bethlehem, P A 18015-1582, USA. Email: k aty as@lehigh.edu 1 where ρ > 0, which is o f the form of (1.1) with f ( x ) = 1 2 k Ax − b k 2 2 and g ( x ) := ρ k x k 1 . In this c a se, the t wo problems (1.2) and (1.3) a re eas y to s olve. Sp eciﬁca lly , (1.2) reduces to solving a linear s ystem a nd (1 .3) reduces to a vector shrink age op era tion which req uir es O ( n ) op era tions (see e.g., [23]). Depending o n the size a nd str ucture of A solving the system of linear equatio ns req uir ed by (1.2 ) may b e more exp ensive, less exp ensive or compara ble to computing the gradie nt A ⊤ ( Ax − b ) of f ( x ). In the application we consider in Section 4 these co mputations are co mpa rable due to the sp ecia l structure o f A . Example 2. Nuclear norm mi nimization (NN M). The nuclear no rm minimization pro blem, which seeks a low-rank solution of a linea r system, ca n b e cast as min  1 2 kA ( X ) − b k 2 2 + ρ k X k ∗  , (1.5) where τ > 0, X ∈ R m × n , A : R m × n → R p is a linear o pe rator, b ∈ R p and the nuclear norm k X k ∗ is deﬁned as the sum o f the singular v alues of the matrix X . P roblem (1.5 ) and a sp ecial case of it, the so-called matr ix completion pr o blem, hav e many applications in optimal control, online r ecommendation systems, co mputer vision, etc. (see e.g., [7, 9, 27, 42]). In Problem (1.5), if w e let f ( X ) = 1 2 kA ( X ) − b k 2 2 and g ( X ) = ρ k X k ∗ , then problem (1.2) reduces to solving a linea r system. Problem (1.3) has a closed-form solution that is g iven by matrix shrink age op era tion (see e.g., [34]). Example 3. Robust principal comp one n t analysis (RPCA). The RPCA pro ble m seek s to recov er a low-rank matrix X fr o m a corr upted matrix M . This problem ha s many applications in computer vision, image pr o cessing and web data ranking (se e e.g., [6]), a nd ca n be formulated as min {k X k ∗ + ρ k Y k 1 : X + Y = M } , (1.6) where ρ > 0, M ∈ R m × n and the ℓ 1 norm k Y k 1 := P i,j | Y ij | . Note that (1.6) can b e r ewritten as min {k X k ∗ + ρ k M − X k 1 } , which is of the form of (1.1). Moreover, the tw o problems (1.2) a nd (1.3) c o rresp onding to (1.6) have closed-form solutions given resp ectively by a matrix shrink age op eration and a vector shrink age op era tion. The matrix shrink a ge ope r ation requires a singular v alue decomp ositio n (SVD) and is co mparable in co st to computing a s ubg radient of k X k ∗ or the gra dien t of the smo othed version of this function (see Section 5 ). Example 4. Sparse in v erse co v ariance se lection (SICS). Gaussian gr aphical mo dels ar e of grea t int erest in statistical learning. Because conditional independence b etw een diﬀerent no des corresp ond to zer o ent ries in the in verse cov ar iance matrix of the Ga ussian distribution, one ca n lear n the s tructure o f the graph by estimating a sparse inv erse cov ariance matrix from sa mple data by solving the following maxim um likelihoo d problem with an ℓ 1 -regular ization term, (see e.g., [3, 18, 49, 52]). max { log det( X ) − h Σ , X i − ρ k X k 1 } , or equiv alent ly , min {− log det( X ) + h Σ , X i + ρ k X k 1 } , (1.7) 2 where ρ > 0 and Σ ∈ S n + (the set of symmetric p ositive semideﬁnite matrices) is the sa mple co v a riance matrix. Note that by deﬁning f ( X ) := − lo g det( X ) + h Σ , X i and g ( X ) := ρ k X k 1 , (1.7) is of the form of (1.1). Moreover, it ca n b e pro ved that the problem (1.2) has a closed-form solution, which is given b y a spec tral decomp osition - a comparable eﬀort to computing the gradient of f ( X ), while the so lutio n of problem (1.3) cor resp onds to a vector s hrink ag e op era tion. Algorithms for solv ing problem (1.1) hav e been studied extensively in the literature. F o r larg e-scale problems, for which problems (1.2 ) and (1.3) are rela tiv ely eas y to so lve, the class of a lternating directio n metho ds tha t a re bas e d on v aria ble splitting combined with the a ugmented Lag rangian metho d are partic- ularly impo rtant. In these metho ds, o ne splits the v ar iable x in to tw o v ariables , i.e., one introduces a new v aria ble y and rewrites P roblem (1.1) as min { f ( x ) + g ( y ) : x − y = 0 } . (1.8) Since P roblem (1.8) is a n equa lit y c o nstrained problem, the augmented La g rangian metho d can b e used to solve it. Given a pena lt y parameter 1 /µ , a t the k -th iteration, the augmen ted L a grangia n method minimizes the augmented La g rangian function L µ ( x, y ; λ ) := f ( x ) + g ( y ) − h λ, x − y i + 1 2 µ k x − y k 2 , (1.9) with resp ect to x and y , i.e., it solves the s ubproblem ( x k , y k ) := arg min x,y L µ ( x, y ; λ k ) (1.10) and then up dates the Lagrang e mu ltiplier λ k via: λ k +1 := λ k − 1 µ ( x k − y k ) . (1.11) Minimizing L µ ( x, y ; λ ) with respe c t to x and y jointly is often not easy . In fact, it certainly is not an y easier than solving the o riginal problem (1 .1). How ever, if o ne minimizes L µ ( x, y ; λ ) with resp ect to x and y alter natingly , one needs to solve problems of the for m (1.2) and (1.3), which as we ha ve alr eady discussed, is often eas y to do. Such an alternating dir ection aug mented La grangia n metho d (AD AL) for solv ing (1.8 ) is given b elow as Algorithm 1. Algorithm 1 : Alternating Direction Augment ed La grangia n Metho d (AD AL) Cho ose µ , λ 0 and x 0 = y 0 . 1 for k = 0 , 1 , · · · do 2 x k +1 := a rg min x L µ ( x, y k ; λ k ) 3 y k +1 := arg min y L µ ( x k +1 , y ; λ k ) 4 λ k +1 := λ k − 1 µ ( x k +1 − y k +1 ) 5 The history o f alternating direction metho ds (ADMs) go es back to the 1950s for solving PDE s [1 4, 41] and to the 197 0s for solving v a riational problems asso ciated with PDEs [19, 2 0]. ADMs hav e also been applied to solving v a riational inequality problems by Tseng [46, 47] and He et al. [2 4, 26]. Recen tly , with the emergence of compr essive sensing and subs e quent great interest in ℓ 1 minimization [8, 1 3], ADMs hav e b e e n a pplied to 3 ℓ 1 and total v ariation regular ized problems arising from signa l proc essing and image pr o cessing. The pap ers of Golds tein and Osher [22], Afonso et a l. [2] and Y a ng a nd Zhang [51] are based on the alter nating dir ection augmented Lagrang ian framework (Algorithm 1), and demons tr ate that ADMs a re very eﬃcient for s o lving ℓ 1 and TV regula rized problems. The work of Y uan [53] and Y uan and Y ang [5 4] show ed that ADMs can also eﬃciently so lve ℓ 1 -regular ized pro ble ms arising from statistics a nd data analysis. More recently , W en, Goldfarb and Yin [50] and Malick et a l. [35] applied a lter nating direction augmented Lag rangian methods to s o lve semideﬁnite prog ramming (SDP) problems. The res ults in [5 0] s how that these metho ds greatly outp e rform interior p oint metho ds o n several cla sses o f well-structured SDP pro blems. F ur thermore, He et al. prop osed an a lternating direc tion based co nt ractio n metho d for solving s e parable linearly co nstrained conv ex pro blems [25]. Another impor tant and related class of algor ithms for so lving (1.1) is based on op era tor-splitting. The aim of these algorithms is to ﬁnd an x such that 0 ∈ T 1 ( x ) + T 2 ( x ) , (1.12) where T 1 and T 2 are maximal mo notone op erato r s. This is a mor e general problem than (1.1) and ADMs for it hav e be e n the fo cus of a s ubstantial amount of resear ch; e.g ., s ee [10, 11, 12, 1 5, 16, 32, 44]. Since the ﬁrst-order optimality conditions for (1.1 ) a re: 0 ∈ ∂ f ( x ) + ∂ g ( x ) , (1.13) where ∂ f ( x ) denotes the subdiﬀer e ntial of f ( x ) a t the po in t x , a solution to Proble m (1.1) can be o btained by solving P roblem (1.12). F or exa mple, s ee [15, 32, 44] and references therein for more infor mation o n this class of alg o rithms. While global conv ergence results for v ar ious splitting and alternating direction algo rithms have b een established under appropria te conditions , our interest here is on iteration complexity b ounds for such alg o- rithms. By an iter ation complexity b ound we mean a b ound on the num ber of itera tions needed to o btain an ǫ -optimal so lution which is deﬁned as follows. Definition 1.1. x ǫ ∈ R n is c al le d an ǫ - optimal solution t o (1 .1 ) if F ( x ǫ ) − F ( x ∗ ) ≤ ǫ , wher e x ∗ is an optimal solution to (1.1) . Complexity b ounds for ﬁrst-or de r methods for solv ing conv ex optimizatio n problems ha ve been g iven b y Nesterov and man y others . In [37, 38], Nesterov g av e ﬁr st-order algor ithms fo r solving smo o th unconstrained conv ex minimization problems with an iteration complexity of O ( p L/ǫ ), where L is the Lipsc hitz co nstant of the gradient of the o b jectiv e function, and sho wed that this is the best complexity that is obtainable when only ﬁrst-o r der infor mation is used. These metho ds can b e v iewed as accelera ted gra dient methods where a com bination o f past iterates are used to compute the next iter ate. Similar tec hniques were then applied to nonsmo oth pro blems [4, 39, 40, 48] and co rresp onding optimal complexity results were o btained. The IST A (Iterative Shrink age/Thre s holding Algorithm) and FIST A (F ast Itera tive Shrink age/ Thresholding Algorithm) a lgorithms propo s ed by Beck and T ebo ulle in [4] are designed for solving (1.1) when one of the functions (s ay f ( x )) is smo oth and the other is not. It is pr ov ed in [4] that the num b er of iterations required b y IST A a nd FIST A to get an ǫ -optimal so lution to problem (1.1) a re resp ectively O ( L ( f ) /ǫ ) and O ( p L ( f ) /ǫ ), under the assumption that ∇ f ( x ) is Lipschitz co n tinuous with Lipschitz constant L ( f ), i.e., k∇ f ( x ) − ∇ f ( y ) k 2 ≤ L ( f ) k x − y k 2 , ∀ x, y ∈ R n . 4 IST A computes a sequence { x k } via the iter ation x k +1 := arg min x Q f ( x, x k ) , (1.14) where (1.15) Q f ( u, v ) := g ( u ) + f ( v ) + h∇ f ( v ) , u − v i + 1 2 µ k u − v k 2 , while FIST A computes { x k } via the iteration        x k := ar g min x Q f ( x, y k ) t k +1 :=  1 + p 1 + 4 t 2 k  / 2 y k +1 := x k +  t k − 1 t k +1  ( x k − x k − 1 ) (1.16) starting with t 1 = 1 , y 1 = x 0 ∈ R n and k = 1. Note that IST A and FIST A treat the functions f ( x ) and g ( x ) v ery diﬀerently . At each iteration they bo th linea rize the function f ( x ) but never dire c tly minimize it, while they do minimize the function g ( x ) in conjunction with the lineariz a tion of f ( x ) and a proximal (pe nalty) term. These tw o metho ds ha ve prov ed to b e eﬃcien t fo r solving the CS problem (1.4) (see e.g., [4, 23]) and the NNM pr oblem (1.5 ) (see e.g ., [34, 4 5]). IST A and FIST A w ork well in these ar e a s b ecause f ( x ) is quadratic and is well approximated by linearization. Howev er, for the RPCA problem (1.6 ) where tw o c o mplicated functions ar e in volv ed, IST A and FIST A do no t work well. As we shall show in Sections 2 and 3, our ADMs are very eﬀective in solving RPCA problems. F or the SICS pr oblem (1.7), int ermediate itera tes X k may not b e p os itiv e deﬁnite, and hence the gra dient of f ( X ) = − lo g det( X ) + h Σ , X i may not be well deﬁned at X k . Therefore , IST A a nd FIST A cannot b e used to s olve the SICS pr oblem (1.7 ). In [4 3], it is s hown that SICS pro blems ca n b e very eﬃciently so lved by o ur ADM appr oach. Our contr ibutio n. In this pa per , w e pr op ose b oth basic and a ccelerated (i.e., fast ) versions o f ﬁrst- order alternating lineariz ation metho ds (ALMs) based o n an alternating dir ection augmented Lag rangian approach for solving (1.1) and analyze their iter a tion c o mplexities. O ur basic methods require at most O ( L/ǫ ) iterations to obtain an ǫ - optimal solution, while our fast methods require at most O ( p L/ǫ ) iteratio ns with only a very small increase in the computatio na l e ﬀo r t r equired at each iter ation. Thus, our fas t metho ds are optimal ﬁrst-order methods in ter ms of iter ation complexity . F or b oth types of metho ds, w e present an alg orithm that requires b oth functions to be contin uously diﬀerentiable with Lipsc hitz co nstants for the gradients denoted by L ( f ) a nd L ( g ). I n this case L = max { L ( f ) , L ( g ) } . W e als o present fo r each type o f metho d, a n algorithm that o nly needs one of the functions, say f ( x ), to be s mo oth, in which ca s e L = L ( f ). These algo rithms a re related to the multiple splitting algo rithms in a recent pap er by Goldfarb and Ma [21]. The alg o rithms in [21] a re Jacobi type methods since they do not use information fr o m the cur rent iteration to solve succeeding subproblems in that iteration, while the alg orithms prop osed in this pap er are Gauss-Seidel type methods sinc e informa tio n fro m the cur rent itera tion is used later in the same iteration. These a lgorithms ca n als o b e viewed as ex tens io ns of the IST A and FIST A a lg orithms in [4]. The complexity bo unds we obtain for o ur algor ithms are similar to (and a s muc h as a fac to r of tw o be tter that) those in [4]. A t each iter ation, o ur algorithms alternatively minimize t w o diﬀerent appr oximations to the origina l ob jective function, obtained by keeping one function unc hanged and linear izing the other one. Our basic algorithm is s imilar in many wa ys to the a lternating lineariz ation metho d prop osed by Kiwiel et al. [28]. In 5 particular, the approximate functions minimized at each step o f Algorithm 3.1 in [28] hav e the same form as those minimized in our alg o rithm. Ho wev er, our basic algo r ithm diﬀers fro m the one in [2 8] in the wa y that the pr oximal ter ms are chosen, and our accelera ted a lgorithms are very diﬀerent. More ov er, no complexity bo unds hav e b een given for the a lgorithm in [2 8]. T o the b est o f our knowledge, the co mplexity results in this pap er are the ﬁrst ones that hav e b een g iven for a Ga uss-Seidel type alterna ting direction metho d 1 . Complexity results for related Jacobi type a lternating direction metho ds are given in [21]. Organization. The res t of this pap er is orga nized as follo ws. In Sections 2 and 3 we prop ose our alternating linearizatio n metho ds based on alter na ting directio n a ugmented Lagra ng ian methods and give conv ergence/co mplexity b ounds for them. W e compar e the pe r formance of our ALMs to other co mpeting ﬁrst-order algo rithms using an image deblurring problem in Section 4. In Section 5 , we apply our ALMs to solve very la rge RPCA problems arising from background extr action in surveillance video and matrix completion and r epo rt the numerical res ults. Finally , we make so me co nclusion in Section 6. 2. Alternating L inearization Metho ds. In iteratio n of the AD AL metho d, Alg o rithm 1 , the La- grange multiplier λ is updated just once, immediately after the aug men ted Lagra ngian is minimized with resp ect to y . Since the alternating direction approach is meant to b e symmetric with resp ect to x and y , it is natural to also up date λ after solv ing the subpro blem with re spe c t to x . By doing this, we g e t a symmetric version of the AD AL metho d. This algorithm is g iven b elow as Algor ithm 2 . Algorithm 2 : Symmetric Alter na ting Direction Augmented La grangia n Metho d (SAD AL) Cho ose µ , λ 0 and x 0 = y 0 . 1 for k = 0 , 1 , · · · do 2 x k +1 := a rg min x L µ ( x, y k ; λ k ) 3 λ k + 1 2 := λ k − 1 µ ( x k +1 − y k ) 4 y k +1 := arg min y L µ ( x k +1 , y ; λ k + 1 2 ) 5 λ k +1 := λ k + 1 2 − 1 µ ( x k +1 − y k +1 ) 6 This ADA L v ariant is describ ed and a nalyzed in [20]. Mor eov er, it is shown in [20] that Algorithms 1 and 2 a r e equiv a len t to the Do ug las-Rachford [14] and Peaceman-Rachford [41] metho ds, resp ectively applied to the optimality condition (1.13 ) for pro blem (1.1). If we a ssume that both f ( x ) and g ( x ) are diﬀeren tiable, it fo llows fr om the ﬁrst order optimality co nditio ns for the tw o s ubproblems in lines 3 and 5 of Algor ithm 2 that λ k + 1 2 = ∇ f ( x k +1 ) and λ k +1 = − ∇ g ( y k +1 ) . (2.1) Substituting (2.1 ) in to Algorithm 2, we get the following alter na ting linear ization method (ALM) whic h is equiv alent to the SAD AL metho d (Algo rithm 2) when both f and g ar e diﬀerent iable. In Algo rithm 3 , Q f ( u, v ) is deﬁned b y (1 .15) and Q g ( u, v ) := f ( u ) + g ( v ) + h∇ g ( v ) , u − v i + 1 2 µ k u − v k 2 2 . (2.2) In Alg o rithm 3, we alterna tively repla ce the functions g and f by their lineariza tions plus a proximal 1 After completion of an earlier v ersion of the presen t paper, which is a v ailable on h ttp://arxiv.org/abs/0912.4571 , Monteiro and Sv aiter [36] gav e an iteration complexit y b ound to ac hieve a desired closeness of the curr en t iterate to the solution for ADMs for solving the m ore general problem (1.12). 6 Algorithm 3 : Alternating Linearization Metho d (ALM) Cho ose µ and x 0 = y 0 . 1 for k = 0 , 1 , · · · do 2 x k +1 := a rg min x Q g ( x, y k ) 3 y k +1 := arg min y Q f ( y , x k +1 ) 4 regular iz ation term to get an a pproximation to the original function F . Th us, our ALM algorithm can als o be viewed as a pr oximal po int algo rithm. A drawbac k of Algorithm 3 is that it requires both f and g to b e contin uously diﬀere ntiable. In ma ny applications, how ever, one of these functions is no nsmo oth, as in the ex amples given in Section 1. Although Algorithm 2 c an be applied when f ( x ) and g ( x ) a re nonsmoo th, we are unable to provide a comparable complexity b ound in this case. How ever, when only one of the functions of f and g is nonsmoo th (say g is nonsmo o th), the following v a riant of Algorithm 3 applies, and for this algor ithm, we hav e a complexity result. Algorithm 4 : Alternating Linearization Metho d with Skipping Steps (ALM-S) Cho ose µ , λ 0 and x 0 = y 0 . 1 for k = 0 , 1 , · · · do 2 x k +1 := a rg min x L µ ( x, y k ; λ k ) 3 If F ( x k +1 ) > L µ ( x k +1 , y k ; λ k ), then x k +1 := y k 4 y k +1 := arg min y Q f ( y , x k +1 ) 5 λ k +1 := ∇ f ( x k +1 ) − ( x k +1 − y k +1 ) /µ 6 W e call Algorithm 4, ALM with skipping steps (ALM-S) be cause in line 4 of Algorithm 4, if F ( x k +1 ) > L µ ( x k +1 , y k ; λ k ) (2.3) holds, we let x k +1 := y k , i.e., we s kip the computation of x k +1 in line 3. An alternative version o f Alg orithm 4 that has s maller av erag e work p er iteration is the following Algor ithm 5. Algorithm 5 : Alternating Linearization Metho d with Skipping Steps (equiv alent version) Cho ose µ , λ 0 and x 0 = y 0 . 1 for k = 0 , 1 , · · · do 2 x k +1 := a rg min x L µ ( x, y k ; λ k ) 3 if F ( x k +1 ) > L µ ( x k +1 , y k ; λ k ) then 4 x k +1 := y k 5 y k +1 := a rg min y Q f ( y , x k +1 ) 6 λ k +1 := ∇ f ( x k +1 ) − ( x k +1 − y k +1 ) /µ 7 else 8 λ k + 1 2 := λ k − ( x k +1 − y k ) /µ 9 y k +1 := a rg min y L µ ( x k +1 , y ; λ k + 1 2 ) 10 λ k +1 := λ k + 1 2 − ( x k +1 − y k +1 ) /µ 1 1 Note that in Algorithm 5, when (2.3 ) do es not ho ld, we switc h to the SAD AL algorithm, which upda tes λ instead of computing ∇ f ( x k +1 ). Algor ithm 5 is usually faster than Algor ithm 4 when ∇ f ( x k +1 ) is costly 7 to compute in addition to p erforming Step 3. Note also that when (2.3) holds, then the steps of the algor ithm reduce to tho s e of IST A. The following theor e m g ives conditions under which Algor ithms 2 , 3 , 4 and 5 a re equiv a lent . Theorem 2.1. (i) If b oth f and g ar e diﬀer entiable, and λ 0 is set to −∇ g ( y 0 ) , then Algorithms 2 and 3 ar e e quivalent. (ii ) If in addition g is Lipschitz c ontinu ous with Lipschitz c onstant L ( g ) , and µ ≤ 1 /L ( g ) , then Algo rithms 3 and 4 ar e e quivalent. (iii) If f is diﬀer entiable, t hen Al gorithms 4 and 5 ar e e quivalent. Pr o of . When b oth f and g a re diﬀerentiable a nd λ 0 = −∇ g ( y 0 ), (2.1 ) holds for all k ≥ 0 , and it follows that L µ ( x, y k ; λ k ) ≡ Q g ( x, y k ) and L µ ( x k +1 , y ; λ k + 1 2 ) ≡ Q f ( y , x k +1 ) . This pr ov es part (i). If ∇ g ( x ) is Lipschitz co n tinuous and µ ≤ 1 / L ( g ), g ( x k +1 ) ≤ g ( y k ) + h∇ g ( y k ) , x k +1 − y k i + 1 2 µ   x k +1 − y k   2 2 , holds (see e.g., [5]). This implies that (2.3) do es not hold and hence, x k +1 := a rg min x L µ ( x, y k ; λ k ), and the equiv alence of Algorithms 3 and 4 follows. This pr ov es part (ii). The optimality of x k +1 in line 3 of Algo rithm 5 implies tha t λ k + 1 2 = ∇ f ( x k +1 ) when (2.3) do es not hold a nd hence that L µ ( x k +1 , y ; λ k + 1 2 ) ≡ Q f ( y , x k +1 ). This proves part (iii). W e show in the following that the iter ation complexity of Algo rithm 4 is O (1 /ǫ ) for obtaining an ǫ - optimal solution for (1.1 ). First, we need the following generaliza tion of Lemma 2 .3 in [4]. Lemma 2.2 . L et ψ : R n → R and φ : R n → R b e c onvex functions and deﬁne Q ψ ( u, v ) := φ ( u ) + ψ ( v ) + h γ ψ ( v ) , u − v i + 1 2 µ k u − v k 2 2 , and p ψ ( v ) := arg min u Q ψ ( u, v ) , (2.4) wher e γ ψ ( v ) is any sub gr adient in t he sub diﬀer ential ∂ ψ ( v ) of ψ ( v ) at t he p oint v . L et Φ( · ) = φ ( · ) + ψ ( · ) . F or any v , if Φ( p ψ ( v )) ≤ Q ψ ( p ψ ( v ) , v ) , (2.5) then for any u , 2 µ (Φ( u ) − Φ( p ψ ( v ))) ≥ k p ψ ( v ) − u k 2 − k v − u k 2 . (2.6) Pr o of. F rom (2.5), we have Φ( u ) − Φ( p ψ ( v )) ≥ Φ( u ) − Q ψ ( p ψ ( v ) , v ) = Φ( u ) −  φ ( p ψ ( v )) + ψ ( v ) + h γ ψ ( v ) , p ψ ( v ) − v i + 1 2 µ k p ψ ( v ) − v k 2 2  . (2.7) Since φ and ψ are convex w e hav e φ ( u ) ≥ φ ( p ψ ( v )) + h u − p ψ ( v ) , γ φ ( p ψ ( v )) i , (2.8) 8 and ψ ( u ) ≥ ψ ( v ) + h u − v , γ ψ ( v ) i , (2.9) where γ φ ( · ) is a subgr adient of φ ( · ) and γ φ ( p ψ ( v )) satisﬁes the ﬁrs t-order optimalit y conditions fo r (2.4), i.e., γ φ ( p ψ ( v )) + γ ψ ( v ) + 1 µ ( p ψ ( v ) − v ) = 0 . (2.10) Summing (2 .8 ) a nd (2.9) yields Φ( u ) ≥ φ ( p ψ ( v )) + h u − p ψ ( v ) , γ φ ( p ψ ( v )) i + ψ ( v ) + h u − v , γ ψ ( v ) i . (2.11) Therefore, from (2 .7), (2.10) and (2.11 ) it follows that Φ( u ) − Φ( p ψ ( v )) ≥ h γ ψ ( v ) + γ φ ( p ψ ( v )) , u − p ψ ( v ) i − 1 2 µ k p ψ ( v ) − v k 2 2 = h− 1 µ ( p ψ ( v ) − v ) , u − p ψ ( v ) i − 1 2 µ k p ψ ( v ) − v k 2 2 = 1 2 µ  k p ψ ( v ) − u k 2 − k v − u k 2  . (2.12) Theorem 2.3. Assume ∇ f ( · ) is Lipschi tz c ontinuous with Lipsch itz c onstant L ( f ) . F or µ ≤ 1 / L ( f ) , the iter ates y k in Alg orithm 4 satisfy F ( y k ) − F ( x ∗ ) ≤ k x 0 − x ∗ k 2 2 µ ( k + k n ) , ∀ k , (2.13) wher e x ∗ is an optimal solution of (1.1) and k n is the num b er of iter ations un til the k -th for which F ( x k +1 ) ≤ L µ ( x k +1 , y k ; λ k ) , i.e., the numb er of iter ations when n o skipping s t ep o c curs. Thus, the se quenc e { F ( y k ) } pr o duc e d by Algorithm 4 c onver ges to F ( x ∗ ) . Mor e over, if 1 / ( β L ( f )) ≤ µ ≤ 1 /L ( f ) wher e β ≥ 1 , the nu mb er of iter ations ne e de d to obtain an ǫ -optimal solut ion is at most ⌈ C /ǫ ⌉ , wher e C = β L ( f ) k x 0 − x ∗ k 2 / 2 . Pr o of . Let I be the set of all iteratio n indices until k − 1-st for which no skipping o ccurs and let I c be its complement. Let I = { n i } , i = 0 , . . . , k n − 1. It follo ws that for a ll n ∈ I c , x n +1 = y n . F or n ∈ I w e can apply Le mma 2.2 to obtain the following inequalities. In (2.6), b y letting ψ = f , φ = g , u = x ∗ and v = x n +1 , we get p ψ ( v ) = y n +1 , Φ = F and 2 µ ( F ( x ∗ ) − F ( y n +1 )) ≥ k y n +1 − x ∗ k 2 − k x n +1 − x ∗ k 2 . (2.14) Similarly , by letting ψ = g , φ = f , u = x ∗ and v = y n in (2.6) w e get p g ( v ) = x n +1 , Φ = F and 2 µ ( F ( x ∗ ) − F ( x n +1 )) ≥ k x n +1 − x ∗ k 2 − k y n − x ∗ k 2 . (2.15) T ak ing the summation of (2.14) and (2.15 ) we get 2 µ (2 F ( x ∗ ) − F ( x n +1 ) − F ( y n +1 )) ≥ k y n +1 − x ∗ k 2 − k y n − x ∗ k 2 . (2.16) 9 F or n ∈ I c , (2.14) holds. Then since x n +1 = y n we get 2 µ ( F ( x ∗ ) − F ( y n +1 )) ≥ k y n +1 − x ∗ k 2 − k y n − x ∗ k 2 . (2.17) Summing (2 .1 6) and (2.17) over n = 0 , 1 , . . . , k − 1 we get 2 µ ((2 | I | + | I c | ) F ( x ∗ ) − X n ∈ I F ( x n +1 ) − k − 1 X n =0 F ( y n +1 )) (2.18) ≥ k − 1 X n =0  k y n +1 − x ∗ k 2 − k y n − x ∗ k 2  = k y k − x ∗ k 2 − k y 0 − x ∗ k 2 ≥ − k x 0 − x ∗ k 2 . F or an y n , since Lemma 2.2 holds for a n y u , letting u = x n +1 instead of x ∗ we get fro m (2 .14) that 2 µ ( F ( x n +1 ) − F ( y n +1 )) ≥ k y n +1 − x n +1 k 2 ≥ 0 , (2.19) or, equiv alent ly , 2 µ ( F ( x n ) − F ( y n )) ≥ k y n − x n k 2 ≥ 0 . (2.20) Thu s we get F ( y n ) ≤ F ( x n ) , ∀ n. Similarly , fo r n ∈ I by letting u = y n instead of x ∗ we get from (2.15 ) that 2 µ ( F ( y n ) − F ( x n +1 )) ≥ k x n +1 − y n k 2 ≥ 0 . (2.21) On the other ha nd, for n ∈ I c , (2.21) holds trivia lly b e cause x n +1 = y n ; thus (2.21 ) ho lds for all n . Adding (2 .19) and (2.21) a nd adding (2.2 0) and (2.21), res pectively , y ie ld 2 µ ( F ( y n ) − F ( y n +1 )) ≥ 0 and 2 µ ( F ( x n ) − F ( x n +1 )) ≥ 0 , for all n. (2.22) The inequalities (2 .22) show that the s equences of function v alues F ( y n ) and F ( x n ) are non-incr e asing. Thus we hav e, k − 1 X n =0 F ( y n +1 ) ≥ k F ( y k ) and X n ∈ I F ( x n +1 ) ≥ k n F ( x k ) . (2.23) Combining (2.1 8) and (2.23) yields 2 µ  ( k + k n ) F ( x ∗ ) − k n F ( x k ) − k F ( y k )  ≥ −k x 0 − x ∗ k 2 . (2.24) Hence, since F ( y k ) ≤ F ( x k ), 2 µ ( k + k n )  F ( y k ) − F ( x ∗ )  ≤ k x 0 − x ∗ k 2 , 10 which gives us the desired res ult (2.1 3). Corollar y 2.4. Assu m e ∇ f and ∇ g ar e b oth Lipschitz c ontinuous with Lipschitz c onstant s L ( f ) and L ( g ) , r esp e ctively. F or µ ≤ min { 1 / L ( f ) , 1 /L ( g ) } , Algorithm 3 satisﬁes F ( y k ) − F ( x ∗ ) ≤ k x 0 − x ∗ k 2 4 µk , ∀ k , (2.25) wher e x ∗ is an optimal solution of (1.1) . Thus se quenc e { F ( y k ) } pr o duc e d by A lgorithm 3 c onver ges t o F ( x ∗ ) . Mor e over, if 1 / ( β ma x { L ( f ) , L ( g ) } ) ≤ µ ≤ 1 / max { L ( f ) , L ( g ) } wher e β ≥ 1 , the numb er of iter ations n e e de d to get an ǫ -optimal solution is at most ⌈ C /ǫ ⌉ , wher e C = β max { L ( f ) , L ( g ) }k x 0 − x ∗ k 2 / 4 . Pr o of . The conclusion follows from Theor ems 2.1 and 2.3 and k n = k . Remark 2.5. The c omplexity b ound in Cor ol lary 2.4 is smal ler than the analo gous b ound for IST A in [4] by a factor of two. It is e asy to se e that t he b ound in The or em 2.3 is also an impr ovement over the b ound in [4] as long as F ( x k +1 ) ≤ L µ ( x k +1 , y k ; λ k ) holds for at le ast one value of k . It is r e asonable then to ask if the p er-iter ation c ost of A lgorithms 3 and 4 ar e c omp ar able t o that of IST A. It is inde e d the c ase when the assumption holds that minimizing L µ ( x, y k ; λ k ) has c omp ar able c ost (and often involves the same c omputations) as c omputing the gr adient ∇ f ( y k ) . Remark 2. 6. Mor e gener al pr oblems of the form min f ( x ) + g ( y ) s.t. Ax + y = b (2.26) ar e e asily hand le d by our appr o ach, sinc e one c an expr ess (2.26) as min f ( x ) + g ( b − Ax ) . Remark 2.7 . If a c onvex c onstr aint x ∈ C , wher e C is a c onvex set is adde d to pr oblem (1.1) , and we imp ose this c onstr aint in the two subpr oblems in Algori thms 3 and 4, i.e., we imp ose x ∈ C in the subpr oblems with r esp e ct to x and y ∈ C in the subpr oblems with r esp e ct to y , the c omplexity r esults in The or em 2.3 and Cor ol lary 2. 4 c ontinue t o hold. The only changes in the pr o of ar e in L emma 2.2. If ther e is a c onstr aint x ∈ C , then (2.6) holds for any u ∈ C and v ∈ C . Also in the pr o of of Le mma 2.2, t he ﬁrst e quality in (2.12) b e c omes a “ ≥ ” ine quality due to the fact that the optimality c onditions (2.10 ) b e c ome h γ φ ( p ψ ( v )) + γ ψ ( v ) + 1 µ ( p ψ ( v ) − v ) , u − p ψ ( v ) i ≥ 0 , ∀ u ∈ C . Remark 2.8. Although Algorithms 3 and 4 assume that the Lipschitz c onstant s ar e known, and henc e that an u pp er b ound for µ is known, this c an b e r elaxe d by u sing the b acktr acking te chnique in [4] t o estimate µ at e ach iter ation. 3. F ast Alternating Li nearization M etho ds. In this section, we prop ose a fast alternating lin- earization metho d (F ALM) which computes an ǫ -optimal solution to pr oblem (1.1) in O ( p L/ǫ ) iterations, while keeping the w ork at each iteration almost the same as that r equired by ALM. F ALM is a n a ccelerated version of ALM for so lving (1.1), or equiv alently (1.8 ), when f ( x ) a nd g ( x ) a re bo th diﬀerent iable, and is given b elow as Algorithm 6. Clear ly , F ALM is a lso a Ga us s-Seidel t yp e algorithm. 11 In fact, it is a successive ov er-rela xation t yp e alg orithm s ince ( t k − 1) /t k +1 > 0 , ∀ k ≥ 2 . Algorithm 6 : F a st Alternating Linea rization Metho d (F ALM) Cho ose µ and x 0 = y 0 = z 1 , set t 1 = 1 . 1 for k = 1 , 2 , · · · do 2 x k := a rg min x Q g ( x, z k ) 3 y k := a rg min y Q f ( y , x k ) 4 t k +1 := (1 + p 1 + 4 t 2 k ) / 2 5 z k +1 := y k + t k − 1 t k +1 ( y k − y k − 1 ) 6 Algorithm 6 requires b oth f and g to b e contin uous ly diﬀerentiable. T o de velop a n algo r ithm that ca n be applied to pro blems where one of the functions is non-diﬀerentiable, we use a skipping technique as in Algorithm 4. F ALM with skipping steps (F ALM-S), which do e s not require g ( x ) to b e smo oth, is giv en below as Algor ithm 7 . Algorithm 7 : F ALM with Skipping Steps (F ALM-S) Cho ose x 0 = y 0 = z 1 and λ 1 ∈ − ∂ g ( z 1 ), set t 1 = 1 . 1 for k = 1 , 2 , · · · do 2 x k := a rg min x L µ ( x, z k ; λ k ) 3 if F ( x k ) > L µ ( x k , z k ; λ k ) then 4 if x-st ep was not skipp e d at iter ation k-1 then 5 t k :=  1 + q 1 + 8 t 2 k − 1  / 2 6 else 7 t k :=  1 + q 1 + 4 t 2 k − 1  / 2 8 x k := z k := y k − 1 + t k − 1 − 1 t k  y k − 1 − y k − 2  9 y k := a rg min y Q f ( x k , y ) 10 if x k = z k then 11 t k +1 :=  1 + p 1 + 2 t 2 k  / 2 12 else 13 t k +1 :=  1 + p 1 + 4 t 2 k  / 2 14 z k +1 := y k + t k − 1 t k +1  y k − y k − 1  15 Cho ose λ k +1 ∈ − ∂ g ( z k +1 ) 16 The following theor e m g ives conditions under which Algor ithms 6 and 7 are equiv ale n t. Theorem 3.1. If b oth f ( x ) and g ( x ) ar e diﬀer entiable and ∇ g ( x ) is Lipschitz c ontinuous with Lipschitz c onstant L ( g ) , and µ ≤ 1 /L ( g ) , then A lgorithms 6 and 7 ar e e quivalent. Pr o of . As in Theo rem 2.1 , if f a nd g are diﬀer en tiable, L µ ( x, z k ; λ k ) ≡ Q g ( x, z k ). The conclusion then follows fro m the fact that F ( x k ) ≤ L µ ( x k , z k ; λ k ) alwa ys holds when µ ≤ 1 /L ( g ) and thus ther e are no skipping steps. T o prove that Alg o rithm 7 requir e s O ( p L ( f ) /ǫ ) itera tions to obtain a n ǫ -optima l solution, we need the following lemmas. W e call k -th iteration a skipping step if x k = z k , and a r e gular step if x k 6 = z k . Lemma 3.2 . The se quenc e { x k , y k } gener ate d by Alg orithm 7 satisﬁes 2 µ ( t 2 k v k − t 2 k +1 v k +1 ) ≥ k u k +1 k 2 − k u k k 2 , (3.1) 12 wher e u k := t k y k − ( t k − 1 ) y k − 1 − x ∗ and v k := 2 F ( y k ) − 2 F ( x ∗ ) if iter ation k is a r e gular step and v k := F ( y k ) − F ( x ∗ ) if iter ation k is a skipping step. Pr o of . Ther e are four cases to consider: (i) b oth the k -th and the ( k + 1)-st iterations a re r egular steps; (ii) the k -th iteration is a regula r step and the ( k + 1)-st iter ation is a skipping s tep; (iii) both the k -th and the ( k + 1 )-st iteratio ns are skipping steps ; (iv) the k -th iteration is a skipping step and the ( k + 1)-s t iteration is a regular s tep. W e will prove that the following inequality holds fo r all the four ca ses: 2 µ ( t 2 k v k − t 2 k +1 v k +1 ) ≥ t k +1 ( t k +1 − 1)  k y k +1 − y k k 2 − k z k +1 − y k k 2  + t k +1  k y k +1 − x ∗ k 2 − k z k +1 − x ∗ k 2  . (3.2) The pro of of (3.1) a nd hence, the lemma, then follows from the fact that the right hand side of inequality (3.2) equals k t k +1 y k +1 − ( t k +1 − 1) y k − x ∗ k 2 − k t k +1 z k +1 − ( t k +1 − 1) y k − x ∗ k 2 = k u k +1 k 2 − k u k k 2 , where we hav e used the fact that t k +1 z k +1 := t k +1 y k + t k ( y k − y k − 1 ) − ( y k − y k − 1 ) . Case (i): Let us ﬁrst co nsider the case when b oth the k -th and ( k + 1)-st iter ations are reg ular steps. In (2.6), by letting ψ = f , φ = g , u = y k and v = x k +1 , we get p ψ ( v ) = y k +1 , Φ = F and 2 µ ( F ( y k ) − F ( y k +1 )) ≥ k y k +1 − y k k 2 − k x k +1 − y k k 2 . (3.3) In (2.6), by letting ψ = g , φ = f , u = y k , v = z k +1 , we get p ψ ( v ) = x k +1 , Φ = F and 2 µ ( F ( y k ) − F ( x k +1 )) ≥ k x k +1 − y k k 2 − k z k +1 − y k k 2 . (3.4) Summing (3 .3 ) a nd (3.4), and using the fact tha t F ( y k +1 ) ≤ F ( x k +1 ), we obtain, 2 µ ( v k − v k +1 ) = 2 µ (2 F ( y k ) − 2 F ( y k +1 )) ≥ k y k +1 − y k k 2 − k z k +1 − y k k 2 (3.5) Again, in (2.6 ), by letting ψ = g , φ = f , u = x ∗ , v = z k +1 , we get p ψ ( v ) = x k +1 , Φ = F and 2 µ ( F ( x ∗ ) − F ( x k +1 )) ≥ k x k +1 − x ∗ k 2 − k z k +1 − x ∗ k 2 . (3.6) In (2.6), by letting ψ = f , φ = g , u = x ∗ , v = x k +1 , we get p ψ ( v ) = y k +1 , Φ = F and 2 µ ( F ( x ∗ ) − F ( y k +1 ) ≥ k y k +1 − x ∗ k 2 − k x k +1 − x ∗ k 2 . (3.7 ) Summing (3 .6 ) a nd (3.7), and again using the fact that F ( y k +1 ) ≤ F ( x k +1 ), we o btain, − 2 µv k +1 = 2 µ (2 F ( x ∗ ) − 2 F ( y k +1 )) ≥ k y k +1 − x ∗ k 2 − k z k +1 − x ∗ k 2 . (3.8) If we m ultiply (3.5) by t 2 k , and (3.8 ) by t k +1 , and take the sum of the r esulting tw o inequalities, we g et (3.2) by using the fact tha t t 2 k = t k +1 ( t k +1 − 1). Case (ii): By letting ψ = f , φ = g , u = y k and v = z k +1 in (2.6), we g e t p ψ ( v ) = y k +1 , Φ = F and 2 µ ( F ( y k ) − F ( y k +1 )) ≥ k y k +1 − y k k 2 − k z k +1 − y k k 2 . (3 .9) 13 Since the steps tak en in the k -th and ( k + 1)-st itera tions are regular and sk ipping , resp ectively , we have 2 µ  v k 2 − v k +1  = 2 µ ( F ( y k ) − F ( y k +1 )) ≥ k y k +1 − y k k 2 − k z k +1 − y k k 2 (3.10) Also by letting ψ = f , φ = g , u = x ∗ and v = z k +1 in (2.6), we g e t p ψ ( v ) = y k +1 , Φ = F and − 2 µv k +1 = 2 µ ( F ( x ∗ ) − F ( y k +1 )) ≥ k y k +1 − x ∗ k 2 − k z k +1 − x ∗ k 2 . (3.11) Then multiplying (3.10) by 2 t 2 k , (3.11) by t k +1 , summing the resulting tw o inequalities and using the fact that in this case 2 t 2 k = t k +1 ( t k +1 − 1), we obtain (3.2 ). Case (iii): This case r educes to tw o consecutive FIST A steps and the pro of a bove applies with t 2 k = t k +1 ( t k +1 − 1) a nd inequality (3.10) re placed by 2 µ ( v k − v k +1 ) = 2 µ ( F ( y k ) − F ( y k +1 )) ≥ k y k +1 − y k k 2 − k z k +1 − y k k 2 (3.12) which gets multiplied by t 2 k . Case (iv): In this case, (3.5 ) in the pro of of cas e (i) is r e pla ced by 2 µ (2 v k − v k +1 ) = 2 µ (2 F ( y k ) − 2 F ( y k +1 )) ≥ k y k +1 − y k k 2 − k z k +1 − y k k 2 (3.13) which when multiplied by t 2 k / 2 and combined with (3.8) multiplied by t k +1 , a nd the fact that in this case t 2 k / 2 = t k +1 ( t k +1 − 1), yields (3.2 ). The following lemma gives low er b ounds for the sequence of scalars { t k } genera ted by Algor ithm 7. Lemma 3.3 . F or al l k ≥ 1 t he se quenc e { t k } gener ate d by Alg orithm 7 satisﬁes: if the ﬁrst st ep is a skipping step, t k ≥ ( 1 2 ( k + 1 + αr ( k )) if k is a skippi ng step , 1 2 √ 2 ( k + 1 + αr ( k )) if k is a r e gular step , if the ﬁrst st ep is a r e gular step, t k ≥ ( 1 √ 2 ( k + 1 + ˆ αs ( k )) if k is a skipping step , 1 2 ( k + 1 + ˆ αs ( k )) if k is a r e gu lar step , wher e r ( k ) and s ( k ) ar e the numb er of steps among t he ﬁrst k steps that ar e r e gular and skip ping steps, r esp e ct ively, and α ≡ √ 2 − 1 and ˆ α ≡ 1 √ 2 − 1 . Pr o of . Consider the case where the ﬁrst iteratio n of Algorithm 7 is a skipping step. Clearly , the sequence of itera tio ns fo llows a pattern of alternating blo cks of o ne or more s kipping steps and o ne or more regular steps. Let the index of the ﬁrst iteration in the i -th blo ck b e denoted b y n i . Since it is assumed that the ﬁrst iteration is a s kipping step, itera tions n 1 , n 3 , n 5 . . . are skipping steps ( n 1 = 1 ) a nd n 2 , n 4 , n 6 . . . are re g ular steps. Note that the statement o f the lemma in this case co rresp onds to t k ≥ ( 1 2 ( k + 1 + αr ( k )) , for n j ≤ k ≤ n j +1 − 1, if j is odd , 1 2 √ 2 ( k + 1 + αr ( k )) , for n j ≤ k ≤ n j +1 − 1, if j is ev en , (3.14) which we will prov e by induction on j . 14 W e ﬁrst no te that it follows from the up dating r ules and for mulas for t k that t k ≥      1 2 + √ 2 t k − 1 , if k is a skipping s tep and k − 1 is a r egular step , 1 2 + 1 √ 2 t k − 1 , if k is a r egular step and k − 1 is a s k ipping step , 1 2 + t k − 1 , otherwise . (3.15) Consider j = 1. Cle arly (3.14) holds for all itera tio ns n 1 = 1 ≤ k ≤ n 2 − 1 , since t k ≥ k +1 2 holds trivially for t 1 = 1 , and for 1 < k ≤ n 2 − 1, t k ≥ k − 1 2 + t 1 = k +1 2 . Now ass ume that (3.14) ho lds for all j < ¯ j . If ¯ j is even, iterations k = n ¯ j and k − 1 are , resp ectively , regular and skipping iter ations. Hence, from (3.15), we have that t k ≥ 1 2 + 1 √ 2 t k − 1 ≥ 1 2 + 1 2 √ 2 ( k + αr ( k − 1)) = 1 2 √ 2 ( k + 1 + αr ( k )) . Since the remaining p ≡ n ¯ j +1 − 1 − n ¯ j iterations b efore iteratio n n ¯ j +1 are all regula r itera tio ns ( p may b e zero), we have from (3.15) that, for n ¯ j < k ≤ n ¯ j +1 − 1, t k ≥ k − n ¯ j 2 + t n ¯ j ≥ k − n ¯ j 2 + 1 2 √ 2 ( n ¯ j + 1 + αr ( n ¯ j )) = k − n ¯ j 2 + 1 2 √ 2 ( n ¯ j + 1 + α ( r ( k ) − k + n ¯ j )) = 1 2 √ 2 ( k + 1 + αr ( k )) . If ¯ j is o dd, iteration k = n ¯ j is a s kipping iteration. Hence, from (3.15), we ha ve that t k ≥ 1 2 + √ 2 t k − 1 ≥ 1 2 + √ 2 1 2 √ 2 ( k + αr ( k − 1)) = 1 2 ( k + 1 + αr ( k )). Since the remaining p ≡ n ¯ j +1 − 1 − n ¯ j iterations b efore iter ation n ¯ j +1 are all sk ipping iterations (ag ain p may be zero), we hav e from (3.15 ) that, for n ¯ j < k ≤ n ¯ j +1 − 1, t k ≥ k − n ¯ j 2 + t n ¯ j ≥ k − n ¯ j 2 + 1 2 ( n ¯ j + 1 + αr ( n ¯ j )) = 1 2 ( k + 1 + αr ( k )). This concludes the inductio n. Since the pro of for the case that the ﬁrst step is a r egular step is totally a na logous to the a b ove pro of, we leav e this to the re a der. Now we are r eady to give the co mplexity of Algorithm 7. Theorem 3.4. L et α = √ 2 − 1 and r ( k ) b e the numb er of steps among the ﬁrst k steps that ar e r e gular steps. Assuming ∇ f ( · ) is Lipschitz c ontinuous with Lipschitz c onstant L ( f ) , if µ ≤ 1 /L ( f ) , the se quenc e { y k } gener ate d by Alg orithm 7 satisﬁes: (3.16) F ( y k ) − F ( x ∗ ) ≤ 2 k x 0 − x ∗ k 2 µ ( k + 1 + α ˆ r ( k )) 2 , wher e ˆ r ( k ) = r ( k ) if the ﬁrst step is a skipping step, and ˆ r ( k ) = r ( k ) + 1 if the ﬁrst step is a r e gular step. Henc e, t he se quenc e { F ( y k ) } pr o duc e d by Algorithm 4 c onver ges to F ( x ∗ ) . Mor e over, if 1 / ( β L ( f )) ≤ µ ≤ 1 / L ( f ) wher e β ≥ 1 , t he numb er of iter ations r e quir e d by Algorithm 7 to get an ǫ -optimal solution to (1.1) is at most ⌊ p C /ǫ ⌋ , wher e C = 2 β L ( f ) k x 0 − x ∗ k 2 . Pr o of . Using the s ame notatio n a s in Lemmas 3.2 and 3.3, (3 .8) and (3.11 ) imply that − 2 µv 1 ≥ k y 1 − x ∗ k 2 − k z 1 − x ∗ k 2 holds whether the ﬁrst iteration is a skipping step o r not. Th us we have 2 µv 1 + k y 1 − x ∗ k 2 ≤ k z 1 − x ∗ k 2 = k x 0 − x ∗ k 2 . (3.17) 15 F ro m L e mma 3.2 we know that the sequence { 2 µt 2 k v k + k u k k 2 } is non- increasing. Therefor e, we have 2 µt 2 k v k ≤ 2 µt 2 k v k + k u k k 2 ≤ 2 µt 2 1 v 1 + k u 1 k 2 = 2 µv 1 + k y 1 − x ∗ k 2 ≤ k x 0 − x ∗ k 2 , (3.18) where the equalit y follows from the facts tha t t 1 = 1 and u 1 = y 1 − x ∗ , and the la st inequalit y is from (3.17). Recall the deﬁnition of v k in Lemma 3.2 and the b ounds of t k in Lemma 3.3. W e get (i) and (ii) fr o m (3.18). Keeping in mind that v k has a diﬀerent expressio n dep e nding on whether the k -th step is a skipping or a reg ular step, it follows that the sequence { y k } genera ted by Algorithm 7 satisﬁes : (i) if the ﬁr st step is a sk ipping step, then F ( y k ) − F ( x ∗ ) ≤ 2 k x 0 − x ∗ k 2 µ ( k + 1 + αr ( k )) 2 ; (ii) if the ﬁr st step is a r egular step, then F ( y k ) − F ( x ∗ ) ≤ k x 0 − x ∗ k 2 µ ( k + 1 + ˆ αs ( k )) 2 . It is easy to c heck that these b ounds are equiv alent to (3.16), and that the worst case bound on the nu mber of iterations follows from (3.1 6 ). Corollar y 3.5. Assu m e ∇ f and ∇ g ar e b oth Lipschitz c ontinuous with Lipschitz c onstant s L ( f ) and L ( g ) , r esp e ctively. F or µ ≤ min { 1 / L ( f ) , 1 /L ( g ) } , Algorithm 6 satisﬁes F ( y k ) − F ( x ∗ ) ≤ k x 0 − x ∗ k 2 µ ( k + 1) 2 , ∀ k , (3.19) wher e x ∗ is an optimal solution of (1.1) . Henc e, the se quenc e { F ( y k ) } pr o duc e d by Algorithm 6 c onver ges to F ( x ∗ ) , and if 1 / ( β max { L ( f ) , L ( g ) } ) ≤ µ ≤ 1 / max { L ( f ) , L ( g ) } , wher e β ≥ 1 , t he numb er of iter ations ne e de d to get an ǫ -optimal solution is at most ⌈ p C /ǫ − 1 ⌉ , wher e C = β max { L ( f ) , L ( g ) }k x 0 − x ∗ k 2 . Pr o of . Note that since µ ≤ min { 1 /L ( f ) , 1 /L ( g ) } , from Theor em 3.1 we know that Algo rithms 6 a nd 7 are equiv alent. That is, every step in Algorithm 7 is a r e gular step. T he r efore, case (ii) in Theorem 3.4 holds and s ( k ) = 0, which le a ds to (3.1 9). Remark 3. 6. The c omplexity b oun d in Cor ol lary 3.5 is s mal ler than the analo gous b ound for FIST A in [4] by a factor of √ 2 . It is e asy t o se e that the b ound in The or em 3.4 is also an impr ovement over the b ound in [4] as long as F ( x k +1 ) ≤ L µ ( x k +1 , y k ; λ k ) holds for at le ast one value of k . As in the c ase of Algo rithms 3 and 4, the p er-iter ation c ost of Algo rithms 6 and 7 ar e c omp ar able to that of FIST A. Remark 3. 7. Line 6 in Algori thm 6 and Line 15 in A lgorithm 7 c an b e change d to: z k +1 := w k + 1 t k +1 [ t k ( y k − w k − 1 ) − ( w k − w k − 1 )] , wher e w k := αx k + (1 − α ) y k , α ∈ (0 , 1) , and The or em 3.4 and Cor ol lary 3.5 stil l hold. Remark 3. 8. Although Algorithms 6 and 7, assume that t he Lipschitz c onstants ar e known, and henc e that an u pp er b ound for µ is known, this c an b e r elaxe d by u sing the b acktr acking te chnique in [4] t o estimate µ at e ach iter ation. 4. Comparison of ALM, F ALM, IST A, FIST A, SAD AL and SALSA. In this section w e co m- pare the per formance of our basic and fast ALMs, with and without skipping steps, a gainst IST A, FIST A, 16 SAD AL (Algor ithm 2) and an alter nating direction aug mented La grangia n metho d SALSA describ e d in [2] on a b enchm ark w av elet-based image deblur r ing pro blem from [17]. In this problem, the orig inal image is the well-kno wn Cameraman image o f size 256 × 25 6 and the obser ved image is o bta ined a fter imp osing a uniform blur o f size 9 × 9 (denoted by the o p er ator R ) and Gaus sian noise (gene r ated by the function r andn in MA TLAB with a s eed of 0 a nd a standard deviation o f 0 . 5 6). Since the co eﬃcient of the wav elet tr ansform of the image is sparse in this pr o blem, one ca n try to r econstruct the ima ge u from the obs erved image b by solving the proble m: ¯ x := a r g min x 1 2 k Ax − b k 2 2 + ρ k x k 1 , (4.1) and setting u := W ¯ x , where A := RW and W is the inverse discrete Haar wa velet transfor m with four levels. By deﬁning f ( x ) := 1 2 k Ax − b k 2 2 and g ( x ) := ρ k x k 1 , it is clear that (4.1) can be expressed in the form of (1.1) a nd ca n b e solved by ALM-S (Algorithm 4), F ALM-S (Algo rithm 7), IST A, FIST A, SALSA and SADAL (Algor ithm 2). How ever, in o rder to use ALM (Algorithm 3) and F ALM (Algorithm 6), we need to smo oth g ( x ) ﬁr st, since these t wo algor ithms requir e b oth f a nd g to b e smo oth. Here we a pply the smo othing technique intro duced by Nester ov [39] since this technique gua rantees that the gradient of the smo othed function is Lipschitz contin uous. A s mo othed appr oximation to the ℓ 1 function g ( x ) := ρ k x k 1 with smo othness par ameter σ > 0 is g σ ( x ) := max {h x, z i − σ 2 k z k 2 2 : k z k ∞ ≤ ρ } . (4.2) It is easy to show that the o ptimal solution z σ ( x ) of (4.2) is z σ ( x ) = min { ρ, max { x/σ, − ρ }} . (4.3) According to Theo r em 1 in [39], the gr a dient of g σ is given by ∇ g σ ( x ) = z σ ( x ) a nd is Lipschitz contin uous with Lipsc hitz co nstant L ( g σ ) = 1 /σ . After smoo thing g , we can apply Algorithms 3 and 6 to solve the smo othed problem: min x f ( x ) + g σ ( x ) . (4.4) W e ha ve the following theorem ab out the ǫ -optimal solutions of problems (4.1) and (4.4). Theorem 4 . 1. L et σ = ǫ nρ 2 and ǫ > 0 . If x ( σ ) is an ǫ/ 2 -optimal solution to (4.4) , then x ( σ ) is an ǫ -optimal solution to (4.1) . Pr o of. Let D g := max { 1 2 k z k 2 2 : k z k ∞ ≤ ρ } = 1 2 nρ 2 and x ∗ and x ∗ ( σ ) b e optimal so lution to proble ms (4.1) and (4.4 ), r esp ectively . Note that g σ ( x ) ≤ g ( x ) ≤ g σ ( x ) + σ D g , ∀ x ∈ R n . (4.5) Using the inequalities in (4.5) and the facts that x ( σ ) is an ǫ/ 2-o ptimal solution to (4 .4) and σ D g = ǫ 2 , we 17 hav e f ( x ( σ )) + g ( x ( σ )) − f ( x ∗ ) − g ( x ∗ ) ≤ f ( x ( σ )) + g σ ( x ( σ )) + σ D g − f ( x ∗ ) − g σ ( x ∗ ) ≤ f ( x ( σ )) + g σ ( x ( σ )) + σ D g − f ( x ∗ ( σ )) − g σ ( x ∗ ( σ )) ≤ ǫ / 2 + σ D g = ǫ . Thu s, to ﬁnd a n ǫ -optimal solution to (4.1 ), we can apply Algo rithms 3 and 6 to ﬁnd a n ǫ / 2-optimal solution to (4.4) with σ = ǫ nρ 2 . The itera tion complexity results in Cor ollaries 2.4 and 3.5 hold since the gradient of g σ is Lipschitz c ontin uous . How ev er, the num b ers o f iter ations needed by ALM and F ALM to obtain an ǫ -optimal solution to (4 .1) b ecome O (1 /ǫ 2 ) and O (1 /ǫ ), res pec tively , due to the fact tha t the Lipschitz constant L ( g σ ) = 1 /σ = nρ 2 ǫ = O (1 /ǫ ). When Algo rithms 4 and 7 are applied to s olve (4.1), the subproblems (1.2 ) and (1.3 ) a re easy to solve. Spec iﬁc a lly , (1.2) corr esp onds to solving a linear system whic h is par ticularly eas y to do beca use o f the specia l structures of R and W (see [1, 2]); (1.3) corr esp onds to a vector shrink age op era tio n. When Algorithms 3 and 6 a r e applied to solve (4.4), (1.3 ) w ith g repla ced by g σ is also eas y to solve; its optimal solution is x := z − τ min { ρ, max {− ρ, z τ + σ }} . Since ALM is equiv alent to SADA L when b oth functions a r e smo oth, we implemen ted ALM as SADAL when we solved (4.4 ). W e also applied SAD AL to the no ns mo oth problem (4.1 ). W e also implemen ted ALM- S a s Algo rithm 5 since the latter was usually faster . In all algor ithms, we set the initial p oints x 0 = y 0 = 0 , and in F ALM and F ALM-S we set z 1 = 0 . MA TLAB co des for SALSA, FIST A and IST A (mo diﬁed from FIST A) w ere downloaded from http://cascais .lx.it.pt/ ∼ mafonso/ salsa.html and their default inputs w ere used. Mor eov er, λ 0 was set to 0 in algorithms 2, 4, 5 and 7 since ∇ g σ ( x 0 ) = 0 and 0 ∈ − ∂ g ( x 0 ) when x 0 = 0 . Also, whenever g ( x ) was smo othed, we set σ = 10 − 6 . µ was set to 1 in a ll the a lgorithms since the Lipschitz constant of the gradient of function 1 2 k RW ( · ) − b k 2 2 was known to b e 1 . W e set µ to 1 even for the s mo othed pro ble ms . Although this viola tes the requirement µ ≤ 1 L ( g σ ) in C o rollar ie s 2.4 and 3.5, we see from our numerical re sults rep or ted b elow that ALM and F ALM still work very well. All of the algor ithms tested w ere termina ted after 1000 itera tions. The (nonsmo othed) ob jective function v alues in (4.1) pro duce d by these algo rithms at iter ations: 10, 5 0, 100 , 200, 50 0, 800 and 1 0 00 for diﬀer ent c hoices of ρ are pre s ent ed in T ables 4.1 and 4 .2. The CPU times (in seconds) and the num b er of iter ations required to reduce the ob jective function v alue to b elow 1 . 04 e + 5 and 8 . 6 0 e + 5 a re r e po rted resp ectively in the last columns of T able s 4.1 and 4.2. All of our co des were written in MA TLAB and run in MA TLAB 7.3.0 on a Dell Pr ecision 67 0 workstation with an Intel Xeo n(TM) 3.4GHZ CPU and 6 GB of RAM. T able 4.1 Comp arison of the algorithms for solving (4.1) with ρ = 0 . 01 solver ob j in k-th iteration cpu (iter) 10 50 100 200 500 800 1000 F ALM-S 1.767239e+5 1.040919e+5 1.004322e+5 9.726599e+4 9.341282e+4 9.182962e+4 9.121742e+4 24.3 (51) F ALM 1 .767249e+5 1.040955e+5 9.899843e+4 9.516208e+4 9.186355e+4 9.073086e+4 9.028790e+4 23.1 (51) FIST A 1.723109e+5 1.061116e+5 1.016385e+5 9.752858e+4 9.372093e+4 9.233719e+4 9.178455e+4 26.0 (69) ALM-S 4.218082e+5 1.439742e+5 1.212865e+5 1.107103e+5 1.042869e+5 1.021905e+5 1.013128e+5 208.9 ( 531) ALM 4.585705e+5 1.481379e+5 1.233182e+5 1.116683e+5 1.047410e+5 1.025611e+5 1.016589e+5 208.1 ( 581) IST A 2.345290e+5 1. 267048e+5 1.137827e+5 1.079721e+5 1.040666e+5 1.025107e+5 1.018068e+5 196.8 (510) SALSA 8.772957e+5 1.549462e+5 1.267379e+5 1.132676e+5 1.054600e+5 1.031346e+5 1.021898e+5 223.9 (663) SADAL 2.524912e+5 1.271591e+5 1.133542e+5 1.068386e+5 1.021905e+5 1.004005e+5 9. 961905e+4 113. 5 (332) 18 T able 4.2 Comp arison of the algorithms for solving (4.1) with ρ = 0 . 1 solver ob j in k -th ite ration cpu (ite r) 10 50 100 200 500 800 1000 F ALM-S 9.868574e+5 8.771604e+5 8.487372e+5 8.271496e+5 8.110211e+5 8.065750e+5 8.050973e+5 37.7 (76) F ALM 9 .876315e+5 8.629257e+5 8.369244e+5 8.210375e+5 8.097621e+5 8.067903e+5 8.058290e+5 25.7 (54) FIST A 9.924884e+5 8.830263e+5 8.501727e+5 8.288459e+5 8.126598e+5 8.081259e+5 8.066060e+5 30.1 (79) ALM-S 1.227588e+6 9.468694e+5 9.134766e+5 8.880703e+5 8.617264e+5 8.509737e+5 8.465260e+5 214.5 ( 537) ALM 1.263787e+6 9.521381e+5 9.172737e+5 8.910902e+5 8.639917e+5 8.528932e+5 8.482666e+5 211.8 ( 588) IST A 1.048956e+6 9. 396822e+5 9.161787e+5 8.951970e+5 8.700864e+5 8.589587e+5 8.541664e+5 293.8 (764) SALSA 1.680608e+6 9.601661e+5 9.230268e+5 8.956607e+5 8.674579e+5 8.558580e+5 8.509770e+5 230.2 (671) SADAL 1.060130e+6 9.231803e+5 8.956150e+5 8.735746e+5 8.509601e+5 8.420295e+5 8. 383270e+5 112. 5 (335) F ro m T ables 4.1 and 4.2 we see that in terms of the v alue of the ob jectiv e function achieved after a sp eciﬁed num ber of iterations, the p erformanc e of F ALM-S a nd F ALM is alwa ys slightly b e tter than that of FIST A and is muc h b etter than the pe rformance of the other alg orithms. O n the t wo test problems, since F ALM-S a nd F ALM ar e alw ays b etter than ALM-S and ALM, and FIST A is alwa ys b etter than IST A, w e can conclude that the Nesterov-t yp e acc e le ration techn ique gr e atly sp eeds up the ba sic algor ithms on these problems. Moreov er, altho ug h sometimes in the early iterations FIST A (IST A) is b etter than F ALM-S and F ALM (ALM-S and ALM), it is alwa ys worse than the latter t wo alg orithms when the iteration num b er is large. W e als o illustrate our compariso ns graphica lly b y plotting in Figure 4.1 the ob jective function v alue versus the num ber of iterations taken by these a lgorithms for solving (4.1) with ρ = 0 . 1. F rom Figure 4.1 w e see clearly that for this problem, ALM outp erfor ms IST A, F ALM outp erforms FIST A, F ALM o utper forms ALM and F ALM-S outp erfor ms ALM-S. F ro m the CPU times and the iteratio n num b ers in the la s t co lumns of T able s 4.1 and 4.2 we see that, the fast v ersio ns are a lwa ys muc h b etter tha n the basic versions of the algo rithms. Since iteratio ns of FIST A cost less than those o f F ALM-S (a nd F ALM as well), w e s ee that although FIST A takes 35 % (4%) more iterations than F ALM-S in the last column of T able 4 .1 (4.2 ) it takes only 7% more time (20 % le ss time). W e note that for the problems with ρ = 0 . 01 and ρ = 0 . 1, (i.e., for the r esults given in T ables 4.1 and 4.2), 89 1 and 981, resp ectively , of the ﬁrst 10 00 iter ations p er fo rmed by F ALM-S were skipping steps. In contrast, none of the steps p erformed by ALM-S o n either o f these problems were skipping steps. While the latter r esult is somewha t surpris ing, the fact that F ALM-S per forms many skipping steps is not, since the Nesterov-like accele r ation appr oach is an over-r elaxation a pproach that generates p oints that extr a po late beyond the prev io us p oint and the one pr o duced b y the ALM algor ithm. 5. Applications. In this sectio n, we describ e how ALM and F ALM can b e applied to pro ble ms that can be for mulated as RPCA problems to illustr ate the use of Nesterov-t ype smo othing when the functions f and g do not satisfy the smo othness conditions requir ed b y the theorems in Sections 2 and 3. Our n umerical results show that our methods are able to solve h uge pr oblems that arise in pra ctice; e.g ., one pro blem inv o lving r o ughly 40 million v aria bles and 20 million linear constraints is solved in ab out thre e -quarters of an hour. W e alse describ e application of our metho ds to the SICS problem. 5.1. Applications in Robust P rincipal Comp onen t Analysis. In order to apply Algorithms 3 and 6 to (1.6), we need to smo o th both the n uclear nor m f ( X ) := k X k ∗ and the ℓ 1 norm g ( Y ) := ρ k Y k 1 . W e again apply Nesterov’s smo othing technique as in Se c tion 4. g ( Y ) can b e smo othed in the same wa y as the vector ℓ 1 norm in Section 4. W e use g σ ( Y ) to denote the smo o thed function with smo othness pa rameter σ > 0. A smo o thed a pproximation to f ( X ) with smo othness parameter σ > 0 is f σ ( X ) := ma x {h X , W i − σ 2 k W k 2 F : k W k ≤ 1 } . (5 .1) 19 100 200 300 400 500 600 700 800 900 1000 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 x 10 5 iteration objective value ALM ISTA 100 200 300 400 500 600 700 800 900 1000 8.05 8.1 8.15 8.2 8.25 8.3 8.35 8.4 8.45 8.5 8.55 x 10 5 iteration objective value FALM FISTA 100 200 300 400 500 600 700 800 900 1000 8 8.2 8.4 8.6 8.8 9 9.2 x 10 5 iteration objective value FALM ALM 100 200 300 400 500 600 700 800 900 1000 8 8.2 8.4 8.6 8.8 9 9.2 x 10 5 iteration objective value FALM−S ALM−S Fig. 4.1 . co mp arison of the algorithms It is easy to show that the o ptimal solution W σ ( X ) o f (5.1) is W σ ( X ) = U Diag(min { γ , 1 } ) V ⊤ , (5.2) where U Diag( γ ) V ⊤ is the singular v alue decomp osition (SVD) o f X/σ. Accor ding to Theorem 1 in [3 9], the gradient of f σ is given by ∇ f σ ( X ) = W σ ( X ) and is Lipsc hitz contin uous with Lips ch itz consta nt L ( f σ ) = 1 /σ . After smo othing f and g , we can apply Algorithms 3 and 6 to s olve the following s mo othed problem: min { f σ ( X ) + g σ ( Y ) : X + Y = M } . (5.3) W e ha ve the following theorem ab out ǫ -optimal s olutions of pr oblems (1.6) and (5.3). Theorem 5.1 . L et σ = ǫ 2 max { min { m,n } ,mnρ 2 } and ǫ > 0 . If ( X ( σ ) , Y ( σ )) is an ǫ/ 2 - optimal solution to (5.3) , then ( X ( σ ) , Y ( σ )) is an ǫ - optimal solution to (1 .6) . Pr o of. Let D f := max { 1 2 k W k 2 F : k W k ≤ 1 } = 1 2 min { m, n } , D g := max { 1 2 k Z k 2 F : k Z k ∞ ≤ ρ } = 1 2 mnρ 2 and ( X ∗ , Y ∗ ) and ( X ∗ ( σ ) , Y ∗ ( σ )) b e o ptimal solution to pr oblems (1.6) and (5.3 ), resp ectively . Note that f σ ( X ) ≤ f ( X ) ≤ f σ ( X ) + σ D f , ∀ X ∈ R m × n (5.4) 20 and g σ ( Y ) ≤ g ( Y ) ≤ g σ ( Y ) + σ D g , ∀ Y ∈ R m × n . (5.5) Using the inequalities in (5.4) and (5.5 ) a nd the facts that ( X ( σ ) , Y ( σ )) is an ǫ/ 2-optimal so lution to (5 .3) and σ max { D f , D g } = ǫ 4 , we hav e f ( X ( σ )) + g ( Y ( σ )) − f ( X ∗ ) − g ( Y ∗ ) ≤ f σ ( X ( σ )) + g σ ( Y ( σ )) + σ D f + σ D g − f σ ( X ∗ ) − g σ ( Y ∗ ) ≤ f σ ( X ( σ )) + g σ ( Y ( σ )) + σ D f + σ D g − f σ ( X ∗ ( σ )) − g σ ( Y ∗ ( σ )) ≤ ǫ/ 2 + σ D f + σ D g ≤ ǫ / 2 + ǫ/ 4 + ǫ/ 4 = ǫ. Thu s, according to Theorem 5 .1 , to ﬁnd an ǫ -optimal solution to (1.6 ), we need to ﬁnd an ǫ/ 2-optimal solution to (5 .3) with σ = ǫ 2 max { min { m,n } ,mnρ 2 } . W e can either apply Algo rithms 3 and 6 to solve (5.3 ), or apply Alg orithms 4 and 7 to solve (5.3 ) with only o ne functions (say f ( x )) smo o thed. The iteration complexity results in Theorems 2.3 a nd 3 .4 hold s inc e the gr adients of f σ is Lipschitz contin uous. How ever, the num b ers of iterations needed b y ALM and F ALM to obtain an ǫ -optimal s olution to (1.6 ) b ecome O (1 /ǫ 2 ) and O (1 /ǫ ), resp ectively , due to the fact tha t the Lipschitz c onstant L ( f σ ) = 1 /σ = 2 max { min { m,n } ,mnρ 2 } ǫ = O (1 / ǫ ). The tw o subpr o blems at itera tion k of Algor ithm 3 when applied to (5.3 ) r educe to X k +1 := arg min X f σ ( X ) + g σ ( Y k ) + h∇ g σ ( Y k ) , M − X − Y k i + 1 2 µ k X + Y k − M k 2 F , (5.6) and Y k +1 := a rg min Y f σ ( X k +1 ) + h∇ f σ ( X k +1 ) , M − X k +1 − Y i + 1 2 µ k X k +1 + Y − M k 2 F + g σ ( Y ) . (5.7) The ﬁrst-o rder optimality conditions for (5.6) a re: W σ ( X ) − Z σ ( Y k ) + 1 µ ( X + Y k − M ) = 0 , (5.8) where W σ ( X ) and Z σ ( Y ) a re deﬁned in (5.2) and (4.3). It is easy to c heck that (5.9) X := U Diag( γ − µγ max { γ , µ + σ } ) V ⊤ satisﬁes (5.8 ), where U Diag( γ ) V ⊤ is the SVD of the ma tr ix µZ σ ( Y k ) − Y k + M . Thus, solving the subpro blem (5.6) corr e spo nds to an SVD. If we deﬁne B := µW σ ( X k +1 ) − X k +1 + M , it is ea sy to verify that Y ij = B ij − µ min { ρ, max {− ρ, B ij σ + µ }} for i = 1 , . . . , m and j = 1 , . . . , n (5.10) satisﬁes the ﬁrst-order optima lity conditions for (5.7): − W σ ( X k +1 ) + 1 µ ( X k +1 + Y − M ) + Z σ ( Y ) = 0 . Thu s, solv ing the subpro ble m (5.7 ) can b e done very cheaply . The tw o subproblems at the k -th iter a tion of Algorithm 6 can b e done in the same way a nd the ma in computationa l eﬀort in ea ch iteration of b oth ALM and F ALM corre spo nds to an SVD. 21 5.2. RPCA with Missi ng Data. In some applications of RPCA, some o f the entries of M in (1.6) may be missing (e.g., in lo w-rank matrix completion pro blems where the matrix is corr upted by noise). Let Ω b e the index set of the en tries of M that are o bserv a ble and deﬁne the pro jection op era tor P Ω as: ( P Ω ( X )) ij = X ij , if ( i, j ) ∈ Ω a nd ( P Ω ( X )) ij = 0 otherwise. It has b een shown under some randomness hypotheses that the low rank ¯ X and s parse ¯ Y can b e recovered with high pro bability by solving (see Theorem 1.2 in [6]), ( ¯ X , ¯ Y ) := arg min X,Y {k X k ∗ + ρ k Y k 1 : P Ω ( X + Y ) = P Ω ( M ) } . (5.11) T o s olve (5.11 ) by ALM or F ALM, we need to transform it into the form of (1 .6). F o r this we hav e Theorem 5. 2. ( ¯ X , P Ω ( ¯ Y )) is an optimal solution to (5.11) if ( ¯ X , ¯ Y ) = ar g min X,Y {k X k ∗ + ρ kP Ω ( Y ) k 1 : X + Y = P Ω ( M ) } . (5.12) Pr o of . Supp ose ( X ∗ , Y ∗ ) is an optimal solution to (5.11 ). W e claim that Y ∗ ij = 0 , ∀ ( i, j ) / ∈ Ω. Otherwise, ( X ∗ , P Ω ( Y ∗ )) is fea s ible to (5.11) and has a strictly s maller ob jective function v alue than ( X ∗ , Y ∗ ), which contradicts the optimality of ( X ∗ , Y ∗ ). Thus, kP Ω ( Y ∗ ) k 1 = k Y ∗ k 1 . Now supp ose that ( ¯ X , P Ω ( ¯ Y )) is not optimal to (5.1 1); then we hav e k X ∗ k ∗ + ρ kP Ω ( Y ∗ ) k 1 = k X ∗ k ∗ + ρ k Y ∗ k 1 < k ¯ X k ∗ + ρ kP Ω ( ¯ Y ) k 1 . (5.13) By deﬁning a new matrix ˜ Y as ˜ Y ij = ( Y ∗ ij , ( i, j ) ∈ Ω − X ∗ ij , ( i, j ) / ∈ Ω , we hav e that ( X ∗ , ˜ Y ) is feasible to (5.12) and kP Ω ( ˜ Y ) k 1 = kP Ω ( Y ∗ ) k 1 . Combining this with (5.13), we obtain k X ∗ k ∗ + ρ kP Ω ( ˜ Y ) k 1 < k ¯ X k ∗ + ρ kP Ω ( ¯ Y ) k 1 , which contradicts the optimality of ( ¯ X , ¯ Y ) to (5.12 ). T he r efore, ( ¯ X , P Ω ( ¯ Y )) is optimal to (5.1 1). The only diﬀerences b etw een (1.6) and (5.12) lie in that the matrix M is replaced b y P Ω ( M ) and g ( Y ) = ρ k Y k 1 is replaced by ρ kP Ω ( Y ) k 1 . A smo othed appr oximation g σ ( Y ) to g ( Y ) := ρ kP Ω ( Y ) k 1 is given by g σ ( Y ) := max {hP Ω ( Y ) , Z i − σ 2 k Z k 2 F : k Z k ∞ ≤ ρ } , (5.14) and ( ∇ g σ ( Y )) ij = min { ρ, max { ( P Ω ( Y )) ij /σ, − ρ }} , for 1 ≤ i ≤ m and 1 ≤ j ≤ n. (5.15) According to Theo rem 1 in [39], ∇ g σ ( Y ) is Lipschitz c ontin uous with L σ ( g ) = 1 /σ . Thus the conv ergence and iteration complexity res ults in Theor ems 2.3 a nd 3.4 apply . The only changes in Algo rithms 3 and 4 22 and Algorithms 6 and 7 are: r eplacing M by P Ω ( M ) and computing Y k +1 using (5.10) with B is replaced by P Ω ( B ). 5.3. Numerical R esults on R PCA Problems. In this sectio n, w e rep ort n umerical results obtained using the ALM metho d to solve RPCA problems with b oth complete and incomplete data matrices M . W e compare the per formance of ALM with the ex act ADM (EADM) and the inexac t ADM (IADM) metho ds in [3 1]. The MA TLAB co de s of EADM and IADM were downloaded from http : //watt.c sl .il l inois.edu/ ∼ per ceiv e/m atrix − rank / sampl e code.html and their default settings w ere used. T o further accelerate ALM, we ado pted the contin uation strategy used in EADM a nd IADM. Sp eciﬁca lly , w e set µ k +1 := ma x { ¯ µ, η µ k } , where µ 0 = k M k / 1 . 25 , ¯ µ = 10 − 6 and η = 2 / 3 in our numerical e x per iment s. Although in so me iteratio ns this violates the requir emen t µ ≤ min { 1 L ( f σ ) , 1 L ( g σ ) } in Corolla ries 2.4 and 3.5, we see from our numerical results rep or ted b elow that ALM and F ALM still w ork very w ell. W e also found that by a dopting this upda ting rule for µ , there w as not muc h diﬀerence b etw een the perfor mance of ALM and that of F ALM. So we only compare ALM with E ADM and IADM. As in Section 4, since we applied ALM to a smo othed problem, we implemen ted ALM as SAD AL. The initial p oint in ALM was s et to ( X 0 , Y 0 ) = ( M , 0 ) a nd the initial Lagra nge multiplier was s et to Λ 0 = − ∇ g σ ( Y 0 ). W e set the smo othness para meter σ = 10 − 6 . Solving subproblem (5 .6) requires co mputing an SVD (see (5.9)). Howev er, we do not have to compute the whole SVD, as only the singular v alues that ar e larger than the threshold τ = µγ max { γ ,µ + σ } and the corres po nding singular vectors are needed. W e therefore use PROP ACK [29], which is also used in E ADM and IADM, to compute these sing ular v alues a nd corres po nding singula r vectors. T o us e PROP ACK, one has to specify the num ber of leading singular v alues (denoted by sv k ) to b e co mputed at iteration k . W e here a dopt the strategy sugges ted in [31] for EADM and IADM. This strategy starts with sv 0 = 100 and updates sv k via: sv k +1 = ( sv p k + 1 , if sv p k < s v k min { sv p k + round (0 . 0 5 d ) , d } , if sv p k = s v k , where d = min { m, n } a nd sv p k is the num ber of singular v a lues that ar e larger than the thresho ld τ . In a ll o ur exp eriments ρ was chosen equal to 1 / √ m . W e stopp ed ALM, EADM and IADM when the relative infeasibility was less than 10 − 7 , i.e., k X + Y − M k F < 1 0 − 7 k M k F . 5.3.1. Bac kground E xtraction from Surv e illance Vi deo. Extracting the almost still background from a sequence frames of video is a basic task in video sur veillance. This problem is diﬃcult due to the presence of mo ving foregrounds in the video. In terestingly , as shown in [6], this pro blem c a n b e formulated as a RP CA problem (1.6). By stacking the columns of each frame into a long vector, we get a matrix M whose co lumns co rresp ond to the sequence of frames of the video. This matr ix M can be decomp osed in to the sum of t wo matrices M := ¯ X + ¯ Y . The ma tr ix ¯ X , which represe n ts the background in the frames , should be o f low ra nk due to the cor relation betw een frames. The matrix ¯ Y , which repre sents the moving o b jects in the foregro und in the frames, sho uld b e sparse since these o b jects us ually o ccupy a small po rtion of each frame. W e apply ALM to solve (1.6) for t wo videos intro duced in [30]. Our ﬁrst example is a sequence of 200 g rayscale frames of size 1 44 × 176 fro m a video o f a hall a t an airp ort. Thus the ma tr ix M is in R 25344 × 200 . The seco nd exa mple is a sequence of 320 c olor frames from a video taken at a campus. Since the v ide o is co lored, eac h frame is an image stor e d in the R GB format, which is a 128 × 1 60 × 3 cub e. The v ideo is then reshap ed in to a 128 × 16 0 by 3 × 320 matrix, i.e., M ∈ R 20480 × 960 . Some frames of the v ideo s and the recovered bac kgro unds and foregrounds are shown in Figure 5.3.1 . W e only show the fr a mes pro duced b y ALM, b ecause EADM and IADM pr o duce visually 23 (a) (b) (c) (a) (b) (c) Fig. 5.1 . In the ﬁrst 3 c olumns: (a) Vide o se quenc e . (b ) Static b ackgr ound r e c over e d by our ALM. N ote that the man who kept stil l in the 200 fr ames stays as in the b ackgr ound. (c) Moving for e gro und r e c over e d by our ALM. In the last 3 c olumns: (a) Vide o se quenc e. (b) Static b ackg r ound r e c over e d by our ALM . (c) Moving for e gr ound r e co ver e d by our ALM. ident ical r esults. F ro m these ﬁgures w e can s e e that ALM can eﬀectively separa te the nearly still background from the moving foreg round. T able 5 .1 summarizes the n umerical results on thes e pro blems. The CPU times are r ep orted in the form of hh : mm : ss . F r om T able 5.1 we see tha t altho ug h ALM is s lig ht ly worse than IADM, it is muc h faster than E ADM in terms of b oth the n umber of SVDs and CPU times. W e note that the numerical results in [6] show that the mo del (1.6) pro duces m uch better results than other comp eting mo dels for ba ckground extraction in s ur veillance video. T able 5.1 Comp arison of AL M and EADM on surveil lanc e v ide o pr oblems Exact ADM Inexact ADM ALM Problem m n SVDs CPU S VDs CPU SVD s CPU Hall (gray) 25344 200 550 40:15 38 03:47 43 04:0 3 Campus (color) 20480 960 651 13:54:3 8 40 43:35 46 46:4 9 5.3.2. Random Matrix Comple tion Problems wi th Grossl y Corrupted Data. F or the matrix completion problem (5.1 1 ), we set M := A + E , where the r ank r matrix A ∈ R n × n was created as the pro duct A L A ⊤ R , of random matrices A L ∈ R n × r and A R ∈ R n × r with i.i.d. Gaussian e ntries N (0 , 1) and the sparse matrix E w as generated b y c ho osing its suppo rt uniformly at r andom and its nonzer o e ntries uniformly i.i.d. in the interv al [ − 500 , 5 0 0]. In T able 5.2, rr := rank( A ) /n , spr := k E k 0 /n 2 , the relative error s r el X := k X − A k F / k A k F and rel Y := k Y − E k F / k E k F , and the sampling ratio of Ω, S R = m/ n 2 . The m indices in Ω were ge ner ated uniformly at random. W e set ρ = 1 / √ n and stoppe d ALM when the relative infeasibility k X + Y − P Ω ( M ) k F / kP Ω ( M ) k F < 10 − 5 and for our contin ua tion s trategy , we set µ 0 = kP Ω ( M ) k F / 1 . 25. The test re sults obtained using ALM to solve (5.1 2 ) with the nonsmo oth functions replaced by their smo othed approximations are given in T able 5 .2. F rom T able 5 .2 we see that ALM recovered the test matrices from a limited num ber of observ atio ns. Note that a fairly high n um b er of samples w as needed to obta in sma ll relative erro rs due to the presence of no is e. The num b er of iterations needed w as almost constant (around 36), no ma tter the size of the problems. The CP U times (in seco nds) needed are also rep or ted. 24 T able 5.2 Numeric al r e sults for noisy matrix co mpletion pr oblems rr spr iter relX relY cpu iter relX r elY cpu S R = 90% , n = 500 S R = 80% , n = 500 0 . 05 0 . 0 5 36 4.6 0e-5 4.25e -6 13 7 36 3.24 e-5 4.3 1e-6 1 53 0 . 05 0 . 1 36 4.6 8e-5 5.29e -6 15 6 36 4.40 e-5 4.9 1e-6 1 61 0 . 1 0 . 05 36 4.04e- 5 3.74e-6 128 36 1.28e- 3 1.33 e -4 12 9 0 . 1 0 . 1 36 6 .00e-4 4.50 e-5 1 29 35 1.0 6e-2 7 .59e-4 124 S R = 90% , n = 1000 S R = 80% , n = 1000 0 . 05 0 . 0 5 37 3.1 0e-5 3.96e -6 1089 37 2.2 7e-5 4 .14e-6 119 1 0 . 05 0 . 1 37 3.2 0e-5 4.93e -6 1213 37 3.0 0e-5 4 .66e-6 127 1 0 . 1 0 . 05 37 2.68e- 5 3.34e-6 982 37 1.75e- 4 2.49 e -5 99 4 0 . 1 0 . 1 37 3 .64e-5 4.51 e-6 1004 3 6 4 .62e-3 4.63e-4 9 65 5.4. Sparse In verse Co v ariance Selection. In [4 3] ALM metho d was successfully applied to the Sparse Inv erse Cov ariance Selec tion problem: min X ∈ S n ++ F ( X ) ≡ f ( X ) + g ( X ) , (5.16) where f ( X ) = − log det( X ) + h S, X i and g ( X ) = ρ k X k 1 . Note that in o ur case f ( X ) does not hav e Lips ch itz contin uo us g radient in g eneral. Moreover, f ( X ) is only deﬁned for p ositive deﬁnite matrices while g ( X ) is deﬁned everywhere. These pr op erties of the ob jective function ma ke the SICS pro blem esp ecially challenging for optimization metho ds. Nevertheless, we ca n still apply Algo rithm 4 and o btain the complexit y b ound in Theorem 2.3 as follows. As prov ed in [3 3], the optimal so lution X ∗ of (5.1 6) satis ﬁes X  αI , where α = 1 k S k + nρ , (see Prop osition 3.1 in [33]). The r efore, the SICS pro blem (5.16) can b e for mulated as: min X,Y { f ( X ) + g ( Y ) : X − Y = 0 , X ∈ C , Y ∈ C } , (5.17) where C := { X ∈ S n : X  α 2 I } . W e can apply Algo rithm 4 and Theorem 2 .3 as per Remark 2.7. The diﬃcult y a rises, how ever, when p erforming minimization in Y (Step 5 of Algor ithm 4) with the cons tr aint Y ∈ C . Without this constr a int , the minimization is o bta ined by a ma tr ix shr ink age op er ation. Ho wev er, the problem b ecomes har der to solve with this additional constr aint. Minimizatio n in X (Step 3 of Algorithm 4) with or without the constra in t X ∈ C is accomplis he d by p er forming an SVD of the current iterate Y k . Hence the constr a int can b e ea s ily imp osed. Also note that once the SVD is co mputed b oth ∇ f ( X k +1 ) and ∇ f ( Y k ) are readily a v a ilable (see [43] for details ). This implies that either skipping o r no nskipping itera tio ns of Algor ithm 4 c a n be per formed at the same cost as o ne IST A itera tion. Instead of imp osing constra in t Y ∈ C in Step 5 of Algorithm 4 we ca n obtain feasible so lutions by a line sear ch on µ . W e know that the c onstraint X  α 2 I is not tight at the solution. Hence if we start the algorithm with X  αI a nd r estrict the step size µ to b e suﬃciently small then the iterates of the metho d will remain in C . Similarly , one can a pply IST A with small steps to remain in C . Note ho wev er, that the b ound on the Lipschitz constant of the gradient of f ( X ) is 1 /α 2 and hence can b e very large. It is not practical to res tr ict µ in the algo r ithm to b e smaller than α 2 , since µ determines the step size at each iteration. The adv antage of ALM metho ds o ver IST A in this ca se is that as s o on a s the Y ∈ C is relaxed IST A can no longer be applied, while ALM/SADAL can b e applied and indeed works very well. The theory in this case only 25 applies o nc e certain proximit y to the o ptimal solution has b een r eached. But as shown in [43], the SADAL metho d is co mputationally s uper ior to other s tate-of-the-ar t metho ds for SICS. W e hav e also a pplied the F ALM metho d to the SICS pr oblem, but we have not obser ved an y adv antage ov er ALM for this particula r applica tion. 6. Conclusion. In this pap er, w e pr o po sed b oth ba sic a nd accelera ted versions of alter nating lineariza - tion metho ds for minimizing the sum of tw o conv ex functions. Our bas ic methods r equire at most O (1 / ǫ ) iterations to obtain a n ǫ -optimal solution, while our accelerated methods require at most O (1 / √ ǫ ) iterations with only a small additional amount of computationa l eﬀor t a t each iteration. Numerical r e sults o n image deblurring, background extr action from surveillance video and matrix completion with gro ssly corrupted data are r epo rted. These results demo nstrate the eﬃciency a nd the pra ctical p otential of our a lgorithms. Ac kno wledgement. W e would like to thank Dr. Zaiw en W en for ins ig ht ful discus sions on the topic o f this pap er. References. [1] M. Af onso, J. Bioucas-Dias, and M. Figueiredo , An augmente d Lagr angian appr o ach to the c onstra ine d optimization formulation of imaging inverse pr oblems , Accepted in IEE E T r ansactions on Image Pro cessing , (200 9). [2] , F ast image r e c overy using variable splitting and c onstr aine d optimiza tion , prepr int av aila ble at ht tp://ar xiv.org/ abs/09 10.4887, (2009). [3] O. Banerjee, L. El Gha oui, an d A. d’Aspremont , Mo del sele ct ion thr ough sp arse max imu m likeliho o d estimation for mu ltivariate gaussian for binary data , Jour nal of Machine Learning Research, 9 (2008 ), pp. 48 5–516 . [4] A . Beck and M. Teboulle , A fast itera tive shrinkage -thr esholding algorithm for line ar inverse pr ob- lems , SIAM J. Imag ing Sciences, 2 (20 09), pp. 18 3–202 . [5] D . P. Ber tsekas , Nonline ar Pr o gr amming, 2nd Ed , Athena Scientiﬁc, Belmont, Massa ch usetts, 1 999. [6] E . J. Cand ` es, X. Li, Y. Ma, and J. Wright , Rob ust princip al c omp onent analysis? , submitted, (2009). [7] E . J. Cand ` es and B. Recht , Ex act matrix c ompletion via c onvex optimization , F oundatio ns of Com- putational Mathematics, 9 (2 0 09), pp. 71 7–772 . [8] E . J. Cand ` es, J. Romber g, and T. T ao , R obust u nc ert ainty principles: Exact signal re c onstru ction fr om highly inc omplete fr e quency information , IEEE T ransa ctions on Information Theory , 52 (200 6), pp. 489 –509. [9] E . J. Cand ` es and T. T a o , The p ower of c onvex r elaxation: ne ar-optimal matrix c ompletion , IE EE T r a ns. Infor m. Theory , 56 (2 009), pp. 205 3–208 0. [10] P. L. Combettes , Solving monotone inclu s ions via c omp ositions of nonex p ansive aver age d op er ators , Optimization, 53 (20 04), pp. 475–5 04. [11] P. L. Combettes and Jean-Christophe Pesquet , A Douglas-Rachfor d splitting appr o ach t o n ons - mo oth c onvex varia tional signal r e c overy , IEE E Journa l of Selected T opics in Signal Pro ces sing, 1 (2007), pp. 564 –574. [12] P. L. Combettes and V. R. W ajs , Signal r e c overy by pr oximal forwa r d-b ackwar d splitting , SIAM Journal on Multiscale Mo de ling and Simulation, 4 (20 05), pp. 1168–1 200. [13] D. Donoho , Compr esse d sensing , IEEE T ra nsactions on Informa tio n Theo ry , 52 (2006), pp. 128 9–130 6. 26 [14] J. Douglas and H . H. Ra chford , On the numeric al solution of t he he at c onduction pr oblem in 2 and 3 sp ac e variables , T ra nsactions of the Amer ican Mathematical So ciety , 82 (1 9 56), pp. 421–4 39. [15] J. E ckstein and D. P. Ber tsekas , On the D ouglas-Rachfor d splitting metho d and the pr oximal p oint algorithm for maximal monotone op er ators , Math. P r ogra m., 5 5 (19 92), pp. 29 3–318 . [16] J. Eckstein and B . F. Sv aiter , A family of pr oje ctive splitting metho ds for su m of two maximal monotone op er ators , Math. Pr ogram. Ser . B, 11 1 (20 0 8), pp. 173 –199 . [17] M. Figueiredo and R. N ow a k , An EM algorithm for wavelet-b ase d image r estor ation , IEEE T r a ns- actions on Imag e P ro cessing, 12 (200 3 ), pp. 9 06–91 6. [18] J. Friedman, T. Hastie, a n d R. Tibshirani , S p arse inverse c ovarianc e estimation with the gr aphic al lasso , Biostatistics, (20 07). [19] D. Gaba y and B. Mer cier , A dual algorithm for the solut ion of nonline ar variational pr oblems via ﬁnite-element appr oximations , C o mp. Math. Appl., 2 (1 976), pp. 17– 40. [20] R. Glowinski and P. Le T allec , Augmente d L agr angian and Op er ator-Splitting Metho ds in Nonlin- e ar Me chanics , SIAM, Philadelphia , Pennsylv a nia, 1 9 89. [21] D. Go l df arb and S. Ma , F ast multiple splitting algorithms fo r c onvex optimization , tec h. r epo rt, Department of IEO R, Columbia Universit y . Preprint av ailable at http://arxiv.org /abs/0 912.4570, 2009 . [22] T. Goldstein and S. Osher , The split Br e gman algorithm for L1 r e gu larize d pr oblems , UCLA CAM Repo rt 08-2 9 , (200 8). [23] E. T. Hale, W. Yin, and Y . Zh a n g , Fixe d-p oint c ontinu ation for ℓ 1 -minimization: Metho dolo gy and c onver genc e , SIAM Journal on Optimization, 19 (2008 ), pp. 1 107–1 130. [24] B. S. He, L.-Z. Liao, D. Han, and H . Y ang , A new inexact alternating dir e ct ion met ho d for monotone variational ine qualities , Math. P rogra m., 92 (200 2 ), pp. 1 03–11 8. [25] B. S. He, M. T ao, M. Xu, and X. Yuan , Alternating dir e ction b ase d c ontra ction metho d for gener al ly sep ar able line arly c onstr aine d c onvex pr o gr amming pr oblems , Prepr int , (2 009). [26] B. S . He, H. Y ang, a n d S. L. W an g , Alternating dir e ction metho d with self-adaptive p enalty p ar ame- ters for monotone variational ine qualities , Jour nal of optimization theory a nd applications , 106 (20 00), pp. 337 –356. [27] R. H. Kesha v an, A. Mont anari, and S. Oh , Matrix c ompletion fr om a few entries , IEEE T rans. on Info. Theor y , 56 (20 10), pp. 2980– 2 998. [28] K. C. Kiwiel, C. H. Rosa, an d A. R uszczynski , Pr ox imal de c omp osition via alternating line ariza- tion , SIAM J. O ptimization, 9 (19 99), pp. 668–68 9. [29] R. M. Larsen , PR OP AC K - softwar e for lar ge and sp arse SVD c alculations , Av ailable from ht tp://sun.sta nfo r d.edu/ ∼ rmunk/PROP ACK. [30] L. Li, W. H u a n g, I. Gu, and Q. Tian , Statistic al mo deling of c omplex b ackgr ounds for for e gr ound obje ct dete ction , IEEE T rans. on Ima ge P ro cessing, 13 (20 04), pp. 1459–1 472. [31] Z. Lin, M. Chen, L. Wu, and Y. Ma , The augmente d lagr ange multiplier m et ho d for exact r e c overy of c orrupte d low-r ank matric es , prepr in t, (2009 ). [32] P. L. Lions and B. Mercier , Splitting al gorithms for the sum of two nonline ar op er ators , SIAM Journal on Numerical Analy s is, 16 (197 9), pp. 9 64–97 9. [33] Z. Lu , Smo oth optimization appr o ach for sp arse c ovarianc e sele ction , SIAM J. Optim., 19 (2009 ), pp. 180 7–182 7. [34] S. Ma, D. Goldf arb, and L. Chen , Fixe d p oint and Br e gman iter ative metho ds for matrix r ank min- imization , T o app ea r in Mathema tical Pr ogramming Series A, (2009). (published online: 23 September 27 2009). [35] J. Malick, J. Povh, F. Rendl, and A . Wiegele , R e gularization met ho ds for semideﬁnite pr o gr am- ming , SIAM Jo urnal o n O ptimization, 20 (200 9), pp. 336 –356. [36] R. D. C. Monteiro and B. F. Sv aiter , Iter ation-c omplexity of blo ck-de c omp osition algorithms and the alternating minimization augmente d Lagr angian met ho d , P reprint, (2010). [37] Y. E . N ester ov , A metho d for unc onstr aine d c onvex minimization pr oblem with t he r ate of c onver genc e O (1 /k 2 ), Dokl. Ak ad. Nauk SSSR, 269 (1983 ), pp. 5 4 3–54 7 . [38] , Intr o ductory le ctur es on c onvex optimization , 87 (200 4), pp. xviii+236. A bas ic course. [39] , Smo oth minimization for non-smo oth functions , Math. Pr o gram. Ser. A, 103 (2005 ), pp. 12 7–15 2 . [40] , Gr adient m et ho ds for minimizing c omp osite obje ctive fu n ction , CORE Discussio n Pap er 200 7/76, (2007). [41] D. H. Peaceman and H. H. Rachf ord , The numeric al solution of p ar ab olic el liptic diﬀer ential e quations , SIAM Journal on Applied Mathematics, 3 (1955 ), pp. 2 8–41. [42] B. Recht, M. F azel , and P. P arrilo , Guar ante e d minimum r ank solutions of matrix e qu ations via nucle ar norm minimization , T o app ear in SIAM Review, (20 07). [43] K. Scheinberg, S . Ma, and D. Goldf arb , S p arse inverse c ovarianc e sele ction via alternating lin- e arization metho ds , in Pr o ceedings of the Neural Infor mation Pro ces s ing Systems (NIPS), 2010. [44] J. E. Spingarn , Partial inverse of a monotone op er ator , Appl. Math. Optim., 10 (1983), pp. 247– 265. [45] K.-C. Toh and S. Yun , An ac c eler ate d pr oximal gr adient algorithm for n u cle ar norm r e gularize d le ast squar es pr oblems , preprint, Nationa l University of Singap or e, (2009). [46] P. Tseng , F urther applic ations of a splitting algorithm to de c omp osition in variational ine qualities and c onvex pr o gr amming , Mathematica l P rogr a mming, 48 (199 0), pp. 249–26 3. [47] , Appl ic ations of a splitting algorithm t o de c omp osition in c onvex pr o gr amming and va riational ine qualities , SIAM J. Control and Optimization, 29 (199 1), pp. 1 19–13 8. [48] , On ac c eler ate d pr oximal gr adient metho ds for c onvex-c onc ave optimization , submitted to SIAM J. Optim., (200 8). [49] M. W ainwright, P. Ra vikumar, and J. Laffer ty , High-dimensional gr aphic al mo del sele ction using ℓ 1 -r e gularize d lo gistic r e gr ession , NIPS, 19 (2007), pp. 146 5–147 2. [50] Z. Wen, D . Goldf arb, and W. Yin , Alternating di r e ction augment e d L agr angian metho ds for semideﬁnite pr o gr amming , tec h. rep ort, Co lum bia Universit y , 2 009. [51] J. Y ang and Y. Zhang , Alternating dir e ction algorithms for ℓ 1 pr oblems in c ompr essive sensing , preprint, (200 9 ). [52] M. Yuan and Y. Lin , Mo del sele ction and estimation in the Gaussian gr aphic al mo del , Biometrik a, 94 (2007), pp. 1 9–35 . [53] X. Y uan , Alternating dir e ction metho ds for sp arse c ovaria nc e sele ction , (2009). Preprint av a ilable at ht tp://www.o ptimiza tion-online.org /DB HTML/2009 /09/2 390.html. [54] X. Yuan and J. Y an g , Sp arse and low r ank matrix de c omp osition via alternating dir e ction met ho ds , preprint, (200 9 ). 28

Fast Alternating Linearization Methods for Minimizing the Sum of Two Convex Functions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment