Scalable Robust Matrix Recovery: Frank-Wolfe Meets Proximal Methods

SCALABLE R OBUST MA TRIX RECO VER Y: FRANK-W OLFE MEETS PR O XIMAL METHODS CUN MU ∗ , YUQIAN ZHANG † , JOHN WRIGHT † , AND DONALD GOLDF ARB ∗ Abstract. Reco vering matrices from compressiv e and grossly corrupted observations is a funda- mental problem in robust statistics, with rich applications in computer vision and machine learning. In theory , under certain conditions, this problem can b e solved in p olynomial time via a natural conv ex relaxation, kno wn as Compressive Principal Component Pursuit (CPCP). How ev er, many existing prov ably convergen t algorithms for CPCP suﬀer from sup erlinear p er-iteration cost, which severely limits their applicability to large-scale problems. In this paper, we prop ose pro v ably con- vergen t, scalable and eﬃcient methods to solve CPCP with (essentially) linear per-iteration cost. Our metho d combines classical ideas from F rank-W olfe and proximal metho ds. In eac h iteration, we mainly exploit F rank-W olfe to up date the low-rank comp onent with rank-one SVD and exploit the proximal step for the sparse term. Conv ergence results and implementation details are discussed. W e demonstrate the practicability and scalability of our approac h with numerical experiments on visual data. Key w ords. robust matrix recov ery , compressive principal component pursuit, F rank-W olfe, conditional gradient, proximal methods, scalability AMS sub ject classiﬁcations. 90C06, 90C25, 90C52 1. Introduction. Supp ose that a matrix M 0 ∈ R m × n is of the form M 0 = L 0 + S 0 + N 0 , where L 0 is a lo w-rank matrix, S 0 is a sparse error matrix, and N 0 is a dense noise matrix. Linear measuremen ts (1.1) b = A [ M 0 ] =  h A 1 , M 0 i , h A 2 , M 0 i , . . . , h A p , M 0 i  > ∈ R p are collected, where A : R m × n → R p is the sensing op erator, A k is the sensing matrix for the k -th measuremen t and h A k , M 0 i . = T r( M > 0 A k ). Can we, in a tr actable way, r e c over L 0 and S 0 fr om b , given A ? One natural approac h is to solv e the optimization com bining the ﬁdelit y term and the structural terms: (1.2) min L , S 1 2 k b − A [ L + S ] k 2 2 + λ L rank( L ) + λ S k S k 0 . Here, λ L and λ S are regularization parameters, and k S k 0 denotes the n umber of nonzero en tries in S . Unfortunately , problem ( 1.2 ) is nonconv ex, and hence is not directly tractable. Ho wev er, by replacing the ` 0 norm k S k 0 with the ` 1 norm k S k 1 . = P m i =1 P n j =1 | S ij | , and replacing the rank rank( L ) with the nuclear norm k L k ∗ (deﬁned as the sum of the singular v alues of L ), we obtain a natural, tractable, conv ex relaxation of ( 1.2 ), (1.3) min L , S 1 2 k b − A [ L + S ] k 2 2 + λ L k L k ∗ + λ S k S k 1 . This con vex surrogate is sometimes referred to as c ompr essive princip al c omp onent pursuit (CPCP) [ 1 ]. Equiv alently , since  M ∈ R m × n | b = A [ M ]  =  M ∈ R m × n | P Q [ M ] = P Q [ M 0 ]  , ∗ Department of Industrial Engineering and Op erations Research, Columbia Universit y ( cm3052@ columbia.edu , goldfarb@columbia.edu ). DG was funded by NSF Grants DMS-1016571 and CCF- 1527809. † Department of Electrical Engineering, Columbia Universit y , ( yq2409@cs.columbia.edu , johnwright@ee.columbia.edu ). JW was funded by ONR-N00014-13-0492. 1 2 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB where Q ⊆ R m × n is a linear subspace spanned b y the set of sensing matrices { A i } p i =1 , and P Q denotes the pro jection op erator on to that subspace, we can rewrite problem ( 1.3 ) in the (possibly) more compact form, ∗ (1.4) min L , S f ( L , S ) . = 1 2 kP Q [ L + S − M 0 ] k 2 F + λ L k L k ∗ + λ S k S k 1 . Recen tly , CPCP and its close v ariants hav e b een studied for diﬀeren t sensing op erators A (or equiv alently diﬀeren t subspaces Q ). In sp eciﬁc, [ 2 , 3 , 4 , 5 , 6 ] consider the case where a subset Ω ⊆ { 1 , 2 , . . . , m } × { 1 , 2 , . . . , n } of the entries of M 0 is observ ed. Then CPCP can b e reduced to (1.5) min L , S 1 2 kP Ω [ L + S − M 0 ] k 2 F + λ L k L k ∗ + λ S k S k 1 , where P Ω [ · ] denotes the orthogonal pro jection on to the linear space of matrices sup- p orted on Ω, i.e., P Ω [ M 0 ]( i, j ) = ( M 0 ) ij if ( i, j ) ∈ Ω and P Ω [ M 0 ]( i, j ) = 0 otherwise. [ 1 ] studies the case where each A k is an i.i.d. N (0 , 1) matrix, whic h is equiv alen t (in distribution) to saying that we choose a linear subspace Q uniformly at random from the set of all p -dimensional subspaces of R m × n and observ e P Q [ M 0 ]. Accordingly , all the ab ov e provide theoretical guarantees for CPCP , under fairly mild conditions, to pro duce accurate estimates of L 0 and P Ω [ S 0 ] (or S 0 ), even when the num b er of measuremen ts p is substan tially less than mn . Inspired by these theoretical results, researchers from diﬀerent ﬁelds hav e lever- aged CPCP to solv e many practical problems, including video bac kground mo deling [ 3 ], batch image alignment [ 7 ], face veriﬁcation [ 8 ], photometric stereo [ 9 ], dynamic MRI [ 10 ], topic modeling [ 11 ], laten t v ariable graphical model learning [ 12 ] and outlier detection and robust Principal Comp onent Analysis [ 3 ], to name just a few. Living in the era of big data , most of thes e applications inv olve large datasets and high dimensional data spaces. Therefore, to fully realize the b eneﬁt of the theory , w e need pr ovably c onver gent and sc alable algorithms for CPCP . This has motiv ated m uch researc h into the dev elopmen t of ﬁrst-order metho ds for problem ( 1.4 ) and its v ariants; e.g see [ 13 , 14 , 15 , 16 , 17 , 18 ]. These metho ds, in essence, all exploit a closed-form expression for the pro ximal op erator of the n uclear norm, which inv olves the singular v alue decompsition (SVD). Hence, the dominan t cost in eac h iteration is computing an SVD of the same size as the input data. This is substan tially more scalable than oﬀ-the-shelf interior p oint solv ers suc h as SDPT3 [ 19 ]. Nevertheless, the superlinear cost of each iteration has limited the practical applicabilit y of these ﬁrst-order methods to problems in v olving sev eral thousands of data points and sev eral thousands of dimensions. The need to compute a sequence of full or partial SVDs is a serious b ottleneck for truly large-scale applications. As a remedy , in this paper, w e design more sc alable algorithms to solve CPCP that compute only a rank-one SVD in each iteration. Our approach leverages t w o classical and widely studied ideas – F rank-W olfe iterations to handle the n uclear norm, and pro ximal steps to handle the ` 1 norm. This turns out to b e an ideal com bination of tec hniques to solve large-scale CPCP problems. In particular, it yields algorithms that ∗ T o transform problem ( 1.3 ) into problem ( 1.4 ), simple procedures like Gram-chmidt migh t be inv oked. Despite b eing equiv alent, one formulation might be preferred ov er the other in practice, depending on the speciﬁcations of the sensing op erator A [ · ]. In this paper, we will mainly focus on solving problem ( 1.4 ) and its v arian ts. Our metho ds, ho wev er, are not restrictive to ( 1.4 ) and can be easily extended to problem ( 1.3 ). SCALABLE ROBUST MA TRIX RECOVER Y 3 are substantially mor e sc alable than prox-based ﬁrst-order metho ds such as IST A and FIST A [ 20 ], and con verge much faster in practice than a straigh tforw ard application of F rank-W olfe. The remainder of this pap er is organized as follows. Section 2 reviews the general prop erties of the F rank-W olfe algorithm, and describ es several basic building blo cks that we will use in our algorithms. Section 3 and Section 4 respectively describ e ho w to modify the F rank-W olfe algorithm to solv e CPCP’s norm c onstr aine d v ersion (1.6) min L , S l ( L , S ) . = 1 2 kP Q [ L + S − M 0 ] k 2 F s.t. k L k ∗ ≤ τ L , k S k 1 ≤ τ S , and the penalized v ersion, i.e. problem ( 1.4 ), b y incorporating pro ximal regularization to more eﬀectively handle the ` 1 norm. Conv ergence results and our implemen tation details are also discussed. Section 5 presents numerical exp erimen ts on large datasets that demonstrate the scalability of our prop osed algorithms. In Section 6 , we sum- marize our contributions and discuss p otential future w orks. 2. Preliminaries. 2.1. F rank-W olfe metho d. The F rank-W olfe (FW) method [ 21 ], also kno wn as the conditional gradient metho d [ 22 ], applies to the general problem of minimizing a diﬀeren tiable con vex function h o ver a compact, con vex domain D ⊆ R n : (2.1) minimize h ( x ) sub ject to x ∈ D ⊆ R n . Here, ∇ h is assumed to b e L -Lipsc hitz: (2.2) ∀ x , y ∈ D , k∇ h ( x ) − ∇ h ( y ) k ≤ L k x − y k . Throughout, we let D = max x , y ∈D k x − y k denote the diameter of the feasible set D . In its simplest form, the F rank-W olfe algorithm pro ceeds as follows. A t each iteration k , w e linearize the ob jective function h about the current p oint x k : (2.3) h ( v ) ≈ h ( x k ) +  ∇ h ( x k ) , v − x k  . W e minimize the linearization o ver the feasible se t D to obtain v k ∈ arg min v ∈D  ∇ h ( x k ) , v  , (2.4) and then take a step in the feasible descent direction v k − x k : (2.5) x k +1 = x k + 2 k + 2 ( v k − x k ) . This yields a v ery simple procedure, which w e summarize as Algorithm 1 . The par- ticular step size, 2 k +2 , comes from the con vergence analysis of the algorithm, whic h w e discuss in more details b elow. First prop osed in [ 21 ], FW-type metho ds hav e b een frequently revisited in diﬀer- en t ﬁelds. Recen tly , they hav e experienced a resurgence in statistics, mac hine learning and signal pro cessing, due to their ability to yield highly scalable algorithms for op- timization with structure-encouraging norms such as the ` 1 norm and n uclear norm. In particular, if x is a matrix and D = { x | k x k ∗ ≤ β } is a n uclear norm ball, the subproblem (2.6) min v ∈D h v , ∇ h ( x ) i 4 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB Algorithm 1 F rank-W olfe method for problem ( 2.1 ) 1: Initialization: x 0 ∈ D ; 2: for k = 0 , 1 , 2 , . . . do 3: v k ∈ argmin v ∈D  v , ∇ h ( x k )  ; 4: γ = 2 k +2 ; 5: x k +1 = x k + γ ( v k − x k ); 6: end for can be solved using only the singular vector pair corresponding to the single leading singular v alue of the matrix ∇ h ( x ). Thus, at each iteration, we only ha ve to compute a rank-one partial SVD. This is substantially cheaper than the full/partial SVD ex- ploited in pro ximal methods [ 23 , 24 ]. W e recommend [ 25 ] as a comprehensiv e surv ey of the latest dev elopments in FW-type methods. Algorithm 2 F rank-W olfe method for problem ( 2.1 ) with general updating sc heme 1: Initialization: x 0 ∈ D ; 2: for k = 0 , 1 , 2 , . . . do 3: v k ∈ argmin v ∈D  v , ∇ h ( x k )  ; 4: γ = 2 k +2 ; 5: Up date x k +1 to some p oint in D suc h that h ( x k +1 ) ≤ h ( x k + γ ( v k − x k )); 6: end for In the past ﬁv e decades, numerous v ariants of Algorithm 1 ha v e b een prop osed and implemented. Man y mo dify Algorithm 1 by replacing the simple up dating rule ( 2.5 ) with more sophisticated schemes, e.g., (2.7) x k +1 ∈ arg min x h ( x ) s.t. x ∈ con v { x k , v k } or (2.8) x k +1 ∈ arg min x h ( x ) s.t. x ∈ con v { x k , v k , v k − 1 , . . . , v k − j } . The conv ergence of these schemes can b e analyzed simultaneously , using the fact that they pro duce iterates x k +1 whose ob jectiv e is no greater than that pro duced by the original F rank-W olfe update sc heme: h ( x k +1 ) ≤ h ( x k + γ ( v k − x k )) . Algorithm 2 states a general v ersion of F rank-W olfe, whose up date is only required to satisfy this relationship. It includes as sp ecial cases the up dating rules ( 2.5 ), ( 2.7 ) and ( 2.8 ). This ﬂexibilit y will be crucial for eﬀectiv ely handling the sparse structure in the CPCP problems ( 1.4 ) and ( 1.6 ). The con vergence of Algorithm 2 can be prov ed using w ell-established tec hniques [ 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 ]. Using these ideas, one can sho w that it con verges at a rate of O (1 /k ) in function v alue: Theorem 2.1. L et x ? b e an optimal solution to ( 2.1 ) . F or { x k } gener ate d by A lgorithm 2 , we have for k = 0 , 1 , 2 , . . . , (2.9) h ( x k ) − h ( x ? ) ≤ 2 LD 2 k + 2 . SCALABLE ROBUST MA TRIX RECOVER Y 5 Pr o of . F or k = 0 , 1 , 2 , . . . , we hav e h ( x k +1 ) ≤ h ( x k + γ ( v k − x k )) ≤ h ( x k ) + γ  ∇ h ( x k ) , v k − x k  + Lγ 2 2   v k − x k   2 ≤ h ( x k ) + γ  ∇ h ( x k ) , v k − x k  + γ 2 LD 2 2 (2.10) ≤ h ( x k ) + γ  ∇ h ( x k ) , x ? − x k  + γ 2 LD 2 2 ≤ h ( x k ) + γ ( h ( x ? ) − h ( x k )) + γ 2 LD 2 2 , (2.11) where the second inequalit y holds since ∇ h ( · ) is L -Lipsc hitz con tinuous; the third line follo ws b ecause D is the diameter for the feasible set D ; the fourth inequality follows from v k ∈ argmin v ∈D  v , ∇ h ( x k )  and x ? ∈ D ; the last one holds since h ( · ) is con vex. Rearranging terms in ( 2.11 ), one obtains that for k = 0 , 1 , 2 , . . . , (2.12) h ( x k +1 ) − h ( x ? ) ≤ (1 − γ )  h ( x k ) − h ( x ? )  + γ 2 LD 2 2 . Therefore, b y mathematical induction, it can b e v eriﬁed that h ( x k ) − h ( x ? ) ≤ 2 LD 2 k + 2 , for k = 1 , 2 , 3 , . . . . Remark 1. Note that the c onstant in the r ate of c onver genc e dep ends on the Lipschitz c onstant L of h and the diameter D . While Theorem 2.1 guarantees that Algorithm 2 conv erges at a rate of O (1 /k ), in practice it is useful to hav e a more precise bound on the suboptimality at iterate k . The surrogate dualit y gap (2.13) d ( x k ) =  x k − v k , ∇ h ( x k )  , pro vides a useful upper b ound on the suboptimality h ( x k ) − h ( x ? ) : h ( x k ) − h ( x ? ) ≤ −  x ? − x k , ∇ h ( x k )  ≤ − min v  v − x k , ∇ h ( x k )  =  x k − v k , ∇ h ( x k )  = d ( x k ) . (2.14) This w as ﬁrst prop osed in [ 21 ] and later [ 25 ] sho wed that d ( x k ) = O (1 /k ). Next, we pro vide a reﬁnement of this result, using ideas from [ 25 , 30 ]: Theorem 2.2. L et { x k } b e the se quenc e gener ate d by A lgorithm 2 . Then for any K ≥ 1 , ther e exists 1 ≤ ˜ k ≤ K such that (2.15) d ( x ˜ k ) ≤ 6 LD 2 K + 2 . Pr o of . F or notational conv enience, w e denote h k . = h ( x k ), ∆ k . = h ( x k ) − h ( x ? ), d k . = d ( x k ), C . = 2 LD 2 , B . = K + 2, ˆ k . = d 1 2 B e − 1, µ . = d 1 2 B e /B . 6 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB Supp ose on the con trary that (2.16) d k > 3 C B , for all k ∈  d 1 2 B e − 1 , d 1 2 B e , . . . , K  . F rom ( 2.10 ), we know that for an y k ≥ 1 (2.17) ∆ k +1 ≤ ∆ k + γ  ∇ h ( x k ) , v k − x k  + γ 2 LD 2 2 = ∆ k − 2 d k k + 2 + C ( k + 2) 2 . Therefore, b y using ( 2.17 ) rep eatedly , one has ∆ K +1 ≤ ∆ ˆ k − K X k = ˆ k 2 d k k + 2 + K X k = ˆ k C ( k + 2) 2 < ∆ ˆ k − 6 C B K X k = ˆ k 1 k + 2 + C K X k = ˆ k 1 ( k + 2) 2 = ∆ ˆ k − 6 C B B X k = ˆ k +2 1 k + C B X k = ˆ k +2 1 k 2 ≤ C µB − 6 C B · B − ˆ k − 1 B + C · B − ˆ k − 1 B ( ˆ k + 1) = C µB − 6 C B (1 − µ ) + C B 1 − µ µ = C µB (2 − 6 µ (1 − µ ) − µ ) (2.18) where the second line is due to our assumption ( 2.16 ); the fourth line holds since ∆ ˆ k ≤ C ˆ k +2 b y Theorem 1, and P b k = a 1 k 2 ≤ b − a +1 b ( a − 1) for an y b ≥ a > 1. No w deﬁne φ ( x ) = 2 − 6 x (1 − x ) − x . Clearly φ ( · ) is con vex. Since φ ( 1 2 ) = φ ( 2 3 ) = 0, w e ha ve φ ( x ) ≤ 0 for an y x ∈ [ 1 2 , 2 3 ]. As µ = d 1 2 B e /B ∈ [ 1 2 , 2 3 ], from ( 2.18 ), w e hav e ∆ K +1 = h ( x K +1 ) − h ( x ? ) < C µB φ ( µ ) ≤ 0 , whic h is a con tradiction. Remark 2. The c onver genc e r ate for the duality gap matches the one for h ( x k ) − h ( x ? ) (se e ( 2.9 ) ), which suggests that the upp er b ound d ( x k ) c an serve as a pr actic al stopping criterion. F or our problem, the main computational burden in Algorithms 1 and 2 will b e solving the linear subproblem min v ∈D  v , ∇ h ( x k )  , † i.e. minimizing linear functions o ver the unit balls for k·k ∗ and k·k 1 . F ortunately , b oth of these op erations ha ve simple closed-form solutions, which we will describe in the next section. 2.2. Optimization oracles. W e no w describ e several optimization oracles in- v olving the ` 1 norm and the nuclear norm, which serv e as the main building blo cks for our methods. These oracles ha ve computational costs that are (essen tially) linear in the size of the input. † In some situations, w e can signiﬁcantly reduce this cost by solving this problem inexactly [ 27 , 25 ]. Our algorithms and results can also tolerate inexact step calculations; w e omit the discussion here for simplicity . SCALABLE ROBUST MA TRIX RECOVER Y 7 Minimizing a line ar function over the nucle ar norm b al l . Since the dual norm of the nuclear norm is the operator norm, i.e., k Y k = max k X k ∗ ≤ 1 h Y , X i , the optimization problem (2.19) minimize X h Y , X i sub ject to k X k ∗ ≤ 1 has optimal v alue − k Y k . One minimizer is the rank-one matrix X ? = − uv > , where u and v are the left- and righ t- singular vectors corresp onding to the leading singular v alue of Y , and can be eﬃcien tly computed (e.g. using pow er method). Minimizing a line ar function over the ` 1 b al l . Since the dual norm of the ` 1 norm is the ` ∞ norm, i.e., k Y k ∞ := max ( i,j ) | Y ij | = max k X k 1 ≤ 1 h Y , X i , the optimization problem (2.20) minimize X h Y , X i sub ject to k X k 1 ≤ 1 has optimal v alue − k Y k ∞ . One minimizer is the one-sparse matrix X ? = − sgn( Y i ? j ? ) e i ? e > j ? , where ( i ? , j ? ) ∈ arg max ( i,j ) | Y ij | ; i.e. X ? has exactly one nonzero element. Pr oje ction onto the ` 1 -b al l . T o eﬀectively handle the sparse term in the norm constrained problem ( 1.6 ), we will need to mo dify the F rank-W olfe algorithm by in- corp orating additional pro jection steps. F or an y Y ∈ R m × n and β > 0, the pro jection on to the ` 1 -ball: (2.21) P k·k 1 ≤ β [ Y ] = arg min k X k 1 ≤ β 1 2 k X − Y k 2 F , can b e easily solved with O ( mn (log m + log n )) cost [ 32 ]. Moreo ver, a divide and conquer algorithm, achieving linear cost in exp ectation to solve ( 2.21 ), has also b een prop osed in [ 32 ]. Pr oximal mapping of ` 1 norm . T o eﬀectively handle the sparse term arising in problem ( 1.4 ), w e will need to modify the F rank-W olfe algorithm b y incorporating additional proximal steps. F or any Y ∈ R m × n and λ > 0, the pro ximal mapping of ` 1 norm has the follo wing closed-form expression (2.22) T λ [ Y ] = arg min X ∈ R m × n 1 2 k X − Y k 2 F + λ k X k 1 , where T λ : R → R denotes the soft-thresholding op erator T λ ( x ) = sgn( x ) max {| x | − λ, 0 } , and extension to matrices is obtained by applying the scalar op erator T λ ( · ) to eac h elemen t. 3. FW-P Metho d for Norm Constrained Problem. In this section, we de- v elop scalable algorithms for the norm-constrained compressiv e principal component pursuit problem, (3.1) min L , S l ( L , S ) = 1 2 kP Q [ L + S − M ] k 2 F s.t. k L k ∗ ≤ τ L , k S k 1 ≤ τ S . W e ﬁrst describ e a straigh tforw ard application of the F rank-W olfe metho d to this problem. W e will see that although it has relativ ely cheap iterations, it con v erges v ery slo wly on t ypical n umerical examples, because it only makes a one-sparse up date to the sparse term S at a time. W e will sho w ho w to remedy this problem by augmen ting the FW iteration with an additional proximal step (essen tially a pro jected gradient step) in each iteration, yielding a new algorithm which up dates S muc h more eﬃciently . Because it com bines F rank-W olfe and pro jection steps, w e will call this new algorithm F rank-W olfe-Pro jection (FW-P). 8 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB Pr op erties of the obje ctive and c onstr aints.. T o apply F rank-W olfe to ( 3.1 ), w e ﬁrst note that the ob jectiv e l ( L , S ) in ( 3.1 ) is diﬀeren tiable, with ∇ L l ( L , S ) = P Q [ L + S − M ] (3.2) ∇ S l ( L , S ) = P Q [ L + S − M ] . (3.3) Moreo ver, the follo wing lemma sho ws that the gradient map ∇ l ( L , S ) = ( ∇ L l, ∇ S l ) is 2-Lipsc hitz: Lemma 3.1. F or al l ( L , S ) and ( L 0 , S 0 ) , we have k∇ l ( L , S ) − ∇ l ( L 0 , S 0 ) k F ≤ 2 k ( L , S ) − ( L 0 , S 0 ) k F . Pr o of . F rom ( 3.2 ) and ( 3.3 ), w e ha ve k∇ l ( L , S ) − ∇ l ( L 0 , S 0 ) k 2 F = 2 kP Q [ L + S − M ] − P Q [ L 0 + S 0 − M ] k 2 F = 2 kP Q [ L + S ] − P Q [ L 0 + S 0 ] k 2 F ≤ 2 k L + S − L 0 − S 0 k 2 F ≤ 4 k L − L 0 k 2 F + 4 k S − S 0 k 2 F = 4 k ( L , S ) − ( L 0 , S 0 ) k 2 F , whic h implies the result. The feasible set in ( 3.1 ) is compact. The follo wing lemma b ounds its diameter D : Lemma 3.2. The fe asible set D = { ( L , S ) | k L k ∗ ≤ τ L , k S k 1 ≤ τ S } has diameter D ≤ 2 p τ 2 L + τ 2 S . Pr o of . F or an y Z = ( L , S ) and Z 0 = ( L 0 , S 0 ) ∈ D , k Z − Z 0 k 2 F = k L − L 0 k 2 F + k S − S 0 k 2 F ≤ ( k L k F + k L 0 k F ) 2 + ( k S k F + k S 0 k F ) 2 ≤ ( k L k ∗ + k L 0 k ∗ ) 2 + ( k S k 1 + k S 0 k 1 ) 2 ≤ 4 τ 2 L + 4 τ 2 S . (3.4) 3.1. F rank-W olfe for problem ( 3.1 ) . Since ( 3.1 ) asks us to minimize a con vex, diﬀeren tiable function with Lipschitz gradient o ver a compact conv ex domain, the F rank-W olfe method in Algorithm 1 applies. It generates a sequence of iterates x k = ( L k , S k ). Using the expression for the gradien t in ( 3.2 )-( 3.3 ), at each iteration, the step direction v k = ( V k L , V k S ) is generated b y solving the linearized subproblem  V k L V k S  ∈ arg min  P Q [ L k + S k − M ] P Q [ L k + S k − M ]  ,  V L V S  (3.5) s.t. k V L k ∗ ≤ τ L , k V S k 1 ≤ τ S , whic h decouples into tw o indep enden t subproblems: V k L ∈ arg min k V L k ∗ ≤ τ L hP Q [ L k + S k − M ] , V L i , V k S ∈ arg min k V S k 1 ≤ τ S hP Q [ L k + S k − M ] , V S i . These subproblems can b e easily solv ed by exploiting the linear optimization oracles in tro duced in Section 2.2 . In particular, V k L = − τ L u k ( v k ) > , (3.6) V k S = − τ S · δ k i ? j ? · e k i ? ( e k j ? ) > , (3.7) SCALABLE ROBUST MA TRIX RECOVER Y 9 where u k and v k are leading left- and right- singular vectors of P Q [ L k + S k − M ] and ( i ? , j ? ) is the of the largest element of P Q [ L k + S k − M ] in magnitude and δ k ij := sgn h  P Q  L k + S k − M  ij i . Algorithm 3 gives the F rank-W olfe metho d sp ecialized to problem ( 3.1 ). Algorithm 3 F rank-W olfe method for problem ( 3.1 ) 1: Initialization: L 0 = S 0 = 0 ; 2: for k = 0 , 1 , 2 , · · · do 3: D k L ∈ arg min k D L k ∗ ≤ 1 hP Q [ L k + S k − M ] , D L i ; V k L = τ L D k L ; 4: D k S ∈ arg min k D S k 1 ≤ 1 hP Q [ L k + S k − M ] , D S i ; V k S = τ S D k S ; 5: γ = 2 k +2 ; 6: L k +1 = L k + γ ( V k L − L k ); 7: S k +1 = S k + γ ( V k S − S k ); 8: end for The ma jor adv an tage of Algorithm 3 lies in the simplicity of the up date rules ( 3.6 )-( 3.7 ). Both hav e closed form, and b oth can b e computed in time (essen tially) linear in the size of the input. Because V k L is rank-one, the algorithm can be viewed as performing a sequence of rank one updates. The ma jor disadv antage of Algorithm 3 is that S has only a one-sparse up date at each iteration, since V k S = − τ S e k i ? ( e k j ? ) > has only one nonzero entry . This is a signiﬁcant disadv an tage in practice, as the optimal S ? ma y hav e a relativ ely large n umber of nonzero en tries. Indeed, in theory , the CPCP relaxation w orks ev en when a constan t fraction of the entries in S 0 are nonzero. In applications such as foreground- bac kground separation, the num b er of nonzero entries in the target sparse term can b e quite large. The dashed curv es in Figure 1 sho w the eﬀect of this on the practical con vergence of the algorithm, on a simulated example of size 1 , 000 × 1 , 000, in which ab out 1% of the entries in the target sparse matrix S 0 are nonzero. As shown, the progress is quite slo w. 3.2. FW-P algorithm: combining F rank-W olfe and pro jected gradien t. T o ov ercome the dra wback of the naive F rank-W olfe algorithm describ ed ab ov e, w e prop ose incorp orating an additional gradient pro jection step after eac h F rank-W olfe up date. This additional step up dates the sparse term S only , with the goal of ac- celerating con vergence in these v ariables. At iteration k , let ( L k +1 / 2 , S k +1 / 2 ) be the result pro duced b y F rank-W olfe. T o pro duce the next iterate, w e retain the low rank term L k +1 / 2 , but set S k +1 = P k·k 1 ≤ τ S h S k + 1 2 − ∇ S l ( L k + 1 2 , S k + 1 2 ) i (3.8) = P k·k 1 ≤ τ S h S k + 1 2 − P Q [ L k + 1 2 + S k + 1 2 − M ] i ; (3.9) i.e. w e simply take an additional pro jected gradien t step in the sparse term S . The resulting algorithm is presen ted as Algorithm 4 below. W e call this metho d the FW-P algorithm, as it com bines F rank-W olfe steps and pro jections. In Figure 1 , we compare Algorithms 3 and 4 on syn thetic data. In this example, the FW-P method is clearly more eﬃcien t in recov ering L 0 and S 0 . The conv ergence of Algorithm 4 can b e analyzed by recognizing it as a sp eciﬁc instance of the generalized F rank-W olfe iteration in Algorithm 2 . This pro jection step 10 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB 0 20 40 60 80 100 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 iter. no. log 10 (rel. err.) L k FW FW−P 0 20 40 60 80 100 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 iter. no. log 10 (rel. err.) S k FW FW−P Fig. 1 . Comparisons b etwe en Algorithms 3 and 4 for pr oblem ( 3.1 ) on synthetic data. The data ar e gener ate d in Matlab as m = 1000 ; n = 1000 ; r = 5 ; L 0 = randn ( m , r ) ∗ randn ( r , n ); Omega = ones ( m , n ); S 0 = 100 ∗ randn ( m , n ) . ∗ ( rand ( m , n ) < 0 . 01 ); M = L 0 + S 0 + randn ( m , n ); τ L = norm nuc ( L 0 ); τ S = norm ( vec ( S 0 ) , 1 ); The left ﬁgur e plots log 10 (   L k − L 0   F / k L 0 k F ) versus the iter ation numb er k . The right ﬁgur e plots log 10 (   S k − S 0   F / k S 0 k F ) versus k . The FW-P metho d is cle arly mor e eﬃcient than the str aightforwar d FW metho d in r e c overing L 0 and S 0 . Algorithm 4 FW-P method for problem ( 3.1 ) 1: Initialization: L 0 = S 0 = 0 ; 2: for k = 0 , 1 , 2 , · · · do 3: D k L ∈ arg min k D L k ∗ ≤ 1 hP Q [ L k + S k − M ] , D L i ; V k L = τ L D k L ; 4: D k S ∈ arg min k D S k 1 ≤ 1 hP Q [ L k + S k − M ] , D S i ; V k S = τ S D k S ; 5: γ = 2 k +2 ; 6: L k + 1 2 = L k + γ ( V k L − L k ); 7: S k + 1 2 = S k + γ ( V k S − S k ); 8: S k +1 = P k·k 1 ≤ τ S  S k + 1 2 − P Q [ L k + 1 2 + S k + 1 2 − M ]  ; 9: L k +1 = L k + 1 2 ; 10: end for ( 3.9 ) can b e regarded as a pro ximal step to set S k +1 as arg min k S k 1 ≤ τ S ˆ l k + 1 2 ( S ) := l ( L k + 1 2 , S k + 1 2 )+ h∇ S l ( L k + 1 2 , S k + 1 2 ) , S − S k + 1 2 i + 1 2    S − S k + 1 2    2 F . It can then be easily v eriﬁed that (3.10) ˆ l k + 1 2 ( S k + 1 2 ) = l ( L k + 1 2 , S k + 1 2 ) , and ˆ l k + 1 2 ( S ) ≥ l ( L k + 1 2 , S ) for an y S , since ∇ S l ( L , S ) is 1-Lipschitz. This implies that the FW-P algorithm chooses a next iterate whose ob jectiv e is no worse than that pro duced b y the F rank-W olfe step: l ( L k +1 , S k +1 ) = l ( L k + 1 2 , S k +1 ) ≤ ˆ l k + 1 2 ( S k +1 ) ≤ ˆ l k + 1 2 ( S k + 1 2 ) = l ( L k + 1 2 , S k + 1 2 ) . SCALABLE ROBUST MA TRIX RECOVER Y 11 This is precisely the prop erty that is required to inv oke Algorithm 2 and Theorems 2.1 and 2.2 . Using Lemmas 4.1 and 4.2 to estimate the Lipsc hitz constan t of ∇ l and the diameter of D , we obtain the following result, which sho ws that FW-P retains the O (1 /k ) con vergence rate of the original FW metho d: Theorem 3.3. L et l ? b e the optimal value to pr oblem ( 3.1 ) , x k = ( L k , S k ) and v k = ( V k L , V k S ) b e the se quenc e pr o duc e d by A lgorithm 4 . Then we have (3.11) l ( L k , S k ) − l ? ≤ 16( τ 2 L + τ 2 S ) k + 2 . Mor e over, for any K ≥ 1 , ther e exists 1 ≤ ˜ k ≤ K such that the surr o gate duality gap (deﬁne d in ( 2.13 ) ) satisﬁes (3.12) d ( x ˜ k ) = D x ˜ k − v ˜ k , ∇ l ( x ˜ k ) E ≤ 48( τ 2 L + τ 2 S ) K + 2 . Pr o of . Substituting L = 2 (Lemma 3.1 ) and D ≤ 2 p τ 2 L + τ 2 S (Lemma 3.2 ) into Theorems 2.1 and 2.2 , we can easily obtain the abov e result. 4. FW-T Metho d for P enalized Problem. In this section, we dev elop a scalable algorithm for the p enalized v ersion of the CPCP problem, min L , S f ( L , S ) . = 1 2 kP Q [ L + S − M ] k 2 F + λ L k L k ∗ + λ S k S k 1 . (4.1) In Section 4.1 , we reformulate problem ( 4.1 ) into the form of ( 2.1 ) so that the F rank- W olfe metho d can b e applied. In Section 4.2 , w e apply the F rank-W olfe metho d directly to the reform ulated problem, ac hieving linear p er-iteration cost and O (1 /k ) con vergence in function v alue. How ever, b ecause it up dates the sparse term one elemen t at a time, it conv erges very slowly on typical n umerical examples. In Section 4 , we in tro duce our FW-T metho d, which resolv es this issue. Our FW-T metho d essen tially exploits the F rank-W olfe step to handle the n uclear norm and a pro ximal gradien t step to handle the ` 1 -norm, while k eeping iteration cost lo w and retaining con vergence guarantees. 4.1. Reformulation as smo oth, constrained optimization. Note that prob- lem ( 4.1 ) has a non-diﬀerentiable ob jectiv e function and an un b ounded feasible set. T o apply the F rank-W olfe metho d, we exploit a t wo-step reform ulation to transform ( 4.1 ) in to the form of ( 2.1 ). First, we b orrow ideas from [ 24 ] and w ork with the epigraph reform ulation of ( 4.1 ), min g ( L , S , t L , t S ) . = 1 2 kP Q [ L + S − M ] k 2 F + λ L t L + λ S t S s.t. k L k ∗ ≤ t L , k S k 1 ≤ t S , (4.2) obtained b y in tro ducing auxiliary v ariables t L and t S . Now the ob jectiv e function g ( L , S , t L , t S ) is diﬀerentiable, with ∇ L g ( L , S , t L , t S ) = ∇ S g ( L , S , t L , t S ) = P Q [ L + S − M ] , (4.3) ∇ t L g ( L , S , t L , t S ) = λ L , ∇ t S g ( L , S , t L , t S ) = λ S . (4.4) A calculation, which w e summarize in the following lemma, sho ws that the gradient ∇ g ( L , S , t L , t S ) = ( ∇ L g , ∇ S g , ∇ t L g , ∇ t S g ) is 2-Lipschitz: 12 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB Lemma 4.1. F or al l ( L , S , t L , t S ) and ( L 0 , S 0 , t 0 L , t 0 S ) fe asible to ( 4.2 ) , (4.5) k∇ g ( L , S , t L , t S ) − ∇ g ( L 0 , S 0 , t 0 L , t 0 S ) k F ≤ 2 k ( L , S , t L , t S ) − ( L 0 , S 0 , t 0 L , t 0 S ) k F . Pr o of . Based on ( 4.3 ) and ( 4.4 ), it follo ws directly that k∇ g ( L , S , t L , t S ) − ∇ g ( L 0 , S 0 , t 0 L , t 0 S ) k 2 F ≤ 4 k L − L 0 k 2 F + 4 k S − S 0 k 2 F ≤ 4 k ( L , S , t L , t S ) − ( L 0 , S 0 , t 0 L , t 0 S ) k 2 F , whic h implies the result. Ho wev er, the F rank-W olfe metho d still cannot deal with ( 4.2 ), since its feasible region is un b ounded. If w e could someho w obtain upp er bounds on the optimal v alues of t L and t S : U L ≥ t ? L and U S ≥ t ? S , then we could solv e the equiv alent problem min 1 2 kP Q [ L + S − M ] k 2 F + λ L t L + λ S t S (4.6) s.t. k L k ∗ ≤ t L ≤ U L , k S k 1 ≤ t S ≤ U S , whic h no w has a compact and con vex feasible set. One simple wa y to obtain suc h U L , U S is as follows. One trivial feasible solution to problem ( 4.2 ) is L = 0 , S = 0 , t L = 0, t S = 0. This solution has ob jectiv e v alue 1 2 kP Q [ M ] k 2 F . Hence, the optimal ob jectiv e v alue is no larger than this. This implies that for an y optimal t ? L , t ? S , t ? L ≤ 1 2 λ L kP Q [ M ] k 2 F , t ? S ≤ 1 2 λ S kP Q [ M ] k 2 F . (4.7) Hence, w e can alwa ys choose (4.8) U L = 1 2 λ L kP Q [ M ] k 2 F , U S = 1 2 λ S kP Q [ M ] k 2 F to produce a v alid, b ounded feasible region. The following lemma bounds its diameter D : Lemma 4.2. The fe asible set D = { ( L , S , t L , t S ) | k L k ∗ ≤ t L ≤ U L , k S k 1 ≤ t S ≤ U S } has diameter D ≤ √ 5 · p U 2 L + U 2 S . Pr o of . Since for any Z = ( L , S , t L , t S ), Z 0 = ( L 0 , S 0 , t 0 L , t 0 S ) ∈ D , w e ha ve k Z − Z 0 k 2 F = k L − L 0 k 2 F + k S − S 0 k 2 F + ( t L − t 0 L ) 2 + ( t S − t 0 S ) 2 ≤ ( k L k F + k L 0 k F ) 2 + ( k S k F + k S 0 k F ) 2 + ( t L − t 0 L ) 2 + ( t S − t 0 S ) 2 ≤ ( k L k ∗ + k L 0 k ∗ ) 2 + ( k S k 1 + k S 0 k 1 ) 2 + ( t L − t 0 L ) 2 + ( t S − t 0 S ) 2 ≤ ( U L + U L ) 2 + ( U S + U S ) 2 + U 2 L + U 2 S = 5( U 2 L + U 2 S ) , whic h implies the result. With thes e mo diﬁcations, we can apply F rank-W olfe directly to obtain a solution ( b L , b S , b t L , b t S ) to ( 4.6 ), and hence to pro duce a solution ( b L , b S ) to the original problem ( 4.1 ). In subsection 4.2 , we describ e ho w to do this. Unfortunately , this straightfor- w ard solution has t wo main disadv an tages. First, as in the norm constrained case, it pro duces only one-sparse updates to S , whic h results in slo w conv ergence. Second, the exact primal conv ergence rate in Theorem 2.1 dep ends on the diameter of the feasible set, whic h in turn dep ends on the accuracy of our (crude) upper b ounds U L and U S . In subsection 4.3 , we sho w how to remedy b oth issues, yielding a F rank- W olfe-Thresholding method that performs signiﬁcan tly b etter in practice. SCALABLE ROBUST MA TRIX RECOVER Y 13 4.2. F rank-W olfe for problem ( 4.6 ) . Applying the F rank-W olfe metho d in Algorithm 1 generates a sequence of iterates x k = ( L k , S k , t k L , t k S ). Using the expres- sions for the gradient in ( 4.3 ) and ( 4.4 ), at eac h iteration, v k = ( V k L , V k S , V k t L , V k t S ) is generated b y solving the linearized subproblem v k ∈ arg min v ∈D  P Q [ L k + S k − M ] , V L + V S  + λ L V t L + λ S V t S , (4.9) whic h can b e decoupled into tw o indep enden t subproblems, ( V k L , V k t L ) ∈ arg min k V L k ∗ ≤ V t L ≤ U L g L ( V L , V t L ) . =  P Q [ L k + S k − M ] , V L  + λ L V t L (4.10) ( V k S , V k t S ) ∈ arg min k V S k 1 ≤ V t S ≤ U S g S ( V S , V t S ) . =  P Q [ L k + S k − M ] , V S  + λ S V t S . (4.11) Let us consider problem ( 4.10 ) ﬁrst. Set (4.12) D k L ∈ arg min k D L k ∗ ≤ 1 ˆ g L ( D L ) . =  P Q [ L k + S k − M ] , D L  + λ L . Because g L ( V L , V t L ) is a homogeneous function, i.e., g L ( α V L , αV t L ) = αg L ( V L , V t L ), for any α ∈ R , its optimal v alue g ( V k L , V k t L ) = V k t L ˆ g L ( D k L ). Hence V k t L = U L if ˆ g L ( D k L ) < 0, and V k t L = 0 if ˆ g L ( D k L ) > 0. F rom this observ ation, it can b e easily v eriﬁed (see also [ 24 , Lemma 1] for a more general result) that (4.13) ( V k L , V k t L ) ∈      { ( 0 , 0) } if ˆ g L ( D k L ) > 0 con v { ( 0 , 0) , U L ( D k L , 1) } if ˆ g L ( D k L ) = 0  U L ( D k L , 1)  if ˆ g L ( D k L ) < 0 . In a similar manner, w e can up date ( V k S , V k t S ). This leads fairly directly to the im- plemen tation of the F rank-W olfe metho d for problem ( 4.6 ), describ ed in Algorithm 5 . As a direct corollary of Theorem 2.1 , using parameters calculated in Lemmas 4.1 and 4.2 , we hav e Corollar y 4.3. L et x ? = ( L ? , S ? , t ? L , t ? S ) b e an optimal solution to ( 4.6 ) . F or { x k } gener ate d by Algorithm 5 , we have for k = 0 , 1 , 2 , . . . , (4.14) g ( x k ) − g ( x ? ) ≤ 20( U 2 L + U 2 S ) k + 2 . Pr o of . Applying Theorem 2.1 with parameters calculated in Lemmas 4.1 and 4.2 , w e directly hav e (4.15) g ( x k ) − g ( x ? ) ≤ 2 · 2 ·  p 5( U 2 L + U 2 S )  2 k + 2 = 20( U 2 L + U 2 S ) k + 2 . A more careful calculation b elow slightly improv es the constan t in ( 4.15 ). g ( x k +1 ) = g ( x k + γ ( v k − x k )) ≤ g ( x k ) + γ  ∇ g ( x k ) , v k − x k  + γ 2   V k L − L k   2 F + γ 2   V k S − S k   2 F ≤ g ( x k ) + γ  ∇ g ( x k ) , v k − x k  + 4 γ 2 ( U 2 L + U 2 S ) , (4.16) 14 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB Algorithm 5 F rank-W olfe method for problem ( 4.6 ) 1: Initialization: L 0 = S 0 = 0 ; t 0 L = t 0 S = 0; 2: for k = 0 , 1 , 2 , . . . do 3: D k L ∈ arg min k D L k ∗ ≤ 1 hP Q [ L k + S k − M ] , D L i ; 4: D k S ∈ arg min k D S k 1 ≤ 1 hP Q [ L k + S k − M ] , D S i ; 5: if λ L ≥ −hP Q [ L k + S k − M ] , D k L i then 6: V k L = 0 ; V k t L = 0 7: else 8: V k L = U L D k L , V k t L = U L ; 9: end if 10: if λ S ≥ −hP Q [ L k + S k − M ] , D k S i then 11: V k S = 0 ; V k t S = 0; 12: else 13: V k S = U S D k S , V k t S = U S ; 14: end if 15: γ = 2 k +2 ; 16: L k +1 = (1 − γ ) L k + γ V k L , t k +1 L = (1 − γ ) t k L + γ V k t L ; 17: S k +1 = (1 − γ ) S k + γ V k S , t k +1 S = (1 − γ ) t k S + γ V k t S ; 18: end for where the second line holds by noting that g is only linear in t L and t S ; the last line holds as   V k L − L k   2 F ≤ (   V k L   F +   L k   F ) 2 ≤ ( U L + U L ) 2 = 4 U 2 L , and   V k S − S k   2 F ≤ (   V k S   F +   S k   F ) 2 ≤ ( U S + U S ) 2 = 4 U 2 S . F ollowing the arguments in the pro of of Theorem 1 with ( 2.10 ) replaced by ( 4.16 ), we can easily obtain that g ( x k ) − g ( x ? ) ≤ 16( U 2 L + U 2 S ) k + 2 . In addition to the ab ov e conv ergence result, another ma jor adv an tage of Algo- rithm 5 is the simplicit y of the up date rules (lines 3 - 4 in Algorithm 5 ). Both hav e closed-form solutions that can b e computed in time (essentially) linearly dep endent on the size of the input. Ho wev er, t w o clear limitations substan tially hinder Algorithm 5 ’s eﬃciency . First, as in the norm constrained case, V k S has only one nonzero entry , so S has a one-sparse up date in eac h iteration. Second, the exact rate of con vergence relies on our (crude) guesses of U L and U S (Corollary 4.3 ). In the next subsection, w e present remedies to resolv e both issues. 4.3. FW-T algorithm: combining F rank-W olfe and proximal metho ds. T o alleviate the diﬃculties faced b y Algorithm 5 , we prop ose a new algorithm called F rank-W olfe-Thresholding (FW-T) (Algorithm 6 ), that com bines a mo diﬁed FW step with a proximal gradient step. Belo w w e highligh t the key features of FW-T. Pr oximal gr adient step for S . T o up date S in a m ore eﬃcient wa y , we incor- p orate an additional proximal gradien t step for S . A t iteration k , let ( L k + 1 2 , S k + 1 2 ) SCALABLE ROBUST MA TRIX RECOVER Y 15 b e the result pro duced b y F rank-W olfe step. T o pro duce the next iterate, we re- tain the lo w-rank term L k + 1 2 , but execute a pro ximal gradient step for the function f ( L k + 1 2 , S ) at the p oint S k + 1 2 , i.e. S k +1 ∈ arg min S D ∇ S f ( L k + 1 2 , S k + 1 2 ) , S − S k + 1 2 E + 1 2    S − S k + 1 2    2 F + λ S k S k 1 = arg min S D P Q [ L k + 1 2 + S k + 1 2 − M ] , S − S k + 1 2 E + 1 2    S − S k + 1 2    2 F + λ S k S k 1 (4.17) whic h can b e easily computed using the soft-thresholding operator: (4.18) S k +1 = T λ S h S k + 1 2 − P Q [ L k + 1 2 + S k + 1 2 − M ] i . Exact line se ar ch . F or the F rank-W olfe step, instead of c ho osing the ﬁxed step length 2 k +2 , w e implemen t an exact line searc h by solving a t wo-dimensional quadratic problem ( 4.20 ), as in [ 24 ]. This modiﬁcation turns out to be crucial to achiev e a primal con vergence result that only w eakly dep ends on the tightness of our guesses U L and U S . A daptive up dates of U L and U S . W e initialize U L and U S using the crude b ound ( 4.8 ). Then, at the end of the k -iteration, w e resp ectively up date (4.19) U k +1 L = g ( L k +1 , S k +1 , t k +1 L , t k +1 S ) /λ L , U k +1 S = g ( L k +1 , S k +1 , t k +1 L , t k +1 S ) /λ S . This scheme maintains the prop erty that U k +1 L ≥ t ? L and U k +1 S ≥ t ? S . Moreov er, we pro ve (Lemma 4.4 ) that g is non-increasing through our algorithm, and so this sc heme pro duces a sequence of tighter upp er bounds for U ? L and U ? S . Although this dynamic sc heme do es not improv e the theoretical conv ergence result, some acceleration is em- pirically exhibited. Conver genc e analysis . Since b oth the FW step and the proximal gradien t step do not increase the ob jectiv e v alue, w e can easily recognize FW-T metho d as a descent algorithm: Lemma 4.4. L et { ( L k , S k , t k L , t k S ) } b e the se quenc e of iter ates pr o duc e d by the FW-T algorithm. F or e ach k = 0 , 1 , 2 · · · , (4.21) g ( L k +1 , S k +1 , t k +1 L , t k +1 S ) ≤ g ( L k + 1 2 , S k + 1 2 , t k + 1 2 L , t k + 1 2 S ) ≤ g ( L k , S k , t k L , t k S ) . Pr o of . Since ( L k , S k , t k L , t k S ) is alwa ys feasible to the quadratic program ( 4.20 ), (4.22) g ( L k + 1 2 , S k + 1 2 , t k + 1 2 L , t k + 1 2 S ) ≤ g ( L k , S k , t k L , t k S ) . Based on ( 4.17 ), the threshold step (line 6 in Algorithm 3) can b e written as S k +1 = arg min S ˆ g k + 1 2 ( S ) . = 1 2    P Q [ L k + 1 2 + S k + 1 2 − M ]    2 F + λ L t k + 1 2 L + λ S k S k 1 + hP Q [ L k + 1 2 + S k + 1 2 − M ] , S − S k + 1 2 i + 1 2    S − S k + 1 2    2 F . The follo wing properties of ˆ g k + 1 2 ( · ) can b e easily veriﬁed ˆ g k + 1 2 ( S k + 1 2 ) = g ( L k + 1 2 , S k + 1 2 , t k + 1 2 L , k S k + 1 2 k 1 ) ≤ g ( L k + 1 2 , S k + 1 2 , t k + 1 2 L , t k + 1 2 S ); 16 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB Algorithm 6 FW-T method for problem ( 4.1 ) 1: Input: data matrix M ∈ R m × n ; w eights λ L , λ S > 0; max iteration n umber T ; 2: Initialization: L 0 = S 0 = 0 ; t 0 L = t 0 S = 0; U 0 L = g ( L 0 , S 0 , t 0 L , t 0 S ) /λ L ; U 0 S = g ( L 0 , S 0 , t 0 L , t 0 S ) /λ S ; 3: for k = 0 , 1 , 2 , · · · , T do 4: same as lines 3-14 in A lgorithm 5 ; 5:  L k + 1 2 , S k + 1 2 , t k + 1 2 L , t K + 1 2 S  is computed as an optimizer to min 1 2 kP Q [ L + S − M ] k 2 F + λ L t L + λ S t S (4.20) s.t.  L t L  ∈ conv  L k t k L  ,  V k L V k t L   S t S  ∈ conv  S k t k S  ,  V k S V k t S  ; 6: S k +1 = T  S k + 1 2 − P Q [ L k + 1 2 + S k + 1 2 − M ] , λ S  ; 7: L k +1 = L k + 1 2 , t k +1 L = t k + 1 2 L ; t k +1 S =   S k +1   1 ; 8: U k +1 L = g ( L k +1 , S k +1 , t k +1 L , t k +1 S ) /λ L ; 9: U k +1 S = g ( L k +1 , S k +1 , t k +1 L , t k +1 S ) /λ S ; 10: end for ˆ g k + 1 2 ( S ) ≥ g ( L k + 1 2 , S , t k + 1 2 L , k S k 1 ) , for an y S . Therefore, w e ha ve g ( L k +1 , S k +1 , t k +1 L , t k +1 S ) = g ( L k + 1 2 , S k +1 , t k + 1 2 L , t k +1 S ) ≤ ˆ g k + 1 2 ( S k +1 ) ≤ ˆ g k + 1 2 ( S k + 1 2 ) ≤ g ( L k + 1 2 , S k + 1 2 , t k + 1 2 L , t k + 1 2 S ) (4.23) Com bining ( 4.22 ) and ( 4.23 ), w e obtain g ( L k +1 , S k +1 , t k +1 L , t k +1 S ) ≤ g ( L k + 1 2 , S k + 1 2 , t k + 1 2 L , t k + 1 2 S ) ≤ g ( L k , S k , t k L , t k S ) . Moreo ver, we can establish primal con v ergence (almost) indep endent of U 0 L and U 0 S : Theorem 4.5. L et r ? L and r ? S b e the smal lest r adii such that (4.24)  ( L , S )     f ( L , S ) ≤ g ( L 0 , S 0 , t 0 L , t 0 S ) = 1 2 kP Q [ M ] k 2 F  ⊆ B ( r ? L ) × B ( r ? S ) , wher e B ( r ) . = { X ∈ R m × n | k X k F ≤ r } for any r ≥ 0 . ‡ Then for the se quenc e { ( L k , S k , t k L , t k S ) } gener ate d by Algorithm 6 , we have g ( L k , S k , t k L , t k S ) − g ( L ? , S ? , t ? L , t ? S ) (4.25) ≤ min { 4( t ? L + r ? L ) 2 + 4( t ? S + r ? S ) 2 , 16( U 0 L ) 2 + 16( U 0 S ) 2 } k + 2 . ‡ Since the ob jective function in problem ( 4.1 ) is coercive, i.e. lim k → + ∞ f ( L k , S k ) = + ∞ for any sequence ( L k , S k ) such that lim k → + ∞   ( L k , S k )   F = + ∞ , clearly r ? L ≥ 0 and r ? S ≥ 0 exist. SCALABLE ROBUST MA TRIX RECOVER Y 17 Pr o of . F or notational conv enience, w e denote x k = ( L k , S k , t k L , t k S ) , x ? = ( L ? , S ? , t ? L , t ? S ) and v k = ( V k L , V k S , V k t L , V k t S ) . F or an y p oint x = ( L , S , t L , t S ) ∈ R m × n × R m × n × R × R , we adopt the notation that L [ x ] = L , S [ x ] = S , t L [ x ] = t L and t S [ x ] = t S . Since g ( x k ) − g ( x ? ) ≤ 16( U 0 L ) 2 +16( U 0 S ) 2 k +2 can b e easily established following the pro of of Corollary 4.3 , below w e will focus on the other part that g ( x k ) − g ( x ? ) ≤ 4( t ? L + r ? L ) 2 +4( t ? S + r ? S ) 2 k +2 . Let us ﬁrst mak e tw o simple observ ations. Since f ( L ? , S ? ) ≤ g ( L k , S k , t k L , t k S ), w e ha ve (4.26) U k L = g ( L k , S k , t k L , t k S ) /λ L ≥ t ? L and U k S = g ( L k , S k , t k L , t k S ) /λ S ≥ t ? S . Therefore, our U k L and U k S alw ays b ound t ? L and t ? S from abov e. F rom Lemma 4.4 , g ( L k , S k , t k L , t k S ) is non-increasing, f ( L k , S k ) ≤ g ( L k , S k , t k L , t k S ) ≤ g ( L 0 , S 0 , t 0 L , t 0 S ) , whic h implies that ( L k , S k ) ⊆ B ( r ? L ) × B ( r ? S ), i.e.   L k   F ≤ r ? L and   S k   F ≤ r ? S . Let us no w consider the k -th iteration. Similar to the proof in [ 24 ], w e in tro duce the auxiliary point v k + = ( t ? L U k L V k L , t ? S U k S V k S , t ? L U k L V k t L , t ? S U k S V k t S ). Then based on our argument for ( 4.13 ), it can b e easily v eriﬁed that ( L [ v k + ] , t L [ v k + ]) ∈ arg min k V L k ∗ ≤ V t L ≤ t ? L g L ( V L , V t L ) (4.27) ( S [ v k + ] , t S [ v k + ]) ∈ arg min k V S k 1 ≤ V t S ≤ t ? S g S ( V S , V t S ) . (4.28) Recall γ = 2 k +2 . W e hav e g ( x k + 1 2 ) ≤ g ( x k + γ ( v k + − x k )) ≤ g ( x k ) + γ h∇ g ( x k ) , v k + − x k i + γ 2    L [ v k + ] − L [ x k ]   2 F +   S [ v k + ] − S [ x k ]   2 F  ≤ g ( x k ) + γ  g L ( L [ v k + − x k ] , t L [ v k + − x k ]) + g S ( S [ v k + − x k ] , t S [ v k + − x k ])  + γ 2  ( t ? L + r ? L ) 2 + ( t ? S + r ? S ) 2  ≤ g ( x k ) + γ  g L ( L [ x ? − x k ] , t L [ x ? − x k ]) + g S ( S [ x ? − x k ] , t S [ x ? − x k ])  + γ 2  ( t ? L + r ? L ) 2 + ( t ? S + r ? S ) 2  = g ( x k ) + γ h∇ g ( x k ) , x ? − x k i + γ 2  ( t ? L + r ? L ) 2 + ( t ? S + r ? S ) 2  ≤ g ( x k ) + γ  g ( x ? ) − g ( x k )  + γ 2  ( t ? L + r ? L ) 2 + ( t ? S + r ? S ) 2  , where the ﬁrst inequalit y holds since x k + γ ( v k + − x k ) is feasible to the quadratic program ( 4.20 ) while x k + 1 2 minimizes it; the third inequality is due to the facts that   L [ v k + ] − L [ x k ]   F ≤   L [ v k + ]   F +   L [ x k ]   F ≤   L [ v k + ]   ∗ +   L [ x k ]   F ≤ t ? L + r ? L   S [ v k + ] − S [ x k ]   F ≤   S [ v k + ]   F +   S [ x k ]   F ≤   S [ v k + ]   1 +   S [ x k ]   F ≤ t ? S + r ? S ; 18 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB the fourth inequalit y holds as ( L [ x ? ] , t L [ x ? ]) and ( S [ x ? ] , t S [ x ? ]) are resp ectiv ely fea- sible to ( 4.27 ) and ( 4.28 ) while ( L [ v k + ] , t L [ v k + ]) and ( S [ v k + ] , t S [ v k + ]) resp ectively mini- mize ( 4.27 ) and ( 4.28 ); Therefore, w e obtain g ( x k + 1 2 ) − g ( x ? ) ≤ (1 − γ )  g ( x k ) − g ( x ? )  + γ 2  ( t ? L + r ? L ) 2 + ( t ? S + r ? S ) 2  . Moreo ver, by Lemma 4.4 , we hav e g ( x k +1 ) ≤ g ( x k + 1 2 ) . Th us, w e obtain the recurrence g ( x k +1 ) − g ( x ? ) ≤ (1 − γ )  g ( x k ) − g ( x ? )  + γ 2  ( t ? L + r ? L ) 2 + ( t ? S + r ? S ) 2  . Applying mathematical induction, one can easily obtain that g ( L k , S k , t k L , t k S ) − g ( L ? , S ? , t ? L , t ? S ) ≤ 4  ( t ? L + r ? L ) 2 + ( t ? S + r ? S ) 2  k + 2 . Since U 0 L and U 0 S are quite crude upp er bounds for t ? L and t ? S , 16( U 0 L ) 2 + 16( U 0 S ) 2 could b e muc h larger than 4( t ? L + r ? L ) 2 + 4( t ? S + r ? S ) 2 . Therefore, this primal con vergence results depend on U 0 L and U 0 S in a very weak manner. Ho wev er, the con vergence result of the surrogate dualit y gap d ( x k ) still hinges up on the upp er bounds: Theorem 4.6. L et x k denote ( L k , S k , t k L , t k S ) gener ate d by A lgorithm 6 . Then for any K ≥ 1 , ther e exists 1 ≤ ˜ k ≤ K such that (4.29) g ( x ˜ k ) − g ( x ? ) ≤ d ( x ˜ k ) ≤ 48  ( U 0 L ) 2 + ( U 0 S ) 2  K + 2 . Pr o of . Deﬁne ∆ k = g ( x k ) − g ( x ? ). F ollo wing ( 4.16 ), w e ha ve (4.30) ∆ k +1 ≤ ∆ k + γ  ∇ g ( x k ) , v k − x k  + 4 γ 2  ( U 0 L ) 2 + ( U 0 S ) 2  . Then following the arguments in the pro of of Theorem 2 with ( 2.17 ) replaced by ( 4.30 ), w e can easily obtain the result. Stopping criterion . Compared to the conv ergence of g ( x k ) (Theorem 4.5 ), the con vergence result for d ( x k ) can b e m uch slo w er (Theorem 4.6 ). Therefore, here the surrogate dualit y gap d ( · ) is not that suitable to serve as a stopping criterion. Consequen tly , in our implementation, we terminate Algorithm 6 if (4.31)   g ( x k +1 ) − g ( x k )   /g ( x k ) ≤ ε, for ﬁv e consecutiv e iterations. 5. Numerical Exp erimen ts. In this section, w e rep ort numerical results ob- tained by applying our FW-T metho d (Algorithm 6 ) to problem ( 1.5 ) with real data arising from applications considered in [ 3 ]: for e gr ound/b ackgr ound sep ar ation in surveil lanc e vide os , and shadow and sp e cularity r emoval fr om fac e images . SCALABLE ROBUST MA TRIX RECOVER Y 19 Giv en observ ations { M 0 ( i, j ) | ( i, j ) ∈ Ω } , where Ω ⊆ { 1 , . . . , m } × { 1 , . . . , n } is the index set of the observ able en tries in M 0 ∈ R m × n , w e assigned weigh ts λ L = δ ρ kP Ω [ M 0 ] k F and λ S = δ √ ρ kP Ω [ M 0 ] k F / p max( m, n ) to problem ( 1.5 ), § where ρ = | Ω | /mn and δ is c hosen as 0 . 001 for the surveillance problem and 0 . 01 for the face problem. W e compared our FW-T metho d with the p opular ﬁrst-order metho ds iter a- tive soft-thr esholding algorithm (IST A) and fast iter ative soft-thr esholding algorithm (FIST A) [ 20 ], b oth of whose implemen tations used partial singular value de c omp osi- tion (SVD). In subsection 5.1 , w e provided detailed descriptions and implementations of IST A and FIST A. W e set ε = 10 − 3 in FW-T’s stopping criterion ( 4.31 ), ¶ and terminated IST A and FIST A whenever they reached the ob jective v alue returned b y the FW-T method. k All the exp erimen ts w ere conducted on a computer with In tel Xeon E5-2630 Pro cessor (12 cores at 2.4 GHz), and 64GB RAM running MA TLAB R2012b (64 bits). 5.1. IST A & FIST A for problem ( 1.5 ) . Iter ative soft-thr esholding algorithm (IST A), is an eﬃcien t w ay to tac kle unconstrained nonsmo oth optimization problem esp ecially at large scale. IST A follows the general idea by iteratively minimizing an upp er b ound of the original ob jectiv e. In particular, when applied to problem ( 1.5 ) of our interest, IST A up dates ( L , S ) for the k -th iteration by solving ( L k +1 , S k +1 ) = arg min L , S  ∇ L l ( L k , S k ) ∇ S l ( L k , S k )  ,  L − L k S − S k  + (5.1) L f 2      L S  −  L k S k      2 F + λ L k L k ∗ + λ S k S k 1 . Here L f = 2 denotes the Lipschitz constan t of ∇ l ( L , S ) with resp ect to ( L , S ), and ∇ L l ( L k , S k ) = ∇ S l ( L k , S k ) = P Ω [ L k + S k − M ]. Since L and S are decoupled in ( 5.1 ), equiv alently we hav e L k +1 = arg min L     L −  L k − 1 2 P Ω [ L k + S k − M ]      2 F + λ L k L k ∗ , (5.2) S k +1 = arg min S     S −  S k − 1 2 P Ω [ L k + S k − M ]      2 F + λ S k S k 1 . (5.3) The solution to problem ( 5.3 ) can b e given explicitly in terms of the pro ximal mapping of k·k 1 as in tro duced in Section 2.2, i.e., S k +1 = T λ S / 2  S k − 1 2 P Ω [ L k + S k − M ]  . F or a matrix X and any τ ≥ 0, let D τ ( X ) denote the singular v alue thresholding op- erator D τ ( X ) = U T τ ( Σ ) V > , where X = U Σ V > is the singular v alue decomp osition § The ratio λ L /λ S = p ρ max( m, n ) follows the suggestion in [ 3 ]. F or applications in computer vision at least, our choices in λ L and λ S seem to b e quite robust, although it is p ossible to improve the p erformance by making sligh t adjustments to our current settings of λ L and λ S . ¶ As discussed in [ 33 , 34 ], with noisy data, solving optimization problems to high accuracy does not necessarily impro ve the recov ery qualit y . Consequently , w e set ε to a mo dest v alue. k All co des are av ailable at: https://sites.google.com/site/mucun1988/publi 20 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB of X . It is not diﬃcult to sho w [ 35 , 36 ] that the solution to problem ( 5.2 ) can be giv en explicitly by L k +1 = D λ L / 2  L k − 1 2 P Ω [ L k + S k − M ]  . Algorithm 7 summarizes our IST A implementation for problem ( 1.5 ). Algorithm 7 IST A for problem ( 1.5 ) 1: Initialization: L 0 = 0 , S 0 = 0 ; 2: for k = 0 , 1 , 2 , · · · do 3: L k +1 = D λ L / 2  L k − 1 2 P Ω [ L k + S k − M ]  ; 4: S k +1 = T λ S / 2  S k − 1 2 P Ω [ L k + S k − M ]  ; 5: end for Regarding IST A’s sp eed of con vergence, it can b e prov ed that f ( L k , S k ) − f ? = O (1 /k ), where f ? denotes the optimal v alue of problem ( 1.5 ). F ast iter ative soft-thr esholding algorithm (FIST A) in tro duced in [ 20 ], is an acceler- ated version of IST A, which incorp orate a momentum step b orrow ed from Nesterov’s optimal gradient scheme [ 37 ]. F or FIST A, a b etter conv ergence result, f ( L k , S k ) − f ? = O (1 /k 2 ), can b e achiev ed with a cost p er iteration that is comparable to IST A. Algorithm 8 summarizes our FIST A implementation for problem ( 1.5 ). Algorithm 8 FIST A for problem ( 1.5 ) 1: Initialization: ˆ L 0 = L 0 = 0 , ˆ S 0 = S 0 = 0 , t 0 = 1; 2: for k = 0 , 1 , 2 , · · · do 3: L k +1 = D λ L / 2 h ˆ L k − 1 2 P Ω [ ˆ L k + ˆ S k − M ] i ; 4: S k +1 = T λ S / 2 h ˆ S k − 1 2 P Ω [ ˆ L k + ˆ S k − M ] i ; 5: t k +1 = 1+ √ 1+4( t k ) 2 2 ; 6: ˆ L k +1 = L k +1 + t k − 1 t k +1 ( L k +1 − L k ); 7: ˆ S k +1 = S k +1 + t k − 1 t k +1 ( S k +1 − S k ); 8: end for Partial SVD. In each iteration of either IST A or FIST A, we only need those singular v alues that are larger than λ S / 2 and their corresp onding singular vectors. Therefore, a partial SVD can b e utilized to reduce the computational burden of a full SVD. Since most partial SVD soft ware pack ages (e.g. PROP ACK [ 38 ]) require sp eci- fying in adv ance the n umber of top singular v alues and singular v ectors to compute, w e heuristically determine this n um b er (denoted as sv k at iteration k ). Sp eciﬁcally , let d = min { m, n } , and sv p k denote the n umber of computed singular v alues that w ere larger than λ S / 2 in the k -th iteration. Similar to [ 17 ], in our implementation, w e start with sv 0 = d/ 10, and adjust sv k dynamically as follows: sv k +1 = ( min { sv p k + 1 , d } if svp k < sv k min { sv p k + round(0 . 05 d ) , d } otherwise . SCALABLE ROBUST MA TRIX RECOVER Y 21 0 5 10 15 0 200 400 600 800 1000 Number of frames ( × 1000) Per−iteration cost (s) FW−T ISTA FISTA 0 5 10 15 20 0 200 400 600 800 1000 Number of frames ( × 1000) Per−iteration cost (s) FW−T ISTA FISTA Airp ort Square Fig. 2 . Per-iter ation c ost vs. the numb er of fr ames in Airp ort and Squar e vide os with ful l observation. The per-iter ation c ost of our FW-T metho d gr ows line arly with the size of data, in c ontr ast with the sup erline ar p er-iter ation c ost of IST A and FIST A. That makes the FW-T metho d mor e advantage ous or may even b e the only feasible choic e for lar ge problems. 5.2. F oreground-bac kground separation in surveillance video. In surveil- lance videos, due to the strong correlation b etw een frames, it is natural to mo del the bac kground as low rank; while foreground ob jects, suc h as cars or p edestrians, that normally o ccupy only a fraction of the video, can b e treated as sparse. So, if we stac k each frame as a column in the data matrix M 0 , it is reasonable to assume that M 0 ≈ L 0 + S 0 , where L 0 captures the bac kground and S 0 represen ts the foreground mo vemen ts. Here, we solv ed problem ( 1.5 ) for videos introduced in [ 39 ] and [ 40 ]. The observ ed en tries w ere sampled uniformly with ratio ρ chosen resp ectively as 1, 0 . 8 and 0 . 6. T able 1 summarizes the n umerical p erformances of FW-T, IST A and FIST A in terms of the iteration n umber and running time (in seconds). As can b e observed, our FW-T metho d is more eﬃcien t than IST A and FIST A, and the adv an tage b ecomes more prominen t as the size of the data gro ws and the observ ations are more com- pressed (with smaller sampling ratio ρ ). Even though the FW-T metho d to ok more iterations than FIST A and in many cases than IST A, it took less time in many cases but one due to its low p er-iteration cost. T o illustrate this more clearly , in Figure 2 , w e plot the p er-iteration cost of these three metho ds on the Airp ort and Square videos as a function of the num b er of frames. The computational cost of FW-T scales linearly with the size of the data, whereas the cost of the other methods increases sup erlinearly . Another observ ation is that as the num b er of measurements decreases, the iteration n umbers of b oth IST A and FIST A metho ds grow substan tially , while those of the FW-T metho d remain quite stable. This explains the more fav orable b e- ha vior of the FW-T metho d when ρ is small. In Figure 3 , frames of the original videos, the bac kgrounds and the foregrounds produced by the FW-T metho d are presented, and the separation ac hieved is quite satisfactory . 5.3. Shadow and sp ecularity remov al from face images. Images taken under v arying illumination can also be modeled as the sup erp osition of low-rank and sparse comp onents. Here, the data matrix M 0 is again formed by stacking eac h image as a column. The low-rank term L 0 captures the smo oth v ariations [ 41 ], while the sparse term S 0 represen ts cast shadows and sp ecularities [ 42 , 8 ]. CPCP can 22 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB T a ble 1 Comp arisons of FW-T, IST A and FIST A on surveil lanc e vide o data. The advantage of our FW-T method b e c omes pr ominent when the data ar e at lar ge sc ale and compr esse d (i.e. the smal l ρ sc enarios). FW-T IST A FIST A Data ρ iter. time iter. time iter. time Lobb y 1 . 0 96 1.94e+02 144 3.64e+02 41 1.60e+02 (20480 × 1000) 0 . 8 104 2.33e+02 216 1.03e+03 52 3.55e+02 0 . 6 133 3.12e+02 380 1.67e+03 74 5.10e+02 Campus 1 . 0 45 1.56e+02 78 1.49e+03 23 4.63e+02 (20480 × 1439) 0 . 8 44 1.57e+02 122 2.34e+03 30 6.45e+02 0 . 6 41 1.39e+02 218 4.27e+03 43 1.08e+03 Escalator 1 . 0 81 7.40e+02 58 4.19e+03 25 2.18e+03 (20800 × 3417) 0 . 8 80 7.35e+02 90 8.18e+03 32 3.46e+03 0 . 6 82 7.68e+02 162 1.83e+04 43 5.73e+03 Mall 1 . 0 38 4.70e+02 110 5.03e+03 35 1.73e+03 (81920 × 1286) 0 . 8 35 4.58e+02 171 7.32e+03 44 2.34e+03 0 . 6 44 5.09e+02 308 1.31e+04 62 3.42e+03 Restauran t 1 . 0 70 5.44e+02 52 3.01e+03 20 1.63e+03 (19200 × 3055) 0 . 8 74 5.51e+02 81 4.84e+03 26 1.82e+03 0 . 6 76 5.73e+02 144 9.93e+03 38 3.31e+03 Hall 1 . 0 60 6.33e+02 52 2.98e+03 21 1.39e+03 (25344 × 3584) 0 . 8 62 6.52e+02 81 6.45e+03 28 2.90e+03 0 . 6 70 7.43e+02 144 1.42e+04 39 4.94e+03 Airp ort 1 . 0 130 6.42e+03 29 2.37e+04 14 1.37e+04 (25344 × 15730) 0 . 8 136 6.65e+03 45 6.92e+04 18 4.27e+04 0 . 6 154 7.72e+03 77 1.78e+05 24 7.32e+04 Square 1 . 0 179 1.24e+04 29 3.15e+04 13 1.51e+04 (19200 × 28181) 0 . 8 181 1.26e+04 44 1.04e+05 17 6.03e+04 0 . 6 191 1.31e+04 78 2.63e+05 22 9.88e+05 SCALABLE ROBUST MA TRIX RECOVER Y 23 M 0 ˆ L ˆ S P Ω [ M 0 ] ˆ L ˆ S Fig. 3 . Surveil lanc e vide os. The vide os from top to b ottom ar e r esp e ctively L obby, Campus, Esc alator, Mal l, R estaur ant, Hal l, Airp ort and Squar e. The left p anel pr esents vide os with ful l observation ( ρ = 1 ) and the right one pr esents vide os with p artial observation ( ρ = 0 . 6 ). Visual ly, the low-rank comp onent suc c essful ly r ec overs the b ackgr ound and the sp arse one c aptures the moving obje cts (e.g. vehicles, pe destrians) in the for e gr ound. b e used to remo v e the shadows and sp ecularities [ 3 , 8 ]. Here, we solv ed problem ( 1.4 ) for Y aleB face images [ 43 ]. T able 2 summarizes the numerical p erformances of FW-T, IST A and FIST A. Similar to the observ ation made regarding the abov e surv eillance video exp eriment, the num b er of iterations required by IST A and FIST A gro ws m uch faster than it does for the FW-T m ethod when ρ decreases. How ever, unlik e in those tests, where the num b er of frames in each dataset was at least sev eral thousand, the num b er of frames here is just 65. This preven ts the FW-T metho d from signiﬁcan tly b eneﬁting from its linear p er-iteration cost and consequently , while FW-T still outp erforms IST A for v alues of ρ ≤ 0 . 7, the FIST A method is alw ays the 24 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB T a ble 2 Comp arisons of FW-T, IST A and FIST A on Y aleB fac e data. The numb er of fr ames, 65, is r elatively smal l for this applic ation. This disables the FW-T method to signiﬁc antly beneﬁt fr om its line ar p er-iter ation c ost and c onse quently the FIST A metho d c onsistently has a b etter p erformanc e. FW-T IST A FIST A Data ρ iter. time iter. time iter. time Y aleB01 1 . 0 65 34.0 49 21.4 17 8.69 (32256 × 65) 0 . 9 68 35.6 59 23.9 19 8.62 0 . 8 79 42.2 76 35.3 22 10.9 0 . 7 76 39.9 97 44.0 25 11.1 0 . 6 71 37.5 127 50.2 29 12.9 0 . 5 80 40.5 182 77.9 35 15.2 Y aleB02 1 . 0 64 34.6 51 19.2 18 7.31 (32256 × 65) 0 . 9 64 26.8 61 22.6 20 7.32 0 . 8 71 33.9 78 27.7 22 8.61 0 . 7 71 31.3 99 36.6 26 11.0 0 . 6 73 36.6 132 53.7 30 12.4 0 . 5 63 28.0 177 64.6 35 13.4 Y aleB03 1 . 0 62 26.0 49 16.6 18 6.00 (32256 × 65) 0 . 9 71 27.5 62 20.3 20 6.43 0 . 8 69 30.0 78 26.0 22 8.32 0 . 7 78 31.5 101 32.9 26 9.00 0 . 6 73 28.7 132 40.4 30 10.6 0 . 5 70 28.0 181 60.3 36 12.8 Y aleB04 1 . 0 63 28.5 47 16.6 17 6.35 (32256 × 65) 0 . 9 67 28.7 58 23.1 19 7.98 0 . 8 68 31.7 72 26.3 23 9.39 0 . 7 69 30.7 92 35.9 26 9.84 0 . 6 71 29.4 124 40.0 29 10.1 0 . 5 74 29.4 174 67.3 36 14.3 fastest. In Figure 4 , the original images, the low-rank and the sparse parts pro duced b y the FW-T metho d are presented. Visually , the recov ered low-rank comp onent is smo other and b etter conditioned for face recognition than the original image, while the sparse comp onent corresp onds to shadows and specularities. 6. Discussion. In this paper, w e ha v e proposed scalable algorithms called F rank- W olfe-Pro jection (FW-P) and F rank-W olfe-Thresholding (FW-T) for norm constrained and p enalized v ersions of CPCP . Essen tially , these metho ds combine classical ideas in F rank-W olfe and Proximal metho ds to ac hiev e linear p er-iteration cost, O (1 /k ) con vergence in function v alue and practical eﬃciency in up dating the sparse comp o- nen t. Extensive numerical exp eriments were conducted on computer vision related SCALABLE ROBUST MA TRIX RECOVER Y 25 M 0 ˆ L ˆ S P Ω [ M 0 ] ˆ L ˆ S Fig. 4 . F ac e images. The pictur es fr om top to bottom ar e r espe ctively Y aleB01, Y aleB02, Y aleB03 and Y aleB04 fac e images. The left p anel pr esents the case with ful l observation ( ρ = 1 ), while the right panel pr esents the c ase with partial observation ( ρ = 0 . 6 ). Visual ly, the r e c over ed low-r ank comp onent is smo other and better c onditione d for fac e r e co gnition than the original image, while the sp arse comp onent c orr esponds to shadows and spe cularities. applications of CPCP , whic h demonstrated the great p otential of our metho ds for dealing with problems of v ery large scale. Moreov er, the general idea of lev eraging diﬀeren t metho ds to deal with diﬀerent functions may b e v aluable for other demixing problems. W e are also a ware that though our algorithms are extremely eﬃcien t in the begin- ning iterations and quickly arrive at an appro ximate solution of practical signiﬁcance, they become less competitive in solutions of very high accuracy , due to the nature of F rank-W olfe. This suggests further hybridization under our framew ork (e.g. us- ing nonconv ex approac hes to handle the nuclear norm) might b e utilized in certain applications (see [ 44 ] for research in that direction). Ac kno wledgemen ts. W e are grateful to the asso ciate editor Chen Greif and three anonymous reviewers for their helpful suggestions and commen ts that substan- tially impro ve the paper. REFERENCES [1] J. W right, A. Ganesh, K. Min, and Y. Ma, “Compressive principal comp onen t pursuit,” Infor- mation and Infer enc e , vol. 2, no. 1, pp. 32–68, 2013. [2] V. Chandrasek aran, S. Sangha vi, P . Parrilo, and A. Willsky , “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization , vol. 21, no. 2, pp. 572–596, 2011. 26 C. MU, Y. ZHANG, J. WRIGHT AND D. GOLDF ARB [3] E. Cand` es, X. Li, Y. Ma, and J. W right, “Robust principal comp onent analysis?,” Journal of the ACM (JACM) , vol. 58, no. 3, pp. 11:1–11:37, 2011. [4] Z. Zhou, X. Li, J. W righ t, E. Candes, and Y. Ma, “Stable principal comp onent pursuit,” in ISIT , 2010. [5] D. Hsu, S. Kak ade, and T. Zhang, “Robust matrix decomp osition with sparse corruptions,” IEEE T r ansactions on Information The ory , vol. 57, no. 11, pp. 7221–7234, 2011. [6] A. Agarwal, S. Negahban, and M. W ainwrigh t, “Noisy matrix decomp osition via convex re- laxation: Optimal rates in high dimensions,” The Annals of Statistics , vol. 40, no. 2, pp. 1171–1197, 2012. [7] Y. P eng, A. Ganesh, J. W right, W. Xu, and Y. Ma, “Rasl: Robust alignmen t b y sparse and low-rank decomp osition for linearly correlated images,” IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , vol. 34, no. 11, pp. 2233–2246, 2012. [8] Y. Zhang, C. Mu, H. Kuo, and J. W right, “T owards guaranteed illumination model s for non- conv ex ob jects,” in ICCV , 2013. [9] L. W u, A. Ganesh, B. Shi, Y. Matsushita, Y. W ang, and Y. Ma, “Robust photometric stereo via low-rank matrix completion and recov ery ,” in ACCV , 2011. [10] R. Otazo, E. Cand` es, and D. K. So dickson, “Low-rank plus sparse matrix decomp osition for accelerated dynamic MRI with separation of background and dynamic comp onents,” Mag- netic R esonanc e in Me dicine , 2014. [11] K. Min, Z. Zhang, J. W righ t, and Y. Ma, “Decomp osing bac kground topics from keyw ords by principal comp onent pursuit,” in CIKM , 2010. [12] V. Chandrasek aran, P . Parrilo, and A. Willsky , “Laten t v ariable graphical mo del selection via conv ex optimization,” Annals of Statistics , vol. 40, no. 4, pp. 1935–1967, 2012. [13] Z. Lin, A. Ganesh, J. W righ t, L. W u, M. Chen, and Y. Ma, “F ast convex optimization algorithms for exact reco very of a corrupted lo w-rank matrix,” in CAMSAP , 2009. [14] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier method for exact recov ery of corrupted lo w-rank matrices,” arXiv pr eprint arXiv:1009.5055 , 2010. [15] X. Y uan and J. Y ang, “Sparse and low-rank matrix decomp osition via alternating direction methods,” pr eprint , 2009. [16] N. S. Aybat, D. Goldfarb, and G. Iyengar, “F ast ﬁrst-order methods for stable principal com- ponent pursuit,” arXiv pr eprint arXiv:1105.2126 , 2011. [17] M. T ao and X. Y uan, “Recov ering low-rank and sparse components of matrices from incomplete and noisy observ ations,” SIAM Journal on Optimization , vol. 21, no. 1, pp. 57–81, 2011. [18] N. S. Aybat, D. Goldfarb, and S. Ma, “Eﬃcient algorithms for robust and stable principal component pursuit problems,” Computational Optimization and Applic ations , pp. 1–29, 2012. [19] R. T¨ ut¨ unc¨ u, K. T oh, and M. T odd, “Solving semideﬁnite-quadratic-linear programs using sdpt3,” Mathematic al Pr ogr amming , vol. 95, no. 2, pp. 189–217, 2003. [20] A. Beck and M. T eboulle, “A fast iterative shrink age-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences , vol. 2, no. 1, pp. 183–202, 2009. [21] M. F rank and P . W olfe, “An algorithm for quadratic programming,” Naval R ese arch L o gistics Quarterly , vol. 3, no. 1-2, pp. 95–110, 1956. [22] E. Levitin and B. Poly ak, “Constrained minimization metho ds,” USSR Computational Math- ematics and Mathematic al Physics , v ol. 6, no. 5, pp. 1–50, 1966. [23] M. Jaggi and M. Sulovsk, “A simple algorithm for nuclear norm regularized problems,” in ICML , 2010. [24] Z. Harchaoui, A. Juditsky , and A. Nemiro vski, “Conditional gradient algorithms for norm- regularized smo oth convex optimization,” Mathematic al Pr o gramming , pp. 1–38, 2014. [25] M. Jaggi, “Revisiting Frank-Wolfe: Pro jection-free sparse conv ex optimization,” in ICML , 2013. [26] V. F. Demianov and A. M. Rubinov, Appr oximate metho ds in optimization pr oblems . Mo dern analytic and computational metho ds in science and mathematics, American Elsevier Pub. Co., 1970. [27] J. C. Dunn and S. Harsh barger, “Conditional gradient algorithms with op en lo op step size rules,” Journal of Mathematic al Analysis and Applic ations , v ol. 62, no. 2, pp. 432 – 444, 1978. [28] M. Patriksson, “Partial linearization metho ds in nonlinear programming,” Journal of Opti- mization The ory and Applications , vol. 78, no. 2, pp. 227–246, 1993. [29] T. Zhang, “Sequential greedy approximation for certain conv ex optimization problems,” IEEE T r ansactions on Information The ory , vol. 49, no. 3, pp. 682–691, 2003. [30] K. Clarkson, “Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm,” ACM T r ans. Algorithms , vol. 6, no. 4, pp. 63:1–63:30, 2010. SCALABLE ROBUST MA TRIX RECOVER Y 27 [31] R. M. F reund and P . Grigas, “New analysis and results for the frank–w olfe metho d,” Mathe- matic al Pr o gramm ing , vol. 155, no. 1-2, pp. 199–230, 2016. [32] J. Duchi, S. Shalev-Shw artz, Y. Singer, and T. Chandra, “Eﬃcient pro jections onto the l 1 -ball for learning in high dimensions,” in ICML , 2008. [33] J. Y ang and Y. Zhang, “Alternating direction algorithms for ` 1 -problems in compressiv e sens- ing,” SIAM Journal on Scientiﬁc Computing , vol. 33, no. 1, pp. 250–278, 2011. [34] J. Y ang and X. Y uan, “Linearized augmented lagrangian and alternating direction metho ds for nuclear norm minimization,” Mathematics of Computation , vol. 82, no. 281, pp. 301–329, 2013. [35] J. Cai, E. Cand` es, and Z. Shen, “A singular value thresholding algorithm for matrix comple- tion,” SIAM Journal on Optimization , v ol. 20, no. 4, pp. 1956–1982, 2010. [36] S. Ma, D. Goldfarb, and L. Chen, “Fixed p oint and Bregman iterativ e metho ds for matrix rank minimization,” Mathematic al Pr ogr amming , vol. 128, no. 1-2, pp. 321–353, 2011. [37] Y. Nesterov, “A metho d of solving a conv ex programming problem with conv ergence rate O (1 /k 2 ),” in Soviet Mathematics Doklady , vol. 27, pp. 372–376, 1983. [38] R. M. Larsen, “Propack-soft ware for large and sparse svd calculations,” Available online. URL http://sun. stanfor d. e du/rmunk/PROP ACK , pp. 2008–2009, 2004. [39] L. Li, W. Huang, I. Y. Gu, and Q. Tian, “Statistical modeling of complex backgrounds for foreground ob ject detection,” IEEE T r ansactions on Image Pr o c essing , vol. 13, no. 11, pp. 1459–1472, 2004. [40] N. Jacobs, N. Roman, and R. Pless, “Consistent temp oral v ariations in many outdo or scenes,” in CVPR , 2007. [41] R. Basri and D. Jacobs, “Lambertian reﬂectance and linear subspaces,” IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , vol. 25, no. 2, pp. 218–233, 2003. [42] J. W righ t, A. Y ang, A. Ganesh, S. Sastry , and Y. Ma, “Robust face recognition via sparse representation,” IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , v ol. 31, no. 2, pp. 210–227, 2009. [43] A. Georghiades, P . Belhumeur, and D. Kriegman, “F rom few to many: Illumination cone mo dels for face recognition under v ariable lighting and p ose,” IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , vol. 23, no. 6, pp. 643–660, 2001. [44] S. Laue, “A h ybrid algorithm for con vex semi deﬁnite optimization,” in ICML , 2012.

Scalable Robust Matrix Recovery: Frank-Wolfe Meets Proximal Methods

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment