Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions

Noisy matrix decomp osition via con v ex relaxation: Optimal rates in hig h dimen sions Alekh Agarw al † Sahand Negahban † Martin J. W ain wrigh t † ,⋆ alekh@eecs. berkeley.edu sahand@eecs .berkeley.ed u wainwrig@s tat.berkeley. edu Departmen t o f EECS † Departmen t o f Statistics ⋆ UC B erk eley UC B erk eley F ebruary 201 2 (revision of F ebruary 2011 v ersion) T ec hnical Rep ort, Departmen t of Statistics, UC Berkele y Abstract W e analyze a class of estimators ba sed on conv ex relax ation for so lving hig h-dimensional matrix decomp osition problems. The observ ations are noisy realiza tio ns of a line a r trans- formation X of the sum of an (appr oximately) low rank matrix Θ ⋆ with a second matrix Γ ⋆ endow ed w ith a complementary f orm of low-dimensional structure; this set-up includes many statistical models of interest, including forms of factor a na lysis, m ulti-task r egres- sion with shared s tructure, and ro bust cov a riance estimation. W e der ive a general theore m that gives upp er b ounds on the F ro b enius no rm erro r for a n estimate of the pa ir (Θ ⋆ , Γ ⋆ ) obtained by solving a convex optimization problem that combines the nuclear norm with a genera l decomp osable reg ularizer. Our results are based on imp osing a “spikiness” con- dition that is rela ted to but milder than singula r v ector incoherence. W e sp ecialize our general result to tw o cases that have b een studied in past work: low rank plus a n entry- wise spa rse matrix, and low ra nk plus a columnwise sparse ma trix. F or b oth mo dels, o ur theory yields non-a symptotic F rob enius error b ounds for b oth deterministic and sto chas- tic noise matrices , and applies to matrices Θ ⋆ that can b e exactly or approximately lo w rank, and matrices Γ ⋆ that can be exactly or appro ximately sparse. Moreov er, for the case of stochastic no ise matr ices and the iden tit y observ a tio n op erator, w e establis h matching low er b ounds o n the minimax er ror, showing that our results ca nnot b e improved be yond constant factors. The sha r pness of o ur theor etical predictions is conﬁrmed b y numerical simulations. 1 In tro duction The fo cus of this pap er is a class of high-dimensional matrix d ecomp osition problems of the follo wing v ariet y . Supp ose th at w e observe a matrix Y ∈ R d 1 × d 2 that is (appr oximate ly) equal to the sum of tw o unknown matrices: ho w to r eco ve r go o d estimates of the pair? Of course, this problem is ill-p osed in general, so that it is necessary to impose some kind of lo w-d imensional structure on the matrix co mp onents, one example being rank constrain ts. The f r amew ork of this pap er supp oses that one matrix comp onen t (denoted Θ ⋆ ) is lo w-r an k , either exactly or in an a ppr oximate sense, and allo ws for general forms of lo w -d imensional structure for the second comp onen t Γ ⋆ . Two particular cases of str ucture for Γ ⋆ that hav e b een considered in past w ork are element wise spars ity [9, 8, 7] and column-wise sparsit y [18, 29 ]. Problems of matrix decomp osition are motiv ated by a v ariet y of applications. Man y classical metho ds for dimensionalit y redu ction, among them factor analysis and principal 1 comp onent s analysis (PCA), are based on estimating a lo w-r an k matrix from data. Diﬀerent forms of r obust PCA can b e form ulated in terms of matrix decomp osition using the matrix Γ ⋆ to mo del the gross errors [9, 7, 29]. Similarly , certain p roblems of robust cov ariance estimation can b e describ ed using matrix decomp ositions with a column/row-sparse str u cture, as w e describ e in th is pap er. The p roblem of lo w r an k plu s sparse matrix decomp osition also arises in Gaussian co v ariance sele ction with hidd en v ariables [8], in whic h ca se the in v erse co v ariance of the ob s erv ed v ector can b e decomp osed as the sum of a sparse m atrix with a lo w rank matrix. Mat rix decomp ositions also arise in multi-ta sk regression [32, 21, 27], w hic h in v olv e solving a collect ion of regression problems, referr ed to as tasks, o v er a common set of features. F or some features, one exp ects their we igh ting to b e p reserv ed across features, whic h can b e mo deled b y a lo w-rank constrain t, whereas other features are exp ected to v ary across tasks, whic h can b e mo deled b y a sp arse component [5, 2]. S ee S ection 2.1 for f u rther discussion of these motiv ating applications. In th is pap er, we study a n oisy linear observ ation that can b e used to describ e a num b er of applications in a un iﬁed w a y . Let X b e a linear op erator th at maps m atrices in R d 1 × d 2 to matrices in R n 1 × n 2 . In the simplest of cases, this observ ation op erator is simply the identit y mapping, so that we n ecessarily ha v e n 1 = d 1 and n 2 = d 2 . Ho w ev er, as we d iscuss in the sequel, it is us eful for certain applications, suc h as m ulti-task r egression, to consider more general linear op erators of this form. Hence, w e study the problem matrix d ecomp osition for the general linear observ ation mo del Y = X (Θ ⋆ + Γ ⋆ ) + W , (1) where Θ ⋆ and Γ ⋆ are u nkno wn d 1 × d 2 matrices, and W ∈ R n 1 × n 2 is some t yp e of observ ation noise; it is p otentia lly dense, and can either b e deterministic or sto c hastic. The matrix Θ ⋆ is assumed to b e either exactly lo w-r an k , or w ell-appro ximated by a lo w-rank matrix, whereas the matrix Γ ⋆ is assu med to ha ve a complementa ry t yp e of lo w-dimensional str u cture, suc h as sparsit y . As we discu ss in Section 2.1, a v ariet y of in teresting statistical mo d els can b e form ulated as in stances of the observ ation mo d el (1 ). Su c h mo dels include versions of factor analysis inv olving non-ident it y n oise matrices, robu st forms of co v ariance estimation, and m ulti-task regression with some features s h ared across tasks, and a sparse subs et diﬀering across tasks. Giv en th is observ ation mo del, our goal is to reco v er accurate estimates of the decomp osition (Θ ⋆ , Γ ⋆ ) based on the noisy observ ations Y . In this pap er, we analyze simple estimators based on con v ex relaxations inv olving the n uclear norm, and a second general norm R . Most past wo rk on th e m o del (1) has fo cused on the noiseless setting ( W = 0), and for the iden tit y observ ation op erator (so that X (Θ ⋆ + Γ ⋆ ) = Θ ⋆ + Γ ⋆ ). Chandr asek aran et al. [9] studied the case when Γ ⋆ is assum ed to sparse, with a relativ ely sm all num b er s ≪ d 1 d 2 of non-zero en tries. In the noiseless setting, th ey ga ve suﬃcien t conditions for exact reco very for an adv ersarial sparsit y mo del, meaning the non-zero p ositions of Γ ⋆ can b e arbitrary . Subsequ ent w ork b y Cand es et al. [7] analyzed the same mo d el but under an assum ption of random sparsity , meaning th at the non-zero p ositions are c hosen un iformly at random. In v ery recent w ork, Xu et al. [29 ] ha v e an alyzed a diﬀerent mo del, in whic h the matrix Γ ⋆ is assumed to b e column wise sparse, w ith a relativ ely small num b er s ≪ d 2 of non-zero columns. Their analysis guaran teed appr o xim ate reco very for the lo w-rank matrix, in particular for the uncorrup ted columns. After initial p osting of this work, w e b ecame aw are of recent work by Hsu et al. [14], who deriv ed F rob enius norm error b ound s for the case of exact elemen twise sparsit y . As we discuss in more detail in Section 3.4, in this sp ecial case, our b oun ds are b ased 2 on milder conditions, and yield sh arp er rates for problems wh ere the rank and sparsity scale with the dimension. Our main con tribu tion is to pr o vid e a general oracle-t yp e result (Theorem 1) on appro x- imate reco v ery of the unknown decomp osition from noisy observ ations, v alid f or structural constrain ts on Γ ⋆ imp osed via a d ecomp osable regularizer. The class of decomp osable regular- izers, introd uced in p ast w ork b y Negah ban et al. [19], includes the ele ment w ise ℓ 1 -norm and column wise (2 , 1)-norm as sp ecial cases, as well as v arious other regularizers used in practice. Our main result is stated in Th eorem 1 : it pr o vid es ﬁnite-sample guarante es for estimates obtained b y solving a class of con v ex programs formed using a comp osite regularizer. The resulting F rob enius n orm error b oun ds consist of m ultiple terms, eac h of whic h has a n atur al in terpretation in terms of the estimation and appro ximation errors asso ciated with the s u b- problems of r eco ve ring Θ ⋆ and Γ ⋆ . W e then sp ecialize Th eorem 1 to the case of elemen twise or column wise sp arsit y mo dels for Γ ⋆ , thereb y obtaining reco v ery guarantees for m atrices Θ ⋆ that ma y b e either exactly or appro ximately lo w-rank, as we ll as matrices Γ ⋆ that may b e either exactly or appro ximately sparse. W e pro vide non-asymptotic error b ounds for general noise matrices W b oth for elemen twise and columnwise sparse m o dels (see Corollaries 1 through Corollary 6). T o the b est of our kno w ledge, th ese are the ﬁr st results that apply to this broad class of mo dels, allo wing for n oisiness ( W 6 = 0) that is either sto c hastic or deterministic, ma- trix comp onen ts that are only appro ximately low-rank and/or sparse, and general forms of the observ ation op erator X . In add ition, the error b ounds obtained by ou r analysis are sharp, and cannot b e imp ro v ed in general. More p recisely , for the case of sto chastic noise matrices and the identit y observ ation op erator, w e pr o ve th at the squ ared F rob en iu s errors ac hieve d by our estimators are m in imax- optimal (see Th eorem 2). An interesting feature of our analysis is that, in con trast to previous w ork [9, 29, 7], we do not imp ose incoherence conditions on the singular vecto rs of Θ ⋆ ; r ather, w e con tr ol the in teractio n with a milder condition in vol ving th e du al norm of the regularizer. In the sp ecial case of elemen twise sp arsit y , this dual norm enforces an u pp er b ound on the “spikiness” of the lo w-rank comp onent, and h as pro v en useful in the related setting of noisy matrix completion [20]. This constrain t is not strong enough to guarant ee id en tiﬁabilit y o f the mo dels (and hence exact r eco very in the noiseless setting), but it do es p ro vide a b oun d on the degree of non-identiﬁabilit y . W e show that this same term arises in b oth the u pp er and lo wer b ounds on the problem of appro ximate reco v ery that is of in terest in the noisy setting. The remainder of the pap er is organized as follo ws. In Section 2, we set up the p roblem in a p recise wa y , and d escrib e the estimators. Sectio n 3 is d ev oted to the statemen t of our main result on ac hiev ability , as well as its v arious corollaries for sp ecial cases of the matrix decomp osition problem. W e also state a matc hin g lo wer b ound on the minimax error for matrix d ecomp osition with sto chasti c noise. In S ection 4, we pro vide numerical simulati ons that illustrate the sh arpness of our theoretical p redictions. Section 5 is devo ted to the pro ofs of our results, with certain m ore tec hnical asp ects of the argument deferr ed to the app en dices, and we conclude with a discussion in Section 6 . Notation: F or the reader’s con v enience, w e summarize here some of the s tandard notation used throughout this pap er. F or an y matrix M ∈ R d 1 × d 2 , w e deﬁne the F rob enius norm | | | M | | | F : = q P d 1 j =1 P d 2 k =1 M 2 j k , co rresp onding to the ordin ary Eu clidean norm of its en tries. W e den ote its singular v alues by σ 1 ( M ) ≥ σ 2 ( M ) ≥ · · · ≥ σ d ( M ) ≥ 0, wh ere d = min { d 1 , d 2 } . Its nuclear norm is giv en b y | | | M | | | N = P d j =1 σ j ( M ). 3 2 Con v ex relaxations and matrix decomp osition In this pap er, w e consider a family of r egularizers formed by a com bination of the nucle ar norm | | | Θ | | | N : = P min { d 1 ,d 2 } j =1 σ j (Θ), whic h acts as a conv ex surrogate to a rank constrain t for Θ ⋆ (e.g., see Rec ht et al. [25] and references therein), with a norm-b ase d r e gularizer R : R d 1 × d 2 → R + used to constrain the structur e of Γ ⋆ . W e p ro vide a general theorem applicable to a class of regularizers R that satisfy a certain decomp osabilit y prop erty [19], and then consider in detail a few particular choic es of R that ha v e b een studied in past work, including th e elemen t w ise ℓ 1 -norm, and the column w ise (2 , 1)-norm (see E xamples 4 and 5 b elo w). 2.1 Some motiv ating applications W e b egin with some m otiv ating applications for the general linear observ ation mo del with noise (1 ). Example 1 (F actor analysis with sparse noise) . In a f actor analysis mo del, random v ectors Z i ∈ R d are assu med to b e generated in an i.i.d. manner from the m o del Z i = L U i + ε i , f or i = 1 , 2 , . . . , n , (2) where L ∈ R d 1 × r is a loading matrix, and the v ectors U i ∼ N (0 , I r × r ) and ε i ∼ N (0 , Γ ⋆ ) are indep end en t. Giv en n i.i.d. samples from th e mo del (2 ), the goal is to estimate the loading matrix L , or the matrix LL T that pro jects ont o column span of L . A simp le calculation sho ws that the co v ariance matrix of Z i has the form Σ = LL T + Γ ⋆ . Consequent ly , in the sp ecial case w hen Γ ⋆ = σ 2 I d × d , then the range of L is spanned by the top r eigenv ectors of Σ, and so we can reco v er it via standard principal comp onen ts analysis. In other app lications, w e migh t no longer b e guaran teed that Γ ⋆ is the identit y , in which case th e top r eigen vect ors of Σ n eed n ot b e close to column span of L . Nonetheless, when Γ ⋆ is a sparse matrix, th e pr oblem of estimating LL T can b e unders to o d as an instance of our general observ ation mo del (1) with d 1 = d 2 = d , and the identit y observ ation op erator X (so th at n 1 = n 2 = d ). In p articular, if the let the observ ation m atrix Y ∈ R d × d b e the sample co v ariance matrix 1 n P n i − 1 Z i Z T i , then some algebra sho ws that Y = Θ ⋆ + Γ ⋆ + W , where Θ ⋆ = LL T is of rank r , and the rand om matrix W is a re-cen tered form of Wishart noise [1 ]—in p articular, the zero-mean m atrix W : = 1 n n X i =1 Z i Z T i −  LL T + Γ ⋆  . (3) When Γ ⋆ is assumed to b e element wise sparse (i.e., with relativ ely few non-zero en tries), then this constraint can b e enforced via the elemen t wise ℓ 1 -norm (see Example 4 to follo w). ♣ Example 2 (Multi-task regression) . Supp ose that w e are giv en a coll ection of d 2 regres- sion problems in R d 1 , eac h of the form y j = X β ∗ j + w j for j = 1 , 2 , . . . , d 2 . Here eac h β ∗ j ∈ R d 1 is an unkno wn regression v ector, w j ∈ R n is observ ation noise, and X ∈ R n × d 1 is the design matrix. This family of mo d els can b e written in a conv enient matrix form as Y = X B ∗ + W , where Y = [ y 1 · · · y d 2 ] and W = [ w 1 · · · w d 2 ] are b oth matrices in R n × d 2 and B ∗ : = [ β ∗ 1 · · · β ∗ d 2 ] ∈ R d 1 × d 2 is a matrix of r egression ve ctors. F ollo win g standard termi- nology in m ulti-task learning, we refer to eac h column of B ∗ as a task , and eac h ro w of B ∗ as a fe atur e. 4 In many applications, it is n atural to assume that the feature we igh tings—i.e., that is, the v ectors β ∗ j ∈ R d 2 —exhibit some degree of sh ared stru cture across tasks [2, 32, 21, 27]. This t yp e of shared structured can b e mo deled b y imp osing a lo w-rank structure; for instance, in the extreme case of rank one, it w ould enforce that eac h β ∗ j is a m ultiple of some common underlying v ector. H o w ev er, many m ulti-task learning p r oblems exhib it more complicated structure, in whic h some su bset of features are shared across tasks, and some other subset of features v ary substantia lly across tasks [2, 4]. F or instance, in the Amazo n r ecommendation system, tasks corresp ond to d iﬀerent classes of p r o ducts, such as b o oks, electronics and so on, and f eatures include ratings by users. Some ratings (suc h as numerical scores) should hav e a meaning th at is p reserv ed across tasks, whereas other features (e.g., the lab el “b oring”) are v ery meaningful in applications to some categories (e.g., b o oks) but less so in others (e.g., electronics). This kind of structure can b e captured by assuming that the u nkno wn regression matrix B ∗ has a lo w-r ank p lus sparse decomp osition—namely , B ∗ = Θ ⋆ + Γ ⋆ where Θ ⋆ is lo w-rank and Γ ⋆ is sparse, with a relativ ely small num b er of non-zero entries, corresp ond ing to f eature/task pairs that that diﬀer signiﬁcan tly fr om the baseline. A v ariant of this mo del is b ased on instead assumin g th at Γ ⋆ is row-sparse, with a small num b er of non-zero ro ws. (In Example 5 to follo w, w e discuss an appropriate regularizer for enforcing su c h row or column sparsit y .) With this mo d el stru cture, we then deﬁne the obser v ation op erator X : R d 1 × d 2 → R n × d 2 via A 7→ X A , so that n 1 = n and n 2 = d 2 in our general n otation. In th is wa y , w e obtain another instance of the lin ear obs er v ation mo d el (1). ♣ Example 3 (Robu st co v ariance estimation) . F or i = 1 , 2 , . . . , n , let U i ∈ R d b e samp les from a zero-mean distribu tion with unkno wn co v ariance matrix Θ ⋆ . When the vect ors U i are observed without any form of corruption, then it is straightforw ard to estimate Θ ⋆ b y p erformin g PCA on the sample co v ariance matrix. Imagining that j ∈ { 1 , 2 , . . . , d } indexes diﬀeren t individu als in the p opu lation, n ow supp ose that the d ata asso ciated with s ome sub set S of individuals is arbitrarily corrupted. This adv ersarial corruption can b e mo d eled b y assuming that w e observ e the v ectors Z i = U i + v i for i = 1 , . . . , n , wh ere eac h v i ∈ R d is a v ector supp orted on the su bset S . Letting Y = 1 n P n i =1 Z i Z T i b e the sample co v ariance matrix of the corrup ted samples, some algebra sho ws that it can b e decomp osed as Y = Θ ⋆ + ∆ + W , where W : = 1 n P n i =1 U i U T i − Θ ⋆ is again a type of r e-cen tered Wishart noise, and the remaining term can b e written as ∆ : = 1 n n X i =1 v i v T i + 1 n n X i =1  U i v T i + v i U T i ) . (4) Note that ∆ itself is not a column-sp arse or r o w-sp arse matrix; h o wev er, since eac h vect or v i ∈ R d is su p p orted only on some subs et S ⊂ { 1 , 2 , . . . , d } , we can wr ite ∆ = Γ ⋆ + (Γ ⋆ ) T , where Γ ⋆ is a column-spars e matrix with entries only in columns ind exed by S . Th is structur e can b e enforced by the use of the column-sparse regularizer (12), as describ ed in Example 5 to follo w. ♣ 5 2.2 Con vex relaxation for noisy matrix decomp osition Giv en the observ ation m o del Y = X (Θ ⋆ + Γ ⋆ ) + W , it is natural to consider an estimator based on solving the regularized least- squares pr ogram min (Θ , Γ)  1 2 | | | Y − X (Θ + Γ) | | | 2 F + λ d | | | Θ | | | N + µ d R (Γ)  . Here ( λ d , µ d ) are non-negativ e regularizer parameters, to b e c hosen b y the user. Ou r theory also p ro vides choic es of these p arameters that guaran tee goo d pr op erties of the asso ciated estimator. Although this estimator is reasonable, it tu r ns out th at an add itional constrain t yields an equally simp le estimator that has attractiv e prop erties, b oth in theory an d in prac- tice. In order to understand the n eed for an add itional constrain t, it should b e noted th at without fu rther constrain ts, the mo del (1) is unidentiﬁable, ev en in the noiseless setting ( W = 0). Indeed, as has b een discussed in past work [9, 7, 29], n o m etho d can reco ver the comp onent s (Θ ⋆ , Γ ⋆ ) un less the lo w-r ank comp on ent is “incoherent” with the matrix Γ ⋆ . F or instance, sup p osing for the moment th at Γ ⋆ is a s p arse m atrix, consider a rank one m atrix with Θ ⋆ 11 6 = 0, an d zeros in all other p ositions. In this case, it is clearly imp ossib le to d isen tangle Θ ⋆ from a sparse matrix. Pa st w ork on b oth matrix completion and decomp osition [9, 7, 29] has ruled out these t yp es of troublesome cases via conditions on the singular v ectors of the lo w-r an k comp onen t Θ ⋆ , and used th em to deriv e suﬃcien t cond itions for exact reco ve ry in the noiseless setting (see the discussion follo wing Example 4 for more details). In this pap er, we imp ose a related but m ild er condition, previously int ro du ced in our past w ork on matrix co mpletion [20], with the goal of p er f orming app r o xim ate reco v ery . T o b e clear, this condition d o es not guaran tee identi ﬁabilit y , but rather pro vides a b ound on the r adius of non-identiﬁability. It should b e noted that non-identiﬁabilit y is a feature common to many high-dimensional statistica l mo d els. 1 Moreo ve r, in the more realistic setting of noisy observ ations and/or matrices that are not exactly low-rank, su c h approximat e reco very is the b est that can b e exp ected. Ind eed, one of our main contributions is to establish minimax- optimalit y of our rates, meaning that no algorithm can b e substantia lly b etter o v er the matrix classes that w e consid er. F or a giv en regularizer R , w e deﬁne the quantit y κ d ( R ) : = sup V 6 = 0 | | | V | | | F / R ( V ), wh ich measures the relation b et wee n th e regularizer and the F rob enius norm. Moreo v er, we deﬁ n e the associated dual n orm R ∗ ( U ) : = sup R ( V ) ≤ 1 h h V , U i i , (5) where h h V , U i i : = trace( V T U ) is the trace inner pro d uct on the space R d 1 × d 2 . Ou r estimators are based on co nstraining the in teraction betw een the low-rank comp onent and Γ ⋆ via the quan tit y ϕ R (Θ) : = κ d ( R ∗ ) R ∗ (Θ) . (6) More sp eciﬁcally , w e analyze the family of estimat ors min (Θ , Γ) n 1 2 | | | Y − X (Θ + Γ) | | | 2 F + λ d | | | Θ | | | N + µ d R (Γ) o , (7) 1 F or instance, see the pap er [23] for discussion of non-identiﬁabilit y in h igh-dimensional sparse regressio n. 6 sub ject to ϕ R (Θ) ≤ α for some ﬁxed parameter α . 2.3 Some examples Let us consider some examples to provide intuitio n f or sp eciﬁc form s of th e estimator (7 ), and the role of th e additional constrain t. Example 4 (Sp arsit y and element wise ℓ 1 -norm) . Supp ose that Γ ⋆ is assumed to b e sparse, with s ≪ d 1 d 2 non-zero en tries. In this case, the sum Θ ⋆ + Γ ⋆ corresp onds to the sum of a lo w rank matrix with a sparse matrix. Motiv ating applications include the p roblem of factor analysis with a n on-iden tit y bu t sparse noise co v ariance, as discussed in Example 1, as well as certain formulatio ns of robust PCA [7], and mo del selection in Gauss-Mark ov rand om ﬁelds with hidden v ariables [8]. Giv en the sparsit y of Γ ⋆ , an appropriate choice of regularizer is the elemen twise ℓ 1 -norm R (Γ) = k Γ k 1 : = d 1 X j =1 d 2 X k =1 | Γ j k | . (8) With th is choi ce, it is straigh tforw ard to ve rify that R ∗ ( Z ) = k Z k ∞ : = max j =1 ,...,d 1 max k =1 ,...,d 2 | Z j k | , (9) and moreo v er , that κ d ( R ∗ ) = √ d 1 d 2 . Consequen tly , in this sp eciﬁc case, the general conv ex program (7) tak es the form min (Θ , Γ) n 1 2 | | | Y − X (Θ + Γ) | | | 2 F + λ d | | | Θ | | | N + µ d k Γ k 1 o suc h th at k Θ k ∞ ≤ α √ d 1 d 2 . (10) The constrain t in v olving k Θ k ∞ serv es to con trol the “spikiness” of the lo w rank comp onent, with larger settings of α allo win g for more sp iky matrices. In deed, th is typ e of sp ikiness cont rol has pr o ven useful in analysis of nucle ar n orm relaxatio ns for n oisy matrix completion [20 ]. T o gain in tuition for the parameter α , if we consider matrices with | | | Θ | | | F ≈ 1, as is appropr iate to k eep a constan t signal-to-noise ratio in the noisy mo del (1), then setting α ≈ 1 allo ws only for matrices for whic h | Θ j k | ≈ 1 / √ d 1 d 2 in all en tr ies. If we wan t to p erm it the maximally spiky m atrix w ith all its mass in a single p osition, then the p arameter α must b e of the order √ d 1 d 2 . In p ractice, we are interested in settings of α lying b et we en these tw o extremes. ♣ P ast w ork on ℓ 1 -forms of matrix decomp osition h as imp osed singular vect or incoherence conditions that are related to but diﬀerent from our spikiness condition. M ore concretely , if we write the S VD of the lo w-rank comp onent as Θ ⋆ = U D V T where D is diagonal, and U ∈ R d 1 × r and V ∈ R d 2 × r are m atrices of the left and right sin gular vec tors. S ingular v ector incoherence b ounds quantiti es suc h as k U U T − r d 1 I d 1 × d 1 k ∞ , k V V T − r d 2 I d 2 × d 2 k ∞ , and k U V T k ∞ . (11) all of wh ic h measure the degree of “coherence” b et w een the sin gular vec tors and the canonical basis. A remark able feature of such conditions is that they ha v e n o dep end ence on the singular values of Θ ⋆ . T his lac k of dep endence make s sense in the noiseless setting, where exact reco very is the goal. F or noisy mo dels, in con trast, one sh ou ld only b e concerned 7 with reco v ering comp onen ts with “large” singular v alues. In this context , our b ound on the maxim um element k Θ ⋆ k ∞ , or equiv alent ly on th e quan tit y k U D V T k ∞ , is natural. Note that it imp oses no constraint on the matrices U U T or V V T , and moreov er it uses the diagonal matrix of singular v alues as a w eigh t in the ℓ ∞ b ound . Moreo v er, w e n ote that there are many matrices for which k Θ ⋆ k ∞ satisﬁes a reasonable b ound , whereas th e incoherence measures are p o orly b eha v ed (e.g., see Section 3.4.2 in the p ap er [20] for one example). Example 5 (Column-sparsity and blo c k columnwise regularizatio n) . Other app lications in- v olv e mod els in whic h Γ ⋆ has a relativ ely small n umber s ≪ d 2 of non-zero columns (or a relativ ely small n um b er s ≪ d 1 of n on-zero r o w s). S uc h applications in clude the multi-ta sk regression problem from Example 2, the robu st co v ariance p roblem from Example 3, as wel l as a form of robus t PCA considered b y Xu et al. [29]. In this case, it is n atur al to constrain Γ via the (2 , 1)-norm regularizer R (Γ) = k Γ k 2 , 1 : = d 2 X k =1 k Γ k k 2 , (12) where Γ k is the k th column of Γ (or the (1 , 2)-norm regularizer that enforces the analogous constrain t on the r o w s of Γ). F or this c hoice, it can b e veriﬁed that R ∗ ( U ) = k U k 2 , ∞ : = max k =1 , 2 ,...,d 2 k U k k 2 , (13) where U k denotes the k th column of U , and th at κ d ( R ∗ ) = √ d 2 . Cons equ en tly , in this sp eciﬁc case, the general con ve x program (7) tak es the form min (Θ , Γ) n 1 2 | | | Y − X (Θ + Γ) | | | 2 F + λ d | | | Θ | | | N + µ d k Γ k 2 , 1 o suc h th at k Θ k 2 , ∞ ≤ α √ d 2 . (14) As b efore, the constrain t k Θ k 2 , ∞ serv es to limit the “spikiness” of the lo w rank comp onen t, where in this case, spikiness is measured in a column wise mann er. Again, it is n atural to consider matrices su c h that | | | Θ ⋆ | | | F ≈ 1, so that th e s ignal-to- noise r atio in the observ ation mo del (1) stays ﬁ xed. Thus, if α ≈ 1, th en w e are restricted to matrices for whic h k Θ ⋆ k k 2 ≈ 1 √ d 2 for all columns k = 1 , 2 , . . . , d 2 . A t the other extreme, in order to p ermit a maximally “column-spiky” matrix (i.e., with a single non-zero column of ℓ 2 -norm r oughly 1), we need to set α ≈ √ d 2 . As b efore, of p ractical inte rest a re settings of α lying b et we en these tw o extremes. ♣ 3 Main results and their consequences In this section, w e state our main results, and d iscuss some of their consequences. Our ﬁrst r esult app lies to the family of con vex p r ograms (7) when ever R b elongs to the class of decomp osable r egularizers, and the least-squares loss asso ciated with the obs erv ation mo d el satisﬁes a sp eciﬁc form of restricted strong con v exit y [19]. Accordingly , w e b egin in Section 3.1 b y deﬁn ing the notion of decomp osabilit y , and then illustrating h o w the elemen twise- ℓ 1 and column wise-(2 , 1)-norms, as discussed in E xamples 4 and 5 resp ectiv ely , are b oth instances of decomp osable r egularizers. In Section 3.2, we d eﬁne the form of restricted stron g con v exit y appropriate to our setting. Section 3.3 conta ins the statemen t of our main resu lt ab out the M -estimator (7), while Sections 3.4 and 3.6 are devo ted to its consequences for th e cases of 8 elemen twise sparsity and columnwise sparsit y , resp ectiv ely . In Section 3.5, w e complement our analysis of the conv ex pr ogram (7) by sho w ing that, in the sp ecial case of th e identit y op erator, a simple t w o-step metho d can achiev e similar rates (up to constant factors). W e also pro vide an example sh o w ing that the t wo-ste p metho d can f ail for more general observ ation op erators. In Section 3.7 , we state matc hing lo we r b ound s on the minimax errors in the case of the iden tit y op erator and Gaussian noise. 3.1 Decomp osable r egularizers The notion of decomp osabilit y is d eﬁned in terms of a p air of su b spaces, w h ic h (in general) need not b e orthogonal complemen ts. Here we consider a sp ecial case of d ecomp osabilit y that is suﬃcien t to co v er the examples of in terest in th is pap er: Deﬁnition 1. Giv en a subspace M ⊆ R d 1 × d 2 and its orthogonal complement M ⊥ , a norm- based r egularizer R is de c omp osable with r esp e ct ( M , M ⊥ ) if R ( U + V ) = R ( U ) + R ( V ) for all U ∈ M , an d V ∈ M ⊥ . (15) T o pro vide some int uition, the sub space M should b e though t of as the nominal mo del sub- sp ac e ; in our results, it will b e chosen suc h that the matrix Γ ⋆ lies w ith in or close to M . T h e orthogonal co mplement M ⊥ represent s deviations a wa y from the mo del subspace, and the equalit y (15) guarantee s that suc h deviations are p en alized as muc h as p ossible. As discussed at more length in Negah b an et al. [19], a large class of norms are decom- p osable with resp ect to in teresting 2 subspace pairs. Of particular relev ance to us is the decomp osabilit y of the elemen twise ℓ 1 -norm k Γ k 1 and the columnwise (2 , 1)-norm k Γ k 2 , 1 , as previously d iscussed in Examples 4 and 5 resp ectiv ely . Decomp osabilit y of R ( · ) = k · k 1 : Beginning with the elemen t w ise ℓ 1 -norm, give n an arbitrary subset S ⊆ { 1 , 2 , . . . , d 1 } × { 1 , 2 , . . . , d 2 } of matrix indices, consid er the s u bspace pair M ( S ) : =  U ∈ R d 1 × d 2 | U j k = 0 for all ( j, k ) / ∈ S  , and M ⊥ ( S ) : = ( M ( S )) ⊥ . (16) It is then easy to s ee that f or an y pair U ∈ M ( S ) , U ′ ∈ M ⊥ ( S ), we hav e the sp litting k U + U ′ k 1 = k U k 1 + k U ′ k 1 , sh o w ing th at the element wise ℓ 1 -norm is decomp osable with r e- sp ect to the pair ( M ( S ) , M ⊥ ( S )). Decomp osabilit y of R ( · ) = k · k 2 , 1 : S imilarly , the column wise (2 , 1)-norm is also decom- p osable with resp ect to ap p ropriately d eﬁ ned sub spaces, indexed by subsets C ⊆ { 1 , 2 , . . . , d 2 } of col umn indices. Indeed, using V k to d en ote the k th column of the matrix V , deﬁne M ( C ) : =  V ∈ R d 1 × d 2 | V k = 0 for all k / ∈ C  , (17) and M ⊥ ( C ) : = ( M ( C )) ⊥ . Again, it is easy to v erify that f or an y pair V ∈ M ( C ) , V ′ ∈ M ⊥ ( C ), w e h a ve k V + V ′ k 2 , 1 = k V k 2 , 1 + k V ′ k 2 , 1 , th us v erifying the decomp osabilit y prop erty . F or an y decomp osable regularizer and subspace M 6 = { 0 } , we deﬁne the compatibilit y 2 Note that an y norm is (trivially) decomp osable with respect t o the pair ( M , M ⊥ ) = ( R d 1 × d 2 , { 0 } ). 9 constan t Ψ( M , R ) : = sup U ∈ M ,U 6 =0 R ( U ) | | | U | | | F . (18) This qu an tit y measures the compatibilit y b et w een the F r ob enius norm and the regularizer o ver th e subs p ace M . F or example, for the ℓ 1 -norm and the set M ( S ) previously deﬁn ed (16), an element ary calculation yields Ψ  M ( S ); k · k 1  = √ s . 3.2 Restricted strong con v exit y Giv en a loss f unction, the general n otion of strong conv exit y in v olv es establishing a qu adratic lo wer b ound on the err or in the ﬁr st-order T aylo r app ro ximation [6]. In our setting, the loss is the quad r atic function L (Ω) = 1 2 | | | Y − X (Ω) | | | 2 F (where w e use Ω = Θ + Γ), so that th e ﬁrst-order T aylo r series err or at Ω in the d irection of the matrix ∆ is giv en by L (Ω + ∆) − L (Ω) − L (Ω) T ∆ = 1 2 | | | X (∆) | | | 2 F . (19) Consequent ly , strong con v exit y is equiv alen t to a low er b ound of the f orm 1 2 k X (∆) k 2 2 ≥ γ 2 | | | ∆ | | | 2 F , where γ > 0 is the strong con vexit y constan t. Restricted strong con v exit y is a w eak er condition that also in v olves a norm deﬁn ed b y the regularizers. In our case, f or any p air ( µ d , λ d ) of p ositiv e num b ers, we ﬁrst d eﬁne the weig ht ed com b ination of the t wo regularizers—namely Q (Θ , Γ) : = | | | Θ | | | N + µ d λ d R (Γ) . (20) F or a giv en matrix ∆, w e can use this we igh ted com b in ation to deﬁne an asso ciated norm Φ(∆) : = inf Θ+Γ=∆ Q (Θ , Γ) , (21) corresp ondin g to the minim u m v alue of Q (Θ , Γ) o v er all d ecomp ositions of ∆ 3 . Deﬁnition 2 (RS C) . The quadr atic loss with linear op erator X : R d 1 × d 2 → R n 1 × n 2 satisﬁes restricted strong con v exit y with r esp ect to the n orm Φ and with parameters ( γ , τ n ) if 1 2 | | | X (∆) | | | 2 F ≥ γ 2 | | | ∆ | | | 2 F − τ n Φ 2 (∆) for all ∆ ∈ R d 1 × d 2 . (22) Note that if condition (2 2) holds with τ n = 0 and an y γ > 0, then w e reco v er the usual deﬁnition of strong conv exit y (with resp ect to the F rob enius norm). In the sp ecial case of the iden tit y op erator (i.e., X (Θ) = Θ), su c h strong con vexit y do es h old with γ = 1. More general observ ation op erators require d iﬀeren t c hoices of the parameter γ , and also non-zero c hoices of the tolerance p arameter τ n . While RSC establishes a form of (approxima te) identiﬁabilit y in general, here th e error ∆ is a com b ination of the error in estimating Θ ⋆ (∆ Θ ) and Γ ⋆ (∆ Γ ). Consequen tly , we will n eed a further lo wer b oun d on | | | ∆ | | | F in terms of | | | ∆ Θ | | | F and | | | ∆ Γ | | | F in th e p ro of of our m ain r esults to d emons trate the (appro ximate) iden tiﬁability of our mo d el u nder the RSC condition 22. 3 Deﬁned this w a y , Φ(∆) is the inﬁmal-conv olution of the tw o norms | | | · | | | N and R , whic h is a very w ell studied ob ject in conv ex analysis (see e.g. [26]) 10 3.3 Results for general regularizers and noise W e b egin by stating a result for a general observ ation op erator X , a general decomp osable regularizer R and a general noise matrix W . In later subsections, we sp ecialize this result to particular c hoices of observ ation op erator, regularizers, and sto c hastic n oise matrices. In all our results, we measur e error u sing the squar e d F r ob enius norm summed across b oth matrices e 2 ( b Θ , b Γ) : = | | | b Θ − Θ ⋆ | | | 2 F + | | | b Γ − Γ ⋆ | | | 2 F . (23) With this notation, the follo wing result app lies to the observ ation mo del Y = X (Γ ⋆ + Θ ⋆ ) + W , where the lo w-rank matrix satisﬁes the constrain t ϕ R (Θ ⋆ ) ≤ α . Our upp er b ound on the squared F rob eniu s error consists of three terms K Θ ⋆ : = λ 2 d γ 2  r + γ λ d d X j = r +1 σ j (Θ ⋆ )  (24a) K Γ ⋆ : = µ 2 d γ 2  Ψ 2 ( M ; R ) + γ µ d R (Π M ⊥ (Γ ⋆ ))  (24b) K τ n : = τ n γ  d X j = r +1 σ j (Θ ⋆ ) + µ d λ d R (Γ ⋆ M ⊥ )  2 . (24c) As will be clariﬁed shortly , these thr ee terms corresp ond to the er r ors asso ciated with the lo w-r an k term ( K Θ ⋆ ), the sparse te rm ( K Γ ⋆ ), and additional error ( K τ n ) asso ciated with a non-zero tolerance τ n 6 = 0 in the RSC condition (22). Theorem 1. Supp ose that the observation op er ator X satisﬁes the RSC c ondition (22) with curvatur e γ > 0 , and a toler anc e τ n such that ther e exist inte gers r = 1 , 2 , . . . , min { d 1 , d 2 } , for which 128 τ n r < γ 4 , and 64 τ n  Ψ( M ; R ) µ d λ d  2 < γ 4 . (25) Then if we solve the c onvex pr o gr am (7) with r e gularization p ar ameters ( λ d , µ d ) satisfying λ d ≥ 4 | | | X ∗ ( W ) | | | op , and µ d ≥ 4 R ∗ ( X ∗ ( W )) + 4 γ α κ d , (26) ther e ar e universal c onstant c j , j = 1 , 2 , 3 such that for any matrix p air (Θ ⋆ , Γ ⋆ ) sat isfying ϕ R (Θ ⋆ ) ≤ α and any R -de c omp osable p air ( M , M ⊥ ) , any optimal solution ( b Θ , b Γ) satisﬁes e 2 ( b Θ , b Γ) ≤ c 1 K Θ ⋆ + c 2 K Γ ⋆ + c 3 K τ n . (27) Let u s make a few remarks in order to inte rpret the meaning of this claim. Deterministic guaran tee: T o b e clear, T h eorem 1 is a d eterministic statemen t th at applies to any optimum of the con v ex program (7 ). Moreo ver, it actually pro vides a wh ole family of upp er b ounds, one for eac h choice of the rank parameter r and eac h c hoice of the su bspace pair ( M , M ⊥ ). In p ractice, these choic es are optimized so as to obtain the tigh test p ossible upp er b ound. As f or the condition (25 ), it will b e satisﬁed for a su ﬃcien tly large sample size n as long as γ > 0, and the tolerance τ n decreases to zero with the samp le size. In m an y cases of int erest—including the iden tit y observ ation op erator and m u lti-task cases—the RSC condition h olds with τ n = 0, so that condition (25) holds as long as γ > 0. 11 In terpretation of diﬀerent terms: Let us fo cus ﬁrs t on the term K Θ ⋆ , whic h corr esp onds to the complexit y of estimating the low-rank comp onen t. It is fur ther sub-divided in to t wo terms, with the term λ 2 d r corresp onding to the estimation e rr or associated with a rank r m a- trix, whereas the term λ d P d j = r +1 σ j (Θ ⋆ ) corr esp onds to the appr oximation err or asso ciated with representing Θ ⋆ (whic h migh t b e fu ll rank ) b y a matrix of rank r . A similar in terpre- tation applies to the tw o comp onents associated with Γ ⋆ , the ﬁrst of wh ic h corresp onds to a form of estimation err or, whereas the second corresp ond s to a form of approximat ion error. A family of upp er b ounds: S ince the inequalit y (27) corresp onds to a family of upp er b ound s indexed by r and th e subsp ace M , these qu an tities can b e c hosen adaptiv ely , dep en ding on the structur e of the matrices (Θ ⋆ , Γ ⋆ ), so as to obtain the tigh test p ossible up p er b oun d. In th e simplest case, the RS C conditions hold with tolerance τ n = 0, the matrix Θ ⋆ is exactl y lo w ran k (sa y rank r ), and Γ ⋆ lies within a R -decomp osable subspace M . In this case, the appro ximation err ors v an ish , and Th eorem 1 guarante es that the squared F rob enius error is at most e 2 ( b Θ; b Γ) - λ 2 d r + µ 2 d Ψ 2 ( M ; R ) , (28) where th e - notation ind icates that w e ignore constan t factors. 3.4 Results for ℓ 1 -norm regularization Theorem 1 holds for an y regularizer that is d ecomp osable with r esp ect to some su bspace pair. As pr eviously noted, an imp ortan t example of a decomp osable regularizer is the elemen t wise ℓ 1 -norm, whic h is d ecomp osable w ith resp ect to su bspaces of the f orm (16). Corollary 1. Consider an observation op er ator X that satisﬁes the RSC c ondition (22) with γ > 0 and τ n = 0 . Supp ose that we solve the c onvex pr o gr am (10) with r e gularization p ar am- eters ( λ d , µ d ) such that λ d ≥ 4 | | | X ∗ ( W ) | | | op , and µ d ≥ 4 k X ∗ ( W ) k ∞ + 4 γ α √ d 1 d 2 . (29) Then ther e ar e universal c onstants c j such that f or any matrix p air (Θ ⋆ , Γ ⋆ ) with k Θ ⋆ k ∞ ≤ α √ d 1 d 2 and for al l inte gers r = 1 , 2 , . . . , min { d 1 , d 2 } , and s = 1 , 2 , . . . , ( d 1 d 2 ) , we have e 2 ( b Θ , b Γ) ≤ c 1 λ 2 d γ 2  r + 1 λ d d X j = r +1 σ j (Θ ⋆ )  + c 2 µ 2 d γ 2  s + 1 µ d X ( j,k ) / ∈ S | Γ ⋆ j k |  , (30) wher e S is an arbitr ary subset of matrix indic es of c ar dinality at most s . Remarks: This result follo w s d irectly b y sp ecializ ing Theorem 1 to the elemen twise ℓ 1 -norm. As n oted in Example 4, for this n orm, we h av e κ d = √ d 1 d 2 , so that the c h oice (29) satisﬁes the conditions of T heorem 1. The dual norm is giv en b y the elemen twise ℓ ∞ -norm R ∗ ( · ) = k · k ∞ . As observed in Section 3.1, the ℓ 1 -norm is decomp osable w ith resp ect to subspace pairs of the form ( M ( S ) , M ⊥ ( S )), for an arbitrary su b set S of matrix ind ices. Moreo v er, for an y subset S of cardinalit y s , we ha v e Ψ 2 ( M ( S )) = s . It is easy to ve rify that with this choic e, we hav e 12 Π M ⊥ (Γ ⋆ ) = P ( j,k ) / ∈ S | Γ ⋆ j k | , f rom wh ic h the claim follo w s . It is worth noting the inequalit y (27) corresp onds to a family of upp er b ound s indexed b y r and the sub set S . F or any ﬁxed in teger s ∈ { 1 , 2 , . . . , ( d 1 d 2 ) } , it is n atural to let S index the largest s v alues (in absolute v alue) of Γ ⋆ . Moreo v er, the c h oice of the pair ( r, s ) can b e fur ther adapted to the structure of the matrix. F or instance, wh en Θ ⋆ is exactly lo w r ank, and Γ ⋆ is exactly sparse, then one natural choice is r = rank(Θ ⋆ ), and s = | supp(Γ ⋆ ) | . W ith this c h oice, b oth the app ro ximation terms v anish, and Corollary 1 guaran tees that any solution ( b Θ , b Γ) of the con ve x program (10 ) satisﬁes | | | b Θ − Θ ⋆ | | | 2 F + | | | b Γ − Γ ⋆ | | | 2 F . λ 2 d r + µ 2 d s. (31) F u r ther sp ecializing to the case of n oiseless observ ations ( W = 0), yields a form of appro ximate reco very—namely | | | b Θ − Θ ⋆ | | | 2 F + | | | b Γ − Γ ⋆ | | | 2 F . α 2 s d 1 d 2 . (32) This guarantee is wea k er than the exact reco v ery resu lts obtained in past w ork on the noiseless observ ation mo d el with iden tit y op erator [9, 7]; ho w ev er, these pap ers imp osed incoherence requirement s on the singular v ectors of the lo w -r ank comp on ent Θ ⋆ that are more restrictiv e than the conditions of Theorem 1. Our elemen twise ℓ ∞ b ound is a w eak er condition than incoherence, since it allo ws for singular v ectors to b e coheren t as long as the associated singular v alue is not too large. Moreo ve r, the b oun d (32) is un impro v able up to constan t factors, d ue to th e non-identi ﬁabilit y of the observ ation mo del (1) , as sho wn b y the f ollo wing example for the id en tit y observ ation op erator X = I . Example 6. [Unimpro v abilit y for elemen t wise sparse mo del] Consider a giv en sparsit y index s ∈ { 1 , 2 , . . . , ( d 1 d 2 ) } , where we ma y assume without loss of generalit y that s ≤ d 2 . W e then form th e matrix Θ ⋆ : = α √ d 1 d 2      1 0 . . . 0       1 1 1 . . . 0 . . . 0  | {z } f T , (33) where the vecto r f ∈ R d 2 has exactly s ones. Note that k Θ ⋆ k ∞ = α √ d 1 d 2 b y construction, and moreov er Θ ⋆ is r ank one, and has s non-zero entries. Since up to s entries of the noise matrix Γ ⋆ can b e chosen arbitrarily , “nature” ca n alwa ys set Γ ⋆ = − Θ ⋆ , meaning th at w e w ould obser ve Y = Θ ⋆ + Γ ⋆ = 0. Consequently , based on obser v in g only Y , the pair (Θ ⋆ , Γ ⋆ ) is indistinguishable from the all-zero matrices (0 d 1 × d 2 , 0 d 1 × d 2 ). This fact can b e u sed to sho w that no metho d can ha v e squared F rob enius error lo w er than ≈ α 2 s d 1 d 2 ; see Section 3.7 for a precise state men t. Therefore, the b ound (32) cannot be impro v ed un less one is willing to imp ose fu rther restrictions on the pair (Θ ⋆ , Γ ⋆ ). W e n ote that the singular v ector incoherence conditions, as imp osed in past w ork [9, 7, 14] and used to guaran tee exact r eco v ery , w ould exclude the matrix (33), since its left singular vect or is th e unit v ector e 1 ∈ R d 1 . ♣ 13 3.4.1 Results for sto c hastic noise matrices Our discussion th u s far has applied to general observ ation op erators X , and general noise matrices W . More concrete resu lts can b e obtained by assuming particular forms of X , and that the noise matrix W is sto chastic . Our ﬁ rst stochastic resu lt applies to th e identit y op erator X = I and a noise m atrix W generated with i.i.d. N (0 , ν 2 / ( d 1 d 2 )) entries. 4 Corollary 2. Supp ose X = I , the matrix Θ ⋆ has r ank at most r and satisﬁes k Θ ⋆ k ∞ ≤ α √ d 1 d 2 , and Γ ⋆ has at mos t s non-zer o entries. If the noise matrix W h as i.i .d. N (0 , ν 2 / ( d 1 d 2 )) entries, and we solve the c onvex pr o gr am (10) with r e gularization p ar ameters λ d = 8 ν √ d 1 + 8 ν √ d 2 , and µ d = 16 ν s log( d 1 d 2 ) d 1 d 2 + 4 α √ d 1 d 2 , (34) then with pr ob ability gr e ater than 1 − exp  − 2 log ( d 1 d 2 )  , any optimal solution ( b Θ , b Γ) satisﬁes e 2 ( b Θ , b Γ) ≤ c 1 ν 2  r ( d 1 + d 2 ) d 1 d 2  | {z } K Θ ⋆ + c 1 ν 2  s log ( d 1 d 2 ) d 1 d 2  + c 1 α 2 s d 1 d 2 | {z } K Γ ⋆ (35) Remarks: In th e s tatement of this corollary , th e settings of λ d and µ d are b ased on up p er b ound ing k W k ∞ and | | | W | | | op , using large d eviation b ounds and some n on -asymp totic ran- dom matrix theory . With a sligh tly modiﬁ ed argum en t, the b ound (35) can b e sharp ened sligh tly b y r educing the logarithmic term to log( d 1 d 2 s ). As shown in Theorem 2 to follo w in Section 3.7, this sharp en ed b ound is minimax-optimal, meaning that no estimator (regardless of its computational complexit y) can achiev e m u c h b etter estimates for the matrix classes an d noise m o del give n here. It is also w orth observing that b oth terms in the b oun d (35 ) ha v e in tuitiv e interpretatio ns. Considering ﬁrst the term K Θ ⋆ , we n ote that the n umerator term r ( d 1 + d 2 ) is of the order of the num b er of f ree parameters in a rank r matrix of dimen sions d 1 × d 2 . The m ultiplicativ e factor ν 2 d 1 d 2 corresp onds to the n oise v ariance in the p roblem. On the other h and, the term K Γ ⋆ measures the complexit y of estimating s n on -zero en tries in a d 1 × d 2 matrix. Not e that th ere are  d 1 d 2 s  p ossible subsets of size s , and consequen tly , the numerator in cludes a term that scales as log  d 1 d 2 s  ≈ s log( d 1 d 2 ). As b efore, the m u ltiplicativ e pr e-factor ν 2 d 1 d 2 corresp onds to the noise v ariance. Finally , the second term within K Γ ⋆ —namely the quantit y α 2 s d 1 d 2 —arises from the non-identiﬁabilit y of th e mo del, and as discussed in Example 6, it cannot b e a v oided without imp osing fur ther restrictions on the pair (Γ ⋆ , Θ ⋆ ). W e now turn to analysis of the sparse factor analysis problem: as pr eviously int ro duced in Example 1, th is inv olv es estimation of a co v ariance matrix that has a low-rank plus el- emen t wise sparse decomp osition. In this case, giv en n i.i.d. samples f r om the unkno wn co v ariance matrix Σ = Θ ⋆ + Γ ⋆ , the noise matrix W ∈ R d × d is a recen tered Wishart n oise (see equation (3)). W e can use tail b oun d s for its en tries and its op erator norm in ord er to sp ecify appr opriate c hoices of the regularization parameters λ d and µ d . W e summarize our conclusions in the f ollo wing corollary: 4 T o b e clear, we state our results in terms of the noise scaling ν 2 / ( d 1 d 2 ) since it corresponds to a mo del with constant signal-to-noise ratio when t h e F rob enius norms of Θ ⋆ and Γ ⋆ remain b ound ed , indep endently of the dimension. The same results would h old if th e noise w ere not rescaled, mo dulo the appropriate rescalings of the v arious terms. 14 Corollary 3. Consider the factor analysis mo del with n ≥ d samples, and r e gularization p ar ameters λ d = 16 | | | √ Σ | | | 2 r d n , and µ d = 32 ρ (Σ) r log d n + 4 α d , wher e ρ (Σ) = max j Σ j j . (36) Then with pr ob ability gr e ater tha n 1 − c 2 exp  − c 3 log( d )  , any optimal solution ( b Θ , b Γ) satis- ﬁes e 2 ( b Θ , b Γ) ≤ c 1  | | | Σ | | | 2 r d n + ρ (Σ) s log d n  + c 1 α 2 s d 2 . W e note that the condition n ≥ d is necessary to obtain consisten t estimates in factor analysis mo dels, ev en in th e case with Γ ⋆ = I d × d where PCA is p ossible (e.g., see Joh n stone [15]). Again, the terms in th e b oun d hav e a natural interpretatio n: since a matrix of rank r in d dimen sions has roughly r d degree s of freedom, w e exp ect to s ee a term of the order r d n . Similarly , since there are log  d 2 s  ≈ s log d s ubsets of size s in a d × d matrix, w e also exp ect to see a term of the order s l og d n . Moreo ver, although we ha v e stated our c h oices of regularization parameter in terms of | | | Σ | | | 2 and ρ (Σ), these can b e replaced by the analogous v ersions using the sample co v ariance matrix b Σ. (By the concentrat ion results that w e establish, the p opulation and emp irical versions do not d iﬀer signiﬁcan tly wh en n ≥ d .) 3.4.2 Comparison to H su et al. [14] This recen t w ork fo cu ses on the problem of matrix decomp osition w ith the k · k 1 -norm, and pro vides results b oth for the n oiseless and noisy setting. All of th eir work fo cu ses on the case of exactly lo w r ank and exactly spars e m atrices, and deals only with the id en tit y observ ation op erator; in con trast, Theorem 1 in this pap er provides an u pp er b ound for general matrix pairs and observ ation operators. Most relev ant is comparison of our ℓ 1 -results with exa ct rank-sparsit y constraints to th eir Theorem 3, wh ich p r o vid es v arious error b ounds (in n uclear and F rob enius norm ) for suc h mo d els with additiv e noise. T hese b ound s are ob tained using an estimator similar to our program (10 ), and in parts of their analysis, th ey en force b ou n ds on the ℓ ∞ -norm of the solution. Ho wev er, this is not d one directly with a constrain t on Θ as in our estimator, b ut rather by p enalizing the diﬀerence k Y − Γ k ∞ , or by thr esholding the solution. Apart f rom these m inor diﬀerences, there are tw o ma jor diﬀerences b etw een our r esu lts, and those of Hsu et al. First of all, their analysis in volv es three qu an tities ( α , β , γ ) that measure singular v ector incoherence, and m ust satisfy a num b er of inequalities. In cont rast, our analysis is based only on a single condition: the “spikiness” condition on the lo w-ran k comp onent Θ ⋆ . As w e ha v e seen, this constrain t is wea k er than singular v ector incoherence, and consequently , u nlik e the result of Hsu et al., w e do n ot provide exact r eco very guarantee s for th e noisele ss setting. How ev er, it is in teresting to see (as sho wn b y our analysis) that a v ery simple sp ikiness condition suﬃces for the appr o ximate reco v ery guarantee s that are of in terest for noisy observ ation mo dels. Giv en these diﬀering assumptions, the underlying pro of tec hniques are quite distinct, w ith our metho ds lev eraging the n otion of restricted strong con vexit y introd uced by Negah b an et al. [19]. The second (a nd p erhap s most signiﬁcan t) diﬀerence is in th e sharpness of the results for the n oisy sett ing, and the p erm iss ible scalings of the rank-sparsit y pair ( r , s ). As will b e clariﬁed in Section 3.7, the rates that w e establish for lo w-rank plu s elemen t wise sparsity 15 for the noisy Gauss ian mo del (Corollary 2) are minimax-optimal up to constant factors. In con tr ast, the u pp er b ounds in Theorem 3 of Hsu et al. in v olve the pr o duct rs , and h ence are s u b-optimal as the rank and sparsity scale. These terms app ear only additiv ely b oth our upp er and minimax lo wer b ound s , sho wing that an upp er b ound inv olving th e p r o duct rs is sub-optimal. Moreo ver, the b ounds of Hsu et al. (see S ection IV.D) are limited to matrix decomp ositions for which the rank-sparsit y pair ( r, s ) are b oun d ed as r s - d 1 d 1 log( d 1 ) log ( d 2 ) (37) This b ound precludes man y scalings that are of inte rest. F or instance, if the sparse comp o- nen t Γ ⋆ has a nearly constan t fraction of non-zeros (sa y s ≍ d 1 d 2 log( d 1 ) log( d 2 ) for concreteness), then the b ound (37) restricts to Θ ⋆ to ha v e constan t ran k . I n contrast, our analysis allo ws for high-dimensional scaling of b oth the r ank r and sparsity s simultaneously; as can b e seen by ins p ection of Corollary 2, our F rob enius norm error go es to zero un der the scalings s ≍ d 1 d 2 log( d 1 ) log( d 2 ) and r ≍ d 2 log( d 2 ) . 3.4.3 Results for multi-task regression Let us now extend our results to the setting of multi-t ask r egression, as in tro duced in Exam- ple 2. The obser v ation mo del is of the f orm Y = X B ∗ + W , where X ∈ R n × d 1 is a kno wn design matrix, and we observe th e matrix Y = R n × d 2 . Our goal is to estimate the the r egres- sion matrix B ∗ ∈ R d 1 × d 2 , wh ic h is assu med to hav e a d ecomp osition of the form B ∗ = Θ ⋆ + Γ ⋆ , where Θ ⋆ mo dels the shared c haracteristics b etw een eac h of th e tasks, and the matrix Γ ⋆ mo d- els p erturbations a wa y from the shared structure. If we tak e Γ ⋆ to b e a sp ars e m atrix, an appropriate choice of regularizer R is the element wise ℓ 1 -norm, as in Corollary 2. W e u se σ min and σ max to denote the minimum and maximum singular v alues (resp ectiv ely) of the rescaled design matrix X/ √ n ; w e assume that X is inv ertible so that σ min > 0, and moreo ver, that its columns are uniformly b ounded in ℓ 2 -norm, meaning that max j =1 ,...,d 1 k X j k 2 ≤ κ max √ n . W e note that these assumptions are satisﬁed for man y common examples of random design. Corollary 4. Supp ose that the matrix Θ ⋆ has r ank at most r and satisﬁes k Θ ⋆ k ∞ ≤ α √ d 1 d 2 , and the matr ix Γ ⋆ has at most s non-zer o entries. If the entries of W ar e i.i.d. N (0 , ν 2 ) , and we solve the c onvex pr o gr am (10) with r e gularization p ar ameters λ d = 8 ν σ max √ n ( p d 1 + p d 2 ) , and µ d = 16 ν κ max p n log ( d 1 d 2 ) + 4 α σ min √ n √ d 1 d 2 , (38) then with pr ob ability gr e ater than 1 − exp  − 2 log ( d 1 d 2 )  , any optimal solution ( b Θ , b Γ) satisﬁes e 2 ( b Θ , b Γ) ≤ c 1 ν 2 σ 2 max σ 4 min  r ( d 1 + d 2 ) n  | {z } K Θ ⋆ + c 2  ν 2 κ 2 max σ 4 min  s log ( d 1 d 2 ) n  + α 2 s d 1 d 2  | {z } K Γ ⋆ . (3 9) Remarks: W e see that the results p resen ted abov e are analogous to those presented in Corollary 2. Ho w ev er, in this setting, we leverag e large deviations results in order to ﬁn d b ound s on k X ∗ ( W ) k ∞ and | | | X ∗ ( W ) | | | op that hold with high p robabilit y given our obs er v ation mo del. 16 3.5 An alternativ e tw o-step metho d As suggested b y one review er, it is p ossible that a simpler tw o-step metho d—namely , b ased on ﬁrst thr esh olding the en tries of the observ ation matrix Y , and then p erforming a lo w -rank appro ximation—migh t ac hieve similar rates to the more complex con v ex relaxation (10). In this section, w e pr ovide a detailed analysis of one v ersion of s uc h a p ro cedure in the case of n uclear norm com bined w ith ℓ 1 -regularizatio n. W e pro v e that in the sp ecial case of X = I , this p ro cedure can attain the same form of error b ounds, with p ossibly diﬀerent constan ts. Ho wev er, there is also a cautionary m essage here: w e also giv e an example to sho w that the t wo-step m etho d w ill n ot necessarily p erform w ell for general observ ation op erators X . In d etail, let us consider the follo wing t w o-step estimato r: (a) Estimate the sp arse comp onent Γ ⋆ b y solving b Γ ∈ argmin Γ ∈ R d 1 × d 2  1 2 | | | Y − Γ | | | 2 F + µ d k Γ k 1  . (40) As is w ell-kno w n, this conv ex program has an explicit solution based on soft-thresholding the en tries of Y . (b) Giv en the estimate b Γ, estimat e the lo w-r an k comp onen t Θ ⋆ b y solving the con v ex pro- gram b Θ ∈ argmin Θ ∈ R d 1 × d 2  1 2 | | | Y − Θ − b Γ | | | 2 F + λ d | | | Θ | | | N  . (41) In terestingly , note that this metho d can b e u ndersto o d as th e ﬁrst tw o steps of a blo ckwise co- ordinate descent metho d for solving the con vex program (10). In step (a), we ﬁx the lo w-rank comp onent , and minimize as a function of the sp arse comp onent. In step (b), w e ﬁx the sp arse comp onent , and then minimize as a fun ction of the low-rank comp onen t. The follo w ing r esult that these t wo steps of co-ordinate descent ac hiev e the same r ates (u p to constan t facto rs) as solving th e full con ve x program (10): Prop osition 1. Given observations Y fr om the mo del Y = Θ ⋆ + Γ ⋆ + W with k Θ ⋆ k ∞ ≤ α √ d 1 d 2 , c onsider the two-step pr o c e dur e (40) and (41) with r e gularization p ar ameters ( λ d , µ d ) such that λ d ≥ 4 | | | W | | | op , and µ d ≥ 4 k W k ∞ + 4 α √ d 1 d 2 . (42) Then the err or b ound (30) fr om Cor ol lary 1 holds with γ = 1 . Consequent ly , in the sp ecial case th at X = I , then there is n o need to solv e the con vex pro- gram (10) to optimalit y ; r ather, tw o steps of co-ordinate descen t are s u ﬃcien t. On the o ther hand, the sim p le tw o-stage method will not work for general observ ation op erators X . As sho wn in the pro of of Prop osition 1, the t wo-ste p method relies criti- cally on having the quanti t y k X (Θ ⋆ + W ) k ∞ b e upp er b ounded (u p to constan t factors) b y max {k Θ ⋆ k ∞ , k W k ∞ } . By tr iangle inequalit y , this cond ition holds trivially when X = I , but can b e violated by other choic es of the observ ation op erator, as illustrated by the follo w ing example. 17 Example 7 (F ailur e of t wo-ste p metho d) . Recall the multi-task observ ation mo del ﬁrs t intro- duced in Example 2. In Corollary 4, w e sh o wed that th e general estimator (10) will reco ve r go o d estimates u nder certain assump tions on the observ ation matrix. In this example, w e pro vide an instance for whic h the assumptions of Corollary 4 are satisﬁed, but on the other hand, the t w o-step method will not return a go o d estimate. More sp eciﬁcally , let us consider the observ ation mo d el Y = X (Θ ⋆ + Γ ⋆ ) + W , in whic h Y ∈ R d × d , and the observ ation matrix X ∈ R d × d tak es the form X : = I d × d + 1 √ d e 1 ~ 1 T , where e 1 ∈ R d is the standard basis v ector with a 1 in the ﬁrst comp onen t, and ~ 1 d enotes the v ector of all ones. Supp ose that the unknown lo w-rank matrix is give n by Θ ⋆ = 1 d ~ 1 ~ 1 T . Note that th is matrix has rank one, and satisﬁes k Θ ⋆ k ∞ = 1 d . W e now ve rify that the conditions of Corolla ry 4 are satisﬁed. Letting σ min and σ max denote (resp ectiv ely) the smallest and largest singular v alues of X , w e ha v e σ min = 1 and σ max ≤ 2. Moreo ver, letting X j denote the j th column of X , w e ha ve max j =1 ,...,d k X j k 2 ≤ 2. Consequent ly , if we consider rescaled observ ations with noise v ariance ν 2 /d , the conditions of Corolla ry 4 are all satisﬁed with constants (indep enden t of dimension), so that the M - estimator (10) will ha ve go o d p erformance. On the other hand, letting E denote exp ectation ov er any zero-mea n noise matrix W , we ha v e E  k X (Θ ⋆ + W ) k ∞  ( i ) ≥ k X (Θ ⋆ + E [ W ]) k ∞ = k X (Θ ⋆ ) k ∞ ( ii ) ≥ √ d k Θ ⋆ k ∞ , where s tep (i) exploits Jensen’s inequalit y , and step (ii) uses the fact th at k X (Θ ⋆ ) k ∞ = 1 /d + 1 / √ d =  1 + √ d  k Θ ⋆ k ∞ . F or any noise matrix W with reasonable tail b eha vior, the v ariable k X (Θ ⋆ + W ) k ∞ will concen trate around its exp ectation, sho win g that k X (Θ ⋆ + W ) k ∞ will b e larger than k Θ ⋆ k ∞ b y an ord er of m agnitud e (factor of √ d ). Consequen tly , the tw o-step method will ha v e muc h larger estimation error in this case. ♣ 3.6 Results for k · k 2 , 1 regularization Let us return again to the general Theorem 1, and illustrate some more of its consequences in application to the column wise (2 , 1)-norm previously deﬁn ed in Example 5, and metho ds b ased on solving the con v ex program (14). As b efore, sp ecia lizing Theorem 1 to this decomp osable regularizer yields a num b er of guaran tees. In order to kee p ou r pr esen tation relativ ely brief, w e f o cus h ere on the case of the iden tit y observ ation op erator X = I . Corollary 5. Supp ose that we solve the c onvex pr o gr am (14) with r e gularization p ar ameters ( λ d , µ d ) such that λ d ≥ 4 | | | W | | | op , and µ d ≥ 4 k W k 2 , ∞ + 4 α √ d 2 . (43) Then ther e is a u niversal c onstant c 1 such that for any matrix p air (Θ ⋆ , Γ ⋆ ) with k Θ ⋆ k 2 , ∞ ≤ α √ d 2 and for al l inte gers r = 1 , 2 , . . . , d and s = 1 , 2 , . . . , d 2 , we have | | | b Θ − Θ ⋆ | | | 2 F + | | | b Γ − Γ ⋆ | | | 2 F ≤ c 1 λ 2 d  r + 1 λ d d X j = r +1 σ j (Θ ⋆ )  + c 1 µ 2 d  s + 1 µ d X k / ∈ C k Γ ⋆ k k 2  , (44) 18 wher e C ⊆ { 1 , 2 , . . . , d 2 } is an arbitr ary subset of c olumn indic es of c ar dinality at most s . Remarks: This result follo ws directly b y sp ecializing Theorem 1 to the column wise (2 , 1)- norm and iden tit y observ ation mo del, previously discussed in Example 5. I ts d ual n orm is th e column wise (2 , ∞ )-norm, and we ha v e κ d = √ d 2 . As discussed in Section 3.1, the (2 , 1)-norm is decomposable with resp ect to subspaces of the t yp e M ( C ), as deﬁned in equation (17), where C ⊆ { 1 , 2 , . . . , d 2 } is an arbitrary subset of column indices. F or an y suc h sub set C of cardinalit y s , it can b e calculated that Ψ 2 ( M ( C )) = s , and moreo ver, that k Π M ⊥ (Γ ⋆ ) k 2 , 1 = P k / ∈ C k Γ ⋆ k k 2 . Consequen tly , the b ound (44) follo ws from Theorem 1. As b efore, if we assume that Θ ⋆ has exactly r ank r and Γ ⋆ has at most s non-zero columns , then b oth appr oximati on error terms in the b ou n d (44) v anish, and w e reco v er an u p p er b oun d of the form | | | b Θ − Θ ⋆ | | | 2 F + | | | b Γ − Γ ⋆ | | | 2 F . λ 2 d r + µ 2 d s . If w e fur ther sp ecialize to the case of exact observ ations ( W = 0), then Corollary 5 guarantees that | | | b Θ − Θ ⋆ | | | 2 F + | | | b Γ − Γ ⋆ | | | 2 F . α 2 s d 2 . The follo wing example sho ws, that giv en our conditions, ev en in the noiseless setting, no metho d can reco v er the matrices to p recision more accurate th an α 2 s/d 2 . Example 8 (Unimpr ov abilit y for columnwise sparse mo del) . In ord er to demonstrate that the term α 2 s/d 2 is una voidable, it suﬃces to consider a sligh t mo diﬁcation of Example 6. I n particular, let us deﬁne the matrix Θ ⋆ : = α √ d 1 d 2      1 1 . . . 1       1 1 1 . . . 0 . . . 0  | {z } f T , (45) where aga in the v ector f ∈ R d 2 has s non-zeros. Note that the matrix Θ ⋆ is r ank on e, has s n on-zero columns, and moreo v er k Θ ⋆ k 2 , ∞ = α √ d 2 . Consequen tly , the matrix Θ ⋆ is co vered by Corollary 5. Sin ce s columns of the matrix Γ ⋆ can b e c hosen in an arbitrary manner, it is p ossible that Γ ⋆ = − Θ ⋆ , in whic h case th e obs er v ation matrix Y = 0. This fact can b e exploited to sho w that no metho d can ac hieve squared F rob en ius error muc h smaller than ≈ α 2 s d 2 ; see S ection 3.7 for the precise statemen t. Finally , we n ote th at it is diﬃcult to compare d irectly to the results of Xu et al. [29], since their results do not guaran tee exact reco very of the pair (Θ ⋆ , Γ ⋆ ). ♣ As with the case of elemen twise ℓ 1 -norm, more concrete results can b e obtained when the noise m atrix W is s to chastic. Corollary 6. Supp ose Θ ⋆ has r ank at mo st r and satisﬁes k Θ ⋆ k 2 , ∞ ≤ α √ d 2 , and Γ ⋆ has at most s non-zer o c olumns. If the noise matrix W ha s i.i.d. N (0 , ν 2 / ( d 1 d 2 )) entries, and we solve the c onvex pr o gr am (14) with r e gularization p ar ameters λ d = 8 ν √ d 1 + 8 ν √ d 2 and µ d = 8 ν r 1 d 2 + r log d 2 d 1 d 2 + 4 α √ d 2 , 19 then with pr ob ability gr e ater than 1 − exp  − 2 log ( d 2 )  , any optimal solution ( b Θ , b Γ) satisﬁes e 2 ( b Θ , b Γ) ≤ c 1 ν 2 r ( d 1 + d 2 ) d 1 d 2 | {z } K Θ ⋆ + ν 2  sd 1 d 1 d 2 + s log d 2 d 1 d 2  + c 2 α 2 s d 2 | {z } K Γ ⋆ . (46) Remarks: Note that th e setting of λ d is the same as in Corollary 2, wher eas the param- eter µ d is chosen b ased on upp er b ound ing k W k 2 , ∞ , corresp ondin g to the d ual n orm of the column wise (2 , 1)-norm. With a sligh tly modiﬁed argument, the b ound (46 ) can b e s h arp- ened sligh tly by red u cing the logarithmic term to log( d 2 s ). As shown in T h eorem 2 to f ollo w in S ection 3.7, this sh arp ened b ound is minimax-optimal. As with C orollary 2, b oth terms in the b ound (46) are readily int erpreted. The term K Θ ⋆ has the same inte rpretation, as a com b ination of the num b er of degrees of freedom in a r ank r matrix (that is, of the order r ( d 1 + d 2 )) scaled by the noise v ariance ν 2 d 1 d 2 . The second term K Γ ⋆ has a somewhat more su btle interpretatio n. The problem of estimating s non-zero columns em b edded within a d 1 × d 2 matrix can be split in to t wo sub-problems: ﬁ r st, the problem of estimating the s d 1 non-zero p arameters (in F rob enius norm), and second, the problem of column subset selectio n—i.e., determining the lo cation of the s non-zero p arameters. The es- timation sub-pr ob lem yields the term ν 2 sd 1 d 1 d 2 , whereas the column su bset selection sub-problem incurs a p enalt y inv olving log  d 2 s  ≈ s log d 2 , multiplie d b y the usual noise v ariance. The ﬁ n al term α 2 s/d 2 arises from the non-identi ﬁabilit y of the mo del. As discuss ed in Example 8, it is una v oidable without fur ther restrictions. W e now turn to some consequen ces for the problem of robu st co v ariance estimation for- m ulated in Examp le 3. As seen from equation (4), the disturb an ce matrix in this setting can b e w ritten as a sum (Γ ⋆ ) T + Γ ⋆ , wh ere Γ ⋆ is a column-wise sparse matrix. Conse- quen tly , we can u se a v ariant of the esti mator (14), in whic h the loss fun ction is giv en b y | | | Y − { Θ ⋆ + (Γ ⋆ ) T + Γ ⋆ }| | | 2 F . The follo w in g r esult sum marizes the consequences of Theorem 1 in this setting: Corollary 7. Consider the pr oblem of r obust c ovarianc e estimation with n ≥ d samples, b ase d on a mat rix Θ ⋆ with r ank at most r that satisﬁes k Θ ⋆ k 2 , ∞ ≤ α √ d , and a c orrupting matrix Γ ⋆ with at most s r ows and c olumns c orrupte d. If we solve SDP (14) with r e gularization p ar ameters λ 2 d = 8 | | | Θ ⋆ | | | 2 op r n , and µ 2 d = 8 | | | Θ ⋆ | | | 2 op r n + 16 α 2 d , (47) then with pr ob ability gr e ater than 1 − c 2 exp  − c 3 log( d )  , any optimal solution ( b Θ , b Γ) satisﬁes e 2 ( b Θ , b Γ) ≤ c 1 | | | Θ ⋆ | | | 2 op  r 2 n + sr n  + c 2 α 2 s d . Some commen ts ab out th is result: with the motiv ation of b eing concrete, we ha v e giv en an explicit c h oice (47) of the regularization parameters, in volving th e op erator norm | | | Θ ⋆ | | | op , but an y upp er b ound would su ﬃce. As with th e noise v ariance in Corollary 6, a t ypical strategy w ould choose this pre-factor by cross-v alidation. 20 3.7 Lo wer b ounds F or the case of i.i.d Gaussian n oise matrices, Corollaries 2 and 6 provide results of an ac h iev able nature, namely in guaran teeing that our estimato rs achiev e certain F rob enius errors. In this section, w e turn to the complemen tary qu estion: what are the fundamental (algorithmic- indep end en t) limits of accuracy in noisy matrix decomp osition? One wa y in whic h to address suc h a question is by analyzing stati stical minimax r ates. More formally , giv en some family F of matrices, the asso ciated minimax error is giv en by M ( F ) : = inf ( e Θ , e Γ) sup (Θ ⋆ , Γ ⋆ ) E  | | | e Θ − Θ ⋆ | | | 2 F + | | | e Γ − Γ ⋆ | | | 2 F  , (48) where th e inﬁmum ranges o ver all estimators ( e Θ , e Γ) that are (measurable) fu nctions of the data Y , and the supremum ranges o v er all p airs (Θ ⋆ , Γ ⋆ ) ∈ F . Here the exp ectation is tak en o ver the Gaussian noise matrix W , u nder the linear observ ation mo del (1 ). Giv en a matrix Γ ⋆ , we deﬁne its su pp ort set supp(Γ ⋆ ) : = { ( j, k ) | Γ ⋆ j k 6 = 0 } , as w ell as its column su p p ort colsupp (Γ ⋆ ) : = { k | Γ ⋆ k 6 = 0  , where Γ ⋆ k denotes the k th column. Usin g this n otation, our in terest cen ters on the follo wing t w o matrix families: F sp ( r , s, α ) : =  (Θ ⋆ , Γ ⋆ ) | rank(Θ ⋆ ) ≤ r , | su pp(Γ ⋆ ) | ≤ s, k Θ ⋆ k ∞ ≤ α √ d 1 d 2  , and (4 9a) F col ( r , s, α ) : =  (Θ ⋆ , Γ ⋆ ) | rank(Θ ⋆ ) ≤ r , | colsup p(Γ ⋆ ) | ≤ s, k Θ ⋆ k 2 , ∞ ≤ α √ d 2  . (49b) By constr u ction, Corollaries 2 and 6 apply to the families F sp and F col resp ectiv ely . The follo wing theorem establishes lo wer b ound s on the minim ax risks (in squared F r ob enius norm) ov er these t wo families for the iden tit y observ ation op erator: Theorem 2. Consider the line ar observation mo del (1) with identity observation op er ator: X (Θ + Γ) = Θ + Γ . Ther e is a universal c onstant c 0 > 0 such that for al l α ≥ 32 p log( d 1 d 2 ) , we have M ( F sp ( r , s, α )) ≥ c 0 ν 2  r ( d 1 + d 2 ) d 1 d 2 + s log ( d 1 d 2 − s s/ 2 ) d 1 d 2  + c 0 α 2 s d 1 d 2 , (50) and M ( F col ( r , s, α )) ≥ c 0 ν 2  r ( d 1 + d 2 ) d 1 d 2 + s d 2 + s log ( d 2 − s s/ 2 ) d 1 d 2  + c 0 α 2 s d 2 . (51) Note th e agreemen t with the ac hiev able rates guarant eed in C orollaries 2 and 6 resp ectiv ely . (As discussed in the remarks follo wing these corollaries, the sharp ened forms of the logarithmic factors follo w by a more carefu l analysis.) Th eorem 2 sho ws that in terms of squared F rob enius error, the con v ex relaxatio ns (10) and (14) are minimax optimal u p to constan t factors. In addition, it is w orth observing that although T heorem 2 is s tated in the con text of additiv e Gaussian noise, it also shows that the radius of non-iden tiﬁabilit y (in volving the parameter α ) is a fundamen tal limit. In particular, b y setting the noise v ariance to zero, w e see that und er our milder conditions, even in the n oiseless setting, no algorithm can estimate to greater accuracy th an c 0 α 2 s d 1 d 2 , or the analogous quan tit y f or column-sp ars e matrices. 21 4 Sim ulation results W e ha v e imp lemented the M -estimators based on the con v ex programs (10) and (14), in particular by adapting ﬁrst-order optimization metho ds du e to Nestero v [22]. In this section, w e rep ort simulat ion results th at d emonstrate th e excellen t agreement b et ween our theoretical predictions and the b eha vior in practice. In all cases, w e used squ are matrices ( d = d 1 = d 2 ), and a sto c h astic noise matrix W with i.i.d. N (0 , ν 2 d 2 ) en tries, with ν 2 = 1. F or an y giv en rank r , we generated Θ ⋆ b y randomly choosing the spaces of left and righ t sin gu lar v ectors. W e formed random s parse (elemen t wise or columnwise) matrices b y choosing the p ositions of the non-zeros (en tries or column s) un iformly at random. Recall th e estimator (10) from Example 4. It is based on a com bin ation of th e nuclea r norm with the element wise ℓ 1 -norm, and is motiv ated p roblem of reco v ering a low-rank matrix Θ ⋆ corrupted by an arbitrary sparse matrix Γ ⋆ . In our ﬁrst set of exp eriments, w e ﬁxed the matrix dimension d = 100, and then studied a range of r anks r for Θ ⋆ , as w ell as a range of sparsit y indices s for Γ ⋆ . More sp eciﬁcally , we studied linear scalings of the form r = γ d for a constant γ ∈ (0 , 1), and s = β d 2 for a second constan t β ∈ (0 , 1). Note that under this s caling, Corollary 2 predicts that th e squared F rob en iu s error s h ould b e upp er b ounded as c 1 γ + c 2 β log (1 /β ), for some u n iv ersal constants c 1 , c 2 . Figure 1(a) pro vides exp erimental conﬁrmation of the accuracy of these th eoretical predictions: v arying γ (with β ﬁxed) pr o duces linear gro w th of th e squared error as a f u nction of γ . In Figure 1(b), w e study the complemen tary scaling, with the rank ratio γ ﬁxed and the sparsit y ratio β v arying in the in terv al [ . 01 , . 1]. S ince β log(1 /β ) ≈ Θ( β ) o ver this in terv al, we should exp ect to see roughly linear scaling. Again, the p lot sho ws go o d agreemen t with the theoreti cal predictions. 10 20 30 40 50 10 15 20 25 30 35 40 E r r o r v e r s u s r a n k R a n k Fro b e n i u s n o r m e r r o r s q u a r e d 0 200 400 600 800 1000 0.5 1 1.5 2 2.5 3 3.5 E r r o r v e r s u s s p a r s i t y S p a r s i t y in d e x k Fro b e n i u s n o r m e r r o r s q u a r e d (a) (b) Figure 1. Behavior o f the estimato r (10 ). (a) P lot of the squared F rob enius error e 2 ( b Θ , b Γ) versus the rank ratio γ ∈ { 0 . 0 5 : 0 . 0 5 : 0 . 5 0 } , for matrices of size 1 0 0 × 10 0 and s = 2 171 corrupted en tries. The growth o f the squared er ror is linear in γ , as predicted b y the theory . (b) Plot o f the squared F r ob enius er ror e 2 ( b Θ , b Γ) versus the spars it y para meter β ∈ [0 . 01 , 0 . 1 ] for matrices o f s ize 100 × 10 0 and rank r = 10. Consistent with the theory , the square d err o r scales approximately linearly in β in a neighbo rho o d a round zero . 22 100 200 300 400 500 3 3.5 4 4.5 5 5.5 d Fro b e n i u s n o r m e r r o r s q u a r e d S q u a r e d e r r o r v e r s u s d r = 10 r = 15 100 200 300 400 500 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 d Fro b e n i u s n o r m e r r o r s q u a r e d I nv e r s e s q u a r e d e r r o r v e r s u s d r = 10 r = 15 (a) (b) Figure 2. Be havior o f the es timator (14). (a) Plot of the sq uared F ro be nius err or e 2 ( b Θ , b Γ) versus the dimension d ∈ { 100 : 25 : 300 } , for tw o diﬀeren t choices of the ra nk ( r = 10 a nd r = 1 5). (b) Plot of the inverse squa r ed F ro b e nius error versus the dimension, conﬁrming the linear scaling in d predicted by theor y . In a dditio n, the curve for r = 15 r equires a ma trix dimension that is 3 / 2 times la r ger to reach the same erro r a s the curve for r = 10 , consistent with theory . No w recall th e estimator (14) from Example 5, designed for estimating a lo w -rank matrix plus a columnwise sp arse m atrix. W e h a ve observed similar linear dep endence on the analogs of the p arameters γ and β , as predicted by our theory . In the inte rests of exhibiting a diﬀerent phenomenon, h er e we rep ort its p erformance for m atrices of v arying dimension, in all cases with Γ ⋆ ha ving s = 3 r n on-zero co lumns. Figure 2(a) shows plots of squ ared F rob enius error versus the dimension for t w o c hoices of the rank ( r = 10 and r = 15), and the matrix dimension v arying in the range d ∈ { 100 : 25 : 300 } . As predicted b y our theory , these plots decrease at the rate 1 /d . Indeed, this scal ing is rev ealed by replotting the inv erse squared error v ersus d , whic h prod uces the roughly linear plots shown in p anel (b). Moreo ver, by comparing the relati v e slop es of these t w o curv es, w e see that th e problem with r ank r = 15 requires roughly a dimension that is roughly 3 2 larger th an the p r oblem with r = 10 to achiev e the same error. Again, this linear s caling in rank is consisten t with Corollary 6. 5 Pro ofs In this section, we provide the p ro ofs of our main results, with th e pro ofs of some m ore tec hn ical lemmas deferred to the app endices. 5.1 Pro of of Theorem 1 F or the reader’s con ve nience, let us recall here the t w o assumptions on the regularization parameters: µ d ≥ 4 R ∗ ( X ∗ ( W )) + 4 γ α κ d > 0 , and λ d ≥ 4 | | | X ∗ ( W ) | | | op . (52) 23 F u r thermore, so as to s im p lify notat ion, let us deﬁne the error matrices b ∆ Θ : = b Θ − Θ ⋆ and b ∆ Γ : = b Γ − Γ ⋆ . Let ( M , M ⊥ ) denote an arbitrary subspace pair for wh ic h the regularizer R is decomp osable. Throughout these p ro ofs, w e adopt the con v enien t shorthand notation b ∆ Γ M : = Π M ( b ∆ Γ ) and b ∆ Γ M ⊥ = Π M ⊥ ( b ∆ Γ ), w ith similar deﬁnitions for Γ ⋆ M and Γ ⋆ M ⊥ . W e no w tu r n to a lemma that deals with the b eha vior of the err or matrices ( b ∆ Θ , b ∆ Γ ) wh en measured together using a weigh ted s um of th e nucle ar norm and regularizer R . I n order to state the follo wing lemma, let us recall th at for any p ositiv e ( µ d , λ d ), th e w eigh ted norm is deﬁned as Q (Θ , Γ) : = | | | Θ | | | N + µ d λ d R (Γ). With th is notation, w e ha ve the follo wing: Lemma 1. F or any r = 1 , 2 , . . . , d , ther e is a de c omp osition b ∆ Θ = b ∆ Θ A + b ∆ Θ B such that: (a) The de c omp osition satisﬁes rank( b ∆ Θ A ) ≤ 2 r , and ( b ∆ Θ A ) T b ∆ Θ B = ( b ∆ Θ B ) T b ∆ Θ A = 0 . (b) The diﬀer enc e Q (Θ ⋆ , Γ ⋆ ) − Q (Θ ⋆ + b ∆ Θ , Γ ⋆ + b ∆ Γ ) is upp er b ounde d by Q ( b ∆ Θ A , b ∆ Γ M ) − Q ( b ∆ Θ B , b ∆ Γ M ⊥ ) + 2 d X j = r +1 σ j (Θ ⋆ ) + 2 µ d λ d R (Γ ⋆ M ⊥ ) . (53) (c) Under c onditions (52) on µ d and λ d , the err or matric es b ∆ Θ and b ∆ Γ satisfy Q  b ∆ Θ B , b ∆ Γ M ⊥  ≤ 3 Q  b ∆ Θ A , b ∆ Γ M  + 4  d X j = r +1 σ j (Θ ⋆ ) + µ d λ d R (Γ ⋆ M ⊥ )  . (54) for any R - de c omp osable p air ( M , M ⊥ ) . See App endix A for the pro of of this resu lt. Our second lemma guaran tees that the cost f unction L (Θ , Γ) = 1 2 | | | Y − X (Θ + Γ) | | | 2 F is strongly con v ex in a r estricted set of directions. In p articular, if w e let δ L ( b ∆ Θ , b ∆ Γ ) denote the error in the ﬁrst-order T aylo r series exp ansion around (Θ ⋆ , Γ ⋆ ), th en some alg ebra sho w s that δ L ( b ∆ Θ , b ∆ Γ ) = 1 2 | | | X ( b ∆ Θ + b ∆ Γ ) | | | 2 F . (55) The follo wing lemma sho ws that (up to a s lack term) this T a ylor error is lo w er b ound ed b y the squared F r ob enius norm. Lemma 2 (Restricted strong con ve xit y) . Under the c onditions of The or em 1, the ﬁrst-or der T aylor series err or (55) is lower b ounde d by γ 4  | | | b ∆ Θ | | | 2 F + | | | b ∆ Γ | | | 2 F  − λ d 2 Q ( b ∆ Θ , b ∆ Γ ) − 16 τ n n d X j = r +1 σ j (Θ ⋆ ) + µ d λ d R (Γ ⋆ M ⊥ ) o 2 . (56) 24 W e pr ov e this result in App endix B. Using Lemmas 1 and 2, we can n o w complete the pro of of Theorem 1. By the optimalit y of ( b Θ , b Γ) and the feasibilit y of (Θ ⋆ , Γ ⋆ ), we ha v e 1 2 | | | Y − X ( b Θ + b Γ) | | | 2 F + λ d | | | b Θ | | | N + µ d R ( b Γ) ≤ 1 2 | | | Y − X (Θ ⋆ + Γ ⋆ ) | | | 2 F + λ d | | | Θ ⋆ | | | N + µ d R (Γ ⋆ ) . Recalling that Y = X (Θ ⋆ + Γ ⋆ ) + W , and re-arranging in terms of the errors b ∆ Θ = b Θ − Θ ⋆ and b ∆ Γ = b Γ − Γ ⋆ , w e obtain 1 2 | | | X ( b ∆ Θ + b ∆ Γ ) | | | 2 F ≤ h h b ∆ Θ + b ∆ Γ , X ∗ ( W ) i i + λ d Q (Θ ⋆ , Γ ⋆ ) − λ d Q (Θ ⋆ + b ∆ Θ , Γ ⋆ + b ∆ Γ ) , where th e weigh ted norm Q wa s p reviously d eﬁned (20). W e n o w su bstitute inequ ality (53) from Lemm a 1 in to the righ t-hand-side of the ab o ve equation to obtain 1 2 | | | X ( b ∆ Θ + b ∆ Γ ) | | | 2 F ≤ h h b ∆ Θ + b ∆ Γ , X ∗ ( W ) i i + λ d Q ( b ∆ Θ A , b ∆ Γ M ) − λ d Q ( b ∆ Θ B , b ∆ Γ M ⊥ ) + 2 λ d d X j = r +1 σ j (Θ ⋆ ) + 2 µ d R (Γ ⋆ M ⊥ ) Some algebra and an application of H¨ older’s inequalit y and the triangle inequ alit y allo ws u s to obtain the upp er b ound  | | | b ∆ Θ A | | | N + | | | b ∆ Θ B | | | N  | | | X ∗ ( W ) | | | op +  R ( b ∆ Γ M ) + R ( b ∆ Γ M ⊥ )  R ∗ ( X ∗ ( W )) − λ d Q ( b ∆ Θ B , b ∆ Γ M ⊥ ) + λ d Q ( b ∆ Θ A , b ∆ Γ M ) + 2 λ d d X j = r +1 σ j (Θ ⋆ ) + 2 µ d R (Γ ⋆ M ⊥ ) . Recalling conditions (52) for µ d and λ d , w e obtain the inequalit y 1 2 | | | X ( b ∆ Θ + b ∆ Γ ) | | | 2 F ≤ 3 λ d 2 Q ( b ∆ Θ A , b ∆ Γ M ) + 2 λ d d X j = r +1 σ j (Θ ⋆ ) + 2 µ d R (Γ ⋆ M ⊥ ) . Using inequ alit y (56) from Lemma 2 to lo w er b oun d the righ t-hand side, and then rearranging terms yields γ 4  | | | b ∆ Θ | | | 2 F + | | | b ∆ Γ | | | 2 F  ≤ 3 λ d 2 Q ( b ∆ Θ A , b ∆ Γ M ) + λ d 2 Q ( b ∆ Θ , b ∆ Γ ) + 16 τ n  d X j = r +1 σ j (Θ ⋆ ) + µ d λ d R (Γ ⋆ M ⊥ )  2 + 2 λ d d X j = r +1 σ j (Θ ⋆ ) + 2 µ d R (Γ ⋆ M ⊥ ) . (57) No w note that by the triangle in equalit y Q ( b ∆ Θ , b ∆ Γ ) ≤ Q ( b ∆ Θ A , b ∆ Γ M ) + Q ( b ∆ Θ B , b ∆ Γ M ⊥ ), so that com b ined with the b ound (53) from Lemma 1, we obtain Q ( b ∆ Θ , b ∆ Γ ) ≤ 4 Q ( b ∆ Θ A , b ∆ Γ M ) + 4 { d X j = r +1 σ j (Θ ⋆ ) + µ d λ d R (Γ ⋆ M ⊥ ) } . 25 Substituting this upp er b ound in to equation (57 ) yields γ 4  | | | b ∆ Θ | | | 2 F + | | | b ∆ Γ | | | 2 F  ≤ 4 Q ( b ∆ Θ A , b ∆ Γ M ) + 16 τ n  d X j = r +1 σ j (Θ ⋆ ) + µ d λ d R (Γ ⋆ M ⊥ )  2 + 4  λ d d X j = r +1 σ j (Θ ⋆ ) + µ d R (Γ ⋆ M ⊥ )  . (58) Noting that b ∆ Θ A has ran k at most 2 r and that b ∆ Γ M lies in the mod el space M , we ﬁn d th at λ d Q ( b ∆ Θ A , b ∆ Γ M ) ≤ √ 2 r λ d | | | b ∆ Θ A | | | F + Ψ( M ) µ d | | | b ∆ Γ M | | | F ≤ √ 2 r λ d | | | b ∆ Θ | | | F + Ψ( M ) µ d | | | b ∆ Γ | | | F . Substituting the ab ov e inequalit y int o equation (58) and r earr an ging the terms in v olving e 2 ( b ∆ Θ , b ∆ Γ ) yields the claim. 5.2 Pro of of Corollaries 2 and 4 Note that Corollary 2 can b e view ed as a sp ecial case of Corollary 4, in wh ic h n = d 1 and X = I d 1 × d 1 . Consequen tly , w e may pro v e th e latter result, and then obtain the form er resu lt with this sp ecializati on. Recall that we let σ min and σ max denote (resp ectiv ely) the minimum and maxim um eigen v alues of X , and that κ max = max j =1 ,...,d 1 k X j k 2 denotes the maxim um ℓ 2 -norm o v er the columns . (In the sp ecial case X = I d 1 × d 2 , w e hav e σ min = σ max = κ max = 1.) Both corollaries are based on the regularizer, R ( · ) = k · k 1 , and the asso ciated dual n orm R ∗ ( · ) = k · k ∞ . W e need to v erify that the stated c hoices of ( λ d , µ d ) satisfy the require- men ts (29) of Corollary 1. Give n our assump tions on the p air ( X , W ), a little ca lculation sho ws that the matrix Z : = X T W ∈ R d 1 × d 2 has indep enden t columns, with eac h column Z j ∼ N (0 , ν 2 X T X n ). Since | | | X T X | | | op ≤ σ 2 max , kno wn results on the singu lar v alues of Gaus- sian rand om matrices [10] imp ly that P  | | | X T W | | | op ≥ 4 ν σ max ( √ d 1 + √ d 2 ) √ n  ≤ 2 exp  − c ( d 1 + d 2 )  . Consequent ly , s etting λ d ≥ 16 ν σ max ( √ d 1 + √ d 2 ) √ n ensures that the requir emen t (26) is satisﬁed. As for the associated requirement for µ d , it suﬃces to upp er b ound the element w ise ℓ ∞ norm of X T W . Since the ℓ 2 norm of th e columns of X are b ound ed by κ max , the en tries of X T W are i.i.d. and Gaussian with v ariance at most ( ν κ max ) 2 /n . Consequently , the standard Gaussian tail b ound com bined with u n ion b oun d yields P  k X T W k ∞ ≥ 4 ν κ max √ n log( d 1 d 2 )  ≤ exp ( − log d 1 d 2 ) , from whic h we conclude that the stated choice s of ( λ d , µ d ) are v alid with high probabilit y . T urning now to th e RS C condition, w e n ote that in the case of multiv ariate regression, we ha v e 1 2 | | | X (∆) | | | 2 F = 1 2 | | | X ∆ | | | 2 F ≥ σ 2 min 2 | | | ∆ | | | 2 F , sho wing that the RSC condition holds with γ = σ 2 min . In order to obtain the sharp er result for X = I d 1 × d 1 in Corollary 2—in which log ( d 1 d 2 ) is replaced b y the sm aller quantit y log( d 1 d 2 s )— we n eed to b e more careful in up p er b ounding the noise term h h W , b ∆ Γ i i . W e refer the reader to App endix C.1 for details of this argumen t. 26 5.3 Pro of of Corollary 3 F or this mo del, the n oise matrix is recen tered Wishart noise—namely , W = 1 n P n i =1 Z i Z T i − Σ , where eac h Z i ∼ N (0 , Σ). Letting U i ∼ N (0 , I d × d ) b e i.i.d. Gaussian ran d om v ectors, we ha v e | | | W | | | op = | | | √ Σ  1 n n X i =1 U i U i − I d × d  √ Σ | | | op ≤ | | | Σ | | | op | | | 1 n n X i =1 U i U T i − I d × d | | | op ≤ 4 | | | Σ | | | op r d n , where the ﬁnal b ound holds w ith probabilit y greater than 1 − 2 exp( − c 1 d ), u sing standard tail b ounds on Gaussian random matrices [10]. Thus, w e see that the sp eciﬁed c h oice (36) of λ d is v alid for T heorem 1 with h igh pr obabilit y . W e no w turn to the c h oice of µ d . The ent ries of W are p ro ducts of Gauss ian v ariables, and hence ha v e sub -exp onen tial tails (e.g., [3]). Th erefore, for an y en try ( i, j ), we h a ve the tail b oun d P [ | W ij | > ρ (Σ) t ] ≤ 2 exp( − n t 2 / 20), v alid f or all t ∈ (0 , 1]. B y union b ound o v er all d 2 en tries, w e conclude th at P  k W k ∞ ≥ 8 ρ (Σ) r log d n  ≤ 2 exp( − c 2 log d ) , whic h sh o w s that the sp eciﬁed choic e of µ d is also v alid w ith high p robabilit y . 5.4 Pro of of Pr op osition 1 T o b egin, let us recal l co ndition (52) on the regularization parameters, and that, for this pro of, the matrices ( b Θ , b Γ) denote an y optimal solutions to the optimization problems (40) and (41) deﬁning the t wo-step p r o cedure. W e again deﬁne the error m atrices b ∆ Θ = b Θ − Θ ⋆ and b ∆ Γ = b Γ − Γ ⋆ , the m atrices b ∆ Γ M : = Π M ( b ∆ Γ ) and b ∆ Γ M ⊥ = Π M ⊥ ( b ∆ Γ ), and the matrices Γ ⋆ M and Γ ⋆ M ⊥ as p reviously d eﬁned in the p ro of of Theorem 1. Our pro of of Prop osition 1 is b ased on t w o lemmas, of whic h the ﬁrs t pro vides con trol on the error b ∆ Γ in estimating the sparse comp onent. Lemma 3. Under the assumptions of Pr op osition 1, for any subset S of matr ix indic es of c ar dinality at most s , the sp arse err or b ∆ Γ in any solution of the c onvex pr o gr am (40) satisﬁes the b ound | | | b ∆ Γ | | | 2 F ≤ c 1 µ 2 d  s + 1 µ d X ( j,k ) / ∈ S | Γ ⋆ j k |  . (59) Pr o of. Since b Γ and Γ ⋆ are op timal and f easible (resp ectiv ely) for the conv ex program (40), w e h a ve 1 2 | | | b Γ − Y | | | 2 F + µ d k b Γ k 1 ≤ 1 2 | | | Θ ⋆ + W | | | 2 F + µ d k Γ ⋆ k 1 . Re-writing th is inequalit y in terms of the err or b ∆ Γ and r e-arranging terms yields 1 2 | | | b ∆ Γ | | | 2 F ≤ |h h b ∆ Γ , W + Θ ⋆ i i| + µ d k Γ ⋆ k 1 − µ d k Γ ⋆ + b ∆ Γ k 1 . By d ecomp osabilit y of the ℓ 1 -norm, we obtain 1 2 | | | b ∆ Γ | | | 2 F ≤ |h h b ∆ Γ , W + Θ ⋆ i i| + µ d  k Γ ∗ S k 1 + k Γ ∗ S c k 1 − k Γ ∗ S + b ∆ Γ S k 1 − k Γ ∗ S c + b ∆ Γ S c k 1  ≤ |h h b ∆ Γ , W + Θ ⋆ i i| + µ d  2 k Γ ∗ S c k 1 + k b ∆ Γ S k 1 − k b ∆ Γ S c k 1  , 27 where the second step is b ased on tw o applications of the triangle inequalit y . No w b y applyin g H¨ older’s inequalit y and the triangle inequalit y to the ﬁrst term on the right-hand side, we obtain 1 2 | | | b ∆ Γ | | | 2 F ≤ k b ∆ Γ k 1 [ k W k ∞ + k Θ ⋆ k ∞ ] + µ d  2 k Γ ∗ S c k 1 + k b ∆ Γ S k 1 − k b ∆ Γ S c k 1  = k b ∆ Γ S k 1  k W k ∞ + k Θ ⋆ k ∞ + µ d } + k b ∆ Γ S c k 1  k W k ∞ + k Θ ⋆ k ∞ − µ d } + 2 µ d k Γ ∗ S c k 1 ≤ 2 µ d k b ∆ Γ S k 1 + 2 µ d k Γ ∗ S c k 1 , where the ﬁnal inequalit y follo ws from our s tated c hoice (42) of the regularizat ion parameter µ d . Since k b ∆ Γ S k 1 ≤ √ s | | | b ∆ S | | | F ≤ √ s | | | b ∆ Γ | | | F , the claim (59) follo ws with some algebra. Our second lemma pr o vid es a b ou n d on the lo w-rank error b ∆ Θ in terms o f the sparse matrix error b ∆ Γ . Lemma 4. If in addition to the c onditions of P r op osition 1, the sp arse er orr matrix is b ounde d as | | | b ∆ Γ | | | F ≤ δ , then the low-r ank err or matrix is b ounde d as | | | b ∆ Θ | | | 2 F ≤ c 1 λ 2 d  r + 1 λ d d X j = r +1 σ j (Θ ⋆ )  + c 2 δ 2 . ( 60) As th e pr o of of this lemma is somewhat more inv olve d, w e d efer it to App endix D. Finally , com b ining the lo w-rank b ound (60) with the spars e boun d (59 ) from Lemm a 3 yields the claim of Prop osition 1. 5.5 Pro of of Corollary 6 F or this corollary , we ha v e R ( · ) = k · k 2 , 1 and R ∗ ( · ) = k · k 2 , ∞ . In ord er to establish the claim, w e need to sh o w that the conditions of Corollary 5 on the regularization pair ( λ d , µ d ) hold with h igh pr obabilit y . The setting of λ d is the same as Corollary 2, and is v alid by our earlier argumen t. Hence, in order to complete the pr o of, it remains to establish an upp er b ound on k W k 2 , ∞ . Let W k b e the k th column of the matrix. N oting that the fun ction W k 7→ k W k k 2 is Lipsc hitz, by concentrat ion of measure for Gaussian Lips c h itz fu nctions [16], w e h a ve P  k W k k 2 ≥ E k W k k 2 + t  ≤ exp  − t 2 d 1 d 2 2 ν 2  for all t > 0. Using the Gaussianit y of W k , we ha ve E k W k k 2 ≤ ν √ d 1 d 2 √ d 1 = ν √ d 2 . Applying un ion b ound o ver all d 2 columns, we conclude that with probab ility greater than 1 − exp  − t 2 d 1 d 2 2 ν 2 + log d 2  , w e h a ve m ax k k W k k 2 ≤ ν √ d 2 + t . Setting t = 4 ν q log d 2 d 1 d 2 yields P  k W k 2 , ∞ ≥ ν √ d 2 + 4 ν r log d 2 d 1 d 2  ≤ exp  − 3 log d 2  , from w hic h the claim follo ws. As b efore, a sharp er b ound (with log d 2 replaced b y log ( d 2 /s )) can b e ob tained by a r eﬁned argumen t; we refer the reader to Ap p endix C.2 for the details. 28 5.6 Pro of of Corollary 7 F or this mo del, the noise matrix tak es the form W : = 1 n P n i =1 U i U T i − Θ ⋆ , where U i ∼ N (0 , Θ ⋆ ). Since Θ ⋆ is p ositiv e semideﬁnite with rank at most r , we can write W = Q  1 n Z i Z T i − I r × r  Q T , where the m atrix Q ∈ R d × r satisﬁes the rela tionship Θ ⋆ = QQ T , and Z i ∼ N (0 , I r × r ) is standard Gaussian in dimension r . Consequen tly , b y known results on singular v alues of Wishart m atrices [10], we hav e | | | W | | | op ≤ √ 8 | | | Θ ⋆ | | | op p r n with high pr obabilit y , s ho wing th at the sp eciﬁed choic e of λ d is v alid. It remains to b ound the quantit y k W k 2 , ∞ . By kno wn matrix norm b ou n ds [13], w e ha v e k W k 2 , ∞ ≤ | | | W | | | op , so that the claim follo ws by the previous argumen t. 5.7 Pro of of Theorem 2 Our lo wer b ound pro ofs are based on a standard red uction [12, 31, 30] from estimation to a m ultiw a y hyp othesis testing problem ov er a pac kin g set of matrix pairs. In particular, giv en a collect ion { (Θ j , Γ j ) , j = 1 , 2 , . . . , M } of matrix pairs con tained in some family F , w e say that it f orm s a δ -pac k in g in F rob enius norm if, for all distinct pairs i, j ∈ { 1 , 2 , . . . , M } , we ha ve | | | Θ i − Θ j | | | 2 F + | | | Γ i − Γ j | | | 2 F ≥ δ 2 . Giv en suc h a pac king set, it is a straigh tforward consequence of F ano’s inequalit y that the minimax err or o v er F satisﬁes the lo w er b oun d P  M ( F ) ≥ δ 2 8  ≥ 1 − I ( Y ; J ) + log 2 log M , (61) where I ( Y ; J ) is the m utual inform ation b etw een the observ ation matrix Y ∈ R d 1 × d 2 , and J is an index un iformly distrib uted ov er { 1 , 2 , . . . , M } . In order to obtain diﬀerent comp onents of our b ound , w e mak e diﬀerent choic es of th e pac king set, and use diﬀerent b oun ding tec hniques for the m utual in formation. 5.7.1 Low er b ounds f or elemen twise sparsity W e b egin by pr o vin g the lo w er b ou n d (50) for matrix d ecomp ositions o v er the family F sp ( r , s, α ). P acking for radius of non-iden tiﬁability Let u s ﬁrst establish the lo w er b ound inv olving the radius of non-identiﬁabilit y , namely th e term scaling as α 2 s d 1 d 2 in the case of s -sparsity for Θ ⋆ . Recall fr om Example 6 the “b ad ” matrix (33 ), w hic h we denote here by B ∗ . By construction, w e ha ve | | | B ∗ | | | 2 F = α 2 s d 1 d 2 . Using this matrix, we construct a v ery simple pac king set w ith M = 4 matrix pairs (Θ , Γ):  ( B ∗ , − B ∗ ) , ( − B ∗ , B ∗ ) , ( 1 √ 2 B ∗ , − 1 √ 2 B ∗ ) , (0 , 0)  (62) Eac h one of th ese mat rix pairs (Θ , Γ) b elongs to the set F sp (1 , s, α ), so it can b e used to establish a lo w er b ound o ver this set. (Moreo ver, it also yields a lo wer b ound o v er the sets F sp ( r , s, α ) for r > 1, since they are sup ersets.) It can also b e v eriﬁed that for an y tw o distinct pairs of matrices in the set (62) , they d iﬀer in squared F rob enius norm by at least 29 δ 2 = 1 2 | | | B ∗ | | | 2 F = 1 2 α 2 s d 1 d 2 . Let J b e a r andom index u niformly distrib uted o v er the four p ossible mo dels in our pac king set (62). B y constru ction, for an y m atrix p air (Θ , Γ) in the pac k in g set, we ha ve Θ + Γ = 0. Consequen tly , f or any one of th ese mo d els, the observ ation matrix Y is simply equal to pure noise W , and h ence I ( Y ; J ) = 0. Putting together the pieces, the F ano b ound (61) imp lies that P  M ( F sp (1 , s, α )) ≥ 1 16 α 2 s d 1 d 2  ≥ 1 − I ( Y ; J ) + log 2 log 4 = 1 2 . P acking for estimation error: W e no w describ e the construction of a pac king set for lo wer b ounding the estimation error. In this case, our construction is m ore subtle, based on the th e Cartesian pro du ct of tw o comp onents, one for the lo w rank matrices, and the other for the sparse matrices. F or the low r ank comp onen t, w e r e-state a sligh tly mo diﬁed form (adapted to the setting of non-square matrices) of Lemm a 2 from the pap er [20]: Lemma 5. F or d 1 , d 2 ≥ 10 , a toler anc e δ > 0 , and for e ach r = 1 , 2 , . . . , d , ther e exists a set of d 1 × d 2 -dimensional matr ic es { Θ 1 , . . . , Θ M } with c ar dinality M ≥ 1 4 exp  r d 1 256 + r d 2 256  such that e ach matrix has r ank r , and mor e over | | | Θ ℓ | | | 2 F = δ 2 for al l ℓ = 1 , 2 , . . . , M , (63a) | | | Θ ℓ − Θ k | | | 2 F ≥ δ 2 for al l ℓ 6 = k , (63b) k Θ ℓ k ∞ ≤ δ s 32 log( d 1 d 2 ) d 1 d 2 for al l ℓ = 1 , 2 , . . . , M . (63c) Consequent ly , as long as δ ≤ 1, w e are guaran teed that th e matrice s Θ ℓ b elong to the set F sp ( r , s, α ) for all α ≥ 32 p log( d 1 d 2 ). As for the spars e matrices, the follo wing result is a mo diﬁcation, so as to apply to the matrix setting of in terest here, of Lemma 5 from the pap er [23]: Lemma 6 (Sparse mat rix pac king) . F or any δ > 0 , and for e ach inte ger s < d 1 d 2 , ther e exists a set of matric es { Γ 1 , . . . , Γ N } with c ar dinality N ≥ exp  s 2 log d 1 d 2 − s s/ 2  such that | | | Γ j − Γ k | | | 2 F ≥ δ 2 , and (64a) | | | Γ j | | | 2 F ≤ 8 δ 2 , (64b) and such that e ach Γ j has at most s non-zer o entries. W e n o w ha ve the nece ssary ingredients to pro ve the lo wer b oun d (5 0). By com b ining Lemmas 5 and 6, w e conclude that there exists a set of matrices with cardinalit y M N ≥ 1 4 exp  s 2 log d 1 d 2 − s s/ 2 + r d 1 256 + r d 2 256  (65) suc h th at | | | (Θ ℓ , Γ k ) − (Θ ℓ ′ , Γ k ′ ) | | | 2 F ≥ δ 2 for all pairs suc h that ℓ 6 = ℓ ′ or k 6 = k ′ , and (66a) | | | (Θ ℓ , Γ k ) | | | 2 F ≤ 9 δ 2 for all ( ℓ, k ). (66b) 30 Let P ℓ,k denote the distribution of the observ ation matrix Y when Θ ℓ and Γ k are the underlying p arameters. W e apply the F ano construction o v er the class of M N suc h distribu- tions, th ereb y obtaining that in order to sho w that the m inimax error is lo wer b ounded by c 0 δ 2 (for some universal constan t c 0 > 0), it su ﬃces to sho w that 1 ( M N 2 ) P ( ℓ,k ) 6 =( ℓ ′ ,k ′ ) D ( P ℓ,k k P ℓ ′ ,k ′ ) + log 2 log( M N ) ≤ 1 2 , (67) where D ( P ℓ,k k P ℓ ′ ,k ′ ) denotes the K ullbac k-Leibler d ivergence b etw een the distrib utions P ℓ,k and P ℓ ′ ,k ′ . Giv en the assumption of Gaussian n oise with v ariance ν 2 / ( d 1 d 2 ), we hav e D ( P j k P k ) = d 1 d 2 2 ν 2 | | | (Θ ℓ , Γ k ) − (Θ ℓ ′ , Γ k ′ ) | | | 2 F ( i ) ≤ 18 d 1 d 2 δ 2 ν 2 , (68) where the b ound (i) follo ws from the condition (66b). Combined with lo wer b ound (65), we see that it suﬃces to c ho ose δ suc h that 18 d 1 d 2 δ 2 ν 2 + log 2 log 1 4 +  s 2 log d 1 d 2 − s s/ 2 + r d 1 256 + r d 2 256  ≤ 1 2 . F or d 1 , d 2 larger than a ﬁ nite constan t (to exclud e degenerate cases), we see that the c hoice δ 2 = c 0 ν 2  r d 1 + r d 2 + s log d 1 d 2 − s s/ 2 d 1 d 2  , for a suitably sm all constant c 0 > 0 is su ﬃcien t, thereby establishing the low er b oun d (50). 5.7.2 Low er b ounds f or column wise sparsit y The lo wer b ound (51) f or columnwise follo ws from a similar argum en t. The only mo diﬁcations are in the pac king sets. P acking for radius of non-iden tiﬁability In order to establish a lo wer b ound of order α 2 s d 2 , recall the “bad” matrix (45) from Example 8 , wh ic h w e denote b y B ∗ . By construction, it h as squared F r ob enius n orm | | | B ∗ | | | 2 F = α 2 s d 2 . W e use it to form the pac king set  ( B ∗ , − B ∗ ) , ( − B ∗ , B ∗ ) , ( 1 √ 2 B ∗ , − 1 √ 2 B ∗ ) , (0 , 0)  (69) Eac h one of these matrix pairs (Θ , Γ) b elongs to the set F col (1 , s, α ), so it can be used to establish a lo w er b ound o ver this set. (Moreo ver, it also yields a lo wer b ound o v er the sets F col ( r , s, α ) for r > 1, since they are sup ersets.) It can also b e v eriﬁed that for an y tw o distinct pairs of matrices in the set (69) , they d iﬀer in squared F rob enius norm by at least δ 2 = 1 2 | | | B ∗ | | | 2 F = 1 2 α 2 s d 2 . Consequen tly , the same argument as b efore shows that P  M ( F col (1 , s, α )) ≥ 1 16 α 2 s d 2  ≥ 1 − I ( Y ; J ) + log 2 log 4 = 1 2 . 31 P acking for estimation error: W e no w describ e pac kings for th e estimation er r or terms. F or th e low-rank pac king set, w e need to ensu re that the (2 , ∞ )-norm is con trolled. F rom the b ound (63c ), we hav e the guarante e k Θ ℓ k 2 , ∞ ≤ δ s 32 log ( d 1 d 2 ) d 2 for all ℓ = 1 , 2 , . . . , M , (70) so that, as long as δ ≤ 1, the matrices Θ ℓ b elong to the set F col ( r , s, α ) for all α ≥ 32 p log( d 1 d 2 ). The follo wing lemma c h aracterizes a suitable pac kin g s et for the columnwise sparse comp o- nen t: Lemma 7 (Column wise sparse matrix p ac king) . F or al l d 2 ≥ 10 and inte gers s in the set { 1 , 2 , . . . , d 2 − 1 } , ther e exists a family d 1 × d 2 matric es { Γ k , k = 1 , 2 , . . . N } with c ar dinality N ≥ exp  s 8 log d 2 − s s/ 2 + sd 1 8  , satisfying the ine qu alities | | | Γ j − Γ k | | | 2 F ≥ δ 2 , for al l j 6 = k , and (71a) | | | Γ j | | | 2 F ≤ 64 δ 2 , (71b) and such that e ach Γ j has at most s non-zer o c olumns. This claim follo ws by suitably adapting Lemma 5(b) in the pap er by Raskutti et al. [24] on minimax rates for kernel classes. In particular, we view column j of a matrix Γ as deﬁnin g a linear function in d imension R d 1 ; for eac h j = 1 , 2 , . . . , d 1 , this deﬁ n es a Hilb ert space H j of fun ctions. By kno wn results on metric entrop y of Euclidean balls [17], this fun ction class has logarithmic metric entrop y , so that p art (b ) of the ab o v e lemma applies, and yields the stated r esu lt. Using this lemma and the pac king set for the lo w-rank compon ent and follo wing through the F ano constr u ction yields th e claimed lo w er b oun d (50) on the minimax err or for th e class F col ( r , s, α ), wh ic h completes the pro of of Theorem 2. 6 Discussion In this pap er, w e analyzed a class of con v ex relaxations for solving a general class of matrix decomp osition p roblems, in whic h the goal is reco ver a p air of matrices, based on observing a noisy con taminated v ersion of their sum. Since the problem is ill-p osed in general, it is essen tial to imp ose stru cture, and th is pap er focu ses on the setting in whic h one matrix is app ro ximately low-rank, and the second has a complementary form of lo w-dimensional structure enforced b y a decomp osable regularizer. P articular cases include matrices that are elemen twise sp arse, or column wise sparse, and the asso ciated matrix decomp osition problems ha v e v arious applications, including robust PC A, robustness in collab orativ e ﬁltering, and mo del selection in Gauss-Marko v random ﬁ elds . W e p ro vided a general non-asymptotic b ound on the F rob en iu s error of a con vex relaxation based on a regularizing the least-squares loss with a combination of the n uclear norm with a decomp osable regularizer. When sp ecialized 32 to the case of elemen t wise and columnwise spars ity , these estimato rs yield rates that are minimax-optimal up to constan t factors. V arious extensions of this w ork are p ossible. W e ha v e not d iscussed here ho w our estimator w ould b eha v e under a p artial obser v ation m o del, in which only a fr action of the entries are observ ed. This problem is v ery closely r elated to matrix completion, a problem for whic h recen t work b y Negah ban and W ain wr igh t [20] sho ws that a f orm of restricted stron g con v exit y holds with high probabilit y . This p rop erty could b e adapted to the cur ren t setting, and would allo w for proving F rob enius norm error b ounds on the lo w rank comp onen t. Finally , although this p ap er has f o cused on the case in whic h th e ﬁrst matrix component is appro ximately lo w rank, muc h of our theory could b e applied to a more general class of matrix decomp ositio n problems, in wh ic h the ﬁrs t co mp onent is pen alized b y a decomp osable regularizer that is “complemen tary” to the second matrix comp onen t. It remains to explore the prop er ties and applications of these diﬀeren t forms of matrix decomp osition. Ac kn o wledgemen ts All thr ee authors we re partially s upp orted b y gran t AF O SR-09NL184. In add ition, SN and MJW were partially supp orted by gran t NSF-CDI-094174 2, and AA was partially supp orted a Microsoft Researc h Graduate F ello wship. All three authors would lik e to ac kn o w ledge the Banﬀ Int ernational Researc h S tation (BIRS) in Banﬀ, Canada for hospitalit y and w ork facilities that stimulat ed and su pp orted this collaboration. A Pro of of L emma 1 The decomp ositio n describ ed in p art (a) was established by Rec h t et al. [25], so that it remains to pro v e p art (b). With the app ropriate deﬁn itions, part (b) can b e reco ve red by exploiting Lemma 1 fr om Negah ban et al. [19]. Their lemma applies to optimization problems of the general form min θ ∈ R p  L ( θ ) + γ n r ( θ )  , where L is a loss f u nction on the parameter space, and r is norm-based regularizer that satisﬁes a prop erty kno wn as decomp osabilit y . The elemen twise ℓ 1 -norm as w ell as the nucle ar norm are b oth in stances of decomp osable regularizers. Their lemma requires that the regularization parameter γ n b e c hosen suc h that γ n ≥ 2 r ∗  ∇L ( θ ∗ )  , w here r ∗ is the dual norm, and ∇L ( θ ∗ ) is the gradien t of the loss ev aluated at the true parameter. W e no w d iscuss ho w this lemma can b e a pplied in our sp ecial case. He re the relev ant parameters are of the form θ = (Θ , Γ), and th e loss function is giv en b y L (Θ , Γ) = 1 2 | | | Y − (Θ + Γ) | | | 2 F . The samp le size n = d 2 , since we mak e one observ ation for eac h entry of the matrix. On the other hand, the regularizer is giv en by the f unction r ( θ ) = Q (Θ , Γ) : = | | | Θ | | | N + µ d λ d R (Γ) , coupled with the regularization parameter γ n = λ d . By assump tion, the regularizer R is decomp osable, and as shown in the pap er [19], the nucle ar norm is also decomp osable. Since 33 Q is sim p ly a sum of these d ecomp osable regularizers o ver separate matrices, it is also de- comp osable. It remains to compute the gradien t ∇L (Θ ⋆ , Γ ⋆ ), and ev aluate the du al norm. A straight- forw ard calculation yields that ∇L (Θ ⋆ , Γ ⋆ ) =  W W  T . In addition, it can b e ve riﬁed b y standard prop erties of dual norms Q ∗ ( U, V ) = | | | U | | | op + λ d µ d R ∗ ( V ) . Th us, it suﬃces to c ho ose the regularization parameter suc h that λ d ≥ 2 Q ∗ ( W , W ) = 2 | | | W | | | op + 2 λ d µ d R ∗ ( W ) . Giv en our condition (52) , w e h a ve 2 | | | W | | | op + 2 λ d µ d R ∗ ( W ) ≤ 2 | | | W | | | op + λ d 2 , meaning that it suﬃces to h a ve λ d ≥ 4 | | | W | | | op , as state d in the second part of condition (52). B Pro of of L emma 2 By th e RSC condition (22), w e ha ve 1 2 | | | X ( b ∆ Θ + b ∆ Γ ) | | | 2 F − γ 2 | | | b ∆ Θ + b ∆ Γ | | | 2 F ≥ − τ n Φ 2 ( b ∆ Θ + b ∆ Γ ) ≥ − τ n Q 2 ( b ∆ Θ , b ∆ Γ ) , (72) where th e second inequalit y follo ws by the deﬁnitions (20) and (21) of Q and Φ resp ectiv ely . W e no w der ive a lo w er b ound on | | | b ∆ Θ + b ∆ Γ | | | F , and an up p er b oun d on Q 2 ( b ∆ Θ , b ∆ Γ ). Beginning with the former term, observe that γ 2  | | | b ∆ Θ | | | 2 F + | | | b ∆ Γ | | | 2 F  − γ 2 | | | b ∆ Θ + b ∆ Γ | | | 2 F = − γ h h b ∆ Θ , b ∆ Γ i i , so that it suﬃces to upp er b oun d γ |h h b ∆ Θ , b ∆ Γ i i| . By the dualit y of the p air ( R , R ∗ ), we h a ve γ   h h b ∆ Θ , b ∆ Γ i i   ≤ γ R ∗ ( b ∆ Θ ) R ( b ∆ Γ ) . No w since b Θ and Θ ⋆ are b oth feasible for the program (7) and recalling that b ∆ Θ = b Θ − Θ ⋆ , an application of triangle inequalit y yields γ R ∗ ( b ∆ Θ ) ≤ γ  R ∗ ( b Θ) + R ∗ (Θ ⋆ )  ≤ 2 α γ κ d ( i ) ≤ µ d 2 , where inequalit y (i) follo ws from our c hoice of µ d . Pu tting together the pieces, w e ha ve sho wn that γ 2 | | | b ∆ Θ + b ∆ Γ | | | 2 F ≥ γ 2  | | | b ∆ Θ | | | 2 F + | | | b ∆ Γ | | | 2 F  − µ d 2 R ( b ∆ Γ ) . Since the quant it y λ d | | | b ∆ Θ | | | N ≥ 0, w e can write γ 2 | | | b ∆ Θ + b ∆ Γ | | | 2 F ≥ γ 2  | | | b ∆ Θ | | | 2 F + | | | b ∆ Γ | | | 2 F  − µ d 2 R ( b ∆ Γ ) − λ d 2 | | | b ∆ Θ | | | N = γ 2  | | | b ∆ Θ | | | 2 F + | | | b ∆ Γ | | | 2 F  − λ d 2 Q ( b ∆ Θ , b ∆ Γ ) , 34 where th e latter equalit y follo ws b y th e d eﬁnition (20) of Q . Next we turn to th e upp er b ound on Q ( b ∆ Θ , b ∆ Γ ). By the triangle in equ alit y , w e ha ve Q ( b ∆ Θ , b ∆ Γ ) ≤ Q ( b ∆ Θ A , b ∆ Γ M ) + Q ( b ∆ Θ B , b ∆ Γ M ⊥ ) . F u r thermore, substituting in equation (53) in to the ab o v e equatio n yields Q ( b ∆ Θ , b ∆ Γ ) ≤ 4 Q ( b ∆ Θ A , b ∆ Γ M ) + 4 { d X j = r +1 σ j (Θ ⋆ ) + µ d λ d R (Γ ⋆ M ⊥ ) } . (73) Since b ∆ Θ A has ran k at most 2 r and b ∆ Γ M b elongs to the mod el sp ace M , w e ha v e λ d Q ( b ∆ Θ A , b ∆ Γ M ) ≤ √ 2 r λ d | | | b ∆ Θ A | | | F + Ψ( M ) µ d | | | b ∆ Γ M | | | F ≤ √ 2 r λ d | | | b ∆ Θ | | | F + Ψ( M ) µ d | | | b ∆ Γ | | | F . The claim then follo ws b y sub stituting th e ab o ve equation into equation (73), and then sub- stituting th e r esult into the earlier inequalit y (72). C Reﬁnemen t of ac hiev abilit y results In this app endix, w e pro vide r eﬁned arguments that yield sharp ened forms of Corollaries 2 and 6. These reﬁn emen ts yield ac hiev able b ounds that m atch the minimax lo wer b ounds in Theorem 2 up to constan t factors. W e note th at these r eﬁnemen ts are signiﬁcan tly diﬀerent only when the sparsit y ind ex s scales as Θ ( d 1 d 2 ) for Corollary 2, or as Θ( d 2 ) for Corollary 6. C.1 Reﬁnemen t of Cor ollary 2 In the p ro of of Theorem 1, when sp ecial ized to the ℓ 1 -norm, the noise term |h h W, b ∆ Γ i i| is simply upp er b oun ded by k W k ∞ k b ∆ Γ k 1 . Here we use a more careful argument to con trol this noise term. Throughout the pro of, w e assume that the regularization p arameter λ d is set in the usual wa y , wh er eas w e c ho ose µ d = 16 ν s log d 1 d 2 s d 1 d 2 + 4 α √ d 1 d 2 . (74) W e split our analysis in to tw o cases. Case 1: First, supp ose that k b ∆ Γ k 1 ≤ √ s | | | b ∆ Γ | | | F . In this case, w e ha v e the upp er b ound   h h W , b ∆ Γ i i   ≤ sup k ∆ k 1 ≤ √ s | | | b ∆ Γ | | | F | | | ∆ | | | F ≤| | | b ∆ Γ | | | F |h h W , ∆ i i| = | | | b ∆ Γ | | | F sup k ∆ k 1 ≤ √ s | | | ∆ | | | F ≤ 1 |h h W , ∆ i i| | {z } Z ( s ) It remains to u pp er b ound the random v ariable Z ( s ). Vi ew ed as a function of W , it is a Lipsc hitz f unction with parameter ν √ d 1 d 2 , so that P  Z ( s ) ≥ E [ Z ( s )] + δ ] ≤ exp  − d 1 d 2 δ 2 2 ν 2  . 35 Setting δ 2 = 4 sν 2 d 1 d 2 log( d 1 d 2 s ), we hav e Z ( s ) ≤ E [ Z ( s )] + 2 sν d 1 d 2  log  d 1 d 2 s  with p robabilit y greater than 1 − exp  − 2 s log( d 1 d 2 s )  . It remains to u pp er b ound the exp ected v alue. I n order to do so, w e apply Theorem 5.1(ii) from Gord on et al. [11] w ith ( q 0 , q 1 ) = (1 , 2), n = d 1 d 2 and t = √ s , thereb y obtaining E [ Z ( t )] ≤ c ′ ν √ d 1 d 2 √ s s 2 + log  2 d 1 d 2 s  ≤ c ν √ d 1 d 2 s s log  d 1 d 2 s  . With th is b ound, pro ceeding through the remainder of the pro of yields the claimed rate. Case 2: Alternativ ely , we m ust ha v e k b ∆ Γ k 1 > √ s | | | b ∆ Γ | | | F . In this case, we need to sho w that the stated c h oice (74) of µ d satisﬁes µ d k b ∆ Γ k 1 ≥ 2 |h h W , b ∆ Γ i i| with high p robabilit y . As can b e seen from examining the pro ofs, this condition is suﬃcient to ensu re th at Lemma 1 and Lemma 2 all hold, as required for our analysis. W e hav e the up p er b oun d   h h W , b ∆ Γ i i   ≤ sup k ∆ k 1 ≤k b ∆ Γ k 1 | | | ∆ | | | F ≤| | | b ∆ Γ | | | F |h h W , ∆ i i| = | | | b ∆ Γ | | | F Z  k b ∆ Γ k 1 | | | b ∆ Γ | | | F  , where f or any radius t > 0, we deﬁn e the random v ariable Z ( t ) : = sup k ∆ k 1 ≤ t | | | ∆ | | | F ≤ 1 |h h W , ∆ i i| . F or eac h ﬁ xed t , th e same argumen t as b efore sho ws that Z ( t ) is concen trated around its exp ectation, and Theorem 5.1(ii) f r om Gordon et al. [11] with ( q 0 , q 1 ) = (1 , 2), n = d 1 d 2 yields E  Z ( t )  ≤ c ν √ d 1 d 2 t s log  d 1 d 2 t 2  . Setting δ 2 = 4 t 2 ν 2 d 1 d 2 log( d 1 d 2 s ) in the concen tration b oun d, we conclude that Z ( t ) ≤ c ′ t ν √ d 1 d 2  s log  d 1 d 2 s  + s log  d 1 d 2 t 2  . with h igh probabilit y . A standard p eeling argumen t (e.g., [28]) can b e used to extend this b ound to a unif orm one o v er the c h oice of r adii t , so that it applies to the random one t = k b ∆ Γ k 1 | | | b ∆ Γ | | | F of in terest. (The only c h anges in doing such a p eeling are in constan t terms.) W e th us conclud e th at Z k b ∆ Γ k 1 | | | b ∆ Γ | | | F ! ≤ c ′ k b ∆ Γ k 1 | | | b ∆ Γ | | | F ν √ d 1 d 2  s log  d 1 d 2 s  + s log  d 1 d 2 k b ∆ Γ k 2 1 / | | | b ∆ Γ | | | 2 F  36 with h igh pr obabilit y . S in ce k b ∆ Γ k 1 > √ s | | | b ∆ Γ | | | F , w e ha v e 1 k b ∆ Γ k 2 1 / | | | b ∆ Γ | | | 2 F ≤ 1 s , and hence   h h W , b ∆ Γ i i   ≤ | | | b ∆ Γ | | | F Z  k b ∆ Γ k 1 | | | b ∆ Γ | | | F  ≤ c ′′ k b ∆ Γ k 1 ν √ d 1 d 2 s log  d 1 d 2 s  with h igh pr obabilit y . With this b ound, the remainder of the pro of p ro ceeds as b efore. In particular, the reﬁn ed c hoice (74) of µ d is adequate. C.2 Reﬁnemen t of Cor ollary 6 As in the reﬁnement of Corollary 2 from App endix C.1, w e n eed to b e more careful in con- trolling the noise term h h W , b ∆ Γ i i . F or th is corollary , w e make the reﬁn ed c h oice of regularizer µ d = 16 ν r 1 d 2 + 16 ν s log( d 2 /s ) d 1 d 2 + 4 α √ d 2 (75) As in App endix C .1, we split our analysis in to t wo cases. Case 1: First, supp ose that k b ∆ Γ k 2 , 1 ≤ √ s | | | b ∆ Γ | | | F . In this case, w e ha ve   h h W , b ∆ Γ i i   ≤ sup k ∆ k 2 , 1 ≤ √ s | | | b ∆ Γ | | | F | | | ∆ | | | F ≤| | | b ∆ Γ | | | F |h h W , ∆ i i| = | | | b ∆ Γ | | | F sup k ∆ k 2 , 1 ≤ √ s | | | ∆ | | | F ≤ 1 |h h W , ∆ i i| | {z } e Z ( s ) The function W 7→ e Z ( s ) is a Lipsc hitz fu n ction w ith parameter ν √ d 1 d 2 , so th at by con- cen tr ation of measure for Gaussian Lipsc hitz fun ctions [16], it satisﬁes the up p er tail b ound P  e Z ( s ) ≥ E [ e Z ( s )] + δ ] ≤ exp  − d 1 d 2 δ 2 2 ν 2  . Setting δ 2 = 4 sν 2 d 1 d 2 log( d 2 s ) y ields e Z ( s ) ≤ E [ e Z ( s )] + 2 ν s s log ( d 2 s ) d 1 d 2 (76) with p robabilit y greater than 1 − exp  − 2 s log( d 2 s )  . It r emains to upp er b ound the exp ectation. Applying the Cauc hy-Sc hw arz in equalit y to eac h column, w e h a ve E [ e Z ( s )] ≤ E  sup k ∆ k 2 , 1 ≤ √ s | | | ∆ | | | F ≤ 1 d 2 X k =1 k W k k 2 k ∆ k k 2  = E  sup k ∆ k 2 , 1 ≤ √ s | | | ∆ | | | F ≤ 1 d 2 X k =1  k W k k 2 − E [ k W k k 2 ]  k ∆ k k 2  + sup k ∆ k 2 , 1 ≤ √ s  d X k =1 k ∆ k k 2  E [ k W 1 k 2 ] ≤ E  sup k ∆ k 2 , 1 ≤ √ s | | | ∆ | | | F ≤ 1 d 2 X k =1  k W k k 2 − E [ k W k k 2 ]  | {z } V k k ∆ k k 2  + 4 ν r s d 2 , using th e f act that E [ k W 1 k 2 ] ≤ ν q d 1 d 2 d 2 = ν √ d 2 . 37 No w the v ariable V k is zero-mean, and sub-Gaussian with parameter ν √ d 1 d 2 , again u s ing concen tration of measure for Lip sc h itz functions of Gaussians [16]. Consequently , by setting δ k = k ∆ k k 2 , w e can write E [ e Z ( s )] ≤ E  sup k δ k 1 ≤ 4 √ s k δ k 2 ≤ 1 d 2 X k =1 V k δ k  + 4 ν r s d 2 , Applying Th eorem 5.1(ii) from Gordon et al. [11] with ( q 0 , q 1 ) = (1 , 2), n = d 2 and t = 4 √ s then yields E [ e Z ( s )] ≤ c ν √ d 1 d 2 √ s s 2 + log  2 d 2 16 s  + 4 ν r s d 2 , whic h combined w ith the concen tration b ound (76) yields the r eﬁned claim. Case 2: Alternative ly , we ma y assume that k b ∆ Γ k 2 , 1 > √ s | | | b ∆ Γ | | | F . In th is case, we n eed to v erify that the c h oice (75) µ d satisﬁes µ d k b ∆ Γ k 2 , 1 ≥ 2 |h h W , b ∆ Γ i i| with h igh probabilit y . W e ha v e the upp er b ound   h h W , b ∆ Γ i i   ≤ sup k ∆ k 2 , 1 ≤k b ∆ Γ k 2 , 1 | | | ∆ | | | F ≤| | | b ∆ Γ | | | F |h h W , ∆ i i| = | | | b ∆ Γ | | | F e Z  k b ∆ Γ k 2 , 1 | | | b ∆ Γ | | | F  , where f or any radius t > 0, we deﬁn e the random v ariable e Z ( t ) : = sup k ∆ k 2 , 1 ≤ t | | | ∆ | | | F ≤ 1 |h h W , ∆ i i| . F ollo wing through the same argumen t as in Case 2 of App endix C .1 yields that for an y ﬁx ed t > 0, w e hav e e Z ( t ) ≤ c ν √ d 1 d 2 t s 2 + log  2 d 2 t 2  + 4 ν t √ d 2 + 2 ν t s log( d 2 s ) d 1 d 2 with h igh p robabilit y . As b efore, this can b e extended to a uniform b ound o v er t b y a p eeling argumen t, an d we conclude that   h h W , b ∆ Γ i i   ≤ = | | | b ∆ Γ | | | F e Z  k b ∆ Γ k 2 , 1 | | | b ∆ Γ | | | F  ≤ c k b ∆ Γ k 2 , 1  ν √ d 1 d 2 s 2 + log  2 d 2 k b ∆ Γ k 2 2 , 1 / | | | b ∆ Γ | | | 2 F  + 4 ν 1 √ d 2 + 2 ν s log( d 2 s ) d 1 d 2  with h igh pr obabilit y . S in ce 1 k b ∆ Γ k 2 2 , 1 / | | | b ∆ Γ | | | 2 F ≤ 1 s b y assum ption, the claim follo ws. 38 D Pro of of Lemma 4 Since b Θ and Θ ⋆ are optimal and feasible (resp ectiv ely) for the conv ex program (41), we ha v e 1 2 | | | Y − b Θ − b Γ | | | 2 F + λ d | | | b Θ | | | N ≤ 1 2 | | | Y − Θ ⋆ − b Γ | | | 2 F + λ d | | | Θ ⋆ | | | N . Recalling that Y = Θ ⋆ + Γ ⋆ + W and re-writing in term s of the error matrices b ∆ Γ = b Γ − Γ ⋆ and b ∆ Θ = b Θ − Θ ⋆ , we ﬁnd that 1 2 | | | b ∆ Θ + b ∆ Γ − W | | | 2 F + λ d | | | Θ ⋆ + b ∆ Θ | | | N ≤ 1 2 | | | b ∆ Γ − W | | | 2 F + λ d | | | Θ ⋆ | | | N . Expanding the F rob enius norm and reorganizing terms yields 1 2 | | | b ∆ Θ | | | 2 F ≤ |h h b ∆ Θ , b ∆ Γ + W i i| + λ d  | | | Θ ⋆ | | | N − λ d | | | Θ ⋆ + b ∆ Θ | | | N  . F rom Lemma 1 in the pap er [21], there exists a decomp osition b ∆ Θ = b ∆ Θ A + b ∆ Θ B suc h that the rank of b ∆ Θ A upp er-b ounded by 2 r and | | | Θ ⋆ | | | N − | | | Θ ⋆ + b ∆ Θ A + b ∆ Θ B | | | N ≤ 2 d X j = r +1 σ j (Θ ⋆ ) + | | | b ∆ Θ A | | | N − | | | b ∆ Θ B | | | N , whic h imp lies th at 1 2 | | | b ∆ Θ | | | 2 F ≤ |h h b ∆ Θ , b ∆ Γ + W i i| + λ d  | | | b ∆ Θ A | | | N − | | | b ∆ Θ B | | | N  + 2 λ d d X j = r +1 σ j (Θ ⋆ ) ( i ) ≤ |h h b ∆ Θ , b ∆ Γ i i| + |h h b ∆ Θ , W i i| + λ d | | | b ∆ Θ A | | | N − λ d | | | b ∆ Θ B | | | N + 2 λ d d X j = r +1 σ j (Θ ⋆ ) ( ii ) ≤ | | | b ∆ Θ | | | F δ + | | | b ∆ Θ | | | N | | | W | | | op + λ d | | | b ∆ Θ A | | | N − λ d | | | b ∆ Θ B | | | N + 2 λ d d X j = r +1 σ j (Θ ⋆ ) ( iii ) ≤ | | | b ∆ Θ | | | F δ + | | | W | | | op  | | | b ∆ Θ A | | | N + | | | b ∆ Θ A | | | N  + λ d | | | b ∆ Θ A | | | N − λ d | | | b ∆ Θ B | | | N + 2 λ d d X j = r +1 σ j (Θ ⋆ ) = | | | b ∆ Θ | | | F δ + | | | b ∆ Θ A | | | N  | | | W | | | op + λ d } + | | | b ∆ Θ B | | | N  | | | W | | | op − λ d  + 2 λ d d X j = r +1 σ j (Θ ⋆ ) , where step (i) follo ws by triangle inequ ality; step (ii) by the Cauc h y-Sc h w arz and H¨ older in- equalit y , and our assum ed b ound | | | b ∆ Γ | | | F ≤ δ ; and step (iii) follo ws by substituting b ∆ Θ = b ∆ Θ A + b ∆ Θ B and app lying triangle inequalit y . Since w e ha ve chosen λ d ≥ | | | W | | | op , we conclude that 1 2 | | | b ∆ Θ | | | 2 F ≤ | | | b ∆ Θ | | | F δ + 2 λ d | | | b ∆ Θ A | | | N + 2 λ d d X j = r +1 σ j (Θ ⋆ ) ≤ | | | b ∆ Θ | | | F δ + 2 λ d √ 2 r | | | b ∆ Θ | | | F + 2 λ d d X j = r +1 σ j (Θ ⋆ ) 39 where the second in equalit y follo ws since | | | b ∆ Θ A | | | N ≤ √ 2 r | | | b ∆ Θ A | | | F ≤ √ 2 r | | | b ∆ Θ | | | F . W e ha v e th us obtained a quadratic inequalit y in | | | b ∆ Θ | | | F , and applying th e quadratic formula yields the claim. References [1] T. W. Anderson. A n Intr o duction to Multivariate Statistic al A nalysis . Wiley Series in Probabilit y an d Mathematical Statistic s. Wiley , New Y ork, 2003 . [2] R. K. Ando and T . Z hang. A fr amew ork for learning predictiv e stru ctures fr om multiple tasks an d unlab eled d ata. J. Mach. L e arn. R e s. , 6:1817–185 3, Decem b er 2005. [3] P . J. Bic k el and E. Levina. Co v ariance regularization b y thresholding. Anna ls of Statis- tics , 36( 6):2577 –2604, 2008. [4] J. Blitzer, D. P . F oster, and Sh am M. Kak ade. Zero-shot domain ad ap tation: A m ulti- view approac h. T ec hn ical rep ort, T oy ota T ec hn ologica l Institute at C h icago, 2009. [5] J. Blitzer, R. Mcdonald, and F. Pereira. Domain adaptation with structural corresp on- dence learning. In EMNLP Confer enc e , 2006. [6] S. Bo yd and L . V andenb erghe. Convex optimization . Cam bridge Unive rsit y Press, Cam- bridge, UK, 2004. [7] E. J. Candes, X. Li, Y. Ma, and J. W right. Robust Prin cipal Comp onen t Analysis? T ec hn ical r ep ort, S tanford, 2009. a v ailable at arXiv:0912.359 9. [8] V. Ch andrasek aran, P . A. Pa rillo, and A. S. Willsky . L atent v ariable grap h ical mo d el se- lection via con v ex optimization. T ec hnical rep ort, Massac husetts Ins titute of T ec hnology , 2010. [9] V. Chandrasek aran, S . Sanghavi , P . A. P arrilo, and A. S . Willsky . Rank-sparsit y in- coherence f or matrix decomp osition. T ec h nical rep ort, MIT, J une 2009. Av ailable at arXiv:09 06.2220v1 . [10] K. R. Da vidson and S. J. Szarek. Lo cal op erator theory , random matrices, and Banac h spaces. In Handb o ok of Banach Sp ac es , vo lume 1, pages 317–336. Elsevier, Amsterd am, NL, 20 01. [11] Y. Gordon, A. E. Litv ak, S. Mend elson, and A. P a jor. Gaussian a ve rages of in terp o- lated b o dies and applications to app ro ximate r econstruction. Journal of Appr oximation The ory , 149:59–7 3, 2007. [12] R. Z. Has’minskii. A lo wer b ound on the risks of n onparametric estimates of densities in the uniform metric. The ory Pr ob. A ppl. , 23:79 4–798, 1978. [13] R. A. Horn and C. R. J ohnson. Matrix Analysis . Cam bridge Universit y Press, Cam bridge, 1985. [14] D. Hsu, S . M. Kak ade, and T . Zh ang. Robu st m atrix decomp osition with sparse corrup- tions. T echnical rep ort, Univ. Pennsylv ania, Nov em b er 2010. 40 [15] I. M. Johnstone. O n the d istribution of th e largest eigen v alue in p rincipal comp onen ts analysis. A nnals of Statistics , 29(2):2 95–327 , Apr il 2001. [16] M. Ledoux. The Conc entr ation of Me asur e Phenomenon . Mathematica l Sur v eys and Monographs. American Mathematical So ciet y , Pro vidence, RI, 2001. [17] J. Mato usek. L e ctur e s on discr ete ge ometry . Sprin ger-V erlag, New Y ork, 2002. [18] M. McCoy and J. T ropp . Tw o Prop osals for Robust PCA u sing Semideﬁ n ite Program- ming. T ec hn ical rep ort, California Institute of T ec hnology , 2010. [19] S. Negahban, P . Ravikumar, M. J. W ain wright , and B. Y u . A uniﬁed fr amew ork for high- dimensional analysis of M-estimators with decomp osable regularizers. In NIPS Confer- enc e , V ancouv er, Canada, Dece m b er 2009. F u ll length v ersion arxiv:10 10.2731 v1. [20] S. Negah b an and M. J. W ain wrigh t. Restricted strong con ve xit y and (we igh ted) matrix completion: Optimal b ounds with noise. T ec hnical rep ort, UC Berk eley , August 2010 . [21] S. Negah ban and M. J. W ain w righ t. Estimati on of (near) lo w-rank matrices with noise and h igh-dimensional scaling. Annals of Statistics , 39(2):1069– 1097, 2011. [22] Y. Nestero v. Gradien t metho ds for m inimizing comp osite ob jectiv e function. T echnical Rep ort 76, Cen ter for Op erations Research and Econometrics (CORE), Catholic Univ er- sit y of Louv ain (UCL), 2007. [23] G. Raskutti, M. J. W ainwrigh t, and B. Y u . Minimax rates of estimation for high-dimensional linear regression ov er ℓ q -balls. IEEE T r ans. Inform ation The ory , 57(10 ):6976—6 994, Octob er 2011. [24] G. Raskutti, M. J . W ain wrigh t, and B. Y u. M inimax-optimal rates for sparse addi- tiv e mo dels o v er k ernel classes via conv ex programming. Journal of Machine L e arning R e se ar ch , 2012 . Av ailable at http://a rxiv.org/abs/1008 .3654. [25] B. Rec ht, M. F azel, an d P . P arrilo. Guarant eed minim um-rank solutions of linear matrix equations via nucle ar norm minimization. SIAM R eview , 52(3):4 71–501 , 2010. [26] R. T. Ro ck afellar. Convex Analysis . Princeton Universit y Press, Princeton, 1970. [27] A. Rohde and A. Tsybako v. Estimation of h igh-dimensional lo w-rank matrices. T ec hnical Rep ort arXiv:0912.5338 v2, Universite de Paris, Jan uary 2010. [28] S. v an de Geer. Empiric al Pr o c esses in M- Estimation . Cam br idge Univ ersit y Press, 2000. [29] H. Xu, C. Caramanis, and S . S angha vi. Robu st PC A via Outlier Pursuit. T ec hnical rep ort, Univ ersity of T exas, Austin, 201 0. a v ailable at arXiv:1010.423 7. [30] Y. Y ang and A. Barron. In formation-theoretic determination of m inimax rates of con- v ergence. An nals of Statistics , 27(5):1 564–15 99, 1999. [31] B. Y u . Assouad, Fano and Le Cam. In F estschrift for Lucien Le Cam , pages 423–4 35. Springer-V erlag, Berlin, 1997. [32] M. Y uan, A. Ekici, Z. Lu, and R. Mon teiro. Dimension reduction and co eﬃcien t estima- tion in multiv ariate linear regression. Journal Of The R oyal Statistic al So ciety Series B , 69(3): 329–34 6, 2007. 41

Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment