Matrix completion with column manipulation: Near-optimal sample-robustness-rank tradeoffs
This paper considers the problem of matrix completion when some number of the columns are completely and arbitrarily corrupted, potentially by a malicious adversary. It is well-known that standard algorithms for matrix completion can return arbitrari…
Authors: Yudong Chen, Huan Xu, Constantine Caramanis
Matrix completion with column manipulation: Near-optimal sample-robustness-rank tradeoffs Y udong Chen, Huan Xu, Constan tine Caramanis, Suja y Sangha vi ∗ Abstract This pap er considers the problem of matrix completion when some n umber of the columns are completely and arbitrarily corrupted, p otentially by a malicious adversary . It is w ell-known that standard algorithms for matrix completion can return arbitrarily p o or results, if ev en a single column is corrupted. One direct application comes from robust collab orative filtering. Here, some num ber of users are so-called manipulators who try to skew the predictions of the algorithm b y calibrating their inputs to the system. In this pap er, we dev elop an efficient algorithm for this problem based on a com bination of a trimming pro cedure and a conv ex program that minimizes the nuclear norm and the ` 1 , 2 norm. Our theoretical results show that given a v anishing fraction of observed en tries, it is nev ertheless p ossible to complete the underlying matrix ev en when the n umber of corrupted columns grows. Significan tly , our results hold without an y assumptions on the locations or v alues of the observ ed en tries of the manipulated columns. Moreo ver, we show b y an information-theoretic argumen t that our guarantees are nearly optimal in terms of the fraction of sampled en tries on the authen tic columns, the fraction of corrupted columns, and the rank of the underlying matrix. Our results therefore sharply characterize the tradeoffs b et ween sample, robustness and rank in matrix completion. 1 In tro duction Previous work in low-rank matrix completion [ 10 , 11 , 19 , 15 ] has demonstrated the following re- mark able fact: given a m × n matrix of rank r , if its entries are sampled uniformly at random, then with high probability , the solution to a con vex and in particular tractable optimization problem yields exact reconstruction of the matrix when only O (( m + n ) r log ( m + n )) en tries are sampled. Y et as our sim ulations demonstrate, if even a few columns of this matrix are corrupted, the output of these algorithms can b e arbitrarily sk ewed from the true matrix. This problem is par- ticularly relev ant in so-called collab orativ e filtering, or recommender systems. Here, based on only partial observ ation of users’ preferences, one tries to obtain accurate predictions for their unre- v ealed preferences. It is well known and w ell documented [ 29 , 47 ] that such recommender systems are susceptible to manipulation by malicious users who can calibrate al l their inputs adversarially . It is of great interest to dev elop efficien tly scalable algorithms that can successfully predict prefer- ences of the honest users based on the corrupted and partially observed data, while identifying the manipulators. ∗ Y. Chen is with the School of Op erations Researc h and Informaiton Engineering, Cornell Universit y (yudong.c hen@cornell.edu). H. Xu is with the Department of Mec hanical Engineering, National Universit y of Sin- gap ore (mp exuh@nus.edu.sg). C. Caramanis and S. Sangha vi are with the Departmen t of Electrical and Computer Engineering, the Universit y of T exas at Austin (constantine@utexas.edu, sanghavi@mail.utexas.edu). This pap er w as presen ted in part at the International Conference on Machine Learning, 2011. 1 The presence of partial observ ation and p otentially adversarial input makes a priori iden tifica- tion of corrupted column versus go o d column a challenging task. F or example, a simple metho d that works fairly well under full observ ation and purely random corruption, is to use the correlation b et ween the columns. Since the authentic columns of a low-rank matrix are linearly correlated, under suitable conditions they can be identified as those whic h ha ve a high correlation with many other columns. Ho wev er, when partial observ ations are present, this metho d fails since it is not immediately clear even how to compute the correlation—most pairs of columns do not share an observ ed co ordinate—let alone finding the corrupted columns whic h can disguise themselves to lo ok lik e a partially observ ed authen tic column. At first sigh t, it is unclear how to accomplish the tw o tasks simultaneously: completing unobserved en tries, and iden tifying corrupted columns. This pap er studies this precise problem. W e do so b y exploiting the algebraic structure of the problem: t he non-corrupted columns form a lo w-rank matrix, while the corrupted columns can b e seen as a column-sparse matrix. Th us, the mathematical problem w e address is to decomp ose a low-rank matrix from a column-sparse matrix, based on only partial observ ation. Sp ecifically , supp ose we are given partial observ ation of a matrix M , which can b e written as M = L 0 + C 0 , (1) where L 0 is lo w-rank and C 0 has only a few non-zero columns. Here the en tries of C 0 ma y hav e arbitrary magnitude and can even b e adv ersarially built; the column/ro w space of L 0 as w ell as the p ositions of non-zero columns of C 0 are unknown. With a subset of the entries of M observ ed, can w e efficien tly recov er L 0 on the non-corrupted columns, and also identify the non-zero columns of C 0 ? And, ho w do the rank and the n umber of corrupted columns impact the n umber of observ ations needed? W e pro vide an affirmative answer to the first question, and a quantitativ e solution to the second. In particular, w e develop an efficient algorithm, whic h is based on a trimming pro cedure follow ed b y a con v ex program that minimizes the n uclear norm and the ` 1 , 2 norm. W e pro vide sufficient conditions under which this algorithm pro v ably reco vers L 0 and identifies the corrupted columns. Our algorithm succeeds even when a v anishing fraction of randomly lo cated entries are observed and a significant fraction of the columns are corrupted; moreov er, the num b er of observ ations we need dep ends near optimally , in an information-theoretic sense, on the rank of L 0 and the n umber of corrupted columns. Significan tly , w e do not assume an ything ab out the v alues nor the lo cations of observ ations on the corrupted columns. W e note that our corruption mo del is very general. By making no assumption on the corrupted columns, our results cov er, but are not limited to, adversarial manipulation. F or example, the corrupted columns can also represent p ersisten t noise and abnormal sub-p opulations that are not w ell mo deled b y a (kno wn) probabilistic model. W e discuss several suc h examples in Section 1.1 . Conceptually , our results establish the relation and tradeoffs b etw een three aspects of the prob- lem: sample complexity (the n umber of observed en tries), mo del complexity (the rank of the matrix) and adversary robustness (the num b er of arbitrarily corrupted columns). While the interpla y b e- t w een sample and mo del complexities is a recurring theme in mo dern work of statistics and mac hine learning, their relation with robustness (particularly to arbitrary and adversarial corruption, as op- p osed to neutral, sto chastic noise) seems muc h less well understo o d. Our results sho w that with more samples, one can not only estimate matrices of higher rank, but also b e robust to more ad- v ersarial columns. Importantly , w e pro vide both (and nearly matc hing) upp er and lo w er b ounds, th us establishing a complete and sharp characterization of this phenomenon. T o establish low er 2 b ounds under arbitrary corruption, w e use tec hniques that are quite differen t from existing ones for sto c hastic corruption that largely rely on F ano’s inequality and the alike. P ap er Organization W e p ostp one the discussion of related w ork to Section 3 after we state our main theorems. In Section 1.1 we describ e several application motiv ating our study , follo wed by a summary of our main tec hnical con tributions in Section 1.2 . In Section 2 we give the mathematical setup of the robust matrix completion problem with corrupted columns. In Section 3 w e pro vide the main results of the pap er: a robust matrix completion algorithm, a sufficien t condition for the success of the algorithm, and a matching in v erse theorem sho wing the optimalit y of the algorithm. W e also surv ey relev ant w ork in the literature and discuss their connection to our results. In Section 4 w e discuss implementation issues and provide empirical results. W e pro ve the tw o main theorems in Sections 5 and 6 , resp ectiv ely , with some of the tec hnical details deferred to the app endix. The pap er concludes with a discussion in Section 7 . 1.1 Motiv ating Applications Our inv estigation is motiv ated b y several imp ortan t problems in machine learning and statistics, whic h we discuss b elow. Manipulation-Robust Collab orative Filtering In online commerce and advertisemen t, com- panies collect user ratings for pro ducts, and w ould like to predict user preferences based on these incomplete ratings—a problem kno wn as collaborative filtering (CF). There is a large and gro wing literature on CF; most w ell-known is the w ork on the Netflix prize [ 4 ], but also see [ 1 , 45 ] and the ref- erences therein. V arious CF algorithms hav e b een dev elop ed [ 20 , 35 , 40 , 39 , 44 ]. A t ypical approach to cast it as a matrix completion problem: the preferences across users are known to b e correlated and thus mo deled as a low-rank matrix L 0 , and the goal is to estimate L 0 from its partially observ ed en tries. How ever, the quality of prediction may b e seriously hamp ered by (ev en a small n um b er of ) manipulators —p oten tially malicious users, who calibrate (possibly in a co ordinated wa y) their ratings and the entries they cho ose to r ank in an attempt to skew predictions [ 47 , 29 ]. In the matrix completion framew ork, this corresp onds to the setting where some of the columns of the observed matrix are pro vided by manipulativ e users. As the ratings of the authentic users corresp ond to a lo w-rank matrix L 0 , the corrupted ratings corresp ond to a column-sparse matrix C 0 . Therefore, in order to p erform collab orative filtering with robustness to manipulation, we need to identify the non-zero columns of C 0 and at the same time recov er L 0 , giv en only a set of incomplete en tries. This falls precisely in to the scop e of our problem. Robust PCA In the robust Principal Comp onen t Analysis (PCA) problem [ 48 , 50 , 32 , 38 ], one is giv en a data matrix M , of whic h most of the columns corresp ond to authen tic data p oints that lie in a lo w-dimensional space—the space of principal components. The remaining columns are outliers , whic h are not (kno wn to be) captured by a lo w-dimensional linear mo del. The goal is to negate the effect of outliers and reco ver the true +principal comp onents. In man y situations suc h as problems in medical researc h (see e.g., [ 12 ]), there are unobserved v ariables/attributes for eac h data p oin t. The problem of robust PCA with partial observ ation—recov ering the principal comp onen ts in the face of partially observ ed samples and also corrupted p oin ts—falls directly in to our framew ork. 3 Cro wdsourcing Cro wdsourcing has emerged as a p opular approach for using h uman p ow er to solv e learning problems. Here m ultiple-choice questions are distributed to sev eral workers , whose answ ers are then collected and aggregated in an attempt to obtain an accurate answer to each question. In a simplified setting called the Hammer-Sp ammer mo del [ 24 , 25 ], a w ork er is either a hammer who gives correct answers, or a sp ammer who answ ers completely randomly . A more general setting is considered in [ 26 ], where the spammers need not follo w a probabilistic model and ma y submit an y answers they wan t, for instance with an unknown bias, or even adv ersarially . This problem can be mapp ed to our matrix framework, where rows correspond to questions and columns to w ork ers, with L 0 represen ting the matrix of true answ ers from the hammers and C 0 the answ ers from the spammers. Each w ork er typically answers only a subset of the questions, leading to partial observ ation. Mo del Mismatch More generally , the corrupted columns can encompass an y observ ations that are not captured by the assumed lo w-rank mo del. These observ ations may b e generated from an unkno wn p opulation or affected by factors b eyond the knowledge of the mo deler, but not necessarily adv ersarial. Suc h mismatch b et ween the mo dels and data is ubiquitous. F or instance, in collab- orativ e filtering there may exist a small set of atypical users whose preferences are v ery weakly correlated with the ma jority and th us difficult to infer using data from the ma jority . In PCA some data p oin ts ma y simply not conform to the lo w-dimensional linear model. The answers from some w ork ers in cro wdsourcing systems ma y be erroneous as the data collecting pro cess is non-ideal and not fully con trollable. It is difficult to accurately model or recov er these columns, but our results guaran tee that they do not hinder the recov ery of the other columns. 1.2 Main Contributions In this pap er, we prop ose a new algorithm for matrix completion in the presence of corrupted columns and provide p erformance guarantees. Sp ecifically , w e hav e the follo wing results: 1. W e develop a tw o-step matrix completion algorithm, whic h first trims the ov er-sampled columns of the matrix, and then solves a conv ex optimization problem inv olving the n uclear norm and the ` 1 , 2 norm. Our algorithm extends the standard n uclear norm minimization approac h for matrix completion, and the use of trimming and the ` 1 , 2 norm pla ys a crucial role in achieving robustness to arbitrary column-wise corruption. 2. F or an n × n incoheren t matrix with rank r and a subset of its columns arbitrarily corrupted, w e sho w that if a fraction of p randomly lo cated en tries are observed in the uncorrupted columns, then our algorithm prov ably iden tifies the corrupted columns and completes the uncorrupted ones as long as p ob eys the usual condition p & r log 2 n n for matrix completion, and in addition the fraction γ of corrupted columns satisfies γ . p r √ r log 3 n . 3. W e further sho w that the t w o conditions are near-optimal, in the sense that if p . r log n n or γ & p r , then an adv ersary can corrupt the columns in suc h a w a y that al l algorithms fail with probabilit y b ounded aw ay from zero. Therefore, our results establish tigh t b ounds for the sample-robustness-rank tradeoffs in matrix completion. 4. W e develop a v ariant of the Augmen ted Lagrangian Multipliers (ALM) metho d for solving the conv ex optimization problem in our algorithm. Empirical results on syn thetic data are 4 pro vided, which corrob orate with our theoretical findings and show that our algorithm is more robust than standard matrix completion algorithms. 2 Problem Setup Supp ose M is a ground-truth matrix in R m × ( n + n c ) . Among the n + n c columns of M , n of them (w e will call them authentic or non-c orrupte d ) span an r -dimensional subspace of R m , and the remaining n c columns are arbitrary (w e will call them c orrupte d ). W e only observe a subset of the en tries of the matrix M , and the goal is to infer the true subspace of the authentic columns and the iden tities of the corrupted ones. Under the ab ov e setup, it is clear that the matrix M can b e decomp osed as M = L 0 + C 0 . Here L 0 ∈ R m × ( n + n c ) is the matrix containing the authen tic columns, and therefore rank ( L 0 ) = r . The matrix C 0 ∈ R m × ( n + n c ) con tains the corrupted columns, so at most n c of the columns of C 0 are non-zero. Let I 0 ⊂ [ n + n c ] b e the indices of the corrupted columns; that is, I 0 := column-supp ort ( C 0 ) , where | I 0 | = n c . Let Ω ⊆ [ m ] × [ n + n c ] be the set of indices of the observed en tries of M , and P Ω the pro jection on to the matrices supp orted on Ω , which is given b y ( P Ω X ) ij = ( X ij , ( i, j ) ∈ Ω , 0 , ( i, j ) / ∈ Ω . With this notation, our goal is to exactly reco ver from P Ω M the authentic columns in L 0 and the corresp onding column space as well as the lo cations I 0 of the non-zero columns of C 0 . 2.1 Assumptions In general, it is not alw ays p ossible to complete a low-rank matrix in the presence of corrupted columns. F or example, if L 0 has only one non-zero column, it is impossible to distinguish L 0 from C 0 ev en when M is fully observ ed. It is also w ell-kno wn in the matrix completion literature [ 10 , 19 , 27 ] that if L 0 has only one non-zero row, or if one ro w or column of L 0 is completely unobserved, then asking to recov er L 0 from partial observ ations is problematic. T o a void these pathological situations, w e will assume that L 0 satisfy the now standard incoherence condition [ 10 ] and the observed en tries on the authentic columns of L 0 are sampled at random. W e note that we mak e no assumptions on the v alues or lo cations of the observ ed en tries of corrupted columns in C 0 . 2.1.1 Incoherence Condition Supp ose L 0 has the Singular V alue Decomp osition (SVD) L 0 = U 0 Σ 0 V > 0 , where U 0 ∈ R m × r , V 0 ∈ R ( n + n c ) × r and Σ 0 ∈ R r × r . W e use k·k 2 to denote the vector ` 2 norm, and e i b e the i -th standard basis vector whose dimension will b e clear in con text. Assumption 1 (Incoherence) . The matrix L 0 is zer o on the c olumns in I 0 . Mor e over, L 0 satisfies the fol lowing two inc oher enc e c onditions with p ar ameter µ : max 1 ≤ i ≤ m U > 0 e i 2 2 ≤ µ r m , max 1 ≤ j ≤ n + n c V > 0 e j 2 2 ≤ µ r n . 5 Since the columns of L 0 in I 0 are sup erp osed with the arbitrary C 0 , there is no hop e of reco v ering these columns. Therefore, there is no loss of generalit y to assume L 0 is zero on I 0 . Consequen tly , the matrix V > 0 has at most n non-zero columns (all in I c 0 ), and accordingly the denominator on the righ t hand side of the second inequalit y ab o v e is n instead of the full dimension n + n c . The tw o incoherence conditions are needed for completion of L 0 from partial observ ations ev en with no corrupted columns. The incoherence parameter µ is kno wn to be small in v arious natural mo dels and applications [ 10 , 11 ]. The second inequality in Assumption 1 is necessary in the presence of corrupted columns, even when the matrix is ful ly observe d . This inequality essentially enforces that the information ab out the column space of L 0 is spread out among the columns. If, for instance, an authentic column of L 0 w ere not in the span of all the other columns, one could not hop e to distinguish it from a corrupted column (cf. [ 50 ]). Finally , w e note that previous work on matrix completion often imp oses a str ong inc oher enc e c ondition max i,j U 0 V > 0 ij ≤ q µr mn [ 10 , 11 , 19 , 42 ]. W e do not need this assumption, thus im- pro ving ov er these previous results. F urther discussion on this p oint is provided after our main theorems. 2.1.2 Sampling Mo del Recall that I 0 is the indices of the corrupted columns. Let ˜ Ω := Ω ∩ ([ m ] × I c 0 ) be the set of indices of observed en tries on the non-corrupted columns. W e use the follo wing definition. Definition 1 (Bernoulli mo del) . Supp ose Θ 0 ⊆ [ m ] × [ n + n c ] . A set Θ is said to be sampled from the Bernoul li mo del with pr ob abilities { p j } n + n c j =1 on Θ 0 if eac h element ( i, j ) of Θ 0 is con tained in Θ with probability p j , indep endently of all others. If p j = p for j , then Θ is said to b e sampled from the Bernoulli mo del on Θ 0 with uniform probabilit y p . W e can no w specify our assumption on ho w the observ ed entries are sampled. Assumption 2 (Sampling) . The set ˜ Ω is sample d fr om the Bernoul li mo del with pr ob abilities { p j } on [ m ] × I c 0 , wher e p j ≥ p for al l j ∈ I c 0 . Mor e over, ˜ Ω is indep endent of P Ω C 0 , the observe d entries on the c orrupte d c olumns. Note that our mo del is more general than the uniform sampling mo del assumed in some previous w ork—w e only require a lower b ound on the observ ation probabilities of the non-corrupted columns, so some columns may ha ve an observ ation probability higher than p . Imp ortantly , the Bernoulli mo del is not imp osed on the corrupted columns. The adv ersary may c ho ose to reveal all entries on columns in I 0 or just a fraction of them, and the lo cations of these observed entries may b e c hosen randomly or adversarially depending on L 0 . The assumption of ˜ Ω b eing indep endent of the corrupted columns is needed for tec hnical reasons. W e conjecture that it is only an artifact of our analysis and not actually necessary , as indicated by our empirical results. 2.1.3 Corrupted Columns Let γ := n c n b e the ratio of the n umber of corrupted columns to the n umber of authen tic columns. Other than the indep endence requirement ab ov e, we mak e no assumption whatsoever on the cor- rupted columns in C 0 . The incoherence assumption is imp osed on the authentic L 0 , not on M or C 0 , as is the sampling assumption, and therefore the corrupted columns are not restricted in an y w ay b y these. These columns need not follo w any probabilistic distributions, and they may be c hosen by 6 some adv ersary who aims to skew one’s inference of the non-corrupted columns. One consequence of this is that w e will not be able to recov er the values of the completely corrupted columns of C 0 , but we are able to rev eal their identities . 3 Main Results: Algorithms, Guaran tees and Limits The main result of this pap er sa ys that despite the corrupted columns and partial observ ation, w e can simultaneously reco v er L 0 , the non-corrupted columns, and identify I 0 , the position of the corrupted columns, as long as the num b er of corrupted columns and unobserved en tries are con trolled. Moreo ver, this can b e achiev ed efficiently via a tr actable pro cedure, given as Algorithm 1 . The algorithm has tw o steps. In the first trimming step, we find columns with a large n um b er of observ ed entrie s, and throw a wa y some of these entries randomly . This step is imp ortan t, b oth in theory and empirically , to achiev e go o d p erformance: an adversary may choose to rev eal (and corrupt) a large num b er of en tries on certain columns, whic h may sk ew the next step of the algo- rithm; the trimming step protects against this effect. Note that we cannot directly identify these o v er-sampled corrupted columns by coun ting the n umber of observ ations—under the (non-uniform) sampling mo del in Assumption 2 , some authentic columns are also allow ed to ha v e many observed en tries. In the next step of the algorithm, we solv e a con vex program with the trimmed observ ations as the input. The con vex program, in fact a Semidefinite Program (SDP), finds a pair ( L ∗ , C ∗ ) that is consistent with the observ ations and minimizes the w eighted sum of the n uclear norm k L k ∗ and the matrix ` 1 , 2 norm k C k 1 , 2 , where k L k ∗ is the sum of singular v alues of L and a conv ex surrogate of its rank, and k C k 1 , 2 is the sum of the column ` 2 norms of C and a con v ex surrogate of its column sparsit y . The algorithm has tw o parameters: the threshold 0 < ρ < 1 for trimming and the co efficien t λ > 0 for the w eighted sum in the con vex program. Our theoretical results sp ecify ho w to choose their v alues. W e say Algorithm 1 suc c e e ds if we hav e P U 0 ( L ∗ ) = L ∗ , P I c 0 ( L ∗ ) = L 0 and I ∗ ⊆ I 0 for any optimal solution ( L ∗ , C ∗ ) of ( 2 ), where P U 0 ( L ∗ ) := U 0 U > 0 L ∗ is the pro jection of the columns of L ∗ on to the column space of L 0 , and P I c 0 ( L ∗ ) is the pro jection of L ∗ on to the matrices supp orted on Algorithm 1 Manipulator Pursuit Input : P Ω ( M ) , Ω , λ , ρ . T rimming : F or j = 1 , . . . , n + n c , if the num b er of observ ed en tries h j on the j -th column satisfies h j > ρm , then randomly select ρm en tries (b y sampling without replacemen t) from these h j en tries and set the rest as unobserved. Let ˆ Ω b e the set of remaining observ ed indices. Solv e for optimum ( L ∗ , C ∗ ): minimize L,C k L k ∗ + λ k C k 1 , 2 (2) sub ject to P ˆ Ω ( L + C ) = P ˆ Ω ( M ) Set I ∗ = column-supp ort ( C ∗ ) := { j : C ∗ ij 6 = 0 for some i } . Output : L ∗ , C ∗ and I ∗ . 7 the column indices in I c 0 , given b y P I c 0 ( L ∗ ) ij = ( L ∗ ij , if j 6∈ I 0 , 0 , if j ∈ I 0 . That is, the algorithm succeeds we recov er the true column space of the original L 0 and complete its uncorrupted columns, and at the same time identify the lo cations of the corrupted columns. Note that the definition of success allows for I ∗ ( I 0 . In this case it ma y app ear that some corrupted columns are uniden tified and included in L ∗ , but it is actually not a problem: the requirement P U 0 ( L ∗ ) = L ∗ means that these uniden tified “corrupted” columns can b e completed to lie in the true column space of L 0 , so they are essen tially not c orrupte d , as they are indistinguishable from a partially observed authen tic column and do not affect the completion. 3.1 Sufficien t Conditions for Reco v ery Our first main theorem guaran tees that under some natural conditions, our algorithm exactly re- co v ers the non-corrupted columns and the iden tities of the corrupted columns with high probabilit y . Recall that p is a lo wer b ound of the observ ation probabilit y on the non-corrupted columns, γ := n c n the ratio b etw een the n umbers of corrupted and uncorrupted columns, and ρ the trimming threshold. Theorem 1. L et α := ρ p . Ther e exist universal p ositive c onstant c 1 and c 2 for which the fol lowing holds. Supp ose the Assumptions 1 and 2 hold. If in Algorithm 1 we take λ ∈ s 1 + 1 α µr log ( m + n ) pn , 1 48 q p (1 + α ) µr γ n log ( m + n ) , and ( p, γ ) satisfies p ≥ c 1 1 + 1 α µr log 2 ( m + n ) min( m, n ) , (3) γ ≤ c 2 α 1 + α √ α p µr √ µr log 3 ( m + n ) , (4) then A lgorithm 1 suc c e e ds with pr ob ability at le ast 1 − 20( m + n ) − 5 . Note that the interval for λ is non-empty under the c ondition ( 4 ). W e prov e this theorem in Section 5 . The t wo conditions ( 3 ) and ( 4 ) hav e the natural interpretation that the algorithm succeeds as long as there is sufficiently man y observed en tries (in particular, more than the degrees of freedom of a rank- r matrix), and the n um b er of corrupted columns is not to o large relativ e to the num b er of observed en tries. W e discuss these t wo conditions in more details in the next sub-section. The theorem also shows that the parameter λ in the conv ex program ( 2 ) can tak e an y v alue in a certain range. 3.1.1 Consequences W e explore several consequences of Theorem 1 . The conditions ( 3 ) and ( 4 ) abov e inv olve the v alue of the parameter ρ from trimming in Algorithm 1 . The conditions b ecome the least restrictiv e if α := ρ p = Θ(1) , i.e., when ρ is of the same order of p . Cho osing ρ optimally in this wa y gives the follo wing corollary . 8 Corollary 1 (Optimal Bound) . Ther e exist universal c onstant c 1 and c 2 such that the fol lowing holds. Supp ose the Assumptions 1 and 2 hold, and we take ρ = p and λ = q 2 µr log( m + n ) pn in A lgorithm 1 . A lgorithm 1 suc c e e ds with pr ob ability at le ast 1 − 20( m + n ) − 5 as long as ( p, γ ) satisfy ( 3 ) and ( 4 ) with α = 1 . F or a more concrete example, suppose the observ ation probabilit y satisfies p & √ µ 3 r 3 log 3 n n 1 − κ , then Corollary 1 guarantees success of our algorithms when the num b er of corrupted en tries γ n is less than n κ . In a conference version [ 18 ] of this pap er, we analyze the second step of Algorithm 1 (i.e., without trimming, or equiv alently ρ = 1 ) and sho w that it succeeds if ( p, γ ) satisfy (among other things) the condition γ . p 2 1 + µr p √ n 2 µ 3 r 3 log 6 ( m + n ) . This result is significantly improv ed b y Corollary 1 (in particular, compared to the condition ( 4 ) with α = 1 ), which allo ws for an order-wise larger num b er of corrupted columns. Our analysis rev eals that the trimming step in Algorithm 2 is crucial to this impro vemen t. R emark 1 . In practice, we may estimate the v alue of p b y using a robust mean estimator (e.g., the median or trimmed mean) of the fraction of observed entries ov er the columns. Giv en suc h an estimate ˆ p , w e can set ρ = ˆ p and λ = q c log( m + n ) ˆ pn for some constant c (say 50 ), and the algorithm’s success is guaran teed by Theorem 1 and Corollary 1 for µr = O (1) . (Note that while we ma y not kno w n , w e do kno w the v alue of n + n c , whic h differs from n b y at most a factor of 2 whenever n c ≤ n .) This approach is taken in our empirical studies in Section 4 . Setting p = 1 in Corollary 1 immediately yields a guarantee for the full observ ation setting. Corollary 2 (F ull Observ ation) . Supp ose the Assumptions 1 and 2 hold with p = 1 . If we take ρ = 1 and λ = q 2 µr log( m + n ) n in A lgorithm 1 , and γ := n c n satisfies γ ≤ c 0 1 1 µr √ µr log 3 ( m + n ) for some universal c onstant c 0 1 , then A lgorithm 1 suc c e e ds with pr ob ability at le ast 1 − 20( m + n ) − 5 . The full observ ation setting of our model corresponds to the Robust PCA problem with sample- wise corruption (cf. Section 1.1 ), which is previously considered in [ 49 , 50 ]. There they propose an algorithm called Outlier Pursuit , which is similar to the second step of our Algorithm 1 and shown to succeed in the full observ ation setting if γ . 1 µr . Our result in Corollary 2 is off b y a small factor of √ µr log 3 ( m + n ) . This sub-optimality can b e remo ved by a more careful analysis in the setting with p close to 1 , but w e c ho ose not to delve in to it. On the other hand, setting γ = 0 gives a guaran tee for the standard exact matrix completion setting with clean observ ations. Our result is pow erful enough that it in fact impro ves up on some previous results in this setting. 9 Corollary 3 (Matrix Completion) . Supp ose γ = 0 and the Assumption 1 and 2 hold. If we take ρ = 1 and λ ≥ q 2 µr log( m + n ) pn in A lgorithm 1 , and p satisfies p ≥ c 00 1 µr log 2 ( m + n ) min( m, n ) for some universal c onstant c 00 1 , then A lgorithm 1 suc c e e ds with pr ob ability at le ast 1 − 20( m + n ) − 5 . Exact matrix completion is considered in the seminal w ork [ 10 ] and subsequen tly in [ 11 , 19 , 42 ], in which the lo w-rank matrix L 0 is assumed to satisfied tw o incoherence conditions: the standard incoherence condition with parameter µ as in Assumption 1 , and an additional str ong inc oher- enc e c ondition k U V k ∞ ≤ q µ str r mn . They sho w that L 0 can be exactly recov ered via n uclear norm minimization if p & max { µ,µ str } r log 2 ( m + n ) min { m,n } . Corollary 3 impro v es up on this result by removing the dep endence on the strong incoherence parameter µ str , whic h can b e as large as µr . This impro v e- men t was also observed in the recent w ork in [ 15 , 16 ]. W e ha ve seen that Theorem 1 and Corollary 1 giv e, as immediate corollaries, strong b ounds for the sp ecial cases of full observ ation and standard matrix completion, which is a testament to the sharpness of our results. In fact, w e show in the next sub-section that the conditions in Theorem 1 are near-optimal. 3.2 Information-Theoretic Limits for Recov ery Corollary 1 says that the conditions ( 3 ) and ( 4 ) with α = 1 are sufficien t for our algorithm to succeed. Theorem 2 below shows that these conditions are in fact close to being information- theoretic (minimax) optimal. That is, they cannot b e si gnificantly improv ed by an y algorithm regardless of its computational complexity . Note that the theorem trac ks the v alues of µ , r , p and γ , so all of them can scale in a non-trivial wa y with respect to n . Theorem 2. Supp ose m = n ≥ 4 , µr ≤ n log(2 n ) , and ( p, γ ) satisfy p ≤ µr log (2 n ) 2 n (5) or γ := n c n ≥ 2 p µr . (6) Then any algorithm wil l fail to output the c orr e ct c olumn sp ac e with pr ob ability at le ast 1 16 ; mor e pr e cisely, for al l me asur able functions ˆ L of M and Ω , max L 0 ,C 0 , Ω \ ˜ Ω P h P U 0 ( ˆ L ) 6 = ˆ L i ≥ 1 16 , wher e the maximization r anges over al l matrix p airs ( L 0 , C 0 ) and observe d indic es on the c orrupte d c olumns Ω \ ˜ Ω that satisfy the Assumptions 1 and 2 , and the pr ob ability is with r esp e ct to the distri- bution of the observe d indic es on the non-c orrupte d c olumns ˜ Ω . W e prov e this theorem in Section 6 . 10 By comparing with Theorem 2 , w e see that the conditions in Corollary 1 are close to the ac hiev able limits. In particular, with α = 1 , the condition ( 3 ) on p matc hes ( 5 ) up to one logarithmic factor, and the condition ( 4 ) on γ is w orse than ( 6 ) b y a factor of c √ µr log 3 n . In particular, b oth conditions are optimal up to logarithmic factors in the case of constant rank and incoherence µr = O (1) . It is of in terest to study whether this small gap can be closed, potentially by tightening up the sufficient conditions in Theorem 1 and Corollary 1 . The failure condition ( 5 ) is an extension of a standard result for matrix completion in [ 11 , Theorem 1.7]. T o gain some in tuition on the second condition ( 6 ), w e consider the case with µr = 1 , for whic h the condition b ecomes n c & pn . This means that with probabilit y b ounded a w ay from zero, the num b er of observed corrupted entries in the first ro w exceeds that of observ ed authentic en tries in the same row. In this case, if the corrupted entries in the other rows are chosen to b e consisten t with the true column space (on all but the first co ordinates), then no algorithm can tell whic h of the t wo sets of entries in the first ro w is actually authent ic, and therefore recov ery of this ro w is imp ossible. Theorem 2 is prov ed using an extension of the ab o ve argumen t—b y demonstrating a particular wa y of corrupting n c & pn µr columns that prov ably confuses any algorithm. Implications for Robust PCA: Recall the Robust PCA setting with full observ ations ( p = 1 ) and the Outlier Pursuit algorithm discussed after Corollary 2 in Section 3.1 . Theorem 2 shows that γ & 1 µr is necessary , so the guarantee for Outlier Pursuit giv en in [ 50 ] is order-wise optimal. 3.3 Sample-Robustness-Rank T radeoffs The results in the last tw o-subsections highlight the tradeoffs b etw een sample complexit y , outlier robustness and model complexit y (matrix rank). In particular, giv en a higher the observ ation prob- abilit y p , one can handle a higher fraction γ of corrupted columns and a higher rank r of the underlying matrix. The other direction is also true: with a smaller p , the fraction of allo wable cor- rupted columns and the allo wable rank will necessarily become smaller, regardless of the algorithm and the amount of computational. Theorems 1 and 2 provide the precise conditions that p , γ and r need to obey . W e emphasize that here w e consider robustness to arbitr ary and p ossibly adversarial corruption. Our results characterize, in terms of b oth upper and low er b ounds, the tradeoffs b et ween adv ersary robustness and sample/mo del complexities. This can b e put into the con text of the study of mo dern high-dimensional statistics [ 41 , 7 ], where the relationship b et ween sample and mo del complexities is a central topic of interest. More recently , a line of w ork has fo cused on the tradeoffs b etw een computational complexit y and v arious statistical quan tities [ 13 , 5 , 36 , 53 ]. Our results can b e viewed as adding a new dimension to these recent lines of work: we consider another axis of the problem— robustness (to adversarial corruption)—and its relation to other statistical quan tities. Therefore, while w e inv estigate a sp ecific problem (matrix completion), w e exp ect sample-robustness tradeoffs to b e relev ant in a broader con text. Finally , we note that our empirical study in Section 4 demonstrate the following phenomenon: If we further assume that the corrupted columns are randomly generated and indep endent of eac h other, then our algorithm can recov er L 0 from a muc h higher num b er n c of corrupted columns than is predicted by Theorems 1 and 2 (which require, among other things, n c = γ n ≤ 1 ). In particular, the corrupted columns can significantly out-num b er the authentic columns. This means our algorithm is useful w ell beyond the adv ersarial corruption setting considered in the theorems ab ov e, and its actual performance can b ecome better if the corruption is more restricted and “b enign”. A similar 11 phenomenon is observ ed in [ 32 , 52 ] for the sp ecial case of ful l observation . Here we therefore see another lev el of sample-robustness-rank tradeoffs: If w e only ask for a weak er sense of robustness, namely , robustness against randomly corrupted columns as opp osed to arbitrary ones, then w e hav e more relaxed requiremen ts on the observ ation probability , the rank and the num b er of corrupted columns. It is an in teresting op en problem to rigorously quan tify the in terplay betw een the nature of the corruption and the recov ery performance. 3.4 Connections to Prior W ork and Innov ation Recen t w ork in matrix completion shows that by using con vex optimization [ 10 , 11 , 19 , 42 ] or other algorithms [ 27 , 23 , 8 ], one can exactly recov er an n × n rank- r matrix with high probability from as few as O ( nr p oly log n ) (clean) en tries. Our pap er extends this line of w ork and sho ws that ev en if all the observed en tries on some columns are completely corrupted (b y p ossibly adversarial noise), one can still recov er the non-corrupted columns as w ell as the identit y of the corrupted ones. As discussed b efore, our work also extends the work in [ 49 , 50 ], whic h only considers the full observ ation setting; see also [ 2 ] for results on the full observ ation setting with noise. The cen terpiece of our algorithm is a conv ex optimization problem that is a conv ex proxy to a very natural but in tractable algorithm for our task, namely , finding a lo w-rank matrix L and a column- sparse matrix C consistent with the observ ed data. Suc h conv ex surrogates for rank and supp ort functions ha ve b een used (often separately) in problems in volving low-rank matrices [ 43 , 10 ]) and in problems with group-sparsit y [ 51 , 21 ]. When this manuscript is under preparation, we learn ab out the v ery recen t work [ 28 ], whic h also studies robust matrix completion under column-wise sparse corruption, alb eit under a somewhat different setting. Their results are fo cused on the noisy setting with general sampling distributions, but do not guaran tee exact reco very in the noiseless case. Our w ork is also related to the problem of separating a lo w-rank matrix and an ov erall (element- wise) sparse matrix from their sum [ 9 , 14 ] (this is sometimes called the lo w-rank-plus-sparse problem, or L + S for short). This problem has also been studied under the partial observ ation setting [ 9 , 17 , 33 ]. Compared to this line of work, our results indicate that separation is p ossible ev en if the low- rank matrix is added with a c olumn sp arse matrix instead of an over al l sp arse matrix. In particular, w e allow al l the observ ations from some columns to be completely corrupted. In con trast, existing guaran tees for the L + S problem require that from each ro w and column at least some observ ations are clean, thus not suitable for our setting; this is also demonstrated in our exp eriments. Moreo v er, although we do not pursue in this paper, our tec hniques allow us to establish results on separating three components—a low rank matrix, an elemen t-wise sparse matrix, and a column-sparse matrix. Besides the ob vious difference in the problem setup, our pap er also departs from the previous w ork in terms of mathematical analysis. In particular, in previous works in exact matrix completion and decomp osition, the intended outcome is kno wn a priori —their goal is to output a matrix or a pair of matrices, exactly equal to the original one(s). In our setting, how ever, the optimal solution of the con v ex problem is in general neither the original lo w rank matrix L 0 nor the matrix C 0 whic h consists of only the corrupted columns. This critical difference requires a no v el analysis that builds on a v ariant of the primal-dual witness (or or acle pr oblem ) metho d. This metho d has b een applied to study supp ort reco v ery in problems in v olving sparsity [ 3 , 31 ]. Here we use the metho d for the reco very of the eigen sp ac e and c olumn supp ort. A related problem is considered in [ 49 , 50 ], whic h, how ever, only studies with the full observ ation setting. The presence of (many) missing en tries mak es the problem m uch more complicated, as w e need to deal with three matrix structures sim ultaneously , i.e., low-rankness, column sparsit y , and ov erall/element-wise sparsity . This requires 12 the introduction of new ingredien ts in the analysis; in particular, one imp ortant tec hnical innov ation requires the dev elopment of new concentration results that inv olve these three structures, including b ounds on the k·k ∞ , 2 norms of certain randomly sampled low-rank matrices (see Lemmas 9 and 11 ). 4 Implemen tation and Empirical Results In this section, w e discuss implementation issues of our algorithm and pro vide empirical results. 4.1 An ADMM Solver for the Conv ex Program The optimization problem ( 2 ) is a semi-definite program (SDP), and can b e solved b y off-the-shelf SDP solvers. How ever, these general-purp ose solv ers can only handle small problems (e.g., 400-b y- 400 matrices) and do not scale w ell to large datasets. Here w e use a family of first order algorithms called the Alternating Direction Method of Multipliers (ADMM) metho ds [ 6 , 34 ], shown to be effectiv e on problems in v olving non-smo oth ob jectiv e functions. W e adapt this metho d to our partially observ ed, k·k ∗ + λ k·k 1 , 2 -t yp e problem; see Algorithm 2 . Here L ( S ) is the en try-wise soft-thresholding op erator: if | S ij | ≤ , then set it to zero, and otherwise let S ij := S ij − S ij / | S ij | . Similarly , C ( C ) is the column-wise soft-thresholding op erator: if k C i k 2 ≤ , then set it to zero, and otherwise let C i := C i − C i / k C i k 2 . Note that the matrix E ( k ) accoun ts for the unobserved entries. In our exp eriments, the parameters are set to u 0 = k M k 1 , 2 − 1 and α = 1 . 1 , and the criterion for conv ergence is M − E ( k ) − L ( k ) − C ( k ) F / k M k F ≤ 10 − 6 . The main cost of Algorithm 2 is computing the SVD of the matrix Z := M − E ( k ) − C ( k ) + u − 1 k Y ( k ) in eac h iteration. W e can sp eed up the computation b y taking adv an tage of the specific structure of our problem, namely partial observ ation and lo w-rankness. Observ e that the iterate Z can b e written as the sum of t wo matrices Z = M − E ( k ) − L ( k ) − C ( k ) + u − 1 k Y ( k ) + L ( k ) . A careful examination of Algorithm 2 rev eals that the first matrix is non-zero only on the observed indices Ω , while the second matrix has rank equal to the num b er of singular v alues that remain non-zero after the soft-thresholding in the last iteration. W e can therefore emplo y a celebrated SVD routine called PR OP ACK [ 30 ], which can make use of such sparse and low-rank structures. Using this strategy , w e are able to apply the algorithm to moderately large instances in our exp erimen ts, esp ecially in the setting we care most, i.e., when only a small num b er of en tries are observed. 4.2 Sim ulations W e test the p erformance of our metho d on syn thetic data. F or a giv en rank r , w e generate tw o matrices A ∈ R m × r and B ∈ R n × r with i.i.d. standard Gaussian entries, and then build the rank- r matrix L 0 ∈ R m × ( n + n c ) b y L 0 = AB > padded with n c zero columns. The set of observ ed entries on the authentic columns is generated according to the Bernoulli mo del in Assumption 2 . The observ ation probabilities { p j } on the authentic columns, as well as the n c corrupted columns in C 0 and their observ ed entries, are specified later. The observed matrix P Ω M = P Ω ( L 0 + C 0 ) and the set of observed entries Ω are then given as input to Algorithm 1 , with the con vex program solv ed using the ADMM solv er describ ed ab ov e. W e set the parameters ρ and λ in Algorithm 1 according to Corollary 1 , estimating p j b y 1 . 1 × median ( { ˜ p j } ) , where ˜ p j is the empirical observ ation probabilit y of the j -th columns, 13 Algorithm 2 The ALM Algorithm for Robust Matrix Completion input : P Ω M ∈ R m × ( n + n c ) (assuming P Ω c M = 0 ), Ω , λ initialize : Y (0) = 0 ; L (0) = 0 ; C (0) = 0 ; E (0) = 0 ; u 0 > 0 ; α > 1 ; k = 0 . while not conv erged do ( U, S, V ) = SVD M − E ( k ) − C ( k ) + u − 1 k Y ( k ) ; L ( k +1) = U L u − 1 k ( S ) V > ; C ( k +1) = C λu − 1 k M − E ( k ) − L ( k +1) + u − 1 k Y ( k ) ; E ( k +1) = P Ω c M − L ( k +1) − C ( k +1) + u − 1 k Y ( k ) ; Y ( k +1) = Y ( k ) + u k M − E ( k +1) − L ( k +1) − C ( k +1) ; u k +1 = αu k ; k ← k + 1 ; end while return L ( k ) , C ( k ) 4.2.1 Effect of T rimming In the first set of exp erimen ts, we study the p erformance of Algorithm 1 with and without the trimming step. W e consider recov ering a matrix L 0 with rank r = 2 and dimensions m × ( n + n c ) = 400 × 400 . The n c corrupted columns in C 0 are iden tical and equal to a random column vector in R m , with all of them fully observed. The observ ation probabilit y p j of the j -th authentic equal to 1 if j is a m ultiple of 3 , and equal to p otherwise, where w e consider differen t v alues of p . Note that man y authen tic columns are fully observ ed, so one cannot distinguish them from the corrupted columns based on only the num b er of observ ations. Figure 1 shows the relativ e errors of the output L ∗ on the uncorrupted columns, i.e., P I c 0 ( L ∗ − L 0 ) F / P I c 0 L 0 F , for differen t v alues of the observ ation probabilit y p and the num b er of corrupted columns n c . Compared to no trimming, the trimming step often leads to muc h low er errors and allows for more corrupted columns. This agrees with our theoretical findings and sho ws that trimming is indeed crucial to goo d performance. Ha ving demonstrated the b enefit of trimming when the p j ’s are non-uniform, in the remaining exp erimen ts we set p j ≡ p for simplicity . 4.2.2 Comparison with standard matrix completion and L + S While our theory and algorithm allow for the corrupted columns of C 0 to hav e en tries with arbitrarily large magnitude, w e p erform comparison in a more realistic setting with b ounded corruption. In the second set of experiments, the n c non-zero columns of C 0 are identical, whic h equal the first column of L 0 on the lo cations of its observed en tries, and are i.i.d. standard Gaussian on the other lo cations. These columns are normalized to ha ve the same norm as the first column of L 0 . The lo cations of the observ ed en tries are also identical across the columns of C 0 , and are randomly selected according to the Bernoulli mo del with probability p . Note that the columns of C 0 ha v e the same norm and observ ation probabilities as the authentic columns. If w e think of eac h column of L 0 as the ratings of movies from an authentic user, then the ab ov e construction of C 0 mimics a rating manipulation sc heme that is rep orted to b e effectiv e in the literature [ 47 ]. In particular, the columns of C 0 are mean t to b e similar to the ratings from an authentic user in L 0 on the observed lo cations, while trying to skew the unobserved ones in a co ordinated fashion. 14 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 10 −4 10 −2 10 0 Number of corrupted columns Relative Frobenius error With Trimming No Trimming 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 10 −4 10 −2 10 0 Number of corrupted columns Relative Frobenius error With Trimming No Trimming (a) (b) Figure 1: Comparison of the p erformance of Algorithm 1 with and without trimming. The plots sho w the relativ e F rob enius norm errors on recov ering the uncorrupted columns of a 400 × 400 rank- 2 matrix with observ ation probabilities (a) p = 0 . 2 , (b) p = 0 . 3 . Each p oint in the plots is the a v erage of 10 trials. When only a small fraction of the entries are observed, the corrupted columns P Ω ( C 0 ) can be view ed as a sparse matrix. Therefore, to separate L 0 from P Ω ( C 0 ) , one might think it is possible to apply the tec hniques in [ 9 , 14 ], dubb ed the L + S approach, whic h decomp oses a low-rank matrix and a sparse matrix from their sum. In particular, one tries to decomp ose the input matrix P Ω ( M ) b y solving the follo wing conv ex program: ( L ∗ , S ∗ ) = arg min L,S k L k ∗ + λ k S k 1 (7) s.t. P Ω ( L + S ) = P Ω ( M ) . Ho w ever, a cen tral assumption of the L + S approach, namely , the support of the sparse matrix is spread out o ver the columns and ro ws, is violated in the setup considered in this paper. Therefore, it is no surprise that using the L + S approac h should not b e successful. This is indeed the case, as is illustrated numerically in our experiments. In particular, w e compare our algorithm with the L + S approach (with λ set to 1 / p max ( m, n ) according to [ 9 ]), as w ell as with standard matrix completion (which is equiv alent to solving ( 7 ) with the additional constrain t S = 0 ). The conv ex program ( 7 ) is solved using the ADMM methods in [ 34 ]. The results are sho wn in Figure 2 for v arious v alues of p and n c . W e see that the L + S and standard matrix completion approac hes are not robust under our setting, and our algorithm has consisten tly b etter p erformance under b oth metrics considered. Moreov er, with a higher observ ation probabilit y p , w e can handle a larger n um b er n c of corrupted columns and ac hieve essen tially exact reco v ery , whic h is consisten t with our theory . 4.2.3 Random Corruption In this third set of experiments, we consider a more b enign setting of the corrupted columns, where these columns are generated randomly and indep endently with i.i.d. Gaussian en tries. The exp erimen ts are done under the setting with rank r = 4 , m = 200 ro ws and n + n c = 1000 columns. Figure 3 sho ws the p erformance of the three algorithms for v arious p and n c . Our algorithm again outp erforms standard matrix completion and the L + S approaches. Perhaps more imp ortan tly , we see that our algorithm succeeds under a muc h higher v alue of n c than in the adv ersarial setting ab o ve. In particular, w e reco v er the authentic columns ev en when they are significan tly out-n um b ered 15 0 10 20 30 40 50 60 70 10 − 4 10 − 3 10 − 2 10 − 1 Number of corrupted columns Column space error RMC L+S MC 0 10 20 30 40 50 60 70 10 − 4 10 − 3 10 − 2 10 − 1 Number of corrupted columns Relative Frobenius error RMC L+S MC 0 10 20 30 40 50 60 70 10 − 4 10 − 3 10 − 2 10 − 1 Number of corrupted columns Column space error RMC L+S MC 0 10 20 30 40 50 60 70 10 − 4 10 − 3 10 − 2 10 − 1 Number of corrupted columns Relative Frobenius error RMC L+S MC (a) (b) Figure 2: Comparison of robust matrix completion in Algorithm 1 (RMC), standard matrix com- pletion (MC) and the L + S approac h. The relativ e F rob enius norm errors are shown for reco vering a 400 × 400 rank- 4 matrix with observ ation probabilities (a) p = 0 . 2 and (b) p = 0 . 4 . Each p oint in the plots is the av erage of 10 trials. 0 200 400 600 800 1000 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Number of corrupted columns Column space error RMC L+S MC 0 200 400 600 800 1000 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Number of corrupted columns Relative Frobenius error RMC L+S MC 0 200 400 600 800 1000 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Number of corrupted columns Column space error RMC L+S MC 0 200 400 600 800 1000 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Number of corrupted columns Relative Frobenius error RMC L+S MC (a) (b) Figure 3: Comparison of robust matrix completion in Algorithm 1 (RMC), standard matrix completion (MC) and the L + S approac h, with randomly corrupted columns. The relativ e F rob enius norm errors are sho wn for recov ering a 200 × 1000 rank- 4 matrix with observ ation probabilities (a) p = 0 . 3 and (b) p = 0 . 6 . Eac h p oint in the plots is the av erage of 10 trials. b y the corrupted columns, e.g., with n = 200 and n c = 800 . This result shows that with such “less adversarial” corruption, the p erformance of our algorithm is b etter than is guaranteed by our theory on worse case corruption. Rigorously characterizing this phenomenon is an interesting future direction. Finally , we demonstrate the applicability of our algorithms to larger matrices with sparse ob- serv ation. W e consider a setting with rank r = 8 , m = 1000 rows and n + n c = 5000 columns, with observ ation probabilit y p = 0 . 05 or p = 0 . 1 . The p erformance of our algorithm is shown in Figure 4 . Again we see that our algorithm is able to reco ver the true matrix ev en when there are man y corrupted columns. The a verage running time of eac h trial is less than 2 min utes, indicating scalabilit y to large problems. 5 Pro of of Theorem 1 In this section w e pro ve the main Theorem 1 . The pro of requires a num b er of intermediate steps. Here we provide a brief ov erview of the proof roadmap. By definition of the success of Algorithm 1 , w e need to sho w that an y optimal solution ( L ∗ , C ∗ ) of the program ( 2 ) has the properties (i) 16 0 1000 2000 3000 4000 5000 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Number of corrupted columns Relative Frobenius error p=0.05 p=0.1 Figure 4: Performance of our Algorithm 1 with random corruption. The relativ e F rob enius norm errors are sho wn for reco vering a 1000 × 5000 rank- 8 matrix with randomly corrupted columns and observ ation probabilities (a) p = 0 . 05 and (b) p = 0 . 1 . Eac h p oint is the av erage of 10 trials. P I c 0 L ∗ = L 0 , (ii) P U 0 L ∗ = L ∗ and (iii) I ∗ = column-support ( C ∗ ) ⊆ I 0 . A central roadblo ck to this goal is that unless the adversary’s corrupted columns happ en to b e p erfectly p erp endicular to the column space of the true lo w-rank matrix, ( L ∗ , C ∗ ) wil l not b e pr e cisely e qual to the gr ound truth ( L 0 , C 0 ) . The reason is simple: if the corrupted columns ha ve a non-p erp endicular comp onen t, then some part of that will b e put in to the L ∗ matrix recov ered b y the optimization. Algorithmically , this matter is irrelev an t: as long as the corrupted columns are identified, and the reco vered L ∗ matc hes the desired L 0 on the non-corrupted columns, our ob jective is met, and the problem is solved. The analysis, how ever, is significantly complicated: b ecause L ∗ 6 = L 0 in general and we do not know what L ∗ is exactly , w e can no longer use the standard approach in the matrix completion literature of proving the ground-truth is the unique optimal solution of the conv ex program. T o prov e the theorem, w e use the idea of a primal-dual witness : we construct a primal solution ( ¯ L, ¯ C ) and a dual certificate ¯ Q such that: • ( ¯ L, ¯ C ) has the desired properties (i)–(iii); • ¯ Q certifies that an y optimal solution to ( 2 ) is either equal to ¯ L, ¯ C , or is in a subspace defined b y ¯ L, ¯ C and still has the prop erties (i)–(iii). Bey ond the ab ov e obstacle, challenges arise b ecause of the simultaneous presence of three matrix structures: lo w rank, entry-wise sparse, and column sparse. This requires a num b er of additional inno v ations, including concen tration bounds in volving these structures. In the rest of the section we present the details of the pro of, which is divided in to sev eral steps. In Section 5.1 , we provide the notation and preliminaries of the pro of, and show that it suffices to consider a simpler setting. In Section 5.2 we construct the primal solution ( ¯ L, ¯ C ) and study its prop erties. In Section 5.3 , w e describ e the conditions that a dual certificate ¯ Q needs to satisfy . W e construct the dual certificate ¯ Q in Section 5.4 , and then prov e that it indeed satisfies the desired conditions with high probability in Section 5.5 . The pro ofs of the tec hnical lemmas are deferred to the app endix. 5.1 Notation and Preliminaries F or a vector x , x i is its i -th entry . F or a matrix A , A · j is its j -th column and A ij is its ( i, j ) -th en try . Sev eral standard matrix norms are used: k A k ∗ is the nuclear norm (the sum of singular 17 v alues), k A k is the sp ectral/op erator norm (the largest singular v alues), k A k ∞ is the matrix infinity norm (the largest absolute v alue of the entries), k A k 1 , 2 is the sum of ` 2 norms of the columns of A , k A k ∞ , 2 is the largest ` 2 norm of the columns of A , and finally k A k F is the F rob enius norm. W e also define ` ( ∞ , 2) 2 norm of a matrix by k A k ( ∞ , 2) 2 := max n k A k ∞ , 2 , A > ∞ , 2 o , whic h is the largest ` 2 norm of the columns and ro ws of A . F or an y positive integer k , [ k ] := { 1 , 2 , . . . , k } . W e also use the notation a ∧ b := min { a, b } and a ∨ b := max { a, b } . The letter c and their deriv atives ( c 2 etc.) denote unsp ecified constan ts that are, ho wev er, univ ersal in that they are independent of p , γ , β , ρ , n , n c , m and r . By with high pr ob ability ( w.h.p. ), w e mean with probabilit y at least 1 − c ( m + n ) − 10 for some numerical constant c > 0 . Recall that ˜ Ω := Ω ∩ ([ m ] × I c 0 ) is the set of observ ed entries on the non-corrupted columns in I c 0 . W e use Ω c := Ω ∩ ([ m ] × I 0 ) to denote the set of observed entries on the corrupted columns in I 0 . W e abuse notation b y using Ω (and similarly Ω c , ˜ Ω , ˜ Ω c etc) to denote b oth the set of matrix en tries and the linear subspace of matrices supp orted on these entries. Similarly I 0 and I c 0 denote b oth the set of column indices and the linear subspace of matrices supp orted on these columns. The operators P ˜ Ω , P Ω c , P I 0 and P I c 0 etc. are the corresp onding pro jections on to the sets of matrices supp orted on ˜ Ω , Ω c , I 0 , I c 0 etc. Denote the SVD of L 0 as U 0 Σ 0 V > 0 , where U 0 ∈ R m × r and V 0 ∈ R ( n + n c ) × r . Let P U 0 b e the pro jection given b y P U 0 A = U 0 U > 0 A , i.e., pro jecting each column of A on to the column space of L 0 , where A is any matrix with m rows. The complimentary op eration P U ⊥ 0 A := A − P U 0 A pro jects the columns of A on to the subspace orthogonal to the column space of L 0 . Similarly for the row space we define the pro jection P V 0 A := AV 0 V > 0 . W e define the subspace T 0 := { U 0 X > + Y V > 0 : X ∈ R ( n + n c ) × r with P I 0 X > = 0 , Y ∈ R m × r } ⊂ R m × ( n + n c ) , i.e., the set of matrices whic h has the same column or ro w space as L 0 and is supp orted on the columns in I c 0 ; note that T 0 ⊂ I c 0 . The pro jection P T 0 is given b y P T 0 A := P U 0 A + P V 0 A − P U 0 P V 0 A = P U 0 A + P U ⊥ 0 P V 0 A for A ∈ R m × ( n + n c ) , and the complemen tary pro jection is P T ⊥ 0 A := A − P T 0 A = ( I d − U 0 U > 0 ) A ( I d − V 0 V > 0 ) , where I d is the identit y matrix with appropriate dimension. W e note that the range of P T 0 is larger than T 0 since the matrix P T 0 A may hav e non-zero columns in I 0 . Nevertheless, when restricted to the subspace I c 0 , P T 0 is indeed the Euclidean pro jection onto T 0 . Also note that the column-wise pro jection P U 0 comm utes with the row-wise pro jections P V 0 , P I 0 and P I c 0 , since row-wise pro jections are given by right m ultiplying a matrix, whereas column-wise pro jections are left m ultiplications. W e use I to denote the iden tity mapping on R m × ( n + n c ) . W e provide a summary of the notation used in the pro of in T able 1 . 5.1.1 Equiv alen t Mo dels and T rimming It turns out that we ma y simplify the pro of b y transferring to an equiv alent setting with a simpler observ ation mo del and no trimming. Let ˆ p := min { p, ρ } and β := ρ ˆ p . The conditions for p, γ and λ 18 T able 1: Summary of Notation Notation Meaning M Input data matrix Ω Set of observed indices ˜ Ω Set of observed indices on the non-corrupted columns Ω c Set of observed indices on the corrupted columns ˆ Ω T rimmed set of observ ed indices L ∗ , C ∗ An optimal solution to the program ( 2 ) L 0 , C 0 T rue low-rank matrix and outlier matrix U 0 , V 0 , T 0 The left and righ t singular vectors of L 0 and the corresp onding tangent space I 0 Set of the indices of the corrupted columns (i.e., non-zero columns of C 0 ) ¯ L, ¯ C A solution to the oracle problem ( 11 ) ¯ U , ¯ V , ¯ T The left and righ t singular vectors of ¯ L and the corresponding tangen t space ¯ I The set of the indices of the non-zero columns of ¯ C ¯ H The column-wise normalized version of ¯ C ¯ Q The dual certificate corresponding to ¯ L, ¯ C P T 0 , P ¯ U , P ¯ I c , P ˜ Ω , etc. Pro jection op erators on R m × ( n + n c ) I The identit y mapping on R m × ( n + n c ) I d The identit y matrix in Theorem 1 can b e written equiv alen tly as (with p ossibly differen t constan ts c 1 and c 2 ) ˆ p ≥ c 1 µr log 2 ( m + n ) min( m, n ) , (8) γ ≤ c 2 ˆ p µr √ β µr log 3 ( m + n ) , (9) λ ∈ " s µr log ( m + n ) ˆ pn , 1 48 p √ β µr γ n log( m + n ) # . (10) W e first note that the only randomness in the problem is the distribution of ˜ Ω , the set of observ ed indices on the non-corrupted columns in I c 0 . W e claim that it suffices to establish the theorem assuming uniform observ ation probabilit y on I c 0 . T o establish this claim, w e need some notation. Without loss of generalit y w e assume I c 0 = [ n ] . Let ~ p b e the v ector in R n + with elements p 1 , . . . , p n , where we recall that p j ≥ p ≥ ˆ p for all j ∈ [ n ] b y Assumption 2 . Denote by P Ber ( ~ p ) and P UBer ( ˆ p/ 4) the probabilities calculated resp ectively when ˜ Ω follo ws the Bernoulli mo del with probabilities ~ p = ( p j ) , and when ˜ Ω follo ws the Bernoulli mo del with uniform probabilit y ˆ p/ 4 . The following lemma, prov ed in the app endix, connects the success probabilities of Algorithm 1 under these t w o mo dels. Lemma 1. R e c al l that p j ≥ ˆ p for al l j , and supp ose that the c ondition ( 8 ) holds with a sufficiently lar ge c onstant c 1 . If P UBer ( ˆ p/ 4) [ success ] ≥ 1 − 17( m + n ) − 5 , then P Ber ( ~ p ) [ success ] ≥ 1 − 20( m + n ) − 5 . The lemma implies that it suffices to pro ve Theorem 1 assuming ˜ Ω follows the Bernoulli mo del with uniform probability ˆ p/ 4 . No w define the set Ω 0 := ˜ Ω ∪ ˆ Ω ∩ ([ m ] × I 0 ) , whic h is the set of observed indices with only the columns in I 0 trimmed. If the condition ( 8 ) holds with a sufficien tly large constant c 1 , then 19 w.h.p. with resp ect to P UBer ( ˆ p/ 4) , Ω 0 is equal to ˆ Ω , the fully trimmed s et. (This is b ecause by Bernstein’s inequality , each uncorrupted column in I c 0 has no more than 2 · ˆ p 4 m ≤ ρm observed en tries w.h.p. and therefore is not c hanged by trimming.) In other w ords, the conv ex program ( 2 ) with Ω 0 as the input is iden tical to the one with input ˆ Ω w.h.p., so it suffices to pro ve Algorithm 1 succeeds w.h.p. assuming the columns in I c 0 are not trimmed. Finally , note that after trimming the n um b er of remaining observ ations on eac h corrupted column in I 0 is at most ρm . Combining these observ ations, we conclude that we ma y replace the sampling Assumption 2 with the following new Assumption 3 , and study Algorithm 1 without trimming (i.e., only the conv ex program). Note that in Assumption 3 w e hav e changed the probability from ˆ p/ 4 to ˆ p, which only affects the constan t c 1 in the condition ( 8 ). Assumption 3 (Sampling 2) . The set ˜ Ω is sample d fr om the Bernoul li mo del with uniform pr ob- ability ˆ p on [ m ] × I c 0 , and is indep endent of the lo c ations of the observe d entries on the c orrupte d c olumns. F or e ach j ∈ I 0 , we have | Ω ∩ ([ m ] × { j } ) | ≤ 2 ρm . Summarizing the argumen ts ab o v e, we ha ve established that in order to prov e Theorem 1 , it suffices to prov e the following: Under Assumptions 1 and 3 , if the conditions ( 8 )–( 10 ) hold, then with probability at least 1 − 16( m + n ) − 5 , the program ( 2 ) with Ω as the input succeeds, i.e., an y optimal solution to the program satisfies the prop erties (i)–(iii) stated at the beginning of this section. 5.2 Primal Construction W e no w construct the primal solution ¯ L, ¯ C . Recall that Ω c is the observed indices on the corrupted columns I 0 . Let ( ¯ L, ¯ C ) b e an optimal solution to the follo wing or acle pr oblem : min L,C k L k ∗ + λ k C k 1 , 2 s.t. P Ω c ( L + C ) = P Ω c ( M ) P I c 0 ( L ) = L 0 . P U 0 ( L ) = L P I 0 ( C ) = C . (11) Note that we hav e imp osed the desired prop erties of ( L ∗ , C ∗ ) as constraints in the oracle prob- lem. Let ¯ U ¯ Σ ¯ V b e the rank- r SVD of ¯ L (the lemma b elow sho ws that ¯ L has rank r ) and ¯ I := column-supp ort ( ¯ C ) . W e define sev eral subspaces and pro jections analogously to those for L 0 : P ¯ U A := ¯ U ¯ U > A , P ¯ U ⊥ A = A −P ¯ U A , P ¯ V := A ¯ V ¯ V > , ¯ T := ¯ U X > + Y ¯ V > : X ∈ R ( n + n c ) × r , Y ∈ R m × r , P ¯ T A := P ¯ U A + P ¯ V A − P ¯ U P ¯ V A , and P ¯ T ⊥ A := A − P ¯ T A . The follo wing lemma, whose pro of is giv en in the appendix, relates some basic properties of the oracle solution ( ¯ L, ¯ C ) to the ground truth ( L 0 , C 0 ) . Lemma 2. W e have the fol lowing: (a) P ¯ U = P U 0 and ¯ I ⊆ I 0 ; (b) max 1 ≤ j ≤ n + n c P I c 0 ¯ V > e j 2 ≤ q µr n ; (c) P I c 0 P ¯ T = P T 0 P I c 0 P ¯ T ; (d) P ¯ T P I c 0 = P ¯ T P T 0 P I c 0 ; (e) P ¯ T ⊥ P T ⊥ 0 P I c 0 = P T ⊥ 0 P I c 0 . Since all the constraint in ( 11 ) are linear, b y standard conv ex analysis the optimal solution ( ¯ L, ¯ C ) must satisfy the KKT conditions. That is, there exist Lagrange m ultipliers A 1 , A 2 , A 3 and 20 A 4 (corresp onding to the four constrain ts in the oracle problem) and matrices F and G suc h that P ¯ T F = 0 , k F k ≤ 1 , P ¯ I c ¯ H = 0 , ¯ H · j = ¯ C · j / ¯ C · j 2 for all j ∈ ¯ I , G ∈ ¯ I c , k G k ∞ , 2 ≤ 1 , and ¯ U ¯ V > + F + P I c 0 A 2 + ( I − P U 0 ) A 3 = λ ¯ H + G + P I c 0 A 4 = P Ω c A 1 ; (12) here ¯ U ¯ V > + F is a subgradient of k L k ∗ at ¯ L , and ¯ H + G is a subgradien t of k C k 1 , 2 at ¯ C . Also note that ¯ H is the column-wise normalized version of ¯ C with unit-norm nonzero columns. Define the matrix ¯ H 0 := ¯ H + P I 0 G . The following lemma characterizes ¯ H 0 and is prov ed in the app endix Lemma 3. W e have the fol lowing: (a) ¯ H 0 ∈ Ω c ; (b) P ¯ I ¯ H 0 = ¯ H ; (c) P ¯ I c ¯ H 0 ∞ 2 ≤ 1 ; (d) ¯ U P I 0 ¯ V > = P U 0 ( λ ¯ H 0 ) ; (e) ¯ H and ¯ H 0 ar e indep endent of ˜ Ω . 5.3 Success Condition Recall that in Section 5.1.1 we show that it suffices to prov e the conv ex program ( 2 ) succeeds without trimming. The following prop osition, pro ved in the app endix, pro vides a deterministic sufficien t condition for suc h success. The success condition inv olves the quantities ¯ T , ¯ U , ¯ V , ¯ I and ¯ H of the oracle solution ( ¯ L, ¯ C ) constructed in the last subsection. Prop osition 1. If the fol lowing c onditions hold: 1. ˆ p − 1 P T 0 P ˜ Ω P T 0 − P T 0 Z F ≤ 1 2 k Z k F for all Z ∈ I c 0 . 2. I 0 ∩ range ( P ¯ V ) = { 0 } . 3. Ther e exists a matrix ¯ Q ∈ R m × ( n + n c ) (c al le d an appr oximate dual c ertific ate) which satisfies (a) ¯ Q ∈ Ω ; (b) ¯ U ¯ V > − P ¯ T ¯ Q = P ¯ T D for some D ∈ R m × ( n + n c ) with D ∈ I c 0 and k D k F ≤ q ˆ p 2 min 1 4 , λ 4 ; (c) P ¯ T ⊥ ¯ Q ≤ 1 2 ; (d) P ¯ I ¯ Q = λ ¯ H ; (e) P ¯ I c ∩ I 0 ¯ Q ∞ , 2 ≤ λ ; (f ) P I c 0 ¯ Q ∞ , 2 ≤ λ 2 . Then any optimal solution ( L ∗ , C ∗ ) to the pr o gr am ( 2 ) must satisfy P I c 0 L ∗ = L 0 , P U 0 L ∗ = L ∗ and P I 0 C ∗ = C ∗ , which me ans A lgorithm 1 suc c e e ds. 5.3.1 Appro ximate Isometry and Con traction W e now show that the conditions 1 and 2 in Prop osition 1 are satisfied w.h.p. under our mo del assumptions and the conditions ( 8 )–( 10 ). Recall that b y Assumption 3 the set ˜ Ω follows the Bernoulli mo del with uniform probabilit y ˆ p . The follo wing lemma establishes the appro ximate isometry prop ert y in the condition 1. Lemma 4. Supp ose ˆ p ≥ µr m ∧ n log( m + n ) , then w.h.p. w e ha ve: for all Z ∈ I c 0 , ˆ p − 1 P T 0 P ˜ Ω P T 0 − P T 0 Z F ≤ 1 2 k Z k F . (13) 21 The lemma is a v ariant of the standard approximate isometry inequalit y in the literature of ma- trix completion/decomp osition [ 9 , 17 , 33 ]. In particular, w e note that the op erator ˆ p − 1 P T 0 P ˜ Ω P T 0 − P T 0 maps the subspace T 0 ⊂ I c 0 to itself, so Lemma 4 is an immediate consequence of Part 1) of Lemma 11 in [ 17 ]. The next lemma, pro ved in the app endix, sho ws that the op erator P ¯ V P I 0 P ¯ V is a contraction, whic h in particular implies the condition 2 in Prop osition 1 . Lemma 5. If λ 2 ≤ 1 2 γ n , then kP ¯ V P I 0 P ¯ V ( Z ) k F ≤ 1 2 k Z k F and kP ¯ V P I 0 P ¯ V ( Z ) k ≤ 1 2 k Z k for an y matrix Z . Note that the requirements on ˆ p and λ in the ab ov e lemmas are satisfied under the conditions ( 8 ) and ( 10 ). W e therefore hav e established the conditions 1 and 2 in Prop osition 1 . T o prov e the theorem, it remains to construct a dual certificate ¯ Q ob eying the conditions 3( a ) – ( f ) in Prop osition 1 w.h.p., which is done in the next subsection. 5.4 Dual Construction W e build ¯ Q in t w o steps. In the first step w e construct a matrix Q that satisfies all the requirements except 3( a ) . By Lemma 5 , w e know the op erator P ¯ V P I c 0 P ¯ V = P ¯ V − P ¯ V P I 0 P ¯ V is inv ertible on range ( P ¯ V ) (as a subspace of R m × ( n + n c ) ), with its in v erse given b y B := P ¯ V P I c 0 P ¯ V − 1 = P ¯ V + ∞ X i =1 ( P ¯ V P I 0 P ¯ V ) i . (14) W e define a matrix Q b y Q := ¯ U ¯ V > + λ ¯ H 0 − λ P U 0 ¯ H 0 − P I c 0 P ¯ V B P ¯ V P ¯ U ⊥ λ ¯ H 0 . It is straightforw ard to chec k that Q has the follo wing properties (proof in the app endix): Lemma 6. W e have P I 0 Q = λ ¯ H 0 , P ¯ T Q = ¯ U ¯ V > , and P ¯ V P ¯ U ⊥ ( λ ¯ H 0 ) ≤ λ ¯ H 0 ≤ λ ¯ H 0 F ≤ λ √ γ n. While not needed in the sequel, it is a simple exercise to c hec k that Lemmas 6 and 5 together imply kP ¯ T ⊥ Q k ≤ 3 λ √ γ n ≤ 1 2 and P I c 0 Q ∞ , 2 ≤ 1 + 2 λ √ γ n q µr n ≤ 1 2 λ under the condition ( 10 ). Therefore, Q satisfies the condition 3 in Proposition 1 except for the requiremen t of b eing an elemen t of Ω . Note that this requirement can only p otentially fail on the columns in I c 0 since P I 0 Q = λ ¯ H 0 ∈ Ω c . As the second step of building the dual certificate, we use the a v ariant of the golfing sc heme in [ 19 ] to conv ert Q to a matrix ¯ Q that ob eys this requirement. Set k 0 = 20 log ( m + n ) and p 0 = 1 − (1 − ˆ p ) 1 /k 0 . Let ˜ Ω k , k = 1 , . . . , k 0 b e sets of entries sampled independently from the Bernoulli model on [ m ] × I c 0 with uniform probabilit y p 0 ; that is, P ( i, j ) ∈ ˜ Ω k = p 0 indep enden tly of all others for all ( i, j ) ∈ [ m ] × I c 0 and k ∈ [ k 0 ] . W e may assume ˜ Ω = S k 0 k =1 ˜ Ω k , whic h does not c hange the distribution of ˜ Ω . Note that p 0 ≥ ˆ p/k 0 ≥ c 1 µr log( m + n ) 20( m ∧ n ) under the condition ( 8 ). W e set Y 0 := 0 and define the matrices { Y k } recursively b y Y k := Y k − 1 + 1 p 0 P ˜ Ω k P T 0 P I c 0 Q − Y k − 1 , k = 1 , . . . , k 0 . The final dual certificate is given b y ¯ Q = P I 0 Q + Y k 0 . 22 5.5 V erification of the Dual Certificate W e no w verify that the dual certificate ¯ Q constructed ab ov e satisfies all the requirements 3( a ) – 3( f ) in Prop osition 1 under the conditions ( 8 )–( 10 ). W e hav e P I c 0 ¯ Q = Y k 0 ∈ ˜ Ω b y construction and P I 0 ¯ Q = P I 0 Q = λ ¯ H 0 ∈ Ω c b y part ( a ) of Lemma 3 , so the condition 3( a ) holds. Moreo ver, by part ( b ) and ( c ) of Lemma 3 w e hav e P ¯ I ¯ Q = λ P ¯ I ¯ H 0 = ¯ H and P ¯ I c ∩ I 0 ¯ Q ∞ . 2 = λ P ¯ I c ∩ I 0 ¯ H 0 ∞ , 2 ≤ λ , so the conditions 3( d ) and 3( e ) are also satisfied. It remains to verify 3( b ) , 3( c ) and 3( f ) . 5.5.1 Condition 3( b ) Define the linear op erators A k := P T 0 − 1 p 0 P T 0 P ˜ Ω k P T 0 for k = 1 , . . . , k o and the matrices D k = P T 0 P I c 0 Q − Y k for k = 0 , . . . , k 0 . With this notation, w e hav e Y k = Y k − 1 + 1 p 0 P ˜ Ω k D k − 1 b y definition, which implies D k = P T 0 − 1 p 0 P T 0 P ˜ Ω k P T 0 D k − 1 = A k ( D k − 1 ) , k = 1 , . . . , k 0 . (15) It follows that with high probabilit y , k D k 0 k F = kA k 0 A k 0 − 1 · · · A 1 ( D 0 ) k F ( a ) ≤ 1 2 k 0 k D 0 k F ( b ) ≤ 1 ( m + n ) 10 k D 0 k F , where ( a ) follo ws from Lemma 4 with ˜ Ω replaced b y ˜ Ω k and ( b ) follo ws from our choice of k 0 . T o b ound k D 0 k F , we observ e that by definition of Q , D 0 = P T 0 P I c 0 Q = ¯ U P I c 0 ¯ V > + P V 0 P I c 0 P ¯ V B P ¯ V P ¯ U ⊥ λ ¯ H 0 (16) By ( 14 ) and Lemma 5 , we kno w that for an y matrix Z , kB ( Z ) k F ≤ ∞ X i =0 1 2 i k Z k F ≤ 2 k Z k F . (17) Com bining the last t w o equations ( 16 ) and ( 17 ) giv es k D 0 k F ≤ ¯ U ¯ V > F + 2 λ ¯ H 0 F ≤ √ r + 2 λ √ γ n, where the last inequalit y follo ws from Lemma 6 . It follo ws that k D k 0 k F ≤ 1 ( m + n ) 10 √ r + 2 λ √ γ n ≤ √ ˆ p 2 min 1 4 , λ 4 , (18) where the last inequalit y follo ws from the conditions ( 8 ) and ( 10 ). On the other hand, since ¯ Q = P I c 0 Y k 0 + P I 0 Q and P ¯ T Q = ¯ U ¯ V > b y Lemma , w e hav e ¯ U ¯ V − P ¯ T ¯ Q = P ¯ T Q − P ¯ T P I c 0 Y k 0 + P I 0 Q = P ¯ T P I c 0 ( Q − Y k 0 ) = P ¯ T D k 0 , (19) where the last equalit y follows from P art ( d ) of Lemma 2 . W e conclude that the condition 3(b) in Prop osition 1 holds b y combining ( 19 ), ( 18 ) and the fact that D k 0 ∈ T 0 ⊆ I c 0 . 23 5.5.2 Condition 3( c ) W e may write P ¯ T ⊥ ¯ Q = P ¯ T ⊥ λ ¯ H 0 + P ¯ T ⊥ P T 0 Y k 0 + P ¯ T ⊥ P T ⊥ 0 Y k 0 = P ¯ T ⊥ λ ¯ H 0 + P ¯ T ⊥ P T 0 P I c 0 Q − P ¯ T ⊥ D k 0 + P T ⊥ 0 Y k 0 , where the first equality follo ws from definition of ¯ Q , and the second equality follows from Y k 0 ∈ T 0 ⊆ I c 0 and part ( e ) of Lemma 2 . Hence w e ha ve P ¯ T ⊥ ¯ Q ≤ λ ¯ H 0 + k D k 0 k + P ¯ T ⊥ P T 0 P I c 0 Q + P T ⊥ 0 Y k 0 . The condition 3( c ) holds if eac h of the terms ab o ve is upp er b ounded by 1 8 . By Lemma , we ha ve λ ¯ H 0 ≤ λ √ γ n ≤ 1 16 , where the last inequalit y holds under the condition ( 10 ). In ( 18 ) we already sho w ed that k D k 0 k ≤ k D k 0 k F ≤ √ ˆ p 8 ≤ 1 8 . Moreo ver, using ( 16 ), we ha v e P ¯ T ⊥ P T 0 P I c 0 Q ≤ P ¯ V ⊥ P V 0 P I c 0 P ¯ V B P ¯ V P ¯ U ⊥ ( λ ¯ H 0 ) F ( a ) ≤ 2 λ ¯ H 0 F ≤ 1 8 , where (a) follows ( 17 ) and the fact that pro jections do not increase the F rob enius norm. It remains to b ound P T ⊥ 0 Y k 0 b y 1 8 . F or brevity w e introduce some additional notation. Let D U 0 := ¯ U P I c 0 ¯ V , and D V 0 := P ¯ U ⊥ P V 0 P I c 0 B P ¯ V λ ¯ H 0 , D U k := A k A k − 1 · · · A 1 ( D U 0 ) and D V k := A k A k − 1 · · · A 1 ( D V 0 ) for k = 1 , . . . , k 0 . Note that for each k ≥ 2 , D U k − 1 and D V k − 1 are indep endent of ˜ Ω k b y construction the ˜ Ω k ’s and part ( e ) of Lemma 3 . With these definitions, w e ha v e D k = D U k + D V k for k = 0 , . . . , k 0 b y ( 16 ) and ( 15 ), and hence Y k 0 = k 0 X k =1 1 p 0 P ˜ Ω k D k − 1 = k 0 X i =1 1 p 0 P ˜ Ω k D U k − 1 + k 0 X k =1 1 p 0 P ˜ Ω k D V k − 1 . (20) Let t b e either U or V . Since D t k − 1 ∈ T 0 for each k , we ha v e k 0 X k =1 P T ⊥ 0 1 p 0 P ˜ Ω k D t k − 1 = k 0 X k =1 P T ⊥ 0 1 p 0 P ˜ Ω k D t k − 1 − D t k − 1 ≤ k 0 X k =1 1 p 0 P ˜ Ω k − I D t k − 1 . (21) T o pro ceed, w e need three lemmas in volving the norms of a matrix after certain random pro jections. Recall that ˜ Ω and ˜ Ω k are sampled from the Bernoulli mo del with uniform probability ˆ p and p 0 , resp ectiv ely . The first lemma bounds the sp ectral norm using the ` ∞ and ` ( ∞ , 2) 2 norm. This lemma is pro ved in a recent report by the author [ 15 ], but w e pro vide a pro of in the appendix for completeness. Lemma 7. L et Z b e a fixe d m × ( n + n c ) matrix in I c 0 . W e have w.h.p. 1 ˆ p P ˜ Ω Z − Z ≤ 15 log ( m + n ) ˆ p k Z k ∞ + s 60 log ( m + n ) ˆ p k Z k ( ∞ , 2) 2 ! . The next lemma, standard in matrix completion literature, further con trols the ` ∞ norm. 24 Lemma 8. [ 17 , L emma 13, p art 1] L et Z b e a fixe d m × ( n + n c ) matrix in T 0 . If ˆ p > 66 log( m + n ) m ∧ n , then w.h.p. we have 1 ˆ p P T 0 P ˜ Ω P T 0 Z − P T 0 Z ∞ ≤ 1 2 k Z k ∞ . The third lemma is new, which con trols the ` ( ∞ , 2) 2 norm. See the appendix for a pro of. Lemma 9. The fol lowing holds for some c onstant c 0 > 0 and any fixe d matrix Z ∈ T 0 . If ˆ p ≥ c 0 µr log( m + n ) m ∧ n , then we have w.h.p. 1 ˆ p P T 0 P ˜ Ω P T 0 Z − P T 0 Z ∞ , 2 ≤ 40 log ( m + n ) ˆ p r µr n ∧ m k Z k ∞ + s 250 µr log ( m + n ) ˆ p ( n ∧ m ) k Z k ∞ , 2 ≤ 1 2 s log( m + n ) ˆ p k Z k ∞ + 1 2 k Z k ∞ , 2 . The same b ound holds with the k·k ∞ , 2 norm r eplac e d by the k·k ( ∞ , 2) 2 norm. Applying Lemma 7 with ˜ Ω replaced by ˜ Ω k to the R.H.S. of ( 21 ) and using p 0 ≥ ˆ p 20 log( m + n ) & µr log( m + n ) m ∧ n under the condition ( 8 ), we ha ve for t = U or V , k 0 X k =1 1 p 0 P ˜ Ω k − I D t k − 1 ≤ k 0 X k =1 15 log 2 ( m + n ) ˆ p D t k − 1 ∞ + k 0 X k =1 s 60 log 2 ( m + n ) ˆ p D t k − 1 ( ∞ , 2) 2 . (22) W e then apply Lemmas 8 and 9 with ˜ Ω replaced by ˜ Ω k to the tw o norms in the last R.H.S., which giv es D t k − 1 ∞ = A k − 1 A i − 2 · · · A 1 D t 0 ∞ ≤ 1 2 k − 1 D t 0 ∞ , D t k − 1 ( ∞ , 2) 2 = A k − 1 A i − 2 · · · A 1 D t 0 ( ∞ , 2) 2 ≤ 1 2 k − 1 D t 0 ( ∞ , 2) 2 + k − 1 2 k − 1 s log 2 ( m + n ) ˆ p D t 0 ∞ . It follows that k 0 X k =1 log( m + n ) p 0 D t k − 1 ∞ + k 0 X k =1 s log( m + n ) p 0 D t k − 1 ( ∞ , 2) 2 ≤ 6 log 2 ( m + n ) ˆ p D t 0 ∞ + 2 s log 2 ( m + n ) ˆ p D t 0 ( ∞ , 2) 2 . (23) Com bining ( 20 )–( 23 ), w e obtain P T ⊥ 0 Y k 0 ≤ 90 log 2 ( m + n ) ˆ p D U 0 ∞ + D V 0 ∞ + 16 s log 2 ( m + n ) ˆ p D U 0 ( ∞ , 2) 2 + D V 0 ( ∞ , 2) 2 . The following lemma, pro ved in the app endix, b ounds the norms of D U 0 and D V 0 ab o ve. The lemma relies on the second part of Assumption 3 , whic h is a consequence of the trimming pro cedure in Algorithm 1 . 25 Lemma 10. R e c al l that β := ρ ˆ p . Under Assumptions 1 and 3 , we have D U 0 ∞ ≤ r µ 2 r 2 mn , D U 0 ∞ , 2 ≤ r µr n , D U 0 ( ∞ , 2) 2 ≤ r µr m ∧ n , D V 0 ∞ ≤ D V 0 ∞ , 2 ≤ 4 λ 2 γ µr p β ˆ pn, D V 0 ( ∞ , 2) 2 ≤ D V 0 F ≤ 4 λ 2 γ n p µr β ˆ p. Using this lemma, w e conclude that P T ⊥ 0 Y k 0 ≤ 90 µr log 2 ( m + n ) ˆ p √ mn + 90 · 4 λ 2 γ µr s β n ˆ p log 2 ( m + n ) + 16 s µr log 2 ( m + n ) ˆ p ( m ∧ n ) + 48 λ 2 γ n p µr β log( m + n ) . One chec ks that each term ab ov e is b ounded b y 1 32 under the conditions ( 8 ) and ( 10 ). This means that P T ⊥ 0 Y k 0 ≤ 1 8 , proving the condition 3( c ) in Prop osition 1 . 5.5.3 Condition 3( f ) W e need to sho w P I c 0 ¯ Q ∞ , 2 = k Y k 0 k ∞ , 2 ≤ λ 2 . By ( 20 ), w e ha ve k Y k 0 k ∞ , 2 ≤ k 0 X k =1 1 p 0 P ˜ Ω k − I D U k − 1 ∞ , 2 + k 0 X k =1 D U k − 1 ∞ , 2 | {z } S 1 + k 0 X k =1 1 p 0 P ˜ Ω k − I D V k − 1 ∞ , 2 + k 0 X k =1 D V k − 1 ∞ , 2 | {z } S 2 . (24) It suffices to b ound each of S 1 and S 2 b y λ 4 . W e need the follo wing lemma, whic h is pro ved in the app endix. Lemma 11. F or any fixe d matrix Z ∈ T 0 , we have w.h.p., 1 ˆ p P ˜ Ω Z − Z ∞ , 2 ≤ 20 log ( m + n ) ˆ p k Z k ∞ + s 50 log ( m + n ) ˆ p k Z k ∞ , 2 . Using the lemma with ˜ Ω replaced by ˜ Ω k , we ha ve w.h.p. S 1 ≤ k 0 X k =1 20 log ( m + n ) p 0 D U k − 1 ∞ + 2 k 0 X k =1 s 50 log ( m + n ) p 0 D U k − 1 ∞ , 2 . Thanks to the second part of Lemma 9 , w e know that ( 23 ) holds with k·k ( ∞ , 2) 2 replaced b y k·k ∞ , 2 . Using this, we obtain that w.h.p. S 1 ≤ 120 log 2 ( m + n ) ˆ p D U 0 ∞ +4 s 50 log 2 ( m + n ) ˆ p D U 0 ∞ , 2 ≤ 120 µr log 2 ( m + n ) ˆ p √ mn +4 s 50 µr log 2 ( m + n ) ˆ pn , 26 where the last inequalit y follo ws from Lemma 10 . The last R.H.S. is no more than λ 4 under the conditions ( 8 ) and ( 10 ). T urning to the term S 2 in ( 24 ), we apply Lemma 11 with ˜ Ω replaced b y ˜ Ω k to obtain that w.h.p., S 2 ≤ k 0 X k =1 20 log ( m + n ) p 0 D V k − 1 ∞ + 2 k 0 X k =1 s 50 log ( m + n ) p 0 D V k − 1 ∞ , 2 . Since ( 23 ) holds with k·k ( ∞ , 2) 2 replaced by k·k ∞ , 2 , we obtain that w.h.p., S 2 ≤ 120 log 2 ( m + n ) ˆ p D V 0 ∞ + 4 s 50 log 2 ( m + n ) ˆ p D V 0 ∞ , 2 . It then follows from Lemma 10 that w.h.p., S 2 ≤ 480 λ 2 γ µr s β n ˆ p log 2 ( m + n ) + 16 λ 2 γ µr q 50 β n log 2 ( m + n ) ≤ 600 λ 2 γ µr s β n ˆ p log 2 ( m + n ) . The last R.H.S. is b ounded by λ 4 w.h.p. under the conditions ( 8 ) and ( 10 ). This establishes the condition 3(f ) in Prop osition 1 . Finally , note that eac h random ev ent abov e holds w.h.p., so b y the union b ound they hold simultaneously with probabilit y at least 1 − 20( m + n ) − 5 . This completes the pro of of Theorem 1 . 6 Pro of of Theorem 2 W e consider the t wo conditions ( 5 ) and ( 6 ) separately . 6.1 Condition ( 5 ): p ≤ µr log(2 n ) 2 n In this case we use a mo dified argumen t from [ 11 , Theorem 1.7] to establish the imp ossibility of determining the c olumn sp ac e (i.e., the left singular vectors). W e ma y assume n c = 0 . Without loss of generalit y , assume that s := n µr is an integer. W e use e i to denote the i -th standard basis whose dimension will b ecome clear in context. F or k ∈ [ r ] , define the set B k = { ( k − 1) s + 1 , ( k − 1) s + 2 , . . . , k s } . (25) Consider the matrix L = P r k =1 u k v > k ∈ R n , where the (unnormalized) singular v ectors u k ∈ R n and v k ∈ R n are given b y u k = X i ∈ B k ω i e i , v k = X k ∈ B k e i , where the ω i ’s take v alues in {− 1 , 1 } . Clearly , L has rank- r and incoherence parameter µ , and is a blo c k diagonal matrix with r blo cks of size s × s . In particular, each ro w of a blo ck is either all 1 or − 1 with its sign determined by ω i . An illustration of L is given in Figure 5 . Therefore, in order to uniquely determine the left singular vectors u k from the observ ed entries of L , w e must b e on ev en t that there is at least one observed en try on every ro w i of eac h diagonal blo c k, since otherwise there w ould b e no information on w i . Under the Bernoulli sampling mo del in Assumption 2 , the 27 1" #1" 0 0 1 n/µr n/µr n/µr n/µr n n Figure 5: An illustration of L constructed in Section 6.1 with rank r = 2 . probabilit y of this ev ent is π = [1 − (1 − p ) s ] n . Using the premise 2 p ≤ log(2 n ) s ≤ 1 of the theorem and the inequality 1 − x + x 2 / 2 > e − x , ∀ x ≥ 0 , w e ha ve 1 − p ≥ 1 − log(2 n ) 2 s ≥ 1 − log(2 n ) s + log 2 (2 n ) 2 s 2 > e − log(2 n ) /s . It follows that π ≤ exp [ − n (1 − p ) s ] ≤ exp h − ne − log 2 n i = exp( − 1 / 2) ≤ 3 4 , where the first inequality follo ws from 1 − x ≤ e − x . Therefore, with probability 1 − π ≥ 1 4 , there exists one row of a diagonal block that is unobserv ed, in whic h case the u k ’s cannot b e determined. It is easy to see that this implies the conclusion of the theorem. 6.2 Condition ( 6 ): γ ≥ 2 p µr W.L.O.G. w e assume ps is a positive integer. Under the ab ov e condition, we hav e n c = γ n ≥ 2 ps > ps , where s := n µr as b efore. W e pro ve the theorem b y constructing a family of candidate solutions ( L i , C i ) , i = 1 , 2 , . . . , 2 M and sho wing it is difficult to accurately distinguish them based on the observ ed data Ω and P Ω ( L i + C i ) . In this subsection, we use capital letters ( B 1 , I 2 , J , etc.) to denote sets of c olumn indices (i.e., subsets of [ n + n c ] ), and Greek letters ( Ω , Θ , ξ etc.) to denote sets of entry indices (i.e., subsets of [ n ] × [ n + n c ] ). Let J := { ( r − 1) s + 1 , . . . , r s + n c } . Recall the definition of the B k ’s in ( 25 ), whic h satisfies B r ⊆ J . W e further let I := J \ B r and u k = X i ∈ B k e i , k ∈ [ r ]; ¯ u r = − e rs + X i ∈ B r ,i 6 = rs e i ; v k = X i ∈ B k e i , k ∈ [ r ]; w = X i ∈ I e i . W e build t wo candidate solutions ( L 1 , C 1 ) and ( L 2 , C 2 ) as follows: L 1 = r X k =1 u k v > k , C 1 = ¯ u r w > L 2 = r − 1 X k =1 u k v > k + ¯ u r v > r , C 2 = u r w > . W e illustrate them in Figure 6 . Let M := s + n c s . In the definition of the ( L 1 , C 1 ) , if we let the set 28 1 n/µr n/µr n/µr n c n c 1" n/µr L 1 L 2 C 1 C 2 B r I 1 B r I 1 #1" 1 n/µr n/µr n/µr n c n c 1" n/µr B r I 2 B r I 2 1" 1" 1" 1" #1" J J J J Figure 6: An illustration of ( L 1 , C 1 ) and ( L 2 , C 2 ) constructed in Section 6.2 with rank r = 2 . B k v ary in all M p ossible subsets of J with size s (i.e., w e p ermute the columns in J ), then we get M differen t candidates ( L i , C i ) , i = 1 , 3 , . . . , 2 M − 1 . Similarly , b y v arying B k in ( L 2 , C 2 ) w e can get another M candidates ( L i , C i ) , i = 2 , 4 , . . . , 2 M . W e thus hav e defined a family of 2 M pairs. Let I i := column-supp ort ( C i ) . Note that for the L i ’s, only the lo cations of the last s authentic columns v ary in J , and the sign of these columns’ r s -th row changes. The corrupted columns in C i are iden tical to the last s authen tic columns of L i except with the sign of the rs -th row flipp ed. Therefore, to recov er the column space of L i , one needs to determine the sign of the r s -th row. The idea of the pro of is simple: under the Bernoulli mo del and with n c > 2 ps columns in I i , with p ositiv e probability the r s -row has roughly as many observed 1 ’s as − 1 ’s, so there is no wa y to determine which sign is authentic. W e make this precise b y sp ecifying the set of observed en tries Ω = ˜ Ω i ∪ Ω c,i , for eac h candidate i ∈ [2 M ] . A ccording to our assumption, the observ ations ˜ Ω i on the authentic columns follo w the Bernoulli mo del with uniform probabilit y p . It remains to sp ecify the observ ations Ω c,i on the corrupted columns. Recall Definition 1 of the Bernoulli mo del, and let Ω + c,i b e drawn from the Bernoulli model on [ r s − 1] × I i with uniform probabilit y p ; this will b e the observ ed entries on the first r s − 1 ro ws of the corrupted columns. Let Γ i b e indep endent from ˜ Ω i and dra wn according to the Bernoulli model on [ s ] with uniform probability p . If | Γ i | ≥ n c , then Ω − c,i , the set of observ ed en tries on the r s -th row of the corrupted columns, is set as Ω − c,i = { r s } × I i . If | Γ i | = t < n c , then w e set Ω − c i = { r s } × I i ( t ) , where I i ( t ) denotes the t smallest indices in I i . The set of observ ed en tries on the corrupted columns I i is then given b y Ω c,i = Ω + c,i ∪ Ω − c,i . W e see that the authentic observ ations ˜ Ω i are indep endent of C i and Ω c,i , so Assumption 2 is satisfied. In the sequel, we use P L i ,C i to denote the probabilit y computed under the i -th candidate solution ( L i , C i ) . No w supp ose the true solution is the first candidate ( L 1 , C 1 ) . Let Θ 1 := ˜ Ω 1 ∩ ( { r s } × J ) b e the set of observ ations on the r s -th ro w of the authentic columns in J . If we define the even t E := {| Γ 1 | ≤ | Θ 1 | ≤ ps } , then we ha ve P L 1 ,C 1 [ E ] ( i ) ≥ 1 2 P L 1 ,C 1 [ | Γ 1 | < ps and | Θ 1 | < ps ] ( ii ) ≥ 1 2 · P L 1 ,C 1 [ | Γ 1 | < ps ] · P L 1 ,C 1 [ | Θ 1 | < ps ] ( iii ) ≥ 1 8 29 here ( i ) follo ws from symmetry , and ( ii ) – ( iii ) hold b ecause | Θ 1 | and | Γ 1 | are independent and b oth follo w the Binomial distribution with s trials and probability p , whose median is ps . On this ev ent E , w e can alwa ys find another candidate solution i 0 ∈ { 2 , 4 , . . . , 2 M } (whic h means the last row of L i has a negative sign so the column space is different) suc h that Θ 1 = { r s } × I i 0 ( | Θ 1 | ) ; this is b ecause Θ 1 ⊆ { r s } × J and the I i ’s enumerates the subsets of J with size n c > ps ≥ | Θ 1 | . See Figure 7 for an illustration. Let ω ⊆ [ n ] × [ n + n c ] b e a realization of Ω that is consistent with E , i.e., it satisfies P L 1 ,C 1 [Ω = ω and E ] > 0 . W e claim that (pro ved b elo w) for an y suc h ω , we ha ve P L 1 ,C 1 [Ω = ω ] ≤ P L i 0 ,C i 0 [Ω = ω ] , and P Ω ( L 1 + C 1 ) = Z := P Ω ( L i 0 + C i 0 ) for Ω = ω . This means the observed data is iden tical under both candidate solutions, but the i 0 -th candidate has a higher lik eliho o d. In this case, the maximum lik eliho o d estimator (MLE), whic h is giv en b y f ( ω , Z ) := arg max ( L i ,C i ) P L i ,C i [Ω = ω , P Ω ( L i 0 + C i 0 ) = Z ] will incorrectly output a solution other than ( L 1 , C 1 ) with probabilit y at least 1 2 . The abov e argu- men t in fact holds if an y one of the ( L i , C i ) ’s is the true solution. Therefore, the av erage probabilit y of error for the MLE is at least 1 2 · P L 1 ,C 1 [ E ] ≥ 1 16 . Since the MLE minimizes the a verage probability of error, which in turn lo w er b ounds the w orst case error probability , w e conclude that any estimator mak es an error with worst case probability at least 1 16 . This prov es the theorem. Pro of of the claim: When Ω = ω , the equality P Ω ( L 1 + C 1 ) = P Ω ( L i 0 + C i 0 ) holds b y construction of the ( L i , C i ) ’s and the assumption on ω (cf. Figure 7 ). T o prov e the inequality , we note the distribution of Ω under ( L 1 , C 1 ) and ( L i 0 , C i 0 ) only differs on the en tries in Υ := { r s } × J . Let ξ := ω ∩ ( { r s } × I i 0 ) and ζ := ω ∩ ( { r s } × I 1 ) . Note that b ecause ω is consisten t with E , we ha v e | ζ | ≤ | ξ | ≤ ps < n c ; moreo v er, the observed en tries in Υ are either on the columns I i or I i 0 , so ω ∩ Υ = ξ ∪ ζ . Let g ( · ) denote the probability mass function of the Binomial distribution with s trials and probabilit y p . Then, according to our sp ecification of Ω under eac h candidate solution, w e hav e P L 1 ,C 1 [Ω = ω ] P L i 0 ,C i 0 [Ω = ω ] = P L 1 ,C 1 [Ω ∩ Υ = ω ∩ Υ] P L i 0 ,C i 0 [Ω ∩ Υ = ω ∩ Υ] = p | ξ | (1 − p ) s −| ξ | g ( | ζ | ) g ( | ξ | ) p | ζ | (1 − p ) s −| ζ | = g ( | ζ | ) g ( | ξ | ) · p 1 − p | ξ |−| ζ | . Observ e that g ( · ) is unimo dal with mo de ps , and | ζ | ≤ | ξ | ≤ ps , so g ( | ζ | ) ≤ g ( | ξ | ) . Moreov er, we ha v e p 1 − p | ξ |−| ζ | ≤ 1 by the assumption p ≤ 1 2 . This means P L 1 ,C 1 [Ω = ω ] P L i 0 ,C i 0 [Ω = ω ] ≤ 1 , pro ving the claim. 30 1 L 1 L i C 1 C i I 1 I 1 1 I i J J I i 0 0 0 0 Θ 1 ϒ = observed 1 = observed -1 = observed 0 = unobserved Figure 7: An illustration of the t w o solutions ( L 1 , C 1 ) , ( L i 0 , C i 0 ) and the lo cations of the observ ed en tries Ω in Section 6.2 , where | Θ 1 | ≤ n c . In this case the tw o solutions generate the same observ ed data P Ω ( L 1 + C 1 ) = P Ω ( L i 0 + L i 0 ) and it is imp ossible to distinguish b etw een them. 7 Conclusion In this paper, w e study the problem of completing a lo w-rank matrix from sparsely observ ed en tries when observ ations from some columns are completely and arbitrarily corrupted. W e propose a new algorithm based on trimming and conv ex optimization, and provide p erformance guarantee s sho wing its robustness to column-wise corruption. W e further show that the p erformance of our algorithm is close to the information-theoretic limit under adv ersarial corruption, thus achieving near-optimal tradeoffs b et ween sample complexit y , robustness and rank. Immediate future directions include remo ving the sub-optimality in bounds and allo wing for noise and sparse corruption. It may b e p ossible to further impro ve the robustness of matrix completion by com bining our approac h with other outlier detection tec hniques [ 22 , 37 ]. As our work is motiv ated b y the practical applications in collab orativ e filtering and crowdsourcing, it is imp ortant to study in more depth the computational asp ects and dev elop fast online/parallel algorithms. A more systematic exploration of the relation b etw een sample complexit y , mo del complexity , computational complexit y and robustness, will also b e of m uch theoretical and practical in terest. A c kno wledgment W e are grateful to the anon ymous review ers for their helpful suggestions on impro ving the quality of the man uscript. Y. Chen was supp orted by NSF gran t CIF-31712-23800, ONR MURI gran t N00014- 11-1-0688, and a start-up fund from the School of Op erations Researc h and Information Engineering at Cornell Univ ersit y . The work of H. Xu w as partially supp orted b y the Ministry of Education of 31 Singap ore through A cRF Tier T wo gran ts R-265-000-443-112 and R265-000-519-112, and A*ST AR SER C PSF grant R-265-000-540-305. C. Caramanis ac knowledges NSF grants 1056028, 1302435 and 1116955. His research w as also partially supported b y the U.S. DoT through the Data-Supp orted T ransp ortation Op erations and Planning (D-STOP) Tier 1 Universit y T ransp ortation Center. S. Sangha vi would lik e to ac knowledge NSF grants 0954059 and 1302435. App endices A Pro of of Lemma 1 W e need a simple observ ation first: the conv ex program ( 2 ) has a monotonicity prop erty , that is, ha ving more observed en tries on the uncorrupted columns only makes the program more likely to succeed. Lemma 12 (Monotonicity) . Supp ose the indic es set Ω 1 and Ω 2 ar e such that Ω 1 ∩ ([ m ] × I c 0 ) ⊆ Ω 2 ∩ ([ m ] × I c 0 ) and Ω 1 ∩ ([ m ] × I 0 ) = Ω 2 ∩ ([ m ] × I 0 ) . If the pr o gr am ( 2 ) with ˆ Ω = Ω 1 as the input suc c e e ds, then using ˆ Ω = Ω 2 as the input also suc c e e ds. Pr o of. Define the set X := ( L, C ) : P I c 0 ( L ) = L 0 , P U 0 ( L ) = L, P I 0 ( C ) = C , P Ω 1 ∩ ([ m ] × I 0 ) ( L + C ) = P Ω 1 ∩ ([ m ] × I 0 ) ( M ) , whic h are the solutions that corresp ond to the success of the algorithm and are consistent on the en tries in Ω 1 ∩ ([ m ] × I 0 ) = Ω 2 ∩ ([ m ] × I 0 ) . Observ e that an y solution in X is feasible to the program with ˆ Ω equal to Ω 1 or Ω 2 . Supp ose ( L ∗ 0 , C ∗ 0 ) is an y optimal solution to the program ( 2 ) with Ω 2 . By optimalit y we must ha ve L ∗ 0 ∗ + λ C ∗ 0 1 , 2 ≤ k L k ∗ + λ k C k 1 , 2 , ∀ ( L, C ) ∈ X . On the other hand, the program with Ω 1 succeeds b y assumption, meaning that any optimal solution ( L ∗ , C ∗ ) of it must be in the set X . It follows that ( L ∗ 0 , C ∗ 0 ) has an ob jectiv e v alue low er or equal to ( L ∗ , C ∗ ) . But ( L ∗ 0 , C ∗ 0 ) is also feasible to the program with Ω 1 since Ω 1 ⊆ Ω 2 , so ( L ∗ 0 , C ∗ 0 ) is optimal to the program with Ω 1 and hence in the set X . This means the program with Ω 2 succeeds. W e turn to the proof of Lemma 1 . Given a v ector ~ k ∈ R n with elemen ts k j , let P Unif ( ~ k ) denote the probability when ˜ Ω follows the uniform mo del with parameter ~ k , meaning that the observ ed en tries on the j -th column is sampled uniformly at random without replacemen t from all size- k j subsets of the entries in this column. Recall that h j is the n umber of observed entries on the j -th column b efore trimming. W e use b x c to denote the largest integer no more than x . W e hav e the 32 follo wing chain of inequalities: P Ber ( ~ p ) [ success ] = m X k 1 =1 · · · m X k n =1 P Ber ( ~ p ) [ success | h j = k j , j ∈ [ n ]] P Ber ( ~ p ) [ h j = k j , j ∈ [ n ]] ≥ m X k 1 = b ˆ pm/ 2 c · · · m X k n = b ˆ pm/ 2 c P Ber ( ~ p ) [ success | h j = k j , j ∈ [ n ]] P Ber ( ~ p ) [ h j = k j , j ∈ [ n ]] ( a ) = m X k 1 = b ˆ pm/ 2 c · · · m X k n = b ˆ pm/ 2 c P Unif ( ~ k ) [ success ] P Ber ( ~ p ) [ h j = k j , j ∈ [ n ]] ( b ) = m X k 1 = b ˆ pm/ 2 c · · · m X k n = b ˆ pm/ 2 c P Unif ( ~ k ∧b ρm c ) [ success ] P Ber ( ~ p ) [ h j = k j , j ∈ [ n ]] ( c ) ≥ m X k 1 = b ˆ pm/ 2 c · · · m X k n = b ˆ pm/ 2 c P Unif ( b ˆ pm/ 2 c ) [ success ] P Ber ( ~ p ) [ h j = k j , j ∈ [ n ]] = P Unif ( b ˆ pm/ 2 c ) [ success ] P Ber ( ~ p ) [ h j ≥ b ˆ pm/ 2 c , j ∈ [ n ]] ( d ) ≥ P Unif ( b ˆ pm/ 2 c ) [ success ] 1 − ( m + n ) − 10 , where ( a ) follows from the fact that the conditional distribution of a set following the Bernoulli mo del giv en its cardinalit y is the same as sampling uniformly without replacement, ( b ) is a consequence of the trimming step in Algorithm 1 , as a uniform subset of a uniformly sampled set is still uniform, ( c ) follows from ρm ≥ ˆ pm/ 2 and the monotonicit y in Lemma 12 , and finally ( d ) follo ws from the Bernstein inequalit y under the condition ( 8 ) with c 1 large enough. The probability in ( d ) can b e b ounded by similar reasoning as follo ws: P Unif ( b ˆ pm/ 2 c ) [ success ] ≥ P Unif ( b ˆ pm/ 2 c ) [ success ] b ˆ pm/ 2 c X k 1 =1 · · · b ˆ pm/ 2 c X k n =1 P UBer ( ˆ p/ 4) [ h j = k j , j ∈ [ n ]] ( a ) ≥ b ˆ pm/ 2 c X k 1 =1 · · · b ˆ pm/ 2 c X k n =1 P Unif ( ~ k ) [ success ] P UBer ( ˆ p/ 4) [ h j = k j , j ∈ [ n ]] ( b ) = b ˆ pm/ 2 c X k 1 =1 · · · b ˆ pm/ 2 c X k n =1 P UBer ( ˆ p/ 4) [ success | h j = k j , j ∈ [ n ]] P UBer ( ˆ p/ 4) [ h j = k j , j ∈ [ n ]] = b ˆ pm/ 2 c X k 1 =1 · · · b ˆ pm/ 2 c X k n =1 P UBer ( ˆ p/ 4) [ success, h j = k j , j ∈ [ n ]] = P UBer ( ˆ p/ 4) [ success ] − X ~ k : ∃ j ∈ [ n ] ,k j > b ˆ pm/ 2 c P UBer ( ˆ p/ 4) [ success, h j = k j , j ∈ [ n ]] ≥ P UBer ( ˆ p/ 4) [ success ] − P UBer ( ˆ p/ 4) ∃ j ∈ [ n ] , h j > p 0 m/ 2 ( c ) ≥ P UBer ( ˆ p/ 4) [ success ] − ( m + n ) − 10 , 33 where ( a ) follows from the monotonicit y Lemma 12 , ( b ) follows from the fact that conditional Bernoulli distribution is uniform, and ( c ) follo ws from the Bernstein inequality under the condi- tion ( 8 ). Combining pieces, we obtain P Ber ( ~ p ) [ success ] ≥ 1 − ( m + n ) − 10 P UBer ( ˆ p/ 4) [ success ] − ( m + n ) − 10 . The lemma follows. B Pro of of Lemmas in Section 5.2 In this section, w e prov e the lemmas used in Section 5.2 . B.1 Pro of of Lemma 2 Let col ( Z ) denote the column space of a matrix Z . Observ e that P U 0 ¯ L = ¯ L implies col ( ¯ L ) ⊆ col ( U 0 ) , and P I c 0 ( ¯ L ) = L 0 implies col ( ¯ L ) ⊇ col ( U 0 ) . It follows that col ( ¯ U ) = col ( ¯ L ) = col ( U 0 ) . Because ¯ C satisfies the last constrain t in the oracle problem ( 11 ), we hav e ¯ I ∈ I 0 . This prov es part (a) of the lemma. A consequence is that rank ( ¯ L ) = rank ( L 0 ) = r . Since P I c 0 ¯ L = L 0 , we conclude that the matrix ¯ V > c := P I c 0 ¯ V > has the same rank- r row space as V > 0 . Therefore, ¯ V > c ¯ V c ∈ R r × r is p ositive definite and there exists a symmetric and inv ertible matrix K 1 ∈ R r × r with K 2 1 = ¯ V > c ¯ V c and k K 1 k ≤ ¯ V c ≤ ¯ V ≤ 1 . This implies that K − 1 1 ¯ V > c has orthonormal rows spanning the same ro w space as V > 0 . Because V > 0 also has orthonormal rows, there must exist an orthonormal matrix K 2 ∈ R r × r suc h that K 2 K − 1 1 ¯ V > c = V > 0 . Hence we hav e ¯ V > c = N V > 0 , where the matrix N := K − 1 2 K 1 ∈ R r × r is inv ertible. It follo ws that max 1 ≤ j ≤ n + n c P I c 0 ¯ V > e j 2 2 = max j K − 1 2 K 1 V > 0 e j 2 2 ≤ K − 1 2 2 k K 1 k 2 max j V > 0 e j 2 2 ≤ µr n , where in the last inequality w e use the incoherence of L 0 in Assumption 1 . This prov es part ( b ) . No w consider part ( c ) . Let Z b e an arbitrary matrix in R m × ( n + n c ) . By part (a) of the lemma, w e hav e P U 0 P ¯ U P I c 0 Z = P ¯ U P I c 0 Z . W e also ha ve P I c 0 P ¯ U ⊥ P ¯ V Z = ( P ¯ U ⊥ Z ) ¯ V P I c 0 ¯ V > = ( P ¯ U ⊥ Z ) ¯ V ¯ V > c , where the R.H.S. spans the same ro w space as V > 0 b y the discussion in the last paragraph. It follows that P T 0 P I c 0 P ¯ T Z = P T 0 P ¯ U P I c 0 Z + P T 0 P I c 0 P ¯ U ⊥ P ¯ V Z = P ¯ U P I c 0 Z + P I c 0 P ¯ U ⊥ P ¯ V Z = P I c 0 P ¯ T Z. F or part ( d ) , the previous discussion sho ws that ¯ V c = V 0 N > . Therefore, for an y Y ∈ R m × ( n + n c ) , w e hav e P I c 0 Y V 0 V > 0 ¯ V ¯ V > = P I c 0 Y V 0 V > 0 ¯ V c ¯ V > = P I c 0 Y V 0 V > 0 V 0 N > ¯ V > = P I c 0 Y ¯ V c ¯ V > = P I c 0 Y ¯ V ¯ V > . Applying this equality with Y = P ¯ U ⊥ Z , w e obtain P ¯ T P T 0 P I c 0 Z = P ¯ U P I c 0 Z + P I c 0 ( I − P U 0 ) Z V 0 V > 0 ¯ V ¯ V > = P ¯ U P I c 0 Z + P I c 0 P ¯ U ⊥ Z ¯ V ¯ V > = P ¯ T P I c 0 Z. Finally , to pro ve part ( e ) , w e note that P ¯ T ⊥ P T ⊥ 0 P I c 0 Z = ( I − P ¯ T ) ( I − P T 0 ) P I c 0 Z. Expanding the last R.H.S and applying part ( d ) of the lemma gives the desired result. 34 B.2 Pro of of Lemma 3 Applying P I 0 to b oth sides of the last equality in ( 12 ) prov es part ( a ) of the lemma. P art ( b ) follows from G ∈ ¯ I c , and part ( c ) follows from P ¯ I c ¯ H 0 = P ¯ I c P I 0 G . Applying the pro jection P ¯ U P I 0 = P U 0 P I 0 to b oth sides of the first equalit y in ( 12 ), we obtain part ( d ) . Finally , note that ¯ H and ¯ H 0 are determined by the oracle program ( 11 ), which only dep ends on P Ω c M = P Ω c C 0 and dose not inv olve ˜ Ω . Therefore, indep endence betw een ˜ Ω and P Ω c C 0 imp osed in Assumption 3 implies part ( e ) . C Pro of of Prop osition 1 T o prov e the prop osition, we need a technical lemma. Lemma 13. Supp ose ( 13 ) holds, then for any ∆ l , ∆ c ∈ R m × ( n + n c ) with P Ω ∆ l + P Ω ∆ c = 0 , we have P I c 0 P ¯ T ∆ l F ≤ r 2 ˆ p kP ¯ T ⊥ ∆ l k ∗ + P I c 0 ∆ c 1 , 2 . Pr o of. Since P Ω ∆ c = −P Ω ∆ l , we ha ve P I c 0 ∆ c 1 , 2 ≥ P ˜ Ω ∆ c F = P ˜ Ω ∆ l F . By triangle inequality , we get P ˜ Ω ∆ l F ≥ P ˜ Ω P ¯ T ∆ l F − P ˜ Ω P ¯ T ⊥ ∆ l F ≥ P ˜ Ω P I c 0 P ¯ T ∆ l F − kP ¯ T ⊥ ∆ l k ∗ . W e b ound the first term in the last R.H.S.: P ˜ Ω P I c 0 P ¯ T ∆ l 2 F = P ˜ Ω P I c 0 P ¯ T ∆ l , P ˜ Ω P I c 0 P ¯ T ∆ l ( a ) = P I c 0 P ¯ T ∆ l , P T 0 P ˜ Ω P T 0 P I c 0 P ¯ T ∆ l ( b ) = P I c 0 P ¯ T ∆ l , P T 0 P ˜ Ω P T 0 P I c 0 P ¯ T ∆ l − ˆ p P T 0 P I c 0 P ¯ T ∆ l + ˆ p P I c 0 P ¯ T ∆ l ( c ) ≥ ˆ p 2 P I c 0 P ¯ T ∆ l 2 F . where ( a ) follows from P art (c) of Lemma 2 and the fact that P T 0 is a pro jection when restricted to I c 0 , ( b ) uses P art (c) of Lemma 2 again, and ( c ) uses ( 13 ). Com bining the last three equations pro v es the lemma. Bac k to the pro of of Prop osition 1 . Supp ose ( L ∗ , C ∗ ) = ( ¯ L + ∆ l , ¯ C + ∆ c ) is an optimal solution to ( 2 ), with P Ω ∆ l + P Ω ∆ c = 0 . T ake an y matrix F ∈ ¯ T ⊥ suc h that k F k = 1 , h F , P ¯ T ⊥ ∆ l i = k P ¯ T ⊥ ∆ l k ∗ and another matrix G ∈ ¯ I c suc h that k G k ∞ , 2 = 1 , h G, P ¯ I c ∆ c i = k P ¯ I c ∆ c k 1 , 2 = P ¯ I c ∩ I 0 ∆ c 1 , 2 + P I c 0 ∆ c 1 , 2 . Then ¯ U ¯ V > + F is a subgradien t of ¯ L ∗ and P ¯ I ¯ Q + λG is a subgradien t of λ ¯ C 1 , 2 . By optimality of ( L ∗ , C ∗ ) , we ha ve 0 ≥ ¯ L + ∆ l ∗ + λ ¯ C + ∆ c 1 , 2 − ¯ L ∗ − λ ¯ C 1 , 2 ( i ) ≥ D ¯ U ¯ V > + F , ∆ l E + P I 0 ¯ Q + λG, ∆ c ( ii ) = kP ¯ T ⊥ ∆ l k ∗ + λ P ¯ I c ∩ I 0 ∆ c 1 , 2 + P I c 0 ∆ c 1 , 2 + D ¯ U ¯ V > − ¯ Q, ∆ l E + P ¯ I ¯ Q − ¯ Q, ∆ c 35 where ( i ) follo ws from the definition of a subgradient, and ( ii ) is due to Condition ( a ) and P Ω ∆ l + P Ω ∆ c = 0 . No w observe that Conditions 3( b ) and 3( c ) imply D ¯ U ¯ V > − ¯ Q, ∆ l E = P ¯ T D − P ¯ T ⊥ ¯ Q, ∆ l ≥ − r ˆ p 2 min 1 4 , λ 4 P I c 0 P ¯ T ∆ l F − 1 2 kP ¯ T ⊥ ∆ l k ∗ , and Conditions 3( e ) and 3( f ) imply P ¯ I ¯ Q − ¯ Q, ∆ c ≥ − λ P ¯ I c ∩ I 0 ∆ c 1 , 2 − λ 2 P I c 0 ∆ c 1 , 2 Putting together, we obtain 0 ≥ 1 2 kP ¯ T ⊥ ∆ l k ∗ + 1 2 λ P I c 0 ∆ c 1 , 2 − r ˆ p 2 P I c 0 P ¯ T ∆ l F ( iii ) ≥ 1 2 kP ¯ T ⊥ ∆ l k ∗ + 1 2 λ P I c 0 ∆ c 1 , 2 − min 1 4 , λ 4 kP ¯ T ⊥ ∆ l k ∗ + P I c 0 ∆ c 1 , 2 ≥ 1 4 kP ¯ T ⊥ ∆ l k F + 1 4 λ P I c 0 ∆ c 1 , 2 ≥ 0 , where ( iii ) follows from Lemma 13 . Therefore, w e m ust ha ve kP ¯ T ⊥ ∆ l k F = P I c 0 ∆ c 1 , 2 = 0 , whic h means ∆ l ∈ ¯ T , P I c 0 ∆ c = 0 and P I 0 C ∗ = C ∗ . It follows that P T 0 P I c 0 P ¯ T ∆ l = P I c 0 P ¯ T ∆ l = P I c 0 ∆ l b y P art (c) of Lemma 2 , and P ˜ Ω ∆ l = −P ˜ Ω ∆ c = 0 , so P I c 0 ∆ l ∈ T 0 ∩ ˜ Ω . But this in tersection is trivial b y Condition 1 in the prop osition, so P I c 0 ∆ l = 0 and thus P I c 0 L ∗ = L 0 . F urthermore, we hav e P ¯ U ⊥ ∆ l = P ¯ U ⊥ P ¯ T ∆ l = P ¯ V P ¯ U ⊥ ∆ l and thus P ¯ U ⊥ ∆ l ∈ range ( P ¯ V ) . But w e also hav e P ¯ U ⊥ ∆ l = P ¯ U ⊥ P I c 0 + P I 0 ∆ l = P ¯ U ⊥ P I 0 ∆ l ∈ I 0 . This implies P ¯ U ⊥ ∆ l = 0 by Condition 2 in the prop osition. This shows that P U 0 ∆ l = P ¯ U ∆ l = ∆ l , where the first equality follo ws from part (a) of Lemma 2 . This completes the pro of of the prop osition. D Pro of of Lemma 5 F or any matrices A and B , w e ha v e k AB k F ≤ k A k k B k F , (26) whic h follows from k AB k 2 F = P j k AB e j k 2 2 ≤ P j k A k 2 k B e j k 2 2 = k A k 2 k B k 2 F . Using part ( d ) of Lemma 3 , we know P I 0 ¯ V > = λ ¯ U > ¯ H 0 . It follo ws that for any matrix Z , P ¯ V P I 0 P ¯ V ( Z ) = P I 0 Z ¯ V ¯ V > ¯ V ¯ V > = Z ¯ V P I 0 ¯ V > P I 0 ¯ V > > ¯ V > = λ 2 Z ¯ V ¯ U > ¯ H 0 ¯ H 0> ¯ U ¯ V > . Using ( 26 ), we obtain kP ¯ V P I 0 P ¯ V ( Z ) k F ≤ λ 2 k Z k F ¯ V 2 ¯ U 2 ¯ H 0 2 ( i ) ≤ λ 2 γ n k Z k F ≤ 1 2 k Z k F , where the inequalit y ( i ) follows from ¯ H 0 ∞ , 2 ≤ 1 and ¯ H 0 ∈ I 0 has at most γ n non-zero columns. The second part of the lemma is a prov ed in similar manner using the sub-multiplicit y of the matrix sp ectral norm. 36 E Pro of of Lemma 6 By part ( d ) of Lemma 3 , w e ha ve P I 0 Q = ¯ U P I 0 ¯ V > + λ ¯ H 0 − λ P U 0 ¯ H 0 = λ ¯ H 0 . Using ( 14 ) and P U 0 = P ¯ U , we ha ve P ¯ T Q = ¯ U ¯ V > + λ P ¯ U ¯ H 0 + λ P ¯ U ⊥ P ¯ V ¯ H 0 − λ P U 0 ¯ H 0 − λ P ¯ V P I c 0 P ¯ V B P ¯ V P ¯ U ⊥ ¯ H 0 = ¯ U ¯ V > . This prov es the t w o equalities in the lemma. Observe that ¯ H 0 ∈ I 0 has at most n c = γ n non- zero columns, each of which has norm at most one by part ( b ) and ( c ) of Lemma 3 . It follows that ¯ H 0 ≤ ¯ H 0 F ≤ √ γ n . W e also ha v e P ¯ V P ¯ U ⊥ ¯ H 0 ≤ I d − ¯ U ¯ U > ¯ H 0 ¯ V ¯ V > ≤ ¯ H 0 b y sub-m ultiplicit y of the sp ectral norm. This pro ves the first set of inequalities in the lemma. F Pro of of Lemmas in Section 5.5 In this section, w e prov e the lemmas used in Section 5.5 . F.1 Pro of of Lemma 10 Recall that D U 0 := ¯ U P ¯ I c ( ¯ V ) , and D V 0 := P ¯ U ⊥ P V 0 P I c 0 B P ¯ V λ ¯ H 0 . The first three inequalities follow directly from the incoherence Assumption 1 and part (b) of Lemma 2 . Now, b y Assumption 3 and part ( a ) of Lemma 3 , w e know each column of ¯ H 0 has at most 2 ρm non-zeros. Because ¯ U has the same column space as U 0 b y Lemma 2 , ¯ U satisfies the same incoherence prop ert y as U 0 giv en in Assumption 1 . Therefore, w e ha ve e > a ¯ H > ¯ U 2 ≤ ¯ H e a 1 ¯ U > ∞ , 2 ≤ p 2 ρm ¯ H e a 2 · r µr m = p 2 ρµr . It follows that ¯ H > ¯ U ≤ ¯ H > ¯ U F ≤ √ γ n p 2 ρµr = √ γ n p 2 β ˆ pµr , where we use the definition β := ρ ˆ p . Using Lemma 5 and the fact that ¯ H ≤ √ γ n , we get B ¯ H ¯ H > ¯ U ¯ V > = P I c 0 P ¯ V ∞ X i =0 ( P ¯ V P I 0 P ¯ V ) i ¯ H ¯ H > ¯ U ¯ V > ≤ ∞ X i =0 1 2 ! ¯ H ¯ H > ¯ U ¯ V > ≤ 4 γ n p β ˆ pµr . (27) On the other hand, note that by part ( d ) of Lemma 3 w e ha ve P I 0 ¯ V > = λ ¯ U > ¯ H 0 . Since ¯ H 0 ∈ I 0 , w e ha v e λ P ¯ U ⊥ P V 0 P I c 0 B P ¯ V ¯ H 0 e j 2 = λ I d − ¯ U ¯ U > B ¯ H 0 ¯ V ¯ V > V 0 V > 0 e j 2 = λ I d − ¯ U ¯ U > B ¯ H 0 ( P I 0 ¯ V > ) > ¯ V > V 0 V > 0 e j 2 = λ 2 I d − ¯ U ¯ U > B ¯ H 0 ¯ H 0 > ¯ U ¯ V > V 0 V > 0 e j 2 . (28) 37 Com bining ( 27 ) and ( 28 ), we obtain D V 0 ∞ , 2 = max j λ P ¯ U ⊥ P V 0 P I c 0 B P ¯ V ¯ H 0 e j 2 ≤ λ 2 I d − ¯ U ¯ U > B ¯ H ¯ H > ¯ U ¯ V > max j V 0 V > 0 e j 2 ≤ λ 2 · 1 · 4 γ n p β ˆ pµr · r µr n = 4 λ 2 γ µr p β ˆ pn, whic h pro ves the forth equation in the lemma. The last equation in the lemma can be established in a similar manner using Lemma 5 : D V 0 F = P ¯ U ⊥ P V 0 B P ¯ V λ ¯ H 0 F ≤ λ 2 I d − ¯ U ¯ U > B ¯ H ¯ H > ¯ U ¯ V > F V 0 V > 0 ≤ 2 λ 2 ¯ H ¯ H > ¯ U ¯ V > F ≤ 2 λ 2 ¯ H ¯ H > ¯ U F ¯ V > ≤ 2 λ 2 · √ γ n · √ γ n p 2 β ˆ pµr · 1 . F.2 Pro of of Lemma 9 Let e i b e the i -th standard basis whose dimension will b ecome clear in the con text. The following inequalit y is used rep eatedly: from the incoherence Assumption 1 , we hav e P T 0 e i e > j 2 F = kP U 0 e i k 2 2 + kP V 0 e j k 2 2 − kP U 0 e i k 2 2 kP V 0 e j k 2 2 ≤ 2 µr n ∧ m , ∀ i ∈ [ m ] , j ∈ [ n + n c ] . (29) W e also need the matrix Bernstein inequalit y , restated below. Theorem 3 (Matrix Bernstein [ 46 ]) . L et X 1 , . . . , X N ∈ R m × n b e indep endent zer o me an r andom matric es. Supp ose ther e exist two numb ers B and σ 2 such that max ( E N X k =1 X k X > k , E N X k =1 X > k X k ) ≤ σ 2 and k X k k ≤ B almost sur ely for al l k . Then with pr ob ability at le ast 1 − 2( m + n ) − 12 , we have N X k =1 X k ≤ 20 B log( m + n ) + p 50 σ 2 log( m + n ) . W e now turn to the pro of of the lemma. Pr o of. (of Lemma 9 ) Observ e that 1 ˆ p P T 0 P ˜ Ω P T 0 Z − P T 0 Z ∈ I c 0 for an y matrix Z ∈ T 0 ⊆ I c 0 . Fix an index b ∈ I c 0 . F or eac h ( i, j ) ∈ [ m ] × I c 0 , let δ ( ij ) b e the indicator v ariable whic h equals one if and only if ( i, j ) ∈ ˜ Ω . W e hav e P δ ( ij ) = 1 = ˆ p by assumption 3 . Define S ( ij ) := 1 ˆ p δ ( ij ) − 1 Z ij P T 0 ( e i e > j ) e b , 38 whic h is a column v ector in R m . Since P T 0 Z = Z for Z ∈ T 0 , the b -th column of the matrix 1 p P T 0 P ˜ Ω − P T 0 Z can be written as 1 ˆ p P T 0 P ˜ Ω − I Z e b = X ( i,j ) ∈ [ m ] × I c 0 S ( ij ) , whic h is the sum of indep enden t vectors in R m . Note that E S ( ij ) = 0 and S ( ij ) 2 ≤ 1 ˆ p δ ( ij ) − 1 | Z ij | P T 0 ( e i e > j ) F ≤ 1 ˆ p r 2 µr n ∧ m k Z k ∞ , a.s. , where the second inequalit y follo ws from ( 29 ). W e also ha ve E X ( i,j ) ∈ [ m ] × I c 0 S > ( ij ) S ( ij ) = X i,j E " 1 p δ ( ij ) − 1 2 # Z 2 ij P T 0 e i e > j e b 2 2 = 1 − ˆ p ˆ p X i,j Z 2 ij P T 0 e i e > j e b 2 2 . W e b ound the term in the summand in the last R.H.S. Recall that I d denotes the iden tity matrix. F or each ( i, j ) ∈ [ m ] × I c 0 , we ha ve P T 0 e i e > j e b 2 = U 0 U > 0 e i e > j e b + I d − U 0 U > 0 e i e > j V 0 V > 0 e b 2 = U 0 U > 0 e i + I d − U 0 U > 0 e i V > 0 e b 2 2 2 , if j = b, I d − U 0 U > 0 e i e > j V 0 V > 0 e b 2 , if j 6 = b, ≤ ( U > 0 e i 2 + V > 0 e b 2 2 , if j = b, e > j V 0 V > 0 e b , if j 6 = b, ≤ 2 q µr m ∧ n , if j = b, e > j V 0 V > 0 e b , if j 6 = b, where in the last inequality we use V > 0 e b 2 ≤ 1 and the incoherence Assumption 1 . It follo ws that E X i,j S ( ij ) S > ( ij ) = E X i,j S > ( ij ) S ( ij ) ≤ 1 ˆ p X i ∈ [ m ] ,j = b Z 2 ij 4 µr n ∧ m + 1 ˆ p X i ∈ [ m ] ,j 6 = b Z 2 ij e > j V V > e b 2 = 4 µr ˆ p ( n ∧ m ) X i Z 2 ib + 1 ˆ p X j 6 = b e > j V V > e b 2 X i Z 2 ij ≤ 4 ˆ p µr n ∧ m k Z k 2 ∞ , 2 + 1 ˆ p V V > e b 2 2 k Z k 2 ∞ , 2 ≤ 4 ˆ p µr n ∧ m k Z k 2 ∞ , 2 + 1 ˆ p µr n · k Z k 2 ∞ , 2 ≤ 5 µr ˆ p ( n ∧ m ) k Z k 2 ∞ , 2 . 39 T reating { S ( ij ) } as zero-padded m × n matrices and applying the Matrix Bernstein inequalit y in Theorem 3 , we obtain that with probabilit y at least 1 − 2( m + n ) − 12 , 1 ˆ p P T 0 P ˜ Ω − I Z e b 2 ≤ 20 1 ˆ p r 2 µr n ∧ m k Z k ∞ log( m + n ) + s 50 · 5 µr ˆ p ( n ∧ m ) k Z k 2 ∞ , 2 log( m + n ) ≤ 1 2 s log( m + n ) ˆ p k Z k ∞ + 1 2 k Z k ∞ , 2 , where the second inequalit y holds provided c 0 in the condition of the lemma is sufficien tly large. In a similar fashion we can pro ve that for eac h a ∈ [ m ] and with probabilit y at least 1 − 2( m + n ) − 12 , e > a 1 ˆ p P T 0 P ˜ Ω − I Z ≤ 40 ˆ p r µr n ∧ m k Z k ∞ log( m + n ) + s 250 µr ˆ p ( n ∧ m ) log( m + n ) k Z k ∞ , 2 ≤ 1 2 s log( m + n ) ˆ p k Z k ∞ + 1 2 Z > ∞ , 2 . The lemma follows from a union b ound ov er all indices a ∈ [ m ] and b ∈ I c 0 . F.3 Pro of of Lemma 11 Observ e that 1 ˆ p P ˜ Ω Z − Z ∈ I c 0 for an y matrix Z ∈ T 0 ⊆ I c 0 . Fix an index b ∈ I c 0 . F or eac h i ∈ [ m ] , w e recall that the indicator v ariable δ ( ib ) defined in the last section, and define the vector ξ ( i ) := Z ib 1 ˆ p δ ( ib ) − 1 e i ∈ R m . Then the b -th column of the matrix 1 ˆ p P ˜ Ω Z − Z can b e written as 1 ˆ p P ˜ Ω Z − Z e b = X i ∈ [ m ] ξ ( i ) . whic h is the sum of independent vectors. Note that eac h ξ ( i ) has mean zero and satisfies ξ ( i ) 2 ≤ 1 ˆ p − 1 Z ib ≤ 1 ˆ p k Z k ∞ a.s. Moreo ver, w e ha ve max E X i ∈ [ m ] ξ > ( i ) ξ ( i ) , E X i ∈ [ m ] ξ ( i ) ξ > ( i ) = max ( 1 − ˆ p ˆ p X i Z 2 ib , 1 − ˆ p ˆ p X i Z 2 ib e i e > i ) ≤ 1 ˆ p k Z k 2 ∞ , 2 . T reating { ξ ( i ) } as zero-padded m × n matrices and applying the matrix Bernstein inequality in Theorem 3 , we obtain that with probabilit y at least 1 − 2( m + n ) − 12 , 1 ˆ p P ˜ Ω Z − Z e b 2 ≤ 20 log ( m + n ) ˆ p k Z k ∞ + s 50 log ( m + n ) ˆ p k Z k ∞ , 2 . The lemma follows from a union b ound ov er all indices b ∈ I c 0 . 40 F.4 Pro of of Lemma 7 Recall the indicator v ariables δ ( ij ) defined in the last section. Since Z ∈ I c 0 , we ma y write 1 ˆ p P ˆ Ω Z − Z = X ( i,j ) ∈ [ m ] × I c 0 S ( ij ) := X ( i,j ) ∈ [ m ] × I c 0 1 ˆ p δ ( ij ) − 1 Z ij e i e > j , where S ( ij ) are indep endent matrices satisfying E [ S ( ij ) ] = 0 and S ( ij ) ≤ 1 ˆ p k Z k ∞ . Moreo ver, w e hav e E X ( i,j ) ∈ [ m ] × I c 0 S > ( ij ) S ( ij ) = X ( i,j ) ∈ [ m ] × I c 0 Z 2 ij e i e > j e j e > i E 1 ˆ p δ ij − 1 2 = X ( i,j ) ∈ [ m ] × I c 0 1 − ˆ p ˆ p Z 2 ij e i e > i and thus E X ( i,j ) ∈ [ m ] × I c 0 S > ( ij ) S ( ij ) ≤ 1 ˆ p max i ∈ [ m ] X j ∈ I c 0 Z 2 ij ≤ 1 ˆ p k Z k 2 ( ∞ , 2) 2 . W e can b ound E P ( i,j ) ∈ [ m ] × I c 0 S ( ij ) S > ( ij ) in a similar wa y . Applying the matrix Bernstein inequalit y in Theorem 3 pro v es the lemma. References [1] Gediminas A domavicius and Alexander T uzhilin. T ow ard the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE T r ansactions on Know le dge and Data Engine ering , pages 734–749, 2005. [2] Alekh Agarw al, Sahand Negah ban, and Martin J. W ainwrigh t. Noisy matrix decomp osition via con v ex relaxation: Optimal rates in high dimensions. The A nnals of Statistics , 40(2):1171–1197, 2012. [3] Arash A. Amini and Martin J. W ainwrigh t. High-dimensional analysis of semidefinite relax- ations for sparse principal comp onents. The Annals of Statistics , 37(5):2877–2921, 2009. [4] James Bennett and Stan Lanning. The Netflix prize. In Pr o c e e dings of KDD Cup and W orkshop , page 35, 2007. [5] Quen tin Berthet and Philipp e Rigollet. Complexit y theoretic lo wer bounds for sparse princi- pal comp onen t detection. Journal of Machine L e arning R ese ar ch: W orkshop and Confer enc e Pr o c e e dings , 30:1046–1066, 2013. [6] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Ec kstein. Distributed optimization and statistical learning via the alternating direction metho d of multipliers. F oun- dations and T r ends in Machine L e arning , 3(1):1–122, 2011. [7] P eter Bühlmann and Sara V an De Geer. Statistics for high-dimensional data: metho ds, the ory and applic ations . Springer, 2011. [8] Jian-F eng Cai, Emman uel J. Candès, and Zuo w ei Shen. A singular v alue thresholding algorithm for matrix completion. SIAM Journal on Optimization , 20(4):1956–1982, 2010. 41 [9] Emman uel J. Candès, Xiao dong Li, Yi Ma, and John W righ t. Robust principal comp onent analysis? Journal of the ACM , 58(3):11, 2011. [10] Emman uel J. Candès and Benjamin Rec h t. Exact matrix completion via con vex optimization. F oundations of Computational Mathematics , 9(6):717–772, 2009. [11] Emman uel J. Candès and T erence T ao. The p ow er of con v ex relaxation: Near-optimal matrix completion. IEEE T r ansactions on Information The ory , 56(5):2053–2080, 2010. [12] Nicolò Cesa-Bianchi, Shai Shalev-Sh wartz, and Ohad Shamir. Efficien t learning with partially observ ed attributes. In Pr o c e e dings of the 27th International Confer enc e on Machine le arning , pages 216–223, 2010. [13] V enk at Chandrasek aran and Michael I. Jordan. Computational and statistical tradeoffs via con v ex relaxation. Pr o c e e dings of the National A c ademy of Scienc es , 110(13):E1181–E1190, 2013. [14] V enk at Chandrasek aran, Sujay Sangha vi, Pablo P arrilo, and Alan Willsky . Rank-sparsity in- coherence for matrix decomposition. SIAM Journal on Optimization , 21(2):572–596, 2011. [15] Y udong Chen. Incoherence-optimal matrix completion. IEEE T r ansactions on Information The ory , 61(5):2909–2923, 2015. [16] Y udong Chen, Srinadh Bho janapalli, Sujay Sanghavi, and Rachel W ard. Coherent Matrix Completion. In Pr o c e e dings of International Confer enc e on Machine L e arning , 2014. [17] Y udong Chen, Ali Jalali, Sujay Sangha vi, and Constantine Caramanis. Low-rank matrix re- co v ery from errors and erasures. IEEE T r ansactions on Information The ory , 59(7):4324–4337, 2013. [18] Y udong Chen, Huan Xu, Constan tine Caramanis, and Sujay Sanghavi. Robust matrix comple- tion and corrupted columns. In Pr o c e e dings of the 28th International Confer enc e on Machine L e arning , pages 873–880, 2011. [19] Da vid Gross. Reco vering lo w-rank matrices from few co efficien ts in an y basis. IEEE T r ansac- tions on Information The ory , 57(3):1548–1566, 2011. [20] Jonathan. L. Herlo c ker, Joseph A. Konstan, Al Borc hers, and John Riedl. An algorithmic framew ork for p erforming collab orative filtering. In Pr o c e e dings of the 22nd annual interna- tional ACM SIGIR c onfer enc e on R ese ar ch and development in information r etrieval , pages 230–237. ACM, 1999. [21] Junzhou Huang and T ong Zhang. The b enefit of group sparsity . The A nnals of Statistics , 38(4):1978–2004, 2010. [22] P eter Hub er. R obust Statistics . Wiley , New Y ork, 1981. [23] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Lo w-rank matrix completion using alternating minimization. In Pr o c e e dings of the 45th Annual ACM Symp osium on The ory of Computing , pages 665–674. A CM, 2013. 42 [24] Da vid R. Karger, Sewoong Oh, and Dev a vrat Shah. Budget-optimal crowdsourcing using lo w- rank matrix approximations. In 49th Annual Al lerton Confer enc e on Communic ation, Contr ol, and Computing , pages 284–291. IEEE, 2011. [25] Da vid R. Karger, Sewoong Oh, and Dev a vrat Shah. Iterativ e learning for reliable cro wdsourcing systems. In A dvanc es in neur al information pr o c essing systems , pages 1953–1961, 2011. [26] Da vid R. Karger, Sew o ong Oh, and Dev avrat Shah. Efficient crowdsourcin g for multi-class lab eling. In Pr o c e e dings of the A CM SIGMETRICS/international c onfer enc e on Me asur ement and mo deling of c omputer systems , pages 81–92. A CM, 2013. [27] Ragh unandan H. Keshav an, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE T r ansactions on Information The ory , 56(6):2980–2998, 2010. [28] Olga Klopp, Karim Lounici, and Alexandre B. T sybak ov. Robust Matrix Completion. arXiv pr eprint arXiv:1412.8132 , 2014. [29] Sh yong K. Lam and John Riedl. Shilling recommender systems for fun and profit. In Pr o c e e dings of the 13th International Confer enc e on W orld Wide W eb , 2004. [30] Rasm us Larsen. PROP A CK: a softw are for large and sparse SVD calculations. Available on http: // sun. stanford. edu/ ~rmunk/ PROPACK/ . [31] Jason D. Lee, Y uek ai Sun, and Jonathan E. T a ylor. On model selection consistency of p enalized m-estimators: a geometric theory . In A dvanc es in Neur al Information Pr o c essing Systems , pages 342–350, 2013. [32] Gilad Lerman, Michael B. McCoy , Jo el A. T ropp, and T eng Zhang. Robust computation of linear mo dels by con v ex relaxation. F oundations of Computational Mathematics , 15(2):363–410. [33] Xiao dong Li. Compressed sensing and matrix completion with constant prop ortion of corrup- tions. Constructive Appr oximation , 37(1):73–99, 2013. [34] Zhouc hen Lin, Minming Chen, Leqin W u, and Yi Ma. The augmen ted Lagrange multiplier metho d for exact reco very of corrupted low-rank matrices. UIUC T e chnic al R ep ort UILU- ENG-09-2215 , 2009. [35] Greg Linden, Brent Smith, and Jerem y Y ork. Amazon.com recommendations: Item-to-item collab orativ e filtering. IEEE Internet Computing , 7(1), 2003. [36] Zongming Ma and Yihong W u. Computational barriers in minimax submatrix detection. The A nnals of Statistics , 43(3):1089–1116, 2015. [37] Ricardo A. Maronna, R. Douglas Martin, and Víctor J. Y ohai. R obust statistics . Wiley , 2006. [38] Mic hael McCoy , Jo el A T ropp, et al. T wo prop osals for robust p ca using semidefinite program- ming. Ele ctr onic Journal of Statistics , 5:1123–1160, 2011. [39] Sangkil Mo on and Gary J. Russell. Predicting product purc hase from inferred customer simi- larit y: An autologistic model approac h. Management Scienc e , 54(1):71, 2008. 43 [40] Ra jeev Motw ani and Sergei V assilvitskii. T racing the path: new mo del and algorithms for collab orativ e filtering. In IEEE 23r d International Confer enc e on Data Engine ering W orkshop , pages 853–862. IEEE, 2007. [41] Sahand Negah ban, Pradeep Ra vikumar, Martin J. W ainwrigh t, and Bin Y u. A unified frame- w ork for high-dimensional analysis of m-estimators with decomp osable regularizers. Statistic al Scienc e , 27(4):538–557, 2012. [42] Benjamin Rech t. A simpler approac h to matrix completion. Journal of Machine L e arning R ese ar ch , 12:3413–3430, 2011. [43] Benjamin Rec ht, Maryam F azel, and P ablo A. P arrilo. Guaran teed Minim um-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM R eview , 52(471), 2010. [44] J. J. Sandvig, Bamshad Mobasher, and Robin Burk e. Robustness of collab orative recom- mendation based on asso ciation rule mining. In Pr o c e e dings of the 2007 ACM c onfer enc e on R e c ommender Systems , page 112. ACM, 2007. [45] J. Ben Schafer, Joseph A. Konstan, and John Riedl. E-commerce recommendation applications. Data Mining and Know le dge Disc overy , 5(1):115–153, 2001. [46] Jo el A. T ropp. User-friendly tail b ounds for sums of random matrices. F oundations of Com- putational Mathematics , 12(4):389–434, 2012. [47] Benjamin V an Ro y and Xiang Y an. Manipulation robustness of collab orativ e filtering. Man- agement Scienc e , 56(11):1911–1929, 2010. [48] Huan Xu, Constan tine Caramanis, and Shie Mannor. Outlier-Robust PCA: The High Dimen- sional Case. IEEE T r ansactions on Information The ory , 59(1), 2013. [49] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. In A dvanc es in Neur al Information Pr o c essing Systems 23 , pages 2496–2504, 2010. [50] Huan Xu, Constan tine Caramanis, and Suja y Sangha vi. Robust PCA via outlier pursuit. IEEE T r ansactions on Information The ory , 58(5):3047–3064, 2012. [51] Ming Y uan and Yi Lin. Mo del selection and estimation in regression with group ed v ariables. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 68(1):49–67, 2006. [52] Hongy ang Zhang and Zhouc hen Lin. Personal comm unication. [53] Y uchen Zhang, Martin J. W ainwrigh t, and Michael I. Jordan. Low er b ounds on the p erfor- mance of p olynomial-time algorithms for sparse linear regression. In Pr o c e e dings of The 27th Confer enc e on L e arning The ory , pages 921–948, 2014. 44
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment