Exact Matrix Completion via Convex Optimization
We consider a problem of considerable practical interest: the recovery of a data matrix from a sampling of its entries. Suppose that we observe m entries selected uniformly at random from a matrix M. Can we complete the matrix and recover the entries…
Authors: Emmanuel J. C, es, Benjamin Recht
Exact Matrix Completion via Con v ex Optimization Emman uel J. Cand ` es † and Benjamin Rec h t ] † Applied and Computational Mathematics, Caltech, P asadena, CA 91125 ] Center for the Mathematics of Information, Caltec h, P asadena, CA 91125 Ma y 2008 Abstract W e consider a problem of considerable practical interest: the reco very of a data matrix from a sampling of its entries. Supp ose that we observe m en tries selected uniformly at random from a matrix M . Can we complete the matrix and reco ver the entries that we ha ve not seen? W e sho w that one can perfectly recov er most low-rank matrices from what appears to be an incomplete set of en tries. W e prov e that if the num b er m of sampled en tries ob eys m ≥ C n 1 . 2 r log n for some p ositiv e numerical constan t C , then with very high probability , most n × n matrices of rank r can be perfectly reco vered b y solving a simple conv ex optimization program. This program finds the matrix with minim um n uclear norm that fits the data. The condition abov e assumes that the rank is not too large. Ho w ever, if one replaces the 1.2 exp onen t with 1.25, then the result holds for all v alues of the rank. Similar results hold for arbitrary rectangular matrices as well. Our results are connected with the recent literature on compressed sensing, and sho w that ob jects other than signals and images can be p erfectly reconstructed from v ery limited information. Keyw ords. Matrix completion, low-rank matrices, con v ex optimization, dualit y in optimiza- tion, nuclear norm minimization, random matrices, noncommutativ e Khintc hine inequality , decou- pling, compressed sensing. 1 In tro duction In man y practical problems of in terest, one would like to reco v er a matrix from a sampling of its en tries. As a motiv ating example, consider the task of inferring answ ers in a partially filled out surv ey . That is, supp ose that questions are b eing ask ed to a collection of individuals. Then we can form a matrix where the ro ws index eac h individual and the columns index the questions. W e collect data to fill out this table but unfortunately , many questions are left unanswered. Is it p ossible to make an educated guess ab out what the missing answers should b e? Ho w can one make suc h a guess? F ormally , we may view this problem as follo ws. W e are in terested in reco vering a data matrix M with n 1 ro ws and n 2 columns but only get to observ e a n umber m of its entries whic h is comparably muc h smaller than n 1 n 2 , the total num b er of en tries. Can one reco ver the matrix M from m of its entries? In general, ev eryone w ould agree that this is imp ossible without some additional information. 1 In man y instances, ho wev er, the matrix we wish to recov er is kno wn to b e structured in the sense that it is lo w-rank or approximately lo w-rank. (W e recall for completeness that a matrix with n 1 ro ws and n 2 columns has rank r if its rows or columns span an r -dimensional space.) Below are t wo examples of practical scenarios where one w ould like to b e able to reco ver a low-rank matrix from a sampling of its en tries. • The Netflix pr oblem. In the area of recommender systems, users submit ratings on a subset of en tries in a database, and the vendor pro vides recommendations based on the user’s pref- erences [28, 32]. Because users only rate a few items, one w ould like to infer their preference for unrated items. A sp ecial instance of this problem is the no w famous Netflix problem [2]. Users (ro ws of the data matrix) are given the opp ortunit y to rate movies (columns of the data matrix) but users t ypically rate only very few movies so that there are very few scattered observ ed en tries of this data matrix. Y et one w ould lik e to complete this matrix so that the vendor (here Netflix) migh t recommend titles that an y particular user is lik ely to be willing to order. In this case, the data matrix of all user-ratings ma y be appro ximately lo w-rank b ecause it is commonly b eliev ed that only a few factors con tribute to an individual’s tastes or preferences. • T riangulation fr om inc omplete data. Supp ose we are giv en partial information about the dis- tances b et w een ob jects and w ould lik e to reconstruct the lo w-dimensional geometry describing their lo cations. F or example, we may ha v e a netw ork of lo w-p o w er wirelessly net work ed sen- sors scattered randomly across a region. Supp ose eac h sensor only has the abilit y to construct distance estimates based on signal strength readings from its nearest fello w sensors. F rom these noisy distance estimates, w e can form a partially observed distance matrix. W e can then estimate the true distance matrix whose rank will b e equal to tw o if the sensors are lo cated in a plane or three if they are lo cated in three dimensional space [24, 31]. In this case, w e only need to observ e a few distances p er node to hav e enough information to reconstruct the p ositions of the ob jects. These examples are of course far from exhaustiv e and there are man y other problems which fall in this general category . F or instance, w e may hav e some v ery limited information ab out a cov ariance matrix of interest. Y et, this cov ariance matrix may b e lo w-rank or approximately lo w-rank b ecause the v ariables only dep end up on a comparably smaller num b er of factors. 1.1 Imp edimen ts and solutions Supp ose for simplicity that w e wish to reco ver a square n × n matrix M of rank r . 1 Suc h a matrix M can b e represented by n 2 n umbers, but it only has (2 n − r ) r degrees of freedom. This fact can b e revealed b y coun ting parameters in the singular v alue decomp osition (the num ber of degrees of freedom associated with the description of the singular v alues and of the left and righ t singular vectors). When the rank is small, this is considerably smaller than n 2 . F or instance, when M enco des a 10-dimensional phenomenon, then the n um b er of degrees of freedom is ab out 20 n offering a reduction in dimensionality by a factor about equal to n/ 20. When n is large (e.g. in the thousands or millions), the data matrix carries muc h less information than its am bient dimension 1 W e emphasize that there is nothing sp ecial ab out M b eing square and all of our discussion would apply to arbitrary rectangular matrices as well. The adv antage of fo cusing on square matrices is a simplified exposition and reduction in the num ber of parameters of which we need to k eep track. 2 suggests. The problem is now whether it is p ossible to reco v er this matrix from a sampling of its en tries without having to prob e all the n 2 en tries, or more generally collect n 2 or more measuremen ts ab out M . 1.1.1 Whic h matrices? In general, one cannot hop e to be able to recov er a low-rank matrix from a sample of its entries. Consider the rank-1 matrix M equal to M = e 1 e ∗ n = 0 0 · · · 0 1 0 0 · · · 0 0 . . . . . . . . . . . . . . . 0 0 · · · 0 0 , (1.1) where here and throughout, e i is the i th canonical basis vector in Euclidean space (the vector with all entries equal to 0 but the i th equal to 1). This matrix has a 1 in the top-right corner and all the other en tries are 0. Clearly this matrix cannot b e recov ered from a sampling of its en tries unless w e prett y muc h see all the entries. The reason is that for most sampling sets, w e would only get to see zeros so that we would hav e no wa y of guessing that the matrix is not zero. F or instance, if we w ere to see 90% of the entries selected at random, then 10% of the time w e w ould only get to see zero es. It is therefore impossible to recov er al l low-rank matrices from a set of sampled en tries but can one reco ver most of them? T o inv estigate this issue, we in tro duce a simple mo del of low-rank matrices. Consider the singular v alue decomp osition (SVD) of a matrix M M = r X k =1 σ k u k v ∗ k , (1.2) where the u k ’s and v k ’s are the left and right singular v ectors, and the σ k ’s are the singular v alues (the ro ots of the eigenv alues of M ∗ M ). Then we could think of a generic low-rank matrix as follo ws: the family { u k } 1 ≤ k ≤ r is selected uniformly at random among all families of r orthonormal v ectors, and similarly for the the family { v k } 1 ≤ k ≤ r . The tw o families may or may not b e indep endent of eac h other. W e make no assumptions ab out the singular v alues σ k . In the sequel, w e will refer to this mo del as the r andom ortho gonal mo del . This mo del is con venien t in the sense that it is b oth v ery concrete and simple, and useful in the sense that it will help us fix the main ideas. In the sequel, ho w ever, we will consider far more general mo dels. The question for now is whether or not one can reco ver such a generic matrix from a sampling of its entries. 1.1.2 Whic h sampling sets? Clearly , one cannot hop e to reconstruct any lo w-rank matrix M —ev en of rank 1—if the sampling set a voids any column or ro w of M . Suppose that M is of rank 1 and of the form xy ∗ , x , y ∈ R n so that the ( i, j )th en try is giv en by M ij = x i y j . Then if w e do not ha ve samples from the first row for example, one could never guess the v alue of the first component x 1 , b y an y metho d whatso ever; no information about x 1 is observ ed. There is 3 of course nothing sp ecial ab out the first ro w and this argumen t extends to any ro w or column. T o ha ve any hop e of reco v ering an unkno wn matrix, one needs at least one observ ation per ro w and one observ ation p er column. W e ha v e just seen that if the sampling is adversarial, e.g. one observ es all of the entries of M but those in the first ro w, then one w ould not even b e able to recov er matrices of rank 1. But what happ ens for most sampling sets? Can one reco v er a low-rank matrix from almost all sampling sets of cardinalit y m ? F ormally , suppose that the set Ω of lo cations corresponding to the observ ed entries (( i, j ) ∈ Ω if M ij is observed) is a set of cardinalit y m sampled uniformly at random. Then can one reco v er a generic lo w-rank matrix M , perhaps with very large probabilit y , from the kno wledge of the v alue of its entries in the set Ω? 1.1.3 Whic h algorithm? If the n umber of measurements is sufficiently large, and if the entries are sufficiently uniformly distributed as ab o v e, one might hope that there is only one lo w-rank matrix with these entries. If this w ere true, one w ould wan t to reco ver the data matrix by solving the optimization problem minimize rank( X ) sub ject to X ij = M ij ( i, j ) ∈ Ω , (1.3) where X is the decision v ariable and rank( X ) is equal to the rank of the matrix X . The program (1.3) is a common sense approach which simply seeks the simplest explanation fitting the observ ed data. If there were only one lo w-rank ob ject fitting the data, this w ould reco ver M . This is unfortunately of little practical use b ecause this optimization problem is not only NP-hard, but all kno wn algorithms whic h provide exact solutions require time doubly exponential in the dimension n of the matrix in both theory and practice [14]. If a matrix has rank r , then it has exactly r nonzero singular v alues so that the rank function in (1.3) is simply the n umber of nonv anishing singular v alues. In this pap er, we consider an alternativ e which minimizes the sum of the singular v alues ov er the constraint set. This sum is called the nucle ar norm , k X k ∗ = n X k =1 σ k ( X ) (1.4) where, here and b elo w, σ k ( X ) denotes the k th largest singular v alue of X . The heuristic optimiza- tion is then giv en b y minimize k X k ∗ sub ject to X ij = M ij ( i, j ) ∈ Ω . (1.5) Whereas the rank function coun ts the n umber of nonv anishing singular v alues, the n uclear norm sums their amplitude and in some sense, is to the rank functional what the con v ex ` 1 norm is to the counting ` 0 norm in the area of sparse signal recov ery . The main p oin t here is that the n uclear norm is a con vex function and, as w e will discuss in Section 1.4 can be optimized efficiently via semidefinite programming. 1.1.4 A first t ypical result Our first result shows that, p erhaps unexp ectedly , this heuristic optimization recov ers a generic M when the n umber of randomly sampled en tries is large enough. W e will pro ve the following: 4 Theorem 1.1 L et M b e an n 1 × n 2 matrix of r ank r sample d fr om the r andom ortho gonal mo del, and put n = max( n 1 , n 2 ) . Supp ose we observe m entries of M with lo c ations sample d uniformly at r andom. Then ther e ar e numeric al c onstants C and c such that if m ≥ C n 5 / 4 r log n , (1.6) the minimizer to the pr oblem (1.5) is unique and e qual to M with pr ob ability at le ast 1 − cn − 3 ; that is to say, the semidefinite pr o gr am (1.5) r e c overs al l the entries of M with no err or. In addition, if r ≤ n 1 / 5 , then the r e c overy is exact with pr ob ability at le ast 1 − cn − 3 pr ovide d that m ≥ C n 6 / 5 r log n . (1.7) The theorem states that a surprisingly small num ber of en tries are sufficien t to complete a generic lo w-rank matrix. F or small v alues of the rank, e.g. when r = O (1) or r = O (log n ), one only needs to see on the order of n 6 / 5 en tries (ignoring logarithmic factors) whic h is considerably smaller than n 2 —the total n um b er of en tries of a squared matrix. The real feat, ho w ever, is that the recov ery algorithm is tractable and v ery concrete. Hence the con tribution is tw ofold: • Under the h yp otheses of Theorem 1.1, there is a unique low-rank matrix whic h is consistent with the observ ed entries. • F urther, this matrix can b e reco vered by the con vex optimization (1.5). In other words, for most problems, the n uclear norm relaxation is formal ly e quivalent to the com binatorially hard rank minimization problem (1.3). Theorem 1.1 is in fact a special instance of a far more general theorem that co v ers a m uch larger set of matrices M . W e describe this general class of matrices and precise reco very conditions in the next section. 1.2 Main results As seen in our first example (1.1), it is imp ossible to reco ver a matrix which is equal to zero in nearly all of its en tries unless we see all the entries of the matrix. T o recov er a low-rank matrix, this matrix cannot b e in the null space of the sampling op erator giving the v alues of a subset of the en tries. Now it is easy to see that if the singular vectors of a matrix M are highly concen trated, then M could v ery well b e in the n ull-space of the sampling op erator. F or instance consider the rank-2 symmetric matrix M giv en by M = 2 X k =1 σ k u k u ∗ k , u 1 = ( e 1 + e 2 ) / √ 2 , u 2 = ( e 1 − e 2 ) / √ 2 , where the singular v alues are arbitrary . Then this matrix v anishes ev erywhere except in the top-left 2 × 2 corner and one would basically need to see all the entries of M to b e able to recov er this matrix exactly b y any metho d whatso ev er. There is an endless list of examples of this sort. Hence, w e arriv e at the notion that, someho w, the singular vectors need to be sufficien tly spread—that is, uncorrelated with the standard basis—in order to minimize the num b er of observ ations needed to reco ver a low-rank matrix. 2 This motiv ates the following definition. 2 Both the left and right singular v ectors need to b e uncorrelated with the standard basis. Indeed, the matrix e 1 v ∗ has its first ro w equal to v and all the others equal to zero. Clearly , this rank-1 matrix cannot b e recov ered unless w e basically see all of its en tries. 5 Definition 1.2 L et U b e a subsp ac e of R n of dimension r and P U b e the ortho gonal pr oje ction onto U . Then the coherence of U (vis-` a-vis the standar d b asis ( e i ) ) is define d to b e µ ( U ) ≡ n r max 1 ≤ i ≤ n k P U e i k 2 . (1.8) Note that for an y subspace, the smallest µ ( U ) can b e is 1, achiev ed, for example, if U is spanned b y vectors whose entries all ha ve magnitude 1 / √ n . The largest p ossible v alue for µ ( U ) is n/r whic h w ould correspond to an y subspace that con tains a standard basis element. W e shall b e primarily interested in subspace with low coherence as matrices whose column and row spaces hav e lo w coherence cannot really b e in the n ull space of the sampling op erator. F or instance, we will see that the random subspaces discussed abov e ha ve nearly minimal coherence. T o state our main result, we in tro duce t wo assumptions ab out an n 1 × n 2 matrix M whose SVD is given by M = P 1 ≤ k ≤ r σ k u k v ∗ k and with column and row spaces denoted b y U and V resp ectiv ely . A0 The coherences obey max( µ ( U ) , µ ( V )) ≤ µ 0 for some positive µ 0 . A1 The n 1 × n 2 matrix P 1 ≤ k ≤ r u k v ∗ k has a maximum entry b ounded by µ 1 p r / ( n 1 n 2 ) in absolute v alue for some p ositiv e µ 1 . The µ ’s ab ov e may dep end on r and n 1 , n 2 . Moreo ver, note that A1 alw ays holds with µ 1 = µ 0 √ r since the ( i, j )th entry of the matrix P 1 ≤ k ≤ r u k v ∗ k is given b y P 1 ≤ k ≤ r u ik v j k and by the Cauch y- Sc hw arz inequalit y , X 1 ≤ k ≤ r u ik v j k ≤ s X 1 ≤ k ≤ r | u ik | 2 s X 1 ≤ k ≤ r | v j k | 2 ≤ µ 0 r √ n 1 n 2 . Hence, for sufficiently small ranks, µ 1 is comparable to µ 0 . As we will see in Section 2, for larger ranks, b oth subspaces selected from the uniform distribution and spaces constructed as the span of singular vectors with bounded entries are not only incoheren t with the standard basis, but also ob ey A1 with high probabilit y for v alues of µ 1 at most logarithmic in n 1 and/or n 2 . Belo w w e will assume that µ 1 is greater than or equal to 1. W e are in the p osition to state our main result: if a matrix has row and column spaces that are incoheren t with the standard basis, then n uclear norm minimization can reco ver this matrix from a random sampling of a small n um b er of en tries. Theorem 1.3 L et M b e an n 1 × n 2 matrix of r ank r ob eying A0 and A1 and put n = max( n 1 , n 2 ) . Supp ose we observe m entries of M with lo c ations sample d uniformly at r andom. Then ther e exist c onstants C , c such that if m ≥ C max( µ 2 1 , µ 1 / 2 0 µ 1 , µ 0 n 1 / 4 ) nr ( β log n ) (1.9) for some β > 2 , then the minimizer to the pr oblem (1.5) is unique and e qual to M with pr ob ability at le ast 1 − cn − β . F or r ≤ µ − 1 0 n 1 / 5 this estimate c an b e impr ove d to m ≥ C µ 0 n 6 / 5 r ( β log n ) (1.10) with the same pr ob ability of suc c ess. 6 Theorem 1.3 asserts that if the coherence is lo w, few samples are required to recov er M . F or example, if µ 0 = O (1) and the rank is not too large, then the reco very is exact with large probabilit y pro vided that m ≥ C n 6 / 5 r log n . (1.11) W e giv e tw o illustrative examples of matrices with incoherent column and row spaces. This list is b y no means exhaustive. 1. The first example is the random orthogonal model. F or v alues of the rank r greater than log n , µ ( U ) and µ ( V ) are O (1), µ 1 = O (log n ) b oth with v ery large probability . Hence, the reco very is exact pro vided that m ob eys (1.6) or (1.7). Sp ecializing Theorem 1.3 to these v alues of the parameters gives Theorem 1.1. Hence, Theorem 1.1 is a sp ecial case of our general reco very result. 2. The second example is more general and, in a n utshell, simply requires that the components of the singular v ectors of M are small. Assume that the u j and v j ’s ob ey max ij |h e i , u j i| 2 ≤ µ B /n, max ij |h e i , v j i| 2 ≤ µ B /n, (1.12) for some v alue of µ B = O (1). Then the maxim um coherence is at most µ B since µ ( U ) ≤ µ B and µ ( V ) ≤ µ B . F urther, we will see in Section 2 that A1 holds most of the time with µ 1 = O ( √ log n ). Thus, for matrices with singular vectors obeying (1.12), the recov ery is exact pro vided that m ob eys (1.11) for v alues of the rank not exceeding µ − 1 B n 1 / 5 . 1.3 Extensions Our main result (Theorem 1.3) extends to a v ariety of other low-rank matrix completion problems b ey ond the sampling of entries. Indeed, supp ose w e hav e t wo orthonormal bases f 1 , . . . , f n and g 1 , . . . , g n of R n , and that w e are in terested in solving the rank minimization problem minimize rank( X ) sub ject to f ∗ i X g j = f ∗ i M g j , ( i, j ) ∈ Ω , . (1.13) This comes up in a n umber of applications. As a motiv ating example, there has b een a great deal of in terest in the machine learning communit y in developing sp ecialized algorithms for the multiclass and multitask learning problems (see, e.g., [1, 3, 5]). In m ulticlass learning, the goal is to build m ultiple classifiers with the same training data to distinguish b et ween more than tw o categories. F or example, in face recognition, one migh t wan t to classify whether an image patc h corresp onds to an eye, nose, or mouth. In m ultitask learning, we hav e a large set of data, but ha v e a v ariety of differen t classification tasks, and, for each task, only partial subsets of the data are relev an t. F or instance, in activity recognition, we ma y ha ve acquired sets of observ ations of multiple sub jects and wan t to determine if each observ ed p erson is walking or running. Ho wev er, a different classifier is to b e learned for each individual, and it is not clear ho w ha ving access to the full collection of observ ations can improv e classification p erformance. Multitask learning aims precisely to take adv antage of the access to the full database to improv e performance on the individual tasks. In the abstract form ulation of this problem for linear classifiers, w e ha ve K classes to distin- guish and are given training examples f 1 , . . . , f n . F or eac h example, we are given partial lab eling information ab out whic h classes it b elongs or do es not b elong to. That is, for each example f j 7 and class k , we ma y either b e told that f j b elongs to class k , b e told f j do es not b elong to class k , or pro vided no information about the membership of f j to class k . F or each class 1 ≤ k ≤ K , w e w ould lik e to pro duce a linear function w k suc h that w ∗ k f i > 0 if f i b elongs to class k and w ∗ k f i < 0 otherwise. F ormally , we can search for the vector w k that satisfies the equality con- strain ts w ∗ k f i = y ik where y ik = 1 if w e are told that f i b elongs to class k , y ik = − 1 if w e are told that f i do es not b elong to class k , and y ik unconstrained if we are not pro vided information. A common hypothesis in the m ultitask setting is that the w k corresp onding to eac h of the classes together span a very lo w dimensional subspace with dimension significantly smaller than K [1, 3, 5]. That is, the basic assumption is that W = [ w 1 , . . . , w K ] is low-rank. Hence, the m ulticlass learning problem can b e cast as (1.13) with observ ations of the form f ∗ i W e j . T o see that our theorem pro vides conditions under whic h (1.13) can b e solved via nuclear norm minimization, note that there exist unitary transformations F and G suc h that e j = F f j and e j = Gg j for eac h j = 1 , . . . , n . Hence, f ∗ i X g j = e ∗ i ( F X G ∗ ) e j . Then if the conditions of Theorem 1.3 hold for the matrix F X G ∗ , it is immediate that nuclear norm minimization finds the unique optimal solution of (1.13) when w e are pro vided a large enough random collection of the inner pro ducts f ∗ i M g j . In other words, all that is needed is that the column and ro w spaces of M b e respectively incoheren t with the basis ( f i ) and ( g i ). F rom this p erspective, we additionally remark that our results likely extend to the case where one observ es a small num b er of arbitrary linear functionals of a hidden matrix M . Set N = n 2 and A 1 , . . . , A N b e an orthonormal basis for the linear space of n × n matrices with the usual inner pro duct h X , Y i = trace( X ∗ Y ). Then w e exp ect our results should also apply to the rank minimization problem minimize rank( X ) sub ject to h A k , X i = h A k , M i k ∈ Ω , (1.14) where Ω ⊂ { 1 , . . . , N } is selected uniformly at random. In fact, (1.14) is (1.3) when the orthobasis is the canonical basis ( e i e ∗ j ) 1 ≤ i,j ≤ n . Here, those lo w-rank matrices which hav e small inner pro duct with all the basis elemen ts A k ma y be reco v erable b y nuclear norm minimization. T o a v oid unnec- essary confusion and notational clutter, we leav e this general lo w-rank recov ery problem for future w ork. 1.4 Connections, alternativ es and prior art Nuclear norm minimization is a recent heuristic introduced b y F azel in [18], and is an extension of the trace heuristic often used b y the con trol communit y , see e.g. [6, 26]. Indeed, when the matrix v ariable is symmetric and p ositiv e semidefinite, the n uclear norm of X is the sum of the (nonneg- ativ e) eigenv alues and thus equal to the trace of X . Hence, for p ositiv e semidefinite unkno wns, (1.5) w ould simply minimize the trace o ver the constraint set: minimize trace( X ) sub ject to X ij = M ij ( i, j ) ∈ Ω X 0 . 8 This is a semidefinite program. Even for the general matrix M whic h ma y not be positive definite or ev en symmetric, the n uclear norm heuristic can b e form ulated in terms of semidefinite programming as, for instance, the program (1.5) is equiv alen t to minimize trace( W 1 ) + trace( W 2 ) sub ject to X ij = M ij ( i, j ) ∈ Ω W 1 X X ∗ W 2 0 with optimization v ariables X , W 1 and W 2 , (see, e.g., [18, 35]). There are man y efficien t algorithms and high-qualit y softw are a v ailable for solving these types of problems. Our work is inspired b y results in the emerging field of c ompr essive sampling or c ompr esse d sensing , a new paradigm for acquiring information about ob jects of interest from what appears to b e a highly incomplete set of measuremen ts [11, 13, 17]. In practice, this means for example that high-resolution imaging is p ossible with fewer sensors, or that one can sp eed up signal acquisition time in biomedical applications b y orders of magnitude, simply by taking far fewer sp ecially co ded samples. Mathematically sp eaking, we wish to reconstruct a signal x ∈ R n from a small num b er measuremen ts y = Φ x , y ∈ R m , and m is muc h smaller than n ; i.e. we hav e far few er equations than unkno wns. In general, one cannot hope to reconstruct x but assume now that the ob ject we wish to recov er is known to b e structured in the sense that it is sparse (or approximately sparse). This means that the unknown ob ject dep ends up on a smaller n umber of unkno wn parameters. Then it has b een shown that ` 1 minimization allows reco very of sparse signals from remark ably few measuremen ts: supp osing Φ is chosen randomly from a suitable distribution, then with v ery high probabilit y , all sparse signals with ab out k nonzero entries can b e reco vered from on the order of k log n measuremen ts. F or instance, if x is k -sparse in the F ourier domain, i.e. x is a superp osition of k sin usoids, then it can b e p erfectly recov ered with high probabilit y—by ` 1 minimization—from the kno wledge of ab out k log n of its entries sampled uniformly at random [11]. F rom this viewp oin t, the results in this paper greatly extend the theory of compressed sensing b y sho wing that other types of interesting ob jects or structures, b ey ond sparse signals and images, can b e reco vered from a limited set of measurements. Moreo v er, the techniques for pro ving our main results build up on ideas from the compressed sensing literature together with probabilistic to ols suc h as the p o werful tec hniques of Bourgain and of Rudelson for b ounding norms of op erators b et w een Banach spaces. Our notion of incoherence generalizes the concept of the same name in compressive sampling. Notably , in [10], the authors in tro duce the notion of the incoherence of a unitary transformation. Letting U b e an n × n unitary matrix, the c oher enc e of U is giv en by µ ( U ) = n max j,k | U j k | 2 . This quantit y ranges in v alues from 1 for a unitary transformation whose entries all ha ve the same magnitude to n for the identit y matrix. Using this notion, [10] sho wed that with high probabilit y , a k -sparse signal could b e reco v ered via linear programming from the observ ation of the inner pro duct of the signal with m = Ω( µ ( U ) k log n ) randomly selected columns of the matrix U . This result provided a generalization of the celebrated results about partial F ourier observ ations describ ed in [11], a special case where µ ( U ) = 1. This pap er generalizes the notion of incoherence to problems beyond the setting of sparse signal recov ery . 9 In [27], the authors studied the n uclear norm heuristic applied to a related problem where partial information about a matrix M is a v ailable from m equations of the form h A ( k ) , M i = X ij A ( k ) ij M ij = b k , k = 1 , . . . , m, (1.15) where for each k , { A ( k ) ij } ij is an i.i.d. sequence of Gaussian or Bernoulli random v ariables and the sequences { A ( k ) } are also indep enden t from eac h other (the sequences { A ( k ) } and { b k } are a v ailable to the analyst). Building on the concept of r estricte d isometry in tro duced in [12] in the context of sparse signal reco very , [27] establishes the first sufficien t conditions for whic h the n uclear norm heuristic returns the minimum rank elemen t in the constraint set. They pro ve that the heuristic succeeds with large probabilit y whenev er the num b er m of a v ailable measuremen ts is greater than a constant times 2 nr log n for n × n matrices. Although this is an interesting result, a serious imp edimen t to this approach is that one needs to essentially measure random pro jections of the unknown data matrix—a situation whic h unfortunately do es not commonly arise in practice. F urther, the measurements in (1.15) giv e some information ab out al l the en tries of M whereas in our problem, information ab out most of the entries is simply not a v ailable. In particular, the results and techniques in tro duced in [27] do not b egin to address the matrix completion problem of in terest to us in this pap er. As a consequence, our metho ds are completely different; for example, they do not rely on an y notions of restricted isometry . Instead, as w e discuss b elo w, w e prov e the existence of a Lagrange m ultiplier for the optimization (1.5) that certifies the unique optimal solution is precisely the matrix that w e wish to reco v er. Finally , we w ould lik e to briefly discuss the p ossibilit y of other recov ery algorithms when the sampling happ ens to b e c hosen in a very sp ecial fashion. F or example, supp ose that M is generic and that w e precisely observ e ev ery entry in the first r rows and columns of the matrix. W rite M in blo c k form as M = M 11 M 12 M 21 M 22 with M 11 an r × r matrix. In the special case that M 11 is in vertible and M has rank r , then it is easy to verify that M 22 = M 21 M − 1 11 M 12 . One can pro ve this iden tit y by forming the SVD of M , for example. That is, if M is generic, and the upper r × r blo c k is in v ertible, and we observe every en try in the first r ro ws and columns, w e can reco ver M . This result immediately generalizes to the case where one observ es precisely r ro ws and r columns and the r × r matrix at the in tersection of the observed ro ws and columns is in vertible. How ev er, this scheme has many practical drawbac ks that stand in the wa y of a generalization to a completion algorithm from a general set of en tries. First, if w e miss any en try in these rows or columns, w e cannot reco ver M , nor can w e lev erage an y information provided by en tries of M 22 . Second, if the matrix has rank less than r , and w e observ e r rows and columns, a combinatorial search to find the collection that has an in v ertible square sub-block is required. Moreo ver, because of the matrix in version, the algorithm is rather fragile to noise in the en tries. 1.5 Notations and organization of the pap er The pap er is organized as follo ws. W e first argue in Section 2 that the random orthogonal mo del and, more generally , matrices with incoherent column and ro w spaces ob ey the assumptions of the general Theorem 1.3. T o pro ve Theorem 1.3, w e first establish sufficien t conditions whic h guaran tee 10 that the true lo w-rank matrix M is the unique solution to (1.5) in Section 3. One of these conditions is the existence of a dual vector obeying tw o crucial prop erties. Section 4 constructs suc h a dual v ector and pro vides the o v erall arc hitecture of the proof which sho ws that, indeed, this v ector ob eys the desired prop erties provided that the num b er of measurements is sufficien tly large. Surprisingly , as explored in Section 5, the existence of a dual v ector certifying that M is unique is related to some problems in random graph theory including “the coupon collector’s problem.” F ollowing this discussion, we pro ve our main result via several in termediate results which are all pro ven in Section 6. Section 7 introduces n umerical exp eriments sho wing that matrix completion based on n uclear norm minimization w orks w ell in practice. Section 8 closes the pap er with a short summary of our findings, a discussion of imp ortan t extensions and improv emen ts. In particular, we will discuss p ossible wa ys of improving the 1.2 exp onen t in (1.10) so that it gets closer to 1. Finally , the App endix pro vides proofs of auxiliary lemmas supp orting our main argumen t. Before contin uing, we provide here a brief summary of the notations used throughout the pap er. Matrices are b old capital, vectors are bold low ercase and scalars or entries are not b old. F or instance, X is a matrix and X ij its ( i, j )th entry . Likewise x is a vector and x i its i th comp onen t. When w e ha v e a collection of vectors u k ∈ R n for 1 ≤ k ≤ d , we will denote by u ik the i th comp onen t of the v ector u k and [ u 1 , . . . , u d ] will denote the n × d matrix whose k th column is u k . A v ariety of norms on matrices will b e discussed. The sp ectral norm of a matrix is denoted b y k X k . The Euclidean inner pro duct b et ween tw o matrices is h X , Y i = trace( X ∗ Y ), and the corresp onding Euclidean norm, called the F rob enius or Hilb ert-Sc hmidt norm, is denoted k X k F . That is, k X k F = h X , X i 1 / 2 . The n uclear norm of a matrix X is k X k ∗ . The maxim um en try of X (in absolute v alue) is denoted b y k X k ∞ ≡ max ij | X ij | . F or vectors, we will only consider the usual Euclidean ` 2 norm whic h we simply write as k x k . F urther, we will also manipulate linear transformation which acts on matrices and will use caligraphic letters for these op erators as in A ( X ). In particular, the identit y operator will b e denoted by I . The only norm we will consider for these op erators is their sp ectral norm (the top singular v alue) denoted by kAk = sup X : k X k F ≤ 1 kA ( X ) k F . Finally , we adopt the con ven tion that C denotes a n umerical constan t indep enden t of the matrix dimensions, rank, and num b er of measuremen ts, whose v alue may c hange from line to line. Certain sp ecial constants with precise n umerical v alues will be ornamen ted with subscripts (e.g., C R ). An y exceptions to this notational sc heme will b e noted in the text. 2 Whic h matrices are incoherent? In this section w e restrict our attention to square n × n matrices, but the extension to rectangular n 1 × n 2 matrices immediately follo ws by setting n = max( n 1 , n 2 ). 2.1 Incoheren t bases span incoheren t subspaces Almost all n × n matrices M with singular vectors { u k } 1 ≤ k ≤ r and { v k } 1 ≤ k ≤ r ob eying the size prop ert y (1.12) also satisfy the assumptions A0 and A1 with µ 0 = µ B , µ 1 = C µ B √ log n for some p ositiv e constan t C . As mentioned abov e, A0 holds automatically , but, observe that A1 would not hold with a small v alue of µ 1 if tw o ro ws of the matrices [ u 1 , . . . , u r ] and [ v 1 , . . . , v r ] are identical 11 with all en tries of magnitude p µ B /n since it is not hard to see that in this case k X k u k v ∗ k k ∞ = µ B r /n. Certainly , this example is constructed in a very sp ecial wa y , and should o ccur infrequently . W e no w show that it is generically unlikely . Consider the matrix r X k =1 k u k v ∗ k , (2.1) where { k } 1 ≤ k ≤ r is an arbitrary sign sequence. F or almost all choices of sign sequences, A1 is satisfied with µ 1 = O ( µ B √ log n ). Indeed, if one selects the signs uniformly at random, then for eac h β > 0, P ( k r X k =1 k u k v k k ∞ ≥ µ B p 8 β r log n/n ) ≤ (2 n 2 ) n − β . (2.2) This is of in terest b ecause suppose the low-rank matrix we wish to reco v er is of the form M = r X k =1 λ k u k v ∗ k (2.3) with scalars λ k . Since the vectors { u k } and { v k } are orthogonal, the singular v alues of M are giv en by | λ k | and the singular v ectors are giv en by sgn( λ k ) u k and v k for k = 1 , . . . , r . Hence, in this model A1 concerns the maxim um entry of the matrix giv en by (2.1) with k = sgn( λ k ). That is to say , for most sign patterns, the matrix of in terest obeys an appropriate size condition. W e emphasize here that the only thing that w e assumed ab out the u k ’s and v k ’s was that they had small entries. In particular, they could b e equal to each other as would b e the case for a symmetric matrix. The claim (2.2) is a simple application of Ho effding’s inequalit y . The ( i, j )th entry of (2.1) is giv en by Z ij = X 1 ≤ k ≤ r k u ik v j k , and is a sum of r zero-mean indep enden t random v ariables, eac h b ounded b y µ B /n . Therefore, P ( | Z ij | ≥ λµ B √ r /n ) ≤ 2 e − λ 2 / 8 . Setting λ proportional to √ log n and applying the union bound gives the claim. T o summarize, we sa y that M is sampled from the inc oher ent b asis mo del if it is of the form M = r X k =1 k σ k u k v ∗ k ; (2.4) { k } 1 ≤ k ≤ r is a random sign sequence, and { u k } 1 ≤ k ≤ r and { v k } 1 ≤ k ≤ r ha ve maximum entries of size at most p µ B /n . Lemma 2.1 Ther e exist numeric al c onstants c and C such that for any β > 0 , matric es fr om the inc oher ent b asis mo del ob ey the assumption A1 with µ 1 ≤ C µ B p ( β + 2) log n with pr ob ability at le ast 1 − cn − β . 12 2.2 Random subspaces span incoherent subspaces In this section, we prov e that the random orthogonal mo del ob eys the tw o assumptions A0 and A1 (with appropriate v alues for the µ ’s) with large probability . Lemma 2.2 Set ¯ r = max( r, log n ) . Then ther e exist c onstants C and c such that the r andom ortho gonal mo del ob eys: 3 1. max i k P U e i k 2 ≤ C ¯ r /n , 2. k P 1 ≤ k ≤ r u k v ∗ k k ∞ ≤ C log n √ ¯ r /n . with pr ob ability 1 − cn − 3 log n . W e note that an argument similar to the following proof would giv e that if C of the form K β where K is a fixed numerical constan t, we can ac hiev e a probability at least 1 − cn − β pro vided that n is sufficien tly large. T o establish these facts, w e make use of the standard result b elo w [21]. Lemma 2.3 L et Y d b e distribute d as a chi-squar e d r andom variable with d de gr e es of fr e e dom. Then for e ach t > 0 P ( Y d − d ≥ t √ 2 d + t 2 ) ≤ e − t 2 / 2 and P ( Y d − d ≤ − t √ 2 d ) ≤ e − t 2 / 2 . (2.5) W e will use (2.5) as follo ws: for eac h ∈ (0 , 1) we ha v e P ( Y d ≥ d (1 − ) − 1 ) ≤ e − 2 d/ 4 and P ( Y d ≤ d (1 − )) ≤ e − 2 d/ 4 . (2.6) W e b egin with the second assertion of Lemma 2.2 since it will imply the first as w ell. Observ e that it follo ws from k P U e i k 2 = X 1 ≤ k ≤ r u 2 ik , (2.7) that Z r ≡ k P U e i k 2 ( a is fixed) is the squared Euclidean length of the first r comp onen ts of a unit v ector uniformly distributed on the unit sphere in n dimensions. No w supp ose that x 1 , x 2 , . . . , x n are i.i.d. N (0 , 1). Then the distribution of a unit v ector uniformly distributed on the sphere is that of x / k x k and, therefore, the la w of Z r is that of Y r / Y n , where Y r = P k ≤ r x 2 k . Fix > 0 and consider the ev ent A n, = { Y n /n ≥ 1 − } . F or eac h λ > 0, it follows from (2.6) that P ( Z r − r /n ≥ λ √ 2 r /n ) = P ( Y r ≥ [ r + λ √ 2 r ] Y n /n ) ≤ P ( Y r ≥ [ r + λ √ 2 r ] Y n /n and A n, ) + P ( A c n, ) ≤ P ( Y r ≥ [ r + λ √ 2 r ][1 − ]) + e − 2 n/ 4 = P ( Y r − r ≥ λ √ 2 r [1 − − p r / 2 λ 2 ]) + e − 2 n/ 4 . No w pick = 4( n − 1 log n ) 1 / 2 , λ = 8 √ 2 log n and assume that n is sufficiently large so that (1 + p r / 2 λ 2 ) ≤ 1 / 2 . 3 When r ≥ C 0 (log n ) 3 for some p ositive constant C 0 , a b etter estimate is p ossible, namely , k P P P 1 ≤ k ≤ r u k v ∗ k k ∞ ≤ C √ r log n/n . 13 Then P ( Z r − r /n ≥ λ √ 2 r /n ) ≤ P ( Y r − r ≥ ( λ/ 2) √ 2 r ) + n − 4 . Assume no w that r ≥ 4 log n (whic h means that λ ≤ 4 √ 2 r ). Then it follo ws from (2.5) that P ( Y r − r ≥ ( λ/ 2) √ 2 r ) ≤ P ( Y r − r ≥ ( λ/ 4) √ 2 r + ( λ/ 4) 2 ) ≤ e − λ 2 / 32 = n − 4 . Hence P ( Z r − r /n ≥ 16 p r log n/n ) ≤ 2 n − 4 and, therefore, P (max i k P U e i k 2 − r /n ≥ 16 p r log n/n ) ≤ 2 n − 3 (2.8) b y the union b ound. Note that (2.8) establishes the first claim of the lemma (even for r < 4 log n since in this case Z r ≤ Z d 4 log n e ). It remains to establish the second claim. Notice that b y symmetry , E = P 1 ≤ k ≤ r u k v ∗ k has the same distribution as F = r X k =1 k u k v ∗ k , where { k } is an independent Rademac her sequence. It then follows from Hoeffding’s inequality that conditional on { u k } and { v k } w e hav e P ( | F ij | > t ) ≤ 2 e − t 2 / 2 σ 2 ij , σ 2 ij = X 1 ≤ k ≤ r u 2 ik v 2 ik . Our previous results indicate that max ij | v ij | 2 ≤ (10 log n ) /n with large probabilit y and thus σ 2 ij ≤ 10 log n n k P U e i k 2 . Set ¯ r = max( r , log n ). Since k P U e i k 2 ≤ C ¯ r/n with large probability , w e ha ve σ 2 ij ≤ C (log n ) ¯ r /n 2 with large probabilit y . Hence the marginal distribution of F ij ob eys P ( | F ij | > λ √ ¯ r /n ) ≤ 2 e − γ λ 2 / log n + P ( σ 2 ij ≥ C (log n ) ¯ r/n 2 ) . for some n umerical constan t γ . Picking λ = γ 0 log n where γ 0 is a sufficien tly large numerical constan t gives k F k ∞ ≤ C (log n ) √ ¯ r /n with large probabilit y . Since E and F ha ve the same distribution, the second claim follows. The claim ab out the size of max ij | v ij | 2 is straigh tforw ard since our techniques show that for eac h λ > 0 P ( Z 1 ≥ λ (log n ) /n ) ≤ P ( Y 1 ≥ λ (1 − ) log n ) + e − 2 n/ 4 . Moreo ver, P ( Y 1 ≥ λ (1 − ) log n ) = P ( | x 1 | ≥ p λ (1 − ) log n ) ≤ 2 e − 1 2 λ (1 − ) log n . If n is sufficien tly large so that ≤ 1 / 5, this giv es P ( Z 1 ≥ 10(log n ) /n ) ≤ 3 n − 4 and, therefore, P (max ij | v ij | 2 ≥ 10(log n ) /n ) ≤ 12 n − 3 log n since the maxim um is tak en ov er at most 4 n log n pairs. 14 3 Dualit y Let R Ω : R n 1 × n 2 → R | Ω | b e the sampling op erator which extracts the observed entries, R Ω ( X ) = ( X ij ) ij ∈ Ω , so that the constrain t in (1.5) becomes R Ω ( X ) = R Ω ( M ). Standard conv ex optimization theory asserts that X is solution to (1.5) if there exists a dual vector (or Lagrange m ultiplier) λ ∈ R | Ω | suc h that R ∗ Ω λ is a sub gr adient of the n uclear norm at the p oin t X , whic h we denote by R ∗ Ω λ ∈ ∂ k X k ∗ (3.1) (see, e.g. [7]). Recall the definition of a subgradient of a conv ex function f : R n 1 × n 2 → R . W e say that Y is a subgradien t of f at X 0 , denoted Y ∈ ∂ f ( X 0 ), if f ( X ) ≥ f ( X 0 ) + h Y , X − X 0 i (3.2) for all X . Supp ose X 0 ∈ R n 1 × n 2 has rank r with a singular v alue decomposition given b y X 0 = X 1 ≤ k ≤ r σ k u k v ∗ k , (3.3) With these notations, Y is a subgradient of the n uclear norm at X 0 if and only if it is of the form Y = X 1 ≤ k ≤ r u k v ∗ k + W , (3.4) where W ob eys the follo wing tw o properties: (i) the column space of W is orthogonal to U ≡ span ( u 1 , . . . , u r ), and the ro w space of W is orthogonal to V ≡ span ( v 1 , . . . , v r ); (ii) the spectral norm of W is less than or equal to 1. (see, e.g., [23, 36]). T o express these properties concisely , it is con v enient to in troduce the orthogonal decomp osition R n 1 × n 2 = T ⊕ T ⊥ where T is the linear space spanned by elemen ts of the form u k x ∗ and y v ∗ k , 1 ≤ k ≤ r , where x and y are arbitrary , and T ⊥ is its orthogonal complement. Note that dim( T ) = r ( n 1 + n 2 − r ), precisely the n um b er of degrees of freedom in the set of n 1 × n 2 matrices of rank r . T ⊥ is the subspace of matrices spanned b y the family ( xy ∗ ), where x (resp ectiv ely y ) is an y vector orthogonal to U (resp ectively V ). The orthogonal pro jection P T on to T is giv en by P T ( X ) = P U X + X P V − P U X P V , (3.5) where P U and P V are the orthogonal pro jections onto U and V . Note here that while P U and P V are matrices, P T is a linear operator mapping matrices to matrices. W e also ha ve P T ⊥ ( X ) = ( I − P T )( X ) = ( I n 1 − P U ) X ( I n 2 − P V ) where I d denotes the d × d iden tity matrix. With these notations, Y ∈ ∂ k X 0 k ∗ if (i’) P T ( Y ) = P 1 ≤ k ≤ r u k v ∗ k , 15 (ii’) and kP T ⊥ Y k ≤ 1. No w that we ha ve characterized the subgradient of the nuclear norm, the lemma below gives sufficien t conditions for the uniqueness of the minimizer to (1.5). Lemma 3.1 Consider a matrix X 0 = P r k =1 σ k u k v ∗ k of r ank r which is fe asible for the pr oblem (1.5) , and supp ose that the fol lowing two c onditions hold: 1. ther e exists a dual p oint λ such that Y = R ∗ Ω λ ob eys P T ( Y ) = r X k =1 u k v ∗ k , kP T ⊥ ( Y ) k < 1; (3.6) 2. the sampling op er ator R Ω r estricte d to elements in T is inje ctive. Then X 0 is the unique minimizer. Before proving this result, we w ould like to emphasize that this lemma provides a clear strategy for pro ving our main result, namely , Theorem 1.3. Letting M = P r k =1 σ k u k v ∗ k , M is the unique solution to (1.5) if the injectivity condition holds and if one can find a dual point λ suc h that Y = R ∗ Ω λ ob eys (3.6). The proof of Lemma 3.1 uses a standard fact which states that the n uclear norm and the spectral norm are dual to one another. Lemma 3.2 F or e ach p air W and H , we have h W , H i ≤ k W k k H k ∗ . In addition, for e ach H , ther e is a W ob eying k W k = 1 which achieves the e quality. A v ariet y of pro ofs are a v ailable for this Lemma, and an elemen tary argument is sk etched in [27]. W e now turn to the pro of of Lemma 3.1. Pro of [of Lemma 3.1] Consider any p erturbation X 0 + H where R Ω ( H ) = 0. Then for any W 0 ob eying (i)–(ii), P r k =1 u k v ∗ k + W 0 is a subgradien t of the n uclear norm at X 0 and, therefore, k X 0 + H k ∗ ≥ k X 0 k ∗ + h r X k =1 u k v ∗ k + W 0 , H i . Letting W = P T ⊥ ( Y ), w e ma y write P r k =1 u k v ∗ k = R ∗ Ω λ − W . Since k W k < 1 and R Ω ( H ) = 0, it then follo ws that k X 0 + H k ∗ ≥ k X 0 k ∗ + h W 0 − W , H i . No w by construction h W 0 − W , H i = hP T ⊥ ( W 0 − W ) , H i = h W 0 − W , P T ⊥ ( H ) i . W e use Lemma 3.2 and set W 0 = P T ⊥ ( Z ) where Z is any matrix ob eying k Z k ≤ 1 and h Z , P T ⊥ ( H ) i = kP T ⊥ ( H ) k ∗ . Then W 0 ∈ T ⊥ , k W 0 k ≤ 1, and h W 0 − W , H i ≥ (1 − k W k ) kP T ⊥ ( H ) k ∗ , whic h b y assumption is strictly positive unless P T ⊥ ( H ) = 0. In other words, k X 0 + H k ∗ > k X 0 k ∗ unless P T ⊥ ( H ) = 0. Assume then that P T ⊥ ( H ) = 0 or equiv alen tly that H ∈ T . Then R Ω ( H ) = 0 implies that H = 0 b y the injectivity assumption. In conclusion, k X 0 + H k ∗ > k X k ∗ unless H = 0. 16 4 Arc hitecture of the pro of Our strategy to prov e that M = P 1 ≤ k ≤ r σ k u k v ∗ k is the unique minimizer to (1.5) is to construct a matrix Y whic h v anishes on Ω c and ob eys the conditions of Lemma 3.1 (and show the injectivity of the sampling op erator restricted to matrices in T along the w ay). Set P Ω to b e the orthogonal pro jector onto the indices in Ω so that the ( i, j )th comp onent of P Ω ( X ) is equal to X ij if ( i, j ) ∈ Ω and zero otherwise. Our candidate Y will be the solution to minimize k X k F sub ject to ( P T P Ω )( X ) = P r k =1 u k v ∗ k . (4.1) The matrix Y v anishes on Ω c as otherwise it would not b e an optimal solution since P Ω ( Y ) w ould ob ey the constrain t and hav e a smaller F rob enius norm. Hence Y = P Ω ( Y ) and P T ( Y ) = P r k =1 u k v ∗ k . Since the Pythagoras form ula gives k Y k 2 F = kP T ( Y ) k 2 F + kP T ⊥ ( Y ) k 2 F = k r X k =1 u k v ∗ k k 2 F + kP T ⊥ ( Y ) k 2 F = r + kP T ⊥ ( Y ) k 2 F , minimizing the F rob enius norm of X amounts to minimizing the F rob enius norm of P T ⊥ ( X ) under the constraint P T ( X ) = P r k =1 u k v ∗ k . Our motiv ation is tw ofold. First, the solution to the least- squares problem (4.1) has a closed form that is amenable to analysis. Second, b y forcing P T ⊥ ( Y ) to b e small in the F rob enius norm, w e hope that it will b e small in the spectral norm as w ell, and establishing that kP T ⊥ ( Y ) k < 1 would prov e that M is the unique solution to (1.5). T o compute the solution to (4.1), w e introduce the op erator A Ω T defined b y A Ω T ( M ) = P Ω P T ( M ) . Then, if A ∗ Ω T A Ω T = P T P Ω P T has full rank when restricted to T , the minimizer to (4.1) is given b y Y = A Ω T ( A ∗ Ω T A Ω T ) − 1 ( E ) , E ≡ r X k =1 u k v ∗ k . (4.2) W e clarify the meaning of (4.2) to a void an y confusion. ( A ∗ Ω T A Ω T ) − 1 ( E ) is meant to b e that elemen t F in T obeying ( A ∗ Ω T A Ω T )( F ) = E . T o summarize the aims of our pro of strategy , • W e m ust first sho w that A ∗ Ω T A Ω T = P T P Ω P T is a one-to-one linear mapping from T onto itself. In this case, A Ω T = P Ω P T —as a mapping from T to R n 1 × n 2 —is injective. This is the second sufficient condition of Lemma 3.1. Moreov er, our ansatz for Y giv en by (4.2) is w ell-defined. • Ha ving established that Y is w ell-defined, w e will show that kP T ⊥ ( Y ) k < 1 , th us proving the first sufficient condition. 17 4.1 The Bernoulli mo del Instead of showing that the theorem holds when Ω is a set of size m sampled uniformly at random, w e pro ve the theorem for a subset Ω 0 sampled according to the Bernoul li mo del . Here and b e- lo w, { δ ij } 1 ≤ i ≤ n 1 , 1 ≤ j ≤ n 2 is a sequence of indep endent identically distributed 0 / 1 Bernoulli random v ariables with P ( δ ij = 1) = p ≡ m n 1 n 2 , (4.3) and define Ω 0 = { ( i, j ) : δ ij = 1 } . (4.4) Note that E | Ω 0 | = m , so that the av erage cardinality of Ω 0 is that of Ω. Then following the same reasoning as the argument dev elop ed in Section I I.C of [11] shows that the probability of ‘failure’ under the uniform mo del is bounded b y 2 times the probabilit y of failure under the Bernoull i mo del; the failure even t is the even t on whic h the solution to (1.5) is not exact. Hence, w e can restrict our atten tion to the Bernoulli mo del and from no w on, w e will assume that Ω is giv en by (4.4). This is adv antageous b ecause the Bernoulli mo del admits a simpler analysis than uniform sampling thanks to the independence b etw een the δ ij ’s. 4.2 The injectivit y prop ert y W e study the injectivit y of A Ω T , which also sho ws that Y is w ell-defined. T o prov e this, we will sho w that the linear op erator p − 1 P T ( P Ω − p I ) P T has small op erator norm, which we recall is sup k X k F ≤ 1 p − 1 kP T ( P Ω − p I ) P T ( X ) k F . Theorem 4.1 Supp ose Ω is sample d ac c or ding to the Bernoul li mo del (4.3) – (4.4) and put n = max( n 1 , n 2 ) . Supp ose that the c oher enc es ob ey max( µ ( U ) , µ ( V )) ≤ µ 0 . Then, ther e is a numeric al c onstants C R such that for al l β > 1 , p − 1 kP T P Ω P T − p P T k ≤ C R r µ 0 nr ( β log n ) m (4.5) with pr ob ability at le ast 1 − 3 n − β pr ovide d that C R q µ 0 nr ( β log n ) m < 1 . Pro of Decompose any matrix X as X = P ab h X , e a e ∗ b i e a e ∗ b so that P T ( X ) = X ab hP T ( X ) , e a e ∗ b i e a e ∗ b = X ab h X , P T ( e a e ∗ b ) i e a e ∗ b . Hence, P Ω P T ( X ) = P ab δ ab h X , P T ( e a e ∗ b ) i e a e ∗ b whic h gives ( P T P Ω P T )( X ) = X ab δ ab h X , P T ( e a e ∗ b ) i P T ( e a e ∗ b ) . In other w ords, P T P Ω P T = X ab δ ab P T ( e a e ∗ b ) ⊗ P T ( e a e ∗ b ) . It follo ws from the definition (3.5) of P T that P T ( e a e ∗ b ) = ( P U e a ) e ∗ b + e a ( P V e b ) ∗ − ( P U e a )( P V e b ) ∗ . (4.6) 18 This giv es kP T ( e a e ∗ b ) k 2 F = hP T ( e a e ∗ b ) , e a e ∗ b i = k P U e a k 2 + k P V e b k 2 − k P U e a k 2 k P V e b k 2 (4.7) and since k P U e a k 2 ≤ µ ( U ) r /n 1 and k P V e b k 2 ≤ µ ( U ) r /n 2 , kP T ( e a e ∗ b ) k 2 F ≤ 2 µ 0 r / min( n 1 , n 2 ) . (4.8) No w the fact that the op erator P T P Ω P T do es not deviate from its exp ected v alue E ( P T P Ω P T ) = P T ( E P Ω ) P T = P T ( p I ) P T = p P T in the sp ectral norm is related to Rudelson’s selection theorem [29]. The first part of the theorem b elo w ma y b e found in [10] for example, see also [30] for a very similar statement. Theorem 4.2 [10] L et { δ ab } b e indep endent 0/1 Bernoul li variables with P ( δ ab = 1) = p = m n 1 n 2 and put n = max( n 1 , n 2 ) . Supp ose that kP T ( e a e ∗ b ) k 2 F ≤ 2 µ 0 r /n . Set Z ≡ p − 1 k X ab ( δ ab − p ) P T ( e a e ∗ b ) ⊗ P T ( e a e ∗ b ) k = p − 1 kP T P Ω P T − p P T k . 1. Ther e exists a c onstant C 0 R such that E Z ≤ C 0 R r µ 0 nr log n m (4.9) pr ovide d that the right-hand side is smal ler than 1. 2. Supp ose E Z ≤ 1 . Then for e ach λ > 0 , we have P | Z − E Z | > λ r µ 0 nr log n m ! ≤ 3 exp − γ 0 0 min ( λ 2 log n, λ s m log n µ 0 nr )! (4.10) for some p ositive c onstant γ 0 0 . As mentioned ab o ve, the first part, namely , (4.9) is an application of an established result whic h states that if { y i } is a family of vectors in R d and { δ i } is a 0/1 Bernoulli sequence with P ( δ i = 1) = p , then p − 1 k X i ( δ i − p ) y i ⊗ y i k ≤ C s log d p max i k y i k for some C > 0 pro vided that the righ t-hand side is less than 1. The proof ma y b e found in the cited literature, e.g. in [10]. Hence, the first part follows from applying this result to vectors of the form P T ( e a e ∗ b ) and using the av ailable b ound on kP T ( e a e ∗ b ) k F . The second part follows from T alagrand’s concentration inequalit y and ma y b e found in the App endix. Set λ = p β /γ 0 0 and assume that m > ( β /γ 0 0 ) µ 0 nr log n . Then the left-hand side of (4.10) is b ounded b y 3 n − β and th us, we established that Z ≤ C 0 R r µ 0 nr log n m + 1 p γ 0 0 r µ 0 nr β log n m 19 with probabilit y at least 1 − 3 n − β . Setting C R = C 0 R + 1 / p γ 0 0 finishes the proof. T ake m large enough so that C R p µ 0 ( nr /m ) log n ≤ 1 / 2. Then it follo ws from (4.5) that p 2 kP T ( X ) k F ≤ k ( P T P Ω P T )( X ) k F ≤ 3 p 2 kP T ( X ) k F (4.11) for all X with large probabilit y . In particular, the operator A ∗ Ω T A Ω T = P T P Ω P T mapping T on to itself is w ell-conditioned and hence in vertible. An immediate consequence is the following: Corollary 4.3 Assume that C R p µ 0 nr (log n ) /m ≤ 1 / 2 . With the same pr ob ability as in The or em 4.1, we have kP Ω P T ( X ) k F ≤ p 3 p/ 2 kP T ( X ) k F . (4.12) Pro of W e ha ve kP Ω P T ( X ) k 2 F = h X , ( P Ω P T ) ∗ ( P Ω P T ) X i = h X , ( P T P Ω P T ) X i and th us kP Ω P T ( X ) k 2 F = hP T X , ( P T P Ω P T ) X i ≤ kP T ( X ) k F k ( P T P Ω P T )( X ) k F , where the inequalit y is due to Cauc hy-Sc h warz. The conclusion (4.12) follows from (4.11). 4.3 The size prop ert y In this section, we explain how w e will show that kP T ⊥ ( Y ) k < 1. This result will follow from fiv e lemmas that w e will pro ve in Section 6. Introduce H ≡ P T − p − 1 P T P Ω P T , whic h ob eys kH ( X ) k F ≤ C R p µ 0 ( nr /m ) β log n kP T ( X ) k F with large probabilit y b ecause of The- orem 4.1. F or any matrix X ∈ T , ( P T P Ω P T ) − 1 ( X ) can be expressed in terms of the p o wer series ( P T P Ω P T ) − 1 ( X ) = p − 1 ( X + H ( X ) + H 2 ( X ) + . . . ) for H is a contraction when m is sufficiently large. Since Y = P Ω P T ( P T P Ω P T ) − 1 ( P 1 ≤ k ≤ r u k v ∗ k ), P T ⊥ ( Y ) ma y b e decomposed as P T ⊥ ( Y ) = p − 1 ( P T ⊥ P Ω P T )( E + H ( E ) + H 2 ( E ) + . . . ) , E = X 1 ≤ k ≤ r u k v ∗ k . (4.13) T o bound the norm of the left-hand side, it is of course sufficient to b ound the norm of the summands in the righ t-hand side. T aking the follo wing five lemmas together establishes Theorem 1.3. Lemma 4.4 Fix β ≥ 2 and λ ≥ 1 . Ther e is a numeric al c onstant C 0 such that if m ≥ λ µ 2 1 nr β log n , then p − 1 k ( P T ⊥ P Ω P T ) E k ≤ C 0 λ − 1 / 2 . (4.14) with pr ob ability at le ast 1 − n − β . Lemma 4.5 Fix β ≥ 2 and λ ≥ 1 . Ther e ar e numeric al c onstants C 1 and c 1 such that if m ≥ λ µ 1 max( √ µ 0 , µ 1 ) nr β log n , then p − 1 k ( P T ⊥ P Ω P T ) H ( E ) k ≤ C 1 λ − 1 (4.15) with pr ob ability at le ast 1 − c 1 n − β . 20 Lemma 4.6 Fix β ≥ 2 and λ ≥ 1 . Ther e ar e numeric al c onstants C 2 and c 2 such that if m ≥ λ µ 4 / 3 0 nr 4 / 3 β log n , then p − 1 k ( P T ⊥ P Ω P T ) H 2 ( E ) k ≤ C 2 λ − 3 / 2 (4.16) with pr ob ability at le ast 1 − c 2 n − β . Lemma 4.7 Fix β ≥ 2 and λ ≥ 1 . Ther e ar e numeric al c onstants C 3 and c 3 such that if m ≥ λµ 2 0 nr 2 β log n , then p − 1 k ( P T ⊥ P Ω P T ) H 3 ( E ) k ≤ C 3 λ − 1 / 2 (4.17) with pr ob ability at le ast 1 − c 3 n − β . Lemma 4.8 Under the assumptions of The or em 4.1, ther e is a numeric al c onstant C k 0 such that if m ≥ (2 C R ) 2 µ 0 nr β log n , then p − 1 k ( P T ⊥ P Ω P T ) X k ≥ k 0 H k ( E ) k ≤ C k 0 n 2 r m 1 / 2 µ 0 nr β log n m k 0 / 2 (4.18) with pr ob ability at le ast 1 − n − β . Let us now sho w how w e may com bine these lemmas to prov e our main results. Under all of the assumptions of Theorem 1.3, consider the four Lemmas 4.4, 4.5, 4.6 and 4.8, the latter applied with k 0 = 3. T ogether they imply that there are numerical constan ts c and C suc h that kP T ⊥ ( Y ) k < 1 with probabilit y at least 1 − cn − β pro vided that the num ber of samples ob eys m ≥ C max( µ 2 1 , µ 1 / 2 0 µ 1 , µ 4 / 3 0 r 1 / 3 , µ 0 n 1 / 4 ) nr β log n (4.19) for some constant C . The four expressions in the maxim um come from Lemmas 4.4, 4.5, 4.6 and 4.8 in this order. No w the b ound (4.19) is only interesting in the range when µ 0 n 1 / 4 r is smaller than a constan t times n as otherwise the righ t-hand side is greater than n 2 (this w ould say that one w ould see all the en tries in which case our claim is trivial). When µ 0 r ≤ n 3 / 4 , ( µ 0 r ) 4 / 3 ≤ µ 0 n 5 / 4 r and th us the recov ery is exact pro vided that m obeys (1.9). F or the case concerning small v alues of the rank, we consider all five lemmas and apply Lemma 4.8, the latter applied with k 0 = 4. T ogether they imply that kP T ⊥ ( Y ) k < 1 with probability at least 1 − cn − β pro vided that the num ber of samples ob eys m ≥ C max( µ 2 0 r , µ 0 n 1 / 5 ) nr β log n (4.20) for some constant C . The t w o expressions in the maxim um come from Lemmas 4.7 and 4.8 in this order. The reason for this simplified formulation is that the terms µ 2 1 , µ 1 / 2 0 µ 1 and µ 4 / 3 0 r 1 / 3 whic h come from Lemmas 4.4, 4.5 and 4.6 are b ounded ab o ve b y µ 2 0 r since µ 1 ≤ µ 0 √ r . When µ 0 r ≤ n 1 / 5 , the reco very is exact provided that m obeys (1.10). 5 Connections with Random Graph Theory 5.1 The injectivit y prop ert y and the coup on collector’s problem W e argued in the Introduction that to hav e an y hope of reco vering an unkno wn matrix of rank 1 b y any metho d whatso ev er, one needs at least one observ ation p er ro w and one observ ation p er 21 column. Sample m entries uniformly at random. Viewing the ro w indices as bins, assign the k th sampled en try to the bin corresp onding to its row index. Then to ha ve any hope of reco vering our matrix, all the bins need to b e o ccupied. Quan tifying ho w man y samples are required to fill all of the bins is the famous c oup on c ol le ctor’s pr oblem . Coup on collection is also connected to the injectivity of the sampling operator P Ω restricted to elemen ts in T . Supp ose w e sample the en tries of a rank 1 matrix equal to xy ∗ with left and right singular vectors u = x / k x k and v = y / k y k resp ectiv ely and hav e not seen an ything in the i th ro w. Then we claim that P Ω (restricted to T ) has a nontrivial null space and th us P T P Ω P T is not in vertible. Indeed, consider the matrix e i v ∗ . This matrix is in T and P Ω ( e i v ∗ ) = 0 since e i v ∗ v anishes outside of the i th row. The same applies to the columns as well. If we hav e not seen anything in column j , then the rank-1 matrix ue ∗ j ∈ T and P Ω ( ue ∗ j ) = 0. In conclusion, the in vertibilit y of P T P Ω P T implies a complete collection. When the entries are sampled uniformly at random, it is w ell known that one needs on the order of n log n samples to sample all the ro ws. What is interesting is that Theorem 4.1 implies that P T P Ω P T is inv ertible—a stronger property—when the n umber of samples is also on the order of n log n . A particular implication of this discussion is that the logarithmic factors in Theorem 4.1 are una voidable. 5.2 The injectivit y prop ert y and the connectivit y problem T o recov er a matrix of rank 1, one needs muc h more than at least one observ ation p er ro w and column. Let R b e the set of row indices, 1 ≤ i ≤ n , and C b e the set of column indices, 1 ≤ j ≤ n , and consider the bipartite graph connecting vertices i ∈ R to v ertices j ∈ C if and only if ( i, j ) ∈ Ω, i.e. the ( i, j )th en try is observ ed. W e claim that if this graph is not fully connected, then one cannot hop e to reco v er a matrix of rank 1. T o see this, w e let I b e the set of ro w indices and J b e the set of column indices in any connected comp onen t. W e will assume that I and J are nonempty as otherwise, one is in the previously discussed situation where some rows or columns are not sampled. Consider a rank 1 matrix equal to xy ∗ as b efore with singular vectors u = x / k x k and v = y / k y k . Then all the information ab out the v alues of the x i ’s with i ∈ I and of the y j ’s with j ∈ J are giv en by the sampled en tries connecting I to J since all the other observed en tries connect v ertices in I c to those in J c . Now even if one observ es all the en tries x i y j with i ∈ I and j ∈ J , then at least the signs of x i , i ∈ I , and of y j , j ∈ J , would remain undetermined. Indeed, if the v alues ( x i ) i ∈ I , ( y j ) j ∈ J are consisten t with the observ ed entries, so are the v alues ( − x i ) i ∈ I , ( − y j ) j ∈ J . Ho wev er, since the same analysis holds for the sets I c and J c , there are at least t wo matrices consisten t with the observ ed en tries and exact matrix completion is imp ossible. The connectivit y of the graph is also related to the injectivity of the sampling op erator P Ω restricted to elements in T . If the graph is not fully connected, then we claim that P Ω (restricted to T ) has a nontrivial null space and th us P T P Ω P T is not in vertible. Indeed, consider the matrix M = av ∗ + ub ∗ , where a i = − u i if i ∈ I and a i = u i otherwise, and b j = v j if j ∈ J and b j = − v j otherwise. Then this matrix is in T and obeys M ij = 0 22 if ( i, j ) ∈ I × J or ( i, j ) ∈ I c × J c . Note that on the complement, i.e. ( i, j ) ∈ I × J c or ( i, j ) ∈ I c × J , one has M ij = 2 u i v j and one can sho w that M 6 = 0 unless uv ∗ = 0. Since Ω is included in the union of I × J and I c × J c , we ha v e that P Ω ( M ) = 0. In conclusion, the inv ertibility of P T P Ω P T implies a fully connected graph. When the en tries are sampled uniformly at random, it is w ell kno wn that one needs on the order of n log n samples to obtain a fully connected graph with large probabilit y (see, e.g., [8]). Remark- ably , Theorem 4.1 implies that P T P Ω P T is inv ertible—a stronger prop ert y—when the num b er of samples is also on the order of n log n . 6 Pro ofs of the Critical Lemmas In this section, w e prov e the five lemmas of Section 4.3. Before w e b egin, how ev er, w e develop a simple estimate whic h w e will use throughout. F or each pair ( a, b ) and ( a 0 , b 0 ), it follo ws from the expression of P T ( e a e ∗ b ) (4.6) that hP T ( e a 0 e ∗ b 0 ) , e a e ∗ b i = h e a , P U e a 0 i 1 { b = b 0 } + h e b , P V e b 0 i 1 { a = a 0 } − h e a , P U e a 0 ih e b , P V e b 0 i . (6.1) Fix µ 0 ob eying µ ( U ) ≤ µ 0 and µ ( V ) ≤ µ 0 and note that |h e a , P U e a 0 i| = |h P U e a , P U e a 0 i| ≤ k P U e a k k P U e a 0 k ≤ µ 0 r /n 1 and similarly for h e b , P V e b 0 i . Supp ose that b = b 0 and a 6 = a 0 , then |hP T ( e a 0 e ∗ b 0 ) , e a e ∗ b i| = |h e a , P U e a 0 i| (1 − k P V e b k 2 ) ≤ µ 0 r /n 1 . W e hav e a similar bound when a = a 0 and b 6 = b 0 whereas when a 6 = a 0 and b 6 = b 0 , |hP T ( e a 0 e ∗ b 0 ) , e a e ∗ b i| ≤ ( µ 0 r ) 2 / ( n 1 n 2 ) . In short, it follo ws from this analysis (and from (4.8) for the case where ( a, b ) = ( a 0 , b 0 )) that max ab,a 0 b 0 |hP T ( e a 0 e ∗ b 0 ) , e a e ∗ b i| ≤ 2 µ 0 r / min( n 1 , n 2 ) . (6.2) A consequence of (4.8) is the estimate: X a 0 b 0 |hP T ( e a 0 e ∗ b 0 ) , e a e ∗ b i| 2 = X a 0 b 0 |hP T ( e a e ∗ b ) , e a 0 e ∗ b 0 i| 2 = kP T ( e a e ∗ b ) k 2 F ≤ 2 µ 0 r / min( n 1 , n 2 ) , (6.3) whic h we will apply several times. A related estimate is this: max a X b | E ab | 2 ≤ µ 0 r / min( n 1 , n 2 ) , (6.4) and the same is true b y exc hanging the role of a and b . T o see this, write X b | E ab | 2 = k e ∗ a E k 2 = k X j ≤ r v j h u j , e a ik 2 = X j ≤ r |h u j , e a i| 2 = k P U e a k 2 , 23 and the conclusion follo ws from the coherence property . W e will pro ve the lemmas in the case where n 1 = n 2 = n for simplicity , i.e. in the case of square matrices of dimension n . The general case is treated in exactly the same w ay . In fact, the argumen t only mak es use of the b ounds (6.2), (6.3) (and sometimes (6.4)), and the general case is obtained b y replacing n with min( n 1 , n 2 ). Eac h of the follo wing subsections computes the operator norm of some random v ariable. In eac h section, w e denote S as the quan tit y whose norm w e wish to analyze. W e will also frequen tly use the notation H for some auxiliary matrix v ariable whose norm w e will need to bound. Hence, w e will reuse the same notation many times rather than introducing a dozens new names—just like in computer programming where one uses the same v ariable name in distinct routines. 6.1 Pro of of Lemma 4.4 In this section, w e dev elop a b ound on p − 1 kP T ⊥ P Ω P T ( E ) k = p − 1 kP T ⊥ ( P Ω − p I ) P T ( E ) k ≤ p − 1 k ( P Ω − p I )( E ) k , where the equality follows from P T ⊥ P T = 0, and the inequalit y from P T ( E ) = E together with kP T ⊥ ( X ) k ≤ k X k which is v alid for an y matrix X . Set S ≡ p − 1 ( P Ω − p I )( E ) = p − 1 X ab ( δ ab − p ) E ab e a e ∗ b . (6.5) W e think of S as a random v ariable since it depends on the random δ ab ’s, and note that E S = 0. The pro of of Lemma 4.4 op erates by developing an estimate on the size of ( E k S k q ) 1 /q for some q ≥ 1 and by applying Mark ov inequalit y to b ound the tail of the random v ariable k S k . T o do this, w e shall use a symmetrization argument and the noncomm utativ e Khintc hine inequality . Since the function f ( S ) = k S k q is con vex, Jensen’s inequality giv es that E k S k q ≤ E k S − S 0 k q , where S 0 = p − 1 P ab ( δ 0 ab − p ) E ab e a e ∗ b is an independent cop y of S . Since ( δ ab − δ 0 ab ) is symmetric, S − S 0 has the same distribution as p − 1 X ab ab ( δ ab − δ 0 ab ) E ab e a e ∗ b ≡ S − S 0 , where { ab } is an indep enden t Rademac her sequence and S = p − 1 P ab ab δ ab E ab e a e ∗ b . F urther, the triangle inequalit y gives ( E k S − S 0 k q ) 1 /q ≤ ( E k S k q ) 1 /q + ( E k S 0 k q ) 1 /q = 2( E k S k q ) 1 /q since S and S 0 ha ve the same distribution and, therefore, ( E k S k q ) 1 /q ≤ 2 p − 1 E δ E k X ab ab δ ab E ab e a e ∗ b k q ! 1 /q . 24 W e are no w in p osition to apply the noncommutativ e Khintc hine inequalit y whic h b ounds the Sc hatten norm of a Rademac her series. F or q ≥ 1, the Schatten q-norm of a matrix is denoted b y k X k S q = n X i =1 σ i ( X ) q ! 1 /q . Note that the nuclear norm is equal to the Sc hatten 1-norm and the F rob enius norm is equal to the Schatten 2-norm. The follo wing theorem was originally pro v en b y Lust-Picquard [25], and was later sharp ened b y Buc hholz [9]. Lemma 6.1 (Noncomm utativ e Khin tc hine inequality) L et ( X i ) 1 ≤ i ≤ r b e a finite se quenc e of matric es of the same dimension and let { i } b e a R ademacher se quenc e. F or e ach q ≥ 2 E X i i X i q S q 1 /q ≤ C K √ q max X i X ∗ i X i ! 1 / 2 S q , X i X i X ∗ i ! 1 / 2 S q , wher e C K = 2 − 1 / 4 p π /e . F or reference, if X is an n × n matrix and q ≥ log n , we ha v e k X k ≤ k X k S q ≤ e k X k , so that the Schatten q -norm is within a m ultiplicative constan t from the operator norm. Observ e no w that with q 0 ≥ q ( E δ E k S k q ) 1 /q ≤ E δ E k S k q S q 0 1 /q ≤ E δ E k S k q 0 S q 0 1 /q 0 . W e apply the noncomm utative Khin tchine inequalit y with q 0 ≥ log n , and after a little algebra, obtain E δ E k S k q 0 S q 0 1 /q 0 ≤ C K e √ q 0 p E δ max " k X ab δ ab E 2 ab e a e ∗ a k q 0 / 2 , k X ab δ ab E 2 ab e b e ∗ b k q 0 / 2 #! 1 /q 0 . The tw o terms in the right-hand side are essentially the same and if we can bound any one of them, the same technique will apply to the other. W e consider the first and since P ab δ ab E 2 ab e a e ∗ a is a diagonal matrix, k X ab δ ab E 2 ab e a e ∗ a k = max a X b δ ab E 2 ab . The follo wing lemma b ounds the q th moment of this quan tit y . Lemma 6.2 Supp ose that q is an inte ger ob eying 1 ≤ q ≤ np and assume np ≥ 2 log n . Then E δ max a X b δ ab E 2 ab ! q ≤ 2 2 np k E k 2 ∞ q . (6.6) 25 The pro of of this lemma is in the App endix. The same estimate applies to E max b P a δ ab E 2 ab q and th us for each q ≥ 1 E δ max " k X ab δ ab E 2 ab e a e ∗ a k q , k X ab δ ab E 2 ab e b e ∗ b k q # ≤ 4 2 np k E k 2 ∞ q . (In the rectangular case, the same estimate holds with n = max( n 1 , n 2 ).) T ake q = β log n for some β ≥ 1, and set q 0 = q . Then since k E k ∞ ≤ µ 1 √ r /n , we established that ( E k S k q ) 1 /q ≤ C 1 p p β log n √ np k E k ∞ = C µ 1 r nr β log n m ≡ K 0 . Then b y Marko v’s inequalit y , for each t > 0, P ( k S k > tK 0 ) ≤ t − q , and for t = e , we conclude that P k S k > C e µ 1 r nr β log n m ! ≤ n − β with the pro viso that m ≥ max( β , 2) n log n so that Lemma 6.2 holds. W e ha ve not made any assumption in this section ab out the matrix E (except that we ha ve a b ound on the maxim um entry) and, therefore, hav e prov ed the theorem b elo w, whic h shall be used man y times in the sequel. Theorem 6.3 L et X b e a fixe d n × n matrix. Ther e is a c onstant C 0 such that for e ach β > 2 p − 1 k ( P Ω − p I )( X ) k ≤ C 0 β n log n p 1 / 2 k X k ∞ (6.7) with pr ob ability at le ast 1 − n − β pr ovide d that np ≥ β log n . Note that this is the same C 0 describ ed in Lemma 4.4. 6.2 Pro of of Lemma 4.5 W e now need to b ound the spe ctral norm of P T ⊥ P Ω P T H ( E ) and will use some of the ideas dev elop ed in the previous section. Just as before, p − 1 kP T ⊥ P Ω P T H ( E ) k ≤ p − 1 k ( P Ω − p I ) H ( E ) k , and put S ≡ p − 1 ( P Ω − p I ) H ( E ) = p − 2 X ab,a 0 b 0 ξ ab ξ a 0 b 0 E a 0 b 0 hP T e a 0 e ∗ b 0 , e a e ∗ b i e a e ∗ b , where here and below, ξ ab ≡ δ ab − p . Decomp ose S as S = p − 2 X ( a,b )=( a 0 ,b 0 ) + p − 2 X ( a,b ) 6 = ( a 0 ,b 0 ) ≡ S 0 + S 1 . (6.8) 26 W e b ound the spectral norm of the diagonal and off-diagonal con tributions separately . W e b egin with S 0 and decomp ose ( ξ ab ) 2 as ξ 2 ab = ( δ ab − p ) 2 = (1 − 2 p )( δ ab − p ) + p (1 − p ) = (1 − 2 p ) ξ ab + p (1 − p ) , whic h allows us to express S 0 as S 0 = 1 − 2 p p X ab ξ ab H ab e a e ∗ b + (1 − p ) X ab H ab e a e ∗ b , H ab ≡ p − 1 E ab hP T e a e ∗ b , e a e ∗ b i . (6.9) Theorem 6.3 bounds the sp ectral norm of the first term of the righ t-hand side and w e ha ve p − 1 k X ab ξ ab H ab e a e ∗ b k ≤ C 0 r n 3 β log n m k H k ∞ with probability at least 1 − n − β . No w since k E k ∞ ≤ µ 1 √ r /n and |hP T e a e ∗ b , e a e ∗ b i| ≤ 2 µ 0 r /n b y (6.2), k H k ∞ ≤ µ 0 µ 1 (2 r /np ) √ r /n , and p − 1 k X ab ξ ab H ab e a e ∗ b k ≤ C µ 0 µ 1 nr m r nr β log n m with the same probability . The second term of the righ t-hand side in (6.9) is deterministic and w e dev elop an argument that we will reuse sev eral times. W e record a useful lemma. Lemma 6.4 L et X b e a fixe d matrix and set Z ≡ P ab X ab hP T ( e a e ∗ b ) , e a e ∗ b i e a e ∗ b . Then k Z k ≤ 2 µ 0 r n k X k . Pro of Let Λ U and Λ V b e the diagonal matrices with entries k P U e a k 2 and k P V e b k 2 resp ectiv ely , Λ U = diag( k P U e a k 2 ) , Λ V = diag( k P V e b k 2 ) . (6.10) T o b ound the spectral norm of Z , observe that it follows from (4.7) that Z = Λ U X + X Λ V − Λ U X Λ V = Λ U X ( I − Λ V ) + X Λ V . (6.11) Hence, since k Λ U k and k Λ V k are bounded by min( µ 0 r /n, 1) and k I − Λ V k ≤ 1, we hav e k Z k ≤ k Λ U kk X kk I − Λ V k + k X kk Λ V k ≤ (2 µ 0 r /n ) k X k . Clearly , this lemma and k E k = 1 give that H defined in (6.9) ob eys k H k ≤ 2 µ 0 r /np . In summary , k S 0 k ≤ C nr m µ 0 µ 1 r β nr log n m + µ 0 ! for some C > 0 with the same probabilit y as in Lemma 4.4. It remains to bound the off-diagonal term. T o this end, w e use a useful decoupling lemma: 27 Lemma 6.5 [16] L et { η i } 1 ≤ i ≤ n b e a se quenc e of indep endent r andom variables, and { x ij } i 6 = j b e elements taken fr om a Banach sp ac e. Then P ( k X i 6 = j η i η j x ij k ≥ t ) ≤ C D P ( k X i 6 = j η i η 0 j x ij k > t/C D ) , (6.12) wher e { η 0 i } is an indep endent c opy of { η i } . This lemma asserts that it is sufficien t to estimate P ( k S 0 1 k ≥ t ) where S 0 1 is giv en by S 0 1 ≡ p − 2 X ab 6 = a 0 b 0 ξ ab ξ 0 a 0 b 0 E a 0 b 0 hP T e a 0 e ∗ b 0 , e a e ∗ b i e a e ∗ b (6.13) in whic h { ξ 0 ab } is an independent cop y of { ξ ab } . W e write S 0 1 as S 0 1 = p − 1 X ab ξ ab H ab e a e ∗ b , H ab ≡ p − 1 X a 0 b 0 :( a 0 ,b 0 ) 6 =( a,b ) ξ 0 a 0 b 0 E a 0 b 0 hP T e a 0 e ∗ b 0 , e a e ∗ b i . (6.14) T o b ound the tail of k S 0 1 k , observ e that P ( k S 0 1 k ≥ t ) ≤ P ( k S 0 1 k ≥ t | k H k ∞ ≤ K ) + P ( k H k ∞ > K ) . By independence, the first term of the right-hand side is bounded by Theorem 6.3. On the ev ent {k H k ∞ ≤ K } , we ha v e p − 1 k X ab ξ ab H ab e a e ∗ b k ≤ C r n 3 β log n m K. with probabilit y at least 1 − n − β . T o b ound k H k ∞ , w e use Bernstein’s inequalit y . Lemma 6.6 L et X b e a fixe d matrix and define Q ( X ) as the matrix whose ( a, b ) th entry is [ Q ( X )] ab = p − 1 X a 0 b 0 :( a 0 ,b 0 ) 6 =( a,b ) ( δ a 0 b 0 − p ) X a 0 b 0 hP T e a 0 e ∗ b 0 , e a e ∗ b i , wher e { δ ab } is an indep endent Bernoul li se quenc e ob eying P ( δ ab = 1) = p . Then P kQ ( X ) k ∞ > λ r µ 0 r np k X k ∞ ≤ 2 n 2 exp − λ 2 2 + 2 3 q µ 0 r np λ . (6.15) With λ = √ 3 β log n , the right-hand side is b ounde d by 2 n 2 − β pr ovide d that np ≥ 4 β 3 µ 0 r log n . In p articular, for λ = √ 6 β log n with β > 2 , the b ound is less than 2 n − β pr ovide d that np ≥ 8 β 3 µ 0 r log n . Pro of The inequality (6.15) is an application of Bernstein’s inequalit y , whic h states that for a sum of uniformly bounded indep enden t zero-mean random v ariables obeying | Y k | ≤ c , P | n X k =1 Y k | > t ! ≤ 2 e − t 2 / (2 σ 2 +2 ct/ 3) , (6.16) 28 where σ 2 is the sum of the v ariances, σ 2 ≡ P n k =1 V ar( Y k ). W e ha ve V ar([ Q ( X )] ab ) = 1 − p p X a 0 b 0 :( a 0 ,b 0 ) 6 =( a,b ) | X a 0 b 0 | 2 |hP T e a 0 e ∗ b 0 , e a e ∗ b i| 2 ≤ 1 − p p k X k 2 ∞ X a 0 b 0 :( a 0 ,b 0 ) 6 =( a,b ) |hP T e a e ∗ b , e a 0 e ∗ b 0 i| 2 ≤ 1 − p p k X k 2 ∞ 2 µ 0 r /n b y (6.3). Also, p − 1 | ( δ a 0 b 0 − p ) X a 0 b 0 hP T e a 0 e ∗ b 0 , e a e ∗ b i| ≤ p − 1 k X k ∞ 2 µ 0 r /n and hence, for eac h t > 0, (6.16) gives P ( | [ Q ( X )] ab | > t ) ≤ 2 exp − t 2 2 µ 0 r np k X k 2 ∞ + 2 3 µ 0 r np k X k ∞ t ! . (6.17) Putting t = λ p µ 0 r /np k X k ∞ for some λ > 0 and applying the union bound gives (6.15). Since k E k ∞ ≤ µ 1 √ r /n it follo ws that H = Q ( E ) in tro duced in (6.14) ob eys k H k ∞ ≤ C µ 1 √ r n r µ 0 nr β log n m with probabilit y at least 1 − 2 n − β for eac h β > 2 and, therefore, k S 0 1 k ≤ C √ µ 0 µ 1 nr β log n m with probabilit y at least 1 − 3 n − β . In conclusion, w e hav e p − 1 k ( P Ω − p I ) H ( E ) k ≤ C nr m √ µ 0 µ 1 r µ 0 nr β log n m + β log n ! + µ 0 ! (6.18) with probability at least 1 − (1 + 3 C D ) n − β . A simple algebraic manipulation concludes the pro of of Lemma 4.5. Note that we ha ve not made any assumption ab out the matrix E and, therefore, established the follo wing: Lemma 6.7 L et X b e a fixe d n × n matrix. Ther e is a c onstant C 0 0 such that p − 2 k X ( a,b ) 6 =( a 0 ,b 0 ) ξ ab ξ a 0 b 0 X ab hP T ( e a 0 e ∗ b 0 ) , e a e ∗ b i e a e ∗ b k ≤ C 0 0 √ µ 0 r β log n p k X k ∞ (6.19) with pr ob ability at le ast 1 − O ( n − β ) for al l β > 2 pr ovide d that np ≥ 3 µ 0 r β log n . 29 6.3 Pro of of Lemma 4.6 T o pro ve Lemma 4.6, we need to bound the sp ectral norm of p − 1 ( P Ω − p I ) H 2 ( E ), a matrix given b y p − 3 X a 1 b 1 ,a 2 b 2 ,a 3 b 3 ξ a 1 b 1 ξ a 2 b 2 ξ a 3 b 3 E a 3 b 3 hP T e a 3 e ∗ b 3 , e a 2 e ∗ b 2 ihP T e a 2 e ∗ b 2 , e a 1 e ∗ b 1 i e a 1 e ∗ b 1 , where ξ ab = δ ab − p as b efore. It is conv enien t to introduce notations to compress this expression. Set ω = ( a, b ) (and ω i = ( a i , b i ) for i = 1 , 2 , 3), F ω = e a e ∗ b , and P ω 0 ω = hP T e a 0 e ∗ b 0 , e a e ∗ b i so that p − 1 ( P Ω − p I ) H 2 ( E ) = p − 3 X ω 1 ,ω 2 ,ω 3 ξ ω 1 ξ ω 2 ξ ω 3 E ω 3 P ω 3 ω 2 P ω 2 ω 1 F ω 1 . P artition the sum dep ending on whether some of the ω i ’s are the same or not 1 p ( P Ω − p I ) H 2 ( E ) = 1 p 3 X ω 1 = ω 2 = ω 3 + X ω 1 6 = ω 2 = ω 3 + X ω 1 = ω 3 6 = ω 2 + X ω 1 = ω 2 6 = ω 3 + X ω 1 6 = ω 2 6 = ω 3 . (6.20) The meaning should be clear; for instance, the sum P ω 1 6 = ω 2 = ω 3 is the sum o ver the ω ’s suc h that ω 2 = ω 3 and ω 1 6 = ω 2 . Similarly , P ω 1 6 = ω 2 6 = ω 3 is the sum ov er the ω ’s such that they are all distinct. The idea is now to use a decoupling argument to bound each sum in the right-hand side of (6.20) (except for the first whic h does not need to b e decoupled) and sho w that all terms are appropriately small in the spectral norm. W e b egin with the first term whic h is equal to 1 p 3 X ω ( ξ ω ) 3 E ω P 2 ω ω F ω = 1 − 3 p + 3 p 2 p 3 X ω ξ ω E ω P 2 ω ω F ω + 1 − 3 p + 2 p 2 p 2 X ω E ω P 2 ω ω F ω , (6.21) where w e hav e used the iden tit y ( ξ ω ) 3 = (1 − 3 p + 3 p 2 ) ξ ω + p (1 − 3 p + 2 p 2 ) . Set H ω = E ω ( p − 1 P ω ω ) 2 . F or the first term in the right-hand side of (6.21), we need to control k P ω ξ ω H ω F ω k . This is easily b ounded b y Theorem 6.3. Indeed, it follo ws from | H ω | ≤ 2 µ 0 r np 2 k E k ∞ that for eac h β > 0, p − 1 k X ω ξ ω H ω F ω k ≤ C µ 0 nr m 2 µ 1 r nr β log n m = C µ 2 0 µ 1 p β log n nr m 5 / 2 with probably at least 1 − n − β . F or the second term in the right-hand side of (6.21), w e apply Lemma 6.4 whic h gives k X ω E ω P 2 ω ω F ω k ≤ (2 µ 0 r /n ) 2 30 so that k H k ≤ (2 µ 0 r /np ) 2 . In conclusion, the first term in (6.20) has a sp ectral norm whic h is b ounded b y C nr m 2 µ 2 0 µ 1 nr β log n m 1 / 2 + µ 2 0 ! with probabilit y at least 1 − n − β . W e now turn our attention to the second term which can be written as p − 3 X ω 1 6 = ω 2 ξ ω 1 ( ξ ω 2 ) 2 E ω 2 P ω 2 ω 2 P ω 2 ω 1 F ω 1 = 1 − 2 p p 3 X ω 1 6 = ω 2 ξ ω 1 ξ ω 2 E ω 2 P ω 2 ω 2 P ω 2 ω 1 F ω 1 + 1 − p p 2 X ω 1 6 = ω 2 ξ ω 1 E ω 2 P ω 2 ω 2 P ω 2 ω 1 F ω 1 . Put S 1 for the first term; bounding k S 1 k is a simple application of Lemma 6.7 with X ω = p − 1 E ω P ω ω , whic h gives k S 1 k ≤ C µ 3 / 2 0 µ 1 ( β log n ) nr m 2 since k E k ∞ ≤ µ 1 √ r /n . F or the second term, w e need to b ound the spectral norm of S 2 where S 2 ≡ p − 1 X ω 1 ξ ω 1 H ω 1 F ω 1 , H ω 1 = p − 1 X ω 2 : ω 2 6 = ω 1 E ω 2 P ω 2 ω 2 P ω 2 ω 1 . Note that H is deterministic. The lemma b elo w pro vides an estimate ab out k H k ∞ . Lemma 6.8 The matrix H ob eys k H k ∞ ≤ µ 0 r np 3 k E k ∞ + 2 µ 0 r n . (6.22) Pro of W e b egin b y rewriting H as pH ω = X ω 0 E ω 0 P ω 0 ω 0 P ω 0 ω − E ω P 2 ω ω . Clearly , | E ω P 2 ω ω | ≤ ( µ 0 r /n ) 2 k E k ∞ so that it suffices to bound the first term, which is the ω th en try of the matrix X ω ,ω 0 E ω 0 P ω 0 ω 0 P ω 0 ω F ω = P T ( Λ U E + E Λ V − Λ U E Λ V ) . No w it is immediate to see that Λ U E ∈ T and likewise for E Λ V . Hence, kP T ( Λ U E + E Λ V − Λ U E Λ V ) k ∞ ≤ k Λ U E k ∞ + k E Λ V k ∞ + kP T ( Λ U E Λ V ) k ∞ ≤ 2 k E k ∞ µ 0 r /n + kP T ( Λ U E Λ V ) k ∞ . W e finally use the crude estimate kP T ( Λ U E Λ V ) k ∞ ≤ kP T ( Λ U E Λ V ) k ≤ 2 k Λ U E Λ V k ≤ 2( µ 0 r /n ) 2 to complete the proof of the lemma. 31 As a consequence of this lemma, Theorem 6.3 gives k S 2 k ≤ C p β log n nr m 3 / 2 ( µ 0 µ 1 + µ 2 0 √ r ) with probability at least 1 − n − β . In conclusion, the second term in (6.20) has sp ectral norm b ounded b y C p β log n nr m 3 / 2 µ 0 µ 1 r µ 0 nr β log n m + µ 0 µ 1 + µ 2 0 √ r ! with probabilit y at least 1 − O ( n − β ). W e now examine the third term which can be written as p − 3 X ω 1 6 = ω 2 ( ξ ω 1 ) 2 ξ ω 2 E ω 1 P ω 1 ω 2 P ω 2 ω 1 F ω 1 = 1 − 2 p p 3 X ω 1 6 = ω 2 ξ ω 1 ξ ω 2 E ω 1 P 2 ω 2 ω 1 F ω 1 + 1 − p p 2 X ω 1 6 = ω 2 ξ ω 2 E ω 1 P 2 ω 2 ω 1 F ω 1 . W e use the decoupling argument once more so that for the first term of the right-hand side, it suffices to estimate the tail of the norm of S 1 ≡ p − 1 X ω 1 ξ (1) ω 1 E ω 1 H ω 1 F ω 1 , H ω 1 ≡ p − 2 X ω 2 : ω 2 6 = ω 1 ξ (2) ω 2 P 2 ω 2 ω 1 , where { ξ (1) ω } and { ξ (2) ω } are indep enden t copies of { ξ ω } . It follows from Bernstein’s inequalit y and the estimates | P ω 2 ω 1 | ≤ 2 µ 0 r /n and X ω 2 : ω 2 6 = ω 1 | P ω 2 ω 1 | 4 ≤ max ω 2 : ω 2 6 = ω 1 | P ω 2 ω 1 | 2 X ω 2 : ω 2 6 = ω 1 | P ω 2 ω 1 | 2 ≤ 2 µ 0 r n 2 2 µ 0 r n that for eac h λ > 0, 4 P | H ω 1 | > λ 2 µ 0 r np 3 / 2 ! ≤ 2 exp − λ 2 2 + 2 3 λ 2 µ 0 r np 1 / 2 . It is no w not hard to see that this inequalit y implies that P k H k ∞ > p 8 β log n 2 µ 0 nr m 3 / 2 ! ≤ 2 n − 2 β +2 pro vided that m ≥ 16 9 µ 0 nr β log n . As a consequence, for each β > 2, Theorem 6.3 giv es k S 1 k ≤ C µ 3 / 2 0 µ 1 β log n nr m 2 4 W e w ould lik e to remark that one can often get better estimates; when ω 1 6 = ω 2 , the bound | P ω 2 ω 1 | ≤ 2 µ 0 r /n ma y b e rather crude. Indeed, one can derive b etter estimates for the random orthogonal mo del, for example. 32 with probabilit y at least 1 − 3 n − β . The other term is equal to (1 − p ) times P ω 1 E ω 1 H ω 1 F ω 1 , and k X ω 1 E ω 1 H ω 1 F ω 1 k ≤ k X ω 1 E ω 1 H ω 1 F ω 1 k F ≤ k H k ∞ k E k F ≤ C p β log n µ 0 nr m 3 / 2 √ r . In conclusion, the third term in (6.20) has sp ectral norm b ounded b y C µ 0 p β log n nr m 3 / 2 µ 1 r µ 0 nr β log n m + √ µ 0 r ! with probabilit y at least 1 − O ( n − β ). W e pro ceed to the fourth term whic h can b e written as p − 3 X ω 1 6 = ω 3 ( ξ ω 1 ) 2 ξ ω 3 E ω 3 P ω 3 ω 1 P ω 1 ω 1 F ω 1 = 1 − 2 p p 3 X ω 1 6 = ω 3 ξ ω 1 ξ ω 3 E ω 3 P ω 3 ω 1 P ω 1 ω 1 F ω 1 + 1 − p p 2 X ω 1 6 = ω 3 ξ ω 3 E ω 3 P ω 3 ω 1 P ω 1 ω 1 F ω 1 . Let S 1 b e the first term and set H ω 1 = p − 2 P ω 1 6 = ω 3 ξ ω 1 ξ ω 3 E ω 3 P ω 3 ω 1 F ω 1 . Then Lemma 6.4 giv es k S 1 k ≤ 2 µ 0 r np k H k ≤ C µ 3 / 2 0 µ 1 ( β log n ) nr m 2 where the last inequalit y is given b y Lemma 6.7. F or the other term—call it S 2 —set H ω 1 = p − 1 P ω 3 : ω 3 6 = ω 1 ξ ω 3 E ω 3 P ω 3 ω 1 . Then Lemma 6.4 giv es k S 2 k ≤ 2 µ 0 r np k H k . Notice that H ω 1 = p − 1 P ω 3 ξ ω 3 E ω 3 P ω 3 ω 1 − p − 1 ξ ω 1 E ω 1 P ω 1 ω 1 so that with G ω 1 = E ω 1 P ω 1 ω 1 H = p − 1 [ P T ( P Ω − p I )( E ) − ( P Ω − p I )( G )] . No w for any matrix X , kP T ( X ) k = k X − P T ⊥ ( X ) k ≤ 2 k X k and, therefore, k H k ≤ 2 p − 1 k ( P Ω − p I )( E ) k + p − 1 k ( P Ω − p I )( G ) k . As a consequence and since k G k ∞ ≤ k E k ∞ , Theorem 6.3 giv es for eac h β > 2, k H k ≤ C µ 1 r nr β log n m with probabilit y at least 1 − n − β . In conclusion, the fourth term in (6.20) has spectral norm b ounded b y C µ 0 µ 1 p β log n nr m 3 / 2 r µ 0 nr β log n m + 1 ! 33 with probabilit y at least 1 − O ( n − β ). W e finally examine the last term p − 3 X ω 1 6 = ω 2 6 = ω 3 ξ ω 1 ξ ω 2 ξ ω 3 E ω 3 P ω 3 ω 2 P ω 2 ω 1 F ω 1 . No w just as one has a decoupling inequality for pairs of v ariables, w e ha ve a decoupling inequalit y for triples as w ell and w e thus simply need to b ound the tail of S 1 ≡ p − 3 X ω 1 6 = ω 2 6 = ω 3 ξ (1) ω 1 ξ (2) ω 2 ξ (3) ω 3 E ω 3 P ω 3 ω 2 P ω 2 ω 1 F ω 1 in which the sequences { ξ (1) ω } , { ξ (2) ω } and { ξ (3) ω } are independent copies of { ξ ω } . W e refer to [16] for details. W e now argue as in Section 6.2 and write S 1 as S 1 = p − 1 X ω 1 ξ (1) ω 1 H ω 1 F ω 1 , where H ω 1 ≡ p − 1 X ω 2 : ω 2 6 = ω 1 ξ (2) ω 2 G ω 2 P ω 2 ω 1 , G ω 2 ≡ p − 1 X ω 3 : ω 3 6 = ω 1 ,ω 3 6 = ω 2 ξ (3) ω 3 E ω 3 P ω 3 ω 2 . (6.23) By Lemma 6.6, w e ha ve for eac h β > 2 k G k ∞ ≤ C r µ 0 nr β log n m k E k ∞ with large probabilit y and the same argumen t then gives k H k ∞ ≤ C r µ 0 nr β log n m k G k ∞ ≤ C µ 0 nr β log n m k E k ∞ with probabilit y at least 1 − 4 n − β . As a consequence, Theorem 6.3 giv es k S k ≤ C µ 0 µ 1 nr β log n m 3 / 2 with probabilit y at least 1 − O ( n − β ). T o summarize the calculations of this section and using the fact that µ 0 ≥ 1 and µ 1 ≤ µ 0 √ r , w e hav e established that if m ≥ µ 0 nr ( β log n ), p − 1 k ( P Ω − p I ) H 2 ( E ) k ≤ C nr m 2 µ 2 0 µ 1 r nr β log n m + µ 2 0 ! + C p β log n nr m 3 / 2 µ 2 0 √ r + C nr β log n m 3 / 2 µ 0 µ 1 with probabilit y at least 1 − O ( n − β ). One can chec k that if m = λ µ 4 / 3 0 nr 4 / 3 β log n for a fixed β ≥ 2 and λ ≥ 1, then there is a constan t C suc h that k p − 1 ( P Ω − p I ) H 2 ( E ) k ≤ C λ − 3 / 2 with probabilit y at least 1 − O ( n − β ). This is the con tent of Lemma 4.6. 34 6.4 Pro of of Lemma 4.7 Clearly , one could contin ue on the same path and estimate the spectral norm of p − 1 ( P Ω − p I ) H 3 ( E ) b y the same technique as in the previous sections. That is to say , w e w ould write p − 1 ( P Ω − p I ) H 3 ( E ) = p − 4 X ω 1 ,ω 2 ,ω 3 ,ω 4 " 4 Y i =1 ξ ω i # E ω 4 " 3 Y i =1 P ω i +1 ω i # F ω 1 with the same notations as b efore, and partition the sum dep ending on whether some of the ω i ’s are the same or not. Then w e w ould use the decoupling argumen t to bound eac h term in the sum. Although this is a clear p ossibility , one w ould need to consider 18 cases and the calculations would b ecome a little lab orious. In this section, we prop ose to b ound the term p − 1 ( P Ω − p I ) H 3 ( E ) with a differen t argumen t whic h has t wo main adv antages: first, it is muc h shorter and second, it uses m uch of what we ha v e already established. The downside is that it is not as sharp. The starting point is to note that p − 1 ( P Ω − p I ) H 3 ( E ) = p − 1 (Ξ ◦ H 3 ( E )) , where Ξ is the matrix with i.i.d. entries equal to ξ ab = δ ab − p and ◦ denotes the Hadamard pro duct (comp onen t wise m ultiplication). T o bound the sp ectral norm of this Hadamard pro duct, w e apply an inequality due to Ando, Horn, and Johnson [4]. An elementary proof can b e found in § 5.6 of [19]. Lemma 6.9 [19] L et A and B b e two n 1 × n 2 matric es. Then k A ◦ B k ≤ k A k ν ( B ) , (6.24) wher e ν is the function ν ( B ) = inf { c ( X ) c ( Y ) : X Y ∗ = B } , and c ( X ) is the maximum Euclide an norm of the r ows c ( X ) 2 = max 1 ≤ i ≤ n X j X 2 ij . T o apply (6.24), we first notice that one can estimate the norm of Ξ via Theorem 6.3. Indeed, let Z = 11 ∗ b e the matrix with all entries equal to one. Then p − 1 Ξ = p − 1 ( P Ω − p I )( Z ) and th us p − 1 k Ξ k ≤ C n 3 β log n m 1 / 2 (6.25) with probability at least 1 − n − β . One could obtain a similar result by appealing to the recent literature on random matrix theory and on concentration of measure. Poten tially this could allow to deriv e an upp er b ound without the logarithmic term but w e will not consider these refinemen ts here. (It is interesting to note in passing, how ev er, that the tw o page pro of of Theorem 6.3 gives a large deviation result ab out the largest singular v alue of a matrix with i.i.d. en tries whic h is sharp up to a m ultiplicative factor prop ortional to at most √ log n .) Second, w e b ound the second factor in (6.24) via the following estimate: 35 Lemma 6.10 Ther e ar e numeric al c onstants C and c so that for e ach β > 2 , H 3 ( E ) ob eys ν ( H 3 ( E )) ≤ C µ 0 r /n (6.26) with pr ob ability at le ast 1 − O ( n − β ) pr ovide d that m ≥ c µ 4 / 3 0 nr 5 / 3 ( β log n ) . The t wo inequalities (6.25) and (6.26) give p − 1 k Ξ ◦ H 3 ( E ) k ≤ C r µ 2 0 nr 2 β log n m , with large probability . Hence, when m is substantially larger than a constant times µ 2 0 nr 2 ( β log n ), w e hav e that the sp ectral norm of p − 1 ( P Ω − p I ) H 3 ( E ) is m uch less than 1. This is the con ten t of Lemma 4.7. The remainder of this section prov es Lemma 6.10. Set S ≡ H 3 ( E ) for short. Because S is in T , S = P T ( S ) = P U S + S P V − P U S P V . W riting P U = P r j =1 u j u ∗ j and similarly for P V giv es S = r X j =1 u j ( u ∗ j S ) + r X j =1 (( I − P U ) S v j ) v ∗ j . F or each 1 ≤ j ≤ r , let α j ≡ S v j and β ∗ j ≡ u ∗ j S . Then the decomp osition S = r X j =1 u j β ∗ j + r X j =1 ( P U ⊥ α j ) v ∗ j , where P U ⊥ = I − P U , pro vides a factorization of the form S = X Y ∗ , ( X = [ u 1 , . . . , u r , P U ⊥ α 1 , . . . , P U ⊥ α r ] , Y = [ v 1 , . . . , v r , β 1 , . . . , β r ] . It follo ws from our assumption that c 2 ([ u 1 , . . . , u r ]) = max 1 ≤ i ≤ n X 1 ≤ j ≤ r u 2 ij = max 1 ≤ i ≤ n k P U e i k 2 ≤ µ 0 r /n, and similarly for [ v 1 , . . . , v r ]. Hence, to prov e Lemma 6.10, it suffices to prov e that the maxim um ro w norm obeys c ([ β 1 , . . . , β r ]) ≤ C p µ 0 r /n for some constan t C > 0, and similarly for the matrix [ P U ⊥ α 1 , . . . , P U ⊥ α r ]. Lemma 6.11 Ther e is a numeric al c onstant C such that for e ach β > 2 , c ([ α 1 , . . . , α r ]) ≤ C p µ 0 r /n (6.27) with pr ob ability at le ast 1 − O ( n − β ) pr ovide d that m ob eys the c ondition of L emma 6.10. A similar estimate for [ β 1 , . . . , β r ] is obtained in the same w a y b y exchanging the roles of u and v . Moreo ver, a minor mo dification of the argumen t gives c ([ P U ⊥ α 1 , . . . , P U ⊥ α r ]) ≤ C p µ 0 r /n (6.28) 36 as w ell, and we will omit the details. In short, the estimate (6.27) implies Lemma 6.10. Pro of [of Lemma 6.11] T o prov e (6.27), w e use the notations of the previous section and write α j = p − 3 X a 1 b 1 ,a 2 b 2 ,a 3 b 3 ξ a 1 b 1 ξ a 2 b 2 ξ a 3 b 3 E a 3 b 3 hP T e a 3 e ∗ b 3 , e a 2 e ∗ b 2 ihP T e a 2 e ∗ b 2 , e a 1 e ∗ b 1 iP T ( e a 1 e ∗ b 1 ) v j = p − 3 X ω 1 ,ω 2 ,ω 3 ξ ω 1 ξ ω 2 ξ ω 3 E ω 3 P ω 3 ω 2 P ω 2 ω 1 P T ( F ω 1 ) v j = p − 3 X ω 1 ,ω 2 ,ω 3 ξ ω 1 ξ ω 2 ξ ω 3 E ω 3 P ω 3 ω 2 P ω 2 ω 1 ( F ω 1 v j ) since for an y matrix X , P T ( X ) v j = X v j for eac h 1 ≤ j ≤ r . W e then follow the same steps as in Section 6.3 and partition the sum depending on whether some of the ω i ’s are the same or not α j = p − 3 X ω 1 = ω 2 = ω 3 + X ω 1 6 = ω 2 = ω 3 + X ω 1 = ω 3 6 = ω 2 + X ω 1 = ω 2 6 = ω 3 + X ω 1 6 = ω 2 6 = ω 3 . (6.29) The idea is this: to establish (6.27), it is sufficient to sho w that if γ j is any of the fiv e terms abov e, it ob eys s X 1 ≤ j ≤ r | γ ij | 2 ≤ C p µ 0 r /n (6.30) ( γ ij is the i th comp onen t of γ j as usual) with large probability . The strategy for getting such estimates is to use decoupling whenev er applicable. Just as Theorem 6.3 prov ed useful to b ound the norm of p − 1 ( P Ω − p I ) H 2 ( E ) in Section 6.3, the lemma be lo w will help bounding the magnitudes of the comp onen ts of α j . Lemma 6.12 Define S ≡ p − 1 P ij P ω ξ ω H ω h e i , F ω v j i e i e ∗ j . Then for e ach λ > 0 P ( k S k ∞ ≥ p µ 0 /n ) ≤ 2 n 2 exp − 1 2 n µ 0 p k H k 2 ∞ + 2 3 p √ r k H k ∞ ! . (6.31) Pro of The proof is an application of Bernstein’s inequality (6.16). Note that h e i , F ω v j i = 1 { a = i } v bj and hence V ar( S ij ) ≤ p − 1 k H k 2 ∞ X ω |h e i , F ω v j i| 2 = p − 1 k H k 2 ∞ since P ω |h e i , F ω v j i| 2 = 1, and | p − 1 H ω h e i , F ω v j i| ≤ p − 1 k H k ∞ p µ 0 r /n since |h e i , F ω v j i| ≤ | v bj | and | v bj | ≤ k P V e b k ≤ p µ 0 r /n. Eac h term in (6.29) is giv en b y the corresp onding term in (6.20) after formally substituting F ω with F ω v j . W e b egin with the first term whose i th comp onen t is equal to γ ij ≡ p − 3 (1 − 3 p + 3 p 2 ) X ω ξ ω E ω P 2 ω ω h e i , F ω v j i + p − 2 (1 − 3 p + 2 p 2 ) X ω E ω P 2 ω ω h e i , F ω v j i . (6.32) 37 Ignoring the constan t factor (1 − 3 p + 3 p 2 ) which is b ounded by 1, we write the first of these t wo terms as ( S 0 ) ij ≡ p − 1 X ω ξ ω H ω h e i , F ω v j i , H ω = E ω ( p − 1 P ω ω ) 2 . Since k H k ∞ ≤ ( µ 0 nr /m ) 2 µ 1 √ r /n , it follo ws from Lemma (6.12) that P k S 0 k ∞ ≥ p µ 0 /n ≤ 2 n 2 e − 1 /D , D ≤ C µ 3 0 µ 2 1 nr m 5 + µ 2 0 µ 1 nr m 3 for some n umerical C > 0. Since µ 1 ≤ µ 0 √ r , we ha ve that when m ≥ λµ 0 nr 6 / 5 ( β log n ) for some n umerical constan t λ > 0, k S 0 k ∞ ≥ p µ 0 /n with probabilit y at most 2 n 2 e − ( β log n ) 3 ; this probabilit y is inv ersely proportional to a sup erp olynomial in n . F or the second term, the matrix with entries E ω P 2 ω ω is giv en by Λ 2 U E + E Λ 2 V + 2 Λ U E Λ V + Λ 2 U E Λ 2 V − 2 Λ 2 U E Λ V − 2 Λ U E Λ 2 V and th us X ω E ω P 2 ω ω h e i , F ω v j i = h e i , ( Λ 2 U E + E Λ 2 V + 2 Λ U E Λ V + Λ 2 U E Λ 2 V − 2 Λ 2 U E Λ V − 2 Λ U E Λ 2 V ) v j i . This is a sum of six terms and w e will show ho w to b ound the first three; the last three are dealt in exactly the same w a y and ob ey better estimates. F or the first, w e hav e h e i , Λ 2 U E v j i = h Λ 2 U e i , E v j i = k P U e i k 4 h e i , u j i Hence p − 2 s X 1 ≤ j ≤ r |h e i , Λ 2 U E v j i| 2 = p − 2 k P U e i k 4 s X 1 ≤ j ≤ r |h e i , u j i| 2 = p − 2 k P U e i k 5 ≤ µ 0 r np 2 r µ 0 r n . In other w ords, when m ≥ µ 0 nr , the righ t hand-side is b ounded by p µ 0 r /n as desired. F or the second term, w e hav e h e i , E Λ 2 V v j i = X b k P V e b k 4 v bj h e i , E e b i = X b k P V e b k 4 v bj E ib . Hence it follo ws from the Cauc hy-Sc h warz inequalit y and (6.4) that p − 2 |h e i , E Λ 2 V v j i| ≤ µ 0 r np 2 r µ 0 r n . In other w ords, when m ≥ µ 0 nr 5 / 4 , p − 2 s X 1 ≤ j ≤ r |h e i , E Λ 2 V v j i| 2 ≤ r µ 0 r n (6.33) as desired. F or the third term, we hav e h e i , Λ U E Λ V v j i = k P U e i k 2 X b k P V e b k 2 v bj E ib . 38 The Cauc hy-Sc h warz inequality giv es 2 p − 2 |h e i , Λ U E Λ V v j i| ≤ 2 µ 0 r np 2 r µ 0 r n just as b efore. In other words, when m ≥ µ 0 nr 5 / 4 , 2 p − 2 q P 1 ≤ j ≤ r |h e i , Λ U E Λ V v j i| 2 is b ounded by 2 p µ 0 r /n . The other terms ob ey (6.33) as w ell when m ≥ µ 0 nr 5 / 4 . In conclusion, the first term (6.32) in (6.29) obeys (6.30) with probabilit y at least 1 − O ( n − β ) provided that m ≥ µ 0 nr 5 / 4 ( β log n ). W e now turn our attention to the second term which can be written as γ ij ≡ p − 3 (1 − 2 p ) X ω 1 6 = ω 2 ξ ω 1 ξ ω 2 E ω 2 P ω 2 ω 2 P ω 2 ω 1 h e i , F ω 1 v j i + p − 2 (1 − p ) X ω 1 6 = ω 2 ξ ω 1 E ω 2 P ω 2 ω 2 P ω 2 ω 1 h e i , F ω 1 v j i . W e decouple the first term so that it suffices to bound ( S 0 ) ij ≡ p − 1 X ω 1 ξ (1) ω 1 H ω 1 h e i , F ω 1 v j i , H ω 1 ≡ p − 2 X ω 2 : ω 2 6 = ω 1 ξ (2) ω 2 E ω 2 P ω 2 ω 2 P ω 2 ω 1 , where the sequences { ξ (1) ω } and { ξ (2) ω } are independent. The metho d from Section 6.2 shows that k H k ∞ ≤ C r µ 0 nr β log n m sup ω | E ω ( p − 1 P ω ω ) | ≤ C p β log n µ 0 nr m 3 / 2 k E k ∞ with probabilit y at least 1 − 2 n − β for eac h β > 2. Therefore, Lemma 6.12 giv es P k S 0 k ∞ ≥ p µ 0 /n ≤ 2 n 2 e − 1 /D , (6.34) where D ob eys D ≤ C µ 2 0 µ 2 1 ( β log n ) nr m 4 + µ 3 / 2 0 µ 1 p β log n nr m 5 / 2 . (6.35) for some p ositiv e constan t C . Hence, when m ≥ λµ 0 nr 5 / 4 ( β log n ) for some sufficien tly large n umerical constan t λ > 0, we hav e that k S 0 k ∞ ≥ p µ 0 /n with probabilit y at most 2 n 2 e − ( β log n ) 2 . This is in versely prop ortional to a sup erpolynomial in n . W e write the second term as ( S 1 ) ij ≡ p − 1 X ω 1 6 = ω 2 ξ ω 1 H ω 1 h e i , F ω 1 v j i , H ω 1 = p − 1 X ω 2 : ω 2 6 = ω 1 E ω 2 P ω 2 ω 2 P ω 2 ω 1 . W e know from Section 6.3 that H ob eys k H k ∞ ≤ C µ 2 0 r 2 /m since µ 1 ≤ µ 0 √ r so that Lemma 6.12 giv es P k S 1 k ∞ ≥ p µ 0 /n ≤ 2 n 2 e − 1 /D , D ≤ C µ 3 0 n 3 r 4 m 3 + µ 2 0 n 2 r 5 / 2 m 2 ! for some C > 0. Hence, when m ≥ λµ 0 nr 4 / 3 ( β log n ) for some n umerical constant λ > 0, w e hav e that k S 1 k ∞ ≥ p µ 0 /n with probability at most 2 n 2 e − ( β log n ) 2 . This is in versely prop ortional to a sup erpolynomial in n . In conclusion and taking in to accoun t the decoupling constants in (6.12), 39 the second term in (6.29) obeys (6.30) with probabilit y at least 1 − O ( n − β ) provided that m is sufficien tly large as ab o v e. W e now examine the third term which can be written as p − 3 (1 − 2 p ) X ω 1 6 = ω 2 ξ ω 1 ξ ω 2 E ω 1 P 2 ω 2 ω 1 h e i , F ω 1 v j i + p − 2 (1 − p ) X ω 1 6 = ω 2 ξ ω 2 E ω 1 P 2 ω 2 ω 1 h e i , F ω 1 v j i . F or the first term of the righ t-hand side, it suffices to estimate the tail of ( S 0 ) ij ≡ p − 1 X ω 1 ξ (1) ω 1 E ω 1 H ω 1 h e i , F ω 1 v j i , H ω 1 ≡ p − 2 X ω 2 : ω 2 6 = ω 1 ξ (2) ω 2 P 2 ω 2 ω 1 , where { ξ (1) ω } and { ξ (2) ω } are indep enden t. W e know from Section 6.3 that k H k ∞ ob eys k H k ∞ ≤ C √ β log n ( µ 0 nr /m ) 3 / 2 with probabilit y at least 1 − 2 n − β for eac h β > 2. Th us, Lemma (6.12) sho ws that S 0 ob eys (6.34)–(6.35) just as b efore. The other term is equal to (1 − p ) times P ω 1 E ω 1 H ω 1 h e i , F ω 1 v j i , and b y the Cauc hy-Sc h warz inequalit y and (6.4) X ω 1 E ω 1 H ω 1 h e i , F ω 1 v j i ≤ k H k ∞ k e ∗ i E k X b v 2 bj ! 1 / 2 ≤ C r µ 0 n p β log n µ 0 nr 4 / 3 m ! 3 / 2 on the even t where k H k ∞ ≤ C √ β log n ( µ 0 nr /m ) 3 / 2 . Hence, when m ≥ λµ 0 nr 4 / 3 ( β log n ) for some numerical constant λ > 0, w e ha ve that | P ω 1 E ω 1 H ω 1 h e i , F ω 1 v j i| ≤ p µ 0 /n on this ev en t. In conclusion, the third term in (6.29) ob eys (6.30) with probability at least 1 − O ( n − β ) provided that m is sufficien tly large as ab o v e. W e pro ceed to the fourth term whic h can b e written as p − 3 (1 − 2 p ) X ω 1 6 = ω 3 ξ ω 1 ξ ω 3 E ω 3 P ω 3 ω 1 P ω 1 ω 1 h e i , F ω 1 v j i + p − 2 (1 − p ) X ω 1 6 = ω 3 ξ ω 3 E ω 3 P ω 3 ω 1 P ω 1 ω 1 h e i , F ω 1 v j i . W e use the decoupling trick for the first term and b ound the tail of ( S 0 ) ij ≡ p − 1 X ω 1 ξ (1) ω 1 H ω 1 ( p − 1 P ω 1 ω 1 ) h e i , F ω 1 v j i , H ω 1 ≡ p − 1 X ω 3 : ω 3 6 = ω 1 ξ (3) ω 3 E ω 3 P ω 3 ω 1 , where { ξ (1) ω } and { ξ (3) ω } are independent. W e know from Section 6.2 that k H k ∞ ≤ C r µ 0 nr β log n m k E k ∞ with probabilit y at least 1 − 2 n − β for eac h β > 2. Therefore, Lemma 6.12 shows that S 0 ob eys (6.34)–(6.35) just as before. The other term is equal to (1 − p ) times P ω 1 H ω 1 ( p − 1 P ω 1 ω 1 ) h e i , F ω 1 v j i , and the Cauc hy-Sc h warz inequality giv es X ω 1 H ω 1 ( p − 1 P ω 1 ω 1 ) h e i , F ω 1 v j i ≤ √ n k H k ∞ µ 0 nr m ≤ C µ 1 √ r β log n √ n µ 0 nr m 3 / 2 on the even t k H k ∞ ≤ C p µ 0 nr ( β log n ) /m k E k ∞ . Because µ 1 ≤ µ 0 √ r , w e ha ve that whenev er m ≥ λ µ 4 / 3 0 nr 5 / 3 ( β log n ) for some numerical constant λ > 0, p − 1 | P ω 1 H ω 1 P ω 1 ω 1 h e i , F ω 1 v j i| ≤ 40 p µ 0 /n just as b efore. In conclusion, the fourth term in (6.29) ob eys (6.30) with probability at least 1 − O ( n − β ) pro vided that m is sufficien tly large as ab o ve. W e finally examine the last term p − 3 X ω 1 6 = ω 2 6 = ω 3 ξ ω 1 ξ ω 2 ξ ω 3 E ω 3 P ω 3 ω 2 P ω 2 ω 1 h e i , F ω 1 v j i . Just as before, we need to bound the tail of ( S 0 ) ij ≡ p − 1 X ω 1 ,ω 2 ,ω 3 ξ (1) ω 1 H ω 1 h e i , F ω 1 v j i , where H is giv en by (6.23). W e know from Section 6.3 that H ob eys k H k ∞ ≤ C ( β log n ) µ 0 nr m µ 1 √ r n with probabilit y at least 1 − 4 n − β for eac h β > 2. Therefore, Lemma 6.12 giv es P k S 0 k ∞ ≥ 1 5 p µ 0 /n ≤ 2 n 2 e − 1 /D , D ≤ C µ 0 µ 2 1 ( β log n ) 2 nr m 3 + µ 0 µ 1 ( β log n ) nr m 2 for some C > 0. Hence, when m ≥ λµ 0 nr 4 / 3 ( β log n ) for some n umerical constant λ > 0, w e hav e that k S 0 k ∞ ≥ 1 5 p µ 0 /n with probability at most 2 n 2 e − ( β log n ) . In conclusion, the fifth term in (6.29) ob eys (6.30) with probability at least 1 − O ( n − β ) provided that m is sufficiently large as ab o v e. T o summarize the calculations of this section, if m = λ µ 4 / 3 0 nr 5 / 3 ( β log n ) where β ≥ 2 is fixed and λ is some sufficien tly large n umerical constant, then X 1 ≤ j ≤ r | α ij | 2 ≤ µ 0 r /n with probabilit y at least 1 − O ( n − β ). This concludes the pro of. 6.5 Pro of of Lemma 4.8 It remains to study the sp ectral norm of p − 1 ( P T ⊥ P Ω P T ) P k ≥ k 0 H k ( E ) for some p ositiv e in teger k 0 , whic h we b ound b y the F rob enius norm p − 1 k ( P T ⊥ P Ω P T ) X k ≥ k 0 H k ( E ) k ≤ p − 1 k ( P Ω P T ) X k ≥ k 0 H k ( E ) k F ≤ p 3 / 2 p k X k ≥ k 0 H k ( E ) k F , where the inequalit y follows from Corollary 4.3. T o b ound the F rob enius of the series, write k X k ≥ k 0 H k ( E ) k F ≤ kHk k 0 k E k F + kHk k 0 +1 k E k F + . . . ≤ kHk k 0 1 − kHk k E k F . 41 Theorem 4.1 giv es an upp er b ound on kHk since kHk ≤ C R p µ 0 nr β log n/m < 1 / 2 on an ev en t with probabilit y at least 1 − 3 n − β . Since k E k F = √ r , w e conclude that p − 1 k ( P Ω P T ) X k ≥ k 0 H k ( E ) k F ≤ C 1 √ p µ 0 nr β log n m k 0 / 2 √ r = C n 2 r m 1 / 2 µ 0 nr β log n m k 0 / 2 with large probabilit y . This is the con tent of Lemma 4.8. 7 Numerical Exp erimen ts T o demonstrate the practical applicability of the nuclear norm heuristic for recov ering low-rank matrices from their e n tries, w e conducted a series of numerical exp erimen ts for a v ariety of the matrix sizes n , ranks r , and num bers of en tries m . F or eac h ( n, m, r ) triple, we repeated the follo wing pro cedure 50 times. W e generated M , an n × n matrix of rank r , by sampling t wo n × r factors M L and M R with i.i.d. Gaussian en tries and setting M = M L M ∗ R . W e sampled a subset Ω of m en tries uniformly at random. Then the n uclear norm minimization minimize k X k ∗ sub ject to X ij = M ij , ( i, j ) ∈ Ω w as solv ed using the SDP solver SDPT3 [34]. W e declared M to be reco vered if the solution returned b y the SDP , X opt , satisfied k X opt − M k F / k M k F < 10 − 3 . Figure 1 shows the results of these exp erimen ts for n = 40 and 50. The x -axis corresp onds to the fraction of the en tries of the matrix that are revealed to the SDP solver. The y -axis corresp onds to the ratio b et w een the dimension of the set of rank r matrices, d r = r (2 n − r ), and the num b er of measuremen ts m . Note that both of these axes range from zero to one as a v alue greater than one on the x -axis corresponds to an ov erdetermined linear system where the semidefinite program alwa ys succeeds, and a v alue of greater than one on the y -axis corresp onds to a situation where there is alwa ys an infinite num b er of matrices with rank r with the giv en entries. The color of each cell in the figures reflects the empirical reco v ery rate of the 50 runs (scaled b et ween 0 and 1). White denotes perfect reco v ery in all exp eriments, and black denotes failure for all exp erimen ts. In terestingly , the exp erimen ts reveal v ery similar plots for different n , suggesting that our asymptotic conditions for reco v ery ma y b e rather conserv ative. F or a second exp eriment, w e generated random p ositive semidefinite matrices and tried to reco ver them from their en tries using the n uclear norm heuristic. As abov e, w e rep eated the same pro cedure 50 times for eac h ( n, m, r ) triple. W e generated M , an n × n p ositiv e semidefinite matrix of rank r , b y sampling an n × r factor M F with i.i.d. Gaussian en tries and setting M = M F M ∗ F . W e sampled a subset Ω of m entries uniformly at random. Then w e solv ed the n uclear norm minimization problem minimize trace( X ) sub ject to X ij = M ij , ( i, j ) ∈ Ω X 0 . As abov e, we declared M to be recov ered if k X opt − M k F / k M k F < 10 − 3 . Figure 2 shows the results of these exp erimen ts for n = 40 and 50. The x -axis again corresp onds to the fraction of the entries of the matrix that are revealed to the SDP solv er, but, in this case, the n umber of 42 m/n 2 d r /m 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 m/n 2 d r /m 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 (a) (b) Figure 1: Reco very of full matrices from their en tries. F or each ( n, m, r ) triple, we rep eated the follo wing pro cedure 50 times. A matrix M of rank r and a subset of m entries w ere selected at random. Then we solv ed the n uclear norm minimization for X sub ject to X ij = M ij on the selected entries. W e declared M to be recov ered if k X opt − M k F / k M k F < 10 − 3 . The results are sho wn for (a) n = 40 and (b) n = 50. The color of each cell reflects the empirical reco very rate (scaled b etw een 0 and 1). White denotes perfect recov ery in all exp erimen ts, and blac k denotes failure for all experiments. measuremen ts is divided b y D n = n ( n + 1) / 2, the num b er of unique entries in a positive-semidefinite matrix and the dimension of the rank r matrices is d r = nr − r ( r − 1) / 2. The color of each cell is chosen in the same fashion as in the experiment with full matrices. Interestingly , the reco v ery region is muc h larger for p ositiv e semidefinite matrices, and future w ork is needed to inv estigate if the theoretical scaling is also more fa v orable in this scenario of lo w-rank matrix completion. Finally , in Figure 3, w e plot the p erformance of the n uclear norm heuristic when recov ering lo w-rank matrices from Gaussian pro jections of these m atrices. In these cases, M was generated in the same fashion as ab o ve, but, in place of sampling entries, w e generated m random Gaussian pro jections of the data (see the discussion in Section 1.4). Then w e solved the optimization minimize k X k ∗ sub ject to A ( X ) = A ( M ) . with the additional constraint that X 0 in the p ositiv e semidefinite case. Here A ( X ) denotes a linear map of the form (1.15) where the en tries are sampled i.i.d. from a zero-mean unit v ariance Gaussian distribution. In these exp erimen ts, the recov ery regime is far larger than in the case of that of sampling en tries, but this is not particularly surprising as each Gaussian observ ation measures a con tribution from every en try in the matrix M . These Gaussian mo dels w ere studied extensiv ely in [27]. 43 m/D n d r /m 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 m/D n d r /m 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 (a) (b) Figure 2: Recov ery of p ositiv e semidefinite matrices from their en tries. F or each ( n, m, r ) triple, w e repeated the follo wing procedure 50 times. A p ositive semidefinite matrix M of rank r and a set of m entries were selected at random. Then we solved the nuclear norm minimization sub ject to X ij = M ij on the selected en tries with the constraint that X 0. The color scheme for each cell denotes empirical reco very probabilit y and is the same as in Figure 1. The results are sho wn for (a) n = 40 and (b) n = 50. m/n 2 d r /m 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 m/D n d r /m 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 (a) (b) Figure 3: Reco very of matrices from Gaussian observ ations. F or each ( n, m, r ) triple, w e rep eated the following pro cedure 10 times. In (a), a matrix of rank r w as generated as in Figures 1. In (b) a p ositiv e semidefinite matrix of rank r was generated as in Figures 2. In b oth plots, w e select a matrix A from the Gaussian ensem ble with m ro ws and n 2 (in (a)) or D n = n ( n + 1) / 2 (in (b)) columns. Then w e solve the nuclear norm minimization sub ject to A ( X ) = A ( M ). The color scheme for eac h cell denotes empirical recov ery probabilit y and is the same as in Figures 1 and 2. 44 8 Discussion 8.1 Impro v ements In this pap er, w e ha ve sho wn that under suitable conditions, one can reconstruct an n × n matrix of rank r from a small n um b er of its sampled en tries provided that this num b er is on the order of n 1 . 2 r log n , at least for mo derate v alues of the rank. One w ould lik e to kno w whether b etter results hold in the sense that exact matrix reco very would b e guaran teed with a reduced n umber of measurements. In particular, recall that an n × n matrix of rank r dep ends on (2 n − r ) r degrees of freedom; is it true then that it is p ossible to reco v er most low-rank matrices from on the order of nr —up to logarithmic multiplicativ e factors—randomly selected en tries? Can the sample size b e merely proportional to the true complexit y of the lo w-rank ob ject w e wish to reco ver? In this direction, w e w ould lik e to emphasize that there is nothing in our approach that appar- en tly preven ts us from getting stronger results. Indeed, we dev elop ed a bound on the sp ectral norm of eac h of the first four terms ( P T ⊥ P Ω P T ) H k ( E ) in the series (4.13) (corresp onding to v alues of k equal to 0 , 1 , 2 , 3) and used a general argumen t to b ound the remainder of the series. Presumably , one could bound higher order terms by the same techniques. Getting an appropriate b ound on k ( P T ⊥ P Ω P T ) H 4 ( E ) k would lo w er the exp onent of n from 6 / 5 to 7 / 6. The appropriate b ound on k ( P T ⊥ P Ω P T ) H 5 ( E ) k w ould further low er the exp onen t to 8 / 7, and so on. T o obtain an optimal result, one would need to reac h k of size ab out log n . In doing so, ho w ever, one w ould ha ve to pa y sp ecial atten tion to the size of the decoupling constan ts (the constant C D for t wo v ariables in Lemma 6.5) which dep end on k —the n umber of decoupled v ariables. These constan ts gro w with k and upp er bounds are kno wn [15, 16]. 8.2 F urther directions It w ould b e of in terest to extend our results to the case where the unkno wn matrix is appro ximately lo w-rank. Supp ose w e write the SVD of a matrix M as M = X 1 ≤ k ≤ n σ k u k v ∗ k , where σ 1 ≥ σ 2 ≥ . . . ≥ σ n ≥ 0 and assume for simplicit y that none of the σ k ’s v anish. In general, it is imp ossible to complete suc h a matrix exactly from a partial subset of its entries. How ev er, one migh t hope to b e able to reco ver a goo d appro ximation if, for example, most of the singular v alues are small or negligible. F or instance, consider the truncated SVD of the matrix M , M r = X 1 ≤ k ≤ r σ k u k v ∗ k , where the sum extends ov er the r largest singular v alues and let M ? b e the solution to (1.5). Then one would not expect to hav e M ? = M but it would b e of great interest to determine whether the size of M ? − M is comparable to that of M − M r pro vided that the n um b er of sampled entries is sufficien tly large. F or example, one would lik e to know whether it is reasonable to expect that k M ? − M k ∗ is on the same order as k M − M r k ∗ (one could ask for a similar comparison with a differen t norm). If the answ er is positive, then this w ould sa y that appro ximately lo w-rank matrices can b e accurately recov ered from a small set of sampled en tries. 45 Another imp ortan t direction is to determine whether the reconstruction is robust to noise as in some applications, one w ould presumably observ e Y ij = M ij + z ij , ( i, j ) ∈ Ω , where z is a deterministic or sto c hastic p erturbation. In this setup, one w ould p erhaps w ant to minimize the nuclear norm sub ject to kP Ω ( X − Y ) k F ≤ where is an upp er bound on the noise lev el instead of enforcing the equalit y constraint P Ω ( X ) = P Ω ( Y ). Can one expect that this algorithm or a v ariation thereof pro vides accurate answ ers? That is, can one expect that the e rror b et w een the recov ered and the true data matrix b e prop ortional to the noise level? 9 App endix 9.1 Pro of of Theorem 4.2 The pro of of (4.10) follo ws that in [10] but w e shall use sligh tly more precise estimates. Let Y 1 , . . . , Y n b e a sequence of indep enden t random v ariables taking v alues in a Banach space and let Y ? b e the suprem um defined as Y ? = sup f ∈F n X i =1 f ( Y i ) , (9.1) where F is a countable family of real-v alued functions such that if f ∈ F , then − f ∈ F . T alagrand [33] pro ved a concentration inequalit y ab out Y ? , see also [22, Corollary 7.8]. Theorem 9.1 Assume that | f | ≤ B and E f ( Y i ) = 0 for every f in F and i = 1 , . . . , n . Then for al l t ≥ 0 , P ( | Y ? − E Y ? | > t ) ≤ 3 exp − t K B log 1 + B t σ 2 + B E Y ? , (9.2) wher e σ 2 = sup f ∈F P n i =1 E f 2 ( Y i ) , and K is a numeric al c onstant. W e note that very precise v alues of the n umerical constant K are known and are small, see [20]. W e will apply this theorem to the random v ariable Z defined in the statement of Theorem 4.2. Put Y ab = p − 1 ( δ ab − p ) P T ( e a e ∗ b ) ⊗ P T ( e a e ∗ b ) and Y = P ab Y ab . By definition, Z = sup h X 1 , Y ( X 2 ) i = sup X ab h X 1 , Y ab ( X 2 ) i = sup p − 1 X ab ( δ ab − p ) h X 1 , P T ( e a e ∗ b ) i hP T ( e a e ∗ b ) , X 2 i , where the supremum is ov er a countable collection of matrices X 1 and X 2 ob eying k X 1 k F ≤ 1 and k X 2 k F ≤ 1. Note that it follo ws from (4.8) |h X 1 , Y ab ( X 2 ) i| = p − 1 | δ ab − p | |h X 1 , P T ( e a e ∗ b ) i| |hP T ( e a e ∗ b ) , X 2 i| ≤ p − 1 kP T ( e a e ∗ b ) k 2 F ≤ 2 µ 0 r / (min( n 1 , n 2 ) p ) = 2 µ 0 nr /m 46 (recall that n = max( n 1 , n 2 )). Hence, w e can apply Theorem 9.1 with B = 2 µ 0 ( nr /m ). Also E |h X 1 , Y ab ( X 2 ) i| 2 = p − 1 (1 − p ) |h X 1 , P T ( e a e ∗ b ) i| 2 |h X 2 , P T ( e a e ∗ b ) i| 2 ≤ p − 1 kP T ( e a e ∗ b ) k 2 F |hP T ( X 2 ) , e a e ∗ b i| 2 so that X ab E |h X 1 , Y ab ( X 2 ) i| 2 ≤ (2 µ 0 nr /m ) X ab |hP T ( X 2 ) , e a e ∗ b i| 2 = (2 µ 0 nr /m ) kP T ( X 2 ) k 2 F ≤ 2 µ 0 nr /m. Since E Z ≤ 1, Theorem 9.1 gives P ( | Z − E Z | > t ) ≤ 3 exp − t K B log(1 + t/ 2) ≤ 3 exp − t log 2 K B min(1 , t/ 2) , where w e ha ve used the fact that log(1 + u ) ≥ (log 2) min(1 , u ) for u ≥ 0. Plugging t = λ q µ 0 nr log n m and B = 2 µ 0 nr /m establishes the claim. 9.2 Pro of of Lemma 6.2 W e shall mak e use of the follo wing lemma whic h is an application of well-kno wn deviation b ounds ab out binomial v ariables. Lemma 9.2 L et { δ i } 1 ≤ i ≤ n b e a se quenc e of i.i.d. Bernoul li variables with P ( δ i = 1) = p and Y = P n i =1 δ i . Then for e ach λ > 0 , P ( Y > λ E Y ) ≤ exp − λ 2 2 + 2 λ/ 3 E Y . (9.3) The random v ariable P b δ ab E 2 ab is b ounded b y k E k 2 ∞ P b δ ab and it thus suffices to estimate the q th momen t of Y ∗ = max Y a where Y a = P b δ ab . The inequalit y (9.3) implies that P ( Y ∗ > λnp ) ≤ n exp − λ 2 2 + 2 λ/ 3 np , and for λ ≥ 2, this giv es P ( Y ∗ > λnp ) ≤ n e − λnp/ 2 . Hence E Y q ∗ = Z ∞ 0 P ( Y ∗ > t ) q t q − 1 dt ≤ (2 np ) q + Z ∞ 2 np n e − t/ 2 q t q − 1 dt. By in tegrating by parts, one can chec k that when q ≤ np , w e ha ve Z ∞ 2 np n e − t/ 2 q t q − 1 dt ≤ nq (2 np ) q e − np . Under the assumptions of the lemma, w e ha ve nq e − np ≤ 1 and, therefore, E Y q ∗ ≤ 2 (2 np ) q . The conclusion follo ws. 47 Ac kno wledgments E. C. w as partially supp orted b y a National Science F oundation grant CCF-515362, b y the 2006 W aterman Award (NSF) and by an ONR gran t. The authors would like to thank Ali Jadbabaie, P ablo P arrilo, Ali Rahimi, T erence T ao, and Jo el T ropp for fruitful discussions ab out parts of this pap er. E. C. w ould lik e to thank Arnaud Durand for his careful proof-reading and comments. References [1] J. Ab erneth y , F. Bach, T. Evgeniou, and J.-P . V ert. Lo w-rank matrix factorization with attributes. T ec hnical Report N24/06/MM, Ecole des Mines de P aris, 2006. [2] A CM SIGKDD and Netflix. Pr o c e e dings of KDD Cup and Workshop , 2007. Proceedings av ailable online at http://www.cs.uic.edu/ ~ liub/KDD- cup- 2007/proceedings.html . [3] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncov ering shared structures in m ulticlass classification. In Pr o c e e dings of the Twenty-fourth International Confer enc e on Machine L e arning , 2007. [4] T. Ando, R. A. Horn, and C. R. Johnson. The singular v alues of a Hadamard pro duct: A basic inequalit y . Line ar and Multiline ar Algebr a , 21:345–365, 1987. [5] A. Argyriou, T. Evgeniou, and M. Pon til. Multi-task feature learning. In Neur al Information Pr o c essing Systems , 2007. [6] C. Beck and R. D’Andrea. Computational study and comparisons of LFT reducibilit y methods. In Pr o c e e dings of the Americ an Contr ol Confer enc e , 1998. [7] D. P . Bertsek as, A. Nedic, and A. E. Ozdaglar. Convex Analysis and Optimization . Athena Scientific, Belmon t, MA, 2003. [8] B. Bollob´ as. R andom Gr aphs . Cambridge Univ ersity Press, Cambridge, 2nd edition, 2001. [9] A. Buc hholz. Operator Khin tc hine inequalit y in non-commutativ e probabilit y . Math. Annalen , 319:1–16, 2001. [10] E. J. Cand ` es and J. Romberg. Sparsity and incoherence in compressiv e sampling. Inverse Pr oblems , 23(3):969–985, 2007. [11] E. J. Cand` es, J. Romberg, and T. T ao. Robust uncertaint y principles: exact signal reconstruction from highly incomplete frequency information. IEEE T r ans. Inform. The ory , 52(2):489–509, 2006. [12] E. J. Cand ` es and T. T ao. Decoding by linear programming. IEEE T r ansactions on Information The ory , 51(12):4203–4215, 2005. [13] E. J. Cand ` es and T. T ao. Near optimal signal recov ery from random pro jections: Universal enco ding strategies? IEEE T r ans. Inform. The ory , 52(12):5406–5425, December 2006. [14] A. L. Chisto v and D. Y u. Grigoriev. Complexit y of quantifier elimination in the theory of algebraically closed fields. In Pr o c e e dings of the 11th Symp osium on Mathematic al F oundations of Computer Scienc e , v olume 176 of L e ctur e Notes in Computer Scienc e , pages 17–31. Springer V erlag, 1984. [15] V. H. de la P e ˜ na. Decoupling and Khin tchine’s inequalities for U -statistics. A nn. Pr ob ab. , 20(4):1877– 1892, 1992. [16] V. H. de la Pe˜ na and S. J. Montgomery-Smith. Decoupling inequalities for the tail probabilities of m ultiv ariate U -statistics. A nn. Pr ob ab. , 23(2):806–816, 1995. [17] D. L. Donoho. Compressed sensing. IEEE T r ans. Inform. The ory , 52(4):1289–1306, 2006. [18] M. F azel. Matrix R ank Minimization with Applic ations . PhD thesis, Stanford Univ ersity , 2002. 48 [19] R. A. Horn and C. R. Johnson. T opics in matrix analysis . Cambridge Universit y Press, Cambridge, 1994. Corrected reprint of the 1991 original. [20] T. Klein and E. Rio. Concentration around the mean for maxima of empirical pro cesses. A nn. Pr ob ab. , 33(3):1060–1077, 2005. [21] B. Laurent and P . Massart. Adaptive estimation of a quadratic functional by mo del selection. Ann. Statist. , 28(5):1302–1338, 2000. [22] M. Ledoux. The Conc entr ation of Me asur e Phenomenon . American Mathematical Society , 2001. [23] A. S. Lewis. The mathematics of eigenv alue optimization. Mathematic al Pr o gr amming , 97(1–2):155–176, 2003. [24] N. Linial, E. London, and Y. Rabino vich. The geometry of graphs and some of its algorithmic applica- tions. Combinatoric a , 15:215–245, 1995. [25] F. Lust-Picquard. In ´ egalit´ es de Khin tchine dans C p (1 < p < ∞ ). Comptes R endus A c ad. Sci. Paris, S ´ erie I , 303(7):289–292, 1986. [26] M. Mesbahi and G. P . P apa v assilop oulos. On the rank minimization problem ov er a positive semidefinite linear matrix inequality . IEEE T r ansactions on A utomatic Contr ol , 42(2):239–243, 1997. [27] B. Rec ht, M. F azel, and P . Parrilo. Guaranteed minim um rank solutions of matrix equations via nuclear norm minimization. 2007. Submitted to SIAM R eview . [28] J. D. M. Rennie and N. Srebro. F ast maximum margin matrix factorization for collaborative prediction. In Pr o c e e dings of the International Confer enc e of Machine L e arning , 2005. [29] M. Rudelson. Random vectors in the isotropic position. J. F unct. Anal. , 164(1):60–72, 1999. [30] M. Rudelson and R. V ershynin. Sampling from large matrices: an approach through geometric functional analysis. J. A CM , 54(4):Art. 21, 19 pp. (electronic), 2007. [31] A. M.-C. So and Y. Y e. Theory of semidefinite programming for sensor net work localization. Mathe- matic al Pr o gr amming, Series B , 109, 2007. [32] N. Srebro. L e arning with Matrix F actorizations . PhD thesis, Massac husetts Institute of T echnology , 2004. [33] M. T alagrand. New concentration inequalities in product spaces. Invent. Math. , 126(3):505–563, 1996. [34] K. C. T oh, M.J. T o dd, and R. H. T ¨ ut ¨ unc ¨ u. SDPT3 - a MA TLAB softwar e p ackage for semidefinite- quadr atic-line ar pr o gr amming . Av ailable from http://www.math.nus.edu.sg/~mattohkc/sdpt3.html . [35] L. V andenberghe and S. P . Bo yd. Semidefinite programming. SIAM R eview , 38(1):49–95, 1996. [36] G. A. W atson. Characterization of the subdifferential of some matrix norms. Line ar Algebr a and Applic ations , 170:1039–1053, 1992. 49
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment