The Power of Convex Relaxation: Near-Optimal Matrix Completion
This paper is concerned with the problem of recovering an unknown matrix from a small fraction of its entries. This is known as the matrix completion problem, and comes up in a great number of applications, including the famous Netflix Prize and othe…
Authors: ** - Emmanuel J. C, ès (University of California, Berkeley) - Benjamin Recht (University of California
The P o w er of Con v ex Relaxation: Near-Optimal Matrix Completion Emman uel J. Cand ` es † and T erence T ao ] † Applied and Computational Mathematics, Caltech, P asadena, CA 91125 ] Department of Mathematics, Univ ersit y of California, Los Angeles, CA 90095 No vem ber 26, 2024 Abstract This pap er is concerned with the problem of recov ering an unkno wn matrix from a small fraction of its entries. This is known as the matrix c ompletion problem, and comes up in a great n umber of applications, including the famous Netflix Prize and other similar questions in collab orativ e filtering. In general, accurate reco very of a matrix from a small n umber of entries is imp ossible; but the knowledge that the unknown matrix has low rank radically c hanges this premise, making the search for solutions meaningful. This pap er presen ts optimality results quan tifying the minimum n um b er of entries needed to reco ver a matrix of rank r exactly b y an y method whatso ev er (the information theoretic limit). More imp ortantly , the paper sho ws that, under certain incoherence assumptions on the singular v ectors of the matrix, recov ery is possible b y solving a con venien t con vex program as so on as the n umber of en tries is on the order of the information theoretic limit (up to logarithmic factors). This conv ex program simply finds, among all matrices consisten t with the observed en tries, that with minim um nuclear norm. As an example, we sho w that on the order of nr log( n ) samples are needed to reco ver a random n × n matrix of rank r by any method, and to b e sure, nuclear norm minimization succeeds as so on as the n umber of entries is of the form nr p olylog( n ). Keyw ords. Matrix com pletion, low-rank matrices, semidefinite programming, duality in opti- mization, n uclear norm minimization, random matrices and techniques from random matrix theory , free probabilit y . 1 In tro duction 1.1 Motiv ation Imagine we ha ve an n 1 × n 2 arra y of real 1 n umbers and that w e are interested in knowing the v alue of each of the n 1 n 2 en tries in this arra y . Supp ose, how ever, that we only get to see a small num ber of the entries so that most of the elements ab out which we wish information are simply missing. Is it p ossible from the av ailable en tries to guess the man y en tries that we hav e not seen? This problem is now known as the matrix c ompletion problem [7], and com es up in a great num b er of applications, including the famous Netflix Prize and other similar questions in 1 Muc h of the discussion b elow, as well as our main results, apply also to the case of complex matrix completion, with some minor adjustments in the absolute constants; but for simplicity w e restrict atten tion to the real case. 1 collab orativ e filtering [12]. In a nutshell, collab orative filtering is the task of making automatic predictions about the interests of a user b y collecting taste information from many users. Netflix is a commercial compan y implemen ting collab orativ e filtering, and seeks to predict users’ movie preferences from just a few ratings p er user. There are man y other such recommendation systems prop osed b y Amazon, Barnes and Noble, and Apple Inc. to name just a few. In each instance, we ha ve a partial list ab out a user’s preferences for a few rated items, and would lik e to predict his/her preferences for all items from this and other information gleaned from many other users. In mathematical terms, the problem may b e p osed as follo ws: we hav e a data matrix M ∈ R n 1 × n 2 whic h w e would lik e to kno w as precisely as p ossible. Unfortunately , the only information a v ailable about M is a sampled set of en tries M ij , ( i, j ) ∈ Ω, where Ω is a subset of the complete set of en tries [ n 1 ] × [ n 2 ]. (Here and in the sequel, [ n ] denotes the list { 1 , . . . , n } .) Clearly , this problem is ill-p osed for there is no w ay to guess the missing en tries without making an y assumption ab out the matrix M . An increasingly common assumption in the field is to suppose that the unkno wn matrix M has lo w rank or has approximately low rank. In a recommendation system, this makes sense because often times, only a few factors contribute to an individual’s taste. In [7], the authors sho wed that this premise radically changes the problem, making the search for solutions meaningful. Before reviewing these results, w e w ould lik e to emphasize that the problem of recov ering a low-rank matrix from a sample of its en tries, and by extension from few er linear functionals ab out the matrix, c omes up in many application areas other than collab orative filtering. F or instance, the completion problem also arises in computer vision. There, many pixels ma y b e missing in digital images b ecause of o cclusion or tracking failures in a video sequence. Recov ering a scene and inferring camera motion from a sequence of images is a matrix completion problem known as the structure-from-motion problem [9, 23]. Other examples include system iden tification in control [19], m ulti-class learning in data analysis [1–3], global p ositioning—e.g. of sensors in a net work—from partial distance information [5, 21, 22], remote sensing applications in signal pro cessing where we w ould lik e to infer a full co v ariance matrix from partially observed correlations [25], and many statistical problems in volving succinct factor mo dels. 1.2 Minimal sampling This paper is concerned with the theoretical underpinnings of matrix completion and more sp ecif- ically in quan tifying the minimum num b er of en tries needed to reco ver a matrix of rank r exactly . This num ber generally dep ends on the matrix we wish to recov er. F or simplicity , assume that the unkno wn rank- r matrix M is n × n . Then it is not hard to see that matrix completion is imp ossible unless the n umber of samples m is at least 2 nr − r 2 , as a matrix of rank r dep ends on this man y degrees of freedom. The singular v alue decomp osition (SVD) M = X k ∈ [ r ] σ k u k v ∗ k , (1.1) where σ 1 , . . . , σ r ≥ 0 are the singular v alues, and the singular vectors u 1 , . . . , u r ∈ R n 1 = R n and v 1 , . . . , v r ∈ R n 2 = R n are tw o sets of orthonormal vectors, is useful to reveal these degrees of freedom. Informally , the s ingular v alues σ 1 ≥ . . . ≥ σ r dep end on r degrees of freedom, the left singular v ectors u k on ( n − 1) + ( n − 2) + . . . + ( n − r ) = nr − r ( r + 1) / 2 degrees of freedom, and similarly for the righ t singular v ectors v k . If m < 2 nr − r 2 , no matter whic h en tries are a v ailable, 2 there can b e an infinite num b er of matrices of rank at most r with exactly the same entries, and so exact matrix completion is imp ossible. In fact, if the observed lo cations are sampled at random, w e will see later that the minimum num ber of samples is b etter thought of as b eing on the order of nr log n rather than nr b ecause of a coup on collector’s effect. In this pap er, we are in terested in identifying large classes of matrices which can pro v ably b e reco vered b y a tractable algorithm from a num b er of samples approaching the ab o ve limit, i.e. from ab out nr log n samples. Before con tin uing, it is conv enient to in tro duce some notations that will b e used throughout: let P Ω : R n × n → R n × n b e the orthogonal pro jection onto the subspace of matrices which v anish outside of Ω (( i, j ) ∈ Ω if and only if M ij is observed); that is, Y = P Ω ( X ) is defined as Y ij = ( X ij , ( i, j ) ∈ Ω , 0 , otherwise , so that the information ab out M is given b y P Ω ( M ). The matrix M can b e, in principle, reco v ered from P Ω ( M ) if it is the unique matrix of rank less or equal to r consistent with the data. In other w ords, if M is the unique solution to minimize rank( X ) sub ject to P Ω ( X ) = P Ω ( M ) . (1.2) Kno wing when this happ ens is a delicate question which shall b e addressed later. F or the moment, note that attempting recov ery via (1.2) is not practical as rank minimization is in general an NP- hard problem for which there are no known algorithms capable of solving problems in practical time once, sa y , n ≥ 10. In [7], it was pro ved 1) that matrix completion is not as ill-p osed as previously thought and 2) that exact matrix completion is p ossible by conv ex programming. The authors of [7] prop osed reco vering the unknown m atrix b y solving the n uclear norm minimization problem minimize k X k ∗ sub ject to P Ω ( X ) = P Ω ( M ) , (1.3) where the nucle ar norm k X k ∗ of a matrix X is defined as the sum of its singular v alues, k X k ∗ := X i σ i ( X ) . (1.4) (The problem (1.3) is a semidefinite program [11].) They pro ved that if Ω is sampled uniformly at random among all subset of cardinality m and M ob eys a low coherence condition whic h we will review later, then with large probability , the unique solution to (1.3) is exactly M , provided that the n umber of samples ob eys m ≥ C n 6 / 5 r log n (1.5) (to b e completely exact, there is a restriction on the range of v alues that r can take on). In (1.5), the num b er of samples p er degree of freedom is not logarithmic or p olylogarithmic in the dimension, and one w ould lik e to kno w whether b etter results approac hing the nr log n limit are p ossible. This pap er pro vides a p ositive answ er. In details, this w ork dev elops man y useful matrix mo dels for which nuclear norm minimization is guaran teed to succeed as so on as the num ber of en tries is of the form nr p olylog( n ). 3 1.3 Main results A contribution of this pap er is to develop simple hypotheses ab out the matrix M whic h makes it recov erable by semidefinite programming from nearly minimally sampled entries. T o state our assumptions, we recall the SVD of M (1.1) and denote by P U (resp. P V ) the orthogonal pro jections on to the column (resp. row) space of M ; i.e. the span of the left (resp. right) singular vectors. Note that P U = X i ∈ [ r ] u i u ∗ i ; P V = X i ∈ [ r ] v i v ∗ i . (1.6) Next, define the matrix E as E := X i ∈ [ r ] u i v ∗ i . (1.7) W e observe that E interacts well with P U and P V , in particular obeying the identities P U E = E = E P V ; E ∗ E = P V ; E E ∗ = P U . One can view E as a sort of matrix-v alued “sign pattern” for M (compare (1.7) with (1.1)), and is also closely related to the subgradien t ∂ k M k ∗ of the n uclear norm at M (see (3.2)). It is clear that some assumptions on the singular vectors u i , v i (or on the space s U, V ) is needed in order to hav e a hop e of efficient matrix completion. F or instance, if u 1 and v 1 are Kronec ker delta functions at p ositions i, j resp ectively , then the singular v alue σ 1 can only b e reco vered if one actually samples the ( i, j ) co ordinate, whic h is only lik ely if one is sampling a significan t fraction of the en tire matrix. Th us we need the v ectors u i , v i to b e “spread out” or “incoherent” in some sense. In our arguments, it will b e conv enien t to phrase this incoherence assumptions using the pro jection matrices P U , P V and the sign pattern matrix E . More precisely , our assumptions are as follo ws. A1 There exists µ 1 > 0 suc h that for all pairs ( a, a 0 ) ∈ [ n 1 ] × [ n 1 ] and ( b, b 0 ) ∈ [ n 2 ] × [ n 2 ], h e a , P U e a 0 i − r n 1 1 a = a 0 ≤ µ 1 √ r n 1 , (1.8a) h e b , P V e b 0 i − r n 2 1 b = b 0 ≤ µ 1 √ r n 2 . (1.8b) A2 There exists µ 2 > 0 suc h that for all ( a, b ) ∈ [ n 1 ] × [ n 2 ], | E ab | ≤ µ 2 √ r √ n 1 n 2 . (1.9) W e will sa y that the matrix M ob ey the str ong inc oher enc e pr op erty with parameter µ if one can tak e µ 1 and µ 2 b oth less than equal to µ . (This prop erty is related to, but sligh tly different from, the inc oher enc e pr op erty , whic h will b e discussed in Section 1.6.1.) R emark. Our assumptions only inv olve the singular v ectors u 1 , . . . , u r , v 1 , . . . , v r of M ; the singular values σ 1 , . . . , σ r are completely unconstrained. This lack of dep endence on the singular v alues is a consequence of the geometry of the nuclear norm (and in particular, the fact that the subgradien t ∂ k X k ∗ of this norm is independent of the singular v alues, see (3.2)). 4 It is not hard to see that µ must b e greater than 1. F or instance, (1.9) implies r = X ( a,b ) ∈ [ n 1 ] × [ n 2 ] | E ab | 2 ≤ µ 2 2 r whic h forces µ 2 ≥ 1. The F rob enius norm identities r = k P U k 2 F = X a,a 0 ∈ [ n 1 ] |h e a , P U e a 0 i| 2 and (1.8a), (1.8b) also place a similar low er b ound on µ 1 . W e will sho w that 1) matrices ob eying the strong incoherence prop ert y with a small v alue of the parameter µ can b e recov ered from few er en tries and that 2) many m atrices of interest ob ey the strong incoherence prop erty with a small µ . W e will shortly develop three mo dels, the uniformly b ounde d ortho gonal mo del , the low-r ank low-c oher enc e mo del , and the r andom ortho gonal mo del whic h all illustrate the p oin t that if the singular v ectors of M are “spread out” in the sense that their amplitudes all ha ve ab out the same size, then the parameter µ is lo w. In some sense, “most” lo w-rank matrices ob ey the strong incoherence prop erty with µ = O ( √ log n ), where n = max( n 1 , n 2 ). Here, O ( · ) is the standard asymptotic notation, whic h is review ed in Section 1.8. Our first matrix completion result is as follows. Theorem 1.1 (Matrix completion I) L et M ∈ R n 1 × n 2 b e a fixe d matrix of r ank r = O (1) ob eying the str ong inc oher enc e pr op erty with p ar ameter µ . Write n := max( n 1 , n 2 ) . Supp ose we observe m entries of M with lo c ations sample d uniformly at r andom. Then ther e is a p ositive numeric al c onstant C such that if m ≥ C µ 4 n (log n ) 2 , (1.10) then M is the unique solution to (1.3) with pr ob ability at le ast 1 − n − 3 . In other wor ds: with high pr ob ability, nucle ar-norm minimization r e c overs al l the entries of M with no err or. This result is noteworth y for tw o reasons. The first is that the matrix mo del is deterministic and only needs the strong incoherence assumption. The second is more substan tial. Consider the class of b ounded rank matrices ob eying µ = O (1). W e shall see that no metho d whatso ever can reco ver those matrices unless the n umber of entries ob eys m ≥ c 0 n log n for some positive n umerical constan t c 0 ; this is the information theoretic limit. Thus Theorem 1.1 asserts that exact reco v ery b y n uclear-norm minimization o ccurs nearly as so on as it is information theoretically possible. Indeed, if the num b er of samples is sligh tly larger, b y a logarithmic factor, than the information theoretic limit, then (1.3) fills in the missing entries with no error. W e stated Theorem 1.1 for b ounded ranks, but our pro of gives a result for all v alues of r . Indeed, the argumen t will establish that the reco very is exact with high probabilit y provided that m ≥ C µ 4 nr 2 (log n ) 2 . (1.11) When r = O (1), this is Theorem 1.1. W e will pro ve a stronger and near-optimal result b elo w (Theorem 1.2) in whic h w e replace the quadratic dep endence on r with linear dep endence. The reason wh y w e state Theorem 1.1 first is that its pro of is somewhat simpler than that of Theorem 1.2, and we hop e that it will provide the reader with a useful lead-in to the claims and pro of of our main result. 5 Theorem 1.2 (Matrix completion I I) Under the same hyp otheses as in The or em 1.1, ther e is a numeric al c onstant C such that if m ≥ C µ 2 nr log 6 n, (1.12) M is the unique solution to (1.3) with pr ob ability at le ast 1 − n − 3 . This result is general and nonasymptotic. The pro of of Theorems 1.1, 1.2 will o ccupy the bulk of the pap er, starting at Section 3. 1.4 A surprise W e find it unexp ected that nuclear norm-minimization works so well, for reasons w e now pause to discuss. F or simplicit y , consider matrices with a strong incoherence parameter µ p olylogarithmic in the dimension. W e know that for the rank minimization program (1.2) to succeed, or equiv alen tly for the problem to b e well p osed, the num ber of samples m ust exceed a constan t times nr log n . Ho wev er, Theorem 1.2 prov es that the conv ex relaxation is rigorously exact nearly as so on as our problem has a unique low-rank solution. The surprise here is that admittedly , there is a priori no go o d reason to susp ect that conv ex relaxation might w ork so well. There is a priori no go o d reason to suspect that the gap b etw een what combinatorial and conv ex optimization can do is this small. In this sense, w e find these findings a little unexpected. The reader will note an analogy with the recen t literature on compressed sensing, whic h sho ws that under some conditions, the sparsest solution to an underdetermined system of linear equations is that with minim um ` 1 norm. 1.5 Mo del matrices W e now discuss mo del matrices which ob ey the conditions (1.8) and (1.9) for small v alues of the strong incoherence parameter µ . F or simplicit y w e restrict atten tion to the square matrix case n 1 = n 2 = n . 1.5.1 Uniformly b ounded mo del In this section we shall show, roughly sp eaking, that almost all n × n matrices M with singular v ectors ob eying the size prop ert y k u k k ` ∞ , k v k k ` ∞ ≤ p µ B /n, (1.13) with µ B = O (1) also satisfy the assumptions A1 and A2 with µ 1 , µ 2 = O ( √ log n ). This justifies our earlier claim that when the singular vectors are spread out, then the strong incoherence prop erty holds for a small v alue of µ . W e define a random model ob eying (1.13) as follows: take tw o arbitrary families of n orthonor- mal vectors [ u 1 , . . . , u n ] and [ v 1 , . . . , v n ] ob eying (1.13). W e allow the u i and v i to b e deterministic; for instance one could ha v e u i = v i for all i ∈ [ n ]. 1. Select r left singular v ectors u α (1) , . . . , u α ( r ) at random with replacemen t from the first family , and r right singular v ectors v β (1) , . . . , v β ( r ) from the second family , also at random. W e do not require that the β are c hosen indep endently from the α ; for instance one could ha ve β ( k ) = α ( k ) for all k ∈ [ r ]. 6 2. Set M := P k ∈ [ r ] k σ k u α ( k ) v ∗ β ( k ) , where the signs 1 , . . . , r ∈ {− 1 , +1 } are chosen indep en- den tly at random (with probability 1 / 2 of each c hoice of sign), and σ 1 , . . . , σ r > 0 are arbitrary distinct p ositiv e num bers (which are allow ed to dep end on the previous random choices). W e emphasize that the only assumptions about the families [ u 1 , . . . , u n ] and [ v 1 , . . . , v n ] is that they ha ve small components. F or example, they may be the same. Also note that this mo del allows for an y kind of dep endence b etw een the left and righ t singular selected vectors. F or instance, w e ma y select the same columns as to obtain a symmetric matrix as in the case where the tw o families are the same. Th us, one can think of our mo del as producing a generic matrix with uniformly b ounded singular v ectors. W e now show that P U , P V and E ob ey (1.8) and (1.9), with µ 1 , µ 2 = O ( µ B √ log n ), with large probabilit y . F or (1.9), observe that E = X k ∈ [ r ] k u α ( k ) v ∗ β ( k ) , and { k } is a sequence of i.i.d. ± 1 symmetric random v ariables. Then Hoeffding’s inequalit y shows that µ 2 = O ( µ B √ log n ); see [7] for details. F or (1.8), we will use a b eautiful concentration-of-measure result of McDiarmid. Theorem 1.3 [18] L et { a 1 , . . . , a n } b e a se quenc e of sc alars ob eying | a i | ≤ α . Cho ose a r andom set S of size s without r eplac ement fr om { 1 , . . . , n } and let Y = P i ∈ S a i . Then for e ach t ≥ 0 , P ( | Y − E Y | ≥ t ) ≤ 2 e − t 2 2 sα 2 . (1.14) F rom (1.6) we hav e P U = X k ∈ S u k u ∗ k , where S := { α (1) , . . . , α ( r ) } . F or any fixed a, a 0 ∈ [ n ], set Y := h P U e a , P U e a 0 i = X k ∈ S h e a , u k ih u k , e a 0 i and note that E Y = r n 1 a = a 0 . Since |h e a , u k ih u k , e a 0 i| ≤ µ B /n , w e apply (1.14) and obtain P h P U e a , P U e a 0 i − 1 { a = a 0 } r /n ≥ λ µ B √ r n ≤ 2 e − λ 2 / 2 . T aking λ prop ortional to √ log n and applying the union b ound for a, a 0 ∈ [ n ] prov es (1.8) with probabilit y at least 1 − n − 3 (sa y) with µ 1 = O ( µ B √ log n ). Com bining this computation with Theorems 1.1, 1.2, w e hav e established the following corollary: Corollary 1.4 (Matrix completion, uniformly b ounded mo del) L et M b e a matrix sample d fr om a uniformly b ounde d mo del. Under the hyp otheses of The or em 1.1, if m ≥ C µ 2 B nr log 7 n, M is the unique solution to (1.3) with pr ob ability at le ast 1 − n − 3 . As we shal l se e b elow, when r = O (1) , it suffic es to have m ≥ C µ 2 B n log 2 n. 7 R emark. F or large v alues of the rank, the assumption that the ` ∞ norm of the singular vectors is O (1 / √ n ) is not sufficient to conclude that (1.8) holds with µ 1 = O ( √ log n ). Th us, the extra randomization step (in which we select the r singular vectors from a list of n possible v ectors) is in some sense necessary . As an example, take [ u 1 , . . . , u r ] to b e the first r columns of the Hadamard transform where each row corresponds to a frequency . Then k u k k ` ∞ ≤ 1 / √ n but if r ≤ n/ 2, the first t wo rows of [ u 1 , . . . , u r ] are iden tical. Hence h P U e 1 , P U e 2 i = r /n. Ob viously , this do es not scale lik e √ r /n . Similarly , the sign flip (step 2) is also necessary as otherwise, w e could hav e E = P U as in the case where [ u 1 , . . . , u n ] = [ v 1 , . . . , v n ] and the same columns are selected. Here, max a E aa = max a k P U e a k 2 ≥ 1 n X a k P U e a k 2 = r n , whic h do es not scale lik e √ r /n either. 1.5.2 Lo w-rank lo w-coherence mo del When the rank is small, the assumption that the singular v ectors are spread is sufficien t to show that the parameter µ is small. T o see this, suppose that the singular vectors ob ey (1.13). Then h P U e a , P U e a 0 i − 1 { a = a 0 } r n ≤ max a ∈ [ n ] k P U e a k 2 ≤ µ B r n . (1.15) The first inequalit y follows from the Cauch y-Sc h warz inequalit y |h P U e a , P U e a 0 i| ≤ k P U e a kk P U e a 0 k for a 6 = a 0 and from the F rob enius norm b ound max a ∈ [ n ] k P U e a k 2 ≥ 1 n k P U k 2 F = r n . This giv es µ 1 ≤ µ B √ r . Also, by another application of Cauch y-Sc hw arz w e hav e | E ab | ≤ max a ∈ [ n ] k P U e a k max b ∈ [ n ] k P V e b k ≤ µ B r n (1.16) so that w e also ha ve µ 2 ≤ µ B √ r . In short, µ ≤ µ B √ r . Our lo w-rank low-coherence mo del assumes that r = O (1) and that the singular vectors ob ey (1.13). When µ B = O (1), this mo del ob eys the strong incoherence prop ert y with µ = O (1). In this case, Theorem 1.1 specializes as follows: Corollary 1.5 (Matrix completion, low-rank lo w-coherence mo del) L et M b e a matrix of b ounde d r ank ( r = O (1)) whose singular ve ctors ob ey (1.13) . Under the hyp otheses of The or em 1.1, if m ≥ C µ 2 B n log 2 n, then M is the unique solution to (1.3) with pr ob ability at le ast 1 − n − 3 . 8 1.5.3 Random orthogonal mo del Our last model is b orro wed from [7] and assumes that the column matrices [ u 1 , . . . , u r ] and [ v 1 , . . . , v r ] are independent random orthogonal matrices, with no assumptions whatso ev er on the singular v alues σ 1 , . . . , σ r . Note that this is a sp ecial case of the uniformly b ounded mo del since this is equiv alen t to selecting tw o n × n random orthonormal bases, and then selecting the singular v ectors as in Section 1.5.1. Since we know that the maximu m entry of an n × n random orthogonal matrix is b ounded by a constant times q log n n with large probability , then Section 1.5.1 sho ws that this mo del obeys the strong incoherence prop erty with µ = O (log n ). Theorems 1.1, 1.2 then give Corollary 1.6 (Matrix completion, random orthogonal mo del) L et M b e a matrix sample d fr om the r andom ortho gonal mo del. Under the hyp otheses of The or em 1.1, if m ≥ C nr log 8 n, then M is the unique solution to (1.3) with pr ob ability at le ast 1 − n − 3 . The exp onent 8 c an b e lower e d to 7 when r ≥ log n and to 6 when r = O (1) . As mentioned earlier, we ha ve a lo wer b ound m ≥ 2 nr − r 2 for matrix completion, which can b e impro ved to m ≥ C nr log n under reasonable h yp otheses on the matrix M . Th us, the hypothesis on m in Corollary 1.6 cannot b e substantially impro ved. How ev er, it is lik ely that by sp ecializing the pro ofs of our general results (Theorems 1.1 and 1.2) to this sp ecial case, one ma y b e able to impro ve the p ow er of the logarithm here, though it seems that a substantial effort would be needed to reac h the optimal lev el of nr log n even in the b ounded rank case. Sp eaking of logarithmic improv emen ts, we ha ve shown that µ = O (log n ), which is sharp since for r = 1, one cannot hop e for b etter estimates. F or r muc h larger than log n , how ev er, one can impro ve this to µ = O ( √ log n ). As far as µ 1 is concerned, this is essentially a consequence of the Johnson-Lindenstrauss lemma. F or a 6 = a 0 , write h P U e a , P U e a 0 i = 1 4 k P U e a + P U e a 0 k 2 − k P U e a − P U e a 0 k 2 . W e claim that for each a 6 = a 0 , k P U ( e a ± e a 0 ) k 2 − 2 r n ≤ C √ r log n n (1.17) with probability at least 1 − n − 5 , say . This inequalit y is indeed w ell known. Observ e that k P U x k has the same distribution than the Euclidean norm of the first r components of a v ector uniformly distributed on the n − 1 dimensional sphere of radius k x k . Then we hav e [4]: P r r n (1 − ε ) k x k ≤ k P U x k ≤ r r n (1 − ε ) − 1 k x k ≤ 2 e − 2 r/ 4 + 2 e − 2 n/ 4 . Cho osing x = e a ± e a 0 , = C 0 q log n r , and applying the union b ound pro ves the claim as long as long as r is sufficien tly larger than log n . Finally , since a b ound on the diagonal term k P U e a k 2 − r /n in (1.8) follows from the same inequality b y simply choosing x = e a , we ha ve µ 1 = O ( √ log n ). Similar argumen ts for µ 2 exist but w e forgo the details. 9 1.6 Comparison with other w orks 1.6.1 Nuclear norm minimization The mathematical study of matrix completion b egan with [7], whic h made slightly differen t incoher- ence assumptions than in this pap er. Namely , let us sa y that the matrix M ob eys the inc oher enc e pr op erty with a parameter µ 0 > 0 if k P U e a k 2 ≤ µ 0 r n 1 , k P V e b k 2 ≤ µ 0 r n 2 (1.18) for all a ∈ [ n 1 ], b ∈ [ n 2 ]. Again, this implies µ 0 ≥ 1. In [7] it was shown that if a fixed matrix M ob eys the incoherence prop erty with parameter µ 0 , then n uclear minimization succeeds with large probabilit y if m ≥ C µ 0 n 6 / 5 r log n (1.19) pro vided that µ 0 r ≤ n 1 / 5 . No w consider a matrix M ob eying the strong incoherence prop erty with µ = O (1). Then since µ 0 ≥ 1, (1.19) guarantees exact reconstruction only if m ≥ C n 6 / 5 r log n (and r = O ( n 1 / 5 )) while our results only need nr p olylog( n ) samples. Hence, our results provide a substantial improv emen t o ver (1.19) at least in the regime whic h p ermits minimal sampling. W e w ould lik e to note that there are obvious relationships b etw een the b est incoherence param- eter µ 0 and the b est strong incoherence parameters µ 1 , µ 2 for a giv en matrix M , which we take to b e square for simplicit y . On the one hand, (1.8) implies that k P U e a k 2 ≤ r n + µ 1 √ r n so that one can take µ 0 ≤ 1 + µ 1 / √ r . This shows that one can apply results from the incoherence mo del (in whic h we only know (1.18)) to our mo del (in which we assume strong incoherence). On the other hand, |h P U e a , P U e a 0 i| ≤ k P U e a kk P U e a 0 k ≤ µ 0 r n so that µ 1 ≤ µ 0 √ r . Similarly , µ 2 ≤ µ 0 √ r so that one can transfer results in the other direction as w ell. W e w ould like to men tion another imp ortan t pap er [20] inspired by compressed sensing, and whic h also recov ers low-rank matrices from partial information. The mo del in [20], ho wev er, assumes some sort of Gaussian measurements and is completely different from the completion problem discussed in this paper. 1.6.2 Spectral metho ds An in teresting new approac h to the matrix completion problem has been recen tly introduced in [13]. This algorithm starts by trimming each row and column with to o few entries; i.e. one replaces the en tries in those ro ws and columns by zero. Then one computes the SVD of the trimmed matrix and truncate it as to only keep the top r singular v alues (note that one would need to kno w r a priori ). Then under some conditions (including the incoherence prop erty (1.18) with µ = O (1)), this work shows that accurate recov ery is p ossible from a minimal num ber of samples, namely , on 10 the order of nr log n samples. Having said this, this work is not directly comparable to ours b ecause it op erates in a different regime. Firstly , the results are asymptotic and are v alid in a regime when the dimensions of the matrix tend to infinit y in a fixed ratio while ours are not. Secondly , there is a strong assumption ab out the range of the singular v alues the unkno wn matrix can take on while w e mak e no suc h assumption; they must b e clustered so that no singular v alue can b e to o large or to o small compared to the others. Finally , this w ork only shows appro ximate recov ery—not exact reco very as w e do here—although exact reco very results ha ve b een announced. This work is of course v ery in teresting b ecause it may sho w that methods—other than con vex optimization—can also ac hieve minimal sampling b ounds. 1.7 Lo w er b ounds W e w ould like to conclude the tour of the results in tro duced in this paper with a simple lo wer b ound, whic h highlights the fundamental role pla yed b y the coherence in con trolling what is information- theoretically p ossible. Theorem 1.7 (Lo wer b ound, Bernoulli mo del) Fix 1 ≤ m, r ≤ n and µ 0 ≥ 1 , let 0 < δ < 1 / 2 , and supp ose that we do not have the c o ndition − log 1 − m n 2 ≥ µ 0 r n log n 2 δ . (1.20) Then ther e exist infinitely many p airs of distinct n × n matric es M 6 = M 0 of r ank at most r and ob eying the inc oher enc e pr op erty (1.18) with p ar ameter µ 0 such that P Ω ( M ) = P Ω ( M 0 ) with pr ob ability at le ast δ . Her e, e ach entry is observe d with pr ob ability p = m/n 2 indep endently fr om the others. Clearly , ev en if one kno ws the rank and the coherence of a matrix ahead of time, then no algorithm can be guaran teed to succeed based on the kno wledge of P Ω ( M ) only , since they are man y candidates which are consisten t with these data. W e pro ve this theorem in Section 2. Informally , Theorem 1.7 asserts that (1.20) is a necessary condition for matrix completion to w ork with high probabilit y if all we know about the matrix M is that it has rank at most r and the incoherence prop ert y with parameter µ 0 . When the righ t-hand side of (1.20) is less than ε < 1, this implies m ≥ (1 − ε/ 2) µ 0 nr log n 2 δ . (1.21) Recall that the num ber of degrees of freedom of a rank- r matrix is 2 nr (1 − r / 2 n ). Hence, to reco ver an arbitrary rank- r matrix with the incoherence prop erty with parameter µ 0 with an y decen t probability by any metho d whatso ever, the minimum num b er of samples m ust b e ab out the n umber of degrees of freedom times µ 0 log n ; in other words, the o v ersampling factor is directly prop ortional to the coherence. Since µ 0 ≥ 1, this justifies our earlier assertions that nr log n samples are really needed. In the Bernoul li mo del used in Theorem 1.7, the n um b er of entries is a binomial random v ariable sharply concentrating around its mean m . There is very little difference b etw een this mo del and the uniform mo del which assumes that Ω is sampled uniformly at random among all subsets of cardinalit y m . Results holding for one hold for the other with only very m inor adjustmen ts. Because w e are concerned with essential difficulties, not technical ones, we will often prov e our results using the Bernoulli model, and indicate how the results may easily b e adapted to the uniform mo del. 11 1.8 Notation Before contin uing, we pro vide here a brief summary of the notations used throughout the pap er. T o simplify the notation, we shall work exclusively with square matrices, thus n 1 = n 2 = n. The results for non-square matrices (with n = max( n 1 , n 2 )) are pro ven in exactly the same fashion, but will add more subscripts to a notational system whic h is already quite complicated, and w e will leav e the details to the interested reader. W e will also assume that n ≥ C for some sufficiently large absolute constan t C , as our results are v acuous in the regime n = O (1). Throughout, w e will alwa ys assume that m is at least as large as 2 nr , th us 2 r ≤ np, p := m/n 2 . (1.22) A v ariety of norms on matrices X ∈ R n × n will b e discussed. The sp e ctr al norm (or op er ator norm ) of a matrix is denoted by k X k := sup x ∈ R n : k x k =1 k X x k = sup 1 ≤ j ≤ n σ j ( X ) . The Euclidean inner product b etw een t wo matrices is defined by the formula h X , Y i := trace( X ∗ Y ) , and the corresp onding Euclidean norm, called the F r ob enius norm or Hilb ert-Schmidt norm , is denoted k X k F := h X , X i 1 / 2 = ( n X j =1 σ j ( X ) 2 ) 1 / 2 . The nucle ar norm of a matrix X is denoted k X k ∗ := n X j =1 σ j ( X ) . F or vectors, we will only consider the usual Euclidean ` 2 norm whic h we simply write as k x k . F urther, w e will also manipulate linear transformation whic h acts on the space R n × n matrices suc h as P Ω , and we will use calligraphic letters for these op erators as in A ( X ). In particular, the iden tity op erator on this space will be denoted b y I : R n × n → R n × n , and should not b e confused with the iden tity matrix I ∈ R n × n . The only norm we will consider for these op erators is their sp ectral norm (the top singular v alue) kAk := sup X : k X k F ≤ 1 kA ( X ) k F . Th us for instance kP Ω k = 1 . W e use the usual asymptotic notation, for instance writing O ( M ) to denote a quan tit y b ounded in magnitude by C M for some absolute constan t C > 0. W e will sometimes raise such notation to 12 some p ow er, for instance O ( M ) M w ould denote a quantit y b ounded in magnitude by ( C M ) M for some absolute constan t C > 0. W e also write X . Y for X = O ( Y ), and p oly( X ) for O (1 + | X | ) O (1) . W e use 1 E to denote the indicator function of an ev ent E , e.g. 1 a = a 0 equals 1 when a = a 0 and 0 when a 6 = a 0 . If A is a finite set, we use | A | to denote its cardinality . W e record some (standard) conv en tions inv olving empty sets. The set [ n ] := { 1 , . . . , n } is understo o d to b e the empt y set when n = 0. W e also make the usual con ven tions that an empty sum P x ∈∅ f ( x ) is zero, and an empty pro duct Q x ∈∅ f ( x ) is one. Note how ever that a k -fold sum suc h as P a 1 ,...,a k ∈ [ n ] f ( a 1 , . . . , a k ) do es not v anish when k = 0, but is instead equal to a single summand f () with the empt y tuple () ∈ [ n ] 0 as the input; th us for instance the iden tity X a 1 ,...,a k ∈ [ n ] k Y i =1 f ( a i ) = X a ∈ [ n ] f ( a ) k is v alid both for p ositive integers k and for k = 0 (and b oth for non-zero f and for zero f , recalling of course that 0 0 = 1). W e will refer to sums ov er the empty tuple as trivial sums to distinguish them from empty sums . 2 Lo w er b ounds This section prov es Theorem 1.7, which asserts that no metho d can recov er an arbitrary n × n matrix of rank r and coherence at most µ 0 unless the n umber of random samples obeys (1.20). As stated in the theorem, we establish lo wer b ounds for the Bernoulli model, whic h then apply to the mo del where exactly m en tries are selected uniformly at random, see the App endix for details. It ma y b e b est to consider a simple example first to understand the main idea b ehind the pro of of Theorem 1.7. Suppose that r = 1, µ 0 > 1 in which case M = xy ∗ . F or simplicity , supp ose that y is fixed, sa y y = (1 , . . . , 1), and x is chosen arbitrarily from the cub e [1 , √ µ 0 ] n of R n . One easily v erifies that M ob eys the coherence prop erty with parameter µ 0 (and in fact also ob eys the strong incoherence prop ert y with a comparable parameter). Then to reco ver M , we need to see at least one entry p er ro w. F or instance, if the first ro w is unsampled, one has no information ab out the first co ordinate x 1 of x other than that it lies in [1 , √ µ 0 ], and so the claim follo ws in this case by v arying x 1 along the infinite set [1 , √ µ 0 ]. No w under the Bernoulli mo del, the num b er of observed en tries in the first row—and in any fixed row or column—is a binomial random v ariable with a num b er of trials equal to n and a probabilit y of success equal to p . Therefore, the probabilit y π 0 that an y ro w is unsampled is equal to π 0 = (1 − p ) n . By indep endence, the probability that all ro ws are sampled at least once is (1 − π 0 ) n , and an y metho d succeeding with probabilit y greater 1 − δ would need (1 − π 0 ) n ≥ 1 − δ. or − nπ 0 ≥ n log (1 − π 0 ) ≥ log (1 − δ ). When δ < 1 / 2, log(1 − δ ) ≥ − 2 δ and thus, any metho d w ould need π 0 ≤ 2 δ n . This is the desired conclusion when µ 0 > 1, r = 1. 13 This t yp e of simple analysis easily extends to general v alues of the rank r and of the coherence. Without loss of generality , assume that ` := n µ 0 r is an in teger, and consider a (self-adjoin t) n × n matrix M of rank r of the form M := r X k =1 σ k u k u ∗ k , where the σ k are drawn arbitrarily from [0 , 1] (sa y), and the singular v ectors u 1 , . . . , u r are defined as follo ws: u i,k := r 1 ` X i ∈ B k e i , B k = { ( k − 1) ` + 1 , ( k − 1) ` + 2 , . . . , k ` } ; that is to say , u k v anishes everywhere except on a supp ort of ` consecutive indices. Clearly , this matrix is incoherent with parameter µ 0 . Because the supp orts of the singular v ectors are disjoint, M is a blo ck-diagonal matrix with diagonal blo cks of size ` × ` . W e now argue as b efore. Recov ery with p ositiv e probability is imp ossible unless we hav e sampled at least one entry p er row of each diagonal blo c k, since otherwise we would b e forced to guess at least one of the σ k based on no information (other than that σ k lies in [0 , 1]), and the theorem will follow by v arying this singular v alue. No w the probability π 0 that the first row of the first block—and an y fixed row of any fixed blo c k—is unsampled is equal to (1 − p ) ` . Therefore, an y method succeeding with probability greater 1 − δ would need (1 − π 1 ) n ≥ 1 − δ, whic h implies π 1 ≤ 2 δ /n just as b efore. With π 1 = (1 − p ) ` , this gives (1.20) under the Bernoulli mo del. The second part of the theorem, namely , (1.21) follows from the equiv alen t characterization m ≥ n 2 1 − e − µ 0 r n log( n/ 2 δ ) together with 1 − e − x > x − x 2 / 2 whenev er x ≥ 0. 3 Strategy and No v elt y This section outlines the strategy for proving our main results, Theorems 1.1 and 1.2. The pro ofs of these theorems are the same up to a p oin t where the argumen ts to estimate the moments of a certain random matrix differ. In this section, w e present the common part of the pro of, leading to tw o key momen t estimates, while the pro ofs of these crucial estimates are the ob ject of later sections. One can of course pro ve our claims for the Bernoulli mo del with p = m/n 2 and transfer the results to the uniform mo del, by using the argumen ts in the appendix. F or example, the probabilit y that the reco very via (1.3) is not exact is at most t wice that under the Bernoulli model. 3.1 Dualit y W e b egin b y recalling some calculations from [7, Section 3]. F rom standard dualit y theory , w e know that the correct matrix M ∈ R n × n is a solution to (1.3) if and only if there exists a dual certificate Y ∈ R n 1 × n 2 with the prop erty that P Ω ( Y ) is a subgradient of the nuclear norm at M , whic h w e write as P Ω ( Y ) ∈ ∂ k M k ∗ . (3.1) 14 W e recall the pro jection matrices P U , P V and the companion matrix E defined by (1.6), (1.7). It is kno wn [15, 24] that ∂ k M k ∗ = E + W : W ∈ R n × n , P U W = 0 , W P V = 0 , k W k ≤ 1 . (3.2) There is a more compact wa y to write (3.2). Let T ⊂ R n × n b e the span of matrices of the form u k y ∗ and xv ∗ k and let T ⊥ b e its orthogonal complement. Let P T : R n × n → T be the orthogonal pro jection on to T ; one easily v erifies the explicit form ula P T ( X ) = P U X + X P V − P U X P V , (3.3) and note that the complemen tary pro jection P T ⊥ := I − P T is giv en by the formula P T ⊥ ( X ) = ( I − P U ) X ( I − P V ) . (3.4) In particular, P T ⊥ is a con traction: kP T ⊥ k ≤ 1 . (3.5) Then Z ∈ ∂ k X k ∗ if and only if P T ( Z ) = E , and kP T ⊥ ( Z ) k ≤ 1 . With these preliminaries in place, [7] establishes the following result. Lemma 3.1 (Dual certificate implies matrix completion) L et the notation b e as ab ove. Sup- p ose that the fol lowing two c onditions hold: 1. Ther e exists Y ∈ R n × n ob eying (a) P Ω ( Y ) = Y , (b) P T ( Y ) = E , and (c) kP T ⊥ ( Y ) k < 1 . 2. The r estriction P Ω T : T → P Ω ( R n × n ) of the (sampling) op er ator P Ω r estricte d to T is inje ctive. Then M is the unique solution to the c onvex pr o gr am (1.3) . Pro of See [7, Lemma 3.1]. The second sufficien t condition, namely , the injectivity of the restriction to P Ω has b een studied in [7]. W e recall a useful result. Theorem 3.2 (Rudelson selection estimate) [7, The or em 4.1] Supp ose Ω is sample d ac c or d- ing to the Bernoul li mo del and put n := max( n 1 , n 2 ) . Assume that M ob eys (1.18) . Then ther e is a numeric al c onstant C R such that for al l β > 1 , we have the b ound p − 1 kP T P Ω P T − p P T k ≤ a (3.6) with pr ob ability at le ast 1 − 3 n − β pr ovide d that a < 1 , wher e a is the quantity a := C R r µ 0 nr ( β log n ) m (3.7) 15 W e will apply this theorem with β := 4 (say). The statement (3.6) is stronger than the injectivity of the restriction of P Ω to T . Indeed, tak e m sufficiently large so that the a < 1. Then if X ∈ T , w e hav e kP T P Ω ( X ) − pX k F < ap k X k F , and ob viously , P Ω ( X ) cannot v anish unless X = 0. In order for the condition a < 1 to hold, w e m ust hav e m ≥ C 0 µ 0 nr log n (3.8) for a suitably large constant C 0 . But this follows from the h yp otheses in either Theorem 1.1 or Theorem 1.2, for reasons that w e no w pause to explain. In either of these theorems we ha v e m ≥ C 1 µnr log n (3.9) for some large constan t C 1 . Recall from Section 1.6.1 that µ 0 ≤ 1 + µ 1 / √ r ≤ 1 + µ/ √ r , and so (3.9) implies (3.8) whenever µ 0 ≥ 2 (say). When µ 0 < 2, w e can also deduce (3.8) from (3.9) by applying the trivial bound µ ≥ 1 noted in the in tro duction. In summary , to prov e Theorem 1.1 or Theorem 1.2, it suffices (under the hypotheses of these theorems) to exhibit a dual matrix Y ob eying the first sufficient condition of Lemma 3.1, with probabilit y at least 1 − n − 3 / 2 (sa y). This is the ob jectiv e of the remaining sections of the paper. 3.2 The dual certificate Whenev er the map P Ω T : T → P Ω ( R n × n ) restricted to T is injective, the linear map T → T X 7→ P T P Ω P T ( X ) is inv ertible, and we denote its in verse by ( P T P Ω P T ) − 1 : T → T . Introduce the dual matrix Y ∈ P Ω ( R n × n ) ⊂ R n × n defined via Y = P Ω P T ( P T P Ω P T ) − 1 E . (3.10) By construction, P Ω ( Y ) = Y , P T ( Y ) = E and, therefore, w e will establish that M is the unique minimizer if one can sho w that kP T ⊥ ( Y ) k < 1 . (3.11) The dual matrix Y would then certify that M is the unique solution, and this is the reason why w e will refer to Y as a c andidate c ertific ate . This certificate was also used in [7]. Before con tinuing, we would like to offer a little motiv ation for the c hoice of the dual matrix Y . It is not difficult to c hec k that (3.10) is actually the solution to the follo wing problem: minimize k Z k F sub ject to P T P Ω ( Z ) = E . Note that b y the Pythagorean iden tity , Y ob eys k Y k 2 F = kP T ( Y ) k 2 F + kP T ⊥ ( Y ) k 2 F = r + kP T ⊥ ( Y ) k 2 F . 16 The interpretation is now clear: among all matrices ob eying P Ω ( Z ) = Z and P T ( Z ) = E , Y is that elemen t which minimizes kP T ⊥ ( Z ) k F . By forcing the F rob enius norm of P T ⊥ ( Y ) to b e small, it is reasonable to exp ect that its sp ectral norm will b e sufficiently small as w ell. In that sense, Y defined via (3.10) is a v ery suitable candidate. Ev en though this is a differen t problem, our candidate certificate resembles—and is inspired b y—that constructed in [8] to show that ` 1 minimization recov ers sparse vectors from minimally sampled data. 3.3 The Neumann series W e now develop a useful formula for the candidate certificate, and b egin by in tro ducing a normalized v ersion Q Ω : R n × n → R n × n of P Ω , defined b y the form ula Q Ω := 1 p P Ω − I (3.12) where I : R n × n → R n × n is the identit y op erator on matrices ( not the identit y matrix I ∈ R n × n !). Note that with the Bernoulli m odel for selecting Ω, that Q Ω has exp ectation zero. F rom (3.12) we hav e P T P Ω P T = p P T ( I + Q Ω ) P T , and owing to Theorem 3.2, one can write ( P T P Ω P T ) − 1 as the con vergen t Neumann series p ( P T P Ω P T ) − 1 = X k ≥ 0 ( − 1) k ( P T Q Ω P T ) k . F rom the identit y P T ⊥ P T = 0 we conclude that P T ⊥ P Ω P T = p ( P T ⊥ Q Ω P T ). One can therefore express the candidate certificate Y (3.10) as P T ⊥ ( Y ) = X k ≥ 0 ( − 1) k P T ⊥ Q Ω ( P T Q Ω P T ) k ( E ) = X k ≥ 0 ( − 1) k P T ⊥ ( Q Ω P T ) k Q Ω ( E ) , where we hav e used P 2 T = P T and P T ( E ) = E . By the triangle inequality and (3.5), it thus suffices to sho w that X k ≥ 0 k ( Q Ω P T ) k Q Ω ( E ) k < 1 with probabilit y at least 1 − n − 3 / 2. It is not hard to b ound the tail of the series thanks to Theorem 3.2. First, this theorem b ounds the sp ectral norm of P T Q Ω P T b y the quan tity a in (3.7). This gives that for each k ≥ 1, k ( P T Q Ω P T ) k ( E ) k F < a k k E k F = a k √ r and, therefore, k ( Q Ω P T ) k Q Ω ( E ) k F = kQ Ω P T ( P T Q Ω P T ) k ( E ) k F ≤ kQ Ω P T k a k √ r . Second, this theorem also bounds kQ Ω P T k (recall that this is the sp ectral norm) since kQ Ω P T k 2 = max k X k F ≤ 1 hQ Ω P T ( X ) , Q Ω P T ( X ) i = h X , P T Q 2 Ω P T ( X ) i . 17 Expanding the iden tity P 2 Ω = P Ω in terms of Q Ω , w e obtain Q 2 Ω = 1 p [(1 − 2 p ) Q Ω + (1 − p ) I ] , (3.13) and th us, for all k X k F ≤ 1, p h X , P T Q 2 Ω P T ( X ) i = (1 − 2 p ) h X , P T Q Ω P T ( X ) i + (1 − p ) kP T ( X ) k 2 F ≤ a + 1 . Hence kQ Ω P T k ≤ p ( a + 1) /p . F or eac h k 0 ≥ 0, this giv es X k ≥ k 0 k ( Q Ω P T ) k Q Ω ( E ) k F ≤ r 3 r 2 p X k ≥ k 0 a k ≤ r 6 r p a k 0 pro vided that a < 1 / 2. With p = m/n 2 and a defined b y (3.7) with β = 4, w e hav e X k ≥ k 0 k ( Q Ω P T ) k Q Ω ( E ) k F ≤ √ n × O µ 0 nr log n m k 0 +1 2 with probability at least 1 − n − 4 . When k 0 + 1 ≥ log n , n 1 k 0 +1 ≤ n 1 log n = e and th us for each suc h a k 0 , X k ≥ k 0 k ( Q Ω P T ) k Q Ω ( E ) k F ≤ O µ 0 nr log n m k 0 +1 2 (3.14) with the same probabilit y . T o summarize this section, w e conclude that since b oth our results assume that m ≥ c 0 µ 0 nr log n for some sufficien tly large numerical constan t c 0 (see the discussion at the end of Section 3.1), it no w suffices to show that b log n c X k =0 k ( Q Ω P T ) k Q Ω E k ≤ 1 2 (3.15) (sa y) with probability at least 1 − n − 3 / 4 (sa y). 3.4 Cen tering W e ha v e already normalised P Ω to ha v e “mean zero” in some sense b y replacing it with Q Ω . No w w e p erform a similar op eration for the pro jection P T : X 7→ P U X + X P V − P U X P V . The eigenv alues of P T are cen tered around ρ 0 := trace( P T ) /n 2 = 2 ρ − ρ 2 , ρ := r /n, (3.16) as this follows from the fact that P T is a an orthogonal pro jection onto a space of dimension 2 nr − r 2 . Therefore, w e simply split P T as P T = Q T + ρ 0 I , (3.17) so that the eigen v alues of Q T are cen tered around zero. F rom now on, ρ and ρ 0 will alw ays b e the n umbers defined ab ov e. 18 Lemma 3.3 (Replacing P T with Q T ) L et 0 < σ < 1 . Consider the event such that k ( Q Ω Q T ) k Q Ω ( E ) k ≤ σ k +1 2 , for al l 0 ≤ k < k 0 . (3.18) Then on this event, we have that for al l 0 ≤ k < k 0 , k ( Q Ω P T ) k Q Ω ( E ) k ≤ (1 + 4 k +1 ) σ k +1 2 , (3.19) pr ovide d that 8 nr /m < σ 3 / 2 . F rom (3.19) and the geometric series form ula we obtain the corollary k 0 − 1 X k =0 k ( Q Ω P T ) k Q Ω ( E ) k ≤ 5 √ σ 1 1 − 4 √ σ . (3.20) Let σ 0 b e suc h that the righ t-hand side is less than 1/4, say . Applying this with σ = σ 0 , we conclude that to pro ve (3.15) with probability at least 1 − n − 3 / 4, it suffices by the union b ound to show that (3.18) for this v alue of σ . (Note that the hypothesis 8 nr /m < σ 3 / 2 follo ws from the h yp otheses in either Theorem 1.1 or Theorem 1.2.) Lemma 3.3, whic h is prov en in the App endix, is useful b ecause the op erator Q T is easier to w ork with than P T in the sense that it is more homogeneous, and ob eys b etter estimates. If w e split the pro jections P U , P V as P U = ρI + Q U , P V = ρI + Q V , (3.21) then Q T ob eys Q T ( X ) = (1 − ρ ) Q U X + (1 − ρ ) X Q V − Q U X Q V . Let U a,a 0 , V b,b 0 denote the matrix elemen ts of Q U , Q V : U a,a 0 := h e a , Q U e a 0 i = h e a , P U e a 0 i − ρ 1 a = a 0 , (3.22) and similarly for V b,b 0 . The coefficients c ab,a 0 b 0 of Q T ob ey c ab,a 0 b 0 := h e a e ∗ b , Q T ( e a 0 e b 0 ) i = (1 − ρ )1 b = b 0 U a,a 0 + (1 − ρ )1 a = a 0 V b,b 0 − U a,a 0 V b,b 0 . (3.23) An immediate consequence of this under the assumptions (1.8), is the estimate | c ab,a 0 b 0 | . (1 a = a 0 + 1 b = b 0 ) µ √ r n + µ 2 r n 2 . (3.24) When µ = O (1), these coefficients are b ounded by O ( √ r /n ) when a = a 0 or b = b 0 while in con trast, if w e sta yed with P T rather than Q T , the diagonal co efficients would b e as large as r /n . Ho wev er, our lemma states that bounding k ( Q Ω Q T ) k Q Ω ( E ) k automatically b ounds k ( Q Ω P T ) k Q Ω ( E ) k by nearly the same quantit y . This is the main adv antage of replacing the P T b y the Q T in our analysis. 19 3.5 Key estimates T o summarize the previous discussion, and in particular the b ounds (3.20) and (3.14), we see ev erything reduces to b ounding the sp ectral norm of ( Q Ω Q T ) k Q Ω ( E ) for k = 0 , 1 , . . . , b log n c . Pro viding go o d upp er bounds on these quantities is the crux of the argument. W e use the momen t metho d, controlling a sp ectral norm a matrix b y the trace of a high pow er of that matrix. W e will pro ve tw o momen t estimates which ultimately imply our tw o main results (Theorems 1.1 and 1.2) resp ectiv ely . The first suc h estimate is as follo ws: Theorem 3.4 (Momen t b ound I) Set A = ( Q Ω Q T ) k Q Ω ( E ) for a fixe d k ≥ 0 . Under the as- sumptions of The or em 1.1, we have that for e ach j > 0 , E trace( A ∗ A ) j = O j ( k + 1) 2 j ( k +1) n nr 2 µ m j ( k +1) , r µ := µ 2 r , (3.25) pr ovide d that m ≥ nr 2 µ and n ≥ c 0 j ( k + 1) for some numeric al c onstant c 0 . By Mark ov’s inequality , this result automatically estimates the norm of ( Q Ω Q T ) k Q Ω ( E ) and im- mediately giv es the following corollary . Corollary 3.5 (Existence of dual certificate I) Under the assumptions of The or em 1.1, the matrix Y (3.10) is a dual c ertific ate, and ob eys kP T ⊥ ( Y ) k ≤ 1 / 2 with pr ob ability at le ast 1 − n − 3 pr ovide d that m ob eys (1.10) . Pro of Set A = ( Q Ω Q T ) k Q Ω ( E ) with k ≤ log n , and set σ ≤ σ 0 . By Mark ov’s inequality P ( k A k ≥ σ k +1 2 ) ≤ E k A k 2 j σ j ( k +1) , No w choose j > 0 to b e the smallest in teger such that j ( k + 1) ≥ log n . Since k A k 2 j ≤ trace( A ∗ A ) j , Theorem 1.1 giv es P k A k ≥ σ k +1 2 ≤ γ j ( k +1) for some γ = O ( j ( k + 1)) 2 nr 2 µ a m where w e hav e used the fact that n 1 j ( k +1) ≤ n 1 log n = e . Hence, if m ≥ C 0 nr 2 µ (log n ) 2 σ , (3.26) for some n umerical constant C 0 , w e hav e γ < 1 / 4 and P k ( Q Ω Q T ) k Q Ω ( E ) k ≥ σ k +1 2 ≤ n − 4 . Therefore, [ 0 ≤ k< log n { ( Q Ω Q T ) k Q Ω ( E ) k ≥ a k +1 2 } 20 has probability less or equal to n − 4 log n ≤ n − 3 / 2 for n ≥ 2. Since the corollary assumes r = O (1), then (3.26) together with (3.20) and (3.14) prov e the claim thanks to our c hoice of σ . Of course, Theorem 1.1 follows immediately from Corollary 3.5 and Lemma 3.1. In the same w ay , our second result (Theorem 1.2) follo ws from a more refined estimate stated b elow. Theorem 3.6 (Momen t b ound I I) Set A = ( Q Ω Q T ) k Q Ω ( E ) for a fixe d k ≥ 0 . Under the assumptions of The or em 1.2, we have that for e ach j > 0 ( r µ is given in (3.25) ), E trace( A ∗ A ) j ≤ ( j ( k + 1)) 6 nr µ m j ( k +1) (3.27) pr ovide d that n ≥ c 0 j ( k + 1) for some numeric al c onstant c 0 . Just as before, this theorem immediately implies the following corollary . Corollary 3.7 (Existence of dual certificate I I) Under the assumptions of The or em 1.2, the matrix Y (3.10) is a dual c ertific ate, and ob eys kP T ⊥ ( Y ) k ≤ 1 / 2 with pr ob ability at le ast 1 − n − 3 pr ovide d that m ob eys (1.12) . The pro of is identical to that of Corollary 3.5 and is omitted. Again, Corollary 3.7 and Lemma 3.1 immediately imply Theorem 1.2. W e ha ve learned that v erifying that Y is a v alid dual certificate reduces to (3.25) and (3.27), and w e conclude this section by giving a road map to the proofs. In Section 4, w e will develop a form ula for E trace( A ∗ A ) j , whic h is our starting p oint for bounding this quantit y . Then Section 5 develops the first and p erhaps easier b ound (3.25) while Section 6 refines the argumen t b y exploiting clev er cancellations, and establishes the nearly optimal b ound (3.27). 3.6 No v elty As explained earlier, this pap er derives near-optimal sampling results which are stronger than those in [7]. One of the reasons underlying this improv emen t is that w e use completely differ- en t techniques. In details, [7] constructs the dual certificate (3.10) and proceeds by showing that kP T ⊥ ( Y ) k < 1 by b ounding each term in the series P k ≥ 0 k ( Q Ω P T ) k Q Ω ( E ) k < 1. F urther, to pro ve that the early terms (small v alues of k ) are appropriately small, the authors employ a sophisticated arra y of to ols from asymptotic geometric analysis, including noncommutativ e Khintc hine inequali- ties [16], decoupling techniques of Bourgain and Tzafiri and of de la Pe˜ na [10], and large deviations inequalities [14]. They b ound eac h term individually up to k = 4 and use the same argument as that in Section 3.3 to b ound the rest of the series. Since the tail starts at k 0 = 5, this giv es that a sufficien t condition is that the num b er of s amples exceeds a constant times µ 0 n 6 / 5 nr log n . Bound- ing each term k ( Q Ω P T ) k Q Ω ( E ) k k with the to ols put forth in [7] for larger v alues of k b ecomes increasingly delicate b ecause of the coupling b etw een the indicator v ariables defining the random set Ω. In addition, the noncommutativ e Khintc hine inequality seems less effective in higher dimen- sions; that is, for large v alues of k . Informally sp eaking, the reason for this seems to b e that the t yp es of random sums that app ear in the moments ( Q Ω P T ) k Q Ω ( E ) for large k inv olv e complicated com binations of the co efficients of P T that are not simply comp onents of some pro duct matrix, and whic h do not simplify substan tially after a direct application of the Khin tchine inequality . In this pap er, we use a v ery differen t strategy to estimate the sp ectral norm of ( Q Ω Q T ) k Q Ω ( E ), and emplo y moment metho ds, which ha ve a long history in random matrix theory , dating back at 21 least to the classical work of Wigner [26]. W e raise the matrix A := ( Q Ω Q T ) k Q Ω ( E ) to a large p o wer j so that σ 2 j 1 ( A ) = k A k 2 j ≈ trace( A ∗ A ) j = X i ∈ [ n ] σ 2 j i ( A ) (the largest element dominates the sum). W e then need to compute the exp ectation of the righ t- hand side, and reduce matters to a purely com binatorial question inv olving the statistics of v arious t yp es of paths in a plane. It is rather remark able that carrying out these combinatorial calculations nearly give the quan titatively correct answ er; the momen t metho d seems to come close to giving the ultimate limit of performance one can exp ect from n uclear-norm minimization. As we shall shortly see, the expression trace( A ∗ A ) j expands as a sum o ver “paths” of pro ducts of v arious co efficients of the op erators Q Ω , Q T and the matrix E . These paths can b e viewed as complicated v arian ts of Dyc k paths. How ever, it does not seem that one can simply in vok e standard momen t metho d calculations in the literature to compute this sum, as in order to obtain efficient b ounds, we will need to tak e full adv an tage of identities suc h as P T P T = P T (whic h capture certain cancellation prop erties of the co efficients of P T or Q T ) to simplify v arious comp onents of this sum. It is only after p erforming suc h simplifications that one can afford to estimate all the co efficients b y absolute v alues and coun t paths to conclude the argumen t. 4 Momen ts Let j ≥ 0 b e a fixed in teger. The goal of this section is to develop a formula for X := E trace( A ∗ A ) j . (4.1) This will clearly be of use in the pro ofs of the momen t b ounds (Theorems 3.4, 3.6). 4.1 First step: expansion W e first write the matrix A in comp onen ts as A = X a,b ∈ [ n ] A ab e ab for some scalars A ab , where e ab is the standard basis for the n × n matrices and A ab is the ( a, b ) th en try of A . Then trace( A ∗ A ) j = X a 1 ,...,a j ∈ [ n ] b 1 ,...,b j ∈ [ n ] Y i ∈ [ j ] A a i b i A a i +1 b i , where w e adopt the cyclic con ven tion a j +1 = a 1 . Equiv alently , w e can write trace( A ∗ A ) j = X Y i ∈ [ j ] 1 Y µ =0 A a i,µ b i,µ , (4.2) where the sum is o v er all a i,µ , b i,µ ∈ [ n ] for i ∈ [ j ] , µ ∈ { 0 , 1 } ob eying the compatibilit y conditions a i, 1 = a i +1 , 0 ; b i, 1 = b i, 0 for all i ∈ [ j ] 22 Figure 1: A typical path in [ n ] × [ n ] that app ears in the expansion of trace( A ∗ A ) j , here with j = 3. with the cyclic con ven tion a j +1 , 0 = a 1 , 0 . Example. If j = 2, then we can write trace( A ∗ A ) j as X a 1 ,a 2 ,b 1 ,b 2 ∈ [ n ] A a 1 b 1 A a 2 b 1 A a 2 b 2 A a 2 b 1 . or equiv alently as X 2 Y i =1 1 Y µ =0 A a i,µ ,b i,µ where the sum is ov er all a 1 , 0 , a 1 , 1 , a 2 , 0 , a 2 , 1 , b 1 , 0 , b 1 , 1 , b 2 , 0 , b 2 , 1 ∈ [ n ] ob eying the compatibilit y con- ditions a 1 , 1 = a 2 , 0 ; a 2 , 1 = a 1 , 0 ; b 1 , 1 = b 1 , 0 ; b 2 , 1 = b 2 , 0 . R emark. The sum in (4.2) can b e viewed as ov er all closed paths of length 2 j in [ n ] × [ n ], where the edges of the paths alternate b etw een “horizontal ro ok mov es” and “vertical ro ok mo ves” resp ectiv ely; see Figure 1. Second, write Q T and Q Ω in co efficien ts as Q T ( e a 0 b 0 ) = X ab c ab,a 0 b 0 e ab where c ab,a 0 b 0 is giv en by (3.23), and Q Ω ( e a 0 b 0 ) = ξ a 0 b 0 e a 0 b 0 , 23 where ξ ab are the iid, zero-expectation random v ariables ξ ab := 1 p 1 ( a,b ) ∈ Ω − 1 . With this, w e hav e A a 0 ,b 0 := X a 1 ,b 1 ,...,a k ,b k ∈ [ n ] Y l ∈ [ k ] c a l − 1 b l − 1 ,a l b l k Y l =0 ξ a l b l E a k b k (4.3) for an y a 0 , b 0 ∈ [ n ]. Note that this formula is ev en v alid in the base case k = 0, where it simplifies to just A a 0 b 0 = ξ a 0 b 0 E a 0 b 0 due to our con ven tions on trivial sums and empty pro ducts. Example. If k = 2, then A a 0 ,b 0 = X a 1 ,a 2 ,b 1 ,b 2 ∈ [ n ] ξ a 0 b 0 c a 0 b 0 ,a 1 ,b 1 ξ a 1 b 1 c a 1 b 1 ,a 2 b 2 ξ a 2 b 2 E a 2 b 2 . R emark. One can view the right-hand side of (4.3) as the sum ov er paths of length k + 1 in [ n ] × [ n ] starting at the designated p oint ( a 0 , b 0 ) and ending at some arbitrary p oint ( a k , b k ). Each edge (from ( a i , b i ) to ( a i +1 , b i +1 )) ma y b e a horizon tal or vertical “ro ok mo ve” (in that at least one of the a or b co ordinates do es not change 2 ), or a “non-ro ok mo ve” in which b oth the a and b co ordinates change. It will be imp ortan t later on to k eep trac k of whic h edges are ro ok mo v es and whic h ones are not, basically b ecause of the presence of the delta functions 1 a = a 0 , 1 b = b 0 in (3.23). Eac h edge in this path is w eighted b y a c factor, and eac h vertex in the path is w eighted b y a ξ factor, with the final v ertex also weigh ted by an additional E factor. It is imp ortant to note that the path is allo wed to cross itself, in which case w eigh ts suc h as ξ 2 , ξ 3 , etc. ma y app ear, see Figure 2. Inserting (4.3) into (4.2), we see that X can th us b e expanded as E X ∗ Y i ∈ [ j ] 1 Y µ =0 h Y l ∈ [ k ] c a i,µ,l − 1 b i,µ,l − 1 ,a i,µ,l b i,µ,l k Y l =0 ξ a i,µ,l b i,µ,l E a i,µ,k b i,µ,k i , (4.4) where the sum P ∗ is ov er all com binations of a i,µ,l , b i,µ,l ∈ [ n ] for i ∈ [ j ], µ ∈ { 0 , 1 } and 0 ≤ l ≤ k ob eying the compatibilit y conditions a i, 1 , 0 = a i +1 , 0 , 0 ; b i, 1 , 0 = b i, 0 , 0 for all i ∈ [ j ] (4.5) with the cyclic con ven tion a j +1 , 0 , 0 = a 1 , 0 , 0 . Example. Con tinuing our running example j = k = 2, we hav e X = E X ∗ 2 Y i =1 1 Y µ =0 ξ a i,µ, 0 b i,µ, 0 c a i,µ, 0 b i,µ, 0 ,a i,µ, 1 b i,µ, 1 ξ a i,µ, 1 b i,µ, 1 c a i,µ, 1 b i,µ, 1 ,a i,µ, 2 b i,µ, 2 ξ a i,µ, 2 b i,µ, 2 E a i,µ, 2 b i,µ, 2 where a i,µ,l for i = 1 , 2, µ = 0 , 1, l = 0 , 1 , 2 ob ey the compatibility conditions a 1 , 1 , 0 = a 2 , 0 , 0 ; a 2 , 1 , 0 = a 1 , 0 , 0 ; b 1 , 1 , 0 = b 1 , 0 , 0 ; b 2 , 1 , 0 = b 2 , 0 , 0 . 2 Unlik e the ordinary rules of chess, w e will consider the trivial mo ve when a i +1 = a i and b i +1 = b i to also qualify as a “ro ok mo ve”, which is sim ultaneously a horizontal and a vertical ro ok mov e. 24 Figure 2: A typical path app earing in the expansion (4.3) of A a 0 b 0 , here with k = 5. Each v ertex of the path giv es rise to a ξ factor (with the final vertex, coloured in red, providing an additional E factor), while each edge of the path pro vides a c factor. Note that the path is certainly allow ed to cross itself (leading to the ξ factors b eing raised to pow ers greater than 1, as is for instance the case here at ( a 1 , b 1 ) = ( a 4 , b 4 )), and that the edges of the path may b e horizon tal, vertical, or neither. Note that despite the small v alues of j and k , this is already a rather complicated sum, ranging o ver n 2 j (2 k +1) = n 20 summands, eac h of which is the pro duct of 4 j ( k + 1) = 24 terms. R emark. The expansion (4.4) is the sum ov er a sort of combinatorial “spider”, whose “bo dy” is a closed path of length 2 j in [ n ] × [ n ] of alternating horizontal and vertical ro ok mov es, and whose 2 j “legs” are paths of length k , emanating out of each v ertex of the b o dy . The v arious “segmen ts” of the legs (which can b e either ro ok or non-ro ok mov es) acquire a weigh t of c , and the “joints” of the legs acquire a weigh t of ξ , with an additional weigh t of E at the tip of eac h leg. T o complicate things further, it is certainly p ossible for a v ertex of one leg to o verlap with another v ertex from either the same leg or a differen t leg, introducing weigh ts suc h as ξ 2 , ξ 3 , etc.; see Figure 3. As one can see, the s et of possible configurations that this “spider” can be in is rather large and complicated. 4.2 Second step: collecting ro ws and columns W e no w group the terms in the expansion (4.4) in to a b ounded n umber of comp onents, dep ending on ho w the v arious horizon tal co ordinates a i,µ,l and v ertical co ordinates b i,µ,l o verlap. It is conv enien t to order the 2 j ( k + 1) tuples ( i, µ, l ) ∈ [ j ] × { 0 , 1 } × { 0 , . . . , k } lexicographically b y declaring ( i, µ, l ) < ( i 0 , µ 0 , l 0 ) if i < i 0 , or if i = i 0 and µ < µ 0 , or if i = i 0 and µ = µ 0 and l < l 0 . W e then define the indices s i,µ,l , t i,µ,l ∈ { 1 , 2 , 3 , . . . } recursiv ely for all ( i, µ, l ) ∈ [ j ] × { 0 , 1 } × [ k ] b y setting s 1 , 0 , 0 = 1 and declaring s i,µ,l := s i 0 ,µ 0 ,l 0 if there exists ( i 0 , µ 0 , l 0 ) < ( i, µ, l ) with a i 0 ,µ 0 ,l 0 = a i,µ,l , or equal to the first p ositive in teger not equal to any of the s i 0 ,µ 0 ,l 0 for ( i 0 , µ 0 , l 0 ) < ( i, µ, l ) otherwise. 25 Figure 3: A “spider” with j = 3 and k = 2, with the “bo dy” in b oldface lines and the “legs” as directed paths from the bo dy to the tips (marked in red). Define t i,µ,l using b i,µ,l similarly . W e observ e the cyclic c ondition s i, 1 , 0 = s i +1 , 0 , 0 ; t i, 1 , 0 = t i, 0 , 0 for all i ∈ [ j ] (4.6) with the cyclic con ven tion s j +1 , 0 , 0 = s 1 , 0 , 0 . Example. Supp ose that j = 2, k = 1, and n ≥ 30, with the ( a i,µ,l , b i,µ,l ) given in lexicographical ordering as ( a 0 , 0 , 0 , b 0 , 0 , 0 ) = (17 , 30) ( a 0 , 0 , 1 , b 0 , 0 , 1 ) = (13 , 27) ( a 0 , 1 , 0 , b 0 , 1 , 0 ) = (28 , 30) ( a 0 , 1 , 1 , b 0 , 1 , 1 ) = (13 , 25) ( a 1 , 0 , 0 , b 1 , 0 , 0 ) = (28 , 11) ( a 1 , 0 , 1 , b 1 , 0 , 1 ) = (17 , 27) ( a 1 , 1 , 0 , b 1 , 1 , 0 ) = (17 , 11) ( a 1 , 1 , 1 , b 1 , 1 , 1 ) = (13 , 27) 26 Then w e would hav e ( s 0 , 0 , 0 , t 0 , 0 , 0 ) = (1 , 1) ( s 0 , 0 , 1 , t 0 , 0 , 1 ) = (2 , 2) ( s 0 , 1 , 0 , t 0 , 1 , 0 ) = (3 , 1) ( s 0 , 1 , 1 , t 0 , 1 , 1 ) = (2 , 3) ( s 1 , 0 , 0 , t 1 , 0 , 0 ) = (3 , 4) ( s 1 , 0 , 1 , t 1 , 0 , 1 ) = (1 , 2) ( s 1 , 1 , 0 , t 1 , 1 , 0 ) = (1 , 4) ( s 1 , 1 , 1 , t 1 , 1 , 1 ) = (2 , 2) . Observ e that the conditions (4.5) hold for this example, whic h then forces (4.6) to hold also. In addition to the prop erty (4.6), we see from construction of ( s, t ) that for any ( i, µ, l ) ∈ [ j ] × { 0 , 1 } × { 0 , . . . , k } , the sets { s ( i 0 , µ 0 , l 0 ) : ( i 0 , µ 0 , l 0 ) ≤ ( i, µ, l ) } , { t ( i 0 , µ 0 , l 0 ) : ( i 0 , µ 0 , l 0 ) ≤ ( i, µ, l ) } (4.7) are initial segmen ts, i.e. of the form [ m ] for some integer m . Let us call pairs ( s, t ) of sequences with this prop erty , as well as the prop erty (4.6), admissible ; th us for instance the sequences in the ab o ve example are admissible. Given an admissible pair ( s, t ), if w e define the sets J , K by J := { s i,µ,l : ( i, µ, l ) ∈ [ j ] × { 0 , 1 } × { 0 , . . . , k }} K := { t i,µ,l : ( i, µ, l ) ∈ [ j ] × { 0 , 1 } × { 0 , . . . , k }} (4.8) then we observe that J = [ | J | ] , K = [ | K | ]. Also, if ( s, t ) arose from a i,µ,l , b i,µ,l in the ab ov e manner, there exist unique injections α : J → [ n ] , β : K → [ n ] suc h that a i,µ,l = α ( s i,µ,l ) and b i,µ,l = β ( t i,µ,l ). Example. Con tinuing the previous example, we hav e J = [3], K = [4], with the injections α : [3] → [ n ] and β : [4] → [ n ] defined by α (1) := 17; α (2) := 13; α (3) := 28 and β (1) := 30; β (2) := 27; β (3) := 25; β (4) := 11 . Con versely , an y admissible pair ( s, t ) and injections α, β determine a i,µ,l and b i,µ,l . Because of this, w e can thus expand X as X = X ( s,t ) E X α,β Y i ∈ [ j ] 1 Y µ =0 h Y l ∈ [ k ] c α ( s i,µ,l − 1 ) β ( t i,µ,l − 1 ) ,α ( s i,µ,l ) β ( t i,µ,l ) L Y l =0 ξ α ( s i,µ,l ) β ( t i,µ,l ) E α ( s i,µ,k ) β ( t i,µ,k ) i , where the outer sum is o v er all admissible pairs ( s, t ), and the inner sum is ov er all injections. R emark. As with preceding iden tities, the ab ov e formula is also v alid when k = 0 (with our con ven tions on trivial sums and empty products), in which case it simplifies to X = X ( s,t ) E X α,β Y i ∈ [ j ] 1 Y µ =0 ξ α ( s i,µ, 0 ) β ( t i,µ, 0 ) E α ( s i,µ, 0 ) β ( t i,µ, 0 ) . 27 R emark. One can think of ( s, t ) as describing the combinatorial “configuration” of the “spider” (( a i,µ,l , b i,µ,l )) ( i,µ,l ) ∈ [ j ] ×{ 0 , 1 }×{ 0 ,...,k } - it determines which vertices of the spider are equal to, or on the same row or column as, other vertices of the spider. The injections α, β then enumerate the w ays in which suc h a configuration can be “represented” inside the grid [ n ] × [ n ]. 4.3 Third step: computing the exp ectation The expansion w e ha ve for X lo oks quite complicated. Ho wev er, the fact that the ξ ab are inde- p enden t and hav e mean zero allows us to simplify this expansion to a significant degree. Indeed, observ e that the random v ariable Ξ := Q i ∈ [ j ] Q 1 µ =0 Q L l =0 ξ α ( s i,µ,l ) β ( t i,µ,l ) has zero exp ectation if there is an y pair in J × K whic h can b e expressed exactly once in the form ( s i,µ,l , t i,µ,l ). Th us w e ma y assume that no pair can be expressed exactly once in this manner. If δ is a Bernoulli v ariable with P ( δ = 1) = p = 1 − P ( δ = 0), then for eac h s ≥ 0, one easily computes E ( δ − p ) s = p (1 − p ) (1 − p ) s − 1 + ( − 1) s p s − 1 and hence | E ( 1 p δ − 1) s | ≤ p 1 − s . The v alue of the exp ectation of E Ξ do es not dep end on the choice of α or β , and the calculation ab o ve shows that Ξ ob eys | E Ξ | ≤ 1 p 2 j ( k +1) −| Ω | , where Ω := { ( s i,µ,l , t i,µ,l ) : ( i, µ, l ) ∈ [ j ] × { 0 , 1 } × { 0 , . . . , k }} ⊂ J × K . (4.9) Applying this estimate and the triangle inequality , we can thus b ound X b y X ≤ X ( s,t ) strongly admissible (1 /p ) 2 j ( k +1) −| Ω | X α,β Y i ∈ [ j ] 1 Y µ =0 h Y l ∈ [ k ] c α ( s i,µ,l − 1 ) β ( t i,µ,l − 1 ) ,α ( s i,µ,l ) β ( t i,µ,l ) E α ( s i,µ,k ) β ( t i,µ,k ) i , (4.10) where the sum is o ver those admissible ( s, t ) such that each elemen t of Ω is visited at least twice b y the sequence ( s i,µ,l , t i,µ,l ); we shall call such ( s, t ) str ongly admissible . W e will use the b ound (4.10) as a starting point for proving the momen t estimates (3.25) and (3.27). Example. The pair ( s, t ) in the Example in Section 4.2 is admissible but not strongly admissible, b ecause not every element of the set Ω (whic h, in this example, is { (1 , 1) , (2 , 2) , (3 , 1) , (2 , 3) , (3 , 4) , (1 , 2) , (1 , 4) } ) is visited t wice b y the ( s, t ). R emark. Once again, the formula (4.10) is v alid when k = 0, with the usual conv en tions on empt y pro ducts (in particular, the factor in volving the c co efficients can b e deleted in this case). 5 Quadratic b ound in the rank This section establishes (3.25) under the assumptions of Theorem 1.1, which is the easier of the t wo moment estimates. Here w e shall just tak e the absolute v alues in (4.10) inside the summation 28 and use the estimates on the co efficients given to us by hypothesis. Indeed, starting with (4.10) and the triangle inequalit y and applying (1.9) together with (3.23) giv es X ≤ O (1) j ( k +1) X ( s,t ) strongly admissible (1 /p ) 2 j ( k +1) −| Ω | X α,β ( √ r µ /n ) 2 j k + | Q | +2 j , where we recall that r µ = µ 2 r , and Q is the set of all ( i, µ, l ) ∈ [ j ] × { 0 , 1 } × [ k ] such that s i,µ,l − 1 6 = s i,µ,l and t i,µ,l − 1 6 = t i,µ,l . Thinking of the sequence { ( s i,µ,l , t i,µ,l ) } as a path in J × K , w e hav e that ( i, µ, l ) ∈ Q if and only if the mov e from ( s i,µ,l − 1 , t i,µ,l − 1 ) to ( s i,µ,l , t i,µ,l ) is neither horizon tal nor vertical; p er our earlier discussion, this is a “non-rook” mov e. Example. The example in Section 4.2 is admissible, but not strongly admissible. Nev ertheless, the ab ov e definitions can still be applied, and we see that Q = { (0 , 0 , 1) , (0 , 1 , 1) , (1 , 0 , 1) , (1 , 1 , 1) } in this case, because all of the four asso ciated mo ves are non-ro ok mov es. As the n umber of injections α, β is at most n | J | , n | K | resp ectiv ely , w e thus hav e X ≤ O (1) j ( k +1) X ( s,t ) str. admiss. (1 /p ) 2 j ( k +1) −| Ω | n | J | + | K | ( √ r µ /n ) 2 j k + | Q | +2 j , whic h we rearrange slightly as X ≤ O (1) j ( k +1) X ( s,t ) str. admiss. r 2 µ np 2 j ( k +1) −| Ω | r | Q | 2 +2 | Ω |− 3 j ( k +1) µ n | J | + | K |−| Q |−| Ω | . Since ( s, t ) is strongly admissible and every p oint in Ω needs to b e visited at least twice, we see that | Ω | ≤ j ( k + 1) . Also, since Q ⊂ [ j ] × { 0 , 1 } × [ k ], we hav e the trivial bound | Q | ≤ 2 j k . This ensures that | Q | 2 + 2 | Ω | − 3 j ( k + 1) ≤ 0 and 2 j ( k + 1) − | Ω | ≥ j ( k + 1) . F rom the hypotheses of Theorem 1.1 w e hav e np ≥ r 2 µ , and th us X ≤ O r 2 µ np j ( k +1) X ( s,t ) str. admiss. n | J | + | K |−| Q |−| Ω | . R emark. In the case where k = 0 in which Q = ∅ , one can easily obtain a b etter estimate, namely , (if np ≥ r µ ) X ≤ O r µ np j X ( s,t ) str. admiss. n | J | + | K |−| Ω | . 29 Call a triple ( i, µ, l ) r e cycle d if we hav e s i 0 ,µ 0 ,l 0 = s i,µ,l or t i 0 ,µ 0 ,l 0 = t i,µ,l for some ( i 0 , µ 0 , l 0 ) < ( i, µ, l ), and total ly r e cycle d if ( s i 0 ,µ 0 ,l 0 , t i 0 ,µ 0 ,l 0 ) = ( s i,µ,l , t i,µ,l ) for some ( i 0 , µ 0 , l 0 ) < ( i, µ, l ). Let Q 0 denote the set of all ( i, µ, l ) ∈ Q which are recycled. Example. The example in Section 4.2 is admissible, but not strongly admissible. Nev ertheless, the ab o ve definitions can still b e applied, and w e see that the triples (0 , 1 , 0) , (0 , 1 , 1) , (1 , 0 , 0) , (1 , 0 , 1) , (1 , 1 , 0) , (1 , 1 , 1) are all recycled (b ecause they either reuse an existing v alue of s or t or both), while the triple (1 , 1 , 1) is totally recycled (it visits the same lo cation as the earlier triple (0 , 0 , 1)). Th us in this case, w e hav e Q 0 = { (0 , 1 , 1) , (1 , 0 , 1) , (1 , 1 , 1) } . W e observ e that if ( i, µ, l ) ∈ [ j ] × { 0 , 1 } × [ k ] is not recycled, then it m ust ha ve been reac hed from ( i, µ, l − 1) by a non-ro ok mov e, and th us ( i, µ, l ) lies in Q . Lemma 5.1 (Exp onen t b ound) F or any admissible tuple, we have | J | + | K | − | Q | − | Ω | ≤ −| Q 0 | + 1 . Pro of W e let ( i, µ, l ) increase from (1 , 0 , 0) to ( j, 1 , k ) and see ho w each ( i, µ, l ) influences the quan tity | J | + | K | − | Q \ Q 0 | − | Ω | . Firstly , we see that the triple (1 , 0 , 0) initialises | J | , | K | , | Ω | = 1 and | Q \ Q 0 | = 0, so | J | + | K | − | Q \ Q 0 | − | Ω | = 1 at this initial stage. Now w e see ho w eac h subsequent ( i, µ, l ) adjusts this quan tity . If ( i, µ, l ) is totally recycled, then J, K, Ω , Q \ Q 0 are unchanged by the addition of ( i, µ, l ), and so | J | + | K | − | Q \ Q 0 | − | Ω | do es not change. If ( i, µ, l ) is recycled but not totally recycled, then one of J, K increases in size by at most one, as does Ω, but the other set of J , K remains unchanged, as do es Q \ Q 0 , and so | J | + | K | − | Q \ Q 0 | − | Ω | do es not increase. If ( i, µ, l ) is not recycled at all, then (by (4.6)) w e m ust ha ve l > 0, and then (b y definition of Q, Q 0 ) we hav e ( i, µ, l ) ∈ Q \ Q 0 , and so | Q \ Q 0 | and | Ω | b oth increase by one. Meanwhile, | J | and | K | increase by 1, and so | J | + | K | − | Q \ Q 0 | − | Ω | do es not change. Putting all this together we obtain the claim. This lemma giv es X ≤ O r 2 µ np j ( k +1) X str. admiss. n −| Q 0 | +1 . R emark. When k = 0, we ha v e the b etter b ound X ≤ O r µ np j X str. admiss. n. T o estimate the ab ov e sum, we need to coun t strongly admissible pairs. This is achiev ed by the follo wing lemma. Lemma 5.2 (P air coun ting) F or fixe d q ≥ 0 , the numb er of str ongly admissible p airs ( s, t ) with | Q 0 | = q is at most O ( j ( k + 1)) 2 j ( k +1)+ q . 30 Pro of Firstly observ e that once one fixes q , the n umber of possible c hoices for Q 0 is 2 j k q , which w e can b ound crudely by 2 2 j ( k +1) = O (1) 2 j ( k +1)+ q . So we ma y without loss of generality assume that Q 0 is fixed. F or similar reasons w e may assume Q is fixed. As with the pro of of Lemma 5.1, we incremen t ( i, µ, l ) from (1 , 0 , 0) to ( j, 1 , k ) and upp er b ound ho w many choices w e ha ve av ailable for s i,µ,l , t i,µ,l at eac h stage. There are no choices av ailable for s 1 , 0 , 0 , t 1 , 0 , 0 , which m ust b oth b e one. Now supp ose that ( i, µ, l ) > (1 , 0 , 0). There are sev eral cases. If l = 0, then by (4.6) one of s i,µ,l , t i,µ,l has no choices av ailable to it, while the other has at most O ( j ( k + 1)) choices. If l > 0 and ( i, µ, l ) 6∈ Q , then at least one of s i,µ,l , t i,µ,l is necessarily equal to its predecessor; there are at most tw o choices av ailable for which index is equal in this fashion, and then there are O ( j ( k + 1)) choices for the other index. If l > 0 and ( i, µ, l ) ∈ Q \ Q 0 , then both s i,µ,l and t i,µ,l are new, and are thus equal to the first p ositiv e integer not already o ccupied by s i 0 ,µ 0 ,l 0 or t i 0 ,µ 0 ,l 0 resp ectiv ely for ( i 0 , µ 0 , l 0 ) < ( i, µ, l ). So there is only one c hoice a v ailable in this case. Finally , if ( i, µ, l ) ∈ Q 0 , then there can be O ( j ( k + 1)) choices for b oth s i,µ,l and t i,µ,l . Multiplying together all these b ounds, we obtain that the n umber of strongly admissible pairs is b ounded b y O ( j ( k + 1)) 2 j +2 j k −| Q | +2 | Q 0 | = O ( j ( k + 1)) 2 j ( k +1) −| Q \ Q 0 | + | Q 0 | , whic h prov es the claim (here we discard the | Q \ Q 0 | factor). Using the abov e lemma w e obtain X ≤ O (1) j ( k +1) n r 2 µ np ! j ( k +1) 2 j k X q =0 O ( j ( k + 1)) 2 j ( k +1)+ q n − q . Under the assumption n ≥ c 0 j ( k + 1) for some numerical constant c 0 , we can sum the series and obtain Theorem 3.4. R emark. When k = 0, we ha v e the b etter b ound X ≤ O ( j ) 2 j n r µ np j . 6 Linear b ound in the rank W e now prov e the more sophisticated moment estimate (3.27) under the hypotheses of Theorem 1.2. Here, we cannot afford to take absolute v alues immediately , as in the pro of of (3.25), but first must exploit some algebraic cancellation prop erties in the co efficients c ab,a 0 b 0 , E ab app earing in (4.10) to simplify the sum. 6.1 Cancellation identities Recall from (3.23) that the co efficients c ab,a 0 b 0 are defined in terms of the co efficients U a,a 0 , V b,b 0 in tro duced in (3.22). W e recall the symmetries U a,a 0 = U a 0 ,a , V b,b 0 = V b 0 ,b and the pro jection 31 iden tities X a 0 U a,a 0 U a 0 ,a 00 = (1 − 2 ρ ) U a,a 00 − ρ (1 − ρ ) 1 a = a 00 , (6.1) X b 0 V b,b 0 V b 0 ,b 00 = (1 − 2 ρ ) V b,b 00 − ρ (1 − ρ ) 1 b = b 00 ; (6.2) the first iden tity follows from the matrix identit y X a 0 U a,a 0 U a 0 ,a 00 = h e a , Q 2 U e a 0 i after one writes the pro jection iden tity P 2 U = P U in terms of Q U using (3.21), and similarly for the second iden tity . In a similar v ein, w e also hav e the iden tities X a 0 U a,a 0 E a 0 ,b = (1 − ρ ) E a,b = X b 0 E a,b 0 V b 0 ,b , (6.3) whic h simply come from Q U E = P U E − ρE = (1 − ρ ) E together with E Q V = E P V − ρE = (1 − ρ ) E . Finally , we observe the t w o equalities X b E a,b E a 0 ,b = U a,a 0 + ρ 1 a = a 0 , X a E a,b E a,b 0 = V b,b 0 + ρ 1 b = b 0 . (6.4) The first identit y follows from the fact that P b E a,b E a 0 ,b is the ( a, a 0 ) th elemen t of E E ∗ = P U = Q U + ρI , and the second one similarly follo ws from the identit y E ∗ E = P V = Q V + ρI . 6.2 Reduction to a summand b ound Just as before, our goal is to estimate X := E trace( A ∗ A ) j , A = ( Q Ω Q T ) k Q Ω E . W e recall the b ound (4.10), and expand out eac h of the c co efficients using (3.23) into three terms. T o describ e the resulting expansion of the sum we need more notation. Define an admissible quadruplet ( s, t, L U , L V ) to b e an admiss ible pair ( s, t ), together with t wo sets L U , L V with L U ∪ L V = [ j ] × { 0 , 1 } × [ k ], such that s i,µ,l − 1 = s i,µ,l whenev er ( i, µ, l ) ∈ ([ j ] × { 0 , 1 } × [ k ]) \L U , and t i,µ,l − 1 = t i,µ,l whenev er ( i, µ, l ) ∈ ([ j ] × { 0 , 1 } × [ k ]) \L V . If ( s, t ) is also strongly admissible, we say that ( s, t, L U , L V ) is a str ongly admissible quadruplet . The sets L U \L V , L V \L U , L U ∩ L V will corresp ond to the three terms 1 b = b 0 U a,a 0 , 1 a = a 0 V b,b 0 , U a,a 0 V b,b 0 app earing in (3.23). With this notation, we expand the pro duct Y i ∈ [ j ] 1 Y µ =0 Y l ∈ [ k ] c α ( s i,µ,l − 1 ) β ( t i,µ,l − 1 ) ,α ( s i,µ,l ) β ( t i,µ,l ) as X L U , L V (1 − ρ ) |L U \L V | + |L U \L V | ( − 1) |L U ∩L V | h Y ( i,µ,l ) ∈L U \L V 1 β ( t i,µ,l − 1 )= β ( t i,µ,l ) U α ( s i,µ,l − 1 ) ,α ( s i,µ,l ) i h Y ( i,µ,l ) ∈L V \L U 1 α ( s i,µ,l − 1 ) ,α ( s i,µ,l ) V β ( t i,µ,l − 1 ) ,β ( t i,µ,l ) ih Y ( i,µ,l ) ∈L U ∩L V U α ( s i,µ,l − 1 ) ,α ( s i,µ,l ) V β ( t i,µ,l − 1 ) ,β ( t i,µ,l ) i , 32 where the sum is o v er all partitions as ab o ve, and which w e can rearrange as X L U , L V [ − (1 − ρ )] 2 j ( k +1) −|L U ∩L V | h Y ( i,µ,l ) ∈L U U α ( s i,µ,l − 1 ) ,α ( s i,µ,l ) i h Y ( i,µ,l ) ∈L V V β ( t i,µ,l − 1 ) ,β ( t i,µ,l ) i . F rom this and the triangle inequalit y , w e observe the b ound X ≤ (1 − ρ ) 2 j ( k +1) −|L U ∩L V | X ( s,t, L U , L V ) (1 /p ) 2 j ( k +1) −| Ω | | X s,t, L U , L V | , where the sum ranges o v er all strongly admissible quadruplets, and X s,t, L U , L V := X α,β h Y i ∈ [ j ] 1 Y µ =0 E α ( s i,µ,k ) β ( t i,µ,k ) i h Y ( i,µ,l ) ∈L U U α ( s i,µ,l − 1 ) ,α ( s i,µ,l ) i h Y ( i,µ,l ) ∈L V V β ( t i,µ,l − 1 ) ,β ( t i,µ,l ) i . R emark. A strongly admissible quadruplet can be view ed as the configuration of a “spider” with sev eral additional constrain ts. Firstly , the spider must visit each of its v ertices at least t wice (strong admissibilit y). When ( i, µ, l ) ∈ [ j ] × { 0 , 1 } × [ k ] lies out of L U , then only horizon tal ro ok mov es are allo wed when reaching ( i, µ, l ) from ( i, µ, l − 1); similarly , when ( i, µ, l ) lies out of L V , then only v ertical ro ok mov es are allo wed from ( i, µ, l − 1) to ( i, µ, l ). In particular, non-rook mo ves are only allo wed inside L U ∩ L V ; in the notation of the previous section, w e hav e Q ⊂ L U ∩ L V . Note though that while one has the right to execute a non-rook mov e to L U ∩ L V , it is not mandatory; it could still b e that ( s i,µ,l − 1 , t i,µ,l − 1 ) shares a common ro w or column (or ev en b oth) with ( s i,µ,l , t i,µ,l ). W e claim the following fundamental b ound on the summand | X s,t, L U , L V | : Prop osition 6.1 (Summand b ound) L et ( s, t, L U , L V ) b e a str ongly admissible quadruplet. Then we have | X s,t, L U , L V | ≤ O ( j ( k + 1)) 2 j ( k +1) ( r /n ) 2 j ( k +1) −| Ω | n. Assuming this proposition, we ha v e X ≤ O ( j ( k + 1)) 2 j ( k +1) X ( s,t, L U , L V ) ( r /np ) 2 j ( k +1) −| Ω | n and since | Ω | ≤ j ( k + 1) (by strong admissibility) and r ≤ np , and the num b er of ( s, t, L U , L V ) can b e crudely bounded by O ( j ( k + 1)) 4 j ( k +1) , X ≤ O ( j ( k + 1)) 6 j ( k +1) ( r /np ) j ( k +1) n. This giv es (3.27) as desired. The b ound on the num ber of quadruplets follows from the fact that there are at most j ( k + 1) 4 j ( k +1) strongly admissible pairs and that the num b er of ( L U , L V ) p er pair is at most O (1) j ( k +1) . R emark. It seems clear that the exp onent 6 can b e low ered by a finer analysis, for instance b y using counting b ounds such as Lemma 5.2. Ho wev er, substan tial effort seems to b e required in order to obtain the optimal exp onent of 1 here. 33 Figure 4: A generalized spider (note the v ariable leg lengths). A v ertex lab eled just b y L U m ust ha ve been reached from its predecessor b y a v ertical ro ok mov e, while a vertex labeled just b y L V m ust hav e b een reac hed by a horizon tal ro ok mov e. V ertices lab eled by b oth L U and L V may b e reached from their predecessor b y a non-ro ok mov e, but they are still allo wed to lie on the same row or column as their predecessor, as is the case in the leg on the b ottom left of this figure. The sets L U , L V indicate whic h U and V terms will sho w up in the expansion (6.5). 6.3 Pro of of Prop osition 6.1 T o prov e the prop osition, it is conv enien t to generalise it by allo wing k to dep end on i, µ . More precisely , define a c onfigur ation C = ( j, k , J, K, s, t, L U , L V ) to be the following set of data: • An integer j ≥ 1, and a map k : [ j ] × { 0 , 1 } → { 0 , 1 , 2 , . . . } , generating a set Γ := { ( i, µ, l ) : i ∈ [ j ] , µ ∈ { 0 , 1 } , 0 ≤ l ≤ k ( i, µ ) } ; • Finite sets J, K , and surjective maps s : Γ → J and t : Γ → K ob eying (4.6); • Sets L U , L V suc h that L U ∪ L V := Γ + := { ( i, µ, l ) ∈ Γ : l > 0 } and such that s i,µ,l − 1 = s i,µ,l whenev er ( i, µ, l ) ∈ Γ + \L U , and t i,µ,l − 1 = t i,µ,l whenev er ( i, µ, l ) ∈ Γ + \L V . R emark. Note w e do not require configurations to b e strongly admissible, although for our application to Proposition 6.1 strong admissibilit y is required. Similarly , w e no longer require that the segmen ts (4.7) be initial segmen ts. This remo v al of h yp otheses will giv e us a conv enient amount of flexibility in a certain induction argument that we shall p erform shortly . One can think of a configuration as describing a “generalized spider” whose legs are allo wed to b e of unequal length, but for which certain of the segments (indicated b y the sets L U , L V ) are required to be horizontal or v ertical. The freedom to extend or shorten the legs of the spider separately will be of imp ortance when w e use the iden tities (6.1), (6.3), (6.4) to simplify the expression X s,t, L U , L V , see Figure 4. 34 Giv en a configuration C , define the quan tity X C b y the formula X C := X α,β h Y i ∈ [ j ] 1 Y µ =0 E α ( s ( i,µ,k ( i,µ ))) β ( t ( i,µ,k ( i,µ ))) i h Y ( i,µ,l ) ∈L U U α ( s ( i,µ,l − 1)) ,α ( s ( i,µ,l )) ih Y ( i,µ,l ) ∈L V V β ( t ( i,µ,l − 1)) ,β ( t ( i,µ,l )) i , (6.5) where α : J → [ n ] , β : K → [ n ] range ov er all injections. T o prov e Prop osition 6.1, it then suffices to sho w that | X C | ≤ ( C 0 (1 + | J | + | K | )) | J | + | K | ( r µ /n ) | Γ |−| Ω | n (6.6) for some absolute constan t C 0 > 0, where Ω := { ( s ( i, µ, l ) , t ( i, µ, l )) : ( i, µ, l ) ∈ Γ } , since Prop osition 6.1 then follows from the sp ecial case in which k ( i, µ ) = k is constant and ( s, t ) is strongly admissible, in whic h case w e hav e | J | + | K | ≤ 2 | Ω | ≤ | Γ | = 2 j ( k + 1) (b y strong admissibility). T o pro ve the claim (6.6) we will p erform strong induction on the quantit y | J | + | K | ; th us w e assume that the claim has already been prov en for all configurations with a strictly smaller v alue of | J | + | K | . (This inductiv e hypothesis can b e v acuous for very small v alues of | J | + | K | .) Then, for fixed | J | + | K | , w e p erform strong induction on |L U ∩ L V | , assuming that the claim has already b een prov en for all configurations with the same v alue of | J | + | K | and a strictly smaller v alue of |L U ∩ L V | . R emark. Roughly speaking, the inductive h yp othesis is asserting that the target estimate (6.6) has already b een prov en for all generalized spider configurations which are “simpler” than the curren t configuration, either by using fewer rows and columns, or by using the same n umber of ro ws and columns but b y having fewer opportunities for non-ro ok mov es. As we shall shortly see, whenev er we inv oke the inner induction h yp othesis (decreasing |L U ∩L V | , k eeping | J | + | K | fixed) w e are replacing the expression X C with another expression X C 0 co vered b y this hypothesis; this causes no degradation in the constant. But when w e inv ok e the outer induction hypothesis (decreasing | J | + | K | ), w e will be splitting up X C in to about O (1 + | J | + | K | ) terms X C 0 , each of whic h is cov ered by this h yp othesis; this causes a degradation of O (1 + | J | + | K | ) in the constan ts and is th us resp onsible for the loss of ( C 0 (1 + | J | + | K | )) | J | + | K | in (6.6). F or future reference w e observ e that we may take r µ ≤ n , as the hypotheses of Theorem 1.1 are v acuous otherwise ( m cannot exceed n 2 ). T o prov e (6.6) w e divide in to several cases. 6.3.1 First case: an unguarded non-ro ok mov e Supp ose first that L U ∩ L V con tains an element ( i 0 , µ 0 , l 0 ) with the property that ( s i 0 ,µ 0 ,l 0 − 1 , t i 0 ,µ 0 ,l 0 ) 6∈ Ω . (6.7) 35 Note that this forces the edge from ( s i 0 ,µ 0 ,l 0 − 1 , t i 0 ,µ 0 ,l 0 − 1 ) to ( s i 0 ,µ 0 ,l 0 , t i 0 ,µ 0 ,l 0 ) to b e partially “un- guarded” in the sense that one of the opp osite vertices of the rectangle that this edge is inscrib ed in is not visited b y the ( s, t ) pair. When we ha v e suc h an unguarded non-ro ok mov e, we can “erase” the element ( i 0 , µ 0 , l 0 ) from L U ∩ L V b y replacing C = ( j, k , J, K, s, t, L U , L V ) by the “stretched” v ariant C 0 = ( j 0 , k 0 , J 0 , K 0 , s 0 , t 0 , L 0 U , L 0 V ), defined as follo ws: • j 0 := j , J 0 := J , and K 0 := K . • k 0 ( i, µ ) := k ( i, µ ) for ( i, µ ) 6 = ( i 0 , µ 0 ), and k 0 ( i 0 , µ 0 ) := k ( i 0 , µ 0 ) + 1. • ( s 0 i,µ,l , t 0 i,µ,l ) := ( s i,µ,l , t i,µ,l ) whenev er ( i, µ ) 6 = ( i 0 , µ 0 ), or when ( i, µ ) = ( i 0 , µ 0 ) and l < l 0 . • ( s 0 i,µ,l , t 0 i,µ,l ) := ( s i,µ,l − 1 , t i,µ,l − 1 ) whenev er ( i, µ ) = ( i 0 , µ 0 ) and l > l 0 . • ( s 0 i 0 ,µ 0 ,l 0 , t 0 i 0 ,µ 0 ,l 0 ) := ( s i 0 ,µ 0 ,l 0 − 1 , t i 0 ,µ 0 ,l 0 ). • W e ha ve L 0 U := { ( i, µ, l ) ∈ L U : ( i, µ ) 6 = ( i 0 , µ 0 ) } ∪ { ( i 0 , µ 0 , l ) ∈ L U : l < l 0 } ∪ { ( i 0 , µ 0 , l + 1) : ( i 0 , µ 0 , l ) ∈ L U ; l > l 0 + 1 } ∪ { ( i 0 , µ 0 , l 0 + 1) } and L 0 V := { ( i, µ, l ) ∈ L V : ( i, µ ) 6 = ( i 0 , µ 0 ) } ∪ { ( i 0 , µ 0 , l ) ∈ L V : l < l 0 } ∪ { ( i 0 , µ 0 , l + 1) : ( i 0 , µ 0 , l ) ∈ L V ; l > l 0 + 1 } ∪ { ( i 0 , µ 0 , l 0 ) } . All of this is illustrated in Figure 5. One can chec k that C 0 is still a configuration, and X C 0 is exactly equal to X C ; informally what has happ ened here is that a single “non-ro ok” mo v e (whic h con tributed b oth a U a,a 0 factor and a V b,b 0 factor to the summand in X C ) has b een replaced with an equiv alen t pair of tw o ro ok mov es (one of whic h contributes the U a,a 0 factor, and the other con tributes the V b,b 0 factor). Observ e that, | Γ 0 | = | Γ | + 1 and | Ω 0 | = | Ω | + 1 (here we use the non-guarded hypothesis (6.7)), while | J 0 | + | K 0 | = | J | + | K | and |L 0 U ∩ L 0 V | = |L U ∩ L V | − 1. Th us in this case we see that the claim follo ws from the (second) induction h yp othesis. W e ma y th us eliminate this case and assume that ( s i 0 ,µ 0 ,l 0 − 1 , t i 0 ,µ 0 ,l 0 ) ∈ Ω whenev er ( i 0 , µ 0 , l 0 ) ∈ L U ∩ L V . (6.8) F or similar reasons we may assume ( s i 0 ,µ 0 ,l 0 , t i 0 ,µ 0 ,l 0 − 1 ) ∈ Ω whenev er ( i 0 , µ 0 , l 0 ) ∈ L U ∩ L V . (6.9) 36 Figure 5: A fragmen t of a leg showing an unguarded non-ro ok mo v e from ( s i 0 ,µ 0 ,l 0 − 1 , t i 0 ,µ 0 ,l 0 − 1 ) to ( s i 0 ,µ 0 ,l 0 , t i 0 ,µ 0 ,l 0 ) is conv erted into tw o ro ok mov es, thus decreas- ing |L U ∩ L V | by one. Note that the lab els further do wn the leg hav e to b e incremented by one. 6.3.2 Second case: a low m ultiplicit y ro w or column, no unguarded non-ro ok mo ves Next, giv en any x ∈ J , define the r ow multiplicity τ x to b e τ x := |{ ( i, µ, l ) ∈ L U : s ( i, µ, l ) = x }| + |{ ( i, µ, l ) ∈ L U : s ( i, µ, l − 1) = x }| + |{ ( i, µ ) ∈ [ j ] × { 0 , 1 } : s ( i, µ, k ( i, µ )) = x }| and similarly for an y y ∈ K , define the c olumn multiplicity τ y to b e τ y := |{ ( i, µ, l ) ∈ L V : t ( i, µ, l ) = y }| + |{ ( i, µ, l ) ∈ L V : t ( i, µ, l − 1) = y }| + |{ ( i, µ ) ∈ [ j ] × { 0 , 1 } : t ( i, µ, k ( i, µ )) = y }| . R emark. Informally , τ x measures the num ber of times α ( x ) app ears in (6.5), and similarly for τ y and β ( y ). Alternativ ely , one can think of τ x as counting the num b er of times the spider has the opp ortunity to “enter” and “exit” the row s = x , and similarly τ y measures the num b er of opp ortunities to en ter or exit the column t = y . By surjectivity we know that τ x , τ y are strictly p ositive for each x ∈ J , y ∈ K . W e also observ e that τ x , τ y m ust b e even. T o see this, write τ x = X ( i,µ,l ) ∈L U 1 s ( i,µ,l )= x + 1 s ( i,µ,l − 1)= x + X ( i,µ ) ∈ [ j ] ×{ 0 , 1 } 1 s ( i,µ,k ( i,µ ))= x . No w observe that if ( i, µ, l ) ∈ Γ + \L U , then 1 s ( i,µ,l )= x = 1 s ( i,µ,l − 1)= x . Th us we hav e τ x mo d 2 = X ( i,µ,l ) ∈ Γ + 1 s ( i,µ,l )= x + 1 s ( i,µ,l − 1)= x + X i,µ ∈ [ j ] ×{ 0 , 1 } 1 s ( i,µ,k ( i,µ ))= x mo d 2 . 37 (a) (b) Figure 6: In (a), a m ultiplicity 2 row is sho wn. After using the identit y (6.1), the contribution of this configuration is replaced with a num b er of terms one of whic h is sho wn in (b), in which the x row is deleted and replaced with another existing row ˜ x . But w e can telescop e this to τ x mo d 2 = X i,µ ∈ [ j ] ×{ 0 , 1 } 1 s ( i,µ, 0)= x mo d 2 , and the righ t-hand side v anishes by (4.6), sho wing that τ x is ev en, and similarly τ y is ev en. In this subsection, w e disp ose of the case of a lo w-multiplicit y row, or more precisely when τ x = 2 for some x ∈ J . By symmetry , the argumen t will also dispose of the case of a lo w-m ultiplicity column, when τ y = 2 for some y ∈ K . Supp ose that τ x = 2 for some x ∈ J . W e first remark that this implies that there do es not exist ( i, µ, l ) ∈ L U with s ( i, µ, l ) = s ( i, µ, l − 1) = x . W e argue by con tradiction and define l ? to b e the first integer larger than l for which ( i, µ, l ? ) ∈ L U . First, supp ose that l ? do es not exist (whic h, for instance, happ ens when l = k ( i, µ )). Then in this case it is not hard to see that s ( i, µ, k ( i, µ )) = x since for ( i, µ, l 0 ) / ∈ L U , w e ha ve s ( i, µ, l 0 ) = s ( i, µ, l 0 − 1). In this case, τ x exceeds 2. Else, l ? do es exist but then s ( i, µ, l ? − 1) = x since s ( i, µ, l 0 ) = s ( i, µ, l 0 − 1) for l < l 0 < l ? . Again, τ x exceeds 2 and this is a con tradiction. Th us, if ( i, µ, l ) ∈ L U and s ( i, µ, l ) = x , then s ( i, µ, l − 1) 6 = x , and similarly if ( i, µ, l ) ∈ L U and s ( i, µ, l − 1) = x , then s ( i, µ, l ) 6 = x . No w let us lo ok at the terms in (6.5) which inv olve α ( x ). Since τ x = 2, there are only t wo suc h terms, and eac h of the terms are either of the form U α ( x ) ,α ( x 0 ) or E α ( x ) ,β ( y ) for some y ∈ K or x 0 ∈ J \{ x } . W e now hav e to divide in to three sub cases. Sub case 1: (6.5) con tains t wo terms U α ( x ) ,α ( x 0 ) , U α ( x ) ,α ( x 00 ) . Figure 6(a) for a typical configuration in whic h this is the case. The idea is to use the iden tity (6.1) to “delete” the row x , th us reducing | J | + | K | and allo wing us to use an induction h yp othesis. Accordingly , let us define ˜ J := J \{ j } , and let ˜ α : ˜ J → [ n ] b e the restriction of α to ˜ J . W e also write a := α ( x ) for the deleted ro w a . W e now isolate the tw o terms U α ( x ) ,α ( x 0 ) , U α ( x ) ,α ( x 00 ) from the rest of (6.5), expressing this sum as X ˜ α,β . . . h X a ∈ [ n ] \ ˜ α ( ˜ J ) U a, ˜ α ( x 0 ) U a, ˜ α ( x 00 ) i 38 Figure 7: Another term arising from the configuration in Figure 6(a), in whic h tw o U factors ha ve been collapsed into one. Note the reduction in length of the configuration by one. where the . . . denotes the pro duct of all the terms in (6.5) other than U α ( x ) ,α ( x 0 ) and U α ( x ) ,α ( x 00 ) , but with α replaced b y ˜ α , and ˜ α, β ranging o ver injections from ˜ J and K to [ n ] resp ectively . F rom (6.1) we hav e X a ∈ [ n ] U a, ˜ α ( x 0 ) U a, ˜ α ( x 00 ) = (1 − 2 ρ ) U ˜ α ( x 0 ) , ˜ α ( x 00 ) − ρ (1 − ρ ) 1 x 0 = x 00 and th us X a ∈ [ n ] \ ˜ α ( ˜ J ) U a, ˜ α ( x 0 ) U a, ˜ α ( x 00 ) = (1 − 2 ρ ) U ˜ α ( x 0 ) , ˜ α ( x 00 ) − ρ (1 − ρ ) 1 x 0 = x 00 − X ˜ x ∈ ˜ J U ˜ α ( ˜ x ) , ˜ α ( x 0 ) U ˜ α ( ˜ x ) , ˜ α ( x 00 ) . (6.10) Consider the con tribution of one of the final terms U ˜ α ( ˜ x ) , ˜ α ( x 0 ) U ˜ α ( ˜ x ) , ˜ α ( x 00 ) of (6.10). This contribution is equal to X C 0 , where C 0 is formed from C by replacing J with ˜ J , and replacing every o ccurrence of x in the range of α with ˜ x , but leaving all other comp onents of C unchanged (see Figure 6(b)). Observ e that | Γ 0 | = | Γ | , | Ω 0 | ≤ | Ω | , | J 0 | + | K 0 | < | J | + | K | , so the con tribution of these terms is acceptable b y the (first) induction h yp othesis (for C 0 large enough). Next, we consider the con tribution of the term U ˜ α ( x 0 ) , ˜ α ( x 00 ) of (6.10). This con tribution is equal to X C 00 , where C 00 is formed from C b y replacing J with ˜ J , replacing every o ccurrence of x in the range of α with x 0 , and also deleting the one elemen t ( i 0 , µ 0 , l 0 ) in L U from Γ + (relab eling the remaining triples ( i 0 , µ 0 , l ) for l 0 < l ≤ k ( i 0 , µ 0 ) b y decremen ting l b y 1) that ga v e rise to U α ( x ) ,α ( x 0 ) , unless this elemen t ( i 0 , µ 0 , l 0 ) also lies in L V , in which case one remo ves ( i 0 , µ 0 , l 0 ) from L U but lea ves it in L V (and do es not relab el an y further triples) (see Figure 7 for an example of the former case, and 8 for the latter case). One observ es that | Γ 00 | ≥ | Γ | − 1, | Ω 00 | ≤ | Ω | − 1 (here we use (6.8), (6.9)), | J 00 | + | K 00 | < | J | + | K | , and so this term also is controlled b y the (first) induction hypothesis (for C 0 large enough). Finally , w e consider the contribution of the term ρ 1 x 0 = x 00 of (6.10), whic h of course is only non- trivial when x 0 = x 00 . This contribution is equal to ρX C 000 , where C 000 is formed from C by deleting x 39 Figure 8: Another collapse of tw o U factors into one. This time, the presence of the L V lab el means that the length of the configuration remains unchanged; but the guarded nature of the collapsed non-rook mov e (evidenced here b y the point (a)) ensures that the supp ort Ω of the configuration shrinks by at least one instead. from J , replacing ev ery occurrence of x in the range of α with x 0 = x 00 , and also deleting the tw o elemen ts ( i 0 , µ 0 , l 0 ), ( i 1 , µ 1 , l 1 ) of L U from Γ + that ga ve rise to the factors U α ( x ) ,α ( x 0 ) , U α ( x ) ,α ( x 00 ) in (6.5), unless these elemen ts also lie in L V , in whic h case one deletes them just from L U but lea ves them in L V and Γ + ; one also decrements the lab els of any subsequen t ( i 0 , µ 0 , l ), ( i 1 , µ 1 , l ) accordingly (see Figure 9). One observ es that | Γ 000 | − | Ω 000 | ≥ | Γ | − | Ω | − 1, | J 000 | + | K 000 | < | J | + | K | , and | J 000 | + | K 000 | + |L 000 U ∩ L 000 V | < | J | + | K | + |L U ∩ L V | , and so this term also is controlled by the induction hypothesis. (Note w e need to use the additional ρ factor (whic h is less than r µ /n ) in order to mak e up for a p ossible decrease in | Γ | − | Ω | b y 1.) This deals with the case when there are tw o U terms inv olving α ( x ). Sub case 2: (6.5) con tains a term U α ( x ) ,α ( x 0 ) and a term E α ( x ) ,β ( y ) . A t ypical case here is depicted in Figure 10. The strategy here is similar to Sub case 1, except that one uses (6.3) instead of (6.1). Letting ˜ J , ˜ α, a b e as b efore, w e can express (6.5) as X ˜ α,β . . . h X a ∈ [ n ] \ ˜ α ( ˜ J ) U a, ˜ α ( x 0 ) E a,β ( y ) i where the . . . denotes the pro duct of all the terms in (6.5) other than U α ( x ) ,α ( x 0 ) and E α ( x ) ,β ( y ) , but with α replaced b y ˜ α , and ˜ α , β ranging o ver injections from ˜ J and K to [ n ] resp ectively . F rom (6.3) we hav e X a ∈ [ n ] U a, ˜ α ( x 0 ) E a,β ( y ) = (1 − ρ ) E ˜ α ( x 0 ) ,β ( y ) and hence X a ∈ [ n ] \ ˜ α ( ˜ J ) U a, ˜ α ( x 0 ) E a,β ( y ) = (1 − ρ ) E ˜ α ( x 0 ) ,β ( y ) − X ˜ x ∈ ˜ J U ˜ α ( ˜ j ) , ˜ α ( x 0 ) E ˜ α ( ˜ j ) ,β ( y ) (6.11) 40 Figure 9: A collapse of t wo U factors (with identical indices) to a ρ 1 x 0 = x 00 factor. The p oint mark ed (a) indicates the guarded nature of the non-ro ok mov e on the right. Note that | Γ | − | Ω | can decrease by at most 1 (and will often sta y constant or ev en increase). Figure 10: A configuration inv olving a U and E factor on the left. After applying (6.3), one gets some terms associated to configuations such as those in the upper righ t, in whic h the x ro w has b een deleted and replaced with another existing row ˜ x , plus a term coming from a configuration in the lo wer right, in whic h the U E terms hav e b een collapsed to a single E term. 41 Figure 11: A m ultiplicit y 2 ro w with t w o Es, which are necessarily at the ends of t wo adjacen t legs of the spider. Here we use ( i, µ, l ) as shorthand for ( s i,µ,l , t i,µ,l ). The contribution of the final terms in (6.11) are treated in exactly the same wa y as the final terms in (6.10), and the main term E ˜ α ( x 0 ) ,β ( y ) is treated in exactly the same w ay as the term U ˜ α ( x 0 ) , ˜ α ( x 00 ) in (6.10). This concludes the treatmen t of the case when there is one U term and one E term in volving α ( x ). Sub case 3: (6.5) con tains t wo terms E α ( x ) ,β ( y ) , E α ( x ) ,β ( y 0 ) . A t ypical case here is depicted in 11. The strategy here is similar to that in the previous tw o sub cases, but no w one uses (6.4) rather than (6.1). The com binatorics of the situation are, ho w ever, sligh tly different. By considering the path from E α ( x ) ,β ( y ) to E α ( x ) ,β ( y 0 ) along the spider, we see (from the hypoth- esis τ x = 2) that this path must be completely horizontal (with no elem en ts of L U presen t), and the tw o legs of the spider that give rise to E α ( x ) ,β ( y ) , E α ( x ) ,β ( y 0 ) at their tips must b e adjacen t, with their bases connected by a horizon tal line segment. In other words, up to in terchange of y and y 0 , and cyclic permutation of the [ j ] indices, we may assume that ( x, y ) = ( s (1 , 1 , k ( i, 1)) , t (1 , 1 , k ( i, 1))); ( x, y 0 ) = ( s (2 , 0 , k (2 , 0)) , t (2 , 0 , k (2 , 0))) with s (1 , 1 , l ) = s (2 , 0 , l 0 ) = x for all 0 ≤ l ≤ k (1 , 1) and 0 ≤ l 0 ≤ k (2 , 0), where the index 2 is understo o d to b e identified with 1 in the degenerate case j = 1. Also, L U cannot contain an y triple of the form (1 , 1 , l ) for l ∈ [ k (1 , 1)] or (2 , 0 , l 0 ) for l 0 ∈ [ k (2 , 0)] (and so all these triples lie in L V instead). F or tec hnical reasons w e need to deal with the degenerate case j = 1 separately . In this case, s is iden tically equal to x , and so (6.5) simplifies to X β h X a ∈ [ n ] E a,β ( y ) E a,β ( y 0 ) i 1 Y µ =0 k (1 ,µ ) Y l =0 V β ( t ( i,µ,l − 1)) ,β ( t ( i,µ,l )) . In the extreme degenerate case when k (1 , 0) = k (1 , 1) = 0, the sum is just P a,b ∈ [ n ] E 2 ab = r , which is acceptable, so w e ma y assume that k (1 , 0) + k (1 , 1) > 0. W e ma y assume that the column m ultiplicity τ ˜ y ≥ 4 for ev ery ˜ y ∈ K , since otherwise w e could use (the reflected form of ) one of the previous tw o sub cases to conclude (6.6) from the induction hypothesis. (Note when y = y 0 , it is not p ossible for τ y to equal 2 since k (1 , 0) + k (1 , 1) > 0.) 42 Using (6.4) follow ed b y (1.8a) we hav e X a ∈ [ n ] E a,β ( y ) E a,β ( y 0 ) . √ r µ /n + 1 y = y 0 r /n . r µ /n and so b y (1.8b) we can b ound | X C | . X β ( r µ /n )( √ r µ /n ) k (1 , 0)+ k (1 , 1) . The n umber of p ossible β is at most n | K | , so to establish (6.6) in this case it suffices to show that n | K | ( r µ /n )( √ r µ /n ) k (1 , 0)+ k (1 , 1) . ( r µ /n ) | Γ |−| Ω | n. Observ e that in this degenerate case j = 1, w e hav e | Ω | = | K | and | Γ | = k (1 , 0) + k (1 , 1) + 2. One then chec ks that the claim is true when r µ = 1, so it suffices to chec k that the other extreme case r µ = n , i.e. | K | − 1 2 ( k (1 , 0) + k (1 , 1)) ≤ 1 . But as τ y ≥ 4 for all k , ev ery element in K m ust b e visited at least t wice, and the claim follo ws. No w we deal with the non-degenerate case j > 1. Letting ˜ J , ˜ α, a b e as in previous sub cases, we can express (6.5) as X ˜ α,β . . . h X a ∈ [ n ] \ ˜ α ( ˜ J ) E a,β ( y ) E a,β ( y 0 ) i (6.12) where the . . . denotes the pro duct of all the terms in (6.5) other than E α ( x ) ,β ( y ) and E α ( x ) ,β ( y 0 ) , but with α replaced b y ˜ α , and ˜ α , β ranging o ver injections from ˜ J and K to [ n ] resp ectively . F rom (6.4), we hav e X a ∈ [ n ] E a,β ( y ) E a,β ( y 0 ) = V β ( y ) ,β ( y 0 ) + ρ 1 y = y 0 and hence X a ∈ [ n ] \ ˜ α ( ˜ J ) E a,β ( y ) E a,β ( y 0 ) = V β ( y ) ,β ( y 0 ) + ρ 1 y = y 0 − X ˜ x ∈ ˜ J E ˜ α ( ˜ j ) ,β ( y ) E ˜ α ( ˜ j ) ,β ( y 0 ) . (6.13) The final terms are treated here in exactly the same wa y as the final terms in (6.10) or (6.11). No w we consider the main term V β ( y ) ,β ( y 0 ) . The contribution of this term will b e of the form X C 0 , where the configuration C 0 is formed from C by “detac hing” the tw o legs ( i, µ ) = (1 , 1) , (2 , 0) from the spider, “gluing them together” at the tips using the V β ( y ) ,β ( y 0 ) term, and then “inserting” those t wo legs into the base of the ( i, µ ) = (1 , 0) leg. T o explain this pro cedure more formally , observ e that the . . . term in (6.12) can be expanded further (isolating out the terms coming from ( i, µ ) = (1 , 1) , (2 , 0)) as h k (2 , 0) Y l =1 V β ( t (2 , 0 ,l − 1)) ,β ( t (2 , 0 ,l )) i h 1 Y l = k (1 , 1) V β ( s (1 , 1 ,l − 1)) ,β ( s (1 , 1 ,l )) i . . . where the . . . now denote all the terms that do not come from ( i, µ ) = (1 , 1) or ( i, µ ) = (2 , 0), and w e hav e rev ersed the order of the second pro duct for reasons that will be clearer later. Recalling 43 Figure 12: The configuation from Figure 11 after collapsing the tw o E ’s to a V , which is represen ted by a long curved line rather than a straight line for clarity . Note the substan tial relab eling of vertices. that y = t (1 , 1 , k (1 , 1)) and y 0 = t (2 , 0 , k (2 , 0)), we see that the contribution of the first term of (6.13) to (6.12) is no w of the form X ˜ α,β h k (2 , 0) Y l =1 V β ( t (2 , 0 ,l − 1)) ,β ( t (2 , 0 ,l )) i V β ( t (2 , 0 ,k (2 , 0))) ,β ( t (1 , 1 ,k (1 , 1))) h 1 Y l = k (1 , 1) V β ( s (1 , 1 ,l − 1)) ,β ( s (1 , 1 ,l )) i . . . . But this expression is simply X C 0 , where the configuration of C 0 is formed from C in the following fashion: • j 0 is equal to j − 1, J 0 is equal to ˜ J , and K 0 is equal to K . • k 0 (1 , 0) := k (2 , 0) + 1 + k (1 , 1) + k (1 , 0), and k 0 ( i, µ ) := k ( i + 1 , µ ) for ( i, µ ) 6 = (1 , 0). • The path { ( s 0 (1 , 0 , l ) , t 0 (1 , 0 , l )) : l = 0 , . . . , k 0 (1 , 0) } is formed by concatenating the path { ( s (1 , 0 , 0) , t (2 , 0 , l )) : l = 0 , . . . , k (2 , 0) } , with an edge from ( s (1 , 0 , 0) , t (2 , 0 , k (2 , 0))) to ( s (1 , 0 , 0) , t (1 , 1 , k (1 , 1))), with the path { ( s (1 , 0 , 0) , t (1 , 1 , l )) : l = k (1 , 1) , . . . , 0 } , with the path { ( s (1 , 0 , l ) , t (1 , 0 , l )) : l = 0 , . . . , k (1 , 0) } . • F or an y ( i, µ ) 6 = ( i, 0), the path { ( s 0 ( i, µ, l ) , t 0 ( i, µ, l )) : l = 0 , . . . k 0 ( i, µ ) } is equal to the path { ( s ( i, µ, l ) , t ( i + 1 , µ, l )) : l = 0 , . . . , k ( i + 1 , µ ) } . • W e ha ve L 0 U := { (1 , 0 , k (2 , 0) + 1 + k (1 , 1) + l ) : (1 , 0 , l ) ∈ L U } ∪ { ( i, µ, l ) : ( i + 1 , µ, l ) ∈ L U } and L 0 V := { (1 , 0 , k (2 , 0) + 1 + k (1 , 1) + l ) : (1 , 0 , l ) ∈ L V } ∪ { ( i, µ, l ) : ( i + 1 , µ, l ) ∈ L V } ∪ { (1 , 0 , 1) , . . . , (1 , 0 , k (2 , 0) + 1 + k (1 , 1)) } . This construction is represen ted in Figure 12. 44 One can c heck that this is indeed a configuration. One has | J 0 | + | K 0 | < | J | + | K | , | Γ 0 | = | Γ | − 1, and | Ω 0 | ≤ | Ω | − 1, and so this con tribution to (6.6) is acceptable from the (first) induction h yp othesis. This handles the con tribution of the V β ( y ) ,β ( y 0 ) term. The ρ 1 y = y 0 term is treated similarly , except that there is no edge b etw een the p oints ( s (1 , 0 , 0) , t (2 , 0 , k (2 , 0))) and ( s (1 , 0 , 0) , t (1 , 1 , k (1 , 1))) (whic h are no w equal, since y = y 0 ). This reduces the analogue of | Γ 0 | to | Γ | − 2, but the additional factor of ρ (which is at most r µ /n ) comp ensates for this. W e omit the details. This concludes the treatmen t of the third sub case. 6.3.3 Third case: High multiplicit y ro ws and columns After eliminating all of the previous cases, we ma y no w may assume (since τ x is ev en) that τ x ≥ 4 for all x ∈ J (6.14) and similarly w e may assume that τ y ≥ 4 for all y ∈ K . (6.15) W e hav e now made the maximum use we can of the cancellation iden tities (6.1), (6.3), (6.4), and hav e no further use for them. Instead, we shall no w place absolute v alues everywhere and estimate X C using (1.9), (1.8a), (1.8b), obtaining the b ound | X C | ≤ n | J | + | K | O ( √ r µ /n ) | Γ | + |L U ∩L V | . Comparing this with (6.6), w e see that it will suffice (b y taking C 0 large enough) to sho w that n | J | + | K | ( √ r µ /n ) | Γ | + |L U ∩L V | ≤ ( r µ /n ) | Γ |−| Ω | n. Using the extreme cases r µ = 1 and r µ = n as test cases, w e see that our task is to show that | J | + | K | ≤ |L U ∩ L V | + | Ω | + 1 (6.16) and | J | + | K | ≤ 1 2 ( | Γ | + |L U ∩ L V | ) + 1 . (6.17) The first inequality (6.16) is prov en b y Lemma 5.1. The second is a consequence of the double coun ting identit y 4( | J | + | K | ) ≤ X x ∈ J τ x + X y ∈ K τ y = 2 | Γ | + 2 |L U ∩ L V | where the inequalit y follows from (6.14)–(6.15) (and w e don’t even need the +1 in this case). 7 Discussion In terestingly , there is an emerging literature on the developmen t of efficien t algorithms for solving the nuclear-norm minimization problem (1.3) [6, 17]. F or instance, in [6], the authors show that the singular-v alue thresholding algorithm can solv e certain problem instances in which the matrix has close to a billion unkno wn entries in a matter of minutes on a personal computer. Hence, the 45 near-optimal sampling results introduced in this pap er are practical and, therefore, should b e of consequence to practitioners in terested in reco vering low-rank matrices from just a few entries. T o b e broadly applicable, how ever, the matrix completion problem needs to be robust vis a vis noise. That is, if one is given a few entries of a low-rank matrix contaminated with a small amount of noise, one would lik e to b e able to guess the missing entries, p erhaps not exactly , but accurately . W e actually b eliev e that the methods and results dev elop ed in this paper are amenable to the study of “the noisy matrix completion problem” and hop e to rep ort on our progress in a later paper. 8 App endix 8.1 Equiv alence b et w een the uniform and Bernoulli mo dels 8.1.1 Lo w er b ounds F or the sak e of completeness, we explain ho w Theorem 1.7 implies nearly identical results for the uniform mo del. W e hav e established the low er b ound by showing that there are t wo fixed matrices M 6 = M 0 for which P Ω ( M ) = P Ω ( M 0 ) with probability greater than δ unless m ob eys the b ound (1.20). Supp ose that Ω is sampled according to the Bernoulli mo del with p 0 ≥ m/n 2 and let F b e the ev ent {P Ω ( M ) = P Ω ( M 0 ) } . Then P ( F ) = n 2 X k =0 P ( F | | Ω | = k ) P ( | Ω | = k ) ≤ m − 1 X k =0 P ( | Ω | = k ) + n 2 X k = m P ( F | | Ω | = k ) P ( | Ω | = k ) ≤ P ( | Ω | < m ) + P ( F | | Ω | = m ) , where we ha ve used the fact that for k ≥ m , P ( F | | Ω | = m ) ≥ P ( F | | Ω | = k ). The conditional distribution of Ω giv en its cardinalit y is uniform and, therefore, P Unif( m ) ( F ) ≥ P Ber( p 0 ) ( F ) − P Ber( p 0 ) ( | Ω | < m ) , in which P Unif( m ) and P Ber( p 0 ) are probabilities calculated under the uniform and Bernoulli mo dels. If we choose p 0 = 2 m/n 2 , we ha ve that P Ber( p 0 ) ( | Ω | < m ) ≤ δ / 2 pro vided δ is not ridiculously small. Th us if P Ber( p 0 ) ( F ) ≥ δ , w e hav e P Unif( m ) ( F ) ≥ δ / 2 . In short, we get a lo wer b ound for the uniform mo del b y applying the b ound for the Bernoulli mo del with a v alue of p = 2 m 2 /n and a probabilit y of failure equal to 2 δ . 8.1.2 Upper b ounds W e prov e the claim stated at the onset of Section 3 which states that the probabilit y of failure under the uniform mo del is at most t wice that under the Bernoulli model. Let F b e the ev en t that 46 the reco very via (1.3) is not exact. With our earlier notations, P Ber( p ) ( F ) = n 2 X k =0 P Ber( p ) ( F | | Ω | = k ) P Ber( p ) ( | Ω | = k ) ≥ m X k =0 P Ber( p ) ( F | | Ω | = k ) P Ber( p ) ( | Ω | = k ) ≥ P Ber( p ) ( F | | Ω | = m ) m X k =0 P Ber( p ) ( | Ω | = k ) ≥ 1 2 P Unif( m ) ( F ) , where we hav e used P Ber( p ) ( F | | Ω | = k ) ≥ P Ber( p ) ( F | | Ω | = m ) for k ≤ m (the probability of failure is nonincreasing in the size of the observed set), and P Ber( p ) ( | Ω | ≤ m ) ≥ 1 / 2. 8.2 Pro of of Lemma 3.3 In this section, w e will mak e frequent use of (3.13) and of the similar iden tit y Q 2 T = (1 − 2 ρ 0 ) Q T + ρ 0 (1 − ρ 0 ) I , (8.1) whic h is obtained b y squaring b oth sides of (3.17) together with P 2 T = P T . W e b egin with tw o lemmas. Lemma 8.1 F or e ach k ≥ 0 , we have ( Q Ω P T ) k Q Ω = k X j =0 α ( k ) j ( Q Ω Q T ) j Q Ω + k − 1 X j =0 β ( k ) j ( Q Ω Q T ) j + k − 2 X j =0 γ ( k ) j Q T ( Q Ω Q T ) j Q Ω + k − 3 X j =0 δ ( k ) j Q T ( Q Ω Q T ) j , (8.2) wher e starting fr om α (0) 0 = 1 , the se quenc es { α ( k ) } , { β ( k ) } , { γ ( k ) } and { δ ( k ) } ar e inductively define d via α ( k +1) j = [ α ( k ) j − 1 + (1 − ρ 0 ) γ ( k ) j − 1 ] + ρ 0 (1 − 2 p ) p [ α ( k ) j + (1 − ρ 0 ) γ ( k ) j ] + 1 j =0 ρ 0 [ β ( k ) 0 + (1 − ρ 0 ) δ ( k ) 0 ] β ( k +1) j = [ β ( k ) j − 1 + (1 − ρ 0 ) δ ( k ) j − 1 ] + ρ 0 (1 − 2 p ) p [ β ( k ) j + (1 − ρ 0 ) δ ( k ) j ]1 j > 0 + 1 j =0 ρ 0 1 − p p [ α ( k ) 0 + (1 − ρ 0 ) γ ( k ) 0 ] and γ ( k +1) j = ρ 0 (1 − p ) p [ α ( k ) j +1 + (1 − ρ 0 ) γ ( k ) j +1 ] δ ( k +1) j = ρ 0 (1 − p ) p [ β ( k ) j +1 + (1 − ρ 0 ) δ ( k ) j +1 ] . In the ab ove r e curr enc e r elations, we adopt the c onvention that α ( k ) j = 0 whenever j is not in the r ange sp e cifie d by (8.2) , and similarly for β ( k ) j , γ ( k ) j and δ ( k ) j . 47 Pro of The pro of operates b y induction. The claim for k = 0 is straigh tforward. T o compute the co efficien t sequences of ( Q Ω P T ) k +1 Q Ω from those of ( Q Ω P T ) k Q Ω , use the identit y P T = Q T + ρ 0 I to decomp ose ( Q Ω P T ) k +1 Q Ω as follo ws: ( Q Ω P T ) k +1 Q Ω = Q Ω Q T ( Q Ω P T ) k Q Ω + ρ 0 Q Ω ( Q Ω P T ) k Q Ω . Then expanding ( Q Ω P T ) k Q Ω as in (8.2), and using the tw o iden tities Q Ω ( Q Ω Q T ) j Q Ω = ( 1 − 2 p p Q Ω + (1 − p ) p I , j = 0 , 1 − 2 p p ( Q Ω Q T ) j Q Ω + (1 − p ) p Q T ( Q Ω Q T ) j − 1 Q Ω , j > 0 , and Q Ω ( Q Ω Q T ) j = ( Q Ω , j = 0 , 1 − 2 p p ( Q Ω Q T ) j + (1 − p ) p Q T ( Q Ω Q T ) j − 1 , j > 0 , whic h b oth follow from (3.13), giv es the desired recurrence relation. The calculation is rather straigh tforward and omitted. W e note that the recurrence relations giv e α ( k ) k = 1 for all k ≥ 0, β ( k ) k − 1 = β ( k − 1) k − 2 = . . . = β (1) 0 = ρ 0 (1 − p ) p for all k ≥ 1, and γ ( k ) k − 2 = ρ 0 (1 − p ) p α ( k − 1) k − 1 = ρ 0 (1 − p ) p , δ ( k ) k − 3 = ρ 0 (1 − p ) p β ( k − 1) k − 2 = ρ 0 (1 − p ) p 2 , for all k ≥ 2 and k ≥ 3 resp ectively . Lemma 8.2 Put λ = ρ 0 /p and observe that by assumption (1.22) , λ < 1 . Then for al l j, k ≥ 0 , we have max | α ( k ) j | , | β ( k ) j | , | γ ( k ) j | , | δ ( k ) j | ≤ λ d k − j 2 e 4 k . (8.3) Pro of W e pro ve the lemma b y induction on k . The claim is true for k = 0. Supp ose it is true up to k , w e then use the recurrence relations given b y Lemma 8.1 to establish the claim up to k + 1. In details, since | 1 − ρ 0 | < 1, ρ 0 < λ and | 1 − 2 p | < 1, the recurrence relation for α ( k +1) giv es | α ( k +1) j | ≤ | α ( k ) j − 1 | + | γ ( k ) j − 1 | + λ [ | α ( k ) j | + | γ ( k ) j | ] + 1 j =0 λ [ | β ( k ) 0 | + | δ ( k ) 0 | ] ≤ 2 λ d k +1 − j 2 e 4 k 1 j > 0 + 2 λ d k − j 2 e +1 4 k + 2 λ d k 2 e +1 4 k 1 j =0 ≤ 2 λ d k +1 − j 2 e 4 k 1 j > 0 + 2 λ d k +1 − j 2 e 4 k + 2 λ d k +1 2 e 4 k 1 j =0 ≤ λ d k +1 − j 2 e 4 k +1 , 48 whic h pro ves the claim for the sequence { α ( k ) } . W e b ound | β ( k +1) j | in exactly the same wa y and omit the details. Now the recurrence relation for γ ( k +1) giv es | γ ( k +1) j | ≤ λ [ | α ( k ) j +1 | + | γ ( k ) j +1 | ] ≤ 2 λ d k − j − 1 2 e +1 4 k ≤ 4 k +1 λ d k +1 − j 2 e , whic h pro ves the claim for the sequence { γ ( k ) } . The quantit y | δ ( k +1) j | is bounded in exactly the same w ay , whic h concludes the pro of of the lemma. W e are no w w ell p ositioned to pro ve Lemma 3.3 and b egin b y recording a useful fact. Since for an y X , kP T ⊥ ( X ) k ≤ k X k , and Q T = P T − ρ 0 I = ( I − P T ⊥ ) − ρ 0 I = (1 − ρ 0 ) I − P T ⊥ , the triangular inequalit y gives that for all X , kQ T ( X ) k ≤ 2 k X k . (8.4) No w k ( Q Ω P T ) k Q Ω ( E ) k ≤ k X j =0 | α ( k ) j |k ( Q Ω Q T ) j Q Ω ( E ) k + k − 1 X j =0 | β ( k ) j |k ( Q Ω Q T ) j ( E ) k + k − 2 X j =0 | γ ( k ) j |kQ T ( Q Ω Q T ) j Q Ω ( E ) k + k − 3 X j =0 | δ ( k ) j |kQ T ( Q Ω Q T ) j ( E ) k , and it follo ws from (8.4) that k ( Q Ω P T ) k Q Ω ( E ) k ≤ k X j =0 ( | α ( k ) j | + 2 | γ ( k ) j | ) k ( Q Ω Q T ) j Q Ω ( E ) k + k − 1 X j =0 ( | β ( k ) j | + 2 | δ ( k ) j | ) k ( Q Ω Q T ) j ( E ) k . F or j = 0, we hav e k ( Q Ω Q T ) j ( E ) k = k E k = 1 while for j > 0 k ( Q Ω Q T ) j ( E ) k = k ( Q Ω Q T ) j − 1 Q Ω Q T ( E ) k = (1 − ρ 0 ) k ( Q Ω Q T ) j − 1 Q Ω ( E ) k since Q T ( E ) = (1 − ρ 0 )( E ). By using the size estimates given by Lemma 8.2 on the co efficients, we ha ve 1 3 k ( Q Ω P T ) k Q Ω ( E ) k ≤ 1 3 σ k +1 2 + 4 k k − 1 X j =0 λ d k − j 2 e σ j +1 2 + 4 k k − 1 X j =0 λ d k − j 2 e σ j 2 ≤ 1 3 σ k +1 2 + 4 k σ k +1 2 k − 1 X j =0 λ d k − j 2 e σ − k − j 2 + 4 k σ k 2 k − 1 X j =0 λ d k − j 2 e σ − k − j 2 ≤ 1 3 σ k +1 2 + 4 k σ k +1 2 + σ k 2 k − 1 X j =0 λ d k − j 2 e σ − k − j 2 . 49 No w, k − 1 X j =0 λ d k − j 2 e σ − k − j 2 ≤ λ √ σ + λ σ 1 1 − λ σ ≤ 2 3 √ σ where the last inequalit y holds pro vided that 4 λ ≤ σ 3 / 2 . The conclusion is k ( Q Ω P T ) k Q Ω ( E ) k ≤ (1 + 4 k +1 ) σ k +1 2 , whic h is what we needed to establish. Ac kno wledgements E. C. is supported b y ONR grants N00014-09-1-0469 and N00014-08-1-0749 and by the W aterman Aw ard from NSF. E. C. would lik e to thank Xiao dong Li and Chiara Sabatti for helpful conv ersa- tions related to this pro ject. T. T. is supp orted b y a gran t from the MacArthur F oundation, by NSF gran t DMS-0649473, and b y the NSF W aterman a ward. References [1] J. Ab erneth y , F. Bach, T. Evgeniou, and J.-P . V ert. Low-rank matrix factorization with attributes. T ec hnical Rep ort N24/06/MM, Ecole des Mines de P aris, 2006. [2] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Unco vering shared structures in multiclass classification. Pr o c e e dings of the Twenty-fourth International Confer enc e on Machine L e arning , 2007. [3] A. Argyriou, T. Evgeniou, and M. Pon til. Multi-task feature learning. Neur al Information Pr o c essing Systems , 2007. [4] A. Barvinok. A c ourse in c onvexity , volume 54 of Gr aduate Studies in Mathematics . American Mathe- matical So ciet y , Providence, RI, 2002. [5] P . Biswas, T-C. Lian, T-C. W ang, and Y. Y e. Semidefinite programming based algorithms for sensor net work localization. ACM T r ans. Sen. Netw. , 2(2):188–220, 2006. [6] J-F. Cai, E. J. Cand ` es, and Z. Shen. A singular v alue thresholding algorithm for matrix completion. T ec hnical rep ort, 2008. Preprin t av ailable at . [7] E. J. Cand` es and B. Rech t. Exact Matrix Completion via Con vex Optimization. T o appear in F ound. of Comput. Math. , 2008. [8] E. J. Cand` es, J. Romberg, and T. T ao. Robust uncertain ty principles: exact signal reconstruction from highly incomplete frequency information. IEEE T r ans. Inform. The ory , 52(2):489–509, 2006. [9] P . Chen and D. Suter. Reco vering the missing components in a large noisy lo w-rank matrix: application to SFM source. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 26(8):1051–1063, 2004. [10] V. H. de la Pe˜ na and S. J. Montgomery-Smith. Decoupling inequalities for the tail probabilities of m ultiv ariate U -statistics. Ann. Pr ob ab. , 23(2):806–816, 1995. [11] M. F azel, H. Hindi, and S. Boyd. Log-det heuristic for matrix rank minimization with applications to Hank el and Euclidean distance matrices. Pr o c. Am. Contr ol Conf , June 2003. [12] D. Goldb erg, D. Nichols, B. M. Oki, and D. T erry . Using collab orative filtering to wea v e an information tap estry . Communic ations of the ACM , 35:61–70, 1992. 50 [13] R. Keshav an, S. Oh, and A. Montanari. Matrix completion from a few entries. Submitted to ISIT’09, 2009. [14] M. Ledoux. The Conc entr ation of Me asur e Phenomenon . American Mathematical Society , 2001. [15] A. S. Lewis. The mathematics of eigenv alue optimization. Math. Pr o gr am. , 97(1-2, Ser. B):155–176, 2003. [16] F. Lust-Picquard. In ´ egalit´ es de Khin tchine dans C p (1 < p < ∞ ). Comptes R endus A c ad. Sci. Paris, S ´ erie I , 303(7):289–292, 1986. [17] S. Ma, D. Goldfarb, and L. Chen. Fixed p oint and Bregman iterativ e metho ds for matrix rank mini- mization. T echnical rep ort, 2008. [18] C. McDiarmid. Cen tering sequences with b ounded differences. Combin. Pr ob ab. Comput. , 6(1):79–86, 1997. [19] M. Mesbahi and G. P . P apav assilop oulos. On the rank minimization problem o ver a positive semidefinite linear matrix inequality . IEEE T r ansactions on A utomatic Contr ol , 42(2):239–243, 1997. [20] B. Rec ht, M. F azel, and P . Parrilo. Guaran teed minimum rank solutions of matrix equations via n uclear norm minimization. Submitted to SIAM R eview , 2007. [21] A. Singer. A remark on global p ositioning from lo cal distances. Pr o c. Natl. A c ad. Sci. USA , 105(28):9507–9511, 2008. [22] A. Singer and M. Cucuringu. Uniqueness of low-rank matrix completion by rigidity theory . Submitted for publication, 2009. [23] C. T omasi and T. Kanade. Shape and motion from image streams under orthography: a factorization metho d. International Journal of Computer Vision , 9(2):137–154, 1992. [24] G. A. W atson. Characterization of the sub differential of some matrix norms. Line ar Algebr a Appl. , 170:33–45, 1992. [25] C-C. W eng. Matrix completion for sensor netw orks, 2009. P ersonal communication. [26] E. Wigner. Characteristic v ectors of b ordered matrices with infinite dimensions. Ann. of Math. , 62:548– 564, 1955. 51
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment