Regularization for Matrix Completion

We consider the problem of reconstructing a low rank matrix from noisy observations of a subset of its entries. This task has applications in statistical learning, computer vision, and signal processing. In these contexts, "noise" generically refers …

Authors: Raghun, an H. Keshavan, Andrea Montanari

Regularization for Matrix Completion
Re gularizat ion for Matrix Completion Raghuna ndan H. Kesha van ∗ and Andrea Mon tanari ∗ † Departmen ts of Electrical E ngineerin g ∗ and Statistics † , Stanf ord University Abstract —W e consider the p roblem of reconstructing a low rank matrix from noisy observa tions of a subset of its en tries. This task has applications in statisti cal learning, computer vision, and signal processing . In these contexts, ‘noise’ generically refers to any contribution to the d ata t hat is not captured b y t he low-rank model. In most applications, the n oise level is larg e compared to the und erlying signal and it is important to av oid ov erfitting. In order to tackle this problem, we define a regularized cost function well suited for spectral r econstruction methods. Within a rand om noise model, and in the larg e system limit, we prov e that the resulting accuracy undergo es a phase transition dependin g on the n oise leve l and on t he fraction o f obser ved entries. The cost function can be min imized u sing O P T S PAC E (a manifold gradient d escent algorithm). Numerical simulations show that this approach is competitive with state-of-the-art alternativ es. I . I N T RO D U C T I O N Let N b e an m × n matrix which is ‘ap proximate ly’ low rank, th at is N = M + W = U Σ V T + W . (1) where U has dimen sions m × r , V has dimensions n × r , and Σ is a diago nal r × r matrix . Thus M has rank r and W can be tho ught of as no ise, or ‘u nexplained con tributions’ to N . Throu ghout the paper we ass um e the nor malization U T U = m I r × r and V T V = n I r × r ( I d × d being the d × d id entity). Out of the m × n e ntries of N , a subset E ⊆ [ m ] × [ n ] is observed. W e let P E ( N ) be th e m × n m atrix that contain s the o bserved entries of N , and is filled with 0 ’ s in th e o ther positions P E ( N ) ij =  N ij if ( i, j ) ∈ E , 0 otherwise. (2) The noisy ma trix completion problem requires to r econstruct the low rank matr ix M from the observations P E ( N ) . In the following we will also write N E = P E ( N ) for th e spar sified matrix. Over the last year, matrix co mpletion has attracted significant atten tion because of its relev ance –amo ng other applications– to c olaborative filtering . In this case, the ma trix N contains ev aluations o f a grou p of customers on a g roup o f produ cts, and one is interested in exploiting a sparsely filled matrix to pr ovide person alized recom mendation s [ 1]. In such app lications, the noise W is not a small pertu rbation and it is crucial to av oid overfitting. F or instan ce, in the lim it M → 0 , the estima te o f c M risks to be a lo w-rank a pproxim a- tion o f th e noise W , which would be g rossly in correct. In o rder to o vercom e this prob lem, we pr opose in this pap er an algorithm based on minimizing the following co st f unction F E ( X, Y ; S ) ≡ 1 2 ||P E ( N − X S Y T ) || 2 F + 1 2 λ || S || 2 F . (3) Here the minimization variables are S ∈ R r × r , and X ∈ R m × r , Y ∈ R n × r with X T X = Y T Y = I r × r . Finally , λ > 0 is a regular ization parameter . A. Algo rithm an d main r esults The algor ithm is an adaptatio n o f the O P T S PAC E alg orithm developed in [ 2]. A ke y observation is that the fo llowing modified cost function can be minimized by singular value decomp osition (see Section I .1): b F E ( X, Y ; S ) ≡ 1 2 ||P E ( N ) − X S Y T || 2 F + 1 2 λ || S || 2 F . (4) As emphasized in [2], [3], which analyzed the case λ = 0 , this minimization can yield poor results un less th e set of observations E is ‘we ll balance d’. This pr oblem can be bypassed by ‘trimming ’ the set E , and co nstructing a balan ced set e E . T he O P T S P AC E a lgorithm is given as f ollows. O P T S PAC E ( set E , matrix N E ) 1: T rim E , an d let e E be the outp ut; 2: Minimize b F e E ( X, Y ; S ) via SVD, let X 0 , Y 0 , S 0 be the o utput; 3: Minimize F E ( X, Y ; S ) by gradien t descent using X 0 , Y 0 , S 0 as in itial condition. In this paper we will study th is algorithm u nder a mod el for which step 1 ( trimming) is never ca lled, i.e. e E = E with high pro bability . W e will therefo re no t discuss it any further . Section II comp ares the behavior of the pre sent appro ach with altern ativ e scheme s. O ur m ain analytical result is a shar p characterizatio n of the mean square error after step 2. Here and b elow the limit n → ∞ is understoo d to be taken with m/n → α ∈ (0 , ∞ ) . Theorem I.1. Assume | M ij | ≤ M max , W ij to be i.i.d. rando m variables with mean 0 variance √ mnσ 2 and E { W 4 ij } ≤ C n 2 , and that for each entry ( i, j ) , N ij is o bserved (i.e. ( i , j ) ∈ E ) indepen dently with pr oba bility p . F in ally let c M = X 0 S 0 Y T 0 be the r ank r matrix r econstructed b y step 2 of O P T S PAC E , for the op timal choice o f λ . The n, a lmost sur ely for n → ∞ 1 || M || 2 F || c M − M || 2 F = 1 − − n P r k =1 Σ 2 k  1 − σ 4 p 2 Σ 4 k  + o 2 || Σ || 2 F n P r k =1 Σ 2 k  1 + √ ασ 2 p Σ 2 k  1 + σ 2 p Σ 2 k √ α o + o n (1) . This theorem focuses on a high -noise regime, and predicts a sharp p hase transition: if σ 2 /p < Σ 1 , we c an successfully extract info rmation on M , from the observations N E . I f on the other han d σ 2 /p ≥ Σ 1 , the obser vations are essentialy useless in reconstructing M . It is po ssible to p rove [4] th at the resulting tradeo ff between no ise and observed entries is tight: no algorithm can obtain relativ e mean square err or smaller than o ne fo r σ 2 /p ≥ Σ 1 , u nder a simple ran dom mo del for M . T o the best of our knowledge, this is the first sharp phase transition result fo r low rank matr ix com pletion. For the proof of The orem I.1, we refer to Section I II. An importan t byprodu ct of the proof is that it provides a rule for choosing the re gular ization parameter λ , in the large system limit. B. Related work The imp ortance of regularization in matr ix comp letion is well known to practitioner s. For in stance, one impo rtant c om- ponen t of many algorithms competing for the Netflix challenge [1], consisted in minimizing the cost functio n H E ( X, Y ; S ) ≡ 1 2 ||P E ( N − e X e Y T ) || 2 F + 1 2 λ || e X || 2 F + 1 2 λ || e Y || 2 F (this is also known as maximum margin matrix factorization [5], [6]). Here the minimizatio n variables are e X ∈ R m × r , e Y ∈ R n × r . Unlike in O P T S PAC E , these matrices are not co nstrained to be orthogon al, and as a consequen ce the proble m beco mes significantly mor e degenera te. Notice that, in our appro ach, the orthogonality con straint fixes the no rms || X || F , || Y || F . This motiv ates the use of || S || 2 F as a regu larization term. Con vex relaxations of the matrix com pletion prob lem were recently studied in [7], [ 8]. As em phasized by Mazu mder, Hastie and T ibshirani [9], such nuclear norms relaxations can be viewed as spectral regularizations of a least square problem . Finally , the phase transition phenomeno n in Theorem I.1, ge neralizes a resu lt of John stone a nd Lu on prin cipal compon ent analysis [10], and similar random matrix models were stud ied in [ 11]. I I . N U M E R I C A L S I M U L AT I O N S In this section, we present the resu lts of numerical sim- ulations on synth etically genera ted matrices. The data are generated following the r ecipe of [9]: sample U ∈ R n × r and V ∈ R m × r by ch oosing U ij and V ij indepen dently an d indentically as N (0 , 1) . Sample ind ependen tly W ∈ R m × n by c hoosing W ij iid with distribution N (0 , σ 2 √ mn ) . Set N = U V T + W . W e also use the parameters chosen in [9] and define SNR = s V ar(( U V T ) ij ) V ar( W ij ) , T estErr or = ||P ⊥ E ( U V T − b N ) || 2 F ||P ⊥ E ( U V T ) || 2 F , T rainEr ror = ||P E ( N − b N ) || 2 F ||P E ( N ) || 2 F , where P ⊥ E ( A ) ≡ A − P E ( A ) . In Figure 1 , we plot the tra in error and test error for the O P T S PAC E algo rithm on matrices gen erated as above with n = 100 , r = 10 , SNR= 1 and p = 0 . 5 . For comp arison, we 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 0 0.2 0.4 0.6 0.8 1 10 20 30 40 50 60 L1 L1-L0 L1-U C O P T S PA C E ( λ ∗ ) O P T S PA C E ( 0 ) Fig. 1. T est (top) and train (bottom) error vs. rank for O P T S PAC E , S O F T - I M P U T E , H A R D - I M P U T E and SVT . Here m = n = 100 , r = 10 , p = 0 . 5 , SNR = 1 . also plot the correspon ding curves f or S O F T - I M P U T E , H A R D - I M P U T E an d SVT taken fr om [9]. In Figures 2 and 3 , we plot the same curves for d ifferent values o f r, ǫ , SNR . In these plots, O P T S PAC E ( λ ) co rrespond s to the algorithm th at min- imizes the cost (3 ). In particular O P T S PAC E (0) cor responds to th e alg orithm describ ed in [2]. Further, λ ∗ = λ ∗ ( ρ ) is the value o f the regularization parameter that m inimizes the test error while using rank ρ (this can be estimated on a subset of the d ata, no t used fo r tra ining). It is clear that regularization greatly improves th e perfor- mance of O P T S PAC E and m akes it c ompetitive with the be st alternative methods. I I I . P RO O F O F T H E O R E M I . 1 The proo f of Theor em 1 is ba sed on the follo wing three steps: ( i ) Obtain an explicit expression f or the r oot mean square erro r in terms of right and lef t singular vectors of N ; ( ii ) Estimate the effect o f the noise W on the right and left singular vectors; ( iii ) Estimate th e effect of miss ing entries. Step ( ii ) builds on recent es timates on the eigenv ectors of large covariance matrices [12]. In step ( iii ) w e use the results of [2]. Step ( i ) is based on the following linear algebra calculation, whose proof we omit due to space co nstraints (her e and b elow h A, B i ≡ T r ( AB T ) ). 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 0 0.2 0.4 0.6 0.8 1 10 20 30 40 50 60 L1 L1-L0 L1-U C O P T S PA C E ( λ ∗ ) O P T S PA C E ( 0 ) Fig. 2. T est (top) and tr ain (bottom) error vs. rank for O P T S PA C E , S O F T - I M P U T E , H A R D - I M P U T E and SVT . Here m = n = 100 , r = 6 , p = 0 . 5 , SNR = 1 . Proposition I II.1. Let X 0 ∈ R m × r and Y 0 ∈ R m × r be the matrices whose column s a r e the first r , right an d left, si ng ular vectors of N E . Then the rank- r matrix r econstructed by step 2 o f o f O P T S PAC E , with r e gularizatio n parameter λ , h as th e form c M ( λ ) = X 0 S 0 ( λ ) Y T 0 Further , there e xists λ ∗ > 0 such that 1 mn || M − c M ( λ ∗ ) || 2 F = || Σ || 2 F −  h X T 0 M Y 0 , X T 0 N E Y 0 i √ mn || X 0 N E Y 0 || F  2 . (5) A. The effect of no ise In order to isolate the effect of noise, we consider the matrix b N = p U Σ V T + W E . Th rougho ut this section we ass um e that the hypoth eses of Theo rem I.1 hold . Lemma III.2. Let ( nz 1 ,n , . . . , nz r,n ) b e the r la r gest singular values of b N . Then, as n → ∞ , z i,n → z i almost surely , where , for Σ 2 i > σ 2 /p , z i = p Σ i  α  σ 2 p Σ 2 i + 1 √ α   σ 2 p Σ 2 i + √ α  1 / 2 , (6) and z i = σ p pα 1 / 2 (1 + √ α ) for Σ 2 i ≤ σ 2 /p . Further , l et X ∈ R m × r and Y ∈ R n × r be the matrices whose co lumns are the first r , right and left, singular vecto rs 0 0.2 0.4 0.6 0.8 1 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 1 5 10 15 20 25 30 L1 L1-L0 L1-U C O P T S PA C E ( λ ∗ ) O P T S PA C E ( 0 ) Fig. 3. T est (top) and train (bottom) error vs. rank for O P T S PAC E , S O F T - I M P U T E , H A R D - I M P U T E and SVT . Here m = n = 100 , r = 5 , p = 0 . 2 , SNR = 10 . of b N . Then ther e e xists a sequence of r × r orthogon al matrices Q n such th at, almost surely || 1 √ m U T X − AQ n || F → 0 , || 1 √ n V T Y − B Q n || F → 0 with A = diag( a 1 , . . . , a r ) , B = diag( b 1 , . . . , b r ) a nd a 2 i =  1 − σ 4 p 2 Σ 4 i  1 + √ ασ 2 p Σ 2 i  − 1 , b 2 i =  1 − σ 4 p 2 Σ 4 i  1 + σ 2 p √ α Σ 2 i  − 1 , (7) for Σ 2 i > σ 2 /p , while a i = b i = 0 otherwise. Pr oof: Due to space limitation s, we will fo cus here on the case Σ 1 , . . . , Σ r > σ 2 /p . The general p roof proceeds along the same line s, a nd we defer it to [4]. Notice that W E is an m × n matrix with i.i.d. entries with variance √ mnσ 2 p an d fo urth mome nt bound ed by C n 2 . I t is therefore sufficient to p rove our c laim for p = 1 and then rescale Σ by p an d σ by √ p . W e will also assume that, witho ut loss o f g enerality , m ≥ n . Let b Z be an r × r diag onal matrix containing the eigen values ( nz n, 1 , . . . , nz n,r ) . The eig en value equation s read U ˆ β y + W Y − X b Z = 0 , (8) V ˆ β x + W T X − Y b Z = 0 . (9) where we d efined ˆ β x ≡ Σ U T X , ˆ β y ≡ Σ V T Y ∈ R r × r . By sing ular v alue decomposition we can write W = L diag( w 1 , w 2 , . . . w n ) R T , with L T L = I m × m , R T R = I n × n . Let u T i , x T i , v T i , y T i ∈ R r be th e i -th r ow of - respectively- L T U , L T X , R T V , R T Y . In this basis eq uations (8) and (9) read u T i ˆ β y + w i y T i − x T i b Z = 0 , i ∈ [ n ] , u T i ˆ β y − x T i b Z = 0 , i ∈ [ m ] \ [ n ] , v T i ˆ β x + w i x T i − y T i b Z = 0 , i ∈ [ n ] . These c an be solved to get x T i = ( u T i ˆ β y b Z + w i v T i ˆ β x )( Z 2 − w 2 i ) − 1 , i ∈ [ n ] , x T i = u T i ˆ β y b Z − 1 , i ∈ [ m ] \ [ n ] , y T i = ( v T i ˆ β x b Z + w i u T i ˆ β y )( b Z 2 − w 2 i ) − 1 , i ∈ [ n ] . (10 ) By d efinition Σ − 1 ˆ β x = P m i =1 u i x T i , and Σ − 1 ˆ β y = P n i =1 v i y T i , whence Σ − 1 ˆ β x = n X i =1 u i ( u T i ˆ β y b Z + w i v T i ˆ β x )( b Z 2 − w 2 i ) − 1 + m X i = n +1 u i u T i β y b Z − 1 , (11) Σ − 1 ˆ β y = n X i =1 v i ( v T i ˆ β x b Z + w i u T i ˆ β y )( b Z 2 − w 2 i ) − 1 . (12) Let λ = w 2 i α 1 / 2 / ( m 2 σ 2 ) . Then, it is a well known fact [13] that as n → ∞ the empirical law of th e λ i ’ s c on verges weakly almost surely to the Marcen ko-Pastur law , with density ρ ( λ ) = α q ( λ − c 2 − )( c 2 + − λ ) / (2 π λ ) , with c ± = 1 ± α − 1 / 2 . Let β x = ˆ β x / √ m , β y = ˆ β x / √ n , Z = b Z /n . A priori, it is no t clear that the sequ ence ( β x , β y , Z ) –d ependen t o n n – conv erges. Ho wever , it is immed iate to show that the sequence is tight, and hence we can re strict ourselves to a subseque nce Ξ ≡ { n i } i ∈ N along which a limit exists. Eventu ally we will show that th e limit do es not depend on the subseq uence, apart, possibly , fr om the rotation Q n . Hence we s hall denote the subsequen tial limit, by an abuse of n otation, as ( β x , β y , Z ) . Consider now a such a con vergent subsequ ence. It is possi- ble to show that Σ 2 i > σ 2 /p implies Z 2 ii > α 3 / 2 σ 2 c + ( α ) 2 + δ for some positi ve δ . Since almost surely as n → ∞ , w 2 i < α 3 / 2 σ 2 c + ( α ) 2 + δ / 2 for all i , for all pu rposes the summan ds on the rhs of Eq s. (11), (12) can be rep laced by uniform ly continuo us, bounded function s of the limiting eig en values λ i . Further, each entry of u i (resp. v i ) is ju st a single co ordinate of the left ( right) singu lar vectors of the rand om matrix W . Using The orem 1 in [12], it follo ws that any subsequential limit satisfies th e equ ations β x = Σ β y n Z Z ( Z 2 − α 3 / 2 σ 2 λ ) − 1 ρ ( λ )d λ + ( α − 1) Z − 1 o , (13) β y = Σ β x n Z Z ( Z 2 − α 3 / 2 σ 2 λ ) − 1 ρ ( λ ) d λ o , . (14) Solving f or β y , we get an equ ation of the form Σ − 2 β y = β y f ( Z ) (15) where f ( · ) is a func tion that can b e giv en explicitely using the Stieltjis tr ansform of the measure ρ ( λ )d λ . Equatio n (15) implies that β y is block diago nal acco rding to the degeneracy pattern of Σ . Considering each b lock, either β y vanishes in the block (a case that can be e xclu ded u sing Σ 2 i > σ 2 /p ) or Σ − 2 i = f ( Z ii ) in the b lock. Solving for Z ii shows that the eigenvalues are un iquely determine d ( independ ent of th e subsequen ce) and g iv en b y Eq . (6 ). In order to d etermine β x and β y first observe that, since I r × r = Y T Y = P n i =1 y i y T i , we hav e, usin g Eq . (1 0) I r × r = n X i =1 ( b Z 2 − w 2 i ) − 1 ( b Z ˆ β T x v i + w i ˆ β T y u i ) ( v T i ˆ β x b Z + w i u T i ˆ β y )( b Z 2 − w 2 i ) − 1 . In th e limit n → ∞ , and assumin g a conver gent subsequence for ( Z, β x , β y ) , this sum ca n be co mputed as above. After I r × r = n Z Z 2 ( Z 2 − α 3 / 2 σ 2 λ ) 2 ρ ( λ ) d λ o C x + n Z α 3 / 2 σ 2 λ ( Z 2 − α 3 / 2 σ 2 λ ) 2 ρ ( λ ) d λ o C y , where C x = β T x β x , C y = β T y β y and th e f unctions of Z on the r hs are defin ed as standar d analyic functio ns of matrices. Using Eqs. (13), (14) and solving th e above, we get C x = diag(Σ 2 1 a 2 1 , . . . Σ 2 r a 2 r ) , and B y = diag (Σ 2 1 b 2 1 , . . . Σ 2 r b 2 r ) . W e already conclud ed that β x and β y are block diag onals with blocks in c orrespon dence with the degeneracy pattern of Σ . Since β T x β x = C x and β T y β y = C y are diagon al, with the same degeneracy pattern, it follows that, inside ea ch b lock of size d , each of β x and β y is p ropor tional to a d × d orth ogon al matrix. Therefo re β x = Σ AQ s , β y = Σ B Q ′ s , f or some o thogon al matriced Q s , Q ′ s . Also, u sing equation ( 13) o ne can p rove that Q s = Q ′ s . Notice, by the above argument A , B are uniqu ely fixed by our construction . On the other hand Q s might depend on the subsequen ce Ξ . Since our statmeme nt allows f or a seqence of ro tations Q n , that d epend on n , the e ventual subseq uence depend ence of Q s can b e factored out. It is usefu l to p oint out a straightfor ward consequen ce of the ab ove. Corollary III .3. Ther e exists a sequence of orthogonal ma- trices Q n ∈ R r × r such that, almost sur ely , lim n →∞       1 √ mn X T U Σ V T Y − Q n D Q T n       F = 0 , (16) with D = diag(Σ 1 a 1 b 1 , . . . , Σ r a r b r ) . B. The effect of missing entries The proo f of Theo rem I.1 is co mpleted by es tabilishin g a relation b etween the sing ular vectors X 0 , Y 0 of N E and the singular vector s X and Y o f b N . Lemma III.4. Let k ≤ r b e the lar gest inte ger such tha t Σ 1 ≥ · · · ≥ Σ k > σ 2 /p , and d enote by X ( k ) 0 , Y ( k ) 0 , X ( k ) , an d Y ( k ) the matrices containin g th e fi rst k column s of X 0 , Y 0 , X , a nd Y , respectively . Let X ( k ) 0 = X ( k ) S x + X ( k ) ⊥ , Y ( k ) 0 = Y ( k ) S y + Y ( k ) ⊥ wher e ( X ( k ) ⊥ ) T X ( k ) = 0 , ( Y ( k ) ⊥ ) T Y ( k ) = 0 and S x , S y ∈ R r × r . Then ther e e xists a n umerical co nstant C = C (Σ i , σ 2 , α, M max ) , such that, with high pr obability , || X ( k ) ⊥ || 2 F , || Y ( k ) ⊥ || 2 F ≤ C r r 1 n , (17) with pr obability app r oaching 1 a s n → ∞ . Pr oof: W e will prove our claim for the right singular vector Y , since the left case is comp letely analogous. Further we will d rop the sup erscript k to lighten the notation. W e start by no ticing that || N E Y 0 || 2 F = P k a =1 ( n ˜ z a,n ) 2 , where n ˜ z a,n are the singular v alues of N E . Using Lemma 3.2 in [ 2] which bo unds || M E − p M || 2 = || N E − b N || 2 , we get || N E Y 0 || 2 F ≥ k X a =1 ( nz a,n − C M max √ pn ) 2 . (18) On the o ther h and || N E Y 0 || F ≤ || b N Y 0 || F + || N E − b N || 2 || Y 0 || F . Fu rther by letting S y = L y Θ y R T y , f or L y , R y orthog onal matrices, we get || b N Y 0 || 2 F = || b N Y L y Θ y || 2 F + || b N Y ⊥ || 2 F . Since Y T 0 Y 0 = I k × k , we have I k × k = R y Θ T y Θ y R T y + Y T ⊥ Y ⊥ , and therefore || b N Y 0 || 2 F = || b N Y L y || 2 F − || b N Y L y R T y Y T ⊥ || 2 F + || b N Y ⊥ || 2 F ≤ n 2 k X a =1 z 2 a,n − n 2 z 2 k,n || Y ⊥ || 2 F + n 2 pσ 2 α ( c + ( α ) + δ ) || Y ⊥ || 2 F = n 2 k X a =1 z 2 a,n − n 2 e y || Y ⊥ || 2 F , where e y ≡ z 2 k,n − pσ 2 α ( c + ( α ) + δ ) , and used th e ineq uality || b N Y ⊥ || 2 F ≤ n 2 pσ 2 α ( c + ( α ) + δ ) || Y ⊥ || 2 F which hold s for all δ > 0 asy mptotically alm ost su rely as n → ∞ (by an immediate gener alization of Lemm a III .2). It is simple to check that Σ k ≥ σ 2 /p implies e y > 0 . Using tr iangular in equality , Lemma 3 .2 in [ 2], we get || N Y 0 || 2 F ≤ n 2 r X a =1 z 2 a,n − n 2 e y || Y ⊥ || 2 F + C npα 3 / 2 M 2 max r +2 C n √ npα 3 / 4 M max √ r || z || , which, combined with eq uation (1 8), imp lies the thesis. Pr oof o f Theo r em I .1: W e now tu rn to u pper bou nding the right hand side of Eq. (5). Let k be defined as in the last lemma. Notice that by Lemma III.2, X T ( U Σ V T ) Y is well appro ximated by ( X ( k ) ) T ( U Σ V T ) Y ( k ) . Analogou sly , it can be proved that X T 0 ( U Σ V T ) Y 0 is well approximated by ( X ( k ) 0 ) T ( U Σ V T ) Y ( k ) 0 . Due to space limitations, we w ill o mit this technical step and thus focus here o n the case k = r (equiv alently , neglect the error in curred by th is approxim a- tion). Using Lemma III.4 to bound the contribution o f X ⊥ , Y ⊥ , we hav e h X T 0 ( U Σ V T ) Y 0 , X T 0 N E Y 0 i = h S T x X T ( U Σ V T ) Y S y , X T 0 N E Y 0 i (1 + o n (1)) = h X T ( U Σ V T ) Y , S T x X T 0 N E Y 0 S y i (1 + o n (1)) . (19) Further X T 0 N E Y 0 = X T 0 b N Y 0 + X T 0 ( N E − b N ) Y 0 and, using once more the b ound in Lemma 3 .2 of [2], th at im plies | X T 0 ( N E − b N ) Y 0 | ≤ C r √ nrp , we ge t S T x X T 0 N E Y 0 S y = L x Θ 2 x L T x X T b N Y R y Θ 2 y R T y + E 1 = Z + E 2 , where we re call that Z is the diagonal m atrix with en tries giv en by the singular values of b N , and || E 1 || 2 F , || E 2 || 2 F ≤ C ( p, r ) √ n . Using this estimate in Eq. (1 9), togethe r with the result in Le mma I II.2, we finally get h X T 0 ( U Σ V T ) Y 0 , X T 0 N E Y 0 i √ mn || X T 0 N E Y 0 || 2 F ≥ P r k =1 Σ k a k b k z k √ α || z || − o n (1) , which implies the thesis af ter simple algebraic m anipulation s A C K N O W L E D G E M E N T S W e are gratefu l t o T . Hastie, R. Mazumder and R. T ibshirani for stimulatin g d iscussions, an d for mak ing av ailable th eir data. This work was sup ported by a T erman fellowship, and the NSF g rants CCF-07 43978 an d DMS- 08062 11. R E F E R E N C E S [1] “Netflix prize, ” http://ww w.netflixprize. com/ . [2] R. H. Ke shav an, A. Montanari, and S. Oh, “Mat rix completio n from a fe w entries, ” January 2009, arxiv :0901.3150 . [3] ——, “Matrix complet ion from noisy entries, ” June 2009, arXi v:0906.2027. [4] R. H. Kesha va n and A. Monta nari, “Re gulariza tion for matrix comple- tion, ” 2010, journal version, in preparation . [5] N. Srebro, J. Rennie, and T . Jaakk ola, “Maximum margin matrix fac torizatio n, ” in Advances in Neura l Information Proce ssing Systems 17 , 2005. [6] J. Rennie and N. Srebro, “Fast maximum m argi n m atrix facto rization for colla borati ve prediction, ” in 22nd Internati onal Co nfere nce on Machine Learning , 2005. [7] E. J. Cand ` es and B. Recht, “Exact matrix complet ion via con vex optimiza tion, ” F ound. of Comput. Math. , vo l. 9, no. 6, pp . 717 – 772, 2009. [8] E. J. Cand ` es and Y . Plan, “Mat rix completion with noise, ” 2009 , arXiv:0903.313 1 . [9] R. Mazumder , T . Hastie, and R. Tibshi rani, “Spect ral regulariz ation algorit hms for learni ng larg e inc omplete mat rices, ” 2009, submitted. [10] I. M. Johnstone and A. Y . Lu, “On consistency and sparsity for principal component analysis in high dimension, ” J . A mer . Stat. A ssoc. , vol. 104, pp. 682–693, 2009. [11] M. Capit aine, C. Donat i-Martin, and D. F ´ eral, “The largest ei gen va lue of finite rank deformat ion of large wigner matrices: con ver gence and non-uni versa lity of the fluctuati ons, ” Ann. Pr obab . , vol. 37, pp. 1–47, 2009. [12] Z.D. Bai, B.Q.Miao, and G.M.Pan, “On asymptoti cs of eigen ve ctors of larg e sample cova riance matri ces, ” Ann. of Pr obab . , vol. 35, pp. 1532– 1572, 2007. [13] J. Silverstei n and Z. Bai, “On the empiri cal distrib ution of eigen v alues of a class of large-dime ntional rand om matrices, ” J. Multivaria te Anal. , vol. 54, pp. 175–192, 1995.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment