Tighter Low-rank Approximation via Sampling the Leveraged Element
In this work, we propose a new randomized algorithm for computing a low-rank approximation to a given matrix. Taking an approach different from existing literature, our method first involves a specific biased sampling, with an element being chosen ba…
Authors: Srinadh Bhojanapalli, Prateek Jain, Sujay Sanghavi
Tigh ter Lo w-rank Appro ximation via Sam pling the Lev eraged Elemen t Srinadh Bho janapalli The Univ ersit y of T exas at Austin bsrinadh@utexas.edu Prateek Jain Microsoft Researc h, India pra jain@microsoft.com Suja y Sanghavi The Univ ersit y of T exas at Austin sangha vi@mail.utexas.edu June 21, 2021 Abstract In this w ork, we p ro po se a new randomized algorithm for computing a lo w-rank approxima- tion to a given matrix . T aking an approach different fr om existing literatur e, our method fir s t inv olves a sp ecific biased sa mpling, with a n element b eing c hosen based on the leverage s cores of its row a nd co lumn, and then inv olves w eighted alterna ting minimization ov er the factore d form of the in tended low-rank matrix , to minimize error only on these samples. Our metho d can lev erage input sparsity , y et pr o duce approximations in sp e ctr al (as oppos ed to the w eaker F rob enius) norm; this combines the best aspects o f otherwise dispa rate cur rent results, but with a dep endence on the condition num be r κ = σ 1 /σ r . In particular we require O ( nnz ( M ) + nκ 2 r 5 ǫ 2 ) computations to generate a rank- r approximation to M in s pe c tral norm. In contrast, the b est existing metho d requir es O ( nnz ( M ) + nr 2 ǫ 4 ) time to compute an appr oximation in F ro be nius norm. Besides the tigh tness in sp ectra l nor m, we hav e a b etter dependence on the error ǫ . Our metho d is natura lly and highly par a llelizable. Our new approach enables t w o extensions that are in teresting on their own. The fir st is a new metho d to directly compute a low-rank appr oximation (in efficient fac to red form) to the pro duct of t w o given matrices; it computes a s ma ll random set of en tries of the pro duct, and then executes w eighted a lternating minimization (as befor e ) on thes e. The sampling strategy is different b ecause no w we cannot access leverage scores of the pro duct matrix (but instead hav e to work with input matr ic e s). The se c ond ex tension is an improved algorithm with smalle r communication c o mplexity for the dis tributed PCA setting (wher e each server ha s small set of rows o f the matrix, and wan t to compute low r ank approximation with small amount of communication with o ther ser vers). 1 1 In tro duction Finding a low-rank appro ximation to a matrix is fun d amen tal to a w ide array of mac hine learning tec hniques. The large sizes of mo dern data m atrices has drive n muc h r ecen t w ork into efficien t (t ypically randomized) metho d s to find lo w-rank appro ximations that do not exactly minimize the residual, bu t ru n m uc h faster / parallel, with few er passes o v er the data. Existing approac hes in v olv e either inte lligen t samplin g of a few ro ws / columns of the matrix, pro jections ont o lo w er- dimensional spaces, or samp ling of entries follo w ed b y a top- r SVD of the resulting matrix (with unsampled entries set to 0). W e pursue a different approac h: w e fi rst samp le entries in a sp ecific biased random w a y , and then min imize the error on these samples ov er a searc h sp ace that is the factored f orm of the lo w-rank mat rix w e are tr ying to fi nd. W e note that this is different from appro ximating a 0-filled matrix; it is instead reminiscent of matrix completion in the sense that it only lo oks at errors on the s amp led en tries. Another crucial in gredien t is ho w th e sampling is d on e: we u se a com bination of ℓ 1 sampling, and of a distribution wh ere th e probability of an elemen t is prop ortional to the sum of the lev erag e scores of its ro w and its column. Both the sampling and the sub sequen t alternating minimization are naturally fast, paralleliz- able, and able to utilize sparsity in the inpu t matrix. Existing literature h as either fo cused on runn in g in input spars ity time bu t app ro ximation in (the we ak er) F rob eniu s norm , or r unnin g in O ( n 2 ) time with appro ximation in sp ectral norm. Our metho d pro vides the b est of b oth w orlds: it runs in input sp ars it y time, with just tw o p asses o v er the data matrix, and yields an appr o ximation in sp ectral norm. It do es ho w ev er ha ve a dep endence on the ratio of the first to th e r th singular v al ue of the matrix. Our alternativ e app r oac h also yields new metho ds for t wo related problems: directly finding the low-rank approximat ion of the pr o duct of t w o giv en matrices, and d istributed P C A. Our con tributions are th us three n ew metho ds in this sp ace: • Lo w-rank appro ximation of a general matrix: Our first (and main) con tribution is a new metho d (LE L A, Algorithm 1) f or lo w-rank approximat ion of any give n matrix: first draw a random subset of en tries in a sp ecific biased w a y , and then execute a w eigh ted alte rn ating minimization algorithm that minimizes the error on th ese samples ov er a factored f orm of the in tended lo w-rank matrix. The sampling is done with only t w o passes o ve r the m atrix (eac h in inp ut sparsit y time), and b oth the sampling and the alternating min imization steps are highly paralle lizable and compactly stored/manip u lated. F or a matrix M , let M r b e the b est rank- r appro ximation (i.e. the matrix corresp onding to top r comp onen ts of S VD). Ou r algorithm finds a rank- r matrix c M r in time O ( nnz ( M ) + nκ 2 r 5 ǫ 2 ), while pro viding appro ximation in s p ectral norm: k M − c M r k ≤ k M − M r k + ǫ k M − M r k F , where κ = σ 1 ( M ) /σ r ( M ) is the condition num b er of M r . Existing metho ds either can run in input sparsit y time, but p ro vide appro ximations in (the w eak er) F rob enius norm (i.e. with || · || rep laced b y || · || F in the ab o v e exp r ession), or ru n in O ( n 2 ) time to appro ximate in sp ectral norm , but even then with lea ding constan ts larger than 1. Our metho d how ev er do es ha v e a dep endence on κ , w hic h these do not. See T able 1 for a detailed comparison to existing results for lo w-rank app ro ximation. • Direc t approximation of a matrix pro duct: W e pro vide a new method to directly find a lo w-rank approximati on to the pro du ct of t w o matrices, without having to fir st compute 2 the pro du ct itself. T o do so, w e first c ho ose a small set of en tries (in a biased random w a y) of the pr o duct that we will compute, and then again run weigh ted alternating minimizatio n on th ese samples. The c hoice of the biased random distribution is no w differen t fr om ab ov e, as the sampling step do es n ot hav e access to the pr o duct matrix. Ho w ev er, again b oth the sampling an d alt ernating minimization are highly parallelizable. F or A ∈ R n 1 × d , B ∈ R d × n 2 , and n = max ( n 1 , n 2 ), our algorithm first chooses O ( nr 3 log n/ǫ 2 ) en tries of the pro d u ct A · B that it needs to sample; eac h sample take s O ( d ) time ind ividually , since it is a p ro du ct of t w o length- d v ectors (though these can b e paralleli zed). Th e wei ghte d alternating minimizatio n then run s in O ( nr 5 κ 2 ǫ 2 ) time (where κ = σ 1 ( A · B ) /σ r ( A · B )). This results in a rank- r approxi mation d AB r of A · B in sp ectral norm, as giv en ab o v e. • Dis tributed PCA: Motiv a ted b y applications with r eally large matrices, recen t work h as lo ok ed at low-rank appro ximation in a distribu ted setting where there are s servers – eac h ha v e small set of r ows of the matrix – eac h of wh ich can comm unicate with a cen tral pr o- cessor c harged with coordinating th e algorithm. In this mo del, one is intereste d in find go o d appro ximations while minimizing b oth computations and th e comm unication burd en on the cen ter. W e show that our L E LA algorithm can b e extended to the distribu ted setting wh ile guar- an teeing small communicatio n complexit y . In particular, our algorithm guaran tees the same error b ounds as that of our non-distribu ted v ersion bu t guaran tees communicati on complexit y of O ( ds + nr 5 κ 2 ǫ 2 log n ) real n umb ers for co mpu ting rank- r appro ximation to M ∈ R n × d . F or n ≈ d and large s , our analysis guarantee s significantly lesser communicati on complexit y than the state-of-the-art method [22], while pro viding tigh ter sp ectral norm b ou n ds. Notation : Capital letter M t ypically denotes a matrix. M i denotes the i -th row of M , M j denotes the j -th column of M , and M ij denotes the ( i, j )-th elemen t o f M . Unless s p ecified otherwise, M ∈ R n × d and M r is th e b est r ank- r appr o ximation of M . Al so, M r = U ∗ Σ ∗ ( V ∗ ) T denotes the SVD of M r . κ = σ ∗ 1 /σ ∗ r denotes the condition num b er of M r , w here σ ∗ i is the i -th singular v alue of M . k u k d enotes the L 2 norm of vect or u . k M k = m ax k x k =1 k M x k d en otes the sp ectral or op erator n orm of M . k M k F = q P ij M 2 ij denotes th e F rob en iu s norm of M . Also, k M k 1 , 1 = P ij | M ij | . dist ( X, Y ) = k X T ⊥ Y k denotes the principal angle b ased distance b et wee n subspaces spanned by X and Y orthonormal matrices. T ypically , C d en otes a global constant indep en d en t of problem parameters and c an c hange from step to step. Giv en a set Ω ⊆ [ n ] × [ d ], P Ω ( M ) is giv en by: P Ω ( M )( i, j ) = M ij if ( i, j ) ∈ Ω and 0 otherwise. R Ω ( M ) = w . ∗ P Ω ( M ) denotes the Hadamard p ro du ct of w and P Ω ( M ). That is, R Ω ( M )( i, j ) = w ij M ij if ( i, j ) ∈ Ω and 0 otherw ise. Similarly let R 1 / 2 Ω ( M )( i, j ) = √ w ij M ij if ( i, j ) ∈ Ω and 0 otherwise. 2 Related results Lo w rank appro ximation: No w we w ill br iefly review some of the existing wo rk on algorithms for lo w rank approximati on. [14] in tro du ced the pr ob lem of compu ting low ran k approxima tion of a matrix M w ith few passes o v er M . They presente d an algorithm that samples few ro ws and columns and do es SVD to compute lo w rank appr oximati on, an d ga ve add itiv e err or guarante es. 3 Reference F rob enius n orm Sp ectral norm Computation time BJS14 ( Our Algorithm) (1 + ǫ ) k ∆ k F k ∆ k + ǫ k ∆ k F O ( nnz ( M ) + nr 5 κ 2 log( n ) ǫ 2 ) CW13[7] (1 + ǫ ) k ∆ k F (1 + ǫ ) k ∆ k F O ( nnz ( M ) + nr 2 ǫ 4 + r 3 ǫ 5 ) BG13 [3] (1 + ǫ ) k ∆ k F c k ∆ k + ǫ k ∆ k F O ( n 2 ( r +log( n ) ǫ 2 ) + n r 2 log( n ) 2 ǫ 4 ) NDT09[28] (1 + ǫ ) k ∆ k F c k ∆ k + ǫ √ n k ∆ k O ( n 2 log( r log ( n ) ǫ ) + nr 2 log( n ) 2 ǫ 4 ) WLR T08[32] (1 + ǫ ) k ∆ k F k ∆ k + ǫ √ n k ∆ k O ( n 2 log( r ǫ ) + nr 4 ǫ 4 ) Sar06[30] (1 + ǫ ) k ∆ k F (1 + ǫ ) k ∆ k F O ( nnz ( M ) r ǫ + n r 2 ǫ 2 ) T able 1: Comparison of error rates and computation time of some lo w rank appro ximation algo- rithms. ∆ = M − M r . [9, 10] hav e extended th ese r esults. [2] considered a different approac h based on en trywise samplin g and quantiza tion for lo w rank appro ximation and has giv en additiv e err or b ounds. [18, 30, 11, 8] hav e give n lo w rank appro ximation algorithms w ith relativ e error guarantees in F rob enius norm. [32, 28] hav e provi ded guaran tees on error in sp ectral norm wh ic h are later impro ve d in [17, 3 ]. T he main tec hniqu es of these algorithms is to us e a random Gaussian or Hadamard transform matrix for pro jecting th e matrix ont o a lo w dimensional subsp ace and compute the r ank- r s ubspace. [3] ha v e giv en an algorithm based on rand om Hadamard transform that computes rank- r approxima tion in time O ( n 2 r ǫ 2 ) and giv es sp ectral norm b ound of k M − c M r k ≤ c k M − M r k + ǫ k M − M r k F . One dra wbac k of Hadamard transform is th at it cannot tak e adv antag e of sp arsit y of th e inp ut matrix. Recen tly [7] ga v e an algorithm usin g sparse su bspace emb edding that run s in inp ut s p arsit y time with relat ive F rob enius norm err or guarant ees. W e presen ted some results in this area as a comparison with our results in table 1. This is a hea vily su bsampled set of existing r esults on lo w rank appro ximations. There is a lot of in terest- ing wo rk on v ery related pr ob lems of co mpu ting column/ro w based(CUR) decomp ositions, matrix sk etc h ing, lo w rank app ro ximation w ith streaming d ata. Lo ok at [27, 17] for more detailed discus- sion and co mparison. Matrix sparsification: In th e matrix sp arsification pr oblem, the goal is to create a sparse sketc h of a giv en matrix by samp ling and rewei ghing the entries of the matrix. V arious tec hniqu es for sam- pling hav e b een pr op osed and analyz ed whic h g uarantee ǫ appro ximation error in F roben iu s norm with O ( n ǫ 2 log n ) samples [12, 1]. As we will see in th e next section, the first step of algo rithm 1 in v olv es sampling according to a very sp ecific distribution (similar to matrix sparsification), whic h has b een designed for guarantee ing go o d error b ounds for compu ting lo w r an k approximati on. F or a comparison of v arious samp ling d istributions for the problem of lo w rank matrix r eco very see [6]. Matrix completion: Matrix completion problem is to reco v er a n × n rank- r matrix f rom ob- serving small num b er of ( O ( nr log( n ))) random entries. Nuclear norm minimization is sho wn to reco ver the matrix f r om un iform r andom samp les if the m atrix is in coheren t 1 [4, 5, 29, 16] . S imilar results are sho w n for other algorithms lik e Op tSpace [23] and alternating minimization [21, 19, 20] . 1 A n × d matrix A of rank - r with SV D U ∗ Σ ∗ ( V ∗ ) T is incoherent if k ( U ∗ ) i k 2 ≤ µ 0 r n , ∀ i and k ( V ∗ ) j k 2 ≤ µ 0 r d , ∀ j for some constant µ 0 . 4 Recen tly [6] has giv en guaran tees for reco very of an y m atrix under lev erage score sampling from O ( nr log 2 ( n )) entries. Distributed PCA: In distributed PCA, one w an ts to compute rank- r appro ximation of a n × d matrix that is stored across s servers with small communicat ion b et ween serve rs. One p opular mo del is ro w p artition mo del wher e su bset of ro ws are stored at eac h serv er. Algorithms in [13 , 25, 15, 24] ac h iev e O ( dsr ǫ ) communicatio n complexit y with relativ e err or guarante es in F rob enius norm, un der this mo d el. Recen tly [22] ha ve considered the scenario of arbitrary splitting of a n × d matrix an d giv en an algorithm that has O ( dsr ǫ ) communicatio n complexit y with relativ e error guarantees in F rob enius norm. 3 Lo w-rank Appro ximation of Matrices In this section w e w ill presen t our main cont ribu tion: a new randomized algorithm for co mpu ting lo w -rank appro ximation of an y giv en matrix. Our al gorithm first samples a few elements from the giv en matrix M ∈ R n × d , and then rank- r appro ximation is computed u sing only those samples. Algorithm 1 pro vid es a d etailed p seudo-co de of our algorithm; w e no w comment on eac h of the t wo stages: Sampling: A crucial ingredient of our approac h is u sing the correct sampling distribution. Re- cen t results in matrix completion [6] indicate that a small num b er ( O ( nr log 2 ( n ))) of samples drawn in a wa y biased b y lev erage scores 2 can capture al l the inf ormation in an y exactly low-rank matrix. While this is indicativ e, here we ha ve neither access to th e leverag e scores, n or is our matrix ex- actly lo w-r an k . W e approxima te the lev erage scores with the r ow and column norms ( || M i || 2 and || M j || 2 ), and acco unt for the arbitrary high-rank nature of input b y includ ing an L 1 term in the sampling; the distrib ution is giv en in eq. (2). Computatio nally , our sampling pro cedure can b e done in t wo passes and O ( nn z ( M ) + m log n ) time. Weighte d alterna ting minimization: In our sec ond step, w e directly optimize o v er the factored form of the in tended lo w-rank matrix, by minimizing a weighte d squ ared error o ve r the sampled elemen ts from stage 1. That is, w e first express th e lo w-rank appro ximation c M r as U V T and then iterate ov er U and V alternatingly to minimize the w eigh ted L 2 error ov er th e sample d entries (see Sub-routine 2). No te that this is different from t aking pr incipal comp onen ts of a 0-filled v ersion of the samp led matrix. The wei ght s giv e higher emph asis to elements with smaller s amp ling probabilities. I n particular, the goal is to minimize th e follo w ing ob j ectiv e fun ction: E r r ( c M r ) = X ( i,j ) ∈ Ω w ij M ij − ( c M r ) ij 2 , (1) where w ij = 1 / ˆ q ij when ˆ q ij > 0, 0 else. F o r initializa tion of the W AltMin pro cedure, we compu te SVD of R Ω ( M ) (r eweighed sampled matrix) follo we d by a trimming step (see Step 4, 5 of Su b- routine 2). T rimm ing step sets ( ˜ U 0 ) i = 0 if k ( U 0 ) i k ≥ 4 k M i k / k M k F and ( ˜ U 0 ) i = ( U 0 ) i otherwise; and b U 0 is the orthonormal matrix spann ing th e column space of ˜ U 0 . T h is step prev ents hea vy ro ws/columns from ha vin g un due influ ence. 2 If S VD of M r = U ∗ Σ ∗ ( V ∗ ) T then leverage scores of M r are || ( U ∗ ) i || 2 and || ( V ∗ ) j || 2 for all i, j . 5 Algorithm 1 LELA: Lev eraged Elemen t Lo w-rank Appro ximation input matrix: M ∈ R n × d , ran k : r , n u m b er of samples: m , n u m b er of iterations: T 1: Sample Ω ⊆ [ n ] × [ d ] where eac h elemen t is sampled indep enden tly with probability: ˆ q ij = min { q ij , 1 } q ij = m · k M i k 2 + k M j k 2 2( n + d ) k M k 2 F + | M ij | 2 k M k 1 , 1 . (2) /*See Section 3.1 for details ab out efficien t implemen tation of this step*/ 2: Obtain P Ω ( M ) using o ne pass o v er M 3: c M r = W Al tMin ( P Ω ( M ) , Ω , r , ˆ q , T ) output c M r W e no w provide our m ain result for lo w-rank appr o x im ation and sho w that Algorithm 1 can pro vide a tigh t appro ximation to M r while using a small num b er of s amp les m = E [ | Ω | ]. Theorem 3.1. L et M ∈ R n × d b e any given mat rix ( n ≥ d ) and let M r b e the b est r ank- r appr oxi- mation to M . Set the numb er of samples m = C γ nr 3 ǫ 2 κ 2 log( n ) log 2 ( k M k ζ ) , wher e C > 0 is any glob al c onstant, κ = σ 1 /σ r wher e σ i is the i -th singular value of M . Also, set the numb er of iter ations of W AltMin pr o c e dur e to b e T = log( k M k ζ ) . Then, with pr ob ability gr e ater than 1 − γ for any c onstant γ > 0 , the output c M r of Algo rithm 1 with the ab ove sp e cifie d p ar ameters m, T , satisfies: k M − c M r k ≤ k M − M r k + ǫ k M − M r k F + ζ . That is, if T = log ( k M k ǫ k M − M r k F ) , we have: k M − c M r k ≤ k M − M r k + 2 ǫ k M − M r k F . Note that our time and sample complexit y dep ends quadratically on κ . Recen t results in the matrix completion literature sho ws that suc h a dep endence can b e impro ve d to log ( κ ) by using a sligh tly more inv olv ed analysis [20]. W e b eliev e a similar analysis can b e com b ined with our tec hn iques to obtain tigh ter b ounds ; we lea ve a similar tigh ter analysis for futur e researc h as such a pro of would b e significantly more tedious and wo uld take a wa y from the key message of this pap er. 3.1 Computation complexit y: In the fi rst step w e tak e 2 passes o ver the matrix to compute the sampling distribution (2) and sampling the en tries based on this distribution. It is easy to sho w that this step would requir e O ( nnz ( M ) + m log( n )) time. Next, the initializatio n step of W AltMin pro cedure r equ ires com- puting rank- r SVD of R Ω 0 ( M ) whic h has at most m non-zero entries. Hence, the pro cedur e can b e completed in O ( mr ) time using stand ard tec hniqu es like p o wer metho d. Note that by Lemma 3.2 w e need top- r sin gu lar ve ctors of R Ω 0 ( M ) only upto constan t appro ximation. F ur - ther t -th iteration of alte rn ating minimization tak es O (2 | Ω 2 t +1 | r 2 ) time. So, the total time com- plexit y of our metho d is O ( n nz ( M ) + mr 2 ). As sho wn in Theorem 3.1, ou r metho d r equires m = O ( nr 3 ǫ 2 κ 2 log( n ) log 2 ( k M k ǫ k M − M r k F )) samples. Hence, the total run-time of our algorithm is: 6 Sub-routine 2 W AltMin: W eigh ted Alternating Minimiza tion input P Ω ( M ) , Ω , r , ˆ q , T 1: w ij = 1 / ˆ q ij when ˆ q ij > 0, 0 else, ∀ i, j 2: Divide Ω in 2 T + 1 equal uniformly ran d om sub sets, i.e., Ω = { Ω 0 , . . . , Ω 2 T } 3: R Ω 0 ( M ) ← w . ∗ P Ω 0 ( M ) 4: U (0) Σ (0) ( V (0) ) T = S V D ( R Ω 0 ( M ) , r ) //Best rank- r appr o ximation of R Ω 0 ( M ) 5: T rim U (0) and let b U (0) b e th e output (see Section 3) 6: for t = 0 to T − 1 do 7: b V ( t +1) = argm in V k R 1 / 2 Ω 2 t +1 ( M − b U ( t ) V T ) k 2 F , for V ∈ R d × r . 8: b U ( t +1) = arg min U k R 1 / 2 Ω 2 t +2 ( M − U ( b V ( t +1) ) T ) k 2 F , for U ∈ R n × r . 9: end for output Completed m atrix c M r = b U ( T ) ( b V ( T ) ) T . O ( nnz ( M ) + nr 5 ǫ 2 κ 2 log( n ) log 2 ( k M k ǫ k M − M r k F )). Remarks: No w we will discuss ho w to sample en tries of M using sampling metho d (2) in O ( nnz ( M )+ m log( n )) time. Consider the follo win g multinomial based sampling mo del: samp le the num- b er of elemen ts p er ro w (sa y m i ) by d oing m d ra ws usin g a m ultinomial distribution o v er the ro ws, giv en b y { 0 . 5( d k M i k 2 ( n + d ) k M k 2 F + 1 n + d ) + 0 . 5 k M i k 1 k M k 1 , 1 } . Then, sample m i elemen ts of the ro w- i , using { 0 . 5 k M j k 2 k M k 2 F + 0 . 5 | M ij | k M k 1 , 1 } o v er j ∈ [ d ], with replacemen t. The f ailure pr obabilit y in this mo del is b ounded by 2 times the failure pr obabilit y if the elemen ts are samp led according to (2) [4]. Hence , we can instead use the ab ov e mentio ned multinomial mo del for sampling. Moreo v er, k M i k , k M i k 1 and k M j k can b e computed in time O ( nnz ( M ) + n ), so m i ’s can b e sampled efficien tly . Moreo ve r, the m u ltinomial distribution for al l the ro ws can b e computed in time O ( d + nnz ( M )), O ( d ) work for setting up the first k M j k term and nnz ( M ) term for c hanging the base distrib u tion wherever M ij is n on-zero. Hence, the total time complexit y is O ( nnz ( M ) + m log n ). 3.2 Pro of Ov er view: W e n o w presen t the k ey steps in our p r o of of Theorem 3.1. As men tioned in the previous secti on, our algorithm pro ceeds in t wo steps: ent ry-wise sampling of the give n matrix M and th en w eighte d alternating min imization (W AltMin) to obtain a lo w-rank appro ximation of M . Hence, the goal is to analyze the W AltMin pro cedure, w ith inp ut samp les obtained u sing (2), to obtain the b ounds in Theorem 3.1. No w, W AltMin is an iterativ e p r o cedure solving an inh eren tly non-con v ex pr oblem, min U,V P ( i,j ) ∈ Ω w ij ( e T i U V T e j − M ij ) 2 . Hence, it is prone to lo cal m inimas or worse, migh t not even conv erge. Ho w ev er, recen t results for low-rank matrix completion ha ve sho wn that alternating minimization (with appr opriate initialization) can in deed b e analyzed to obtain exact matrix completion guarantee s. Our pr o of also follo ws along similar lines, wher e w e sh ow th at the in itializat ion pro cedu r e (step 4 of Sub-routin e 2) provides an accurate enough estimate of M r and then at eac h step, w e sho w a geometric decrease in distance to M r . Ho wev er, our pro of differs from the previous w orks in tw o key asp ects: a) existing pro of tec h n iques of alternating minimization assume that eac h elemen t is sampled uniformly at rand om, wh ile w e can allo w biased and appro ximate sampling, b ) 7 existing tec h n iques cru cially use the assumption that M r is incoherent , w h ile our pro of a vo ids this assumption u sing the weigh ted v ersion of AltMin. W e no w present our b oun ds for initializatio n as well as for eac h step of th e W AltMin p ro cedur e. Theorem 3.1 follo ws easily from the t wo b ounds. Lemma 3.2 (Initializa tion) . L et the set of entries Ω b e gener ate d ac c or ding to (2) . Also, let m ≥ C n δ 2 log( n ) . Then, the fol lowing holds (w.p. ≥ 1 − 2 n 10 ): k R Ω ( M ) − M k ≤ δ k M k F . (3) Also , if k M − M r k F ≤ 1 576 κr 1 . 5 k M r k F , then the f ol lowing holds (w.p. ≥ 1 − 2 n 10 ): k ( b U (0) ) i k ≤ 8 √ r q k M i k 2 / k M k 2 F and dist ( b U (0) , U ∗ ) ≤ 1 2 , wher e b U (0) is the initial i ter ate obtaine d using Steps 4 , 5 of Sub-Pr o c e dur e 2. κ = σ ∗ 1 /σ ∗ r , σ ∗ i is the i -th singular value of M , M r = U ∗ Σ ∗ ( V ∗ ) T . Let P r ( A ) b e the b est rank- r app ro ximation of A . Then, Lemma 3.2 and W eyl’s inequalit y implies th at: k M − P r ( R Ω ( M )) k ≤ k M − R Ω ( M ) k + k R Ω ( M ) − P r ( R Ω ( M )) k ≤ k M − M r k + 2 δ k M k F . (4) No w , we can hav e tw o cases: 1) k M − M r k F ≥ 1 576 κr 1 . 5 k M r k F : In this case, s etting δ = ǫ/ ( κr 1 . 5 ) in (4) already imp lies the required error b ounds of Theorem 3.1 3 . 2) k M − M r k F ≤ 1 576 κr 1 . 5 k M r k F . In this r egime, we w ill show n ow that alternating minimization reduces the error from initial δ k M k F to δ k M − M r k F . Lemma 3.3 (W AltMin Descen t) . L et hyp otheses of The or em 3.1 hold. Also, let k M − M r k F ≤ 1 576 κr √ r k M r k F . L et b U ( t ) b e the t -th step iter ate of Sub-Pr o c e dur e 2 (c al le d fr om Algorithm 1 ), and let b V ( t +1) b e the ( t + 1) -th iter ate (for V ). Also, let k ( U ( t ) ) i k ≤ 8 √ rκ r k M j k 2 k M k 2 F + | M ij | k M k F and dist ( U ( t ) , U ∗ ) ≤ 1 2 , wher e U ( t ) is a set of orthonorma l ve ctors sp anning b U ( t ) . Then, the fol lowing holds (w.p. ≥ 1 − γ /T ): dist ( V ( t +1) , V ∗ ) ≤ 1 2 dist ( U ( t ) , U ∗ ) + ǫ k M − M r k F /σ ∗ r , and k ( V ( t +1) ) j k ≤ 8 √ rκ r k M j k 2 k M k 2 F + | M ij | k M k F , wher e V ( t +1) is a set of orthonorma l ve ctors sp anning b V ( t +1) . The ab o ve lemma sho ws that distance b et w een b V ( t +1) and V ∗ (and similarly , b U ( t +1) and U ∗ ) decreases geomet rically up to ǫ k M − M r k F /σ ∗ r . Hence, after log ( k M k F k /ζ ) steps, the firs t error term in th e b ounds ab o ve v anishes and the error b ound giv en in Theorem 3.1 is obtained. Note that, the sampling distribution used for our result is a “h yb rid” distribution com b ining lev er age scores and the L 1 -st yle samp ling. Ho wev er, if M is indeed a rank- r matrix, then our analy- sis can b e extended to h andle the lev erage score based samplin g itself ( q ij = m · k M i k 2 + k M j k 2 2 n k M k 2 F ). Hence our resu lts also s ho w th at we igh ted alternating minimization can b e used to solv e the coherent- matrix completion problem in tro d uced in [6]. 3 There is a small t echnicalit y here: alternating minimization can potentially w orsen this b ound. But the error after eac h step of alternating min imization can b e effectively chec ked using a small cross-v alidation set and w e can stop if the error increases. 8 3.3 Direct Low-rank Approxi mation of Matrix Pro duct In this section w e present a new pass efficien t algorithm for the follo wing p roblem: su pp ose we are giv en tw o matrices, and desire a lo w-r an k approxima tion of their p ro du ct AB ; in p articular, we are not interested in the actual full matrix pro duct itself (as this ma y b e unwieldy to store an d use, and thus w asteful to pr o duce in its entiret y). O ne example setting where this arises is when one w an ts to calculate the joint coun ts b et ween t wo v ery large sets of entiti es; for example, web companies routin ely co me across settings where they need to understand (for e xample) ho w man y users b oth searc h ed for a particular query and clic ke d on a particular advertisemen t. T he num b er of p ossible queries and ads is huge, and fi nding this co-o ccurrence matrix fr om user logs in v olv es m ultiplying tw o m atrices – query-by-user and us er -by-ad resp ectiv ely – eac h of w hic h is itself large. W e giv e a metho d that directly pro du ces a low-rank ap p ro ximation of th e final pro d u ct, and in v olv es storage and manipu lation of only the efficien t factored form (i.e. one tall and one fat matrix) of the fin al int ended lo w-r ank matrix. Note that as opp osed to the p revious section, the matrix d o es not already exist and hence we d o not ha ve access to its row and column norms; so we need a new sampling sc heme (and a d ifferen t p ro of of correct ness). Algorithm: S u pp ose w e are giv en an n 1 × d matrix A an d another d × n 2 matrix B , and we wish to calculate a rank- r appro ximation of the pro duct A · B . Our algorithm pro ceeds in tw o stages: 1. Cho ose a b iased random set Ω ⊂ [ n 1 ] × [ n 2 ] of elements as follo w s: c ho ose an inte nd ed n umb er m (according to Th eorem 3.4 b elo w) of sampled elements, and then indep enden tly include eac h ( i, j ) ∈ [ n 1 ] × [ n 2 ] in Ω with pr obabilit y giv en b y ˆ q ij = min { 1 , q ij } wh ere q ij := m · k A i k 2 n 2 k A k 2 F + k B j k 2 n 1 k B k 2 F , (5) Then, fi nd P Ω ( A · B ), i.e. only the elemen ts of th e pro d uct AB th at are in this set Ω. 2. Run the alternating minimization pro cedure W AltMin( P Ω ( A · B ) , Ω , r , ˆ q , T ), wh er e T is the n umb er of iterations (again chosen according to Theorem 3.4 b elo w). Th is pr o duces the lo w -rank appro ximation in factored form. Remarks: Note that the sampling distribution no w dep ends only on the ro w norms k A i k 2 of A and the c olumn norms k B j k 2 of B ; eac h of these can b e found complete ly in parallel, with one pass o ver eac h ro w /column of the matrices A / B . A second pass, again parallelizable, calculates the element ( A · B ) ij of th e pro d uct, for ( i, j ) ∈ Ω. Once this is done, w e are again in the setting of doing wei ght ed alternating minimization ov er a small set of samples – the setting w e had b efore, and as a lready mentio ned this to o is highly parallelizable and v ery fast o verall. In particular, the computation complexit y of the algo rithm is O ( | Ω | · ( d + r 2 )) = O ( m ( d + r 2 )) = O ( nr 3 κ 2 ǫ 2 · ( d + r 2 )) (suppressin g terms dep enden t on n orms of A and B ), where n = max { n 1 , n 2 } . W e now present our theorem on the num b er of samples and iterations n eeded to m ake this pro cedur e w ork with at least a constant pr obabilit y . Theorem 3.4. Consider matric es A ∈ R n 1 × d and B ∈ R d × n 2 and let m = C γ · ( k A k 2 F + k B k 2 F ) 2 k AB k 2 F · nr 3 ( ǫ ) 2 κ 2 log( n ) log 2 ( k A k F + k B k F ζ ) , wher e κ = σ ∗ 1 /σ ∗ r , σ ∗ i is the i -th singu lar value of A · B and T = log( k A k F + k B k F ζ ) . L et Ω b e sample d using pr ob ability distribution (5) . Th en, the output d AB r = 9 W Al t M in ( P Ω ( A · B ) , Ω , r, ˆ q , T ) of Sub-r outine 2 satisfies (w.p. ≥ 1 − γ ): k A · B − d AB r k ≤ k A · B − ( A · B ) r k + ǫ k A · B − ( A · B ) r k F + ζ . Next, we sho w an application of our matrix-m ultiplication appr oac h to appro ximating co v ariance matrices M = Y Y T , where Y is a n × d sample matrix. Note, that a r ank- r app r o ximation to M can b e computed by computing lo w rank appro ximation of Y , b Y r , i.e., ˜ M r = b Y r b Y T r . Ho wev er, as w e sho w b elo w, such an app roac h leads to w eaker b ounds as compared to compu ting lo w rank appro ximation o f Y Y T : Corollary 3.5. L e t M = Y Y T ∈ R n × n and let Ω b e sample d using pr ob ability distribution (5) with m ≥ C γ nr 3 ǫ 2 κ 2 log( n ) log 2 ( k Y k ζ ) , the output c M r of W Al t M in ( P Ω ( M ) , Ω , r , ˆ q , log( k Y k ζ )) satisfy (w.p. ≥ 1 − γ ): k c M r − ( Y Y T ) r k ≤ ǫ k Y − Y r k 2 F + ζ . F urther when k Y − Y r k F ≤ k Y r k F we get: k c M r − ( Y Y T ) r k ≤ ǫ Y Y T − ( Y Y T ) r F + ζ . No w , one ca n compute ˜ M r = b Y r b Y T r in ti me O ( n 2 r + nr 5 ǫ 2 ) with error k ( Y Y T ) r − ˜ M r k k Y k 2 ≤ ǫ k Y − Y r k F k Y k . Where as computing low rank approxima tion of Y Y T giv es(fr om Corollary 3.5) k ( Y Y T ) r − c M r k k Y k 2 ≤ ǫ k Y Y T − Y r Y T r k F k Y k 2 , which can b e muc h smaller than ǫ k Y − Y r k F k Y k . The difference in error is a consequence of larger gap b et w een singular v alues of Y Y T compared to Y . F or related discussion an d applications see section 4.5 of [17]. 4 Distributed Principal Comp onen t A nalysis Mo dern large-scale systems hav e to routinely compu te PCA of data matrices with millions of data p oints em b edded in s im ilarly large num b er of d im en sions. No w, ev en storing suc h matrices on a sin gle mac h ine is not p ossible and hence most indu strial scale systems u se distrib u ted computing en vironment to handle such problems. Ho we ve r, p erformance of suc h systems dep end not only on computation and storage complexit y , but also on the required amoun t of comm u nication b et w een differen t serv ers. In p articular, we consider the follo wing distributed PCA setting: Let M ∈ R n × d b e a given matrix (assume n ≥ d but n ≈ d ). Also, let M b e row partitioned among s serv ers and let M r k ∈ R n × d b e the matrix with rows { r k } ⊆ [ n ] of M and rest filled with zeros, stored on k -th serv er. Moreo v er, w e assume that one of the servers act as Central Pro cessor(CP) and in eac h round all serv er s comm un icate with th e CP and the CP communicat es b ac k w ith all the serv ers. No w, the goal is to compute c M r , an estimate of M r , suc h that the total communicat ion (i.e. num b er of b its tran s ferred) b et we en CP and other serv ers is minimized. Note that, suc h a mo del is now standard f or this problem and w as most r ecen tly studied b y [2 2]. Recen tly sev eral interesting results [13, 15, 24, 22] ha v e giv en algorithms to compu te rank- r appro ximation of M , ˜ M r in the ab o ve men tioned distributed setting. In particular, [22] prop osed a method that for row-partitio ned model requires O ( dsr ǫ + sr 2 ǫ 4 ) comm u nication to obtain a relativ e F rob enius norm guaran tee, || M − ˜ M r || F ≤ (1 + ǫ ) || M − M r || F . 10 Algorithm 3 Distributed lo w rank appro ximation algorithm input Matrix M r k at server k , rank- r , num b er of samples m and n u m b er of iterations T . 1: Sampling: Eac h serv er k compu tes column n orm s of M r k , k M r k k 1 , 1 and communicate s to C P . CP computes co lumn norms of M , k M k 1 , 1 , k M k F and comm unicates to all serv ers. 2: Eac h serv er k s amp les ( i, j )th en try with p robabilit y min { m ( k M i r k k 2 + k M j k 2 2 n k M k 2 F + ( M r k ) ij k M k 1 , 1 ) , 1 } for ro ws { r k } and generate s Ω k . 3: Eac h serv er k sends lists o f columns ( { c k } ⊂ [ d ]) where R Ω k ( M r k ) has samp led en tries, to CP . 4: Initialization: CP g enerates random n × r matrix Y (0) and communicate s Y (0) c k to server k . 5: for t = 0 to log(1 /c ) do 6: Each serv er k computes b Y ( t +1) c k = R Ω k ( M r k ) T R Ω k ( M r k ) ∗ Y ( t ) c k and comm unicates to CP . 7: CP computes Y ( t +1) = P k b Y ( t +1) c k , n orm alize and communicates Y ( t +1) c k to server k . 8: end for 9: W Alt Min: Eac h serv er k set b V (0) c k = Y ( t +1) c k . 10: for t = 0 to T − 1 do 11: Eac h s erv er k computes ( b U ( t +1) ) i = arg min x ∈ R r P j :( i,j ) ∈ Ω k w ij M ij − x T ( b V t ) j 2 , for all i ∈ { r k } . 12: Eac h server k sends to C P ; z k j = b U ( t +1) r k T R Ω k ( M r k ) e j and B k j = P i :( i,j ) ∈ Ω k w ij uu T , u = ( b U ( t +1) ) i for all j ∈ { c k } . 13: CP computes B j = P k B k j and ( b V ( t +1) ) j = B − 1 j P k z k j for j = 1 , · · · , d and comm unicates ( b V ( t +1) ) c k to server k . 14: end for output Serv er k has b U ( t +1) r k and CP has b V ( t +1) . In con trast, a d istributed setting extension of our LEL A algorithm 1 has linear comm unication complexit y O ( ds + nr 5 ǫ 2 ) and computes r ank- r appro ximation c M r , with || M − c M r || ≤ || M − M r || + ǫ || M − M r || F . Now note that if n ≈ d and if s scales with n (wh ic h is a typical requirement ), then our communicatio n co mplexit y can b e significantly b etter than that of [22]. Moreo v er, our metho d pro vides sp ectral norm b ounds as compared to relativ ely wea k F rob enius b ounds ment ioned ab ov e. Algorithm: Th e distribu ted v ersion of our LELA algorithm dep ends cr u cially on the follo wing observ ation: giv en V , eac h ro w of U can b e up d ated indep en den tly . Hence, serve rs n eed to comm u- nicate ro ws of V only . Ther e also, w e can use the fact that eac h server requires only O ( nr /s lo g n ) ro ws of V to up date their corresp onding U r k . U r k denote restriction of U to ro ws in set { r k } and 0 outside a nd similarly b V c k , b Y c k denote restriction of V , b Y to r o w s { c k } . W e now describ e the d istributed version of eac h of the critical step of LELA algorithm. See Algorithm 3 for a detailed p seudo-co de. F or simplicit y , we dropp ed the use of different set of samples in eac h iteration. Corresp on d ingly the algorithm w ill mo dify to distribu tin g samp les Ω k in to 2 T + 1 buc ke ts and u sing one in eac h iteration. This simplification do esn ’t c hange th e comm unication complexit y . Sampling : F or samp ling, we first compute column norms k M j k , ∀ 1 ≤ j ≤ d and comm unicate to eac h serv er. This op eration w ould require O ( ds ) comm unication. Next, eac h server (serv er k ) samples elemen ts from its r o w s { r k } and stores R Ω k ( M r k ) lo cally . Note that b ecause of indep endenc e 11 over r ows , the serv ers don’t need to transm it their samples to other serv ers. Initialization : In the initialization step, our algorithm compu tes top r right singular ve ctor of R Ω ( M ) by iterations b Y ( t +1) = R Ω ( M ) T R Ω ( M ) Y t = P k R Ω k ( M ) T R Ω k ( M ) Y t . No w, note that computing R Ω k ( M ) Y t requires s erv er k to access atmost | Ω k | columns of Y t . Hence, the total comm unication from the CP to all the servers in this r ou n d is O ( | Ω | r ). Similarly , eac h column of R Ω k ( M ) T R Ω k ( M ) Y t is on ly | Ω k | sparse. Hence, total comm un ication from all the serv ers to CP in this round is O ( | Ω | r ). No w, w e need constan t many round s to get a constan t factor appro x im ation to SVD of R Ω ( M ), which is enough for go o d initializatio n in W AltMin pro cedur e. Hence, total comm unication co mplexity of the initializa tion step w ould b e O ( | Ω | r ). Altern ating M inimization Step : F or alte rnating minimization, up d ate to ro w s of U is compu ted at the corresp onding serv ers an d the up date to V is computed at the CP . F or up dating b U ( t +1) r k at serv er k , w e use th e follo wing observ ation: up dating b U ( t +1) r k requires atmost | Ω k | r o ws of b V ( t ) . Hence, the total communicatio n fr om CP to all the serv ers in the t -th iteration is O ( | Ω | r ). Next, we mak e a critical observ ation that up date b V ( t +1) can b e computed b y adding certain messages from eac h serv er (see Algorithm 3 for more details). Message from serv er k to C P is of size O ( | Ω k | r 2 ). Hence, total comm u nication complexit y in eac h round is O ( | Ω | r 2 ) and total num b er of rounds is O (log ( k M k F /ζ )). W e now com bine the ab ov e given observ ations to pro vide error b ounds and comm unication complexit y of our distributed PCA al gorithm: Theorem 4.1. L et the n × d matrix M b e distribute d over s servers ac c or ding to the r ow-p artition mo del. L et m ≥ C γ nr 3 ǫ 2 κ 2 log( n ) log 2 ( || M r || ζ ) . Then, the algorithm 3 on c ompletion wil l le ave ma- tric es b U ( t +1) r k at server k and b V ( t +1) at CP su c h that the fol lowing holds (w.p. ≥ 1 − γ ): || M − b U ( t +1) ( b V ( t +1) ) T || ≤ || M − M r || + ǫ || M − M r || F + ζ , wher e b U ( t +1) = P k b U ( t +1) r k . This algor ithm has a c ommunic ation c omplexity of O ( ds + | Ω | r 2 ) = O ( ds + nr 5 κ 2 ǫ 2 log 2 ( || M r || ζ )) r e al numb ers. As discuss ed ab o ve, eac h up date to b V ( t ) and b U ( t ) are computed exactly as giv en in the W AltMin pro cedur e (Sub-r ou tin e 2). Hence, error b oun ds for th e algorithm follo ws directly fr om T heorem 3.1. Comm unication co mplexit y b oun ds follo ws by observing that | Ω | ≤ 2 m w.h .p. Remark: The samp lin g step give n ab o ve suggests another simple algorithm where we can com- pute P Ω ( M ) in a distr ib uted fash ion and co mmunicate the samples to CP . All the computation is p erformed at CP afterwards. Hence, the tota l comm unication complexit y wo uld b e O ( ds + | Ω | ) = O ( ds + nr 3 κ 2 ǫ 2 log( || M r || ζ ), whic h is lesser than the co mmunicatio n complexit y of Algorithm 3. How- ev er, su c h an algorithm is not desirable in practice, b ecause it is completely relian t on one single serv er to p erform all the computation. Hence it is slo w er an d is fault-pr on e. In con trast, our algorithm can b e implement ed in a p eer-p eer scenario as w ell and is more fault-toleran t. Also, the comm un ication complexit y b ound of T heorem 4.1 only b ounds th e total num b er of real num b ers transferr ed. Ho wev er, if eac h of the num b er requires sev eral bits to communicate then the real comm un ication ca n still b e v ery large. Belo w, w e b ound eac h of the real num b er that we transfer, hen ce pro viding a b ound on the n u m b er of bits tr an s ferred. Bit complexity: First w e will b oun d w ij M ij . Note th at w e need to b ound this only for ( i, j ) ∈ Ω. No w , | w ij M ij | ≤ || R Ω ( M ) || ∞ ≤ || R Ω ( M ) || ≤ || M || + ǫ || M || F ≤ 2 ∗ ndM max , where the th ird inequalit y follo ws from Lemm a 3.2. Hence if the en tries of th e matrix M are b eing r epresent ed 12 using b bits in itially then the algorithm needs to u se O ( b + log( nd )) b its. By the same argumen t w e g et a b ound of O ( b + log ( nd )) b its for compu ting || M i || 2 , ∀ i ; || M j || 2 , ∀ j ; || M || 2 F and || M || 1 , 1 . F urther at any s tage of th e W AltMin iterations || b U ( t ) ( b V ( t +1) ) T || ∞ ≤ || b U ( t ) ( b V ( t +1) ) T || ≤ 2 || M || F . So this stage also n eeds O ( b + log ( n )) bits for computation. Hence o v erall th e bit complexity of eac h of the real n umbers of the algorithm 3 is O ( b + log ( nd )) , if b bits are needed for rep r esen ting the m atrix en tries. T hat is, o v erall communicatio n complexit y of the algorithm is O (( b + log( nd )) · ( ds + nr 3 κ 2 ǫ 2 log( || M r || ζ )). 5 Sim ulations 4 5 6 7 8 9 10 11 12 0 0.1 0.2 0.3 0.4 0.5 S a m p l e s ( m ) / n l o g ( n ) | | M r − c M r | | L E L A , | | M − M r | | = 0 . 0 1 R a n d o m p r o j e c t i o n , | | M − M r | | = 0 . 0 1 L E L A , | | M − M r | | = 0 . 0 5 R a n d o m p r o j e c t i o n , | | M − M r | | = 0 . 0 5 L E L A , | | M − M r | | = 0 . 1 R a n d o m p r o j e c t i o n , | | M − M r | | = 0 . 1 5 6 7 8 9 10 11 12 0 0.1 0.2 0.3 0.4 0.5 S a m p l e s ( m ) / n l o g ( n ) | | M r − c M r | | L E L A , | | M − M r | | = 0 . 0 1 R a n d o m p r o j e c t i o n , | | M − M r | | = 0 . 0 1 L E L A , | | M − M r | | = 0 . 0 5 R a n d o m p r o j e c t i o n , | | M − M r | | = 0 . 0 5 L E L A , | | M − M r | | = 0 . 1 R a n d o m p r o j e c t i o n , | | M − M r | | = 0 . 1 (a) (b) Figure 1: Figure p lots h o w the err or || M r − c M r || decreases with increasing n umb er of samples m for differen t v alues of noise || M − M r || , for incoheren t and co herent matrices resp ectiv ely . Algorithm LELA 1 is r u n w ith m samples and Gaussian pro jection algorithms is run with corresp ondin g dimen- sion of th e pro jection l = m/n . Comp utationally LELA algorithms tak es O ( nnz ( M ) + m log( n )) time for compu ting samp les and Gaussian pro jection algorithm tak es O ( nm ) time. (a) :F or same n umb er of samples b oth algo rithms h a v e almost the same error for incoheren t matrices. (b): F or coheren t matrices clearly the error of LELA algorithm (solid lines) is muc h smaller than that of random p ro jection (dotted lines). In this section we present some sim u lation results on sy nthetic data to show the error p erfor- mance of the algorithm 1. First we consider the setting of fi nding low rank appro ximation of a giv en matrix M . Later w e consider th e s etting of computing lo w rank approximati on of A · B , giv en A and B without co mpu ting the pro duct. F or sim ulations we consider random matrices of size 1000 by 1000 and rank-5. M r is a rank 5 matrix with all singular v alues 1. W e consid er t w o cases, one in whic h M r is incoheren t and other in whic h M r is coheren t. Recall that a n × d rank- r matrix M r is an incoheren t matrix if k ( U ∗ ) i k 2 ≤ µ 0 r n , ∀ i and k ( V ∗ ) j k 2 ≤ µ 0 r d , ∀ j , where SVD of M r is U ∗ Σ ∗ ( V ∗ ) T . In tuitiv ely incoherent matrices ha ve mass spr ead o ve r almost all en tries whereas coheren t matrices ha ve mass concen tr ated on only few en tries. T o generate matrices w ith v aryin g incoherence parameter µ 0 , w e use the p o wer la w matrices mo del [6]. M r = D U V T D , where U and V are random n × r orth onormal matrices and D is a 13 5 6 7 8 9 10 11 12 0 0.1 0.2 0.3 0.4 0.5 S a m p l e s ( m ) / ( n l o g ( n ) ) | | ( Y Y T ) r − d ( Y Y T ) r | | L E L A d i r e c t , | | Y − Y r | | = 0 . 0 1 S t a g e w i s e , | | Y − Y r | | = 0 . 0 1 L E L A d i r e c t , | | Y − Y r | | = 0 . 0 5 S t a g e w i s e , | | Y − Y r | | = 0 . 0 5 L E L A d i r e c t , | | Y − Y r | | = 0 . 1 S t a g e w i s e , | | Y − Y r | | = 0 . 1 5 6 7 8 9 10 11 12 0 0.1 0.2 0.3 0.4 0.5 S a m p l e s ( m ) / n l o g ( n ) | | ( Y Y T ) r − d ( Y Y T ) r | | L E L A d i r e c t , | | Y − Y r | | = 0 . 0 1 S t a g e w i s e , | | Y − Y r | | = 0 . 0 1 L E L A d i r e c t , | | Y − Y r | | = 0 . 0 5 S t a g e w i s e , | | Y − Y r | | = 0 . 0 5 L E L A d i r e c t , | | Y − Y r | | = 0 . 1 S t a g e w i s e , | | Y − Y r | | = 0 . 1 (a) (b) Figure 2: Figure plots the err or || ( Y Y T ) r − \ ( Y Y T ) r || for L ELA dir ect (Section 3.3) and Stagewise algorithm f or ( a) :incoherent matrices and for (b): coherent matrices. Stagewise algorithm is first computing rank- r appro ximation b Y r of Y using algorithm 1 and setting the low rank app r o ximation \ ( Y Y T ) r = b Y r b Y T r . Clearly directly computing low rank app ro ximation of Y Y T has smaller error. diagonal matrix with D ii ∝ 1 i α . F or α = 0 M r is an incoheren t matrix with µ 0 = O (1) and for α = 1 M r is a co herent matrix w ith µ 0 = O ( n ). The inp ut to algorithms is the m atrix M = M r + Z , where Z is a Gaussian noise matrix with || Z || = 0 . 01 , 0 . 05 and 0 . 1. Corr esp ondin gly F rob enius n orm of Z is || Z || ∗ √ 1000 / 2, whic h is 0 . 16 , 0 . 79 and 1 . 6 r esp ectiv ely . Eac h sim ulation is av eraged o ver 20 different runs. W e run th e W AltMin step of the algorithm for 15 iterations. Note that using d ifferen t set of samples in eac h iteration of W AltMin subr outine 2 is generally observ ed to b e not r equired in practice. Hence we u se the same set of samp les for all iterations. In the first plot w e compare the err or || M r − c M r || of our algorithm LELA 1 with the random pro jection based algorithm [17, 3]. W e use the matrix with eac h en try an indep end en t Gaussian random v ariable as the sk etc hing matrix, for the random pro jection algorithm. Other c hoices are W alsh-Hadamard based transform [32] a nd sparse em b edding matrices [7]. W e compare the error of b oth algorithms as we v ary n umber of samples m for algorithm 1, equiv alen tly v ary in g the dimension of r andom pr o jection l = m/n f or the random pro jection algo- rithm. I n fi gu r e 1 we plot the error || M r − c M r || with v arying num b er of samples m for b oth the algorithms. F or in coheren t matrices w e see that LELA algorithm has almost the s ame error as the random pro jection algorithm 1(a). But for coheren t matrices w e notice that in fi gure 1(b) LELA has significantl y smaller err or. No w we consider th e setting of compu ting low ran k approximat ion of Y Y T giv en Y u sing algorithm LELA direct discussed in section 3.3 with sampling (5). In fi gure 2 w e compare this algorithm with a stagewise alg orithm, which computes lo w rank app ro ximation b Y r from Y fir st and sets the rank- r approximat ion of Y Y T as b Y r b Y T r . As discussed in s ection 3.3 direct appr o ximation of Y Y T has less error than th at of computing b Y r b Y T r . Again plot 2(a) is for incoheren t matrices and plot 2(b) is for coheren t matrices. Finally in fi gure 3 w e consider the case wh ere A and B are tw o r ank 2 r matrices with AB b eing a rank r matrix. Here the top r dimensional ro w sp ace of A is orthogonal to the top r dimens ional 14 4 5 6 7 8 9 10 11 12 0 0.2 0.4 0.6 0.8 1 1.2 1.4 S a m p l e s ( m ) / n l o g ( n ) | | ( A B ) r − d ( A B ) r | | L E L A d i r e c t , | | A B − ( A B ) r | | = 0 . 0 1 S t a g e w i s e , | | A B − ( A B ) r | | = 0 . 0 1 L E L A d i r e c t , | | A B − ( A B ) r | | = 0 . 0 5 S t a g e w i s e , | | A B − ( A B ) r | | = 0 . 0 5 L E L A d i r e c t , | | A B − ( A B ) r | | = 0 . 1 S t a g e w i s e , | | A B − ( A B ) r | | = 0 . 1 5 6 7 8 9 10 11 12 0 0.2 0.4 0.6 0.8 1 1.2 1.4 S a m p l e s ( m ) / n l o g ( n ) | | ( A B ) r − ( d A B ) r | | LE L A d i r e c t , | | A B − ( A B ) r | | = 0 . 01 S t a g e w i s e , | | A B − ( A B ) r | | = 0 . 01 LE L A d i r e c t , | | A B − ( A B ) r | | = 0 . 05 S t a g e w i s e , | | A B − ( A B ) r | | = 0 . 05 LE L A d i r e c t , | | A B − ( A B ) r | | = 0 . 1 S t a g e w i s e , | | A B − ( A B ) r | | = 0 . 1 (a) (b) Figure 3: Figure plots th e err or || ( AB ) r − \ ( AB ) r || for LELA direct (S ection 3.3) and Stagewise algorithm f or ( a) :incoherent matrices and for (b): coherent matrices. Stagewise algorithm is first computing rank- r appr o ximation b A r , b B r of A, B resp ectiv ely u sing algorithm 1 and setting the lo w rank approximat ion \ ( AB ) r = b A r b B r . Clearly directly computing lo w rank ap p ro ximation of AB has smaller error. column space of B . Hence simp le algorithms that compute rank r appr o ximation of A and B fi rst and then m u ltiply will ha v e high error as compared to that of LELA dir ect. References [1] D. Ac hlioptas, Z. Karnin, and E. Lib ert y . Near-optimal distr ib utions f or d ata matrix sampling. A dvanc es in Neur al Information Pr o c essing Systems , 73 , 2013. [2] D. Achlio ptas and F. McSh erry . F ast computation of lo w rank matrix appr o ximations. In Pr o c e e dings of the thirty-thir d annual ACM symp osium on The ory of c omputing , pages 611– 618. ACM, 2001. [3] C. Boutsidis and A. Gittens. Impro ved matrix algorithms via the su bsampled r andomized hadamard tr an s form. SIAM Journal on M atrix Analysis and Applic ations , 34(3) :1301–13 40, 2013. [4] E. J . C and ` es and B. Rec ht . Exact matrix completion via con v ex optimization. F oundations of Computation al mathematics , 9(6):717– 772, 2009 . [5] E. J. Cand` es and T. T ao. Th e p o wer of con ve x relaxation: Near-optimal matrix completion. Information The ory, IEEE T r ansactions on , 56(5 ):2053–2 080, 2010. [6] Y. Chen, S. Bho janapalli, S. Sanghavi, and R. W ard. Coherent matrix completion. In Pr o- c e e dings of The 31st International Confer enc e on Machine L e arning , pages 674–682, 201 4. [7] K. L. Clarkson and D. P . W o o dru ff. Lo w rank approxima tion and regression in in put spar- sit y time. I n Pr o c e e dings of the 45th annual ACM symp osium on Symp osium on the ory of c omputing , page s 81–9 0. A CM, 2013. 15 [8] A. Deshpande and S. V empala. Adap tive samp ling and fast low-rank matrix app ro ximation. In Appr oximation, R andomizatio n, and Combinatorial O ptimization. Algorithms and T e chniques , pages 292–303. Springer, 2006. [9] P . Drin eas and R. Kannan. P ass efficient algorithms for approxi mating large matrices. In SODA , volume 3, p ages 223–232 , 2003. [10] P . Drineas, R. Kann an, and M. W. Mahoney . F ast m on te carlo algorithms f or matrices ii: Computing a lo w-rank app ro ximation to a matrix. SIAM J ournal on Computing , 36(1):1 58– 183, 2006. [11] P . Drineas, M. W. Mahoney , and S. Muthukrishnan. Su b space s ampling and relativ e-error matrix ap p ro ximation: Column -based metho ds. In Appr oximation, R andomization, and Com- binatorial Optimization. A lgorithms and T e c hniqu e s , pages 316–326. Springer, 2006. [12] P . Drineas and A. Zouzias. A note on elemen t-wise matrix sparsification via a matrix-v alued b erns tein inequalit y . Information Pr o c essing L etters , 11 1(8):385– 389, 2011 . [13] D. F eldman, M. S chmidt, and C. Sohler. T u r ning b ig d ata into tiny data: Constant- size coresets f or k-means, p ca and pr o jectiv e clustering. In Pr o c e e dings of the Twenty-F ourth Annual ACM-SIAM Symp osium on Discr ete Algorith ms , page s 143 4–1453. SIAM, 2013. [14] A. F rieze, R. K annan, and S. V empala. F ast mon te-carlo algorithms for fi nding lo w-rank appro ximations. In F oundations of Computer Sci enc e, 1998. Pr o c e e dings. 39th A nnual Sym- p osium on , p ages 370–378 , Nov 1998. [15] M. Gh ash ami and J. M. Phillips. Relativ e errors for deterministic lo w -rank matrix app r o xi- mations. In SODA , pages 70 7–717. SIAM, 2014. [16] D. Gross. Reco ve ring lo w-rank m atrices fr om few co efficien ts in any basis. Information The ory, IEEE T r ansactions on , 57 (3):1548– 1566, 2011. [17] N. Halko , P .-G. Martinsson, and J . A. T ropp. Finding structure with rand omn ess: P robabilistic algorithms for constructing app ro ximate matrix decomp ositions. SIAM r eview , 53(2):217 –288, 2011. [18] S. Har-Pel ed. Lo w rank matrix appro ximation in linear time. Manuscript. http://val is . cs. uiuc. e du/sariel/p ap ers/05/lr ank/lr ank. p df , 20 06. [19] M. Hardt. Understanding alternating minimization for matrix completion. arXiv pr eprint arXiv:1312.09 25 , 20 13. [20] M. Hardt and M. W o otters. F ast matrix completion without the condition n umber. In Pr o- c e e dings of The 27th Confer enc e on L e arning The ory , page s 638– 678, 201 4. [21] P . Jain, P . Netrapalli, and S. Sanghavi. Lo w-rank matrix completion using alternating min- imization. In Pr o c e e dings of the 45th annual ACM symp osium on Symp osium on the ory of c omputing , page s 665– 674. A CM, 2013. 16 [22] R. Kannan, S. S. V empala, and D. P . W o o dr uff. Prin cipal comp onent analysis and higher correlations for distributed data. In Pr o c e e dings of The 27t h Confer enc e on L e arning The ory , pages 1040–105 7, 2014. [23] R. H. K esha v an, A. Mon tanari, and S . Oh. Matrix completion from a few entries. Information The ory, IEEE T r ansactions on , 56(6 ):2980–2 998, 2010. [24] Y. Liang, M.-F. Balcan, and V. Kanc hanapally . Distribu ted p ca and k-means clus tering. In The Big L e arning Workshop at NIPS , 2013. [25] E. Lib erty . Simple and deterministic matrix ske tc hing. In P r o c e e dings of the 19th ACM SIGKDD international c onfer enc e on Know le dge disc overy and data mining , pages 581–588 . A CM, 2013. [26] L. Mac key , M. I. Jordan, R. Y. Chen, B. F arrell, and J. A. T ropp. Matrix concentrati on inequalities via the metho d of exc h angeable p airs. arXiv pr eprint arXiv:1201.6002 , 20 12. [27] M. W. Mahoney . Randomized algorithms for matrices and d ata. F oundations and T r ends R in Machine L e arning , 3(2 ):123–22 4, 201 1. [28] N. H. Nguy en, T. T. Do, and T. D. T ran. A fast and efficien t algorithm for lo w -rank ap- pro ximation of a m atrix. In P r o c e e dings of the 41st annual ACM symp osium on The ory of c omputing , page s 215– 224. A CM, 2009. [29] B. Rech t. A simpler approac h to matrix completion. arXiv pr eprint arXiv:0910.0651 , 200 9. [30] T. Sarlos. Impr o v ed approximat ion algorithms for large matrices via random pro jections. In F oundations of Computer Scienc e, 2006. F OCS’06. 47th Annual IEEE Symp osium on , pages 143–1 52. IEEE, 20 06. [31] J. A. T ropp . User-friendly tai l b ounds for sums of random matric es. F oundations of Comp u- tational Mathemat ics , 12(4):3 89–434, 2012. [32] F. W o olfe, E. Lib erty , V. Rokhlin, and M. T ygert. A fast rand omized algorithm for the appro ximation of matrices. Applie d and Computational Harmonic Analy sis , 25(3):33 5–366, 2008. A Concen tration Inequalities In this secti on w e will review couple of concen tration inequalities we use in the pro ofs. Lemma A.1 (Bernstein’s Inequalit y) . L et X 1 , ...X n b e indep endent sc alar r andom variables. L et | X i | ≤ L, ∀ i w.p. 1 . Then, P " n X i =1 X i − n X i =1 E [ X i ] ≥ t # ≤ 2 exp − t 2 / 2 P n i =1 V ar( X i ) + Lt / 3 . (6) 17 Lemma A.2 (Matrix Bernstein’s In equalit y [31]) . L et X 1 , ...X p b e indep endent r andom matric es in R n × n . Assume e ach matrix has b ounde d deviation fr om its me an: k X i − E [ X i ] k ≤ L, ∀ i w.p. 1 . Also let the varianc e b e σ 2 = max ( E " p X i =1 ( X i − E [ X i ])( X i − E [ X i ]) T # , E " p X i =1 ( X i − E [ X i ]) T ( X i − E [ X i ]) # ) . Then, P " n X i =1 ( X i − E [ X i ]) ≥ t # ≤ 2 n exp − t 2 / 2 σ 2 + Lt/ 3 . (7) Recall the Shatten- p norm of a matrix X is k X k p = n X i =1 σ i ( X ) p ! 1 /p . σ i ( X ) is the i th singular v alue of X . In particular for p = 2, Shatten-2 norm is the F rob enius norm of the matrix. Lemma A.3. [Matrix Chebyshev Ine quality [26]] L et X b e a r andom matrix. F or al l t > 0 , P [ k X k ≥ t ] ≤ inf p ≥ 1 t − p E h k X k p p i . (8) B Pro ofs of section 3 In this section we w ill pr esen t pro of for Theorem 3.1 . F or simplicit y we w ill only discuss pro ofs for the case when matrix is square. Rectangular case is a simple extension. W e will p r o vide pro ofs of the su pp orting lemmas first. W e w ill recall some of the nota tion no w . q ij = m · 0 . 5 k M i k 2 + k M j k 2 2 n k M k 2 F + 0 . 5 | M ij | k M k 1 , 1 . Let ˆ q ij = min { q ij , 1 } . This is to make su re the probabilities are all less than 1. Recall the definition of weigh ts w ij = 1 / ˆ q ij when ˆ q ij > 0 and 0 else. Note that P ij ˆ q ij ≤ m . Also let m ≥ β nr log ( n ). Let { δ ij } b e th e indicator random v ariables and δ ij = 1 with probabilit y ˆ q ij . Defin e Ω to b e th e sampling op erator with Ω ij = δ ij . Define the w eigh ted sampling op erator R Ω suc h that, R Ω ( M ) ij = δ ij w ij M ij . Throughout the pro ofs w e will drop the subscript of Ω that denotes different samplin g sets a t eac h iteration of W AltMin. First we will abs tr act out the prop erties of the sampling distrib u tion (2) that w e u se in the rest of the proof. 18 Lemma B.1. F or Ω g ener ate d ac c or ding to (2) and under the assumptions of L emma 3.2 the fol lowing holds, for al l ( i, j ) such that q ij ≤ 1 . M ij ˆ q ij ≤ 2 n m k M k F , (9) X { j : ˆ q ij = q ij } M 2 ij ˆ q ij ≤ 4 n m k M k 2 F , (10) k ( U ∗ ) i k 2 ˆ q ij ≤ 8 nr κ 2 m , (11) and k ( U ∗ ) i kk ( V ∗ ) j k ˆ q ij ≤ 8 nr κ 2 m . (12) The pr o of of th e lemma B.1 is straigh tforward fr om the d efinition of q ij . B.1 Initialization No w w e will pro vide p ro of of the initializati on lemma 3.2. Pro of of lemma 3.2: Pr o of. The pr o of of this lemma has tw o parts. 1) W e sho w that k R Ω ( M ) − M k ≤ δ k M k F 2) W e sho w that the trimmin g step of algorithm 2 giv es the required ro w n orm b ounds on b U (0) . k ( b U (0) ) i k ≤ 8 √ r q k M i k 2 / k M k 2 F and dist ( b U (0) , U ∗ ) ≤ 1 2 , Pr o of of the first step: W e pr ov e th e pro of of the first part using the matrix Bern s tein inequalit y . Note that the L 1 term in th e sampling d istribution will h elp in getting goo d b ounds on absolute magnitude of random v ariables X ij in this proof. Let X ij = ( δ ij − ˆ q ij ) w ij M ij e i e T j . Note that { X ij } n i,j =1 are in d ep end en t zero mean rand om matrices. Also R Ω ( M ) − E [ R Ω ( M )] = P ij X ij . First we will b ound k X ij k . When q ij ≥ 1, ˆ q ij = 1 and δ ij = 1, and X ij = 0 with probabilit y 1. Hence w e only need to consider cases when ˆ q ij = q ij ≤ 1. W e will assume this in all the pr o ofs without explicitly men tioning it an y more. k X ij k = max {| (1 − ˆ q ij ) w ij M ij | , | ˆ q ij w ij M ij |} . (13) Recall w ij = 1 / ˆ q ij . Hence | (1 − ˆ q ij ) w ij M ij | = ( 1 ˆ q ij − 1) M ij ≤ M ij ˆ q ij ζ 1 ≤ 2 n m k M k F . 19 ζ 1 follo ws from (9). | ˆ q ij w ij M ij | = | M ij | ζ 1 ≤ M ij ˆ q ij ≤ 2 n m k M k F . ζ 1 follo ws from ˆ q ij ≤ 1. Hence, k X ij k is boun ded b y L = 2 n m k M k F . Now we will b ound the v ariance. E X ij X ij X T ij = E X ij ( δ ij − ˆ q ij ) 2 w 2 ij M 2 ij e i e T i = X ij ˆ q ij (1 − ˆ q ij ) w 2 ij M 2 ij e i e T i = max i X j ˆ q ij (1 − ˆ q ij ) w 2 ij M 2 ij . No w , X j ˆ q ij (1 − ˆ q ij ) w 2 ij M 2 ij = X j ( 1 ˆ q ij − 1) M 2 ij ≤ X j M 2 ij ( ˆ q ij ) ζ 1 ≤ 4 n m k M k 2 F . ζ 1 follo ws from (10). Hence E X ij X ij X T ij = max i X j ˆ q ij (1 − ˆ q ij ) w 2 ij M 2 ij ≤ max i 4 n m k M k 2 F = 4 n m k M k 2 F . W e can pr o v e the same boun d f or the E h P ij X T ij X ij i . Hence σ 2 = 4 n m k M k 2 F . No w using matrix Bernstein in equalit y with t = δ k M k F giv es, with p robabilit y ≥ 1 − 2 n 2 , k R Ω ( M ) − E [ R Ω ( M )] k = k R Ω ( M ) − M k ≤ δ k M k F . Hence we get k M − P r ( R Ω ( M )) k ≤ k M − R Ω ( M ) k + k R Ω ( M ) − P r ( R Ω ( M )) k ≤ k M − M r k + 2 δ k M k F , which implies k M r − P r ( R Ω ( M )) k ≤ 2 k M − M r k + 2 δ k M k F . Let SVD o f P r ( R Ω ( M )) b e U (0) Σ (0) ( V (0) ) T . Hence, kP r ( R Ω ( M )) − M r k 2 = k U (0) Σ (0) ( V (0) ) T − U ∗ Σ ∗ ( V ∗ ) T k 2 = k U (0) Σ (0) ( V (0) ) T − U (0) ( U (0) ) T U ∗ Σ ∗ ( V ∗ ) T − ( I − U (0) ( U (0) ) T ) U ∗ Σ ∗ ( V ∗ ) T k 2 ≥ k ( I − U (0) ( U (0) ) T ) U ∗ Σ ∗ ( V ∗ ) T k 2 ≥ ( σ ∗ r ) 2 k ( U (0) ⊥ ) T U ∗ k 2 . This imp lies dist ( U (0) , U ∗ ) ≤ 2 k M − M r k +2 δ k M k F σ ∗ r ≤ 1 144 r . T his follo ws from the assum p tion k M − M r k F ≤ 1 576 κr 1 . 5 k M r k F and δ ≤ 1 576 κr 1 . 5 . κ = σ ∗ 1 σ ∗ r is the the condition num b er of M r . Pr o of of the trimming step: F rom previous step w e kno w that k R Ω ( M ) − M k ≤ δ k M k F and consequen tly di st ( U (0) , U ∗ ) ≤ δ 2 . L et, l i = q 4 k M i k 2 / k M k 2 F , 20 b e the estimates for the left lev erages scores of the matrix M . S ince k M − M r k F ≤ k M r k F , l 2 i ≥ P r k =1 ( σ ∗ k ) 2 ( U ∗ ik ) 2 P r k =1 ( σ ∗ k ) 2 . Set the elements bigger than 2 l i in the ith ro w of U 0 to 0 and let ˜ U b e the n ew initializat ion matrix obtained. Also since dist ( U (0) , U ∗ ) ≤ δ 2 , for ev ery j = 1 , .., r there exists a v ector ¯ u j in R n , suc h that h U (0) j , ¯ u j i ≥ p 1 − δ 2 2 , k ¯ u j k = 1 and | ( ¯ u j ) i | 2 ≤ P r k =1 ( σ ∗ k ) 2 ( U ∗ ik ) 2 P r k =1 ( σ ∗ k ) 2 for all i . No w ˜ U j is the j th column of ˜ U obtained b y setting th e en tries of the j th column of U (0) to 0 whenev er the ith ent ry of U (0) j is b igger than 2 l i . F or suc h i , ( ˜ U j ) i − ( ¯ u j ) i ≤ s P r k =1 ( σ ∗ k ) 2 ( U ∗ ik ) 2 P r k =1 ( σ ∗ k ) 2 ≤ ( U 0 j ) i − ( ¯ u j ) i , (14) since ( U (0) j ) i − ( ¯ u j ) i ≥ 2 l i − l i = r P r k =1 ( σ ∗ k ) 2 ( U ∗ ik ) 2 P r k =1 ( σ ∗ k ) 2 . F or the rest of the coordinates, ( ˜ U j ) i = ( U (0) j ) i . Hence, ˜ U j − ¯ u j ≤ U (0) j − ¯ u j = q 1 + 1 − 2 h U (0) j , ¯ u j i ≤ √ 2 δ 2 . Hence ˜ U j ≥ 1 − √ 2 δ 2 , and U (0) j − ˜ U j ≤ r 1 − ˜ U j 2 ≤ 2 √ δ 2 , for δ 2 ≤ 1 √ 2 . Also U (0) − ˜ U F ≤ 2 √ r δ 2 . This giv es ab ound on the smallest singular v alue of ˜ U . σ min ( ˜ U ) ≥ σ min ( U (0) ) − σ max ( ˜ U − U (0) ) ≥ 1 − 2 p r δ 2 . No w let the redu ced QR decomp osition of ˜ U b e ˜ U = b U (0) Λ − 1 , where b U (0) is th e matrix with orthonormal column s . F rom the b ounds ab o v e we get that k Λ k 2 = 1 σ min (Λ − 1 ) 2 = 1 σ min ( b U (0) Λ − 1 ) 2 = 1 σ min ( ˜ U ) 2 ≤ 4 , when δ 2 ≤ 1 16 r . First w e will s h o w that this trimmin g step will not increase th e d istance to U ∗ b y m uc h. T o b ound the dist ( b U (0) , U ∗ ) consider, k ( u ∗ ⊥ ) T U k , where u ∗ ⊥ is some v ector p erp end icular to U ∗ . k ( u ∗ ⊥ ) T b U (0) k = k ( u ∗ ⊥ ) T ˜ U Λ k ≤ ( k ( u ∗ ⊥ ) T U (0) k + k ( u ∗ ⊥ ) T ( ˜ U − U (0) ) k ) k Λ k ≤ ( δ 2 + 2 p r δ 2 )2 ≤ 6 p r δ 2 ≤ 1 2 , for δ 2 ≤ 1 144 r . Second w e will boun d k ( b U (0) ) i k . k ( b U (0) ) i k = k e T i b U (0) k = k e T i ˜ U Λ k ≤ k e T i ˜ U kk Λ k ≤ 2 l i √ r 2 ≤ 8 √ r q k M i k 2 / k M k 2 F . Hence we finish the pro of of th e seco nd part of the lemma. 21 B.2 W eigh ted AltMin analysis W e first provide pr o of of Lemma 3.3 for rank-1 case to explain the main ideas and in the next section w e will d iscu ss rank- r ca se. Hence M 1 = σ ∗ u ∗ ( v ∗ ) T . Before the p ro of of the lemma w e will pro ve couple of supp orting lemmas. Let u t and v t +1 b e the normalized v ectors of the iterates b u t and b v t +1 of W AltMin. W e as- sume that samples for eac h iteration are generated indep endently . F or simplicit y w e will drop the subscripts on Ω that denote differen t set of samples in eac h ite ration in the rest of the pro of. The weig hte d alternating minimizati on up dates at the t + 1 iteratio n are, k b u t k b v t +1 j = σ ∗ v ∗ j P i δ ij w ij u t i u ∗ i P i δ ij w ij ( u t i ) 2 + P i δ ij w ij u t i ( M − M 1 ) ij P i δ ij w ij ( u t i ) 2 . (15) W riting in terms of p o wer metho d up dates w e ge t, k b u t k b v t +1 = σ ∗ h u ∗ , u t i v ∗ − σ ∗ B − 1 ( h u t , u ∗ i B − C ) v ∗ + B − 1 y , (16) where B and C are d iagonal matrices with B j j = P i δ ij w ij ( u t i ) 2 and C j j = P i δ ij w ij u t i u ∗ i and y is the vect or R Ω ( M − M 1 ) T u t with entries y j = P i δ ij w ij u t i ( M − M 1 ) ij . No w w e will b ound the error caused by the M − M 1 comp onent in eac h ite ration. Lemma B.2. F or Ω g ener ate d ac c or ding to (2) and under the assumptions of L emma 3.3 the fol lowing holds: ( u t ) T R Ω ( M − M 1 ) − ( u t ) T ( M − M 1 ) ≤ δ k M − M 1 k F , (17) with pr ob ability gr e ater that 1 − γ T log( n ) , for m ≥ β n log( n ) , β ≥ 4 c 2 1 T γ δ 2 . Henc e, ( u t ) T R Ω ( M − M 1 ) ≤ dist ( u t , u ∗ ) k M − M 1 k + δ k M − M 1 k F , for c onstant δ . Pr o of of lemma B.2. Let the random matrices X ij = ( δ ij − ˆ q ij ) w ij ( M − M 1 ) ij ( u t ) i e T j . Then P ij X ij = ( u t ) T R Ω ( M − M 1 ) − ( u t ) T ( M − M 1 ). Also E [ X ij ] = 0. W e will use the matrix Chebyshev inequalit y for p = 2. No w w e will b ound E P ij X ij 2 2 . E X ij X ij 2 2 = E X j X i ( δ ij − ˆ q ij ) w ij ( M − M 1 ) ij ( u t ) i ! 2 ζ 1 = X j X i ˆ q ij (1 − ˆ q ij )( w ij ) 2 ( M − M 1 ) 2 ij ( u t i ) 2 ≤ X ij w ij ( u t i ) 2 ( M − M 1 ) 2 ij ζ 2 ≤ 4 nc 2 1 m k M − M 1 k 2 F . ζ 1 follo ws from the fact th at X ij are zero mean indep enden t random v ariables. ζ 2 follo ws f rom (20) . Hence applying the m atrix Chebyshev inequalit y f or p = 2 and t = δ k M − M 1 k F giv es the result. 22 Lemma B.3. F or Ω sample d ac c or ding to (2) and under the assumptions of L emma 3.3 the fol- lowing hold s: X j δ ij w ij ( u ∗ j ) 2 − X j ( u ∗ j ) 2 ≤ δ 1 , (18) with pr ob ability gr e ater that 1 − 2 n 2 , for m ≥ β n log( n ) , β ≥ 16 δ 2 1 and δ 1 ≤ 3 . Pr o of of L emma B.3. Note that P j δ ij w ij ( u ∗ j ) 2 − P j ( u ∗ j ) 2 = P j ( δ ij − ˆ q ij ) w ij ( u ∗ j ) 2 . Let th e random v ariable X j = ( δ ij − ˆ q ij ) w ij ( u ∗ j ) 2 . E [ X j ] = 0 and V ar( X j ) = ˆ q ij (1 − ˆ q ij )( w ij ( u ∗ j ) 2 ) 2 . Hence, X j V ar( X j ) = X j ˆ q ij (1 − ˆ q ij )( w ij ( u ∗ j ) 2 ) 2 = X j ( 1 ˆ q ij − 1)( u ∗ j ) 4 ≤ X j ( u ∗ j ) 2 ˆ q ij ( u ∗ j ) 2 ζ 1 ≤ 16 n m X j ( u ∗ j ) 2 = 16 n m . ζ 1 follo ws f rom (11). Also it is easy to c h ec k that | X j | ≤ 16 n m . No w applyin g Bernstein inequalit y giv es the resu lt. Lemma B.4. F or Ω sample d ac c or ding to (2) and under the assumptions of L emma 3.3 the fol- lowing hold s: k ( h u t , u ∗ i B − C ) v ∗ k ≤ δ 1 p 1 − h u ∗ , u t i 2 , (19) with pr ob ability gr e ater than 1 − 2 n 2 , for m ≥ β n log( n ) , β ≥ 48 c 2 1 δ 2 1 and δ 1 ≤ 3 . Pr o of. Let α i = u t i ( h u ∗ , u t i u t i − u ∗ i ) . Hence the j th coordin ate of the error term in equation (16) is P i δ ij w ij α i v ∗ j P i δ ij w ij ( u t i ) 2 . Recall that α i = u t i ( h u ∗ , u t i u t i − u ∗ i ) . Let X ij = δ ij w ij α i v ∗ j e j e T 1 , for i, j in [1 , ..n ]. Note that X ij are indep endent random matrices. Th en ( h u t , u ∗ i B − C ) v ∗ is the first and th e only column of the matrix P n i,j =1 X ij . W e will b ound k P n ij =1 X ij k using matrix Bernstein inequalit y . X ij E [ X ij ] = X j X i ˆ q ij w ij α i v ∗ j e j e T 1 = X j X i α i v ∗ j e j e T 1 = 0 , b ecause P i α i = 0. No w w e will giv e a b ound on k X ij k . k X ij k = | δ ij w ij α i v ∗ j | ≤ v ∗ j u t i ˆ q ij ( h u ∗ , u t i u t i − u ∗ i ) ζ 1 ≤ 16 nc 1 m s X i ( h u ∗ , u t i u t i − u ∗ i ) 2 = 16 nc 1 m p 1 − h u ∗ , u t i 2 23 ζ 1 follo ws from (12) and (20). No w w e will b ound the v ariance. E X ij ( X ij − E [ X ij ]) T ( X ij − E [ X ij ]) = E X ij ( X ij T X ij − E [ X ij ] T E [ X ij ]) = X j X i ˆ q ij (1 − ˆ q ij )( w ij α i v ∗ j ) 2 e 1 e T 1 . Then, X i ˆ q ij (1 − ˆ q ij )( w ij α i v ∗ j ) 2 ≤ X i w ij ( u t i ) 2 ( h u ∗ , u t i u t i − u ∗ i ) 2 ( v ∗ j ) 2 ≤ c 2 1 4 n m ( v ∗ j ) 2 X i ( h u ∗ , u t i u t i − u ∗ i ) 2 ≤ c 2 1 4 n m ( v ∗ j ) 2 (1 − h u ∗ , u t i 2 ) . Hence, E X ij ( X ij − E [ X ij ]) T ( X ij − E [ X ij ]) ≤ 4 nc 2 1 m (1 − h u ∗ , u t i 2 ) . The lemma follo ws from applying matrix Bernstein inequalit y . No w w e will pro vide p ro of of le mma 3.3. Pro of of lemma 3.3: [Rank-1 case] Pr o of. No w w e will pr o v e that the distance b et wee n u t , u ∗ and v t +1 , v ∗ decreases with eac h iteration. Recall that from the assump tions of the lemma w e ha ve the follo wing ro w norm b ounds for u t ; | u t i | ≤ c 1 q k M i k 2 / k M k 2 F + | M ij | / k M k F . (20) First we will pro ve that dist ( u t , u ∗ ) decreases in eac h iteration and second we will pr o v e that v t +1 satisfies similar b ound on its ro w norms. Bounding h v t +1 , v ∗ i : Using Lemma B.3, Lemma B.4 and equation (16) w e ge t, k b u t kh b v t +1 , v ∗ i ≥ σ ∗ h u t , u ∗ i − σ ∗ δ 1 1 − δ 1 p 1 − h u ∗ , u t i 2 − 1 1 − δ 1 k y T v ∗ k (21) and k b u t kh b v t +1 , v ∗ ⊥ i ≤ σ ∗ δ 1 1 − δ 1 p 1 − h u ∗ , u t i 2 + 1 1 − δ 1 k y k . (22) 24 Hence by applying the noise b oun ds Lemma B.2 w e ge t, dist ( v t +1 , v ∗ ) 2 = 1 − h v t +1 , v ∗ i 2 = h b v t +1 , v ∗ ⊥ i 2 h b v t +1 , v ∗ ⊥ i 2 + h b v t +1 , v ∗ i 2 ≤ h b v t +1 , v ∗ ⊥ i 2 h b v t +1 , v ∗ i 2 ζ 1 ≤ 4( δ 1 dist ( u t , u ∗ ) + dist ( u t , u ∗ ) k M − M 1 k /σ ∗ + δ k M − M 1 k F /σ ∗ ) 2 ( h u t , u ∗ i − 2 δ 1 p 1 − h u ∗ , u t i 2 − 2 δ k M − M 1 k /σ ∗ ) 2 ζ 2 ≤ 4( δ 1 dist ( u t , u ∗ ) + dist ( u t , u ∗ ) k M − M 1 k /σ ∗ + δ k M − M 1 k F /σ ∗ ) 2 ( h u ∗ , u 0 i − 2 δ 1 p 1 − h u ∗ , u 0 i 2 − 2 δ k M − M 1 k /σ ∗ ) 2 ζ 3 ≤ 25( δ 1 dist ( u t , u ∗ ) + dist ( u t , u ∗ ) k M − M 1 k /σ ∗ + δ k M − M 1 k F /σ ∗ ) 2 . ζ 1 follo ws from δ 1 ≤ 1 2 . ζ 2 follo ws from using h u t , u ∗ i ≥ h u 0 , u ∗ i . ζ 3 follo ws f r om ( h u ∗ , u 0 i − 2 δ 1 p 1 − h u ∗ , u 0 i 2 ≥ 1 2 , δ ≤ 1 20 and δ 1 ≤ 1 20 . Hence dist ( v t +1 , v ∗ ) ≤ 1 4 dist ( u t , u ∗ ) + 5 dist ( u t , u ∗ ) k M − M 1 k /σ ∗ + 5 δ k M − M 1 k F /σ ∗ ≤ 1 2 dist ( u t , u ∗ ) + 5 δ k M − M 1 k F /σ ∗ . Bounding | v t +1 j | : F rom Lemma B.3 and (20) we get that P i δ ij w ij ( u t i ) 2 − 1 ≤ δ 1 and P i δ ij w ij u ∗ i u t i − h u ∗ , u t i ≤ δ 1 , when β ≥ 16 c 2 1 δ 2 1 . Hence, 1 − δ 1 ≤ B j j = X i δ ij w ij ( u t i ) 2 ≤ 1 + δ 1 , (23) and C j j = X i δ ij w ij u t i u ∗ i ≤ h u t , u ∗ i + δ 1 . (24) Recall that k b u t k b v t +1 j = P i δ ij w ij u t i M ij P i δ ij w ij ( u t i ) 2 ≤ 1 1 − δ 1 X i δ ij w ij u t i M ij . W e w ill b oun d using P i δ ij w ij u t i M ij using Bernstein inequalit y . Let X i = ( δ ij − ˆ q ij ) w ij u t i M ij . Th en P i E [ X i ] = 0 and P i u t i M ij ≤ k M j k b y Cauc hy-Sc hw artz inequalit y . P i V ar( X i ) = P i ˆ q ij (1 − ˆ q ij )( w ij ) 2 ( u t i ) 2 M 2 ij ≤ P i w ij ( u t i ) 2 M 2 ij ≤ 4 nc 2 1 m k M j k 2 . Finally | X ij | ≤ w ij u t i M ij ≤ 4 nc 1 m | M ij | / q | M ij | k M k F ≤ 4 nc 1 m p | M ij |k M k F . Hence ap p lying B ernstein inequalit y with t = δ p k M j k 2 + | M ij |k M k F giv es, P i δ ij w ij u t i M ij ≤ (1 + δ 1 ) p k M j k 2 + | M ij |k M k F with pr obabilit y greater than 1 − 2 n 3 when m ≥ 24 c 2 1 δ 2 1 n log( n ). F or δ 1 ≤ 1 20 , we get, k b u t k b v t +1 j ≤ 21 19 p k M j k 2 + | M ij |k M k F . 25 No w w e will b ound k b v t +1 k . k b u t kk b v t +1 k ≥ k b u t kh b v t +1 , v ∗ i ζ 1 ≥ σ ∗ h u t , u ∗ i − σ ∗ δ 1 1 − δ 1 p 1 − h u ∗ , u t i 2 − 1 1 − δ 1 k y T v ∗ k (25) ζ 2 ≥ σ ∗ h b u 0 , u ∗ i − 2 σ ∗ δ 1 p 1 − h u ∗ , b u 0 i 2 − 2 δ k M − M 1 k ζ 3 ≥ 2 5 σ ∗ . (26) ζ 1 follo ws from Lemma B.4 and equatio ns (16) and (23). ζ 2 follo ws from using h u ∗ , b u 0 i ≤ h u ∗ , u t i and δ 1 ≤ 1 20 . ζ 3 follo ws from the argument: for δ 1 ≤ 1 16 , h b u 0 , u ∗ i − 2 δ 1 p 1 − h u ∗ , b u 0 i 2 is greater than 1 2 , if h b u 0 , u ∗ i ≥ 3 5 . This holds b ecause dist ( u ∗ , b u 0 ) ≤ 4 5 from Lemma 3.2 . Hence we get v t +1 j = b v t +1 j k b v t +1 k ≤ 3 p k M j k 2 + | M ij |k M k F σ ∗ ≤ c 1 q k M j k 2 / k M k 2 F + | M ij | / k M k F , for c 1 = 6 . Hence w e h a v e s h o wn th at v t +1 satisfies the row norm b ounds. F rom Lemma B.2 w e h a v e, in eac h iteration with p robabilit y greater than 1 − γ T log( n ) w e ha ve ( u t ) T R Ω ( M − M 1 ) ≤ dist ( u t , u ∗ ) k M − M 1 k + δ k M − M 1 k F . Hence the probabilit y of f ailur e in T iterations is less than γ . L emma now follo ws from assump tion on m . No w w e ha ve all the el ement s needed for pro of of the Theorem 3.1. Pro of of The orem 3.1: [Rank-1 ca se] Pr o of. Lemma 3.2 h as sho wn that b u 0 satisfies the r ow norm b ounds co ndition. F rom Lemma 3.3 w e get dist ( v t +1 , v ∗ ) ≤ 1 2 dist ( u t , u ∗ ) + 5 δ k M − M 1 k F /σ ∗ . Hence dist ( v t +1 , v ∗ ) ≤ 1 4 t dist ( b u 0 , u ∗ ) + 10 δ k M − M 1 k F /σ ∗ . After t = O (log ( 1 ζ )) ite rations we get di st ( v t +1 , v ∗ ) ≤ ζ + 10 δ k M − M 1 k F /σ ∗ and dist ( u t , u ∗ ) ≤ ζ + 10 δ k M − M 1 k F /σ ∗ . Hence, k M 1 − b u t ( b v t +1 ) T k ≤ k ( I − u t ( u t ) T ) M 1 k + k u t ( u t ) T M 1 − ( v t +1 ) T k ζ 1 ≤ σ ∗ 1 dist ( u t , u ∗ ) + k σ ∗ B − 1 ( h u t , u ∗ i B − C ) v ∗ k + k B − 1 y k ζ 2 ≤ σ ∗ 1 dist ( u t , u ∗ ) + 2 δ 3 k M k dist ( u t , u ∗ ) + 2 dist ( u t , u ∗ ) k M − M 1 k + 2 δ k M − M 1 k F ≤ cσ ∗ 1 ζ + ǫ k M − M 1 k F . ζ 1 follo ws from equation (16) and ζ 2 from k B − 1 k ≤ 1 1 − δ 3 ≤ 2 from Lemma B.3 . F rom Lemma B.2 w e ha ve, in eac h iteratio n with p robabilit y greater than 1 − γ T l og( n ) w e ha ve ( u t ) T R Ω ( M − M 1 ) ≤ dist ( u t , u ∗ ) k M − M 1 k + δ k M − M 1 k F . Hence the probabilit y of failure in T iterations is less than γ . B.3 Rank- r pro ofs Let SVD of M r b e U ∗ Σ ∗ ( V ∗ ) T , U ∗ , V ∗ are n × r orthonormal matrices and Σ ∗ is a r × r diagonal matrix with Σ ∗ ii = σ ∗ i . W e ha v e seen in Lemma 3. 2 that initialization and trimming steps gi ve k ( b U (0) ) i k ≤ 8 √ r q k M i k 2 / k M k 2 F and dist ( b U (0) , U ∗ ) ≤ 1 2 . 26 for m ≥ n r 3 κ 2 log( n ). In this section we will presen t r an k - r pro of of Lemma 3.3. Before that w e will p r esen t rank- r v ersion of the sup p orting lemmas. No w lik e sh o wn in [21], w e will an alyze a equiv alen t algo rithm to algorithm 2 where the iterates are orthogonaliz ed at eac h step. This mak es analysis significantly simpler to p resen t. Let, b U ( t ) = U ( t ) R ( t ) and b V ( t +1) = V ( t +1) R ( t +1) b e th e resp ectiv e Q R factorizations. Then w e replace step 7 of the algorithm 2 with b V ( t +1) = arg min V ∈ R n × r k R Ω 2 t +1 ( M − U ( t ) V T ) k 2 F . W e s im ilarly c h ange step 8 too. W e also assu me that samples for eac h iteratio n are generated in dep end en tly . F or sim p licit y w e will d rop the su bscripts on Ω that denote different set of samples in eac h iteration in th e r est of the p ro of. T he w eigh ted alternating minimization up dates at th e t + 1 iteration are, ( b V ( t +1) ) j = ( B j ) − 1 C j Σ ∗ ( V ∗ ) j + ( U ( t ) ) T R Ω ( M − M r ) j , (27) where B j and C j are r × r matrices. B j = X i δ ij w ij ( U ( t ) ) i ( U ( t ) ) i T , C j = X i δ ij w ij ( U ( t ) ) i ( U ∗ ) i T . W riting in terms of p o wer metho d up dates w e ge t, ( b V ( t +1) ) j = ( U ( t ) ) T U ∗ − ( B j ) − 1 ( B j ( U ( t ) ) T U ∗ − C j ) Σ ∗ ( V ∗ ) j + ( B j ) − 1 ( U ( t ) ) T R Ω ( M − M r ) j . (28) Hence, ( b V ( t +1) ) T = ( U ( t ) ) T U ∗ Σ ∗ ( V ∗ ) T − F + n X j =1 ( B j ) − 1 ( U ( t ) ) T R Ω ( M − M r ) j e T j , where the j th column of F , F j = ( B j ) − 1 ( B j ( U ( t ) ) T U ∗ − C j ) Σ ∗ ( V ∗ ) j . First we will b ound k B j k usin g matrix Be rn s tein inequalit y . Lemma B.5. F or Ω gener ate d ac c or ding to (2) the f ol lowing holds: k B j − I k ≤ δ 2 , and k C j − ( U ( t ) ) T U ∗ k ≤ δ 2 , (29) with pr ob ability gr e ater that 1 − 2 n 2 , for m ≥ β nr κ log( n ) , β ≥ 4 ∗ 48 c 2 1 δ 2 2 and δ 2 ≤ 3 r . Pr o of. Let the matrices X i = δ ij w ij ( U ( t ) ) i ( U ( t ) ) i T , th en B j = P n i =1 X i . E [ X i ] = ( U ( t ) ) i ( U ( t ) ) i T and E B j = ( U ( t ) ) T U ( t ) = I . Also since U ( t ) satisfies (33), it is easy to see that k X i − E [ X i ] k ≤ 16 c 2 1 n m and E P i ( X i − E [ X i ])( X i − E [ X i ]) T ≤ 16 c 2 1 r n m . Ap plying the matrix Bernstein inequalit y giv es the fir st result. Let the matrices Y i = δ ij w ij ( U ( t ) ) i ( U ∗ ) i T , then C j = P n i =1 Y i . E [ Y i ] = ( U ( t ) ) i ( U ∗ ) i T and E C j = ( U ( t ) ) T U ∗ . Also since U ( t ) satisfies (33), it is easy to see that k Y i − E [ Y i ] k ≤ 8 c 1 κr 0 . 5 n m and E P i ( Y i − E [ Y i ])( Y i − E [ Y i ]) T ≤ 8 c 2 1 r n m . Applying the matrix Bernstein inequalit y give s the second result. 27 No w w e will b ound the error caused by the M − M r comp onent in eac h iteration. Lemma B.6. F or Ω gener ate d ac c or ding to (2) the f ol lowing holds: ( U ( t ) ) T R Ω ( M − M r ) − ( U ( t ) ) T ( M − M r ) ≤ δ k M − M r k F , (30) with pr ob ability gr e ater that 1 − 1 c 2 log( n ) , for m ≥ β nr log( n ) , β ≥ 4 c 2 1 c 2 δ 2 . Henc e, ( U ( t ) ) T R Ω ( M − M r ) ≤ dist ( U ( t ) , U ∗ ) k M − M r k + δ k M − M r k F , for c onstant δ . Pr o of of lemma B.6. Let the ran d om matrices X ij = ( δ ij − ˆ q ij ) w ij ( M − M r ) ij ( U ( t ) ) i e T j . Then P ij X ij = ( U ( t ) ) T R Ω ( M − M r ) − ( U ( t ) ) T ( M − M r ). Also E [ X ij ] = 0. W e w ill use the matrix Chebyshev inequalit y for p = 2. No w w e will b ound E P ij X ij 2 2 . E X ij X ij 2 2 = E X j X i ( δ ij − ˆ q ij ) w ij ( M − M r ) ij ( U ( t ) ) i ! 2 ζ 1 = X j X i ˆ q ij (1 − ˆ q ij )( w ij ) 2 ( M − M r ) 2 ij k ( U ( t ) ) i k 2 ≤ X ij w ij k ( U ( t ) ) i k 2 ( M − M r ) 2 ij ζ 2 ≤ c 2 1 n m k M − M 1 k 2 F . ζ 1 follo ws from the fact th at X ij are zero mean indep enden t random v ariables. ζ 2 follo ws f rom (33) . Hence applying the m atrix Chebyshev inequalit y f or p = 2 and t = δ k M − M 1 k F giv es the result. Since k B j k is b oun ded b y the previous lemma, to get a b ound on the norm of the error term in equation (2 8 ), k F k , w e need to b ound k ˜ F k , w here ˜ F j = ( B j ( U ( t ) ) T U ∗ − C j )Σ ∗ ( V ∗ ) j . Lemma B.7. F or Ω gener ate d ac c or ding to (2) the f ol lowing holds: k ˜ F k ≤ δ 2 σ ∗ 1 dist ( U ( t ) , U ∗ ) (31) with pr ob ability gr e ater that 1 − 2 n 2 , for m ≥ β n log( n ) , β ≥ 32 r 3 κ 2 δ 2 2 , c 1 ≤ 8 κ √ r and δ 2 ≤ 1 2 . Pr o of. Recall that the j th column of F , F j = ( B j ) − 1 ( B j ( U ( t ) ) T U ∗ − C j ) Σ ∗ ( V ∗ ) j , where B j = P i δ ij w ij ( U ( t ) ) i ( U ( t ) ) i T , and C j = P i δ ij w ij ( U ( t ) ) i ( U ∗ ) i T . W e will b ound sp ectral n orm of F us in g matrix Berns tein inequalit y . Let u i = ( U ( t ) ) i and A j i = u i ( u i ) T ( U ( t ) ) T U ∗ − u i ( U ∗ ) i T . Let X ij = ( B j ) − 1 ( δ ij w ij A j i ) Σ ∗ ( V ∗ ) j e T j , th en P i X ij = ( B j ) − 1 ( B j ( U ( t ) ) T U ∗ − C j ) Σ ∗ ( V ∗ ) j e T j . No w w e will b ound || X ij || . k X ij k ≤ σ ∗ 1 1 − δ 2 ( V ∗ ) j w ij A j i ≤ σ ∗ 1 1 − δ 2 ( V ∗ ) j w ij ( U ( t ) ) i ( u i ) T ( U ( t ) ) T U ∗ − ( U ∗ ) i T ζ 1 ≤ σ ∗ 1 1 − δ 2 8 c 1 n √ rκ m dist ( U ( t ) , U ∗ ) . 28 ζ 1 follo ws from (33), (1 2) and ( u i ) T ( U ( t ) ) T U ∗ − ( U ∗ ) i T = e T i U ( t ) ( U ( t ) ) T U ∗ − ( U ∗ ) ≤ U ( t ) ( U ( t ) ) T U ∗ − ( U ∗ ) = dist ( U ( t ) , U ∗ ) . Similarly let us b ou n d the v ariance E h P ij X ij X T ij i . E X ij X ij X T ij = E X ij δ ij w 2 ij ( B j ) − 1 A j i Σ ∗ ( V ∗ ) j ( V ∗ ) j T Σ ∗ A j i T (( B j ) − 1 ) T e j e T j ≤ X ij 1 (1 − δ 2 ) 2 w ij A j i Σ ∗ ( V ∗ ) j ( V ∗ ) j T Σ ∗ A j i T ≤ X ij 1 (1 − δ 2 ) 2 ( σ ∗ 1 ) 2 w ij ( V ∗ ) j 2 ( U ( t ) ) i 2 ( u i ) T ( U ( t ) ) T U ∗ − ( U ∗ ) i T 2 ζ 1 ≤ ( σ ∗ 1 ) 2 (1 − δ 2 ) 2 8 nc 2 1 m X ij ( V ∗ ) j 2 ( u i ) T ( U ( t ) ) T U ∗ − ( U ∗ ) i T 2 ζ 2 ≤ ( σ ∗ 1 ) 2 (1 − δ 2 ) 2 8 nc 2 1 m U ( t ) ( U ( t ) ) T U ∗ − ( U ∗ ) 2 F X j ( V ∗ ) j 2 ≤ ( σ ∗ 1 ) 2 (1 − δ 2 ) 2 8 nc 2 1 m r 2 dist ( U ( t ) , U ∗ ) 2 . ζ 1 follo ws from (33) and (12). ζ 2 follo ws from Similarly E h P ij X T ij X ij i can b e b ounded. No w applyin g the matrix Bernstein inequalit y with t = δ 2 σ ∗ 1 dist ( U ( t ) , U ∗ ) give s the result. No w since b V ( t +1) = V ( t +1) R ( t +1) , σ min ( R ( t +1) ) = σ min ( b V ( t +1) ) ζ 1 ≥ σ min (( U ( t ) ) T U ∗ Σ ∗ ( V ∗ ) T ) − k F k − k n X j =1 ( B j ) − 1 ( U ( t ) ) T R Ω ( M − M r ) j e T j k . (32) ζ 1 follo ws from (28). No w σ min (( U ( t ) ) T U ∗ Σ ∗ ( V ∗ ) T ) ≥ σ ∗ r p 1 − dist ( U ( t ) , U ∗ ) 2 . k F k ≤ 1 1 − δ 2 k ˜ F k ≤ δ 2 1 − δ 2 σ ∗ 1 dist ( U ( t ) , U ∗ ). k n X j =1 ( B j ) − 1 ( U ( t ) ) T R Ω ( M − M r ) j e T j k ≤ 1 1 − δ 2 k ( U ( t ) ) T R Ω ( M − M r ) k F ≤ 1 1 − δ 2 dist ( U ( t ) , U ∗ ) k M − M r k + δ 1 − δ 2 k M − M r k F , from Lemma B.6. Hence σ min ( R ( t +1) ) ≥ σ ∗ r q 1 − dist ( U ( t ) , U ∗ ) 2 − κ δ 2 1 − δ 2 dist ( U ( t ) , U ∗ ) − 2 1 − δ 2 k M − M r k F /σ ∗ r ≥ σ ∗ r 2 , 29 for enough n umb er of samples m . No w w e are ready to p resen t pro of of Lemma 3 .3 for rank- r case. Pro of of Lemma 3.3: Pr o of. The pr o of lik e in r ank-1 ca se has t wo steps. In the first step w e sho w that dist ( V ( t +1) , V ∗ ) decreases in eac h iteration. In the second step w e sho w ro w norm b ounds for V ( t +1) . Recall from the assu m ptions of th e lemma we ha ve the follo wing ro w norm b ound for U ( t ) : k ( U ( t ) ) i k ≤ c 1 q k M i k 2 / k M k 2 F + | M ij | / k M k F , for all i. (33) Bounding dist ( V ( t +1) , V ∗ ) : dist ( V ( t +1) , V ∗ ) = k ( V ( t +1) ) T V ∗ ⊥ k ζ 1 ≤ k ( R ( t +1) ) − 1 T F V ∗ ⊥ k + 1 1 − δ 2 k ( R ( t +1) ) − 1 T ( U ( t ) ) T R Ω ( M − M r ) V ∗ ⊥ k F ζ 2 ≤ 1 σ min ( R ( t +1) ) k F k + 1 1 − δ 2 dist ( U ( t ) , U ∗ ) k M − M r k + δ 1 − δ 2 k M − M r k F ≤ 2 σ ∗ r δ 2 1 − δ 2 k M k dist ( U ( t ) , U ∗ ) + 1 1 − δ 2 dist ( U ( t ) , U ∗ ) k M − M r k + δ 1 − δ 2 k M − M r k F ≤ 1 2 dist ( U ( t ) , U ∗ ) + 5 δ k M − M r k F /σ ∗ r , for δ 2 ≤ 1 16 κ . ζ 1 follo ws from (28). ζ 2 follo ws from Lemma B.6. Bounding k ( V ( t +1) ) j k : F rom Lemma B.5 and (33) w e get that σ min ( B j ) ≥ 1 − δ 2 and σ max ( C j ) ≤ 1 + δ 2 . Recall that ( V ( t +1) ) j = ( R ( t +1) ) − 1 T ( B j ) − 1 ( U ( t ) ) T R Ω ( M ) j . (34) Hence, k ( V ( t +1) ) j k ≤ 1 σ min ( R ( t +1) ) 1 1 − δ 2 k ( U ( t ) ) T R Ω ( M ) j k . W e w ill b ound k ( U ( t ) ) T R Ω ( M ) j k u s ing matrix Berns tein inequalit y . Let X i = ( δ ij − ˆ q ij ) w ij M ij ( U ( t ) ) i e T j . Then E [ X i ] = 0 and P i X i = ( U ( t ) ) T R Ω ( M ) j − ( U ( t ) ) T M j . No w k X i k ≤ c 1 2 n m p | M ij |k M k F and E P i X i X T i ≤ 8 c 2 1 n m k M j k 2 . Hence applying matrix Bern- stein inequ alit y with t = δ 2 p k M j k 2 + | M ij |k M k F , imp lies k ( U ( t ) ) T R Ω ( M ) j k ≤ k M j k + δ 2 q k M j k 2 + | M ij |k M k F ≤ (1 + δ 2 ) q k M j k 2 + | M ij |k M k F with pr obabilit y greater than 1 − 2 n 2 for m ≥ 24 c 2 1 δ 2 2 n log( n ). Hence k ( V ( t +1) ) j k ≤ 8 κ √ r r k M j k 2 k M k 2 F + | M ij | k M k F . Hence w e hav e sho wn that ( V ( t +1) ) j satisfies corresp onding ro w norm b oun d. This completes the p ro of of the Lemma. No w w e ha ve all the el ement s needed for pro of of the Theorem 3.1. Pro of of The orem 3.1: 30 Pr o of. F rom Lemma 3.3 we get dist ( V ( t +1) , V ∗ ) ≤ 1 2 dist ( U ( t ) , U ∗ ) + 5 δ k M − M r k F /σ ∗ r . Hence dist ( V ( t +1) , V ∗ ) ≤ 1 4 t dist ( b U 0 , U ∗ ) + 10 δ k M − M r k F /σ ∗ r . After t = O (log ( 1 ζ )) iterations w e get dist ( V ( t +1) , V ∗ ) ≤ ζ + 10 δ k M − M r k F /σ ∗ r and dist ( U ( t ) , U ∗ ) ≤ ζ + 10 δ k M − M r k F /σ ∗ r . Hence, k M r − U ( t ) ( b V ( t +1) ) T k ≤ k ( I − U ( t ) ( U ( t ) ) T ) M r k + k U ( t ) ( U ( t ) ) T M r − ( b V ( t +1) ) T k ζ 1 ≤ σ ∗ 1 dist ( U ( t ) , U ∗ ) + k F k + k n X j =1 ( B j ) − 1 ( U ( t ) ) T R Ω ( M − M r ) j e T j k ζ 2 ≤ σ ∗ 1 dist ( U ( t ) , U ∗ ) + 2 δ 2 σ ∗ 1 dist ( U ( t ) , U ∗ ) + 2 dist ( U ( t ) , U ∗ ) k M − M r k + 2 δ k M − M r k F ≤ cσ ∗ 1 ζ + ǫ k M − M 1 k F . ζ 1 follo ws from equation (28) and ζ 2 from k B − 1 k ≤ 1 1 − δ 3 ≤ 2 from Lemma B.5 . F rom Lemma B.6 w e ha ve, in eac h iteratio n with p robabilit y greater than 1 − γ T l og( n ) w e ha ve ( U ( t ) ) T R Ω ( M − M r ) ≤ dist ( U ( t ) , U ∗ ) k M − M r k + δ k M − M r k F . Hence the pr obabilit y of failure in T iterations is less than γ . C Pro ofs of section 3.3 W e will no w discuss p ro of of Theorem 3.4. T he pro of follo w s same stru cture as p ro of of Theorem 3.1 with few k ey c h anges b ecause of the absence of L 1 term in the sampling and the sp ecia l structure of M = AB . Again for simplicit y w e will present p ro ofs only for the case of n 1 = n 2 = n . Recall that ˆ q ij = min (1 , q ij ) wh ere q ij = m · k A i k 2 n k A k 2 F + k B j k 2 n k B k 2 F . Also, let w ij = 1 / ˆ q ij . First we will abs tr act out the prop erties of the sampling distrib u tion (5) that w e u se in the rest of the proof. Also let C AB = ( k A k 2 F + k B k 2 F ) 2 k AB k 2 F Lemma C .1. F or Ω gener ate d ac c or ding to (5) and under the assumptions of L emma C.2 the fol lowing holds, for al l ( i, j ) such that q ij ≤ 1 . M ij ˆ q ij ≤ n 2 m ( k A k 2 F + k B k 2 F ) , (35) X { j : ˆ q ij = q ij } M 2 ij ˆ q ij ≤ n m ( k A k 2 F + k B k 2 F ) 2 , (36) k ( U ∗ ) i k 2 ˆ q ij ≤ n m ( k A k 2 F + k B k 2 F ) 2 k A · B k 2 F , (37) and k ( U ∗ ) i kk ( V ∗ ) j k ˆ q ij ≤ n m ( k A k 2 F + k B k 2 F ) 2 k A · B k 2 F . (38) The pr o of of th e lemma C.1 is straigh tforward fr om the defi n ition of q ij . No w , similar to pr o of of Theorem 3.1, we divide our analysis in t wo parts: in itializa tion analysis and weig hted alternating m inimization a nalysis. 31 C.1 Initialization Lemma C.2 (Initialization) . L et the set of entries Ω b e gener ate d ac c or ding to ˆ q ij (5) . Also, let m ≥ C C AB n δ 2 log( n ) . Then, the fol lowing ho lds (w.p. ≥ 1 − 2 n 10 ): k R Ω ( AB ) − AB k ≤ δ k AB k F . (39) Also , if k AB − ( AB ) r k F ≤ 1 576 κr 1 . 5 k ( AB ) r k F , then the f ol lowing holds (w.p. ≥ 1 − 2 n 10 ): k ( b U (0) ) i k ≤ 8 √ r q k A i k 2 / k A k 2 F and dist ( b U (0) , U ∗ ) ≤ 1 2 , wher e b U (0) is the initial i ter ate obtaine d using Steps 4 , 5 of Sub-Pr o c e dur e 2. κ = σ ∗ 1 /σ ∗ r , σ ∗ i is the i -th singular value of AB , ( AB ) r = U ∗ Σ ∗ ( V ∗ ) T . Pr o of. First we sho w that R Ω ( AB ) is a goo d appro ximation of AB . Let M = AB . W e p ro v e th is part of the lemma u sing the matrix Bernstein inequalit y . Let X ij = ( δ ij − ˆ q ij ) w ij M ij e i e T j . Note that { X ij } n i,j =1 are ind ep end en t zero mean random matrices. Also R Ω ( AB ) − E [ R Ω ( AB )] = P ij X ij . First w e will b oun d k X ij k . When mq ij ≥ 1, ˆ q ij = 1 and δ ij = 1, and X ij = 0 with probabilit y 1. Hence we only need to consider cases when ˆ q ij = mq ij ≤ 1. W e will assume this in all the pro ofs without explicitly men tioning it an y more. k X ij k = max {| (1 − ˆ q ij ) w ij M ij | , | ˆ q ij w ij M ij |} . Recall w ij = 1 / ˆ q ij . Hence | (1 − ˆ q ij ) w ij M ij | = ( 1 ˆ q ij − 1) M ij ≤ M ij ˆ q ij ζ 1 ≤ n 2 m ( k A k 2 F + k B k 2 F ) . ζ 1 follo ws from (35). | ˆ q ij w ij M ij | = | M ij | ζ 1 ≤ M ij ˆ q ij ≤ n 2 m ( k A k 2 F + k B k 2 F ) . ζ 1 follo ws from ˆ q ij ≤ 1. Hence, k X ij k is b ounded b y L = n 2 m ( k A k 2 F + k B k 2 F ). Recall that this is the step in the pro of of Lemm a 3.2 that required the L1 term in sampling, which w e d idn’t need no w b ecause of th e structure AB of the matrix. No w w e will b ound the v ariance. E X ij X ij X T ij = E X ij ( δ ij − ˆ q ij ) 2 w 2 ij M 2 ij e i e T i = X ij ˆ q ij (1 − ˆ q ij ) w 2 ij M 2 ij e i e T i = max i X j ˆ q ij (1 − ˆ q ij ) w 2 ij M 2 ij . No w , X j ˆ q ij (1 − ˆ q ij ) w 2 ij M 2 ij = X j ( 1 ˆ q ij − 1) M 2 ij ≤ X j M 2 ij ( ˆ q ij ) ζ 1 ≤ n m ( k A k 2 F + k B k 2 F ) 2 . 32 ζ 1 follo ws from (36). Hence E X ij X ij X T ij = max i X j ˆ q ij (1 − ˆ q ij ) w 2 ij M 2 ij ≤ max i n m ( k A k 2 F + k B k 2 F ) 2 = n m ( k A k 2 F + k B k 2 F ) 2 . W e can p ro v e the same b ound f or the E h P ij X T ij X ij i . Hence σ 2 = n m ( k A k 2 F + k B k 2 F ) 2 . No w using matrix Bernstein inequalit y with t = δ k AB k F giv es, with p robabilit y ≥ 1 − 2 n 2 , k R Ω ( AB ) − E [ R Ω ( AB )] k = k R Ω ( AB ) − AB k ≤ δ k AB k F . Once we ha v e k R Ω ( M ) − M k ≤ δ k M k F , pro of of the trimming step that guarantees k ( b U (0) ) i k ≤ 8 √ r q k A i k 2 / k A k 2 F and dist ( b U (0) , U ∗ ) ≤ 1 2 , follo ws from the same argument as in Lemma 3.2. C.2 W eigh ted AltMin A nalysis Lemma C.3 (W AltMin Descen t) . L et hyp otheses of The or em 3.4 hold. Also, let k AB − ( AB ) r k F ≤ 1 576 κr √ r k ( AB ) r k F . L et b U ( t ) b e the t - th step iter ate of Sub-Pr o c e dur e 2 (c al le d fr om W AltM in ( P Ω ( A · B ) , Ω , ˆ q , T ) ), and let b V ( t +1) b e the ( t + 1) -th iter ate (for V ). Also, let k ( U ( t ) ) i k ≤ 8 √ r κ q k A i k 2 / k A k 2 F and dist ( U ( t ) , U ∗ ) ≤ 1 2 , wher e U ( t ) is a set of orthonorm al ve ctors sp anning b U ( t ) . Then, the fol low- ing hold s (w.p. ≥ 1 − γ /T ): dist ( V ( t +1) , V ∗ ) ≤ 1 2 dist ( U ( t ) , U ∗ ) + ǫ k AB − ( AB ) r k F /σ ∗ r , and k ( V ( t +1) ) j k ≤ 8 √ rκ q k B j k 2 / k B k 2 F , wher e V ( t +1) is a set of orthonormal v e ctors sp anning b V ( t +1) . F or the sak e of simplicit y we will discuss the pr o of for rank-1( r = 1) case for this p art of the algorithm. Rank- r pr o of follo ws by com binin g the b elo w analysis with rank- r analysis of Lemma 3.3 (see Section B.3). Before present ing the p ro of of this L emm a, w e will state couple of supp orting lemmas. T h e pro ofs of these supp orting lemmas follo ws v ery cl osely to the ones in section B.2 . Lemma C .4. F or Ω sample d ac c or ding to (5) and under the assumptions of L emma C.3, the fol lowing holds: X j δ ij w ij ( u ∗ j ) 2 − X j ( u ∗ j ) 2 ≤ δ 1 , (40) with pr ob ability gr e ater that 1 − 2 n 2 , for m ≥ β C AB n log( n ) , β ≥ 16 δ 2 1 and δ 1 ≤ 3 . 33 W e assu me that samples for eac h ite ration are generated indep endently . F or simplicit y w e will drop the subscripts on Ω that denote different set of samples in eac h iteration in the r est of the pro of. The weig hted alternating minimization up dates at the t + 1 it eration are, k b u t k b v t +1 j = σ ∗ v ∗ j P i δ ij w ij u t i u ∗ i P i δ ij w ij ( u t i ) 2 + P i δ ij w ij u t i ( M − M 1 ) ij P i δ ij w ij ( u t i ) 2 . (41) W riting in terms of p o wer metho d up dates w e ge t, k b u t k b v t +1 = σ ∗ h u ∗ , u t i v ∗ − σ ∗ P − 1 ( h u t , u ∗ i P − Q ) v ∗ + P − 1 y , (42) where P and Q are diagonal matrices with P j j = P i δ ij w ij ( u t i ) 2 and Q j j = P i δ ij w ij u t i u ∗ i and y is the vect or R Ω ( M − M 1 ) T u t with entries y j = P i δ ij w ij u t i ( M − M 1 ) ij . No w w e will b ound the error caused by the M − M r comp onent in eac h iteration. Lemma C.5. F or Ω gener ate d ac c or ding to (5) and under the assumptions of L emma C.3, the fol lowing holds: ( U ( t ) ) T R Ω ( M − M r ) − ( U ( t ) ) T ( M − M r ) ≤ δ k M − M r k F , (43) with pr ob ability gr e ater that 1 − 1 c 2 log( n ) , for m ≥ β nr log( n ) , β ≥ 4 c 2 1 c 2 δ 2 . Henc e, ( U ( t ) ) T R Ω ( M − M r ) ≤ dist ( U ( t ) , U ∗ ) k M − M r k + δ k M − M r k F , for c onstant δ . Lemma C .6. F or Ω sample d ac c or ding to (5) and under the assumptions of L emma C.3, the fol lowing holds: k ( h u t , u ∗ i P − Q ) v ∗ k ≤ δ 1 p 1 − h u ∗ , u t i 2 , (44) with pr ob ability gr e ater than 1 − 2 n 2 , for m ≥ β C AB n log( n ) , β ≥ 48 c 2 1 δ 2 1 and δ 1 ≤ 3 . No w w e will pro vide p ro of of le mma C.3. Pro of of lemma C.3: [Rank-1 ca se] Pr o of. Let u t and v t +1 b e the norm alized ve ctors of the iterates b u t and b v t +1 . In the first step w e will prov e that the distance b et we en u t , u ∗ and v t +1 , v ∗ decreases with eac h iteratio n. In the second step w e will pr o ve that v t +1 satisfies | v t +1 j | ≤ c 1 q k B j k 2 / k B k 2 F . F rom the assu mptions of the lemma we hav e, | u t i | ≤ c 1 q k A i k 2 / k A k 2 F . (45) Bounding h v t +1 , v ∗ i : Using Lemma C.4, Lemma C.6 a nd equation (42) we get, k b u t kh b v t +1 , v ∗ i ≥ σ ∗ h u t , u ∗ i − σ ∗ δ 1 1 − δ 1 p 1 − h u ∗ , u t i 2 − 1 1 − δ 1 k y T v ∗ k (46) and k b u t kh b v t +1 , v ∗ ⊥ i ≤ σ ∗ δ 1 1 − δ 1 p 1 − h u ∗ , u t i 2 + 1 1 − δ 1 k y k . (47) 34 Hence by applying the noise b oun ds Lemma C.5 we get, dist ( v t +1 , v ∗ ) 2 = 1 − h v t +1 , v ∗ i 2 = h b v t +1 , v ∗ ⊥ i 2 h b v t +1 , v ∗ ⊥ i 2 + h b v t +1 , v ∗ i 2 ≤ h b v t +1 , v ∗ ⊥ i 2 h b v t +1 , v ∗ i 2 ζ 1 ≤ 4( δ 1 dist ( u t , u ∗ ) + dist ( u t , u ∗ ) k M − M 1 k /σ ∗ + δ k M − M 1 k F /σ ∗ ) 2 ( h u t , u ∗ i − 2 δ 1 p 1 − h u ∗ , u t i 2 − 2 δ k M − M 1 k /σ ∗ ) 2 ζ 2 ≤ 4( δ 1 dist ( u t , u ∗ ) + dist ( u t , u ∗ ) k M − M 1 k /σ ∗ + δ k M − M 1 k F /σ ∗ ) 2 ( h u ∗ , u 0 i − 2 δ 1 p 1 − h u ∗ , u 0 i 2 − 2 δ k M − M 1 k /σ ∗ ) 2 ζ 3 ≤ 25( δ 1 dist ( u t , u ∗ ) + dist ( u t , u ∗ ) k M − M 1 k /σ ∗ + δ k M − M 1 k F /σ ∗ ) 2 . ζ 1 follo ws from δ 1 ≤ 1 2 . ζ 2 follo ws from using h u t , u ∗ i ≥ h u 0 , u ∗ i . ζ 3 follo ws f r om ( h u ∗ , u 0 i − 2 δ 1 p 1 − h u ∗ , u 0 i 2 ≥ 1 2 , δ ≤ 1 20 and δ 1 ≤ 1 20 . Hence dist ( v t +1 , v ∗ ) ≤ 1 4 dist ( u t , u ∗ ) + 5 dist ( u t , u ∗ ) k M − M 1 k /σ ∗ + 5 δ k M − M 1 k F /σ ∗ ≤ 1 2 dist ( u t , u ∗ ) + 5 δ k M − M 1 k F /σ ∗ . (48) No w , by selecting m ≥ C γ · ( k A k 2 F + k B k 2 F ) 2 k AB k 2 F · nr 3 ( ǫ ) 2 κ 2 log( n ) log 2 ( k A k F + k B k F ζ ), the ab o ve b ound r educes to (w.p. ≥ 1 − γ / log ( k A k F + k B k F ζ )): dist ( v t +1 , v ∗ ) ≤ 1 2 dist ( u t , u ∗ ) + ǫ k M − M 1 k F . (49) Hence, u s ing in d uction, after T = log ( k A k F + k B k F ζ ) rou n ds, we obtain (w.p. ≥ 1 − γ ): dist ( v t +1 , v ∗ ) ≤ ǫ k M − M 1 k F + ζ . How ev er, the ab o ve ind uction step w ould require v t +1 to satisfy the L ∞ condition as well, that we pro v e b elo w. Bounding v t +1 j : F rom Lemm a C.4 and (45) we get that P i δ ij w ij ( u t i ) 2 − 1 ≤ δ 1 and P i δ ij w ij u ∗ i u t i − h u ∗ , u t i ≤ δ 1 , when β ≥ 16 c 2 1 δ 2 1 . Hence, 1 − δ 1 ≤ P j j = X i δ ij w ij ( u t i ) 2 ≤ 1 + δ 1 , (50) and Q j j = X i δ ij w ij u t i u ∗ i ≤ h u t , u ∗ i + δ 1 . (51) Recall that k b u t k b v t +1 j = P i δ ij w ij u t i M ij P i δ ij w ij ( u t i ) 2 ≤ 1 1 − δ 1 X i δ ij w ij u t i M ij . W e will b ound using P i δ ij w ij u t i M ij using b er n stein inequalit y . Let X i = ( δ ij − ˆ q ij ) w ij u t i M ij . Then P i E [ X i ] = 0 and P i u t i M ij ≤ k M j k by Cauc h y-Sch wartz inequalit y . P i V ar( X i ) = P i ˆ q ij (1 − ˆ q ij )( w ij ) 2 ( u t i ) 2 M 2 ij ≤ P i w ij ( u t i ) 2 M 2 ij ≤ nc 2 1 m k M j k 2 ≤ nc 2 1 m k B j k 2 k A k 2 F . Finally | X ij | ≤ w ij u t i M ij ≤ 35 nc 1 m k A k F k B j k . Hence ap p lying b ernstein inequ alit y with t = δ k B j k k B k F k AB k F giv es, P i δ ij w ij u t i M ij ≤ (1 + δ 1 ) k B j k k B k F k AB k F with probabilit y greater than 1 − 2 n 3 when m ≥ 24 c 2 1 δ 2 1 C AB n log( n ). F or δ 1 ≤ 1 20 , w e g et, k b u t k b v t +1 j ≤ 21 19 k B j k k B k F k AB k F . No w w e will b ound k b v t +1 k . k b u t kk b v t +1 k ≥ k b u t kh b v t +1 , v ∗ i ζ 1 ≥ σ ∗ h u t , u ∗ i − σ ∗ δ 1 1 − δ 1 p 1 − h u ∗ , u t i 2 − 1 1 − δ 1 k y T v ∗ k ζ 2 ≥ σ ∗ h u 0 , u ∗ i − 2 σ ∗ δ 1 p 1 − h u ∗ , u 0 i 2 − 2 δ k M − M 1 k ζ 3 ≥ 2 5 σ ∗ . ζ 1 follo ws from Lemma C.6 and equatio ns (42 ) and (50). ζ 2 follo ws from using h u ∗ , u 0 i ≤ h u ∗ , u t i and δ 1 ≤ 1 20 . ζ 3 follo ws by initializa tion and using the a ssum ption on m with large enough C > 0. Hence we get v t +1 j = b v t +1 j k b v t +1 k ≤ 3 21 19 k B j k k B k F k AB k F σ ∗ ≤ c 1 k B j k k B k F , for c 1 = 6 . Hence we ha v e sho w n that v t +1 satisfies the ro w norm b ound s. This co mpletes the pro of. The pr o of of th e Theorem 3.4 n ow follo ws from the Lemma C.2 and Lemma C.3. 36
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment