Understanding Alternating Minimization for Matrix Completion
Alternating Minimization is a widely used and empirically successful heuristic for matrix completion and related low-rank optimization problems. Theoretical guarantees for Alternating Minimization have been hard to come by and are still poorly unders…
Authors: Moritz Hardt
Understanding Alterna ting Minimization for Ma trix Completion Moritz Hardt ∗ Ma y 15, 2014 Abstract Alternating minimiza tion is a widely used and empirically successful heuristic for ma trix completion and related low -rank optimization problems. Theoretical guarantees for alterna ting minimization ha ve been hard to come by and are still poor ly understood. This is in part beca use the heuristic is iterativ e and non-con vex in nature. W e give a new algorithm based on alterna ting minimization that prov ably recov ers an unknown low -rank matrix from a random subsample of its entries under a standard incoherence assumption. Our results reduce the sample size requirements of the alternating minimization approach by at least a quartic f actor in the rank and the condition number of the unknown matrix. These improv ements apply even if the matrix is only close to low -rank in the Frobenius norm. Our algorithm runs in nearly linear time in the dimension of the matrix and, in a broad rang e of parameters, giv es the strongest sample bounds among all subquadratic time alg orithms that we are a ware of . Underlying our w ork is a new robust conv ergence anal ysis of the well-known P ower Method for com puting the dominant singular v ectors of a matrix. This viewpoint leads to a conceptuall y simple understanding of alternating minimization. In addition, we contribute a new technique for controlling the coherence of intermediate sol utions arising in iterativ e algorithms based on a smoothed analysis of the QR factoriza tion. These techniques may be of interest beyond their application here. ∗ IBM Research Almaden. Email: mhardt@us.ibm.com 1 Introduction Alternating minimization is an empiricall y successful heuristic for the matrix completion problem in which the goal is to recover an unknown low-r ank matrix from a subsample of its entries. Matrix completion has received a tremendous amount of attention over the past few years due to its fundamental role as an optimization problem and its applicability in number of areas incl uding collaborativ e filtering and quantum tomograph y . Alternating minimization has been used early on in the context of matrix completion [ BK , HH ] and continues to pla y an important role in practical approaches to the problem. The approach also formed an im portant component in the winning submission for the Netflix Prize [ KB V ]. Given a subset Ω of entries drawn from an unknown matrix A, Alternating minimization starts from a poor approximation X 0 Y > 0 to the target matrix and gr adually improv es the approximation quality by fixing one of the factors and minimizing a certain objective ov er the other factor . Here, X 0 , Y 0 each have k columns where k is the target rank of the factorization. The least squares objective is the typical choice. In this case, at step ` we solv e the optimization problem X ` = arg min X X ( i ,j ) ∈ Ω h A i j − ( X Y > ` − 1 ) i j i 2 . This optimization step is then repeated with X ` fixed in order to determine Y ` as Y ` = arg min X X ( i ,j ) ∈ Ω h A i j − ( X ` Y > ) i j i 2 . Separating the f actors X ` and Y ` is what makes the optimization step tr actable. This basic update step is usually combined with an initialization procedure for finding X 0 , Y 0 , as well as methods for modifying intermediate solutions, e.g., truncating larg e entries. More than a specific algorithm we think of alternating minimization as a framewor k for solving a non-conv ex low -rank optimization problem. A major advan tage of alternating minimization over alternatives is that each update is com- putationally cheap and has a small memory footprin t as we only need to keep track of 2 k vectors. In contrast, the nuclear norm approach to matrix completion [ CR , Rec , CT ] requires solving a semidefinite program. The advantag e of the n uclear norm approach is tha t it comes with strong theoretical guarantees under certain assumptions on the unknown matrix and the subsample of its entries. There are two (by now standard) assumptions which together imply that nuclear norm minimization succeeds. The first is that the subsample Ω includes each entry of A uniformly at random with probability p. The second assumption is that the first k singular vectors of A span an incoherent subspace. Informally coherence measures the correlation of the subspace with an y standard basis vector . More formally , the coherence of a k -dimensional subspace of R n is at most µ if the projection of each standard basis vector has norm a t most p µk / n. The space spanned by the top k singular space of v arious random ma trix models typically sa tisfies this property with small µ. But also real-wor ld matrices tend to exhibit incoherence when k is reasonably small. Theoretical results on matrix completion primarily apply to the nuclear norm semidefinite program which is prohibitive to execute on realistic instance sizes. There certainly has been progress on practical algorithms for solving related convex programs [ JY , MHT , JS , AKKS , HO ]. Unfortunatel y , these algorithms are not known to achieve the same type of recovery guaran tees attained by exact nuclear norm minimization. This raises the important question if there are fast algorithms for matrix completion that come with guarantees on the required sample size comparable to those achiev ed by n uclear norm minimization. In this work w e make progress on 2 this problem by proving strong sam ple complexity bounds for al ternating minimization. Along the wa y our wor k helps to giv e a theoretical justification and understanding f or wh y alternating minimization wor ks. 1.1 Our results W e begin with our result on the exact matrix completion problem where the goal is to recover an unknown rank k matrix M from a subsample Ω of its entries where each entry is included independently with probability p. Here and in the following we will always assume that M = U Λ U > is a symmetric n × n matrix with singular values σ 1 > . . . > σ k . Our resul t generalizes straightforwardl y to rectangular matrices as we will see. Our algorithm will output a pair of matrices ( X , Y ) where X is an orthonormal n × k matrix that approximates U in the strong sense that k ( I − U U > ) X k 6 ε . Moreover , the matrix X Y > is close to M in Frobenius norm. T o state the theorem we formally define the coherence of U as µ ( U ) def = max i ∈ [ n ] ( n/ k ) k e > i U k 2 2 where e i is the i -th standard basis vector . Theorem 1.1. Given a sample of size e O ( pn 2 ) drawn from an unknown n × n matrix M = U Λ U > of rank k by including each entry with probability p, our algorithm outputs with high probability a pair of matrices ( X , Y ) such that k ( I − U U > ) X k 6 ε and k M − X Y > k F 6 ε k M k F provided that pn > k ( k + l og( n/ ε )) µ ( U ) ( k M k F / σ k ) 2 . (1) Our result should be compared with two remarkable recent works by J ain, Netrapalli and Sanghavi [ JNS ] and K eshav an [ Kes ] who ga ve rig orous sample complexity bounds for alternating minimization. [ JNS ] obtained the bound pn > k 7 ( σ 1 / σ k ) 6 µ ( U ) 2 and Keshav an obtained the incompa- rable bound pn > k ( σ 1 / σ k ) 8 µ ( U ) that is superior when the matrix has small condition number σ 1 / σ k . Since k M k F 6 √ k σ 1 our result improves upon [ JNS ] by at least a f actor of k 4 ( σ 1 / σ k ) 4 µ ( U ) and improv es on [ Kes ] as soon as σ 1 / σ k k 1 / 3 . The improv ement is larg er when k M k F = O ( σ 1 ) which we expect if the singular v alues deca y rapidly . Theorem 1.1 is a special case of Theorem 6.1 . W e remark that the number of least squares update steps is bounded by O ( log ( n/ ε ) log n ) . The cost of performing these update steps is up to a logarithmic f actor what dominates the w orst -case running time of our algorithm. It can be seen that the least squares problem can be solved in time O nk 3 + | Ω | · k which is is linear in n + | Ω | and polynomial in k . The number of update steps enters the sample complexity since w e assume (as in previous work) that fresh samples are used in each step. How ever , the logarithmic dependence on 1 / ε guarantees exponentiall y fast conv erg ence and allows us to obtain an y inv erse polynomial error with only a constant f actor overhead in sample com plexity . Noisy matrix completion. In noisy matrix completion the unknown matrix is only close to l ow- rank, typically in F robenius norm. Our results apply to any matrix of the form A = M + N , where M = U Λ U > is a matrix of rank k as before and N = ( I − U U > ) A is the part of A not captured by the dominant singular v ectors. Here, N can be an arbitrary deterministic matrix tha t satisfies the following constr aints: max i ∈ [ n ] k e > i N k 2 6 µ N n · σ 2 k and max i j ∈ [ n ] | N i j | 6 µ N n · k A k F . (2) Here, e i denotes the i -th standard basis vector so tha t k e > i N k is the Euclidean norm of the i -th row of N . The conditions state no en try and no row of N should be too larg e compared to the Frobenius 3 norm of N . W e can think of the parameter µ N as an analog to the coherence parameter µ ( U ) tha t w e saw ear lier . Since N could be cl ose to full rank, µ ( V ) is not a meaningful parameter in g eneral. If the rank of V is k , then our assumptions roughly reduce to wha t is implied by requiring µ ( V ) 6 µ N . From here on w e let µ ∗ = max µ ( U ) , µ N , log n . W e hav e the following theorem. Theorem 1.2. Given a sample of size e O ( pn 2 ) drawn fr om an unknown n × n matrix A = M + N where M = U Λ U > has rank k and N = ( I − U U > ) M satisfies (2) , our algorithm outputs with high probability ( X , Y ) such that k ( I − U U > ) X k 6 ε and k M − X Y > k F 6 ε k A k F provided that pn > k ( k + log( n/ ε )) µ ∗ k M k F + k N k F / ε σ k ! 2 1 − σ k +1 σ k 5 . (3) The theorem is a strict generaliza tion of the noise-free case which we recov er by setting N = 0 in which case the separation parameter γ k := 1 − σ k +1 / σ k is equal to 1 . The result follows from Theorem 6.1 that gives a somewhat stronger sam ple complexity bound. Compared to our noise- free bound, there are two new parameters that en ter the sample complexity . The first one is the separation parameter γ k . The second is the quantity k N k F / ε. T o interpret this quantity , suppose that that A has a good low-r ank approximation in Frobenius norm, formally , k N k F 6 ε k A k F for ε 6 1 / 2 , then it must also be the case that k N k F / ε 6 2 k M k F . Our algorithm then finds a good rank k approximation with at most e O ( k 3 ( σ 1 / σ k ) 2 µ ∗ n ) samples assuming γ k = Ω (1) . Hence, assuming that A has a good rank k approximation in Frobenius norm and tha t σ k and σ k +1 are well-separa ted, our bound recovers the noise-free bound up to a constan t factor . Note that if w e ’re only interested in the second error bound k M − X Y > k F 6 ε k M k F + k N k F , we we can eliminate the dependence on the condition n umber in the sample complexity entirel y . The reason is that any singular val ue smaller than εσ 1 / k can be treated as part of the noise matrix. Assuming the condition number is at least k to begin with we can alwa ys find two singular v alues that have separation at least Ω ( k ) . This ensures that the sample requirement is polynomial in k without any dependence on the condition n umber and giv es us the following corollary . Corollary 1.3. Under the assumptions of Theorem 1.2 , if σ 1 > k σ k / ε, then we can find X , Y such that k M − X Y > k F 6 ε k A k F provided that pn > poly( k ) µ ∗ . The previous corollary is remar kable, because small error in F robenius norm is the most common error measure in the literature on matrix com pletion. The result shows that in this error measure, there is no dependence on the condition number . The result is tight f or k = O (1) up to constant factors ev en information- theoretically as we will discuss bel ow . The approach of Jain et al. w as adapted to the noisy setting by Gunasekar et al. [ G A GG ] showing roughly same sample complexity in the noisy setting under some assumptions on the noise matrix. W e achieve the same improv ements over [ G A GG ] as we did compared to [ JNS ] in the noise-free case. Moreov er , our assumptions in (2) are substantiall y weaker than the assumption of [ G A GG ]. The latter work required the largest entry of N in absolute val ue to be bounded by O ( σ k / n √ k ) . This directly im plies that each row of N has norm at most O ( σ k / √ k n ) and that k N k F 6 O ( σ k / √ k ) . Moreover under this assumption w e would hav e γ k > 1 − o k (1) . Kesha v an ’ s result [ Kes ] also applies to the noisy setting, but it requires k N k 6 ( σ k / σ 1 ) 3 and max i k e > i N k 6 p µ ( U ) k / n k N k . In particular this bound does not allow k N k F to grow with k M k F . Since neither result allows arbitraril y small singular v alue separation, w e cannot use these results to eliminate the dependence on the condition number as is possible using our technique. 4 Remark on required sample complexity and assumptions. It is known that information-theoreticall y Ω ( k µ ( U ) n ) measurements are necessary to recover the unknown matrix [ CT ] and this bound is achieved (up to l og-factors) by the nuclear norm semidefinite program. Compared with the information-theoretic optimum our bound su ff ers a factor O ( k ( k M k F / σ k ) 2 ) loss. While we do not know if this loss is necessary , there is a natural barrier . If we denote by P Ω ( A ) the matrix in which all unobserved en tries are 0 and the others are scaled by 1 / p, then Ω ( k µ ( U )( k M k F / σ k ) 2 n ) samples are necessary to ensure that P Ω ( A ) preserves the k -th singular value to within constant relative error . Formally , k P Ω ( A ) − A k 2 6 0 . 1 σ k . While this is not a necessary requirement for alternating least squares, it represents the current bottleneck for finding a g ood initial matrix. It is also known tha t without an incoherence assumption the matrix com pletion problem can be ill-posed and recov ery becomes infeasible ev en informa tion-theoretically [ CT ]. Moreover , even on incoherent matrices it was recently shown that already the exact matrix com pletion problem remains com putationally hard to approximate in a strong sense [ HMR W ]. This shows that additional assumptions are needed beyond incoherence to make the problem tractable. 2 Proof ov erview and techniques Robust con verg ence of subspace iter ation. An importan t observ ation of [ JNS ] is that the update rule in alterna ting minimization can be analyzed as a noisy update step of the well known power method for computing eig env ectors, also called subspace iteration when applied to multiple v ectors simul taneously . The noise term that arises depends on the sampling error induced by the subsample of the entries. W e further devel op this point of view by giving a new robust conv ergence anal ysis of the power method. T o illustrate the technique, consider a model of numerical linear algebr a in which an input matrix A can only be accessed through noisy matrix vector products of the form Ax + g , where x is a chosen vector and g is a possibly adv ersarial noise term. Our goal is to compute the dominant singular vectors u 1 , . . . , u k of the matrix A. Subspace iteration starts with an initial guess, an orthonormal matrix X 0 ∈ R n × k typically chosen at random. The algorithm then repeatedly computes Y ` = AX ` − 1 + G ` , follow ed by an orthonormalization step in order to obtain X ` from Y ` . Here, G ` is the noise variable added to the com putation. Theorem 3.8 characterizes the conv ergence beha vior of this g eneral alg orithm. An importan t component of our analysis is the choice of a suitable potential function that decreases at each step. Here we make use of the tangent of the largest principal angle between the subspace U spanned by the first k singular vectors of the input matrix and the k -dimensional space spanned by the columns of the iterate X ` . Principal angles are a very useful tool in numerical analysis that we briefly recap in Section 3 . Our analysis shows that the al gorithm essentially converg es at the ra te of ( σ k +1 + ∆ ) / ( σ k − ∆ ) for some ∆ σ k under suitable conditions on the noise matrix G ` . Alternating least squares. W e recall the well-known least squares update: Y ` = arg min Y k P Ω ( A − X ` − 1 Y > ) k 2 F . (4) Since we can f ocus on symmetric matrices without loss of g enerality , there is no need for an alternating update in which the left and right factor are flipped. W e therefore drop the term “ alternating” . W e can express the optimal Y ` as Y ` = AX ` − 1 + G ` using gradient informa tion about the least squares objectiv e. The error term G ` has an intriguing property . Its norm k G ` k depends on the quantity k V > X ` − 1 k which coincides with the sine of the largest principal angle between U 5 and X ` − 1 . This property ensures that as the algorithm begins to converg e the norm of the error term starts to diminish. Near exact recovery is now possible (assuming the matrix has rank at most k ). A novelty in our approach is that we obtain strong bounds on k G ` k by computing O ( log n ) independent copies of Y ` (using fresh samples) and taking the componentwise median of the resulting matrices. The resulting procedure called MedianLS is analyzed in Section 4 . A di ffi culty with iterating the least squares update in general is that it is unclear how well it conv erges from a random initial matrix X 0 . In our analysis we therefore use an initialization procedure that finds a matrix X 0 that satisfies k V > X 0 k 6 1 / 4 . Our initialization procedure is based on (approximately) computing the first k singular vectors of P Ω ( A ) . T o rule out large entries in the vectors w e truncate the resulting v ectors. While this gener al approach is standard, our truncation procedure first applies a random rotation to the vectors that leads to a tighter analysis than the naive approach. Smooth orthonormalization. A key novel ty in our approach is the way we argue about the coherence of each iterate X ` . Ideall y , we woul d like to argue that µ ( X ` ) = O ( µ ∗ ) . A direct approach would be to argue that X ` was obtained from Y ` using the QR -factorization and so X ` = Y ` R − 1 for some invertible R. This gives the bound k e > i X ` k 6 k e > i Y ` k · k R − 1 k that unfortunately is quite lossy and leads to a dependence on the condition number . W e avoid this problem using an idea that ’ s closely rela ted to the smoo thed analysis of the QR - factorization. Sankar , Spielman and T eng [ SST ] showed that while the perturbation stability of QR can be quadratic, it is constant after adding a su ffi ciently large amount of Gaussian noise. In the context of smoothed analysis this is usually interpreted as saying that there are “few bad inputs” for the QR factoriza tion. In our context, the matrix Y ` is already the outcome of a noisy operation Y ` = AX ` − 1 + G ` and so there is no harm in actuall y adding a Ga ussian noise matrix H ` to Y ` provided that the norm of that matrix is no larger than that of G ` . Roughly speaking, this will allow us to argue that there is no dependence on the condition number when applying the QR -factorization to Y ` . There are some important complica tions. The magnitude of Y ` may be too large to apply the smoothed analysis argument directly to Y ` . Instead we observ e that the columns of X ` are contained in the rang e S of the matrix [ U | ( N X ` − 1 + G ` + H ` )] . Since S has dimension at most 2 k it su ffi ces to argue that this space has small coherence. Moreov er we can choose H ` to be roughly on the same order as N X ` − 1 and G ` so that the smoothed analysis argument leads to an excellent bound bound on the smallest singular val ue of N X ` − 1 + G ` + H ` . T o prove that the coherence is small we need to exhibit a basis for S . This requires us to argue about the related matrix ( I − U U > )( N X ` − 1 + G + H ` ) since w e need to orthonormalize the last k vectors ag ainst the first when constructing a basis. Another minor complication is that we don ’ t know the magnitude of G ` so we need to find the right scaling of H ` on the fly . W e call the resulting procedure that SmoothQR and analyze its guaran tees in Section 5 . Putting things together . The final al gorithm tha t we analyze is quite sim ple to describe as shown in Figure 1 . The algorithm makes use of an initialization procedure Initialize that we defer to Section 7 . In Section 6 we prove our main theorem. The generalization of our result to rectangular matrices foll ows from a standard “dilation ” argument that we describe in Section D . The description of the algorithm also uses a helper function called Split that ’ s used to split the subsample in to independent pieces of roughl y equal size while preserving the distributional assumption that our theorems use. W e discuss Split in Section C . 6 Input: Observed set of indices Ω ⊆ [ n ] × [ n ] of an unknown symmetric matrix A ∈ R n × n with entries P Ω ( A ) , number of iterations L ∈ N , error parameter ε > 0 , target dimension k , coherence parameter µ. Algorithm S AltL S ( P Ω ( A ) , Ω , L, k , ε, µ ) : 1. ( Ω 0 , Ω 0 ) ← Split ( Ω , 2) , ( Ω 1 , . . . , Ω L ) ← Split ( Ω 0 , L ) 2. X 0 ← Initialize ( P Ω 0 ( A ) , Ω 0 , k , µ ) 3. For ` = 1 to L : (a) Y ` ← MedianLS ( P Ω ` ( A ) , Ω ` , X ` − 1 , L, k ) (b) X ` ← SmoothQR ( Y ` , ε, µ ) Output: P air of matrices ( X L − 1 , Y L ) Figure 1: Smoothed alternating least squares ( S Al tLS ) 2.1 Further discussion of related w ork There is a vast literature on the topic that we cannot com pletely surv ey here. Most cl osely related is the work of J ain et al. [ JNS ] that sug gested the idea of thinking of al ternating least squares as a noisy update step in the P ower Method. Our approach takes inspiration from this work by analyzing least squares using the noisy power method. Howev er , our analysis is substantially di ff erent in both how conv ergence and low coherence is argued. The approach of Keshav an [ Kes ] uses a rather di ff erent argumen t. As an alternativ e to the nuclear norm approach, Kesha van, Mon tanari and Oh [ KMO1 , KMO2 ] present two approaches, a spectr al approach and an algorithm called OptSpace . The spectral ap- proach roughly corresponds to our initialization proced ure and giv es similar guarantees. OptSp ace requires a stronger incoherence assumption, has larger sample complexity in terms of the condition number , namely ( σ 1 / σ k ) 6 , and requires optimizing over the Gr assmanian manifold. However , the requirement on N achieved by OptSp ace can be weaker than ours in the noisy setting. In the exact case, our algorithm has a m uch faster conv erg ence rate (logarithmic dependence on 1 / ε rather than polynomial). There are a number of fast alg orithms for matrix completion based on either (stochastic) gradient descent [ RR ] or (online) Frank - W olfe [ JS , HK ]. These algorithms generally minimize squared loss on the observed entries subject to a nuclear norm constraint and in general do not produce a matrix that is close to the true unknown matrix on all entries. In contrast, our alg orithm guarantees con verg ence in domain , that is, to the unknown matrix itself . Moreover , our dependence on the error is logarithmic whereas in these al gorithms it is polynomial. Privacy -preserving spectr al analysis. Our work is also cl osely related to a line of work on di ff erentiall y priv ate singular v ector computa tion [ HR1 , HR2 , Har ]. These papers each consider algorithms based on the pow er method where noise is injected to achieve the priv acy guarantee known as Di ff erential Privacy [ DMNS ]. Hardt and Roth [ HR1 , HR2 , Har ] observed that incoherence could be used to obtain improv ed guarantees. This requires controlling the coherence of the iterates produced by the noisy power method which leads to similar problems as the ones faced here. What ’ s sim pler in the priv acy setting is that the noise term is typically Ga ussian leading to a cleaner analysis. Our work uses a similar conv ergence anal ysis for noisy subspace iteration that w as used 7 in a concurrent wor k by the author [ HR2 ]. 2.2 Preliminaries and Notation W e denote by A > the transpose of a matrix (or vector) A. W e use the notation x & y do denote that the relation x > C y holds f or a su ffi ciently larg e absol ute constant C > 0 independent of x and y . W e let R ( A ) denote the range of the matrix A. The coherence of a subspace pla ys an important role in our analysis. Definition 2.1 (Coherence) . The µ - coherence of a k -dimensional subspace U of R n is defined as µ ( U ) def = max i ∈ [ n ] n k k P U e i k 2 2 , where e i denotes the i -th standard basis vector . 3 Robust local con v ergence of subspace itera tion Figure 2 presents our basic templa te alg orithm. The alg orithm is iden tical to the standard subspace iteration al gorithm except tha t in each itera tion ` , the computa tion is perturbed by a matrix G ` . The matrix G ` can be adv ersarially and adaptiv ely chosen in each round. W e will analyze under which conditions on the perturbation w e can expect the algorithm to con verg e rapidly . Input: Matrix A ∈ R n × n , number of iter ations L ∈ N , target dimension k 1. Let X 0 ∈ R n × k be an orthonormal matrix. 2. For ` = 1 to L : (a) Let G ` ∈ R n × k be an arbitrary perturbation. (b) Y ` ← AX ` − 1 + G ` (c) X ` ← GS( Y ` ) Output: Matrix X L with k orthonormal columns Figure 2: Noisy Subspace Iteration ( NSI ) Principal angles are a useful tool in analyzing the converg ence behavior of numerical eigenv alue methods. W e will use the largest principal angle between two subspaces as a potential function in our conv ergence anal ysis. Definition 3.1. Let X , Y ∈ R n × k be orthonormal bases f or subspaces X , Y , respectivel y . Then, the sine of the largest principal angle betw een X and Y is defined as sin θ ( X , Y ) def = k ( I − X X > ) Y k . W e use some standard properties of the largest principal angle. Proposition 3.2 ([ ZK ]) . Let X , Y , X , Y be as in Definition 3.1 and let X ⊥ be an orthonormal basis for the orthogonal complement of X . Then, we have cos θ ( X , Y ) = σ k ( X > Y ) . and assuming X > Y is invertible, tan θ ( X , Y ) = k X > ⊥ Y ( X > Y ) − 1 k From here on w e will alwa ys assume that A has the spectral decomposition A = U Λ U U > + V Λ V V > , (5) 8 where U ∈ R n × k , V ∈ R n × ( n − k ) corresponding to the first k and last n − k eigen vectors respectivel y . W e will let σ 1 > . . . > σ n denote the singular v alues of A which coincide with the absolute eig env al ues of A sorted in non-increasing order . Our conv ergence anal ysis tracks the tangent of the largest principal angles betw een the sub- spaces R ( U ) and R ( X ` ). The next lemma shows a natural condition under which the potential decreases mul tiplicativel y in step ` . W e think of this lemma as a local converg ence guarantee, since it assumes that the cosine of the largest principal angle between R ( U ) and R ( X ` − 1 ) is already l ower bounded by a constant. Lemma 3.3 (One Step Local Con verg ence) . Let ` ∈ { 1 , . . . , L } . Assume that cos θ k ( U , X ` − 1 ) > 1 2 > k U > G ` k σ k . Then, tan θ ( U , X ` ) 6 tan θ ( U , X ` − 1 ) · σ k +1 + 2 k V > G ` k tan θ ( U ,X ` − 1 ) σ k − 2 k U > G ` k . (6) Proof. W e first need to v erify tha t X ` has rank k . This f ollows if we can show that σ k ( Y ` ) > 0 . Indeed, σ k ( Y ` ) > σ k ( U > Y ` ) = σ k ( Λ U U > X ` − 1 + U > G ` ) > σ k · σ k ( U > X ` − 1 ) − k U > G ` k . The right hand side is strictly greater than zero by our assumption, because σ k ( U > X ` − 1 ) = cos θ k ( U , X ` − 1 ) . Further , we ha ve X ` = Y ` R for some inv ertible transf ormation R. Therefore, U > X ` is inv ertible and we can in voke Proposition 3.2 to express tan θ ( U , X ` ) as: V > X ` ( U > X ` ) − 1 = V > Y ` RR − 1 ( U > Y ` ) − 1 = V > Y ` ( U > Y ` ) − 1 . Using the fact that Y ` = AX ` − 1 + G ` , V > Y ` ( U > Y ` ) − 1 = V > Y ` ( Λ U U > X ` − 1 + U > G ` ) − 1 = V > Y ` Λ U + U > G ` ( U > X ` − 1 ) − 1 U > X ` − 1 − 1 = V > Y ` ( U > X ` − 1 ) − 1 Λ U + U > G ` ( U > X ` − 1 ) − 1 − 1 Putting S = Λ U + U > G ` ( U > X ` − 1 ) − 1 , we theref ore get V > Y ` ( U > Y ` ) − 1 6 V > Y ` ( U > X ` − 1 ) − 1 S − 1 6 V > Y ` ( U > X ` − 1 ) − 1 · S − 1 = V > Y ` ( U > X ` − 1 ) − 1 σ k ( S ) . In the second inequality we used the f act that for an y two matrices P , Q we ha ve k P Q k 6 k P k · k Q k . Let us bound the numera tor of the RHS as follows: V > Y ` ( U > X ` − 1 ) − 1 = Λ V V > X ` − 1 ( U > X ` − 1 ) − 1 + V > G ` ( U > X ` − 1 ) − 1 = Λ V V > X ` − 1 ( U > X ` − 1 ) − 1 + V > G ` ( U > X ` − 1 ) − 1 = k Λ V k · V > X ` − 1 ( U > X ` − 1 ) − 1 + V > G ` ( U > X ` − 1 ) − 1 = σ k +1 · tan θ ( U , X ` − 1 ) + V > G ` ( U > X ` − 1 ) − 1 6 σ k +1 · tan θ ( U , X ` − 1 ) + V > G ` · ( U > X ` − 1 ) − 1 = σ k +1 · tan θ ( U , X ` − 1 ) + k V > G ` k cos θ k ( U , X ` − 1 ) = σ k +1 · tan θ ( U , X ` − 1 ) + 2 V > G ` . 9 Here we used the f act that k ( U > X ` − 1 ) − 1 k = 1 σ k ( U > X ` − 1 ) = 1 cos θ k ( U , X ` − 1 ) . W e also need a lower bound on σ k ( S ) . Indeed, σ k ( S ) > σ k ( Λ U ) − U > G ` ( U > X ` − 1 ) − 1 > σ k − U > G ` · ( U > X ` − 1 ) − 1 = σ k − 2 U > G ` . Note that the RHS is strictly positiv e due to the assumption of the lemma. Summarizing what we hav e, tan θ ( U , X ` ) 6 σ k +1 · tan θ ( U , X ` − 1 ) + 2 k V > G ` k σ k − 2 k U > G ` k . This is equivalen t to the statement of the lemma as we can see from a sim ple rearrangemen t. The next lemma essentially foll ows by iterating the previous lemma. Lemma 3.4 (Local Converg ence) . Let 0 6 ε 6 1 / 4 . Let ∆ = max 1 6 ` 6 L k G ` k and γ k = 1 − σ k +1 / σ k . Assume that k V > X 0 k 6 1 / 4 and σ k > 8 ∆ / γ k ε . Then, V > X L 6 max n ε, 2 · V > X 0 · exp( − γ k L/ 2) o . Proof. Our first claim shows that once the potential function is below ε at step ` − 1, it cannot increase beyond ε. Claim 3.5. Let ` > 1 . Suppose that tan θ ( U , X ` − 1 ) 6 ε . Then, tan θ ( U , X ` ) 6 ε . Proof. By our assumption, cos θ k ( U , X ` − 1 ) > √ 1 − ε 2 > 15 / 16 . T ogether with the lower bound on σ k , the assumptions for Lemma 3.3 are met. Hence, using our assumptions, tan θ ( U , X ` ) 6 (1 − γ k ) σ k ε + 2 ∆ σ k − 2 ∆ 6 ε . Our second claim shows that if the poten tial is at least ε at step ` − 1 , it will decrease by a factor 1 − γ k / 2 . Claim 3.6. Let ` > 1 Suppose that tan θ ( U , X ` − 1 ) ∈ [ ε , 1 / 2] . Then, tan θ ( U , X ` ) 6 (1 − γ k / 2) tan θ ( U , X ` − 1 ) . Proof. Using the assum ption of the claim we hav e cos θ ( U , X ` − 1 ) > 1 tan θ ( U ,X ` − 1 ) > 1 / 2 > ∆ / σ k . W e can therefore apply Lemma 3.3 to concl ude tan θ ( U , X ` ) 6 tan θ ( U , X ` − 1 ) · (1 − γ k ) σ k + 2 ∆ σ k − 2 ∆ 6 tan θ ( U , X ` − 1 ) · (1 − γ k )(1 + γ k / 4) 1 − γ k / 4 6 tan θ ( U , X ` − 1 )(1 − γ k / 2) The two previous claims together im ply that tan θ ( U , X L ) 6 max n tan θ ( U , X 0 )(1 − γ k / 2) L , ε o , provided that tan θ ( U , X 0 ) 6 1 / 2 . This is the case since we assumed that sin θ ( U , X 0 ) 6 1 / 4 . Note that (1 − γ k / 2) L 6 exp ( − γ k L/ 2) . It remains to observe that k V > X L k 6 tan θ ( U , X L ) and further tan θ ( U , X 0 ) 6 2 k V > X 0 k by our assumption on X 0 . 10 In our application later on the error terms k G ` k decrease as ` increases and the algorithm starts to converg e. W e need a convergence bound for this type of shrinking error . The next definition expresses a condition on G ` that allows f or a useful conv ergence bound. Definition 3.7 (Admissible) . Let γ k = 1 − σ k +1 / σ k . W e say that the pair of matrices ( X ` − 1 , G ` ) is ε -admissible for NSI if k G ` k 6 1 32 γ k σ k k V > X ` − 1 k + ε 32 γ k σ k . (7) W e say that a family of matrices { ( X ` − 1 , G ` ) } L ` =1 is ε -admissible for NSI if each member of the set is ε -admissible. W e will use the notation { G ` } as a shorthand for { ( X ` − 1 , G ` ) } L ` =1 . W e have the f ollowing con verg ence guarantee for admissible noise matrices. Theorem 3.8. Let γ k = 1 − σ k +1 / σ k . Let ε 6 1 / 2 . Assume that the family of noise matrices { G ` } is ( ε/ 2) -admissible for NSI and that k V > X 0 k 6 1 / 4 . Then, we have k V > X L k 6 ε for any L > 4 γ − 1 k log (1 / ε ) . Proof. W e prov e by induction that for ev ery t > 0 after L t = 4 t γ − 1 k steps, we ha ve V > X L t 6 max n 2 − ( t +1) , ε o . The base case ( t = 0) follows directly from the assumption that k V > X 0 k 6 1 / 4 . W e turn to the inductiv e step. By induction h ypothesis, we ha ve V > X L t 6 max n 2 − ( t +1) , ε o . W e apply Lemma 3.4 with “ X 0 = X L t ” and error parameter max n 2 − t +2 , ε o and L = L t +1 − L t . The conditions of the lemma are satisfied as can be easil y checked using the assumption that { G ` } is ε/ 2- admissible. Using the fact that L t +1 − L t = 4 / γ k , the conclusion of the lemma giv es V > X L t +1 6 max ( ε, 2 · max n ε, 2 − ( t +1) o exp − γ k ( L t +1 − L t ) 2 !) 6 max n ε, 2 − ( t +2) o . 4 Least squares update rule Input: T arget dimension k , observed set of indices Ω ⊆ [ n ] × [ n ] of an unknown symmetric matrix A ∈ R n × n with entries P Ω ( A ) , orthonormal matrix X ∈ R n × k . Algorithm LS ( P Ω ( A ) , Ω , X , L, k ) : Y ← arg min Y ∈ R n × k k P Ω ( A − X Y > ) k 2 F Output: P air of matrices ( X , Y ) Figure 3: Least squares update Figure 4 describes the least squares update step specialized to the case of a symmetric matrix. Our goal is to express this upda te step as an upda te step of the f orm Y = AX + G so that we may apply our analysis of noisy subspace itera tion. This syntactic transforma tion is explained in Section 4.1 follow ed by a bound on the norm of the error term G in Section 4.2 . 11 4.1 From al ternating least squares to noisy subspace iteration The optimizer Y satisfies a set of linear equations that we deriv e from the gradient of the objective function. Lemma 4.1 (Optimality Condition) . Let P i : R n → R n be the linear projection onto the coordinates in Ω i = { j : ( i , j ) ∈ Ω } scaled by p − 1 = n 2 / ( E | Ω | ) , i.e., P i = p − 1 P j ∈ Ω i e j e > j . F urther , define the matrix B i ∈ R k × k as B i = X > P i X and assume that B i is invertible. Then, for every i ∈ [ n ] , the i -th row of Y satisfies e > i Y = e > i AP i X B − 1 i . Proof. Call the objective function f ( Y ) = k P Ω ( A − X Y > ) k 2 F . W e note that for every i ∈ [ n ] , j ∈ [ k ] , we ha ve ∂f ∂Y i j = − 2 P s ∈ Ω i A i s X sj + 2 P k r =1 Y i r P s ∈ Ω i X sj X sr . From this w e conclude tha t the optimal Y must satisfy e > i AP i X = e > i Y X > P i X = e > i Y B i . Hence, e > i Y = e > i AP i X B − 1 i . The assumption that B i is inv ertible is essentially without loss of g enerality . Indeed, we will later see that B i is in vertible (and in fact close to the identity matrix) with very high probability . W e can now express the least squares update as Y = AX + G where we deriv e some useful expression for G. Lemma 4.2. Let E = ( I − X X > ) U . W e have Y = AX + G where G = G M + G N and the matrices G M and G N are fo each row i ∈ [ n ] if B i is invertible we have the following expressions: e > i G M = e > i U Λ U E > P i X B − 1 i e > i G N = e > i ( N P i X B − 1 i − N X ) . Proof. By Lemma 4.1 , e > i Y = e > i AP i X B − 1 i = e > i Y = e > i M P i X B − 1 i + e > i N P i X B − 1 i . Let C i = U > P i X and put D = U > X . On the one hand, e > i M P i X B − 1 i = e > i U Λ U C i B − 1 i = e > i ( U Λ U D − U Λ U ( D B i − C i ) B − 1 i ) = e > i M X − e > i U Λ U ( D B i − C i ) B − 1 i On the other hand, C i = U > P i X = ( X X > U + E ) > P i X = ( U > X ) X > P i X + E > ( P i X ) = D B i + E > P i X . Hence, as desired, e > i M P i X B − 1 i = e > i M X − e > i U Λ U E > P i X B − 1 i . Finally , it f ollows directly by definition that e > i N P i X B − 1 i = e > i N X + e > i G N . Putting the previous two equations tog ether , we conclude that Y = M X + G M + N X + G N = AX + G M + G N . 4.2 Deviation bounds for the least squares upda te In this section we analyze the norm of the error term G from the previous section. More specificall y , we prove a bound on the norm of each row of G. Our bound uses the fact tha t the matrix E appearing in the expression for the error term satisfies k E k = k V > X k . This gives us a bound in terms of the quantity k V > X k . Lemma 4.3. Let δ ∈ (0 , 1) . Assume that each entry is included in Ω independently with probability p & k µ ( X ) log n δ 2 n . (8) Then, for every i ∈ [ n ] , P n e > i G > δ · k e > i M k · k V > X k + k e > i N k o 6 1 5 . 12 Proof. Fix i ∈ [ n ] . Lemma A.5 shows that with probability 9 / 10 we have e > i G M 6 δ · k e > i M k · k V > X k . Similarl y , Lemma A.6 shows that with probability 9 / 10 we ha ve e > i G N 6 δ · k e > i N k . Both even ts occur with probability 4 / 5 and in this case we ha ve e > i G 6 e > i G M + e > i G N 6 δ · k e > i M k · V > X + k e > i N k . 4.3 Median least squares update Given the previous error bound we can achiev e a strong concen tration bound by taking the component -wise median of m ultiple independent sam ples of the error term. Lemma 4.4. Let G 1 , . . . , G t be i.i.d. copies of G. Let G = median ( G 1 , . . . , G t ) be the component- wise median of G 1 , . . . , G t and assume p satisfies (8) . Then, for every i ∈ [ n ] , P n e > i G > δ k e > i M k · V > X + k e > i N k o 6 exp( − Ω ( t )) . Proof. Fix i ∈ [ n ] and let g 1 , . . . , g t ∈ R k denote the i -th rows of G 1 , . . . , G t . Let S = { j ∈ [ t ] : k g j k 6 B } where B = ( δ/ 4) k e > i M k · k V > X k + k e > i N k . Applying Lemma 4.3 with error parameter δ/ 4 it follows that E | S | > 4 t / 5 . Moreover , the draws of g j are independent. So we can appl y a Cherno ff bound to argue that | S | > 2 t / 3 with probability 1 − exp ( − Ω ( t )) . Assuming that this event occurs, we claim that g = median( g 1 , . . . , g t ) satisfies k g k 6 4 B and this claim establishes the lemma. T o prove this claim, fix an y coordinate of r ∈ [ k ] . By the median property |{ j : ( g j ) 2 r > g 2 r }| > t / 2 . Since | S | > 2 t / 3 this means that at least t / 3 vectors with j ∈ S hav e ( g j ) 2 r > g 2 r . In particular , the av erage val ue of ( g j ) 2 r over all j ∈ S must be at least t g 2 r / 3 | S | > g 2 r / 3 . This shows that the aver age of k g j k 2 over all j ∈ S must be at least k g k 2 / 3 . On the other hand, we also know that the averag e squared norm in S is at most B 2 by definition of the set S . It follows that k g k 2 6 3 B 2 . This implies what w e needed to show . W e can now conclude a strong concentration bound f or the median of m ultiple independen t solutions to the least squares minimiza tion step. This wa y we can obtain the desired error bound for all rows sim ultaneously . This leads to the following extension of the least squares update rule. Input: T arget dimension k , observed set of indices Ω ⊆ [ n ] × [ n ] of an unknown symmetric matrix A ∈ R n × n with entries P Ω ( A ) , orthonormal matrix X ∈ R n × k . Algorithm MedianLS ( P Ω ( A ) , Ω , X , L, k ) : 1. ( Ω 1 , . . . , Ω t ) ← Split ( Ω , t ) for t = O (log n ) . 2. Y i ← LS ( P Ω i ( A ) , Ω i , X , L, k ) Output: P air of matrices ( X , median( Y 1 , . . . , Y t )) Figure 4: Median least squares update Lemma 4.5. Let Ω be a sample in which each entry is included independently with probability p & k µ ( X ) log 2 n δ 2 n . Let Y ← MedianLS ( P Ω ( A ) , Ω , X , L, k ) . Then, we have with probability 1 − 1 / n 3 that Y = AX + G and G satisfies for every i ∈ [ n ] the bound e > i G 6 δ e > i M · k V > X k + δ e > i N . 13 Proof. By Lemma C.1 , the samples Ω 1 , . . . , Ω t are independent and each set Ω j includes each entry with probability at least p/ t . The output satisfies Y = median ( Y 1 , . . . , Y t ) , where each Y j is of the form Y j = AX + G j . It follows that median ( Y 1 , . . . , Y t ) = AX + G where G = median ( G 1 , . . . , G t ) . W e can therefore apply Lemma 4.4 to concl ude the lemma using the fact that t = O ( log n ) allows us to take a union bound over all n rows. 5 Incoherence via smooth QR factoriza tion As part of our analysis of al ternating minimiza tion w e need to show that the intermedia te sol utions X ` hav e small coherence. For this purpose we propose an idea inspired by Smoothed Analysis of the QR factorization [ SST ]. The problem with applying the QR factorization directly to Y ` is that Y ` might be ill-conditioned. This can lead to a matrix X ` (via QR -factorization) that has large coordinates and whose coherence is therefore no longer as small as we desire. A naive bound on the condition number of Y ` would lead to a large loss in sample complexity . What we show instead is that a small Ga ussian perturbation to Y ` leads to a su ffi ciently w ell-conditioned matrix e Y ` = Y ` + H ` . Orthonormalizing e Y ` now leads to a matrix of small coherence. Intuitiv ely , since the computation of Y ` is already noisy the additional noise term has little e ff ect so long as its norm is bounded by that of G ` . Since we don ’ t know the norm of G ` , we ha ve to search for the right noise parameter using a simple binary search. W e call the resulting procedure SmoothQR and describe in in Figure 5 . Input: Matrix Y ∈ R n × k , parameters µ, ε > 0 . Algorithm SmoothQR ( Y , ε , µ ) : 1. X ← QR ( Y ) , H ← 0 , σ ← ε k Y k / n. 2. While µ ( X ) > µ and σ 6 k Y k : (a) X ← GS( Y + H ) where H ∼ N(0 , σ 2 / n ) (b) σ ← 2 σ Output: P air of matrices ( X , H ) Figure 5: Smooth Orthonormalization ( SmoothQR ) T o analyze the algorithm we begin with a lemma that analyzes the smallest singular val ue under a Gaussian perturba tion. What makes the analysis easier is the f act that the matrices w e ’re interested in are rectangular . The square case was considered in [ SST ] and requires more involv ed arguments. Lemma 5.1. Let G ∈ R n × k be any matrix with k G k 6 1 and let V be a n − k dimensional subspace with orthogonal projection P V . Let H ∼ N (0 , τ 2 / n ) n × k be a random Gaussian matrix. Assume k = o ( n/ log n ) . Then, with probability 1 − exp( − Ω ( n )) , we have σ k ( P V ( G + H ) ) > Ω ( τ ) . The proof follows from standard concentration arguments and is contained in Section B . T o use this lemma in our context we ’ll introduce a v ariant of µ -coherent that applies to matrices rather than subspaces. Definition 5.2 ( ρ -coherence) . Given a matrix G ∈ R n × k we let ρ ( G ) def = n k max i ∈ [ n ] k e > i G k 2 . 14 The next lemma is our main technical tool in this section. It shows that adding a Gaussian noise term leads to a bound on the coherence after applying the QR -factorization. Lemma 5.3. Let k = o ( n/ log n ) and τ ∈ (0 , 1) . Let U ∈ R n × k be an orthonormal matrix. Let G ∈ R n × k be a matrix such that k G k 6 1 . Let H ∼ N (0 , τ 2 / n ) k × n be a random Gaussian matrix. Then, with probability 1 − exp( − Ω ( n )) − n 5 , there is an orthonormal matrix Q ∈ R n × 2 k such that: 1. R ( Q ) = R ([ U | G + H ]) . 2. µ ( Q ) 6 O 1 τ 2 · ( ρ ( G ) + µ ( U ) + log n ) . Proof. First note that R ([ U | G + H ]) = R ([ U | ( I − U U > )( G + H )]) . Let B = ( I − U U > )( G + H ) . Applying the QR -f actorization to [ U | B ] , we can find two orthonormal matrices Q 1 , Q 2 ∈ R n × k such that hav e that [ Q 1 | Q 2 ] = [ U | BR − 1 ] where R ∈ R k × k . That is Q 1 = U since U is already orthonormal. Moreover , the col umns of B are orthogonal to U and therefore w e can appl y the QR -factorization to U and B independently . W e can now apply Lemma 5.1 to the ( n − k )-dimensional subspace U ⊥ and the matrix G + H . It follows that with probability 1 − exp ( − Ω ( n )) , we hav e σ k ( B ) > Ω ( τ ) . Assume that this ev ent occurs. Also, observe tha t σ k ( B ) = σ k ( R ) . The second condition is now easy to v erify n k e > i Q 2 = n k e > i U 2 + n k e > i BR − 1 2 = µ ( U ) + n k e > i BR − 1 2 On the other hand, n k e > i BR − 1 2 6 n k e > i B 2 R − 1 2 6 O n k τ 2 e > i B 2 , where we used the f act that R − 1 = 1 / σ k ( R ) = O (1 / τ ) . Moreover , n k e > i B 2 6 2 n k e > i ( I − U U > ) G 2 + 2 ρ (( I − U U > ) H ) 6 2 ρ ( G ) + 2 ρ ( U U > G ) + 2 ρ (( I − U U > ) H ) . Finally , ρ ( U U > G ) 6 µ ( U ) k U > G k 2 6 µ ( U ) and, by Lemma B.1 , we have ρ (( I − U U > ) H ) 6 O ( log n ) with probability 1 − 1 / n 5 . The lemma follows with a union bound over the fail ure probabilities. The next lemma states tha t when SmoothQR is invoked on an input of the form AX + G with suitable parameters, the al gorithm outputs a matrix of the form X 0 = QR ( AX + G + H ) whose coherence is bounded in terms of µ ( U ) and ρ ( G ) and moreover H satisfies a bound on its norm. The lemma also permits to trade-o ff the amount of additional noise introd uced with the resulting coherence parameter . Lemma 5.4. Let τ > 0 and assume k = o ( n/ log n ) and µ ( U ) k 6 n. There is an absolute constant C 5 . 4 > 0 such that the following claim holds. Let G ∈ R n × k . Let X ∈ R n × k be an orthonormal matrix such that ν > max { k G k , k N X k } . Assume that µ > C 5 . 4 τ 2 µ ( U ) + ρ ( G ) + ρ ( N X ) ν 2 + log n . Then, for every ε 6 τ ν satisfying log ( n/ ε ) 6 n and every µ 6 n, we have with probability 1 − O ( n − 4 ) , the algorithm SmoothQR ( AX + G, ε , µ ) terminates in O ( log ( n/ ε )) steps and outputs ( X 0 , H ) such that µ ( X 0 ) 6 µ and where H satisfies k H k 6 τ ν . 15 Proof. Suppose that SmoothQR terminates in an iteration where σ 2 6 τ 2 ν 2 / 4 . W e claim that in this case with probability 1 − exp ( − Ω ( n )) we m ust have that k H k 6 τ ν . Indeed, assuming the algorithm termina tes when σ 2 6 c τ 2 ν 2 / k , the algorithm took at most t = O ( log ( n/ ε )) 6 O ( n ) steps. Let H 1 , . . . , H t denote the random Gaussian matrices g enerated in each step. W e claim that each of them satisfies k H t k 6 τ ν . Note that for all t we ha ve E k H t k 2 6 τ 2 ν 2 / 4 . The claim therefore foll ows directly from tail bounds for the Frobenius norm of Gaussian random matrices and holds with probability 1 − exp( − Ω ( n )) . The next claim now finishes the proof. Claim 5.5. With probability 1 − O (1 / n 4 ) , the algorithm terminates in an iteration wher e σ 2 6 τ 2 ν 2 / 4 . T o prove the claim, consider the first iteration in which σ 2 > τ 2 ν 2 / 8 . Let us define G 0 = ( N X + G ) / 2 ν . W e can now apply Lemma 5.3 to the matrix G 0 which satisfies the assumption of the lemma that k G 0 k 6 1 . The lemma then entails that with the stated probability bound there is an orthonormal n × 2 k matrix Q such that R ( Q ) = R ([ U | G 0 + H ]) = R ([ U | G + N X + H ]) , and moreover µ ( Q ) 6 O 1 τ 2 · ( ρ ( G ) + µ ( U ) + log n ) . On the one hand, R ( X 0 ) = R ( AX + G + H ) = R ( M X + N X + G + H ) ⊆ R ([ U | N X + G + H ]) = R ( V ) . The inclusion follows from the fact that U is an orthonormal basis for the range of M X = U Σ U U > X . On the other hand, ρ ( G 0 ) = O ( ρ ( G / ν ) + ρ ( N X / ν 0 ) ) . Hence, by Lemma B.2 and the fact that dim ( Q ) 6 2 dim ( X 0 ), we hav e µ ( X 0 ) 6 2 µ ( Q ) 6 µ . This shows that the termination criterion of the algorithm is satisfied provided w e pick C 5 . 4 large enough. 6 Conv ergence bounds for al ternating minimization The total sample complexity we achieve is the sum of two terms. The first one is used by the initialization step that w e discuss in Section 7 . The second term specifies the sample requirements for iterating the least squares algorithm. It therefore makes sense to define the following two quantities: p init = k 2 µ ∗ k A k 2 F log n γ 2 k σ 2 k n and p LS = k µ ∗ ( k M k 2 F + k N k 2 F / ε 2 ) log( n/ ε ) log 2 n γ 5 k σ 2 k n While the first term has a quadratic dependence on k it does not depend on ε at all and it has single logarithmic f actor . The second term features a linear dependence on k . Our main theorem shows that if the sampling probability is larger than the sum of these two terms, the algorithm conv erges rapidly to the true unknown matrix. Theorem 6.1 (Main) . Let k , ε > 0 . Let A = M + N be a symmetric n × n matrix where M is a matrix of rank k with the spectral decomposition M = U Λ U U > and N = ( I − U U > ) A = V Λ V V > satisfies (2) . Let γ k = 1 − σ k +1 / σ k where σ k is the smallest singular value of M and σ k +1 is the largest singular value of N . Then, there are parameters µ = Θ ( γ − 2 k k ( µ ∗ + log n )) and L = Θ ( γ − 1 k log ( n/ ε )) such that the out - put ( X , Y ) of S Al tLS ( P Ω ( A ) , Ω , k , L, ε, µ ) satisfies k ( I − U U > ) X L k 6 ε with probability 9 / 10 . Before we prove the theorem in Section 6.1 , we will state an immediate corollary that gives bounds on the reconstruction error in the Frobenius norm. 16 Corollary 6.2 (Reconstruction error) . Under the assumptions of Theorem 6.1 , we have that the out - put ( X , Y ) of S Al tLS satisfies k M − X Y > k F 6 ε k A k F with probability 9 / 10 . Proof. Let ( X , Y ) be the matrices giv en by our al gorithm when in voked with error parameter ε/ 2 . By Theorem 6.1 we have k U U > − X X > k = k ( I − U U > ) X k 6 ε 2 . Using the proof of Theorem 6.1 we also know that Y = AX + G where G is ( ε / 4)- admissible so that k G k F 6 ε σ k / 2 . Consequently , M − X Y > F = M − X X > A + X G F 6 U U > A − X X > A F + k X G k F 6 U U > − X X > k A k F + k G k F 6 ( ε / 2) k A k F + ( ε / 2) σ k 6 ε k A k F . In the second inequality we used tha t for all matrices P , Q we ha ve k P Q k F 6 k P k · k Q k F . 6.1 Proof of Theorem 6.1 Proof. W e first apply Theorem 7.1 (shown below) to conclude that with probability 19 / 20 , the initial matrix X 0 satisfies k V > X 0 k 6 1 / 4 and µ ( X 0 ) 6 32 µ ( U ) log n. Assume that this event occurs. Our goal is now to apply Theorem 3.8 . Consider the sequence of matrices n ( X ` − 1 , e G ` ) o L ` =1 obtained by the execution of S AltL S starting from X 0 and letting e G ` = G ` + H ` where G ` is the error term corresponding to the ` -step of MedianLS , and H ` is the error term introduced by the application of SmoothQR at step ` . T o apply Theorem 3.8 , we need to show that this sequence of matrices is ( ε/ 2)-admissible for NSI with probability 19 / 20. The theorem then directly gives tha t k V > X L k 6 ε and this would concl ude our proof by summing up the error probabilities. Let τ = γ k 128 and b µ = C 5 . 4 τ 2 ( 20 µ ∗ + log n ) . Let µ be an y number satisfying µ > b µ. Since b µ = Θ ( γ − 2 k k ( µ ∗ + log n )) , this satisfies the requirement in the theorem. W e prove that with probability 9 / 20 , the following three claims hol d: 1. { ( X ` − 1 , G ` ) } L ` =1 is ( ε/ 4)-admissible, 2. { ( X ` − 1 , H ` ) } L ` =1 is ( ε/ 4)-admissible, 3. for all ` ∈ { 0 , . . . , L − 1 } , we ha ve µ ( X ` ) 6 µ. This implies the claim that w e want using a triangle inequality since e G ` = G ` + H ` . The proof of these three claims is by mutual induction. For ` = 0 , we only need to check the third claim which follows form the fact that X 0 satisfies the coherence bound. Now assume that all three claims hold at step ` − 1 , we will argue that the with probability 1 − n/ 100 , all three claims hold at step `. Since L 6 n, this is su ffi cient. The first claim follows from Lemma 4.4 using the induction hypothesis tha t µ ( X ` − 1 ) 6 b µ. Specifically , we apply the lemma with δ = c min { γ k σ k / k M k F , εγ k σ k / k N k F } for su ffi ciently small constant c > 0 . The lemma requires the lower bound p & k µ ∗ log 2 n δ 2 n . W e can easily verify that the right hand side is a factor L = Θ ( γ − 1 k log ( n/ ε )) smaller than what is provided by the assum ption of the theorem. This is because new samples are used in each of the L steps so that we need to divide the given bound by L. Lemma 4.4 now gives with probability 1 − 1 / n 3 the upper bound k G ` k F 6 1 4 1 32 γ k σ k V > X ` − 1 + ε 32 γ k σ k . 17 In particular , this satisfies the definition of ε/ 4- admissibility . W e proceed assuming that this event occurs as the error probability is small enough to ignore. The remaining two claims foll ow from Lemma 5.4 . W e will apply the lemma to AX ` + G ` with ν = σ k ( k V > X ` − 1 k + ε ) and τ as above. Note that k N X ` − 1 k 6 σ k V > X ` − 1 . Hence we ha v e ν > max {k G ` k , k N X ` − 1 k} as required by the lemma. The lemma also requires a low er bound µ. T o satisfy the lower bound we in voke Lemma B.3 showing that with probability 1 − 1 / n 2 , we ha ve 1 ν 2 ( ρ ( G ) + ρ ( N X ) ) 6 10 µ ∗ . W e remark that this is the lemma that uses the assum ption on N provided by (2) . Again w e assume this even t occurs. In this case we ha ve µ > b µ = C 5 . 4 τ 2 ( 20 µ ∗ + log n ) and so we see that µ satisfies the requirement of Lemma 5.4 . It follows tha t SmoothQR produces with probability 1 − 1 / n 4 a matrix H ` such that k H ` k 6 τ ν 6 γ k ν 128 6 1 4 1 32 γ k σ k V > X ` − 1 F + ε 32 γ k σ k . In particular , H ` satisfies the requirement of ( ε/ 4)- admissibility . Moreover , the lemma giv es that µ ( X ` ) 6 µ. This shows that also the second and third claim of our inductive claim contin ue to hold. All error probabilities we incurred w ere o (1 / n ) and we can sum up the error probabilities over all L 6 n steps to concludes the proof of the theorem. 7 Finding a g ood starting point Figure 6 describes an alg orithm that computes the top k singular vectors of P Ω ( A ) and truncates them in order to ensure incoherence. The algorithm serv es as a fast initialization procedure for our main alg orithm. This general approach is relatively standard in the liter ature. How ever , our truncation argument di ff ers from previous approaches. Specifically , we use a random orthonormal transformation to spread out the entries of the singular vectors before truncation. This leads to a tighter bound on the coherence. Theorem 7.1 (Initialization) . Let A = M + N be a symmetric n × n matrix wher e M is a matrix of rank k with the spectral decomposition M = U Λ U U > and N = ( I − U U > ) A satisfies (2) . Assume that each entry is included in Ω independently probability p > C k ( k µ ( U ) + µ N )( k A k F / γ k σ k ) 2 log n n (9) for a su ffi ciently large constant C > 0 . Then, the algorithm Initialize returns an orthonormal matrix X ∈ R n × k such that with probability 9 / 10 , k V > X k F 6 1 / 4 and µ ( X ) 6 32 µ ( U ) log n. Proof. The proof follows directly from Lemma 7.3 and Lemma 7.4 below . 18 Input: T arget dimension k , observed set of indices Ω ⊆ [ n ] × [ n ] of an unknown symmetric matrix A ∈ R n × n with entries P Ω ( A ) , coherence parameter µ ∈ R . Algorithm Initialize ( P Ω ( A ) , Ω , k , µ ) : 1. Compute the first k singular vectors W ∈ R n × k of P Ω ( A ) . 2. f W ← W O where O ∈ R k × k is a random orthonormal matrix. 3. T ← T µ 0 ( f W ) with µ 0 = p 8 µ log( n ) / n where T c replaces each entry of its in put with the nearest number in the interv al [ − c, c ] . 4. X ← Q R ( T ) Output: Orthonormal matrix X ∈ R n × k . Figure 6: Initialization Procedure ( Initialize ) Remark 7.2. T o implement Initialize it is su ffi cient to compute an approximate singular value decom- position of P Ω ( A ) . F rom our analysis it is easy to see that it is su ffi cient to compute the k -th singular value to accuracy , say , γ k σ k / 100 k . This can be done e ffi ciently using, for example, the P ower Method (Subspace Iteration) with O ( k γ − 1 ) log n ) iterations. See [ Hig , S te ] for details on the Power Method. In particular , the running time of this step is dominated by the running time of LS . Lemma 7.3. Assume that Ω satisfies Equation 9 . Then, P n k V > W k 2 6 1 / 16 √ k o > 1 − 1 / n 2 . Proof. By our assumption on A, we hav e max i ∈ [ n ] e > i N 2 6 ( µ N / n ) k A k 2 F and max i ,j ∈ [ n ] | N i j | 6 ( µ N / n ) k A k F . Moreover , max i e > i M 2 6 ( µ ( U ) k / n ) k M k 2 F and max i ,j | M i j | 6 ( µ ( U ) k / n ) k M k F . This shows that max i e > i A 2 6 µ ( U ) k + µ N n k A k 2 F and max i ,j | A i j | 6 µ ( U ) k + µ N n k A k F . Plug ging these upper bounds into Lemma A.3 together with our sam ple bound in Equation 9 , we get that P ( k A − P Ω ( A ) k > γ k σ k 32 √ k ) 6 1 / n 2 . Put ε = γ k σ k / 32 √ k . Now assume that k A − P Ω ( A ) k 6 ε and let W be the top k singular vectors of P Ω ( A ) . On the one hand, σ k ( P Ω ( A )) > σ k ( A ) − ε > σ k − γ k σ k / 2 . One the other hand, by definition, σ k +1 ( A ) = σ k − γ k σ k . Hence, by the Davis-Kahan sin θ -theorem [ DK , Ste ] w e hav e that k V > W k = sin θ k ( U , W ) 6 ε σ k ( P Ω ( A )) − σ k +1 ( A ) 6 2 ε γ k σ k = 1 16 √ k . Lemma 7.4. Assume that k V > W k 2 6 1 / 16 √ k . Then, with probability 99 / 100 we have k V > X k F 6 1 / 4 and µ ( X ) 6 32 µ ( U ) log n. Proof. By our assumption on W , there exists an orthonormal transformation Q ∈ R k × k such that k U Q − W k F 6 1 / 16 . Moreover , µ ( U Q ) = µ ( U ) 6 µ. In other words, W is close in Frobenius norm to an orthonormal basis of small coherence. A priori it could be that some entries of U Q are as large as p µk / n. Howev er , after rotating U Q be a random rotation, all entries will be as small as p µ log( n ) / n. This is formalized in the next claim. 19 Claim 7.5. Let Y ∈ R n × k be any orthonormal basis with µ ( Y ) 6 µ. Then, for a random orthonormal matrix O ∈ R k × k , we have P n max i j | ( Y O ) i j | > p 8 µ log( n ) / n o 6 1 n 2 . Proof. Consider a single entry Z = ( Y O ) i j . Observe that Z is distributed like a coordinate of a random v ector in R k of norm at most p µk / n. By measure concentra tion, we hav e P n | Z | > ε p µk / n o 6 4 exp( − ε 2 k / 2) . This follows from Levy’ s Lemma (see [ Mat ]) using the fact that projection onto a single coordinate in R k is a Lipschitz function on the ( k − 1)-dimensional sphere. The median of this function is 0 due to spherical symmetry . Hence, the above bound follows. Putting ε = p 8 log( n ) / k , we ha ve that P | Z | > q 8 µ log( n ) / n 6 4 exp(3 log( n )) = 4 n − 4 . T aking a union bound over all k n 6 n 2 / 4 entries, we ha ve tha t with probability 1 − 1 / n 2 , max i ,j | ( Y O ) i j | 6 q 8 µ log( n ) / n . Applying the previous claim to U Q, we hav e that with probability 1 − 1 / n 2 , for all i , j , ( U QO ) i j 6 µ 0 . Furthermore, because a rotation does not increase Frobenius norm, we also hav e k U QO − W O k F 6 1 / 16 . T runcating the entries of W O to µ 0 can therefore only decrease the distance in Frobenius norm to U QO . Hence, k U QO − T k F 6 1 / 16 . Also, since truncation is a projection onto the set { B : | B i j | 6 µ 0 } with respect to Frobenius norm, w e hav e k W O − T k F 6 k U QO − T k F 6 1 16 . W e can write X = T R − 1 where R is an inv ertible linear transforma tion with the same singular v alues as T and thus satisfies k R − 1 k = 1 σ k ( T ) 6 1 σ k ( W O ) − σ 1 ( W O − T ) 6 1 1 − 1 / 16 6 2 . Therefore, e > i X = e > i T R − 1 6 e > i T R − 1 6 2 e > i T 6 2 q 8 k µ ( U ) log( n ) / n . Hence, µ ( X ) 6 n k · 32 k µ ( U ) log( n ) n 6 32 µ ( U ) log( n ) . Finally , V > X F = V > T R − 1 F 6 V > T F R − 1 6 2 V > T F 6 2 V > W O F + 2 k W O − T k F 6 2 V > W F + 1 8 6 1 4 . Acknowledgments Thanks to David Gleich, Prateek Jain, Jonathan Kelner , Raghu Meka, Ankur Moitra, Nikhil Sri- vasta va, and Mary W ootters for many helpful discussions. W e thank the Simons Institute for Theoretical Computer Science at Ber keley , where some of this research was done. 20 References [AKKS] Haim A vron, Satyen Kale, Shiva Prasad Kasiviswanathan, and Vikas Sindhwani. E ffi - cient and practical stochastic subgradient descent for n uclear norm regularization. In Proc. 29 th IC ML . A CM, 2012. [BK] Robert M. Bell and Y ehuda Koren. Scalable collaborative fil tering with jointly deriv ed neighborhood interpolation weights. In ICD M , pages 43–52. IEEE Computer Society , 2007. [CR] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimiza- tion. F oundations of C omputional Mathematics , 9:717–772, December 2009. [CT] Emmanuel J. Candès and T erence T ao. The power of convex relaxation: near -optimal matrix completion. IEEE T ransactions on Information Theory , 56(5):2053–2080, 2010. [DK] Chandler Davis and W . M. Kahan. The rotation of eigenv ectors by a perturbation. iii. SIAM J. Numer . Anal. , 7:1–46, 1970. [DMNS] C ynthia Dwork, F rank McSherry , Kobbi Nissim, and A dam Smith. Calibrating noise to sensitivity in priva te data analysis. In Proc. 3 rd TC C , pages 265–284. Spring er , 2006. [G A GG] Suriya Gunasekar , A yan Acharya, Neeraj Gaur , and Joydeep Ghosh. Noisy matrix completion using al ternating minimization. In Proc. E C ML PKDD , pages 194–209. Springer , 2013. [Har] Moritz Hardt. Robust subspace iteration and privacy -preserving spectral analysis. arXiv , 1311:2495, 2013. [HH] J ustin P . Hal dar and Diego Hernando. Rank -constrained sol utions to linear matrix equations using pow erfactorization. IEEE Signal Process. Lett. , 16(7):584–587, 2009. [Hig] Nicholas J. Higham. Accur acy and S tability of Numerical Algorithms . Society for Industrial and Applied Mathematics, 2002. [HK] Elad Hazan and Satyen Kale. Projection-free online learning. In ICML . A CM, 2012. [HMR W] Moritz Hardt , Raghu Meka, Prasad Raghav endra, and Benjamin W eitz. Com putational limits for matrix com pletion. C oRR , abs/1402.2331, 2014. [HO] Cho-J ui Hsieh and P eder A. Olsen. Nuclear norm minimization via active subspace selection. In Proc. 31 st IC ML . A CM, 2014. [HR1] Moritz Hardt and Aaron Roth. Bea ting randomized response on incoherent matrices. In Proc. 44 th Symposium on Theory of C omputing (STOC) , pag es 1255–1268. A CM, 2012. [HR2] Moritz Hardt and Aaron Roth. Beyond worst -case analysis in priva te singular vector computation. In Proc. 45 th Symposium on Theory of C omputing (STOC) . A CM, 2013. [JNS] Prateek J ain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alterna ting minimization. In Proc. 45 th Symposium on Theory of Computing (ST OC) , pages 665–674. A CM, 2013. 21 [JS] Martin J aggi and Marek Sulovský. A simple algorithm for nuclear norm regularized problems. In Proc. 27 th IC ML , pages 471–478. A CM, 2010. [JY] Shuiwang Ji and Jieping Y e. An accelerated gradient method f or trace norm minimiza tion. In Proc. 26 th IC ML , page 58. A CM, 2009. [KB V] Y ehuda Koren, Robert M. Bell, and Chris V olinsky . Matrix factorization techniques f or recommender systems. IEEE Computer , 42(8):30–37, 2009. [Kes] Raghunandan H. Keshav an. E ffi cient algorithms for collaborative filtering . PhD thesis, Stanf ord University , 2012. [KMO1] Raghunandan H. Kesha van, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE T ransactions on Information Theory , 56(6):2980–2998, 2010. [KMO2] Raghunandan H. Kesha van, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. Journal of Machine Learning Research , 11:2057–2078, 2010. [Mat] Jiri Matousek. Lectures on Discrete Geometry . Springer - V erlag New Y ork, Inc., 2002. [MHT] Rahul Mazumder , T revor Hastie, and Robert Tibshirani. Spectral regularization algo- rithms for learning large incomplete matrices. Journal of Machine Learning Research , 11:2287–2322, 2010. [Rec] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research , 12:3413–3430, 2011. [RR] Benjamin Recht and Christopher Ré. P arallel stochastic gr adient alg orithms for larg e- scale matrix completion. Math. Program. C omput. , 5(2):201–226, 2013. [SST] Arvind Sankar , Daniel A. Spielman, and Shang-Hua T eng. Smoothed analysis of the condition numbers and growth factors of ma trices. SIAM J. Matrix Analysis Applications , 28(2):446–476, 2006. [Ste] G.W . Stew art. Matrix Algorithms. V olume II: Eigensystems . Society for Industrial and Applied Mathematics, 2001. [T ro] J oel A. T ropp. User -friendly tail bounds for sums of random matrices. F oundations of C omputational Mathematics , 12(4):389–434, 2012. [ZK] Pheizhen Zhu and Andrew V . Knyazev . Angles between subspaces and their tangents. Arxiv preprint arXiv:1209.0523 , 2012. A Large devia tion bounds W e need some matrix concentration inequalities. T urn to [ T ro ] for background. Theorem A.1 (Matrix Bernstein) . C onsider a finite sequence { Z k } of independent random matrices with dimensions d 1 × d 2 . Assume that each random matrix satisfies E Z k = 0 and k Z k k 6 R almost surely . Define σ 2 def = max n P k E Z k Z > k , P k E Z > k Z k o . Then, for all t > 0 , P n P k Z k > t o 6 ( d 1 + d 2 ) · exp − t 2 / 2 σ 2 + Rt / 3 ! . 22 Theorem A.2 (Matrix Cherno ff ) . C onsider a finite sequence { X k } of independent self -adjoint matrices of dimension d . Assume that each random matrix satisfies X k 0 and λ max ( X k ) 6 R almost surely . Define µ min def = λ min ( P k E X k ) . Then, P λ min X k X k 6 (1 − δ ) µ min 6 d · exp − δ 2 µ min 2 R ! A.1 Error bounds for initialization Lemma A.3. Suppose that A ∈ R m × n and let Ω ⊂ [ m ] × [ n ] be a random subset where each entry is included independently with probability p . Then P { k P Ω ( A ) − A k > u } 6 n exp − u 2 / 2 σ 2 + u 3 (1 / p − 1) max i j | A i j | ! . where σ 2 = (1 / p − 1) max max i e > i A 2 , max j Ae j 2 . Proof. Let ξ i j be independent Bernoulli- p random variables, which are 1 if ( i , j ) ∈ Ω and 0 otherwise. Consider the sum of independent random matrices P Ω ( A ) − A = P i ,j ξ i j p − 1 A i j e i e > j . Applying Theorem A.1 , we concl ude that P { k P Ω ( A ) − A k > u } 6 n exp − u 2 / 2 σ 2 + Ru / 3 ! . Here we use tha t E P i ,j ξ i j p − 1 2 A 2 i j e i e > j e j e i > = 1 p − 1 max i e > i A 2 and similarl y E X i ,j ξ i j p − 1 ! 2 A 2 i j e j e > i e i e > j = 1 p − 1 max i Ae j 2 . Further , ξ i j p − 1 A i j e i e > j 6 R = 1 p − 1 max i j | A i j | . This concludes the proof . A.2 Error bounds for least squares Lemma A.4. Let 0 < δ < 1 and let i ∈ [ n ] . Assume that p & k µ ( X ) log n δ 2 n . Then, P n B − 1 i > 1 1 − δ o 6 1 n 5 . Proof. Let B = B i and p = p ` . Clearl y , B − 1 = 1 / λ min ( B ) . W e will use Theorem A.2 to low er bound the smallest eigen val ue of B by 1 − δ. Denoting the rows of X by x 1 , . . . , x n ∈ R k we have B = P n i =1 1 p Z i x i x > i , where { Z i } are independent Bernoulli ( p ) random variables. Moreover , E B = X > X = Id k × k . Therefore, in the notation of Theorem A.2 , this is µ min ( B ) = 1 . Moreover , using our lower bound on p, 1 p Z i x i x > i 6 1 p k x i k 2 6 µ ( X ` − 1 ) k pn 6 δ 2 20 log n . Hence, by Theorem A.2 , P { λ min ( B ) 6 1 − δ } 6 k exp ( − 10 log n ) . The claim foll ows. Lemma A.5. Let 0 < δ < 1 Assume that p & k µ ( X ) log n δ 2 n . Then, for every i ∈ [ n ] , we have P n e > i U Λ U E > P i X > δ k e > i M k · V > X o 6 1 10 . 23 Proof. Let Ω ∈ R k × k be an orthonormal transformation such that all columns of X 0 = X Ω hav e ` ∞ -norm at most p 8 µ ( X ) log( n ) / n. Such a transformation exists as shown in Claim 7.5 . Then, E e > i U Λ U E > P i X 2 = E e > i U Λ U E > P i X 0 2 = k X r =1 E ( e > i U Λ U E > P i x 0 r ) 2 where x 0 r is the r -th col umn of X 0 . Let w > = e > i U Λ U E > . W e hav e k w k 2 6 k e > i U Λ U k · k E k = k e > i M k · k V > X k . Finally , E ( w > P i x 0 r ) 2 = n X j =1 E ( w j ( P i ) j j ( x 0 r ) j ) 2 6 1 p · k w k 2 2 · k x 0 r k 2 ∞ 6 8 µ ( X ) log n pn . Hence, E e > i U Λ U E > P i X 2 6 8 k µ ( X ) log n pn e > i M 2 · V > X 2 = δ 2 10 e > i M 2 · V > X 2 . The claim now follows from Mar kov’ s inequality . Lemma A.6. Let 0 < δ < 1 . Assume that p & k µ ( X ) log n δ 2 n . Then, for every i ∈ [ n ] , P n k e > i G N k > δ k e > i N k o 6 1 10 . Proof. Recall, by Lemma 4.2 , we hav e e > i G N = e > i N P i X B − 1 i − e > i N X = ( e > i N P i X − e > i N X B i ) B − 1 i Plug ging in the lower bound on p into Lemma A.4 , we get that k B i − I k 6 δ / 4 for all i ∈ [ n ] with probability 19 / 20 . W e will show that for ev ery fixed i ∈ [ n ] , we hav e with probability 19 / 20 that k e > i N P i X − e > i N X k 6 ( δ / 4) k e > i N k (10) Both even ts simultaneousl y occur with probablity 9 / 10 and in this case we ha ve: e > i G N 6 e > i N P i X − e > i N X B − 1 i + e > i N X k B i − I k B − 1 i 6 ( δ / 2) e > i N + ( δ / 2) e > i N = δ e > i N . So, it remains to show Equation 10 . Fix i ∈ [ n ] . Let Ω ∈ R k × k be an orthonormal transformation such that all columns of X 0 = X Ω hav e ` ∞ -norm at most p 8 µ ( X ) log( n ) / n. Let w > = e > i N be the i -th row of N . Let us denote by x 0 1 , . . . , x 0 k the k columns of X 0 . W e hav e that E w > P i X − w > X 2 = E w > ( P i − I ) X 0 2 = k X r =1 E h w, ( P i − I ) x 0 r i 2 = k X r =1 n X j =1 E (( P i ) j j − 1) 2 w 2 j ( x 0 r ) 2 j 6 k w k 2 · 8 k µ ( X ) log n pn 6 δ 80 k e > i N k . The bound in Equation 10 now foll ows from Markov’ s inequality . 24 B Additional lemmas and proofs f or smooth QR factorization Proof of Lemma 5.1 . Fix a unit vector x ∈ R k . W e hav e k P V ( G + H ) x k 2 > k P V H x k 2 − |h P V Gx , P V H x i| Note that g = H x is distributed like N (0 , τ 2 / n ) n and y = P V C x has norm at most 1 . Due to the rotational invariance of the Gaussian measure, we ma y assume without loss of generality that V is the subspace spanned by the first n − k standard basis vectors in R n . Hence, denoting h ∼ N (0 , τ 2 / n ) n − k , our goal is to lower bound k h k 2 − |h y , h i| . Note that E k h k 2 > τ 2 / 2 and by standard concentration bounds f or the norm of a Gaussian v ariable we ha ve P n k h k 2 6 τ 2 / 4 o 6 exp( − Ω ( n )) . On the other hand h y , h i is distributed like a one-dimensional Gaussian variable of v ariance at most τ 2 / n. Hence, by Gaussian tail bounds, P n h y , h i 2 > τ 2 / 8 o 6 exp ( − Ω ( n )) . Hence, with probability 1 − exp ( − Ω ( n )) , we have k P V ( G + H ) x k > Ω ( τ ) . W e can now take a union bound over a net of the unit sphere in R k of size exp ( O ( k log k )) to conclude that with probability 1 − exp ( O ( k log k )) exp ( − Ω ( n )) , we hav e for all unit vectors x ∈ R k that k P V ( G + H ) x k > Ω ( τ ) . Therefore σ k ( P V ( G + H )) > Ω ( τ ) . By our assumption exp ( O ( k log k )) = exp ( o ( n )) and hence this event occurs with probability 1 − exp( − Ω ( n )) . Lemma B.1. Let P be the projection onto an ( n − k ) -dimensional subspace. Let H ∼ N (0 , 1 / n ) n × k . Then, ρ ( P H ) 6 O (log n ) with probability 1 − 1 / n 5 . Proof. W e hav e that P = ( I − U U > ) for some k -dimensional basis U . Hence, ρ ( P U ) 6 O ( ρ ( H )) + O ( ρ ( U U > H )) . Using concentration bounds for the norm of each row of H and a union bound over all rows it follows str aightforwardly tha t ρ ( H ) 6 O ( log n ) with probability 1 − 1 / 2 n 5 . The second term satisfies ρ ( U U > ) 6 ρ ( U ) k U > H k 2 = µ ( U ) k U > H k 2 . But U > H is a Gaussian matrix N (0 , 1 / n ) k × k and hence its largest singular val ue satisfies k U > H k 2 6 O ( k log( n ) / n ) with probability 1 − 1 / 2 n 5 . Lemma B.2. Let X , Y be k and k 0 dimensional subspaces, respectively , such that R ( X ) ⊆ R ( Y ) . Then, µ ( X ) 6 k 0 k µ ( Y ) . Proof. W e know that µ ( Y ) is rotationally in varian t. Therefore, without loss of generality w e may assume that Y = [ X | X 0 ] for some orthonormal matrix X 0 . Here, we identify X and Y with orthonormal bases. Hence, µ ( X ) = n k max i ∈ [ n ] k e > i X k 2 6 n k max i ∈ [ n ] k e > i X k 2 + k e > i X 0 k 2 = n k max i ∈ [ n ] k e > i Y k 2 = k 0 k µ ( Y ) . The following technical lemma w as needed in the proof of Theorem 6.1 . Lemma B.3. Under the assumptions of Theor em 6.1 , we have for every ` ∈ [ L ] and ν = σ k 32 ( k V > X ` − 1 k + ε ) with probability 1 − 1 / n 2 , 1 ν 2 ( ρ ( G ) + ρ ( N X ` − 1 ) ) 6 3 µ ∗ . 25 Proof. Given the low er bound on p in Theorem 6.1 we can apply Lemma 4.4 to conclude that k e > i G M ` k 6 p k µ ( U ) / n · ν and k e > i G N ` k 6 p µ ∗ / n · ν . Hence, ρ ( G ` ) / ν 2 6 µ ∗ . Further , we claim that k e > i N X k 2 6 ( µ ∗ / n ) σ k k V > U k for all i ∈ [ n ] , because e > i N X 6 e > i V Σ V · V > X ` − 1 = e > i N · V > X ` − 1 . Here we used the f act that e > i N 2 = e > i N V 2 + e > i N U 2 = e > i N V 2 = e > i V Σ V 2 . Using Equation 2 , this shows that ρ ( N X ` − 1 ) / ν 2 6 µ ∗ and finishes the proof . C Splitting up the subsample W e needed a procedure Split ( Ω , t ) that takes a sample Ω and splits it into t independent samples that preserv e the distributional assumption that w e need. The next lemma is standard. Lemma C.1. There is a procedure Split ( Ω , t ) such that if Ω is sampled by including each element independently with probability p, then Split ( Ω , t ) outputs independent random variables Ω 1 , . . . , Ω t such that each set Ω i includes each element independently with probability p i > p / t . Proof sketch. Consider independen t random samples Ω 0 1 , . . . , Ω 0 t where each set contains every element independently with probability p/ 2 t . Consider the mul ti-set Ω 0 obtained from taking the union of these sets (counting mul tiplicities). Each element occurs in Ω 0 at least once with probability p 0 = 1 − (1 − p / t ) t 6 p . The mul tiplicity is distributed according to a binomial random variable. Hence, w e can simulate the distribution of Ω 0 given the random sample Ω by subsampling so that each entry is included with probability p 0 and then introducing multiplicities randomly according to the correct Binomial distribution. On the other hand, given the r andom variable Ω 0 we can easil y simulate Ω 0 1 , . . . , Ω 0 t by assigning each element present in Ω 0 with mul tiplicity k to a random subset of k out t sets. D Gener alization to rectangular matrices For our purposes it will su ffi ce to consider symmetric square matrices. This foll ows from a simple transforma tion that preserves the matrix coherence and singular vectors of the matrix. Indeed, given a matrix B ∈ R m × n and m 6 n with singular value decomposition B = P r i =1 σ i u i v > i , we may consider the symmetric ( m + n ) × ( m + n ) matrix A = " 0 B B > 0 # . The matrix A has the following properties: A has a rank 2 · rank ( B ) and singular v alues σ 1 ( B ) , . . . , σ r ( B ) each occuring with mul tiplicity two. The singular vectors corresponding to a singular val ue σ i are spanned by the vectors { ( u i , 0) , (0 , v i ) } . In particular , an algorithm to find a rank 2 k approximation to A also finds a rank 2 k approximation to B up to the same error . Moreover , let e U denote the space spanned by the top 2 k singular vectors of A, and let U , respectivel y V , denote the space spanned by the top k left, repectiv ely right , singular vectors of B. Then µ ( e U ) 6 n + m 2 k µ ( U ) k m + µ ( V ) k n 6 n + m m max µ ( U ) , µ ( V ) . Note that we can assume that ( n + m ) / m is constant by splitting B into a sequence of m × O ( m ) matrices and recovering each matrix separatel y . It will also be important for us that w e can turn a uniformly random subsample of B into a uniformly r andom subsample of A. This is easily accom plished by splitting the sam ple into tw o equally sized halv es, using one for B and one for B > . The remaining quadrants of A are 0 and can be subsampled trivially of an y giv en density . 26
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment