Learning Mixtures of Discrete Product Distributions using Spectral Decompositions

Learning Mixtures of Discrete Pro duct Distributions using Sp ectral Decomp ositions Prateek Jain Microsoft Researc h India, Bangalore pra jain@ mic ro s o ft.com Sew o ong Oh Departmen t of Industrial and En terprise Systems Engineering Univ ersit y of Illinois at Urbana-Champaign sw oh@illinois.edu Abstract W e study the problem of learning a distribution fro m samples, when the underly ing distribution is a mixture of pro duct distr ibutions o ver discr ete domains. This pr oblem is motiv ated by sev era l practical applica tions such as crowdsourcing, recommendation sys tems, and lea r ning Bo olean func- tions. The existing so lutions either hea vily rely o n the fact that the num b er o f mixtures is ﬁnite or hav e sample/time complexity that is exp onen tial in the n umber of mixtures. In this pap er, w e int r oduce a po lynomial time/sample complexity metho d for learning a mixture of r discrete pro duct distributions ov er { 1 , 2 , . . . , ℓ } n , for g eneral ℓ and r . W e show that our approach is consistent and further provide ﬁnite sample guarantees. W e use recently developed techniques from tens or dec ompositions for moment matching. A c ru- cial step in these appro ac hes is to construct certa in tensors with low-rank sp ectral decomp ositions. These tensors ar e typically estimated from the sa mple moments. The main ch a llenge in learning mixtures of discrete pr oduct distributions is that the co rrespo nding low-rank tenso rs cannot b e obtained directly fr om the sample momen ts. Instead, we need to estimate a low-rank matrix us- ing only oﬀ-diagonal entries, and estimate a tensor using a few linea r measurements. W e giv e a n alternating minimiza tion based method to estimate the low-rank matrix, a nd formulate the tensor estimation problem a s a least-squares problem. 1 In tro duction Consider the follo wing generativ e mo del f or sampling f r om a mixture of p rod uct distrib utions ov er discrete domains. W e use r to denote the n umber of compon ents in the mixture, ℓ to d enote the size of the discrete output alphab et in eac h co ordinate, a n d n to denote the total n umb er of co ordinates. Eac h sample b el ongs to one of r comp onen ts, and conditioned on its comp onen t q ∈ { 1 , . . . , r } the n dimensional d iscrete sample y ∈ { 1 , . . . , ℓ } n is d ra wn from some distribu tio n π q . Pr ec isely , the m odel is represent ed by the non-negativ e we ights of the comp onen ts w = [ w 1 . . . w r ] ∈ R r that sum to one, and the r d istributions Π = [ π 1 . . . π r ] ∈ R ℓn × r . W e use an ℓn dimensional bin ary random v ector x to represent a sample y . F or x = [ x 1 . . . x n ] ∈ { 0 , 1 } ℓn , the i -th co ordinate x i ∈ { 0 , 1 } ℓ is an ℓ dimensional binary r an d om v ector such that x i = e j if and o n ly if y i = j , where e j for some j ∈ { 1 , . . . , ℓ } is the standard coordinate basis v ector. When a sample is drawn, the typ e of the sample is d ra wn from w = [ w 1 . . . w r ] such that it h as t yp e q w ith probabilit y w q . Cond itioned on this typ e, the sample is distributed according to π q ∈ R ℓn , suc h that y i ’s are in dep enden t, hence it is a pro duct distribu tio n , and distributed according to ( π q ) ( i,j ) = P ( y i = j | y b elong to comp onen t q ) , where ( π q ) ( i,j ) is the  ( i − 1) ℓ + j  -th en try of the v ector π ( q ) . No te th at using th e b in ary enco ding, E [ x | its type is q ] = π q , and E [ x ] = P q w q π q . Also, we le t π ( i ) ∈ R ℓ × r represent the d istribution in the i -th co ordinate su c h that π ( i ) j,q = ( π q ) ( i,j ) = P ( y i = j | y b elongs to comp onent q ). Then, the d iscrete distribution can b e repr esen ted b y the m at r ix Π ∈ R ℓn × r = [ π (1) ; π (2) ; . . . ; π ( n ) ] and the w eigh ts w = [ w 1 , . . . , w r ]. This mixture d istribution (of ℓ -wise discrete d istributions ov er pro duct spaces) captures as sp ecial cases the mo dels used in seve r al problems in domains such as cro wdsour cing [DS79], g enetics [SRH07], and recommendation systems [TM10]. F or example, in the crowdsourcing app lic ation, this mo del is same as the p opular Da w id and Sk ene [DS79] mo del: x i represent s answ er of the i -th work er to a m ultiple c hoice question (or task) o f t yp e q ∈ [ r ]. Given the ground truth lab el q , eac h of the wo rker is assumed to answer indep enden tly . The goal is to ﬁnd out the “qualit y” of the wo r k ers (i.e. learn Π) and/or to learn the type of eac h question (clustering). W e are interested in the follo wing tw o clo sely related pr oblems: • Learn mixture parameters { π q } q ∈{ 1 ,...,r } and { w q } q ∈{ 1 ,...,r } accurately and eﬃcie ntly . • Clu ster the samples accurately and eﬃcien tly? Historically , ho wev er, diﬀeren t a lgorithms h av e b een p rop ose d dep endin g on which question is ad- dressed. Also, for eac h of the problems, distinct measures of p erformances h a v e b een u sed to ev aluate the p r oposed solution. In this pap er, w e prop ose an eﬃcien t method to address b oth questions. The ﬁrst question of estimating th e un d erlying parameters of the mixtur e comp onen ts has b een addressed in [KMR + 94, FM99 , F OS08], where the error of a giv en algorithm is measur ed as th e KL- div ergence b et wee n the true distribution and the estimated distrib ution. More precisely , a mixture learning algorithm is said to b e an ac cur ate le arning algorith m , if it outputs a mixture of pro duct distribution s u c h that th e follo w ing holds with probabilit y at least 1 − δ : D KL  X || b X  ≡ X x P ( X = x ) log( P ( X = x ) / P ( b X = x )) ≤ ε, 1 where ǫ, δ ∈ (0 , 1) are any give n constan ts, and X, b X ∈ { 0 , 1 } nℓ denote the random ve ctors distrib uted according to the true and the estimated mixture d istr ibution, resp ectiv ely . F u rthermore, the algorithm is said to eﬃcient if its time complexit y is p olynomial in n , r , ℓ , 1 /ε , and log(1 /δ ). This Pr obably Appro ximately Correct (P AC) style framework w as ﬁr s t in tro duced by Kearns et al. [KMR + 94], wh ere they provided the ﬁrst analytical result for a simpler problem o f learning mixtures of Hamming balls, which is a sp ecia l case of o u r mo del w ith ℓ = 2. Ho w ever, the running time of the prop osed algorithm is su per -p olynomial O (( n/δ ) log r ) an d al s o assumes that one can obtain the exact probabilit y of a sample y . F reu n d and Mansour [FM99] were th e ﬁrst to addressed the sample complexit y , bu t for the r estrict ive case of r = 2 and ℓ = 2. F or this case, their metho d has runn in g time O ( n 3 . 5 log 3 (1 /δ ) /ε 5 ) and sample complexit y O ( n 2 log(1 /δ ) /ε 2 ). F eldman, O’Donnell, and Serv edio in [F O S 08] ge n er alized approac h of [FM99] to arbitrary n u m b er of t yp es r and arb itrary n u mb er of outpu t lab els ℓ . F or general ℓ , their algorithm requires r u nning time scaling as O (( nℓ ℓ /ε ) r 3 ). Hence, the prop osed algorithm is an eﬃcient le arning algorithm only for ﬁnite v alues of r = O (1) and ℓ = O (1). A breakthrough in F eldman et al.’s resu lt is that their result holds for all problem instances, w ith no d epend ence on the minimum w eight w min or the co n ditio n num b er σ 1 (Π W 1 / 2 ) /σ r (Π W 1 / 2 ), wh er e σ i (Π W 1 / 2 ) is the i -th singular v alue of Π W 1 / 2 , and W is a r × r d iag onal matrix w ith the w eigh ts w in the diagonals. How ev er, this comes at a cost of ru nning time scaling exp onenti ally in b oth r 3 and ℓ , whic h is unacceptable in pr ac tice for an y v alue of r b ey ond tw o. F urther , the runn ing time is exp onen tial for a ll pr oblem instances, ev en when the problem parameters are wel l-b ehave d , with ﬁnite condition num b er. In this p ap er, we alleviate this issue by pr oposing an eﬃcien t algorithm for wel l-b ehave m ixture distributions. In particular, w e giv e an algorithm w ith p olynomial run ning time, and prov e that it giv es ε -accurate estimate for an y p roblem instance that satisfy th e follo wing tw o conditions: a ) th e w eigh t w q is strictly p osit ive for all q ; and b ) the condition n u m b er σ 1 (Π W 1 / 2 ) /σ r (Π W 1 / 2 ) is b oun d ed as p er hyp otheses in Theorem 3.3. The existence of an eﬃcien t learning algorithm for all problem instances and parameters still remains an op en problem, and it is conjectured in [FOS08] that “solving the m ixture learning problem for any r = ω (1) would require a ma jor breakthrough in learning theory”. r , ℓ = O (1) General r and ℓ σ 1 (Π W 1 / 2 ) /σ r (Π W 1 / 2 ) = pol y ( ℓ, r , n ) W AM [F OS08], Alg orithm 1 Algorithm 1 General cond. num b er W AM [F OS08] Op en T able 1: Lands cap e of eﬃcient learning alg orithms The second question ﬁnd in g the clusters has b een a d dressed in [CHRZ07, CR08]. Chaud h uri et al. in [CHRZ07] intro d uced an iterativ e clustering al gorithm bu t their metho d is restricted to the case of a mixture of t wo pro duct distribu tio n s with binary outputs, i.e. r = 2 and ℓ = 2. Ch audh uri and Rao in [CR08] p roposed a sp ectral metho d for general r , ℓ . How ev er, for the algorithm to co rr ec tly reco ver cluster of eac h sample w.h.p, the und erlying mixture d istr ibution should satisfy a certain ‘spreading’ condition. Moreo v er, the algorithm need to kno w the parameters c haracterizing the ‘spread’ of the distribution, which t ypically is not a v ailable apriori. Although it is p ossible to estima te the mixture distribution, once the samples are clustered, Ch au d h uri et al. provides no guaran tees for estimating the distribution. As is the case for th e ﬁ rst p roblem, for clustering also, we pro vide an eﬃcien t algorithm for general ℓ, r , under the assumption that the condition n u m b er of Π W 1 / 2 to b e b ounded. This condition is not directly comparable with the spreading condition assu med in previous w ork. Our algorithm ﬁrs t estimates the mixture parameters and th en uses the distance b ased clustering method of [AK01]. 2 Our metho d for estimating the mixture parameters is b ased on the momen t matching tec hnique from [AHK12, AGMS12]. T ypically , second and third (and s ometimes fourth) m ome nts of the tru e distribution are estimated using the giv en samples. Then, using the s p ectral d ec omp osition of the second m ome nt one dev elops certain whitening op erators that reduce the higher-order moment tensors to orthogonal tensors. Suc h h igher ord er tensors are then decomp osed using a p o w er-metho d b ased metho d [A GH + 12] to obtain the required distribution parameters. While suc h a tec hn ique is generic and applies to seve ral p opular mo dels [HK13, A GH + 12], for many of the mo dels the moments themselv es constitute the “correct” int ermed ia te quan tit y that can b e used for whitening and tensor decomposition. Ho wev er, b ecause there are dep endencies in the ℓ -wise mo del (for example, x 1 to x ℓ are correlated), the higher-order m omen ts are “incomplete” ve r s io n s of the in termediate quantitie s that w e require (see (1), (2)). Hence, w e need to complete these m ome nts so as to use them for estimating distribution parameters Π , W . Completion of the “incomplete” second moment, can b e p osed as a lo w-rank matrix completion problem where the blo ck-diagonal elemen ts are missing. F or this problem, w e prop ose an alternating minimization based metho d and, b orro win g tec h n iques f rom the recen t w ork of [JNS13], we pr o v e that alternating minimization is able to complete th e second momen t exactly . W e would lik e to note that our alternating minimization result also solv es a generalization of the lo w-rank+diagonal decomp osition problem of [S CPW12]. Moreo ver, unlik e trace-norm b ased metho d of [SCPW12], w hic h in p ract ice is computationally exp ensiv e, our metho d is eﬃcien t, requires only one Singular V alue Decomposition (SVD) step, and is robu s t to noise as well. W e reduce the completion of the “incomplete” th ird moment to a simp le least squares problem that is robust as well. Using tec hniques from our second momen t completion metho d, w e can analyze an alt ern ati n g minimizatio n m etho d also for the third momen t case as well. Ho wev er, for the mixture problem w e can exploit the s tructure to red uce the problem to an eﬃcient le ast s quares pr oblem with closed form solutio n . Next, we presen t our metho d (see Algorithm 1) that combines the estimates from the ab o ve men tioned steps to estimate the distribution parameters Π , W (see Theorem 3.2 , Th eo r em 3.3). After estimating the mo del parameters Π, and W , we also sho w that the KL-diverge n ce measure and the clustering err or measure can also b e sho w n to b e small. In fact the excess error v anishes as the num b er of samples gro w (see Coroll ary 3.4, Corollary 3.5). 2 Related W ork Learning m ixtu res of distribu tio n s is an imp ortan t p roblem with sev eral app lic ations such as clustering, cro wdsour cing, comm un it y detection etc. One of the most well stud ied problems in this domain is that of learning a mixture of Gaussians. There is a long list of interesting recen t results, and discussing the literature in detail is out sid e of the scop e of this pap er. Our app roac h is inspired by b oth sp ectral and momen t-matc hing b ase d tec hniques that ha ve been successfully applied in learning a mixture o f Gaussians [VW04, AK01, MV10, HK13]. Another p opu lar mixtur e distribu tio n arises in topic models, where eac h w ord x i is selected from a ℓ -sized dictio n ary . Sev eral recen t results sho w th a t suc h a model can also b e learned eﬃcientl y u sing sp ectral as w ell as moments based metho ds [RSS12, AHK12, AGM1 2 ]. How ev er, there is a crucial diﬀerence b et w een the general mixture of pro duct distribution that we consid er and the topic model distribution. Given a topic (or question) q , eac h of the w ord s x i in th e topic mo del ha v e exactly the same pr obabilit y . That is, π ( i ) = π f or all i ∈ { 1 , . . . , n } . In contrast, for our pr oblem, π ( i ) 6 = π ( j ) , i 6 = j , in general. Learning mixtures of discrete d istribution o ve r pr odu ct spaces has sev eral practical applications suc h as cro wds ou r cing, r ec ommend ati on sys tems, etc. How ev er, as discussed in the previous sect ion, 3 most of the existing results for this problem are d esig n ed for the case of sm all alphab et size ℓ or the n u mb er of mixtur e comp onen ts r . F or sev eral practical problems [K OS 13], ℓ c an b e large and hence existing metho ds either do not app ly or are very ineﬃcien t. In this work, we prop ose ﬁrst pro v ably eﬃcien t method for learning m ixtu re of discrete distrib utions for general ℓ and r . Our metho d is based on tensor d ecomp osition metho ds for moment matc hing th at ha v e recent ly b een made p opu lar for learning mixture d istr ibutions. F or example, [HK13] p ro vided a m etho d to learn mixture of Gaussians without any separation assum ption. Similarly , [AHK12] in tro duced a metho d for learning mixture of HMMs, and also for topic mo dels. Using similar tec hn iques, another interesti n g result has b een obtained for the problem of indep enden t comp onent analysis (ICA) [A GMS12 , GR12, HK13]. T ypically , tensor decomp osition metho ds p roceed in t w o s te p s. First, obtain a whitening op erator using the second moment estimates. Then, use this w hitening op erator to construct a tensor with orthogonal decomp osition, whic h rev eals the tr u e parameters of the distribu tio n. Ho wev er, in a mixture of ℓ -wa y distribution that w e consider, the second or the th ird momen t d o not revea l all the “required” en tries, making it diﬃcult to ﬁn d the standard whitening op erator. W e h andle this problem b y p osing it as a m at r ix completion problem and using an alternating minimization metho d to complete the second momen t. Our proof for the alternating minimization metho d closely follo ws the analysis of [JNS13]. Ho w ever, [JNS13] h andled a matrix completion pr oblem wh ere the entries are missing uniformly at random, wh ile in our case the blo c k diagonal ele ments are missing. 2.1 Notation T ypically , we denote a matrix or a tensor b y an upp er-case letter (e.g. M ) while a v ector is denoted b y a small-case letter (e.g. v ). M i denotes the i -th column of m at r ix M . M ij denotes the ( i, j )-th en try of matrix M and M ij k denotes the ( i, j, k )- th en try of the th ird order tensor M . A T denotes the transp ose of m at r ix A , i.e., A T ij = A j i . [ k ] = { 1 , . . . , k } denotes the set of ﬁ rst k inte gers. e i denotes the i -th s ta n dard basis vec tor. If M ∈ R ℓn × d , then M ( m ) (1 ≤ m ≤ n ) denotes the m -th block of M , i.e., ( m − 1) ℓ + 1 to mℓ -th ro ws of M . The op erator ⊗ d enote s the outer pro duct. F or example, H = v 1 ⊗ v 2 ⊗ v 3 denote a rank-one tensor su c h that H abc = ( v 1 ) a · ( v 2 ) b · ( v 3 ) c . F or a symmetric third -order tensor T ∈ R d × d × d , deﬁne an r × r × r dimensional op eration with resp ec t to a matrix R ∈ R d × r as T [ R, R , R ] ≡ X i 1 ,i 2 ,i 3 ∈ [ d ] T i 1 ,i 2 ,i 3 R i 1 ,j 1 R i 2 ,j 2 R i 3 ,j 3 ( e j 1 ⊗ e j 2 ⊗ e j 3 ) . k A k = k A k 2 denotes the sp ectral norm of a tensor A . That is, k A k 2 = max x, k x k =1 A [ x, . . . , x ]. k A k F denotes the F r obeniu s norm of A , i.e., k A k F = q P i 1 ,i 2 ,...,i p A 2 i 1 i 2 ...i p . W e use M = U Σ V T to denote the singular v alue decomp osition (SVD) of M , where σ r ( M ) d enote s the r -th singular v alue of M . Also, wlog, assume that σ 1 ≥ σ 2 · · · ≥ σ r . 3 Main results In this section, w e present our main results for estimating the mixture w eigh ts w q , 1 ≤ q ≤ r and the probabilit y matrix Π of the mixture distrib ution. Our estimation metho d is based on the moment- matc hing tec hn ique that h as b een p opularized by seve r al r ec ent r esu lts [AHK12, HK Z 12, HK13, A GH + 12]. Ho wev er, our m et h od diﬀers from the existing metho ds in the follo wing crucial asp ects: w e prop ose ( a ) a matrix completion approac h to estimate the second moment s fr om samples (Algo- rithm 2); and ( b ) a least squares approac h with an appr opriate change of basis to estimat e the third momen ts f r om s amples (Algorithm 3). Th ese app r oa ches pro vide r obust algorithms to estimat in g the 4 momen ts an d migh t b e of ind epend en t interest to a broad range of a p plicat ions in the domain of learning mixtur e distributions. The key step in our method is estimation of the follo wing t wo quan tities: M 2 ≡ X q ∈ [ r ] w q  π q ⊗ π q  = Π W Π T ∈ R ℓn × ℓn , (1) M 3 ≡ X q ∈ [ r ] w q  π q ⊗ π q ⊗ π q  ∈ R ℓn × ℓn × ℓn , (2) where W is a diago n al matrix s.t. W q q = w q . No w, as is standard in the moment based metho ds, we exp loit sp ectral structur e of M 2 , M 3 to reco v er the laten t parameters Π and W . The f ollo w in g theorem pr esents a m et h od for estimating Π , W , assu ming M 2 , M 3 are estimated exactl y : Theorem 3.1. L et M 2 , M 3 b e as deﬁne d in (1) , (2) . Also, let M 2 = U M 2 Σ M 2 U T M 2 b e the eigenvalue de- c omp osition of M 2 . Now, deﬁne G = M 3 [ U M 2 Σ − 1 / 2 M 2 , U M 2 Σ − 1 / 2 M 2 , U M 2 Σ − 1 / 2 M 2 ] . L et V G = [ v G 1 v G 2 . . . v G r ] ∈ R r × r , λ G q , 1 ≤ q ≤ r b e the eigenve ctors and eigenvalues obtaine d by the ortho gonal tensor de c omp osi- tion of G (se e [AGH + 12]), i.e., G = P r q =1 λ G q ( v G q ⊗ v G q ⊗ v G q ) . Then, Π = U M 2 Σ 1 / 2 M 2 V G Λ G , a nd W = (Λ G ) − 2 , wher e Λ G ∈ R r × r is a dia gonal matrix with Λ G q q = λ G q . The ab o v e theorem r educes the problem of estimation of mixture parameters Π , W to that of esti- mating M 2 and M 3 . Typically , in momen t based metho d s, tensors corresp onding to M 2 and M 3 can b e estimated directly usin g th e s econd moment or th ird moment of the distribu tio n , wh ic h can b e estimated eﬃciently using the provi d ed d at a samples. In our p roblem, how ev er, the blo c k-diagonal en tries of M 2 and M 3 cannot b e d irectl y computed from th ese samp le moment s. F or example, the exp ected v alue of a diagonal entry at j -th co ordinate is E [ xx T ] j,j = E [ x j ] = P q ∈ [ r ] w q Π j,q , where as the corresp ondin g entry for M 2 is ( M 2 ) j,j = P q ∈ [ r ] w q (Π j,q ) 2 . T o r ec ov er these unkn o wn ℓ × ℓ blo c k-diagonal entries of M 2 , w e us e an alternating minimization algorithm. Our algorithm writes M 2 in a bi-linear form and solv es for eac h factor of the bi-linear f orm using the computed oﬀ-diagonal blo c ks of M 2 . W e then prov e th at this algorithm exactly reco vers the miss ing en tries w hen w e are give n th e exac t second moment. F or estimating M 3 , we reduce the problem of estimating unknown b loc k-diagonal en tries of M 3 to a least squares problem that can b e solv ed eﬃcie ntly . Concretely , to get a consisten t estimate of M 2 , we p ose it as a matrix completion problem, where w e use th e oﬀ-blo c k-diagonal en tries of the second momen t, whic h w e kn o w are consisten t, to estimate the missin g en tries. Precisely , let Ω 2 ≡ n ( i, j ) ⊆ [ ℓn ] × [ ℓn ] | ⌈ i ℓ ⌉ 6 = ⌈ j ℓ ⌉ o , b e the indices of the oﬀ-blo c k-diagonal en tries, an d deﬁne a masking op erator as: P Ω 2 ( A ) i,j ≡  A i,j , if ( i, j ) ∈ Ω 2 , 0 , otherwise . (3) No w, using the fact that M 2 has r ank at most r , w e ﬁ nd a rank- r esti m at e that explains the oﬀ- b loc k- diagonal entries using an alt ern at in g minimization algorithm deﬁned in Sect ion 4. c M 2 ≡ Ma tr ixAl tMin   2 |S | X t ∈ [ |S | / 2 ] x t x T t , Ω 2 , r, T   , (4) 5 Algorithm 1 Sp ectral-Dist: Momen t metho d for Mixture of Discrete Distribu tion 1: Input: Samples { x t } t ∈S 2: c M 2 ← Ma trixAl tMin  2 |S | P t ∈ [ |S | / 2 ] x t x T t  , Ω 2 , r, T  (see Algorithm 2) 3: Compute eigen v alue decomp osit ion of c M 2 = b U M 2 b Σ M 2 b U T M 2 4: b G ← Tensor LS  2 |S | P |S | t = |S | / 2+1 x t ⊗ x t ⊗ x t  , Ω 3 , b U M 2 , b Σ M 2  (see Algorithm 3) 5: Compute a rank- r orthogonal tensor decomp ositio n P q ∈ [ r ] b λ G q ( ˆ v G q ⊗ ˆ v G q ⊗ ˆ v G q ) of b G , u sing Robust P o wer-method of [AGH + 12] 6: Output: b Π = b U M 2 b Σ 1 / 2 M 2 b V G b Λ G , c W = ( b Λ G q ) − 2 , wh ere ( b V G ) T = [ ˆ v G 1 . . . ˆ v G r ] where { x 1 , . . . , x |S | } is the set of observe d samples, and T is the n umb er of iteratio n s. W e use the ﬁrst half of the samples to estimate M 2 and the r est to estimate the third-order tensor. Similarly for th e tensor M 3 , the sample third momen t do es n ot conv erge to M 3 . Ho w ev er, the oﬀ-blo c k d ia gonal en tries do con v erge to the corr esp ond ing entries of M 3 . That is, le t Ω 3 ≡ n ( i, j, k ) ⊆ [ ℓn ] × [ ℓn ] × [ ℓn ] | ⌈ i ℓ ⌉ 6 = ⌈ j ℓ ⌉ 6 = ⌈ k ℓ ⌉ 6 = ⌈ i ℓ ⌉ o , b e the indices of the oﬀ-blo c k-diagonal en tries, an d deﬁne the follo win g masking op erator: P Ω 3 ( A ) i,j,k ≡  A i,j,k , if ( i, j, k ) ∈ Ω 3 , 0 , otherwise . (5) Then, we ha v e consisten t estimates for P Ω 3 ( M 3 ) fr om the sample third m oment. No w, in the case of M 3 , we do not explicitly compute M 3 . Ins te ad, w e estimate a r × r × r dimensional tensor e G ≡ M 3 [ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ] (cf. Theorem 3.1), using a lea st squares form u lation that uses only oﬀ-diago nal blo c ks of P Ω ( M 3 ). That is, b G ≡ Tenso rLS  2 |S | |S | X t =1+ |S | / 2 x t ⊗ x t ⊗ x t , Ω 3 , b U M 2 , b Σ M 2  , where c M 2 = b U M 2 b S M 2 b U T M 2 is the singular v alue decomp osition of th e rank- r m at r ix c M 2 . After estima- tion of b G , s imila r to Theorem 3.1, w e use the whitening and tensor decomp osition to estimate Π , W . See Algorithm 1 for a ps eu do-c o de of our approac h. Remark : Note that w e use a new set of |S | / 2 samples to estimate the third momen t. This sub- sampling h elps u s in our analysis, as it ensures indep endence of the samples x |S | / 2+1 , . . . , x |S | from th e output of the alternating minimization step (4). The next theorem shows th at the momen t matc h ing approac h (Algorithm 1) is consisten t. Let c W = diag([ ˆ w 1 , . . . , ˆ w r ]) and b Π = [ b π 1 , . . . , b π r ] denote th e estimates obtained using Alg orithm 1. Also, let µ d enote the blo c k-incoherence of M 2 = Π W Π T as deﬁned in (7). Theorem 3.2. Assume that the sample se c ond and the thir d moments ar e exact, i.e., P Ω 2 ( 2 |S | P t ∈ [ |S | / 2 ] x t x T t ) = P Ω 2 ( M 2 ) and P Ω 3 ( 2 |S | P |S | t = |S | / 2+1 x t ⊗ x t ⊗ x t ) = P Ω 3 ( M 3 ) . Also, let T = ∞ for the Ma trixAl tM in pr o c e dur e and let n ≥ C σ 1 ( M 2 ) 5 µ 5 r 3 . 5 /σ r ( M 2 ) 5 , for a glob al c onstant C > 0 . Then, ther e exists a p ermutatio n P over [ r ] suc h that, for al l q ∈ [ r ] , π q = b π P ( q ) and w q = ˆ w P ( q ) . W e no w p ro vide a ﬁ nite sample v ersion of the ab o ve theorem. 6 Theorem 3.3 (F in ite sample b ound) . Ther e exi sts p ositive c onstants C 0 , C 1 , C 2 , C 3 and a p ermuta- tion P on [ r ] such that if n ≥ C 0 σ 1 ( M 2 ) 4 . 5 µ 4 r 3 . 5 /σ r ( M 2 ) 4 . 5 then for any ε M ≤ C 1 √ r + ℓ and for a lar ge enough sample size: |S | ≥ C 2 µ 6 r 6 w min σ 1 ( M 2 ) 6 n 3 σ r ( M 2 ) 9 log( n /δ ) ε 2 M , the fol lowing holds for al l q ∈ [ r ] , with pr ob ability at le ast 1 − δ : | ˆ w P ( q ) − w q | ≤ ε M , k b π P ( q ) − π q k ≤ ε M s r w max σ 1 ( M 2 ) w min . F urther, Algorith m 1 runs in time p oly  n, ℓ, r , 1 /ε, log (1 /δ ) , 1 /w min , σ 1 ( M 2 ) /σ r ( M 2 )  . Note that, the est imated ˆ π i ’s and ˆ w i ’s using Algorithm 1 do not necessarily deﬁn e a valid probabilit y measure: they can take negativ e v alues and migh t not su m to one. W e can pr ocess the estimates further to get a v alid p robabilit y distribution, and sho w that the estimated mixture distribu tio n is close in Kullbac k-Leibler d iv ergence to the original one. Let ε w = C 3 ε M / √ w min . W e ﬁrst set ˜ w ′ q =  ˆ w q if ˆ w q ≥ ε w , ε w if ˆ w q < ε w , and set mixture w eight s ˜ w q = ˜ w ′ q / P q ′ ˜ w ′ q ′ . Similarly , let ε π = C 3 ε M q σ 1 ( M 2 ) r (1+ ε M σ r ( M 2 )) w min and set ˜ π ′ ( j ) q ,p = ( ˆ π ( j ) q ,p if ˆ π ( j ) q ,p ≥ ε π , ε π if ˆ π ( j ) q ,p < ε π , for all q ∈ [ r ], p ∈ [ ℓ ], and j ∈ [ n ], and n ormaliz e it to get v alid distributions ˜ π ( j ) q ,p = ˜ π ′ ( j ) q ,p / P p ′ ˜ π ′ ( j ) q ,p ′ . Let b X d en ot e a rand om vecto r in { 0 , 1 } ℓn obtained b y ﬁ rst selecting a r andom t yp e q with p robabilit y ˜ w q and then dra wing from a random vec tor according to ˜ π q . Corollary 3.4 (KL-diverge n ce b ound) . Under the hyp otheses of The or em 3.3 , ther e exists a p ositive c onsta nt C such that if |S | ≥ C n 7 r 7 µ 6 σ 1 ( M 2 ) 7 ℓ 12 w max log( n/δ ) / ( σ r ( M 2 ) 9 η 6 w 2 min ) , then Algorithm 1 with the a b ove p ost-pr o c essing pr o duc es a r -mixtur e distribution b X that, with pr ob ability at le ast 1 − δ , satisﬁes : D K L ( X || b X ) ≤ η . Moreo v er, we can sho w that the “t yp e” of eac h data p oin t can also b e reco v ered accurately . Corollary 3.5 (Clustering boun d) . Deﬁne: ˜ ε ≡ max i,j ∈ [ r ] ( k π i − π j k 2 − 2 k Π k F p 2 log ( r/δ ) ( k π i − π j k + 2 p 2 log ( r/δ )) r 1 / 2 ) . Under the hyp otheses of The or e m 3.3, ther e exists a p ositive numeric al c onstant C such that i f ˜ ε > 0 and |S | ≥ C µ 6 r 7 n 3 σ 1 ( M 2 ) 7 w max log( n/δ ) / ( w 2 min σ r ( M 2 ) 9 ˜ ε 2 ) , then with pr ob ability at le ast 1 − δ , the distanc e b ase d clustering algorithm of [AK01] c omputes a c orr e ct clustering of the samples. 7 Algorithm 2 Ma trixAl tMin : Alternating Minimizati on for Matrix Completion 1: Input: S 2 = 2 |S | P t ∈{ 1 ,..., |S | / 2 } x t x T t , Ω 2 , r , T 2: Initialize ℓn × r dimensional matrix U 0 ← top- r eigen vec tors of P Ω 2 ( S 2 ) 3: for all τ = 1 to T − 1 do 4: b U τ +1 = arg min U kP Ω 2 ( S 2 ) − P Ω 2 ( U U T τ ) k 2 F 5: [ U τ +1 R τ +1 ] = QR( b U t +1 ) (standard QR decomposition) 6: end for 7: Output: c M 2 = ( b U T )( U T − 1 ) T Algorithm 3 TensorLS : Least Squares metho d for T ensor Estimatio n 1: Input: S 3 = 2 |S | P t ∈{|S | / 2+1 ,.. ., | S |} ( x t ⊗ x t ⊗ x t ), Ω 3 , b U M 2 , b Σ M 2 2: Deﬁne op erator b ν : R r × r × r → R ℓn × ℓn × ℓn as follo ws b ν ij k ( Z ) = ( P abc Z abc ( b U M 2 b Σ 1 / 2 M 2 ) ia ( b U M 2 b Σ 1 / 2 M 2 ) j b ( b U M 2 b Σ 1 / 2 M 2 ) k c , if ⌈ i ℓ ⌉ 6 = ⌈ j ℓ ⌉ 6 = ⌈ k ℓ ⌉ 6 = ⌈ i ℓ ⌉ , 0 , otherwise . (6) 3: Deﬁne b A : R r × r × r → R r × r × r s.t. b A ( Z ) = b ν ( Z )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ] 4: Output: b G = arg min Z k b A ( Z ) − P Ω 3 ( S 3 )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ] k 2 F 4 Algorithm In this sectio n , we describ e the prop osed approac h in detail and p ro vide ﬁnite samp le p erformance guaran tees f or eac h comp onen ts: Ma trixAl tM i n and Tensor LS . These results are crucial in pr o ving the ﬁ nite sample b ound in Theorem 3.3. As men tioned in the previous section, the algorithm ﬁr s t estimates M 2 using the alternating minimization p rocedur e. Recall that the second moment of the data giv en b y S 2 cannot estimate the block-diag onal en tries of M 2 . That is, ev en in the case o f inﬁnite samples, w e only ha ve co n sistency in th e oﬀ-blo c k-diagonal entrie s: P Ω 2 ( S 2 ) = P Ω 2 ( M 2 ). How ev er, to apply the “whitening” op erator to the third order tensor (see Theorem 3.1) we need to estimat e M 2 . In general it is n ot p ossible to estimate M 2 from P Ω 2 ( M 2 ) as one can ﬁll any en tries in the blo c k-diagonal entries. F ortunately , we can a vo id suc h a case since M 2 is guarant eed to b e of rank r ≪ ℓn . Ho we ver, ev en a lo w-rank assumption is n ot enough to reco ve r bac k M 2 . F or example, if M 2 = e 1 e T 1 , then P Ω 2 ( M 2 ) = 0 and one cannot r ec ov er back M 2 . Hence, w e mak e an add iti onal standard assumption that M 2 is µ -blo c k-incoheren t, where a symmetric rank - r matrix A with singular v alue decomp osition A = U S V T is µ -block-i n co h eren t if th e op erator norm of all ℓ × r blo c ks of U are upp er b ounded by   U ( i )   2 ≤ µ r r n , for all i ∈ [ n ] , (7) where U ( i ) is an ℓ × r su b matrix of U whic h is deﬁned by the blo c k fr om the (( i − 1) ℓ + 1)-th r o w to the ( iℓ )-t h row. F or a giv en matrix M , the smallest v alue of µ that satisfy the abov e condition is referred to as the blo c k-incoherence of M . No w, assumin g that M 2 satisﬁes t w o assum p tio n s, r ≪ ℓn and M 2 is µ -block incoheren t, w e pro vide an alternating minimization method that pro v ably reco vers M 2 . In particular, w e mo del M 2 explicitly using a bi-linear form M 2 = b U ( t +1) ( U ( t ) ) T with v ariables b U ( t +1) ∈ R ℓn × r and U ( t ) ∈ R ℓn × r . W e iterativ ely solv e for b U ( t +1) for ﬁxed U ( t ) , and u se QR decomp osition to orthonormalize b U ( t +1) to get U ( t +1) . Note that the QR-d ec omp osition is not r e q u ir e d for our method but we use it only for ease 8 of analysis. Belo w, we give the precise reco ve ry guaran tee f or the alternating minimization metho d (Algorithm 2). Theorem 4.1 (Matrix completio n u sing alternating min im ization) . F or an ℓn × ℓn symmetric r ank- r matrix M with blo ck-inc oher enc e µ , we observe oﬀ-blo ck-diagonal entries c orrupte d by noise: c M ij =  M ij + E ij if ⌈ i ℓ ⌉ 6 = ⌈ j ℓ ⌉ , 0 otherwise. L et c M ( τ ) denote the output after τ iter ations of Ma t rixAl tMin . If µ ≤ ( σ r ( M ) /σ 1 ( M )) p n/ (32 r 1 . 5 ) , the noise is b ounde d by kP Ω 2 ( E ) k 2 ≤ σ r ( M ) / 32 √ r , and e ach c olumn of the noise is b ounde d by kP Ω 2 ( E ) i k ≤ σ 1 ( M ) µ p 3 r / (8 n ℓ ) , ∀ i ∈ n ℓ , then after τ ≥ (1 / 2) log  2 k M k F /ε  iter ation s of M a trixAl t Min , the estimate c M ( τ ) satisﬁes: k M − c M ( τ ) k 2 ≤ ε + 9 k M k F √ r σ r ( M ) kP Ω 2 ( E ) k 2 , for any ε ∈ (0 , 1) . F urther, c M ( τ ) is µ 1 -inc oher ent with µ 1 = 6 µσ 1 ( M 2 ) /σ r ( M 2 ) . F or estimating M 2 , the noise E in the oﬀ-blo c k-diagonal en tries are due to insuﬃcient sample size. W e can precisely b oun d ho w large the sampling noise is in the follo wing lemma. Lemma 4.2. L et S 2 = 2 |S | P t ∈{ 1 ,..., |S | / 2 } x t x T t b e the sample c o-varianc e matrix. Also , let E = kP Ω 2 ( S 2 ) − P Ω 2 ( M 2 ) k 2 . Then, k E k 2 ≤ 8 s n 2 log( nℓ/δ ) |S | . Mor e over, k E i k 2 ≤ 8 p n log (1 / δ ) / | S | , for al l i ∈ [ nℓ ] . The ab o v e th eo r em s h o ws that M 2 can b e reco v ered exactly from inﬁnite many samples, if n ≥ µ 2 σ 1 ( M ) 2 r 1 . 5 σ r ( M ) 2 . F urtherm ore, using Lemma 4.2, M 2 can b e r eco vered app ro ximately , with sample size |S | = O ( n 2 ( ℓ + r ) /σ r ( M ) 2 ). No w, reco vering M 2 = Π W Π T reco v ers the left-singular sp ac e of Π, i.e ., range( U ). Ho wev er, we still need to reco ve r W and the righ t-singular sp ac e of Π, i.e., range( V ). T o this end, we can estimate the tensor M 3 , “wh iten” the tensor us in g b U M 2 b Σ − 1 / 2 M 2 (recall th at , c M 2 = b U M 2 b Σ M 2 b U T M 2 ), and then use tensor decomp osition tec hn iques to solve for V , W . Ho wev er, w e sho w that estimating M 3 is n ot necessary , we can d irect ly estimate the “wh itened” tensor b y solving a system of linear equ ations. In particular, w e design an op erato r b A : R r × r × r → R r × r × r suc h that b A ( e G ) ≈ P Ω 3 ( S 3 )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ], wh ere e G ≡ X q ∈ [ r ] 1 √ w q ( R 3 e q ⊗ R 3 e q ⊗ R 3 e q ) , and R 3 ≡ b Σ − 1 / 2 M 2 b U T M 2 Π W 1 / 2 . (8) Moreo v er, w e sho w th at b A is nearly-isometric. Hence, we can eﬃcien tly estimate e G , using the follo wing system of equati ons : b G = arg min Z k b A ( Z ) − P Ω 3 ( S 3 )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ] k 2 F . (9) Let µ and µ 1 denote the block-incoherence of M 2 and c M 2 resp ectiv ely , as deﬁn ed in (7). Theorem 4.3. L et e G , b G b e as deﬁne d i n (8) , (9) , r esp e ctively. If n ≥ 14 4 r 3 σ 1 ( M 2 ) 2 /σ r ( M 2 ) 2 , then the fol lowing holds with pr ob ability at le ast 1 − δ : k b G − e G k F ≤ 24 µ 3 1 µr 3 . 5 σ 1 ( M 2 ) 3 / 2 n √ w min σ r ( M 2 ) 3 / 2 ε M 2 + 2    P Ω 3 ( M 3 − S 3 )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ]    F , for ε M 2 ≡ (1 /σ r ( M 2 )) k c M 2 − M 2 k 2 . 9 W e ca n also p r o v e a b ound on the sampling noise for the third order tensor in the follo wing lemma. Lemma 4.4. L e t S 3 = 2 |S | P t ∈{|S | / 2+1 ,.. ., | S |} ( x t ⊗ x t ⊗ x t ) . Th en, ther e exists a p ositive numeric al c onsta nt C suc h that, with pr ob ability at le ast 1 − δ ,    P Ω 3 ( M 3 − S 3 )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ]    F ≤ C r 3 µ 3 1 n 3 / 2 σ r ( M 2 ) 3 / 2 s log(1 /δ ) |S | . Next, w e app ly the tensor decomp osition metho d o f [A GH + 12] to decomp ose obtained tensor, b G , and obtain b R 3 , c W th at appro ximates R 3 and W . W e then use the obtained estimate b R 3 , c W to estimate Π; see Algorithm 1 for the details. In particular, using Theorem 4.1 and T h eo r em 4.3, Algorithm 1 pro vides the follo wing estimate for Π: b Π = b U M 2 b Σ 1 / 2 M 2 b R 3 c W − 1 / 2 ≈ b U M 2 b U T M 2 Π . No w, k b Π − Π k 2 can b e b ounded b y using the ab o ve equatio n along with the fact th at range( b U M 2 ) ≈ range(Π). See Secti on A.6 for a detaile d pro of. 5 Applications in Cro wdsou rcing Cro wd sourcing has emerged as an eﬀectiv e p aradig m for solving large-scale d at a-pro cessing tasks in domains w h ere humans ha ve an adv ant age o ver computers. E x amp les in clude image classiﬁcation, video ann otation, data entry , optical c haracter recognition, and tr an s la tion. F or tasks with discrete c hoice outpu ts, one of the most widely used mo del is the Da wid-Skene mo del in tro duced in [DS79]: eac h exp ert j is modeled through a r × r c onfusion matrix π ( j ) where π ( j ) pq is the probabilit y that the exp ert answers q wh en the true label is p . This mo del wa s d ev eloped to study ho w diﬀerent clinicians giv e diﬀeren t diagnosis, ev en when they are presente d with the same medical c hart. Th is is a sp ecia l case, with ℓ = r , of the mixture mo del studied in this pap er. Historically , a greedy algorithm based on Exp ectation-Ma ximization has b een widely used for inference [DS79, SFB + 95, HZ98, SPI08], bu t with no u n derstanding of ho w the p erformance c hanges with the problem parameters and sample size. Recen tly , sp ectral approac hes were p rop ose d and analyzed with pro v able guarantee s. F or a simple case when there are only tw o labels, i.e. r = ℓ = 2, Ghosh et al. in [GKM11] and Karger et al. in [KOS11a] analyzed a sp ectral approac h of u sing the top sin gu lar ve ctor for clustering under Dawid-Sk ene mo del. Th e mo del s tudied in these work is a sp ecial case o f our model with r = ℓ = 2 and w = [1 / 2 , 1 / 2], and π ( j ) =  p j 1 − p j 1 − p j p j  . Let q = (1 /n ) P j ∈ [ n ] 2( p j − 1) 2 , then it follo ws th at σ 1 ( M 2 ) = (1 / 2) n and σ 2 ( M 2 ) = (1 / 2) nq . It was pro ved in [GKM11, KOS11a] that if w e pro ject ea ch data p oin t x i on to the second singular ve ctor of S 2 the emp ir ica l second m ome nt, and make a decision based on the sign of this pro jection, w e get go o d estimates with the probabilit y of miscla ssiﬁ ca tion scales as O (1 /σ r ( M 2 )). More recen tly , Karger et al. in [KOS11b] prop osed a n ew approac h based on a message-passing algorithm for computing the top sin gu lar v ectors, and impro ved this misclassiﬁcation b ound to an ex- p onen tially deca ying O ( e − C σ r ( M 2 ) ) for some p ositiv e numerical constant C . Ho wev er, these approac hes highly rely on the fact that ther e are only t wo groun d truth lab els, an d the algorithm and an alysis cannot b e generalized. These sp ectral app roac h es has b een extended to general r in [K OS13] with misclassiﬁcation p robabilit y scaling as O ( r /σ r ( M 2 )), but this approac h still uses the existing binary classiﬁcation algorithms as a blac k b o x and tries to solv e a series of binary classiﬁcation tasks. F ur thermore, existing sp ectral appr oa ches use S 2 directly for inference. This is not consisten t, since ev en if inﬁn it e num b er of samples are pro vided, this empirical second momen t do es n ot con verge to M 2 . 10 Instead, we use r ece nt dev elopments in matrix completion to reco ver M 2 from samples, th us pro viding a consisten t estimator. Hence, we pro vid e a robust clusterin g a lgorithm for cro wds ou r cing and pro vid e estimates for the mixture distribution with p ro v able guarante es. C oroll ary 3.5 sho ws that with large enough samp le s, the m iscla ssiﬁ ca tion probabilit y of our appr oac h sca les a s O ( r e − C ( r σ r ( M 2 ) 2 /n ) ) f or some p ositiv e constan t C . This is an exp onen tial deca y and a signiﬁcan t impro v ement o ver th e known error b ound of O ( r /σ r ( M 2 )). 6 Conclusion W e presented a metho d for learning a mixture of ℓ -wise discrete distrib u tio n with d istribution pa- rameters Π , W . Our metho d shows that assuming n ≥ C r 3 κ 4 . 5 and the num b er of samples to b e | S | ≥ C 1 ( n r 7 κ 9 log( n/δ )) / ( w 2 min ε 2 Π ), w e h a v e k b Π − Π k 2 ≤ ε Π where κ = σ 1 ( M 2 ) /σ r ( M 2 ), and M 2 = Π W Π T . Note that our algorithm d o es not require any separabilit y condition on the d istr ibution, is consisten t for inﬁnite samples, and is robust to noise as well. Th at is, our analysis can b e easil y exte n ded to the noisy case, where there is a small amount of noise in eac h sample. Our sample complexit y b ounds in cl u de the co n d itio n num b er of the distribution κ whic h imp lie s that our method requir es κ to b e at most pol y ( ℓ, r ). This mak es our metho d unsuitable for th e problem of learning Bo ole an fun ctions [F OS 08]. Ho we ver, it is not clear if is p ossible to d esig n an eﬃcien t algorithm with samp le complexity in dep en den t of the condition num b er. W e lea ve furth er study of the d epend ence of sample complexit y on the condition n umber as a to p ic for futu re researc h. Another dra wb ac k of our metho d is that n is required to b e n = Ω( r 3 ). W e b eliev e that this condition is natur al, as one cannot r ec ov er th e d istribution for n = 1. Ho w ever, establishing tight information theoretic lo w er b ound on n (w.r.t. ℓ, r ) is still a n op en problem. F or the cro wds ou r cing app lica tion, the current er r or b ound for clustering translates in to O ( e − C nq 2 ) when r = 2. This is not as strong as the b est kno wn error b ound of O ( e − C nq ), since q is alwa ys less than one. Th e cu rren t analysis and algorithm for clusterin g needs to b e impr ov ed to get an error b ound of O ( re − C r σ r ( M 2 ) ) f or general r su c h that it giv es optimal error r ate for the sp ecial case of r = 2. The samp le complexit y also dep ends on 1 /w min , wh ic h we b eliev e is u nnecessary . If th ere is a comp onen t with small mixing weigh t, we should b e able to ignore such comp onen t smaller than the sample noise lev el and s till guaran tee the same lev el a ccuracy . T o this end, w e need an adaptive algorithm that detec ts the n umb er of comp onen ts that a r e n on-trivia l and this is a sub ject of future researc h. More fun damen tally , all of the m omen t matc hin g metho ds based on the sp ectral decomp ositio n s suﬀer fr om the same restrictions. It is required that the underlying tensors ha ve rank equal to the n u mb er of comp onen ts, and the condition n u m b er needs to b e small. Ho wev er, the pr ob lem itself is not necessarily more diﬃcult wh en the condition num b er is larger. Finally , we b eliev e that our tec h nique of completion of the second and the h igher order momen ts should hav e application to sev eral other mixtur e models that in vo lve ℓ -wise distributions, e.g., m ixed mem b ership sto c hastic blo c k mo del with ℓ -w ise connections b et we en no des. References [A GH + 12] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kak ade, and Matus T elgarsky , T ensor de c omp ositions for le arning latent variable mo dels , CoRR abs/1210.755 9 (201 2). [A GM12] Sanjeev Arora, Rong Ge, and Ankur Moitra, L e arning topic mo dels - going b eyond SV D , F OCS, 20 12, pp. 1–10. 11 [A GMS12] Sanjeev Arora, Rong Ge, Ankur Moitra, and S ushan t Sac hd ev a, Pr ovable ICA with un- known Gaussian noise, with implic ations for Gaussian mixtur es and auto enc o ders , NIPS, 2012, pp. 23 84–2392. [AHK12] A. Anand kumar, D. Hsu, and S. M. Kak ade, A metho d of moments f or mixtur e mo dels and hidd en markov mo dels , arXiv pr eprin t arXiv:1203.06 83 (2012). [AK01] S. Arora and R. Kannan, L e arning mixtur es of arbitr ary Gaussians , STOC, 2001, pp. 247– 257. [AM05] Dimitris Achlioptas an d F rank McSherr y , On sp e ctr al le arning of mixtur es of distributions , Learning Theory , Springer, 2005 , pp . 4 58–469. [CHRZ07] Kamalik a Chaud h uri, Eran Halperin , Satish Rao, and Sh u heng Zhou, A rigor ous analysis of p opulation str atiﬁc ation with limite d data , SODA, 2007, pp. 1 046–1055. [CR08] K. C haudh ur i and S. Rao, L e arning mixtur es of pr o duct distributions using c orr elations and indep endenc e. , COL T, 2008, pp. 9–20. [DS79] A. P . Da wid and A. M. Sk ene, Maximum likeliho o d estimation of observer err or-r ates using the em algor ithm , Journal of the Ro y al Statistica l So ciet y . Series C (Applied S ta tistics) 28 (1979 ), no. 1, 20– 28. [FM99] Y. F r eund and Y. Mansour, Estimating a mixtur e of two pr o duct distributions , COL T, 1999, pp. 53 –62. [F OS08] J. F eldman, R. O’Donnell, and R. A Serv edio, L e arning mixtur es of pr o duct distributions over discr ete domains , SIAM Journal on Computing 37 (20 08), no. 5, 1536–1564 . [GA GG13] Su riy a Gunasek ar, Aya n Ac hary a, Neera j Gaur, and Joydee p Ghosh, Noisy matrix c om- pletion using al ternating minimization , Mac hine Learning and Kno wledge Disc ov ery in Databases, Sp r inger, 201 3, pp. 194–209. [GKM11] A. Ghosh, S. Kale, and P . McAfee, Who mo der ates the mo der ato rs?: cr owd sour cing abuse dete c tion in user-gener ate d c ontent , EC, 20 11, pp. 167–176 . [GR12] Na vin Goy al and Luis Rad emacher, Eﬃcient le arning of si mplic es , C oRR abs/1211.222 7 (2012 ). [HK13] D. Hsu and S. M. K ak ade, L e arning mixtur es of spheric al Gaussians: mo ment metho ds and sp e ctr al de c omp ositions , ITCS, 2013, pp. 11–20 . [HKZ12] D. Hsu, S. M. Kak ade, and T . Zhang, A sp e ctr al algo rithm for le arning hidden markov mo dels , Journal of Computer and System Science s 78 (2012), no. 5, 1460–14 80. [HZ98] Siu L Hui and Xiao H Zhou, Evaluation of diagnostic tests without gold standar ds , Statis- tical metho ds in medical researc h 7 (1998) , n o. 4, 354– 370. [JNS13] Prateek Jain, Pran eeth Netrapalli, and S uja y S angha vi, L ow-r ank matrix c ompletio n using alternating minimization , STO C , 2013, pp. 665–67 4. [KMR + 94] M. Kearns, Y. Ma n sour, D. Ron, R. Ru binfeld, R. E. Sc hapire, an d L. Sellie, On the le arna bi lity of discr e te distributions , ST OC, 1994, pp. 2 73–282. 12 [K OS 11a] D. R. Karger, S. Oh, and D. S h ah, Budget-optimal cr owdsour cing using low-r ank matrix appr oximations , Allerton, 2011. [K OS 11b] , Budget-optimal task al lo c ation for r eliable cr owdsour cing systems , a r Xiv preprint arXiv:1110 .3564 (2011). [K OS 13] , Eﬃcient cr owdsour cing for multi-class lab eling , Pro ceedings of the ACM SIG- METRICS/inte r natio n al conference on Measuremen t and mo deling o f computer systems, 2013, pp. 81 –92. [McS01] F rank McSherry , Sp e ctr al p artitio ning of r ando m gr aphs , FOCS, 2001, pp. 529 –537. [MV10] Ank u r Moitra and Gregory V alian t, Settling the p olynom i al le arnability of mixtur es of Gaussians , FOCS, 2010, p p. 93 –102. [RSS12] Y u v al Rabani, Leonard J. Sc hulman, and C haita ny a Swam y , L e arning mixtur es of arbitr ary distributions over lar ge discr ete domains , CoRR abs/12 12.1527 (2012 ). [SCPW12] J ames Saun derson, V enk at Chandrasek aran, P ablo A. Parrilo, and Alan S. Willsky , Diag- onal and low-r ank matrix de c omp ositions, c orr e la tion matric es, and el lipsoid ﬁtting , SIAM J. Matrix An al ys is Applications 33 (2 012), no. 4, 13 95–1416. [SFB + 95] P . S m yth, U. F a yya d, M. Burl, P . Pe r ona, and P . Baldi, Inferring gr ound truth fr om subje ctive lab el ling of venus images , NIPS, 1995, p p. 1085 –1092. [SPI08] V. S. Sheng, F. Pro vo st, and P . G. Ip eirotis, Get another lab el? impr oving data quality and dat a mining using multiple, noisy lab elers , KDD, 2008, pp . 614 –622. [SRH07] S. S ridhar, S. Rao , and E. Halp erin, An eﬃcient and ac cur ate g r aph-b ase d appr o ach to de- te ct p opula tion su b str u c tur e , R esearch in C omp utatio nal Molecular Biology , 2007 , pp. 503– 517. [TM10] Dan-Cristian T omozei and Laurent Massouli´ e, Distribute d user pr oﬁling via sp e ctr al meth- o ds , A CM SIGMETRICS P erform an ce Ev aluation Review, v ol. 38, 2010 , pp . 3 83–384. [T r o1 2] Jo el A T ropp, User-frie nd ly tail b ounds for sums of r andom matric es , F oundations of Computational Mathematics 12 (2012 ), no. 4, 389–4 34. [VW04] San tosh V empala and Gr an t W ang, A sp e ctr al algorithm for le arning mixtur e mo dels , J. Comput. Sy s t. Sci. 68 (2004), no. 4, 841–86 0. 13 App endix A Pro ofs In this section, w e giv e d etailed pr oofs for all the k ey theorems/le m mata that w e require to pro ve our main result (Theorem 4.1, Th eo r em 4.3). A.1 Pro of of Theorem 4.1 W e analyze eac h iteration and sh o w that we ge t closer to the optimal solution u p to a certain n oise lev el at eac h step. T o mak e the blo c k structures explicit, we use index ( i, a ) for some i ∈ [ n ] and a ∈ [ ℓ ] to denote ( i − 1) ℓ + a ∈ [ ℓn ]. Th e lea st squares u p date giv es: U ( t +1) = arg min V ∈ R ℓn × ℓn X i,j ∈ [ n ] , a,b ∈ [ ℓ ] ,i 6 = j  c M ( i,a ) , ( j,b ) −  V ( b U ( t ) ) T  ( i,a ) , ( j,b )  2 . Setting the gradien t to zero, w e ge t: − 2 X j 6 = i,b ∈ [ ℓ ]  M ( i,a ) , ( j,b ) + E ( i,a ) , ( j,b ) −  U ( t +1) ( i,a ) , b U ( t ) ( j,b )   b U ( t ) ( j,b ) = 0 , for all i ∈ [ n ] and a ∈ [ ℓ ]. Here, U ( t ) ( j,b ) is a r -dim en sional column v ector represen ting the (( j − 1) ℓ + b )- th r o w of U ( t ) . Let M = U S U T b e the singular v alue decomp osition of M . The r -dimensional column v ector U ( t +1) ( i,a ) can b e written as: U ( t +1) ( i,a ) = ( B ( i,a ) ) − 1 C ( i,a ) S U ( i,a ) + ( B ( i,a ) ) − 1 N ( i,a ) = D S U ( i,a ) | {z } pow er iteration − ( B ( i,a ) ) − 1  B ( i,a ) D − C ( i,a )  S U ( i,a ) | {z } error due to missi ng ent r ies + ( B ( i,a ) ) − 1 N ( i,a ) | {z } error due to noise , (10 ) where, B ( i,a ) = X j 6 = i,j ∈ [ n ] ,b ∈ [ ℓ ] b U ( t ) ( j,b ) ( b U ( t ) ) T ( j,b ) ∈ R r × r C ( i,a ) = X j 6 = i,j ∈ [ n ] ,b ∈ [ ℓ ] b U ( t ) ( j,b ) U T ( j,b ) ∈ R r × r D = X j ∈ [ n ] , b ∈ [ ℓ ] b U ( t ) ( j,b ) U T ( j,b ) ∈ R r × r N ( i,a ) = X j 6 = i,j ∈ [ n ] ,b ∈ [ ℓ ] E ( i,a ) , ( j,b ) b U ( t ) ( j,b ) ∈ R r × 1 . Note that, the ab o ve qu an tities are in d epen d en t of index a , but we carry the index for uniform it y of notation. In a m atrix form of dimension ℓ n × r , w e us e F miss ∈ R ℓn × r to denote the err or due to missing en tries and F noise ∈ R ℓn × r to d enote the error du e to the noise s uc h that U ( t +1) = M b U ( t ) − F ( t +1) miss + F ( t +1) noise , and b U ( t +1) =  M b U ( t ) − F ( t +1) miss + F ( t +1) noise   R ( t +1) U  − 1 , (11) 14 where we deﬁne R ( t +1) U to b e the upp er triangular matrix obtained b y Q R decomp ositi on of U ( t +1) = b U ( t +1) R ( t +1) U . The explicit form ula for F miss and F noise is giv en in (14) and (18). Then, the error afte r t iterations of the alternating minimizatio n is b ound ed by   M − b U ( t )  U ( t +1)  T   F ≤   ( I − b U ( t )  b U ( t )  T ) U S   F +   F ( t +1) miss   F +   F ( t +1) noise   F . (12) Let U ⊥ ∈ R ℓn × ( ℓn − r ) b e an orthogonal matrix sp an n ing the subspace orthogonal to U . W e use the follo wing deﬁnition of distance b etw een tw o r -dimensional subspaces in R ℓn . d ( b U , U ) =   U T ⊥ b U   2 . The follo wing k ey tec hnical lemma pro vides u pp er bou n ds on eac h of th e error terms in (12). Lemma A.1. F or any µ 1 -inc oher ent ortho gonal matrix U ( t ) ∈ R ℓn × r and µ -inc oher ent matrix M ∈ R ℓn × ℓn , the err or after one step of alterna ting minimization is upp e r b ounde d by k F ( t +1) miss k F ≤ σ 1 ( M ) r 1 . 5 µµ 1 n (1 − µ 2 1 r n ) d ( b U ( t ) , U ) , k F ( t +1) noise k F ≤ 1 1 − µ 2 1 r n √ r kP Ω ( E ) k 2 , wher e σ i ( M ) is the i -th singu la r value of M . W e sho w in Lemma A.3 that the incoherence assump tio n is satisﬁed for all t with µ 1 = 6( σ 1 ( M ) /σ r ( M )) µ . F or µ 1 ≤ p n/ 2 r as p er our assumption and subs tit u ting these b ounds in to (12), we get   M − b U ( t )  U ( t +1)  T   F ≤ k M k F d ( b U ( t ) , U ) + 12 σ 1 ( M ) 2 r 1 . 5 µ 2 n σ r ( M ) d ( b U ( t ) , U ) + 2 √ r kP Ω ( E ) k 2 , where the ﬁrs t term foll ows from the fact that k ( I − b U ( t ) ( b U ( t ) ) T ) U k 2 = k b U ( t ) ⊥ ( b U ( t ) ⊥ ) T U k 2 = d ( b U ( t ) , U ). T o further b ound the distance d ( b U ( t ) , U ), we ﬁ rst claim that after t iterations of the alternating minimization algorithm, the estimates sat isfy d ( b U ( t ) , U ) ≤ ε 2 k M k F + 2 √ 3 r σ r ( M ) kP Ω ( E ) k 2 , ( 13) for t ≥ (1 / 2) log  2 k M k F /ε  . F or µ ≤ p n σ r ( M ) / (12 r σ 1 ( M )) as per our assumption, this giv es   M − b U ( t )  U ( t +1)  T   F ≤ ε + 9 k M k F √ r σ r ( M ) kP Ω ( E ) k 2 . This p r o v es th e desired error b ound of Theorem 4.1. No w, we are le ft to pr o v e (13) for t ≥ (1 / 2) log  2 k M k F /ε  . This follo ws from the analysis of eac h step of the algorithm, whic h sho ws that we impro ve at eac h step up to a certain n oise lev el. Deﬁne R ( t +1) U to b e the upp er triangular matrix obtained by QR decomp osition of U ( t +1) = b U ( t +1) R ( t +1) U . Then we can represent the distance using (11) a s : d ( b U ( t +1) , U ) =    U T ⊥  U S U T U ( t ) − F ( t +1) miss + F ( t +1) noise  R ( t +1) U  − 1    2 , ≤  k F ( t +1) miss k 2 + k F ( t +1) noise k 2     R ( t +1) U  − 1   2 , ≤ 12 √ 3 σ 1 ( M ) 2 r 1 . 5 µ 2 σ r ( M ) 2 n d ( b U ( t ) , U ) + 2 √ 3 r σ r ( M ) kP Ω ( E ) k 2 , 15 where w e used Lemma A.2 to b ound    R ( t +1) U  − 1   2 , Lemma A.1 to b ound k F ( t +1) miss k 2 and k F ( t +1) noise k 2 , and Lemma A.3 to b ound µ 1 . F or µ ≤ p nσ r ( M ) / (10 r 1 . 5 σ 1 ( M )) as p er our assumption, it follo ws that d ( b U ( t ) , U ) =  1 4  t d ( b U (0) , U ) + 2 √ 3 r σ r ( M ) kP Ω ( E ) k 2 , T aking t ≥ (1 / 2) log  2 k M k F /ε  , this ﬁnishes the pr oof of the desired b ound in (1 3 ). No w we are left to pro v e that starting from a go o d initial guess w e obtain using a simple Singular V alue Decomp osition(SVD), the estimates at ev ery iterate t is incoheren t with b ounded k ( R ( t +1) U ) − 1 k 2 . W e ﬁrst state the f ol lowing t wo lemmas up p er b ound in g µ 1 and k ( R ( t +1) U ) − 1 k 2 . Then w e p ro v e that the hyp otheses of the lemmas are satisﬁed, if we start f r om a go od initial ization. Lemma A.2. Assume that U is µ -inc oher ent with µ ≤ ( σ r ( M ) /σ 1 ( M )) p n/ (32 r 1 . 5 ) , d ( b U ( t ) , U ) ≤ 1 / 2 , and kP Ω ( E ) k 2 ≤ σ r ( M ) / (16 √ r ) . Then, k ( R ( t +1) U ) − 1 k 2 ≤ √ 3 σ r ( M ) . Lemma A.3 (Incoherence of the estimates) . Assume that b U ( t ) is ˜ µ - inc oher ent with ˜ µ ≤ p n/ (2 r ) , and U is µ -inc oher ent with µ ≤ ( σ r ( M ) /σ 1 ( M )) p n/ (32 r ) , and the noise E satisfy kP Ω ( E ) ( i,a ) k ≤ σ 1 ( M ) µ p 3 r / (8 nℓ ) for a l l i ∈ [ n ] and a ∈ [ ℓ ] . Then, b U ( t +1) is µ 1 -inc oher ent with µ 1 = 6 µ σ 1 ( M ) σ r ( M ) . F or the ab o v e tw o lemmas to h old, we need a go od in it ial guess b U (0) with incoherence less than 4 µ and error upp er b ounded b y d ( b U (0) , U ) ≤ 1 / 2. Next lemmas shows that w e can get suc h a go o d initial guess b y singular v alue decomp osition and tru n cat ion. And th is ﬁnishes the pro of of T heorem 4.1. Lemma A.4 (Bound on the initial guess) . L et b U (0) b e the output of step 3 in the alternating mini- mization algorithm, and let µ 0 b e the inc oher enc e of b U (0) . A ssuming µ ≤ p σ r ( M ) n/ (32 σ 1 ( M ) r 1 . 5 ) and kP Ω ( E ) k 2 ≤ σ r ( M ) / (32 √ r ) , we have the fol lowing upp er b ound on the err or an d the inc oher enc e: d ( b U (0) , U ) ≤ 1 2 , µ 0 ≤ 4 µ. A.1.1 Pro ofs of Lemmas A.1, A.2, A.3, A.4 Pr o of of L emma A.1. First, w e p ro v e th e follo wing u pp er b ou n d f or µ 1 -incoheren t b U ( t +1) . k F ( t +1) miss k F ≤ σ 1 ( M ) r 1 . 5 µµ 1 n (1 − µ 2 1 r n ) d ( b U ( t ) , U ) . W e drop the time index ( t + 1) wh enev er it is cle ar from the context, to simplify n otations. Let F ( i,a ) ∈ R r b e a column v ector repr esen ting the ( ℓ ( i − 1) + a )-th row of F miss ∈ R ℓn × r . W e kno w from (10) that F ( i,a ) = ( B ( i,a ) ) − 1  B ( i,a ) D − C ( i,a )  | {z } ≡ H ( i ) S U ( i,a ) , (14) 16 where w e deﬁn e H ( i ) ≡ B ( i,a ) D − C ( i,a ) . Notice that we d ropp ed a from the index to emphasize that B ( i,a ) and C ( i,a ) do not dep end on a . k F miss k F ≤ s X i,a k ( B ( i,a ) ) − 1 k 2 2 k H ( i ) S U ( i,a ) k 2 = max j,b k ( B ( j,b ) ) − 1 k 2 max x ∈ R ℓn × r , k x k F =1 X i ∈ [ n ] ,a ∈ [ ℓ ] , q ∈ [ r ] x ( i,a ) ,q e T q H ( i ) S U ( i,a ) . T o upp er b ound the ﬁrst term, notice th at k ( B ( j,b ) ) − 1 k 2 ≤ 1 /σ r ( B ( j,b ) ). Since B ( j,b ) = I r × r − P a ∈ [ ℓ ] b U ( j,a ) ( b U ( j,a ) ) T , and by incoherence prop ert y from Lemma A.3, w e ha v e k ( B ( j,b ) ) − 1 k 2 ≤ 1 1 − µ 2 1 r n , (15) for all ( j, b ). The second term can b e b ounded using Cau ch y-Sc hw arz inequalit y: X i ∈ [ n ] ,a ∈ [ ℓ ] , q ∈ [ r ] x ( i,a ) ,q e T q H ( i ) S U ( i,a ) = X i ∈ [ n ] ,q ,p ∈ [ r ]  X a ∈ [ ℓ ] S p U ( i,a ) ,p x ( i,a ) ,q   e T q H ( i ) e p  ≤ s X i,p,q  X a ∈ [ ℓ ] S p U ( i,a ) ,p x ( i,a ) ,q  2 s X i,p,q ( e T q H ( i ) e p ) 2 , where S p is the p -th eigen v alue of M . Applying C auc h y-Sc hw arz again, and b y the incoherence of U is and k x k = 1, X i,p,q  X a ∈ [ ℓ ] S p U ( i,a ) ,p x ( i,a ) ,q  2 ≤ X i,p,q S 2 p  X a ∈ [ ℓ ] U 2 ( i,a ) ,p X b ∈ [ ℓ ] x 2 ( i,b ) ,q  ≤ σ 1 ( M ) 2 µ 2 r n . (16) X i,p,q ( e T q H ( i ) e p ) 2 = X i,p,q  X a b U ( i,a ) ,q  U ( i,a ) ,p − b U T ( i,a ) b U T U p   2 ≤ X i n X a,q b U 2 ( i,a ) ,q X b,p  U ( i,b ) ,p − b U T ( i,b ) b U T U p  2 o ≤ µ 2 1 r n X i,b,p  U ( i,b ) ,p − b U T ( i,b ) b U T U p  2 ≤ µ 2 1 r n  r − T r( U T b U b U T U )  2 ≤ µ 2 1 r 2 d ( b U , U ) 2 n , (17) where the last inequalit y follo ws from the fact that d ( b U , U ) 2 = k b U T ⊥ U k 2 2 = k U T b U ⊥ b U T ⊥ U k 2 = k I r × r − U T b U b U T U k 2 = 1 − σ r ( b U T U ) 2 ≥ 1 − (1 / r ) P p σ p ( b U T U ) 2 . No w, w e pro ve an up p er boun d on k F ( t +1) noise k F . Again, w e drop the time index ( t + 1) or ( t ) whenev er it is clea r from the con text. Let e F ( i,a ) ∈ R r denote a column ve ctor representing the ( ℓ ( i − 1) + a )-th ro w of F noise . W e kno w from (10) that e F ( i,a ) = ( B ( i,a ) ) − 1  b U T E ( i,a ) − X b ∈ [ ℓ ] E ( i,a )( i,b ) b U i,b  , (18) 17 where E ( i,a ) ∈ R ℓn is an co lu mn v ector represen ting the ( ℓ ( i − 1) + a )-th ro w of E . T hen, k F noise k F ≤ v u u t X i ∈ [ n ] ,a ∈ [ ℓ ]   ( B ( i,a ) ) − 1   2 2    b U T E ( i,a ) − X b ∈ [ ℓ ] E ( i,a )( i,b ) b U i,b    2 ≤ max i,a   ( B ( i,a ) ) − 1   2 kP Ω ( E ) b U k F ≤ 1 1 − µ 2 1 r n √ r kP Ω ( E ) k 2 , where P Ω is the pro jection ont o th e sampled en tries deﬁned in (3), and w e used (15) to b ound k ( B ( i,a ) ) − 1 k 2 .  Pr o of of L emma A.2. F rom Lemma 7 in [GAG G13 ], we kno w th at k ( R ( t +1) U ) − 1 k 2 ≤ 1 σ r ( M ) p 1 − d 2 ( U ( t ) , U ) − k F ( t +1) miss k 2 − k F ( t +1) noise k 2 . F rom Lemma A.1 with µ ≤ ( σ r ( M ) / (6 σ 1 ( M ))) p n/ (2 r 1 . 5 ) and kP Ω ( E ) k 2 ≤ σ r ( M ) / (16 √ r ), we hav e k F ( t +1) noise k 2 ≤ σ r ( M ) / 8 and k F ( t +1) miss k 2 ≤ (1 / 6) σ r ( M ) d ( b U ( t ) , U ). Assu ming d ( b U ( t ) , U ) ≤ 1 / 2, this pro ve s the d esir ed claim.  Pr o of of L emma A.3. Assu ming that b U ( t ) is ˜ µ -incoheren t, we m ak e use of the follo wing set of inequal- ities: k ( B ( i,a ) ) − 1 k 2 ≥ 1 − ( ˜ µ 2 r /n ) k B ( i,a ) k 2 = k I r × r − b U ( i ) b U T ( i ) k 2 ≤ 1 k D k 2 = k b U T U k 2 ≤ 1 k C ( i,a ) k 2 = k b U T U − b U ( i ) U T ( i ) k 2 ≤ 1 + µ ˜ µ r /n . Also, fr om L emma A.2 , we kn o w that if ˜ µ ≤ p n/ 2 r as p er our assumption, th en k ( R ( t +1) U ) − 1 k 2 ≤ √ 3 /σ r ( M ). Then, by (10) a n d the triangular inequalit y , X a ∈ [ ℓ ] k b U ( t +1) ( i,a ) k 2 ≤ X a ∈ [ ℓ ]   ( B ( i,a ) ) − 1 C ( i,a ) S U ( i,a ) + ( B ( i,a ) ) − 1 N ( i,a )   2    R ( t +1) U  − 1   2 2 ≤ X a ∈ [ ℓ ] 2    R ( t +1) U  − 1   2 2   ( B ( i,a ) ) − 1   2 2 n   C ( i,a )   2 2 k S k 2 2 k U ( i,a ) k 2 + k N ( i,a ) k 2 o ≤ 6 σ r ( M ) 2 (1 − ( ˜ µ 2 r /n )) 2 X a ∈ [ ℓ ] n σ 1 ( M ) 2  1 + µ ˜ µr /n  k U ( i,a ) k 2 +   b U T P Ω ( E ) ( i,a )   2 o ≤ 6 σ r ( M ) 2 (1 − ( ˜ µ 2 r /n )) 2 n σ 1 ( M ) 2  1 + µ ˜ µr n  µ 2 r n + kP Ω ( E ) ( i,a ) k 2 o ≤ 36 σ 1 ( M ) 2 σ r ( M ) 2 µ 2 r n , where the last inequalit y f ol lows from our assumption that ˜ µ ≤ p n/ (2 r ), µ ≤ ( σ r ( M ) /σ 1 ( M )) p n/ (32 r ) , and kP Ω ( E ) ( i,a ) k ≤ σ 1 ( M ) µ p 3 r / (8 nℓ ). Th is pr o v es that b U ( t +1) is µ 1 -incoheren t for µ 1 = 6 µ ( σ 1 ( M ) /σ r ( M )).  18 Pr o of of L emma A.4. Let P r ( c M ) = e U e S e U T denote the b est ran k - r approxima tion of the obs erv ed matrix c M and P Ω is the sampling mask op erator deﬁn ed in (3) suc h that M − c M = P Ω ( E )+ M −P Ω ( M ). Then, k M − P r ( c M ) k 2 ≤ k M − c M k 2 + k c M − P r ( c M ) k 2 ≤ 2 k M − c M k 2 ≤ 2  kP Ω ( E ) k 2 + k M − P Ω ( M ) k 2  ≤ 2  kP Ω ( E ) k 2 + σ 1 ( M ) µ 2 r /n  , (19) where we u sed the fact th at P r ( c M ) is the b est rank- r appro ximation s uc h that k c M − P r ( c M ) k 2 ≤ k c M − A k 2 for any rank- r matrix A , an d k M − P Ω ( M ) k 2 = max i k U ( i ) S U T ( i ) k 2 ≤ ( µ 2 r /n ) σ 1 ( M ). The next series of inequalities pro vide an upp er b ound on d ( e U , U ) in terms of the sp ectral norm : k M − P r ( c M ) k 2 = k ( e U e U T )( U S U T − e U e S e U T ) + ( e U ⊥ e U T ⊥ )( U S U T − e U e S e U T ) k 2 ≥ k ( e U ⊥ e U T ⊥ )( U S U T − e U e S e U T ) k 2 = k e U T ⊥ U S U T k 2 ≥ σ r ( S ) k e U T ⊥ U k 2 ≥ σ r ( S ) d ( e U , U ) , T ogether with (19), this implies that d ( e U , U ) ≤ 2 σ r ( M )  kP Ω ( E ) k 2 + σ 1 ( M ) µ 2 r /n  . F or k P Ω ( E ) k 2 ≤ σ r ( M ) / (32 √ r ) and µ ≤ p σ r ( M ) n/ (32 σ 1 ( M ) r 1 . 5 ) as p er our assumptions, we ha v e d ( e U , U ) ≤ 1 8 √ r . Next, we sho w that by trun ca ting large comp onents of e U , we can get an incoherent matrix b U (0) whic h is a lso close to U . Consider a sub -mat r ix of U wh ic h consists of the ro ws from ℓ ( i − 1) + 1 to ℓi . W e den ot e this b loc k by U ( i ) ∈ R ℓ × r . Let U denote an ℓn × r matrix obtained f r om e U by setting to zero all blo c ks that ha ve F rob enius norm greater than 2 µ p r /n . Let b U (0) b e the orthonormal basis of U . W e us e the follo wing lemma to b ound the err or and in co h erence of the resulting b U (0) . A similar lemma has b een prov en in [JNS13, Lemma C.2], and w e provi d e a tigh ter b ound in the foll owing lemma. F or δ ≤ 1 / (8 √ r ), this lemma p ro v es th at w e g et the desired boun d of d ( b U (0) , U ) ≤ 1 / 2 and µ 0 ≤ 4 µ .  Lemma A.5. L et µ 0 denote the inc oher enc e of U , and deﬁne δ ≡ d ( e U , U ) . Then d ( b U (0) , U ) ≤ 3 √ r δ 1 − 2 √ r δ , and µ 0 ≤ 2 µ 1 − 2 √ r δ . Pr o of. Denote the QR decomp osition of U b y U = b U (0) R and let δ ≡ d ( e U , U ). Then, d ( b U (0) , U ) = k U T ⊥ b U (0) k 2 ≤ k U T ⊥ U k 2 k R − 1 k 2 ≤  k U T ⊥ ( U − e U ) k 2 + k U T ⊥ e U k 2  k R − 1 k 2 =  k U − e U k 2 + δ  k R − 1 k 2 . (20) 19 First, w e upp er b ound k U − e U k F as follo ws. Let P () denote a pro jection op erator that sets to zero those blo c ks whose F r ob enius norm is smaller than 2 µ p r /n suc h that P ( e U ) = e U − U . Then, kP ( e U ) k F ≤ kP ( e U − U ( U T e U )) k F + kP ( U ( U T e U )) k F . (21) The ﬁrst term can b e b ounded by kP ( e U − U ( U T e U )) k F ≤ k e U − U ( U T e U ) k F ≤ √ r k e U − U ( U T e U ) k 2 = √ r δ . The second term can b e b ounded by kP ( U ( U T e U )) k F = kP ( U ) ( U T e U ) k F ≤ kP ( U ) k F . By in co h erence of U , we ha v e that kP ( U ) k F ≤ √ N µ p r /n , wh ere N is the n um b er of ℓ × r b loc k matrices that is not set to ze r o b y P ( · ) . T o pro vide an upp er b ound on N , notice that the incoherence of an ℓn × r matrix U ( U T e U ) is µ . This follo ws from the fact that k U T e U k 2 ≤ 1. Then, k U ( U T e U ) − e U k F ≥ kP ( U ( U T e U ) − e U ) k F ≥ √ N µ r r n , where the last line f ol lows from the f ac t that there are N blo c ks where the F rob enius norm of U ( U T e U in that blo c k is at most µ p r /n and the F rob enius norm of e U is at least 2 µ p r /n . On the ot h er hand , w e ha v e k U ( U T e U ) − e U k F ≤ √ rδ . Putting these inequalities tog ether, w e get that √ N ≤ δ √ n µ and kP ( U ( U T e U )) k F ≤ √ r δ . Substituting th ese b ounds in (21) give s k e U − U k F ≤ 2 δ √ r . (22) Next, we sho w that k R − 1 k 2 ≤ 1 1 − 2 δ √ r . (23) By the deﬁnition of R , we k n o w that k R − 1 k 2 = 1 /σ r ( R ) = 1 /σ r ( b U (0) = 1 /σ r ( U ). Using W eyl’s inequalit y , w e can lo wer b oun d σ r ( U ) = σ r ( U − e U + e U ) ≥ σ r ( e U ) − σ 1 ( U − e U ). Since e U is an orthogonal matrix and u sing (22), this pro ves (23). Substituting (22) and (23) in to (20 ), w e get d ( b U (0) , U ) ≤ (2 √ r + 1) δ 1 − 2 δ √ r . F or δ ≤ , this give s the desired b ound. T o provide an upp er b ound on the incoherence µ 0 of b U (0) , recall that the incoherence is d eﬁ ned as µ 0 p r /n = max i k b U (0) ( i ) k F = max i k U ( i ) R − 1 k F . By co ns tructio n , k U ( i ) k F ≤ 2 µ p r /n , and from (2 3 ) w e kno w that k R − 1 k 2 ≤ 1 / (1 − 2 δ √ r ). T ogether, this giv es µ 0 ≤ 2 µ 1 − 2 δ √ r . This ﬁ n ishes the pro of of the d esired b ounds.  20 A.2 Pro of of Theorem 4.3 In this section, we pro vide a detailed pr oof of Theorem 4.3. T o this end, we ﬁrst pro vide an inﬁn ite sample v ersion of the pro of, i.e., when P Ω 3 ( S 3 ) = P Ω 3 ( M 3 ). Th en , in the next subs ec tion, we b ound eac h elemen t of P Ω 3 ( S 3 ) − P Ω 3 ( M 3 ) and extend the in ﬁnite sample v ersion of the p roof to the ﬁn ite sample case. Recall th at c M 2 = b U M 2 b Σ M 2 b U T M 2 , ε = k c M 2 − M 2 k 2 k /σ r ( M 2 ), M 2 is µ -incoheren t and c M 2 is µ 1 - incoheren t. Incoherence of a matrix is d eﬁned as in (7 ) . Then, the follo win g t wo remarks can b e easily pro ved using stand ard matrix p erturbation results (for example, see [AHK12]). Remark A.6. Supp ose k c M 2 − M 2 k 2 ≤ εσ r ( M 2 ) , then 1 − 4 ε 2 (1 − ε ) 2 ≤ σ r ( U T b U M 2 ) ≤ σ 1 ( b U T M 2 U ) ≤ 1 . That is, k ( I − b U M 2 b U T M 2 ) U k 2 ≤ ε, and , k ( U T b U M 2 ) T ( U T b U M 2 ) − I k ≤ 8 ε 2 (1 − ε ) 2 . Remark A.7. Supp ose k c M 2 − M 2 k 2 ≤ εσ r ( M 2 ) , then k I − b Σ − 1 / 2 M 2 b U T M 2 M 2 b U M 2 b Σ − 1 / 2 M 2 k 2 ≤ 2 ε. Pr o of. k I − b Σ − 1 / 2 M 2 b U T M 2 M 2 b U M 2 b Σ − 1 / 2 M 2 k 2 = k b Σ − 1 / 2 M 2 b U T M 2 ( c M 2 − M 2 ) b U M 2 b Σ − 1 / 2 M 2 k 2 ≤ k b Σ − 1 / 2 M 2 b U T M 2 k 2 2 k c M 2 − M 2 k 2 ≤ 1 σ r ( M 2 )(1 − ε ) σ r ( M 2 ) ε , where w e u sed the fact that k b Σ − 1 / 2 M 2 k 2 2 ≥ 1 /σ r ( c M 2 ) and σ r ( c M 2 ) ≥ σ r ( M 2 )(1 − ε ) by W eyl’s inequalit y . F or ε < 1 / 2 we ha ve the d esired b ound.  W e no w d eﬁne the follo wing op erato r s: b ν and b A . Deﬁn e b ν : R r × r × r → R ℓn × ℓn × ℓn as: b ν ij k ( Z ) = ( P abc Z abc ( b U M 2 b Σ 1 / 2 M 2 ) ia ( b U M 2 b Σ 1 / 2 M 2 ) j b ( b U M 2 b Σ 1 / 2 M 2 ) k c , if ⌈ i ℓ ⌉ 6 = ⌈ j ℓ ⌉ 6 = ⌈ k ℓ ⌉ 6 = ⌈ i ℓ ⌉ , 0 , otherwise . (24) Deﬁne b A : R ℓn × ℓn × ℓn → R r × r × r as: b A ( Z ) = b ν ( Z ) h b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 i . (25) No w, le t R 3 b e deﬁned as: R 3 = b Σ − 1 / 2 M 2 b U T M 2 U Σ V T W 1 / 2 . Note that , using Remark A.7, k R 3 R T 3 − I k ≤ 2 ε. Also, deﬁn e the follo wing te n sor: e G = X q ∈ [ r ] 1 √ w q ( R 3 e q ⊗ R 3 e q ⊗ R 3 e q ) . (26) 21 Note that, as R 3 is n ea rly orthonormal, e G is a ne arly orthogonally decomp osable tensor. W e no w pr esen t a lemma th at sho w s that P Ω 3 ( M 3 ) h b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 i and b A ( e G ) are “close”. Lemma A.8. P Ω 3 ( M 3 ) h b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 i = b A ( e G ) + E , wher e k E k F ≤ 12 µ 3 1 µ r 3 . 5 σ 1 ( M 2 ) 3 / 2 ε n √ w min σ r ( M 2 ) 3 / 2 , and we deno te the F r ob enius norm of a tensor as k E k F = { P i,j,k E 2 i,j,k } 1 / 2 Pr o of. Deﬁne H = b A ( G ) and F = P Ω 3 ( M 3 ) h b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 i . Also, let Q = U Σ V T W 1 / 2 and b Q = b U M 2 b Σ − 1 / 2 M 2 . Note that, F abc = P ij k δ ij k M 3 ( i, j, k ) b Q ia b Q j b b Q k c , where δ ij k = 1 , if ( i , j, k ) ∈ Ω 3 and 0 otherwise. Also, M 3 ( i, j, k ) = P q ∈ [ r ] 1 √ w q Q iq · Q j q · Q k q . Hence, F abc = X q ∈ [ r ] 1 √ w q X ij k δ ij k Q iq · Q j q · Q k q · b Q ia · b Q j b · b Q k c . (27) Note that, P i b Q ia Q iq = h b Q a , Q q i = e T a b Σ − 1 / 2 M 2 b U T M 2 U Σ V T W 1 / 2 e q = e T a R 3 e q . That is, F abc = G abc − X q ∈ [ r ] 1 √ w q X m ∈ [ n ] h b Q ( m ) a , Q ( m ) q i · h b Q ( m ) b , Q ( m ) q i · h b Q ( m ) c , Q ( m ) q i − X q ∈ [ r ] 1 √ w q e T a R 3 e q X m ∈ [ n ] h b Q ( m ) b , Q ( m ) q i · h b Q ( m ) c , Q ( m ) q i − X q ∈ [ r ] 1 √ w q e T b R 3 e q X m ∈ [ n ] h b Q ( m ) a , Q ( m ) q i · h b Q ( m ) c , Q ( m ) q i − X q ∈ [ r ] 1 √ w q e T c R 3 e q X m ∈ [ n ] h b Q ( m ) a , Q ( m ) q i · h b Q ( m ) b , Q ( m ) q i . (28) On the ot h er hand, b ν ( G ) ij k = ( P q ∈ [ r ] 1 √ w q e T i ( b U M 2 b Σ 1 / 2 M 2 R 3 ) e q · e T j ( b U M 2 b Σ 1 / 2 M 2 R 3 ) e q · e T k ( b U M 2 b Σ 1 / 2 M 2 R 3 ) e q , if ⌈ i ℓ ⌉ 6 = ⌈ j ℓ ⌉ 6 = ⌈ k ℓ ⌉ 6 = ⌈ i ℓ ⌉ , 0 , otherwise . (29) That is, H abc = X q ∈ [ r ] 1 √ w q X ij k δ ij k e T i ( b U M 2 b Σ 1 / 2 M 2 R 3 ) e q · e T j ( b U M 2 b Σ 1 / 2 M 2 R 3 ) e q · e T k ( b U M 2 b Σ 1 / 2 M 2 R 3 ) e q · b Q ia · b Q j b · b Q k c . (30) No w, note that P i b Q ia e T i ( b U M 2 b Σ 1 / 2 M 2 R 3 ) e q = h b Q a , b U M 2 b Σ 1 / 2 M 2 R 3 e q i = e T a b Σ − 1 / 2 M 2 b U T M 2 b U M 2 b Σ 1 / 2 M 2 R 3 e q = e T a R 3 e q . Also, let e Q = b U M 2 b U T M 2 Q . That is, H abc = G abc − X q ∈ [ r ] 1 √ w q X m ∈ [ n ] h b Q ( m ) a , e Q ( m ) q i · h b Q ( m ) b , e Q ( m ) q i · h b Q ( m ) c , e Q ( m ) q i − X q ∈ [ r ] 1 √ w q e T a R 3 e q X m ∈ [ n ] h b Q ( m ) b , e Q ( m ) q i · h b Q ( m ) c , e Q ( m ) q i − X q ∈ [ r ] 1 √ w q e T b R 3 e q X m ∈ [ n ] h b Q ( m ) a , e Q ( m ) q i · h b Q ( m ) c , e Q ( m ) q i − X q ∈ [ r ] 1 √ w q e T c R 3 e q X m ∈ [ n ] h b Q ( m ) a , e Q ( m ) q i · h b Q ( m ) b , e Q ( m ) q i . (31) 22 No w,    h b Q ( m ) c , e Q ( m ) q i − h b Q ( m ) c , Q ( m ) q i    ≤ k b Q ( m ) c k k e Q ( m ) q − Q ( m ) q k ≤ k b Q ( m ) c k k ( I − b U M 2 b U T M 2 ) U k 2 k Σ V T W 1 / 2 k 2 ≤ µ 1 √ r p n (1 − ε ) σ r ( M 2 ) ε p σ 1 ( M 2 ) , where w e used k ( I − b U M 2 b U T M 2 ) U k 2 ≤ ε from Remark A.6, and the follo w in g remark to b oun d k b Q ( m ) c k . Then, fr om Remark A.9, |h b Q ( m ) a , e Q ( m ) q ih b Q ( m ) c , e Q ( m ) q i − h b Q ( m ) a , Q ( m ) q ih b Q ( m ) c , Q ( m ) q i| ≤ | ( h b Q ( m ) a , e Q ( m ) q i − h b Q ( m ) a , Q ( m ) q i ) h b Q ( m ) c , e Q ( m ) q i| + |h b Q ( m ) a , Q ( m ) q i ( h b Q ( m ) c , e Q ( m ) q i − h b Q ( m ) c , Q ( m ) q i ) | ≤ µ 1 p r σ 1 ( M 2 ) p n (1 − ε ) σ r ( M 2 ) ε µ 1 ( µ + µ 1 ) r n (1 − ε ) σ r ( M 2 ) F ur ther, | e T a R 3 e q | ≤ µ 1 p ( r σ 1 ( M 2 )) / ( n (1 − ε ) σ r ( M 2 )). The desired b ound n o w follo ws b y us ing the ab o v e inequalities to b ound k E k F = k H − F k F .  Remark A.9. F or e Q = b U M 2 b U T M 2 Q , Q = U Σ V T W 1 / 2 , and b Q = b U M 2 b Σ − 1 / 2 M 2 , supp ose M 2 is µ -inc oher ent and c M 2 is µ 1 -inc oher ent. Then, k b Q ( m ) c k ≤ µ 1 r 1 / 2 p (1 − ε ) n σ r ( M 2 ) , k e Q ( m ) c k ≤ µ 1 r r σ 1 ( M 2 ) n , a nd k Q ( m ) c k ≤ µ r r σ 1 ( M 2 ) n . Pr o of. k b Q ( m ) c k = 1 q b Σ cc n X a ∈ [ ℓ ] ( b U M 2 ) ℓ ( m − 1)+ a,c o 1 / 2 ≤ µ 1 p r /n p σ r ( M 2 )(1 − ε ) . The rest of the remark follo w s similarly .  Next, we no w sho w that k b A − 1 k 2 is sm all. Lemma A.10. σ min ( b A ) ≥ 1 − 8 r 3 σ 1 ( M 2 ) 2 (1 + ε ) 2 / ( nσ r ( M 2 ) 2 (1 − ε ) 2 ) and henc e, k b A − 1 k 2 ≤ 1 1 − 72 r 3 σ 1 ( M 2 ) 2 / ( nσ r ( M 2 ) 2 ) . Pr o of. Let b Q = b U M 2 b Σ − 1 / 2 M 2 , e Q = b U M 2 b Σ 1 / 2 M 2 , and H = b A ( Z ). Then, H abc = X ij k δ ij k X a ′ b ′ c ′ Z a ′ b ′ c ′ e Q ia ′ · e Q j b ′ · e Q k c ′ · b Q ia · b Q j b · b Q k c , where δ ij k = 1 , if ( i, j, k ) ∈ Ω 3 and 0 otherwise. T h at is, H abc = Z abc − X a ′ b ′ c ′ Z a ′ b ′ c ′ X m ∈ [ n ] h b Q ( m ) a , e Q ( m ) a ′ i · h b Q ( m ) b , e Q ( m ) b ′ i · h b Q ( m ) c , e Q ( m ) c ′ i − X b ′ c ′ Z ab ′ c ′ X m ∈ [ n ] h b Q ( m ) b , e Q ( m ) b ′ i · h b Q ( m ) c , e Q ( m ) c ′ i − X a ′ c ′ Z a ′ bc ′ X m ∈ [ n ] h b Q ( m ) a , e Q ( m ) a ′ i · h b Q ( m ) c , e Q ( m ) c ′ i − X a ′ b ′ Z a ′ b ′ c X m ∈ [ n ] h b Q ( m ) a , e Q ( m ) a ′ i · h b Q ( m ) c , e Q ( m ) c ′ i . (32) 23 Let vec( H ) = B · vec( Z ). W e kno w that |h e Q ( m ) a , b Q ( m ) a i| ≤ µ 2 1 r /n and |h e Q ( m ) a , e Q ( m ) a ′ i| ≤ µ 2 1 r σ 1 ( M 2 )(1 + ε ) / ( nσ r ( m 2 )(1 − ε )) for a 6 = a ′ . No w, using the ab o ve equation and using incoherence: 1 − 4 r 2 µ 4 1 /n ≤ B pp ≤ 1 + 4 r 2 µ 4 1 /n, ∀ 1 ≤ p ≤ r. Similarly , | B pq | ≤ 4 r 2 µ 4 1 σ 1 ( M 2 ) 2 (1 + ε ) 2 / ( nσ r ( M 2 ) 2 (1 − ε ) 2 ) , ∀ p 6 = q . Theorem no w follo ws using Gershgorin’s theorem.  Finally , w e com b in e the ab o ve tw o lemmas to sho w that the least squares pro cedure approxima tely reco v ers e G . Lemma A.11. L e t G b e as deﬁne d in (26) . Also, let b G b e obtaine d by solving th e f ol lowing le ast squar es pr oblem: b G = arg min Z k b A ( Z ) − P Ω 3 ( M 3 ) h b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 i k 2 F . Then, for n ≥ 144 r 3 σ 1 ( M 2 ) 2 /σ r ( M 2 ) 2 such that k b A − 1 k 2 ≤ 2 , k b G − e G k F ≤ 24 µ 3 1 µ r 3 . 5 σ 1 ( M 2 ) 3 / 2 ε n √ w min σ r ( M 2 ) . Pr o of. Note that b A : R r × r × r → R r × r × r is a s q u are op erat or. Mo r eo ver, using Lemma A.8: P Ω 3 ( M 3 ) h b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 i = b A ( e G ) + E . Hence, k b G − e G k F = k b A − 1 ( b A ( b G ) − b A ( e G )) k 2 ≤ k b A − 1 k 2 k E k F . T ogether with Lemma A.8 an d A.10, we get the desired b ound.  Pr o of of The or em 4.3. Note that A : R r × r × r → R r × r × r is a squ are op erator. Moreo v er, using Lemma A.8: P Ω 3 ( M 3 ) h b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 i = b A ( e G ) + E . In the case of ﬁn ite man y samples, w e use S 3 = 1 |S | P |S | t =1+ |S | / 2 x t ⊗ x t ⊗ x t for estimating th e lo w-dimensional tensor e G . In p artic u lar, w e co mp ute the follo wing quan tity: b H = P Ω 3 ( S 3 ) h b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 i . (33) W e then use this quan tit y to solv e the least squares problem. That is, w e ﬁ nd b G as: b G = arg min Z k b A ( Z ) − b H k 2 F . No w, w e show that such a pro cedure giv es b G that is close to e G (see (2 6 ) ). k b G − e G k F = k b A − 1 ( b A ( b G )) − b A − 1 ( b A ( e G )) k F = k b A − 1 ( P Ω 3 ( S 3 )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ]) − b A − 1 ( b A ( e G )) k F = k b A − 1 ( P Ω 3 ( S 3 − M 3 + M 3 )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ]) − b A − 1 ( b A ( e G )) k F ≤ k b A − 1 k 2  k E k F + kP Ω 3 ( S 3 − M 3 )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ] k F  ≤ k A − 1 k 2  12 µ 3 1 µr 3 . 5 σ 1 ( M 2 ) 3 / 2 ε n √ w min σ r ( M 2 ) 3 / 2 + kP Ω 3 ( S 3 − M 3 )[ b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 ] k F  . This ﬁ n ishes the pro of of the d esired claim.  24 A.3 Pro of of Lemma 4.2 Let E = E (1) − E (2) where E (1) ≡ S 2 − E [ S 2 ], E (2) ≡ P Ω c 2 ( S 2 − E [ S 2 ]), and Ω c 2 is the complemen t of Ω 2 . W e ﬁrst note that k x t k 2 = n . Hence, app lying Matrix Ho eﬀding b ound (see Theorem 1.3 of [T r o1 2 ]), w e g et with pr obabilit y at least 1 − δ : k E (1) k 2 =    2 |S | X t ∈{ 1 ,..., |S | / 2 } ( x t x T t ) − E h 2 |S | X t ∈{ 1 ,..., |S | / 2 } ( x t x T t ) i    2 ≤ s 32 n 2 log( n ℓ /δ ) |S | . The second term E (2) is a diagonal matrix, with eac h d ia gonal en try E (2) ii distributed as a bin omial distribution. Applyin g standard Ho eﬀding’s b ound, w e get that with p robabilit y at lea st 1 − δ , k E (2) k 2 = max i ∈ [ ℓn ] | E (2) ii | ≤ s 2 log (2 / δ ) |S | . This give s the desired b ound on k E (1) + E (2) k 2 . Similarly , x t,i k x t k 2 ≤ √ n, ∀ i . Hence, using s tand ard Ho eﬀding Bound, we get w ith probabilit y at least 1 − δ ,    2 |S | X t ∈ [ |S | / 2 ] ( x t x t ) i − E [ S 2 ] i    2 ≤ s 16 n log (2 / δ ) |S | . A.4 Pro of of Lemma 4.4 The claim f ol lows form the follo wing lemma. Lemma A.12. L et H = P Ω 3 ( M 3 ) h b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 , b U M 2 b Σ − 1 / 2 M 2 i and b H b e as deﬁne d ab ove. Then, with pr ob ability lar ge r tha n 1 − δ , we h ave: | H abc − b H abc | ≤ 2  2 r n σ r ( M 2 )  3 / 2 µ 3 1 s log(1 /δ ) |S | . Pr o of. Let b H abc = 1 |S | P t ∈S Y t a,b,c , where Y t a,b,c = P ( i,j,k ) ∈ Ω 3 x t,i x t,j x t,k b Q ia b Q j b b Q k c , where b Q = b U M 2 b Σ − 1 / 2 M 2 . Then E [ Y t ] = H . T hat is, Y t a,b,c = h b Q a , x t i · h b Q b , x t i · h b Q c , x t i − X m ∈ [ n ] h b Q ( m ) a , ( x t ) ( m ) ih b Q ( m ) b , ( x t ) ( m ) ih b Q ( m ) c , ( x t ) ( m ) i − h b Q a , x t i · X m ∈ [ n ] h b Q ( m ) b , ( x t ) ( m ) ih b Q ( m ) c , ( x t ) ( m ) i − h b Q b , x t i · X m ∈ [ n ] h b Q ( m ) a , ( x t ) ( m ) ih b Q ( m ) c , ( x t ) ( m ) i − h b Q c , x t i · X m ∈ [ n ] h b Q ( m ) a , ( x t ) ( m ) ih b Q ( m ) b , ( x t ) ( m ) i . (34) Note that, |h b Q ( m ) b , x ( m ) t i| ≤ µ 1 √ r √ n (1 − ε ) σ r ( M 2 ) . Hence, for all a ∈ [ r ], |h b Q a , x t i| ≤ µ 1 √ r n √ (1 − ε ) σ r ( M 2 ) . Using the ab o v e inequalit y with (34), we get: | Y t a,b,c | ≤  r n/ ((1 − ε ) σ r ( M 2 ))  3 / 2 µ 3 1 . Lemma now follo ws b y u sing Hoeﬀdin g’ s inequalit y .  25 A.5 Pro of of Theorem 3.1 W e ﬁrst observ e that a s U M 2 = U R 1 , where R 1 ∈ R r × r is an orthonormal matrix. Also, Σ M 2 = R T 1 Σ V T W V Σ R 1 . Hence, Σ 1 / 2 M 2 = R T 1 Σ V T W 1 / 2 R 3 , wher e R 3 is an orthonormal matrix. Moreo v er, Σ − 1 / 2 M 2 = R T 3 W − 1 / 2 V Σ − 1 R 1 . Hence, G = M 3 [ U M 2 Σ − 1 / 2 M 2 , U M 2 Σ − 1 / 2 M 2 , U M 2 Σ − 1 / 2 M 2 ] = k X q =1 w q ( R T 3 W − 1 / 2 e q ) ⊗ ( R T 3 W − 1 / 2 e q ) ⊗ ( R T 3 W − 1 / 2 e q ) = k X q =1 1 √ w q ( R T 3 e q ) ⊗ ( R T 3 e q )( R T 3 e q ) . (35) No w, using orthogonal tensor d ecomp osition metho d of [A GH + 12], we ge t: Λ G = W − 1 / 2 as the eigen v alues and V G = R T 3 as the eigenv ecto r s. Th eo r em no w f ol lows by observing: U M 2 · Σ 1 / 2 M 2 · V G · Λ G = U M 2 · R T 1 Σ V T W 1 / 2 R 3 · R T 3 · W − 1 / 2 = U Σ V T = Π . A.6 Pro of of Theorem 3.2 and Theorem 3.3 Pr o of of The or em 3.2. Recall that in this case, the n umb er of samples are inﬁnite, i.e., |S | = ∞ . Hence, P Ω 2 ( S 2 ) = P Ω 2 ( M 2 ). That is, E = 0. F urthermore, T = ∞ . Hence, using Th eorem 4.1, Algorithm 2 exac tly reco v ers M 2 , i.e., c M ( T ) 2 = M 2 . F ur thermore, using Theorem 4.3, we ha v e b G = G ; as, ε = k M 2 − c M 2 k 2 = 0 and |S | = ∞ . Now, consider R 3 R T 3 = b Σ − 1 / 2 M 2 b U T M 2 Π W 1 / 2 · W 1 / 2 Π T b U M 2 b Σ − 1 / 2 M 2 = b Σ − 1 / 2 M 2 b U T M 2 M 2 b U M 2 b Σ − 1 / 2 M 2 = I . Th at is, R 3 is orthonormal. Hence , using orthogonal decomp osition metho d of [A GH + 12] (see T h eorem A.13), w e get V G = R 3 and Λ G = W − 1 / 2 . No w, using step 6 of Algorithm 1, b Π = b U M 2 b U T M 2 Π. Theorem no w follo ws as b U M 2 b U T M 2 U = U using Remark A.6. Also note that from Theorem 4.1, c M 2 is µ 1 incoheren t with µ 1 = 6 µσ 1 ( M 2 ) /σ r ( M 2 ).  Pr o of of The or em 3.3. T o simplify the notat ions, we will assu me that the p ermutati on that matc hes the outpu t of our algorithm to the actual t yp es is the id entit y per mutation. Let’s deﬁne ε M ≡ k c M 2 − M 2 k 2 σ r ( M 2 ) and ε G ≡ k b G − e G k 2 , (36) where b G is th e output of the TensorLS and e G = M 3 [ b U M 2 b Σ M 2 , b U M 2 b Σ M 2 , b U M 2 b Σ M 2 ]. The sp ectral algorithm outputs b Π = b U M 2 b Σ 1 / 2 M 2 b V G b Λ G , and we kno w that Π = U M 2 Σ 1 / 2 M 2 V G W − 1 / 2 . In order to show that these t wo matrices are close, no w migh t hop e to pro ve that eac h of the terms are clo se. F or example we wa nt k U M 2 − b U M 2 k 2 to b e small. Ho w eve r , ev en if U M 2 and b U M 2 span the same subspaces the distance migh t b e quite large . Hence, w e p ro ject P o nto the subs pace sp anned by b U M 2 to prov e the b ound w e wan t. Deﬁne R 3 ≡ b Σ − 1 / 2 M 2 b U T M 2 Π W 1 / 2 , (37) suc h that e G = r X i =1 1 √ w i ( ˜ v i ⊗ ˜ v i ⊗ ˜ v i ) , (38) 26 where R 3 = [ ˜ v 1 , . . . , ˜ v r ]. Then, w e hav e b U M 2 Π = b Σ 1 / 2 M 2 R 3 Q − 1 / 2 . Then, k Π − b Π k 2 ≤ k b U M 2 b U T M 2 Π − Π k 2 + k b Π − b U M 2 b U T M 2 Π k 2 = k ( b U M 2 b U T M 2 − I )Π k 2 + k b U M 2 b Σ 1 / 2 M 2 b V G b Λ G − b U M 2 b Σ 1 / 2 M 2 R G 3 W − 1 / 2 k 2 ≤ k ( b U M 2 b U T M 2 − I )Π k 2 + k b U M 2 b Σ 1 / 2 M 2 ( b V G − R 3 ) W − 1 / 2 k 2 + k b U M 2 b Σ 1 / 2 M 2 b V G ( b Λ G − W − 1 / 2 ) k 2 . (39) T o b ound the ﬁrst term, denote the SVD of Π as Π = U Σ V T . Usin g Remark A.6, k b U M 2 b U T M 2 Π − Π k 2 ≤ k b U M 2 b U T M 2 U − U k 2 k Σ k 2 ≤ ε M σ 1 (Π). Note that k b Σ M 2 k 2 ≤ k c M 2 − M 2 k + k M 2 k 2 ≤ ε M σ r ( M 2 ) + k M 2 k 2 ≤ 2 k M 2 k 2 , when ε M ≤ 1 / 2 . T o pro ve that the second term is b ound ed by C p k M 2 k 2 r w max /w min ( ε G + (1 / √ w min ) ε M ), w e claim that k R 3 − b V G k 2 ≤ C √ r w max  ε G + 1 √ w min ε M  , and k W − 1 / 2 − b Λ G k 2 ≤ C  ε G + 1 √ w min ε M  . No w r ecall that, R 3 = b Σ − 1 / 2 M 2 b U T M 2 Π W 1 / 2 . Let the S VD of R 3 b e R 3 = U 1 Σ 1 V T 1 . Deﬁne an orthogonal matrix R = U 1 V T 1 , suc h that RR T = R T R = I . Using Remark A.7 w e ha v e k R 3 − R k 2 ≤ 2 ε M . Moreo v er, e G = P q ∈ [ r ] 1 √ w q ( R e q ⊗ R e q ⊗ R e q ) + E G , wh ere k E G k 2 ≤ 2 ε M (1 + ε M ) 2 √ w min ≤ 8 ε M √ w min , (40) where, last inequalit y follo ws by ε M ≤ 1. Hence, u sing (36), (40), w e hav e (w.p . ≥ 1 − 2 δ ): k b G − X q ∈ [ r ] 1 √ w q ( R e q ⊗ R e q ⊗ R e q ) k 2 ≤ ε G + k E G k 2 ≤ ε G + (8 / √ w min ) ε M . (41) Since R is orth og onal by constru cti on, we can ap p ly Th eo r em A.13 to b ound th e distance b et w een b V G and R , i.e. k b V G − R k 2 ≤ 8 √ r w max ( ε G + (8 / √ w min ) ε M ). By triangular inequalit y , w e get that k b V G − R 3 k 2 ≤ k b V G − R k 2 + k R − R 3 k 2 ≤ 8 √ r w max  ε G + 8 √ w min ε M  + 2 ε M ≤ C √ r w max  ε G + 1 √ w min ε M  . Similarly , k W − 1 / 2 − b Λ G k 2 ≤ 5  ε G + 8 √ w min ε M  . This implies that the third term in (39) is b ounded by k b U M 2 b Σ 1 / 2 M 2 b V G ( b Λ G − W − 1 / 2 ) k 2 ≤ C p k M 2 k 2 ( ε G + ε M / √ w min ), usin g the assumption on |S | suc h that ( √ r w max ) ε G ≤ C and ( p r w max /w min ) ε M ≤ C . Putting these b ounds toge ther, we get that k b Π − Π k 2 ≤ C s r w max k M 2 k 2 w min  ε G + 1 √ w min ε M  , 27 where we used the fac t that k Π k 2 ≤ (1 / √ w min ) k M 2 k 1 / 2 2 . F rom Theorems 4.1 and 4.3 and Lemmas 4.2 and 4.4, w e get th at ε M ≤ C n k M 2 k F r 1 / 2 σ r ( M 2 ) 2 s log( n/δ ) |S | , and ε G ≤ C µ 4 r 3 . 5 √ w min  σ 1 ( M 2 ) σ r ( M 2 )  4 . 5 1 n ε M + C r 3 µ 3 σ 1 ( M 2 ) 3 n 1 . 5 σ r ( M 2 ) 4 . 5 s log( n/δ ) |S | , when |S | ≥ C ′ ( ℓ + r )( n 2 /σ r ( M 2 ) 2 ) log ( n/δ ) and n ≥ C ′ ( r 3 + r 1 . 5 µ 2 )( σ 1 ( M 2 ) /σ r ( M 2 )) 2 . F ur th er, if n ≥ C ′ µ 4 r 3 . 5 ( σ 1 ( M 2 ) /σ r ( M 2 )) 4 . 5 , then ε G ≤ C 1 √ w min ε M + C r 3 µ 3 σ 1 ( M 2 ) 3 n 1 . 5 σ r ( M 2 ) 4 . 5 s log( n/δ ) |S | .  Theorem A.13 (Restatemen t of Theorem 5.1 by [A GH + 12]) . L et G = P i ∈ [ r ] λ i ( v i ⊗ v i ⊗ v i ) + E , wher e k E k 2 ≤ C 1 λ min r . Then the tensor p ower-metho d after N ≥ C 2 (log r + log log  λ max k E k 2  , gener ates ve ctors ˆ v i , 1 ≤ i ≤ r , an d b λ i , 1 ≤ i ≤ r , s.t., k v i − ˆ v P ( i ) k 2 ≤ 8 k E k 2 /λ P ( i ) , | λ i − b λ P ( i ) | ≤ 5 k E k 2 . (42) wher e P is some p e rmutation on [ r ] . A.7 Pro of of Corollary 3.4 F eldman et al. p ro v ed that if we ha v e a goo d estimate of w i ’s and π i ’s in absolute diﬀerence, then the thresholding and normalizati on deﬁned in Section 3 give s a go od estimate in KL-div ergence. Theorem A.14 ([F OS08 , Theorem 12]) . Assume Z is a mixtur e of r pr o duct distributions on { 1 , . . . , ℓ } n with mixing weights w 1 . . . , w r and pr ob abilities π j i,a , and the fol lowing ar e satisﬁe d: • for al l i ∈ [ r ] we have | w i − ˆ w i | ≤ ε w , and • for al l i ∈ [ r ] such that w i ≥ ε min we have | π ( j ) i,a − ˆ π ( j ) i,a | ≤ ε π for al l j ∈ [ n ] and a ∈ [ ℓ ] . Then, for suﬃciently smal l ε w and ε π , the mixtur e b Z satisﬁes D KL ( Z || b Z ) ≤ 12 nℓ 3 ε 1 / 2 π + nk ε min log( ℓ/ε π ) + ε 1 / 3 w . (43) F or the right- h and-side of (43) to b e less than η , it suﬃces to ha v e ε w = O ( η 3 ), ε π = O ( η 2 /n 2 ℓ 6 ), and ε min = O ( η/nk log( ℓ/ε π )). F rom Th eo r em 3.3, | ˆ w i − w i | = O ( ε M ). T hen ε M ≤ C η 3 for some p ositiv e constan t C en- sures that the condition is satisﬁed with ε w = O ( η 3 ). F rom Th eo r em 3.3, we know that that | ˆ π ( j ) i,a − π ( j ) i,a | = O ( ε M p σ 1 ( M 2 ) w max r /w min ). Then ε M ≤ C ( η 2 w 1 / 2 min / ( n 2 ℓ 6 ( σ 1 ( M 2 ) w max r ) 1 / 2 )) for some p ositiv e constan t C ensu res that the condition is satisﬁed with ε π = O ( η 2 /n 2 ℓ 6 ). These results are true for an y v alues of w min , as long as it is positive . Hence, w e ha ve ε min = 0. It follo ws that for a choice of ε M ≤ C η 2 min n w 1 / 2 min n 2 ℓ 6 ( σ 1 ( M 2 ) w max r ) 1 / 2 , η o , w e ha v e the desired b ound on the KL-div ergence. 28 A.8 Pro of of Corollary 3.5 W e u se a tec hnique similar to th ose used to analyze distance b ased clustering algorithms in [AK01, AM05, McS01]. The clustering algorithm of [AK01] uses b Π obtained in Algorithm 1 to redu ce the dimension of the samples and a p ply distance based cl u stering algorithm of [AK01]. F ollo wing the analysis of [AK01], w e w ant to iden tify the conditions suc h th at t wo samples from the same t yp e are cl oser than the distance b et we en t w o s amples f rom t w o diﬀeren t t yp es. In order to get a large enough gap, we apply b Π and sh o w that k b Π T ( x i − x j ) k < k b Π T ( x i − x k ) k , for all x i and x j that b elong to the same t yp e and for all x k with a d iﬀeren t t yp e. Then, it is suﬃcien t to sh o w that k b Π T ( π a − π b ) k ≥ 4 max i ∈S k x i − E [ x i ] k f or all a 6 = b ∈ [ r ]. F rom Th eorem 3.3, we kno w that for |S | ≥ C µ 6 r 7 n 3 σ 1 ( M 2 ) 7 w max log( n/δ ) / ( w 2 min σ r ( M 2 ) 9 ˜ ε 2 ), k π a − ˆ π a k ≤ ε M p r w max σ 1 ( M 2 ) /w min ≤ ˜ ε for all a ∈ [ r ]. T hen, k b Π T ( π a − π b ) k ≥ k Π T ( π a − π b ) k − k (Π − b Π) T ( π a − π b ) k ≥ q ( π T a ( π a − π b )) 2 + ( π T b ( π a − π b )) 2 − k Π − b Π k 2 k π a − π b k ≥ k π a − π b k 2 − √ r ˜ ε k π a − π b k On the ot h er hand, applying a concen tration of measure inequalit y giv es P  | ˆ π T a ( x i − E [ x i ]) | ≥ k ˆ π a k p 2 log ( r/δ )  ≤ δ r . Applying union b ound, k b Π T ( x i − E [ x i ]) k ≤ k b Π k F p 2 log ( r/δ ) ≤ ( √ 2 k Π k F + √ 2 r ˜ ε ) p 4 log ( r/δ ) with probabilit y at least 1 − δ , where w e u sed the fact that k b Π k 2 F ≤ P a ( k π a k + ˜ ε ) 2 ≤ 2 P a ( k π a k 2 + ˜ ε 2 ) ≤ 2( k Π k F + √ r ˜ ε ) 2 . F or ˜ ε ≥ ( k π a − π b k 2 − k Π k F p 8 log ( r/δ )) / ( √ r k π a − π b k + p 8 r log ( r/δ )), it follo w s that k b Π T ( π a − π b ) k ≥ 4 max i ∈S k x i − E [ x i ] k , and this pro ves that th e distance based algorithm of [AK01] will succeed in ﬁn d ing the righ t clusters for all samples. 29

Learning Mixtures of Discrete Product Distributions using Spectral Decompositions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment