Isotropic PCA and Affine-Invariant Clustering

We present a new algorithm for clustering points in R^n. The key property of the algorithm is that it is affine-invariant, i.e., it produces the same partition for any affine transformation of the input. It has strong guarantees when the input is dra…

Authors: S. Charles Brubaker, Santosh S. Vempala

Isotropic PCA and Affine-In v arian t Clustering S. Charles Brubak er ∗ San to sh S. V empala ∗ Abstract W e presen t an extension of P rincipal Comp onent Analysis (PCA) and a new algorithm for clustering po ints in R n based o n it. The key pro p erty of the algor ithm is that it is affine-inv ariant. When the input is a sa mple from a mixture o f tw o ar bitrary Gaussians , the algor ithm correctly classifies the s ample assuming only that the t wo components ar e se pa rable b y a h yp er plane, i.e., there exists a halfspace that cont ains most of o ne Gaus sian and almost none of the other in probability ma s s. This is nearly the b est p ossible, improving known results substantially [14, 9, 1]. F or k > 2 comp onents, the algorithm requir es only tha t there be some ( k − 1)- dimensional subspace in which the o verlap in every direction is small. Here we define ov erla p to be the ratio of the following tw o quan tities: 1) the average s quared distance b etw een a point and the mean o f its comp onent, a nd 2 ) the av er age squared distance b etw een a po int and the mean of the mixture. The main res ult may also b e stated in the language of linear discr iminant analysis: if the standar d Fisher discrimina nt [8] is sma ll enough, lab els ar e not needed to estimate the optimal subspace for pr o jection. Our main to ols a re is o tropic trans fo rmation, sp ectra l pr o j ection and a simple r eweigh ting technique. W e call this combin ation isotr opic PCA . ∗ College of Comput in g, Georgia T ec h. Email: { brubaker,vem pala } @cc.g atech.edu 1 In tro duction W e presen t an extension to Pr incipal Comp onen t Analysis (PCA), w h ic h is able to go b eyond standard PCA in identifying “imp ortan t” d irections. When the co v ariance matrix of the input (distribution or p oin t set in R n ) is a multiple of the iden tit y , then PC A revea ls n o information; th e second momen t along an y direction is the same. Suc h inputs are called isotropic. Our extension, whic h we call isotr op ic PCA , can rev eal in teresting information in suc h settings. W e us e this tec hniqu e to give an affine-in v arian t clustering algorithm for p oin ts in R n . When applied to the problem of unrav eling mixtures of arbitrary Gaussians fr om unlab eled samples, the algorithm yields substanti al improv emen ts of kno wn results. T o illustrate the tec hnique, consider the uniform d istribution on th e set X = { ( x, y ) ∈ R 2 : x ∈ {− 1 , 1 } , y ∈ [ − √ 3 , √ 3] } , whic h is isotropic. Supp ose this distribu tion is rotated in an unkno wn w ay and that we would like to r eco v er the original x and y axes. F or eac h p oin t in a s ample, we ma y pro ject it to the unit circle and compute the co v ariance matrix of the r esulting p oint set. The x d irection will corresp ond to the greater eigen v ector, the y dir ection to the other. See Figure 1 for an illustration. Instead of pro jection onto the unit circle, this pro cess may also b e thought of as imp ortance weigh ting, a tec h nique whic h allo ws one to sim u late one distribution w ith another. In this case, w e are sim u lating a distrib ution o v er the set X , where the density function is pr op ortional to (1 + y 2 ) − 1 , so that p oin ts n ear (1 , 0) or ( − 1 , 0) are more probable. −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 a a’ Figure 1: Mapping p oints to the un it circle and th en fin ding th e d irection of maxim um v ariance rev eals the orien tation of this isotropic distribution. In this pap er, w e describ e ho w to apply this metho d to mixtur es of arbitrary Gaussians in R n in order to find a set of directions along whic h the Gaussians are wel l-separated. These directions span th e Fisher su bspace of the mixture, a cla ssical concept in Pa ttern Recog nition. On ce these directions are identi fied, p oin ts can b e classified according to whic h comp onent of the distribu tion generated them, and hence all parameters of the mixture can b e learned. What separates this pap er from previous w ork on learning m ixtures is that our algorithm is affine-in v ariant. Indeed, for ev ery mixtur e distribution that can b e learned using a p r eviously kno w n algorithm, there is a linear transformation of b ounded condition n umb er that causes th e algorithm to fail. F or k = 2 comp onen ts our algorithm has nearly the b est p ossible guaran tees (and subsum es all previous results) for clustering Gaussian m ixtu res. F or k > 2, it requires that th ere b e a ( k − 1)-dimensional s u bspace where th e overlap of the comp onen ts is small in ev ery direction (See section 1.2). This condition can b e stated in terms of th e Fisher discrimin ant, a quant it y commonly used in the field of P attern Recogniti on with lab eled d ata. Be cause our algorithm is affine in v arian t, it mak es it p ossible to unrav el a m uc h larger set of Gaussian mixtu res than had 1 b een p ossible previously . The first step of our algorithm is to p lace the mixture in isotropic p osition (see Section 1.2) via an affin e transformation. This has the effect of making the ( k − 1)-dimensional Fisher subspace, i.e., the one th at minimizes the Fisher discriminant, th e same as th e sub space sp anned b y the means of the comp onen ts (they only coincide in general in isotropic p osition), for any mixtu r e. Th e r est of the algorithm identi fies dir ections close to this subsp ace and uses them to cluster, without access to lab els. In tuitivel y th is is hard since after isotrop y , standard PCA rev eals no additional information. Before presenti ng the ideas and guaran tees in more detail, we describ e relev an t related w ork. 1.1 Previous W ork A mixture mo del is a con vex com bin ation of distributions of known t yp e. In the most commonly studied ve rsion, a d istribution F in R n is comp osed of k u nkno wn Gaussians. That is, F = w 1 N ( µ 1 , Σ 1 ) + . . . + w k N ( µ k , Σ k ) , where the mixing w eigh ts w i , means µ i , and co v ariance matrices Σ i are all unknown. T ypically , k ≪ n , so that a concise mo del explains a high dim en sional ph en omenon. A rand om sample is generated fr om F by first c ho osing a comp onent with probabilit y equal to its mixing w eigh t and then picking a random p oin t from th at comp onent d istribution. In this pap er, w e study the classical problem of unrav eling a sample fr om a mixture, i.e., lab eling eac h p oint in the samp le according to its comp on ent of origin. Heuristics for classifying samples includ e “exp ectatio n maximization” [5] and “k-means cluster- ing” [11]. These metho ds can tak e a long time and can get stu c k with s ub optimal classifications. Ov er the past decade, there has b een muc h progress on find ing p olynomial-time algo rithms with rigorous guaran tees for classifying mixtures, especially mixtures of Gaussians [4, 15, 14, 17 , 9 , 1]. Starting w ith Dasg upta’s pap er [4], one line of w ork uses the concentrati on of pairwise distances and assu m es that the comp onen ts’ means are so far apart that distances b et ween p oints from the same comp onen t are like ly to b e smaller than d istances from p oin ts in d ifferen t comp on ents. Arora and Kannan [14] establish nearly optimal results for s u c h d istance-based algorithms. Unfortunately their r esults inherently requ ire separation that grows with the dimension of the ambien t space and the largest v ariance of eac h comp onen t Gaussian. T o see why this is un natural, consider k well-separate d Gaussians in R k with m eans e 1 , . . . , e k , i.e. eac h mean is 1 un it a wa y fr om the origin along a unique co ord inate axis. Adding extra dimensions with arbitrary v ariance do es not affec t the separabilit y of these Gaussians, but these algorithms are no longer guaran teed to w ork. F or example, supp ose that eac h Gaussian has a maxim u m v ariance of ǫ ≪ 1. Then, adding O ∗ ( k ǫ − 2 ) extra dimen s ions with v ariance ǫ w ill violate the necessary separation conditions. T o imp ro ve on this, a su bsequent line of work uses sp ectral pro jection (PCA). V empala and W ang [17] sho wed th at for a mixtur e of spheric al Gaussians, the subspace spanned b y the top k principal comp onen ts of the mixture con tains the means of the comp onen ts. Thus, pro j ecting to this sub s pace has the effect of s h rinking the comp onen ts while main taining the separation b etw een their means. Th is leads to a n early optimal separation r equirement of k µ i − µ j k ≥ ˜ Ω( k 1 / 4 ) max { σ i , σ j } where µ i is the mean of comp onent i and σ 2 i is the v ariance of comp onent i alo ng an y direction. Note that there is no dep endence on the dimension of the distribution. Kannan et al. [9] applied the sp ectral approac h to arb itrary mixtur es of Gaussians (and more generally , logconca v e distribu tions) 2 (a) Distance Concentrati on Separa- bilit y (b) Hyp erplane Separabilit y (c) Intermean Hyp erplane and Fisher Hy p erplane. Figure 2: Previous w ork requires d istance concen tration separabilit y wh ic h dep end s on the max- im u m directional v ariance (a). Ou r results require only hyp erplane separabilit y , which dep ends only on the v ariance in the separating d irection(b). F or non-isotropic m ixtures the b est separating direction ma y not b e b et ween the means of the comp onen ts(c). and obtained a separation that gro ws with a p olynomial in k and the largest v ariance of eac h comp onent : k µ i − µ j k ≥ p oly( k ) max { σ i, max , σ j, max } where σ 2 i, max is the maximum v ariance of the i th comp onent in an y d ir ection. The p olynomial in k w as imp ro ved in [1] along with matching lo wer b ounds for th is approac h, suggesting th is to b e the limit of sp ectral metho ds. Going b eyo nd this “sp ectral threshold” for arb itrary Gaussians has b een a ma jor op en problem. The rep resen tativ e h ard case is the sp ecial case of t wo parallel “pancak es”, i.e., tw o Gaussians that are spherical in n − 1 directions and narro w in the last direction, so that a h yp erplane orth ogonal to th e last d ir ection sep arates th e t wo . The sp ectral approac h requires a s eparation that grows with their largest standard deviation whic h is un r elated to th e distance b etw een the pancak es (their means). Other examples can b e generated b y starting with Gaussians in k dimensions that are separable an d then adding ot her d imensions, one of w h ic h has large v ariance. B ecause there is a subspace where the Gaussians are separable, the separation requ iremen t should dep end only on the dimension of this subspace and the comp onents’ v ariances in it. A related line of work consid ers learning sym metric pro d uct d istributions, w here the co ordinates are indep end en t. F eldman et al [6] hav e sho wn that mixtures of axis-al igned Gaussians can b e appro ximated without an y separation assumption at all in time exp on ential in k . A. Dasgupta et al [3] consider hea vy -tailed distributions as opp osed to Gaussians or log-conca v e ones and giv e conditions und er w hic h th ey can b e clustered usin g an algorithm that is exp onentia l in the num b er of samples. Chaud h uri and Rao [2] ha ve recentl y giv en a p olynomial time algorithm for clustering suc h hea vy taile d pro du ct distribu tions. 1.2 Results W e assume we are giv en a lo w er b ound w on the m inim um mixin g weigh t and k , the n umb er of comp onent s. With high probabilit y , our algorithm Unra v el return s a p artition of space b y h yp erplanes so that eac h part (a p olyhedron) en closes almost all of the probabilit y mass of a single comp onent and almost n one of the other comp onents. The error of such a set of p olyhedra is the total probabilit y mass that f alls outside the correct p olyhedron. W e first state our result for t wo Gaussians in a wa y that makes clear the relationship to pr evious w ork that relies on separation. 3 Theorem 1. L et w 1 , µ 1 , Σ 1 and w 2 , µ 2 , Σ 2 define a mixtur e of two Gaussians. Ther e is an absolute c onstant C suc h that, if ther e exists a dir e ctio n v such that | pro j v ( µ 1 − µ 2 ) | ≥ C  p v T Σ 1 v + p v T Σ 2 v  w − 2 log 1 / 2  1 w δ + 1 η  , then with pr ob ability 1 − δ algorithm Unra ve l r eturns two c omp lementary halfsp ac es that have err or at most η using time and a numb er of samples that is p olynomial in n, w − 1 , log(1 /δ ) . So the separation requ ir ed b et we en the means is comparable to the standard deviation in some dir e ction . This separation condition of Theorem 1 is affin e-inv aria n t and m uc h wea k er than conditions of the form k µ 1 − µ 2 k & max { σ 1 , max , σ 2 , max } used in previous work. See Figure 2. The dotted line shows h o w previous wo rk effectiv ely treats ev er y comp onen t as spherical. W e also n ote that the separating dir ection d o es not need to b e the in termean direction as illustrated in Figure 2(c). The dotted line illustrates hyp erplane induced by the in termean d irection, which may b e far from the optimal s ep arating h yp erp lane sh own b y th e solid line. It will b e insigh tfu l to state this r esu lt in terms of the Fisher discriminan t, a standard notion from P attern Rec ognition [8, 7] that is used with lab eled data. In w ords, the Fisher discriminant along direction p is J ( p ) = the intra- comp onent v ariance in direction p the total v ariance in d irection p Mathematica lly , this is expressed as J ( p ) = E  k pro j p ( x − µ ℓ ( x ) ) k 2  E  k pro j p ( x ) k 2  = p T (w 1 Σ 1 + w 2 Σ 2 ) p p T (w 1 (Σ 1 + µ 1 µ T 1 ) + w 2 (Σ 2 + µ 2 µ T 2 )) p for x distributed acco rding to a mixture distribu tion with m eans µ i and co v ariance matrices Σ i . W e u se ℓ ( x ) to ind icate the comp onent from w h ic h x w as dra w n . Theorem 2. Ther e is an abso lute c onstan t C for which the fol lowing holds. Su pp ose that F is a mixtur e of two Gaussians such that ther e exists a dir e ction p for which J ( p ) ≤ C w 3 log − 1  1 δ w + 1 η  . With pr ob ability 1 − δ , algorithm Unr a vel r eturns a halfsp ac e with err or at most η using time and sample c omplexity p ol ynomial in n , w − 1 , log (1 /δ ) . There are s everal wa ys of generalizing the Fisher discriminan t for k = 2 comp onen ts to greater k [7]. Th ese generalizat ions are most easily understo o d when the d istribution is isotropic. An isotropic distrib ution has the iden tit y matrix as its co v ariance an d the origin as its mean. An isotropic mixture th erefore has k X i =1 w i µ i = 0 and k X i =1 w i (Σ i + µ i µ T i ) = I . It is w ell k n o wn that any distribution with b ounded co v ariance m atrix (and therefore an y m ixture) can b e made isotropic by an affine transformation. As w e will see shortly , for k = 2, for an isotropic mixture, the line joining th e means is the direction that minimizes the Fisher discriminant. 4 Under isotrop y , the d enominator of the Fisher discriminant is alw a ys 1. Th us, the d iscriminan t is just the exp ected squared distance b et ween the pro jection of a p oin t and the pro j ection of its mean, where pro jection is on to some d irection p . The generalization to k > 2 is natur al, as we ma y simply r eplace p ro jection on to direction p with p ro jection onto a ( k − 1)-dimensional subspace S . F or conv enience, let Σ = k X i =1 w i Σ i . Let the ve ctor p 1 , . . . , p k − 1 b e an orthonormal basis of S and let ℓ ( x ) b e the comp onen t fr om which x w as drawn. W e then hav e und er isotropy J ( S ) = E [ k pro j S ( x − µ ℓ ( x ) ) k 2 ] = k − 1 X j =1 p T j Σ p j for x distributed acco rding to a mixture distribu tion with m eans µ i and co v ariance matrices Σ i . As Σ is symm etric p ositiv e definite, it f ollo ws that the smallest k − 1 eigen v ectors of the matrix are optimal c h oices of p j and S is the sp an of these eigen vec tors. This motiv at es our d efinition of th e Fisher subspace f or any mixture with b ound ed seco nd momen ts (not necessarily Gaussians). Definition 1. L et { w i , µ i , Σ i } b e the weights, me ans, and c ovarian c e matr ic es for an isotr op ic 1 mixtur e distribution with me an at the origin and wher e dim(span { µ 1 , . . . , µ k } ) = k − 1 . L et ℓ ( x ) b e the c omp onent fr om which x was dr awn. The Fisher subsp ace F is define d as the ( k − 1) -dimensional subsp ac e that minimizes J ( S ) = E [ k pro j S ( x − µ ℓ ( x ) ) k 2 ] . over subsp ac es S of dimension k − 1 . Note that dim(span { µ 1 , . . . , µ k } ) is only k − 1 b ecause isotrop y implies P k i =1 w i µ i = 0. The next lemma pro vides a simple alternativ e charact erization of the Fisher subs p ace as the s pan of the means of the comp onents (after transformin g to isotropic p osition). Th e pro of is giv en in Section 3.2. Lemma 1. Supp os e { w i , µ i , Σ i } k i =1 defines an isotr opic mixtur e in R n . L et λ 1 ≥ . . . ≥ λ n b e the eigenvalues of the matrix Σ = P k i =1 w i Σ i and let v 1 , . . . , v n b e the c orr esp onding eig enve ctors. If the dimension of the sp a n of the me ans of the c omp onents is k − 1 , then the Fisher subsp ac e F = span { v n − k +1 , . . . , v n } = span { µ 1 , . . . , µ k } . Our algo rithm attempts to find the Fisher subspace (or one close to it) and succeeds in doing so, provided the discrim in an t is small enough. The next d efinition will b e us efu l in stating our main theorem precisely . Definition 2. The o verla p of a mixtur e given as in Definition 1 is φ = min S : dim( S )= k − 1 max p ∈ S p T Σ p. (1) 1 F or non-isotropic mixtu res, the Fisher discriminant generalizes to P k − 1 j =1 p T j “ P k i =1 w i (Σ i + µ i µ T i ) ” − 1 Σ p j and the o verl ap to p T “ P k i =1 w i (Σ i + µ i µ T i ) ” − 1 Σ p 5 It is a direct consequence of the Couran t-Fisher min-max theorem that φ is the ( k − 1)th smallest eigen v alue of the matrix Σ and th e subspace ac h ieving φ is the Fisher subspace, i.e., φ =   E [pro j F ( x − µ ℓ ( x ) )pro j F ( x − µ ℓ ( x ) ) T ]   2 . W e can n o w state our main theorem for k > 2. Theorem 3. Ther e is an abso lute c onstan t C for which the fol lowing holds. Su pp ose that F is a mixtur e of k Ga ussian c omp onents wher e the overlap satisfies φ ≤ C w 3 k − 3 log − 1  nk δ w + 1 η  With pr ob ability 1 − δ , algorith m Unra vel r eturns a set of k p olyhe dr a that have err o r at mo st η using time and a numb er of samples that is p olynomia l in n, w − 1 , log(1 /δ ) . In words, the algorithm successfully unrav els arbitrary Gaussians provided there exists a ( k − 1)- dimensional su b space in whic h alo ng ev ery dir ection, the exp ected squared distance of a p oint to its comp onent mean is sm aller than the exp ected squ ared d istance to the o veral l mean b y roughly a p oly( k , 1 / w) factor. Ther e is no d ep enden ce on the largest v ariances of the ind ividual comp onent s, and the d ep end ence on the am bient dimension is logarithmic. This means that the addition of extra d im en sions (eve n where the distribution has large v ariance) as discussed in Section 1.1 h as little impact on the success of our algorithm. 2 Algorithm The algorithm has three ma jor comp onent s: an initial affine transf ormation, a rewei gh ting step, and identi fication of a direction close to the Fisher subspace and a h yp erp lane orthogonal to this direction whic h lea v es eac h comp onent’s pr obabilit y mass almost enti rely in one of the halfspaces induced b y th e hyp erplane. The k ey insight is that the rew eighting tec hniqu e will either cause the mean of the mixture to shift in the in termean sub space, or cause the top k − 1 p r incipal comp onent s of the second moment matrix to appro ximate the intermean subspace. In either case, we obtain a direction along which w e can partition the comp onents. W e first fi nd an affine transformation W w hic h when applied to F results in an isotropic distribution. That is, we mo v e the mean to the origin and apply a lin ear transformation to mak e the cov ariance matrix the iden tity . W e apply this transformation to a new set of m 1 p oints { x i } from F and th en rew eigh t according to a spherically symmetric Gaussian exp( −k x k 2 / (2 α )) for α = Θ( n/ w). W e then compute the mean ˆ u and second moment matrix ˆ M of the r esu lting set. 2 After the r ew eighting, the algorithm chooses either the n ew mean or the direction of maxim u m second momen t and pro jects th e d ata onto this direction h . By bisecting the largest gap b et ween p oints, w e obtain a threshold t , whic h along w ith h defin es a h yp erplane that separates the com- p onents. Using the notation H h,t = { x ∈ R n : h T x ≥ t } , to in dicate a h alfspace, we then recurse on ea c h half of the mixture. Thus, ev ery nod e in the recursion tr ee represen ts an intersectio n of half-spaces. T o mak e our analysis easie r, w e assume that we use differen t samples for eac h step of the algorithm. The reader migh t find it useful to r ead Section 2.1, whic h giv es an in tuitive explaination for ho w the algorithm w ork s on parallel pancak es, b efore reviewing the details of the algorithm. 2 This practice of transforming the p oints and then looking at the second moment matrix can b e view ed as a form of kernel PCA; how ever the connection b etw een ou r algorithm and kernel PCA is superfi cial. Our transformation does not result in any standard kernel. Moreo ver, it is dimension-preserving (it is just a rew eighting), and hence the “kernel trick” has no computational adv antage. 6 Algorithm 1 Unra vel Input: Int eger k , s calar w. Initialization: P = R n . 1. (Isotropy) Use samples lying in P to compute an affine transform ation W that m ak es the distribution nearly isotropic (mean zero, identi t y co v ariance matrix). 2. (Reweig h tin g) Use m 1 samples in P and for eac h compute a w eight e −k x k 2 / ( α ) (where α > n/ w). 3. (S ep arating Direction) Find the mean of the reweig h ted d ata ˆ µ . If k ˆ µ k > √ w / (32 α ), let h = ˆ µ . Otherwise, fi n d th e co v ariance m atrix ˆ M of the rew eighte d p oin ts and let h b e its top principal comp onent. 4. (Recurs ion) Pro ject m 2 sample p oin ts to h and find the largest gap b et ween p oints in the in terv al [ − 1 / 2 , 1 / 2]. If this gap is less than 1 / 4( k − 1), then retur n P . Otherwise, set t to b e the mid p oint of the largest gap, recurse on P ∩ H h,t and P ∩ H − h, − t , and return the union of the p olyhedra p ro du ces by these recursiv e calls. 2.1 P arallel P ancak es The follo wing sp ecial case, whic h repr esents the op en pr ob lem in previous work, w ill illuminate the in tu ition b ehind the new algorithm. S upp ose F is a m ixture of t w o sph erical Gaussians that are w ell-separated, i.e. the intermean d istance is large compared to the standard deviation along an y direction. W e consider t wo cases, one where the mixing w eights are equal and another where they are im balanced. After isotrop y is enforced, eac h comp onen t will b ecome thin in the in termean direction, giving the d en sit y the app earance of t w o parallel pancak es. When the mixing wei gh ts are equal, the means of the comp onen ts will b e equ ally spaced at a distance of 1 − φ on opp osite sides of th e origin. F or im b alanced weig h ts, the origin will still lie on the in termean direction but w ill b e muc h closer to the hea vier comp onent, while th e ligh ter comp onent will b e muc h further a wa y . In b oth cases, this transformation m ak es th e v ariance of the mixture 1 in eve ry direction, so the pr incipal comp onen ts giv e us no insight into the in ter-mean direction. Consider next the effect of the rew eighting on the mean of the mixture. F or the case of equal mixing weig h ts, symmetry assu res that th e mean do es not shift at all. F or im balanced weigh ts, ho wev er, the hea vier comp onen t, whic h lies closer to the origin will b ecome h ea vier still. Th us, the rewei gh ted mean shifts to ward the mean of the hea vier component, allo wing us to detect the in termean direction. Finally , consider the effect of reweig h ting on the second moments of the mixtur e with equal mixing we igh ts. Bec ause p oin ts closer to the origin are weigh ted more, the second moment in ev ery direction is reduced. Ho w ev er, in the in termean d irection, wh ere part of the momen t is d ue to the d isplacemen t of the comp onent means fr om the origin, it s h rinks less. Th us, the directio n of maxim u m second momen t is the intermean direction. 2.2 Ov erview of Analysis T o analyze the algorithm, in the general case, w e will pro ceed as follo ws. Section 3 sho ws that under isotrop y the Fisher sub space coincides with the in termean subsp ace (Lemm a 1), giv es the necessary sampling con vergence and p erturbation lemmas and r elates o v erlap to a more con ven tional notion 7 of separation (Pr op. 5). Section 3.3 giv es app ro ximations to th e fi r st and second momen ts. Section 4 th en com b ines these app ro ximations with the p ertur bation lemmas to show that the v ector h (either the m ean shif t or the large st principal comp onen t) lies close to the intermean su b space. Finally , S ection 5 shows the correctness of the recursive asp ects of the algorithm. 3 Preliminaries 3.1 Matrix Pro p erties F or a m atrix Z , w e will den ote the i th largest eigen v alue of Z b y λ i ( Z ) or j ust λ i if the matrix is clear from conte xt. Unless sp ecified otherwise, all norms are the 2-norm. F or symmetric m atrices, this is k Z k 2 = λ 1 ( Z ) = max x ∈ R n k Z x k 2 / k x k 2 . The follo wing t wo facts from linear algebra will b e usefu l in our analysis. F act 2. L et λ 1 ≥ . . . ≥ λ n b e the eigenvalues for an n -by- n symmetric p ositive definite matrix Z and let v 1 , . . . v n b e the c orr e sp onding eigenve ctors. Then λ n + . . . + λ n − k +1 = min S : dim( S )= k k X j =1 p T j Z p j , wher e { p j } i s any orthonormal b asis f or S . If λ n − k > λ n − k +1 , then span { v n , . . . , v n − k +1 } is the unique minimizing subsp ac e. Recall that a m atrix Z is p ositiv e semi-definite if x T Z x ≥ 0 for all non-zero x . F act 3. Supp ose that the matrix Z =  A B T B D  is symmetric p ositive semi-definite and that A and D ar e squar e su b matric es. Then k B k ≤ p k A kk D k . Pr o of. Let y and x b e the top left and r igh t singular v ectors of B , so that y T B x = k B k . Because Z is p ositive semi-definite, we hav e that for any real γ , 0 ≤ [ γ x T y T ] Z [ γ x T y T ] T = γ 2 x T Ax + 2 γ y T B x + y T D y . This is a quadratic p olynomial in γ that can ha ve only one real r o ot. T h erefore the discrimin an t m u st b e non-p ositiv e: 0 ≥ 4( y T B x ) 2 − 4( x T Ax )( y T D y ) . W e conclud e that k B k = y T B x ≤ q ( x T Ax )( y T D y ) ≤ p k A kk D k . 3.2 The Fisher Cr iterion and Isotrop y W e b egin with the pro of of the lemma that for an isotropic mixture the Fisher su b space is the same as the int ermean sub space. 8 Pr o of of L emma 1. By defin ition for an isotropic d istribution, the Fish er subsp ace min imizes J ( S ) = E [ k pro j S ( x − µ ℓ ( x ) ) k 2 ] = k − 1 X j =1 p T j Σ p j , where { p j } is an orthonormal b asis for S . By F act 2 one minimizing sub space is the span of th e smallest k − 1 eigen vect ors of the matrix Σ, i.e. v n − k +2 , . . . , v n . Because the d istribution is isotropic, Σ = I − k X i =1 w i µ i µ T i , and these ve ctors b ecome the largest eigen v ectors of P k i =1 w i µ i µ T i . Clearly , span { v n − k +2 , . . . , v n } ⊆ span { µ 1 , . . . , µ k } , but b oth spans ha ve dimension k − 1 making them equ al. Since v n − k +1 m u st b e orthogonal to the other eigen v ectors, it follo ws that λ n − k +1 = 1 > λ n − k +2 . Therefore, span { v n − k +2 , . . . , v n } ⊆ span { µ 1 , . . . , µ k } is the u nique minimizing sub space. It follo ws directly that u n der the conditions of L emma 1, the o v erlap ma y b e c haracterized as φ = λ n − k +2 (Σ) = 1 − λ k − 1 k X i =1 w i µ i µ T i ! . F or clarit y of the analysis, we will assume that Step 1 of the algorithm pro duces a p erfectly isotropic mixture. Theorem 4 giv es a b ound on the required num b er of samples to mak e the distribution nearly isotropic, and as our analysis shows, our algorithm is robu st to small estimation errors. W e will also assume for con v en ience of n otation that the the unit vec tors along the first k − 1 co or- dinate axes e 1 , . . . e k − 1 span the inte rmean (i.e. Fisher) sub space. T hat is, F = span { e 1 , . . . , e k − 1 } . When considering this subspace it will b e con venien t to b e able to refer to pr o jection of the mean v ectors to this s u bspace. Thus, w e defi ne ˜ µ i ∈ R k − 1 to b e the first k − 1 coord inates of µ i ; the remaining co ordinates are all zero. In other terms, ˜ µ i = [ I k − 1 0] µ i . In this co ordinate system the co v ariance matrix of eac h comp onent has a particular structur e, whic h will b e useful for ou r analysis. F o r the rest of th is pap er we fix the f ollo wing notatio n: an isotropic mixture is d efined b y { w i , µ i , Σ i } . W e assume that span { e 1 , . . . , e k − 1 } is the intermean subspace and A i , B i , and D i are defined such that w i Σ i =  A i B T i B i D i  (2) where A i is a ( k − 1) × ( k − 1) sub m atrix and D i is a ( n − k + 1) × ( n − k + 1) submatrix. Lemma 4 (Co v ariance Structure) . U si ng the ab ove notation, k A i k ≤ φ , k D i k ≤ 1 , k B i k ≤ p φ for al l c omp onents i . 9 Pr o of of L emma 4. Because sp an { e 1 , . . . , e k − 1 } is the Fisher s ubspace φ = max v ∈ R k − 1 1 k v k 2 k X i =1 v T A i v =      k X i =1 A i      2 . Also P k i =1 D i = I , so k P k i =1 D i k = 1. Eac h matrix w i Σ i is p ositiv e defin ite, so the p rincipal minors A i , D i m u st b e p ositiv e defin ite as w ell. Therefore, k A i k ≤ φ , k D i k ≤ 1, and k B i k ≤ p k A i kk D i k = √ φ using F act 3 . F or small φ , the co v ariance b et w een intermean and non-in termean d irections, i.e. B i , is small. F or k = 2, this means that all d ensities will ha ve a “nearly parallel pancak e” s hap e. In general, it means that k − 1 of the prin cipal axes of the Gauss ians will lie close to the in termean subspace. W e conclude this section with a pr op osition connecting, for k = 2, the o ve rlap to a standard notion of separation b et wee n t wo distributions, so that Theorem 1 b ecomes an immediate corollary of Theorem 2. Prop osition 5. If ther e exists a unit ve ctor p su c h that | p T ( µ 1 − µ 2 ) | > t ( p p T w 1 Σ 1 p + p p T w 2 Σ 2 p ) , then the overlap φ ≤ J ( p ) ≤ (1 + w 1 w 2 t 2 ) − 1 . Pr o of of Pr op osition 5. S in ce the m ean of the distribu tion is at the origin, w e hav e w 1 p T µ 1 = − w 2 p T µ 2 . Th us, | p T µ 1 − p T µ 2 | 2 = ( p T µ 1 ) 2 + ( p T µ 2 ) 2 + 2 | p T µ 1 || p T µ 2 | = (w 1 p T µ 1 ) 2  1 w 2 1 + 1 w 2 2 + 2 w 1 w 2  , using w 1 + w 2 = 1. W e rewrite the last factor as 1 w 2 1 + 1 w 2 2 + 2 w 1 w 2 = w 2 1 + w 2 2 + 2w 1 w 2 w 2 1 w 2 2 = 1 w 2 1 w 2 2 = 1 w 1 w 2  1 w 1 + 1 w 2  . Again, using the f act th at w 1 p T µ 1 = − w 2 p T µ 2 , we hav e that | p T µ 1 − p T µ 2 | 2 = (w 1 p T µ 1 ) 2 w 1 w 2  1 w 1 + 1 w 2  = w 1 ( p T µ 1 ) 2 + w 2 ( p T µ 2 ) 2 w 1 w 2 . Th us, by the s ep aration condition w 1 ( p T µ 1 ) 2 + w 2 ( p T µ 2 ) 2 = w 1 w 2 | p T µ 1 − p T µ 2 | 2 ≥ w 1 w 2 t 2 ( p T w 1 Σ 1 p + p T w 2 Σ 2 p ) . T o b ound J ( p ), we then argue J ( p ) = p T w 1 Σ 1 p + p T w 2 Σ 2 p w 1 ( p T Σ 1 p + ( p T µ 1 ) 2 ) + w 2 ( p T Σ 2 p + ( p T µ 2 ) 2 ) = 1 − w 1 ( p T µ 1 ) 2 + w 2 ( p T µ 2 ) 2 w 1 ( p T Σ 1 p + ( p T µ 1 ) 2 ) + w 2 ( p T Σ 2 p + ( p T µ 2 ) 2 ) ≤ 1 − w 1 w 2 t 2 (w 1 p T Σ 1 p + w 2 p T Σ 2 p ) w 1 ( p T Σ 1 p + ( p T µ 1 ) 2 ) + w 2 ( p T Σ 2 p + ( p T µ 2 ) 2 ) ≤ 1 − w 1 w 2 t 2 J ( p ) , and J ( p ) ≤ 1 / (1 + w 1 w 2 t 2 ). 10 3.3 Appro ximation of the R ew eighted Momen ts Our algorithm w ork s by computing the first and second rewe igh ted moment s of a p oin t set from F . In this section, w e examine how the rew eight ing affects the second m omen ts of a single comp onent and then giv e some appr o ximations for the first and s econd momen ts of the ent ire mixture. 3.3.1 Single Comp onen t The first s tep is to c haracterize how the rew eigh ting affects the momen ts of a single comp onent. Sp ecifically , we will s ho w for an y f u nction f (and therefore x and xx T in particular) that f or α > 0, E  f ( x ) exp  − k x k 2 2 α  = X i w i ρ i E i [ f ( y i )] , Here, E i [ · ] denotes exp ectation take n with resp ect to the comp onen t i , the quantit y ρ i = E i h exp  − k x k 2 2 α i , and y i is a Gaussian v ariable with p arameters sligh tly p ertur b ed fr om the original i th comp onen t. Claim 6. If α = n/ w , the quantity ρ i = E i h exp  − k x k 2 2 α i is at le ast 1 / 2 . Pr o of. Beca use the d istribution is isotropic, for an y comp onen t i , w i E i [ k x k 2 ] ≤ n . Ther efore, ρ i = E i  exp  − k x k 2 2 α  ≥ E i  1 − k x k 2 2 α  ≥ 1 − 1 2 α n w i ≥ 1 2 . Lemma 7 (Rewe igh ted Momen ts of a Single Comp onent) . F or any α > 0 , with r esp e ct to a single c omp o nent i of the mixtur e E i  x exp  − k x k 2 2 α  = ρ i ( µ i − 1 α Σ i µ i + f ) and E i  xx T exp  − k x k 2 2 α  = ρ (Σ i + µ i µ T i − 1 α (Σ i Σ i + µ i µ T i Σ i + Σ i µ i µ T i ) + F ) wher e k f k , k F k = O ( α − 2 ) . W e fi rst establish the f ollo wing claim. Claim 8. L et x b e a r andom variable distribute d ac c or ding to the normal distribution N ( µ, Σ) and let Σ = Q Λ Q T b e the singular value de c omp osition of Σ with λ 1 , . . . , λ n b eing the diagonal elements of Λ . L et W = diag( α/ ( α + λ 1 ) , . . . , α/ ( α + λ n )) . Final ly, let y b e a r andom variable distribute d ac c or ding to N ( QW Q T µ, QW Λ Q T ) . Then for any fu nc tion f ( x ) , E  f ( x ) exp  − k x k 2 2 α  = det( W ) 1 / 2 exp  − µ T QW Q T µ 2 α  E [ f ( y )] . Pr o of of Claim 8 . W e assum e that Q = I for th e initial p art of the pro of. F rom the definition of a Gaussian distrib ution, we ha ve E  f ( x ) exp  − k x k 2 2 α  = det(Λ) − 1 / 2 (2 π ) − n/ 2 Z R n f ( x ) exp  − x T x 2 α − ( x − µ ) T Λ − 1 ( x − µ ) 2  . 11 Because Λ is d iagonal, w e ma y w rite the exp onents on the right hand side as n X i =1 x 2 i α − 1 + ( x i − µ i ) 2 λ − 1 i = n X i =1 x 2 i ( λ − 1 + α − 1 ) − 2 x i µ i λ − 1 i + µ 2 i λ − 1 i . Completing the square giv es the expression n X i =1  x i − µ i α α + λ i  2  λ i α α + λ i  − 1 + µ 2 i λ − 1 i − µ 2 i λ − 1 i α α + λ i . The last t w o terms can b e sim p lified to µ 2 i / ( α + λ i ). In matrix form th e exp on ent b ecomes ( x − W µ ) T ( W Λ) − 1 ( x − W µ ) + µ T W µα − 1 . F or general Q , this b ecomes  x − Q W Q T µ  T Q ( W Λ) − 1 Q T  x − Q W Q T µ  + µ T QW Q T µα − 1 . No w recalling the defin ition of the random v ariable y , we see E  f ( x ) exp  − k x k 2 2 α  = det(Λ) − 1 / 2 (2 π ) − n/ 2 exp  − µ T QW Q T µ 2 α  Z R n f ( x ) exp  − 1 2  x − Q W Q T µ  T Q ( W Λ) − 1 Q T  x − QW Q T µ   = det( W ) 1 / 2 exp  − µ T QW Q T µ 2 α  E [ f ( y )] . The pro of of Lemm a 7 is now straightfo rw ard. Pr o of of L emma 7. F or sim p licit y of notatio n, w e drop the subscript i from ρ i , µ i , Σ i with the understand ing that all statemen ts of exp ectation apply to a single comp onent. Using the notation of Claim 8, we h a ve ρ = E  exp  − k x k 2 2 α  = det( W ) 1 / 2 exp  − µ T QW Q T µ 2 α  . A diagonal ent ry of the matrix W can expanded as α α + λ i = 1 − λ i α + λ i = 1 − λ i α + λ 2 i α ( α + λ i ) , so that W = I − 1 α Λ + 1 α 2 W Λ 2 . Th us, E  x exp  − k x k 2 2 α  = ρ ( QW Q T µ ) = ρ ( QI Q T µ − 1 α Q Λ Q T µ + 1 α 2 QW Λ 2 Q T µ ) = ρ ( µ − 1 α Σ µ + f ) , 12 where k f k = O ( α − 2 ). W e analyze the p erturb ed cov ariance in a similar f ashion. E  xx T exp  − k x k 2 2 α  = ρ  Q ( W Λ) Q T + QW Q T µµ T QW Q T  = ρ  Q Λ Q T − 1 α Q Λ 2 Q T + 1 α 2 QW Λ 3 Q T +( µ − 1 α Σ µ + f )( µ − 1 α Σ µ + f ) T  = ρ  Σ + µµ T − 1 α (ΣΣ + µµ T Σ + Σ µµ T ) + F  , where k F k = O ( α − 2 ). 3.3.2 Mixture momen ts The second s tep is to appr o ximate the first and second momen ts of th e entire mixtur e distrib ution. Let ρ b e the v ector wh ere ρ i = E i h exp  − k x k 2 2 α i and let ¯ ρ b e the a v erage of th e ρ i . W e also d efine u ≡ E  x exp  − k x k 2 2 α  = k X i =1 w i ρ i µ i − 1 α k X i =1 w i ρ i Σ i µ i + f (3) M ≡ E  xx T exp  − k x k 2 2 α  = k X i =1 w i ρ i (Σ i + µ i µ T i − 1 α (Σ i Σ i + µ i µ T i Σ i + Σ i µ i µ T i )) + F (4 ) with k f k = O ( α − 2 ) and k F k = O ( α − 2 ). W e denote the estimates of these quan tities compu ted from samples by ˆ u and ˆ M resp ectiv ely . Lemma 9. L et v = P k i =1 ρ i w i µ i . Then k u − v k 2 ≤ 4 k 2 α 2 w φ. Pr o of of L emma 9. W e argue from Eqn. 2 and Eqn . 3 that k u − v k = 1 α      k X i =1 w i ρ i Σ i µ i      + O ( α − 2 ) ≤ 1 α √ w k X i =1 ρ i k (w i Σ i )( √ w i µ i ) k + O ( α − 2 ) ≤ 1 α √ w k X i =1 ρ i k [ A i , B T i ] T kk ( √ w i µ i ) k + O ( α − 2 ) . F rom isotropy , it follo ws that k √ w i µ i k ≤ 1. T o b ound the other factor, w e argue k [ A i , B T i ] T k ≤ √ 2 max {k A i k , k B i k} ≤ p 2 φ. Therefore, k u − v k 2 ≤ 2 k 2 α 2 w φ + O ( α − 3 ) ≤ 4 k 2 α 2 w φ, for sufficient ly large n , as α ≥ n/ w . 13 Lemma 10. L et Γ = " P k i =1 ρ i (w i ˜ µ i ˜ µ i T + A i ) 0 0 P k i =1 ρ i D i − ρ i w i α D 2 i # . If k ρ − 1 ¯ ρ k ∞ < 1 / (2 α ) , then k M − Γ k 2 2 ≤ 16 2 k 2 w 2 α 2 φ. Before giving the pr o of, we summarize some of the n ecessary calculation in the follo wing claim. Claim 11. The matrix of se c ond moments M = E  xx T exp  − k x k 2 2 α  =  Γ 11 0 0 Γ 22  +  ∆ 11 ∆ T 21 ∆ 21 ∆ 22  + F , wher e Γ 11 = k X i =1 ρ i (w i ˜ µ i ˜ µ i T + A i ) Γ 22 = k X i =1 ρ i D i − ρ i w i α D 2 i ∆ 11 = − k X i =1 ρ i w i α B T i B i + ρ i w i α  w i ˜ µ i ˜ µ i T A i + w i A i ˜ µ i ˜ µ i T + A 2 i  ∆ 21 = k X i =1 ρ i B i − ρ i w i α  B i (w i ˜ µ i ˜ µ i T ) + B i A i + D i B i  ∆ 22 = − k X i =1 ρ i w i α B i B T i , and k F k = O ( α − 2 ) . Pr o of. The calculatio n is straigh tforward. Pr o of of L emma 10. W e b egin by b ound ing the 2-norm of eac h of the blo c ks. S ince k w i ˜ µ i ˜ µ i T k < 1 and k A i k ≤ φ and k B i k ≤ √ φ , we can b ound k ∆ 11 k = max k y k =1 k X i =1 ρ i w i α y T B T i B i y T − ρ i w i α y T  w i ˜ µ i ˜ µ i T A i + w i A i ˜ µ i ˜ µ i T + A 2 i  y + O ( α − 2 ) ≤ k X i =1 ρ i w i α k B i k 2 + ρ i w i α (2 k A k + k A k 2 ) + O ( α − 2 ) ≤ 4 k w α φ + O ( α − 2 ) . By a similar argument, k ∆ 22 k ≤ k φ/ (w α ) + O ( α − 2 ). F or ∆ 21 , w e observe that P k i =1 B i = 0. 14 Therefore, k ∆ 21 k ≤      k X i =1 ( ρ i − ¯ ρ ) B i      +      k X i =1 ρ i w i α  B i (w i ˜ µ i ˜ µ T i ) + B i A i + D i B i       + O ( α − 2 ) ≤ k X i =1 | ρ i − ¯ ρ |k B i k + k X i =1 ρ i w i α  k B i (w i ˜ µ i ˜ µ T i ) k + k B i A i k + k D i B i k  + O ( α − 2 ) ≤ k k ρ − 1 ¯ ρ k ∞ p φ + k X i =1 ρ i w i α ( p φ + φ p φ + p φ ) + O ( α − 2 ) ≤ k k ρ − 1 ¯ ρ k ∞ p φ + 3 k ¯ ρ w α p φ ≤ 7 k 2w α p φ + O ( α − 2 ) . Th us, we ha ve max {k ∆ 11 k , k ∆ 22 k , k ∆ 21 k} ≤ 4 k √ φ/ (w α ) + O ( α − 2 ), so that k M − Γ k ≤ k ∆ k + O ( α − 2 ) ≤ 2 max {k ∆ 11 k , k ∆ 22 k , k ∆ 21 k} ≤ 8 k w α p φ + O ( α − 2 ) ≤ 16 k w α p φ. for sufficient ly large n , as α ≥ n/ w . 3.4 Sample Con v ergence W e no w giv e some b ounds on the con verge nce of the transformation to isotrop y ( ˆ µ → 0 and ˆ Σ → I ) and on the co n v ergence of the rew eigh ted sample mean ˆ u and sample matrix of second moments ˆ M to their exp ectations u and M . F or the conv ergence of seco nd moment matrices, we use the follo wing lemma due to Ru delson [12], whic h w as presented in this form in [13]. Lemma 12. L et y b e a r andom ve ctor fr om a distribution D in R n , with su p D k y k = M and k E ( y y T ) k ≤ 1 . L et y 1 , . . . , y m b e i ndep endent samples fr om D . L et η = C M r log m m wher e C is an absolute c onstant. Then, (i) If η < 1 , then E k 1 m m X i =1 y i y T i − E ( y y T ) k ! ≤ η . (ii) F or every t ∈ (0 , 1) , P k 1 m m X i =1 y i y T i − E ( y y T ) k > t ! ≤ 2 e − ct 2 /η 2 . This lemma is used to show that a distribu tion can b e made n early isotropic using only O ∗ ( k n ) samples [12, 10]. T he isotropic transformation is computed simply by estimating the mean and co v ariance matrix of a sample, and computing the affine transformation that puts the sample in isotropic p osition. 15 Theorem 4. Ther e is an absolute c onstant C such tha t for an isotr opic mixtur e of k lo gc onc ave distributions, with pr ob ability at le ast 1 − δ , a sample of size m > C k n log 2 ( n/δ ) ǫ 2 gives a sample me an ˆ µ and sample c ovarianc e ˆ Σ so that k ˆ µ k ≤ ǫ and k ˆ Σ − I k ≤ ǫ. W e n o w consid er the reweig h ted moments. Lemma 13. L et ǫ, δ > 0 and let ˆ µ b e the r eweighte d sample me an of a set of m p oints dr awn fr om an isotr o pic mixtur e of k Gaussians in n dimensions, wher e m ≥ 2 nα ǫ 2 log 2 n δ . Then P [ k ˆ u − u k > ǫ ] ≤ δ Pr o of. W e first consider only a single co ordin ate of the vec tor ˆ u . Let y = x 1 exp  −k x k 2 / (2 α )  − u 1 . W e obs erv e that     x 1 exp  − k x k 2 2 α      ≤ | x 1 | exp  − x 2 1 2 α  ≤ r α e < √ α. Th us, eac h term in the sum m ˆ u 1 = P m j =1 y j falls the range [ − √ α − u 1 , √ α − u 1 ]. W e ma y therefore apply Ho effding’s in equalit y to sho w that P  | ˆ u 1 − u 1 | ≥ ǫ/ √ n  ≤ 2 exp  − 2 m 2 ( ǫ/ √ n ) 2 m · (2 √ α ) 2  ≤ 2 exp  − mǫ 2 2 αn  ≤ δ n . T aking the union b ound o ver the n coord inates, w e ha v e that with probabilit y 1 − δ the error in eac h co ordinate is at most ǫ/ √ n , whic h implies that k ˆ u − u k ≤ ǫ . Lemma 14. L et ǫ, δ > 0 and let ˆ M b e the r eweighte d sample matrix of se c ond moments for a set of m p oints dr awn fr om an isotr opic mixtur e of k Gaussians in n dimensions, wher e m ≥ C 1 nα ǫ 2 log nα δ . and C 1 is an absolute c onstant. Then P h    ˆ M − M    > ǫ i < δ . Pr o of. W e will apply Lemm a 12. Define y = x exp  −k x k 2 / (2 α )  . Then, y 2 i ≤ x 2 i exp  − k x k 2 α  ≤ x 2 i exp  − x 2 i α  ≤ α e < α. Therefore k y k ≤ √ αn . Next, sin ce M is in isotropic p osition (we can assum e this w .l.o.g.), we hav e for an y u nit v ector v , E (( v T y ) 2 )) ≤ E (( v T x ) 2 ) ≤ 1 16 and so k E ( y y T ) k ≤ 1. No w w e apply the second part of Lemm a 12 w ith η = ǫ p c/ ln(2 /δ ) and t = η p ln(2 /δ ) /c . Th is requires that η = cǫ ln(2 /δ ) ≤ C √ αn r log m m whic h is satisfied for our choice of m . Lemma 15. L et X b e a c ol le ction of m p oints dr awn fr om a Gaussian with me an µ and varianc e σ 2 . With pr ob ability 1 − δ , | x − µ | ≤ σ p 2 log m/δ . for every x ∈ X . 3.5 P er turbation Lemma W e w ill use the follo wing key lemma due to Stew art [16] to sho w that w hen we app ly the sp ectral step, the top k − 1 dimens ional inv aria n t su bspace will b e close to the Fisher sub space. Lemma 16 (Stew art’s Theorem) . Supp ose A and A + E ar e n-by-n symmetric matric es and that A =  D 1 0 0 D 2  r n − r r n − r E =  E 11 E T 21 E 21 E 22  r n − r r n − r . L et the c ol umns of V b e the top r eigenve ctors of the matrix A + E and let P 2 b e the matrix with c olumns e r +1 , . . . , e n . If d = λ r ( D 1 ) − λ 1 ( D 2 ) > 0 and k E k ≤ d 5 , then k V T P 2 k ≤ 4 d k E 21 k 2 . 4 Finding a V ector near the F isher Su bspace In this section, we com bine the appro x im ations of Section 3.3 and the p erturbation lemma of Section 3.5 to sh o w that th e direction h c hosen by step 3 of the algorithm is close to th e in termean subspace. Section 5 argues that this direction can b e used to partition the components. Finding the s eparating direction is the most chal lenging part of the classifica tion task and r epresent s the main con trib u tion of th is w ork. W e first assume zero ov erlap and th at th e sample reweig h ted momen ts b ehav e exactly according to exp ectation. In this case, the mean sh ift ˆ u b ecomes v ≡ k X i =1 w i ρ i µ i . W e can in tuitiv ely think of the comp onen ts that hav e greater ρ i as gaining mixing w eigh t and those with smaller ρ i as losing mixing w eight . As long as the ρ i are not all equal, w e w ill observe some shift of the mean in the in termean s ubspace, i.e. Fisher s u bspace. Therefore, w e m a y u se 17 this direction to p artition the comp onen ts. On the other hand, if all of the ρ i are equal, then ˆ M b ecomes Γ ≡ " P k i =1 ρ i (w i ˜ µ i ˜ µ i T + A i ) 0 0 P k i =1 ρ i D i − ρ i w i α D 2 i # = ¯ ρ " I 0 0 I − 1 α P k i =1 1 w i D 2 i # . Notice that the second momen ts in the subsp ace sp an { e 1 , . . . , e k − 1 } are mainta ined wh ile those in the complemen tary s ubspace are red uced by p oly(1 /α ). Th erefore, th e top eigenv ector will b e in the inte rmean su b space, whic h is the Fisher sub space. W e now argue that this same strategy can b e adapted to wo rk in general, i.e., with nonzero o ve rlap and sampling errors, with high p robabilit y . A critical aspect of th is argumen t is that the norm of the error term ˆ M − Γ dep ends only on φ an d k and n ot the dimension of th e data. See Lemma 10 and the su pp orting Lemma 4 and F act 3. Since we cannot kno w directly ho w im b alanced the ρ i are, we c h o ose the method of finding a separating direction according the norm of th e ve ctor k ˆ u k . Recall that when k ˆ u k > √ w / (32 α ) the algorithm us es ˆ u to determine the separating directio n h . Lemma 17 guaran tees that this v ector is close to the Fisher subspace. When k ˆ u k ≤ √ w / (32 α ), the algorithm u s es the top eigen v ector of the co v ariance matrix ˆ M . Lemma 18 guarante es that this v ector is close to th e Fisher subspace. Lemma 17 (Mean Shift Method) . L et ǫ > 0 . Ther e exists a c onstant C such that if m 1 ≥ C n 4 p oly( k , w − 1 , log n/δ ) , then the fol lowing holds with pr ob ability 1 − δ . If k ˆ u k > √ w / (32 α ) and φ ≤ w 2 ǫ 2 14 k 2 , then k ˆ u T v k k ˆ u kk v k ≥ 1 − ǫ. Lemma 18 (Sp ectral Metho d) . L et ǫ > 0 . Ther e exists a c onsta nt C such that if m 1 ≥ C n 4 p oly( k , w − 1 , log n/δ ) , then the fol lowing holds with pr ob ability 1 − δ . L et v 1 , . . . , v k − 1 b e the top k − 1 eigenve ctors of ˆ M . If k ˆ u k ≤ √ w / (32 α ) and φ ≤ w 2 ǫ 640 2 k 2 then min v ∈ span { v 1 ,...,v k − 1 } , k v k =1 k pro j F ( v ) k ≥ 1 − ǫ. 4.1 Mean Shift Pr o of of L emma 17. W e will mak e use of the follo w ing claim. Claim 19. F or any ve ctors a, b 6 = 0 , | a T b | k a kk b k ≥  1 − k a − b k 2 max {k a k 2 , k b k 2 }  1 / 2 . By the triangle in equalit y , k ˆ u − v k ≤ k ˆ u − u k + k u − v k . By Lemma 9 , k u − v k ≤ r 4 k 2 α 2 w φ = r 4 k 2 α 2 w · w 2 ǫ 2 1 4 k 2 ≤ r w ǫ 2 12 α 2 . 18 By Lemma 13, for large m 1 w e obtain th e same b ound on k ˆ u − u k with probabilit y 1 − δ . Thus, k ˆ u − v k ≤ r w ǫ 2 10 α 2 . Applying the claim gives k ˆ u T v k k ˆ u kk v k ≥ 1 − k ˆ u − v k 2 k ˆ u k 2 ≥ 1 − w ǫ 2 10 α 2 · 32 2 α 2 w = 1 − ǫ. Pr o of of Claim 19. Without loss of generalit y , assume k u k ≥ k v k and fix the distance k u − v k . In order to maximize the angle b etw een u and v , the vec tor v should b e c h osen s o that it is tangent to the s p here cen tered at u with radius k u − v k . Hence, the v ectors u , v ,( u − v ) form a righ t triangle where k u k 2 = k v k 2 + k u − v k 2 . F or this c hoice of v , let θ b e the angle b et ween u and v so th at u T v k u kk v k = cos θ = (1 − sin 2 θ ) 1 / 2 =  1 − k u − v k 2 k u k 2  1 / 2 . 4.2 Sp ectral Method W e firs t sho w that the smallness of the mean sh if t ˆ u implies that th e coefficients ρ i are suffi cien tly uniform to allo w us to app ly the sp ectral metho d. Claim 20 (Small Mean S hift Implies Balanced Second Moments) . If k ˆ u | ≤ √ w / (32 α ) and p φ ≤ w 64 k , then k ρ − 1 ¯ ρ k 2 ≤ 1 8 α . Pr o of. Let q 1 , . . . , q k b e the right singular v ectors of the matrix U = [w 1 µ 1 , . . . , w k µ k ] and let σ i ( U ) b e the i th largest sin gular v alue. Beca use P k i =1 w i µ i = 0, w e hav e that σ k ( U ) = 0 and q k = 1 / √ k . Recall that ρ is the k v ector of scalars ρ 1 , . . . , ρ k and that v = U ρ . Then k v k 2 = k U ρ k 2 = k − 1 X i =1 σ i ( U ) 2 ( q T i ρ ) 2 ≥ σ k − 1 ( U ) 2 k ρ − q k ( q T k ρ ) k 2 2 = σ k − 1 ( U ) 2 k ρ − 1 ¯ ρ k 2 2 . 19 Because q k − 1 ∈ span { µ 1 , . . . , µ k } , w e ha v e that P k i =1 w i q T k − 1 µ i µ T i q k − 1 ≥ 1 − φ . Therefore, σ k − 1 ( U ) 2 = k U q k − 1 k 2 = q T k − 1 k X i =1 w 2 i µ i µ T i ! q k − 1 ≥ w q T k − 1 k X i =1 w i µ i µ T i ! q k − 1 ≥ w(1 − φ ) . Th us, we ha ve the b ound k ρ − 1 ¯ ρ k ∞ ≤ 1 p (1 − φ )w k v k ≤ 2 √ w k v k . By the triangle in equalit y k v k ≤ k ˆ u k + k ˆ u − v k . As argued in Lemma 9 , k ˆ u − v k ≤ r 4 k 2 α 2 w φ = r 4 k 2 α 2 w · w 2 64 2 k 2 = ≤ √ w 32 α . Th us, k ρ − 1 ¯ ρ k ∞ ≤ 2 ¯ ρ √ w k v k ≤ 2 ¯ ρ √ w  √ w 32 α + √ w 32 α  ≤ 1 8 α . W e next sho w that the top k − 1 principal comp onen ts of Γ span th e intermean s ubspace and put a lo wer b ound on the sp ectral gap b et ween the in termean and non-int ermean comp on ents. Lemma 21 (Ideal Case) . If k ρ − 1 ¯ ρ k ∞ ≤ 1 / (8 α ) , then λ k − 1 (Γ) − λ k (Γ) ≥ 1 4 α , and the top k − 1 eigenve ctors of Γ sp an the me ans of the c omp onents. Pr o of of L emma 21. W e first b oun d λ k − 1 (Γ 11 ). Recall that Γ 11 = k X i =1 ρ i (w i ˜ µ i ˜ µ i T + A i ) . Th us, λ k − 1 (Γ 11 ) = min k y k =1 k X i =1 ρ i y T (w i ˜ µ i ˜ µ i T + A i ) y ≥ ¯ ρ − max k y k =1 k X i =1 ( ¯ ρ − ρ i ) y T (w i ˜ µ i ˜ µ i T + A i ) y . 20 W e observe that P k i =1 y T (w i ˜ µ i ˜ µ i T + A i ) y = 1 and eac h term is non-negati v e. Hence the sum is b oun d ed by k X i =1 ( ¯ ρ − ρ i ) y T (w i ˜ µ i ˜ µ i T + A i ) y ≤ k ρ − 1 ¯ ρ k ∞ , so, λ k − 1 (Γ 11 ) ≥ ¯ ρ − k ρ − 1 ¯ ρ k ∞ . Next, we b ound λ 1 (Γ 22 ). Recall that Γ 22 = k X i =1 ρ i D i − ρ i w i α D 2 i and that f or any n − k v ector y suc h that k y k = 1, w e hav e P k i =1 y T D i y = 1. Using the same argumen ts as ab ov e, λ 1 (Γ 22 ) = max k y k =1 ¯ ρ + k X i =1 ( ρ i − ¯ ρ ) y T D i y − ρ i w i α y T D 2 i y ≤ ¯ ρ + k ρ − 1 ¯ ρ k ∞ − min k y k =1 k X i =1 ρ i w i α y T D 2 i y . T o b ound the last sum, we observe that ρ i − ¯ ρ = O ( α − 1 ). Therefore k X i =1 ρ i w i α y T D 2 i y ≥ ¯ ρ α k X i =1 1 w i y T D 2 i y + O ( α − 2 ) . Without loss of generalit y , w e ma y assu me that y = e 1 b y an appropriate rotation of the D i . Let D i ( ℓ, j ) b e elemen t in the ℓ th ro w and j th column of the matrix D i . Then the s u m b ecomes k X i =1 1 w i y T D 2 i y = k X i =1 1 w i n X j =1 D j (1 , j ) 2 ≥ k X i =1 1 w i D j (1 , 1) 2 . Because P k i =1 D i = I , we h a ve P k i =1 D i (1 , 1) = 1. F rom the Cauc h y-Sc hw artz inequalit y , it follo ws k X i =1 w i ! 1 / 2 k X i =1 1 w i D i (1 , 1) 2 ! 1 / 2 ≥ k X i =1 √ w i D i (1 , 1) √ w i = 1 . Since P k i =1 w i = 1, w e conclude that P k i =1 1 w i D i (1 , 1) 2 ≥ 1. Thus, using the fact that ¯ ρ ≥ 1 / 2, we ha ve k X i =1 ρ i w i α y T D 2 i y ≥ 1 2 α Putting the b ounds toget her λ k − 1 (Γ 11 ) − λ 1 (Γ 22 ) ≥ 1 2 α − 2 k ρ − 1 ¯ ρ k ∞ ≥ 1 4 α . 21 Pr o of of L emma 18. T o b ound the effect of ov erlap and sample errors on the eige n v ectors, w e apply Stew art’s Lemma (Lemma 16). Define d = λ k − 1 (Γ) − λ k (Γ) and E = ˆ M − Γ. W e assume that the mean shift satisfies k ˆ u k ≤ √ w / (32 α ) and that φ is small. By Lemma 21, this implies that d = λ k − 1 (Γ) − λ k (Γ) ≥ 1 4 α . (5) T o b ou n d k E k , we use the triangle inequ alit y k E k ≤ k Γ − M k + k M − ˆ M k . Lemma 10 b ounds the firs t term by k M − Γ k ≤ r 16 2 k 2 w 2 α 2 φ = r 16 2 k 2 w 2 α 2 · w 2 ǫ 640 2 k 2 ≤ 1 40 α √ ǫ. By Lemma 14 , w e obtain the same b ound on k M − ˆ M k with probabilit y 1 − δ for large enough m 1 . Th us, k E k ≤ 1 20 α √ ǫ. Com b ining the b ounds of Eqn. 5 and 4.2, w e ha ve p 1 − (1 − ǫ ) 2 d − 5 k E k ≥ p 1 − (1 − ǫ ) 2 1 4 α − 5 1 20 α √ ǫ ≥ 0 , as p 1 − (1 − ǫ ) 2 ≥ √ ǫ . This implies b oth that k E k ≤ d/ 5 and that 4 k E 21 | /d < p 1 − (1 − ǫ ) 2 , enabling us to apply S tew art’s Lemma to the matrix pair Γ and ˆ M . By Lemma 21, the top k − 1 eigen ve ctors of Γ, i.e. e 1 , . . . , e k − 1 , span the means of th e com- p onents. Let the columns of P 1 b e these eigenv ectors. Let the column s of P 2 b e defined suc h that [ P 1 , P 2 ] is an orthonormal matrix and let v 1 , . . . , v k b e the top k − 1 eigen vect ors of ˆ M . By Stew art’s Lemma, letting the columns of V b e v 1 , . . . , v k − 1 , we hav e k V T P 2 k 2 ≤ p 1 − (1 − ǫ ) 2 , or equiv alen tly , min v ∈ span { v 1 ,...,v k − 1 } , k v k =1 k pro j F v k = σ k − 1 ( V T P 1 ) ≥ 1 − ǫ. 5 Recursion In this section, w e sho w that for every direction h that is close to the intermean subspace, the “largest gap clustering” step pr o duces a p air of complemen tary halfsp aces th at partitions R n while lea ving only a small part of the pr obabilit y mass on the wrong side of the partition, small enough that with high p robabilit y , it do es not affect th e samples used b y the algo rithm. Lemma 22. L et δ, δ ′ > 0 , wher e δ ′ ≤ δ / (2 m 2 ) , and let m 2 satisfy m 2 ≥ n/k log(2 k /δ ) . Supp ose that h is a unit ve ctor such that k pro j F ( h ) k ≥ 1 − w 2 10 ( k − 1) 2 log 1 δ ′ . L et F b e a mixtur e of k > 1 Gaussians with overlap φ ≤ w 2 9 ( k − 1) 2 log − 1 1 δ ′ . 22 L et X b e a c ol le ction of m 2 p oints fr om F and let t b e the midp o int of the lar gest gap in set { h T x : x ∈ X } . With pr ob ability 1 − δ , the halfsp ac e H h,t has the fol lowing pr op erty. F or a r andom sample y fr o m F either y , µ ℓ ( y ) ∈ H h,t or y , µ ℓ ( y ) / ∈ H h,t with pr o b ability 1 − δ ′ . Pr o of of L emma 22. The idea b ehind the pro of is simp le. W e first s h o w that t wo of the m eans are at least a constan t d istance apart. W e then b ound the width of a comp onen t along the direction h , i.e. the maximum distance b et w een t wo p oin ts b elonging to the same comp onent. If the width of eac h comp onen t is small, then clearly the largest gap m u s t fall b et ween comp onents. Setting t to b e the mid p oin t of the gap, we av oid cutting an y comp onen ts. W e first show that at least one mean m u st b e far from the origin in the directio n h . Let the columns of P 1 b e the vect ors e 1 , . . . , e k − 1 . The s pan of these v ectors is also the span of the means, so we ha v e max i ( h T µ i ) 2 = max i ( h T P 1 P T 1 µ i ) 2 = k P T 1 h k 2 max i  ( P T 1 h ) T k P 1 h k ˜ µ i  2 ≥ k P T 1 h k 2 k X i =1 w i  ( P T 1 h ) T k P 1 h k ˜ µ i  2 ≥ k P T 1 h k 2 (1 − φ ) > 1 2 . Since the origin is the mean of the means, w e conclude that the maxim um distance b et wee n t w o means in the dir ection h is at least 1 / 2. Without loss of generalit y , w e assume that the interv al [0 , 1 / 2] is conta in ed b et ween t wo means pro jected to h . W e no w s h o w that every p oin t x drawn from comp onent i falls in a narro w interv al w hen pro jected to h . T hat is, x satisfies h T x ∈ b i , wh ere b i = [ h T µ i − (8( k − 1)) − 1 , h T µ i + (8( k − 1)) − 1 ]. W e b egin b y examining the v ariance along h . Let e k , . . . , e n b e the col umns of th e matrix n -b y- ( n − k + 1) m atrix P 2 . Recall from Eqn. 2 that P T 1 w i Σ i P 1 = A i , that P T 2 w i Σ i P 1 = B i , and that P T 2 w i Σ i P 2 = D i . The norms of these matrices are b ounded according to Lemma 4. Also, the v ector h = P 1 P T 1 h + P 2 P T 2 h . F or con venience of notation w e define ǫ suc h th at k P T 1 h k = 1 − ǫ . T hen k P T 2 h k 2 = 1 − (1 − ǫ ) 2 ≤ 2 ǫ . W e no w argue h T w i Σ i h ≤  h T P 1 A i P T 1 h + 2 h T P 2 B i P 1 h + h T P T 2 D i P 2 h  ≤ 2  h T P 1 A i P T 1 h + h T P 2 D i P T 2 h  ≤ 2( k P T 1 h k 2 k A i k + k P T 2 h k 2 kk D i k ) ≤ 2( φ + 2 ǫ ) . Using the assump tions ab out φ and ǫ , we conclude that th e maxim um v ariance along h is at most max i h T Σ i h ≤ 2 w  w 2 9 ( k − 1) 2 log 1 δ ′ + 2 w 2 10 ( k − 1) 2 log 1 δ ′  ≤  2 7 ( k − 1) 2 log 1 /δ ′  − 1 . W e no w translate these b ounds on the v ariance to a b ound on th e difference b et w een the minim um and maxim um p oin ts along the direction h . By Lemma 15, with pr obabilit y 1 − δ / 2 | h T ( x − µ ℓ ( x ) ) | ≤ q 2 h T Σ i h log (2 m 2 /δ ) ≤ 1 8( k − 1) · log(2 m 2 /δ ) log(1 /δ ′ ) ≤ 1 8( k − 1) . 23 Th us, with probabilit y 1 − δ / 2, eve ry p oin t from X falls in to the union of interv a ls b 1 ∪ . . . ∪ b k where b i = [ h T µ i − (8( k − 1)) − 1 , h T µ i + (8( k − 1)) − 1 ]. Because these interv als are cen tered ab out the means, at least the equiv alen t of one in terv al m us t fall outside the ran ge [0 , 1 / 2], whic h w e assumed w as conta ined b etw een t w o pro jected means. Th u s, th e measure of sub set of [0 , 1 / 2] that do es not fall in to one of th e in terv als is 1 2 − ( k − 1) 1 4( k − 1) = 1 4 . This set can b e cut into at most k − 1 inte rv als, so the smallest p ossible gap b et wee n these int erv als is (4( k − 1)) − 1 , which is exactly the width of an in terv al. Because m 2 = k / w log(2 k /δ ) th e set X conta ins at least one sample from ev ery comp onen t with probabilit y 1 − δ / 2. Ove rall, with p robabilit y 1 − δ ev ery comp onent has at least one samp le and all samp les from comp onen t i fall in b i . Th us, the largest gap b et we en the sampled p oint s w ill not con tain one of the interv als b 1 , . . . , b k . Moreo v er, the m idp oint t of this gap must also fall outside of b 1 ∪ . . . ∪ b k , ensuring that n o b i is cut by t . By the same argument giv en ab o ve, any single p oin t y from F is con tained in b 1 ∪ . . . ∪ b k with probabilit y 1 − δ ′ pro ving the Lemma. In the pr o of of the main theorem for large k , w e will n eed to hav e ev ery p oint sampled from F in the recursion subtree classified co rrectly by the halfspace, so w e will assume δ ′ considerably smaller than m 2 /δ . The second lemma sho ws that all su bmixtures hav e smaller ov erlap to ensure that all th e relev an t lemmas apply in th e r ecur siv e steps. Lemma 23. The r emoval of any su b set of c omp onents c annot induc e a mixtur e with gr e ater overlap than the original. Pr o of of L emma 23. Su pp ose that the comp onen ts j + 1 , . . . k are remo ved from the m ixtu re. Let ω = P j i =1 w i b e a norm alizing factor for the w eights. Then if c = P j i =1 w i µ i = − P k i = j +1 w i µ i , the induced mean is ω − 1 c . Let T b e the sub space that minimizes the m aximum ov erlap for the full k comp onent mixtur e. W e then argue that the o verlap ˜ φ 2 of the indu ced mixtu r e is b ounded b y ˜ φ = min dim( S )= j − 1 max v ∈ S ω − 1 v T Σ v ω − 1 P j i =1 w i v T ( µ i µ T i − cc T + Σ i ) v ≤ max v ∈ span { e 1 ,...,e k − 1 }\ span { µ j +1 ,...,µ k } P j i =1 w i v T Σ i v P j i =1 w i v T ( µ i µ T i − cc T + Σ i ) v . Ev ery v ∈ span { e 1 , . . . , e k − 1 } \ s p an { µ j +1 , . . . , µ k } m ust b e orthogonal to ev er y µ ℓ for j + 1 ≤ ℓ ≤ k . Therefore, v m u st b e orthogonal to c as well. This also enables us to add the terms for j + 1 , . . . , k in b oth the numerator and denominator, b ecause they are all zero. ˜ φ ≤ max v ∈ span { e 1 ,...,e k − 1 }\ span { µ j +1 ,...,µ k } v T Σ v P k i =1 w i v T ( µ i µ T i + Σ i ) v ≤ max v ∈ span { e 1 ,...,e k − 1 } v T Σ v P k i =1 w i v T ( µ i µ T i + Σ i ) v = φ. 24 The pro ofs of the main th eorems are no w apparen t. Consid er the case of k = 2 Gaussians first. As argued in Section 3.4, using m 1 = ω ( k n 4 w − 3 log( n /δ w)) samples to estimate ˆ u and ˆ M is sufficien t to guarante e that the estimates are accurate. F or a w ell-c hosen constant C , th e condition φ ≤ J ( p ) ≤ C w 3 log − 1  1 δ w + 1 η  of Theorem 2 implies that p φ ≤ w √ ǫ 640 · 2 , where ǫ = w 2 9 log − 1  2 m 2 δ + 1 η  . The argument s of Section 4 then sho w that the direction h selecte d in step 3 satisfies k P T 1 h k ≥ 1 − ǫ = 1 − w 2 9 log − 1  m 2 δ + 1 η  . Already , for the o ve rlap w e h a ve p φ ≤ w √ ǫ 640 · 2 ≤ r w 2 9 ( k − 1) 2 log − 1 / 2 1 δ ′ . so we ma y ap p ly Lemma 22 with δ ′ = ( m 2 /δ + 1 /η ) − 1 . Thus, with probabilit y 1 − δ the classifier H h,t is correct with p robabilit y 1 − δ ′ ≥ 1 − η . W e follo w the same outline for k > 2, with the quant it y 1 /δ ′ = m 2 /δ + 1 /η b eing replaced with 1 /δ ′ = m/δ + 1 /η , where m is the total n umb er of samples used. This is necessary b ecause the half-space H h,t m u st classify ev ery sample p oin t tak en b elo w it in the recursion subtree correctly . This adds the n and k facto rs so that the required o v erlap b ecomes φ ≤ C w 3 k − 3 log − 1  nk δ w + 1 η  for an ap p ropriate constan t C . T h e correctness in the recur siv e steps is guaranteed by Lemma 23. Assuming that all previous steps are correct, the termin ation condition of step 4 is clearly correct when a single comp onen t is isolate d. 6 Conclusion W e ha ve p resen ted an affine-inv a rian t extension of principal comp onen ts. W e exp ect that this tec hniqu e should b e applicable to a b r oader class of pr oblems. F or example, mixtures of distri- butions w ith some mild p rop erties suc h as cen ter symm etry and some b ounds on the fi rst few momen ts migh t b e solv able using isotropic PCA. It w ould b e nice to c h aracterize the full scop e of the tec hnique for clustering and also to fin d other app lications, giv en that standard PCA is widely used. References [1] D. Achlioptas and F. McSherry . On sp ectral learning of mixtur es of d istr ibutions. In Pr o c. of COL T , 2005. 25 [2] K. Chaudhuri and S. Rao. Learning m ixtu res of pro d uct distrib utions u sing correlations and indep en dence. In Pr o c. of CO L T , 2008. [3] Anirban Dasgupta, John Hop croft, Jon Kleinberg, and Mark Sandler. O n learning m ixtures of hea vy-tailed distributions. In Pr o c. of FOCS , 2005. [4] S. DasGupta. Learning m ixtu res of gaussians. In Pr o c . of FOCS , 1999. [5] A.P . Dempster, N.M. Laird, and D.B. Rubin. Maximum lik eliho o d from incomplete data via the em algorithm. J ournal of the R oyal Statistic al So ciety B , 39:1–38, 1977. [6] Jon F eldman, Ro cco A. S erv edio, and Rya n O’Donnell. P ac learning axis-aligned mixtur es of gaussians with no sep aration assumption. In CO L T , p ages 20–34, 2006 . [7] K. F ukun aga. Intr o duction to Statistic al Pattern R e c o gnition . Academic P ress, 1990. [8] R. O. Duda P .E. Hart and D.G. Stork. Pattern Classific ation . John Wiley & Sons, 2001. [9] R. Kannan, H. Salmasian, and S . V empala. The sp ectral m etho d for general mixtu re mo dels. In Pr o c e e dings of the 18th Confer enc e on L e arning The ory . Unive rsit y of California Pr ess, 2005. [10] L. Lo v´ asz and S. V emp ala. The geometry of logc onca v e fu nctions and and sampling algorithms. R ando m Strucur es and Algorithms , 30(3):307–3 58, 2007. [11] J. B. MacQueen. Some metho d s for classification and analysis of multiv aria te observ ations. In Pr o c e e dings of 5-th Berkeley Symp osium on Mathematic al Statistics and Pr ob ability , v olume 1, pages 281–29 7. Univ ersit y of California Press, 1967. [12] M. Rudelson. R an d om ve ctors in the isotropic p osition. Journal of F unctional Analys is , 164:60 –72, 1999 . [13] M. Rudelson and R. V ershynin. Samplin g from large matrices: An approac h through geometric functional analysis. J. ACM , 54(4), 2007 . [14] R. Kannan S. Arora. Learning mixtures of arb itrary gaussians. Ann. Appl. Pr ob ab. , 15(1A):69– 92, 2005. [15] L. Sch ulman S. DasGupta. A t wo -r ound v arian t of em for gaussian mixtures. In Sixte enth Confer enc e on U nc ertainty in A rtificial Intel lig e nc e , 2000. [16] G.W. S tew art and Ji guang Su n. Matrix Perturb ation The ory . Academic Press, Inc., 1990 . [17] S. V e mpala and G. W ang. A sp ectral algo rithm for learning mixtur es of d istributions. P r o c. of F OCS 200 2; JCCS , 68(4):841–8 60, 2004. 26

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment