Learning Mixtures of Gaussians using the k-means Algorithm

Learning Mixtures of Gaussians Using the k -Means Algorithm Kamalik a Chaudh uri CSE Departmen t, UC San D iego kamalika@so e.ucsd.edu Sanjo y Dasgupta CSE Departmen t, UC San D iego dasgupta@cs .ucsd.edu Andrea V attani CSE Departmen t, UC San Diego avattani@cs .ucsd.edu Octob er 23, 2018 Abstract One of the most p opular a lgorithms for clustering in Euclidean spa ce is the k -means algo- rithm; k -means is diﬃcult to analy z e mathematically , a nd few theoretical guarantees are known ab out it, particularly when the data is wel l-clustere d . In this pap er, we attempt to ﬁll this g a p in the literature b y analyzing the b ehavior of k -mea ns on w ell-c lustered data. In par ticular, we study the case when each cluster is distributed a s a diﬀerent Gaus sian – or, in other words, when the input co mes from a mixture of Gaussia ns. W e a nalyze three a spects of the k -means alg orithm under this assumption. F irst, w e s how that when the input comes from a mixture o f tw o spheric al Gaussia ns, a v ariant of the 2-means algorithm successfully isolates the subspace con taining the means of the mixture comp onents. Second, w e sho w an exact expression for the con vergence of our v ariant of the 2-means a lgorithm, when the input is a v ery large num b er of sa mples from a mixture of s pher ical Gaussians. Our analysis do es not requir e any lower b ound on the separation betw een the mixture comp onents. Finally , we study the sample requirement of k -means; fo r a mix tur e of 2 spher ical Gaussians , we sho w an upper b ound on the num b er of samples required b y a v aria n t of 2-means to get close to the true solution. The sample r equirement gr o ws with increas ing dimens ionalit y of the data, a nd dec r easing separation betw een the means of the Gaussians. T o match our upp e r bo und, we show an information-theo retic low er bo und on any algo rithm tha t lear ns mixtures of t wo spherical Gauss ians; our low er b ound indicates that in the cas e w he n the overlap b etw een the pro babilit y ma s ses of the t w o distributions is small, the sample requir emen t of k -means is ne ar-optimal . 1 1 In tro du ction One of the most p opu lar algorithms for clustering in E uclidean sp ace is the k -means algorithm [Llo82, F or65, Mac67]; th is is a simple, lo cal-searc h algo rithm that iterativ ely reﬁnes a partition of the in- put p oin ts unt il co n v ergence. Lik e man y local-searc h algorithms, k -means is notoriously diﬃcult to analyze, and few theoretical guaran tees are kno wn ab out it. There has b een three lines of work on the k -means algorithm. A ﬁrst line of questioning addresses the qualit y of the solution pro duced by k -means, in comparison to the glo bally optimal solution. While it has b een well- kno w n that for general in p uts the qualit y of this solution can b e arbitrarily bad, the conditions un der whic h k -means yields a globally optimal solution on wel l- cluster e d data are n ot w ell-unders too d. A second line o f wo rk [A V0 6, V at09] examines the num b er of iterat ions requir ed b y k -means to con verge. [V at09] sho ws that there exists a set of n p oints on the plane, such that k -means tak es as man y as Ω(2 n ) iterations to con v erge on these p oin ts. A smo othed analysis upp er b oun d of poly ( n ) iterations has b een established b y [AMR09], but th is b ound is still muc h higher than what is observ ed in p r acti ce, where the n u m b er of iterations are frequent ly sublinear in n . Moreo v er, the smo othed analysis b oun d ap p lies to small p erturb atio ns of arbitrary inputs, and the question of whether one ca n ge t faster con v ergence on w ell-cl ustered inputs, is still unresolve d. A thir d question, considered in the statistics lit erature, is the statistical eﬃciency of k -means. Supp ose the input is dr a wn from some s im p le d istr ibution, for wh ic h k -means is statistically consisten t; then, ho w man y samples is required for k -means to con verge ? Ar e there other co nsisten t pro cedures with a b etter samp le requirement? In this pap er, w e study all three aspects of k -means, b y studying the b eha vior of k -means on Gaussian clusters. S u c h data is frequently mo delled as a mixtur e of Gaussians; a mixture is a collect ion of Gaussians D = { D 1 , . . . , D k } and weigh ts w 1 , . . . , w k , su c h th at P i w i = 1. T o sample from the mixture, w e ﬁrst pic k i with pr obabilit y w i and then dra w a random sample from D i . Clustering suc h d ata th en r educes to the problem of le arning a mixtur e ; h er e, w e are giv en only the abilit y to sample from a mixtur e, and our goa l is to learn the parameters of eac h Gaussian D i , as w ell as determine whic h Gaussian eac h sample came from. Our results are as follo ws. First, w e sh o w that when the inp ut comes from a mixture of t w o spherical Gauss ians, a v arian t of th e 2-means algorithm su ccessfully isolates the subsp ace con taining the means of the Gaussians. Second, we sho w an exact expression for th e co n vergence of a v ariant of the 2-means algorithm, wh en the input is a large num b er of samples from a mixtu r e of t w o spherical Gaussians. O ur analysis sh o ws that the conv ergence-rate is logarithmic in the dimension, and decreases w ith increasing separation b et ween the mixtur e comp onent s. Finally , w e address the sample requirement of k -means; for a mixture of 2 sp herical Gaussians, we sho w an upp er b oun d on the num b er of samples required by a v ariant of 2-means to get close to the tru e solution. The sample requirement gro ws with increasing dimensionalit y of the data, and decreasing separation b etw een the means of the d istributions. T o matc h our upp er b oun d, we show an information-theoretic lo w er b ound on any algorithm that learns mixtur es of t wo spherical Gaussians; our lo w er b ound ind icat es that in the case when the o verlap b et ween the probabilit y masses of the tw o distribu tions is small, the sample requirement of 2-mea ns is ne ar-optimal . Additionally , we mak e some partial p rogress to w ards analyzing k -means in the more general case – w e sho w th at if our v ariant of 2-means is run on a mixture of k spherical Ga ussians, then, it c on verges to a v ector in the subspace conta ining the means of D i . The k ey ins ight in our analysis is a no vel p oten tial function θ t , w hic h is the minim u m angle b et ween the subspace of the means of D i , and the normal to the hyp erplane separator in 2-means. W e sho w that this angle d ecrease s with iterations of our v arian t of 2-means, and we can charac terize con vergence rates and sample requiremen ts, b y c h aracterizing th e rate of d ecrease of the p ote n tial. 2 Our Results. More sp eciﬁca lly , our results are as follo ws. W e p erform a probabilistic analysis of a v ariant of 2-means; our v ariant is essenti ally a symmetrized ve rsion of 2-means, and it reduces to 2-means when w e ha v e a v ery large n um b er of s amp les from a mixture of tw o iden tical spherical Gaussians with equal w eigh ts. In the 2-means algorithm, the separator b et wee n the t wo clus ters is alw ays a hyp erplane, and w e us e the angle θ t b et ween the normal to this hyperp lane and the mean of a mixture compon ent in rou n d t , a s a measure of the p oten tial in eac h round. Note that when θ t = 0, w e ha ve arriv ed at the correct solution. First, in Section 3, we consider the case when w e ha v e at our disp osal a v ery large n um b er of samples from a mixture of N ( µ 1 , ( σ 1 ) 2 I d ) and N ( µ 2 , ( σ 2 ) 2 I d ) with mixing w eigh ts ρ 1 , ρ 2 resp ec- tiv ely . W e sh o w an exact relationship b et ween θ t and θ t +1 , for an y v alue of µ j , σ j , ρ j and t . Using this relationship, w e can appr o ximate the rate of con v ergence of 2-means, for d iﬀerent v alues of the separation, as w ell as diﬀerent in itialization pr ocedur es. Our guaran tees illustrate that the progress of k -means is very f ast – namely , the square of the cosine of θ t gro ws b y at least a constan t factor (for high separation) ea c h round , when one is far from the actual solution, and slow when the ac tual solution is v ery close. Next, in Section 4 , we c h aracte rize the sample r equiremen t for ou r v arian t of 2-means to succeed, when the input is a mixture of tw o spherical Gaussians. F or the case of t wo identic al spherical Gaussians with equal mixing weigh t, our results imply that w h en the separation µ < 1, and when ˜ Ω( d µ 4 ) samples are used in eac h rou n d, the 2-means algorithm makes progress at roughly th e same rate as in Section 3. This agrees with the Ω( 1 µ 4 ) sample complexit y low er b ound [Lin96] for learning a mixture of Gauss ians on the line, as w ell as with exp erimen tal results of [SSR06]. When µ > 1, our v ariant of 2-means mak es p rogress in eac h roun d , when the num b er of samples is at least ˜ Ω( d µ 2 ). Then, in Section 5, we pro vide an inform atio n-theoretic lo we r b ound on th e sample requirement of an y algorithm for learning a mixture of t wo spherical Gaussians with standard deviation 1 and equal w eight . W e sh ow that w h en the separation µ > 1, any algorithm r equires Ω ( d µ 2 ) samples to con verge to a v ector w ithin angle θ = cos − 1 ( c ) of the true solutio n, where c is a co nstan t. Th is indicates that k -means has near-optimal sample requ iremen t when µ > 1. Finally , in S ectio n 6, we examine the p er f ormance of 2-means when th e input comes fr om a mixture of k sp herical Gaussians. W e show that, in this case, the normal to the hyp erplane separating the t wo clusters conv erges to a vect or in the s u bspace cont aining the means of the mixture comp onents. Again, we c haracterize exactly the rate of con vergence, which looks v ery similar to the b ounds in Sect ion 3. Related W ork. The conv ergence-time of the k -means alg orithm has b een analyzed in the w orst- case [A V06, V at09 ], and the smo othed analysis settings [MR09, AMR09]; [V at09 ] sh o ws that the con v ergence-time of k -means ma y b e Ω(2 n ) ev en in the plane. [AMR0 9] establishes a O ( n 30 ) smo othed complexit y b ound. [ORSS06] analyzes the p erformance of k -means when the data ob eys a clusterabilit y condition; ho w ever, their clusterability condition is v ery diﬀerent, and moreo ve r, they examine co nditions und er which constan t-facto r appro ximations can b e found. In statistics literature, the k -means algorithm has b een sho wn to b e consisten t [Mac67]. [P ol81] s h o ws that minimizing the k -means ob jectiv e function (namely , the sum of the squ ares of the distances b et w een eac h p oin t and the cen ter it is assigned to), is consistent , giv en s u ﬃcien tly many samples. As optimizing the k -means ob jective is NP-Hard, one cannot hop e to alwa ys get an exact solution. None of these tw o works quan tify either the conv ergence r ate or the exact sample requirement of k -means. There has b een tw o lines of p revious wo rk on theoretical analysis of the EM algorithm [DLR77], whic h is clo sely related to k -means. Essen tially , for learning mixtures of iden tical Gaussians, the only d iﬀerence b et w een EM and k -means is that EM uses p artial assignments or soft clusterings , 3 whereas k -means do es not. First, [R W84, XJ96] views learning mixtures as an optimization pr ob - lem, and EM as an optimization pro cedure o v er the lik eliho o d surface. Th ey analyze the structure of the like liho od surf ace around the optim um to conclude that EM has ﬁrst-order con v ergence. An optimization pro cedure on a parameter m is said to ha ve ﬁrst-order con ve rgence, if, || m t +1 − m ∗ || ≤ R · || m t − m ∗ || where m t is th e estimate of m at time step t using n samples, m ∗ is th e maxim um lik elihoo d estimator for m using n samp les, a nd R is some ﬁxed constan t b et w een 0 and 1. In con trast, our analysis al so applies when one is far fr om the optimum. The second line of w ork is a prob ab ilistic analysis of EM due to [DS00]; they sh o w a t wo- round v ariant of EM whic h con v erges to the correct partitioning of the samples, when the inp ut is generated by a mixtur e of k well- separated, spherical Ga ussians. F or their analysis to wo rk, they require the mixture comp onen ts to b e separated such that tw o samples from the same Gaussian are a little closer in sp ace than tw o s amples from d iﬀeren t Gaussians. In con trast, our an alysis applies when the separation is m uch smaller. The sample requiremen t of learnin g mixtur es h as b een previously studied in the literature, but not in the con text of k -means. [CHRZ07, Cha07] pro vid es an algorithm that learns a mixture of t wo b inary pr o du ct distrib utions with uniform w eights, when the separation µ b et w een the mixture comp onen ts is at least a constan t, so long as ˜ Ω( d µ 4 ) samples are a v ailable. (Notice that for suc h distributions, the directional standard deviation is at most 1.) Th eir algorithm is similar to k - means in some resp ects, but d iﬀeren t in that they use diﬀeren t sets of coord inates in eac h roun d, and this is very crucial in th eir analysis. Additionally , [BCOFZ07] sho w a s p ectral algorithm whic h learns a mixture of k binary pro duct distributions, when the distribu tions hav e small o v erlap in probabilit y m ass, and the s ample size is at least ˜ Ω( d/µ 2 ). [Lin96] shows th at at least ˜ Ω( 1 µ 4 ) samples are required to learn a mixture of t wo Gaussians in one dimension. W e note that although our lo we r b oun d of Ω( d/µ 2 ) for µ > 1 seems to con tradict the u pp er b ound of [CHRZ07, Cha07], this is not actually the case. O ur lo wer b ound c haracterizes the num b er of samp les required to ﬁnd a v ector at an angle θ = cos − 1 (1 / 10 ) with the v ector joining the means. Ho wev er, in order to classify a constant fraction of the p oin ts correctly , we only need to ﬁnd a v ector at an angle θ ′ = cos − 1 (1 /µ ) with the ve ctor joining the m eans. Since the goal of [CHRZ07] is to simply classify a constan t fraction of the samples, their upp er b ound is less than O ( d/µ 2 ). In add ition to theoretical analysis, th ere has b een ve ry interesti ng exp erimen tal w ork du e to [SSR06], whic h studies the sample requ ir emen t for EM on a mixture of k spherical Gaussians. They conjecture that the p roblem of learnin g mixtures h as three phases, d ep end ing on the n u m b er of samples : with less than about d µ 4 samples, lea rning mixtures is information-theoretica lly hard ; with more than ab out d µ 2 samples, it is compu tati onally easy , and in b et ween, computationally hard, b ut easy in an information-theoretic sen s e. Finally , there has b een a line of work whic h pro vides algorithms (diﬀerent from EM or k -means) th at are guaran teed to learn mixtur es of Gaussians un der certa in separation cond itio ns – see, for example, [Das99, VW02, AK05, AM05, KSV05, CR08, BV08]. F or mixtures of tw o Gaussians, our result is comparable to the b est results for sp herical Gaussians [V W02] in terms of separatio n requiremen t, and we ha ve a smalle r sample requirement . 2 The Setting The k -means algorithm iterativ ely reﬁnes a partitioning of the in p ut data. A t eac h iteration, k p oin ts are main tained as c enters ; eac h input is assigned to its clo sest cen ter. The cen ter of eac h 4 cluster is then recomputed as the empirical mean of the p oin ts assigned to th e cluster. This pro cedure is con tinued u n til con verge nce. Our v ariant of k -means is d escrib ed b elo w . There a re t w o m ain diﬀerences betw een th e actual 2-means algorithm, and our v arian t. First, w e us e a separate set of samples in eac h iteratio n. Secondly , we alw a ys ﬁx the cluster b oundary to b e a hyp erplane through the origin. When th e input is a v er y large num b er of samples fr om a mixture of t w o iden tical Gaussians with equal mixing w eigh ts, and with cen ter of mass at the origin, this is exact ly 2-means initialized with symmetric cen ters (with resp ect to the origin). W e analyze this symm etrized v ersion of 2-means eve n when the mixing w eigh ts and the v ariances of the Gaussians in the m ixture are not equal. The input to our algorithm is a set of samples S , a num b er of iterations N , and a s tarting v ector ˘ u 0 , and the o utput is a v ector u N obtained afte r N iterations of the 2-mea ns algo rithm. 2-means-iterate( S , N , u 0 ) 1. P artition S rand omly in to sets of equal size S 1 , . . . , S N . 2. F or iteration t = 0 , . . . , N − 1, compute: C t +1 = { x ∈ S t +1 |h x, u t i > 0 } ¯ C t +1 = { x ∈ S t +1 |h x, u t i < 0 } Compute: u t +1 as the empirical av erage of C t +1 . Notation. In Sections 3 and 4, w e analyze Algorithm 2-means-iterate , when the in put is generated b y a mixture D = { D 1 , D 2 } of t w o Gaussians. W e let D 1 = N ( µ 1 , ( σ 1 ) 2 I d ), D 2 = N ( µ 2 , ( σ 2 ) 2 I d ), with mixing w eigh ts ρ 1 and ρ 2 . W e also assu me without loss of generalit y that for all j , σ j ≥ 1. As the ce n ter of mass of the mixture lie s at the orig in, ρ 1 µ 1 + ρ 2 µ 2 = 0. In Section 6, w e study a somewhat more general case . W e deﬁne b as the unit v ector along µ 1 , i.e. b = µ 1 || µ 1 || .Henceforth, for any v ector v , w e use the notation ˘ v to d enote the u nit v ector along v , i.e. ˘ v = v || v || . Th erefore, ˘ u t is the unit v ector along u t . W e assume without loss of generalit y that µ 1 lies in the cluster C t +1 . In addition, for eac h t , w e deﬁne θ t as the angle b et we en µ 1 and u t .W e u se th e cosine of θ t as a measure of p rogress of the algorithm at round t , and our goal is to sho w that this quan tity increases as t increases. O b serv e that 0 ≤ cos( θ t ) ≤ 1, and cos( θ t ) = 1 w h en u t and µ 1 are ali gned along th e same direction. F or eac h t , w e deﬁne τ j t = h µ j , ˘ u t i = h µ j , b i cos( θ t ). Moreo v er, f r om our notatio n, cos( θ t ) = τ 1 t || µ 1 || . In addition, we deﬁne ρ min = min j ρ j , µ min = min j || µ j || , and σ max = max j σ j . F or the sp ecial case of t wo id en tical spherical Gaussians w ith equal w eigh ts, w e use µ = || µ 1 || = || µ 2 || . Finally , for a ≤ b , w e use the n otat ion Φ( a, b ) to denote th e probabilit y that a standard n ormal v ariable tak es v alues b et ween a and b . 3 Exact Estimation In this sec tion, we examine the p erformance of Algorithm 2-means-iterate when one can est imate the vect ors u t exactly – that is, when a very large num b er of samples from the mixtur e is a v ailable. Our main result of this sect ion is Lemma 1, whic h exa ctly c haracterizes the b eha vior of 2-means- iterate a t a sp eciﬁc iteration t . F or an y t , we deﬁn e the qu an tities ξ t and m t as follo ws: ξ t = X j ρ j σ j e − ( τ j t ) 2 / 2( σ j ) 2 √ 2 π , m t = P j ρ j h µ j , b i · Φ( − τ j t σ j , ∞ ) 5 O ˘ u t ˘ v t θ t τ 1 t p || µ 1 || 2 − ( τ 1 t ) 2 µ 2 µ 1 Figure 1: Here we are depicting the plane deﬁned b y the vectors µ 1 and ˘ u t . The vector ˘ v t is simply the unit vector alo ng µ 1 − h µ 1 , ˘ u t i ˘ u t . Therefore, we have τ 1 t = || µ 1 || cos( θ t ) and p || µ 1 || 2 − ( τ 1 t ) 2 = || µ 1 || sin( θ t ). No w, our main lemma can b e sta ted as follo ws . Lemma 1. cos 2 ( θ t +1 ) = cos 2 ( θ t )  1 + tan 2 ( θ t ) 2 cos( θ t ) ξ t m t + m 2 t ξ 2 t + 2 cos( θ t ) ξ t m t + m 2 t  The pro of is in the App endix. Using Lemma 1, w e can c haracterize the conv ergence rates and times of 2-means-iterate for diﬀeren t v alues of µ j , ρ j and σ j , as wel l as diﬀerent initiali zations of u 0 . The con v ergence rates can b e c haracterized in terms of t wo n atural parameters of the pr oblem, M = P j ρ j || µ j || 2 σ j , whic h measures how m uc h the distrib u tions are separated, and V = P j ρ j σ j , whic h measures the av erage standard deviations of the distrib u tions. W e observ e that as σ j ≥ 1, for all j , V ≥ 1 alw a ys. T o c haracterize these rate s, it is also con venien t to look at t wo diﬀeren t cases, ac cording to the v alue of µ j , the separation betw een the mixture comp onen ts. Small µ j . First, we consider the case wh en eac h || µ j || /σ j is less th an a ﬁxed constan t q ln 9 2 π , including the case when || µ j || ca n b e m uc h less than 1. In this case, the Gaussians are not ev en separated in terms of p r obabilit y mass; in fact, as || µ j || /σ j decreases, the o v erlap in probabilit y mass b et w een the Gaussians tends to 1. Ho wev er, w e show that 2-means-iterate can still do something interesting, in terms of reco ve ring the subspace conta ining the means of the d istributions. Theorem 2 summarizes the con v ergence rate in this case. Theorem 2 (Small µ j ). L et || µ j || /σ j < q ln 9 2 π , for j = 1 , 2 . Then, ther e exist ﬁxe d c onstants a 1 and a 2 , such that: cos 2 ( θ t )(1 + a 1 ( M /V ) sin 2 ( θ t )) ≤ cos 2 ( θ t +1 ) ≤ cos 2 ( θ t )(1 + a 2 ( M /V ) sin 2 ( θ t )) F or a mixture of t w o identic al Gaussians with equal mixing w eigh ts, w e can conclude: Corollary 3. F or a mixtur e of two identic al spheric al Ga ussians with e qual mixing weights, stan- dar d deviation 1 , if µ = || µ 1 || = || µ 2 || < q ln 9 2 π , then, cos 2 ( θ t )(1 + a ′ 1 µ 2 sin 2 ( θ t )) ≤ cos 2 ( θ t +1 ) ≤ cos 2 ( θ t )(1 + a ′ 2 µ 2 sin 2 ( θ t )) The pro of follo ws by a combination of Lemma 1, and Lemma 25. F rom C orolla ry 3, w e observ e that cos 2 ( θ t ) gro ws b y a factor of (1 + Θ( µ 2 )) in e ac h iteration, except when θ t is v ery close to 0. 6 This means that when 2-means-iterate is f ar from the actual s olution, it approac hes the solution at a consisten tly high rate . Th e con vergence r ate only gro ws slo we r, once k -means is v ery close to the ac tual solution. Large µ j . In this case, there exists a j s uc h that || µ j || /σ j ≥ q ln 9 2 π . In this r egime, the Gaussians ha ve small o v erlap in probabilit y mass, yet , the distance b et ween tw o samples from the same distribution is muc h greater th an the separation b et w een the distributions. Our guarantees for this case are summ arized b y Theorem 4. W e see from Theorem 4 th at there are tw o regimes of b eha vior of the conv ergence rate, dep end- ing on the v alue of max j | τ j t | /σ j . These regimes h a v e a natural in terpr eta tion. Th e ﬁrst regime corresp onds to the case when θ t is large enough, suc h that when pro jected onto u t , at most a constan t f r acti on of samples from the t w o distributions can b e classiﬁed with high conﬁd ence. The second r egime corresp onds to the case when θ t is close enough to 0 suc h that when pro jected along u t , most of the samp les fr om the distributions can b e classiﬁed w ith high conﬁdence. As exp ected, in th e second regime, the con ve rgence rate is m uc h slo wer than in the ﬁrst regime. Theorem 4 (L a rge µ j ). Supp ose th er e exists j such that || µ j || /σ j ≥ q ln 9 2 π . If | τ j t | /σ j < q ln 9 2 π , for al l j , then, ther e exist ﬁxe d c onstants a 3 , a 4 , a 5 and a 6 such that: cos 2 ( θ t )  1 + a 3 ( M /V ) 2 sin 2 ( θ t ) a 4 + ( M /V ) 2 cos 2 ( θ t )  ≤ cos 2 ( θ t +1 ) ≤ cos 2 ( θ t )  1 + a 5 (( M /V ) + ( M /V ) 2 ) sin 2 ( θ t ) a 6 + ( M /V ) 2 cos 2 ( θ t )  On the other ha nd, if ther e exists j such tha t | τ j t | /σ j ≥ q ln 9 2 π , then, ther e exist ﬁxe d c onstants a 7 and a 8 such that: cos 2 ( θ t )(1 + a 7 ρ 2 min µ 2 min a 8 V 2 + ρ 2 min µ 2 min tan 2 ( θ t )) ≤ cos 2 ( θ t +1 ) ≤ cos 2 ( θ t )(1 + tan 2 ( θ t )) F or t wo iden tical Gaussians with standard deviation 1, we can co nclude: Corollary 5. F or a mixtur e of two identic al Gaussians with e qual mixing weights, and standar d deviation 1 , if µ = || µ 1 || = || µ 2 || > q ln 9 2 π , and if | τ 1 t | = | τ 2 t | ≤ q ln 9 2 π , then, ther e exist ﬁxe d c onstants a ′ 3 , a ′ 4 , a ′ 5 , a ′ 6 such that: cos 2 ( θ t )  1 + a ′ 3 µ 4 sin 2 ( θ t ) a ′ 4 + µ 4 cos 2 ( θ t )  ≤ cos 2 ( θ t +1 ) ≤ cos 2 ( θ t )  1 + a ′ 5 µ 4 sin 2 ( θ t ) a 6 + µ 4 cos 2 ( θ t )  On the other hand, if | τ 1 t | = | τ 2 t | ≥ q ln 9 2 π , then, ther e exists a ﬁxe d c onstant a ′ 7 such that: cos 2 ( θ t )(1 + a ′ 7 tan 2 ( θ t )) ≤ cos 2 ( θ t +1 ) ≤ cos 2 ( θ t )(1 + tan 2 ( θ t )) In this case as well, w e observ e the same phenomenon: the co n v ergence rate is high when w e are far a w ay from the solution, and slo w wh en we are close. Using Theorems 2 and 4, w e can c haracterize the conv ergence times of 2-means-iterate; f or the sak e of s implicit y , we present the con vergence time b oun ds for a mixtur e of t wo sp herical Gaussians with equal mixin g w eigh ts and standard deviatio n 1. W e recall that in this case 2-means-iterate is exa ctly 2-me ans. Corollary 6 (Con vergenc e Time). If θ 0 is the initial angle b etwe en µ 1 and u 0 , then, cos 2 ( θ N ) ≥ 1 − ǫ after N = C 0 ·  ln( 1 cos 2 ( θ 0 ) ) ln(1+ µ 2 ) + 1 ln(1+ ǫ )  iter ations, wher e C 0 is a ﬁxe d c onstant. 7 Eﬀect of Initialization. As apparent fr om Corollary 6, the eﬀect of initialization is only to ensure a lo wer b ound on the v alue of cos( θ 0 ). W e illustrate b elo w , t w o natural wa ys by whic h one can select u 0 , and their eﬀect on the con v ergence rate. F or the sak e of simplicit y , we state th ese b ounds for the case in which we h a v e tw o identic al Gaussians with equal mixin g weig h ts and standard deviation 1. • First, one can c ho ose u 0 uniformly at random fr om the s urface of a unit sph ere in R d ; in this case, cos 2 ( θ 0 ) = Θ( 1 d ), with constan t p robabilit y , and as a resu lt, the con v ergence time to r eac h cos − 1 (1 / √ 2) is O ( ln d ln(1+ µ 2 ) ). • A second wa y to choose u 0 is to set it to b e a rand om s amp le from the mixture; in this case, cos 2 ( θ 0 ) = Θ( (1+ µ ) 2 d ) with constan t probabilit y , and the time to reac h cos − 1 (1 / √ 2) is O ( ln d ln(1+ µ 2 ) ). 4 Finite Samples In this section, w e analyze Algorithm 2-means-iterate, when w e are requ ired to estimate the statis- tics at eac h round w ith a ﬁn ite n um b er of samples. W e c haracterize the num b er of samp les needed to ensure that 2-means-iterate make s progress in eac h roun d, and w e also charact erize the rate of progress when the required num b er of samples are a v ailable. The main resu lt of this section is the follo wing lemma, which characte rizes θ t +1 , the angle b et ween µ 1 and the hyperp lane separator in 2-means-iterate, giv en θ t . Notice that n o w θ t is a random v ariable, whic h dep ends on the samp les dra wn in rounds 1 , . . . , t − 1, and given θ t , θ t +1 is a random v ariable, wh ose v alue dep ends on samples in r ound t . Also w e use u t +1 as the center o f partition C t in iteration t + 1, and E [ u t +1 ] is the exp ected cen ter. Note that all the exp ectations in round t a re conditioned on θ t . In addition, w e use S t +1 to denote the quan tit y E [ X · 1 X ∈ C t +1 ], where 1 X ∈ C t +1 is the indicator fu nction for the ev ent X ∈ C t +1 , and the expectation is tak en o v er the en tire mixtu r e. Note that, S t +1 = E [ u t +1 ] Pr[ X ∈ C t +1 ] = Z t +1 E [ u t +1 ]. W e use ˆ S t +1 to denote the empirical v alue of S t +1 . Lemma 7. If we use n samples in iter ation t , then, given θ t , with pr ob ability 1 − 2 δ , cos 2 ( θ t +1 ) ≥ cos 2 ( θ t )  1 + tan 2 ( θ t ) 2 cos( θ t ) ξ t m t + m 2 t ξ 2 t +2 cos( θ t ) ξ t m t + m 2 t +∆ 2  −  ∆ 2 cos 2 ( θ t )+2∆ 1 ( m t + ξ t cos( θ t )) m 2 t + ξ 2 t +2 ξ t m t cos( θ t )+∆ 2  wher e, ∆ 1 = 8 log(4 n/δ )( σ max + max j || µ j || ) √ n ∆ 2 = 128 log 2 (8 n/δ )( σ 2 max d + P j || µ j || 2 ) n + 8 log( n/δ ) √ n ( σ max || S t +1 || + max j |h S t +1 , µ j i| ) The main id ea b ehind the pro of of Lemma 7 is that we can write cos 2 ( θ t +1 ) = h ˆ S t +1 ,µ 1 i 2 || µ 1 || 2 || ˆ S t +1 || 2 . Next, w e can use Lemma 1, and the d eﬁnition of S t +1 to get an expression for h S t +1 ,µ 1 i 2 || S t +1 || 2 || µ 1 || 2 , and Lemmas 8 and 9 to bou n d h ˆ S t +1 − S t +1 , µ 1 i , and || ˆ S t +1 || 2 − || S t +1 || 2 . Plu ggi ng in all these v alues giv es us a pro of of Lemma 7. W e also assu me for the rest of the section that the n umb er of samp les n is at most some polynomial in d , suc h that log( n ) = Θ(log( d )). The t wo main le mmas used in the pro of of Lemma 7 are Lemmas 8 and 9. T o state them, we need to deﬁn e some notation. A t ti me t , we use the notation 8 Lemma 8. F or any t , and for any ve ctor v with norm || v || , with pr ob ability at le ast 1 − δ , |h ˆ S t +1 − S t +1 , v i| ≤ 8 log(4 n/δ )( σ max || v || + max j |h µ j , v i| ) √ n Lemma 9. F or any t , with pr ob ability at le ast 1 − δ , || ˆ S t +1 || 2 ≤ || S t +1 || 2 + 128 log 2 (8 n/δ )( σ 2 max d + P j ( µ j ) 2 ) n + 16 log (8 n/δ ) √ n ( σ max || S t +1 || +max j |h S t +1 , µ j i| ) The pro ofs of Lemmas 8 and 9 are in the App endix. Applying Lemma 7, w e can c haracterize the num b er of samples r equ ired suc h that 2-means-iterat e mak es progress in eac h r ound for diﬀerent v alues of || µ j || . Again, it is conv enien t to lo ok at tw o separate cases, based on || µ j || . Theorem 10 (Small µ j ). L et || µ j || /σ j < q ln 9 2 π , for al l j . If the numb er of samples dr awn i n r ound t is at le ast a 9 σ 2 max log 2 ( d/δ )  d M V sin 4 ( θ t ) + 1 M 2 sin 4 ( θ t ) cos 2 ( θ t )  , for some ﬁxe d c onstant a 9 , then, with pr ob ability at le ast 1 − δ , cos 2 ( θ t +1 ) ≥ cos 2 ( θ t )(1 + a 10 ( M /V ) sin 2 ( θ t )) , wher e a 10 is some ﬁxe d c onstant. In particular, for the case of tw o iden tical Gaussians with equal mixing weig h ts and standard deviation 1, our results implies the follo wing. Corollary 11. L et µ = || µ 1 || = || µ 2 || < q ln 9 2 π . If the numb er of samp les dr awn in r ound t is at le ast a 9 log 2 ( d/δ )  d µ 2 sin 4 ( θ t ) + 1 µ 4 cos 2 ( θ t ) si n 4 ( θ t )  , for some ﬁxe d c onstant a 9 , then, with pr ob ability at le ast 1 − δ , cos 2 ( θ t +1 ) ≥ cos 2 ( θ t )(1 + a 10 µ 2 sin 2 ( θ t )) , wher e a 10 is some ﬁxe d c onstant. In particular, when we initialize u 0 with a v ector p ic k ed uniformly at r andom from a d - dimensional sphere, cos 2 ( θ 0 ) ≥ 1 d , with constant probabilit y , and thus th e num b er of samples required for success in the ﬁrst round is ˜ Θ( d µ 4 ). T his bou n d matc hes with the low er b oun ds for learning mixtures of Gaussians in one dimension [Lin96], as w ell as with conjectured lo w er b oun ds in exp erimental w ork [SS R06]. T he follo wing corolla ry summarizes the total num b er of samples required to learn the mixtur e with some ﬁxed precision, for tw o identica l spherical Gaussians with v ariance 1 and equal mixing w eight s. Corollary 12. L et µ = || µ 1 || = || µ 2 || ≤ q ln 9 2 π . Supp ose u 0 is chosen uniformly at r andom, and the numb er of r ounds is N ≥ C 0 · ( ln d ln(1+ µ 2 ) + 1 ln(1+ ǫ ) ) , wher e C 0 is the ﬁxe d c onstant in Cor ol lary 6. If the numb er of samples |S | is at le ast: N · a 9 d l og 2 ( d ) µ 4 ǫ 2 , then, with c onstant pr ob ability, after N r ounds, cos 2 ( θ N ) ≥ 1 − ǫ . One can show a very similar corolla ry when u 0 is initialized as a r andom sample fr om the mixture. W e note that the total num b er of samples is a fact or of N ≅ ln d µ 2 times greate r than the b ound in Theorem 10. This is due to th e fact that w e use a f resh set of samples in ev ery round, in order to simplify our analysis. In pr acti ce, su cce ssiv e iterations of k -means or EM is ru n on the same data -set. Theorem 13 (Large µ j ). Supp ose that ther e exists some j such that || µ j || /σ j ≥ q ln 9 2 π , and supp ose that the numb er of samples dr awn in r ound t is at le ast a 11 log 2 ( d/δ )  dσ 2 max ρ 2 min µ 2 min sin 4 ( θ t ) + σ 2 max + max j || µ j || 2 M 2 cos 2 ( θ t ) sin 4 ( θ t ) + σ 2 max max j || µ j || 2 + max j || µ j || 4 ρ 4 min µ 4 min sin 4 ( θ t )  9 for some c onstant a 11 . If | τ j t | ≤ q ln 9 2 π , for al l j , then, with pr ob ability at le ast 1 − δ , cos 2 ( θ t +1 ) ≥ cos 2 ( θ t )(1 + a 12 min(1 , M 2 + M V ) sin 2 ( θ t )) ; otherwise, with pr ob ability at le ast 1 − δ , cos 2 ( θ t +1 ) ≥ cos 2 ( θ t )(1 + a 13 ρ 2 min µ 2 min tan 2 ( θ t ) V 2 + ρ 2 min µ 2 min ) , wher e a 12 and a 13 ar e ﬁxe d c onstants . F or a mixtur e of t w o iden tical Gaussians with equal mixing weigh ts and standard deviat ion 1, our result implies: Corollary 14. Supp ose that µ = || µ 1 || = || µ 2 || ≥ q ln 9 2 π , and supp ose tha t th e numb er of sa mples in r ound t is at le ast: a 11 log 2 ( d/δ )  d µ 2 sin 4 ( θ t ) + 1 µ 2 cos 2 ( θ t ) si n 4 ( θ t )  , for some c onstant a 11 . If | τ j t | ≤ q ln 9 2 π , then, w ith pr ob ability at le ast 1 − δ , cos 2 ( θ t +1 ) ≥ cos 2 ( θ t )(1 + a 12 sin 2 ( θ t )) ; otherw ise, with pr ob ability 1 − δ , cos 2 ( θ t +1 ) ≥ cos 2 ( θ t )(1 + a 13 tan 2 ( θ t )) , wher e a 12 and a 13 ar e ﬁxe d c onstants. Again, if w e pic k u 0 uniformly at r andom, we require ab out ˜ Ω( d µ 2 ) samples for the ﬁrst round to succeed. When µ > 1, this b ound is w orse than d µ 4 , b ut matc hes with the upp er b ound s of [BCOFZ 07 ]. The f ollo wing corolla ry sho w s the num b er of samples required in total for 2-means- iterate to con v erge. Corollary 15. L et µ ≥ q ln 9 2 π . Supp ose u 0 is chosen uniformly at r andom and the numb er of r ounds is N ≥ C 0 · (ln d + 1 ln(1+ ǫ ) ) , wher e C 0 is the c onstant in Cor ol lary 6. If |S | is at le ast 2 N C 0 d l og 2 ( d ) µ 2 ǫ 2 , then, with c onstant pr ob ability, after N r ounds, cos 2 ( θ N ) ≥ 1 − ǫ . 5 Lo w er Bounds In this section, w e prov e a lo wer b ound on the sample complexit y of learning mixtur es of Gaussians, using F ano’s Inequalit y [Y u97 , CT05], stated in Theorem 19. Our main theorem in this section can b e su m marized as follo ws. Theorem 16. Supp ose we ar e gi v en samples fr om the mixtur e D ( µ ) = 1 2 N ( µ, I d ) + 1 2 N ( − µ, I d ) , for some µ , and let ˆ µ b e the estimate of µ c ompute d fr om n samples. If n < C d || µ || 2 for some c onstant C , and || µ || > 1 , then, ther e exi sts µ such that E D ( µ ) || µ − ˆ µ || ≥ C ′ || µ || , wher e C ′ is a c onstant. The main tools in the pro of of Theorem 16 are the follo win g lemmas, and a generalized ve rsion of F ano’s Inequalit y [CT05, Y u97]. Lemma 17. L et µ 1 , µ 2 ∈ R d , and let D 1 and D 2 b e the fol lowing mixtur e distributions: D 1 = 1 2 N ( µ 1 , I d ) + 1 2 N ( − µ 1 , I d ) , and D 2 = 1 2 N ( µ 2 , I d ) + 1 2 N ( − µ 2 , I d ) . Then, KL ( D 1 , D 2 ) ≤ 1 √ 2 π · || µ 2 || 2 − || µ 1 || 2 + 3 √ 2 π 2 ln 2 + 2 || µ 1 || ( e −|| µ 1 || 2 / 2 + √ 2 π || µ 1 || Φ(0 , || µ 1 || )) ! Lemma 18. Ther e exists a set of ve ctors V = { v 1 , . . . , v K } in R d with the f ol lowing pr op erties: (1) F or e ach i and j , d ( v i , v j ) ≥ 1 5 , d ( v i , − v j ) ≥ 1 5 . (2) K = e d/ 10 . (3) F or al l i , || v i || ≤ q 7 5 . 10 Theorem 19 (F ano’s Inequalit y). Consider a class of densities F , which c ontains r densities f 1 , . . . , f r , c orr esp onding to p ar ameter values θ 1 , . . . , θ r . L et d ( · ) b e any metric on θ , and let ˆ θ b e an estimate of θ fr om n samples fr om a density f in F . If, f or al l i and j , d ( θ i , θ j ) ≥ α , and KL ( f i , f j ) ≤ β , then, max j E j d ( ˆ θ , θ j ) ≥ α 2 (1 − nβ +log 2 log( r − 1) ) , wher e E j denotes the exp e ctation with r esp e ct to distribution j . Pr o of. (Of Theorem 16) W e apply F ano’s Inequalit y . Our class of densities F is the class of all mixtures of the form 1 2 N ( µ ′ , I d ) + 1 2 N ( − µ ′ , I d ). W e set the p arameter θ = µ ′ , and d ( µ 1 , µ 2 ) = || µ 1 − µ 2 || . W e construct a sub class F = { f 1 , . . . , f r } o f F as follo ws. W e set eac h f i = 1 2 N ( || µ || v i , I d ) + 1 2 N ( −|| µ || v i , I d ), for ea c h v ector v i in V in Lemma 18. Notice that no w r = e d/ 10 . Moreo ver, for eac h pair i and j , from Lemma 17 and Lemma 18, KL ( f i , f j ) ≤ C 1 || µ || 2 + C 2 , for constant s C 1 and C 2 . Finally , from Lemma 18, for ea c h pair i and j , d ( µ i , µ j ) ≥ || µ || 5 . The Theorem now follo ws b y an application of F ano’s Inequalit y 19. 6 More General k -means In th is section, we sho w th at when w e apply 2-means on an inpu t generated by a mixture of k spherical Gaussians, the n orm al to th e hyp erplane whic h partitions the t wo clusters in th e 2-means algorithm, conv erges to a v ector in the su bspace M con taining the means of mixture comp onen ts. W e assum e that our inpu t is generated by a mixture of k spherical Gaussians, with means µ j , v ariances ( σ j ) 2 , j = 1 , . . . , k , and mixing w eigh ts ρ 1 , . . . , ρ k . T he mixture is cen tered at the origin suc h th at P ρ j µ j = 0. W e use M to d enote the subspace con taining the means µ 1 , . . . , µ k . W e use Algorithm 2-means-iterate on this input, and our goa l is to sho w that it still con verges to a vect or in M . Notation. In the sequel, giv en a v ector x and a subspace W , we deﬁne the angle b et ween x and W as the angle b et ween x and the pro jection of x on to W . W e examine th e angle θ t , b etw een u t and M , and our goal is to sho w that the co sine of this angle gro ws as t increase s. Our main result of this sectio n is Lemma 20, whic h exact ly deﬁnes the b eha vior of 2-means-iterate o n a mixtur e of k spherical Gaussians. Recall that at time t , w e u s e ˘ u t to p artitio n the inpu t d ata , and the pro jection of ˘ u t along M is cos ( θ t ) b y deﬁnition. Let b 1 t b e a unit vecto r lying in the su bspace M suc h that: ˘ u t = cos( θ t ) b 1 t + sin( θ t ) v t , where v t lies in the orthogonal complemen t of M , and has n orm 1. W e deﬁne a second v ector ˘ u ⊥ t as follo ws : ˘ u ⊥ t = sin( θ t ) b 1 t − cos ( θ t ) v t . W e observ e that h ˘ u t , ˘ u ⊥ t i = 0, || ˘ u ⊥ t || = 1, and the pr o jection of ˘ u ⊥ t on M is sin( θ t ) b 1 t .W e no w extend the set { b 1 t } to complete an orthonormal basis B = { b 1 t , . . . , b k − 1 t } of M . W e also ob s erv e that { b 2 t , . . . , b k − 1 t , ˘ u t , ˘ u ⊥ t } is an orthonormal basis of the subspace s p anned by any basis of M , along with v t , and can b e extended to a basis of R d . F or j = 1 , . . . , k , w e deﬁne τ j t as follo ws: τ t j = h µ j , ˘ u t i = cos( θ t ) h µ j , b 1 t i . Finally we (re)-deﬁne the quan tit y ξ t , and deﬁne m l t , for l = 1 , . . . , k − 1 as ξ t = X j ρ j σ j e − ( τ j t ) 2 / 2( σ j ) 2 √ 2 π , m l t = X j ρ j Φ( − τ j t σ j , ∞ ) h µ j , b l t i Our m ain lemma is stated b elo w. The pro of is in the App endix. Lemma 20. At any iter ation t of Algo rithm 2-me ans-iter ate, cos 2 ( θ t +1 ) = cos 2 ( θ t )  1 + tan 2 ( θ t ) 2 cos( θ t ) ξ t m 1 t + P l ( m l t ) 2 ξ 2 t + 2 cos( θ t ) ξ t m 1 t + P l ( m l t ) 2  11 References [AK05] S. Arora and R. Kannan. Learning mixtu r es of s ep arate d nonsph erical Gaussians. Ann. A pplie d Pr ob. , 15(1A):69 –92, 2005. [AM05] D. Achlioptas and F. Mc Sherry . O n sp ectral learning of mixtures of distributions. In COL T , 20 05. [AMR09] D. Ar th ur, B. Mant hey , and H. R¨ oglin. k-means has p olynomial smo othed complexit y . In FOCS , 2009. T o app ear. [A V06] D. Ar th ur and S. V assilvitskii. Ho w slo w is the k-means method ? In SoCG , 20 06. [BCOFZ07] A. Blum, A. Co ja-Oghlan, A. M. F rieze, and S. Zhou. Separating p opulations with wide d ata: A sp ectral analysis. In ISAA C , 2007. [BV08] S. C. Brubaker and S. V empala. Isotropic PCA and aﬃne-inv ariant clustering. In F OCS , 2008. [Cha07] K. C h audh uri. L e arning Mixtur es of Distributions . PhD thesis, Univ ersit y of Califor- nia, Be rk eley , 2007. UCB/EECS-2007-12 4. [CHRZ07] K. Chaudhuri, E. Halp erin, S. Rao, and S. Zhou. A rigorous analysis of p opulation stratiﬁcation with limited data. In SOD A , 2007 . [CR08] K. Chaud h ur i and S . Rao. Learning mixtur es of distributions usin g co rrelations an d indep endence. In COL T , 2008. [CT05] T. Co v er and J. Thomas. Elements of Informa tion The ory : Se c ond E dition . Wiley , 2005. [Das99] S. Dasg upta. Learning mixtures of ga ussians. In F O CS , 1999 . [DLR77] A. P . Dempster, N. M. Laird, and D. B. Rubin . Maximum likeli ho o d from incomplete data via the em algorithm (with discu s sion). Journal of th e R oyal Statistic al So ciety B, 39 , 19 77. [DS00] S. Dasgupta and L. Sc h ulman. A t w o-round v arian t of EM for Gaussian mixtures. In UAI , 20 00. [F or65] E. F orgey . Cluster an alysis of multiv ariate d ata: Eﬃciency vs. interpretabilit y of classiﬁcation. Biometrics , 1965. [KSV05] R. Kann an, H. Salmasian, and S. V empala. T he sp ectral method for general mixture mo dels. In COL T , 2005. [Lin96] B. G. Lindsey . M ixtur e Mo dels:The ory, Ge ometry and Applic ations . IMS, 1996 . [Llo82] S.P . Llo yd. Least squares qu an tization in PCM. IEEE T r ans. Inf ormation The ory , 1982. [Mac67 ] J. B. MacQueen. Some metho ds for classiﬁcation and analysis of multiv ariate obs er - v ations. In Berkeley Symp osium on Mathematic al Statistics and Pr ob ability , 1967. 12 [MR09] B. Manthey and H. R¨ oglin. Impro v ed sm o othed an alysis of the k-means metho d. In SODA , 2009 . [ORSS06] Rafail Ostro vsky , Y uv al Rab an i, Leonard J. Sc h ulman, an d Ch aitan y a Swa m y . The eﬀectiv eness of llo yd -typ e method s for the k-means pr oblem. In F OCS , pages 165–176, 2006. [P ol81] D. P ollard. Stron g consistency of k-means clustering. Annals of Statistics , 1981. [R W84] R. Redner and H. W alk er. Mixture d ensities, maximum lik elihoo d and th e em algo- rithm. SIAM R eview , 1984. [SSR06] N. Srebro, G. Shakhn aro vic h , and S . T. Ro weis. An inv estigation of computational and informatio nal limits in gaussian mixtur e clustering. I n ICML , 2 006. [V at09] A. V attani. k-means tak es exp onen tially many iterations ev en in th e plane. In SoCG , 2009. [VW02] V. V empala and G. W ang. A sp ectral algorithm for learning mixtures of distributions. In FOCS , 2002. [XJ96] L. Xu and M. I. Jordan. On con vergence prop erties of the em algorithm f or gaussian mixtures. Ne ur al Computation , 1996. [Y u 97] B. Y u. Assaoud , fan o and le cam. F e stschrift for Lucien L e Cam . D . Pol lar d, E. T or gersen, and G. Y ang (e ds) , pages 42 3–435 , 1997. App endix 6.1 Pro of of Lemma 1 In this section, w e pro ve Lemma 1. First, w e need some additional notat ion. Notation. W e deﬁne, for j = 1 , 2: w j t +1 = Pr[ x ∼ D j | x ∈ C t +1 ] u j t +1 = E [ x | x ∼ D j , x ∈ C t +1 ] W e observe that u t +1 no w can be written as: u t +1 = w 1 t +1 u 1 t +1 + w 2 t +1 u 2 t +1 Moreo v er, we deﬁne Z t +1 = Pr[ x ∈ C t +1 ]. Pro of of Lemma 1. W e start by p ro viding exact expressions for w 1 t +1 and w 2 t +1 with resp ect to the p artitio n computed in the p revious round t . These are used to compute the pro jections of u t +1 along the v ecto rs ˘ u t and µ 1 − h µ 1 , ˘ u t i ˘ u t , wh ic h ﬁnally leads to a pro of of Lemma 1. Lemma 21. In r ound t , for j = 1 , 2 , w j t +1 = ρ j Φ( − τ j t σ j , ∞ ) Z t +1 . 13 Pr o of. W e can w r ite: w j t +1 = Pr[ x ∈ C t +1 | x ∼ D j ] Pr[ x ∼ D j ] Pr[ x ∈ C t +1 ] W e note that Pr[ x ∼ D j ] = ρ j , and Pr[ x ∈ C t +1 ] = Z t +1 . As D j is a spherical Gaussian, for any x generated from D j , and for an y vec tor y o rthogonal to u t , h y , x i is distributed ind ep end en tly f r om h ˘ u t , x i . Moreo ver, w e observe that h ˘ u t , x i is distrib uted as a Gaussian with mean h µ j , ˘ u t i = τ j t and sta ndard devia tion σ j . T herefore, Pr[ x ∈ C t +1 | x ∼ D j ] = Pr x ∼ D j [ h ˘ u t , x i > 0] = Pr [ N ( τ j t , σ j ) ≥ 0] = Φ( − τ j t σ j , ∞ ) from whic h the lemma follo ws. Lemma 22. F or any t , h u t +1 , ˘ u t i = ξ t + m t cos( θ t ) Z t +1 . Pr o of. Consider a sample x drawn from D j . T hen, h x, ˘ u t i is distrib u ted as a Gaussian with mean h µ j , ˘ u t i = τ j t and standard deviatio n σ j . W e reca ll th at Pr[ x ∈ C t +1 ] = Z t +1 . T herefore, h u j t +1 , ˘ u t i is equal to: E [ x, x ∈ C t +1 | x ∼ D j ] Pr[ x ∈ C t +1 | x ∼ D j ] = 1 Pr[ N ( τ j t , σ j ) > 0] · Z ∞ y = 0 y e − ( y − τ j t ) 2 / 2( σ j ) 2 σ j √ 2 π dy whic h is, again, equal to: 1 Φ( − τ j t σ j , ∞ ) τ j t Z ∞ y = 0 e − ( y − τ j t ) 2 / 2( σ j ) 2 σ j √ 2 π dy + Z ∞ y = 0 ( y − τ j t ) e − ( y − τ j t ) 2 / 2( σ j ) 2 σ j √ 2 π dy ! = 1 Φ( − τ j t σ j , ∞ ) τ j t Φ( − τ j t σ j , ∞ ) + Z ∞ y = 0 ( y − τ j t ) e − ( y − τ j t ) 2 / 2( σ j ) 2 σ j √ 2 π dy ! W e can compute the in tegral in the equation ab o ve as follo ws. Z ∞ y = 0 ( y − τ j t ) e − ( y − τ j t ) 2 / 2( σ j ) 2 dy = ( σ j ) 2 Z ∞ z =( τ j t ) 2 / 2( σ j ) 2 e − z dz = ( σ j ) 2 e − ( τ j t ) 2 / 2( σ j ) 2 W e can no w compute h u t +1 , ˘ u t i as follo ws. h u t +1 , ˘ u t i = w 1 t +1 h u 1 t +1 , ˘ u t i + w 2 t +1 h u 2 t +1 , ˘ u t i = 1 Z t +1 · X j ρ j τ j t Φ( − τ j t σ j , ∞ ) + ρ j ( σ j ) 2 e − ( τ j t ) 2 / 2( σ j ) 2 σ j √ 2 π ! The le mma follo ws b y recalling τ j t = h µ j , b i cos( θ t ) and plugging in the v alues of m t and ξ t . Lemma 23. L et ˘ v t b e a unit v e ctor along µ 1 − h µ 1 , ˘ u t i ˘ u t . Then, h u t +1 , ˘ v t i = m t sin( θ t ) Z t +1 . In addition, for any ve ctor z orth o gonal to ˘ u t and ˘ v t , h u t +1 , z i = 0 . 14 Pr o of. W e observe that for a sample x d ra wn from distribution D 1 (resp ectiv ely , D 2 ) and any unit v ector v 1 , orthogo nal to ˘ u t , h x, v 1 i is distributed as a Gaussian with mean h µ 1 , v 1 i ( h µ 2 , v 1 i , resp ectiv ely) and standard deviation σ 1 (resp. σ 2 ). Th erefore, the p ro jection of u t +1 on ˘ v t can be written as: h u t +1 , ˘ v t i = X j w j t +1 h µ j , ˘ v t i = 1 Z t +1 X j ρ j Φ( − τ j t σ j , ∞ ) h µ j , ˘ v t i from whic h the ﬁr st part o f th e lemma follo ws. The second part of the lemma follo ws fr om th e observ ation that for any v ector z orth ogonal to ˘ u t and ˘ v t , h µ j , z i = 0, for j = 1 , 2. Lemma 24. F or any t , h u t +1 , µ 1 i = || µ 1 || ( ξ t cos( θ t ) + m t ) Z t +1 || u t +1 || 2 = ξ 2 t + m 2 t + 2 ξ t m t cos( θ t ) ( Z t +1 ) 2 Pr o of. As we ha v e an inﬁnite num b er of samp les, θ t +1 lies on the same plane as θ t . Therefore, w e can write h u t +1 , µ 1 i = h u t +1 , ˘ u t ih µ 1 , ˘ u t i + h u t +1 , ˘ v t ih µ 1 , ˘ v t i . Moreo v er, w e can write || u t +1 || 2 = h u t +1 , ˘ u t i 2 + h u t +1 , ˘ v t i 2 . T h us, the ﬁrst tw o equ atio n follo w by using Lemma 22 and 23, and recalling that h µ 1 , ˘ u t i = τ 1 t = || µ 1 || cos( θ t ) and h µ 1 , ˘ v t i = || µ 1 || sin( θ t ). W e are no w ready to complete th e pr o of of Lemma 1. Pr o of. (Of Lemma 1) By deﬁnition of θ t +1 , cos 2 ( θ t +1 ) = h u t +1 ,µ 1 i 2 || u t +1 || 2 || µ 1 || 2 . Therefore, || µ 1 || 2 cos 2 ( θ t +1 ) = h u t +1 , µ 1 i 2 || u t +1 || 2 = ( τ 1 t ) 2  1 + h u t +1 , µ 1 i 2 − || µ 1 || 2 cos 2 ( θ t ) || u t +1 || 2 || µ 1 || 2 cos 2 ( θ t ) || u t +1 || 2  = ( τ 1 t ) 2  1 + || µ 1 || 2 sin 2 ( θ t )( m 2 t + 2 ξ t m t cos( θ t )) || µ 1 || 2 cos 2 ( θ t ) || u t +1 || 2  = || µ 1 || 2 cos 2 ( θ t )  1 + tan 2 ( θ t ) m 2 t + 2 ξ t m t cos( θ t ) || u t +1 || 2  where we u sed Lemma 24 and the observ ation that cos( θ t ) = τ 1 t || µ 1 || . T he L emma follo ws by r eplacing || u t +1 || 2 using the expression in Lemma 24. The next Lemma h elps u s to d eriv e Theorem 2 from Lemma 1. It s ho ws how to appro ximate Φ( − τ , τ ) when τ is small. Lemma 25. L et τ ≤ q ln 9 2 π . Then, 5 3 √ 2 π τ ≤ Φ( − τ , τ ) ≤ 2 √ 2 π τ . In addition, 2 e − τ 2 / 2 √ 2 π ≥ 2 3 . 15 6.2 Pro ofs of Sample R equiremen t Bounds F or the rest of the section, w e pr ov e Lemmas 8 and 9, whic h lead to a pro of of Lemma 7. First, w e n eed to d eﬁne some notation. Notation. At time t , we use the notation S t +1 to denote th e quantit y E [ X · 1 X ∈ C t +1 ], where 1 X ∈ C t +1 is the indicator fun ction for the ev en t X ∈ C t +1 , and the exp ectation is taken o ve r the en tire mixtur e. In the sequel, w e also use the notation ˆ S t +1 to denote the empirical v alue of S t +1 . Our goal is to b ound the concen tration of certain functions of ˆ S t +1 around their exp ected v alues, when we are giv en only n samples from the mixture. Recall that w e deﬁne θ t +1 as the angle b et w een µ 1 and the h yp erplane separator in 2-means-iterate, giv en θ t . Notice that no w θ t is a random v ariable, whic h dep ends on the samples dr awn in roun ds 1 , . . . , t − 1, an d giv en θ t , θ t +1 is a random v ariable, whose v alue dep ends on samples in round t . Also w e use u t +1 as the cen ter of partition C t in iteration t + 1, and E [ u t +1 ] is the exp ected cen ter. Note that all the exp ectat ions in round t are conditioned on θ t . Pro ofs. W e are now ready to pr o v e Lemm as 8 and 9. Pr o of. (Of Lemma 8) Let X 1 , . . . , X n b e the n iid samples from the mixture; for eac h i , w e can write the pro jection of X i along v as f ollo ws: h X i , v i = Y i + Z i where Z i ∼ N (0 , σ j ), if X i is generated fr om distribution D j , and Y i = h µ j , v i , if X i is generated b y D j . T herefore, w e c an wr ite: h ˆ S t +1 , v i = 1 n X i Y i · 1 X i ∈ C t +1 + X i Z i · 1 X i ∈ C t +1 ! T o determine the concen tration of h ˆ S t +1 , v i around its exp ecte d v alue, we addr ess the t w o term s separately . The ﬁrst term is a sum of n indep endently distributed random v ariables, suc h that c hanging one v ariable c hanges the su m b y at most max j 2 |h µ j ,v i| n ; th er efore, to calculate its concen tration, one ca n apply Ho eﬀding’s In equalit y . It follo ws that with probabilit y at most δ 2 , | 1 n X i Y i · 1 X i ∈ C t +1 − E [ 1 n X i Y i · 1 X i ∈ C t +1 ] | > max j 4 |h µ j , v i| p log(4 n/δ ) √ n W e note that, in th e second term, eac h Z i is a Gaussian w ith m ean 0 and v ariance σ j , scaled b y || v || . F or some 0 ≤ δ ′ ≤ 1, let E i ( δ ′ ) denote the e v ent − σ max || v || p 2 log(1 /δ ′ ) ≤ Z i · 1 X i ∈ C t +1 ≤ σ max || v || p 2 log(1 /δ ′ ) As Z i ∼ N (0 , σ j ), if X i is generated fr om distribution D j , and 1 X i ∈ C t +1 tak es v alues 0 and 1, for an y i , for δ ′ small enough,Pr[ E i ( δ ′ )] ≥ 1 − δ ′ . W e u se δ ′ = δ 4 n , and condition on the fact that all the even ts { E i ( δ ′ ) , i = 1 , . . . , n } happ en; using a n Union b ound o v er the ev ents ¯ E i ( δ ′ ), the probability that this h olds is at least 1 − δ 4 . W e also observ e that, as the Gaussians Z i are indep endent ly distrib uted, cond itio ned on the union of the ev ents E i , the Gaussians Z i are still ind ep en den t. Therefore, conditioned on the ev ent ∪ i E i ( δ ′ ), 16 1 n P i Z i · 1 X i ∈ C t +1 is the sum of n ind ep enden t random v ariables, such that c hanging one v ariable c han ges the sum by at most 2 σ max || v || √ 2 l og (1 /δ ′ ) n . W e can no w app ly Hoeﬀdin g’s b ound to conclude that with probabilit y at least 1 − δ 2 , | 1 n X i Z i · 1 X i ∈ C t +1 − E [ 1 n X i Z i · 1 X i ∈ C t +1 ] | ≤ 4 σ max || v || p 2 log(1 /δ ′ ) p 2 log(1 /δ ) √ n ≤ 8 σ max || v || log (4 n/δ ) √ n The le mma no w follo ws b y applying an u nion b ound. Pr o of. (Of Lemma 9) W e can wr ite: || ˆ S t +1 || 2 ≤ || S t +1 || 2 + || ˆ S t +1 − S t +1 || 2 + 2 |h ˆ S t +1 − S t +1 , S t +1 i| If v 1 , . . . , v d is an y orthonormal basis of R d , then, we can b ou n d the second term as follo ws. With probabilit y at least 1 − δ 2 , || ˆ S t +1 − S t +1 || 2 = d X i =1 ( h ˆ S t +1 − S t +1 , v i i ) 2 ≤ 128 log 2 (8 n/δ ) n ( X i σ 2 max || v i || 2 + X i,j h µ j , v i i 2 ) ≤ 128 log 2 (8 n/δ ) n ( σ 2 max d + X j ( µ j ) 2 ) The second step follo w s by the application of Lemm a 8, and the fact that for any a and b , ( a + b ) 2 ≤ 2( a 2 + b 2 ). Using Lemma 8, with pr obabilit y at least 1 − δ 2 , h ˆ S t +1 − S t +1 , S t +1 i ≤ 8 log(8 n/δ ) √ n ( σ max || S t +1 || + max j |h S t +1 , µ j i| ) The le mma follo ws b y a union b ound ov er these t w o abov e ev en ts. 6.3 Pro ofs of Lo wer Bounds Pr o of. (Of Lemma 17) Let P b e the plane con taining the origin O an d the v ectors µ 1 and µ 2 . If v is a v ector orthogonal to P , then, the pro jection of D 1 along v is a Gaussian N (0 , 1), whic h is distributed indep enden tly of the p r o jection of D 1 along P (and same is the case for D 2 ).Therefore, to compute the KL-Div ergence of D 1 and D 2 , it is suﬃcien t to compute the K L-Div ergence of the pro jections of D 1 and D 2 along the plane P . Let x b e a v ector in P . Then, KL ( D 1 , D 2 ) = 1 √ 2 π Z x ∈ P ( 1 2 e −|| x − µ 1 || 2 / 2 + 1 2 e −|| x + µ 1 || 2 / 2 ) ln 1 2 e −|| x − µ 1 || 2 / 2 + 1 2 e −|| x + µ 1 || 2 / 2 1 2 e −|| x − µ 2 || 2 / 2 + 1 2 e −|| x + µ 2 || 2 / 2 ! dx = 1 √ 2 π Z x ∈ P ( 1 2 e −|| x − µ 1 || 2 / 2 + 1 2 e −|| x + µ 1 || 2 / 2 ) ln e −|| x + µ 1 || 2 / 2 · (1 + e 2 h x,µ 1 i ) e −|| x + µ 2 || 2 / 2 · (1 + e 2 h x,µ 2 i ) ! dx = 1 √ 2 π Z x ∈ P ( 1 2 e −|| x − µ 1 || 2 / 2 + 1 2 e −|| x + µ 1 || 2 / 2 ) ( || x + µ 2 || 2 − || x + µ 1 || 2 ) + ln 1 + e 2 h x,µ 1 i 1 + e 2 h x,µ 2 i ! dx 17 W e observ e that for an y x , || x + µ 2 || 2 − || x + µ 1 || 2 = || µ 2 || 2 − || µ 1 || 2 + 2 h x, µ 2 − µ 1 i . As the exp ected v alue of D 1 is 0, we can wr ite that: Z x ∈ P ( 1 2 e −|| x − µ 1 || 2 / 2 + 1 2 e −|| x + µ 1 || 2 / 2 ) h x, µ 2 − µ 1 i = E x ∼ D 1 h x, µ 1 − µ 2 i = 0 (1) W e no w f ocus on the case where || µ 1 || >> 1. W e observe that for any µ 2 and an y x , 1 + e 2 h x,µ 2 i > 1. Th erefore, com bining the previous t wo equations, KL ( D 1 , D 2 ) ≤ 1 √ 2 π  || µ 2 || 2 − || µ 1 || 2 + Z x ∈ P ( 1 2 e −|| x − µ 1 || 2 / 2 + 1 2 e −|| x + µ 1 || 2 / 2 ) ln(1 + e 2 h x,µ 1 i ) dx  Again, since the pro jection of D 1 p erp endicular to µ 1 is distributed in dep enden tly of the p r o jection of D 1 along µ 1 , the ab o ve inte gral can b e tak en o v er a one-dimensional x which v aries along the v ector µ 1 . F or the rest of th e pro of, we abuse notation, and use µ 1 to denote b oth the v ector µ 1 and the scalar || µ 1 || . W e can wr ite: Z ∞ x = −∞ ( 1 2 e − ( x − µ 1 ) 2 / 2 + 1 2 e − ( x + µ 1 ) 2 / 2 ) ln(1 + e 2 µ 1 x ) dx ≤ √ 2 π ln 2 + Z ∞ x =0 ( 1 2 e − ( x − µ 1 ) 2 / 2 + 1 2 e − ( x + µ 1 ) 2 / 2 ) ln(1 + e 2 µ 1 x ) dx ≤ √ 2 π ln 2 + Z ∞ x =0 ( 1 2 e − ( x − µ 1 ) 2 / 2 + 1 2 e − ( x + µ 1 ) 2 / 2 )(ln 2 + 2 xµ 1 ) dx ≤ 3 √ 2 π 2 ln 2 + 2 µ 1 Z ∞ x =0 ( 1 2 e − ( x − µ 1 ) 2 / 2 + 1 2 e − ( x + µ 1 ) 2 / 2 ) xdx The ﬁrst part follo ws b ecause for x < 0, ln(1 + e 2 xµ 1 ) ≤ ln 2. The second part follo ws b ecause for x > 0, ln(1 + e 2 xµ 1 ) ≤ ln(2 e 2 xµ 1 ). The third part follo ws from the symmetry of D 1 around the origin. No w, for an y a , we can write: 1 √ 2 π Z ∞ x =0 xe − ( x + a ) 2 / 2 dx = 1 √ 2 π · e − a 2 / 2 − a Φ( a, ∞ ) Plugging this in, w e can show that, KL ( D 1 , D 2 ) ≤ 1 √ 2 π || µ 2 || 2 − || µ 1 || 2 + 3 √ 2 π 2 ln 2 + 2 || µ 1 || ( e −|| µ 1 || 2 / 2 + √ 2 π || µ 1 || Φ(0 , || µ 1 || )) ! from whic h the lemma follo ws. Pr o of. (Of Lemma 18) F or eac h i , let eac h v i b e dra wn indep enden tly fr om the distribution 1 √ d N (0 , I d ). F or eac h i, j , let P ij = d 2 · d ( v i , v j ) and N ij = d 2 · d ( v i , − v j ). Then, for eac h i and j , P ij and N ij are distributed according to the Chi-squared distribu tion with parameter d . F rom Lemma 26, it follo ws that: Pr[ P ij < d 10 ] ≤ e − 3 d/ 10 . A similar lemma can also b e sho wn to hold for th e random v ariables N ij . Applying the Union Bound, the probabilit y that this holds for P ij and N ij for all pairs ( i, j ) , i ∈ V , j ∈ V is at most 2 K 2 e − 3 d/ 10 . This probabilit y is at most 1 2 when K = e d/ 10 . In addition, we observ e th at f or eac h v ector v i , d · || v i || 2 is also distributed as a Chi-squared distribution with parameter d . F rom Lemma 26, for eac h i , Pr[ || v i || 2 > 7 / 5] ≤ e − 2 d/ 15 . The second part of the lemma now follo ws b y an Union Bo und o v er all K vecto rs in th e set V . 18 Lemma 26. L et X b e a r andom varia ble, dr awn fr om the Chi-squar e d distribution with p ar ameter d . Then, Pr[ X < d 10 ] ≤ e − 3 d/ 10 Mor e over, Pr[ X > 7 d 5 ] ≤ e − 2 d/ 15 Pr o of. Let Y b e the ran d om v ariable d eﬁ ned as follo ws: Y = d − X . Then, Pr[ X < d 10 ] = Pr[ Y > 9 d 10 ] = Pr[ e tY > e 9 dt/ 10 ] ≤ E [ e tY ] e 9 dt/ 10 where the last step uses a Mark o v’s Inequalit y . W e observe that E [ e tY ] = e td E [ e − tX ] = e td (1 − 2 t ) d/ 2 , for t < 1 2 . T he ﬁrst part of the lemma follo ws from the observ ation that (1 − 2 t ) d/ 2 ≤ e − td , and b y plugging in t = 1 3 . F or the second part, we again o bserv e that Pr[ X > 7 d 5 ] ≤ (1 − 2 t ) − d/ 2 e − 7 dt/ 5 ≤ e − 2 dt/ 5 The le mma no w follo ws b y plugging in t = 1 3 . 6.4 More General k -means : R esults and P ro ofs In th is section, we sho w th at when w e apply 2-means on an inpu t generated by a mixture of k spherical Gaussians, the n orm al to th e hyp erplane whic h partitions the t wo clusters in th e 2-means algorithm, conv erges to a v ector in the su bspace M con taining the means of mixture comp onen ts. This su bspace is in teresting b ecause, in this subspace, the distance b etw een the means is as high as in the original s pace; h o we v er, if the n um b er of clusters is small compared to the d imension, the distance b et w een t w o samp les from the same clus ter is m u ch smaller. In fact , sev eral algorithms for learning mixtur e mo dels [VW02, AM05, CR08] attempt to isolate this subspace ﬁrst, and then use s ome simple clustering metho ds in this subspace. 6.4.1 The Set ting W e assum e that our inpu t is generated by a mixture of k spherical Gaussians, with means µ j , v ariances ( σ j ) 2 , j = 1 , . . . , k , and mixing w eigh ts ρ 1 , . . . , ρ k . T he mixture is cen tered at the origin suc h that P ρ j µ j = 0. W e use M to d enote the subspace con taining the means µ 1 , . . . , µ k . W e use Algorithm 2- means-iterate on this input, and our goal is to s ho w that it still con v erges to a v ector in M . In the sequel, giv en a v ector x and a subspace W , we deﬁne the angle b etw een x and W as th e angle b et we en x and the pro jection of x onto W . As in Sectio ns 2 and 3, we examine the angle θ t , b et ween u t and M , and our goal is to sh o w that th e co sine of this angle gro w s as t increases. Our main r esult of this sect ion is Lemma 20, w hic h, analog ous to Lemma 1 in Secti on 3, exact ly deﬁnes the b eha v ior of 2-means on a mixture of k spherical Gaussians. Before w e can prov e the lemma, w e need some additional n otat ion. 19 6.4.2 Notation Recall that at time t , w e use ˘ u t to partition the input data, and th e pro jection of ˘ u t along M is cos( θ t ) by deﬁ nition. Let b 1 t b e a unit ve ctor lying in the subs p ace M suc h that: ˘ u t = cos( θ t ) b 1 t + sin( θ t ) v t where v t lies in the orthogonal complemen t of M , and has norm 1. W e deﬁne a second v ector ˘ u ⊥ t as foll o ws: ˘ u ⊥ t = sin( θ t ) b 1 t − cos( θ t ) v t W e observe that h ˘ u t , ˘ u ⊥ t i = 0, || ˘ u ⊥ t || = 1, and the pro jection of ˘ u ⊥ t on M is sin( θ t ) b 1 t . W e n o w extend the set { b 1 t } to complete an orthonormal basis B = { b 1 t , . . . , b k − 1 t } of M . W e also observe that { b 2 t , . . . , b k − 1 t , ˘ u t , ˘ u ⊥ t } is an orthonormal basis of th e su bspace spanned b y an y basis of M , alo ng with v t , and can b e extended to a basis of R d . F or j = 1 , . . . , k , w e d eﬁne τ j t as foll o ws: τ t j = h µ j , ˘ u t i = cos( θ t ) h µ j , b 1 t i Finally w e (re)-deﬁne the quan tit y ξ t as ξ t = X j ρ j σ j e − ( τ j t ) 2 / 2( σ j ) 2 √ 2 π and, f or an y l = 1 , . . . , k − 1, w e deﬁne: m l t = X j ρ j Φ( − τ j t σ j , ∞ ) h µ j , b l t i 6.4.3 Proof of Lemma 20 The main id ea b ehind the proof of L emm a 20 is to estimate the norm and the pro jection of u t +1 ; w e do this in three steps. First, we estimate the pro jection of u t +1 along ˘ u t ; n ext, we estimate this pr o jection on ˘ u ⊥ t , and ﬁnally , we estimate its pro jection along b 2 t , . . . , b l t . C om bining these pro jections, and observing that the pro jection of u t +1 on an y direction p erp endicular to these is 0, w e can pro v e the lemma. As b efore, w e deﬁne Z t +1 = Pr[ x ∈ C t +1 ] No w we mak e the follo wing claim. Lemma 27. F or any t and any j , Pr[ x ∼ D j | x ∈ C t +1 ] = ρ j Z t +1 Φ( − τ j t σ j , ∞ ) Pr o of. Same p roof of Lemma 21 Next, w e estimate the pro jection of u t +1 along ˘ u t . Lemma 28. h u t +1 , ˘ u t i = ξ t + cos( θ t ) m 1 t Z t +1 20 Pr o of. Consider a sample x dra wn fr om distr ibution D j . T h e pro jection of x on ˘ u t is distributed as a Gaussian with mean τ j t and standard deviation σ j . The probability that x lies in C t +1 is Pr[ N ( τ j t , σ j ) > 0] = Φ( − τ j t σ j , ∞ ). Giv en th at x lies in C t +1 , the pro jection of x on ˘ u t is distribu ted as a tr uncated Gaussian, with m ean τ j t and standard deviation σ j , which is trun cat ed at 0. Th er efore, E [ h x, ˘ u t i| x ∈ C t +1 , x ∼ D j ] = 1 Φ( − τ j t σ j , ∞ ) Z ∞ y = 0 y e − ( y − τ j t ) 2 / 2 σ j √ 2 π dy ! whic h is again equ al to 1 Φ( − τ j t σ j , ∞ ) τ j t Z ∞ y = 0 e − ( y − τ j t ) 2 / 2( σ j ) 2 σ j √ 2 π dy + Z ∞ y = 0 ( y − τ j t ) e − ( y − τ j t ) 2 / 2( σ j ) 2 σ j √ 2 π dy ! = 1 Φ( − τ j t σ j , ∞ ) τ j t Φ( − τ j t σ j , ∞ ) + Z ∞ y = 0 ( y − τ j t ) e − ( y − τ j t ) 2 / 2( σ j ) 2 σ j √ 2 π dy ! W e can ev aluate the in tegral in th e equation ab o v e as follo ws. Z ∞ y = 0 ( y − τ j t ) e − ( y − τ j t ) 2 / 2( σ j ) 2 dy = ( σ j ) 2 Z ∞ z =( τ j t ) 2 / 2( σ j ) 2 e − z dz = ( σ j ) 2 e − ( τ j t ) 2 / 2( σ j ) 2 Therefore w e can conclude that E [ h x, ˘ u t i| x ∈ C t +1 , x ∼ D j ] = τ j t + 1 Φ( − τ j t σ j , ∞ ) · σ j e − ( τ j t ) 2 / 2( σ j ) 2 √ 2 π No w w e can write h u t +1 , ˘ u t i = X j E [ h x, ˘ u t i| x ∼ D j , x ∈ C t +1 ] Pr[ x ∼ D j | x ∈ C t +1 ] = 1 Z t +1 X j ρ j Φ( − τ j t σ j , ∞ ) E [ h x, ˘ u t i| x ∼ D j , x ∈ C t +1 ] where w e used lemma 27. The lemma follo w s by r eca lling that τ j t = cos( θ t ) h µ j , b 1 t i . Lemma 29. F or any t , h u t +1 , ˘ u ⊥ t i = sin( θ t ) m 1 t Z t +1 Pr o of. Let x b e a samp le d ra wn from distribution D j . Since ˘ u ⊥ t is p erp endicular to ˘ u t , and D j is a spherical Gauss ian, giv en that x ∈ C t +1 , that is, the pro j ecti on of x on ˘ u t is grea ter than 0, the pro jection of x on ˘ u ⊥ t is still distr ibuted as a Gaussian with mean h µ j , ˘ u ⊥ t i and standard deviation σ j . T hat is, E [ h x, ˘ u ⊥ t i| x ∼ D j , x ∈ C t +1 ] = h µ j , ˘ u ⊥ t i Also recall that, by deﬁnition of ˘ u ⊥ t , h µ j , ˘ u ⊥ t i = sin( θ t ) h µ j , b 1 t i . T o prov e the lemma, we obs er ve that h u t +1 , ˘ u ⊥ t i is equal to X j E [ h x, ˘ u ⊥ t i| x ∼ D j , x ∈ C t +1 ] Pr[ x ∼ D j | x ∈ C t +1 ] The le mma follo ws b y usin g lemma 27. 21 Lemma 30. F or l ≥ 2 , h u t +1 , b l t i = m l t Z t +1 Pr o of. Let x be a sample drawn from distrib ution D j . Since b l t is p erp endicular to ˘ u t , and D j is a spherical Gauss ian, giv en that x ∈ C t +1 , that is, the pro j ecti on of x on ˘ u t is grea ter than 0, the pro jection of x on b l t is still distrib uted as a Gaussian w ith mean h µ j , b l t i and standard d eviati on σ j . T hat is, E [ h x, b l t i| x ∼ D j , x ∈ C t +1 ] = h µ j , b l t i T o pro v e th e lemma, w e observ e that h b l t , u t +1 i is equal to X j E [ h x, b l t i| x ∼ D j , x ∈ C t +1 ] Pr[ x ∼ D j | x ∈ C t +1 ] The le mma follo ws b y usin g lemma 27. Finally , we show a le mma whic h estimate s the norm of th e ve ctor u t +1 . Lemma 31. || u t +1 || 2 = 1 Z 2 t +1 ( ξ 2 t + 2 ξ t cos( θ t ) m 1 t + k X l =1 ( m l t ) 2 ) Pr o of. Com binin g Lemmas 28, 29 a nd 30, we can w rite: || u t +1 || 2 = h ˘ u t , u t +1 i 2 + h ˘ u ⊥ t , u t +1 i 2 + X l ≥ 2 h b l t , u t +1 i 2 = 1 Z 2 t +1  ξ 2 t + 2 ξ t cos( θ t ) m 1 t + cos 2 ( θ t )( m 1 t ) 2 + sin 2 ( θ t )( m 1 t ) 2 + k X l =2 ( m l t ) 2  The le mma follo ws b y plugging in the fact that cos 2 ( θ t ) + sin 2 ( θ t ) = 1. No w we are r eady to pro ve Lemma 20. Pr o of. (Of Lemma 20) Since b 1 t , . . . , b k t form a basis o f M , w e can write: cos 2 ( θ t +1 ) = P k l =1 h u t +1 , b l t i 2 || u t +1 || 2 (2) || u t +1 || 2 is estimated in Lemma 3 1, and h u t +1 , b l t i is estimated by Lemma 29. Using these lemmas, as b 1 t lies in the su bspace spanned b y the orthogonal v ectors ˘ u t and ˘ u ⊥ t , we can write: h u t +1 , b 1 t i = h ˘ u t , u t +1 ih ˘ u t , b 1 t i + h ˘ u ⊥ t , u t +1 ih ˘ u ⊥ t , b 1 t i = cos( θ t ) ξ t + m 1 t Z t +1 Plugging this in to Equation 2, we get: cos 2 ( θ t +1 ) = ξ 2 t cos 2 ( θ t ) + 2 ξ t cos( θ t ) m 1 t + P l ( m l t ) 2 ξ 2 t + 2 ξ t cos( θ t ) m 1 t + P l ( m l t ) 2 The le mma follo ws b y rearranging the ab o v e equation, similar to the pro of of Lemma 1. 22

Learning Mixtures of Gaussians using the k-means Algorithm

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment