Near-optimal-sample estimators for spherical Gaussian mixtures

Near-optimal-sample estimators for spherica l Gaussian mixture s Jayadev Achar ya ∗ , Ashkan Jafarpour † , Alon Orlitksy ‡ , and Ananda Theertha Suresh § Univ ersity of California, S an Diego February 20, 2014 Abstract Statistical and machin e-learning algorithms are frequently applied to high -dimension al data. In many of these ap plications data is scarce, and often much mor e costly than co mputation time. W e pr ovide the ﬁrst sample-efﬁcient polynomial-time est imator for high-d imensional spherical Gaussian mixture s. For mixtures of any k d - dimensiona l sp herical Gaussians, we derive an intuiti ve spectral-estimator that uses O k  d log 2 d ǫ 4  samples a nd runs in time O k,ǫ ( d 3 log 5 d ) , both signiﬁcantly lo wer tha n previously known. The co nstant factor O k is po lynomia l f or samp le complexity and is expon ential for the time complexity , again mu ch smaller than what was previously known. W e also show that Ω k  d ǫ 2  samples are need ed for any alg orithm. Hence the samp le complexity is near-optimal in the numbe r of dimensions. W e also deriv e a simple estimator for k - compon ent one -dimension al mixture s that uses O  k log k ǫ ǫ 2  samples an d run s in time  O   k ǫ  3 k + 1  . Our o ther techn ical contributions includ e a faster algorithm for choosing a density estimate from a set of distributions, th at minimizes the ℓ 1 distance to an unknown underly ing distrib ution. ∗ jacharya@ucsd.edu † ashkan@ucsd .edu ‡ alon@ucsd.edu § asuresh@ucsd.edu 1 Introd uction 1.1 Backgr ound Meaning ful informat ion often r esides in high-dimens ional spaces: v oice s ignals are expressed in many freque ncy bands, credit ratings are inﬂuenced by m ultiple parameters, and document topi cs are manifested in the prev al ence of numerou s words. Some applicat ions, su ch as topic model ing and genomic ana lysis consid er data in ov er 1000 dimensio ns, [17, 44]. T ypically , inf ormation can b e generated by diffe rent types o f sources: v oice is spoken by men or women, credit p arameters cor respon d to wealth y or poor indi vid uals, and doc uments addre ss topics su ch a s spo rts or politic s. In such cases the ove rall data follo w a mixture distrib u tion [26, 36, 38]. Mixtures of high-dimens ional distrib utions are therefore central to the understan ding and processing of many natural p henomen a. Metho ds for reco vering the mixture co m ponen ts from the data h a ve cons equent ly been ex tensi vely studied by statistician s, engineers, and computer scientists. Initial ly , heuristic methods such as e xpecta tion-max imizatio n (EM) were de velo ped [27, 35]. Over the past deca de, more rigoro us algorithms were deri ved to recov er mixtures of d -dimen sional spherical Gaus- sians [5, 7, 11, 12, 21, 42 ], gener al Gaussian s [2, 4, 6, 10, 22, 29], and other log -conca ve distrib utions [23]. Many of these al gorith ms consider mixtu res where the ℓ 1 distan ce between the mixture components is 2 − o d ( 1 ) , namely approache s the maximum of 2 as d increa ses. They identify the distrib ution compo- nents in time and samples that grow poly nomially in the dimension d . Recently , [22, 29] sho wed that any d -dimens ional Gaussia n m ixture can be recov ered in polynomial time. Howe ver , their a lgorithm uses > d 100 time and samples . A differe nt approac h tha t a voi ds the large componen t-dista nce requirement a nd th e hig h time and sample comple xity , consid ers a sligh tly mor e relax ed n otion of approximati on, sometimes called P AC learn ing . P A C learnin g [24] does not approximate each m ixture component, but instead deriv es a mixture distrib ution that is close to the orig inal one. Speciﬁcally , giv en a distance bound ǫ > 0 , error probab ility δ > 0 , and sample s from th e und erlying mixture f , where we use bol dface letters fo r d -dimens ional objects, P A C learning seeks a mixtu re es timate ˆ f with at m ost k compo nents such that D ( f , ˆ f ) ≤ ǫ with pr obability ≥ 1 − δ , where D (⋅ , ⋅) is so me gi ven dis tance measur e, for ex ample ℓ 1 distan ce or KL d i verg ence. This notion of estimation is also kno wn as pr oper learning in the literature. An important and extens ive ly studied special case of mixture distr ib utions are spher ical-Gau ssians [5, 7, 11, 12, 21, 42], where diff erent coordina tes ha ve the same va riance , though potentia lly differe nt means. Due to their simple struc ture, they are easi er to analyze and under a minimum-separa tion assumption ha ve pro v ably-practic al algorithms for clusterin g and parameter estimatio n [7, 11, 12, 42]. 1.2 Sample complexity Reducing the number of samples is of great practical sig niﬁcance . For example , in topic modeling ev ery sample is a whole document, in credit analysis e very sample is a person’ s credit history , and in genetics , e very sample is a human DNA. H ence samples can be very scarc e and obtaining them can be ve ry costly . By contras t, current CP Us run at sev eral Giga Hertz, hence sample s are typical ly much more scarce of a resour ce than time. Note that for on e-dimens ional statistical pro blems, the need for sample-ef ﬁcient algorit hms has been broadl y recognize d. The sample complexi ty of m any problems is kno wn quite accurately , often to within a constant fact or . For example , for discr ete distrib utions ove r { 1 , . . . ,s } , an appr oach propose d in [32] and its modiﬁcation s were used in [40, 41] to estimate the probability multiset using Θ ( s  log s ) sampl es. 1 Learning one-dimensi onal m -modal di strib utions ov er { 1 , . . . ,s } requires Θ ( m log ( s  m ) ǫ 3 ) samples [14]. Similarly , one-dimens ional m ixtures of k structu red distrib utions (log-conca ve, monotone hazard rate, and unimoda l) ov er { 1 , . . . ,s } can be lear ned with O ( k  ǫ 4 ) , O( k log ( s  ǫ ) ǫ 4 ) , and O ( k log ( s ) ǫ 4 ) samples, respec ti vely , and these bounds are tight up to a fa ctor of ǫ [31]. Compared to one dimensional prob lems, in high dimensio ns there is a po lynomial gap in the sample comple xity . For ex ample, for learning spherical Gaussian mixture s, the number of samples require d by pre vious algorithms is O ( d 12 ) for k = 2 componen ts, and increased expon entially with k [19]. In this paper we bridge this gap, by const ructing near-l inear sample complexi ty esti mators. 1.3 Pr evious and new r esults Our m ain contrib ution is P A C le arning d dimens ional Gaussia n mixtures w ith near- linear sample s. W e sho w fe w auxiliary results for one-dimen sional Gaussians. 1.3.1 d -dimensional Gaussian mixtur es Sev eral papers con sidered P A C learnin g of discrete- and Gaussia n-prod uct mixtures . [20] conside red mix- tures of two d -dimen sional Bernoulli products where all probabili ties are bou nded away from 0. They sho wed that this class is P A C l earnab le in  O ( d 2  ǫ 4 ) time and samp les, where the  O notation hides logar ith- mic factors. [18] eliminated the proba bility constrain ts and general ized the results from binary to arbitrary discre te alph abets, and from 2 to k mixture components. They sho wed that mixtu res of k discrete products are P A C learn able in  O  ( d  ǫ ) 2 k 2 ( k + 1 )  time, and although they did not exp licitly mention sample com- ple xity , their algorithm uses  O  ( d  ǫ ) 4 ( k + 1 )  samples. [19] gener alized these results to Gaussian product s, sho wing in partic ular that mixtures of k Gaussians, w here the diff erence between the means normalized by the ratio of standard de viation s is bou nded by B , are P A C learnable in  O  ( dB  ǫ ) 2 k 2 ( k + 1 )  time, and can be sho wn to use  O  ( dB  ǫ ) 4 ( k + 1 )  samples. These algo rithms consider the KL di ver gence between the distri- b ution and its estimate, b ut it can be shown that the ℓ 1 distan ce would result in similar complexitie s. It can also be sho wn that these algo rithms or th eir simple mod iﬁcations ha ve simil ar time and sa m ple co m ple xities for spheric al Gaussians as well. Our main contrib ution shows that mixture s of sp herical-Gaussi ans are P A C le arnabl e in ℓ 1 distan ce with sample complexity that is nearly linear in the dimension . Speciﬁcally , Theorem 8 sho ws that mixtures of k spheri cal-Gauss ian distrib utions can be learne d in n = O  dk 9 ǫ 4 log 2 d δ  = O k ,ǫ ( d log 2 d ) samples and O  n 2 d log n + d 2  k 7 ǫ 3 log d δ  k 2  =  O k ,ǫ ( d 3 ) . time. Observe that rec ent algo rithms typically constr uct the co var iance matr ix [19, 42], he nce require ≥ nd 2 time. In that sense , for small v alues of k , the time complex ity we deriv e is comparable to the bes t such algori thms can hope for . Observ e also that the expon ential dep endence on k is of the form d 2  k 7 ǫ 3 log d δ  k 2 , which is signi ﬁcantly lower tha n the d O ( k 3 ) depen dence in pre vious results. By contrast, Theorem 2 sho ws that P A C learning k -compon ent spherical Gaussian mixture s require Ω ( dk  ǫ 2 ) samples for any algorith m , henc e our distrib ution learning algorithms are nearly sample optimal. In addit ion, their time complexit y sign iﬁcantly improv es on pre viousl y known ones . 2 1.3.2 One-dimensional Gaussian mixtur es Indepe ndently and around the same time as this work [15] sho w ed that mixtures of two one-dimen sional Gaussian s can be learnt with  O ( ǫ − 2 ) samples and in time O ( ǫ − 7 . 01 ) . W e pro vide a nat ural est imator for learnin g mixtures of k one dimensi onal Gauss ians using s ome basic properties of G aussia n distri b utions and sho w that mixture of any k -one dimensio nal Gaussian s can be learnt with  O ( k ǫ − 2 ) samples and in time  O   k ǫ  3 k + 1  . 1.4 The appr oach and technical contrib utions The popular S C H E FF E estimato r tak es a collect ion F of distrib utions and uses O ( log F ) independe nt samples from an underlyi ng distrib ution f to ﬁnd a distrib ution in F whose dista nce from f is at most a consta nt fa ctor lar ger than that of the distrib ution in F that is closet to f [16]. In L emma 1, we lo w er the time comple xity of the Scheffe alg orithm from O (F  2 ) time to  O (F ) , helpin g us reduce the time comple xity of our algorithms. Our goal is ther efore to construct a small class of distr ib utions that is ǫ -close to an y possible underlyin g distrib ution. For simplici ty , consider spherical Gaussians with the same v ariance and means bound ed by B . T ake the coll ection of all distrib utions deri ved by quant izing the means of all components in all coordinat es to ǫ m accura cy , and quantizin g the weights to ǫ w accura cy . It can be sho wn that to get distance ǫ from the underl ying distri bution , it sufﬁces to tak e ǫ m , ǫ w ≤ 1  p oly ǫ ( dk ) . There ar e at most  B ǫ m  dk ⋅  1 ǫ w  k = 2  O ǫ ( dk ) possib le combin ations of the k mean vectors and weights. Hence S C H E FF E implies an expone ntial-t ime algori thm w ith sample comple xity  O ( dk ) . T o reduc e the depen dence on d , one can approxi m ate the span of the k mean v ectors. This reduces the probl em from d to k dimensions, allo wing us to consider a distrib ution collection of size 2 O ( k 2 ) , with S C H E FF E sample complexit y of just O ( k 2 ) . [18, 19] constructs the sample correlation matrix and uses k of its columns to approximate the span of mean vec tors. This approac h require s the k columns of the sample correla tion matrix to be very clos e to the actual correlat ion matrix, and thus requires a lot more samples. W e deri ve a spectral algorithm that uses the top k eigen vectors of the sample cov ariance matrix to approx imate the span of the k mean vectors. Since we use the entire cov ariance matrix instead of just k columns , a weaker concen tration is suf ﬁcient and we gain on the sample comple xity . Using recent tools from non-asymptot ic random matrix theory [3, 39, 43], we sho w that the approxima- tion of th e span of the means co n ver ges in  O ( d ) samples. T his result allows us t o address most “reason able” distrib utions , b ut still there are some “corn er cases” that need to be analyzed separately . T o address them, we m odify some kno wn clustering algorithms such as single-l inkage , and spectral projecti ons. While the basic algo rithms were kno wn before , our contrib ution here, which takes a f air bit of ef fort and space, is to sho w that judiciou s modiﬁcations of the algo rithms and rigoro us statistic al analy sis yield polyn omial time algori thms with near optimal sample complex ity . Our appro ach applies most directly to mixtures of sph erical Gaussians. W e pro vide a simple and practi- cal recursi ve clusterin g and spec tral algorithm that estimates all such distr ib utions in O k ( d log 2 d ) samples. The paper is organ ized as follo ws. In Sectio n 2, we introduce notations , describ e results on the Scheffe estimato r , and state a lo wer bound. In Section 3, we p resent the algorithm fo r k -spheric al Gaussians. In Sec- tion 4 we sho w a simple learn ing algo rithm for on e-dimensional Gaussian mix tures. T o pre serve re adabil ity , most of the technic al details and proofs are gi ven in the appendix . 3 2 Pr eliminaries 2.1 Notation For arbitr ary pro duct distrib utions p 1 , . . . , p k ov er a d dimensional space let p j,i be the distrib ution of p j ov er coordi nate i , and let µ j,i and σ j,i be the mean and varian ce of p j,i respec ti vely . Let f = ( w 1 , . . . , w k , p 1 , . . . , p k ) be the mixture of these distrib utions with m ixing weights w 1 , . . . , w k . W e denote esti mates of a quantity x by ˆ x . It can be empiric al mean or a more compl ex estimate.  ⋅ denotes the spectral norm of a matrix and ⋅ 2 denote s the ℓ 2 norm of a vec tor . 2.2 Selection fr om a pool of distrib utions Many algorit hms for learning mixtures over the domain X ﬁrst obtain a small collection of mixtures distri- b utions F and th en pe rform Maximum Likeliho od test us ing the samples to output a d istrib ution [14, 18, 20]. Our algorithm also obtains a set of distrib utions containin g at least one that is close to the underlying in ℓ 1 distan ce. The estimation problem no w reduces to the follo w ing. Giv en a class F of distrib utions and samples from an unkno wn distrib ution f , ﬁnd a distrib ution in F that is clo se to f . Let D ( f , F ) def = min f i ∈F D ( f , f i ) . The well-kno wn S chef fe’ s method [16] uses O ( ǫ − 2 log F ) samples from the underlying distrib ution f , and in time O ( ǫ − 2 F  2 T log F ) outp uts a distrib ution in F w ith ℓ 1 distan ce of at mo st 9 . 1 max ( D ( f , F ) , ǫ ) from f , where T is the time required to compute the probability of an x ∈ X by a distrib ution in F . A nai ve applic ation of this algorith m requires time quad ratic in the numbe r of dist rib utions in F . W e propos e a v ariant of this, that works in near linear time, albeit requi ring sligh tly more samples. More precisely , Lemma 1 (Appen dix B) . Let ǫ > 0 . F o r some cons tant c , given c ǫ 2 log  ∣F ∣ δ  indepe ndent samples fr om a dis- trib ution f , with pr obabili ty ≥ 1 − δ , the output ˆ f of M O D I FI E D S C H E FF E D ( ˆ f , f ) ≤ 1000 max ( ǫ, D ( f , F )) . Furthermor e , the algorithm runs in time O  ∣F ∣ T log (∣F ∣/ δ ) ǫ 2  . W e therefor e ﬁnd a small class F with at least one distrib ution close to the underlying m ixture. For our proble m of estimating k component mixtures in d -dimension s, T = O ( dk ) and F  =  O k ,ǫ ( d 2 ) . Note that we ha ve not optimized the constant 1000 in the abov e lemma. 2.3 Lower bound Using Fano’ s inequ ality , we show an information theoretic lower bound of Ω ( dk  ǫ 2 ) samples to learn k - compone nt d -dimension al mixtures of spherical Gaussians for any alg orithm. More precisely , Theor em 2 (Appendix C) . Any algorithm that learns all k -compone nt d -dimension al spheric al G aussia n mixtur es up to ℓ 1 distan ce ǫ with pr obabi lity ≥ 1  2 r equir es at least Ω ( dk ǫ 2 ) samples. 3 Mixtur es in d dimensions 3.1 Description of L E A R N k - S P H E R E Algorith m L E A R N K - S P H E R E l earns mixtures of k spherica l Gaussians using near- linear samples. For clarity , we assume that all comp onents ha ve the same v ariance σ 2 , i.e., p i = N ( µ i , σ 2 I d ) for 1 ≤ i ≤ k . A modiﬁcatio n of this algorithm works for components with differ ent varian ces. The core ideas are same and we include it in the ﬁnal ver sion of the paper . 4 The easy part of the algo rithm is estimat ing σ 2 . If X ( 1 ) and X ( 2 ) are two samples from the same compone nt, then X ( 1 ) − X ( 2 ) is d istrib uted N ( 0 , 2 σ 2 I d ) . Hence for lar ge d ,  X ( 1 ) − X ( 2 ) 2 2 concen trates around 2 dσ 2 . By the pigeon-hol e princ iple, giv en k + 1 samples, tw o of them are from the s ame component . Therefore , the minimum p airwise distance between k + 1 samples is clo se to 2 dσ 2 . This const itutes the ﬁrst step of our algorith m . W e now concen trate on estimating the m eans. As stated in the introduct ion, giv en the span of the mean vec tors µ i , we can grid the k dimens ional span to the required accurac y ǫ g and use S C H E FF E , to obtain a polyn omial time algorithm. One of the natural and w ell-use d methods to estimate the span of mean vectors is using the correla tion matrix [42]. Conside r the correlatio n-type matrix, S = 1 n n  i = 1 X ( i ) X ( i ) t − σ 2 I d . In expec tation, the fra ction of terms from p i is w i . Furthermore for a sample X from a particular co mponent j , E [ XX t ] = σ 2 I d + µ j µ j t . It follo ws that E [ S ] = k  j = 1 w j µ j µ j t . Therefore , as n → ∞ , the m atrix S con ver ges to ∑ k j = 1 w j µ j µ j t , and its top k eigen vect ors span of means. While the abov e intuitio n is well unders tood, the number of samples necessar y for con ver gence is not well studied. Ideally , irre specti ve of the val ues of the means, w e wish  O ( d ) samples to be sufﬁcien t for the con v ergenc e. Ho wev er this is not true, as we demonstr ate by a simple example. Example 3. Con sider the speci al case , d = 1 , k = 2 , σ 2 = 1 , w 1 = w 2 = 1  2 , and the dif fer ence of means  µ 1 − µ 2  = L for a lar ge L ≫ 1 . Given this prior information, one can estimate the the averag e of the mixtur e, that yield s µ 1 + µ 2 2 . Solv ing equat ions obtained by µ 1 + µ 2 and µ 1 − µ 2 = L , yields µ 1 and µ 2 . The varian ce of the mixtur e is 1 + L 2 4 > L 2 4 . W ith add itional Chernof f type bounds, one can show that given n samples the err or in estimating the avera ge is  µ 1 + µ 2 − ˆ µ 1 − ˆ µ 2  ≈ Θ  L √ n  . Ther efor e to estimate the m eans to a small accuracy we need n ≥ L 2 , i.e., mor e t he se par ation, mor e samples ar e necessary . A similar phen omenon happens in the con ver gence of the corre lation matrices, where the vari ances of quanti ties of interes t increa ses with separation. In other words, for the span to be accurate the number of samples necessa ry incr eases with the separation. T o ov ercome this phenomeno n, a natural idea is to cluster the Gaussians such that the means of components in the same cluster are close and then apply S C H E FF E on the span within each cluster . Even though spectral clustering algorithms are studied in [ 2, 42], they assume that the weights are strictly bound ed a way from 0 , which does not hold here. W e use a simpl e recursi ve cluste ring algo rithm tha t takes a cluster C w ith a verage µ ( C ) . If there is a compon ent in the clust er such that √ w i  µ i − µ ( C ) 2 is Ω ( log ( n  δ )) , then the algorithm divi des the cluster into two nonemp ty clusters without any mis-clu stering . For technic al reaso ns similar to the abov e exa mple, we also use a coarse clustering algorithm that ensures that the mean separati on is  O ( d 1 / 4 ) within each cluster . The algo rithm can be summarized as: 5 1. V ariance estimatio n: U se ﬁrst k + 1 samples and estimate the minimum distance among s ample-pairs to estimate σ 2 . 2. Coarse clust ering: Using a single-l inkage algorithm, gro up the samples such that within each cluster formed, the mean separati on is smaller than  O ( d 1 / 4 ) . 3. Recursiv e clustering : As long as there is a cluster that has samples from more than one compone nt with mean s far apart, (descr ibed by a c ondition on the no rm of its cov ariance matrix in the algori thm) estimate its lar gest eige n vec tor and project samples of this cluster onto this eigen vector and cluster them. This hierarchic al method is continued un til there are cl usters that contain close-by-co mponent s. 4. Sear ch in the span: The resultin g clusters contain components that are close-b y , i.e.,  µ i − µ j  2 < O ( k 3 / 2 ˆ σ 2 log n δ ) . W e approximat e the span of mean s by the top k − 1 eig en ve ctors and the mean vec tor , and perform an exhausti ve search using M O D I FI E D S C H E FF E . W e no w describe these steps stating the performanc e of each step. Algorith m L E A R N K - S P H E R E Input: n samples x ( 1 ) , x ( 2 ) , . . . , x ( n ) from f and ǫ . 1. Sample vari ance: ˆ σ 2 = min a ≠ b ∶ a,b ∈ [ k + 1 ]  x ( a ) − x ( b ) 2 2  2 d . 2. Coarse single-linkage clustering: Start with each sample as a cluster , • While ∃ two clu sters with squared-dis tance ≤ 2 d ˆ σ 2 + 23 ˆ σ 2  d log n 2 δ , merg e them. 3. Recursiv e spectral- clustering: While there is a ne w cluster C w ith  C  ≥ n ǫ  5 k and spect ral norm of its sample co v ariance matrix ≥ 12 k 2 ˆ σ 2 log n 3  δ , • Use nǫ  8 k 2 of the samples to ﬁnd the lar gest eigen vect or and discard these samples. • Project the remaining samples on the lar gest eigen v ector . • Perform single-l inkage in the project ed space (as before) till the distanc e between cluster s > 3 ˆ σ  log n 2 k  δ creatin g ne w clusters. 4. Exhaustive sear ch : Let ǫ g = ǫ ( 16 k 3 / 2 ) , L = 200  k 4 ǫ − 1 log n 2 δ , and G = { − L, . . . , − ǫ g , 0 , ǫ g , 2 ǫ g , . . . L } . Let W = { 0 , ǫ ( 4 k ) , 2 ǫ ( 4 k ) , . . . 1 } and Σ def = { σ 2 ∶ σ 2 = ˆ σ 2 ( 1 + i  d ) ∀ − d < i ≤ d } . • For each cluster C ﬁ nd its top k − 1 eig en ve ctors u 1 , u 2 . . . u k − 1 and let S pan ( C ) = { ˆ µ ( C ) + ∑ k − 1 i = 1 g i ˆ σ u i ∶ g 1 , g 2 . . . g k − 1 ∈ G } . • Let Span = { Span ( C ) ∶  C  ≥ nǫ  5 k } . • For all w ′ i ∈ W , σ ′ 2 ∈ Σ , ˆ µ i ∈ Span , add {( w ′ 1 , . . . , w ′ k − 1 , 1 − ∑ k − 1 i = 1 w ′ i , N ( ˆ µ 1 , σ ′ 2 ) , . . . , N ( ˆ µ k , σ ′ 2 )} in F . 5. Run M O D I FI E D S C H E FF E on F and output the resulting distrib ution. 6 3.2 Sket ch of correctne ss T o simplif y the bound s and express ions, we assume that d > 1000 and δ ≥ min ( 2 n 2 e − d / 10 , 1  3 ) . For smaller val ues of δ , we run the algo rithm with error 1  3 and repeat it O ( log 1 δ ) times to choose a set of candid ate mixtures F δ . By Chernof f-bound with error ≤ δ , F δ contai ns a mixture ǫ -close to f . Finally , w e run M O D I FI E D S C H E FF E on F δ to obtai n a mixture that is close to f . By the unio n bound and L emma 1, the error is ≤ 2 δ . V ariance estimation: Let ˆ σ be the v ariance estimate from step 1. In high dimension s, the dif ference between two random samples from a Gaussian concentra tes. This is made precise in the next lemma w hich states ˆ σ is a goo d estimate of the v ariance. Then the follo wing is a simple applica tion of Gaussian tail bound s. Lemma 4 (Appendi x D .1) . Given n samples fr om the k -compon ent mixtur e, with pr obabili ty 1 − 2 δ ,  ˆ σ 2 − σ 2  ≤ 2 . 5 σ 2  log ( n 2  δ ) d . Coarse single-linkage cluster in g: T he second step is a single- linkage routine that cluste rs mixture compone nts w ith far means. Single-lin kage is a simple clust ering sche me that starts out with each data point as a cluster , and at each step mer ges the two that are closest to form lar ger clusters. The algori thm stops when the distan ce between clusters is lar ger than a pre-speciﬁed threshold. Suppose the samples are generated by an one-di mension al mixture of k components that are far , then with high pr obability , when the algorithm generates k clusters and all the samples within a clus ter are gen- erated by a single component. More precisely , if ∀ i, j ∈ [ k ] ,  µ i − µ j  = Ω ( σ log n ) , then all the n samples concen trate arou nd the ir respecti ve m eans an d the separa tion between any two samples from dif ferent com- ponen ts woul d be lar ger than the large st separation between any two samples fro m the same componen t. Hence for a suitable value of threshold , single-link age correctly identiﬁes the clusters . For d -dimens ional Gaussian mixtures a similar notio n holds true, with minimum separation Ω ( d 1 / 4 log n δ ) . More precis ely , Lemma 5 (Appendix D.2) . After Step 2 of L E A R N K - S P H E R E , with pr obabilit y ≥ 1 − 2 δ , all samples fr om eac h comp onent will be in the same clus ter and the maximum dist ance be tween two componen ts within eac h cluste r is ≤ 10 kσ  d log n 2 δ  1 / 4 . Recursiv e spectral-clusteri n g: The cluste rs formed at this step consis ts of component s w ith mean separa tion O ( d 1 / 4 log n δ ) . W e now recursi vely zoom into the clusters formed and show that it is possible to cluste r the components with much smaller mean separa tion. Note that since the matrix is symmetric, the lar gest magnitude of the eigen v alue is same as the spectr al norm. W e ﬁrst ﬁnd th e larg est eigen vector of S ( C ) def = 1  C    x ∈ C ( x − ˆ µ ( C ))( x − ˆ µ ( C )) t  − ˆ σ 2 I d , which is the sample cov ariance matrix with its diagona l term reduced by ˆ σ 2 . If there are two components with means far apart, then using single-link age we di vide the cluster into two. The follo w ing lemma shows that this step perfo rms accurate clustering of components with means well separate d. Lemma 6 (Appendix D.3) . Let n ≥ c ⋅ dk 4 ǫ log n 3 δ . After r ecursive clustering , with pr obabili ty ≥ 1 − 4 δ . the samples ar e divided in to cl uster s such th at for each c omponen t i within any clus ter C , √ w i  µ i − µ ( C ) 2 ≤ 25 σ  k 3 log n 3 δ . Furthermor e , all the samples fr om one component r emain in a single cluster . 7 Exhaustiv e sear ch and Scheffe: After step 3, all clusters ha ve a small weighted radius √ w i  µ i − µ ( C ) 2 ≤ 25 σ  k 3 log n 3 δ , th e the eigen vectors giv e an a ccurate estimate o f the span of µ i − µ ( C ) within ea ch cl uster . More precise ly , Lemma 7 (Append ix D.4) . Let n ≥ c ⋅ dk 9 ǫ 4 log 2 d δ for some constan t c . After step 3, with pr obab ility ≥ 1 − 7 δ the following holds: if  C  ≥ nǫ  5 k , then the pr ojecti on of [ µ i − µ ( C )]  µ i − µ ( C ) 2 on the space ortho gonal to the spa n of top k − 1 eigen vect ors has m agn itude ≤ ǫσ 8 √ 2 k √ w i ∣∣ µ i − µ ( C )∣ ∣ 2 . W e no w ha ve accu rate estimates of the sp ans of the clusters and each cluster has c omponen ts with clo se means. It is now possible to grid the set of possib ilities in eac h clust er to obtain a set of distrib utions such that one of them is close to the under lying. There is a trade- of f between a dense grid to obtain a good estimatio n and the computatio n time req uired. The ﬁnal step takes the sparse st grid possib le to ensure an error ≤ ǫ . T his is quant ized below . Theor em 8 (Appendix D.5) . Let n ≥ c ⋅ dk 9 ǫ 4 log 2 d δ for some const ant c . Then Algorithm L E A R N K - S P H E R E with pr obabi lity ≥ 1 − 9 δ , outpu ts a distrib ution ˆ f such that D ( ˆ f , f ) ≤ 1000 ǫ . Furthermor e, the algorithm runs in time O  n 2 d log n + d 2  k 7 ǫ 3 log d δ  k 2  . Note that the run time is calcul ated based on the ef ﬁcient implementa tion of single-lin kage [37] and the exp onenti al term is not optimized. W e no w study mix tures in one -dimensi on and pro vide an est imator using M O D I FI E D S C H E FF E . 4 Mixtur es in one dimension Over the pas t decade estimating one dimensio nal distrib utions has gained sig niﬁcant attention [1, 13–15, 30, 31, 33, 41]. W e now provide a simple estimator for learning one dimensional mixtures using the M O D - I FI E D S C H E FF E estimator proposed earlier . The d -dimension estimato r uses spectral projections to ﬁnd the span of means , whereas for one dimensi on case , we use a simple observ ation on properties of sam- ples from Gaussians for estimation. Formally , giv en samples from f , a mixture of Gaussian distrib utions p i def = N ( µ i , σ 2 i ) with weig hts w 1 , w 2 , . . . w k , ou r goal is to ﬁnd a mixtu re ˆ f = ( ˆ w 1 , ˆ w 2 , . . . ˆ w k , ˆ p 1 , ˆ p 2 , . . . ˆ p k ) such that D ( f , ˆ f ) ≤ ǫ . Note tha t we mak e no assumption on the weights, means or the v ariances of the compone nts. W e pro vide an algor ithm that, using  O ( k ǫ − 2 ) samples and in time  O ( k ǫ − 3 k − 1 ) , out puts an estimate that is at most ǫ from the underlying in ℓ 1 distan ce with pro babili ty ≥ 1 − δ . Our algorithm is an immediate conseq uence of the follo wing observ ation for samples from a Gaussia n distrib ution. Lemma 9. Given n indepen dent sample s x 1 , . . . , x n fr om N ( µ, σ 2 ) , the r e are two samples x j , x k suc h that  x j − µ  ≤ σ 7 log 2 / δ 2 n and  x j − x k − σ  ≤ 2 σ 7 log 2 / δ 2 n with pr oba bility ≥ 1 − δ . Pro of The density of N ( µ, σ 2 ) is ≥ ( 7 σ ) − 1 in th e interv al [ µ − √ 2 σ , µ + √ 2 σ ] . T herefor e, the probability that a sampl e occurs in the interv al µ − ǫσ, µ + ǫσ is ≥ 2 ǫ  7 . Hence, the pr obabil ity that n one of the n samples occurs in [ µ − ǫσ, µ + ǫσ ] is ≤ ( 1 − 2 ǫ  7 ) n ≤ e − 2 nǫ / 7 . If ǫ ≥ 7 log 2 / δ 2 n , then the proba bility that none of the samples occur in the interv al is ≤ δ  2 . A similar ar gument sho w s that there is a sample within interv al, [ µ + σ − ǫσ, µ + σ + ǫσ ] , prov ing the lemma. The abo ve observ ation can be tran slated into selecting a pool of candidate distrib utions such that one of the distrib utions is close to the underlying distrib ution. 8 Lemma 10. Given n ≥ 120 k l og 4 k δ ǫ samples fr om a m ixtur e f of k Gaussians . Let S = { N ( x j , ( x j − x k ) 2 ) ∶ 1 ≤ j, k ≤ n } be a set of Gaussia ns and W = { 0 , ǫ 2 k , 2 ǫ 2 k . . . , 1 } be the set of weight s. Let F def = { ˆ w 1 , ˆ w 2 , . . . , ˆ w k − 1 , 1 − k − 1  i = 1 ˆ w i , ˆ p 1 , ˆ p 2 , . . . ˆ p k ∶ ˆ w i ∈ W, ˆ p i ∈ S } be a set of n 2 k  2 k ǫ  k − 1 ≤ n 3 k − 1 candid ate mixtur e distrib utions. T her e ex ists a ˆ f ∈ F suc h that D ( f , ˆ f ) ≤ ǫ . Pr oof. Let f = ( w 1 , w 2 , . . . w k , p 1 , p 2 , . . . p k ) . For ˆ f = ( ˆ w 1 , ˆ w 2 , . . . , ˆ w k − 1 , 1 − ∑ k − 1 i = 1 ˆ w i , ˆ p 1 , ˆ p 2 , . . . ˆ p k ) , by the triangl e inequality , D ( f , ˆ f ) ≤ k − 1  i = 1 2  ˆ w i − w i  + k  i = 1 w i D ( p i , ˆ p i ) . W e sho w that there is a distrib ution in ˆ f ∈ F such that the sum above is bounded by ǫ . Since we qua ntize the grids as multiples of ǫ  2 k , w e consi der distrib utions in F such that each  ˆ w i − w i  ≤ ǫ  4 k , and therefore ∑ i  ˆ w i − w i  ≤ ǫ 2 . W e no w show that for each p i there is a ˆ p i such that w i D ( p i , ˆ p i ) ≤ ǫ 2 k , thus prov ing that D ( f , ˆ f ) ≤ ǫ . If w i ≤ ǫ 4 k , then w i D ( p i , ˆ p i ) ≤ ǫ 2 k . Otherwise , let w ′ i > ǫ 4 k be the fraction of samples from p i . B y Lemma 9 and 14, with probab ility ≥ 1 − δ  2 k , D ( p i , ˆ p i ) 2 ≤ 2 ( µ i − µ ′ i ) 2 σ 2 i + 16 ( σ i − σ ′ i ) 2 σ 2 i ≤ 25 log 2 4 k δ ( nw ′ i ) 2 + 800 log 2 4 k δ ( nw ′ i ) 2 ≤ 825 log 2 4 k δ ( nw ′ i ) 2 . Therefore , w i D ( p i , ˆ p i ) ≤ 30 w i log 4 k δ nw ′ i . Since w i > ǫ  4 k , w ith probabil ity ≥ 1 − δ  2 k , w i ≤ 2 w ′ i . By the union bound with probability ≥ 1 − δ  k , w i D ( p i , ˆ p i ) ≤ 60 log 4 k δ n . Hence if n ≥ 120 k l og 4 k δ ǫ , the abov e quantity is less than ǫ  2 k . The total error probab ility is ≤ δ by the union bound . Running M O D I FI E D S C H E FF E algorithm on the abov e set of candidate s F yields a mixture that is close to the under lying one. B y Lemma 1 and the abo ve lemma w e get Cor ollary 11. Let n ≥ c ⋅ k l og k ǫδ ǫ 2 for some constan t c . Ther e is an algor ithm that runs in time O      k log k ǫδ ǫ   3 k − 1 k 2 log k ǫδ ǫ 2    , and r eturns a mixtur e ˆ f suc h that D ( f , ˆ f ) ≤ 1000 ǫ with err or pr obabil ity ≤ 2 δ . 9 Pr oof. Use n ′ def = 120 k l og 4 k δ ǫ samples to gene rate a set of at most n ′ 3 k − 1 candid ate distrib utions as stated in Lemma 10. W ith probability ≥ 1 − δ , one of the candidate dis trib utions is ǫ -close to the underlying one . Run M O D I FI E D S C H E FF E on this set of can didate distrib utions to obtain a 1000 ǫ -close estimate of f with probab ility ≥ 1 − δ (Lemma 1). The run time is domina ted by the run time of M O D I FI E D S C H E FF E which is O  ∣ F ∣ T log ∣F ∣ δ ǫ 2  , where  F  = n ′ 3 k − 1 and T = k . The total error prob ability is ≤ 2 δ by the union bound. Remark 12. The above bound matc hes the independ ent and contempor ary r esult by [15] for k = 2 . While the pr ocess of identify ing the candidate means is same for both the papers , the pr oces s of identifying the varian ces and pr oof tech niques ar e diff er ent. 5 Ackno wledgements W e thank Sanjo y Dasgup ta, T odd K emp, and Krishnamurthy V ishwana than for helpful discussion s. Refer ences [1] Jayade v Acharya, Ashkan Jafarpou r , A lon O rlitsk y , and Ananda Theert ha Suresh. Optimal probabil- ity estimation with applicati ons to predictio n and classiﬁcatio n. In Pr oceed ings of the 26th Annual Confer ence on Learning Theory (COLT) , pag es 764–796, 2013. [2] Dimitris A chliop tas and Frank McSherry . On spectral learning of mixtures of distrib utions. In Pr o- ceedin gs of the 18th Annual Confer ence on L earning Theory (COLT) , p ages 458–469, 2005. [3] Rudolf Ahlswede an d Andre as W inter . Strong con v erse for i dentiﬁcat ion via quantu m channels. IEEE T ran sactions on Information Theory , 48(3):569 –579, 2002. [4] Joseph Anderson, M ikhail Belkin, Nav in Goyal, Luis Rademacher , and James R . V oss. The more, the merrier: the blessing of dimensionalit y for learning larg e gaussian mixtures. CoRR , abs/1311 . 2891, 2013. [5] Martin Azizyan, A arti Singh, and Larry A. W asserman. Minimax theory for high-di mensiona l gaussian mixtures with spars e m ean separat ion. CoRR , ab s/1306.2035, 201 3. [6] Mikhail B elkin and Kaushik Sinha. Polynomia l learning of distrib ution fa m ilies. In Pr oceeding s of the 51st Annual Sympos ium on F oun dations of Comput er Science (F OCS) , page s 103–112, 2010. [7] Kamalika Chaudhu ri, S anjo y Dasgupta, and Andr ea V attani. Learning mixtures of g aussians using the k-means algorit hm. CoRR , ab s/0912 . 0086, 2009 . [8] G.B . Coleman and Harry C. Andre ws. Image seg mentatio n by cluste ring. Pr oceed ings of the IEEE , 67(5): 773–7 85, 1979. [9] Thomas M. Cover an d Joy A. Thomas. Elements of informatio n theory (2. ed.) . W iley , 2006. [10] Sanjoy Dasgupta. Learning mixtures of gaussi ans. In P r oceedi ngs of the 40th Annual Sympo sium on F oundatio ns of Comput er Scie nce (F OCS) , page s 634–644, 1999. 10 [11] Sanjoy Dasgupta and Leonard J. Schulman. A two-round v ariant of EM for gaussian mixtures. In Pr oceedin gs of the 16th Annual Conf er ence on Uncertaint y in Artiﬁcial Intell igence (U AI) , pag es 152– 159, 2000. [12] Sanjoy Dasgupta and Leonard J. Schulman. A probabili stic analy sis of EM for mixtures of separated, spheri cal gaussians . Jour nal on Machine Learning Resear ch (JMLR) , 8:203 –226, 2007. [13] Constantin os Daskalakis , Ilias Diak onikolas, and R occo A. Serv edio. Learn ing k -modal distrib utions via testing . In SOD A , pages 1371– 1385, 2012. [14] Constantin os Daskala kis, Ilias Diakonik olas, and R occo A. S erve dio. Learnin g po isson binomial distri- b utions. In Pr oceedings o f the 44th Annual Annu al ACM Symposium o n T heory of Computin g (STOC) , pages 709–72 8, 2012 . [15] Constantin os Daskalakis and Gautam Kamath. F aster and sample near -optimal algori thms for pro per learnin g mixtures of gaussi ans. CoRR , abs/131 2.1054, 2013. [16] Luc Devr oye and G ´ abor Lugo si. Combinatoria l methods in density estimation . Springer , 2001. [17] Inderjit S. Dhil lon, Y uqi ang Guan, an d Jacob K ogan. Iterati ve clusteri ng of high dimensi onal text da ta augment ed by local se arch. In Pr oceedings of the 2n d Industrial Confer ence on Data Minin g (ICDM ) , pages 131–13 8, 2002 . [18] Jon Feldman, Ryan O’Donnell, and Rocco A. Served io. Learni ng mixtures of product distrib utions ov er discrete do mains. In Pr oceedings of the 46th Annual Symposium on F oun dation s of C omputer Scienc e (FO CS) , pages 501– 510, 2005. [19] Jon Feldman, Rocco A. Serve dio, and Ryan O’Donnell. P A C learning axis-a ligned mixture s of gaus- sians with no separation assumptio n. In Pr oceedings of the 19th Annual C onfer ence on Learning Theory (COLT) , p ages 20–34 , 2006. [20] Y oav Freund and Y ishay Mansour . Estimating a mixture of two product distrib utions. In Pr oceedings of the 13th Annual Confer ence on Learning Theory (COLT) , pag es 53–62, 1999. [21] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherica l gau ssians: momen t methods and spectr al decomposi tions. In Pr oceedings of the 4 th Inno vation s in Theor etical Computer S cience Con- fer ence (ITCS) , pages 11–20, 2013. [22] Adam T auman K alai, Ankur M oitra, and Gregor y V aliant. Efﬁcient ly lear ning mixtures of two gaus- sians. In P r oceedi ngs of the 42nd Annual A nnual ACM Symposi um on T heory of C omputing (STOC) , pages 553–56 2, 2010 . [23] Ravin dran Kannan, Hadi Salmasian , and Santosh V empala. The spectral method for gen eral mixture models. SIAM J ournal on Computing , 38(3):1 141–1 156, 2008. [24] Michael J. Kearn s, Y ishay M ansou r , Dana Ron, Ronitt Rubinfe ld, Robe rt E. Schapire, and Linda Sellie. On the learn ability of discrete distrib utions. In Pr oceedin gs of the 26th Annual Annual A C M Symposiu m on Theory of Computing (ST O C) , pages 273–2 82, 1994. [25] B. Laurent and Pas cal Massart. Adapt ive estima tion of a quadrat ic functio nal by model selectio n. The Annals of Statist ics , 28(5):pp. 1302–13 38, 2000. 11 [26] Bruce G. Lindsay . Mixtur e Models: Theory , Geometry and Applicatio ns . NS F-CBMS Conference series in Probabil ity and Statistics, Penn. State Univ ersity , 1995. [27] Jinwen Ma, Lei Xu, and Michael I. Jord an. Asymptoti c con ver gence rate of the em algorith m for gaussi an mixtures. Neural Computation , 12(12 ):2881 –2907, 2001. [28] Satyaki Mahalanabis and Daniel Stef ankov ic. In Pr oceedings o f the 2 1st Annual Confe r ence on Learn- ing Theory (COLT) , pa ges 503–512 . Omnipress, 2008. [29] Ankur Moitra and G rego ry V alian t. Settlin g the polynomia l learnabil ity of mixtures of gauss ians. In Pr oceedings of the 51st Annual Symposium on F oundations of Computer Science (FOCS) , pages 93–10 2, 2010. [30] Siu on Cha n, Ilias Diakoni kol as, Rocco A. Serv edio, and Xia orui Sun. Efﬁcie nt de nsity estimat ion via piece wise polynomial approximati on. CoRR , abs/1 305.320 7, 2013. [31] Siu on Chan, Ilias Diako nik olas, Rocco A. Servedio, and Xiaor ui S un. L earning mixtures of struc- tured distrib utions ov er discret e domains. In Pr oceedings of the 24th Annual Sympos ium on Discr ete Algorith ms (SODA) , pages 1380–13 94, 2013. [32] Alon Orlitsky , N arayan a P . Santhanam, Krishnamurthy V iswanath an, and Junan Zhang. On modeling proﬁles instead of values . In Pr oceeding s of the 20th Annual Confe r ence on Uncert ainty in Artiﬁc ial Intelli gen ce (U AI) , 2004. [33] Liam Paninski . V ariation al minimax estimation of discrete distrib utions under kl loss. In P r oceedi ngs of the 18th Annual Confer ence on Neural Infor mation P r ocessi ng (NIPS) , 2004. [34] David Polla rd. Asymptopia . 1997. [35] Richard A. R edner and Homer F . W alker . Mixture densit ies, maximum likeliho od and the em algo- rithm. SIAM Revie w , 26(2):pp. 195–239, 1984. [36] Douglas A. Reynolds and Richa rd C. R ose. Rob ust te xt-indepen dent speaker id entiﬁcation using gaus- sian mixture speak er m odels. IEEE T ran sactions on Speech and Aud io Pr ocessing , 3(1):72–8 3, 1995. [37] Robin Sibson. Slink: An optimall y efﬁcie nt algorithm for the single-link cluster method. The Computer J ournal , 16(1):30 –34, 1973. [38] D Micha el Ti ttering ton, Adrian FM Smith, and Udi E Makov . Statis tical anal ysis of ﬁnit e mixtur e distrib utions , v olume 7. W iley New Y ork, 198 5. [39] Joel A. T ropp. User -friendly tail bounds for sums of random matrices. F oun dation s of Computationa l Mathematic s , 12(4):38 9–434 , 2012. [40] G. V alian t and P . V aliant. Estimating the unseen: an n/log (n)-samp le estimator for entropy and support size, shown optimal via n ew clts. Proceedings of t he 43rd Annual A nnual ACM Symposium on Theory of Computin g (STOC), 2011. [41] Gregory V alian t and Pau l V aliant. Estimati ng the unsee n: A subli near-sa m ple canonical estimator of distrib utions . Electr onic Colloquium on Computation al Complex ity (ECC C) , 17 :180, 2010. 12 [42] Santosh V empala and G rant W ang. A spect ral algorithm for lear ning mixture s of distrib utions . In Pr oceedin gs of the 43r d Annual Sy mposium on F ound ations of Computer Scie nce (FOCS) , pa ges 113 – , 2002 . [43] Roman V ershynin . I ntroduction to the non-asymp totic analysis of random matrices . Co RR , abs/10 11.3027, 2010 . [44] Eric P . Xing, Mich ael I. Jordan, and Richard M. Karp. Feature selection fo r h igh-dimension al genomic microarra y data. In Pr oceedings of the 18th Annual Internation al Confer ence on Machine Learning (ICML) , pages 601–6 08, 2001. [45] Bin Y u. Assouad , Fano, and Le C am. In F estschr ift for Lucien Le Cam , pages 423 –435. S pringe r N e w Y ork, 1997. A Useful tools A.1 Bounds on ℓ 1 distance For two d dimensiona l product di stribut ions p 1 and p 2 , if we boun d the ℓ 1 distan ce on eac h coord inate by ǫ , then by t riangle inequality D ( p 1 , p 2 ) ≤ dǫ . H o w e ver this b ound is o ften w eak. One way to ob tain a stronger bound is to relate ℓ 1 distan ce to Bhattachary ya para meter , which is deﬁned as follo ws: Bhattacharyy a pa- rameter B ( p 1 , p 2 ) between two distrib utions p 1 and p 2 is B ( p 1 , p 2 ) =  x ∈X  p 1 ( x ) p 2 ( x ) dx. W e use the fact that for two produ ct distrib utions p 1 and p 2 , B ( p 1 , p 2 ) = ∏ d i = 1 B ( p 1 ,i , p 2 ,i ) to obtain strong er bounds o n the ℓ 1 distan ce. W e ﬁrst bound Bhattac haryya paramete r for two one-dimensi onal G aus- sian distri b utions. Lemma 1 3. The Bhatta charyya para m eter for two o ne dimens ional Gaussian dist rib utions p 1 = N ( µ 1 , σ 2 1 ) and p 2 = N ( µ 2 , σ 2 2 ) is B ( p 1 , p 2 ) ≥ 1 − ( µ 1 − µ 2 ) 2 ) 4 ( σ 2 1 + σ 2 2 ) − ( σ 2 1 − σ 2 2 ) 2 ( σ 2 1 + σ 2 2 ) 2 . Pr oof. For Gaussian distrib utions the Bhattach aryya parameter is (see [8]), B ( p 1 , p 2 ) = y e − x , where x = ( µ 1 − µ 2 ) 2 ) 4 ( σ 2 1 + σ 2 2 ) and y =  2 σ 1 σ 2 σ 2 1 + σ 2 2 . Observ e that y =  2 σ 1 σ 2 σ 2 1 + σ 2 2 =     1 − ( σ 1 − σ 2 ) 2 σ 2 1 + σ 2 2 ≥ 1 − ( σ 1 − σ 2 ) 2 σ 2 1 + σ 2 2 ≥ 1 − ( σ 2 1 − σ 2 2 ) 2 ( σ 2 1 + σ 2 2 ) 2 . Hence, B ( p 1 , p 2 ) = y e − x ≥ y ( 1 − x ) ≥ ( 1 − x ) 1 − ( σ 2 1 − σ 2 2 ) 2 ( σ 2 1 + σ 2 2 ) 2  ≥ 1 − x − ( σ 2 1 − σ 2 2 ) 2 ( σ 2 1 + σ 2 2 ) 2 . Substitu ting the val ue of x results in the lemma. The next lemma follo w s from the relationsh ip between Bhattacharyy a paramete r and ℓ 1 distan ce (see [34]), and the pre vious lemma. 13 Lemma 14. F or any two Gaussian pr oduct distrib utions p 1 and p 2 , D ( p 1 , p 2 ) 2 ≤ 8  d  i = 1 1 − B ( p 1 ,i , p 2 ,i ) ≤ d  i = 1 2 ( µ 1 ,i − µ 2 ,i ) 2 σ 2 1 ,i + σ 2 2 ,i + 8 ( σ 2 1 ,i − σ 2 2 ,i ) 2 ( σ 2 1 ,i + σ 2 2 ,i ) 2 . A.2 Concentration inequalities W e use the follo w ing con centration inequali ties for Gaussian, Chi-Square, and sum of Bernou lli random v ariables in the rest of the paper . Lemma 15. F or a Gaussian rando m variable X with mea n µ and variance σ 2 , Pr ( X − µ  ≥ tσ ) ≤ e − t 2 / 2 . Lemma 16 ( [25]) . If Y 1 , Y 2 , . . . Y n be n i.i.d.Gaussian variables with mean 0 and varianc e σ 2 , then Pr  n  i = 1 Y 2 i − nσ 2 ≥ 2 ( √ nt + t ) σ 2  ≤ e − t , and Pr  n  i = 1 Y 2 i − n σ 2 ≤ − 2 √ ntσ 2  ≤ e − t . Furthermor e for a ﬁxed vector a , Pr  n  i = 1 a i ( Y 2 i − 1 ) ≤ 2 ( a  2 √ t +  a  ∞ t ) σ 2  ≤ 2 e − t . Lemma 17 (Chernof f bound) . If X 1 , X 2 . . . X n ar e distrib uted accor ding to B ernoul li p , then with pr oba- bility 1 − δ ,  ∑ n i = 1 X i n − p  ≤  2 p ( 1 − p ) n log 2 δ + 2 3 log 2 δ n . W e no w state a non-asympt otic concentrati on inequa lity for ran dom matrices that helps us bound errors in spect ral algorithms. Lemma 18 ( [43] Remark 5 . 51 ) . Let y ( 1 ) , y ( 2 ) , . . . , y ( n ) be gener ated accor ding to N ( 0 , Σ ) . F or every ǫ ∈ ( 0 , 1 ) and t ≥ 1 , if n ≥ c ′ d  t ǫ  2 for some const ant c ′ , then with pr obab ility ≥ 1 − 2 e − t 2 n ,  n  i = 1 1 n y ( i ) y t ( i ) − Σ  ≤ ǫ  Σ  . A.3 Matrix eigen values W e no w state few simpl e lemmas on the eigen v alues of per turbed matrices. Lemma 19. Let λ A 1 ≥ λ A ≥ . . . λ A d ≥ 0 and λ B 1 ≥ λ B ≥ . . . λ B d ≥ 0 be the eigen valu es of two symmetric matrices A and B re spectively . If  A − B  ≤ ǫ , then ∀ i ,  λ A i − λ B i  ≤ ǫ . Pr oof. Let u 1 , u 2 , . . . u d be a set of eigen vectors of A that corresponds to λ A 1 , λ A 2 , . . . λ A d . Similarly let v 1 , v 2 , . . . v d be eigen vectors of B C onside r the ﬁrst eigen valu e of B , λ B 1 =  B  =  A + ( B − A ) ≥  A  −  B − A  ≥ λ A 1 − ǫ. 14 No w consi der an i > 1 . If λ B i < λ A i − ǫ , then by deﬁnition of eigen v alues max v ∶∀ j ≤ i − 1 , v ⋅ v j = 0  B v  2 < λ A i − ǫ. No w conside r a unit vecto r ∑ i j = 1 α j u j in the span of u 1 , . . . u i , that is orthogo nal to v 1 , . . . v i − 1 . F or this vec tor ,                       B i  j = 1 α j u j                       2 ≥                       A i  j = 1 α j u j                       2 −                       ( A − B ) i  j = 1 α j u j                       2 ≥      i  j = 1 α 2 j ( λ A j ) 2 − ǫ ≥ λ A i − ǫ, a contra dictio n. Hence, ∀ i ≤ d , λ B i ≥ λ A i − ǫ . The proof in the other directi on is similar and omitted. Lemma 20. Let A = ∑ k i = 1 η 2 i u i u t i be a posit ive semideﬁn ite symmetric m atrix for k ≤ d . Let u 1 , u 2 , . . . u k span a k − 1 dimensio nal space . Let B = A + R , wher e  R  ≤ ǫ . Let v 1 , v 2 , . . . v k − 1 be the top k − 1 eig en vecto rs of B . Then the pr oject ion of u i in spac e orthogon al to v 1 , v 2 , . . . v k − 1 is ≤ 2 √ ǫ η i . Pr oof. Let λ B i be the i th lar gest eigen v alue of B . Observe that B + ǫ I d is a pos itiv e semideﬁnite matrix as for any v ector v , v t ( A + R + ǫ I d ) v ≥ 0 . F urthermor e  A + R + ǫ I d − A  ≤ 2 ǫ . S ince eigen valu es of B + ǫ I d is λ B + ǫ , by L emma 19, for all i ≤ d ,  λ A i − λ B i − ǫ  ≤ 2 ǫ . Therefore,  λ B i  for i ≥ k is ≤ 3 ǫ . Let u i = ∑ k − 1 j = 1 α i,j v j +  1 − ∑ k − 1 j = 1 α 2 i,j u ′ , for a vect or u ′ orthog onal to v 1 , v 2 , . . . v k − 1 . W e comput e u ′ t A u ′ in two ways . Since A = B − R ,  u ′ t ( B − R ) u ′  ≤  u ′ t B u ′  +  u ′ t R u ′  ≤  B u ′  2 +  R  . Since u ′ is orthogo nal to ﬁrst k eigen vectors , we ha ve  B u ′  2 ≤ 3 ǫ and hence  u ′ ( B − R ) u ′  ≤ 4 ǫ . u ′ t A u ′ ≥ η 2 i  1 − k − 1  j = 1 α 2 i,j  . W e ha ve shown that the abo ve quantity is ≤ 4 ǫ . Therefore  1 − ∑ k − 1 j = 1 α 2 i,j  1 / 2 ≤ 2 √ ǫ  η i . B Selection from a set of candidate distribu t ions Giv en samples from an unkn own distrib ution f , the objecti ve is to output a distri b ution from a kno wn collec tion F of distrib utions with ℓ 1 distan ce close to D ( f , F ) . S chef fe estimate [16] outputs a distrib ution from F whose ℓ 1 distan ce from f is at most 9 . 1 max ( D ( f , F ) , ǫ ) The algorithm requires O ( ǫ − 2 log  F ) samples and the runs in time O ( F  2 T ( n +  X )) , where T is the time to compute the probabil ity f j ( x ) of x , for any f j ∈ F . An approach to redu ce the time comple xity , alb eit using e xponential pre-proce ssing, was proposed in [28]. W e prese nt the modiﬁed Schef fe algorithm with near l inear time comple xity and then pro ve Lemma 1. W e ﬁrst present the algorit hm S C H E FF E * with running time  O ( F  2 T n ) . 15 Algorith m S C H E FF E * Input: a set F of candid ate distrib utions, ǫ ∶ upper bound on D ( f , F ) , n independ ent sampl es x 1 , . . . , x n from f . For eac h pair ( p, q ) in F do: 1. µ f = 1 n ∑ n i = 1 I { p ( x i ) > q . ( x i )} . 2. Generate indepe ndent samples y 1 , . . . , y n and z 1 , . . . , z n from p and q respecti vely . 3. µ p = 1 n ∑ n i = 1 I { p ( y i ) > q ( y i )} , µ q = 1 n ∑ n i = 1 I { p ( z i ) > q ( z i )} . 4. If  µ p − µ f  <  µ q − µ f  declare p as winner , else q . Output the distrib ution with most wins, breaking ties arbitrarily . W e mak e the follo wing modiﬁcati on to the algorithm where we reduce th e size of p otentia l distrib utions by half in e very iteration . Algorith m M O D I FI E D S C H E FF E Input: set F of cand idate distr ib utions, ǫ ∶ upper bou nd on min f i ∈F D ( f , f i ) , n indepen dent samples x 1 , . . . , x n from f . 1. Let G = F , C ←  2. Repeat until  G  > 1 : (a) Randomly form  G  2 pairs of distrib utions in G and run S C H E FF E * on each pair using the n samples. (b) Replace G with the  G  2 winners. (c) Randomly select a set A of min { G  ,  F  1 / 3 } elements from G . (d) Run S C H E FF E * on each pair in A and add the distrib utions with most wins to C . 3. Run S C H E FF E * on C and output the winner Remark 21. F or the ease of pr oof, we assume that δ ≥ 10 log ∣ F ∣ ∣ F ∣ 1 / 3 . If δ < 10 log ∣ F ∣ ∣ F ∣ 1 / 3 , we run the algorithm with err or pr obability 1  3 and rep eat it O ( log 1 δ ) times to ch oose a set of candidate mixtur es F δ . By Chernof f- bound with err or pr obability ≤ δ , F δ contai ns a mixtur e close to f . Fi nally , we run S C H E FF E * on F δ to obtain a mixtur e that is close to f . Pr oof sketc h of Lemm a 1. For any set A and a distrib ution p , giv en n independ ent samples from p the em- pirical proba bility µ n ( A ) has a distrib ution around p ( A ) with stan dard dev iation ∼ 1 √ n . T ogether w ith an observ ation in Scheffe estimation in [16] one can show that if the number of samples n = O  log ∣F ∣ δ ǫ 2  , then S C H E FF E * has a guarantee 10 max ( ǫ, D ( f , F )) with probabili ty ≥ 1 − δ . 16 Since we ru n S C H E FF E * at most  F ( 2 log  F  + 1 ) times, choos ing δ = δ ( 4  F  log  F  + 2  F ) result s in the sample comple xity of O   log ∣ F ∣ 2 ( 4 log ∣ F ∣ + 2 ) δ ǫ 2   = O   log ∣ F ∣ δ ǫ 2   , and the total error prob ability of δ  2 for all runs of S C H E FF E * durin g the algo rithm. The above valu e of n dictate s our sample comple xity . W e no w consider the follo w ing two cases: • If at some stage ≥ log ( 2 / δ ) ∣ F ∣ 1 / 3 fractio n of elements in A hav e an ℓ 1 distan ce ≤ 10 ǫ from f , then at that stage with probabilit y ≥ 1 − δ  2 an element w ith distance ≤ 10 ǫ from f is added to A . T herefor e a distrib ution with distance ≤ 100 ǫ is selected to C . • If at no stage this happens, then consider the element that is closest to f , i.e., at ℓ 1 distan ce at most ǫ . W ith probabilit y ≥  1 − log ( 2 / δ ) ∣ F ∣ 1 / 3  log ∣ F ∣ it alwa ys competes with an element at a distance at least 10 ǫ from f and it wins all these games with probab ility ≥ 1 − δ  2 . Therefore with probability ≥ 1 − δ  2 there is an el ement in C at ℓ 1 distan ce at most 100 ǫ . R unning S C H E FF E * on this set yields a distrib ution at a dista nce ≤ 100 ⋅ 10 ǫ = 1000 ǫ . The erro r probabilit y is ≤ δ by the union bound . C Lower bound W e ﬁrst sho w a lower bound for a single Gaussia n distr ib ution and generali ze it to mixtures. C.1 Single Gaussian distrib ution The proof is an applicatio n of the follo wing ver sion of Fano’ s ine qualit y [9, 45]. It states tha t we cannot simultan eously estimate all distrib utions in a class usi ng n samples if the y satisfy certain conditio ns. Lemma 22. (F ano’ s Inequ ality) Let f 1 , . . . , f r + 1 be a colle ction of distri b utions suc h that for any i ≠ j , D ( f i , f j ) ≥ α , and K L ( f i , f j ) ≤ β . Let f be an estimate of the underlying distrib ution using n i.i.d. samples fr om one of the f i ’ s. T hen, sup i E [ D ( f i , f )] ≥ α 2  1 − nβ + log 2 log r  . W e conside r d − dimensional spherical Gaussians with identity cov ariance matrix, with means along any coordi nate restricted to ± cǫ √ d . The KL di ver gence between two sphe rical Gaussians with ide ntity cov ariance matrix is the squared distance betwee n their means. Therefore , any two distrib utions we cons ider hav e KL distan ce at most β = d  i = 1  2 cǫ √ d  2 = 4 c 2 ǫ 2 , W e no w consi der a subset of th ese 2 d distrib utions to obtain a lo wer bou nd on α . By the Gil bert-V arsha m ov bound , there exis ts a binar y code with ≥ 2 d / 8 code words of length d and minimum distan ce d  8 . Consider one suc h code. Now for e ach cod e word, map 1 → cǫ √ d and 0 → − cǫ √ d to obtai n a distri b ution in our cl ass. W e consid er this subset of ≥ 2 d / 8 distrib utions as our f i ’ s. 17 Consider an y two f i ’ s. Their m eans diff er in at lea st d  8 coordinat es. W e show that the ℓ 1 distan ce between them is ≥ cǫ  4 . W ithout loss of gen erality , let the means differ in the ﬁ rst d  8 coor dinate s, and furthe rmore, one of t he distrib utions has mean s c ǫ  √ d and the other h as − cǫ  √ d in the ﬁrst d  8 coordinate s. The sum of the ﬁrst d  8 coordina tes is N ( cǫ √ d  8 , d  8 ) and N ( − cǫ √ d  8 , d  8 ) . The ℓ 1 distan ce between these normal random v ariables is a lower bound on the ℓ 1 distan ce of the original random varia bles. For small v alues of cǫ the distance between the two Gaussians is at least ≥ cǫ  4 . This serv es as our α . Applying the Fano’ s Inequality , the ℓ 1 error on the worst distr ib ution is at least cǫ 8  1 − n 4 c 2 ǫ 2 + log 2 d  8  , which for c = 16 and n < d 2 14 ǫ 2 is at least ǫ . In other words, the smallest n to appro ximate all sph erical normal distri b utions to ℓ 1 distan ce at most ǫ is > d 2 14 ǫ 2 . C.2 Mixtur es of k G aussians W e now pro vide a lo wer bound on the sample complexity of learning mixtur es of k Gaussians in d di- mension s. W e e xtend the constructio n for learnin g a single sph erical Gaussian to m ixture s of k Gaussian s and sho w a lower bou nd of Ω ( k d  ǫ 2 ) samples. W e will agai n use Fano’ s inequality ove r a class of 2 k d / 64 distrib utions as described next. T o prov e the lower bound on th e sample complexity of learning sph erical Gaussians, we designed a class of 2 d / 8 distrib utions around the origin. Let P def = { P 1 , . . . , P T } , where T = 2 d / 8 , be this class . Recall that each P i is a spherical Gaussian with unit v ariance. For a distrib ution P over R d and µ ∈ R d , let P + µ be the distrib ution P shifted by µ . W e no w cho ose µ 1 , . . . , µ k ’ s e xtr emely well-se parated . T he class of distrib utions we consi der will be a mixture of k components, where the j th component is a distrib ution fro m P shifted by µ j . Sinc e the µ ’ s will be well separa ted, we will use the results from last section ov er each componen t. For i ∈ [ T ] , and j ∈ [ k ] , P ij def = P i + µ j . Each ( i 1 , . . . , i k ) ∈ [ T ] k corres ponds to the mixture 1 k ( P i 1 1 + P i 2 2 + . . . + P i k k ) of k spherical Gaussia ns. W e consider this clas s of T k = 2 k d / 8 distrib utions . By the Gilber t-V arshamo v bound , for any T ≥ 2 , there is a T -ary codes of length k , w ith minimum dist ance ≥ k  8 and number of code words ≥ 2 k / 8 . This implies that among the T k = 2 dk / 8 distrib utions , there are 2 k d / 64 distrib utions such that an y two t uples ( i 1 , . . . , i k ) and ( i ′ 1 , . . . , i ′ k ) corresp onding to differ ent distrib utions diffe r in at leas t k  8 locatio ns. If we choose the µ ’ s well separated, the c omponen ts of any mixtu re distrib ution hav e v ery little o verlap. For simpl icity , we choose µ j ’ s sati sfying min j 1 ≠ j 2  µ j 1 − µ j 2  2 ≥  2 k d ǫ  100 . 18 This implies that for j ≠ l ,  P ij − P i ′ l  1 < ( ǫ  2 dk ) 10 . Therefore, for two dif ferent mixture distrib utions,  1 k ( P i 1 1 + P i 2 2 + . . . + P i k k ) − 1 k ( P i ′ 1 1 + P i ′ 2 2 + . . . + P i ′ k k ) 1 ( a ) ≥ 1 k  j ∈ [ k ] ,i j ,i ′ j ∈ [ T ]  P i j j − P i ′ j j  − k 2 ( ǫ  2 dk ) 10 ( b ) ≥ 1 8 cǫ 4 − k 2 ( ǫ  2 dk ) 10 . where ( a ) follo w s form the fact that two mixtures hav e o ver lap only in the correspondi ng components , ( b ) uses the fact that at least in k  8 compone nts i j ≠ i ′ j , and then us es the lower bound from the p re vious sec tion. Therefore , the ℓ 1 distan ce between any two of the 2 k d / 64 distrib utions is ≥ c 1 ǫ  32 for c 1 slight ly smaller than c . W e t ake this as α . No w , to upper bound the K L di ver gence, we simply use the con ve xity , namely for any dis trib utions P 1 . . . P k and Q 1 . . . Q k , let ¯ P and ¯ Q be the mean distrib utions. Then , D ( ¯ P  ¯ Q ) ≤ 1 k k  i = 1 D ( P i  Q i ) . By the constru ction and from the pre vious section, for any j , D ( P i j j  P i ′ j j ) = D ( P i  P i ′ ) ≤ 4 c 2 ǫ 2 . Therefore , we can tak e β = 4 c 2 ǫ 2 . Therefore by the Fano’ s inequali ty , the ℓ 1 error on the wor st distrib ution is at least c 1 ǫ 64  1 − n 4 c 2 ǫ 2 + log 2 dk  64  , which for c 1 = 128 , c = 128 . 1 and n < dk 8 8 ǫ 2 is at least ǫ . D Proofs f or k spherical Gaussians W e ﬁrst state a simple conce ntration result that helps us in other proofs . Lemma 23. Given n samples fr om a set of Gaussian distrib utions, with pr oba bility ≥ 1 − 2 δ , for every pair of samples X ∼ N ( µ 1 , σ 2 I d ) and Y ∼ N ( µ 2 , σ 2 I d ) ,  X − Y  2 2 ≤ 2 dσ 2 + 4 σ 2  d log n 2 δ +  µ 1 − µ 2  2 2 + 4 σ  µ 1 − µ 2  2  log n 2 δ + 4 σ 2 log n 2 δ . (1) and  X − Y  2 2 ≥ 2 dσ 2 − 4 σ 2  d log n 2 δ +  µ 1 − µ 2  2 2 − 4 σ  µ 1 − µ 2  2  log n 2 δ . (2) 19 Pr oof. W e prov e the lower bou nd, the proof for the upper bou nd is similar and omitted. Since X and Y are Gaussian s, X − Y is distr ib uted as N ( µ 1 − µ 2 , 2 σ 2 ) . Rewriti ng  X − Y  2  X − Y  2 2 =  X − Y − ( µ 1 − µ 2 ) 2 2 +  µ 1 − µ 2  2 2 + 2 ( µ 1 − µ 2 ) ⋅ ( X − Y − ( µ 1 − µ 2 )) . Let Z = X − Y − ( µ 1 − µ 2 ) , then Z ∼ N ( 0 , 2 σ 2 I d ) . Therefore by Lemma 16, with probabil ity 1 − δ  n 2 ,  Z  2 2 ≥ 2 dσ 2 − 4 σ 2  d log n 2 δ . Furthermor e ( µ 1 − µ 2 ) ⋅ Z is sum of Gau ssians and hence a Gaussian di strib ution. It ha s mean 0 and varia nce 2 σ 2  µ 1 − µ 2  2 2 . Therefore, by Lemma 15 with probabilit y 1 − δ  n 2 , ( µ 1 − µ 2 ) ⋅ Z ≥ − 2 σ  µ 1 − µ 2  2  log n 2 δ . By the union boun d w ith probab ility 1 − 2 δ  n 2 ,  X − Y  2 2 ≥ 2 dσ 2 − 4 σ 2  d log n 2 δ +  µ 1 − µ 2  2 2 − 4 σ  µ 1 − µ 2  2  log n 2 δ . There are  n 2  pairs and the lemma follo ws by the union bound. D.1 Proof of Lemma 4 W e sho w that if Equations (1) and (2) are satisﬁed, then the lemma holds. The erro r probability is that of Lemma 23 and is ≤ 2 δ . Since the minimum is ov er k + 1 indices, at least two samples are from the same compone nt. Applying Equations (1) and (2) for these two samples 2 d ˆ σ 2 ≤ 2 dσ 2 + 4 σ 2  d log n 2 δ + 4 σ 2 log n 2 δ . Similarly by Equation s (1 ) and (2) for an y two samples X ( a ) , X ( b ) in [ k + 1 ] ,  X ( a ) − X ( b ) 2 2 ≥ 2 dσ 2 − 4 σ 2  d log n 2 δ +  µ i − µ j  2 2 − 4 σ  µ i − µ j  2  log n 2 δ ≥ 2 dσ 2 − 4 σ 2  d log n 2 δ − 4 σ 2 log n 2 δ , where the last inequality follo w s f rom the fact that α 2 − 4 αβ ≥ − 4 β 2 . The result follo ws from the assumptio n that d > 20 log n 2  δ . D.2 Proof of Lemma 5 W e sho w that if Equations (1) and (2) are satisﬁed, then the lemma holds. The erro r probability is that of Lemma 23 and is ≤ 2 δ . Since Equations (1) and (2) are satisﬁed, by the proof of Lemma 4,  ˆ σ 2 − σ 2  ≤ 2 . 5 σ 2  log ( n 2 / δ ) d . If two samples X ( a ) and X ( b ) are from the same componen t, by Lemma 23,  X ( a ) − X ( b ) 2 2 ≤ 2 dσ 2 + 4 σ 2  d log n 2 δ + 4 σ 2 log n 2 δ ≤ 2 dσ 2 + 5 σ 2  d log n 2 δ . 20 By Lemma 4, the abo ve quantity is less than 2 d ˆ σ 2 + 23 ˆ σ 2  d log n 2 δ . Hence all the samples from the same compone nt are in a single cluster . Suppose there are two sampl es from diff erent components in a cluster , then by Equations (1) and (2), 2 d ˆ σ 2 + 23 ˆ σ 2  d log n 2 δ ≥ 2 dσ 2 − 4 σ 2  d log n 2 δ +  µ i − µ j  2 2 − 4 σ  µ i − µ j  2  log n 2 δ . Relating ˆ σ 2 and σ 2 using Lemma 4, 2 dσ 2 + 40 σ 2  d log n 2 δ ≥ 2 dσ 2 − 4 σ 2  d log n 2 δ +  µ i − µ j  2 2 − 4 σ  µ i − µ j  2  log n 2 δ . Hence  µ i − µ j  2 ≤ 10 σ  d log n 2 δ  1 / 4 . There are at m ost k components ; therefore , an y two compone nts within the same clust er are at a distance ≤ 10 k σ  d log n 2 δ  1 / 4 . D.3 Proof of Lemma 6 The proof is in vo lve d and we sho w it in step s. W e ﬁrst show few concentratio n boun ds which we use later to ar gue that the samples are clusterable when the sample cov ariance matrix has a larg e eigen value. Let ˆ w i be the fraction of samples from component i . Let ˆ µ i be the empir ical a verage of samples from p i . Let ˆ µ ( C ) be the empirica l ave rage of samples in cluster C . If C is the entire set of samp les we use ˆ µ inste ad of ˆ µ ( C ) . W e ﬁrst sho w a concentratio n inequa lity that we use in rest of the calcula tions. Lemma 24. Given n samples fr om a k -componen t Gaussian mixtur e with pr obabilit y ≥ 1 − 2 δ , for every compone nt i  ˆ µ i − µ i  2 2 ≤  d + 3  d log 2 k δ  σ 2 n ˆ w i and  ˆ w i − w i  ≤     2 w i log 2 k δ n + 2 3 log 2 k δ n . (3) Pr oof. Since ˆ µ i − µ i is distrib uted N ( 0 , σ 2 I d  n ˆ w i ) , by Lemma 16 with proba bility ≥ 1 − δ  k ,  ˆ µ i − µ i  2 2 ≤  d + 2  d log 2 k δ + 2 log 2 k δ  σ 2 n ˆ w i ≤  d + 3  d log 2 k δ  σ 2 n ˆ w i . The second inequality uses the fact that d ≥ 20 log n 2  δ . F or bounding the weights , obse rve that by Lemma 17 with prob ability ≥ 1 − δ  k ,  ˆ w i − w i  ≤  2 w i log 2 k  δ n + 2 3 log 2 k  δ n . By the union boun d the error probabilit y is ≤ 2 k δ  2 k = δ . A simple applica tion of triangle inequali ty yiel ds the follo wing lemma. Lemma 25. Given n samples fr om a k -compone nt Gaussian mixtur e if Equation (3) holds, then  k  i = 1 ˆ w i ( ˆ µ i − µ i )( ˆ µ i − µ i ) t  ≤  d + 3  d log 2 k δ  k σ 2 n . 21 Lemma 26. Given n samples fr om a k -component Gaussian mixtur e, if E quatio n (3) holds and the maximum distan ce between two compone nts is ≤ 10 kσ  d log n 2 δ  1 / 4 , then  ˆ µ − µ )  2 ≤ cσ  dk log n 2 δ n , for a constant c . Pr oof. Observe that ˆ µ − µ = k  i = 1 ˆ w i ˆ µ i − w i µ i = k  i = 1 ˆ w i ( ˆ µ i − µ i ) + ( ˆ w i − w i ) µ i = k  i = 1 ˆ w i ( ˆ µ i − µ i ) + ( ˆ w i − w i )( µ i − µ ) . (4) Hence by E quation (3) and the fact that the m aximum distanc e between two components is ≤ 10 kσ  d log n 2 δ  1 / 4 ,  ˆ µ − µ  2 ≤ k  i = 1 ˆ w i      d + 3  d log 2 k δ  σ √ n ˆ w i +   2 w i log 2 k  δ n + 2 3 log 2 k  δ n  10 k  d log n 2 δ  1 / 4 σ. For n ≥ d ≥ max ( k 4 , 20 log n 2  δ , 1000 ) , we get th e abo ve term i s ≤ c  k d log n 2 / δ n σ , fo r some constan t c . W e no w make a simple observ ation on cov ariance m atrices . Lemma 27. Given n samples fr om a k -compone nt mixtur e,  k  i = 1 ˆ w i ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t − k  i = 1 ˆ w i ( µ i − µ )( µ i − µ ) t  ≤ 2  ˆ µ − µ  2 2 + k  i = 1 2 ˆ w i  ˆ µ i − µ i  2 2 + 2  √ k  ˆ µ − µ  2 + k  i = 1  ˆ w i  ˆ µ i − µ i  2  max j  ˆ w j  µ j − µ  2 . Pr oof. Observe that for an y two vect ors u and v , uu t − vv t = u ( u t − v t ) + ( u − v ) v t = ( u − v )( u − v ) t + v ( u − v ) t + ( u − v ) v t . Hence by triangle inequa lity ,  uu t − vv t  ≤  u − v  2 2 + 2  v  2  u − v  2 . Applying the abo ve obse rv ation to u = ˆ µ i − ˆ µ and v = µ i − µ , w e get k  i = 1 ˆ w i  ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t − ( µ i − µ )( µ i − µ ) t  ≤ k  i = 1  ˆ w i  ˆ µ i − ˆ µ − µ i − µ  2 2 + 2  ˆ w i  µ i − µ  2  ˆ w i  ˆ µ i − ˆ µ − µ i − µ  2  ≤ k  i = 1  2 ˆ w i  ˆ µ i − µ i  2 2 + 2 ˆ w i  ˆ µ − µ  2 2 + 2 max j  ˆ w j  µ j − µ  2   ˆ w i  ˆ µ i − µ i  2 +  ˆ w i  ˆ µ − µ  2  ≤ 2  ˆ µ − µ  2 2 + k  i = 1 2 ˆ w i  ˆ µ i − µ i  2 2 + 2  √ k  ˆ µ − µ  2 + k  i = 1  ˆ w i  ˆ µ i − µ i  2  max j  ˆ w j  µ j − µ  2 . The lemma follo ws from triangle inequal ity . The follo wing lemma imm ediate ly follo w s from Lemmas 26 and 27. 22 Lemma 28. Given n samples fr om a k -component G aussia n mixtur e, if Equation (3) and the maximum distan ce between two componen ts is ≤ 10 k σ  d log n 2 δ  1 / 4 , then  k  i = 1 ˆ w i ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t − k  i = 1 ˆ w i ( µ i − µ )( µ i − µ ) t  ≤ cσ 2 dk 2 log n 2 δ n + cσ     dk 2 log n 2 δ n max i  ˆ w i  µ i − µ  2 , for a consta nt c . Lemma 29. F or a set of samples X ( 1 ) , . . . X ( n ) fr om a k -compone nt mixtur e, n  i = 1 ( X ( i ) − ˆ µ )( X ( i ) − ˆ µ ) t n = k  i = 1 ˆ w i ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t − ˆ w i ( ˆ µ i − µ i )( ˆ µ i − µ i ) t +  j ∣ X ( j )∼ p i ( X ( j ) − µ i )( X ( j ) − µ i ) t n . wher e ˆ w i and ˆ µ i ar e the empirical weights and avera ges of component s i and ˆ µ = 1 n ∑ n i = 1 X i . Pr oof. The gi ven e xpression can be rewritt en as 1 n n  i = 1 ( X ( i ) − ˆ µ )( X ( i ) − ˆ µ ) t = k  i = 1 ˆ w i  j ∣ X ( j )∼ p i 1 n ˆ w i X ( j ) − ˆ µ )( X ( j ) − ˆ µ ) t . First obse rve that for any set of poin ts x i and their a verage ˆ x and any v alue a ,  i ( x i − a ) 2 =  i ( x i − ˆ x ) 2 + ( ˆ x − a ) 2 . Hence for samples from a compone nt i ,  j ∣ X ( j )∼ p i 1 n ˆ w i ( X ( j ) − ˆ µ )( X ( j ) − ˆ µ ) t =  j ∣ X ( j )∼ p i 1 n ˆ w i ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t +  j ∣ X ( j )∼ p i 1 n ˆ w i ( X ( j ) − ˆ µ i )( X ( j ) − ˆ µ i ) t = ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t +  j ∣ X ( j )∼ p i 1 n ˆ w i ( X ( j ) − ˆ µ i )( X ( j ) − ˆ µ i ) t = ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t +  j ∣ X ( j )∼ p i 1 n ˆ w i ( X ( j ) − µ i )( X ( j ) − µ i ) t − ( ˆ µ i − µ i )( ˆ µ i − µ i ) t . Summing ov er all component s results in the lemma. W e no w bound the error in estimating the eigen v alue of the cov arianc e m atrix. Lemma 3 0. G iven X ( 1 ) , . . . X ( n ) , n samples fr om a k -component Gaussian mixtur e, if Equation s (1) , (2) , and (3) hold, then with pr obabilit y ≥ 1 − 2 δ ,  1 n n  i = 1 ( X ( i ) − ˆ µ )( X ( i ) − ˆ µ ) t − ˆ σ 2 I d − k  i = 1 ˆ w i ( µ i − µ )( µ i − µ ) t  ≤ c ( n ) def = cσ 2     d log n 2 δ n + cσ 2 dk 2 log n 2 δ n + cσ     dk 2 log n 2 δ n max i  ˆ w i  µ i − µ  2 , (5) for a consta nt c . 23 Pr oof. Since Equation s (1), ( 2), and (3) h old, condition s in Lemmas 2 6 and 28 a re satisﬁed . By Lemma 28,  k  i = 1 ˆ w i ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t − k  i = 1 ˆ w i ( µ i − µ )( µ i − µ ) t  = O     σ 2 dk 2 log n 2 δ n + σ     dk 2 log n 2 δ n max i  ˆ w i  µ i − µ  2     . Hence it remains to sho w ,  1 n n  i = 1 ( X ( i ) − ˆ µ )( X ( i ) − ˆ µ ) t − k  i = 1 ˆ w i ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t  = O         k d log 5 k 2 δ n σ 2     . By Lemma 29, the co v ariance matrix can be rewritten as k  i = 1 ˆ w i ( ˆ µ i − ˆ µ )( ˆ µ i − ˆ µ ) t − ˆ w i ( ˆ µ i − µ i )( ˆ µ i − µ i ) t + k  i = 1  j ∣ X ( j )∼ p i 1 n ( X ( j ) − µ i )( X ( j ) − µ i ) t − ˆ σ 2 I d . (6) W e now bound the norms of second and third terms in the abov e equation . Consider the third term, ∑ k i = 1 ∑ j ∣ X ( j )∼ p i 1 n ( X ( j ) − µ i )( X ( j ) − µ i ) t . Conditioned on the fac t that X ( j ) ∼ p i , X ( j ) − µ i is dis- trib uted N ( 0 , σ 2 I d ) , therefo re by Lemma 18 and Lemma 4 ,with proba bility ≥ 1 − 2 δ ,                         k  i = 1  j ∣ X ( j )∼ p i 1 n ( X ( j ) − µ i )( X ( j ) − µ i ) t − ˆ σ 2 I d                         ≤ c ′     d log 2 d δ n σ 2 + 2 . 5 σ 2     log n 2 δ d . The second ter m in Equation (6 ) is bounded by Lemma 25. Hence together with the f act that d ≥ 20 log n 2  δ we get that with probab ility ≥ 1 − 2 δ , the second and third terms are bounded by O  σ 2  dk n log n 2 δ  . Lemma 31. Let u be the lar gest eig en vector of the sample covari ance matrix and n ≥ c ⋅ dk 2 log n 2 δ . If max i √ ˆ w i  µ i − µ  2 = ασ and Equation (5) holds, then ther e e xists i such that  u ⋅ ( µ i − µ ) ≥ σ ( α − 1 − 1  α ) √ k . Pr oof. Observe that  ∑ j w j v j v t j  ≥  ∑ j w j v j v t j v i ∣∣ v i ∣∣  2 ≥ w i  v i  2 2 . Therefore  k  i = 1 ˆ w i ( µ i − µ )( µ i − µ ) t  ≥                       k  j = 1 ˆ w j ( µ j − µ )( µ j − µ ) t ( µ i − µ )  µ i − µ                        2 ≥ α 2 σ 2 . Hence by Lemma 30 an d the triangle inequal ity , the lar gest eigen v alue of the sa m ple-co varia nce matrix is ≥ α 2 σ 2 − c ( n ) . Similarly by applying Lemma 30 again we get,  ∑ k i = 1 ˆ w i ( µ i − µ )( µ i − µ ) t u  2 ≥ α 2 σ 2 − 2 c ( n ) . By triangl e inequality and Cauchy-Schwar tz ineq uality ,  k  i = 1 ˆ w i ( µ i − µ )( µ i − µ ) t u  2 ≤ k  i = 1  ˆ w i ( µ i − µ )( µ i − µ ) t u  2 ≤ k  i = 1 ˆ w i ( µ i − µ ) 2 max j ( µ j − µ ) ⋅ u  ≤     k  i = 1 ˆ w i ( µ i − µ ) 2 2 max j ( µ j − µ ) ⋅ u  ≤ √ k ασ m ax j ( µ j − µ ) ⋅ u  . 24 Hence √ k ασ max i ( µ i − µ ) ⋅ u  ≥ α 2 σ 2 − 2 c ( n ) . The lemma follo ws by substitutin g the bound on n in c ( n ) . W e no w make a simple observ ation on Gaussian mixtures. Fac t 32. The samples fr om a subset of components A of the Gauss ian mixtur e ar e distrib uted accor ding to a Gaussia n m ixtur e of components A with weights being w ′ i = w i ( ∑ j ∈ A w j ) . W e no w prov e L emma 6. Pr oof of Lemma 6. Observe that we run the recursi ve cluster ing at most n times. A t ev ery step, the under - lying distr ib ution within a cluster is a Gau ssian mixture. L et Equa tions (1), (2) hold with pr obability 1 − 2 δ . Let Equatio ns (3 ) (5) all hold with proba bility ≥ 1 − δ ′ , where δ ′ = δ  2 n at each of n steps. By the union bound the t otal erro r is ≤ 2 δ + δ ′ ⋅ 2 n ≤ 3 δ . S ince Equat ions (1), (2) holds , the condition s of Lemmas 4 an d 5 hold. F urthermo re it can be sho w n that discarding at most nǫ  4 k samples at each step does not affect the calcul ations. W e ﬁrst sho w that if √ w i  µ i − µ ( C ) 2 ≥ 25  k 3 log ( n 3  δ ) σ , then the algorit hm gets into the loop. Let w ′ i be the weight of the component w ithin the cluster and n ′ ≥ nǫ  5 k be the number of samples in the clus ter . Let α = 25  k 3 log ( n 3  δ ) . By Fa ct 32, the components in clus ter C ha ve weight w ′ i ≥ w i . Hence  w ′ i  µ i − µ ( C ) 2 ≥ ασ . Since  w ′ i  µ i − µ ( C ) 2 ≥ ασ , and by Lemma 5  µ i − µ ( C ) ≤ 10 kσ ( d log n 2  δ ) 1 / 4 , w e ha ve w ′ i ≥ α 2 ( 100 k 2  d log n 2  δ ) . Hence by lemma 24 , w ′ i ≥ w i  2 and  ˆ w ′ i  µ i − µ ( C ) 2 ≥ ασ  √ 2 . Hence by Lemma 30 and triangle inequali ty the lar gest eige n value of S ( C ) is ≥ α 2 σ 2  2 − c ( n ′ ) ≥ α 2 σ 2  4 ≥ α 2 ˆ σ 2  8 ≥ 12 ˆ σ 2 k 3 log n 2  δ ′ = 12 ˆ σ 2 k 3 log n 3  δ . Therefore the algorit hm gets into the loop. If n ′ ≥ nǫ  8 k 2 ≥ c ⋅ dk 2 log n 3 δ , then by Lemma 31, there exists a component i such that  u ⋅ ( µ i − µ ( C )) ≥ σ ( α  √ 2 − 1 − √ 2  α ) √ k , where u is the top eigen vect or of th e ﬁrst nǫ  4 k 2 samples. Observ e that ∑ i ∈ C w i u ⋅ ( µ i − µ ( C )) = 0 and max i  u ⋅ ( µ i − µ ( C ))  ≥ σ ( α  √ 2 − 1 − √ 2  α ) √ k . Let µ i be sorted accordin g to their va lues of u ⋅ ( µ i − µ ( C )) , then max i  u ⋅ ( µ i − µ i + 1 ) ≥ σ α  √ 2 − 1 − √ 2  α k 3 / 2 ≥ 12 σ  log n 3 δ ≥ 9 ˆ σ  log n 3 δ , where the last inequality follo ws from Lemma 4 and the fac t that d ≥ 20 log n 2  δ . For a sample from compone nt p i , similar to the proof of Lemma 5, by Lemma 15, with proba bility ≥ 1 − δ  n 2 k ,  u ⋅ ( X ( i ) − µ i ) ≤ σ  2 log ( n 2 k  δ ) 2 ≤ 2 ˆ σ  log ( n 2 k  δ ) , where the secon d inequalit y follo ws from Lemm a 4 . Since there are two compone nts that are far apart by ≥ 9 ˆ σ  log n 2 δ ˆ σ and the maximum distance be tween a sample and its mean is ≤ 2 ˆ σ  log ( n 2 k  δ ) and the algori thm di vides into at-least two non-empty clusters such that no two samples from the same distrib ution are cluster ed into two cluster s. For the second part observ e that by the abo ve conc entration on u , no two samples from the same com- ponen t are clustered dif ferently irres pecti ve of the mean separation . N ote that we are us ing t he f act that each sample is clus tered at most 2 k times to get the bou nd on th e error probabilit y . The total error probabi lity by the union bound is ≤ 4 δ . 25 D.4 Proof of Lemma 7 W e sh ow th at if t he co nclusi ons in Lemmas 6 and 24 holds, then the lemma is s atisﬁed. W e als o ass ume that the conclusio ns in Lemma 30 holds for all the clusters with error probabili ty δ ′ = δ  k . By the union bound the total error probabi lity is ≤ 7 δ . By Lemma 6 all the components within each cluster satisfy √ w i  µ i − µ ( C ) 2 ≤ 25 σ  k 3 log ( n 3  δ ) . Let n ≥ c ⋅ dk 9 ǫ − 4 log 2 d  δ . For notational c on venienc e let S ( C ) = 1 ∣ C ∣ ∑ ∣ C ∣ i = 1 ( X ( i ) − µ ( C ))( X ( i ) − µ ( C )) t − ˆ σ 2 I d . Therefore by Lemma 30 for large enough c ,  S ( C ) − n  C   i ∈ C ˆ w i ( µ i − µ ( C )) ( µ i − µ ( C )) t  ≤ ǫ 2 σ 2 1000 k 2 n  C  . Let v 1 , v 2 , . . . v k − 1 be the top eigen vectors of 1 ∣ C ∣ ∑ i ∈ C w i ( µ i − µ ( C ))( µ i − µ ( C )) t . Let η i =  ˆ w ′ i  µ i − µ ( C ) 2 = √ ˆ w i  n ∣ C ∣  µ i − µ ( C ) 2 . Let ∆ i = µ i − µ ( C )) ∣∣( µ i − µ ( C ))∣ ∣ 2 . Therefore,  i ∈ C n  C   i ∈ C ˆ w i ( µ i − µ ( C )) ( µ i − µ ( C )) t =  i ∈ C η 2 i ∆ i ∆ t i . Hence by Lemma 20, the projec tion of ∆ i on the space orthog onal to top k − 1 eigen vectors of S ( C ) is ≤     ǫ 2 σ 2 1000 k 2 n  C  1 η i ≤ ǫσ 16 √ ˆ w i  µ i − µ ( C ) 2 k ≤ ǫσ 8 √ 2 √ w i  µ i − µ ( C ) 2 k . The last inequ ality follows from the bound on ˆ w i in Lemma 24. D.5 Proof of Theor em 8 W e sho w that the theorem holds if the co nclusions in Lemm as 7 and 26 hol ds with e rror probabili ty δ ′ = δ  k . Since in the proof of Lemma 7, the prob ability that Lemma 6 hold s is inclu ded, Lemma 6 also holds with the same probab ility . Since there are at most k clusters , by the union bound the total error probabilit y is ≤ 9 δ . For ev ery compon ent i , we show th at there is a choic e of mean vector a nd weight in the search s tep s uch that w i D ( p i , ˆ p i ) ≤ ǫ  2 k and  w i − ˆ w i  ≤ ǫ  4 k . That would impl y that the re is a ˆ f duri ng the searc h such that D ( f , ˆ f ) ≤  C  i ∈ C w i D ( p i , ˆ p i ) + 2 k − 1  i = 1  w i − ˆ w i  ≤ ǫ 2 k + ǫ 2 k = ǫ. Since the weights are gridded by ǫ  4 k , there exists a ˆ w i such that  w i − ˆ w i  ≤ ǫ  4 k . W e now sho w that there exi sts a choice of m ean vector such that w i D ( p i , ˆ p i ) ≤ ǫ  2 k . Note that if a compon ent has weight ≤ ǫ  4 k , the abov e inequal ity follows immediately . T herefo re we only look at those components w ith w i ≥ ǫ  4 k , by Lemma 24, for such components ˆ w i ≥ ǫ  5 k and th erefore we only look at c lusters suc h that  C  ≥ nǫ  5 k . By Lemmas 14 and for any i , D ( p i , ˆ p i ) 2 ≤ 2 d  j = 1 ( µ i,j − ˆ µ i,j ) 2 σ 2 + 8 d ( σ 2 − ˆ σ 2 ) 2 σ 4 . Note that sinc e we are discarding at most nǫ  8 k 2 random samples at eac h step. A total numbe r of ≤ nǫ  8 k random samples are discarded . It can be sho wn that this does not affe ct our calculation s and we ignor e it in 26 this proof. By Lemm a 4, the ﬁrst es timate of σ 2 satisﬁes  ˆ σ 2 − σ 2  ≤ 2 . 5 σ 2  log n 2  δ . Hence while s earching ov er val ues of ˆ σ 2 , there exis t one such that  σ ′ 2 − σ 2  ≤ ǫσ 2  √ 64 dk 2 . Hence, D ( p i , ˆ p i ) 2 ≤ 2  µ i − ˆ µ i  2 2 σ 2 + ǫ 2 8 k 2 . Therefore if we sho w that there is a mean vector ˆ µ i during the search such that   µ i − ˆ µ i  2 ≤ ǫσ   16 k 2 ˆ w i , that would pro ve the Lemma. By tria ngle inequality ,  µ i − ˆ µ i  2 ≤  µ ( C ) − ˆ µ ( C )  2 +  µ i − µ ( C ) − ( ˆ µ i − ˆ µ ( C ))  2 . By Lemma 26 for lar ge enough n ,  µ ( C ) − ˆ µ ( C )  2 ≤ cσ     dk log 2 n 2  δ  C  ≤ ǫσ 8 k √ w i . The second inequ ality follo w s from the bound on n and the fact that  C  ≥ n ˆ w i . Since w i ≥ ǫ  4 k , by Lemma 24, ˆ w i ≥ w i  2 , we ha ve  µ i − ˆ µ i  2 ≤  µ i − µ ( C ) − ( ˆ µ i − ˆ µ ( C ))  2 + ǫσ 8 k √ w i . Let u 1 . . . u k − 1 are the top eigen vecto rs the sample cov ariance matrix of cluster C . W e now prov e th at during the search, the re is a vector of the form ∑ k − 1 j = 1 g j ǫ g ˆ σ u j such that  µ i − µ ( C ) − ∑ k − 1 j = 1 g j ǫ g ˆ σ u j  2 ≤ ǫσ 8 k √ w i , during the search, thus pro ving the lemma. Let η i = √ w i  µ i − µ ( C ) 2 . By Lemma 7, there are set of coef ﬁcients α i such that µ i − µ ( C )  µ i − µ ( C ) 2 = k − 1  j = 1 α j u j +  1 −  α  2 u ′ , where u ′ is perpend icular to u 1 . . . u k − 1 and  1 −  α  2 ≤ ǫσ ( 8 √ 2 η i k ) . H ence, we ha ve µ i − µ ( C ) = k − 1  j = 1  µ i − µ ( C ) 2 α j u j +  µ i − µ ( C ) 2  1 −  α  2 2 u ′ , Since w i ≥ ǫ  4 k and by L emma 6, η i ≤ 25 √ k 3 σ log ( n 3  δ ) , and  µ i − µ ( C ) 2 ≤ 100 √ k 4 ǫ − 1 σ log ( n 3  δ ) . Therefore ∃ g j such that  g j ˆ σ − α j  ≤ ǫ g ˆ σ on each eigen v ector . Hence, w i  µ i − µ ( C ) − k − 1  i = 1 g j ǫ g ˆ σ u j  2 2 ≤ w i k ǫ 2 g ˆ σ 2 + w i  µ i − µ ( C ) 2 2 ( 1 −  α  2 ) ≤ k ǫ 2 g ˆ σ 2 + η 2 i ǫ 2 σ 2 128 η 2 i k 2 ≤ ǫ 2 σ 2 128 k 2 + ǫ 2 σ 2 128 k 2 ≤ ǫ 2 σ 2 64 k 2 . The last inequality follo ws by Lemma 4 and the fac t that ǫ g ≤ ǫ  16 k 3 / 2 , and hence the theorem. The run time can be eas ily computed by retraci ng the steps of the algor ithm and using an ef ﬁcient implement ation of singl e-link age. 27

Near-optimal-sample estimators for spherical Gaussian mixtures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment