Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians

F aster and Sample Near-Optimal Algorithms for Prop er Learning Mixtures of Gaussians Constan tinos Dask alakis ∗ EECS, MIT costis@mit. edu Gautam Kamath † EECS, MIT g@csail.mit .edu Octob er 29, 20 18 Abstract W e provide an algorithm for pro p erly learning mixtur es of t wo single-dimens ional Gaussians without any separability assumptions. Given ˜ O (1 /ε 2 ) samples from a n unknown mixture, our algorithm outputs a mixture that is ε - close in total v ariation dista nce, in time ˜ O (1 / ε 5 ). Our sample co mplexit y is optimal up to logar ithm ic factors, and signiﬁcantly improves upon bo th Kalai et al. [KMV10], whos e algor ithm has a prohibitive dep endence on 1 /ε , and F eldman et al. [F OS06], whose algorithm requires bounds on the mixture parameters and depends pseudo- po lynomially in these parameter s. One of our main con tributions is a n improv ed and generaliz e d algorithm for selecting a go o d candidate distribution from among comp eting hypotheses . Namely , g iv en a collection of N hypotheses cont aining at least one candidate that is ε -close to an unknown distribution, our algorithm o utpu ts a candidate w hich is O ( ε )-close to the distributio n. The algor ithm r equires O (log N /ε 2 ) samples from the unknown distribution and O ( N log N /ε 2 ) time, which impr o ves previous such r esults (suc h as the Sc heﬀ´ e estimator) from a quadratic dependence of the running time on N to qua silinear. Given the wide use of such res ult s for the purp ose of hypo th esis selection, our impr o ved algor ithm implies immediate improv ements to an y suc h use. ∗ Supp orted b y a Sloan F oun dation F ello wship, a Microsof t Research F aculty F ello wship, and NSF A ward CCF- 0953960 (CAREER) and CCF-110 1491. † P art of this w ork was done while the auth or was supp orted b y an Akama i Presidential F ellowship. 1 In tro duction Learning mixtures of Gaussian distribu ti ons is one of the most fu ndamen tal p roblems in Statistics, with a multitude of app lic ations in the natural and so cia l sciences, w hic h has recen tly receiv ed con- siderable atten tion in C o mpu ter S c ience literature. Given indep endent samples f rom an u nkno wn mixture of Gaussians, the task is to ‘learn’ the und erlying mixture. In one v ersion of the problem, ‘learning’ means estimating the p ar ameters of the mixture, that is the mixing probabilities as we ll as the parameters of eac h constituent Gaussian. T he most p opular heuristic for d oi ng so is runn ing the EM algorithm on samples from the mixtu re [DLR77], alb eit no rigorous guarantee s are kno wn for it in general. A lin e of r ese arc h initiated b y Dasgupta [Da s99 , AK01, VW02, AM05, BV08] pro vides rig- orous guaran tees u nder separabilit y conditions: roughly sp eaking, it is assumed that the con- stituen t Gaussians ha ve v ariation distance b ounded a wa y from 0 (indeed, in some cases, distance exp onen tially close to 1). This lin e of wo rk wa s recen tly settled by a triplet of b reakt hr ou gh re- sults [KMV10, MV10, BS10], establishing the p olynomial solv abilit y of the problem under minimal separabilit y conditions for the parameters to b e reco ve rable in th e ﬁ rst place: for an y ε > 0, p oly- nomial in n and 1 /ε samples from a mixture of n -dimensional Gaussians su ﬃce to reco v er the parameters of the mixture in p oly( n, 1 /ε ) time. While these results settle the p olynomial solv abilit y of the problem, they serve more as a pro of of concept in th at their dep endence on 1 /ε is quite exp ensiv e. 1 Indeed, ev en for mixtures of tw o single-dimensional Gaussians, a p r ac tically eﬃcien t algorithm for this problem is un kno wn. A w eak er goal for the learner of Gaussian mixtures is this: giv en s a mples fr o m an u nkno wn mix- ture, ﬁnd any mixture that is close to the unkn o w n one, for some notion of closeness. This P A C-st yle v ersion of the problem [KMR + 94] was pursued b y F eldman et al. [F OS06] who obtained eﬃcien t learning algorithms for mixtur e s of n -dimensional, axis-aligned Gaussians. Giv en p oly( n, 1 /ε, L ) samples from s u c h m ixt ur e , their algorithm constructs a mixtur e whose KL d iv erge nce to the sam- pled one is at most ε . Un f o rtun at ely , the sample and time complexit y of their algorithm dep ends p olynomially on a (pr io rly kno wn) b ound L , d e termining the range of the m e ans and v ariances of the constituent Gaussians in ev ery dimension. 2 In particular, the algorithm has pseudo-p olynomial dep endence on L w here there should n’t b e an y d epend ence on L at all [KMV10, MV10, BS10]. Finally , y et a w eak er goal for the learner wo uld b e to constru ct any distribution th a t is close to the unkn o w n mixtu re. In this non-pr op er v ersion of the p roblem the learner is n o t restricted to output a Gaussian mixture, bu t can outpu t an y (represen tation of a) d istr ibution th a t is close to the unkno wn mixture. F or this problem, r ec ent results of C han et al. [CDSS14] pro vide algo rithms for single-dimensional mixtur es, whose sample c omplexit y has near-optimal dep endence on 1 /ε . Namely , giv en ˜ O (1 / ε 2 ) samp le s from a s ingl e-dimensional mixtur e, they constru ct a piecewise p olynomial distribu tio n that is ε -close in total v ariation distance. Inspired b y this recen t pr o gress on non-pr o p erly learning single-dimensional mixtures, our goal in this pap er is to pro vide sample-optimal algorithms that pr op erly le arn . W e obtain suc h algorithms for mixtur es of tw o single-dimensional Gaussians. Namel y , 1 F or example, the single-dimensional algorithm in the heart of [KMV10] has sample and time complexit y of Θ(1 /ε 300 ) and Ω(1 /ε 1377 ) resp ectiv ely (even though th e aut h ors most certainly did not intend to optimize th ei r constants). 2 In particular, it is assumed that every constituent Gaussian in every dimension has mean µ ∈ [ − µ max , µ max ] and v ariance σ 2 ∈ [ σ 2 min , σ 2 max ] where µ max σ max /σ min ≤ L . 1 Theorem 1. F or al l ε, δ > 0 , given ˜ O (lo g (1 /δ ) /ε 2 ) indep endent samples fr om an arbitr ar y mix- tur e F of tw o univariate Gaussians we c an c om pute i n time ˜ O (lo g 3 (1 /δ ) /ε 5 ) a mixtur e F ′ such that d T V ( F , F ′ ) ≤ ε with pr ob ability at le ast 1 − δ . The exp e cte d running time of this algorithm is ˜ O (lo g 2 (1 /δ ) /ε 5 ) . W e note that learning a univ aria te mixture often lies at the heart of learning m ultiv aria te mixtures [K MV 10, MV10], so it is imp ortan t to understand this f undamen tal case. Discussion. Note that our algorithm mak es no separability assumptions ab out the constituen t Gaussians of the unkno wn mixtur e, n or do es it r e quire or dep end on a b ound on the mixture’s parameters. Also, b eca use the mixture is single-dimensional it is not amenable to the tec hniques of [HK13]. Moreo v er, it is easy to s ee that our sample complexit y is optimal u p to logarithmic factors. Ind ee d, a Gaussian mixtur e can tr ivially sim ulate a Bern o ulli distribution as follo ws. Let Z b e a Bernoulli random v ariable that is 0 with probabilit y 1 − p and 1 with probabilit y p . C le arly , Z can b e viewed as a mixture of t w o Gauss ia n random v ariables of 0 v ariance , which ha v e means 0 and 1 and are mixed with pr obabilit ies 1 − p and p resp ectiv ely . It is kno wn that 1 /ε 2 samples are n e eded to prop erly learn a Bern ou lli distribu ti on, hence this lo w er b ound immediately carries o v er to Gaussian mixtures. Approac h. Our algorithm is intuiti vel y qu it e simp le , although s o me care is requ ir e d to mak e the ideas w ork. First, we can guess th e mixin g w eigh t up to add iti ve error O ( ε ), and pro ceed with our algorithm pretending that our guess is corr ect. Ev ery gu ess will resu lt in a collectio n of candidate distributions, and the ﬁ nal step of our algorithm is a tournament that w ill select, from among all candidate distributions pro duced in the course of our algorithm, a distribution that is ε -clo se to the unknown mixtur e, if suc h a d istribution exists. T o d o this w e will mak e use of the follo wing th eo rem whic h is our second main con tribution in this pap er. (See T heo rem 19 f o r a precise statemen t.) Informal Theorem 19. Ther e exists an algorithm FastTourn ament tha t takes as input sam- ple ac c ess to an unknown distribution X and a c ol le ction of c andid ate hyp othesis distributions H 1 , . . . , H N , as wel l as an ac cur acy p ar ameter ε > 0 , and has the fol lowing b ehavior: if ther e ex- ists some distribution among H 1 , . . . , H N that is ε -close to X , then the distribution output by the algorithm is O ( ε ) -close to X . Mor e over, the numb er of samples dr awn by the algorithm fr om e ach of the distributions is O (log N /ε 2 ) and the running time is O ( N log N /ε 2 ) . Devising a hyp o thesis selection algorithm with the p erformance guaran tees of Theorem 19 re- quires strengthening the Sc heﬀ´ e-estimate -based app r oa c h d e scrib ed in Chapter 6 of [DL01] (see [Y at8 5 , DL96, DL97], as w ell as the recent pap ers of Dask alakis et al. [DDS12] and Ch an et al. [CDSS13]) to con tin uous distribu tio ns whose crossings are diﬃcult to compu te exactly as w ell as to sample-only access to all in v olv ed d istributions, but most imp ortan tly impro ving the runnin g time to almost linear in the num b er N of cand idat e d istributions. F urther comparison of our new h yp othesis s e lection algo rithm to related w ork is provided in Section 4. It is also w orth noting that the tournament based appr oa c h of [F OS06] cannot b e used for our p urp oses in this pap er as it would require a priorly known b ound on the mixture’s p aramet ers and would dep end pseudop olynomially on this b ound . T uning th e num b er of samples according to the guessed mixing weigh t, we pro ceed to draw samples from the unkno wn mixture, exp ecting that so me of these samples will fall suﬃcien tly 2 close to the means of the constituen t Gaussians, where the closeness will dep end on the num b er of s a mples dra wn as w ell as the unkno wn v ariances. W e guess whic h sample fall s close to the mean of the constituen t Gauss ia n that h a s the smaller v alue of σ /w (standard deviation to mixing w eigh t ratio), wh ic h giv es us the second parameter of the mixture. T o pin d o w n the v ariance of this Gaussian, w e implement a n at ur al idea. In tuitiv ely , if we dr a w samples from the m ixture, we exp ect that the constituent Gaussian w ith the smallest σ /w will d et ermine the smallest d ista nce among the samples. Pursuing this idea w e pro duce a collection of v ariance candidates, one of which truly corresp onds to th e v ariance of this Gaussian, giving us a third parameter. A t th is p oint , we ha v e a complete description of one of the comp onen t Gaussians. If w e could remo v e this comp onen t from the mixtur e, w e would b e left with the remainin g u nkno wn Gaus- sian. Our approac h is to generate an empirical distribu tion of th e mixtu r e and “sub trac t out” the comp onen t that we already kno w, giving us an appro ximation to the unknown Gaussian. F or th e purp oses of estimating the t wo parameters of this un k n o wn Gaussian, we obser ve th a t the most traditional estimates of location and scale are unreliable, since the err or in our appro ximation ma y cause probabilit y mass to b e s hifted to arbitrary locations. Instead, w e use robust statistics to obtain ap p ro ximatio ns to these t wo parameters. The emp irica l distribution of the mixtur e is generated using the Dv oretzky-Kie fer-W olfo witz (DKW) inequalit y [DKW56]. With O (1 /ε 2 ) samples from an arbitr ary distribution, this algo rithm generates an ε -appro ximation to the distr ibution (w ith resp ect to the Kolmogoro v metric). S in ce this result app lie s to arbitrary d ist rib u tio ns, it generat es a h yp othesis that is w eak, in some sen s e s, including the c hoice of distance metric. I n p a rticular, the h yp othesis d istribution output by the DKW inequ ality is discrete, resulting in a total v ariation distance of 1 fr o m a mixture of Gaussians (or an y other con tin uous distribution), r e gardless of the accuracy parameter ε . Th us, we consid e r it to b e in teresting that suc h a weak h yp othesis can b e used as a to ol to generate a stronger, prop er h yp othesis. W e note th at the Kolmogoro v d ista nce metric is not sp ecial here - an appro ximation with resp ect to other reasonable distance metrics may b e sub stit uted in, as long as the descrip ti on of the hyp o thesis is eﬃcient ly manipulable in the appropr ia te w a ys. W e show that, for any target total v ariation distance ε , the num b er of samples required to execute th e steps outlined ab o v e in order to p rodu ce a collec tion of candidate h yp otheses one of whic h is ε -close to the unkno wn mixture, as w ell as to run the tourn amen t to s e lect from among the candidate d istributions are ˜ O (1 / ε 2 ). The runn ing time is ˜ O (1 / ε 5 ). Comparison to Prior W ork on Learning Gaussian Mixtures. In comparison to the recen t breakthrough results [KMV10, MV10 , BS10], our algorithm has near-optimal sample complexit y and muc h mild er ru n ning time, where these results ha v e qu it e exp ensiv e d e p endence of b oth their sample and time complexit y o n the accuracy ε , eve n for single-dimensional mixtures. 3 On t he other hand, our algorithm has weak er guarantee s in that w e prop erly learn b ut don’t do parameter estimation. In comparison to [F OS 06 ], our al gorithm requires no b ounds on the p a rameters o f the constituen t Gaussians and exhibits no pseud o -p olynomial dep endence of the sample and time complexit y on su c h b ound s. On the other hand, w e learn with resp ect to the tota l v ariatio n distance rather than th e KL div ergence. Finally , compared to [CDS S 13 , C D SS 1 4], w e prop erly learn while they non-prop erly learn and we b oth ha ve near-optimal sample complexit y . Recen tly and indep endently , Achary a et al. [AJOS14a] ha v e also pr o vided algorithms for prop- 3 F or example, compared to [KMV10] we improv e by a factor of at least 15 0 th e exp onen t of both the sample and time complexity . 3 erly learning spherical Gaussian mixtures. Th eir primary fo cus is on the high dimensional case, aiming at a n e ar-linear samp le dep endence on the dimension. Our fo cus is instead on optimizing the dep endence of the sample and time complexit y on ε in the one-dimensional case. In fact, [AJOS14a] also stud y mixtur e s of k Ga uss ians in one dimension, pro viding a pr op er learning algorithm with n e ar-optimal sample complexit y of ˜ O  k /ε 2  and run ning time ˜ O k  1 /ε 3 k + 1  . Sp ecializi ng to tw o single-dimensional Gaussians ( k = 2), their algorithm has near-optimal sample complexit y , lik e ours, but is slow er b y a factor of O (1 /ε 2 ) than ours. W e also remark that, th r ough a com bin ation of tec hniques from our pap er and theirs, a prop er learning algorithm f or mixtures of k Gaussians can b e obtained, w it h near-optimal sample complexit y of ˜ O  k /ε 2  and run ning time ˜ O k  1 /ε 3 k − 1  , im p ro ving b y a factor of O (1 /ε 2 ) the ru n ning time of their k -Gaussian algorithm. Roughly , this algorithm creates cand idat e distributions in which the p arame ters of the ﬁrst k − 1 comp onen ts are generated using metho ds from [AJOS14a], and the parameters of the ﬁnal comp o- nen t are d et ermined using ou r robust statistica l tec hniques, in whic h w e “subtract out” the ﬁrst k − 1 comp onen ts and robus tl y estimate the mean and v ariance of the r ema ind er . 2 Preliminaries Let N ( µ, σ 2 ) r epresen t the un iv ariate normal distribu ti on, with mean µ ∈ R and v ariance σ 2 ∈ R , with d ensit y fun ct ion N ( µ, σ 2 , x ) = 1 σ √ 2 π e − ( x − µ ) 2 2 σ 2 . The univ ariate half-normal distribution with parameter σ 2 is the distribution of | Y | where Y is distributed according to N (0 , σ 2 ). The CDF of the half-normal distribution is F ( σ , x ) = erf  x σ √ 2  , where erf ( x ) is the error fun ct ion, deﬁned as erf( x ) = 2 √ π Z x 0 e − t 2 d t. W e also mak e use of the complemen t of th e error fun ct ion, erfc( x ), d eﬁ ned as erfc( x ) = 1 − er f ( x ). A Gaussian mixtur e mod el (GMM) of distrib u tio ns N 1 ( µ 1 , σ 2 1 ) , . . . , N n ( µ n , σ 2 n ) has PDF f ( x ) = n X i =1 w i N ( µ i , σ 2 i , x ) , where P i w i = 1. These w i are r e ferred to as the mixin g w eigh ts. Drawing a sa mp le from a GMM can b e visualized as the follo wing pro cess: s e lect a single Gaussian, wh e re the probabilit y of selecting a Gaussian is equal to its mixin g w eigh t, and draw a sample from that Gaussian. In this pap er, w e consider mixtur es of t w o Gaussians, s o w 2 = 1 − w 1 . W e will inte rchangea bly us e w and 1 − w in place of w 1 and w 2 . The total v ariati on distance b et w een t w o prob ab ility measur es P and Q on a σ -alg ebra F is deﬁned by d T V ( P , Q ) = su p A ∈ F | P ( A ) − Q ( A ) | = 1 2 k P − Q k 1 . 4 F or s implici ty in th e exp osition of our algorithm, w e make th e standard assu mption (see, e.g., [FOS06, KMV10 ]) of inﬁnite pr ecision real arithmetic. In particular, the samples we dr aw from a mixture of Gaussians are real num b ers, and we can do exact computations on real num b ers, e.g., we can exactly ev aluate the PDF of a Gaussian distribution on a r eal num b er. 2.1 Bounds on T otal V ariation Distance for GMMs W e r e call a r esult from [DDO + 13]: Prop os ition 2 (Prop osition B.4 of [DDO + 13]) . L et µ 1 , µ 2 ∈ R and 0 ≤ σ 1 ≤ σ 2 . Th en d T V ( N ( µ 1 , σ 2 1 ) , N ( µ 2 , σ 2 2 )) ≤ 1 2  | µ 1 − µ 2 | σ 1 + σ 2 2 − σ 2 1 σ 2 1  . The follo wing prop osition, whose p roof is deferred to the app endix, pr ovides a b ound on the total v aria tion distance b et w een t w o GMMs in terms of the distance b et w een the co nstituent Gaussians. Prop os ition 3. Supp ose we have two GMM s X and Y , with P D Fs w N 1 + (1 − w ) N 2 and ˆ w ˆ N 1 + (1 − ˆ w ) ˆ N 2 r e sp e ctively. Then d T V ( X, Y ) ≤ | w − ˆ w | + w d T V ( N 1 , ˆ N 1 ) + (1 − w ) d T V ( N 2 , ˆ N 2 ) . Com bining these prop ositions, we obtain the follo wing lemma: Lemma 4. L et X and Y by two GMMs with PDFs w 1 N 1 + w 2 N 2 and ˆ w 1 ˆ N 1 + ˆ w 2 ˆ N 2 r e sp e ctively, wher e | w i − ˆ w i | ≤ O ( ε ) , | µ i − ˆ µ i | ≤ O ( ε w i ) σ i ≤ O ( ε ) σ i , | σ i − ˆ σ i | ≤ O ( ε w i ) σ i ≤ O ( ε ) σ i , for al l i such that w i ≥ ε 25 . Th en d T V ( X, Y ) ≤ ε . 2.2 Kolmogoro v Dist a nce In ad d itio n to total v ariation distance, w e will also use the Kolmogoro v d ista nce metric. Deﬁnition 1. The Kolmogo rov distance b etwe en two pr ob ability me asur es with CDFs F X and F Y is d K ( F X , F Y ) = sup x ∈ R | F X ( x ) − F Y ( x ) | . W e will also use this metric to compare general fu nctio ns, w h ic h ma y not necessarily b e v ali d CDFs. W e h a ve th e follo wing fact , stating th a t tot al v ariatio n distance upp er b ounds Kolmogo rov distance [GS02 ]. F act 5. d K ( F X , F Y ) ≤ d T V ( f X , f Y ) F ortunately , it is fairly easy to learn with resp ect to the Kolmogoro v d ista nce, due t o the Dv oretzky-Kiefer-W olfowitz (DKW) inequalit y [DKW56]. Theorem 6. ([DKW56],[Mas90]) Supp ose we have n IID samples X 1 , . . . X n fr o m a pr ob ability distribution with CDF F . L et F n ( x ) = 1 n P n i =1 1 { X i ≤ x } b e the empiric al CDF. Then Pr[ d K ( F , F n ) ≥ ε ] ≤ 2 e − 2 nε 2 . In p articular, if n = Ω((1 /ε 2 ) · log (1 /δ )) , then Pr[ d K ( F , F n ) ≥ ε ] ≤ δ . 5 2.3 Represen ting and Manipulating CDFs W e w il l need to b e able to eﬃcien tly r e pr e sent and query the CDF of p robabilit y distributions we construct. This will b e don e us in g a d a ta structure we denote the n -interval p artition representat ion of a distribu t ion. This allo ws us to represen t a discrete random v ariable X o v er a supp ort of size ≤ n . C on s t ru c tion tak es ˜ O ( n ) time, and at the cost of O (log n ) time p er op eration, we can compute F − 1 X ( x ) f or x ∈ [0 , 1]. F ull details will b e provided in App end ix D. Using this constru c tion and Theorem 6, we can deriv e the follo wing pr op osition: Prop os ition 7. Su p p ose we have n = Θ( 1 ε 2 · log 1 δ ) IID samples fr om a r andom variable X . In ˜ O  1 ε 2 · log 1 δ  time, we c an c onstruct a data structur e which wil l al low us to c onvert indep endent samples fr om the uniform distribution over [0 , 1] to indep e ndent samples fr om a r andom variable ˆ X , such that d K  F X , F ˆ X  ≤ ε with pr ob ability 1 − δ . Ov er the course of our algorithm, it w ill b e natur a l to s ubtract out a comp onent of a distribution. Lemma 8. Supp ose we have ac c ess to the n -interval p artitio n r e pr esentation of a CDF F , and that ther e e xi sts a weight w and CDFs G and H such that d K  H , F − w G 1 − w  ≤ ε . Given w and G , we c an c ompute the n -interval p artition r epr esentation of a distribution ˆ H such that d K ( H , ˆ H ) ≤ ε in O ( n ) time. A pro of and fu ll details are provided in App end ix D. 2.4 Robust Statistic s W e use tw o we ll kno wn robust statistics, the median and the interquartile range. T hese are suited to our app lic ation for t wo purp oses. First, they are easy to compu te with the n -in terv al partition represent ation of a distribu tion. Each requires a constan t num b er of quer ies of the CDF at particular v al ues, and the cost of eac h query is O (log n ). Second, they are robust to small mo diﬁcations with resp ect to most metrics on pr o bability distributions. In particular, w e w ill demonstrate th e ir robustness on Gaussians wh e n considering distance with resp ect to the Kolmogoro v metric. Lemma 9. L et ˆ F b e a distribution su ch that d K ( N ( µ, σ 2 ) , ˆ F ) ≤ ε , wher e ε < 1 8 . Then med ( ˆ F ) , ˆ F − 1 ( 1 2 ) ∈ [ µ − 2 √ 2 εσ , µ + 2 √ 2 εσ ] . Lemma 10. L et ˆ F b e a distribution such that d K ( N ( µ, σ 2 ) , ˆ F ) ≤ ε , wher e ε < 1 8 . Then I QR ( ˆ F ) 2 √ 2 erf − 1 ( 1 2 ) , ˆ F − 1 ( 3 4 ) − ˆ F − 1 ( 1 4 ) 2 √ 2 erf − 1 ( 1 2 ) ∈  σ − 5 2 erf − 1 ( 1 2 ) εσ , σ + 7 2 erf − 1 ( 1 2 ) εσ  . The p roofs are deferr ed to App endix B. 2.5 Outline of t he Algorit h m W e can decomp ose our algorithm in to t w o comp onen ts: generating a coll ection of candidate distri- butions con taining at lea st one candid ate with lo w statistic al distance to the u nkno wn distribution (Theorem 11), and identifying such a candid a te from this collection (Th e orem 19). Generation of Candida te Distributions: In Section 3, we deal w ith generation of candidate distributions. A c an didate distribution is describ ed by the parameter set ( ˆ w, ˆ µ 1 , ˆ σ 1 , ˆ µ 2 , ˆ σ 2 ), wh ic h 6 corresp onds to the GMM with PDF f ( x ) = ˆ w N ( ˆ µ 1 , ˆ σ 2 1 , x ) + (1 − ˆ w ) N ( ˆ µ 2 , ˆ σ 2 2 , x ). As suggested b y Lemma 4, if we ha ve a candidate distribution with su ﬃ ce ntly accurate parameters, it will h a ve lo w statistica l distance to the unkno wn distribution. O ur ﬁrst goal will b e to generate a collecti on of candidates that cont ains at least one su c h candidate. S ince the time complexit y of our algorithm dep ends on the size of this collec tion, we wish to k eep it to a m in im um. A t a high lev el, w e sequentia lly generate candidates for eac h parameter. In p a rticular, we start b y generating candidates for th e mixing w eigh t. While most of these will b e inaccurate, we w ill guaran tee to pro duce at least one appropriately accurate candidate ˆ w ∗ . T hen, f o r eac h candidate mixing w eigh t, w e will generate candidates for the mean of one of the Gaussians. W e will guarant ee that, out of the candidate means we generated for ˆ w ∗ , it is lik ely that at least one candidate ˆ µ ∗ 1 will b e su ﬃ ci entl y close to th e true mean for this comp onen t. The candidate means that w ere generated for other mixin g w eigh ts ha v e no suc h guaran tee. W e u s e a similar sequ e ntia l approac h to generate candidates for the v ariance of this comp onen t. Once w e hav e a d escrip ti on of the ﬁr s t comp onen t, we simulate the pro cess of sub tr a cting it from the mixture, th us giving u s a single Gaussian, whose parameters w e can learn. W e can not immediately identify which candidates ha v e inaccurate parameters, and they serve only to in ﬂate the size of our collecti on. A t a lo we r leve l, our algorithm s tarts by generating candid a tes for the mixing weig ht follo w ed by generating candidates for the mean of the comp onent with the sm a ller v alue of σ i w i . Note that we do not kno w which of the t w o Gaussians this is. The solution is to branc h our algorithm, wh e re eac h branc h assumes a corresp ondence to a d iﬀerent Gauss ian. One of the t w o branches is guarante ed to b e correct, and it will only doub le the num b er of candidate distributions. W e observ e that if w e tak e n samples from a single Gaussian, it is lik ely that there will exist a sample at distance O ( σ n ) from its mean. T hus, if we tak e Θ( 1 w i ε ) s a mples fr o m the mixtur e, one of them will b e suﬃcient ly close to the mean of the corresp o nd ing Gaussian. Exploiting this observ ation we obtain candidates for the mixin g we igh t an d the ﬁrst mean as su mmarized b y Lemma 14. Next, we generate candid at es for the v ariance of this Gauss ian. Our sp eciﬁc approac h is b a sed on th e observ at ion that give n n samples f rom a sin g le Gaussian, the minim um distance of a sample to the m ean w ill lik ely b e Θ( σ n ). In the m ixt ur e , this prop ert y will still hold for the Gaussian with the sm a ller σ i w i , so we extract this statistic and use a grid around it to generate su ﬃci ently accurate candidates f o r σ i . This is Lemma 16. A t this p oin t, w e ha ve a complete description of one of the comp onent Gaussians. Also, w e can generate an empirical distribution of the mixture, wh ic h give s an adequate appro ximation to the tr ue distribu tion. Give n th ese t wo p ie ces, we u pd at e th e emp irica l distrib utio n b y remo ving probabilit y mass con tributed by the kno wn comp onen t. When done carefully , w e end up w it h an appro ximate description of the distribu tio n of the u nkno wn comp onen t. A t this p oint, we extract the median and the int erquartile range (IQR) of the resulting d istribution. These s ta tistics are robust, so they can tolerate error in our app r o ximat ion. Finally , th e median and I QR allo w us to deriv e the last mean and v ariance of our d istribution. This is Lemma 18. Putting ev erything together, w e obtain the follo wing result wh ose pr oof is in Section 3.5. Theorem 11. F or al l ε, δ > 0 , given log(1 /δ ) · O (1 /ε 2 ) indep endent samp les fr om an arbitr ary mixtur e F of two univariate Gaussians, we c an gener ate a c ol le c tio n of log(1 /δ ) · ˜ O (1 /ε 3 ) c and idate mixtur es of two univariate Ga ussians, c onta ining at le ast one c and idate F ′ such that d T V ( F , F ′ ) < ε with pr ob ability at le ast 1 − δ . Candidate Selection: In view of Theorem 11, to prov e our m a in result it su ﬃce s to select fr o m among the candid a te mixtures some mixture that is close to the unknown mixture. In S ec tion 4, 7 w e describ e a tournamen t-based algorithm (Theorem 19) for iden tifying a ca ndid at e whic h has lo w statistical distance to the unknown mixtur e , concluding the pro o f of Theorem 1. See our discussion in Section 4 for the c hallenges arising in obtaining a tournament-base d algorithm for con tin uous distribu tions whose crossin gs are diﬃcult to compu te exactly , as w ell as in sp eeding up the tournamen t’s runnin g time to almost linear in the num b er of cand idate s. 3 Generating Candidate Distributions By Prop ositi on 3, if one of the Gaussians of a mixtur e has a negligible mixing w eigh t, it has a neg- ligible impact on th e mixture’s statistical distance to the unkn o wn mixture. Hence, the candid ate means an d v aria nces of this Gaussian are irr el ev an t. This is fortunate, since if min ( w , 1 − w ) << ε and we only dr a w O (1 /ε 2 ) s a mples from the un kno wn mixture, as w e are planning to do, we ha v e no h ope of seeing a suﬃcient n umber of samples from the lo w-w eigh t Gaussian to p erform accurate statistica l tests for it. So for this section we will assum e that min ( w , 1 − w ) ≥ Ω( ε ) and w e will deal w it h the other case s ep a rately . 3.1 Generating Mixing W eight Candidates The ﬁ rst step is to generate candidates for the mixing weigh t. W e can obtain a collection of O ( 1 ε ) candidates contai nin g some ˆ w ∗ ∈ [ w − ε, w + ε ] by simply taking the set { tε | t ∈  1 ε  } . 3.2 Generating Mean Candidates The next step is to g enerate candidates for the mean corresp onding to the Gaussian with the smaller v alue of σ i w i . Note that, a pr io ri, w e do not kno w whether i = 1 or i = 2. W e try b oth cases, ﬁrst generating candidates assuming they corresp ond to µ 1 , and then rep eating with µ 2 . This will m ultiply our total num b er of candidate distributions by a factor of 2. Without loss of essen tial generalit y , assume f o r this section that i = 1. W e w an t a collection of cand id a tes con taining ˆ µ ∗ 1 suc h that µ 1 − εσ 1 ≤ ˆ µ ∗ 1 ≤ µ 1 + εσ 1 . The follo wing prop ositions are straigh tforw ard and prov ed in App endix C. Prop os ition 12. Fix i ∈ { 1 , 2 } . Given 20 √ 2 3 w i ε samples fr om a GMM, ther e wil l exist a sample ˆ µ ∗ i ∈ µ i ± εσ i with pr ob ability ≥ 99 100 . Prop os ition 13. F ix i ∈ { 1 , 2 } . Supp ose w i − ε ≤ ˆ w i ≤ w i + ε , and w i ≥ ε . Th en 2 ˆ w i ≥ 1 w i . W e use these fact s to design a simple algo rithm: f o r eac h candidate ˆ w 1 (from Section 3.1), tak e 40 √ 2 3 ˆ w 1 ε samples fr o m the mixture and u se eac h of them as a candidate f o r µ 1 . W e no w examine ho w many candidate pairs ( ˆ w , ˆ µ 1 ) w e generated. Naiv ely , since ˆ w i ma y b e as small as O ( ε ), the candidates for the m ean w ill multiply the size of our collection b y O  1 ε 2  . Ho w ev er, w e note that when ˆ w i = Ω(1), then the num b er of candidates for µ i is actuall y O  1 ε  . W e count the num b er of candidate triples ( ˆ w , ˆ µ 1 ), com bining with previous results in the follo wing: Lemma 14. Supp o se we have sample ac c e ss to a GMM with (unknown) p ar ameters ( w , µ 1 , µ 2 , σ 1 , σ 2 ) . Then for any ε > 0 and c onstants c w , c m > 0 , using O ( 1 ε 2 ) samp les fr om the GMM , we c an gener at e a c o l le ction of O  log ε − 1 ε 2  c and idate p airs for ( w, µ 1 ) . With pr ob ability ≥ 99 100 , this wil l c on tain a p air ( ˆ w ∗ , ˆ µ ∗ 1 ) suc h that ˆ w ∗ ∈ w ± O ( ε ) , ˆ µ ∗ 1 ∈ µ 1 ± O ( ε ) σ 1 . 8 The p r oof of the lemma is deferred to App endix C. It implies th a t we can generate O  1 ε 2  candidate triples, suc h that at least one pair s im ultaneously d esc rib es w and µ 1 to the d esir ed accuracy . 3.3 Generating Candidates for a Single V ariance In this section, w e generate candidates for the v ariance corresp onding to the Gaussian with the smaller v alue of σ i w i . W e con tin ue with our guess of whether i = 1 or i = 2 from the previous section. Again, assume for this sectio n th a t i = 1. The basic idea is that w e will ﬁnd the closest p oin t to ˆ µ 1 . W e use the follo wing prop ert y (whose pr oof is deferred to App endix E) to establish a range for this distance, wh ic h we can th e n grid o ve r. W e note that this lemma h ol ds in scenarios more general than we consider here, in cl ud ing k > 2 and when samples are drawn from a distribution whic h is only close to a GMM, rather than exactly a GMM. Lemma 15. L et c 1 and c 2 b e c onstant s as deﬁne d in P r op osition 27, and c 3 = c 1 9 √ 2 c 2 . Consider a mixtur e of k Gaussians f , with c omp onents N ( µ 1 , σ 2 1 ) , . . . , N ( µ k , σ 2 k ) and weights w 1 , . . . , w k , and let j = arg min i σ i w i . Supp ose we have estimates for the weights and me an s for al l i ∈ [1 , k ] : • ˆ w i , such that 1 2 ˆ w i ≤ w i ≤ 2 ˆ w i • ˆ µ i , such that | ˆ µ i − µ i | ≤ c 3 2 k σ j Now supp ose we dr aw n = 9 √ πc 2 2 ˆ w j samples X 1 , . . . , X n fr o m a distribution ˆ f , wher e d K ( f , ˆ f ) ≤ δ = c 1 2 n = c 1 9 √ π c 2 ˆ w j . Th en min i | X i − ˆ µ j | ∈ [ c 3 2 k σ j , ( √ 2 + c 3 2 k ) σ j ] with pr ob ability ≥ 9 10 . Summarizing what we hav e so far, Lemma 16. Supp ose we have sample ac c ess to a GMM with p ar ameters ( w , µ 1 , µ 2 , σ 1 , σ 2 ) , wher e σ 1 w ≤ σ 2 1 − w . F urthermor e, we have e st imates ˆ w ∗ ∈ w ± O ( ε ) , ˆ µ ∗ 1 ∈ µ 1 ± O ( ε ) σ 1 . Then for any ε > 0 , using O ( 1 ε ) samples fr om the GM M, we c an gener ate a c ol le ction of O  1 ε  c and idates for σ 1 . With pr ob ability ≥ 9 10 , this wil l c onta in a c and idate ˆ σ ∗ 1 such that ˆ σ ∗ 1 ∈ (1 ± O ( ε )) σ 1 . 3.4 Learning the Last Component using Robust St a tistics A t th is p oin t, our collectio n of candid a tes m ust con tain a triple ( ˆ w ∗ , ˆ µ ∗ 1 , ˆ σ ∗ 1 ) w h ic h are su ﬃcie ntly close to the correct parameters. Intuitiv ely , if w e could remo v e this comp onen t fr o m the mixture, w e w ould b e left w ith a distribu tio n co rresp ondin g to a single Gaussian, w h ic h w e could learn trivially . W e will formalize the notion of “component subtr a ction,” whic h will allo w u s to eliminat e the kno wn comp onen t and obtain a description of an appro ximation to the CDF for the remaining comp onen t. Using classic robus t statistics (the median and the interquartile range), we can then obtain appro ximations to the un kno wn mean and v ariance. This has the adv an tage of a single additional candidate for these parameters, in comparison to O ( 1 ε ) candidates for the previous mean and v ariance. Our ﬁ r st step will b e to generate an approxima tion of the ov erall distr ibution. W e will do this only once, at the b eginning of the en tire algorithm. O ur app ro ximati on is w ith resp ect to the Kolmogoro v distance. Using the DKW in e qualit y (Theorem 6) and P r oposition 7, w e obtain a O ( 1 ε 2 )-in terv al partition r e pr e senta tion of ˆ H su c h that d K ( ˆ H , H ) ≤ ε , with pr o bability ≥ 1 − δ using O ( 1 ε 2 · log 1 δ ) time and samples (where H is th e CDF of the GMM). 9 Next, for eac h candidate ( ˆ w , ˆ µ 1 , ˆ σ 1 ), we apply Lemma 8 to obtain the O ( 1 ε 2 )-in terv al partition of the distribution with the kno wn comp onen t remo v ed, i.e., u sing the notation of Lemma 8, let H b e the CDF of the GMM, F is our DKW-based appro ximation to H , w is the weigh t ˆ w , and G is N ( ˆ µ 1 , ˆ σ 2 1 ). W e note that this costs O ( 1 ε 2 ) for eac h candidate triple, and since there are ˜ O ( 1 ε 3 ) suc h triples, the total cost of such op erations will b e ˜ O ( 1 ε 5 ). Ho w eve r, since the tourn a ment w e will use for selection of a cand id a te w ill require ˜ Ω( 1 ε 5 ) anyw a y , th is do es not aﬀect the ov erall ru n time of our algorithm. The follo wing prop osition sho ws that, wh en our candidate triple is ( w ∗ , µ ∗ 1 , σ ∗ 1 ), the distribu ti on that we obtain after sub trac ting the kn o wn comp onen t out and rescaling is close to the unkno wn comp onen t. Prop os ition 17. Supp ose ther e exists a mixtur e of two Gaussians F = w N ( µ 1 , σ 2 1 )+(1 − w ) N ( µ 2 , σ 2 2 ) wher e O ( ε ) ≤ w ≤ 1 − O ( ε ) , and we have ˆ F , such that d K ( ˆ F , F ) ≤ O ( ε ) , ˆ w ∗ , such that | ˆ w ∗ − w | ≤ O ( ε ) , ˆ µ ∗ 1 , such that | µ ∗ 1 − µ 1 | ≤ O ( ε ) σ 1 , and ˆ σ ∗ 1 , such that | σ ∗ 1 − σ 1 | ≤ O ( ε ) σ 1 . Then d K  N ( µ 2 , σ 2 2 ) , ˆ F − ˆ w ∗ N ( ˆ µ ∗ 1 , ˆ σ ∗ 2 1 ) 1 − ˆ w ∗  ≤ O ( ε ) 1 − w . Since the resulting distribu tio n is close to the correct one, w e can u se robu st statistics (via Lemmas 9 and 10) to r ec o ve r the missing parameters. W e com bine this with previous d e tails in to the follo wing L emm a , whose pro of is deferred to Ap pend ix C. Lemma 18. Supp ose we have sample ac c ess to a GMM with p ar ameters ( w , µ 1 , µ 2 , σ 1 , σ 2 ) , wher e σ 1 w ≤ σ 2 1 − w . F urthermor e, we have estimates ˆ w ∗ ∈ w ± O ( ε ) , ˆ µ ∗ 1 ∈ µ 1 ± O ( ε ) σ 1 , ˆ σ ∗ 1 ∈ (1 ± O ( ε )) σ 1 . Then for any ε > 0 , using O ( 1 ε 2 · log 1 δ ) samples fr om the GMM, with pr ob ability ≥ 1 − δ , we c an gener ate c and idates ˆ µ ∗ 2 ∈ µ 2 ± O  ε 1 − w  σ 2 and ˆ σ ∗ 2 ∈  1 ± O  ε 1 − w  σ 2 . 3.5 Putting It T ogether Theorem 11 is obtained by com bining the p revio us sections, and its pr oof is giv en in the app endix. 4 Quasilinear-Time Hyp othesis Selection The goal of this section is to pr e sent a h yp othesis selecti on algo rithm , FastTour nament , which is giv en sample access to a target d istribution X and sev eral hypotheses distribu t ions H 1 , . . . , H N , together with an accuracy parameter ε > 0, and is supp osed to s e lect a h yp othesis distr ib ution from { H 1 , . . . , H N } . Th e desired b eha vior is th is: if at least one distr ib ution in { H 1 , . . . , H N } is ε -close to X in tota l v ariat ion distance, we wan t that the hyp ot hesis d ist rib u tio n selected by the algorithm is O ( ε )-close to X . W e p ro vide such an algorithm whose sample complexit y is O ( 1 ε 2 log N ) and whose ru nning time O ( 1 ε 2 N log N ), i.e. quasi-line ar in the numb er of hyp otheses , improving the runn in g time of the state of the art (predominan tly the Sc heﬀ´ e-estimat e based algorithm in [DL01]) quadratically . W e d e ve lop our algo rithm in full generalit y , assu ming that we ha ve sample access to the dis- tributions of in terest, and without making any assump tions ab out whether they are cont inuous or discrete, and whether their sup p ort is single- or multi-dimensional. All our algorithm needs is sample access to the d ist ribu tio ns at hand, together with a wa y to compare the p robabilit y den- sit y/mass fun cti ons of the distributions, encapsu la ted in the follo wing deﬁn iti on. In our deﬁn iti on, 10 H i ( x ) is the probab ility mass at x if H i is a discrete d istribution, and the probabilit y densit y at x if H i is a contin uous distribution. W e assume that H 1 and H 2 are either b oth discrete or b oth con tin uous, and that, if they are conti nuous, they h a ve a density function. Deﬁnition 2 . L et H 1 and H 2 b e pr ob ability distributions over some set D . A PDF comparator for H 1 , H 2 is an or acle that takes as i np ut some x ∈ D and outputs 1 if H 1 ( x ) > H 2 ( x ) , and 0 otherwise. Our hyp othesis selectio n algorithm is su mmarize d in the follo wing statemen t: Theorem 19. Ther e is an algorithm FastTournament ( X , H , ε, δ ) , wh ich is given sa mple a c c ess to some distribution X and a c ol le ction of distributions H = { H 1 , . . . , H N } over some set D , ac c ess to a PDF c o mp ar ator for every p air of distributions H i , H j ∈ H , an ac cur acy p ar ame- ter ε > 0 , and a c onﬁdenc e p ar ameter δ > 0 . The algorithm makes O  log 1 /δ ε 2 · log N  dr aws fr o m e ach of X, H 1 , . . . , H N and r eturns some H ∈ H or de clar es “failur e.” If ther e is some H ∗ ∈ H such tha t d T V ( H ∗ , X ) ≤ ε then with pr ob ability a t le ast 1 − δ the distribution H that FastTour nament r eturns s atisﬁes d T V ( H , X ) ≤ 512 ε. The total numb er of op er ations of the al- gorithm i s O  log 1 /δ ε 2  N log N + log 2 1 δ   . F urthermor e, the exp e cte d numb er of op er ations of the algorithm is O  N log N/δ ε 2  . The pr oof of Theorem 19 is giv en in S ec tion 4.3, while the preceding sections build the required mac hinery for the constru ction. Remark 1. A slight mo diﬁc ation of our algorithm pr o vide d in App endix G admits a worst-c ase running time of O  log 1 /δ ε 2  N log N + log 1+ γ 1 δ   , for any desir e d c onsta nt γ > 0 , though the appr ox- imation gu a r ante e is we akene d b ase d on the value of γ . Se e Cor ol lary 1 and its pr o of in App endix G. Comparison to Other Hyp othesis Selection Metho ds: The sk eleton of the h yp othesis se- lection algorithm of Theorem 19 as well as th e impro ve d one of Corollary 1, is having candidate distributions comp ete against eac h other in a tournamen t-lik e fashion. This appr oac h is quite nat- ural and h a s b een commonly u sed in the literature; see e.g. Devroy e and Lu go si ([DL96 , DL97] and Chapter 6 of [DL01]), Y at racos [Y at85], as w ell as th e recen t pap ers of Dask ala kis et al. [DDS12] and Chan et al. [CDS S 13 ]. Th e hyp o thesis selecti on algorithms in these w orks are signiﬁcantl y slow er than ours, as their ru nning times hav e quadratic dep endence on th e num b er N of hypotheses, w hile our dep endence is qu a si-linear. F urthermore, our setting is more general than pr io r work, in that we only require sample access to the h yp otheses and a PDF comparator. Previous algorithms required kno wledge of (or ability to compute) the pr o babilit y assigned by ev ery p a ir of hyp ot heses to th eir Sc heﬀ ´ e set—this is the subset of the supp ort where one h yp othesis has larger PMF/PDF th a n th e other, w hic h is d iﬃcult to compute in general, ev en giv en exp li cit descriptions of the h yp otheses. Recen t indep endent wo rk by Ac hary a et al. [AJOS14a, AJOS 14 b] provides a hyp ot hesis selection algorithm, based on the S c heﬀ ´ e estimate in Chapter 6 of [DL01]. Th e ir algorithm p erforms a num b er of op erations that is comparable to ours. In particular, the exp ected runn ing time of their algorithm is also O  N log N /δ ε 2  , but our w orst-case r unning time has b etter d epend ence on δ . Our algorithm is not based on the Scheﬀ ´ e estimate, usin g instead a sp ecia lized estimator pro vided in Lemma 20. Their algorithm, describ ed in terms of the Sc heﬀ ´ e estimate, is not immed ia tely applicable to sample- only access to the hyp ot heses, or to settings wher e the probabilities on Scheﬀ ´ e sets are diﬃcu lt to compute. 11 4.1 Cho osing Betw een Two H ypotheses W e start with an algorithm for c ho osing b et w een t wo h yp othesis distrib u tio ns. T his is an ad ap tation of a similar algorithm from [DDS12] to cont inuous distributions and sample-only access. The pro of of the follo wing is in App endix F. Lemma 20. Ther e is an algorithm C hooseHypothesis ( X, H 1 , H 2 , ε, δ ) , which is given sample ac- c ess to distributions X , H 1 , H 2 over some set D , ac c ess to a P DF c omp ar ato r for H 1 , H 2 , an ac cu- r acy p ar ameter ε > 0 , and a c on ﬁdenc e p ar ameter δ > 0 . The algorithm dr aws m = O (log (1 /δ ) /ε 2 ) samples fr om e ach of X, H 1 and H 2 , and either r eturns some H ∈ { H 1 , H 2 } as the winner or de- clar es a “dr aw.” The total numb er of op er ations of the algorithm is O (log (1 /δ ) /ε 2 ) . A dditional ly, the output satisﬁes the fol lowing pr op erties: 1. If d T V ( X, H 1 ) ≤ ε but d T V ( X, H 2 ) > 8 ε , the pr ob ability that H 1 is not de clar e d winner is ≤ δ ; 2. If d T V ( X, H 1 ) ≤ ε but d T V ( X, H 2 ) > 4 ε , the pr ob ability that H 2 is de clar e d winner is ≤ δ ; 3. The analo gous c onclusions hold if we inter cha nge H 1 and H 2 in Pr op erties 1 and 2 ab ove; 4. If d T V ( H 1 , H 2 ) ≤ 5 ε , the algorithm de clar es a “dr aw” with pr ob ability at le ast 1 − δ . 4.2 The Slo w T ournamen t W e pro ceed with a hyp othesis selectio n algorithm, Slow Tournament , whic h h as the correct b eha v- ior, but whose ru nning time is sub optimal. Again w e pro cee d similarly to [DDS12] making the approac h robust to contin uous distributions and sample-only access. SlowTourna ment p erforms pairwise comparisons b et wee n all h yp otheses in H , using the sub routine Choo seHypothesis of Lemma 20, and outputs a hyp othesis that n ever lost to (but p oten tially tied with) other h yp othe- ses. The ru nning time of the alg orithm is qu a dr a tic in |H| , as all pairs of hyp otheses are compared . FastTour nament , d e scrib ed in Section 4.3, organizes the tournament in a more eﬃcien t manner, impro ving the runn in g time to quasilinear. Th e pro of of Lemma 21 can b e found in App endix F. Lemma 21. Ther e is an algorith m SlowTourn ament ( X, H , ε, δ ) , which is given sample ac c ess to some distribution X and a c ol le ction of distributions H = { H 1 , . . . , H N } over some set D , ac c ess to a PDF c omp ar ator fo r every p a ir o f distributions H i , H j ∈ H , an ac cur acy p ar ameter ε > 0 , and a c o nﬁdenc e p ar ameter δ > 0 . The algorithm makes m = O (lo g ( N/δ ) / ε 2 ) dr aws fr om e ach of X, H 1 , . . . , H N and r eturns some H ∈ H or de clar es “failur e.” If ther e is some H ∗ ∈ H such that d T V ( H ∗ , X ) ≤ ε then with pr ob ability at le ast 1 − δ the distr ibution H that SlowTourna ment r eturns satisﬁes d T V ( H , X ) ≤ 8 ε. The total numb er of op er ations of the algorithm is O  N 2 log( N /δ ) /ε 2  . 4.3 The F ast T ourna men t W e pro v e our main result of this section, providing a quasi-linear time algo rithm for selecting f rom a collectio n of hyp othesis d istributions H one that is close to a target distribu t ion X , imp ro ving the ru nning time of Slo wTournament from Lemma 21. In tuitiv ely , ther e are t w o cases to consider. Collectio n H is either dense or sparse in distributions that are close to X . In the former case, w e sho w that we can sub -sa mple H b efore runn ing S lowTournament . In the latter case, we show ho w 12 to set-up a t w o-phase tourn ament, w hose ﬁrst phase eliminates all but a sub linear num b er of hy- p otheses, and w h ose second ph ase runs SlowT ournament on the surviving hyp ot heses. Dep ending on the densit y of H in d istributions that are close to the target distribu tio n X , we s ho w that one of the aforementio ned strategies is guaran teed to output a d istribution that is close to X . As w e do not kno w a pr iori the d ensit y of H in distrib u tio ns th a t are close to X , and hence whic h of our t w o strategie s will succeed in ﬁnding a d istribution that is close to X , we use b oth strategies and run a tourn a ment among their outpu ts, using Sl owTournament ag ain. Pr o of of The or em 19: L et p b e the fraction of the elemen ts of H that are 8 ε -close to X . The v alue of p is unknown to our algorithm. Regardless, w e pr o p ose t wo strategies for selecting a distribution from H , one of whic h is guarante ed to succeed wh at ev er the v alue of p is. W e assume throughout this pro of that N is larger than a suﬃcien tly large constant , otherwise our claim follo ws dir e ctly from Lemm a 21. S1: Pic k a random sub set H ′ ⊆ H of size ⌈ 3 √ N ⌉ , an d run Sl owTournament ( X , H ′ , 8 ε, e − 3 ) to select s ome d istr ibution ˜ H ∈ H ′ . W e sho w the follo wing in Ap pend ix F. Claim 1. The nu mb er of samples dr awn by S 1 fr o m e ach of the distributions in H ∪ { X } i s O ( 1 ε 2 log N ) , and the t otal numb er of op er ations is O ( 1 ε 2 N log N ) . Mor e over, if p ∈ [ 1 √ N , 1] and ther e is some distribution in H that is ε -close to X , then the distribution ˜ H outp ut by S1 is 64 ε - close to X , with pr ob ability at le ast 9 / 10 . S2: There are t wo ph ases in this strategy: • Phase 1: This phase pr o ceeds in T = ⌊ log 2 √ N 2 ⌋ iterations, i 1 , . . . , i T . Iteration i ℓ tak es as in put a sub s e t H i ℓ − 1 ⊆ H (wh ere H i 0 ≡ H ), and pr oduces some H i ℓ ⊂ H i ℓ − 1 , s u c h that |H i ℓ | = l |H i ℓ − 1 | 2 m , as follo ws: randomly pair up the elemen ts of H i ℓ − 1 (p ossibly one elemen t is left u npaired), and for every pair ( H i , H j ) ru n ChooseHy pothesis ( X, H i , H j , ε, 1 / 3 N ). W e do this w ith a small cav ea t: ins tead of dra wing O (lo g (3 N ) /ε 2 ) f resh samples (as required b y Lemma 20) in every executio n of ChooseHyp othesis (from whic hev er distributions are in v olv ed in that execution), we dra w O (log (3 N ) /ε 2 ) s amp le s from e ac h of X, H 1 , . . . , H N once and for all, and reuse the same samples in all executions of Cho oseHypothesis . • Phase 2: Given the collec tion H i T output by Phase 1, we ru n SlowTo urnament ( X, H i T , ε, 1 / 4) to select some d istr ibution ˆ H ∈ H i T . (W e use fresh samp le s for the execution of SlowTou rnament .) W e sh o w the follo wing in App endix F. Claim 2. The nu mb er of samples dr awn by S 2 fr o m e ach of the distributions in H ∪ { X } i s O ( 1 ε 2 log N ) , and the total numb er of op er ations is O ( 1 ε 2 N log N ) . M o r e over, if p ∈ (0 , 1 √ N ] and ther e is some distribution in H that is ε -close to X , then the distribution ˆ H output b y S2 is 8 ε - close to X , with pr ob ability at le ast 1 / 4 . Giv en strategies S1 and S2, w e ﬁrst design an algorithm whic h has the stated wo rst-case n umb er of op eratio ns. The algorithm F astTournament A w orks as follo ws: 13 1. Execute strategy S1 k 1 = log 2 2 δ times, w ith fresh samples eac h time. Let ˜ H 1 , . . . , ˜ H k 1 b e the distributions output b y these executions. 2. Execute strategy S2 k 2 = log 4 2 δ times, w ith fresh samples eac h time. Let ˆ H 1 , . . . , ˆ H k 2 b e the distributions output b y these executions. 3. Set G ≡ { ˜ H 1 , . . . , ˜ H k 1 , ˆ H 1 , . . . , ˆ H k 2 } . Execute SlowTour nament ( X, G , 64 ε, δ / 2) . Claim 3. Fa stTournament A satisﬁes the pr op erties describ e d in the statement of The o r em 19, exc ept for the b ound on the exp e c te d numb er of op er ations. Pr o of of Claim 3: The b ound s on the n umb e r of samples and op erations follo w im m e diately from our choic e of k 1 , k 2 , Claims 1 and 2, and Lemma 21. Let us j ustify the correctness of the algorithm. Supp ose that t here is some distribution in H that is ε -c lose to X . W e distinguish tw o cases, dep ending on th e fraction p of distributions in H that are ε -close to X : • p ∈ [ 1 √ N , 1]: In this case, eac h execution of S1 has probabilit y at least 9 / 10 of outputting a d is- tribution that is 64 ε -clo se to X . So th e p robabilit y th a t none of ˜ H 1 , . . . , ˜ H k 1 is 64 ε -clo se to X is at most ( 1 10 ) k 1 ≤ δ / 2. Hence, with probabilit y at least 1 − δ / 2 , G con tains a distrib utio n that is 64 ε -cl ose to X . Conditioning on this, SlowTour nament ( X, G , 64 ε, δ / 2) will outp u t a distribu- tion that is 512 ε -clo se to X with probabilit y at least 1 − δ/ 2, by Lemma 21. Hence, with o v erall probabilit y at least 1 − δ , th e distribu ti on output b y FastTour nament is 512 ε -clo se to X . • p ∈ (0 , 1 √ N ]: T h is case is analyzed analogously . With probabilit y at least 1 − δ / 2, at least one of ˆ H 1 , . . . , ˆ H k 2 is 8 ε -close to X (b y Claim 2). Conditioning on this, SlowTour nament ( X, G , 64 ε, δ / 2) outputs a distribu tio n that is 512 ε -close to X , with probab ility at least 1 − δ / 2 (by Lemma 21). So, with o v erall p r obabilit y at least 1 − δ , th e distribu ti on output b y FastTo urnament is 512 ε - close to X . W e no w describ e an algorithm whic h has the stated exp ected num b er of op erations. The algorithm Fa stTournament B w orks as follo ws: 1. Execute strategy S1, let ˜ H 1 b e the distribution ou tp ut b y this execution. 2. Execute strategy S2, let ˜ H 2 b e the distribution ou tp ut b y this execution. 3. Execute C hooseHypothesis ( X, ˜ H i , H , 64 ε, δ / N 3 ) for i ∈ { 1 , 2 } and all H ∈ H . If either ˜ H 1 or ˜ H 2 nev er loses, output that hyp ot hesis. Otherwise, remo v e ˜ H 1 and ˜ H 2 from H , and rep eat the algorithm starting from step 1, unless H is emp t y . Claim 4. FastTo urnament B satisﬁes the pr op erties describ e d in the statement of The or em 19, exc ept for the worst-c ase b ound on the numb er of op er ations . Pr o of of Claim 4: W e n o te that w e will ﬁrst draw O (lo g ( N 3 /δ ) /ε 2 ) from eac h of X , H 1 , . . . , H N and use the same samples for ev ery execution of Choose Hypothesis to a v oid blowing up th e s amp le complexit y . Usin g this fact, the sample complexity is as claimed. W e no w justify the correctness of the algorithm. Since w e run ChooseH ypothesis on a given pair of h yp otheses at most once, there are at most N 2 executions of this algorithm. Because 14 eac h f a ils with p robabilit y at most δ N 3 , b y the union b ound, the probabilit y that an y execution of Choo seHypothesis eve r f a ils is at most δ , so all executions succeed with prob ab ility at least 1 − δ N . Condition on this happ ening f or the remainder of the pro of of correct ness. In Step 3 of our algorithm, w e compare some ˜ H with ev ery other hypothesis. W e analyze t wo cases: • S u pp ose that d T V ( X, ˜ H ) ≤ 64 ε . By Lemma 20, ˜ H will nev er lose, and will b e output b y FastTour nament B . • S u pp ose that d T V ( X, ˜ H ) > 512 ε . Then by Lemma 20, ˜ H will lose to an y candidate H ′ with d T V ( X, H ′ ) ≤ 64 ε . W e assu med there exists at lea st one h yp othesis with th is prop ert y in th e b eginning of the algorithm. F urthermore, by the previous case, if this h yp othesis were selected b y S1 or S2 at some prior step, the algorithm w ould ha ve terminated; so in particular, if the algorithm is still r unning, this hyp o thesis could not ha v e b een r e mov ed from H . Therefore, ˜ H will lose at least once and will not b e outp ut b y FastTour nament B . The correctness of our algorithm foll o ws from the sec ond case ab o v e. Indeed, if the algorithm outputs a distribu t ion ˜ H , it must b e the case that d T V ( X, ˜ H ) ≤ 512 ε . Moreo v er, w e will not run out of hyp otheses b efore w e output a d istribution. Indeed, w e only discard a hyp othesis if it w as selected by S1 or S2 and then lost at least once in Step 3. F urth e rmore, in the b egi nn ing of our algorithm there exists a distribu tio n H such that d T V ( X, H ) ≤ 64 ε . If ev er selected b y S1 or S2, H will not lose to an y distribution in S te p 3, and the algorithm w ill output a distr ib ution. If it is not selected by S1 or S2, H w on’t b e remo v ed from H . W e no w r ea son ab out the exp ected r u nning time of our algorithm. First, consider the case when all executions of Choo seHypothesis are successful, whic h happ ens with probab ility ≥ 1 − δ N . If either S 1 or S2 outpu ts a distribution suc h th a t d T V ( X, ˜ H ) ≤ 64 ε , then b y the ﬁrs t case ab o v e it will b e output by Fa stTournament B . I f this happ ened with probabilit y at least p ind epend en tly in ev ery iteration of our algorithm, then th e num b er of iterations w ould b e sto c hastically dominated by a geometric random v ariable with parameter p , so the exp ected num b er of round s would b e upp er b ounded by 1 p . By Claims 1 and 2, p ≥ 1 4 , so, wh en Choos eHypothesis never fails, the exp ected n umb e r of rounds is at most 4. Next, consider when at least one execution of ChooseHyp othesis fails, wh ich happ ens with p robabilit y ≤ δ N . Since FastTour nament B remo v es at least one hypothesis in eve ry r o un d, there are at most N round s. Com bining these tw o cases, the exp ected num b er of rounds is at most (1 − δ N )4 + δ N N ≤ 5. By Claims 1 and 2 and Lemma 20, eac h round requ ir es O ( N log N + N log N /δ ) op erations. Since the exp ec ted num b er of round s is O (1), w e obtain the desired b o un d on the exp ecte d n um b er of op eratio ns. In order to obtain all the guaran tees of th e theorem simultaneo usly , our algorithm FastTo urnament will alternate b et w een steps of Fast Tournament A and FastT ournament B , w here b oth algorithms are giv en an error parameter equ al to δ 2 . I f either algo rithm outputs a hyp ot hesis, FastTourn ament outputs it. By union b ound and Claims 3 and 4, b oth FastTournam ent A and FastTourn ament B will b e correct with probabilit y at least 1 − δ . The w orst-case runnin g time is as desired b y Claim 3 and since interlea v in g b et w een steps of the t wo tournament s will m ultiply the n umber of steps by a f actor of at most 2. W e hav e th e exp ected ru nning time similarly , by Claim 4. 15 5 Pro of of T heore m 1 Theorem 1 is an immediate consequence of Th e orems 11 and 19. Namely , we run the algorithm of Theorem 11 to pro duce a collectio n of Gaussian mixtures, one of which is within ε of the unknown mixture F . Then we u se Fast Tournament of Theorem 19 to select from among the candidates a mixture that is O ( ε )-close to F . F or the execution of FastTourna ment , we n e ed a P D F comparator for all pairs of ca ndid at e m ixtu res in our collection. Give n that these are describ ed with their parameters, our PDF comparators ev aluate the densities of t w o given mixtures at a c hallenge p oin t x and d ec ide wh ich one is largest. W e also need samp le access to our candid at e mixtu res. Giv en a p arametric description ( w , µ 1 , σ 2 1 , µ 2 , σ 2 2 ) of a mixtur e , w e can draw a sample fr om it as follo ws: ﬁrst dra w a uniform [0 , 1] v a riable whose v alue compared to w determines whether to sample f r om N ( µ 1 , σ 2 1 ) or N ( µ 2 , σ 2 2 ) in the second step; for the second step, use the Bo x-Muller trans form [BM58] to obtain sample from either N ( µ 1 , σ 2 1 ) or N ( µ 2 , σ 2 2 ) as d ec ided in the ﬁrst step. References [AJOS14a] Jay adev Ac harya, Ashk an Jafarp our, Alon Orlitsky , and Anand a Theerta Suresh. Near- optimal-sample estimators for sp herical Gaussian mixtu r es. Online manuscript , 2014. [AJOS14b] Jay adev Ac harya , Ash k an Jafarp our, Alon Orlitsky , and Ananda Theertha Suresh. Sorting with adv ersarial comparators an d application to d e nsity estimation. In Pr o- c e e dings of the 2014 IEEE Internation al Symp osium on Information The ory , ISIT ’14, W ashington, DC, USA, 2014. IEEE C omputer So cie t y . [AK01] Sanjeev Arora and Ra vi Kannan. Learning mixtures of arbitrary Gaussians. In Pr o- c e e dings of the 33r d Annual ACM Symp osium on the The ory of Computing , STOC ’01, pages 247–257, New Y ork, NY, USA, 2001. A CM. [AM05] Dimitris Ac hlioptas and F rank McSherry . On s p ectral learning of mixtures of distrib u- tions. In Pr o c e e dings of the 18th Annual Confer enc e on L e arning The ory , COL T ’05, pages 458–469. Sp ringer, 200 5. [BM58] G. E. P . Bo x and Mervin E. Muller. A not e on the generati on of rand o m normal deviates. The Annals of Mathematic al Statistics , 29(2):610– 611, J une 195 8. [BS10] Mikhail Belkin and Kaush ik Sinha. P olynomial learning of distribution f amilies. In Pr o c e e dings of the 51st A nnual IEEE Symp osium on F ounda tions of Computer Scienc e , F OCS ’10, pages 103–112, W ashington, DC, USA, 2010. IEE E Computer So ciet y . [BV08] Sp encer Charles Brubaker and San tosh V empala. Isotropic PCA and aﬃne-inv arian t clustering. In P r o c e e dings of the 49th Annual IEEE Symp osium on F oundations of Com- puter Scienc e , F OCS ’08, pages 551– 560, W ashington, DC, USA, 2008 . I E EE Computer So ciet y . [CDSS13] Siu On Chan, I li as Diak onik olas, Ro cco A. Serv edio, and Xiaorui Sun. Learnin g mix- tures of stru ct ured distrib utions o v er discrete d o mains. In Pr o c e e dings of the 24th Annual ACM-SIAM Symp osium on Discr ete Algorithms , SOD A ’13, pages 1380–139 4, Philadelphia, P A, USA, 2013. S IAM. 16 [CDSS14] Siu On C h an, Ilias Diak oniko las, Ro cco A. Serv edio, and Xiaorui Sun. Eﬃcien t densit y estimation via p ie cewise p olynomial approximat ion. In Pr o c e e dings of the 46th Annual ACM Symp osium on the The ory of Computing , ST O C ’14, New Y ork, NY, USA, 2014. A CM. [Das99] Sanjoy Dasgupta. Learnin g mixtures of Gaussians. In Pr o c e e dings of the 40th Annual IEEE Symp osium on F oundations of Computer Scienc e , F OCS ’99, pages 634–644, W ashington, DC, USA, 1999. IEEE C omputer So cie t y . [DDO + 13] Constantinos Dask alakis, Ilias Diak onik olas, Rya n O ’D onnell, Ro cco A. Servedio, and Li Y ang T an. L earn ing su m s of ind epend en t inte ger random v ariables. In Pr o c e e dings of the 54th Annual IE E E Symp osium on F oundations of Computer Scienc e , F OCS ’13, pages 217–226, W ashington, DC, USA, 2013. IEEE Compu te r S ociet y . [DDS12] C on s t anti nos Dask alakis, Ilias Diak onik olas, and Ro cco A. S erv edio. Learning Poisson binomial distributions. In Pr o c e e dings of the 4 4th Annual ACM Symp osium on the The ory of Computing , STOC ’12, pages 709–72 8, New Y ork, NY, USA, 2012. ACM. [DKW56] A. Dv oretzky , J . Kiefer, and J. W olfo witz. Asymptotic minimax characte r of the sam- ple distrib utio n f unction and of the classical m ultinomial estimator. The Ann als of Mathematic al Statistics , 27(3):642 –669, 09 1956. [DL96] Luc Devro y e and G´ ab or Lu g osi. A universall y acceptable smo othing factor f or ke rn e l densit y estimation. The Annals of Statistics , 24:2499 –2512, 1996. [DL97] Luc Devro y e and G´ ab or L ugo si. Nonasymptotic univ ersal smo othing factors, kernel complexit y and y atracos classes. The Annals of Statistics , 25:262 6–2637, 1997. [DL01] Luc Devro y e and G´ ab or Lu g osi. Combinatoria l metho ds in density estimation . Springer, 2001. [DLR77] A. P . Dempster, N. M. L aird , and D. B. Rubin. Maxim um L ik eli ho o d fr o m Incom- plete Data via the EM Algorithm. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , 39(1):1– 38, 1977. [F OS06] Jon F eldman, Ry an O’Donnell, and Ro cco A. Serv edio. P A C learning axis-aligned mixtures of Gaussians with no separation assump t ion. In Pr o c e e dings of the 19th Annual Confer enc e on L e arn ing The ory , C O L T ’06, pages 20–34, Berlin, Heidelb erg, 2006. Springer-V erlag. [GS02] Alison L . Gib b s and F rancis E. Su . On c ho osing and b ound ing pr o bability metrics. International Statistic al R eview , 70(3):419– 435, Decem b er 2002. [HK13] Daniel Hsu and S h am M. Kak ade. Learning mixtures of s p herical Gaussians: Moment metho ds and sp ectral d ec omp ositions. In Pr o c e e dings of the 4th Confer enc e on Inno- vations in The or etic al Computer Scienc e , ITCS ’13, pages 11– 20, New Y o rk, NY, USA, 2013. ACM. 17 [KMR + 94] Michae l Kearns, Yisha y Mansour, Dana Ron, Ronitt Rubinf el d, Rob ert E. Schapire, and Linda Sellie. On th e learnabilit y of d iscret e distributions. In Pr o c e e dings of the 26th Annua l ACM Symp osium on the The ory of Computing , STOC ’94, pages 27 3–282, New Y ork, NY, USA, 1994. A CM. [KMV10] Adam T auman Kalai, Ankur Moitra, and Gregory V alia nt. Eﬃcien tly learning mixtures of t wo Gaussians. In Pr o c e e dings of the 42nd Annual ACM Symp osium on the The ory of Computing , STO C ’10, pages 553–562, New Y ork, NY, USA, 2010. A CM. [Mas90] P . Ma ssart. The tigh t constan t in t he Dv oretzky-Kiefer-Wolfo witz inequalit y . The Anna ls of P r ob a bility , 18(3 ):1269–1 283, 07 1990. [MV10] Ankur Moitra and Gregory V alian t. S ettling the p olynomial learnabilit y of mixtures of Gaussians. In Pr o c e e dings of the 51st Annual IEEE Symp osium on F oundations of Computer Scienc e , FOCS ’10, pages 93–1 02, W ashington, DC, USA, 2010. IEEE Computer So ciet y . [VW02] San tosh V empala and Grant W ang. A sp ectral algorithm for learning mixtures of distributions. In Pr o c e e dings of the 43r d Annua l IEEE Symp osium on F oundations of Computer Scienc e , F OCS ’02, p a ges 113–123 , W ash in gt on, DC, USA, 2002 . IEEE Computer So ciet y . [Y at8 5] Y annis G. Y atracos. Rates of con v ergence of minim um distance estimators and k ol- mogoro v’s en tropy . The Annals of Statistics , 13(2):768–7 74, 1985. A G ridding W e will encounter settings wh e re w e ha v e b ound s L and R on an un kno wn measure X suc h that L ≤ X ≤ R , and w ish to obtain an estimate ˆ X suc h that (1 − ε ) X ≤ ˆ X ≤ (1 + ε ) X . Gridding is a common tec hnique to generate a list of candid ates that is guarante ed to conta in suc h an estimate. F act 22. Candida tes of the form L + k εL deﬁne an additive grid with at most 1 ε  R − L L  c and idates. F act 23. Candidat es of the form L (1 + ε ) k deﬁne a multiplic ative grid with at most 1 log (1+ ε ) log  R L  c and idates. W e also encounte r scenarios wh e re w e require an add iti ve estimate X − ε ≤ ˆ X ≤ X + ε . F act 24. Candidates of the form L + k ε deﬁne an absolute add itive grid with 1 ε ( R − L ) c andidates. B Pr oofs Omitted from Section 2 Pr o of of P r op osition 3: W e use d T V ( P , Q ) and 1 2 k P − Q k 1 in terc hangeably in the cases wh ere P and Q are not necessarily probabilit y distr ib utions. Let N i = N ( µ i , σ 2 i ) and ˆ N i = N ( ˆ µ i , ˆ σ 2 i ). By triangle inequalit y , d T V ( ˆ w ˆ N 1 + (1 − ˆ w ) ˆ N 2 , w N 1 + (1 − w ) N 2 ) ≤ d T V ( ˆ w ˆ N 1 , w N 1 ) + d T V ((1 − ˆ w ) ˆ N 2 , (1 − w ) N 2 ) 18 Insp ecting the ﬁ rst term, 1 2    w N 1 − ˆ w ˆ N 1    1 = 1 2    w N 1 − w ˆ N 1 + w ˆ N 1 − ˆ w ˆ N 1    1 ≤ w d T V ( N 1 , ˆ N 1 ) + 1 2 | w − ˆ w | , again usin g the triangle inequalit y . A symmetric statemen t holds f or the other term, giving us th e desired result. Pr o of of L emma 9: W e will use x to denote the median of our distribu tio n, where ˆ F ( x ) = 1 2 . Since d K ( F , ˆ F ) ≤ ε , F ( x ) ≤ 1 2 + ε . Usin g the CDF of the normal distribution, w e obtain 1 2 + 1 2 erf  x − µ √ 2 σ 2  ≤ 1 2 + ε . Rearranging, w e get x ≤ µ + √ 2erf − 1 (2 ε ) σ ≤ µ + 2 √ 2 εσ , where the inequalit y uses the T a ylor series of e rf − 1 around 0 and T a ylor’s Th e orem. B y symmetry of the Gaussian distribution, w e can obtain the corresp onding lo wer b oun d f o r x . Pr o of of L emma 10: First, w e sho w that F − 1  3 4  ∈ " µ + √ 2erf − 1  1 2  σ − 5 √ 2 2 σ ε, µ + √ 2erf − 1  1 2  σ + 7 √ 2 2 σ ε # . Let x = F − 1  3 4  . Since d K ( F , ˆ F ) ≤ ε , F ( x ) ≤ 3 4 + ε . Using the C D F of the n o rmal d istribution, w e obtain 1 2 + 1 2 erf  x − µ √ 2 σ 2  ≤ 3 4 + ε . Rearranging, w e get x ≤ µ + √ 2erf − 1  1 2 + 2 ε  σ ≤ µ + √ 2erf − 1  1 2  σ + 7 √ 2 2 εσ , w here the in equali t y uses the T a ylor ser ies of erf − 1 around 1 2 and T aylo r’s Theorem. A similar approac h giv es the desired low er b ound. By s ymmetry , w e can obtain the b ounds F − 1  1 4  ∈ " µ − √ 2erf − 1  1 2  σ − 7 √ 2 2 σ ε, µ − √ 2erf − 1  1 2  σ + 5 √ 2 2 σ ε # . Com bining this w it h the previous b ounds and rescaling, we obtain the lemma statemen t. C Pro ofs Omitted from Section 3 Pr o of of Pr op osition 12: The probabilit y that a sample is from N i is w i . Using the C D F of the half-normal distribution, give n that a sample is f r om N i , the pr o bability that it is at a distance ≤ εσ i from µ i is er f  ε √ 2  . If w e tak e a single sample f r om the mixture, it will s a tisfy th e desired conditions w it h p r obabilit y at least w i erf  ε √ 2  . If we tak e 20 √ 2 3 w i ε samples from the mixture, the probabilit y that s o me sample satisﬁes the conditions is at least 1 −  1 − w i erf  ε √ 2  20 √ 2 3 w i ε ≥ 1 −  1 − w i · 3 4 ε √ 2  20 √ 2 3 w i ε ≥ 1 − e − 5 ≥ 99 100 where th e ﬁrst inequality is by noting that er f( x ) ≥ 3 4 x for x ∈ [0 , 1]. Pr o of of P r op osition 13: w i ≥ ε implies w i ≥ w i + ε 2 , and thus 2 ˆ w i ≥ 2 w i + ε ≥ 1 w i . 19 Pr o of of L emma 14: Asid e f r om the size of the collection, the rest of the conclusions follo w from Prop ositions 12 and 13. F or a giv en ˆ w , the n umb er of candidates ˆ µ 1 w e consider is 40 √ 2 3 ˆ w ε . W e su m this o v er all candidates for ˆ w , namely , ε, 2 ε, . . . , 1 − ε , giving us 1 ε − 1 X t =1 40 √ 2 3 k ε 2 = 40 √ 2 3 ε 2 H 1 ε − 1 = O  log ε − 1 ε 2  where H n is the n th harmonic num b er. Pr o of of L emma 16: Let Y b e the nearest sample to ˆ µ 1 . F rom Lemma 15, with probability ≥ 9 10 , | Y − ˆ µ 1 | ∈ [ c 3 4 σ 1 , ( √ 2 + c 3 4 ) σ 1 ]. W e can generate candidates b y rearranging the b ounds to obtain Y √ 2 + c 3 4 ≤ σ 1 ≤ Y c 3 4 Applying F a ct 23 and noting that R L = O (1), w e conclude that we can grid o v er this range w ith O ( 1 ε ) cand idate s. Pr o of of L emma 18: T he p roof follo ws the sketc h outlined in Section 3.4. W e ﬁrs t use Prop o- sition 7 to construct an appro ximation ˆ F of the GMM F . Usin g Prop osition 17, we see that d K  N ( µ 2 , σ 2 2 ) , ˆ F − ˆ w ∗ N ( ˆ µ ∗ 1 , ˆ σ ∗ 2 1 ) 1 − ˆ w ∗  ≤ O ( ε ) 1 − w . By Lemma 8, we can compute a distribu tio n ˆ H such that d K ( N ( µ 2 , σ 2 2 ) , ˆ H ) ≤ O ( ε ) 1 − w . Finally , using the median and in terquartile range and the guar- an teed pro vided by Lemmas 9 and 10, we can compute candid ates ˆ µ ∗ 2 ∈ µ 2 ± O  ε 1 − w  σ 2 and ˆ σ ∗ 2 ∈  1 ± O  ε 1 − w  σ 2 from ˆ H , as d esired. Pr o of of P r op osition 17: d K N ( µ 2 , σ 2 2 ) , ˆ F − ˆ w ∗ N ( ˆ µ ∗ 1 , ˆ σ ∗ 2 1 ) 1 − ˆ w ∗ ! = 1 1 − ˆ w d K ( ˆ w ∗ N ( ˆ µ ∗ 1 , ˆ σ ∗ 2 1 ) + (1 − ˆ w ∗ ) N ( µ 2 , σ 2 2 ) , ˆ F ) ≤ 1 1 − ˆ w ( d K ( ˆ w ∗ N ( ˆ µ ∗ 1 , ˆ σ ∗ 2 1 ) + (1 − ˆ w ∗ ) N ( µ 2 , σ 2 2 ) , F ) + d K ( F , ˆ F )) ≤ 1 1 − ˆ w ( d T V ( ˆ w ∗ N ( ˆ µ ∗ 1 , ˆ σ ∗ 2 1 ) + (1 − ˆ w ∗ ) N ( µ 2 , σ 2 2 ) , F ) + O ( ε )) ≤ 1 1 − ˆ w ( | w − ˆ w | + d T V ( N ( µ 1 , σ 1 ) , N ( ˆ µ ∗ 1 , ˆ σ ∗ 2 1 ) + O ( ε )) ≤ O ( ε ) 1 − ˆ w ≤ O ( ε ) 1 − w The equalit y is a rearrangement of terms, the ﬁr st inequalit y is the triangle inequalit y , the second in- equalit y uses F act 5, and the third and fourth inequalities us e Prop ositions 2 and 3 resp ectiv ely . 20 C.1 Pro of of Theorem 11 Pr o of of The or em 11: W e p rodu ce t w o lists of candidates corresp onding to whether min ( w , 1 − w ) = Ω( ε ) or not: • In the ﬁrst ca se, combining Lemmas 1 4, 16, and 18 and taking the Cartesian prod uct of the resulting candidates for the mixture’s p aramet ers, w e see that we can obtain a colle c- tion of O  log ε − 1 ε 3  candidate mixtures. With probabilit y ≥ 4 5 , this will cont ain a cand id at e ( ˆ w ∗ , ˆ µ ∗ 1 , ˆ µ ∗ 2 , ˆ σ ∗ 1 , ˆ σ ∗ 2 ) such that ˆ w ∈ w ± O ( ε ) , ˆ µ i ∈ µ i ± O ( ε ) σ i for i = 1 , 2, and ˆ σ i ∈ (1 ± O ( ε )) σ i for i = 1 , 2. Note that we can c ho ose the hidden constan ts to b e as small as n ec essary for Lemma 4, and thus we can obtain the d esired total v ariation distance. Finally , n o te that the n umb er of samples that w e need for the ab o v e to hold is O (1 / ε 2 ). F or this, it is crucial that we ﬁrst dra w a suﬃcien t O (1 /ε 2 ) samp le s from the mixture (sp eciﬁed b y the w orse requirement among Lemmas 14, 16, and 18), and then execute the candidate generation algo rithm outlined in Lemmas 14, 16, and 18. In particular, w e do not wan t to redra w samples for ev ery branching of this algorithm. Finally , to b o ost the success pr obabilit y , we rep eat the en tire pro cess log 5 δ − 1 times and let our collection of candidate mixu tres b e the u nion of the collections from th e se rep etitions. The p robabilit y that none of these collections con tains a suitable candidate d istribution is ≤  1 5  log 5 δ − 1 ≤ δ . • In the second case, i.e. when one of the wei ghts, w.l.o.g. w 2 , is O ( ε ), w e set ˆ w 1 = 1 and w e only p rodu ce candidates for ( µ 1 , σ 2 1 ). Note th at this scenario ﬁ ts in to the fr a mework of Lemmas 9 and 10. O ur mixture F is such that d K ( F , N ( µ 1 , σ 2 1 )) ≤ d T V ( F , N ( µ 1 , σ 2 1 )) ≤ O ( ε ). By the DKW inequalit y (Theorem 6), w e can us e O ( 1 ε 2 · log 1 δ ) samp le s to generate the empirical distribu ti on, wh ich giv es us a distribu tion ˆ F suc h that d T V ( ˆ F , N ( µ 1 , σ 2 1 )) ≤ O ( ε ) (b y triangle inequalit y), with probabilit y ≥ 1 − δ . F rom this distribution, using the median and int erquartile range an d the guarant ees of Lemmas 9 and 10, w e can extract ˆ µ ∗ 1 and ˆ σ ∗ 1 suc h that | ˆ µ ∗ 1 − µ 1 | ≤ O ( ε ) σ 1 and | ˆ σ ∗ 1 − σ 1 | ≤ O ( ε ) σ 1 . Thus, b y Lemma 4, we can ac hiev e the desired total v ariation distance. D Details ab out Represen ting and Manipulating CDFs First, we remark that, give n a discrete rand o m v ariable X o v er a supp ort of size n , in ˜ O ( n ) time, w e can construct a d ata structure wh ich allo ws us to compute F − 1 X ( x ) for x ∈ [0 , 1] at th e cost of O (log n ) time p er op eration. Th is d a ta structures will b e a set of disjoint int erv als wh ic h form a partition of [0 , 1], eac h associated w it h a v alue. A v a lue x ∈ [0 , 1] will b e mapp ed to th e v alue asso ci ated to the interv al whic h con tains x . This d a ta structure can b e constructed b y mapping eac h v a lue to an interv al of width equal to the probabilit y of that v alue. This data structure can b e queried in O (log n ) ti me b y p erforming binary searc h o n th e left endp oin ts of the interv als. W e name th is the n -interval p artition repr ese ntat ion of a distribution. T o a vo id confus io n w ith in terv als that r e pr e sent ele ments of th e σ -algebra of the distribution, we refer to the in terv als that are s tored in the data s tructure as pr ob ability intervals . 21 W e note that, if we are only concerned with sampling, th e ord er of the element s of the su pp ort is irrelev an t. Ho w ev er, we will sort the elemen ts of th e supp ort in order to p erform eﬃcient mo diﬁcations later. A t one p oin t in our learning algorithm, w e will ha v e a candidate which correctly describ es one of the tw o comp onen ts in our mixtu r e of Gaussians. If w e could “su b tract out” this comp onen t f rom the mixture, we would b e left with a single Gaussian - in this setting, we can eﬃciently p erform parameter estimation to learn the other comp onen t. Ho w ev er, if we naively subtract the p roba- bilit y densities, w e will obtain negativ e probabilit y densities, o r equiv alen tly , non-monotonically increasing CDFs. T o deal with this issue, we deﬁn e a pro cess we call monotonization. I n tuitiv ely , this will shift n eg ativ e probabilit y density to locations with p ositiv e probabilit y d ensit y . W e sho w that th is pr eserves Kolmogoro v distance and that it can b e implemented eﬃcien tly . Deﬁnition 3. Given a b ounde d function f : R → R , the monotonization of f is ˆ f , wher e ˆ f ( x ) = sup y ≤ x f ( x ) . W e argue that if a function is close in K o lmogoro v distance to a monotone fun ct ion, th en so is its m o notonization. Prop os ition 25. Supp ose we have two b ounde d functions F and G su ch that d K ( F , G ) ≤ ε , wher e F is monoto ne non-de cr e a sing. Then ˆ G , the monotonizat ion of G , is su c h that d K ( F , ˆ G ) ≤ ε . Pr o of. W e sho w that | F ( x ) − ˆ G ( x ) | ≤ ε holds for an arbitrary p oin t x , imp lyi ng that d K ( F , ˆ G ) ≤ ε . There are tw o cases: F ( x ) ≥ ˆ G ( x ) and F ( x ) < ˆ G ( x ). If F ( x ) ≥ ˆ G ( x ), using the fact that ˆ G ( x ) ≥ G ( x ) (due to monotoniza tion), we can deduce | F ( x ) − ˆ G ( x ) | ≤ | F ( x ) − G ( x ) | ≤ ε . If F ( x ) < ˆ G ( x ), consider an inﬁnite sequence of p oin ts { y i } su ch that G ( y i ) b ecomes arb itrarily close to sup y ≤ x G ( x ). By monotonicit y of F , we h av e that | F ( x ) − ˆ G ( x ) | ≤ | F ( y i ) − G ( y i ) | + δ i ≤ ε + δ i , where δ i = | ˆ G ( x ) − G ( y i ) | . Since δ i can b e tak en arbitrarily small, w e h a v e | F ( x ) − ˆ G ( x ) | ≤ ε . W e will n e ed to eﬃcien tly compute the monotonizat ion in certain settings, when s u btracting one m o notone fun ction from another. Prop os ition 26. Supp ose we have ac c ess to the n - interva l p artition r epr esentation of a CDF F . Given a monotone non-de cr e asing function G , we c an c omp ute the n - interva l p artitio n r epr esenta- tion of the monotonization of F − G in O ( n ) time. Pr o of. Consider the v alues in the n -int erv al partition of F . Bet w een an y t w o consecutiv e v alues v 1 and v 2 , F will b e ﬂat, and since G is monotone n on-decrea sing, F − G will b e monotone non- increasing. T h erefore, the monotonization of F − G at x ∈ [ v 1 , v 2 ) w ill b e the maxim um of F − G on ( −∞ , v 1 ]. The resu lt ing monotonization will also b e ﬂat on the same inte rv als as F , s o we w ill only need to up date the probability int erv als to reﬂect this monotonization. W e will iterate o v er pr o bability in terv als in increasing order of their v alues, and d escrib e ho w to up date eac h interv al. W e will need to keep tr a c k of the maximum v alue of F − G seen so far. Let m b e th e maxim um of F − G for all x ≤ v , w here v is the v alue asso cia ted with the last probabilit y in terv al we h a v e processed. Initially , we hav e the v alue m = 0. Supp ose w e are insp ecting a probabilit y inte rv al with endp oints [ l, r ] and v alue v . The left end point of this probabilit y inte rv al will b ecome ˆ l = m , and the right endp oin t will b ecome ˆ r = r − G ( v ). If ˆ r ≤ ˆ l , the in terv al is 22 degenerate, meaning that the monotonization w ill ﬂ a tten out the discontin uit y at v - therefore, w e simply delete the in terv al. Otherwise, we ha ve a p roper probabilit y in terv al, and w e up date m = ˆ r . This up date tak es constan t time p er int erv al, so th e o v erall time required is O ( n ). T o conclude, we prov e Lemma 8. Pr o of of L emma 8: First, by assu mption, we know that 1 1 − w d K ((1 − w ) H , F − w G ) ≤ ε . By Prop osition 26, we can eﬃcien tly compute the monotonizati on of F − w G - n ame this (1 − w ) ˆ H . By Prop ositi on 25, we ha ve that 1 1 − w d K ((1 − w ) H, (1 − w ) ˆ H ) ≤ ε . Renormalizing th e distribu ti ons giv es the d esired appro ximation guaran tee. T o justify the ru nning time of th is pr ocedure, we m ust also argue that the normalizatio n can b e done eﬃcient ly . T o normalize the distribution (1 − w ) ˆ H , we mak e another O ( n ) p a ss o v er the probabilit y in terv als and m ultiply all the endp oin ts by 1 r ∗ , w here r ∗ is the righ t endp oin t of the righ tmost pr obabilit y in terv al. W e n o te th at r ∗ will b e exactly 1 − w b eca use th e v alue of F − w G at ∞ is 1 − w , so this pro cess results in the d ist rib u tio n ˆ H . E Robust Estimation of Scale from a M i xture of Gaussians In th is section, we examine the f o llo wing s ta tistic: Giv en some p oint x ∈ R and n I ID random v ariables X 1 , . . . , X n , what is the minim um distance b et w een x and any X i ? W e give an interv al in whic h this s ta tistic is lik ely to fall (Prop osit ion 27), an d examine its robustness when sampling from distributions whic h are statistically close to the distrib u tio n under consideration (Prop osition 29 ). W e then apply these results to mixtures of Gaussians (Prop osition 30 and Lemma 15). Prop os ition 27. Supp ose we have n IID r andom variables X 1 , . . . , X n ∼ X , so me x ∈ R , and y = F X ( x ) . L et I N b e the interval [ F − 1 X ( y − c 1 n ) , F − 1 X ( y + c 1 n )] and I F b e the interval [ F − 1 X ( y − c 2 n ) , F − 1 X ( y + c 2 n )] for some c onstant s 0 < c 1 < c 2 ≤ n , and I = I F \ I N . L et j = arg min i | X i − x | . Then Pr[ X j ∈ I ] ≥ 9 10 for al l n > 0 . Pr o of. W e sh o w that Pr[ X j 6∈ I ] ≤ 1 10 b y sho wing that th e follo wing t wo bad ev en ts are un li kel y: 1. W e hav e a sample whic h is to o close to x 2. All our samples are to o far from x Sho wing these eve nts o cc ur with low probabilit y and com bining w ith the union b ound give s the desired result. Let Y b e the n umb e r of samples at distance < c 1 n in distance in the CDF, i.e., Y = |{ i | | F − 1 X ( X i ) − y | < c 1 n }| . By linearit y of exp ect ation, E [ Y ] = 2 c 1 . By Mark o v’s inequalit y , Pr( Y > 0) < 2 c 1 . This allo ws us to u pp er b ound the p r obabilit y that one of ou r samp les is to o close to x . Let Z b e the num b er of samples at distance < c 2 n in distance i n the CDF, i.e., Z = |{ i | | F − 1 X ( X i ) − y | < c 2 n }| , and let Z i b e an indicator random v aria ble wh ic h indicates this p roperty for X i . W e use the second moment principle, Pr( Z > 0) ≥ E [ Z ] 2 E [ Z 2 ] 23 By lin e arit y of exp ectati on, E [ Z ] 2 = 4 c 2 2 . E [ Z 2 ] = X i E [ Z 2 i ] + X i X j 6 = i E [ Z i Z j ] = 2 c 2 + n ( n − 1)  4 c 2 2 n 2  ≥ 2 c 2 + 4 c 2 2 And thus, Pr( Z = 0) ≤ 1 2 c 2 +1 . T his allo ws us to upp er b ound the pr ob ab ility that all of our samples are to o f a r from x . Setting c 1 = 1 40 and c 2 = 19 2 giv es p robabilit y < 1 20 for eac h of the bad ev en ts, resulting in a probabilit y < 1 10 of either bad ev ent by the union b ound, and thus the desired result. W e will need the f o llo wing prop ert y of Kolmogoro v distance, whic h stat es that probabilit y mass within ev ery in terv al is approximate ly p reserv ed: Prop os ition 28. If d K ( f X , f Y ) ≤ ε , then for al l i nter vals I ⊆ R , | f X ( I ) − f Y ( I ) | ≤ 2 ε . Pr o of. F or an interv al I = [ a, b ], w e can rewrite the prop ert y as | f X ( I ) − f Y ( I ) | = | ( F X ( b ) − F X ( a )) − ( F Y ( b ) − F Y ( a )) | ≤ | F X ( b ) − F Y ( b ) | + | F X ( a ) − F Y ( a ) | ≤ 2 ε as desired, where the ﬁrst inequalit y is the triangle inequalit y and the second inequalit y is d ue to the b ound on Kolmogoro v d istance. The n e xt prop osition sa ys that if we instead dra w samples fr o m a distribution w hic h is close in total v ariation distance, th e same prop ert y approxima tely holds with resp ect to t he orig inal distribution. Prop os ition 29. Supp ose we have n IID r andom varia bles ˆ X 1 , . . . , ˆ X n ∼ ˆ X wher e d K ( f X , f ˆ X ) ≤ δ , some x ∈ R , and y = F X ( x ) . L et I N b e the interval [ F − 1 X ( y − c 1 n + δ ) , F − 1 X ( y + c 1 n − δ )] and I F b e the interval [ F − 1 X ( y − c 2 n − δ ) , F − 1 X ( y + c 2 n + δ )] for some c onstants 0 < c 1 < c 2 ≤ n , and I = I F \ I N . L et j = arg m in i | X i − a | . Then Pr[ X j ∈ I ] ≥ 9 10 for al l n > 0 . Pr o of. First, examine in terv al I N . This in terv al conta ins 2 c 1 n − 2 δ probability measure of the distribution F X . By Pr o p osition 28, | F X ( I N ) − F ˆ X ( I N ) | ≤ 2 δ , so F ˆ X ( I N ) ≤ 2 c 1 n . One can rep eat this argumen t to show that the amoun t of measur e con tained b y F ˆ X in [ F − 1 X ( y − c 2 n − δ ) , F − 1 X ( y + c 2 n + δ )] is ≥ 2 c 2 n . As established through the p roof of Prop ositio n 27, with pr obabilit y ≥ 9 10 , there will b e n o samples in a win d o w con taining probability measure 2 c 1 n , but th er e w ill b e at least one sample in a window con taining pr obabilit y measure 2 c 2 n . Applying the same argument to these in terv als, w e can arr iv e at the desired result. W e examine this statistic for some mixture of k Gaussians with PDF f around the p oin t corresp onding to the mean of the comp onen t w ith the minimum v alue of σ i w i . In iti ally , we assume that we know this lo cation exactly and that we are taking samp les according to f exactly . 24 Prop os ition 30. Consider a mixtur e of k Gaussians with P DF f , c omp onents N ( µ 1 , σ 2 1 ) , . . . , N ( µ k , σ 2 k ) and weights w 1 , . . . , w k . L et j = arg min i σ i w i . If we take n samples X 1 , . . . , X n fr o m the mixtur e (wher e n > 3 √ π c 2 2 w j ), then min i | X i − µ j | ∈ h √ 2 π c 1 σ j k w j n , 3 √ 2 π c 2 σ j 2 w j n i with pr ob ability ≥ 9 10 , wher e c 1 and c 2 ar e as deﬁne d in Pr op osition 27. Pr o of. W e examine the CDF of the mixture around µ i . Using Prop osition 1 (and sym met ry of a Gaussian ab out its mean), it is suﬃcient to s h o w that " µ i + √ 2 π c 1 σ i k w i n , µ i + 3 √ 2 π c 2 σ i 2 w i n # ⊇ h F − 1 ( F ( µ i ) + c 1 n ) , F − 1 ( F ( µ i ) + c 2 n ) i , where F is the CDF of the mixtur e. W e sho w that eac h endp oin t of the latter interv al b ounds the corresp onding endp oin t of the former in terv al. First, we show c 1 n ≥ F  µ i + √ 2 π c 1 σ i k w i n  − F ( µ i ). Let I = h µ i , µ i + √ 2 π c 1 σ i k w i n i , f b e the PDF of the mixture, and f i b e the PDF of comp onen t i of the mixture. The r ig ht-hand sid e of th e inequalit y w e wish to p ro v e is equal to Z I f ( x ) d x = Z I k X j =1 w j f j ( x ) d x ≤ Z I k X j =1 w j 1 σ j √ 2 π d x ≤ Z I k w i σ i √ 2 π d x = c 1 n where the ﬁr st in e qualit y is sin c e the maximum of the PDF of a Gaussian is 1 σ √ 2 π , and the second is since σ j w j ≤ σ i w i for all j . Next, we sho w c 2 n ≤ F  µ i + 3 √ 2 π c 2 σ i 2 w i n  − F ( µ i ). W e n ote that the right-hand sid e is the probabilit y mass con tained in the in terv al - a lo w er b ound for this quantit y is the p robabilit y mass con tributed b y the particular Guassian we are examining, wh ic h is w i 2 erf  3 √ π c 2 2 w i n  . T aking the T a ylor expansion of the error function giv es erf( x ) = 2 √ π  x − x 3 3 + O ( x 5 )  ≥ 2 √ π  2 3 x  if x < 1. Applyin g this here, w e can lo w er b ound the con tributed probabilit y mass b y w i 2 2 √ π 2 3 3 √ π c 2 2 w i n = c 2 n , as desired. Finally , w e deal with uncertain ties in parameters and apply the robustness prop erties of Prop o- sition 29 in the follo wing lemma: Pr o of of L emma 15: W e analyze the eﬀect of eac h uncertain t y: 25 • First, w e consider the eﬀect of samp ling from ˆ f , w hic h is δ -close to f . By u s ing Prop osition 29, we kno w that the n ea rest sample to µ j will b e at CDF distance b et w een c 1 n − δ ≥ c 1 2 n and c 2 n + δ ≤ 3 c 2 2 n . W e can then rep eat the pr oof of Pr o p osition 30 w it h c 1 replaced by c 1 2 and c 2 replaced b y 3 c 2 2 . This giv es us that min i | X i − µ j | ∈ h √ π c 1 √ 2 kw j n σ j , 9 √ π c 2 2 √ 2 w j n σ j i (where n ≥ 9 √ π c 2 4 w j ) with p robabilit y ≥ 9 10 . • Next, substituting in the b ounds 1 2 ˆ w j ≤ w j ≤ 2 ˆ w j , we get min i | X i − µ j | ∈ h √ π c 1 2 √ 2 k ˆ w j n σ j , 9 √ π c 2 √ 2 ˆ w j n σ j i (where n ≥ 9 √ π c 2 2 ˆ w j ) w ith probabilit y ≥ 9 10 . • W e use n = 9 √ π c 2 2 ˆ w j samples to obtain: min i | X i − µ j | ∈  c 3 k σ j , √ 2 σ j  with p robabilit y ≥ 9 10 . • Finally , applying | ˆ µ j − µ j | ≤ c 3 2 k σ j giv es the lemma statemen t. F Omitted Pro ofs from Section 4 Pr o of of L emma 20: W e set up a comp e tition b et w een H 1 and H 2 , in terms of the follo w ing subset of D : W 1 ≡ W 1 ( H 1 , H 2 ) := { w ∈ D H 1 ( w ) > H 2 ( w ) } . In terms of W 1 w e deﬁne p 1 = H 1 ( W 1 ) and p 2 = H 2 ( W 1 ). Clearly , p 1 > p 2 and d T V ( H 1 , H 2 ) = p 1 − p 2 . The comp etitio n b et w een H 1 and H 2 is carried out as follo ws: 1a. Draw m = O  log(1 /δ ) ε 2  samples s 1 , . . . , s m from X , and let ˆ τ = 1 m |{ i | s i ∈ W 1 }| b e the fraction of them that fall inside W 1 . 1b. S imila rly , d ra w m samples from H 1 , and let ˆ p 1 b e th e fraction of them that fall inside W 1 . 1c. Fin ally , dr a w m samples from H 2 , and let ˆ p 2 b e the fraction of them that fall inside W 1 . 2. If ˆ p 1 − ˆ p 2 ≤ 6 ε , declare a dr a w. O th erwise: 3. If ˆ τ > ˆ p 1 − 2 ε , declare H 1 as w inner and r etur n H 1 ; otherwise, 4. if ˆ τ < ˆ p 2 + 2 ε , d e clare H 2 as winn er and return H 2 ; otherwise, 5. Declare a draw. Notice that, in S te ps 1a, 1b and 1c, the algorithm utilize s the PDF comparator for distribu tio ns H 1 and H 2 . The correctness of the algorithm is a consequence of the follo wing claim. Claim 5. Supp ose that d T V ( X, H 1 ) ≤ ε . Then: 1. If d T V ( X, H 2 ) > 8 ε , then the pr ob ability that the c omp etition b etwe en H 1 and H 2 do es not de c lar e H 1 as the winner is at most 6 e − mε 2 / 2 ; 2. If d T V ( X, H 2 ) > 4 ε , then the pr ob ability that the c omp etition b etwe en H 1 and H 2 r e tu rns H 2 as the winner is at most 6 e − mε 2 / 2 . 26 The analo gous c onclusion s hold if we i nter change H 1 and H 2 in the ab ove claims. Final ly, if d T V ( H 1 , H 2 ) ≤ 5 ε , the algorithm wil l de clar e a dr aw with pr ob ability at le ast 1 − 6 e − mε 2 / 2 . Pr o of o f Cla im 5: Let τ = X ( W 1 ). The Chernoﬀ b ound (together with a u nion b ound) imply that, with pr obabilit y at least 1 − 6 e − mε 2 / 2 , the follo wing are sim ultaneously tr ue: | p 1 − ˆ p 1 | < ε/ 2, | p 2 − ˆ p 2 | < ε/ 2, and | τ − ˆ τ | < ε/ 2. Conditioning on these: • If d T V ( X, H 1 ) ≤ ε and d T V ( X, H 2 ) > 8 ε , th e n from the triangle inequalit y we ge t that p 1 − p 2 = d T V ( H 1 , H 2 ) > 7 ε , h ence ˆ p 1 − ˆ p 2 > p 1 − p 2 − ε > 6 ε . Hence, the algorithm will go b ey ond Step 2. Mo reo ve r, d T V ( X, H 1 ) ≤ ε implies th a t | τ − p 1 | ≤ ε , hence | ˆ τ − ˆ p 1 | < 2 ε . S o the algorithm w il l stop at Step 3, declaring H 1 as the winn er of the comp etition b et w een H 1 and H 2 . • If d T V ( X, H 2 ) ≤ ε and d T V ( X, H 1 ) > 8 ε , then as in the previous case we get from the triangle inequalit y that p 1 − p 2 = d T V ( H 1 , H 2 ) > 7 ε , hence ˆ p 1 − ˆ p 2 > p 1 − p 2 − ε > 6 ε . Hence, the algorithm will go b ey ond Step 2. Moreo ver, d T V ( X, H 2 ) ≤ ε imp lies that | τ − p 2 | ≤ ε , hen ce | ˆ τ − ˆ p 2 | < 2 ε . So ˆ p 1 > ˆ τ + 4 ε . Hence, the algorithm will not stop at Step 3, and it will stop at S tep 4 declaring H 2 as the winn e r of th e comp etition b et wee n H 1 and H 2 . • If d T V ( X, H 1 ) ≤ ε and d T V ( X, H 2 ) > 4 ε , we distinguish t w o s u b cases. I f ˆ p 1 − ˆ p 2 ≤ 6 ε , then the algorithm will stop at S te p 2 declaring a dra w. If ˆ p 1 − ˆ p 2 > 6 ε , th e algorithm pr o ceeds to Step 3. Notice that d T V ( X, H 1 ) ≤ ε implies that | τ − p 1 | ≤ ε , hence | ˆ τ − ˆ p 1 | < 2 ε . So the algorithm will stop at Step 3, d e claring H 1 as the w inner of the comp etitio n b et wee n H 1 and H 2 . • If d T V ( X, H 2 ) ≤ ε and d T V ( X, H 1 ) > 4 ε , we distinguish t w o s u b cases. I f ˆ p 1 − ˆ p 2 ≤ 6 ε , then the algorithm will stop at S te p 2 declaring a dra w. If ˆ p 1 − ˆ p 2 > 6 ε , th e algorithm pr o ceeds to S tep 3. Notice that d T V ( X, H 2 ) ≤ ε imp lie s that | τ − p 2 | ≤ ε , h e nce | ˆ τ − ˆ p 2 | < 2 ε . Hence, ˆ p 1 > ˆ p 2 + 6 ε ≥ ˆ τ + 4 ε , so the algorithm will not stop at S te p 3 and will pro ceed to Step 4. Giv en that | ˆ τ − ˆ p 2 | < 2 ε , the algorithm will stop at Step 4, declaring H 2 as the winner of the comp etit ion b et w een H 1 and H 2 . • If d T V ( H 1 , H 2 ) ≤ 5 ε , then p 1 − p 2 ≤ 5 ε , hence ˆ p 1 − ˆ p 2 ≤ 6 ε . S o the algorithm will stop at Step 2 declaring a draw. Pr o of of L emma 21: Dra w m = O (lo g (2 N/δ ) / ε 2 ) samples from eac h of X , H 1 , . . . , H N and, using the same samples, run ChooseHy pothesis  X, H i , H j , ε, δ 2 N  , for ev ery p a ir of d ist ribu tio ns H i , H j ∈ H . If there is a distribu tion H ∈ H that wa s nev er a loser (but p oten tially tied with some distribu ti ons), outpu t any su ch distribution. Otherw ise, outpu t “failure.” W e analyze the correctness of our prop osed algorithm in t wo steps. Fi rst, su pp o se there exists H ∗ ∈ H s u c h that d T V ( H ∗ , X ) ≤ ε . W e argue that, w it h p robabilit y at least 1 − δ 2 , H ∗ nev er loses 27 a comp etitio n against any other H ′ ∈ H (so the tourn a ment d oes not outpu t “failure”). Consid e r an y H ′ ∈ H . If d T V ( X, H ′ ) > 4 ε , b y L e mma 20 the probabilit y that H ∗ is n ot declared a winner or tie against H ′ is at most δ 2 N . O n the other hand , if d T V ( X, H ′ ) ≤ 4 ε , the triangle inequalit y gives that d T V ( H ∗ , H ′ ) ≤ 5 ε and, b y Lemma 20, the probability th at H ∗ do es n o t dra w against H ′ is at most δ 2 N . A un io n b ound ov er all N distribu ti ons in H sho ws that with pr ob ab ility at least 1 − δ 2 , the distribution H ∗ nev er loses a comp etition. W e next argue that with probabilit y at least 1 − δ 2 , every distribution H ∈ H that nev er loses m ust b e 8 ε -close to X . Fix a distribution H suc h that d T V ( X, H ) > 8 ε . Le mma 20 implies that H loses to H ∗ with probability at least 1 − δ / 2 N . A union b ound giv es that with pr ob ab ility at least 1 − δ 2 , ev ery distr ibution H such that d T V ( X, H ) > 8 ε loses some comp etition. Th us, with o v erall probabilit y at least 1 − δ , the tourn ame nt d oes n ot output “failure” and outputs some d istribution H such that d T V ( H , X ) ≤ 8 ε. Pr o of of Claim 1: The probabilit y that H ′ con tains no distribution that is 8 ε -cl ose to X is at most (1 − p ) ⌈ 3 √ N ⌉ ≤ e − 3 . If H ′ con tains at least one distrib ution that is 8 ε -close to X , then by Lemma 21 the d istr ibution output by SlowT ournament ( X , H ′ , 8 ε, e − 3 ) is 64 ε -clo se to X with probabilit y at least 1 − e − 3 . F rom a u nion b ound, it f ollo ws that the d ist rib u tio n output b y S1 is 64 ε -close to X , with probabilit y at least 1 − 2 e − 3 ≥ 9 / 10. Th e b ounds on the num b er of samples and op eratio ns follo w from Lemma 21. Pr o of of Claim 2: Sup p ose that there is some distribution H ∗ ∈ H that is ε -close to X . W e ﬁr st argue th a t with probabilit y at least 1 3 , H ∗ ∈ H i T . W e sh o w this in t w o steps: (a) Recall that w e draw samples fr om X, H 1 , . . . , H N b efore Phase 1 b egins, and reuse the same samples wheneve r r e quired by some execution of Choo seHypothesis durin g Phase 1. Fix a realizatio n of these samples. W e can ask the qu estion of wh a t w ould happ en if w e executed ChooseHy pothesis ( X, H ∗ , H j , ε, 1 / 3 N ), for s ome H j ∈ H \ { H ∗ } u sing these samples. F rom Lemma 20, it follo ws that, if H j is farther than 8 ε -a w a y from X , then H ∗ w ould b e decla red the win ner by ChooseH ypothesis ( X , H ∗ , H j , ε, 1 / 3 N ), with probabilit y at least 1 − 1 / 3 N . By a u n ion b ound, our samples satisfy this pr operty simultaneously for all H j ∈ H \ { H ∗ } that are farther than 8 ε -a wa y from X , w it h prob ab ility at least 1 − 1 / 3. Henceforth, we condition that our samples h a ve this prop ert y . (b) Conditioning on our samples ha ving the prop ert y discussed in (a), we argue that H ∗ ∈ H i T with p robabilit y at l east 1 / 2 (so that, with ov erall pr obabilit y at least 1 / 3, it holds that H ∗ ∈ H i T ). It suﬃces to argue that, with p r obabilit y at least 1 / 2 , in all iterations of Phase 1, H ∗ is not matc hed with a distribution that is 8 ε -close to X . Th is h a pp ens with probabilit y at least: (1 − p )(1 − 2 p ) · · · (1 − 2 T − 1 p ) ≥ 2 − 2 p P T − 1 i =0 2 i = 2 − 2 p (2 T − 1) ≥ 1 / 2 . Indeed, giv en the deﬁ nitio n of p , the probabilit y that H ∗ is not matc hed to a distrib utio n that is 8 ε -close to X is at least 1 − p in the ﬁ rst iteration. If this happ ens, then (b ecause of our conditioning from (a)), H ∗ will surv ive th is iteration. In th e next iteratio n, the fr action of survivin g distribu tions that are 8 ε -close to X and are diﬀeren t than H ∗ itself is at most 2 p . Hence, the pr o bability that H ∗ is not matc hed to a distribution that is 8 ε -cl ose to X i s at least 1 − 2 p in the second iteration, etc. 28 No w, conditioning on H ∗ ∈ H i T , it follo ws fr om Lemma 21 that the d istribution ˆ H output by SlowTour nament ( X, H i T , ε, 1 / 4) is 8 ε -close to X with probability at least 3 / 4. Hence, with ov erall pr obabilit y at least 1 / 4, the distribu ti on output b y S2 is 8 ε -close to X . The num b er of samples drawn fr om eac h distribution in H ∪ { X } is clearly O ( 1 ε 2 log N ), as Phase 1 draws O ( 1 ε 2 log N ) samples from eac h distribution and, by Lemma 21, Phase 2 also draws O ( 1 ε 2 log N ) samples from eac h distribu tio n. The total num b er of op erations is b ounded by O ( 1 ε 2 N log N ) . Ind ee d, Phase 1 r uns Choo seHypothesis O ( N ) times, and b y Lemma 20 and ou r c hoice of 1 / 3 N for th e conﬁden c e p arame ter of eac h exe- cution, eac h execution takes O (log N /ε 2 ) op eratio ns. So the total n umber of op erations of Phase 1 is O ( 1 ε 2 N log N ). On the other h a nd , the size of H i T is at most 2 ⌈ log 2 N ⌉ 2 T = 2 ⌈ log 2 N ⌉ 2 ⌊ log 2 √ N 2 ⌋ ≤ 8 √ N . So b y Lemma 21, Phase 2 tak es O ( 1 ε 2 N log N ) op erations. G F aste r Slo w and F ast T our name n ts In this sectio n, we describ e another h yp othesis selection algorithm. This algorithm is faster than SlowTour nament , though at th e cost of a larger constant in the approxi mation factor. In most reasonable parameter regimes, this algorithm is slo we r than F astTournament , and s t ill has a larger constan t in the appro ximation factor. Regardless, w e go on to sho w ho w it can b e u sed to impr o v e up on the worst-ca se running time of FastT ournament . Theorem 31. F or any c onstant γ > 0 , ther e is an algorithm Rec ursiveSlowTourn ament γ ( X, H , ε, δ ) , which is given sample ac c ess to some distribution X and a c ol le ction of distributions H = { H 1 , . . . , H N } over some set D , ac c ess to a P DF c omp ar ato r for ev e r y p air of distributions H i , H j ∈ H , an ac cu- r acy p ar ameter ε > 0 , and a c onﬁdenc e p ar ameter δ > 0 . The algorithm makes m = O (log ( N/δ ) / ε 2 ) dr aws fr om e ach of X , H 1 , . . . , H N and r eturns some H ∈ H or de c la r es “failur e.” If ther e is some H ∗ ∈ H such tha t d T V ( H ∗ , X ) ≤ ε then with pr ob ability a t le ast 1 − δ the distribution H that Recursiv eSlowTournament r eturns satisﬁes d T V ( H , X ) ≤ O ( ε ) . The total nu mb er of op er ations of the algorithm is O  N 1+ γ log( N /δ ) /ε 2  . Pr o of. F or s im p lic it y , assume that √ N is in teger. (If not , intro d uce in to H multiple copies of an arbr itrary H ∈ H so that √ N b ecomes an intege r.) P artition H into √ N subsets, H = H 1 ⊔ H 2 ⊔ . . . ⊔ H √ N and do the follo wing: 1. Set δ ′ = δ / 2, dra w O (log( √ N /δ ′ ) /ε 2 ) samples from X and, using the same samples, run SlowTour nament ( X, H i , ε, δ ′ ) f rom Lemma 21 for eac h i ; 2. Run Sl owTournament ( X , W , 8 ε, δ ′ ), where W are the distributions output by SlowTour nament in the p revio us step. I f W = ∅ outp ut “failure”. Let us call the ab o v e algo rithm SlowTour nament N 1 ( X, H , ε, δ ), b efore p roceeding to analyze its correctness, samp le and time complexit y . Su pp o se there exists a distrib utio n H ∈ H suc h that d T V ( H , X ) ≤ ε . Without loss of generalit y , assume that H ∈ H 1 . Then , fr o m Lemma 21, with probabilit y at least 1 − δ ′ , Slow Tournament ( X , H 1 , ε, δ ′ ) will output a distribution H ′ suc h that d T V ( H ′ , X ) ≤ 8 ε . Conditioning on this and applying Le mma 21 agai n, with conditional probabilit y at least 1 − δ ′ SlowTour nament ( X, W , 8 ε, δ ′ ) w il l output a d istribution H ′′ suc h that d T V ( H ′′ , X ) ≤ 64 ε . So with o v erall probabilit y at least 1 − δ , S lowTournament N 1 ( X, H , ε, δ ) will 29 output a distribu tio n that is 64 ε -close to X . The num b er of samp le s th a t the algo rithm dra ws from X is O (lo g ( N/δ ) /ε 2 ), and the ru nning time is √ N × O  N log( N /δ ′ ) /ε 2  + O  N log( N /δ ′ ) / (8 ε ) 2  = O  N 3 / 2 log( N /δ ) /ε 2  . So, compared to Slow Tournament , Sl owTournament N 1 has the same sample complexit y asymp - totics and the same asym p to tic guarantee f o r the d istance from X of the output distribution, b u t the exp onen t of N in the ru nning time imp ro v ed fr o m 2 to 3 / 2. F or t = 2 , 3 , . . . , d e ﬁn e SlowTou rnament N t b y replacing Sl owTournament by Slow Tournament N t − 1 in the cod e of SlowTo urnament N 1 . It follo ws f r om the same analysis as ab o v e that as t increases the exp onen t of N in the run ning time gets arb itrarily close to 1. In p artic ular, in on e step an exp onen t of 1 + α b eco mes an exp onen t of 1 + α/ 2. So for some constan t t , SlowTo urnament N t will satisfy th e requiremen ts of the theorem. As a corollary , we can immediately impr o v e the run n ing time of F astTournament at the cost of th e constan t in the appr o ximat ion factor. The construction and analysis is nearly iden tical to that of Fast Tournament . The sole diﬀerence i s in step 3 of FastT ournament A - we replace SlowTour nament with Recu rsiveSlowTourna ment γ . Corollary 1. F or any c onstant γ > 0 , ther e is an algorith m Fast Tournament γ ( X, H , ε, δ ) , which is given sample ac c ess to some distribution X and a c ol le ction of distributions H = { H 1 , . . . , H N } over some set D , ac c ess to a P DF c omp ar ator for e very p a ir of distributions H i , H j ∈ H , an ac cur acy p ar ameter ε > 0 , and a c onﬁdenc e p ar ameter δ > 0 . The algorithm makes O  log 1 /δ ε 2 · log N  dr aws fr om e ach of X , H 1 , . . . , H N and r e turn s some H ∈ H or de clar es “failur e.” If ther e is some H ∗ ∈ H such that d T V ( H ∗ , X ) ≤ ε then with pr ob ability at le ast 1 − δ the distribution H that SlowTour nament r eturns satisﬁes d T V ( H , X ) ≤ O ( ε ) . The total numb er of op er ations of the algorithm is O  log 1 /δ ε 2 ( N log N + log 1+ γ 1 δ )  . F urthermor e, the e xp e cte d numb er of op er ations of the algorithm is O  N log N/δ ε 2  . 30

Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment