Efficient Density Estimation via Piecewise Polynomial Approximation

Eﬃcien t Densit y Estimati on via Piecewise P o l ynomial Appro xim a tion Siu-On Chan ∗ UC Berk eley siuon@cs.be rkeley.edu . Ilias Diak onikolas † Univ ersit y of Edin burgh ilias.d@ed. ac.uk . Ro cco A. Serv edio ‡ Colum bia Univ ersit y rocco@cs.co lumbia.edu . Xiaorui Sun § Colum bia Univ ersit y xiaoruisun@ cs.columbia .edu . No vem b er 27, 2024 Abstract W e give a highly eﬃcient “semi-a gnostic” algo rithm for lea rning univ a riate probability dis- tributions that are well approximated by piec ewise p olynomia l density functions. Let p b e an arbitrar y distribution over an interv a l I which is τ -close (in total v ar iation dis tance) to an un- known probability distribution q that is deﬁned b y an unknown partition of I into t interv als and t unknown degree- d p olyno mials sp ecifying q ov er each of the interv als . W e give an alg orithm that draws ˜ O ( t ( d + 1) /ε 2 ) samples from p , runs in time p oly( t, d, 1 /ε ), and with high pro babil- it y outputs a piecewise p olyno mial hypothesis distribution h that is ( O ( τ ) + ε )-c lo se (in to tal v ariation dista nce ) to p . This sample complexity is ess e ntially optimal; we show that even for τ = 0, an y alg orithm that learns an unknown t -piecewise degree- d probability dis tribution over I to accura cy ε m ust use Ω( t ( d +1) p oly(1+log( d +1)) · 1 ε 2 ) samples from the distribution, reg ardless of its running time. Our a lgorithm combines to ols fro m approximation theor y , uniform co nv erg ence, linear pro g ramming, and dynamic pro gramming. W e a pply this general algorithm to o btain a wide ra nge of results for many natur al problems in density estimation ov er b oth contin uo us and discrete domains. These include state-o f-the- art results for lear ning mixtures of lo g-concave distributions; mixtures of t -mo da l distributio ns ; mixtures o f Mono tone Hazard Rate distributions ; mixtures of Poisson Binomial Distributions; mixtures of Gaussians; and mixtures o f k -monotone densities. Our gener al tec hnique yields computationally eﬃcient algorithms for all these pro blems, in many cases with prov ably o ptimal sample complexities (up to lo garithmic factor s) in all pa rameters. ∗ Supp orted by NSF aw ard DMS- 1106999 , DOD ONR grant N 0001411 10140 and NSF aw ard CCF-1118083 . † P art of this work was done while the author wa s at UC Berkele y supp orted by a Simons Postdoctoral F ello wship. ‡ Supp orted by NSF grants CCF-091592 9 and CCF-1115703. § Supp orted by NSF grant CCF-1149257 . 1 In t ro duction Ov er the past sev eral decades, many works in computational learning theory hav e addressed th e general problem of learning an un kno w n Bo olean function from lab eled examples. A r ecurring theme that has emerged from this line of work is that s tate-o f-the-art learning resu lts can often b e ac h iev ed b y analyzing p olynomia ls th at compute or appr o ximate the fu n ction to b e learned, see e.g. [LMN93, KM93, J ac97 , KS04, MOS04, KOS04]. In the current p ap er we sh o w that th is theme extends to the well- studied unsu p ervised learning problem of density estimation ; namely , learning an un kno wn pr ob ability distribution giv en i.i.d. samples dr awn fr om th e distribution. W e prop ose a new approac h to density estimation based on establishing the existence of pie c ewise p olynomial density functions that app ro ximate the dis- tributions to b e learned. Th e key to ol th at enables this appr oac h is a n ew an d highly eﬃcient general algorithm that we pro vide f or learnin g u niv ariate p robabilit y distributions that are w ell appro ximated b y piecewise p olynomial densit y functions. Com b ining our general algorithm with structural results sh owing that p robabilit y distribu tions of in terest can b e wel l ap p ro x im ated u sing piecewise p olynomial density functions, we obtain learning algorithms for those d istributions. W e demonstrate the eﬃcacy of this appr oac h by sho wing that for man y natural and w ell-studied t yp es of distributions, there do indeed exist piecewise p olynomial densities that approximate the distributions to high accuracy . F or all of th ese types of d istributions our general app roac h giv es a state-of-t he-art computationally eﬃcien t learning algorithm w ith the b est kno wn s amp le complexity (n u m b er of s amp les th at are required from the distrib ution) to date; in many cases the sample complexit y of our approac h is p ro v ably optimal, up to logarithmic factors in the optimal sample complexit y . 1.1 Related work. Densit y estimation is a w ell-studied topic in probabilit y th eory and statis- tics (see [DG85, S il86, Sco92, DL01] for b o ok-length introdu ctions). There is a n um b er of generic tec hniqu es for den s it y estimation in the mathematical statistics literature, in clud ing histograms, k ern els (and v arian ts thereof ), n earest neigh b or estimators, orthogonal series estimators, maxi- m u m like liho o d (and v arian ts th er eof ) and others (see Chapter 2 of [Sil86] for a sur v ey of existing metho ds). In recen t y ears, theoretical compu ter s cience r esearchers h a ve also stu died densit y esti- mation p r oblems, w ith an explicit fo cus on obtaining c omputational ly eﬃcient algorithms (see e.g. [KMR + 94, FM99, F OS05, BS10, KMV10, MV10, DDS12a, DDS12b]. W e wo rk in a P AC-t yp e mo d el similar to that of [KMR + 94] and to w ell-studied statistical framew orks for densit y estimation. The learning algorithm has access to i.i.d. draws fr om an unknown probabilit y distribu tion p . It m ust output a hyp othesis d istribution h such that with high pr obabilit y the total v ariation d istance d T V ( p, h ) b et ween p and h is at most ε. (Recall that the total v ariation distance b et ween t w o distributions p and h is 1 2 R | p ( x ) − h ( x ) | dx for con tinuous distributions, an d is 1 2 P | p ( x ) − h ( x ) | for discrete distribu tions.) W e shall b e centrally concerned with obtaining learning algorithms that b oth use f ew samples and are computationally eﬃcien t. The pr evious w ork that is most closely related to our curr en t pap er is the recen t work [CDSS13]. (That pap er dealt with distr ibutions o ver the discrete domain [ n ] = { 1 , . . . , n } , bu t since the current w ork fo cus es mostly on the contin uous domain, in our description of the [CDSS13] r esults b elo w w e translate them to the con tinuous domain. This translation is straight forw ard.) T o describ e the main result of [CDSS13] w e need to introd uce the notions of mixtur e distributions and pie c ewise c onstant distributions. Give n distributions p 1 , . . . , p k and non -n egativ e v alues µ 1 , . . . , µ k that su m to 1, w e sa y that p = P k i =1 µ i p i is a k - mixtur e o f c omp onents p 1 , . . . , p k with mixing w eights µ 1 , . . . , µ k . A dr aw from p is obtained by c ho osing i ∈ [ k ] with probability µ i and then making a dra w from p i . A distribu tion q ov er an interv al I is ( ε, t ) -pie c ewise c onstant if there is a partition of 1 I in to t disjoint inte rv als I 1 , . . . , I t suc h that p is ε -close (in total v ariation distance) to a d istr ibution q suc h that q ( x ) = c j for all x ∈ I j for some c j ≥ 0. The main resu lt o f [CDSS 13 ] is an eﬃcien t algorithm for learnin g an y k -mixtur e of ( ε, t )- piecewise constan t distribu tions: Theorem 1. Ther e is an algorithm that le arns any k -mixtur e of ( ε, t ) -pie c ewise c onstant distri- butions over an interval I to ac cur acy O ( ε ) , using O ( k t/ε 3 ) samples and running in ˜ O ( k t/ε 3 ) time. 1 1.2 Our main result. As our m ain algorithmic cont ribution, we giv e a signiﬁcant strengthening and generali zation of Theorem 1 ab o ve. First, w e impro v e the ε -dep endence in the sample com- plexit y of Th eorem 1 from 1 /ε 3 to a near-optimal ˜ O (1 /ε 2 ) . 2 Second, we extend Th eorem 1 from piecewise constan t distribu tions to pie c ewise p olynomial d istributions. More p recisely , w e sa y that a d istribution ov er an in terv al I is ( ε, t ) -pie c ewise de gr e e- d if there is a p artition of I in to t disj oin t in terv als I 1 , . . . , I t suc h th at p is ε -close (in total v ariation distance) to a d istr ibution q suc h that q ( x ) = q j ( x ) f or all x ∈ I j , where eac h of q 1 , . . . , q t is a un iv ariate degree- d p olynomial. 3 (Note th at b eing ( ε, t )-piecewise constant is the same as b eing ( ε, t )-piecewise degree-0.) W e sa y that such a distribution q is a t -pie c ewise de gr e e- d distribution. Our main algorithmic result is the follo win g (see T heorem 23 for a fully detailed statemen t of the r esult): Theorem 2. [Informal statement] Ther e is an algorithm that le arns any k -mixtur e of ( ε, t ) - pie c ewise de gr e e- d distributions over an interval I to ac cur acy O ( ε ) , using ˜ O (( d + 1) kt/ε 2 ) samples and running in p oly(( d + 1) , k, t, 1 /ε ) time. As w e d escrib e b elo w, th e app lications that we giv e for Theorem 2 cru cially u se b oth asp ects in whic h it strengthens Th eorem 1 (degree d rather than degree 0, and ˜ O (1 /ε 2 ) samples rather than O (1 /ε 3 )) to obtain n ear-optimal samp le complexities. A diﬀerent view on our main resu lt, whic h ma y also b e illumin ating, is that it gives a “semi- agnostic” algorithm for learning piecewise p olynomial densities. (Since any k -mixture of t -piecewise degree- d distribu tions is easily seen to b e a k t -piecewise degree- d d istribution, we phr ase the d is- cussion b elo w on ly in terms of t -piecewise degree- d distrib utions r ather than mixtur es.) Let P t,d ( I ) denote the class of all t -piecewise degree- d d istributions o v er interv al I . Let p b e an y distribu tion o ver I . Our algorithm, give n p arameters t, d, ε and ˜ O ( t ( d + 1) /ε 2 ) samples from p , outputs an O ( t )-piecewise d egree- d hyp othesis distribu tion h such th at d T V ( p, h ) ≤ 4opt t,d (1 + ε ) + ε , wh ere opt t,d := inf r ∈P t,d ( I ) d T V ( p, r ) . (See Th eorem 25.) W e p ro ve the follo wing lo wer b ound (see Th eorem 8 for a pr ecise statemen t), whic h sh ows that the num b er of samples that our algorithm uses is optimal up to logarithmic factors: 1 Here and throughout the p aper we wo rk in a stand ard unit-cost mod el of computation, in which a sample from distribution p is obtained in one time step (and is assumed to ﬁt into one register) and basic arithmetic operations are assumed to take u nit time. Our algorithms, like the [CDSS13] algorithm, only p erforms basic arithmetic op erations on “reasonable” inputs. 2 Recall the w ell-know n fact that Ω(1 /ε 2 ) samples are required for essentially every nontrivial distribut ion learning problem. In particular, an y algorithm th at distinguishes th e un iform distribution ov er [ − 1 , 1] from the piecewise constant distribution with p df p ( x ) = 1 2 (1 − ε ) for − 1 ≤ x ≤ 0, p ( x ) = 1 2 (1 + ε ) for 0 < x ≤ 1, must u se Ω(1 /ε 2 ) samples. 3 Here and throughout the pap er, whenever we refer to a “degree- d p olynomial,” we mean a p olynomial of degree at most d. 2 Theorem 3. [ Informal statement] Any algorithm that le arns an unknown t -pie c ewise de gr e e- d distribution q over an interval I to ac c u r acy ε must use Ω( t ( d +1) poly (1+log( d +1)) · 1 ε 2 ) samples. Note that the lo wer b oun d holds even when the unkn o wn d istribution is exactly a t -piecewise degree- d distribu tion, i.e. opt t,d = 0 (in fact, the lo w er b ound still a pplies eve n if the t − 1 “breakp oint s” deﬁ ning the t inte rv al b oun daries within I are ﬁxed to b e ev en ly sp aced across I ). 1.3 Applications of Theorem 2. Using Theorem 2 we ob tain highly eﬃcien t algorithms for a wide r ange of sp eciﬁc distribution learning problems o v er b oth conti n uous and discrete domains. These include learning mixtures of log-conca v e distributions; mixtures of t -mo dal distribu tions; mixtures of Monotone Hazard Rate distrib utions; mixtur es of Poisson Binomial Distributions; mix- tures of Gaussians; and mixtures of k -monotone den sities. (S ee T able 1 for a concise summary of these results an d a comparison with previous resu lts.) All of our algorithms r un in p olynomial time in all of the relev ant parameters, and for all of the mixtu re learning problems listed in T able 1, our results imp ro ve on previous state-of-the-art r esults by a p olynomial factor. (In some cases, suc h as t -piecewise degree- d p olynomial distribu tions and mixtures of t b ound ed k -monotone distributions, w e b eliev e that we giv e th e ﬁ r st non trivial learning results for th e distribu tion classes in question.) In man y cases the s ample complexities of our algorithms are pr o v ably optimal, up to logarithmic factors in th e optimal s ample complexit y . Detailed descriptions of all of the classes of distr ib utions in the table, and of our results for learning those distribu tions, are given in S ection 4. W e note that all the learning results indicated with theorem num b ers in T able 1 (i.e. resu lts pro v ed in this pap er) are in fact semi- agnostic learning results for th e giv en classes as describ ed in the p r evious subs ection; hen ce all of these resu lts are highly robust ev en if th e target distr ibu- tion do es not exactly b elong to the sp eciﬁed class of distrib utions. More p recisely , if the target distribution is τ -close to some mem b er of the sp eciﬁed class of d istributions, then the algorithm uses the s tated num b er of samples and outputs a hypothesis that is ( O ( τ ) + ε ) close to the target distribution. 1.4 Our Approac h and T ec hniques. As stated in [Sil86], “the oldest and most widely used densit y estimat or is the h istogram”: Giv en samples from a densit y f , the m etho d p artitions the domain into a num b er of in terv als (bins) I 1 , . . . , I k , and outpu ts th e empirical d ensit y whic h is constan t within eac h bin. Note th at the num b er k of bins and the width of eac h b in are parameters and ma y dep end on the particular class of d istributions b eing learned. O ur prop osed tec h nique ma y n atur ally b e viewe d as a very broad generalizatio n of the histogram metho d, wher e instead of appro ximating the distribution by a c onstant within eac h bin, we approximat e it b y a low-de gr e e p olynomial. W e b eliev e that such a generalization is very natural; the r ecen t pap er [P A13 ] also prop oses u sing splines for d ensit y estimation. (Ho w ev er, this is not the main fo cu s of the p ap er and indeed [P A13 ] do es not provide or analyze algorithms for d ensit y estimation.) Our generalizat ion of the h istogram metho d seems lik ely to b e of wide applicabilit y . In deed, as we sho w in this p ap er, it can b e used to obtain many computationally eﬃcien t learners for a w ide class of concrete learning problems, yielding seve ral new and nearly optimal resu lts. The general algorithm. At a high level , ou r algorithm uses a rather su btle d y n amic program (roughly , to disco ver the “correct” in terv als in eac h of whic h the un d erlying distribution is close to a degree- d p olynomial) and linear pr ogramming (roughly , to learn a single degree- d sub -d istribution on a giv en in terv al). W e note, ho w ev er, that many c hallenges arise in going f rom this high-leve l in tuition to a working algorithm. Consider ﬁ rst the sp ecial case in which ther e is only a single kno w n in terv al (see S ection 3.3). In this sp ecial c ase our problem is s omewh at reminiscen t of the p roblem of lea rning a “noisy 3 Class of Distributions Num b er of samples Reference Con t in uous distributions o v e r an interv al I t -piecewise constant O ( t/ǫ 3 ) [CDSS13] t -piecewise constant ˜ O ( t/ǫ 2 ) ( † ) Theorem 23 t -piecewise d egree- d p olynomial ˜ O ( td/ǫ 2 ) ( † ) Theorem 23, Theorem 8 log-co nca ve O (1 /ε 5 / 2 ) ( † ) folklore [DL01] mixture of k log-conca v e d istributions ˜ O ( k /ε 5 / 2 ) ( † ) Theorem 26 mixture of t b ound ed 1-monotone distribu tions ˜ O ( t/ǫ 3 ) ( † ) Theorem 33 mixture of t b ound ed 2-monotone distribu tions ˜ O ( t/ǫ 5 / 2 ) ( † ) Theorem 33 mixture of t b ound ed k -monotone distrib utions ˜ O ( tk /ǫ 2+1 /k ) Theorem 33 mixture of k Gaussians ˜ O ( k /ǫ 2 ) ( † ) Corollary 37 Discrete distributions o ver { 1 , 2 , . . . , N } t -mo dal ˜ O ( t log( N ) /ǫ 3 ) + ˜ O ( t 3 /ε 3 ) [DDS12a] mixture of k t -mo dal distr ib utions O ( k t log ( N ) /ǫ 4 ) [CDSS13] mixture of k t -mo dal distr ib utions ˜ O ( k t log ( N ) /ǫ 3 ) ( † ) Theorem 39 mixture of k monotone hazard rate distribu tions ˜ O ( k log( N ) /ǫ 4 ) [CDSS13] mixture of k monotone hazard rate distribu tions ˜ O ( k lo g ( N ) /ǫ 3 ) ( † ) Theorem 40 mixture of k log-conca v e d istributions ˜ O ( k /ǫ 4 ) [CDSS13] mixture of k log-conca v e d istributions ˜ O ( k /ǫ 3 ) Theorem 41 P oisson Binomial Distribution ˜ O (1 /ǫ 3 ) [DDS12b, CDSS13] mixture of k P oisson Binomial Distributions ˜ O ( k /ǫ 4 ) [CDSS13] mixture of k P oisson Binomial Distributions ˜ O ( k /ǫ 3 ) Theorem 41 T able 1: K no wn algorithmic results for learning v arious classes of probabilit y distr ib utions. “Num- b er of samples” in dicates the n um b er of samples that the algorithm uses to learn to total v ariation distance ε . Results given in this pap er are indicated w ith a reference to the corresp onding theorem. A ( † ) indicates th at the giv en u pp er b ound on sample complexity is known to b e optimal up to at most logarithmic factors (i.e. “ ˜ O ( m ) ( † )” means that there is a kn own lo wer b oun d of Ω( m )). p olynomial” that w as studied by Arora and Khot [AK03]. W e stress, though, that our setting is considerably more challe nging in the follo win g sense: in the [AK03] framewo rk, eac h d ata p oint is a p air ( x, y ) where y is assumed to b e close to the v alue p ( x ) of the target p olynomial at x . In our setting the inpu t data is unlab ele d – we only get p oin ts x dr a wn f rom a distribution that is τ -close to some p olynomial p d f. Ho wev er, w e are able to lev erage some ingredien ts from [AK03] in our con text. W e carry ou t a carefu l error analysis using probabilistic inequalities (the V C inequalit y and tail b oun ds) and ingredient s from b asic ap p ro x im ation theory to sho w that ˜ O ( d/ε 2 ) samples suﬃce for our lin ear program to achiev e an O (opt 1 ,d + ε )-accurate hyp othesis w ith h igh prob ab ility . Additional challe nges arise when we go f rom a single inte rv al to the general case of t -piecewise p olynomial densities (see Section 3.4). The “correct” interv als can of course on ly b e app r o ximated rather than exactly identiﬁed, intro d ucing an add itional source of error that n eeds to b e carefully managed. W e formulate a dyn amic program that u s es the algorithm from Section 3.3 as a “blac k b o x” to ac hieve our most general learning result. The applications. Given our general algorithm, in order to obtain eﬃcien t learning algorithms for sp eciﬁc classes of d istributions, it is su ﬃcien t to establish the existence of piecewise p olynomial (or piecewise constan t) approximati ons to the distributions that are to b e learned. In some cases suc h 4 existence results w ere already kn o wn ; for example, Birg ´ e [Bir87b] pr o vides the necessary existence result that we require for discrete t -mo dal distributions, and classical results in app ro xim ation theory [Dud 74, No v88 ] give the n ecessary existence results for conca ve distributions o v er con tinuous domains. F or log-conca ve d ensities o ver con tin u ous domains, w e prov e a new stru ctural result on app ro x im ation by piecewise linear densities (Lemma 27) which, combined with our general algorithm, leads to an optimal learning algorithm for (mixtu res of ) su c h densities. Finally , for k -monotone d istributions we are able to lev erage a recent (and quite soph isticated) r esult from the appro ximation theory literature [KL04, KL07] to obtain th e required appro ximation resu lt. Structure o f t his pap er: In Section 2 we include some basic preliminaries. In Section 3 w e present our main learning result and in Section 4 we describ e our applications. 2 Preliminaries Throughout the p ap er for simplicit y we consider d istributions ov er the inte rv al [ − 1 , 1). It is easy to see that the general results give n in Section 3 go through for d istributions ov er an arbitrary in terv al I . (In th e applications giv en in S ection 4 we exp licitly discus s the diﬀerent domains ov er whic h our distributions are deﬁn ed.) Giv en a v alue κ > 0, we sa y that a distribu tion p o ve r [ − 1 , 1) is κ -wel l-b ehave d if sup x ∈ [ − 1 , 1) Pr x ∼ p [ x ] ≤ κ , i.e. n o individual real v alue is assigned more than κ probability und er p . An y pr obabilit y d is- tribution with no atoms (and h en ce an y piecewise p olynomial d istr ibution) is κ -w ell-b ehav ed for all κ > 0, bu t for example the distrib ution whic h outputs the v alue 0 . 3 with p r obabilit y 1 / 1 00 and otherwise outputs a uniform v alue in [ − 1 , 1) is only κ -well- b eha v ed for κ ≥ 1 / 100 . Our resu lts apply for general distrib utions o v er [ − 1 , 1) wh ic h may ha v e an atomic part as w ell as a non-atomic part. Throughout the p ap er w e assume that th e densit y p is measur able. Note that throughout the pap er w e only ev er wo rk with the pr obabilities P r x ∼ p [ x = z ] of single p oint s an d pr obabilities Pr x ∼ p [ x ∈ S ] of s ets S that are ﬁnite u n ions of interv als and single p oin ts. Giv en a function p : I → R on an in terv al I ⊆ [ − 1 , 1) and a su bint erv al J ⊆ I , we w rite p ( J ) to den ote R J p ( x ) dx. Thus if p is the p df of a probabilit y d istribution o v er [ − 1 , 1), the v alue p ( J ) is the probabilit y that d istr ibution p assigns to the su bin terv al J . W e sometimes refer to a function p o ver an in terv al (whic h n eed not necessarily integrate to 1 o ver the inte rv al) as a “sub d istribution.” Giv en m indep endent samples s 1 , . . . , s m , dr a wn f rom a distribution p ov er [ − 1 , 1), the empiric al distribution b p m o ver [ − 1 , 1) is the discrete distribu tion sup p orted on { s 1 , . . . , s m } deﬁned as follo ws: for all z ∈ [ − 1 , 1), Pr x ∼ b p m [ x = z ] = |{ j ∈ [ m ] | s j = x }| /m . Optimal piecew ise p olynomial appro ximators. Fix a distribu tion p o v er [ − 1 , 1). W e write opt t,d to denote the v alue opt t,d := inf r ∈P t,d ([ − 1 , 1)) d T V ( p, r ) . Standard closure argu m en ts can b e used to show that the ab o v e inﬁ mum is attained by some r ∈ P t,d ([ − 1 , 1)); h o wev er this is not actually required for our pu rp oses. It is straigh tforward to v erify that any d istribution ˜ r ∈ P t,d ([ − 1 , 1)) such that d T V ( p, ˜ r ) is at most (sa y) opt t,d + ε/ 100 is suﬃcien t for all our arguments. Reﬁnemen ts. Let I = { I 1 , . . . , I s } b e a partition of [ − 1 , 1) into s d isjoin t in terv als, and J = { J 1 , . . . , J t } b e a partition of [ − 1 , 1) in to t disjoin t interv als. W e sa y that J is a r eﬁnement of I if eac h int erv al in I is a union of interv als in J , i.e. for eve ry a ∈ [ s ] there is a sub set S a ⊆ [ t ] such that I a = ∪ b ∈ S a J b . F or I = { I i } r i =1 and I ′ = { I ′ i } s i =1 t wo partitions of [ − 1 , 1) in to r and s interv als resp ectiv ely , w e say th at the c ommon r eﬁnement of I and I ′ is the partition J of [ − 1 , 1) in to in terv als obtained 5 from I and I ′ in the obvious wa y , by taking all p ossible n onempt y interv als of the form I i ∩ I ′ j . It is clear that J is b oth a reﬁnement of I and of I ′ and that J con tains at m ost r + s int erv als. Appro ximation t heory . W e will need some basic notation and results from appro ximation th eory . W e wr ite k p k ∞ to denote sup x ∈ [ − 1 , 1) | p ( x ) | . W e recall the famous inequalities of Bernstein and Mark ov b ounding the d eriv ativ e of univ ariate p olynomials: Theorem 4. F or any r e al-v alue d de gr e e- d p olynomial p over [ − 1 , 1) , we have • (Bernstein ’s Ine quality) k p ′ k ∞ ≤ k p k ∞ · d 2 ; and • (Markov’s Ine qu ality) k p ′ k ∞ ≤ d √ 1 − x 2 · k p k ∞ for al l − 1 ≤ x ≤ 1 . The V C inequalit y . Giv en a family of subsets A o ve r [ − 1 , 1), d eﬁne k p k A = sup A ∈A | p ( A ) | . The V C dimension of A is the maxim um size of a s ubset X ⊂ [ − 1 , 1) that is sh attered by A (a set X is shattered by A if for ev ery Y ⊆ X , some A ∈ A satisﬁes A ∩ X = Y ). If th ere is a shattered su bset of size s f or all s then w e say that the VC dimens ion of A is ∞ . The w ell-kno wn V apnik-Chervonenkis (VC) ine quality sa y s th e follo wing: Theorem 5 (V C inequ alit y , [DL01, p.31]) . L et b p m b e an empiric al distribution of m samples fr om p . L et A b e a family of subse ts of VC dimension d . Then E [ k p − b p m k A ] ≤ O ( p d/m ) . 2.1 P artit ioning in to in terv als of appro ximate ly equal mass. As a basic primitiv e, we will often need to decomp ose a κ -we ll-b eha ved d istribution p in to Θ(1 /κ ) int erv als eac h of which has pr ob ab ility Θ( κ ) und er p . The follo wing lemma lets u s ac hieve this u sing ˜ O (1 /κ ) samples; the simple pr o of is give n in App end ix A . Lemma 6. Giv e n 0 < κ < 1 and ac c ess to samples fr om an κ/ 64 -wel l-b ehave d distribution p over [ − 1 , 1) , the pr o c e dur e Approx imately-E qual-Partition uses ˜ O (1 /κ ) samples fr om p , runs in time ˜ O (1 /κ ) , and with pr ob ability at le ast 99 / 10 0 outputs a p artition of [ − 1 , 1) into ℓ = Θ(1 /κ ) intervals such that p ( I j ) ∈ [ 1 2 κ , 3 κ ] for al l 1 ≤ j ≤ ℓ. 3 Main result: Learning mixtures of p iecewise polynomial d istributions wit h near-optimal sample complexit y In this section we present and analyze our main algorithm for learnin g mixtures of ( τ , t )-piecewise degree- d d istributions o ver [ − 1 , 1). W e start b y giving a simple information-theoretic argument (Prop osition 7, Section 3.1 ) sho win g that there i s a (c omputationally ineﬃcien t) algorithm to learn an y distribution p to ac curacy 3opt t,d + ε usin g O ( t ( d + 1) /ε 2 ) samples, where opt t,d is the sm allest v ariation d istance b et ween p and an y t -piecewise degree- d d istr ibution. Next, we cont rast this inf orm ation-theoretic p ositiv e result with an in formation-theoretic low er b ound (Theorem 8, Section 3.2) showing that any algorithm, regardless of its ru nning time, for learning a t -p iecewise degree- d d istribution to accuracy ε must use Ω ( t ( d +1) poly (1+log( d +1)) · 1 ε 2 ) samples. W e then build up to our main result in stages by giving eﬃcien t algorithms for successiv ely more challengi ng learning problems. In Section 3.3 we give an eﬃcien t “semi-agnostic” algorithm for learning a sin gle degree- d p d f. More precisely , the algorithm dra ws ˜ O (( d + 1) /ε 2 ) samples from any w ell-b eha ved distribution p , and with high probability outputs a degree- d p d f h su c h that d T V ( p, h ) ≤ 3opt 1 ,d (1 + ε ) + ε . This algorithm uses ingredien ts from appr oximati on th eory and linear programming. In Section 3.4 we extend the approac h usin g dynamic p rogramming to obtain an eﬃcien t “semi-agnostic” algorithm for t -piecewise degree- d p dfs. The extended algorithm draws ˜ O ( t ( d + 1) /ε 2 ) samp les from any 6 w ell-b eha ved distr ibution p , and with high probabilit y outpu ts a (2 t − 1)-piecewise degree- d p df h suc h that d T V ( p, h ) ≤ 3opt t,d (1 + ε ) + ε . In S ection 3.5 we extend the result to k -mixtures of well - b ehav ed d istributions. Finally , in Section 3.6 we sho w h o w w e may get rid of the “w ell-b eha ved” requirement , and thereby p ro ve Theorem 2 . 3.1 An information-t he oret ic sample complexit y upp er b ound. Prop osition 7. Ther e is a (c omputational ly ineﬃcient) algorithm that dr aws O ( t ( d + 1) /ε 2 ) sam- ples fr om any distribution p over [ − 1 , 1) , and with pr ob ability 9 / 10 outputs a hyp othesis distribution h such that d T V ( p, h ) ≤ 3opt t,d + ε . Pr o of. The m ain idea is to use Theorem 5, the V C inequalit y . Let p b e the target distribu tion and let q b e a t -piecewise degree- d distribu tion su ch that d T V ( p, q ) = op t t,d . Th e algorithm d ra w s m = O ( t ( d + 1) /ε 2 ) samp les from p ; let b p m b e the resu lting empirical d istribution of these m samples. W e d eﬁne the family A of s ubsets of [ − 1 , 1) to consist of all un ions of up to 2 t ( d + 1) inte rv als. Since d T V ( p, q ) ≤ opt t,d w e ha ve that k p − q k A ≤ opt t,d . Since the VC dimension of A is 4 t ( d + 1), Theorem 5 implies that E [ k p − b p m k A ] ≤ ε/ 40, and h ence b y Marko v’s in equ alit y , with probability at least 19 / 20 w e ha ve that k p − b p m k A ≤ ε/ 2 . By the triangle inequalit y for k · k A -distance, this means that k q − b p m k A ≤ opt t,d + ε/ 2 . The algorithm outputs a t -piecewise d egree- d distribution h that minimizes k h − b p m k A . Since q is a t -piecewise degree- d distr ibution that satisﬁes k q − b p m k A ≤ opt t,d + ε/ 2, the distribu tion h satisﬁes k h − b p m k A ≤ opt t,d + ε/ 2 . Hence the triangle inequalit y give s k h − q k A ≤ 2opt t,d + ε. No w sin ce h and q are b oth t -piecewise degree- d distributions, they m ust h a ve at m ost 2 t ( d + 1) crossings. (T aking the common reﬁ nemen t of the in terv als for p and the int erv als for q , we get at most 2 t int erv als. Within eac h suc h in terv al b oth h an d q are d egree- d p olynomials, so there are at most 2 t ( d + 1) crossings in total (where the extra +1 comes from th e endp oin ts of eac h of the 2 t in terv als).) Consequently w e h a ve that d T V ( h, q ) = k h − q k A ≤ 2opt t,d + ε. The triangle inequalit y for v ariation distance giv es that d T V ( h, p ) ≤ 3opt t,d + ε , and the pro of is complete. It is not hard to see that the dep end ence on eac h of the parameters t, d, 1 /ε in the ab ov e upp er b ound is information-theoretically optimal. Note that th e alg orithm describ ed ab o v e is not eﬃcien t b ecause it is b y no means clear how to construct a t -piecewise degree- d distribu tion h that minimizes k h − b p m k A in a computationally eﬃcien t w a y . Ind eed, sev eral approac h es to solv e this p r oblem yield ru nning times that grow exp onenti ally in t, d . Starting in Section 3.3, we give an algorithm that ac hiev es almost the same sample complexit y but ru ns in time p oly( t, d, 1 /ε ) . T he m ain idea is that minimizing k ·k A (whic h in v olve s inﬁ n itely many inequalities) can b e app ro ximately achiev ed by minimizing a small num b er of inequ alities (Deﬁnition 11 and Lemma 14), and this can b e achiev ed w ith a linear program. 3.2 An information-theoretic sample complexit y lo wer b ound. T o complemen t the inf orm ation- theoretic upp er b ound fr om the previous sub section, in this su bsection we pro v e an information- theoretic lo wer b ound sho win g that ev en if opt t,d = 0 (i.e. the target distribution p is exact ly a t -piecewise degree- d distrib ution), ˜ Ω( t ( d + 1) /ε 2 ) samples are requir ed f or any algorithm to learn to accuracy ε : Theorem 8. L et p b e an unknown t - pie c ewise de gr e e- d distribution over [ − 1 , 1) wher e t ≥ 1 , d ≥ 0 satisfy t + d > 1 . 4 L et L b e any algor ithm which, give n as input t, d, ε and ac c ess to indep endent 4 Note that t = 1 and d = 0 is a d egenerate case where the only p ossible d istribution p is the uniform distribution o ver [ − 1 , 1). 7 samples fr om p , outputs a hyp othesis distribution h such that E [ d T V ( p, h )] ≤ ε , wher e the exp e c tation is over the r andom samples dr awn fr om p and any internal r andomness of L . Then L must use at le ast Ω ( t ( d +1) (1+log( d +1)) 2 · 1 ε 2 ) samples. Theorem 8 is pr o ved usin g a well known lemma of Assouad [Ass83 ], together with carefully tai- lored constructions of p olynomial probability density f u nctions to meet the conditions of Assouad’s lemma. Th e pro of of Theorem 8 is deferr ed to App endix A.2. 3.3 Semi-agnostically learning a degree- d p olynomial density w ith near-optimal sam- ple complexity . In this section we pro v e the follo win g: Theorem 9. L et p b e an ε 64( d +1) -wel l-b ehave d p df over [ − 1 , 1) . Ther e is an algorithm Learn-WB -Single-P oly ( d, ε ) which runs in p oly ( d + 1 , 1 /ε ) time, uses ˜ O (( d + 1) /ε 2 ) samples f r om p , and with pr ob ability at le ast 9 / 10 outputs a de gr e e- d p olynomial q which deﬁnes a p df over [ − 1 , 1) such that d T V ( p, q ) ≤ 3opt 1 ,d (1 + ε ) + O ( ε ) . Some p reliminary deﬁ n itions will b e helpfu l: Deﬁnition 10 (Uniform partition) . Let p b e a sub d istribution on an in terv al I ⊆ [ − 1 , 1). A partition P = { I 1 , . . . , I ℓ } of I is ( p, η ) - uniform if p ( I j ) ≤ η for all 1 ≤ j ≤ ℓ . Deﬁnition 11. Let P = { [ i 0 , i 1 ) , . . . , [ i r − 1 , i r ) } b e a partition of an interv al I ⊆ [ − 1 , 1). Let p, q : I → R b e tw o fun ctions on I . W e sa y that p and q satisfy the ( P , η , ε ) -ine qualities over I if | p ([ i j , i ℓ )) − q ([ i j , i ℓ )) | ≤ p ε ( ℓ − j ) · η for all 0 ≤ j < ℓ ≤ r . W e w ill also use the follo wing n otatio n: F or this subsection, let I = [ − 1 , 1) ( I will denote a subinterv al of [ − 1 , 1) when the results are applied in the n ext subsection). W e write k f k ( I ) 1 to denote R I | f ( x ) | dx , and w e write d ( I ) T V ( p, q ) to denote k p − q k ( I ) 1 / 2. W e wr ite opt ( I ) 1 ,d to denote th e inﬁmum of the statistical distance d ( I ) T V ( p, g ) b et ween p and any degree- d sub distribu tion g on I that satisﬁes g ( I ) = p ( I ). The k ey step of Lea rn-WB-Sin gle-Poly is Step 3 where it calls th e Fi nd-Single -Polynom ial pro cedure. In this pro cedu r e T i ( x ) denotes the degree- i Cheb ychev p olynomial of the ﬁ rst kind. The f unction Find-Si ngle-Poly nomial s h ould b e th ough t of as the CDF of a “quasi-distribu tion” f ; w e sa y that f = F ′ is a “quasi-distribu tion” and not a b ona ﬁde prob ab ility distribu tion b ecause it is n ot guaranteed to b e non-n egativ e everywhere on [ − 1 , 1). S tep 2 of Find -Single-P olynomia l pro cesses f sligh tly to obtain a p olynomial q which is an actual distribution o v er [ − 1 , 1) . W e note th at w h ile the Find-Single -Polynomi al pro cedur e ma y app ear to b e more general than is n eeded f or this section, w e will exp loit its full generalit y in the next s ubsection where it is used as a key subrou tin e for semi-agnostically learning t -piecewise p olynomial distrib utions. Algorithm Lea rn-WB-Sin gle-Poly : Input: p arameters d, ε Output: with probabilit y at least 9 / 10, a degree- d distribu tion q suc h that d T V ( p, q ) ≤ 3 · opt 1 ,d + O ( ε ) 8 1. Run Algorithm App roximate ly-Equal- Partition on inpu t p arameter ε/ ( d + 1) to p ar- tition [ − 1 , 1) into z = Θ (( d + 1) /ε ) in terv als I 0 = [ i 0 , i 1 ), . . . , I z − 1 = [ i z − 1 , i z ), wh ere i 0 = − 1 and i z = 1, such that for eac h j ∈ { 1 , . . . , z } we hav e p ([ i j − 1 , i j )) = Θ( ε/ ( d + 1)) . 2. Dra w m = ˜ O (( d + 1) /ε 2 ) samples and let b p m b e th e emp ir ical distribu tion deﬁn ed b y these samp les. 3. Call Find-Sing le-Polyn omial ( d , ε , η := Θ( ε/ ( d + 1)), { I 0 , . . . , I z − 1 } , b p m ) and output the hyp othesis q that it r eturns. Subroutine F ind-Sing le-Polyno mial : Input: degree parameter d ; error parameter ε ; p arameter η ; ( p, η )-uniform partition P I = { I 1 , . . . , I z } of in terv al I = ∪ z i =1 I i in to z inte rv als suc h that √ εz · η ≤ ε/ 2; a sub distr ib ution b p m on I such that b p m and p satisfy the ( P , η , ε )-inequalities ov er I Output: a num b er τ and a d egree- d sub distrib u tion q on I such th at q ( I ) = b p m ( I ), d ( I ) T V ( p, q ) ≤ 3opt ( I ) 1 ,d (1 + ε ) + p εr ( d + 1) · η + error , 0 ≤ τ ≤ opt ( I ) 1 ,d (1 + ε ) and err or = O (( d + 1) η ). 1. Let τ b e the s olution to the follo w ing LP: minimize τ sub ject to the follo wing constrain ts: (Belo w F ( x ) = P d +1 i =0 c i T i ( x ) wh ere T i ( x ) is the degree- i Ch eb y chev p olynomial of the ﬁrst kin d, and f ( x ) = F ′ ( x ) = P d +1 i =0 c i T ′ i ( x ).) (a) F ( − 1) = 0 and F (1) = b p m ( I ); (b) F or eac h 0 ≤ j < k ≤ z ,      b p m ([ i j , i k )) + P j ≤ ℓ T ( i − 1 , ℓ ) + τ , then i. Up date T ( i, j ) to T ( i − 1 , ℓ ) + τ ii. Store the p olynomial h in a table H ( i, j ). 6. Reco ver a piecewise degree- d distribu tion h from the table H ( · , · ). Let X 1 b e the even t that step (1) of Sub r outine Find-P iecewise- Polynomi al su cceeds (i.e. the in terv als [ i j , i j +1 ) all h a ve mass w ithin a constant factor of ε/t ( d +1)). In step (2) of Learn -WB-Piece wise-Poly , the algo rithm eﬀectiv ely constructs a coarsening P ′ of P by m erging ev ery d + 1 consecutiv e in- terv als from P . These sup er-int erv als are used in th e dynamic p rogramming in step (5). The table entry T ( i, j ) stores the minimum s um of errors τ (returned by th e su broutine Find-Single- Pol yno mial ) w hen the interv al [ i ′ 0 , i ′ j ) is p artitioned into i pieces. The dyn amic p rogram ab ov e only compu tes an estimate of opt t,d ; one can u s e standard tec hniqu es to also r eco v er a t -piecewise degree- d p olynomial q close to p . F or step (3), let X 2 b e the ev en t that p and b p m satisfy ( P , ε/ ( t ( d + 1)) , ε/ 4 )-inequ alities. In particular, wh en X 2 holds b p m ( I ) /p ( I ) ≤ ε/ 2 for all I ∈ P . By multiplicativ e Chernoﬀ and u nion b ound (o v er the m samples in step (3)), ev ent X 2 holds with probability at least 19 / 20. Prop osition 18. If X 1 and X 2 hold and p is τ -close to some t -pie c ewise de gr e e- d distribution, then ther e is a c o arsening P ∗ of P ′ and de gr e e- d p olynomials g i : I ∗ i → R such that P i d T V ( p, g i ) ≤ τ + O ( ε ) . F urther, the g i functions c an b e chosen to satisfy c onstr aints 1a, 1d – 1f in the subr outine Find-Piecewise-Pol y nomial . Pr o of. Supp ose p is τ -close to a t -piecewise degree- d distribution. I n other wo rds, there exists a partition { J 1 , . . . , J t } of [ − 1 , 1) and degree- d p olynomials h i : J i → R suc h that P 1 ≤ i ≤ t d T V ( p, h i ) ≤ τ . Let { [ i ′ 0 , i ′ 1 ) , . . . , [ i ′ s − 1 , i ′ s ) } b e P ′ . E x cept in degenerate cases, the coarsening P ∗ con tains 2 t − 1 in terv als, corresp on d ing to the t in terv als on whic h p is a p olynomial and t − 1 small inte rv als con taining “breakp oin ts” b et w een the p olynomials. More p recisely , if we den ote b y { α 0 , . . . , α j } the b reakp oin ts of J 1 , . . . , J t (so that J j = [ α j − 1 , α j )), and deﬁne J ′ j := ∪{ [ α a , α b ) | [ α a , α b ) ⊂ J j } as the m aximal subinterv al of J j with endp oin ts f rom { α j } , then P ∗ is the partition con taining all th e J ′ j ’s together with the interv als b et wee n consecutiv e J ′ j ’s. As a resu lt, P ∗ is a partition of [ − 1 , 1) into at most 2 t − 1 non-empt y interv als. F or an interv al I ∗ i not con taining any breakp oint, the corresp ond ing p olynomial g i : I ∗ i → R is simply h i rescaled by the empir ical mass on I ∗ i , so g i ( x ) = h i ( x ) · b p m ( I ∗ i ) h i ( I ∗ i ) for x ∈ I ∗ i 6 = ∅ . 15 Then g i clearly satisﬁes constrain ts 1a and 1f. Constrain ts 1d and 1e are also satisﬁed: ( h i ) φ i is a degree- d p olynomial on [ − 1 , 1) b ound ed by 1 in absolute v alue (h er e φ i = φ I ∗ i ), and b p m ( I ∗ i ) /h i ( I ∗ i ) ≤ ε/ 2 wh en X 2 holds. F or an in terv al I ∗ i con taining a breakp oint, we simp ly set g i to b e the constant f u nction w ith total mass b p m ( I ∗ i ) on I ∗ i . As b efore, g i satisﬁes 1d–1f. Th e contribution of suc h g i ’s (there are at most t − 1 of them) to P i d T V ( p, g i ) is at most ( t − 1) · 2 ε/t = O ( ε ), using the fact that P ′ is ( b p m , 4 ε/t )-uniform wh en X 1 and X 2 hold. When eve n t X 2 holds, p and b p m satisfy the ( P I ∗ i , ε/ ( t ( d + 1)) , ε/ 4)-inequalities. But this is the same as g i and b p m + ( g i − p ) satisfying th e ( P I ∗ i , ε/ ( t ( d + 1)) , ε ) / 4-inequaliti es, b ecause p − b p m = g i − ( b p m + g i − p ). Therefore Claim 13 tells us that constrain t 1b is satisﬁed. Constraints 1c are satisﬁed for similar reasons as in Section 3.3. T ogether with Prop osition 18, the LP in th e subroutine Find- Single-P olynomial will b e feasible, provi ded the partition P ∗ is c h osen correctly in the dynamic program. W e h a ve the follo wing r estatement of Lemma 14, and a robust v ersion as a corollary (which follo ws by com b ining Lemma 14 and the pro of of Prop osition 7). Lemma 19 (Lemma 14 restated) . L et P b e a ( p, η ) -p artition of I ⊆ [ − 1 , 1) into r intervals. L et b p m b e a sub distribution on I such that b p m and p satisfy the ( P , η , ε ) -ine qualities. If f : I → R and b p m also satisfy the ( P , η , ε ) -ine qualities, then k b p m − f k ( I ) A d ≤ p εr ( d + 1) · η + err or , wher e the error is O (( d + 1) η ) . Corollary 20. L et p b e a de gr e e- d sub distribution on I . L et P b e a ( p, η ) -p artition of I ⊆ [ − 1 , 1) into r intervals. L et b p m b e a sub distribution on I such that b p m and p satisfy ( P , η , ε ) -ine qualities. If h : I → R and b p m + w also satisfy ( P , η , ε ) -ine qualities, then d ( I ) T V ( p, h ) ≤ 3 τ (1 + ε ) + p εr ( d + 1) · η + error , wher e 2 τ = k w k 1 and error = O (( d + 1) η ) . Pro of of Theorem 16. Since p is τ -close to a t -piecewise degree- d distribu tion, there are a p ar- tition { J 1 , . . . , J t } of [ − 1 , 1) and degree- d p olynomials g i : J i → R su ch that P 1 ≤ i ≤ t τ i ≤ τ , where τ i = d T V ( p, g i ). Let P ∗ = { I ∗ 1 , . . . , I ∗ 2 t − 1 } b e the coarsening of P ′ as in th e pro of of Prop osition 18. When X 1 and X 2 hold, it follo ws by a s im p le induction on i ∈ { 0 , . . . , 2 t − 1 } th at the algorithm will outpu t a (2 t − 1)-piecewise degree- d distribu tion h satisfying d T V ( p, h ) ≤ X 1 ≤ i ≤ t  3 τ i (1 + ε ) + p εr i ( d + 1) · ε t ( d + 1) + O  ( d + 1) · ε t ( d + 1)  + O ( ε ) . (11 ) The ﬁrst term comes from Corollary 20 (with η = O ( ε/ ( t ( d + 1)))), and the second term comes from the t − 1 inte rv als con taining the breakp oin ts (see th e pr o of of Pr op osition 18). Here r i denotes the num b er of interv als from P con tained in I ∗ i . Therefore the RHS of Eq. (11) is at most 3 τ (1 + ε ) + X 1 ≤ i ≤ t p εr i ( d + 1) · ε t ( d + 1) + O ( ε ) . The second term of this expression is b ound ed by ε using Cauch y–Sc h w arz and the fact that P con tains t ( d + 1) /ε inte rv als. 16 3.5 Learning k -mixtures of well-b eha ved ( τ , t ) -piecewise degree- d distributions. In this subsection we prov e Th eorem 2 under the add itional restriction that the target p olynomial p is we ll- b ehav ed: Theorem 21. L et p b e an ε 64 kt ( d +1) -wel l-b ehave d k -mixtur e of ( τ , t ) -pie c ewise de gr e e- d distributions over [ − 1 , 1) . Ther e is an algorithm that runs in p oly( k , t, d + 1 , 1 /ε ) time, uses ˜ O (( d + 1) k t/ε 2 ) samples fr om p , and with pr ob ability at le ast 9 / 10 outputs a (2 k t − 1) -pie c ewise de gr e e- d hyp othesis h such that d T V ( p, h ) ≤ 3opt t,d (1 + ε ) + O ( ε ) . As w e shall see, the algorithm of the pr evious subsection in fact suﬃces for this result. T h e key to extendin g Theorem 16 to yield T heorem 21 is the follo wing structural result, whic h sa ys that an y k -mixtu re of ( τ , t )-piecewise degree- d distr ib utions must itself b e an ( τ , k t )-piecewise degree- d distribution. Lemma 22. L et p 1 , . . . , p k e ach b e an ( τ , t ) -pie c ewise de gr e e - d distribution over [ − 1 , 1) and let p = P k j =1 µ j p j b e a k -mixtur e of c omp onents p 1 , . . . , p k . Then p is a ( τ , k t ) -pie c ewise de gr e e- d distribution. The simp le pro of is essen tially the s ame as the pro of of Lemma 3.2 of [CDSS13] and is given in App end ix A. W e may rephrase Theorem 16 as follo ws: Alternate Phrasing of Theorem 16. L et p b e an ε 64 t ( d +1) -wel l-b ehave d ( τ , t ) -pie c ewise de gr e e- d p df over [ − 1 , 1) . Algorithm Lea rn-WB-Pie cewise-P oly ( t, d, ε ) runs i n p oly( t, d + 1 , 1 /ε ) time, uses ˜ O ( t ( d + 1) /ε 2 ) samples fr om p , and with pr ob ability at le ast 9 / 10 outputs a (2 t − 1) -pie c ewise de g r e e- d distribution q such that d T V ( p, q ) ≤ 3 τ (1 + ε ) + O ( ε ) . Theorem 21 follo w s imm ediately fr om Theorem 16 and Lemma 22. 3.6 Pro of of Theorem 2. In th is subs ection w e show ho w to r emo ve the well-behav edness assumption from Th eorem 21 and th us pro ve Th eorem 2. More precisely w e prov e the follo wing theorem wh ic h is a more detailed version of Th eorem 2: Theorem 23. L et p b e any k -mixtur e of ( τ , t ) -pie c ewise de gr e e- d distributions over [ − 1 , 1) . Ther e is an algorithm that runs in p oly ( k , t , d + 1 , 1 /ε ) time, uses ˜ O (( d + 1) k t/ε 2 ) samples fr om p , and with pr ob ability at le ast 9 / 10 outputs a (2 kt − 1) -pie c ewise de gr e e- d hyp othesis h suc h tha t d T V ( p, h ) ≤ 4opt t,d (1 + ε ) + O ( ε ) . T o pro ve Theorem 23 we will need the follo wing simple p ro cedure, whic h (appro ximately) outputs all the p oints in [ − 1 , 1) th at are γ -hea vy u nder a distribution p : Algorithm Fin d-Heavy : Input: p arameter γ > 0, sample access to d istribution p ov er [ − 1 , 1) Output: With p robabilit y at least 99 / 100 , a set S ⊂ [ − 1 , 1) such th at for all x ∈ [ − 1 , 1), 1. if Pr x ∼ p [ x ] ≥ 2 γ then x ∈ S ; 2. if Pr x ∼ p [ x ] < γ / 2 then x / ∈ S . Dra w m = ˜ O (1 /γ ) samples f rom p . F or eac h x ∈ [ − 1 , 1) let b p ( x ) equal 1 /m times th e num b er of o ccurr en ces of x in these m d ra w s. Return the set S which con tains all x suc h that b p ( x ) ≥ γ . 17 It is clear that the set S r eturned by Find-H eavy ( γ ) has | S | ≤ 1 /γ . W e no w p ro ve that Find-Hea vy p erforms as claimed: Lemma 24. With pr ob ability at le ast 99 / 10 0 , Fi nd-Heavy ( γ ) r eturns a set S satisfying c onditions (1) and (2) in the “Output” description. W e give the straight forw ard pr o of in App endix A. T o pro ve Th eorem 23 it suﬃces to pr o ve the follo w in g result (whic h is an extension of Theo- rem 16 that do es not require the well- b ehav edness condition on p ): Theorem 25. L et p b e a p df over [ − 1 , 1) . Ther e is an algorithm Learn -Piecewis e-Poly ( t, d, ε ) which runs in p oly ( t, d + 1 , 1 /ε ) time, use s ˜ O ( t ( d + 1) /ε 2 ) samples fr om p , and with pr ob ability at le ast 9 / 10 outputs a (2 t − 1) -pie c ewise de gr e e- d distribution q such that d T V ( p, q ) ≤ 4opt t,d (1 + ε ) + O ( ε ) . wher e opt t,d is the smal lest variation distanc e b etwe en p and any t -pie c ewise de gr e e- d distribution. Using the argumen ts of Section 3.5, Th eorem 23 follo ws from Theorem 25 exactly as Theorem 21 follo ws from Theorem 16. Pro of of Theorem 25 . The algorithm L earn-Piec ewise-Po ly ( t, d, 1 /ε ) w orks as follo ws : it ﬁrst runs Fi nd-Heavy ( γ ) where γ = O ( ε t ( d +1) ) to obtain a set S ⊂ [ − 1 , 1) . It then run s Learn-WB -Piecewis e-Poly- ( t, d, 1 /ε ) but using the d istr ibution p [ − 1 , 1) \ S (i.e. p conditioned on [ − 1 , 1) \ S ) in place of p throughout the algorithm. Eac h time a draw from p [ − 1 , 1) \ S is required, it simply d r a ws rep eatedly f r om p u n til a p oint outside of S is obtained. Let p b e an y d istr ibution o v er [ − 1 , 1) . S ince the conclusion of the theorem is trivial if opt t,d ≥ 1 / 4, we ma y assum e that opt t,d < 1 / 4 . Consider an execution of Le arn-Piece wise-Pol y ( t, d, 1 /ε ). W e assume that conditions (1) and (2) of Find-Heavy in deed hold for th e set S that it constructs. Let S ′ ⊇ S be deﬁ n ed as S ′ = { x ∈ [ − 1 , 1) : Pr x ∼ p [ x ] ≥ γ / 2 } . Since every t -piecewise degree- d distr ibution q has d T V ( p, q ) ≥ Pr x ∼ p [ x ∈ S ′ ] (b ecause p assigns pr obabilit y Pr x ∼ p [ x ∈ S ′ ] to S ′ whereas q assigns probabilit y 0 to this ﬁnite set of p oin ts), it m ust b e th e ca se that Pr x ∼ p [ x ∈ S ] ≤ Pr x ∼ p [ x ∈ S ′ ] ≤ opt t,d . Hence a dra w from p [ − 1 , 1) \ S is in deed a v alid d r a w fr om p [ − 1 , 1) \ S except with failure p robabilit y at most opt t,d < 1 / 4 . It follo ws easily from this and the sample complexity b ound of Theorem 16 that th e sample complexity of algorithm Le arn-Piece wise-Poly ( t, d, 1 /ε ) is as claimed. V erifying correctness is also straigh tforw ard. Recall that opt t,d denotes the inﬁmum of d T V ( p, q ) where q is any t -p iecewise degree- d distribution. Fix a q w h ic h ac hiev es d T V ( p, q ) = opt t,d ; w e claim that this q also satisﬁes d T V ( p [ − 1 , 1) \ S , q ) ≤ opt t,d . (T o see th is, note that we m ay wr ite d T V ( p, q ) as A + B where A is the con trib ution from p oints in [ − 1 , 1) \ S and B is the con tribution from S . Since Pr x ∼ q [ x ∈ B ] is zero it m us t b e the case that B = 1 2 Pr x ∼ p [ S ], where the “ 1 2 ” is the factor relating L 1 norm and total v ariation distance. No w wr ite d T V ( p [ − 1 , 1) \ S , q ) as A ′ + B ′ where A ′ is the con tribution fr om p oin ts in [ − 1 , 1) \ S and B is the con tribu tion from S . Clearly B ′ is no w 0, and A ′ can b e at m ost B = 1 2 Pr x ∼ p [ S ] larger than A .) By Lemma 24 we ha v e that p [ − 1 , 1) \ S is O ( ε t ( d +1) )-w ell-b ehav ed. Hence by Th eorem 16, when L earn-WB-P iecewise -Poly ( t, d, 1 /ε ) is run on p [ − 1 , 1) \ S it succeeds with high probab ility to give a hyp othesis h suc h that d T V ( h, p [ − 1 , 1) \ S ) ≤ 3opt t,d (1 + ε ) + O ( ε ). Since d T V ( p, p [ − 1 , 1) \ S ) ≤ opt t,d using the triangle inequalit y we get that d T V ( h, p ) ≤ 4opt t,d (1 + ε ) + O ( ε ), and Theorem 25 is prov ed. 4 Applications In this section we use Theorem 23 to obtain a wide range of concrete learning results for natural and well -studied classes of distributions o v er b oth conti n uous and discrete domains. T hroughout 18 this section we do not aim to exhaustive ly co ver all p ossible app lications of Theorem 23, but rather to giv e some selected applications that are indicativ e of th e generalit y and p o wer of our metho ds. W e ﬁrst (Section 4.1) giv e a range of app lications of Th eorem 23 to s emi-agnostic ally learn v arious n atur al classes of con tin uous distributions. These in clude n on-parametric classes su ch as conca v e, log-conca v e, and k -monotone den s ities, mixtures of these densities, and p arametric classes suc h as mixtures of univ ariate Gauss ians. Next, tu rning to discrete distributions w e ﬁr st show (Section 4.2) h o w the d = 0 case of Theo- rem 23 can b e easily adap ted to learn discr ete d istributions that are we ll-appro xim ated by piecewise ﬂat distributions. Usin g this general result, we imp ro ve prior results on learning mixtur es of discrete t -mo dal d istributions, mixtures of discrete monotone hazard r ate (MHR) distributions, and mix- tures of discrete log-co nca ve distrib u tions (includ ing m ixtures of Poisson Binomial Distributions), in most cases giving essentia lly optimal results in terms of sample complexit y . While we h a ve n ot pursu ed this direction in th e current pap er, whic h fo cuses chieﬂy on co n tin u ous distribu tions, w e susp ect that with add itional w ork T heorem 23 can b e adapted to discrete domains in its full gener- alit y (of p olynomials of degree d for arbitrary d ). W e conjecture that suc h an adaptation may giv e essen tially optimal sample complexity b ound s for all of the classes of discrete distributions th at w e discuss in this pap er. 4.1 Applications to Distributions o v er C ontin uous Domains. In this section w e apply our general app roac h to ob tain eﬃcien t learning algorithms for mixtures of many diﬀeren t t yp es of con tinuous probabilit y distribu tions. W e f o cus c h ieﬂy on distribu tions that are deﬁned b y v arious kinds of “shap e r estrictions” on the p df. Nonp arametric density estimation for shap e restricted classes h as b een a su b ject of study in statistics since the 1950s (see [BBBB72] for an early b o ok on the topic), and has applications to a range of areas including reliabilit y theory (see [Reb05] and references therein). Th e shap e restrictions that ha ve b een studied in this area includ e monotonicit y and conca vity of p dfs [Gre56, Bru58, Rao69, W eg70, HP76, Gro85, Bir87a , Bir87b ]. More recent ly , motiv ated by statistical applications (see e.g. W alther’s recent s urve y [W al09 ]), researc hers in this area hav e considered other t yp es of shap e restrictions including log-conca vity and k -monotonicit y [BW07, DR09, BR W09, GW09, BW10, K M10]. As we w ill see, our general m etho d pro vides a single un iﬁed approac h that giv es a h ighly- eﬃcien t algorithm (b oth in terms of sample complexity an d computational complexity) for all the aforement ioned shap e restricted d ensities (and mixtures th ereof ). In most cases the sample complexities of our eﬃcien t algorithms are optimal u p to log factors. 4.1.1 Conca ve and Log-conc a ve Densities. Let I ⊆ R b e a (not necessarily ﬁnite) in terv al. Recall that a function g : I → R is called c onc ave if for an y x, y ∈ I and λ ∈ [0 , 1] it holds g ( λx + (1 − λ ) y ) ≥ λg ( x ) + (1 − λ ) g ( y ) . A f unction h : I → R + is called lo g- c onc ave if h ( x ) = exp ( g ( x )), where g : I → R is conca v e. In this section we show that our general tec hnique yields n early-optimal eﬃcient algorithms to learn (mixtur es of ) conca ve and (more generally) log-conca v e densities. (Because of the conca vity of the log function it is easy to see that ev ery p ositiv e an d conca ve function is log-conca v e.) In particular, we sho w the follo w ing: Theorem 26. L et f : I → R + b e any k -mixtur e of lo g-c onc ave densities, wher e I = [ a, b ] is an arbitr ary (not ne c essarily ﬁnite) interval. Ther e is an algorithm that runs in pol y ( k /ε ) time, dr aws ˜ O ( k /ε 5 / 2 ) samples fr om f , and with pr ob ability at le ast 9 / 10 outputs a hyp othesis distribution h such that d T V ( f , h ) ≤ ε . W e note that the ab ov e sample complexit y is information-theoretically optimal (up to logarith- 19 mic factors). In p articular, it is kno wn (see e.g. Chapter 15 of [DL01]) that learning a single conca ve densit y (recall that a conca v e d ensit y is necessarily log-conca v e) o ver [0 , 1] requires Ω( ε − 5 / 2 ) sam- ples. This lo wer b ound can b e easily generalized to show that learning a k -mixture of log-conca v e distributions o ver [0 , 1] requires Ω( k /ε 5 / 2 ) samples. As far as we know, ours is the ﬁrst computa- tionally eﬃcient algorithm w ith essent ially optimal sample complexit y for this p r oblem. T o p ro ve our r esult we pro ceed as f ollo ws: W e s ho w that an y log-conca ve d ensit y f : I → R + has an ( ε, t ) -p iecewise linear (degree-1 ) decomp osition for t = ˜ O (1 / √ ε ). A cont in uous v ersion of the argum ent in Th eorem 4.1 of [CDSS 13] can b e used to sh o w the existence of an ( ε, t ) -piecewise c onstant (degree-0) decomp osition with t = ˜ O (1 /ε ). Unfortunately , the latter b ound is essen tially tigh t, h ence cannot lead to an algorithm with sample complexity b etter than Ω( ε − 3 ) . Classical app ro ximation r esu lts (see e.g. [Dud74, Nov88]) provide optimal piecewise linear de- comp ositions of conca ve functions. While these resu lts hav e a dep endence on the d omain size of the fu nction, they can rather easily b e adapted to establish the existence of ( ε, t )-piecewise lin ear decomp ositions for conca v e densities w ith t = O (1 / √ ε ). How ev er , w e are not a w are of pr ior work establishing the existence of piecewise linear decomp ositions for lo g-c onc ave d ensities. W e give such a resu lt b y proving the follo w ing structural lemma: Lemma 27. L et f : I → R + b e any lo g-c onc ave density, wher e I = [ a, b ] is an arbitr ary (not ne c- essarily ﬁnite) interval. Ther e exi sts an ( ε, t ) -pie c ewise line ar de c omp osition of f for t = ˜ O (1 / √ ε ) . W e note that our p ro of of L emm a 27 is signiﬁ can tly diﬀerent fr om the aforementi oned kno w n argumen ts establishing the existence of piecewise linear app r o ximations for conca v e functions. In particular, these pr o ofs critically exp loit conca vit y , namely the fact th at for a conca ve function f , the line s egment ( x, f ( x )), ( y , f ( y )) lies b elo w the graph of the function. Before giving the pro of of our lemma, w e note that the ˜ O (1 / √ ε ) b ound is b est p ossible (up to log factors) eve n for conca ve densities. This can b e v eriﬁ ed by considering the conca v e densit y ov er [0 , 1] whose graph is giv en b y the upp er half of a circle. W e fu rther n ote that th e [DL01 ] Ω(1 /ε 5 / 2 ) low er b ound implies th at no signiﬁcant strengthening can b e ac hiev ed by using our general results for learning p iecewise degree- d p olynomials for d > 1. Theorem 26 follo ws as a d irect corollary of Lemma 27 and Th eorem 2. Pro of of Lemma 27 : W e b egin by recalling the follo wing fact w hic h is a b asic prop erty (in fact an alternate c h aracterizat ion) of log-conca ve densities: F act 28. ([An95], L emma 1) L et f : R → R + b e lo g-c onc ave. Supp ose that { x | f ( x ) > 0 } = ( a, b ) . Then, for al l x 1 , x 2 ∈ ( a, b ) with x 1 < x 2 and al l δ ≥ 0 such that x 1 + δ, x 2 + δ ∈ ( a, b ) we have f ( x 1 + δ ) f ( x 1 ) ≥ f ( x 2 + δ ) f ( x 2 ) . Let f b e an arbitrary log-c onca ve densit y ov er R . W ell kno wn concen tration b ounds for log- conca v e d en sities (see [An95]) imply that 1 − ε fraction of th e total probability mass lies in a ﬁnite in terv al [ a, b ]. Let m ∈ [ a, b ] b e a mo d e of f s o that f is non-decreasing in [ a, m ] and non-increasing in [ m, b ]. (Recall the we ll-kno wn fact [An95] that ev ery log-co nca ve densit y is unimo dal, so such a mo d e must exist.) It su ﬃ ces to analyze the second p ortion of the densit y, i.e., a non-increasing log-co nca ve (su b)-distribution o v er [ m, b ]. W e may furth er assume without loss of generalit y that [ m, b ] = [0 , 1]. (It w ill b e clear that in w hat follo ws nothing c h anges in the calculations as a result of this assumption – the length of the in terv al is irrelev an t.) So let f : [0 , 1] → R + b e a non-increasing log-conca v e density and let c = f (0) = m ax x ∈ [0 , 1] f ( x ) . It follo ws f r om elementa ry calculus th at f is cont in uous in its sup p ort. W e assume without loss of 20 generalit y that f is strictly decreasing in this d omain. (It follo ws from F act 28 that f or any non - increasing log-co nca ve densit y o v er [0 , 1] there exists x 0 ∈ [0 , 1] suc h that f is constant in [0 , x 0 ] and strictly decreasing in [ x 0 , 1].) W e pr o ceed to constru ct the desired p iecewise-linea r appro ximation in t wo stages: (a) Let r , s ∈ Z + with r = Θ((1 /ε ) log (1 /ε )) and s = ⌈ log 1 / (1 − ε ) f (0) f (1) ⌉ = ⌈ log 1 − ε f (1) f (0) ⌉ . W e divide the domain [0 , 1] into t ′ def = min { r , s } = O ((1 /ε ) log (1 /ε )) inte rv als (disjoint except at the endp oin ts) I = { I i } t ′ i =1 , wh ere I i = [ x i − 1 , x i ], i ∈ [ t ′ ]. Th e p oint x i ∈ [0 , 1] is the p oint that satisﬁes f ( x i ) = m ax { f ( x 0 )(1 − ε ) i , f (1) } . (12) Since the fu nction is strictly decreasing and con tin u ous, suc h a p oint exists and is un ique. Note that the d eﬁ nition with the “max” ab ov e add resses the case that s ≤ r . In this case, we will ha v e that x t ′ = x s = 1 . If s > r , th en we w ill hav e that f ( x i ) = f ( x 0 )(1 − ε ) i for i ∈ [ t ′ ] and x t ′ < 1 . W e n o w pro ceed to establish a couple of useful prop erties of this decomp osition. The ﬁrst prop erty is that th e length of the interv als I i is non-increasing as a function of i for i ∈ [ t ′ ]. Claim 29. F or al l i ∈ [ t ′ − 1] we have that | I i | ≥ | I i +1 | . Pr o of. Consider t w o consecutiv e interv als I i = [ x i − 1 , x i ] and I i +1 = [ x i , x i +1 ], i ∈ [ t ′ − 1]. It is easy to see that by the deﬁn ition of the int erv als we ha v e th at f ( x i +1 ) f ( x i ) ≥ f ( x i ) f ( x i − 1 ) or equ iv alentl y f ( x i + | I i +1 | ) f ( x i ) ≥ f ( x i − 1 + | I i | ) f ( x i − 1 ) . Since x i − 1 < x i , by F act 28 we hav e f ( x i − 1 + | I i | ) f ( x i − 1 ) ≥ f ( x i + | I i | ) f ( x i ) . Com b ining the ab ov e tw o inequalities yields that f ( x i + | I i +1 | ) ≥ f ( x i + | I i | ). Since f is non-increasing we conclude that x i + | I i +1 | ≤ x i + | I i | and the p ro of is complete. The second prop ert y is that the pr obabilit y mass that f pu ts in the interv al [ x t ′ , 1] is b ounded b y ε . Claim 30. We have that f ([ x t ′ , 1]) ≤ ε . Pr o of. W e consid er t w o cases. If t ′ = s , th en x t ′ = 1 and the desired p robabilit y is zero. It th us suﬃces to analyze the case t ′ = r . In this case x t ′ < 1 and for all i ∈ [ t ′ ] it holds f ( x i ) = f ( x 0 )(1 − ε ) i . Note th at f ( x t ′ ) = f (0)(1 − ε ) t ′ ≤ f (0) ε/ 2 = cε/ 2 . F or the pu rp oses of the analysis, su pp ose w e d ecomp ose [ x t ′ , 1] into a s equ ence of in terv als { I i } i>t ′ , where I i = [ x i − 1 , x i ] and p oint x i is deﬁn ed by (12). T hat is, we hav e a total of s interv als I 1 , . . . , I s 21 partitioning [0 , 1] where by Claim 29 | I 1 | ≥ | I 2 | ≥ . . . ≥ | I s | . Clearly , P s i =1 f ( I i ) = 1 and since f is non-increasing c (1 − ε ) i | I i | ≤ f ( x i ) | I i | ≤ f ( I i ) ≤ f ( x i − 1 ) | I i | = c (1 − ε ) i − 1 | I i | . (13) Com b ining the ab ov e yields c · s P i =1 (1 − ε ) i | I i | ≤ 1 . (14) W e wan t to sh ow th at f ([ x t ′ , 1]) = P s i = t ′ +1 f ( I i ) ≤ ε. Indeed, we ha v e s P i = t ′ +1 f ( I i ) ≤ s P i = t ′ +1 c (1 − ε ) i − 1 | I i | ≤ cε 2(1 − ε ) · s − t ′ P i =1 (1 − ε ) i | I i + t ′ | (15) where the ﬁ rst inequalit y uses (13) and the second uses the fact that (1 − ε ) t ′ ≤ ε/ 2. By Claim 29 it f ollo ws that | I i + t ′ | ≤ | I i | which yields s P i = t ′ +1 f ( I i ) ≤ cε 2(1 − ε ) · s − t ′ P i =1 (1 − ε ) i | I i | ≤ cε 2(1 − ε ) · s P i =1 (1 − ε ) i | I i | ≤ ε where the last inequalit y follo ws from (14) f or ε ≤ 1 / 2. In fact, it is no w easy to sh o w that I is an ( O ( ε ) , t ′ )-ﬂat decomp osition of f , but we will n ot mak e d irect use of this in the sub sequen t analysis. (b) In the second step, w e group consecutiv e inte rv als of I (in in cr easing order of i ) to obtain an ( O ( ε ) , t ) piecewise linear d ecomp osition J = { J ℓ } t ℓ =1 for f , where t = ˜ O ( ε − 1 / 2 ) . Supp ose that w e hav e constructed the sup er-in terv als J 1 , . . . , J ℓ − 1 and that ∪ ℓ − 1 s =1 J s = ∪ i k =1 I k = [ x 0 , x i ] . If i = t ′ then t is set to ℓ − 1, and if i ≤ t ′ then the su p er-inte rv al J ℓ con tains the interv als I i +1 , . . . , I j , wh ere j ∈ Z + is th e maxim um v alue whic h is ≤ t ′ and satisﬁes: (1) f ( x j ) ≥ f ( x i )(1 − ε ) 1 / √ ε , and (2) | I j | ≥ (1 − √ ε ) | I i +1 | . Within eac h sup er-int erv al J ℓ = ∪ j k = i +1 I k = [ x i , x j ] we appr o ximate f by the linear fun c- tion ˜ f satisfying ˜ f ( x i ) = f ( x i ) and ˜ f ( x j ) = f ( x j ). This completes the description of the construction. W e pro ceed to sho w correctness. Our ﬁrst claim is that it is suﬃcien t, in the constru ction describ ed in (b) ab o ve, to tak e only t = ˜ O ( ε − 1 / 2 ) su p er-interv als, b ecause the pr obabilit y mass u nder f that lies to the right of the righ tmost of these su p er-int erv als is at most ε : Claim 31. Supp ose th at t = Ω( ε − 1 / 2 log(1 /ε )) and J t = [ x u , x v ] is the rightmost sup er- interval. Then, f ([ x v , 1]) ≤ ε . Pr o of. Consider a generic sup er-int erv al J ℓ = ∪ j k = i +1 I k . Since j is the maximum v alue th at satisﬁes b oth (1) and (2) w e conclude that either j + 1 − i > 1 / √ ε (16) 22 (this in equalit y follo ws from the negation of (1) and th e deﬁ n ition of f ( x i ), f ( x j )) or | I j +1 | < (1 − √ ε ) | I i +1 | . (17) Supp ose w e ha v e t = Ω( ε − 1 / 2 log(1 /ε )) sup er-in terv als. Then, either (16) is satisﬁed for at least t/ 2 sup er -interv als or (17) is satisﬁed for at least t/ 2 sup er-in terv als. Denote the righ tmost sup er-interv al by J t = [ x u , x v ]. In the ﬁrst case, for an appropriate constan t in the big-Omega, we ha v e v = t ′ and the desired r esult follo ws fr om Claim 30. In the second case, for an app ropriate constant in the big-Omega we w ill hav e | I v | ≤ ε 3 | I 1 | . T o sho w that f ([ x v , 1]) ≤ ε in this case, we consider further partitioning the interv al [ x v , 1] in to a s equence of interv als { I i } i>v , where I i = [ x i − 1 , x i ] and p oint x i is d eﬁned by (12). By Claim 29 we will hav e that | I i | ≤ | I v | , i > v . W e can therefore b ound the desired qu an tit y by s P i = v +1 f ( I i ) ≤ s P i = v +1 c (1 − ε ) i − 1 | I i | ≤ s P i = v +1 c (1 − ε ) i − 1 ε 3 | I 1 | ≤ ε 3 c | I 1 | ∞ P i =1 (1 − ε ) i − 1 ≤ ε 3 (1 − ε ) 2 · 1 − ε ε ≤ ε, where the ﬁrst inequalit y used th e ﬁ rst in equ alit y of (15 ) and the p enultimat e inequalit y uses the fact that c (1 − ε ) | I 1 | ≤ p ( I 1 ) ≤ 1 . T his completes the p ro of of the claim. The main claim we are going to establish for th e piecewise-linea r appro xim ation J is the follo wing: Claim 32. F or any sup e r- interval J ℓ = ∪ j k = i +1 I k and any i ≤ m ≤ j we have that | ˜ f ( x m ) − f ( x m ) | = O ( ε ) f ( x m ) . Assuming the ab o ve claim it is easy to argue that J is in deed an ( O ( ε ) , t ) piecewise lin ear appro ximation to f . Let ˜ f b e th e piecewise linear f unction ov er [0 , 1] wh ich is linear o v er J ℓ (as describ ed ab o ve) an d identica lly zero in the in terv al [ x v , 1]. Indeed, we ha v e that k ˜ f − f k 1 ≤ t X ℓ =1 Z J ℓ | ˜ f ( y ) − f ( y ) | dy + f ([ x v , 1]) ≤ v X i =1 Z x i y = x i − 1 | ˜ f ( y ) − f ( y ) | dy + ε ≤ v X i =1 O ( ε ) f ( x m ) | I i | + ε ≤ v X i =1 O ( ε ) f ( x i − 1 ) | I i | + ε = O ( ε ) where the second inequalit y u sed C laim 31, the third inequalit y used Claim 32, the fourth inequalit y used the fact that f is non-increasing, and the ﬁnal inequalit y used the fact that v X i =1 f ( x i − 1 ) | I i | ≤ 1 1 − ε v X i =1 f ( x i ) | I i | ≤ 1 1 − ε v X i =1 f ( I i ) ≤ 1 / (1 − ε ) , 23 whic h follo w s by the deﬁn ition of the f ( x i )’s. W e are now ready to give the pro of of the claim. Pr o of of Claim 32. If ˜ f is the approximati ng line b et ween x i and x j w e can write ˜ f ( x m ) = f ( x i ) + ( f ( x j ) − f ( x i )) · P m k = i +1 | I k | P j k = i +1 | I k | . Note that f ( x j ) − f ( x i ) = f ( x i )  (1 − ε ) j − i − 1  . W e also r ecall that (1 − ε ) j − i = 1 − ε ( j − i ) + ε 2 ( j − i ) 2 / 2 + O ( ε 3 ( j − i ) 3 ) . Since i, j are in th e same sup er-in terv al, we ha v e that j − i ≤ 1 / √ ε , whic h implies that the ab o ve error term is O ( ε 3 / 2 ). W e will use this appro ximation henceforth, which is also v alid for any m ∈ [ i, j ]. Also by condition (2) deﬁning the lengths of the in terv als in the same sup er-in terv al and the monotonicit y of the lengths themselv es, we obtain m − i j − i · (1 − √ ε ) ≤ P m k = i +1 | I k | P j k = i +1 | I k | ≤ m − i j − i · 1 1 − √ ε . By carefully com bining the ab ov e inequalities we obtain the d esired result. In particular, we ha ve that ˜ f ( x m ) ≤ f ( x i ) h 1 − ε  1 + O ( √ ε )  ( m − i ) + ( ε 2 / 2)( j − i )( m − i )  1 + O ( √ ε )  + O ( ε 3 / 2 ) i . Also f ( x m ) = f ( x i ) h 1 − ε ( m − i ) + ( ε 2 / 2)( m − i ) 2 + O ( ε 3 / 2 ) i . Therefore, using the fact that j − i, m − i ≤ 1 / √ ε , we get that ˜ f ( x m ) − f ( x m ) ≤ O ( ε ) f ( x i ) . In an analogous manner we obtain that f ( x m ) − ˜ f ( x m ) ≤ O ( ε ) f ( x i ) . By the deﬁnition of a s up er-interv al, the m axim um and minim um v alues of f within the sup er- in terv al are within a 1 + o (1) factor of eac h other. This completes the pro of of Claim 32. This completes the pro of of Lemma 27. 4.1.2 k -monotone Densities. Let I = [ a, b ] ⊆ R be a (n ot n ecessarily ﬁnite) inte rv al. A function f : I → R + is said to b e 1 -monotone if it is non-increasing. It is 2 -monotone if it is non-increasing and conv ex, and k - monoton e for k ≥ 3 if ( − 1) j f ( j ) is non-negativ e, non-increasing and conv ex for j = 0 , . . . , k − 2 . T he problem of dens it y estimatio n for k -monotone densities has b een extensiv ely inv estigated in the m athematical statistics communit y durin g the past few y ears (see [BW07, GW09, BW10, Ser10] an d references therein) d ue to its signiﬁcance in b oth th eory and applications [BW07]. F or example, as p oin ted out in [BW07], th e p roblem of learning an unk n o wn k -monotone densit y arises in a generalization of Hamp el’s bird -w atc hing problem [Ham87]. 24 The aforemen tioned pap ers from the statistics communit y f o cus on analyzing the rate of con- v ergence of the Maxim um Lik elihoo d Estimator (MLE) un der v arious metrics. In this section we sho w that our approac h y ields an eﬃcient algorithm to learn b ound ed k -monotone d ensities o v er [0 , 1] (i.e., k -monotone densities p suc h that sup x ∈ [0 , 1] p ( x ) = O (1)), and mixtur es thereof, with sample complexit y ˜ O ( k /ε 2+1 /k ). Th is b ound is p ro v ably optimal (up to log factors) for k = 1 by [Bir87a] and for k = 2 (see e.g. Chapter 15 of [DL01]) and we conjecture th at it is sim ilarly tigh t for all v alues of k . Our main algorithmic result f or k -monotone densities is the follo wing: Theorem 33. L et k ∈ Z + and f : [0 , 1] → R + b e a t -mixtur e of b ounde d k - monotone densities. Ther e is an algorithm that runs in p oly( k , t, 1 /ε ) time, uses ˜ O ( tk /ε 2+1 /k ) samples, and outputs a hyp othesis distribution h such that d T V ( h, f ) ≤ ε . The ab o ve theorem follo ws as a corollary of Th eorem 2 and the follo win g stru ctur al resu lt: Lemma 34 (Implicit in [KL04 , K L07]) . L et f : [0 , 1] → R + b e a k -monotone density such that sup x | f ( x ) | = O (1) . Ther e exists an ( ε, t ) - pie c e wise de gr e e- ( k − 1) appr oximation of f with t = O ( ε 1 /k ) . As we no w explain the ab ov e lemma can b e dedu ced from r ecen t work in approxi mation the- ory [KL04, K L07]. T o state the relev ant theorem w e need some terminology: Let s ∈ Z + , and f or a r eal fun ction f o v er in terv al I , let ∆ s τ f ( t ) = P s i =0 ( − 1) s − i  s i  f ( t + iτ ) b e th e s th diﬀerence of the function x with step τ > 0, wh ere [ t, t + sτ ] ⊆ I . F or r ∈ Z ∗ + , let W r 1 ( I ) b e the set of real fun ctions f o ver I that are absolutely contin uous in every compact subinterv al of I and s atisfy k f ( r ) k 1 = O (1) . W e den ote by ∆ s + W r 1 ( I ) the sub set of fun ctions f in W r 1 ( I ) that satisfy ∆ s τ f ( t ) ≥ 0 for all τ > 0 suc h that [ t, t + sτ ] ⊆ I . (Note that if f is s -times diﬀeren tiable the latter condition is tan tamount to sa ying th at f ( s ) ≥ 0.) W e hav e the follo w ing: Theorem 35 (Theorem 1 in [KL07]) . L et s ∈ Z + , r, ν, n ∈ Z ∗ + such that ν ≥ m ax { r, s } . F or any f ∈ ∆ s + W r 1 ( I ) ther e exists a pie c ewise de gr e e- ( ν − 1) p olynomial appr oximation h to f with n pie c es such that k h − f k 1 = O ( n − max { r ,s } ) . (In fact, it is sho wn in [K L 07] that th e ab o v e b ound is quantita tiv ely optimal up to constant factors.) Let f : [0 , 1] → R + b e a k -monotone densit y su c h that sup | f | = O (1) . It is easy to see that Lemma 34 follo w s f rom Theorem 35 for the follo wing setting of parameters: s = k , r = 1 and ν = max { r , s } = k . Ind eed, since ( − 1) k − 2 f ( k − 2) is con v ex, it f ollo ws that ∆ k τ f ( t ) is nonnegativ e for ev en k and nonp ositive f or o dd k . Since f is a n on-increasing b ound ed density , it is clear that k f ′ k 1 = | R 1 0 f ′ ( t ) dt | = f (0) − f (1) = O (1) . Hence, for ev en k Theorem 35 is applicable to f and yields Lemma 34. F or o d d k , L emma 34 follo ws by applying T heorem 35 to the function − f . 4.1.3 Mixtures of Univ ariat e Gaussians. As a ﬁ nal example illustrating the p o wer and gen- eralit y of Theorem 2, w e n ow sho w how it v ery easily yields a compu tationally eﬃcien t and es- sen tially optimal (up to logarithmic factors) sample complexit y algorithm for learnin g mixtur es of k u niv ariate Gaussians. As will b e eviden t from the pro of, similar results could b e obtai ned via our tec hn iques for a w id e range of mixture distr ib ution learning problems f or d iﬀeren t t yp es of parametric un iv ariate distribu tions b eyond Gaussians. Lemma 36. L et p = N ( µ, σ 2 ) b e a univariate Gaussian. Then p i s an ( ε, 3) -pie c e wise de gr e e- d distribution f or d = O (log (1 /ε )) . 25 Since Theorem 23 is easily seen to extend to s emi-agnostic learning of k -mixtures of t -piecewise degree- d d istributions, Lemma 36 imm ed iately giv es the follo wing semi-agnostic learnin g resu lt for mixtures of k one-dimensional Gauss ians: Theorem 37. L et p b e any distribution that has d T V ( p, q ) ≤ ε wher e q is any one-dimensional mixtur e of k Gaussians. Ther e is a p oly( k , 1 /ε ) -time algorithm that uses ˜ O ( k /ε 2 ) samples and with high pr ob ability outputs a hyp othesis h such that d T V ( h, p ) ≤ O ( ε ) . It is straigh tforward to s h o w that Ω( k /ε 2 ) samples are information-theoretically necessary f or learn- ing a mixture of k Gaussians, and thus our sample complexit y is optimal up to logarithmic factors. Discussion. Mo itra and V alian t [MV10] recen tly ga ve an algorithm for p ar ameter estimation (a stronger requiremen t than the densit y estimation guaran tees that w e pro vide) of an y mixture of k n - dimensional Ga ussians. T heir algorithm has sample complexit y that is exp onential in k , and ind eed they prov e that any algorithm that do es parameter estimation ev en f or a mixture of k one-dimensional Gaussians must use 2 Ω( k ) samples. In contrast, our r esu lt sho w s that it is p ossible to p erform density estimation for any mixture of k one-dimen s ional Gaussians with a computationally eﬃcien t algorithm that uses exp onential ly fewer (linear in k ) samples th an are required for p arameter estimatio n. Moreo v er , unlike the parameter estimatio n results of [MV10], our density estimation algorithm is semi-agnostic: it succeeds even if the target distribution is ε -far from a mixture of Gaussians. Pro of of Lemma 36: Without loss of generalit y we ma y tak e p to b e the standard Gaussian N (0 , 1) , whic h has p d f p ( x ) = 1 √ 2 π e − x 2 / 2 . Let I 1 = ( −∞ , − C p log(1 /ε )) , I 2 = [ − C p log(1 /ε ) , C p log(1 /ε ) ) and I 3 = [ C p log(1 /ε ) , ∞ ) w here C > 0 is an absolute constan t. W e deﬁne the d istribution q as follo ws: q ( x ) = 0 for all x ∈ I 1 ∪ I 3 , and q ( x ) is giv en b y the d egree- d T aylo r expansion of p ( x ) ab out 0 for x ∈ I 2 , wh ere d = O (log (1 /ε )) . Clearly q is a 3-piecewise degree- d p olynomial. T o see that d T V ( p, q ) ≤ ε , we ﬁ rst observe that b y a s tand ard Gaussian tail b oun d the regions I 1 and I 3 con tribu te at most ε/ 2 to d T V ( p, q ) so it suﬃces to argue th at Z I 2 | p ( x ) − q ( x ) | dx ≤ ε/ 2 . (18) Fix an y x ∈ I 2 . T a ylor’s theorem giv es that | p ( x ) − q ( x ) | ≤ p ( d +1) ( x ′ ) x d +1 / ( d + 1)! for some x ′ ∈ [0 , x ] . Recalling that the ( d + 1)-st d eriv ativ e p ( d +1) ( x ′ ) of the p df of the standard Gaussian equals H d +1 ( x ′ ) p ( x ′ ), where H d +1 is the Herm ite p olynomial of order d + 1, standard b ounds on the Hermite p olynomials together with the fact th at | x | ≤ C p log(1 /ε ) giv e that for d = O (log 1 ε ) w e hav e | p ( x ) − q ( x ) | ≤ ε 2 for all x ∈ I 2 . T h is giv es th e lemma. 4.2 Learning discrete distributions. F or conv enience in this subsection we consider discrete distributions o ver the 2 N -p oin t ﬁn ite domain D :=  − N N , − N − 1 N , . . . , − 1 N , 0 , 1 N , . . . , N − 1 N  . W e say that a d iscrete distrib u tion q ov er domain D is t -ﬂat if ther e exists a partition of D into t in terv als I 1 , . . . , I t suc h that q ( i ) = q ( j ) for all i, j ∈ I ℓ for all ℓ = 1 , . . . , t . W e say that a d istribution p o v er D is ( ε, t ) - ﬂat if d T V ( p, q ) ≤ ε for some distrib ution q o v er D th at is t -ﬂ at. W e b egin by giving a simple reduction f rom learning ( ε, t )-ﬂat distributions o ver D to learning ( ε, t )-piecewise degree-0 distribu tions o ver [ − 1 , 1] . T ogether with Theorem 23 th is r ed uction giv es us an essen tially optimal algorithm f or learning discrete ( ε, t )-ﬂat distributions (see Th eorem 38). 26 W e th en apply Theorem 38 to ob tain highly eﬃcien t algorithms (in most cases with prov ably n ear- optimal samp le complexit y) for v arious s p eciﬁc classes of discrete d istributions essentia lly resolving a num b er of op en problems from p revious w orks. 4.2.1 A reduction from discrete t o con tinuous. Giv en a d iscrete distrib u tion p o v er D , we deﬁne ˜ p to b e the distribu tion o ver [ − 1 , 1) deﬁ n ed as follo ws: a dr a w from ˜ p is obtained by drawing a v alue i/ N from p , and then outpu tting i + x/ N wh ere x is distributed un iformly o v er [0 , 1) . It is easy to see that if d istr ibution p (o ver domain D ) is t -ﬂat, then the distribution ˜ p (o ver d omain [ − 1 , 1)) is t -p iecewise degree-0. Moreo v er, if p is τ -close to some t -ﬂat distribu tion q o v er D , then ˜ p is τ -close to ˜ q . In the opp osite direction, f or p a distribu tion o ver [ − 1 , 1) we deﬁne p ∗ to b e the follo wing distribution su pp orted on D : a d r a w from p ∗ is obtained by sampling x from p an d then outpu tting the v alue ob tained by roundin g x do w n to the next inte ger m ultiple of 1 / N . I t is easy to see that if p , q are distr ib utions o ver [ − 1 , 1) then d T V ( p, q ) = d T V ( p ∗ , q ∗ ) . It is also clear that f or p a distribution o ver D we ha v e ( ˜ p ) ∗ = p. With these relationships in hand , we ma y learn a ( τ , t )-ﬂat distribution p o v er D as follo w s: r un Algorithm Learn-Piecewi se-Poly ( t, d = 0 , ε ) on th e distribution ˜ p . Since p is ( τ , t )-ﬂat, ˜ p is τ - close to s ome t -p iecewise degree-0 distr ib ution q o ver [ − 1 , 1), so the algorithm with high probabilit y constructs a hyp othesis h o v er [ − 1 , 1) su c h that d T V ( h, ˜ p ) ≤ O ( τ + ε ). The ﬁn al hyp othesis is h ∗ ; for this h yp othesis we ha v e d T V ( h ∗ , p ) = d T V ( h ∗ , ( ˜ p ) ∗ ) = d T V ( h, ˜ p ) ≤ O ( τ + ε ) as d esir ed . The ab o ve discussion and Theorem 23 together give th e follo wing: Theorem 38. L et p b e a mixtur e of k ( τ , t ) -ﬂat discr ete distributions over D . Ther e is an algorithm which uses ˜ O ( k t/ε 2 ) samples fr om p , runs in time p oly( k , t, 1 /ε ) , and with pr ob ability at le ast 9 / 10 outputs a hyp othesis distribution h over D such that d T V ( p, h ) ≤ O ( ε + τ ) . W e n ote that this is essen tially a s tr onger version of Corollary 3.1 (the main tec hn ical result) of [CDSS13], wh ic h ga ve a similar guarantee but w ith an algorithm that r equired O ( k t/ε 3 ) samples. W e also remark that Ω( k t/ε 2 ) samples are information-theoreticall y r equired to learn an arbitrary k -mixture of t -ﬂat distribu tions. Hence, our sample complexit y is optimal up to logarithmic factors (ev en for the case τ = 0). W e would also lik e to men tion the relation of the ab ov e theorem to a recent work by In dyk, Levi and Rubinfeld [ILR12]. Motiv ated by a database app licatio n, [ILR12] consider th e problem of learning a k -ﬂat d istr ibution o ver [ n ] under the L 2 norm and giv e an eﬃcien t algorithm that uses O ( k 2 log( n ) /ε 4 ) samples. Since the total v ariation distance is a stronger metric, Theorem 38 immediately im p lies an improv ed sample b oun d of ˜ O ( k /ε 2 ) for their problem. 4.2.2 Learning sp eciﬁc classes of discrete distributions. Mixtures of t -mo dal discrete distributions. Recall that a distribution o v er an interv al I = [ a, b ] ∩ D is said to b e unimo dal if there is a v alue y ∈ I su c h that its p d f is monotone non-decreasing on I ∩ [ − 1 , y ] and monotone non-increasing on I ∩ ( y , 1). F or t > 1, a distribution p o ver D is t -mo dal if there is a partition of D into t inte rv als I 1 , . . . , I t suc h that the conditional d istributions p I 1 , . . . , p I t are eac h u nimo dal. In [CDSS13, DDS + 13] (building on [Bir87b]) it is sho wn that every t -mo d al distribution o v er D is ( ε, t log ( N ) /ε )-ﬂat . By using this fact together with Th eorem 38 in place of Corollary 3.1 of 27 [CDSS13], we improv e the sample complexit y of th e [CDSS13] algorithm for learning mixtu r es of t -mo dal distrib utions and obtain th e follo wing: Theorem 39. F or any t ≥ 1 , let p b e any k -mixtur e of t - mo dal distributions over D . Ther e is an algorithm that runs in time p oly( k , t, log N , 1 /ε ) , dr aws ˜ O ( k t log ( N ) /ε 3 ) samples fr om p , and with pr ob ability at le ast 9 / 10 outputs a hyp othesis distribution h such that d T V ( p, h ) ≤ ε . W e note that an easy adaptation of Birg ´ e’s low er b ound [Bir87a] for learning monotone d istribu- tions (see the discussion at the end of Section 5 of [C DSS13]) giv es th at any algorithm f or learnin g a k -mixture of t -mo dal distrib utions o ver D must u se Ω( k t log( N / ( k t )) /ε 3 ) samp les, and hence the sample complexit y b ound of T h eorem 39 is optimal up to logarithmic factors. W e fur ther note that ev en th e t = 1 case of this result compares fa v orably with the main result of [DDS12a], whic h ga ve an algorithm for learning t -mo dal distributions o v er D that uses O ( t log ( N ) /ε 3 ) + ˜ O ( t 3 /ε 3 ) samples. The [DDS12a] r esult ga ve an optimal b oun d only for small settings of t , sp eciﬁcally t = ˜ O ((log N ) 1 / 3 ), and ga ve a quite p o or b ound as t gro ws large; for example, at t = (log N ) 2 the op timal b ound would b e O ((log N ) 3 /ε 3 ) b ut the [DDS12a] result only give s ˜ O ((log N ) 9 /ε 3 ) . In con trast, our n ew resu lt giv es an essentially optimal b ou n d (up to log f actors in the optimal sample complexit y) for al l s ettings of t . Mixtures of monotone hazard rate dist ribut ions. Let p b e a d istribution supp orted on D . The hazar d r ate of p is the function H ( i ) def = p ( i ) P j ≥ i p ( j ) ; if P j ≥ i p ( j ) = 0 then we s a y H ( i ) = + ∞ . W e say that p has monotone hazar d r ate (MHR) if H ( i ) is a n on-decreasing fu nction o ver D . [CDSS13] sho w ed that every MHR distribution o v er D is ( ε, O (log ( N /ε ) /ε ))- ﬂat. Theorem 38 th u s giv es us the follo win g: Theorem 40. L et p b e any k -mixtur e of MH R distributions over D . Ther e is an algorithm that runs in time p oly( k , log N , 1 /ε ) , dr aws ˜ O ( k log ( N ) /ε 3 ) samples fr om p , and with pr ob ability at le ast 9 / 10 outputs a hyp othesis distribution h such that d T V ( p, h ) ≤ ε . In [CDSS 13] it is sho wn that an y algorithm to learn k -mixtures of MHR distr ib utions o v er D m u st use Ω( k log ( N/k ) /ε 3 ) samples, so Th eorem 40 is essentiall y optimal in its sample complexit y . Mixtures of discrete log-conca v e distributions. A p robabilit y d istribution p ov er D is said to b e lo g- c onc ave if it satisﬁes the follo w in g conditions: (i) if i < j < k ∈ D are such that p ( i ) p ( k ) > 0 then p ( j ) > 0; and (ii) p ( k / N ) 2 ≥ p (( k − 1) / N ) p (( k + 1) / N ) for all k ∈ {− N + 1 , . . . , − 1 , 0 , 1 , . . . , N − 2 } . In [CDSS13] it is shown th at ev ery log-conca v e distribution o ve r D is ( ε, O (log (1 /ε )) / ε )-ﬂat. Hence Theorem 38 giv es: Theorem 41. L et p b e any k -mixtur e of lo g-c onc ave distributions over D . Ther e i s an algorithm that runs in time p oly( k , 1 /ε ) , dr aws ˜ O ( k /ε 3 ) samples fr om p , and with pr ob ability at le ast 9 / 10 outputs a hyp othesis distribution h such that d T V ( p, h ) ≤ ε . As in the previous examples, this impro v es the [CDSS13] sample complexit y by essen tially a factor of 1 /ε . W e n ote that as a sp ecial case of Th eorem 41 w e get an eﬃcien t O ( k /ε 3 )-sample algorithm for learning any mixture of k P oisson Binomial Di stribu tions . (A Po isson Binomial Distribution, or PBD, is a ran d om v ariable of the f orm X 1 + · · · + X N where th e X i ’s are indep end en t 0/1 rand om v ariables that ma y ha v e arbitrary and non-iden tical mea ns.) T he main r esult of [DDS12b] ga v e an eﬃcien t ˜ O (1 /ε 3 )-sample algorithm for learnin g a sin gle PBD; here w e ac hiev e the same sample complexity , with an eﬃcien t algorithm, for learning an y mixture of an y constan t n um b er of PBDs. 28 Ac knowledgemen ts. W e w ou ld like to thank Dany Leviatan f or usefu l corresp ondence r egarding his recent w ork s [KL04, K L07]. References [AK03] Sanjeev Arora and S ubhash Khot. Fitting algebraic curves to noisy d ata. J. Comput. Syst. Sci. , 67(2):32 5–340 , 20 03. [An95] M. Y. An. Log-conca v e probabilit y d istributions: Theory and s tatistical testing. T ec h- nical Rep ort Economics W orkin g Paper Arc hiv e at WUST L, W ashington Universit y at St. Louis, 1995. [Ass83] P . Assouad. Deux r emarques su r l’estimation. C. R. A c ad. Sci. Paris S´ er. I , 296:1021 – 1024, 1983. [BBBB7 2] R.E. Barlo w, D.J. Bartholomew, J.M. Bremner , and H.D. Brunk . Statistic al Infer enc e under Or der R estrictions . Wiley , New Y ork, 1972. [Bir87a] L. Birg´ e. Estimating a densit y under ord er r estrictions: Nonasymptotic min imax r isk . Anna ls of Statistics , 15(3):995–1 012, 1987. [Bir87b] L. Birg ´ e. On the risk of histograms f or estimating d ecreasing d ensities. Annals of Statistics , 15(3):1013– 1022, 1987. [Bru58] H. D. Brunk. On th e estimation of parameters r estricted by inequalities. The A nnals of Mathematic al Statistics , 29(2):pp. 437–454, 1958. [BR W09 ] F. Balab d aoui, K. Ruﬁb ac h, and J. A. W ellner. Limit distribu tion theory for maxim u m lik eliho o d estimation of a log-conca v e density . The Annals of Statistics , 37(3):pp. 1299– 1331, 2009. [BS10] M. Belkin and K. Sinh a. Polynomial learning of distrib ution families. I n FOCS , pages 103–1 12, 2010. [BW07] F. Balab daoui and J . A. W ellner. Estimation of a k -monotone d ensit y: Limit distribu- tion theory and the spline connection. The Annals of Statistics , 35 (6):pp. 2536–256 4, 2007. [BW10] F. Balab daoui and J. A. W ellner. Estimatio n of a k -mon otone density: c h aracteri- zations, consistency and minimax lo wer b ounds. Statistic a Ne erlandic a , 64(1):45–7 0, 2010. [CDSS13] S. Chan, I. Diak onik olas, R. S erv ed io, and X. S un. Learning mixtures of structured distributions o v er discrete domains. In SO DA , 2013. [DDS12a] C. Dask alakis, I . Diak onik olas, and R.A. Servedio. Learning k -mo dal distribu tions via testing. I n SOD A, 2012. [DDS12b] C . Dask alakis, I. Diak onik olas, and R.A. Serv edio. Learn in g P oisson Binomial Distri- butions. In STOC , p ages 709–728 , 2012. [DDS + 13] C. Dask alakis, I. Diak onikola s, R. Ser vedio, G. V alian t, and P . V alian t. T esting k -mo dal distributions: Optimal algorithms via red uctions. In SODA, to app e ar , 2013. 29 [DG85] L. Devro ye and L . Gy¨ orﬁ. Nonp ar ametric Density Estimation: The L 1 View . John Wiley & Sons, 1985. [DGJ + 10] I. Diak oniok olas, P . Gopalan, R. Jaiswa l, R. S er vedio, and E. Viola. Bounded indep en- dence fo ols halfspaces. SIAM Journal on Computing , 39(8):34 41–34 62, 201 0. [DL01] L. Devro y e and G. Lugosi. Combinatorial metho ds in density estimation . Sp ringer Series in Statistics, S pringer, 2001. [DR09] L. D umbge n and K. Ruﬁb ac h. Maxim um like liho o d estimation of a log-conca ve densit y and its distribution function: Basic pr op erties and u n iform consistency . Bernoul li , 15(1): 40–68, 2009. [Dud74] R.M Dudley . Metric en tropy of some classes of sets with diﬀeren tiable b ound aries. Journal of Appr oximation The ory , 10(3):227 – 236, 1974. [FM99] Y. F r eund and Y. Mansour. Estimating a mixture of t wo pr o duct distribu tions. In Pr o- c e e dings of the Twelfth Annual Confer enc e on Comp utational L e arning The ory , pages 183–1 92, 1999. [F OS 05] J. F eldman, R. O’Donnell, and R. Serve dio. Learning mixtures of p ro duct distr ib utions o ver discrete d omains. In Pr o c. 46th Symp osium on F oundations of Computer Sci enc e (F OCS) , p ages 501–510 , 2005. [Gre56] U. Grenander . On the theory of mortalit y measurement. Skand. Aktuarietidskr. , 39:12 5– 153, 1956. [Gro85] P . Gro eneb o om. Estimating a monotone dens it y . In Pr o c. of the Be rke ley Confer enc e in Honor of J erzy Neyman and Jack Kiefer , pages 539–555 , 1985. [GW09] F. Gao and J. A. W ellner. On the rate of con v er gence of the maxim u m lik eliho o d estimator of a k -monotone d ensit y . Scienc e in China Series A: Mathematics , 52:152 5– 1538, 2009. [Ham87] F. R. Hamp el. Design, data & analysis. c h apter Design, m o delling, and analysis of some biologica l data s ets, pages 93–128. John Wiley & Sons, In c., New Y ork, NY, USA, 1987. [HP76] D. L. Hans on and G. Pledger. C onsistency in conca v e regression. The A nnals of Statistics , 4(6):pp. 1038–10 50, 1976. [ILR12] P . Indyk, R . Levi, and R. Ru b infeld. Appro ximating and T esting k -Histogram Distri- butions in Sub -linear T ime. In PODS , pages 15–22, 2012. [Jac97] J. Jac kson. An eﬃcient members h ip-query algorithm for learnin g DNF with resp ect to the uniform distribu tion. Journal of Computer and System Scienc es , 55:414–44 0, 1997. [KL04] V. N. Kono v alo v and D. Leviatan. F ree-knot splines approximat ion of s -monotone functions. A dv. Comput. Math. , 20(4):347 –366, 2004. [KL07] V. N. Konov alo v an d D. Leviatan. F reeknot splines approxi mation of sob olev-t yp e classes of s -monotone fun ctions. A dv. Comput. Math. , 27(2):2 11–23 6, 2007. [KM93] E. Kushilevitz and Y. Mansour. Learnin g decision trees using the Fourier sp ectrum. SIAM J. on Computing , 22(6):1331 –1348 , 1993. 30 [KM10] R. Ko enke r and I. Mizera. Q u asi-conca v e d ensit y estimation. Ann. Statist. , 38(5):2998 – 3027, 2010. [KMR + 94] M. Kearns, Y. Mansou r , D. Ron, R . Rubin feld, R. Schapire, and L. S ellie. On the learnabilit y of discrete distribu tions. In Pr o c e e dings of the 26th Symp osium on The ory of Computing , pages 273–282, 1994. [KMV10] A. T. Kalai, A. Moitra, and G. V alian t. Eﬃcien tly learning mixtures of tw o Gaussians. In STOC , pages 553–562, 2010. [K O S04] A. Kliv ans, R. O ’Donnell, and R. Serv edio. Learning in tersections and thresholds of halfspaces. Journal of Computer & System Scienc es , 68(4):808 –840, 2004. [KS04] A. Kliv ans and R. Servedio. Learning DNF in time 2 ˜ O ( n 1 / 3 ) . Journal of Computer & System Scienc es , 68(2):303 –318, 2004. [LMN93] N. Linial, Y. Mansour, and N. Nisan. Constan t depth circuits, Four ier transform and learnabilit y . Journal of the ACM , 40(3):607 –620, 1993. [MOS04] E. Mossel, R. O’Donnell, and R. Servedio. Learning fun ctions of k relev ant v ariables. Journal of Computer & System Scienc es , 69( 3):421 –434, 2004. Preliminary v er s ion in Pr o c. STOC’ 03 . [MR95] R. Mot wani and P . Ragha v an. R andomize d Algor ithms . Cambridge Univ er s it y Press, New Y ork , NY, 1995. [MV10] A. Moitra and G. V alian t. S ettling the p olynomial learnabilit y of mixtures of Gaussians. In FOCS , pages 93–102, 2010. [No v88] E. No v ak. Deterministic and Sto chastic Err or Bounds In Numeric al Analys is . S pringer- V erlag, 1988. [P A13] D. P app and F. Aliza deh. Shap e constrained estimation us in g nonnegativ e splin es. Journal of Computationa l and Gr aphic al Statistics , 0(ja):null, 2013. [Rao69] B.L.S. Prak asa Rao. E s timation of a unimo d al density . Sankhya Ser. A , 31:23–36 , 1969. [Reb05] L. Reb oul. Estimation of a fun ction und er shap e restrictions. Applications to reliabilit y . Ann. Statist. , 33(3):1 330–1 356, 2005. [Sco92] D.W. Scott. Multivariate Density Estimation: The ory, Pr actic e and Visu alization . Wiley , New Y ork, 1992. [Ser10] A. S eregin. Uniqueness of th e maxim um likeli ho o d estimator for k -monotone densities. Pr o c e e dings of The Americ an Mathematic al So ciety , 138:45 11–45 11, 2010. [Sil86] B. W. Silv erman. Density Estimation . C hapman and Hall, Lon d on, 1986. [W al09] G. W alther. Inferen ce and mo deling with log-co nca ve distributions. Statistic al Scienc e , 24(3): 319–32 7, 2009. [W eg70] E.J. W egman. Maximum lik eliho o d estimation of a un imo dal densit y . I. and I I. Ann. Math. Statist. , 41:457–4 71, 2169–2174 , 1970. 31 A Omitted pro ofs A.1 Pro of of Lemma 6 . Recall Lemma 6: Lemma 6. Given 0 < κ < 1 and ac c ess to samples fr om an κ/ 64 -wel l-b ehave d distribution p over [ − 1 , 1) , the pr o c e dur e Approxim ately-Equ al-Partition u se s ˜ O (1 /κ ) samples fr om p , runs in time ˜ O (1 /κ ) , and with pr ob ability at le ast 99 / 100 outputs a p artition of [ − 1 , 1) into ℓ = Θ(1 /κ ) intervals su ch that p ( I j ) ∈ [ 1 2 κ , 3 κ ] for al l 1 ≤ j ≤ ℓ. Pro of of Lemma 6: Let n den ote 1 /κ (w e assume wlog that n is an intege r). Let S b e a samp le of m = Θ( n log n ) i.i.d. dr a ws fr om p , wh ere m is an in teger m ultiple of n . F or 1 ≤ i ≤ m let U ( i ) denote the i -th ord er statistic of S , i.e. th e i -th largest element of S . Let U (0) := − 1 . Our goal is to show that w ith high probability , for eac h j ∈ { 1 , . . . , n } w e hav e p ([ U ( j − 1 n · m ) , U ( j n · m ) )) ∈ [ 1 2 n , 2 n ]. This means that s im p ly greedily taking the in terv als I 1 , I 2 , . . . from left to righ t, wh ere the left endp oin t of I 0 is − 1, the left (closed) endp oin t of the j -th in terv al is the same as the righ t (op en) endp oin t of the ( j − 1)st in terv al, and the j -th int erv al ends at U ( j n · m ) , the resu lting interv als ha ve probability masses as desired. (These interv als co ve r [ − 1 , U ( m ) ]; an easy argumen t shows that with p r obabilit y at least 1 − 1 /n , the unco v ered r egion ( U ( m ) , 1) h as mass at most 1 /n u nder p , so w e may add it to th e ﬁnal interv al.) Let P d en ote the cum u lativ e density functions asso ciated with p . F or 0 ≤ α < β ≤ 1 let # S [ α, β ) denote the num b er of element s x ∈ S that hav e P ( x ) ∈ [ α, β ) . A m ultiplicativ e C hernoﬀ b ound and a un ion b ound together straigh tforw ard ly giv e that w ith pr obabilit y at least 99 / 100, for eac h i ∈ { 1 , . . . , 8 n } we ha ve # S [ i − 1 8 n , i 8 n ) ∈ [ 1 16 · m n , 3 16 · m n ] . (Note that since p is 1 64 n -w ell-b eh a ved, the amount of mass that p pu ts on P − 1 ([ i − 1 8 n , i 8 n )) lies in [ 3 32 n , 5 32 n ] . ) As an immediate consequence of this we get that p ([ U ( j − 1 n · m ) , U ( j n · m ) ]) ∈ [ 1 2 n , 2 n ] f or eac h j ∈ { 1 , . . . , n } , which establishes the lemma. A.2 Pro of of Theorem 8. Recall T heorem 8: Theorem 8. L et p b e an unknown t -pie c ewise de gr e e- d distribution over [ − 1 , 1) wher e t ≥ 1 , d ≥ 0 satisfy t + d > 1 . L et L b e any algorithm which, given as input t, d, ε and ac c ess to indep endent samples fr om p , outputs a hyp othesis distribution h such that E [ d T V ( p, h )] ≤ ε , wher e the exp e c tation is over the r andom samples dr awn fr om p and any internal r andomness of L . Then L must use at le ast Ω ( t ( d +1) (1+log( d +1)) 2 · 1 ε 2 ) samples. W e ﬁrst observe that if d = 0 then the claimed Ω( t/ε 2 ) lo wer b oun d follo ws easily from the standard fact that this many samples are requir ed to learn an unknown d istr ibution o ver the t - elemen t set { 1 , . . . , t } . (Th is f act follo ws easily from Assouad’s lemma; we will essen tially pr o ve it using Assouad’s lemma in Section A.2.1 b elo w.) Thus we ma y assume b elo w that d > 0; in fact, w e can (and d o) assume that d ≥ C where C may b e take n to b e any ﬁxed absolute constan t. In what follo ws we shall use Assouad’s lemma to establish an Ω ( d (log d ) 2 · 1 ε 2 ) lo w er b ound for learning a single degree- d distribution o ver [ − 1 , 1) to accuracy ε . The same argumen t app lied to a concatenation of t equally weigh ted copies of this low er b ound constru ction o ver t disjoin t in terv als [ − 1 , − 1 + 2 t ) , . . . , [1 − 2 t , 1) (again using Assouad’s lemma) yields Theorem 8. Th us to pro v e Theorem 8 for general t it is enough to prov e the follo wing lo w er b ound, corresp onding to t = 1. (F or ease of exp osition in our later argum ents, w e tak e the domain of p b elo w to b e the in terv al [0 , 2 k ) rather than [ − 1 , 1) . ) Theorem 42. Fix an inte ger d ≥ C . L et p b e an unknown de gr e e- d distribution over [0 , 2 k ) . L et L b e any algorithm which, given as input d, ε and ac c ess to indep e ndent samples fr om p , outputs 32 a hyp othesis distribution h such that E [ d T V ( p, h )] ≤ ε . Then L must use at le ast Ω( d (log d ) 2 · 1 ε 2 ) samples. Our main to ol for pr o ving Theorem 42 is Assouad’s Lemma [Ass83]. W e recall th e statemen t of Assouad’s Lemma fr om [DG85] b elo w. (The statemen t b elow is sligh tly tailored to our conte xt, in that we h a ve take n the underlyin g domain to b e [0 , 2 k ) and the partition of the domain to b e [0 , 2) , [2 , 4) , . . . , [2 k − 2 , 2 k ).) Theorem 43. [The or e m 5, Chapter 4, [DG85]] L et k ≥ 1 b e an inte ger. F or e ach b = ( b 1 , . . . , b k ) ∈ {− 1 , 1 } k , let p b b e a pr ob ability distribution over [0 , 2 k ) . Supp ose that the distributions p b satisfy the fol lowing pr op erties: Fix any ℓ ∈ [ k ] and any b ∈ {− 1 , 1 } k with b ℓ = 1 . L et b ′ ∈ {− 1 , 1 } k b e the same as b but with b ′ ℓ = − 1 . The pr op erties ar e that 1. R 2 ℓ 2 ℓ − 2 | p b ( x ) − p b ′ ( x ) | dx ≥ α , and 2. R 2 k 0 p p b ( x ) p b ′ ( x ) dx ≥ 1 − γ > 0 . Then for any any algorithm L that dr aws n samples fr om an unknown p ∈ { p b } b ∈{− 1 , 1 } k and outputs a hyp othesis distribution h , ther e is some b ∈ {− 1 , 1 } k such that if the tar get distribution p is p b , then E [ d T V ( p b , h )] ≥ ( k α/ 4)(1 − p 2 nγ ) . (19) W e w ill u se this lemma in the follo wing wa y: Fix an y d ≥ C and an y 0 < ε < 1 / 2 . W e will exhibit a family of 2 k distributions p b , where eac h p b is a degree- d p olynomial distrib ution and k = Θ( d/ (log d ) 2 ) . W e will show that all pairs b, b ′ ∈ {− 1 , 1 } k as sp eciﬁed in T heorem 43 satisfy condition (1) with α = Ω( ε/k ), and satisfy condition (2) w ith γ = O ( ε 2 /k ) . With these conditions, consider an algorithm L that dr aws n = 1 / (8 γ ) samples from the un kno wn target distribution p . The right -hand side of (19) simp liﬁes to k α/ 8 = Ω( ε ), and hence b y Theorem 43, the exp ected v ariation d istance error of algorithm L ’s hyp othesis h is Ω( ε ) . This yields Theorem 42. Th us, in the rest of this su bsection, to p r o ve Th eorem 42 and th us establish T h eorem 8 , it suﬃces for us to describ e the 2 k distributions p b and establish conditions (1) and (2) with the claimed b ounds α = Ω( ε/k ) and γ = O ( ε 2 /k ) . W e do this b elo w. A.2.1 The idea behind the construction. W e pro vide s ome in tuition b efore ent ering into the d etails of our construction. In tuitiv ely , eac h p olynomial p b (for a give n b ∈ {− 1 , 1 } k ) is an appro ximation, o v er the inte rv al [0 , 2 k ) of interest, of a 2 k -piecewise constan t d istribution S b that w e describ e b elow. T o do this, ﬁr s t let us deﬁn e the 2 k -piecewise constan t distribution R b ( x ) = R b, 1 ( x ) + ... + R b,k ( x ) o ver [0 , 2 k ), where R b,i ( x ) is a function whic h is 0 outside of the in terv al [2 i − 2 , 2 i ) . F or x ∈ [2 i − 2 , 2 i − 1) w e h a ve R b,i ( x ) = (1 + b i · ε ) / (2 k ), and for x ∈ [2 i − 1 , 2 i ) we hav e R b,i ( x ) = (1 − b i · ε ) / (2 k ) . So note that rega rdless of wh ether b i is 1 or − 1, w e ha v e R 2 i 2 i − 2 R b,i ( x ) dx = 1 /k and hen ce R 2 k 0 R b ( x ) dx = 1, so R b is indeed a p robabilit y distr ib ution o ver the domain [0 , 2 k ) . The distr ib ution S b o ver [0 , 2 k ) is deﬁned as S b ( x ) = 1 10 · 1 2 k + 9 10 · R b ( x ) . (20) 33 (The reason for “mixing” R b with the uniform d istribution will b ecome clear later; roughly , it is to con trol the adverse eﬀect on condition (2) of h a ving only a p olynomial appro ximation p b instead of the actual piecewise constan t distrib ution.) T o motiv ate the goal of constructing p olynomials p b that appr o ximate th e piecewise constant distributions S b , let us verify that the distribu tions { S b } b ∈{− 1 , 1 } k satisfy conditions (1) and (2) of Theorem 43 with th e desir ed parameters. So ﬁ x any b ∈ {− 1 , 1 } k with b ℓ = 1 and let b ′ ∈ {− 1 , 1 } k diﬀer from b pr ecisely in th e ℓ -th co ordinate. F or (1), w e immediately ha v e th at Z 2 ℓ 2 ℓ − 2 | S b ( x ) − S b ′ ( x ) | dx = 9 10 Z 2 ℓ 2 ℓ − 2 | R b,ℓ ( x ) − R b ′ ,ℓ ( x ) | dx = 9 5 · ε k . F or (2), w e h a ve th at for an y tw o distrib u tions f , g , Z 2 k 0 p f ( x ) g ( x ) dx = 1 − h ( f , g ) 2 where h ( f , g ) 2 is the squared Hellinger distance b etw een f and g , h ( f , g ) 2 = 1 2 Z 2 k 0  p f ( x ) − p g ( x )  2 dx. Applying this to S b and S b ′ , we get h ( S b , S b ′ ) 2 = 1 2 Z 2 k 0  p S b ( x ) − p S b ′ ( x )  2 dx = 1 2 Z 2 ℓ 2 ℓ − 2 r 1 20 k + 9 10 · 1 + ε 2 k − r 1 20 k + 9 10 · 1 − ε 2 k ! 2 dx = Θ ( ε 2 /k ) , as d esir ed . W e now turn to the actual construction. A.2.2 The construction. Fix an y b ∈ {− 1 , 1 } k . Our goal is to giv e a d egree- d p olynomial p b that is a h igh-qualit y app ro ximator of S b ( x ) o ver [0 , 2 k ) . W e shall do this b y approximat ing eac h R b,i ( x ) and com b ining the appr oximato rs in the obvio us w a y . W e can write eac h R b,i ( x ) as R b,i, 1 ( x ) + R b,i, 2 ( x ) where R b,i, 1 ( x ) is 0 outside of [2 i − 2 , 2 i − 1) and R b,i, 2 ( x ) is 0 outside of [2 i − 1 , 2 i ) . So R b ( x ) is the s u m of 2 k many fu nctions eac h of wh ich is of the form ω b,j · I j ( x ) , i.e. R b ( x ) = 2 k X j =1 ω b,j · I j ( x ) (21 ) where eac h ω b,j is either (1 + ε ) / 2 k or is (1 − ε ) / 2 k and I j is the ind icator fun ction of the in terv al [ j − 1 , j ): i.e. I j ( x ) = 1 if x ∈ [ j − 1 , j ) and is 0 elsewhere. W e shall appr o ximate eac h indicator function I j ( x ) o ver [0 , 2 k ) b y a lo w-degree un iv ariate p olynomial whic h we sh all denote ˜ I j ( x ); t hen w e will m ultiply eac h ˜ I j ( x ) b y ω b,j and su m the results to obtain ou r p olynomial appr oximato r ˜ R b ( x ) to R b ( x ) , i.e. ˜ R b ( x ) := 2 k X j =1 ω b,j ˜ I j ( x ) . (22) 34 The starting p oin t of our construction is the p olynomial whose existence is asserted in Lemma 3.7 of [DGJ + 10]; this is essenti ally a low-degree un iv ariate p olynomial which is a high-accuracy app ro x- imator to the fun ction sign( x ) ov er [ − 1 , 1) except for v alues of x that h a ve small absolute v alue. T aking k = M log(1 /ε ) in Claim 3.8 of [DGJ + 10] for M a suﬃcient ly large constant (rather than M = 15 as is d one in [DGJ + 10]), the constru ction emp lo yed in the pro of of Lemm a 3.7 giv es the follo wing: F act 44. F or 0 ≤ τ ≤ c , wher e c < 1 is an absolute c onstant, ther e is a p olynomial A ( x ) of de gr e e O ((log(1 /τ )) 2 /τ ) su c h that 1. F or al l x ∈ [ − 1 , − τ ) we have A ( x ) ∈ [ − 1 , − 1 + τ 10 ] ; 2. F or al l x ∈ ( τ , 1] we have A ( x ) ∈ [1 − τ 10 , 1] ; 3. F or al l x ∈ [ − τ , τ ] we have A ( x ) ∈ [ − 1 , 1] . F or − 1 / 4 ≤ θ ≤ 1 / 4 let B θ ( x ) denote the p olynomial B θ ( x ) = ( A ( x ) − A ( x − θ )) / 2. Giv en F act 44, it is easy to see th at B θ ( x ) has d egree O ((log(1 /τ )) 2 /τ ) and, o ver the in terv al [ − 1 / 2 , 1 / 2], is a high-accuracy approxi mation to the indicator fun ction of th e in terv al [0 , θ ] except on “error regions” of width at most τ at eac h of the end p oint s 0 , θ . Next, recall that k = Θ ( d/ (log d ) 2 ) where d is at least some univ ersal constan t C . Cho osing τ = δ /k for a s uitably small p ositiv e absolute constan t δ , and p erforming a suitable linear scaling and sh ifting of the p olynomial B θ ( x ), we get the follo wing: F act 45. Fix any inte ger 1 ≤ j ≤ 2 k . Ther e is a p olynomial C j ( x ) of de gr e e at most d which is such that 1. F or x ∈ [ j − 0 . 999 , j − 0 . 001 ) we have C j ( x ) ∈ [1 − 1 /k 5 , 1)] ; 2. F or x ∈ [0 , j − 1) ∪ [ j, 2 k ) we have C j ( x ) ∈ [0 , 1 /k 5 ] ; 3. F or x ∈ [ j − 1 , j − 0 . 999 ) ∪ [ j − 0 . 001 , j ) we have 0 ≤ C j ( x ) ≤ 1 . The desired p olynomial ˜ I j ( x ) whic h is an app ro ximator of th e indicator function I j ( x ) is ob- tained by renormalizing C j so that it in tegrates to 1 ov er the d omain [0 , 2 k ); i.e. w e deﬁne ˜ I j ( x ) = C j ( x ) / Z 2 k 0 C j ( x ) dx. (23) By F act 45 w e ha v e th at R 2 k 0 C j ( x ) dx ∈ [0 . 997 , 1 . 003], and thus we ob tain the follo w ing: F act 46. Fix any i nte ger 1 ≤ j ≤ 2 k . The p olynomial ˜ I j ( x ) has de gr e e at most d and is such that 1. F or x ∈ [ j − 0 . 999 , j − 0 . 001 ) we have ˜ I j ( x ) ∈ [0 . 996 , 1 . 004)] ; 2. F or x ∈ [0 , j − 1) ∪ [ j, 2 k ) we have ˜ I j ( x ) ∈ [0 , 1 /k 4 ] ; 3. F or x ∈ [ j − 1 , j − 0 . 999 ) ∪ [ j − 0 . 001 , j ) we have 0 ≤ ˜ I j ≤ 1 . 004 ; and 4. R 2 k 0 ˜ I j ( x ) dx = 1 . 35 Recall that from (22) the p olynomial appro ximator ˜ R b ( x ) for R b ( x ) is deﬁn ed as ˜ R b ( x ) = P 2 k j =1 ω b,j ˜ I j ( x ) . W e deﬁne the ﬁnal p olynomial p b ( x ) as p b ( x ) = 1 10 · 1 2 k + 9 10 · ˜ R b ( x ) . (24) Since P 2 k j =1 ω b,j = 1 for ev ery b ∈ {− 1 , 1 } k , the p olynomial p b do es ind eed deﬁ n e a legitimate probabilit y d istr ibution o v er [0 , 2 k ) . It w ill b e u seful for us to tak e the follo win g alternate view on p b ( x ). Deﬁne ˜ J j ( x ) = 1 10 · 1 2 k + 9 10 · ˜ I j ( x ) . (25) Recalling that P 2 k j =1 ω b,j = 1, we ma y alternately deﬁn e p b as p b ( x ) = 2 k X j =1 ω b,j ˜ J j ( x ) . (26) The follo wing is an easy consequence of F act 46: F act 47. Fix any 1 ≤ j ≤ 2 k . The p olynomial ˜ J j ( x ) has de gr e e at most d and is such that 1. F or x ∈ [ j − 0 . 999 , j − 0 . 001 ) we have ˜ J j ( x ) ∈ [0 . 896 + 0 . 1 / (2 k ) , 0 . 9004 + 0 . 1 / (2 k )] ; 2. F or x ∈ [0 , j − 1) ∪ [ j, 2 k ) we have ˜ I j ( x ) ∈ [0 . 1 / (2 k ) , 0 . 1 / (2 k ) + 1 /k 4 ] ; 3. F or x ∈ [ j − 1 , j − 0 . 999 ) ∪ [ j − 0 . 001 , j ) we have ˜ J j ( x ) ∈ [0 . 1 / (2 k ) , 0 . 9004 + 0 . 1 / (2 k )) ; and 4. R 2 k 0 ˜ J j ( x ) dx = 1 . W e are no w ready to pro ve that the d istributions { p b } b ∈{− 1 , 1 } k satisfy prop erties (1) and (2) of Assouad’s lemma with α = Ω( ε/k ) and γ = O ( ε 2 /k ) as d escrib ed in the discussion follo win g Theorem 43. Fix b ∈ {− 1 , 1 } k with b ℓ = 1 and b ′ ∈ {− 1 , 1 } k whic h agrees with b except in th e ℓ -th co ordinate. W e establish p rop erties (1) and (2) in the follo w ing t w o claims: Claim 48. We have R 2 ℓ 2 ℓ − 2 | p b ( x ) − p b ′ ( x ) | dx ≥ Ω( ε/k ) . Pr o of. Recall fr om (26) that p b ( x ) = 2 k X j =1 ω b,j · ˜ J j ( x ) and p b ′ ( x ) = 2 k X j =1 ω b ′ ,j · ˜ J j ( x ) . W e ha ve that ω b,j = ω b ′ ,j for all b u t exactly tw o (adjacent ) v alues of j , which are j = 2 ℓ − 1 and j = 2 ℓ. F or those v alues we ha v e ω b, 2 ℓ − 1 = (1 + ε ) / (2 k ) , ω b ′ , 2 ℓ − 1 = (1 − ε ) / (2 k ) while ω b, 2 ℓ = (1 − ε ) / (2 k ) , ω b ′ , 2 ℓ = (1 + ε ) / (2 k ) . So we ha v e 36 Z 2 ℓ 2 ℓ − 2 | p b ( x ) − p b ′ ( x ) | dx = Z 2 ℓ 2 ℓ − 2 | ( ω b, 2 ℓ − 1 ˜ J 2 ℓ − 1 ( x ) + ω b, 2 ℓ ˜ J 2 ℓ ( x )) − ( ω b ′ , 2 ℓ − 1 ˜ J 2 ℓ − 1 ( x ) + ω b ′ , 2 ℓ ˜ J 2 ℓ ( x )) | dx = ( ε/k ) · Z 2 ℓ 2 ℓ − 2 | ˜ J 2 ℓ − 1 ( x ) − ˜ J 2 ℓ ( x ) | dx. Claim 48 no w follo ws immediately from Z 2 ℓ 2 ℓ − 2 | ˜ J 2 ℓ − 1 ( x ) − ˜ J 2 ℓ ( x ) | dx = Ω(1) , whic h is an easy consequence of F act 47. Claim 49. We have R 2 k 0 p p b ( x ) p b ′ ( x ) dx ≥ 1 − O ( ε 2 /k ) , i.e. h ( p b , p b ′ ) 2 ≤ O ( ε 2 /k ) . Pr o of. As ab o v e ω b,j = ω b ′ ,j for all but exactly t wo (adjacent) v alues of j whic h are j = 2 ℓ − 1 and j = 2 ℓ . F or those v alues we hav e ω b, 2 ℓ − 1 = (1 + ε ) / (2 k ) , ω b ′ , 2 ℓ − 1 = (1 − ε ) / (2 k ) , ω b, 2 ℓ = (1 − ε ) / (2 k ) , ω b ′ , 2 ℓ = (1 + ε ) / (2 k ) . W e hav e h ( p b , p b ′ ) 2 = 1 2 Z 2 k 0  √ p b ( x ) − p p b ′ ( x )  2 dx = A/ 2 + B / 2 , where A = Z [2 k ] \ [2 ℓ − 2 , 2 ℓ )  p p b ( x ) − p p b ′ ( x )  2 dx and B = Z [2 ℓ − 2 , 2 ℓ ]  p p b ( x ) − p p b ′ ( x )  2 dx. W e ﬁrst b oun d B , b y upp er b oundin g the v alue of the inte grand  p p b ( x ) − p p b ′ ( x )  2 on an y ﬁ xed x ∈ [2 k ] \ [2 ℓ − 2 , 2 ℓ ] . Recall t hat p b ( x ) is a sum of th e 2 k v alues ω b,j · ˜ J j ( x ) . The 0 . 1 / (2 k ) con tribution to eac h ˜ J j ( x ) ensures that p b ( x ) ≥ 0 . 1 / (2 k ) for all x ∈ [0 , 2 k ], and it is easy to see from the construction that p b ( x ) ≤ 2 / (2 k ) for all x ∈ [0 , 2 k ] . The diﬀerence b et ween the v alues p b ( x ) and p b ′ ( x ) comes entirely from ( ε/k )( ˜ J 2 ℓ − 1 ( x ) − ˜ J 2 ℓ ( x )), w hic h has magnitud e at most ( ε/k ) · (1 /k 4 ) = ε/k 5 . S o we hav e that  √ p b ( x ) − p p b ′ ( x )  2 is at most the f ollo wing (where c x ∈ [0 . 1 , 2] for eac h x ∈ [2 k ] \ [2 ℓ − 2 , 2 ℓ ]):  r c x k + ε k 5 − r c x k  2 = ( c x /k ) ·  r 1 + ε c x k 4 − 1  2 = ( c x /k ) · [Θ( ε/k 4 )] 2 = Θ( ε 2 /k 9 ) . In tegrating ov er the region of width 2 k − 2, we get that B = O ( ε 2 /k 8 ) . It r emains to b ound A . Fix any x ∈ [2 ℓ − 2 , 2 ℓ ]. As ab o ve w e hav e that p b ( x ) equals c x /k f or some c x ∈ [0 . 1 , 2], and (26) implies th at p b ( x ) and p b ′ ( x ) diﬀer by at most Θ ( ε/k ) . S o w e h a ve  √ p b ( x ) − p p b ′ ( x )  2 ≤ " r c x k − r c x k − Θ( ε ) k # 2 = c x k h 1 − p 1 − Θ( ε ) i 2 = c x k Θ( ε 2 ) = Θ ( ε 2 /k ) . In tegrating ov er the region of width 2, we get that A = O ( ε 2 /k ) . 37 This concludes the pro of of Theorem 42 and with it the pro of of Theorem 8. A.3 Pro of of Lemma 22. Recall Lemma 22: Lemma 22. L et p 1 , . . . , p k e ach b e an ( τ , t ) -pie c ewise de gr e e- d distribution over [ − 1 , 1) and let p = P k j =1 µ j p j b e a k -mixtur e of c omp onents p 1 , . . . , p k . Then p is a ( τ , k t ) -pie c ewise de gr e e- d distribution. Pro of of Lemma 22: F or 1 ≤ j ≤ k , let P j denote the interv als I j, 1 , . . . , I j,t suc h that p j is τ -close to a d istribution g j whose p d f is giv en b y p olynomials g j 1 , . . . , g j,t o ver interv als I j, 1 , . . . , I j,t resp ectiv ely . Let P be the common r eﬁnemen t of P 1 , . . . , P k . It is clear that P is a p artition of [ − 1 , 1) into at most k t interv als. F or eac h I in P and for eac h 1 ≤ j ≤ k , let g j,I ∈ { g j, 1 , . . . , g j,t } b e the p olynomial corresp on d ing to I . W e claim that p = P k j =1 µ j p j is τ -close to the k t -p iecewise degree- d distribution g w hic h h as the p olynomial P k j =1 µ j g j,I as its p df ov er inte rv al I , for eac h I ∈ P . T o see this, f or eac h in terv al I ∈ P let us write ˜ p j,I to denote the fu nction whic h equals p j on I and equals 0 elsewhere, and lik ewise for ˜ g j,I . With th is notation w e ma y write th e condition that p j is τ -close to g j in total v ariation d istance as      X I ∈P ˜ p j,I − ˜ g j,I      1 ≤ 2 τ . (27 ) W e then ha v e k p − g k 1 =       X I ∈P   k X j =1 µ j ˜ p j,I − µ j ˜ g j,I         1 ≤ k X j =1 µ j      X I ∈P ( ˜ p j,I − ˜ g j,I )      1 ≤ 2 τ , and the pro of is complete. A.4 Pro of of Lemma 24. Recall Lemma 24: Lemma 24. With pr ob ability at le ast 99 / 100 , F ind-Heav y ( γ ) r eturns a set S satisfying c onditions (1) and (2) in the “Output” description. Pro of of Lemma 24: Fix any x ∈ [ − 1 , 1) suc h that Pr x ∼ p [ x ] ≥ 2 γ . A standard multiplicat iv e Chernoﬀ b ound implies that x is placed in S except with failure probability at most 1 200 · 1 2 γ . S ince there are at most 1 2 γ v alues x ∈ [ − 1 , 1) suc h th at Pr x ∼ p [ x ] ≥ 2 γ , w e get th at cond ition (1) holds except with failure probabilit y at most 1 200 . F or the seco nd bullet, ﬁrst consid er an y x suc h that Pr x ∼ p [ x ] ∈ [ γ 2 c , γ 2 ] (here c > 0 is a unive rsal constan t). A stand ard m ultiplicativ e Chernoﬀ b ound giv es that eac h suc h x satisﬁes b p ( x ) ≥ 2 Pr x ∼ p [ x ] with probabilit y at most 1 400 · 2 c γ , and h ence eac h s uc h x satisﬁes b p ( x ) ≥ γ w ith probabilit y at most 1 400 · 2 c γ . Since ther e are at most 2 c /γ such x ’s, we get that w ith probabilit y at least 1 − 1 400 no s u c h x b elongs to S . T o ﬁn ish the analysis we recall the follo win g v ersion of the multiplicativ e Ch ernoﬀ b ound : F act 50. [ [ MR95], The or em 4.1] L et Y 1 , . . . , Y m b e i.i.d. 0/1 r andom variables with Pr[ Y i = 1] = q and let Q = mq = E [ P m i =1 Y i ] . Then for al l τ > 0 we have Pr " m X i =1 Y i ≥ (1 + τ ) Q # ≤  e τ (1 + τ ) 1+ τ  Q ≤  e (1 + τ )  (1+ τ ) Q . 38 Fix an y int eger r ≥ c and ﬁx any x su c h that Pr x ∼ p [ x ] ∈ [ γ 2 r +1 , γ 2 r ] . T aking 1 + τ in F act 50 to equal 2 r , we get that Pr[ x ∈ S ] ≤  e 2 r  Θ( mγ ) =  e 2 r  Θ(log(1 /γ )) . Summing o v er all (at m ost 2 r +1 /γ many) x suc h that Pr x ∼ p [ x ] ∈ [ γ 2 r +1 , γ 2 r ], w e get th at the prob- abilit y that any such x is placed in S is at most 2 r +1 γ ·  e 2 r  Θ(log(1 /γ )) ≤ 1 400 · 1 2 r . Summ ing ov er all r ≥ c , the total failure prob ab ility incurr ed by such x is at most 1 / 400 . T his p ro ves the lemma. 39

Efficient Density Estimation via Piecewise Polynomial Approximation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment