Learning $k$-Modal Distributions via Testing

Learning k -Modal Distributions via T esting ∗ Constantinos Daskalakis † MIT costis@csa il.mit.edu Ilias Diakonikolas ‡ Univ ersity of Edinbur gh ilias.d@ed .ac.uk Rocco A. Servedio § Columbia Univ ersity rocco@cs.c olumbia.ed u September 16, 2014 Abstract A k -mod al probability distribution over the discrete dom ain { 1 , ..., n } is one whose histogram has at mo st k “peaks” an d “valleys. ” Such distributions are natural g eneralization s of monotone ( k = 0 ) and unimod al ( k = 1 ) pro bability distributions, wh ich ha ve been i ntensively studied in pro bability theory and statistics . In this pap er we consider the p roblem of learnin g (i.e.,perfo rming density estimation of) an un known k -mod al distribution with respe ct to th e L 1 distance. The learnin g algor ithm is given access to indepen dent samples drawn from an unkn own k -mod al distribution p , and it mu st output a hy pothesis distribution b p such that with high probability the total v ariation distance between p and b p is at most ǫ. Ou r main go al is to obtain computatio nally ef ﬁcient algor ithms for this problem that use (close to) an inform ation-theo retically optimal number of samples. W e gi ve an efﬁcient algo rithm for this prob lem that runs in time poly( k , log ( n ) , 1 /ǫ ) . For k ≤ ˜ O (log n ) , the n umber of samples u sed by our algorith m is very close ( within an ˜ O (log(1 /ǫ )) factor) to b eing info rmation- theoretically op timal. Prior to this work compu tationally efﬁcient algorithms were known only for the cases k = 0 , 1 [Bir87b, Bir97]. A nov el featu re of our appr oach is that our learning algorithm crucially uses a new algorithm for pr operty testing of pr obability distributions as a key subrou tine. The learning algorithm uses the property tester to efﬁciently deco mpose the k - modal d istribution into k (near-)monoto ne distributions, which are easier to learn. 1 Introd uction This paper consider s a natural unsupervise d learning problem in v o lving k -modal distrib uti ons ov er the discre te domain [ n ] = { 1 , . . . , n } . A d istrib u tion is k -modal if the plot of its p robab ility den sity functio n (pd f) has at most k “peaks” and “v alle ys” (see Section 2.1 for a precise deﬁnition ). Such distrib utio ns arise both in theoretical (see e.g., [CKC83, Ke m91, C T04]) and applied (see e.g., [Mur64, dTF90, FPP + 98]) research ; the y naturally genera lize the simpler classes of monoto ne ( k = 0 ) and unimodal ( k = 1 ) distrib ut ions that hav e been intensi ve ly studie d in probab ility theory and statisti cs (see the discu ssion of relat ed work belo w). Our main aim in this paper is to giv e an ef ﬁcient algorith m for learnin g an unkno wn k -modal distrib utio n p to total v ariati on distance ǫ , giv en access only to independ ent samples drawn from p . As described below there is an informatio n-theo retic lower boun d of Ω( k log( n/k ) /ǫ 3 ) samples f or this learni ng prob lem, so an importa nt goal for us is to obtai n an algorithm whose sample complexity is as close as possibl e to this lower bound. An ∗ A preliminary version of this work appeared in the Pr oceeding s of the T wenty-Thir d Annual ACM-SIAM Symposium on Discr ete Algorithms (SOD A 2012). † Research supported by NSF CAREER award CCF-0953 960 and by a Sloan Foundation Fello wship. ‡ Most of this research was done while t he author was at UC Berkele y supported by a Si mons Postdoctoral Fellowship . S ome of this work was done at Columbia Uni versity , supported by NSF grant CCF-0728736, and by an Ale xander S . Onassis Foundation Fello wship. § Supported by NSF grants CNS-0716245, CCF-0915929, and CCF-1115703. equall y important goal is for our algorithm to be computatio nally efﬁcie nt, i.e., to run in time polynomial in the size of its input sample. Our main contrib ution in this paper is a computation ally ef ﬁ cient algorithm that has nearly optimal sample comple xity for small (but sup er -constant) valu es of k . 1.1 Backgr ound and re lation to pre vious work There is a rich body of work in the statistics and probabili ty literat ures on estimating distrib utions under var - ious kinds of “shape” or “order” restriction s. In particu lar , man y researchers ha ve studied the risk of dif- ferent estimators for monot one ( k = 0 ) and unimodal ( k = 1 ) distrib utions; see for e xample the works of [Rao69, W eg70, Gro85, Bir87a, Bir87b, Bir97], among m any others. These and related papers from the prob- ability /statis tics literatu re mostly deal w ith information-th eoretic up per and lower bounds on the sample com- ple xity of learning monotone and unimodal distrib utions . In contrast, a central goal of the curren t work is to obtain computa tionall y ef ﬁcient learning algorithms for large r values of k . It should be noted th at some of the wo rks cited abov e do gi ve efﬁcien t alg orithms for the cases k = 0 and k = 1 ; in particular w e mention the result s of Birg ´ e [Bir87b, Bir97], w hich gi ve computat ionally ef ﬁ cient O (log ( n ) /ǫ 3 ) -sample algorith ms for learnin g unkn own m onoton e or un imodal distrib utions ov er [ n ] res pec- ti vely . (Birg ´ e [Bir87a] also sho wed that this sample complexi ty is asympto tically optimal, as we discuss belo w ; we describe the algorit hm of [Bir87b ] in more detail in S ection 2.2, and indeed use it as an ingredie nt of our approa ch througho ut this paper .) H o w e ver , for these relativ ely simple k = 0 , 1 classes of distrib utions the main challe nge is in dev eloping sample-ef ﬁ cient estimators, and the algorithmic aspects are typically rather straight- forwar d (as is the case in [Bir87b]). In cont rast, much more challe nging and intere sting algorithmic issues arise for the general v alues of k which we consider here. 1.2 Our Results Our main result is a highly ef ﬁ cient algor ithm for learn ing an unkn own k -modal distrib ution ov er [ n ] : Theor em 1 Let p be any unkno wn k -modal distrib ution over [ n ] . Ther e is an algor ithm that uses 1  k log( n/k ) ǫ 3 + k 2 ǫ 3 · log k ǫ · log log k ǫ  · ˜ O (log(1 /δ )) samples fr om p , runs for p oly( k , log n, 1 /ǫ, log (1 /δ )) bit operatio ns, and with pr obability 1 − δ output s a (succi nct descrip tion of a) hy pothes is distrib ution b p over [ n ] such that the t otal variatio n distanc e between p and b p is at most ǫ. As alluded to earlier , Birg ´ e [Bir87a] gav e a sample complex ity lower bound for learni ng monotone distri- b utions. The lower bou nd in [Bir87a] is stated for continuo us distrib utions but the ar guments are easily adapted to the discret e case; [B ir87a] shows that (for ǫ ≥ 1 /n Ω(1) ) 2 any algorithm for learning an unkno wn monoton e distrib ution ov er [ n ] to total v ariation distance ǫ must use Ω(log ( n ) /ǫ 3 ) samples. By a simpl e co nstruc tion which conca tenates k copies of the monotone lower bound constructio n ov er interv als of length n/k , using the monoton e lower bound it is poss ible to sho w: Pro position 1 Any algorit hm for lear ning an unknown k -modal dist rib ution over [ n ] to variatio n distance ǫ (for ǫ ≥ 1 /n Ω(1) ) must use Ω( k log ( n/k ) /ǫ 3 ) samples. 1 W e write ˜ O ( · ) to hide factors which are poly-logarithmic in the argu ment to ˜ O ( · ) ; thus for example ˜ O ( a l og b ) denotes a quantity which is O (( a lo g b ) · (log( a lo g b )) c ) for some absolute constant c . 2 For ǫ sufﬁciently small t he generic upper bound of Fact 12, which says that any distri bution over [ n ] can be learned to variation distance ǫ using O ( n/ǫ 2 ) samples, provides a better bound. 1 Thus our learning algorith m is nearly optimal in its sample complex ity; more precis ely , for k ≤ ˜ O (log n ) (and ǫ as bo unded abov e), our sample compl exit y in Theorem 1 is asy mptotica lly optimal up to a f actor of ˜ O (log (1 /ǫ )) . Since each dra w fro m a distrib ution o ver [ n ] is a log ( n ) -bit string, Proposition 1 implies that the running time of our algorithm is optimal up to polynomial factors. As far as we are aware, prior to this work no le arning algo rithm for k -modal dis trib utions was kno wn that simultane ously had p oly( k , log n ) sample comple xity and eve n running time p ( n ) for a ﬁxed polynomial p ( n ) (where the ex ponent does not depen d on k ). 1.3 Our Ap pr oa ch As mentioned in Section 1.1 Bir g ´ e gav e a highly efﬁcient algorith m for learning a monotone distrib ution in [Bir87b]. Since a k -modal distrib ution is simply a concatenat ion of k + 1 monotone distrib utions (ﬁrst non- increa sing, then no n-decr easing , then non-incr easing , etc.), it is natural to try to use Bir g ´ e’ s algorithm as a compone nt of an algorit hm for learnin g k -modal distrib utions , and indee d this is what we do. The most nai ve way to use Birg ´ e’ s algorith m would be to guess all possible  n k  locatio ns of the k “modes” of p . While such an approa ch can be shown to hav e good sample comple xity , the resulti ng Ω( n k ) running time is grossly inef ﬁcient. A “moderately nai ve” approach, which we analyze in Section 3.1, is to partition [ n ] into rough ly k /ǫ interv als each of weight rough ly ǫ/k , and run Bir g ´ e’ s algorith m separat ely on each such interv al. Since the tar get distr ib ution is k -modal, at most k of the interv als can be non-mon otone; Birg ´ e’ s algorit hm can be used to obtain an ǫ -accurate hypot hesis on each monotone interv al, and ev en if it fails badly on the (at most) k non-monoto ne inter v als, the resulting tot al co ntrib ution to wards the o verall er ror fr om tho se fai lures is a t mo st O ( ǫ ) . This approa ch is much more efﬁcien t than the totally naiv e approach, givin g running time polynomial in k , log n , and 1 /ǫ , b ut its sample comple xity turns out to be po lynomially worse th an the O ( k log ( n ) /ǫ 3 ) that we are shooting for . (Roughly speaking, this is because the approac h in volv es run ning Bir g ´ e’ s O (log ( n ) /ǫ 3 ) -sample algori thm Ω ( k /ǫ ) times, so it uses at least k log( n ) /ǫ 4 samples.) Our m ain learning result is achiev ed by augmenting the “moderately nai ve” algorithm sketche d abov e w ith a ne w pr operty testing algorithm. Unlik e a lear ning algorith m , a propert y testing algorithm for probability distrib utions need not output a high- accura cy hypoth esis; instead, it has the more modest goal of successfu lly (with high probab ility) distingui shing between probability distrib utions that hav e a giv en property of interest, ver sus distri b utions that are far (in total v ariation distance ) from eve ry distrib ution that has the property . See [GGR98, Ron10, Gol10] for broad ov ervie ws of prop erty testing. W e gi ve a property testing algorithm for the followin g problem: giv en samples from a distrib ution p over [ n ] which is promis ed to be k -modal, outpu t “yes” (with high probability ) if p is monot one an d “n o” (with high probabi lity) if p is ǫ -far in total varia tion distance from ev ery monotone distrib ution. C ruciall y , our testing algori thm uses O ( k /ǫ 2 ) samples indepe ndent of n for this problem. Roughly speakin g, by using this algori thm O ( k /ǫ ) times w e are a ble to identify k + 1 interv als that (i) collecti vely c ontain almos t all of p ’ s mass, and (ii) are each (close to) monotone and thus can be handled using Birg ´ e’ s algorithm. Thus the over all sample complexi ty of our ap proach is (ro ughly ) O ( k 2 /ǫ 3 ) (for the O ( k /ǫ ) runs of th e tester) plu s O ( k log ( n ) /ǫ 3 ) (for the k runs of Bir g ´ e’ s algorith m ), which gi ves Theorem 1 and is very clos e to optimal for k not too large. 1.4 Discussion Our lea rning algorith m highlights a nov el way th at prope rty testing algorit hms can be useful for learnin g. Much researc h has been done on underst anding the relation between property testing algorith ms and learning algo- rithms, see e.g., [GGR 98, KR00] and the lengthy surv ey [Ron08]. A s Goldreic h has noted [Gol11], an often - in v oked motiv ation for prope rty testing is that (inexp ensi ve) testing algorithms can be used as a “preliminary diagno stic” to determine whethe r it is appropriate to run a (more expensi ve) learning algorithm. In contrast, in this work w e are using property testin g rather differ ently , as an ine xpensi ve way of deco mposing a “complex” object (a k -modal distrib ution) which we do not a priori know ho w to learn, into a collecti on of “simpler” ob- 2 jects (monoton e or near -monoton e distrib utions) which can be learned using existin g techniques. W e are not awa re of prior learning algorit hms that succe ssfully use property testers in this way; we belie ve that this high- le vel approa ch to desig ning learning algori thms, by using property testers to deco mpose “comple x” objects into simpler objec ts that can be ef ﬁciently learned, may ﬁnd future applica tions else where. 2 Pr eliminaries 2.1 Notation and Pr oblem Statement For n ∈ Z + , denote by [ n ] the set { 1 , . . . , n } ; for i, j ∈ Z + , i ≤ j , denot e by [ i, j ] the set { i, i + 1 , . . . , j } . W e write v ( i ) to deno te the i -th element of vector v ∈ R n . For v = ( v (1) , . . . , v ( n )) ∈ R n denote by k v k 1 = P n i =1 | v ( i ) | its L 1 -norm. W e consid er discrete probabil ity distrib utions ov er [ n ] , which are fun ctions p : [ n ] → [0 , 1] such that P n i =1 p ( i ) = 1 . For S ⊆ [ n ] we write p ( S ) to denote P i ∈ S p ( i ) . For S ⊆ [ n ] , we write p S to denote the condit ional distrib ution over S that is induced by p. W e use the nota tion P for the cumulative distr ib ution functi on (cdf) corres pondi ng to p , i.e., P : [ n ] → [0 , 1] is deﬁned by P ( j ) = P j i =1 p ( i ) . A distri b ution p ov er [ n ] is no n-incre asing (resp. non-d ecreasing) if p ( i + 1) ≤ p ( i ) (resp. p ( i + 1) ≥ p ( i ) ), for all i ∈ [ n − 1] ; p is monotone if it is either non-inc reasin g or non-dec reasing . W e call a nonempty interv al I = [ a, b ] ⊆ [2 , n − 1] a max-in terval of p if p ( i ) = c for all i ∈ I and max { p ( a − 1) , p ( b + 1) } < c ; in this case, we say that the point a is a left max point of p . Analogously , a min-interv al of p is an interv al I = [ a, b ] ⊆ [2 , n − 1] with p ( i ) = c for all i ∈ I and min { p ( a − 1) , p ( b + 1) } > c ; the point a is called a left min point of p . If I = [ a, b ] is either a max-inter v al or a min-inte rval (it cannot be both) we say that I is an extr eme-int erval of p , and a is called a left e xtr eme point of p. N ote that any distrib ution uniquely deﬁnes a collection of extreme-i nterv als (hence, left e xtreme poin ts). W e say that p is k -modal if it has at most k ext reme-inte rv als. W e write D n (resp. M k n ) to denote the set of all distrib utions (resp. k -modal distrib utions) ov er [ n ] . Let p, q be distrib utions ov er [ n ] with correspo nding cdfs P , Q . T he total variation distance between p and q is d T V ( p, q ) := max S ⊆ [ n ] | p ( S ) − q ( S ) | = (1 / 2) · k p − q k 1 . The Ko lmogor ov distance between p and q is deﬁned as d K ( p, q ) := max j ∈ [ n ] | P ( j ) − Q ( j ) | . N ote that d K ( p, q ) ≤ d T V ( p, q ) . W e w ill also need a more general distanc e measure that captu res the abov e tw o metrics as special cases. F ix a family of sub sets A over [ n ] . W e deﬁne the A –distan ce betwee n p and q by k p − q k A := max A ∈A | p ( A ) − q ( A ) | . (Note that if A = 2 [ n ] , the powerse t of [ n ] , then the A –distan ce is iden tiﬁed with the total v ariatio n distanc e, while when A = { [1 , j ] , j ∈ [ n ] } it is id entiﬁed with the K olmogoro v distan ce.) Also recall tha t the VC – dimensio n of A is the maximum size of a sub set X ⊆ [ n ] that is shattere d by A (a set X is shatte red by A if for e ver y Y ⊆ X some A ∈ A satisﬁes A ∩ X = Y ). Learning k -modal Distrib utions . Giv en indepen dent samples from an unk nown k -modal distrib ution p ∈ M k n and ǫ > 0 , the goal is to output a hypothesis distrib ution h such that with probability 1 − δ we hav e d T V ( p, h ) ≤ ǫ . W e say that such an algorith m A learns p to accur acy ǫ and conﬁden ce δ . T he parameters of interest are the number of samples and the runni ng time requi red by the algor ithm. 2.2 Basic T ools W e recall some useful tools from probab ility theory . The VC inequality . Giv en m indepen dent samples s 1 , . . . , s m , drawn from p : [ n ] → [0 , 1] , th e empirical distrib ution b p m : [ n ] → [0 , 1] is deﬁned as foll o ws: for all i ∈ [ n ] , b p m ( i ) = |{ j ∈ [ m ] | s j = i }| /m . Fix a f amily of subset s A ov er [ n ] of VC–dimensi on d . The VC ineq uality states that for m = Ω( d/ǫ 2 ) , with probability 9 / 10 the empirica l distrib ution b p m will be ǫ -close to p in A -distance. T his sample bound is asympto tically optimal. 3 Theor em 2 (VC inequality , [DL01, p.31]) Let b p m be an empirica l distrib ution of m samples fr om p . Let A be a family of subsets of VC–dimensio n d . Then E [ k p − b p m k A ] ≤ O ( p d/m ) . Unifo rm con ver gence. W e will also use the followin g uniform con ver gence bound: Theor em 3 ([DL01, p17]) Let A be a family of subset s o ver [ n ] , and b p m be an empirical distrib ution of m samples fr om p . Let X be the rand om variable k p − b p m k A . Then we have Pr [ X − E [ X ] > η ] ≤ e − 2 mη 2 . Our second tool, due to Birg ´ e [Bir87b], provid es a sample-optimal and computation ally efﬁcient algorit hm to learn monoto ne distrib utions to ǫ -accura cy in total variat ion distance. Before we state the rele v ant theorem, we need a deﬁniti on. W e say that a distrib ution p is δ -close to being non-incr easing (resp. non-d ecreasing) if there exis ts a non-in creasin g (resp. non-dec reasin g) distrib ution q such that d T V ( p, q ) ≤ δ . W e are now ready to state Birg ´ e’ s result: Theor em 4 ([Bir87b], Theorem 1) (semi-a gnosti c lea rner) Ther e is an algorith m L ↓ with the followin g per- formance guarant ee: Given m indepen dent samples fr om a distrib ution p ov er [ n ] which is opt -close to being non-in cr easing , L ↓ perfor ms ˜ O ( m · log n + m 1 / 3 · (log n ) 5 / 3 ) bit oper ations and outpu ts a (suc cinct descr iption of a) hypoth esis distrib ution e p over [ n ] that satisﬁes E [ d T V ( e p, p )] ≤ 2 · opt + O   log n/ ( m + 1)  1 / 3  . The afo r ementioned algorit hm partition s the domain [ n ] in O ( m 1 / 3 · (log n ) 2 / 3 ) intervals and ou tputs a hypoth - esis distri b ution that is uniform within each of the se intervals. By taking m = Ω(log n/ǫ 3 ) , one obtains a hypothes is such that E [ d T V ( e p, p )] ≤ 2 · opt + ǫ. W e stress that Bir g ´ e’ s alg orithm for learni ng no n-increasing dis trib utions [Bir87b] is in fact “semi-agn ostic, ” in the sense that it also learns distrib utions th at are close to being non-i ncreas ing; this rob ustness will be crucial for us later (since in our ﬁna l a lgorit hm we will use Bir g ´ e’ s algorith m on dis trib utions identiﬁed by our te ster , that a re cl ose to mono- tone bu t not nec essaril y perfectly monot one). This se mi-agnos tic property is not ex plicitly sta ted in [Bir87b] b ut it can be sho wn to follo w easily from his re sults. W e sh ow ho w the semi-ag nostic property follo ws from Bir g ´ e’ s results in A ppend ix A. Let L ↑ denote the correspond ing semi- agnos tic algorith m for learni ng non-decreas ing distrib utions . Our ﬁnal tool is a routine to do hypothes is testing , i.e., to select a high-accur acy hypothesi s distri b ution from a collection of hy pothes is distrib utions one of w hich has high accura cy . The need for such a routine arises in se ver al places; in some cases we kno w that a distri b ution is monotone , b ut do not kno w whether it is non-in creasi ng or non-decre asing. In this case, we can run both algorithms L ↑ and L ↓ and then choose a good hypot hesis using hyp othesis testing. A nothe r need for h ypothesis testin g is to “ boost conﬁde nce” that a learnin g algori thm genera tes a high -accuracy hypot hesis. O ur initial v ersion of the alg orithm for Theorem 1 g enerate s an ǫ -accura te hypothesis with probabi lity at least 9 / 10 ; by runni ng it O (log (1 /δ )) times using a hypothesi s testing routin e, it is possibl e to identify an O ( ǫ ) -accura te hypothesis with probab ility 1 − δ. Routines of t he so rt tha t w e requir e hav e been giv en in e.g., [DL01] and [DDS12]; we use the follo wing theorem from [DDS12]: Theor em 5 Ther e is an algorithm Choo se-Hy pothe sis p ( h 1 , h 2 , ǫ ′ , δ ′ ) which is given sample access to p , two hypothesis distrib utions h 1 , h 2 for p , an accurac y paramete r ǫ ′ , and a conﬁdenc e par ameter δ ′ . It makes m = O (log (1 /δ ′ ) /ǫ ′ 2 ) dr aws fr om p and r eturn s a hypoth esis h ∈ { h 1 , h 2 } . If one of h 1 , h 2 has d T V ( h i , p ) ≤ ǫ ′ then with pr obabil ity 1 − δ ′ the hypoth esis h that Ch oose -Hypo thesis r eturns has d T V ( h, p ) ≤ 6 ǫ ′ . For the s ake of complet eness, we descr ibe and analyz e the C hoose -Hypo thesis algorithm in Appendix B. 4 3 Learn ing k -modal Distrib utions In this section, we present our main result: a nearly sample-optimal and computationa lly efﬁcien t algorithm to learn an unkno wn k -modal distri b ution. In Section 3.1 we prese nt a simple learning algo rithm with a suboptimal sample comple xity . In S ection 3.2 we present our main result which in volv es a propert y testing algorithm as a subrou tine. 3.1 W arm-up: A simple learning algorithm In this subsection, we giv e an algorithm that runs in time p oly( k , log n, 1 /ǫ, log (1 /δ )) and learn s an unkno wn k -modal distrib ution to accuracy ǫ and conﬁdence δ . The sample complexi ty of the algorithm is essentially optimal as a function of k (up to a logarith m ic factor), b ut subopt imal as a functio n of ǫ , b y a polynomial factor . In the follo wing pseud ocode we gi ve a detailed descripti on of the algorithm L earn- kmoda l-simple ; the algo rithm output s an ǫ -accurate hypothe sis with conﬁdenc e 9 / 10 (see Theorem 6). W e explai n ho w to boo st the conﬁdenc e to 1 − δ after the proof of the theorem. The algorithm Le arn-k modal -simple works as follo ws: W e start by partitioning the domain [ n ] into consec uti ve interv als of mass “approximate ly ǫ/k . ” T o do this, w e draw Θ( k /ǫ 3 ) samples from p and greedily partiti on the domain into disjoin t interv als of empirical mass roughl y ǫ/k . (Some care is needed in this step, since there may be “hea vy” points in the support of the distrib ution; howe ver , we gloss ov er this technica l issue for the sake of this intuiti ve explanat ion.) Note that we do not ha ve a guaran tee that each such interv al w ill ha ve true probabilit y mass Θ( ǫ/k ) . In fact, it may well be the case that the additi ve error δ between the true probab ility mass of an inter v al and its empirical mass (rough ly ǫ/k ) is δ = ω ( ǫ/k ) . T he error guarantee of the partiti oning is more “g lobal” in that th e sum of thes e errors across all such inte rv als is at most ǫ . In pa rticula r , as a simple corollary of the VC inequalit y , we can dedu ce the follo wing statement that will be used sev eral times throug hout the paper: Fac t 2 L et p be an y distrib ution over [ n ] and b p m be the empirica l distrib ution of m samples fr om p . F or m = Ω  ( d/ǫ 2 ) log(1 /δ )  , with pr obabili ty at least 1 − δ , for any collecti on J of (at most) d disjoin t intervals in [ n ] , we have that P J ∈J | p ( J ) − b p m ( J ) | ≤ ǫ. Pro of: Note that P J ∈J | p ( J ) − b p m ( J ) | = 2 | p ( A ) − b p m ( A ) | , (1) where A = { J ∈ J : p ( J ) > b p m ( J ) } . Since J is a collectio n of a t most d interv als, it is c lear that A is a unio n of at most d interv als. If A d is the f amily of all union s of at most d interv als, then the right hand side of (1) is at most 2 k p − b p m k A d . Since the VC–dimensio n of A d is 2 d , Theorem 2 implies tha t the quantit y (1) has expect ed v alue at most ǫ/ 2 . The claim no w follo ws by an applic ation of Theorem 3 with η = ǫ/ 2 . If this ste p is successf ul, we ha ve partitio ned the doma in into a set of O ( k/ǫ ) conse cuti ve interva ls of probab ility mass “rou ghly ǫ/k . ” The next step is to app ly Birg ´ e’ s m onoto ne learning algorithm to each inter v al. A ca veat comes from the fact that not all such interv als are guarantee d to be monotone (or ev en close to being monotone). H o w e ver , since our input distri b ution is assumed to be k -modal, all but (at most) k of these interv als are monotone. Call a non- monoton e interv al “bad. ” Since all inter v als ha ve empirical probab ility mass at most ǫ/k and there are at most k bad interv als, it follows from Fact 2 that these interv als contrib ute at most O ( ǫ ) to the total mass. So e ven though B ir g ´ e’ s algori thm gi ves no guarantee s for bad interv als, these interv als do not af fect the error by more than O ( ǫ ) . Let us now focus on the m onoto ne interv als. For each such interv al, w e do not kn o w if it is monoton e increa sing or monoton e decreasing . T o ov ercome this dif ﬁculty , w e run both monoto ne algorithms L ↓ and L ↑ for each interv al and then use hypothesis testin g to choose the correct candidate distrib ution. 5 Also, note that si nce we ha ve O ( k /ǫ ) interv als, we need to run each instance of both the mon otone learnin g algori thms and the hypoth esis testin g algorithm with conﬁdence 1 − O ( ǫ/k ) , so that we can guarantee that the ov erall algorithm has conﬁdenc e 9 / 10 . Note that Theorem 4 and Marko v’ s inequality imply that if we draw Ω(log n/ǫ 3 ) samples from a non -increa sing distrib ution p , the hypothe sis e p output by L ↓ satisﬁes d T V ( e p, p ) ≤ ǫ with probabilit y 9 / 10 . W e can boos t the conﬁdence to 1 − δ with an ov erhead of O (log (1 /δ ) log log (1 /δ )) in the sample comple xity: Fac t 3 L et p be a non-incr easin g dis trib ution o ver [ n ] . Ther e is an algorit hm L ↓ δ with the following performan ce guar antee: Given (log n/ǫ 3 ) · ˜ O (log (1 /δ ))) samples fr om p , L ↓ δ perfor ms ˜ O  (log 2 n/ǫ 3 ) · log 2 (1 /δ )  bit op- era tions and output s a (succinct descriptio n of a) hypothes is distrib ution e p ov er [ n ] that satisﬁes d T V ( e p, p ) ≤ ǫ with pr obability at least 1 − δ . The algorithm L ↓ δ runs L ↓ O (log(1 /δ )) times and performs a tournament among the candidate hypothe - ses using C hoose -Hypo thesis . L et L ↑ δ denote the correspond ing algor ithm for learni ng non-decreas ing distrib utions with conﬁdence δ . W e post pone further details on these algorithms to Appendix C. Theor em 6 The algorith m Learn-kmod al-simple uses k log n ǫ 4 · ˜ O (log ( k /ǫ )) samples, performs p oly( k , log n, 1 /ǫ ) bit operati ons, and learns a k -modal distrib ution to accura cy O ( ǫ ) with pr obab ility 9 / 10 . 6 Learn -kmod al-si mple Inputs: ǫ > 0 ; sample access to k -moda l distrib ution p over [ n ] 1. Fix d := ⌈ 20 k/ǫ ⌉ . Draw r = Θ( d/ǫ 2 ) samples from p and let b p denote the resulting empiric al distrib ution. 2. Greedily parti tion the domain [ n ] into ℓ atomic inter vals I := { I i } ℓ i =1 as follo ws: (a) I 1 := [1 , j 1 ] , where j 1 := min { j ∈ [ n ] | b p ([1 , j ]) ≥ ǫ/ (10 k ) } . (b) For i ≥ 1 , if ∪ i j =1 I j = [1 , j i ] , then I i +1 := [ j i + 1 , j i +1 ] , where j i +1 is deﬁned as follo ws: • If b p ([ j i + 1 , n ]) ≥ ǫ/ (10 k ) , then j i +1 := min { j ∈ [ n ] | b p ([ j i + 1 , j ]) ≥ ǫ/ (10 k ) } . • Otherwise, j i +1 := n . 3. Construc t a set of ℓ light inter vals I ′ := { I ′ i } ℓ i =1 and a set { b i } t i =1 of t ≤ ℓ heavy poin ts as follows: (a) For each interv al I i = [ a, b ] ∈ I , if b p ( I i ) ≥ ǫ / (5 k ) deﬁne I ′ i := [ a, b − 1] and make b a hea vy point. (Note that it is possib le to ha ve I ′ i = ∅ . ) (b) Otherwise, deﬁne I ′ i := I i . Fix δ ′ := ǫ/ (500 k ) . 4. Draw m = ( k /ǫ 4 ) · log ( n ) · ˜ Θ(log(1 /δ ′ )) samples s = { s i } m i =1 from p . For each light interv al I ′ i , i ∈ [ ℓ ] , run both L ↓ δ ′ and L ↑ δ ′ on the cond itional distrib ution p I ′ i using the samples in s ∩ I ′ i . Let e p ↓ I ′ i , e p ↑ I ′ i be the correspo nding conditio nal hypot hesis distrib utions. 5. Draw m ′ = Θ(( k /ǫ 4 ) · log(1 /δ ′ )) samples s ′ = { s ′ i } m ′ i =1 from p . For each light interv al I ′ i , i ∈ [ ℓ ] , run Choos e-Hyp othesis p ( e p ↑ I ′ i , e p ↓ I ′ i , ǫ, δ ′ ) using the samples in s ′ ∩ I ′ i . Denote by e p I ′ i the returne d condit ional distrib ution on I ′ i . 6. Output the hypoth esis h = P ℓ j =1 b p ( I ′ j ) · e p I ′ j + P t j =1 b p ( b j ) · 1 b j . Pro of: First, it is easy to s ee th at the alg orithm has the c laimed sampl e comple xity . Indeed, the algori thm draws a total of r + m + m ′ samples in Steps 1, 4 and 5. The runnin g time is also easy to analyze, as it is easy to see that e very step can be performed in polynomial time (in fact, nearl y linear time) in the sample size. W e need to show that with probability 9 / 10 (over its random samples), algorithm Le arn-k moda l-simple outpu ts a hypoth esis h such that d T V ( h, p ) ≤ O ( ǫ ) . Since r = Θ( d/ǫ 2 ) samples are drawn in Step 1, Fact 2 implies that w ith probabil ity of fail ure at most 1 / 100 , for each family J of at most d disjo int interv als from [ n ] , we ha ve P J ∈J | p ( J ) − b p m ( J ) | ≤ ǫ. (2) For t he rest of the analys is of Lea rn-km odal- simple we con dition on this “go od” e vent. Since e very atomic inte rval I ∈ I has b p ( I ) ≥ ǫ/ (10 k ) (except pot entiall y the rightmo st one), it follo ws that the number ℓ of atomic interv als constru cted in Step 2 satisﬁes ℓ ≤ 10 · ( k /ǫ ) . By the constructio n in Steps 2 and 3, e very light interv al I ′ ∈ I ′ has b p ( I ′ ) ≤ ǫ/ (5 k ) . Note also that e ver y heavy point b has b p ( b ) ≥ ǫ/ (10 k ) and the number of hea vy points t is at most ℓ. Since the light interv als and hea vy points form a partition of [ n ] , we can write p = ℓ P j =1 p ( I ′ j ) · p I ′ j + t P j =1 p ( b j ) · 1 b j . 7 Therefore , we can boun d the variat ion distan ce as follo ws: d T V ( h, p ) ≤ ℓ P j =1 | b p ( I ′ j ) − p ( I ′ j ) | + t P j =1 | b p ( b j ) − p ( b j ) | + ℓ P j =1 p ( I ′ j ) · d T V ( e p I ′ j , p I ′ j ) . (3) Since ℓ + t ≤ d , by Fact 2 and our conditioni ng, the contrib ution of the ﬁrst two terms to the sum is upper bound ed by ǫ . W e procee d to bound the contrib ution of the third term. Since p is k -modal, at most k of the light interv als I ′ j are not monotone for p. Call these interv als “bad” and denote by B as the set of bad interv als. Even thoug h we ha ve not identi ﬁed the b ad interv als, we k no w that a ll suc h interv als are lig ht. Therefor e, their tot al empi rical probab ility mass (under b p m ) is at most k · ǫ/ (5 k ) = ǫ/ 5 , i.e., P I ∈B b p ( I ) ≤ ǫ/ 5 . By our conditionin g (see Equation (2)) and the triangl e inequality it follo ws that     P I ∈B p ( I ) − P I ∈B b p ( I )     ≤ P I ∈B | p ( I ) − b p ( I ) | ≤ ǫ which implies that the true probability m ass of the bad interv als is at m ost ǫ / 5 + ǫ = 6 ǫ/ 5 . Hence, the contri- b ution of bad interv als to the third term of the right hand side of (3) is at most O ( ǫ ) . (Note that this statement holds true indepe ndent of the samples s we dra w in Step 4.) It remains to bound the contrib ution of monoto ne interv als to the third term. L et ℓ ′ ≤ ℓ be the number of monoton e light interv als and assume after renaming the indices that the y are e I := { I ′ j } ℓ ′ j =1 . T o bound from abo ve the right hand side of (3 ), it sufﬁce s to show that with probab ility at least 19 / 20 (ov er the samples drawn in Steps 4-5) it holds ℓ ′ P j =1 p ( I ′ j ) · d T V ( e p I ′ j , p I ′ j ) = O ( ǫ ) . (4) T o pro ve (4) we partition the set e I into three subs ets based on their p robability mass under p . Note that we do not ha ve a lower bou nd on the probab ility mass of interv als in e I . Moreo ver , by our conditioni ng (see Equation (2)) and the fact that each interv al in e I is light, it follo ws that any I ∈ e I has p ( I ) ≤ b p ( I ) + ǫ ≤ 2 ǫ. W e deﬁne the partiti on of e I into the follo wing three sets: e I 1 = { I ∈ e I : p ( I ) ≤ ǫ 2 / (20 k ) } , e I 2 = { I ∈ e I : ǫ 2 / (20 k ) < p ( I ) ≤ ǫ/k } and e I 3 = { I ∈ e I : ǫ/k < p ( I ) ≤ 2 ǫ } . W e bound the contrib ution of each subset in turn. It is clear that the contrib ution of e I 1 to (4) is at most P I ∈ f I 1 p ( I ) ≤ | e I 1 | · ǫ 2 / (20 k ) ≤ ℓ ′ · ǫ 2 / (20 k ) ≤ ℓ · ǫ 2 / (20 k ) ≤ ǫ/ 2 . T o bound from abov e the contrib ution of e I 2 to (4), w e partition e I 2 into g 2 = ⌈ log 2 (20 /ǫ ) ⌉ = Θ(log(1 /ǫ )) group s. For i ∈ [ g 2 ] , the set ( e I 2 ) i consis ts of those interv als in e I 2 that hav e mass under p in the range  2 − i · ( ǫ/k ) , 2 − i +1 · ( ǫ/k )  . T he follo wing statemen t establis hes the var iation distance closen ess between the condit ional hypoth esis for an interv al in the i -th group ( e I 2 ) i and the corresp ondin g condi tional distrib ution. Claim 4 W ith pr obability at least 19 / 20 (ove r the sample s , s ′ ), for each i ∈ [ g 2 ] and each monotone light interv al I ′ j ∈ ( e I 2 ) i we have d T V ( f p I ′ j , p I ′ j ) = O (2 i/ 3 · ǫ ) . Pro of: Since in Step 4 w e draw m samples, and each interv al I ′ j ∈ ( e I 2 ) i has p ( I ′ j ) ∈  2 − i · ( ǫ/k ) , 2 − i +1 · ( ǫ/k )  , a standard cou pon collector ar gument [NS60] te lls us tha t with prob ability 99 / 10 0 , for eac h ( i, j ) pair , the in ter - v al I ′ j will get at lea st 2 − i · (log ( n ) /ǫ 3 ) · ˜ Ω(log(1 /δ ′ )) many samples . Let’ s rewrite th is as (log( n ) / (2 i/ 3 · ǫ ) 3 ) · ˜ Ω(log(1 /δ ′ )) samples . W e conditi on on this ev ent. Fix an interv al I ′ j ∈ ( e I 2 ) i . W e ﬁrst sho w that with failu re probability at m ost ǫ/ (500 k ) after Step 4, either e p ↓ I ′ j or e p ↑ I ′ j will be (2 i/ 3 · ǫ ) -accurate . Indeed , by Fact 3 and takin g int o account t he number of samples that landed 8 in I ′ j , with probab ility 1 − ǫ/ (500 k ) over s , d T V ( e p α i I ′ j , p I ′ j ) ≤ 2 i/ 3 ǫ , where α i = ↓ if p I ′ j is non- increas ing and α i = ↑ otherwise. By a uni on bound ov er all (at most ℓ many) ( i, j ) pairs, it f ollo ws that with pro babilit y at leas t 49 / 50 , for each interv al I ′ j ∈ ( e I 2 ) i one of the two candidat e hypothesi s distrib utions is (2 i/ 3 ǫ ) -accurate . W e condit ion on this e ven t. No w consider Step 5. Since this step draws m ′ samples, and each interv al I ′ j ∈ ( e I 2 ) i has p ( I ′ j ) ∈  2 − i · ( ǫ/k ) , 2 − i +1 · ( ǫ/k )  , as before a standard coupon collector argument [NS60] tells us that with proba- bility 99 / 100 , for eac h ( i, j ) pair , the interv al I ′ j will get at least (1 / (2 i/ 3 · ǫ ) 3 ) · ˜ Ω(log(1 /δ ′ )) many sampl es in this step; we hencefo rth assume that this is indeed the case for each I ′ j . Thus, T heorem 5 appl ied to each ﬁ xed interv al I ′ j implies that the algorith m Ch oose- Hypot hesis will output a hypoth esis that is 6 · (2 i/ 3 ǫ ) -close to p I ′ j with probab ility 1 − ǫ/ (5 00 k ) . By a u nion bound, i t f ollo ws that with probabilit y at lea st 49 / 50 , the abo ve condit ion holds for all monoton e light interv als unde r considerati on. Therefore, except with failure probability 19 / 20 , the statement of the claim holds. Giv en the claim, we explo it the fact that for interv als I ′ j such that p ( I ′ j ) is small we can afford lar ger error on the total varia tion distance . More precis ely , let c i = | ( e I 2 ) i | , the number of interv als in ( e I 2 ) i , and note that P g 2 i =1 c i ≤ ℓ . Hence, we can bound the contrib ution of e I 2 to (4) by g 2 P i =1 c i · ( ǫ/k ) · 2 − i +1 · O (2 i/ 3 · ǫ ) ≤ O (1) · (2 ǫ 2 /k ) · g 2 P i =1 c i · 2 − 2 i/ 3 . Since P g 2 i =1 c i = | e I 2 | ≤ ℓ , the abov e express ion is maximized for c 1 = | e I 2 | ≤ ℓ and c i = 0 , i > 1 , and the maximum v alue is at most O (1) · ( ǫ 2 /k ) · ℓ = O ( ǫ ) . Boundin g the contrib ution of e I 3 to (4) is ver y similar . W e partition e I 3 into g 3 = ⌈ log 2 k ⌉ + 1 = Θ(log ( k )) group s. For i ∈ [ g 3 ] , the set ( e I 3 ) i consis ts of those interv als in e I 3 that hav e mass under p in the range  2 − i +1 · ǫ, 2 − i +2 · ǫ  . The follo wing statement is identica l to Claim 4 albeit with diff erent parameters : Claim 5 W ith pr obability at least 19 / 20 (ove r the sample s , s ′ ), for each i ∈ [ g 3 ] and each monotone light interv al I ′ j ∈ ( e I 3 ) i , we have d T V ( f p I ′ j , p I ′ j ) = O (2 i/ 3 · ǫ · k − 1 / 3 ) . Let f i = | ( e I 3 ) i | , the number of interv als in ( e I 3 ) i . Each interv al I ∈ ( e I 3 ) i has p ( I ) ∈ ( d i , 2 d i ] , where d i := 2 − i +1 · ǫ . W e therefore hav e g 3 P i =1 d i f i ≤ p ( e I 3 ) ≤ 1 . (5) W e can no w bound from abo ve the contrib ution of e I 3 to (4) by g 3 P i =1 2 d i f i · O (2 i/ 3 · ǫ · k − 1 / 3 ) ≤ O (1) · ( ǫ/k 1 / 3 ) · g 3 P i =1 d i f i · 2 i/ 3 . By (5) it follo ws that the abov e expr ession is maximize d for d g 3 f g 3 = 1 and d i f i = 0 , i < g 3 . The maximum v alue is at most O (1) · ( ǫ/k 1 / 3 ) · 2 g 3 / 3 = O ( ǫ ) where the ﬁ nal equality uses the fac t that 2 g 3 ≤ 4 k as follows by our deﬁnition of g 3 . This prov es (4) and complete s the proof of Theorem 6. T o get an O ( ǫ ) -ac curate hypoth esis with probabilit y 1 − δ , we can simply run L earn- kmoda l-simple O (log (1 /δ )) times and then perform a tournament using Theorem 5. This increa ses the sample complexit y by a ˜ O (log (1 /δ )) fac tor . The running time increa ses by a factor of O (log 2 (1 /δ )) . W e postpone the details for Appendi x C . 9 3.2 Main Result: Learning k -modal distributions using testing Here is some intuitio n to moti vat e our k -modal distr ib ution learni ng algorith m and gi ve a high-le vel idea of why the dominant term in its sample comple xity is O ( k log( n/k ) /ǫ 3 ) . Let p denote the tar get k -modal distrib ution to be learned. A s discus sed abov e, optimal (in terms of time and sample complexit y) algorithms are kno wn for learning a monoton e distrib ution over [ n ] , so if the locations of the k modes of p w ere known then it would be straightfo rward to learn p very efﬁcie ntly by running the monoton e distrib ution learner ov er k + 1 separate interv als. But it is clear that in general w e cannot hope to ef ﬁ ciently identify the m odes of p exactly (for instan ce it could be the case that p ( a ) = p ( a + 2) = 1 /n while p ( a + 1) = 1 /n + 1 / 2 n ). Still, it is natural to try to decompo se the k -modal distrib ution into a collect ion of (nearly ) monotone distrib utions and learn those. At a high lev el that is what our algorithm does, using a nov el pr opert y testing algorith m . More precisely , we giv e a distrib ution testing algor ithm with the follo wing performance guarant ee: Let q be a k -modal distrib ution over [ n ] . G i ven an accurac y paramet er τ , our tester tak es p oly( k /τ ) samples from q and outpu ts “yes” with high proba bility if q is mono tone and “no” with high probability if q is τ -f ar from ev ery monoton e distrib ution. (W e stress t hat th e assumpt ion that q is k -modal is e ssentia l here, sinc e an ea sy ar gument gi ven in [BKR04] sho ws that Ω( n 1 / 2 ) samples are required to test whether a general distrib ution over [ n ] is monoton e versus Θ(1) -far from monoton e.) W ith some care, by running the abov e-described tester O ( k /ǫ ) times w ith accur acy parameter τ , we can decompo se the domain [ n ] into • at most k + 1 “superinter v als, ” w hich ha ve the propert y that the condi tional distrib ution of p ov er each superi nterv al is almost monoton e ( τ -clos e to monoton e); • at most k + 1 “ne gligibl e interv als”, which hav e the prope rty that each one has probabil ity mass at most O ( ǫ/k ) under p (so ignoring all of them incurs at most O ( ǫ ) total error); and • at most k + 1 “hea vy” points, each of which has mass at least Ω( ǫ/k ) under p. W e can ignore the negl igible interv als, and the hea vy points are easy to handle; ho wev er some care m ust be tak en to learn the “almost mo notone ” restric tions o f p over each superinterv al. A nai ve appro ach, using a generic log( n ) /ǫ 3 -sample monoton e distrib ution learner that has no performan ce guara ntees if the target distrib ution is not monoton e, leads to an inef ﬁcient ov erall algorithm. S uch an approach would requir e that τ (the closeness paramete r used by the tester) be at m ost 1 / (the sample comple xity of the monotone distrib ution learner), i.e., τ < ǫ 3 / log ( n ) . Since the sample complex ity of the tester is p oly( k /τ ) and the tester is run Ω( k /ǫ ) times, this approa ch would lead to an o verall sample complex ity that is unacceptabl y high. Fortun ately , instead of using a generic monotone distrib ution learner , we can use the semi-agn ostic mono- tone dis trib ution learner of Birg ´ e (Theorem 4) that can h andle d e viatio ns from monoton icity far more ef ﬁciently than the abov e naiv e approach . Recall that giv en draws from a distrib ution q ov er [ n ] that is τ -close to mono- tone, this algor ithm uses O (log( n ) /ǫ 3 ) samples and outputs a hypothesis distrib ution that is (2 τ + ǫ ) -close to monoton e. By using this algorithm w e can tak e the accurac y paramete r τ for our teste r to be Θ ( ǫ ) and learn the condit ional distrib ution of p ov er a giv en superin terv al to accurac y O ( ǫ ) using O (log ( n ) /ǫ 3 ) samples from that superi nterv al. Since there ar e k + 1 superint erv als ov erall, a car eful ana lysis sh o ws that O ( k log ( n ) /ǫ 3 ) samples suf ﬁce to hand le all the superi nterv als. W e note that the algorithm also requires an additiona l additi ve p oly ( k /ǫ ) samples (indep endent of n ) besi des this dominant term (for ex ample, to run the tester and to estimate accur ate weights with which to combine the v arious sub-h ypothe ses). The over all sample complexi ty we achie ve is stated in Theorem 7 belo w . Theor em 7 (Main) The algorith m Learn-kmoda l uses O  k log( n/k ) /ǫ 3 + ( k 2 /ǫ 3 ) · log( k /ǫ ) · log log( k /ǫ )  10 samples, performs p oly( k , log n, 1 /ǫ ) bit operat ions, and learns any k -modal distrib ution to accuracy ǫ and conﬁd ence 9 / 10 . Theorem 1 follo ws from Theorem 7 by runnin g L earn- kmoda l O (log (1 /δ )) times and using hypothesis testing to boost the conﬁden ce to 1 − δ . W e giv e details in Appendix C. Algorith m Learn-k modal makes essential use of an algorithm T ↑ for testing w hether a k -moda l dis - trib ution over [ n ] is non-decreas ing. Algo rithm T ↑ ( ǫ, δ ) uses O (log(1 /δ )) · ( k /ǫ 2 ) samples from a k -modal distrib ution p ov er [ n ] , and beha ves as follows: • (Completen ess) If p is non-de creasin g, then T ↑ outpu ts “yes” with prob ability at least 1 − δ ; • (Soundn ess) If p is ǫ -far fro m non-decrea sing, then T ↑ outpu ts “yes” with prob ability at most δ . Let T ↓ denote the analogou s algorith m for testin g whether a k -modal distrib ution o ver [ n ] is non-incr easing (we will need both algorithms). The description and proof of correctnes s for T ↑ is postponed to the follo wing subsec tion (Section 3.4). 3.3 Algorithm Learn-km odal and its analysis Algorith m Lear n-kmo dal is giv en below with its analy sis follo wing. 11 Learn -kmod al Inputs: ǫ > 0 ; sample access to k -moda l distrib ution p over [ n ] 1. Fix τ := ǫ/ (100 k ) . Draw r = Θ(1 /τ 2 ) samples from p and let b p denot e the empirica l distrib ution. 2. Greedily parti tion the domain [ n ] into ℓ atomic inter vals I := { I i } ℓ i =1 as follo ws: (a) I 1 := [1 , j 1 ] , where j 1 := min { j ∈ [ n ] | b p ([1 , j ]) ≥ ǫ/ (10 k ) } . (b) For i ≥ 1 , if ∪ i j =1 I j = [1 , j i ] , then I i +1 := [ j i + 1 , j i +1 ] , where j i +1 is deﬁned as follo ws: • If b p ([ j i + 1 , n ]) ≥ ǫ/ (10 k ) , then j i +1 := min { j ∈ [ n ] | b p ([ j i + 1 , j ]) ≥ ǫ/ (10 k ) } . • Otherwise, j i +1 := n . 3. Set τ ′ := ǫ / (200 0 k ) . Draw r ′ = Θ(( k 2 /ǫ 3 ) · log (1 /τ ′ ) log log(1 /τ ′ )) samples s from p to use in Steps 4-5. 4. Run both T ↑ ( ǫ, τ ′ ) and T ↓ ( ǫ, τ ′ ) ov er p ∪ j i =1 I i for j = 1 , 2 , . . . , to ﬁnd the left most atomic interv al I j 1 such that both T ↑ and T ↓ return “no” ov er p ∪ j 1 i =1 I i . Let I j 1 = [ a j 1 , b j 1 ] . W e consid er two case s: Case 1: If b p [ a j 1 , b j 1 ] ≥ 2 ǫ/ (10 k ) , deﬁne I ′ j 1 := [ a j 1 , b j 1 − 1] and b j 1 is a heavy point. Case 2: If b p [ a j 1 , b j 1 ] < 2 ǫ/ (10 k ) then deﬁne I ′ j 1 := I j 1 . Call I ′ j 1 a ne gligi ble interv al. If j 1 > 1 then deﬁne the ﬁrst superinterv al S 1 to be ∪ j 1 − 1 i =1 I i , and set a 1 ∈ {↑ , ↓} to be a 1 = ↑ if T ↑ return ed “yes” on p ∪ j 1 − 1 i =1 I i and to be a 1 = ↓ if T ↓ return ed “yes” on p ∪ j 1 − 1 i =1 I i . 5. Repeat Step 3 starting w ith the next interv al I j 1 +1 , i.e., ﬁ nd the leftmost atomic interv al I j 2 such that both T ↑ and T ↓ return “no” ov er p ∪ j 2 i = j 1 +1 I i . C ontinu e doing this until all interv als throug h I ℓ ha ve been used. Let S 1 , . . . , S t be th e sup erinte rv als obtained thro ugh the abo ve process and ( a 1 , . . . , a t ) ∈ {↑ , ↓} t be the corr espond ing string of bits. 6. Draw m = Θ( k · log( n /k ) /ǫ 3 ) samples s ′ from p . For each superinterv al S i , i ∈ [ t ] , run A a i on the condit ional distrib ution p S i of p using the s amples in s ′ ∩ S i . Let e p S i be the hy pothes is thus obtaine d. 7. Output the hypoth esis h = P t i =1 b p ( S i ) · e p S i + P j b p ( { b j } ) · 1 b j . W e are no w ready to pro ve Theorem 7. Pro of of Th eor em 7: Before entering into the proof w e record two observ ations; we state them explic itly here for the sake of the e xposition. Fac t 6 L et R ⊆ [ n ] . If p R is neither non-incr easing n or no n-decr easing , then R contains at leas t one left extr eme point. Fac t 7 Suppose that R ⊆ [ n ] does not cont ain a left ext r eme point. F or any ǫ, τ , if T ↑ ( ǫ, τ ) and T ↓ ( ǫ, τ ) ar e both run on p R , then the pr obabil ity that both calls r eturn “no” is at most τ . Pro of: By Fact 6 p R is either non-decrea sing or non-inc reasing . If p R is non-dec reasin g then T ↑ will output “no” with prob ability at most τ , a nd similarly , if p R is non-i ncreas ing then T ↓ will outpu t “no” with probabilit y at most τ . 12 Since r = Θ(1 /τ 2 ) samples are dra w n in the ﬁrst step, Fact 2 (appli ed for d = 1 ) implies that with probab ility of failure at most 1 / 100 each interv al I ⊆ [ n ] has | b p ( I ) − p ( I ) | ≤ 2 τ . For the rest of the proof we condit ion on this good e vent. Since e very atomic inte rval I ∈ I has b p ( I ) ≥ ǫ/ (10 k ) (exc ept potent ially the rightmost one), it fo llo ws that the number ℓ of atomic interv als construc ted in S tep 2 satisﬁes ℓ ≤ 10 · ( k /ǫ ) . Moreo ver , by our conditionin g, each atomic inter v al I i has p ( I i ) ≥ 8 ǫ/ (10 0 k ) . Note th at in Cas e ( 1) of Step 4, if b p [ a j 1 , b j 1 ] ≥ 2 ǫ/ (10 k ) then it mus t be the ca se that b p ( b j 1 ) ≥ ǫ/ (10 k ) (and thus p ( b j 1 ) ≥ 8 ǫ/ (10 0 k ) ). In t his case, by deﬁnition of how the interv al I j 1 was formed, we must hav e that I ′ j 1 = [ a j 1 , b j 1 − 1] satisﬁes b p ( I ′ j 1 ) < ǫ/ (10 k ) . So both in Case 1 and Case 2, we now hav e that b p ( I ′ j 1 ) ≤ 2 ǫ/ (10 k ) , and thus p ( I ′ j 1 ) ≤ 22 ǫ/ (100 k ) . Entirely similar reasonin g shows that ev ery negl igible interv al constru cted in Steps 4 and 5 has mass at most 22 ǫ/ (1 00 k ) under p . In S teps 4–5 we in v oke th e testers T ↓ and T ↑ on the condition al distr ib utions of (unions of contiguo us) atomic interv als. Note th at we need e nough sampl es in e very atomic interv al, since otherwise the tes ters pro vide no guarantees . W e claim that w ith probabilit y at least 99 / 100 over the sample s of S tep 3, each atomic interv al gets b = Ω  ( k /ǫ 2 ) · log(1 /τ ′ )  samples. This follo ws by a standard coupo n colle ctor’ s ar gument, which we now pro vide. As argue d abov e, each atomic interv al has probability mass Ω( ǫ/k ) under p . So, w e hav e ℓ = O ( k /ǫ ) bins (atomic interv als), and we want each bin to conta in b balls (samples ). It is well-kno w n [NS 60] that after taking Θ( ℓ · log ℓ + ℓ · b · log log ℓ ) samples from p , with probabil ity 99 / 100 each bin will cont ain the desired number of ball s. The claim no w follo ws by our choic e of parameters. Conditionin g on this e vent, any ex ecutio n of the testers T ↑ ( ǫ, τ ′ ) and T ↓ ( ǫ, τ ′ ) in Steps 4 and 5 will hav e the guarante ed completeness and soundn ess proper ties. In the ex ecution of Steps 4 and 5, there are a total of at most ℓ occasion s when T ↑ ( ǫ, τ ′ ) and T ↓ ( ǫ, τ ′ ) are both run ov er some union of contigu ous atomic interv als. B y Fact 7 and a union bound, the probabili ty that (in any of these insta nces the interv al does not contain a left extreme point and yet both calls return “no”) is at most (10 k /ǫ ) τ ′ ≤ 1 / 200 . So with failure probability at most 1 / 200 for this step, each time Step 4 identiﬁes a group of cons ecuti ve interv als I j , . . . , I j + r such that both T ↑ and T ↓ outpu t “no”, there is a left ext reme point in ∪ j + r i = j I i . S ince p is k -modal, it fol lo w s that with f ailure probability at most 1 / 20 0 there are at most k + 1 total repetit ions of Step 4, and hence the number t of super interv als obtai ned is at most k + 1 . W e moreov er claim that with very high proba bility each of the t superinterv als S i is very close to non- increa sing or non-de creasi ng (with its corr ect orientatio n giv en by a i ): Claim 8 W ith failur e pr obabilit y at most 1 / 100 , each i ∈ [ t ] satisﬁes the following : if a i = ↑ then p S i is ǫ -close to a non-de cr easing distrib ution and if a i = ↓ then p S i is ǫ -close to a non- incr easing distrib ution. Pro of: There are at most 2 ℓ ≤ 20 k /ǫ instances when either T ↓ or T ↑ is run on a union of contiguo us interv als. For any ﬁxed execut ion of T ↓ ov er an interva l I , the probability that T ↓ outpu ts “yes” while p I is ǫ -far from e ver y non-increas ing distrib ution ov er I is at most τ ′ , and similarly for T ↑ . A union bound and the choice of τ ′ conclu de the proof of the claim. Thus we ha ve estab lished that wit h o verall f ailure probabi lity at most 5 / 100 , after Step 5 th e interv al [ n ] has been partit ioned into: 1. A set { S i } t i =1 of t ≤ k + 1 sup erinter v als, with p ( S i ) ≥ 8 ǫ/ (100 k ) and p S i being ǫ -close to either non-in creasi ng or non-d ecreasi ng according to the value of bit a i . 2. A set { I ′ i } t ′ i =1 of t ′ ≤ k + 1 negl igible interv als, such that p ( I ′ i ) ≤ 22 ǫ/ (100 k ) . 3. A set { b i } t ′′ i =1 of t ′′ ≤ k + 1 hea vy points, each with p ( b i ) ≥ 8 ǫ/ (10 0 k ) . W e condit ion on the abo ve good e vents, and bound from above the expected total va riation distance (ov er the sample s ′ ). In partic ular , we ha ve the follo wing lemma: 13 Lemma 9 Conditio ned on the abo ve good events 1–3, we have that E s ′ [ d T V ( h, p )] = O ( ǫ ) . Pro of of Lemma 9: By the discussio n preced ing the lemma statement, the domain [ n ] has been partitio ned into a set of super interv als, a set of neg ligible interv als and a set of hea vy points. As a consequenc e, we can write p = t P j =1 p ( S j ) · p S j + t ′′ P j =1 p ( { b j } ) · 1 b j + t ′ P j =1 p ( I ′ j ) · p I ′ j . Therefore , we can boun d the total v ariatio n distance as follo ws: d T V ( h, p ) ≤ t P j =1 | b p ( S j ) − p ( S j ) | + t ′′ P j =1 | b p ( b j ) − p ( b j ) | + t ′ P j =1 p ( I ′ j ) + t P j =1 p ( S j ) · d T V ( e p S j , p S j ) . Recall that each term in the ﬁrst tw o sums is bou nded from abov e by 2 τ . Hence, the contrib ution of these terms to the RHS is at most 2 τ · (2 k + 2) ≤ ǫ/ 10 . Since each neglig ible interv al I ′ j has p ( I ′ j ) ≤ 22 ǫ/ (100 k ) , the contri b ution of the third sum is at most t ′ · 22 ǫ/ (100 k ) ≤ ǫ/ 4 . It thus remains to bound the contrib ution of the last sum. W e will sho w that E s ′ " t P j =1 p ( S j ) · d T V ( e p S j , p S j ) # = O ( ǫ ) . Denote n i = | S i | . Clearly , P t i =1 n i ≤ n . Since w e are conditio ning on the good ev ents (1)-(3), each superi nterv al is ǫ -close to monotone with a kno wn orien tation (non-incr easing or non- decrea sing) gi ven by a i . Hence we may apply Theorem 4 for each superin terv al. Recall tha t in Step 5 we d raw a total o f m samples. Let m i , i ∈ [ t ] be the nu mber of s amples that la nd in S i ; observ e that m i is a binomiall y distrib uted random v ariable with m i ∼ Bin( m, p ( S i )) . W e app ly Theorem 4 for each ǫ -monoto ne interv al, condit ioning on the v alue of m i , and get d T V ( e p S i , p S i ) ≤ 2 ǫ + O  (log n i / ( m i + 1)) 1 / 3  . Hence, we can boun d from abov e the desired expectat ion as follo ws t P j =1 p ( S j ) · E s ′  d T V ( e p S j , p S j )  ≤ t P j =1 2 ǫ · p ( S j ) ! + O t P j =1 p ( S j ) · (log n j ) 1 / 3 · E s ′ [( m j + 1) − 1 / 3 ] ! . Since P j p ( S j ) ≤ 1 , to prov e the lemma, it sufﬁces to sho w that the second term is bounded, i.e., that t P j =1 p ( S j ) · (log n j ) 1 / 3 · E s ′ [( m j + 1) − 1 / 3 ] = O ( ǫ ) . T o do this, we will ﬁrst need the follo wing claim: Claim 10 F or a binomia l random variabl e X ∼ Bin( m, q ) it holds E [( X + 1) − 1 / 3 ] < ( mq ) − 1 / 3 . 14 Pro of: Jensen’ s inequality implies that E [( X + 1) − 1 / 3 ] ≤ ( E [1 / ( X + 1)]) 1 / 3 . W e claim that E [1 / ( X + 1)] < 1 / E [ X ] . This can be sho wn as follo ws: W e ﬁrst recall that E [ X ] = m · q . For the exp ectatio n of the in verse, we can write: E [1 / ( X + 1)] = = m P j =0 1 j + 1  m j  q j (1 − q ) m − j = 1 m + 1 · m P j =0  m + 1 j + 1  q j (1 − q ) m − j = 1 q · ( m + 1) · m +1 P i =1  m + 1 i  q i (1 − q ) m +1 − i = 1 − (1 − q ) m +1 q · ( m + 1) < 1 m · q . The claim no w follo ws by the monoto nicity of the mappin g x 7→ x 1 / 3 . By Claim 10, applie d to m i ∼ Bin( m, p ( S i )) , we ha ve that E s ′ [( m i + 1) − 1 / 3 ] < m − 1 / 3 · ( p ( S i )) − 1 / 3 . Therefore , our desired quanti ty can be boun ded from abo ve by t P j =1 p ( S j ) · (log n j ) 1 / 3 m 1 / 3 · ( p ( S j )) 1 / 3 = O ( ǫ ) · t P j =1 ( p ( S j )) 2 / 3 ·  log n j k · log( n/k )  1 / 3 . W e no w claim that the secon d term in the RHS abo ve is upper bou nded by 2 . Indeed, this follo ws by an applic ation of H ¨ o lder’ s in equali ty for the v ectors ( p ( S j ) 2 / 3 ) t j =1 and (( log n j k · log( n/k ) ) 1 / 3 ) t j =1 , with H ¨ older conjugate s 3 / 2 and 3 . That is, t P j =1 ( p ( S j )) 2 / 3 ·  log n j k · log( n/k )  1 / 3 ≤ ≤ t P j =1 p ( S j ) ! 2 / 3 · t P j =1 log n j k · log( n/k ) ! 1 / 3 ≤ 2 . The ﬁrst inequality is H ¨ ol der and the second uses the fact that P t j =1 p ( S j ) ≤ 1 an d P t j =1 log( n j ) ≤ t · log( n/t ) ≤ ( k + 1) · log( n/k ) . This last inequality is a consequence of the conca vity of the logarithm and the fact tha t P j n j ≤ n . This completes the proof of the lemma. By applyin g Markov’ s inequality and a union bound, we get that with probab ility 9 / 10 the algorith m Learn -kmod al outputs a hypothesi s h that has d T V ( h, p )= O ( ǫ ) as required. It is clear that the algorithm has the claimed sample complex ity . The running time is also easy to analyze, as it is easy to see that e very step can be performed in polynomia l time in the sample size. T his completes the proof of Theorem 7. 15 3.4 T esting whether a k -modal distrib ution is monotone In this secti on we describe and analyze the testing algor ithm T ↑ . Gi ven sample access to a k -modal distrib ution q over [ n ] and τ > 0 , our tester T ↑ uses O ( k /τ 2 ) many s amples from q and has the follo wing properties: • If q is non-dec reasing , T ↑ outpu ts “yes” with probabi lity at least 2 / 3 . • If q is τ -far from non- decrea sing, T ↑ outpu ts “no” with probability at least 2 / 3 . (The algor ithm T ↑ ( τ , δ ) is obtained by repeatin g T ↑ O (log (1 /δ )) times and taking the majority vote .) Before we describe the algorith m we need some notation. L et q be a distrib ution over [ n ] . For a ≤ b < c ∈ [ n ] deﬁne E ( q , a, b, c ) := q ([ a, b ]) ( b − a + 1) − q ([ b + 1 , c ]) ( c − b ) . W e also denote T ( q , a, b, c ) := E ( q , a, b, c ) 1 ( b − a +1) + 1 ( c − b ) . Intuiti vely , th e quantity E ( q , a, b, c ) captures the dif ference between the ave rage valu e of q ov er [ a, b ] versus ov er [ b + 1 , c ] ; it is negati ve if f the av erage va lue of q is higher ove r [ b + 1 , c ] than it is over [ a, b ] . The quantity T ( q , a, b, c ) is a scaled versi on of E ( q , a, b, c ) . The idea behind tester T ↑ is simple. It is based on the observ ation that if q is a non-decre asing distrib ution, then for any two consecu ti ve interv als [ a, b ] and [ b + 1 , c ] the a verage of q ov er [ b + 1 , c ] must be at least as lar ge as the av erage of q ov er [ a, b ] . Thus any non-d ecreasi ng distrib ution will pass a test that checks “all” pairs of consecuti ve interv als looking for a violation . Our tester T ↑ checks “all” sums of (at most) k co nsecut i ve interv als looki ng for a violatio n. Our analysis shows that in fact such a test is complete as well as sound if the distrib ution q is guaranteed to be k -modal. The key ingredie nt is a structural result (Lemma 11 belo w), which is pro ved using a procedu re reminiscent of “Myerson ironing” [Mye81] to con vert a k -modal distrib ution to a non-d ecreasi ng distr ib ution. Teste r T ↑ ( τ ) Inputs: τ > 0 ; sample access to k -modal distrib ution q ov er [ n ] 1. Draw r = Θ ( k /τ 2 ) samples s from q and let b q be the resulting empirical distrib ution. 2. If there ex ists ℓ ∈ [ k ] and { a i , b i , c i } ℓ i =1 ∈ s ∪ { n } with a i ≤ b i < c i < a i +1 , i ∈ [ ℓ − 1] , such that ℓ P i =1 T ( b q , a i , b i , c i − 1) ≥ τ / 4 (6) then output “no”, otherwise output “yes”. The follo wing theorem establi shes correctn ess of the tester . Theor em 8 The algorithm T ↑ uses O ( k /τ 2 ) samples fr om q , performs p oly( k /τ ) · log n bit oper ations and satisﬁ es the desir ed completenes s and soundness pr operties . Pro of: W e start by sho wing that the algorit hm has the claimed completeness and soundness properti es. Let us say that the sample s is good if for e very collectio n I of (at most) 3 k interv als in [ n ] it holds P I ∈I | q ( I ) − b q ( I ) | ≤ τ / 20 . 16 By Fact 2 with p robabi lity at least 2 / 3 the sample s is good. W e henceforth condition on this ev ent. For a ≤ b < c ∈ [ n ] let us denote γ = | q ([ a, b ]) − b q ([ a, b ]) | and γ ′ = | q ([ b + 1 , c ]) − b q ([ b + 1 , c ]) | . Then we can write | E ( q , a, b, c ) − E ( b q , a, b, c ) | ≤ γ b − a + 1 + γ ′ c − b ≤ ( γ + γ ′ ) ·  1 b − a + 1 + 1 c − b  which implies that | T ( q , a, b, c ) − T ( b q , a, b, c ) | ≤ γ + γ ′ . (7) No w con sider any { a i , b i , c i } ℓ i =1 ∈ [ n ] , for some ℓ ≤ k , with a i ≤ b i < c i < a i +1 , i ∈ [ ℓ − 1] . Similarly denote γ i = | q ([ a i , b i ]) − b q ([ a i , b i ]) | and γ ′ i = | q ([ b i + 1 , c i ]) − b q ([ b i + 1 , c i ]) | . W ith this notation w e ha ve     ℓ P i =1 T ( q , a i , b i , c i ) − ℓ P i =1 T ( b q , a i , b i , c i )     ≤ ℓ P i =1 | T ( q , a i , b i , c i ) − T ( b q , a i , b i , c i ) | ≤ ℓ P i =1 ( γ i + γ ′ i ) where we used the trian gle inequa lity and (7). Note that the rightmost term is the sum of the “additi ve errors” for the collection { [ a i , b i ] , [ b i + 1 , c i ] } ℓ i =1 of 2 ℓ interv als. H ence, it follows from our conditi oning that the last term is bounde d from abo ve by τ / 20 , i.e.,     ℓ P i =1 T ( q , a i , b i , c i ) − ℓ P i =1 T ( b q , a i , b i , c i )     ≤ τ / 20 . (8) W e ﬁrst establis h completenes s. Suppose that q is non-de creasi ng. Then the ave rage probabi lity valu e in any interv al [ a, b ] is a non-d ecreas ing function of a . That is, for all a ≤ b < c ∈ [ n ] it holds E ( q , a, b, c ) ≤ 0 , hence T ( q , a, b, c ) ≤ 0 . This implies that for any choice of { a i , b i , c i } ℓ i =1 ∈ [ n ] with a i ≤ b i < c i < a i +1 , we will ha ve P ℓ i =1 T ( q , a i , b i , c i ) ≤ 0 . By (8) we no w get that ℓ P i =1 T ( b q , a i , b i , c i ) ≤ τ / 2 0 , i.e., the tester says “yes” with proba bility at least 2 / 3 . T o prov e sound ness, w e will crucia lly need the follo wing structural lemma: Lemma 11 L et q be a k -modal distrib ution over [ n ] that is τ -far fr om being non-decr easin g. Then there exist s ℓ ∈ [ k ] and { a i , b i , c i } ℓ i =1 ⊆ [ n ] 3 ℓ with a i ≤ b i < c i < a i +1 , i ∈ [ ℓ − 1] , such that ℓ P i =1 T ( q , a i , b i , c i ) ≥ τ / 2 . (9) W e ﬁrst show how the soundness follo w s from the lemma. Let q be a k -modal distrib ution ov er [ n ] that is τ -far from non-decreas ing. Denote s ′ := s ∪ { n } = { s 1 , s 2 , . . . , s r ′ } with r ′ ≤ r + 1 and s j < s j +1 . W e want to sho w that there exist point s in s ′ that satisfy (6). Namely , that there exis ts ℓ ∈ [ k ] and { s a i , s b i , s c i } ℓ i =1 ∈ s ′ with s a i ≤ s b i < s c i < s a i +1 , i ∈ [ ℓ − 1] , such that ℓ P i =1 T ( b q , s a i , s b i , s c i − 1) ≥ τ / 4 . (10) By L emma 11, there exist s ℓ ∈ [ k ] and { a i , b i , c i } ℓ i =1 ∈ [ n ] with a i ≤ b i < c i < a i +1 , i ∈ [ ℓ − 1] , such that P ℓ i =1 T ( q , a i , b i , c i ) ≥ τ / 2 . Combined with (8) the latter inequality implies that ℓ P i =1 T ( b q , a i , b i , c i ) ≥ τ / 2 − τ / 20 > τ / 4 . (11) 17 First note that it is no loss of generalit y to assume that b q ([ a i , b i ]) > 0 for all i ∈ [ ℓ ] . (If there is some j ∈ [ ℓ ] with b q ([ a j , b j ]) = 0 , then by deﬁnition we ha ve T ( b q , a j , b j , c j ) ≤ 0 ; hence, we can remov e this term from the abo ve sum and the RHS does not decre ase.) Giv en the domain points { a i , b i , c i } ℓ i =1 we deﬁne the sample poin ts s a i , s b i , s c i such that: (i) [ s a i , s b i ] ⊆ [ a i , b i ] , (ii) [ s b i + 1 , s c i − 1] ⊇ [ b i + 1 , c i ] , (iii) b q ([ s a i , s b i ]) = b q ([ a i , b i ]) and (i v) b q ([ s b i + 1 , s c i − 1]) = b q ([ b i + 1 , c i ]) . T o achie ve these propertie s we select: • s a i to be the leftmost point of the sample in [ a i , b i ] ; s b i to be the rightmost point of the sample in [ a i , b i ] . Note that by our assumpt ion that b q ([ a i , b i ]) > 0 at least one sample falls in [ a i , b i ] . • s c i to be the leftmost point of the sample in [ c i + 1 , n ] ; or the poin t n if [ c i + 1 , n ] has no samples or is empty . W e can re write (11) as follo ws: ℓ X i =1 b q ([ a i , b i ]) 1 + b i − a i +1 c i − b i ≥ τ / 4 + ℓ X i =1 b q ([ b i + 1 , c i ]) 1 + c i − b i b i − a i +1 . (12) No w note that by properties (i) and (ii ) a bov e it follo ws th at b i − a i + 1 ≥ s b i − s a i + 1 and c i − b i ≤ s c i − s b i − 1 . Combining with prope rties (iii) and (i v) we get b q ([ a i , b i ]) 1 + b i − a i +1 c i − b i = b q ([ s a i , s b i ]) 1 + b i − a i +1 c i − b i ≤ b q ([ s a i , s b i ]) 1 + s b i − s a i +1 s c i − s b i − 1 (13) and similarly b q ([ b i + 1 , c i ]) 1 + c i − b i b i − a i +1 = b q ([ s b i + 1 , s c i − 1]) 1 + c i − b i b i − a i +1 ≥ b q ([ s b i + 1 , s c i − 1]) 1 + s c i − s b i − 1 s b i − s a i +1 . (14) A combin ation of (12), (13), (14) yields the desired result (10). It thus remains to prov e Lemma 11. Pro of of Lemma 11: W e will prov e the contraposi ti ve. Let q be a k -moda l distrib ution ov er [ n ] such that for any ℓ ≤ k and { a i , b i , c i } ℓ i =1 ⊆ [ n ] 3 ℓ such that a i ≤ b i < c i < a i +1 , i ∈ [ ℓ − 1] , w e ha ve ℓ P i =1 T ( q , a i , b i , c i ) ≤ τ / 2 . (15) W e will constr uct a non-d ecreasi ng distrib ution e q that is τ -close to q . The hi gh lev el idea of the argumen t is as follo ws: the construct ion of e q proceeds in (at most) k stages where in each stage, we reduce the number of modes by at least one and incur small error in the total var iation distan ce. In particu lar , we iterati vely construct a sequen ce of distrib utions { q ( i ) } ℓ i =0 , q (0) = q and q ( ℓ ) = e q , for some ℓ ≤ k , such that for all i ∈ [ ℓ ] we hav e that q ( i ) is ( k − i ) -modal and d T V ( q ( i − 1) , q ( i ) ) ≤ 2 τ i , where the quanti ties τ i will be deﬁned in the cours e of the analysi s below . By appropriat ely using (15), we will show tha t ℓ P i =1 τ i ≤ τ / 2 . (16) 18 Assuming this, it follo ws from the triang le inequality that d T V ( e q , q ) ≤ ℓ P i =1 d T V ( q ( i ) , q ( i − 1) ) ≤ 2 · ℓ P i =1 τ i ≤ τ as desir ed, w here the last inequa lity uses (16). Consider the graph (histogram) of the discret e densi ty q . T he x -axis represen ts the n points of the domain and the y -axi s the correspondi ng pro babili ties. W e ﬁ rst informally describe how to obtain q (1) from q . The constr uction of q ( i ) from q ( i − 1) , i ∈ [ ℓ ] , is essentially identical. Let j 1 be the leftmost (i.e., ha ving minimum x -coord inate) left-ex treme point (mode) of q , and assume that it is a local maximum with height (probabili ty mass) q ( j 1 ) . (A symmetric arg ument works for the case that it is a local minimum.) The idea of the proof is based on the follo w ing simple process (reminiscent of Myerson’ s ironin g process [Mye81 ]): W e start with the horizo ntal line y = q ( j 1 ) and move it d o wnwards unt il we reac h a heigh t h 1 < q ( j 1 ) so th at the tot al mass “cut - of f ” equals the mass “missing” to the right; then we make the distrib ution “ﬂat” in the correspo nding interv al (hence , reducin g the numb er of modes by at least one). W e now proceed with the formal ar gument, assuming as abo ve that the leftmost left-ex treme point j 1 of q is a local maximum. W e say that the line y = h interse cts a point i ∈ [ n ] in the domain of q if q ( i ) ≥ h . The line y = h , h ∈ [0 , q ( j 1 )] , inters ects the graph of q at a unique interv al I ( h ) ⊆ [ n ] that con tains j 1 . Suppose I ( h ) = [ a ( h ) , b ( h )] , w here a ( h ) , b ( h ) ∈ [ n ] depend on h . By deﬁnition this m eans that q ( a ( h )) ≥ h and q ( a ( h ) − 1) < h (since q is supported on [ n ] , we adopt the con venti on that q (0) = 0 ). Recall that the distrib ution q is non-d ecreasi ng in the interv al [1 , j 1 ] and that j 1 ≥ a ( h ) . The term “the mass cut-of f by the line y = h ” means the quantity A ( h ) = q ( I ( h )) − h · ( b ( h ) − a ( h ) + 1) , i.e., the “mass of the interv al I ( h ) abo ve the line. ” The height h of the line y = h deﬁnes the points a ( h ) , b ( h ) ∈ [ n ] as describ ed abov e. W e consid er v alues of h such that q is unimoda l (incre asing then decreasin g) over I ( h ) . In particu lar , let j ′ 1 be the leftmos t mode of q to the right of j 1 , i.e., j ′ 1 > j 1 and j ′ 1 is a local minimum. W e consider v alues of h ∈ ( q ( j ′ 1 ) , q ( j 1 )) . For such v alues, the interv al I ( h ) is in deed unimod al (as b ( h ) < j ′ 1 ). For h ∈ ( q ( j ′ 1 ) , q ( j 1 )) we deﬁne the point c ( h ) ≥ j ′ 1 as follo ws: It is t he right most point of the lar gest interva l containing j ′ 1 whose probab ility mass does not ex ceed h . That is, all points in [ j ′ 1 , c ( h )] hav e proba bility mass at most h and q ( c ( h ) + 1) > h (or c ( h ) = n ). Consider the interv al J ( h ) = [ b ( h ) + 1 , c ( h )] . This interv al is non-empty , since b ( h ) < j ′ 1 ≤ c ( h ) . (N ote that J ( h ) is not necessa rily a unimodal interv al; it contains at least one mode j ′ 1 of q , b ut it may also contain more modes.) The term “the mass missing to the right of the line y = h ” means the quantity B ( h ) = h · ( c ( h ) − b ( h )) − q ( J ( h )) . Consider the function C ( h ) = A ( h ) − B ( h ) ove r [ q ( j ′ 1 ) , q ( j 1 )] . This function is continu ous in its domain; moreo ver , we ha ve that C ( q ( j 1 )) = A ( q ( j 1 )) − B ( q ( j 1 )) < 0 , as A ( q ( j 1 )) = 0 , and C  q ( j ′ 1 )  = A  q ( j ′ 1 )  − B  q ( j ′ 1 )  > 0 , as B ( q ( j ′ 1 )) = 0 . Therefore, by the intermediate val ue theo rem, there exists a va lue h 1 ∈ ( q ( j ′ 1 ) , q ( j 1 )) such that A ( h 1 ) = B ( h 1 ) . The distrib ution q (1) is constru cted as follo ws: W e mov e the mass τ 1 = A ( h 1 ) from I ( h 1 ) to J ( h 1 ) . Note that the d istrib ution q (1) is ident ical to q outsid e the interv al [ a ( h 1 ) , c ( h 1 )] , hence the le ftmost mode o f q (1) is in ( c ( h 1 ) , n ] . It is also clear that d T V ( q (1) , q ) ≤ 2 τ 1 . 19 Let us denote a 1 = a ( h 1 ) , b 1 = b ( h 1 ) and c 1 = c ( h 1 ) . W e claim that q (1) has at least one mode less than q . Indeed , q (1) is non-dec reasing in [1 , a 1 − 1] and constan t in [ a 1 , c 1 ] . (By our “ﬂattening ” process, all the points in the latter interv al ha ve probabili ty mass exact ly h 1 .) Recalling that q (1) ( a 1 ) = h 1 ≥ q (1) ( a 1 − 1) = q ( a 1 − 1) , we dedu ce that q (1) is non-d ecreasing in [1 , c 1 ] . W e will no w ar gue that τ 1 = T ( q , a 1 , b 1 , c 1 ) . (17) Recall that we ha ve A ( h 1 ) = B ( h 1 ) = τ 1 , which can be written as q ([ a 1 , b 1 ]) − h 1 · ( b 1 − a 1 + 1) = h 1 · ( c 1 − b 1 ) − q ([ b 1 + 1 , c 1 ]) = τ 1 . From this , we get q ([ a 1 , b 1 ]) ( b 1 − a 1 + 1) − q ([ b 1 + 1 , c 1 ]) ( c 1 − b 1 ) = τ 1 ( b 1 − a 1 + 1) + τ 1 ( c 1 − b 1 ) or equi va lently E ( q , a 1 , b 1 , c 1 ) = τ 1 ( b 1 − a 1 + 1) + τ 1 ( c 1 − b 1 ) which gi ves (17). W e construct q (2) from q (1) using the same proc edure. Recalling that the leftmost mode of q (1) lies in the interv al ( c 1 , n ] an identical argu ment as above implie s that d T V ( q (2) , q (1) ) ≤ 2 τ 2 where τ 2 = T ( q (1) , a 2 , b 2 , c 2 ) for some a 2 , b 2 , c 2 ∈ [ n ] satisfyin g c 1 < a 2 ≤ b 2 < c 2 . Since q (1) is ident ical to q in ( c 1 , n ] , it follo ws that τ 2 = T ( q , a 2 , b 2 , c 2 ) . W e continue this process itera ti vely for ℓ ≤ k stages until we obtain a non- decreas ing distrib ution q ( ℓ ) . (Note that w e remov e at least one mode in each iteration, henc e it may be the case that ℓ < k .) It follo ws inducti vely that for all i ∈ [ ℓ ] , we ha ve that d T V ( q ( i ) , q ( i − 1) ) ≤ 2 τ i where τ i = T ( q , a i , b i , c i ) , for c i − 1 < a i ≤ b i < c i . W e therefo re conclud e that ℓ P i =1 τ i = ℓ P i =1 T ( q , a i , b i , c i ) which is boun ded from abov e by τ / 2 by (15). This estab lishes (16) completing the proof of Lemma 11. The uppe r bound on the sample comp lexi ty of the algo rithm is straightfo rward , since only Step 1 uses samples. It remains to analyze the running time. The only non-tri vial computation is in Step 2 where w e need to decide whether there ex ist ℓ ≤ k “ordered triples ” { a i , b i , c i } ℓ i =1 ∈ s ′ with a i ≤ b i < c i < a i +1 , i ∈ [ ℓ − 1] , such that P ℓ i =1 T ( b q , a i , b i , c i − 1) ≥ τ / 4 . Even though a nai ve brute-for ce implementa tion would need time Ω( r k ) · log n , there is a simple dynamic programming algori thm that runs in p oly( r, k ) · log n time. W e no w prov ide the details. Consider the objecti ve function T ( ℓ ) = max  ℓ P i =1 T ( b q , a i , b i , c i − 1) | { a i , b i , c i } ℓ i =1 ∈ s ′ with a i ≤ b i < c i < a i +1 , i ∈ [ ℓ − 1]  , 20 for ℓ ∈ [ k ] . W e want to dec ide wheth er max ℓ ≤ k T ( ℓ ) ≥ τ / 4 . For ℓ ∈ [ k ] and j ∈ [ r ′ ] , we use dynamic progra mming to compute the quantit ies T ( ℓ, j ) = max  ℓ P i =1 T ( b q , a i , b i , c i − 1) | { a i , b i , c i } ℓ i =1 ∈ s ′ with a i ≤ b i < c i < a i +1 , i ∈ [ ℓ − 1] and c ℓ = s j  . (This clearly suf ﬁ ces as T ( ℓ ) = max j ∈ [ r ′ ] T ( ℓ, j ) .) The dynamic program is based on the recursi ve identit y T ( ℓ + 1 , j ) = max j ′ ∈ [ r ′ ] ,j ′ h 2 ( w ) } . (21) Let then p 1 = h 1 ( W 1 ) and q 1 = h 2 ( W 1 ) . Clearly , p 1 > q 1 and d T V ( h 1 , h 2 ) = p 1 − q 1 . The competition between h 1 and h 2 is carried out as follo ws: 1. If p 1 − q 1 ≤ 5 ǫ ′ , decla re a draw and return either h i . Otherwise: 2. Draw m = O  log(1 /δ ′ ) ǫ ′ 2  samples s 1 , . . . , s m from p , and let τ = 1 m |{ i | s i ∈ W 1 }| be the fraction of samples that fall insid e W 1 . 3. If τ > p 1 − 3 2 ǫ ′ , declare h 1 as winner and retur n h 1 ; otherwis e, 4. if τ < q 1 + 3 2 ǫ ′ , declare h 2 as winner and return h 2 ; otherwis e, 5. declare a dra w and return either h i . It is not hard to check that the outcome of the competit ion does not depend on the ordering of the pair of distrib utions pro vided in the input; that is, on inputs ( h 1 , h 2 ) and ( h 2 , h 1 ) the competition outputs the same result for a ﬁxed seq uence of samples s 1 , . . . , s m dra wn from p . The correctne ss of Cho ose-H ypoth esis is an immediate cons equence of the followin g lemma. Lemma 14 Suppose that d T V ( p, h 1 ) ≤ ǫ ′ . Then: 23 (i) If d T V ( p, h 2 ) > 6 ǫ ′ , then the pr obability that the competitio n between h 1 and h 2 does not declar e h 1 as the winner is at most e − mǫ ′ 2 / 2 . (Intuitively , if h 2 is very bad then it is very like ly that h 1 will be declar ed winner .) (ii) If d T V ( p, h 2 ) > 4 ǫ ′ , the pr obability that the competiti on between h 1 and h 2 declar es h 2 as the winner is at most e − mǫ ′ 2 / 2 . (Intui tively , if h 2 is only m oder ately bad then a draw is possible b ut it is very unlik ely that h 2 will be declar ed winner .) Pro of: Let r = p ( W 1 ) . The deﬁnition of the tota l va riation distan ce implies that | r − p 1 | ≤ ǫ ′ . Let us deﬁne the 0 / 1 (indic ator) rando m v ariables { Z j } m j =1 as Z j = 1 iff s j ∈ W 1 . Clearly , τ = 1 m P m j =1 Z j and E [ τ ] = E [ Z j ] = r . Since the Z j ’ s are mutually indepe ndent, it follows from the Chernof f bound that Pr[ τ ≤ r − ǫ ′ / 2] ≤ e − mǫ ′ 2 / 2 . Using | r − p 1 | ≤ ǫ ′ we get that Pr[ τ ≤ p 1 − 3 ǫ ′ / 2] ≤ e − mǫ ′ 2 / 2 . • For part (i): If d T V ( p, h 2 ) > 6 ǫ ′ , from the triangle inequality w e get that p 1 − q 1 = d T V ( h 1 , h 2 ) > 5 ǫ ′ . Hence, the al gorithm will g o bey ond Step 1, an d with pr obabil ity at lea st 1 − e − mǫ ′ 2 / 2 , it will stop at Step 3, declari ng h 1 as the winner of the compe tition between h 1 and h 2 . • For part (ii): If p 1 − q 1 ≤ 5 ǫ ′ then the competitio n declares a dra w , hence h 2 is not the winner . Otherwise we ha ve p 1 − q 1 > 5 ǫ ′ and the abo ve a r guments imp ly that the competi tion bet ween h 1 and h 2 will declare h 2 as the winner with probab ility at most e − mǫ ′ 2 / 2 . This concl udes the proo f of Lemma 14. The proof of the theore m is no w complete . C Using the Hypothesis T ester In this section, w e ex plain in detail how we use the hypoth esis testing algorith m Choo se-Hy pothesis throug hout this paper . In particu lar , the algorit hm Choos e-Hyp othes is is used in the follo w ing places: • In S tep 4 of algorithm Lear n-km odal- simple we need an algorith m L ↓ δ ′ (resp. L ↑ δ ′ ) that learns a non-in creasi ng (resp. non-incre asing) distrib ution withi n total v ariation dista nce ǫ and conﬁden ce δ ′ . Note that the corres pondi ng algorith m s L ↓ and L ↑ pro vided by Theorem 4 ha ve conﬁden ce 9 / 10 . T o boost the conﬁden ce of L ↓ (resp. L ↑ ) we run the algori thm O (log(1 /δ ′ )) times an d use Choose -Hypo thesis in an appropr iate tournamen t proce dure to select among the candida te hypoth esis distrib utions. • In Step 5 of algorit hm Learn -kmo dal-s imple we need to select among two candida te hypothesis distrib utions (with the promise that at least one of them is close to the true cond itional distrib ution). In this case, we run Choo se-Hy pothe sis once to select between the two candida tes. • Also note that b oth algori thms Le arn-k modal -simple and Learn-k modal gene rate an ǫ -accurate hypot hesis with probability 9 / 10 . W e would lik e to boost the probabilit y of success to 1 − δ . T o achiev e this we a gain run the c orresp onding algorithm O (log (1 /δ )) times and us e Choos e-Hyp othes is in an approp riate tourn ament to select among the candidate hypothesis distrib utions. W e no w formally describ e the “tourn ament” algor ithm to boost the conﬁde nce to 1 − δ . Lemma 15 L et p be any distrib ution over a ﬁ nite set W . Suppos e that D ǫ is a coll ection of N distrib utions over W such that th er e e xists q ∈ D ǫ with d T V ( p, q ) ≤ ǫ . T hen ther e is an algorithm tha t use s O ( ǫ − 2 log N log(1 /δ )) samples fr om p and with pr obabil ity 1 − δ outpu ts a distrib ution p ′ ∈ D ǫ that satisﬁes d T V ( p, p ′ ) ≤ 6 ǫ. 24 Dev roye and Lugosi (Chapter 7 of [DL 01]) prove a similar result by hav ing all pairs of distrib utions in the cov er comp ete agai nst ea ch o ther u sing their notion of a competitio n, but again the re are some small differe nces: their approac h choos es a distrib ution in the cover w hich wins the maximum number of competitions , whereas our algori thm cho oses a distrib ution that is nev er defe ated (i.e., won or achie ved a draw agai nst all other d istrib utions in the co ver) . Instead we follo w the approach from [DDS12]. Pro of: The algorithm performs a tourna ment by runnin g the competition Ch oose- Hypot hesis p ( h i , h j , ǫ, δ / (2 N )) for e ver y pair of distinct distrib utions h i , h j in the collectio n D ǫ . It output s a distrib ution q ⋆ ∈ D ǫ that was nev er a loser (i.e., won or achie ved a draw in all its competitions) . If no such distrib ution ex ists in D ǫ then the algorit hm outpu ts “fail ure. ” By deﬁnition, there exists some q ∈ D ǫ such that d T V ( p, q ) ≤ ǫ. W e ﬁrst ar gue that with high probabi lity t his distrib ution q nev er loses a competitio n against any other q ′ ∈ D ǫ (so the algorith m does not output “f ailure” ). Consider an y q ′ ∈ D ǫ . If d T V ( p, q ′ ) > 4 ǫ , by Lemm a 14(ii ) the prob ability that q loses to q ′ is at mos t 2 e − mǫ 2 / 2 = O (1 / N ) . On the other hand, if d T V ( p, q ′ ) ≤ 4 δ , the triangle inequality giv es that d T V ( q , q ′ ) ≤ 5 ǫ and thus q draws against q ′ . A union bound over all N distrib utions in D ǫ sho ws that with probabilit y 1 − δ / 2 , the distrib ution q nev er loses a competi tion. W e next ar gue that with proba bility at least 1 − δ / 2 , eve ry distr ib ution q ′ ∈ D ǫ that nev er loses has small v ariation distance from p . F ix a distrib ution q ′ such that d T V ( q ′ , p ) > 6 ǫ ; Lemma 14(i) implies that q ′ loses to q with probab ility 1 − 2 e − mǫ 2 / 2 ≥ 1 − δ / (2 N ) . A union boun d giv es that with prob ability 1 − δ / 2 , e very distrib ution q ′ that has d T V ( q ′ , p ) > 6 ǫ loses some competition . Thus, with ov erall probabi lity at least 1 − δ , the tournament does not output “failure ” and outputs some distrib ution q ⋆ such that d T V ( p, q ⋆ ) is at most 6 ǫ. This prov es the lemma. W e no w explain how the abov e lemma is used in our conte xt: Suppose we perform O (log (1 /δ )) runs of a learning algorithm that construct s an ǫ -accurate hypo thesis with prob ability at least 9 / 10 . Then, with failure probab ility at most δ / 2 , at least o ne of the hyp othese s genera ted i s ǫ -close to the true dis trib ution in variat ion d is- tance. C onditi oning on this good ev ent, we hav e a collection of distrib utions with cardina lity O (log(1 /δ )) that satisﬁes the assumption of the lemma. Hence , using O  (1 /ǫ 2 ) · log log (1 /δ ) · log(1 /δ )  samples we can learn to accurac y 6 ǫ and conﬁdence 1 − δ / 2 . The ov erall sample complexity is O (log(1 /δ )) times the sample complex - ity of the lear ning algorithm run with conﬁdence 9 / 10 , plus this additi onal O  (1 /ǫ 2 ) · log log(1 /δ ) · log (1 /δ )  term. In terms of running time,we make the follo wing easily veriﬁable remarks: When the hy pothe sis testing algori thm Choose -Hypo thesis is run on a pa ir of distrib utions that are pro duced by Birg ´ e’ s algorithm, its runnin g time is polyn omial in the succinct descriptio n of these distrib utions, i.e., in log 2 ( n ) /ǫ . Similarly , when Choose -Hypo thesis is run on a pai r of outpu ts of Learn -kmod al-simple or Learn- kmoda l , its running time is polynomial in the succinct descrip tion of these distrib utions. More speciﬁcally , in the for - mer case, the succinct description has bit comple xity O  k · log 2 ( n ) /ǫ 2  (since the output consists of O ( k /ǫ ) monoton e interv als, and the condi tional distrib ution on each interv al is the output of Birg ´ e’ s algorithm for that interv al). In th e latter case, the succinct des criptio n has bit complex ity O  k · log 2 ( n ) /ǫ  , sinc e th e al- gorith m Learn -kmod al constructs on ly k monoton e interv als. Hence, in both cases, each e xec utatio n of the testing algorithm perfor ms p oly( k , log n, 1 /ǫ ) bit operations . Since the tourn ament in vo kes the algorithm Choos e-Hyp othesis O (log 2 (1 /δ )) times (for ev ery pair of distrib utions in our poo l of O (log (1 /δ )) cand i- dates) the upper boun d on the runnin g time follo ws. Refer ences [Bir87a] L. B ir g ´ e. Estimating a density under order restriction s: Nonasymptot ic minimax risk. Annals of Statis tics , 15(3):995 –1012 , 1987. 25 [Bir87b] L. Birg ´ e. On the risk of histograms for estimating decreasing densities. Anna ls of Statist ics , 15(3): 1013– 1022, 1987. [Bir97] L. Bir g ´ e. Estimation of unimoda l densitie s without smoothne ss assumpti ons. Annals of Statist ics , 25(3): 970–9 81, 1997. [BKR04] T . Batu, R. Kumar , and R. Rubinf eld. Sublin ear algori thms for testing monoto ne and unimo dal distrib utions . In Pr oceeding s of the 36th Sympo sium on Theor y of Computing , page s 381–3 90, 2004. [CKC83] L . Cobb, P . Kop pstein, and N. H. Chen. Estimation and moment recursion relations for multimodal distrib utions of the ex ponen tial family . J. America n Statistical Association , 78(381):1 24–13 0, 1983. [CT04] K.S. Chan and H. T ong. T esting for multimodal ity with depen dent data. Biometrika , 91(1): 113–1 23, 2004. [DDS12] C. Daskalakis, I. Diakon ikolas, and R.A. Servedio. Learning Poisson Binomial Distrib utions. In Pr oceedin gs of the 44th Sympos ium on Theory of Computin g , pages 709–7 28, 2012. [DL96a] L. Devro ye and G. L ugosi. Nonasymptotic uni versal smoothing fa ctors, ker nel complexity and Y a- tracos clas ses. A nnals of Statis tics , 25:26 26–26 37, 1996 . [DL96b] L. De vroye and G. Lugosi. A univ ersally accept able smoothing factor for kern el densi ty estimatio n. Annals of Stat istics , 24:2499–2 512, 1996. [DL01] L. D e vro ye and G. L ugosi. Combinator ial methods in density estimation . Springer Series in Statistics, Springer , 2001. [dTF90] G.A. de T oledo and J.M. Fernandez. Patch -clamp measurements rev eal multimodal distrib ution of granul e sizes in rat mast cells. Journa l of Cell Biolo gy , 110(4):1 033–1039, 1990. [FPP + 98] F .R. Ferraro, B. Paltrini eri, F .F . Pecci, R .T . Rood, and B. Dorman. Multimoda l distrib utions along the hori zontal branch. The Astr ophysical Jour nal , 500:311– 319, 1998. [GGR98] O. Goldreic h, S. Goldwas ser , and D. Ron. Property testing and its connectio n to learning and ap- proximat ion. Journ al of the A CM , 45:653–7 50, 1998. [Gol10] O. Goldreic h, editor . Pr operty T esting: Curr ent Resear ch and Sur ve ys . S pringe r , 2010. LNCS 6390. [Gol11] O. Goldreich. Highligh ts of the Bertinoro worksh op on Sublinear Algorithms (unpublishe d com- ments). Posted at http:/ /www . wisdom.weizmann .ac.il/ ˜ oded/MC/072.html, acces sed June 17, 2011, 2011. [Gro85] P . Groeneb oom. E stimating a monotone density . In Pr oc. of the Berkele y Confer ence in Honor of J erzy Ne yman and Ja c k Kiefer , pages 539–555, 1985. [Ke m91] J.H.B. Kemperman. Mixtures with a limited number of modal interv als. Anna ls of Statist ics , 19(4): 2120– 2144, 1991. [KR00] M. Kearns and D. Ron. T estin g problems w ith sub-le arning sample complexity . J . Comp. Sys. Sci. , 61:42 8–456 , 2000. [Mur64] E.A. Murphy . One cause? many cause s?: The ar gument from the bimoda l distrib ution. J . Chr onic Diseases , 17(4):301– 324, 1964. [Mye81] R.B. Mye rson. Optimal auction design. Mathematics of Operatio ns Resear ch , 6:58– 73, 1981. 26 [NS60] D. J. N e wman and L. S hepp. The double dixie cup problem. The American Mathematical Monthly , 67(1): pp. 58–61, 1960. [Rao69] B.L.S. Prakasa Rao. Estimation of a unimoda l density . Sankh ya Ser . A , 31 :23–3 6, 1969. [Ron08] D. Ron. Property T esting: A Learning T heory P erspec ti ve. F ounda tions and T r ends in Machin e Learning , 1(3):307– 402, 2008. [Ron10] D. Ron. Algorithmic and analys is techn iques in prope rty testing. F oundati ons and T re nds in Theo- r etical Computer Science , 5:73–205 , 2010. [W eg7 0] E.J. W egman. Maximum like lihood estimation of a unimodal densi ty . I. and II. A nn. M ath. Statist. , 41:45 7–471 , 2169–217 4, 1970. [Y at85] Y . G. Y atrac os. Rates of con ve r gence of minimum distance estimators and Kolmo goro v’ s entrop y. Annals of Stat istics , 13:768–77 4, 1985. 27

Learning $k$-Modal Distributions via Testing

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment