Testing $k$-Modal Distributions: Optimal Algorithms via Reductions

T esting k -Modal Distributions: Optimal Algorithms via Reductio ns Constantinos Daskalakis ∗ MIT Ilias Diakonikolas † UC Berkele y Rocco A. Servedio ‡ Columbia Uni versity Gregory V aliant § UC Berke ley Paul V aliant ¶ UC Berke ley May 1, 2018 Abstract W e giv e highly ef ﬁcient algorithms, an d almost matching lo wer bounds, for a rang e of basic s tatistical problem s that inv olve testing an d estimatin g th e L 1 (total v ariation) distance b etween tw o k -m odal distributions p and q over the discrete domain { 1 , . . . , n } . More precisely , we co nsider the fo llowing fou r p roblems: given sample access to an unknown k -mo dal distrib ution p , T E S T I N G I D E N T I T Y T O A K N O W N O R U N K N O W N D I S T R I B U T I O N : 1. Determine whether p = q ( for an explicitly gi ven k -m odal distribution q ) versus p i s ǫ -far from q ; 2. Determine whether p = q ( where q is av ailable via sample access) versus p is ǫ -far from q ; E S T I M AT I N G L 1 D I S TA N C E ( “ T O L E R A N T T E S T I N G ” ) AG A I N S T A K N O W N O R U N K N O W N D I S T R I B U T I O N : 3. Approx imate d T V ( p, q ) to within additive ǫ where q is an exp licitly gi ven k - modal distrib ution q ; 4. Approx imate d T V ( p, q ) to within additive ǫ where q is av ailable via sample access. For each of these four pro blems we g i ve sub-log arithmic sample algorithm s, th at w e show are tigh t up to additive p oly ( k ) a nd m ultiplicative p olylog lo g n + p olylo g k factor s. Thus our bou nds sign iﬁcantly improve the previous results of [BKR04], which wer e for testing identity of distributions (items (1) and (2) above) in the spe cial cases k = 0 (mon otone distributions) and k = 1 (unim odal distributions) and requ ired O ((log n ) 3 ) samples. As our ma in concep tual contribution, we introdu ce a new reduction -based approach for distribution-testing problem s that lets us obtain all the above results in a un iﬁed way . Roughly speaking, this ap proach en ables us to transform various distribution testing p roblems f or k -m odal distributions over { 1 , . . . , n } to the c orrespon ding distribution testing prob lems for un restricted distributions over a muc h smaller do main { 1 , . . . , ℓ } where ℓ = O ( k log n ) . ∗ costis@csail. mit.edu. Research supported by NSF CAREER award CCF-0953960 and by a Sloan Foun dation Fellowship. † ilias@cs.berk eley.edu . Research supported by a Simons Found at ion Postdoctoral Fellowship. S ome of this work was done while at Columbia Univ ersity , supported by NS F grant CCF-07287 36, and by an Alexander S. Onassis Foun dation Fellowship. ‡ rocco@cs.colu mbia.edu . Supported by NSF grants CCF-0347282 and CCF-0523664 . § gregory.valia nt@gmail.com . S upported by an NSF graduate research fellowsh ip. ¶ pvaliant@gmai l.com . Supported by an NSF postdoctoral research fellowship. 1 Introd uction Giv en samples from a pair of unkno wn distrib ution s, th e problem of “identity testing”—that is, distinguish ing whether the two distrib utions are the same versus signiﬁcantl y differe nt—and, more general ly , t he problem of estimatin g the L 1 distan ce between the distri b utions, is perhaps the most fundamenta l statistic al task. D espite a long histo ry of study , by both the statistics and computer science communities, the sample comple xities of these basic tasks were only recently established . Iden tity testing , g i ven samples from a pair of distrib ution s of support [ n ] , can b e do ne u sing ˜ O ( n 2 / 3 ) samples [BFR + 00], an d this upper b ound i s o ptimal up to p olylog( n ) factors [V al08a]. Estimating the L 1 distan ce (“tolerant testing”) between distrib utions of support [ n ] requires Θ( n/ log n ) samples, and this is tight up to constan t factor s [VV11a, VV11b]. T he v ariant s of these problems when one of the two distrib utions is e xplicitly gi ven req uire e Θ( √ n ) samples for id entity testing [BFF + 01] a nd Θ( n/ log n ) samples for L 1 distan ce estimation [VV11a, VV11b] respecti vely . While it is surpris ing that these tasks can be performed using a sublinea r number of samples, for m any real- world applicat ions using √ n , n 2 / 3 , or n log n samples is still impractical. As these bounds charac terize worst- case insta nces, one might hope that drastica lly better performance m ay be poss ible for many setting s typic ally encou ntered in practice. Thus, a natural research direct ion, w hich we pursue in this paper , is to understan d ho w structu ral propert ies of the distrib utions in question may be le ver aged to yield improve d sample complexi ties. In this work we consid er m onoton e, unimod al, and more generally k -modal distrib utions. Monotone, uni- modal, and bimodal distrib utions abound in the natural world. The distrib ution of many measurement s—heigh ts or weights of members of a popul ation, concen tration s of va rious chemical s in cells, para meters of many atmo- spheri c phenomena–o ften belong to this class of dis trib utions. Because of the ir ubiquit y , m uch wor k in the natu ral scienc es rests on the analysis of such di strib utions (for e xample, on No vember 1, 20 11 a Goo gle Sch olar search for the exa ct phrase “bimodal distrib ution” in the bodies of papers return ed more than 90,000 hits). Though perhaps not as perv asiv e, k -modal dis trib utions for la r ger v alues of k commonly arise as m ixtures o f unimoda l dis trib utions and a re natu ral obje cts of stu dy . On the theoretic al side, motiv ated by the man y applica tions, monotone, unimodal, and k -modal distrib utions ha ve been in tensiv ely studied in the prob ability and statis tics literature s for decades, see e.g. [Gre56, Rao69, BBB B72, CKC83, Gro85, Bir87a, Bir87b, K em91, Fou97, CT04, JW09]. 1.1 Our re sults. Our main results are algor ithms, and nearly matchin g lo wer bounds, that gi ve a complete picture of the sample comple xities of identity testing and estimating L 1 distan ce for monotone and k -modal distrib utions. W e obtain such results in both the setting where the two distrib utions are giv en via samples, and the setting where one of the distri b utions is gi ven via sample s and the other is desc ribed expl icitly . All our results ha ve the nature of a reduction: performing these tasks on k -modal distri b utions over [ n ] turns out to hav e almost exact ly the same sample complex ities as performing the correspond ing tasks on arbitra ry distrib utions over [ k log n ] . For any small consta nt k (or ev en k = O ((log n ) 1 / 3 ) ) and arbitrari ly small constant ǫ , all our results are tight to within either p olylog log n or p olylog log log n factors. See T able 1 for the new sample comple xity upper and lower bounds for the monotone and k -modal tasks; see S ection 2 for the (expo nentia lly higher ) sample complexitie s of the general- distrib ution tasks on which our results rely . While our main focus is on sample complexity rather than running time, w e note that all of our algorith ms run in p oly(log n, k , 1 /ǫ ) bit operat ions (note that ev en reading a single sample from a distrib ution ov er [ n ] takes log n bit operatio ns). W e vie w the equi vale nce bet ween the sample complex ity of each of the above tasks on a m onoton e or unimod al distrib ution of domain [ n ] and the sample complexi ty of the same task on an unrestri cted distrib ution of domain [log n ] as a surprising result, becau se such an equi valen ce fails to hold for related estimation tasks. For example , consid er the task of distinguish ing whether a distrib ution on [ n ] is unifor m versus far from uniform. For general distrib utions this takes Θ( √ n ) samples , so one might expect the corre spond ing proble m for monoton e distrib utions to need √ log n samples; in fact, howe ver , one can test this with a const ant n umber of samples, by simply comparing the empirically observ ed probability masses of the left and right halves of the domain. An exa mple in the other directi on is the probl em of ﬁnding a constant additi ve estimate for the entrop y of a distrib ution. On domains of size [ n ] this can be done in n log n samples, and thus one might ex pect to be able to estimate entrop y for monoton e distrib utions on [ n ] using log n log log n samples. Neve rthele ss, it is not hard to see that Ω(log 2 n ) samples are required . 1 T esting problem Our upper bound Our lo wer bound p, q are both monoto ne: T esting identity , q is kno wn: O  (log n ) 1 / 2 (log log n ) · ǫ − 5 / 2  Ω  (log n ) 1 / 2  T esting identity , q is unkn o w n: O  (log n ) 2 / 3 · (log log n ) · ǫ − 10 / 3  Ω   log n log log n  2 / 3  Estimating L 1 distan ce, q is kno wn: O  log n log log n · ǫ − 3  Ω  log n log log n · log log log n  Estimating L 1 distan ce, q is unkno wn: O  log n log log n · ǫ − 3  Ω  log n log log n · log log log n  p, q ar e both k -modal: T esting identity , q is kno wn: O  k 2 ǫ 4 + ( k log n ) 1 / 2 ǫ 3 · log  k log n ǫ  Ω  ( k log n ) 1 / 2  T esting identity , q is unkn o w n: O  k 2 ǫ 4 + ( k log n ) 2 / 3 ǫ 10 / 3 · log  k log n ǫ  Ω   k l og n log( k log n )  2 / 3  Estimating L 1 distan ce, q is kno wn: O  k 2 ǫ 4 + k l og n log( k log n ) · ǫ − 4  Ω  k l og n log( k log n ) · log log( k log n )  Estimating L 1 distan ce, q is unkno wn: O  k 2 ǫ 4 + k l og n log( k log n ) · ǫ − 4  Ω  k l og n log( k log n ) · log log( k log n )  T able 1: Our u pper and lower bounds for identit y testin g an d L 1 estimatio n. In the ta ble we omit a “ log (1 /δ ) ” term which is present in all our upper bounds for algorithms w hich giv e the correct answer with probab ility 1 − δ. For the “testing identity” proble ms, our lower bound s are for distinguish ing whether p = q vers us d T V ( p, q ) > 1 / 2 with success probability 2 / 3 . For estimating L 1 distan ce, our bound s are for estimating d T V ( p, q ) to w ithin ± ǫ , for any k = O ( n 1 / 2 ) , with the lo w er boun ds correspond ing to success probabil ity 2 / 3 . The reduction- like techniq ues which we use to establi sh both our algorithmic results and our lo w er bounds (discu ssed in more detail in Section 1.2 belo w) re veal an unexp ected relationship between the class of k -modal distrib utions of support [ n ] and the class of general distrib utions of support [ k log n ] . W e hope that this reduction - based a pproac h may p rov ide a f rame work for the dis cov ery of other relation ships that will be us eful in future wo rk in the ex treme subline ar regime of statisti cal property estimation and propert y testing. Comparison with prior w ork. Our res ults sig niﬁcantly ext end and impro ve upon the p re vious algorit hmic results of Batu et al [BKR04] for identi ty testing of monoton e or unimod al ( k = 1 ) distrib utions, which required O (log 3 n ) samples. More recent ly , [DDS11] establ ished the sample complexit y of learning k -modal distrib utions to be essent ially Θ( k log( n ) ǫ − 3 ) . Such a learning algorit hm easily yields a testing algorith m w ith the same sample comple xity for all four var iants of the testing problem (one can simply run the learn er tw ice to obtain hypothese s b p and b q that are suf ﬁciently close to p and q respec ti vely , and output according ly). While the [DDS11] re sult can be appl ied to our testin g problems (though gi ving suboptimal results) , we stress that the ideas underlying [DDS11] and this paper are quite dif ferent. T he [DD S11] paper learns a k -modal distri- b ution by using a kno wn algorit hm for learni ng m onoton e distrib utions [Bir87b] k times in a black-bo x manner; the n otion o f r educin g the do main si ze —which we vie w as central t o the results a nd c ontrib utions of this paper —is no where present in [DDS11]. By contra st, the focus in this paper is on introdu cing the use of reductio ns as a po werful (b ut surpri singly , seeming ly pre viously unus ed) too l in the de velopment of algori thms for basic s tatisti cal tasks on distrib utions, which, at least in this case, is capable of giv ing essentia lly optimal upper and lower bound s for natural restric ted classes of distrib utions. 1.2 T echn iques. Our main concep tual contrib ution is a new reductio n-based approach that lets us obtain all our upper and lower bounds in a clean and uniﬁed way . The approach works by reducing the m onoton e and k -modal distrib ution testing problems to general distrib ution testing and estimation problems over a much smaller domain , and v ice v ersa. For the mon otone c ase t his smaller d omain is essential ly of size log ( n ) /ǫ , and for t he k -modal case the smaller d omain is ess ential ly of s ize k log ( n ) /ǫ 2 . By solv ing the gene ral distr ib ution problems ov er the smaller 2 domain using kno wn results w e get a valid answer for the origin al (monotone or k -modal) problems ov er domain [ n ] . More details on our algorithmic reduction are giv en in Section A. Con ver sely , ou r lower bound reduction lets us reexpres s arbitr ary distrib utions ov er a small domain [ ℓ ] by monoton e (or unimodal , or k -modal, as require d) distrib utions ov er an expon entiall y larg er domain, while pre- servin g many of their features with respect to the L 1 distan ce. Crucially , this reduction allo ws one to simulate dra wing samples from the lar ger monotone distrib ution gi ven access to samples from the smaller distrib ution , so that a kno wn impossib ility result for unre stricted distrib utions on [ ℓ ] may be lev eraged to yiel d a correspondi ng impossib ility result for monotone (or unimodal, or k -modal) distrib utions on the ex ponen tially larger domain. The inspira tion for our resul ts is an observ ation of B ir g ´ e [Bir87b] that giv en a monotone- decrea sing probabil ity distrib ution over [ n ] , if one subdi vides [ n ] into an expone ntially increasing series of consecut i ve sub-inte rv als, the i th havin g size (1 + ǫ ) i , then if one replaces the probabil ity m ass on each interv al with a unifo rm distrib ution on that interv al, the distrib ution changes by only O ( ǫ ) in total va riation distance . Furthe r , giv en such a subdi vision of the suppo rt into log 1+ ǫ ( n ) interv als, one may essen tially treat the original monotone distrib ution as essentially a distrib ution ov er these interv als, namely a distrib ution of support log 1+ ǫ ( n ) . In this way , one m ay hope to reduce monoton e distrib ution testin g or estimation on [ n ] to general distrib ution testing or estimation on a domain of size log 1+ ǫ ( n ) , and vice vers a. See Section B for details . For the monotone testing prob lems the partiti on into subinterv als is constructed obli viously (without drawing any samples or m aking any referenc e to p or q of any sort) – for a giv en val ue of ǫ the partition is the same for all non-in creasi ng distrib utions. For the k -modal testing problems, constru cting the desired partit ion is signiﬁcantly more in v olved. This is do ne via a caref ul procedu re which uses k 2 · p oly(1 /ǫ ) samples 1 from p and q and uses the obli vious decomposi tion for monotone distrib utions in a delica te way . This construc tion is giv en in S ection C. 2 Notation and Preli minaries 2.1 Notation. W e write [ n ] to denote the set { 1 , . . . , n } , and for integer s i ≤ j w e write [ i, j ] to denote the set { i, i + 1 , . . . , j } . W e cons ider discrete probabilit y distrib utions over [ n ] , which are functions p : [ n ] → [0 , 1] such that P n i =1 p ( i ) = 1 . For S ⊆ [ n ] we write p ( S ) to denote P i ∈ S p ( i ) . W e use the notation P for the cumulati ve distrib ution function (cdf) correspo nding to p , i.e. P : [ n ] → [0 , 1] is deﬁned by P ( j ) = P j i =1 p ( i ) . A distrib ution p over [ n ] is n on-in creasin g (resp. non-decrea sing) if p ( i + 1) ≤ p ( i ) (resp. p ( i + 1) ≥ p ( i ) ), for all i ∈ [ n − 1] ; p is monot one if it is ei ther no n-incr easing or no n-dec reasin g. Thus t he “orienta tion” of a monotone distrib ution is either non-decr easing (denoted ↑ ) or non-in creasin g (denote d ↓ ). W e call a nonempty interv al I = [ a, b ] ⊆ [2 , n − 1] a max-inter val of p if p ( i ) = c for all i ∈ I and max { p ( a − 1) , p ( b + 1) } < c . Analogousl y , a m in-inte rval of p is an interv al I = [ a, b ] ⊆ [2 , n − 1] with p ( i ) = c for all i ∈ I and min { p ( a − 1) , p ( b + 1) } > c . W e say that p is k -modal if it has at most k max-interv als and min-inte rv als. W e note that according to our deﬁnition, what is usually referre d to as a bimodal distrib ution is a 3 -modal distrib ution. Let p, q be distri b utions ov er [ n ] with correspondi ng cdfs P , Q . T he total variation distan ce between p and q is d T V ( p, q ) := max S ⊆ [ n ] | p ( S ) − q ( S ) | = (1 / 2) P i ∈ [ n ] | p ( i ) − q ( i ) | . The K olmogor ov distan ce between p and q is deﬁned as d K ( p, q ) := max j ∈ [ n ] | P ( j ) − Q ( j ) | . Note that d K ( p, q ) ≤ d T V ( p, q ) . Finally , a sub-d istribu tion is a function q : [ n ] → [0 , 1] which satisﬁes P n i =1 q ( i ) ≤ 1 . For p a d istrib ution ov er [ n ] and I ⊆ [ n ] , the res trictio n of p to I is the sub-distrib ution p I deﬁned by p I ( i ) = p ( i ) if i ∈ I and p I ( i ) = 0 otherwis e. Like wise, we denote by p I the conditio nal distrib ution of p on I , i.e. p I ( i ) = p ( i ) /p ( I ) if i ∈ I and p I ( i ) = 0 otherwise. 2.2 Basic tools fr om probability . W e will require the Dvor etzky-Kiefe r -W olfowitz (DKW) inequa lity ([DKW56]) from probability theory . This basic fact says that O (1 /ǫ 2 ) samples suf ﬁce to learn any distrib ution within error ǫ with respect to the K olmogor ov dista nce . More precisely , let p be any distrib ution ov er [ n ] . G i ven m independ ent 1 Intuitiv ely , the pa rtition must be ﬁner in re gions o f hig her probability d ensity; for non-increasing distrib utions ( for e xample) this region is at the left side of the domain, but for g eneral k -modal distributions, one must draw samp les to discov er t he high-probab ility regions. 3 samples s 1 , . . . , s m dra wn from p : [ n ] → [0 , 1] , the empirical distrib ution b p m : [ n ] → [0 , 1] is deﬁned as follo ws: for all i ∈ [ n ] , b p m ( i ) = |{ j ∈ [ m ] | s j = i }| / m . The DKW inequality states that for m = Ω((1 /ǫ 2 ) · ln(1 /δ )) , with probabilit y 1 − δ the empirical distrib ution b p m will be ǫ -close to p in Ko lmogoro v distance. This sample bound is asympto tically optimal and indepen dent of the support size. Theor em 1 ([DKW56, Mas90]) . F or all ǫ > 0 , it holds: Pr[ d K ( p, b p m ) > ǫ ] ≤ 2 e − 2 mǫ 2 . Another simple result that we will need is the follo wing, which is easily veriﬁed from ﬁrst princ iples: Observ ation 1. L et I = [ a, b ] be an interval and let u I denote the uniform distrib ution ove r I . L et p I denote a non-in cr easing distrib ution over I . Then for every initia l interval I ′ = [ a, b ′ ] of I , we have u I ( I ′ ) ≤ p I ( I ′ ) . 2.3 T esting a n d estimatio n f or arbitra ry dist rib ution Our t esting algorit hms work by redu cing to kno wn a lgo- rithms for test ing arbitra ry distrib utions over an ℓ -element domain. W e will use the follo wing well kno w n result s: Theor em 2 (testing identity , kno wn distrib ution [BFF + 01]) . Let q be an expl icitly given distrib ution over [ ℓ ] . L et p be an unknown dis trib ution over [ ℓ ] that i s acc essibl e via sampl es. Ther e is a te sting al gorith m T E S T - I D E N T I T Y - K N O W N ( p, q , ǫ, δ ) that uses s I K ( ℓ, ǫ, δ ) := O ( ℓ 1 / 2 log( ℓ ) ǫ − 2 log(1 /δ )) samples fr om p and has the following pr opert ies: • If p ≡ q then with pr obability at least 1 − δ the algorith m outputs “accept;” and • If d T V ( p, q ) ≥ ǫ then with pr obabili ty at least 1 − δ the algorithm outputs “r eject. ” Theor em 3 (testin g identity , unkno wn distrib ution [BFR + 10]) . Let p and q both be unknown distrib utions over [ ℓ ] that ar e accessible via samples. Ther e is a testin g algorithm T E S T - I D E N T I T Y - U N K N O W N ( p, q , ǫ, δ ) that uses s I U ( ℓ, ǫ, δ ) := O ( ℓ 2 / 3 log( ℓ/δ ) ǫ − 8 / 3 ) samples fr om p and q and has the fol lowing pr operties: • If p ≡ q then with pr obability at least 1 − δ the algorith m outputs “accept;” and • If d T V ( p, q ) ≥ ǫ then with pr obabili ty at least 1 − δ the algorithm outputs “r eject. ” Theor em 4 ( L 1 estimatio n [VV11b]) . Let p be an unknown distrib ution over [ ℓ ] that is access ible via samples, and let q be a distrib ution over [ ℓ ] that is either explic itly given, or accessible via samples. Ther e is an estimator L 1 - E S T I M A T E ( p, q , ǫ, δ ) that, with pr obability a t least 1 − δ , o utputs a value in the interv al ( d T V ( p, q ) − ǫ, d T V ( p, q ) + ǫ ) . The algorith m uses s E ( ℓ, ǫ, δ ) := O  ℓ log ℓ · ǫ − 2 log(1 /δ )  samples. 3 T es ting and Estimating Monotone Distrib utions 3.1 Oblivious decomposition of m onotone d istrib u tions Our main tool for testing monotone distrib utions is an oblivious decomposition of monotone distri b utions that is a varian t of a const ructio n of B ir g ´ e [Bir87b]. A s w e will see it enable s us to r educe t he pr oblem of te sting a monoto ne distr ib ution to the problem of testin g an arbitrar y distrib ution ove r a much smaller domain. Before sta ting the d ecompo sition, some notati on will be helpful. Fix a di strib ution p ov er [ n ] and a partiti on of [ n ] into disjo int i nterv als I := { I i } ℓ i =1 . The ﬂattened d istrib ution ( p f ) I corres pondin g to p and I is the di strib ution ov er [ n ] deﬁned as follo ws: for j ∈ [ ℓ ] and i ∈ I j , ( p f ) I ( i ) = P t ∈ I j p ( t ) / | I j | . T hat is, ( p f ) I is obtai ned from p by av eraging the w eight that p assigns to each interv al over the entire interv al. The r educed distrib ution ( p r ) I corres pondin g to p and I is the distrib ution ov er [ ℓ ] that assigns the i th point the weight p assigns to the interv al I i ; i.e., for i ∈ [ ℓ ] , we hav e ( p r ) I ( i ) = p ( I i ) . Note that if p is non-in creasing then so is ( p f ) I , b ut this is not necess arily the case for ( p r ) I . The follo wing simple lemma, prov ed in Section A, sho ws why reduced distrib utions are use ful for us: Deﬁnition 1. L et p be a distrib ution over [ n ] and let I = { I i } ℓ i =1 be a partition of [ n ] into disjo int intervals. W e say that I is a ( p, ǫ, ℓ ) -ﬂat decompos ition of [ n ] if d T V ( p, ( p f ) I ) ≤ ǫ. 4 Lemma 2. Let I = { I i } ℓ i =1 be a part ition of [ n ] into di sjoint inte rvals. Suppose that p and q ar e distrib utions over [ n ] suc h that I is both a ( p, ǫ, ℓ ) -ﬂat decompositio n of [ n ] and is also a ( q , ǫ, ℓ ) -ﬂat decompositio n of [ n ] . Then   d T V ( p, q ) − d T V (( p r ) I , ( q r ) I )   ≤ 2 ǫ. More ove r , if p = q then ( p r ) I = ( q r ) I . W e no w state our obli vious decompositio n result for monotone distri b ution s: Theor em 5 ([Bir87b]) . (oblivio us decompos ition) F ix any n ∈ Z + and ǫ > 0 . The partition I := { I i } ℓ i =1 of [ n ] , in which the j th interval has size ⌊ (1 + ǫ ) j ⌋ has the following pr operties: ℓ = O ((1 /ǫ ) · log( ǫ · n + 1)) , and I is a ( p, O ( ǫ ) , ℓ ) -ﬂat decompositio n of [ n ] for any non-inc r easing distri b ution p over [ n ] . There is an analogou s ver sion of Theorem 5, asserti ng the ex istence of an “obli vious” partitio n for non- decrea sing distrib utions (w hich is of course dif ferent from the “obli vious” partition I for non-increas ing distri- b utions of Theorem 5); this will be useful later . While our construct ion is essential ly that of Birg ´ e, we note that the version gi ven in [Bir87b] is for non- increa sing distri b utions ov er the continuou s domain [0 , n ] , and it is phrased rathe r dif ferently . Adapting the ar gu- ments of [Bir87b] to our discrete setting of distrib ution s over [ n ] is not conceptually dif ﬁcult but requires some care. For the sak e of being self-co ntaine d we provide a self-conta ined proof of the discrete version , stated above, that we require in Append ix E. 3.2 Efﬁciently testing monotone distributions Now w e are ready to estab lish our upper bounds on testing monoton e distrib utions (giv en in the ﬁrst four ro ws of T able 1). All of the algo rithms are essenti ally the same: each works by reducing the giv en monotone distrib ution testing problem to the same testing problem for arbitrary distrib utions ov er su pport of siz e ℓ = O (log n/ǫ ) using the o bli vious decompositio n from the pre vious subse ction . For concreten ess we ex plicitl y des cribe the tester for the “testing identity , q is kno wn” cas e belo w , and then indicate the small changes that are necess ary to get the testers for the other three cases. T E S T - I D E N T I T Y - K N O W N - M O N O T O N E Inputs: ǫ, δ > 0 ; sample access to non-in creasing distrib ution p ov er [ n ] ; ex plicit d escrip tion of non-inc reasin g distrib ution q ov er [ n ] 1. Let I := { I i } ℓ i =1 , with ℓ = Θ(log ( ǫn + 1) /ǫ ) , be the partition of [ n ] gi ven by Theorem 5, which is a ( p ′ , ǫ/ 8 , ℓ ) -ﬂat decomposi tion of [ n ] for any non-in creasing distrib ution p ′ . 2. Let ( q r ) I denote the reduce d distrib ution ov er [ ℓ ] obtained from q using I as deﬁned in Section A. 3. Draw m = s I K ( ℓ, ǫ/ 2 , δ ) samples from ( p r ) I , where ( p r ) I is the reduc ed distrib ution ov er [ ℓ ] obtai ned from p using I as deﬁned in Section A. 4. Output the result of T E S T - I D E N T I T Y - K N O W N (( p r ) I , ( q r ) I , ǫ 2 , δ ) on the samples from Step 3. W e no w estab lish our claimed upper bound for the “testing identity , q is kno wn” case. W e ﬁrst observe that in Step 3, the desired m = s I K ( ℓ, ǫ/ 2 , δ ) samples from ( p r ) I can easily be obtained by dra w ing m samples from p and con vertin g each one to the corre spond ing draw from ( p r ) I in the obvious way . If p = q then ( p r ) I = ( q r ) I , and T E S T - I D E N T I T Y - K N O W N - M O N O T O N E outputs “accep t” with probabi lity at least 1 − δ by Theorem 2. If d T V ( p, q ) ≥ ǫ , then by Lemma 2, Theorem 5 and the triangle inequality , we ha ve that d T V (( p r ) I , ( q r ) I ) ≥ 3 ǫ/ 4 , so T E S T - I D E N T I T Y - K N O W N - M O N O T O N E output s “reject” with proba bility at least 1 − δ by Theorem 2. For the “testing identity , q is unkno w n” case, the the algorith m T E S T - I D E N T I T Y - U N K N O W N - M O N O T O N E is ver y similar to T E S T - I D E N T I T Y - K N O W N - M O N OT O N E . The dif ferences are as follo ws: inst ead of Step 2, in S tep 3 we dra w m = s I U ( ℓ, ǫ/ 2 , δ ) samples from ( p r ) I and the same number of samples from ( q r ) I ; and in S tep 4, we run T E S T - I D E N T I T Y - U N K N O W N (( p r ) I , ( q r ) I , ǫ 2 , δ ) using the samples from Step 3. T he analysis is exactl y the same as abo ve (using Theorem 3 in place of Theorem 2). W e no w describe the algor ithm L 1 - E S T I M ATE - K N O W N - M O N O T O N E for the “tolerant testing , q is known” case. This algor ithm takes values ǫ and δ as input , so the partition I deﬁned in Step 1 is a ( p ′ , ǫ/ 4 , ℓ ) -ﬂat de- composi tion of [ n ] for any non-increa sing p ′ . In Step 3 the algorith m draws m = s E ( ℓ, ǫ/ 2 , δ ) samples and runs L 1 - E S T I M ATE (( p r ) I , ( q r ) I , ǫ/ 2 , δ ) in S tep 4. If d T V ( p, q ) = c then by the triangle inequality we ha ve that 5 d T V (( p r ) I , ( q r ) I ) ∈ [ c − ǫ/ 2 , c + ǫ/ 2] and L 1 - E S T I M ATE - K N O W N - M O N O T O N E outputs a value within the pre- scribe d range with probability at least 1 − δ, by Theorem 4. T he algorith m L 1 - E S T I M ATE - U N K N O W N - M O N OT O N E and its analysi s are entirely similar . 4 Fr om Monotone to k -modal In this section we establ ish our main posi ti ve testing results for k -modal distr ib utions, the uppe r bounds stated in the ﬁnal four ro ws of T able 1. In the pre vious section , w e were able to use the obli vious decomposi tion to yield a parti tion of [ n ] into relati vely fe w inter v als, with the guaran tee that the corres pond ing ﬂattened distrib ution is close to the true distrib ution. The main challenge in exten ding these results to unimodal or k -modal distrib u- tions, is that in order to m ake the analogo us decomposit ion, one must ﬁrst determine–by taking samples from the distrib ution –which regi ons are monotonical ly increasin g vs decreas ing. Our algorithm C O N S T R U C T - F L A T - D E C O M P O S I T I O N ( p, ǫ, δ ) performs this task with the follo wing guarantee : Lemma 3. Let p be a k -modal distr ib ution over [ n ] . Algorith m C O N S T R U C T - F L A T -D E C O M P O S I T I O N ( p, ǫ, δ ) dra ws O ( k 2 ǫ − 4 log(1 /δ )) samples fr om p and outputs a ( p, ǫ, ℓ ) -ﬂat decomposi tion of [ n ] with pr obabil ity at least 1 − δ , wher e ℓ = O ( k log( n ) /ǫ 2 ) . The bu lk of our work in Section C is to describe C O N S T R U C T - F L A T - D E C O M P O S I T I O N ( p, ǫ, δ ) and prove Lemma 3, b ut ﬁrst w e show how L emma 3 yields our claimed testing results for k -modal distrib utions. As in the monoton e case all four algorith ms are es sentia lly the same: each work s by reducin g the gi ven k -modal distrib ution testing prob lem to the same testing problem for arbitrary distrib utions ove r [ ℓ ] . One slig ht complicatio n is that the partition obtai ned for distrib ution p will generally dif fer from that for q . In the monotone distri b ution setting, the partiti on was obli vious to the distrib ution s, and thus this concern did not arise. Naiv ely , one might hope that the ﬂ attened distrib ution correspond ing to any reﬁnement of a partition will be at least as good as the ﬂattened distrib ution correspond ing to the actu al partition. This hop e is easily seen to b e str ictly false , bu t we sho w that it is true up to a fac tor of 2, which suf ﬁces for our purposes . The follo wing terminology will be useful: L et I = { I i } r i =1 and I ′ = { I ′ i } s i =1 be two partition s of [ n ] into r and s interv als respecti vely . The common r eﬁnemen t of I and I ′ is the partit ion J of [ n ] into interv als obtained from I and I ′ in th e obv ious way , by ta king all p ossib le nonempty inte rv als of the for m I i ∩ I ′ j . It is clear that J is both a reﬁnement of I and of I ′ and tha t the n umber of interv als |J | in J is a t most r + s . W e pro ve the follo wing lemma in Section A: Lemma 4. L et p be any distrib ution over [ n ] , let I = { I i } a i =1 be a ( p, ǫ, a ) -ﬂat decomposit ion of [ n ] , and let J = { J i } b i =1 be a r eﬁnement of I . Then J is a ( p, 2 ǫ, b ) -ﬂat decompositio n of [ n ] . W e describe the T E S T - I D E N T I T Y - K N O W N - K M O D A L algorithm below . T E S T - I D E N T I T Y - K N O W N - K M O D A L Inputs: ǫ, δ > 0 ; sample access to k -modal distrib utions p , q ov er [ n ] 1. Run C O N S T RU C T - F L AT - D E C O M P O S I T I O N ( p, ǫ / 2 , δ / 4) and let I = { I i } ℓ i =1 , ℓ = O ( k log ( n ) /ǫ 2 ) , be the partit ion that it outp uts. Run C O N S T R U C T -F L A T - D E C O M P O S I T I O N ( p, ǫ/ 2 , δ / 4) and let I ′ = { I ′ i } ℓ ′ i =1 , ℓ ′ = O ( k log( n ) /ǫ 2 ) , be the partition that it out puts. Let J be the common reﬁne ment of I and I and let ℓ J = O ( k log( n ) /ǫ 2 ) be the number of interv als in J . 2. Let ( q r ) J denote the reduce d distrib ution ov er [ ℓ J ] obtained from q using J as deﬁned in Section A. 3. Draw m = s I K ( ℓ J , ǫ/ 2 , δ / 2) samples from ( p r ) J , where ( p r ) J is the reduced distrib ution ov er [ ℓ J ] obtain ed from p using J as deﬁned in Section A. 4. Run T E S T - I D E N T I T Y - K N O W N (( p r ) J , ( q r ) J , ǫ 2 , δ 2 ) using the samples from Step 3 and output what it outpu ts. 6 W e note that Steps 2, 3 and 4 of T E S T - I D E N T I T Y - K N O W N - K M O DA L are the same as the correspond ing steps of T E S T - I D E N T I T Y - K N O W N - M O N OT O N E . For the analysis of T E S T - I D E N T I T Y - K N O W N - K M O DA L , Lemmas 3 and 4 giv e us that with p robabi lity 1 − δ / 2 , the partitio n J obtaine d in Step 1 is both a ( p, ǫ, ℓ J ) -ﬂat and ( q , ǫ, ℓ J ) - ﬂat decompositio n of [ n ] ; we condition on this going forward. From this point on the analysis is essentiall y identi cal to the analysis for T E S T - I D E N T I T Y - K N O W N - M O N O T O N E and is omitted. The modiﬁcatio ns required to obtain algorithms T E S T - I D E N T I T Y - U N K N O W N - K M O DA L , L 1 - E S T I M ATE - K N O W N - K M O DA L and L 1 - E S T I M ATE - U N K N O W N - K M O D A L , and the analysis of these algorithms, are completely analog ous to the modiﬁcations and analyse s of S ection 3.2 an d are omitted. 4.1 The C O N S T R U C T - F L AT - D E C O M P O S I T I O N algorit hm. W e present C O N S T RU C T -F L A T - D E C O M P O S I T I O N ( p , ǫ, δ ) follo wed by an intuiti ve explan ation. Note that it employs a procedure O R I E N T A T I O N ( b p , I ) , which uses no sam- ples and is present ed and analyzed in Section 4.2. C O N S T R U C T - F L A T - D E C O M P O S I T I O N I N P U T S : ǫ, δ > 0 ; sample acces s to k -modal distrib ution p ov er [ n ] 1. Initial ize I := ∅ . 2. Fix τ := ǫ 2 / (200 00 k ) . Draw r = Θ(log(1 /δ ) /τ 2 ) samples from p and let b p denote the resulting empirica l distrib ution (which by Theorem 1 has d K ( b p, p ) ≤ τ with probabil ity at least 1 − δ ). 3. Greedily partition the domain [ n ] into α atomic intervals { I i } α i =1 as follo w s: I 1 := [1 , j 1 ] , where j 1 := min { j ∈ [ n ] | b p ([1 , j ]) ≥ ǫ/ (100 k ) } . For i ≥ 1 , if ∪ i j =1 I j = [1 , j i ] , then I i +1 := [ j i + 1 , j i +1 ] , w here j i +1 is deﬁned as follo ws: If b p ([ j i + 1 , n ]) ≥ ǫ/ (100 k ) , then j i +1 := m in { j ∈ [ n ] | b p ([ j i + 1 , j ]) ≥ ǫ/ (100 k ) } , other wise, j i +1 := n . 4. Construc t a set of n m moder ate interva ls , a set o f n h heavy poin ts , and a set of n n ne gligib le interval s as follo ws: For each atomic interv al I i = [ a, b ] , (a) if b p ([ a, b ]) ≤ 3 ǫ/ (10 0 k ) then I i is decla red to be a modera te interv al; (b) otherwise we ha ve b p ([ a, b ]) > 3 ǫ/ (100 k ) and we declare b to be a heavy point . If a < b then we declar e [ a, b − 1] to be a ne gligible interv al. For eac h interv al I which is a hea vy point, add I to I . Add each neg ligible interv al I to I . 5. For each mode rate interv al I , run procedure O R I E N T A T I O N ( b p , I ) ; let ◦ ∈ {↑ , ↓ , ⊥} be its output. If ◦ = ⊥ then add I to I . If ◦ = ↓ then let J I be the partiti on of I giv en by Theorem 5 w hich is a ( p ′ , ǫ/ 4 , O (log ( n ) /ǫ )) -ﬂat decompo sition of I for any non-inc reasing distrib ution p ′ ov er I . Add all the elements of J I to I . If ◦ = ↑ then let J I be the partit ion of I giv en by the dual vers ion of Theorem 5, which is a ( p ′ , ǫ/ 4 , O (log ( n ) /ǫ )) -ﬂat decompositio n of I for any non-de creasin g distrib ution p ′ ov er I . Add all the elements of J I to I . 6. Output the partiti on I of [ n ] . Roughly spe aking, when C O N S T R U C T - F L A T -D E C O M P O S I T I O N construct s a partition I , it in itially breaks [ n ] up into two types of interv als. The ﬁrst type are interv als that are “okay” to include in a ﬂat decompositio n, either becaus e the y ha ve very little mass, or because they consist of a single point, or because they are close to uniform. The second type are interv als that are “not okay” to include in a ﬂ at decompo sition – the y hav e signiﬁcan t mass and are far from uniform – but the algorithm is able to ensure that almost all of these are monotone distrib utions with a kn o w n orient ation. It then uses th e obli vious decomposition of Theorem 5 to construct a ﬂat dec ompositi on of each such interv al. (Note that it is crucial that the orientation is known in ord er to be able to use Theore m 5.) In more detail, C O N S T RU C T -F L A T - D E C O M P O S I T I O N ( p, ǫ, δ ) works as follo ws. The algorithm ﬁ rst dra w s a batch of samples from p and uses them to constru ct an estimate b p of the CDF of p (this is straightf orwar d usin g the DKW inequ ality). U sing b p the algorithm partitions [ n ] into a collec tion of O ( k /ǫ ) disjo int interv als in the 7 follo wing way: • A small collectio n of the interv als are “ne gligible”; they collecti vely ha ve total mass less than ǫ under p . Each negli gible interv al I will be an element of the partit ion I . • Some o f the interv als are “heavy points ”; these are inter v als consist ing of a single point that has mass Ω ( ǫ/k ) under p . E ach hea vy point I w ill also be an elemen t of the partition I . • The remaining interv als are “moderate” interva ls, each of w hich has mass Θ( ǫ/k ) under p . It remains to incorporate the moderate interv als into the partition I that is being construc ted. This is done as follo ws: using b p , the algorit hm comes up with a “g uess” of the correct orientati on (non -increasing, non-decrea sing, or close to unifo rm) f or each moderate interv al. Each moderate interv al where the “gue ssed” o rienta tion is “close to unifor m” is included in the p artition I . F inally , for each moderate in terv al I where the guessed o rientation is “n on- increa sing” or “non-dec reasin g”, the alg orithm in v okes Theore m 5 on I to perform the obli vious deco mpositio n for monoton e distrib utions, and the resulting sub-in terv als are included in I . The analysis will show that the guesses are almost alway s correct, and intuiti vely this should imply that the I that is constru cted is indeed a ( p, ǫ, ℓ ) -ﬂat decompo sition of [ n ] . 4.2 The O R I E N TA T I O N algorithm. The O R I E N T ATI O N algorithm takes as input an explic it distrib ution of a distrib ution b p ov er [ n ] and an interv al I ⊆ [ n ] . Intuit i vel y , it assumes that b p I is close (in Kol mogoro v distance) to a monoton e distrib ution p I , and its goal is to det ermine the orient ation of p I : it outputs either ↑ , ↓ or ⊥ (the last of which means “close to uniform”). T he algorithm is quite simple; it checks whether there exist s an initial interv al I ′ of I on w hich b p I ’ s weight is signi ﬁcantly differ ent from u I ( I ′ ) (the w eight that the unifo rm distrib ution ove r I assign s to I ′ ) and bases its output on this in the obvious way . A precise descri ption of the algorithm (which uses no samples) is gi ven belo w . O R I E N T AT I O N I N P U T S : explic it description of distrib ution b p ov er [ n ] ; interv al I = [ a, b ] ⊆ [ n ] 1. If | I | = 1 (i.e. I = { a } for some a ∈ [ n ] ) then return “ ⊥ ”, otherwise contin ue. 2. If there is an initial interv al I ′ = [ a, j ] of I that satisﬁes u I ( I ′ ) − ( b p ) I ( I ′ ) > ǫ 7 then halt and output “ ↑ ”. Otherwise, 3. If there is an initial interv al I ′ = [ a, j ] of I that satisﬁes u I ( I ′ ) − ( b p ) I ( I ′ ) < − ǫ 7 then halt and output “ ↓ ”. Otherwise, 4. Output “ ⊥ ”. W e proceed to analyze O R I E N TA T I O N . W e sho w that if p I is fa r from uniform then O R I E N T A T I O N output s the c orrect o rienta tion for it. W e al so sh ow that whene ver O R I E N TA T I O N does not output “ ⊥ ”, whate ver it outputs is the correc t orientation of p I . The proof is giv en in Section C.3. Lemma 5. Let p be a distrib ution over [ n ] and let interval I = [ a, b ] ⊆ [ n ] be such that p I is monotone . Suppose p ( I ) ≥ 99 ǫ/ (100 00 k ) , and suppose that for every interval I ′ ⊆ I w e hav e that | b p ( I ′ ) − p ( I ′ ) | ≤ ǫ 2 10000 k . Then 1. If p I is non-dec r easing and p I is ǫ/ 6 -fa r fr om the uniform distrib ution u I ove r I , then O R I E N T ATI O N ( b p , I ) outpu ts “ ↑ ”; 2. if O R I E N T A T I O N ( b p , I ) outputs “ ↑ ” then p I is non-dec r easing; 3. if p I is non-inc r easing and p I is ǫ/ 6 -far fr om the uniform distrib ution u I ove r I , then O R I E N T A T I O N ( b p , I ) outpu ts “ ↓ ”; 4. if O R I E N T A T I O N ( b p , I ) outputs “ ↓ ” then p I is non-inc r easing . 8 5 Lower Bounds Our algorithmic resu lts follo w fro m a redu ction which sho w s ho w one can reduce the proble m of testing p roperties of monotone or k -modal distrib utions to the task of testing propertie s of gene ral distrib utions over a much smaller suppo rt. Our approa ch to provi ng lo w er boun ds is complemen tary; we giv e a canon ical scheme for trans forming “lo wer bound instanc es” of general distrib utions to relate d lo wer bound instances of monotone distrib utions w ith much lar ger supports. A generic lower bound instance for distance estimation has the follo wing form: there is a distrib ution D over pair s of distrib utions , ( p, p ′ ) , with the information theoret ic guarantee that, gi ven s indep enden t samples from distrib utions p and p ′ , w ith ( p, p ′ ) ← D , it is impossible to disting uish the case that d T V ( p, p ′ ) ≤ ǫ 1 ver sus d T V ( p, p ′ ) > ǫ 2 with any probabi lity greater than 1 − δ , where the proba bility is taken ove r both the selection of ( p, p ′ ) ← D and the choice of samples. In gener al, such information theoretic lower bound s are difﬁcul t to prov e. Fortun ately , as m ention ed abo ve, we will be able to pro ve lo w er boun ds for monotone and k -modal distrib utions by le ver aging the kno wn lower boun d constructio ns in a black-box fashion. Deﬁnitions 2 and 3 , gi ven belo w , deﬁne a two-stage transfo rmation of a generic distrib ution into a related k -modal distrib ution over a much lar ger supp ort. This transformati on prese rves total var iation distance : for any pair of distri b utions, the var iation distance between their transformati ons is identical to the variat ion distance between the original distrib utions. Additiona lly , we ensu re that giv en access to s inde pende nt samples from an origin al input distrib ution, one can simulate drawing s samples from the related k -modal distrib ution yield ed by the transfo rmation. Giv en an y lo wer–bo und const ructio n D for ge neral di strib utions, the a bov e tran sformati on will yield a lo wer –boun d instance D k for ( k − 1) -modal dis trib utions (so monotone distrib utions correspon d to k = 1 ) deﬁned by sele cting a pair of distrib utions ( p, p ′ ) ← D , then outpu tting the pair of transformed distrib utions. T his transfo rmed ensemble of distrib utions is a lower –bound instance, for if some algorith m could successf ully test pairs of ( k − 1) -mod al distrib utions from D k , then tha t algo rithm could be used to te st pa irs fr om D , by simulati ng samples d rawn from th e transformed v ersions of th e distrib utions. T he follo wing propositi on, prove d in Section D, summarizes the abo ve discu ssion : Pro position 6. Let D be a d istrib ution over pai rs of dist rib utions support ed on [ n ] such t hat giv en s samples fr om distrib utions p, p ′ with ( p, p ′ ) ← D , no algori thm can dis tingu ish whether d T V ( p, p ′ ) ≤ ǫ 1 ver sus d T V ( p, p ′ ) > ǫ 2 with pr obabili ty gr eater than 1 − δ (over both the draw of ( p, p ′ ) fr om D and the draw of samples fr om p , p ′ ). L et p max , p min be the re spectiv e maximum and minimum pr obabilitie s with w hich any element arises in distrib utions that ar e supporte d in D . Then ther e ex ists a distrib ution D k ove r pair s of ( k − 1) -modal distrib utions supported on [ N ] = [4 k e 8 n k (1+log( p max /p min )) ] such that no algorit hm, when given s samples fr om distrib utions p k , p ′ k , with ( p k , p ′ k ) ← D k , can distingu ish whether d T V ( p k , p ′ k ) ≤ ǫ 1 ver sus d T V ( p k , p ′ k ) > ǫ 2 with success pr obabil ity gr eater than 1 − δ . Before prov ing this propositi on, we state va rious coroll aries which result from apply ing the Propositi on to kno wn lower -bound const ructio ns for general dist rib utions. T he ﬁrst is for the “testing ident ity , q is unkn o w n” proble m: Cor ollary 7. Ther e exist s a constan t c such that for suf ﬁciently lar ge N and 1 ≤ k = O (log N ) , ther e is a distrib ution D k ove r pairs of 2( k − 1) -modal distrib utions ( p, p ′ ) over [ N ] , such that no algorithm, when given c  k l og N log log N  2 / 3 samples fr om a pair of distrib utions ( p, p ′ ) ← D , can distingui sh the case that d T V ( p, p ′ ) = 0 fr om the case d T V ( p, p ′ ) > . 5 with pr obabilit y at least . 6 . This Corollary gi ves the lo w er bounds stated in lines 2 and 6 of T able 1. It fol lo ws from applyin g Proposit ion 6 to a (trivia lly modiﬁed) versi on of the lo wer bound const ructio n giv en in [BFR + 00, V al08b], summarize d by the follo wing theorem: Theor em 6 ([BFR + 00, V al08b]) . Ther e ex ists a con stant c such that f or suf ﬁciently lar ge n , ther e is a di strib ution D ove r pairs of distrib utions ( p, p ′ ) over [ n ] , such that for any ( p, p ′ ) ← D , the m aximum pr obability with which 9 any element occurs in p or p ′ is 1 n 2 / 3 , and the minimum pr obability is 1 2 n . Additionally , no algorithm, when given cn 2 / 3 samples fr om ( p, p ′ ) ← D , can distinguish whether d T V ( p, p ′ ) = 0 , fr om d T V ( p, p ′ ) > . 5 w ith pr obabilit y at least . 6 . Our second coroll ary is for L 1 estimatio n, in the case that one of the distrib utions is explicitl y gi ven. This tri vially also yields an equi valen t lower bo und for the setting in which both distrib utions are giv en via samples. Cor ollary 8. F or any a, b with 0 < a < b < 1 / 2 , ther e exis ts a cons tant c > 0 , such that for any suf ﬁciently lar ge N and 1 ≤ k = O (log N ) , ther e exis ts a 2( k − 1) -modal distrib ution q of support [ N ] , and a distrib ution D k ove r 2( k − 1) -moda l distrib utions over [ N ] , such tha t no algorith m, w hen give n c k l og N log log N · l og log l og N samples fr om a distrib ution p ← D , can distinguish the case that d T V ( p, q ) < a versus d T V ( p, p ′ ) > b with pr obability at least . 6 . This Corollary gi ves the lo wer bound s claimed in lines 3, 4, 7 and 8 of T able 1. It follo ws from applyi ng Proposit ion 6 to the lo w er bound construc tion giv en in [VV11a], summarized by the follo wing theorem: Theor em 7 ([VV11a]) . F or any a, b with 0 < a < b < 1 / 2 , ther e exi sts a constant c > 0 , such that for any sufﬁci ently lar ge n , ther e is a distrib ution D ove r distrib utions with suppor t [ n ] , suc h that for any p ← D , the maximum pr obability with which any element occu rs in p is O  log n n  , and the minimum pr obabili ty is 1 2 n . Addition ally , no algorith m, when given c n log n samples fr om p ← D can distingu ish whether d T V ( p, u n ) < a ver sus d T V ( p, u n ) > b with pr obab ility at least . 6 , wher e u n denote s the uniform distrib ution over [ n ] . Note that the abov e theorem can be express ed in the language of Propositio n 6 by deﬁning the distrib ution D ′ ov er pairs of distrib utions which chooses a distrib ution according to D for the ﬁrst distrib ution of each pair , and alw ays selects u n for the second distrib ution of each pair . Our third co rollary , which gi ves the lower bounds claimed in lines 1 and 5 of T able 1, is for the “testing identity , q is kno wn” problem: Cor ollary 9. F or any ǫ ∈ (0 , 1 / 2] , ther e is a constant c such that for sufﬁci ently lar ge N and 1 ≤ k = O (log m ) , ther e is a k -modal distrib ution p with support [ N ] , and a distrib ution D over 2( k − 1) -moda l distrib utions of suppo rt [ N ] suc h tha t no al gorith m , when given c ( k log m ) 1 / 2 samples fr om a distri b ution p ′ ← D , can distingui sh the case that d T V ( p, p ′ ) = 0 fr om the case d T V ( p, p ′ ) > ǫ with pr obab ility at least . 6 . The abo ve corolla ry follo ws from applying Proposition 6 to the follo wing triv ially veri ﬁed lo wer bound con- structi on: Fac t 10. Let D be the ensemble of dist rib utions of support n deﬁned as follows: with pr obabil ity 1 / 2 , p ← D is the uniform distrib ution on supp ort n , and with pr obability 1 / 2 , p ← D assigns pr obability 1 / 2 n to a rand om half of the domain elements, and pr obability 3 / 2 n to the other half of the domain elements . No algorithm, when given fewer than n 1 / 2 / 100 samples fr om a distrib ution p ← D can distinguis h between d T V ( p, u n ) = 0 ver sus d T V ( p, u n ) ≥ . 5 with pr obabil ity gr eater than . 6 . As noted pre viously (after Theorem 7), this fact can also be expre ssed in the language of Proposition 6. 6 Conclusions W e hav e introduc ed a simple ne w approach for tackling distrib ution testin g problems for restric ted classes of distrib utions , by reducing them to general-d istrib ution testing problems ov er a smaller domain. W e applied this approa ch to get new testin g results for a range of distrib ution testing problems in volv ing monoto ne and k -modal distrib utions , and establis hed lower bo unds showin g that all our new alg orithms are essentia lly optimal. A general directio n for future work is to app ly our reduction method to obtain near-o ptimal testin g algorithms for other interestin g classes of dist rib utions. This will in volv e const ructin g ﬂat decomposit ions of vario us types of distrib utions using few samples, which seems to be a natura l and interes ting algorithmic proble m. A speciﬁc goal is to de vel op a more efﬁcien t version of our C O N S T RU C T - F L AT - D E C O M P O S I T I O N algorithm for k -modal distrib utions ; is it possib le to obtain an improv ed version of this algorit hm that uses o ( k ) samples? 10 Refer ences [BBBB72] R.E. Barlo w , D .J. Bartholome w , J.M. Bremner , and H.D. Brunk. Statistical Infer ence under Or der Restricti ons . W iley , New Y ork, 1972. [BFF + 01] T . Batu, E. Fischer , L. Fortno w , R . Kumar , R. Rubinfeld, and P . White. T esting random varia bles for independ ence and identity . In P r oc. 42nd IEE E C onfer ence on F oundations of Computer Science , pages 442 –451, 2001. [BFR + 00] T ugkan B atu, L ance Fortn ow , Ronitt Rubinf eld, W arren D. S mith, and Patrick White. T esting that distrib utions are close. In IEEE Symposiu m on F oundations of Computer Science , pages 259–26 9, 2000. [BFR + 10] T . Batu, L . Fortno w , R. Rubinfeld , W . D. Smith, and P . White. T esting closeness of discrete distrib u- tions, 2010 . [Bir87a] L. Bir g ´ e. Estimating a density under order restric tions: Nonas ymptotic minimax risk. Annals of Statis tics , 15(3):99 5–101 2, 1987. [Bir87b] L. Bir g ´ e. On the risk of histograms for estimat ing decreasin g densit ies. Anna ls of Statist ics , 15(3): 1013– 1022, 1987. [BKR04] T ugk an B atu, Rav i Kumar , and Ronitt R ubinfe ld. Sublinea r algorithms for testing monotone and unimoda l distrib ution s. In ACM Symposium on Theory of Computing , pages 381–39 0, 2004. [CKC83] L. Cobb, P . K oppstein, and N.H. C hen. Estimatio n and m oment recursion relations for m ultimoda l distrib utions of the expone ntial family . J. America n Statistic al Association , 78(381):12 4–130 , 1983. [CT04] K.S. Chan and H . T ong. T esting for multimodality with dependent data. Biometrik a , 91(1):11 3–123 , 2004. [DDS11] C. Daskalakis, I. Diakoni kol as, and R. A. Servedio . Learning k -modal distrib utions via testing. A v ail- able at http: //arxi v .org/a bs/1107.2700, 2011. [DKW56] A. Dvo retzk y , J. Kiefer , and J. W olfo witz. Asymptotic m inimax character of the sample distrib ution functi on and of the classical multinomial estimato r . Ann. Mathematical Statistics , 27(3):642 –669, 1956. [Fou9 7] A.-L. Foug ` eres. Estimation de densit ´ es unimod ales. Canadia n J ournal of Statisti cs , 25:37 5–387 , 1997. [Gre56] U. Grenand er . On the theory of mortality measuremen t. Skand. Aktuarieti dskr . , 39:125–1 53, 1956. [Gro85] P . Groeneboom. Estimating a mono tone de nsity . In Pr oc. of the Berk ele y Confer ence in Ho nor o f J erzy Ne yman and Jac k Kiefer , pages 539–55 5, 1985. [JW09] H. K . Janko wski and J. A . W ellner . Estimation of a discrete monoton e density . Electr onic Journ al of Statis tics , 3:1567– 1605, 2009. [Ke m91] J.H.B. Kemperman. Mixtures with a limited number of modal interv als. Annals of Stati stics , 19(4): 2120– 2144, 1991. [Mas90] P . Massart. The tight constant in the D v oretzk y-Kiefer -W olfo witz inequality. Annals of Pr obabilit y , 18(3): 1269– 1283, 1990. [Rao69] B.L.S. Prakasa Rao. Estimation of a unimodal density . Sank hya Ser . A , 31:23 –36, 1969. [V al08a ] P . V aliant. T esting Symmetric Pr operties of Distrib utions . PhD thes is, M. I.T ., 2008 . 11 [V al08b ] Paul V aliant. T esting symmetric prop erties of distrib utions. In STOC , pages 383–39 2, 2008. [VV11a] Gregory V aliant and Paul V aliant. Estimating the unseen: an n/ log( n ) -sample estimator for entropy and suppo rt size, shown optimal via ne w C L T s. In ST OC , pages 685–69 4, 2011. [VV11b] Gregor y V aliant and Paul V aliant. T he po wer of linear estimators. In F OCS , 2011. For simplicity , the appendix consists of a slightly expanded and self-contained version of th e exposition in the body of the paper , follo wing the “Notation and Preliminari es” section. A Shrinkin g the domain size: Reductions for distrib ution-test ing prob lems In this section we present the general frame work of our reduction- based approach and sketc h how w e instantiate this appro ach for monotone and k -modal distrib utions. W e denote by | I | the card inality of an interv al I ⊆ [ n ] , i.e. for I = [ a, b ] we ha ve | I | = b − a + 1 . Fix a distrib ution p ov er [ n ] and a parti tion of [ n ] into disjoin t interv als I := { I i } ℓ i =1 . The ﬂattened distrib ution ( p f ) I corres pondin g to p and I is the distrib ution ove r [ n ] deﬁned as follo w s: for j ∈ [ ℓ ] and i ∈ I j , ( p f ) I ( i ) = P t ∈ I j p ( t ) / | I j | . T hat is, ( p f ) I is obtained from p by av eraging the w eight that p assigns to each interv al ove r the entire interv al. The r educed distrib ution ( p r ) I corres pondin g to p and I is the distrib ution over [ ℓ ] that assigns the i th point the weight p assigns to the interv al I i ; i.e., for i ∈ [ ℓ ] , w e ha ve ( p r ) I ( i ) = p ( I i ) . Note that if p is non-in creasi ng then so is ( p f ) I , b ut this is not necessa rily the case for ( p r ) I . Deﬁnition 1. Let p be a distrib ution ove r [ n ] and let I = { I i } ℓ i =1 be a parti tion of [ n ] into disjoint intervals . W e say that I is a ( p, ǫ, ℓ ) -ﬂat decompos ition of [ n ] if d T V ( p, ( p f ) I ) ≤ ǫ. The follo wing useful lemma relates closeness of p and q to closeness of the reduced distrib utions: Lemma 2 Let I = { I i } ℓ i =1 be a p artitio n of [ n ] into dis joint inter vals. Suppos e that p and q ar e distrib utions over [ n ] suc h that I is both a ( p, ǫ, ℓ ) -ﬂat decompositio n of [ n ] and is also a ( q , ǫ, ℓ ) -ﬂat decompositio n of [ n ] . Then   d T V ( p, q ) − d T V (( p r ) I , ( q r ) I )   ≤ 2 ǫ. More ove r , if p = q then ( p r ) I = ( q r ) I . Pr oof. The second statement is clear by the deﬁnition of a reduc ed distrib ution. T o prov e the ﬁrst statement , we ﬁrst observe that for any pair of distrib utions p, q and any partition I of [ n ] into disjo int interv als, we ha ve that d T V (( p r ) I , ( q r ) I ) = d T V (( p f ) I , ( q f ) I ) . W e thus hav e that   d T V ( p, q ) − d T V (( p r ) I , ( q r ) I )   is equal to   d T V ( p, q ) − d T V (( p f ) I , ( q f ) I )   = d T V ( p, q ) − d T V (( p f ) I , ( q f ) I ) ≤ d T V ( p, ( p f ) I ) + d T V ( q , ( q f ) I ) , where the equality abov e is equi v alent to d T V ( p, q ) ≥ d T V (( p f ) I , ( q f ) I ) (which is easily veriﬁed by conside ring each int erv al I i ∈ I separate ly and applyi ng triangle inequali ty) and the ine qualit y is the triangle inequa lity . Since I is b oth a ( p, ǫ, ℓ ) -ﬂat decompos ition of [ n ] and a ( q , ǫ, ℓ ) -ﬂat deco m positio n of [ n ] , we ha ve th at d T V ( p, ( p f ) I ) ≤ ǫ and d T V ( q , ( q f ) I ) ≤ ǫ. The RHS abov e is thus bounde d by 2 ǫ and the lemma follo ws. Lemma 2, w hile simple, is at the heart of our redu ction- based approach; it lets us transform a distrib ution- testing problem over the lar ge domain [ n ] to a distri b ution-testing probl em over the much smaller “redu ced” domain [ ℓ ] . At a high le vel, all our testin g algor ithms will follow the same basic approach: ﬁrst they run a procedure whic h, with high probabilit y , constructs a partition I of [ n ] that is both a ( p, ǫ, ℓ ) -ﬂat decompositi on of [ n ] and a ( q , ǫ, ℓ ) - ﬂat dec ompositi on of [ n ] . Next t hey run t he approp riate general-dis trib ution tester over th e ℓ -element dist rib utions ( p r ) I , ( q r ) I and output what it outputs; Lemma 2 guarantees that the distance between ( p r ) I and ( q r ) I fait hfully reﬂects the distan ce between p and q , so this output is correc t. W e no w provid e a few mor e details that are speciﬁc to the vari ous diff erent testing problems that we consider . For the monotone distri b ution testing proble ms the constructio n of I is done obliv iously (without drawin g any samples or any reference to p or q of an y sort) and there is no possib ility of failure – the assump tion that p and q 12 are both (say) non-decreas ing guarant ees that the I that is construc ted is both a ( p, ǫ, ℓ ) -ﬂat decompo sition of [ n ] and a ( q , ǫ, ℓ ) -ﬂat decomposit ion of [ n ] . W e describe this decomposition proce dure in S ection 3.1 and present our monoton e distrib ution testing algorithms that are based on it in Section 3.2. For the k -modal testing problems it is no t so strai ghtfor ward to constru ct the desired decompo sition I . This is done vi a a careful pro cedur e w hich uses k 2 · p oly(1 /ǫ ) samples from p and q . This pr ocedu re has the property that with probabi lity 1 − δ / 2 , the I it outp uts is both a ( p, ǫ, ℓ ) -ﬂat decompositio n of [ n ] and a ( q , ǫ, ℓ ) -ﬂat decompos ition of [ n ] , where ℓ = O ( k log ( n ) /ǫ 2 ) . Giv en this, by running a testing algorithm (which has success probabilit y 1 − δ / 2 ) on the pair ( p r ) I , ( q r ) I of distrib utions ov er [ ℓ ] , we will get an answer w hich is with probability 1 − δ a leg itimate answer for the original testing problem. T he detai ls are gi ven in Section C. W e close this section with a result about partitio ns and ﬂat decompo sitions which will be usefu l later . Let I = { I i } a i =1 , I ′ = { I ′ j } b j =1 be two parti tions of [ n ] . W e say that I ′ is a r eﬁnement of I if for ev ery i ∈ [ a ] there is a subset S i of [ b ] such that ∪ j ∈ S i I ′ j = I i (note that for this to hold we m ust hav e a ≤ b ). Note that { S i } a i =1 forms a partiti on of [ b ] . W e pro ve the follo wing useful lemma: Lemma 4 . Let p be any distrib ution over [ n ] , let I = { I i } a i =1 be a ( p, ǫ , a ) -ﬂat decomposi tion of [ n ] , and let J = { J i } b i =1 be a r eﬁnement of I . Then J is a ( p, 2 ǫ, b ) -ﬂat decompositio n of [ n ] . Pr oof. Fix any i ∈ [ ℓ ] and let S i ⊆ [ b ] be such that I i = ∪ j ∈ S i J j . T o prov e the lemma it suf ﬁ ces to sho w that 2 P t ∈ I i | p ( t ) − ( p f ) I ( t ) | ≥ P j ∈ S P t ∈ J j | p ( t ) − ( p f ) J ( t ) | , (1) since the sum on the LHS is the contrib ution that I i makes to d T V ( p, ( p f ) I ) and the sum on the RHS is the corres pondin g cont rib ution I i makes to d T V ( p, ( p f ) J ) . It m ay seem intuiti vel y obvi ous that the sum on the L HS (which corres pond s to approximating the sub-distrib ution p I i using a “globa l a verage”) must be smaller than the sum on the RHS (which correspo nds to using separate “local av erages”). Howe ver , this intuition is not quite correc t, and it is necess ary to ha ve the factor of two. T o see this, consider a distrib ution p ov er [ n ] such that p (1) = (1 / 2 ) · (1 /n ) ; p ( i ) = 1 /n for i ∈ [2 , n − 1] ; and p ( n ) = (3 / 2) · (1 /n ) . T aking I 1 = [1 , n / 2] and I 2 = [ n/ 2 + 1 , n ] , it is easy to check that inequality (1) is essential ly tight (up to a o (1) factor). W e now proceed to establish (1). Let T ⊆ [ n ] and consi der a partition of T into k nonempty sets T i , i ∈ [ k ] . Denote µ def = p ( T ) / | T | and µ i def = p ( T i ) / | T i | . Then, (1) can be re-expres sed as follo ws 2 P t ∈ T | p ( t ) − µ | ≥ k P i =1 P t ∈ T i | p ( t ) − µ i | . (2) W e shall prov e the abov e statement for all sequences of numbers p (1) , . . . , p ( n ) . S ince adding or subtracting the same quantity from each number p ( t ) does not change the validit y of (2), for the sake of con ven ience we may assume all the n umbers a verage to 0 , that is, µ = 0 . Conside r the i -th term on th e right hand side, P t ∈ T i | p ( t ) − µ i | . W e can bound this quantity from abov e as follo ws: P t ∈ T i | p ( t ) − µ i | ≤ P t ∈ T i | p ( t ) | + | T i | · | µ i | = P t ∈ T i | p ( t ) | + | p ( T i ) | = 2 P t ∈ T i | p ( t ) | = 2 P t ∈ T i | p ( t ) − µ | , where the inequality follows from the triangle inequality (applied term by term), the ﬁrst equality is by the deﬁni- tion of µ i , the second equality is trivia l, and the ﬁnal equality uses the assumption that µ = 0 . The lemma follows by summing ov er i ∈ [ k ] , using the fact that the T i ’ s form a partit ion of T . B Efﬁciently T esting Monotone Distrib utions B.1 Oblivio us decomposition of monotone distributions Our main tool for testing monoton e distr ib utions is an oblivious decomposition of monotone distri b utions that is a varian t of a const ructio n of B ir g ´ e [Bir87b]. A s w e 13 will see it enable s us to r educe t he pr oblem of te sting a monoto ne distr ib ution to the problem of testin g an arbitrar y distrib ution ove r a much smaller domain. The decompo sition result is giv en belo w: Theor em 5 ([Bir87b]) . (obliviou s decompos ition) F ix any n ∈ Z + and ǫ > 0 . The partitio n I := { I i } ℓ i =1 of [ n ] descri bed below has the following pr operties: ℓ = O ((1 /ǫ ) · log ( ǫ · n + 1)) , and for any non-incr easin g distrib ution p over [ n ] , I is a ( p, O ( ǫ ) , ℓ ) -ﬂat decomposi tion of [ n ] . There is a dual version of T heorem 5 , asserting the exist ence of an “obli vious” partit ion for non-decrea sing distrib utions (which is of cours e diffe rent from the “obli vious ” partitio n I for non-increas ing distrib utions of Theorem 5); this will be usefu l later . While our construct ion is essential ly that of Birg ´ e, we note that the version gi ven in [Bir87b] is for non- increa sing distri b utions ov er the continuou s domain [0 , n ] , and it is phrased rathe r dif ferently . Adapting the ar gu- ments of [Bir87b] to our discrete setting of distrib ution s over [ n ] is not conceptually dif ﬁcult but requires some care. For the sak e of being self-co ntaine d we provide a self-conta ined proof of the discrete version , stated above, that we require in Append ix E. B.2 Efﬁciently testing monotone d istrib utions No w we are ready to establish our upper bound s on testing monoton e distrib utions (giv en in the ﬁrst four ro ws of T able 1). All of the algo rithms are essenti ally the same: each works by reducing the giv en monotone distrib ution testing problem to the same testing problem for arbitrary distrib utions ov er su pport of siz e ℓ = O (log n/ǫ ) using the o bli vious decompositio n from the pre vious subse ction . For concreten ess we ex plicitl y des cribe the tester for the “testing identity , q is kno wn” cas e belo w , and then indicate the small changes that are necess ary to get the testers for the other three cases. T E S T - I D E N T I T Y - K N O W N - M O N O T O N E Inputs: ǫ, δ > 0 ; sample access to non-in creasing distrib ution p ov er [ n ] ; ex plicit d escrip tion of non-inc reasin g distrib ution q ov er [ n ] 1. Let I := { I i } ℓ i =1 , with ℓ = Θ(log ( ǫn + 1) /ǫ ) , be the partition of [ n ] gi ven by Theorem 5, which is a ( p ′ , ǫ/ 8 , ℓ ) -ﬂat decomposi tion of [ n ] for any non-in creasing distrib ution p ′ . 2. Let ( q r ) I denote the reduce d distrib ution ov er [ ℓ ] obtained from q using I as deﬁned in Section A. 3. Draw m = s I K ( ℓ, ǫ/ 2 , δ ) samples from ( p r ) I , where ( p r ) I is the reduc ed distrib ution ov er [ ℓ ] obtai ned from p using I as deﬁned in Section A. 4. Run T E S T - I D E N T I T Y - K N O W N (( p r ) I , ( q r ) I , ǫ 2 , δ ) using the samples from Step 3 and outpu t what it outpu ts. W e no w estab lish our claimed upper bound for the “testing identity , q is kno wn” case. W e ﬁrst observe that in Step 3, the desired m = s I K ( ℓ, ǫ/ 2 , δ ) samples from ( p r ) I can easily be obtained by dra w ing m samples from p and con vert ing each one to the correspondi ng draw from ( p r ) I in the obvious way . If p = q then by L emma 2 we ha ve that ( p r ) I = ( q r ) I , and T E S T -I D E N T I T Y - K N O W N - M O N O T O N E outpu ts “accept” with probab ility at least 1 − δ by T heorem 2. If d T V ( p, q ) ≥ ǫ , then by L emma 2 and Theorem 5 we hav e that d T V (( p r ) I , ( q r ) I ) ≥ 3 ǫ/ 4 , so T E S T - I D E N T I T Y - K N O W N - M O N O T O N E output s “reject” with proba bility at least 1 − δ by Theorem 2. For the “testing identity , q is unkno w n” case, the the algorith m T E S T - I D E N T I T Y - U N K N O W N - M O N O T O N E is ver y similar to T E S T - I D E N T I T Y - K N O W N - M O N OT O N E . The dif ferences are as follo ws: inst ead of Step 2, in S tep 3 we dra w m = s I U ( ℓ, ǫ/ 2 , δ ) samples from ( p r ) I and the same number of samples from ( q r ) I ; and in S tep 4, we run T E S T - I D E N T I T Y - U N K N O W N (( p r ) I , ( q r ) I , ǫ 2 , δ ) using the samples from Step 3. T he analysis is exactl y the same as abo ve (using Theorem 3 in place of Theorem 2). W e no w describe the algor ithm L 1 - E S T I M ATE - K N O W N - M O N O T O N E for the “tolerant testing , q is known” case. T his algorithm takes value s ǫ and δ as input, so the partition I deﬁned in Step 1 is a ( p ′ , ǫ/ 4 , ℓ ) -ﬂat decom- positi on of [ n ] for any non-incr easing p ′ . In Step 3 the algorit hm dra ws m = s E ( ℓ, ǫ/ 2 , δ ) samples and runs L 1 - E S T I M A T E (( p r ) I , ( q r ) I , ǫ/ 2 , δ ) in Step 4. If d T V ( p, q ) = c then by Lemma 2 we ha ve that d T V (( p r ) I , ( q r ) I ) ∈ 14 [ c − ǫ/ 2 , c + ǫ / 2] and L 1 - E S T I M ATE - K N O W N - M O N O T O N E outputs a v alue within the p rescrib ed range with prob- ability at le ast 1 − δ , by Theorem 4. The algorithm L 1 - E S T I M ATE - U N K N O W N - M O N O T O N E case and its analy sis are entirely similar . C Efﬁciently T es ting k - modal Distrib utions In this secti on we establish our main positi ve testi ng resul ts for k -modal distrib ution s, the uppe r bounds stated in the ﬁnal four rows of T able 1. T he ke y to all these results is an algorith m C O N S T R U C T - F L AT - D E C O M P O S I T I O N ( p, ǫ, δ ) . W e prov e the follo wing performanc e guara ntee about this algorithm: Lemma 3. Let p be a k -modal distr ib ution over [ n ] . Algorithm C O N S T R U C T - F L A T - D E C O M P O S I T I O N ( p, ǫ, δ ) dra ws O ( k 2 ǫ − 4 log(1 /δ )) samples fr om p and outputs a ( p, ǫ, ℓ ) -ﬂat decomposi tion of [ n ] with pr obabil ity at least 1 − δ , wher e ℓ = O ( k log( n ) /ǫ 2 ) . The bu lk of our work in Section C is to describe C O N S T R U C T - F L A T - D E C O M P O S I T I O N ( p, ǫ, δ ) and prove Lemma 3, but ﬁrst we show ho w Lemma 3 easily yields our claimed testing results for k -modal distrib utions. As in the m onoto ne case all four algorith ms are essentiall y the same: each works by reducing the giv en k -modal distrib ution testing prob lem to the sa me testi ng prob lem for arbitrar y distrib utions over [ ℓ ] . W e describ e the T E S T - I D E N T I T Y - K N O W N - K M O DA L algorith m belo w , and then indic ate the necessar y changes to get the other three testers . The follo wing terminology will be useful: L et I = { I i } r i =1 and I ′ = { I ′ i } s i =1 be two partition s of [ n ] into r and s interv als respecti vely . The common r eﬁnemen t of I and I ′ is the partit ion J of [ n ] into interv als obtained from I and I ′ in the obviou s way , by taking all possible nonempty interv als of the form I i ∩ I ′ j . It is clear that J is both a reﬁnement of I and of I ′ and that the number of interv als |J | in J is at most r + s. T E S T - I D E N T I T Y - K N O W N - K M O D A L Inputs: ǫ, δ > 0 ; sample access to k -modal distrib utions p , q ov er [ n ] 1. Run C O N S T RU C T - F L AT - D E C O M P O S I T I O N ( p, ǫ / 2 , δ / 4) and let I = { I i } ℓ i =1 , ℓ = O ( k log ( n ) /ǫ 2 ) , be the partit ion that it outp uts. Run C O N S T R U C T -F L A T - D E C O M P O S I T I O N ( p, ǫ/ 2 , δ / 4) and let I ′ = { I ′ i } ℓ ′ i =1 , ℓ ′ = O ( k log( n ) /ǫ 2 ) , be the partition that it out puts. Let J be the common reﬁne ment of I and I and let ℓ J = O ( k log( n ) /ǫ 2 ) be the number of interv als in J . 2. Let ( q r ) J denote the reduce d distrib ution ov er [ ℓ J ] obtained from q using J as deﬁned in Section A. 3. Draw m = s I K ( ℓ J , ǫ/ 2 , δ / 2) samples from ( p r ) J , where ( p r ) J is the reduced distrib ution ov er [ ℓ J ] obtain ed from p using J as deﬁned in Section A. 4. Run T E S T - I D E N T I T Y - K N O W N (( p r ) J , ( q r ) J , ǫ 2 , δ 2 ) using the samples from Step 3 and output what it outpu ts. W e note that Steps 2, 3 and 4 of T E S T - I D E N T I T Y - K N O W N - K M O DA L are the same as the correspond ing steps of T E S T - I D E N T I T Y - K N O W N - M O N OT O N E . For the analysis of T E S T - I D E N T I T Y - K N O W N - K M O DA L , Lemmas 3 and 4 giv e us that with p robabi lity 1 − δ / 2 , the partitio n J obtaine d in Step 1 is both a ( p, ǫ, ℓ J ) -ﬂat and ( q , ǫ, ℓ J ) - ﬂat decompositio n of [ n ] ; we condition on this going forward. From this point on the analysis is essentiall y identi cal to the analysis for T E S T - I D E N T I T Y - K N O W N - M O N O T O N E and is omitted. The modiﬁcatio ns required to obtain algorithms T E S T - I D E N T I T Y - U N K N O W N - K M O DA L , L 1 - E S T I M ATE - K N O W N - K M O DA L and L 1 - E S T I M ATE - U N K N O W N - K M O D A L , and the analysis of these algorithms, are completely analog ous to the modiﬁcations and analyse s of Appendix B.2 and are omitted. C.1 The C O N S T R U C T - F L AT - D E C O M P O S I T I O N algorithm. W e present C O N S T R U C T - F L A T - D E C O M P O S I T I O N ( p, ǫ , δ ) follo wed by an intuiti ve explan ation. Note that it employs a procedure O R I E N T A T I O N ( b p , I ) , which uses no sam- ples and is present ed and analyzed in Section 4.2. 15 C O N S T R U C T - F L A T - D E C O M P O S I T I O N I N P U T S : ǫ, δ > 0 ; sample acces s to k -modal distrib ution p ov er [ n ] 1. Initial ize I := ∅ . 2. Fix τ := ǫ 2 / (200 00 k ) . Draw r = Θ(log(1 /δ ) /τ 2 ) samples from p and let b p denote the resulting empirica l distrib ution (which by Theorem 1 has d K ( b p, p ) ≤ τ with probabil ity at least 1 − δ ). 3. Greedily partition the domain [ n ] into α atomic intervals { I i } α i =1 as follo w s: I 1 := [1 , j 1 ] , where j 1 := min { j ∈ [ n ] | b p ([1 , j ]) ≥ ǫ/ (100 k ) } . For i ≥ 1 , if ∪ i j =1 I j = [1 , j i ] , then I i +1 := [ j i + 1 , j i +1 ] , w here j i +1 is deﬁned as follo ws: If b p ([ j i + 1 , n ]) ≥ ǫ/ (100 k ) , then j i +1 := m in { j ∈ [ n ] | b p ([ j i + 1 , j ]) ≥ ǫ/ (100 k ) } , other wise, j i +1 := n . 4. Construc t a set of n m moder ate interva ls , a set o f n h heavy poin ts , and a set of n n ne gligib le interval s as follo ws: For each atomic interv al I i = [ a, b ] , (a) if b p ([ a, b ]) ≤ 3 ǫ/ (10 0 k ) then I i is decla red to be a modera te interv al; (b) otherwise we ha ve b p ([ a, b ]) > 3 ǫ/ (100 k ) and we declare b to be a heavy point . If a < b then we declar e [ a, b − 1] to be a ne gligible interv al. For eac h interv al I which is a hea vy point, add I to I . Add each neg ligible interv al I to I . 5. For each mode rate interv al I , run procedure O R I E N T A T I O N ( b p , I ) ; let ◦ ∈ {↑ , ↓ , ⊥} be its output. If ◦ = ⊥ then add I to I . If ◦ = ↓ then let J I be the partiti on of I giv en by Theorem 5 w hich is a ( p ′ , ǫ/ 4 , O (log ( n ) /ǫ )) -ﬂat decompo sition of I for any non-inc reasing distrib ution p ′ ov er I . Add all the elements of J I to I . If ◦ = ↑ then let J I be the partit ion of I giv en by the dual vers ion of Theorem 5, which is a ( p ′ , ǫ/ 4 , O (log ( n ) /ǫ )) -ﬂat decompositio n of I for any non-de creasin g distrib ution p ′ ov er I . Add all the elements of J I to I . 6. Output the partiti on I of [ n ] . Roughly spe aking, when C O N S T R U C T - F L A T -D E C O M P O S I T I O N construct s a partition I , it in itially breaks [ n ] up into two types of interv als. The ﬁrst type are interv als that are “okay” to include in a ﬂat decompositio n, either becaus e the y ha ve very little mass, or because they consist of a single point, or because they are close to uniform. The second type are interv als that are “not okay” to include in a ﬂ at decompo sition – the y hav e signiﬁcan t mass and are far from uniform – but the algorithm is able to ensure that almost all of these are monotone distrib utions with a kn o w n orient ation. It then uses th e obli vious decomposition of Theorem 5 to construct a ﬂat dec ompositi on of each such interv al. (Note that it is crucial that the orientation is known in ord er to be able to use Theore m 5.) In more detail, C O N S T RU C T -F L A T - D E C O M P O S I T I O N ( p, ǫ, δ ) works as follo ws. The algorithm ﬁ rst dra w s a batch of samples from p and uses them to constru ct an estimate b p of the CDF of p (this is straightf orwar d usin g the DKW inequ ality). U sing b p the algorithm partitions [ n ] into a collec tion of O ( k /ǫ ) disjo int interv als in the follo wing way: • A small collectio n of the interv als are “ne gligible”; they collecti vely ha ve total mass less than ǫ under p . Each negli gible interv al I will be an element of the partit ion I . • Some o f the interv als are “heavy points ”; these are inter v als consist ing of a single point that has mass Ω ( ǫ/k ) under p . E ach hea vy point I w ill also be an elemen t of the partition I . • The remaining interv als are “moderate” interva ls, each of w hich has mass Θ( ǫ/k ) under p . It remains to incorporate the moderate interv als into the partition I that is being construc ted. This is done as follo ws: using b p , the algorit hm comes up with a “g uess” of the correct orientati on (non -increasing, non-decrea sing, or close to unifo rm) f or each moderate interv al. Each moderate interv al where the “gue ssed” o rienta tion is “close to unifor m” is included in the p artition I . F inally , for each moderate in terv al I where the guessed o rientation is “n on- increa sing” or “non-dec reasin g”, the alg orithm in v okes Theore m 5 on I to perform the obli vious deco mpositio n for 16 monoton e distrib utions, and the resulting sub-in terv als are included in I . The analysis will show that the guesses are almost alway s correct, and intuiti vely this should imply that the I that is constru cted is indeed a ( p, ǫ, ℓ ) -ﬂat decompo sition of [ n ] . C.2 Pe rf ormance of C O N S T R U C T -F L AT - D E C O M P O S I T I O N : Pro of of L emma 3. The cl aimed sample bo und is ob vious from inspe ction of the algorithm, as the only step that draws an y samples is Step 2. The bou nd on the number of interv als in the ﬂat decompositi on follo ws directly from the upper bounds on the number of hea vy points , negligi ble interv als and moderate inter v als shown be lo w , using also Theore m 5. It remains to sho w that the outpu t of the algorithm is a valid ﬂat decompo sition of p . F irst, by the DKW inequalit y (Theore m 1) w e hav e that with proba bility at least 1 − δ it is the case that | b p ( I ) − p ( I ) | ≤ ǫ 2 10000 k , for ev ery interv al I ⊆ [ n ] . (3) W e make some preliminary observ ations about the weight that p has on the interv als con structe d in Steps 4 and 5. Since ev ery atomic interv al I i constr ucted in Step 4 has b p ( I ) ≥ ǫ/ (100 k ) (excep t potential ly the rightmost one), it follo ws that the number α of atomic interva ls constructe d in Step 3 satis ﬁes α ≤ ⌈ 100 k /ǫ ⌉ . W e now establish bounds on the probab ility mass that p assigns to the moderate interv als, heavy point s, and neg ligibl e interv als that are con structe d in Step 4. Using (3), each interv al I i that is decla red to be a moderate interv al in Step 4(a) must satisfy 99 ǫ/ (10 000 k ) ≤ p ([ a, b ]) ≤ 301 ǫ/ (10 000 k ) (for all modera te interv als [ a, b ] ) . (4) By virt ue of the greed y process that is used to cons truct atomic in terv als in Step 3, each point b that is declare d to be a hea vy point in Step 4(b) must satisfy b p ( b ) ≥ 2 ǫ/ (100 k ) and thus using (3) again p ( b ) ≥ 199 ǫ/ (1000 0 k ) (for all heav y points b ) . (5) Moreo ver , each interv al [ a, b − 1] that is declared to be a neglig ible interv al must satisfy b p ([ a, b − 1]) < ǫ/ (100 k ) and thus using (3) again p ([ a, b − 1]) ≤ 101 ǫ/ (10 000 k ) (for all negligi ble interv als [ a, b − 1] ) . (6) It is clear tha t n m (the number of moder ate interv als) and n h (the numb er of hea vy points) are each at most α. Next we ob serv e that the number of negligib le interv als n n satisﬁes n n ≤ k . This is becau se a t t he end of each negligibl e interv al [ a, b − 1] we ha ve (obs erving tha t each neg ligibl e interv al must be nonempty) that p ( b − 1) ≤ p ([ a, b − 1]) ≤ 101 ǫ/ (10 000 k ) while p ( b ) ≥ 199 ǫ/ (100 00 k ) . S ince p is k -modal, there can be at most ⌈ ( k + 1) / 2 ⌉ ≤ k points b ∈ [ n ] satisfyi ng this condition. Since each ne gligible interv al I satisﬁes p ( I ) ≤ 101 ǫ/ (1 0000 k ) we hav e that the total probability mass under p of all the negligib le interv als is at most 101 ǫ/ 10 000 . Thus far we ha ve built a partitio n of [ n ] into a collection of n m ≤ ⌈ 100 k/ǫ ⌉ moderate interv als (which we denote M 1 , . . . , M n m ), a set of n h ≤ ⌈ 100 k/ǫ ⌉ hea vy points (which we denote h 1 , . . . , h n h ) and a set of n n ≤ k neg ligibl e interv als (which we denote N 1 , . . . , N n n ). Let A ⊆ { 1 , . . . , n m } denote the set of those indices i such that O R I E N T A T I O N ( b p , M i ) outputs ⊥ in Step 6. The partition I that C O N S T R U C T - F L AT - D E C O M P O S I T I O N constr ucts consists of { h 1 } , . . . , { h n h } , N 1 , . . . , N n n , { M i } i ∈ A , and S i ∈ ([ n m ] \ A ) J M i . W e can thus write p as p = n h P j =1 p ( h j ) · 1 h j + n n P j =1 p ( N j ) p N j + P j ∈ A p ( M j ) p M j + P j ∈ ([ n m ] \ A ) P I ∈J M j p ( I ) p I . (7) 17 Using Lemma 15 (pro ved in Appendix F) w e can bound the tota l variat ion distance between p and ( p f ) I by d T V ( p, ( p f ) I ) ≤ 1 2 n h P j =1 | p ( h j ) − ( p f ) I ( h j ) | + 1 2 n n P j =1 | p ( N j ) − ( p f ) I ( N j ) | + n n P j =1 p ( N j ) · d T V ( p N j , (( p f ) I ) N j ) + 1 2 P j ∈ A | p ( M j ) − ( p f ) I ( M j ) | + P j ∈ A p ( M j ) · d T V ( p M j , (( p f ) I ) M j ) + 1 2 P j ∈ ([ n m ] \ A ) P I ∈ J M j | p ( I ) − ( p f ) I ( I ) | + P j ∈ ([ n m ] \ A ) P I ∈ J M j p ( I ) · d T V ( p I , (( p f ) I ) I ) . (8) Since p ( I ) = ( p f ) I ( I ) for ev ery I ∈ I , this simpliﬁes to d T V ( p, ( p f ) I ) ≤ n n P j =1 p ( N j ) · d T V ( p N j , (( p f ) I ) N j ) + P j ∈ A p ( M j ) · d T V ( p M j , (( p f ) I ) M j ) + P j ∈ ([ n m ] \ A ) P I ∈ J M j p ( I ) · d T V ( p I , (( p f ) I ) I ) . (9) which we no w proceed to bound. Recalling from (6) that p ( N j ) ≤ 101 ǫ/ (10000 k ) for each negli gible interv al N j , and recalling that n n ≤ k , the ﬁrst summand in (9) is at most 101 ǫ/ 10 000 . T o bound the second summand, ﬁx any j ∈ A so M j is a moderate interv al suc h that O R I E N T A T I O N ( b p , M j ) re- turns ⊥ . If p M j is non-decr easing then by Lemma 5 it must b e the c ase t hat d T V ( p M j , (( p f ) I ) M j ) ≤ ǫ/ 6 (note that (( p f ) I ) M j is just u M j , the unifor m distr ib ution ove r M j ). Lemma 5 gi ves the s ame bound if p M j is non-i ncreas ing. If p M j is neither non-in creasin g nor non-de creasin g then we hav e no nontri vial bound on d T V ( p M j , (( p f ) I ) M j ) , b ut since p is k -modal there can be at most k such v alues of j in A . Recalling (4), ove rall we ha ve that P j ∈ A p ( M j ) · d T V ( p M j , (( p f ) I ) M j ) ≤ 301 ǫk 10000 k + ǫ 6 ≤ 1968 ǫ 10000 , and we ha ve bounded the second summand. It remains to bound the ﬁ nal summand of (9). For each j ∈ ([ n m ] \ A ) , we know that O R I E N T A T I O N ( b p , M j ) outpu ts eith er ↑ or ↓ . If p M j is monotone , then by Lemma 5 we hav e that the output of O R I E N TA T I O N ( b p , M j ) gi ves the correct orientatio n of p M j . Consequently J M j is a ( p M j , ǫ/ 4 , O (log ( n ) /ǫ )) -ﬂat decomposit ion of M j , by Theorem 5. This means that d T V ( p M j , (( p f ) I ) M j ) ≤ ǫ/ 4 , which is equi va lent to 1 p ( M j ) P I ∈J M j p ( I ) d T V ( p I , (( p f ) I ) I ) ≤ ǫ 4 , i.e. P I ∈J M j p ( I ) d T V ( p I , (( p f ) I ) I ) ≤ p ( M j ) · ǫ 4 . Let B ⊂ [ n m ] \ A be such that, for all j ∈ B , p M j is monoto ne. Summing the abov e over a ll j ∈ B giv es: P j ∈ B P I ∈J M j p ( I ) d T V ( p I , (( p f ) I ) I ) ≤ P j ∈ B p ( M j ) · ǫ 4 ≤ ǫ 4 . Giv en that p is k -modal, the cardinality of the set [ n m ] \ ( A ∪ B ) is at most k . So we hav e the bound: P j ∈ [ n m ] \ ( A ∪ B ) P I ∈J M j p ( I ) d T V ( p I , (( p f ) I ) I ) ≤ P j ∈ [ n m ] \ ( A ∪ B ) p ( M j ) ≤ 301 ǫk 10000 k . So the third summand of (9) is at most ǫ/ 4 + 301 ǫ/ 100 00 , and ov erall we ha ve that (9) ≤ ǫ 2 . Hence, we hav e sho wn that I is a ( p, ǫ, ℓ ) -ﬂat decompositio n of [ n ] , and Lemma 3 is pro ved. 18 C.3 The O R I E N TA T I O N algor ithm. T he O R I E N TA T I O N algorith m take s as input an explici t distrib ution of a distrib ution b p ov er [ n ] and an interv al I ⊆ [ n ] . Intuit i vel y , it assumes that b p I is close (in Kol mogoro v distance) to a monoton e distrib ution p I , and its goal is to det ermine the orient ation of p I : it outputs either ↑ , ↓ or ⊥ (the last of which means “close to uniform”). T he algorithm is quite simple; it checks whether there exist s an initial interv al I ′ of I on w hich b p I ’ s weight is signi ﬁcantly differ ent from u I ( I ′ ) (the w eight that the unifo rm distrib ution ove r I assign s to I ′ ) and bases its output on this in the obvious way . A precise descri ption of the algorithm (which uses no samples) is gi ven belo w . O R I E N T AT I O N I N P U T S : explicit descript ion of distrib ution b p ove r [ n ] ; interv al I = [ a, b ] ⊆ [ n ] 1. If | I | = 1 (i.e. I = { a } for some a ∈ [ n ] ) then retur n “ ⊥ ”, otherwise continue. 2. If there is an initia l int erv al I ′ = [ a, j ] of I that sat isﬁes u I ( I ′ ) − ( b p ) I ( I ′ ) > ǫ 7 then halt and out put “ ↑ ”. Otherwise, 3. If there is an initial interv al I ′ = [ a, j ] of I that satisﬁes u I ( I ′ ) − ( b p ) I ( I ′ ) < − ǫ 7 then halt and outpu t “ ↓ ”. Otherwise, 4. Output “ ⊥ ”. W e proceed to analyze O R I E N TA T I O N . W e sho w that if p I is fa r from uniform then O R I E N T A T I O N output s the c orrect o rienta tion for it. W e al so sh ow that whene ver O R I E N TA T I O N does not output “ ⊥ ”, whate ver it outputs is the cor rect orie ntatio n of p I . For ease of readab ility , for the rest o f this sub sectio n we use t he fo llowing no tation: ∆ := ǫ 2 10000 k Lemma 5. L et p be a distrib ution over [ n ] and let interval I = [ a, b ] ⊆ [ n ] be suc h that p I is monotone . Suppo se p ( I ) ≥ 99 ǫ/ (100 00 k ) , and suppose that for every interval I ′ ⊆ I w e hav e that | b p ( I ′ ) − p ( I ′ ) | ≤ ∆ . (10) Then 1. If p I is non-dec r easing and p I is ǫ/ 6 -fa r fr om the uniform distrib ution u I ove r I , then O R I E N T ATI O N ( b p , I ) outpu ts “ ↑ ”; 2. if O R I E N T A T I O N ( b p , I ) outputs “ ↑ ” then p I is non-dec r easing; 3. if p I is non-inc r easing and p I is ǫ/ 6 -far fr om the uniform distrib ution u I ove r I , then O R I E N T A T I O N ( b p , I ) outpu ts “ ↓ ”; 4. if O R I E N T A T I O N ( b p , I ) outputs “ ↓ ” then p I is non-inc r easing . Pr oof. Let I ′ = [ a, j ] ⊆ I be any initia l interv al of I . W e ﬁrst establish the upper bound | p I ( I ′ ) − ( b p ) I ( I ′ ) | ≤ ǫ/ 49 (11) as this will be usefu l for the rest of the proof. Using (10) we ha ve p I ( I ′ ) − ( b p ) I ( I ′ ) = p ( I ′ ) p ( I ) − b p ( I ′ ) b p ( I ) ≥ p ( I ′ ) p ( I ) − p ( I ′ ) + ∆ p ( I ) − ∆ = − ∆ · p ( I ′ ) + p ( I ) p ( I )( p ( I ) − ∆) . (12) 19 No w using the fact that p ( I ′ ) ≤ p ( I ) and p ( I ) ≥ 99 ǫ/ (10000 k ) , we get that (12) is at least − ∆ · 2 p ( I ) (98 / 9 9) p ( I ) 2 = − 2 · 99∆ 98 p ( I ) ≥ − 2 · 99∆ · 1000 0 k 98 · 99 ǫ = − ǫ 49 . So we ha ve estab lished the lo wer bound p I ( I ′ ) − ( b p ) I ( I ′ ) ≥ − ǫ / 49 . For th e upper bound, similar reaso ning giv es p I ( I ′ ) − ( b p ) I ( I ′ ) ≤ ∆ · p ( I ′ ) + p ( I ) p ( I )( p ( I ) + ∆) ≤ ∆ · 2 p ( I ) p ( I ) 2 · (100 / 9 9) ≤ ∆ · 2 · 10000 k · 99 99 ǫ · 100 = ǫ 50 and so we ha ve sho wn that | p I ( I ′ ) − ( b p ) I ( I ′ ) | ≤ ǫ/ 49 as desired. Now we proc eed to prov e the lemma. W e ﬁrst prove Part 1. S uppos e that p I is non-dec reasin g and d T V ( p I , u I ) > ǫ/ 6 . S ince p I is monotone and u I is uniform and both are supporte d on I , w e ha ve that the pdfs for p I and u I ha ve exactly one crossing. An easy consequen ce of this is that d K ( p I , u I ) = d T V ( p I , u I ) > ǫ/ 6 . By the deﬁnition of d K and the fa ct that p I is non-d ecreasi ng, we get that there exis ts a point j ∈ I and an interv al I ′ = [ a, j ] which is such that d K ( p I , u I ) = u I ( I ′ ) − p I ( I ′ ) > ǫ 6 . Using (11) we get from this that u I ( I ′ ) − ( b p ) I ( I ′ ) > ǫ 6 − ǫ 49 > ǫ 7 and thus O R I E N T ATI O N outputs “ ↑ ” in Step 3 as claimed. No w we turn to Part 2 of the lemma. Suppose that O R I E N TA T I O N ( b p , I ) output s “ ↑ ”. Then it must be the case that there is an initial interv al I ′ = [ a, j ] of I that satisﬁes u I ( I ′ ) − ( b p ) I ( I ′ ) > ǫ 7 . By (11) w e hav e that u I ( I ′ ) − p I ( I ′ ) > ǫ 7 − ǫ 49 = 6 ǫ 49 . But Observ ation 1 tells us that if p I were non-increa sing then we would hav e u I ( I ′ ) − p I ( I ′ ) ≤ 0 ; so p I canno t be non-increa sing, and theref ore it must be non-de creasi ng. For Par t 3, suppo se that p I is non-i ncreas ing and d T V ( p I , u I ) > ǫ/ 6 . F irst we must sho w that O R I E N TA T I O N does not outpu t “ ↑ ” in Step 3. Since p I is non-incr easing , O bserv ation 1 gi ves us th at u I ( I ′ ) − p I ( I ′ ) ≤ 0 for e very initial interv al I ′ of I . Inequality (11) then giv es u I ( I ′ ) − ( b p ) I ( I ′ ) ≤ ǫ/ 49 , so O R I E N T A T I O N indeed does not outpu t “ ↑ ” in Step 3 (and it reaches Step 4 in its ex ecutio n). Now argu m ents exactl y analogous to the ar guments for part 1 (b ut using no w the fact that p I is non-incre asing rather than non- decrea sing) gi ve that there is an initial interv al I ′ such that ( b p ) I ( I ′ ) − u I ( I ′ ) > ǫ 6 − ǫ 49 > ǫ 7 , so O R I E N T ATI O N outputs “ ↓ ” in Step 4 and Part 3 of the lemma follo ws. Finally , Part 4 of th e lemma follo ws from analogous argume nts as Part 2. D Proof of Pr oposition 6 W e start by deﬁning the transf ormation , and then prove the necess ary lemmas to show that the transformat ion yields k -modal distrib utions with the speciﬁed increas e in support size, preserv es L 1 distan ce between pairs , and has the property that samples from the transformed distrib utions can be simulated gi ven access to samples from the origina l distrib utions. The transformatio n proceeds in two phases . In the ﬁrst phase, the input distrib ution p is transformed into a related distrib ution f with larger support ; f has the addit ional property that the ratio of the probabili ties of consec uti ve domain elemen ts is boun ded. Intuiti vely the distri b ution f correspo nds to a “reduc ed distrib ution” from Section A. In the second phase, the distrib ution f is transf ormed into the ﬁ nal 2( k − 1) -modal distrib ution g . Both stages of the transformatio n cons ist of subdi viding each element of the domain of the input distrib ution into a set of elements of the output distrib ution; in the ﬁrst stage, the probabilitie s of each element of the set are chosen acco rding to a geometric sequence , while in the second phase, all elements of each set are gi ven equal probab ilities. W e no w deﬁne this two-phase transformat ion and prov e Proposition 6. 20 Deﬁnition 2. F ix ǫ > 0 and a distrib ution p ove r [ n ] such that p min ≤ p ( i ) ≤ p max for all i ∈ [ n ] . W e deﬁne the distrib ution f p,ǫ,p max ,p min in two steps. Let q be the distrib ution on support [ c ] with c = 1 + ⌈ log 1+ ǫ p max − log 1+ ǫ p min ⌉ that is deﬁned by q ( i ) = (1 + ǫ ) i − 1 ǫ (1+ ǫ ) c − 1 . The distrib ution f p,ǫ,p max ,p min has s uppor t [ cn ] , and f or i ∈ [ n ] and j ∈ [ c ] it assigns pr obabilit y p ( i ) q ( j ) to domain element c ( i − 1) + j. It is con ve nient for us to vie w the m o d r opera tor as givin g an output in [ r ] , so that “ r mod r ” equals r . Deﬁnition 3. W e deﬁne the distr ib ution g k ,p,ǫ, p max ,p min fr om distrib ution f p,ǫ,p max ,p min of support [ m ] via the followin g pr ocess. Let r = ⌈ m k ⌉ , and let a 1 := 1 , and for all i ∈ { 2 , . . . , r } , let a i := ⌈ (1 + ǫ ) a i − 1 ⌉ . F or eac h i ∈ [ m ] , we assig n pr obab ility f p,ǫ,p max ,p min ( i ) a i mo d r to each of the a j suppo rt elements in the set { 1 + t, 2 + t, . . . , a i mo d r + t } , wher e t = P i − 1 ℓ =1 a ( ℓ m od r ) . Lemma 11. G iven ǫ, p min , p max , and access to independ ent samples fr om distrib ution p , one can genera te inde- pende nt samples fr om f p,ǫ,p max ,p min and fr om g k ,p,ǫ, p max ,p min . Pr oof. T o generate a sample according to f p,ǫ,p max ,p min , one simply take s a sample i ← p and then draws j ∈ [ c ] accord ing to the distrib ution q as deﬁned in Deﬁnition 2 (note that this draw according to q only in v olves ǫ, p min and p max ). W e then output the valu e c ( i − 1) + j. It follo ws imm ediatel y from the abov e deﬁnition that the distrib ution of the output value is f p,ǫ,p max ,p min . T o generate a sample according to g k ,p,ǫ, p max ,p min gi ven a sample i ← f p,ǫ,p max ,p min , one simply outputs (a unifor mly random) one of the a ( i mo d r ) suppo rt elements of g k ,p,ǫ, p max ,p min corres pondin g to the element i of f p,ǫ,p max ,p min . Speciﬁcally , if the support of f p,ǫ,p max ,p min is [ m ] , then w e output a random element of the set { 1 + t, 2 + t, . . . , a i mod r + t } , where t = P i − 1 ℓ =1 a ( ℓ m od r ) , with a j as de ﬁned in Deﬁnition 3, and r = ⌈ m k ⌉ . Lemma 12. If p min ≤ p ( i ) ≤ p max for all i ∈ [ n ] , the n the distrib ution f p,ǫ,p max ,p min of Deﬁnition 2, with density f : [ cn ] → R , has the pr operty that f ( i ) f ( i − 1) ≤ 1 + ǫ for a ll i > 1 , and the dist rib ution g k ,p,ǫ, p max ,p min of Deﬁnition 3 is 2( k − 1) -modal. Pr oof. Note th at the distrib ution q , with support [ c ] as deﬁned in Deﬁnition 2, has the propert y that q ( i ) /q ( i − 1) = 1 + ǫ for all i ∈ { 2 , . . . , c } , and thus f ( ℓ ) /f ( ℓ − 1) = 1 + ǫ for an y ℓ satisfying ( ℓ mo d c ) 6 = 1 . For v alues ℓ that are 1 mod c , we ha ve f ( ℓ ) f ( ℓ − 1) = p ( i + 1) p ( i )(1 + ǫ ) c − 1 ≤ p ( i + 1) p min p ( i ) p max ≤ 1 . Giv en this property of f p,ǫ,p max ,p min , we now establish that g k ,p,ǫ, p max ,p min is monoto ne decreasing on each of the k equally sized contiguous regions of its domain. First consider the case k = 1 ; giv en a support element j , let i be such that j ∈ { 1 + P i − 1 ℓ =1 a ℓ , . . . , a i + P i − 1 ℓ =1 a ℓ } . W e thus ha ve that g 1 ,p,ǫ,p max ,p min ( j ) = f p,ǫ,p max ,p min ( i ) a i ≤ (1 + ǫ ) f p,ǫ,p max ,p min ( i − 1) a i ≤ f p,ǫ,p max ,p min ( i − 1) a i − 1 ≤ g 1 ,p,ǫ,p max ,p min ( j − 1) , and thus g 1 ,p,ǫ,p max ,p min is indeed 0-modal since it is monoto ne non-in creasi ng. For k > 1 the abo ve ar guments apply to each of the k equa lly-si zed contiguous regions of the support, so there are 2( k − 1) m odes, namely the local maxima occ urring at the right endp oint of each reg ion, and the local minima oc curring at the left endpoin t of each reg ion. Lemma 13. F or any distrib utions p, p ′ with supp ort [ n ] , and any ǫ, p max , p min , we have that d T V ( p, p ′ ) = d T V  f p,ǫ,p max ,p min , f p ′ ,ǫ,p max ,p min  = d T V  g k ,p,ǫ, p max ,p min , g k ,p ′ ,ǫ,p max ,p min  . 21 Pr oof. Both equalities follo w immediately from the fact that the transformation s of Deﬁnitions 2 and 3 partition each element of the input distrib ution in a manner that is obli vious to the probabil ities. T o illustrate, letting c = 1 + ⌈ log 1+ ǫ p max − log 1+ ǫ p min ⌉ , and letting q be as in Deﬁnition 2, we ha ve the follo wing: d T V  f p,ǫ,p max ,p min , f p ′ ,ǫ,p max ,p min  = P i ∈ [ n ] ,j ∈ [ c ] q ( j ) | p ( i ) − p ′ ( i ) | = P i ∈ [ n ] | p ( i ) − p ′ ( i ) | . Lemma 1 4. If p has su pport [ n ] , then f or an y ǫ < 1 / 2 , the distr ib ution g k ,p,ǫ, p max ,p min is supporte d on [ N ] , whe r e N is at most k e 8 n k ( 1+log( p max /p min ) ) ǫ 2 . Pr oof. The support of f p,ǫ,p max ,p min is n (1 + ⌈ log 1+ ǫ p max − log 1+ ǫ p min ⌉ ) ≤ n  2 + log( p max /p min ) log(1+ ǫ )  . Letting a 1 := 1 and b 1 := ⌈ 1 ǫ ⌉ , and deﬁning a i := ⌈ a i − 1 (1 + ǫ ) ⌉ , and b i := ⌈ b i − 1 (1 + ǫ ) ⌉ , we hav e that a i ≤ b i for all i . Addition ally , b i +1 /b i ≤ 1 + 2 ǫ , since al l b i ≥ 1 /ǫ, and thus the ceiling op eration can inc rease th e v alue of (1 + ǫ ) b i by at most ǫb i . Putting these two observ ations together , we ha ve m P i =1 a i ≤ m P i =1 b i ≤ (1 + 2 ǫ ) m +1 2 ǫ 2 . For an y ǫ ≤ 1 / 2 , we ha ve that the support of g k ,p, 1 / 2 ,p max ,p min is at most k (1 + 2 ǫ ) l n k  2+ log( p max /p min ) log(1+ ǫ ) m ǫ 2 ≤ k (1 + 2 ǫ ) 2 n k  2+4 log( p max /p min ) 2 ǫ  ǫ 2 ≤ k (1 + 2 ǫ ) 1 2 ǫ ( 8 n k (1+log( p max /p min )) ) ǫ 2 ≤ k e 8 n k (1+log( p max /p min )) ǫ 2 . Pr oof of Pr opositio n 6. The proof is now a simple m atter of assembling the above parts. Gi ven a distrib ution D over pairs of distrib utions of support [ n ] , as speciﬁed in the proposition statement, the distrib ution D k is deﬁned via the process of taking ( p, p ′ ) ← D , then apply ing the transfor mation of Deﬁ nition s 2 and 3 with ǫ = 1 / 2 and to yield a pair  g k ,p, 1 / 2 ,p max ,p min , g k ,p ′ , 1 / 2 ,p max ,p min  . W e claim that this D k satisﬁes all the prop- erties claimed in the propo sition statement. Speciﬁcally , L emmas 12 and 14, respecti vely , ensure that e very dis- trib ution in the supp ort of D k has at most 2( k − 1) modes, and has support size at most 4 k e 8 n k (1+log( p max /p min )) . Addition ally , L emma 13 guarantee s that the transformation prese rve s L 1 distan ce, namely , for two distrib utions p, p ′ with suppor t [ n ] , we ha ve L 1 ( p, p ′ ) = L 1 ( g k ,p, 1 / 2 ,p max ,p min , g k ,p ′ , 1 / 2 ,p max ,p min ) . Finally , Lemma 11 guar - antees that, giv en s independe nt samples from p , one can simulate drawin g s indep endent samples according to g k ,p, 1 / 2 ,p max ,p min . Assuming for the sak e of contra diction that one had an algorithm that could distinguish whether L 1 ( g k ,p, 1 / 2 ,p max ,p min , g k ,p ′ , 1 / 2 ,p max ,p min ) is less than ǫ 1 ver sus grea ter than ǫ 2 with th e desire d probability gi ven s samples, one could take s samples from distrib utions ( p, p ′ ) ← D , simulat e havi ng drawn them from g k ,p, 1 / 2 ,p max ,p min and g k ,p ′ , 1 / 2 ,p max ,p min , and then run th e hypot hesize d tester algorit hm on those sampl es, and out- put the answer , w hich will be the same for ( p, p ′ ) as for ( g k ,p, 1 / 2 ,p max ,p min , g k ,p ′ , 1 / 2 ,p max ,p min ) . T his contradicts the assumpti on that no algorit hm with these success parameters exists for ( p, p ′ ) ← D . 22 E Pr oof of Theorem 5 W e ﬁ rst note that we can assume that ǫ > 1 /n . Otherwis e, the decompositio n of [ n ] into singleton interv als I i = { i } , i ∈ [ n ] , tr i vially satisﬁes th e sta tement of the the orem. Indeed, in t his case we ha ve that (1 /ǫ ) · log n > n and p f ≡ p . W e ﬁrst describe the obli vious decompositio n and then show that it satisﬁes the statement of the theorem. The decompositio n I w ill be a partition of [ n ] into ℓ none mpty consec uti ve interv als I 1 , . . . , I ℓ . In particul ar , for j ∈ [ ℓ ] , we hav e I j = [ n j − 1 + 1 , n j ] with n 0 = 0 and n ℓ = n . The length of interv al I i , denoted by l i , is deﬁned to be the cardinalit y of I i , i.e., l i = | I i | . (Gi ven tha t the interv als are disjoint and consecuti ve, to fully deﬁne them it suf ﬁces to specify their lengths.) W e ca n ass ume wlog that n and 1 /ǫ are ea ch at least sufﬁci ently lar ge uni versa l constants . The int erv al lengths are deﬁned as follo ws. Let ℓ ∈ Z + be the smallest inte ger such that ℓ P i =1 ⌊ (1 + ǫ ) i ⌋ ≥ n. For i = 1 , 2 , . . . , ℓ − 1 we deﬁne l i := ⌊ (1 + ǫ ) i ⌋ . For the ℓ -t h interv al, we set l ℓ := n − ℓ − 1 P i =1 l i . It follo ws from the aforementio ned deﬁnition that the number ℓ of interv als in the decompositio n is at most O ((1 /ǫ ) · log (1 + ǫ · n )) . Let p be any non-increas ing distrib ution ove r [ n ] . W e will now show that the abo ve described decomp ositio n satisﬁes d T V ( p f , p ) = O ( ǫ ) where p f is the ﬂattened distrib ution correspon ding to p and the parti tion I = { I i } ℓ i =1 . W e can write d T V ( p f , p ) = (1 / 2) · n P i =1 | p f ( i ) − p ( i ) | = ℓ P j =1 d T V  ( p f ) I j , p I j  where p I denote s the (sub-di strib ution) restriction of p over I . Let I j = [ n j − 1 + 1 , n j ] with l j = | I j | = n j − n j − 1 . Then w e ha ve that d T V  ( p f ) I j , p I j  = (1 / 2) · n j P i = n j − 1 +1 | p f ( i ) − p ( i ) | . Recall that p f is by deﬁnition consta nt within each I j and in particular equal to ¯ p j f = P n j i = n j − 1 +1 p ( i ) /l j . Also recall th at p is non-i ncreas ing, hence p ( n j − 1 ) ≥ p ( n j − 1 + 1) ≥ ¯ p j f ≥ p ( n j ) . Therefore, we ca n bo und from ab ov e the v ariatio n distance w ithin I j as follo ws d T V  ( p f ) I j , p I j  ≤ l j · ( p ( n j − 1 + 1) − p ( n j )) ≤ l j · ( p ( n j − 1 ) − p ( n j )) . So, we ha ve d T V ( p f , p ) ≤ ℓ P j =1 l j · ( p ( n j − 1 ) − p ( n j )) . (13) T o bound the abov e quantity we analyze summands with l j < 1 /ǫ and with l j ≥ 1 /ǫ separat ely . 23 Formally , we partition the set of interv als I 1 , . . . , I ℓ into “short” interv als and “long interv als” as follo ws: If any interv al I j satisﬁes l j ≥ 1 /ǫ , then let j 0 ∈ Z + be the larg est integer such that l j 0 < 1 /ǫ ; otherwise we ha ve that ev ery interv al I j satisﬁes l j < 1 /ǫ , and in this case we let j 0 = ℓ. If j 0 < ℓ then we ha ve that j 0 = Θ((1 /ǫ ) · log 2 (1 /ǫ )) . Let S = { I i } j 0 i =1 denote the set of short interv als and let L denote its complemen t L = I \ S . Consider the sho rt interv als and cluste r them into gr oups accordin g to thei r length; that is, a grou p contains all interv als in S of the same length . W e denot e by G i the i th group, which by deﬁnition contains all interv als in S of length i ; note that these interv als are consecuti ve. The car dinality of a group (denoted by | · | ) is the number of interv als it contains; the length of a group is the number of elements it contains (i.e. the sum of the lengths of the interv als it contains ). Note that G 1 (the group containin g all singleton interv als) has | G 1 | = Ω(1 /ǫ ) (this follo ws from the assumpt ion that 1 /ǫ < n ). Hence G 1 has length Ω(1 /ǫ ) . Let j ∗ < 1 /ǫ be the maximum length of any short interv al in S . It is easy to verify that each group G j for j ≤ j ∗ is nonempty , and that for all j ≤ j ∗ − 1 , we ha ve | G j | = Ω ((1 /ǫ ) · (1 /j )) , w hich implies that the leng th of G j is Ω(1 /ǫ ) . T o bound the contrib ution to (13) from the short interv als, we consider the correspo nding sum for each group, and use the fact that G 1 makes no co ntribu tion to the error . In particula r , the contr ib ution of the short interv als is j ∗ P l =2 l ·  p − l − p + l  (14) where p − l (resp. p + l ) is the proba bility mass of the leftmost (resp. rightmost) point in G l . Giv en that p is non - increa sing, w e ha ve that p + l ≥ p − l +1 . Therefore, we can upper bound (14) by 2 · p + 1 + j ∗ − 1 P l =2 p + l − j ∗ · p + j ∗ . No w no te that p + 1 = O ( ǫ ) · p ( G 1 ) , sinc e G 1 has leng th (total number of el ements) Ω(1 /ǫ ) and p is non-i ncreas ing. Similarly , for l < j ∗ , we hav e that p + l = O ( ǫ ) · p ( G l ) , since G l has length Ω(1 /ǫ ) . Therefore, the abov e quant ity can be uppe r bounded by O ( ǫ ) · p ( G 1 ) + O ( ǫ ) · j ∗ − 1 P l =2 p ( G l ) − j ∗ · p + j ∗ = O ( ǫ ) · p ( S ) − j ∗ · p + j ∗ . (15) W e cons ider two cases: The ﬁrst case is that L = ∅ . In this ca se, we are done becaus e the abo ve expre ssion (15) is O ( ǫ ) . The second case is that L 6 = ∅ (we note in passing that in this case the total number of elements in all short interv als is Ω(1 /ǫ 2 ) , which means that we must ha ve ǫ = Ω(1 / √ n ) ). In this cas e we b ound the contr ib ution of th e long interv als using the same arg ument as Birg ´ e. In particular , the contrib ution of the long interv als is ℓ P j = j 0 +1 l j · ( p ( n j − 1 ) − p ( n j )) ≤ ( j ∗ + 1) · p + j ∗ + ℓ − 1 P j = j 0 +1 ( l j +1 − l j ) · p ( n j ) . (16) Giv en that l j +1 − l j ≤ (2 ǫ ) · l j and P j l j · p ( n j ) ≤ p ( L ) , it follo ws that the second summand in (16) is at most O ( ǫ ) · p ( L ) . Therefore, the total vari ation distance between p and p f is at most (15) + (16), i.e. O ( ǫ ) · p ( S ) + O ( ǫ ) · p ( L ) + p + j ∗ . (17) Finally , note that p ( L ) + p ( S ) = 1 and p + j ∗ = O ( ǫ ) . (The latter holds because p + j ∗ is the proba bility mass of the rightmos t point in S ; reca ll that S has length at least 1 /ǫ and p is decreasi ng.) This implies that (17) is at most O ( ǫ ) , and this completes the proof of Theorem 5. 24 F Bounding variation distance As note d abo ve, our tester will work by decomposin g the interv al [ n ] into sub-inte rv als. The follo wing lemma will be usefu l for us; it boun ds the va riation distan ce between two distri b utions p and q in terms of ho w p and q beha ve on the sub-i nterv als in such a decompo sition. Lemma 15. Let [ n ] be partitio ned into I 1 , . . . , I r . Let p, q be two distrib utions over [ n ] . Then d T V ( p, q ) ≤ 1 2 r P j =1 | p ( I j ) − q ( I j ) | + r P j =1 p ( I j ) · d T V ( p I j , q I j ) . (18) Pr oof. Recall that d T V ( p, q ) = 1 2 P n i =1 | p ( i ) − q ( i ) | . T o pro ve the claim it sufﬁce s to sho w that 1 2 P i ∈ I 1 | p ( i ) − q ( i ) | ≤ 1 2 | p ( I 1 ) − q ( I 1 ) | + p ( I 1 ) · d T V ( p I 1 , q I 1 ) . (19) W e assume that p ( I 1 ) ≤ q ( I 1 ) and prov e (19) under this assumption . T his gi ves the bound in general since if p ( I 1 ) > q ( I 1 ) we ha ve 1 2 P i ∈ I 1 | p ( i ) − q ( i ) | ≤ | p ( I 1 ) − q ( I 1 ) | + q ( I 1 ) · d T V ( p I 1 , q I 1 ) < | p ( I 1 ) − q ( I 1 ) | + p ( I 1 ) · d T V ( p I 1 , q I 1 ) where the ﬁrst inequa lity is by (19). The triangl e inequality giv es us | p ( i ) − q ( i ) | ≤     p ( i ) − q ( i ) · p ( I 1 ) q ( I 1 )     +     q ( i ) · p ( I 1 ) q ( I 1 ) − q ( i )     . Summing this ov er all i ∈ I 1 we get 1 2 P i ∈ I 1 | p ( i ) − q ( i ) | ≤ 1 2 P i ∈ I 1     p ( i ) − q ( i ) · p ( I 1 ) q ( I 1 )     + 1 2 P i ∈ I 1     q ( i ) · p ( I 1 ) q ( I 1 ) − q ( i )     . W e can rewri te the ﬁrst term on the RHS as 1 2 P i ∈ I 1     p ( i ) − q ( i ) · p ( I 1 ) q ( I 1 )     = p ( I 1 ) · 1 2 P i ∈ I 1     p ( i ) p ( I 1 ) − q ( i ) q ( I 1 )     = p ( I 1 ) · 1 2 P i ∈ I 1 | p I 1 ( i ) − q I 1 ( i ) | = p ( I 1 ) · d T V ( p I 1 , q I 1 ) so to pro ve the desir ed bound it sufﬁce s to show tha t 1 2 P i ∈ I 1     q ( i ) · p ( I 1 ) q ( I 1 ) − q ( i )     ≤ | p ( I 1 ) − q ( I 1 ) | . (20) W e ha ve     q ( i ) · p ( I 1 ) q ( I 1 ) − q ( i )     = q ( i ) ·     p ( I 1 ) q ( I 1 ) − 1     and hence we ha ve 1 2 P i ∈ I 1     q ( i ) · p ( I 1 ) q ( I 1 ) − q ( i )     = 1 2 P i ∈ I 1 q ( i ) ·     p ( I 1 ) q ( I 1 ) − 1     = 1 2 q ( I 1 ) ·     p ( I 1 ) q ( I 1 ) − 1     = 1 2 | p ( I 1 ) − q ( I 1 ) | . So we indee d hav e (20) as require d, and the lemma holds. 25

Testing $k$-Modal Distributions: Optimal Algorithms via Reductions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment