Testing Shape Restrictions of Discrete Distributions

We study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution $D$ over $[n]$ and a property $\mathcal{P}$, the goal is to distinguish between $D\in\mathcal{P…

Authors: Clement L. Canonne, Ilias Diakonikolas, Themis Gouleakis

T esting Shap e Restrictions of Discrete Distributions Clémen t L. Canonne ∗ Ilias Diak onik olas † Themis Gouleakis ‡ Ronitt R ubinfeld § Jan uary 22, 2016 Abstract W e study the question of testing structu r e d proper ties (classes) of discrete distributions. Specifica lly , given sample access to an a rbitrary distr ibution D o ver [ n ] a nd a pro perty P , the goal is to distinguish b et ween D ∈ P and ℓ 1 ( D, P ) > ε . W e develop a general algorithm for this question, whic h a pplies to a la rge ra nge of “shape-co nstrained” prop erties, including monoto ne, log-concave, t -modal, piecewise-p olynomial, and Poisson Binomia l distributions. Moreov er, fo r all cases consider e d, our algorithm has near-optimal sample complexit y with regard to the domain size and is c o mputationally efficient. F or most of these classes, w e provide the firs t non-trivial tester in the literature. In addition, w e als o describ e a generic metho d to prove lo wer bo unds for this problem, and use it to show our upper bounds are nearly tigh t. Finally , w e extend some of o ur tec hniques to toler an t testing, deriving near ly–tigh t upp er and low er b ounds for the corresp o nding questions. 1 In tro duction Inferring information ab out the probabilit y distribution that underlies a data sample is an essenti al question in Statistics, and one that has ramifications in ev ery field of th e natural sciences and quan titati v e researc h . In many situations, it is n atural to assu m e that this data exhibits some simple structure b ecause of kno wn prop erties of the origin of the data, and in fact these assumptions are crucial in making the problem tractable. Suc h assumptions translate as constrain ts on the probabilit y distribution – e.g., it is supp osed to b e Gaussia n, or to meet a smo othness or “fat tail” condition (see e.g., [ Man63 , Hou86 , TLSM95 ]). As a result, the problem of deciding whether a distribution possesses such a structur al prop ert y has b een widely inv estigate d b oth in theory and practice, in the con text of shap e r estricte d infer- enc e [ BDBB72 , S S01 ] and mo del sele ction [ MP07 ]. Here, it is guaran teed or though t that the un- kno w n distribu tion satisfies a shap e constraint, suc h as having a monotone or log-conca v e probabil- it y densit y function [ SN99 , BB0 5 , W al09 , Dia16 ]. F rom a d ifferen t p ersp ectiv e, a recen t line of w ork ∗ Colum b ia Universit y . Email: ccanonne@c s.columbia .edu . Research supp orted b y NSF CCF-111 5703 and NSF CCF-1319 788. † Universit y of Edinburgh. Email: ilias.d@ed .ac.uk . Researc h su p ported by EP SRC grant EP/L0217 49/1, a Marie Curie Career Integration Gran t, and a S ICSA gran t. This work w as performed in part while v isi ting CSAIL, MIT. ‡ CSAIL, MIT. Email: tgoule@mit .edu . § CSAIL, MIT and the B la v atnik School of Computer Science, T el A viv U n iv ersity . Email: ronitt@csa il.mit.edu . 1 in Theoret ical Computer Science, originating from the pap ers o f Batu et al . [ BFR + 00 , BFF + 01 , GR00 ] h as also b een tac kling similar qu estio ns in the setting of prop erty testing (see [ Ron08 , Ron10 , R u b12 , Can15 ] for surveys on this field). This very activ e area has seen a spate of results and br eak- throughs o ver the p ast d ecade, culminating in very efficien t (b oth sample and time-wise) algorithms for a wide range of distribu tio n testing problems [ BDKR05 , GMV0 6 , AAK + 07 , DDS + 13 , C DVV1 4 , AD15 , DKN15b ]. I n man y cases, this led to a tigh t c haracterization of the n umb er of samples re- quired for these tasks as well as the d ev elopmen t of new tools and tec hniques, dra w ing connections to learning and information theory [ VV10 , VV11a , VV14 ]. In this pap er, we f o cus on the foll o wing general prop ert y test ing pr oblem: giv en a class (prop - ert y) of distributions P and s amp le acce ss to an arbitr ary distribu tion D , one must distinguish b et wee n the case that ( a) D ∈ P , v ersus (b) k D − D ′ k 1 > ε for all D ′ ∈ P (i.e., D is either in the class, or far from it). While man y of the previous w orks hav e focused o n the testing of sp e- cific prop erties of distributions or obtained algo rithms and lo wer b ounds on a case-b y-case b asis, an emerging trend in d istribution testing is to design general f r amew orks that can b e applied to sever al prop ert y testing problems [ V al11 , VV11 a , DKN15b , DKN15a ]. This direction, the testing analo g of a simil ar mo v ement in distribution learning [ CDSS13 , CDSS 14b , CDSS14a , ADLS15 ], aims at abstracting the minimal assu mptions that are shared b y a large v ariet y of problems, and giving algorithms that can b e used for any of these p r oblems. In this work, w e mak e significan t pr ogress in this dir ect ion by pro vidin g a unified framework for the question of test ing v arious p r op erties of probabilit y distributions. More sp ecifical ly , w e describ e a generic tec hnique to obtain upp er b oun ds on the sample complexit y of this questi on, which applies to a br oad range of structured classes. Our tec hnique yields sample near-optimal and computational ly efficien t testers for a wide range of d istrib ution families. Conv ersely , we also dev elop a general app roac h to pro ve lo w er b oun ds on these sample complexities, and u se it to deriv e tigh t or nearly tigh t b ounds for man y of these classes. Related work. Batu et al. [ BKR04 ] initiated the study of efficien t p rop ert y testers for monotonic- it y and obtained (nearly) m atc h ing upp er and lo we r b oun ds for th is p roblem; while [ AD15 ] later con- sidered testing the class of Poisson Binomial Distributions, and settled the samp le complexit y of this problem (up to the precise d ep endence on ε ). Indyk, Levi, and Rubinfeld [ ILR12 ], focusing on distri- butions that are piecewise co nstan t on t in terv als (“ t -histograms”) describ ed a ˜ O ( √ tn/ε 5 )-sample algorithm for testing memb ers hip to this class. Another b o dy of w ork by [ BDKR05 ], [ BKR04 ], and [ DDS + 13 ] shows ho w assump tions on the shap e of the distributions can lead to significan tly more efficien t a lgorithms. T hey d escribe su ch impro ve men ts in the case of ident it y a nd closeness testing as w ell as for en tropy estimatio n, under monotonici t y or k -mod alit y constrain ts. Sp ecif- ically , Batu et al. sho w in [ BKR04 ] ho w to obtain a O  log 3 n/ε 3  -sample tester for cl oseness in this setting, in stark con trast to the Ω  n 2 / 3  general low er b ound . Daskalakis et al. [ DDS + 13 ] later ga v e O ( √ log n ) and O (log 2 / 3 n )-sample testing algorithms for testing resp ectiv ely id en tit y and closeness of monotone distributions, and obtained similar resu lts for k -mo dal distributions. Finally , we briefly mentio n t w o related r esults, due resp ectiv ely to [ BDKR05 ] and [ DDS12a ]. The first one states that for the task of getting a m ultiplicativ e estimate of the en trop y of a d istribution, assuming monot onicit y enables exp onen tial sa vings in sample complexit y – O  log 6 n  , instead o f Ω( n c ) for the general ca se. The seco nd describ es h ow to test if an un kno wn k -mo dal distribution is in f act monotone, usin g only O  k /ε 2  samples. Note that the latter line of work differs from ours in that it pr esupp oses the distribu tio ns satisfy some structural prop ert y , and uses this knowle dge 2 to test something else ab out the distribu tion; while w e are gi v en a priori arbitrary distributions, and m ust che ck whether the structural prop ert y holds. Except for the prop erties of m onotonicit y and b eing a PBD, nothing wa s pr evio usly known on te sting the shap e restricted prop erties that we study . In dep enden tly and concurren tly to this work, A c harya , Daskalakis, and Kamath obtained a sample near-optimal efficien t al gorithm for testing log-conca vity . 1 Moreo v er, for the sp ecific pr oblems of ident it y and closeness testing, 2 recen t r esults of [ DKN15b , DKN15a ] describ e a general algorithm whic h applies to a large range of shap e or str u ctural co n- strain ts, and yields optimal identit y testers for classes of distributions th at satisfy them. W e observe that while the question they a nsw er can b e cast as a sp ecialized instance of mem b ership testing, our results are incomparable to theirs, b oth b eca use of the distinction abov e (testing with v ersus testing for s tru cture) and as the structural assumptions they rely o n are fund amen tally differen t from ours. 1.1 Results and T ec hniques Upp er Bounds . A n atural w a y to tac kle our mem b ership testing problem w ould b e to first learn the unkno w n distribution D as if it satisfied the p rop ert y , b efore c hec king if th e h yp othesis obtained is indeed b oth clo se to the original distribu tion and to the prop ert y . T aking adv an tage of the purp orted stru ctur e, the first step could presumably b e conducted with a small n um b er of samples; things br eak do wn, ho w ev er, in th e sec ond step. Indeed, most appro ximation r esults leading to the imp ro v ed learning algorithms one w ould a pply in the firs t stage only provi de ve ry w eak guaran tees, in the ℓ 1 sense. F or this r easo n, they lac k the robustness that w ould b e required for the second part, wh ere it b ecomes necessary to p erform toler ant testing b et ween the h yp othesis and D – a task that w ould then en tail a n u m b er of samples almost linear in the domain size. T o o ve rcome this difficult y , we n eed to mo v e a wa y from these global ℓ 1 closeness results and instead w ork with stronger requiremen ts, th is time in ℓ 2 norm. A t the core of our approac h is an idea of Bat u et al. [ BKR04 ], which sho w that monotone dis- tributions can b e w ell-appro ximated (in a ce rtain tec hnical sense) b y piecewise constan t densities on a suitable in terv al partition of the domain; and lev erage this fact to reduce monotonicit y te sting to uniformit y testing on eac h in terv al of this p artition. While the argumen t of [ BKR04 ] is tail ored sp ecifically for the setting of monotonicit y testi ng, w e are able to abstract the k ey ingredien ts, and obtain a generic membersh ip tester that app lies to a w id e range of distribu tio n families. In more d e- tail, we p ro vide a testing algorithm wh ic h applies to an y class of distributions wh ich admits succinct appro ximate decomp ositions – that is, eac h distribution in the class ca n b e w ell-appro ximated (in a strong ℓ 2 sense) b y piecewise constan t densities on a small n umb er of in terv als (w e hereafter refer to this appro ximation prop ert y , formally defined in Definition 3.1 , as ( Succ inctness ); and extend the n otation to a pply to an y class C of distrib utions for w hic h all D ∈ C satisfy ( Succinctness )). Crucially , the algorithm do es not care ab out how these deco mp ositions can b e obta ined: for the purp ose of te sting these structural prop erties we o nly need to establish their existenc e . Sp ecific examples are giv en in th e corollaries b elo w. Informally , our main algorithmic result, in f ormally 1 F ollo wing t h e communication of a preli minary version of t h is pap er (F ebruary 2015), w e we re informed that [ ADK15 ] subsequ en tly obtained near-optimal testers for some of the cla sses w e consider. T o the best of our knowl edge, their work b uilds on ideas from [ AD15 ] and their techniques are orthogonal to ours. 2 Recall t hat the identit y testing problem asks, given the explicit description of a distribution D ∗ and sample access to an unk n o wn distribution D , to decide whether D is eq ual to D ∗ or far from it; while in closeness testing both distributions to compare are unknow n. 3 stated (see Theorem 3.3 for a detailed formal statemen t), is as f ollo w s: Theorem 1.1 (Main Theorem) . Ther e exists an algor ithm Test Splitt abl e which, given sam- pling ac c ess to an unknown distribution D over [ n ] and p ar ameter ε ∈ (0 , 1] , c an d istinguish with pr ob ability 2 / 3 b etwe en (a) D ∈ P versus (b) ℓ 1 ( D , P ) > ε , for any pr op erty P that satisfies the ab ove nat ur al structur al criterion ( Succinctness ) . Mor e over, for many such pr op e rties this algorithm is c omputational ly efficient, and its sample c omplexity is optimal (up to lo garithmic factors and the exact dep endenc e on ε ). W e then instan tiate this result to obtain “out-o f-the-b o x” c omputatio nal ly efficient testers for sev- eral classes of distributions, b y showing that they satisfy the premise of our theorem (the d efinition of these classes is giv en in Section 2.1 ): Corollary 1.2. The algorithm TestSplitt able c an test the classes of monotone, unimo dal, lo g- c onc ave, c onc ave, c onvex, and monotone hazar d r ate (MHR) distributions, with ˜ O  √ n/ε 7 / 2  sam- ples. Corollary 1.3. The algorith m TestSplitt able c an test the class of t - mo dal dis tributions, w ith ˜ O  √ tn/ε 7 / 2  samples. Corollary 1.4. The algorith m TestSplitt able c an test the classes of t -histo gr ams and t -pie c ewise de g r e e- d dist ributions, with ˜ O  √ tn/ε 3  and ˜ O  p t ( d + 1) n/ε 7 / 2 + t ( d + 1) /ε 3  samples r esp e ctively. Corollary 1.5. The algorithm Test Splitt abl e c an test the classes of Binomial and Poisson Binomial Distributions, with ˜ O  n 1 / 4 /ε 7 / 2  samples. Class Upp erbo und Lo werbound Monotone ˜ O  √ n ε 6  [ BKR04 ], ˜ O  √ n ε 7 / 2  ( Corollar y 1.2 ) Ω  √ n ε 2  [ BKR04 ], Ω  √ n ε 2  ( Corollar y 1.6 ) Unimoda l ˜ O  √ n ε 7 / 2  ( Corollar y 1.2 ) Ω  √ n ε 2  ( Corollar y 1.6 ) t -mo dal ˜ O  √ tn ε 7 / 2  ( Corollar y 1.3 ) Ω  √ n ε 2  ( Corollar y 1.6 ) Log-concave, concav e , conv ex ˜ O  √ n ε 7 / 2  ( Corollar y 1.2 ) Ω  √ n ε 2  ( Corollar y 1.6 ) Monotone Hazard Rate (MHR) ˜ O  √ n ε 7 / 2  ( Corollar y 1.2 ) Ω  √ n ε 2  ( Corollar y 1.6 ) Binomial, P o isson Bino- mial (PBD) ˜ O  n 1 / 4 ε 2 + 1 ε 6  [ AD15 ], ˜ O  n 1 / 4 ε 7 / 2  ( Corollar y 1.5 ) Ω  n 1 / 4 ε 2  ([ AD15 ], Cor ollary 1.7 ) t -histogra ms ˜ O  √ tn ε 5  [ ILR12 ], ˜ O  √ tn ε 3  ( Corollar y 1.4 ) Ω  √ tn  for t ≤ 1 ε [ ILR12 ], Ω  √ n ε 2  ( Corollar y 1.6 ) t -piecewise degr ee- d ˜ O  √ t ( d +1) n ε 7 / 2 + t ( d +1) ε 3  ( Corollar y 1.4 ) Ω  √ n ε 2  ( Corollar y 1.6 ) k -SIIR V Ω  k 1 / 2 n 1 / 4  ( Corollary 1 .8 ) T able 1: S ummary of r esults. W e remark that the aforement ioned sample upp er b ounds are info rmation-theoretic ally near- optimal in the domain size n (up to logarithmic factors). See T able 1 and the foll o wing subsectio n 4 for the corresp onding lo wer b ou n ds. W e d id not attempt to optimize th e dep endence on the parameter ε , though a more careful analysis can lead to suc h impro v emen ts. W e stress that prior to our work, no non-trivial testing b ound wa s kno wn for most of these classes – sp ecifically , o ur nearly-t igh t b ounds for t -mod al with t > 1, log-conca ve , co nca ve, con v ex, MHR, and piecewise p olynomial distributions are new. Moreo v er, although a few of our applica- tions were kno wn in the literature (the ˜ O  √ n/ε 6  upp er and Ω  √ n/ε 2  lo wer b ound s on testing monotonicit y can b e found in [ BKR04 ], while the Θ  n 1 / 4  sample complexit y of testing PBD s wa s recen tly giv en 3 in [ AD15 ], and the task of testing t -histograms is considered in [ ILR12 ]), the crux here is that we are able to deriv e them in a unifie d wa y , by applying the same generic algorithm to all these different distribution f amili es. W e note that our upp er b oun d for t -histograms ( Corollary 1.4 ) also impro v es on the previous ˜ O  √ tn/ε 5  -sample tester, a s long as t = ˜ O  n 1 / 3 /ε 2  . In addition to its generalit y , our framew ork yields m uch cleaner and conceptuall y simpler pro ofs of the upp er and lo wer b ounds from [ AD15 ]. Lo wer Bounds. T o complement our u pp er b ounds, w e giv e a generic f ramew ork f or proving lo wer b ounds against te sting cla sses of distributions. In more detail, w e describe ho w to r e duc e – under a mild assumption on the prop ert y C – the problem of testing memb ership to C (“does D ∈ C ?”) to testing identity to D ∗ (“do es D = D ∗ ?”), for any explicit distribution D ∗ in C . While these t wo problems need not in general b e related, 4 w e show that our reduction-based app roac h app lies to a large n umb er of natural prop erties, and obtain low er b ounds that nearly matc h our up p er b ounds for all of them. Moreo ve r, this lets us deriv e a simple pro of of the lo w er b ound of [ A D15 ] on testing the cla ss of PBDs. The reader is referred to Theorem 6.1 for the formal statemen t of our reduction-based low er b ound theorem. In this section, w e s tate the concrete corollarie s w e obtain for sp ecific structured distribution famili es: Corollary 1.6. T esting lo g -c onc avity, c onvexity, c onc avity, MHR, unimo dality, t -mo dality, t -histo gr ams, and t - pie c ewise de g r e e- d distributions e ach r e quir e Ω  √ n/ε 2  samples (the last thr e e for t = o ( √ n ) and t ( d + 1) = o ( √ n ) , r esp e ctively), for any ε ≥ 1 /n O (1) . Corollary 1.7. T esting the classes of Binomial and Poisson Bino mial Distri butions e ach r e quir e Ω  n 1 / 4 /ε 2  samples, for any ε ≥ 1 /n O (1) . Corollary 1.8. Ther e exist absolute c onstants c > 0 and ε 0 > 0 such that testing the class of k -SIIR V distributions r e quir es Ω  k 1 / 2 n 1 / 4  samples, for any k = o ( n c ) and ε ≤ ε 0 . T olerant T esting. Using our tec hniques, w e also esta blish nearly–ti gh t up p er and lo we r b ounds on tole ran t te sting for shap e restrictions. Similarly , our up p er and lo wer b ounds are matc hing as 3 F or the sample complexity of testing monotonicit y , [ BKR04 ] originally states an ˜ O  √ n/ε 4  upp er b oun d, but the proof seems to only result in an ˜ O  √ n/ε 6  b ound. Regarding the class of PBDs, [ AD15 ] obtain an n 1 / 4 · ˜ O  1 /ε 2  + ˜ O  1 /ε 6  sample complexity , to be compared with our ˜ O  n 1 / 4 /ε 7 / 2 ) + O  log 4 n/ε 4  upp er b ound; as wel l as an Ω  n 1 / 4 /ε 2  lo wer b oun d. 4 As a simple example, consider the class C of al l distri butions, for whic h testing mem bership is trivial. 4 T oler ant testing of a prop ert y P is defined as fol lo ws: given 0 ≤ ε 1 < ε 2 ≤ 1, one m ust distinguish betw een (a) ℓ 1 ( D, P ) ≤ ε 1 and (b) ℓ 1 ( D, P ) ≥ ε 2 . This turns out to be, in general, a muc h harder task than that of “regular” testing (where we take ε 1 = 0). 5 a function of the d omain size. More sp ecifically , w e giv e a simple generic upp er b ound approac h (namely , a learning follo wed by toleran t testing algorithm). Our toleran t testing low er b ound s follo w the same reduction-based app r oac h as in the non-toleran t case. In more detail, our results are as follo ws (se e Section 6 and Section 7 ): Corollary 1.9. T oler ant testing of lo g-c onc avity, c onvexity, c onc avity, MHR, unimo dality, and t - mo dality c an b e p erforme d with O  1 ( ε 2 − ε 1 ) 2 n log n  samples, for ε 2 ≥ C ε 1 (wher e C > 2 is an absolute c onstant). Corollary 1.10. T oler ant testing of the classes of Bi nomia l and Poisson Bino mial D istributions c an b e p erforme d with O  1 ( ε 2 − ε 1 ) 2 √ n log(1 /ε 1 ) log n  samples, for ε 2 ≥ C ε 1 (wher e C > 2 is an absolute c onstant). Corollary 1.11. T oler ant testing of lo g-c onc avity, c onvexity, c onc avity, MHR, unimo dality, and t -mo dality e ach r e quir e Ω  1 ( ε 2 − ε 1 ) n log n  samples (the latter for t = o ( n ) ). Corollary 1.12. T oler ant testing of the classes of Bi nomia l and Poisson Bino mial D istributions e ach r e quir e Ω  1 ( ε 2 − ε 1 ) √ n log n  samples. On the scope of our results. W e p oin t out that our main theorem is lik ely to apply to many other classes of structured distributions, due to the mild structural assump tions it requires. Ho w- ev er, w e did not attempt here t o b e comprehensiv e; but rather to illustrate the generalit y of our approac h. Moreo ve r, for all pr operties considered in this pap er the generic upp er an d low er b ounds w e deriv e through our metho ds turn out to b e optimal up to at most p olylogarithmic factors (with regard to th e supp ort siz e). The reader is referred to T able 1 for a summary of our results and related w ork. 1.2 Organization of the P ap er W e sta rt b y gi ving the necessary bac kground and definitions in Section 2 , b efore turning t o our main r esult, the pro of of Theorem 1.1 (our general testing algorithm) in Secti on 3 . I n Section 4 , w e establish the nece ssary stru ctural theorems for eac h classes of distributions considered, enabling us to derive the upp er boun ds of T able 1 . S ecti on 5 int ro duces a sligh t mo dification of our algorithm whic h yields stronger test ing results for classes of distributions with small effectiv e supp ort, and use it to derive Corollary 1.5 , our u pp er b ound for Poi sson Binomial d istributions. Second, Section 6 con tains the d etai ls of our low er b ound metho dolog y , and of its applications to the classes of T able 1 . Finally , Section 6.2 is concerned with the extension of this metho dology to toler ant test ing, of whic h Section 7 describ es a ge neric up p er b ound coun terpart. 2 Notation and Preliminarie s 2.1 Definitions W e giv e here th e formal descriptions of the classe s of d istributions in v olve d in this w ork. Recall that a d istribution D o v er [ n ] is monotone (non-increasing) if its probabilit y m ass function (pm f ) satisfies 6 D (1) ≥ D (2) ≥ . . . D ( n ). A natural generalizatio n of the cl ass M of monotone distributions is the set of t -mo dal distributions, i.e. distributions whose pmf can go “u p and do wn” or “do wn and up” up to t times: 5 Definition 2.1 ( t -mo d al) . Fix any distribution D o ver [ n ], and in teger t . D is said to hav e t mo des if there exists a sequence i 0 < · · · < i t +1 suc h that either ( − 1) j D ( i j ) < ( − 1) j D ( i j +1 ) for all 0 ≤ j ≤ t , or ( − 1) j D ( i j ) > ( − 1) j D ( i j +1 ) for all 0 ≤ j ≤ t . W e call D t -mo dal if it has a t most t mod es, and write M t for the class of al l t -mo dal distributions (o mitting the dep endence on n ). The particular case of t = 1 corresp onds to the set M 1 of unimo dal d istributions. Definition 2.2 (Log-Conca v e) . A distribution D o v er [ n ] is said to b e lo g-c onc ave if it satisfies the follo wing conditions: (i) for any 1 ≤ i < j < k ≤ n suc h that D ( i ) D ( k ) > 0, D ( j ) > 0; and (ii) for all 1 < k < n , D ( k ) 2 ≥ D ( k − 1) D ( k + 1). W e write L for the class of al l log-c onca v e distributions (omitting the dep endence on n ). Definition 2.3 (Conca v e and Con v ex) . A distribution D o v er [ n ] is said to b e c onc ave if it satisfies the follo wing conditio ns: (i) for any 1 ≤ i < j < k ≤ n suc h that D ( i ) D ( k ) > 0, D ( j ) > 0; and (ii) for all 1 < k < n suc h that D ( k − 1) D ( k + 1) > 0, 2 D ( k ) ≥ D ( k − 1) + D ( k + 1); it is c onvex if the rev erse inequali t y holds in (ii) . W e write K − (resp. K + ) for the class of all conca ve (resp. con v ex) distributions (omitting the dep endence on n ). It is n ot hard to see that con vex and conca ve distribu tions are unimo dal; moreo ve r, ev ery conca ve distribution is also log-conca v e, i.e. K − ⊆ L . Note that in b oth Definition 2.2 and Definition 2.3 , condition (i) is equiv alen t to enforcing that the distribution b e supp orted on an int erv al. Definition 2.4 (Mo notone Hazard Rate) . A distribu tion D ov er [ n ] is said to ha ve mono tone hazar d r ate (MHR) if its hazar d r ate H ( i ) def = D ( i ) P n j = i D ( j ) is a non-decreasing function. W e write MHR for the class of all MHR distributions (omitting the dep endence on n ). It is kno wn that ev ery log-conca v e distribution is b oth unimo dal and MH R (see e.g . [ An96 , Prop osition 10]), and that monotone distributions are MHR. T w o other classes of distributions ha ve elic ited significant interest in the conte xt of densit y est imation, that of histo gr ams (piecewise constan t) and pie c ewise p olynomial densities : Definition 2.5 (Piecewise P olynomials [ CDSS14a ]) . A distribution D ov er [ n ] is said t o b e a t - pie c ewise de gr e e- d distribution if there is a partiti on of [ n ] in to t disjoin t in terv als I 1 , . . . , I t suc h that D ( i ) = p j ( i ) for all i ∈ I j , where eac h p 1 , . . . p t is a univ ariate p olynomial of degree at most d . W e write P t,d for the class of all t -piece wise degree- d distributions (o mitting the dep endence on n ). (W e note that t -piecewise d egree- 0 d istributions are also commonly referred to as t - histo gr ams , and write H t for P t, 0 .) Finally , w e recall the definition of the t w o follo wing classes, whic h b oth extend the family of Binomial d istrib utions B I N n : the first, b y remo ving the need for ea c h of the indep endent Bernoulli summands to share the same bias parameter. 5 Note that this sligh tly deviates from the Statistics literature, where only the p eaks are counted as modes (so that what is usually referred to as a bimodal distribution is, acco rding to our definition, 3-mo dal). 7 Definition 2.6. A random v ariable X is said to f oll o w a Poisson Binomial Distribution (with parameter n ∈ N ) if it can be wr itte n as X = P n k =1 X k , wh ere X 1 . . . , X n are indep endent, n on- necessarily identic ally distributed Bernoulli random v ariables. W e denote by P B D n the class of all suc h P oisson Binomial Distributions. It is not hard to sho w that P oisson Binomial Distributions are in particula r lo g-conca v e. On e can generalize ev en fu rther, by allo w ing eac h r andom v ariable of the summation to be inte ger-v alued: Definition 2.7. Fix an y k ≥ 0. W e s ay a r an d om v ariable X is a k -Sum of Ind ep endent Inte ger R andom V ariables ( k - SIIR V) with parameter n ∈ N if it can b e written as X = P n j =1 X j , where X 1 . . . , X n are indep enden t, non-necessarily iden tically distrib u ted rand om v ariables taking v alue in { 0 , 1 , . . . , k − 1 } . W e denote b y k - S I I RV n the class of all suc h k -SI IR V s. 2.2 T o ols from previous w ork W e fir st restate a result of Batu et al. relating cl oseness to uniformit y in ℓ 2 and ℓ 1 norms to “o verall flatness” of the p r obabilit y mass fun ctio n, and w hic h will b e one of the ingredien ts of the pro of of Theorem 1.1 : Lemma 2.8 ([ BFR + 00 , BFF + 01 ]) . L et D b e a distribution on a domain S . (a) If max i ∈ S D ( i ) ≤ (1 + ε ) min i ∈ S D ( i ) , then k D k 2 2 ≤ (1 + ε 2 ) / | S | . (b) If k D k 2 2 ≤ (1 + ε 2 ) / | S | , th en k D − U S k 1 ≤ ε . T o c h ec k condition (b) ab o ve w e sh all rely on the follo wing, whic h one can derive f r om the tec hniques in [ DKN15b ] and whose pro of w e defer to App end ix A : Lemma 2.9 (A dapted from [ DKN15b , Theorem 11]) . Ther e exists an al gorithm Che ck-Small- ℓ 2 which, given p ar ameters ε, δ ∈ (0 , 1) and c · p | I | /ε 2 log(1 /δ ) indep endent samples fr om a distr ibution D over I (for some absolute c onstant c > 0 ), outputs either y es or no , and satis fies the fol low ing. • If k D − U I k 2 > ε/ p | I | , then the algorithm outputs no with pr ob ability at le ast 1 − δ ; • If k D − U I k 2 ≤ ε/ 2 p | I | , then the algorithm outputs y es with pr ob ability a t le ast 1 − δ . Finally , w e will also rely on a cla ssical result from Pr obabilit y , the Dvor etzky–K iefer–W olfowitz (DKW) inequalit y , r esta ted b elo w: Theorem 2.10 ([ DKW56 , Mas90 ]) . L et D b e a dist ribution over [ n ] . Given m indep endent samp les x 1 , . . . , x m fr om D , define the empiric al d istribution ˆ D as fol lows: ˆ D ( i ) def = |{ j ∈ [ m ] : x j = i }| m , i ∈ [ n ] . Then, for al l ε > 0 , Pr h k D − ˆ D k Ko l > ε i ≤ 2 e − 2 mε 2 , wher e k· − ·k Ko l denotes the K olmo gor ov distanc e (i.e., the ℓ ∞ distanc e b etwe en cumulative dist ribution functions). In p articula r, this implies that O  1 /ε 2  samples suffice to learn a distribution up to ε in K olmogoro v distance. 8 3 The General Algorithm In this section, w e obtain our main result, restate d b elo w: Theorem 1.1 (Main Theorem) . Ther e exists an algor ithm Test Splitt abl e which, given sam- pling ac c ess to an unknown distribution D over [ n ] and p ar ameter ε ∈ (0 , 1] , c an distinguish with pr ob ability 2 / 3 b etwe en (a) D ∈ P versus (b) ℓ 1 ( D , P ) > ε , for any pr op erty P that satisfies the ab ove nat ur al structur al criterion ( Succinctness ) . Mor e over, for many such pr op e rties this algorithm is c omputational ly efficient, and its sample c omplexity is optimal (up to lo garithmic factors and the exact dep endenc e on ε ). In t uition. Before diving in to the pro of of this theorem, w e first pro vid e a high-lev el description of the argumen t. The algorithm pro ceeds in 3 stage s: the first, the de c omp osition step , attempts to recursiv ely construct a partition of the domain in a small num b er of inte rv als, with a ve ry strong guaran tee. If the decomp osition succee ds, then the unkno wn distribution D will b e close (in ℓ 1 distance) to its “flattening” on the partition; while if it fails (too man y in terv als ha v e to b e created), this serv es as evidence th at D does not b elong to the class and we can reject. T h e second stage, the appr oximation step , then learns this flatt ening of the distribution – w hic h can b e done with few samples s ince by construction w e do not ha ve many interv als. Th e last sta ge i s purely computational, the pr oje ction step : where we v erify that the flattening w e ha ve learned is in d eed close to the class C . If all three stages succe ed, then by the triangle inequalit y it must b e the case that D is close to C ; and b y the structural a ssumption on the class, if D ∈ C then it will admit succinct enough partitions, and all three stages will go through. T urning to the p roof, w e start by defin ing formally the “structural criterion” w e shall r ely on, b efore describing the algorithm at the heart of our result in Section 3.1 . (W e note that a mo dification of this algorit hm will b e describ ed in Section 5 , and will allo w us to deriv e Corollary 1.5 .) Definition 3.1 (Decomp ositi ons) . Let γ > 0 and L = L ( γ , n ) ≥ 1. A class of distributions C on [ n ] is said to b e ( γ , L ) -de c omp osable if for every D ∈ C there exists ℓ ≤ L and a partition I ( γ , D ) = ( I 1 , . . . , I ℓ ) of the in terv al [1 , n ] suc h that, for all j ∈ [ ℓ ], one of the follo wing holds: (i) D ( I j ) ≤ γ L ; or (ii) max i ∈ I j D ( i ) ≤ (1 + γ ) · min i ∈ I j D ( i ). F urther, if I ( γ , D ) is dyadic (i.e ., eac h I k is of the form [ j · 2 i + 1 , ( j + 1) · 2 i ] for some in tegers i, j , corresp onding to the lea v es of a recursiv e bisect ion of [ n ]), then C is said to b e ( γ , L ) -splittable . Lemma 3.2. If C is ( γ , L ) -de c omp osable, then it is ( γ , O ( L log n )) -splittable. Pr o of. W e will b egin by p ro ving a cla im that for every partiti on I = { I 1 , I 2 , ...I L } of the interv al [1 , n ] into L in terv als, there exists a refin ement of that partition whic h consists of at most L · log n dy adic interv als. S o, it suffices to pro v e that ev ery in terv al [ a, b ] ⊆ [1 , n ], can b e partitioned in at most O (log n ) dy adic in terv als. I ndeed, let ℓ b e the largest int eger such that 2 ℓ ≤ b − a 2 and let m b e the smallest in teger suc h that m · 2 ℓ ≥ a . If follo ws that m · 2 ℓ ≤ a + b − a 2 = a + b 2 and ( m + 1) · 2 ℓ ≤ b . So, the in terv al I = [ m · 2 ℓ + 1 , ( m + 1) · 2 ℓ ] is fully con tained in [ a, b ] and has size at lea st b − a 4 . W e will also use the fact that, for ev ery ℓ ′ ≤ ℓ , m · 2 ℓ = m · 2 ℓ − ℓ ′ · 2 ℓ ′ = m ′ · 2 ℓ ′ (1) 9 No w consider the f ollo wing pro cedure: Starting from right (resp. left) side of the in terv al I , w e add the largest in terv al whic h is adjacen t to it and fully con tained in [ a, b ] and recurse un til w e co v er the w hole int erv al [( m + 1) · 2 ℓ + 1 , b ] (resp. [ a, m · 2 ℓ ]). C learly , at the end of this pro cedure, the whole in terv al [ a, b ] is co v ered by dya dic in terv als. It remains to sho w that the pr ocedure tak es O (log n ) steps. Indeed, using Equation 1 , w e can see that at least half of the remaining left or righ t in terv al is co vered in ea c h step (except ma yb e for the first 2 steps where it is at least a quarter). Th us, the pro cedure will take at most 2 log n + 2 = O (log n ) steps in total. F rom the ab o v e, w e can see that eac h of t he L inte rv als o f the p artiti on I can b e co ve red with O (log n ) dy adic interv als, whic h completes the p roof of the claim. In order to c omplete the pr oof of the le mma, notic e that the t w o conditions in Definition 3.1 are closed under taking subsets. 3.1 The algorithm Theorem 1.1 , and with it Corollary 1.2 and Corollary 1.3 will follo w from the theorem b elo w , com bined with the str u ctural th eorems from Section 4 : Theorem 3.3. L et C b e a class of distributions over [ n ] for which the fol lowing hold s. 1. C is ( γ , L ( γ , n )) -splittable; 2. ther e exists a pr o c e dur e ProjectionDist C which, given as input a p ar ameter α ∈ (0 , 1) and the explicit description of a distribution D over [ n ] , r eturns ye s if the distanc e ℓ 1 ( D , C ) to C is at most α/ 10 , and no if ℓ 1 ( D , C ) ≥ 9 α/ 10 (a nd either yes or no otherwise). Then, the algorithm TestSplitt able ( A lgorithm 1 ) is a O  max  √ nL log n /ε 3 , L/ε 2  -sample tester for C , for L = L ( ε, n ) . (Mor e over, i f Pr ojectionDist C is c omputational ly efficient, then so i s TestSplitt able .) 3.2 Pro of of Theorem 3.3 W e no w giv e the pro of of our main result ( Theorem 3.3 ), first analyz ing the sample complexit y of Algorithm 1 b efore arguing its correctness. F or the latter, we will n eed the follo win g simp le lemma from [ ILR12 ], restated b elo w: F a ct 3.4 ([ ILR12 , F act 1]) . L et D b e a distribution over [ n ] , and δ ∈ (0 , 1] . Given m ≥ C · log n δ η indep endent samples f r om D (for some absolute c onstant C > 0 ), with pr ob ability at le ast 1 − δ we have th at, for every interval I ⊆ [ n ] : (i) if D ( I ) ≥ η 4 , then D ( I ) 2 ≤ m I m ≤ 3 D ( I ) 2 ; (ii) if m I m ≥ η 2 , then D ( I ) > η 4 ; (iii) if m I m < η 2 , then D ( I ) < η ; wher e m I def = |{ j ∈ [ m ] : x j ∈ I }| is the numb e r of the samples fal ling into I . 3.3 Sample complexit y . The sample complexit y is immediate, and comes from Steps 4 and 20 . The total n um b er of samples is m + O  ℓ ε 2  = O p | I | · L ε 3 log | I | + L ε log | I | + L ε 2 ! = O p | I | · L ε 3 log | I | + L ε 2 ! . 10 Algorithm 1 TestSpl itt able Require: Domain I (inte rv al), sample access to D o ve r I ; subroutine Pr oj ectionDist C Input: Paramete rs ε and fu nction L C ( · , · ). 1: Setting Up 2: Define γ def = ε 80 , L def = L C ( γ , | I | ), κ def = ε 160 L , δ def = 1 10 L ; and c > 0 b e as in Lemma 2.9 . 3: Set m def = C · max  1 κ , √ L | I | ε 3  · log | I | = ˜ O  √ L | I | ε 3 + L ε  ⊲ C is an absolute constant. 4: Obtain a sequence s of m indep endent samples from D . ⊲ F or a n y J ⊆ I , let m J be the n um ber of samples falling in J . 5: 6: Decomposition 7: while m I ≥ max  c · √ | I | ε 2 log 1 δ , κm  and at most L splits ha v e b een p erformed do 8: R u n Check-Small- ℓ 2 (from Lemma 2.9 ) with parameters ε 40 and δ , using the samples of s b elonging to I . 9: if Check-Small- ℓ 2 outputs no then 10: Bisect I , and recurse on b oth halv es (using th e same samples). 11: end if 12: end while 13: if more than L splits ha ve b een p erformed then 14: return REJECT 15: else 16: Let I def = ( I 1 , . . . , I ℓ ) b e the partitio n of [ n ] from the lea v es of the recursion. ⊲ ℓ ≤ L . 17: end if 18: 19: Appro xima tion 20: Learn the flattening Φ( D , I ) of D to ℓ 1 error ε 20 (with probabilit y 1 / 1 0), using O  ℓ/ε 2  new samples. Let ˜ D b e the r esu lting hyp othesis. ⊲ ˜ D is a ℓ - his to gram. 21: 22: Offline Check 23: return A CCEPT if and only if ProjectionDist C ( ε, ˜ D ) returns y es . ⊲ No sa mple needed. 24: 3.4 Correctness. Sa y an int erv al I considered during the execution of the “Decomp ositio n” step is he avy if m I is big enough on Step 7 , and light otherwise; and let H and L denote the sets of heavy and ligh t interv als resp ectiv ely . By c hoice of m and a union b ound o v er all | I | 2 p ossible in terv als, we can assum e on one hand that with probabilit y at least 9 / 10 the guaran tees of F act 3.4 hold simultaneo usly for all in terv als considered. W e hereafter condition on this ev en t. W e first argue that if the algorithm do es not reject in Step 13 , then with pr obabilit y at least 11 9 / 10 w e ha v e k D − Φ( D , I ) k 1 ≤ ε/ 20. Indeed, w e can write k D − Φ( D , I ) k 1 = X k : I k ∈ L D ( I k ) · k D I k − U I k k 1 + X k : I k ∈ H D ( I k ) · k D I k − U I k k 1 ≤ 2 X k : I k ∈ L D ( I k ) + X k : I k ∈ H D ( I k ) · k D I k − U I k k 1 . Let us b ound the t w o terms separately . • If I ′ ∈ H , then b y our c hoice of thresh old we can apply Lemma 2.9 with δ = 1 10 L ; conditioning on all of the (at most L ) ev ents happ ening, wh ic h o verall fails with p robabilit y at most 1 / 10 b y a union b ound, w e get k D I ′ k 2 2 = k D I ′ − U I ′ k 2 2 + 1 | I ′ | ≤ 1 + ε 2 1600 ! 1 | I ′ | as Check-Small- ℓ 2 returned y es ; and by Lemma 2.8 this implies k D I ′ − U I ′ k 1 ≤ ε/ 40. • If I ′ ∈ L , then w e claim that D ( I ′ ) ≤ max ( κ, 2 c · √ | I ′ | mε 2 log 1 δ ). Clearly , this is true if D ( I ′ ) ≤ κ , so it only remains to show that D ( I ′ ) ≤ 2 c · √ | I ′ | mε 2 log 1 δ . But this follo w s from F act 3.4 (i) , as if we had D ( I ′ ) > 2 c · √ | I ′ | mε 2 log 1 δ then m I ′ w ould ha v e b een big enough, and I ′ / ∈ L . Overall , X I ′ ∈ L D ( I ′ ) ≤ X I ′ ∈ L κ + 2 c · p | I ′ | mε 2 log 1 δ ! ≤ Lκ +2 X I ′ ∈ L c · p | I ′ | mε 2 log 1 δ ≤ ε 160   1 + X I ′ ∈ L s | I ′ | | I | L   ≤ ε 80 for a su fficien tly big choic e of co nstan t C > 0 in the definition of m ; where we first used that | L | ≤ L , and then that P I ′ ∈ L r | I ′ | | I | ≤ √ L b y Jensen’s in equ alit y . Putting it together, this yields k D − Φ( D , I ) k 1 ≤ 2 · ε 80 + ε 40 X I ′ ∈ H D ( I k ) ≤ ε/ 40 + ε/ 40 = ε/ 20 . Soundness. By con trap ositiv e, w e argue that if the test return s ACCEPT , then (wit h probabilit y at least 2 / 3) D is ε -close to C . In deed, conditioning on ˜ D b eing ε/ 20-close to Φ( D , I ), w e get b y the triangle inequalit y that k D − C k 1 ≤ k D − Φ( D , I ) k 1 + k Φ( D, I ) − ˜ D k 1 + dist  ˜ D , C  ≤ ε 20 + ε 20 + 9 ε 10 = ε. Ov erall, this happ ens except with pr ob ability at most 1 / 10 + 1 / 10 + 1 / 10 < 1 / 3. Completeness. Assu m e D ∈ C . T hen the c hoice of of γ and L ensures the existence o f a goo d dy adic partition I ( γ , D ) in the sense of Definition 3.1 . F or any I in this partition f or whic h (i) holds ( D ( I ) ≤ γ L < κ 2 ), I will hav e m I m < κ and b e kept as a “ligh t leaf ” (this by contrapositiv e of F act 3.4 (ii) ). F or the other ones, (ii) holds: let I b e one of these (a t most L ) int erv als. • If m I is to o small on Step 7 , then I is k ept as “ligh t leaf. ” 12 • Otherwise, then by our c hoice of constan ts we can use Lemma 2.8 and app ly Lemma 2. 9 with δ = 1 10 L ; conditioning on all of the (at most L ) ev ents happ ening, whic h o veral l fails with pr obabilit y at most 1 / 10 by a union b ound, Check-Small- ℓ 2 will output y es , as k D I − U I k 2 2 = k D I k 2 2 − 1 | I | ≤ 1 + ε 2 6400 ! 1 | I | − 1 | I | = ε 2 6400 | I | and I is k ept as “flat leaf. ” Therefore, as I ( γ , D ) is dy adic the D ecomposition stage is guarantee d to stop w ithin at most L splits (in the w orst case, it goes on u n til I ( γ , D ) is considered, at which p oin t it succeeds). 6 Th us Step 13 passes, and the algorithm reac hes the Approxima tion stage. By the foregoing d iscus sion, this implies Φ( D , I ) is ε/ 20-close to D (and hence to C ); ˜ D is then (except with probabilit y at most 1 / 10) ( ε 20 + ε 20 = ε 10 )-close to C , and the algorithm returns A CCEPT . 4 Structural Th eorems In this section, we sho w that a w ide r ange of natural distribution families are succinctly decomp os- able, and pro vide efficien t pr ojection algorithms for eac h class. 4.1 Existence of Structural Decompositions Theorem 4.1 (Monoto nicit y) . F or al l γ > 0 , the class M of monoto ne distributions on [ n ] is ( γ , L ) -splittable for L def = O  log 2 n γ  . Note that this proof can already b e foun d in [ BKR04 , Theorem 10], in terw ov en with the analysis of their algorithm. F or the sak e of b eing self-c on tained, w e repro duce the structural part of their argumen t, remo ving its algorithmic asp ects: Pr o of of The or em 4.1 . W e define the I recursively as follo ws: I (0) = ([1 , n ]), and for j ≥ 0 the partition I ( j +1) is obtained from I ( j ) = ( I ( j ) 1 , . . . , I ( j ) ℓ j ) b y going o v er the I ( j ) i = [ a ( j ) i , b ( j ) i ] in order, and: (a) if D ( I ( j ) i ) ≤ γ L , then I ( j ) i is added as elemen t of I ( j +1) (“mark ed as leaf”); (b) else, if D ( b ( j ) i ) ≤ (1 + γ ) D ( a ( j ) i ), then I ( j ) i is added as elemen t of I ( j +1) (“mark ed as leaf”); (c) otherwise, bisect I ( j ) in I ( j ) L , I ( j ) R (with    I ( j ) L    = l    I ( j )    / 2 m ) and add b oth I ( j ) L and I ( j ) R as elemen ts of I ( j +1) . 6 In more d etail , we w an t to argue that if D is in the class, then a decomposition with at mos t L pieces is found by the algo rithm. Since there is a dyadic decomposition with at most L pieces (namely , I ( γ , D ) = ( I 1 , . . . , I t )), it suffices to argue that the algorithm will never split one of the I j ’s (as every single I j will eventuall y b e consi dered by the recursive binary splitting, unless th e algorithm stopp ed recursing in this “path” b efore ev en consideri ng I j , whic h is even b etter). But this is the case b y the abov e arg ument, whic h ensures each suc h I j will b e recognize d as satisfying one of the tw o cond itions for “good decomposition” (being either close t o uniform in ℓ 2 , or having very little mas s). 13 and rep eat un til con v ergence (that is, whenev er the last item is not applied f or an y of the in terv als). Clearly , this pro cess is w ell-defined, and will ev en tually terminate (as ( ℓ j ) j is a non-decreasing sequence of natural num b ers, upp er b ounded by n ). Let I = ( I 1 , . . . , I ℓ ) (with I i = [ a i , a i +1 )) b e its outcome , so that the I i ’s are consecutiv e in terv als all satisfying either (a) or (b) . As (b) clearly implies (ii) , w e only need to sho w that ℓ ≤ L ; for this purp ose, we shall leve rage as in [ BKR0 4 ] the fact that D is monoto ne to b ound the n um b er of recursion steps. The recursion ab o v e d efines a complete binary tree (with the lea ves b eing the interv als satisfying (a) or (b) , and the inte rnal no des the other ones). Let t b e th e n umb er of recursion steps the pro cess go es through b efore con v erging to I (heig h t of th e tree) ; as men tioned ab o ve , w e ha v e t ≤ log n (as w e start with an in terv al of size n , and the length is halve d at ea c h step.) . Observe fu rther th at if at an y p oint an int erv al I ( j ) i = [ a ( j ) i , b ( j ) i ] has D ( a ( j ) i ) ≤ γ nL , then it immediately (as well as all the I ( j ) k ’s for k ≥ i b y mon otonicit y) satisfies (a) and is no longe r split (“b ecomes a lea f ”). So at an y j ≤ t , the n umb er of int erv als i j for whic h n either (a) nor (b) holds m ust satisfy 1 ≥ D ( a ( j ) 1 ) > (1 + γ ) D ( a ( j ) 2 ) > (1 + γ ) 2 D ( a ( j ) 3 ) > · · · > (1 + γ ) i j − 1 D ( a ( j ) i j ) ≥ (1 + γ ) i j − 1 γ nL where a k denotes the b eginning of the k -th int erv al (ag ain w e use monotonicit y to argue that the extrema w ere reac hed at the end s of ea c h interv al), so that i j ≤ 1 + log nL γ log(1+ γ ) . In particular, the total n u m b er of in ternal no des is then t X i =1 i j ≤ t · 1 + log nL γ log(1 + γ ) ! = (1 + o (1)) log 2 n log(1 + γ ) ≤ L . This implies the same b ound on the n umb er of lea ves ℓ . Corollary 4.2 (Unimod alit y) . F or al l γ > 0 , the class M 1 of unimo dal distributions on [ n ] is ( γ , L ) -de c omp osable for L def = O  log 2 n γ  . Pr o of. F or an y D ∈ M 1 , [ n ] ca n b e partitioned in tw o in terv als I , J such that D I , D J are ei ther monotone n on-increasing or non-decreasing. Ap plying Theorem 4.1 to D I and D J and taking the union of b oth partitions yields a (no longer necessarily dy adic) partition of [ n ]. The same argumen t yields an analog ue statemen t for t -mo dal distributions: Corollary 4.3 ( t -mo dalit y) . F or any t ≥ 1 and al l γ > 0 , the class M t of t -mo dal distributions on [ n ] is ( γ , L ) - de c omp osable for L def = O  t log 2 n γ  . Corollary 4.4 (Log-conca vit y , conca v ity and con v exit y) . F or al l γ > 0 , the classes L , K − and K + of lo g - c onc ave, c onc ave and c onvex distributions on [ n ] ar e ( γ , L ) -de c omp osable for L def = O  log 2 n γ  . Pr o of. This is directly implied b y Corollary 4.2 , recalling that log-conca ve , conca ve and con v ex distributions are unimo dal. Theorem 4.5 (Monotone Hazard Rate) . F or al l γ > 0 , the class MHR of MHR distributions on [ n ] is ( γ , L ) -de c omp osable for L def = O  log n γ  . 14 Pr o of. This follo w s from adapting the proof of [ CDSS13 ], which establishes that ev ery MHR dis- tribution can b e appro ximated in ℓ 1 distance b y a O (log ( n/ε ) /ε )-histogram. F or complet eness, w e repro duce their argumen t, su ita bly mo dified to our purp oses, in App end ix B . Theorem 4.6 (Pie cewise P olynomials) . F or al l γ > 0 , t, d ≥ 0 , the class P t,d of t -pie c ewise de gr e e- d distr ibutions on [ n ] is ( γ , L ) -de c omp osable for L def = O  t ( d +1) γ log 2 n  . (Mor e over, for the class of t -histo gr ams H t ( d = 0 ) one c an take L = t .) Pr o of. The last part of t he s tat emen t is ob vious, so w e fo cus on t he first claim. O bserving that eac h of the t piece s of a distrib ution D ∈ P t,d can b e sub divided in at most d + 1 int erv als on whic h D is monotone (b eing degree- d p olynomial on ea c h su c h pieces), w e obtain a partitio n of [ n ] in to at most t ( d + 1) in terv als. D b eing monotone on eac h of them, w e can apply an argument almost iden tical to that of Theorem 4.1 to argue that ea c h in terv al can b e further split in to O (log 2 n/γ ) subinte rv als, yielding a goo d decomp osition with O ( t ( d + 1) log 2 n/γ ) p iece s. 4.2 Pro jection Step: computing the distances This section con tains details of the distance estimation p rocedur es for these classes, required in the last stage of Algorithm 1 . (Note that some of these results are phrased in terms of distance appro ximation, as estimating the d istance ℓ 1 ( D , C ) to suffi cie n t accuracy in particular yields an algorithm for this stage.) W e fo cus in this s ect ion on ac hieving the samp le complexities stated in Corollary 1.2 , Corollary 1.3 , and Corollary 1.4 . While almost all the distance estimation pro cedures we giv e in this section are efficien t, run ning in time p olynomial in all the parameters or ev en with only a p olylogarithmic dep endence on n , there are t wo exceptions – n amely , the pro cedures for m onotone hazard rate ( Lemma 4.9 ) and log-conca ve ( Lemma 4.10 ) distributions. W e do describ e computationally ef- ficien t pr o cedures for these tw o cases as w ell in Section 4.2.1 , at a mo dest additiv e cost in the sample complexit y . Lemma 4.7 (Monot onicit y [ BKR04 , Lemma 8]) . Ther e exists a pr o c e dur e ProjectionDist M that, on input n as wel l as the ful l (suc cinct) sp e cific ation of a ℓ -histo gr am D on [ n ] , c omputes the (exact) distanc e ℓ 1 ( D , M ) in time p oly( ℓ ) . A straigh tforward mo dification of the algo rithm ab ov e (e.g., by adapting the u nderlying linear program to tak e as inp ut the lo cat ion m ∈ [ ℓ ] o f the mo de of the distrib ution; then trying all ℓ p ossibilities, ru nning the subroutine ℓ times a nd pic king the minim um v alue) resu lts in a similar claim for unimo dal distributions: Lemma 4.8 ( Unimo dalit y) . Ther e exists a pr o c e dur e ProjectionDist M 1 that, on input n as wel l as the ful l (suc cinct) sp e cific ation of a ℓ -histo gr am D on [ n ] , c omputes the (exact) distanc e ℓ 1 ( D , M 1 ) in time p oly( ℓ ) . A similar result can easily b e obtained for the class of t -mo dal distributions as well, with a p oly( ℓ, t )-time alg orithm b ased on a com b inatio n of dynamic and linea r programming. Analogous statemen ts hold for the c lasses of conca v e and c on vex distributions K + , K − , also based on linear programming (sp ecifically , on runnin g O  n 2  differen t linear programs – one for eac h p ossible supp ort [ a, b ] ⊆ [ n ] – and ta king the minim um o ver them). 15 Lemma 4.9 (MHR) . Ther e e xists a (no n-efficient) pr o c e dur e ProjectionDist MHR that, on input n , ε , as wel l as the fu l l sp e cific ation of a distribution D on [ n ] , dis tinguishes b etwe en ℓ 1 ( D , MHR ) ≤ ε and ℓ 1 ( D , MHR ) > 2 ε in time 2 ˜ O ε ( n ) . Lemma 4.10 (Log-conca vit y) . Ther e exists a (non-efficient) p r o c e dur e Pr oj ectionDist L that, on input n , ε , as wel l as the fu l l sp e cific ation of a distribution D on [ n ] , distinguishes b etwe en ℓ 1 ( D , L ) ≤ ε and ℓ 1 ( D , L ) > 2 ε in time 2 ˜ O ε ( n ) . L emma 4.9 and L emma 4.10 . W e here give a naiv e algorithm for these tw o problems, based on an exhau s tive searc h o v er a (huge) ε -co ver S of distributions ov er [ n ]. Essen tially , S con tains all p ossible distributions whose probabilities p 1 , . . . , p n are of the form j ε/n , for j ∈ { 0 , . . . , n/ε } (so that |S | = O (( n/ε ) n )). It is not hard to see that this ind eed defines an ε -co v er of the set of all distributions, and moreo v er that it can b e computed in time p oly( |S | ). T o approximat e the distance from a n explicit distribution D to the class C (either MHR or L ), it is enough to go o v er ev ery element S of S , c hec king (this time, efficien tly) if k S − D k 1 ≤ ε and if there is a distribution P ∈ C close to S (this time, p oin twise, that is | P ( i ) − S ( i ) | ≤ ε/n for all i ) – whic h also implie s k S − P k 1 ≤ ε and thus k P − D k 1 ≤ 2 ε . T he test for p oint w ise closeness can b e done by c h ec king feasibilit y o f a linea r program with v ariables corresp onding to the loga rithm of probabilities, i.e. x i ≡ ln P ( i ). Indeed, this form ulation allo ws to rephrase the log-conca v e and MHR constraints as linear constrain ts, and p oin twise approxi mation is simp ly enforcing that ln( S ( i ) − ε/n ) ≤ x i ≤ ln( S ( i ) + ε/n ) for all i . A t the end of this en umeration, the pro cedure accepts if and only if for some S b oth k S − D k 1 ≤ ε and the corresp ond ing linear program wa s feasi ble. Lemma 4.11 (Piecewise Polynomia ls) . Th er e exists a pr o c e dur e Pr ojectionDist P t,d that, on input n as wel l as the ful l sp e cific ation of an ℓ -histo gr am D on [ n ] , c omputes an appr oximation ∆ of the distanc e ℓ 1 ( D , P t,d ) such that ℓ 1 ( D , P t,d ) ≤ ∆ ≤ 3 ℓ 1 ( D , P t,d ) + ε , and runs in time O  n 3  · p oly( ℓ, t, d, 1 ε ) . Mor e over, for the sp e cial c ase of t -histo gr ams ( d = 0 ) ther e exists a pr o c e dur e ProjectionDist H t , which, given inputs as ab ove, c omputes an appr oximation ∆ of the distanc e ℓ 1 ( D , H t ) such that ℓ 1 ( D , H t ) ≤ ∆ ≤ 4 ℓ 1 ( D , H t ) + ε , and runs in time p oly( ℓ, t, 1 ε ) , indep endent of n . Pr o of. W e b egin with ProjectionDist H t . Fix an y distribu tio n D on [ n ]. Giv en an y explicit partition of [ n ] in to in terv als I = ( I 1 , . . . , I t ), on e can easil y sho w that k D − Φ( D , I ) k 1 ≤ 2 opt I , where opt I is the optimal distance of D to any histogram on I . T o get a 2-appro ximation of ℓ 1 ( D , H t ), it th u s suffices to find the minimum, o v er all p ossible partitionings I of [ n ] into t in terv als, of the quant it y k D − Φ( D , I ) k 1 (whic h itsel f can b e computed in time T = O (min ( tℓ, n ))). By a simple d yn amic programming approac h, this can b e perf ormed in time O  tn 2 · T  . Th e quadratic dep endence on n , whic h follo ws from allo wing the end p oin ts of the t interv als to b e at an y p oin t of the domain, is ho wev er far from optimal an d can b e reduced to ( t/ε ) 2 , as w e s ho w b elo w. F or η > 0, d efine an η -gr anular de c omp osition of a distribution D o ver [ n ] to b e a partition of [ n ] in to s = O (1 /η ) in terv als J 1 , . . . , J s suc h that eac h in terv al J i is either a singlet on or satisfies D ( J i ) ≤ η . (Note that if D is a known ℓ -histogram, one ca n compute a n η -granular deco mp osition of D in time O ( ℓ/η ) in a greedy fashion.) Claim 4.12. L et D b e a distribution over [ n ] , and J = ( J 1 , . . . , J s ) b e an η -gr anular de c omp osition of D (with s ≥ t ). Then, ther e exists a p artition of [ n ] into t intervals I = ( I 1 , . . . , I t ) and a t - histo gr am H on I such that k D − H k 1 ≤ 2 ℓ 1 ( D , H t ) + 2 tη , and I is a c o arsening of J . 16 Before pro ving it, w e describ e how this will enable us to get the d esired time complexit y for Pr ojectionDist H t . Phrased differen tly , the claim ab o v e al lo ws us to run our dynamic p rogram using the O (1 /η ) endp oints of the O ( 1 /η ) instea d of t he n points of the domain, pa ying only an additiv e error O ( tη ). Setting η = ε 4 t , the guaran tee f or Pr o jectionDist H t follo ws. Pr o of of Claim 4.12 . Let J = ( J 1 , . . . , J s ) b e an η -gran ular decomp osition of D , and H ∗ ∈ H t b e a histogram ac hieving opt = ℓ 1 ( D , H t ). Denote further b y I ∗ = ( I ∗ 1 , . . . , I ∗ t ) the partit ion of [ n ] corresp onding to H ∗ . Consider no w the r ≤ t endp oin ts of the I ∗ i ’s that do not fall on one of th e endp oin ts of the J i ’s: let J i 1 , . . . , J i r b e the respectiv e inte rv als in whic h they fall (in particula r, these cannot b e s inglet on in terv als), and S = ∪ r j =1 J i j their union. By definition of η -granula rit y , D ( S ) ≤ tη , and it follo ws th at H ∗ ( S ) ≤ tη + 1 2 opt . W e defi n e H from H ∗ in t w o stage s: first, we obtain a (sub)distribution H ′ b y mo difying H ∗ on S , setting for eac h x ∈ J i j the v alue of H to b e the minim um v alue (among the t wo optio ns) that H ∗ tak es on J i j . H ′ is thus a t -histogram, and the endp oin ts of its in terv als are endp oints of J as wished; but it ma y not su m to one. Ho w ever, b y construction we ha v e that H ′ ([ n ]) ≥ 1 − H ∗ ( S ) ≥ 1 − tη − 1 2 opt . Using this, w e can fin ally define our t -histog ram distribution H as the ren orm aliz ation of H ′ . It is easy to c h eck that H is a v alid t -histogram on a coarsening of J , and k D − H k 1 ≤ k D − H ′ k 1 + (1 − H ′ ([ n ])) ≤ k D − H ∗ k 1 + k H ∗ − H ′ k 1 + tη + 1 2 opt ≤ 2 opt + 2 tη as stated. T urning no w to ProjectionDist P t,d , w e apply the same initial dynamic programming ap- proac h, whic h will result on a run ning time of O  n 2 t · T  , where T is the time required to estimate (to sufficient accuracy) the distance of a giv en (su b)distribution o ver an interv al I on to the sp ace P d of degree- d p olynomials. Sp ecifically , w e will in vok e the follo wing result, adapted from [ CDSS14a ] to our setting: Theorem 4.13. L et p b e a ℓ -histo gr am over [ − 1 , 1) . Ther e is an algorithm Pr oje ctSinglePol y ( d, η ) which runs in time p oly( ℓ, d + 1 , 1 /η ) , and outputs a de gr e e- d p olynomial q which defines a p df over [ − 1 , 1) such that k p − q k 1 ≤ 3 ℓ 1 ( p, P d ) + O ( η ) . The proof of this mo dification of [ CDSS14a , T heorem 9] is deferred to App endix C . Applying it as a blac kb ox wit h η set to O ( ε/t ) and noting that computing the ℓ 1 distance t o our explici t distribution on a giv en in terv al of the degree- d p olynomial r eturned incurs an additional O ( n ) factor, w e obtain the claimed guaran tee and r unning time. 4.2.1 Computationally Efficien t Pro cedures for Log-conc a ve and MHR Distributions W e no w describ e ho w to obtain effici e nt testing for the classes L and MH R – that is, ho w to obtain p olynomial-ti me d istance estimation pr ocedur es for these tw o classes, un lik e the ones describ ed in the previous section. A t a very high-lev el, the idea is in b oth case to write d o wn a linear p rogram on v ariables related lo garithmic al ly to the p robabilitie s we are searc hing, as enforcing the log-c onca v e and MHR constraint s on these new v ariables can b e done linearly . Th e catc h now b ecomes the ℓ 1 ob jectiv e fu nction (and, to a lesser exten t, the fac t that the probabilities m ust sum to one), now highly non-linear. 17 The first insigh t is to l ev erage th e structure of lo g-conca v e (resp. monotone hazard rate) dis- tributions to express this objectiv e as s lightly stronger constrain ts, sp ecifically p oin t wise (1 ± ε ) - m u ltiplic ativ e closeness, m uch easier to enforce in o ur “logarithmic form ulation. ” Ev en so, doing this naiv ely fails, essentia lly b ecause of a to o we ak distance guaran tee b et ween our explici t h is- togram ˆ D and t he u nkno wn distribution w e are tryin g to find: in the completeness ca se, w e are only promised ε -closeness in ℓ 1 , wh ile w e w ould a lso require goo d additiv e p oint wise closeness of the order ε 2 or ε 3 . The second insigh t is th us to observ e that we “almost” ha v e this for free: indeed, if we do not reject in the first stage of the testing algorithm, w e do obtain an explicit k -histogram ˆ D with the guaran tee that D is ε -clo se to the distribution P to test. Ho wev er, we also implicitly hav e another distribution ˆ D ′ that is p ε/k -close to P i n K olmo gor ov dista nc e : as in the recurs ive descent w e tak e enough samples to use the DKW inequalit y ( Theorem 2.10 ) with this parameter, i .e. an additiv e o v erh ead of O ( k /ε ) samples (on top of the ˜ O ( √ k n/ε 7 / 2 )). If we are willing to increase this o ve rhead b y ju st a small amount , that is to tak e ˜ O  max( k /ε, 1 /ε 4 )  , w e can guarante e that ˆ D ′ b e also ˜ O  ε 2  -close to P in K olmogoro v distance. Com b ining these ideas yield th e follo wing distance estimation lemmas: Lemma 4.14 (Monot one Haza rd Rate) . Ther e exists a pr o c e dur e ProjectionDist ∗ MHR that, on input n as wel l as the ful l sp e cific ation of a k -histo gr am distribution D on [ n ] and of a ℓ -histo gr am distribution D ′ on [ n ] , runs in time p oly( n, 1 /ε ) , and sat isfies the fol lowing. • If ther e is P ∈ MHR such that k D − P k 1 ≤ ε and k D ′ − P k Ko l ≤ ε 3 , then the pr o c e dur e r e turns yes ; • If ℓ 1 ( D , MHR ) > 10 0 ε , then the pr o c e dur e r eturns no . Lemma 4.15 (Log-conca vit y) . Ther e exists a pr o c e dur e ProjectionDist ∗ L that, on i nput n as wel l as the ful l sp e cific ations of a k -histo gr am d istribution D o n [ n ] and a ℓ - histo gr am distribution D ′ on [ n ] , runs in time p oly( n, k , ℓ, 1 /ε ) , and satisfies the fol lowing. • If ther e i s P ∈ L such that k D − P k 1 ≤ ε and k D ′ − P k Ko l ≤ ε 2 log 2 (1 /ε ) , then the pr o c e dur e r e turns yes ; • If ℓ 1 ( D , L ) ≥ 100 ε , then the pr o c e dur e r eturns no . The p roofs of these tw o le mmas are quite tec hnical and deferred to App endix C . With these in hand, a simple mo dification of our main algorithm (sp ecifically , setting m = ˜ O (max( p | I | /ε 3 L, L 2 /ε 2 , 1 /ε c )) for c either 4 or 6 instead of ˜ O (max( p | I | /ε 3 L, L 2 /ε 2 )), to get the desired Ko lmogoro v distance guaran tee; and pro viding the empirical histogram defined b y these m samples along to the distance estimatio n procedu r e) su ffices to obtain th e follo wing counte rpart to Corollary 1.2 : Corollary 4.16. The algor ithm Test Splitt abl e , after this mo dific ation, c an efficien tly test the classes of lo g-c onc ave and monotone hazar d r ate (MHR) distributions, with r esp e ctive ly ˜ O  √ n/ε 7 / 2 + 1 /ε 4  and ˜ O  √ n/ε 7 / 2 + 1 /ε 6  samples. 5 Going F urther: Reducing the Supp ort Size The g eneral approac h w e ha v e b een follo wing so far giv es, out-of-the-b o x, an efficient testing al- gorithm with sample complexit y ˜ O ( √ n ) f or a large range of pr operties. How ev er, this sample 18 complexit y can for some classes P b e b rough t do w n a lot more, b y taking adv an tage in a p r epro- cessing step of goo d concen tration guarante es of distributions in P . As a motiv ating example, consider the class of Poisso n Binomial D istributions (PBD). It i s wel l- kno w n (see e.g. [ K G71 , Sectio n 2]) that PBDs are unimod al, and more sp ecifically that P B D n ⊆ L ⊆ M 1 . Th erefore, using our generic framework we can test P oisson Binomial Distributions w ith ˜ O ( √ n ) samples. This is, h o w ev er, far from optimal: as sho w n in [ AD15 ], a sample complexit y of Θ  n 1 / 4  is b oth necessary and sufficien t. The reason our general algorithm ends up making quadratically to o many queries can b e explained as follo ws. PBDs are tigh tly co ncen trated around their expectation, so that they “m orall y” liv e on a supp ort of size m = O ( √ n ). Y et, instead of testing them on this v ery small supp ort, in the abov e w e still consider the en tire range [ n ], and th u s end up pa ying a dep enden ce √ n – instead of √ m . If w e could use that observ ation to fi rst reduce the domain to the effe ctiv e supp ort of the distribution, then w e could call our testing alg orithm on this reduced domain of size O ( √ n ). In the rest of this section, we formalize and dev elop this idea, and in Section 5.2 will obtain as a direct application a ˜ O  n 1 / 4  -query testing algorithm for P B D n . Definition 5.1. Given ε > 0, the ε -effe ctive supp ort of a distribu tion D is the smallest int erv al I suc h that D ( I ) ≥ 1 − ε . The last definition w e s hall require is of the c onditione d distributions of a class C : Definition 5.2. F or any cl ass of distrib utions C o ve r [ n ], define the set of c onditione d dist ributions of C (with resp ect to ε > 0 and in terv al I ⊆ [ n ]) as C ε,I def = { D I : D ∈ C , D ( I ) ≥ 1 − ε } . Finally , w e w ill r equire the follo wing simp le r esult: Lemma 5.3. L et D b e a distribution over [ n ] , and I ⊆ [ n ] a n interval such th at D ( I ) ≥ 1 − ε 10 . Then, • If D ∈ C , then D I ∈ C ε 10 ,I ; • If ℓ 1 ( D , C ) > ε , then ℓ 1 ( D I , C ε 10 ,I ) > 7 ε 10 . Pr o of. The first item is obvio us. As for the second, let P ∈ C b e an y d istribution w ith P ( I ) ≥ 1 − ε 10 . By assumption, k D − P k 1 > ε : but we ha v e, writing α = 1 / 10, k D I − P I k 1 = X i ∈ I     D ( i ) D ( I ) − P ( i ) P ( I )     = 1 D ( I ) X i ∈ I     D ( i ) − P ( i ) + P ( i )  1 − D ( I ) P ( I )      ≥ 1 D ( I )  X i ∈ I | D ( i ) − P ( i ) | −     1 − D ( I ) P ( I )     X i ∈ I P ( i )  = 1 D ( I )  X i ∈ I | D ( i ) − P ( i ) | − | P ( I ) − D ( I ) |  ≥ 1 D ( I )  X i ∈ I | D ( i ) − P ( i ) | − αε  ≥ 1 D ( I )  k D − P k 1 − X i / ∈ I | D ( i ) − P ( i ) | − αε  ≥ 1 D ( I )  k D − P k 1 − 3 αε  > (1 − 3 α ) ε = 7 10 ε. 19 W e n o w proceed to state and pro ve our result – namely , efficien t testing of structur e d classe s of distributions with nice c onc e ntr ation pr op erties . Theorem 5.4. L et C b e a class of distributions over [ n ] for which the fol lowing hold s. 1. ther e is a function M ( · , · ) such that e ach D ∈ C has ε -effe ctive supp ort of size at most M ( n , ε ) ; 2. for every ε ∈ [0 , 1] and interval I ⊆ [ n ] , C ε,I is ( γ , L ) -splittable; 3. ther e exists an efficient pr o c e dur e P rojectionDist C ε,I which, given as input the explicit de- scription o f a distribution D over [ n ] and interval I ⊆ [ n ] , c omputes the distanc e ℓ 1 ( D I , C ε,I ) . Then, the algorithm Test EffectiveSplitt able ( A lgorithm 2 ) is a O  max  1 ε 3 √ mℓ log m, ℓ 2 ε 2  - sample tester for C , wher e m = M ( n, ε 60 ) and ℓ = L ( ε 1200 , m ) . Algorithm 2 TestEffec tiveSplitt abl e Require: Domain Ω (in terv al of size n ), sa mple access to D o ver Ω; subroutine Pr ojectionDist C ε,I Input: Paramete rs ε ∈ (0 , 1], function L ( · , · ), and u pp er b ound function M ( · , · ) for t he effectiv e supp ort of the cl ass C . 1: Set m def = O  1 /ε 2  , τ def = M ( n, ε 60 ). 2: Effective Suppor t 3: Compute ˆ D , an empirica l estimate of D , by dr a wing m indep enden t samples from D . 4: Let J b e the largest in terv al of the form { 1 , . . . , j } such that ˆ D ( J ) ≤ ε 30 . 5: Let K b e th e largest interv al of the form { k , . . . , n } suc h that ˆ D ( K ) ≤ ε 30 . 6: Set I ← [ n ] \ ( J ∪ K ). 7: if | I | > τ then return REJECT 8: end if 9: 10: Testing 11: Call Tes tSplitt abl e with I (pr o viding sim u lat ed access to D I b y r ejection s amp ling, returning F AIL if the n um b er of samples q f r om D I required b y the su broutine is not obtained after O ( q ) samples from D ), ProjectionDist C ε,I , parameters ε ′ def = 7 ε 10 and L ( · , · ). 12: return A CCEPT if TestSplitt abl e accepts, REJECT otherwise. 13: 5.1 Pro of of Th eorem 5.4 By the c hoice of m and the DKW in equ alit y , with probabilit y at least 23 / 24 the estimate ˆ D satisfies k D − ˆ D k Ko l ≤ ε 60 . Conditioning o n that from no w o n, w e get that D ( I ) ≥ ˆ D ( I ) − ε 30 ≥ 1 − ε 10 . F urthermore, denoting by j and k the t wo in ner endp oints of J and K in Steps 4 and 5 , w e ha ve D ( J ∪ { j + 1 } ) ≥ ˆ D ( J ∪ { j + 1 } ) − ε 60 > ε 60 (similarly for D ( K ∪ { k − 1 } )), so that I has size at most σ + 1, where σ is the ε 60 -effecti v e supp ort size of D . Finally , note that since D ( I ) = Ω(1) b y our conditioning, the sim ulation of samp les b y rejecti on sampling will succeed with probabilit y at least 23 / 24 and the algorit hm will not o utput F AIL . Sample complexit y . T h e sample complexit y is the sum of the O  1 /ε 2  in Step 3 and the O ( q ) in Step 11 . F rom Theorem 1. 1 and the c h oice of I , this la tter qu an tit y is O  max  1 ε 3 √ M ℓ log M , ℓ 2 ε 2  20 where M = M ( n, ε 60 ) and ℓ = L ( ε 1200 , M ( n, ε 60 )). Correctness. If D ∈ C , then by the sett ing of τ (set to b e an upp er b ound on the ε 60 -effecti v e sup- p ort size of any distribution in C ) the algorithm will go b ey ond Step 6 . The call to TestSp litt able will then end up in the algorithm returning ACCE PT in Step 12 , w ith p robabilit y at least 2 / 3 b y Lemma 5.3 , Theorem 1.1 and our c h oic e of parameters. Similarly , if D is ε -far from C , then either its effec tiv e supp ort is to o large (and then th e t est on Step 6 fails), or the main tester will detect that its conditional distribution on I is 7 ε 10 -far from C and output REJECT in Step 12 . Ov erall, in either ca se the algorithm is correct except with probabilit y at most 1 / 24 + 1 / 24 + 1 / 3 = 5 / 12 (b y a un ion b ound). Rep eating constant ly man y times and outputting the m ajority v ote brings the probabilit y of failure do wn to 1 / 3. 5.2 Application: T esting P oisson Binomia l Distributions In this s ection, w e illustrate the u se of our generic t w o-stag e approac h to test the cla ss of P oisson Binomial Distributions. S p ecifica lly , w e pro v e the foll o wing resu lt: Corollary 5.5. The class of Poisson Binomial Distributions c an b e teste d with ˜ O  n 1 / 4 /ε 7 / 2  + O  log 4 n/ε 4  samples, using A lgorithm 2 . This is a direct consequence of Theorem 5.4 and the lemmas b elo w. Th e first one states that, indeed, PBDs ha ve small effec tiv e supp ort: F a ct 5.6. F or any ε > 0 , a PBD has ε -effe ctiv e supp ort of size O  p n log (1 / ε )  . Pr o of. B y an add itiv e Ch ernoff Bound , an y random v ariable X follo wing a Poisso n Binomia l Distri- bution has Pr[ | X − E X | > γ n ] ≤ 2 e − 2 γ 2 n . T aking γ def = q 1 2 n ln 2 ε , w e get that Pr[ X ∈ I ] ≥ 1 − ε , where I def = [ E X − q 1 2 ln 2 ε , E X + q 1 2 ln 2 ε ]. It is clear that if D ∈ P B D n (and therefore is un imodal), then for an y in terv al I ⊆ [ n ] the conditional distrib ution D I is still unimo dal, and th us the class of c onditione d PBDs P B D ε,I n def = { D I : D ∈ P B D n , D ( I ) ≥ 1 − ε } falls u nder Corollary 4.2 . The last piece w e need to apply our generic testing framew ork is the existence of an algorithm to compu te the d istance b et we en an (explicit) distribution and the class of conditioned PBDs. Th is is provided b y our next lemma: Claim 5.7. Ther e exists a pr o c e dur e ProjectionDist P BD ε,I n that, on input n and ε, ∈ [0 , 1] , I ⊆ [ n ] as wel l as the ful l sp e cific ation of a distribution D on [ n ] , c omputes a value τ such that τ ∈ [1 ± 2 ε ] · ℓ 1 ( D , P B D ε,I n ) ± ε 100 , in time n 2 (1 /ε ) O (log 1 /ε ) . Pr o of. The goal is to find a γ = Θ( ε )-a pproxi mation of the minimum v alue of P i ∈ I    P ( i ) P ( I ) − D ( i ) D ( I )    , sub ject to P ( I ) = P i ∈ I P ( i ) ≥ 1 − ε and P ∈ P B D n . W e fi r st note that, giv en the p aramet ers n ∈ N and p 1 , . . . , p n ∈ [0 , 1] of a PBD P , the v ector of ( n + 1) probabilities P (0) , . . . , P ( n ) can be obtained in time O  n 2  b y dynamic programming. Therefore, compu ting the ℓ 1 distance b etw een D and a n y PBD with kno wn parameters can b e done effic ien tly . T o conclude, w e inv ok e a r esult of Diak onik olas, Kane, and Stew art, that guaran tees the existence of a succinct (prop er) co ve r of P B D n : 21 Theorem 5.8 ([ DKS15 , Theorem 14] (rephrased)) . F or al l n, γ > 0 , ther e exists a set S γ ⊆ P B D n such tha t: (i) S γ is a γ -c over of P B D n ; that is, for al l D ∈ P B D n ther e exists some D ′ ∈ S γ such that k D − D ′ k 1 ≤ γ (ii) | S γ | ≤ n (1 /γ ) O (log 1 /γ ) (iii) S γ c an b e c ompute d in time n (1 /γ ) O (log 1 /γ ) and e ach D ∈ S γ is explicitly describ e d by its set of p ar ameters. W e further observ e that the factor n in b oth the size of the co ver and runn ing time can b e easily remo ved in our case, as we kno w a goo d appro ximation of the supp ort size of the candidate PBD s. (That is, w e only need to en umerate o v er a su bset of the co v er of [ DKS15 ], that of the P BDs with effectiv e supp ort compatible with our distribu tio n D .) Set γ def = ε 250 . Fix P ∈ P B D n suc h that P ( I ) ≥ 1 − ε , and Q ∈ S γ suc h that k P − Q k 1 ≤ γ . In particular, it is easy to see via the corresp ondence b etw een ℓ 1 and total v ariation distance that | P ( I ) − Q ( I ) | ≤ γ / 2. By a calculation analogue as in Lemma 5.3 , we ha v e k P I − Q I k 1 = X i ∈ I     P ( i ) P ( I ) − Q ( i ) Q ( I )     = X i ∈ I     P ( i ) P ( I ) − Q ( i ) P ( I ) + Q ( i )  1 P ( I ) − 1 Q ( I )      = X i ∈ I     P ( i ) P ( I ) − Q ( i ) P ( I )     ± X i ∈ I Q ( i )     1 P ( I ) − 1 Q ( I )     = 1 P ( I ) X i ∈ I | P ( i ) − Q ( i ) | ± | P ( I ) − Q ( I ) | ! = 1 P ( I ) X i ∈ I | P ( i ) − Q ( i ) | ± γ 2 ! = 1 P ( I )  k P − Q k 1 ± 5 γ 2  ∈ [ k P − Q k 1 − 5 γ / 2 , (1 + 2 ε ) ( k P − Q k 1 + 5 γ / 2)] where we used the fact that P i / ∈ I | P ( i ) − Q ( i ) | = 2  P i / ∈ I : P ( i ) >Q ( i ) ( P ( i ) − Q ( i ))  + Q ( I ) − P ( I ) ∈ [ − 2 γ , 2 γ ]. By the triangle inequalit y , this imp lies that the min im um of k P I − D I k 1 o ve r the distri- butions P of S ε with P ( I ) ≥ 1 − ( ε + γ / 2) will b e within an add itiv e O ( ε ) of ℓ 1 ( D , P B D ε,I n ). Th e fact that the former can b e done in ti me p oly( n ) · (1 /ε ) O ( log 2 1 /ε ) concludes the pro of. As previously menti oned, this appro ximation guarantee for ℓ 1 ( D , P B D ε,I n ) is suffi cie n t for the purp ose of Algorithm 1 . Pr o of of Cor ol lary 5.5 . Combining the ab o ve , we inv ok e T heorem 5 .4 with M ( n, ε ) = O ( p n log (1 / ε )) ( F act 5.6 ) and L ( m, γ ) = O  log 2 m γ  ( Corollary 4.2 ). This yields the clai med sample complexit y; fi- nally , the efficiency is a direct consequence of Claim 5.7 . 6 Lo w er Bounds 6.1 Reduction-based Lo w er Bound Approac h W e now turn to p ro ving conv erses to our p ositiv e results – n amel y , that man y of the upp er b ound s w e obtain cannot b e significan tly improv ed up on. As in our alg orithmic approac h , we describ e for this purp ose a generic fr amework for obtaining lo we r b ounds. 22 In order to state our results, w e will require the u s u al definition of agnostic le arning . Recall that an algo rithm is said to b e a semi- agnostic le arner for a class C if it satisfies the follo wing. Giv en sample access to an arbitrary distribution D and paramete r ε , it outputs a hypothesis ˆ D whic h (with high probabilit y) do es “almost as w ell as it gets”: k D − ˆ D k 1 ≤ c · opt C ,D + O ( ε ) where opt C ,D def = i nf D ′ ∈C ℓ 1 ( D ′ , D ), and c ≥ 1 is some absolute co nstan t ( if c = 1, the le arner is said to b e agnost ic). High-lev el idea. Th e m oti v ation for our r esult is the observ ation of [ BKR04 ] that “monotonicit y is at le ast as hard as u niformit y . ” Unfortu n atel y , their s p ec ific argumen t do es not generalize easily to ot her cla sses of distributions, making it imp ossible to exte nd it r eadily . The s tarting p oin t of our approac h is to observ e that while uniformit y testing is hard in general, it b ecomes v ery easy under the pr omise that the distribution is mo notone, or even only close to monotone (namely , O  1 /ε 2  samples suffice). T his can give an alte rnate pr oof of the lo we r b ound for monotonicit y te sting, via a differen t reduction: first, test if the un kno wn distribution is monotone; if it is, te st wh ether it is uniform, no w assuming clo seness to monotone. More generally , this idea applies to any cl ass C w hic h (a) cont ains the uniform distribution, and (b) for whic h we ha ve a o ( √ n )-sample agnostic learner L , as f ollo ws. Assuming we hav e a tester T for C w ith sample complexit y o ( √ n ), define a uniformit y tester as b elo w . • test if D ∈ C using T ; if not, reject (as U ∈ C , D cannot b e uniform); • otherwise, agnostic ally learn D with L (since D is close to C ), and obtain hypothesis ˆ D ; • c hec k offline if ˆ D is close to uniform. By assump tion, T and L eac h use o ( √ n ) samples, so do es the whole pro cess; but this con tradicts the lo wer b ound of [ BFR + 00 , P an08 ] on u niformit y testing. Hence, T must u s e Ω ( √ n ) samples. This “testing-b y-narro wing” reduction argument can b e further exte nded to other prop erties than to uniformit y , as we sho w b elo w: Theorem 6.1. L et C b e a class of distributions over [ n ] for which the fol lowing hold s: (i) ther e exists a semi-agnostic le arner L for C , with sample c omplexity q L ( n, ε, δ ) and “agnostic c onstant” c ; (ii) ther e exists a sub class C Hard ⊆ C such that testing C Hard r e q u ir es q H ( n, ε ) samples. Supp ose further that q L ( n, ε, 1 / 10) = o ( q H ( n, ε )) . Then, any tester for C m ust u se Ω( q H ( n, ε )) samples. Pr o of. The ab ov e theorem relies on the reduction outlined ab o v e, whic h w e rigorously detail here. Assuming C , C Hard , L as ab o v e (with semi-agnostic constant c ≥ 1), and a teste r T for C with sample complexit y q T ( n, ε ), we define a tester T Hard for C Hard . On input ε ∈ (0 , 1] and giv en sample access to a distribution D on [ n ], T Hard acts as follo ws: • call T with parameters n , ε ′ c (where ε ′ def = ε 3 ) and failure probabilit y 1 / 6, to ε ′ c -test if D ∈ C . If not, reject. • otherwise, agnostic ally learn a h yp othesis ˆ D for D , with L ca lled with parameters n , ε ′ and failure probabilit y 1 / 6; 23 • c hec k offline if ˆ D is ε ′ -close to C Hard , accept if and only if this is the case. W e condition on b oth calls (to T and L ) to b e successful, which o verall happ en s w ith probabilit y at least 2 / 3 by a union b ound. T he completeness is immediate : if D ∈ C Hard ⊆ C , T accepts, and the h yp othesis ˆ D sati sfies k ˆ D − D k 1 ≤ ε ′ . Therefore, ℓ 1 ( ˆ D , C Hard ) ≤ ε ′ , and T Hard accepts. F or the soundness, w e pro ceed by con trap ositiv e. Supp ose T Hard accepts; it means that ea c h step was successful. In particular, ℓ 1 ( ˆ D , C ) ≤ ε ′ /c ; so th at the hyp othesis outputted by the agnosti c learner satisfies k ˆ D − D k 1 ≤ c · opt + ε ′ ≤ 2 ε ′ . In tur n, since the last step passed and b y a triangle inequalit y w e get, as claimed, ℓ 1 ( D , C Hard ) ≤ 2 ε ′ + ℓ 1 ( ˆ D , C Hard ) ≤ 3 ε ′ = ε . Observing that the ov erall sample complexit y is q T ( n, ε ′ c ) + q L ( n, ε ′ , 1 10 ) = q T ( n, ε ′ c ) + o ( q H ( n, ε ′ )) concludes the pro of. T aking C Hard to b e the singleton consisting of the uniform distribution, and from the semi- agnostic learners of [ CDSS13 , CDSS 14a ] (eac h with sample complexit y either p oly(1 /ε ) or p oly(log n , 1 /ε )), w e obtain the follo wing: 7 Corollary 1.6. T esting lo g -c onc avity, c onvexity, c onc avity, MHR, unimo dality, t -mo dality, t -histo gr ams, and t - pie c ewise de g r e e- d distributions e ach r e quir e Ω  √ n/ε 2  samples (the last thr e e for t = o ( √ n ) and t ( d + 1) = o ( √ n ) , r esp e ctively), for any ε ≥ 1 /n O (1) . Similarly , we ca n use another result of [ DDS12b ] whic h sho ws ho w to agnostically learn P oisson Binomial Distributions with ˜ O  1 /ε 2  samples. 8 T aking C Hard to b e th e single Bin( n , 1 / 2) distribu- tion (along with the testing lo wer b ound of [ VV14 ]), this yields the follo wing: Corollary 1.7. T esting the classes of Binomial and Poisson Bino mial Distri butions e ach r e quir e Ω  n 1 / 4 /ε 2  samples, for any ε ≥ 1 /n O (1) . Finally , w e derive a lo wer b ound on testing k -SI IR V s from the agnost ic learner of [ DDO + 13 ] (whic h has sample complexit y p oly( k , 1 /ε ) samples, indep enden t of n ): Corollary 1.8. Ther e exist absolute c onstants c > 0 and ε 0 > 0 such that testing the class of k -SIIR V distributions r e quir es Ω  k 1 / 2 n 1 / 4  samples, for any k = o ( n c ) and ε ≤ ε 0 . Cor ol lary 1.8 . T o pro v e this result, it is enough b y Theorem 6.1 to exhib it a p articular k -SI IR V S su c h that testing id entit y to S requires this many samples. Moreo v er, from [ VV14 ] this last part amoun ts to p ro ving that the (truncated) 2/3-norm k S − max − ε 0 k 2 / 3 of S is Ω  k 1 / 2 n 1 / 4  (for some small ε 0 > 0). Ou r hard instance S will b e defined as follo ws: it is defined as the d istribution of X 1 + · · · + X n , where the X i ’s are ind epend en t in teger random v ariables uniform on { 0 , . . . , k − 1 } (in particular, for k = 2 we get a Bin( n , 1 / 2) distribution). It is straigh tforw ard to v erify that E S = n ( k − 1) 2 and σ 2 def = V ar S = ( k 2 − 1) n 12 = Θ  k 2 n  ; moreo v er, S is log-conca v e (as the conv olution of n uniform distribu tio ns). F rom this last p oint , we get that (i) the maxim u m probabilit y of S , attained at its mo de, is k S k ∞ = Θ(1 / σ ); and (ii) for ev ery j in an interv al I of length 2 σ cen tered 7 Sp ecificall y , these low er b ounds hold as long as ε = Ω(1 /n α ) for some absolute constan t α > 0 (so that th e sample complexit y of th e agnostic learner is ind eed negligible in front of √ n/ε 2 ). 8 Note the quasi-qu adratic dep endence on ε o f the learner, which allows us to get ε into our low er b ound for n ≫ p oly log(1 /ε ). 24 at this m ode, S ( j ) ≥ Ω ( k S k ∞ ). Putting this together, we get that the 2/3-norm (and simila rly the truncated 2/3- norm) of S is lo w er b oun ded by  X j ∈ I S ( j ) 2 / 3  3 / 2 ≥  2 σ · Ω(1 /σ ) 2 / 3  3 / 2 = Ω  σ 1 / 2  = Ω  k 1 / 2 n 1 / 4  whic h concludes the p roof. 6.2 T olerant T esting This lo w er b ound framew ork from th e p revious section carries to toler ant testi ng as w ell, resulting in this analogue to Theorem 6.1 : Theorem 6.2. L et C b e a class of distributions over [ n ] for which the fol lowing hold s: (i) ther e exists a semi-agnostic le arner L for C , with sample c omplexity q L ( n, ε, δ ) and “agnostic c onstant” c ; (ii) ther e exists a sub class C Hard ⊆ C such that toleran t testing C Hard r e q u ir es q H ( n, ε 1 , ε 2 ) samples for some p ar ameters ε 2 > (4 c + 1) ε 1 . Supp ose further tha t q L ( n, ε 2 − ε 1 , 1 / 10) = o ( q H ( n, ε 1 , ε 2 )) . Then, any tolerant tester for C must use Ω ( q H ( n, ε 1 , ε 2 )) samples (for some explicit p ar ameters ε ′ 1 , ε ′ 2 ). Pr o of. The argumen t follo ws the same ideas as for Theorem 6.1 , up to the details of the parameters. Assuming C , C Hard , L as ab ov e (with semi-agnostic constant c ≥ 1), and a toleran t te ster T for C with sample complexit y q ( n, ε 1 , ε 2 ), we define a tole ran t teste r T Hard for C Hard . On input 0 < ε 1 < ε 2 ≤ 1 with ε 2 > (4 c + 1) ε 1 , and g iv en sample acc ess to a distribution D on [ n ], T Hard acts as follo ws. After setting ε ′ 1 def = ε 2 − ε 1 4 , ε ′ 2 def = ε 2 − ε 1 2 , ε ′ def = ε 2 − ε 1 16 and τ def = 6 ε 2 +10 ε 1 16 , • call T with parameters n , ε ′ 1 c , ε ′ 2 c and failure probabilit y 1 / 6, to tolerant ly test if D ∈ C . If ℓ 1 ( D , C ) > ε ′ 2 /c , reject. • otherwise, agnostic ally learn a h yp othesis ˆ D for D , with L ca lled with parameters n , ε ′ and failure probabilit y 1 / 6; • c hec k offline if ˆ D is τ -close to C Hard , acce pt if and only if this is the case. W e condition on b oth calls (to T and L ) to b e successful, which o verall happ en s w ith probabilit y at least 2 / 3 b y a union b ound. W e first argue completeness: assu me ℓ 1 ( D , C Hard ) ≤ ε 1 . Th is implies ℓ 1 ( D , C ) ≤ ε 1 , so that T accepts as ε 1 ≤ ε ′ 1 /c (wh ic h is the case b ecause ε 2 > (4 c + 1) ε 1 ). Th us, the h yp othesis ˆ D satisfies k ˆ D − D k 1 ≤ c · ε ′ 1 /c + ε ′ = ε ′ 1 + ε ′ . Th erefore, ℓ 1 ( ˆ D , C Hard ) ≤ k ˆ D − D k 1 + ℓ 1 ( D , C Hard ) ≤ ε ′ 1 + ε ′ + ε 1 < τ , and T Hard accepts. F or the soundness, w e again pr o ceed b y contra p ositiv e. Supp ose T Hard accepts; it means that eac h step was successful. In particular, ℓ 1 ( ˆ D , C ) ≤ ε ′ 2 /c ; so that the hypothesis outputted by the agnostic lea rner satisfies k ˆ D − D k 1 ≤ c · opt + ε ′ ≤ ε ′ 2 + ε ′ . In turn, since the last step passed and b y a triangle inequalit y w e get, as claimed, ℓ 1 ( D , C Hard ) ≤ ε ′ 2 + ε ′ + ℓ 1 ( ˆ D , C Hard ) ≤ ε ′ 2 + ε ′ + τ < ε 2 . Observing that the o v erall sample complexit y is q T ( n, ε ′ 1 c , ε ′ 2 c ) + q L ( n, ε ′ , 1 10 ) = q T ( n, ε ′ c ) + o ( q H ( n, ε ′ )) concludes the pro of. As b efore, we instantia te th e general theorem to obtain sp ecific low er b ound s f or toleran t te sting of the classe s w e co v ered in this pap er. That is, taking C Hard to b e the singlet on consisting of the 25 uniform d istribution (com bined with the toleran t testing lo wer b ou n d of [ VV10 ]), and again from the semi-agnostic le arners o f [ CDSS13 , CDSS14a ] (e ac h with sample c omplexit y either poly(1 /ε ) or p oly(log n, 1 /ε )), we obtain the foll o wing: Corollary 1.11. T oler ant testing of lo g-c onc avity, c onvexity, c onc avity, MHR, unimo dality, and t -mo dality e ach r e quir e Ω  1 ( ε 2 − ε 1 ) n log n  samples (the latter for t = o ( n ) ). Similarly , we ag ain turn to the cl ass of Poisso n Binomial Distributions, for whic h we can in vok e as b efore the ˜ O  1 /ε 2  -sample agnostic learner of [ DDS12b ]. As b efore, w e would lik e to c ho ose for C Hard the single Bin( n, 1 / 2) distribution; ho we v er, as n o to leran t testing lo wer b ound for this distribution exists – to the b est of our kno wledge – in the literature, we first need to establish the lo wer b ound w e will rely up on: Theorem 6.3. Ther e exists an absolute c onstant ε 0 > 0 such that the fol low ing holds. A ny al- gorithm which, given sampling ac c ess t o an unknown distribution D on Ω and p ar ameter ε ∈ (0 , ε 0 ) , distinguishes wit h pr ob ability at l e ast 2 / 3 b etwe en (i) k D − Bin( n, 1 / 2) k 1 ≤ ε and (ii) k D − Bin ( n, 1 / 2) k 1 ≥ 10 0 ε must use Ω  1 ε √ n log n  samples. The pr oof relies on a reduction from toleran t testing of uniformity , drawing on a result of V alia n t and V alian t [ VV10 ]; for t he sak e of co nciseness, the details a re deferred to App endix D . With Theorem 6.3 in hand, w e can apply Theorem 6.2 to obtain the d esired lo w er b ound: Corollary 1.12. T oler ant testing of the classes of Bi nomia l and Poisson Bino mial D istributions e ach r e quir e Ω  1 ( ε 2 − ε 1 ) √ n log n  samples. W e observe that b oth Corollary 1.11 and Corollary 1.12 are tig h t (with rega rd to the d epen d ence on n ), as pro ven in the next secti on ( Section 7 ). 7 A Generic T oleran t T esting Upp er B ound T o conclude this w ork, w e address the question of toleran t test ing of distribution classes. In the same spirit as b efore, w e fo cus on describing a generic appr oa c h to obtain su c h b oun d s, in a clean conceptual manner. The most general stat emen t of the result we pro v e in this sectio n is stated b elo w, wh ic h w e then instan tiate to matc h th e lo w er b ounds from Section 6.2 : Theorem 7.1. L et C b e a class of distributions over [ n ] for which the fol lowing hold s: (i) ther e exists a semi-agnostic le arner L for C , with sample c omplexity q L ( n, ε, δ ) and “agnostic c onstant” c ; (ii) for any η ∈ [0 , 1] , e very distribution in C has η -effe ctive supp ort of size at most M ( n, η ) . Then, ther e exists an algorithm that, for any fixe d κ > 1 and on input ε 1 , ε 2 ∈ (0 , 1) such that ε 2 ≥ C ε 1 , has the fol lowing guar ante e (wher e C > 2 dep ends on c and κ only). The algorithm takes O  1 ( ε 2 − ε 1 ) 2 m log m  + q L ( n, ε 2 − ε 1 κ , 1 10 ) sampl es (wher e m = M ( n, ε 1 ) ), and with pr ob ability at le ast 2 / 3 distinguishes b etwe en (a) ℓ 1 ( D , C ) ≤ ε 1 and (b) ℓ 1 ( D , C ) > ε 2 . (Mor e over, one c an take C = (1 + (5 c + 6) κ κ − 1 ) .) Corollary 1.9. T oler ant testing of lo g-c onc avity, c onvexity, c onc avity, MHR, unimo dality, and t - mo dality c an b e p erforme d with O  1 ( ε 2 − ε 1 ) 2 n log n  samples, for ε 2 ≥ C ε 1 (wher e C > 2 is an absolute c onstant). 26 Applying no w the theorem with M ( n, ε ) = p n log (1 /ε ) (as p er Corollary 5.5 ), w e obta in an impro v ed upp er b ound for Binomial and P oisson Binomia l distributions: Corollary 1.10. T oler ant testing of the classes of Bi nomia l and Poisson Bino mial D istributions c an b e p erforme d with O  1 ( ε 2 − ε 1 ) 2 √ n log(1 /ε 1 ) log n  samples, for ε 2 ≥ C ε 1 (wher e C > 2 is an absolute c onstant). High-lev el idea. Somewhat similar to the lo wer b ound framewo rk deve lop ed in Section 6 , th e gist of the approac h i s to reduce the prob lem of toleran t testing mem b ership of D to the class C to that of toleran t testing iden tit y to a kno wn distribution – namely , the distribution ˆ D obtained after trying to agnostically learn D . Intuitiv ely , an agnostic lea rner for C should result in a goo d enough hyp othesis ˆ D (i.e., ˆ D close enough to b oth D and C ) when D is ε 1 -close to C ; but outp u t a ˆ D that is significan tly far from either D or C w hen D is ε 2 -far from C – sufficien tly for us to b e able to tell . Besides the man y tec hn ica l d eta ils one has to con trol for the paramete rs to wo rk out, one k ey elemen t is the use of a toleran t testing algorithm for closeness of t wo distributions due to [ VV11b ], whose (tigh t) sample complexit y scal es as n/ log n for a domain o f size n . In order to get th e r igh t dep endence on the effect iv e supp ort (required in particular for Corollary 1.10 ), we ha ve to p erform a first te st to identify the effec tiv e supp ort of the distribution and c h ec k its size , in order to only call this toleran t clo seness testing al gorithm on this m uch smaller sub set. (Th is additional prepr ocessing step itself has to b e carefully done, and comes at the price of a sligh tly w orse constan t C = C ( c, κ ) in the statemen t of the theorem.) 7.1 Pro of of Theorem 7.1 As describ ed in the preceding section, the algorithm will rely on the abilit y to p erform toleran t testing of equiv alence b et wee n t wo unkno wn distributions (o ver some kno wn domain of size m ). This is ensured b y an alg orithm of V alian t and V alian t, restated b elo w: Theorem 7.2 ([ VV11b , Theorem 3 and 4]) . Ther e exists an algorithm E which, given sam pling ac c e ss to two unknown distributions D 1 , D 2 over [ m ] , satisfies th e fol lowing. On input ε ∈ (0 , 1] , i t takes O ( 1 ε 2 m log m ) samples fr om D 1 and D 2 , and outputs a value ∆ such tha t |k D 1 − D 2 k 1 − ∆ | ≤ ε with pr ob ability 1 − 1 / p oly( m ) . (F urthermo r e , E runs in time poly( m ) .) F or the pro of, w e will also n eed this fac t, simila r to Lemma 5.3 , w h ic h relates the distance of t wo distributions to that of their conditional distributions on a subset of the domain: F a ct 7.3. L et D and P b e distributions over [ n ] , and I ⊆ [ n ] an interval such that D ( I ) ≥ 1 − α and P ( I ) ≥ 1 − β . Then, • k D I − P I k 1 ≤ 3 2 k D − P k 1 D ( I ) ≤ 3 k D − P k 1 (the last ine quality for α ≤ 1 2 ); and • k D I − P I k 1 ≥ k D − P k 1 − 2( α + β ) . 27 Pr o of. T o establish the first item, write: k D I − P I k 1 = X i ∈ I     D ( i ) D ( I ) − P ( i ) P ( I )     = 1 D ( I ) X i ∈ I     D ( i ) − P ( i ) + P ( i )  1 − D ( I ) P ( I )      ≤ 1 D ( I )  X i ∈ I | D ( i ) − P ( i ) | +     1 − D ( I ) P ( I )     X i ∈ I P ( i )  = 1 D ( I )  X i ∈ I | D ( i ) − P ( i ) | + | P ( I ) − D ( I ) |  ≤ 1 D ( I )  X i ∈ I | D ( i ) − P ( i ) | + 1 2 k D − P k 1  ≤ 1 D ( I ) · 3 2 k D − P k 1 where we used the fact that | P ( I ) − D ( I ) | ≤ d TV ( D , P ) = 1 2 k D − P k 1 . T ur ning no w to the second item, w e h a v e: k D I − P I k 1 = 1 D ( I ) X i ∈ I     D ( i ) − P ( i ) + P ( i )  1 − D ( I ) P ( I )      ≥ 1 D ( I )  X i ∈ I | D ( i ) − P ( i ) | −     1 − D ( I ) P ( I )     X i ∈ I P ( i )  = 1 D ( I )  X i ∈ I | D ( i ) − P ( i ) | − | P ( I ) − D ( I ) |  ≥ 1 D ( I )  X i ∈ I | D ( i ) − P ( i ) | − ( α + β )  ≥ 1 D ( I )  k D − P k 1 − X i / ∈ I | D ( i ) − P ( i ) | − ( α + β )  ≥ 1 D ( I )  k D − P k 1 − 2( α + β )  ≥ k D − P k 1 − 2( α + β ) . With these t wo ingredien ts, w e are in p osition to establish our theo rem: Pr o of of The or em 7.1 . The alg orithm pro ceeds as follo ws, wh ere we set ε def = ε 2 − ε 1 17 κ , θ def = ε 2 − ((6 + c ) ε 1 + 11 ε ), and τ def = 2 (3+ c ) ε 1 +5 ε 2 : (1) usin g O ( 1 ε 2 ) samples, g et (with probabilit y at least 1 − 1 / 10, b y Theorem 2.10 ) a distribution ˜ D ε 2 -close t o D in K olmogoro v distance; and let I ⊆ [ n ] b e t he smal lest i n terv al such that ˜ D ( I ) > 1 − 3 2 ε 1 − ε . O utput R EJECT if | I | > M ( n, ε 1 ). (2) inv ok e L on D with p aramete rs ε and failure probability 1 10 , to obtain a h yp othesis ˆ D ; (3) call E (from Theorem 7.2 ) on D I , ˆ D I with p arameter ε 6 to get an esti mate ˆ ∆ of k D I − ˆ D I k 1 ; (4) output REJECT if ˆ D ( I ) < 1 − τ ; (5) compute “offline” (an estimate accurate within ε of ) ℓ 1 ( ˆ D , C ), denoted ∆; (6) output REJECT is ∆ + ˆ ∆ > θ , and output A CCEPT otherwise. The claimed sample complexit y is immediate from Steps (2) and (3) , along with Th eorem 7.2 . T urning to correctness, w e condition on b oth subrou tines m ee ting their guaran tee (i.e., k D − ˆ D k 1 ≤ c · opt + ε and k D − ˆ D k 1 ∈ [ ˆ ∆ − ε, ˆ ∆+ ε ]), whic h happ ens with probability at least 8 / 10 − 1 / p oly ( n ) ≥ 3 / 4 b y a union b ound. • Soun d ness: If ℓ 1 ( D , C ) ≤ ε 1 , then D is ε 1 -close to some P ∈ C , for which there exists an interv al J ⊆ [ n ] of size at most M ( n, ε 1 ) such that P ( J ) ≥ 1 − ε 1 . It follo w s that 28 D ( J ) ≥ 1 − 3 2 ε 1 (since | D ( J ) − P ( J ) | ≤ ε 1 2 ) and ˜ D ( J ) ≥ 1 − 3 2 ε 1 − 2 · ε 2 ε ; establishing exist ence of a go o d in terv al I to b e found (and Step (1) do es not end with REJECT ). A dditionally , k D − ˆ D k 1 ≤ c · ε 1 + ε and by the triangle inequalit y this implies ℓ 1 ( ˆ D , C ) ≤ (1 + c ) ε 1 + ε . Moreo v er, as D ( I ) ≥ ˜ D ( I ) − 2 · ε 2 ≥ 1 − 3 2 ε 1 − 2 ε a nd    ˆ D ( I ) − D ( I )    ≤ 1 2 k D − ˆ D k 1 , w e do ha ve ˆ D ( I ) ≥ 1 − 3 2 ε 1 − 2 ε − cε 1 2 − ε 2 = 1 − τ and the algorithm do es not reject in Step (4) . T o conclude, one has b y F act 7.3 that k D I − ˆ D I k 1 ≤ 3 2 k D − ˆ D k 1 D ( I ) ≤ 3 2 ( cε 1 + ε ) 1 − 3 2 ε 1 − 2 ε ≤ 3( cε 1 + ε ) (for ε 1 < 1 / 4, as ε < 1 / 17) Therefore, ∆ + ˆ ∆ ≤ ℓ 1 ( ˆ D , C ) + ε + k D I − ˆ D I k 1 + ε ≤ (4 c + 1) ε 1 + 6 ε ≤ ε 2 − ((6 + c ) ε 1 + 11 ε ) = θ (the last inequalit y b y the assumption on ε 2 , ε 1 ), and the tester accepts. • Completeness: If ℓ 1 ( D , C ) > ε 2 , then we m ust h a v e k D − ˆ D k 1 + ℓ 1 ( ˆ D , C ) > ε 2 . If the algorithm do es not already reject in Step (4) , then ˆ D ( I ) ≥ 1 − τ . But, by F act 7.3 , k D I − ˆ D I k 1 ≥ k D − ˆ D k 1 − 2( D ( I c ) + ˆ D ( I c )) ≥ k D I − ˆ D I k 1 − 2  3 2 ε 1 + 2 ε + τ  = k D − ˆ D k 1 − ((6 + c ) ε 1 + 9 ε ) w e then ha v e k D I − ˆ D I k 1 + ℓ 1 ( ˆ D , C ) > ε 2 − ((6 + c ) ε 1 + 9 ε ). This implies ∆ + ˆ ∆ > ε 2 − ((6 + c ) ε 1 + 9 ε ) − 2 ε = ε 2 − ((6 + c ) ε 1 + 11 ε ) = θ , and the tester rejects. Finally , the testing algorithm defined ab o v e is computati onally efficient as long as b oth the learning algorithm (Step (2) ) and the estimatio n p rocedu re (Step (5) ) are. References [AAK + 07] Noga Alon, Alexandr An doni, T ali Kauf m an, Kevin Matulef, Ronitt Rubinfeld, and Ning Xie . T esting k -wise and almost k -wise indep endence. In Pr o c e e dings of th e 39th A CM Symp osium on The ory of Computing, STOC 2007, San Di e g o, Califor nia, USA, June 11 -13, 2007 , page s 496–505 , New Y ork, NY, US A, 2007. [AD15] Ja yadev A chary a and Constan tinos Daskalakis. T esting Po isson Binomial Distribu- tions. In Pr o c e e dings of the Twenty-Sixth A nnual A CM-SIAM Symp osium on Discr ete A lgorithms , SOD A 2015, Sa n D ie go, CA, USA, January 4-6, 2015 , pages 1829– 1840, 2015. [ADK15] Ja yadev Ac hary a, Constant inos Da skalakis, a nd Gautam C. Kamath. O ptimal T esting for Prop erties of Distributions. In C. Cortes, N.D. La w rence, D.D. Lee, M. Sugiy ama, R. Garnett , and R. Garnett, ed itors, A dvanc es in Neur al Informatio n P r o c essing Sy s- tems 28 , pages 357 7–359 8. Curran Associates, In c., 20 15. [ADLS15] Jay adev A chary a, I lias Diak onik olas, Jerry Zheng L i, and Lu dwig Schmidt. Sample- optimal densit y estimatio n in n early-l inear time. CoRR , abs/1506.0 0671, 20 15. 29 [AK03] Sanjeev Arora and Subhash Khot. Fitti ng algebraic curve s to noisy data. Journal o f Computer an d System Scienc es , 67(2):325 – 34 0, 2003. Sp ecial I s s ue on S TOC 200 2. [An96] Mark Y. An. Log-conca v e probabilit y distribu tions: theory and statistical testing. T ec hnical rep ort, C en tre for Lab our Mark et and So cial Researc h, Denmark, 1996 . [BB05] Mark Bagnoli and T ed Bergstrom. Log-conca v e probabilit y and its applications. Ec o- nomic The ory , 26(2):445– 469, 2005. [BDBB7 2] Ric hard E. Barlo w, Bartholomew D.J, J.M Bremner, and H.D B runk. Statistic al In- fer enc e Under Or der R estrictions: The The ory an d App lic ation of Isotonic R e gr ession . Wiley Series in Pr obab ility and Mathematica l Stati stics. J. Wile y , London, New Y ork, 1972. [BDKR05] T uğkan Batu, Sanjo y Dasgupta, Ra vi Kumar, and Ronitt Rubinfeld. T h e complexit y of appro ximating the en tropy . SIAM Journal on Co mputing , 35(1):13 2–150 , 2005. [BFF + 01] T uğkan Batu, Eldar Fisc her, Lance F ortno w, Ra vi Kumar, Ronitt R ub infeld, and P atric k White. T esting r andom v ariables for indep endence and iden tit y . In 42 nd A n- nual IEEE Symp osium on F oundations o f Computer Scienc e, F OCS 2001, L as V e gas, Nevada, USA, Octob er 14-17 2001 , page s 442–4 51, 20 01. [BFR + 00] T u ğkan Bat u, Lance F ortno w, Ronitt R ub in feld, W arren D. Smith, and P atric k White. T esting that distributions are close. In 41st A nnual IEEE Symp osium on F oundatio ns of Computer Scienc e, FOCS 2000, R e dondo Be ach, California, USA, N ovemb er 12-14 2000 , pages 259–26 9, 2000 . [BKR04] T uğkan Batu, Ra vi Ku mar, and Ronitt R ubinfeld. Sub linea r algorithms for testing monotone and un imod al distributions. In Pr o c e e dings of the 36th A CM Symp osium on The ory of Computing, STOC 2004, Chic ago, IL, USA, June 13-16, 2004 , p age s 381–3 90, New Y ork, NY, USA, 2004. A CM. [Can15] Clémen t L. Canonne. A Surv ey on Distribution Testing: y our data is Big. But is it Blue? Ele ctr onic Col lo quium on Computa tional Complexity (E CCC) , 22:63, April 2015. [CDSS13] Siu-on Chan, Ilias Diak onik olas, Rocco A. Servedi o, and Xiaorui Su n. Learning mixtures of structured distributions o v er discrete d omains. In Pr o c e e dings of the Twenty-F ourth A nnual A CM-SIAM Symp osium on Discr ete A lgorithms, SODA 2013, New Orle ans, L ouisiana, USA, January 6-8, 20 13 , pages 1380–1 394, 201 3. [CDSS14a] Siu-on Ch an, Ilias D iak oniko las, Ro cco A. Serv edio, and Xiaorui Sun. Efficien t densit y estimatio n via piecewise polynomial app r o ximation. In Pr o c e e dings of the 45 th A CM Symp osium on The ory of Computing, STOC 2014, Ne w Y ork, NY, USA, May 31 - June 03 , 2014 , pages 604– 613. ACM, 201 4. [CDSS14b] Siu -on Chan, Ilias Diak onik olas, Ro cco A. Serv edio, and Xiaorui Sun . Near-optimal densit y estimation in near-linear time using v ariable-width histograms. In A nnual Con - fer enc e on Neur al Informa tion Pr o c essing Systems (NIPS) , pages 1844–1 852, 201 4. 30 [CD VV14] S iu -on Ch an, Ilias Diak onik olas, Gregory V alian t, and P aul V alian t. O p timal algo rithms for testing closeness of discrete distribu tions. In Pr o c e e dings of the Twenty -Fifth A nnual A CM-SIAM Symp osium on Discr ete A lgorithms, SODA 2014, Portland, Or e gon, U SA, January 5-7, 2014 , pages 1193–12 03, 2014 . [DDO + 13] C on s tantinos Daskalakis, Ilias Diak onik olas, Ry an O’Donnell, Ro cco A. Serv edio, and Li-Y ang T an. Learning Su ms of Indep endent Intege r Random V ariables. In 54th A nnual IEEE Symp osium on F oundations of Computer Scienc e, FOCS 2013, Berkeley, CA, USA, Octob er 26-29, 20 13 , pages 217–22 6. IEEE Computer Societ y , 2013. [DDS12a] Constanti nos Daskalakis, Ilias Dia k onikol as, and Ro cco A. Serv edio. Learning k -mo dal distributions via testing. In Pr o c e e dings of the Twenty -Thir d A nnual A CM-SIAM Sym- p osium on Discr ete A lgorithms, SODA 2012, K yoto, Jap an, January 17 -19, 2012 , p age s 1371– 1385. Societ y for Ind ustrial and App lied Mathematics (SIAM), 2012. [DDS12b] Constan tinos Daskalakis, Ilias Diak onik olas, and Ro cco A. Serv edio. Learning Poisson Binomial Distributions. In Pr o c e e dings of the 44th A CM Symp osium on The ory of Computing, STOC 2012 Confer enc e, New Y ork, N Y , USA, May 19 - 22, 2012 , STO C ’12, pages 709–72 8, New Y ork, NY, USA, 2012. ACM. [DDS + 13] Constan tinos Daskalakis, Ilias Diak onikola s, Rocco A. Servedio , Grego ry V alian t, and P aul V alian t. T esting k -mo dal distributions: Op timal algorithms via r eductions. In Pr o- c e e dings of the Twenty-F ourth A nnual A CM-SIA M Symp osium on D iscr ete Algorith ms, SODA 2013, New Orle ans, L ouisiana, USA, January 6-8, 2013 , pages 1833–18 52. So- ciet y for Industrial and Applied Mat hematics (SIAM), 2013. [Dia16 ] Ilias Dia k onikola s. Learning structured distributions. In Handb o ok of Big D ata . CR C Press, 2016. [DKN15a] Il ias Diak onik olas, D aniel M. Kane, and Vladimir Nikishkin. Optimal algorit hms and lo wer b ounds for testing closeness of structured distributions. In 5 6th A nnual IEEE Symp osium on F oundations of Computer Scienc e, FOCS 20 15 , 2015. [DKN15b] Ilias Di ak oniko las, Daniel M. Kane, and Vladimir Nikishkin. T esting I den tit y of S truc- tured Distributions. In Pr o c e e dings of the Twenty-Sixth A nnual A CM -SIAM Symp osium on Discr ete A lgorithm s, SODA 2015 , San Die go, CA, USA, January 4-6, 2015 , 2015. [DKS15] Ilias Dia k onikola s, Da niel M. K ane, and Alistair Stew art. Nearly optimal learning and sparse co v ers for sums of indep end ent integ er random v ariables. CoRR , abs/1505.0 0662, 2015. [DKW56] Ary eh Dv oretzky , Jac k Kiefer, and Jac ob W olfo w itz . Asym ptotic minimax c haracter of the sample distribution fu nction and of the classic al m ultinomial estimator. The A nnals of Mathematic al Statistics , 27(3 ):642– 669, 1956. [GMV06] Sudipto Guha, Andrew McGregor, and Suresh V enkatasubramanian. Streaming and sublinear appro ximation of en trop y and information distanc es. I n Pr o c e e dings of the Sevente enth A nnual A CM -SIAM Symp osium on Discr ete A lgorithms, SODA 2006, Mi- ami, Florida, USA, January 22-26, 2006 , pages 73 3–742 , Philadel phia, P A, USA, 200 6. So ciet y for In dustrial and App lied Ma thematics (SIAM). 31 [GR00] Oded Goldreic h and Dana Ron. On testing expansion in b ounded-degree graphs. T ec h- nical Rep ort T R00- 020, Electronic C olloquium on Computational Comp lexit y (ECCC), 2000. [Hou86] Philip Ho ugaard. Su rviv al m o dels for heterogeneous p opulations deriv ed from stable distributions. Biometrika , 73:3 97–96 , 1986 . [ILR12] Piotr In dyk, Reut Levi, and Ronitt R ubinfeld. Appr o x imating and T esting k -Histog ram Distributions in Sub-linear Time. In Pr o c e e dings of PODS , pages 15– 22, 2012. [K G71] J. Keilson and H. Gerb er. Some r esults for discrete unimo dalit y . Journal of the A mer- ic an Statistic al A sso ciation , 66(3 34):pp. 386–389 , 1971. [Man63] Benoit Ma ndelbrot. New metho ds in statistica l economic s. Journal of Politic al Ec on- omy , 71(5):pp. 421–44 0, 196 3. [Mas90] P ascal Massart. The tigh t constan t in the Dv oretzky–Kiefer–W olfo witz inequ ali t y . The A nnals of Pr ob ability , 18(3):126 9–128 3, 1990. [MP07] P ascal Ma ssart and Jean Picard. Concen tration inequalities and mod el selection. Lec- ture Notes in Mathematic s, 33, 20 03, Saint- Flour, Can tal, 2007. Springer. [P an08] Liam P aninski. A coi ncidence-based test for uniformit y giv en very sparsely sampled discrete data. IEEE T r ansactions on Informa tion The ory , 54(10) :4750 –4755, 2008. [Ron08] Dana Ron. Prop erty T esting: A Learning Theory Perspectiv e. F oundations and T r ends in Machine L e arning , 1(3):307 –402, 2008. [Ron10] Dana Ron. Algorithmic and anal ysis tec hniques in prop ert y testing. F oundation s and T r ends in The or etic al Computer Scienc e , 5:73–205 , 2010 . [R ub 12] Ronitt R ubinf eld. T aming Big Probabilit y Dist ributions. XRDS , 19(1): 24–28 , Septem- b er 2012. [SN99] Debasis S engupta and Asok K . Nanda. Log-conca ve and conca ve distributions in reli- abilit y . Naval R e se ar ch L o gistics (NRL) , 46(4) :419– 433, 1999. [SS01] Mervyn J. S ilv apulle and Pr anab K. Sen. Constr aine d Statistic al Infer enc e . J ohn Wiley & Sons, Inc., 2001. [TLSM95] Constan tino T sall is, Silvio V. F. Levy , André M. C. Souza, and Roge r Ma ynard. Statistica l-mec hanical foundation of the ubiquit y of Lévy distributions in n ature. Phys. R e v. L ett. , 75:3589–35 93, No v 19 95. [V al1 1] P aul V alian t. T esting symm etric prop erties of d istributions. SIAM Journal on Com- puting , 40(6): 1927– 1968, 2011. [VV10] Gregory V alian t and P aul V alian t. A CL T and tigh t lo w er b ounds for estimating en tropy . Ele ctr onic Col lo quium on Computa tional Complexit y (ECCC) , 17:17 9, 201 0. 32 [VV11a] Gregory V alian t and Pa ul V alian t. Estimating the unseen: An n/ log n -sample estimator for ent rop y and sup p ort size, sho wn optimal via new clts. I n P r o c e e dings of th e 43 r d A CM Symp osium on The ory of Computing, STOC 2011, San Jose, CA, USA , 6-8 June 2011 , pages 685–69 4, 2011 . [VV11b] Gregory V alia n t and P aul V alia n t. The p o wer of linear estimators. In 52nd A nnual IEEE Symp osium on F oundations of Computer Scienc e, FOCS 2011, Palm Springs, CA, USA, Octob er 22-25, 2011 , pages 403–412, 2011. [VV14] Gregory V alian t and Paul V alia n t. An automatic inequalit y pro ver and ins tance optimal iden tit y testing. I n 55th A nnual IEEE Symp osium on F oundations of Computer Scienc e, F OCS 201 4, Philadelp hia, P A, USA, Octob er 18-21, 2014 , 20 14. [W al0 9] Guen ther W alther. Inference and mod eling with log-conca ve distributions. Statistic al Scienc e , 24(3):31 9–327 , 2009. A Pro of of Lemma 2.9 W e no w give the pro of of Lemma 2.9 , resta ted b elo w: Lemma 2.9 (A dapted from [ DKN15b , Theorem 11 ]) . Ther e exists an algorithm Check-Small- ℓ 2 which, given p ar ameters ε, δ ∈ (0 , 1) and c · p | I | /ε 2 log(1 /δ ) indep endent samples fr om a distr ibution D over I (for some absolute c onstant c > 0 ), outputs either y es or no , and satis fies the fol low ing. • If k D − U I k 2 > ε/ p | I | , then the algorithm outputs no with pr ob ability at le ast 1 − δ ; • If k D − U I k 2 ≤ ε/ 2 p | I | , then the algorithm outputs y es with pr ob ability a t le ast 1 − δ . Pr o of. T o do so, we first describe an algorithm that distinguishes b et w een k D − U k 2 2 ≥ ε 2 /n and k D − U k 2 2 < ε 2 / (2 n ) with p robabilit y at least 2 / 3, u sing C · √ n ε 2 samples. Boosting the success probabilit y to 1 − δ at the price of a m u ltiplic ativ e log 1 δ factor can then b e ac hiev ed by standard tec hniques. Similarly as in the p ro of of Theorem 11 (whose algorithm w e use, bu t with a t hreshold τ def = 3 4 m 2 ε 2 n instead of 4 m √ n ), define the quan tities Z k def =  X k − m n  2 − X k , k ∈ [ n ] and Z def = P n k =1 Z k , where the X k ’s (and th us the Z k ’s) are indep enden t b y Poisso nization, and X k ∼ Po isson( mD ( k )). It is not hard to see that E Z k = ∆ 2 k , where ∆ k def = ( 1 n − D ( k )), so t hat E Z = m 2 k D − U k 2 2 . F urthermore, we also get V ar Z k = 2 m 2  1 n − ∆ k  2 + 4 m 3  1 n − ∆ k  ∆ k so that V ar Z = 2 m 2 n X k =1 ∆ 2 k + 1 n − 2 m n X k =1 ∆ 3 k ! (2) (after expanding and since P n k =1 ∆ k = 0) . 33 Soundness. Almost str aight fr om [ D KN15b ], but the thr eshold has change d. Assume ∆ 2 def = k D − U k 2 2 ≥ ε 2 /n ; w e will sho w that Pr[ Z < τ ] ≤ 1 / 3. By C heb yshev’s inequalit y , it is sufficien t to sho w that τ ≤ E Z − √ 3 √ V ar Z , as Pr h E Z − Z > √ 3 √ V ar Z i ≤ 1 / 3 . As τ < 3 4 E Z , arguin g that √ 3 √ V ar Z ≤ 1 4 E Z is enough, i.e. that 48 V ar Z ≤ ( E Z ) 2 . F rom ( 2 ), this is equiv alent to sho wing ∆ 2 + 1 n − 2 m n X k =1 ∆ 3 k ≤ m 2 ∆ 4 96 . W e b ound the LHS te rm by te rm. • As ∆ 2 ≥ ε 2 n , w e get m 2 ∆ 2 ≥ C 2 ε 2 , and th us m 2 ∆ 4 288 ≥ C 2 288 ε 2 ∆ 2 ≥ ∆ 2 (as C ≥ 17 and ε ≤ 1). • Similarly , m 2 ∆ 4 288 ≥ C 2 288 ε 2 · ε 2 n ≥ 1 n . • Finally , recalling that 9 n X k =1 | ∆ k | 3 ≤ n X k =1 | ∆ k | 2 ! 3 / 2 = ∆ 3 w e get that    2 m P n k =1 | ∆ k | 3    ≤ 2 m ∆ 3 = m 2 ∆ 4 288 · 2 · 288 m ∆ ≤ m 2 ∆ 4 288 , using the fact that m ∆ 2 · 288 ≥ C 576 ε ≥ 1 (b y c h oic e of C ≥ 576). Ov erall, the L HS is at most 3 · m 2 ∆ 4 288 = m 2 ∆ 4 96 , as claimed. Completeness. Assu me ∆ 2 = k D − U k 2 2 < ε 2 / (4 n ). W e need to sho w t hat Pr[ Z ≥ τ ] ≤ 1 / 3. Chebyshev’s inequalit y implies Pr h Z − E Z > √ 3 √ V ar Z i ≤ 1 / 3 and therefore it is sufficien t to sho w that τ ≥ E Z + √ 3 √ V ar Z Recalling the expressions of E Z and V ar Z from ( 2 ), this is tan tamount to sho wing 3 4 m 2 ε 2 n ≥ m 2 ∆ 2 + √ 6 m v u u t ∆ 2 + 1 n − 2 m n X k =1 ∆ 3 k or equiv alent ly 3 4 m √ n ε 2 ≥ m √ n ∆ 2 + √ 6 v u u t 1 + n ∆ 2 − 2 nm n X k =1 ∆ 3 k . 9 F or any sequence x = ( x 1 , . . . , x n ) ∈ R n , p > 0 7→ k x k p is non-increasing. In particular, for 0 < p ≤ q < ∞ , X i | x i | q ! 1 /q = k x k q ≤ k x k p = X i | x i | p ! 1 /p . T o see wh y , one can easily p ro ve that if k x k p = 1, then k x k q q ≤ 1 (b ounding eac h term | x i | q ≤ | x i | p ), and therefore k x k q ≤ 1 = k x k p . Next, for the ge neral case, apply this to y = x/ k x k p , whic h has u nit ℓ p norm, and conclude b y homogeneit y of the norm. 34 Since q 1 + n ∆ 2 − 2 nm P n k =1 ∆ 3 k ≤ √ 1 + n ∆ 2 ≤ p 1 + ε 2 / 4 ≤ p 5 / 4, we get that th e second term is at most p 30 / 4 < 3. All that remains is to sh o w that m √ n ∆ 2 ≥ 3 m ε 2 4 √ n − 3. But as ∆ 2 < ε 2 / (4 n ), m √ n ∆ 2 ≤ m ε 2 4 √ n ; and o ur choic e of m ≥ C · √ n ε 2 for so me absolute constan t C ≥ 6 ensur es this holds. B Pro of of Theorem 4.5 In this section, w e pro v e our stru ctural r esult for MHR distributions, Theorem 4.5 : Theorem 4.5 (Monotone Hazard Rate) . F or al l γ > 0 , the class MHR of MHR distributions on [ n ] is ( γ , L ) -de c omp osable for L def = O  log n γ  . Pr o of. W e r eprod uce and adapt the argument of [ CDSS13 , S ection 5. 1] to mee t our d efinition of decomp osabilit y (wh ich, alb eit related, is incomparable to theirs). First, w e mo dify the al gorithm at the core of their constru ctiv e pr o of, in Algorithm 3 : note that the only tw o changes are in Steps 2 and 3 , where we use paramet ers resp ectiv ely γ n and γ n 2 . F ollo wing the structure of their Algorithm 3 Decompos e-MHR ′ ( D , γ ) Require: explicit descriptio n of MHR distribution D o v er [ n ]; accuracy parameter γ > 0 1: Set J ← [ n ] and Q ← ∅ . 2: Let I ← R ight-Inter v al ( D , J, γ n ) and I ′ ← R ight-Inter v a l ( D , J \ I , γ n ). Set J ← J \ ( I ∪ I ′ ). 3: Set i ∈ J to b e the smallest inte ger suc h that D ( i ) ≥ γ n 2 . If no suc h i exists, let I ′′ ← J and go to Step 9 . Otherwise, let I ′′ ← { 1 , . . . , i − 1 } and J ← J \ I ′′ . 4: while J 6 = ∅ do 5: Let j ∈ J b et the smallest in teger suc h that D ( j ) / ∈ [ 1 1+ γ , 1 + γ ] D ( i ). If no suc h j exists, let I ′′′ ← J ; otherwise let I ′′′ ← { i, . . . , j − 1 } . 6: A dd I ′′′ to Q and set J ← J \ I ′′′ . 7: Let i ← j . 8: end while 9: Return Q ∪ { I , I ′ , I ′′ } . pro of, we wr ite Q = { I 1 , . . . , I |Q| } w ith I i = [ a i , b i ], and defin e Q ′ = { I i ∈ Q : D ( a i ) > D ( a i +1 ) } , Q ′′ = { I i ∈ Q : D ( a i ) ≤ D ( a i +1 ) } . W e immediately obtain the analogues of their Lemmas 5.2 and 5.3: Lemma B.1. W e have Q I i ∈Q ′ D ( a i ) D ( a i +1 ) ≤ n γ . Lemma B.2. Step 4 of Algo rithm 3 adds at most O  1 ε log n ε  intervals to Q . Sketch. This deriv es from observing that no w D ( I ∪ I ′ ) ≥ γ /n , whic h as in [ CDSS13 , Lemma 5. 3] in turn implies 1 ≥ γ n (1 + γ ) |Q ′ |− 1 so that |Q ′ | = O  1 ε log n ε  . 35 Again follo wing their argumen t, we also get D ( a |Q| +1 ) D ( a 1 ) = Y I i ∈Q ′′ D ( a i +1 ) D ( a i ) · Y I i ∈Q ′ D ( a i +1 ) D ( a i ) b y com bining Lemma B.1 with the fact that D ( a |Q| +1 ≤ 1 and that b y construction D ( a i ) ≥ γ /n 2 , w e get Y I i ∈Q ′′ D ( a i +1 ) D ( a i ) ≤ n γ · n 2 γ = n 3 γ . But since eac h term in the pr o du ct is at least (1 + γ ) (b y construction of Q and the definition of Q ′′ ), this leads to (1 + γ ) |Q ′′ | ≤ n 3 γ and th u s |Q ′′ | = O  1 ε log n ε  as w ell. It remains to sh o w that Q∪{ I , I ′ , I ′′ } is ind eed a go o d decomp ositio n of [ n ] for D , as p er Definition 3.1 . Since b y constru ctio n eve ry in terv al in Q satisfies item (ii) , we only are left with the case of I , I ′ and I ′′ . F or the fi r st t wo , as they w ere r eturned b y R ight-Inter v al either (a) they are singletons, in whic h case ite m (ii) triviall y holds; or (b) they h a v e at least t w o ele men ts, in wh ic h case they ha ve probabilit y mass at most γ n (b y the c hoice of parameters for Right-Inter v al ) and thus item (i) is satisfied. Finally , it is immediate to see that b y construction D ( I ′′ ) ≤ n · γ /n 2 = γ /n , and item (i) holds in this case as w ell. C Pro ofs from Section 4 This section con tains the pro ofs omitted from Section 4 , namely the distance estimation pr ocedur es for t -piece wise degree- d ( Theorem 4.13 ), monotone hazard rate ( Lemma 4.14 ), and log-conca v e distributions ( Lemma 4.15 ). C.1 Pro of of Theorem 4.13 In this section, w e pro v e the follo win g: Theorem C.1. L et p b e a ℓ -histo gr am over [ − 1 , 1) . Ther e is an algorithm Pr oje ctSinglePol y ( d, ε ) which runs in time p oly( ℓ, d + 1 , 1 /ε ) , and outputs a de gr e e- d p olynomial q which defines a p df over [ − 1 , 1) such that k p − q k 1 ≤ 3 ℓ 1 ( p, P d ) + O ( ε ) . As mentio ned in Section 4 , the p roof of this statemen t is a rather s traig h tforw ard adaption of the pro of of [ CDSS14a , Theorem 9], with t wo differences: first, in ou r sett ing there is no uncertain ty nor probabilistic argumen t du e to s amp ling, as we are pro vided with an explicit description of the histogram p . S eco nd, Chan et al. r equ ir e some “wel l-b eha ve dness” assum ption on the distribu tion p (for tec hnical r easons essen tially due to the sampling acce ss), that we remo v e h ere. Besides these t wo p oin ts, the p roof is almost iden tical to theirs, and w e only repro duce (our mo dification of ) it here for the sak e of completeness. (An y error in tro duced in the pro cess, how ev er, is solely our resp onsibilit y .) 36 Pr o of. Some pr eliminary d efinitions will b e helpful: Definition C.2 (Uniform partition) . Let p b e a sub d istribution on an interv al I ⊆ [ − 1 , 1). A partition I = { I 1 , . . . , I ℓ } of I is ( p, η ) -uniform if p ( I j ) ≤ η for all 1 ≤ j ≤ ℓ . W e will also use the foll o wing notation: F or this subsection, let I = [ − 1 , 1) ( I will denote a subinte rv al of [ − 1 , 1) wh en the resu lts are applied in the next su b sectio n). W e write k f k ( I ) 1 to denote R I | f ( x ) | dx , and we write d TV ( I ) ( p, q ) to d en ote k p − q k ( I ) 1 / 2. W e write opt ( I ) 1 ,d to denote the infimum of the distance k p − g k ( I ) 1 b et wee n p and an y degree- d su b distribution g on I that satisfies g ( I ) = p ( I ). The k ey step of ProjectSinglePol y is Step 2 where it calls th e FindSinglePol y pro cedure. In this pro cedure T i ( x ) denotes the degree - i Chebyc h ev polynomial of the first kin d. The function FindSinglePol y should b e though t of as th e CDF of a “quasi-distribution” f ; we say that f = F ′ is a “quasi-distribution” and not a b ona fide probabilit y distribution b ecause it is not guaran teed to b e non-negativ e ev erywhere on [ − 1 , 1). S tep 2 of FindSinglePol y pro cesses f slightl y to obtai n a p olynomial q whic h is an actual distribution o ver [ − 1 , 1) . Algorithm 4 ProjectSinglePol y Require: p aramete rs d, ε ; and th e full description of a ℓ -histog ram p o ver [ − 1 , 1). Ensure: a degree- d distrib u tion q such that d TV ( p, q ) ≤ 3 · opt 1 ,d + O ( ε ) 1: P artition [ − 1 , 1) in to z = Θ(( d + 1) /ε ) in terv als I 0 = [ i 0 , i 1 ) , . . . , I z − 1 = [ i z − 1 , i z ), wh ere i 0 = − 1 and i z = 1, suc h that f or eac h j ∈ { 1 , . . . , z } w e ha ve p ( I j ) = Θ( ε/ ( d + 1)) or ( | I j | = 1 and p ( I j ) = Ω( ε/ ( d + 1))). 2: Call FindSinglePol y ( d , ε , η := Θ( ε/ ( d + 1)), { I 0 , . . . , I z − 1 } , p and output the hypothesis q that it returns. The rest of this subsection giv es the pro of of T heorem C.1 . The claimed running ti me b ound is ob vious (the computation is dominated by s olving the p oly( d, 1 /ε )-size LP in Pr ojectSingle- Pol y , with an additio nal term linea r in ℓ when partitioning [ − 1 , 1) in the initial fi rst step), so it suffices to pro ve correctness. Before la unc hing in to the p roof w e give some in tuition for the linea r program. Intuitiv ely F ( x ) represen ts the cdf of a degree- d p olynomial distribution f wh ere f = F ′ . Constrain t (a) captures the endp oin t constrain ts that any cdf must obey if it has the same to tal w eigh t as p . Intuitiv ely , constrain t (b) ensu res that for eac h interv al [ i j , i k ), the v alue F ( i k ) − F ( i j ) (whic h we may alternately write as f ([ i j , i k ))) is close to the weig h t p ([ i j , i k )) that the distribution p uts on the in terv al. Recall that b y assumption p is opt 1 ,d -close to some degree- d p olynomial r . In tuitiv ely the v ariable w ℓ represen ts R [ i ℓ ,i ℓ +1 ) ( r − p ) (note that th ese v alues sum to zero b y constrain t (c) ( 4 ), and y ℓ represen ts the absolute v alue of w ℓ (see constrain t (c) ( 5 )). The v alue τ , whic h b y constrain t (c) ( 6 ) is at least the su m of the y ℓ ’s, repr esen ts a lo wer b ound on opt 1 ,d . Th e constrain ts in (d) and (e) r eflect the fact that as a cdf, F should b e b ounded b etw een 0 and 1 ( more on this b elo w), and the (f ) constrain ts reflect the fact that the p d f f = F ′ should b e everywhere nonnegativ e (again more on this b elo w). W e begin b y observing that Pr oj ectSinglePol y calls FindSinglePol y with in put param- eters that satisfy FindSinglePol y ’s input r equiremen ts: (I) th e non-singlet on int erv als I 0 , . . . , I z − 1 are ( p, η )-uniform; and 37 Algorithm 5 FindSinglePol y Require: d egree parameter d ; error p aramete r ε ; parameter η ; ( p, η )-uniform partition I I = { I 1 , . . . , I z } of in terv al I int o z in terv als su c h that √ εz · η ≤ ε / 2; a sub distribution p on I Ensure: a n umb er τ and a degree- d sub d istribution q on I suc h that q ( I ) = p ( I ), opt ( I ) 1 ,d ≤ k p − q k ( I ) 1 ≤ 3 opt ( I ) 1 ,d + q εz ( d + 1) · η + error , 0 ≤ τ ≤ opt ( I ) 1 ,d and error = O (( d + 1) η ). 1: Let τ b e the solution to the follo wing L P : minimize τ sub ject to the follo wing constraint s: (Belo w F ( x ) = P d +1 i =0 c i T i ( x ) where T i ( x ) is the d egree- i Cheb yc hev p olynomia l of the fi rst kind, and f ( x ) = F ′ ( x ) = P d +1 i =0 c i T ′ i ( x ).) (a) F ( − 1) = 0 and F (1) = p ( I ); (b) F or eac h 0 ≤ j < k ≤ z ,         p ([ i j , i k )) + X j ≤ ℓ 10 0 ε , then the pr o c e dur e r eturns no . Pr o of. F or con v enience, let α def = ε 3 ; w e also write [ i, j ] instead of { i, . . . , j } . First, we note that it is easy to reduce our problem to the case where, in the completeness case, w e h a v e P ∈ MH R such that k D − P k 1 ≤ 2 ε and k D − P k Ko l ≤ 2 α ; while in the soun dness case ℓ 1 ( D , MHR ) ≥ 99 ε . Indeed, this can b e d one with a linear p rogram on p oly( k, ℓ ) v ariables, asking to find a ( k + ℓ )-histogram D ′′ on a refinement of D and D ′ minimizing the ℓ 1 distance to D , und er the constraint that t he Kolmogo ro v distance to D ′ b e b ounded by ε . (In the complete ness case, clearly a feasible solution exists, as P is one.) W e therefore follo w with this n ew form ulation: eit her (a) D is ε -close to a monotone hazard rate distribution P (in ℓ 1 distance) and D is α -c lose to P (in K olmogoro v distance) ; and (b) D is 32 ε -far from monotone hazard rate where D is a ( k + ℓ )-histogram. W e then pr oceed b y observing the follo wing easy fact: su pp ose P is a MHR distribution on [ n ], i.e. suc h that the quantit y h i def = P ( i ) P n j = i P ( i ) , i ∈ [ n ] is n on-increasing. Then, w e ha ve P ( i ) = h i i − 1 Y j =1 (1 − h j ) , i ∈ [ n ] . (10) and there is a bijectiv e co rresp ondence b et w een P and ( h i ) i ∈ [ n ] . 41 W e will w r ite a linear program with v ariables y 1 , . . . , y n , with the corresp ondence y i def = ln(1 − h i ). Note that with this p aramete rization, w e get that if the ( y i ) i ∈ [ n ] corresp ond to a MHR distribution P , then for i ∈ [ n ] P ([ i, n ]) = i − 1 Y j =1 e y j = e P i − 1 j =1 y j and asking that ln(1 − ε ) ≤ P i − 1 j =1 y j − ln D ([ i, n ]) ≤ ln(1 + ε ) amounts to requiring P ([ i, n ]) ∈ [1 ± ε ] D ([ i, n ]) . W e fo cus fi rst on the completeness case, to provide intuitio n for the linear program. Supp ose there exists P ∈ MHR suc h P ∈ MHR suc h that k D − P k 1 ≤ ε and k D ′ − P k Ko l ≤ α . Th is implies that for all i ∈ [ n ], | P ([ i, n ]) − D ([ i, n ]) | ≤ 2 α . Define I = { b + 1 , . . . , n } to b e the longest in terv al suc h that D ( { b + 1 , . . . , n } ) ≤ ε 2 . It foll o ws that for ev ery i ∈ [ n ] \ I , P ([ i, n ]) D ([ i, n ]) ≤ D ([ i, n ]) + 2 α D ([ i, n ]) ≤ 1 + 2 α ε/ 2 = 1 + 4 ε 2 ≤ 1 + ε (11) and s imilarly P ([ i,n ] ) D ([ i, n ]) ≥ D ([ i, n ]) − 2 α D ([ i, n ] ≥ 1 − ε . This means that for the p oints i in [ n ] \ I , we can write constrain ts asking for m u ltiplic ativ e closeness (within 1 ± ε ) b et ween e P i − 1 j =1 y j and D ([ i, n ]), whic h is v ery easy to w r ite do wn as linear constrain ts on the y i ’s. The linear prog ram. Let T and S be resp ectiv ely the sets of “ligh t” and “hea vy” p oin ts, defined as T =  i ∈ { 1 , . . . , b } : D ( i ) ≤ ε 2  and S =  i ∈ { 1 , . . . , b } : D ( i ) > ε 2  , where b is as ab o ve. (In particular, | S | ≤ 1 /ε 2 .) Algorithm 6 Linear Program Find y 1 , . . . , y b s.t. y i ≤ 0 (12) y i +1 ≤ y i ∀ i ∈ { 1 , . . . , b − 1 } (13) ln(1 − ε ) ≤ i − 1 X j =1 y j − ln D ([ i, n ]) ≤ ln(1 + ε ) ∀ i ∈ { 1 , . . . , b } (14) D ( i ) − ε i (1 + ε ) D [ i, n ] ≤ − y i ≤ (1 + 4 ε ) D ( i ) + ε i (1 − ε ) D [ i, n ] ∀ i ∈ T (15) X i ∈ T ε i ≤ ε (16) 0 ≤ ε i ≤ 2 α ∀ i ∈ T (17) ln  1 − D ( i ) + 2 α (1 − ε ) D [ i, n ]  ≤ y i ≤ ln  1 − D ( i ) − 2 α (1 + ε ) D [ i, n ]  ∀ i ∈ S (18) 42 Giv en a solution to the linear program ab ov e, define ˜ P (a non-normalized probabilit y distribu- tion) b y setting ˜ P ( i ) = (1 − e y i ) e P i − 1 j =1 y j for i ∈ { 1 , . . . , b } , and ˜ P ( i ) = 0 for i ∈ I = { b + 1 , . . . , n } . A MHR distribution is then obtained b y norm ali zing ˜ P . Completeness. S upp ose P ∈ MH R is as promised. In particular, by the Kolmog oro v distance assumption w e kno w that every i ∈ T has P ( i ) ≤ ε 2 + 2 α < 2 ε 2 . • F or an y i ∈ T , w e ha v e that P ( i ) P [ i, n ] ≤ 2 ε 2 (1 − ε ) ε ≤ 4 ε , and D ( i ) − ε i (1 + ε ) D [ i, n ] ≤ P ( i ) P [ i, n ] ≤ − ln(1 − P ( i ) P [ i, n ] ) | {z } − y i ≤ (1 +4 ε ) P ( i ) P [ i, n ] = (1 +4 ε ) D ( i ) + ε i P [ i, n ] ≤ 1 + 4 ε 1 − ε D ( i ) + ε i D [ i, n ] (19) where we used Equation 11 for the t wo outer inequaliti es; and so ( 15 ), ( 1 6 ), and ( 17 ) would follo w fr om setting ε i def = | P ( i ) − D ( i ) | (alo ng with the guaran tees on ℓ 1 and K olmogoro v distances b et wee n P and D ). • F or i ∈ S , Constrain t ( 18 ) is also met, as P ( i ) P ([ i,n ] ) ∈ h D ( i ) − 2 α P ([ i,n ] ) , D ( i )+2 α P ([ i,n ] ) i ⊆ h D ( i ) − 2 α (1+ ε ) D ([ i,n ] ) , D ( i )+2 α (1 − ε ) D ([ i,n ] ) i . Soundness. Assu me a feasible solution to the linear pr ogram is found. W e argue that this implies D is O ( ε )-close to some MHR distribution, namely to the distribution obtained b y renormalizi ng ˜ P . In order to do s o, we b ound separately the ℓ 1 distance b et w een D and ˜ P , from I , S , and T . First, P i ∈ I    D ( i ) − ˜ P ( i )    = P i ∈ I D ( i ) ≤ ε 2 b y construction. F or i ∈ T , w e ha v e D ( i ) D [ i,n ] ≤ ε , and ˜ P ( i ) = (1 − e y i ) ) e P i − 1 j =1 y j ∈ [1 ± ε ] (1 − e y i ) D ([ i, n ]) . No w, 1 − (1 − ε ) D ( i ) − ε i (1 + ε ) D [ i, n ] ≥ e − D ( i ) − ε i (1+ ε ) D [ i,n ] ≥ e y i ≥ e − (1+4 ε ) D ( i )+ ε i (1 − ε ) D [ i,n ] ≥ 1 − (1 + 4 ε ) D ( i ) + ε i (1 − ε ) D [ i, n ] so that (1 − ε ) (1 − ε ) (1 + ε ) ( D ( i ) − ε i ) ≤ ˜ P ( i ) ≤ (1 + 4 ε ) (1 + ε ) (1 − ε ) ( D ( i ) + ε i ) whic h imp lies (1 − 10 ε )( D ( i ) − ε i ) ≤ ˜ P ( i ) ≤ (1 + 10 ε )( D ( i ) + ε i ) so that P i ∈ T    D ( i ) − ˜ P ( i )    ≤ 10 ε P i ∈ T D ( i ) + (1 + 10 ε ) P i ∈ T ε i ≤ 10 ε + (1 + 10 ε ) ε ≤ 20 ε where the last inequalit y foll o ws from Constrain t ( 16 ). T o analyz e the cont ribution from S , w e observ e that Constraint ( 18 ) implies that, for an y i ∈ S , D ( i ) − 2 α (1 + ε ) D ([ i, n ]) ≤ ˜ P ( i ) ˜ P ([ i, n ]) ≤ D ( i ) + 2 α (1 − ε ) D ([ i, n ]) whic h combined with Constrain t ( 14 ) guaran tees D ( i ) − 2 α (1 + ε ) 2 ˜ P ([ i, n ]) ≤ ˜ P ( i ) ˜ P ([ i, n ]) ≤ D ( i ) + 2 α (1 − ε ) 2 ˜ P ([ i, n ]) 43 whic h in turn implies that    ˜ P ( i ) − D ( i )    ≤ 3 ε ˜ P ( i ) + 2 α . Recalling that | S | ≤ 1 ε 2 and α = ε 3 , this yields P i ∈ S    D ( i ) − ˜ P ( i )    ≤ 3 ε P i ∈ S ˜ P ( i ) + 2 ε ≤ 3 ε (1 + ε ) + 2 ε ≤ 8 ε . Summing up, w e get P n i =1    D ( i ) − ˜ P ( i )    ≤ 30 ε whic h finally implies b y the triangle inequalit y that t he ℓ 1 distance b et wee n D and the normalized v ersion of ˜ P (a v alid MHR distrib ution) is at most 32 ε . R unning time. The running time is immediate, from executing th e t wo linear pr ograms on p oly( n, 1 /ε ) v ariables and constrain ts. C.3 Pro of of L emma 4.15 Lemma 4.15 (Log-conca vit y) . Ther e exists a pr o c e dur e ProjectionDist ∗ L that, on i nput n as wel l as the ful l sp e cific ations of a k -histo gr am d istribution D o n [ n ] and a ℓ - histo gr am distribution D ′ on [ n ] , runs in time p oly( n, k , ℓ, 1 /ε ) , and satisfies the fol lowing. • If ther e i s P ∈ L such that k D − P k 1 ≤ ε and k D ′ − P k Ko l ≤ ε 2 log 2 (1 /ε ) , then the pr o c e dur e r e turns yes ; • If ℓ 1 ( D , L ) ≥ 100 ε , then the pr o c e dur e r eturns no . Pr o of. W e set α def = ε 2 log 2 (1 /ε ) , β def = ε 2 log(1 /ε ) , and γ def = ε 2 10 (so that α ≪ β ≪ γ ≪ ε ), Giv en the explicit description of a distribution D on [ n ], whic h a k -h istogram o ve r a partition I = ( I 1 , . . . , I k ) of [ n ] with k = poly (log n, 1 /ε ) and the explicit description of a distribution D ′ on [ n ] , one must efficie ntly distinguish betw een: (a) D is ε -close to a log-conca v e P (in ℓ 1 distance) and D ′ is α -close to P (in Kolmog oro v distance); and (b) D is 100 ε -far from log-conca v e. If we are willing to pa y an extra factor of O ( n ), w e can assume without loss of generalit y that w e kno w the mo de of the closest log-c onca v e distribution (wh ic h is imp lici tly assumed in the follo wing: the final algorithm will simply try all p ossible modes). Outline. First, we argue that we can s imp lify to the ca se wh ere D is un imo dal. Then, reduce to the case where where D and D ′ are only one distribution, satisfying b oth requiremen ts from the completeness case. Both can b e done efficient ly ( Section C.3.1 ), and mak e the rest m u c h easier. Th en, p erform some ad ho c partitioning of [ n ], using our knowle dge of D , into ˜ O  1 /ε 2  pieces suc h that eac h piece is either a “hea vy” singlet on, or an in terv al I w ith we igh t v ery close (m u ltiplicativ ely) to D ( I ) under the tar get lo g-c onc ave distribution, if it exists ( Section C.3.2 ). This in particula r simplifies the t yp e of log-conca v e distribution we are looking for: it is sufficien t to lo ok for distrib utions putting that very sp ecific we igh t on eac h piece, u p to a (1 + o (1)) factor. Then , in Section C.3.3 , we write and solv e a linea r program to try and find s uc h a “simplified” log-c onca v e distribution, and reject if no feasible solution exists. Note that the first tw o sections allo w us to argue that instead of additiv e (in ℓ 1 ) closeness, w e can enforce constrain ts on multiplic ative (within a (1 + ε ) factor) closeness b et w een D and the target log-conca ve distribution. This is w h at enables a linear program with v ariables b eing the logarithm of the probabilities, whic h pla ys v ery nicely with the log-c onca vit y constrain ts. W e will require the follo wing r esult o f Chan, Diak onik olas, S erv edio, and Sun : 44 Theorem C.6 ([ CDSS13 , Lemma 4.1]) . L et D b e a distribution over [ n ] , lo g-c onc ave and non- de c r e asing over { 1 , . . . , b } ⊆ [ n ] . L et a ≤ b such that σ = D ( { 1 , . . . , a − 1 } ) > 0 , and write τ = D ( { a, . . . , b } ) . Then D ( b ) D ( a ) ≤ 1 + τ σ . C.3.1 Step 1 Reducing to D unimo dal. Using a linear program, fin d a closest unimo dal distribution ˜ D to D (also a k -h istog ram on I ) under the co nstrain t that k D − P k Ko l ≤ α : this can b e done in time p oly( k ). I f k D − ˜ D k 1 > ε , output REJECT . • If D is ε -close to a log-conca v e distribu tio n P as ab o v e, then it is in particular ε -clo se to unimo dal and w e do not r eject . Moreo ver, by the triangle inequalit y k ˜ D − P k 1 ≤ 2 ε and k ˜ D − P k Ko l ≤ 2 α . • If D is 100 ε -far from log-co nca ve and we do not reject, then ℓ 1 ( ˜ D , L ) ≥ 99 ε . Reducing to D = D ′ . First, we note that it is ea sy to reduce our problem to the case where, in the completeness case, we ha ve P ∈ L suc h that k D − P k 1 ≤ 4 ε an d k D − P k Ko l ≤ 4 α ; while in the soundness case ℓ 1 ( D , L ) ≥ 97 ε . Indeed, this can b e done with a linear program on p oly( k, ℓ ) v ariables and constraint s, asking to find a ( k + ℓ )-histog ram D ′′ on a refinement of D and D ′ minimizing the ℓ 1 distance to D , u nder th e constrain t that the Ko lmogoro v d istance to D ′ b e b ounded by 2 α . ( In the co mpleteness case, clearly a feasible solution exists, as (the flattening on this ( k + ℓ )-in terv al partition) of P is one.) W e therefore follo w with this new form u lati on: either (a) D is 4 ε -clo se to a log- conca v e P (in ℓ 1 distance) and D is 4 α -close to P (in Kol mogoro v distance); and (b) D is 97 ε -far from log-conca ve; where D is a ( k + ℓ )-histogram. This w a y , we ha v e reduced the problem to a sligh tly more con venien t one, that of Section C.3.2 . Reducing to knowing the support [ a, b ] . The next step is to co mpute a go od app ro ximation of the supp ort of an y target log -conca v e distribution. This is easily obtai ned in time O ( k ) as the in terv al { a, · · · , b } suc h that • D ( { 1 , . . . , a − 1 } ) ≤ α bu t D ( { 1 , . . . , a } ) > α ; and • D ( { b + 1 , . . . , } n ) ≤ α but D ( { b, . . . , n } ) > α . An y log-conca v e distribution that is α -close to D m u st includ e { a, · · · , b } in its supp ort, since otherwise the ℓ 1 distance b et w een D and P is already greater than α . Conv ersely , if P is a l og- conca v e distribu tion α -close to D , it is easy to see that the distribu tion obtained b y setting P to b e zero outside { a, · · · , b } and renormalizing the result is still log-conca v e, and O ( α )-close to D . C.3.2 Step 2 Giv en the explicit description of a unimo dal distribution D on [ n ], which a k -histogram o v er a partition I = ( I 1 , . . . , I k ) of [ n ] with k = poly(log n , 1 /ε ), one must efficiently distinguish b et w een: (a) D is ε -close to a log-conca v e P (in ℓ 1 distance) and α -close to P (in K olmogoro v distance ); and 45 (b) D is 24 ε -far from log-conca ve, assuming w e kno w the m ode of the closest log-conca v e distrib ution, w hic h has supp ort [ n ]. In this sta ge, w e co mpute a partitio n J of [ n ] in to ˜ O  1 /ε 2  in terv als (here, w e imp licit ly u s e the knowledge of the mo de of the closest log-conca v e distribution, in ord er to app ly Theorem C.6 differen tly on t wo inte rv als of the supp ort, corresp onding to the non-d ecrea sing and non-increasi ng parts of the target log-c onca ve distribu tion) . As D is unimo dal, we can efficie n tly ( O (log k )) find the in terv al S of hea vy p oin ts, that is S def = { x ∈ [ n ] : D ( x ) ≥ β } . Eac h point in S will form a singleton in terv al in our p artiti on. Let T def = [ n ] \ S b e its complement ( T is the union of at most t w o in terv als T 1 , T 2 on which D is monotone, th e head and tail of the distribution). F or conv enience, we fo cus on only one of these t wo in terv als, without loss of generalit y the “head” T 1 (on whic h D is non-decreasing). 1. Greedily find J = { 1 , . . . , a } , the small est prefix of the distribution sa tisfying D ( J ) ∈  ε 10 − β , ε 10  . 2. S imilarly , partition T 1 \ J into in terv als I ′ 1 , . . . , I ′ s (with s = O (1 /γ ) = O  1 /ε 2  ) suc h that γ 10 ≤ D ( I ′ j ) ≤ 9 10 γ for all 1 ≤ j ≤ s − 1, and γ 10 ≤ D ( I ′ s ) ≤ γ . This is p ossible as all p oin ts not in S hav e weig h t less than β , and β ≪ γ . Discussion: wh y doing this? W e fo cus on the completeness case : let P ∈ L b e a log-c onca ve distribution suc h th at k D − P k 1 ≤ ε and k D − P k Ko l ≤ α . Applying Theorem C.6 on J and the I ′ j ’s, w e obtain (using the fact that    P ( I ′ j ) − D ( I ′ j )    ≤ 2 α ) that : max x ∈ I ′ j P ( x ) min x ∈ I ′ j P ( x ) ≤ 1 + D ( I ′ j ) + 2 α D ( J ) − 2 α ≤ 1 + γ + 2 α ε 10 − 2 α = 1 + ε + O ε 2 log 2 (1 /ε ) ! def = 1 + κ. Moreo v er, w e also get that eac h resulting interv al I ′ j will satisfy D ( I ′ j )(1 − κ j ) = D ( I ′ j ) − 2 α ≤ P ( I ′ j ) ≤ D ( I ′ j ) + 2 α = D ( I ′ j )(1 + κ j ) with κ j def = 2 α D ( I ′ j ) = Θ  1 / log 2 (1 /ε )  . Summing up, w e ha ve a partition of [ n ] in to | S | + 2 = ˜ O  1 /ε 2  in terv als su ch that: • The (at most) t wo end interv als ha v e D ( J ) ∈  ε 10 − β , ε 10  , and thus P ( J ) ∈  ε 10 − β − 2 α, ε 10 + 2 α  ; • the ˜ O  1 /ε 2  singleton-in terv als from S are p oints x with D ( x ) ≥ β , so that P ( x ) ≥ β − 2 α ≥ β 2 ; • eac h other in terv al I = I ′ j satisfies (1 − κ j ) D ( I ) ≤ P ( I ) ≤ (1 + κ j ) D ( I ) (20) with κ j = O  1 / log 2 (1 /ε )  ; and max x ∈ I P ( x ) min x ∈ I P ( x ) ≤ 1 + κ < 1 + 3 2 ε. (21) W e will use in the constrain ts of the linear program the fact that (1 + 3 2 ε )(1 + κ j ) ≤ 1 + 2 ε , a nd 1 − κ j 1+ 3 2 ε ≥ 1 1+2 ε . 46 C.3.3 Step 3 W e sta rt b y co mputing the partit ion J = ( J 1 , . . . , J ℓ ) as in Section C.3.2 ; with ℓ = ˜ O  1 /ε 2  ; and write J j = { a j , . . . , b j } for al l j ∈ [ ℓ ]. W e f u rther denote by S and T the se t of hea vy and ligh t p oin ts, follo win g the notations from Section C.3.2 ; and let T ′ def = T 1 ∪ T 2 b e the set obtained b y remo ving th e t w o “end in terv als” (called J in the previous section) from T . Algorithm 7 Linear Program Find x 1 , . . . , x n , ε 1 , . . . , ε | S | s.t. x i ≤ 0 (22) x i − x i − 1 ≥ x i +1 − x i ∀ i ∈ [ n ] (23) − ln(1 + 2 ε ) ≤ x i − µ j ≤ ln(1 + 2 ε ) , ∀ j ∈ T ′ , ∀ i ∈ J j (24) − 2 ε i D ( i ) ≤ x i − ln D ( i ) ≤ ε i D ( i ) , ∀ i ∈ S (25) X i ∈ S ε i ≤ ε (26) 0 ≤ ε i ≤ 2 α ∀ i ∈ S (27) (28) where µ j def = ln D ( J j ) | J j | for j ∈ T ′ . Lemma C.7 (Sound ness) . If the line ar pr o gr am ( Al gorithm 7 ) has a fe asible solution, then ℓ 1 ( D , L ) ≤ O ( ε ) . Pr o of. A feasible solution to this linear p rogram will define (setting p i = e x i ) a sequence p = ( p 1 , . . . , p n ) ∈ (0 , 1] n suc h that • p tak es v alues in (0 , 1] (from ( 2 2 )); • p is log-conca v e (from ( 23 )); • p is “(1 + O ( ε ))-m ultiplicativ ely constan t” on eac h in terv al J j (from ( 24 )); • p puts roughly the right amount of w eigh t on eac h J i : – weig h t (1 ± O ( ε )) D ( J ) on ev ery J from T (from ( 24 )), so that the ℓ 1 distance b et ween D and p co ming from T ′ is at most O ( ε ) ; – it puts w eight appro ximately D ( J ) on ev ery singleton J from S , i.e. such that D ( J ) ≥ β . T o see w h y , observe that eac h ε i is in [0 , 2 α ] by constrain ts ( 27 ). In p articula r, this means that ε i D ( i ) ≤ 2 α β ≪ 1, and w e ha v e D ( i ) − 4 ε i ≤ D ( i ) · e − 4 ε i D ( i ) ≤ p i = e x i ≤ D ( i ) · e 2 ε i D ( i ) ≤ D ( i ) + 4 ε i and toge ther with ( 26 ) this guaran tees that the ℓ 1 distance b et w een D and p coming from S is at most ε . 47 Note that the solution obtained this w ay ma y not sum to one – i.e., is n ot necessaril y a probabilit y distribution. How ev er, it is easy to renormalize p to obtai n a b ona fide p robabilit y distribution ˜ P as follo ws: set ˜ P = p ( i ) P i ∈ S ∪ T ′ p ( i ) for all i ∈ S ∪ T ′ , and p ( i ) = 0 for i ∈ T \ T ′ . Since b y th e ab o v e discussion w e know that p ( S ∪ T ′ ) is within O ( ε ) of D ( S ∪ T ′ ) (itself in [1 − 9 ε 5 , 1 + 9 ε 5 ] by construction of T ′ ), ˜ P is a log-conca ve distribution su c h that k ˜ P − D k 1 = O ( ε ). Lemma C.8 (Completeness) . If Ther e is P in L such tha t k D − P k 1 ≤ ε an d k D − P k Ko l ≤ α , then the line ar pr o gr am ( A lgorithm 7 ) has a fe asible so lution. Pr o of. Le t P ∈ L suc h that k D − P k 1 ≤ ε and k D − P k Ko l ≤ α . Define x i def = ln P ( i ) for all i ∈ [ n ]. Constrain ts ( 22 ) and ( 23 ) are immediately satisfied, sin ce P is log-co nca ve. By the discussion from Section C.3.2 (more sp ecificall y , Eq. ( 20 ) and ( 21 )), constrain t ( 24 ) holds as w ell. Letting ε i def = | P ( i ) − D ( i ) | for i ∈ S , we also immediate ly h a v e ( 26 ) and ( 27 ) (since k P − D k 1 ≤ ε and k D − P k Ko l ≤ α b y assump tion). Finally , to see why ( 25 ) is satisfied, w e rewrite x i − ln D ( i ) = ln P ( i ) D ( i ) = ln D ( i ) ± ε i D ( i ) = ln(1 ± ε i D ( i ) ) and use the fact that ln (1 + x ) ≤ x and ln (1 − x ) ≥ − 2 x (the latter for x < 1 2 , along with ε i D ( i ) ≤ 2 α β ≪ 1). C.3.4 Putting it all together: Pro of of Lemma 4.10 The algorithm is as follo ws (k eeping the notations from Section C.3.1 to Section C.3.3 ): • Set α, β , γ as a b o ve. • F ollo w Section C.3.1 to reduce it to the case where D is unimo dal and satisfies the conditions for K olmogoro v and ℓ 1 distance; and a goo d [ a, b ] appro ximation of the sup p ort is kno wn • F or eac h of the O ( n ) p ossib le mo des c ∈ [ a, b ]: – Run the linear program Algorithm 7 , return ACCEPT if a feasible sol ution is found • None of the linear programs was feasible: return REJECT . The correctness comes from Lemma C.7 and Lemma C.8 and the discus s ions in Section C.3.1 to Section C.3.3 ; as for the claimed runnin g time, it is immediate from the algorithm and the fact that the linear program execute d ea c h step has p oly( n, 1 /ε ) constrain ts and v ariables. D Pro of of Theorem 6.3 In this section, w e establish our lo we r b ound for toleran t testing of the Binomial distribution, restated b elo w: Theorem 6.3. Ther e exists an absolute c onstant ε 0 > 0 such that the fol low ing holds. A ny al- gorithm which, given sampling ac c ess t o an unknown distribution D on Ω and p ar ameter ε ∈ (0 , ε 0 ) , distinguishes wit h pr ob ability at l e ast 2 / 3 b etwe en (i) k D − Bin( n, 1 / 2) k 1 ≤ ε and (ii) k D − Bin ( n, 1 / 2) k 1 ≥ 10 0 ε must use Ω  1 ε √ n log n  samples. 48 The theorem will b e a consequence of the (slig h tly) more general result b elo w: Theorem D.1. Ther e exist absolute c onstants ε 0 > 0 and λ > 0 such that the fol lowing holds. A ny algorithm w hich, given SAMP ac c ess to an unknown d istribution D on Ω and p ar ameter ε ∈ (0 , ε 0 ) , distinguishes with pr ob ability at le ast 2 / 3 b etwe en (i) k D − Bin  n, 1 2  k 1 ≤ ε and (ii) k D − Bin  n, 1 2  k 1 ≥ λε 1 / 3 − ε must use Ω  ε √ n log( ε 2 / 3 n )  samples. By c h o osing a suitable φ and working out the corresp ond ing paramet ers, this for instance enables us to deriv e the follo wing: Corollary D.2. Ther e exists an absolute c onstant ε 0 ∈ (0 , 1 / 1000) such that the fol lowing ho lds. A ny algo rithm wh ich, given SAMP ac c ess to an unknown distribution D on Ω , dist inguishes with pr ob ability at le ast 2 / 3 b etwe en (i) k D − Bin  n, 1 2  k 1 ≤ ε 0 and (ii) k D − Bin  n, 1 2  k 1 ≥ 10 0 ε 0 must use Ω  √ n log n  samples. By standard tec hniques, this will in turn imply Theorem 6.3 . Pr o of of The or em D.1 . Hereafter, w e wr ite for con venience B n def = Bin  n, 1 2  . T o prov e this lo we r b ound, w e will rely on the follo wing: Theorem D.3 ([ VV10 , Th eorem 1]) . F or any c onstant φ ∈ (0 , 1 / 4) , fol low ing ho lds. A ny algorithm which, given SAMP ac c ess to an unknown dist ribution D on Ω , distinguishes with pr ob ability at le ast 2 / 3 b e twe en (i) k D − U n k 1 ≤ φ and (ii) k D − U n k 1 ≥ 1 2 − φ , must have sample c omplexity at le ast φ 32 n log n . Without loss of generalit y , assume n is ev en (so that B n has only one mo de lo cated at n 2 ). F or c > 0, we wr ite I n,c for the in terv al { n 2 − c √ n, . . . , n 2 + c √ n } and J n,c def = Ω \ I n,c . F a ct D.4. F or an y c > 0 , B n ( n 2 + c √ n ) B n ( n/ 2) , B n ( n 2 − c √ n ) B n ( n/ 2) ∼ n →∞ e − 2 c 2 and B n ( I n,c ) ∈ (1 ± o (1)) · [ e − 2 c 2 , 1] · 2 c r 2 π = Θ( c ) . The reduction pro ceeds as follo ws: giv en sampling access to D on [ n ], we can simula te sampling access to a distribution D ′ on [ N ] (where N = Θ  n 2  ) suc h that • if k D − U n k 1 ≤ φ , then k D ′ − B N k 1 < ε ; • if k D − U n k 1 ≥ 1 2 − φ , then k D ′ − B N k 1 > ε ′ − ε for ε def = Θ( φ 3 / 2 ) and ε ′ def = Θ( φ 1 2 ); in a w a y that p reserv es the sample complexit y . More precisely , define c def = q 2 ln 1 1 − φ = Θ  √ φ  (so that φ = 1 − e − 2 c 2 ) and N suc h that | I N ,c | = n (that is, N = ( n/ (2 c )) 2 = Θ  n 2 /φ  ). F rom no w on, we can therefore ident ify [ n ] to I N ,c in the ob v ious wa y , and see a dr a w from D as an ele men t in I N ,c . Let p def = B N ( I N ,c ) = Θ  √ φ  , and B N ,c , ¯ B N ,c resp ectiv ely denote the conditional distribu tions induced b y B N on I N ,c and J N ,c . In tu itiv ely , w e w ant D to b e map p ed to the conditional d istrib ution of D ′ on I N ,c , and the conditional distribution of D ′ on J N ,c to b e exactly ¯ B N ,c . This is done as b y defining D ′ b y the pro cess b elo w: 49 • with probabilit y p , we dr a w a sample from D (seen as an ele men t of I N ,c ; • with probabilit y 1 − p , w e d ra w a sample from ¯ B N ,c . Let ˜ B N b e d efined as the distribution wh ic h exac tly matc hes B N on J n,c , but is uniform on I n,c : ˜ B N ( i ) =    p | I n,c | i ∈ I n,c B N ( i ) i ∈ J n,c F rom the ab ov e, we ha ve that k D ′ − ˜ B N k 1 = p · k D − U n k 1 . F urthermore, b y F act D.4 , Lemma 2.8 and the definition of I n,c , w e get that k B N − ˜ B N k 1 = p · k ( B N ) I n,c − U I n,c k 1 ≤ p · φ . Putting it all together, • If k D − U n k 1 ≤ φ , then by the triangle inequalit y k D ′ − B N k 1 ≤ p ( φ + φ ) = 2 pφ ; • If k D − U n k 1 ≥ 1 2 − φ , then similarly k D ′ − B N k 1 ≥ p ( 1 2 − φ − φ ) = p 4 − 2 pφ . Recalling that p = Θ  √ φ  and setting ε def = 2 pφ concludes the reduction. F rom Th eorem D.3 , we conclude that φ 32 n log n = Ω φ √ φN log( φN ) ! = Ω ε √ N log( ε 2 / 3 N ) ! samples are necessary . Pr o of of Cor ol lary D .2 . The corollary follo ws from the proof of Corollary D.2 , b y taking φ = 1 / 100 0 and computing the corresp ondin g ε and ε ′ − ε to c hec k that indeed lim n →∞ ε ′ − ε ε > 10 0. 50

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment