Testing Identity of Structured Distributions
We study the question of identity testing for structured distributions. More precisely, given samples from a {\em structured} distribution $q$ over $[n]$ and an explicit distribution $p$ over $[n]$, we wish to distinguish whether $q=p$ versus $q$ is …
Authors: Ilias Diakonikolas, Daniel M. Kane, Vladimir Nikishkin
T esting Iden tit y of Structured Dis tributions Ilias Diak o nik olas ∗ Univ ersit y of Edin burgh ilias.d@ed. ac.uk . Daniel M. Kane † Univ ersit y of California, San Diego dakane@cs.u csd.edu . Vladimir Nikishkin ‡ Univ ersit y of Edin burgh v.nikishkin @sms.ed.ac. uk . July 24, 2018 Abstract W e study the question of identit y testing for structured distributions. More precisely , given samples from a structure d distribution q ov e r [ n ] a nd an explic it distribution p ov er [ n ], w e wish to distinguish whether q = p versus q is at lea st ε - far fr om p , in L 1 distance. In this work, we present a unified appr oach tha t yields new, simple tes ters, with sample complexit y that is information-theor etically o ptimal, for bro ad clas s es of str uctured distributions, including t -flat distributions, t -mo da l distr ibutio ns, log-co ncav e distributions, monotone haza rd rate (MHR) distributions, a nd mixtures thereo f. 1 In tro duction Ho w man y samples do we need to ve rify th e identit y of a distribution? This is arguably the single most fun damen tal question in statistica l h yp othesis testing [NP33], with Pearson’s c hi-squ ared test [P ea00 ] (and v arian ts thereof ) still b eing the metho d of choice used in pr actice. Th is qu estion has also b een exte nsiv ely stud ied b y the TCS comm unity in the fr amew ork of pr op erty testing [RS96, GGR98]: Given sample access to an unknown d istribution q ov er a finite domain [ n ] := { 1 , . . . , n } , an explicit d istribution p ov er [ n ], and a parameter ε > 0, we wan t to distinguish b et ween the cases that q and p are iden tical versus ε -far from eac h other in L 1 norm (stati stical distance). Previous w ork on this problem focu sed on c haracterizing th e sample size needed to test the iden tit y of an arbitrary d istribution of a give n supp ort size. After more than a decade of study , this “wo r st- case” regime is w ell-understo o d : there e xists a computationally efficien t estimator with samp le complexit y O ( √ n/ε 2 ) [VV14] and a matc hing information-theoretic lo wer b ound [P an08 ]. While it is certainly a significan t impr o veme n t o v er naiv e approac hes and is tigh t in general, the b ound of Θ( √ n ) is still impractical, if the su pp ort size n is v ery la r ge. W e emph asize that the aforemen tioned samp le complexit y c haracterizes wo rst-case instances, and one migh t hop e that drastically b etter results can b e obtained for most natural settings. In con trast to this setting, in whic h we assum e nothing ab out the structur e of the unkn o wn distribu tion q , in man y cases we kno w a pr iori that the distribution q in qu estion has some “nice s tructure”. F or example, w e ma y ∗ Supp orted by EPSRC grant EP/L021749/1, a Marie Curie Career Integration Grant, and a SI CSA grant. † Supp orted in part by an NSF Postdoctoral F ellow ship. ‡ Supp orted by a U niversit y of Edinburgh PCD Scholars hip. 1 ha ve some qualitativ e information ab out the d ensit y q , e.g., it ma y b e a mixture of a small num b er of log-co nca ve distribu tions, or a multi-mod al distribu tion with a b oun ded num b er of mo des. The follo wing qu estion naturally arises: Can we exploit the underlying structur e in or der to p erform the desir e d sta tistic al estimation task mor e efficiently? One would optimistically h op e for the answer to the ab ov e question to b e “YES.” While this h as b een confirmed in sev eral cases f or the problem of le arning (see e.g., [DDS12a, DDS12b, DDO + 13, CDSS14]), relativ ely little work has b een done for testing p rop erties of structured distrib utions. In this pap er, w e sh o w th at this is in deed the case for the aforement ioned pr oblem of id en tit y testing for a broad sp ectrum of natural and w ell-studied distribution classes. T o describ e our results in more detail, we will need some terminology . Let C b e a class of distributions ov er [ n ]. The problem of identity testing for C is the follo wing: Giv en samp le acce ss to an unknown d istribution q ∈ C , and an explicit distribution p ∈ C 1 , w e w ant to distinguish b et ween the case that q = p ve rsus k q − p k 1 ≥ ε. W e emphasize th at the sample complexit y of this testing problem d ep ends on the und erlying class C , and we b eliev e it is of fundamenta l inte r est to obtain efficien t algorithms th at are sample optimal for C . One approac h to solv e this problem is to learn q up to L 1 distance ε/ 2 and c h eck that the h yp othesis is ε/ 2-c lose to p . T h u s, the sample complexit y of identit y testing for C is b ound ed fr om ab o ve by the sample complexit y of le arning (an arb itrary distribu tion in) C . It is natural to ask whether a b etter sample size b ound could b e ac hieved for the identit y testing problem, since this task is, in some sense, less demanding than th e task of learnin g. In this w ork, we provide a comprehensive pictur e of the sample and computational complexities of identit y testing for a broad class of structured d istributions. More sp ecifically , we prop ose a unified framew ork that yields new, simp le, and pr ovably optima l id en tity testers f or v arious structured classes C ; see T able 1 f or an ind icativ e list of distribu tion classes to whic h our fr amew ork applies. Ou r appr o ach r elies on a single unifie d algorithm that we design, which yields highly efficie nt identity testers for many shap e r estricte d c lasses of distributions. As an in teresting byprod uct, w e establish that, for v arious structured classes C , identit y testing for C is pr o v ably easier than learnin g. In particular, the samp le b ounds in the third column of T able 1 fr om [CDSS 14] also apply for le arning the corresp ond ing class C , and are kno wn to b e information-theoretical ly optimal f or the learning p roblem. Our main resu lt (see Theorem 1 and Prop osition 2 in Section 2) can b e phrased, roughly , as follo ws: L et C b e a class of univariate distributions such that any p air of distributions p , q ∈ C have “essential ly” at most k c r ossings, that is, p oints of the do main wher e q − p changes its sign. Then, the identity pr oblem for C c an b e solve d with O ( √ k /ε 2 ) samples. Mor e over, this b ound is information-the or etic al ly optimal. By the term “essentia lly” we mean that a constan t fractio n of the con tribution to k q − p k 1 is due to a set of k crossin gs – the actual n u m b er of crossin gs can be arbitrary . F or example, if C is the class of t -piecewise constan t d istributions, it is clear that an y t wo distributions in C ha v e O ( t ) crossings, which gi v es us the fi rst line of T able 1. As a more in teresting example, consider the class C of log-conca v e distr ib utions o ve r [ n ]. While the num b er of crossings b et we en p, q ∈ C can b e Ω( n ), it can b e shown (see Lemma 17 in [CDSS 14]) that th e essential n umb er of crossings is k = e O (1 / √ ε ), wh ic h gives us the third line of th e table. More generally , w e obtain asym p totic impro v ements o ve r the standard O ( √ n/ε 2 ) b ound for an y cl ass C such that the essentia l n umb er of crossings is k = o ( n ). This condition app lies for any class C that ca n b e w ell-approximat ed in L 1 distance by piecewise low-degree p olynomials (see Corollary 3 f or a precise statemen t). 1 It is no loss of generality to assume that p ∈ C ; otherwise the tester can outp ut “NO” without draw ing samples. 2 Class of Distributions o v er [ n ] Our upp e r b ound Previous w ork t -piecewise constan t O ( √ t/ε 2 ) O ( t/ε 2 ) [CDSS14] t -piecewise degree- d p olynomial O p t ( d + 1) /ε 2 O t ( d + 1) /ε 2 [CDSS14] log-co nca ve e O (1 /ε 9 / 4 ) e O (1 /ε 5 / 2 ) [CDSS14] k -mixture of log- conca v e √ k · e O (1 /ε 9 / 4 ) e O ( k /ε 5 / 2 ) [CDSS14] t -mo dal O ( p t log( n ) /ε 5 / 2 ) O p t log ( n ) /ε 3 + t 2 /ε 4 [DDS + 13] k -mixture of t -mod al O ( p k t log( n ) /ε 5 / 2 ) O p k t log ( n ) /ε 3 + k 2 t 2 /ε 4 [DDS + 13] monotone hazard rate (MHR) O ( p log( n/ε ) /ε 5 / 2 ) O (log( n/ε ) /ε 3 ) [CDSS14] k -mixture of MHR O ( p k log ( n/ε ) /ε 5 / 2 ) O ( k log ( n/ε ) /ε 3 ) [CDSS14] T able 1: Algorithmic results for id en tity testing of v arious classes of p robabilit y distribu tions. The second column indicates the samp le complexit y of our general algorithm applied to the class un der consideration. The third column indicates the sample complexit y of the b est previously kno wn algorithm for the s ame problem. 1.1 Related and Prior W ork In this su bsection w e review the related literature and compare our results with p revious wo r k. Distribution Prope rty T esting The area of distribu tion prop ert y testing, initiated in the TCS comm un it y by the wo rk of Batu et al. [BFR + 00, BFR + 13], has develo p ed into a v ery activ e researc h area with intimate connections to inform ation theory , learning and statistics. The paradigmatic algorithmic problem in this area is th e follo wing: give n sample access to an unkn o wn distribution q o ve r an n -elemen t set, w e wan t to determine wh ether q has some pr op ert y or is “far” (in statistical distance or, equiv alent ly , L 1 norm) from an y d istribution ha v in g the prop ert y . The ov erarc hing goal is to obtain a computationally efficien t algo rithm that uses as few samples as p ossible – certainly asymptotically few er than the s upp ort size n , and ideally muc h less than that. See [GR00, BFR + 00, BFF + 01, Bat01, BDKR02, BKR04, Pan08, V al11, VV11, DDS + 13, ADJ + 11, LRR 11, ILR12] for a sample of works and [Rub12] for a surv ey . One of the first p roblems stud ied in this line of work is that of “identit y testing against a kno wn distribution”: Given samp les f rom an un k n o wn distribu tion q and an explicitly give n distribu tion p distinguish b etw een the case that q = p versus the case that q is ε -far fr om p in L 1 norm. The problem of uni f ormity testing – the s p ecial case of id en tity testing w h en p is the uniform distribution – w as first considered b y Goldreic h and Ron [GR00] w h o, motiv ated b y a connection to testing exp ansion in graphs, obtained a uniformity tester using O ( √ n/ε 4 ) samp les. Sub sequen tly , P anin s ki ga ve the tigh t b oun d of Θ( √ n/ε 2 ) [P an08 ] for this pr oblem. Batu et al. [BFF + 01] obtained an iden tit y testing algorithm against an arbitrary exp licit d istribution with sample complexit y ˜ O ( √ n/ε 4 ). T h e tigh t b ound of Θ( √ n/ε 2 ) for the general iden tit y testing p roblem w as give n on ly recen tly in [VV14]. Shap e Restricted Statistical Estimation The area of inference un der shap e constrain ts – that is, inference ab out a probabilit y distrib u tion un der the constrain t that its prob ab ility d ensit y func- tion (p df ) satisfies certain qu alitativ e p rop erties – is a classical topic in statistics starting with the pioneering w ork of Grenander [Gre56] on monotone distribu tions (see [BBBB 72] for an early b o ok on the topic). V arious structural r estrictions ha v e b een studied in the stat istics literature, start- ing from monotonicit y , un imo dalit y , and conca vit y [Gre56 , Bru58, Rao69, W eg70, HP76, Gro85, Bir87a, Bir87b, F ou97, CT04, JW09], and more recen tly fo cusing on structural r estrictions such as log-co nca vity and k -monotonicit y [BW07, DR09, BR W09, GW09, BW10, KM10]. 3 Shap e restricted inference is wel l-motiv ated in its own r igh t, and has seen a recent su rge of researc h activit y in the s tatistics comm unit y , in part due to the ub iquit y of structur ed distribu tions in the natural sciences. S uc h structural constrain ts on the und erlying distributions are sometimes direct consequences of the studied application p roblem (see e.g., Hamp el [Ham87], or W ang et al. [WWW + 05]), or they are a p laus ib le explanation of the mo del un der inv estigatio n (see e.g., [Reb05] and references therein for app lications to economics and reliabilit y theory). W e also p oint the reader to the recen t surve y [W al09] h ighligh ting the imp ortance of log-conca vit y in statistical inference. Th e hop e is that, under su c h structural constrain ts, th e qualit y of the resulting estimators ma y dramatically improv e, b oth in terms of sample size and in terms of computational efficiency . W e remark that the statistics literature on the topic has f o cused prim arily o n the problem of density estimation or learning an unknown s tr uctured distribution. That is, giv en samples from a distribution q promised to b elong to some d istr ibution class C , we would like to output a hyp othesis distribution that is a go o d appr o ximation to q . In recent yea rs, there has b een a flurr y of results in the TCS comm u nit y on learning structured distr ibutions, with a fo cus on b oth sample complexit y and computational co mplexit y , s ee [KMR + 94, F OS05, BS10, KMV10, MV10, DDS12a , DDS12b, CDSS13, DDO + 13, CDSS 14] for some representa tiv e works. Comparison with Prior W ork In recent wo r k, C h an, Diak onik olas, Servedio, and S un [CDS S14] prop osed a general app roac h to le arn u niv ariate probabilit y d istributions that are well appro ximated b y piecewise p olynomials. [CDSS14] obtained a computationally efficien t and sample near-optimal algorithm to agnostically learn piecewise p olynomial distributions, thus obtaining efficien t estima- tors for v arious classes of structured distribu tions. F or man y of the classes C considered in T able 1 the b est previously kno wn sample complexit y for the iden tit y testing pr oblem for C is identified with the sample complexit y of the corresp onding learnin g problem f rom [CDSS14]. W e remark that the results of this pap er apply to all classes C considered in [CDSS14], and are in fact more general as our condition (any p, q ∈ C hav e a b ound ed num b er of “essentia l” crossin gs) subsumes the p iecewise p olynomial condition (see discussion b efore Corollary 3 in S ection 2 ). At the technical lev el, in con trast to the learning algorithm of [CDSS 14], wh ic h r elies on a combinatio n of linear programming and d ynamic programming, our identit y tester is simp le and com binatorial. In the co n text of prop erty testing, Batu, Kumar, and Rubin feld [BKR04] ga v e algorithms f or the problem of iden tit y testing of un imo dal distr ibutions with sample complexity O (log 3 n ) . More recen tly , Dask alakis, Diak onikola s, Serve dio, V alian t, and V alia nt [DDS + 13] generalized this result to t -mo dal distributions obtaining an id entit y tester with sample complexit y O ( p t log( n ) /ε 3 + t 2 /ε 4 ). W e remark that for the class of t -mo dal distr ibutions our approac h yields an identit y tester with s ample complexit y O ( p t log( n ) /ε 5 / 2 ), matc hing the lo w er b ound of [DDS + 13]. Moreo ver, our w ork yiel ds sample optimal identit y testing algorithms not only for t -mo d al distributions, but for a broad sp ectrum of structured distributions via a u nified approac h . It should b e emph asized that the main ideas und erlying this p ap er are v ery different from those of [DDS + 13]. The algorithm of [DDS + 13] is based on th e fact from [Bir87a] that any t -mo dal d istr ibution is ε -close in L 1 norm to a piecewise constant distribution with k = O ( t · log( n ) /ε ) inte rv als. Hence, if the lo cation and the width of these k “flat” interv als we re kno wn in adv ance, the problem would b e easy: Th e algorithm could just test iden tit y b et ween the “redu ced” distributions supp orted on these k interv als, thus obtaining the optimal s amp le complexit y of O ( √ k /ε 2 ) = O ( p t log( n ) /ε 5 / 2 ). T o circumv en t the pr ob lem th at this decomp osition is n ot kno wn a priori, [DDS + 13] start b y dra wing samples from the unknown d istribution q to construct such a decomp osition. There are tw o ca veats with this s tr ategy: First, the n umb er of samples used to ac hiev e th is is Ω( t 2 ) an d th e num b er of interv als of the constructed decomp osition is significantly larger than k , namely k ′ = Ω( k /ε ). As a consequence, the samp le complexit y of identit y testing for 4 the redu ced d istributions on su pp ort k ′ is Ω( √ k ′ /ε 2 ) = Ω( p t log( n ) /ε 3 ) . In conclusion, the approac h of [DDS + 13] inv olv es constru cting an adaptive in terv al d ecomp osi- tion of the domain follo wed b y a sin gle application of an iden tity tester to th e redu ced distribu tions o ve r those interv als. At a high-lev el our no ve l approac h w orks as follo ws: W e consider sever al obliv- ious in terv al decomp ositions of the d omain (i.e., without d ra win g an y samples from q ) and app ly a “reduced” iden tit y tester for e ach suc h decomp osition. While it ma y seem surprising that suc h an approac h can b e optimal, our algorithm and its analysis exploit a certain str on g prop ert y of uniformity testers, n amely their p erformance gu arantee with resp ect to the L 2 norm . See Section 2 for a detailed exp lanation of our tec hniques. Finally , we commen t on th e relation of this w ork to th e recen t pap er [VV14]. In [VV14], V alia n t and V alian t study the sample complexit y of th e identit y testing problem as a function of the explici t distribution. In particular, [VV14] mak es no assu mptions ab out the structur e of the unknown d istribution q , and characte r izes the sample complexit y of the identit y testing problem as a function of the kno w n distribu tion p. The curren t w ork pr o vides a u nified framew ork to exploit structural p rop erties of the unkn o wn distrib ution q , and yields s ample optimal identit y testers for v arious shap e restrictions. Hence, the results of this pap er are orthogonal to the results of [VV14]. 2 Our Results and T ec hniques 2.1 Basic Definitions W e start with some notation that will b e used throughout this pap er. W e consider discrete probabilit y distr ib utions o ve r [ n ] := { 1 , . . . , n } , w hic h are give n by probabilit y d en sit y fu nctions p : [ n ] → [0 , 1] su c h that P n i =1 p i = 1, where p i is the probabilit y of elemen t i in distribution p . By abuse of n otation, we will sometimes use p to denote the distrib ution with density function p i . W e emphasize that we view the domain [ n ] as an ord ered set. Th roughout this pap er we will b e in terested in structured distribution families that resp ect this ord er in g. The L 1 (resp. L 2 ) norm of a distribution is id en tified with th e L 1 (resp. L 2 ) norm of the corresp ondin g densit y fun ction, i.e., k p k 1 = P n i =1 | p i | and k p k 2 = q P n i =1 p 2 i . The L 1 (resp. L 2 ) distance b et w een d istributions p and q is defin ed as the L 1 (resp. L 2 ) norm of the vect or of their difference, i.e., k p − q k 1 = P n i =1 | p i − q i | and k p − q k 2 = p P n i =1 ( p i − q i ) 2 . W e will denote by U n the un iform distribution o ver [ n ]. In t erv al part itions and A k -distance Fix a p artition of [ n ] into disjoin t in terv als I := ( I i ) ℓ i =1 . F or suc h a collection I we will denote its cardinalit y b y |I | , i.e., |I | = ℓ. F or an interv al J ⊆ [ n ], w e denote b y | J | its cardinalit y or length, i.e., if J = [ a, b ], with a ≤ b ∈ [ n ], then | J | = b − a + 1 . The r e duc e d distribution p I r corresp ondin g to p and I is the distribution o v er [ ℓ ] that assigns the i th “p oint ” th e m ass that p assigns to the interv al I i ; i.e., for i ∈ [ ℓ ], p I r ( i ) = p ( I i ). W e n o w defin e a d istance metric b etw een distributions that will b e crucial for this pap er. Let J k b e the collectio n of all partitions of [ n ] into k in terv als, i.e., I ∈ J k if and only if I = ( I i ) k i =1 is a partition of [ n ] in to in terv als I 1 , . . . , I k . F or p, q : [ n ] → [0 , 1] and k ∈ Z + , 2 ≤ k ≤ n , we define the A k -distance b etw een p and q by k p − q k A k def = max I =( I i ) k i =1 ∈ J k k X i =1 | p ( I i ) − q ( I i ) | = max I ∈ J k k p I r − q I r k 1 . W e remark th at the A k -distance b et ween distributions 2 is w ell-studied in probabilit y theory and statistics. Note that for any p air of distributions p, q : [ n ] → [0 , 1], and any k ∈ Z + with 2 ≤ k ≤ n , 2 W e n ote th at the definition of A k -distance in this work is sligh tly different than [DL01, CDSS14], but is easily 5 w e ha ve that k p − q k A k ≤ k p − q k 1 , and the t wo metrics are iden tical for k = n . Also note that k p − q k A 2 = 2 · d K ( p, q ), wh er e d K is the Kolmogoro v metric (i.e., the L ∞ distance b et w een the CDF’s). Discussion Th e w ell-kno wn V apnik-Chervonenkis (VC) ine quality (see e.g., [DL01, p.31]) provides the inf orm ation-theoretic ally optimal sample size to le arn an arbitrary distribution q ov er [ n ] in this metric. In particular, it implies that m = Ω( k /ε 2 ) iid draws from q suffice in order to learn q within A k -distance ε (with probabilit y at lea s t 9 / 10). T his fact has r ecen tly pro ved u seful in the con text of learning structured distributions: By exploiting this f act, Chan, Diak onikola s, Serv edio, and Sun [CDSS14] recen tly obtained compu tationally efficien t and near-sample optimal algorithms for learning v arious classes of structured d istributions with r esp e ct to the L 1 distanc e . It is thus natural to ask the follo wing question: What is the sample complexit y of testing prop - erties of d istributions with r esp ect to th e A k -distance? Can w e use prop erty testing algorithms in this metric to obtain sample-optimal testing algorithms for interesting classes of structured distri- butions with r esp e ct to the L 1 distanc e ? In this work we answer b oth questions in the affirmativ e for the p roblem of ident it y testing. 2.2 Our Results Our main result is an optimal algorithm f or the iden tity testing p roblem und er the A k -distance metric: Theorem 1 (Main) . Given ε > 0 , an inte ger k with 2 ≤ k ≤ n , sample ac c ess to a distribution q over [ n ] , and an explicit distribution p over [ n ] , ther e is a c omputationa l ly efficient algorithm which uses O ( √ k /ε 2 ) samples fr om q , and with pr ob ability at le ast 2 / 3 distinguishes whether q = p versus k q − p k A k ≥ ε . A dditional ly, Ω ( √ k /ε 2 ) samples ar e informatio n-the or etic al ly ne c essary. The in formation-theoretic samp le low er b oun d of Ω( √ k /ε 2 ) can b e easily deduced from the kno wn lo w er b ound of Ω ( √ n/ε 2 ) for uniformity testing o ver [ n ] under the L 1 norm [P an08 ]. Ind eed, if the underlying d istribution q o v er [ n ] is piecewise co nstan t with k pieces, and p is the u niform distribution o v er [ n ], we hav e k q − p k A k = k q − p k 1 . Hence, our A k -uniformit y testing problem in this case is at least as hard as L 1 -uniformit y testing ov er sup p ort of size k . The p ro of of Theorem 1 pro ceeds in tw o stag es: In the first stage, w e reduce the A k iden tity testing p roblem to A k uniformity testing w ithout in cu rring any loss in the s amp le complexit y . In th e second stage, we use an optimal L 2 uniformity tester as a blac k-b o x to obtain an O ( √ k /ε 2 ) sample algorithm for A k uniformity testing. W e remark that th e L 2 uniformity tester is not applied to the distribution q directly , but to a sequen ce of red uced distribu tions q I r , for an ap p ropriate collecti on of inte rv al partitions I . See Section 2.3 for a detailed intuitiv e explanation of the p ro of. W e remark that an app lication of Theorem 1 for k = n , yields a sample optimal L 1 iden tity tester (for an arbitrary d istribution q ), giving a new algorithm matc hing the r ecen t tigh t upp er b ound in [VV14]. Our new L 1 iden tity tester is arguable simp ler and more intuitiv e, as it only uses an L 2 uniformity tester in a blac k-b o x manner. W e sho w that Theorem 1 has a wide range of applications to the problem of L 1 iden tity testing for v arious classes of natural an d w ell-studied structured distr ib utions. At a high level, the main message of this wo rk is that the A k distance can b e used to c h aracterize the samp le complexit y of seen to be essentially equiva lent. In particular, [CDSS14] considers the quantit y max S ∈S k | p ( S ) − q ( S ) | , where S k is the collection of all unions of at most k interv als in [ n ]. It is a simple exercise to verify that k p − q k A k ≤ 2 · max S ∈S k | p ( S ) − q ( S ) | = k p − q k A 2 k +1 , which implies that the tw o definitions are equiva lent up to constant factors for the purp ose of b oth upp er and low er b ounds. 6 L 1 iden tity testing for broad classes of stru ctured distributions. The follo wing simple prop osition underlies our appr oac h: Prop osition 2. F or a distribution class C over [ n ] and ε > 0 , let k = k ( C , ε ) b e the smal lest inte ger such that for any f 1 , f 2 ∈ C it holds that k f 1 − f 2 k 1 ≤ k f 1 − f 2 k A k + ε/ 2 . Then ther e exists an L 1 identity testing algorithm for C using O ( √ k /ε 2 ) samples. The pro of of the prop osition is straight forw ard: Giv en sample access to q ∈ C and an explicit description of p ∈ C , we apply the A k -iden tity testing algorithm of Theorem 1 for the v alue of k in the statemen t of the p rop osition, and error ε ′ = ε/ 2. If q = p , the algorithm will output “YES” with probability at least 2 / 3. If k q − p k 1 ≥ ε , then b y the condition of Prop osition 2 we h av e that k q − p k A k ≥ ε ′ , and the algo rithm will output “NO” w ith pr obabilit y at least 2 / 3. Hence, as long as the und erlying distribution satisfies the condition of Prop osition 2 for a v alue of k = o ( n ), Theorem 1 yields an asymptotic improv emen t o ver the sample complexit y of Θ( √ n/ε 2 ). W e remark that the v alue of k in the p rop osition is a natural complexit y measure for the difference b et ween tw o probabilit y den sit y fun ctions in the class C . It f ollo ws fr om the definition of the A k distance that this v alue corresp onds to the num b er of “essentia l” crossings b et ween f 1 and f 2 – i.e., the num b er of crossings b etw een the fu n ctions f 1 and f 2 that significant ly affect th eir L 1 distance. Intuitiv ely , the num b er of essent ial crossin gs – as opp osed to the domain size – is, in some sense, the “righ t” parameter to c haracterize the sample complexit y of L 1 iden tity testing for C . As w e explain b elo w, the upp er b ound implied b y the ab o ve p r op osition is information-theoreticall y optimal for a w ide range of structured distribution classes C . More sp ecifically , our framework can b e applied to all s tructured d istribution classes C that can b e well-a pproxima ted in L 1 distance b y pie c ewise low-de gr e e p olynomials . W e sa y that a distribu tion p o ve r [ n ] is t -pie c ewise de gr e e- d if there exists a partitio n of [ n ] in to t interv als such that p is a (discrete) degree- d p olynomial within eac h in terv al. Let P t,d denote th e class of all t -piecewise degree- d distribu tions o ver [ n ]. W e sa y that a distribution class C is ε -close in L 1 to P t,d if for an y f ∈ C there exists p ∈ P t,d suc h that k f − p k 1 ≤ ε. It is easy to see that any pair of distributions p, q ∈ P t,d ha ve at m ost 2 t ( d + 1) crossings, w hic h implies that k p − q k A k = k p − q k 1 , for k = 2 t ( d + 1) (see e.g., Prop osition 6 in [CDSS 14]). W e therefore obtain the follo wing: Corollary 3. L et C b e a distribution class over [ n ] and ε > 0 . Consider p ar ameters t = t ( C, ε ) and d = d ( C, ε ) such that C is ε/ 4 -close in L 1 to P t,d . Then ther e exists an L 1 identity testing algorithm for C using O ( p t ( d + 1) /ε 2 ) samples. Note that an y p air of v alues ( t, d ) satisfying the condition ab ov e s u ffices for the conclusion of the corollary . S ince our goal is to min imize the samp le complexit y , for a giv en class C , we would lik e to apply the corollary for v alues t and d satisfying the ab o ve condition and are suc h that the pro d uct t ( d + 1) is minimized. The appropriate c h oice of these v alues is crucial, and is based on prop er ties of the underlyin g distribu tion family . Observ e that th e sample b ound of O ( p t ( d + 1) /ε 2 ) is tight in general, as f ollo ws b y selecting C = P t,d . T his can b e dedu ced from the general low er b oun d of Ω( √ n/ε 2 ) for u n iformit y testing, and the fact that for n = t ( d + 1), an y distribution ov er supp ort [ n ] can b e expressed as a t -piecewise degree- d distrib u tion. The concrete testing resu lts of T able 1 are obtained f rom Corollary 3 by using known existen tial appro ximation theorems [Bir87a, CDSS13, CDSS14] for the corresp onding structured distribution classes. In particular, we obtain efficien t identi t y testers, in most cases with pr ov ably optimal s amp le complexit y , for all the s tr uctured distrib ution classes studied in [CDSS13, CDSS 14] in the context of learning. P erhaps su rprisingly , our up p er b ound s are tight not only for the class of piecewise p olynomials, but also for the sp ecific s h ap e r estricted classes of T able 1. Th e corresp ondin g lo we r 7 b ound s for sp ecific classes are either kno wn from pr evious work (as e.g., in th e case of t -mo dal distributions [DDS + 13]) or can b e obtained u s ing stand ard constructions. Finally , we remark that the results of this p ap er can b e app ropriately generalized to the setting of testing the identit y of con tinuous distributions o ve r th e real line. More sp ecifically , Theorem 1 also holds for probabilit y distributions o v er R . (Th e only additional assumption required is that the explicitly giv en contin uous p df p can b e efficien tly int egrated up to an y additiv e accuracy .) In fact, the pro of f or the discrete setting extends almost verbatim to the con tinuous setting w ith minor mo difications. It is ea sy to see that b oth Prop osition 2 and Corollary 3 hold for the co n tin u ou s setting as well. 2.3 Our T ec hniques W e no w p ro vid e a d etailed int uitiv e exp lanation of the ideas that lead to our main result, Th eorem 1. Giv en samp le access to a d istribution q and an explicit distribution p , w e wan t to test w hether q = p ve rsus k q − p k A k ≥ ε . B y definition we hav e that k q − p k A k = max I k q I r − p I r k 1 . S o, if the “optimal” partition J ∗ = ( J ∗ i ) k j =1 maximizing this expression w as kno wn a p r iori, the problem w ould b e ea s y: Our algorithm could th en consider the r ed uced d istr ibutions q J ∗ r and p J ∗ r , whic h are su pp orted on sets of size k , and call a standard L 1 -iden tity tester to decide whether q J ∗ r = p J ∗ r v ersu s k q J ∗ r − p J ∗ r k 1 ≥ ε . (Note th at for an y giv en partition I of [ n ] into int erv als and any distribution q , giv en sample access to q one can simulate sample access to the reduced distribution q I r .) Th e difficulty , of course, is that the optimal k -partition is not fixe d , as it dep end s on the unknown d istribution q , th us it is n ot av ailable to the algo rithm. Hence, a more refined approac h is necessary . Our starting p oin t is a new, simple redu ction of the general p roblem of identi t y testing to its sp ecial case of u niformit y testing. The main idea of the r ed uction is to appr op r iately “stretc h” the domain size, using the explicit distribution p , in order to transform the iden tit y testing problem b et w een q and p into a u niformit y testing p roblem for a (differen t) d istribution q ′ (that dep end s on q and p ). T o sho w correctness of this reduction we need to sho w that it preserves the A k distance, and that we can sample from q ′ giv en samples from q . W e n o w pro ceed with the details. Since p is giv en explicitly in the inpu t, we assu me for simplicit y that eac h p i is a rational num b er, hence ther e exists some (p oten tially large) N ∈ Z + suc h that p i = α i / N , where α i ∈ Z + and P n i =1 α i = N . 3 Giv en samp le access to q and an explicit p ov er [ n ], we construct an instance of the uniformity testing pr oblem as f ollo ws: Let p ′ b e the uniform distrib u tion ov er [ N ] and let q ′ b e the d istribution o ver [ N ] ob tained from q b y sub d ivid ing the pr obabilit y mass of q i , i ∈ [ n ], equ ally among α i new consecutiv e p oin ts. It is clea r that this reduction p reserv es the A k distance, i.e., k q − p k A k = k q ′ − p ′ k A k . The only remaining ta sk is to sho w h ow to simulate sample access to q ′ , giv en samples from q . Giv en a sample i fr om q , our sample for q ′ is selected un iformly at rand om from the corresp onding set of α i man y new p oints. Hence, we ha ve redu ced the problem of identit y testing b et ween q and p in A k distance, to the problem of un iformit y testing of q ′ in A k distance. Note that this reduction is also computationally efficien t, as it only r equires O ( n ) pr e-computation to sp ecify the new interv als. F or the rest of this section, we fo cu s on the problem of A k uniformity testing. F or notational con venience , w e will use q to denote the un kno w n distribution and p to den ote the un iform distri- bution ov er [ n ]. The r ou gh idea is to consider an appropriate collection of in terv al partitions of [ n ] 3 W e remark th at this assumption is not necessary: F or th e case of irrational p i ’s we can appro x imate them by rational num b ers e p i up to sufficient accuracy and p roceed with the appro ximate distribution e p . This approximation step do es n ot preserve p erfect completeness; how ever, w e p oint out that our testers hav e some mild robustness in the completeness case, which suffices for all the arguments to go through. 8 and call a standard L 1 -uniformit y teste r for eac h of these partitions. T o mak e suc h an app roac h w ork and giv e us a sample optimal algorithm for our A k -uniformit y testing prob lem we need to u s e a subtle and strong prop ert y of uniformity testing, n amely its p erform ance guarant ee un der the L 2 norm. W e elab orate on this p oin t b elo w. F or an y partitio n I of [ n ] in to k in terv als by d efinition we hav e that k q I r − p I r k 1 ≤ k q − p k A k . Therefore, if q = p , we will also hav e q I r = p I r . The issue is that k q I r − p I r k 1 can b e muc h smaller than k q − p k A k ; in fact, it is n ot difficult to construct examples w here k q − p k A k = Ω(1) and k q I r − p I r k 1 = 0 . In particular, it is p ossible for the p oin ts where q is larger than p , and wh ere it is smaller than p to cancel ea c h other out within ea c h inte rv al in the partitio n, thus making the partition useless for distinguishing q from p . In ot her words, if the partition I is not “goo d ”, we ma y not b e able to detect an y existing discrepancy . A simple, bu t sub optimal, w a y to circumv en t this issue is to consider a partition I ′ of [ n ] int o k ′ = Θ( k /ε ) in terv als of the same length. Note that eac h suc h in terv al will ha ve pr obabilit y mass 1 /k ′ = Θ( ε/k ) u nder the unif orm distribu tion p . If the constan t in the big-Θ is approp r iately selecte d, say k ′ = 10 k/ε , it is not hard to sho w that k q I ′ r − p I ′ r k 1 ≥ k q − p k A k − ε/ 2; hence, we will necessarily d etect a large d iscr ep ancy for the reduced distr ibution. By app lyin g the optimal L 1 uniformity tester th is approac h will require Ω( √ k ′ /ε 2 ) = Ω( √ k /ε 2 . 5 ) samples. A key to ol that is essentia l in our analysis is a strong prop ert y of un iformit y testing. An optimal L 1 uniformity tester for q can distinguish b et ween the uniform distribu tion and the case that k q − p k 1 ≥ ε u s ing O ( √ n/ε 2 ) samples. Ho wev er, a stronger guaran tee is p ossible: With the same sample size, w e can distinguish the un iform distribution fr om the case that k q − p k 2 ≥ ε/ √ n . W e emphasize th at suc h a strong L 2 guaran tee is sp ecific to u niformit y testing, and is pro v ably not p ossible for the general problem of identit y testing. In p revious wo rk, Goldreic h and Ron [GR00 ] ga v e such an L 2 guaran tee for un iformit y testing, b ut their algorithm us es O ( √ n/ε 4 ) samples. P anin s ki’s O ( √ n/ε 2 ) uniformity tester works for the L 1 norm, and it is not kno wn whether it ac hiev es the desired L 2 prop erty . As one of our main tools we sho w the follo w ing L 2 guaran tee, whic h is optimal as a function of n and ε : Theorem 4. Given 0 < ε, δ < 1 and sample ac c ess to a distribution q over [ n ] , ther e is an algorithm T est-Uniformit y- L 2 ( q , n, ε, δ ) which uses m = O ( √ n/ε 2 ) · log(1 /δ ) samples fr om q , runs in time line ar in i ts sample size, and with pr ob ability at le ast 1 − δ distinguishes whether q = U n versus k p − q k 2 ≥ ε/ √ n . T o pr o ve Th eorem 4 we sho w that a v ariant of P earson’s c h i-squared test [Pea 00] – which can b e viewe d as a sp ecial case of the recen t “c hi-square t yp e” testers in [CD VV14, VV14] – h as the desired L 2 guaran tee. While this tester h as b een (implicitly) studied in [CD VV14, VV14], and it is known to b e sample optimal with resp ect to the L 1 norm, it has not b een previously anal yzed for the L 2 norm. The no velt y of Theorem 4 lies in the tigh t analysis of the algorithm und er th e L 2 distance, and is p resen ted in Ap p endix A. Armed w ith Th eorem 4 we pro ceed as follo ws: W e consider a set of j 0 = O (log(1 /ε )) differen t partitions of the domain [ n ] in to interv als. F or 0 ≤ j < j 0 the p artition I ( j ) consists of ℓ j def = |I ( j ) | = k · 2 j man y in terv als I ( j ) i , i ∈ [ ℓ j ], i.e., I ( j ) = ( I ( j ) i ) ℓ j i =1 . F or a fixed v alue of j , al l interv als in I ( j ) ha ve the same le ngth, or equiv alently , the same pr obabilit y mass under the uniform distribution . Then, for any fixed j ∈ [ j 0 ], we h av e p ( I ( j ) i ) = 1 / ( k · 2 j ) for all i ∈ [ ℓ j ] . (Obser ve that, by our aforemen tioned reduction to the un iform case, we may assume that th e domain size n is a multiple of k 2 j 0 , and thus that it is p ossible to ev enly divid e into su ch in terv als of the same length). Note th at if q = p , then for all 0 ≤ j < j 0 , it holds q I ( j ) r = p I ( j ) r . Recalling that all in terv als in I ( j ) ha ve the same pr ob ab ility mass under p , it follo ws that p I ( j ) r = U ℓ j , i.e ., p I ( j ) r is the uniform 9 distribution o ver its su pp ort. So, if q = p , for any p artition we h a ve q I ( j ) r = U ℓ j . Our m ain structural result (Lemma 6) is a rob u st inv erse lemma: If q is far fr om uniform in A k distanc e then, for at le ast one of the p artitions I ( j ) , the r e duc e d distribution q I ( j ) r wil l b e far fr om uniform in L 2 distanc e. The quan titativ e v ersion of this statemen t is qu ite subtle. In particular, w e start from the assu mption of b eing ε -far in A k distance and can only deduce “far” in L 2 distance. This is absolutely cr itical for us to b e able to obtain the optimal sample complexit y . The k ey insight for the analysis comes from noting that the optimal p artition sep arating q from p in A k distance cannot ha v e to o man y p arts. Thus, if the “highs” and “lo ws” cancel out o ver some small in terv als, they must b e ve ry large in order to comp ensate for the fact that they are relativ ely narro w . Therefore, wh en p and q differ on a smaller scale, their L 2 discrepancy will b e greater, and this comp ensates f or the fact that th e p artition detecting this discrepancy will need to ha v e more in terv als in it. In S ection 3 w e presen t our sample optimal uniformit y tester un der the A k distance, thereb y establishing Th eorem 1 . 3 T esting Uniformit y und er the A k -norm Algorithm T est-Uniformit y- A k ( q , n, ε ) Input: sample access to a distribution q o v er [ n ], k ∈ Z + with 2 ≤ k ≤ n , and ε > 0. Output: “YES” if q = U n ; “NO” if k q − U n k A k ≥ ε. 1. Dra w a sample S of size m = O ( √ k /ε 2 ) from q . 2. Fix j 0 ∈ Z + suc h that j 0 def = ⌈ log 2 (1 /ε ) ⌉ + O (1) . Consider the collection {I ( j ) } j 0 − 1 j =0 of j 0 partitions of [ n ] int o interv als; the partition I ( j ) = ( I ( j ) i ) ℓ j i =1 consists of ℓ j = k · 2 j man y in terv als with p ( I ( j ) i ) = 1 / ( k · 2 j ), where p def = U n . 3. F or j = 0 , 1 , . . . , j 0 − 1: (a) Consider the red u ced distrib utions q I ( j ) r and p I ( j ) r ≡ U ℓ j . Use the sample S to sim u late samples to q I ( j ) r . (b) Run T est-Uniformit y- L 2 ( q I ( j ) r , ℓ j , ε j , δ j ) for ε j = C · ε · 2 3 j / 8 for C > 0 a sufficientl y small constant and δ j = 2 − j / 6, i.e., test whether q I ( j ) r = U ℓ j v ersu s k q I ( j ) r − U ℓ j k 2 > γ j def = ε j / p ℓ j . 4. If all the testers in Step 3(b) outpu t “YES”, th en outp ut “YES”; otherwise outpu t “NO”. Prop osition 5. The algorithm T est-Uniformit y- A k ( q , n, ε ) , on input a sample of size m = O ( √ k /ε 2 ) dr awn fr om a distribution q over [ n ] , ε > 0 and an inte ger k with 2 ≤ k ≤ n , c orr e ctly distingui shes the c ase that q = U n fr om the c ase that k q − U n k A k ≥ ε , with pr ob ability at le ast 2 / 3 . Pr o of. First, it is straigh tforw ard to ve rify the claimed sample complexit y , as the algorithm only dra w s samples in Step 1. Note that the algorithm uses the same set of samples S for all testers in Step 3(b). By T heorem 4, the tester T est-Uniformity- L 2 ( q I ( j ) r , ℓ j , ε j , δ j ), on inp u t a set of m j = O (( p ℓ j /ε 2 j ) · log (1 /δ j )) samples from q I ( j ) r distinguishes the case that q I ( j ) r = U ℓ j from the case that k q I ( j ) r − U ℓ j k 2 ≥ γ j def = ε j / p ℓ j with probabilit y at least 1 − δ j . F rom our c hoice of parameters 10 it can b e v er ifi ed that max j m j ≤ m = O ( √ k /ε 2 ), h ence we can use the same sample S as in put to these testers for all 0 ≤ j ≤ j 0 − 1. In fact, it is easy to see that P j 0 − 1 j =0 m j = O ( m ), wh ich implies that the ov erall algorithm runs in samp le-linear time. Sin ce eac h tester in Step 3(b) h as error p robabilit y δ j , by a u n ion b ound o ver all j ∈ { 0 , . . . , j 0 − 1 } , the total error p robabilit y is at m ost P j 0 − 1 j =0 δ j ≤ (1 / 6) · P ∞ j =0 2 − j = 1 / 3 . Th erefore, with pr obabilit y at least 2 / 3 all the testers in Step 3(b) succeed. W e w ill henceforth condition on this “go o d” ev ent , and establish the completeness and soun dness prop erties of the o v erall algorithm u nder this conditioning. W e start by establishing completeness. If q = p = U n , then for any p artition I ( j ) , 0 ≤ j ≤ j 0 − 1, w e hav e that q I ( j ) r = p I ( j ) r = U ℓ j . By our aforementio ned conditioning, all testers in Step 3(b) will output “YES”, h ence the o verall algorithm w ill also outpu t “YES ”, as desired. W e no w pro ceed to establish the soundness of our algorithm. Ass u ming th at k q − p k A k ≥ ε , w e wan t to show that the algorithm T est-Uniformit y- A k ( q , n, ε ) outputs “NO” with pr obabilit y at least 2 / 3. T o wards this end , w e pro ve the follo wing structural lemma: Lemma 6. Ther e exists a c onstant C > 0 such that the fol lowing holds: If k q − p k A k ≥ ε , ther e exists j ∈ Z + with 0 ≤ j ≤ j 0 − 1 such that k q I ( j ) r − U ℓ j k 2 2 ≥ γ 2 j def = ε 2 j /ℓ j = C 2 · ( ε 2 /k ) · 2 − j / 4 . Giv en the lemma, th e soundness prop erty of our algorithm follo ws easily . Indeed, since all testers T est-Uniformit y- L 2 ( q I ( j ) r , ℓ j , ε j , δ j ) of Step 3(b) are su ccessful by our conditioning, Lemma 6 implies that at least one of them outputs “NO”, h ence the ov erall algorithm will output “NO”. The p ro of of L emm a 6 in its full generalit y is quite tec h nical. F or the s ake of the in tuition, in the follo wing subsection (S ection 3.1) w e p ro vid e a p ro of of the lemma for the imp ortant sp ecial case that the unknown distribution q is promised to b e k - flat , i.e., piecewise constan t with k p ieces. This setting captures man y of the core ideas and, at the same time, av oids some of the necessary tec hnical difficulties of the general case. Finally , in Section 3.2 we p resen t our p ro of for the general case. 3.1 Pro of of Structural Lemma: k -flat Case F or this sp ecial case w e will pr o ve the lemma for C = 1 / 80. S in ce q is k -fl at there exists a p artition I ∗ = ( I ∗ j ) k j =1 of [ n ] in to k inte r v als so that q is constant within eac h suc h interv al. T his in particular implies that k q − p k A k = k q − p k 1 , wher e p def = U n . F or J ∈ I ∗ let us denote by q J the v alue of q within in terv al J , that is, for all j ∈ [ k ] and i ∈ I ∗ j w e h a ve q i = q I ∗ j . F or notat ional conv enience, w e sometimes u se p J to denote the v alue of p = U n within interv al J . By assu mption we ha ve that k q − p k 1 = P k j =1 | I ∗ j | · | q I ∗ j − 1 /n | ≥ ε. Throughout the pro of, we work with in terv als I ∗ j ∈ I ∗ suc h that q I ∗ j < 1 /n. W e w ill henceforth refer to suc h inte rv als as tr oughs and w ill denote by T ⊆ [ k ] the corresp onding set of ind ices, i.e. , T = { j ∈ [ k ] | q I ∗ j < 1 /n } . F or eac h trough J ∈ { I ∗ j } j ∈T w e define its depth as de pt h ( J ) = ( p J − q J ) /p J = n · (1 /n − q J ) and its width as w idth ( J ) = p ( J ) = (1 /n ) · | J | . Note that th e width of J is iden tified with th e p robabilit y mass that the un iform distribu tion assigns to it. T h e discr ep ancy of a trough J is defined b y Discr ( J ) = depth ( J ) · widt h ( J ) = | J | · ( 1 /n − q J ) and corresp onds to the con tribu tion of J to the L 1 distance b et we en q and p . It follo ws fr om Sc h effe’s identit y that half of the con tribution to k q − p k 1 comes fr om troughs, namely k q − p k T 1 def = P j ∈T Discr ( I ∗ j ) = (1 / 2) · k q − p k 1 ≥ ε/ 2 . An imp ortan t observ ation is that we ma y assume that all troughs hav e width at most 1 /k at the cost of p otent ially d ou b ling the total n u m b er of in terv als. Indeed, it is easy to see that we can artificially sub divid e “wider” troughs s o that eac h n ew trough h as wid th at most 1 /k . This p ro cess comes at the exp ense of at most doublin g 11 the n um b er of troughs. Let us d en ote b y { e I j } j ∈T ′ this set of (new) troughs, where |T ′ | ≤ 2 k and eac h e I j is a subset of s ome I ∗ i , i ∈ T . W e w ill henceforth d eal with the set of troughs { e I j } j ∈T ′ eac h of width at m ost 1 /k . By construction, it is clear that k q − p k T ′ 1 def = X j ∈T ′ Discr ( e I j ) = k q − p k T 1 ≥ ε/ 2 . (1) A t this p oin t w e n ote that we can essent ially ignore troughs J ∈ { e I j } j ∈T ′ with small d iscrepancy . Indeed, the total con tribu tion of in terv als J ∈ { e I j } j ∈T ′ with Discr ( J ) ≤ ε/ 20 k to the LHS of (1 ) is at most |T ′ | · ( ε/ 20 k ) ≤ 2 k · ( ε/ 2 0 k ) = ε/ 10. Let T ∗ b e the subset of T ′ corresp ondin g to troughs with discrep an cy at least ε/ 20 k , i.e., j ∈ T ∗ if and only if j ∈ T ′ and Discr ( e I j ) ≥ ε/ 20 k . Then, we ha ve that k q − p k T ∗ 1 def = X j ∈T ∗ Discr ( e I j ) ≥ 2 ε/ 5 . (2) Observe that for any interv al J it holds Discr ( J ) ≤ width ( J ). Note that this part of th e argumen t dep end s critically on considering only troughs. Hence, for j ∈ T ∗ w e ha ve that ε/ (20 k ) ≤ widt h ( e I j ) ≤ 1 /k . (3) Th us far w e hav e argued that a constant fr action of the con tribution to k q − p k 1 comes from troughs w h ose width satisfies (3). Ou r next crucial claim is that eac h suc h tr ou gh must hav e a “large” o v erlap with one of the in terv als I ( j ) i considered b y our alg orithm T est-Uniformity- A k . In particular, consid er a trough J ∈ { e I j } j ∈T ∗ . W e claim that there exists j ∈ { 0 , . . . , j 0 − 1 } and i ∈ [ ℓ j ] suc h that | I ( j ) i | ≥ | J | / 4 and so that I ( j ) i ⊆ J . T o see this we first pic k a j so that width ( J ) / 2 > 2 − j /k ≥ width ( J ) / 4 . Since the I ( j ) i ha ve width less than half that of J , J m u s t in tersect at least three of these interv als. Thus, any but the tw o outermost suc h in terv als will b e en tirely con tained within J , and furthermore has w idth 2 j /k ≥ width ( J ) / 4 . Since the interv al L ∈ I ( j +1) is a “domain p oin t” for the reduced distrib u tion q I ( j +1) r , the L 1 error b et w een q I ( j +1) r and U ℓ j +1 incurred by this element is at least 1 4 · Discr ( J ), and th e corresp on d ing L 2 2 error is at least 1 16 · ( Discr ( J )) 2 ≥ ε 320 k · Discr ( J ) , where the inequalit y follo w s from the fact that Discr ( J ) ≥ ε/ (20 k ). Hence, w e ha ve that k q I ( j +1) r − U ℓ j +1 k 2 2 ≥ ε/ (320 k ) · Discr ( J ) . (4) As shown ab o ve, for ev ery trough J ∈ { e I j } j ∈T ∗ there exists a level j ∈ { 0 , . . . , j 0 − 1 } suc h that (4) holds. Hence, summing (4) ov er all lev els we obtain j 0 − 1 X j =0 k q I ( j +1) r − U ℓ j +1 k 2 2 ≥ ε/ (320 k ) · X j ∈T ∗ Discr ( e I j ) ≥ ε 2 / (800 k ) , (5) where the second inequalit y follo ws from (2). Note that j 0 − 1 X j =0 γ 2 j ≤ j 0 − 1 X j =0 ε 2 · 2 3 j / 4 80 2 · k 2 j = ε 2 6400 k j 0 − 1 X j =0 2 − j / 4 < ε 2 / (800 k ) . Therefore, by the ab o ve, w e must hav e that k q I ( j +1) r − U ℓ j +1 k 2 2 > γ 2 j for some 0 ≤ j ≤ j 0 − 1. This completes the pr o of of Lemma 6 for the sp ecial case of q b eing k -flat. 12 3.2 Pro of of Structural Lemma: General Case T o prov e the general v ers ion of our str u ctural r esult for the A k distance, w e will need to c ho ose an appropriate v alue for the u niv ers al constan t C . W e sho w th at it is su fficien t to tak e C ≤ 5 · 10 − 6 . (While w e hav e not attempted to optimize constan t factors, we b eliev e that a more careful analysis will lead to substan tially b etter constants.) A useful observ ation is that our T est-Uniformity- A k algorithm only distinguishes whic h of the in terv als of I ( j 0 − 1) eac h of our samp les lies in, and can therefore equiv alen tly b e though t of as a uniformity tester for the reduced distribution q I ( j 0 − 1) r . In order to show that it suffices to consider only this r estricted sample set, w e claim that k q I ( j 0 − 1) r − U ℓ j 0 − 1 k A k ≥ k p − q k A k − ε/ 2 . In particular, these A k distances would b e equal if th e dividers of the optimal partition for q were all on b oun d aries b etw een interv als of I ( j 0 − 1) . If this was not th e case though, we could roun d the endp oint of ea ch trough in w ard to the nearest s uc h b oundary (note that w e can assume that the optimal p artition has no tw o adj acent tr oughs). This increases the d iscrepancy of eac h trough b y at most 2 k · 2 − j 0 , and th us for j 0 − log 2 (1 /ε ) a su fficien tly large unive rsal constan t, the total discrepancy decreases by at most ε/ 2. Th us, we ha ve reduced ou r selv es to the case where n = 2 j 0 − 1 · k and ha v e argued that it suffices to show that our algorithm works to distinguish A k -distance in this setting with ε j = 10 − 5 · ε · 2 3 j / 8 . The analysis of the completeness and th e sound ness of th e tester is id en tical to P rop osition 5 . The only miss in g piece is the pro of of Lemma 6, w hic h we now restate for th e sak e of con venience : Lemma 7. If k q − p k A k ≥ ε , ther e exists some j ∈ Z + with 0 ≤ j ≤ j 0 − 1 such that k q I ( j ) r − U ℓ j k 2 2 ≥ γ 2 j := ε 2 j /ℓ j = 10 − 10 2 − j / 4 ε 2 /k . The analysis of the general case h er e is somewh at more complicated than the sp ecial case for q b eing k -flat case that w as presen ted in the pr evious section. This is b ecause it is p ossible for one of the inte rv als J in the optimal partition (i.e., the in terv al partition I ∗ ∈ J k maximizing k q I r − q I r k 1 in the definition of the A k distance) to ha v e large o verlap with an in terv al I that our algorithm considers – that is, I ∈ ∪ j 0 − 1 j =0 I ( j ) – without ha ving q ( I ) and p ( I ) differ sub stan tially . Note that the un kno w n d istribution q is not guaran teed to b e constant within suc h an in terv al J , and in particular the d ifferen ce q − p d o es not necessarily preserv e its sign within J . T o d eal with this iss ue, we note th at th ere are t wo p ossibilities for an interv al J in the optimal partition: Either one of the in terv als I ( j ) i (considered by our alg orithm) of size at least | J | / 2 has discrepancy comparable to J , or the distribu tion q differs fr om p ev en more su bstan tially on one of the in terv als separating th e endp oin tss of I ( j ) i from the endp oin ts of J . Therefore, either an in terv al con tained within this will detect a large L 2 error, or we will need to again p ass to a subin terv al. T o mak e th is in tuition rigorous, w e w ill need a mec hanism for detecting where this recurs ion will terminate. T o handle this formally , w e in tro duce the follo wing definition: Definition 8. F or p = U n and q an arbitrary distribution o v er [ n ], we define the sc ale-sensitive- L 2 distanc e b et we en q and p to b e k q − p k 2 [ k ] def = max I =( I 1 ,...,I r ) ∈ W 1 /k r X i =1 Discr 2 ( I i ) width 1 / 8 ( I i ) where W 1 /k is the collection of all inte rv al p artitions of [ n ] in to interv als of width at m ost 1 /k . 13 The n otion of the scale-sensitiv e- L 2 distance will b e a u seful int ermediate to ol in our analysis. The r ough idea of the definition is that the optimal partition will b e able to detect the correctly sized int erv als for our tester to notice. (It w ill act as an analogue of the p artition into the int erv als where q is constant for the k -flat case.) The first thing we need to sho w is that if q and p h a ve large A k distance then they a lso hav e large scale-sensitiv e- L 2 distance. Indeed, we h a ve the follo win g lemma: Lemma 9. F or p = U n and q an arbitr ary d istribution over [ n ] , we have that k q − p k 2 [ k ] ≥ k q − p k 2 A k (2 k ) 7 / 8 . Pr o of. Let ε = k q − p k 2 A k . Consider the optimal I ∗ in the d efinition of the A k distance. As in our analysis for th e k -flat case, b y fu r ther su b d ivid ing in terv als of wid th more than 1 /k into smaller ones, w e can obtain a n ew partition, I ′ = ( I ′ i ) s i =1 , of cardinalit y s ≤ 2 k all of whose p arts ha ve width at most 1 / k . F urthermore, w e ha ve that P i Discr ( I ′ i ) ≥ ε . Using this partition to b ound from b elow k q − p k 2 [ k ] , by Cauch y-Sc h warz we obtain that k q − p k 2 [ k ] ≥ X i Discr 2 ( I ′ i ) width ( I ′ i ) 1 / 8 ≥ ( P i Discr ( I ′ i )) 2 P i width ( I ′ i ) 1 / 8 ≥ ε 2 2 k (1 / (2 k )) 1 / 8 = ε 2 (2 k ) 7 / 8 . The second imp ortan t fact ab out the scale-sensitiv e- L 2 distance is that if it is large then one of the partitions considered in our algorithm will pro d uce a large L 2 error. Prop osition 10. L et p = U n b e the uniform distribution and q a distribution over [ n ] . Then we have that k q − p k 2 [ k ] ≤ 10 8 j 0 − 1 X j =0 2 j · k X i =1 Discr 2 ( I ( j ) i ) width 1 / 8 ( I ( j ) i ) . (6) Pr o of. Let J ∈ W 1 /k b e the optimal partition used when compu ting the scale-sensitiv e- L 2 distance k q − p k [ k ] . In p articular, it is a partition in to in terv als of width at most 1 /k so that P i Discr 2 ( J i ) width ( J i ) 1 / 8 is maximized. T o prov e Equation Eq. (6), w e pro ve a notably stronger claim. In particular, w e will pro ve that for eac h interv al J ℓ ∈ J Discr 2 ( J ℓ ) width 1 / 8 ( J ℓ ) ≤ 10 8 j 0 − 1 X j =0 X i : I ( j ) i ⊂ J ℓ Discr 2 ( I ( j ) i ) width 1 / 8 ( I ( j ) i ) . (7) Summing o ve r ℓ w ould then yield k q − p k 2 [ k ] on the left hand side and a strict subset of the terms from Equation Eq. (6) on the right hand side. F rom here on, we will consider only a single in terv al J ℓ . F or notational conv enience, we will dr op th e subscript and m er ely call it J . 14 First, note that if | J | ≤ 10 8 , then this f ollo ws easily from considerin g ju s t the sum o ver j = j 0 − 1. Then, if t = | J | , J is divided int o t inte rv als of size one. The sum of the d iscrepancies of these in terv als equals the discrepancy of J , and thus, the sum of th e squares of the discrepancies is at least Discr 2 ( J ) /t . F urthermore, the widths of th ese subinterv als are all smaller than the width of J b y a factor of t . Thus, in this case the sum of the right h an d side of Equation Eq. (7) is at least 1 /t 7 / 8 ≥ 1 10 7 of the left hand side. Otherwise, if | J | > 10 8 , we can fi n d a j so that width ( J ) / 10 8 < 1 / (2 j · k ) ≤ 2 · widt h ( J ) / 10 8 . W e claim that in this case Equation Eq. (7) holds ev en if w e restrict the su m on the right hand side to this v alue of j . Note th at J cont ains at most 10 8 in terv als of I ( j ) , and that it is co ve red b y these in terv als plus t w o narr o wer inte rv als on the ends. Call these end-in terv als R 1 and R 2 . W e claim that Discr ( R i ) ≤ Discr ( J ) / 3. This is b ecause otherwise it w ould b e the case th at Discr 2 ( R i ) width 1 / 8 ( R i ) > Discr 2 ( J ) width 1 / 8 ( J ) . (This is b ecause (1 / 3) 2 · (2 / 10 8 ) − 1 / 8 > 1.) This is a contradicti on, since it would mean that partitioning J int o R i and its complement would impro ve the su m defining k q − p k [ k ] , which w as assumed to b e maxim u m. This in tur n implies that the sum of the discrepancies of the I ( j ) i con tained in J m ust b e at least Discr ( J ) / 3, so the sum of their squares is at least Discr 2 ( J ) / (9 · 10 8 ). On the ot her hand, eac h of these interv als is narr o wer than J by a factor of at lea st 10 8 / 2, th us the appropriate sum of Discr 2 ( I ( j ) i ) width 1 / 8 ( I ( j ) i ) is at least Discr 2 ( J ) 10 8 width 1 / 8 ( J ) . This completes the pro of. W e are n o w r eady to prov e Lemma 7. Pr o of. If k q − p k A k ≥ ε we ha ve b y Lemma 9 that k q − p k 2 [ k ] ≥ ε 2 (2 k ) 7 / 8 . By Prop osition 10, this implies that ε 2 (2 k ) 7 / 8 ≤ 10 8 j 0 − 1 X j =0 2 j · k X i =1 Discr 2 ( I ( j ) i ) width 1 / 8 ( I ( j ) i ) = 10 8 j 0 − 1 X j =0 (2 j k ) 1 / 8 k q I ( j ) − U ℓ j k 2 2 . Therefore, j 0 − 1 X j =0 2 j / 8 k q I ( j ) − U ℓ j k 2 2 ≥ 5 · 10 − 9 ε 2 /k . (8) On the other hand, if k q I ( j ) − U ℓ j k 2 2 w ere at most 10 − 10 2 − j / 4 ε 2 /k for eac h j , then th e sum ab o ve w ould b e at most 10 − 10 ε 2 /k X j 2 − j / 8 < 5 · 10 − 9 ε 2 /k . This w ould con tradict E q u ation Eq. (8), th u s pro vin g th at k q I ( j ) − U ℓ j k 2 2 ≥ 10 − 10 2 − j / 4 ε 2 /k for at least one j , pro ving Lemma 7. 15 4 Conclusions and F u tu r e W ork In this w ork we designed a computationally efficien t a lgorithm for the problem of iden tit y testing against a kno wn d istribution, which yields sample optimal b ounds for a wid e range of natural and imp ortant classes of s tr uctured distributions. A natural direction for fu ture wo rk is to generalize our results to the pr oblem of iden tit y testing b et ween tw o unkno wn stru ctured distr ib utions. What is the optimal sample complexit y in this more general setting? W e emp h asize that new ideas are required for th is pr oblem, as the algorithm and analysis in this w ork cru cially exploit the a p riori kno wledge of the explicit distribution. References [ADJ + 11] J. Ac h arya, H. Das, A. Jafarp our , A. Orlitsky , an d S . Pan. Comp etitiv e closeness testing. Journal of Machine L e arning R ese ar ch - P r o c e e dings T r ack , 19:47–6 8, 2011. [Bat01 ] T. Batu. T esting Pr op erties of Distributions . PhD thesis, Cornell Univ ersity , 2001. [BBBB7 2] R.E. Barlo w, D.J. Bartholomew, J.M. Bremner, and H.D. Brun k . Statistic al Infer enc e under Or der R estrictions . Wiley , New Y ork, 197 2. [BDKR02] T. Batu, S. Dasgupta, R. Kumar, and R. Rubinf eld. Th e complexit y of appr o ximating en tropy . In ACM Symp osium on The ory of Computing , p ages 678 –687, 2002 . [BFF + 01] T . Batu, E. Fisc her, L. F ortno w, R. K umar, R. Rub infeld, and P . White. T esting random v ariables for indep en d ence and id en tity . In Pr o c. 42nd IEEE Symp osium on F oundations of Computer Scienc e , pages 442–451, 2001. [BFR + 00] T. Batu, L. F ortno w, R. Rubin f eld, W. D. S mith, and P . White. T esting that d istr i- butions are close. In IEEE Symp osium on F oundations of Computer Scienc e , pages 259–2 69, 2000. [BFR + 13] T. Batu, L. F ortno w, R. Rubin feld, W. D. Smith, and P . White. T esting closeness of discrete distribu tions. J. ACM , 60(1):4 , 2013. [Bir87a] L. Birg´ e. Es timating a d ensit y under order r estrictions: Nonasymp totic minimax r isk. Anna ls of Statistics , 15(3):995– 1012, 1987. [Bir87b] L. Birg ´ e. On th e risk of histograms for estimating decreasing den sities. A nnals of Statistics , 15(3):101 3–1022 , 1987. [BKR04] T. Batu, R. Kumar, and R. Rub infeld. S ublinear algorithms for testing monotone and unimo dal distribu tions. In ACM Symp osium on The ory of Computing , p ages 381–39 0, 2004. [Bru58] H. D. Brun k. On the estimation of parameters restricted by inequalities. The Annals of Mathematic al Statistics , 29(2) :pp. 437–4 54, 1958. [BR W09] F. Balabd aoui, K. Rufi bac h, and J. A. W ellner. Limit distrib u tion theory for maximum lik eliho o d estimatio n of a log-conca v e density . The Annal s of Statistics , 37(3):pp. 1299– 1331, 2009. 16 [BS10] M. Belkin and K . Sin ha. P olynomial learning of distribu tion families. In F O CS , pages 103–1 12, 2010. [BW07] F. Balab daoui and J . A. W ellner. Es timation of a k -monotone densit y: L imit distribu- tion theory and the sp line connection. The Annals of Statistics , 35(6):pp. 2536–2564 , 2007. [BW10] F. Balab daoui and J. A. W ellner. Estimation of a k -monotone density: c h aracteri- zations, consistency and minimax lo we r b ounds. Statistic a Ne erlandic a , 64( 1):45– 70, 2010. [CDSS13] S. Ch an, I. Diak onik olas, R. Serv edio, and X. Sun. Learning mixtu r es of structured distributions ov er discrete domains. In SODA , pages 1380–139 4, 2013. [CDSS14] S. Chan, I. Diako n ik olas, R. Servedio, and X. Sun. Efficien t density estimation via piecewise p olynomial appro x im ation. In STOC , pages 604– 613, 2014. [CD VV14] S. Chan, I. Diak onik olas, P . V alian t, and G. V alian t. Optimal algo rithms for testing closeness of d iscrete distributions. In SO DA , pages 1193–12 03, 20 14. [CT04] K.S. Chan and H. T ong. T esting for multimodalit y with dep endent data. Biometrika , 91(1): 113–12 3, 2004. [DDO + 13] C. Dask alakis, I. Diak onik olas, R. O’Donnell, R.A. S erv ed io, and L. T an. Learning Sums of I n dep end en t In teger Random V ariables. In FOCS , pages 217–226 , 20 13. [DDS12a] C. Dask alakis, I. Diak oniko las, and R.A. S erv ed io. Learning k -mo dal d istributions via testing. In SODA , p ages 1371–1 385, 2012. [DDS12b] C. Dask alakis, I. Diak oniko las, and R.A. Serve dio. Learnin g Po iss on Binomial Distri- butions. In STOC , pages 709–728, 2012. [DDS + 13] C. Dask alakis, I. Diak onik olas, R. Servedio, G. V alian t, and P . V alian t. T esting k - mo dal distribu tions: Optimal algorithms via reductions. In SODA , p ages 1833–185 2, 2013. [DL01] L. Devro y e and G. Lugosi. Co mbi natorial metho ds in density estimation . Springer Series in Statistics, Sprin ger, 2001. [DR09] L. D umbge n and K. Rufib ac h. Maxim u m lik eliho o d estimation of a log-conca v e den- sit y and its d istribution function: Basic p rop erties and un iform consistency . Bernoul li , 15(1): 40–68, 2009. [F OS05] J. F eldman, R. O’Donnell, and R. Servedio . Learnin g mixtures of p ro duct distribu tions o ve r discrete d omains. In Pr o c. 46th Symp osium on F oundations of Computer Scienc e (F OCS) , pages 501–510, 2005. [F ou97] A.-L. F oug ` eres. Estimation de d ensit ´ es unimo dales. Canadian Journal of Sta tistics , 25:375 –387, 19 97. [GGR98] O. Goldreic h, S. Goldw asser, and D. Ron. Prop erty testing and its connection to learning and appro ximation. Journal of the ACM , 45:653–750 , 1998. 17 [GR00] O. Goldreich and D. Ron. On testing expansion in b ounded-degree graphs. T ec hnical Rep ort TR00-020, E lectronic Collo quium on C omputational Complexit y , 2000. [Gre56] U. Grenand er. On the theory of mortalit y measuremen t. Skand. Aktuar ietidskr. , 39:125 –153, 19 56. [Gro85] P . Gro en eb o om. Estimating a m onotone density . In Pr o c. of the Berkeley Confer enc e in Honor of Jerzy Neyman and Jack Kiefer , pages 539–555, 1985. [GW09] F. Gao and J. A. W ellner. On the r ate of conv ergence of the maxim um likel iho o d estimator of a k -mon otone densit y . Scienc e in China Series A: Mathematics , 52:1525– 1538, 2009. [Ham87] F. R. Hamp el. Design, data & analysis. c hapter Design, mo delling, and analysis of some biologica l data sets, pages 93–128. John Wiley & Sons, Inc., New Y ork, NY, USA, 1987. [HP76] D. L . Hanson and G. P ledger. Consistency in conca v e regression. The Annals of Statistics , 4(6):pp. 1038–1050 , 19 76. [ILR12] P . Ind yk, R. L evi, and R. Rub in feld. Appro ximating and T esting k -Histogram Distri- butions in S ub-linear Time. In POD S , pages 15–22, 2012. [JW09] H. K. Janko wski and J . A. W ellner. Estimation of a discrete monotone densit y . Ele c- tr onic Journal of Statistics , 3:1567 –1605 , 2009. [KM10] R. Ko enk er and I. Mizera. Quasi-conca ve dens it y estimation. Ann. Statist. , 38(5): 2998–3 027, 2010. [KMR + 94] M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, and L. Sellie. On the learnabilit y of discrete distribu tions. In Pr o c e e dings of the 26th Symp osium on The ory of Computing , pages 273–282, 1994. [KMV10] A. T. Kalai, A. Moitra, and G. V alian t. Efficiently learning mixtur es of tw o Gaussians. In STOC , pages 553 –562, 2010 . [LRR11] R. Levi, D. Ron, and R. Rubinfeld. T esting prop er ties of collect ions of distr ibutions. In ICS , p ages 179–19 4, 2011 . [MV10] A. Moitra and G. V alian t. Settling the p olynomial learnabilit y of mixtur es of Gaus- sians. In FOCS , p ages 93–102 , 2010. [NP33] J. Neyman and E. S. P earson. On th e pr oblem of the most efficien t tests of statistical h y p otheses. Philosoph ic al T r ansactions of the R oyal So ciety of L ondon. Series A, Containing Pap ers of a M athematic al or Physic al Char acter , 231(694-7 06):28 9–337, 1933. [P an08] L. Paninski. A coincidence-based test for uniformit y giv en v ery sparsely-sampled discrete data. IEEE T r ansactions on Information The ory , 54:4750–47 55, 2008. [P ea00] K. Pearson. On the criterion that a give n system of d eviations fr om th e pr obable in the case of a correlated system of v ariables is suc h th at it can b e reasonably su pp osed to ha ve arisen f rom r andom sampling. Philosophic al Magazine Series 5 , 50(302):15 7–175, 1900. 18 [Rao69] B.L.S. Prak asa Rao. Estimation of a un imo dal densit y . Sa nkhya Ser. A , 31:23–36 , 1969. [Reb05] L. Reboul. Estimation of a fu n ction u nder shap e restrictions. Applications to relia - bilit y . Ann. Statist. , 33(3):13 30–13 56, 2005 . [RS96] R. Rubin feld and M. Su d an. Robust c haracterizations of p olynomials with app lications to program testing. SIAM J. on Comput. , 25:25 2–271, 1996. [Rub12] R. Rubinfeld. T aming big probability distributions. XRDS , 19(1):24 –28, 2012. [V al1 1] P . V alian t. T esting sym metric prop erties of distr ib utions. SIAM J. Comput. , 40(6): 1927–1 968, 2011. [VV11] G. V alian t and P . V alian t. Estimating the unseen: an n / log( n )-sample estimator for en tropy and sup p ort size, s ho wn optimal via n ew CL Ts. In STOC , pages 685–694, 2011. [VV14] G. V alian t and P . V alian t. An automatic in equalit y prov er and instance op timal iden tity testing. In FOCS , 2014. [W al0 9] G. W alther. Inference and mo deling with log-conca ve distributions. Statistic al Scienc e , 24(3): 319–32 7, 2009. [W eg 70] E.J. W egman. Maxim um lik eliho o d estimatio n of a un imo dal dens it y . I. and I I . Ann. Math. Statist. , 41:457–4 71, 2169–217 4, 1970. [WWW + 05] X. W ang, M. W o o dr o ofe, M. W alk er, M. Mateo, and E. Olszewski. Estimating dark matter distribu tions. The Astr ophysic al Journal , 626:145 –158, 2005. App endix: Omitted Pro ofs A A Useful Primitiv e: T est ing Uniformit y in L 2 norm In th is sectio n, w e giv e an algorithm for uniformit y testing with resp ect to the L 2 distance, thereb y establishing Theorem 4. Th e algorithm T est-Uniformit y- L 2 ( q , n, ε ) describ ed b elo w dr a ws O ( √ n/ε 2 ) s amples from a d istribution q o ver [ n ] and distinguishes b et ween the cases that q = U n v ersu s k q − U n k 2 > ε/ √ n with probability at least 2 / 3. Rep eating the algorithm O (log(1 /δ )) times and taking the ma jorit y answ er r esults in a confid ence probab ility of 1 − δ , giving the desired algorithm T est-Uniformit y- L 2 ( q , n, ε, δ ) of T heorem 4. Our estimator is a v arian t of P earson’s c hi-squared test [P ea00], and can b e view ed as a sp ecial case of the recent “c hi-squ are t yp e” testers in [CD VV14 , VV14]. W e remark that, as follo ws fr om the Cauch y-Sc h warz inequalit y , the same estimator distinguishes the un iform distribu tion from an y distribution q su ch that k q − U n k 1 > ε , i.e., algorithm T est-Uniformit y- L 2 ( q , n, ε ) is an optimal uniformity tester for the L 1 norm. The L 2 guaran tee w e pr o ve here is new, is strictly str on ger than the aforement ioned L 1 guaran tee, and is crucial for our purp oses in Section 3. F or λ ≥ 0, w e denote by Poi( λ ) the P oisson distrib ution with p arameter λ. In our algorithm b elo w , we emp lo y the stand ard “P oissonization” approac h: namely , w e assume that , rather than dra w ing m indep endent samples fr om a d istribution, w e first select m ′ from Poi( m ), and then dra w m ′ samples. This P oissonization mak es the num b er of times d ifferent elemen ts occur in the 19 sample indep end en t, with the d istribution of the num b er of o ccurrences of the i -th d omain elemen t distributed as P oi( mq i ), sim p lifying the analysis. As P oi( m ) is tigh tly concen trated about m , w e can carry out this P oissonization tric k without loss of generalit y at th e exp ense of only s ub-constan t factors in the sample complexit y . Algorithm T est-Uniformit y- L 2 ( q , n, ε ) Input: samp le access to a distribution q ov er [ n ], and ε > 0. Output: “YES” if q = U n ; “NO” if k q − U n k 2 ≥ ε/ √ n. 1. Dra w m ′ ∼ Po i( m ) iid samples from q . 2. Let X i b e the n u mb er of o ccurrences of the i th domain elemen ts in the sample from q 3. Define Z = P n i =1 ( X i − m/n ) 2 − X i . 4. If Z ≥ 4 m/ √ n return “NO”; otherwise, return “YES”. The follo wing theorem characte rizes the p erform ance of the ab ov e estimator: Theorem 11. F or any distribution q over [ n ] the ab ove algorithm distinguishes the c ase that q = U n fr om the c ase that || q − U n || 2 ≥ ε/ √ n when given O ( √ n/ε 2 ) samples fr om q with pr ob ability at le ast 2 / 3 . Pr o of. Define Z i = ( X i − m/n ) 2 − X i . S ince X i is distributed as Poi( mq i ), E [ Z i ] = m 2 ∆ 2 i , where ∆ i := 1 /n − q i . By lin earit y of exp ectation we can write E [ Z ] = P n i =1 E [ Z i ] = m 2 · P n i =1 ∆ 2 i . Similarly we can calculat e V ar[ Z i ] = 2 m 2 (∆ i − 1 /n ) 2 + 4 m 3 (1 /n − ∆ i )∆ 2 i . Since the X i ’s (and h ence the Z i ’s) are in dep end ent, it follo w s that V ar[ Z ] = P n i =1 V ar[ Z i ] . W e start b y establishing completeness. S upp ose q = U n . W e will sh o w that Pr[ Z ≥ 4 m/ √ n ] ≤ 1 / 3 . Note that in this case ∆ i = 0 for all i ∈ [ n ], hence E [ Z ] = 0 and V ar[ Z ] = 2 m 2 /n. Cheb yshev’s inequalit y implies that Pr[ Z ≥ 4 m/ √ n ] = Pr h Z ≥ (2 √ 2) p V ar[ Z ] i ≤ (1 / 8) < 2 / 3 as desired. W e n o w pro ceed to pro ve soundn ess of the tester. Supp ose that k q − U n k 2 ≥ ε √ n . In this case w e will show that Pr[ Z ≤ 4 m/ √ n ] ≤ 1 / 3 . Note that Chebyshev’s in equalit y implies that Pr h Z ≤ E [ Z ] − 2 p V ar[ Z ] i ≤ 1 / 4 . It thus suffi ces to sho w that E [ Z ] ≥ 8 m/ √ n and E [ Z ] 2 ≥ 16 V ar[ Z ] . Establishin g the former inequalit y is easy . Indeed, E [ Z ] = m 2 · k q − U n k 2 2 ≥ m 2 · ( ε 2 /n ) ≥ 8 m/ √ n for m ≥ 8 √ n/ε 2 . Pro vin g the latter inequalit y requires a more detailed analysis. W e will sho w that for a suffi- cien tly large constan t C > 0, if m ≥ C √ n/ε 2 w e w ill ha ve V ar[ Z ] ≪ E [ Z ] 2 . 20 Ignoring m ultiplicativ e constant factors, w e equiv alen tly need to sh o w that m 2 · n X i =1 ∆ 2 i − 2∆ i /n + 1 /n ! + m 3 n X i =1 ∆ 2 i /n + ∆ 3 i ≪ m 4 n X i =1 ∆ 2 i ! 2 . T o p ro ve the d esired in equalit y , it suffices to b ound fr om ab ov e th e absolute v alue of eac h of the five terms of the LHS s ep arately . F or the firs t term w e need to show that m 2 · P n i =1 ∆ 2 i ≪ m 4 · P n i =1 ∆ 2 i 2 or equiv alen tly m ≫ 1 / k q − U n k 2 . (9) Since k q − U n k 2 ≥ ε/ √ n , the RHS of (9) is b ound ed from ab o ve b y √ n/ε , hence (9) holds true for our c h oice of m . F or the second term w e wa n t to sh o w that P n i =1 | ∆ i | ≪ m 2 n · P n i =1 ∆ 2 i 2 . Recalling that P n i =1 | ∆ i | ≤ √ n · q P n i =1 ∆ 2 i , as follo ws from the Cauch y-Sc hw arz inequalit y , it s uffices to sh o w that m 2 ≫ (1 / √ n ) · 1 / ( P n i =1 ∆ 2 i ) 3 / 2 or equiv alen tly m ≫ 1 n 1 / 4 · 1 k q − U n k 3 / 2 2 . (10) Since k q − U n k 2 ≥ ε/ √ n , the RHS of (10 ) is b ounded from ab o ve by √ n/ε 3 / 2 , hence (10) is also satisfied. F or the third term w e wa n t to argue that m 2 /n ≪ m 4 · P n i =1 ∆ 2 i 2 or m ≫ 1 n 1 / 2 · 1 k q − U n k 2 2 , (11) whic h holds for our choice of m , since the RHS is b ound ed from ab ov e by √ n/ε 2 . Bounding th e f ourth term amoun ts to showing that ( m 3 /n ) P n i =1 ∆ 2 i ≪ m 4 P n i =1 ∆ 2 i 2 whic h can b e rewr itten as m ≫ 1 n · 1 k q − U n k 2 2 , (12) and is satisfied since the RHS is at most 1 /ε 2 . Finally , for the fifth term we wan t to pro ve th at m 3 · P n i =1 | ∆ i | 3 ≪ m 4 · P n i =1 ∆ 2 i 2 or that P n i =1 | ∆ i | 3 ≪ m · P n i =1 ∆ 2 i 2 . F rom Jensen’s inequalit y it follo ws that P n i =1 | ∆ i | 3 ≤ P n i =1 ∆ i | 2 3 / 2 ; hence, it is sufficien t to sho w that P n i =1 ∆ i | 2 3 / 2 ≪ m · P n i =1 ∆ 2 i 2 or m ≫ 1 k q − U n k 2 . (13) Since k q − U n k 2 ≥ ε/ √ n the ab ov e RHS is at most √ n/ε and (13) is s atisfied. T his completes the soundness pr o of and the pro of of Theorem 11. 21
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment