Telling Two Distributions Apart: a Tight Characterization

T elling Tw o Distributions Apart: a Tigh t Characterization Eyal Even Dar ⋆ and Mark Sa ndler ⋆⋆ Abstract. W e consider th e problem of distinguishing b etw een tw o arbi- trary black-b o x d istributions deﬁn ed o ver the domain [ n ], given access to s samples from b oth. It is know n th at in th e worst case ˜ O ( n 2 / 3 ) samples is both necessary and suﬃcient, provided th at the d istributions ha ve L 1 diﬀerence of at least ε . How ever, it is also known that in many cases fewer samples suﬃce. W e identify a new p arameter, that precisely controls the num b er of samples n eeded, and present an algorithm that requires th e num b er of samples only dep endent of this parameter and indep endent of t h e domain size. Also for a large sub class of distributions w e pro vide a lo w er b ound, that matches our algorithm up to the p olylogarithmic b ound. 1 In tro duct ion One of the most fundamental challenges facing mo der n data analysis, is to under- stand a nd infer hidden prop erties of the data b eing obser ved. Prop erty testing framework [9 ,6,2,1,3,4] has recently emerg ed as one too l to test whether a g iven data set has ce rtain pro pe r ty with high pr obability with only a few quer ies. One problem that commonly aris es in applications is to test whether several sources of rando m samples follow the s ame distribution or are far apa rt and this is the problem w e s tudy in this pap e r. More sp eciﬁcally , suppos e we a r e given a black box that generates indep endent sa mples of a distribution P over [ n ], a bla ck b ox the generates indepe ndent samples of a distributio n Q ov er [ n ], a nd ﬁnally a black box tha t genera tes indep endent samples of distribution T , which is either P or Q . How ma n y sa mples do we need to decide whether T is identical to P or to Q ? This problem ar ises regular ly in change detection problems [5], testing whether Marko v chain is rapidly mixing [2], and other con texts. Our Contribution Our results gener alize on the results of B a tu et al in [2], they hav e s hown that there exists a pair of distributions P , Q on domain [ n ] with a large statistical diﬀer ence || P − Q || 1 ≥ 1 / 2, s uch that no a lgorithm can tell apar t case ( P , P ) from ( P , Q ) with o ( n 2 / 3 ) samples. They also pr ovided a n a lg orithm that nearly matc hes the low er b ound for a speciﬁc pa ir of distributions. In the present pap er , instead of analyzing the “hardest” pair of distributions , we char acterize the prop er t y that controls the num be r of samples one needs to ⋆ evendar.ey al@gmail.com, Final Inc, this researc h was p erformed while author w as at Google Researc h, N Y ⋆⋆ sandler@google.com, Google Researc h, N Y tell a pair of distributio ns a part. This characterization allows us to facto r out the dependency on the domain size. Namely , for every tw o distributions, P a nd Q , that satisfy certain technical pro per ties that w e describ e b elow, we establish bo th an algorithm a nd a lower bo und, such that the n umber of samples b oth necessary and suﬃcient is Ω ( k P + Q k 2 k P − Q k 2 2 ). F or the lower bound example o f [2], this quantit y amounts to n 2 / 3 and the distributions in [2 ] s atisfy the needed technical prop erties, and th us our results generalize upon their result. F rom pra ctical p ersp ective such characteriza tio n is imp orta n t b ecause the high level pro p er ties of the distributions may be lea rned empirically (for instanc e it might b e known tha t the po tent ial class of distributions is a power-la w) and our results allow to signiﬁca nt ly reduce the num b er of samples needed fo r testing . In many r esp ects, our results co mplemen t those of V aliant [10]. There it was shown that for testing symmetric and con tinuous prop erties, it is b oth necessary and suﬃcient to consider only the high frequency elements. In con tra st, we show that for our problem the low freq uency e lemen ts provide all the neces s ary in- formation. This was quite surprising provided tha t low fr equency elements ha ve no useful infor mation for contin uous prop e rties. F or the computabilit y part [10] int ro duces the canonica l tester - an expo nent ial time algorithm that ﬁnds all feasible underlying data, that co uld have pr o duced the o utput. I f all chec ked inputs hav e the prop erty v alue cons istent , it r e po rts it, and otherwise gives r an- dom a nswer. In contrast, our algorithm guara nt ees that for a n y underlying pair of dis tributions, the a lgorithm a fter obser ving the sample will b e corr ect with high probability , even thoug h there might b e a v alid input tha t w ould cause the algorithm to fail. Finally , w e de velop a new tec hnique tha t allows tigh t co ncent ra tion bo unds analysis of heterogeneous balls a nd bins pro blem that might be of indep endent int ere s t. Pap er Overview In the next section we descr ib e o ur problem in more detail, connect it to clo seness problem studied in the ea rlier work and state our res ults. Section 4 proves useful c oncentration b ounds, and int ro duces the main technique that we use for the algor ithm analysis. Section 5 provides a lgorithm and ana lysis, and ﬁna lly in the section 6 we prov e our lower bo unds. 2 Problem F orm ulation W e cons ide r a rbitrary distributions ov er the domain [ n ]. W e a ssume that the o nly wa y of interacting with a distribution is through a blackbox sampling mechanism. The main pr oblem we consider is a s follo ws: Pr oblem 1 (Distinguishability pr oblem). Given “training phase” o f s samples from X and s samples fro m Y , and a “ testing phas e ” of a sample of siz e s from either X or Y , output whether ﬁrst or seco nd distribution gener a ted the testing sample. W e say tha t an algor ithm solves the distinguisha bilit y problem for a clas s of distribution pair s P , with s samples, if for a n y ( X , Y ) ∈ P , with probability at least 1 − poly log ( s ) s 2 it outputs cor rect answer. F urther , if X and Y are identical, it outputs ﬁrst or seco nd with proba bilit y at least 0 . 5 − poly log ( s ) s 2 . W e show that the dis tr inguishability problem is equiv a lent to the following problem studied in [2 ,10]: Pr oblem 2 (Closeness Pr oblem[2]). Given s samples fro m X and s samples fro m Y , decide whether X and Y are almo st identical or far apart with acceptable error at mo st poly log ( s ) s 2 . An algor ithm solves the c lo seness problem for a class of distribution pairs P , if for every pa ir ( X , Y ) ∈ P , it outputs “diﬀerent” with pr obability at least 1 − poly log ( s ) s 2 , and for e very input of the form ( X , X ) it outputs “same” with probability at least 1 − poly log ( s ) s 2 . Our ﬁrs t obs erv a tio n is that if either of the pro blems ca n b e solved for a certain cla ss of distribution pairs, then the other ca n b e solved with at most logarithmica lly mor e sa mples and time. The following lemma formalizes this statement, and due to spa ce limitatio ns the pro of is deferred to app endix. Lemma 1. If t her e is an algorithm that solves distinguishability pr oblem for a class of distribution p airs P with s then t her e is an algorithm that solves identit y pr oblem for class P with at most O ( s lo g s ) samples. If ther e is an algorithm t hat solves close ness pr oblem for class P with at m ost s samples, then ther e is an algorithm that solves distinguishability pr oblem with at most s samples. 3 Results Our algorithmic r esults can be descr ibed as follows: Theorem 1. Consider a class of distribution p airs s u ch as || P − Q || 2 2 || P + Q || 2 ≥ α , and let s = 60( | lo g α | ) 7 / 2 /α , and e ach p i and q i is at most 1 2 s , then A lgorithm 3 pr o- duc es c orr e ct answer with pr ob ability at le ast 1 − c/ s 2 , wher e c is some universal c onst ant. Essentially the theorem ab ouve states that || P + Q || 2 || P − Q || 2 2 controls the distinguisha- bilit y of distribution pa ir. There are several interesting cases , if || P − Q || 2 is comparable to either || P || 2 or || Q || 2 , then the r esults says that s ≈ 1 / || P − Q || 2 suﬃces. This g eneralizes the observ ation from [2] that if the L 2 diﬀerence is la rge then constant n umber of samples suﬃces. W e also no te that the c ondition p i ≤ 1 s is a technical condition that g uar- antees that any ﬁxed elemen t has exp ectation of appea rance o f at most 1 / 2. In other w or ds, no elements can be exp ected to b e seen in the sa mple with high probability . This is requir ement is particula r ly striking, given that the results of [10] say that elements that hav e low exp ectation, are prov ably n ot useful when testing cont inuous prop erties. F urther explo iting this contrast is a very int ere s ting o pen dir e c tion. F or the low er bound part, our results apply to a sp ecial class of distributions that w e call w eakly disjoin t distributions Deﬁnition 1. Distributions P a nd Q ar e we akly disjoi nt if every element x satisﬁes one of the fol lowing: (1) p x = q x , (2) p x > 0 and q x = 0 , (3) q x > 0 and p x = 0 We denote the set of elements such that p x = q x by C ( P, Q ) , and the r est ar e denote d by D ( P , Q ) It is worth noting that the known w ors t case exa mples o f [2] belong to this class. W e conjecture tha t weakly dis joint distributions repres ent the har dest ca se for low er b ound, and all other distributions need fewer sa mples, and that the result above generaliz es to all distributions. F or the rest of the pap er we conce ntrate on the distinguishabilit y problem, but results through lemma 1 immediately apply to closeness problem. No w we formulate our low er b ounds results: Theorem 2. L et P and Q ar e we akly disjoint distributions, and let s ≤ min { 0 . 25 || P − Q || 3 , 1 / || P || ∞ , 1 / || Q || ∞ , 1 c || P + Q || 2 || P − Q || 2 2 } wher e c is some universal c onstant. No algori thm c an solve a distinguishability pr oblem for a class of distritribut ion p airs deﬁne d by arbitr ary p ermu tations of ( P, Q ) , e.g. P = { (( π P , π Q ) } . The ﬁrst, second and third constra in ts in min expression ab ov e ar e tec hnical as- sumptions. In fact for man y distributions including the w or st-case scenario of [2] || P + Q || 2 || P − Q || 2 2 << min { 1 / || P − Q || 3 , 1 / || P || ∞ , 1 / || Q || ∞ } , and hence those co nstraits can be dropp ed. 4 Preliminaries 4.1 Distinguis hing b et ween sequences of Bernoulli random v ariables Consider t wo seq uences of ra ndo m Ber noulli v ariables x i and y i . In this section we characterize when the s ign of the observ ations of P x i − P y i can b e exp ected to be the same as sign of E [ P x i − P y i ] with high pro bability . W e defer the pro of to the a ppendix . Lemma 2. Supp ose { x i } and { y i } is a se quenc e of Bernoul li r andom variables such t hat E [ P x i ] = α and E [ P y i ] = β , wher e α < β . Then Pr h X x i > X y i i < 2 ex p[ − ( α − β ) 2 8( α + β ) ] 4.2 W eight Concentra tio n Results F or Heterogeneous Balls and Bins Consider a sa mple S = { s 1 , s 2 , . . . , s s } o f size s from a ﬁxed distribution P ov er the domain [ n ]. Let the sample conﬁgura tion b e { c 1 , . . . c n } , where c i is the nu mber o f times element i was selected. A standard interpretation is that sample represents s balls b eing dropp ed in to n bins , wher e P describ es the proba bilit y of a ball going into each bin. The sample co nﬁguration is the ﬁnal distribution of ba lls in the bins. Note that P c j = s . W e show tight concentration for the quantit y P n i =1 α i c i , for non-neg ative α i . 1 Note that c i and c j are correla ted and therefore the terms in the sum ar e not indep endent. An immediate approa ch w ould be to cons ider a contribution of i th sample to the sum. It is ea sy to see that the co nt ributio n is b ounded by max α i , a nd thus one can apply McDiarmid inequality[8], how ever the re s ulting bo und w ould be to o weak since we have to apply an uniform upp er bound. In what follows we call the sampling pro c edure , w he r e ea ch sa mple is s elected independently from distribtuion ov er the domain [ n ] the t yp e I sampling. No w, we consider a diﬀeren t kind of sampling that we call type I I. Deﬁnition 2 (T yp e I I sampli ng). F or e ach i in [ n ] we ﬂip p i -biase d c oin s times, and sele ct the c orr esp onding element every time we get he ad. W e show that for almost any sample selectio n of type (I), the corresp onding conﬁguratio n in type I I sa mpling w ould hav e similar weight in type I I. The weigh t of a ll conﬁgur ations in type ( I) not satisfying this constraint is o (1 /s ) ε 1 . Once we s how that, then an y concentration bound in Type I I will translate to corresp o nding c o ncentration bo und for type I. W e use P ( I ) [ · ] and P ( I I ) [ · ] to deno te probability according to type (I) and t yp e (II) sampling . Due to space limitations all the pro ofs are defer r ed to ap- pendix . Now we s how the lower and upp er b ound connections betw een type I and type I I sampling . Lemma 3. F or every c onﬁ gur ation C = { i 1 , . . . i n } , such that i j ≤ ln s , and P i j = s ′ , wher e s ′ ∈ [ s ± √ s ] 2 3 √ sP ( I I ) [ C ] ≤ P ( I ) [ C ] ≤ 30 s 3 / 2 P ( I I ) [ C ] . wher e P ( I ) [ C ] is a pr ob ability of observing C in typ e I sampling with s ′ elements and P ( I I ) [ C ] is a pr ob ability of observing Cin typ e II sampling with s elements. The following lemma b ounds the total probability of a conﬁguration which ele- men ts app ear ing more than ln s times. The pro of is deferr e d to app endix. Lemma 4. In typ e I sampling, the pr ob ability ther e exists an element that was sample d mor e than ln s times is at most 1 s ln l n s . 1 In fact the technique applies to arbitrary bound ed functions. Now we fo r malize the tra nslation b etw een concentration b ounds for Type (I) and Typ e (II) samplings . Lemma 5. Supp ose we sample fr om distribution P , s times using typ e I and typ e I I sampling r esulting in c onﬁgur ations C . L et A = { α i } b e an arbitr ary ve ctor with non-n e gative elements, and r ≥ 0 . Then P ( I ) [ | W − E [ W ] | > r ] ≤ 30 s 3 / 2 P ( I I ) [ | W − E [ W ′ ] | > r ] + 1 s ln ln s , wher e W = P n i =1 α i c i . Lemma 6. Consider s samples sele cte d using T yp e I sampling fr om the dis- tribution P = { p i } , wher e s ≥ 10 , and A = { α i } is arbitr ary ve ctor. L et W = P n i =1 α i c i . Then Pr h | W − E [ W ] | ≥ 2 (ln s ) 3 / 2 || A || 2 i ≤ 1 s 2 5 Algorithm and A nalysis A t the high level our a lgorithm works a s follows. First, we chec k if the tw o-nor ms of dis tributions P and Q ar e suﬃciently far aw ay from e a ch o ther. If it is the case, then we can decide simply b y lo oking at the estima tes o f the 2-norm of P , Q and T . O n the other hand if || P || 2 ≈ || Q || 2 then we show tha t co unting the num ber o f collisions of sample from T , with P a nd Q , and then cho osing the one that has hig her num b er of collis ions gives the co rrect answer with high probability . Algorithm 2 provides a full des cription on 2- norm estimatio n. The idea is to estimate a pro bability mass of a sample S by computing the num b er of collisions of fresh sa mples with S , and then noting that the exp ected mass o f a sample of size l is l || P || 2 . O ne imp or tant caveat is that if S contains a particular element mor e tha n once, we ne e d to carefully compute the collisions in such a wa y to keep the pr obability o f a collision at l || P || 2 and to achiev e that, we split o ur sampling in to max c i phases. During phase i we only count collisions with elemen ts that hav e o ccurr ed at least i times. F o r mor e deta ils we refer to algorithm 1 whic h is used as subr outine for both 2 and the main algor ithm 3. F or the analysis w e mainly use the technique developed in Se c tio n 4. Lemma 7. A lgorithm 1 output s a set, S , such that E [ | S | ] = P n i =1 c i p i . Pr o of. Let w i be the c o nt ribution out of s i . E [ w i ] = P n j =1 p j 1 c j ≥ i Summing ov er all ele men ts o f S and use linear it y o f exp ectation we have: E [ | S | ] = P m i =1 E [ w i ] = P m i =1 P n j =1 p j 1 c j ≥ i = P n i =1 p i c i , where the last equality fo llows from c hang ing the summation o rder. Lemma 8. The total numb er of elements sele cte d after l iter ations in step 3 is a sum of indep endent Bernoul li r andom variables, and its exp e ctation is l W . Sampling ac c or ding to given p attern Algorithm 1 Sampling according to given pa ttern Input: Conﬁguration { c 1 , . . . c n } , where m ≥ c i ≥ 0. Distribut ion P Output: Multi-Set of elements S , su ch that E [ | S | ] = P n i =1 c i p i Description: 1. S ample m elements from P , s 1 , . . . s m 2. F or each s i , if c s i ≥ i inclu d e s i into set S Algorithm 2 Computing 2-nor m of distribution P Input: Distribution P , accuracy parameter l Output: Approximate v alue of || P || 2 2 Description: 1. S elect l samples from P , let c 1 , . . . c n b e th e conﬁguration. Note that the ex p ected w eight W of the conﬁguration is l || P || 2 2 2. I f max c i ≥ log l rep ort failure 3. S ample with rep etition using Algorithm 1 l t imes. Let r i b e th e resp ective set size from i th simulatio n. Note that E [ r i ] = W 4. R ep ort P l i =1 r i /l 2 as the approximate v alue Pr o of. Indeed, the expe c ted num ber of selections for ev ery inv o cation Algo rithm 1 is W , and that in itself is a sum of m ≤ log s Bernoulli v ariables. Thus, we hav e l t log s Bernoulli r andom v ariables with the total expected weigh t of lW . F urthermo re, the to ta l num b er of samples used by the a lgorithm is b ounded. Lemma 9. The total numb er of samples use d by Algo rithm 2 is at m ost 2 l log l . Now we are ready to pr ov e the main prop erty of the a lgorithm 2 . Lemma 10 (Concen tration results for 2-norm estimation). Supp ose the Algo rithm 2 is ru n for distributions P and T with p ar ameter l > 10 . If P = T then the estimate for || P || 2 is gr e ater than estimate for || T || 2 with pr ob ability 1 / 2 . If s ( || T || 2 2 − || P || 2 2 ) ≥ 4 (ln l ) 3 / 2 ( || P || 2 + || T || 2 ) then t he estimate for || P || 2 is smal ler than the estimate for || T || 2 with pr ob ability at le ast 1 − c/ l 2 , wher e c is s ome universal c onstant . Pr o of. The ﬁrs t part is due to the symmetry . F or the second par t, we use Lemma 6 to note that the weight of s election W T ≤ s || T || 2 2 − 3(ln l ) 3 / 2 ) || T || 2 or W P ≥ s || P || 2 2 + 3(ln l ) 3 / 2 || P || 2 with probability at most 2 l 2 . Therefore with probability at least 1 − 2 l 2 we hav e | W T − W P | ≥ (ln l ) 3 / 2 ( || P || 2 + || T || 2 ). Using lemmas 8 and 2 we hav e: Pr h ˜ W T ≤ ˜ W P i ≤ o ( 1 l 2 )+2 exp  l 2 ( W T − W P ) 2 8( W T + W P )  ≤ 1 2 l 2 +2 ex p  − l 2 (ln 3 l )( || P || 2 + || Q || 2 ) 2 8 l 2 ( || T || 2 2 + || P 2 || 2 )  ≤ 1 l 2 , Bringing it al l to gether: main algorithm and analysis Algorithm 3 Distinguishing b etw een tw o distributions Input: Blac kb ox providing samples from P , Q and T . Estimate s = || P + Q || 2 || P − Q || 2 2 . 2 Out- put: “P” if T is P a nd “Q” otherwise Description: 1. Compute L 2 norm of P , Q and T using Algorithm 2, using accuracy parameter l = 30 s (ln s ) 3 / 2 . R ep eat log s times. Let ˜ P i , ˜ Q i and ˜ T i denote the estimated norms in i th iteration. 2. I f ˜ T i ≥ ˜ P i for all i or ˜ T i ≤ ˜ P i for all i then rep ort “Q” and quit. 3. I f ˜ T i ≥ ˜ Q i for all i or ˜ T i ≤ ˜ Q i for all i then rep ort “P” and quit. 4. Else: (a) T raining Phase: sample l = 30(ln s ) 3 / 2 s elemen ts from distributions P and Q . Let C P and C Q denote the conﬁguration of elemen ts th at w ere selected. (b) T esting Phase: use Algorithm 1 to sample with repetition s times, on both C P and C Q using fresh sample each time. Let c P and c Q denote the total size of the Algorithm 1 output for C P and C Q respectively . (c) If c P > c Q rep ort “P”, otherwise rep ort “Q”. where w e hav e us ed exp( − ln 3 l ) < 1 2( l 2 ) , for l > 10. Let W P and W Q denote the total probability mass of sa mple se le cted by P and Q in distribution T . In o ther words W P = P n i =1 t s P ( i ) , where s P ( i ) is the i -th sample from P . Now, consider H 1 (e.g. T = P ). W e hav e: E [ W P ] = s P n i =1 p 2 i and E [ W Q ] = s P n i =1 p i q i , where i -th term is expected contribution of i -th element of the dis- tribution int o the total sum fo r a single selection. Therefore w e hav e E [ W P − W Q ] = s n X i =1 p i ( p i − q i ) Similarly in h yp othesis H 2 we hav e: E [ W Q − W P ] = s n X i =1 q i ( q i − p i ) The re st of the analy sis pro ceeds as follows, we ﬁrst show that if the algo- rithm pas ses the ﬁrst s tage then with high pr obability | ( || P || 2 2 − || Q || 2 2 ) | ≤ 4(ln s ) 3 / 2 || P + Q || 2 , in which case the ma jority voting on collision co un ts gives the desired result. The following lemma is almos t immediate from Lemma 10 Lemma 11 (Correctness of the case when || P || 2 6≈ || Q || 2 ). If s | n X i =1 ( p 2 i − q 2 i ) | ≥ ln 3 / 2 ( l )( || P || 2 + || Q || 2 ) the algorithm terminates on or b efor e step 3 of algorithm 3 with pr ob ability at le ast 1 − c/ s 2 . F urther, the pr ob ability of inc orr e ct answer if it terminates is at most c s 2 , for some c onstant c . Pr o of. Without los s o f generality we can assume T = P . Using the result of lemma 10, the pro ba bilit y that ˜ P i < ˜ T i or for all i is at most ( 1 2 ) 2 log s ≤ 1 s 2 . Thu s probability of incorrec t answer is at most 1 s 2 . Second, if the condition of the lemma s atisﬁed, then fro m lemma 1 0 and the union b ound the probability of inco nsistent measurements is a t mos t log l l 2 ≤ 1 s 2 . Given the lemma ab ove, we hav e that if the algorithm passed stage 1, then − 4(ln l ) 3 / 2 ( || P || 2 + || Q || 2 ) ≤ l ( X p 2 i − X q 2 i ) ≤ 4 (ln l ) 3 / 2 ( || P || 2 + || Q || 2 ) Therefore in hypo thesis H 1 : E [ W P − W Q ] = l ( n X i =1 p 2 i − n X i =1 q i p i ) ≥ l ×  || P − Q || 2 2 2 − 4(ln l ) 3 / 2 l ( || P || 2 + || Q || 2 )  W e hav e ln l = ln( s ln 3 / 2 s ) ≤ ln ln s + log s + 3 ≤ 1 . 5 log s , and thus l ≥ 20(ln l ) 3 / 2 || P + Q || 2 || P − Q || 2 2 . Substituting w e hav e : E [ W P − W Q ] ≥ l  || P − Q || 2 2 2 − 4 20 || P − Q || 2 2  ≥ 6(ln l ) 3 / 2 ( || P || 2 + || Q || 2 ) Applying Le mma 6 with weigh t function P , the probability that either W P and W Q deviates fro m its exp ectatio n by more than 2(ln l ) 3 / 2 || P || 2 is at mo st c s 2 for a ﬁxed constant c . Thus Pr h W P − W Q ≤ (ln l ) 3 / 2 ( || P || 2 + || Q || 2 ) i ≤ c 2 s 2 (1) Therefore W P and W Q with high pr obability are far apart, a nd could b e dis- tinguished using the bo unds fro m Lemma 2. Indeed, the total exp ected n umber of hits is sW P and sW Q , for b oth P and Q , thus the probability tha t the total nu mber of hits is fo r C P is s maller than C Q in H 1 is a t most: Pr [ c p < c q ] ≤ exp[ − s ( W P − W Q ) 2 W P + W Q ] ≤ exp  − s ln 3 l || P + Q || 2 2 l ( || P || 2 2 + || Q || 2 2 )  ≤ 1 l 2 (2) Combining (1) and (2) we have that for some universal constant c , with proba- bilit y at lea s t 1 − c s 2 , C P will receive more hits than C Q and symmetrically in H 2 , C Q will r eceive mor e hits than C P . Thus the algo rithm pro duces co rrect answer with high probability and the pro o f of Theorem 1 is immediate. 6 Lo w er Bounds for W eakly Disjoin t Dist ributions In this Section we prov e Theorem 2. Fir st we observe tha t for a n y ﬁxe d pair of distributions, there is in fact an algorithm that can diﬀer ent iate b etw e e n them with far fewer sa mples tha n our lo wer-b ound theorem dictates 3 . Th us, the ma in challenge is to prove that there is no universal algorithm that ca n diﬀerentiate betw een arbitrar y pairs of distr ibutions. Here, we sho w that even the simpler problem wher e the pair of distributions is ﬁx e d, and a random p ermutation π is applied to bo th of them, there is no a lgorithm that can diﬀere n tiate b etw een π P , a nd πQ . Since this problem is simpler than the origina l pr o blem (w e know the distribution shap e), the low er bound applies to the origina l problem. Pr oblem 3 (Diﬀer ent iating b etwe en two known distribution with u n known p er- mutation). Supp ose P and Q are t wo known distributions on domain D . Solve distinguishability problem on the class o f distributions deﬁned by ( π P , π Q ), for all permutations π . In Problem 3, the algorithm needs to solve the problem for every π , thus if such algo rithm exists, it would b e able to solve distinguishability problem with π ch os en at random a nd fro m the pe r sp ective of the a lgorithm, e le ments tha t were chosen the same num ber of times, a re equiv alent. Thus, the o nly factor the algorithm can diﬀerentiate up on ar e c ounts of how often each elemen t a pp ea r ed in diﬀerent phases. More sp eciﬁcally we will use | ( i, j, k ) | notatio n to deno te the nu mber of elemen ts, which were chosen i -times while sampling from P , j -times while sampling fro m Q in the training phase and k -times during the testing phase. W e also use notatio n | ( i, j, ∗ ) | to denote the total n umber o f elements that were selected i a nd j times during training pha se and arbitra ry n umber of times during the testing phase. Finally and | ( i, j, +) | to denote the n umber of ele ments that were selected at least once during the testing phase. In what follows we use H 1 and H 2 to denote the tw o p os s ible hypo theses that the testing distribution is in fac t P or Q resp ectively . T o prove this theorem we s implify the problem further by disclo sing some additional informa tio n ab out distributions, this allows us to show that some data in the input pos s esses no information a nd could b e ignor ed. Speciﬁcally , we rely on the following v aria nt of the problem: Pr oblem 4 (Diﬀer ent iating with a hint). Supp ose P and Q a r e tw o known dis - tributions on D and π is unknown p ermutation on D . F or each element that satisﬁes one of the following conditions the alg orithm is revealed whether it be- longs to co mmon or disjoint set. (a) selected at least o nce while sa mpling from T and at lea st twice w hile sampling from P or Q (b) selected at least twice while sampling from P o r Q , and b elongs to the C ( π P , π Q )/ The set o f a ll elements for which their pr obabilities are known is called hint . Given the hint, diﬀere n tiate betw een π P and π Q . 3 F or instance distinguishing b etw een t wo un iform d istribution on half domain a con- stant number of samples is suﬃcient Note that Problem 4 is immediately reducible to Pro blem 3 , thus a lower b ound for 4 immediately implies a low er b ound for 3. If a n element from the disjoint part of P and Q has its identit y re vealed, then the algorithm ca n immediately output the correct answer. W e call such elements helpful . W e ca ll other revealed elements unhelpful , note that the set that the unhelpful elements is fully determined by the training phase (these ar e the elemen ts that were selected at least twice during training phase and b elong to the common set). Firs t, w e pr ov e the b ound on the probability of observing helpful elements. Later we s how that knowledge o f unhelp does not reveal muc h to the a lgorithm. Lemma 12. If the nu mb er of samples s ≤ 0 . 25 || P − Q || 3 , then t he pr ob ability t hat ther e is one or mor e helpful elements is at most 1 20 . Pr o of. F or every ele ment that has pro bability of selection during testing pha se p , the probability that it becomes helpful is at most  2 s 3  p 3 < 8 s 3 p 3 6 ≤ 1 . 5 s 3 p 3 if it b elongs to dis join t section, a nd 0 otherwise. Therefor e, the total exp ected nu mber of helpful ele men ts is 1 . 5 s 3 || P − Q || 3 3 . Using Markov inequa lit y we im- mediately hav e that probability of o bserving one or more helpful element is at most 1 1 . 5 s 3 || P − Q || 3 3 ≤ 1 20 as desired. Since the pro bability o f observing helpful element is b ounded by 1 20 , it suﬃces to prove that any alg o rithm that do es not observe any helpful ele ments, is corr e c t with pro ba bilit y a t most 1 − Ω (1). The next step is to show that dis c lo sed elements from the common part ar e ig nored by the optimal alg orithm. Lemma 13. The optimal algori thm do es not dep end on t he set of un helpful elements. Pr o of. More forma lly , let C denotes the testing conﬁguratio n, and le t Y de- notes the unhelpful part of hint. Let A ( C, Y ) is the optima l algorithm that takes as that outputs H 1 or H 2 . Suppo se there exists Y ′ and Y ′′ such that A ( C, Y ′ ) 6 = A ( C , Y ′′ ), and without loss o f generality we a ssume A ( C, Y ′ ) = H 1 and A ( C, Y ′′ ) = H 2 . Of all optimal algorithms w e ch os e the one that mini- mizes the n umber of triples ( C, Y , Y ′ ) that satisfy this prop erty . Without loss of generality Pr [ C | H 1 ] ≥ Pr [ C | H 2 ] and let A 1 be the mo diﬁcation of A s uch that A 1 ( C, Y ′′ ) = H 1 . But then the total probability of error for A 1 will be the Pr [ C | H 1 ] P r Y ′′ − Pr [ C | H 2 ] Pr [ Y ′′ ], whic h is smaller or equal than A co nt ra dic- tion with either optimality or minimility of A . So far w e show ed that helpful elements ter minate the algor ithm, and non- helpful do not aﬀect the outcome. Therefo r e, the only signatures ( i, j, k ) that the algorithm has knowledge of will b elong to the following set: { (0 , 0 , 0) , (0 , 0 , 1 ) , (0 , 1 , 0) , (1 , 0 , 0) , (0 , 1 , 1) , (1 , 0 , 1) , (0 , 0 , 2) , (2 , 0 , 0) , (0 , 2 , 0 ) } . Consider the following random v ariables | (0 , 1 , ∗ ) | , | (1 , 0 , ∗ ) | , | (0 , 0 , ∗ ) | , | (0 , 2 , ∗ ) | , | (2 , 0 , ∗ ) | . They are fully determined by the training phase and thus are independent of the h yp otheses. W e call these r andom v ariables the training conﬁguratio n. The following lemma is immediate and the pro of is deferred to app endix. Lemma 14. If the tr aining c onﬁgur ation is ﬁxe d then the values | (0 , 1 , 1) | , | (1 , 0 , 1) | , | (0 , 0 , 1) | ful ly determine al l the data availab le to the algorithm. Therefore for ﬁxed conﬁgur ation of the tr a ining phase, the output of o ptimal algorithm only dep ends on | (0 , 1 , 1) | , | (1 , 0 , 1) | , | (0 , 0 , 1) | . Co nsider h ( i, j ) the total num b er o f elemen ts that were selected during testing phase, and that hav e signature ( i, j, ∗ ). Note that h ( i, j ) = P s k =1 k × | ( i, j, k ) | . Now we pr ov e tha t no alg orithm by obs erving h (0 , 1 ), h (1 , 0), h (0 , 0) can hav e err or probability less than 1 / 17. Again w e defer the pro of to the app endix due to spa ce c o nstraints. Lemma 15. L et C P , D P and C Q , D Q b e t he pr ob ability masses of subsets of C ( π P, π Q ) and D ( π P , π Q ) that wer e sample d during tr aining phase of P and Q r esp e ctively. Assum e without loss of gener ality P r [ D P ≥ D Q ] ≥ 1 / 2 . Then with pr ob ability at le ast 1 / 2 , in hyp otheses H 1 for observe d h (1 , 0) = a , h (0 , 1) = b h (0 , 0) = c , we have: Pr [ h (1 , 0) = a, h (0 , 1) = b , h (0 , 0) = c | H 1 ] Pr [ h (1 , 0) = a, h (0 , 1) = b , h (0 , 0) = c | H 2 ] ≤ 8 (3) F ro m this lemma it immediately follows that any alg orithm will b e wrong with probability at least 1 / 16. Now we ar e ready to prov e the main theorem of this section. Pro of of the o rem 2. By lemmas 13 and 14, the optimal a lgorithm can ignore all the elements that has signatures o ther than | (0 , 1 , 1) | , | (1 , 0 , 1) | a nd | (0 , 0 , 1 ) | . Now, supp ose there exists an algor ithm A ( x, y , z ) that only obser ves | (0 , 1 , 1 ) | , | (1 , 0 , 1) | , | (0 , 0 , 1 ) | , and has er ror proba bilit y less then 1 / 100. Then the alg orithm for | (0 , 1 , +) | , | (1 , 0 , +) | and | (0 , 0 , +) | that just substitutes the latter into former, will have er ror pr o bability a t most 1 / 100 + 1 20 < 1 / 17 . Indeed, let x = | (0 , 1 , +) | , y = | (1 , 0 , + ) | a nd z = | (0 , 0 , +) | − 2 ( s − x − y ) and execute the algor ithm A ( x, y , z ) will hav e err or at most 1 / 100 + 1 20 . Thus, w e con tradicted Lemma 15. 7 Conclusions and Op en Directions Perhaps the most interesting immediate e x tension to our work is to inco rp orate high-frequency elements in our analysis to eliminate the tec hnical as sumptions that we make in o ur theorems theor em. One p ossible approach is to combine techn iques of V aliant a nd Micali, but it remains to b e seen if the hybrid appr oach will pro duce b etter r esults. P roving or disproving o ur c o njecture that weakly disjoint distribution ar e indeed the hardest when it comes to telling distribution apart would also b e an imp orta nt technical r esult. The other directions is to extend the tec hniques of section 4. F or instance this tec hniques could b e used to estimate v ar io us concen tra tion b ounds on how many heter ogeneous bins will receive ex actly t balls and it rema ins an interesting question on how far those techniques could b e pushed. On the mor e technical side an interesting question is whether (under some r e asonable as sumption), the probability ratio b etw een t yp e I and type I I conﬁgur ations is, in fact b ounded by constant, rather than by O ( s ). By us ing tighter analysis we can in fact show that this ratio is in fact O ( √ s ) though r educing it further remains an op en q uestion. References 1. T. Batu, E. Fisc her, L. F ortnow , R. Kumar, R. Rubinfeld, and P . White. T esting random v ariables for indep endence an d identit y . Pr o c e e dings 2001 IEEE Interna- tional Conf er enc e on Cluster Computing , pages 442–451, 2003. 2. T. Batu, L. F ortnow , R. R ubinfeld, W.D. Smith, and P . Wh ite. T esting that d is- tributions are close. F oundations of Computer Scienc e, Annual IEEE Symp osium on , page 259, 2000. 3. T ugkan Batu, Sanjo y Dasgupta, Ravi Ku mar, and Ronitt Rub infeld. The Com- plexity of Approximating the En tropy . Computational Complexity, Annual IEEE Confer enc e on , 0:17, 2002. 4. T ugkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Su blinear algorithms for testing monotone an d unimodal distributions. Pr o c e e dings of the thirty-sixth annual A CM symp osium on The ory of c omputing - STOC ’04 , page 381, 2004. 5. S. Ben-David, J. Gehrke, and D . Kifer. Detecting c hange in data streams. In International Conf er enc e on V ery L ar ge Data Bases , pages 181–190, 2004. 6. Oded Goldreich, Shari Goldwas ser, and Dana Ron. Property t esting and its con- nection to learning and app ro ximation. J. ACM , 45(4):65 3–750, 1998. 7. W assily Hoeﬀd ing. Probabilit y I n equalities for Sums of Bounded Rand om V ari- ables. Journal of the Americ an Statistic al Asso ciation , 58:3 01:13–30, Marc h 1963. 8. C. McDiarmid. On th e method of b ounded diﬀerences, Surveys in combinatorics, 1989 (Norwic h, 1989 ). L ondon M ath. So c. L e ctur e Note Ser , 141:148–188, 1989. 9. D. Ron. Property testing (tutorial). In Handb o ok of R andomization . 2000. 10. P .P .A. V alian t. T esting symmetric prop erties of distributions. In STO C , pages 383–392 , New Y ork, NY, US A, 2008. ACM. A Equiv alence of iden tity and distinguishabilit y problems Pro of of Lemma 1. W e ﬁrst show how to s im ulate a correct alg orithm A I for the identit y pr oblem using an algor ithm A D for the distinguishability pro blem. W e run A D for 3 log s times on fresh input, where the s a mple for the testing phase is always taken from the ﬁrs t distribution. If the answer is a lwa ys the same say “diﬀerent”, otherwis e say the “same”. If the tw o distributions are the same, then A D gives random answer, therefore the probability for A D pro duce the same answer for log s iterations (and hence A I pro ducing wrong answer) is at most (3 / 5) 3 log s ≤ 1 s 2 . Similarly if distributions are diﬀerent, the pr o bability for A D go give diﬀerent answ er means that it has mista ken at least once , whic h would happ en with probability at most log(3 s ) poly log ( s ) s 2 as desired as it re ma ins po lylogra thmic. F or the other direction w e simulate a cor rect algor ithm for the distinguisha - bilit y problem using an algor ithm A I for the identit y pro blem. W e test a new sample ag ainst b oth X and Y , if A I says tha t b oth a re the same or b oth are diﬀerent, output X with proba bilit y 0.5. If the output of A I is “the same as X ” output “ X ”, o therwise output “ Y ”’. If X and Y ar e the same the testing algor ithm will s ay “the same” with probability at least 1 − 2 pol y l og ( s ) /s 2 , thus distinguishing algorithm will sa y X with pro bability at least 0 . 5 − pol y l og ( s ) /s 2 , a nd Y with probability at lea st 0 . 5 − pol y l og ( s ) /s 2 . If X a nd Y ar e diﬀere n t, then the testing a lgorithm will make a mistake with pro bability at most 2 pol y l og ( s ) /s 2 , a nd thus the identit y algorithm will pro duce diﬀerent answers, and the distinguishing algor ithm will be co rrect with pro bability at le a st 1 − 2 pol y l og ( s ) /s 2 . B Pro ofs F rom Subsection 4.1 Lemma 16. L et { x i } and { y i } b e a se quenc e of N indep en dent , t hough not ne c essarily identic al Bernoul li r andom variables such that E [ P x i ] = pN a nd E [ P y i ] = qN and pN < q N . If N ≥ 8 | log 2 δ | ( p + q ) ( p − q ) 2 , then Pr h X x i ≥ X y i i ≤ δ. Pr o of. Using Chernoﬀ inequality for an y α we hav e Pr " n X i =1 x i ≥ (1 + α ) pN # < exp[ α × pN ] (1 + α ) (1+ α ) pN . (4) On the o ther hand each for y i we hav e: Pr " n X i =1 y i ≤ (1 − β ) q N # < exp[ − N q β 2 2 ] . (5) Cho ose 1 − β = p + q 2 q , and 1 + α = p + q 2 p . Since p < q , β , α > 0. First consider the ca se where p ≤ q / 8. Thus substituting it in to (4) w e get: Pr " n X i =1 x i ≥ p + q 2 N # ≤ exp[ q − p 2 N ] ( p + q 2 p ) N p + q 2 ≤ exp  N q − p 2 − 3( p + q ) 4 N  ≤ exp[ − N ( q 4 + 5 p 4 )] ≤ exp[ − N ( q − p ) 4 ] ≤ exp  − N ( q − p ) 2 4( p + q )  where w e hav e us ed ( p + q ) / (2 p ) ≥ 4 . 5 ≥ e 1 . 5 . Similar ly for (5) we ha ve: Pr " n X i =1 y i ≤ (1 − β ) q N # ≤ exp  − N ( p − q ) 2 / 8 q  ≤  − N ( q − p ) 2 8( p + q )  (6) If, on the other hand q ≥ p ≥ q / 8, then α = ( p + q ) / 2 p − 1 ≤ 4, and w e can use the following v ar iant of Che r noﬀ b ound: Pr " n X i =1 x i ≥ N p + q 2 # ≤ exp[ − pN α 2 / 4] ≤ exp[ − N ( q − p ) 2 / (4 p )] ≤ e xp[ − N ( q − p ) 2 / 2( p + q )] (7) and s imilarly: Pr " n X i =1 y i ≤ N p + q 2 # ≤ exp[ − pN β 2 ] ≤ exp[ − N ( q − p ) 2 / (2 p )] ≤ e xp[ − N ( q − p ) 2 / ( p + q )] Combining we hav e the des ired result. Lemma 2 immediately follows from lemma 16 C Pro ofs for concen tration b ounds for balls and bins Pro of of Lemma 3. Note that type I I sampling is ov er s elements whereas t yp e I is ov er s ′ elements. W e hav e, P ( I ) [ C ] = p i 1 1 p i 2 2 . . . p i n n s ′ ! i 1 ! i 2 ! . . . i n ! , whereas P ( I I ) [ C ] = Y s ! i j !( s − i j )! p i j j (1 − p j ) s − i j = p i 1 1 p i 2 2 . . . p i n n i 1 ! . . . i n ! Y s ! ( s − i j )! (1 − p j ) ( s − i j ) Recalling that i j ≤ ln s , we have s i j ≥ s ! ( s − i j )! ≥ ( s − ln s ) i j , and P i j = s ′ therefore we hav e: s ′ s ′ e s − s ′ ≥ s s ′ ≥ Y s ! ( s − i j ) ≥ ( s − ln s ) s ′ = s s ′ (1 − ln s s ) s ′ ≥ s ′ s ′ e s − s ′ 1 2 s . (8) where we hav e used s s ′ = s ′ s ′ (1 + s − s ′ s ′ ) s ′ and e s − s ′ ≥ (1 + s − s ′ s ′ ) s ′ ≥ e s − s ′ 2 . Substituting we hav e: P ( I I ) [ C ] P ( I ) [ C ] ≥ s ′ s ′ e s − s ′ (2 es ) s ′ ! Y (1 − p j ) s (1 − p j ) i j ≥ e s 9 es √ s ′ exp   − n X j =1 p j ( s + 1)   ≥ 1 10 e 2 s 3 / 2 where in the ﬁr st transition w e used the fact that (1 − x/ ( s + 1 )) s ≥ e − x for x < 1 and s > 1, and the fact that p j ( s + 1) < 1 and ﬁnally Stirling formula s ′ ! ≤ 3 √ s ′ s ′ /e s ′ . As desired. T o get the lower bo und we ha ve: P ( I I ) [ C ] P ( I ) [ C ] ≤ s ′ s ′ e s − s ′ s ′ ! Y (1 − p j ) s − i j ≤ e s 2 √ s ′ Y (1 − 1 / s ) − i j (1 − p j ) s (9) ≤ 3 e s 2 √ s exp   − n X j =1 p j s   ≤ 3 2 √ s (10) where in the second transition w e used n ! ≥ 2 √ n ( n/e ) n and p j < 1 /l , in the third w e hav e used (1 − 1 /s ) − s < 3, and (1 − x/s ) s ≤ e − x . Pro of of Lemma 4. Indeed, let d i denote the total num b e r of times that s ample s i was rep eated in S . Recall that we hav e p i s < 1 / 2 for all i , and thus we can just uniformly upper b ound it: Pr [ d i ≥ ln s ] ≤ max k Pr [ c k ≥ ln s ] ≤ max k exp [ln s − p i s ] h ln s p k s i ln s (11) ≤ max k s ( p k s ) ln s s ln ln s ≤  1 s  ln ln s +1 . (12) using union bound ov er all d i gives us the desir ed r esult. Pro of of Lem ma 5. Let us deno te the set of all conﬁgurations satisfying | P n i =1 α i c i − E [ P n i =1 α i c i ] | > r , b y D . F or an y conﬁguration C ∈ D such that max c i ≤ ln s , we ca n a pply lemma 3. And thus P ( I ) [ C ] ≤ 30 s 3 / 2 P ( I I ) [ C ]. In addition the total probability mass of all conﬁgurations C such that max c i > ln s is at most 1 s ln ln s . Therefore P ( I ) [ D ] ≤ 30 s 3 / 2 Pr [ | W ′ − E [ W ′ ] | > r ] + 1 s ln ln s as desired. Pro of of Lemm a 6. Consider type I I sampling with s samples a nd let W ′ be the total s e le ction weigh t. Note that E [ W ] = E [ W ] ′ = s || P || 2 2 . F ur ther, we hav e W ′ = P s i =1 c ′ i p i , wher e c ′ i is the count of how man y times i - th element was chosen. The individual ter ms in the sum a re indep endent, howev er they are not b ounded, so we cannot use Ho eﬀding inequa lit y[7]. Instea d w e consider V ′ = P s i =1 min( c ′ i , ln s ) p i . Using Ho eﬀding inequa lity we have: P ( I I ) h | V ′ − E [ V ′ ] | ≥ 1 . 5(ln s ) 3 / 2 || A || 2 i ≤ exp  − 4 . 5 ln 3 s || A || 2 2 (ln s ) 2 P n i =1 α 2 i  ≤ 1 s 4 furthermore, b e cause of the lemma 4 we hav e: E [ V ′ − W ′ ] ≤ Pr [ V ′ 6 == W ′ ] || A || ∞ s ≤ || A || ∞ s s ln l n s ≤ || A || 2 and he nc e Pr h | W ′ − E [ W ] ′ | ≥ 2(ln s ) 3 / 2 || A || 2 i ≤ Pr h | W ′ − E [ V ] ′ | ≥ 1 . 5(ln s ) 3 / 2 || P || 2 i ≤ 2 s 4 where we have used that V ′ 6 = W ′ ≤ 1 s 4 ). Thus, using the concent ra tion le mma 5, w e hav e Pr h | W − E [ W ] | ≥ 2(ln s ) 3 / 2 || P || 2 i ≤ Pr h | W ′ − E [ W ] ′ | 2(ln s ) 3 / 2 || P || 2 i × s 3 / 2 + 1 s ln ln s ≤ 1 / s 2 D Pro ofs for lo w er b ounds Pro of of Lemm a 14. Indee d: | (0 , 0 , 2) | = ( s − | (1 , 0 , 1) | − | (0 , 1 , 1) | − | (0 , 0 , 1) | ) / 2 | (1 , 0 , 0) | = | (1 , 0 , ∗ ) | − | (1 , 0 , 1) | , | (0 , 1 , 0 ) | = | (0 , 1 , ∗ ) | − | (0 , 1 , 1) | | (0 , 0 , 0) | = | (0 , 0 , ∗ ) | − | (0 , 0 , 2) | − | (0 , 0 , 1) | , | (0 , 2 , 0) | = | (0 , 2 , ∗ ) | , | (2 , 0 , 0) | = | (0 , 2 , ∗ ) | Pro of of Lemma 15. Denote Pr [ h (1 , 0) = a, h (0 , 1) = b, h (0 , 0) = c | H i ] a s π i . W e need to b ound π 1 /π 2 . W e hav e π 1 = ( C P + D P ) a C b Q (1 − C P − C Q − D P ) c and π 2 = ( C P ) a ( C Q + D Q ) b (1 − C P − C Q − D Q ) c therefore π 1 π 2 =  1 + D P C P  a  1 + D Q C Q  − b  1 + D P − D Q 1 − C P − C Q − D P  − c (13) ≤ 2 exp[ D P C P a − D Q C Q b − D P − D Q 1 − C P − C Q − D P c ] (14) (15) Now, w e replace a = a 0 + δ a with b = b 0 + δ b and c = c 0 + δ c , wher e a 0 = ( C P + D P ) s , b 0 = C Q s , and c 0 = (1 − C P − C Q − D P ) s a nd we ge t: π 1 π 2 ≤ 2 exp[ D 2 P C P s ] exp[ D P C P δ a ] exp[ D Q C Q δ b ] exp[ | D P − D Q | 1 − C P − C Q − D P δ c ] Note E [ δ a ] = E [ δ b ] = E [ δ c ] = 0, a nd with proba bility at least 0 . 5, δ a ≤ 3 p ( C P + D P ) s , δ b ≤ 3 p C Q s a nd δ c ≤ 3 p (1 − C P − C Q − D P ) s . Substitut- ing w e hav e with high proba bilit y: π 1 π 2 ≤ 2 exp[ D 2 P C P s ] exp[ 3 D P √ s √ C P ] exp[ 3 D Q √ s p C Q ] exp[ 3 | D P − D Q | √ s p (1 − C P − C Q − D P ) ] (16) Let || P D || 2 denotes the 2 norm of the disjoint part of P and Q , and || P C || 2 the the 2-norm of common part P (or Q ). W e hav e || P − Q || 2 2 = || P D || 2 2 + || P Q || 2 2 and || P + Q || 2 ≤ || P C + P D || 2 . With high probability w e have D P ≤ 2 || P D || 2 2 s ≤ || P C || 2 4 , where we used concentration bo unds and substituted s ≤ 1 10 || P + Q || 2 || P − Q || 2 2 ≤ || P C || 2 8( || P D || 2 ) 2 . Similarly D Q ≤ || P C || 2 / 4 a nd C P ≥ || P C || 2 2 s 2 , therefore w e hav e D 2 P s C P ≤ 2 || P C || 2 2 s 16 || P C || 2 2 s ≤ 1 / 10 Substituting in (1 6) and using the fact the last exp onent is o (1), we get the desired result.

Telling Two Distributions Apart: a Tight Characterization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment