Testing Distribution Identity Efficiently
We consider the problem of testing distribution identity. Given a sequence of independent samples from an unknown distribution on a domain of size n, the goal is to check if the unknown distribution approximately equals a known distribution on the sa…
Authors: ** - **Batu**, **Fortnow**, **Fischer**
T esting Distrib ution Identity Ef ficiently Krzysztof Onak MIT konak@mit. edu Abstract W e consider the p roblem of testing distribution identity . Giv en a sequen ce of indep endent samples from an unknown distribution on a domain of si ze n , the goal is to check if the unknown distrib ution ap- proxim ately equals a known distribution o n the same doma in. While Batu, Fortnow , Fischer, Kumar, Ru- binfeld, and White (FOCS 2001) p roved that the sample comp lexity of the prob lem is ˜ O ( √ n · poly ( 1 / ε )) , the r unning time o f th eir tester is much higher: O ( n ) + ˜ O ( √ n · poly ( 1 / ε )) . W e modif y their tester to achieve a runnin g time of ˜ O ( √ n · poly ( 1 / ε )) . Let p and q be two probability distr ib ution s on [ n ] 1 , and let k p − q k 1 denote the ℓ 1 -distan ce between p and q . In this paper , algorithms hav e access to two distri b ution s q and p . • The distrib ution p is known : for each i ∈ [ n ] , the algorithm can query th e probab ility p i of i in cons tant time. • The distrib ution q is unknown : the algorithm can only obtain an independe nt sa mple from q in constant time. An identit y tester is an algorithm such that: • i f p = q , then it accepts w ith prob ability 2 / 3, • i f k p − q k 1 ≥ ε , then it rejects with probability 2 / 3. Batu, Fortno w , F ischer , Kumar , R ubinfe ld, and White [BFF + 01] pro v ed that there is an identity tester that uses only ˜ O ( √ n · poly ( 1 / ε )) samples from q . A shortcomin g of their algorith m is a running time of O ( n ) + ˜ O ( √ n · poly ( 1 / ε )) . In this note, we sho w that their tester can be modified to achiev e a running time of ˜ O ( √ n · poly ( 1 / ε )) . It is also well kn o wn that Ω ( √ n ) samples are requ ired to tell th e unifor m distrib utio n on [ n ] from a distri b ution that is uniform on a random subset of [ n ] of size n / 2. 1 The Original T ester W e no w des cribe the tester of Batu et al. [BFF + 01], which is outlin ed as Algorithm 1. Let ε ′ = ε / C , where C is a suf ficiently lar ge positi v e constant. The tester starts by partitio ning the set [ n ] into k + 1 = log 1 + ε ′ 2 n ε + 1 = O ( 1 ε · log ( n / ε )) sets R 0 , R 1 , . . . , R k in Step 1, where R j = n i ∈ [ n ] : ε 2 n · ( 1 + ε ′ ) j − 1 < p i ≤ ε 2 n · ( 1 + ε ′ ) j o 1 W e write [ k ] to deno te the set { 1 , 2 , . . . , k } , for any positiv e integer k . 1 Algorithm 1 : O utline of the tester of Batu et al. [BFF + 01] Partiti on [ n ] into R 0 , R 1 , . . . , R k 1 Compute P j , for j ∈ { 0 , 1 , . . . , k } 2 Use O (( k / ε ) 2 · log k ) samples from q to get an estimate Q ′ j of each Q j up to ε / ( 4 k + 4 ) 3 if k ( P 0 , . . . , P k ) − ( Q ′ 0 , . . . , Q ′ k ) k 1 > ε / 4 then REJECT 4 Let s i , i ∈ [ n ] , be the numbe r of occurrence s of i in a sample of size S = ˜ O ( √ n · poly ( 1 / ε )) 5 f or j > 0 s.t. P j > ε / ( 4 k + 4 ) d o 6 if ∑ i ∈ R j s i 2 > ( 1 + ε / 4 ) · Q 2 · P j · ε 2 n ( 1 + ε ′ ) j then REJECT 7 A CCEP T 8 for j > 0, and R 0 = n i ∈ [ n ] : p i ≤ ε 2 n o . W e then define prob abiliti es of each set accord ing to p and q : P j = ∑ i ∈ R j p i and Q j = ∑ i ∈ R j q i . T he tester compute s a nd estimates those proba bilitie s in Steps 2 and 3. In Step 4, the tester verifies that the probabilitie s of sets R j in both the d istrib utions are close. Finally , in Steps 5– 7, the tester ve rifies that q restrict ed to each R j is approximat ely uniform, by comparing second moments of p and q ov er each R j . If q passes the test with pr obabil ity greater than 1 / 3, it must be close to p . On the ot her hand, if p = q , then the pa rameters can be set so that q passes with prob ability 2 / 3. Note that the additi ve linear term in the complexit y of the tester comes from expl icitly computing each R i and each P i in Steps 1–2. 2 Our Improv ement Note that the partitio n of [ n ] into sets R j need not be compute d explic itly , since for each sample i from q , one can check which R j it belon gs to by querying p i . W e obser ve that o ne can v erify tha t k ( P 0 , . . . , P k ) − ( Q 0 , . . . , Q k ) k 1 is small witho ut ex plicit ly computing each P i . W e use Algorithm 2 for this purpos e. L et j ⋆ be an inde x such that an element of probability 1 / √ n would bel ong to R j ⋆ . T he algorith m is based on the follo wing facts: • F or j < j ⋆ , if P j is no t ne glig ible, R j must be large, and a good ad diti v e estimat e to P j can b e ob tained by uniformly sampling ˜ O ( √ n · poly ( 1 / ε )) elements of [ n ] , and computing the weight of those that belong to R j . • I f p = q , w e are lik ely t o learn all elemen ts in R j , j ≥ j ⋆ , by sampling only ˜ O ( √ n ) elements of q . This gi ve s the exact v alu e of each P j , j > j ⋆ . If p 6 = q , this method still giv es lower bounds for each P j . If k ( P 0 , . . . , P k ) − ( Q ′ 0 , . . . , Q ′ k ) k 1 ≥ δ , our estimat es for P j and Q j are lik ely to be sufficien tly diff erent. A detaile d proof foll o ws. Lemma 1 Algorithm 2 w ith appr opriately ch osen constan ts tells p = q (Case 1) fr om k ( P 0 , . . . , P k ) − ( Q 0 , . . . , Q k ) k 1 ≥ δ (Case 2) with pr obabili ty 9 / 10 . Pro of The multiplicati ve constant in the s ample size of Step 1 is such that S tep 1 succeeds with p robab ility 99 / 10 0. The size of S 1 is ch osen such th at with probability 99 / 100, S 1 contai ns all elements i of probab ility 2 Algorithm 2 : T elling p = q (Case 1) from k ( P 0 , . . . , P k ) − ( Q 0 , . . . , Q k ) k 1 ≥ δ (Case 2) Use O (( k / δ ) 2 · log k ) samples from q to get an estimate Q ′ j of each Q j up to δ / ( 8 k + 8 ) 1 Let j ⋆ be an inde x such that an element of probabilit y 1 / √ n would bel ong to R j ⋆ 2 Let S 1 be a set of O ( √ n · log n ) samples from q 3 f or j s.t. j ⋆ ≤ j ≤ k d o 4 Let T j = S 1 ∩ R j (with no repeti tions) 5 if Q ′ j − ∑ i ∈ T j p i > δ 8 k + 8 then r etur n “Case 2” 6 Let S 2 be a set of O k δ 3 √ n · log k indepe ndent samples from [ n ] with replacement 7 f or j s.t. j < j ⋆ do 8 Let U j = S 2 ∩ R j (with repetiti ons) 9 if Q ′ j − ∑ i ∈ U j p i | S 2 | > δ 4 k + 4 then r etur n “Case 2” 10 r eturn “Case 1” 11 q i ≥ 1 2 √ n by the coupo ns collector’ s problem. Finally , the size of S 2 is chosen such that with probabi lity 99 / 10 0, for each j < j ⋆ , ∑ i ∈ U j p i | S 2 | − P j ≤ δ 8 k + 8 . T o see this, let us first focus on j < j ⋆ such that P j ≥ δ 16 ( k + 1 ) . Note that each i ∈ S 2 contri b utes with a valu e in [ 0 , 1 / √ n ] to ∑ i ∈ U j p i . By the Chernof f bound, O k δ 3 √ n · log k samples suffice to estimate P j with m ultipli cati ve error 1 + δ 8 k + 8 with probability 1 − 1 200 k , which implies additi ve error at most δ 8 k + 8 as well. For j < j ⋆ such that P j < δ 16 ( k + 1 ) , the Chernof f bound still guara ntees w ith the same pro babilit y that the estimate is less than δ 8 k + 8 . If p = q , then Algo rithm 2 disco v ers this with probability 97 / 1 00 d ue to t he follo wing fa cts. Firstly , T j = R j , for j ≥ j ⋆ , so ∑ i ∈ T j p i = P j . T herefo re, pro vided all Q ′ j are good approx imation s to the corres pondi ng Q j , q alw ays passes Step 6. Secondly , if all ∑ i ∈ U j p i | S 2 | , 0 ≤ j < j ⋆ , are good approxima tions of the corre spond ing P j , q al ways passes Step 10 as well. If k ( P 0 , . . . , P k ) − ( Q 0 , . . . , Q k ) k 1 ≥ δ , there is j ′ such that Q j ′ > P j ′ + δ 2 k . If j ′ ≥ j ⋆ , then because P j is alw ays g reater t han or eq ual t o ∑ i ∈ T j p i , the t ester concl udes in Step 6 fo r j = j ′ that Case 2 occurs , prov ided Q ′ j ′ is a good a pproxi mation to Q j ′ , whic h happe ns with pro babili ty at least 99 / 100. If j ′ < j ⋆ , th en beca use we hav e good approxi mations to both Q j ′ and P j ′ with probabi lity 98 / 100, and their distance is at least δ 2 k − 2 δ 8 k + 8 > δ 4 k + 4 , the algorith m concl udes in S tep 10 for j = j ′ that Case 2 occurs. T o get an effici ent tester , w e replace Steps 2–4 of Algorithm 1 with A lgorith m 2, where we set δ to ε / C for a suf ficiently larg e constant C . If Algorithm 2 concludes that Case 2 occurs, the ne w algorit hm immediatel y rejects . Furthermore, if it is not the case that k ( P 0 , . . . , P k ) − ( Q 0 , . . . , Q k ) k 1 ≥ δ , Steps 6 and 7 work with esti mates Q ′ j instea d of the exact v alues P j up to a modificatio n of constants. Acknowledgmen t The autho r thanks R onitt Rubinfe ld for asking the question. 3 Refer ences [BFF + 01] T ug kan Batu, Lance Fortno w , Eldar Fischer , Rav i Kumar , Ronitt R ubinfe ld, and Patrick White. T esting random varia bles for independ ence and identity . In FOCS , pag es 442–451, 2001. 4
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment