We consider the problem of testing distribution identity. Given a sequence of independent samples from an unknown distribution on a domain of size n, the goal is to check if the unknown distribution approximately equals a known distribution on the same domain. While Batu, Fortnow, Fischer, Kumar, Rubinfeld, and White (FOCS 2001) proved that the sample complexity of the problem is O~(sqrt(n) * poly(1/epsilon)), the running time of their tester is much higher: O(n) + O~(sqrt(n) * poly(1/epsilon)). We modify their tester to achieve a running time of O~(sqrt(n) * poly(1/epsilon)).
Deep Dive into Testing Distribution Identity Efficiently.
We consider the problem of testing distribution identity. Given a sequence of independent samples from an unknown distribution on a domain of size n, the goal is to check if the unknown distribution approximately equals a known distribution on the same domain. While Batu, Fortnow, Fischer, Kumar, Rubinfeld, and White (FOCS 2001) proved that the sample complexity of the problem is O~(sqrt(n) * poly(1/epsilon)), the running time of their tester is much higher: O(n) + O~(sqrt(n) * poly(1/epsilon)). We modify their tester to achieve a running time of O~(sqrt(n) * poly(1/epsilon)).
• The distribution p is known: for each i ∈ [n], the algorithm can query the probability p i of i in constant time.
• The distribution q is unknown: the algorithm can only obtain an independent sample from q in constant time.
An identity tester is an algorithm such that:
• if p = q, then it accepts with probability 2/3,
• if pq 1 ≥ ε, then it rejects with probability 2/3.
Batu, Fortnow, Fischer, Kumar, Rubinfeld, and White [BFF + 01] proved that there is an identity tester that uses only Õ( √ n • poly(1/ε)) samples from q. A shortcoming of their algorithm is a running time of
In this note, we show that their tester can be modified to achieve a running time of Õ( √ n • poly(1/ε)). It is also well known that Ω( √ n) samples are required to tell the uniform distribution on [n] from a distribution that is uniform on a random subset of [n] of size n/2.
We now describe the tester of Batu et al. [BFF + 01], which is outlined as Algorithm 1. Let ε ′ = ε/C, where C is a sufficiently large positive constant. The tester starts by partitioning the set
Step 1, where
Algorithm 1: Outline of the tester of Batu et al.
, be the number of occurrences of i in a sample of size S = Õ( √ n • poly(1/ε))
for j > 0, and
We then define probabilities of each set according to p and q: P j = ∑ i∈R j p i and Q j = ∑ i∈R j q i . The tester computes and estimates those probabilities in Steps 2 and 3. In Step 4, the tester verifies that the probabilities of sets R j in both the distributions are close. Finally, in Steps 5-7, the tester verifies that q restricted to each R j is approximately uniform, by comparing second moments of p and q over each R j . If q passes the test with probability greater than 1/3, it must be close to p. On the other hand, if p = q, then the parameters can be set so that q passes with probability 2/3. Note that the additive linear term in the complexity of the tester comes from explicitly computing each R i and each P i in Steps 1-2.
Note that the partition of [n] into sets R j need not be computed explicitly, since for each sample i from q, one can check which R j it belongs to by querying p i .
We observe that one can verify that (P 0 , . . . , P k ) -(Q 0 , . . . , Q k ) 1 is small without explicitly computing each P i . We use Algorithm 2 for this purpose. Let j ⋆ be an index such that an element of probability 1/ √ n would belong to R j ⋆ . The algorithm is based on the following facts:
• For j < j ⋆ , if P j is not negligible, R j must be large, and a good additive estimate to P j can be obtained by uniformly sampling Õ(
and computing the weight of those that belong to R j .
• If p = q, we are likely to learn all elements in R j , j ≥ j ⋆ , by sampling only Õ( √ n) elements of q. This gives the exact value of each P j , j > j ⋆ . If p = q, this method still gives lower bounds for each P j .
1 ≥ δ, our estimates for P j and Q j are likely to be sufficiently different. A detailed proof follows.
Lemma 1 Algorithm 2 with appropriately chosen constants tells p = q (Case 1) from (P 0 , . . . , P k )-(Q 0 , . . . , Q k ) 1 ≥ δ (Case 2) with probability 9/10.
The multiplicative constant in the sample size of Step 1 is such that Step 1 succeeds with probability 99/100. The size of S 1 is chosen such that with probability 99/100, S 1 contains all elements i of probability Algorithm 2: Telling p = q (Case 1) from (P 0 , . . . ,
δ 4k+4 then return “Case 2” 10 return “Case 1”
n by the coupons collector’s problem. Finally, the size of S 2 is chosen such that with probability 99/100, for each j < j ⋆ ,
To see this, let us first focus on j < j ⋆ such that P j ≥ δ 16(k+1) . Note that each i ∈ S 2 contributes with a value in [0, 1/ √ n] to ∑ i∈U j p i . By the Chernoff bound,
• log k samples suffice to estimate P j with multiplicative error 1 + δ 8k+8 with probability 1 -1 200k , which implies additive error at most δ 8k+8 as well. For j < j ⋆ such that P j < δ 16(k+1) , the Chernoff bound still guarantees with the same probability that the estimate is less than δ 8k+8 . If p = q, then Algorithm 2 discovers this with probability 97/100 due to the following facts. Firstly, T j = R j , for j ≥ j ⋆ , so ∑ i∈T j p i = P j . Therefore, provided all Q ′ j are good approximations to the corresponding Q j , q always passes Step 6. Secondly, if all ∑ i∈U j p i |S 2 | , 0 ≤ j < j ⋆ , are good approximations of the corresponding P j , q always passes Step 10 as well.
If (P 0 , . . . ,
If j ′ ≥ j ⋆ , then because P j is always greater than or equal to ∑ i∈T j p i , the tester concludes in Step 6 for j = j ′ that Case 2 occurs, provided Q ′ j ′ is a good approximation to Q j ′ , which happens with probability at least 99/100. If j ′ < j ⋆ , then because we have good approximations to both Q j ′ and P j ′ with probability 98/100, and their distance is at least δ 2k -2 δ 8k+8 > δ 4k+4 , the algorithm concludes in Step 10 for j = j ′ that Case 2 occurs. To get an efficient tester, we replace Steps 2-4 of Algorith
…(Full text truncated)…
This content is AI-processed based on ArXiv data.