Testing Distribution Identity Efficiently

Reading time: 7 minute
...

📝 Original Info

  • Title: Testing Distribution Identity Efficiently
  • ArXiv ID: 0910.3243
  • Date: 2009-10-20
  • Authors: Researchers from original ArXiv paper

📝 Abstract

We consider the problem of testing distribution identity. Given a sequence of independent samples from an unknown distribution on a domain of size n, the goal is to check if the unknown distribution approximately equals a known distribution on the same domain. While Batu, Fortnow, Fischer, Kumar, Rubinfeld, and White (FOCS 2001) proved that the sample complexity of the problem is O~(sqrt(n) * poly(1/epsilon)), the running time of their tester is much higher: O(n) + O~(sqrt(n) * poly(1/epsilon)). We modify their tester to achieve a running time of O~(sqrt(n) * poly(1/epsilon)).

💡 Deep Analysis

Deep Dive into Testing Distribution Identity Efficiently.

We consider the problem of testing distribution identity. Given a sequence of independent samples from an unknown distribution on a domain of size n, the goal is to check if the unknown distribution approximately equals a known distribution on the same domain. While Batu, Fortnow, Fischer, Kumar, Rubinfeld, and White (FOCS 2001) proved that the sample complexity of the problem is O~(sqrt(n) * poly(1/epsilon)), the running time of their tester is much higher: O(n) + O~(sqrt(n) * poly(1/epsilon)). We modify their tester to achieve a running time of O~(sqrt(n) * poly(1/epsilon)).

📄 Full Content

• The distribution p is known: for each i ∈ [n], the algorithm can query the probability p i of i in constant time.

• The distribution q is unknown: the algorithm can only obtain an independent sample from q in constant time.

An identity tester is an algorithm such that:

• if p = q, then it accepts with probability 2/3,

• if pq 1 ≥ ε, then it rejects with probability 2/3.

Batu, Fortnow, Fischer, Kumar, Rubinfeld, and White [BFF + 01] proved that there is an identity tester that uses only Õ( √ n • poly(1/ε)) samples from q. A shortcoming of their algorithm is a running time of

In this note, we show that their tester can be modified to achieve a running time of Õ( √ n • poly(1/ε)). It is also well known that Ω( √ n) samples are required to tell the uniform distribution on [n] from a distribution that is uniform on a random subset of [n] of size n/2.

We now describe the tester of Batu et al. [BFF + 01], which is outlined as Algorithm 1. Let ε ′ = ε/C, where C is a sufficiently large positive constant. The tester starts by partitioning the set

Step 1, where

Algorithm 1: Outline of the tester of Batu et al.

, be the number of occurrences of i in a sample of size S = Õ( √ n • poly(1/ε))

for j > 0, and

We then define probabilities of each set according to p and q: P j = ∑ i∈R j p i and Q j = ∑ i∈R j q i . The tester computes and estimates those probabilities in Steps 2 and 3. In Step 4, the tester verifies that the probabilities of sets R j in both the distributions are close. Finally, in Steps 5-7, the tester verifies that q restricted to each R j is approximately uniform, by comparing second moments of p and q over each R j . If q passes the test with probability greater than 1/3, it must be close to p. On the other hand, if p = q, then the parameters can be set so that q passes with probability 2/3. Note that the additive linear term in the complexity of the tester comes from explicitly computing each R i and each P i in Steps 1-2.

Note that the partition of [n] into sets R j need not be computed explicitly, since for each sample i from q, one can check which R j it belongs to by querying p i .

We observe that one can verify that (P 0 , . . . , P k ) -(Q 0 , . . . , Q k ) 1 is small without explicitly computing each P i . We use Algorithm 2 for this purpose. Let j ⋆ be an index such that an element of probability 1/ √ n would belong to R j ⋆ . The algorithm is based on the following facts:

• For j < j ⋆ , if P j is not negligible, R j must be large, and a good additive estimate to P j can be obtained by uniformly sampling Õ(

and computing the weight of those that belong to R j .

• If p = q, we are likely to learn all elements in R j , j ≥ j ⋆ , by sampling only Õ( √ n) elements of q. This gives the exact value of each P j , j > j ⋆ . If p = q, this method still gives lower bounds for each P j .

1 ≥ δ, our estimates for P j and Q j are likely to be sufficiently different. A detailed proof follows.

Lemma 1 Algorithm 2 with appropriately chosen constants tells p = q (Case 1) from (P 0 , . . . , P k )-(Q 0 , . . . , Q k ) 1 ≥ δ (Case 2) with probability 9/10.

The multiplicative constant in the sample size of Step 1 is such that Step 1 succeeds with probability 99/100. The size of S 1 is chosen such that with probability 99/100, S 1 contains all elements i of probability Algorithm 2: Telling p = q (Case 1) from (P 0 , . . . ,

δ 4k+4 then return “Case 2” 10 return “Case 1”

n by the coupons collector’s problem. Finally, the size of S 2 is chosen such that with probability 99/100, for each j < j ⋆ ,

To see this, let us first focus on j < j ⋆ such that P j ≥ δ 16(k+1) . Note that each i ∈ S 2 contributes with a value in [0, 1/ √ n] to ∑ i∈U j p i . By the Chernoff bound,

• log k samples suffice to estimate P j with multiplicative error 1 + δ 8k+8 with probability 1 -1 200k , which implies additive error at most δ 8k+8 as well. For j < j ⋆ such that P j < δ 16(k+1) , the Chernoff bound still guarantees with the same probability that the estimate is less than δ 8k+8 . If p = q, then Algorithm 2 discovers this with probability 97/100 due to the following facts. Firstly, T j = R j , for j ≥ j ⋆ , so ∑ i∈T j p i = P j . Therefore, provided all Q ′ j are good approximations to the corresponding Q j , q always passes Step 6. Secondly, if all ∑ i∈U j p i |S 2 | , 0 ≤ j < j ⋆ , are good approximations of the corresponding P j , q always passes Step 10 as well.

If (P 0 , . . . ,

If j ′ ≥ j ⋆ , then because P j is always greater than or equal to ∑ i∈T j p i , the tester concludes in Step 6 for j = j ′ that Case 2 occurs, provided Q ′ j ′ is a good approximation to Q j ′ , which happens with probability at least 99/100. If j ′ < j ⋆ , then because we have good approximations to both Q j ′ and P j ′ with probability 98/100, and their distance is at least δ 2k -2 δ 8k+8 > δ 4k+4 , the algorithm concludes in Step 10 for j = j ′ that Case 2 occurs. To get an efficient tester, we replace Steps 2-4 of Algorith

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut