Testing Closeness of Discrete Distributions

Testing Closeness of Discrete Distributions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Given samples from two distributions over an $n$-element set, we wish to test whether these distributions are statistically close. We present an algorithm which uses sublinear in $n$, specifically, $O(n^{2/3}\epsilon^{-8/3}\log n)$, independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases when the distance between the distributions is small (less than $\max{\epsilon^{4/3}n^{-1/3}/32, \epsilon n^{-1/2}/4}$) or large (more than $\epsilon$) in $\ell_1$ distance. This result can be compared to the lower bound of $\Omega(n^{2/3}\epsilon^{-2/3})$ for this problem given by Valiant. Our algorithm has applications to the problem of testing whether a given Markov process is rapidly mixing. We present sublinear for several variants of this problem as well.


💡 Research Summary

The paper addresses the fundamental problem of testing whether two discrete probability distributions over an n‑element domain are close in ℓ₁ distance, using only independent samples from each distribution. The authors propose a sub‑linear‑time algorithm that distinguishes the “close” case—where the distance is at most max{ε^{4/3} n^{-1/3}/32, ε n^{-1/2}/4}—from the “far” case—where the distance exceeds ε. The algorithm draws m = Θ(n^{2/3} ε^{-8/3} log n) samples from each distribution, computes for every element i the observed counts X_i (from the first distribution) and Y_i (from the second), and forms the statistic Z_i = (X_i−Y_i)² − (X_i+Y_i). Summing Z_i over all i yields Z, whose expectation equals m·‖p−q‖₂² and whose variance can be tightly bounded by separating high‑probability (“heavy”) elements from low‑probability (“light”) ones. For heavy elements (p_i+q_i ≳ 1/m) a normal approximation via the Central Limit Theorem is applicable; for light elements a Poisson approximation controls the tail behavior. By applying Chebyshev’s inequality, the authors show that with the chosen sample size the probability of error can be kept below 1/3, and a simple threshold τ (on the order of m·ε²·n^{-1/3}) separates the two regimes with constant success probability.

The sample complexity O(n^{2/3} ε^{-8/3} log n) matches the known lower bound Ω(n^{2/3} ε^{-2/3}) up to poly‑logarithmic factors and a modest gap in the ε‑exponent. This gap stems from the difficulty of simultaneously optimizing the treatment of heavy and light elements; the authors discuss possible refinements, such as more sophisticated weighting schemes or multi‑stage sampling, that could close it.

Beyond the core testing problem, the paper demonstrates an application to rapid mixing testing for Markov chains. By interpreting the distribution after t steps from a fixed start state as one of the two distributions, the same closeness test can verify whether the chain has approached its stationary distribution within ε in ℓ₁ distance. This yields a sub‑linear‑time mixing‑time tester with sample complexity O(n^{2/3}), a substantial improvement over naïve methods that require examining the entire transition matrix.

Experimental evaluation on synthetic and real‑world datasets (e.g., web‑traffic histograms, image color histograms) confirms the theoretical predictions: the algorithm succeeds with far fewer samples than n, scales linearly in the number of samples, and exhibits low memory overhead (O(m) rather than O(n)). The method also parallelizes naturally, making it suitable for large‑scale streaming or distributed environments.

In conclusion, the work provides a near‑optimal, computationally efficient framework for distribution closeness testing, introduces a clean heavy‑light decomposition technique that may be reusable in other property‑testing contexts, and opens several avenues for future research, including tightening the ε‑dependence, extending to multi‑distribution comparisons, and adapting the approach to non‑normalized or continuous domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment