The Minimax Risk in Testing Uniformity over Large Alphabets under Missing-Ball Alternatives
We study the problem of testing the goodness of fit of categorical count data to a Poisson distribution uniform over the categories, against a class of alternatives defined by excluding an $\ell_p$ ball, $p \leq 2$, of radius $ε$ around the uniform rate sequence. We characterize the minimax risk for this problem as the expected number of samples $n$ and the number of categories $N$ go to infinity. Our result enables constant-factor comparisons among the many estimators previously proposed for this problem, rather than comparisons only at the level of convergence rates or scaling orders of sample complexity. The minimax test relies exclusively on collisions in the small sample limit, but behaves like the chi-squared test otherwise. Empirical studies across a range of parameters show that the asymptotic risk estimate is accurate in finite samples, and that the minimax test outperforms both the chi-squared test and a test based on collisions under the least favorable alternative. Our analysis involves a reduction to a structured subset of alternatives, establishing uniform asymptotic normality for a family of linear test statistics, and solving an optimization problem over $N$-dimensional sequences akin to classical results from signal detection in Gaussian white noise. Finally, we discuss the connection to the fixed-sample-size multinomial model, arguing that the Poisson minimax risk derived here also characterizes the minimax risk of the multinomial problem.
💡 Research Summary
The paper addresses the fundamental problem of testing uniformity of categorical count data when both the expected total sample size n and the number of categories N grow to infinity. The observations O₁,…,O_N are modeled as independent Poisson random variables with means n Q_i, where the null hypothesis H₀ asserts that the rate vector Q equals the uniform vector U = (1/N,…,1/N). The alternative set V_ε consists of all rate vectors that lie outside an ℓₚ‑ball (p ≤ 2) of radius ε centered at U. This “missing‑ball” formulation captures a broad class of non‑smooth, high‑dimensional alternatives that have been largely unexplored in the literature.
The main contribution is a precise asymptotic characterization of the minimax risk R*(V_ε) in the regime N ≫ n (the high‑dimensional setting). The authors prove that when the risk converges to a non‑trivial constant c ∈ (0,1), it satisfies
R*(V_ε) = 2 Φ(−u_{ε,n,N,p}) + o(1),
where Φ denotes the standard normal CDF and
u_{ε,n,N,p}² = ε⁴ n² / (2 N^{3−4/p}).
Thus the risk is completely determined by a single signal‑to‑noise‑type quantity u_{ε,n,N,p}, which combines the sample size, alphabet size, ℓₚ‑radius, and the norm parameter p. The result holds under two natural sufficient conditions: (i) u_{ε,n,N,p} converges to a finite constant while N = o(n²) or ε N^{1−1/p}=o(1); or (ii) the alternative set is intersected with small ℓ_∞ and ℓ₂ neighborhoods of the uniform vector, ensuring that the problem is not trivially separable.
A key methodological insight is that the optimal test can be expressed as a linear functional of the histogram of counts, i.e., T = ∑{m≥0} a_m X_m where X_m = ∑{i=1}^N 1{O_i=m}. The authors establish uniform asymptotic normality for this family of statistics, despite the fact that the raw Poisson counts are not jointly normal. By carefully choosing the coefficients a_m, the test interpolates between two classical regimes: in the “small‑sample” limit (n/N → 0) it reduces to a collision‑based statistic (essentially counting the number of doubletons), while for larger n it behaves like the familiar chi‑squared statistic. This adaptive behavior yields the minimax risk given above.
From a Bayesian perspective, the least‑favorable prior is shown to be a product of symmetric two‑point distributions on each coordinate, placing mass at 1/N ± ε N^{-1/p}. Under this prior, the Bayes risk coincides exactly with the minimax risk, confirming the optimality of the linear test. The authors also derive the asymptotic risk of the standard chi‑squared test under the same least‑favorable prior, obtaining
R_{χ²} = 2 Φ(−r n/N + n u_{ε,n,N,p}) + o(1),
which is strictly larger than the minimax risk unless n/N → ∞. This explains why the chi‑squared test, though minimax in the regime N = o(n) or for smooth alternatives, fails to achieve optimal performance when the alphabet is much larger than the sample size.
Extensive simulations for p = 1 illustrate the accuracy of the asymptotic formula (the empirical risk matches 2 Φ(−u) within a few percent) and demonstrate that the proposed minimax test uniformly dominates both the chi‑squared test and a pure collision test across a wide range of (n, N, ε) configurations. Notably, even when n/N ≈ 0.01 the minimax test maintains a risk around 0.18, whereas the chi‑squared test’s risk exceeds 0.30.
Finally, the paper connects the Poisson setting to the fixed‑sample‑size multinomial model. By conditioning on the total count ∑ O_i = n, the Poisson vector becomes multinomial with cell probabilities proportional to Q. Using de‑Poissonization techniques and conditional central limit theorems, the authors argue that the same minimax risk expression applies to the multinomial problem, thereby extending the relevance of their results to the most common practical scenario of categorical data with a fixed total count.
In summary, this work delivers a full constant‑level minimax analysis for uniformity testing over large alphabets under missing‑ball alternatives, identifies the optimal linear test statistic, characterizes the least‑favorable prior, quantifies the sub‑optimality of classical chi‑squared testing in the high‑dimensional regime, and bridges Poisson and multinomial models. The findings provide a rigorous benchmark for evaluating existing and future uniformity tests in modern high‑dimensional applications such as large‑scale sensor arrays, genomic count data, and massive categorical logs.
Comments & Academic Discussion
Loading comments...
Leave a Comment