On the statistical analysis of grouped data: when Pearson $χ^2$ and other divisible statistics are not goodness-of-fit tests
Thousands of experiments are analyzed and papers are published each year involving the statistical analysis of grouped data. While this area of statistics is often perceived – somewhat naively – as saturated, several misconceptions still affect everyday practice, and new frontiers have so far remained unexplored. Researchers must be aware of the limitations affecting their analyses and what are the new possibilities in their hands. Motivated by this need, the article introduces a unifying approach to the analysis of grouped data, which allows us to study the class of divisible statistics – that includes Pearson’s $χ^2$, the likelihood ratio as special cases – with a fresh perspective. The contributions collected in this manuscript span from modeling and estimation to distribution-free goodness-of-fit tests. Perhaps the most surprising result presented here is that, in a sparse regime, all tests proposed in the literature are dominated by members of the class of weighted linear statistics.
💡 Research Summary
The paper addresses a long‑standing gap in the statistical analysis of grouped (binned) data, focusing on the performance of traditional goodness‑of‑fit (GOF) tests such as Pearson’s χ², the likelihood‑ratio (LR), and related “divisible” statistics when the data are sparse and high‑dimensional. The authors first formalize the class of divisible statistics as sums of a function g(ν(xₖ), mθ(xₖ)) over K cells, where ν(xₖ) denotes the observed count in cell k and mθ(xₖ) the corresponding expected count under a parametric model. They show that many classical GOF tests, as well as newer spectral and cumulative statistics, belong to this class.
A central contribution is the introduction of a unified asymptotic framework in which both the total expected count T and the number of cells K grow to infinity while their ratio c = T/K converges to a positive constant. Under this regime the expected counts per cell remain O(1) and the cell counts retain a Poisson distribution rather than converging to a Gaussian limit. This “sparse regime” captures realistic situations in text analysis, ecological species counts, large‑scale surveys, and high‑energy physics where many categories are observed only a few times.
Within this setting the authors develop a Poisson brush model: the data are generated by a Poisson process N_T on a bounded region X with intensity λβ(x) scaled by T. The observed frequencies ν(xₖ) are the increments of N_T over a regular partition of X into K equal‑volume bins. They derive the asymptotic behavior of the entire class of divisible statistics, showing that when the model parameters θ = (c, β) are known, no single divisible statistic can uniformly detect all local alternatives – a result that mirrors earlier findings for fixed‑K asymptotics but now holds in the high‑dimensional sparse context.
The most striking theoretical result (Section 6.4) proves that, in the sparse regime, every test based on a divisible statistic is asymptotically dominated by a weighted linear statistic of the form
S_w = ∑_{k=1}^K w_k
Comments & Academic Discussion
Loading comments...
Leave a Comment