Chi-square and classical exact tests often wildly misreport significance; the remedy lies in computers

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

If a discrete probability distribution in a model being tested for goodness-of-fit is not close to uniform, then forming the Pearson chi-square statistic can involve division by nearly zero. This often leads to serious trouble in practice – even in the absence of round-off errors – as the present article illustrates via numerous examples. Fortunately, with the now widespread availability of computers, avoiding all the trouble is simple and easy: without the problematic division by nearly zero, the actual values taken by goodness-of-fit statistics are not humanly interpretable, but black-box computer programs can rapidly calculate their precise significance.

💡 Research Summary

The paper addresses a fundamental flaw in the widely used Pearson chi‑square (χ²) goodness‑of‑fit test when applied to discrete probability models whose expected cell probabilities are far from uniform. In such cases the χ² statistic involves division by very small expected probabilities, effectively “division by nearly zero,” which can dramatically inflate the statistic and produce wildly inaccurate significance levels, even in the absence of round‑off errors. The authors demonstrate this problem through a series of synthetic and real‑world examples, showing that χ² can severely over‑reject the null hypothesis whenever a few cells have tiny expected counts.

To remedy the issue, the authors propose abandoning the weighted average that underlies χ² and instead using the unweighted root‑mean‑square (RMS) statistic

X = √∑ₖ (qₖ – pₖ(θ̂))²

where qₖ are the observed relative frequencies and pₖ(θ̂) are the model probabilities evaluated at the maximum‑likelihood estimate θ̂. Because the RMS does not divide by the expected probabilities, it remains well‑behaved even when some pₖ are close to zero. They also discuss the log‑likelihood‑ratio statistic G² and the Freeman‑Tukey (Hellinger) statistic H² as alternative, similarly robust measures.

A central methodological contribution is the use of Monte‑Carlo simulation to compute exact p‑values for any of these statistics. By generating many synthetic data sets under the fitted model, the empirical distribution of the test statistic is obtained, allowing precise significance assessment without relying on asymptotic χ² approximations. The authors implement the simulations in C using a “multiply‑with‑carry” random‑number generator to ensure reproducibility and to provide guaranteed error bounds on the estimated p‑values.

The paper’s empirical section (Section 4) applies the methods to a wide variety of data sets: synthetic examples with extreme sparsity, Zipf’s law for word frequencies, radioactive decay counts, blood‑type distributions, health‑assessment surveys, butterfly species counts, and religious affiliation data. In each case the χ² statistic either yields absurdly small p‑values (falsely rejecting the model) or, after ad‑hoc re‑binning, becomes difficult to interpret. By contrast, the RMS (and often G²) statistics give sensible p‑values that correctly reflect the degree of model misfit. The authors emphasize that re‑binning to avoid small expected counts is a “black art” that can bias results, whereas RMS works directly on the original bins.

Section 5 conducts a systematic power analysis. Various families of discrete distributions are examined, including truncated power‑law, modified Poisson, and shifted geometric laws, both with and without parameter estimation. Across all scenarios, the RMS test consistently exhibits higher power than χ², especially when expected counts per cell are five or fewer. The authors also show that RMS and χ² coincide only when the model is uniform; otherwise RMS dominates in terms of both power and robustness.

The discussion clarifies that while the discrete Kolmogorov–Smirnov or Kuiper statistics can be more powerful in certain settings, the RMS statistic is attractive because it is simple, does not require re‑binning, and its asymptotic distribution matches that of χ² when the model holds. The paper concludes that, given today’s ubiquitous computational resources, there is no longer any justification for relying on the classical χ² test for discrete goodness‑of‑fit problems. Instead, practitioners should adopt RMS (or G²) together with Monte‑Carlo derived exact p‑values, thereby obtaining reliable inference even in highly sparse, non‑uniform settings.

Chi-square and classical exact tests often wildly misreport significance; the remedy lies in computers

💡 Research Summary

Comments & Academic Discussion

Leave a Comment