Testing Consistency of Two Histograms

Testing Consistency of Two Histograms
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Several approaches to testing the hypothesis that two histograms are drawn from the same distribution are investigated. We note that single-sample continuous distribution tests may be adapted to this two-sample grouped data situation. The difficulty of not having a fully-specified null hypothesis is an important consideration in the general case, and care is required in estimating probabilities with ``toy’’ Monte Carlo simulations. The performance of several common tests is compared; no single test performs best in all situations.


💡 Research Summary

The paper investigates a variety of statistical methods for testing whether two histograms are drawn from the same underlying distribution. The authors model each histogram as a multivariate Poisson process with identical bin boundaries, denoting the bin counts of the first histogram by U_i and those of the second by V_i, with means μ_i and ν_i respectively. Two null hypotheses are considered: (H₀) the bin‑by‑bin means are equal, and (H′₀) the shapes of the distributions are equal up to an overall normalization factor.

For large‑count bins the Poisson variables are approximated by normal distributions, leading to the test statistic
 T = Σ_i (U_i – V_i)² / (U_i + V_i).
If the variances σ_i² are estimated by the observed sums (U_i+V_i), T follows approximately a χ² distribution with k degrees of freedom for an absolute comparison, or k‑1 degrees of freedom when the histograms are first rescaled to a common total count N = (N_u+N_v)/2 for a shape‑only comparison. The authors note that this approximation becomes conservative when many bins contain few events.

The paper applies these procedures to a concrete example (Fig. 1) where the two histograms contain 492 and 424 total counts and were generated with a 10 % difference in expected bin contents. Table I shows χ²‑based p‑values of 0.86–0.96 for both absolute and shape tests, which agree closely with p‑values obtained from “toy” Monte‑Carlo simulations that explicitly sample the Poisson model under the null. This agreement suggests that, in moderate‑to‑high statistics regimes, the χ² approximation is acceptable.

A key conceptual difficulty highlighted is that the null hypothesis is not fully specified: the true means μ_i and ν_i are unknown and must be estimated, typically by maximum‑likelihood (i.e., the observed counts). This estimation introduces uncertainty, especially in low‑count bins, and can lead to either overly conservative or anti‑conservative behavior. To address this, the authors propose a conjecture: for large values of the test statistic T_c, the true tail probability P(T > T_c) is bounded below by the χ² tail probability with the same degrees of freedom. They provide a heuristic derivation for the single‑bin case, showing that the ratio of the exact Poisson tail to the χ² tail decays to zero as T grows, supporting the conjecture. Consequently, using χ²‑based p‑values tends to reject the null less often than the exact test would—i.e., the approach is conservative for large discrepancies.

Beyond the χ² family, the authors evaluate several non‑parametric two‑sample tests traditionally used for continuous data: Kolmogorov–Smirnov, Cramér–von‑Mises, and Anderson–Darling. While these tests can be applied to binned data, the discrete nature of histograms complicates the exact null distribution, and the resulting p‑values (Table I) are generally less informative.

A geometric alternative, the Bhattacharyya distance measure (BDM), treats each normalized histogram as a unit vector in k‑dimensional space and computes the dot product Σ_i (U_i V_i)/(N_u N_v). For the example data BDM = 0.986, yielding a p‑value of 0.97, indicating very high similarity. This statistic is closely related to the χ² expression but emphasizes directional alignment rather than magnitude differences.

The authors also discuss a pragmatic “bin‑combining” strategy: adjacent bins are merged until each combined bin contains at least a user‑specified minimum count (minBin). Figure 3 illustrates how the test statistic T and its associated p‑value evolve as minBin is increased. While this improves the validity of the normal approximation, it can reduce statistical power by decreasing the effective number of degrees of freedom.

For testing overall normalization (total counts) independently of shape, the paper derives a uniformly most powerful test based on the binomial distribution of N_v given N_u+N_v = N. Under H₀: μ_T = ν_T, the conditional distribution of N_v is Binomial(N, ½). Applying this to the example (N = 916, N_v = 424) yields a two‑sided p‑value of 0.027, consistent with the normal‑approximation estimate of 0.025.

In summary, the study demonstrates that no single test dominates across all scenarios. χ²‑based methods work well when bin counts are moderate to large, but become overly conservative in low‑count regimes. Non‑parametric tests suffer from discreteness issues, while geometric measures like BDM provide an intuitive alternative for shape comparison. The authors stress the importance of Monte‑Carlo validation when the null hypothesis is not fully specified and recommend selecting the test based on data characteristics (total counts, bin occupancy, interest in absolute vs. relative normalization). The paper’s comprehensive comparison offers practical guidance for physicists and other scientists who routinely need to assess histogram consistency.


Comments & Academic Discussion

Loading comments...

Leave a Comment