Annotated Bibliography of Some Papers on Combining Significances or p-values

Annotated Bibliography of Some Papers on Combining Significances or   p-values
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A question that comes up repeatedly is how to combine the results of two experiments if all that is known is that one experiment had a n-sigma effect and another experiment had a m-sigma effect. This question is not well-posed: depending on what additional assumptions are made, the preferred answer is different. The note lists some of the more prominent papers on the topic, with some brief comments and excerpts.


💡 Research Summary

The paper is an annotated bibliography that surveys the statistical literature on how to combine the significance levels (or p‑values) reported by two independent experiments, each described only by an “n‑sigma” and an “m‑sigma” effect. The authors begin by emphasizing that the question is ill‑posed unless additional assumptions are made: the sigma notation presupposes a normal (Gaussian) test statistic, and the two experiments must be independent, have comparable error structures, and share a common null hypothesis. Without these conditions, simply adding the sigma values is mathematically unjustified.

The bibliography is organized around the most widely used combination methods, each accompanied by a brief description, the underlying assumptions, and representative citations. The first method discussed is Fisher’s combined probability test, which transforms each p‑value into –2 ln p and sums them; the resulting statistic follows a chi‑square distribution with 2k degrees of freedom (k = number of experiments). Fisher’s method is non‑directional and robust under the assumption of independent, uniformly distributed p‑values, but it discards information about the sign of the effect.

Next, Stouffer’s Z‑method (also called the inverse‑normal method) is presented. Here each sigma is converted to a Z‑score, optionally weighted (often by the square root of the sample size or the inverse variance), and then averaged. The combined Z follows a standard normal distribution, preserving both magnitude and sign. The authors note that the choice of weights can be subjective and that the method assumes normality of the underlying test statistics; violations can lead to bias.

The bibliography then covers Tippett’s minimum‑p method, which focuses on the smallest p‑value among the studies and uses its extreme‑value distribution for inference. This approach is powerful when one study provides a very strong signal but ignores the contribution of weaker, consistent evidence. Lipták’s weighted Z method is introduced as a generalization of Stouffer’s approach, allowing arbitrary pre‑specified weights that can encode prior knowledge about experimental sensitivity or systematic uncertainties.

Bayesian combination techniques are also surveyed. In the Bayesian framework, each experiment contributes a likelihood (or a posterior) that is multiplied to obtain a joint posterior distribution. The Bayes factor derived from this joint posterior quantifies the overall evidence for the alternative hypothesis. The authors stress that Bayesian methods are highly flexible but depend critically on the choice of prior distributions, which can be a source of controversy in high‑energy physics.

A more general likelihood‑ratio combination is described, wherein the full likelihood functions of the experiments are multiplied to form a joint likelihood. Test statistics derived from this joint likelihood (e.g., a global likelihood‑ratio) can be evaluated via asymptotic theory or Monte‑Carlo simulation. This approach is theoretically optimal because it uses the complete information about the data-generating process, but it is computationally intensive and sensitive to model misspecification.

The paper also discusses practical issues that arise in meta‑analysis: one‑sided versus two‑sided tests, correction for multiple comparisons, and heterogeneity among studies. When effect sizes vary substantially across experiments, a random‑effects model (e.g., DerSimonian‑Laird) may be more appropriate than a fixed‑effects weighted average. The bibliography cites applications in particle physics (combining ATLAS and CMS Higgs boson searches), astrophysics (joint neutrino detections by IceCube and ANTARES), and genomics (meta‑analysis of genome‑wide association studies).

Throughout the annotated list, the authors highlight that no single method dominates; the “best” combination depends on the scientific context, the degree of independence, the availability of effect‑size information, and the tolerance for model assumptions. They conclude with a call for further research on combining non‑independent data, handling non‑Gaussian error structures, and exploring machine‑learning‑based non‑linear combination schemes that could adaptively weight studies based on their predictive performance. The bibliography thus serves both as a quick reference for practitioners needing to merge sigma‑level results and as a roadmap for methodological development in the field.


Comments & Academic Discussion

Loading comments...

Leave a Comment