Beyond single-threshold searches: the Event Stacking Test

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a new statistical test that examines the consistency of the tails of two empirical distributions at multiple thresholds. Such distributions are often encountered in counting experiments, in physics and elsewhere, where the significance of populations of events is evaluated. This multi-threshold approach has the effect of “stacking” multiple events into the tail bin of the distribution, and thus we call it the Event Stacking Test. This test has the ability to confidently detect inconsistencies composed of multiple events, even if these events are low-significance outliers in isolation. We derive the Event Stacking Test from first principles and show that the p-value it reports is a well-calibrated representation of noise fluctuations. When applying this test to the detection of gravitational-wave transients in LIGO-Virgo data, we find that it performs better than or comparably to other statistical tests historically used within the gravitational-wave community. This test is particularly well-suited for detecting classes of gravitational-wave transients that are minimally-modeled, i.e., gravitational-wave bursts. We show that the Event Stacking Test allows us to set upper limits on the astrophysical rate-density of gravitational-wave bursts that are stricter than those set using other statistical tests by factors of up to 2 - 3.

💡 Research Summary

The paper introduces a novel statistical method called the Event Stacking Test (EST) designed to assess the consistency of the tails of two empirical distributions across multiple thresholds. In the context of gravitational‑wave (GW) searches, the two distributions are the “0‑lag” data (the real observation) and the background data obtained by time‑shifting detector streams to destroy any true astrophysical coincidences. Both are modeled as Poisson processes with a rate λ(Λ) that depends on a search statistic Λ, where larger Λ values indicate more GW‑like events.

Derivation of the test
The authors first derive the familiar single‑threshold false‑alarm probability (FAP) using both a maximum‑likelihood estimate of λ (λ̂ = N_back/T_back) and a fully Bayesian marginalisation over λ with Gamma‑family priors (uniform and Jeffreys). They then extend the framework to k thresholds Λ₁ < Λ₂ < … < Λ_k. For each threshold the cumulative number of events exceeding it (N_i) is recorded for both data sets. By constructing a directed acyclic graph (DAG) they identify conditional independencies that allow the joint probability P(N_0‑lag | N_back, T_0‑lag, T_back) to be factorised. Integrating over λ analytically (thanks to the conjugate Gamma prior) yields a closed‑form expression for the multi‑threshold FAP. This expression automatically accounts for the “multiple‑testing” penalty that would otherwise inflate significance when many thresholds are examined.

Key properties of EST

Inclusion of event rates – Unlike the Binomial Test, EST retains information about the absolute event rate, which is crucial when a genuine GW signal raises the overall rate in the 0‑lag data.
Analytical p‑values – The test provides exact p‑values without resorting to computationally expensive Monte‑Carlo simulations.
Stacking of low‑significance events – By evaluating the joint consistency of the k loudest events, EST can detect a population of modest‑significance outliers that would be missed by the traditional Loudest Event Test (LET), which treats each candidate in isolation.
Flexibility – The method can be applied to any counting experiment where the background is well‑characterised and the signal is minimally modelled.

Performance on simulated data
The authors validate EST using synthetic Poisson realizations of background and signal. They demonstrate that the reported FAPs are well‑calibrated (i.e., the empirical false‑alarm rate matches the nominal value) and that EST’s detection efficiency exceeds that of LET for scenarios involving two or more sub‑threshold events. The gain is most pronounced when the signal contributes a modest excess to the tail of the Λ distribution.

Application to LIGO‑Virgo data
EST is applied to real O1/O2 data from the LIGO‑Virgo collaboration, using the OLIB pipeline for GW burst searches and the PyCBC pipeline for compact‑binary coalescences. In the burst analysis, EST identifies clusters of events that are not significant individually but collectively produce a low multi‑threshold FAP. Compared with LET, EST yields a ~10–20 % improvement in overall detection efficiency for burst‑like signals. Moreover, when setting upper limits on the astrophysical rate density of GW bursts, EST produces limits that are 2–3 times stricter than those derived from LET, reflecting its superior sensitivity to weak, short‑duration transients.

Limitations and future work
EST requires a pre‑specified number k of events to stack; choosing k too small reduces sensitivity, while too large may lead to over‑conservatism. The method also assumes Poissonian background; deviations (e.g., non‑stationary noise) would necessitate extensions such as Cox processes or adaptive priors. The authors suggest future directions including dynamic selection of k via Bayesian model comparison, incorporation of non‑Poisson background models, and combining multiple search pipelines into a multidimensional EST.

Conclusion
The Event Stacking Test offers a rigorous, analytically tractable way to test the agreement of tail regions of empirical distributions across several thresholds. By “stacking” multiple low‑significance events, it can reveal populations of signals that escape detection by traditional single‑threshold tests. Its application to LIGO‑Virgo data demonstrates tangible improvements in detection efficiency and tighter astrophysical constraints, making EST a valuable addition to the statistical toolbox for gravitational‑wave astronomy and other fields that rely on counting experiments with poorly modelled signals.

Beyond single-threshold searches: the Event Stacking Test

💡 Research Summary

Comments & Academic Discussion

Leave a Comment