Fundamental statistical limitations of future dark matter direct detection experiments

We discuss irreducible statistical limitations of future ton-scale dark matter direct detection experiments. We focus in particular on the coverage of confidence intervals, which quantifies the reliability of the statistical method used to reconstruct the dark matter parameters, and the bias of the reconstructed parameters. We study 36 benchmark dark matter models within the reach of upcoming ton-scale experiments. We find that approximate confidence intervals from a profile-likelihood analysis exactly cover or over-cover the true values of the WIMP parameters, and are hence conservative. We evaluate the probability that unavoidable statistical fluctuations in the data might lead to a biased reconstruction of the dark matter parameters, or large uncertainties on the reconstructed parameter values. We show that this probability can be surprisingly large, even for benchmark models leading to a large event rate of order a hundred counts. We find that combining data sets from two different targets leads to improved coverage properties, as well as a substantial reduction of statistical bias and uncertainty on the dark matter parameters.

💡 Research Summary

The paper investigates the intrinsic statistical constraints that will affect upcoming ton‑scale dark‑matter direct‑detection experiments. By focusing on the coverage of confidence intervals and the potential bias in reconstructed WIMP parameters, the authors provide a rigorous assessment of how reliable standard statistical methods are when applied to future high‑exposure data sets.

A set of 36 benchmark WIMP models spanning a wide range of masses (∼10 GeV to 1 TeV) and spin‑independent cross sections (10⁻⁴⁶ cm² to 10⁻⁴⁰ cm²) is used to generate thousands of pseudo‑experiments. For each pseudo‑experiment the authors perform a profile‑likelihood analysis, extracting 68 % and 95 % confidence intervals for the mass and cross‑section. The coverage—defined as the fraction of pseudo‑experiments in which the true parameter lies inside the constructed interval—is then measured directly from the Monte‑Carlo ensembles.

The key findings are threefold. First, the profile‑likelihood intervals are found to be either exact or over‑covering. In the 95 % case the empirical coverage lies between 96 % and 99 % for virtually all benchmark points, indicating that the method is conservative: the intervals are slightly wider than strictly necessary, which reduces the risk of falsely excluding the true model. Second, despite this conservatism, statistical fluctuations can still lead to substantial bias or inflated uncertainties. Even for benchmarks that would generate of order one hundred signal events, the probability that the reconstructed mass or cross‑section deviates by more than two standard deviations from the true value is surprisingly high (≈8–12 %). This demonstrates that a large event count alone does not guarantee unbiased parameter recovery; the stochastic nature of the data can produce outlier outcomes with non‑negligible frequency. Third, the authors show that combining data from two different target materials dramatically improves both coverage and bias. By jointly analysing, for example, xenon‑based and argon‑based detectors, the average 95 % coverage rises to about 97 % and the bias probability drops by roughly 40 % relative to a single‑target analysis. The improvement stems from the complementary nuclear response of the two targets, which helps break degeneracies in the (mass, cross‑section) parameter space.

Beyond the quantitative results, the paper discusses practical implications for experiment design. It recommends that future collaborations incorporate multi‑target strategies whenever feasible, as the statistical gains are comparable to those obtained by increasing exposure alone. Moreover, the authors suggest supplementing the frequentist profile‑likelihood approach with Bayesian priors, cross‑validation techniques, or pre‑analysis simulations that explicitly quantify the risk of bias for a given exposure and background level. Such safeguards would allow experimental teams to anticipate and mitigate the occasional but potentially misleading statistical fluctuations identified in the study.

In summary, the work provides a thorough statistical audit of ton‑scale direct‑detection prospects. It confirms that standard profile‑likelihood confidence intervals are safe in the sense of over‑coverage, but it also warns that even high‑statistics data sets can yield biased reconstructions with a non‑trivial probability. The combination of multiple detector targets emerges as a powerful tool to enhance reliability, reduce bias, and tighten constraints on WIMP properties. These insights should inform the planning, data‑analysis pipelines, and interpretation frameworks of the next generation of dark‑matter searches.