Hypothesis tests and model parameter estimation on data sets with missing correlation information
Ideally, all analyses of normally distributed data should include the full covariance information between all data points. In practice, the full covariance matrix between all data points is not always available. Either because a result was published without a covariance matrix, or because one tries to combine multiple results from separate publications. For simple hypothesis tests, it is possible to define robust test statistics that will behave conservatively in the presence on unknown correlations. For model parameter fits, one can inflate the variance by a factor to ensure that things remain conservative at least up to a chosen confidence level. This paper describes a class of robust test statistics for simple hypothesis tests, as well as an algorithm to determine the necessary inflation factor for model parameter fits and Goodness of Fit tests and composite hypothesis tests. It then presents some example applications of the methods to real neutrino interaction data and model comparisons.
💡 Research Summary
The paper addresses a common problem in many fields of experimental science: the lack of a complete covariance matrix when analysing normally distributed data. While the ideal situation would provide a full covariance matrix for all data points, in practice results are often published without this information, or researchers need to combine results from different publications where inter‑study correlations are unknown. Ignoring unknown correlations can lead to misleading conclusions, especially in hypothesis testing and parameter estimation.
The authors build on a previous method that treats unknown covariances as nuisance parameters and minimizes the Mahalanobis distance over all admissible covariance matrices. They generalise this approach to the case where the covariance matrix is known in block‑diagonal form (each block corresponds to a subset of data with a known internal covariance) but the off‑diagonal blocks (the correlations between blocks) are completely unknown.
The key result is the “fitted” test statistic:
fitted(x | µ, S) = max_i D_i²,
where D_i² = (x_i − µ_i)ᵀ S_ii⁻¹ (x_i − µ_i) is the Mahalanobis distance for block i. The authors prove that, for any admissible choice of the unknown off‑diagonal blocks, the smallest possible overall Mahalanobis distance is exactly the largest block distance. The proof uses a linear transformation that isolates the last block, constructs specific off‑diagonal elements that cancel the contributions of all other blocks, and applies Sylvester’s criterion to guarantee positive‑definiteness. By recursively applying the same construction, the result holds for any number of blocks.
Because the statistic reduces to the maximum of a fixed number of independent χ² variables (one per block), its null distribution can be written analytically as the product of the individual χ² CDFs. The authors call this the “Cee‑squared” distribution. They provide a Python implementation (NuStatTools) that evaluates the CDF and p‑values for any set of block sizes.
To illustrate the behaviour, the authors generate 10‑dimensional Gaussian toy data with two blocks of five variables each and vary the true inter‑block correlation ρ (0, 0.5, 0.9, 0.99). The conventional “naïve” Mahalanobis test, which assumes no correlations, severely under‑covers the true confidence level as ρ increases. In contrast, the fitted statistic remains conservative for all ρ, i.e., the observed p‑values are always larger than the nominal ones, guaranteeing that the Type I error rate never exceeds the chosen significance level.
For model parameter estimation, the paper introduces a variance‑inflation (derating) factor. The idea is to multiply the known block covariances by a factor λ ≥ 1 such that the fitted statistic’s distribution, evaluated with the inflated covariances, still yields a p‑value above a pre‑selected confidence threshold (e.g., 95 %). An algorithm is given that computes the minimal λ required, based on the observed block distances and their degrees of freedom. This allows practitioners to continue using standard χ² minimisation while ensuring that the resulting confidence intervals are conservative despite unknown inter‑block correlations.
The authors further generalise the fitted statistic to a family of “f_max” statistics:
f_max(x | µ, S) = max_i f_i(D_i²),
where each f_i is a strictly increasing function. Different choices of f_i can emphasise the most significant block (e.g., by using the χ² CDF to select the smallest p‑value) or apply other weighting schemes. All such statistics share the robustness property because they depend only on the ordering of the block distances, not on the unknown correlations. The paper derives the corresponding null CDF as a product of transformed χ² CDFs.
The methodology is applied to real neutrino‑interaction data from the T2K, MINERvA, and MicroBooNE experiments. Each experiment’s published cross‑section measurements are treated as separate blocks; inter‑experiment correlations are unknown. Using the fitted statistic, the authors evaluate several theoretical neutrino‑interaction models (Spectral Function, Local Fermi Gas, Relativistic Fermi Gas, various 2p2h and final‑state‑interaction variants). For single‑experiment tests, some models are compatible, while others are excluded at the 98–99 % level. When combining experiments, the maximum block distance dominates, and virtually all models are excluded at >99.7 % confidence, illustrating how the method prevents over‑optimistic claims that could arise from ignoring unknown correlations.
A discussion of the “dilution effect” follows: a very strong exclusion from a small‑degree‑of‑freedom measurement can be softened when combined with a large‑df measurement, because the combined statistic is still the maximum of the block distances, but the reference CDF now has many degrees of freedom. This is analogous to the look‑elsewhere effect but originates from unknown inter‑study correlations.
In conclusion, the paper provides a mathematically rigorous, computationally tractable framework for hypothesis testing and parameter estimation when only partial covariance information is available. The fitted and f_max statistics guarantee conservativeness, the variance‑inflation algorithm supplies a practical way to obtain reliable confidence intervals, and the open‑source implementation makes the approach readily usable. Future extensions could address non‑Gaussian data, non‑linear models, and Bayesian formulations.
Comments & Academic Discussion
Loading comments...
Leave a Comment