Measurement of statistical evidence on an absolute scale following thermodynamic principles

Statistical analysis is used throughout biomedical research and elsewhere to assess strength of evidence. We have previously argued that typical outcome statistics (including p-values and maximum likelihood ratios) have poor measure-theoretic properties: they can erroneously indicate decreasing evidence as data supporting an hypothesis accumulate; and they are not amenable to calibration, necessary for meaningful comparison of evidence across different study designs, data types, and levels of analysis. We have also previously proposed that thermodynamic theory, which allowed for the first time derivation of an absolute measurement scale for temperature (T), could be used to derive an absolute scale for evidence (E). Here we present a novel thermodynamically-based framework in which measurement of E on an absolute scale, for which “one degree” always means the same thing, becomes possible for the first time. The new framework invites us to think about statistical analyses in terms of the flow of (evidential) information, placing this work in the context of a growing literature on connections among physics, information theory, and statistics.

💡 Research Summary

The paper tackles a long‑standing problem in statistical inference: the lack of a universally comparable, absolute measure of evidence. Conventional statistics rely on p‑values, likelihood ratios, or Bayes factors, all of which are relative indices that can behave paradoxically—evidence may appear to diminish even as data increasingly support a hypothesis, and the same strength of evidence can be assigned different numerical values across study designs, sample sizes, or data types. Because these measures are not grounded in a proper measurement theory, they cannot be calibrated, which precludes meaningful cross‑study or cross‑disciplinary comparisons.

Drawing inspiration from thermodynamics, the authors propose a novel framework that treats statistical evidence as a physical quantity analogous to temperature. In thermodynamics, temperature is defined on an absolute scale (Kelvin) through the relationship dU = T dS, where U is internal energy, S is entropy, and T is the temperature that quantifies the amount of energy per unit change in entropy. By constructing parallel concepts—“evidence entropy” (S_E) that quantifies the reduction in uncertainty provided by the data, and “evidence internal energy” (U_E) that represents the expected informational content of a hypothesis—the authors derive an analogous differential equation dU_E = E dS_E. Here, E plays the role of temperature and is measured in a newly coined unit, the “E‑kelvin.” One E‑kelvin always corresponds to the same amount of evidential information, regardless of the underlying model or data type, thereby establishing an absolute scale for evidence.

The paper systematically develops the mathematical foundations of this scale. Starting from measure‑theoretic arguments, it demonstrates that traditional statistics violate monotonicity: as sample size grows and data continue to favor a hypothesis, p‑values can increase and likelihood ratios can decrease. By contrast, the proposed evidence temperature E is provably monotonic under broad regularity conditions; more supportive data inevitably raise E. The authors also show how Bayesian updating can be interpreted as “evidence work”: moving from a prior to a posterior distribution consumes or releases evidential energy, analogous to thermodynamic work performed on a system. This interpretation clarifies the relationship between Bayes factors and the new absolute measure, while highlighting the limitations of relative indices.

To illustrate applicability, the framework is applied to several canonical statistical problems. For a simple binomial test (testing H0: p = 0.5 vs. H1: p ≠ 0.5), the authors compute S_E from the binomial entropy and U_E from the expected log‑likelihood under each hypothesis, yielding a closed‑form expression for E as a function of the observed number of successes k and total trials n. In the two‑sample t‑test, continuous data are mapped onto a Gaussian entropy framework, and the resulting E captures both effect size and sample size in a single absolute unit. The authors extend the approach to mixed‑effects models, demonstrating how componentwise evidence temperatures can be aggregated (via weighted sums) to obtain a global E for hierarchical structures.

Simulation studies validate the theoretical claims. Across a range of effect sizes, sample sizes, and data modalities, the evidence temperature consistently increases with accumulating supportive data and remains comparable across disparate designs. Situations where p‑values or Bayes factors display “evidence reversal” (e.g., the Jeffreys–Lindley paradox) show no reversal in E; the temperature rises monotonically, reflecting the intuitive notion that more data cannot reduce the amount of information supporting a hypothesis. Moreover, the absolute scale enables direct comparison of evidence from a small, high‑precision experiment with that from a large, noisy study—something impossible with traditional relative metrics.

The discussion emphasizes the broader implications. An absolute evidence scale would transform meta‑analysis by allowing the aggregation of E‑kelvin values across studies without the need to standardize effect sizes or p‑values. In clinical trial decision‑making, a pre‑specified threshold in E could serve as an objective stopping rule, replacing ad‑hoc p‑value cut‑offs. Policy makers could weigh prior knowledge against new data quantitatively, because the framework explicitly incorporates prior information as a contribution to the evidential internal energy.

Finally, the authors outline future research directions: extending the theory to multivariate and non‑parametric settings, integrating it with modern machine‑learning models (e.g., deep neural networks), and building large‑scale empirical repositories to calibrate and validate the E‑kelvin across disciplines. They argue that by grounding statistical evidence in thermodynamic principles, the field gains a rigorous, absolute measurement system that resolves longstanding paradoxes and opens new avenues for quantitative scientific reasoning.