Null hypothesis significance tests: A mix-up of two different theories, the basis for widespread confusion and numerous misinterpretations

Null hypothesis statistical significance tests (NHST) are widely used in quantitative research in the empirical sciences including scientometrics. Nevertheless, since their introduction nearly a century ago significance tests have been controversial. Many researchers are not aware of the numerous criticisms raised against NHST. As practiced, NHST has been characterized as a null ritual that is overused and too often misapplied and misinterpreted. NHST is in fact a patchwork of two fundamentally different classical statistical testing models, often blended with some wishful quasi-Bayesian interpretations. This is undoubtedly a major reason why NHST is very often misunderstood. But NHST also has intrinsic logical problems and the epistemic range of the information provided by such tests is much more limited than most researchers recognize. In this article we introduce to the scientometric community the theoretical origins of NHST, which is mostly absent from standard statistical textbooks, and we discuss some of the most prevalent problems relating to the practice of NHST and trace these problems back to the mixup of the two different theoretical origins. Finally, we illustrate some of the misunderstandings with examples from the scientometric literature and bring forward some modest recommendations for a more sound practice in quantitative data analysis.

💡 Research Summary

The paper provides a comprehensive critique of Null Hypothesis Significance Testing (NHST) as it is currently practiced, especially within the field of scientometrics. It begins by tracing the historical origins of NHST to two distinct statistical traditions: Ronald Fisher’s “significance testing” and the Neyman‑Pearson “hypothesis testing” framework. Fisher introduced the p‑value as a measure of how unlikely the observed data would be if the null hypothesis were true, treating it as an evidential index rather than a decision rule. In contrast, Neyman and Pearson formalized a decision‑theoretic procedure that requires pre‑specifying both a null and an alternative hypothesis, setting Type I (α) and Type II (β) error rates, and using a critical region to either reject or retain the null.

The authors argue that modern NHST is a confused amalgam of these two models, often blended with ad‑hoc Bayesian interpretations. This mixture generates several pervasive misunderstandings: (1) treating the p‑value as the probability that the null hypothesis is true (or that the alternative is true), (2) using the conventional 0.05 threshold as a universal marker of “significance” regardless of context, and (3) ignoring effect sizes and statistical power, thereby equating statistical significance with substantive importance.

To illustrate the consequences of these misinterpretations, the paper surveys a range of recent scientometric studies—comparisons of citation counts across countries, analyses of co‑authorship network centrality, pre‑post policy impact assessments, etc. In many cases, authors report only p‑values and declare results “significant” or “non‑significant” without discussing the magnitude of the effect, confidence intervals, or the adequacy of sample size. The authors show that such practices can lead to false conclusions: with large samples, trivial differences become statistically significant; with small samples, meaningful differences may be missed. Moreover, p‑values are highly sensitive to underlying assumptions (normality, independence, homoscedasticity), so violations can distort the evidential value of the test.

Beyond the empirical examples, the paper outlines the logical limits of NHST. A p‑value merely quantifies the compatibility of the data with the null hypothesis; it does not provide direct evidence for the alternative hypothesis, nor does it measure the probability that a hypothesis is true. Consequently, NHST offers a narrow epistemic window that many researchers over‑interpret.

In the final section, the authors propose a set of practical recommendations to move toward more robust quantitative analysis. They advocate for: (a) conducting a priori power analyses to ensure adequate sample sizes; (b) reporting effect sizes and confidence intervals alongside p‑values, thereby emphasizing substantive significance; (c) considering Bayesian methods, information‑theoretic criteria (AIC, BIC), or model‑averaging approaches as alternatives or complements to NHST; (d) treating p‑values as graded evidence rather than binary decision thresholds; and (e) enhancing transparency through data visualisation, exploratory analyses, and clear articulation of the research hypotheses.

The conclusion stresses that while NHST need not be discarded entirely, its limitations must be acknowledged and mitigated. For scientometric researchers in particular, reducing reliance on p‑value rituals and embracing richer statistical reporting will improve the credibility, reproducibility, and interpretability of their findings.