Title: Reproducibility and Statistical Methodology
ArXiv ID: 2602.15697
Date: 2026-02-17
Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (작성자는 해당 정보를 확인할 수 없으므로, 추후 원문에서 확인 필요) **
📝 Abstract
In 2015 the Open Science Collaboration (OSC) (Nosek et al 2015) published a highly influential paper which claimed that a large fraction of published results in the psychological sciences were not reproducible. In this article we review this claim from several points of view. We first offer an extended analysis of the methods used in that study. We show that the OSC methodology induces a bias that is able by itself to explain the discrepancy between the OSC estimates of reproducibility and other more optimistic estimates made by similar studies. The article also offers a more general literature review and discussion of reproducibility in experimental science. We argue, for both scientific and ethical reasons, that a considered balance of false positive and false negative rates is preferable to a single-minded concentration on false positive rates alone.
💡 Deep Analysis
📄 Full Content
The value of any kind of research depends on the reproducibility of the results. "The salutary habit," suggests Ronald Fisher, "of repeating important experiments, or of carrying out original observations in replicate, shows a tacit appreciation of the fact that the object of our study is not the individual result, but the population of possibilities of which we do our best to make our experiments representative." (Fisher, 1925).
If a study produces an outcome which cannot be reproduced in a similar environment at another point in time, then such a study has failed in its original goal to provide observations that allude to real-life phenomena. Therefore, ensuring the reproducibility of all studies is a top priority in all fields of science. However, a number of investigations into the current climate of scientific studies have suggested that a large number of published results are unable to be reproduced, implying that these results have no implications outside of the initial test conditions. This issue is known in the literature as the reproducibility (or replication) crisis, and has become a significant concern. In fact, according to a survey of researchers reported in Baker (2016) " … 52% of those surveyed agree that there is a significant ‘crisis’ of reproducibility …". As a consequence, failure of reproducibility is seen as a major issue across most fields of science, with many individuals and organizations investigating these claims.
In order to test this, the Open Science Collaboration (OSC) (osf.io/vmrgu) conducted a replication project (OSC-RP), selecting 100 studies from prominent psychological journals and reproducing them in close to original conditions (Nosek et al., 2015). The OSC-RP reports that while 97% of the original experiments were shown to have significant results (P ≤ 0.05), of these only 36% of the reproduced experiments were found to be significant under similar conditions. Furthermore, 47% of effect sizes from the original experiments were within the 95% confidence interval of the effect sizes from reproduced studies, and 39% of effects were subjectively considered to be successful reproductions.
The OSC-RP report does not discern any particular systematic flaws in experimental or statistical methodology, although it conjectures issues regarding incentive in the scientific community. The OSC-RP does conclude, however, and implies as such in their report, that the scientific community faces a self-evident risk of irreproducible and unreliable experimental data from major scientific journals. If true, this is a serious concern that requires some degree of reform in experimentation and publication practices.
However, not all studies of reproducibility yield pessimistic conclusions. Etz and Vandekerckhove (2016) presents the notion that the reported failure of reproducibility from the OSC-RP’s report can be attributed to an “overestimation of effect sizes”. Klein et al. (2014) summarizes a separate replication study, the Many Labs Replication Project (ML-RP), which reported a much higher reproducibility rate of 85%, one compatible with strict adherence to good experimental and statistical practice (https://osf.io/
wx7ck/k/). In a similar study reported in Camerer et al. (2016) 18 experimental studies in the field of economics were replicated, again yielding a higher reproducibility rate than the OSC-RP (61%), while using a similar replication protocol (ECO-RP). See also the exchange in Gilbert et al. (2016) and Anderson et al. (2016) following publication of Nosek et al. (2015).
Therefore, there may be a different conclusion to discern from the statistical data presented by the OSC-RP, one that is both sensible and elementary in regards to basic statistical principles. In order to investigate further, we must begin by clarifying and reconsidering the degree of reproducibility that researchers should expect from such studies.
We first develop a “reproducibility model” with which to precisely define a reproducibility rate and to offer guidance regarding its estimation. Throughout, model assumptions will never deviate from standard practice, and we will assume, in particular, that probabilities of conventional type I error (false positive) and type II error (false negative) are accurately reported where needed. If this model is able to predict the reproducibility rates reported by the OSC-RP under these conditions, then it provides no basis to fault contemporary research and publication practice.
Let us define a universe U of hypothesis tests, for which we have null hypothesis H o and alternative hypothesis H a . Alternative hypotheses represent effects of scientific interest, and as a consequence are published in journals.
A proportion π of U , which we refer to as effect prevalence, are truly H a . Let α represent the type I error and let β represent the type II error. All possible reported outcomes of a study can be represented with the decision tree illustrated in Figure