Consider avoiding the .05 significance level

Reading time: 6 minute
...

📝 Abstract

It is suggested that some shortcomings of Null Hypothesis Significance Testing (NHST), viewed from the perspective of Bayesian statistics, turn benign once the traditional threshold p value of .05 is substituted by a sufficiently smaller value. To illustrate, the posterior probability of H0 stating P=.5, given data that just render it rejected by NHST with a p value of .05 (and a uniform prior), is shown here to be not much smaller than .50 for most values of N below 100 (and even exceeds .50 for N>=100); in contrast, with a p value of .001 posterior probability does not exceed .06 for N<=100 (neither .25 for N<9000). Yet more interesting, posterior probability becomes quite independent of N with a p value of .0001, hence practically satisfying the alpha postulate - set by Cornfield (1966) as the condition for p value being a measure of evidence in itself. In view of the low prospect that most researchers will soon convert to use Bayesian statistics in any form, we thus suggest that researchers who elect the conservative option of resorting to NHST be encouraged to avoid as much as possible using a p value of .05 as a threshold for rejecting H0. The analysis presented here may be used to discuss afresh which level of threshold p value seems to be a reasonable, practical substitute.

💡 Analysis

It is suggested that some shortcomings of Null Hypothesis Significance Testing (NHST), viewed from the perspective of Bayesian statistics, turn benign once the traditional threshold p value of .05 is substituted by a sufficiently smaller value. To illustrate, the posterior probability of H0 stating P=.5, given data that just render it rejected by NHST with a p value of .05 (and a uniform prior), is shown here to be not much smaller than .50 for most values of N below 100 (and even exceeds .50 for N>=100); in contrast, with a p value of .001 posterior probability does not exceed .06 for N<=100 (neither .25 for N<9000). Yet more interesting, posterior probability becomes quite independent of N with a p value of .0001, hence practically satisfying the alpha postulate - set by Cornfield (1966) as the condition for p value being a measure of evidence in itself. In view of the low prospect that most researchers will soon convert to use Bayesian statistics in any form, we thus suggest that researchers who elect the conservative option of resorting to NHST be encouraged to avoid as much as possible using a p value of .05 as a threshold for rejecting H0. The analysis presented here may be used to discuss afresh which level of threshold p value seems to be a reasonable, practical substitute.

📄 Content

Consider avoiding the .05 significance level

David Navon Yoav Cohen

The University of Haifa, Israel National Institute for Testing and Evaluation, Israel

Running Head: Avoiding .05

Address:
David Navon Department of Psychology University of Haifa Haifa 31905 Israel

Phone: 972-4-8240927 E-mail: dnavon@psy.haifa.ac.il

2 Abstract

     It is suggested that some shortcomings of Null Hypothesis Significance Testing 

(NHST), viewed from the perspective of Bayesian statistics, turn benign once the traditional threshold p value of .05 is substituted by a sufficiently smaller value. To illustrate, the posterior probability of H0 stating P=.5, given data that just render it rejected by NHST with a p value of .05 (and a uniform prior), is shown here to be not much smaller than .50 for most values of N below 100 (and even exceeds .50 for N≥100); in contrast, with a p value of .001 posterior probability does not exceed .06 for N≤100 (neither .25 for N<9000). Yet more interesting, posterior probability becomes quite independent of N with a p value of .0001, hence practically satisfying the α postulate – set by Cornfield (1966) as the condition for p value being a measure of evidence in itself. In view of the low prospect that most researchers will soon convert to use Bayesian statistics in any form, we thus suggest that researchers who elect the conservative option of resorting to NHST be encouraged to avoid as much as possible using a p value of .05 as a threshold for rejecting H0. The analysis presented here may be used to discuss afresh which level of threshold p value seems to be a reasonable, practical substitute.

Highlights

• We argue that some shortcomings of NHST are not as irreparable as sometimes presented • Posterior probability of a H0 rejected with p value of .05 hovers around chance level • That probability is much smaller and much less dependent on N, the lower p value is • We suggest using as threshold significance level p values much smaller than .05

Key words: Hypothesis testing, significance level, replicability, type-I error

3

Null Hypothesis Significance Testing (NHST) is being employed as the main inference tool in experimental studies despite the longstanding scholarly controversy over its use. That dispute started roughly at the time that Ronald Fisher began criticizing other statisticians, among them J. Neyman and M.G. Kendall, for overstretching and convoluting his rudimentary idea of the null hypothesis, as he had first introduced1 in Fisher (1925) and brilliantly explained a decade later (Fisher, 1935; cf review in Salsberg, 2001).
The use of NHST grew considerably more controversial in recent decades in light of several concerns (see, e.g., Bakan, 1966; Campbell, 1982; Carver, 1978, 1993; Cohen, 1990, 1994; Edwards, Lindman & Savage, 1963; Falk & Greenbaum, 1995; Hunter, 1997; Krantz, 1999; Krueger, 2001; Lykken, 1968; Nickerson, 2000; Simmons, Nelson & Simonsohn, 2011; Wilson, Miller & Lower, 1967; and lately, Nuzzo, 2014). A prominent concern is that NHST does not measure evidential weight, rather resorts to collapsing a continuum of evidential weight into a binary decision (reject/accept H0), thereby nullifying the potential impact of any piece of evidence that fails to meet, even just barely, the significance criterion, often for quite prosaic reasons (budgetary constraints, for example). Another worry is that NHST focuses on the cost of a false positive while almost neglecting the value of a true positive (Killeen, 2006). Vehement NHST critics, like Armstrong (2007) and Krueger (2001), deplore what one of them (Krueger, ibid, p.16) called “the survival of a flawed method” and others designated as an instance of “trained incapacity” (Ziliak & McCloskey, 2008, p. 238) or of a “cult” (ibid, book’s title). Consequently, reliance on confidence intervals and effect sizes has increased in last decades (e.g., Kline, 2004) and is now advocated by quite a few authors to officially replace use of NHST (e.g., Cumming, 2013).
On the other hand, there is no consensus that NHST is basically wrong (see, e.g., Hagen, 1997; Mogie, 2004; Nickerson, 2000). Nickerson chose to summarize his thorough review of the controversy as follows: “NHST is easily misunderstood and misused but when applied with good judgment it can be an effective aid to the interpretation of experimental data” (ibid, p. 241). One way or the other, it is a fact that NHST is still applied to research data almost as prevalently as before (as noted in Lakens & Evers, 2014, p. 284), not the least because it is still required in many publication manuals of scholarly journals.
True, many people doing research in psychology fail to heed the warnings about NHST made in textbooks of statistics for psychologists (and refreshed from time to time in articles

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut