Consider avoiding the .05 significance level
📝 Abstract
It is suggested that some shortcomings of Null Hypothesis Significance Testing (NHST), viewed from the perspective of Bayesian statistics, turn benign once the traditional threshold p value of .05 is substituted by a sufficiently smaller value. To illustrate, the posterior probability of H0 stating P=.5, given data that just render it rejected by NHST with a p value of .05 (and a uniform prior), is shown here to be not much smaller than .50 for most values of N below 100 (and even exceeds .50 for N>=100); in contrast, with a p value of .001 posterior probability does not exceed .06 for N<=100 (neither .25 for N<9000). Yet more interesting, posterior probability becomes quite independent of N with a p value of .0001, hence practically satisfying the alpha postulate - set by Cornfield (1966) as the condition for p value being a measure of evidence in itself. In view of the low prospect that most researchers will soon convert to use Bayesian statistics in any form, we thus suggest that researchers who elect the conservative option of resorting to NHST be encouraged to avoid as much as possible using a p value of .05 as a threshold for rejecting H0. The analysis presented here may be used to discuss afresh which level of threshold p value seems to be a reasonable, practical substitute.
💡 Analysis
It is suggested that some shortcomings of Null Hypothesis Significance Testing (NHST), viewed from the perspective of Bayesian statistics, turn benign once the traditional threshold p value of .05 is substituted by a sufficiently smaller value. To illustrate, the posterior probability of H0 stating P=.5, given data that just render it rejected by NHST with a p value of .05 (and a uniform prior), is shown here to be not much smaller than .50 for most values of N below 100 (and even exceeds .50 for N>=100); in contrast, with a p value of .001 posterior probability does not exceed .06 for N<=100 (neither .25 for N<9000). Yet more interesting, posterior probability becomes quite independent of N with a p value of .0001, hence practically satisfying the alpha postulate - set by Cornfield (1966) as the condition for p value being a measure of evidence in itself. In view of the low prospect that most researchers will soon convert to use Bayesian statistics in any form, we thus suggest that researchers who elect the conservative option of resorting to NHST be encouraged to avoid as much as possible using a p value of .05 as a threshold for rejecting H0. The analysis presented here may be used to discuss afresh which level of threshold p value seems to be a reasonable, practical substitute.
📄 Content
Consider avoiding the .05 significance level
David Navon Yoav Cohen
The University of Haifa, Israel National Institute for Testing and Evaluation, Israel
Running Head: Avoiding .05
Address:
David Navon
Department of Psychology
University of Haifa
Haifa 31905
Israel
Phone: 972-4-8240927 E-mail: dnavon@psy.haifa.ac.il
2 Abstract
It is suggested that some shortcomings of Null Hypothesis Significance Testing
(NHST), viewed from the perspective of Bayesian statistics, turn benign once the traditional threshold p value of .05 is substituted by a sufficiently smaller value. To illustrate, the posterior probability of H0 stating P=.5, given data that just render it rejected by NHST with a p value of .05 (and a uniform prior), is shown here to be not much smaller than .50 for most values of N below 100 (and even exceeds .50 for N≥100); in contrast, with a p value of .001 posterior probability does not exceed .06 for N≤100 (neither .25 for N<9000). Yet more interesting, posterior probability becomes quite independent of N with a p value of .0001, hence practically satisfying the α postulate – set by Cornfield (1966) as the condition for p value being a measure of evidence in itself. In view of the low prospect that most researchers will soon convert to use Bayesian statistics in any form, we thus suggest that researchers who elect the conservative option of resorting to NHST be encouraged to avoid as much as possible using a p value of .05 as a threshold for rejecting H0. The analysis presented here may be used to discuss afresh which level of threshold p value seems to be a reasonable, practical substitute.
Highlights
• We argue that some shortcomings of NHST are not as irreparable as sometimes presented • Posterior probability of a H0 rejected with p value of .05 hovers around chance level • That probability is much smaller and much less dependent on N, the lower p value is • We suggest using as threshold significance level p values much smaller than .05
Key words: Hypothesis testing, significance level, replicability, type-I error
3
Null Hypothesis Significance Testing (NHST) is being employed as the main
inference tool in experimental studies despite the longstanding scholarly controversy over
its use. That dispute started roughly at the time that Ronald Fisher began criticizing other
statisticians, among them J. Neyman and M.G. Kendall, for overstretching and convoluting
his rudimentary idea of the null hypothesis, as he had first introduced1 in Fisher (1925) and
brilliantly explained a decade later (Fisher, 1935; cf review in Salsberg, 2001).
The use of NHST grew considerably more controversial in recent decades in light
of several concerns (see, e.g., Bakan, 1966; Campbell, 1982; Carver, 1978, 1993; Cohen,
1990, 1994; Edwards, Lindman & Savage, 1963; Falk & Greenbaum, 1995; Hunter, 1997;
Krantz, 1999; Krueger, 2001; Lykken, 1968; Nickerson, 2000; Simmons, Nelson &
Simonsohn, 2011; Wilson, Miller & Lower, 1967; and lately, Nuzzo, 2014). A prominent
concern is that NHST does not measure evidential weight, rather resorts to collapsing a
continuum of evidential weight into a binary decision (reject/accept H0), thereby nullifying
the potential impact of any piece of evidence that fails to meet, even just barely, the
significance criterion, often for quite prosaic reasons (budgetary constraints, for example).
Another worry is that NHST focuses on the cost of a false positive while almost neglecting
the value of a true positive (Killeen, 2006). Vehement NHST critics, like Armstrong (2007)
and Krueger (2001), deplore what one of them (Krueger, ibid, p.16) called “the survival of a
flawed method” and others designated as an instance of “trained incapacity” (Ziliak &
McCloskey, 2008, p. 238) or of a “cult” (ibid, book’s title). Consequently, reliance on
confidence intervals and effect sizes has increased in last decades (e.g., Kline, 2004) and
is now advocated by quite a few authors to officially replace use of NHST (e.g., Cumming,
2013).
On the other hand, there is no consensus that NHST is basically wrong (see, e.g.,
Hagen, 1997; Mogie, 2004; Nickerson, 2000). Nickerson chose to summarize his thorough
review of the controversy as follows: “NHST is easily misunderstood and misused but
when applied with good judgment it can be an effective aid to the interpretation of
experimental data” (ibid, p. 241). One way or the other, it is a fact that NHST is still applied
to research data almost as prevalently as before (as noted in Lakens & Evers, 2014, p.
284), not the least because it is still required in many publication manuals of scholarly
journals.
True, many people doing research in psychology fail to heed the warnings about
NHST made in textbooks of statistics for psychologists (and refreshed from time to time in
articles
This content is AI-processed based on ArXiv data.