A reckless guide to P-values: local evidence, global errors

A rec kless guide to P-v alues Lo cal evidence, global errors Chapter 13 in the b ook Go o d Researc h Practice in Exp erimen tal Pharmacology , editors A. Bespalo v, MC Michel, and T Stec kler, to b e published by Springer. Op en Access, Creativ e Commons 4.0 1 Mic hael J. Lew Departmen t of Pharmacology and Therap eutics Univ ersity of Melbourne Octob er 7, 2019 1 Abstract This chapter demystiﬁes P-v alues, hypothesis tests and signiﬁcance tests, and in tro duces the concepts of lo cal evidence and global error rates. The lo cal evidence is embo died in this data and concerns the hypotheses of interest for this exp erimen t, whereas the global error rate is a prop ert y of the statistical analysis and sampling pro cedure. It is sho wn using simple examples that lo cal evidence and global error rates can be, and should b e, considered together when making inferences. Po w er analysis for experimental design for h yp othesis testing are explained, along with the more lo cally fo cussed expected P-v alues. Issues relating to multiple testing, HARKing, and P-hac king are explained, and it is shown that, in many situation, their eﬀects on lo cal evidence and global error rates are in conﬂict, a conﬂict that can alwa ys b e ov ercome by a fresh dataset from replication of k ey exp erimen ts. Statistics is complicated, and so is science. There is no singular right wa y to do either, and universally acceptable compromises may not exist. Statistics oﬀers a wide array of to ols for assisting with scientiﬁc inference by calibrating uncertaint y , but statistical inference is not a substitute for scientiﬁc inference. P-v alues are useful indices of evidence and deserv e their place in the statistical to olb o x of basic pharmacologists. Con ten ts 1 Abstract 2 2 In tro duction 3 2.1 On the role of statistics . . . . . . . . . . . . . . . . . . . . . . . 3 3 All ab out P-v alues 5 3.1 Hyp othesis test and Signiﬁcance test . . . . . . . . . . . . . . . . 6 3.2 Con tradictory instructions . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Evidence is lo cal; error rates are global . . . . . . . . . . . . . . . 12 3.4 On the scaling of P-v alues . . . . . . . . . . . . . . . . . . . . . . 13 3.5 P ow er and exp ected P-v alues . . . . . . . . . . . . . . . . . . . . 14 2 4 Practical problems with P-v alues 17 4.1 The signiﬁcance ﬁlter exaggeration mac hine . . . . . . . . . . . . 18 4.2 Multiple comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 P-hac king . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.4 What is a statistical mo del? . . . . . . . . . . . . . . . . . . . . . 28 5 P-v alues and inference 30 2 In tro duction There is a widespread consensus that we are in the midst of a ‘repro ducibilit y cri- sis’ and that inappropriate application of statistical metho ds facilitates, or even causes, irreproducibility [Ioannidis, 2005, Nuzzo, 2014, Colquhoun, 2014, George et al., 2017, W agenmakers et al., 2018]. P-v alues are a “p erv asiv e problem” [W agenmakers, 2007] b ecause they are misundersto o d, misapplied, and answer a question that no-one asks [Ro y all, 1997, Halsey et al., 2015, Colquhoun, 2014]. They exaggerate evidence [Johnson, 2013, Benjamin et al., 2018] or they are ir- reconcilable with evidence [Berger and Sellke, 1987]. What’s worse, ‘P-hac king’ ampliﬁes their intrinsic shortcomings [F raser et al., 2018]. The inescapable con- clusion, it would seem, is that P-v alues should b e eliminated by replacemen t with Bay es factors [Go odman, 2001, W agenmakers, 2007] or conﬁdence in ter- v als [Cumming, 2008], or b y simply doing without [T raﬁmow and Marks, 2015]. Ho wev er, m uch of the blame for irreproducibility that is apportioned to P-v alues is based on p erv asive and p ernicious misunderstandings. This c hapter is an attempt to resolv e those misunderstandings. Some migh t sa y it is a rec kless attempt b ecause history suggests that it is doomed to fail- ure, and reckless also b ecause it go es against muc h of the conv en tional wisdom regarding P-v alues and will therefore b e seen by some as promoting inappropri- ate statistical practices. That’s OK though, b ecause the conv entional wisdom regarding P-v alues is mistak en in imp ortan t wa ys, and those mistakes fuel false supp ositions regarding what practices are appropriate. 2.1 On the role of statistics Statistics is complicated 1 but is usually presented simplistically in the statistics textb ooks and courses studied by pharmacologists. Readers of those b o oks and graduates of those courses should therefore b e forgiven for wrongly assuming that statistics is a set of rules and recipes that must b e applied in order to obtain a statistically v alid statistically signiﬁcan t. The instructions sa y that y ou matc h the data to the recip e, turn the crank, and bingo: it’s signiﬁcan t, or not. If y ou do it right then you might rew arded with a star! No matter how explicable that simplistic view of statistics migh t b e, it is far to o limiting. It leads to 1 Even its grammatical form is complicated: “statistics” lo oks like a plural noun, but it both plural when referring to values calculated from data and singular when referring to the discipline or approac hes to data analysis. 3 though tless use of a limited set of metho ds and to ov er-reliance on the familiar but misunderstoo d P-v alue. It prev ents the full utilisation of statistical thinking within scientiﬁc inference, and allows bad statistics to license false inferences. W e hav e to aim for more than the rote-learning of recip es in statistics courses b ecause while statistics is not simple, go od science is harder. I therefore take as a working assumption the notion that go o d scientists are capable of dealing with the in tricacies of statistical thinking. I will admit up front that it is not essential to hav e a statistical inference in order to make a scien tiﬁc inference. F or example, there is little need for a formal statistical analysis if results can b e dealt with using the in ter-o cular impact test 2 . How ev er, scientiﬁc inferences can be made more securely with statistics b ecause it oﬀers a ric h set of to ols for calibrating uncertaint y . Statistical analysis is particularly helpful in the p enum bral ‘ma yb e zone’ where the uncertain ty is relativ ely ev enly balanced—the zone where scientists are most lik ely to b e sw ay ed b y biasses in to o v er-interpretation of random deviations within the noise. The extra insight from a well-implemen ted statistical analysis can protect from the desire to ﬁnd something notable, and thereb y reduce the num ber of false claims made. Most p eople need all the help they can get to preven t them mak- ing fo ols of themselv es by claiming that their fav ourite theory is substan tiated by observ ations that do nothing of the sort. —[Colquhoun, 1971, p. 1] Impro ved utilisation of statistical approaches would indeed help to min- imise the n umber of times that pharmacologists make fools of themselves by reducing the num ber of false p ositiv e results in pharmacological journals and, consequen tly , reduce the n umber of fault y leads that fail to translate in to a ther- ap eutic [Begley and Ellis, 2012]. Ho w ever, ev en ideal application of the most appropriate statistical methods would not impro v e the replicability of published results quite as m uch as might b e assumed b ecause not ev ery result that fails to b e replicated is a false p ositive and not ev ery mistak en conclusion w ould b e prev ented by b etter statistical inferences. Basic pharmacological studies are typically p erformed using biological mo d- els suc h as cell lines, tissue samples, or lab oratory animals and so ev en if the original results are not false p ositives a replication might fail when it is con- ducted using diﬀerent mo dels, [Druck er, 2016]. Replications might also fail when the original results are critically dep enden t on unrecognised metho dolog- ical details, or on reagents such as an tib o dies that hav e prop erties that can v ary o ver time or betw een sources [Berglund et al., 2008, Baker and Dolgin, 2017, V o elkl et al., 2018]. It is those types of irrepro ducibilit y rather than false p ositiv es that are responsible for many failures of published leads to translate in to clinical targets or therap eutics (see also Chapter 11). The distinction b eing made here is b etw een false p ositiv e inferences which lac k ‘internal v alidit y’ and 2 In other w ords, results that hit y ou righ t betw een the eyes. In the Australian vernacular the inter-ocular impact test is the blo ody obvious test. 4 failures of generalisability which lac k ‘external v alidit y’ even though correct in themselv es. It is an imp ortant distinction b ecause the former can be reduced b y more appropriate use of statistical metho ds but the latter can not. The inherent ob jectivit y of statistics can minimise the n umber of times that w e mak e fo ols of ourselves, but just doing statistics is not enough, b ecause it is not a set of rules for scientists to follo w to make automated scientiﬁc inferences. T o get from calibrated statistical inferences to reliable inferences ab out the real world, the statistical analyses ha ve to b e interpreted; thoughtfully and in the full knowledge of the prop erties of the to ol and the nature of the real w orld system b eing prob ed. Some researchers might be disconcerted b y the fact that statistics cannot provide such certaint y , b ecause they just w ant to b e told whether their latest result is “real”. No matter how attractiv e it migh t be to fob oﬀ onto statistics the resp onsibility for inferences, the answers that scientists seek cannot b e answered by statistics alone. 3 All ab out P-v alues P-v alues are not ev erything, and they are certainly not nothing. There are many , man y useful pro cedures and to ols in statistics that do not inv olve or provide P- v alues, but P-v alues are by far the most widely used inferential statistic in basic pharmacological researc h pap ers.0 P-v alues are a practical success but a critical failure. Scientists the w orld ov er use them, but scarcely a statistician can b e found to defend them. —[Senn, 2001, p. 193] Not only are P-v alues rarely defended, they are frequen tly derided (e.g. Berger and Sellke [1987], Lecoutre et al. [2001], Goo dman [2001], W agenmakers [2007]). Ev en so, supp ort for the con tinued use of P-v alues for at least some purp oses with some cav eats can b e found (e.g. Nick erson [2000], Senn [2001], Garc ´ ıa-P ´ erez [2016], Krueger and Hec k [2017]). One crucial cav eat is that a clear distinction has to b e dra wn betw een the dic hotomisation of P-v alues into ‘signiﬁcan t’ or ‘not signiﬁcant’ (t ypically on the basis of a threshold set at 0.05) and the evidential meaning of the actual n umerically sp eciﬁed P-v alue. The former comes from a hyp othesis test and the latter from a signiﬁc anc e test . Con trary to what m an y readers will think and ha ve b een taught, they are not the same things. It might b e argued that the battle to retain a clear distinction b et ween signiﬁcance tests and hypothesis tests has long b een lost, but I hav e to con tinue that battle here because that distinction is critical for understanding the uses and misuses of P-v alues. Detailed accounts can also be found elsewhere [Hub ert y, 1993, Senn, 2001, Hubbard et al., 2003, Lenhard, 2006, Hurlb ert and Lom bardi, 2009, Lew, 2012]. 5 3.1 Hyp othesis test and Signiﬁcance test When comparing signiﬁcance tests and h yp othesis tests it is con ven tional to note that the former are ‘Fisherian’ (or, p erhaps, “neoFisherian” [Hurlb ert and Lom bardi, 2009]) and the latter are ‘Neyman–Pearsonian’. R.A. Fisher did not in ven t signiﬁcance tests per se—Gossett published what b ecame Studen t’s t -test b efore Fisher’s career had b egun [Student, 1908] and even that is not the ﬁrst example—but Fisher did eﬀectively p opularise their use with his bo ok Statistic al Metho ds for R ese ar ch Workers (1925), and he is credited with (or blamed for!) the conv en tion of P < 0 . 05 as a criterion for ‘signiﬁcance’. It is imp ortan t to note that Fisher’s ‘signiﬁcant’ denoted something along the lines of worth y of further consideration or inv estigation, which is diﬀerent to what is denoted b y the same w ord applied tot he results of a h ypothesis test. Hyp othesis tests came later, with the 1933 paper by Neyman & Pearson that set out the w orkings of dic hotomising hypothesis tests and also introduced of the ideas “errors of the ﬁrst kind” (false p ositive errors; type I errors) and “errors of the second kind” (false negativ e errors; type I I errors) and a formalisation of the concept of statistical p o wer. A Neyman–Pearsonian hypothesis test is more than a simple statistical cal- culation. It is a method that properly encompasse s exp erimental planning and exp erimen ter b eha viour as well. Before an exp eriment is conducted, the exp er- imen ter chooses α , the size of the critical region in the distribution of the test statistic, on the basis of the acceptable false p ositiv e (i.e. type I) error rate and sets the sample size on the basis of an acceptable false negative (i.e. t yp e I I) error rate. In eﬀect the sample size, p o wer 3 , and α are traded oﬀ against eac h other to obtain an exp erimen tal design with the appropriate mix of cost and error rates. In order for the error rates of the pro cedure to b e w ell cal- ibrated, the sample size and α hav e to b e set in adv ance of the exp erimen t b eing p erformed, a detail that is often o verlooked b y pharmacologists. After the exp erimen t has b een run and the data are in hand, the mechanics of the test inv olv es a determination of whether the observed v alue of the test statistic lies within a pre-determined critical region under the sampling distribution pro- vided by a statistical mo del and the n ull hypothesis. When the observed v alue of the test statistic falls within the critical range the result is ‘signiﬁcan t’ and the analyst discards the null hypothesis. When the observ ed test statistic falls outside the critical range the result is ‘not signiﬁcant’ and the null h yp othesis is not discarded. In current practice, dic hotomisation of results into signiﬁcan t and not signif- ican t is most often made on the basis of the observed P-v alue b eing less than or greater than a con ven tional threshold of 0.05, so we hav e the familiar P < 0 . 05 for α = 0 . 05. The one-to-one relationship b et ween the test statistic b eing within the critical range and the P-v alue b eing less than α means that such practice is not intrinsically problematical, but using a P-v alue as an intermediate in a h yp othesis test obscures the nature of the test and contributes to the conﬂation 3 The ‘pow er’ of the experiment is one min us the false positive error rate, but it is a function of the true eﬀect size, as explained later. 6 of signiﬁcance tests and h yp othesis tests. The classical Neyman–Pearsonian hypothesis test is an acceptance pro ce- dure, or a decision theory pro cedure [Birnbaum, 1977, Hurlb ert and Lombardi, 2009] that do es not require, or pro vide, a P-v alue. Its output is a binary deci- sion: either reject the null hypothesis or fail to reject the n ull hypothesis. In con trast, a Fisherian signiﬁcance test yields a P-v alue that enco des the evidence in the data against the null h yp othesis, but not, directly , a decision. The P- v alue is the probability of observing data as extreme as that observ ed, or more extreme, when the null hypothesis is true. That probability is generated or determined by a statistical mo del of some sort, and so we should really include the phrase ‘according to the statistical mo del’ in to the deﬁnition. In the Fish- erian tradition 4 a P-v alue is interpreted evidentially: the smaller the P-v alue the stronger the evidence against the null hypothesis and the more implausible the n ull hypothesis is, according to the statistical mo del. No b ehavioural or inferen tial consequences attac h to the observ ed P-v alue and no threshold need b e applied b ecause the P-v alue is a contin uous index. In practice the probabilistic nature of P-v alues has prov ed diﬃcult to use b ecause people tend to mistak enly assume that the P-v alue measures the proba- bilit y of the n ull h ypothesis or the probability of an erroneous decision—it seems that they prefer an y probabilit y that is more notew orthy or less of a mouthful than the probabilit y according to a statistical model of observing data as ex- treme or more extreme when the null h yp othesis is true. Happily , there are no ordinary uses of P-v alues that require them to b e in terpreted as probabilities. My advice is to forget that P-v alues can b e deﬁned as probabilities and instead lo ok at them as indices of surprisingness or unusualness of data: the smaller the P-v alue the more surprising are the data compared to what the statistical mo del predicts when the null hypothesis is true. Conﬂation of signiﬁcance tests and hypothesis tests may b e encouraged by their apparently equiv alent outputs (signiﬁcance and P-v alues), but the conﬂa- tion is to o often encouraged by textb o ok authors, even to the exten t of pre- sen ting a h ybrid approach con taining features of both. The problem has deep ro ots: when Neyman & P earson published their hypothesis test in 1933 it was immediately assumed that their test was an extension of Fisher’s signiﬁcance tests. Substantiv e diﬀerences in the philosophical and theoretical underpinnings 4 It has b een argued that b ecause Fisher regularly describ ed experimental results as ‘sig- niﬁcant’ or ’not signiﬁcan t’ he was treating P-v alues dichotomously and that he used a ﬁxed threshold for that dichotomisation (e.g. [Lehmann, 2011, pp. 51–53]). How ev er, Fisher meant the word ‘signiﬁcant’ to denote only that a result that is w orthy of attention and follow up, and he quoted P-v alues as being less than 0.05, 0.02, and 0.01 because he w as was work- ing from tables of critical v alues of test statistics rather than lab oriously calculating exact P-v alues manually . He wrote ab out the issue on several o ccasions, for example this: Conv enien t as it is to note that a h yp othesis is con tradicted at some familiar lev el of signiﬁcance such as 5% or 2% or 1% we do not, in Inductive Inference, ever need to lose sight of the exact strength whic h the evidence has in fact reached, or to ignore the fact that with further trial it might come to b e stronger, or weak er. —[Fisher, 1960, p. 25] 7 so on became apparen t to the protagonists and a long-lasting and bitter personal enmit y dev elop ed betw een Fisher and Neyman [Lenhard, 2006, Lehmann, 2011]. That feud seems likely to b e one of the causes of the confusion that w e hav e to da y as it has b een suggested that authors of statistics textb ooks av oided tak- ing sides in the feud—an understandable resp onse given vehemence and the forceful p ersonalities of the protagonists—either by presenting only one of the approac hes without men tion of the other or b y presen ting a mixture of b oth [Co wles, 1989, Hub erty, 1993, Halpin and Stam, 2006]. Whatev er the origin of the confusion, the fact that signiﬁcance tests and h yp othesis tests are rarely explained as distinct alternatives in textb ooks, en- courages man y to mistakenly assume that ‘signiﬁcance test’ and ‘hypothesis test’ are synonyms. It also encourages the use a hybrid of the t wo whic h is com- monly called NHST (for Null Hyp othesis Signiﬁcance T est). NHST has b een derided, for example as an “inconsistent mishmash” Gigerenzer [1998] and as a “jerry-built framew ork” [Krueger and Heck, 2017, p. 1] but versions of NHST are nonetheless more common than w ell-constructed h yp othesis tests and signif- icance tests together. Users of NHST almost univ ersally assume that they are ‘doing it right’ and the assumption that P-v alue equals NHST p ersists, largely unnoticed, particularly in the commentaries of those clamouring for the elimi- nation of P-v alues. I therefore feel compelled to add to the list of derogatory epithets: NHST is lik e a reverso-plat ypus. The plat ypus w as at one time derided as a fake 5 —a comp osite creature consisting of parts of sev eral animals—but is a real animal, rare but b eautiful, and p erfectly adapted to its ecological niche. The common NHST is assumed by its man y users to b e a prop er statistical pro cedure but is, in fact, an ugly comp osite, maladapted for almost all analytic purp oses. 3.2 Con tradictory instructions No-one should b e using NHST, but should w e use h yp othesis testing or signif- icance testing? The answer should dep end on what your analytical ob jectives are, but in practice it more often dep ends on who you ask. Not all advice is goo d advice, and not ev en the exp erts agree. Responses to the American Statistical Asso ciation’s oﬃcial statemen t on P-v alues pro vides a case in point. In response to the widespread expressions of concern ov er the misuse and misunderstanding of P-v alues, the ASA conv ened a group of exp erts to consider the issues and to collab orate on drafting an oﬃcial statemen t on P-v alues [W asserstein and Lazar, 2016]. Invited commentaries were published alongside the ﬁnal statemen t, and ev en a brief reading of those commentaries on the statement will turn up mis- givings and disagreemen ts. Given that most of the commentaries were written 5 W ell, that’s the conv entional wisdom, but it may b e an exaggeration. The ﬁrst scientiﬁc description of the “duc k-billed platypus” was done in England by Sha w & No dder (1789), who wrote “Of all Mammalia yet known it seems the most extraordinary in its conformation; exhibiting the perfect resem blance of the beak of a Duc k engrafted on the head of a quadruped. So accurate is the similitude that, at ﬁrst view, it naturally excites the idea of some deceptive preparation by artiﬁcial means”. If Shaw & Nodder really thought it a fake, they did not do so for long. 8 b y participants in the exp ert group, such disquiet and dissent conﬁrms the dif- ﬁcult y of this topic. It should also should signal to readers that their practical familiarit y with P-v alues do es not ensure that they understand P-v alues. The oﬃc ial ASA statemen t on P-v alues sets out six num b ered principles concerning P-v alues and scientiﬁc inference: 1. P-v alues can indicate ho w incompatible the data are with a sp eciﬁed sta- tistical mo del. 2. P-v alues do not measure the probabilit y that the studied hypothesis is true, or the c hance that the data were pro duced by random chance. 3. Scien tiﬁc conclusions and business or p olicy decisions should not b e based only on whether a P-v alue passes a sp eciﬁc threshold. 4. Prop er inference requires full rep orting and transparency . 5. A P-v alue, or statistical signiﬁcance, does not measure the size of an eﬀect or the imp ortance of a result. 6. By itself, a P-v alue do es not pro vide a goo d measure of evidence regarding a mo del or hypothesis. Those principles are all sound—some deriv e directly from the deﬁnition of P-v alues and some are self-evidently go o d advice ab out the formation and re- p orting of scien tiﬁc c onclusions—but hypothesis tests and signiﬁcance tests are not even men tioned in the statement and so it do es not directly answ er the question ab out whether we should use signiﬁcance tests or hypothesis tests that I ask ed at the start of this section. Nevertheless, the statement oﬀers a useful p erspective and is not entirely neutral on the question. It urges against the use of a threshold in Principle 3 which says “Scientiﬁc conclusions and business or p olicy decisions should not b e based only on whether a p-v alue passes a sp eciﬁc threshold.” Without a threshold w e cannot use a hypothesis test. Lest any reader think that the inten tion is that P-v alues should not b e used, I p oin t out that the explanatory note for that principle in the ASA do cumen t b egins th us: Practices that reduce data analysis or scien tiﬁc inference to mec han- ical “bright-line” rules (suc h as “ p < 0 . 05”) for justifying scientiﬁc claims or conclusions can lead to erroneous b eliefs and p o or decision making. —[W asserstein and Lazar, 2016, p. 131] “Brigh t-line rule” is an American legal phrase denoting an approach to sim- plifying am biguous or complex legal issues b y establishmen t of a clear, consistent ruling on the basis of ob jectiv e factors. In other words, subtleties of circum- stance and sub jective factors are ignored in fa v our of consistency and simplicit y . Suc h a rule might be useful in the legal setting, but it do es not sound like an approac h w ell-suited to the considerations that should underlie scientiﬁc infer- ence. It is unfortunate, therefore, that a mec hanical bright-line rule is so often used in basic pharmacological research, and even worse that it is demanded by the instructions to authors of the British Journal of Pharmac olo gy : 9 Control T est 1 20 30 P=0.06 Control T est 2 20 30 P=0.04 Figure 1: P=0.04 is not very diﬀer ent fr om P=0.06. Pseudo-data devised to yield one-tailed P=0.06 (left) and P=0.04 (righ t) from a Student’s t -test for indep enden t samples, n = 5 p er group. The y-axis is an arbitrarily scaled measure. When comparing groups, a level of probabilit y ( P ) deemed to con- stitute the threshold for statistical signiﬁcance should b e deﬁned in Metho ds, and not v aried later in Results (by presentation of m ulti- ple levels of signiﬁcance). Thus, ordinarily P < 0 . 05 should b e used throughout a pap er to denote statistically signiﬁcant diﬀerences b e- t ween groups. —[Curtis et al., 2015] An updated version of the guidelines retains those instructions [ ? ], but be- cause it is a bad instruction I present three ob jections. The ﬁrst is that routine use of an arbitrary P-v alue threshold for declaring a result signiﬁcan t ignores almost all of the evidential con tent of the P-v alue b y forcing an all-or-none distinction b etw een a P-v alue small enough and one not small enough. The ar- bitrariness of a threshold for signiﬁcance is w ell known and ﬂows from the fact that there is no natural cutoﬀ p oint or inﬂection p oint in the scale of P-v alues. An yone who is unconvinced that it matters should note that the evidence in a result of P=0.06 is not so diﬀerent from that in a result of P=0.04 as to supp ort an opp osite conclusion (Figure 1). The second ob jection to the instruction to use a threshold of P < 0 . 05 is that exclusiv e focus on whether the result is ab ov e or b elow the threshold blinds analysts to information b ey ond the sample in question. If the statistical pro ce- dure says discard the n ull h yp othesis (or don’t discard it) then that statistical decision seems to o verride and mak e redundan t any further considerations of ev- idence, theory , or scien tiﬁc merit. That is quite dangerous, b ecause all relev an t material should b e considered and integrated into scientiﬁc inferences. The third ob jection refers to the strength of evidence needed to reach the threshold: the British Journal of Pharmac olo gy instruction licenses claims on the basis of relativ ely w eak evidence. 6 The evidential disfav ouring of the n ull 6 Accepting P=0.05 as a suﬃcient reason to supp ose that a treatment is eﬀective is akin to accepting 50% as a passing grade: it’s traditional in man y settings, but it’s far from reassuring. 10 h yp othesis in a P-v alue close to 0.05 is surprisingly w eak when view ed as a lik eliho o d ratio or Bay es factor [Go o dman and Ro yall, 1988, Johnson, 2013, Benjamin et al., 2018], a w eakness that can b e conﬁrmed b y simply ‘eyeballing’ Figure 1. A ﬁxed threshold corresp onding to weak evidence migh t sometimes b e rea- sonable, but often it is not. As Carl Sagan said: “Extraordinary claims require extraordinary evidence.” 7 It would b e possible to o vercome this last ob jection b y setting a low er threshold whenev er an extraordinary claim is to b e made, but the British Journal of Pharmac olo gy instructions preclude suc h a choice by insisting that the same threshold b e applied to all tests within the whole study . There has been a serious prop osal that a lo w er default threshold of P < 0 . 005 b e adopted as the default [Johnson, 2013, Benjamin et al., 2018], but even if that w ould ameliorate the weakness of evidence ob jection, it do esn’t address all of the problems p osed by dic hotomising results into signiﬁcan t and not signiﬁcan t, as is ac knowledged by the many authors of that prop osal. Should the British Journal of Pharmac olo gy enforce its guideline on the use of Neyman–Pearsonian hypothesis testing with a ﬁxed threshold for statisti- cal signiﬁcance? Deﬁnitely not, and lab oratory pharmacologists should usually a void them b ecause the nature those tests is ill-suited to the reality of basic pharmacological studies. The shortcoming of hypothesis testing is that it oﬀers an all-or-none out- come and it engenders a one-and-done resp onse to an exp eriment. All-or-none in that the signiﬁcant or not signiﬁcan t outcome is dic hotomous. One-and-done b ecause once a decision has b een made to reject the null hypothesis there is lit- tle apparent reason to re-test that null hypothesis the same wa y , or diﬀeren tly . There is no mec hanism within the classical Neyman–P earsonian h yp othesis test- ing framew ork for a result to b e treated as pro visional. That is not particularly problematical in the context of a classical randomised clinical trial (RCT) be- cause an RCT is usually conducted only after preclinical studies hav e addressed the relev an t biological questions. That allo ws the scientiﬁc aims of the study to b e simple—they are designed to provide a deﬁnitiv e answer to the primary question. An all-or-none one-and-done hypothesis test is therefore appropriate for an RCT. 8 But the ma jorit y of basic pharmacological laboratory studies do not hav e m uch in common with an RCT b ecause they consist of a series of in terlinked and inter-related exp eriments contributing v ariously to the primary inference. F or example, a basic pharmacological study will often include exp eri- men ts that v alidate exp erimen tal metho ds and reagen ts, concentration-response curv es for one or more of drugs, p ositive and negativ e con trols, and other exp er- imen ts subsidiary to the main purpose of the study . The design of the ‘headline’ exp erimen t (assuming there is one) and in terpretation of its results is dependent 7 That phrase comes from the television series Cosmos , 1980, but may derive from Laplace (1812), who wrote “The w eight of evidence for an extraordinary claim must b e prop ortioned to its strangeness.” [translated, the original is in F rench]. 8 Clinical trials are sometimes aggregated in meta-analyses, but the substrate for meta- analytical combination is the observed eﬀect sizes and sample sizes of the individual trials, not the dic hotomised signiﬁcant or not signiﬁcant outcomes. 11 on the results of those subsidiary exp eriments, and even when there is a singu- lar scientiﬁc hypothesis, it might b e tested in sev eral wa ys using observ ations within the study . It is the aggregate of all of the exp eriments that inform the scien tiﬁc inferences. The all-or-none one-and-done outcome of a hypothesis test is less appropriate to basic researc h than it is to a clinical trial. Pharmacological laboratory experiments also diﬀer from RCTs in other w ays that are relev ant to the choice of statistical methodologies. Compared to an R CT, basic pharmacological research is very cheap, the exp eriments can b e completed v ery quickly , with the results a v ailable for analysis almost imme- diately . Those adv an tages mean that a pharmacologist might design some of the exp erimen ts within a study in resp onse to results obtained in that same study , 9 and so a basic pharmacological study will often con tain preliminary or exploratory research. Basic research and clinical trials also diﬀer in the conse- quences of erroneous inference. A false positive in an RCT might pro ve v ery damaging b y encouraging the adoption of an ineﬀectiv e therap y , but in the m uch more preliminary w orld of basic pharmacological research a false p ositiv e result migh t hav e relativ ely little inﬂuence on the wider world. It could b e argued that statistical protections against false p ositive outcomes that are appropriate in the realm of clinical trials can be inappropriate in the realm of basic researc h. This idea is illustrated in a later section of this chapter. The multi-faceted nature of the basic pharmacological study means that statistical approaches yielding dichotomous y es or no outcomes are less relev an t than they are to the archet ypical R CT. The scien tiﬁc conclusions dra wn from basic pharmacological exp eriments should b e based on thoughtful consideration of the entire suite of results in conjunction with an y other relev an t information, including b oth pre-existing evidence and theory . The dic hotomous all-or-none, one-and-done hypothesis test is po orly adapted to the needs of basic pharmaco- logical exp eriments, and is probably p o orly adapted to the needs of most basic scien tiﬁc studies. Scientiﬁc studies dep end on a detailed ev aluation of evidence but a h yp othesis test do es not fully supp ort such an ev aluation. 3.3 Evidence is lo cal; error rates are global A wa y to understand diﬀerence b et ween the Fisherian signiﬁcance test and the Neyman–P earsonian h yp othesis test is to recognise that the former supports ‘lo cal’ inference, whereas the latter is designed to protect against ‘global’ long- run error. The P-v alue of a signiﬁcance test is lo cal b ecause it is an index of the evidence in this data against this n ull h yp othesis. In contrast, the hypothesis test decision regarding rejection of the null hypothesis is global because it is based on a parameter, α , which is set without reference to the observ ed data. The long run p erformance of the hypothesis test is a prop erty of the pro cedure itself and is indep enden t of any particular data, and so it is global. Lo cal evidence; global errors. This is not an ahistoric imputation, b ecause Neyman 9 Y es, that is also done in ‘adaptive’ clinical trials, but they are not the archet ypical RCT that is the comparator here. 12 & Pearson w ere clear ab out their preference for global error protection rather than lo cal evidence and their ob jectiv es in devising hypothesis tests: W e are inclined to think that as far as a particular hypothesis is concerned, no test based up on the theory of probability can by it- self provide any v aluable evidence of the truth or falseho od of that h yp othesis. But we may lo ok at the purpose of tests from another view-point. Without hoping to kno w whether each separate hypothesis is true or false, we may searc h for rules to gov ern our behaviour with re- gard to them, in following which w e insure that, in the long run of exp erience, we shall not b e to o often wrong. —Neyman and P earson [1933] The distinction b et ween local and global prop erties or information is rel- ativ ely little known, but Liu & Meng 2016 oﬀer a muc h more technical and complete discussion of the lo cal/global distinction, using the descriptors ‘indi- vidualised’ and ‘relev an t’ for the lo cal and the ‘robust’ for the global. They demonstrate a trade-oﬀ b etw een relev ance and robustness that requires judge- men t on the part of the analyst. In short, the desirability of metho ds that hav e go od long-run error prop erties is undeniable, but paying atten tion exclusiv ely to the global blinds us to the lo cal information that is relev ant to inferences. The instructions of the British Journal of Pharmac olo gy are inappropriate b ecause they attend entirely to the global and b ecause the dic hotomising of each exp er- imen tal result into signiﬁcant and not signiﬁcant hinders thoughtful inference. Man y of the battles and con trov ersies regarding statistical tests swirl around issues that might b e clariﬁed using the lo cal versus global distinction, and so it will b e referred to rep eatedly in what follows. 3.4 On the scaling of P-v alues In order to b e able to safely interpret the local, evidential, meaning of a P- v alue, a pharmacologist should understand its scaling. Just like the EC 50 s with whic h pharmacologists are so familiar, P-v alues ha ve a b ounded scale, and just as is the case with EC 50 s it mak es sense to scale P-v alues geometrically (or logarithmically). The non-linear relationship b etw een P-v alues and an intuitiv e scaling of evidence against the null h yp othesis can b e gleaned from Figure 2. Of course, a geometric scaling of the eviden tial meaning of P-v alues implies that the descriptors of evidence should be similarly scaled and so such a scale is proposed in Figure 3, with P-v alues around 0.05 b eing called ‘trivial’ in recognition of the relativ ely unimpressive evidence for a real diﬀerence b et ween condition A and con trol in Figure 2. A ttentiv e readers will hav e noticed that the P-v alues in Figures 1, 2, and 3 are all one-tailed. The num b er of tails that published P-v alues hav e is in- consisten t, is often unsp eciﬁed, and the n umber of tails that a P-v alue should ha ve is contro v ersial (e.g. see Dub ey [1991], Bland and Bland [1994], Koba yashi 13 Control A B C D 20 30 40 P=0.0001 P=0.001 P=0.01 P=0.05 Figure 2: What simple evidenc e lo oks like. Pseudo-data devised to yield one- tailed P-v alues from 0.05 to 0.0001 from a Studen t’s t -test for indep enden t samples, n = 5 p er group. The left-most group of v alues is the control against whic h each of the other sets is compared, and the pseudo-datasets A, B, C, and D were generated by arithmetic adjustment of a single dataset to obtain the indicated P-v alues. The y-axis is an arbitrarily scaled measure. [1997], F reedman [2008], Lom bardi and Hurlb ert [2009], Ruxton and Neuhaeuser [2010]). Arguments ab out P-v alue tails are regularly confounded by diﬀerences b et ween local and global considerations. The most comp elling reasons to fa vour t wo tails relate to global error rates, which means that they apply only to P- v alues that are dichotomised into signiﬁcan t and not signiﬁcant in a hypothesis test. Those argumen ts can safely be ignored when P-v alues are used as indices of evidence and I therefore recommend one-tailed P-v alues for general use in pharmacological experiments—as long as the P-v alues are interpreted as evi- dence and not as a surrogate for decision. (Either wa y , the n umber of tails should alw ays b e sp eciﬁed.) 3.5 P o wer and exp ected P-v alues The Neyman–Pearsonian h yp othesis test is a decision pro cedure that, with a few assumptions, can b e an optimal procedure. Optimal only in the restricted sense that the smallest sample giv es the highest pow er to reject the n ull hypoth- esis when it is false, for any sp eciﬁed rate of false positive errors. T o achiev e that optimalit y the exp erimen tal sample size and α are selected prior to the exp erimen t using a p o wer analysis and with consideration of the costs of the t wo sp eciﬁed t yp es of error and the b eneﬁts of p otentially correct decisions. In other words, there is a loss function built into the design of exp erimen ts. How- ev er, outside of the clinical trials arena, few pharmacologists seem to design exp erimen ts in that w ay . F or example, a study of 22 basic biomedical researc h pap ers published in Natur e Me dicine found that none of them included any men tion of a pow er analysis for setting the sample size [Strasak et al., 2007], and a simple survey of the research pap ers in the most recent issue of British Journal of Pharmac olo gy (2018, issue 17 of v olume 175) gives a similar picture with p o wer analyses speciﬁed in only one out the 11 research pap ers that used 14 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P-value (one-tailed) None Trivial 0.00001 0.0001 0.001 0.01 0.1 1 Strong Weak None Trivial Moderate Rough scaling of evidence against the null hypothesis V ery strong Strong Weak Moderate Figure 3: Evidential descriptors for P-values. Strength of evidence against the null hypothesis scales semi-geometrically with the smallness of the P-v alue. Note that the descriptors for strength of evidence are illustrative only , and it w ould b e a mistake to assume, for example, that a P-v alue of 0.001 indicates mo derately strong evidence against the null hypothesis in every circumstance. P < 0.05 as a criterion for statistical signiﬁcance. It is notable that all of those BJP pap ers included statements in their metho ds sections claiming compliance with the guidelines for experimental design and analysis, guidelines that include this as the ﬁrst k ey p oint: Exp erimen tal design should b e sub jected to a priori p o wer analysis so as to ensure that the size of treatment and control groups is adequate[. . . ] —[Curtis et al., 2015] The most recent issue of Journal of Pharmac olo gy and Exp erimental Ther ap eu- tics (2018, issue 3 of v olume 366) similarly contains no men tion of p ow er of sample size determination in any of its 9 research pap ers, although none of its authors had to pa y lip service to guidelines requiring it. In reality , p ow er analyses are not alw ays necessary or helpful. They hav e no clear role in the design of a preliminary or exploratory exp eriment that is concerned more with h yp othesis generation than hypothesis testing, and a large fraction of the exp erimen ts published in basic pharmacological journals are exploratory or preliminary in nature. Nonetheless, they are described here in detail because experience suggests they are m ysterious to man y pharmacologists and they are v ery useful for planning conﬁrmatory exp eriments. F or a simple test like Student’s t -test a pre-exp eriment p ow er analysis for determination of sample size is easily p erformed. The pow er of a Student’s t - test is dep enden t on: (i) the predetermined acceptable false p ositive error rate, α (bigger α gives more p o wer); (ii) the true eﬀect size, which we will denote as δ (more pow er when δ is larger); (iii) the p opulation standard deviation, σ (smaller σ gives more pow er); and (iv) the sample size (larger n for more pow er). The common approac h to a p o wer test is to sp ecify an eﬀect size of in terest and 15 1 2 3 4 0 0.2 0.4 0.6 0.8 1 Effect size ( δ / σ ) Power 1 2 3 4 0 0.2 0.4 0.6 0.8 1 Effect size ( δ / σ ) Power n =3 n =5 20 40 10 α = 0.05 n =3 n =5 20 40 10 α = 0.005 Figure 4: Power functions for α = 0 . 05 and 0 . 005 . Po wer of one-sided Studen t’s t -test for indep enden t samples expressed as a function of standardised true eﬀect size δ /σ for sample sizes (p er group) from n = 3 to n = 40. Note that δ = µ 1 − µ 2 and σ are p opulation parameters rather than sample estimates. the minimum desired p o wer, so say we wish to detect a true eﬀect of δ = 3 in a system where w e expect the standard deviation to be σ = 2. The free soft ware 10 called R has the function p o wer.t.test() that gives this result: > power.t.test(delta=3, sd=2, power=0.8, sig.level = 0.05, alternative =’one.sided’, n=NULL) Two-sample t test power calculation n = 6.298691 delta = 3 sd = 2 sig.level = 0.05 power = 0.8 alternative = one.sided NOTE: n is number in *each* group It is conv en tional to round the sample size up to the next integer so the sample size w ould b e 7 p er group. While a single point p o wer analysis lik e that is straightforw ard, it provides relativ ely little information compared to the information supplied by the analyst, and its output is sp eciﬁc to the particular eﬀect size speciﬁed, an eﬀect size that more often than not has to b e ‘guesstimated’ instead of estimated b ecause it is the unknown that is the ob ject of study . A plot of p o wer versus eﬀect size is far more informative than the p oint v alue supplied b y the con ven tional p o wer test (Figure 4). Those graphical pow er functions show clearly the three-wa y relationship b et ween sample size, eﬀect size and the risk of a false negative outcome (i.e. one minus the p o wer). 10 www.r-pro ject.org 16 Some exp erimenters are tempted to p erform a p ost-experiment p ow er anal- ysis when their observ ed P-v alue is unsatisfyingly large. They aim to answer the question of how large the sample should hav e b een, and pro ceed to plug in the observ ed eﬀect size and standard deviation and pulling out a larger sam- ple size—alwa ys larger—that might ha ve giv en them the desired small P-v alue. Their interpretation is then that the result would have b e en signiﬁc ant but for the fact that the exp erimen t was underp o wered. That interpretation ignores that fact that the observed eﬀect size migh t b e an exaggeration, or the observed standard deviation might b e an underestimation and the null hypothesis might b e true! Suc h a pro cedure is generally inappropriate and dangerous [Ho enig and Heisey, 2001]. There is a one to one corresp ondence of observ ed P-v alue and p ost-exp erimen t pow er and no matter what the sample size, a larger than desired P-v alue always corresp onds to a low p ow er at the observ ed eﬀect size, whether the n ull hypothesis is true or false. Po wer analyses are useful in the design of exp erimen ts, not for the interpretation of exp erimen tal results. P ow er analyses are tied closely to dichotomising Neyman–Pearsonian hy- p othesis tests, even when expanded to provide full p o wer functions as in Figure 4. How ev er, there is an alternative more closely tied to Fisherian signiﬁcance testing—an approach b etter aligned to the ob jectives of evidence gathering. That alternativ e is a plot of av erage exp ected P-v alues as functions of eﬀect size and sample size [Sac krowitz and Sam uel-Cahn, 1999, Bhattachary a and Habtzghi, 2002]. The median is more relev ant than the mean, b oth b ecause the distribution of exp ected P-v alues is very skew ed and b ecause the median v alue oﬀers a conv enien t in terpretation of there b eing a 50:50 b et that and observed P-v alue will b e either side of it. An equiv alent plot showing the 90th p ercen tile of exp ected P-v alues gives another option for exp erimen t sample size planning purp oses (Figure 5). Should the British Journal of Pharmac olo gy enforce its p o wer guideline? In general no, but pharmacologists should use p ow er curv es or exp ected P-v alue curv es for designing some of their exp eriments, and ough t to sa y so when they do. P ow er analyses for sample size are very imp ortan t for exp erimen ts that are in tended to be deﬁnitive and decisiv e, and that’s wh y sample size considerations are dealt with in detail when planning clinical trials. Ev en though the ma jority of exp eriments in basic pharmacological researc h papers are not like that, as discussed ab o ve, ev en preliminary exp erimen ts should b e planned to a degree, and p o wer curves and exp ected P-v alue curves are b oth useful in that role. 4 Practical problems with P-v alues The sections ab ov e deal with the most basic misconceptions regarding the nature of P-v alues, but critics of P-v alues usually focus on other important issues. In this section I will deal with the signiﬁcance ﬁlter, multiple comparisons, and some forms of P-hacking, and I need to p oin t out immediately that most of the issues are not sp eciﬁc to P-v alues even if some of them are enabled b y the unfortunate dichotomisation of P-v alues into signiﬁcant and not signiﬁcan t. 17 0 1 2 3 4 10 –8 10 –7 10 –6 10 –5 10 –4 10 –3 10 –2 10 –1 10 0 Median P-value n =3 n =5 20 40 10 Effect size ( δ / σ ) 0 1 2 3 4 10 –8 10 –7 10 –6 10 –5 10 –4 10 –3 10 –2 10 –1 10 0 90th percentile of P-values Effect size ( δ / σ ) n =3 n =5 20 40 10 Figure 5: Exp e cte d P-value functions P-v alues exp ected from Student’s t -test for indep enden t samples expressed as a function of standardised true eﬀect size δ /σ for sample sizes (per group) from n = 3 to n = 40. The graph on the left sho ws the median of exp ected P-v alues (i.e. the 50th percentile) and the graph on the righ t sho ws the 90th p ercen tile. It can be exp ected that 50% of observ ed P-v alues will lie b elow the median lines and 90% will lie b elo w the 90th p ercen tile lines for corresp onding sample sizes and eﬀect sizes. The dashed lines indicate P=0.05 and 0.005. In other words, the practical problems with P-v alues are largely the practical problems asso ciated with the mis use of P-v alues and with sloppy statistical inference generally . 4.1 The signiﬁcance ﬁlter exaggeration machine It is natural to assume that the eﬀect size observed in an exp eriment is a go od estimate of the true eﬀect size, and in general that can b e true. How ev er, there are common circumstances where the observed eﬀect size consistently o verestimates the true, sometimes wildly so. The ov erestimation dep ends on the facts that exp erimen tal results exaggerating the true eﬀect are more lik ely to b e found statistically signiﬁcan t, and that we pa y more attention to the signiﬁcan t results and are more lik ely to rep ort them. The key to the eﬀect is selective attention to a subset of results—the signiﬁcant results—and so the pro cess is appropriately called the signiﬁc anc e ﬁlter . If there is nothing assume nothing un tow ard in the sampling mechanism, 11 sample means are un biassed estimators of p opulation means and sample-based standard deviations are nearly unbiassed estimators of p opulation standard de- viations. 12 Because of that we can assume that, on av erage, a sample mean 11 That is not a safe assumption, in particular b ecause a haphazard sample is not a random sample. When was the last time that you used something like a random num ber generator for allo cation of treatments? 12 The v ariance is un biassed but the non-linear square ro ot transformation in to the standard deviation damages that un biassed-ness. Standard deviations calculated from small samples are biassed tow ard underestimation of the true standard deviation. F or example, if the true standard deviation is 1 the expected av erage observed standard deviation for samples of n = 5 is 0.92. 18 pro vides a sensible ‘guesstimate’ for the p opulation parameter and, to a lesser degree, so do es the observed standard deviation. That is indeed the case for a verages o ver all samples, but it cannot be relied up on for an y particular sam- ple. If attention has b een dra wn to a sample on the basis that it is ‘statistically signiﬁcan t’ then that sample is likely to oﬀer an exaggerated picture of the true eﬀect. The phenomenon is usually called the signiﬁc anc e ﬁlter . The wa y it w orks is fairly easily describ ed but, as usual, there are some complexities in its in terpretation. Sa y w e are in the position to run an exp erimen t 100 times with random samples of n = 5 from a single normally distributed p opulation with mean µ = 1 and standard deviation σ = 1. W e w ould exp ect that, on av erage, the sample means, ¯ x would be scattered symmetrically around the true v alue of 1, and the sample-based standard deviations, s , would b e scattered around the true v alue of 1, albeit sligh tly asymmetrically . A set of 100 sim ulations matc hing that scenario sho w exactly that result (see the left panel of Figure 6), with the median of ¯ x b eing 0.97 and the median of s b eing 0.94, b oth of which are close to the exp ected v alues of exactly 1 and ab out 0.92, respectively . If we w ere to pa y attention only to the results where the observed P-v alue was less than 0.05 (with the n ull hypothesis b eing that the population mean is 0), then w e get a diﬀerent picture because the v alues are v ery biassed (see the right panel of Figure 6). Among the ‘signiﬁcant’ results the median sample mean is 1.2 and the median standard deviation is 0.78. The systematic bias of mean and standard deviation among ‘signiﬁcant’ results in those simulations might not seem to o bad, but it is con ven tional to scale the eﬀect size as the standardised ratio ¯ x/s , 13 and the median of that ratio among the ‘signiﬁcan t’ results is fully 50% larger than the correct v alue. What’s more, the biasses get worse with smaller samples, with smaller true eﬀect sizes, and with lo wer P-v alue thresholds for ‘signiﬁcance’. It is notable that even the results with the most extreme exaggeration of eﬀect size in Figure 6—550%—w ould not b e counted as an error within the Neyman–P earsonian h yp othesis testing framework! It would not lead to the false rejection of a true null or to an inappropriate failure to reject a false null and so it is neither a t yp e I nor a type II error. But it is some type of error, a substan tial error in estimation of the magnitude of the eﬀect. The term typ e M err or has been devised for exactly that kind of error [Gelman and Carlin, 2014]. A type M error migh t b e underestimation as w ell as the o verestimation, but o verestimation is the more common in theory [Lu et al., 2018] and in practice [Camerer et al., 2018]. The eﬀect size exaggeration coming from the signiﬁcance ﬁlter is not a result of sampling, or of signiﬁcance testing, or of P-v alues. It is a result of paying extra atten tion to a subset of all results—the ‘signiﬁcant’ subset. The signiﬁcance ﬁlter presents a p eculiar diﬃculty . It leads to exaggera- tion on aver age , but any particular result may well b e close to the correct size 13 That ratio is often called Cohen’s d . Pharmacologists should pay no attention to Cohen’s speciﬁcations of small, medium and large eﬀect sizes [Cohen, 1992] because they are muc h smaller than the eﬀects commonly seen in basic pharmacological exp erimen ts. 19 0.0 0.5 1.0 1.5 2.0 2.5 0 0.5 1 1.5 2 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 0.5 1 1.5 2 2.5 Standard deviation, s Sample mean, x ! All samples Standard deviation, s Sample mean, x ! Samples where P<0.05 Figure 6: The signiﬁc anc e ﬁlter. The dots in the graphs are means and standard deviations of samples of n = 5 dra wn from a normally distributed p opulation with mean µ = 1 and standard deviation σ = 1. The left panel shows all 100 samples and the right panel shows only the results where P < 0.05. The v ertical and horizon tal lines indicate the true parameter v alues. ‘Signiﬁcan t’ results tend to ov er-estimate the p opulation mean and under-estimate the p opulation standard deviation. whether it is ‘signiﬁcan t’ or not. A real-w orld sample mean of, say , ¯ x = 1 . 5 migh t b e an exaggeration of µ = 1, it might b e an underestimation of µ = 2, or it might b e pretty close to µ = 1 . 4 and there w ould b e no wa y to b e certain without kno wing µ , and if µ w ere known then the exp erimen t would probably not hav e b een necessary in the ﬁrst place. That means that the p ossibility of a type M error lo oms o ver an y experimental result that is interesting b ecause of a small P-v alue, and that is particularly true when the sample size is small. The only wa y to gain more conﬁdence that a particular signiﬁcant result closely appro ximates the true state of the world is to rep eat the exp eriment–the second result would not hav e b een run through the signiﬁcance ﬁlter and so its results w ould not hav e a greater than av erage risk of exaggeration and the ov erall in- ference can be informed by b oth results. Of course, exp erimen ts intended to rep eat or replicate an interesting ﬁnding should take the p ossible exaggeration in to account by b eing designed to ha ve higher p ow er than the original. 4.2 Multiple comparisons Multiple testing is the situating where the tension b etw een global and lo cal considerations are most stark. It is also the situation where the w ell-known jelly beans carto on from XKCD.com is irresistable (Figure 7). The carto on scenario is that jelly beans were suspected of causing acne, but a test found “no link b etw een jelly b eans and acne (P > 0 . 05)”, and so the p ossibilit y that only a certain colour of jelly b ean causes acne is then en tertained. All 20 colours of jelly b ean are independently tested, with only the result from green jelly b eans b eing signiﬁcan t, “(P < 0 . 05)”. The newspap er headline at the end of the carto on 20 men tions only the green jelly b eans result, and it does that with exaggerated certain ty . The usual interpretation of that carto on is that the signiﬁcant result with green jelly b eans is lik ely to be a false positive because, after all, h yp othesis testing with the threshold of P < 0 . 05 is expected to yield a false positive one time in 20, on a verage, when the null is true. The more h yp othesis tests there are, the higher the risk that one of them will yield a false p ositiv e result. The textb ook resp onse to multiple comparisons is to in tro duce ‘corrections’ that protect an o verall maximum false p ositiv e error rate b y adjusting the threshold according to the num b er of tests in the family to give protection from inﬂation of the family-wise false p ositive error rate. The Bonferroni adjustment is the b est-kno wn metho d, and while there are several alternativ e ‘corrections’ that p erform a little b etter, none of those is nearly as simple. A Bonferroni adjustment for the family of exp erimen ts in the carto on w ould preserv e an ov erall false p ositive error rate of 5% b y setting a threshold for signiﬁcance of 0 . 05 / 20 = 0 . 0025 in each of the 20 h yp othesis tests. 14 It must b e noted that suc h protection do es not come for free, b ecause adjustments for m ultiplicity inv ariably strip statistical p ow er from the analysis. W e do not kno w whether the ‘signiﬁcant’ link b et ween green jelly b eans and acne would surviv e a Bonferroni adjustment b ecause the actual P-v alues were not supplied, 15 but as an example, a P-v alue of 0.003, low enough to b e quite encouraging as the result of a signiﬁcance test, w ould b e ‘not signiﬁcan t’ ac- cording to the Bonferroni adjustment. Such a result that would presen t us with a serious dilemma b ecause the inference supp orted b y the lo cal evidence w ould b e apparen tly contradicted by global error rate considerations. How ev er, that con tradiction is not what it seems because the n ull h yp othesis of the signiﬁcance test P-v alue is a diﬀerent null hypothesis from that tested b y the Bonferroni- adjusted hypothesis test. The signiﬁcance test null concerns only the green jelly b eans whereas the null hypothesis of the Bonferroni is an omnibus n ull hypoth- esis that sa ys that the link b etw een green jelly b eans on acne is zero and the link b et ween purple jelly b eans on acne is zero and the link b etw een brown jelly b eans is zero, and so on. The P-v alue null hypothesis is local and the omnibus n ull is global. The global null hypothesis migh t b e appropriate b efore the evi- dence is a v ailable (i.e. for p o wer calculations and exp erimen tal planning), but after the data are in hand the lo cal null hypothesis concerning just the green jelly b eans gains imp ortance. It is imp ortant to av oid being blinded to the lo cal evidence b y a non- signiﬁcan t global. After all, the pattern of evidence in the carto on is exactly what w ould b e exp ected if the green colouring agent caused acne: green jelly b eans are asso ciated with acne but the other colours are not. (The failure to see an eﬀect of the mixed jelly beans in the ﬁrst test is easily explicable on the basis of the low er dose of green.) If the data from the trial of green jelly 14 Y ou ma y notice that the ﬁrst test of jelly b eans without reference to colour has b een ignored here. There is no set rule for saying exactly which exp erimen ts constitute a family for the purposes of correction of m ultiplicity . 15 That serves to illustrate one facet of the inadequacy of rep orting ‘P less thans’ in place of actual P-v alues. 21 Figure 7: Multiple testing carto on from XKCD, h ttps://xkcd.com/882/ 22 b eans is independent of the data from the trials of other colours, then there is no wa y that the existence of those other data—or their analysis—can inﬂuence the nature of the green data. The green jelly be an data cannot logically hav e b een aﬀected b y the fact that mauve and b eige jelly beans were tested at a later p oin t in time—the subsequen t cannot aﬀect the previous—and the exp erimen tal system would hav e to b e bizarrely ﬂaw ed for the testing of the purple or brown jelly beans to aﬀect the subsequent exp eriment with green jelly b eans. If the m ultiplicity of tests did not aﬀect the data then it is only reasonable to sa y that it did not aﬀect the evidence. The omnibus global result do es not cancel the lo cal evidence, or ev en alter it, and y et the elev ated risk of a false p ositiv e error is real. That presen ts us with a dilemma and, unfortunately , statistics do es not pro vide a wa y around it. Global error rates and lo cal evidence op erate in diﬀerent logical spaces [Thompson, 2007] and so there can b e no strictly statistical wa y to weigh them together. All is not lost, though, b ecause statistical limitations do not preclude though tful in tegration of lo cal and the global issues when making inferences. W e just hav e to b e more than normally cautious when the lo cal and global pull in diﬀeren t directions. F or example, in the case of the carto on, the evidence in the data fa vour the idea that green jelly b eans are linked with acne (and if we had an exact P-v alue then we could sp ecify the strength of fav ouring) but b ecause the data were obtained by a metho d with a substantial false p ositive error rate w e should be somewhat reluctan t to take that evidence at face v alue. It would b e up to the scientist in the cartoon (the one with safety glasses) to form a pro visional scientiﬁc conclusion regarding the eﬀect of green jelly b eans, ev en if that inference is that any decision should b e deferred un til more evidence is a v ailable. Whatever the inference, the evidence, theory , the metho d, any other corrob orating or rebutting information should all b e considered and rep orted. A man or w oman who sits and deals out a deck of cards rep eatedly will even tually get a very un usual set of hands. A rep ort of unusual- ness w ould be tak en diﬀerently if w e knew it w as the only deal made, or one of a thousand deals, or one of a million deals, etc. —[T ukey, 1991, p. 133] In isolation the carto on exp erimen ts are probably only suﬃcien t to suggest that the asso ciation betw een green jelly acne is w orthy of further in vestigation (with the earnestness of that suggestion b eing in v ersely related to the size of the relev an t P-v alue). The only wa y to b e in a p osition to rep ort an inference con- cerning those jelly b eans without having to hedge around the family-wise false p ositiv e error rate and the signiﬁcance ﬁlter is to re-test the green jelly b eans. New data from a separate exp erimen t will b e free from the taint of elev ated family-wise error rates and untouc hed by the signiﬁcance ﬁlter exaggeration mac hine. And, of course, al l of the original exp erimen ts should b e rep orted alongside the new, as well as reasoned argument incorp orating corrob orating or rebutting information and theory . The fact that a fresh experiment is necessary to allow a straightforw ard con- clusion ab out the eﬀect of the green jelly b eans means that the exp erimen tal 23 series shown in the carto on is a preliminary , exploratory study . Preliminary or exploratory researc h is essential to scientiﬁc progress and can merit publication as long as it is rep orted completely and op enly as preliminary . T oo often sci- en tists fall into the pattern of misrepresenting the pro cesses that lead to their exp erimen tal results, p erhaps under the mistaken assumption that science has to b e h yp othesis driven [Medaw ar, 1963, du Prel et al., 2009, Ho witt and Wil- son, 2014]. That misrepresen tation ma y take the form of a suggestion, implied or stated, that the green jelly beans w ere the intended sub ject of the study , a b eha viour described as HARKing for h yp othesising a fter the r esults are k no wn, or cherry picking where only the signiﬁcant results are presented. The reason that HARKing is problematical is that h yp otheses cannot be tested using the data that suggested the hypothesis in the ﬁrst place b ecause those data always supp ort that h yp othesis (otherwise they w ould not b e suggesting it!), and c herry pic king in tro duces a false impression of the nature of the total evidence and al- lo ws the direct introduction of exp erimen ter bias. Either wa y , fo cussing on just the unusual observ ations from a m ultitude is bad science. It takes little eﬀort and few words to sa y that 20 colours w ere tested and only the green yielded a statistically signiﬁcant eﬀect, and a scientist can (should) then h yp othesise that green jelly b eans cause acne and test that hypothesis with new data. 4.3 P-hac king P-hac king is where an experiment or its analysis are directed at obtainin g a small enough P-v alue to claim signiﬁcance instead of being directed at the clariﬁcation of a scien tiﬁc issue or testing of a hypothesis. Deliberate P-hac king do es happen, p erhaps driven b y the incentiv es built into the systems of academic reward and publication imp eratives, but most P-hac king is accidental—honest researc hers doing ‘the wrong thing’ through ignorance. P-hacking is not alw ays as wrong as might b e assumed, as the idea of P-hacking comes from paying attention exclusiv ely to global consideration of error rates, and most particularly to false p ositiv e error rates. Those most stridently opp osed to P-hac king will p oint to the increased risk of false p ositive errors, but rarely to the lo wered risk of false negativ e errors. I will rec klessly note that some categories of P-hacking lo ok en tirely unproblematical when viewed through the prism of lo cal evidence. The lo cal versus global distinction allows a more nuanced resp onse to P-hacking. Some P-hac king is outrigh t fraud. Consider this example that has recently come to ligh t: One stic king p oin t is that although the stic kers increase apple selec- tion b y 71%, for some reason this is a p v alue of . 06. It seems to me it should b e low er. Do y ou wan t to take a lo ok at it and see what y ou think. If you can get the data, and it needs some t weeking, it w ould b e go od to get that one v alue b elow .05. —Email from Brian W ansink to David Just on Jan. 7, 2012. [Lee, 2018] 24 I do not expect that any readers w ould ﬁnd P-hac king of that kind to be accept- able. Ho wev er, the line b et ween fraudulent P-hacking and the more inno cen t P-hac king through ignorance is hard to deﬁne, particularly so giv en the fact that some b eha viours derided as P-hacking can b e p erfectly legitimate as part of a scientiﬁc research program. Consider this cherry pick ed list 16 of resp onses to a P-v alue b eing greater than 0.05 that ha ve b een describ ed as P-hacking [Motulsky, 2014]: • Analyze only a subset of the data; • Remo ve suspicious outliers; • Adjust data (e.g. divide by b ody weigh t); • T ransform the data (i.e. logarithms); • Rep eat to increase sample size (n); Before going any further I need to p oint out that Motulsky has a more realistic attitude to P-hac king than might b e assumed from my treatment of his list. He writes: “If you use any form of P-hacking, lab el the conclusions as ‘preliminary’.” [Motulsky, 2014, p. 1019]. Analysis of only a subset of the data is illicit if the unanalysed portion is omitted in order to manipulate the P-v alue, but unproblematical if it is omitted for b eing irrelev ant to the scientiﬁc question at hand. Remo v al of suspicious outliers is similar in being only sometimes inappropriate: it dep ends on what is meant by the term “outlier”. If it indicates that a datum is a mistak e such as a typographical or transcriptional error, then of course it should b e remov ed (or corrected). If an outlier is the result of a technical failure of a particular run of the exp erimen tal then p erhaps it should be remo ved, but the technical success or failure of an exp erimen tal run must not b e judged b y the inﬂuence of its data on the ov erall P-v alue. If with w ord outlier just denotes a datum that is further from the mean than the others in the dataset, then omit it at your p eril! Omission of that type of outlier will reduce the v ariabilit y in the data and giv e a low er P-v alue, but will markedly increase the risk of false p ositive results and it is, indeed, an illicit and damaging form of P-hacking. Adjusting the data by standardisation is appropriate—desirable even—in some circumstances. F or example, if a study concerns feeding or organ masses then standardising to b o dy weigh t is probably a go od idea. Such manipulation of data should b e considered P-hacking only if an analyst ﬁnds a to o large P- v alue in unstandardised data and then tries out v arious re-expressions of the data in search of a low P-v alue, and then rep orts the results as if that expres- sion of the data w as in tended all along. The P-hac kingness of log-transformation is similarly situationally dep endent. Consider pharmacological EC 50 s or drug aﬃnities: they are strictly b ounded at zero and so their distributions are sk ewed. In fact the distributions are quite close to log-normal and so log-transformation 16 There are nine sp eciﬁed in the original but I discuss only ﬁve: c herry picking! 25 b efore statistical analysis is appropriate and desirable. Log-transformation of EC 50 s giv es more p ow er to parametric tests and so it is common that signiﬁ- cance testing of logEC 50 s giv es lo wer P-v alues than signiﬁcance testing of the un- transformed EC 50 s. An exp erienced analyst will c ho ose the log-transformation b ecause it is kno wn from empirical and theoretical considerations that the trans- formation mak es the data b etter matc h the exp ectations of a parametric sta- tistical analysis. It might sensibly b e categorised as P-hac king only if the log- transformation was selected with no justiﬁcation other than it giving a low P-v alue. The last form of P-hacking in the list requires a goo d deal more consideration than the others b ecause, well, statistics is complicated. That consideration is facilitated b y a concrete scenario—a scenario that might seem surprisingly realistic to some readers. Say you run an experiment with n = 5 observ ations in each of tw o indep enden t groups, one treated and one control, and obtain a P-v alue of 0.07 from Studen t’s t -test. Y ou might stop and in tegrate the very w eak evidence against the null hypothesis into y our inferen tial considerations, but y ou decide that more data will clarify the situation. Therefore y ou run some extra replicates of the experiment to obtain a total of n = 10 observ ations in eac h group (including the initial 5), and ﬁnd that the P-v alue for the data in aggregate is 0.002. The risk of the ‘signiﬁcan t’ result b eing a false p ositiv e error is elev ated b ecause the data hav e had tw o c hances to lead you to discard the n ull hypothesis. Con ven tional wisdom sa ys that y ou hav e P-hac ked. How ever, there is more to b e considered b efore the exp erimen t is discarded. Con ven tional wisdom usually takes the global p ersp ectiv e. As men tioned ab o ve, it typically privileges false p ositiv e errors ov er an y other consideration, and calls the pro cedure in v alid. Ho wev er, the extra data has added p ow er to the exp erimen t and lo wered the exp ected P-v alue for any true eﬀect size. F rom a lo cal evidence point of view, increasing the sample increases the amount of evidence a v ailable for use in inference, which is a goo d thing. Is extending an exp erimen t after the statistical analysis a go o d thing or a bad thing? The con ven tional answ er is that it is a bad thing and so the con ven tional advice is don’t do it! How ever, a b etter response migh t balance the bad eﬀect of extending the exp erimen t with the goo d. Consideration of the lo cal and global asp ects of statistical inference allo ws a muc h more n uanced answer. The pro cedure describ ed would b e p erfectly acceptable for a preliminary exp eriment. T echnically the tw o-stage pro cedure in that scenario allows optional stopping . The scenario is not explicit, but it can b e discerned that the stopping rule was, in eﬀect, run n = 5 and insp ect the P-v alue; if it is small enough then stop and mak e inferences ab out the n ull hypothesis; if the P-v alue is not small enough for the stop but nonetheless small enough to represent some evidence against the n ull hypothesis, add an extra 5 observ ations to eac h group to giv e n = 10, stop, and analyse again. W e do not kno w how low the in terim P-v alue would hav e to b e for the protocol to stop, and we do not know ho w high it could b e and the extra data still b e gathered, but no matter where those thresholds are set, suc h stopping rules yield false positive rates higher than the nominal critical v alue for stopping w ould suggest. Because of that, the con ven tional view (the 26 global p erspective, of course) is that the proto col is inv alid, but it would b e more accurate to say that such a proto col w ould b e inv alid unless the P-v alue or the threshold for a Neyman–P earsonian dichotomous decision is adjusted as would b e done with a formal se quential test . It is interesting to note that the elev ation of false p ositiv e rate is not necessarily large. Simulations of the scenario as sp eciﬁed and with P < 0.1 as the threshold for contin uing show that the ov erall false p ositive error rate would b e ab out 0.008 when the the critical v alue for stopping at the ﬁrst stage is 0.005, and ab out 0.06 when that critical v alue is 0.05. The increased rate of false p ositives (global error rate) is real, but that do esn’t mean that the evidential meaning of the ﬁnal P-v alue of 0.002 is c hanged. It is the same local evidence against the null as if it w as obtained from a simpler one stage protocol with n = 10. After all, the data are exactly the same as if the exp erimen ter had in tended to obtain n = 10 from the b eginning. The optional stopping has changed the global prop erties of the statistical pro cedure but not the lo cal evidence which contained in the actualised data. Y ou might be wondering ho w it is p ossible that the lo cal evidence b e un- aﬀected by a pro cess that increases the global false p ositiv e error rate. The rationale is that the evidence is contained within the data but the error rate is a prop ert y of the pro cedure—evidence is lo cal and error rates are global. Recall that false p ositiv e errors can only o ccur when the null h yp othesis is true. If the n ull is true then the pro cedure has increased the risk of the data leading us to a false p ositiv e decision, but if the null is false then the pro cedure has de cr e ase d the risk of a false negative decision. Which of those has paid out in this case cannot b e known b ecause we do not know the truth of this lo cal n ull h yp othesis. It might b e argued that an increase in the global risk of false p osi- tiv e decisions should outw eigh the decreased risk of false negatives, but that is a v alue judgemen t that ough t to tak e in to accoun t particulars of the exp erimen t in question, the role of that exp eriment in the ov erall study , and other contextual factors that are unsp eciﬁed in the scenario and that v ary from circumstance to circumstance. So, what can b e said ab out the result of that scenario? The result of P=0.002 pro vides mo derately strong evidence against the null hypothesis, but it w as obtained from a pro cedure with sub-optimal false p ositiv e error characteristics. That sub-optimality should b e accounted for in the inferences that made from the evidence, but it is only confusing to say that it alters the evidence itself, b ecause it is the data that contain the evidence and the sub-optimality did not c hange the data. Motulsky pro vides go od advice on what to do when your exp erimen t has the optional stopping: • F or eac h ﬁgure or table, clearly state whether or not the sam- ple size was c hosen in adv ance, and whether every step used to process and analyze the data was planned as part of the exp erimen tal proto col. • If you used any form of P-hacking, lab el the conclusions as “preliminary .” 27 Giv en that basic pharmacological experiments are often relatively inexp en- siv e and quic kly completed one can add to that list the option of also corrob o- rating (or not) those results with a fresh exp eriment designed to hav e a larger sample size (remem b er the signiﬁcance ﬁlter exaggeration machine) and p er- formed according to the design. Once we mov e b eyond the globalist mindset of one-and-done suc h an option will seem obvious. 4.4 What is a statistical mo del? I remind the reader that this chapter is written under the assumption that pharmacologists can b e trusted to deal with the full complexity of statistics. That assumption gives me licence to discuss unfamiliar notions lik e the role of the statistical mo del in statistical analysis. All to o often the statistical mo del is often invisible to ordinary users of statistics and that invisibilit y encourages though tless use of ﬂaw ed and inappropriate mo dels, thereb y con tributing to the misuse of inferen tial statistics like P-v alues. A statistical mo del is what allows the formation of calibrated statistical inferences and non-trivial probabilistic statements in resp onse to data. The mo del does that by assigning probabilities to p oten tial arrangements of data. A statistical model can b e though t of as a set of assumptions, although it migh t be more realistic to say that a chosen statistical mo del imp oses a set of assumptions on to the exp erimenter. I hav e often b een struck by the extent to which most textb ooks, on the ﬂimsiest of evidence, will dismiss the substitution of assumptions for real knowledge as unimp ortant if it happ ens to b e mathemati- cally con venien t to do so. V ery few bo oks seem to be frank ab out, or p erhaps even aw are of, how little the exp erimenter actually knows ab out the distribution of errors in his observ ations, and ab out facts that are assumed to b e kno wn for the purp oses of statistical calcu- lations. —[Colquhoun, 1971, p. v ] Statistical models can take a v ariety of forms [McCullagh, 2002], but the mo del for the familiar Student’s t -test for indep endent samples is reasonably represen tative. That model consists of assumed distributions (normal) of t wo p opulations with parameters mean ( µ 1 and µ 2 ) and standard deviation ( σ 1 and σ 2 ), 17 and a rule for obtaining samples (e.g. a randomly selected sample of n = 6 observ ations from each p opulation). A sp eciﬁed v alue of the the diﬀerence b et ween means serves as the n ull hypothesis, so H 0 : µ 1 − µ 2 = δ H 0 . The test statistic is 18 t = ( ¯ x 1 − ¯ x 2 ) − δ H 0 s p p 1 /n 1 + 1 /n 2 17 The ordinary Student’s t -test assumes that σ 1 = σ 2 , but the W elch-Scatterth waite v ariant relaxes that assumption. 18 Oh no! An equation! Don’t worry , it’s the only one, and, anyw ay , it is to o late now to stop reading. 28 where ¯ x is a sample mean and s p is the p o oled standard deviation. The explicit inclusion of a n ull h yp othesis term in the equation for t is relatively rare, but it is useful because it s ho ws that the n ull h yp othesis is just a p ossible v alue of the diﬀerence b et ween means. Most commonly the n ull h yp othesis sa ys that the diﬀerence b et ween means is zero—it can b e called a ‘nill-null’—and in that case the omission of δ H 0 from the equation mak es no numerical diﬀerence. V alues of t calculated by that equation hav e a known distribution when µ 1 − µ 2 = δ H 0 , and that distribution is Studen t’s t -distribution. 19 Because the distribution is kno wn it is p ossible to deﬁne hypothesis test acceptance regions for an y lev el of α for a hypothesis test, and any observ ed t -v alue can be con v erted in to a P-v alue in a signiﬁcance test. An imp ortant problem that a pharmacologist is likely to face when using a statistical mo del is that it’s just a model. Scientiﬁc inferences are usually in tended to communicate something ab out the real world, not the mini w orld of a statistical mo del, and the connection b etw een a mo del-based probability of obtaining a test statistic v alue and the state of the real world is alw ays indirect and often inscrutable. Consider the meaning conv eyed by an observ ed P-v alue of 0.002. It indicates that the data are strange or un usual compared to the exp ectations of the statistical model when the parameter of in terest is set to the v alue sp eciﬁed by the null hypothesis. The statistical mo del exp ects a P-v alue of, say , 0.002 to o ccur only 2 times out of a thousand on av erage when the null is true. If such a P-v alue is observed then one of these situations has arisen: • a t wo in a thousand accident of random sampling has o ccurred; • the n ull hypothesised parameter v alue is not close to the true v alue; • the statistical mo del is ﬂa wed or inapplicable b ecause one or more of the assumptions underlying its application are erroneous. T ypically only the ﬁrst and second are considered, but the last is every bit as important b ecause when the statistical mo del is ﬂa wed or inapplicable then the exp ectations of the mo del are not relev ant to the real world system that spawned the data. Figure 8 shows the issue diagrammatically . When we use that statistical inference to inform inferences ab out the real world we are implicitly assuming: (i) that the real world system that generated the data is an analog to the p opulation in the statistical model; (ii) that the w ay the data w ere obtained is well describ ed by the sampling rule of the statistical mo del; and (iii) that the observed data is analogous to the random sample assumed in the statistical mo del. T o the degree that those assumptions are e rroneous there is degradation of the relev ance of the model-based statistical inference to the real w orld inference that is desired. Considerations of mo del applicabilit y are often limited to the p opulation distribution (is m y data normal enough to use a Student’s t -test?) but it is m uch more important to consider whether there is a deﬁnable population that is 19 T echnically it’s the cen tral Studen t’s t -distribution. When δ 6 = δ H 0 it’s a non-central t -distribution [Cumming and Finch, 2001]. 29 Population Statistical model Real world Real world system under investigation Sample Data Sampling Inference Assumed to be equivalent Desired inference Assumed to be relevant Figure 8: Diagram of inference using a statistical mo del. relev an t to the inferential ob jectiv es and whether the exp erimen tal units (“sub- jects”) approximate a random sample. Cell culture exp erimen ts are notorious for ha ving ill-deﬁned populations, and while exp eriments with animal tissues ma y hav e a deﬁnable p opulation, the animals are typically delivered from an animal breeding or holding facilit y and are unlikely to b e a random sample. Issues lik e those mean that the calibration of uncertaint y oﬀered b y statistical metho ds migh t be more or less uncalibrated. F or goo d inferen tial p erformance in the real w orld, there has to be a ﬂexible and well-considered linking of mo del- based statistical inferences and scien tiﬁc inferences concerning the real world. 5 P-v alues and inference A P-v alue tells y ou ho w w ell the data match with the expectations of a statistical mo del when the n ull hypothesis is true. But, as we hav e seen, there are many considerations that ha ve to b e made b efore a lo w P-v alue can safely b e taken to pro vide suﬃcient reason to sa y that the null hypothesis is false. What’s more, inferences ab out the n ull hypothesis are not alwa ys useful. Roy all argues that there are three fundamen tal inferen tial questions that should b e considered when making scien tiﬁc inferences [Roy all, 1997] (here paraphrased and re-ordered): 1. What do these data say? 2. What should I b eliev e now that I hav e these data? 3. What should I do or decide now that I ha ve these data? 30 Those questions are distinct, but not entirely indep enden t and there is no single b est wa y to answer to any of them. A P-v alue from a signiﬁcance test is an answer to the ﬁrst question. It comm unicates how strongly the data argue against the null hypothesis, with a smaller P-v alue being a more insistent shout of “I disagree!”. Ho wev er, the answ er pro vided b y a P-v alue is at b est incomplete, b ecause it is tied to a partic- ular n ull h yp othesis within a particular statistical mo del and because it captures and communicates only some of the information that might b e relev an t to sci- en tiﬁc inference. The limitations of a P-v alue can b e thought of as analogous to a black and white photograph that captures the essence of a scene, but misses coloured detail that migh t b e vital for a correct interpretation. Lik eliho o d functions provide more detail than P-v alues and so they can b e sup erior to P-v alues as answ ers to the question of what the data say . Ho wev er, they will b e unfamiliar to most pharmacologists and they are not immune to problems relating to the relev ance of the statistical model and the p eculiarities of exp erimen tal proto col. 20 As this chapter is ab out P-v alues, we will not consider lik eliho o ds any further, and those who, correctly , see that they might oﬀer utility can read Ro yall’s b o ok [Roy all, 1997]. The second of Ro yall’s questions, What should I b eliev e now that I hav e these data?, requires in tegration of the evidence of the data with what was b eliev ed prior to the evidence b eing av ailable. A formal statistical combination of the evidence with prior b eliefs can b e done using Bay esian metho ds, but they are rarely used for the analysis of basic pharmacological exp eriments and are outside the scope of this chapter ab out P-v alues. Considerations of b elief can b e assisted by P-v alues b ecause when the data argue strongly against the null h yp othesis one should be less inclined to believe it true, but it is imp ortan t to realise that P-v alues do not in any wa y measure or communicate b elief. The Neyman–Pearsonian hypothesis test framework was devised sp eciﬁcally to answer the third question: it is a decision theoretic framew ork. Of course, it is a go od decision pro cedure only when α is sp eciﬁed prior to the data b eing a v ailable, and when a loss function informs the exp erimental design. And it is only useful when there is a singular decision to b e made regarding a null h yp othesis, as can b e the case in acceptance sampling and in some randomised clinical trials. A singular decision regarding a null hypothesis is rarely a suﬃcien t inference from the collection of exp eriments and observ ations that typically mak e up a basic pharmacological studies and so hypothesis tests should not b e a default analytical to ol (and the hybrid NHST should not be used in an y circumstance). Readers might feel that this section has failed to provide a clear metho d for making inferences ab out any of the three questions, and they would b e correct. Statistics is a set of to ols to help with inferences and not a set of inferen tial recip es, a scientiﬁc inferences concerning the real world ha ve to b e made by sci- 20 Roy all [1997] and other proponents of lik elihoo d-based inference (e.g. Berger and W olp ert [1988]) make a contrary argumen t based on the likelihoo d principle and the (irrelev ance of ) sampling rule principle, but those arguments ma y fall down when view ed with the local versus global distinction in mind. Happily , those issues are beyond the scop e of this chapter. 31 en tists, and my in tention with this reckless guide to P-v alues is to encourage an approac h to scientiﬁc inference that is more thoughtful than statistical signiﬁ- cance. After all, those scientists in v ariably know muc h more than statistics do es ab out the real w orld, and hav e a sup erior understanding of the system under study . Scientiﬁc inferences should b e made after principled consideration of the a v ailable evidence, theory and, sometimes, informed opinion. A full ev aluation of evidence will include b oth consideration of the strength of the lo cal evidence and the global prop erties of the exp erimen tal system and statistical mo del from whic h that evidence w as obtained. It is often diﬃcult, just lik e statistics, and there is no recip e. References M. Baker and E. Dolgin. Repro ducibility pro ject yields muddy results. Natur e , 541(7637):269–270, 2017. C. G. Begley and L. M. Ellis. Drug developmen t: Raise standards for preclinical cancer researc h. Natur e , 483(7391):531–533, Mar. 2012. D. J. Benjamin, J. O. Berger, M. Johannesson, B. A. Nosek, E. J. W agenmak- ers, R. Berk, K. A. Bollen, B. Brembs, L. Bro wn, C. Camerer, D. Cesarini, C. D. Chambers, M. Clyde, T. D. Co ok, P . De Bo ec k, Z. Dienes, A. Dreber, K. Easwaran, C. Eﬀerson, E. F ehr, F. Fidler, A. P . Field, M. F orster, E. I. George, R. Gonzalez, S. Go odman, E. Green, D. P . Green, A. G. Greenw ald, J. D. Hadﬁeld, L. V. Hedges, L. Held, T.-H. Ho, H. Hoijtink, D. J. Hruschk a, K. Imai, G. Imbens, J. P . A. Ioannidis, M. Jeon, J. H. Jones, M. Kirchler, D. Laibson, J. List, R. Little, A. Lupia, E. Machery , S. E. Maxwell, M. Mc- Carth y , D. A. Moore, S. L. Morgan, M. Munaf´ o, S. Nak aga wa, B. Nyhan, T. H. Park er, L. Pericc hi, M. Perugini, J. Rouder, J. Rousseau, V. Sav alei, F. D. Sch¨ on bro dt, T. Sellke, B. Sinclair, D. Tingley , T. V an Zandt, S. V azire, D. J. W atts, C. Winship, R. L. W olpert, Y. Xie, C. Y oung, J. Zinman, and V. E. Johnson. Redeﬁne statistical signiﬁcance. Natur e Human Behaviour , pages 1–5, Jan. 2018. J. Berger and T. Sellke. T esting a p oint null hypothesis: the irreconcilability of P v alues and evidence. Journal of the A meric an Statistic al Asso ciation , pages 112–122, 1987. J. O. Berger and R. L. W olp ert. The Likeliho o d Principle . Lecture notes– Monograph Series. IMS, 1988. L. Berglund, E. Bj¨ orling, P . Oksvold, L. F agerb erg, A. Asplund, C. A.-K. Szigy arto, A. Persson, J. Ottosson, H. W ern ´ erus, P . Nilsson, E. Lundb erg, A. Siv ertsson, S. Nav ani, K. W ester, C. Kampf, S. Hober, F. Pon t´ en, and M. Uhl ´ en. A genecentric Human Protein Atlas for expression proﬁles based on an tib odies. Mole cular & c el lular pr ote omics : MCP , 7(10):2019–2027, Oct. 2008. 32 B. Bhattachary a and D. Habtzghi. Median of the pV alue Under the Alternative Hyp othesis. The Americ an Statistician , 56(3):202–206, Aug. 2002. A. Birn baum. The Neyman-P earson Theory as Decision Theory , and as Infer- ence Theory; With a Criticism of the Lindley-Sav age Argument for Bay esian Theory . Synthese , 36(1):19–49, Sept. 1977. J. M. Bland and D. G. Bland. Statistics Notes: One and tw o sided tests of signiﬁcance. BMJ , 309(6949):248–248, July 1994. C. F. Camerer, A. Dreb er, F. Holzmeister, T.-H. Ho, J. Huber, M. Johannesson, M. Kirc hler, G. Nav e, B. A. Nosek, T. Pfeiﬀer, A. Altmejd, N. Buttric k, T. Chan, Y. Chen, E. F orsell, A. Gampa, E. Heik ensten, L. Hummer, T. Imai, S. Isaksson, D. Manfredi, J. Rose, E.-J. W agenmak ers, and H. W u. Ev aluating the replicability of so cial science exp erimen ts in Nature and Science b etw een 2010 and 2015. Natur e Human Behaviour , pages 1–10, Sept. 2018. J. Cohen. A p o wer primer. Psycholo gic al bul letin , 112(1):155–159, July 1992. D. Colquhoun. L e ctur es on Biostatistics . Oxford Universit y Press, 1971. D. Colquhoun. An inv estigation of the false discov ery rate and the misinter- pretation of p-v alues. R oyal So ciety Op en Scienc e , 1(3):140216–140216, Nov. 2014. M. Cowles. Statistics in psycholo gy: an historic al p ersp e ctive . Lawrence Erlbaum Asso ciates, Inc., 1989. G. Cumming. Replication and p in terv als: p v alues predict the future only v aguely , but conﬁdence interv als do muc h b etter. Persp e ctives on Psycholo g- ic al Scienc e , 3(4):286–300, 2008. G. Cumming and S. Finc h. A primer on the understanding, use, and calculation of conﬁdence interv als that are based on central and noncen tral distributions. Educ ational and Psycholo gic al Me asur ement , 61(4):532–574, 2001. M. Curtis, R. Bond, D. Spina, A. Ahluw alia, S. Alexander, M. Giemb ycz, A. Gilc hrist, D. Hoy er, P . Insel, A. Izzo, A. Lawrence, D. MacEw an, L. Mo on, S. W onnacott, A. W eston, and J. McGrath. Exp erimen tal design and analysis and their rep orting: new guidance for publication in b jp. British Journal of Pharmac olo gy , 172(2):3461–3471, 2015. D. J. Druck er. Never W aste a Go od Crisis: Confronting Repro ducibilit y in T ranslational Research. Cel l Metab olism , 24(3):348–360, Sept. 2016. J.-B. du Prel, G. Hommel, B. R¨ ohrig, and M. Blettner. Conﬁdence in terv al or p-v alue?: Part 4 of a series on ev aluation of scientiﬁc publications. Deutsches ¨ Arzteblatt international , 106(19):335–339, Ma y 2009. S. D. Dubey . Some thoughts on the one-sided and tw o-sided tests. Journal of Biopharmac eutic al Statistics , 1(1):139–150, 1991. 33 R. Fisher. Statistic al Metho ds for R ese ar ch Workers . Oliver & Boyd, 1925. R. Fisher. Design of exp eriments . Hafner, New Y ork, 1960. H. F raser, T. Park er, S. Nak aga wa, A. Barnett, and F. F. Questionable researc h practices in ecology and ev olution. PL oS ONE , 13(7):e0200303, 2018. L. S. F reedman. An analysis of the contro versy ov er classical one-sided tests. Clinic al T rials , 5(6):635–640, 2008. M. A. Garc ´ ıa-P´ erez. Thou Shalt Not Bear F alse Witness Against Null Hyp oth- esis Signiﬁcance T esting. Educ ational and Psycholo gic al Me asur ement , 77(4): 631–662, Oct. 2016. A. Gelman and J. Carlin. Bey ond Po wer Calculations. Persp e ctives on Psycho- lo gic al Scienc e , 9(6):641–651, Nov. 2014. C. H. George, S. C. Stanford, S. Alexander, G. Cirino, J. R. Do chert y , M. A. Giem bycz, D. Hoy er, P . A. Insel, A. A. Izzo, Y. Ji, D. J. MacEw an, C. G. Sob ey , S. W onnacott, and A. Ahlu walia. Updating the guidelines for data transparency in the British Journal of Pharmacology - data sharing and the use of scatter plots instead of bar charts. British Journal of Pharmac olo gy , 174(17):2801–2804, Aug. 2017. G. Gigerenzer. W e need statistical thinking, not statistical rituals. Behavior al and Br ain Scienc es , 1998. S. N. Go o dman. Of P-V alues and Bay es: A Modest Prop osal. Epidemiolo gy , 12 (3):295–297, Ma y 2001. S. N. Go o dman and R. Roy all. Evidence and scien tiﬁc research. A meric an Journal of Public He alth , 78(12):1568–1574, 1988. P . F. Halpin and H. J. Stam. Inductive inference or inductiv e b ehavior: Fisher and Neyman-Pearson approaches to statistical testing in psyc hological re- searc h (1940-1960). The Americ an journal of psycholo gy , 119(4):625–653, 2006. L. Halsey , D. Curran-Everett, S. V owler, and G. Drummond. The ﬁckle p v alue generates irrepro ducible results. Natur e Metho ds , 12(3):179–185, 2015. J. Ho enig and D. Heisey . The Abuse of Po w er: The Perv asive F allacy of P ow er Calculations for Data Analysis. The A meric an Statistician , 2001. S. M. Ho witt and A. N. Wilson. Revisiting ”Is the scientiﬁc pap er a fraud?”: The wa y textb o oks and scientiﬁc researc h articles are b eing used to teach undergraduate studen ts could conv ey a misleading image of scien tiﬁc research. EMBO r ep orts , 15(5):481–484, May 2014. 34 R. Hubbard, M. Bay arri, K. Berk, and M. Carlton. Confusion ov er Measures of Evidence (p’s) versus Errors ( α ’s) in Classical Statistical T esting. The A meric an Statistician , 57(3), Aug. 2003. C. J. Hub erty . Historical origins of statistical testing practices: The treatment of Fisher versus Neyman-Pearson views in textb ooks. The Journal of Exp er- imental Educ ational , pages 317–333, 1993. S. Hurlb ert and C. Lom bardi. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zo olo gici F ennici , 46(5):311–349, 2009. J. P . A. Ioannidis. Why Most Published Research Findings Are F alse. PL oS Me dicine , 2(8):e124, Aug. 2005. V. E. Johnson. Revised standards for statistical evidence. Pr o c e e dings of the National A c ademy of Scienc es , 110(48):19313–19317, 2013. K. Kobay ashi. A comparison of one- and tw o-sided tests for judging signiﬁcant diﬀerences in quantitativ e data obtained in toxicological bioass a y of lab ora- tory animals. Journal of Oc cup ational He alth , 39(1):29–35, 1997. J. I. Krueger and P . R. Heck. The Heuristic V alue of p in Inductiv e Statistical Inference. F r ontiers in Psycholo gy , 8:108–16, June 2017. P . Laplace. Th´ eorie analytique des pr ob abilit ´ es . 1812. B. Lecoutre, M.-P . Lecoutre, and J. Poitevineau. Uses, Abuses and Misuses of Signiﬁcance T ests in the Scientiﬁc Communit y: W on’t the Ba yesian Choice Be Unav oidable? International Statistic al R eview / R evue Internationale de Statistique , 69(3):399–417, Dec. 2001. S. M. Lee. Buzzfeed news: Here’s ho w cornell scientist brian wansink turned shoddy data into viral studies about how we eat, F ebruary 2018. URL https://www.buzzfeednews.com/article/stephaniemlee/ brian- wansink- cornell- p- hacking . E. Lehmann. Fisher, Neyman, and the cr e ation of classic al statistics . Springer, 2011. J. Lenhard. Mo dels and statistical inference: The contro versy b etw een ﬁsher and neyman-p earson. Br. J. Philos. Sci. , 57(1):69–91, Marc h 2006. ISSN 0007-0882. doi: 10.1093/b jps/axi152. M. J. Lew. Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don’t know P. 166(5):1559–1567, June 2012. K. Liu and X.-L. Meng. There is individualized treatment. why not indi- vidualized inference? A nnual R eview of Statistics and Its Applic ation , 3(1):79–111, 2016. doi: 10.1146/annurev- statistics- 010814- 020310. URL https://doi.org/10.1146/annurev- statistics- 010814- 020310 . 35 C. Lombardi and S. Hurlbert. Misprescription and misuse of one-tailed tests. A ustr al Ec olo gy , May 2009. J. Lu, Y. Qiu, and A. Deng. A note on t yp e s & m errors in hypothesis testing. British Journal of Mathematic al and Statistic al Psycholo gy , Online v ersion of record b efore inclusion in an issue, 2018. P . McCullagh. What is a statistical model? The A nnals of Statistics , 30(5): 1125–1310, 2002. P . Medaw ar. Is the scientiﬁc pap er a fraud? Listener , 70:377–378, 1963. H. J. Motulsky . Common misconceptions ab out data analysis and statistics. Naunyn-Schmie deb er g’s Ar chives of Pharmac olo gy , 387(11):1017–1023, 2014. J. Neyman and E. P earson. On the Problem of the Most Eﬃcient T ests of Statistical Hyp otheses. Philosophic al T r ansactions of the R oyal So ciety of L ondon. Series A Containing Pap ers of a Mathematic al or Physic al Char acter , 231:289–337, 1933. R. S. Nick erson. Null hypothesis signiﬁcance testing: a review of an old and con tinuing contro versy . Psycholo gic al Metho ds , 5(2):241–301, June 2000. R. Nuzzo. Statistical errors: P v alues, the ’gold standard’of statistical v alidit y , are not as reliable as man y scientists assume. Natur e , 506:150–152, 2014. R. Ro yall. Statistic al evidenc e: a likeliho o d p ar adigm , volume 71 of Mono gr aphs on sttistics and applie d pr ob ability . Chapman & Hall, 1997. G. D. Ruxton and M. Neuhaeuser. When should we use one-tailed hypothesis testing? Metho ds in Ec olo gy and Evolution , 1(2):114–117, 2010. H. Sackro witz and E. Sam uel-Cahn. P V alues as Random V ariables-Exp ected P V alues. Americ an Statistician , pages 326–331, Nov. 1999. S. Senn. Tw o cheers for P-v alues? Journal of Epidemiolo gy and Biostatistics , 6 (2):193–204, Dec. 2001. G. Shaw and F. No dder. The natur alist’s misc el lany: or c olour e d ﬁgur es of natur al obje cts; dr awn and describ e d imme diately fr om natur e . 1789. A. Strasak, Q. Zaman, G. Marinell, and K. Pfeiﬀer. The Use of Statistics in Medical Research: A Comparison of The New England Journal of Medicine and Nature Medicine. The A meric an Statistician , 61(1):47–55, 2007. Studen t. The Probable Error of a Mean. Biometrika , 6(1):1–25, Mar. 1908. B. Thompson. The natur e of statistic al evidenc e , v olume 189 of L e ctur e notes in statistics . Springer, 2007. 36 D. T raﬁmow and M. Marks. Editorial. Basic and Applie d So cial Psycholo gy , 37(1):1–2, 2015. doi: 10.1080/01973533.2015.1012991. URL https://doi. org/10.1080/01973533.2015.1012991 . J. W. T ukey . The Philosoph y of Multiple Comparisons. Statistic al Scienc e , 6 (1):100–116, F eb. 1991. B. V oelkl, L. V ogt, E. S. Sena, and H. W ¨ urbel. Repro ducibilit y of preclinical animal researc h impro ves with heterogeneity of study samples. PLOS Biolo gy , 16(2):e2003693–13, F eb. 2018. E.-J. W agenmakers. A practical solution to the p erv asiv e problems of p v alues. Psychonomic bul letin & r eview , 14(5):779–804, Oct. 2007. E.-J. W agenmakers, M. Marsman, T. Jamil, A. Ly , J. V erhagen, J. Lov e, R. Selker, Q. F. Gronau, M. ˇ Sm ´ ıra, S. Epsk amp, D. Matzk e, J. N. Rouder, and R. D. Morey . Ba y esian inference for psychology . Part I: Theoretical ad- v an tages and practical ramiﬁcations. pages 1–23, Mar. 2018. R. L. W asserstein and N. A. Lazar. The ASA’s Statemen t on p-V alues: Con text, Pro cess, and Purp ose. The Americ an Statistician , 70(2):129–133, June 2016. 37

A reckless guide to P-values: local evidence, global errors

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment