A reckless guide to P-values: local evidence, global errors
This chapter demystifies P-values, hypothesis tests and significance tests, and introduces the concepts of local evidence and global error rates. The local evidence is embodied in \textit{this} data and concerns the hypotheses of interest for \textit…
Authors: Michael J. Lew
A rec kless guide to P-v alues Lo cal evidence, global errors Chapter 13 in the b ook Go o d Researc h Practice in Exp erimen tal Pharmacology , editors A. Bespalo v, MC Michel, and T Stec kler, to b e published by Springer. Op en Access, Creativ e Commons 4.0 1 Mic hael J. Lew Departmen t of Pharmacology and Therap eutics Univ ersity of Melbourne Octob er 7, 2019 1 Abstract This chapter demystifies P-v alues, hypothesis tests and significance tests, and in tro duces the concepts of lo cal evidence and global error rates. The lo cal evidence is embo died in this data and concerns the hypotheses of interest for this exp erimen t, whereas the global error rate is a prop ert y of the statistical analysis and sampling pro cedure. It is sho wn using simple examples that lo cal evidence and global error rates can be, and should b e, considered together when making inferences. Po w er analysis for experimental design for h yp othesis testing are explained, along with the more lo cally fo cussed expected P-v alues. Issues relating to multiple testing, HARKing, and P-hac king are explained, and it is shown that, in many situation, their effects on lo cal evidence and global error rates are in conflict, a conflict that can alwa ys b e ov ercome by a fresh dataset from replication of k ey exp erimen ts. Statistics is complicated, and so is science. There is no singular right wa y to do either, and universally acceptable compromises may not exist. Statistics offers a wide array of to ols for assisting with scientific inference by calibrating uncertaint y , but statistical inference is not a substitute for scientific inference. P-v alues are useful indices of evidence and deserv e their place in the statistical to olb o x of basic pharmacologists. Con ten ts 1 Abstract 2 2 In tro duction 3 2.1 On the role of statistics . . . . . . . . . . . . . . . . . . . . . . . 3 3 All ab out P-v alues 5 3.1 Hyp othesis test and Significance test . . . . . . . . . . . . . . . . 6 3.2 Con tradictory instructions . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Evidence is lo cal; error rates are global . . . . . . . . . . . . . . . 12 3.4 On the scaling of P-v alues . . . . . . . . . . . . . . . . . . . . . . 13 3.5 P ow er and exp ected P-v alues . . . . . . . . . . . . . . . . . . . . 14 2 4 Practical problems with P-v alues 17 4.1 The significance filter exaggeration mac hine . . . . . . . . . . . . 18 4.2 Multiple comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 P-hac king . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.4 What is a statistical mo del? . . . . . . . . . . . . . . . . . . . . . 28 5 P-v alues and inference 30 2 In tro duction There is a widespread consensus that we are in the midst of a ‘repro ducibilit y cri- sis’ and that inappropriate application of statistical metho ds facilitates, or even causes, irreproducibility [Ioannidis, 2005, Nuzzo, 2014, Colquhoun, 2014, George et al., 2017, W agenmakers et al., 2018]. P-v alues are a “p erv asiv e problem” [W agenmakers, 2007] b ecause they are misundersto o d, misapplied, and answer a question that no-one asks [Ro y all, 1997, Halsey et al., 2015, Colquhoun, 2014]. They exaggerate evidence [Johnson, 2013, Benjamin et al., 2018] or they are ir- reconcilable with evidence [Berger and Sellke, 1987]. What’s worse, ‘P-hac king’ amplifies their intrinsic shortcomings [F raser et al., 2018]. The inescapable con- clusion, it would seem, is that P-v alues should b e eliminated by replacemen t with Bay es factors [Go odman, 2001, W agenmakers, 2007] or confidence in ter- v als [Cumming, 2008], or b y simply doing without [T rafimow and Marks, 2015]. Ho wev er, m uch of the blame for irreproducibility that is apportioned to P-v alues is based on p erv asive and p ernicious misunderstandings. This c hapter is an attempt to resolv e those misunderstandings. Some migh t sa y it is a rec kless attempt b ecause history suggests that it is doomed to fail- ure, and reckless also b ecause it go es against muc h of the conv en tional wisdom regarding P-v alues and will therefore b e seen by some as promoting inappropri- ate statistical practices. That’s OK though, b ecause the conv entional wisdom regarding P-v alues is mistak en in imp ortan t wa ys, and those mistakes fuel false supp ositions regarding what practices are appropriate. 2.1 On the role of statistics Statistics is complicated 1 but is usually presented simplistically in the statistics textb ooks and courses studied by pharmacologists. Readers of those b o oks and graduates of those courses should therefore b e forgiven for wrongly assuming that statistics is a set of rules and recipes that must b e applied in order to obtain a statistically v alid statistically significan t. The instructions sa y that y ou matc h the data to the recip e, turn the crank, and bingo: it’s significan t, or not. If y ou do it right then you might rew arded with a star! No matter how explicable that simplistic view of statistics migh t b e, it is far to o limiting. It leads to 1 Even its grammatical form is complicated: “statistics” lo oks like a plural noun, but it both plural when referring to values calculated from data and singular when referring to the discipline or approac hes to data analysis. 3 though tless use of a limited set of metho ds and to ov er-reliance on the familiar but misunderstoo d P-v alue. It prev ents the full utilisation of statistical thinking within scientific inference, and allows bad statistics to license false inferences. W e hav e to aim for more than the rote-learning of recip es in statistics courses b ecause while statistics is not simple, go od science is harder. I therefore take as a working assumption the notion that go o d scientists are capable of dealing with the in tricacies of statistical thinking. I will admit up front that it is not essential to hav e a statistical inference in order to make a scien tific inference. F or example, there is little need for a formal statistical analysis if results can b e dealt with using the in ter-o cular impact test 2 . How ev er, scientific inferences can be made more securely with statistics b ecause it offers a ric h set of to ols for calibrating uncertaint y . Statistical analysis is particularly helpful in the p enum bral ‘ma yb e zone’ where the uncertain ty is relativ ely ev enly balanced—the zone where scientists are most lik ely to b e sw ay ed b y biasses in to o v er-interpretation of random deviations within the noise. The extra insight from a well-implemen ted statistical analysis can protect from the desire to find something notable, and thereb y reduce the num ber of false claims made. Most p eople need all the help they can get to preven t them mak- ing fo ols of themselv es by claiming that their fav ourite theory is substan tiated by observ ations that do nothing of the sort. —[Colquhoun, 1971, p. 1] Impro ved utilisation of statistical approaches would indeed help to min- imise the n umber of times that pharmacologists make fools of themselves by reducing the num ber of false p ositiv e results in pharmacological journals and, consequen tly , reduce the n umber of fault y leads that fail to translate in to a ther- ap eutic [Begley and Ellis, 2012]. Ho w ever, ev en ideal application of the most appropriate statistical methods would not impro v e the replicability of published results quite as m uch as might b e assumed b ecause not ev ery result that fails to b e replicated is a false p ositive and not ev ery mistak en conclusion w ould b e prev ented by b etter statistical inferences. Basic pharmacological studies are typically p erformed using biological mo d- els suc h as cell lines, tissue samples, or lab oratory animals and so ev en if the original results are not false p ositives a replication might fail when it is con- ducted using different mo dels, [Druck er, 2016]. Replications might also fail when the original results are critically dep enden t on unrecognised metho dolog- ical details, or on reagents such as an tib o dies that hav e prop erties that can v ary o ver time or betw een sources [Berglund et al., 2008, Baker and Dolgin, 2017, V o elkl et al., 2018]. It is those types of irrepro ducibilit y rather than false p ositiv es that are responsible for many failures of published leads to translate in to clinical targets or therap eutics (see also Chapter 11). The distinction b eing made here is b etw een false p ositiv e inferences which lac k ‘internal v alidit y’ and 2 In other w ords, results that hit y ou righ t betw een the eyes. In the Australian vernacular the inter-ocular impact test is the blo ody obvious test. 4 failures of generalisability which lac k ‘external v alidit y’ even though correct in themselv es. It is an imp ortant distinction b ecause the former can be reduced b y more appropriate use of statistical metho ds but the latter can not. The inherent ob jectivit y of statistics can minimise the n umber of times that w e mak e fo ols of ourselves, but just doing statistics is not enough, b ecause it is not a set of rules for scientists to follo w to make automated scientific inferences. T o get from calibrated statistical inferences to reliable inferences ab out the real world, the statistical analyses ha ve to b e interpreted; thoughtfully and in the full knowledge of the prop erties of the to ol and the nature of the real w orld system b eing prob ed. Some researchers might be disconcerted b y the fact that statistics cannot provide such certaint y , b ecause they just w ant to b e told whether their latest result is “real”. No matter how attractiv e it migh t be to fob off onto statistics the resp onsibility for inferences, the answers that scientists seek cannot b e answered by statistics alone. 3 All ab out P-v alues P-v alues are not ev erything, and they are certainly not nothing. There are many , man y useful pro cedures and to ols in statistics that do not inv olve or provide P- v alues, but P-v alues are by far the most widely used inferential statistic in basic pharmacological researc h pap ers.0 P-v alues are a practical success but a critical failure. Scientists the w orld ov er use them, but scarcely a statistician can b e found to defend them. —[Senn, 2001, p. 193] Not only are P-v alues rarely defended, they are frequen tly derided (e.g. Berger and Sellke [1987], Lecoutre et al. [2001], Goo dman [2001], W agenmakers [2007]). Ev en so, supp ort for the con tinued use of P-v alues for at least some purp oses with some cav eats can b e found (e.g. Nick erson [2000], Senn [2001], Garc ´ ıa-P ´ erez [2016], Krueger and Hec k [2017]). One crucial cav eat is that a clear distinction has to b e dra wn betw een the dic hotomisation of P-v alues into ‘significan t’ or ‘not significant’ (t ypically on the basis of a threshold set at 0.05) and the evidential meaning of the actual n umerically sp ecified P-v alue. The former comes from a hyp othesis test and the latter from a signific anc e test . Con trary to what m an y readers will think and ha ve b een taught, they are not the same things. It might b e argued that the battle to retain a clear distinction b et ween significance tests and hypothesis tests has long b een lost, but I hav e to con tinue that battle here because that distinction is critical for understanding the uses and misuses of P-v alues. Detailed accounts can also be found elsewhere [Hub ert y, 1993, Senn, 2001, Hubbard et al., 2003, Lenhard, 2006, Hurlb ert and Lom bardi, 2009, Lew, 2012]. 5 3.1 Hyp othesis test and Significance test When comparing significance tests and h yp othesis tests it is con ven tional to note that the former are ‘Fisherian’ (or, p erhaps, “neoFisherian” [Hurlb ert and Lom bardi, 2009]) and the latter are ‘Neyman–Pearsonian’. R.A. Fisher did not in ven t significance tests per se—Gossett published what b ecame Studen t’s t -test b efore Fisher’s career had b egun [Student, 1908] and even that is not the first example—but Fisher did effectively p opularise their use with his bo ok Statistic al Metho ds for R ese ar ch Workers (1925), and he is credited with (or blamed for!) the conv en tion of P < 0 . 05 as a criterion for ‘significance’. It is imp ortan t to note that Fisher’s ‘significant’ denoted something along the lines of worth y of further consideration or inv estigation, which is different to what is denoted b y the same w ord applied tot he results of a h ypothesis test. Hyp othesis tests came later, with the 1933 paper by Neyman & Pearson that set out the w orkings of dic hotomising hypothesis tests and also introduced of the ideas “errors of the first kind” (false p ositive errors; type I errors) and “errors of the second kind” (false negativ e errors; type I I errors) and a formalisation of the concept of statistical p o wer. A Neyman–Pearsonian hypothesis test is more than a simple statistical cal- culation. It is a method that properly encompasse s exp erimental planning and exp erimen ter b eha viour as well. Before an exp eriment is conducted, the exp er- imen ter chooses α , the size of the critical region in the distribution of the test statistic, on the basis of the acceptable false p ositiv e (i.e. type I) error rate and sets the sample size on the basis of an acceptable false negative (i.e. t yp e I I) error rate. In effect the sample size, p o wer 3 , and α are traded off against eac h other to obtain an exp erimen tal design with the appropriate mix of cost and error rates. In order for the error rates of the pro cedure to b e w ell cal- ibrated, the sample size and α hav e to b e set in adv ance of the exp erimen t b eing p erformed, a detail that is often o verlooked b y pharmacologists. After the exp erimen t has b een run and the data are in hand, the mechanics of the test inv olv es a determination of whether the observed v alue of the test statistic lies within a pre-determined critical region under the sampling distribution pro- vided by a statistical mo del and the n ull hypothesis. When the observed v alue of the test statistic falls within the critical range the result is ‘significan t’ and the analyst discards the null hypothesis. When the observ ed test statistic falls outside the critical range the result is ‘not significant’ and the null h yp othesis is not discarded. In current practice, dic hotomisation of results into significan t and not signif- ican t is most often made on the basis of the observed P-v alue b eing less than or greater than a con ven tional threshold of 0.05, so we hav e the familiar P < 0 . 05 for α = 0 . 05. The one-to-one relationship b et ween the test statistic b eing within the critical range and the P-v alue b eing less than α means that such practice is not intrinsically problematical, but using a P-v alue as an intermediate in a h yp othesis test obscures the nature of the test and contributes to the conflation 3 The ‘pow er’ of the experiment is one min us the false positive error rate, but it is a function of the true effect size, as explained later. 6 of significance tests and h yp othesis tests. The classical Neyman–Pearsonian hypothesis test is an acceptance pro ce- dure, or a decision theory pro cedure [Birnbaum, 1977, Hurlb ert and Lombardi, 2009] that do es not require, or pro vide, a P-v alue. Its output is a binary deci- sion: either reject the null hypothesis or fail to reject the n ull hypothesis. In con trast, a Fisherian significance test yields a P-v alue that enco des the evidence in the data against the null h yp othesis, but not, directly , a decision. The P- v alue is the probability of observing data as extreme as that observ ed, or more extreme, when the null hypothesis is true. That probability is generated or determined by a statistical mo del of some sort, and so we should really include the phrase ‘according to the statistical mo del’ in to the definition. In the Fish- erian tradition 4 a P-v alue is interpreted evidentially: the smaller the P-v alue the stronger the evidence against the null hypothesis and the more implausible the n ull hypothesis is, according to the statistical mo del. No b ehavioural or inferen tial consequences attac h to the observ ed P-v alue and no threshold need b e applied b ecause the P-v alue is a contin uous index. In practice the probabilistic nature of P-v alues has prov ed difficult to use b ecause people tend to mistak enly assume that the P-v alue measures the proba- bilit y of the n ull h ypothesis or the probability of an erroneous decision—it seems that they prefer an y probabilit y that is more notew orthy or less of a mouthful than the probabilit y according to a statistical model of observing data as ex- treme or more extreme when the null h yp othesis is true. Happily , there are no ordinary uses of P-v alues that require them to b e in terpreted as probabilities. My advice is to forget that P-v alues can b e defined as probabilities and instead lo ok at them as indices of surprisingness or unusualness of data: the smaller the P-v alue the more surprising are the data compared to what the statistical mo del predicts when the null hypothesis is true. Conflation of significance tests and hypothesis tests may b e encouraged by their apparently equiv alent outputs (significance and P-v alues), but the confla- tion is to o often encouraged by textb o ok authors, even to the exten t of pre- sen ting a h ybrid approach con taining features of both. The problem has deep ro ots: when Neyman & P earson published their hypothesis test in 1933 it was immediately assumed that their test was an extension of Fisher’s significance tests. Substantiv e differences in the philosophical and theoretical underpinnings 4 It has b een argued that b ecause Fisher regularly describ ed experimental results as ‘sig- nificant’ or ’not significan t’ he was treating P-v alues dichotomously and that he used a fixed threshold for that dichotomisation (e.g. [Lehmann, 2011, pp. 51–53]). How ev er, Fisher meant the word ‘significant’ to denote only that a result that is w orthy of attention and follow up, and he quoted P-v alues as being less than 0.05, 0.02, and 0.01 because he w as was work- ing from tables of critical v alues of test statistics rather than lab oriously calculating exact P-v alues manually . He wrote ab out the issue on several o ccasions, for example this: Conv enien t as it is to note that a h yp othesis is con tradicted at some familiar lev el of significance such as 5% or 2% or 1% we do not, in Inductive Inference, ever need to lose sight of the exact strength whic h the evidence has in fact reached, or to ignore the fact that with further trial it might come to b e stronger, or weak er. —[Fisher, 1960, p. 25] 7 so on became apparen t to the protagonists and a long-lasting and bitter personal enmit y dev elop ed betw een Fisher and Neyman [Lenhard, 2006, Lehmann, 2011]. That feud seems likely to b e one of the causes of the confusion that w e hav e to da y as it has b een suggested that authors of statistics textb ooks av oided tak- ing sides in the feud—an understandable resp onse given vehemence and the forceful p ersonalities of the protagonists—either by presenting only one of the approac hes without men tion of the other or b y presen ting a mixture of b oth [Co wles, 1989, Hub erty, 1993, Halpin and Stam, 2006]. Whatev er the origin of the confusion, the fact that significance tests and h yp othesis tests are rarely explained as distinct alternatives in textb ooks, en- courages man y to mistakenly assume that ‘significance test’ and ‘hypothesis test’ are synonyms. It also encourages the use a hybrid of the t wo whic h is com- monly called NHST (for Null Hyp othesis Significance T est). NHST has b een derided, for example as an “inconsistent mishmash” Gigerenzer [1998] and as a “jerry-built framew ork” [Krueger and Heck, 2017, p. 1] but versions of NHST are nonetheless more common than w ell-constructed h yp othesis tests and signif- icance tests together. Users of NHST almost univ ersally assume that they are ‘doing it right’ and the assumption that P-v alue equals NHST p ersists, largely unnoticed, particularly in the commentaries of those clamouring for the elimi- nation of P-v alues. I therefore feel compelled to add to the list of derogatory epithets: NHST is lik e a reverso-plat ypus. The plat ypus w as at one time derided as a fake 5 —a comp osite creature consisting of parts of sev eral animals—but is a real animal, rare but b eautiful, and p erfectly adapted to its ecological niche. The common NHST is assumed by its man y users to b e a prop er statistical pro cedure but is, in fact, an ugly comp osite, maladapted for almost all analytic purp oses. 3.2 Con tradictory instructions No-one should b e using NHST, but should w e use h yp othesis testing or signif- icance testing? The answer should dep end on what your analytical ob jectives are, but in practice it more often dep ends on who you ask. Not all advice is goo d advice, and not ev en the exp erts agree. Responses to the American Statistical Asso ciation’s official statemen t on P-v alues pro vides a case in point. In response to the widespread expressions of concern ov er the misuse and misunderstanding of P-v alues, the ASA conv ened a group of exp erts to consider the issues and to collab orate on drafting an official statemen t on P-v alues [W asserstein and Lazar, 2016]. Invited commentaries were published alongside the final statemen t, and ev en a brief reading of those commentaries on the statement will turn up mis- givings and disagreemen ts. Given that most of the commentaries were written 5 W ell, that’s the conv entional wisdom, but it may b e an exaggeration. The first scientific description of the “duc k-billed platypus” was done in England by Sha w & No dder (1789), who wrote “Of all Mammalia yet known it seems the most extraordinary in its conformation; exhibiting the perfect resem blance of the beak of a Duc k engrafted on the head of a quadruped. So accurate is the similitude that, at first view, it naturally excites the idea of some deceptive preparation by artificial means”. If Shaw & Nodder really thought it a fake, they did not do so for long. 8 b y participants in the exp ert group, such disquiet and dissent confirms the dif- ficult y of this topic. It should also should signal to readers that their practical familiarit y with P-v alues do es not ensure that they understand P-v alues. The offic ial ASA statemen t on P-v alues sets out six num b ered principles concerning P-v alues and scientific inference: 1. P-v alues can indicate ho w incompatible the data are with a sp ecified sta- tistical mo del. 2. P-v alues do not measure the probabilit y that the studied hypothesis is true, or the c hance that the data were pro duced by random chance. 3. Scien tific conclusions and business or p olicy decisions should not b e based only on whether a P-v alue passes a sp ecific threshold. 4. Prop er inference requires full rep orting and transparency . 5. A P-v alue, or statistical significance, does not measure the size of an effect or the imp ortance of a result. 6. By itself, a P-v alue do es not pro vide a goo d measure of evidence regarding a mo del or hypothesis. Those principles are all sound—some deriv e directly from the definition of P-v alues and some are self-evidently go o d advice ab out the formation and re- p orting of scien tific c onclusions—but hypothesis tests and significance tests are not even men tioned in the statement and so it do es not directly answ er the question ab out whether we should use significance tests or hypothesis tests that I ask ed at the start of this section. Nevertheless, the statement offers a useful p erspective and is not entirely neutral on the question. It urges against the use of a threshold in Principle 3 which says “Scientific conclusions and business or p olicy decisions should not b e based only on whether a p-v alue passes a sp ecific threshold.” Without a threshold w e cannot use a hypothesis test. Lest any reader think that the inten tion is that P-v alues should not b e used, I p oin t out that the explanatory note for that principle in the ASA do cumen t b egins th us: Practices that reduce data analysis or scien tific inference to mec han- ical “bright-line” rules (suc h as “ p < 0 . 05”) for justifying scientific claims or conclusions can lead to erroneous b eliefs and p o or decision making. —[W asserstein and Lazar, 2016, p. 131] “Brigh t-line rule” is an American legal phrase denoting an approach to sim- plifying am biguous or complex legal issues b y establishmen t of a clear, consistent ruling on the basis of ob jectiv e factors. In other words, subtleties of circum- stance and sub jective factors are ignored in fa v our of consistency and simplicit y . Suc h a rule might be useful in the legal setting, but it do es not sound like an approac h w ell-suited to the considerations that should underlie scientific infer- ence. It is unfortunate, therefore, that a mec hanical bright-line rule is so often used in basic pharmacological research, and even worse that it is demanded by the instructions to authors of the British Journal of Pharmac olo gy : 9 Control T est 1 20 30 P=0.06 Control T est 2 20 30 P=0.04 Figure 1: P=0.04 is not very differ ent fr om P=0.06. Pseudo-data devised to yield one-tailed P=0.06 (left) and P=0.04 (righ t) from a Student’s t -test for indep enden t samples, n = 5 p er group. The y-axis is an arbitrarily scaled measure. When comparing groups, a level of probabilit y ( P ) deemed to con- stitute the threshold for statistical significance should b e defined in Metho ds, and not v aried later in Results (by presentation of m ulti- ple levels of significance). Thus, ordinarily P < 0 . 05 should b e used throughout a pap er to denote statistically significant differences b e- t ween groups. —[Curtis et al., 2015] An updated version of the guidelines retains those instructions [ ? ], but be- cause it is a bad instruction I present three ob jections. The first is that routine use of an arbitrary P-v alue threshold for declaring a result significan t ignores almost all of the evidential con tent of the P-v alue b y forcing an all-or-none distinction b etw een a P-v alue small enough and one not small enough. The ar- bitrariness of a threshold for significance is w ell known and flows from the fact that there is no natural cutoff p oint or inflection p oint in the scale of P-v alues. An yone who is unconvinced that it matters should note that the evidence in a result of P=0.06 is not so different from that in a result of P=0.04 as to supp ort an opp osite conclusion (Figure 1). The second ob jection to the instruction to use a threshold of P < 0 . 05 is that exclusiv e focus on whether the result is ab ov e or b elow the threshold blinds analysts to information b ey ond the sample in question. If the statistical pro ce- dure says discard the n ull h yp othesis (or don’t discard it) then that statistical decision seems to o verride and mak e redundan t any further considerations of ev- idence, theory , or scien tific merit. That is quite dangerous, b ecause all relev an t material should b e considered and integrated into scientific inferences. The third ob jection refers to the strength of evidence needed to reach the threshold: the British Journal of Pharmac olo gy instruction licenses claims on the basis of relativ ely w eak evidence. 6 The evidential disfav ouring of the n ull 6 Accepting P=0.05 as a sufficient reason to supp ose that a treatment is effective is akin to accepting 50% as a passing grade: it’s traditional in man y settings, but it’s far from reassuring. 10 h yp othesis in a P-v alue close to 0.05 is surprisingly w eak when view ed as a lik eliho o d ratio or Bay es factor [Go o dman and Ro yall, 1988, Johnson, 2013, Benjamin et al., 2018], a w eakness that can b e confirmed b y simply ‘eyeballing’ Figure 1. A fixed threshold corresp onding to weak evidence migh t sometimes b e rea- sonable, but often it is not. As Carl Sagan said: “Extraordinary claims require extraordinary evidence.” 7 It would b e possible to o vercome this last ob jection b y setting a low er threshold whenev er an extraordinary claim is to b e made, but the British Journal of Pharmac olo gy instructions preclude suc h a choice by insisting that the same threshold b e applied to all tests within the whole study . There has been a serious prop osal that a lo w er default threshold of P < 0 . 005 b e adopted as the default [Johnson, 2013, Benjamin et al., 2018], but even if that w ould ameliorate the weakness of evidence ob jection, it do esn’t address all of the problems p osed by dic hotomising results into significan t and not significan t, as is ac knowledged by the many authors of that prop osal. Should the British Journal of Pharmac olo gy enforce its guideline on the use of Neyman–Pearsonian hypothesis testing with a fixed threshold for statisti- cal significance? Definitely not, and lab oratory pharmacologists should usually a void them b ecause the nature those tests is ill-suited to the reality of basic pharmacological studies. The shortcoming of hypothesis testing is that it offers an all-or-none out- come and it engenders a one-and-done resp onse to an exp eriment. All-or-none in that the significant or not significan t outcome is dic hotomous. One-and-done b ecause once a decision has b een made to reject the null hypothesis there is lit- tle apparent reason to re-test that null hypothesis the same wa y , or differen tly . There is no mec hanism within the classical Neyman–P earsonian h yp othesis test- ing framew ork for a result to b e treated as pro visional. That is not particularly problematical in the context of a classical randomised clinical trial (RCT) be- cause an RCT is usually conducted only after preclinical studies hav e addressed the relev an t biological questions. That allo ws the scientific aims of the study to b e simple—they are designed to provide a definitiv e answer to the primary question. An all-or-none one-and-done hypothesis test is therefore appropriate for an RCT. 8 But the ma jorit y of basic pharmacological laboratory studies do not hav e m uch in common with an RCT b ecause they consist of a series of in terlinked and inter-related exp eriments contributing v ariously to the primary inference. F or example, a basic pharmacological study will often include exp eri- men ts that v alidate exp erimen tal metho ds and reagen ts, concentration-response curv es for one or more of drugs, p ositive and negativ e con trols, and other exp er- imen ts subsidiary to the main purpose of the study . The design of the ‘headline’ exp erimen t (assuming there is one) and in terpretation of its results is dependent 7 That phrase comes from the television series Cosmos , 1980, but may derive from Laplace (1812), who wrote “The w eight of evidence for an extraordinary claim must b e prop ortioned to its strangeness.” [translated, the original is in F rench]. 8 Clinical trials are sometimes aggregated in meta-analyses, but the substrate for meta- analytical combination is the observed effect sizes and sample sizes of the individual trials, not the dic hotomised significant or not significant outcomes. 11 on the results of those subsidiary exp eriments, and even when there is a singu- lar scientific hypothesis, it might b e tested in sev eral wa ys using observ ations within the study . It is the aggregate of all of the exp eriments that inform the scien tific inferences. The all-or-none one-and-done outcome of a hypothesis test is less appropriate to basic researc h than it is to a clinical trial. Pharmacological laboratory experiments also differ from RCTs in other w ays that are relev ant to the choice of statistical methodologies. Compared to an R CT, basic pharmacological research is very cheap, the exp eriments can b e completed v ery quickly , with the results a v ailable for analysis almost imme- diately . Those adv an tages mean that a pharmacologist might design some of the exp erimen ts within a study in resp onse to results obtained in that same study , 9 and so a basic pharmacological study will often con tain preliminary or exploratory research. Basic research and clinical trials also differ in the conse- quences of erroneous inference. A false positive in an RCT might pro ve v ery damaging b y encouraging the adoption of an ineffectiv e therap y , but in the m uch more preliminary w orld of basic pharmacological research a false p ositiv e result migh t hav e relativ ely little influence on the wider world. It could b e argued that statistical protections against false p ositive outcomes that are appropriate in the realm of clinical trials can be inappropriate in the realm of basic researc h. This idea is illustrated in a later section of this chapter. The multi-faceted nature of the basic pharmacological study means that statistical approaches yielding dichotomous y es or no outcomes are less relev an t than they are to the archet ypical R CT. The scien tific conclusions dra wn from basic pharmacological exp eriments should b e based on thoughtful consideration of the entire suite of results in conjunction with an y other relev an t information, including b oth pre-existing evidence and theory . The dic hotomous all-or-none, one-and-done hypothesis test is po orly adapted to the needs of basic pharmaco- logical exp eriments, and is probably p o orly adapted to the needs of most basic scien tific studies. Scientific studies dep end on a detailed ev aluation of evidence but a h yp othesis test do es not fully supp ort such an ev aluation. 3.3 Evidence is lo cal; error rates are global A wa y to understand difference b et ween the Fisherian significance test and the Neyman–P earsonian h yp othesis test is to recognise that the former supports ‘lo cal’ inference, whereas the latter is designed to protect against ‘global’ long- run error. The P-v alue of a significance test is lo cal b ecause it is an index of the evidence in this data against this n ull h yp othesis. In contrast, the hypothesis test decision regarding rejection of the null hypothesis is global because it is based on a parameter, α , which is set without reference to the observ ed data. The long run p erformance of the hypothesis test is a prop erty of the pro cedure itself and is indep enden t of any particular data, and so it is global. Lo cal evidence; global errors. This is not an ahistoric imputation, b ecause Neyman 9 Y es, that is also done in ‘adaptive’ clinical trials, but they are not the archet ypical RCT that is the comparator here. 12 & Pearson w ere clear ab out their preference for global error protection rather than lo cal evidence and their ob jectiv es in devising hypothesis tests: W e are inclined to think that as far as a particular hypothesis is concerned, no test based up on the theory of probability can by it- self provide any v aluable evidence of the truth or falseho od of that h yp othesis. But we may lo ok at the purpose of tests from another view-point. Without hoping to kno w whether each separate hypothesis is true or false, we may searc h for rules to gov ern our behaviour with re- gard to them, in following which w e insure that, in the long run of exp erience, we shall not b e to o often wrong. —Neyman and P earson [1933] The distinction b et ween local and global prop erties or information is rel- ativ ely little known, but Liu & Meng 2016 offer a muc h more technical and complete discussion of the lo cal/global distinction, using the descriptors ‘indi- vidualised’ and ‘relev an t’ for the lo cal and the ‘robust’ for the global. They demonstrate a trade-off b etw een relev ance and robustness that requires judge- men t on the part of the analyst. In short, the desirability of metho ds that hav e go od long-run error prop erties is undeniable, but paying atten tion exclusiv ely to the global blinds us to the lo cal information that is relev ant to inferences. The instructions of the British Journal of Pharmac olo gy are inappropriate b ecause they attend entirely to the global and b ecause the dic hotomising of each exp er- imen tal result into significant and not significant hinders thoughtful inference. Man y of the battles and con trov ersies regarding statistical tests swirl around issues that might b e clarified using the lo cal versus global distinction, and so it will b e referred to rep eatedly in what follows. 3.4 On the scaling of P-v alues In order to b e able to safely interpret the local, evidential, meaning of a P- v alue, a pharmacologist should understand its scaling. Just like the EC 50 s with whic h pharmacologists are so familiar, P-v alues ha ve a b ounded scale, and just as is the case with EC 50 s it mak es sense to scale P-v alues geometrically (or logarithmically). The non-linear relationship b etw een P-v alues and an intuitiv e scaling of evidence against the null h yp othesis can b e gleaned from Figure 2. Of course, a geometric scaling of the eviden tial meaning of P-v alues implies that the descriptors of evidence should be similarly scaled and so such a scale is proposed in Figure 3, with P-v alues around 0.05 b eing called ‘trivial’ in recognition of the relativ ely unimpressive evidence for a real difference b et ween condition A and con trol in Figure 2. A ttentiv e readers will hav e noticed that the P-v alues in Figures 1, 2, and 3 are all one-tailed. The num b er of tails that published P-v alues hav e is in- consisten t, is often unsp ecified, and the n umber of tails that a P-v alue should ha ve is contro v ersial (e.g. see Dub ey [1991], Bland and Bland [1994], Koba yashi 13 Control A B C D 20 30 40 P=0.0001 P=0.001 P=0.01 P=0.05 Figure 2: What simple evidenc e lo oks like. Pseudo-data devised to yield one- tailed P-v alues from 0.05 to 0.0001 from a Studen t’s t -test for indep enden t samples, n = 5 p er group. The left-most group of v alues is the control against whic h each of the other sets is compared, and the pseudo-datasets A, B, C, and D were generated by arithmetic adjustment of a single dataset to obtain the indicated P-v alues. The y-axis is an arbitrarily scaled measure. [1997], F reedman [2008], Lom bardi and Hurlb ert [2009], Ruxton and Neuhaeuser [2010]). Arguments ab out P-v alue tails are regularly confounded by differences b et ween local and global considerations. The most comp elling reasons to fa vour t wo tails relate to global error rates, which means that they apply only to P- v alues that are dichotomised into significan t and not significant in a hypothesis test. Those argumen ts can safely be ignored when P-v alues are used as indices of evidence and I therefore recommend one-tailed P-v alues for general use in pharmacological experiments—as long as the P-v alues are interpreted as evi- dence and not as a surrogate for decision. (Either wa y , the n umber of tails should alw ays b e sp ecified.) 3.5 P o wer and exp ected P-v alues The Neyman–Pearsonian h yp othesis test is a decision pro cedure that, with a few assumptions, can b e an optimal procedure. Optimal only in the restricted sense that the smallest sample giv es the highest pow er to reject the n ull hypoth- esis when it is false, for any sp ecified rate of false positive errors. T o achiev e that optimalit y the exp erimen tal sample size and α are selected prior to the exp erimen t using a p o wer analysis and with consideration of the costs of the t wo sp ecified t yp es of error and the b enefits of p otentially correct decisions. In other words, there is a loss function built into the design of exp erimen ts. How- ev er, outside of the clinical trials arena, few pharmacologists seem to design exp erimen ts in that w ay . F or example, a study of 22 basic biomedical researc h pap ers published in Natur e Me dicine found that none of them included any men tion of a pow er analysis for setting the sample size [Strasak et al., 2007], and a simple survey of the research pap ers in the most recent issue of British Journal of Pharmac olo gy (2018, issue 17 of v olume 175) gives a similar picture with p o wer analyses specified in only one out the 11 research pap ers that used 14 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P-value (one-tailed) None Trivial 0.00001 0.0001 0.001 0.01 0.1 1 Strong Weak None Trivial Moderate Rough scaling of evidence against the null hypothesis V ery strong Strong Weak Moderate Figure 3: Evidential descriptors for P-values. Strength of evidence against the null hypothesis scales semi-geometrically with the smallness of the P-v alue. Note that the descriptors for strength of evidence are illustrative only , and it w ould b e a mistake to assume, for example, that a P-v alue of 0.001 indicates mo derately strong evidence against the null hypothesis in every circumstance. P < 0.05 as a criterion for statistical significance. It is notable that all of those BJP pap ers included statements in their metho ds sections claiming compliance with the guidelines for experimental design and analysis, guidelines that include this as the first k ey p oint: Exp erimen tal design should b e sub jected to a priori p o wer analysis so as to ensure that the size of treatment and control groups is adequate[. . . ] —[Curtis et al., 2015] The most recent issue of Journal of Pharmac olo gy and Exp erimental Ther ap eu- tics (2018, issue 3 of v olume 366) similarly contains no men tion of p ow er of sample size determination in any of its 9 research pap ers, although none of its authors had to pa y lip service to guidelines requiring it. In reality , p ow er analyses are not alw ays necessary or helpful. They hav e no clear role in the design of a preliminary or exploratory exp eriment that is concerned more with h yp othesis generation than hypothesis testing, and a large fraction of the exp erimen ts published in basic pharmacological journals are exploratory or preliminary in nature. Nonetheless, they are described here in detail because experience suggests they are m ysterious to man y pharmacologists and they are v ery useful for planning confirmatory exp eriments. F or a simple test like Student’s t -test a pre-exp eriment p ow er analysis for determination of sample size is easily p erformed. The pow er of a Student’s t - test is dep enden t on: (i) the predetermined acceptable false p ositive error rate, α (bigger α gives more p o wer); (ii) the true effect size, which we will denote as δ (more pow er when δ is larger); (iii) the p opulation standard deviation, σ (smaller σ gives more pow er); and (iv) the sample size (larger n for more pow er). The common approac h to a p o wer test is to sp ecify an effect size of in terest and 15 1 2 3 4 0 0.2 0.4 0.6 0.8 1 Effect size ( δ / σ ) Power 1 2 3 4 0 0.2 0.4 0.6 0.8 1 Effect size ( δ / σ ) Power n =3 n =5 20 40 10 α = 0.05 n =3 n =5 20 40 10 α = 0.005 Figure 4: Power functions for α = 0 . 05 and 0 . 005 . Po wer of one-sided Studen t’s t -test for indep enden t samples expressed as a function of standardised true effect size δ /σ for sample sizes (p er group) from n = 3 to n = 40. Note that δ = µ 1 − µ 2 and σ are p opulation parameters rather than sample estimates. the minimum desired p o wer, so say we wish to detect a true effect of δ = 3 in a system where w e expect the standard deviation to be σ = 2. The free soft ware 10 called R has the function p o wer.t.test() that gives this result: > power.t.test(delta=3, sd=2, power=0.8, sig.level = 0.05, alternative =’one.sided’, n=NULL) Two-sample t test power calculation n = 6.298691 delta = 3 sd = 2 sig.level = 0.05 power = 0.8 alternative = one.sided NOTE: n is number in *each* group It is conv en tional to round the sample size up to the next integer so the sample size w ould b e 7 p er group. While a single point p o wer analysis lik e that is straightforw ard, it provides relativ ely little information compared to the information supplied by the analyst, and its output is sp ecific to the particular effect size specified, an effect size that more often than not has to b e ‘guesstimated’ instead of estimated b ecause it is the unknown that is the ob ject of study . A plot of p o wer versus effect size is far more informative than the p oint v alue supplied b y the con ven tional p o wer test (Figure 4). Those graphical pow er functions show clearly the three-wa y relationship b et ween sample size, effect size and the risk of a false negative outcome (i.e. one minus the p o wer). 10 www.r-pro ject.org 16 Some exp erimenters are tempted to p erform a p ost-experiment p ow er anal- ysis when their observ ed P-v alue is unsatisfyingly large. They aim to answer the question of how large the sample should hav e b een, and pro ceed to plug in the observ ed effect size and standard deviation and pulling out a larger sam- ple size—alwa ys larger—that might ha ve giv en them the desired small P-v alue. Their interpretation is then that the result would have b e en signific ant but for the fact that the exp erimen t was underp o wered. That interpretation ignores that fact that the observed effect size migh t b e an exaggeration, or the observed standard deviation might b e an underestimation and the null hypothesis might b e true! Suc h a pro cedure is generally inappropriate and dangerous [Ho enig and Heisey, 2001]. There is a one to one corresp ondence of observ ed P-v alue and p ost-exp erimen t pow er and no matter what the sample size, a larger than desired P-v alue always corresp onds to a low p ow er at the observ ed effect size, whether the n ull hypothesis is true or false. Po wer analyses are useful in the design of exp erimen ts, not for the interpretation of exp erimen tal results. P ow er analyses are tied closely to dichotomising Neyman–Pearsonian hy- p othesis tests, even when expanded to provide full p o wer functions as in Figure 4. How ev er, there is an alternative more closely tied to Fisherian significance testing—an approach b etter aligned to the ob jectives of evidence gathering. That alternativ e is a plot of av erage exp ected P-v alues as functions of effect size and sample size [Sac krowitz and Sam uel-Cahn, 1999, Bhattachary a and Habtzghi, 2002]. The median is more relev ant than the mean, b oth b ecause the distribution of exp ected P-v alues is very skew ed and b ecause the median v alue offers a conv enien t in terpretation of there b eing a 50:50 b et that and observed P-v alue will b e either side of it. An equiv alent plot showing the 90th p ercen tile of exp ected P-v alues gives another option for exp erimen t sample size planning purp oses (Figure 5). Should the British Journal of Pharmac olo gy enforce its p o wer guideline? In general no, but pharmacologists should use p ow er curv es or exp ected P-v alue curv es for designing some of their exp eriments, and ough t to sa y so when they do. P ow er analyses for sample size are very imp ortan t for exp erimen ts that are in tended to be definitive and decisiv e, and that’s wh y sample size considerations are dealt with in detail when planning clinical trials. Ev en though the ma jority of exp eriments in basic pharmacological researc h papers are not like that, as discussed ab o ve, ev en preliminary exp erimen ts should b e planned to a degree, and p o wer curves and exp ected P-v alue curves are b oth useful in that role. 4 Practical problems with P-v alues The sections ab ov e deal with the most basic misconceptions regarding the nature of P-v alues, but critics of P-v alues usually focus on other important issues. In this section I will deal with the significance filter, multiple comparisons, and some forms of P-hacking, and I need to p oin t out immediately that most of the issues are not sp ecific to P-v alues even if some of them are enabled b y the unfortunate dichotomisation of P-v alues into significant and not significan t. 17 0 1 2 3 4 10 –8 10 –7 10 –6 10 –5 10 –4 10 –3 10 –2 10 –1 10 0 Median P-value n =3 n =5 20 40 10 Effect size ( δ / σ ) 0 1 2 3 4 10 –8 10 –7 10 –6 10 –5 10 –4 10 –3 10 –2 10 –1 10 0 90th percentile of P-values Effect size ( δ / σ ) n =3 n =5 20 40 10 Figure 5: Exp e cte d P-value functions P-v alues exp ected from Student’s t -test for indep enden t samples expressed as a function of standardised true effect size δ /σ for sample sizes (per group) from n = 3 to n = 40. The graph on the left sho ws the median of exp ected P-v alues (i.e. the 50th percentile) and the graph on the righ t sho ws the 90th p ercen tile. It can be exp ected that 50% of observ ed P-v alues will lie b elow the median lines and 90% will lie b elo w the 90th p ercen tile lines for corresp onding sample sizes and effect sizes. The dashed lines indicate P=0.05 and 0.005. In other words, the practical problems with P-v alues are largely the practical problems asso ciated with the mis use of P-v alues and with sloppy statistical inference generally . 4.1 The significance filter exaggeration machine It is natural to assume that the effect size observed in an exp eriment is a go od estimate of the true effect size, and in general that can b e true. How ev er, there are common circumstances where the observed effect size consistently o verestimates the true, sometimes wildly so. The ov erestimation dep ends on the facts that exp erimen tal results exaggerating the true effect are more lik ely to b e found statistically significan t, and that we pa y more attention to the significan t results and are more lik ely to rep ort them. The key to the effect is selective attention to a subset of results—the significant results—and so the pro cess is appropriately called the signific anc e filter . If there is nothing assume nothing un tow ard in the sampling mechanism, 11 sample means are un biassed estimators of p opulation means and sample-based standard deviations are nearly unbiassed estimators of p opulation standard de- viations. 12 Because of that we can assume that, on av erage, a sample mean 11 That is not a safe assumption, in particular b ecause a haphazard sample is not a random sample. When was the last time that you used something like a random num ber generator for allo cation of treatments? 12 The v ariance is un biassed but the non-linear square ro ot transformation in to the standard deviation damages that un biassed-ness. Standard deviations calculated from small samples are biassed tow ard underestimation of the true standard deviation. F or example, if the true standard deviation is 1 the expected av erage observed standard deviation for samples of n = 5 is 0.92. 18 pro vides a sensible ‘guesstimate’ for the p opulation parameter and, to a lesser degree, so do es the observed standard deviation. That is indeed the case for a verages o ver all samples, but it cannot be relied up on for an y particular sam- ple. If attention has b een dra wn to a sample on the basis that it is ‘statistically significan t’ then that sample is likely to offer an exaggerated picture of the true effect. The phenomenon is usually called the signific anc e filter . The wa y it w orks is fairly easily describ ed but, as usual, there are some complexities in its in terpretation. Sa y w e are in the position to run an exp erimen t 100 times with random samples of n = 5 from a single normally distributed p opulation with mean µ = 1 and standard deviation σ = 1. W e w ould exp ect that, on av erage, the sample means, ¯ x would be scattered symmetrically around the true v alue of 1, and the sample-based standard deviations, s , would b e scattered around the true v alue of 1, albeit sligh tly asymmetrically . A set of 100 sim ulations matc hing that scenario sho w exactly that result (see the left panel of Figure 6), with the median of ¯ x b eing 0.97 and the median of s b eing 0.94, b oth of which are close to the exp ected v alues of exactly 1 and ab out 0.92, respectively . If we w ere to pa y attention only to the results where the observed P-v alue was less than 0.05 (with the n ull hypothesis b eing that the population mean is 0), then w e get a different picture because the v alues are v ery biassed (see the right panel of Figure 6). Among the ‘significant’ results the median sample mean is 1.2 and the median standard deviation is 0.78. The systematic bias of mean and standard deviation among ‘significant’ results in those simulations might not seem to o bad, but it is con ven tional to scale the effect size as the standardised ratio ¯ x/s , 13 and the median of that ratio among the ‘significan t’ results is fully 50% larger than the correct v alue. What’s more, the biasses get worse with smaller samples, with smaller true effect sizes, and with lo wer P-v alue thresholds for ‘significance’. It is notable that even the results with the most extreme exaggeration of effect size in Figure 6—550%—w ould not b e counted as an error within the Neyman–P earsonian h yp othesis testing framework! It would not lead to the false rejection of a true null or to an inappropriate failure to reject a false null and so it is neither a t yp e I nor a type II error. But it is some type of error, a substan tial error in estimation of the magnitude of the effect. The term typ e M err or has been devised for exactly that kind of error [Gelman and Carlin, 2014]. A type M error migh t b e underestimation as w ell as the o verestimation, but o verestimation is the more common in theory [Lu et al., 2018] and in practice [Camerer et al., 2018]. The effect size exaggeration coming from the significance filter is not a result of sampling, or of significance testing, or of P-v alues. It is a result of paying extra atten tion to a subset of all results—the ‘significant’ subset. The significance filter presents a p eculiar difficulty . It leads to exaggera- tion on aver age , but any particular result may well b e close to the correct size 13 That ratio is often called Cohen’s d . Pharmacologists should pay no attention to Cohen’s specifications of small, medium and large effect sizes [Cohen, 1992] because they are muc h smaller than the effects commonly seen in basic pharmacological exp erimen ts. 19 0.0 0.5 1.0 1.5 2.0 2.5 0 0.5 1 1.5 2 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 0.5 1 1.5 2 2.5 Standard deviation, s Sample mean, x ! All samples Standard deviation, s Sample mean, x ! Samples where P<0.05 Figure 6: The signific anc e filter. The dots in the graphs are means and standard deviations of samples of n = 5 dra wn from a normally distributed p opulation with mean µ = 1 and standard deviation σ = 1. The left panel shows all 100 samples and the right panel shows only the results where P < 0.05. The v ertical and horizon tal lines indicate the true parameter v alues. ‘Significan t’ results tend to ov er-estimate the p opulation mean and under-estimate the p opulation standard deviation. whether it is ‘significan t’ or not. A real-w orld sample mean of, say , ¯ x = 1 . 5 migh t b e an exaggeration of µ = 1, it might b e an underestimation of µ = 2, or it might b e pretty close to µ = 1 . 4 and there w ould b e no wa y to b e certain without kno wing µ , and if µ w ere known then the exp erimen t would probably not hav e b een necessary in the first place. That means that the p ossibility of a type M error lo oms o ver an y experimental result that is interesting b ecause of a small P-v alue, and that is particularly true when the sample size is small. The only wa y to gain more confidence that a particular significant result closely appro ximates the true state of the world is to rep eat the exp eriment–the second result would not hav e b een run through the significance filter and so its results w ould not hav e a greater than av erage risk of exaggeration and the ov erall in- ference can be informed by b oth results. Of course, exp erimen ts intended to rep eat or replicate an interesting finding should take the p ossible exaggeration in to account by b eing designed to ha ve higher p ow er than the original. 4.2 Multiple comparisons Multiple testing is the situating where the tension b etw een global and lo cal considerations are most stark. It is also the situation where the w ell-known jelly beans carto on from XKCD.com is irresistable (Figure 7). The carto on scenario is that jelly beans were suspected of causing acne, but a test found “no link b etw een jelly b eans and acne (P > 0 . 05)”, and so the p ossibilit y that only a certain colour of jelly b ean causes acne is then en tertained. All 20 colours of jelly b ean are independently tested, with only the result from green jelly b eans b eing significan t, “(P < 0 . 05)”. The newspap er headline at the end of the carto on 20 men tions only the green jelly b eans result, and it does that with exaggerated certain ty . The usual interpretation of that carto on is that the significant result with green jelly b eans is lik ely to be a false positive because, after all, h yp othesis testing with the threshold of P < 0 . 05 is expected to yield a false positive one time in 20, on a verage, when the null is true. The more h yp othesis tests there are, the higher the risk that one of them will yield a false p ositiv e result. The textb ook resp onse to multiple comparisons is to in tro duce ‘corrections’ that protect an o verall maximum false p ositiv e error rate b y adjusting the threshold according to the num b er of tests in the family to give protection from inflation of the family-wise false p ositive error rate. The Bonferroni adjustment is the b est-kno wn metho d, and while there are several alternativ e ‘corrections’ that p erform a little b etter, none of those is nearly as simple. A Bonferroni adjustment for the family of exp erimen ts in the carto on w ould preserv e an ov erall false p ositive error rate of 5% b y setting a threshold for significance of 0 . 05 / 20 = 0 . 0025 in each of the 20 h yp othesis tests. 14 It must b e noted that suc h protection do es not come for free, b ecause adjustments for m ultiplicity inv ariably strip statistical p ow er from the analysis. W e do not kno w whether the ‘significant’ link b et ween green jelly b eans and acne would surviv e a Bonferroni adjustment b ecause the actual P-v alues were not supplied, 15 but as an example, a P-v alue of 0.003, low enough to b e quite encouraging as the result of a significance test, w ould b e ‘not significan t’ ac- cording to the Bonferroni adjustment. Such a result that would presen t us with a serious dilemma b ecause the inference supp orted b y the lo cal evidence w ould b e apparen tly contradicted by global error rate considerations. How ev er, that con tradiction is not what it seems because the n ull h yp othesis of the significance test P-v alue is a different null hypothesis from that tested b y the Bonferroni- adjusted hypothesis test. The significance test null concerns only the green jelly b eans whereas the null hypothesis of the Bonferroni is an omnibus n ull hypoth- esis that sa ys that the link b etw een green jelly b eans on acne is zero and the link b et ween purple jelly b eans on acne is zero and the link b etw een brown jelly b eans is zero, and so on. The P-v alue null hypothesis is local and the omnibus n ull is global. The global null hypothesis migh t b e appropriate b efore the evi- dence is a v ailable (i.e. for p o wer calculations and exp erimen tal planning), but after the data are in hand the lo cal null hypothesis concerning just the green jelly b eans gains imp ortance. It is imp ortant to av oid being blinded to the lo cal evidence b y a non- significan t global. After all, the pattern of evidence in the carto on is exactly what w ould b e exp ected if the green colouring agent caused acne: green jelly b eans are asso ciated with acne but the other colours are not. (The failure to see an effect of the mixed jelly beans in the first test is easily explicable on the basis of the low er dose of green.) If the data from the trial of green jelly 14 Y ou ma y notice that the first test of jelly b eans without reference to colour has b een ignored here. There is no set rule for saying exactly which exp erimen ts constitute a family for the purposes of correction of m ultiplicity . 15 That serves to illustrate one facet of the inadequacy of rep orting ‘P less thans’ in place of actual P-v alues. 21 Figure 7: Multiple testing carto on from XKCD, h ttps://xkcd.com/882/ 22 b eans is independent of the data from the trials of other colours, then there is no wa y that the existence of those other data—or their analysis—can influence the nature of the green data. The green jelly be an data cannot logically hav e b een affected b y the fact that mauve and b eige jelly beans were tested at a later p oin t in time—the subsequen t cannot affect the previous—and the exp erimen tal system would hav e to b e bizarrely flaw ed for the testing of the purple or brown jelly beans to affect the subsequent exp eriment with green jelly b eans. If the m ultiplicity of tests did not affect the data then it is only reasonable to sa y that it did not affect the evidence. The omnibus global result do es not cancel the lo cal evidence, or ev en alter it, and y et the elev ated risk of a false p ositiv e error is real. That presen ts us with a dilemma and, unfortunately , statistics do es not pro vide a wa y around it. Global error rates and lo cal evidence op erate in different logical spaces [Thompson, 2007] and so there can b e no strictly statistical wa y to weigh them together. All is not lost, though, b ecause statistical limitations do not preclude though tful in tegration of lo cal and the global issues when making inferences. W e just hav e to b e more than normally cautious when the lo cal and global pull in differen t directions. F or example, in the case of the carto on, the evidence in the data fa vour the idea that green jelly b eans are linked with acne (and if we had an exact P-v alue then we could sp ecify the strength of fav ouring) but b ecause the data were obtained by a metho d with a substantial false p ositive error rate w e should be somewhat reluctan t to take that evidence at face v alue. It would b e up to the scientist in the cartoon (the one with safety glasses) to form a pro visional scientific conclusion regarding the effect of green jelly b eans, ev en if that inference is that any decision should b e deferred un til more evidence is a v ailable. Whatever the inference, the evidence, theory , the metho d, any other corrob orating or rebutting information should all b e considered and rep orted. A man or w oman who sits and deals out a deck of cards rep eatedly will even tually get a very un usual set of hands. A rep ort of unusual- ness w ould be tak en differently if w e knew it w as the only deal made, or one of a thousand deals, or one of a million deals, etc. —[T ukey, 1991, p. 133] In isolation the carto on exp erimen ts are probably only sufficien t to suggest that the asso ciation betw een green jelly acne is w orthy of further in vestigation (with the earnestness of that suggestion b eing in v ersely related to the size of the relev an t P-v alue). The only wa y to b e in a p osition to rep ort an inference con- cerning those jelly b eans without having to hedge around the family-wise false p ositiv e error rate and the significance filter is to re-test the green jelly b eans. New data from a separate exp erimen t will b e free from the taint of elev ated family-wise error rates and untouc hed by the significance filter exaggeration mac hine. And, of course, al l of the original exp erimen ts should b e rep orted alongside the new, as well as reasoned argument incorp orating corrob orating or rebutting information and theory . The fact that a fresh experiment is necessary to allow a straightforw ard con- clusion ab out the effect of the green jelly b eans means that the exp erimen tal 23 series shown in the carto on is a preliminary , exploratory study . Preliminary or exploratory researc h is essential to scientific progress and can merit publication as long as it is rep orted completely and op enly as preliminary . T oo often sci- en tists fall into the pattern of misrepresenting the pro cesses that lead to their exp erimen tal results, p erhaps under the mistaken assumption that science has to b e h yp othesis driven [Medaw ar, 1963, du Prel et al., 2009, Ho witt and Wil- son, 2014]. That misrepresen tation ma y take the form of a suggestion, implied or stated, that the green jelly beans w ere the intended sub ject of the study , a b eha viour described as HARKing for h yp othesising a fter the r esults are k no wn, or cherry picking where only the significant results are presented. The reason that HARKing is problematical is that h yp otheses cannot be tested using the data that suggested the hypothesis in the first place b ecause those data always supp ort that h yp othesis (otherwise they w ould not b e suggesting it!), and c herry pic king in tro duces a false impression of the nature of the total evidence and al- lo ws the direct introduction of exp erimen ter bias. Either wa y , fo cussing on just the unusual observ ations from a m ultitude is bad science. It takes little effort and few words to sa y that 20 colours w ere tested and only the green yielded a statistically significant effect, and a scientist can (should) then h yp othesise that green jelly b eans cause acne and test that hypothesis with new data. 4.3 P-hac king P-hac king is where an experiment or its analysis are directed at obtainin g a small enough P-v alue to claim significance instead of being directed at the clarification of a scien tific issue or testing of a hypothesis. Deliberate P-hac king do es happen, p erhaps driven b y the incentiv es built into the systems of academic reward and publication imp eratives, but most P-hac king is accidental—honest researc hers doing ‘the wrong thing’ through ignorance. P-hacking is not alw ays as wrong as might b e assumed, as the idea of P-hacking comes from paying attention exclusiv ely to global consideration of error rates, and most particularly to false p ositiv e error rates. Those most stridently opp osed to P-hac king will p oint to the increased risk of false p ositive errors, but rarely to the lo wered risk of false negativ e errors. I will rec klessly note that some categories of P-hacking lo ok en tirely unproblematical when viewed through the prism of lo cal evidence. The lo cal versus global distinction allows a more nuanced resp onse to P-hacking. Some P-hac king is outrigh t fraud. Consider this example that has recently come to ligh t: One stic king p oin t is that although the stic kers increase apple selec- tion b y 71%, for some reason this is a p v alue of . 06. It seems to me it should b e low er. Do y ou wan t to take a lo ok at it and see what y ou think. If you can get the data, and it needs some t weeking, it w ould b e go od to get that one v alue b elow .05. —Email from Brian W ansink to David Just on Jan. 7, 2012. [Lee, 2018] 24 I do not expect that any readers w ould find P-hac king of that kind to be accept- able. Ho wev er, the line b et ween fraudulent P-hacking and the more inno cen t P-hac king through ignorance is hard to define, particularly so giv en the fact that some b eha viours derided as P-hacking can b e p erfectly legitimate as part of a scientific research program. Consider this cherry pick ed list 16 of resp onses to a P-v alue b eing greater than 0.05 that ha ve b een describ ed as P-hacking [Motulsky, 2014]: • Analyze only a subset of the data; • Remo ve suspicious outliers; • Adjust data (e.g. divide by b ody weigh t); • T ransform the data (i.e. logarithms); • Rep eat to increase sample size (n); Before going any further I need to p oint out that Motulsky has a more realistic attitude to P-hac king than might b e assumed from my treatment of his list. He writes: “If you use any form of P-hacking, lab el the conclusions as ‘preliminary’.” [Motulsky, 2014, p. 1019]. Analysis of only a subset of the data is illicit if the unanalysed portion is omitted in order to manipulate the P-v alue, but unproblematical if it is omitted for b eing irrelev ant to the scientific question at hand. Remo v al of suspicious outliers is similar in being only sometimes inappropriate: it dep ends on what is meant by the term “outlier”. If it indicates that a datum is a mistak e such as a typographical or transcriptional error, then of course it should b e remov ed (or corrected). If an outlier is the result of a technical failure of a particular run of the exp erimen tal then p erhaps it should be remo ved, but the technical success or failure of an exp erimen tal run must not b e judged b y the influence of its data on the ov erall P-v alue. If with w ord outlier just denotes a datum that is further from the mean than the others in the dataset, then omit it at your p eril! Omission of that type of outlier will reduce the v ariabilit y in the data and giv e a low er P-v alue, but will markedly increase the risk of false p ositive results and it is, indeed, an illicit and damaging form of P-hacking. Adjusting the data by standardisation is appropriate—desirable even—in some circumstances. F or example, if a study concerns feeding or organ masses then standardising to b o dy weigh t is probably a go od idea. Such manipulation of data should b e considered P-hacking only if an analyst finds a to o large P- v alue in unstandardised data and then tries out v arious re-expressions of the data in search of a low P-v alue, and then rep orts the results as if that expres- sion of the data w as in tended all along. The P-hac kingness of log-transformation is similarly situationally dep endent. Consider pharmacological EC 50 s or drug affinities: they are strictly b ounded at zero and so their distributions are sk ewed. In fact the distributions are quite close to log-normal and so log-transformation 16 There are nine sp ecified in the original but I discuss only five: c herry picking! 25 b efore statistical analysis is appropriate and desirable. Log-transformation of EC 50 s giv es more p ow er to parametric tests and so it is common that signifi- cance testing of logEC 50 s giv es lo wer P-v alues than significance testing of the un- transformed EC 50 s. An exp erienced analyst will c ho ose the log-transformation b ecause it is kno wn from empirical and theoretical considerations that the trans- formation mak es the data b etter matc h the exp ectations of a parametric sta- tistical analysis. It might sensibly b e categorised as P-hac king only if the log- transformation was selected with no justification other than it giving a low P-v alue. The last form of P-hacking in the list requires a goo d deal more consideration than the others b ecause, well, statistics is complicated. That consideration is facilitated b y a concrete scenario—a scenario that might seem surprisingly realistic to some readers. Say you run an experiment with n = 5 observ ations in each of tw o indep enden t groups, one treated and one control, and obtain a P-v alue of 0.07 from Studen t’s t -test. Y ou might stop and in tegrate the very w eak evidence against the null hypothesis into y our inferen tial considerations, but y ou decide that more data will clarify the situation. Therefore y ou run some extra replicates of the experiment to obtain a total of n = 10 observ ations in eac h group (including the initial 5), and find that the P-v alue for the data in aggregate is 0.002. The risk of the ‘significan t’ result b eing a false p ositiv e error is elev ated b ecause the data hav e had tw o c hances to lead you to discard the n ull hypothesis. Con ven tional wisdom sa ys that y ou hav e P-hac ked. How ever, there is more to b e considered b efore the exp erimen t is discarded. Con ven tional wisdom usually takes the global p ersp ectiv e. As men tioned ab o ve, it typically privileges false p ositiv e errors ov er an y other consideration, and calls the pro cedure in v alid. Ho wev er, the extra data has added p ow er to the exp erimen t and lo wered the exp ected P-v alue for any true effect size. F rom a lo cal evidence point of view, increasing the sample increases the amount of evidence a v ailable for use in inference, which is a goo d thing. Is extending an exp erimen t after the statistical analysis a go o d thing or a bad thing? The con ven tional answ er is that it is a bad thing and so the con ven tional advice is don’t do it! How ever, a b etter response migh t balance the bad effect of extending the exp erimen t with the goo d. Consideration of the lo cal and global asp ects of statistical inference allo ws a muc h more n uanced answer. The pro cedure describ ed would b e p erfectly acceptable for a preliminary exp eriment. T echnically the tw o-stage pro cedure in that scenario allows optional stopping . The scenario is not explicit, but it can b e discerned that the stopping rule was, in effect, run n = 5 and insp ect the P-v alue; if it is small enough then stop and mak e inferences ab out the n ull hypothesis; if the P-v alue is not small enough for the stop but nonetheless small enough to represent some evidence against the n ull hypothesis, add an extra 5 observ ations to eac h group to giv e n = 10, stop, and analyse again. W e do not kno w how low the in terim P-v alue would hav e to b e for the protocol to stop, and we do not know ho w high it could b e and the extra data still b e gathered, but no matter where those thresholds are set, suc h stopping rules yield false positive rates higher than the nominal critical v alue for stopping w ould suggest. Because of that, the con ven tional view (the 26 global p erspective, of course) is that the proto col is inv alid, but it would b e more accurate to say that such a proto col w ould b e inv alid unless the P-v alue or the threshold for a Neyman–P earsonian dichotomous decision is adjusted as would b e done with a formal se quential test . It is interesting to note that the elev ation of false p ositiv e rate is not necessarily large. Simulations of the scenario as sp ecified and with P < 0.1 as the threshold for contin uing show that the ov erall false p ositive error rate would b e ab out 0.008 when the the critical v alue for stopping at the first stage is 0.005, and ab out 0.06 when that critical v alue is 0.05. The increased rate of false p ositives (global error rate) is real, but that do esn’t mean that the evidential meaning of the final P-v alue of 0.002 is c hanged. It is the same local evidence against the null as if it w as obtained from a simpler one stage protocol with n = 10. After all, the data are exactly the same as if the exp erimen ter had in tended to obtain n = 10 from the b eginning. The optional stopping has changed the global prop erties of the statistical pro cedure but not the lo cal evidence which contained in the actualised data. Y ou might be wondering ho w it is p ossible that the lo cal evidence b e un- affected by a pro cess that increases the global false p ositiv e error rate. The rationale is that the evidence is contained within the data but the error rate is a prop ert y of the pro cedure—evidence is lo cal and error rates are global. Recall that false p ositiv e errors can only o ccur when the null h yp othesis is true. If the n ull is true then the pro cedure has increased the risk of the data leading us to a false p ositiv e decision, but if the null is false then the pro cedure has de cr e ase d the risk of a false negative decision. Which of those has paid out in this case cannot b e known b ecause we do not know the truth of this lo cal n ull h yp othesis. It might b e argued that an increase in the global risk of false p osi- tiv e decisions should outw eigh the decreased risk of false negatives, but that is a v alue judgemen t that ough t to tak e in to accoun t particulars of the exp erimen t in question, the role of that exp eriment in the ov erall study , and other contextual factors that are unsp ecified in the scenario and that v ary from circumstance to circumstance. So, what can b e said ab out the result of that scenario? The result of P=0.002 pro vides mo derately strong evidence against the null hypothesis, but it w as obtained from a pro cedure with sub-optimal false p ositiv e error characteristics. That sub-optimality should b e accounted for in the inferences that made from the evidence, but it is only confusing to say that it alters the evidence itself, b ecause it is the data that contain the evidence and the sub-optimality did not c hange the data. Motulsky pro vides go od advice on what to do when your exp erimen t has the optional stopping: • F or eac h figure or table, clearly state whether or not the sam- ple size was c hosen in adv ance, and whether every step used to process and analyze the data was planned as part of the exp erimen tal proto col. • If you used any form of P-hacking, lab el the conclusions as “preliminary .” 27 Giv en that basic pharmacological experiments are often relatively inexp en- siv e and quic kly completed one can add to that list the option of also corrob o- rating (or not) those results with a fresh exp eriment designed to hav e a larger sample size (remem b er the significance filter exaggeration machine) and p er- formed according to the design. Once we mov e b eyond the globalist mindset of one-and-done suc h an option will seem obvious. 4.4 What is a statistical mo del? I remind the reader that this chapter is written under the assumption that pharmacologists can b e trusted to deal with the full complexity of statistics. That assumption gives me licence to discuss unfamiliar notions lik e the role of the statistical mo del in statistical analysis. All to o often the statistical mo del is often invisible to ordinary users of statistics and that invisibilit y encourages though tless use of flaw ed and inappropriate mo dels, thereb y con tributing to the misuse of inferen tial statistics like P-v alues. A statistical mo del is what allows the formation of calibrated statistical inferences and non-trivial probabilistic statements in resp onse to data. The mo del does that by assigning probabilities to p oten tial arrangements of data. A statistical model can b e though t of as a set of assumptions, although it migh t be more realistic to say that a chosen statistical mo del imp oses a set of assumptions on to the exp erimenter. I hav e often b een struck by the extent to which most textb ooks, on the flimsiest of evidence, will dismiss the substitution of assumptions for real knowledge as unimp ortant if it happ ens to b e mathemati- cally con venien t to do so. V ery few bo oks seem to be frank ab out, or p erhaps even aw are of, how little the exp erimenter actually knows ab out the distribution of errors in his observ ations, and ab out facts that are assumed to b e kno wn for the purp oses of statistical calcu- lations. —[Colquhoun, 1971, p. v ] Statistical models can take a v ariety of forms [McCullagh, 2002], but the mo del for the familiar Student’s t -test for indep endent samples is reasonably represen tative. That model consists of assumed distributions (normal) of t wo p opulations with parameters mean ( µ 1 and µ 2 ) and standard deviation ( σ 1 and σ 2 ), 17 and a rule for obtaining samples (e.g. a randomly selected sample of n = 6 observ ations from each p opulation). A sp ecified v alue of the the difference b et ween means serves as the n ull hypothesis, so H 0 : µ 1 − µ 2 = δ H 0 . The test statistic is 18 t = ( ¯ x 1 − ¯ x 2 ) − δ H 0 s p p 1 /n 1 + 1 /n 2 17 The ordinary Student’s t -test assumes that σ 1 = σ 2 , but the W elch-Scatterth waite v ariant relaxes that assumption. 18 Oh no! An equation! Don’t worry , it’s the only one, and, anyw ay , it is to o late now to stop reading. 28 where ¯ x is a sample mean and s p is the p o oled standard deviation. The explicit inclusion of a n ull h yp othesis term in the equation for t is relatively rare, but it is useful because it s ho ws that the n ull h yp othesis is just a p ossible v alue of the difference b et ween means. Most commonly the n ull h yp othesis sa ys that the difference b et ween means is zero—it can b e called a ‘nill-null’—and in that case the omission of δ H 0 from the equation mak es no numerical difference. V alues of t calculated by that equation hav e a known distribution when µ 1 − µ 2 = δ H 0 , and that distribution is Studen t’s t -distribution. 19 Because the distribution is kno wn it is p ossible to define hypothesis test acceptance regions for an y lev el of α for a hypothesis test, and any observ ed t -v alue can be con v erted in to a P-v alue in a significance test. An imp ortant problem that a pharmacologist is likely to face when using a statistical mo del is that it’s just a model. Scientific inferences are usually in tended to communicate something ab out the real world, not the mini w orld of a statistical mo del, and the connection b etw een a mo del-based probability of obtaining a test statistic v alue and the state of the real world is alw ays indirect and often inscrutable. Consider the meaning conv eyed by an observ ed P-v alue of 0.002. It indicates that the data are strange or un usual compared to the exp ectations of the statistical model when the parameter of in terest is set to the v alue sp ecified by the null hypothesis. The statistical mo del exp ects a P-v alue of, say , 0.002 to o ccur only 2 times out of a thousand on av erage when the null is true. If such a P-v alue is observed then one of these situations has arisen: • a t wo in a thousand accident of random sampling has o ccurred; • the n ull hypothesised parameter v alue is not close to the true v alue; • the statistical mo del is fla wed or inapplicable b ecause one or more of the assumptions underlying its application are erroneous. T ypically only the first and second are considered, but the last is every bit as important b ecause when the statistical mo del is fla wed or inapplicable then the exp ectations of the mo del are not relev ant to the real world system that spawned the data. Figure 8 shows the issue diagrammatically . When we use that statistical inference to inform inferences ab out the real world we are implicitly assuming: (i) that the real world system that generated the data is an analog to the p opulation in the statistical model; (ii) that the w ay the data w ere obtained is well describ ed by the sampling rule of the statistical mo del; and (iii) that the observed data is analogous to the random sample assumed in the statistical mo del. T o the degree that those assumptions are e rroneous there is degradation of the relev ance of the model-based statistical inference to the real w orld inference that is desired. Considerations of mo del applicabilit y are often limited to the p opulation distribution (is m y data normal enough to use a Student’s t -test?) but it is m uch more important to consider whether there is a definable population that is 19 T echnically it’s the cen tral Studen t’s t -distribution. When δ 6 = δ H 0 it’s a non-central t -distribution [Cumming and Finch, 2001]. 29 Population Statistical model Real world Real world system under investigation Sample Data Sampling Inference Assumed to be equivalent Desired inference Assumed to be relevant Figure 8: Diagram of inference using a statistical mo del. relev an t to the inferential ob jectiv es and whether the exp erimen tal units (“sub- jects”) approximate a random sample. Cell culture exp erimen ts are notorious for ha ving ill-defined populations, and while exp eriments with animal tissues ma y hav e a definable p opulation, the animals are typically delivered from an animal breeding or holding facilit y and are unlikely to b e a random sample. Issues lik e those mean that the calibration of uncertaint y offered b y statistical metho ds migh t be more or less uncalibrated. F or goo d inferen tial p erformance in the real w orld, there has to be a flexible and well-considered linking of mo del- based statistical inferences and scien tific inferences concerning the real world. 5 P-v alues and inference A P-v alue tells y ou ho w w ell the data match with the expectations of a statistical mo del when the n ull hypothesis is true. But, as we hav e seen, there are many considerations that ha ve to b e made b efore a lo w P-v alue can safely b e taken to pro vide sufficient reason to sa y that the null hypothesis is false. What’s more, inferences ab out the n ull hypothesis are not alwa ys useful. Roy all argues that there are three fundamen tal inferen tial questions that should b e considered when making scien tific inferences [Roy all, 1997] (here paraphrased and re-ordered): 1. What do these data say? 2. What should I b eliev e now that I hav e these data? 3. What should I do or decide now that I ha ve these data? 30 Those questions are distinct, but not entirely indep enden t and there is no single b est wa y to answer to any of them. A P-v alue from a significance test is an answer to the first question. It comm unicates how strongly the data argue against the null hypothesis, with a smaller P-v alue being a more insistent shout of “I disagree!”. Ho wev er, the answ er pro vided b y a P-v alue is at b est incomplete, b ecause it is tied to a partic- ular n ull h yp othesis within a particular statistical mo del and because it captures and communicates only some of the information that might b e relev an t to sci- en tific inference. The limitations of a P-v alue can b e thought of as analogous to a black and white photograph that captures the essence of a scene, but misses coloured detail that migh t b e vital for a correct interpretation. Lik eliho o d functions provide more detail than P-v alues and so they can b e sup erior to P-v alues as answ ers to the question of what the data say . Ho wev er, they will b e unfamiliar to most pharmacologists and they are not immune to problems relating to the relev ance of the statistical model and the p eculiarities of exp erimen tal proto col. 20 As this chapter is ab out P-v alues, we will not consider lik eliho o ds any further, and those who, correctly , see that they might offer utility can read Ro yall’s b o ok [Roy all, 1997]. The second of Ro yall’s questions, What should I b eliev e now that I hav e these data?, requires in tegration of the evidence of the data with what was b eliev ed prior to the evidence b eing av ailable. A formal statistical combination of the evidence with prior b eliefs can b e done using Bay esian metho ds, but they are rarely used for the analysis of basic pharmacological exp eriments and are outside the scope of this chapter ab out P-v alues. Considerations of b elief can b e assisted by P-v alues b ecause when the data argue strongly against the null h yp othesis one should be less inclined to believe it true, but it is imp ortan t to realise that P-v alues do not in any wa y measure or communicate b elief. The Neyman–Pearsonian hypothesis test framework was devised sp ecifically to answer the third question: it is a decision theoretic framew ork. Of course, it is a go od decision pro cedure only when α is sp ecified prior to the data b eing a v ailable, and when a loss function informs the exp erimental design. And it is only useful when there is a singular decision to b e made regarding a null h yp othesis, as can b e the case in acceptance sampling and in some randomised clinical trials. A singular decision regarding a null hypothesis is rarely a sufficien t inference from the collection of exp eriments and observ ations that typically mak e up a basic pharmacological studies and so hypothesis tests should not b e a default analytical to ol (and the hybrid NHST should not be used in an y circumstance). Readers might feel that this section has failed to provide a clear metho d for making inferences ab out any of the three questions, and they would b e correct. Statistics is a set of to ols to help with inferences and not a set of inferen tial recip es, a scientific inferences concerning the real world ha ve to b e made by sci- 20 Roy all [1997] and other proponents of lik elihoo d-based inference (e.g. Berger and W olp ert [1988]) make a contrary argumen t based on the likelihoo d principle and the (irrelev ance of ) sampling rule principle, but those arguments ma y fall down when view ed with the local versus global distinction in mind. Happily , those issues are beyond the scop e of this chapter. 31 en tists, and my in tention with this reckless guide to P-v alues is to encourage an approac h to scientific inference that is more thoughtful than statistical signifi- cance. After all, those scientists in v ariably know muc h more than statistics do es ab out the real w orld, and hav e a sup erior understanding of the system under study . Scientific inferences should b e made after principled consideration of the a v ailable evidence, theory and, sometimes, informed opinion. A full ev aluation of evidence will include b oth consideration of the strength of the lo cal evidence and the global prop erties of the exp erimen tal system and statistical mo del from whic h that evidence w as obtained. It is often difficult, just lik e statistics, and there is no recip e. References M. Baker and E. Dolgin. Repro ducibility pro ject yields muddy results. Natur e , 541(7637):269–270, 2017. C. G. Begley and L. M. Ellis. Drug developmen t: Raise standards for preclinical cancer researc h. Natur e , 483(7391):531–533, Mar. 2012. D. J. Benjamin, J. O. Berger, M. Johannesson, B. A. Nosek, E. J. W agenmak- ers, R. Berk, K. A. Bollen, B. Brembs, L. Bro wn, C. Camerer, D. Cesarini, C. D. Chambers, M. Clyde, T. D. Co ok, P . De Bo ec k, Z. Dienes, A. Dreber, K. Easwaran, C. Efferson, E. F ehr, F. Fidler, A. P . Field, M. F orster, E. I. George, R. Gonzalez, S. Go odman, E. Green, D. P . Green, A. G. Greenw ald, J. D. Hadfield, L. V. Hedges, L. Held, T.-H. Ho, H. Hoijtink, D. J. Hruschk a, K. Imai, G. Imbens, J. P . A. Ioannidis, M. Jeon, J. H. Jones, M. Kirchler, D. Laibson, J. List, R. Little, A. Lupia, E. Machery , S. E. Maxwell, M. Mc- Carth y , D. A. Moore, S. L. Morgan, M. Munaf´ o, S. Nak aga wa, B. Nyhan, T. H. Park er, L. Pericc hi, M. Perugini, J. Rouder, J. Rousseau, V. Sav alei, F. D. Sch¨ on bro dt, T. Sellke, B. Sinclair, D. Tingley , T. V an Zandt, S. V azire, D. J. W atts, C. Winship, R. L. W olpert, Y. Xie, C. Y oung, J. Zinman, and V. E. Johnson. Redefine statistical significance. Natur e Human Behaviour , pages 1–5, Jan. 2018. J. Berger and T. Sellke. T esting a p oint null hypothesis: the irreconcilability of P v alues and evidence. Journal of the A meric an Statistic al Asso ciation , pages 112–122, 1987. J. O. Berger and R. L. W olp ert. The Likeliho o d Principle . Lecture notes– Monograph Series. IMS, 1988. L. Berglund, E. Bj¨ orling, P . Oksvold, L. F agerb erg, A. Asplund, C. A.-K. Szigy arto, A. Persson, J. Ottosson, H. W ern ´ erus, P . Nilsson, E. Lundb erg, A. Siv ertsson, S. Nav ani, K. W ester, C. Kampf, S. Hober, F. Pon t´ en, and M. Uhl ´ en. A genecentric Human Protein Atlas for expression profiles based on an tib odies. Mole cular & c el lular pr ote omics : MCP , 7(10):2019–2027, Oct. 2008. 32 B. Bhattachary a and D. Habtzghi. Median of the pV alue Under the Alternative Hyp othesis. The Americ an Statistician , 56(3):202–206, Aug. 2002. A. Birn baum. The Neyman-P earson Theory as Decision Theory , and as Infer- ence Theory; With a Criticism of the Lindley-Sav age Argument for Bay esian Theory . Synthese , 36(1):19–49, Sept. 1977. J. M. Bland and D. G. Bland. Statistics Notes: One and tw o sided tests of significance. BMJ , 309(6949):248–248, July 1994. C. F. Camerer, A. Dreb er, F. Holzmeister, T.-H. Ho, J. Huber, M. Johannesson, M. Kirc hler, G. Nav e, B. A. Nosek, T. Pfeiffer, A. Altmejd, N. Buttric k, T. Chan, Y. Chen, E. F orsell, A. Gampa, E. Heik ensten, L. Hummer, T. Imai, S. Isaksson, D. Manfredi, J. Rose, E.-J. W agenmak ers, and H. W u. Ev aluating the replicability of so cial science exp erimen ts in Nature and Science b etw een 2010 and 2015. Natur e Human Behaviour , pages 1–10, Sept. 2018. J. Cohen. A p o wer primer. Psycholo gic al bul letin , 112(1):155–159, July 1992. D. Colquhoun. L e ctur es on Biostatistics . Oxford Universit y Press, 1971. D. Colquhoun. An inv estigation of the false discov ery rate and the misinter- pretation of p-v alues. R oyal So ciety Op en Scienc e , 1(3):140216–140216, Nov. 2014. M. Cowles. Statistics in psycholo gy: an historic al p ersp e ctive . Lawrence Erlbaum Asso ciates, Inc., 1989. G. Cumming. Replication and p in terv als: p v alues predict the future only v aguely , but confidence interv als do muc h b etter. Persp e ctives on Psycholo g- ic al Scienc e , 3(4):286–300, 2008. G. Cumming and S. Finc h. A primer on the understanding, use, and calculation of confidence interv als that are based on central and noncen tral distributions. Educ ational and Psycholo gic al Me asur ement , 61(4):532–574, 2001. M. Curtis, R. Bond, D. Spina, A. Ahluw alia, S. Alexander, M. Giemb ycz, A. Gilc hrist, D. Hoy er, P . Insel, A. Izzo, A. Lawrence, D. MacEw an, L. Mo on, S. W onnacott, A. W eston, and J. McGrath. Exp erimen tal design and analysis and their rep orting: new guidance for publication in b jp. British Journal of Pharmac olo gy , 172(2):3461–3471, 2015. D. J. Druck er. Never W aste a Go od Crisis: Confronting Repro ducibilit y in T ranslational Research. Cel l Metab olism , 24(3):348–360, Sept. 2016. J.-B. du Prel, G. Hommel, B. R¨ ohrig, and M. Blettner. Confidence in terv al or p-v alue?: Part 4 of a series on ev aluation of scientific publications. Deutsches ¨ Arzteblatt international , 106(19):335–339, Ma y 2009. S. D. Dubey . Some thoughts on the one-sided and tw o-sided tests. Journal of Biopharmac eutic al Statistics , 1(1):139–150, 1991. 33 R. Fisher. Statistic al Metho ds for R ese ar ch Workers . Oliver & Boyd, 1925. R. Fisher. Design of exp eriments . Hafner, New Y ork, 1960. H. F raser, T. Park er, S. Nak aga wa, A. Barnett, and F. F. Questionable researc h practices in ecology and ev olution. PL oS ONE , 13(7):e0200303, 2018. L. S. F reedman. An analysis of the contro versy ov er classical one-sided tests. Clinic al T rials , 5(6):635–640, 2008. M. A. Garc ´ ıa-P´ erez. Thou Shalt Not Bear F alse Witness Against Null Hyp oth- esis Significance T esting. Educ ational and Psycholo gic al Me asur ement , 77(4): 631–662, Oct. 2016. A. Gelman and J. Carlin. Bey ond Po wer Calculations. Persp e ctives on Psycho- lo gic al Scienc e , 9(6):641–651, Nov. 2014. C. H. George, S. C. Stanford, S. Alexander, G. Cirino, J. R. Do chert y , M. A. Giem bycz, D. Hoy er, P . A. Insel, A. A. Izzo, Y. Ji, D. J. MacEw an, C. G. Sob ey , S. W onnacott, and A. Ahlu walia. Updating the guidelines for data transparency in the British Journal of Pharmacology - data sharing and the use of scatter plots instead of bar charts. British Journal of Pharmac olo gy , 174(17):2801–2804, Aug. 2017. G. Gigerenzer. W e need statistical thinking, not statistical rituals. Behavior al and Br ain Scienc es , 1998. S. N. Go o dman. Of P-V alues and Bay es: A Modest Prop osal. Epidemiolo gy , 12 (3):295–297, Ma y 2001. S. N. Go o dman and R. Roy all. Evidence and scien tific research. A meric an Journal of Public He alth , 78(12):1568–1574, 1988. P . F. Halpin and H. J. Stam. Inductive inference or inductiv e b ehavior: Fisher and Neyman-Pearson approaches to statistical testing in psyc hological re- searc h (1940-1960). The Americ an journal of psycholo gy , 119(4):625–653, 2006. L. Halsey , D. Curran-Everett, S. V owler, and G. Drummond. The fickle p v alue generates irrepro ducible results. Natur e Metho ds , 12(3):179–185, 2015. J. Ho enig and D. Heisey . The Abuse of Po w er: The Perv asive F allacy of P ow er Calculations for Data Analysis. The A meric an Statistician , 2001. S. M. Ho witt and A. N. Wilson. Revisiting ”Is the scientific pap er a fraud?”: The wa y textb o oks and scientific researc h articles are b eing used to teach undergraduate studen ts could conv ey a misleading image of scien tific research. EMBO r ep orts , 15(5):481–484, May 2014. 34 R. Hubbard, M. Bay arri, K. Berk, and M. Carlton. Confusion ov er Measures of Evidence (p’s) versus Errors ( α ’s) in Classical Statistical T esting. The A meric an Statistician , 57(3), Aug. 2003. C. J. Hub erty . Historical origins of statistical testing practices: The treatment of Fisher versus Neyman-Pearson views in textb ooks. The Journal of Exp er- imental Educ ational , pages 317–333, 1993. S. Hurlb ert and C. Lom bardi. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zo olo gici F ennici , 46(5):311–349, 2009. J. P . A. Ioannidis. Why Most Published Research Findings Are F alse. PL oS Me dicine , 2(8):e124, Aug. 2005. V. E. Johnson. Revised standards for statistical evidence. Pr o c e e dings of the National A c ademy of Scienc es , 110(48):19313–19317, 2013. K. Kobay ashi. A comparison of one- and tw o-sided tests for judging significant differences in quantitativ e data obtained in toxicological bioass a y of lab ora- tory animals. Journal of Oc cup ational He alth , 39(1):29–35, 1997. J. I. Krueger and P . R. Heck. The Heuristic V alue of p in Inductiv e Statistical Inference. F r ontiers in Psycholo gy , 8:108–16, June 2017. P . Laplace. Th´ eorie analytique des pr ob abilit ´ es . 1812. B. Lecoutre, M.-P . Lecoutre, and J. Poitevineau. Uses, Abuses and Misuses of Significance T ests in the Scientific Communit y: W on’t the Ba yesian Choice Be Unav oidable? International Statistic al R eview / R evue Internationale de Statistique , 69(3):399–417, Dec. 2001. S. M. Lee. Buzzfeed news: Here’s ho w cornell scientist brian wansink turned shoddy data into viral studies about how we eat, F ebruary 2018. URL https://www.buzzfeednews.com/article/stephaniemlee/ brian- wansink- cornell- p- hacking . E. Lehmann. Fisher, Neyman, and the cr e ation of classic al statistics . Springer, 2011. J. Lenhard. Mo dels and statistical inference: The contro versy b etw een fisher and neyman-p earson. Br. J. Philos. Sci. , 57(1):69–91, Marc h 2006. ISSN 0007-0882. doi: 10.1093/b jps/axi152. M. J. Lew. Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don’t know P. 166(5):1559–1567, June 2012. K. Liu and X.-L. Meng. There is individualized treatment. why not indi- vidualized inference? A nnual R eview of Statistics and Its Applic ation , 3(1):79–111, 2016. doi: 10.1146/annurev- statistics- 010814- 020310. URL https://doi.org/10.1146/annurev- statistics- 010814- 020310 . 35 C. Lombardi and S. Hurlbert. Misprescription and misuse of one-tailed tests. A ustr al Ec olo gy , May 2009. J. Lu, Y. Qiu, and A. Deng. A note on t yp e s & m errors in hypothesis testing. British Journal of Mathematic al and Statistic al Psycholo gy , Online v ersion of record b efore inclusion in an issue, 2018. P . McCullagh. What is a statistical model? The A nnals of Statistics , 30(5): 1125–1310, 2002. P . Medaw ar. Is the scientific pap er a fraud? Listener , 70:377–378, 1963. H. J. Motulsky . Common misconceptions ab out data analysis and statistics. Naunyn-Schmie deb er g’s Ar chives of Pharmac olo gy , 387(11):1017–1023, 2014. J. Neyman and E. P earson. On the Problem of the Most Efficient T ests of Statistical Hyp otheses. Philosophic al T r ansactions of the R oyal So ciety of L ondon. Series A Containing Pap ers of a Mathematic al or Physic al Char acter , 231:289–337, 1933. R. S. Nick erson. Null hypothesis significance testing: a review of an old and con tinuing contro versy . Psycholo gic al Metho ds , 5(2):241–301, June 2000. R. Nuzzo. Statistical errors: P v alues, the ’gold standard’of statistical v alidit y , are not as reliable as man y scientists assume. Natur e , 506:150–152, 2014. R. Ro yall. Statistic al evidenc e: a likeliho o d p ar adigm , volume 71 of Mono gr aphs on sttistics and applie d pr ob ability . Chapman & Hall, 1997. G. D. Ruxton and M. Neuhaeuser. When should we use one-tailed hypothesis testing? Metho ds in Ec olo gy and Evolution , 1(2):114–117, 2010. H. Sackro witz and E. Sam uel-Cahn. P V alues as Random V ariables-Exp ected P V alues. Americ an Statistician , pages 326–331, Nov. 1999. S. Senn. Tw o cheers for P-v alues? Journal of Epidemiolo gy and Biostatistics , 6 (2):193–204, Dec. 2001. G. Shaw and F. No dder. The natur alist’s misc el lany: or c olour e d figur es of natur al obje cts; dr awn and describ e d imme diately fr om natur e . 1789. A. Strasak, Q. Zaman, G. Marinell, and K. Pfeiffer. The Use of Statistics in Medical Research: A Comparison of The New England Journal of Medicine and Nature Medicine. The A meric an Statistician , 61(1):47–55, 2007. Studen t. The Probable Error of a Mean. Biometrika , 6(1):1–25, Mar. 1908. B. Thompson. The natur e of statistic al evidenc e , v olume 189 of L e ctur e notes in statistics . Springer, 2007. 36 D. T rafimow and M. Marks. Editorial. Basic and Applie d So cial Psycholo gy , 37(1):1–2, 2015. doi: 10.1080/01973533.2015.1012991. URL https://doi. org/10.1080/01973533.2015.1012991 . J. W. T ukey . The Philosoph y of Multiple Comparisons. Statistic al Scienc e , 6 (1):100–116, F eb. 1991. B. V oelkl, L. V ogt, E. S. Sena, and H. W ¨ urbel. Repro ducibilit y of preclinical animal researc h impro ves with heterogeneity of study samples. PLOS Biolo gy , 16(2):e2003693–13, F eb. 2018. E.-J. W agenmakers. A practical solution to the p erv asiv e problems of p v alues. Psychonomic bul letin & r eview , 14(5):779–804, Oct. 2007. E.-J. W agenmakers, M. Marsman, T. Jamil, A. Ly , J. V erhagen, J. Lov e, R. Selker, Q. F. Gronau, M. ˇ Sm ´ ıra, S. Epsk amp, D. Matzk e, J. N. Rouder, and R. D. Morey . Ba y esian inference for psychology . Part I: Theoretical ad- v an tages and practical ramifications. pages 1–23, Mar. 2018. R. L. W asserstein and N. A. Lazar. The ASA’s Statemen t on p-V alues: Con text, Pro cess, and Purp ose. The Americ an Statistician , 70(2):129–133, June 2016. 37
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment