The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics

The Jeffreys-Lindley paradox displays how the use of a p-value (or number of standard deviations z) in a frequentist hypothesis test can lead to an inference that is radically different from that of a Bayesian hypothesis test in the form advocated by…

Authors: Robert D. Cousins

The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy   Physics
The Jeffreys-Lindley P arado x and Disco v ery Criteria in High Energy Ph ysics Rob ert D. Cousins ∗ Departmen t of Ph ysics and Astronomy Univ ersit y of California, Los Angeles, California 90095, USA August 23, 2014 Abstract The Jeffreys–Lindley parado x displays how the use of a p -v alue (or n um b er of standard deviations z ) in a frequentist h yp othesis test can lead to an inference that is radically differen t from that of a Bay esian hypothesis test in the form adv o cated by Harold Jeffreys in the 1930s and common to da y . The setting is the test of a w ell-sp ecified n ull h yp othesis (suc h as the Standard Mo del of elemen tary particle physics, p ossibly with “n uisance parameters”) v ersus a comp osite alternative (such as the Standard Model plus a new force of nature of unkno wn strength). The p -v alue, as w ell as the ratio of the likelihoo d under the n ull hypothesis to the maximized lik eliho o d under the alternativ e, can strongly disfa vor the n ull hypothesis, while the Ba yesian posterior probability for the n ull h yp othesis can b e arbitrarily large. The academic statistics literature contains man y impassioned commen ts on this parado x, yet there is no consensus either on its relev ance to scientific comm unication or on its correct resolution. The parado x is quite relev an t to fron tier research in high energy ph ysics. This paper is an attempt to explain the situation to b oth physicists and statisticians, in the hop e that further progress can b e made. ∗ cousins@ph ysics.ucla.edu 1 Con ten ts 1 In tro duction 3 2 The original “parado x” of Lindley , as corrected b y Bartlett 5 2.1 Is there really a “parado x”? . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 The JL parado x is not ab out testing simple H 0 vs simple H 1 . . . . . 10 3 Do p oin t n ull h yp otheses make sense in principle, or in practice? 11 4 Three scales for θ yield a paradox 13 5 HEP and b elief in the n ull h yp othesis 14 5.1 Examples of three scales for θ in HEP exp erimen ts . . . . . . . . . . 16 5.2 T est statistics for computing p -v alues in HEP . . . . . . . . . . . . . 17 5.3 Are HEP exp erimen ters biased against their n ull h yp otheses? . . . . . 19 5.4 Cases of an artificial n ull that carries little or no b elief . . . . . . . . 20 6 What sets the scale τ ? 22 6.1 Commen ts on non-sub jectiv e priors for estimation and model selection 23 7 The reference analysis approac h of Bernardo 25 8 Effect size in HEP 27 8.1 No effect size is to o small in core ph ysics mo dels . . . . . . . . . . . . 27 8.2 Small effect size can indicate new phenomena at higher energy . . . . 28 9 Neyman-P earson testing and the choice of T yp e I error probabilit y α 30 9.1 The mythology of 5 σ . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 9.2 Multiple trials factors for scanning nuisance parameters that are not eliminated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 10 Can results of hypothesis tests b e cross-calibrated among different searc hes? 34 11 Summary and Conclusions 35 2 1 In tro duction On July 4, 2012, the leaders of tw o huge collaborations (CMS and A TLAS) presented their results at a joint seminar at the CERN laboratory , lo cated on the F renc h– Swiss border outside Genev a. Each describ ed the observ ation of a “new b oson” (a t yp e of particle), susp ected to b e the long-sought Higgs boson (Incandela and Gianotti, 2012). The statistical significances of the results were expressed in terms of “ σ ”: carefully calculated p -v alues (not assuming normalit y) w ere mapp ed on to the equiv alen t num b er of standard deviations in a one-tailed test of the mean of a normal (i.e., Gaussian) distribution. A TLAS observed 5 σ significance by com bining the tw o most p ow erful detection mo des (differen t kinds of particles into whic h the b oson deca yed) in 2012 data with full results from earlier data. With indep enden t data from a different apparatus, and only partially correlated analysis assumptions, CMS observed 5 σ significance in a similar combination, and when com bining with some other mo des as CMS had planned for that data set, 4.9 σ . With A TLAS and CMS also measuring similar v alues for the rates of pro duction of the detected particles, the new b oson was immediately interpreted as the most an ticipated and publicized discov ery in high energy physics (HEP) since the W eb w as b orn (also at CERN). Journalists we nt scurrying for explanations of the meaning of “ σ ”, and wh y “high energy ph ysicists require 5 σ for a disco v ery”. Mean while, some who knew ab out Ba y esian h yp othesis testing asked why high energy physicists w ere using frequen tist p -v alues rather than calculating the p osterior b elief in the h yp otheses. In this pap er, I describ e some of the traditions for claiming disco very in HEP , whic h hav e a decidedly frequentist flav or, dra wing in a pragmatic w ay on b oth Fisher’s ideas and the Neyman–P earson (NP) approac h, despite their disagreements ov er foun- dations of statistical inference. Of course, some HEP practitioners hav e b een aw are of the criticisms of this approac h, having enjoy ed interactions with some of the influ- en tial Ba yesian statisticians (b oth sub jectiv e and ob jective in flav or) who attended HEP workshops on statistics. These issues lead directly to a famous “parado x”, as Lindley (1957) called it, when testing the hypothesis of a sp ecific v alue θ 0 of a pa- rameter against a con tinuous set of alternativ es θ . The differen t scaling of p -v alues and Bay es factors with sample size, describ ed b y Jeffreys and emphasized by Lindley , can lead the frequen tist and the Bay esian to inconsistent strengths of inferences that in some cases can ev en rev erse the apparent inferences. Ho w ever, as describ ed b elow, it is an understatemen t to sa y that the communit y of Bay esian statisticians has not reac hed full agreement on what should replace p - v alues in scien tific communication. F or example, tw o of the most prominen t voices of “ob jectiv e” Bay esianism (J. Berger and J. Bernardo) adv o cate fundamentally different approac hes to hypothesis testing for scientific communication. F urthermore, views in the Ba yesian literature regarding the v alidity of models (in the so cial sciences for example) are strikingly different than those common in HEP . This paper describ es to day’s rather unsatisfactory situation. Progress in HEP mean while con tin ues, but it w ould be potentially quite useful if more statisticians b ecome aw are of the sp ecial circumstances in HEP , and reflect on what the Jeffreys– 3 Lindley (JL) parado x means to HEP , and vice versa. In “high energy physics”, also kno wn as “elemen tary particle ph ysics”, the ob jects of study are the smallest building blo c ks of matter and the forces among them. (F or one p ersp ectiv e, see Wilczek (2004).) The exp erimen tal techniques often make use of the highest-energy accelerated b eams attainable. But due to the magic of quan tum mec hanics, it is p ossible to probe m uch higher energy scales through precise mea- suremen ts of certain particle deca ys at lo wer energy; and since the early univ erse w as hotter than our most energetic b eams, and still has p o werful cosmic accelerators and extreme conditions, astronomical observ ations are another crucial source of in- formation on “high energy physics”. Historically , many disco v eries in HEP ha ve b een in the category kno wn to statisticians as “the in tero cular traumatic test; you know what the data mean when the conclusion hits you b et ween the eyes.” (Edwards et al, 1963, p. 217, citing J. Berkson). In other cases, evidence accum ulated slowly , and it w as considered essen tial to quan tify evidence in a fashion that relates directly to the sub ject of this review. A wide range of views on the JL paradox can be found in reviews with com- men tary b y man y distinguished statisticians, in particular those of Shafer (1982), Berger and Sellk e (1987), Berger and Delampady (1987a), and Rob ert, Chopin, and Rousseau (2009). The review of Ba y es factors b y Kass and Raftery (1995) and the earlier b o ok b y economist Leamer (1978) also offer interesting insigh ts. Some of these authors view statistical issues in their t ypical data analyses rather differen tly than do physicists in HEP; p erhaps the greatest con trast is that physicists do often hav e non-negligible b elief that their n ull hypotheses are v alid to a precision muc h greater than our measuremen t capability . Regarding the searc h by A TLAS and CMS that led to the disco very of “a Higgs boson”, statistician v an Dyk (2014) has prepared an informativ e summary of the statistical procedures that were used. In Sections 2 – 4, I review the parado x, discuss the concept of the p oint null h y- p othesis, and observ e that the paradox arises if there are three different scales in θ ha ving a hierarch y that is common in HEP . In Section 5, I address the notions com- mon among statisticians that “all mo dels are wrong”, and that scien tists tend to be biased against the n ull hypothesis, so that the paradox is irrelev an t. I also describ e the likelihoo d-ratio commonly used in HEP as the test statistic. In Section 6, I discuss the difficult issue of c ho osing the prior for θ , and in particular the scale τ of those v alues of θ for whic h there is non-negligible prior b elief. Section 7 briefly describ es the completely differen t approach to h yp othesis testing advocated b y Bernardo, whic h stands apart from the bulk of the Ba yesian literature. In Section 8, I discuss how measured v alues and confidence interv als, for quantities such as pro duction and deca y rates, augmen t the quoted p -v alue, and how small but precisely measured effects can pro vide a windo w into very high energy ph ysics. Section 9 discusses the choice of Type I error α (probabilit y of rejecting H 0 when it is true) when adopting the approach of NP hypothesis testing, with some commen ts on the “5 σ myth” of HEP . Finally , in Section 10, I discuss the seemingly universal agreement that a single p -v alue is (at b est) a w o efully incomplete summary of the data, and how confidence interv als at v arious confidence levels help readers assess the exp erimental results. I summarize and conclude in Section 11. 4 As it is useful to use precisely defined terms, w e must b e a ware that statisticians and physicists (and psychologists, etc.) ha v e differen t naming con v entions. F or ex- ample, a physicist sa ys “measured v alue”, while a statistician says “point estimate” (and while a psyc hologist says “effect size in original units”). This pap er uses pri- marily the language of statisticians, unless otherwise stated. Th us “estimation” do es not mean “guessing”, but rather the calculation of “p oin t estimates” and “in terv al estimates”. The latter refers to frequentist confidence interv als or their analogs in other paradigms, known to ph ysicists as “uncertainties on the measured v alues”. In this pap er, “error” is generally used in the precisely defined sense of Type I and T yp e I I errors of Neyman-Pearson theory (Section 9), unless obvious from context. Other terms are defined in con text b elo w. Citations are pro vided for the b enefit of readers who ma y not b e aw are that certain terms (such as “loss”) hav e sp ecific tec hnical meanings in the statistics literature. “Effect size” is commonly used in the psyc hology literature, with at least t wo meanings. The first meaning, describ ed by the field’s publication manual (AP A, 2010, p. 34) as “most often easily understo o d”, is simply the measured v alue of a quan tity in the original (often dimensionful) units. Alternativ ely , a “standardized” dimensionless effect size is obtained b y dividing b y a scale suc h as a standard deviation. In this pap er, the term alw ays refers to the former definition (original units), corresp onding to the ph ysicist’s usual measured v alue of a parameter or ph ysical quan tity . Finally , the w ord “mo del” in statistics literature usually refers to a probabilistic equation that describ es the assumed data-generating mec hanisms (Poisson, binomial, etc.), often with adjustable parameters. The use of “mo del” for a “law of nature” is discussed b elo w. 2 The original “parado x” of Lindley , as corrected b y Bartlett Lindley (1957), with a crucial correction b y Bartlett (1957), la ys out the paradox in a form that is useful as our starting p oin t. This exp osition also draws on Section 5.0 of Jeffreys (1961) and on Berger and Delampady (1987a). It mostly follows the notation of the latter, with the conv ention of upp er case for the random v ariable and lo w er case for observed v alues. Figure 1 serv es to illustrate v arious quantities defined b elo w. Supp ose X ha ving density f ( x | θ ) is sampled, where θ is an unknown elemen t of the parameter space Θ. It is desired to test H 0 : θ = θ 0 v ersus H 1 : θ 6 = θ 0 . F ollo wing the Ba yesia n approach to hypothesis testing pioneered b y Jeffreys (also referred to as Bay esian mo del selection), we assign prior probabilities π 0 and π 1 = 1 − π 0 to the resp ectiv e hypotheses. Conditional on H 1 b eing true, one also has a contin uous prior probabilit y densit y g ( θ ) for the unknown parameter. As discussed in the following sections, formulating the problem in this manner leads to a conceptual issue, since in the contin uous parameter space Θ, a single p oint θ 0 (set of measure zero) has non-zero probability asso ciated with it. This is imp ossible with a usual probabilit y densit y , for whic h the probability assigned to an in terv al tends 5 Figure 1: Illustration of quantities used to define the JL parado x. The unknown parameter is θ , with likelihoo d function L ( θ ) resulting from a measuremen t with uncertain t y σ tot . The point MLE is ˆ θ , whic h in the sk etch is ab out 5 σ tot a w ay from the n ull hypothesis, the “p oint null” θ 0 . The p oint null hypothesis has prior probabilit y π 0 , which can be spread out ov er a small interv al of width  0 without materially affecting the parado x. The width of the prior p df g ( θ ) under H 1 has scale τ . The scales hav e the hierarch y  0  σ tot  τ . to zero as the width of the interv al tends to zero. Assignment of non-zero probabilit y π 0 to a single p oin t θ 0 is familiar to physicists b y using the Dirac δ -function (times π 0 ) at θ 0 , while statisticians often refer to placing “probabilit y mass” at θ 0 , or to using “coun ting measure” for θ 0 (in distinction to “Leb esgue measure” for the usual densit y g for θ 6 = θ 0 ). The n ull hypothesis corresp onding to the single p oin t θ 0 is also commonly referred to as a “p oint n ull” h yp othesis, or as a “sharp h yp othesis”. As discussed b elo w, just as a δ -function can b e viewed as useful appro ximation to a highly p eaked function, for hypotheses in HEP it is often the case that the p oint n ull hypothesis is a useful approximation to a prior that is sufficiently concen trated around θ 0 . If the density f ( x | θ ) under H 1 is normal with mean θ and kno wn v ariance σ 2 , then for a random sample { x 1 , x 2 , . . . x n } , the sample mean is normal with v ariance σ 2 /n , i.e., X has density N ( θ , σ 2 /n ). F or conciseness (and even tually to make the p oin t that “ n ” can b e obscure), let σ tot ≡ σ / √ n. (1) The likelihoo d is then L ( θ ) = 1 √ 2 π σ tot exp  − ( x − θ ) 2 / 2 σ 2 tot  , (2) with maximum likelihoo d estimate (MLE) ˆ θ = x . By Bay es’s Theorem, the p osterior probabilities of the hypotheses, giv en ˆ θ , are: P ( H 0 | ˆ θ ) = 1 A π 0 L ( θ 0 ) = 1 A π 0 1 √ 2 π σ tot exp n − ( ˆ θ − θ 0 ) 2 / 2 σ 2 tot o (3) 6 and P ( H 1 | ˆ θ ) = 1 A π 1 Z g ( θ ) L ( θ ) dθ = 1 A π 1 Z g ( θ ) 1 √ 2 π σ tot exp n − ( ˆ θ − θ ) 2 / 2 σ 2 tot o dθ . (4) Here A is a normalization constan t to make the sum of the t w o probabilities equal unit y , and the in tegral is ov er the supp ort of the prior g ( θ ). There will typically b e a scale τ that indicates the range of v alues of θ o ver which g ( θ ) is relativ ely large. One considers the case σ tot  τ , (5) so that g ( θ ) v aries slowly where the rest of the in tegrand is non-negligible, and there- fore the in tegral appro ximately equals g ( ˆ θ ), so that P ( H 1 | ˆ θ ) ≈ 1 A π 1 g ( ˆ θ ) (6) Then the ratio of p osterior o dds to prior o dds for H 0 , i.e., the Bayes factor (BF), is indep enden t of A and π 0 , and giv en b y BF ≡ P ( H 0 | ˆ θ ) P ( H 1 | ˆ θ )  π 0 π 1 ≈ 1 √ 2 π σ tot g ( ˆ θ ) exp n − ( ˆ θ − θ 0 ) 2 / 2 σ 2 tot o = 1 √ 2 π σ tot g ( ˆ θ ) exp( − z 2 / 2) , (7) where z = ( ˆ θ − θ 0 ) /σ tot = √ n ( ˆ θ − θ 0 ) /σ (8) is the usual statistic providing the departure from the null h yp othesis in units of σ tot . Some authors (e.g., Kass and Raftery (1995)) use the notation B 01 for this Ba yes factor, to make clear which h yp otheses are used in the ratio; as this pap er alwa ys uses the same ratio, the subscripts are suppressed. Then the p -v alue for the tw o- tailed test is p = 2(1 − Φ( z )), where Φ is the standard normal cumulativ e distribution function. (As discussed in Section 5.2, in HEP often θ is physically non-negativ e, and hence a one-tailed test is used, i.e., p = 1 − Φ( z ).) Jeffreys (1961, p. 248) notes that g ( ˆ θ ) is indep endent of n and σ tot go es as 1 / √ n , and therefore a given cutoff v alue of BF do es not corresp ond to a fixed v alue of z . This discrepancy in the sample-size scaling of z and p -v alues compared to that of Ba y es factors (already noted for a constan t g on p. 194 in his first edition of 1939) is at the core of the JL parado x, even if one do es not tak e v alues of n so extreme as to mak e P ( H 0 | ˆ θ ) > P ( H 1 | ˆ θ ). Jeffreys (1961, App endix B, p. 435) curiously do wnpla ys the discrepancy at the end of a sentence that summarizes his ob jections to testing based on p -v alues (almost v erbatim with p. 360 of his 1939 edition): “In spite of the difference in principle b et w een my tests and those based on [ p -v alues], and the omission of the latter to give the increase in the critical v alues for large n , dictated essentially by the fact that in 7 testing a small departure found from a large num b er of observ ations we are selecting a v alue out of a long range and should allo w for selection, it app ears that there is not m uc h difference in the practical recommendations.” He does say , “At large num b ers of observ ations there is a difference”, but he suggests that this will b e rare and that the test migh t not b e prop erly formulated: “internal correlation should be suspected and tested”. In contrast, Lindley (1957) emphasized ho w large the discrepancy could b e, using the example where g ( θ ) is tak en to b e constant ov er an interv al that contains b oth ˆ θ and the range of θ in which the integrand is non-negligible. F or any arbitrarily small p -v alue (arbitrarily large z ) that is traditionally interpreted as evidence against the nul l hyp othesis , there will alwa ys exist n for which the BF can b e arbitr arily lar ge in favor of the nul l hyp othesis . Bartlett (1957) quic kly noted that Lindley had neglected the length of the interv al o v er which g ( θ ) is constan t, whic h should app ear in the numerator of the BF, and whic h mak es the p osterior probability of H 0 “m uc h more arbitrary”. More generally , the normalization of g alw a ys has a scale τ that c haracterizes the exten t in θ for whic h g is non-negligible, whic h implies that g ( ˆ θ ) ∝ 1 /τ . Th us, there is a factor of τ in the n umerator of BF. F or example, Berger and Delampady (1987a) and others consider g ( θ ) having densit y N ( θ 0 , τ 2 ), which, in the limit of Eqn. 5, leads to BF = τ σ tot exp( − z 2 / 2) . (9) There is the same prop ortionality in the Lindley/Bartlett example if the length of their interv al is τ . The crucial p oin t is the generic scaling, BF ∝ τ σ tot exp( − z 2 / 2) . (10) Of course, the v alue of the prop ortionalit y constant dep ends on the form of g and sp ecifically on g ( ˆ θ ). Mean while, from Eqn. 2, the ratio λ of the lik eliho o d of θ 0 under H 0 and the maxim um lik eliho o d under H 1 is λ = L ( θ 0 ) / L ( ˆ θ ) (11) = exp n ( ˆ θ − θ 0 ) 2 / 2 σ 2 tot o  exp n ( ˆ θ − ˆ θ ) 2 / 2 σ 2 tot o (12) = exp( − z 2 / 2) (13) ∝  σ tot τ  BF . (14) Th us, unlike the case of simple-vs-simple h yp otheses discussed b elo w in Section 2.2, this maxim um likelihoo d ratio takes the side of the p -v alue in disfav oring the n ull h yp othesis for large z , independent of σ tot /τ , and thus indep enden t of sample size n . This difference b etw een maximizing L ( θ ) under H 1 , and a veraging it under H 1 w eigh ted b y the prior g ( θ ), can b e dramatic. The factor σ tot /τ (arising from the a verage of L weigh ted by g in Eqn. 4) is often called the “Ockham factor” that provides a desirable “Oc kham’s razor” effect 8 (Ja ynes, 2003, Chapter 20) by p enalizing H 1 for imprecise specification of θ . But the fact that (ev en asymptotically) BF dep ends directly on the scale τ of the prior g ( θ ) (and more precisely on g ( ˆ θ )) can come as a surprise to those deeply steep ed in Ba y esian p oin t and in terv al estimation, where t ypically the dep endence on all priors diminishes asymptotically . The surprise is p erhaps enhanced since the BF is often in tro duced as the factor by which prior o dds (even if sub jectiv e) are mo dified in light of the observed data, giving the initial impression that the sub jectiv e part is factorized out from the BF. The likelihoo d ratio λ = exp( − z 2 / 2) takes on the numerical v alues 0.61, 0.14, 0.011, 0.00034, and 3.7E-06, as z is equal to 1, 2, 3, 4, and 5, resp ectively . Thus, in order for the Ockham factor to rev erse the preferences of the hypotheses in the BF compared to the maxim um likelihoo d ratio λ , the Ockham factor must b e smaller than these n umbers in the resp ective cases. Some examples of σ tot and τ in HEP that can do this (at least up to z = 4) are in Section 5.1. As discussed b elow, ev en when not in the extreme case where the Ockham factor reverses the preference of the h yp otheses, its effect deserves scrutiny . F rom the deriv ation, the origin of the Oc kham factor (and hence sample-size dep endence) do es not dep end on the chosen v alue of π 0 , and thus not on the commonly suggested c hoice of π 0 = 1 / 2. The scaling in Eqn. 10 follows from assigning any non-zero probability to the single p oin t θ = θ 0 , as describ ed ab o v e using the Dirac δ -function, or “probabilit y mass”. The situation clearly invited further studies, and v arious authors, beginning with Edw ards et al (1963), hav e explored the impact of changing g ( θ ), making n umerical comparisons of p -v alues to Bay es factors in con texts such as testing a p oint null h yp othesis for a binomial parameter. Generally they hav e given examples in which the p -v alue is alw ays numerically smaller than the BF, ev en when the prior for θ “giv es the utmost generosit y to the alternativ e h yp othesis”. 2.1 Is there really a “parado x”? A trivial “resolution” of JL paradox is to p oin t out that there is no reason to exp ect the numerical results of frequen tist and Bay esian h yp othesis testing to agree, as they calculate different quantities. Still, it is unnerving to man y that “h yp othesis tests” that are both comm unicating scien tific results for the same data can ha ve such a large discrepancy . So is it a p ar adox ? I prefer to use the word “paradox” with the meaning I recall from sc ho ol, “a state- men t that is seemingly contradictory or opp osed to common sense and yet is p erhaps true” (W ebster, 1969, definition 2a). This is the meaning of the word, for example, in the celebrated “paradoxes” of Sp ecial Relativity , suc h as the Twin P aradox and the P ole-in-Barn Parado x. The “resolution” of a paradox is then a careful explanation of wh y it is not a con tradiction. I therefore do not use the word parado x as a synon ym for contradiction—that takes a word with (I think) a very useful meaning and w astes it on a redundant meaning of another word. It can how ever be confusing that what is deemed parado xical dep ends on the p ersonal p ersp ectiv e of what is “seemingly” con tradictory . If someone says, “What Lindley called a paradox is not a parado x”, 9 then typically they either define paradox as a synonym for con tradiction, or it w as alw a ys so obvious to them that the paradox is not a contradiction that they think it is not paradoxical. (It could also b e that there is a contradiction that cannot b e resolv ed, but I ha v e not seen that used as an argument for wh y it is not a paradox.) Although it ma y still b e questionable as to whether there is a resolution satisfactory to every one, for no w I think that the word paradox is quite apt. As the deep issue is the scaling of the BF with sample size (for fixed p -v alue) as p oin ted out b y Jeffreys already in 1939, I follo w some others in calling it the Jeffreys–Lindley (JL) parado x. Other ambiguities in discussions regarding the JL paradox include whether the fo cus is on the p osterior o dds of H 0 (whic h includes the prior o dds) or on the BF (whic h do es not). In addition, while one often introduces the parado x by noting the extreme cases where the p -v alue and the BF seem to imply opp osite inferences, one should also emphasize the less dramatic (but still disturbing) cases where the Oc kham factor plays a large (and p oten tially) arbitrary role, ev en if the BF fa vors H 1 . In the latter cases, it can b e claimed that the p -v alue o v erstates the evidence against H 0 . In this pap er I fo cus on the BF, follo wing some others, e.g. Edwards et al (1963, who somewhat confusingly denote it b y L , p. 218) and Bernardo (1999, p. 102). I also tak e a rather inclusive view of the parado x, as the issue of differences in sample size scaling is alwa ys present, even if not taken to the extreme limit where the Oc kham factor ov erwhelms the BF, and even reverses arbitrarily small prior probability for H 0 . 2.2 The JL parado x is not ab out testing simple H 0 vs simple H 1 T esting simple H 0 : θ = θ 0 vs simple H 1 : θ = θ 1 pro vides another interesting contrast b et w een Ba yesian and frequen tist h yp othesis testing, but this is not an example of the JL parado x. The Ba y es factor and the likelihoo d ratio are the same (in the absence of nuisance parameters), and therefore in agreemen t as to whic h h yp othesis the data fa v or. This is in contrast to the high- n limit of the JL parado x, In the situation of the JL paradox, there is a v alue of θ under H 1 that is e qual to the MLE ˆ θ , and which consequen tly has a lik eliho o d no lo wer than that of θ 0 . The extent to which ˆ θ is not fav ored by the prior is enco ded in the Ockham factor of Eqn. 14, which means that the BF and the likelihoo d ratio λ can disagree on b oth the magnitude and even the direction of the evidence. Simple-vs-simple h yp othesis tests are far less common in HEP than simple-vs- comp osite tests, but ha v e arisen as the CERN exp eriments hav e b een attempting to infer prop erties of the new b oson, such as the quantum n um b ers that characterize its spin and parit y . Again supp osing X ha ving densit y f ( x | θ ) is sampled, now one can form two well-defined p -v alues, namely p 0 indicating departures from H 0 in the direction of H 1 , and p 1 indicating departures from H 1 in the direction of H 0 . A ph ysicist will examine both p -v alues in making an inference. Thompson (2007, p. 108) argues that the set of the two p -v alues is “the evidence”, and man y in HEP may agree. Certainly neglecting one of the p -v alues can b e dan- 10 gerous. F or example, if θ 0 < ˆ θ < θ 1 , and σ tot  θ 1 − θ 0 , then it is conceiv able that H 0 is rejected at 5 σ , while if H 1 w ere the null hypothesis, it would b e rejected at 7 σ . A ph ysicist would b e well a ware of this circumstance and hardly fall into the stra w-man trap of implicitly accepting H 1 b y fo cusing only on p 0 and “rejecting” (only) H 0 . The natural reaction would b e to question b oth hypotheses; i.e., the tw o-simple-hypothesis mo del would b e questioned. (In this context, Senn (2001, pp. 200-201) has further criticism and references regarding the issue of sample-size dep endence of p -v alues.) 3 Do p oin t n ull h yp otheses mak e sense in princi- ple, or in practice? In the Bay esian literature, there are notably differing attitudes expressed regarding the relev ance of a p oin t null h yp othesis θ = θ 0 . Starting with Jeffreys, the fact that Bay esian hypothesis testing can treat a p oin t null h yp othesis in a sp ecial w a y is considered b y man y prop onen ts to b e an adv antage. (As discussed in Section 9, frequen tist testing of a p oin t n ull vs a comp osite alternativ e is tied to in terv al esti- mation, a completely different approac h.) The h yp othesis test is often phrased in the language of mo del selection: the “smaller” mo del H 0 is nested in the “larger” mo del H 1 . F rom this p oint of view, it seems natural to ha ve one’s prior probabilities π 0 and π 1 for the t wo mo dels. How ever, as men tioned ab o v e, from the p oint of view of putting a prior on the entire space Θ in the larger model, this corresp onds to a non- regular prior that has counting measure ( δ -function to ph ysicists) on θ 0 and Leb esgue measure (usual probabilit y densit y to physicists) on θ 6 = θ 0 . As discussed by Casella and Berger (1987a), some of the more disturbing aspects of the JL paradox are ameliorated (or even “reconciled”) if there is no p oin t null, and the test is the so-called “one-sided test”, namely H 0 : θ ≤ θ 0 vs H 1 : θ > θ 0 . Giv en the imp ortance of the issue of probability assigned to the p oin t n ull, some of the opinions expressed in the statistics literature are highligh ted b elow, to con trast with the attitude in HEP describ ed in Section 5. Lindley (2009) lauds the “triumph” of Jeffreys’s “general metho d of significance tests, putting a concen tration of prior probabilit y on the null—no ignorance here— and ev aluating the p osterior probability using what we no w call Bay es factors.” As a strong advocate of the use of sub jectiv e priors that represen t p ersonal b elief, Lindley views the probabilit y mass on the p oin t null as sub jective. (In the same comment, Lindley criticizes Jeffrey’s “error” of integrating ov er the sample space of unobserved data in form ulating his ep on ymous priors for use in point and in terv al estimation.) A t the other end of the sp ectrum of Ba yesian theorists, Bernardo (2009) commen ts on Rob ert et al (2009): “Jeffreys intends to obtain a posterior probabilit y for a precise n ull h yp othesis, and, to do this, he is forced to use a mixed prior whic h puts a lump of probabilit y p = P r ( H 0 ) on the null, say H 0 ≡ θ = θ 0 and distributes the rest with a pr op er prior p ( θ ) (he mostly c ho oses p = 1 / 2). This has a v ery upsetting consequence, usually known as Lindley’s parado x: for any fixed prior probability p indep enden t of the sample size n , the pro cedure will wrongly accept H 0 whenev er the 11 lik eliho o d is concen trated around a true parameter v alue which lies O ( n − 1 / 2 ) from H 0 . I find it difficult to accept a pro cedure whic h is known to pro duce the wrong answ er under sp ecific, but not controllable, circumstances.” When pressed by commenters, Bernardo (2011b) says that “I am sure that there are situations where the scientist is willing to use a prior distribution highly concen trated at a particular region and explore the consequences of this assumption. . . What I claim is that, even in precise h yp othesis testing situations, the scien tist is often in terested in an analysis which do es not assume this t yp e of sharp prior knowledge. . . .” Bernardo go es on to adv o cate a differen t approach (Section 7), which “has the non trivial merit of b eing able to use for b oth estimation and hypothesis testing problems a single, unified theory for the deriv ation of ob jectiv e ‘reference’ priors.” Some statisticians find p oint null h yp otheses irrelev ant to their own work. In the con text of an unen thusiastic comment on the Ba yesian information criterion (BIC), Gelman and Rubin (1995) sa y “More generally , realistic prior distributions in so cial science do not hav e a mass of probability at zero. . . .” Raftery (1995b) disagrees, sa ying that “so cial scien tists are prepared to act as if they had prior distributions with p oint masses at zero. . . so cial scien tists often entertain the p ossibilit y that an effect is smal l ”. In the commen tary of Bernardo (2011b), C. Rob ert and J. Rousseau say , “ Down with p oint masses! The requirement that one uses a p oin t mass as a prior when testing for p oint n ull hypotheses is alw ays an em barrassment and often a cause of misunderstanding in our classrooms. Rephrasing the decision to pic k the simpler mo del as the result of a larger adv antage is thus m uch more lik ely to con vince our studen ts. What matters in p oin t wise hypothesis testing is not whether or not θ = θ 0 holds but what the consequences of a wrong decision are.” Some commen ts on the p oin t n ull h yp othesis are related to another claim, that all mo dels and all point nulls are at b est appro ximations that are wrong at some lev el. I discuss this point in more detail in Section 5, but include a few quotes here. Edw ards et al (1963) say , “. . . in typical applications, one of the hypotheses—the null h yp othesis—is kno wn b y all concerned to b e false from the outset,” citing others including Berkson (1938). V ardeman (1987) claims, “Comp etent scientists do not b eliev e their o wn mo dels or theories, but rather treat them as conv enient fictions. A small (or even 0) prior probability that the current theory is true is not just a device to mak e p osterior probabilities as small as p v alues, it is the w ay go o d scientists think!” Casella and Berger (1987b) ob ject sp ecifically to Jeffreys’s use of π 0 = π 1 = 1 / 2, used in mo dern pap ers as w ell: “Most researc hers w ould not put 50% prior proba- bilit y on H 0 . The purp ose of an exp erimen t is often to disprov e H 0 and researchers are not p erforming exp erimen ts that they b eliev e, a priori , will fail half the time!” Kadane (1987) expresses a similar sentimen t: “F or the last 15 years or so I hav e b een lo oking seriously for sp ecial cases in whic h I might ha ve some serious belief in a null h yp othesis. I ha ve found only one [testing astrologer]. . . I do not exp ect to test a precise hypothesis as a serious statistical calculation.” As discussed b elow, suc h statisticians ha ve evidently not b een socializing with man y HEP physicists. In fact, in the literature I consulted, I encoun tered very few 12 statisticians who granted, as did Zellner (2009), that physical laws suc h as E = mc 2 are p oin t hypotheses, and “Man y other examples of sharp or precise h yp otheses can b e given and it is incorrect to exclude suc h hypotheses a priori or term them ‘unrealistic’. . . .” Condensed matter physicist and Nob el Laureate Philip Anderson (1992) argued for Jeffreys-style h yp othesis testing with resp ect to a claim for evidence for a fifth force of nature. “Let us take the ‘fifth force’. If w e assume from the outset that there is a fifth force, and we need only measure its magnitude, w e are assigning the bin with zero range and zero magnitude an infinitesimal probability to b egin with. Actually , w e should b e assigning this bin, whic h is the null h yp othesis we w ant to test, some finite a priori probability—lik e 1/2—and sharing out the remaining 1/2 among all the other strengths and ranges.” Already in Edw ards et al (1963, p. 235) there w as a key point related to the situation in HEP: “Bay esians. . . must remember that the null h yp othesis is a hazily defined small region rather than a p oin t.” They also emphasized the sub jective nature of singling out a p oin t n ull h yp othesis: “A t least for Ba y esian statisticians, how ever, no pro cedure for testing a sharp null hypothesis is likely to b e appropriate unless the n ull h yp othesis deserv es sp ecial initial credence.” That the “p oin t” null can really b e a “hazily defined small region” is clear from the deriv ation in Section 2. The general scaling conclusion of Eqn. 10 remains v alid if “hazily defined small region” means that the region of θ included in H 0 has a scale  0 suc h that  0  σ tot . T o a ph ysicist, this just means that computing integrals using a δ -function is a go o d appro ximation to integrating ov er a finite region in θ . (Some authors, suc h as Berger and Delampady (1987a) hav e explored quan titatively the approximation induced in the BF by non-zero  0 .) 4 Three scales for θ yield a parado x F rom the preceding sections, w e can conclude that for the JL parado x to arise, it is sufficien t that there exist thr e e sc ales in the parameter space Θ, namely: 1.  0 , the scale under H 0 ; 2. σ tot , the scale for the total measuremen t uncertain ty; and 3. τ , the scale under H 1 ; and that they hav e the hierarch y  0  σ tot  τ . (15) This situation is common in frontier exp erimen ts in HEP , where, as discussed in Section 5.1, the thr e e sc ales ar e often lar gely indep endent . W e even hav e cases where  0 = 0, i.e., most of the sub jective prior probabilit y is on θ = 0. This is the case if θ is the mass of the photon. 13 As noted for example by Shafer (1982), the source of the precision of σ tot do es not matter as long as condition in Eqn. 15 is satisfied. The statistics literature tends to fo cus on the case where σ tot arises from a sample size n via Eqn. 1. This in vites the question as to whether n can really b e arbitrarily large in order to mak e σ tot arbitrarily small. In my view the existence of a regime where the BF go es as τ /σ tot for fixed z (as in Eqn. 10) is the fundamen tal characteristic that can lead to the JL parado x, even if this regime does not extend to σ tot → 0. As I discuss in Section 5.1, suc h regimes are presen t in HEP analyses, and there is not alwa ys a w ell-defined n underlying σ tot , a p oin t I return to in Sections 5.2 and 6 b elow in discussing τ . But w e first consider the mo del itself. 5 HEP and b elief in the n ull h yp othesis A t the heart of the measurement mo dels in HEP are w ell-established equations that are commonly kno wn as “laws of nature”. By some historical quirks, the current “la ws” of elemen tary particle ph ysics, which hav e surviv ed several decades of in- tense scrutin y with only a few well-specified mo difications, are collectiv ely called a “mo del”, namely the Standard Mo del (SM). In this review, I refer to the equations of suc h “laws”, or alternatives considered as p oten tial replacemen ts for them, as “core ph ysics mo dels”. The curren tly accepted core ph ysics mo dels ha ve parameters, such as masses of the quarks and leptons, whic h with few exceptions hav e all b een measured reasonably precisely (ev en if requiring care to define). Multiple complications arise in going from the core physics mo del to the full mea- suremen t mo del that describ es the probabilit y densities for observ ations suc h as the momen tum sp ectra of particles emerging from proton-proton collisions. Theoretical calculations based on the core ph ysics model can b e quite complex, requiring, for example, approximations due to truncation of p ow er series, incomplete understand- ing of the in ternal structure of colliding protons, and insufficien t understanding of the manner in which quarks emerging from the collision recombine in to spra ys of particles (“jets”) that can b e detected. The results of suc h calculations, with their attendan t uncertainties, must then b e propagated through sim ulations of the resp onse of detectors that are parametrized using man y calibration constants, adjustments for inefficien t detection, misidentification of particles, etc. Muc h of the work in data analysis in HEP inv olves subsidiary analyses to measure and calibrate detector re- sp onses, to c hec k the v alidity of theoretical predictions to describ e data (especially where no departures are exp ected), and to confirm the accuracy of many asp ects of the simulations. The aphorism “all mo dels are wrong” (Box, 1976) can certainly apply to the detec- tor sim ulation, where common assumptions of normal or log-normal parametrizations are, at b est, only go o d appro ximations. But the pure core physics mo dels still exist as testable hypotheses that ma y b e regarded as p oin t null hypotheses. Alternativ es to the SM are more generalized mo dels in which the SM is nested. It is certainly worth trying to understand if some ph ysical parameter in the alternative core ph ysics mo del is zero (corresponding to the SM), ev en if it is necessary to do so through the smok e 14 of imp erfect detector descriptions with man y uninteresting and imperfectly known n uisance parameters. Indeed m uch of what distinguishes the capabilities of experi- men ters is how w ell they can do precisely that b y determining the detector resp onse through careful calibration and cross-chec ks. This distinction is ov er-lo ok ed in the con ten tion (Berger and Delampady, 1987a, p. 320) that a p oin t n ull hypothesis in a core ph ysics mo del cannot b e precisely tested if the rest of the measuremen t mo del is not sp ecified p erfectly . There is a deep er p oin t to b e made ab out core physics mo dels concerning the dif- ference b et w een a mo del b eing a go o d “appro ximation” in the ordinary sense of the w ord, and the concept of a mathematical limit. The equations of Newtonian physics ha v e b een sup erseded b y those of sp ecial and general relativit y , but the earlier equa- tions are not just appro ximations that did a go o d job in predicting (most) planetary orbits; they are the correct mathematic al limits in a precise sense. The kinematic expressions for momentum, kinetic energy , etc., are the limits of the sp ecial relativity equations in the limit as the sp eed go es to zero. That is, if you sp ecify a maximum tolerance for error due to the approximation of Newtonian mec hanics, then there ex- ists a sp eed b elow which it will alwa ys b e correct within that tolerance. Similarly , Newton’s univ ersal law of gra vit y is the correct mathematical limit of General Rela- tivit y in the limit of small gravitational fields and low sp eeds (conditions that w ere famously not satisfied to observ ational precision for the orbit of the planet Mercury). This limiting b eha vior can often b e viewed through an appropriate p o w er series. F or example, w e can expand the expression for kinetic energy T from sp ecial relativity , T = p p 2 + m 2 − m , in pow ers of p 2 /m 2 in the non-relativistic limit where momen tum p is m uc h smaller than the mass m . The Newtonian expression, T = p 2 / 2 m , is the first term in the series, follo wed b y the low est order relativistic correction term of p 4 / 8 m 3 . (I use the usual HEP units in whic h the sp eed of ligh t c is 1 and dimensionless; to use other units, substitute pc for p , and mc 2 for m .) An analogous, deep er concept arises in the con text of effective field theories. An effectiv e field theory in a sense consists of the correct first term(s) in a p o w er series of inv erse p ow ers of some scale that is m uch higher than the applicable scale of the effectiv e theory (Georgi, 1993). When a theory is expressed as an infinite series, a k ey issue is whether there is a finite num b er of co efficients to b e determined exp erimen- tally , from which all other co efficients can b e (at least in principle) calculated, with no unphysical answers (in particular infinit y) app earing for measurable quan tities. Theories ha ving this property are called r enormalizable , and are naturally greatly fa v ored ov er theories that give infinities for measurable quan tities or that require in effect an infinite num b er of adjustable parameters. It w as a ma jor milestone in HEP theory when it was shown that the SM (including its Higgs b oson) is in a class of renormalizable theories (’t Ho oft, 1999); removing the Higgs b oson destro ys this prop ert y . In the last three or four decades, thousands of measurements ha ve tested the consistency of the predictions of the SM, man y with remark able precision, including of course measurements at the LHC. Nonetheless, the SM is widely b elieved to b e incomplete, as it leav es unansw ered some obvious questions (such as why there are three generations of quarks and leptons, and why their masses ha ve the v alues they 15 do). If the goal of a unified theory of forces is to succeed, the current mathematical form ulation will b ecome em b edded in to a larger mathematical structure, suc h that more forces and quanta will ha ve to be added. Indeed m uch of the curren t theoretical and exp erimen tal research program is aimed at uncov ering these extensions, while a significan t effort is also sp en t on understanding further the consequences of the kno wn relationships. Nev ertheless, whatever new physics is added, we also exp ect that the SM will remain a correct mathematical limit, or a correct effectiv e field theory , within a more inclusive theory . It is in this sense of b eing the correct limit or correct effectiv e field theory that ph ysicists b eliev e that the SM is “true”, both in its parts and in the collective whole. (I am a ware that there are deep philosophical questions ab out realit y , and that this p oint of view can b e considered “naive”, but this is a p oin t of view that is common among high energy physicists.) It may b e that on deep er insp ection the distinction b etw een an ordinary “approx- imation” and a mathematical limit will not b e so great, as even crude approximations migh t b e considered as types of limits. Also, the usefulness of p ow er series breaks do wn in certain imp ortant “non-p erturbativ e” regimes. Nonetheless, the concepts of renormalizabilit y , limits, and effectiv e field theories are helpful in clarifying what is mean t b y belief in core ph ysics mo dels. Comparing the approach of man y ph ysicists to that of statisticians w orking in other fields, an imp ortant distinction app ears to b e the absence of core “laws” in their mo dels. Under suc h circumstances, one w ould naturally b e av erse to obsession ab out exact v alues of mo del parameters when the uncertain t y in the mo del itself is already dominan t. 5.1 Examples of three scales for θ in HEP exp erimen ts Man y searches at the frontier of HEP hav e three scales with the hierarch y in Eqn. 15. An example is an exp eriment in the 1980s that searched for a particular decay of a particle called the long-lived neutral k aon, the K 0 L . This deca y , to a muon and electron, had b een previously credibly ruled out for a branching fraction (probability p er k aon deca y) of 10 − 8 or higher. With newer tech nology and b etter b eams, the prop osal w as to search do wn to a level of 10 − 12 . This deca y was forbidden at this lev el in the SM, but there was a p ossibilit y that the decay o ccurred at the 10 − 17 lev el (Barroso et al, 1984) or low er via a process where neutrinos c hange t yp e within an expanded v ersion of the SM; since this latter process was out of reach, it w as included in the “p oin t null” hypothesis. This searc h was therefore a “fishing expedition” for “ph ysics beyond the Standard Mo del” (BSM physics), in this case a new force of nature with σ tot ≈ 10 − 12 and  0 ≈ 10 − 17 . Both the scale τ of prior b elief and g ( θ ) w ould b e hard to define, as the motiv ation for p erforming the exp erimen t was the capabilit y to explore the unkno wn with the p otential for a ma jor disco v ery of a new force. F or me p ersonally , π 1 w as small (say 1%), and the scale τ w as probably close to that of the range b eing explored, 10 − 8 . (The first incarnation of the exp erimen t reac hed σ tot ≈ 10 − 11 , without evidence for a new force (Arisak a et al, 1993)). As discussed in Section 8.2, searc hes for such rare deca ys are typically in terpreted in terms of the influence of p ossible new particles with v ery high masses, higher than can b e directly pro duced. 16 As another example, perhaps the most extreme, it is of great interest to determine whether or not protons decay , i.e., whether or not the decay rate is exactly zero, as so far seems to b e the case exp erimentally . Exp erimen ts ha v e already prob ed v alues of the av erage decay rate p er proton of 1 deca y p er 10 31 to 10 33 y ears. This is part of the range of v alues predicted by certain unified field theories that extend the SM (Wilczek, 2004). As the age of the universe is order 10 10 y ears, these are indeed very small rates. Thanks to the exp onen tial nature of such decays in quantum mechanics, the search for suc h tin y deca y rates is possible b y observing nearly 10 34 protons (many kilotons of water) for sev eral years, rather than b y observing several protons for 10 34 y ears! Assigning the three scales is rather arbitrary , but I would sa y that σ tot ≈ 10 − 32 and τ initially was p erhaps 10 − 28 . Historically the null hypothesis under the SM was considered to b e a p oint exactly at zero deca y rate, until 1976 when ’t Ho oft (1976) p oin ted out an exotic non-p erturbativ e mec hanism for proton deca y . But his formula for the SM rate has a factor of ab out exp( − 137 π ) = 10 − 187 that makes it negligible ev en compared to the BSM rates being explored experimentally . (See Babu et al (2013) for a recent review.) Finally , among the m ultitude of curren t searc hes for BSM ph ysics at the LHC to whic h Eqn. 15 applies, I men tion the example of the search for pro duction a hea vy v ersion of the Z 0 b oson (Section 8), a so-called Z 0 (pronounced “Z-prime”). The Z 0 w ould b e the quantum of a new force that app ears generically in man y sp eculativ e BSM mo dels, but without an y reliable prediction as to whether the mass or pro duction rate is accessible at the LHC. F or these searches,  0 = 0 in the SM; σ tot is determined b y the LHC b eam energies, in tensities, and the general-purp ose detector’s measuring capabilities; the scale τ is again rather arbitrary (as are π 0 and g ), but m uc h larger than σ tot . In all three of these examples, the conditions of Eqn. 15 are met. F urthermore, the thr e e sc ales ar e lar gely indep endent . There can b e a lo ose connection in that an exp erimen t ma y b e designed with a particular sub jective v alue of τ in mind, which then influences ho w resources are allo cated, if feasible, to obtain a v alue of σ tot that ma y settle a particular scien tific issue. But this kind of connection can b e tenuous in HEP , esp ecially when an existing general-purp ose apparatus suc h as CMS or A TLAS is applied to a new measurement. Ther efor e ther e is no gener al ly applic able rule of thumb r elating τ to σ tot . Ev en if some sense of the scale τ can b e specified, there still exists the arbitrariness in c ho osing the form of g . Many exp erimenters in HEP think in terms of “orders of magnitude”, with an implicit metric that is uniform in the log of the decay rate. F or example, some might say that “the exp erimen t is worth doing if it extends the reac h b y a factor of 10”, or that “it is w orth taking data for another y ear if the n umber of in teractions observ ed is doubled”. But it is not at all clear that such phrasing really corresp onds to a b elief that is uniform in the implicit logarithmic metric. 5.2 T est statistics for computing p -v alues in HEP There is a long tradition in HEP of using likelihoo d ratios for b oth hypothesis testing and estimation, follo wing established frequen tist theory (Stuart et al, 1999, Chapter 17 22) such as the NP Lemma and Wilks’s Theorem. This is sometimes describ ed in the jargon of HEP (James, 1980), and other times with more extensiv e sourcing (Eadie et al, 1971; Bak er and Cousins, 1984; James, 2006; Co wan et al, 2011). When merited, quite detailed lik eliho o d functions (b oth binned and unbinned) are constructed. In man y cases, θ is a ph ysically non-negativ e quan tity (such as a mass or a Poisson mean) that v anishes under the null hypothesis ( θ 0 = 0), and the alternativ e is H 1 : θ > 0. The likelihoo d-ratio test statistic, denoted by λ , and its distribution under the n ull hypothesis (see b elo w) are used in a one-tailed test to obtain a p -v alue, whic h is then con verted to z , the equiv alent num b er of standard deviations ( σ ) in a one-tailed test of the mean of a normal distribution, z = Φ − 1 (1 − p ) = √ 2 erf − 1 (1 − 2 p ) . (16) F or example, z = 3 corresp onds to a p -v alue of 1 . 35 × 10 − 3 , and z = 5 to a p -v alue of 2 . 9 × 10 − 7 . (F or upp er confidence limits on θ , p -v alues are commonly mo dified to mitigate some issues caused b y do wn ward fluctuations, but this does not affect the pro cedure for testing H 0 .) Nuisance parameters arising from detector calibration, estimates of bac kground rates, etc., are abundan t in these analyses. A large part of the analysis effort is dev oted to understanding and v alidating the (often complicated) descriptions of the resp onse of the exp erimental apparatus that is included in λ . F or n uisance parame- ters, the uncertain ties are t ypically listed as “systematic” in nature, the name that elemen tary statistics b o oks use for uncertainties that are not reduced with more sam- pling. Nevertheless, some systematic uncertainties can b e reduced as more data is tak en and used in the subsidiary analyses for calibrations. A typical example is the calibration of the resp onse of the detector to a high-energy photon ( γ ), crucial for detecting the decay of the Higgs b oson to tw o photons. The basic detector response (an optical flash conv erted to an analog electrical pulse that is digitized) must b e conv erted to units of energy . The resulting energy “measurement” suffers from a smearing due to resolution as well as errors in offset and scale. Special calibration data and computer simulations are used to measure b oth the width and shap e of the smearing function, as well as to determine offsets and scales that still ha v e residual uncertain ty . In terms of the simple N ( θ , σ 2 tot ) mo del discussed throughout this paper, w e ha v e complications: the resp onse function ma y not be normal but can b e measured; the bias on θ may not b e zero but can b e measured; and σ tot is also measured. All of the calibrations may c hange with temp erature, p osition in the detector, radiation damage, etc. Many resources are put in to tracking the time- ev olution of calibration parameters, and therefore minimizing, but of course nev er eliminating, the uncertain ties. Suc h calibration tak es place for all the subdetectors used in a HEP experiment, for all the basic t yp es of detected particles (electrons, muons, pions, etc.). Ultimately , with enough data, certain systematic uncertainties approac h constan t v alues that limit the usefulness adding more data. (Example of limiting systematics w ould include finite resolution on the time dep endence of detector resp onse; con trol of the lasers used for calibration; magnetic field inhomogeneities not p erfectly mapp ed; imp erfect material description in the detector sim ulation; and v arious theoretical uncertain ties.) 18 Once mo dels for the n uisance parameters are selected, v arious approac hes can b e used to “eliminate” them from the likelihoo d ratio λ (Cousins, 2005). “Profiling” the nuisances parameters (i.e., re-optimizing the MLEs of the nuisance parameters for eac h trial v alue of the parameter of in terest) has been part of the basic HEP soft w are to ols (though not called profiling) for decades (James, 1980). The results on the Higgs b oson at the LHC hav e b een based on profiling, partly b ecause asymptotic form ulas for profile likelihoo ds w ere generalized (Co w an et al, 2011) and found to b e useful. It is also common to in tegrate out (marginalize) nuisance parameters in λ in a Bay esian fashion (typically using evidence-based priors), usually through Monte Carlo integration (while treating the parameter of in terest in a frequen tist manner). In man y analyses, the result is fairly robust to the treatmen t of n uisance param- eters in the definition of λ . F or the separate step of obtaining the distribution of λ under the null hypothesis, asymptotic theory (Cow an et al, 2011) can b e applicable, but when feasible the exp erimenters also perform Monte Carlo sim ulations of pseudo- exp erimen ts. These simulations treat the nuisance parameters in some frequen tist and Bay esian-inspired wa ys, and are typically (though not alw ays) rather insensitive to the c hoice of metho d. T o the extent that in tegrations are p erformed ov er the nuisance parameters, or that profiling yields similar results, the use of λ as a test statistic for a frequentist p - v alue is reminiscent of Ba yesian-frequen tist hybrids in the statistics literature (Go o d, 1992, Section 1), including the prior-predictiv e p -v alue of Box (1980). Within HEP , this mix of paradigms has b een advocated (Cousins and Highland, 1992) as a prag- matic approach, and found in general to yield reasonable results under a v ariet y of circumstances. The complexity of such analyses is w orth keeping in mind in Section 6, when the concept of the “unit measurement” with σ = √ nσ tot is introduced as a basis for some “ob jectiv e” metho ds of setting the scale τ . The ov erall σ tot is a syn thesis of many samplings of even ts of in terest as well as even ts in the numerous calibration data sets (some disjoin t from the final analysis, some not). It is uncle ar what c ould b e identifie d as the numb er of events n , since the analysis do es not fit neatly in to the concept of n iden tical samplings. 5.3 Are HEP exp erimen ters biased against their n ull hy- p otheses? Practitioners in disciplines outside of HEP are sometimes accused of b eing biased against accepting n ull h yp otheses, to the p oin t that exp erimen ts are set up with the sole purpose of rejecting the n ull hypothesis (Ba yarri, 1987). Strong bias against publishing n ull results (i.e., results that do not reject the n ull h yp othesis) has been describ ed, for example, in psyc hology (F erguson and Heene, 2012). Researchers migh t feel the need to reject the null h yp othesis in order to publish their results, etc. It is unclear to what extent these c haracterizations might b e v alid in different fields, but in HEP there is often significan t prior b elief in b oth the mo del and the p oint n ull h yp othesis (within  0 ). In many searc hes in HEP , there is a hop e to reject the SM 19 and make a ma jor discov ery of BSM ph ysics in whic h the SM is nested. But there is nonetheless high (or certainly non-negligible) prior belief in the null hypothesis. There hav e b een hundreds of exp erimen tal searches for BSM physics that ha ve not rejected the SM. In HEP , it is normal to publish results that adv ance exploration of the frontiers ev en if they do not reject the null hypothesis. The literature, including the most prestigious journals, has many pap ers b eginning with “Search for. . . ” that rep ort no significan t evidence for the sought-for BSM ph ysics. Often these publications provide useful constraints on theoretical sp eculation, and offer guidance for future searc hes. F or ph ysical quantities θ that cannot hav e negative v alues, the unbiased estimates will b e in the unph ysical negative region ab out half of the time if the true v alue of θ is small compared to σ tot . It might app ear that the measuremen t mo del is wrong if half the results are unphysical. But the explanation in retrosp ect is that the null h yp otheses in HEP ha v e tended to b e true, or almost so. As no BSM physics has been observ ed thus far at the LHC, the c hoices of exp eriments might b e questioned, but they are largely constrained b y resources and by what nat ure has to offer for disco very . Huge detector systems such as CMS and A TLAS are multipurpose exp erimen ts that ma y not hav e the desired sensitivity to some sp ecific pro cesses of in terest. Within constrain ts of a v ailable resources and lo osely prioritized as to sp eculation ab out where the BSM ph ysics may b e observ ed, the collab orations try to lo ok wherever there is some capability for observing new phenomena. 5.4 Cases of an artificial n ull that carries little or no b elief As noted ab o ve, the “core ph ysics mo dels” used in our searc hes typically include the SM as well as larger mo dels in whic h the SM is embedded. In a t ypical search for BSM physics, the SM is the null h yp othesis and carries a non-negligible b elief. Ho w ever, there do es exist a class of searches for whic h ph ysicists place little prior b elief on the null h yp othesis, namely when the null h yp othesis is the SM with a missing piece! This o ccurs when exp erimen ters are lo oking for the “first observ ation” of a phenomenon that is predicted by the SM to ha ve non-zero strength θ = θ 1 , but whic h is yet to b e confirmed in data. The null hypothesis is then typically defined to b e the simple h yp othesis θ = θ 0 = 0, i.e., everything in the SM exc ept the as-y et- confirmed phenomenon. While the alternative hypothesis could b e taken to b e the simple hypothesis θ = θ 1 , it is more common to tak e the alternativ e to b e θ > 0. Results are then rep orted in tw o pieces: (i) a simple-vs-composite h yp othesis test that rep orts the p -v alue for the n ull h yp othesis, and (ii) confidence in terv al(s) for θ at one or more confidence level, whic h can b e then compared to θ 1 . This gives more flexibilit y in in terpretation, including rejection of θ 0 = 0, but with a surprising v alue of ˆ θ that p oints to an alternative other than the SM v alue θ 1 . F urthermore as in all searc hes, collab orations typically present plots showing the distribution of z v alues obtained from Monte Carlo sim ulation of pseudo-exp eriments under eac h of the h yp otheses. F rom these plots one can read off the “exp ected z ” (usually defined as median) for eac h hypothesis, and also get a sense for ho w likely is a statistical fluctuation to the observed z . 20 An example from F ermilab is the search for pro duction of single top quarks via the weak force in proton-antiproton collisions (Abazo v et al, 2009; Aaltonen et al, 2009; F ermilab, 2009). This searc h w as p erformed after the w eak force w as clearly c haracterized, and after top quarks w ere observed via their pro duction in top-antitop quark pairs by the strong force. The search for single top-quark pro duction was exp erimen tally c hallenging, and the yields could hav e differed from exp ectations of the SM due to the p ossibilit y of BSM ph ysics. But there w as not muc h credence in the n ull hypothesis that pro duction of single top quarks did not exist at all. Even tually that n ull was rejected at more than 5 σ . The interest remains on measured v alues and particularly confidence interv als for the pro duction rates (via more than one mec hanism), whic h th us far are consisten t with SM exp ectations. Another example is the searc h for a sp ecific decay mo de of the B s particle that con tains a b ottom quark (b) and an ti-strange-quark (s). The SM predicts that a few out of 10 9 B s deca ys yield tw o muons (hea vy versions of electrons) as decay pro ducts. This measuremen t has significan t p oten tial for disco v ering BSM physics that migh t enhance (or ev en reduce) the SM probability for this decay . The searc h used the n ull hypothesis that the B s deca y to tw o muons had zero probability , a null that was recen tly rejected at the 5 σ lev el. As with single top-quark pro duction, the true physics in terest w as in the measured confidence in terv al(s), as there w as negligible prior belief in the artificial null hypothesis of exactly zero probabilit y for this decay mo de. Of course, a prerequisite for measuring the small deca y probabilit y w as high confidence in the presence of this pro cess in the analyzed data. Thus the clear observ ation (rejection of the n ull) at high significance b y each of tw o experiments w as one of the highligh ts of results from the LHC in 2013 (Chatrc hy an et al, 2013a; Aaij et al, 2013; CERN, 2013). As the Higgs boson is an in tegral part of the SM (required for the renormalizabilit y of the SM) , the operational null hypothesis used in searching for it was similarly tak en to b e an artificial mo del that included all of the SM except the Higgs b oson, and which had no BSM ph ysics to replace the Higgs b oson with a “Higgs-lik e” b oson. How ever, the attitude to w ard the h yp otheses w as not as simple as in the t wo previous examples. The n ull hypothesis of ha ving “no Higgs b oson” carried some prior b elief, in the sense that it w as perhaps plausible that BSM physics might mean that no SM Higgs b oson (or Higgs-lik e b oson) w as observ able in the manner in whic h w e w ere searc hing. F urthermore, the search for the Higgs b oson had such a long history , and had b ecome so w ell-known in the press, that there w ould hav e b een a notable cost to a false disco v ery claim. In m y opinion, this was an imp ortan t part of the justification for the high threshold that the exp erimenters used for declaring an observ ation. (Section 9 discusses factors affecting the threshold.) Analogous to the tw o previous examples, the implementation of the alternative h yp othesis was as the complete SM with a comp osite θ for the strength of the Higgs b oson signal. (This generalized alternative allo w ed for a “Higgs-like” b oson that p erhaps could not be easily distinguished with data in hand.) How ever, the mass of the Higgs b oson is a free parameter in the SM, and had b een only partially constrained b y previous measurements and theoretical argumen ts. Compared to the t wo previous examples, this complicated the search significan tly , as the probabilities of differen t 21 deca y modes of the Higgs b oson c hange dramatically as a function of its mass. This n ull hypothesis of no Higgs (or Higgs-lik e) b oson was definitiv ely rejected up on the announcement of the observ ation of a new b oson by b oth A TLAS and CMS on July 4, 2012. The confidence interv als for signal strength θ in v arious decay sub- classes, though not yet precise, w ere in reasonable agreement with the predictions for the SM Higgs b oson. Subsequen tly , muc h of the fo cus shifted to measurements of describing different pro duction and deca y mec hanisms. F or measuremen ts of con- tin uous parameters, the n ull h yp othesis has rev erted to the complete SM with its Higgs b oson, and the tests (e.g., Chatrc h yan et al (2014, Figure 22) and Aad et al (2013, Figures 10-13)) use the frequen tist dualit y (Section 9 b elo w) b et w een in terv al estimation and hypothesis testing. One constructs (appro ximate) confidence inter- v als and regions for parameters con trolling v arious distributions, and chec ks whether the predicted v alues for the SM Higgs b oson are within the confidence regions. F or an imp ortant simple-vs-simple h yp othesis test of the quantum mec hanical prop erty called parit y , p -v alues for b oth hypotheses were rep orted (Chatrc h yan et al, 2013b), as describ ed in Section 2.2. 6 What sets the scale τ ? As discussed by Jeffreys (1961, p. 251) and re-emphasized by Bartlett (1957), defining the scale τ (the range of v alues of θ ov er whic h the prior g ( θ ) is relativ ely large) is a significan t issue. F undamentally , the scale app ears to b e p ersonal and sub jective, as is the more detailed sp ecification of g ( θ ). Berger and Delampady (1987a,b) state that “the precise n ull testing situation is a prime example in which ob jectiv e procedures do not exist,” and “T esting a precise hypothesis is a situation in which there is cle arly no ob jectiv e Bay esian analysis and, by implication, no sensible ob jectiv e analysis what- so ev er.” Nonetheless, as discussed in this section, Berger and others ha ve attempted to formulate principles for sp ecifying default v alues of τ for communicating scien tific results. Bartlett (1957) suggests that τ might scale as 1 / √ n , canceling the sample-size scaling in σ tot and making the Bay es factor indep endent of n . Co x (2006, p. 106) suggests this as well, on the grounds that “. . . in most, if not all, sp ecific applications in which a test of such a h yp othesis [ θ = θ 0 ] is thought worth doing, the only serious p ossibilities needing consideration are that either the null hypothesis is (v ery nearly) true or that some alternative within a range fairly close to θ 0 is true.” This a voids the situation that he finds unrealistic, in which “the corresp onding answer dep ends explicitly on n b ecause, t ypically unrealistically , large p ortions of prior probability are in regions remote from the n ull h yp othesis relativ e to the information in the data.” P art of Cox’s argument was already giv en b y Jeffreys (1961, p. 251), “. . . the mere fact that it has b een suggested that [ θ ] is zero corresp onds to some presumption that [ θ ] is small.” Leamer (1978, p. 114) mak es a similar p oint, “. . . a prior that allo cates p ositiv e probabilit y to subspaces of the parameter space but is otherwise diffuse represen ts a p eculiar and unlik ely blend of knowledge and ignorance”. (As Section 5.1 discusses, this “p eculiar and unlik ely blend” is common in HEP .) Andrews (1994) also explores 22 the consequences of τ shrinking with sample size, but these ideas seem not to ha v e led to a standard. As another p ossible reconciliation, Rob ert (1993) considers π 1 that increases with τ , but this seems not to ha ve b een pursued further. Man y attempts in the Ba yesian literature to sp ecify a default τ arrive at a sug- gestion that do es not dep end on n , and hence do es not remov e the dep endence of the Ockham factor on n . In the searc h for any non-sub jectiv e n -indep enden t scale, the only option seemingly at hand is σ tot when n = 1, i.e., the original σ (Eqn. 1) that expresses the uncertaint y of a single measurement. This w as in fact suggested b y Jeffreys (1961, p. 268), on the grounds that there is nothing else in the problem that can set the scale, and w as follo wed, for example, in generalizations b y Zellner and Siow (1980). Kass and W asserman (1995) do the same, whic h “has the interpretation of ‘the amoun t of information in the prior on [ θ ] is equal to the amount of information ab out [ θ ] contained in one observ ation’ ”. They refer to this as a “unit information prior”, citing Smith and Spiegelhalter (1980) as also using this “app ealing in terpretation of the prior.” It is not clear to me why this “unit information” approach is “app ealing”, or ho w it could lead to useful, univ ersally cross-calibrated Bay es factors in HEP . As discussed in Section 5.2 the detector may also hav e some intrinsic σ tot for which no preferred n is evident. Raftery (1995a, pp. 132, 135) p oin ts out the same problem. After defining a prior for whic h, “roughly sp eaking, the prior distribution con tains the same amount of information as would, on av erage, one observ ation”, he notes the ob vious problem in practice: the “imp ortant am biguity. . . the definition of [ n ], the sample size.” He gives sev eral examples for whic h he has a recommendation. Berger and P ericchi (2001, with commen tary) review more general p ossibilities based on use of the information in a small subset of the data, and for one method claim that “this is the first general approach to the construction of conv entional priors in nested mo dels.” Berger (2008, 2011) applied one of these so-called “intrinsic priors” to a pedagogical example and its generalization from HEP . Unfortunately , I am not a w are of any one in HEP who has pursued these suggestions. Meanwhile, recen tly Bay arri, Berger, F orte, and Garca-Donato (2012) hav e reconsidered the issue and formulated principles resulting “. . . in a new mo del selection ob jective prior with a num b er of comp elling prop erties.” I think that it is fair to conclude that this is still an active area of research. 6.1 Commen ts on non-sub jectiv e priors for estimation and mo del selection F or p oint and interval estimation , Jeffreys (1961) suggests tw o approaches for obtain- ing a prior for a physically non-negativ e quantit y such as the magnitude of the c harge q of the electron. Both inv olve in v ariance concepts. The first approac h (pp. 120- 123) considers only the parameter b eing measured. In his example, one p erson might consider the c harge q to b e the fundamental quan tit y , while another might consider q 2 (or some other p ow er q m ) to b e the fundamental quan tit y . In spite of this arbi- trariness of the p ow er m , every one will arrive at consistent p osterior densities if they 23 eac h tak e the prior for q m to b e 1 /q m , since all expressions d ( q m ) /q m ) differ only b y a prop ortionality constant. (Equiv alen tly , they can all take the prior as uniform in ln q m , i.e., in ln q .) Jeffreys’s more famous se c ond approach, leading to his ep on ymous rule and priors, is based on the likelihoo d function and some a verages o v er the sample space (i.e., o ver p ossible observ ations). The likelihoo d function is based on what statisticians call the measuremen t “mo del”. This means that “Jeffreys’s prior” is derived not b y consid- ering only the parameter b eing measured, but rather by examining the me asuring app ar atus . F or example, Jeffreys’s prior for a Gaussian (normal) measurement ap- paratus is uniform in the measured v alue. If the measuring apparatus has Gaussian resp onse in q , the prior is uniform in q . If the measuring apparatus has Gaussian resp onse in q 2 , then the prior is uniform in q 2 . If the physical parameter is measured with Gaussian resolution and is ph ysically non-negative, as for the c harge magnitude q , then the functional form of the prior remains the same (uniform) and is set to zero in the unph ysical region (Berger, 1985, p. 89). Berger and Bernardo refer to “non-sub jective” priors suc h as Jeffreys’s prior as “ob jectiv e” priors. This strik es me as rather lik e referring to “non-cubical” volumes as “spherical” v olumes; one is c hanging the usual meaning to the w ord. Bernardo (2011b) defends the use of “ob jectiv e” as follo ws. “No statistical analysis is really ob jectiv e, since b oth the exp erimental design and the mo del assumed ha v e very strong sub jectiv e inputs. Ho wev er, frequentist pro cedures are often branded as ‘ob jectiv e’ just b ecause their conclusions are only conditional on the mo del assumed and the data obtained. Ba yesian metho ds where the prior function is directly deriv ed from the assumed mo del are ob jective in this limited, but precise sense.” Whether or not this defense is accepted, so-called “ob jective” priors can b e deemed useful for p oin t and interv al estimation , ev en to frequen tists, as there is a deep (fre- quen tist) reason for their p otential app eal. Because the priors are deriv ed b y using kno wledge of the prop erties of the me asuring app ar atus , it is at least conceiv able that Bay esian credible interv als based on them might ha ve b etter-than-t ypical fre- quen tist co verage prop erties when in terpreted as appro ximate frequen tist confidence in terv als. As W elch and Peers (1963) show ed, for Jeffreys’s priors this is indeed the case for one-parameter problems. Under suitable regularit y conditions, the approxi- mate co verage of the resulting Bay esian credible interv als is uniquely go o d to order 1 /n , compared to the slow er conv ergence for other priors, whic h is goo d to order 1 / √ n . Hence, except at very small n , by using “ob jective” priors, one can (at least appro ximately) ob ey the Likelihoo d Principle and obtain decent frequen tist cov erage, whic h for some is a preferred “compromise”. Reasonable cov erage can also b e the exp erience for Reference Priors with more than one parameter (Philipp e and Rob ert, 1998, and references therein). This can happ en even though ob jectiv e priors are improp er (i.e., not normalizable) for man y prototype problems; the ill-defined nor- malization constant cancels out in the calculation of the p osterior. (Equiv alently , if a cutoff parameter is introduced to make the prior prop er, the dep endence on the cutoff v anishes as it increases without b ound.) F or mo del selection, Jeffreys prop osed a thir d approach to priors. As discussed in Sections 2 and 3, from the p oin t of view of the larger mo del, the prior is irregular, 24 as it is describ ed by a probability mass (a Dirac δ -function) on the null v alue θ 0 that has measure zero. The prior g ( θ ) on the rest of Θ must b e normalizable (eliminating improp er priors used for estimation) in order for the p osterior probabilit y to b e well- defined. F or Gaussian measuremen ts, Jeffreys argued that g should b e a Cauc hy densit y (“Breit-Wigner” in HEP). Apart from the subtleties that led Jeffreys to c ho ose the Cauc hy form for g , there is the ma jor issue of the scale τ of g , as discussed in Section 6. The t ypical assumption of “ob jectiv e Bay esians” is that, basically by definition, an ob jectiv e τ is one that is deriv ed from the measuring apparatus. And then, assuming that σ 2 tot r efle cts n me asur ements using an app ar atus that pr ovides a varianc e for e ach of σ 2 , as in Eqn. 1 , they in vok e σ as the scale of the prior g . Lindley (e.g., in commen ting on Bernardo (2011b)) argues in cases lik e this that ob jectiv e Bay esians can get lost in the Greek letters and lose con tact with the actual con text. I too find it puzzling that one can first argue that the Ockham’s factor is a crucial feature of Ba yesian logic that is absent from frequentist reasoning, and then resort to choosing this factor based on the measuremen t apparatus, and on a concept of sample size n that can b e difficult to identify . The textb o ok by Lee (2004, p. 130) app ears to agree that this is without comp elling foundation: “Although it seems reasonable that [ τ ] should b e c hosen prop ortional to [ σ ], there do es not seem to b e an y convincing argument for c ho osing this to hav e an y particular v alue. . . .” It seems that, in order to b e useful, an y “ob jective” choice of τ must provide demonstrable cross-calibration of exp eriments with differen t σ tot when n is not w ell- defined. Another voice emphasizing the practical nature of the problem is that of Kass (2009), sa ying that Ba yes factors for h yp othesis testing “remain sensitiv e—to first order—to the c hoice of the prior on the parameter b eing tested.” The results are “con taminated b y a constant that do es not go aw ay asymptotically .” He says that this approach is “essen tially nonexisten t” in neuroscience. 7 The reference analysis approac h of Bernardo Bernardo (1999) (with critical discussion) defines Bay esian hypothesis testing in terms v ery different from calculating the p osterior probabilit y of H 0 : θ = θ 0 . He prop oses to judge whether H 0 is c omp atible (his italics) with the data: “An y Bay esian solution to the problem p osed will obviously require a prior distri- bution p ( θ ) ov er Θ, and the result may well b e v ery sensitive to the particular c hoice of suc h prior; note that, in principle, there is no reason to assume that the prior should necessarily b e concen trated around a particular θ 0 ; indeed, for a judgement on the compatibility of a particular parameter v alue with the observ ed data to b e use- ful for scientific communication, this should only dep end on the assumed model and the observed data, and this requires some form of non-sub jective prior sp ecification for θ which could b e argued to b e ‘neutral’; a sharply concen trated prior around a particular θ 0 w ould hardly qualify .” He later con tinues, “. . . nested hypothesis test- ing problems are b etter describ ed as specific decision problems ab out the c hoice of a useful mo del and that, when form ulated within the framework of decision theory , 25 they do ha v e a natural, fully Bay esian, coheren t solution.” Unlik e Jeffreys, Bernardo advocates using the same non-sub jectiv e priors (ev en when improper) for hypothesis testing as for point and in terv al estimation. He defines a discrepancy measure d whose scaling properties can be complicated for small n , but whic h asymptotically can b e m uch more akin to those of p -v alues than to those of Ba yes factors. In fact, if the p osterior becomes asymptotically normal, then d approac hes (1 + z 2 ) / 2 (Bernardo, 2011a,b). A fixed cutoff for his d (whic h he regards as the ob jectiv e approac h), just as a fixed cutoff for z , is inc onsistent in the statistical sense, namely it do es not accept H 0 with probability 1 when H 0 is true and the sample size increases without b ound. Bernardo and Rueda (2002) elab orate this approac h further, emphasizing that the Ba y es factor approac h, when view ed from the framew ork of Bernardo’s form ulation in terms of decision theory , corresp onds to a “zero-one” loss-difference function, which they refer to as “simplistic”. (Loss functions are discussed by Berger (1985, Section 2.4).) The zero-one loss is so-named b ecause the loss is zero if a correct decision is made, and 1 if an incorrect decision is made. Berger states that, in practice, this loss will rarely b e a go o d approximation to the true loss.) Bernardo and Rueda prefer con tinuous loss functions (suc h as quadratic loss) that do not require the use of non-regular priors. A prior sharply spik ed at θ 0 “ assumes imp ortant prior kno wledge . . . very str ong prior beliefs,” and hence “Bay es factors should not b e used to test the c omp atibility of the data with H 0 , for they inextricably com bine what the data ha v e to sa y with (t ypically sub jective) str ong b eliefs about the v alue of θ .” This con trasts with the commonly follow ed statement of Jeffreys (1961, p. 246) that (in presen t notation), “T o say that w e hav e no information initially as to whether the new parameter is needed or not we must tak e π 0 = π 1 = 1 / 2”. Bernardo and Rueda reiterate Bernardo’s ab o ve-men tioned recommendation of applying the discrepancy measure (expressed in “natural” units of information) according an absolute scale that is indep endent of the sp ecific problem. Bernardo (2011b) pro vides a ma jor review (with extensiv e commen tary), referring unappro vingly to p oin t null hypotheses in an “ob jective” framework, and to the use b egun by Jeffreys of tw o “ r adic al ly differ ent ” t yp es of priors for estimation and for h yp othesis testing. He clarifies his view of h yp othesis testing, that it is a decision whether “to act as if H 0 w ere true”, based on the exp ected posterior loss from using the simpler mo del rather than the alternativ e (full mo del) in whic h it is nested. In his rejoinder, Bernardo states that the JL paradox “clearly p oses a very se- rious problem to Bay es factors, in that, under certain conditions, they ma y lead to misleading answ ers. Whether y ou call this a parado x or a disagreement, the fact that the Ba yes factor for the n ull ma y be arbitrarily large for sufficien tly large n , however r elatively unlikely the data may b e under H 0 is, to sa y the least, deeply dis- turbing. . . the Ba yes factor analysis ma y b e completely misleading, in that it w ould suggest ac c epting the null, even if the lik eliho o d ratio for the MLE against the n ull is v ery large.” A t a recen t Ph yStat w orkshop where Bernardo (2011a) summarized this approach, ph ysicist Demortier (2011) considered it appropriate when the p oin t null hypothesis is a useful simplification (in the sense of definitions in decision theory) rather a p oint 26 ha ving significan t prior probabilit y . He noted (as did Bernardo) that the formalism can account for point nulls if this is desired. 8 Effect size in HEP As noted in the in tro duction, in this pap er “effect size” refers to the p oint and in terv al estimates (measured v alues and uncertainties) of a parameter or ph ysical quantit y , t ypically expressed in the original units. Apparently , rep orting of effect sizes is not alw a ys automatic in some disciplines, leading to rep eated reminders to rep ort them (Kirk, 1996; Wilkinson et al, 1999; Nak agaw a and Cuthill, 2007; AP A, 2010). In HEP , ho w ever, p oint estimates and confidence in terv als for mo del parameters are used to summarize the results of nearly all exp erimen ts, and to compare to the predictions of theory (which often ha ve uncertain ties as well). F or exp erimen ts in whic h one particle interacts with another, the meeting p oin t for comparison of theory and exp erimen t is frequently an interaction probabilit y referred to as a “cross section”. F or particles pro duced in interactions and that subsequently deca y (in to other particles), the comparison of theory and exp erimen t t ypically in- v olv es the decay rate (probabilit y of deca y p er second) or its inv erse, the mean lifetime. Measuremen ts of cross sections and decay rates can b e subdivided in to distinguishable subpro cesses, as functions of b oth contin uous parameters (such as pro duction angles) and discrete parameters (such as the probabilities known as “branc hing fractions” for deca y in to differen t sets of deca y pro ducts). In the example of the Higgs b oson disco very , the effect size w as quan tified through confidence interv als on the pro duct of cross sections and the branching fractions for different sets of decay pro ducts. These confidence interv als provided exciting indications that the new b oson was indeed “Higgs-lik e”, as describ ed b y Incandela and Gianotti and the subsequent A TLAS and CMS publications (Aad et al, 2012; Chatrc h yan et al, 2012). By spring 2013, more data had b een analyzed and it seemed clear to b oth collab orations that the b oson w as “a” Higgs b oson (lea ving op en the p ossibilit y that there might b e more than one). Some of the key figures are describ ed in the information accompan ying the announcemen t of the 2013 Nob el Prize in Physics (Sw edish Academ y, 2013, Figures 6 and 7). 8.1 No effect size is too small in core ph ysics mo dels If one takes the p oin t of view that “all mo dels are wrong” (Box, 1976), then a tiny departure from the n ull hypothesis for a parameter in a normal mo del, whic h is conditional on the mo del b eing true, might b e prop erly disregarded as unin teresting. Ev en if the model is true, a small p -v alue might b e asso ciated with a departure from the n ull h yp othesis (effect size) that is too small to hav e practical significance in form ulating public p olicy or decision-making. In con trast, core physics mo dels reflect presumed “la ws of nature”, and it is alwa ys of ma jor interest if departures with any effect size can b e established with high confidence. In HEP , tests of core physics mo dels also b enefit from what we b elieve to b e the 27 w orld’s most perfect random-sampling mechanism, namely quantum mec hanics. In eac h of many rep etitions of a giv en initial state, nature randomly picks out a final state according to the w eigh ts given by the (true, but not completely kno wn) laws of physics and quan tum mec hanics. F urthermore, the most p erfect incarnation of “iden tical” is ac hiev ed through the fundamen tal quantu m-mechanical prop ert y that elemen tary particles of the same t yp e are indistinguishable . The underlying statistical mo del is t ypically binomial or its generalizations and approximations, esp ecially the P oisson distribution. 8.2 Small effect size can indicate new phenomena at higher energy F or ev ery force there is a quantum field that p ermeates all space. As suggested in 1905 b y Einstein for the electromagnetic (EM) field, asso ciated with ev ery quan tum field is an “energy quan tum” (called the photon for the EM field) that is absorb ed or emitted (“exc hanged”) b y other particles interacting via that field. While the mass of the photon is presumed to b e exactly zero, the masses of quanta of some other fields are non-zero. The nominal mass m , energy E , and momen tum p of suc h energy quan ta are related through Einstein’s equation, m 2 = E 2 − p 2 . (F or unstable particles, the definition of the nominal mass is somewhat technical, but there are agreed-on conv entions.) In teractions in mo dern physics are possible b ecause energy quan ta can b e ex- c hanged even when the energy ∆ E and momentum ∆ p b eing transferred in the inter- action do not corresp ond to the nominal mass of the exchanged quan tum. With a quantit y q 2 (unrelated to sym b ol for the charge q of the electron) defined by q 2 = (∆ E ) 2 − (∆ p ) 2 , quantum mec hanics reduces the probabilit y of the reaction as q 2 departs from the true m 2 of the exchanged particle. In man y pro cesses, the reduction factor is at leading order prop ortional to 1 ( m 2 − q 2 ) 2 . (17) (As q 2 can be p ositive or negativ e, the relative sign of q 2 and m 2 dep ends on details of the pro cess. F or p ositive q 2 , the singularity of m 2 = q 2 is made finite by another term that can b e otherwise neglected in the present discussion.) What q 2 is accessi- ble dep ends on the av ailable tec hnology; in general, larger q 2 requires higher-energy particle b eams and therefore more costly accelerators. F or the photon, m = 0, and the interaction probabilit y go es as 1 /q 4 . On the other hand, if the mass m of the quantum of a force is so large that m 2  | q 2 | , then the probabilit y for an interaction to o ccur due to the exc hange of such a quantum is pro- p ortional to 1 /m 4 . By lo oking for inter actions or de c ays having very low pr ob ability, it is p ossible to pr ob e the existenc e of massive quanta with m 2 wel l b eyond those that c an b e cr e ate d with c oncurr ent te chnolo gy. An illustrative example, studied by Galison (1983), is the accumulation of evidence for the Z 0 b oson (with mass m Z ), an electrically neutral quantum of the weak force 28 h yp othesized in the 1960s. Exp eriments w ere p erformed in the late 1960s and early 1970s using intense b eams of neutrinos scattering off targets of ordinary matter. The a v ailable | q 2 | was muc h smaller than m 2 Z , resulting in a small reaction probabilit y in the presence of other pro cesses that obscured the signal. CERN stak ed the initial claim for observ ation (Hasert et al, 1973). After a p erio d of confusion, both CERN and F ermilab exp erimen tal teams agreed that they had observ ed interactions mediated b y Z 0 b osons, ev en though no Z 0 b osons w ere detected directly , as the energies inv olved (and hence p | q 2 | ) were w ell b elo w m Z . In another type of exp eriment probing the Z 0 b oson, conducted at SLAC in the late 1970s (Prescott et al, 1978), sp ecially prepared electrons (“spin p olarized electrons” in ph ysics jargon) were scattered off nuclei to seek a v ery subtle left-righ t asymmetry in the scattered electrons arising from the combined action of electromagnetic and weak forces. In an exquisite exp eriment, an asymmetry of ab out 1 part in 10 4 w as measured to ab out 10% statistical precision with an estimated systematic uncertain ty also ab out 10%. The statistical mo del was binomial, and the exp erimen t had the ability to measure departures from unity of twice the binomial parameter with an uncertaint y of ab out 10 − 5 . I.e., the sample size of scattered electrons w as of order 10 10 . This precision in a binomial parameter is finer than that in an ESP example that has generated liv ely discussion in the statistics literature on the JL paradox (Bernardo, 2011b, pp. 19, 26, and cited references, and commen ts and rejoinder). More recen t exp erimen ts measure this scattering asymmetry ev en more precisely . The results of Prescott et al. confirmed predictions of the mo del of electro weak in teractions put forw ard by Glashow, W ein b erg, and Salam, clearing the wa y for their Nob el Prize in 1979. Finally , in 1982, the technology for creating interactions with q 2 = m 2 Z w as realized at CERN through collisions of high energy protons and an tiprotons (and subsequently at F ermilab). And in 1989, “Z 0 factories” turned on at SLAC and CERN, colliding electrons and p ositrons at b eam energies tuned to q 2 = m 2 Z . At this q 2 , the small de- nominator in Eqn. 17 causes the tiny deviation in the previous exp erimen ts to become a huge increase in the interaction probabilit y , a factor of 1000 increase compared to the null hypothesis of “no Z 0 b oson”. (There is an additional term in the denomina- tor of Eqn. 17 that reflects the instability of the Z 0 b oson to decay and that I hav e neglected thus far; at q 2 = m 2 Z , it k eeps the expression finite.) This sequence of ev ents in the exp erimental pursuit of the Z 0 b oson is somewhat of a prototype for what man y in HEP hop e will happ en again. A given pro cess (scattering or deca y) has rate zero (or immeasurably small  0 ) according to the SM. If, how ev er, there is a new boson X with mass m X m uc h higher than accessible with curren t tec hnology , then the b oson ma y giv e a non-zero rate, prop ortional to 1 /m 4 X , for the given pro cess. The n ull hypothesis is that X does not exist and the rate for the pro cess is immeasurably small. As m X is not kno wn, the p ossible rates for the pro cess if X do es exist comprise a con tin uum, including rates arbitrarily close to zero. But these tiny num b ers in the con tin uum map onto p ossibilities for ma jor, discrete, mo difications to the laws of nature—new forces! The searches for rare deca ys describ ed in Section 5.1 are examples of this approac h. F or rare decays of K 0 L particles, an observ ation of a branching fraction at the 10 − 11 29 lev el would hav e indicated the presence of a new mass scale some 1000 times greater than the mass of the Z 0 b oson, which is more than a factor of 10 ab o v e curren tly accessible q 2 v alues at LHC. Suc h mass scales are also prob ed b y measuring the difference b etw een the mass of the K 0 L and that of closely related particle, the short- liv ed neutral k aon (K 0 S ). The mass of the K 0 L is ab out half the mass of the proton, and has b een measured to a part in 10 4 . The K 0 L − K 0 S mass difference has been measured to a part in 10 14 , far more precisely than the mass itself. The difference arises from higher-order terms in the w eak interaction, and is extremely sensitive to certain classes of sp eculativ e BSM physics. Even more impressively , the observ ation of proton deca y with a deca y rate at the lev el prob ed b y current experiments would sp ectacularly indicate a new mass scale a factor of 10 13 greater than that of the mass of the Z 0 b oson. Alas, none of these exp erimen ts has observed pro cesses that w ould indicate BSM ph ysics. In the in tervening years, there ha v e ho w ever b een ma jor disco veries in neu- trino ph ysics that ha ve redefined and extended the SM. These discov eries established that the mass of the neutrino, while tin y , is not zero. In some ph ysics mo dels called “seesa w mo dels”, the neutrino mass is in versely prop ortional to a mass scale of BSM ph ysics; th us one interpretation is that the tin y neutrino masses indicate a new v ery large mass scale, p erhaps approac hing the scale prob ed b y proton decay (Hirsc h et al, 2013). 9 Neyman-P earson testing and the c hoice of T yp e I error probabilit y α In Neyman-Pearson (NP) hypothesis testing, the T yp e I error α is the probabilit y of rejecting H 0 when it is true. F or testing a p oin t null vs a comp osite alternativ e, there is a dualit y b et ween NP hypothesis testing and frequentist in terv al estimation via confidence interv als. The h yp othesis test for H 0 : θ = θ 0 vs H 1 : θ 6 = θ 0 , at significance lev el (“size”) α , is entirely equiv alent to whether θ 0 is con tained in a confidence interv al for θ with confidence level (CL) of 1 − α . As emphasized b y Stuart, Ord, and Arnold (1999, p. 175), “Th us there is no need to deriv e optimal prop erties separately for tests and interv als: there is a one-to-one corresp ondence b et w een the problems. . . .” Ma y o and Spanos (2006) argue that confidence interv als hav e shortcomings that are a v oided by using Ma yo’s concept of “sev ere testing”. Spanos (2013) argues this sp ecifically in the con text of the JL paradox. I am not a ware of widespread application of the sev ere testing approach, and do not yet understand it w ell enough to see ho w it w ould improv e scientific communication in HEP if adopted. Hence the present pap er fo cuses on the traditional frequen tist metho ds. As men tioned in Section 5.2, in HEP the workhorse test statistic for testing and estimation is often a likelihoo d-ratio λ . In practice, sometimes one first p erforms the h yp othesis test and uses the dualit y to “in vert the test” to obtain confidence in terv als, and sometimes one first finds in terv als. Performing the test and inv erting it 30 in a rigorous manner is equiv alen t to the original “Neyman construction” of confidence in terv als (Neyman, 1937). Such a construction using the lik eliho o d-ratio test statistic has b een adv o cated b y F eldman and Cousins (1998), particularly in irregular problems suc h as when the n ull hypothesis is on the b oundary . In more routine applications, appro ximate confidence in terv als or regions can b e obtained by finding maxim um- lik eliho o d estimates of unkno wn parameters and forming regions b ounded b y con tours of differences in ln λ as in Wilks’s Theorem (James, 1980, 2006). Confidence in terv als in HEP are t ypically presented for con v entional confidence lev els (68%, 90%, 95%, etc.). Alternativ ely , when exp erimen ters rep ort a p -v alue with respect to some n ull v alue, an yone can in v oke the NP accept/reject paradigm b y comparing the rep orted p -v alue to one’s own (previously chosen) v alue of α . F rom a mathematical p oint of view, one can define the p ost-data p -v alue as the smallest significance lev el α at which the null h yp othesis would be rejected, had that α b een sp ecified in adv ance (Rice, 2007, p. 335). This may offend some who p oint out that Fisher did not define the p -v alue this w a y when he in tro duced the term, but these protests do not negate the n umerical identit y with Fisher’s p -v alue, even when the differen t in terpretations are kept distinct. Regardless of the steps through whic h one learns whether the test statistic λ is in the rejection region of a particular v alue of θ , one must choose the size α , the Type I error probabilit y of rejecting H 0 when it is true. Neyman and P earson introduced the alternativ e h yp othesis H 1 and the Type I I error β for the probability under H 1 that H 0 is not rejected when it is false. They remarked, (Neyman and P earson, 1933a, p. 296) “These t wo sources of error can rarely b e eliminated completely; in some cases it will b e more imp ortan t to a v oid the first, in others the second. . . . The use of these statistical to ols in any giv en case, in determining just ho w the balance should b e struck, m ust be left to the inv estigator.” Lehmann and Romero (2005, p. 57, and earlier editions by Lehmann) ec ho this p oin t in terms of the p ower of the test, defined as 1 − β : “The choice of a lev el of signif- icance α is usually somewhat arbitrary . . . the choice should also take in consideration the p ow er that the test will ac hiev e against the alternativ es of interest. . . .” F or simple-vs-simple hypothesis tests discussed in Section 2.2, the p ow er 1 − β is w ell-defined, and, in fact, Neyman and P earson (1933b, p. 497) discuss how to balance the t wo types of error, for example by considering their sum. It is w ell-kno wn today that suc h an approach, including minimizing a w eighted sum, can remo v e some of the unpleasan t asp ects of testing with a fixed α , suc h as inconsistency in the statistical sense (as mentioned in Section 7, not accepting H 0 with probability 1 when H 0 is true and the sample size increases without b ound). But this optimization of the tradeoff b etw een α and β b ecomes ill-defined for a test of simple vs comp osite h yp otheses when the comp osite h yp othesis has v alues of θ arbitrarily close to θ 0 , since the limiting v alue of β is 0.5, indep enden t of α (Neyman and Pearson, 1933b, p. 496). Rob ert (2013) echoes this concern that in NP testing, “there is a fundamental difficulty in finding a prop er balance (or im balance) b etw een T yp e I and T yp e I I errors, since suc h balance is not provided by the theory , which settles for the sub-optimal selection of a fixe d T yp e I error. In addition, the whole notion of p ower , while central to this referential, has arguable foundations in that 31 this is a function that inevitably dep ends on the unkno wn parameter θ . In particular, the p o wer decreases to the T yp e I error at the b oundary b et w een the n ull and the alternativ e h yp otheses in the parameter set.” Unless a v alue of θ in the comp osite h yp othesis is of sufficiently sp ecial interest to justify its use for considering p o wer, there is no clear procedure. A Ba yesian- inspired approac h would allow optimization b y weigh ting the v alues of θ under H 1 b y a prior g ( θ ). As Raftery (1995a, p. 142) notes, “Bay es factors can b e viewed as a precise wa y of implementing the advice of [Neyman and Pearson (1933a)] that p ow er and significance b e balanced when setting the significance level. . . there is a conflict b et w een Bay es factors and significance testing at predetermined lev els such as .05 or .01.” In fact, Neyman and Pearson (1933b, p. 502) suggest this p ossibilit y if multiple θ i under the alternativ e hypothesis are genuinely sampled from known probabilities Φ i : “. . . if the Φ i ’s were known, a test of greater resultant p ow er could almost certainly b e found.” Kendall and Stuart and successors (Stuart, Ord, and Arnold, 1999, Section 20.29) view the choice of α in terms of costs: “. . . unless w e ha ve supplemen tal information in the form of the c osts (in money or other common terms) of the t wo types of error, and costs of observ ations, we cannot obtain an optimal com bination of α , β , and n for an y giv en problem.” But prior belief should also pla y a role, as remark ed b y Lehmann and Romero (2005, p. 58) (and earlier editions by Lehmann): “Another consideration that ma y enter into the sp ecification of a significance level is the attitude tow ard the h yp othesis before the exp erimen t is performed. If one firmly b elieves the h yp othesis to b e true, extremely con vincing evidence will b e required before one is willing to giv e up this b elief, and the significance lev el will accordingly b e set very lo w.” Of course, these v ague statements ab out c ho osing α do not come close to a formal decision theory (which is how ever not visibly practiced in HEP). F or the case of simple vs comp osite h yp otheses relev an t to the JL paradox, HEP physicists informally take in to accoun t prior b elief, the measured v alues of θ and its confidence in terv al, as well as relativ e costs of errors, contrary to myths ab out automatic use of a “5 σ ” criterion discussed in the next section. 9.1 The m ythology of 5 σ No w adays it is commonly written that 5 σ is the criterion for a disco very in HEP . Suc h a fixed one-size-fits-all lev el of significance ignores the consideration noted ab ov e by Lehmann, and violates one of the most commonly stated tenets of science—that the more extraordinary the claim, the more extraordinary must b e the evidence. I do not b eliev e that exp erienced physicists ha ve suc h an automatic resp onse to a p -v alue, but it may be that some p eople in the field may tak e the fixed threshold more seriously than is w arran ted. The (quite sensible) historical ro ots of the 5 σ criterion were in a specific con text, namely searches p erformed in the 1960s for new “elementary particles”, now kno wn to b e comp osite particles with differen t configurations of quarks in their substruc- ture. A plethora of histograms were made, and presumed new particles, known as “resonances” sho wed up as lo calized excesses (“bumps”) spanning several histogram 32 bins. Up on finding an excess and defining those bins as the “signal region”, the “lo cal p -v alue” could b e calculated as follo ws. First the nearby bins in the histogram (“side- bands”) w ere used to formulate the n ull hypothesis corresp onding to the exp ected n um b er of ev en ts in the signal region in the absence of a new particle. Then the (P oisson) probabilit y under the null hypothesis of observing a bump as large as or larger than that seen was calculated, and expressed in terms of standard deviations “ σ ” by analogy to a one-sided test of a normal distribution. The problem w as that the location of a new resonance was typically not known in adv ance, and the lo cal p -v alue did not include the fact that “pure chance” had lots of opp ortunities (lots of histograms and man y bins) to pro vide an unlik ely o c- currence. Over time many of the alleged new resonances were not confirmed in other indep enden t exp erimen ts. In the group led by Alv arez at Berkeley , histograms with putativ e new resonances were compared to simulations drawn from smo oth distribu- tions (Alv arez, 1968). Rosenfeld (1968, p. 465) describes suc h sim ulations and rough hand calculations of the num b er of trials, and concludes, “T o the theorist or phe- nomenologist the moral is simple: w ait for nearly 5 σ effects. F or the exp erimen tal group who hav e sp ent a year of their time and p erhaps a million dollars, the problem is harder. . . go ahead and publish. . . but they should realize that any bump less than ab out 5 σ calls only for a rep eat of the experiment.” The original concept of “5 σ ” in HEP was therefore mainly motiv ated as a (fairly crude) w ay to account for a m ultiple trials factor (MTF, Section 9.2) in searc hes for phenomena p o orly sp ecified in adv ance. How ever, the threshold had at least one other lik ely motiv ation, namely that in retrospect spurious resonances often w ere attributed to mistakes in mo deling the detector or other so-called “systematic effects” that were either unkno wn or not properly tak en in to accoun t. The “5 σ ” threshold provides crude protection against such mistakes. Unfortunately , man y current HEP practitioners are unaw are of the original moti- v ation for “5 σ ”, and some ma y apply this rule without m uc h though t. F or example, it is sometimes used as a threshold when an MTF correction (Section 9.2) has already b een applied, or when there is no MTF from m ultiple bins or histograms because the measuremen t corresp onds to a completely sp ecified lo cation in parameter space, aside from the v alue of θ in the comp osite h yp othesis. In this case, there is still the question of how many measurements of other quantities to include in the num b er of trials (Lyons, 2010). F urther though ts on 5 σ are given in a recen t note b y Ly ons (2013). 9.2 Multiple trials factors for scanning n uisance parameters that are not eliminated The situation with the MTF describ ed in the previous section can arise whenever there is n uisance parameter ψ that the analysts c ho ose not to eliminate, but instead c ho ose to communicate the results ( p -v alue and confidence in terv al for θ ) as a function of ψ . The search for the Higgs boson (Aad et al, 2012; Chatrc h yan et al, 2012) is such an example, where ψ is the mass of the b oson, while θ is the Poisson mean (relativ e 33 to that exp ected for the SM Higgs b oson) of any putativ e excess of ev en ts at mass ψ . F or eac h mass ψ there is a p -v alue for the departure from H 0 , as if that mass had b e en fixe d in advanc e , as w ell as a confidence interv al for θ , giv en that ψ . This p -v alue is the “local” p -v alue, the probabilit y for a deviation at least as extreme as that observed, at that p articular mass. (Lo cal p -v alues are correlated with those at nearb y masses due to exp erimen tal resolution of the mass measuremen t.) One can then scan all masses in a sp ecified range and find the smallest lo cal p - v alue, p min . The probability of having a lo cal p -v alue as small or smaller than p min , anywher e in a sp e cifie d mass r ange , is greater than p min , b y a factor that is effectively a MTF (also kno wn as the “Look Elsewhere Effect” in HEP). When feasible, the LHC exp eriments use Mon te Carlo sim ulations to calculate the p -v alue that tak es this MTF into account, and refer to that as a “global” p -v alue for the sp ecified mass range. When this is to o computationally demanding, they estimate the global p -v alue using the metho d advocated b y Gross and Vitells (2010), whic h is based on that of Da vies (1987). T o emphasize that the range of masses used for this effective MTF is arbitrary or sub jectiv e, and to indicate the sensitivity to the range, the LHC collab orations chose to give the global p -v alue for tw o ranges of mass (Aad et al (2012, pp. 11,14) and Chatrc h yan et al (2012, pp. 33,41)). Some p ossibilities were the range of masses for whic h the SM Higgs b oson was not previously ruled out at high confidence; the range of masses for which the exp eriment is capable of observing the SM Higgs b oson; or the range of masses for which sufficient data had b een acquired to search for any new b oson. The collab orations made differen t choices. 10 Can results of h yp othesis tests b e cross-calibrated among differen t searc hes? In comm unicating the results of an exp erimen t, generally the goal is to describ e the metho ds, data analysis, and results, as w ell as the authors’ interpretations and conclusions, in a manner that enables readers to draw their o wn conclusions. Although at times authors provide a description of the lik eliho o d function for their observ ations, it is common to assume that confidence interv als (often given for more than one confidence level) and p -v alues (frequently expressed as equiv alen t z of Eqn. 16) are sufficien t input in to inferences or decisions to b e made b y readers. It can therefore b e ask ed what is the result of an author (or reader) taking the p -v alue as the “observ ed data” for a full (sub jectiv e) Ba y esian calculation of the p osterior probability of H 0 . One could ev en attempt to go further and form ulate a decision on whether to claim publicly that H 0 is false, using a (sub jective) loss function describing one’s personal costs of falsely declaring a discov ery , compared to not declaring a true disco very . F rom Eqn. 10, clearly z alone is not sufficien t to recov er the Ba yes factor and pro ceed as a Ba yesian. This p oint is repeatedly emphasized in articles already cited. (Ev en w orse is to try to reco ver the BF using only the binary inputs as to whether the 34 p -v alue was ab o v e some fixed thresholds (Dic key, 1977; Berger and Mortera, 1991; Johnstone and Lindley, 1995).) The oft-rep eated argumen t (e.g., Raftery (1995a, p. 143)) is that there is no justification for the step in the deriv ation of the p -v alue where “probability density for data as extreme as that observ ed” is replaced with “probabilit y for data as extreme, or mor e extr eme ”. Jeffreys (1961, p. 385) still seems to b e unsurpassed in his ironic wa y of saying this (italics in original), “ What the use of [the p -value] implies, ther efor e, is that a hyp othesis that may b e true may b e r eje cte d b e c ause it has not pr e dicte d observable r esults that have not o c curr e d. ” Go o d (1992) opined that, “The real ob jection to [ p -v alues] is not that they usually are utter nonsense, but rather that they can b e highly misleading, esp ecially if the v alue of [ n ] is not also taken into accoun t and is large.” He suggested a rule of thum b for taking n into account by standardizing the p -v alue to an effective size of n = 100, but this seems not to hav e attracted a follo wing. Mean while, often a confidence in terv al for θ (as in v ariably rep orted in HEP pub- lications for 68% CL and at times for other v alues) do es giv e a go o d sense of the magnitude of σ tot (although this might b e misleading in certain sp ecial cases). And one has a sub jectiv e prior and therefore its scale τ . Thus, at le ast crudely, the r e quir e d inputs ar e in hand to r e c over the r esult fr om something like Eqn. 10. It is p erhaps doubtful that most physicists would use them to arriv e at the same Ockham factor as calculated through a BF from the original likelihoo d function. On the other hand, a BF based on an arbitrary (“ob jective”) τ do es not seem to b e an obviously b etter w a y to communicate a result. While the “5 σ ” criterion in HEP gets a lot of press, I think that when a decision needs to b e made, ph ysicists intuitiv ely and informally adjust their decision-making based on the p -v alue, the confidence in terv al, their prior b elief in H 0 and g ( θ ), and their p ersonal sense of costs and risks. 11 Summary and Conclusions More than a half cen tury after Lindley drew attention to the different dep endence of p -v alues and Bay es factors on sample size n (describ ed t wo decades previously b y Jeffreys), there is still no consensus on ho w b est to communicate results of testing scien tific hypotheses. The argument contin ues, esp ecially within the broader Ba yesian comm unit y , where there is muc h criticism of p -v alues, and praise for the “logical” approac h of Bay es factors. A core issue for scien tific communication is that the Oc kham factor σ tot /τ is either arbitrary or p ersonal, even asymptotically for large n . It has alwa ys b een imp ortan t in Bay esian p oint and in terv al estimation for the analyst to describ e the sensitivit y of results to choices of prior probability , esp ecially for problems in volving man y parameters. In testing h yp otheses, suc h sensitivit y anal- ysis is clearly mandatory . The issue is not really the difference in n umerical v alue of p -v alues and p osterior probabilities (or Ba y es factors) as one must commit the error of transp osing the conditional probabilit y (fallacy of probabilit y in version) to equate the t wo. Rather, the fundamen tal question is whether a summary of the exp erimen tal results, with say t wo or three num b ers, can (even in principle) b e interpreted in a 35 manner cross-calibrated across different exp erimen ts. The difference in scaling with sample size (or more generally , the difference in scaling with σ tot /τ ) of the BF and lik eliho o d ratio λ is already apparent in Eqn. 14; therefore the additional issue of tail probabilities of data not observ ed, pithily derided by Jeffreys (Section 10 ab o v e), cannot b ear all the blame for the paradox. It is imp ortan t to gain more exp erience in HEP with Bay es factors, and also with Bernardo’s intriguing prop osals. F or statisticians, I hop e that this discussion of the issues in HEP pro vides “existence pro ofs” of situations where w e cannot ignore the JL parado x, and renews some attempts to improv e metho ds of scientific comm unication. Ac kno wledgmen ts I thank my colleagues in high energy physics, and in the CMS collab oration in particular, for man y useful discussions. I am grateful to members of the CMS Statistics Committee for commen ts on an early draft of the man uscript, and in particular to Luc Demortier and Louis Ly ons for con tinued discussions. T om F erb el provided in v aluable detailed commen ts on t wo previous versions that I p osted on the arXiv. The Ph yStat series of w orkshops organized b y Louis Ly ons has led to man y fruitful discussions and enlightening con tact with prominen t members of the statistics communit y . This material is based up on work partially supp orted b y the U.S. Department of Energy under Aw ard Num b er DE-SC0009937. References Aad G, et al (2012) Observ ation of a new particle in the search for the Standard Mo del Higgs b oson with the A TLAS detector at the LHC. Physics Letters B 716(1):1–29, DOI 10.1016/j.physletb.2012.08.020 Aad G, et al (2013) Measuremen ts of Higgs b oson pro duction and couplings in dib oson final states with the A TLAS detector at the LHC. Physics Letters B 726:88–119, DOI 10.1016/j.physletb.2013.08.010 Aaij R, et al (2013) Measurement of the B 0 s → µ + µ − branc hing fraction and search for B 0 → µ + µ − deca ys at the LHCb exp eriment. Phys Rev Lett 111:101805, DOI 10.1103/Ph ysRevLett.111.101805 Aaltonen T, et al (2009) First observ ation of electro w eak single top quark pro duction. Ph ys Rev Lett 103:092002, DOI 10.1103/Ph ysRevLett.103.092002 Abazo v V, et al (2009) Observ ation of single top quark pro duction. Ph ys Rev Lett 103:092,001, DOI 10.1103/Ph ysRevLett.103.092001 Alv arez L (1968) Nob el lecture: Recen t developmen ts in particle physics. URL http://www.nobelprize.org/nobel_prizes/physics/laureates/1968/ alvarez- lecture.html Anderson PW (1992) The Reverend Thomas Bay es, needles in ha ystac ks, and the fifth force. Ph ysics T oday 45(1):9–11, DOI 10.1063/1.2809482 36 Andrews DWK (1994) The large sample corresp ondence b et w een classical hypothesis tests and Bay esian posterior o dds tests. Econometrica 62(5):1207–1232, URL http: //www.jstor.org/stable/2951513 AP A (2010) Publication Manual of the American Psychological Asso ciation, 6th edn. American Psychological Asso ciation, W ashington, DC Arisak a K, et al (1993) Impro v ed upp er limit on the branc hing ratio B ( K 0 L → µ ± e ∓ . Ph ys Rev Lett 70:1049–1052, DOI 10.1103/Ph ysRevLett.70.1049 Babu K, et al (2013) Baryon num b er violation, arXiv:1311.5285[hep- ph] Bak er S, Cousins RD (1984) Clarification of the use of chi-square and likelihoo d functions in fits to histograms. Nucl Instrum Meth 221:437–442, DOI 10.1016/ 0167- 5087(84)90016- 4 Barroso A, Branco G, Bento M (1984) K 0 L → ¯ µe : Can it b e observ ed? Physics Letters B 134(1-2):123 – 127, DOI http://dx.doi.org/10.1016/0370- 2693(84)90999- 7 Bartlett MS (1957) A comment on D. V. Lindley’s statistical parado x. Biometrik a 44(3/4):533–534, URL http://www.jstor.org/stable/2332888 Ba y arri MJ (1987) [T esting precise h yp otheses]: Commen t. Statistical Science 2(3):342–344, URL http://www.jstor.org/stable/2245776 Ba y arri MJ, Berger JO, F orte A, Garca-Donato G (2012) Criteria for Bay esian mo del c hoice with application to v ariable selection. The Annals of Statistics 40(3):1550– 1577, URL http://www.jstor.org/stable/41713685 Berger J (2008) A comparison of testing methodologies. In: Prosp er H, Ly ons L, De Ro ec k A (eds) Pro ceedings of PHYST A T LHC W orkshop on Statistical Issues for LHC Physics, CERN, Genev a, Switzerland, 27-29 June 2007, CERN, CERN-2008- 001, pp 8–19, URL http://cds.cern.ch/record/1021125 Berger J (2011) The Ba yesian approach to disco v ery . In: Prosp er HB, Ly ons L (eds) Pro ceedings of PHYST A T 2011 W orkshop on Statistical Issues Related to Discov ery Claims in Searc h Exp erimen ts and Unfolding, CERN, Genev a, Switzerland, 17-20 Jan uary 2011, CERN, CERN-2011-006, pp 17–26, URL http://cdsweb.cern.ch/ record/1306523 Berger JO (1985) Statistical Decision Theory and Ba y esian Analysis, 2nd edn. Springer Series in Statistics, Springer, New Y ork Berger JO, Delampady M (1987a) T esting precise hypotheses. Statistical Science 2(3):317–335, URL http://www.jstor.org/stable/2245772 Berger JO, Delampady M (1987b) [Testing precise h yp otheses]: Rejoinder. Statistical Science 2(3):348–352, URL http://www.jstor.org/stable/2245779 37 Berger JO, Mortera J (1991) Interpreting the stars in precise hypothesis testing. In ternational Statistical Review / Revue Internationale de Statistique 59(3):337– 353, URL http://www.jstor.org/stable/1403691 Berger JO, Pericc hi LR (2001) Ob jective Ba yesian metho ds for mo del selection: In tro duction and comparison. Lecture Notes-Monograph Series 38:135–207, URL http://www.jstor.org/stable/4356165 Berger JO, Sellk e T (1987) T esting a p oint null hypothesis: The irreconcilability of p v alues and evidence. Journal of the American Statistical Asso ciation 82(397):112– 122, URL http://www.jstor.org/stable/2289131 Berkson J (1938) Some difficulties of in terpretation encountered in the application of the c hi-square test. Journal of the American Statistical Asso ciation 33(203):526– 536, URL http://www.jstor.org/stable/2279690 Bernardo JM (1999) Nested h yp othesis testing: The Bay esian reference criterion. In: Bernardo JM, Berger JO, Dawid AP , Smith AFM (eds) Ba yesian Statistics 6. Pro ceedings of the Sixth V alencia International Meeting, Oxford U. Press, Oxford, U.K., pp 101–130 Bernardo JM (2009) [Harold Jeffreys’s theory of probability revisited]: Commen t. Statistical Science 24(2):173–175, URL http://www.jstor.org/stable/25681292 Bernardo JM (2011a) Bay es and discov ery: Ob jective Ba yesian hypothesis testing. In: Prosp er HB, Ly ons L (eds) Pro ceedings of PHYST A T 2011 W orkshop on Statistical Issues Related to Disco v ery Claims in Search Exp erimen ts and Unfolding, CERN, Genev a, Switzerland, 17-20 Jan uary 2011, CERN-2011-006, pp 27–49, URL http: //cdsweb.cern.ch/record/1306523 Bernardo JM (2011b) In tegrated ob jective Bay esian estimation and h yp othesis test- ing. In: Bernardo JM, Ba y arri MJ, Berger JO, Dawid AP , Hec kerman D, Smith AFM, W est M (eds) Ba yesian Statistics 9. Pro ceedings of the Ninth V alencia In ternational Meeting, Oxford U. Press, Oxford, U.K., pp 1–68, URL http: //www.uv.es/bernardo/ Bernardo JM, Rueda R (2002) Bay esian h yp othesis testing: A reference approach. In ternational Statistical Review / Revue Internationale de Statistique 70(3):351– 372, URL http://www.jstor.org/stable/1403862 Bo x GEP (1976) Science and statistics. Journal of the American Statistical Asso cia- tion 71(356):791–799, URL http://www.jstor.org/stable/2286841 Bo x GEP (1980) Sampling and Bay es’ inference in scientific mo delling and robustness. Journal of the Ro yal Statistical So ciet y Series A (General) 143(4):pp. 383–430, URL http://www.jstor.org/stable/2982063 38 Casella G, Berger RL (1987a) Reconciling Bay esian and frequentist evidence in the one-sided testing problem. Journal of the American Statistical Asso ciation 82(397):106–111, URL http://www.jstor.org/stable/2289130 Casella G, Berger RL (1987b) [Testing precise h yp otheses]: Commen t. Statistical Science 2(3):344–347, URL http://www.jstor.org/stable/2245777 CERN (2013) CERN exp eriments put Standard Mo del to stringen t test. URL http://press.web.cern.ch/press- releases/2013/07/ cern- experiments- put- standard- model- stringent- test Chatrc h yan S, et al (2012) Observ ation of a new b oson at a mass of 125 GeV with the CMS experiment at the LHC. Ph ysics Letters B 716(1):30 – 61, DOI 10.1016/ j.ph ysletb.2012.08.021 Chatrc h yan S, et al (2013a) Measuremen t of the B 0 s → µ + µ − branc hing fraction and searc h for B 0 → µ + µ − with the CMS exp erimen t. Ph ys Rev Lett 111:101804, DOI 10.1103/PhysRevLett.111.101804 Chatrc h yan S, et al (2013b) Study of the mass and spin-parity of the Higgs b oson candidate via its decays to Z boson pairs. Ph ys Rev Lett 110:081803, DOI 10.1103/ Ph ysRevLett.110.081803 Chatrc h yan S, et al (2014) Measurement of the prop erties of a Higgs b oson in the four-lepton final state. Phys Rev D 89:092007, DOI 10.1103/PhysRevD.89.092007 Cousins RD (2005) T reatment of n uisance parameters in high energy physics, and p os- sible justifications and improv ements in the statistics literature. In: Ly ons L, Unel MK (eds) Pro ceedings of PHYST A T 05 Statistical Problems in Particle Physics, As- troph ysics and Cosmology , Oxford, U.K, September 12-15, 2005, Imp erial College Press, pp 75–85, URL http://www.physics.ox.ac.uk/phystat05/proceedings/ Cousins RD, Highland VL (1992) Incorp orating systematic uncertainties into an upp er limit. Nuclear Instruments and Metho ds A 320:331–335, DOI 10.1016/ 0168- 9002(92)90794- 5 Co w an G, Cranmer K, Gross E, Vitells O (2011) Asymptotic form ulae for lik eliho o d-based tests of new ph ysics. Eur Ph ys J C 71:1554, DOI 10.1140/ep jc/ s10052- 011- 1554- 0 Co x DR (2006) Principles of Statistical Inference. Cambridge Univ ersit y Press, Cam- bridge Da vies RB (1987) Hyp othesis testing when a nuisance parameter is present only under the alternative. Biometrik a 74(1):33–43 Demortier L (2011) Op en issues in the wak e of Banff 2010. In: Prosp er HB, Ly ons L (eds) Pro ceedings of PHYST A T 2011 W orkshop on Statistical Issues Related to 39 Disco v ery Claims in Searc h Exp erimen ts and Unfolding, CERN, Genev a, Switzer- land, 17-20 Jan uary 2011, CERN-2011-006, pp 1–11, URL http://cdsweb.cern. ch/record/1306523 Dic k ey JM (1977) Is the tail area useful as an appro ximate Bay es factor? Journal of the American Statistical Asso ciation 72(357):138–142, DOI 10.1080/01621459. 1977.10479922, URL http://www.jstor.org/stable/2286921 Eadie W, et al (1971) Statistical Metho ds in Exp erimental Ph ysics, 1st edn. North Holland, Amsterdam Edw ards W, Lindman H, Sav age LJ (1963) Ba y esian statistical inference for psycho- logical research. Psyc hological Review 70(3):193–242 F eldman GJ, Cousins RD (1998) Unified approach to the classical statistical anal- ysis of small signals. Phys Rev D 57:3873–3889, DOI 10.1103/PhysRevD.57.3873, physics/9711021 F erguson CJ, Heene M (2012) A v ast grav eyard of undead theories: Publication bias and psyc hological science’s a version to the n ull. Perspectives on Psyc hological Science 7(6):555–561, DOI 10.1177/1745691612459059 F ermilab (2009) F ermilab collider exp erimen ts discov er rare single top quark. URL http://www.fnal.gov/pub/presspass/press_releases/ Single- Top- Quark- March2009.html Galison P (1983) Ho w the first neutral-current exp erimen ts ended. Rev Mo d Phys 55:477–509, DOI 10.1103/RevMo dPh ys.55.477 Gelman A, Rubin DB (1995) Avoiding mo del selection in Ba yesian so cial research. So- ciological Metho dology 25:165–173, URL http://www.jstor.org/stable/271064 Georgi H (1993) Effectiv e field theory. Ann Rev Nucl Part Sci 43:209–252, DOI 10.1146/ann urev.ns.43.120193.001233 Go o d IJ (1992) The Bay es/non-Bay es compromise: A brief review. Journal of the American Statistical Asso ciation 87(419):597–606, URL http://www.jstor.org/ stable/2290192 Gross E, Vitells O (2010) T rial factors or the lo ok elsewhere effect in high energy ph ysics. Eur Ph ys J C 70:525–530, DOI 10.1140/ep jc/s10052- 010- 1470- 8 Hasert F, et al (1973) Observ ation of neutrino-lik e interactions without muon or electron in the Gargamelle neutrino exp erimen t. Physics Letters B 46(1):138 – 140, DOI 10.1016/0370- 2693(73)90499- 1 Hirsc h M, P¨ as H, Porod W (2013) Ghostly b eacons of new physics. Scien tific American 308(April):40–47, DOI 10.1038/scien tificamerican0413- 40 40 Incandela J, Gianotti F (2012) Latest up date in the search for the Higgs b oson, public seminar at CERN. Video: http://cds.cern.ch/record/1459565 ; slides: http://indico.cern.ch/conferenceDisplay.py?confId=197461 James F (1980) Interpretation of the shap e of the likelihoo d function around its minim um. Comput Ph ys Comm un 20:29–35, DOI 10.1016/0010- 4655(80)90103- 4 James F (2006) Statistical Metho ds in Experimental Ph ysics, 2nd edn. W orld Scien- tific, Singap ore Ja ynes E (2003) Probabilit y Theory: The Logic of Science. Cambridge Univ ersity Press, Cambridge, U.K. Jeffreys H (1961) Theory of Probability , 3rd edn. Oxford Univ ersity Press, Oxford Johnstone D, Lindley D (1995) Bay esian inference given data ‘significant at α ’: T ests of p oint h yp otheses. Theory and Decision 38(1):51–60, DOI 10.1007/BF01083168 Kadane JB (1987) [T esting precise h yp otheses]: Commen t. Statistical Science 2(3):347–348, URL http://www.jstor.org/stable/2245778 Kass R (2009) Comment: The imp ortance of Jeffreys’s legacy . Statistical Science 24(2):179–182, URL http://www.jstor.org/stable/25681294 Kass RE, Raftery AE (1995) Bay es factors. Journal of the American Statistical As- so ciation 90(430):773–795, URL http://www.jstor.org/stable/2291091 Kass RE, W asserman L (1995) A reference Bay esian test for nested hypotheses and its relationship to the Sch warz criterion. Journal of the American Statistical Asso- ciation 90(431):928–934, URL http://www.jstor.org/stable/2291327 Kirk RE (1996) Practical significance: A concept whose time has come. Educational and Psychological Measurement 56(5):746–759, DOI 10.1177/ 0013164496056005002 Leamer EE (1978) Specification Searches: Ad Ho c Inference with Nonexp erimen tal Data. Wiley series in probabilit y and mathematical statistics, Wiley , New Y ork Lee PM (2004) Bay esian Statistics: An In tro duction, 3rd edn. Wiley , Chic hester U.K. Lehmann E, Romero JP (2005) T esting Statistical Hyp otheses, 3rd edn. Springer, New Y ork Lindley D (2009) [Harold Jeffreys’s theory of probability revisited]: Comment. Sta- tistical Science 24(2):183–184, URL http://www.jstor.org/stable/25681295 Lindley DV (1957) A statistical paradox. Biometrik a 44(1/2):187–192, URL http: //www.jstor.org/stable/2333251 41 Ly ons L (2010) Commen ts on ‘lo ok elsewhere effect’. http://www.physics.ox.ac.uk/Users/lyons/LEE_feb7_2010.pdf Ly ons L (2013) Disco vering the significance of 5 sigma, Ma y o DG, Spanos A (2006) Severe testing as a basic concept in a Neyman-Pearson philosoph y of induction. The British Journal for the Philosoph y of Science 57(2):pp. 323–357, URL http://www.jstor.org/stable/3873470 Nak aga w a S, Cuthill IC (2007) Effect size, confidence in terv al and statistical signif- icance: a practical guide for biologists. Biological Reviews 82(4):591–605, DOI 10.1111/j.1469- 185X.2007.00027.x Neyman J (1937) Outline of a theory of statistical estimation based on the classical theory of probability . Philosophical T ransactions of the Roy al So ciet y of London Series A, Mathematical and Physical Sciences 236(767):pp. 333–380, URL http: //www.jstor.org/stable/91337 Neyman J, P earson ES (1933a) On the problem of the most efficient tests of statistical h yp otheses. Philosophical T ransactions of the Roy al So ciet y of London Series A, Con taining Papers of a Mathematical or Ph ysical Character 231:289–337, URL http://www.jstor.org/stable/91247 Neyman J, P earson ES (1933b) The testing of statistical hypotheses in relation to probabilities a priori. Mathematical Pro ceedings of the Cam bridge Philosophical So ciet y 29:492–510, DOI 10.1017/S030500410001152X Philipp e A, Rob ert C (1998) A note on the confidence prop erties of reference priors for the calibration mo del. So ciedad de Estad ´ ıstica e Inv estigaci´ on Op erativ a T est 7(1):147–160, DOI 10.1007/BF02565107 Prescott C, et al (1978) Parit y non-conserv ation in inelastic electron scattering. Ph ysics Letters B 77(3):347 – 352, DOI 10.1016/0370- 2693(78)90722- 0 Raftery AE (1995a) Bay esian mo del selection in so cial research. Sociological Metho d- ology 25:111–163, URL http://www.jstor.org/stable/271063 Raftery AE (1995b) Rejoinder: Model selection is unav oidable in so cial researc h. So- ciological Metho dology 25:185–195, URL http://www.jstor.org/stable/271066 Rice JA (2007) Mathematical Statistics and Data Analysis, 3rd edn. Thomson, Bel- mon t, CA Rob ert CP (1993) A note on Jeffreys-Lindley parado x. Statistica Sinica 3(2):601–608 Rob ert CP (2013) On the Jeffreys-Lindley parado x, Rob ert CP , Chopin N, Rousseau J (2009) Harold Jeffreys’s theory of probabilit y revisited. Statistical Science 24(2):141–172, URL http://www.jstor.org/stable/ 25681291 42 Rosenfeld AH (1968) Are there any far-out mesons or bary ons? In: Balta y C, Rosen- feld AH (eds) Meson sp ectroscop y: A collection of articles, W.A. Benjamin, New Y ork, pp 455–483, From the preface: based on reviews presented at the Conference on Meson Spectroscopy , April 26-27, 1968, Philadelphia, P A USA. “...not, ho w ever, in tended to b e the pro ceedings...” Senn S (2001) Two c heers for p-v alues? Journal of Epidemiology and Biostatistics 6(2):193–204 Shafer G (1982) Lindley’s parado x. Journal of the American Statistical Asso ciation 77(378):325–334, URL http://www.jstor.org/stable/2287244 Smith AFM, Spiegelhalter DJ (1980) Ba yes factors and c hoice criteria for linear mo d- els. Journal of the Ro y al Statistical So ciet y Series B (Metho dological) 42(2):213– 220, URL http://www.jstor.org/stable/2984964 Spanos A (2013) Who should b e afraid of the Jeffreys-Lindley parado x? Philosophy of Science 80(1):73–93, URL http://www.jstor.org/stable/10.1086/668875 Stuart A, Ord K, Arnold S (1999) Kendall’s Adv anced Theory of Statistics, vol 2A, 6th edn. Arnold, London, and earlier editions by Kendall and Stuart Sw edish Academ y (2013) Adv anced information: Scientific background: The BEH- mec hanism, interactions with short range forces and scalar particles. URL http:// www.nobelprize.org/nobel_prizes/physics/laureates/2013/advanced.html ’t Ho oft G (1976) Symmetry breaking through Bell-Jackiw anomalies. Phys Rev Lett 37:8–11, DOI 10.1103/Ph ysRevLett.37.8 ’t Ho oft G (1999) Nobel lecture: A confron tation with infinit y . URL http://www.nobelprize.org/nobel_prizes/physics/laureates/1999/ thooft- lecture.html , This web page has video, slides, and p df writeup. Thompson B (2007) The Nature of Statistical Evidence. Lecture Notes in Statistics, Springer, New Y ork v an Dyk DA (2014) The role of statistics in the disco very of a Higgs b o- son. Ann ual Review of Statistics and Its Application 1(1):41–59, DOI 10.1146/ ann urev- statistics- 062713- 085841 V ardeman SB (1987) [Testing a p oin t null hypothesis: The irreconcilability of p v alues and evidence]: Commen t. Journal of the American Statistical Asso ciation 82(397):130–131, URL http://www.jstor.org/stable/2289136 W ebster (1969) W ebster’s Seven th New Collegiate Dictionary , based on W ebster’s Third New In ternational Dictionary. G. and C. Merriam Compan y , Springfield, MA 43 W elc h BL, Peers HW (1963) On formulae for confidence p oints based on integrals of w eighted likelihoo ds. Journal of the Ro yal Statistical Society Series B (Metho d- ological) 25(2):pp. 318–329, URL http://www.jstor.org/stable/2984298 Wilczek F (2004) Nob el lecture: Asymptotic freedom: F rom paradox to paradigm. URL http://www.nobelprize.org/nobel_prizes/physics/laureates/2004/ wilczek- lecture.html , This web page has video, slides, and p df writeup. Wilkinson L, et al (1999) Statistical metho ds in psychology journals - guidelines and explanations. American Psychologist 54(8):594–604, DOI 10.1037//0003- 066X.54. 8.594 Zellner A (2009) [Harold Jeffreys’s theory of probability revisited]: Comment. Statis- tical Science 24(2):187–190, URL http://www.jstor.org/stable/25681297 Zellner A, Siow A (1980) P osterior o dds ratios for selected regression hypotheses. T raba jos de Estadistica Y de In vestigacion Operativ a 31(1):585–603, DOI 10.1007/ BF02888369 44

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment