Open statistical issues in particle physics

The Annals of Applie d Statistics 2008, V ol. 2, No. 3, 887–915 DOI: 10.1214 /08-A OAS163 c  Institute of Mathematical Statistics , 2 008 OPEN ST A TISTICAL ISSUES IN P AR TICLE PHYSICS 1 By Louis L yons Oxfor d University Man y statistical issues arise in the analysis of Pa rticle Physics exp erimen ts. W e give a brief introduction to Particl e Physics, b efore describing the techniques used by Pa rticle Physici sts for dealing with statistical p roblems, and also some of the op en statistical q uestions. 1. In tro duction. P article Physics tries to delv e in to the structure of mat- ter at its most basic lev el. It con tin ues a tradition th at d ates bac k to the Greeks 2 or ev en earlier. In the early da ys of Chemistry , the sm allest ent i- ties w ere atoms. Early in the 20th cen tury , th e exp erim ents of Rutherford demonstrated that atoms consisted of a small n u cleus, with the electrons circulating at distances of ∼ 10 − 10 metres. Subsequently , th e n ucleus was found to be made of protons and neutrons. Many other particles (kno wn as hadrons) lik e protons and neutrons hav e subsequently b een disco vered, but within the last 30 y ears, the quark mo del has brought u nderstanding to the m u ltitude of what used to b e called “elemen tary p articles.” The en tities that w e current ly b eliev e are fund amental (i.e., they do not seem to ha ve any sub-stru cture) are the quarks and leptons sh o wn in T able 1 . There are 6 of eac h, and they app ear to b e arranged in 3 “generations” of increasing mass, eac h con taining qu arks of electric charge +2 / 3 and − 1 / 3 (in units where the electron’s c harge is − 1) and leptons of c harge − 1 an d 0. The n eutral leptons are called n eutrinos. Although c harged leptons and neutrinos ha v e b een detected, qu arks are b eliev ed to b e conﬁ ned w ithin Received Sep tem b er 2007; revised January 2008. 1 Supp orted in p art by a Leve rhulme F oundation grant. Key wor ds and phr ases. P article Physics , parameter determination, goo dness of ﬁt, p -v alues, hypothesis testing, nuisance parameters, u p p er limits, blind analysis, signal- backgro und separation, com b ining results. This is an electronic reprint of the or ig inal a r ticle published b y the Institute of Mathematical Statistics in The Annals of Applie d Statistics , 2008, V ol. 2, No. 3, 887– 9 15 . This reprint diﬀers from the origina l in pag ination and t y p og r aphic deta il. 2 Although the notion of what constitu tes a satisfactory theory has c hanged o ver the centuries, it h as alwa ys b een considered desirable that th e num b er of b asic elements out of whic h everything is constructed should n umb er a t most “ A FEW .” Since for the Greeks th e basic elements w ere A ir, F ire, E arth and W ater, it is clear that they not only understo od the basic principles of Science, bu t also h ad an excellent command of the English language. 1 2 L. L YONS T able 1 The b asic p articles Pa rticle, c harge Generations 1 2 3 Quark, +2 / 3 u (0.3) c (1.5) t (175) Quark, − 1 / 3 d (0.3) s (0.5) b (5) Neutrino, 0 ν e ( < 3 ∗ 10 − 9 ) ν µ ( < 2 ∗ 10 − 4 ) ν τ ( < 0 . 02) Lepton, − 1 electron (5 ∗ 10 − 4 ) µ (0 . 1) τ (1 . 8) Masses sho wn in b rac kets are in GeV/ c 2 . In th ese u nits, the mass of the proton is 0.9. hadrons. They ha v e n ot b een observed d ir ectly , but th eir existence is inf erred from the simpliﬁcation they br ing to the m ultitude of had r ons, and to the w ay th ey explain man y features of the w ay hadrons in teract with eac h other or with leptons. In addition to these particles, there are also others r esp onsible f or me- diating the v arious fundamental forces. These include the massless photon γ , resp onsible for the electro-magnetic force; the massive W and Z b osons whic h mediate the we ak force; and the gluons g resp onsible for the strong force. I n add ition, there is the still to b e detected gra viton wh ic h mediates gra vitational forces, and is usually denoted by th e sym b ol . Because the in teracts so weakl y it is hard to observe . Finally , th ere is th e und iscov ered Higgs b oson, whic h is b eliev ed to b e resp ons ib le for th e mass of the other particles, and whic h is the ob ject of intense searc h es in current exp er im ents. Of cours e, theoretical p h ys icists are proliﬁc at in v enting mo d els, and so there are many other suggested particles. Exp eriments in Pa r ticle P hysics are usu ally conducted at large accelera- tors, f or example, at the Europ ean Centre for Nuclear Researc h (CERN) in Genev a, or at F ermi National Accele rator Lab n ear Chicago. CERN’s s oon - to-b e-running Large Hadron Collider (LHC) is in a tun nel ab out 100 metres b elo w the surface and 27 kilometres in circum f erence, and w hic h str addles the F renc h–Sw iss b order. Protons circulate in bu nc hes in opp osite directions around the r ing, and collide with eac h other at the cente r of large d etecto rs. The bunches are ab out the width of a h um an hair, and are ∼ 10 cen timetres long. When they collide, new p articles are pro duced by con ve rting the av ail- able kinetic energy in to mass. The detectors are designed to tr ac k the path of eac h p article, m easure its cur v ature in the magnetic ﬁeld and hence d eter- mine the particle’s momentum, and also to give information on the p article’s iden tity (e.g., whether it is an electron, muon, pion, k aon or proton). Reactions b et w een coll iding protons will o ccur at a very high rate, b ut most of them are fairly u nin teresting. Thus, exp eriments are designed to ha ve a trigger, whic h mak es a very fast decision as to w h ether th e collision OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 3 (called an “eve n t”) is lik ely to b e in teresting, and hen ce whether the data from the d etecto r is w orth storing. Because of data r ead-out and storage constrain ts, only ab out 100 ev ents p er s econd are r ecorded, and eac h ma y con tain ab out a Megab yte of information. Since the accelerato r ma y ru n for 15 years, some 10 10 ev ents can b e collected by eac h exp erimen t. In analyzing data, allo w ance m ust b e made for the distorting eﬀect introd uced by an y selection bias of the trigger. This review attempts to present some in teresting statistical issu es in the analysis of data collected in P article Physics experim ents. The items dis- cussed b elo w are a mixtur e of current practice, ideals to whic h we aspire and some p ersonal prejudices of the author. I t is hop ed that the app r oac hes men tioned in this article will b e int eresting or outrageous en ou gh to pr o v ok e some Statisticians either to collab orate with Partic le Ph ysicists, or to pro- vide them w ith su ggestio ns for impr o ving their analyses. It is to b e n oted that the tec hniques d escrib ed are simply those used b y Partic le Physici sts; no claim is made that they are necessarily optimal. A Glossary of P article Physics termin olog y app ears in the sup plemen tary material [Ly ons ( 2008 )]. 2. Partic le Physics analyses. This section starts with tw o t ypical exam- ples of Pa rticle Ph ysics analyses, the ﬁrst inv olving parameter determina- tion, while the second tests w h ether d ata is consisten t with a null hypothesis H 0 , or wh ether an alte rnativ e h yp othesis H 1 is fa vored. F urther examples are describ ed later. More detailed d escriptions can b e found in the v arious pap ers of the PHYST A T ser ies of Conferences [see James, Lyons and Per- rin ( 2000 ), Cheun g an d Ly ons ( 2000 ), Whalley an d Lyons ( 2002 ), Lyons, Moun t and Reitmey er ( 2003 ), Lyons and ¨ Unel ( 2005 ), Reid, Linnemann and Ly ons ( 2006 ), Prosp er, Lyo n s and De Ro ec k ( 2007 )]. In particular, at the PHYST A T-LHC meeting at CERN in 2007, the ma jor exp eriments at the LHC presen ted their statistical “wish-lists” [Gross ( 2007 ), Beli k ov ( 2007 ), Xie ( 2007 )]. 2.1. Lifetimes. Here we estimate the lifetime of some sp eciﬁc particle. Th us, w e could hav e n indep endent obs er v ations t 1 . . . t i . . . t n for th e times b et ween the pro duction and deca y for this particle in the s elected ev ents. Then the mean lifetime τ co uld b e d etermined by an unbinned lik eliho o d ﬁt to the probabilit y density τ − 1 exp( − t/τ ). In r eal life we w ould hav e a more complicated expression, to allo w for a p ossible bac kgroun d with a diﬀeren t time dep endence, exper im ental r esolution on the determination of t i , and exp erimen tal acceptance of the detector an d the trigger, wh ic h d ep ends on t . The v arious s teps in the data analysis include: 4 L. L YONS • Reconstruct trac ks from the hits in the detector. • Select w an ted ev en ts th at are enric h ed in the particle whose lifetime w e wish to measure. • F or eac h in teraction, extract the d eca y time t f rom L and v , the distance the particle tra vels and its sp eed. T ypical v alues are p icosec onds, mm s and 99% of the sp eed of ligh t resp ectiv ely . • Mo del the signal, t ypically by an exp onentia l time dep end ence probably smeared by time resolution eﬀects, and the bac kground. Time-dep enden t eﬃciencies for collecti ng the data ma y also b e relev an t. • P erform a lik eliho o d ﬁ t, to determine τ and its statistical err or σ stat . • Estimate the systematic error σ syst , and qu ote the resu lt as τ ± σ stat ± σ syst . These systematics [Heinric h and Lyo ns ( 2007 )] can arise from un certain- ties in some of the extra parameters in v olv ed in mo deling the d ata (e.g., the lev el of b ac kground contaminat ing our signal), or from p ossible u ncer- tain ties in the theory (ma yb e th e exp ected exp onen tial deca y distribution is complicated b y th e existence of t w o o verlapping p articles). Statisticians usually refer to th e former as “n u isance parameters.” In analyses in v olving enough data to ac h ieve reasonable statistical accuracy , considerably more eﬀort is dev oted to assessing the sys tematic error than to d etermining the parameter of in terest and its statistica l error. • Assess the go o dn ess-of-ﬁt b et we en the data and the mo del, and ignore the estimated v alue for the parameter if the ﬁt is unsatisfactory . 2.2. Signiﬁc ant p e ak? Another t yp e of analysis m ight consist of lo oking at a mass sp ectrum (see Figure 1 ). In man y situations w e would exp ect to observ e a r ather smo oth and somewhat b oring distribution, b ut sometimes there may b e a signiﬁ can t-lo oking p eak at some mass p osition. Th is could corresp ond to the exciting disco ve ry of a n ew particle, to a b orin g s tatistic al ﬂuctuation of th e s m ooth bac kground or to some un fortunately o v erlo ok ed eﬀect in the analysis. W e can make some numerica l s tatemen t ab out the probabilit y of obtain- ing a statist ical ﬂu ctuation at least as extreme as the one w e ha v e observ ed. In this situation, w e are p erf orm ing a “Go o dness of Fit” test, that is, we are comparing our data w ith the null h yp othesis of a smo oth distribution. Al- ternativ ely and probably more sensitiv ely , w e could u se our data to compare the t wo hypotheses—ju st smo oth b ac kground or an interesting p eak ab ov e the bac kgroun d; this is “Hyp othesis T esting.” 2.3. Bayes or fr e quentism? In many analyses the question arises wh ether to use a Ba y esian or a Neyman–P earson F requen tist app roac h, or one whic h is neither (e.g., χ 2 , like liho o d, etc.). P article Ph y s icists tend to f a v or a fre- quen tist metho d. This is b ecause w e really do consider that our data are represent ativ e as samp les drawn according to the mod el we are using (deca y OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 5 time d istributions often are exp on ential; the counts in rep eated time int er- v als do follo w a P oisson d istribution, etc.), and hence we wan t to use a sta- tistical approac h that allo ws th e d ata “to sp eak for themselv es,” rather th an our analysis b eing d ominated by our assumptions and b eliefs, as emb od ied in Ba yesia n pr iors. The r eluctance to use pr iors is strongest in situations with sev eral v ariables where multi dimensional pr iors would b e requ ir ed, or in cases where ve ry little is kn o wn ab out the r elev ant parameter—it may b e acceptable to use p rior inform ation abou t a parameter whic h is already w ell measured, but more pr ob lematic to try to quantify pr ior ignorance. Ho we v er, in practice, it is v ery h ard to use the Neyman frequ en tist con- struction when more than t wo or three parameters are in v olve d: softw are to p erform a Neyman construction eﬃcien tly in seve ral dimens ions w ould b e most welcome . The choi ce of a useful ord er in g rule is also very imp ortant. Th us from a pragmatic p oint of view, ev en ardent frequent ists are p repared to us e Ba ye s ian tec h n iques. Most of th em, how ev er, would lik e to ensur e that the tec hniqu e they u s e pro vides parameter interv als with reasonable frequent ist co verag e. There are ev en mixed metho d s [Cousins and Highland ( 1992 )] that use Ba ye sian pr iors for nuisance parameters, but a frequentist metho d for the parameter of in terest. Th e thinking her e is that, although suc h an app roac h cannot b e justiﬁed from fundamentals, it pr o vides a prac- tical metho d whose prop erties can b e c h ec k ed , and are often satisfactory . Fig. 1. Mass histo gr am. This is for r e actions pr o ducing a neutr on ( n ) , π + , K + and K − . A hi sto gr am of the eﬀe ctive m ass of the nK + c ombination is plotte d. If a p article de c aying into a neutr on and a K + is pr o duc e d in these r e actions, a narr ow p e ak should app e ar in this histo gr am at the p article’s mass, but if not the di stribution should b e smo oth. The curve is an attempt to de duc e this smo oth b ackgr ound. Do es the histo gr am pr ovide evidenc e for a new p article, as opp ose d to ther e b eing a statistic al ﬂuctuation f r om the smo oth b ackgr ound, and/or an inc orr e ctly estimate d b ackgr ound? A new p article her e woul d b e very inter esting, as it woul d not ﬁt into the simple quark mo del, b e c ause i t would r e qui r e a 5 -quark structur e. 6 L. L YONS P article Ph ysicists would appreciate advice on ho w to constru ct priors f or parameters of interest, to b e us ed in conju nction with information-based priors for nuisance parameters, and whic h migh t giv e reasonable co v erage [see Demortier ( 2005 )]. 3. E xp erimen tal design. Because exp er im ental detectors are so expen- siv e to construct, the time-scale o ve r whic h they are built and op erated is so long, and they ha ve to op erate under harsh radiatio n co nditions, great care is dev oted to their design and constr u ction. This diﬀers from the tra- ditional statistical appr oac h for the d esign of agricultural tests of diﬀerent fertilisers, b ut instead starts with a list of p h ys ics issu es which the exp eri- men t h op es to address. The idea is to design a detector which will provide answ ers to the physics questions, sub ject to the constrain ts imp osed by the cost of the planned d etecto rs, their ph ysical and mechanica l limita tions, and p erhaps also the limited a v ailable sp ace. Inevitably , compromises in the de- sign are required, and testing of any prop osed sc heme in vol v es analysis of the sim ulated “data” to see if the physics aims can indeed b e ac hieved. Design is also in volv ed wh en plann ing what tec hnique is to b e used to analyze the exp eriment’ s real data. This will b e esp ecially detailed if a blind analysis is to b e p erform ed (see Section 8 ). Another example is p ro vided b y the attempt to assess the s ystematic error on an estimated parameter, caused by n uisance parameters. Th is of- ten requires pro ducing sim u lations of th e data with d iﬀeren t v alues of the n u isance p arameter, and seeing ho w m u c h the physics parameter’s v alue c hanges wh en the nuisance parameter v alue is c h anged b y its uncertain ty (compare Sections 5.4 and 6.2 for wa ys of incorp orating n uisance parame- ters in upp er limit and in p -v alue calculati ons r esp ectiv ely). When several n u isance parameters are inv olv ed, there is the qu estion of whether separate sim u lations should b e pro du ced, in eac h of which only one of th e nuisance parameters is changed from its optimal v alue by its uncertain ty; or whether it is b etter to generate simulatio ns in eac h of whic h all nuisance parameters are simulta neously c h an ged from their optimal v alues according to their ex- p ected (p ossibly correlate d ) multiv ariate d istribution. The tw o metho ds are sometimes referred to un isim (or OF A T = O ne F actor A t a T ime) and mul- tisim r esp ectiv ely . The q u estion is wh ic h metho d r equ ires less compu ting time to ac h ieve the same accuracy for the systematic error [Ro e ( 2007 )]. Ho w to a ssess systematic s w as m u c h discussed at th e Banﬀ meeting [Reid, Linnemann and L y ons ( 2006 )] and PHYST A T-LHC [Read ( 2007 ), Neal ( 2007 ), Linnemann ( 2007 )]. 4. S eparating s ignal from bac kgroun d. Almost eve ry Particle Physic s analysis uses some tec hnique for sep arating signal from backg round. Th is is b ecause only a fraction f of the complete set of stored ev en ts (w h ic h b ecause OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 7 of the trigg er can b e a f actor of 10 7 do wn on the to tal r eacti on rate) will con tain in teractions of in terest for the analysis b eing p erformed. Dep ending on the in vestiga tion b eing undertak en, f could b e as small as 10 − 8 . First some simp le “cuts” are applied; these are generally lo ose selectio ns on single v ariables, whic h are designed to remo ve bac k grou n d while barely reducing the signal. F or examp le, the selected even ts could b e r equ ired to ha ve n o more than 6 c harged trac ks. T hen some more sophisticate d analysis is p erformed, p erhaps using more complicated deriv ed v ariables, for example, the mass of a p ossible particle d eca ying in to a k aon and 3 c harged pions. T o separate signal from bac kgroun d in the m ulti-dimensional space of the ev ent observ ables, these analyses typica lly use metho ds like Fisher discriminants, b o osted d ecision trees, artiﬁcial n eu ral net works (includin g Ba y esian nets), supp ort v ector mac h ines, etc. [Pr osp er ( 2002 ), F riedman ( 2003 , 2005 )]. A de- scription of the softw are av ailable for implement ing some of th ese tec hn iques can b e found in Narsky ( 2006 ) and H¨ oc k er ( 2007 ). If a large data sample is av ailable to p erform an accurate measur ement of a p rop ert y of some particle, then it is not a disaster if there is some lev el of bac kground in the ﬁnally select ed eve n ts, p ro vided that it can b e accurately assessed and allo w ed for in the subsequent analysis. At the other extreme, the separation tec hn ique ma y b e used to see if there is any evidence for the existence of some hyp othesised particle (the p oten tial signal), in the presence of b ac kground from w ell-kno w n sources. T hen the actual data may in fact con tain no observ able signal. These tec h niques are usu ally “taugh t” to r ecog nize signal and bac kgroun d b y b eing giv en examples consisting of large num b ers of eve n ts of eac h t yp e. These may b e pro duced by Mon te Carlo simulatio n, b ut then there is a problem of trying to verify that th e sim ulation is a s u ﬃcien tly accurate represent ation of realit y . I t is b etter to use r eal data, but the diﬃcult y th en is to obtain suﬃ ciently pure samp les of bac kgrou n d and signal. Indeed, f or the searc h f or a n ew particle, tru e data examples do not exist. Ho w ever, it is the acc urate represen tation of bac kground that is likel y to p ose a more serious problem. The wa y that, for example, neural net works are trained is to present the soft w are with app ro ximately equal n umb ers of signal and bac kgroun d ev ents 3 , and then to optimize the cost fu nction C for th e net work. Th is is deﬁned as C = Σ( z i − t i ) 2 , wh ere z i is the tr ained netw ork’s outpu t for th e i th ev en t; t i is the target output, usually c h osen as 1 for signal and zero for bac kground; and the su mmation is o v er all testing ev ents presented to 3 F or searches for rare pro cesses, it is clearly inappropriate t o u se the actual fractions exp ected in the data to d etermine the ratio of signal to bac kground Monte Carlo even ts in th e training sample, b ecause the netw ork could achiev e an excellent score simply by classifying everything as bac kground . 8 L. L YONS the n et w ork. The problem with this is that C is not what w e really w an t to optimize. F or a searc h for a n ew p article, this could b e the sensitivit y of the exp er im ental upp er limit in the absence of signal, wh ile for an an alysis measuring the prop erties (suc h as mass or lifetime) of some w ell-established particle, we w ou ld b e int erested in minimizing the error (including system- atic eﬀects) on the result. So the op en qu estions are as follo ws: • Is it p ossible to deﬁne what m ultiv ariate method will p erform we ll in a giv en class of p roblems? • Ho w can we c h ec k that our m ulti-dimensional training samples for signal and bac kgroun d are reliable descriptions of realit y? • Ho w man y eve n ts are requ ir ed f or trainin g? • Ho w s h ould th ey b e d ivid ed b et w een signal and b ac kground, esp ecially when there are sev eral d iﬀeren t sour ces of b ac kground? • What is the b est wa y of allo wing for nuisance parameters in the mo dels of th e signal and/or bac kground ? • Are there easy wa ys of optimizing on what is really of int erest? 5. Up p er limits. Most searc h es for new phenomena ha ve not found an y evidence for exciting new physics. Rece n t examples from P article Ph ysics include searc hes for the Higgs b oson, sup ersymmetric particles, dark matter, etc.; att empts to ﬁnd sub structure of quarks or leptons; lo oking for extra spatial dimen sions; measuring the mass of a neutrino; etc. Rather than jus t sa ying that n othing was found , it is more u seful to quote an upp er limit on the sought-fo r eﬀect, as this could b e useful in ruling out some theories. An example of this w as the exp eriment by Mic helson and Morley in 1887 whic h attempted to measure the sp eed of the Earth with resp ect to th e aether. No eﬀect wa s seen, but the exp erimen t w as s ensitiv e enough to lead to the demise of the aether theory . A simple scenario is a counting exp erimen t where a bac kground b is ex- p ected from conv en tional sources, together with the p ossibilit y of an in ter- esting signal s . T he num b er of coun ts n observe d is exp ected to b e Poi sson distributed with a mean µ = ǫ ∗ s + b , where ǫ is a factor for conv erting th e basic physics parameter s into the num b er of signal ev en ts exp ected in our particular exp erimen t; it th u s allo ws for exp erimen tal ineﬃciency , the ex- p eriment ’s ru nning time, etc. Then give n a v alue of n which is comparable to the exp ected bac kgroun d, what can w e sa y about s ? The true v alue of s is constrained to b e non-negativ e. The problem is interesting enough if b and ǫ are kno wn exactly; it b ecomes more complicated when only estimates with uncertain ties σ b and σ ǫ are av ailable. Ev en without the nuisance p arameters, a v ariet y of metho ds is a v ailable. These include lik eliho od , χ 2 , Ba ye sian with v arious p riors for s , fr equ en tist OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 9 Neyman constructions with a v ariet y of orderin g rules for n , and v arious ad ho c ap p roac hes. The metho ds give d iﬀerent upp er limits for the same data. 4 A comparison of sev eral metho d s can b e found in Narsky ( 2000 ). The largest discrepancies arise when the observ ed n is less than the exp ected b ac kground b , presu mably b ecause of a down w ard s tatistic al ﬂu ctuation. The follo win g diﬀeren t b eha v iors of th e upp er limit (w h en n < b ) can b e obtained: • F requen tist m etho d s can giv e empt y in terv als for s , that is, there are no v alues of s for whic h the data are lik ely . Particle Ph ysicists tend to b e unhappy wh en their y ears of work result in an empt y interv al for the parameter of interest, and it is little consolation to h ear th at frequ en tist statisticia ns are satisﬁed with this feature, as it d o es not lead to under- co v erage. When n is not q u ite small enough to result in an empty int erv al, th e upp er limit migh t b e v ery small . This could confus e p eople int o thinking that the exp erimen t w as m uch more s en sitiv e than it really was. 5 • The F eldman–Cousins frequent ist metho d [F eldman and C ou s ins ( 1998 )] that emplo ys a lik eliho o d-ratio ordering rule gives u pp er limits whic h decrease as n gets s m aller at constant b . (Th is can also o ccur in other fre- quen tist metho ds.) A related eﬀect is the growth of the limit as b d ecreases at constan t n . Th u s, if no ev ents are observ ed ( n = 0), the u pp er limit for a 90% in terv al is 1.08 for b = 3 . 0, but 2.44 for b = 0. T his is sometimes p r e- sen ted as a paradox, in that if a bright graduate studen t w orked hard and disco vered h o w to eliminate the exp ected bac kground, they wo uld b e “re- w ard ed” by obtaining a weak er upp er limit. 6 An answ er is that although the actual limit had in creased, the sens itivity of th e exp erimen t w ith the smaller bac kgroun d wa s b etter. There are other situations—for example, v arious random c hoices of m easur ing instru men ts [Co x ( 1958 )]—where a measuremen t with b etter sensitivit y can on o ccasion giv e a less-precise result. • In the Ba yesian app roac h, the dep endence of the limit on b is w eak er . Indeed, when n = 0, the limit d o es not dep end on b . 4 By coincidence, the v alues obtained by the Bay esian app roac h with an (improp er) ﬂat prior for s and by th e Neyman construction for up per limits agree when b = 0. 5 Ba yesian methods that use priors with part of the probability density b eing a δ - function at s = 0 can result in a p osterior with an enhanced δ -function at zero, su ch that the u pp er limit contains only th e single p oint s = 0. 6 The n = 0 situation is p erhaps a special case, as the num b er of observed even ts cannot decrease as further selections are imp osed to reduce the ex pected backgro und. F or nonzero observed even ts, if n decreases with th e tighter cuts (as exp ected for reduced backgro und), the upp er limit is lik ely t o go do wn, in agreemen t with intuition. But if n sta y s constant, that could b e b ecause the observed ev ents contain signal, so it is p erhaps not surp rising that t h e upp er limit increases. 10 L. L YONS • Sen et al . ( 2008 ) consider a r elate d problem, of a ph ysical non-negativ e parameter λ pr o d ucing a measurement x , whic h is distributed ab out λ as a Gaussian of v ariance σ 2 . As the observ able x b ecomes more and more negativ e, the upp er limit on λ increases , b ecause it is deduced that σ m u st in fact b e larger th an its originally quoted v alue. In trying to assess wh ich of the metho ds is b est, one ﬁrst needs a list of desirable prop erties. Th ese includ e: • Co verage : Eve n m ost Ba yesian P article Physicist s w ould lik e the co v erage of their interv als to matc h their quoted cred ib ilit y , at least appr oximate ly . Because the data in counting exp eriment s are discrete, it is imp ossible in an y sensible w a y to ac h iev e exact co ve r age for all µ . Ho wev er, it is n ot completely ob vious that even F requen tists n eed cov erage for every p ossib le v alue of µ , sin ce diﬀerent exp eriments will ha v e diﬀeren t v alues of b and of ǫ . Thus, eve n for a constan t v alue of the ph ysical parameter s , diﬀeren t exp erimen ts will ha ve d iﬀerent µ = ǫ ∗ s + b . Thus, it w ould app ear that, if co v erage in some a v er age (o v er µ ) sense we re satisfacto ry , the frequentist requirement for in terv als to contai n the true v alue at the r equ isite rate w ould b e mainta in ed. This, h o w ev er, is not the generally accepted view b y Particl e Ph y s icists, w ho would like not to u nderco ver f or any µ . • Not to o muc h o verco v erage: Because co verage v aries w ith µ , for metho ds that aim not to u nderco ver an ywhere, some o verco v erage is inevitable. This corresp onds to ha ving some up p er limits wh ich are high, and this leads to un desirable loss of p o wer in rejecting alternativ e hyp otheses ab out the parameter’s v alue. • Short and empt y in terv als: Th ese can b e obtained for certa in v alues of the observ able, without r esu lting in underco v erage. They are generally regarded as undesirable for the reasons explained ab o v e. It is not obvious ho w to incorp orate the ab ov e desiderata on interv al length in to an algo r ithm that w ould b e useful for c ho osing b et ween diﬀeren t meth- o ds for setting limits. 5.1. Two-side d intervals. An alternativ e to giving upp er limits is to quote tw o-sided interv als. F or example, a 68% conﬁdence in terv al for the mass of th e top quark migh t b e 169 to 173 GeV/ c 2 , as opp osed to its 90% upp er limit b eing 174 GeV/ c 2 . Most of the diﬃculties and am biguities men- tioned ab ov e apply in th is case to o, together with some extra p ossibilities. Th us, w h ile it is clear which of t w o p ossible upp er limits is tigh ter, this is not n ecessarily s o for tw o-sided int erv als, w here which is shorter ma y b e metric dep enden t; the ﬁrst of t wo interv als for a particle’s lifetime τ ma y b e sh orter, b ut the second may b e shorter w hen the r an ges are quoted for deca y rate (= 1 /τ ). Also, there is more scop e for c hoice of ordering rule for OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 1 1 the frequ entist Neyman construction, or for choosing the inte rv al from the Ba y esian p osterior probabilit y d ensit y . 7 It has b een p oin ted out b y F eldman and Cousins ( 1998 ) that an appar- en tly inno cuous pro cedure for c ho osing wh at result to quote ma y lead to underco v erage. Man y ph ysicists w ould quote an upp er limit on an y p ossible signal if their observ ation w as not more than three standard deviations ab ov e the exp ected b ac kground, b ut a t wo -sided int erv al if their r esult w as ab o ve this. With eac h t yp e of inte rv al constructed to giv e 90% co v er age, there are some v alues of the parameter for whic h the co verag e for this mixed pro ce- dure drops to 85%; F eldman and Cousins refer to this as “ﬂip-ﬂop.” They circum ven t the problem by using a “uniﬁ ed” approac h, in whic h the method automatica lly yields upp er limits f or s m all v alues of the data, but t w o-sided in terv als for larger measuremen ts, while main taining correct co verag e for all p ossible true v alues of the signal. 5.2. Sensitivity. W e hav e already men tioned the idea of quoting the sen- sitivit y of a pr ocedu re, as w ell as the actual u p p er limit as derive d from the observ ed d ata. 8 F or up p er limits or for uncertain ties on measuremen ts, th is can b e d eﬁned as the median v alue that would b e obtained if the p ro cedure w as r ep eated a large num b er of times. Using the median is preferable to th e mean b ecause (a) it is metric ind ep endent (i.e., the med ian lifetime upp er limit would b e the recipr o cal of the med ian deca y rate low er limit); and (b) it is muc h less sensitiv e to a few anomalously large u pp er limits or error estimates. Punzi ( 2003 ) has dra w n atten tion to the f act that this choi ce of deﬁnition for sen sitivit y h as some undesirable features. T h u s, minimizing the median upp er limit for a searc h provi des a diﬀerent optimization from maximizing the median num b er of standard deviations for th e signiﬁcance of a disco very . Also, there is only a 50% chance of ac h ieving the median result or b etter. Instead, for pre-deﬁned levels α and C L , Pun zi determines at w hat signal strength there is a p r obabilit y of at least C L for establishing a disco very at a signiﬁcance level α . This is what he quotes as the sensitivit y , and is the signal strength at whic h we are su re to b e able to claim a disco v ery or to exclude its existence. Belo w this, th e presence or otherwise of a signal mak es to o little diﬀerence, and w e ma y r emain uncertain (see Figure 2 ). 5.3. C L s . This is a tec h n ique [Read ( 2000 , 2004 )] whic h is used for situa- tions in which a disco very is not made, and in stead v arious p arameter v alues are excluded. F or example, the Standard Mo del Higgs b oson is such that, 7 A Bay esian statistician would b e h app y with the p osterior as the ﬁ nal result. Pa rticle Physici sts like to qu ote an interv al as a conv enient summary . 8 The sensitivity on its o wn will not do, because it is indep endent of the data. 12 L. L YONS ev en b efore it is discov ered, ev erything ab out it is well deﬁned b y theory except for its mass. Th e r ate at whic h it is pro du ced in a giv en exp eriment do es dep end on its mass. The failure to observe it can b e con verted in to a mass range for the Higgs whic h is excluded (at some conﬁdence lev el). Figure 3 illustrates the exp ected distributions for some suitably c hosen test s tatistic un der t w o d iﬀeren t hyp otheses: the null H 0 in whic h th ere is only standard kno wn p h ys ics, and H 1 whic h also includes some sp eciﬁc new particle, suc h as the Higgs b oson. In th e simp lest case, the statistic Fig. 2. Punzi deﬁnition of sensitivity. Exp e cte d distributions for a statistic t ( which in simple c ases c ould b e simply the observe d numb er of events n ) , for H 0 = b ackgr ound only ( solid curv es ) and for H 1 = b ackgr ound plus signal ( dashe d curves ) . In (a) , the signal str ength i s very we ak, and i t i s imp ossible to cho ose b etwe en H 0 and H 1 . As shown in (b) , which is for mo der ate signal str ength, p 0 is the pr ob ability ac c or ding to H 0 of t b eing e qual to or lar ger than the observe d t 0 . T o claim a disc overy, p 0 should b e smal l er than some pr e-set level α , usual ly taken to c orr esp ond to 5 σ ; t cr it is the mini mum value of t f or this to b e so. T hen the p ower function 1 − β [e quivalent to p 1 in Figur e 3 ( b )] is the pr ob ability ac c or ding to the al ternative hyp othesis that t wil l exc e e d t crit . A c c or ding to Punzi, the sensitivity should b e deﬁne d as the exp e cte d pr o duction str ength of the signal such that 1 − β exc e e ds another pr e deﬁne d level CL, for example, 95% . The exclusion r e gion i n (b) c orr esp onds to t 0 in the 5% lower tail of H 1 , whi le the disc overy r e gion has t 0 in the 5 σ upp er tail of H 0 ; ther e is a “No de cision ” r e gion in b etwe en, as the signal str ength in (b) is b elow the sensitivity value. The sensitivity is thus the signal str ength ab ove which ther e is a 95% chanc e of maki ng a 5 σ disc overy. That is, the distributions for H 0 and H 1 ar e suﬃciently sep ar ate d that, ap art p ossibly for the 5 σ upp er tail of H 0 and the 5% lower tail of H 1 , they do not overlap. In (c) the signal str ength is so lar ge that ther e is no ambiguity in cho osing b etwe en the hyp otheses. OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 1 3 could b e simply the ob s erv ed num b er of ev en ts n in s ome selected region. In Figure 3 (c), the new particle is pro duced proliﬁcly , and an exp eriment al observ ation of n should fall in one p eak or the other, and easily distinguish b et ween the t wo h yp otheses. In con trast, Figure 3 (a) co rresp ond s to v ery w eak pro d uction of the n ew particle and it is almost imp ossible to know whether the new particle is b eing p ro duced or not. The conv en tional metho d of claiming new particle pro d uction would b e if n fell well ab o v e the main p eak of the H 0 distribution; t ypically a p 0 v alue corresp on d ing to 5 σ would b e requir ed. In a similar w a y , new particle p ro duction w ould b e excluded if n w ere b elo w the main part of the H 1 distribution. T ypically , a 95% exclusion region w ould b e c hosen (i.e., 1 − p 1 ≤ 0 . 05). The C L s metho d aims to pro vide protection against a d o wnw ard ﬂuctuation of n in Figure 3 (a), r esulting in a claim of exclusion in a situation wh ere the exp eriment has no sensitivit y to the pro duction of th e n ew particle; this could happ en in 5% of exp erimen ts. Fig. 3. The C L s metho d. The exp e cte d distributions for a data statistic n ar e shown: ( i ) for the nul l hyp othesis H 0 of b ackgr ound onl y (solid curve); and ( ii ) for H 1 ( dashe d curve ) , wher e ther e i s also some exciting new physics, which tends to r esult i n lar ger n . In (b ), the tail ar e as of H 0 ab ove the observe d n 0 and of H 1 b elow n 0 ar e indic ate d by arr ows; they c orr esp ond to pr ob abilities p 0 and 1 − p 1 r esp e ctively. Figur e (c) shows a situation wher e the new physics is str ongly pr o duc e d, and H 0 and H 1 ar e wel l sep ar ate d. Thus, n 0 would r esult i n H 1 b eing exclude d, while n 1 would b e taken as evidenc e in favour of new physics. In (a), pr o duction is very we ak, and the H 0 and H 1 curves ar e b ar ely distinguishable. In or der to pr ote ct against a downwar d ﬂuctuation (statistic = n 0 ) in a situation li ke (a) r esulting in an exclusion of H 1 when the curves ar e essential ly identic al, C L s is deﬁne d as (1 − p 1 ) / (1 − p 0 ) . 14 L. L YONS It ac hieve s this b y deﬁning 9 C L s = (1 − p 1 ) / (1 − p 0 ) , (1) and requirin g C L s to b e b elo w 0.05. F rom the deﬁn ition, it is clear that C L s cannot b e smaller than 1 − p 1 , and hence is a conserv ativ e v ersion of the frequen tist qu antit y 1 − p 1 . I t tends to 1 − p 1 when n lies ab o ve the H 0 distribution, and to unity w h en H 0 and H 1 are very similar. Statisticians ma y ﬁnd C L s , wh ic h is the ratio of tw o p -v alues, to b e lac king in form al ju stiﬁcation. Its app eal to Pa r ticle Physicists is the protection it pro v id es against excluding p articles from data which ha ve no s en sitivit y to them. W e th u s r egard it as a conserv ativ e frequ en tist appr oac h. 5.4. Nuisanc e p ar ameters. F or calculating upp er limits in the simp le coun ting experiment describ ed in S ection 5 , the n u isance parameters arise from th e uncertaint ies in the b ac kground rate b and the acceptance ǫ . These uncertain ties are us u ally quoted as σ b and σ ǫ (e.g., b = 3 . 1 ± 0 . 5), and the question arises of what these err ors mean. S ometimes they en capsu late the results of a subs id iary measurement, p erformed to estimate b or ǫ , and then they w ould express the wid th of the Ba y esian p osterior or of the frequent ist in terv al obtained f or th e nuisance parameters. Ho w ever, in man y situations, the errors may b e based on a series of subsidiary measurements; they may in volv e Mon te Carlo sim u lations, w hic h ha ve systematic uncertain ties (e.g., related to how w ell the sim ulation d escrib es the real data) as we ll as statis- tical errors; or they may reﬂect uncertain ties or ambiguities in th eoretica l calculatio ns required to deriv e b and/or ǫ . In the absen ce of fu rther in f or- mation the p osterior is often assumed to b e a Gaussian, u sually truncated so as to exclude unp h ysical (e.g., negativ e) v alues. This ma y b e at b est only appro ximately tr u e, and deviations are lik ely to b e most serious in the tails of the distribution. There are many metho ds for in corp orating nuisance parameters in upp er limit calculatio ns. These include: • Pr oﬁle likeliho o d. The lik eliho o d, based on the data fr om the main and from the su b sidiary measur emen ts, is a f unction of the parameter of in- terest s and of the nuisance parameters. The proﬁle lik eliho o d L prof ( s ) is simply the full lik eliho o d L ( s, b best ( s ) , ǫ best ( s )), ev aluated at the v alues of the nuisance parameters that maximize the lik eliho o d at eac h s . Then th e proﬁle like liho o d is simply used to extract the limits on s , m uc h as the ordinary lik eliho o d could b e used for the case when there are no nuisance parameters. 9 Give n th e fact that C L s is essentially the ratio of t wo p -v alues, the choice of symbol C L s (standing for “conﬁdence level of signal”) is confusing. OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 1 5 Rolk e et al. ( 2005 ) ha ve stud ied the b eha vior of the proﬁle likeli ho o d metho d for limits. Heinric h ( 2003a ) had sho wn th at th e lik eliho o d ap- proac h for estimating a P oisson parameter (in the absence of b oth b ack- ground and of nuisance parameters) can h a v e p o or co v erage at lo w v al- ues of the P oisson parameter. How ev er, the pr oﬁle like liho o d seems to do b etter, probably b ecause the n u isance parameters ha v e the eﬀect of smo othing a wa y the ﬂ uctuating co v erage observed b y Heinric h. • F ul l B ayes. When there is a subsidiary measurement, a prior is c hosen for b (or ǫ ), the data is used to extract the lik eliho od , and then Bay es’ theorem is used to dedu ce the p osterior for the nuisance parameter. This p osterior from the sub sidiary measuremen t is then u sed as the pr ior for the nuisance parameter in the main measuremen t (this prior could alternativ ely come from information other than a sub sidiary exp eriment); together with the prior for s and the likelihoo d for the main measur ement, the o verall joint p osterior f or s and the n uisance p arameter(s) is d eriv ed. 10 This is then in tegrated ov er the nuisance p arameter(s) to determine the p osterior for s , from whic h an upp er limit can b e deriv ed. Numerical examples of upp er limits can b e found in Heinric h et al. ( 2004 ), where the metho d is discu ssed in detail. Thus, for precisely determined bac kgrounds, the eﬀect of a 10% uncertain ty in ǫ can b e s een for v arious measured v alues of n in T able 2 . A plot of the co ve rage when the uncertaint y in ǫ is 20% is repro du ced in Figure 4 . It is n ot un iversally appreciated that the c h oice for the main measur e- men t of a truncated Gaussian prior for ǫ and an (improp er) constan t prior for nonnegativ e s results in a p osterior for s whic h diverge s . Thus, numer- ical estimates of the relev an t in tegrals are meaningless. Another pr oblem comes fr om the diﬃcult y of c h o osing sensib le m u lti-dimensional pr iors. Heinric h has p ointe d out the problems that can arise for the ab o ve P ois- son counting exp erimen t, when it is extended to deal with sev eral data c hann els sim ultaneously [Heinric h ( 2005 )]. • F ul ly fr e quentist. In prin ciple, the fully frequentist appr oac h to setting limits when provided with data from the main and from subs idiary mea- surements is straigh tforward: the Neyman construction is p erformed in the m ultidimens ional space where the parameters are s and the nuisance parameters, an d the data are f rom all the relev an t measurement s. T hen the region in p arameter space for wh ic h the observed data w ere lik ely is pro jected on to the s -axis, to obtain th e conﬁdence region for s . In practice, th ere are sev ere diﬃculties in writing a program to do this in a reasonable amount of time. T o date, the largest n u m b er of p arame- ters used is three [Nicola and Signorelli ( 2002 )]. Another problem is that, 10 This is usually equiv alent to starting with a prior for s and the nuisance parameters, and the likelihoo d for the data from the main and the subsidiary exp erimen t s together, to obtain the join t posterior. 16 L. L YONS T able 2 90% c onﬁdenc e level upp er limi ts for the pr o duction r ate s as a function of n , the observ e d numb er of events n b = 0 . 0 b = 3 . 0 0 2 . 35 (2 . 30) 2 . 35 (2 . 30) 3 6 . 87 (6 . 68) 4 . 46 (4 . 36) 6 10 . 88 (10 . 53) 7 . 80 (7 . 60) 9 14 . 71 (14 . 21) 11 . 56 (11 . 21) 20 28 . 27 (27 . 05) 25 . 05 (24 . 05) The Poisso n parameter µ = ǫ ∗ s + b , where th e exp ected background b is either 0.0 or 3.0, an d is precisely known; and ǫ , whose true v alues is 1.0, is estimated in a subsidiary measuremen t with 10% accuracy . The num b ers in bracke ts are the corresponding upp er limits when ǫ is kno wn precisely . At large n , the limits for b = 3 . 0 are 3 u nits lo wer than those for b = 0 . 0; th e latter are approximatel y n + 1 . 28 √ n at large n . The eﬀect of the uncertaint y in ǫ is to increase the limits, and by a larger amount at large n . F or n = 0, th e Ba yesian limits are indep end ent of th e exp ected background b . unless a clev er ordering ru le is used f or pr o d ucing the acceptance region in data space for ﬁxed v alues of th e parameters, the pro jection phase leads to o ve rco ve rage, wh ic h can b ecome larger as the n um b er of nuisance param- eters increases. Go o d orderin g rules h a v e b een found f or a version of the P oisson counting exp eriment [Pu nzi ( 2005 )], and for the r atio of P oisson Fig. 4. The c over age C for the estimate d 90% c onﬁdenc e level upp er limi t as a function of the true p ar ameter s true . T he b ackgr ound b = 3 . 0 is assume d to b e known exactly, while the subsidiary me asur ement f or ǫ gives a 20% ac cur acy. The disc ontinuities ar e a r esult of the discr ete ( inte ger ) natur e of the me asur ements. Ther e app e ars to b e no under c over age. OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 1 7 means [Cousins ( 1 998 )], where the conﬁd ence interv als are tigh ter than those obtained by conditioning on the sum of the n u m b ers of counts in the t wo obs er v ations. F or the fully frequenti st m ethod , it is guaran teed that there will b e no underco v erage for any com b ination of parameter true v alues. Th is is not so for an y other metho d, and so most P article Ph ysicists w ould lik e assu rance that the tec hn ique used do es indeed provide r easonable co v erage, at least for s . There is usually live ly debate b et ween frequentist and Ba ye s ians as to w h ether co verage is desirab le for all v alues of the nuisance parameter(s), or whether one should b e happy with n o or little undercov erage when exp erimen ts are a v eraged ov er the nuisance parameter true v alues. • Mixe d. Because of the diﬃcult y of p erf orm ing a fully frequentist analysis in all but the simplest problems, an alternativ e approac h [Cousins and Highland ( 1992 )] is to use Ba ye sian a v eraging o v er the nuisance param- eters, b ut then to emplo y a f r equen tist appr oac h for s . The hop e is that for most exp erimen ts setting u pp er limits, the statistical errors on the data are r elativ ely large and so, p ro vided the un certaintie s in th e nui- sance parameters are not to o large, the eﬀect of the systematic s on the upp er limits will b e small, and hence an appro ximate metho d of dealing with them ma y b e justiﬁed. 5.5. Banﬀ chal lenge. Giv en the large num b er of tec hniques av ailable for extracting up p er limits fr om data, esp ecially in the presence of nuisance parameters, it wa s decided at the Banﬀ meeting [Reid, L in nemann and Lyons ( 2006 )] that it w ould b e useful to compare the prop erties of the d iﬀerent approac hes un der comparable conditions. This led to the setting u p of the “Banﬀ Challege,” whic h consisted of p ro viding common data sets for any one to calculat e their upp er limits. This was orga nized b y Joel Heinric h , who rep orted on the p erformance of the v arious metho ds at the PHYST A T-LHC meeting [Heinric h ( 2007 )]. 6. D isco v ery iss u es. Searc hes for new particles are an exciting endea vo r , and will pla y an ev en bigger role with the start-up of the LHC at CERN, exp ected in 2008. The 2007 PHYST A T W orkshop at C ERN [Prosp er, Lyons and De Ro ec k ( 2007 )] w as dev oted to statistical issues that arise in disco very- orien ted analyses. 6.1. p - values. In order to qu antify the chance of the observed eﬀect b eing due to an unin teresting statistical ﬂuctuation, some statistic is chosen for the data. The simplest case would b e th e observ ed num b er n 0 of in teresting ev ents. Then the p -v alue is calculated, w hic h is simply the p r obabilit y th at, giv en the exp ected bac kgroun d r ate b f rom kno wn sources, the observed n u m b er of ev en ts would ﬂu ctuate up to n 0 or larger. A sm all v alue of p 18 L. L YONS indicates that the d ata are not v ery compatible with the theory (whic h ma y b e b ecause we d o not understand our detecto r, rather than the theory b eing wrong). P article physic ists usually con ve rt p in to the num b er of stand ard devi- ations σ of a Gaussian d istribution, b ey ond which the one-sided tail area corresp onds to p . Th us, 5 σ corresp ond s to a p -v alue of 3 ∗ 10 − 7 . Th is is done simply b ecause it provides a num b er which is easier to r emem b er, and n ot b ecause Gaussians are relev an t for ev ery situation. Unfortunately , p -v alues are often misinte rpreted as the p robabilit y of the theory b eing true, giv en the data. It sometimes helps colleagues clarify the diﬀerence b etw een p ( A | B ) an d p ( B | A ) by reminding them that the pr ob a- bilit y of b eing p regnan t, give n th e fact that you are female, is consid erably smaller th an the probabilit y of b eing female, giv en the fact that you are pregnan t. 6.2. Nuisanc e p ar ameters. T he calculation of p -v alues is complicated in practice by th e existence of nuisance parameters. F or example, for the simp le situation describ ed ab o ve, there could b e some uncertain ty in the estimated bac kground . Although pivot s are not generally used, there are numerous w ays of incorp orating n uisance p arameters. These include: • Conditioning: In simple cases with a single nuisance p arameter, it ma y b e p ossible to condition on the sum of the num b er of counts in the main and the su bsidiary exp eriment s, and then to u se the bin omial distribution to obtain the p -v alue. • Plug-in p -v alue: The b est estimate of the nuisance parameters is used to calculate p . • Prior predictiv e p -v alue: The p -v alues are a ve raged o ver the nuisance pa- rameters, w eigh ted b y their pr ior d istributions. • P osterior p redictiv e p -v alue: This time, the p osterior distributions of the n u isance parameters are used for w eigh ting. • Sup rem um p -v alue: T he largest p -v alue for an y p ossible v alue of the n u i- sance parameter is u sed. Th is is lik ely to b e useful only w hen the n uisance parameter is forced to b e with in some range; or when there is only a ﬁnite n u m b er of p ossible alternativ e theoretical interpretations. • Conﬁd en ce inte rv al: A conﬁdence r egion of size 1 − γ is used for the nui- sance parameter(s), and then the adjus ted p -v alue is p max + γ , where p max is the largest p -v alue as the nuisance parameters are v aried o ver their conﬁd ence region. Clearly , if it is desired to establish a discov ery from p -v alues around 10 − 7 or smaller, th en γ should b e c hosen at least an order of magnitude b elo w this. The pr op erties of these and other metho ds are compared by Demortier ( 2007 ), w hile Cranmer ( 2007 ) has discussed some of them in the con text OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 1 9 of searc hes at the LHC, where the distributions in the tails of the p roba- bilit y distrib u tions for data can b e v er y relev an t. Again, any exp erience of Statisticians ab out incorp orating nuisance parameters could result in useful advice. The role of systematic eﬀects is lik ely to b e m ore serious here than for upp er limits discussed in S ectio n 5.4 . Th is is b ecause in upp er limit situations the n umb er of eve n ts is us u ally small, an d so statistic al errors dominate. I n con trast, disco very claims h a v e p -v alues of 3 ∗ 10 − 7 or smaller, and so tails of distributions are lik ely to b e imp ortant. 6.3. Why 5 σ ? Unfortu nately the usually accepted ideal for claiming a disco very in P article Ph ysics is th at p sh ou ld corresp ond to at least 5 σ . Statisticians almost inv ariably ask wh y w e u se su c h a strin gent leve l. On e answ er is past exp erience: we ha ve all to o often seen int eresting eﬀects at the 3 σ or 4 σ leve l go a wa y as more data are collected. Another is the m ultiple comparison problem, or “lo ok elsewhere” eﬀect. While the c h ance of obtain- ing a 5 σ eﬀect in one b in of a particular histogram is r eally s mall, it is to b e remem b ered that histograms h a v e many b ins, 11 they could b e plotted with diﬀeren t selection criteria and diﬀerent bin ning, 12 and there are very man y other histograms that were or could ha v e b een lo ok ed at in the course of the exp erimen t. 13 Th us, the c hance of a 5 σ ﬂ uctuation o ccurrin g somewhere in the data is muc h larger than migh t at ﬁ rst app ear. Finally , physicists sub - consciously in corp orate Ba y es’ p riors in assessing how lik ely they feel th at they hav e d isco v ered something new, and h en ce, whether they should claim a disco very . Th us, in deciding b et we en the p ossibilities of a new disco v ery or of an u ndetected systematic eﬀect, our priors migh t fa vor the latter, and hence, strong evidence for disco v ery is required from the data. It is not necessarily equitable to use a uniform standard for large general- purp ose exp erimen ts and f or small ones with a sp eciﬁc aim; or for lo oking for a pr o cess which is exp ected, as compared w ith a v ery sp eculativ e searc h. But p h ysicists and journ al editors do like a d eﬁned rule rather th an a ﬂ exible 11 In calculating a p -val ue in such a case, it is very desirable to take into accoun t the num b er of chances for a statistical ﬂuctu ation to o ccur anywhere in the histogram. At very least, it should b e made clear what the basis of the calculated p -v alue is. 12 If a blind analysis is performed, such decisions are made before lo oking at the data, and so this asp ect of the “look elsewhere” eﬀect is redu ced . 13 The extent to which other p eople’s searches should b e includ ed in an allow ance for the “lo ok elsewhere” eﬀect dep ends subtly on th e implied q uestion b eing add ressed. Thus are we considering the chance of obtaining a statistical ﬂuctuation in an y of the analyses w e hav e p erformed; or by any one analysing data in ou r exp eriment; or by any P article Physici st this y ear? Any one observing a p ossible Higgs signal at the L H C would b e very unhappy ab out ha v ing t o redu ce th e signiﬁcance of their result b ecause of the statistica l ﬂuctuations that could o ccur in sp eculativ e searches p erformed elsewhere. 20 L. L YONS criterion, so this b olsters the 5 σ standard. The general attitude is that, in the absence of a case for sp ecial pleading, 5 σ is a reasonable requir emen t. In any case, it is largely a seman tic issue, in that physicists ﬁndin g a 4 . 5 σ eﬀect would clearly rep ort it, using judiciously chosen wording to d escrib e the status of their observ ation. Statisticians also ask whether we really b eliev e our mo dels out in to th e ex- treme tails of the distributions. In general, this ma y b e so—coun ting exp er - imen ts are exp ected to follo w Poi sson distributions, with small corrections for p ossible long time-scale drifts in detector calibrations; and p article de- ca ys usually are describ ed by exp onen tial d istr ibutions in time. Ho we v er, th e situation is m uc h less clear for nuisance parameters, where err or estimates ma y b e less rigorous, and their distribution is often assumed to b e Gaussian (or trun cated Gaussian) by d efault. The eﬀect of these u ncertain ties on very small p -v alues n eeds to b e inv estigated case-b y -case. W e also hav e to remem b er that p -v alues merely test the null hyp othesis. A m ore sensitiv e w a y to lo ok f or n ew p h ys ics is via th e lik eliho o d ratio or the diﬀerences in χ 2 for the tw o hyp otheses, that is, with and without the new eﬀect. Thus, a very small p -v alue on its o w n is usu ally not enough to mak e a convincing case for disco very . 6.4. R e p etitions in time. A t ypical exp erimen t at a large accelerator may collect data o ver 10–15 y ears. The s ame searc h for a new eﬀect will t ypically b e rep eated once or twic e eac h y ear as more d ata is collected. Do es this con- stitute another f actor of ≈ 20 in the num b er of opp ortunities for a s tatistic al ﬂuctuation to app ear? Our reply is “No.” If there had b een a 6 σ signal with half the d ata (which resu lted in a claim for disco very), w hic h th en b ecame only 3 σ w ith m ore data, this w ould b e grounds for d ownpla ying the earlier disco very claim. Thus, at an y time, there is only one set of data (ev erythin g) that is relev an t. 6.5. Combining p - values. In lo oking for a giv en new eﬀec t, there ma y b e sev eral separate and uncorrelated analyses w h ic h are relev an t. These could corresp ond to diﬀeren t deca y p ossibilities for the new particle or diﬀeren t exp erimen ts lo oking for the same signal. Thus, if the p -v alues for the n u ll h y p othesis (i.e., no new p h ysics) for the separate analyses were 10 − 6 and 0 . 1, wh at is the corresp ond ing p -v alue for the pair of r esults 14 ? The unambiguous answer is that there is no unique recip e for com bining them [CDF ( 2007 ), Cousins ( 2007 )]. Th ere is no single wa y of taking a u ni- form distribution in t wo v ariables, and ﬁnd ing a transformation p com b ( p 1 , p 2 ) that con verts it in to a u niform d istribution of the single v ariable p com b . 14 Rather than combining p -val ues, it is of course b etter t o use the complete sets of original d ata (if av ailable) for ob t aining the com bined result. OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 2 1 Tw o p opular recip es in volv e asking what is the probabilit y that the smaller p -v alue will b e 10 − 6 or smaller or that the p ro duct is b elo w p 1 ∗ p 2 = 10 − 7 . None of the p ossible metho ds has the p r op ert y that in com bining 3 p -v alues, the same answ er is obtained if p 1 is ﬁrst com bined with p 2 , and then th e r e- sult is com bined with p 3 ; or whether some diﬀerent orderin g is us ed. Clearly , it is imp ortan t to d ecide what com b ination metho d should b e used, without reference to the sp eciﬁc data. 6.6. Pe ak ab ove smo oth b ackgr ound. When comparing t wo hypotheses with our d ata, we can use the numerical v alues of the t wo χ 2 quan tities. F or example, w e ma y b e ﬁtting a smo oth d istribution b y a p o wer series, and w ond er w hether w e need a qu ad r atic term, or w hether a linear expression w ould suﬃce. Alternativ ely , w e ma y w an t to assess whether a mass sp ectrum fa vors th e existence of a p eak on top of a smo oth bac k grou n d, as compared with just the smo oth b ac kground. Qualitativ ely , if the extra term (s) are unnecessary , they will result in a relat iv ely small reduction in χ 2 , while if they really are required, the reduction could b e larger. It is sometimes p ossible to b e quan titativ e ab out th e exp ected reduction when the extra terms are n ot needed [Wilks ( 1938 )]. If we are in th e asymp- totic regime, and if the hyp otheses are nested, and if the extra p arameters of the larger hyp othesis are deﬁned un der the smaller one, and in that case do not lie on the b oun dary of their allo w ed region, then the diﬀerence in χ 2 should itself b e distr ibuted as a χ 2 , with the num b er of degrees of fr eedom equal to the num b er of extra parameters. An example that satisﬁes this is pr o vided by th e diﬀerent order p olyno- mials. Pro vided we ha v e a large amount of data, we exp ect th e diﬀerence in χ 2 to hav e one degree of fr eedom, so a v alue larger than around 5 would b e unlik ely . A contrast is pr o vided by a smo oth bac kground C ( x ) compared with a bac kground plus p eak, C ( x ) + A exp[ − 0 . 5 ∗ ( x − x 0 ) 2 /σ 2 ]. Th e extra pa- rameters for the p eak are its amp litud e, p osition and width: A , x 0 and σ resp ectiv ely . Again, the hyp otheses are n ested, in that C ( x ) is just a sp ecial case of the p eak plu s bac kground, with A = 0. Ho wev er, although A is de- ﬁned in the bac kgroun d only case, x 0 and σ are not, as their v alues b ecome completely irrelev an t when A = 0 . F urthermore, unless the p eak plus bac k- ground ﬁt allo ws A to b e negativ e, zero is on the b oundary of its allo we d region. W e th u s sh ould n ot exp ect the diﬀeren ce of th e χ 2 quan tities itself to b e distribu ted as a χ 2 [Protasso v et al. ( 2002 ), Demortier ( 2006 )]. T o assess the signiﬁcance of a particular χ 2 diﬀerence, this unfortun ately means that w e hav e to obtain its d istribution ours elv es, presumably by Mon te C arlo. If w e wan t to ﬁnd out probabilities of statistical ﬂuctuations at the 10 − 6 lev el, this requires a lot of sim u lation, an d probably needs us to u se s omething b etter than brute force. 22 L. L YONS Another example of comparing h yp otheses by their χ 2 v alues is giv en in Section 11.3 . The problem of nonstand ard limiting distribu tions f or χ 2 tests has a substanti al statistical literature [see, e.g., Self and Liang ( 1987 ) and Drton ( 2007 )]. 7. Go o dn ess-of-ﬁt. With sparse data, the un binned lik eliho o d metho d is a goo d one for estimating parameters of a mo del. In ord er to und erstand whether these estimates of the parameters are meaningful, w e n eed to know whether the mo del p ro vides an adequ ate description of the data. Unfor- tunately , as emp h asised by Heinric h ( 2003b ), maxim um lik eliho od is often insensitiv e to whether or not th e d ata ag ree with the m od el. It wo uld b e v ery usefu l to ha ve a wa y of u tilizing the unbinned lik eliho o d so that it do es pro v id e a measure of th e goo dn ess-of-ﬁt. The s tandard metho d lo ved by Pa rticle Physic ists is χ 2 . This, how ev er, is only applicable to b inned data (i.e., in a one or more dimensional histogram). F u rthermore, it loses its attractiv e feature that its distribu tion is mo del- indep endent when there are n ot enough data, w hic h is lik ely to b e so in the m u lti-dimensional case. An alternativ e that is used for s parse one-dimensional d ata is the Kolmogoro v– Smirnov (KS ) approac h or one of its v ariant s. Ho wev er, in the p resence of ﬁtted p arameters, simulatio n is again required to determine the exp ected dis- tribution of the KS-distance. Also b ecause of the problem of ho w to order the data, it is not used by P article Physicists in multi-dimensional situations. Aslan and Z ec h ( 2004 , 2005 ) hav e describ ed a metho d that can b e used with sparse multi-dimensional data. 15 It compares tw o separate sets of ev ent s, whic h could b e data and simulatio n based on a theoretical mo d el or t wo sets of data tak en und er slightl y diﬀerent co nditions, etc. The ﬁrst set of p oin ts are assigned p ositiv e electric charge s, and th e second set negativ e ones, and th en the “electrosta tic energy” of the system is calculated as E = ΣΣ q i ∗ q j ∗ f ( d ij ), where the summation extends ov er all pairs of ob- serv ations; q i is the c h arge of the i th observ ation; and f ( d ij ) is a fun ction of th e distance d ij b et ween observ ations i and j . F or real electrostatics in 3 dimensions, f ( d ) is prop ortional to 1 /d , but here it ca n b e c hosen to giv e desirable b ehavi or; Aslan and Zec h fa vor − ln( d + ǫ ), where ǫ is a small con- stan t to a v oid p roblems as d tends to zero. T his metho d requires the c h oice of a m etric for eac h of th e observ ables, and it also n eeds sim u lation to de- termine the exp ected distribution of E assu ming the t wo distrib utions are iden tical. Aslan and Zec h ﬁn d that their metho d compares fa v orably with other approac hes (e.g, χ 2 , K S and its v arian ts, etc.) in r ejecting alternativ e h y p otheses in v arious one-dimensional problems. 15 A similar approac h can be found in the statistics literature [Cuadras, F ortiana an d Oliv a ( 1997 , 2003 )]. OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 2 3 8. B lind analyses. These are b ecoming in creasingly p opular in P article Ph ysics, as a means of a v oiding p ersonal b ias aﬀecting th e r esult. They in volv e kee p ing part of the data un seen by the physici sts, un til the data selection pr o cedu re and the analysis metho d h a v e b een completely d eﬁ ned, all correction pro cedures sp eciﬁed, etc. The original suggestion to u se a blind analysis for a Partic le Physics ex- p eriment was du e to Lu is Alv arez. An exp erimen t at Stanford had lo ok ed for quarks, by measuring the r esidual c harge on small spheres that were lev- itated in a sup erconducting magnet. If a single free qu ark was present in a sphere, the residual c harge w ould b e a third or t wo-t hirds of the electron’s c harge. Seve ral of the b alls tested indeed yielded suc h v alues. A p oten tial problem was that large corrections had to b e applied to the ra w data in or- der to extract th e ﬁnal r esult for the c h arge. T he suspicion was that ma yb e the exp erimen ters were (sub consciously) applying corrections unt il the v alue turned out to b e “satisfactory .” The blind appr oac h wo u ld in v olve the com- puter adding a random n u m b er to the ra w v alue of the c h arge, whic h would then b e corrected un til the exp erimen talists w ere satisﬁed, and only then w ould the computer subtract the r andom n umb er to rev eal the ﬁn al answer for that sphere. 16 There are v arious method s of p erformin g blind analyses [Klein and R o o d - man ( 2005 )], most of which aim to allo w the exp erimenta lists to lo ok at some of the real data, in order to p erform c hec ks that nothing is terrib ly wrong. Some of these are as follo ws: • The computer adds a r andom n u m b er to the data, wh ic h is only sub- tracted after all corrections are applied. This w as the metho d suggested b y Alv arez. • Use only Monte Carlo to deﬁne the pro cedure. This completely a v oids the danger of allo wing th e data to determine the pro cedu r e to b e used, but suﬀers from the d r a wbac k that the d ata cannot b e compared w ith the Mon te Carlo, to c hec k that the latter is reasonable. • Use only a fraction of the data for deﬁnin g the pro cedur e. Then this is held ﬁ xed for the remainder of the data. In pr inciple, an optimization can b e emp lo y ed to d etermine the fraction to b e k ep t op en , but, in practice, this is often decided by c ho osing a semi-arbitrary time after whic h the future data is k ept blind. • The signal r egion is deﬁned as a certain part of multi-dimensional space, and this is ke pt hidden, bu t all other regions, including those adjacen t to the signal, are a v ailable for insp ection. 16 This suggestion was implemented, but in fact no subsequent results w ere p ublished. The current consensus is that this “disco very” of free quarks is probably spurious. 24 L. L YONS • Keep the Mon te Carlo parameters hidd en. Th is is a tec hnique used by the TWIST exp eriment in their high s tatistics precision determination of p a- rameters associated with muon d eca y . The pro cedure in volv es comparing the data with v arious simulated sets, generated with a series of diﬀerent parameter v alues. T h e data and the s im ulations are b oth visible, but the parameter v alues u sed to generate the sim ulations are k ept hidden. • Keep visible only a fraction of the con ten ts of eac h bin of a histogram. This is used by the MINOS exp eriment searc hing for n eu trino oscillations; th ese w ould aﬀect the energy distribution of the observed ev en ts. By k eeping visible diﬀerent unknown fractions of the data in eac h bin, the energy sp ectral shap e cannot b e determined from the visible part of the data. If sev eral diﬀeren t groups within th e same collab oration are p erforming similar analyses for extracting some sp eciﬁc parameter, then it is desirable to ﬁx the p ro cedure for selecting whic h result to present, or alternativ ely ho w to com b ine the separate results. Th is should b e done b efore the results are seen, and is worth doing ev en if the individual analyses w ere n ot “blind.” A question that arises with b lin d analyses is whether it should b e p ermit- ted to m o d ify the analysis after the d ata had b een un b linded. It is generally agreed that th is should not b e d one. . . u nless every one w ould regard it as ridiculous not to do s o. F or example, if a searc h for rare eve n ts yielded 10 candidates o ve r the course of a year’s run, and it was found that all of these o ccurred on Sund a y mornings at precisely 1:17 a.m., it wo uld b e pru den t to do some further in vestiga tion b efore p ublishing. If “p ost-unblinding” m o d - iﬁcation of the pro cedure is p erformed, this sh ould b e made clear in an y publication. 9. C om binin g results. A commonly used p ro cedure is to combine N dif- feren t uncorrelated measurements a i ± σ i of th e same physical quantit y a . When the measurement s are b eliev ed to b e Gauss ian distr ibuted ab out the true v alue a true , the well -kno wn result is that the b est estimate a best ± σ best is giv en b y a best = Σ a i ∗ w i / Σ w i , σ best = 1 / p Σ w i , (2) where th e weig h ts are deﬁned as w i = 1 /σ 2 i . T his is readily deriv ed fr om minimizing with resp ect to a a w eight ed su m of squ ared d eviations 17 S ( a ) = Σ ( a i − a ) 2 /σ 2 i . (3) 17 A problem arises if th e measurements are d iscrepan t. If S is muc h larger than N − 1, then some serious problem exists, and it is probably unwise to combine the results. Bu t for S/ ( N − 1) somewhat larger than un it y , a commonly adopted pro cedure [P article D ata Group ( 2006 )] is to scale up the uncertaint y on the w eighted a verage by the squ are ro ot of th is factor. OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 2 5 The extension to the case where the in d ividual measur emen ts are corre- lated (as is often the case f or analyses using d iﬀeren t tec h niques on the same data) is straigh tforward: S b ecomes ΣΣ( a i − a ) ∗ H ij ∗ ( a j − a ) , wh ere H is the in v erse er r or m atrix. There are, how ev er, p r actica l details that complicate its app lication. F or example, in the ab o ve form ula, the σ i are sup p osed to b e the true accura- cies of the measur ements. Often, all that we ha v e a v ailable are estimates of th eir v alues. Pr oblems arise in situ ations wher e th e error estimate de- p ends on the measur ed v alue a i . F or example, in coun ting exp erimen ts with P oisson statistics, it is t ypical to set the error as the square ro ot of the ob- serv ed num b er. Then a down w ard ﬂ uctuation in th e observ ation r esults in an o v erestimated we igh t, and a best is biassed do wnw ard. If instead th e err or is estimated as the square ro ot of the exp ected n um b er a , the combined result is biassed upw ard—the increased error reduces S at larger a . A w a y round this diﬃculty has b een suggested by Lyons, Martin and S axon ( 1990 ). Another p roblem arises when the individual measurement s are very cor- related. When the correlation co eﬃcien t of t wo uncertaint ies is larger than σ 1 /σ 2 (where σ 1 is the smaller error), a best lies outside th e range of the t wo measuremen ts. As the correlation co eﬃcien t tends to +1 , the extrap olatio n b ecomes larger, and is v ery sensitiv e to the exact v alue assu med for the cor- relation co eﬃcien t. The situation is aggra v ated b y the fact that σ best tends to zero. This is us u ally dealt with b y s electing one of the tw o analyses, rather than trying to com bin e them. Another extension of th is pr o cedure is for com bin ing N pairs of correlated measuremen ts (e.g., the gradien t and intercept of a straigh t line ﬁ t to sev eral sets of data). The prescription to b e adopted for scaling the errors wh en the individual measurements are somewh at discrepant h as complications. 10. Accuracy of an sw er. S ometimes a result app ears to b e more accurate than is j u stiﬁed. This can arise when an upp er limit is m u c h lo we r than the sensitivit y of the pro cedure (e.g., when the observ ed n um b er of ev en ts in a co unt ing exp erimen t is smaller than the exp ected b ac kground) or when b y c hance in dividual observ ations happ en to lie close to eac h other. Th is can cause problems in d eciding wh ic h measurement is “b etter.” This can b e relev an t in c ho osing whic h of sev eral comp eting analyses on th e s ame data to q u ote as th e result of the exp erimen t; or in com bining diﬀeren t results (see previous section). In the former situation, if the estimated err or increases with the estimated v alue, c ho osing the result with the sm allest estimated error can pro du ce a do wnw ard bias. On th e other h and, using the smallest exp ected error can cause us to ignore an analysis whic h h ad a particularly fa vorable statistical ﬂuctuation, whic h pro duced a resu lt that w as gen uin ely more pr ecise than 26 L. L YONS exp ected. 18 Ho w to d eal with this situatio n in general is an op en question. It has features in common with th e pr oblem of measuring a v oltage b y choos- ing at random a vo ltmeter from a cupb oard conta in ing meters of d iﬀeren t sensitivities [Co x ( 1958 )]. 11. Recen t impr o v ements in u nderstanding. In this section we list a few of the issues on which Partic le Physicists hav e r ecen tly imp ro ve d their understand ing of statistical issues. T o those can b e added a few already discussed ab o ve (see S ectio n 6.6 and the r emarks ab out unbinned like liho o ds in the ﬁrst paragraph of Section 7 ). 11.1. N umb er of de gr e es of fr e e dom. If w e construct the w eigh ted sum of squares S b et w een a p redicted th eoretica l curve and some data in the form of a histogram, provided th e P oisson distrib ution of the data can b e appro x- imated by a Gaussian (and the theory is correct, the data are unbiassed, the error estimates are correct, etc.), asymptotically 19 S will b e distributed as χ 2 with the num b er of degrees of freedom ν = n − f , where n is the num- b er of data p oin ts and f is the n u mb er of free parameters whose v alues are determined in the ﬁt. The relev ance of the asymptotic requir emen t can b e seen by imagining ﬁtting a more or less ﬂat distribu tion by the expr ession N (1 + 10 − 6 cos( x − x 0 )), where the fr ee parameters are th e normalization N and the p hase x 0 . It is clear that, although x 0 is left free in the ﬁt, b ecause of the 10 − 6 factor, it will ha ve a negligible eﬀect on the ﬁtted curve, and hence will not result in the t ypical r ed uction in S asso ciated with ha v in g an extra fr ee parameter. Of course, with an enormous amount of data, we w ou ld h a v e s en sitivit y to x 0 , and so asymp totica lly it d o es reduce ν by one unit, but not for smaller amoun ts of data. Another example inv olv es the search f or neutrino oscillations. The n eu- trino energy sp ectrum is ﬁtted by a su r viv al probability P of the form P = 1 − A sin 2 ( C ∗ ∆ m 2 ) , (4) where C is a kno w n function of the neutrino energy and the length of its ﬂigh t p ath, A is a p arameter whic h dep ends on the n eutrino mixing angle, and ∆ m 2 is the diﬀerence in mass squared of the relev an t neutrino sp ecies. F or small v alues of C ∗ ∆ m 2 , P ≈ 1 − A ( C ∗ ∆ m 2 ) 2 . (5) 18 F or example, the ALEPH exp eriment at LEP pro duced a tigh ter-than-exp ected u pp er limit on the mass of ν τ b ecause they happ ened to observe a deca y conﬁguration pro ducing ν τ whic h w as particularly sensitive for determining its mass. 19 The examples in this section are indep endent of the requirement that we need enough even ts for the Pois son d istribution to be well approximated by a Gaussian. OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 2 7 Th us, the s u rviv al probabilit y dep ends only on the t wo parameters in the com bination A (∆ m 2 ) 2 . Because this combination is all th at w e can h op e to determine, we eﬀectiv ely ha ve only one free parameter rather than t wo . Of cours e, an enormous amount of data can manage to distin gu ish b et ween sin( C ∗ ∆ m 2 ) and C ∗ ∆ m 2 , and so asymp totica lly w e ha ve t wo free param- eters as exp ected. It would b e u seful to h a v e some indication of wh en data are near enough to asymp topia, so as to av oid the necessit y for Monte Carlo calculations of the exp ected distribu tion of S . 11.2. ∆(ln L ) = 0 . 5 rule. In the maxim u m likel iho o d appr oac h to p a- rameter determination, the b est v alue λ 0 of a parameter is d etermin ed by ﬁnding wh ere the lik eliho o d m aximizes; and its error σ λ is estimated by ﬁ n d- ing ho w m u c h the p arameter must b e changed in order f or the logarithm of the lik eliho o d to decrease by 0.5 as compared with the maximum. 20 F r om a fr equen tist viewp oint, this sh ould ideally result in the range from λ 0 − σ λ to λ 0 + σ λ ha ving 68% cov erage. If the measur emen t is distributed ab out the true v alue as a Gaussian with a constan t width, th en exact co v erage is obtained, but in general this is not so. F or example, Heinric h ( 2003a ) h as in v estigated the prop erties of the lik eliho od appr oac h to estimate µ , the mean of a P oisson, when n obs ev ents are obs er ved. Beca use n obs is a discrete v ariable, the co v er age is a discon tinuous fun ction of µ , and v aries from 100% at µ = 0 d o wn to 30% at µ ≈ 0 . 5. 21 11.3. Comp aring two hyp otheses via χ 2 . Assu me w e ha ve a histogram with 100 bin s, and that we are using a χ 2 metho d for ﬁtting it w ith a function with on e f ree p arameter. W e exp ect to obtain a χ 2 v alue of 99 ± 14. Thus, if p 0 , the b est v alue of the parameter, yields a χ 2 of 85, w e w ould regard that as v ery satisfactory . How ev er, a th eoretical colle ague has a mo d el which predicts that the p arameter sh ou ld h a v e a diﬀerent v alue p 1 , and wa n ts to kno w wh at the d ata has to sa y ab out that. W e test this b y calculating the χ 2 for that p 1 and obtain a v alue of 110. W e ap p ear to hav e tw o con trad ictory conclusions: • p 1 is satisfactory: Th is is based on the fact that the r elev an t χ 2 of 110 is w ell with in th e exp ected r an ge of 99 ± 14. 20 If there is more than just one parameter, the likelihoo d must b e remaximized with respect to all the other parameters when lo oking for the ∆( l nL ) = 0 . 5 p oints. 21 It is of course not surprising that metho ds t hat are exp ected to h a ve goo d asymptotic b eha v ior may not displa y opt imal properties for µ ≈ 0. 28 L. L YONS • p 1 is ruled out: The uncertaint y on p is estimated by seeing how muc h it m u st c hange from its optim um v alue in order to mak e χ 2 increase by 1 unit. F or this data, χ 2 ( p 1 ) is 25 un its larger than χ 2 ( p 0 ), and so, assuming that th e b eha vior of χ 2 in the n eigh b orh o o d of the minimum is parab olic, p 1 is ruled out at the 5 standard deviation lev el. Unfortunately , many ph y s icists, o v er-impressed b y the fact that χ 2 ( p 1 ) app ears to b e satisfactory , are reluctan t to accept that p 0 is strongly fav ored b y the data. A similar argument applies to comparing a given set of data with 2 sep - arate hypotheses, for examp le, ﬁ tting a histogram with an exp onential or a straigh t line. Aga in th e diﬀerence b etw een the χ 2 quan tities p ro vides b et- ter discrimination b et w een the hyp otheses than do the individual χ 2 [Ly ons ( 1999 )]. There are of course other wa ys of comparing t wo hypotheses e.g. lik eliho o d ratio, Ba y es factor, Ba y esian information criterion, etc . T rotta ( 2008 ) has discussed their application in cosmology . 12. Conclusions. It is clea r that there are man y practical issues to b e resolv ed in P article Physics. Some of these may b e of inte rest to Statist i- cians. With analyses b ecoming more and more complex, we would we lcome more activ e inv olv ement that would lead to imp r o v ed analyses of our data. An y su ggestio n s regarding impro v ements in the approac hes outlined in this review wo u ld also b e appreciated. Ac kn owledgmen ts. I w ish to ac kno wledge the patience and exp ertise of Da vid Co x, Brad Efr on and Mic hael Stein and also of other Statisticians to o numerous to list, in explaining statistica l issues to me; the ones who ha ve contributed to th e PHYS T A T meetings ha ve b een particularly help- ful. My un derstanding of the practical application of statistical techniques has improv ed considerably as a result of discussions with many exp erimen- tal Pa r ticle Physics colleagues. I esp ecially wan t to thank the mem b ers of the CDF Statistics C omm ittee and Bob Cousins. T o all of y ou, I am most grateful. SUPPLEMENT AR Y MA TERIAL App end ix: Glossary of P article Physics terms (DOI: 10.121 4/08-A OAS163 S UPP ; p df ). REFERENCES Aslan, B. and Zech, G. (2004). A multiv ariate tw o-sample test based on the concept of minim um en ergy . J. Stat. Comp. Simul. 75 109. MR2117010 OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 2 9 Aslan, B . and Zech, G. (2005). St atistical energy as a to ol for binnin g- free multiv ariate goo dness of ﬁt tests, tw o-sample comparison and unfolding. Nucle ar Instruments and Metho ds A537 626. Beliko v, J. (2007). ALICE statistical wish-list. Avai lable at http://phy stat- lhc.web.c ern.ch/phystat- lhc/ . CDF St a tistics Committee (2007). F requ en tly asked questions. A v ailable at http://www - cdf.fnal.gov/ physics/statistics/statistics_faq.html#iptn4 . Cheung, H. and L yons, L. (2000). FNAL conﬁ dence limits w ork sh op. Avai lable at http://con ferences. fnal.gov/CLW/ . Cousins, R . (1998). Improv ed centra l conﬁd ence interv als for th e ratio of Pois son means. Nucle ar Instruments and Metho ds A417 391–399. Cousins, R. (2007). Annotated bibliography on some pap ers on com bin in g signiﬁcances or p -v alues. A v ailable at arXiv:0705.22 09 . Cousins, R. D. and Highland, V. L. (1992). Incorp orating systematic uncertainties into an up per limit. Nucle ar Instruments and Metho ds A320 331. Co x, D . R. (1958). Some problems connected with statistical inference. Ann. Math. Statist. 29 357– 372. MR0094890 Cranmer, K. (2007). Progress, c hallenges and future of statistics at th e LHC. A v ailable at http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 0 01.pdf . Cuadras, C. M., F or tiana, J. and Oliv a, F. (1997). The proximit y of an individual t o a p opulation with applications to discriminant analysis. J. Classiﬁc ation 14 117–136. MR1449744 Cuadras, C. M. and For tian a, J. (2003). Distance-based multiv ariate tw o sample tests. Av ailable at http://w ww.im ub.ub .es/publications/preprin ts/p df/ Cuadras-F ortiana.334.pdf . MR2091618 Demor tier, L. (2005). Bay esian reference analysis. Ava ilable at http://www .physics. ox.ac.uk/phystat05/proceedings/files/demortier- refana.ps . MR2270218 Demor tier, L. (2006). Setting the scene for p -v alues. Av ailable at www.birs. ca/w orkshops/2006/06w5054/rep ort06w5054.p df . Demor tier, L. (2007). P -v alues and n uisance parameters. Av ailable at http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 001.pdf . Dr ton, M. (2007). Likelihoo d ratio tests and singulariti es. Av ailable at http://fro nt.math.u cdavis.edu/0703.5360 . Feldman, G. J. and Cousins, R. D. (1998). U niﬁed approach to th e classical statistical analysis of small signals. Phys. R ev. D 57 3873–3889 . Friedman, J. H. (2003). Recen t adva nces in pred ictiv e (mac hine) learning. Av ailable at http://w ww-stat.stanford.edu/˜j hf/ftp/mac hine.p df . Friedman, J. H. (2005). Separating signal from background u sing ensembles of rules. Av ailable at http://w ww.ph y sics.o x.ac.uk/phystat05/proceedings/ﬁles/friedman phystat.pdf . Gros s, E. (2007 ). A TLAS a nd CM S statistical wish-list. Av ailable a t http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 001.pdf . Heinrich, J. (2003a). Cov erage of error b ars for Po isson data. Av ailable at http://www - cdf.fnal.gov/ publications/cdf6438_coverage.pdf . Heinrich, J. (2003b). Pitfalls of good ness-of-ﬁt from likel ihoo d. Av ailable at http://www .slac.sta nford.edu/econf/C030908/papers/MOCT001.pdf . Heinrich, J. (2005). The Ba yesian approac h to setting limits: What to av oid. Av ailable at http://www .physics. ox.ac.uk/ phystat05/proceedings/files/heinrich.ps . 30 L. L YONS Heinrich, J. (2007). Review of B anﬀ c hallenge on upp er limits . Av ailable at http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 001.pdf . Heinrich, J. and L y ons, L. (2007). Systematic errors. Annual R eviews of Nucle ar and Particle Scienc e 57 145–16 9. Heinrich, J. et al. (2004). Interv al estimation in the presence of nui- sance parameters. 1. Ba yesian approach. CDF Note 7117 . Av ailable at http://www - cdf.fnal.gov/ publications/cdf7117_bayesianlimit.pdf . H ¨ ocker, A. e t al. (2007). TMV A—T o olkit for multiv ariate d ata analysis. Avail able at http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 001.pdf . James, F., L yons, L. and Perrin, Y. (2000). W orkshop on conﬁd en ce lim- its. CERN Y ello w Rep ort 2000-05. http://documents.cern.c h/cgi-bin/setlink?base= cernrep&categ=Y ello w Rep ort&id=2000-005#top . Klein, J. R. and Roodman, A. (2005). Blind analysis in nuclear and p article p h ysics. Ann ual R eview of Nucle ar and Particle Physics 55 141. Linnemann, J. (2007). A pitfall in ev aluating systematic errors. Av ailable at http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 001.pdf . L yons, L. (1999). Comparing tw o hypotheses. A vail able at http://www - cdf.fnal.gov/ physics/statistics/statistics_recommendations.html . L yons, L. (2008). Supplement to “Op en statistical issues in particle physics.” DOI: 10.1214/08 -AO A S 163SUPP . L yons, L., Mar tin, A. and Saxon, D. (1990). O n the d et ermination of th e B lifetime by combining the results of diﬀerent exp eriments. Phy. R ev. D 41 982. L yons, L., Mount, R. and Re itmeyer, R. , eds. (2003). PHYST A T 20003. eConf C030908, SLAC-R-703. Av ailable at http://w ww.sla c.stanford.edu/econf/ C03090 8/proceedings.html . L yons, L. and ¨ Unel, M. K. (2005). Statistical problems in particle physics, astrophysics and cosmology . Imp erial College Press, London. Avai lable at http://www .physics. ox.ac.uk/phystat05/ . MR2270215 Narsky, I. (2000). C omp arison of upp er limits. Avai lable at http://con ferences. fnal.gov/cl2k/copies/inarsky.pdf . Narsky, I. (2006). StatPatternRecognition: A C++ pac k age for m ulti-vari ate classiﬁcation. Av ailable at http://w ww.hep.caltec h.edu/˜narsky/SPR /S PR n arsky chep200 6.p df . Neal, R. (2007). Computing lik eliho od functions w h en distributions are deﬁn ed by sim ulators with nuisance parameters. Av ailable at http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 001.pdf . Nicolo, D. and S ignorelli, G . (2002). A pplication of strong conﬁd ence to the CHOOZ exp eriment with frequentis t inclusion of nuisa nce parameters. Avai lable at http://w ww.ippp.dur.ac.uk/old/W orkshops/02/statistics/proceedings/signorelli. p df . P ar ticle Da t a G r oup (2006). J. Phys. G: Nucl. Part. Phys. 33 1 (see page 14). Pr osper, H . B. (2002). Multiva riate analysis: A uniﬁed p erspective. Av ailable at www.ippp.dur.ac.uk/W orkshops/02/statistics/papers/prosp er HBPDurham2002.ppt . Pr osper, H. B., L yons, L. and De Ro eck, A . (2007). PHYST A T-LH C W orkshop at CERN on Statis tical Issues for LHC Physics. Avai lable at http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 001.pdf . Pr ot assov, R. et al. (2002). Statistics: Hand le with care. Detecting multiple mod el compon ents with the likeli hoo d ratio test. Astr ophysics J. 571 545–559. Punzi, G. (2003). Sensitivity of searc hes for new signals an d its optimization. Av ailable at http://www .slac.sta nford.edu /econf/C030908/papers/MODT002.pdf . OPEN ST A TISTICAL I SSUES IN P AR TICLE PHY SICS 3 1 Punzi, G . (2005). Ordering algorithms and conﬁdence interv als in the pres- ence of nuisance p arameters. Av ailable at http://w ww.ph y sics.o x.ac.uk/ phystat05/proceedings/ﬁles/Punzi PHYST A T05 ﬁnal.pdf . Read, A. L. (2000). Mo diﬁed frequentist analysis of searc h results (the C L s metho d ). Av ailable at http://doc.cern.c h/arc h iv e/electronic/cern/preprin ts/op en/ open -2000-205. p df . Read, A. L. (2004 ). Presen tation of search results—the C L s metho d . J. Phys. G: Nucl. Part. Phys. 28 2693–2 704. Reid, N. (2007). Some asp ects of desig n of experiments. A v ailable at http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 001.pdf . Reid, N. , Linnem ann, J. and L yons, L. (2006). W orkshop on statisti- cal inference problems in high energy physics and astronom y . A v ailable at http://www .birs.ca/ birspages.php?task=displayevent&event_id=06w5054 . Ro e, B. P. ( 2007). Statistical errors in Monte Carlo estimates of systematic errors. Nucle ar Instruments and Metho ds A570 159–164. Ro lke, W. A., Lopez , A. M. and Conrad, J. (2005). Limits and conﬁdence interv als in the presence of nuisance p arameters. Nucle ar I nstruments and Metho ds A551 493–503. Self, S. G. and Liang, K. Y. (1987). Asymp t otic prop erties of maximum likelihoo d estimators and likelihoo d ratio test under non- stand ard conditions. J. Amer. Statist. Asso c. 82 605–610 . MR0898365 Sen, B., W alker, M. and Woodr oofe, M. (2008). On the uniﬁed metho d with nuisance parameters. Statist. Si nic a . T o app ear. Tro tt a, R . (2008). Ba yes in th e sky: Bay esian inference and mo del selection in cosmolog y . Contemp or ary Physics 49 71–104. Whalley, M. and L yons, L., eds. (2002). Adv anced statis tical techniques in particle physics, Durham IPPP/02/3 9. Avai lable at http://www .ippp.dur .ac.uk/old/Workshops/02/statistics/proceedings.shtml . Wilks, S. S . (1938). The large-sample distribution of the lik eliho od ratio for testing composite hypotheses. A nn. Math. Statist. 9 60–62. Xie, Y. (2007). LHCb statistical wishlist. Av ailable at http://phy stat- lhc.web.c ern.ch/phystat- lhc/2008- 001.pdf . P ar ticle Physics Denys Wilkinson Building University of Oxford Keble Ro a d Oxf ord OX1 3RH United Kingdom E-mail: l.lyons@ph ysics.ox.ac.uk

Open statistical issues in particle physics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment