Frequentist tests for Bayesian models

Analogues of the frequentist chi-square and F tests are proposed for testing goodness-of-fit and consistency for Bayesian models. Simple examples exhibit these tests' detection of inconsistency between consecutive experiments with identical parameter…

Authors: L.B. Lucy

Frequentist tests for Bayesian models
Astronomy & Astrophysics manuscript no . lblucy c  ESO 201 8 April 17, 2018 F requen tist tests fo r Ba y es ian mo dels L.B.Lucy Astrophysics G roup, Blac kett Lab oratory , Imp erial College London, Prince Consort Road, London SW7 2AZ, UK Received ; Accepted ABSTRACT Analogues of the frequenti st chi-square and F tests are prop osed for testing goo dness-of-fit an d consistency for Bay esian mod els. Simple examples exh ibit these tests’ detection of inconsistency b etw een consecutive exp eriments with identical parameters, when the first exp eriment provides th e prior for the second. I n a related analysis, a q uantitativ e measure is derived for judging the degree of tension betw een tw o different exp erimen ts with partially o verlapping parameter vectors . Key wo rds. Meth o ds: d ata analysis – Metho ds: statistical 1. Int ro duction Bay esian statistical metho ds ar e now widely a pplied in a s- tronomy . Of the new tec hniques th us introduced, mo del selection (or comparison) is e s pe cially notable in that it replaces the frequentist a c ceptance-rejection pa radigm for testing h yp otheses . Th us, given a data set D , there might be several hypo theses { H k } tha t ha ve the potential to ex- plain D . F rom these, Bayesian mo del selection identifies the particular H k that b est explains D . The pro cedure is sim- ple, though computationally demanding: starting with an arbitrar y pair of the { H k } , we apply the mo del selec tion machinery , discar d the weaker hypothesis, replace it with one of the rema ining H k , and rep eat. This pro cedure usefully disposes of the weak est hy- po theses, but there is no guara ntee tha t the surviving H k explains D . If the corr ect hypothesis is not included in the { H k } , we are left with the ’bes t’ hypo thesis but are not made aw ar e that the search for an explanation should con- tin ue. In the co ntext of model selection, the next step would be a comparis on of this ’b est’ H k with the hypothesis that H k is fals e. But model selection fails at this p oint b ecause we cannot compute the requir e d likeliho o d (Sivia & Skilling 2006, p.8 4). Clea r ly , what is needed is a go o dness-of-fit cri- terion for Bay esia n mo dels. This issue is discuss e d b y P ress et a l. (2007, p.779 ). They note that “ There are no go o d fully Bay esia n metho ds for asse ssing go o dness- o f-fit ...” and go o n to r ep ort that “Sensible Bayesians usually fall back to p -v alue tail statis- tics ...when they really need to know if a mo del is wrong.” On the assumption that astronomers do rea lly need to know if their mo dels are wrong , this pape r a dopts a fre- quentest appr oach to tes ting Bay esia n mo dels. Although this approach ma y b e immediately abho rrent to co mmit- ted Bayesians, the r ole o f the tests prop osed herein is merely to pr ovide a qua nt itative mea sure acco r ding to which Bayesians decide whether their mo dels are satisfac- tory . When they are, the Bay esian infer ences are pr esented - and with incr eased confidence. When not, flaws in the mo d- Send offprint r e quests to : L.B.Lucy els or the da ta must b e so ught, with the aim of e ventually achieving satisfactory Bay esia n inferences. 2. B a y esian m o dels The term Bay esia n mo del - subseq uent ly denoted by M - m ust now b e defined. The na tural definition of M is that which m ust be sp ecified in orde r that B ay esian inferences can be drawn fro m D - i.e., in order to compute p os terior probabilities. This definition implies that, in addition to the hypothesis H , which intro duces the parameter vector α , the prior proba bility distribution π ( α ) mu st also b e included in M . Thus, symbolic a lly , we write M ≡ { π , H } (1) It follows that differen t Ba yesian mo dels ca n share a com- mon H . F or exa mple, H may b e the hypothesis that D is due to Kepler ian motion. But for the motio n of a star ab out the Galaxy’s central black ho le , the a ppropriate π will dif- fer fr om tha t for the r eflex or bit of star due to a planetar y companion. A further consequence is that a failure of M to e x plain D is not necess arily due to H : an inappropria te π is als o a po ssibility . T o astronomer s accustomed to working only with uni- form priors, the no tion that a Bay esian mo del’s p o or fit to D could b e due to π might b e surprising . A specific cir- cumstance where π could be at fault arises when Bay esian metho ds ar e used to rep eatedly up date our knowledge of some pheno menon - e.g., the v a lue of a fundamental con- stant that ov er the years is the sub ject of numerous ex- per iments ( X 1 , . . . , X i , . . . ), usually of increasing precisio n. With an orthodox Bay esian approa ch, X i +1 is analy sed with the prior set equa l to the p oster ior fr om X i . Thus π i +1 ( α ) = p ( α | H , D i ) (2) This is the classical use o f Bay es theore m to update our opinion by incorp or ating past exp er ience. Howev er, if X i is flawed - e.g., due to an unrecognized systematic error - 1 Lucy: T ests for Bay esian mo dels then this c hoice of π impac ts negatively on the ana lysis of X i +1 , leading p erhaps to a p o or fit to D i +1 Now, since s ubsequent fla wless exp eriments result in the decay o f the nega tive impact o f X i , this r ecursive pro cedure is self-cor recting, so a Bay esia n might a rgue tha t the prob- lem can b e ignored. But scientists feel obliged to reso lve such discrepancies b efor e publishing or underta k ing further exp eriments a nd therefore need a statistic that quantifies any failure of M to explain D . 3. A go o dness-of-fit statistic for Ba y esian m o dels The most widely used go o dness-o f-fit statistic in the fre- quentist a pproach to hypo thesis testing is χ 2 , whos e v alue is determined by the residuals b etw een the fitted mo del and the data, with no input fro m pr ior knowledge. Thus, χ 2 0 = χ 2 ( α 0 ) (3) is the go o dnes-of-fit statistic for the minimum- χ 2 solution α 0 . A simple analog ue o f χ 2 0 for Bayesian mo dels is the p os- terior mean of χ 2 ( α ), h χ 2 i π = Z χ 2 ( α ) p ( α | H , D ) dV α (4) where the p osterio r distribution p ( α | H, D ) = π ( α ) L ( α | H , D ) R π ( α ) L ( α | H , D ) dV α (5) Here L ( α | H , D ) is the lik eliho o d o f α g iven data D . Note that since h χ 2 i π depe nds on bo th constituents of M , namely H and π , it has the potential to detect a p o o r fit due to either or b oth b eing a t fault, as req uir ed by Sect.2 . In Eq.(4) the subscript π is attached to h χ 2 i to str e ss that a non-trivial, informative prior is included in M . On the o ther hand, when a uniform prio r is assumed, h χ 2 i is independent of the prio r and is then denoted by h χ 2 i u . The Bayesian go o dness- of-fit statistic h χ 2 i u is used in Lucy (2 014; L14) to illustrate the detection of a weak sec- ond or bit in simulations of Gaia scans of an a strometric binary . In that case , H states that the scan res idua ls a r e due to the reflex K eplerian o r bit caused by one invisible companion. With increasing amplitude of the s econd orbit, the p oint comes when the inv estiga tor will surely a bandon H - i.e., abandon the assumption of just one companion - see Fig.12, L14. 3.1. P -values With the classical frequentist acceptance-r ejection paradigm, a n ull hypothesis H 0 is rejected on the basis of a p -v alue ta il sta tistic. Thus, with the χ 2 test, H 0 is rejected if χ 2 0 > χ 2 ν,β , where p ( χ 2 ν > χ 2 ν,β ) = β , and accepted otherwise. Here ν = n − k is the num be r of deg rees o f freedom, where n is the num b er of meas ur ements and k is the nu mber of parameter s introduced by H 0 , and β is the designated p -thresho ld chosen by the inv estiga tor. F or a Bay esia n mo del, a p -v alue can b e co mputed from the h χ 2 i π statistic, whose approximate distribution is de- rived b elow in Sect.5.1. How ever, a shar p transition from acceptance to rejection of the null hypothesis at some desig- nated p - v alue is no t reco mmended. Firs t, p -v a lues ov ers tate the strength of the evidence ag ainst H 0 (e.g., Sellke et al. 2001). In particular, the v alue p = 0 . 05 recommended in elementary texts does no t imply strong evidence againts H 0 . Second, the p -v a lue is b est reg a rded (Sivia & Skilling 2006, p.85) as serving a qualitative purp ose, with a small v alue prompting us to think ab out alter na tive hypotheses. Thu s, if h χ 2 i π exceeds the chosen threshold χ 2 ν,β , this is a w ar ning that something is a miss a nd s hould b e in vesti- gated, with the degr ee o f concern increasing as β decrea s es. If the β = 0 . 001 threshold is exceeded, then the inv estigator would be w ell-adv ised to sus pec t that M or D is at fault. Although statistics texts emphasiz e tests o f H not D , as- tronomers know that D can b e co r rupted by biases or ca li- bration erro rs. Departures from normally-dis tr ibuted errors can also increas e h χ 2 i π . If D is not a t fault, then M is the culprit, implying that either π o r H is at fault. If the fault lies with π not H , then we exp ect that h χ 2 i u < χ 2 ν,β even though h χ 2 i π > χ 2 ν,β . If neither D nor π ca n b e faulted, then the inv estigator m ust s eek a refined or alterna tive H . 3.2. T yp e I and ty p e I I errors In the frequentist appr oach to hypothesis testing , decision error s ar e said to be o f type I if H is true but the test says reject H , and of t yp e I I if H is false but the test says accept H . Since testing a Bay esian model is not co nc e r ned exclusively with H , these definitions must b e revised, as follows: A type I erro r a rises when M and D are flawless but the statistic (e.g., h χ 2 i π ) exceeds the designa ted thre s hold. A t yp e I I error arise when M o r D are flaw ed but the statistic do es not exceed the desig nated thr e shold. Here the words accept and r eject are av oided. Mor e ov er, no par ticular thresho ld is mandator y: it is at the discre tio n of the in vestigator and is c hosen with rega r d to the co nse- quences of making a decisio n er ror. 4. S tatistics of h χ 2 i The intuitiv e understanding that scientists have r egarding χ 2 0 derives fro m its simplicity and the rigo rous theorems on its distribution that allow us to derive co nfidence regio ns for multi-dimensional linear mo dels (e.g., Press et al. 2007 , Sect.15.6.4). Rigorous statistics for χ 2 require tw o assumptions : 1) that the mo del fitted to D is linear in its parameters, and 2) that measurement err ors ob ey the normal distribution. Nevertheless, even when these standard assumptions do not strictly hold, s cientists s till commo nly rely on χ 2 0 to g auge go o dness-o f-fit, with perha ps Mo nt e Carlo (MC) sampling to pro vide justification or ca libration (e.g ., P ress et a l. 2 007, Sect.15.6.1). Rigorous results for the statistic h χ 2 i ar e therefore of in- terest. In fact, if we a dd the assumption of a uniform prior to the ab ov e standard assumptions, then w e may pr ov e (Appendix A) tha t h χ 2 i u = χ 2 0 + k (6) where k is the num ber o f parameter s. 2 Lucy: T ests for Bay esian mo dels Given that Eq.(6) is exact under the stated assump- tions, it follows that the quantit y h χ 2 i u − k is distributed as χ 2 ν with ν = n − k degrees of freedom. F or minimum- χ 2 fitting of linear mo dels, the s o lution is alwa ys a b etter fit to D than is the true solution. In par- ticular, E ( χ 2 tr ue ) = n , whereas E ( χ 2 0 ) = n − k . Accor dingly , the + k term in E q.(6) ’corrects’ χ 2 0 for overfitting and so E ( h χ 2 i u ) = E ( χ 2 tr ue ) - i.e., the expected v alue o f χ 2 for the actual measurement err ors. 4.1. Effect of an informative prio r When an informative pr ior is included in M , an analytic formula for h χ 2 i π is in general not av ailable. How ever, its approximate statistical prop er ties a re r eadily found. Consider again a linear model with normally-distributed error s and supp ose further that the ex pe r iment ( X 2 ) is without flaws. The χ 2 surfaces are then self-s imilar k - dimensional ellipsoids with minim um at α 0 ≈ α true , the unknown exact so lution. Let us no w further suppo s e that the informative pr io r π ( α ) derives from a previous flawless exp eriment ( X 1 ). The prior π will then be a conv ex (bell- shap ed) function with maximum at α max ≈ α 0 . Now, con- sider a 1-D family of suc h functions all cen tred on α 0 and ranging from the very bro ad to the very narrow. F or the former h χ 2 i π ≈ h χ 2 i u ; for the latter h χ 2 i π ≈ χ 2 0 . Th us, ideally , when a Bay esia n mo de l M is applied to data D w e exp ect that h χ 2 i u > ∼ h χ 2 i π ≥ χ 2 0 (7) Now, a uniform prior and δ ( α − α 0 ) ar e the limits of the ab ov e family o f bell- shap ed functions. Since the delta func- tion limit is not likely to b e c losely approached in practice, a first appr oximation to the distribution of h χ 2 i π is that of h χ 2 i u - i.e ., that o f χ 2 n − k + k . The ab ov e discussion a ssumes faultles s X 1 and X 2 . But now supp ose that there is an inconsistency betw een π and X 2 . The p eak of π a t α max will then in genera l b e offset from the minimum of χ 2 at α 0 . Accordingly , in the calcu- lation of h χ 2 i π from Eq.(4), the neighbourho o d of the χ 2 minim um χ 2 0 at α 0 has reduced weigh t relativ e to χ 2 ( α max ) at the p eak of π . E vidently , in this c ir cumstance, h χ 2 i π can greatly exceed h χ 2 i u , and the inv estigato r is then aler ted to the inconsistency . 5. Numerical exp eriments Given that r igoro us res ults h χ 2 i ar e not av ailable for infor- mative pr iors, num er ic al tests ar e ess e n tial to illustrate the discussion of Sect.4.1. 5.1. A toy mo del A simple example with just one pa rameter µ is a s follows: H states that u = µ , and D co mprises n mea surements u i = µ + σ z i , wher e the z i here and b elow are indepe ndent gaussian v ar iates randomly sampling N (0 , 1). In cr eating synthetic data, we set µ = 0 , σ = 1 a nd n = 100. In the first numerical exp eriment, t wo independent data sets D 1 and D 2 are created compris ing n 1 and n 2 measure- men ts, res p ectively . On the a ssumption of a uniform prior , the p osterior density of µ der ived from D 1 is p ( µ | H, D 1 ) = L ( µ | H, D 1 ) R L ( µ | H, D 1 ) dµ (8) 0 0.01 0.02 0.03 80 120 Probability density < χ 2 > π PDF of < χ 2 > π N = 10 6 Fig. 1. Empirical PDF of h χ 2 i π derived from the analysis of 10 6 data pairs D 1 , D 2 as described in Sect.5.1. The dashed curve is the theoretical PDF for h χ 2 i u . W e no w r egard p ( µ | H , D 1 ) as prio r knowledge to b e tak en int o acc o unt in analysing D 2 . Thus π ( µ ) = p ( µ | H , D 1 ) (9) so that the p osterio r distributio n derived from D 2 is p ( µ | H, D 2 ) = π ( µ ) L ( µ | H, D 2 ) R π ( µ ) L ( µ | H , D 2 ) dµ (10) The statistic quantifying the go o dness -of-fit achiev ed when M = { π , H } is applied to da ta D 2 is then h χ 2 i π = Z χ 2 ( µ ) p ( µ | H , D 2 ) dµ (11) F ro m N indep endent data pair s ( D 1 , D 2 ), we obta in N independent v a lue s of h χ 2 i π th us allowing us to test the exp ectation (Sect.4.1) that this statistic is approximately distributed as h χ 2 i u . In Fig.1 , this empirica l PDF is plot- ted together the theor etical PDF for h χ 2 i u . The accuracy of the approximation at la r ge v alues of χ 2 is of sp ecial im- po rtance. F o r n = 100 and k = 1, the 0.0 5 ,0.01 and 0 .001 critical p oints o f h χ 2 i u are 124.2 , 1 3 5.6 and 149.2, respe c- tively . F ro m a simulation with N = 10 6 , the num b er o f h χ 2 i π v alues exceeding these thresholds are 50 177, 1 0011 and 102 5, resp ectively . Thus, the frac tion o f h χ 2 i π exceed- ing the c ritical v a lues derived from the distribution of h χ 2 i u are clo se to their pr edicted v a lues. According ly , the co nv en- tional int er pretation of these critica l v alues is v alid. In this exp eriment, the analysis of X 2 bene fits from knowledge g ained from X 1 . W e exp e ct ther efore that h χ 2 i π is less than h χ 2 i u , since replac ing the unifor m prio r with the informative π obtaine d from X 1 should improv e the fit. F ro m 10 6 rep etitions, this prov es to b e so with probabil- it y 0.68 3. Sampling noise in D 1 and D 2 accounts for the shortfall. 3 Lucy: T ests for Bay esian mo dels 50 100 150 200 250 0 0.5 1 1.5 2 2.5 < χ 2 > b / σ Detection of bias in X 1 X 1 precedes X 2 Fig. 2. Detection of bias in X 1 with the h χ 2 i π statistic when X 2 is analysed with prior deriv ed from X 1 . V alues of h χ 2 i π (filled circles) and h χ 2 i u (op en circles) are plotted against the bias parameter b/σ . The d ashed lines are the 5 and 95% levels . 5.2. Bias test When X 1 and X 2 are flawless, the statistic h χ 2 i π indicates doubts - i.e ., type I erro rs (Sect.3.2) - ab out the m utual consistency of X 1 and X 2 with just the exp ected frequency . Thu s, with the 5% thresho ld, doubts arise in 5 . 02% of the ab ov e 10 6 trials. This encourag es the use o f h χ 2 i π to detect inconsistency . Accordingly , in a second test, X 1 is flawed due to biase d measurements. Thus, now u i = µ + σ z i + b fo r D 1 . As a re- sult, the prior for X 2 obtained fr om E q.(9) is compromised, and this impacts on the statistic h χ 2 i π from Eq.(11). In Fig.2, the v alues of h χ 2 i π and h χ 2 i u are plotted against b/σ . Since the compro mis ed prior is excluded from h χ 2 i u , its v alues dep end o nly on the fla wless data sets D 2 , and so mostly fall within the (0 . 05 , 0 . 95) interv al. In co n- trast, as b /σ increases, the v alues o f h χ 2 i π are incr easingly affected by the compro mised prior . The mu tual co nsistency of X 1 and X 2 is assess e d on the basis of h χ 2 i π , with choice of critical level at our discr etion. How ever, when h χ 2 i π exceeds the 0 . 1% level at 149 . 2, we would surely conclude that X 1 and X 2 are in conflict and seek to resolve the discrepancy . On the o ther hand, when inconsistency is not indicated, we may accept the Bay esian inferences derived fro m X 2 in the confident b elief that in- corp ora ting prior knowledge from X 1 is justified and ben- eficial. This test illus trates the important point that an inappropria te π can cor rupt the B ay esian inferences dr awn from a flawless exp eriment. Th us, in this case , the bias in D 1 propaga tes into the poster ior p ( µ | H , D 2 ) derived from X 2 . This can b e (and is) av oided by preferr ing the freq uen- tist methodo logy . But to do so is to forgo the grea t merit of Bay esian inference, namely its ability to incor p o rate infor- mative prio r information (F eldman & Cousins 1 9 98). If one do es therefore prefer Bay esian inference , it is evident that a go o dness-of-fit statistic such as h χ 2 i π is essential in orde r to detect error s propag ating into the p o sterior distribution from an ill-chosen prior . 5.3. Order reversed In the ab ov e test, the a na lysis of X 2 is preceded by that of X 1 . This o rder can be reversed. Th us, with the same N data pairs ( D 1 , D 2 ), we now first analys e X 2 with a uni- form pr ior to o btain p ( µ | H , D 2 ). This be comes the prior for the analysis of X 1 . This analys is then gives the po sterior p ( µ | H, D 1 ) from which a new v alue of h χ 2 i π is obtained. When the v alues of h χ 2 i π obtained with this reversed o r- der of analy sis a r e plotted aga inst b/σ , the res ult is similar to Fig.2, implying that the order is unimp ortant. Indeed, statistically , the same decision is reached indep e nden tly of order. F or example, for 10 5 independent data pairs ( D 1 , D 2 ) with b/σ = 1, the n umber with h χ 2 i π > 12 4 . 2, the 5 % threshold, is 502 67 when X 1 precedes X 2 and 5 0149 when X 2 precedes X 1 . 5.4. Alternative statis ti c Noting that the Bay esian evidence is = ¯ L , the pr ior- weigh ted mean o f the lik eliho o d, we can, under standar d assumptions, write ¯ L ∝ exp  − 1 2 χ 2 eff  (12) where the effective χ 2 (Bridges et al. 20 09) is χ 2 eff = − 2 ln Z π ( α ) exp  − 1 2 χ 2 ( α )  dV α (13) This is a po ssible alternative to h χ 2 i π defined in Eq .(4). How ever, in the test o f Sect.5.1, the tw o v alues are so near ly ident ical it is immateria l which mean is used. Here h χ 2 i π is preferre d because it remains well-defined for a uniform prior, for which an a na lytic r esult is av a ilable (App endix A). Because h χ 2 i π and χ 2 eff are nearly identical, the distri- bution of χ 2 eff should approximate tha t of h χ 2 i u (Sect.4.1). T o test this, the exper iment of Sect.5.1 is rep eated with χ 2 eff replacing h χ 2 i π . F r om a s im ulatio n with N = 10 6 , the nu mber of χ 2 eff v alues exceeding the 0.05,0.01 and 0.001 thresholds ar e 5 0167, 9951 and 9 70, resp ectively . Thus if χ 2 eff is c hos en as the go o dness-of- fit s tatistic, accurate p - v alues ca n b e derived on the assumption that χ 2 eff − k is distributed a s χ 2 ν with ν = n − k degrees of freedom. F ro m Sect.4.1, we exp ect these p -v alues to b e accur ate if π ( α ) is not mo re sharply p eaked than L ( α | H , D ). 6. A n F statistic fo r Bay esian mo dels Insp e ction of Fig.2 shows that a mor e pow er ful test of in- consistency betw een X 1 and X 2 m ust exist. A sys tematic displacement of h χ 2 i π relative to h χ 2 i u is already evident even when h χ 2 i π is below the 5 % threshold at 124.2. This suggests that a Bay esia n analogue of the F statistic b e con- structed. 4 Lucy: T ests for Bay esian mo dels 6.1. The frequentist F -test In frequentist statistics, a sta ndard result (e.g., Hamilton 1964, p.139 ) in the testing of line ar h yp otheses is the fol- lowing: we define the statistic F = n − i j χ 2 c − χ 2 0 χ 2 0 (14) where χ 2 0 is the minimum v alue of χ 2 when all i param- eters are adjusted, a nd χ 2 c is the minimum v a lue when a linear constra int is imposed on j ( ≤ i ) pa rameters, so that only the r emaining i − j ar e adjusted. Then, o n the null hypothesis H that the constraint is true, F is distributed as F ν 1 ,ν 2 with ν 1 = j and ν 2 = n − i , where n is the num ber of meas ur ements. Accordingly , H is tested b y comparing the v alue F given b y E q.(14) with critica l v alues F ν 1 ,ν 2 ,β derived from the distribution o f F ν 1 ,ν 2 . Note that when j = i , the constraint c o mpletely deter- mines α . If this v alue is α ∗ , then χ 2 c = χ 2 ( α ∗ ) and H states that α ∗ = α tr ue . A pa rticular mer it of the statistic F is that it is inde- pendent o f σ . Howev er, the resulting F -test do es assume normally-dis tr ibuted measurement err ors. 6.2. A Bay esian F In a Bay esian context, the frequentist hypothesis that α tr ue = α ∗ is replaced b y the statemen t that α tr ue ob eys the po sterior distribution p ( α | H, D 2 ). Thus an exact con- straint is replaced by a fuzzy co ns taint. Adopting the simplest a pproach, w e define F , a Bay esian analo g ue of F , by taking χ 2 c to b e the v alue at the p osterior mean of α , h α i π = R α π ( α ) L ( α | H , D 2 ) dV α R π ( α ) L ( α | H , D 2 ) dV α (15) where π ( α ) is the infor mative prior. Consideratio ns of ac- curacy when v alues of χ 2 are computed on a gr id suggest we take χ 2 0 to b e the v alue at h α i u = R α L ( α | H , D 2 ) dV α R L ( α | H , D 2 ) dV α (16) the p osterior mean for a unifor m pr ior. With χ 2 c and χ 2 o th us defined, the Bayesian F is F = n − i j χ 2 ( h α i π ) − χ 2 ( h α i u ) χ 2 ( h α i u ) (17) and its v alue is to b e compared w ith the chosen threshold F ν 1 ,ν 2 ,β when testing the consistency o f X 1 and X 2 . Since h α i u is independent of π ( α ), the statistic F mea sures the effect of π ( α ) in displa cing h α i π from h α i u . In this Bayesian version of the F -test, the null hypoth- esis states that the p os ter ior mean h α i π = α tr ue . This will be approximately tr ue when a flawless Bay esia n mo del M is applied to a flawless data set D . Ho wev er , if the cho- sen threshold o n F is excee de d, then one or more of π , H and D is suspe ct as discussed in Sect.3.1. If the threshold is not exceeded, then B ay esian inferences drawn from the po sterior distribution p ( α | H , D 2 ) are supp orted. 6.3. T est of p -values from F If Eq.(17) gives F = F ∗ , the c o rresp onding p -v alue is p ∗ = Z ∞ F ∗ F ν 1 ,ν 2 dF (18) where ν 1 = j a nd ν 2 = n − i . The accuracy of these p -v alues can b e tested with the 1 -D toy model of Sect.5.1 as follows: (i) Indep endent data sets D 1 , D 2 are crea ted cor re- sp onding to X 1 , X 2 . (ii) With π from X 1 , the qua ntities h µ i ∗ π and F ∗ are calculated for X 2 with Eqs. (15)-(17 ). (iii) M indep endent data s e ts D m are now created with u i = h µ i ∗ π + σ z i . (iv) F or each D m , the new v alue of χ 2 ( h α i ∗ π ) is ca lcu- lated for the h µ i ∗ π derived at step (ii). (v) F or each D m , the new v alue of χ 2 ( h α i u ) is ca lc u- lated with h α i u from Eq.(1 6 ). (vi) With these new χ 2 v alues , the statistic F m is obtained from Eq.(1 7). The resulting M v a lues of F m give us an empirical es- timate of p ∗ , namely f ∗ , the fractio n of the F m that ex- ceed F ∗ . In one example of this test, a data pair D 1 , D 2 gives h µ i ∗ π = 0 . 089 and F ∗ = 1 . 0021. With ν 1 = 1 and ν 2 = 99, Eq.(1 8 ) then gives p ∗ = 0 . 3192 . This is chec ked by creating M = 10 5 data s ets D m with the a ssumption that µ tr ue = 0 . 089. The resulting empirical estimate is f ∗ = 0 . 3189 , in close agre ement with p ∗ F ro m 100 indep endent pair s D 1 , D 2 , the mean v alue of | p ∗ − f ∗ | is 0 . 00 1. This simulation confirms the accur acy of p -v alues der ived from Eq.(18 ) a nd therefor e of dec ision thresholds F ν 1 ,ν 2 ,β . 6.4. Bias test T o inv estiga te this Bay esian F -test’s ability to detect in- consistecy b etw een X 1 and X 2 , the bia s test of Sect.5.2 is r e p ea ted, again with n 1 = n 2 = 1 00. The vector α in Sect.6.1 now b ecomes the sca la r µ , a nd i = 1. In Fig.3, the v a lues of F from E q.(17) with j = i = 1 are plotted aga inst the bias parameter b/σ . Also plotted are critical v alues derived from the dis tribution F ν 1 ,ν 2 with ν 1 = 1 a nd ν 2 = 99. In contrast to Fig.2 for the h χ 2 i π statistic, inco ns istency b etw een X 1 and X 2 is now detected down to b/σ ≈ 0 . 5 . In this test of inco nsistency , the flaw in X 1 is the bias b . Now, if it were known for certain that this was the flaw, then a Bay esian analysis with H 1 changed from u = µ to u = µ + b - i.e ., with an extr a pa rameter - is sta ightforw ard. The result is the p oster ior density for b , allowing for correc- tion. In contrast, the detection of a flaw with h χ 2 i π or F is cause-indep endent. Although Figs.2 a nd 3 hav e b/σ as the abscissa, for r eal exp eriments this quantit y is not known and c o nclusions are drawn just from the or dinate. 5 Lucy: T ests for Bay esian mo dels 0 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 Bayes F b/ σ Detection of inconsistency 0.1% 5% 50% Fig. 3. Detection of inconsistency b etw een X 1 and X 2 . V alues of F are plotted against the bias p arameter b/σ . The dashed lines are th e 0 . 1 , 5 and 50% levels . N ote that the horizontal scale differs from Fig.2 7. T ension b et wee n exp eriments In the ab ov e, the go o dnes s -of-fit of M to D is inv estigated taking in to account the p ossibility that a p o or fit might be due the prior derived from a pr e vious exp eriment. A related g o o dness-of-fit iss ue co mmonly ar ises in moder n astrophysics, pa rticularly cosmology , namely whether esti- mates of no n-identical but partially overlapping parameter vectors fr om different exp er imen ts ar e mut ually co nsistent. The term c o mmonly use d in this r egard is tensio n, with inv estigators often rep o rting their subje ct ive assess men ts - e.g., there is mar ginal tension b etw een X 1 and X 2 - ba sed on the t wo credibility domains (often multi-dimensional) for the parameter s in co mmon. In a recent pap e r, Seeha rs et al. (2 015) re v iew the at- tempts in the cosmologica l litera ture to q uantif y the con- cept o f tension, with emphasis on CMB exper iment s. Below, we dev elop a rather different approach base d on the F statistic defined in Sect.6.2. Since detecting and re solving conflicts b etw een exp er- imen ts is essential fo r sc ie n tific progress , it is desirable to quantify the term tensio n and to optimize its detection. The conjecture here is that this optimum is achiev ed when inferences from X 1 are imp ose d on the a na lysis o f X 2 7.1. Identical parameter vectors A special case of a ssessing tension betw een exp eriments is that where the parameter vectors are identical. But this is just the problem inv estigated in Sects. 5 and 6. When X 1 (with bias) and X 2 (without bia s) are sepa- rately analysed, the limited ov erlap of the tw o credibility int erv als for µ provides a qualitative indication of tension. How ever, if X 2 is analysed with a prior derived from X 1 , then the statistic h χ 2 i π - see Fig.2 - or, more p ow er fully , the statistic F - see Fig. 3 - provide a quantitative measure to infor m statements ab out the de g ree o f tension. 7.2. Non-id e n tical parameter vectors-I W e now suppos e that D 1 , D 2 are data sets from differ e n t exp eriments X 1 , X 2 designed to test the hypotheses H 1 , H 2 . How ever, thoug h different, these hypotheses hav e para me- ters in common. Sp ecifica lly , the parameter vectors of H 1 and H 2 are ξ = ( α , β ) and η = ( β , γ ), resp ectively , and k , l, m are the num b ers o f elements in α , β , γ , res p ectively . If X 1 and X 2 are a na lysed indep endently , w e may w ell find that M 1 and M 2 provide satisfactory fits to D 1 and D 2 and y et still be oblig ed to rep or t tension b etw een the exp eriments b ecause o f a per ceived inadequa te ov erlap of the tw o l - dimensional cr edibility domains for β . A quan titative measur e of tens ion b etw een X 1 and X 2 can b e derived via the priors, as follows: The analys is of X 1 gives p ( ξ | H 1 , D 1 ), the p osterior distr ibution o f ξ , from which we may derive the p os terior distribution of β by in- tegrating ov er α . Th us p ( β | H 1 , D 1 ) = Z p ( ξ | H 1 , D 1 ) dV α (19) Now, for a Bayesian analys is of X 2 , we must specify π ( η ) throughout η - s pace, not just β -space . But π ( η ) = π ( β , γ ) = π ( γ | β ) π ( β ) (20) Accordingly , what we infer from X 1 can b e imp ose d on the analysis of X 2 by writing π ( η ) = π ( γ | β ) p ( β | H 1 , D 1 ) (21) The conditiona l prior π ( γ | β ) must now be sp e c ified. This can b e taken to b e uniform unless we hav e prio r knowledge from other sources - i.e., not from D 2 . With π ( η ) sp ecified, the analy s is of X 2 gives the p oste- rior density p ( η | H 2 , D 2 ). As in Sect.6.2, we now reg ard this as a fuzzy c onstraint on η from whic h we compute the sharp constraint h η i π given b y Eq.(15). Now, in gene r al, h η i π will be displaced from h η i u given by Eq.(16). The q uestion then is: Is the incr e men t in χ 2 ( η | H 2 , D 2 ) b etw een h η i π and h η i u so large that we must acknowledge tension b etw een X 1 and X 2 ? F ollowing Sect.6, we answer this question by computing F from E q.(17) with i = j = l + m , the total n umber parameters in η . The result is then co mpared to selected critical v a lues from the F ν 1 ,ν 2 distribution, where ν 1 = l + m and ν 2 = n 2 − l − m . With standar d assumptions, F ob eys this distribution if h η i π = η tr ue - i.e, if h β i π = β tr ue and h γ i π = γ tr ue - see Sect.6.3. 7.3. A toy mo del A simple example with o ne pa r ameter ( µ ) for X 1 and tw o ( µ, κ ) for X 2 is as follows: H 1 states that u = µ a nd H 2 states that v = µ + κx . The data set D 1 comprises n 1 measurements u i = µ + σ z i + b , where b is the bias. The data set D 2 comprises n 2 measurements v j = µ + κx j + σz j , where the x j are uniform in ( − 1 , +1). T he parameter s are µ = 0 , κ = 1 , σ = 1 and n 1 = n 2 = 1 00. In the notation of Sect.7.2, the vectors β , γ con tra ct to the scalars µ, κ , whence l = m = 1, a nd α do es not a ppe a r, whence k = 0. 6 Lucy: T ests for Bay esian mo dels 0 5 10 0 0.2 0.4 0.6 0.8 Bayes F b / σ Detection of tension between X 1 and X 2 0.1% 5% 50% Fig. 4. Detection of tension b etw een different ex p erimen ts. V alues of F are plotted against the bias parameter b/σ . The dashed lines are the 0 . 1 , 5 and 50% levels. 7.4. Bias test In the a bove, H 1 are H 2 are different hypotheses but hav e the pa rameter µ in commo n. If b = 0, the analyses of X 1 are X 2 should give s imilar credibility interv a ls for µ and therefore no tension. But with sufficien tly large b , tensio n should arise. This is inv estigated following Sect.7.2 . Applying M 1 to D 1 , we der ive p ( µ | H 1 , D 1 ). Then, tak ing the conditiona l prior π ( κ | µ ) to b e co nstant, we obtain π ( µ, κ ) ∝ p ( µ | H 1 , D 1 ) (22) as the pr ior for the analysis of X 2 . This gives us the p os- terior distribution p ( µ, κ | H 2 , D 2 ), whic h is a fuzzy con- straint in ( µ, κ )-space. Replacing this by the sharp con- straint ( h µ i , h κ i ), the co nstrained χ 2 is χ 2 c = χ 2 ( h µ i , h κ i| H 2 , D 2 ) (23) Substitution in Eq.(1 7 ) with j = i = 2, then g ives F a s a measure of the tension betw een X 1 and X 2 . Under standard assumptions, F is distributed as F ν 1 ,ν 2 with ν 1 = 2 , ν 2 = n 2 − 2 if ( h µ i , h κ i ) = ( µ, κ ) tr ue . In Fig.4, the v a lues of F a re plotted aga inst b/σ to gether with critical v alues for F ν 1 ,ν 2 with ν 1 = 2 , ν 2 = 9 8. This plot shows that tension is detected for b/σ > ∼ 0 . 6. This is slightly inferior to Fig.3 as is to b e exp ected b eca use of the mo re complicated X 2 . The impo rtance of Fig .4 is in s howing that the F statis- tic has the desir ed characteristics of reliably infor ming the inv estigator of tensio n betw een different e x pe r iments with partially overlapping parameter vectors. When the in- consistency is slight ( b/ σ < ∼ 0 . 2), this statistic do es not sound a false alarm. When the inconsistency is substantial ( b/σ > ∼ 0 . 6), the sta tistic do es no t fail to so und the a larm. 7.5. Non-id e n tical parameter vectors-I I If F calcula ted as in Sect.7.2 indica tes tension, the p os sible flaws include the conditional pr ior π ( γ | β ). Thus, tension could b e indicated even if the prior π ( β ) inferr e d from X 1 is a ccurate and consistent with D 2 . Accordingly , w e might pre fer to fo cus just o n β - i.e., on the parameters in common. If so , we co mpute h β i π = Z β p ( η | H 2 , D 2 ) dV η (24) and then find the minim um of χ 2 ( η | H 2 , D 2 ) when β = h β i π . Thus, the contrained χ 2 is now χ 2 c = min γ χ 2 ( h β i π , γ ) (25) The F test also applies in this case - s ee Sect.6.1. Thus this v a lue χ 2 c is substituted in Eq.(17), where w e now take j = l , the num ber of para meters in β . The resulting F is then compared to critical v alues derived fro m F ν 1 ,ν 2 with ν 1 = l , ν 2 = n 2 − l − m . With the standar d ass umptions , F ob eys this distribution if h β i π = β tr ue . F or the simple mo del of Sect.7.3, the resulting plot of F against b /σ is closely similar to Fig.4 and so is o mit- ted. Nevertheless, the option of constraining a s ubset of the pa rameters is likely to b e a p owerful dia g nostic to ol for complex, multi-dimensional pr oblems, iden tifying the pa- rameters contributing most to the tension revealed when the entire vector is cons trained (cf. Seehars et a l. 2 0 15). 8. Discussion and Conclusion A legitimate question to as k about the statistical analysis of data acquired in a s cientific exp eriment is: How well or how badly do es the mo del fit the data? Asking this question is, after all, just the last step in a pplying the scientific method. In a frequentist a na lysis, where the estimated parameter vector α 0 is typically the minimum- χ 2 po int , this ques tion is answered by rep orting the go o dness -of-fit sta tisic χ 2 0 = χ 2 ( α 0 ) or, e quiv a le n tly , the corresp onding p - v a lue. If a p o or fit is ther eby indicated, the inv estigator and the communit y are aw are tha t not only is the mo del called into q uestion but so also is the estimate α 0 and its confidence doma in. If the same data is sub ject to a Bay esian analysis, the same question m ust surely be asked: The Bayesian ap- proach do es not exempt the inv estiga tor from the obliga- tion to rep ort on the success or otherwise of the adopted mo del. In this case, if the fit is po or, the Bay esian mo del is called in to question and so also are inferences dr awn from the p osterior distribution p ( α | H, D ). As noted in Sect.1 , the difficulty in testing B ay esian mo dels is that there are no go o d fully B ayesian metho ds for assessing go o dness-of-fit. Accordingly , in this paper , a frequentist approa ch is advoca ted. Sp ecifically , h χ 2 i π is pro- po sed in Sect.3 as a suitable go o dness-of-fit statisic for Bay esian mo dels . Under the null hypo thesis that M and D are flawless and with the standar d assumptions of lin- earity and normally-dis tr ibuted err ors, then, as argued in Sect.4.1 and illustrated in Fig.1, h χ 2 i π − k is approximately distributed as χ 2 n − k , and so p -v a lues can b e computed. A p -v a lue thus der ived from h χ 2 i π quantifies the av erage go o dness-o f-fit provided by the po s terior distribution. In contrast, in a fr equentist minim um- χ 2 analysis, the p -v alue 7 Lucy: T ests for Bay esian mo dels quantifies the go o dness-of-fit provided by the p oint esti- mate α 0 . In the ab ov e, it is rega rded as self-eviden t that as- tronomers want to adhere to the scientific metho d b y always rep orting the g o o dness-of-fit achiev ed in Bay esian analyse s of o bserv ational da ta. How ever, Gelman & Shalizi (201 3), in an essay on the philoso ph y and pra ctice of Bay esia n statis- tics, note that inv estigator s who identify Bayesian inference with inductive inference regular ly fit and compar e models without chec king them. They deplor e this practice. Instead, these authors advocate the hypothetico -deductive approa ch in which model checking is c r ucial. As in this pape r, they discuss n on-Bayesian chec king o f B ay esian mo dels - sp ecif- ically , the deriv a tio n of p -v a lues from p osterior predictive distributions. Mor eov er, they also stress that the prior dis- tribution is a testable pa r t o f a Bay esian mo del. In the astronomical literature, the use of frequentist tests to v alidate Ba yesian mo dels is no t unique to this pap er. Recently , Harrison et al. (2015) hav e pr e sented an ingenious pro cedur e for v alidating multidimensional p oste- rior distributions with the frequentist Kolmogorov-Smirnov (KS) test for one- dimensional data. Their aim, a s here, is to tes t the entire Bayesian inference pro cedure. F req ue ntist testing a ls o arises in recent a pplications of the Kullback-Leibler divergence to quantify tension be- t ween c o smologica l pro b es (e.g. Seehars et al. 201 5). F or linear mo dels , and with the assumption of Gaussia n prior s and likelihoo ds , a term in the rela tive ent ro py is a statistic that measures tens io n. With these a ssumptions, the s tatis- tic follows a generalized χ 2 distribution, thus allowing a p -v alue to b e co mputed. Seehars et a l.(2015) a lso investigate v ario us purely Bay esian measures o f tensio n. They conclude that interpret- ing these mea sures on a fixed, problem- independent scale - e.g., the Jeffrey’s scale - can b e misleading - see also Nesseris & Garca-B e llido (201 3). A cknow le dgements. I thank D . J.Mortlo ck for comments on an early draft of this pap er, and A.H . Jaffe, M.P .Hobson and the r eferee for useful references. App endix A: Evaluation of h χ 2 i u If α 0 denotes the minimum- χ 2 solution, then α = α 0 + a (A.1) where a is the displacemen t from α 0 . Then, on the assumption of linearity , ∆ χ 2 ( a ) = χ 2 0 + X i,j A ij a i a j (A.2) where the A ij are the constan t elemen ts of the k × k curv ature ma- trix (Press et al . 2007, p.680), where k is the num b er of parameters. It f ollows that surfaces of constan t χ 2 are k -di mensional self-si milar ellipsoids cent ered on α 0 . Now, give n the second assumption of normally- distributed mea- suremen t er rors, the li k eliho o d L ( α ) ∝ exp  − 1 2 χ 2 0  × exp  − 1 2 ∆ χ 2  (A.3) Th us, i n the case of a uniform prior, the posterior mean of χ 2 is h χ 2 i u = χ 2 0 + R ∆ χ 2 exp( − 1 2 ∆ χ 2 ) dV α R exp( − 1 2 ∆ χ 2 ) dV α (A.4) Because surfaces of constan t ∆ χ 2 are self-si milar, the k -dimensional int egrals in Eq.(A.4) reduce to 1-D integrals. Suppose ∆ χ 2 = ∆ χ 2 ∗ on the surf ace of the ell ipsoid with volume V ∗ . If lengths are increased by the factor λ , then the new ellipsoid has ∆ χ 2 = ∆ χ 2 ∗ × λ 2 and V = V ∗ × λ k (A.5) With these scaling r elations, the integrals in Eq.(A.4) can b e trans- formed into integrals ov er λ . The r esul t is h χ 2 i u = χ 2 0 + 2 b R ∞ 0 λ k +1 exp( − bλ 2 ) dλ R ∞ 0 λ k − 1 exp( − bλ 2 ) dλ (A.6) where 2 b = ∆ χ 2 ∗ . The i n tegrals hav e now b een transformed to a known definite integral, Z ∞ 0 λ z exp( − bλ 2 ) dλ = 1 2 Γ( x ) b − x (A.7) where x = ( z + 1) / 2. Aplying this formula, we obtain h χ 2 i u = χ 2 0 + k (A.8) an exact result under the stated assumptions. References Bridges, M ., F eroz, F., Hobson, M . P . & Lasenb y , A . N. 2009, MNRAS, 400, 1075 F eldman, G. J., & Cousins, R. D . 1998, PhRvD, 57, 3873 Gelman, A. & Shalizi 2013, British Journal of Mathematical and Statistical Psyc hology , 66, 8 Hamilton, W.C. 1964 Statistics i n Physical Science (Ronald Pr ess, New Y ork) Harrison, D., Sutton, D., Carv alho, P . & Hobson, M. 2015, MNRAS, 451, 2610 Lucy , L.B. 2014 A&A, 571, 86 (L14) Nesseris, S. & Garca-Belli do, J. 2013, JCAP , 08, 036 Press, W.H., T eukolsky , S.A., V etterling, W.T., & Flannery , B.P . 2007 Numerical Recip es (Cambridge Uni v ersi t y Press, Cambridge) Seehars, S., Grandis, S., Amara, A., & Refregier, A. 2015 Sellke , T., Ba yarri, M.J., & Berger, J. 2001, The American Statistician, 55, 62 Sivia, D.S., & Skilling,J. 2006 Data Analysis: A Ba yesian T utorial , 2nd Ed. (Oxford Uni v ersi ty Pr ess, Oxford) 8

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment