Approximate Bayesian computation (ABC) gives exact results under the assumption of model error

Approximate Bayesian computation (ABC) or likelihood-free inference algorithms are used to find approximations to posterior distributions without making explicit use of the likelihood function, depending instead on simulation of sample data sets from…

Authors: Richard D. Wilkinson

Approximate Bayesian computation (ABC) gi v es ex act results under the assumption of model error Richard D. W ilkinson ∗ School of Mathematical Sciences, Uni ve rs ity of Nottingham. NG7 2RD r .d.wilkinson@nottingham.ac.uk No vember 26, 2024 Abstract Approxima te Bayesian computation (ABC) or likeli hood-free inferen ce algori thms are us ed to find approximati ons to posterio r distrib utions without making explic it use of the likelih ood functio n, depending instead on simu- lation of sample data sets from the model. In this paper we sho w that under the as sumption of the ex istence of a uniform additi ve model error term, ABC algori thms gi ve exac t resu lts when sufficie nt summaries are used. This in- terpret ation allo ws the approximation mad e in many pre vious application papers to be understood , and should guide the choic e of metric and toler - ance in future work. ABC algorithms can be gener alized by replac ing the 1 0-1 cut-of f with an acceptan ce probability that v aries with the distanc e of the s imulated data from the observ ed data. The acceptan ce density gi ves the distrib ution of the error term, enabl ing the uniform error usually used to be replac ed by a general distrib ution. This genera lization can also be applied to approximat e Marko v chai n Monte Carlo algorit hms. In lig ht of this wor k, ABC algorith ms can be seen as calibration techn iques for implicit stoch as- tic models, inferrin g parameter v alues in light of the computer m odel, data, prior beliefs about the paramete r valu es, and any measuremen t or model errors. Keyw ords: Approximate Bayesian comp utation; calibration; implicit infer- ence; likelihood-free inference. 1 Introduction Approximate Bayesian computation (ABC) algo rithms are a group of methods for p erforming Bayesian inference with out the need for explicit ev aluation of the model likelihood function (Beaumont et al., 2002, Marjoram et al., 2003, Sisson et al., 2007). The algorithms can be used with implicit computer mo dels (Diggl e and Gratton, 1984) that genera te sample data sets rather than re t urn likelihoods. ABC methods hav e become p opular in the biological sciences (Sunnaker et al., 2 013) with ap- plications in genetics (see, for example, Siegmund et al., 2008, Foll e t al ., 2008), epidemiology (Blum and T ran, 2010, T anaka et al. , 2 006) and population biology (Ratmann et al., 2007, Hamilton et al., 2005, Cornuet et al., 2008) most common. 2 This popularity is primarily due to the fact that the likelihood function, which ca n be difficult or im possible to compu te for some models, is not needed in order to do inference. Howe ver , despite their popularity little is known about t he quali ty of the approximation they pro vide beyond results shown i n simulation studies. In this paper we gi ve a frame work in which the accurac y of ABC methods can be understood. The notatio n throughou t t his paper is as follo ws. Let θ denote the vector o f unkno wn model parameters we wish to i nfer , and let η ( · ) denot e the computer simulator . W e assume η ( · ) is st ochastic, so t hat the sim ulator repeatedly run at θ will g iv e a range of pos sible out puts, and write X ∼ η ( θ ) to denot e th at X has the same dis tribution as the model run at θ . T o distinguish the model output from the observed data, let D denote the obs erv ation s. T he aim is to calibrate th e model to the data, in order to learn abo ut the true value of t he parameter . The Bayesian approach is to find the posterior distribution of θ giv en D , giv en by π ( θ | D ) = π ( D | θ ) π ( θ ) π ( D ) . Throughout thi s paper , π ( · ) is used to denot e differe nt probability dens ities, and π ( · | · ) conditi onal densities, with t he context clear from the arguments. Above, π ( θ ) is the prior dis tribution, π ( D | θ ) is the like l ihood of the data un der the model given parameter θ (the p robability dist ribution o f η ( θ ) ), π ( θ | D ) is th e posterior distribution, and π ( D ) is the e vidence for the model. It is us ual in Bayesian inference t o find that the normalizing constant π ( D ) is intractable, and a wi de range of Monte Carlo techniques hav e b een developed to 3 deal wit h t his case (Liu, 2001). Doub ly-intractable d istributions are di stributions which hav e a likelihood function π ( D | θ ) = q ( D | θ ) / c ( θ ) which i s known only up to a normalizing constant, c ( θ ) , which is intractable. Standard M onte Carlo techniques do no t apply to these distributions, and Murray et al. (2006) ha ve de- veloped algo rithms which can be used in this case. ABC methods are Monte Carlo techniques dev eloped for use with completely-intractable distrib ut ions, where the likelihood function π ( D | θ ) i s no t even kno wn up to a norm alizing constant . ABC algorithms, sometimes called likelihood-free al gorithms, enable i nference using onl y sim ulations generated from the model, and do not requi re any ev alua- tion of the li kelihood. The most basic form of the ABC algori thm is based on t he rejection algorithm, and is as follows: Algorithm A: appr oximate rejection algorithm A1 Draw θ ∼ π ( θ ) A2 Simulate X from the simulato r X ∼ η ( θ ) A3 Accept θ if ρ ( X , D ) ≤ δ . Here, ρ ( · , · ) i s a distance measure on the mod el output space, and δ is a tol- erance determinin g the accuracy of the algorithm. Accepted values of θ are not from th e true post erior distribution, but from an approxim ation to it, writt en π ( θ | ρ ( D , X ) ≤ δ ) . When δ = 0 this algorithm is exact and give s dra ws from the posterior distrib utio n π ( θ | D ) , whereas as δ → ∞ the algorithm giv es dra ws from the prior . While smaller values of δ lead to samples which better approxim ate 4 the true posterior , they also lead to lower acceptance rates in step A3 than larger values, and so more compu tation must be done to get a giv en sample size. C on- sequently , the tolerance δ can be consid ered as con trolling a trade-off between computabilit y and accurac y . Se veral extensions have been made to th e approxim ate rejection algorithm . If the data are high dimensional, then a stand ard change to the algorithm is to summarize the model output and data, using a summary statisti c S ( · ) to p roject X and D onto a lower dimensional sp ace. Al gorithm A is then changed so th at step A3 reads A3 ′ Accept θ if ρ ( S ( X ) , S ( D )) ≤ δ . Ideally , S ( · ) sh ould be chosen to be a sufficient stati stic for θ . Ho wev er , if the likelihood is u nknown, then suffi cient statistics cannot be identified. Summariz- ing the data and model out put using a n on-suffi cient sum mary adds anot her layer of approximation on top of that added by th e use of the distance measure and tol- erance, but again, it is not known what ef fect any gi ven choice for S ( · ) has on the approximation. In this paper it is sho wn th at the basic approximate rejection algorithm can be interpreted as performing exac t inference in the presence o f uni form model or measurement error . In other words, it is shown that ABC gi ves exact inference for the wrong model, and we giv e a distribution for the mo del error term for whate ver choice of metric and tolerance are used. Th is interpretation allows us to sho w the effect a given choice of m etric, tolerance and weighting ha ve had in pre vi ous applications, and should provide guidance when choosing metrics and weightings 5 in future work. It i s also sh o wn that Algorithm A can be generalized to give inference under the assump tion of a completely flexible form for the model error . W e discuss ho w to m odel the model error , and show how some models can be re writt en to g i ve exact inference. Ratmann et al. (2007) explored a related idea, and looked at using ABC algorithm s t o di agnose model i nadequacies. The aim here is no t to diagnose errors, but to account for them i n th e inference so as t o provide posteriors that take known inadequacies into account, and to understand the ef fect of using standard ABC approaches. Finally , ABC has bee n e xtend ed by Marjoram et al. (2003) from the re j ec- tion algorithm to approximate Marko v chain Monte Carlo algorithms , and by Sisson et al. (2007), T oni et al. (20 09), and Beaumont et al. (2009) to approximate sequential M onte Carlo algorit hms. W e extend th e approximate Markov chain Monte Carlo algo rithm to giv e i nference for a general form of error , and s uggest methods for calculating Bayes factors and integrals for completely-int ractable dis - tributions. 2 Interpr et ing ABC In th is s ection a framework is described wh ich enables t he effect a given metric and weighting hav e i n ABC algorithms to be und erstood. This will th en allow us to improve the in ference by carefully choosing a m etric and weight ing which more closely represents our true b eliefs. The ke y idea is to assum e that there is a discrepancy b etween the best possib le model prediction and the data. This 6 discrepancy represents either measurement error on the data, or model error de- scribing our statistical beliefs about where the mod el is wron g. George Box fa- mously wrote that ‘all m odels are wrong, but some are useful ’, and in order to li nk models to reality it i s n ecessary to account for t his mo del error when performi ng inference. In t he context of determinist ic models, this practice is well est ablished (Campbell, 2006, Go ldstein and Rougier, 2009, Higdon et al., 2008), and sh ould also be und ertaken when linking stochastic m odels to reality , despite the fact th at the v ariabili ty in the model ca n seemingly explain the data as they are. The framew ork i ntroduced here us es the best input app roach, similar to that giv en in Kennedy a nd O’Hagan (2001). W e assume that the measurement D can be considered as a realization of the model run at its best i nput value, ˆ θ , pl us an independent error term ε D = η ( ˆ θ ) + ε . (1) The error ε might re present measurement err or on D , or model error in η ( · ) , or both, in which case we write ε = ε 1 + ε 2 . Discussion about the va li dity of Equation (1), and what ε represents and ho w to model it are delayed until S ectio n 3, and for the t ime bein g we sim ply cons ider ε to h a ve dens ity π ε ( · ) . Th e aim is to d escribe o ur posterior bel iefs abou t t he best i nput ˆ θ in light o f the error ε , the data D , and prior beliefs about ˆ θ . Consider the following algorithm: Algorithm B: pr obabilistic approximate r ejection a lgorithm B1 Draw θ ∼ π ( θ ) B2 Simulate X from the model X ∼ η ( θ ) 7 B3 Accept θ with probability π ε ( D − X ) c . Here, c is a constant chosen to guarantee that π ε ( D − X ) / c defines a probability . For most cases we will expect ε to have a m odal value of 0, and so taking c = π ε ( 0 ) wil l mak e the algorithm v ali d and also ensure ef ficiency by maximizing the acceptance rate. If summaries are in volved, or if D and X l iv e in non-comparable spaces, so that D − X does n ot make sense, we can in stead use any distribution relating X to D , π ε ( D | X ) ins tead. The main innovation in this paper is to sho w that Algorithm B gives e xact inference for the statistical model de s cribed above by Equatio n ( 1). This is essen- tially saying that ABC giv es exact inference, but for the wrong model. Theor em 1 Algorith m B gives draws fr om the post eri or di stribution π ( ˆ θ | D ) un - der th e a s sumption t hat D = η ( ˆ θ ) + ε and ε ∼ π ε ( · ) in d ependently of η ( ˆ θ ) . Pr oo f 1 Let I =        1 if θ is a ccepted 0 otherwise. W e then fi n d that pr ( I = 1 | θ ) = Z pr ( I = 1 | η ( θ ) = x , θ ) π ( x | θ ) d x = Z π ε ( D − x ) c π ( x | θ ) d x . 8 This gi ves that th e dis t ribution of accepted values of θ is π ( θ | I = 1 ) = π ( θ ) R π ε ( D − x ) π ( x | θ ) d x R π ( θ ′ ) R π ε ( D − x ) π ( x | θ ′ ) d x d θ ′ . T o complete the pr oof we must find th e pos terior distribution of the best model input ˆ θ given the data D under the ass umption of model err or . Note that π ( D | η ( ˆ θ ) = x ) = π ε ( D − x ) which implies that the likelihood of θ is π ( D | ˆ θ ) = Z π ( D | η ( ˆ θ ) = x , ˆ θ ) π ( x | ˆ θ ) d x = Z π ε ( D − x ) π ( x | ˆ θ ) d x . Consequently , the posterior distribution of ˆ θ is π ( ˆ θ | D ) = π ( ˆ θ ) R π ε ( D − x ) π ( x | ˆ θ ) d x R π ( θ ) R π ε ( D − x ) π ( x | θ ) d x d θ which matches the dis t ribution of accepted values fr om A l gorithm B. T o illus trate the algorit hm, we consider the toy example used in Siss on et al. (2007) and again in Beaumont et al. (2009) where analytic e xpressio ns can be calculated for the approximations. Example 1 Assume the mod el is a mixtu r e of two norma l d istributions with a uniform p rior for the mean: η ( θ ) ∼ 1 2 N ( θ , 1 ) + 1 2 N ( θ , 1 100 ) , θ ∼ U [ − 10 , 10 ] . 9 Further ass ume that we o b s erve D = 0 , b ut that ther e is measur ement err or ε on this dat a. If ε ∼ U [ − δ , δ ] , which is t he assumption mad e when using Algor i thm A with ρ ( x , 0 ) = | x | , then it is pos sible to show that the appr oximation is π ( θ | ε ∼ U [ − δ , δ ] , D = 0 ) ∝ Φ ( δ − θ ) − Φ ( − δ − θ ) + Φ ( 10 ( δ − θ )) − Φ ( − 10 ( δ + θ )) for θ ∈ [ − 10 , 10 ] , wher e Φ ( · ) is th e standard G a ussian cumulative distribution function. An alternative to as suming uniform err or , is to supp ose that the error has a normal dis tribution ε ∼ N ( 0 , δ 2 / 3 ) . It can th en be shown that the poster i or distribution of θ is π ( θ | ε ∼ N ( 0 , δ 2 3 ) , D = 0 ) ∝ 1 2 φ ( θ ; 0 , 1 + δ 2 3 ) + 1 2 φ ( θ ; 0 , 1 100 + δ 2 3 ) truncated ont o [ − 10 , 10 ] . This is the appr oximatio n found when us ing Al gorithm B with a Gaussian acceptance kernel, wher e φ ( · ; µ , σ 2 ) is the pr obabi lity density function of a Gaussia n distribution with mean µ and variance σ 2 . The value of the variance, δ 2 / 3 , is chosen to be equal to t he variance of a U [ − δ , δ ] random variable. F or lar ge values of the tolerance δ , the differ ence b etween the two appr oximation s can be si gnificant (see F igur e 1), b ut in the limit as δ tends to zer o, t he two appr oxima t ions will be th e s ame, corr espo n ding to zer o error . 10 −4 −2 0 2 4 0.0 0.1 0.2 0.3 0.4 0.5 −4 −2 0 2 4 0.0 0.5 1.0 1.5 2.0 Density Density θ θ δ = 0 . 1 δ = 1 Figure 1: The posterior distributions found when using Algorithm A (solid line) and Algorithm B (dashed line) with a Gaussian acceptance k ernel. The left plot is for δ = 1 and the right plot for δ = 0 . 1. The t wo curves are indisting uishable in the second plot. 3 Model discr ep ancy The interpretation of ABC gi ven by Theorem 1 allo ws us to re visi t pre vious anal- yses done using the ABC algorit hm, and to und erstand the approx imation in the posterior i n t erms of the distribution implicitl y assumed for the error term. If the approximate rejection algorit hm (Algorithm A) was us ed to do the analysis, we can see that this is equiv alent to using the ac ceptance prob ability π ε ( r ) c =        1 if ρ ( r ) ≤ δ 0 otherwise 11 where r is th e dist ance between t he sim ulated and o bserved data. This says that Algorithm A giv es exact in ference for the model wh ich assumes a uni form m ea- surement error on the region defined by the 0-1 cut-of f, i.e., ε ∼ U { x : ρ ( x , D ) ≤ δ } . If ρ ( · , · ) is a Euclidean metric, ρ ( D , x ) = ( x − D ) T ( x − D ) , this is equivalent to assuming uniform measurement error on a ball of radius δ about D . In most situations, it is likely to be a p oor choice for a m odel of the measurement error , because the tails of the distribution are short, with zero mass outsid e of the int erv al [ − δ , δ ] . There are two ways we ca n choose to vie w the error term; eith er as m ea- surement error or model error . Interpreting ε to represent measurement error is relativ ely straight forward, as scientists usually hold be l iefs about the distribution and magnit ude of measurement error on their data. For mos t problems, assu mp- tions of uniform measurement error will be inappropriate, and so using Algorithm A with a 0-1 c ut -of f will be inappropriate. But we hav e sho wn ho w to re pl ace this uniform assump tion with a d istribution which more closely represents the beli efs of the scient ist. Although the distribution of t he m easurement error wil l often be completely specified by th e scientist, for example zero-mean Gaussian error wit h known v ariance, it is pos sible to includ e unknown parameters for th e distribution of ε in θ and infer them a l ong with the model parameters. Care needs to be taken to choose the cons tant c so that t he acceptance rate i n step B3 is l ess than one for 12 all values of the parameter , but other t han t his i t is in theory si mple to infer error parameters alon g with the model parameters. So for example, if ε ∼ N ( 0 , σ 2 ) , where σ 2 is unknown, we c oul d include σ 2 in θ . Some m odels hav e sam pling or measurement error built into the comp uter code so that the model output includes a realization of thi s noise. Rather than coding the noise process into the model, it wi ll sometimes be possible to rewrite the m odel so that it outputs the latent underlying s ignal. If the li kelihood of the data giv en t he latent signal is comp utable (as it often is ), then it m ay be p ossible to analytically account for the noise with t he acceptance probability π ε ( · ) . ABC methods have proven m ost popular i n fields such as genetics, epi demiology , and population bio logy , where a commo n occurrence is to have data gener ated by sampling a hidden underlyi ng tree structure. In many cases, it is the partiall y observed tree structure which causes the likelihood to be intractable, and giv en the underlyi ng tree the sampling p rocess will hav e a known di stribution. If t his is the case (and if computati onal const raints allow), we can use the probabilistic ABC algo rithm to do the sampli ng to give exact inference wi thout any ass umption of mo del error . Not e that if th e sampli ng process gi ves conti nuous data, then exact inference using the rejectio n algorith m would not be possible, and so t his approach has the potential to giv e a significant improvement ov er current methods. Example 2 T o i l lustrate the idea of r ewriting the model in or der to do analytic sampling, we describe a version of the pr oblem consid er ed in Plagnol and T avar ´ e (2004) and W ilkinson a nd T avar ´ e (2009). Their aim was t o use the primat e f ossil r ecor d t o date t he diver gence time of the p r imates. The y used an inhomogeneous 13 branching pr ocess to model speciation, with tr ees r ooted at time t = τ , and simu- lated forwards in t ime to time t = 0 , so that the depth of the tr ee, τ , r epre sent s the diver gence ti me of in ter est. The branching pr ocess is parametrized by λ , which can either be estimated and fixed, o r tr eated as unknown and given a prior dis- tribution. T ime is split into geologic epochs τ < t k < · · · < t 1 < 0 , and the data consist o f counts of t h e number of primate species that have been found in each epoch of the foss il r ecor d, D = ( D 1 , . . . , D k ) . F ossil finds are modelled by a dis- cr ete marking pr ocess on the tr ee, with each species having equal pr obabilit y α of b eing pre served as a fossil in the r ecor d. If we l et N i be the cumulat ive number of branches that exist during any point of epoch i, then the model used for t he fossil find s pr ocess can be written as D i ∼ Binomi al ( N i , α ) . The dist ribution of N = ( N 1 , . . . , N 14 ) cannot be calculated e xpli ci t ly and so we canno t us e a likeli- hood based appr oach to find the posteri or di s tribution of the unknown parameter θ = ( λ , τ , α ) . The ABC appr oach used in Plagnol and T avar ´ e (2004) was to draw a value of θ fr om its pr i or , simulate a sample tr ee and fossil fi n d s, and th en count the number of si mulated fossils i n each epoch to find a simulat ed value of the data X . They then accepted θ if ρ ( D , X ) ≤ δ for so m e metr i c ρ ( · , · ) and toler- ance δ . This gives an approximation t o the posterior distribution of the parameter given t he dat a and the mod el , wher e the appr oximat ion can b e viewed as model or meas ur ement err or . However , instead of appr oximat ing the posterior , it is possible in theory to r ewrite the model and perform the sam p ling an alytically to find the exact posterior distribution: 14 1. Draw θ = ( λ , p , α ) ∼ π ( · ) 2. Simulat e a tr ee T using pa rameter λ and count N 3. Accept θ with probability ∏ k i = 1  N i D i  α D i ( 1 − α ) N i − D i . This al gorithm gi ves exact draws fr om the posterior di s tribution of θ given D, and in t heory ther e is no n eed for a n y a ssumption of measure ment error . Note that θ can include p a rameter α for the s ampling rate, to be inferr ed along with the other model p arameters. However , this makes findi ng a nor malizing constant in step 3 difficult. W itho ut a norma lizing constant to incr ease th e acceptance rate, applying this alg orithm direc tl y will be slow for man y values of D and k (the choice of prior distribution and number of parameters we choose to include in θ can also have a significant effect on t he efficiency). A practical solutio n would be to add an err or term and ass u me the pr esence o f measur ement error on the data (which is l ikely to exist in this case), in or der to incr ease the acceptan ce pr obability in step 3. Appr oaching the p r oblem in this way , it is possible to car efully model the err or on D and imp r ove the estimate of the diver gence time. Using ε to represent measurement error is straigh t forward, whereas using ε to model the model discrepancy (to account for the fact the model is wrong) is harder to conceptualize and not as commonly us ed. For determinis tic m odels, the id ea of u sing a model error term when doing calibration or data assimilatio n is well established and described for a Bayesian frame work in K ennedy and O’Hagan (2001). They assume that th e mod el run at its ‘best’ input, η ( ˆ θ ) , is suf ficient for t he mo del when estimating ˆ θ . In other words, knowledge of the model run 15 at its best input provides all the av ailable informatio n about the system for t he purpose of prediction. If thi s is the c ase, th en we can define ε to be the dif ference between η ( ˆ θ ) and D , and assume ε is independent of η ( ˆ θ ) . Note that th e error is the difference between the data and t he mo del run at i ts b est input , and does not depend on θ . If we do not include an error term ε , t hen the best i nput i s the value of θ that best explains the data, gi ven the model. When we include an er ror term which is carefully modelled and represents our beliefs about th e discrepancy between η ( · ) and reality , then it can be ar gued that ˆ θ represents the ‘true’ value of θ , and th at π ( ˆ θ | D , ε ∼ π ε ( · )) should be our posterior dis tribution for ˆ θ in light of the data and the model. For deterministic models, Goldstein and Rougier (2009) provide a fra m e work to help think about th e model di screpancy . T o sp ecify the distribution of ε , it can help to break the discrepancy do wn into various pa rts : physical processes not modelled, errors in the specification of the model, imp erfect impl ementation etc. So for example, if η ( · ) represents a global climate model pre di cting a verage tem- peratures, t hen common model errors could be not includin g processes su ch as clouds, CO 2 emissions from vegetation etc., error i n the specification might be using an u nduly simpl e model o f economic activity , and imperfect imp lementa- tion would in clude us ing grid cells too lar ge to accurately solve the underlying diffe renti al equations. In some cases it may be necessary to consider model and measurement error , ε + e say , and model each process separately . For stochastic models, as far as we are aware , no gui dance exists about how to model the error , and indeed it is not clear what ε should represent. 16 T o complicate matters further , for many m odels the dimension of D and X will be large, making i t likely th at the acceptance rate π ε ( X − D ) will be small . As noted above , in this case it i s necessary to summ arize the model output and the data using a multi dimensional summary S ( · ) . Using a su mmary means that rather than approximating π ( θ | D ) , the algorithms approxi mate π ( θ | S ( D )) . The interpretation of ε as model or m easurement error still holds , but no w the error i s on the measurement S ( D ) or the model prediction S ( X ) . If each element of S ( · ) has an i nterpretation in terms of a physical process, this m ay make it easier to break the error do wn into independent com ponents. For example, suppose that we use S ( x ) = ( ¯ x , s xx ) , the sampl e mean and v ariance of X , and that we t hen use the following acceptance density π ε ( S ( X ) − S ( D )) = π 1 ( ¯ X − ¯ D ) π 2 ( s X X − s DD ) . This is equiv alent to assumi ng that there are two sources of m odel error . Firstly , the mean predict ion i s assumed t o b e wrong, with the error distributed with den- sity π 1 ( · ) . Secondly , it assumes that the model prediction of the variance is wrong, with the error having distribution π 2 ( · ) . It also ass umes that the error in the mean prediction i s i ndependent of the error in the va riance prediction. This indepen- dence is not necessary , b ut helps with v isualization and elicitation. For t his reason it can be helpful to choo se the diffe rent components of S ( · ) s o t hat they are cl ose to independent (independence may also help increase the acceptance rate). Another possibili ty for choos ing S ( · ) i s to use principal compo nent analysis (if dim ( X ) 17 is lar ge) to find a smaller number of uncorrelated sum maries of t he data whi ch may hav e meaningful interpretation s. In general howe ver , it is n ot kno wn how to choose good summaries. Joyce and Marjoram (2008) ha ve suggested a method for selecting between dif ferent summaries and for deciding ho w many summaries it is optimal to include. Howe ver , more work is required to find summaries which are informativ e, interpretable and for which we can describe the model error . Finally , once we have specified a dis tribution for ε , we may find the acceptance rate is t oo s mall to be practicable and that i t i s n ecessary t o com promise (as in Example 2 above). A pragmatic way to increase the acceptance rate is to use a m ore dis perse distribution for ε . This moves us from t he realm of using ε to model an error we belie ve exists, t o us ing it to approximate the true posterior . This is currentl y how most ABC method s are used. Howe ver , ev en when making a pragmatic compromise, the interpretation of the approximat ion in terms of an error will allow us to think more carefully about ho w to choose between different compromise solutions. Example 3 One of the first uses of an ABC algorith m was by Pritchar d et al . (1999), who used a simple stochastic model to study the demographic hi story of the Y chr omosome, and used an appr oximate r ejection a lgorithm to infer mut a - tion and demographic parameters for th ei r model. Their data consisted of 445 Y chr o mosomes sampl ed at eight differ ent loci fr om a mixtur e of populat i ons fr om ar ound the world, which the y sum ma rized by just thr ee statistics: the mean (acr oss loci) of the variance of r epeat numbers V , the mean effective h eter ozygosity H , and the nu mb er of distinct haplotypes N . The observed value of the summari es 18 for their sample was D ≡ ( V , H , N ) T = ( 1 . 149 , 0 . 6358 , 3 16 ) T . They eli cited prior distributions for the mutation rates fr om t he li teratur e, and us ed diffuse priors for pop u lation parameters such as the grow th rate and the effective num b er of ancestral Y chr omosomes. P opulation gr owth was modell ed using a st a ndar d co- alescent model gr owing at an exponential rate fr om a constant ancestral level, and various differ ent mutation models wer e used to simu l ate s ample values for the thr ee summari es measure d in t he data. The y then applied Algorit h m A usin g the metr i c ρ ( D , X ) = 3 ∏ i = 1 D i − X i D i (2) wher e X is a triplet o f simu l ated values for the t hr ee summaries st atistics. The y used a to l erance value of δ = 0 . 1 , which for their choice of metric corr espon d s to an err or of 10% on each measur ement. This gives r esults equivalent to ass u ming that ther e is independent uniform measur ement err or on the thr ee data summaries, so t hat t he true values of the thr ee summaries have the follo wing distributions V ∼ U [ 1 . 0 341 , 1 . 262 4 ] , H ∼ U [ 0 . 58122 , 0 . 71038 ] , N ∼ U [ 284 , 348 ] . Beaumont et al. (2002) used the same model and data set to compar e the r elative performance of Algorithm A with a n algorith m si milar to A l gorithm B, u s ing an Epanechniko v densit y appl i ed to the metric value (2) for the acceptance pr obabil- ity π ε ( · ) . They s et a value o f δ (the cut-off in Algor i thm A and th e range of the support for ε in Algorithm B) by using a qu antile P δ of the empirical dis tribution function of simulated values of ρ ( D , X ) , i.e., P 0 . 01 means the y accepted the 1% of 19 model runs with values closest to D. The y concluded th at Algorit hm B gives mor e accurate r esults t han Algorithm A, meaning that the dist r ibution found using Al- gorithm B is closer to the posteri or found when assumi n g no measureme nt err or ( δ = 0 ). The conclusion that Alg o rithm B is pr eferable t o Algor ithm A for this model is perhaps not surprising in light of what we now know , as i t was not ta ken into account that both algorit hms u sed the same value of δ . F or Algorithm A this corr esponds to a s suming a measur ement err or with variance δ 2 / 3 , wher eas using an Epanechniko v acceptance pr ob ability is equivalent t o as s uming a measur ement err or with varian ce δ 2 / 5 . T here fo r e, usin g Algo rithm B uses measur ement err or only 60% as variable as t hat ass u med in Algori thm A, and so it is perhaps no t surprisin g th at Al g orithm B gives mor e accurate re su l ts i n this case. 4 A pproximate M ar ko v chain Monte Carlo For problems which have a tightly constrained po sterior distribution (relati ve to the pri or), repeatedly drawing i ndependent values of θ from i ts prior distribution in the rejection algo rithm can be inefficient. For prob lems with a high d imensional θ this i nef ficiency is likely to m ake the application of a rejection type algorithm impracticable. The idea behind Markov chain Monte Carlo (MCMC) is to build a Markov chain on θ and correlate successiv e observations so that more tim e is spent in regions of high posterior p robability . Most MCMC algorit hms, such as the M etropolis-Hastings algorit hm, depend on k nowledge of the likelihood func- 20 tion which we ha ve assumed is not known. Marjoram et al. (2003) giv e an ap- proximate version of the Metropolis -Hastings algorithm, which approximates the acceptance prob ability by using s imulated model output wi th a metric and a 0-1 cut-off to approximate the likelihood ratio. This, as before, is equiv alent to as- suming uni form error on a set defined by the metric and the tolerance. As abov e, this algorithm can be generalized from assuming uni form measurement error t o an arbitrary error t erm. Below , are t wo differe nt algorithms to perform MCMC for the mod el described by Equation (1). The difference between the two algorithms is in the cho ice of sample sp ace used to construct the stati onary di stribution. In Algorithm C we consider the state v ariable to b elong to the sp ace of parameter values Θ , and construct a Markov chai n { θ 1 , θ 2 , . . . } which obeys the foll owing dynamics: Algorithm C: pr obabilistic approximate MCMC 1 C1 At time t , propose a move from θ t to θ ′ according to transit ion kernel q ( θ t , θ ′ ) . C2 Simulate X ′ ∼ η ( θ ′ ) . C3 Set θ t + 1 = θ ′ with probability r ( θ t , θ ′ | X ′ ) = π ε ( D − X ′ ) c min  1 , q ( θ ′ , θ t ) π ( θ ′ ) q ( θ t , θ ′ ) π ( θ t )  , (3) otherwise set θ t + 1 = θ t . 21 An alternative approach is to introduce the value of the sim ulated output as an auxiliary v ariable a nd construct the M arko v chain on the space Θ × X , where X is the space of model outputs. Algorithm D: pr obabilistic approximate MCMC 2 D1 At time t , propose a move from ψ t = ( θ t , X t ) to ψ ′ = ( θ ′ , X ′ ) with θ ′ drawn from transition kernel q ( θ t , θ ′ ) , and X ′ simulated from the model using θ ′ : X ′ ∼ η ( θ ′ ) D2 Set ψ t + 1 = ( θ ′ , X ′ ) wi th probability r (( θ t , X t ) , ( θ ′ , X ′ )) = min  1 , π ε ( D − X ′ ) q ( θ ′ , θ t ) π ( θ ′ ) π ε ( D − X t ) q ( θ t , θ ′ ) π ( θ t )  , (4) otherwise set ψ t + 1 = ψ t . Pr oo f 2 (of con ver gence) T o show t h at these Markov chains con ver ge to the r e- quir ed posterior distr ibution, it is sufficient t o s h ow that the chains satisfy the detailed balance equat ions π ( s ) p ( s , t ) = π ( t ) p ( t , s ) for all s , t wher e p ( · , · ) is the transition kernel of the chain and π ( · ) the r equir ed s tationary distribution. 22 F or Algorithm C the transit i on kernel is the pr oduct of q ( θ , θ ′ ) and the ac- ceptance rate. T o calculate the acceptance rate, note that in Equation (3) the acceptance pr obability is conditioned upon knowledge of X ′ and so we must inte- grate out X ′ to fi nd r ( θ , θ ′ ) . This gives t he transit i on kernel for the chain: p ( θ t , θ ′ ) = q ( θ t , θ ′ ) Z X π ε ( D − X ′ ) c min  1 , q ( θ ′ , θ t ) π ( θ ′ ) q ( θ t , θ ′ ) π ( θ t )  π ( X ′ | θ ′ ) d X ′ . The tar get statio n ary distribution is π ( θ | D ) = π ( θ ) R X π ε ( D − X ) π ( X | θ ) d X π ( D ) . It is then si mple to show that t h e Markov chain described b y Algorithm C sat isfies the deta iled ba l ance equations (see Liu (2001) for comparable calculations). F or Al g orithm D, the transition kernel is p (( θ t , X t ) , ( θ ′ , X ′ )) = q ( θ t , θ ′ ) π ( X ′ | θ ′ ) min  1 , π ε ( D − X ′ ) q ( θ ′ , θ t ) π ( θ ′ ) π ε ( D − X t ) q ( θ t , θ ′ ) π ( θ t )  . (5) The Marko v chain in t his cas e takes values in Θ × X and the r equir ed stationary distribution is π ( θ , X | D ) = π ε ( D − X ) π ( X | θ ) π ( θ ) π ( D ) . (6) It can then be sh own that Equ a tions (5) an d (6) sa t isfy the deta iled balance equa - tions. 23 While Algorithm C is more recognisable as a generalization of the app roximate MCMC algorithm given in Marjoram et al. (2003), Algorithm D is l ikely to be more ef ficient in most cases. This is because the ratio of model error densities that occurs in acceptance rate (4) is likely to result in larger probabilities t han those giv en by Equation ( 3) which simply has a π ε ( D − x ) / c term instead. Algorithm D also has the advantage of not requiring a normalizing constant. 5 Extension s 5.1 Importance sampling Suppose our aim is to calculate expectations of the form E ( f ( θ ) | D ) = Z f ( θ ) π ( θ | D ) d θ where the expectation is taken with respect to the pos terior distribution of θ . T he simplest way to approximate this i s to draw a sample of θ values { θ i } i = 1 ,..., n , from π ( θ | D ) using Algori thm B, C or D and then approxi mate using the sum n − 1 ∑ f ( θ i ) . Howe ver , a more st able estimator can be ob tained by usi ng draws from the prior weighted by π ε ( D − X i ) as in Algorithm B. For each ( θ i , X i ) pair drawn from the prior and sim ulator in steps B1 and B2, assign θ i weight w i = π ε ( D − X i ) . Then an estimator of the required expectation is ∑ f ( θ i ) w i ∑ w i . 24 This is an importance sampling algorithm tar geti ng the joint distribution π ( X , θ | D ) ∝ π ε ( D − X ) π ( X | θ ) π ( θ ) using instrumental distribution π ( X | θ ) π ( θ ) . Note that if the uniform acceptance kernel π ε ( D − X ) ∝ I ρ ( D , X ) ≤ δ is used, then this approach reduces to the rejection algorit hm, as proposals are giv en weight 1 (accepted) or 0 (rejected), showing t hat t here is no uniform-ABC version of importance sampl ing. Sequential Monte Carlo algorithms are possi ble ho weve r , and Sisson et al. (2007) and T oni et al. (2009) considered algorithms in which the tolerance δ is slowly reduced through a schedule δ 1 , . . . , δ T to som e small final value. Both of these algorit hms, as well as variants such as Drov andi et al. (201 1) and Del Moral et al. (2012), w hich use Metropolis-Hastings moves of the parameter between i terations to provide more ef ficient proposal s, can be e xt ended to the generalised ABC case using general acceptance kernels. The move from 0-1 cut-offs to general accep- tance rates can introduce difficulties with memory constraints, due to the require- ment t o store a l ar ge number of particles, m any with sm all but non-zero weights. Partial rejection control, i ntroduced b y Liu (20 01) and extended to an ABC set- ting by Peters et al. (2012), can be used to reject particles with small weight s, onl y keeping particles which ha ve weight above som e threshold. 25 5.2 Model selection The theoretical acceptance rate from the rejection algorithm (Algorith m A w ith δ = 0) is equal to the model evidence π ( D ) . The evidence from different models can then be used to calculate Bayes factors wh ich can be used to perform model selection (Kass and Raftery , 1 995). It is possible to approximate the va lu e of π ( D ) by using th e acceptance rate from Algorithm A (W il kinson, 2007). By doi ng this for two o r more competing mo dels, we can perform app roximate mod el selec- tion, although in practice this approach can be unstable as δ var ies (W ilkinson, 2007). The estimate of π ( D ) can be improved and made interpretable by using the weighted estimate 1 n n ∑ i = 1 1 m m ∑ j = 1 π ε ( D − X j i ) where X 1 i , . . . , X m i ∼ η ( θ i ) and θ 1 , . . . , θ n ∼ π ( · ) . Th is gi ves a more stable estima- tor than si mply taking the acceptance rate, and also t ends to the exact value (as n , m → ∞ ) for the model giv en by Equation (1). Robert et al. (2011) and Didelot et al. (2011) have hig hlighted th e dang ers of using ABC algorith ms for model selection when D and X are replaced by sum- maries S ( D ) and S ( X ) . The Bayes factor based on the full data D , wi ll in general diffe r from the Bayes factor based on t he summary S ( D ) , e ven when S is a suffi- cient statistic for θ for both simulators. The approach advocated here, of considering ABC as an e xtens ion of the modelling process, using acceptance kernel π ε ( D − X ) (or π ε ( S ( D ) | S ( X )) i n a more g eneral n on-additive settin g) to represent the relationship between s imula- 26 tor ou tput and observ atio ns (as encapsulated by Theorem 1 ), suggests a di f ferent approach. The choice of acceptance kernel should be made after careful consid- eration of the simulator’ s abil ity , and ine vit ably in volves a degree of subjective judgement (as does the choice of simulato r , prior , and summary statistics). The kernel us ed forms part of the statistical model, and any model selection scheme will assess this choice, as well as the choice of simul ator and prior . In ot her words, it is inevitable that the estimated Bayes f actor will depend upon π ε in general, f ur- ther highlight ing the need for its careful design. Similarly , the choice of summary st atistic S ( D ) used to reduce the dimension of the data and simulator output, should be b ased on careful con sideration of what aspects of the d ata we expect the simulator t o be able to reproduce. Recent work by Nunes and Balding (2010), Fearnhead and Prangle (2012), Barnes et al. (2012) and Prangle et al. (2013) has focussed on automated meth ods for choos ing good summary statist ics, but care shoul d be taken to ensure the s ummaries selected coincide with the modeller’ s expectations of what the sim ulator can reproduce. Examples can be const ructed in which su mmaries are strongly in formativ e about the parameters (in th e sense of π ( θ | S ( D ) ) differing from π ( θ ) ), but which do not produce belie vable posteriors. F or e xam ple, in dynamical system models, phase sensit iv e summaries (such as the sum of s quare difference s) are usually in - formativ e about the simulator parameters, ev en though the simul ators were only designed to capture the phase-insensitive p arts of the system. Using these su m- maries will g iv e the appearance of having learned about the parameters, as the posterior wi ll differ from the prio r , but it is un clear whether what has been l earnt 27 is of value. If the summary S ( D ) is chosen on a sound physical basis, and the inference viewed as conditional upon this choice (i.e., the posterior is taken to be π ( θ | S ( D ) ) and is not seen as an approxim ation to π ( θ | D ) ), then t he difficul- ties for AB C model selection raised by Robert et al. (2011) are circumvented, and interpretation is clear . 6 Discussion It has been shown i n this paper that approximate Bayesian comput ation algo rithms can be consi dered to give exact inference under th e assum ption o f mod el error . Howe ver , thi s i s only part of the way towards a complete und erstanding of ABC algorithms. In the majority of the appl ication papers using ABC methods, sum- maries of the data and model output hav e been used to reduce th e dimens ion of the output. It cannot be known whether these summaries are suf ficient for th e data, and so in most cases the use of summaries means t hat there i s another l ayer of approximation. Wh ile this work allows us to und erstand th e error assumed on the measurement of th e summary , it says no thing about wh at eff ect using the summary rather than the complete data has on the inference. The use of a simulator discrepanc y term when making inferences is important if one wants to m ove from making statements about the sim ulator to stat ements about reality . Th ere has curre nt ly been only minimal work done on modelling the discrepancy term for stochastic models. One way to approach this is to view the model as deterministic, out putting a d ensity π θ ( x ) for each value of the input θ 28 (many realizations of η ( θ ) would be needed to learn π θ ( x ) ). The discrepancy term ε can then be considered as representing the di f ference between π θ ( x ) and the true v ariabil ity inherent in the physical system. Acknowledge ments I would l ike t o thank Professor P aul Blackwell f or his sug gestion that the metric in the ABC algorithm m ight have a probabilistic int erpretation. I would also like to thank Professor Simon T av ar ´ e, Professor T ony O’Hagan, and Dr Jeremy Oakley for useful discussions on an early draft of the manuscript. References Barnes, C. P ., Filippi, S., Stumpf, M. P . H., Thorne, T ., 2012. Cons iderate ap- proaches to achieving sufficienc y for ABC model selection . Stat. Comput. 22, 1181–1197. Beaumont, M. A. , Cornuet, J. M., Marin, J. M., Robert, C. P ., 2009. Adaptivity for ABC algorithms: the ABC-PMC. Biometrika 96 (4), 983–990. Beaumont, M. A., Zhang, W ., Balding, D. J., 2002. Approximate Bayesian com- putatation in population genetics. Genetics 162 (4), 2025–2035. Blum, M ., Tran, V ., 2 010. HIV wi th contact tracing: a case study in approxim ate Bayesian computation. Biostatistics 11 (4), 644–660. 29 Campbell, K., 2006. Stati stical calib ration of compu ter simul ations. Reliab . Eng. Syst. Safe. 91, 1358–1363. Cornuet, J.-M., Santos, F ., Be aum ont, M. A., Robert, C. P ., Marin, J .-M., Balding, D. J., Guillem aud, T ., Es toup, A., 2008. Inferring populati on history wi th DIY ABC: a u ser -friendly approach to approx imate Bayesian com putation. Bioin- formatics 24 (23), 2713–2719 . Del Moral, P ., Doucet, A., Jasra, A., 2012. An adaptive sequential Mo nte Carlo method for approximate Bayesian computation. Stat. Comput. 22, 1009–1020. Didelot, X., Everitt, R . G., Johansen, A. M., Lawson, D. J., 2011. Like l ihood-free estimation of model e vi dence. Bayesian analysis 6, 49–76. Diggle, P . J ., Gratt on, R. J., 1 984. Mon te Carlo methods of inference for im plicit statistical models. J. R. Statist. Soc. B 46 (2), 193–227. Drov andi, C. C., Pettitt, A. N., Faddy , M. J ., 2011. Approxi mate Bayesian com- putation using indirect inference. J. Roy . Stat. Soc. Ser . C 60, 317–337. Fearnhead, P ., Prangle, D., 2 012. Constructing sum mary statistics for approxim ate Bayesian comp utation: semi-autom atic approxi mate Bayesian comput ation. J. Roy . Stat. Soc. Ser . B 74, 419–4 74. Foll, M. , Beaumont, M. A., Gaggi otti, O., 2008. An approxi mate Bayesian com- putation approach to over com e b iases that arise when using ampl ified fragment length polymorph ism markers to stud y popul ation st ructure. Geneti cs 17 9, 927– 939. 30 Goldstein, M., Rougier , J., 2009. Reified Bayesian mod elling and inference for physical systems. J. Stat. Plan. Infer . 139 (3), 1221 – 123 9. Hamilton, G., Currat, M., Ray , N., H eckel, G., Beaumont, M . A., E xcof fier , L., 2005. Bayesian estimation of recent m igration rates after a spatial expansion. Genetics 170, 409–417. Higdon, D., Gattiker , J., W i lliams, B., Rightley , M., 2008. Computer m odel cali - bration using high-dimensi onal output. J. Am. Statis. Assoc. 103, 570–583. Joyce, P ., Marjoram, P . , 2008. Approximately suf ficient statistics and Bayesian computation. Stat. Appl. Genet. Mo. B. 7 (1), article 26. Kass, R. E., Raftery , A. E., 1995. Bayes factors. J. Am . Statis. Assoc. 90 (430), 773–795. Ke nn edy , M., O’Hagan, A., 2001. Bayesian calibration of c om puter models (with discussion). J. R. Statist. Soc. Ser . B 63, 425–4 64. Liu, J. S., 2001. Mon te Carlo Strategies in Scientific Compu ting. Springer Series in Statistics. Ne w Y ork: Springer . Marjoram, P ., Molitor , J., Plagno l, V ., T a var ´ e, S., 2003. Markov chain Monte Carlo without likelihoods. P . Natl. Acad. Sci. USA 100 (26), 15324–1532 8. Murray , I., Ghahramani, Z., MacKay , D. J. C., 2006. MCMC for doubly- intractable distributions. In: Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence. U AI. 31 Nunes, M. A., Balding, D. J., 2010. On optimal selection of summary statist ics for approximate Bayesian computation. Stat. Appl. Genet. Mo. B. 9, Article 34. Peters, G., Fan, Y ., Sisson, S., 2012. On sequ ential Monte Carlo, p artial rejection control and approximate Bayesian computation. Stat. Comput. 22, 1209–1222. Plagnol, V ., T av ar ´ e, S., 2004. Approxim ate Bayesian comput ation and MCMC. In: Niederreiter , H. (Ed.), Proceedings of Monte Carlo and Qu asi-Monte Carlo Methods 2002. Springer-V erlag, pp. 99–114. Prangle, D., Fearnhead, P ., Cox, M . P ., French, N. P ., 2013. Semi-automatic se- lection o f summary st atistics for abc model choice. In submis sion, a vailable as arXiv:1302:5624. Pritchard, J. K., S eiels tad, M. T . , Perez-Lezaun, A., F eldm an, M. W ., 1999. Popu- lation gro wt h of human Y chromosomes: a study of Y chrom osome mi crosatel- lites. Mol. Biol. Evol. 16 (12), 1791–1798. Ratmann, O., Jorgensen, O., Hinkley , T ., Stu mpf, M., Richardson, S., W iuf, C., 2007. Using likelihood-free inference to compare ev olutionary dynamics of the protein networks of H. pylori and P . falcipar u m . PLoS Comput . Biol. 3 (11), 2266–2276. Robert, C. P ., Cornuet, J. M., Marin, J. M., Pillai, N. S ., 2011. Lack of confidence in approximate Bayesian com putation model choice. P . Natl. Acad. Sci. USA 108, 15112–15 117. 32 Siegmund, K. D., Marjoram, P ., Shibata, D., 2008. Modeling DN A methylation in a population of cancer cells. Stat. Appl. Genet. Mo. B. 7 (1), article 18. Sisson, S. A., Fan, Y ., T anaka, M. M., 20 07. Sequential Monte Carlo wi thout likelihoods. Proc. Natl. Acad. Sci. USA 104 (6), 1760–1765. Sunnaker , M., Busetto, A. G., Nu mminen, E., Corander , J., Foll, M., Dessimo z, C., 2013. Approximate bayesian c om putation. PLoS Comput Biol 9, e1002803. T anaka, M. M., Francis, A. R., Luciani, F ., Sisson, S. A., 2006. Usi ng approximate Bayesian comp utation to estimate tu berculosis transmiss ion parameters from genotype data. Genetics 173, 1511–1520. T on i, T ., W elch, D., Strelko wa, N., Ipsen, A., Stumpf, M. P . H., 200 9. Approxi- mate Bayesian Computation scheme for parameter i nference and m odel selec- tion in dynamical systems . J. R. Soc. Interface 6 (31), 187–202. W ilkinson, R. D., 2007. Bayesian inference of prim ate diver gence times. Ph.D. thesis, Unive rsi ty of C amb ridge. W ilkinson, R. D., T av ar ´ e, S., 2009. Estimatin g primate di vergence times by using conditioned birth-and-death processes. Theor . Popu l. Biol. 75, 278–285. 33

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment