Notes to Robert et al.: Model criticism informs model choice and model comparison

Notes to R ober t et al.: Model criticism inf orms model choice and model comp arison Oliv er Ratmann ?, † , Christophe Andrieu ‡ , Carsten Wiuf ‡ and Sylvia Ric hardson ‡ ? Biology Department, Duk e Universit y , Box 90338 Durham, NC 27708, USA ; † Statistical and Applied Mathematical Sciences In- stitute, Research T riangle Park, NC 27709, USA ; ‡ Department of Mathematics, Univ ersity of Bristol, Bristol, United Kingdom ; ‡ Bioinformatics Research Cen ter, Universit y of Aarhus, Aarh us, Denmark ; ‡ Centre for Biostatistics, Imperial College London, Lon- don, United Kingdom ; Email: ? oliver.ratmann@duke.edu In their letter to PNAS and a comprehensiv e set of notes on arXiv [1, 2], Christian Rob ert, Kerrie Mengersen and Carla Chen (RMC) represent our approac h to mo del criticism in situations when the likelihoo d cannot b e computed as a wa y to “contrast several mo dels with each other”. In addition, guided b y an analysis of scalar error terms on simple examples, RMC argue that mo del assessment with Appro ximate Bay esian Computation under mo del uncertaint y (ABC µ ) is unduly c hallenging and question its Bay esian foundations. W e thank RMC for their in terest and their detailed commen ts on our w ork, whic h giv e us an opportunity to clarify the construction of ABC µ and to explain further the utilit y of ABC µ for the purpose of mo del criticism. Here, w e provide a com prehensiv e set of answers to RMC’s comme n ts, which go b eyond our short resp onse [3]. F or sak e of clarit y , we re-state RMC’s main p oints in italic b efore w e answ er eac h of them in turn. W e wish to emphasize that the use of m ultiple error terms ε 1: K is a necessary and in tegral part of ABC µ . In the ﬁrst section in [4], we introduced ABC µ with the num b er of error terms set to K = 1 to k eep the presen tation simple. In retrospect, we hop e that this initial simpliﬁcation did not lead to confusion (although in later sections and in our applications w e clearly use m ultiple error terms). In tro duction and notation. Appro ximate Bay esian Computation (ABC) exploits mo del sim ulations x of a data-generating pro cess M for sampling from approximate p osterior distributions of the mo del parameters θ [5]. T ypically , suc h predictions form the basis for mo del criticism [6], and we prop ose to use the data already generated b y Monte Carlo implementations of ABC for this purp ose to o [4]. In ABC µ , the dual use of the mo del predictions is reﬂected in an extension of the state space of the targeted random v ariables: whenever the simulated summaries S ( x ) =  S 1 ( x ) , . . . , S K ( x )  , x ∼ f ( · | θ M ) are suﬃciently close to the observed summaries S ( x 0 ), w e retain not only θ but also the computed discrepancies. The rationale of ABC µ is that small discrepancies b etw een x and the observ ed data x 0 indicate fav orable θ , whereas if these discrepancies are alw ays large, the data-generating pro cess 1 (in short: mo del) M cannot describ e the observed data w ell. The full p otential of ABC µ is realized when we compute m ultiple discrepancies, eac h for one summary statistic S k , ρ k  S k ( x ) , S k ( x 0 )  . F rom ﬁrst principles, we deriv ed in [4] the sampling densit y of the accepted pairs  θ ,  ρ k  S k ( x ) , S k ( x 0 )   1: K  , whic h w e denote by f ρ,τ ( θ , ε 1: K | x 0 , M ) ∝ ξ x 0 ,θ ( ε 1: K ) π θ ( θ | M ) π ε 1: K ( ε 1: K | M ) . (1) W e obtained a formula for the “augmented lik eliho o d” ξ x 0 ,θ ( ε 1: K ), whic h enables us to relate the posterior error densit y f ρ,τ ( ε 1: K | x 0 , M ) = Z f ρ,τ ( θ , ε 1: K | x 0 , M ) dθ to the prior predictive error density , a well-kno wn Bay esian quantit y that was systematically discussed in a seminal pap er [7] b y Bo x (when K = 1 and ρ  S ( x ) , S ( x 0 )  = x − x 0 ). W e hav e f ρ,τ ( ε 1: K | x 0 , M ) ∝ π ε 1: K ( ε 1: K | M ) × L ρ ( ε 1: K | M ) (2) where the prior predictive error densit y is given by L ρ ( ε 1: K | M ) = Z δ  ρ k  S k ( x ) , S k ( x 0 )  = ε k  1: K  π ( x | M ) dx, and π ( x | M ) = R f ( x | θ, M ) π θ ( θ | M ) dθ denotes the prior predictiv e (data) density . The shorthand δ notation represen ts a limit of functions as detailed in Section S1.1 of the PNAS Supplementary Material [4]. The densit y π ε 1: K is fully determined by the ABC kernel in the likelihoo d appro ximation, f ρ,τ ( θ | x 0 , M ) = Z f ρ,τ ( θ , ε 1: K | x 0 , M ) dε 1: K ∝ π θ ( θ | M ) Z π ε 1: K   ρ k  S k ( x ) , S k ( x 0 )   1: K    M  f ( x | θ, M ) dx, (3) and can b e interpreted as a prior densit y [8]. The relationship Eq. 2 enables us to asso ciate a statistical in terpretation to our p osterior errors and to relate them to other Bay esian quan tities. Standard Assumptions in ABC and ABC µ . W e assume that (A1) π ε 1: K factorizes into Q K k =1 π ε k , is cen tered at zero and only dep ends on a multi-dimensional scale parameter τ = ( τ 1 , . . . , τ K ). The main reason b ehind (A1) is that otherwise, the same asp ects of the data might b e used to adjust the ABC k ernel (or “prior” densit y) π ε 1: K as w ell as the magnitude of the errors ρ k  S k ( x ) , S k ( x 0 )  , and hence (p oten tially) more than once to inform our quantities of interest f ρ,τ ( θ | x 0 , M ) and f ρ,τ ( ε 1: K | x 0 , M ). F urthermore, ABC µ might suggest to falsely reject the hypothesis that a mo del is an adequate represen tation of the data if π ε 1: K is not centered at zero. Typical choices are π ε k ( ε k | M ) = 1 /τ k 1    ε k   ≤ τ k / 2  , π ε k ( ε k | M ) = (2 πτ 2 k ) − 1 / 2 exp  − 1 / 2 ε 2 /τ 2 k  or π ε k ( ε k | M ) = 1 /τ k exp  − 2   ε k   /τ k  . W e emphasize that in ABC and ABC µ , (A2) the scale parameter τ of the 2 prior π ε 1: K is in general chosen as small as p ossible. Otherwise, if all model sim ulations are “acceptable”, w e ha v e that f ρ,τ ( θ | x 0 , M ) ∝ π θ ( θ | M ) Z π ε 1: K   ρ k  S k ( x ) , S k ( x 0 )   1: K    M  f ( x | θ, M ) dx = π θ ( θ | M ) Z const × f ( x | θ, M ) dx = π θ ( θ | M ) . F urthermore, (A3) the comp ound function x → ρ  S ( x ) , S ( x 0 )  m ust b e sensitiv e to changes in θ . Otherwise, we obtain f ρ,τ ( θ | x 0 , M ) ∝ Z π ε  const  f ( x | θ, M ) dx π θ ( θ | M ) = π θ ( θ | M ) . The idea is to construct useful discrepancies whic h reﬂect changes in the simulated data as θ changes. (A4) As in ABC, we require that ρ k  S k ( x ) , S k ( x 0 )  = 0 if and only if S k ( x ) = S k ( x 0 ). In contrast to most implementations of ABC, these discrepancies should b e real-v alued rather than non-negative. F or example, in the case of scalar summaries, w e use ρ k  S k ( x ) , S k ( x 0 )  = S k ( x ) − S k ( x 0 ) instead of ρ k  S k ( x ) , S k ( x 0 )  =   S k ( x ) − S k ( x 0 )   [4]. W e seek to construct (A5) roughly symmetric predictive error densities L ρ ( ε 1: K | M ) with mode at zero under the null h yp othesis that the prior mo del is an adequate representation of the data. Otherwise, negative small errors ε k ≤ τ k ma y b e signiﬁcantly more (or less) frequen t than p ositive small errors ε k ≤ τ k under the nu ll, and conditioning on error magnitude could result in a large negative (or p ositive) posterior mean error even if the prior mo del is correct. Finally , we assume (A6) that the cumulativ e densit y function P θ,x 0  ε 1 ∈ E 1 , . . . , ε K ∈ E K  = Z X 1 n ρ k  S k ( x ) , S k ( x 0 )  ∈ E k  1: K o f ( x | θ, M ) dx is either con tinuously diﬀerentiable when the observ ation space X is con tinuous, or a step function when X is ﬁnite. In this case, ξ x 0 ,θ ( ε 1: K ) can be re-written in terms of its elemen tary deriv ative. Next, in order to deriv e Eqns. 2-3, w e also assume that the data-generating pro cess M given by f ( · | θ , M ) is suﬃciently regular to exchange the order of integration and limits; recall Section S1.1 of the PNAS Supplemen tary Material [4]. Construction of ABC µ 1. RMC p oint out that “the denomination [of ξ x 0 ,θ ( ε 1: K ) as a] likeliho o d is deb atable” [2] and that “the pr o duct ξ x 0 ,θ ( ε 1: K ) π ε 1: K ( ε 1: K ) is pr ob abilistic al ly inc oher ent” [1]. This c onclusion derives fr om at le ast two observa- tions: (i) “ ξ x 0 ,θ is strictly sp e aking not pr op ortional to a density in x 0 ” [2] and (ii) “ ξ x 0 ,θ ( ε 1: K ) π ε 1: K ( ε 1: K ) is not invariant under r ep ar ameterization” [2]. - In ABC, the observed data is reduced to a set of summary statistics and compared to sim ulated summaries with a p ositive, scalar-v alued discrepancy function ρ  S ( x ) , S ( x 0 )  . F or the purp ose of parameter inference, 3 w e only need to plug ε = ρ  S ( x ) , S ( x 0 )  in to the ABC kernel. In other w ords, the scalar, p ositive error ε is in ABC merely a laten t random v ariable, in tro duced to facilitate Ba yesian computation [9, 10]. In [4], w e derive the sampling distribution of the random v ariable ε , and recognize the utilit y of the related m ultiple error terms ε 1: K , each associated to one summary , for the purpose of model criticism. T o us, ε 1: K is a random v ariable of particular statistical interest and not any longer a latent v ariable introduced for computational reasons. Intuitiv ely , we shift the observed summaries by ε 1: K and propose to infer whether summaries of x 0 that are shifted aw ay from zero would o ccur at a higher frequency and hence b e more probable than the (unshifted) observ ed summaries. F ormally , we deﬁne and iden tify the probabilit y density ε 1: K → ξ θ,x 0 ( ε 1: K ) = lim h → 0 Z δ h  ρ k  S k ( x ) , S k ( x 0 )  − ε k  1: K  f ( x | θ, M ) dx, (4) where the δ h function is giv en in Section 1.1 of the PNAS Supplementary Material [4]. F or an y given x 0 and θ , ξ θ,x 0 ( ε 1: K ) is the inﬁnitesimal frequency with whic h we observe the m ulti-dimensional error ε 1: K . As RMC remark insigh tfully , it can b e called a predictiv e error density that conditions on the observ ed data and the mo del parameter θ . In [4], we termed θ , ε 1: K → f ρ,τ ( x 0 | θ , ε 1: K ) = ξ θ,x 0 ( ε 1: K ) (5) an “augmented likelihoo d” simply to indicate that the state space was extended. Example 1 Supp ose we observe a single, one-dimensional data p oint x 0 , and let us b elieve it is Poisson distribute d with r ate θ (denote d by M 1 ). Consider the sc alar err or ε = x − x 0 . By c onstruction, we have ξ θ,x 0 ( ε ) = lim h → 0 Z δ h  x − x 0 = ε  P oisson( x ; θ ) dx = θ x 0 + ε e − θ ( x 0 + ε )! 1  x 0 + ε ∈ [0 , ∞ )  , and the right hand side e quals in ε a Poisson distribution shifte d by − x 0 and in x 0 a Poisson distribution shifte d by − ε . Thus, when interpr ete d as a function in x 0 , ξ θ,x 0 ( ε ) is also deﬁne d for ne gative values. RMC’s il luminating Poisson example serves to demonstr ate how ξ θ,x 0 ( ε ) diﬀers fr om a “likeliho o d”. However, RMC go b eyond our c onstruction Eq. 4 and trunc ate x 0 → ξ θ,x 0 ( ε ) to p ositive values so as to r e-adjust ξ θ,x 0 ( ε ) to the likeliho o d f ( x 0 | θ , M 1 ) that is only deﬁne d for p ositive x 0 [1, 2]. T o b e cle ar, this r e-adjustment is not p art of ABC µ . Eq. 4 corresp onds to a non-parametric ev aluation of the sampling model in the context of mo del uncertaint y . W e adhere to the sampling mo del in that data is simulated under the likelihoo d, x ∼ f ( · | θ , M ), and probe the mo del predictions in sev eral directions at the same time. If ε 1: K = 0, we ha ve with (A4) that ξ θ,x 0 ( ε 1: K ) corresp onds to the probabilit y of the observed summaries under θ . F or error terms diﬀerent from zero, w e quan tify the probabilit y of deviations from the observed summaries under the sampling mo del. Lab eling f ρ,τ ( x 0 | θ , ε 1: K ) Eq. 5 a “shifted lik eliho o d” seems therefore more appropriate. Because we only shift the observ ed summaries in Eq. 4 (with no further re-adjustments tow ards the original likelihoo d as in [1, 2]), the re-normalization required when considering Eq. 5 as a function in x 0 do es not dep end on ε 1: K , and f ρ,τ ( x 0 | θ , ε 1: K ) is prop ortional to a densit y in x 0 . 4 Next, let us recall that our error terms ε k corresp ond directly to the compound functions x → ρ k  S k ( x ) , S k ( x 0 )  . Therefore, in ABC µ , a transformation of ε k implies a change in ho w the data is being summarized. Typ- ically , such a c hange requires to modify the scale parameter τ of the prior densit y π ε 1: K when the scale of the discrepancies changes to o. Therefore, transformations of the pro duct ξ x 0 ,θ ( ε ) π ε ( ε ) must also c hange τ in π ε 1: K when the Jacobian is not constan t. In the ABC literature, it is well-kno wn that the approximate p osterior densit y f ρ,τ ( θ | x 0 , M ) depends on the c hoice of discrepancies and the stringency of τ [5, 11, 12]. Since ABC µ only uses the information pro vided in ABC to a fuller exten t, the joint p osterior density f ρ,τ ( θ , ε 1: K | x 0 , M ) is equally sensitive to changes in the comp ound functions x → ρ k  S k ( x ) , S k ( x 0 )  and the vector τ . In other words, the ABC and ABC µ target densities f ρ,τ ( θ | x 0 , M ) and f ρ,τ ( θ , ε 1: K | x 0 , M ) are not in v ariant under diﬀerent appro ximation schemes. This lea ves ABC µ probabilistically sound, but warran ts particular caution and calls for sensitivit y analyses, perhaps to a larger extent than is common practice. Mo del assessmen t 2. Mo del assessment with ABC µ r e quir es that “the data is informative” and “is chal lenging” [1, 2]. In the lo c ation-family example [2] it is shown that the p osterior err or e quals the prior err or if the prior pr e dictive density is ﬂat. - W e agree with RMC that ABC µ cannot criticize a mo del when the observ ed data x 0 reduces to a single, one-dimensional data point, as in their examples [2] on page 1-2. More generally , w e show ed that the p osterior error can be interpreted as a weigh ted prior predictive error, Eq. 2 [4]. If the prior predictive error is uninformative, then the p osterior error will simply reﬂect the weigh ting. Example 2 Supp ose a Gaussian likeliho o d mo del M 2 with unknown me an θ and ﬁxe d varianc e 1 , and a Gaussian prior density π θ ( θ | M 2 ) = N ( θ ; θ ? , h 2 ) . We have π ( x | M 2 ) = Z N ( x ; θ , 1) N ( θ ; θ ? , h 2 ) dθ = N ( x ; θ ? , h 2 + 1) , and L ρ,τ ( ε | M 2 ) = N ( ε ; θ ? − x 0 , h 2 + 1) . We mimic a situation wher e π θ is uniform by letting h → ∞ , so that π ( x | M 2 ) and L ρ,τ ( ε | M 2 ) b e c ome impr op er. Supp ose further a Gaussian err or density π ε ( ε | M 2 ) = N ( ε ; 0 , τ 2 ) . Then, f ρ,τ ( ε | x 0 , M 2 ) = N ( ε ; 0 , τ 2 ) . Likewise, when mo dels have c omp ar able p ar ameter sp ac es, then the Bayes’ factor wil l b e inde cisive under non- informative priors π θ [13]. Consider the alternative Gaussian mo del M 0 2 deﬁne d by f ( x | θ, M 0 2 ) = N ( x ; θ, 3) , the same prior density π θ , and let us fo cus on the appr oximate Bayes’ factor B ρ,τ = f ρ,τ ( x 0 | M 0 2 ) f ρ,τ ( x 0 | M 2 ) =  Z f ρ,τ ( x 0 | θ , M 0 2 ) π θ ( θ | M 0 2 ) dθ  .  Z f ρ,τ ( x 0 | θ , M 2 ) π θ ( θ | M 2 ) dθ  5 to mimic the situation that we c annot r e adily evaluate the likeliho o d. We obtain B ρ,τ = r τ 2 + h 2 + 1 τ 2 + h 2 + 3 exp  x 2 0 ( τ 2 + h 2 + 1)( τ 2 + h 2 + 3)  , which tends r apid ly to one as h → ∞ . In this setting, b oth our p osterior err or and appr oximate Bayes’ factor give r e asonable answers for the purp ose of mo del criticism and mo del c omp arison r esp e ctively. Base d on one data p oint, we c annot r eje ct the curr ent mo del M 2 and likewise, the appr oximate Bayes’ factor for cho osing among M 2 and a c omp ar able mo del is inde cisive. Clearly , there is no guaran tee that ABC µ alwa ys unco vers existing mo del mismatch. But ho w diﬃcult is it to uncov er existing discrepancies with ABC µ in practice? T ypically , x 0 con tains some structure and/or rep eated observ ations. Instead of using just one data p oin t in Example 2, let us imagine a data set of 100 samples and summarize this data with tw o statistics, leading to a t wo-dimensional p osterior error density . Example 3 Consider a data set x 0 of 100 indep endent samples that ar e Exp onential ly distribute d with r ate 1 /µ t = 0 . 2 . We b elieve that e ach sample is gener ate d fr om N ( · ; θ , 1) and c onsider a Gaussian prior density π θ ( θ | M 2 ) = N ( θ ; θ ? , h 2 ) . We summarize the data with the sample me an x 0 and the sample me dian median( x ) , use the discr ep ancies ρ  S k ( x ) , S k ( x 0 )  = S k ( x ) − S k ( x 0 ) and c onsider the prior density π ε 1: K ( ε 1: K | M 2 ) = Q k 1 /τ k exp  − 2 | ε k | /τ k  with τ k = 0 . 1 . T o il lustr ate that ABC µ may r eve al inappr opriate prior sp e ciﬁc ations, we set θ ? = 0 and h 2 = 0 . 1 . We applie d the Metr op olis-Hastings sampler pr op ose d by Marjor am et al. [14] and r e c or de d the c ompute d discr ep ancies to estimate our p osterior err or (mcmcABC µ se e p age 11). A mor e detaile d discussion of various algorithms to sample fr om the ABC µ tar get density Eq. 1 wil l app e ar elsewher e. Figur es 1A-B show that the mar ginal densities f ρ,τ ( ε k | x 0 , M 2 ) ar e far fr om zer o, suggesting that our str ong prior b eliefs ar e inade quate to explain the data. T o il lustr ate that ABC µ may identify a faulty sampling mo del, we set θ ? = 5 , h 2 = 100000 . A gain, we estimate d the ABC µ tar get density numeric al ly with mcmcABC µ . Even though π θ is essential ly ﬂat, our mar ginal p osterior err ors do not c enter at zer o, se e Figur es 1C-D. In [4], we investigate d primarily the mar ginal p osterior densities f ρ,τ ( ε k | x 0 , M ) = R f ρ,τ ( ε 1: K | x 0 , M ) dε − k . Her e, we also show he at plots which r eﬂe ct mor e c ompr ehensively the multi-dimensional char acter of our err or density f ρ,τ ( ε x , ε median | x 0 , M 2 ) in Figur e 2A-B. In tuitively , ABC µ will indicate model mismatch whenever all discrepancies are sim ultaneously not close to zero for any θ . T o escap e uniden tiﬁability , the crux in Example 3 is to use m ultiple error terms asso ciated to co-dep endent summaries that may rev eal mo del inconsistencies, see also [4] for a similar example. In real-w orld applications, (most) summary statistics are usually co-dependent, rendering ABC µ a p otentially 6 A 4.0 4.4 4.8 5.2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ε ε ( ( MEAN ) ) posterior density chain 1 chain 2 chain 3 chain 4 B 2.6 3.0 3.4 3.8 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ε ε ( ( MEDIAN ) ) posterior density chain 1 chain 2 chain 3 chain 4 C −1 0 1 2 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ε ε ( ( MEAN ) ) posterior density chain 1 chain 2 chain 3 chain 4 D −3 −2 −1 0 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ε ε ( ( MEDIAN ) ) posterior density chain 1 chain 2 chain 3 chain 4 Figure 1: Numerical reconstructions of the densities f ρ,τ ( ε x | x 0 , M 2 ), f ρ,τ ( ε median | x 0 , M 2 ) in Example 3, obtained with samples generated by mcmcABC µ . The (A-B) p osterior error under inappropriate prior sp eciﬁcations π θ and the (C-D) p osterior error under essen tially ﬂat prior sp eciﬁcations π θ suggest mo del mismatc h. A 0 2 4 0 2 4 ε ε ( ( MEAN ) ) ε ε ( ( MEDIAN ) ) B −1.0 0.0 1.0 2.0 −2.0 −0.5 0.5 ε ε ( ( MEAN ) ) ε ε ( ( MEDIAN ) ) Figure 2: Heat plots of the density f ρ,τ ( ε x , ε median | x 0 , M 2 ) in Example 3, obtained with samples generated by mcmcABC µ , (A) under inappropriate prior speciﬁcations π θ and (B) under essentially ﬂat prior sp eciﬁcations π θ . v ery pow erful method to reveal mo del inconsistencies. Because model inconsistencies can only increase with the inclusion of new summary statistics, we are typically prepared to use a large set of summaries. Moreov er, it is not required that these co-dep endent summaries are suﬃcient for θ under the data-generating pro cess M , as we illustrate in [3], Figure 1. A more detailed discussion will app ear elsewhere; here we only note that these properties are app ealing b ecause in real-w orld applications of ABC and ABC µ , it is t ypically not kno wn whether any set of summaries is suﬃcien t for the parameters of a given model while it is relativ ely easy to come up with co-dep enden t summaries. Ho wev er, the exten t to whic h f ρ,τ ( ε 1: K | x 0 M ) merely reﬂects the w eighting π ε 1: K should b e c hec ked, b ecause the discrepancies might not retain enough information of the data to question a mo del (recall A3). The 7 p erhaps simplest (but not necessarily successful) approac h is to compare the posterior error densit y to the shap e of π ε 1: K . Reassuringly , in our real-world applications, f ρ,τ ( ε 1: K | x 0 M ) diﬀers mark edly from π ε 1: K ; see e.g. Figure 3 in the PNAS pap er where the prior error densit y is indicated in dotted lines. Crucially , since ABC µ do es not reject a mo del when f ρ,τ ( ε 1: K | x 0 M ) is close to π ε 1: K (recall A1), no harm is done should the discrepancies not b e informativ e. The pow er of ABC µ in assessing go o dness-of-ﬁt stems, ﬁrstly , from probing a model in m ultiple directions at the same time. W e hope that our simple examples illuminate the con tribution of mo del inconsistencies, as reﬂected in multiple error terms, to mo del assessment. Secondly , ABC µ makes p ossible to criticize a mo del whose likelihoo d cannot b e readily ev aluated, and does not incur any extra computational cost when compared to ABC (in con trast to related predictiv e approac hes discussed in point 4 below). Mo del criticism 3. ABC µ “is str ongly imp acte d by prior mo deling [and] fails to c ondition on the observe d data” [1]. - The ABC kernel, whic h can be interpreted as a prior densit y π ε 1: K [8], is at the heart of ABC (recall A1) and mo dulates the degree to whic h ABC and ABC µ condition on the observ ed data. In other w ords, the ABC and ABC µ target densities are sensitiv e to the choice of π ε 1: K and particularly its scale parameter τ . Indeed, ABC µ conditions on the observ ed data b y accepting θ in relation to the magnitudes of the K computed discrepancies tak en together. Accordingly , under (A2, A3), the p osterior error density f ρ,τ ( ε 1: K | x 0 , M ) up dates the prior predictiv e error density L ρ ( ε 1: K | M ); see Example 4 below. Based on the observ ation that small error b o osts the weigh t of the asso ciated v alue of θ that are simulated from π θ , we sa y that “ABC µ criticizes a ﬁtted mo del”. This can b e illustrated with the lo cation family in Example 2. Example 4 Consider again the Gaussian likeliho o d mo del M 2 and a Gaussian prior density π θ as in Exam- ple 2. F or our il lustr ation purp oses, let us cho ose π θ br o ad but not ﬂat: h 2 = 9 . In this c ase, f ρ,τ ( ε | x 0 , M 2 ) ∝ N ( ε ; θ ? − x 0 , 10) N ( ε ; 0 , τ 2 ) = N ( ε ; ˜ θ , ˜ σ 2 ) wher e ˜ θ =  τ 2  ( τ 2 + 10)  ×  θ ? − x 0  and ˜ σ 2 = 10 τ 2 / ( τ 2 + 10) ≤ 10 . The p osterior err or density “up- dates” the prior pr e dictive err or in that the varianc e of f ρ,τ ( ε | x 0 , M 2 ) is smal ler than the one of L ρ,τ ( ε | M 2 ) . F urthermor e, we observe that   ˜ θ   is smal ler than the absolute me an of L ρ,τ ( ε | M 2 ) , r eﬂe cting the fact that f ρ,τ ( ε | x 0 , M 2 ) criticizes a ﬁtte d mo del r ather than the prior mo del ( M 2 , π θ ) . F or the purp ose of mo del criticism, it is imp ortant to recognize that the dependency of our posterior error on π ε 1: K is a goo d thing to the extent to which the prior π θ is not an adequate model parameterization. The smaller τ can b e chosen, the more we are able to criticize a ﬁtted mo del and the more we atten uate the inﬂuence of π θ in f ρ,τ ( ε 1: K | x 0 , M ). The latter p oin t can also b e illustrated with the lo cation family in 8 A −1.0 0.0 1.0 −1.0 0.0 1.0 ε ε ( ( MEAN ) ) ε ε ( ( MEDIAN ) ) B −1.0 0.0 1.0 −1.0 0.0 1.0 ε ε ( ( MEAN ) ) ε ε ( ( MEDIAN ) ) C −4 0 2 4 −4 0 2 4 ε ε ( ( MEAN ) ) ε ε ( ( MEDIAN ) ) Figure 3: Heat plots of our posterior error densit y f ρ,τ ( ε x , ε median | x 0 , M 3 ) in Example 5 for broad π θ (A) µ 0 = 1000, (B) α 0 = 2 and β 0 = τ 0 = 1000 and when τ is set to o large, (C) α 0 = 2 and β 0 = τ 0 = 1000 and τ k = 6 . 4. Example 2, recall Section S1.2 in the PNAS Supplementary Material. Let us illustrate the inﬂuence of π θ and π ε 1: K when the summaries are co-dep enden t but not suﬃcien t for θ under model M . Example 5 Consider again the data set x 0 of 100 indep endent samples that ar e Exp onential ly distribute d with r ate 0 . 2 , supp ose now that e ach sample is gener ate d ac c or ding to a Gaussian likeliho o d mo del with unknown me an µ ∈ R and σ 2 ≥ 0 (denote d by M 3 ) and summarize the data with the sample me an and me dian. We c onsider ρ k ( S k ( x ) , S k ( x 0 )) = S k ( x ) − S k ( x 0 ) , π ε 1: K ( ε 1: K | M 3 ) = Q K k =1 1 /τ k 1    ε k   ≤ τ k / 2  , and the prior density π θ ( θ | M 3 ) = π ( µ | M 3 ) π ( σ 2 | M 3 ) wher e π ( σ 2 | M 3 ) = IG ( σ 2 ; α 0 , β 0 ) π ( µ | M 3 ) ∝ 1    µ − µ 0   ≤ τ 0  . In [3], Figur e 1, we chose a slightly diﬀer ent prior density π θ with hyp erp ar ameters µ 0 = 5 , τ 0 = 10 , α 0 = 4 and β 0 = 75 such that the prior me ans of µ and σ 2 ar e 5 and β 0 /  α 0 − 1  = 25 , matching the empiric al me an and the standar d deviation of the observe d data. T o il lustr ate that ABC µ unc overs existing mo del mismatch with c o-dep endent summaries that ar e not suﬃ- cient for θ under M 3 even when π θ diﬀers marke d ly fr om the data or is uninformative, we now vary these hyp erp ar ameters. First, let us set τ 0 = 1000 . We r an mcmcABC µ (se e p age 11) for 10 , 000 iter ations to sam- ple fr om f ρ,τ ( µ, σ 2 , ε x , ε me dian | x 0 , M 3 ) with τ k set to 1 . 6 , and r ep e ate d this run four times fr om over disp erse d starting values to assess the c onver genc e of the chains. Samples fr om the burn-in p erio d wer e disc ar de d. Figur e 3A il lustr ates that our joint p osterior err or density r emains virtual ly unchange d (c omp ar e to Figur e 1C in [3]). Next, we set α 0 = 2 and β 0 = τ 0 = 1000 and r an mcmcABC µ as ab ove. Even though π θ is now extr emely br o ad, our joint p osterior err or density c ontinues to identify mo del mismatch; se e Figur e 3B. ABC dep ends on the err or thr eshold τ , and so do es ABC µ . In or der to identify mo del mismatch with f ρ,τ ( ε 1: K | x 0 , M ) , existing c onﬂicts among sever al summary statistics ar e unc over e d by setting τ suﬃciently 9 smal l. F or example, setting τ k to 6 . 4 such that the ac c eptanc e pr ob ability of mcmcABC µ is lar ger than 80%, the p osterior err or is very br o ad and do es not suggest mo del mismatch; se e Figur e 3C. In summary , the ability of ABC µ to criticize a ﬁtted model is strongly mo dulated b y the choice of discrepancies and the error threshold τ . Probing a mo del under particular assumptions on π θ in the directions sp eciﬁed b y ε 1: K is not guaranteed to unco ver existing mo del mismatch. F or example, using ABC µ with the sample mean and the standard deviation (tw o indep endent summaries) in place of the sample mean and the m edian in Example 5 fails to uncov er existing mo del mismatch. Similarly , using the sample mean and the 25% quantile fail to reveal model inconsistencies as clearly as the sample mean and the sample median. In principle, the con tribution of “the data” to f ρ,τ ( ε 1: K | x 0 , M ) can only increase with larger K and/or more stringent choices of τ , and cannot be quan tiﬁed b y considering ﬂat π ε 1: K (see also A2). In the directions determined b y ε 1: K and τ , the criticized model comprises the sampling model M and our prior assumptions π θ , and we ac knowledge that “having no w ay to distinguish betw een prior and sampling mo del inadequacy is a diﬃcult y” [2]. More w ork is needed here. 4. F r om an ABC p ersp e ctive, using the p osterior pr e dictive m ( x | x 0 , M ) inste ad of the prior pr e dictive π ( x | M ) “r e quir es same c omputing times” [2]. - In the con text of ABC when the lik eliho o d cannot b e readily ev aluated, the use of the p osterior predictiv e (data) density m ( x | x 0 , M ) = Z f ( x | θ , M ) f ( θ | x 0 , M ) dx is complicated b y the fact that samples from the true p osterior densit y f ( θ | x 0 , M ) are in general not a v ailable. Ho wev er, m ( x | x 0 , M ) can b e appro ximated b y m ρ,τ ( x | x 0 , M ) = Z f ( x | θ , M ) f ρ,τ ( θ | x 0 , M ) dx. An alternativ e approac h for mo del criticism could be the appro ximate p osterior predictiv e (APP) error density L ρ,τ ,x 0 ( ε 1: K | M ) = Z δ n ρ k  S k ( x ) , S k ( x 0 )  = ε k  1: K o m ρ,τ ( x | x 0 , M ) dx. Ho wev er, for a complex mo del the extra volatilit y induced by simulating from f ( · | θ , M ) means that re- sim ulations from f ρ,τ ( θ | x 0 , M ) need not meet the stringency requirement π ε 1: K . It migh t therefore also b e useful to consider the w eighted approximate p osterior predictiv e (wAPP) error densit y f ρ,τ ,x 0 ( ε 1: K | x 0 , M ) ∝ L ρ,τ ,x 0 ( ε 1: K | M ) π ( ε 1: K | M ) . Both L ρ,τ ,x 0 ( ε 1: K | M ) and f ρ,τ ,x 0 ( ε 1: K | x 0 , M ) adopt a sequential approach to mo del criticism, comprising a training step (inference of f ρ,τ ( θ | x 0 , M )) and a testing step (APP or wAPP). The testing step adds a computational ov erhead to typical ABC pro cedures. F or example, it takes ab out tw o minutes to ev aluate our sev en summaries on the Sac char omyc es c er evisiae PPI data set [15], and hence an extra 2 × 500 / 60 ≥ 16hrs 10 to obtain 500 samples from APP on one computer. Assuming a large acceptance probabilit y of 10%, the extra time required to obtain 500 samples from wAPP is more than 6 da ys. By con trast, our posterior error incurs no additional computational cost b ecause the discrepancies already computed in an y ABC algorithm are only used to a fuller exten t. Nevertheless, one might be prepared to pay this cost if the densities L ρ,τ ,x 0 ( ε 1: K | M ) and f ρ,τ ,x 0 ( ε 1: K | x 0 , M ) would hav e an in trinsic adv antage compared to our p osterior error densit y f ρ,τ ( ε 1: K | x 0 , M ). In general, it is diﬃcult to compare the behavior of our p osterior error with APP and wAPP under mo del uncertain ty . First, we note that L ρ,τ ,x 0 ( ε 1: K | M ) and our f ρ,τ ( ε 1: K | x 0 , M ) are very diﬀeren t quantities, relating resp ectively to sequen tial and simultaneous approaches to mo del criticism. This is also reﬂected in their distinct asymptotic prop erties as τ → 0. Second, Example 7 in the Appendix demonstrates that, coun ter-intuitiv ely , f ρ,τ ( θ | x 0 , M ) ma y be broader than π θ under mo del uncertain ty . An additional complication to be considered with APP and wAPP is that the same asp ects of the data are used to inform b oth the training and the testing phase. Hence, these quantities violate the fundamental requirement in statistical learning that the training data b e indep endent from the testing data [16]. It is possible to use diﬀeren t asp ects of the data during both stages, and this brings us bac k to the partially predictiv e and conditionally predictiv e densities previously discussed b y Bay arri and Berger [17]. Unfortunately , in real- w orld applications of ABC, it is often diﬃcult to iden tify discrepancies that are indep endent of eac h other. 5. Our estimator ˆ ξ to the augmente d likeliho o d that is b ase d on B r ep e at samples “c annot b e use d as a pr actic al devic e b e c ause B is ne c essarily smal l” [1] “ . . . in which c ase the non-p ar ametric appr oximation is p o or, or B is lar ge in which c ase pr o ducing the x’s is to o time-c onsuming” [2]. - In our applications, we found that we obtained largest improv emen ts in terms of the eﬀectiv e sampling size for small to moderate v alues of B that are computationally feasible. Let us also recall that the c hoice of prop osal kernel in ε 1: K is a crucial element of the second algorithm in [4] and should not b e omitted when considering its eﬃciency . W e ackno wledge that our observ ations may not readily extend to other applications. In the same wa y that we augmented standard ABC to what we call Std-ABC µ in [4], it is straightforw ard to mo dify an y existing ABC algorithm for the purp ose of mo del criticism by using (i) many co-dep endent, real- v alued discrepancies and (ii) recording those discrepancies. F or example, the Metrop olis-Hastings sampler prop osed by Marjoram et al. [14] can b e adapted to pro vide samples from the target distribution f ρ,τ ( dθ , dx, dε 1: K | x 0 , M ) = π θ ( θ | M ) π ε 1: K ( ε 1: K | M ) f ( x | θ , M ) f ρ,τ ( x 0 | M )  δ ρ 1: K ( x ) ( dε 1: K ) dx dθ  , where we put ρ 1: K ( x ) =  ρ k  S k ( x ) , S k ( x 0 )  1: K for brevity . Here, δ ρ 1: K ( x ) ( dε 1: K ) denotes the Dirac measure at the p oin t ρ 1: K ( x ). Supp ose initial v alues θ 0 , x 0 ∼ f ( · | θ 0 , M ) and set ε 0 k = ρ k  S k ( x 0 ) , S k ( x 0 )  . mcmcABC µ 1 If now at θ prop ose a mov e to θ 0 according to a prop osal density q ( θ → θ 0 ). mcmcABC µ 2 Generate x 0 ∼ f ( · | θ 0 , M ) and compute ε 0 k = ρ k  S k ( x 0 ) , S k ( x 0 )  for k = 1 , . . . , K . mcmcABC µ 3 Accept ( θ 0 , x 0 , ε 0 1: K ) with probabilit y mh ( θ , x, ε 1: K ; θ 0 , x 0 , ε 0 1: K ) = min  1 , r v anilla ( θ , x, ε 1: K ; θ 0 , x 0 , ε 0 1: K )  11 where r v anilla ( θ , x, ε 1: K ; θ 0 , x 0 , ε 0 1: K ) = π θ ( θ 0 | M ) q ( θ 0 → θ ) π ε 1: K ( ε 0 1: K | M ) π θ ( θ | M ) q ( θ → θ 0 ) π ε 1: K ( ε 1: K | M ) , and otherwise sta y at ( θ , x, ε 1: K ). Then return to mcmcABC µ 1. As is standard practice, the algorithm is run suﬃciently long after a certain “burn-in” p erio d, and samples from the burn-in p erio d are discarded. It is not diﬃcult to show that marginally in ( θ , ε 1: K ), mcmcABC µ pro vides samples from f ρ,τ ( θ , ε 1: K | x 0 , M ) Eq. 1 for suitable prop osal kernels q ( θ → θ 0 ) under our regularity assumptions (A6). Mo del criticism and mo del comparison 6. “The Bayesian foundations of ABC µ ar e questionable: the c onse quenc es of r eje cting a mo del ar e ignor e d by ABC µ but include c onstructing another mo del” [1] and “this le ads to wonder ab out the gain c omp ar e d with using the Bayes factor” [2]. Mor e over, “the estimation of Bayes’ factors is even faster” [1] and “pr ovides a diﬀer ent answer” [2]. - T o us, mo del criticism and mo del comparison are imp ortant and complemen tary asp ects of statistical reasoning. Indeed, metho ds for mo del comparison attempt to choose betw een candidate mo dels, even if all of them do not matc h the data in one or several asp ects w ell. ABC is very ﬂexible in that arbitrary data-generating pro cesses M can b e analyzed without the need to compute the likelihoo d, so long as the ev aluation of the summary statistics is computationally tractable. ABC µ mak es p ossible to ev aluate at no extra computational cost whether a mo del matches the observ ed data in terms of a large set of summary statistics, and to obtain useful indications how a faulty mo del should b e mo diﬁed. In our w ork, we found that ABC µ th us enables to iterate rapidly through the initial stages of model design to identify one or a suite of mo dels which are in agreement with the data, even when the lik eliho o d cannot b e readily ev aluated. W e b elieve that the ability of ABC µ to oﬀer statistical rigor at this p oin t is highly v aluable to areas of mo dern science where complex mo dels are now formulated to explain and agree with data collected across the traditional b oundaries of disciplines. F or example, in biology , w e face a w ealth of data that is hard to analyze in its entiret y under current computer resources (e.g. molecular genetic data), or we hav e one intricate data set (e.g. molecular interaction netw orks), or we cross b oundaries of biological organization (e.g. systems biology). The metho ds presen ted in Ratmann et al. [4] do not address the problem of c ho osing a mo del from a suite of candidates. Mo del c omparison when the likelihoo d cannot b e readily ev aluated is not the topic of [4], and has b een in tro duced elsewhere [18, 19, 20, 21, 22]. Example 6 L et us r e-visit RMC’s Poisson example [1, 2] in or der to (a) il lustr ate mo del criticism with ABC µ when the err ors ar e discr ete r ather than c ontinuous r andom variables, (b) insp e ct the c ase of asymmetric 12 A −5 0 5 10 0.0 0.1 0.2 0.3 0.4 ε ε f ρ ρ , τ τ ( ( ε ε | x 0 , , M ) ) x 0 = = 5 x 0 = = 1 τ τ = = 2 B −5 0 5 10 0.0 0.1 0.2 0.3 0.4 0.5 ε ε f ρ ρ , τ τ ( ( ε ε | x 0 , , M ) ) τ τ = = 50 τ τ = = 2 3 x 0 = = 5 Figure 4: Densit y plots of f ρ,τ ( ε | M 1 ) in Example 6. Posterior mean errors are indicated in large diamonds. (A) W e ﬁx τ = 2 and consider a data p oint x 0 = 1 that is in agreemen t with our prior b elief π θ = Exp(1) as well as a data p oint x 0 = 5 that diﬀers from our prior mo del. In the latter case, the p osterior mean error suggests mismatch b et ween the model and the data. (B) W e ﬁx x 0 = 5 and consider the prior predictive error density (corresp onding to a n essentially ﬂat π ε with τ = 50) and the p osterior error density asso ciated to τ = 2 / 3. Again, w e observe that this ﬁtted mo del is harder to criticize than the prior mo del. pr e dictive err or densities L ρ ( ε 1: K | M ) and (c) r e-examine the b ehavior of the appr oximate mar ginal likeliho o d as pr esente d in [2], Figur e 1. Consider the Poisson mo del M 1 of Example 1 and supp ose that π θ ( θ | M 1 ) = exp( − θ ) 1  θ ≥ 0  . We have that π ( x | M 1 ) = Z ∞ 0 Poisson ( x ; θ ) exp( − θ ) dθ = 2 − x − 1 1 { x ≥ 0 } , and henc e L ρ ( ε | M 2 ) ∝ 2 − x 0 − ε − 1 1 { x 0 + ε ≥ 0 } . The ABC kernel always dep ends on an “err or thr eshold” τ (r e c al l A1) and, given the form of L ρ ( ε | M 2 ) , we c onsider her e π ε : { 0 , 1 , − 1 , 2 , . . . } → R + 0 with π ε ( ε | M 1 ) ∝ 2 −| ε | /τ . Then, our mar ginal p osterior err or is f ρ,τ ( ε | M 1 ) ∝ 2 − ( x 0 + ε + | ε | /τ +1) 1 { x 0 + ε ≥ 0 } , with normalizing c onstant f ρ,τ ( x 0 | M 1 ) = ∞ X ε = − x 0 2 − ( x 0 + ε + | ε | /τ +1) = 2 − ( x 0 +1)  1 { x 0 > 0 }  1 − 2 (1 − 1 /τ )( x 0 +1) 1 − 2 1 − 1 /τ − 1  + 1 1 − 2 − 1 − 1 /τ  (6) 13 A 0 2 4 6 8 10 −8 −6 −4 −2 0 x 0 posterior mean error ● ● ● ● ● ● ● ● ● ● ● τ τ = = 2 3 τ τ = = 2 τ τ = = ∞ ∞ B 0 2 4 6 8 10 0.001 0.005 0.050 0.500 x 0 approximate marginal likelihood ● ● ● ● ● ● ● ● ● ● ● τ τ = = 0 τ τ = = 2 3 τ τ = = 2 Figure 5: (A) Plots of the p osterior mean error R εf ρ,τ ( ε | x 0 , M 1 ) in Example 6 as a function of x 0 for τ = 2 / 3 (red), τ = 2 (black) and τ = ∞ (grey). Pro vided the prior model is an adequate representation of the data ( x 0 = 1), the prior predictiv e mean error is zero (dashed lines). By con trast, the p osterior mean error is not zero in this case (red and black lines), simply b ecause L ρ ( ε | M 1 ) is not symmetric. (B) Plots of the ABC µ approximate marginal lik eliho o d as a function of x 0 for τ = 2 / 3 (red), τ = 2 (black) and τ = ∞ (grey). Note that this plot diﬀers qualitativ ely from the one in [2] b ecause RMC decided to truncate ξ θ,x 0 to p ositive v alues. under the assumption that x 0 ≥ 0 . Figur e 4 il lustr ates the p osterior err or density for various choic es of x 0 and τ , and the r esp e ctive p osterior me ans ar e indic ate d in lar ge diamonds. Note that π ε was her e only chosen for r e asons of analytic al tr actability, and we c ould stil l use our two-side d Exp onential density π ε : R → R + 0 wher e π ε ( ε | M ) = 1 /τ exp  − 2 | ε | /τ  , or the standar d indic ator function. Inde e d, even if we do not know the set of p ossible discr ete err ors under a mo del M , al l we miss is the c orr e ct normalizing c onstant of π ε : { 0 , 1 , − 1 , 2 , . . . } → R + 0 wher e π ε ( ε | M ) ∝ exp  − 2 | ε | /τ  . This c onstant ne e d not b e known, se e for example our algorithm mcmcABC µ . Figur e 5A il lustr ates the p osterior me an err or R εf ρ,τ ( ε | M 1 ) ε as a function of x 0 for various choic es of τ . Setting τ = ∞ , we obtain the prior pr e dictive me an err or, which is zer o if the mo del c orr esp onds wel l to the observe d data ( x 0 = 1 ). Sinc e L ρ ( ε | M 2 ) is not symmetric ar ound zer o when the prior mo del is ade quate (opp osing A5), c onditioning on err or magnitude r esults in a slightly ne gative p osterior me an err or when x 0 = 1 . L et us r e c al l that RMC de cide d to trunc ate the density ε → ξ θ,x 0 ( ε ) to non-ne gative values, and then plotte d the asso ciate d mar ginal likeliho o d f trunc ( x 0 | M 1 ) = R R ξ trunc θ,x 0 ( ε ) π θ ( θ | M 1 ) π ε ( ε | M 1 ) dθ dε as a function of x 0 in 14 [2], Figur e 1. In Figur e 5B, we plot the ABC µ mar ginal likeliho o d Eq. 6 as a function of x 0 . In this example, the appr oximate mar ginal likeliho o d de cr e ases monotonic al ly for al l values of τ , and only smal l values of τ pr ovide a suitable appr oximation of the true mar ginal likeliho o d ( τ = 0 ). Thus, the p osterior me an err or and the appr oximate mar ginal likeliho o d b oth dep end on the pr e cise value of the “err or thr eshold”, suggesting that sensitivity analyses ar e r e quir e d for ABC µ as wel l as for c omplementary to ols for mo del c omp arison that ar e b ase d on appr oximate mar ginal likeliho o ds. The Bay es’ factor is a to ol that may address both mo del comparison and model criticism, dep ending on the form ulation of the null and alternative hypothesis. It is p ossible to devise approximate Bay es’ factors to test the null h yp othesis ε = 0 versus the alternativ e ε 6 = 0 as a surrogate measure for the h yp othesis that the mo del describ es the data adequately well, i.e. for the purp ose of mo del criticism (unpublished results, but see [23, 24]). How ever, in both cases, the robustness of the Bay es’ factor (with regard to the choice of τ and the quality of the n umerical appro ximation of the ABC or ABC µ target densities) is debatable (unpublished results). While we agree that computing the Bay es’ factor prop osed in [20, 19] is faster than computing p osterior predictive chec ks, w e also note that it cannot b e faster than sampling from f ρ,τ ( θ , ε 1: K | x 0 , M ) so long as the same ABC k ernel (i.e. π ε 1: K ) is used. 7. “Comp aring mo dels via the p osterior err or is missing the mo del c omplexity p enalisation fr om Bayesian mo del c omp arison” [1]. - W e ac knowledge that mo del complexit y is an imp ortan t quan tity to consider during model comparison. 8. “The choic e of ε and π ε ( ε ) is mo del dep endent and the c omp arison [of mo dels] r eﬂe cts prior mo deling, not data assessment” [1]. Final ly, “using the same τ acr oss al l mo dels do es not se em to b e r e c ommendable on a gener al b asis” [1]. - W e agree that the c hoice of discrepancies (hence errors) and τ are application- and model sp eciﬁc. Although the same summaries can t ypically b e used across models that attempt to explain the same data, mo del predictions will t ypically v ary and hence the scales of the sim ulated summaries. This implies that the same τ may not alwa ys be used across diﬀerent mo dels. In this case, it ma y b e diﬃcult to compare the p osterior error density f ρ,τ ( ε | x 0 , M ) across diﬀerent mo dels. In [4], we only suggest to use f ρ,τ ( ε | x 0 , M ) to compare eac h model against the observ ed data. Conclusion W e still ﬁnd that ABC µ enables us to comprehensively quantify discrepancies b etw een a data-generating process M and the data, sim ultaneously with parameter inference ev en when the likelihoo d cannot b e readily ev aluated, th us pro viding v aluable guidance on the in terpretability of parameter estimates and on ho w to improv e mo dels. Ho wev er, the metho d has its limitations. The p osterior error reﬂects an in terplay b etw een the prior predictive 15 error and the stringency with whic h that error is up dated; recall Eq. 2. A t presen t, there is no formal pro cedure to disentangle the contribution of π ε 1: K and L ρ ( ε 1: K | M ) in f ρ,τ ( ε 1: K | x 0 , M ); this prompted us to caution that it is diﬃcult to convincingly asso ciate a formal, probabilistic framework with credibilit y in terv als of f ρ,τ ( ε k | x 0 , M ) [4]. In other w ords, there is no formal guarantee that zero is included in a 95% credibility in terv al with a probability of 0.95 under the hypothesis that the prior model is correct. W e agree that the methods prop osed b y V erdinelli and W asserman [24] are promising for the purpose of mo del criticism via Bay es’ factors, although the sharp hull h yp othesis ε 1: K = 0 has limitations in itself [25]. Moreo ver, w e emphasize that f ρ,τ ( ε 1: K | x 0 , M ) cannot b e though t of as a purely Ba yesian quan tity because π ε 1: K also determines the appro ximation qualit y of f ρ,τ ( θ | x 0 , M ), recall Eq. 3. In particular, this implies that the contribution of “the data” to f ρ,τ ( ε 1: K | x 0 , M ) cannot b e directly quan tiﬁed b y setting π ε 1: K uniform. Finally , we agree with RMC that the ABC and ABC µ target densities, i.e. f ρ,τ ( θ | x 0 , M ) and f ρ,τ ( θ , ε 1: K | x 0 , M ), are sensitiv e to c hanges in the comp ound functions x → ρ k  S k ( x ) , S k ( x 0 )  (i.e. not in v arian t) and ma y attain diﬀeren t meanings under diﬀerent choices of τ . This lea ves the whole method probabilistically coheren t, but calls for sensitivity analyses. Nonetheless, w e believe that ABC and ABC µ are useful to compare observed data and mo del simulations in a coheren t wa y and to mak e inference on the model parameters as w ell as the error terms. It is diﬃcult to understand p osterior quantities of f ρ,τ ( θ | x 0 , M ) in place of p osterior quan tities of the true p osterior densit y f ( θ | x 0 , M ), but it makes go o d sense to interpret them as quan tities that lie b etw een the prior and p osterior density . With our inabilit y to ev aluate the likelihoo d f ( x 0 | θ , M ), we ackno wledge that it is to o cumbersome (or would take to o long) to comprehend the data in full, and turn to those aspects S k of the data whic h we consider to b e most relev ant. Doing so, we retain some information of the a v ailable data and our p osterior densit y f ρ,τ ( θ | x 0 , M ) up dates our prior b eliefs accordingly . Simultaneously , we can make use of the very same information to inv estigate the adequacy of a mo del M in explaining the data, and to up date our prior predictions according to error magnitude. In conclusion, if w e interpret f ρ,τ ( θ | x 0 , M ) as an up date of our prior b eliefs in θ and f ρ,τ ( ε 1: K | x 0 , M ) as an up date of our prior predictiv e densit y under M , then b oth are useful and meaningful quan tities, particularly when the lik eliho o d f ( θ | x 0 , M ) cannot b e ev aluated. Ac knowledgemen ts W e thank Julien Cornebise for insightful comments on an earlier version of this man uscript, as well as Christian Rob ert for insightful discussions that stimulated parts of these notes. OR gratefully ac knowl- edges supp ort from NSF grant NSF-EF-08-27416, CA is supp orted b y an EPSRC Adv ance Research F ellowship, CW b y the Danish Research Council, and S.R. by the Cen tre for Integrativ e Systems Biology at Imperial College as well as gran t G-0600-609 from the Medical Researc h Council. App endix Example 7 R e c onsider the data set x 0 of n = 100 indep endent samples that ar e Exp onential ly distribute d with r ate 1 /µ t = 0 . 2 . We b elieve again that e ach sample of x 0 is gener ate d fr om N ( µ, σ 2 ) with µ ∈ R , σ 2 ≥ 0 unknown, and take a c onjugate normal inverse-gamma prior density for µ and σ 2 with hyp erp ar ameters µ 0 ∈ R , n 0 = 1 , 16 α 0 > 0 and β 0 > 0 π ( σ 2 | M 3 ) = f IG ( α 0 ,β 0 ) ( σ 2 ) = β α 0 0  σ − 2  α 0 +1 exp  − β 0 σ − 2   Γ( α 0 ) π ( µ | σ 2 , M 3 ) = f N ( µ 0 ,σ 2 /n 0 ) ( µ ) π ε 1: K ( ε 1: K | M 3 ) = K Y k =1 1 /τ k 1    ε k   ≤ τ k / 2  , with hyp erp ar ameters set to µ 0 = 5 , n 0 = 1 , α 0 = 4 and β 0 = 75 . We r an Std-ABC µ b ase d on the summary SYMM ( x ) = x − me dian ( x ) , ρ  S ( x ) , S ( x 0 )  = SYMM ( x ) − SYMM ( x 0 ) and the ab ove c onjugate prior in θ for 20 , 000 iter ations to obtain samples fr om the joint p osterior density f ρ,τ ( µ, σ 2 , ε SYMM | x 0 , M 3 ) for various values of τ . Inter estingly, the appr oximate p osterior density f ρ,τ ( θ | x 0 , M ) br o adens for de cr e asing values of τ , as shown in Figur es 6(A-B). A −20 0 20 40 0.00 0.01 0.02 0.03 0.04 0.05 0.06 µ µ ● ● ● ● τ τ 2 = = 0.8 τ τ 2 = = 0.9 τ τ 2 = = 1 τ τ 2 = = 1.2 B 0 50 100 150 200 0.00 0.01 0.02 0.03 0.04 σ σ 2 ● ● ● ● τ τ 2 = = 0.8 τ τ 2 = = 0.9 τ τ 2 = = 1 τ τ 2 = = 1.2 Figure 6: Numerical estimates of (A) f ρ,τ ( µ | x 0 , M 3 ) and (B) f ρ,τ ( σ 2 | x 0 , M 3 ) in Example 7 for decreasing v alues of τ SYMM (diﬀeren t colors). The respective marginal prior densities are ov erlaid (blac k, dashed). 17 Bibliograph y [1] Rob ert CP , Mengersen KL, Chen C (2009) Letter: Mo del choice versus mo del criticism. Proc Natl Acad Sci USA . [2] Rob ert CP , Mengersen KL, Chen C (2009) Mo del choice versus mo del criticism [3] Ratmann O, Andrieu C, Wiuf C, Ric hardson S (2009) Reply to Rob ert et al.: Mo del criticism informs mo del c hoice and mo del comparison. Pro c Natl Acad Sci USA in press. [4] Ratmann O, Andrieu C, Wiuf C, Ric hardson S (2009) Mo del criticism based on lik eliho o d-free inference, with an application to protein netw ork evolution. Pro c Natl Acad Sci USA 106:10576–10581. [5] Beaumont M, Zhang W, Balding D (2002) Approximate Bay esian Computation in p opulation genetics. Genetics 162:2025–2035. [6] Jeﬀreys H (1961) Theory of Probability. Oxford Universit y Press, 3rd edition. [7] Box GEP (1980) Sampling and Bay es’ inference in scien tiﬁc mo delling and robustness. J Roy Stat So c A (General) 143:383–430. [8] Wilkinson RD (2008) Approximate Bay esian Computation (ABC) giv es exact results under the assumption of model error [9] Andrieu C (2006) The expected auxiliary v ariable metho d for Monte Carlo simulation. URL http://www.newton. cam.ac.uk/webseminars/pg+ws/2006/scb/scbw01/1101/andrieu/ . Isaac Newton Institute for Mathematical Sciences, Cam bridge, UK. [10] Andrieu C, Rob erts G (2009) The pseudo-marginal approach for eﬃcient Mon te Carlo computations. Ann Stat 37:697– 725. [11] Pritchard J, Seielstad M, Perez-Lezaun A, F eldman M (1999) Population growth of human Y chromosomes: a study of Y c hromosome microsatellites. Mol Biol Evol 16:1791–1798. [12] Joyce P , Marjoram P (2008) Approximately suﬃcien t statistics and Bay esian computation. Stat Appl Gen Mol Biol 7:26. [13] Berger JO, P ericchi LR (2001) Ob jective ba yesian metho ds for mo del selection: Introduction and comparison. IMS Lecture Notes-Monograph Series 38. [14] Marjoram P , Molitor J, Plagnol V, T av ar´ e S (2003) Mark ov Chain Mon te Carlo without lik eliho o ds. Pro c Natl Acad Sci USA 100:15324–15328. [15] Reguly T, Breitkreutz A, Boucher L, Breitkreutz B, Hon G, et al. (2006) Comprehensive curation and analysis of global in teraction netw orks in Saccharom yces cerevisiae. J Biol 5:11. [16] Hastie T, Tibshirani R, F riedman J (2001) The elements of statistical learning. Springer-V erlag, New Y ork. 18 [17] Bay arri MJ, Berger JO (1999) Bay esian Statistics 6, Oxford Universit y Press, chapter Quantifying surprise in the data and mo del veriﬁcation (with discussion). pp. 53–82. [18] Wilkinson RD (2007) Bay esian inference of primate divergence times. Ph.D. thesis, Universit y of Cambridge. [19] Beaumont M (2008) Simulations, genetics and human prehistory , McDonald Institute Monographs, Universit y of Cam- bridge. [20] F agundes NJR, Ray N, Beaumont M, Neuensch w ander S, Salzano FM, et al. (2007) Statistical ev aluation of alternative mo dels of human evolution. Pro c Natl Acad Sci USA 104:17614–17619. [21] T oni T, Stumpf MPH (2009) Sim ulation-based model selection for dynamical systems in systems and population biology . Bioinformatics :btp619–. [22] Grelaud A, Rob ert CP , Marin JM, Ro dolphe F, T aly JF (2009) ABC likelihoo d-free metho ds for mo del choice in Gibbs random ﬁelds. Bay esian Analysis 4:317–336. [23] Dick ey JM, Lientz BP (1970) The w eighted likelihoo d ratio, sharp h yp otheses ab out chances, the order of a Mark o v c hain. Ann Math Stat 41:214–226. [24] V erdinelli I, W asserman L (1995) Computing Ba yes’ factors using a generalization of the Sa v age-Dic key density ratio. J Am Stat Ass 90:614–618. [25] Berger JO, Delampady M (1987) T esting precise hypotheses. Statistical Science 2:317–335. 19

Notes to Robert et al.: Model criticism informs model choice and model comparison

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment