Importance Tempering

Imp ortance T emp ering Rob ert Grama cy & Ric hard Samw orth Stati stical Lab orato ry Universit y of Cambridge { b obb y , rjs57 } @st atslab.cam .ac.uk Ruth King CREEM Universit y of St Andrews ruth@mcs. st-and.ac.uk Octob er 22, 2018 Abstract Sim ulated temp ering (ST) is an established Mark o v chain Mon te Carlo (MCMC) metho d for sampling from a m ultimo dal density π ( θ ). T ypically , ST in v olve s intro- ducing an auxiliary v ariable k taking v alues in a ﬁnite subset o f [0 , 1] and indexing a set of temp ered d istr ibutions, sa y π k ( θ ) ∝ π ( θ ) k . In this case, small v alues of k e n- courage b etter mixing, but samples from π are only obtained when the join t chai n for ( θ , k ) r ea c hes k = 1. Ho wev er, the en tire c hain can b e used to estimate exp ectations under π of fu nctio ns of interest, pr ovided that imp ortance sampling (IS) we igh ts are calculate d. Unfortunately this metho d, whic h we c all imp ortance temp ering (IT), ca n disapp oin t. This is partly b ecause the most immediately ob vious implemen tation is na ¨ ıv e and can lea d to high v ariance estimators. W e derive a new optimal metho d for com bin in g multiple IS estimators and pro ve that the resulting estimator has a highly desirable prop ert y relate d to the notio n of eﬀect iv e sample size. W e brieﬂy rep ort on the success of the optimal com bination in tw o mo delling scenarios requ iring rev ers ible- jump MCMC, where the na ¨ ıve app roac h fails. Key w ords: simulated temp ering, imp ortance sampling, Mark ov c hain Mon te Carlo (MCMC), Metrop olis–c oupled MCMC 1 1 In tro d uction Mark ov chain Monte Carlo (MCM C) algorithms, in particular Metropolis–Hastings (MH) and Gibbs Sampling (GS), a re b y no w the most widely used metho ds for sim ulation–based inference in Bay esian statistics. The b eaut y of MCMC is its simplicit y . V ery little user input or exp ertise is r eq uired in order to establish a Marko v c hain whose stationary distribution is pro p ortional to π ( θ ), fo r θ ∈ Θ ⊆ R d . As lo ng as the c hain is irreducible, the theory of Mark ov c hains guarantee s that sample a v erages computed from t his realisation will con v erge in an appropriate sense to their exp ectations under π . Ho w ev er, diﬃculties can arise when π has isolated modes, b et w een whic h the Mark ov c hain mo v es only rarely . In suc h cases con ve rgence is slo w, meaning that often infeasibly large sample sizes ar e needed to obtain accurate estimates . New MCMC algorithms ha v e b een prop osed to improv e mixing. Tw o related algorithms are Metrop olis–coupled MCMC (MC 3 ) (Gey er, 19 91 ; Hukushima and Nemoto, 199 6 ) and sim ulated temp ering (ST) (Marinari and P arisi, 1 992; Gey er and Thomps on, 1 995). Both are closely related to the optimisation tec hnique of simulated annealing (SA) (Kirkpatric k et al., 1983). SA w orks with a set of temp er e d distributions π k ( θ ) index ed by an in v erse–temperature parameter k ∈ [0 , ∞ ). One p opular form of tempering is called “p ow ering up”, where π k ( θ ) ∝ π ( θ ) k . Small v alues o f k hav e the eﬀect of ﬂa ttening/wide ning the p eaks a nd raising troughs in π k relativ e to π . In MC 3 and ST w e deﬁne a temp er atur e ladde r 1 = k 1 > k 2 > . . . > k m ≥ 0, and call t he k i its rungs . Both MC 3 and ST in v olve sim ulating from t he set o f m temp ered densities π k 1 , . . . , π k m . MC 3 runs m parallel MCMC c hains, one at each temperat ure, and regularly prop oses sw aps of states at adjacen t rungs k i and k i +1 . Usually , samples are only sav ed from the “cold distribution” π k 1 . In contrast, ST w orks with a “ ps eudo–prior” p ( k i ) and uses a single c hain to sample fr o m the join t distribution, whic h is prop ortional to π k ( θ ) p ( k ). Aga in, it is only at 2 iterations t for whic h k ( t ) = 1 that t he corr esp onding realisation of θ ( t ) is reta ined. ST has an adv an ta ge o v er MC 3 in that only one cop y of the pro cess { θ ( t ) : t = 1 , . . . , T } is needed— rather than m —so the c hain uses less storage and also has b etter mixing (Gey er , 1 991). The disadv an tage is that it needs a go o d choice o f pseudo–prior. F or further compar ison and review, see Ja sra et al. (2007a) a nd Iba (2001). Both MC 3 and ST suﬀer fr o m ineﬃciency b ecause they discard a ll samples fr om π k for k 6 = 1. The discarded samples could b e used to estimate exp ectations under π if they w ere giv en appropriate imp ortance sampling (IS) w eigh t s. F or an inclusiv e review of IS and related metho ds see L iu (20 0 1 , Chapter 2). Moreov er, it ma y b e the case that an IS estimator constructed with samples from a temp ered distribution has smaller v ariance than one ba s ed on a sample of the same size from π . As a simple motiv ating example, let π ( θ ) = N ( θ | µ, σ 2 ), and consider estimating µ = E π ( θ ) b y IS fro m a temp ered distribution π k ( θ ) ∝ π ( θ ) k . A straigh tforw ard calculatio n sho ws t ha t the v alue of k whic h minimise s the v ariance o f the IS estimator is k ∗ =      1 / 2 if µ = 0 3 2 +  σ µ  2 − 1 2 n 1 + 8  σ µ  2 + 4  σ µ  4 o 1 / 2 otherwise. (1) Note that k ∗ ∈ (1 / 2 , 1) for all µ and σ 2 . Moreo v er, one can compute (n umerically) k − = k − ( σ /µ ) < k ∗ suc h tha t for all k ∈ ( k − , 1), the v ariance of the IS estimator ˆ µ k based on samples fr o m π k is smaller than that of one based on a sample of the same size from π . Ho wev er, V ar( ˆ µ k ) → ∞ as k → 0 for all µ and σ 2 . T able 1 giv es k ∗ and k − for v arious v a lues of σ /µ . σ /µ 1/16 1/4 1 4 16 k ∗ 1.00 0.95 0.70 0.52 0.50 k − 0.99 0.89 0.42 0.18 0.16 T a ble 1 : V alues of k ∗ and k − for v ar io us v alues of σ /µ . 3 Therefore, there is a trade-oﬀ in the c hoice of temp ered IS prop osals. On the one hand, lo w in v erse–temp eratures k in ST can guard ag ainst missing mo des o f π with large supp ort b y encouraging b etter mixing b e twe en mo des, but can yield very ineﬃcien t (IS) estimators o v erall. On the other hand, “luk ew arm” temp eratures k , esp ecially k ∈ (1 / 2 , 1), can yield more eﬃcien t estimators within mo des than those obtained from samples at k = 1. Jennison (1993) w as the ﬁrst to suggest using a single temp ered distribution as a pro- p osal in IS, and Neal (19 9 6 , 2001, 20 05 ) has since written sev eral pap ers com bining IS and temp ering. Indeed, in the discus sion of the 1996 pa p er on temp er e d tr ansitions , Neal writes “sim ulated temp ering allow s data asso ciated with p i other than p 0 [the cold distribution] to b e used to calculate exp ec tations with resp ect to . . . p 0 (using an imp o rtance sampling estimator)” 1 . It is this natural extension tha t we call imp ortanc e temp ering (IT), with IMC 3 deﬁned similarly . Giv en the w o r k o f the ab o ve-men tioned authors, and the f a ct that calcu- lating imp ortance w eights is r elatively trivial, it ma y b e surprising that succes sful IT and IMC 3 applications hav e y et to b e published. Liu (2001) comes close in prop osing to augmen t ST with dynamic w eighting (W ong and Liang, 1997) and in applying the W a ng –Landau al- gorithm (A tchad ´ e and Liu, 20 0 7 ) to ST. This pap er addresses wh y the straigh tforw a r d metho dology describ ed ab ov e has tended not to w ork w ell in practice, primarily due to a lack of a principled wa y of com bining the imp ortance w eights collected at each temp erature to o btain an o v erall estimator. If we are in terested in estimating E π { h ( θ ) } , one wa y to do this is with ˆ h = W − 1 T X t =1 w ( θ ( t ) , k ( t ) ) h ( θ ( t ) ) , where W = T X t =1 w ( θ ( t ) , k ( t ) ) , (2) and w ( θ , k ) = π ( θ ) /π ( θ ) k = π ( θ ) 1 − k . Observ e that this estimator is of the form ˆ h = P m i =1 λ i ˆ h i , where 0 ≤ λ i ≤ P m i =1 λ i = 1, with λ i = W − 1 P T t =1 w ( θ ( t ) , k ( t ) ) I { k ( t ) = k i } , and 1 A similar no te is made in the 2 001 pap er with reg ard to anne ale d imp ortanc e sampling . 4 where each ˆ h i is an IS estimator of E π { h ( θ ) } constructed using only the observ atio ns at the in vers e–temp erature k i . W e sho w ho w to improv e this estimator by choosing λ 1 , . . . , λ m to maximise the eﬀ e ctive sample size (see next paragraph), whic h appro ximately corresp onds to minimising the v ariance of ˆ h ( L iu , 2001, Section 2 .5.3). F or the applications that w e ha ve in mind, it is imp ortan t that our estimator can b e constructed without know ledge of the no r ma lising constan ts of π k 1 , . . . , π k m . It is for this reason that metho ds motiv ated by the b ala nc e heuristic (V eac h and Guibas, 1995; Ow en and Z hou, 2000; Madras and Picconi, 1999) cannot b e applied. The notion of eﬀe ctive sample size pla ys an imp ortan t role in t he study of IS esti- mators. Supp ose w e are in terested in estimating E π { h ( θ ) } using a ve ctor of observ ations θ = ( θ (1) , . . . , θ ( T ) ) from a densit y π ′ . D eﬁne the v ector of imp ortance w eigh ts w ≡ w ( θ ) = ( w ( θ (1) ) , . . . , w ( θ ( T ) )), where w ( θ ) = π ( θ ) /π ′ ( θ ). F ollo wing Liu (2001, Section 2.5.3) w e deﬁne the eﬀe ctive sample size by ESS  w ( θ )  ≡ ESS( w ) = T 1 + cv 2 ( w ) , (3) where cv 2 ( w ) is the c o eﬃ c i e nt of variation of the w eigh ts, giv en b y cv 2 ( w ) = P T t =1 ( w ( θ ( t ) ) − ¯ w ) 2 ( T − 1 ) ¯ w 2 , where ¯ w = T − 1 T X t =1 w ( θ ( t ) ) . This should not b e confused with the concept of eﬀe ctive sample size due to auto c orr elation (Kass et a l., 19 9 8 ) (due to serially corr e lated samples fro m a Mark ov c hain). This latter notion is disc ussed brieﬂy in Section 4. Observ e that the sw ap op erations in MC 3 require that the state space Θ b e common for all m temp ered distributions. This is not a requiremen t for ST, as the state stays ﬁxed when ch anges in temp erature are prop osed. Thus applying MC 3 is less straigh t forw ard 5 in ( Bay esian) mo del selec tion/a v eraging pro ble ms whic h typic ally in volv e trans–dimensional Mark ov c hains as in rev ersible–jump MCMC (RJMCMC) (Gr ee n, 19 9 5 ), though it is p ossible (Jasra et al., 2007b). Since RJMCMC algorithms are particularly prone to slo w mixing, and hence a r e an excellen t source of applicatio ns of our idea (as illustrated in Section 3), the rest of the pap er will fo cus on IT. Most of our results a pply equally to IMC 3 b y ignoring the pseudo–prior. The outline of the paper is as follows. In Section 2 w e deriv e the optimal conv ex com bi- nation o f m ultiple IS estimators, and show ho w this estimator has a particularly attractive prop ert y with regar d to its eﬀectiv e sample size . In Section 3 w e brieﬂy repo rt on t he eﬀec- tiv eness of optimal IT, a nd the p o or p erformance of the na ¨ ıv e a pproac h, on sev eral real and syn thetic examples. Section 4 concludes with a discussion. 2 Imp ortance t emp ering The simulate d temp ering (ST) (Gey er and Thomps on, 1995) a lg orithm is a n application of MH on the pro duct space of para me ters and in v erse–temp eratures. That is, samples are obtained from the joint chain π ( θ , k ) ∝ π ( θ ) k p ( k ). This is only p ossible if π ( θ ) k is in tegrable, but H¨ older’s inequalit y ma y b e used to sho w that this is indeed the case prov ided that E π ( k θ k 1 − k k + δ ) < ∞ for some δ > 0, where k · k denotes the Euclidean no r m. The success of ST dep ends crucially on the abilit y of the Mark o v chain frequen tly to: (a) visit high temp eratures (lo w k ) where the probabilit y of escaping lo cal mo des is high; (b) visit k = 1 to obtain samples from π . The algorithm can b e tuned by: ( i.) adjusting the n um b er and lo cation of the rungs of the temp erature la dder; or (ii.) adjusting t he pseudo-prior p ( k ). Geye r and Thomps on (1995) giv e some automated wa ys of adjusting the spacing of the r ung s of the ladder. I ba (20 0 1 ) reviews similar tec hniques from the phy sics literature. A recen t alternativ e—and v ery promising—appro a c h in volv es the W ang–Landau algorithm 6 (A tchad ´ e and Liu, 2007). How ev er, ma ny authors prefer to rely on defa ults, e.g., k i =      (1 + ∆ k ) 1 − i geometric spacing { 1 + ∆ k ( i − 1) } − 1 harmonic spacing i = 1 , . . . , m. (4) The rate parameter ∆ k > 0 can b e problem sp eciﬁc. Motiv ation fo r suc h default spacings is outlined by Liu (2001, Chapter 10: pp. 213 & 23 3). Geometric spacing, or uniform spacing of log( k i ), is also adv o cated b y Neal (1996, 2001). Once a su itable ladder has b een c hosen, the goal is typic ally to c ho ose the pseudo–prior so that the p osterior ov er temp eratures is uniform. The b est wa y to accomplish this is to set p ( k i ) = 1 / Z i , where Z i = R Θ π ( θ ) k i dθ is the normalising constant in π k i = π k i / Z i , whic h is generally unknow n. So while no rmalising constants are not a prerequisite for ST, it can certainly b e useful to kno w them. W e follo w the suggestions of Gey er and Thompson (1995) in setting the pseudo–prior b y a metho d t ha t ro ughly appro ximates the Z i in tw o– stages: ﬁrst b y sto c hastic appro ximation (Kushner and Lin, 1997), a nd then by observ ation coun ts a cc um ulated through pilot runs. T o some exten t, a non-unifo rm p osterior on the temp eratures is less troublesome in the con text of IT than ST. So lo ng as the chain still visits the heated temp eratures o ften enough to get go o d mixing in Θ, a nd if the ESS of the IS estimators at some temp erature(s) is not to o low , useful samples can b e obta ine d without ev er visiting the cold distribution. 2.1 A new optimal w a y to com bine IS estimators ST prov ides us with { ( θ ( t ) , k ( t ) ) : t = 1 , . . . , T } , where θ ( t ) is an sample fro m π k ( t ) . W rite T i = { t : k ( t ) = k i } for t he index set of obse rv ations at the i th temp erature, and let T i = |T i | . Let the v ector of observ ations at the i th temp erature collect in θ i = ( θ i 1 , . . . , θ iT i ), so that { θ ij } T i j =1 ∼ π k i . Similarly , the v ector of IS we igh t s at the i th temp erature is w i = w i ( θ i ) = 7 ( w i ( θ i 1 ) , . . . , w i ( θ iT i )), where w i ( θ ) = π ( θ ) /π k i ( θ ). Eac h vec tor θ i can b e used to construct an IS estimator of E π { h ( θ ) } by setting ˆ h i = P T i j =1 w i ( θ ij ) h ( θ ij ) P T i j =1 w i ( θ ij ) ≡ P T i j =1 w ij h ( θ ij ) W i . It is na t ural to consider an o v erall estimator of E π { h ( θ ) } deﬁned b y a conv ex com bination: ˆ h λ = m X i =1 λ i ˆ h i , where 0 ≤ λ i ≤ m X i =1 λ i = 1 . (5) Unfortunately , if λ 1 , . . . , λ m are not c hosen carefully , V ar( ˆ h λ ), can b e nearly as larg e as the largest V ar( ˆ h i ) (Ow en and Zhou, 2000). Notice that ST is recov ered as a sp ecial case when λ 1 = 1 and λ 2 = · · · = λ m = 0. It may b e tempting to c ho ose λ i = W i /W , where W = P m i =1 W i , reco vering the estimator in Eq. (2). This can lead to a v ery p o or estimator, ev en compared to ST, whic h is demons trated empirically in Section 3. Observ e tha t w e can write ˆ h λ = m X i =1 T i X j =1 w λ ij h ( θ ij ) , (6) where w λ ij = λ i w ij /W i . Let w λ = ( w λ 11 , . . . , w λ 1 T 1 , w λ 21 , . . . , w λ 2 T 2 , . . . , w λ m 1 , . . . , w λ mT m ). At- tempting to c ho ose λ 1 , . . . , λ m to minimise V ar( ˆ h λ ) directly can b e diﬃcult. In t he ba la nce heuristic, V eac h and Guibas (1995) explore com binations of IS estimators of the form (6), where w i ( θ ) = π ( θ ) /g i ( θ ) for a family of prop osal densities g i , with λ ij = c i g i ( θ ij ) P m r =1 c r g r ( θ ij ) , (7) and where 0 ≤ c i ≤ P m i =1 c i = 1 is the pro p ortion o f samples tak en from g i . It turns out 8 that this is equiv alen t to IS with the mixture prop osal ˜ π ( θ ) = P m r =1 c r g r ( θ ): ˆ h bal ≡ 1 T T X t =1 w ( θ t ) h ( θ t ) , where w ( θ ) = π ( θ ) P m r =1 c r g r ( θ ) . (8) The ba lance heuristic has since b een generalised by Owe n and Zhou (2000); it w as rein ven ted b y (Madras and Picconi, 1999, Sec tion 4) in the con t e xt of applied probabilit y . Note that due to the denominator in the deﬁnition of w ( θ ) in Eq. (8), the g i m ust b e normalised densities. This precludes us fro m using the balance heuristic with g i ∝ π k i . When MCMC is necessary to sample from π , the normalisatio n constan t of π , and therefore π k i , is generally unkno wn. The metho d also requires ev aluations of π k i ( θ ( t ) ), i = 1 , . . . , m , at all T rounds, an O ( mT ) o peration that trivialises any computational adv an tag e ST has ov er MC 3 . Instead, w e consider maximising the ESS of ˆ h λ in (5). Prop osition 2.1. Among estimators of the form (5), ESS( w λ ) is maximise d by λ = λ ∗ , wher e, for i = 1 , . . . , m , λ ∗ i = ℓ i P m i =1 ℓ i , and ℓ i = W 2 i P T i j =1 w 2 ij . Pr o of. Since P m i =1 P T i j =1 w λ ij = 1, the problem of maximising the eﬀectiv e sample size is the same as min λ 1 ,...,λ m m X i =1 T i X j =1  λ i w ij W i − 1 T  2 , sub ject to 0 ≤ λ i ≤ m X i =1 λ i = 1 . The result t hen follows by a straigh tforw ar d Lagr a nge m ultiplier argumen t. In t he following discussion and in Remark 2.2 b elo w, w e assume that for i = 1 , . . . , m , T i ≥ 2. The eﬃciency of eac h IS estimator ˆ h i can b e measured through ESS ( w i ). In tuitively , 9 w e hop e that with a go o d c hoice of λ , the ESS o f ˆ h λ , giv en b y ESS( w λ ) = T ( T − 1 ) T 2 P m i =1 λ 2 i /ℓ i − 1 , w ould b e close t o the sum o ve r i of the eﬀectiv e sample sizes of ˆ h i , namely ESS( w i ) = T i ( T i − 1) ℓ i T 2 i − ℓ i . (9) The remark below sho ws that this is indeed the case for ˆ h λ ∗ . Remark 2.2. We h a ve ESS( w λ ∗ ) ≥ m X i =1 ESS( w i ) − 1 4 − 1 T . Pr o of. Since ESS( w i ) ≤ T i , it fo llows from (9) that ℓ i ≤ T i . Thus ESS( w λ ∗ ) = (1 − T − 1 ) P m i =1 ℓ i 1 − P m i =1 ℓ i T 2 i ≥  1 − 1 T   1 + 1 T 2 m X i =1 ℓ i  m X i =1 ℓ i = m X i =1 ℓ i − P m i =1 ℓ i T  1 − P m i =1 ℓ i T  − ( P m i =1 ℓ i ) 2 T 3 ≥ m X i =1 ℓ i − 1 4 − 1 T , since x (1 − x ) attains its maxim um of 1 / 4 at x = 1 / 2 and P ℓ i ≤ P T i = T . In pra ctic e we hav e found that this b ound is sligh tly conserv a tiv e and tha t often it is the case that ESS( w λ ∗ ) ≥ P m i =1 ESS( w i ). Thus o ur optimally–combined IS estimator has a highly desirable and intuitiv e prop ert y in terms of its eﬀectiv e sample size. 10 3 Empirical Resul ts Here w e brieﬂy rep ort on the succ ess of optimal IT, relativ e to the na ¨ ıv e approach a nd ST, on one simple example and t wo inv olving RJMCMC. 3.1 A simple mixture of normals Consider the follo wing to y dens it y π , a mixture of tw o normals: π ( θ ) = 0 . 6 N ( θ | µ 1 = − 8 , σ 2 1 = 0 . 5 2 ) + 0 . 4 N ( θ | µ 2 = 8 , σ 2 2 = 0 . 9 2 ) . (10) T a ble 2 summarises Kolmogorov–Smirno v distances obtained under three IT estimators: ST ( λ 1 = 1), na ¨ ıv e IT ( λ i = W i /W ) and the optimally–com bined IT estimator ( ˆ h λ ∗ ). Observ e K–S distance Metho d ESS( w λ ) mean v ar ST 2535 0.0938 8 . 5 × 10 − 4 na ¨ ıve IT 17779 0.0849 1 . 4 × 10 − 4 ˆ h λ ∗ 22913 0.0836 5 . 2 × 10 − 5 P i ESS( w i ) 22910 T a ble 2: Summary of K–S distances to the true mixture of normals (1 0 ) for ST ( λ 1 = 1), na ¨ ıve IT ( λ i = W i /W ), the optimally–combin ed IT estimator ( ˆ h λ ∗ ). W e used 10 0 repeated samples of size 10 5 , with t em p ered R WM prop osals. that the optimally–com bined IT estimator has b oth the larg est ESS and the smallest v aria nce of the thr ee estimators, and that ESS( w λ ∗ ) > P i ESS( w i ). Na ¨ ıv e IT impro v es up on ST in this example , but has higher v ariance than ˆ h λ ∗ . 3.2 Ba y esian treed Gaussian pro cess mo dels Ba yes ian treed mo dels extend classiﬁcation and regression tree (CAR T) mo dels (Breiman et al., 1984), b y putting a prior on the tr ee structure. W e fo cus on the implemen tatio n of ? who 11 ﬁt Gaussian Pro ces s (GP) mo dels at the lea ves of the tree, sp ec ify the tree prior through a pro cess that limits its depth, and then deﬁne the tree op erations gr ow , prune , change , and swap , to allow inference t o pro cee d b y RJMCMC . The RJMCMC chain usually identiﬁes the correct maximum a p osteriori (MAP) tree, but consis ten tly and signiﬁcan tly o v er estimates the po s terior probabilit y of deep trees. T o guard aga ins t the tr ansd imensional chain getting stuc k in lo cal mo des o f the p osterior, ? r esorted regularly restarting the c hain from t he null t r ee. ST provide s an alternative by increasing the rate of accepted tree o p erations in higher t emp eratures. In particular, w e ﬁnd that ST can increase the rate of accepted prune op erations b y an order of magnitude, th us enabling the chain to escap e t he lo cal mo de s of deep trees. T o demonstrate IT w e ﬁt a treed GP mo del with ST usin g a geometric ladder with m = 4 0 and k m = 0 . 1 to t w o datasets ﬁrst explored by ? : t he 1-d motorcycle acciden t data and 2-d expo nential da t a. W e refer to that pap er f or details ab out the data and models. F or the motorcycle a cc iden t data the ST c hain was run for T = 1 . 5 × 10 5 iterations, where a total of T 1 = 3732 ( ≈ T /m = 3750) samples w ere obtained f rom the cold distribution. That ESS( w λ ∗ ) = 9338 ≈ 2 . 5 T 1 sho ws the considerable improv emen t of IT ov er ST. Moreov er, w e hav e ESS( w λ ∗ ) > P i ESS( w i ) = 9334 . The na ¨ ıve com binat io n λ i = W i W in (2) yields ESS( w λ ) = 285 < 1 10 T 1 , undermining t he very motiv ation o f IT. F or the exp onen tial data the ST c hain w a s run for a total of T = 5 × 10 5 iterations. A total of T 1 = 124 3 6 ( ≈ T /m = 12500) samples w ere obtained fro m the cold distribution. W e found that ESS ( w λ ∗ ) = 21778 ≈ 1 . 75 T 1 , illustrating how IT impro v es on ST. Moreov er, w e hav e ESS( w λ ∗ ) > P i ESS( w i ) = 21776. The na ¨ ıv e combination λ i = W i W in (2) yields ESS ( w λ ∗ ) = 6 54 ≈ 1 18 T 1 —w o r se than ST. 12 3.3 Mark-Recapture-Reco v ery Data W e no w consider a Bay esian mo del selection problem with data relating to the ma r k- recapture and reco v ery of shags o n the Isle of Ma y (King and Broo k s, 2002). The three de- mographic parameters of interest ar e: surviv al rat e s, recapture rates and reco very rates. The mo dels considered for eac h of the demographic parameters a llo w ed a p ossible age– a nd/or time–dep end ence, where the time dep endence w as conditional on the age structure of the pa- rameters. T ypically , mov emen t b et we en the diﬀerent p ossible models—by adding/remov ing time dep endence for a g iv en age group, or up dating the age structure of the parameters—is slo w, with small acceptance probabilities. F or further details of the data, mo del structure, and RJMCMC algorithm see King and Bro o ks (2002). Using the same ST setup as ab ov e, w e ran T = 10 7 iterations and discarded the ﬁrst 10% as burn-in. As with the treed examples, higher temp eratures yielded higher acceptance rates and an order of magnitude b etter exploration of mo del space compared to (untempered) RJMCMC. A t o tal of T 1 = 248158 ( ≈ T /m = 225000) realisations we re obtained from t he cold distribution. By comparison, f or optimal IT w e hav e ESS( w λ ∗ ) = 612 026 ≈ 2 . 5 T 1 and ESS( w λ ∗ ) > P i ESS( w i ) = 61202 0. The corresp onding na ¨ ıv e IT approa c h (using λ i = W i W ) p erformed exceptionally p oo r ly , with ESS( w λ ) of only 5.43 , due to a few large w eights obtained at ho t temp eratures. 4 Discuss ion This pap er has addressed the ineﬃciencies and was tefulness of sim ulated temp ering (ST), and related a lgorithms that are des igned to impro v e mixing in the Mark ov c hain using tem- p ered distributions. W e argued that imp ortance sampling (IS) from tempered distributions can pro duce estimators that are more eﬃcien t tha n ones based on indep enden t sampling, pro vided that the temp erature is c hosen car efully . This mot iv ated augmen ting the ST algo- 13 rithm b y calculating imp ortance weigh ts to salv age discarded samples—a tec hnique whic h w e hav e called imp ortanc e temp erin g (IT). This idea ha s b een suggested b efore, but to our kno wledge little exploration has b ee n carried out for real, complex, applications. W e ha v e deriv ed optimal com bination w eigh ts for the resulting collection of IS estimators, whic h can b e calculated ev en when the normalisation constants of the temp ered distributions are un- kno wn. The w eights are essen tially prop ortional t o the eﬀectiv e sample size (ESS) of the individual estimators, and w e found that the resulting combined ESS in this case w ould b e appro ximately equal to their sum. W e note that the ov erall success of the optimal IT estimator dep ends crucially on a success ful implemen tation o f ST, i.e., having a go o d temp erature ladder a nd pseudo–prior. Ho wev er, it is also imp ortan t to recognise that the optimal com bination, as a resource– eﬃcien t p ost-pro cess ing step, is equally applicable in o t he r con texts, i.e., within MC 3 , o r ev en outside of the domain of temp ered MCMC to com bine any collection IS estimators. Sequen tial Mon te Carlo samplers (Del Moral et al., 2006) may facilitat e a nat ural extension. W e hav e illustrated IT on sev eral examples whic h b eneﬁt from the impro v ed mixing ST pro vides. F or example, the optimal IT metho dology can increase the resulting ESS compared to retaining sample s only from the cold distribution b y roug hly a factor of tw o. Since IT inv olv es sampling from a Marko v c hain, ideally one w ould take in to account the serial correlation in the o b jectiv e criteria for com bining the individual estimators. The eﬀe ctive sample size due to auto c orr elation is deﬁned (K ass et al., 1998) by ESS ρ ( θ ) = T 1 + 2 P T − 1 ℓ =1 ˆ ρ ( ℓ, θ ) , (11) where ˆ ρ ( ℓ, θ ) is the sample auto correlation in θ at lag ℓ ; thus for scalar θ we hav e that ˆ ρ ( ℓ, θ ) = ˆ γ ( ℓ, θ ) / ˆ γ (0 , θ ), where ˆ γ ( ℓ, θ ) = ( T − ℓ ) − 1 P T − ℓ t =1 ( θ ( t ) − ¯ θ )( θ ( t + ℓ ) − ¯ θ ), and ¯ θ = T − 1 P T t =1 θ ( t ) . The results f r o m the previous section suggest that, when the temp erature 14 ladder is ﬁxed, a sensible heuristic migh t b e to consider com bining the individual estima- tors with w eigh ts λ ∗ i prop ortional to pro duct of T − 1 i ESS ρ ( θ i ) and ESS( w i ), say . Ho w ev er, when considering mo diﬁcations to the num b er ( m ) and spacing of in ve rse temp eratures k = { k 1 , . . . , k m } , there is clearly a conﬂict of inte rest betw een the t w o measu res of eﬀectiv e sample size. Adding mor e inv erse temp eratures near one ma y increase ESS ( w λ ∗ ), but may also increase auto correlation in the marginal chain for k . Therefore it may b e sensible to factor ESS ρ ( k ) in to t he ob jec tiv e as w ell. Searc hing fo r temp erature ladders that maximise a h ybrid of ESS and ESS ρ w ould represen t a natural extens ion of this work. References A tchad ´ e, Y. and Liu, J. (2007). “The W ang–Landau a lgorithm in g e neral state spaces: applications and con v ergence analysis.” T ec h. rep., Univ ersit y of Harv ard. Breiman, L., F riedman, J. H., Olshen, R., and Stone, C. (1984 ) . Classiﬁc ation and R e gr essio n T r e es . Belmon t, CA: W adsw orth. Del Moral, P ., Doucet, A., and Jasra, A. (2006). “Sequen tial Monte Carlo Samplers.” Journal of the R oyal Statistic al So ciety, Series B , 68, 4 11–436. Gey er, C. (1991). “Mark ov chain Monte Carlo Maxim um Lik eliho o d.” In Computing Sc i e nc e and Statistics: Pr o c e e dings of the 23r d Symp os i um on the Interfac e , 156–163. Gey er, C. and Thompson, E. (1 9 95). “Annealing Marko v c hain Mon te Carlo with applications to ancens tral inference.” Journal of the A meric an Statistic al Asso ciation , 90, 9 09–920. Gramacy , R. B. and L ee, H. K. H. (2006 ). “Bay esian treed Gaussian pro cess mo dels.” T ec h. rep., Dept. of Applied Math & Statistics, Univ ersit y of California, San ta Cruz. Green, P . (1995). “Rev ersible Jump Mark o v Chain Monte Carlo Computation and Bay esian Mo del D e termination.” Biometrika , 82, 7 1 1–732. 15 Hukushima, K. and Nemoto, K. (1996). “Exc hange Mon te Carlo Metho d and Application to Spin Gla ss Sim ulations.” Journal of the Physic al So ci e t y of Jap an , 65, 4, 1 604–1608. Iba, Y. (2001). “Extended ensem ble Monte Carlo.” International Journal of Mo dern Physics , 12, 5, 623 – 656. Jasra, A., Stephens, D., and Holmes, C. (2007a). “On P opulation-based Sim ulation fo r Stat ic Inference.” S tatistics and Computing , 17, 3, 263–279. — (2 0 07b). “Population-based rev ersible jump Mark o v c hain Mon te Carlo.” B iometric a . (to appear ) . Jennison, C. (1993). “D isc ussion on the meeting on the Gibbs sampler and other Mark ov c hain Mon te Carlo metho ds.” Journal of the R oyal S tat istic al So ciety, Series B , 5 5, 54– 5 6. Kass, R. E., Carlin, B. P ., Gelman, A., and Neal, R. M. (1998 ). “Mark o v Chain Mon te Carlo in Practice: A Roundtable Discussion.” The Americ an Statistician , 5 2 , 2, 93–100. King, R. and Bro oks, S. (2002 ). “Mo del Selection for In t egr a ted Recov ery/Recapture Da t a .” Biometrics , 58, 841 – 851. Kirkpatric k, S., Gelatt, C., and V ecci, M. (198 3 ). “Optimizatio n b y sim ulated annealing.” Scienc e , 220, 671–680 . Kushner, H. and Lin, G. (1997) . Sto chastic Appr oximation Algorithms and Applic ations . New Y ork: Springer. Liu, J. S. (2001). Monte Carlo Str ate gies in S c ientiﬁc Computing . New Y or k: Springer. Madras, N. and Picconi, M. (1999). “Imp ortance sampling for families of distributions.” A nnals of Applie d Pr ob ability , 9, 1202–1225. Marinari, E. and Parisi, G. (1992 ) . “Sim ulated temp ering: A new Monte Carlo sc heme.” Eur ophysics L etters , 19, 451–458. 16 Neal, R. M. (1996). “Sampling fr om m ultimo dal distributions using temp ered transition.” Statistics and Computing , 6, 353–366. — (2001). “Annealed Imp ortance Sampling.” Statistics and Computing , 11, 125– 1 29. — (2005). “Estimating ratios o f normalizing constants using Linked Impor t ance Sampling.” T ech. Rep. 051 1, Departmen t of Statistics, Univ ersit y o f T oronto. 37 pages. Ow en, A. and Zhou, Y. ( 2 000). “Safe and Eﬀectiv e Impor t a nce Sampling.” Journal of the A meric an Statstic al Asso ciation , 95, 449, 135–143. V eac h, E. and G uibas, L. J. (1995). “Optimally com bining sampling tec hniques for Mon te Carlo rendering.” In SIGGRAPH ’9 5 Confer enc e Pr o c e e dings , 419–428 . Reading, MA: Addison–W esley . W ong, W. and Liang , F. (1 997). “D yn amic w eighting in Mon te Carlo and optimization.” In Pr o c e e dings of the National A c ademy of Scienc es of USA , v ol. 94(26), 14220–14 2 24. 17

Importance Tempering

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment