Diffusive Nested Sampling

We introduce a general Monte Carlo method based on Nested Sampling (NS), for sampling complex probability distributions and estimating the normalising constant. The method uses one or more particles, which explore a mixture of nested probability dist…

Authors: Brendon J. Brewer, Livia B. Partay, Gabor Csanyi

Diffusiv e Nested Samplin g Brendon J. Brew er · Li via B. P ´ arta y · G´ abor Cs´ an yi Abstract W e int ro duce a ge ne r al Mont e Carlo metho d based on Nested Sampling (NS), for sampling complex probability distributions and estimating the norma lis- ing consta nt. The metho d uses one or mor e particles , which explor e a mixture of nested pro ba bility distri- butions, each successive distribution oc c up ying ∼ e − 1 times the enclos ed prior mass of the pre v ious distribu- tion. While NS technically requir es indep endent gener- ation of particles, Marko v Chain Monte Car lo (MCMC) exploratio n fits natura lly into this technique. W e illus- trate the new metho d on a test problem and find that it can a chiev e fo ur times the a ccuracy o f clas s ic MCMC- based Nested Sampling, for the same computationa l ef- fort; equiv alent to a factor of 1 6 speedup. An additional bene fit is that mor e samples and a mor e accurate evi- dence v alue can b e obtained simply by co nt inuing the run for longer, as in standard MCMC. Keyw o rds Nested Sampling · Bay esian Computation · Marko v Chain Monte Car lo Soft ware (in C+ +) implementing Diffusive Nested Sampling is a v ailable at http://lindo r.physics. ucsb.edu/DNest/ under the GNU General P ubli c License. B. J. Brewer Departmen t of Phy sics, Unive rsity of Cali fornia, Santa Barbara, CA 93106-9530, USA E-mail: brewer@ph ysics. ucsb.edu L. B. P´ arta y Unive rsity Chemical Laboratory , Univ ersity of Cam bridge, Lens- field Road, C B 2 1EW, Cambridge, United Kingdom G. Cs´ an yi Engineering Laboratory , Universit y of Cambridge, T rumpington Street, CB2 1PZ Camb ridge, United Kingdom 1 Introduction In pro babilistic inference and statistical mec hanics, it is o ften necessar y to characterise complex multidimen- sional probability distributio ns. Typically , we hav e the ability to ev a luate a function pro p ortional to the prob- ability or pro bability density at any given p o int . Given this ability , w e would like to pro duce sample p oints from the target distribution (to c haracter ise our uncer- tainties), and a ls o ev a luate the norma lising constant. In Bayesian Inference, a prior distribution π ( θ ) is mo d- ulated by the likelihoo d function L ( θ ) to pro duce the po sterior distribution f ( θ ): f ( θ ) = 1 Z π ( θ ) L ( θ ) (1) where Z = R π ( θ ) L ( θ ) dθ is the evidence v alue, whic h allows the en tire mo del to b e tested against an y pro- po sed alternative. In statistical mechanics, the go al is to pr o duce samples from a canonica l distribution f ( θ ) = g ( θ ) Z ( λ ) exp ( − λE ( θ )) (2) ( g ( θ ) = densit y of states, E ( θ ) = energy) at a range of inv er se-temp eratur e s λ ; and a ls o to obtain the nor - malisation Z ( λ ) as a function o f λ . The challenge is to develop sampling algorithms tha t a re of general a ppli- cability , and are capa ble of handling the following com- plications: mult imo dality , phase transitions , and s trong correla tions b etw een v a riables (particula rly when the Marko v Chain Monte Car lo (MCMC) propo s als a re not aw ar e of these co r relations ). These challenges hav e led to the developmen t of a large num b er of so phisticated and efficien t techniques ( e.g. Chib & Ramamurth y, 2010; T rias, V ecchio & V eitch, 2009). Howev er, our goal in the pre sent pap er is to develop a metho d that requires little problem-sp ecific tuning, obviating the need for large num b ers of preliminary runs or analytica l work - 2 Diffusiv e Nested Sampling can be applied to any prob- lem where Metropo lis-Hastings can be used. Finally , we note that no sa mpling method can ever b e com- pletely immune fr o m fa ilur e; for exa mple, it is difficult to imagine a ny method that will find an exceedingly small, sharp p ea k whos e domain o f attraction is also very small. 1.1 Nested Sampling Nested Sampling (NS) is a p ow er ful and widely ap- plicable alg orithm for Bayesian computation (Skilling , 2006). Starting with a po pulation of particles { θ i } dr awn from the prior distribution π ( θ ), the worst particle (low- est likeliho o d L ( θ )) is recorde d and then repla ced with a new sample from the prio r distributio n, sub ject to the constraint that its likelihoo d must be higher than that of the p oint it is r eplacing. As this pro cess is re pea ted, the po pula tion of p oints mov e s progr essively higher in likelihoo d. Each time the worst particle is reco rded, it is as- signed a v a lue X ∈ [0 , 1 ], which repres ents the amount of pr ior mass estimated to lie at a higher likelihoo d than that of the dis c arded po int. Assigning X -v alues to p oints c r eates a mapping from the pa rameter spa ce to [0 , 1], wher e the prior b ecomes a uniform dis tr ibu- tion ov er [0 , 1] and the likelihoo d function is a decr eas- ing function o f X . Then, the evidence can b e co mputed by simple nu merical integration, and p oster ior weigh ts can be assigned by a s signing a width to each p oint, such that the po sterior mass as so ciated with the p oint is pr op ortiona l to wi dth × l ikeli hood . The key challenge in implemen ting Nested Sampling for r eal problems is to be able to g enerate the new par- ticle from the pr ior, sub ject to the hard likelihoo d con- straint. If the discar ded po int ha s likelihoo d L ∗ , the newly generated p oint should be sampled from the co n- strained distribution: p L ∗ ( θ ) = π ( θ ) X ∗  1 , L ( θ ) > L ∗ 0 , otherwise. (3) where X ∗ is the normalising co ns tant. T echnically , our knowledge of this new p oint should b e indep endent of all of the surviving p oints. A simple w ay to gener ate such a p o int is suggested b y Siv ia & Skilling (2006): copy o ne o f the surviving p oints and evolv e it via MCMC with r esp ect to the prior distribution, r ejecting pr o p os- als that would take the likelihoo d b elow the current cut- off v alue L ∗ . This evolv es the particle with Equatio n 3 as the target distr ibution. If the MCMC is done for long enough, the new p o int will be effectively independent of the sur viving p o pulation and will b e distr ibuted ac- cording to Equatio n 3. T hr oughout this pap er we refer to this strategy as “classic” Nested Sampling. How ever, in complex pr oblems, this approa ch can easily fa il - constra ine d distributions can often b e very difficult to e fficient ly explor e via MCMC, particularly if the target distribution is multim o dal or highly cor- related. T o ov er come thes e dr awbac ks , several meth- o ds hav e b een developed for generating the new parti- cle (Mukherjee et a l., 2006; F er oz, Hobson, & Bridges, 2008) and these metho ds hav e been successful in im- proving the p er formance o f Nested Sampling, a t least in low-dimensional problems. T ec hniques for dia gnosing m ultimo dality hav e also b een sugg ested by P´ artay , Bart´ ok, & Cs´ an yi (2009). In s e c tion 2 we int ro duce our multi-lev e l metho d which retains the flexibilit y and generality o f MCMC exploratio n, and is a lso capa ble of efficiently exploring difficult constrained distributions. The ma in a dv antage of Nested Sampling is that suc- cessive constrained distributions (i.e. p L ∗ j ( θ ) , p L ∗ j +1 ( θ ), and so on) a r e, by co nstruction, all compress ed by the same factor relative to their pr edecessor s . This is not the cas e with temp ered distributions of the form p T ( θ ) ∝ π ( θ ) L ( θ ) 1 /T , where a small change in temp erature T can corr esp ond to a s mall o r a large c ompression. T em- per ing based algo r ithms (e.g. sim ulated a nnealing, par- allel temp ering ) will fail unles s the density of temp er - ature levels is adjusted a ccording to the sp ecific heat, which b ecomes difficult at a firs t-order phase transi- tion (the uniform- e nergy sampling o f W ang & Landau, 2001, is also incapable of ha ndling first-or der pha se transitions). Unfor tuna tely , knowing the appropriate v al- ues fo r the tempera ture lev els is equiv alent to having already solved the problem. Nested Sa mpling do es not suffer from this is sue b ecaus e it asks the question “what should the next distribution be, suc h tha t it will be compressed b y the des ired factor”, ra ther than the tem- per ing question “the next distributio n is pre- defined, how compre s sed is it relative to the cur rent distribu- tion?” 2 M ulti-Level E xploration Our algorithm begins by generating a point from the prior ( π ( θ ) = p L ∗ 0 where L ∗ 0 = 0), a nd evolving it via MCMC (or indep endent sampling), storing al l of the likelihoo d v alues o f the visited po ints. After so me pre - determined num b er of iterations, w e find the 1 − e − 1 ≈ 0 . 6321 2 qua ntile of the accum ula ted likeliho o ds, and record it as L ∗ 1 , cr eating a new level that oc c upies ab out e − 1 times as muc h prior ma ss a s p L ∗ 0 . Likeliho o d v alues less than L ∗ 1 are then remov ed from the accum ula ted likelihoo d array . Next, classic Nested Sampling would attempt to sam- ple p L ∗ 1 via MCMC. The difference her e is that we a t- tempt to sample a mixtur e (weigh ted sum) of p L ∗ 0 and 3 p L ∗ 1 . Thus, there is some c ha nce of the par ticle escaping to low e r constrained distributions , where it is a llow ed to ex plore more freely . Once w e hav e enough samples from p L ∗ 1 (equiv alent to all po ints from the mixture that happ en to exc e e d L ∗ 1 ), we find the 1 − e − 1 ≈ 0 . 6321 2 quantile of these likelihoo ds and record it as L ∗ 2 . Likeli- ho o d v alues les s than L ∗ 2 are then re mov ed from the ac- cum ulated likelihoo d ar ray . The particle then explore s a mixture of p L ∗ 0 , p L ∗ 1 and p L ∗ 2 , a nd so on. Each time we create a new level, its corr esp onding constrained distributio n co vers ab out e − 1 as muc h pr ior mass a s the la s t. Thus, w e can estimate the X -v alue o f the k th level cr eated as b eing exp ( − k ). This estimate can be refined later on, as explained in Section 3. Once we hav e obtained the desir ed n umber of levels, we a llow the particle to contin ue exploring a mixture of all levels. This multi-lev el explor a tion scheme is similar to simulated temper ing (Marinari & Parisi, 199 2), but using the well-tuned L ∗ v alues from the run, rather than using a pre-defined s e q uence of tempera tures that may be po orly adapted to the problem at hand. 2.1 W eighting Schemes The use of a mixture of co nstrained distributions raises the question: how should we weight ea ch of the mix- ture co mpo nents? Naive unifor m weigh ting works, but takes time pro po rtional to N 2 to cr eate N levels; co n- trast this with classic Nested Sampling whic h ha s O ( N ) per formance in this reg ard. A simple para metric fam- ily of weigh ting schemes are the exp onentially-decaying weigh ts: w j ∝ exp  j − J Λ  , for j ∈ { 1 , 2 , ..., J } (4) where J is the current highest level, and Λ is a scale length, describing how far we are willing to let the par- ticle “backtrack” to freer distributions, in or der to as sist with explo ration and the creation of a new, higher level. With this exp o nent ial c hoice, the time taken to cr eate N levels is pro p ortional to N , the pro p o rtionality con- stant b eing dep endent on Λ . Smaller v a lues are more aggre s sive, and lar ger v a lues, while slow er, are more fail-safe b ecause they allow the par ticle to explo re for longer, and more freely . Once the desir ed num b er of levels has b een crea ted, the weigh ts { w j } are changed to unifor m, to allow fur- ther samples to b e dr awn. Non-uniform weigh ts ca n also be used, for example to s p end mo re time at levels that are s ignificant fo r the p oster ior. T he simulation ca n b e run for as lo ng a s r equired, and the evidence a nd p os - terior sa mples will co nv erg e in a manner a nalogo us to standard MCMC. Additional sep era te runs ar e not r e- quired, provided that enoug h levels hav e b een created. 2.2 E xploring the mixture Suppo se w e hav e obtained a s e t of constrained distribu- tions p L ∗ j ( θ ) from the algorithm describ ed ab ove (Sec- tion 2). Each of the c o nstrained distr ibutions is defined as follows: p L ∗ j ( θ ) = π ( θ ) X j  1 , L ( θ ) > L ∗ j 0 , otherwise. (5) The particle in o ur simulation, θ , is a ssigned a lab el j indicating which particular constra ined dis tr ibution it is currently r e presenting. j = 0 denotes the prior, and j = 1 , 2 , 3 , ..., J denote pr ogres sively higher likeliho o d cutoffs, wher e j = J is the current top level. T o a llow θ to explore the mixture, we up date { θ , j } so as to explore the joint distribution p ( θ , j ) = p ( j ) p ( θ | j ) = w j X j π ( θ )  1 , L ( θ ) > L ∗ j 0 , otherwise. (6) where p ( j ) = w j is the chosen weighting scheme dis- cussed in Section 2.1, and X j is the normalis ing con- stant for p ( θ | j ), which in general depe nds on j . In fact, X j is simply the amount of prior mass enclosed at a likelihoo d a bove L ∗ j , a nd is the ab cissa in the sta ndard presentation of Nested Sa mpling (Skilling, 20 06), hence the notation X rather than the usua l Z for a no rmalis- ing constant. Updating θ is done by prop osing a change that keeps the prior π ( θ ) inv ar ia nt, and then acc epting as long a s the likeliho o d ex ceeds L ∗ j . T o update j , w e prop o se a new v alue from some prop osal dis tr ibution (e.g., move either up or down one level, with 50% pro bability) and accept using the usua l Metrop olis rule, with Equation 6 as the targ et distribution. This cannot b e done without knowing the { X j } - how e ver, o ur pro c edure for adding new levels e nsures that the ratio of the X ’s for neigh- bo uring lev els is approximately e − 1 . So w e at least hav e some approximate v alues for the { X j } , and these ca n be us e d when up dating j . How ever, since o ur creation of new levels is based o n s a mpling, and therefo re not exact, the ratios of X ’s will be a little differen t fro m the theor etical ex pe c ted v alue e − 1 . Hence, we will ac- tually b e exploring Equation 6 with a marginal p ( j ) that is slightly different than the w eights { w j } that we really w anted. How ever, we can further r e vise our esti- mates of the X ’s at ea ch step to achieve more co rrect exploratio n; o ur metho d fo r doing this is explained in Section 3. This refinement not o nly a llows the parti- cle to explore the levels with the desir ed weighting, but also incr e ases the accuracy of the resulting evidence es- timate. Note that it is possible to explicitly marginalise ov e r j and explore a distribution for θ o nly , howev er it is con- 4 Fig. 1 An i l lustration of a the distributions that m ust b e sampled as Nested Sampling progresses. In the classic sc heme, at Step 3, we must obta in a sample f rom the coloured region which is comp osed of tw o separate islands, whic h is usually very difficult if MCM C is the only exploration option. T o amelior ate these difficulties, w e explore the mixture di s tribution (bottom right), where trav el b et wee n isolated mo des is more likely . venien t to retain j for v ar ious pur po ses, most notably the refinements in Sections 3 and 3.1. 3 R evising the X ’s As sampling pro gresse s and more levels are added, the actual X -v alues of the levels can b ecome different from the theoretical exp e ctation X j +1 = e − 1 X j which would be rea lised if our sampling were perfect. This causes the simulation to e x plore a p ( j ) that is very different from the desired weigh ts { w j } . F ortunately , w e can use de- tails of the explora tion to obtain estima tes of the X ’s that are more accurate than those given by the the- oretical exp ectation X j +1 = e − 1 X j . C o nditioning o n a particular level j , the par ticle’s likelihoo d should ex- ceed level j + 1’s lik eliho o d cutoff a fra ction X j +1 /X j of the time. Thus, we ca n use the actual fr action of excee- dences n ( L > L ∗ j +1 | j ) /n ( j ) a s an estimate of the tr ue ratio of no rmalising constants fo r consecutive levels. In practice, we only keep track of the exceedence fra ction for co ns ecutive levels, and us e the theoretical exp ected v alue e − 1 to s tabilise the estimate when the num be r of visits n ( j ) is low: X j +1 X j ≈ n ( L > L ∗ j +1 | j ) + C e − 1 n ( j ) + C (7) Here, the num b er C repres ents our effectiv e confidence in the theo r etical exp ected v alue r elative to the e mpir - ical es timate, and thu s r esembles a B ay esia n estimate. The theoretica l estimate e − 1 dominates the es timate un til the sample size n ( j ) ex ceeds C , whereup on the empirical es tima te bec omes mor e imp ortant. Clear ly , n ( j ) should not be accumulated un til level j + 1 exis ts. An a dditional r efinement of this appro a ch can b e made by realising that the particle, even if it is at lev e l j , may also b e considered as sampling higher distributions if its likelihoo d happ e ns to exceed the higher levels’ thres h- olds. 3.1 E nforcing the T arg et W eighting Note that the ratio of normalis ing constants b etw een level j and level j + 1 is estima ted using only s a mples at level j . Sometimes, if the upp er le vels ha ve been cre- ated incor rectly (typically they ar e to o clo sely spa ced in log X ), the particle sp ends to o muc h time in the upp er levels, r arely sp ending time in the lo wer levels, which would b e needed to correct the erroneo us X -v a lues. T o w or k around this issue, w e also keep track of the num b e r of visits to each level relative to the ex- pec ted n umber of v is its, and add a term to the accep- tance pr obability for mo ving j , to favour levels that hav en’t be e n visited as often as they s hould hav e. If a mov e is prop osed from level j to level k , the Metropo lis- Hastings acceptance pr obability is multiplied by the fol- lowing factor: α enforce =  ( n j + C ) / ( h n j i + C ) ( n k + C ) / ( h n k i + C )  β (8) where n j and n k are the n um b er of visits to levels j and k resp ectively (These a re different from n ( j ) in Sec- tion 3, as they can b e accumulated immediately , not needing a higher level to exis t). Th us, the transitio n is fav oure d if it mov es to a level k that has not b een v isited as often as it should hav e ( h n k i ). This proce dure is ana l- ogous to the appro ach used by W a ng & Landau (2001) to sample a uniform distribution of ener gies. These ex- pec ted n umber s o f visits are track ed throughout the en- tire simulation, they are essentially the in tegrated v a l- ues o f the weigh ts { w j } over the histor y of the sim u- lation. The p ower β co ntrols the s trength of the effect, and C again acts to r egularise the effect when the n um- ber co unts are low. 5 This pro cedure (and the one in Sectio n 3) destroys the Markov prop erty , but this effect decreas es tow ards zero as the simulation procee ds , so the even tual conv er - gence of the algo r ithm is no t affected (this is analogo us to the kind of tuning discussed by Robe rts, Gelman & Gilks, 1997; Ro senthal , 20 10). In practice, any biases intro- duced by this pr o cedure (even with a strong choice of β = 10) hav e been found to be negligible relative to the dominant source of error , which is inc o rrect estimates of the X ’s o f the levels. 4 As signing X-v alues to samples A diffusive nested sampling run explores the joint dis- tribution of Equatio n 6 in an MCMC-st yle w ay . Usua lly this inv olves a lot of steps, and it is wasteful (o f disk space and I/O ov erhea d) to sav e them a ll; there fo re it is necessary to “thin the chain”. This results in an out- put sample of θ p oints. T o use them for ca lculating the evidence a nd po sterior quantit ies, these points m ust be assigned X -v alues. Firstly , each θ in the sample is as s igned an interval of p ossible X -v alues, by finding the t wo lev els whose likelihoo ds sandwich the par ticle. The conditiona l dis- tribution of the X -v alue s of p oints given tha t they lie in some interv al is uniform, so we ca n as sign X - v alues by g enerating uniform random v a r iates b etw een the X - v alues of the t wo sandwiching lev els . This probabilis- tic assignment o f X - v alues to the samples giv es error bars on the ev idence and po sterior q uantities, a s clas- sic NS does. How e ver, it assumes that the X -v alues of the levels are known e x actly , whic h they are not. Unfortunately , this error tends to be more significant than the uncer ta int y of not knowing where each par ti- cle lies within its interv al, and the err or bars a re over- optimistic as a result. Unfortunately , obtaining r eliable error bars from MCMC-bas e d computation re ma ins dif- ficult. Possible metho ds for assigning uncertainties to the X - v alues of the levels w ould make use of the num- ber counts n ( L > L ∗ j +1 | j ), how ever such a scheme has not b een implemented at the time of writing. 5 T est Problem In this sectio n we demonstrate the diffusive Nested Sam- pling metho d on a test problem that is quite simple, yet has enough complications to shed some lig ht on the differences b etw een the diffusive and classic NS algo- rithms. The pro blem is adapted from Skilling (200 6)’s “Statistics Problem” but mo dified to b e bimo dal. The parameter space is 20-dimensional, and the pr ior is uni- form betw een [ − 0 . 5 , 0 . 5 ] in each dimension: π ( x 1 , x 2 , ..., x 20 ) = 20 Y i =1  1 , x i ∈ [ − 0 . 5 , 0 . 5 ] 0 , otherwise (9) The likeliho o d function is the sum of tw o Gaussians, one centred at the orig in a nd having width v = 0 . 1 in each dimension, a nd the other centred at (0 . 031 , 0 . 031 , ..., 0 . 031) and having width u = 0 . 0 1. The nar row er p eak is weigh ted so that it contains 100 times as muc h po sterior mass a s the broad pea k. The likelihoo d function is: L ( x 1 , x 2 , ..., x 20 ) = 20 Y i =1 exp  − 1 2 ( x i /v ) 2  v √ 2 π (10) +100 20 Y i =1 exp  − 1 2 (( x i − 0 . 031) /u ) 2  u √ 2 π (11) F or this pro blem the true v alue of the evidence is log(1 01) ≈ 4 . 6151 , to a very go o d approximation. Skilling (2006) show ed that this sys tem has a phase tra nsition which is handled easily by clas sic Nested Sa mpling. By shifting the dominant mo de o ff-centre, now the pro blem has a phase transition and is also bimo dal. The bimo da lity allows us to illus trate an in teresting effect where im- per fect MCMC explo ration ca uses the dominant mo de to b e discov er ed late (Section 5.1). The MCMC transitio ns used throug hout this sec- tion were all nave r andom walk transitions, centred on the curre nt v alue, with one of the x ’s chosen at ran- dom to b e up dated. T o work around the fact that the optimal step-size changes throughout the r un, we draw the step size S randomly from a scale-inv a riant J effreys prior ∝ 1 /S at ea ch step, with an upper limit for S of 10 0 = 1 and a lower limit of 10 − 6 . The propo s al distribution used to change j was a Gaussian centred at the current v a lue, with standard deviation S ′ which was chosen from a Jeffreys prior betw een 1 and 100 at each step. The pr op osed j was then, of course, rounded to a n in teg er v alue and rejected immediately if it fell outside o f the allow ed range of existing levels. A mor e sophisticated appr oach to setting the stepsize would b e to us e Slice Sampling (Neal, 2003). 5.1 Typical Performance of Diffusive NS T o illustrate the typical output fro m a diffusive Nested Sampling run, w e executed the progr am on the test problem and using numerical parameters defined in T a- ble 1. The prog ress of the algorithm is display ed g r aph- ically in Figure 2. Diffusive NS has tw o distinct stages, the initial mostly-upw ar d progr ess of the particle, a nd then an explo r ation stage (in this case, e x ploring all 6 T able 1 P ar ameter v alues used for diffusiv e nested sampling on the test problem. Pa rameter V alue Number of particles 1 Number of l ikelihoo ds needed to create new level 10,000 In terv al b etw een particle sa ve s 10,000 Maximum n umber of levels 100 Regularisation constan t ( C ) 1,000 Degree of backt racking ( Λ ) 10 Strength of effect to enforce exploration w eight s ( β ) 10 levels unifor mly) which can b e r un indefinitely , a nd is constantly generating new samples and refining the es- timates of the level X -v alues . Of pa rticular note is the fact that, during the uni- form exploration phas e, the pa rticle ca n ea sily mix b e- t ween levels 55 and 1 00, and betw een 0 and 55 , but jumps b etw een these regions o ccur more rarely . This o ccurs ar ound wher e the narrow peak in the p os te- rior b ecomes imp or tant; imp erfect MCMC exploratio n do es not notice its prese nce r ight aw ay .See Figure 3 for more information abo ut this phenomenon. The fi- nal log -likelihoo d vs prio r-mass curve, and the output samples, are shown in Figure 4. 5.2 Empirical compariso ns with Classic NS It is of interest to ev a luate the perfor ma nce of the diffu- sive NS metho d in c o mparison to classic NS. Is it le s s, equally , or mor e efficien t? In particular, how well do the t wo methods cop e with imper fect MCMC explor ation? In this s e ction we explo re this issue, alb eit in a limited w ay . W e ran diffusive and cla ssic NS on the test problem defined in Section 5, with the parameters of the algor ithms chosen to be as similar. Both algo- rithms were limited to 10 7 likelihoo d ev a luations for all of the tests. The para meters for the diffusiv e NS runs are shown in T able 1, and the clas sic NS par ame- ters were chosen s uch that the pro grams hav e reached log X = − 10 0 b y the time the allo ted 10 7 likelihoo d ev aluations had o ccurred. These par ameters ar e listed in T able 2. The num b er of steps parameter listed in T a- ble 2 for cla ssic NS defines how many MCMC steps were used in a n attempt to equilibra te a particle. The imple- men tation o f the clas sic Nested Sampling algor ithm was the standard one where a particle is copied b efore being evolv ed. Each of these algo rithms were run 24 times, and the ro o t mea n squa re error of the estimated (p osterio r mean) evidence v alues were calcula ted. These results are shown in T able 2 and Figure 5, and show a sig nif- icant adv antage to diffusiv e NS, by a factor of ab o ut four even compar ed to the most favourable c la ssic NS parameters . T o obtain similar accuracy with classic NS would hav e required 16 times as mu ch computatio n. This improvemen t can be attributed to diffusiv e NS’s use of all visited likelihoo ds whe n creating a new level. Classic NS uses only the likelihoo d o f the endp oint of the equilibration proce s s, yet pres umably all of the like- liho o ds encoun tered during the exploration a re relev ant for c r eating a new level, or for judging the compre s sion of a n e x isting level. Another feature of diffusive NS that pr obably con- tributes to its efficiency is the fact tha t the pa rticle can backtrac k and explo re with resp ect to broa der distri- butions. This is par ticularly imp ortant when the target distribution is corr elated, yet the prop osal distribution is not. Then, the pa r ticle can fall down several levels, take la rger steps acro ss to the other side of the corre- lated target distribution, and then clim b back up to the target dis tribution. F or class ic NS with a single particle, the RMS er ror is compar able with the theoretical p H/ N ≈ p 63 . 2 / N error bar of classic NS. The RMS err or decrease s with more par ticles, as ex p ec ted (Murray, 2007), but fails to fall off as 1 / √ N , even tua lly increasing again. This is bec ause the num b er of MCMC steps pe r itera tion ha d to be decreas e d to comp ensate for the c omputational ov erhe a d of the incr ease in the num b er o f particles, and the theor etical erro r bars assume that the explor ation is per fect. While it may b e p ossible to improv e classic NS by attempting to retain the div ersity of the initial p op- ulation (sometimes evolving the deleted par ticle, rather than copying), this has not been implemented yet, a nd th us the co mpa rison presented in this pap er is agains t the common implemen tation of classic NS. 5.3 Multi-Particle Diffusive NS On highly correlated and m ultimo dal tar get distribu- tions, t wo str ategies sugg est themselv es . O ne is to make Λ larg e , so that the particle can backtrack a long wa y . Then, when the particle reaches the top level again, it is likely to b e very weakly co rrela ted with the or igi- nal point. A more efficient alternative would b e to run diffusive NS with multiple particles, evolving them in succession or cho osing one a t rando m at eac h step. If a particle gets stuc k in a subor dina te mode, this ca n be detected by its rep eated failur e to b e within a dis - tance of a few times Λ of the to p le vel J . Suc h parti- cles can simply b e deleted, and more CP U time c an b e used for evolving the surviving particles . This approach per forms b etter than the copying metho d of c la ssic NS bec ause particles are only deleted when they ar e k nown 7 0 20 40 60 80 100 0 200 400 600 800 1000 Level (j) Iteration / 10,000 Fig. 2 The anatom y of a diffusive-NS run. Initiall y , the particle tends to mov e upw ards, creating new li kelihoo d levels. Dur ing this phase, the particle can bac ktrac k somewhat, allo wing freer exploration and also refining the estimate s of the compression ratios of the newly created levels. Once the desired n umber of lev els (in this case, 100) hav e been created, the particle attempt s to explore all lev els with the prescri b ed we ight s, in this case unif orm weigh ts. -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0 20 40 60 80 100 log(X) difference Level (j) Fig. 3 Estimated compression from each level’s distri bution to the next. If sampli ng were per f ect, the log( X ) difference b etw een successiv e lev els wo uld be − 1. This is approximately corr ect except around lev els 50-55, where some lev els w ere “acciden tally” created too close together. This is b ecause the slowly-exploring particle failed to notice the presence of the narrow p eak immediately , but in the mean time still created s ome new levels. T able 2 The parameters for the test runs of the algorithms, and the resulting RM S error (from 24 runs of each algorithm) of the log evidence. Each algorithm was allo we d 10 7 likelihoo d ev aluations. Diffusive NS outperformed classic NS si gnifican tly on this problem. Algorithm Pa rameter V alues RMS Deviation in log( Z ) Theoretical p H/ N Diffusive NS As per T able 1 0.583 - Classic NS 1 particle, 100,000 MCMC steps p er N S step 5.82 7.95 Classic NS 10 particles, 10,000 MCMC steps per NS step 2.96 2.51 Classic NS 100 particles, 1, 000 MCMC steps per NS step 2.07 0.80 Classic NS 300 particles, 333 MCMC steps p er N S step 2.71 0.46 8 0 0.1 0.2 0.3 0.4 0.5 0.6 − 90 − 80 − 70 − 60 − 50 − 40 − 30 − 20 − 10 0 Weight for Posterior log(X) − 40 − 20 0 20 40 60 − 90 − 80 − 70 − 60 − 50 − 40 − 30 − 20 − 10 0 log(L) log(X) Fig. 4 Estimated l og-likelihoo d curve for the test problem, based on a diffusive Nested Sampling r un. The blue curve is constructed from the estimated X -v alues of the lev els, whil e the cir cles r epresen t sample points. The chain has b een thinned further in order to sho w the di screte p oints more clearly . Fig. 5 P erformance of five differen t metho ds on the test problem. The fir st metho d (leftmost box and whi ske r plot) i s diffusi v e NS, the others are classic NS in increasing order of the num b er of particles. The bars indicate the spread of results obt ained from repeated runs of each method with di ffer ent random num b er seeds. The horizon tal line i ndicates the true v alue. to b e failing, wher eas in class ic NS the copying o pe r a- tion will inevitably destroy the diversity of the initial po pulation e ven when the MCMC ex ploration is satis- factory . When r unning m ulti-particle diffusiv e NS, it is ad- visable to reset the counters (fro m Sections 3 a nd 3.1) once the desired n umber o f levels hav e b een cre a ted. This prevents the deleted particles from creating un- necessary ba rriers to exploratio n. 6 C o nclusions In this paper we intro duced a v a riant of Nes ted Sam- pling that us e s MCMC to e x plore a mixture of pr ogres - sively co nstrained distributions. This metho d shares the 9 adv antages of the clas sic Nested Sampling method, but has several unique features. Firs tly , o nce the desired nu mber of constra ined distributions hav e b een created, the par ticle is a llow ed to diffusiv ely explore all lev e ls , per p etually o bta ining mor e pos terior sa mples and refin- ing the estimate of the evidenc e . Secondly , this metho d uses the entire explor a tion his to ry of the par ticle in o r- der to estimate the enclos ed prior mass a sso ciated with each level, and hence tends to estima te the enclosed prior mass of each level more accur ately than the clas- sic Nested Sampling metho d. W e r an simple tests of the new algorithm on a test problem, and fo und that diffusive Nested sampling gives more a c curate evidence estimates for the same computational effor t. Whether this will hold tr ue in genera l, or is a problem-dependent adv antage, will b e explor ed in the future. 7 Ackno wledgeme n ts W e would like to thank John Skilling and Iain Murray for helpful discussions and comments on e arlier v er sions of this pap er. This work g rew dir e ctly o ut o f disc ussions that to ok place at MaxEnt 2009 . References Chib, S., Ramamurth y , S. T ailo red Rando mized-blo ck MCMC Metho ds with Application to DSGE Mo dels, Journal of Econometr ics, 15 5, 19 -38 (201 0) F eroz F., Hobso n M. P ., Bridges M., MultiNest: an effi- cient and robust Bay es ian inference to ol for cosmol- ogy a nd particle physics, ar Xiv:0809.3 437 (2008) Marinari E., Parisi G., Simulated T empering: A New Monte Car lo Scheme, Europhysics Le tter s, 19, 451 (1992) Mukherjee, P ., Parkinso n, D., Liddle, A. R. A Nested Sampling Algorithm for Cosmolo gical Model Selec- tion. The Astrophysical Journal 638, L51- L54 (2006 ) Murray , I., Adv ances in Markov chain Monte Ca rlo metho ds. PhD thesis, Gatsby co mputatio nal neuro- science unit, Univ ersity Colleg e L o ndon (2007) Neal, R. M., Slice sampling (with discussio n), Annals of Statistics, vol. 31, pp. 705 -767 (200 3 ) P´ arta y L. B ., Bart´ ok A. P ., Cs´ anyi G., F ull sampling of a tomic co nfigurationa l spaces, arXiv:090 6.354 4 (2009) Rob erts, G. O., Gelma n, A., Gilks, W. R., W eak co nv er- gence and optimal scaling of rando m w a lk Metropolis algorithms, Annals of Applied P robability , V olume 7, Num be r 1 (1997), 110-1 20. Rosenthal, J. S. Optimal pro po sal distributions and adaptive MCMC. In S. P . Bro o ks, A. Gelman, G. Jones, a nd X.-L. Meng (Eds.), Handb o ok of Ma r ko v Chain Monte Ca rlo. Chapman and Hall/CRC Pr ess (2010) Sivia, D. S., Skilling, J., Data Analysis: A Bayesian T utorial, 2 nd Edition, Oxford University Press (2 006) Skilling, J., Nested Sa mpling for Genera l Bay esia n Computation, Bay esia n Analysis 4 , pp. 83 3-86 0 (2006) T rias, M., V ecchio, A. and V eitch, J. Delayed rejection schemes for efficien t Mar ko v-C ha in Monte-Carlo sampling of m ultimo dal distributions. arXiv:090 4.220 7 (200 9 ) W ang, F. and Landau, D. P ., Efficie nt, Multiple-Range Random W alk Algo rithm to Calcula te the Density of States, Phys Rev Letters 86, 2050 (2001 )

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment