Inference for Trans-dimensional Bayesian Models with Diffusive Nested Sampling

Inference for T rans-dimensional Ba y esian Mo dels with Diﬀusiv e Nested Sampling Brendon J. Brew er Departmen t of Statistics, The Universit y of Auc kland Priv ate Bag 92019, Auc kland 1142, New Zealand bj.brewer@auckland.ac.nz Abstract Man y inference problems in volv e inferring the num b er N of comp onen ts in some region, along with their prop erties { x i } N i =1 , from a dataset D . A common statistical example is ﬁnite mixture mo delling. In the Ba yesian framew ork, these problems are typically solved using one of the follo wing tw o metho ds: i) by executing a Monte Carlo algorithm (such as Nested Sampling) once for each p ossible v alue of N , and calculating the marginal likelihoo d or evidence as a function of N ; or ii) by doing a single run that allo ws the mo del dimension N to c hange (such as Marko v Chain Mon te Carlo with birth/death mov es), and obtaining the p osterior for N directly . In this paper w e present a general approac h to this problem that uses trans-dimensional MCMC em b edded within a Nested Sampling algorithm, allowing us to explore the posterior distribution and calculate the marginal lik eliho o d (summed o ver N ) ev en if the problem con tains a phase transition or other diﬃcult features suc h as multimodality . W e presen t tw o example problems, ﬁnding sin usoidal signals in noisy data, and ﬁnding and measuring galaxies in a noisy astronomical image. Both of the examples demonstrate phase transitions in the relationship b et w een the lik eliho od and the cum ulative prior mass, highligh ting the need for Nested Sampling. 1 In tro duction Consider the following class of inference problems. There is some unknown num b er, N , of comp onen ts in a (physical or metaphorical) region. Each of the N comp onen ts has parameters x , which may b e a single scalar v alue (for example, a mass), or a set of v alues (e.g. a tw o- dimensional p osition and a mass). These problems are often c hallenging to solve due to the unkno wn (and p oten tially large) dimensionalit y . In addition, they also hav e the so-called “lab el- switc hing degeneracy” issue (Jasra et al, 2005), where the meaning of the mo del is in v arian t to switc hing the identities of the comp onen ts. Finite mixture modelling is a common statistical example, but this kind of problem app ears in man y other contexts (e.g. Skilling, 1998). T o carry out the inference, we m ust assign a prior to the comp onen ts’ parameters { x i } N i =1 . Since N may b e large, it is usually easier to assign an “conditional prior” giv en some hyperparameters α , and then assign a prior to α . This kind of model is usually called hier ar chic al . The prior for N , α , and { x i } is usually factorised in the following wa y: 1 p ( N , α , { x i } ) = p ( N ) p ( α | N ) p ( { x i }| α , N ) (1) = p ( N ) p ( α ) N Y i =1 p ( x i | α ) . (2) Here we hav e assumed the priors for N and α are indep enden t, and the conditional prior for { x i } is iid and do es not depend on N . Given data D , we usually w an t to calculate prop erties of the p osterior distribution, giv en b y: p ( N , α , { x i }|D ) ∝ p ( N , α , { x i } ) p ( D | N , α , { x i } ) (3) The structure of the model is depicted graphically in Figure 1. The computational metho d used to calculate properties of the posterior distribution is to gener- ate samples from it using Mark ov Chain Mon te Carlo (MCMC). Since N is unkno wn, an MCMC sampler needs to b e able to jump b et ween candidate solutions with diﬀeren t num bers of com- p onen ts, using either birth-death MCMC (Stephens, 2000) or the reversible jump framew ork of Green (1995). The marginal lik eliho od, or evidence, is also an imp ortant quantit y , and is the normalisation constan t of the posterior giv en in Equation 3: Z = p ( D ) (4) = N max X i =0 Z p ( N , α , { x i } ) p ( D | N , α , { x i } ) d N x i d α (5) Some authors ha ve addressed these kinds of problems b y doing separate runs of Nested Sampling (or another method of calculating marginal likelihoo ds), with N ﬁxed at v arious trial v alues (e.g. Hou et al., 2014; F eroz and Hobson, 2014). Then, the p osterior for N can b e calculated based on the estimates of the marginal likelihoo d obtained as a function of N . This approach do es not generalise well to large N , as it would require a large num b er of parallel runs. In this pap er, w e in tro duce an approac h that uses trans-dimensional MCMC mo ves to infer N , but em b eds this process in the Nested Sampling framew ork to o vercome diﬃculties that would b e encountered b y sampling the p osterior distribution directly . Our motiv ation for using Nested Sampling is not primarily to calculate the marginal likelihoo d Z (including summing ov er N , as in Equation 5), although Z is readily a v ailable from the output. The motiv ation for using Nested Sampling, rather than just sampling the p osterior using trans-dimensional MCMC, is that the p osterior often con tains diﬃcult features, s uc h as strong dependencies or multiple modes, that cause problems with mixing. 1.1 Phase T ransitions and Nested Sampling In addition to the well-kno wn c hallenges caused by multiple mo des and strong dep endencies, phase tr ansitions (Skilling, 2006) can also cause diﬃculties for Bay esian computation. Phase transitions o ccur when the p osterior distribution is a mixture of a “slab” with high volume but lo w densit y , and a “spik e” with lo w v olume but high density . T o understand why this causes problems, imagine sampling the posterior using the Metrop olis algorithm. If the sampler is in the slab, it will only rarely prop ose a mov e into the spike, b ecause the spike’s v olume is so small. 2 Objects i = 1 , ..., N α N D x i Figure 1: A pr ob abilistic gr aphic al mo del (PGM) depicting the kind of mo del discusse d in this p ap er. Pr o duc e d using daft ( daft-pgm.org ). The numb er, N , of c omp onents in the mo del is an unknown p ar ameter, as ar e the pr op erties of the c omp onents. The data dep ends on the pr op erties of the c omp onents. Note that it is p ossible (and c ommon) to have other p ar ameters in the mo del that p oint dir e ctly to the data. However, we have omitte d such p ar ameters for simplicity. If the sampler is in the spik e, a prop osal to mov e in to the slab will ha ve a very low acceptance probabilit y b ecause the densit y of the slab is so lo w relativ e to that of the spike. If the spike has a v olume V spike and the slab has a v olume V slab then the num b er of MCMC iterations needed to jump from the spike to the slab, or vice versa, is of order V slab /V spike whic h ma y be v ery large. In many problems, the p osterior distribution is a mixture of a spike comp onen t and a slab comp onen t, how ev er the total p osterior probability in the slab comp onen t is negligible. In this situation, posterior sampling is unaﬀected, how ever problems can arise when calculating the marginal likelihoo d using an annealing-t yp e approac h. F or example, consider the thermo dynamic in tegral, whic h giv es the log of the marginal likelihoo d: log( Z ) = Z 1 0 h log [ L ( θ )] i β dβ (6) where L ( θ ) is the lik eliho o d function, and the exp ectation is taken with respect to the “thermal” distributions which are in termediate b et ween the prior π ( θ ) and the posterior: p β ( θ ) ∝ π ( θ ) L ( θ ) β . (7) If the posterior distribution is a mixture of a slab and a spike, then for some range of inv erse temp eratures β the probability mass in the slab and the spike will b e comparable. A t these temp eratures the MCMC metho d will b e unable to mix (the sampler will only visit the spike or the slab, and not both) and will give an incorrect estimate of h log [ L ( θ )] i β . 3 Nested Sampling (Skilling, 2006) ov ercomes these problems b y w orking with an alternative se- quence of probability distributions, proportional to the prior but with a hard constraint on the v alue of the lik eliho o d: p l ( θ ) ∝ π ( θ ) 1 [ L ( θ ) > l ] (8) The normalising constan ts of these distributions are given b y X ( l ) = Z π ( θ ) 1 [ L ( θ ) > l ] dθ (9) whic h is the amoun t of prior mass that has lik eliho o d greater than l . As Nested Sampling pro ceeds, X shrinks geometrically with time, so the n umber of iterations required to compress from the slab to the spik e is of order log( V slab /V spike ). In Diﬀusiv e Nested Sampling (Section 1.2), it takes of order log ( V slab /V spike ) iterations to pass from the slab to the spike initially , and then log( V slab /V spike ) 2 time to revise this. During this process Nested Sampling visits states with lik eliho o ds in termediate b etw een those of the slab and the spike, whereas no choice of β in Equation 7 will give these states substan tial probabilit y . 1.2 Diﬀusiv e Nested Sampling The main diﬃculty with Nested Sampling is coming up with metho ds to sample the constrained prior distributions of Equation 8. V arious implemen tations of Nested Sampling exist, with diﬀeren t strategies for sampling Equation 8, as w ell as other features to infer whether the run w as successful, or to improv e the accuracy of the marginal likelihoo d estimate (e.g. F eroz, Hobson, & Bridges, 2009; F eroz et al., 2013; Corsaro and De Ridder, 2014; Buc hner , 2014). Diﬀusive Nested Sampling (DNS Brew er, P´ arta y , & Cs´ an yi, 2011b) is an alternative version of Nested Sampling that is appropriate when MCMC is used to explore the parameter space 1 . Rather than w orking with a sequence of constrained prior distributions like in Equation 8, DNS replaces the p osterior distribution with an alternativ e target distribution comp osed of a mixtur e of the prior distribution with the constrained priors, facilitating mixing b et ween the m ultiple mo des and along degeneracy curv es. The mixture of constrained priors is giv en b y a mixture of distributions like in Equation 8: p ( θ ) = n X i =0 w i π ( θ ) 1 [ L ( θ ) > l i ] X i (10) where { l i } is a set of lik eliho o d thresholds or “levels”, constructed suc h that the asso ciated prior mass X i +1 of level i + 1 is approximately e − 1 times that of level i . The { w i } are a set of mixture w eights, usually c hosen to emphasise the higher levels during the initial stages of a run (while creating levels), and set to uniform once the desired n umber of lev els n has b een created. Since DNS do es not sample the p osterior distribution, the samples need to b e assigned weigh ts (in a manner similar to imp ortance sampling) to represent the p osterior. See Brew er, P´ arta y , & Cs´ an yi (2011b) for more details ab out DNS. The idea of replacing the target distribution with something other than the p osterior is similar to annealed imp ortance sampling (Neal, 2001) or parallel temp ering (Hansmann, 1997), where the target distribution is mo diﬁed from the p osterior to something “easier”. How ev er, the 1 A C++ implementation of DNS, called DNest3 , is a v ailable at https://github.com/eggplantbren/DNest3 4 x y Figure 2: Left : A n example of a two-dimensional p osterior distribution with multiple mo des, str ong dep endenc e, and a phase tr ansition (in the top right mo de). While this low dimensional example is not diﬃcult to sample, an analo gous pr oblem in high dimensions would b e. Right : Diﬀusive Neste d Sampling r eplac es the tar get p osterior distribution by a mixtur e of c onstr aine d priors. The density is non-ne gligible everywher e (sinc e the prior is a c omp onent of this distribution), and states interme diate b etwe en the two phases in the top-right mo de ar e appr opriately up-weighte d. sequence of distributions used in Nested Sampling a voids some of the problems with annealing- based methods. In particular, there is no need to choose a temperature sc hedule (a p oten tially large set of tuning parameters), and the method do es not fail when phase transitions are presen t. Figure 2 sho ws a mo c k t wo dimensional posterior densit y with three mo des, one of whic h contains a strong dependence, and another of whic h contains a phase transition. The target distribution used by DNS includes the prior as part of the mixture, so trav el b et ween the mo des is possible. Since DNS is eﬀectively the Metropolis-Hastings algorithm applied to a distribution other than the p osterior, prop osal distributions are needed whic h deﬁne ho w a w alker mov es around the h yp othesis space. In Section 2, w e outline a set of prop osal distributions used to explore the h yp othesis space of p ossible v alues for N , α , and { x i } . A C++ implemen tation of the ideas describ ed in this paper is av ailable online (under the terms of the GNU General Public Licence) at https://github.com/eggplantbren/RJObject . 2 Metrop olis prop osals for the general problem W e will now deﬁne a set of prop osal distributions that can b e used to sample the prior distrib ution p ( N , α , { x i } ) = p ( N ) p ( α ) Q N i =1 p ( x i | α ). These prop osals will need to satisfy detailed balance in order to be v alid when used inside DNS. When the proposal satisﬁes detailed balance with resp ect to the prior, the DNS algorithm can incorp orate hard likelihoo d constraints by rejecting an y prop osal whose likelihoo d does not satisfy the constrain t. The ov erall prop osal is a mixture of all of the proposals listed here. When using DNS, the prop osals t ypically should b e v ery hea vy tailed. When DNS is exploring a distribution close to the prior, large prop osals will generally b e needed to explore the prior 5 eﬃcien tly . When DNS is exploring a distribution constrained to very high v alues of the lik eliho o d function, m uch smaller prop osals will be appropriate. Rather than attempting exp ensiv e tuning (since the num b er of distributions inv olved may n umber in the hundreds or even thousands), w e will apply the hea vy tailed prop osal distributions and simply accept that there will b e some w aste. If a parameter u has a uniform prior b et ween 0 and 1, then a particular choice of hea vy- tailed prop osal is deﬁned by: u 0 := mo d  u + 10 1 . 5 − 6 a b, 1  (11) where a ∼ Uniform(0 , 1) and b ∼ N (0 , 1). This is lik e a standard gaussian/normal random walk prop osal, but with a non-ﬁxed step size, resulting in a scale mixture of normals. When a is close to zero, the size of the prop osal is of order 10, which (because of the mo d function) eﬀectively randomises u 0 from the prior. If a is close to 1, the scale of the proposal is roughly 10 − 4 times the prior width. The co eﬃcien t of a w as set to 6 because smaller proposals are usually not necessary . The v alue 1.5 in the exponent w as found to optimize the p erformance of this prop osal when sampling from U (0 , 1) and N (0 , 1) priors. Most of the proposals discussed in this pap er inv olve changing one (or a subset) of the parameters and/or h yp erparameters, while keeping the others ﬁxed. Whenev er a hea vy-tailed prop osal distribution is required, we use the prop osal from Equation 11, or an analogous one, e.g. a discrete v ersion when the prop osal is to c hange N . The DNest3 pack age monitors the acceptance rate as a function of lev el. In most applications, the acceptance probability is close to 1 (or may b e precisely 1) when sampling the prior, and decreases to 5-40% when sampling high levels. 2.1 Prop osals that mo dify N The ﬁrst kind of proposal we consider are proposals that change the dimension of the mo del, i.e. prop osals that c hange the v alue of N . By necessit y , these will also c hange the parameters { x i } N i =1 b ecause the num b er of parameters will b e changed. The prop osals used here are traditionally called birth and de ath proposals. W e will assume that the prior for N is a discrete uniform distribution b et ween 0 and some N max , inclusiv e. The prop osal starts b y c ho osing a new v alue N 0 according to N 0 = mo d ( N + δ N , N max + 1) (12) where δ N is drawn from a hea vy tailed distribution which is a discrete analogue of Equation 11 ( N is treated as a real num b er in [0 , N max + 1] and then we take the in teger part at the end). The most probable v alues of δ N are ± 1, but v alues of order N max also ha ve some probability , to allow fast exploration when the target distribution is similar to the prior. If N 0 = N , N 0 is regenerated as N + 1 with probabilit y 0.5, and N − 1 with probability 0.5, and then wrapp ed bac k in to the range { 0 , ..., N max } if necessary . If N 0 > N , i.e. the proposal is to add comp onents to the model, the extra parameters needed, { x i } N 0 i = N +1 , are drawn from their prior conditional on the curren t v alue of the h yp erparameters, i.e. the conditional prior p ( x | α ). If N 0 < N , i.e. the proposal is to remov e components from the mo del, then N − N 0 comp onen ts must b e selected for remo v al. All of the components hav e the same probability of being selected for remov al. 2.2 Prop osals that mo dify comp onen ts W e now consider a prop osal distribution that mo diﬁes one or more of the comp onen ts { x i } , while keeping the n um b er of components N , as well as the h yp erparameters α , ﬁxed. 6 Let F ( x ; α ) b e a function that tak es an comp onen t x and transforms it to a v alue u that has a uniform distribution b etw een 0 and 1, giv en α . If the comp onen t x consists of a single scalar v alue, F is the cumulativ e distribution of the conditional prior. Denote the in verse of F b y G . A prop osal for an comp onen t inv olves transforming its parameters to [0 , 1] using F , making the prop osal in terms of u , and then transforming back. Sp eciﬁcally , the prop osal chooses a new v alue x 0 i from the curren t v alue x i as follows: u i := F ( x i ; α ) (13) u 0 i := mo d ( u i + δ u , 1) (14) x 0 i := G ( u 0 i ; α ) . (15) where δ u is drawn from a hea vy-tailed distribution as in Equation 11. Cho osing just one component to c hange is most appropriate when the DNS distribution is v ery constrained. When it is close to the prior, bolder proposals that c hange more than one comp onen t at a time are p ossible. Hence, w e c ho ose the num b er of comp onents to change from a heavy tailed distribution wherer the most probable v alue is 1 but there is also a non trivial probabilit y of prop osing to c hange ∼ N comp onen ts at once. 2.3 Prop osals that c hange the hyperparameters, k eeping the comp o- nen ts ﬁxed Another kind of prop osal that we will consider is a prop osal that keeps all of the components ﬁxed in place (i.e. lea ves { x i } ) unc hanged, but c hanges the hyperparameter(s) from their current v alue α to a new v alue α 0 . The prop osal for the h yp erparameter(s) is c hosen to satisfy detailed balance with respect to p ( α ) and should b e heavy-tailed. The ov erall Metrop olis acceptance probability for this kind of mov e, if sampling the prior (or the constrained prior of DNS) m ust include the follo wing factor: Q N i =1 p ( x i | α 0 ) Q N i =1 p ( x i | α ) (16) Since this prop osal lea ves the comp onents { x i } ﬁxed, it will usually not aﬀect the v alue of the lik eliho o d, and therefore the likelihoo d will not need to be recomputed. Therefore this prop osal is most useful for mixing the v alues of the hyperparameters α when the target distribution is highly constrained. 2.4 Prop osals that change the h yp erparameters and all of the comp o- nen ts The ab o ve prop osals allow for c hanges to N , α , and { x i } , and are therefore suﬃcient to allo w for “correct” exploration of the prior distribution, and the constrained prior distributions used in DNS. How ev er, they do not necessarily allow for eﬃcient exploration, even of the prior itself. The main reason for this is the inability for large c hanges to b e made to the hyperparameters α . If the prop osed c hange from α to α 0 is large, the ratio in Equation 16 is lik ely to be very small, and the mo ve will probably be rejected. Therefore, w e allo w an additional mov e that changes α to a new v alue α 0 , but rather than leaving the components { x i } ﬁxed, they are “dragged” so they represent the distribution p ( x | α 0 ) rather 7 than p ( x | α ). W e do this b y making use of the transformation functions F ( x ; α ) and G ( x ; α ) deﬁned in Section 2.2. The “dragging” process w orks as follows, and must b e carried out on each comp onen t: u i := F ( x i ; α ) (17) x 0 i := G ( u i ; α 0 ) (18) In other w ords, the comp onen ts’ parameters are transformed to [0 , 1] using the current v alue of the hyperparameters, and then transformed bac k using the prop osed v alue of the hyperparame- ters so they represent the new conditional prior rather than the old one. 3 Sin usoidal example In this section we demonstrate a seemingly simple example whic h exhibits a phase transition, making a standard MCMC approach diﬃcult. Supp ose a signal is composed of N sinusoids, of diﬀerent perio ds, amplitudes, and phases. The signal is observ ed at v arious times { t i } m i =1 with noise, and we wan t to use the resulting data to infer the num b er of sinusoids N , along with the p eriods { T i } N i =1 , amplitudes { A i } N i =1 , and phases { φ i } N i =1 . This kind of mo del has man y applications, and has been solved analytically under certain sets of assumptions (see e.g. Bretthorst, 1988; Mortier et al., 2014). Similar mo dels ha ve been used in many diﬀeren t ﬁelds (e.g. Bretthorst, 2003; Umst¨ atter et al., 2005; Brew er et al., 2007; Brewer and Stello, 2009). Note that this problem is also very closely related to the problem of detecting exoplanet signals in radial velocity data, whic h has attracted a lot of research atten tion in recent y ears (e.g. Gregory, 2011; Hou et al., 2014; F eroz et al., 2011). 3.1 Sim ulated Dataset T o demonstrate the techniques, w e simulated some data based on the assumption N = 2, i.e. the true signal was comp osed of tw o sinusoids. The simulated data is sho wn in Figure 3, and sho ws a large, lo w p eriod oscillation with a smaller, muc h faster oscillation sup erimposed. The noise lev el is suc h that the fast oscillation is diﬃcult to detect. The true v alues of the parameters w ere N = 2, A = { 7 , 0 . 155 } , T = { 30 , 2 } , and φ = { 0 , 1 } . The signal was observ ed at n = 1001 p oin ts equally spaced b et ween t = 0 and t = 100. The true form of the signal is: y ( t ) = sin  2 π t 30  + 0 . 3 sin  2 π t 2 + 1  (19) and the noise standard deviation w as σ = 0 . 5. 3.2 Prior Distributions T o infer N , the n um b er of sinusoids, as w ell as all the p erio ds, amplitudes, and phases, we need to deﬁne prior distributions. In the notation of Section 1, the parameters of the each comp onen t are x i = { A i , T i , φ i } . (20) F or the conditional prior p ( x i | α ), we introduced a single h yp erparameter µ , suc h that A i giv en µ has an exp onential prior with mean µ . F or the perio ds, w e assigned a log-uniform distribution 8 0 20 40 60 80 100 Time − 10 − 5 0 5 10 Signal Noise-free signal Noisy measuremen ts Figure 3: The simulate d data for the sinusoidal example. The solid line shows the true signal which c onsists of a lar ge slow oscil lation and a much smal ler amplitude fast oscil lation. The p oints ar e the me asur ements, simulate d with a noise standar d deviation of σ = 0 . 5 . for the p erio ds b et w een ﬁxed b oundaries, and a uniform distribution for the phases b et ween 0 and 2 π . In other w ords, our prior is only hierarchical for the amplitudes { A i } . The prior for µ w as a log-uniform prior, with probability density ∝ 1 /µ , where µ ∈ [10 − 3 , 10 3 ]. The mo del for the shap e of the (noise-free) signal is y ( t ) = N X i =1 A i sin  2 π t T i + φ i  (21) where there are N sin usoids in the signal, the amplitudes are { A i } , the p eriods are { T i } , and the phases are { φ i } . The sampling distribution (probabilit y distribution for the data given the parameters) was a normal (gaussian) distribution with mean zero and standard deviation σ , applied indep enden tly to each data p oin t: Y i | N , { A i } , { T i } , { φ i } , σ ∼ N  y ( t i ) , σ 2  . (22) This also dep ends on an additional noise standard deviation parameter σ which will also b e inferred. W e used a log-uniform prior for σ b et ween 10 − 3 and 10 3 . 3.3 Results W e ran DNS on the simulated data shown in Figure 3. The (log) likelih o o d as a function of (log) enclosed prior mass X (Equation 9) is plotted in Figure 5 This curv e is a standard output of Nested Sampling metho ds and can b e used to gain insight in to the structure of the problem. P articularly , when phase transitions are presen t, this curv e will hav e concav e-up regions (Skilling, 2006). Tw o phase transitions can b e seen in Figure 5. The ﬁrst phase transition, at log( X ) ≈ − 10 nats, separates “noise-only” mo dels (where the entire dataset is accounted for by the noise term) from 9 0 500 1000 1500 2000 2500 Iteration / 20000 − 760 − 755 − 750 − 745 − 740 − 735 − 730 − 725 − 720 − 715 Log Lik eliho o d Figure 4: A “tr ac e plot” of the lo g likeliho o d in the sinusoidal mo del, for an MCMC chain designe d to sample the p osterior distribution. The two phases have lo g-likeliho o ds ar ound -730 and -745 r esp e ctively, however tr ansitions b etwe en the two phases ar e quite unc ommon. “one-sin usoid” models where N = 1. A t log( X ) ≈ − 35 nats, a second phase transition separates N = 1 mo dels from N = 2 models. This phase transition causes a double-p eaked feature in the p osterior w eights (low er panel of Figure 5). If we were to do this analysis by trying to explore the p osterior directly , rather than b y using Nested Sampling, it w ould b e diﬃcult to jump b et w een the N = 1 solution and the N = 2 solution, b ecause the p osterior distribution w ould b e composed of a mixture of a small-v olume but high-likelihoo d spike and a large-v olume but (relativ ely) low likelihoo d slab. Note that this phase transition still exists if we condition on N = 2, so is not caused b y the trans-dimensional mo del. T o demonstrate this, Figure 7 sho ws the v alue of the log likelihoo d versus time for 45 million iterations of MCMC targeting the p osterior distribution. The sampler should visit states with log lik eliho od ∼ − 745 ab out 10% of the time (corresp onding with models that don’t con tain the small oscillation), and the rest of the time should b e sp en t in states with log likelihoo d around − 730. In this run transitions b et ween these tw o phases o ccurred only a handful of times. In a practical application, computing the p osterior probability for the existence of the small oscillation might b e the whole p oint of the calculation, so mixing b et ween mo dels where it is presen t and mo dels where it is absent is crucial. Although DNS explores a distribution other than the p osterior (a mixture of constrained priors is used instead), the sav ed samples can b e assigned importance w eigh ts so w e can represent the p osterior distribution. The marginal p osterior distribution for N is sho wn in Figure 6. The most probable v alue is N = 2 which is also the true v alue, and there is a small probability for N = 1. The decreasing probabilities for N > 2 are due to the well-kno wn natural “Occam’s Razor” eﬀect that can occur in Bay esian Inference (e.g. MacKay, 2003, Chapter 28). The marginal likelihoo d estimate returned b y DNS for this data w as log( Z ) = − 771 . 8, and the information, or Kullback-Leibler divergence from the prior to the p osterior, was estimated to b e 39.8 nats. 10 Figure 5: T op panel: The shap e of the likeliho o d function with r esp e ct to prior mass. Bottom panel: The p osterior weights of the save d samples. The existenc e of a phase tr ansition c auses this to display two sep ar ate p e aks, which would b e diﬃcult to mix b etwe en if we simply attempte d to sample the p osterior distribution. 0 2 4 6 8 10 Num b er of Sin usoids N 0 50 100 150 200 250 300 350 Num b er of p osterior samples Figure 6: The infer enc e for N , the numb er of sinusoids, b ase d on the sinusoid data. The true value was N = 2 , which is also the most pr ob able value given the data. Of c ourse, this is not always the c ase. 11 0 500 1000 1500 2000 2500 Iteration / 20000 − 760 − 755 − 750 − 745 − 740 − 735 − 730 − 725 − 720 − 715 Log Lik eliho o d Figure 7: A “tr ac e plot” of the lo g likeliho o d in the SineWave mo del, for an MCMC chain designe d to sample the p osterior distribution. The two phases have lo g-likeliho o ds ar ound -730 and -745 r esp e ctively, however tr ansitions b etwe en the two phases ar e quite unc ommon. 4 “Galaxy Field” Example Source detection is an important problem in astrophysics. Given some data, usually one or more noisy , blurred, images of the sky , w e w ould like to kno w how man y comp onen ts N (such as stars or galaxies) are in the image, and their prop erties (such as ﬂux, size, orien tation). V arious approac hes exist, cov ering a wide sp ectrum b et w een ad-ho c approac hes and principled inference approac hes (e.g. Irwin, 1985; Bertin and Arnouts, 1996; Dolphin, 2000; Hobson & McLac hlan, 2003; Brew er et al., 2013). The more principled techniques tend to be muc h more computationally in tensive, so should only b e applied to interesting or especially challenging subsets of astronomical imaging. In fact, the approach used b y Brewer et al. (2013) w as essentially an early v ersion of the metho dology presented in this pap er. In this section w e apply the DNS approach to a to y version of the problem of detecting and quan tifying extended comp onen ts such as galaxies in a noisy image. W e mo del eac h galaxy as a mixture of tw o concen tric elliptical gaussian components, with the following eight parameters. Firstly , t wo parameters x c and y c describ e the central position of the galaxy within the image. The ﬂux f describ es the total in tegral of the galaxy’s intensit y proﬁle. The axis ratio q describ es the ellipticity of the galaxy and θ its orientation angle with resp ect to horizon tal. The radius of the bigger gaussian is a parameter w . Finally we include a “radius ratio” u describing the ratio of the radius of the smaller gaussian with respect to that of the bigger gaussian, and a parameter v describing the fraction of the total ﬂux in the smaller gaussian (therefore the fraction of the ligh t in the bigger gaussian is 1 − v ). In a coordinate system aligned with the ma jor axis of the ellipse, the surface brightness proﬁle of a “galaxy” is given b y ρ ( x, y ) = f (1 − v ) 2 π w 2 exp  − 1 2 w 2  q x 2 + y 2 /q   (23) + f v 2 π ( uw ) 2 exp  − 1 2( uw ) 2  q x 2 + y 2 /q   . (24) 12 -1 0 1 x 1 0 -1 y GalaxyField Data Figure 8: The simulate d image for the GalaxyField example. The image is 200 × 200 pixels in size and c ontains N = 47 “galaxies”. W e generated a simulated 200 × 200 pixel noisy image (Figure 8) con taining 47 “galaxies”, in order to test the algorithm. 4.1 Priors F or the conditional prior for the total ﬂuxes of the galaxies, and for the widths of the galax- ies, we used the Pareto distribution. The probability density function for a v ariable x given h yp erparameters x min and a is p ( x | x min , a ) =  ax a min x a +1 , x ≥ x min 0 , otherwise. (25) Since w e are using this for b oth the ﬂuxes and the widths of the galaxies, there will b e four h yp erparameters in total: t wo lo wer limits (one for the ﬂuxes and one for the widths) and t wo slop es. The lo wer limits were assigned log-uniform priors b et ween 10 − 3 and 10 3 , and uniform priors b et ween 0 and 1 were assigned to the inverses of the slop es. This latter c hoice was based on the fact that a Pareto distribution for x is an exp onen tial distribution for ln( x ), with scale length 1 /a . If we do not exp ect the scale length to be v ery extreme then w e might w ant to restrict it to [0 , 1] with a uniform prior. The conditional priors for the radius ratios u and the ﬂux ratios w w ere both uniform distribu- tions. Since u ∈ [0 , 1] and v ∈ [0 , 1], the upper limits of the uniform conditional priors m ust b e 13 Figure 9: T op panel: The shap e of the likeliho o d function with r esp e ct to prior mass for the GalaxyField data. Note the phase tr ansition at log( X ) ≈ − 100 nats. This sep ar ates mo dels with just the few brightest galaxies fr om mo dels with many faint galaxies as wel l. While this do esn ’t aﬀe ct the p osterior distribution (unlike the sinewave example), it would aﬀe ct c alculation of the mar ginal likeliho o d if anne aling wer e use d. Bottom panel: The p osterior weights of the save d samples. b et w een 0 and 1. Therefore the prior for the upp er limit was a uniform distribution b et ween 0 and 1, and the prior for the lo wer limit given the upp er limit w as a uniform distribution b et ween 0 and the upp er limit. As in Section 3 w e allow ed the noise standard deviation σ to b e a free parameter with a log- uniform prior betw een 10 − 3 and 10 3 . 4.2 Results The results from running DNS on the sim ulated galaxy data are sho wn in Figures 9 and 10. This problem, lik e the sinusoidal problem, exhibits a phase transition, which can b e seen in Figure 9. The phase transition separates models which only ﬁt the brightest galaxies from models that also ﬁt the faint galaxies close to the noise level. This phase transition do es not aﬀect the p osterior distribution, as the low- N solutions are even tually found to hav e v ery low imp ortance. How ev er it would cause diﬃcult y for a thermal approac h to calculating the marginal lik eliho o d. The marginal lik eliho o d estimate returned b y DNS for this data was log( Z ) = − 319707 . 6, and the information, or Kullback-Leibler divergence from the prior to the p osterior, was estimated to b e 550.2 nats. 14 0 20 40 60 80 100 Num b er of Galaxies N 0 10 20 30 40 50 60 70 Num b er of p osterior samples Figure 10: The infer enc e for N , the numb er of galaxies, b ase d on the GalaxyField data. The true value was N = 47 . 5 Optimisation T ec hniques In b oth of the examples discussed in this pap er, the likelihoo d ev aluation inv olved computing a mo c k noise-free signal from the current v alue of the mo del parameters. The mathematical form of the mock signal was a sum o ver all N comp onen ts present. Computing the mo c k signal is often the most exp ensiv e step in the ev aluation of the lik eliho od. Ho wev er, man y of the proposal distributions used in volv e c hanging a subset of the parameters, while keeping others ﬁxed. Considerable sp eedups can b e ac hieved b y , when appropriate, up dat- ing the mock signal to reﬂect the prop osed c hange to the mo del, rather than computing the en tire mo c k signal from scratch. F or example, if the prop osal is to add tw o new comp onents to the mo del, the mo c k signal can simply be updated b y adding the eﬀect of the tw o new components to the model. In general, it is p ossible to compute the num b er of comp onents that hav e b een aﬀected b y the prop osal. If this is greater than or equal to N , w e compute the mo c k signal from scratch, otherwise we subtract the eﬀect of those comp onents that hav e b een remov ed and add in the eﬀect of those that hav e b een added. In the C++ implemen tation of this w ork, when a prop osal tak es place, the diﬀerence b et w een the old and new mo del is cac hed and is a v ailable to the user, so that the lik eliho od can be updated rather than computed from scratch. Using this tec hnique oﬀers a sp eed adv an tage of a factor of ∼ 2 for b oth of the example problems. The sinusoidal example could be solved in a few minutes on a desktop computer, whereas the galaxy example to ok m uch longer, about a da y , to execute. 6 Conclusions In this pap er, we presen ted a general approac h to trans-dimensional Ba yesian inference prob- lems. T o compute the p osterior distribution in these con texts, birth and death MCMC mo ves are 15 commonly used. How ev er, the approac h presen ted here replaces the posterior distribution with the alternativ e target distribution used in Diﬀusiv e Nested Sampling. This is done to facilitate mixing b et w een multiple modes, along strong dependencies, and betw een diﬀerent phases. Addi- tionally , the marginal likelihoo d can be computed including the sum o ver the unknown num ber of comp onen ts. The method w as demonstrated on tw o illustrative examples inspired by astro- nomical data analysis, but should b e applicable in other contexts such as mixture mo delling. Soft ware (in C++) implemen ting the general Metrop olis proposals for these mo dels, as w ell as the sp eciﬁc examples, is av ailable at https://github.com/eggplantbren/RJObject . Ac kno wledgemen ts This work is supp orted by a Marsden F ast-Start gran t from the Ro yal So ciet y of New Zealand. I would like to thank the following p eople for v aluable con versations and inspiration: Anna P ancoast (UCSB), David Hogg (NYU), Daniel F oreman-Mack ey (NYU), Courtney Donov an (Auc kland), T om Loredo (Cornell), Iain Murray (Edin burgh), Ewan Cameron (Oxford), John Skilling (MaxEnt Data Consultan ts), and Daniela Hupp enk othen (Amsterdam, NYU). References Bertin, E., Arnouts, S. 1996. SExtractor: Softw are for source extraction. Astronomy and Astro- ph ysics Supplemen t Series 117, 393-404. Bretthorst, G. Larry . 1988. Bay esian Spectrum Analysis and Parameter Estimation. In Lecture Notes in Statistics, 48, Springer-V erlag, New Y ork, New Y ork. Bretthorst, G. L. 2003. F requency Estimation, Multiple Stationary Nonsinusoidal Resonances With T rend. Bay esian Inference and Maxim um Entrop y Metho ds in Science and Engineering 659, 3-22. Brew er, B. J., Bedding, T. R., Kjeldsen, H., Stello, D. 2007. Ba yesian Inference from Observ ations of Solar-like Oscillations. The Astroph ysical Journal 654, 551-557. Brew er, B. J., Stello, D. 2009. Gaussian process modelling of asteroseismic data. Monthly Notices of the Ro yal Astronomical So ciet y 395, 2226-2233. Brew er, B. J., F oreman-Mack ey , D., Hogg, D. W. 2013. Probabilistic Catalogs for Crowded Stellar Fields. The Astronomical Journal 146, 7. Brew er B. J., P´ arta y L. B., Cs´ anyi G., 2011, Statistics and Computing, 21, 4, 649-656. Brew er, B. J., Lewis, G. F., Belokurov, V., Irwin, M. J., Bridges, T. J., Ev ans, N. W. 2011. Mo delling of the complex CASSO W AR Y/SLUGS gra vitational lenses. Monthly Notices of the Ro yal Astronomical Society 412, 2521-2529 Buc hner, Johannes. A statistical test for Nested Sampling algorithms. Statistics and Computing (2014): 1-10. Corsaro, E., De Ridder, J. 2014. DIAMONDS: A new Bay esian nested sampling to ol. Application to p eak bagging of solar-like oscillations. Astronomy and Astroph ysics 571, AA71. Dolphin, A. E. 2000, P ASP , 112, 1383 16 F eroz, F., Hobson, M. P ., Cameron, E., P ettitt, A. N. 2013. Imp ortance Nested Sampling and the MultiNest Algorithm. ArXiv e-prin ts F eroz F., Hobson M. P ., Bridges M., 2009, MNRAS, 398, 1601 F eroz, F., Balan, S. T., Hobson, M. P . 2011. Detecting extrasolar planets from stellar radial v elo cities using Bay esian evidence. Mon thly Notices of the Roy al Astronomical So ciet y 415, 3462-3472. F eroz, F., Hobson, M. P . 2014. Bay esian analysis of radial v elo cit y data of GJ667C with correlated noise: evidence for only t wo planets. Monthly Notices of the Roy al Astronomical So ciet y 437, 3540-3549. Green, P . J., 1995, Reversible Jump Marko v Chain Monte Carlo Computation and Bay esian Mo del Determination, Biometrik a 82 (4): 711732. Gregory , P . C. 2011. Bay esian exoplanet tests of a new metho d for MCMC sampling in highly correlated model parameter spaces. Monthly Notices of the Ro yal Astronomical So ciet y 410, 94-110. Hansmann, Ulric h HE., 1997, “P arallel temp ering algorithm for conformational studies of bio- logical molecules.”, Chemical Physics Letters 281, no. 1 (1997): 140-150. Hobson, M. P ., & McLac hlan, C. 2003, MNRAS, 338, 765 Hou, F., Goo dman, J., Hogg, D. W. 2014. The Probabilities of Orbital-Companion Mo dels for Stellar Radial V elocity Data. ArXiv e-prin ts Irwin, M. J. 1985, MNRAS, 214, 575 Jasra, A., Holmes, C. C, and Stephens, D. A., Mark ov Chain Monte Carlo Metho ds and the Lab el Switc hing Problem in Bay esian Mixture Mo deling, Statistical Science V ol. 20, No. 1, pp. 50-67 MacKa y , Da vid J. C., 2003, “Information theory , inference, and learning algorithms”. V ol. 7. Cam bridge: Cambridge univ ersity press, 2003. Mortier, A., F aria, J. P ., Correia, C. M., Santerne, A., Santos, N. C. 2014. BGLS: A Ba yesian formalism for the generalised Lom b-Scargle p erio dogram. ArXiv e-prints Neal, R. M., 2001, Annealed imp ortance sampling, Statistics and Computing, v ol. 11, pp. 125- 139. Skilling J., 1998, Massiv e Inference and Maxim um Entrop y , in Maxim um En tropy and Bay esian Metho ds, Klu wer Academic Publishers, Dordrec ht/Boston/London p.14 Skilling, J., 2006, Nested Sampling for General Bay esian Computation, Bay esian Analysis 4, pp. 833-860. Stephens, M., 2000, “Bay esian analysis of mixture models with an unknown num b er of comp onen ts-an alternative to rev ersible jump methods.”, Annals of Statistics (2000), 40-74. Umst¨ atter, R., Christensen, N., Hendry , M., Mey er, R., Simha, V., V eitch, J., Vigeland, S., W oan, G. 2005. Bay esian modeling of source confusion in LISA data. Ph ysical Review D 72, 022001. 17

Inference for Trans-dimensional Bayesian Models with Diffusive Nested Sampling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment