Covariance-Adaptive Slice Sampling

We describe two slice sampling methods for taking multivariate steps using the crumb framework. These methods use the gradients at rejected proposals to adapt to the local curvature of the log-density surface, a technique that can produce much better…

Authors: Madeleine Thompson, Radford M. Neal

T ec hnical Rep ort No. 1002, Department of Statistics, Univ ersit y of T oron to Co v ariance-Adaptiv e Slice Sampling Madeleine Thompson Radford M. Neal Departmen t of Statistics Departmen ts of Statistics and Computer Science Univ ersit y of T oronto Univ ersit y of T oronto mthompson@utstat.toronto.edu radford@utstat.toronto.edu http://www.cs.utoronto.ca/ ~ radford/ Marc h 8, 2010 Abstract W e describ e tw o slice sampling methods for taking multiv ariate steps using the crumb framew ork. These metho ds use the gradien ts at rejected prop osals to adapt to the lo cal curv ature of the log-density surface, a technique that can pro duce muc h b etter prop osals when parameters are highly correlated. W e ev aluate our methods on four distributions and compare their p erformance to that of a non-adaptiv e slice sampling metho d and a Metropolis method. The adaptive metho ds perform fa v orably on lo w-dimensional target distributions with highly-correlated parameters. 1 In tro duction Mark o v Chain Mon te Carlo metho ds often p erform p o orly when parameters are highly correlated. Our goal has b een to dev elop MCMC metho ds that w ork well on suc h distributions without requiring prior kno wledge ab out the distribution or extensive tuning runs. Slice sampling (Neal, 2003) is an auxiliary-v ariable MCMC method based on the idea of dra wing p oints uniformly from underneath the densit y surface of a target distribution. If one discards the densit y co ordinate from the sample, the resulting marginal distribution is the target distribution. This do cumen t presents tw o samplers in the “crumb” framework (Neal, 2003, § 5.2), a general framework for slice sampling metho ds that take multiv ariate steps. Unlik e many MCMC samplers, they p erform well when the target distribution has highly correlated parameters. These metho ds can impro v e on univ ariate slice sampling in the same wa y Metrop olis with a prop erly chosen m ultiv ariate prop osal distribution can impro v e on Metropolis with a spherical proposal distribution and on Metropolis up dates of one coordinate at a time. 2 Multiv ariate Steps in the Crumb F ramew ork This sec tion describ es the crumb framework for taking multiv ariate steps in slice sampling. The o v erall goal is to sample from a target distribution, suc h as the one with log-density contours shown in figure 1(a). The dimension of the target distribution’s parameter space is denoted by p . (The example in the figure has 1 x 0 ● (a) (b) Figure 1: (a) Con tours of a target distribution. (b) A slice, S y 0 , and the curren t target component, x 0 . p = 2.) The state space of the Marko v chain for slice sampling has dimension p + 1, of which p corresp ond to the target distribution and one to the current slice lev el. Slice sampling alternates betw een sampling the target co ordinates and the densit y coordinate. Let f ( x ) b e prop ortional to the density function of the target distribution, and let ( x 0 , y 0 ) b e the current state in the augmen ted sample space, where x 0 ∈ R p and y 0 ∈ R . T o update the densit y comp onen t, y 0 , w e sample uniformly from [0 , f ( x 0 )]. Up dating the target comp onent, x 0 , is more difficult. Let S y 0 b e the slice through the distribution at lev el y 0 : S y 0 = { x : f ( x ) ≥ y 0 } (1) The set S y 0 is outlined by a thick line in figure 1(b). The difficulty in up dating the target comp one n t is that w e rarely ha v e an analytic expression for the b oundary of S y 0 , which may not even b e a connected curve, so w e cannot sample uniformly from S y 0 as w e w ould lik e to. The metho ds of this do cument instead perform up dates that lea v e the uniform distribution on S y 0 in v ariant, leaving the resulting c hain with the desired stationary distribution. These methods begin b y sampling a “crum b,” c 1 , from a distribution that dep ends on x 0 , then drawing a prop osal, x 1 , from the distribution of p oints that could hav e generated c 1 , treating the crumb as data and x 0 as an unknown. An example crumb and proposal are dra wn in figure 2(a), with the distribution of c 1 sho wn as a dashed line and the distribution of x 1 sho wn as a dotted line. If x 1 w ere inside S y 0 , we would accept it as the new v alue for the target comp onent of the state. In figure 2(a), it is not, so we draw a new crumb, c 2 , and draw a new prop osal, x 2 , from the distribution of x 0 giv en b oth c 1 and c 2 as data. This is plotted in figure 2(b). While the distribution of prop osed mo v es is determined b y the crum bs and their distributions, w e can c hoose the distribution of the crumbs arbitrarily , using the densities and gradients at rejected prop osals if we desire. In the example of figure 2(b), x 2 is in S y 0 , so w e accept x 2 as the new target comp onent. If it were not, we w ould keep drawing crum bs, adapting their distributions so that the prop osal distribution would b e as close as p ossible to uniform sampling ov er S y 0 , and metaphorically follo wing these crum bs bac k to x 0 b y drawing proposals from the distribution of x 0 conditional on having observed the crumbs (Grimm and Grimm, 2008). 2 x 0 ● c 1 ● x 1 ● x 0 ● c 2 ● x 2 ● (a) (b) Figure 2: (a) The first crumb, c 1 , and a prop osal, x 1 . The distribution of c 1 is shown as dashed. The distribution of x 1 is sho wn as dotted. In this figure and subsequent ones, probability distributions in a t w o-dimensional space are represented by ellipses such that a uniform distribution ov er the ellipse has the same mean and co v ariance as the represen ted distribution. This metho d allo ws multiple distributions to b e dra wn in the same plot. (b) The second crumb, c 2 , and a prop osal, x 2 . The distribution of c 2 is shown as dashed. The distribution of x 2 is shown as dotted. The distribution for c 2 has b een up dated using the metho d described in section 4. Neal (2003, § 5.2) has more information on the crum b framew ork, including a pro of that methods in this framew ork leav e the target distribution in v ariant. 3 Ov erview of Adaptive Gaussian Crum bs W e no w sp ecialize the crum b framework to Gaussian crumbs and prop osals, using log densities at prop osals and their gradien ts to choose the crum b co v ariances. Without violating detailed balance, the crumb distri- bution can dep end on these log densities and gradients. W e assume that while computing the log density at a proposal, w e can compute its gradien t with minimal additional cost. Ideally , the prop osal distribution w ould b e a uniform distribution ov er S y 0 . T o appro ximate uniform sampling ov er S y 0 , we attempt to dra w a sequence of crumbs that results in a Gaussian prop osal distribution with the same co v ariance as a uniform distribution o ver the slice. In b oth adaptive methods discussed in this do cument, the first crum b has a multiv ariate Gaussian distribution: c 1 ∼ N ( x 0 , W − 1 1 ) where W 1 = σ − 2 c I (2) The standard deviation of the initial crumb, σ c , is the only tuning parameter for either metho d that is mo dified in normal use. The distribution for x 0 giv en c 1 is a Gaussian with mean c 1 and precision matrix W 1 , so w e dra w a prop osal from this distribution: x 1 ∼ N ( c 1 , W − 1 1 ) (3) If f ( x 1 ) is at least y 0 , then x 1 is inside the slice, so w e accept x 1 as the target component of the next state of the chain. This up date lea ves the densit y component, y 0 , unc hanged, though it is usually forgotten after 3 a proposal is accepted, an ywa y . When the prop osal is not in the slice, we c ho ose a different cov ariance matrix, W k +1 , for the next crumb, so that the cov ariance of the next prop osal will lo ok more lik e that of uniform sampling o ver S y 0 . The t wo metho ds proposed in this document differ in how they mak e that c hoice; sections 4 and 5 describ e the details. After sampling k crumbs from Gaussians with mean x 0 and precision matrices ( W 1 , . . . , W k ), the distri- bution for the k th proposal (the p osterior for x 0 giv en the crum bs) is: x k ∼ N ( ¯ c k , Λ − 1 k ) (4) where Λ k = W 1 + · · · + W k (5) and ¯ c k = Λ − 1 k ( W 1 c 1 + · · · + W k c k ) (6) If x k ∈ S y 0 —that is, if f ( x k ) ≥ y 0 —w e accept x k as the target comp onent. Otherwise, we must choose a cov ariance for the distribution of ( k + 1)th crum b, draw a new prop osal, and rep eat un til a prop osal is accepted. 4 First Metho d: Matching the Slice Co v ariance In the metho d describ ed in this section, we attempt to find W k +1 so that the ( k + 1)th prop osal distribution has the same conditional v ariance as uniform sampling from S y 0 in the direction of the gradient of log f ( · ) at x k . This gradien t is a go o d guess at the direction in which the prop osal distribution is least like S y 0 . Figure 3(a) sho ws an example of this. W e plotted the gradien ts of the log density at thirt y p oints dra wn from the same distribution (sho wn as a dotted line) as a rejected prop osal; most of these gradients p oint in the direction where the prop osal v ariance is least lik e the slice. Generally , in an ill-conditioned distribution, these gradients do not p oint to wards the mo de, they p oint tow ards the nearest point on the slice. (In a w ell-conditioned distribution, the directions to the mo de and to the nearest point on the slice will b e similar.) 4.1 Cho osing Crum b Cov ariances T o compute W k +1 , we will estimate the second deriv ative of the target distribution in the direction of the gradien t, assuming the target distribution is appro ximately locally Gaussian. Consider the (approximately) parab olic cut through the log-density surface in figure 3(b), which has the equation: ` = − 1 2 κt 2 + β t + γ (7) t is a parameter that is zero at x k and increases in the direction of the gradien t; κ , β , and γ are unknown constan ts w e wish to estimate. The co efficient − 1 2 is arbitrary; this c hoice mak es κ equal to the negativ e of the second deriv ative of the parabola. W e already had to compute f ( x k ) so that w e could determine whether x k w as in S y 0 ; we assume that ∇ log f ( x k ) w as computed simultaneously . T o fix t w o degrees of freedom of 4 x 0 ● x 1 ● x 0 ● x 1 ● u 1 ● x 0 ● x 1 ● u 1 ● (a) (b) (c) Figure 3: (a) sho ws gradien ts at thirt y p oints dra wn from the first prop osal distribution (sho wn as a dotted line). These gradients tend to p oint to wards the slice (shown as a thick line). (b) shows a parab olic cut through the log-density surface that go es through a rejected prop osal, x 1 , in the direction of the gradien t at x 1 , shown as an arrow. u 1 is a point on the cut defined b y equation 12. (c) adds a parab olic cut through the mode in the same direction. These tw o parabolas hav e appro ximately the same second deriv ativ e. The distance betw een the t wo arms of the second parab ola when the v ertical coordinate is equal to log y 0 is sho wn as a double-ended arro w, equal to d in equation 16. the parabola, w e plug these quan tities in to equation 7 and its deriv ative with respect to t : log f ( x k ) = − 1 2 κ · 0 2 + β · 0 + γ = γ (8) k∇ log f ( x k ) k = − κ · 0 + β = β (9) W e still ha ve one degree of freedom left, so we must ev aluate log f ( · ) at another p oint on the parab ola. W e choose a p oint as far a w ay from x k as c k is, hoping that this distance is within the range where the distribution is appro ximately Gaussian. Let δ be this distance: δ = k x k − c k k (10) Let g be the normalized gradien t at x k : g = ∇ log f ( x k ) k∇ log f ( x k ) k (11) Then, the p oint u k is defined to be δ a wa y from x k in the direction g : u k = x k + δ · g (12) In equation 7, u k corresp onds to t = δ . W e ev aluate the density at u k to fix the parabolic cut’s third degree of freedom, and plug this in to equation 7: log f ( u k ) = − 1 2 κδ 2 + β δ + γ (13) 5 Solving equations 8, 9, and 13 for κ gives: κ = − 2 δ 2  log f ( u k ) − log f ( x k ) − k∇ log f ( x k ) k · δ  (14) W e can no w use κ to approximate the conditional v ariance in the direction g . If the Hessian is locally appro ximately constant, as it is for Gaussians, a cut through the mo de in the direction of ∇ log f ( x k ) w ould ha ve the same second deriv ative as the cut through x k . This second parabola, sho wn in figure 3(c), has the equation: ` = − 1 2 κt 2 + M (15) M is the log density at the mo de. W e set t = 0 at the mo de, so there is no linear term. F or now, assume f ( · ) is unimo dal and that M was computed with the conjugate gradient metho d (No cedal and W right, 2006, c h. 5) or some other similar pro cedure before starting the Marko v chain. W e solv e equation 15 for the parab ola’s diameter d at the level of the curren t slice, log y 0 , shown as a doubled-ended arrow in figure 3(c): d = r 8( M − log y 0 ) κ (16) Since the distribution of p oin ts dra wn from an ellipsoidal slice, conditional on their lying on that particular one-dimensional cut, is uniform with length d , the conditional v ariance in the direction of the gradient at x k is: σ 2 k +1 = d 2 12 = 2 3 · M − log y 0 κ (17) With this v ariance, w e can construct a crum b precision matrix that will lead to the desired prop osal precision matrix. W e w ant to dra w a crumb c k +1 so that the p osterior of x 0 giv en the k crum bs has a v ariance equal to σ 2 k +1 in the direction g . Using equation 5, the precision of the prop osal giv en these crumbs is: Λ k +1 = Λ k + W k +1 (18) If we multiply b oth sides of equation 18 b y g T on the left and g on the righ t, the left side is the conditional precision in the direction g . g T Λ k +1 g = g T  Λ k + W k +1  g (19) W e would like to choose W k +1 so that this conditional precision is σ − 2 k +1 , so we replace the left hand side of equation 19 with that: σ − 2 k +1 = g T  Λ k + W k +1  g (20) As will b e discussed in section 4.3, computation will be particularly easy if we c ho ose W k +1 to b e a scaled cop y of Λ k with a rank-one update, so we choose W k +1 to be of the form: W k +1 = θ Λ k + αg g T (21) α and θ are unknown scalars. θ con trols how fast the precision as a whole increases. If we w ould like the v ariance in directions other than g to shrink b y 9 / 10, for example, w e choose θ = 1 / 9. Since θ is constan t, there will be exp onentially fast conv ergence of the prop osal co v ariance in all directions, which allows quic k 6 reco very from o v erly diffuse initial crumb distributions. F or this method, θ is not a critical choice; θ = 1 is reasonable. Substituting equation 21 into equation 20 gives: σ − 2 k +1 = g T  Λ k + θ Λ k + αg g T  g (22) Noting that g T g = 1, w e solv e for α , and set: α = max { σ − 2 k +1 − (1 + θ ) g T Λ k g , 0 } (23) α is restricted to be positive to guarantee p ositive definiteness of the crumb cov ariance. (By choosing θ sim ultaneously , we could perhaps encounter this restriction less frequen tly , but w e hav e not explored this.) Once w e kno w α , we then compute W k +1 using equation 21. The resulting crum b distribution is: c k +1 ∼ N ( x 0 , W − 1 k +1 ) (24) After drawing such a crumb, we draw a prop osal according to equations 4 to 6, accepting or rejecting dep ending on whether f ( x k +1 ) ≥ y 0 , dra wing more crumbs and adapting until a proposal is accepted. 4.2 Estimating the Densit y at the Mo de W e no w modify the metho d to remo ve the restriction that the target distribution b e unimodal and remo ve the requiremen t that we precompute the log densit y at the mo de. Estimating M each time w e up date x 0 instead of precomputing it allo ws M to take on v alues appropriate to lo cal mo des. Since the prop osal distribution only approximates the slice ev en in the b est of circumstances, it is not essen tial that the estimate of M b e particularly go o d. T o estimate M , we initialize M to the slice level, log y 0 , b efore drawing the first crumb. Then, every time w e fit the parab ola described b y equation 7, w e up date M to b e the maximum of the curren t v alue of M and the estimated peak of the parab ola. As more crum bs are drawn, M b ecomes a b etter estimate of the lo cal maxim um. W e could also use the v alues of the log density at rejected prop osals and at the { u k } to b ound M , but if the log density is lo cally concav e, the log densities at the p eaks of the parabola will alwa ys b e larger than these v alues. 4.3 Efficien t Computation of the Crum b Co v ariance This section describes a method for using Cholesky factors of precision matrices to mak e implementation of the cov ariance-matching method of section 4.1 efficient. If implemen ted naiv ely , the metho d of section 4.1 w ould use O ( p 3 ) operations when drawing a prop osal with equations 4 and 6. W e would lik e to a void this. One w ay is to represen t W k and Λ k b y their upper-triangular Cholesky factors F k and R k , where F T k F k = W k and R T k R k = Λ k . First, w e m ust dra w prop osals efficien tly . If z 1 and z 2 are p -v ectors of standard normal v ariates, w e can 7 replace the crum b and prop osal dra ws of equations 24 and 4 with: c k = x 0 + F − 1 k z 1 (25) x k = ¯ c k + R − 1 k z 2 (26) Since Cholesky factors are upp er-triangular, ev aluation of F − 1 k z 1 and R − 1 k z 2 b y backw ard substitution takes O ( p 2 ) operations. W e must also up date the Cholesky factors efficien tly . W e replace the updates of W k and Λ k in equations 21 and 18 with: F k +1 = c hud( √ θ R k , √ αg ) (27) R k +1 = c hud( √ 1 + θ R k , √ αg ) (28) Here, ch ud( R, v ) is the Cholesky factor of R T R + vv T . The function name is an abbreviation for “Cholesky up date.” It can b e computed with the LINP ACK routine DCHUD , which uses Givens rotations to compute the update in O ( p 2 ) operations (Dongarra et al., 1979, c h. 10). Finally , w e w ould lik e to compute the prop osal mean efficiently . W e do this b y k eeping a running sum of the un-normalized crum b mean (the paren thesized expression in equation 6), which we will represen t by ¯ c ∗ k . Define: ¯ c ∗ k = W 1 c 1 + · · · + W k c k (29) = ¯ c ∗ k − 1 + W k c k (30) = ¯ c ∗ k − 1 + F T k F k c k (31) Then, using forw ard and bac kw ard substitution, w e can compute the normalized crum b mean, ¯ c k , as: ¯ c k = R − 1 k R − T k ¯ c ∗ k (32) This w ay , we can compute ¯ c k in O ( p 2 ) operations and do not need to sa ve all the crum bs and crumb co v ariances. With these changes, the resulting algorithm is numerically stable even with ill-conditioned target dis- tributions. Eac h crumb and prop osal draw takes O ( p 2 ) op erations. Figure 4 sho ws pseudo co de for this metho d. 5 Second Metho d: Shrinking the Rank of the Crumb Cov ariance The metho d of section 4 attempts to adapt the crumb distribution so that the prop osal distribution matc hes the shap e of the slice. How ever, it often can’t due to p ositiv e-definiteness constrain ts, requiring the max {· , ·} op eration in equation 23. Ev en when it can perform the adaptation, it ma y not b e appropriate if the underlying distribution is not approximately Gaussian. This section describes a differen t method, also based on the framework of section 3. Instead of attempting to matc h the conditional v ariance in the direction of the gradien t, it just sets it to zero. This is reasonable 8 Slice Sampling with Cov ariance Matching Initialize N , f , x 0 ∈ R p , σ c , and θ . X ← ( ) rep eat N times: M ← log f ( x 0 ) e ← draw from Exp onential(1) ˜ y 0 ← M − e R ← σ − 1 c I F ← σ − 1 c I ¯ c ∗ ← 0 rep eat un til a prop osal is accepted: z ← draw from N p (0 , I ) c ← x 0 + F − 1 z ¯ c ∗ ← ¯ c ∗ + F T F c ¯ c ← R − 1 R − T ¯ c ∗ z ← draw from N p (0 , I ) x ← ¯ c + R − 1 z ˜ y ← log f ( x ) if ˜ y ≥ ˜ y 0 : prop osal accepted, break end (if) G ← ∇ log f ( x ) g ← G/ k G k δ ← k x − c k u ← x + δ g ` u ← log f ( u ) κ ← − 2 δ − 2 ( ` u − ˜ y − δ k G k ) ` x,u ← 1 2 k G k 2 κ + ˜ y M ← max { M , ` x,u } σ 2 ← 2 3 M − ˜ y 0 κ α ← max  0 , σ − 2 − (1 + θ ) g T R T Rg  F ← ch ud  √ θ R, √ αg  R ← ch ud  √ 1 + θ R , √ αg  end (repeat) app end x to X . x 0 ← x end (repeat) return X Figure 4: This figure shows the adaptiv e algorithm of section 4. The v ariables are mostly the same as in the text, with a few exceptions: T o av oid underflow, w e set ˜ y = log y and ˜ y 0 = log y 0 ; see Neal (2003, § 4) for discussion. The k subscript is dropp ed since there is no need to k eep copies of the v alues from an y but the most recent iteration. The v ariable ` x,u indicates the p eak of the parab olic cut through x and u ; N indicates the n um b er of samples to draw. The generated samples are stored in the ordered set X . 9 in that the gradien t at a prop osal probably p oints in a direction where the v ariance is small, so it is more efficien t to mo v e in a differen t direction. With this metho d, the crum b cov ariance is zero in some directions and spherically symmetric in the rest, so its simplest representation is the pair ( w , J ), where J is a matrix of orthonormal columns of directions in whic h the conditional v ariance is zero, and w 2 is the v ariance in the other directions. Define P ( J, v ) to b e the function that returns the comp onen t of vector v orthogonal to the columns of J : P ( J, v ) =    v − J J T v if J has at least one column v if J has no columns (33) F or simplicity of computation, since the crumbs are lo cated in a common subspace with x 0 , this metho d will consider the origin to b e at x 0 except when calling f ( · ), whic h is provided by the user, and when returning samples to the user. Each crumb is dra wn b y: c k = σ c P ( J, z ) where z ∼ N p (0 , I ) (34) When the first crumb is drawn, J has no columns, so P ( J, z ) = z and the first crum b has the distribution sp ecified in section 3. Giv en k crumbs and the cov ariances of their distributions, w e know that x 0 m ust lie in the intersection of the subspaces of their cov ariances. Since the subspace c j is drawn from contains the subspace c k is dra wn from when k > j , this is equiv alen t to sa ying that x 0 m ust lie in the subspace of c k ’s co v ariance, the orthogonal complement of J . So, the precision of the p osterior for x 0 in the direction of columns of J is infinite. It is equal to k σ − 2 c in all other directions, since there are k crumbs, each with spherical v ariance equal to σ 2 c in the subspace they were dra wn in. The mean of the prop osal distribution (with origin x 0 ) is the pro jection of: ¯ c = σ − 2 c c 1 + · · · + σ − 2 c c k k σ − 2 c = k − 1 ( c 1 + · · · + c k ) (35) Therefore, to draw a proposal, w e dra w a vector of standard normal v ariates, pro ject it in to the orthogonal complemen t of J , scale by σ c / √ k , and add ¯ c , also pro jected in to the orthogonal complement of J . With the original origin, the proposal is: x k = x 0 + P ( J, ¯ c ) + ( σ c / √ k ) · P ( J, z ) where z ∼ N p (0 , I ) (36) If the prop osal is in the slice (that is, if f ( x k ) ≥ y 0 ), we accept it. Otherwise, w e adapt the crumb distribution. If J has p − 1 columns, w e can’t add an y more without the crumb co v ariance having a rank of zero, so we do not adapt in that case. Otherwise, we add a single new column in the direction of ∇ log f ( x k ), pro jected 10 in to the directions not already spanned b y J , which we denote b y g ∗ . Th us, the new v alue of J w ould be: J k +1 =  J k g ∗ k g ∗ k  (37) where g ∗ = P ( J k , ∇ log f ( x k )) (38) T o preven t meaningless adaptations, w e only p erform this op eration when the angle b etw een the gradien t and its pro jection in to the nullspace of J is less than 60 ◦ . Equiv alently , w e only adapt when: g ∗ T ∇ log f ( x k ) k g ∗ k k∇ log f ( x k ) k > 1 2 (39) After p ossibly up dating J , we dra w another crumb and proposal, rep eating until a proposal is accepted. The metho d of this section is summarized in figure 5. A v ariation on the metho d (which w e use in the implementation tested in section 7) shrinks the crum b standard deviation in the nonzero directions, σ c , by a constan t factor eac h time a crum b is dra wn; section 7 uses 0.9. This results in exponentially falling prop osal v ariance, whic h, as described in section 4.1, allows large initial crum b v ariances to b e used. 6 Pro cedure for Ev aluation Figures 6 and 7 demonstrate the p erformance of the methods described in this document. Figure 6 demon- strates the p erformance of four samplers on a Gaussian distribution (to b e described in section 7). It con tains t wo graphs, one for each of tw o figures of merit. The top graph plots log densit y function ev aluations p er indep enden t sample against a tuning parameter. The b ottom graph plots processor-seconds per independent sample against a tuning parameter. F or both figures of merit, smaller is better. Eac h of the four panes in each graph con tains data from a single sampler, with eac h p oint represen ting a run with a specific tuning parameter. The tuning parameter for each sampler has the same scale; the sampler initially attempts to take steps roughly that size. Both figures of merit require us to determine the correlation length, the num b er of correlated samples equiv alent to an indep enden t sample. F or the runs done here, an AR(1) mo del captures the necessary structure. F or eac h parameter, we fit the following mo del: X t = E ( X t ) + π · X t − 1 + a t (40) where a t is a noise process. Then, the n umber of samples equiv alent to a single indep enden t sam ple is: τ = v ar( a t ) v ar( X t ) · (1 − π ) 2 (41) This formula is based on the effectiveSize function in COD A (Plummer et al., 2006), which uses the sp ectral approac h of Heidelb erger and W elch (1981). W ei (2006, pp. 276–278) has a more in-depth discussion of the spectrum of AR processes. T o estimate a correlation length for a multiv ariate distribution, we tak e the largest estimated correlation length of each of its parameters. This is not v alid in general, but is an 11 Crum bs with Shrinking-Rank Co v ariance Initialize N , f , x 0 ∈ R p , and σ c . X ← ( ) rep eat N times: e ← draw from Exp onential(1) ˜ y 0 ← log f ( x 0 ) − e J ← [ ] k ← 0 ¯ c ← 0 rep eat un til a prop osal is accepted: k ← k + 1 z ← draw from N p (0 , I ) c ← σ c z ¯ c ← k − 1  ( k − 1) ¯ c + c  z ← draw from N p (0 , I ) x ← x 0 + P  J, ¯ c + σ c √ k z  if log f ( x ) ≥ ˜ y 0 : prop osal accepted, break end (if) if J has less than p − 1 columns: g ∗ ← P ( J, ∇ log f ( x )) if g ∗ T ∇ log f ( x ) > 1 2 k g ∗ k k∇ log f ( x ) k : J ← [ J g ∗ / k g ∗ k ] end (if) end (if) end (repeat) app end x to X . x 0 ← x end (repeat) return X Figure 5: The adaptive algorithm of section 5. This differs from that section in that the pro jection into the n ullspace of J is dela yed until drawing a proposal, reducing the n umber of matrix pro ducts computed. 12 acceptable appro ximation for this experiment. Most c hains are displa yed with a circle indicating an estimate of the figure of merit, with a line indicating a nominal 95% confidence interv al. The interv als are based on normalit y of the increments of the AR(1) pro cess, so they should b e view ed as a low er b ound on the uncertaint y of the point estimate s. Chains whose figures are plotted as question marks were estimated as having an effectiv e sample size of less than four. Figure 7 con tains information on four differen t samplers. The panes of figure 7 are similar to those of figure 6, and eac h is labeled with the distribution and the sampler the chains in that pane come from. The columns of panes corresp ond to samplers; the ro ws of panes corresp ond to distributions. By reading across, one can see how differen t samplers perform on a giv en distribution. By reading do wn, one can see how the p erformance of a giv en sampler v aries from distribution to distribution. Ev ery c hain has 150,000 samples. In general, the c hain length do es not affect the results; w e will point out exceptions to this. 7 Results of Ev aluation W e simulated four samplers on four distributions for t welv e differen t tuning parameters each. The samplers w e considered are: • Cov ariance-Matching: This is the method described in section 4. The tuning parameter is σ c . • Shrinking-Rank: This is the metho d describ ed in section 5. The tuning parameter is σ c . • Non-Adaptive Crum bs: This is a non-adaptive v ariant of the general metho d of section 3. It is lik e the metho d of section 5, but never shrinks rank. How ever, like that metho d, it scales the crumb standard deviation do wn b y 0 . 9 after eac h proposal. The tuning parameter is σ c . • Metrop olis (with T rials): This metho d is a Metrop olis sampler that uses trial runs to automatically determine a suitable Gaussian prop osal distribution. The tuning parameter sp ecifies the standard deviation of the proposal distribution for the first trial run. The distributions w e considered are: • N 4 ( ρ = 0 . 999): This is a four-dimensional Gaussian with highly-correlated parameters. The marginal v ariances for each parameter are one; each parameter has a correlation of 0 . 999 with each other pa- rameter. The cov ariance matrix has a condition n umber of ab out 2900 and is not diagonal. • N 4 ( ρ = − 0 . 3329): This is a four-dimensional Gaussian with negatively-correlated parameters. The marginal v ariances for each parameter are one, and eac h parameter has a correlation of − 0 . 3329 with eac h other parameter. The co v ariance has a condition num b er of about 2900, lik e N 4 ( ρ = 0 . 999), but instead of one large eigen v alue and three small ones, this distribution has one small eigenv alue and three large ones. • Eight Sc ho ols: This is a multilev el mo del in ten dimensions, consisting of eigh t group means and h yp erparameters for their mean and v ariance. It comes from Gelman et al. (2004, pp. 138–145). 13 Comparison of density ev aluations on N[4](rho=0.999) ev als per indep. sample 10^1 10^2 10^3 10^4 10^5 10^6 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ● ● ? Cov ar iance−Matching 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ● ● ● Shrinking−Rank 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ● ● ? Non−Adaptive Crumbs 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ● ● ● Metropolis (with T r ials) Comparison of processor usage on N[4](rho=0.999) scale tuning parameter cpu sec. per indep . sample 10^−2 10^−1 10^0 10^1 10^2 10^3 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ● ● ? Cov ar iance−Matching 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ● ● ● Shrinking−Rank 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ● ● ? Non−Adaptive Crumbs 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ● ● ● Metropolis (with T r ials) Figure 6: The p erformance of four MCMC samplers on N 4 ( ρ = 0 . 999). See section 6 for a description of the graphs and section 7 for discussion. • T en-Comp onent Mixture: This is a ten-component Gaussian mixture in R 10 . Each mo de is a spherically symmetric Gaussian with unit v ariance. The mo des are uniformly distributed on a hypercub e with edge-length ten. By comparing the top and b ottom graphs of figure 6, w e see that for N 4 ( ρ = 0 . 999), pro cessor time and n um b er of densit y ev aluations tell the same story . The plots of function ev aluations and the plots of pro cessor time are nearly iden tical except for their vertical scales. This is true of the other distributions as w ell, so w e do not repeat the pro cessor-time plots for the others. Figure 6 (and the iden tical first row of figure 7) shows the p erformance of the four metho ds on a highly- correlated four-dimensional Gaussian, N 4 ( ρ = 0 . 999). Both adaptive s lice sampling metho ds p erform well once the tuning parameter is at least the same order as the standard deviation of the target distribution. This hock ey-stick p erformance curve is characteristic of slice sampling metho ds. The non-adaptiv e slice sampling metho d alw ays tak es steps of the order of the smallest eigen v alue once its tuning parameter is at least that large, so its p erformance is bad, but after this threshold, its p erformance do es not dep end m uch on the tuning parameter. The Metrop olis sampler also fails to capture the shap e of the distribution, a result that depends more on chain length. Had the c hain b een longer, the preliminary runs the sampler uses to 14 F our Samplers on Four Distributions scale tuning parameter ev als per indep. sample 10^1 10^2 10^3 10^4 10^5 10^6 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ? ? ? Cov ar iance−Matching T en−Component Mixture 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ? ? ? Shrinking−Rank T en−Component Mixture 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ● ● ? ? ? Non−Adaptive Crumbs T en−Component Mixture 0.01 0.1 1 10 100 1000 10000 ● ● ● ● ● ● ● ● ? ? ? ? ? Metropolis (with T r ials) T en−Component Mixture 10^1 10^2 10^3 10^4 10^5 10^6 ● ● ● ● ● ● ● ● ● ● ? ? ? Cov ar iance−Matching Eight Schools ● ● ● ● ● ● ● ● ● ● ? ? ? Shrinking−Rank Eight Schools ● ● ● ● ● ● ● ● ● ● ● ? ? Non−Adaptive Crumbs Eight Schools ● ● ● ● ? ? ? ? ? ? ? ? ? Metropolis (with T r ials) Eight Schools 10^1 10^2 10^3 10^4 10^5 10^6 ● ● ● ● ● ● ● ● ● ● ● ● ? Cov ar iance−Matching N 4 ( ρ = − 0.3329 ) ● ● ● ● ● ● ● ● ● ● ● ● ? Shrinking−Rank N 4 ( ρ = − 0.3329 ) ● ● ● ● ● ● ● ● ● ● ● ● ● Non−Adaptive Crumbs N 4 ( ρ = − 0.3329 ) ● ● ● ● ● ● ● ● ● ● ● ● ● Metropolis (with T r ials) N 4 ( ρ = − 0.3329 ) 10^1 10^2 10^3 10^4 10^5 10^6 ● ● ● ● ● ● ● ● ● ● ● ● ? Cov ar iance−Matching N 4 ( ρ = 0.999 ) ● ● ● ● ● ● ● ● ● ● ● ● ● Shrinking−Rank N 4 ( ρ = 0.999 ) ● ● ● ● ● ● ● ● ● ● ● ● ? Non−Adaptive Crumbs N 4 ( ρ = 0.999 ) ● ● ● ● ● ● ● ● ● ● ● ● ● Metropolis (with T r ials) N 4 ( ρ = 0.999 ) Figure 7: The p erformance of four MCMC samplers on four distributions. See section 6 for a description of the graph and section 7 for discussion. 15 estimate a prop osal distribution might hav e w orked b etter, leading to a verage p erformance comparable to the adaptiv e slice samplers. The second ro w of figure 7 shows sampler p erformance on a similar but negatively correlated four- dimensional Gaussian, N 4 ( ρ = − 0 . 3329). The adaptiv e slice sampling metho ds con tinue to p erform w ell on this distribution, and the non-adaptive sampler improv es somewhat. The Metrop olis sampler improv es a great deal; on this target distribution, it is able to choose a reasonable proposal distribution, so it p erforms comparably to the adaptiv e slice sampling methods. The third row of figure 7 shows sampler p erformance on Eight Schools. The minimum threshold app ears again for both adaptiv e slice samplers as well as the non-adaptiv e slice sampler. Since the condition num b er of the co v ariance of this distribution is only about seven, adaptivit y is not as important, though slice sampling’s robustness to improper tuning parameters remains imp ortant. The adaptiv e Metrop olis method again fails to identify a reasonable prop osal distribution for small tuning parameters, and fails to generate any proposal distribution at all for large ones. This is partially a reflection on this particular implemen tation, which only tries preliminary runs with proposal distribution standard deviations within four orders of magnitude of the tuning parameter. The b ottom row of figure 7, whic h shows performance on T en-Comp onent Mixture, has a similar pattern. The results are more erratic since the distribution has m ultiple, mo derately-separated modes, and none of the samplers are designed to perform w ell on multimodal distributions. 8 Discussion The adaptiv e slice sampling methods of sections 4 and 5 generally p erform at least as well as non-adaptiv e slice sampling metho ds and Metrop olis. Slice sampling in general tends to be more robust to imperfect c hoice of tuning parameters than Metropolis. Preliminary c hains are usually unnecessary , av oiding the hassle of man ual managemen t of these runs or the idiosyncratic performance of automatic ev aluation of the runs. The main disadv antage of the adaptive slice samplers relative to Metrop olis and non-adaptiv e slice sampling is that they require the log density to ha ve an analytically computable gradien t, though this is a standard requiremen t in numerical optimization, and experience in that domain has shown that computing the gradien t is often straigh tforward. The t wo adaptive slice sampling methods tend to p erform similarly to each other. The shrinking-rank metho d usually p erforms slightly b etter, but this adv antage can be mitigated b y making the appro ximation log f ( u k ) ≈ log y 0 . There is no theoretical justification for this, but it cuts the num b er of log density ev aluations by half with negligible performance cost, making the p erformance of the tw o adaptiv e metho ds indistinguishable. The shrinking-rank metho d is simpler, though, and requires only O (min( k , p ) · p ) op erations to dra w the k th crumb and prop osal, slightly b etter than the O ( p 2 ) operations needed by the cov ariance- matc hing metho d. Lik e most v ariations of multiv ariate slice sampling, b oth the non-adaptiv e and adaptive metho ds describ ed here do not work w ell for target distributions in spaces higher than a few dozen. The v ariation in log densit y of samples increases with dimension, but slice sampling tak es steps in the log density of only order one each iteration. So, in high-dimensional spaces, samples tend to b e highly correlated. Due to this p o or scaling with dimensionality , the methods of this document hav e limited usefulness on 16 their o wn. They may be useful in highly-correlated low-dimensional problems, though, and can b e used to tak e steps in highly-correlated low-dimensional subspaces as part of larger sampling schemes. W e hope to address this limitation in future w ork, possibly with p olar slice sampling (Roberts and Rosenthal, 2002). An R implemen tation of these methods is av ailable at http://www.cs.utoronto.ca/ ~ radford . References Dongarra, J. J., Moler, C. B., Bunch, J. R., and Stewart, G. W. (1979). LINP ACK User’s Guide . So ciety for Industrial and Applied Mathematics. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data Analysis, Se c ond Edition . Chapman and Hall/CR C. Grimm, J. and Grimm, W. (2008). Hansel and Gretel. In Grimm’s F airy T ales . Guten b erg Pro ject. Heidelb erger, P . and W elch, P . D. (1981). A sp ectral metho d for confidence interv al generation and run length con trol in simulations. Communic ations of the ACM , 24(4):233–245. Neal, R. M. (2003). Slice sampling. Annals of Statistics , 31:705–767. No cedal, J. and W right, S. J. (2006). Numeric al Optimization, Se c ond Edition . Springer. Plummer, M., Best, N., Co wles, K., and Vines, K. (2006). CODA: Conv ergence diagnosis and output analysis for MCMC. R News , 6(1):7–11. Rob erts, G. O. and Rosenthal, J. S. (2002). The p olar slice sampler. Sto chastic Mo dels , 18(2):257–280. W ei, W. W. S. (2006). Time Series Analysis: Univariate and Multivariate Metho ds, Se c ond Edition . Pear- son/Addison W esley . 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment