Importance Tempering

Reading time: 7 minute
...

📝 Original Info

  • Title: Importance Tempering
  • ArXiv ID: 0707.4242
  • Date: 2008-11-03
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Simulated tempering (ST) is an established Markov chain Monte Carlo (MCMC) method for sampling from a multimodal density $\pi(\theta)$. Typically, ST involves introducing an auxiliary variable $k$ taking values in a finite subset of $[0,1]$ and indexing a set of tempered distributions, say $\pi_k(\theta) \propto \pi(\theta)^k$. In this case, small values of $k$ encourage better mixing, but samples from $\pi$ are only obtained when the joint chain for $(\theta,k)$ reaches $k=1$. However, the entire chain can be used to estimate expectations under $\pi$ of functions of interest, provided that importance sampling (IS) weights are calculated. Unfortunately this method, which we call importance tempering (IT), can disappoint. This is partly because the most immediately obvious implementation is na\"ive and can lead to high variance estimators. We derive a new optimal method for combining multiple IS estimators and prove that the resulting estimator has a highly desirable property related to the notion of effective sample size. We briefly report on the success of the optimal combination in two modelling scenarios requiring reversible-jump MCMC, where the na\"ive approach fails.

💡 Deep Analysis

Deep Dive into Importance Tempering.

Simulated tempering (ST) is an established Markov chain Monte Carlo (MCMC) method for sampling from a multimodal density $\pi(\theta)$. Typically, ST involves introducing an auxiliary variable $k$ taking values in a finite subset of $[0,1]$ and indexing a set of tempered distributions, say $\pi_k(\theta) \propto \pi(\theta)^k$. In this case, small values of $k$ encourage better mixing, but samples from $\pi$ are only obtained when the joint chain for $(\theta,k)$ reaches $k=1$. However, the entire chain can be used to estimate expectations under $\pi$ of functions of interest, provided that importance sampling (IS) weights are calculated. Unfortunately this method, which we call importance tempering (IT), can disappoint. This is partly because the most immediately obvious implementation is na"ive and can lead to high variance estimators. We derive a new optimal method for combining multiple IS estimators and prove that the resulting estimator has a highly desirable property related to

📄 Full Content

Markov chain Monte Carlo (MCMC) algorithms, in particular Metropolis-Hastings (MH) and Gibbs Sampling (GS), are by now the most widely used methods for simulation-based inference in Bayesian statistics. The beauty of MCMC is its simplicity. Very little user input or expertise is required in order to establish a Markov chain whose stationary distribution is proportional to π(θ), for θ ∈ Θ ⊆ R d . As long as the chain is irreducible, the theory of Markov chains guarantees that sample averages computed from this realisation will converge in an appropriate sense to their expectations under π. However, difficulties can arise when π has isolated modes, between which the Markov chain moves only rarely. In such cases convergence is slow, meaning that often infeasibly large sample sizes are needed to obtain accurate estimates.

New MCMC algorithms have been proposed to improve mixing. Two related algorithms are Metropolis-coupled MCMC (MC 3 ) (Geyer, 1991;Hukushima and Nemoto, 1996) and simulated tempering (ST) (Marinari and Parisi, 1992;Geyer and Thompson, 1995). Both are closely related to the optimisation technique of simulated annealing (SA) (Kirkpatrick et al., 1983). SA works with a set of tempered distributions π k (θ) indexed by an inverse-temperature parameter k ∈ [0, ∞). One popular form of tempering is called “powering up”, where π k (θ) ∝ π(θ) k . Small values of k have the effect of flattening/widening the peaks and raising troughs in π k relative to π.

In MC 3 and ST we define a temperature ladder 1 = k 1 > k 2 > . . . > k m ≥ 0, and call the k i its rungs. Both MC 3 and ST involve simulating from the set of m tempered densities π k 1 , . . . , π km . MC 3 runs m parallel MCMC chains, one at each temperature, and regularly proposes swaps of states at adjacent rungs k i and k i+1 . Usually, samples are only saved from the “cold distribution” π k 1 . In contrast, ST works with a “pseudo-prior” p(k i ) and uses a single chain to sample from the joint distribution, which is proportional to π k (θ)p(k). Again, it is only at iterations t for which k (t) = 1 that the corresponding realisation of θ (t) is retained. ST has an advantage over MC 3 in that only one copy of the process {θ (t) : t = 1, . . . , T } is neededrather than m-so the chain uses less storage and also has better mixing (Geyer, 1991). The disadvantage is that it needs a good choice of pseudo-prior. For further comparison and review, see Jasra et al. (2007a) and Iba (2001).

Both MC 3 and ST suffer from inefficiency because they discard all samples from π k for k = 1. The discarded samples could be used to estimate expectations under π if they were given appropriate importance sampling (IS) weights. For an inclusive review of IS and related methods see Liu (2001, Chapter 2). Moreover, it may be the case that an IS estimator constructed with samples from a tempered distribution has smaller variance than one based on a sample of the same size from π. As a simple motivating example, let π(θ) = N(θ|µ, σ 2 ), and consider estimating µ = E π (θ) by IS from a tempered distribution π k (θ) ∝ π(θ) k . A straightforward calculation shows that the value of k which minimises the variance of the IS estimator is

otherwise.

(1)

Note that k * ∈ (1/2, 1) for all µ and σ 2 . Moreover, one can compute (numerically) k -= k -(σ/µ) < k * such that for all k ∈ (k -, 1), the variance of the IS estimator μk based on samples from π k is smaller than that of one based on a sample of the same size from π.

However, Var(μ k ) → ∞ as k → 0 for all µ and σ 2 . Therefore, there is a trade-off in the choice of tempered IS proposals. On the one hand, low inverse-temperatures k in ST can guard against missing modes of π with large support by encouraging better mixing between modes, but can yield very inefficient (IS) estimators overall. On the other hand, “lukewarm” temperatures k, especially k ∈ (1/2, 1), can yield more efficient estimators within modes than those obtained from samples at k = 1. Jennison (1993) was the first to suggest using a single tempered distribution as a proposal in IS, and Neal (1996, 2001, 2005) has since written several papers combining IS and tempering. Indeed, in the discussion of the 1996 paper on tempered transitions, Neal writes “simulated tempering allows data associated with p i other than p 0 [the cold distribution] to be used to calculate expectations with respect to . . . p 0 (using an importance sampling estimator)“1 . It is this natural extension that we call importance tempering (IT), with IMC 3 defined similarly. Given the work of the above-mentioned authors, and the fact that calculating importance weights is relatively trivial, it may be surprising that successful IT and IMC 3 applications have yet to be published. Liu (2001) comes close in proposing to augment ST with dynamic weighting (Wong and Liang, 1997) and in applying the Wang-Landau algorithm (Atchadé and Liu, 2007) to ST. This paper addresses why the straightforward meth

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut