Bayesian optimization using sequential Monte Carlo

Reading time: 6 minute
...

📝 Original Info

  • Title: Bayesian optimization using sequential Monte Carlo
  • ArXiv ID: 1111.4802
  • Date: 2011-11-22
  • Authors: - 논문에 명시된 저자 정보가 제공되지 않았습니다. (원문에서 저자명을 확인할 수 없으므로, 해당 정보를 확인 후 추가하시기 바랍니다.)

📝 Abstract

We consider the problem of optimizing a real-valued continuous function $f$ using a Bayesian approach, where the evaluations of $f$ are chosen sequentially by combining prior information about $f$, which is described by a random process model, and past evaluation results. The main difficulty with this approach is to be able to compute the posterior distributions of quantities of interest which are used to choose evaluation points. In this article, we decide to use a Sequential Monte Carlo (SMC) approach.

💡 Deep Analysis

📄 Full Content

We consider the problem of finding the global maxima of a function f : X → R, where X ⊂ R d is assumed bounded, using the expected improvement (EI) criterion [1,3]. Many examples in the literature show that the EI algorithm is particularly interesting for dealing with the optimization of functions which are expensive to evaluate, as is often the case in design and analysis of computer experiments [2]. However, going from the general framework expressed in [1] to an actual computer implementation is a difficult issue.

The main idea of an EI-based algorithm is a Bayesian one: f is viewed as a sample path of a random process ξ defined on R d . For the sake of tractability, it is generally assumed that ξ has a Gaussian process distribution conditionally to a parameter θ ∈ Θ ⊆ R s , which tunes the mean and covariance functions of the process. Then, given a prior distribution π 0 on θ and some initial evaluation results ξ(X 1 ), . . . , ξ(X n0 ) at X 1 , . . . , X n0 , an (idealized) EI algorithm constructs a sequence of evaluations points X n0+1 , X n0+2 , . . . such that, for each n ≥ n 0 ,

where π n stands for the posterior distribution of θ, conditional on the σ-algebra F n generated by X 1 , ξ(X 1 ), . . . , X n , ξ(X n ), and

) and E n,θ the conditional expectation given F n and θ. In practice, the computation of ρ n is easily carried out (see [3]) but the answers to the following two questions will probably have a direct impact on the performance and applicability of a particular implementation: a) How to deal with the integral in ρn ? b) How to deal with the maximization of ρn at each step?

We can safely say that most implementations-including the popular EGO algorithm [3]-deal with the first issue by using an empirical Bayes (or plug-in) approach, which consists in approximating π n by a Dirac mass at the maximum likelihood estimate of θ. A plug-in approach using maximum a posteriori estimation has been used in [6]; fully Bayesian methods are more difficult to implement (see [4] and references therein). Regarding the optimization of ρn at each step, several strategies have been proposed (see, e.g., [3,5,7,10]).

This article addresses both questions simultaneously, using a sequential Monte Carlo (SMC) approach [8,9] and taking particular care to control the numerical complexity of the algorithm. The main ideas are the following. First, as in [5], a weighted sample

. Besides, at each step n, we attach to each θ n,i a (small) population of candidate evaluation points {x n,i,j , 1 ≤ j ≤ J} which is expected to cover promising regions for that particular value of θ and such that max i,j ρn (x n,i,j ) ≈ max x ρn (x).

At each step n ≥ n 0 of the algorithm, our objective is to construct a set of weighted particles

so that i,j w ′ n,i,j δ γn,i,j → I,J π ′ n , with

where λ denotes the Lebesgue measure, gn

) is a criterion that reflects the interest of evaluating at x (given θ and past evaluation results), and c n (θ) = X g n (x | θ)dx is a normalizing term. For instance, a relevant choice for g n is to consider the probability that ξ exceeds M n at x, at step n. (Note that we consider less θs than xs in G n to keep the numerical complexity of the algorithm low.)

To initialize the algorithm, generate a weighted sample T n0 = {(θ n0,i , w n0,i ), 1 ≤ i ≤ I} from the distribution π n0 , using for instance importance sampling with π 0 as the instrumental distribution, and pick a density q n0 over X (the uniform density, for example). Then, for each n ≥ n 0 :

Step 1: demarginalize -Using T n and q n , construct a weighted sample G n of the form (2), with x n,i,j iid ∼ q n , w ′ n,i,j = w n,i gn(xn,i,j|θn,i) qn(xn,i,j )cn,i , and c n,i =

Step 2: evaluate -Evaluate ξ at X n+1 = argmax i,j I i ′ =1 w n,i ′ ρ n (x n,i,j ; θ n,i ′ ).

Step 3: reweight/resample/move -Construct T n+1 from T n as in [8]: reweight the θ n,i s using w n+1,i ∝ πn+1(θn,i) πn(θn,i) w n,i , resample (e.g., by multinomial resampling), and move the θ n,i s to get θ n+1,i s using an independant Metropolis-Hastings kernel.

Step 4: forge q n+1 -Form an estimate q n+1 of the second marginal of π ′ n from the weighted sample

Hopefully, such a choice of q n+1 will provide a good instrumental density for the next demarginalization step. Any (parametric or non-parametric) density estimator can be used, as long as it is easy to sample from; in this paper, a tree-based histogram estimator is used.

Nota bene: when possible, some components of θ are integrated out analytically in (1) instead of being sampled from; see [4].

Experiments. Preliminary numerical results, showing the relevance of a fully Bayesian approach with respect to empirical Bayes approach, have been provided in [4]. The scope of these results, however, was limited by a rather simplistic implementation (involving a quadrature approximation for ρn and a non-adaptive grid-based optimization for the choice of X n+1 ). We present here some results that demonstrate the capability of

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut