Bayesian optimization using sequential Monte Carlo

Ba y esian optimization using sequen tia l M on te Ca rlo Romain Benassi, Julien Bect, and Emmanuel V azquez SUPELEC, Gif-sur-Yvette, F rance Abstract. W e consider th e problem of optimizing a real-v alued contin- uous function f using a Bay esian approach, where the ev aluations of f are chos en sequentiall y by com bining prior information ab out f , which is described by a random p rocess mo del, and past eval uation results. The main diﬃculty with this approac h is to b e able to compute the p osterior distributions of qu antiti es of interest whic h are used to choose ev aluation p oin ts. In this article, w e decide to u se a Sequ ential Mon te Carlo (SMC) approac h. 1 Ov erview of the con tribution prop osed W e consider the problem of ﬁnding the global maxima of a function f : X → R , where X ⊂ R d is assumed b ounded, using the e x p e cte d impr ovement (EI) criterion [1 , 3]. Many examples in the liter ature show that the E I algo rithm is particularly interesting for dealing with the o ptimization of functions which are exp ensive to ev aluate, as is often th e case in design a nd a nalysis o f computer exp eriments [2]. How ever, going fr o m the g e neral framework ex pressed in [1 ] to an a ctual co mputer implementation is a diﬃcult issue. The main idea of an EI-based algorithm is a Bay esian one: f is view ed as a sample pa th of a ra ndo m pro ces s ξ de ﬁned o n R d . F o r the sa ke of trac tabilit y , it is g enerally a ssumed tha t ξ has a Gaussian pro cess distribution conditiona lly to a par ameter θ ∈ Θ ⊆ R s , which tunes the mea n and cov arianc e functions o f the pr o cess. Then, given a prio r distribution π 0 on θ and some initial ev a luation results ξ ( X 1 ) , . . . , ξ ( X n 0 ) a t X 1 , . . . , X n 0 , an (idealized) EI algo r ithm constructs a s equence o f ev alua tions p oints X n 0 +1 , X n 0 +2 , . . . such that, fo r e a ch n ≥ n 0 , X n +1 = a rgmax x ∈ X ¯ ρ n := Z θ ∈ Θ ρ n ( x ; θ )d π n ( θ ) , (1) where π n stands for the p osterior distribution of θ , conditiona l o n the σ -algebra F n generated b y X 1 , ξ ( X 1 ) , . . . , X n , ξ ( X n ), and ρ n ( x ; θ ) := E n,θ (( ξ ( X n +1 ) − M n ) + | X n +1 = x ) is the E I at x given θ , with M n = ξ ( X 0 ) ∨ · · · ∨ ξ ( X n ) and E n,θ the conditiona l exp ectation giv en F n and θ . In pr actice, the computation of ρ n is eas ily car- ried out (see [3]) but the answers to the fo llowing t wo questions will probably hav e a direct impact on the perfor mance and applicabilit y of a par ticular im- plement atio n: a) How to deal with the integral in ¯ ρ n ? b) How to dea l with the maximization of ¯ ρ n at each step? W e can s a fely say that most implementations—including the po pular E GO al- gorithm [3]—dea l with the ﬁr s t issue by using an empiric al Bayes (or plug-in ) approach, which cons is ts in approximating π n by a Dir a c mass a t the maximum likelihoo d estimate of θ . A plug-in approach using maximum a p osteriori estima- tion ha s b een used in [6]; ful ly Bayesian metho ds are more diﬃcult to implement (see [4 ] and r eferences therein). Regarding the optimiza tion of ¯ ρ n at ea ch step, several strategies hav e b een pr op osed (see, e.g., [3, 5, 7, 10]). This ar ticle addresses b oth questions simultaneously , using a sequential Monte Carlo (SMC) appro a ch [8, 9] and taking particular ca re to cont r o l the n umer- ical complexit y of the algorithm. The main idea s are the follo wing. First, as in [5], a weigh ted sample T n = { ( θ n,i , w n,i ) ∈ Θ × R , 1 ≤ i ≤ I } from π n is used to appro ximate ¯ ρ n ; that is, P I i =1 w n,i ρ n ( x ; θ n,i ) → I ¯ ρ n ( x ). Besides, at each s tep n , we attach to each θ n,i a (small) p opulation o f candidate ev alua tion po in ts { x n,i,j , 1 ≤ j ≤ J } which is exp ected to cov er promising r egions for that particular v alue of θ a nd such that max i,j ¯ ρ n ( x n,i,j ) ≈ max x ¯ ρ n ( x ). 2 Algorithm and results A t each step n ≥ n 0 of the algor ithm, our ob jective is to constr uct a se t of weigh ted pa rticles G n =   γ n,i,j , w ′ n,i,j  , γ n,i,j = ( θ n,i , x n,i,j ) ∈ Θ × X , w ′ n,i,j ∈ R , 1 ≤ i ≤ I , 1 ≤ j ≤ J  (2) so that P i,j w ′ n,i,j δ γ n,i,j → I ,J π ′ n , with d π ′ n ( γ ) = ˜ g n ( x | θ ) d λ ( x ) d π n ( θ ) , x ∈ X , θ ∈ Θ , γ = ( θ , x ) , where λ denotes the Lebe sgue measure, ˜ g n ( x | θ ) = g n ( x | θ ) /c n ( θ ), g n ( x | θ ) is a criterion that r eﬂects the interest of ev a luating at x (g iven θ and pas t ev aluatio n results), and c n ( θ ) = R X g n ( x | θ )d x is a nor malizing term. F or insta nc e , a relev a nt choice for g n is to c onsider the probability that ξ exceeds M n at x , at step n . (Note that we consider less θ s than x s in G n to keep the numerical complexity of the algo rithm low.) T o initialize the alg o rithm, g enerate a weigh ted sample T n 0 = { ( θ n 0 ,i , w n 0 ,i ) , 1 ≤ i ≤ I } fro m the distribution π n 0 , us ing f o r instance impo rtance sa mpling with π 0 as the instrumental distribution, a nd pic k a density q n 0 ov er X (the uniform density , fo r e x ample). Then, for each n ≥ n 0 : Step 1: demar ginalize — Using T n and q n , cons truct a weigh ted sample G n of the form (2), with x n,i,j iid ∼ q n , w ′ n,i,j = w n,i g n ( x n,i,j | θ n,i ) q n ( x n,i,j ) c n,i , and c n,i = P J j ′ =1 g n ( x n,i,j ′ | θ n,i ) q n ( x n,i,j ′ ) . Step 2: evaluate — Ev aluate ξ at X n +1 = arg ma x i,j P I i ′ =1 w n,i ′ ρ n ( x n,i,j ; θ n,i ′ ). Step 3: r eweight/r esample/move — C o nstruct T n +1 from T n as in [8 ]: reweigh t the θ n,i s using w n +1 ,i ∝ π n +1 ( θ n,i ) π n ( θ n,i ) w n,i , resample (e.g., by multinomial resa m- pling), a nd mov e the θ n,i s to get θ n +1 ,i s using a n indep endant Metrop olis- Hastings kernel. Step 4: for ge q n +1 — F orm a n estimate q n +1 of the second marginal of π ′ n from the weighted sa mple X n = { ( x n,i,j , w ′ n,i,j ) , 1 ≤ i ≤ I , 1 ≤ j ≤ J } . Hop efully , such a choice of q n +1 will pro vide a goo d instrumen tal densit y for the next demarginaliza tion step. Any (par ametric or non-pa rametric) density estimator can be used, as long as it is easy to s ample fr om; in this paper , a t r e e -based histogram estimator is used. Nota b ene: when pos sible, some comp onents of θ ar e integrated o ut a na lytically in (1) instead of b eing sampled fro m; see [4]. Exp eriments. Preliminar y numerical results, s howing the relev a nce o f a fully Bay esian approach with res pect to empirical Bay es a pproach, hav e be en provided in [4]. The scop e of these r esults, how ever, was limited by a ra ther s implistic im- plement atio n (inv olving a qua dr ature appr oximation for ¯ ρ n and a no n- adaptive grid-base d optimization for the choice of X n +1 ). W e pr esent here so me results that demonstra te the capa bilit y o f our new SMC-based a lgorithm to overcome these limitations. The exp erimental setup is as follows. W e compare o ur SMC-ba s ed algorithm, with I = J = 100 , to a n EI algo rithm in which: 1) we ﬁx θ (at a “g o o d” v alue obtained using maximum likelihoo d es timation on a large datase t); 2) X n +1 is obtained by exhaustive se a rch on a ﬁxed LHS of size I × J . In b oth cases, we consider a Gaus sian pro cess ξ with a constant but unknown mea n function (with a uniform distribution o n R ) and an anisotr opic Mat ´ ern cov ariance function with regular ity par ameter ν = 5 / 2. Moreov er, for the SM C approach, the v ariance parameter o f the Mat´ ern cov arianc e function is integrated out us ing a Jeﬀreys prior a nd the ra nge para meter s a re endow ed with independent log normal priors . Results. Figures 1(a) and 1(b) show the a verage e r ror ov er 1 00 runs of b oth algorithms, for the Branin function ( d = 2 ) and the lo g-transfo rmed Hartmann 6 function ( d = 6). F o r the Branin function, the reference algor ithm p erfor ms bet- ter o n the ﬁrs t iterations , probably thanks to the “hand-tuned” para meters, but so on stalls due to its non-a da ptive search str ategy . Our SMC- ba sed a lgorithm, how ever, quickly catches up and even tually o vertak es the reference a lgorithm. On th e Hartmann 6 function, w e observe that the r e ference algorithm alwa ys lags b ehind our new algor ithm. W e ha ve b een able to ﬁnd results of th is kind for other test functions. These ﬁndings are promising and need to b e further inv estig ated in a more systematic large-s cale be nc hmar k study . 5 10 15 20 25 30 35 40 −4 −3 −2 −1 0 1 P S f r a g r e p l a c e m e n t s log(max f − M n ) num b er of func tion ev alu ations ref EI EI+SMC (a) Branin function (d imension 2) 20 30 40 50 60 70 80 −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 P S f r a g r e p l a c e m e n t s log(max f − M n ) num b er of fu nction ev alu ations ref EI EI+SMC (b) Hartmann 6 function (dimension 6) Fig. 1. A comparison of the av erage error to the maximum (100 runs) References 1. J. Mo c ku s, V . Tiesis, and A. Zilinsk as. The app lication of Ba yesian method s for seeking th e extremum. In L. Dixon and G. Szego, editors, T owar ds Glob al Optimization , volume 2, p ages 117–129. Elsevier, 1978. 2. T. J. S antner, B. J. Williams, W. I. Notz. The design and analysis of computer exp eriments, S pringer, 2003. 3. D. R . Jones, M. Schonlau, and W. J. W elch. Eﬃcien t global optimization of exp ensive black-box functions. J. Glob al Optim. , 13(4):455– 492, 1998. 4. R. Benassi, J. Bect, and E. V azquez. Robust Gaussian pro cess-based global opti- mization u sing a fully Bay esian exp ected improv ement criterion. In LION5, online pr o c e e di ngs , Roma, Italy , 2011. 5. R. Gramacy and N. P olson. P article learning of Gaussian pro cess mod els for sequential design and optimization, J. Comput. Gr aph. Stat. , 20(1):102–11 8, 2011. 6. D. J. Lizotte, R. Greiner and D. S c huurmans. An exp erimental metho dology for response surface optimization metho ds, T o app ear in J. Glob al Optim. , 38 pages, 2011. 7. R. Bardenet and B. K´ egl. Surrogating the surrogate: accel erating Gaussian- process-b ased global optimization with a mixtu re cross-en tropy algorithm. In ICML 2010, pr o c e e dings , Haifa, Israel, 2010. 8. N. Chopin. A sequen tial particle ﬁlter metho d for static mo dels, Biometrika , 89(3):539– 552, 2002. 9. P . D el Moral, A. D oucet, and A. Jasra. S equential Monte Carlo samplers, J. R. Stat. So c. B , 68(3):411– 436, 2006. 10. D. Ginsb ourger and O. Roustant. DiceOptim: Kriging-based optimization for computer exp eriments, R pack age versi on 1.2, 2011.

Bayesian optimization using sequential Monte Carlo

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment