Free energy computations by minimization of Kullback-Leibler divergence: an efficient adaptive biasing potential method for sparse representations

The present paper proposes an adaptive biasing potential for the computation of free energy landscapes. It is motivated by statistical learning arguments and unifies the tasks of biasing the molecular dynamics to escape free energy wells and estimati…

Authors: I. Bilionis, P. S. Koutsourelakis

F ree energy computatio ns b y minimization of Kullbac k-Leible r div ergence: an efficie n t adaptiv e b i as i ng p oten tial meth o d for s p arse represe n tations I. Bilio nis Center for Applie d Mathema tics, Cornel l U niversity, Ithac a, NY 14853, USA P .S. Koutsour el akis ∗ Scho ol of Civil and Envir onmental Engine ering & Center for Applie d Mathematics, Cornel l University, Ithac a, N Y 14 853, USA Abstract The present pap er p rop oses an adaptive biasing p oten tial for the computation of free energy landscap es. It is motiv ated by stat istical learning arguments a n d unifies the tasks of biasing th e molecular dynamics to escap e f ree energy w ells and estimat- ing the free energy fun ction, un der the same ob jectiv e. It offers rigorous con v ergence diagnostics ev en th ou gh history dep endent, non-Marko vian dyn amics are employ ed . It mak es use of a greedy optimization sc heme in order to obtain sparse repr esenta- tions of the free energy function whic h can b e particularly useful in m ultidimensional cases. I t employs em barr assin gly parallelizable samp lin g sc hemes that are based on adaptiv e Sequent ial Mon te Carlo and can b e readily coupled with legacy molecular dynamics s im ulators. The sequ ential n atur e of the learning and sampling sc heme Preprint su bmitted to Journal of Compu tational Ph ysics 25 Octob er 2018 enables the efficien t calculatio n of f r ee energy functions parametrized b y the temp er- ature. The charac teristics and capabilities of the prop osed metho d are demonstrated in three numerical examples. Key wor ds: free energy computations, adaptive biasing p oten tial, S equen tial Mon te C arlo, atomistic sim ulations, statistica l learning P ACS: 1 In tro duct ion F r ee energy is a cen tral concept in thermo dynamics and in the study of sev eral systems in biology , c hemistry a nd ph ysics [8]. It represen ts a rigorous w a y to coarse-grain systems consisting of v ery large n um b ers of atomistic degrees o f freedom, to prob e states not accessible experimentally , to characterize global c hanges a s w ell as in v estigate relat ive stabilities. In most applications, a brute- force computation based on sampling the atomistic p ositions is impractical or infeasible as the free energy barriers to ov ercome are so large t ha t the syste m remains trapp ed in metastable free energy sets [40,42,8,56]. Equilibrium tech niques for computing free energy surfaces suc h a s Thermo- dynamic In tegration [29] or Adaptive In t egr a tion [49,21] require t he sim ula - tion of ve ry long atomistic t ra jectories in o rder to a chiev e equilibrium a nd lac k conv ergence diagnostics. T ec hniques based on non-equilibrium path sam- pling [26,27 ,23,25] lac k adaptivit y and require the user to sp ecify a particular ∗ Corresp ond ing Au thor. T el: 607- 254-5441 Email addr esses: ib227@co rnell.edu (I. Bilionis ), pk285@cor nell.edu (P .S. Koutsourelakis ). 2 path on the reaction co ordinate space connecting t w o energetically imp or- tan t free energy regions, which can b e non- trivial a task [20]. F urthermore, sampling alo ng these paths correctly migh t neces sitate adv anced and qu ite in v olv ed tec hniques [10]. More r ecently prop osed adaptiv e bia sing p o ten tial [2,55,32,1,18] and adaptiv e biasing force [11,12,41,53,24] tec hniques are ca- pable of dynamically utilizing information o bta ined f rom the atomistic tr a- jectories to bias the curren t dynamics in order to facilitate the escap e from metastable sets [35]. They are able to auto matically disco v er imp ortan t regions of the reaction coo r dina t e space. Since they rely on history-dep endent, non- Mark o vian dynamics, it is not a priori clear, a nd in whic h sense, the system reac hes a stationary state, although some w ork has b een done along theses lines in [4] for Langevin-t yp e systems and [3 9,35]. W e prop ose an adaptiv e biasing p otential tec hnique where the tw o tasks o f biasing the dy namics and estimating the free energy la ndscape are unified under the same ob jectiv e of minimizing the Kullbac k-Leibler div ergence b e- t w een appropria t ely selected distributions o n the extended space that includes atomic co ordinates and the collectiv e v ariables [37,38]. This framew ork pro- vides a natural w ay for selecting the basis functions used in the approx imation of the free energy and obt a ining sparse represen tations which is critical when m ulti-dimensional collectiv e v ariables a re used. It allows the a nalyst to utilize and correct an y prior information on the free energy landscap e and prov ides an efficien t manner of obtaining g o o d estimates at v arious temp eratures. The sc heme prop osed is em barrassingly par a llelizable and relies on adaptive Se- quen tial Mon te Carlo pro cedures whic h enable efficien t sampling f rom the high-dimensional and p o t entially multi-mo da l distributions of in terest. 3 2 Metho dology - A statistical learning approac h for adaptiv ely cal- culating free energies F o r clarit y o f the presen tation, w e will fir st in tro duce our metho d for the so-called alchemic al case and generalize it later for the reaction co ordinate case. Consider a molecular system with generalized coo rdinates q ∈ M ⊂ R N follo wing a Boltzmann-like distribution whic h in turn dep ends on some parameters z ∈ D ⊂ R d p ( q | z ) ∝ exp ( − β V ( q ; z )) (1) where V ( q ; z ) is the p oten tial energy of the system and β plays the role of inv erse temp erature. The free energy A ( z ) is defined, up to an a dditive constan t, by: A ( z ) = − β − 1 Z exp ( − β V ( q ; z )) d q (2) Our goal is to compute the f unction A ( z ) o v er the whole do ma in D . Let ˆ A ( z ; θ ) b e an estimate o f A ( z ) pa r a metrized by θ ∈ Θ ⊂ R K . W e adopt a statistical p ersp ectiv e of learning A ( z ) from simulation data . A p opula r approac h to carrying out r egr ession ta sks and f unctional approximations relies on k ernel mo dels [2 8 ]. Kernel regression mo dels hav e pro v en successful in high- dimensional scenaria where d is in the order of 10 or 100 [5 1,52]. The unknown function is selected from a R epro ducing Kernel Hilb ert Space (RKHS) H K induced b y a semi-p ositive definite ke rnel K ( · , · ). W e adopt represen tations with respect to a k ernel function K ( · , · ) [28]: ˆ A ( z ; θ ) = K X j = 1 θ j K j ( z , z j ) , z ∈ D (3) where z j are points in D whic h are selected as de scrib ed in the sequenc e, 4 In order to fix the additiv e constan t, w e select a p oin t z 0 ∈ D suc h that: ˆ A ( z ; θ ) = 0. 1 In relev a n t litera t ure differen t t yp es of kernel functions ha v e b een used suc h as thin pla t e splines, mu ltiquadrics, or G a ussians. While all these functions can b e emplo y ed in the framew ork presen ted, we fo cus our discussion here on Gaussian k ernels whic h also ha v e an in tuitiv e parametrization with regards to the sc ale of va ri a bility of ˆ A as quantified by the bandwidth para meters τ j = { τ j,l } d l =1 in eac h dimension: K j ( z ) = K ( z , z j ; τ j ) = exp {− d X l =1 τ j,l ( z l − z j,l ) 2 } (4) Gaussian k ernels in the con text of free energy appro ximations hav e also b een used in [32,37,18]. W e define a join t probability distribution on the generalized co ordinat es q and the parameters z as follo ws: p ( q , z | θ ) = 1 Z ( θ ) 1 D ( z ) e − β ( V ( q , z ) − ˆ A ( z ; θ ) ) (5) where 1 D ( z ) is the indicator function on D and Z ( θ ) is the normalization constan t, i.e.: Z ( θ ) = Z 1 D ( z ) e − β ( V ( q , z ) − ˆ A ( z ; θ ) ) d z d θ (6) It is noted that the first-order partial deriv ativ es of the log of the normalization function giv e rise to the exp ectation of the resp ectiv e k ernel: ∂ log Z ∂ θ j = 1 Z ( θ ) ∂ Z ∂ θ j = β E p ( z | θ ) [ K j ( z )] (7) 1 This is alw a ys p ossible b y c hanging the k ernels in Equation 3 to K ′ j ( z , z j ) = K j ( z ) − K j ( z 0 , z j ). 5 whereas the second-order deriv ativ es, pro duce t he co v aria nce b et w een the k er- nels: ∂ 2 log Z ∂ θ j ∂ θ l = − 1 Z 2 ( θ ) ∂ Z ∂ θ j ∂ Z ∂ θ l + 1 Z ( θ ) ∂ 2 Z ∂ θ j ∂ θ l = β 2 E p ( z | θ ) h ( K j ( z ) − E p ( z | θ ) [ K j ( z )])( K l ( z ) − E p ( z | θ ) [ K l ( z )]) i = β 2 C ov p ( z | θ ) [ K j , K l ] (8) The expectations in the t w o equations a b o v e inv o lv e the unkno wn marginal densit y with resp ect to the parameters z ∈ D : p ( z | θ ) = R p ( q , z | θ ) d q = 1 Z ( θ ) 1 D ( z ) e − β ( A ( z ) − ˆ A ( z ; θ ) ) (9) whic h dep ends on the unknow n free energy A ( z ) of Equation (2). The k ey pro p ert y o f p ( z | θ ) is that it reduces to the uniform distribution fo r z ∈ D if and only if the free energy estimate is exact i.e. ˆ A ( z ; θ ) = A ( z ) , z ∈ D . If π ( z ) = 1 D ( z ) 1 |D | is the uniform densit y on D (whose v olume is denoted b y | D | ), then a natural strategy to es timate A ( z ) is b y minimizing a distance metric betw een π ( z ) and p ( z | θ ) in Equation (9) ab o v e. T o that end, w e prop ose employ ing the Kullba ck-Leibler (KL) div ergence KL( π ( z ) k p ( z | θ ) ) [50]: KL( π k p ) = R π ( z ) log π ( z ) p ( z | θ ) d z = R π ( z ) log π ( z ) d z − R π ( z ) log p ( z | θ ) d z = − lo g | D | − R π ( z ) log p ( z | θ ) d z ≥ 0 (10) 6 The latter is not a metric in the mathematical sense, but it is frequen tly used as a measure of the distance b etw een tw o proba bility distributions. It is alw ays non-negativ e and b ecomes zero if and only if π ( z ) ≡ p ( z | θ ) or equiv alen tly ˆ A ( z ; θ ) = A ( z ) , z ∈ D 2 . The af oremen tioned form ulation offers a clear strategy for estimating the free energy by minimizing the follo wing f orm with resp ect to θ : I ( θ ) = − Z π ( z ) log p ( z | θ ) d z (11) Since the KL-divergenc e is alw a ys non-negativ e, the form ulation ab o v e pro- vides a lo w er b o und on the o b jective function I ( θ ): I ( θ ) ≥ log | D | (12) whic h can b e readily calculated and b e used to monitor con v ergence as w ell as the qualit y of the approxim atio n obtained. Ev en though I ( θ ) dep ends on the unkno wn fr ee energy A ( z ) (from Equation (9)): I ( θ ) = − R π ( z ) log p ( z | θ ) d z = β R π ( z )  A ( z ) − ˆ A ( z ; θ )  d z + log Z ( θ ) (13) its partial deriv ativ es J ( θ ) = ∂ I ( θ ) ∂ θ do not , i.e. from Equation (7): J j ( θ ) = ∂ I ( θ ) ∂ θ j = − β E π ( z ) h ∂ ˆ A ∂ θ j i + ∂ log Z ∂ θ j = − β  E π ( z ) [ K j ( z )] − E p ( z | θ ) [ K j ( z )]  (14) 2 Of in terest are free-energy differ enc es and therefore p erturbations of A ( z ) or ˆ A ( z ; θ ) by a constan t are ignored 7 where E π ( z ) [ · ] implies an exp ectation with regards to π ( z ). It is impo r tan t to note that acc ording to Equation (8), the Hessian of the ob jectiv e function H ( θ ) = ∂ 2 I ( θ ) ∂ θ ∂ θ T is prop o rtional to the co v ariance b et w een the k ernels i.e.: ∂ 2 I ∂ θ j ∂ θ l = ∂ log Z ( θ ) ∂ θ j ∂ θ l = β 2 C ov p ( z | θ ) [ K j , K l ] (15) Hence the ob jectiv e function is conv ex with resp ect to θ and there is a unique minim um. F urt hermore, the a ppro ximation of the free energy ˆ A ( z ; θ ), biases the p oten- tial of p ( q , z | θ ) (Equation (5)) a nd allows the system to o v ercome free energy barriers [39]. As in [18], no binning is needed and t he bias p oten tial is non- lo cal, pro viding information ab out the free energy landscap e not only at the states visited but in their neigh b orho o d as w ell. In contrast to other adaptiv e sc hemes, the prop osed formulation connects the problems o f estimating the free energy landscap e and steering the atomistic dynamics b ey ond metastable w ells, under a unified um brella, and provide s a clear conv ergence criterion [35]. F r o m an algorithmic p o in t of view, the prop osed strategy p oses t wo problems. The first in v olv es the s election of the k ernels to b e used in the expansion of Equation (3). This is critical to the sparseness of the represen tation obtained, particularly for m ultidimensional z . T o that end, a greedy selection strategy is discussed in section 2.2 whic h progressiv ely adds k ernels (i.e. increases the cardinalit y K of t he expansion in Equation (3)) as needed. The second pro b- lem inv olv es the optimization of the ob jectiv e function I ( θ ) whic h dep ends on the unknown free energy A ( z ) and the in tractable partition function Z ( θ ) (Equation ( 1 3)). An ob vious appro a c h is b y gradien t desce nt whic h is discussed 8 in detail in section 2.1. This requires the computation of exp ectations with resp ect to p ( z | θ ) (Equation (14)). The in tra ctability of p ( z | θ ) necessitates the use of Monte Carlo sampling whic h m ust b e carried out in the expanded space with resp ect to the join t densit y p ( q , z | θ ) (Equation (5)). This should nev ertheless b e able to capture m ultiple mo des in t he high- dimensional state space consisting of atomic degrees of freedom q and parameters z . T o t hat end w e prop ose p erforming this step b y using non-equilibrium path sampling tec hniques based on adaptive Se quential Monte C arlo s ch emes discussed in section 2.3. Similar sc hemes fo r creating system replicas in parallel hav e b een emplo y ed in [41,35]. W e discuss a nov el adaptiv e ve rsion that retains previ- ous a dv antages while prov iding accurate estimates at reduced computationa l effort. These estimates can b e readily updat ed as θ c hanges after each opti- mization step. A discussion of each of the afo r ementioned algorithmic mo dules is con tained in the ensuing su b-sections. The steps of the sc heme prop o sed can be found in Algorithm 2. 2.1 Optimization with no isy gr adients W e prop o se emplo ying a g radien t de scen t sc heme in order to determine θ , although more inv olv ed pro cedures suc h as Impro v ed Iterativ e Scaling [3,16], noisy conjugat e gradien ts [46] can b e also b e emplo y ed. Second-order (quasi- )Newton tec hniques are also possible although the unav oidable Monte Carlo noise in the computat io n of the Hessian (i.e. the co v ar ia nce in Equation (1 5 )) can destro y its p ositive definiteness. 9 Let θ K denote the v ector of k ernel amplitudes (Equation (3)) when K s uc h k ernels are used. Let also θ K,m denote the estimate of θ K after m iterations of the gradien t descen t algorithm. Then at the ( m + 1) − iteratio n, the follo wing up date equation could b e used: θ K,m +1 = θ K,m − λ J ( θ K,m ) (16) where λ is t he learning rate. Since only a noisy Monte Carlo estimate o f the gra dients J ( θ ) (Equation (14)) is av ailable, it is an ticipated that the no ise could impede con v ergence. F o r t ha t purp ose w e prop ose employ ing a sto c hastic approximation v ariant of the Robbins & Monro sc heme [43,7]. Rather than increasing the sim ulatio n size in order t o reduce the v aria nce, we compute a w eigh ted av erage of the gradien t’s estimates at the curren t and previous iterations. By emplo ying a decreasing sequence of w eigh ts, information from the earlier iterations gets discarded g r a dually and more emphasis is placed on the recen t iterations. As it is show n in [17], this metho d conv erges with a fixed s ample size. In particular, if ˆ J ( θ K,m ) denotes the Mon te Carlo estimate of the gradien t (the details of this estimator are discussed in section 2.3 ) at the m th iteration, then w e calculate: ˜ J m = (1 − η m ) ˜ J m − 1 + η m ˆ J ( θ K,m ) (17) and, rather than Equation (16), w e up date θ as follow s: θ K,m +1 = θ K,m − λ ˜ J m (18) where the sequence of w eights { η m } is suc h that P ∞ m =1 η m = ∞ and P ∞ m =1 η 2 m < ∞ 3 . 3 A family o f suc h s equences that w as us ed in this w ork is η m = η m − p with 10 2.2 Kernel s e l e ction - S p arse r epr esentation of fr e e en er gy landsc ap e A critical ob jectiv e in the prop o sed framew ork relates to the sp arseness of the free energy appro ximation i.e. the cardinality K of the expansion in Equation (3). This is imp ortant in at least tw o w ay s. F irstly , b ecause sparser represen ta- tions can mor e clearly exp ose salien t features of the free energy landscap e, and as a consequence , of the at omistic ensem ble considered. Secondly , b ecause they reduce the n um b er of parameters θ with resp ect to which the o ptimization problem needs to b e solv ed (section 2.1). Giv en a v o cabulary o f p oten tially o v ercomplete basis functions and a prescrib ed K , the problem amoun ts to iden tifying those k ernels (Equation ( 3)) that b est approximate the true free energy surface i.e. minimize the KL div ergence for z ∈ D (Equation (10)). This ob viously implies a n ex cessiv e computational effort since the afo remen- tioned o pt imizatio n problem would need to b e solv ed for all p ossible K − sized com binations of basis functions. F o r that purp ose, w e prop o se a hierarc hical sc heme that pro ceeds b y adding a single k ernel at eac h step. Similar greedy optimization pro cedures hav e b een success fully a pplied in maxim um entrop y problems [16]. Without loss of gener- alit y , one can consider a v o cabular y of functions that consists of the isotropic Gaussian k ernels discuss ed in Equation (4). Eac h of these is parametrized by the locatio n z j of the k ernel and its bandwidth τ j . G iv en K suc h ke rnels, the corresp onding para meters θ K = { θ j } K j = 1 (Equation (3)) that minimize I ( θ ) in Equation (11) and samples from the densit y p ( z | θ K ) (Equation (9) or Equation (5)), we propose selecting the ( K + 1) th k ernel by c ho osing 1 / 2 < p ≤ 1. 11  z K +1 , τ K +1 = { τ K +1 ,l } d l =1  that maximize: ( z K +1 , τ K +1 ) = ar g max ( z K +1 ,τ K +1 )    E π ( z ) [ K ( z , z K +1 ; τ K +1 )] − E p ( z | θ K ) [ K ( z , z K +1 ; τ K +1 )]    (19) Based on E quation (14), this s uggests augmenting our expansion with the k ernel that lo cally maximizes the gradien t of I ( θ ). In tuitive ly this means that w e incorp or a te the kernel function whose exp ected v alue with resp ect to the target, uniform distribution is w orst a ppro ximated b y the curren t densit y p ( z | θ K ). It is obv iously a suboptimal stra t egy , that is necessitated by reasons o f computational cost. The maximization of the ob jectiv e in Equation (19) can b e readily carried out g iven samples from p ( z | θ K ). The same formulation can b e applied to an y type of k ernel or ov ercomplete basis emplo y ed (e.g. wa v elets). The prop o sed strategy promotes sparseness and computational efficienc y while offering a progressiv e resolution of the free energy landscap e that nat ur a lly in v olv es k ernels with larger bandwidths (smaller τ ) in the first steps, and success iv e unv eiling of the finer details which can b e captured b y k ernels of smaller bandwidths (i.e larger τ ). F urt hermore, it offers a rigorous metric for monitoring conv ergence. In partic- ular if ˆ A K ( z ; θ K ) and ˆ A K +1 ( z ; θ K +1 ) denote the free-energy appro ximations obtained at successiv e steps using K and K + 1 k ernels resp ectiv ely , and p K and p K +1 the corresp onding densities in Equation (5), then the improv ement in terms of Kullback -Leibler div ergence (Equation (10)), denoted b y ∆ K +1 can b e assessed with: 0 ≤ ∆ K +1 = KL( π k p K ) − KL( π k p K +1 ) = − β E π h ˆ A K +1 ( z ; θ K +1 ) − ˆ A K ( z ; θ K ) i + log Z ( θ K +1 ) Z ( θ K ) (20) 12 The exp ectation with resp ect to the uniform can in g eneral b e calculated analytically whereas the ratio of normalizing constan ts log Z ( θ K +1 ) Z ( θ K ) (Equation (5)) is a direct output of t he Sequen tial Monte Carlo sampling that is used to sample from the augmen ted densities a nd is discussed in the next section. 2.3 A daptive Se quential Monte Carlo The learning sc heme prop osed relies on efficien t computations of the gradi- en ts app earing in Equation (14). These dep end on exp ectations with resp ect to p ( z | θ ) (Equation (9)) whic h is not a v ailable a nalytically since the actual free energy A ( z ) is unkno wn. W e resort to a Mon te Carlo sc heme that draws samples from the jo in t densit y p ( q , z | θ ) in Equation (5) whic h in v olv es the atomic degrees of freedom q . A brute-force approach w ould generally b e inef- ficien t as simulating a t o mistic t ra jectories suffers fro m w ell-know n difficulties suc h a s the high-dimensionalit y of q , the disparity in scales b et w een q and z and the presence of sev eral metastable w ells [22]. F urthermore, since p ( q , z | θ ) dep ends on θ , new samples would hav e to dra wn ev ery time θ c hanges after eac h iteration of the optimization routine. F o r these reasons, we pro p ose a para llelizable strategy that relies on Sequen- tial Mon te Carlo sample rs (SMC, [15,13]). These offer a general statistical p ersp ectiv e that unifies a range of p ertinen t sc hemes that ha v e b een prop osed in the context o f non- equilibrium path sampling follow ing Jarzynski’s work [26,48]. W e prop ose nov el extensions that allow the algorithm to auto mati- cally adapt to the difficulties of the ta r g et densit y . They retain the ability to in teract seamlessly with legacy , molecular dynamics sim ulators. 13 The prop o sed SMC sche mes offer a flexible framew ork f or sampling f r om a se- quenc e of unormal i z e d pr o b abi l i ty distributions a nd are therefore highly suited for the dynamic setting of the problem a t hand where the target density p ( q , z | θ ) c hanges with θ . F or a given θ , t hey approximate p ( q , z | θ ) with a set of n random samples (or p articles ) { q ( i ) , z ( i ) } n i =1 , whic h are up dated using a com bination of imp o rtanc e s ampling , r esampling and MCMC-based r ejuvena- tion mec hanisms [1 4]. Eac h of these particles is asso ciated with a n im p ortanc e weight w ( i ) whic h is prop o r t ional to p ( q ( i ) , z ( i ) | θ ). The w eigh ts are up dated sequen t ia lly alo ng with the particle lo cations in order t o provide a particulate appro ximation: p ( q , z | θ ) ≈ n X i =1 W ( i ) δ q ( i ) ( q ) δ z ( i ) ( z ) (21) where W ( i ) = w ( i ) / P N k =1 w ( k ) are the normalized weigh ts and δ x ( i ) ( . ) is the Dirac function cen tered at x ( i ) . These particles and w eigh ts can b e used t o estimate exp ectations of an y p ( q , z | θ )-in tegrable function [13,9]. In particular for Equation (14): n X i =1 W ( i ) K j ( z ) → Z K j ( z ) p ( q , z | θ ) d q d z = E p ( z | θ ) [ K j ( z )] (almost sur e l y) (22) The prop osed SMC algorithms will b e used iterativ ely , af t er eac h step of the gradien t descen t a lgorithm. Giv en tw o suc cessiv e estimates θ K,m and θ K,m +1 (Equation (18)) and a pa r t iculate approximation of p ( q , z | θ K,m ), the g oal is to obtain new samples f r o m p ( q , z | θ K,m +1 ) (Algo rithm 2) and compute the new exp ectations in Equation (14) based on Equation ( 22). The qualit y of the Mon te Carlo estimates in Equation (22) dep ends on the pro ximit y of the distributions p ( q , z | θ K,m ) and p ( q , z | θ K,m +1 ). W e prop ose building a path of intermed iate, uno r ma lized distributions that will bridge this gap based on 14 Equation (5) 4 : π γ ( q , z ) ∝ p ( q , z | (1 − γ ) θ K,m + γ θ K,m +1 ) = exp n − β  V ( q , z ) − ˆ A ( z ; θ γ ) o , γ ∈ [0 , 1] (23) where: θ γ = (1 − γ ) θ K,m + γ θ K,m +1 (24) Clearly for γ = 0 one reco v ers p ( q , z | θ K,m ) and for γ = 1, p ( q , z | θ K,m +1 ). The role o f t hese auxiliary distributions is to provide a smo oth transition path where impo r tance sampling can b e efficien tly applied. Nat urally , the more intermediate distributions are considered along this path, t he higher the accuracy of the final estimates, but also the higher the computational cost. On the other hand to o few in termediate distributions π γ can adv ersely affect the o v erall accuracy of the appro ximation. T o that end w e prop ose an adaptive SMC sc heme that automatically deter- mines the n um b er o f in termediate distributions needed [14,3 1 ]. In this pro cess w e are guided b y the Effectiv e Sample Size (ESS, [36]). In particular, let S b e the t otal nu mber of intermediate distributions (whic h is unkno wn a pri- ori) and γ s , s = 1 , 2 , . . . , S the asso ciated recipro cal temperatures suc h that 0 = γ 1 < γ 2 < . . . < γ S = 1, whic h are also unk nown a priori. Let also { ( q ( i ) s , z ( i ) s ) , W ( i ) s } N i =1 denote the particulate appro ximation of π γ s defined as in Equation (23) for γ = γ s . The Effectiv e Sample Size of these particles is then defined as ESS s = 1 / P N i =1 ( W ( i ) s ) 2 and provides a measure of the p o pula t io n v ariance. One extreme, i.e. when ESS s = 1, arises when a single particle has a unit normalized w eight whereas the rest ha v e zero w eigh ts and as a result 4 subscripts K and m indicating the n umber of kernels an d optimizatio n iterations resp ectiv ely ha v e b een dropp ed 15 pro vide no informat io n. The other extreme, i.e. ESS s = N , arises when all the particles are equally informativ e a nd hav e equal w eigh ts W ( i ) s = 1 / N . If the next bridging distribution π γ s +1 is v ery similar to π γ s (ie. γ s +1 ≈ γ s ), then ESS s +1 should not b e that muc h differen t from ESS s . On the o ther hand if that difference is pro nounced then E S S s +1 could dro p dramatically . Hence in determining the next auxiliary distribution, w e define an acceptable reduction in the ESS , i.e. ESS s +1 ≥ ζ ESS s (where ζ < 1) and prescrib e γ s +1 (Equation (23)) accordingly . The prop osed adaptiv e SMC algor it hm is summarized in Algorithm 1. It should be noted that unlik e MCMC sc hemes, the particle p erturbations in the R ejuvena tion step do not require that the P s ( ., . ) is er go di c [15]. It suffices that it is a π γ s -in v arian t k ernel, whic h readily allows adaptively c hanging its parameters in order to ach iev e b etter mixing rates. In the examples presen ted a Metrop o lized Gibbs sc heme w as used to sample q and z separately b y em- plo ying a Metrop o lis-Adjusted Langevin Alg orithm (MALA) for eac h set of co ordinates [44]. In particular giv en  q ( i ) s − 1 , z ( i ) s − 1  these consist of: • Up dating q ( i ) s − 1 → q ( i ) s : q ( i ) s − q ( i ) s − 1 = ∆ t q 2 ∇ q π γ s ( q ( i ) s − 1 , z ( i ) s − 1 ) + q ∆ t q r q = − β ∆ t q 2 ∇ q V ( q ( i ) s − 1 , z ( i ) s − 1 ) + q ∆ t q r q (26) • Up dating z ( i ) s − 1 → z ( i ) s : z ( i ) s − z ( i ) s − 1 = ∆ t z 2 ∇ q π γ s ( q ( i ) s , z ( i ) s − 1 ) + √ ∆ t z r z = − β ∆ t z 2  ∇ z V ( q ( i ) s − 1 , z ( i ) s − 1 ) − ∇ z ˆ A ( z ( i ) s − 1 ; θ γ s )  + q ∆ t q r q (27) 16 Algorithm 1 Adaptive SMC Require: s = 1 and γ 1 = 0 and a p opulat io n { ( q ( i ) 1 , z ( i ) 1 ) , w ( i ) 1 } N i =1 whic h appro ximate π γ 1 ≡ p ( q , z | θ K,m ) in Equation (23) Ensure: The final p opulation { ( θ ( i ) s , d ( i ) s ) , w ( i ) s } N i =1 pro vides a particulate ap- pro ximation of π γ s in the sense of Equations (21), (22). while γ s < 1 do s ← s + 1 { Rew eigh ting-Imp ort ance Sampling } Let w ( i ) s ( γ s ) = w ( i ) s − 1 π γ s ( q ( i ) s − 1 , z ( i ) s − 1 ) π γ s − 1 ( q ( i ) s − 1 , z ( i ) s − 1 ) = w ( i ) s − 1 exp n − β ( V ( q ( i ) s − 1 , z ( i ) s − 1 ) − ˆ A ( z ( i ) s − 1 ; θ γ s ) o exp n − β ( V ( q ( i ) s − 1 , z ( i ) s − 1 ) − ˆ A ( z ( i ) s − 1 ; θ γ s − 1 ) o = w ( i ) s − 1 exp n − β ( ˆ A ( z ( i ) s − 1 ); ( γ s − γ s − 1 )( θ K,m +1 − θ K,m ) o (25) b e the up date d w eigh ts a s a function of γ s . Determine γ s ∈ ( γ s − 1 , 1] so that ESS s = ζ ESS s − 1 . { Resampling } if ESS s ≤ ESS min then Resample end if { Rejuv enation } Use an MCMC k ernel P s  ( q ( i ) s − 1 , z ( i ) s − 1 ) , ( q ( i ) s , z ( i ) s )  that leav es π γ s in v a rian t to p erturb eac h particle ( q ( i ) s − 1 , z ( i ) s − 1 ) → ( q ( i ) s , z ( i ) s ). end while where r q and r z are i.i.d standard Ga ussian v ectors. A Metropo lis accept/reject step with resp ect to the ta rget inv arian t densit y π γ s ( . ) w as p erformed after eac h update. Tw o differen t time steps w ere us ed ∆ t q and ∆ t z for t he q and 17 z co ordinates resp ectiv ely . Their v alues w ere adjusted after eac h iteration s so as to retain an a v erage acceptance ratio (o v er all part icles n ) b etw een 50% and 80% [45]. The adaptivit y afforded b y the pro p osed sc heme relies on the fact that ergo dicit y is no t r equired from the rejuv enation step. As a result sev eral MALA time steps can b e p erformed in Equations (26 ) and (27) (at additional computational expense) or other molecular dynamics samplers can b e emplo y ed whic h could p ot entially exhibit b etter mixing or fit more closely to the ph ysics of the problem at hand [6]. Finally w e note that the estimates of the ratio of normalization constan ts Z s / Z s − 1 b et w een tw o successiv e unormalized densities π γ s − 1 and π γ s can be obtained by a v eraging the unormalized up dated w eigh ts in Equation (2 5) as a direct consequence of the impo rtance sampling iden tity : Z s Z s − 1 = R π γ s ( q , z ) d q d z R π γ s − 1 ( q , z ) d q d z = R π γ s ( q , z ) π γ s − 1 ( q , z ) π γ s − 1 ( q , z ) Z s − 1 d q d z ≈ P n i =1 W ( i ) s − 1 π γ s ( q ( i ) s − 1 , z ( i ) s − 1 ) π γ s − 1 ( q ( i ) s − 1 , z ( i ) s − 1 ) (28) These estimators can b e telescopically multiplied ( [1 5,30]) in order to com- pute the ratio of norma lizatio n constan ts b et w een any pair of distributions as required in Equation (20). Giv en the preceding discussion in sections 2.1, 2.2 and 2.3, w e summarize b elo w the basic steps in the prop osed free energy computation sc heme: In the inner lo op and for fixed K , gradien t descen t (subsection 2.1) is p erfo r med whic h mak es use o f the adaptiv e SMC sc heme (subsection 2.3) in order to compute the exp ectations in the gradien t. In the outer loo p, the cardinalit y of the expansion K is increased b y adding one k ernel (i.e. K ← K + 1) based 18 on Equation (19). This is terminated when the KL gain (Equation (20)) do es not excee d a prescrib ed tolerance. Algorithm 2 Calculatio n o f the f ree energy at a given temperat ur e. Require: K = 0, θ 0 ≡ 0 and a particulate appro ximation of p ( q , z | θ 0 ) (Equation (5)) at the desired temperat ure β . while true do Calculate ∆ K based on Equation (20). if ∆ K ≤ tol then Break the lo op. else Add the optimal ( K + 1) th k ernel based o n Equation (19) and set K ← K + 1 rep eat Estimate gradient at θ K,m and calculate up date θ K,m +1 based on Equation (18) Use adaptive SMC (section 2.3) to construct pa rticulate approx ima- tion of p ( q , z | θ K,m +1 ) from p ( q , z | θ K,m ). un til Con v ergence criteria are met. end if end while 2.4 Obtaining the fr e e ener gy lan dsc a p e for va rio us temp er atur es. The metho dology describ ed in the previous sections is suitable for calculating the free energy as a function of z at a given temp erature. Ho w ev er, one is often in terested in the free energy landscap e as a function of the temperat ur e a lso. In order to ac hiev e this goal w e make use of t he follo wing tw o facts. Firstly , 19 the free energy landscape at higher temp eratures is flatter and secondly t ha t nearb y temperatures ha v e similar free energies landscapes. Based on these, w e prop ose a natural extension to the sequen tial sampling fr amew ork of subsec- tion 2 .3 that can efficien tly pro duce estimates o f the free energy at v arious temp eratures. The idea is to start f r o m a higher temp erature, compute the free energy as des crib ed b efore, then gradually mo v e tow ards low er temper- atures using the free energy of the previous temp erature as an initial guess for the new one. In particular given the free energy estimate ˆ A β 1 ( z ; θ ( β 1 )) and the par t iculate approximation of the join t densit y p β 1 ( q , z | θ ( β 1 )) at a temp erature 1 /β 1 , w e prop ose emplo ying the aforemen tioned adaptive SMC in order to obtain a particulate appro ximation o f the follo wing join t densit y at β 2 > β 1 (i.e. for lo w er t emp erat ur e): p β 2 ( q , z | θ ( β 1 )) ∝ exp n − β 2  V ( q , z ) − ˆ A β 1 ( z ; θ ( β 1 )) o (29) The iterations enum erated in Algorit hm 2 can then b e carr ied out in the same fashion by up dating the existing θ as w ell as adding new kerne ls if the con v ergence criteria are not satisfied. The critical step inv olves building a sequence of distributions from p β 1 ( q , z | θ ( β 1 )) to p β 2 ( q , z | θ ( β 1 )) in Equation (29) . F or this purp ose and similarly to a sim ulated annealing sche dule w e emplo y: π γ ( q , z ) ∝ exp n − ((1 − γ ) β 1 + γ β 2 )  V ( q , z ) − ˆ A β 1 ( z ; θ ( β 1 )) o (30) The steps in Algorithm 1 should b e adjusted to the aforemen tioned sequen ce of bridg ing distributions with the most striking difference in the R eweighing 20 step where the up dated w eigh ts in Equation (25) should no w b e giv en by: w ( i ) s ( γ s ) = w ( i ) s − 1 π γ s ( q ( i ) s − 1 , z ( i ) s − 1 ) π γ s − 1 ( q ( i ) s − 1 , z ( i ) s − 1 ) = w ( i ) s − 1 exp n − ( γ s − γ s − 1 )( β 2 − β 1 )  V ( q ( i ) s − 1 , z ( i ) s − 1 ) − ˆ A β 1 ( z ; θ ( β 1 )) o (31) W e demonstrate the efficacy of suc h an approach in t he last example of section 3. It is finally no ted that at the b eginning of iterations at eac h new temp era- ture, k ernels with v ery small weigh t s θ j w ere remo ve d if θ j max i θ i ≤ 0 . 01. 2.5 The r e action c o or dina te c ase. The prop osed metho d w as described for the alc hemical case. Ho w ev er, it is straigh tforwardly g eneralized to co v er also the general r eaction co o rdinate case. Let ξ : M → D b e a function of the system co o r dina t es. This function is called a reaction co o r dina t e [33]. It is eviden t tha t q can b e view ed in a probabilistic framew ork as a random v ariable and so: z = ξ ( q ) (32) is also a random v aria ble. The probabilit y distribution of z can b e found by in tegrating out the q : p ( z | β ) = Z p ( q ) δ ( ξ ( q ) − z ) d q ∝ Z exp ( − β V ( q )) δ ( ξ ( q ) − z ) d q (33) The free energy A ( z ) with r esp ect to the reaction co ordinate ξ ( q ) is defined to b e the effe ctive p otential of z = ξ ( q ) p ( z ) ∝ exp ( − β A ( z )) (34) 21 Com bining these tw o equations w e see that: A ( z ) = − β − 1 log Z exp ( − β V ( q )) δ ( ξ ( q ) − z ) d q (35) If ˆ A ( z ; θ ) is an estimate o f A ( z ), w e define a new pro ba bilit y distribution o v er q b y: p ( q | θ ) ∝ 1 D ( ξ ( q )) exp  − β ( V ( q ) − ˆ A ( ξ ( q ); θ ))  (36) It is straig ht forward to see t hat under this new distribution for q , the p df o f z b ecomes: p ( z | θ ) = Z p ( q | θ ) δ ( ξ ( q ) − z ) d q ∝ 1 D ( z ) exp  − β ( A ( z ) − ˆ A ( z ; θ ))  (37) This coincides with the expression in Equation (9) and therefore the ensuing deriv ations hold iden tically . F rom a practical p oin t of view, sampling need only performed in t he q space and therefore the adaptive SMC sc hemes are emplo y ed to obtain particulate approx imations of the density in Equation (36). The only difference app ears in the MCMC-based Rejuvenation step where the MALA sampler is emplo ye d only with regards to q . In particular the up date of Equation (26) no w b ecomes: q ( i ) s − q ( i ) s − 1 = ∆ t q 2 ∇ q π γ s ( q ( i ) s − 1 ) + q ∆ t q r q = − β ∆ t q 2  ∇ q V ( q ( i ) s − 1 ) − ∂ ˆ A ∂ z ∇ q ξ ( q )  + q ∆ t q r q (38) It is noted that, in con trast to some ABF metho ds whic h r equire second-order deriv ative s of ξ [24], the prop osed tec hnique only needs first-order deriv atives . Finally , w e point out that the ability of the prop osed approac h to pro vide efficien tly estimates of parametrized free energy surfaces (as in section 2 .4 with resp ect to the temp erat ure β ), can also b e exploited in the reaction 22 co ordinate case b y defining a joint densit y: p ( q , z | θ ) ∝ exp  − β  V ( q ) + µ 2 k z − ξ ( q ) k 2 − ˆ A µ ( z ; θ )  (39) where a s in [37] a n a r t ificial spring with stiffness µ has b een added. Clearly for µ → ∞ one recov ers the aforemen tioned description, but fo r all o ther v alues of µ the formu latio n reduces to that of Equation (5) where in place of V ( q , z ) w e no w hav e V ( q ) + µ 2 k z − ξ ( q ) k 2 . One can therefore obtain free energy surfaces for v arious µ v alues. F or smaller µ the free energy w o uld b e flatter and in the extreme case of µ = 0 it w ould b e constan t. As µ increases, the complexities of the fr ee energy surface w ould b ecome pro nounced. Hence b y exploiting the idea of section 2 .4, a sequence of pr o blems parametrized by µ rather than β , can b e construc ted to g radually mo ve to lar g er µ v alues b y using the free energy of the previous µ as an initial guess for the new one. The adaptiv e SMC sc heme would ensure a s mo o t h enough transition while retaining a go o d lev el accuracy for t he a pproximations obt a ined. 3 Numerical E xamples 3.1 Two-Dimens ional T oy Example Consider a t w o-dimensional system [54,34] with a single parameter z , in ter- acting with p oten tial energy V ( q ; z ) = cos(2 π z )(1 + d 1 q ) + d 2 q 2 Assume that q giv en z and β is distributed according to p ( q | z , β ) ∝ exp ( − β V ( q ; z )) 23 where β is also a fixed parameter that play s the role of an inv erse temp erat ure. W e wish to calculate a n approximation ˆ A ( z ) of the free energy A ( z ) o n an in terv a l D = [ − 0 . 5 , 0 . 5 ]. The true free energy can b e fo und analytically to b e A ( z ) = cos(2 π z ) − d 2 1 cos(2 π z ) 2 4 d 2 + c where c is a constant t hat depends up on the sp ecific choice of the fixe d pa- rameters. In what follo ws, w e choose c so that A ( − 0 . 5) = 0. T o demonstrate our metho d in this simple example we used d 1 = 2 , d 2 = 30. The p o ten tial energy V ( q ; z ) for this c hoice of the parameters is depicted in Figure 1(a). W e fix the in v erse temp erat ure to β = 10. As sho wn in Figure 1(b), the distribution is bimo dal with a big region o f practically zero prob- abilit y separating the t w o mo des. Hence, metastabilit y along the par a meter z is apparen t. The p erfo rmance of the prop osed method with resp ect to the n um b er of particles us ed in the a daptiv e SMC sc heme is depicted in Figures 2 and 3 whic h sho w the ev olution of the estimated free energy landscap e with n = 100 and n = 10 , 00 0 particles respectiv ely . In b oth cases the metho d is capable of capturing the correct characteristics of t he reference solution and as exp ected the noise in the computations is less when the n um b er of parti- cles is larger. In b oth cases the Robbins-Monro learning series is pic k ed to b e η m = m − p with p = 0 . 6 and the learning rate λ = 0 . 1. Figure 4(a) sho ws the first three k ernels selected b y the greedy sc heme of section 2.2 . Figure 4(b) depicts the log -v alues of the k ernel w eigh ts { θ j } K j = 1 whic h clearly demons trat e the abilit y of the prop osed approac h to pro vide sparse appro ximations. The first k ernel selected has the greatest w eight and hence it contains the ma j orit y of the information ab out the free energy curv e. The rest of the kerne ls are progressiv e corrections of the estimate giv en b y the 24 (a) The p otenti al energy V ( q 1 , q 2 ) for d 1 = 2 , d 2 = 30 (b) The probabilit y distribution p ( q 1 , q 2 | β ) for d 1 = 2 , d 2 = 30 , β = 10 Fig. 1. Pot entia l energy and p df for the to y example of section 3.1 first k ernel. This conclusion is also supp orted b y t he result of Figure 5 whic h sho ws the ev olution of the reduction in the KL div ergence with resp ect to the total num ber of iterations as quan tified b y adding the ∆ K +1 in Equation (20). Clearly the first k ernel offers the great est KL gain (∆ 1 ) and further k ernel additions offer progressiv ely smaller reductions in the KL div ergence. 3.2 WCA Dimer W e conside r N = 16 atoms in a t w o-dimensional fully p erio dic box of side l whic h in teract with a purely repulsiv e W CA pair p oten tial [34]: V W CA ( r ) =            4 ǫ   σ r  12 −  σ r  6  + ǫ , if r ≥ r 0 0 , otherwise where σ and ǫ giv e the length and energy scales resp ectiv ely . Tw o of these atoms (say atoms 1 and 2 ) are assumed to form a dimer and their interaction 25 -0.4 -0.2 0 0.2 0.4 0 0.5 1 1.5 2 2.5 3 K=1 reference P S f r a g r e p la c e m e n t s z A ( z ) (a) K=1 -0.4 -0.2 0 0.2 0.4 0 0.5 1 1.5 2 2.5 3 K=1 K=2 reference P S f r a g r e p la c e m e n t s z A ( z ) (b) K =2 -0.4 -0.2 0 0.2 0.4 0 0.5 1 1.5 2 2.5 3 K=1 K=2 K=3 reference P S f r a g r e p la c e m e n t s z A ( z ) (c) K=3 -0.4 -0.2 0 0.2 0.4 0 0.5 1 1.5 2 2.5 3 K=1 K=2 K=3 K=8 reference P S f r a g r e p la c e m e n t s z A ( z ) (d) K =8 Fig. 2. Evolutio n of free energy estimates at v arious k ernel num b er s K when u sing n = 100 particles in the adaptiv e S MC sc heme is described instead with a double w ell p ot en tial: V S ( r ) = h " 1 − ( r − r 0 − w ) 2 w 2 # where h, w , r 0 are fixed parameters and r the distance betw een them. This p oten tial has t w o eq uilibrium p o in ts r 0 and r 0 + 2 w . The barrier separating the t w o equilibria is h . Let q = ( q 1 , q 2 , . . . , q N ) with q i ∈ R 2 denoting the p osition of atom i . The p oten tial energy of the system is: V ( q ) = V S ( | q 1 − q 2 | ) + 2 X i =1 N X j = 3 V W CA ( | q i − q j | ) + X 2

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment