Free energy computations by minimization of Kullback-Leibler divergence: an efficient adaptive biasing potential method for sparse representations

F ree energy computatio ns b y minimization of Kullbac k-Leible r div ergence: an eﬃcie n t adaptiv e b i as i ng p oten tial meth o d for s p arse represe n tations I. Bilio nis Center for Applie d Mathema tics, Cornel l U niversity, Ithac a, NY 14853, USA P .S. Koutsour el akis ∗ Scho ol of Civil and Envir onmental Engine ering & Center for Applie d Mathematics, Cornel l University, Ithac a, N Y 14 853, USA Abstract The present pap er p rop oses an adaptive biasing p oten tial for the computation of free energy landscap es. It is motiv ated by stat istical learning arguments a n d uniﬁes the tasks of biasing th e molecular dynamics to escap e f ree energy w ells and estimat- ing the free energy fun ction, un der the same ob jectiv e. It oﬀers rigorous con v ergence diagnostics ev en th ou gh history dep endent, non-Marko vian dyn amics are employ ed . It mak es use of a greedy optimization sc heme in order to obtain sparse repr esenta- tions of the free energy function whic h can b e particularly useful in m ultidimensional cases. I t employs em barr assin gly parallelizable samp lin g sc hemes that are based on adaptiv e Sequent ial Mon te Carlo and can b e readily coupled with legacy molecular dynamics s im ulators. The sequ ential n atur e of the learning and sampling sc heme Preprint su bmitted to Journal of Compu tational Ph ysics 25 Octob er 2018 enables the eﬃcien t calculatio n of f r ee energy functions parametrized b y the temp er- ature. The charac teristics and capabilities of the prop osed metho d are demonstrated in three numerical examples. Key wor ds: free energy computations, adaptive biasing p oten tial, S equen tial Mon te C arlo, atomistic sim ulations, statistica l learning P ACS: 1 In tro duct ion F r ee energy is a cen tral concept in thermo dynamics and in the study of sev eral systems in biology , c hemistry a nd ph ysics [8]. It represen ts a rigorous w a y to coarse-grain systems consisting of v ery large n um b ers of atomistic degrees o f freedom, to prob e states not accessible experimentally , to characterize global c hanges a s w ell as in v estigate relat ive stabilities. In most applications, a brute- force computation based on sampling the atomistic p ositions is impractical or infeasible as the free energy barriers to ov ercome are so large t ha t the syste m remains trapp ed in metastable free energy sets [40,42,8,56]. Equilibrium tech niques for computing free energy surfaces suc h a s Thermo- dynamic In tegration [29] or Adaptive In t egr a tion [49,21] require t he sim ula - tion of ve ry long atomistic t ra jectories in o rder to a chiev e equilibrium a nd lac k conv ergence diagnostics. T ec hniques based on non-equilibrium path sam- pling [26,27 ,23,25] lac k adaptivit y and require the user to sp ecify a particular ∗ Corresp ond ing Au thor. T el: 607- 254-5441 Email addr esses: ib227@co rnell.edu (I. Bilionis ), pk285@cor nell.edu (P .S. Koutsourelakis ). 2 path on the reaction co ordinate space connecting t w o energetically imp or- tan t free energy regions, which can b e non- trivial a task [20]. F urthermore, sampling alo ng these paths correctly migh t neces sitate adv anced and qu ite in v olv ed tec hniques [10]. More r ecently prop osed adaptiv e bia sing p o ten tial [2,55,32,1,18] and adaptiv e biasing force [11,12,41,53,24] tec hniques are ca- pable of dynamically utilizing information o bta ined f rom the atomistic tr a- jectories to bias the curren t dynamics in order to facilitate the escap e from metastable sets [35]. They are able to auto matically disco v er imp ortan t regions of the reaction coo r dina t e space. Since they rely on history-dep endent, non- Mark o vian dynamics, it is not a priori clear, a nd in whic h sense, the system reac hes a stationary state, although some w ork has b een done along theses lines in [4] for Langevin-t yp e systems and [3 9,35]. W e prop ose an adaptiv e biasing p otential tec hnique where the tw o tasks o f biasing the dy namics and estimating the free energy la ndscape are uniﬁed under the same ob jectiv e of minimizing the Kullbac k-Leibler div ergence b e- t w een appropria t ely selected distributions o n the extended space that includes atomic co ordinates and the collectiv e v ariables [37,38]. This framew ork pro- vides a natural w ay for selecting the basis functions used in the approx imation of the free energy and obt a ining sparse represen tations which is critical when m ulti-dimensional collectiv e v ariables a re used. It allows the a nalyst to utilize and correct an y prior information on the free energy landscap e and prov ides an eﬃcien t manner of obtaining g o o d estimates at v arious temp eratures. The sc heme prop osed is em barrassingly par a llelizable and relies on adaptive Se- quen tial Mon te Carlo pro cedures whic h enable eﬃcien t sampling f rom the high-dimensional and p o t entially multi-mo da l distributions of in terest. 3 2 Metho dology - A statistical learning approac h for adaptiv ely cal- culating free energies F o r clarit y o f the presen tation, w e will ﬁr st in tro duce our metho d for the so-called alchemic al case and generalize it later for the reaction co ordinate case. Consider a molecular system with generalized coo rdinates q ∈ M ⊂ R N follo wing a Boltzmann-like distribution whic h in turn dep ends on some parameters z ∈ D ⊂ R d p ( q | z ) ∝ exp ( − β V ( q ; z )) (1) where V ( q ; z ) is the p oten tial energy of the system and β plays the role of inv erse temp erature. The free energy A ( z ) is deﬁned, up to an a dditive constan t, by: A ( z ) = − β − 1 Z exp ( − β V ( q ; z )) d q (2) Our goal is to compute the f unction A ( z ) o v er the whole do ma in D . Let ˆ A ( z ; θ ) b e an estimate o f A ( z ) pa r a metrized by θ ∈ Θ ⊂ R K . W e adopt a statistical p ersp ectiv e of learning A ( z ) from simulation data . A p opula r approac h to carrying out r egr ession ta sks and f unctional approximations relies on k ernel mo dels [2 8 ]. Kernel regression mo dels hav e pro v en successful in high- dimensional scenaria where d is in the order of 10 or 100 [5 1,52]. The unknown function is selected from a R epro ducing Kernel Hilb ert Space (RKHS) H K induced b y a semi-p ositive deﬁnite ke rnel K ( · , · ). W e adopt represen tations with respect to a k ernel function K ( · , · ) [28]: ˆ A ( z ; θ ) = K X j = 1 θ j K j ( z , z j ) , z ∈ D (3) where z j are points in D whic h are selected as de scrib ed in the sequenc e, 4 In order to ﬁx the additiv e constan t, w e select a p oin t z 0 ∈ D suc h that: ˆ A ( z ; θ ) = 0. 1 In relev a n t litera t ure diﬀeren t t yp es of kernel functions ha v e b een used suc h as thin pla t e splines, mu ltiquadrics, or G a ussians. While all these functions can b e emplo y ed in the framew ork presen ted, we fo cus our discussion here on Gaussian k ernels whic h also ha v e an in tuitiv e parametrization with regards to the sc ale of va ri a bility of ˆ A as quantiﬁed by the bandwidth para meters τ j = { τ j,l } d l =1 in eac h dimension: K j ( z ) = K ( z , z j ; τ j ) = exp {− d X l =1 τ j,l ( z l − z j,l ) 2 } (4) Gaussian k ernels in the con text of free energy appro ximations hav e also b een used in [32,37,18]. W e deﬁne a join t probability distribution on the generalized co ordinat es q and the parameters z as follo ws: p ( q , z | θ ) = 1 Z ( θ ) 1 D ( z ) e − β ( V ( q , z ) − ˆ A ( z ; θ ) ) (5) where 1 D ( z ) is the indicator function on D and Z ( θ ) is the normalization constan t, i.e.: Z ( θ ) = Z 1 D ( z ) e − β ( V ( q , z ) − ˆ A ( z ; θ ) ) d z d θ (6) It is noted that the ﬁrst-order partial deriv ativ es of the log of the normalization function giv e rise to the exp ectation of the resp ectiv e k ernel: ∂ log Z ∂ θ j = 1 Z ( θ ) ∂ Z ∂ θ j = β E p ( z | θ ) [ K j ( z )] (7) 1 This is alw a ys p ossible b y c hanging the k ernels in Equation 3 to K ′ j ( z , z j ) = K j ( z ) − K j ( z 0 , z j ). 5 whereas the second-order deriv ativ es, pro duce t he co v aria nce b et w een the k er- nels: ∂ 2 log Z ∂ θ j ∂ θ l = − 1 Z 2 ( θ ) ∂ Z ∂ θ j ∂ Z ∂ θ l + 1 Z ( θ ) ∂ 2 Z ∂ θ j ∂ θ l = β 2 E p ( z | θ ) h ( K j ( z ) − E p ( z | θ ) [ K j ( z )])( K l ( z ) − E p ( z | θ ) [ K l ( z )]) i = β 2 C ov p ( z | θ ) [ K j , K l ] (8) The expectations in the t w o equations a b o v e inv o lv e the unkno wn marginal densit y with resp ect to the parameters z ∈ D : p ( z | θ ) = R p ( q , z | θ ) d q = 1 Z ( θ ) 1 D ( z ) e − β ( A ( z ) − ˆ A ( z ; θ ) ) (9) whic h dep ends on the unknow n free energy A ( z ) of Equation (2). The k ey pro p ert y o f p ( z | θ ) is that it reduces to the uniform distribution fo r z ∈ D if and only if the free energy estimate is exact i.e. ˆ A ( z ; θ ) = A ( z ) , z ∈ D . If π ( z ) = 1 D ( z ) 1 |D | is the uniform densit y on D (whose v olume is denoted b y | D | ), then a natural strategy to es timate A ( z ) is b y minimizing a distance metric betw een π ( z ) and p ( z | θ ) in Equation (9) ab o v e. T o that end, w e prop ose employ ing the Kullba ck-Leibler (KL) div ergence KL( π ( z ) k p ( z | θ ) ) [50]: KL( π k p ) = R π ( z ) log π ( z ) p ( z | θ ) d z = R π ( z ) log π ( z ) d z − R π ( z ) log p ( z | θ ) d z = − lo g | D | − R π ( z ) log p ( z | θ ) d z ≥ 0 (10) 6 The latter is not a metric in the mathematical sense, but it is frequen tly used as a measure of the distance b etw een tw o proba bility distributions. It is alw ays non-negativ e and b ecomes zero if and only if π ( z ) ≡ p ( z | θ ) or equiv alen tly ˆ A ( z ; θ ) = A ( z ) , z ∈ D 2 . The af oremen tioned form ulation oﬀers a clear strategy for estimating the free energy by minimizing the follo wing f orm with resp ect to θ : I ( θ ) = − Z π ( z ) log p ( z | θ ) d z (11) Since the KL-divergenc e is alw a ys non-negativ e, the form ulation ab o v e pro- vides a lo w er b o und on the o b jective function I ( θ ): I ( θ ) ≥ log | D | (12) whic h can b e readily calculated and b e used to monitor con v ergence as w ell as the qualit y of the approxim atio n obtained. Ev en though I ( θ ) dep ends on the unkno wn fr ee energy A ( z ) (from Equation (9)): I ( θ ) = − R π ( z ) log p ( z | θ ) d z = β R π ( z )  A ( z ) − ˆ A ( z ; θ )  d z + log Z ( θ ) (13) its partial deriv ativ es J ( θ ) = ∂ I ( θ ) ∂ θ do not , i.e. from Equation (7): J j ( θ ) = ∂ I ( θ ) ∂ θ j = − β E π ( z ) h ∂ ˆ A ∂ θ j i + ∂ log Z ∂ θ j = − β  E π ( z ) [ K j ( z )] − E p ( z | θ ) [ K j ( z )]  (14) 2 Of in terest are free-energy diﬀer enc es and therefore p erturbations of A ( z ) or ˆ A ( z ; θ ) by a constan t are ignored 7 where E π ( z ) [ · ] implies an exp ectation with regards to π ( z ). It is impo r tan t to note that acc ording to Equation (8), the Hessian of the ob jectiv e function H ( θ ) = ∂ 2 I ( θ ) ∂ θ ∂ θ T is prop o rtional to the co v ariance b et w een the k ernels i.e.: ∂ 2 I ∂ θ j ∂ θ l = ∂ log Z ( θ ) ∂ θ j ∂ θ l = β 2 C ov p ( z | θ ) [ K j , K l ] (15) Hence the ob jectiv e function is conv ex with resp ect to θ and there is a unique minim um. F urt hermore, the a ppro ximation of the free energy ˆ A ( z ; θ ), biases the p oten- tial of p ( q , z | θ ) (Equation (5)) a nd allows the system to o v ercome free energy barriers [39]. As in [18], no binning is needed and t he bias p oten tial is non- lo cal, pro viding information ab out the free energy landscap e not only at the states visited but in their neigh b orho o d as w ell. In contrast to other adaptiv e sc hemes, the prop osed formulation connects the problems o f estimating the free energy landscap e and steering the atomistic dynamics b ey ond metastable w ells, under a uniﬁed um brella, and provide s a clear conv ergence criterion [35]. F r o m an algorithmic p o in t of view, the prop osed strategy p oses t wo problems. The ﬁrst in v olv es the s election of the k ernels to b e used in the expansion of Equation (3). This is critical to the sparseness of the represen tation obtained, particularly for m ultidimensional z . T o that end, a greedy selection strategy is discussed in section 2.2 whic h progressiv ely adds k ernels (i.e. increases the cardinalit y K of t he expansion in Equation (3)) as needed. The second pro b- lem inv olv es the optimization of the ob jectiv e function I ( θ ) whic h dep ends on the unknown free energy A ( z ) and the in tractable partition function Z ( θ ) (Equation ( 1 3)). An ob vious appro a c h is b y gradien t desce nt whic h is discussed 8 in detail in section 2.1. This requires the computation of exp ectations with resp ect to p ( z | θ ) (Equation (14)). The in tra ctability of p ( z | θ ) necessitates the use of Monte Carlo sampling whic h m ust b e carried out in the expanded space with resp ect to the join t densit y p ( q , z | θ ) (Equation (5)). This should nev ertheless b e able to capture m ultiple mo des in t he high- dimensional state space consisting of atomic degrees of freedom q and parameters z . T o t hat end w e prop ose p erforming this step b y using non-equilibrium path sampling tec hniques based on adaptive Se quential Monte C arlo s ch emes discussed in section 2.3. Similar sc hemes fo r creating system replicas in parallel hav e b een emplo y ed in [41,35]. W e discuss a nov el adaptiv e ve rsion that retains previ- ous a dv antages while prov iding accurate estimates at reduced computationa l eﬀort. These estimates can b e readily updat ed as θ c hanges after each opti- mization step. A discussion of each of the afo r ementioned algorithmic mo dules is con tained in the ensuing su b-sections. The steps of the sc heme prop o sed can be found in Algorithm 2. 2.1 Optimization with no isy gr adients W e prop o se emplo ying a g radien t de scen t sc heme in order to determine θ , although more inv olv ed pro cedures suc h as Impro v ed Iterativ e Scaling [3,16], noisy conjugat e gradien ts [46] can b e also b e emplo y ed. Second-order (quasi- )Newton tec hniques are also possible although the unav oidable Monte Carlo noise in the computat io n of the Hessian (i.e. the co v ar ia nce in Equation (1 5 )) can destro y its p ositive deﬁniteness. 9 Let θ K denote the v ector of k ernel amplitudes (Equation (3)) when K s uc h k ernels are used. Let also θ K,m denote the estimate of θ K after m iterations of the gradien t descen t algorithm. Then at the ( m + 1) − iteratio n, the follo wing up date equation could b e used: θ K,m +1 = θ K,m − λ J ( θ K,m ) (16) where λ is t he learning rate. Since only a noisy Monte Carlo estimate o f the gra dients J ( θ ) (Equation (14)) is av ailable, it is an ticipated that the no ise could impede con v ergence. F o r t ha t purp ose w e prop ose employ ing a sto c hastic approximation v ariant of the Robbins & Monro sc heme [43,7]. Rather than increasing the sim ulatio n size in order t o reduce the v aria nce, we compute a w eigh ted av erage of the gradien t’s estimates at the curren t and previous iterations. By emplo ying a decreasing sequence of w eigh ts, information from the earlier iterations gets discarded g r a dually and more emphasis is placed on the recen t iterations. As it is show n in [17], this metho d conv erges with a ﬁxed s ample size. In particular, if ˆ J ( θ K,m ) denotes the Mon te Carlo estimate of the gradien t (the details of this estimator are discussed in section 2.3 ) at the m th iteration, then w e calculate: ˜ J m = (1 − η m ) ˜ J m − 1 + η m ˆ J ( θ K,m ) (17) and, rather than Equation (16), w e up date θ as follow s: θ K,m +1 = θ K,m − λ ˜ J m (18) where the sequence of w eights { η m } is suc h that P ∞ m =1 η m = ∞ and P ∞ m =1 η 2 m < ∞ 3 . 3 A family o f suc h s equences that w as us ed in this w ork is η m = η m − p with 10 2.2 Kernel s e l e ction - S p arse r epr esentation of fr e e en er gy landsc ap e A critical ob jectiv e in the prop o sed framew ork relates to the sp arseness of the free energy appro ximation i.e. the cardinality K of the expansion in Equation (3). This is imp ortant in at least tw o w ay s. F irstly , b ecause sparser represen ta- tions can mor e clearly exp ose salien t features of the free energy landscap e, and as a consequence , of the at omistic ensem ble considered. Secondly , b ecause they reduce the n um b er of parameters θ with resp ect to which the o ptimization problem needs to b e solv ed (section 2.1). Giv en a v o cabulary o f p oten tially o v ercomplete basis functions and a prescrib ed K , the problem amoun ts to iden tifying those k ernels (Equation ( 3)) that b est approximate the true free energy surface i.e. minimize the KL div ergence for z ∈ D (Equation (10)). This ob viously implies a n ex cessiv e computational eﬀort since the afo remen- tioned o pt imizatio n problem would need to b e solv ed for all p ossible K − sized com binations of basis functions. F o r that purp ose, w e prop o se a hierarc hical sc heme that pro ceeds b y adding a single k ernel at eac h step. Similar greedy optimization pro cedures hav e b een success fully a pplied in maxim um entrop y problems [16]. Without loss of gener- alit y , one can consider a v o cabular y of functions that consists of the isotropic Gaussian k ernels discuss ed in Equation (4). Eac h of these is parametrized by the locatio n z j of the k ernel and its bandwidth τ j . G iv en K suc h ke rnels, the corresp onding para meters θ K = { θ j } K j = 1 (Equation (3)) that minimize I ( θ ) in Equation (11) and samples from the densit y p ( z | θ K ) (Equation (9) or Equation (5)), we propose selecting the ( K + 1) th k ernel by c ho osing 1 / 2 < p ≤ 1. 11  z K +1 , τ K +1 = { τ K +1 ,l } d l =1  that maximize: ( z K +1 , τ K +1 ) = ar g max ( z K +1 ,τ K +1 )    E π ( z ) [ K ( z , z K +1 ; τ K +1 )] − E p ( z | θ K ) [ K ( z , z K +1 ; τ K +1 )]    (19) Based on E quation (14), this s uggests augmenting our expansion with the k ernel that lo cally maximizes the gradien t of I ( θ ). In tuitive ly this means that w e incorp or a te the kernel function whose exp ected v alue with resp ect to the target, uniform distribution is w orst a ppro ximated b y the curren t densit y p ( z | θ K ). It is obv iously a suboptimal stra t egy , that is necessitated by reasons o f computational cost. The maximization of the ob jectiv e in Equation (19) can b e readily carried out g iven samples from p ( z | θ K ). The same formulation can b e applied to an y type of k ernel or ov ercomplete basis emplo y ed (e.g. wa v elets). The prop o sed strategy promotes sparseness and computational eﬃcienc y while oﬀering a progressiv e resolution of the free energy landscap e that nat ur a lly in v olv es k ernels with larger bandwidths (smaller τ ) in the ﬁrst steps, and success iv e unv eiling of the ﬁner details which can b e captured b y k ernels of smaller bandwidths (i.e larger τ ). F urt hermore, it oﬀers a rigorous metric for monitoring conv ergence. In partic- ular if ˆ A K ( z ; θ K ) and ˆ A K +1 ( z ; θ K +1 ) denote the free-energy appro ximations obtained at successiv e steps using K and K + 1 k ernels resp ectiv ely , and p K and p K +1 the corresp onding densities in Equation (5), then the improv ement in terms of Kullback -Leibler div ergence (Equation (10)), denoted b y ∆ K +1 can b e assessed with: 0 ≤ ∆ K +1 = KL( π k p K ) − KL( π k p K +1 ) = − β E π h ˆ A K +1 ( z ; θ K +1 ) − ˆ A K ( z ; θ K ) i + log Z ( θ K +1 ) Z ( θ K ) (20) 12 The exp ectation with resp ect to the uniform can in g eneral b e calculated analytically whereas the ratio of normalizing constan ts log Z ( θ K +1 ) Z ( θ K ) (Equation (5)) is a direct output of t he Sequen tial Monte Carlo sampling that is used to sample from the augmen ted densities a nd is discussed in the next section. 2.3 A daptive Se quential Monte Carlo The learning sc heme prop osed relies on eﬃcien t computations of the gradi- en ts app earing in Equation (14). These dep end on exp ectations with resp ect to p ( z | θ ) (Equation (9)) whic h is not a v ailable a nalytically since the actual free energy A ( z ) is unkno wn. W e resort to a Mon te Carlo sc heme that draws samples from the jo in t densit y p ( q , z | θ ) in Equation (5) whic h in v olv es the atomic degrees of freedom q . A brute-force approach w ould generally b e inef- ﬁcien t as simulating a t o mistic t ra jectories suﬀers fro m w ell-know n diﬃculties suc h a s the high-dimensionalit y of q , the disparity in scales b et w een q and z and the presence of sev eral metastable w ells [22]. F urthermore, since p ( q , z | θ ) dep ends on θ , new samples would hav e to dra wn ev ery time θ c hanges after eac h iteration of the optimization routine. F o r these reasons, we pro p ose a para llelizable strategy that relies on Sequen- tial Mon te Carlo sample rs (SMC, [15,13]). These oﬀer a general statistical p ersp ectiv e that uniﬁes a range of p ertinen t sc hemes that ha v e b een prop osed in the context o f non- equilibrium path sampling follow ing Jarzynski’s work [26,48]. W e prop ose nov el extensions that allow the algorithm to auto mati- cally adapt to the diﬃculties of the ta r g et densit y . They retain the ability to in teract seamlessly with legacy , molecular dynamics sim ulators. 13 The prop o sed SMC sche mes oﬀer a ﬂexible framew ork f or sampling f r om a se- quenc e of unormal i z e d pr o b abi l i ty distributions a nd are therefore highly suited for the dynamic setting of the problem a t hand where the target density p ( q , z | θ ) c hanges with θ . F or a given θ , t hey approximate p ( q , z | θ ) with a set of n random samples (or p articles ) { q ( i ) , z ( i ) } n i =1 , whic h are up dated using a com bination of imp o rtanc e s ampling , r esampling and MCMC-based r ejuvena- tion mec hanisms [1 4]. Eac h of these particles is asso ciated with a n im p ortanc e weight w ( i ) whic h is prop o r t ional to p ( q ( i ) , z ( i ) | θ ). The w eigh ts are up dated sequen t ia lly alo ng with the particle lo cations in order t o provide a particulate appro ximation: p ( q , z | θ ) ≈ n X i =1 W ( i ) δ q ( i ) ( q ) δ z ( i ) ( z ) (21) where W ( i ) = w ( i ) / P N k =1 w ( k ) are the normalized weigh ts and δ x ( i ) ( . ) is the Dirac function cen tered at x ( i ) . These particles and w eigh ts can b e used t o estimate exp ectations of an y p ( q , z | θ )-in tegrable function [13,9]. In particular for Equation (14): n X i =1 W ( i ) K j ( z ) → Z K j ( z ) p ( q , z | θ ) d q d z = E p ( z | θ ) [ K j ( z )] (almost sur e l y) (22) The prop osed SMC algorithms will b e used iterativ ely , af t er eac h step of the gradien t descen t a lgorithm. Giv en tw o suc cessiv e estimates θ K,m and θ K,m +1 (Equation (18)) and a pa r t iculate approximation of p ( q , z | θ K,m ), the g oal is to obtain new samples f r o m p ( q , z | θ K,m +1 ) (Algo rithm 2) and compute the new exp ectations in Equation (14) based on Equation ( 22). The qualit y of the Mon te Carlo estimates in Equation (22) dep ends on the pro ximit y of the distributions p ( q , z | θ K,m ) and p ( q , z | θ K,m +1 ). W e prop ose building a path of intermed iate, uno r ma lized distributions that will bridge this gap based on 14 Equation (5) 4 : π γ ( q , z ) ∝ p ( q , z | (1 − γ ) θ K,m + γ θ K,m +1 ) = exp n − β  V ( q , z ) − ˆ A ( z ; θ γ ) o , γ ∈ [0 , 1] (23) where: θ γ = (1 − γ ) θ K,m + γ θ K,m +1 (24) Clearly for γ = 0 one reco v ers p ( q , z | θ K,m ) and for γ = 1, p ( q , z | θ K,m +1 ). The role o f t hese auxiliary distributions is to provide a smo oth transition path where impo r tance sampling can b e eﬃcien tly applied. Nat urally , the more intermediate distributions are considered along this path, t he higher the accuracy of the ﬁnal estimates, but also the higher the computational cost. On the other hand to o few in termediate distributions π γ can adv ersely aﬀect the o v erall accuracy of the appro ximation. T o that end w e prop ose an adaptive SMC sc heme that automatically deter- mines the n um b er o f in termediate distributions needed [14,3 1 ]. In this pro cess w e are guided b y the Eﬀectiv e Sample Size (ESS, [36]). In particular, let S b e the t otal nu mber of intermediate distributions (whic h is unkno wn a pri- ori) and γ s , s = 1 , 2 , . . . , S the asso ciated recipro cal temperatures suc h that 0 = γ 1 < γ 2 < . . . < γ S = 1, whic h are also unk nown a priori. Let also { ( q ( i ) s , z ( i ) s ) , W ( i ) s } N i =1 denote the particulate appro ximation of π γ s deﬁned as in Equation (23) for γ = γ s . The Eﬀectiv e Sample Size of these particles is then deﬁned as ESS s = 1 / P N i =1 ( W ( i ) s ) 2 and provides a measure of the p o pula t io n v ariance. One extreme, i.e. when ESS s = 1, arises when a single particle has a unit normalized w eight whereas the rest ha v e zero w eigh ts and as a result 4 subscripts K and m indicating the n umber of kernels an d optimizatio n iterations resp ectiv ely ha v e b een dropp ed 15 pro vide no informat io n. The other extreme, i.e. ESS s = N , arises when all the particles are equally informativ e a nd hav e equal w eigh ts W ( i ) s = 1 / N . If the next bridging distribution π γ s +1 is v ery similar to π γ s (ie. γ s +1 ≈ γ s ), then ESS s +1 should not b e that muc h diﬀeren t from ESS s . On the o ther hand if that diﬀerence is pro nounced then E S S s +1 could dro p dramatically . Hence in determining the next auxiliary distribution, w e deﬁne an acceptable reduction in the ESS , i.e. ESS s +1 ≥ ζ ESS s (where ζ < 1) and prescrib e γ s +1 (Equation (23)) accordingly . The prop osed adaptiv e SMC algor it hm is summarized in Algorithm 1. It should be noted that unlik e MCMC sc hemes, the particle p erturbations in the R ejuvena tion step do not require that the P s ( ., . ) is er go di c [15]. It suﬃces that it is a π γ s -in v arian t k ernel, whic h readily allows adaptively c hanging its parameters in order to ach iev e b etter mixing rates. In the examples presen ted a Metrop o lized Gibbs sc heme w as used to sample q and z separately b y em- plo ying a Metrop o lis-Adjusted Langevin Alg orithm (MALA) for eac h set of co ordinates [44]. In particular giv en  q ( i ) s − 1 , z ( i ) s − 1  these consist of: • Up dating q ( i ) s − 1 → q ( i ) s : q ( i ) s − q ( i ) s − 1 = ∆ t q 2 ∇ q π γ s ( q ( i ) s − 1 , z ( i ) s − 1 ) + q ∆ t q r q = − β ∆ t q 2 ∇ q V ( q ( i ) s − 1 , z ( i ) s − 1 ) + q ∆ t q r q (26) • Up dating z ( i ) s − 1 → z ( i ) s : z ( i ) s − z ( i ) s − 1 = ∆ t z 2 ∇ q π γ s ( q ( i ) s , z ( i ) s − 1 ) + √ ∆ t z r z = − β ∆ t z 2  ∇ z V ( q ( i ) s − 1 , z ( i ) s − 1 ) − ∇ z ˆ A ( z ( i ) s − 1 ; θ γ s )  + q ∆ t q r q (27) 16 Algorithm 1 Adaptive SMC Require: s = 1 and γ 1 = 0 and a p opulat io n { ( q ( i ) 1 , z ( i ) 1 ) , w ( i ) 1 } N i =1 whic h appro ximate π γ 1 ≡ p ( q , z | θ K,m ) in Equation (23) Ensure: The ﬁnal p opulation { ( θ ( i ) s , d ( i ) s ) , w ( i ) s } N i =1 pro vides a particulate ap- pro ximation of π γ s in the sense of Equations (21), (22). while γ s < 1 do s ← s + 1 { Rew eigh ting-Imp ort ance Sampling } Let w ( i ) s ( γ s ) = w ( i ) s − 1 π γ s ( q ( i ) s − 1 , z ( i ) s − 1 ) π γ s − 1 ( q ( i ) s − 1 , z ( i ) s − 1 ) = w ( i ) s − 1 exp n − β ( V ( q ( i ) s − 1 , z ( i ) s − 1 ) − ˆ A ( z ( i ) s − 1 ; θ γ s ) o exp n − β ( V ( q ( i ) s − 1 , z ( i ) s − 1 ) − ˆ A ( z ( i ) s − 1 ; θ γ s − 1 ) o = w ( i ) s − 1 exp n − β ( ˆ A ( z ( i ) s − 1 ); ( γ s − γ s − 1 )( θ K,m +1 − θ K,m ) o (25) b e the up date d w eigh ts a s a function of γ s . Determine γ s ∈ ( γ s − 1 , 1] so that ESS s = ζ ESS s − 1 . { Resampling } if ESS s ≤ ESS min then Resample end if { Rejuv enation } Use an MCMC k ernel P s  ( q ( i ) s − 1 , z ( i ) s − 1 ) , ( q ( i ) s , z ( i ) s )  that leav es π γ s in v a rian t to p erturb eac h particle ( q ( i ) s − 1 , z ( i ) s − 1 ) → ( q ( i ) s , z ( i ) s ). end while where r q and r z are i.i.d standard Ga ussian v ectors. A Metropo lis accept/reject step with resp ect to the ta rget inv arian t densit y π γ s ( . ) w as p erformed after eac h update. Tw o diﬀeren t time steps w ere us ed ∆ t q and ∆ t z for t he q and 17 z co ordinates resp ectiv ely . Their v alues w ere adjusted after eac h iteration s so as to retain an a v erage acceptance ratio (o v er all part icles n ) b etw een 50% and 80% [45]. The adaptivit y aﬀorded b y the pro p osed sc heme relies on the fact that ergo dicit y is no t r equired from the rejuv enation step. As a result sev eral MALA time steps can b e p erformed in Equations (26 ) and (27) (at additional computational expense) or other molecular dynamics samplers can b e emplo y ed whic h could p ot entially exhibit b etter mixing or ﬁt more closely to the ph ysics of the problem at hand [6]. Finally w e note that the estimates of the ratio of normalization constan ts Z s / Z s − 1 b et w een tw o successiv e unormalized densities π γ s − 1 and π γ s can be obtained by a v eraging the unormalized up dated w eigh ts in Equation (2 5) as a direct consequence of the impo rtance sampling iden tity : Z s Z s − 1 = R π γ s ( q , z ) d q d z R π γ s − 1 ( q , z ) d q d z = R π γ s ( q , z ) π γ s − 1 ( q , z ) π γ s − 1 ( q , z ) Z s − 1 d q d z ≈ P n i =1 W ( i ) s − 1 π γ s ( q ( i ) s − 1 , z ( i ) s − 1 ) π γ s − 1 ( q ( i ) s − 1 , z ( i ) s − 1 ) (28) These estimators can b e telescopically multiplied ( [1 5,30]) in order to com- pute the ratio of norma lizatio n constan ts b et w een any pair of distributions as required in Equation (20). Giv en the preceding discussion in sections 2.1, 2.2 and 2.3, w e summarize b elo w the basic steps in the prop osed free energy computation sc heme: In the inner lo op and for ﬁxed K , gradien t descen t (subsection 2.1) is p erfo r med whic h mak es use o f the adaptiv e SMC sc heme (subsection 2.3) in order to compute the exp ectations in the gradien t. In the outer loo p, the cardinalit y of the expansion K is increased b y adding one k ernel (i.e. K ← K + 1) based 18 on Equation (19). This is terminated when the KL gain (Equation (20)) do es not excee d a prescrib ed tolerance. Algorithm 2 Calculatio n o f the f ree energy at a given temperat ur e. Require: K = 0, θ 0 ≡ 0 and a particulate appro ximation of p ( q , z | θ 0 ) (Equation (5)) at the desired temperat ure β . while true do Calculate ∆ K based on Equation (20). if ∆ K ≤ tol then Break the lo op. else Add the optimal ( K + 1) th k ernel based o n Equation (19) and set K ← K + 1 rep eat Estimate gradient at θ K,m and calculate up date θ K,m +1 based on Equation (18) Use adaptive SMC (section 2.3) to construct pa rticulate approx ima- tion of p ( q , z | θ K,m +1 ) from p ( q , z | θ K,m ). un til Con v ergence criteria are met. end if end while 2.4 Obtaining the fr e e ener gy lan dsc a p e for va rio us temp er atur es. The metho dology describ ed in the previous sections is suitable for calculating the free energy as a function of z at a given temp erature. Ho w ev er, one is often in terested in the free energy landscap e as a function of the temperat ur e a lso. In order to ac hiev e this goal w e make use of t he follo wing tw o facts. Firstly , 19 the free energy landscape at higher temp eratures is ﬂatter and secondly t ha t nearb y temperatures ha v e similar free energies landscapes. Based on these, w e prop ose a natural extension to the sequen tial sampling fr amew ork of subsec- tion 2 .3 that can eﬃcien tly pro duce estimates o f the free energy at v arious temp eratures. The idea is to start f r o m a higher temp erature, compute the free energy as des crib ed b efore, then gradually mo v e tow ards low er temper- atures using the free energy of the previous temp erature as an initial guess for the new one. In particular given the free energy estimate ˆ A β 1 ( z ; θ ( β 1 )) and the par t iculate approximation of the join t densit y p β 1 ( q , z | θ ( β 1 )) at a temp erature 1 /β 1 , w e prop ose emplo ying the aforemen tioned adaptive SMC in order to obtain a particulate appro ximation o f the follo wing join t densit y at β 2 > β 1 (i.e. for lo w er t emp erat ur e): p β 2 ( q , z | θ ( β 1 )) ∝ exp n − β 2  V ( q , z ) − ˆ A β 1 ( z ; θ ( β 1 )) o (29) The iterations enum erated in Algorit hm 2 can then b e carr ied out in the same fashion by up dating the existing θ as w ell as adding new kerne ls if the con v ergence criteria are not satisﬁed. The critical step inv olves building a sequence of distributions from p β 1 ( q , z | θ ( β 1 )) to p β 2 ( q , z | θ ( β 1 )) in Equation (29) . F or this purp ose and similarly to a sim ulated annealing sche dule w e emplo y: π γ ( q , z ) ∝ exp n − ((1 − γ ) β 1 + γ β 2 )  V ( q , z ) − ˆ A β 1 ( z ; θ ( β 1 )) o (30) The steps in Algorithm 1 should b e adjusted to the aforemen tioned sequen ce of bridg ing distributions with the most striking diﬀerence in the R eweighing 20 step where the up dated w eigh ts in Equation (25) should no w b e giv en by: w ( i ) s ( γ s ) = w ( i ) s − 1 π γ s ( q ( i ) s − 1 , z ( i ) s − 1 ) π γ s − 1 ( q ( i ) s − 1 , z ( i ) s − 1 ) = w ( i ) s − 1 exp n − ( γ s − γ s − 1 )( β 2 − β 1 )  V ( q ( i ) s − 1 , z ( i ) s − 1 ) − ˆ A β 1 ( z ; θ ( β 1 )) o (31) W e demonstrate the eﬃcacy of suc h an approach in t he last example of section 3. It is ﬁnally no ted that at the b eginning of iterations at eac h new temp era- ture, k ernels with v ery small weigh t s θ j w ere remo ve d if θ j max i θ i ≤ 0 . 01. 2.5 The r e action c o or dina te c ase. The prop osed metho d w as described for the alc hemical case. Ho w ev er, it is straigh tforwardly g eneralized to co v er also the general r eaction co o rdinate case. Let ξ : M → D b e a function of the system co o r dina t es. This function is called a reaction co o r dina t e [33]. It is eviden t tha t q can b e view ed in a probabilistic framew ork as a random v ariable and so: z = ξ ( q ) (32) is also a random v aria ble. The probabilit y distribution of z can b e found by in tegrating out the q : p ( z | β ) = Z p ( q ) δ ( ξ ( q ) − z ) d q ∝ Z exp ( − β V ( q )) δ ( ξ ( q ) − z ) d q (33) The free energy A ( z ) with r esp ect to the reaction co ordinate ξ ( q ) is deﬁned to b e the eﬀe ctive p otential of z = ξ ( q ) p ( z ) ∝ exp ( − β A ( z )) (34) 21 Com bining these tw o equations w e see that: A ( z ) = − β − 1 log Z exp ( − β V ( q )) δ ( ξ ( q ) − z ) d q (35) If ˆ A ( z ; θ ) is an estimate o f A ( z ), w e deﬁne a new pro ba bilit y distribution o v er q b y: p ( q | θ ) ∝ 1 D ( ξ ( q )) exp  − β ( V ( q ) − ˆ A ( ξ ( q ); θ ))  (36) It is straig ht forward to see t hat under this new distribution for q , the p df o f z b ecomes: p ( z | θ ) = Z p ( q | θ ) δ ( ξ ( q ) − z ) d q ∝ 1 D ( z ) exp  − β ( A ( z ) − ˆ A ( z ; θ ))  (37) This coincides with the expression in Equation (9) and therefore the ensuing deriv ations hold iden tically . F rom a practical p oin t of view, sampling need only performed in t he q space and therefore the adaptive SMC sc hemes are emplo y ed to obtain particulate approx imations of the density in Equation (36). The only diﬀerence app ears in the MCMC-based Rejuvenation step where the MALA sampler is emplo ye d only with regards to q . In particular the up date of Equation (26) no w b ecomes: q ( i ) s − q ( i ) s − 1 = ∆ t q 2 ∇ q π γ s ( q ( i ) s − 1 ) + q ∆ t q r q = − β ∆ t q 2  ∇ q V ( q ( i ) s − 1 ) − ∂ ˆ A ∂ z ∇ q ξ ( q )  + q ∆ t q r q (38) It is noted that, in con trast to some ABF metho ds whic h r equire second-order deriv ative s of ξ [24], the prop osed tec hnique only needs ﬁrst-order deriv atives . Finally , w e point out that the ability of the prop osed approac h to pro vide eﬃcien tly estimates of parametrized free energy surfaces (as in section 2 .4 with resp ect to the temp erat ure β ), can also b e exploited in the reaction 22 co ordinate case b y deﬁning a joint densit y: p ( q , z | θ ) ∝ exp  − β  V ( q ) + µ 2 k z − ξ ( q ) k 2 − ˆ A µ ( z ; θ )  (39) where a s in [37] a n a r t iﬁcial spring with stiﬀness µ has b een added. Clearly for µ → ∞ one recov ers the aforemen tioned description, but fo r all o ther v alues of µ the formu latio n reduces to that of Equation (5) where in place of V ( q , z ) w e no w hav e V ( q ) + µ 2 k z − ξ ( q ) k 2 . One can therefore obtain free energy surfaces for v arious µ v alues. F or smaller µ the free energy w o uld b e ﬂatter and in the extreme case of µ = 0 it w ould b e constan t. As µ increases, the complexities of the fr ee energy surface w ould b ecome pro nounced. Hence b y exploiting the idea of section 2 .4, a sequence of pr o blems parametrized by µ rather than β , can b e construc ted to g radually mo ve to lar g er µ v alues b y using the free energy of the previous µ as an initial guess for the new one. The adaptiv e SMC sc heme would ensure a s mo o t h enough transition while retaining a go o d lev el accuracy for t he a pproximations obt a ined. 3 Numerical E xamples 3.1 Two-Dimens ional T oy Example Consider a t w o-dimensional system [54,34] with a single parameter z , in ter- acting with p oten tial energy V ( q ; z ) = cos(2 π z )(1 + d 1 q ) + d 2 q 2 Assume that q giv en z and β is distributed according to p ( q | z , β ) ∝ exp ( − β V ( q ; z )) 23 where β is also a ﬁxed parameter that play s the role of an inv erse temp erat ure. W e wish to calculate a n approximation ˆ A ( z ) of the free energy A ( z ) o n an in terv a l D = [ − 0 . 5 , 0 . 5 ]. The true free energy can b e fo und analytically to b e A ( z ) = cos(2 π z ) − d 2 1 cos(2 π z ) 2 4 d 2 + c where c is a constant t hat depends up on the sp eciﬁc choice of the ﬁxe d pa- rameters. In what follo ws, w e choose c so that A ( − 0 . 5) = 0. T o demonstrate our metho d in this simple example we used d 1 = 2 , d 2 = 30. The p o ten tial energy V ( q ; z ) for this c hoice of the parameters is depicted in Figure 1(a). W e ﬁx the in v erse temp erat ure to β = 10. As sho wn in Figure 1(b), the distribution is bimo dal with a big region o f practically zero prob- abilit y separating the t w o mo des. Hence, metastabilit y along the par a meter z is apparen t. The p erfo rmance of the prop osed method with resp ect to the n um b er of particles us ed in the a daptiv e SMC sc heme is depicted in Figures 2 and 3 whic h sho w the ev olution of the estimated free energy landscap e with n = 100 and n = 10 , 00 0 particles respectiv ely . In b oth cases the metho d is capable of capturing the correct characteristics of t he reference solution and as exp ected the noise in the computations is less when the n um b er of parti- cles is larger. In b oth cases the Robbins-Monro learning series is pic k ed to b e η m = m − p with p = 0 . 6 and the learning rate λ = 0 . 1. Figure 4(a) sho ws the ﬁrst three k ernels selected b y the greedy sc heme of section 2.2 . Figure 4(b) depicts the log -v alues of the k ernel w eigh ts { θ j } K j = 1 whic h clearly demons trat e the abilit y of the prop osed approac h to pro vide sparse appro ximations. The ﬁrst k ernel selected has the greatest w eight and hence it contains the ma j orit y of the information ab out the free energy curv e. The rest of the kerne ls are progressiv e corrections of the estimate giv en b y the 24 (a) The p otenti al energy V ( q 1 , q 2 ) for d 1 = 2 , d 2 = 30 (b) The probabilit y distribution p ( q 1 , q 2 | β ) for d 1 = 2 , d 2 = 30 , β = 10 Fig. 1. Pot entia l energy and p df for the to y example of section 3.1 ﬁrst k ernel. This conclusion is also supp orted b y t he result of Figure 5 whic h sho ws the ev olution of the reduction in the KL div ergence with resp ect to the total num ber of iterations as quan tiﬁed b y adding the ∆ K +1 in Equation (20). Clearly the ﬁrst k ernel oﬀers the great est KL gain (∆ 1 ) and further k ernel additions oﬀer progressiv ely smaller reductions in the KL div ergence. 3.2 WCA Dimer W e conside r N = 16 atoms in a t w o-dimensional fully p erio dic box of side l whic h in teract with a purely repulsiv e W CA pair p oten tial [34]: V W CA ( r ) =            4 ǫ   σ r  12 −  σ r  6  + ǫ , if r ≥ r 0 0 , otherwise where σ and ǫ giv e the length and energy scales resp ectiv ely . Tw o of these atoms (say atoms 1 and 2 ) are assumed to form a dimer and their interaction 25 -0.4 -0.2 0 0.2 0.4 0 0.5 1 1.5 2 2.5 3 K=1 reference P S f r a g r e p la c e m e n t s z A ( z ) (a) K=1 -0.4 -0.2 0 0.2 0.4 0 0.5 1 1.5 2 2.5 3 K=1 K=2 reference P S f r a g r e p la c e m e n t s z A ( z ) (b) K =2 -0.4 -0.2 0 0.2 0.4 0 0.5 1 1.5 2 2.5 3 K=1 K=2 K=3 reference P S f r a g r e p la c e m e n t s z A ( z ) (c) K=3 -0.4 -0.2 0 0.2 0.4 0 0.5 1 1.5 2 2.5 3 K=1 K=2 K=3 K=8 reference P S f r a g r e p la c e m e n t s z A ( z ) (d) K =8 Fig. 2. Evolutio n of free energy estimates at v arious k ernel num b er s K when u sing n = 100 particles in the adaptiv e S MC sc heme is described instead with a double w ell p ot en tial: V S ( r ) = h " 1 − ( r − r 0 − w ) 2 w 2 # where h, w , r 0 are ﬁxed parameters and r the distance betw een them. This p oten tial has t w o eq uilibrium p o in ts r 0 and r 0 + 2 w . The barrier separating the t w o equilibria is h . Let q = ( q 1 , q 2 , . . . , q N ) with q i ∈ R 2 denoting the p osition of atom i . The p oten tial energy of the system is: V ( q ) = V S ( | q 1 − q 2 | ) + 2 X i =1 N X j = 3 V W CA ( | q i − q j | ) + X 2

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment