The Geometry of Hamiltonian Monte Carlo

The Geometry of Hamiltonian Mon te Carlo Michael Betancourt MIT Dep a rtment of Physics and L ab or atory for Nucle ar Scienc e, Cam bridge, MA 02139, USA Leo C. Stein ∗ MIT Dep artment of Physics and Kav li Institut e, Cambridge, MA 02139, USA (Dated: Novem ber 27, 2024) With its systematic exploration of probabilit y distributions, Hamiltonian Monte Carl o is a p otent Mark o v Chain Mon te Carlo technique; it is an approac h, h o w ever, ultimately contingent on the choi ce of a suitable Hamiltonian function. By examining b oth t he symp lectic geometry u nderlying Hamiltonian dynamics and the requirements of Marko v Chain Monte Carlo, we construct the gen- eral form of admissible Hamiltonians and propose a p articular choice with p otential app lication in Ba yes ian inference. P ACS n um bers: 02.70.Tt,02.40.Hw Since its introduction by Duane, et a l. [1] and devel- opment b y Neal [2 ], Ha milto nia n Monte Carlo (HMC) has proven to be a powerful Mar ko v Cha in Monte Carlo metho dology . By utilizing Hamiltonian dyna mics a s a Marko v transition kernel, HMC coherently explore s the space of a tar get dis tribution a nd re sults in r apidly mix- ing Mar ko v chains. Such dyna mics, how ever, are depen- dent on a particular Hamiltonia n function and existing applications ha ve b een limited mostly to for ms co mmon in the study of physical s y stems. The feature of Hamiltonian dynamics that furnishes such eﬃcient Marko v chains is a co nsequence of not an y particular Ha milto nian function, but ra ther an implicit symplectic geo metr y o n the under lying parameter spa ce. By ﬁrst understanding the prop erties of this g e ometry , its relation to probability measures, and then the constraints of a well-p osed tra nsition kernel, we determine the most general form of HMC a nd discuss immediate ex tens ions to current work. SYMPLECTIC GEOMETR Y Symplectic g e ometry is the study of an intrinsic struc- ture of cer tain diﬀerentiable manifolds having applica- tions spanning the ph ysical sciences [3]. After in tro duc- ing the tangent and cotangent bundles that der ive from a given base ma nifold, we s how how the latter is natu- rally equipp ed with a symplectic g eometry , and how that structure admits a ﬂow along the cotangent bundle. The T angen t and Cotangen t B undles A smo o th, n -dimensional ma nifold M is deﬁned as a top ological space where the lo cal neig hborho o d o f any ∗ leostein@mit.edu given p oint P ∈ M ‘lo o ks like’ R n . F ormally , a home- omorphism ψ α : O α → U α m ust exist from any ope n subset O α ⊂ M to a n op en subset U α ⊂ R n such that the map ψ β ◦ ψ − 1 α is contin uously diﬀerentiable for any ov erlapping subsets O α ∩ O β 6 = ∅ . Of the many p ossible diﬀerentiable mapping s b etw een diﬀerentiable manifolds, tw o of the most imp o rtant ar e curves, which map open sets of R to neighbor ing po ints on the manifold, and functions, which map e a ch p oint P to R . The co llection of all smo o th curves and all s mo oth functions on M form diﬀeren tiable m anifolds in their own right, and these t wo manifolds naturally equip each point P ∈ M with t w o lo cal vector spaces. T angent vectors of all curves pas s ing thro ugh P form an n -dimensional vector s pa ce known as the tangent s pace, T P ; the gra di- ent s o f all manifold functions in a neighbor ho o d of p oint P span another n -dimensional v ector space denoted the cotangent space, T ∗ P . These tw o vector spaces are dual in the sense that a n elemen t o f T P serves a s a linear transformatio n mapping an element of T ∗ P to R , and vice versa. A set of n functions,  q i ( P )  , 1 are co ordina tes on the manifold if they unique ly identify eac h p oint, and any choice of co o r dinate functions induces natural bases fo r the tangent and co tangent spaces acro s s M : the dir e c- tional deriv ativ e v ectors, ~ e i ≡ ∂ ∂ q i , and gr adients, f d q i , resp ectively . These co or dina te ba s es provide an expan- sion of any element ~ v ∈ T P or ˜ p ∈ T ∗ P , ~ v = n X i =1 v i ~ e i ˜ p = n X i =1 p i f d q i , where the comp onents v i (resp. p i ) a re manifold func- tions dep ending on ~ v (resp. ˜ p ) and the q i . Note tha t the 1 Note tha t the index here, i ∈ { 1 , . . . , n } , is sim ply a label and need not, for example, b e considered the comp onen ts of a vect or. 2 comp onents uniquely identify each element o f the vector spaces: when T P and T ∗ P are trea ted a s manifolds , the comp onent functions  v i  (resp. { p i } ) serv e as proper co ordinate functions on T P (resp. T ∗ P ). Collecting all o f the tang ent spaces acro ss the man- ifold yields a nother 2 n -dimensional manifold called the tangent bund le , T M . A p oint in T M cor resp onds to a sp eciﬁc vector in a par ticular vector space T P and lo - cally the bundle factors into a trivial product space of the base ma nifo ld M and the tangent space T P . Co- ordinates on T M then decomp ose into the direct sum of co o rdinates on the t wo manifolds, { q i , v j } . Likewise the cotangent spaces can b e collected into the co tangent bundle T ∗ M where each p oint is ident iﬁed by the co or di- nates { q i , p j } . In analog y with physical systems, the t wo bundles T M a nd T ∗ M are also known as co nﬁguration space and phas e space, r esp ectively (T able I ). The Symplectic F orm A change of coo rdinates on the base ma nifold M ,  q i  →  Q i  , induces a unique linea r transformation o n the bas is one - forms f d q i on M , f d Q i = n X j =1 ∂ Q i ∂ q j f d q j , (1) and on the co ordinate decomp osition o f a o ne-form ˜ p , P i = n X j =1 ∂ q j ∂ Q i p j . (2) Note that the J acobian and inv erse Jacobia n of the co - ordinate transformation, ∂ q i /∂ Q j and ∂ Q i /∂ q j resp ec- tively , are matr ix inv erses, n X j =1 ∂ Q i ∂ q j ∂ q j ∂ Q k = δ i k . Co ordinate trans formations on M also induce a fam- ily o f transfor mations of the ﬁelds on T ∗ M ; these p oint tr ansforma tions do not mix the  q i  and { p i } co ordi- nates and pr eserve the trivial ﬁb er structure o f the man- ifold. Spec iﬁc a lly , the one-forms d q i ov er T ∗ M (whic h are just the vertic al lifts of the one-for m ﬁelds f d q i ov er M ) tr a nsform as in E q . ( 1 ), while the co ordina te func- tions p i transform a s in Eq. ( 2 ). All of the co ordinate functions,  q i , p j  , a nd their asso ciated basis one-fo r ms  d q i , d p j  are clea rly dep endent on the co o rdinates of the base manifold. Because the tw o transfor mations in Eq. ( 1 ) a nd Eq. ( 2 ) are in v erses, how ever, there are s ome na tural o b jects on T ∗ M indep endent of the manifold coor dinates. In par- ticular, the one-form θ , 2 θ ≡ n X i =1 − p i d q i . is co o rdinate indep endent by cons tr uction. The exterior deriv a tive dθ , ω ≡ d θ = X i d q i ∧ d p i , (3) is more subtle: a lthough the transforma tion of d p i in- volv es deriv ativ es of the Jacobian, ω is co or dinate inde- pendent, ω ′ = X j d Q j ∧ d P j = X i d q i ∧ d p i = ω . This can b e seen either b y explicit calcula tion [3], o r more simply b y noting that, b ecause θ is co ordina te indepen- dent , the exterior deriv a tive d θ must b e a s well. This symple ctic form , ω , is a tw o-form, or anti- symmetric tensor of rank (0 , 2), which maps a given vec- tor ﬁeld ~ X to a one-form ˜ X , ˜ X ≡ ω ( ~ X , · ) . Because ω is non-degenera te, this linear transforma tio n is inv ertible, and any one-form ˜ Y o v er phase spac e also deﬁnes a unique vector ﬁeld ~ Y such that ω ( ~ Y , · ) ≡ ˜ Y . Starting from any function H ( q , p ) on T ∗ M one c an con- struct a one- form via the exterior der iv a tive, d H , and, from this, deﬁne the H amiltonian ve ctor ﬁeld ~ X H asso- ciated with H such that ω ( ~ X H , · ) = d H . Hamiltonian Flo w By sp ecifying a direction at each p oint P ∈ T ∗ M , the vector ﬁeld ~ X H deﬁnes cur ves thro ugh the manifold. These curves cov er the entire manifold witho ut intersect- ing, mapping every po int P to another a long the lo ca l curve and generating a ﬂow of the ent ire cota ng ent bun- dle. In particula r, the v ector ﬁeld diﬀerentiates any func- tion f along the deﬁned v ector at each point, ~ X H ( f ) =  ~ X H  i ∂ ∂ x i ( f ), where x i =  q i , p j  are the co ordinates o f T ∗ M . If the cur ves ar e par ametrized by t ∈ R , then the action of the v ector ﬁeld is simply ~ X H ( f ) = d dt ( f ). Ap- plying the vector ﬁeld ~ X H to the co ordina tes functions 2 The resemblanc e of θ and an expansion of a one-form ˜ p on M i s a result of ambiguous notation. Here p i are co ordinate functions on T ∗ M , and d q i are one-forms on T ∗ M . This one-form θ (or its negation − θ ) i s sometimes called the tautological form. 3 Manifold Manifold Name Coordinate F unctions Coordinate Names M P osition Sp ace q i P osition T M Conﬁguration Space q i , v j P osition, V elo city T ∗ M Phase Space q i , p j P osition, Momenta T ABLE I. The tangent and cotangent b undles, and their co ordinates, are often referenced in analogy to physical systems in classical mechanics. themselves y ields ordinary diﬀerential equations of each co ordinate along the in tegral curves, ˙ q i ( q , p ) ≡ ~ X H ( q i ) = d q i d t = + ∂ H ∂ p i ˙ p i ( q , p ) ≡ ~ X H ( p i ) = d p i d t = − ∂ H ∂ q i , known as Hamilton ’s e quations in classica l mechanics. 3 , 4 Similarly , the ra te of change of a n arbitrary scalar func- tion, f , alo ng the in tegral curves of ~ X H is given by ∂ f ∂ t = ~ X H ( f ) = d f ( ~ X H ) = ω ( ~ X f , ~ X H ) = X i ∂ f ∂ q i ∂ H ∂ p i − ∂ f ∂ p i ∂ H ∂ q i ≡ { f , H } , where { f , H } is known as the Poisson brack et. Note that the function f is inv ar iant along integral curves if the Poisson bracket v anishes; in particular the initia l function H , or Hamiltonia n, is a lways preser ved, ∂ H ∂ t = ~ X H ( H ) = { H , H } = 0 . The rate of change of a genera l geometr ic o b ject along the integral curves of ~ X H is g iven b y using the Lie deriv a- tive, L ~ X H . 5 By deﬁnition the symplectic fo rm is closed, d ω = 0 , a nd as a r esult the Lie der iv ative along ~ X H of ω v anis he s , L ~ X H ω = 0 . 3 The resulting H amiltonian ﬂow is often denoted Hamiltonian ev olution, or H amiltonian dynamics. 4 Local Hamiltonian ﬂo w is not alwa ys adequate: constrain ts of the co ordinate functions, for example, r equire discon tin uous jumps in momentum otherwise kno wn as specular reﬂection (Ap- pendix A). 5 The scalar funct ions considered abov e are geometric ob jects in their own right and, indeed , the Lie deriv ative of a scalar ﬁeld agrees with the action of the vec tor ﬁeld. Consequently , the Lie deriv ativ e of the top-r ank diﬀeren- tial volume for m Ω Ω ≡ ω n ≡ ω ∧ ω ∧ . . . ∧ ω | {z } n times also v a nishes, L ~ X H Ω = 0 , implying tha t diﬀerential v olume e lement s o f phase s pa ce are preserved under the ev olution a lo ng the integral curves. Note that, with b oth the Hamiltonia n and the diﬀeren- tial phase space v olume pr e s erved, the phase space den- sity F ( H )Ω for any smo oth scalar function F : R → R is also inv ariant alo ng in tegral curves, 6 L ~ X H F ( H )Ω =  L ~ X H F ( H )  Ω + F ( H ) L ~ X H Ω = F ′ ( H )  L ~ X H H  Ω = 0 . This prop erty is inherent to the s y mplectic geometry of the cotangent bundle a nd holds for a ny choice of Hamil- tonian and scalar function F . THE IDENTIFICA TION OF F ORMS AND MEASURES A top rank form on a n -dimensional manifold is a ten- sor, indep endent of the ma nifo ld co o r dinates. Given a particular choice of co ordinates  x i  deﬁned a c r oss the ent ire manifold, how ever, any such top-r ank for m can b e decomp osed as φ = f x ( x ) d n x , where f x ( x ) is a scalar function determined by the co- ordinates a nd d n x is the volume form of the co or dinates constructed by wedging the gr adients of ea ch co ordinate function together , d n x = d x 1 ∧ · · · ∧ d x n . 6 This geometric statemen t is equiv alen t to Liouville’s theorem i n statistical m ec hanics [4]. 4 Int ro ducing the ob ject f which ev a lua tes to the appropri- ate f x ( x ) for a ny g iven s et of co ordinate functions  x i  , the decomp ositio n b ecomes φ = f ( x ) d n x . By constr uction, the ob jects f ( x ) a nd d n x in the de- comp osition depend on the co o rdinates: under a c hange of co or dinates from  x i  →  X i  the tw o terms trans- form by acquir ing a deter mina nt of the Jacobian matrix ∂ x /∂ X , f ( X ) =     ∂ x ∂ X     +1 f ( x ) d n X =     ∂ x ∂ X     − 1 d n x . These ob jects are not tensor s but rather tensor densities . In gener al a tensor dens it y of weight w transforms as 7 ρ ( X ) =     ∂ x ∂ X     w ρ ( x ) , with f ( x ) and d n x immediately identiﬁ ed as tensor densi- ties of weigh t +1 and − 1, re sp e ctively . Note that, when the tw o ob jects are co mbin ed, the additiona l Jacobia n factors cancel to give a co o rdinate indep endent ob ject as exp ected of the original tensor. The top ologica l spa ce that forms the ma nifold M ca n also b e endow ed with a c o ordinate indep endent mea- sure that, given co ordina tes, can b e decomp os ed in to tw o co ordinate-de p endent ob jects [5]. Any Borel measur e µ can b e wr itten as d µ = f ( x ) d n x , where d n x is no w the Lebesgue measure on the chosen co ordinates and f ( x ) is the Rado n- Nikodym deriv ativ e of µ with r esp ect to d n x . Under a c o ordinate transfo rma- tion, these tw o ob jects acquire a facto r of the Jacobia n just as a bove, f ( X ) =     ∂ x ∂ X     +1 f ( x ) d n X =     ∂ x ∂ X     − 1 d n x . Indeed the similarity of the tw o systems is no coincidence: provided that the manifold ca n be o riented, the Riesz representation theorem guarantees that the spa ce of top- rank for ms is lo cally isomorphic to the space of mea sures [5], providing an identiﬁcation b etw een the form φ and measure µ as well as the terms in their a bove deco mp o s i- tions. If the form is everywhere p ositive and the integral 7 The deﬁnition of the sign of the weigh t can v ary within the lit- erature; here we use the conv en tion of [3]. ov er all of M is ﬁnite then the corres po nding measure bec omes a normaliza ble Borel measure fro m w hich one can build a probabilistic space. Note that ob jects on the cotang ent bundle ca n b e densities with re s pe ct to co o rdinate transformatio ns on T ∗ M , a cquiring factor s of | ∂ ( q , p ) /∂ ( Q , P ) | , or with re- sp ect to transforma tio ns on M , picking up o nly factors of | ∂ q /∂ Q | . The point tr ansformatio ns intro duced ab ov e are a s p ecia l case of symple ctomorphisms , which preserve the symplectic form ω and hav e Jacobia n determinant | ∂ ( q , p ) /∂ ( Q , P ) | = +1. Therefore, any o b ject tr a ns- forming as a densit y with resp ect to T ∗ M , such as those in the ab ov e decomp os iton, ar e ac tually in v aria nt under transformatio ns that preserve the structure of the cotan- gent bundle. INTEGRAL CUR VES AS MARKO V TRANSITION KERNELS Because the v olume form Ω is nowhere zero, a symplec- tic manifold can alw ays b e oriented a nd, conse quently , the form F ( H ) Ω can alwa ys b e iden tiﬁed with a mea- sure. Moreov er, if F is restricted to p o sitive, in tegrable functions, F : R → R + s . t . Z M F ( H ) Ω ∈ R + , then the corr esp onding measure will b e a Bor el measur e. T a k ing F ( H ) = exp ( C − H ), 8 a g iven Hamiltonian uniquely deﬁnes a pro bability measur e d µ = π ( q , p ) d q d p = exp [ C − H ( q , p )] Ω , where C ∈ R is determined by the normalization. Hamil- tonian ﬂo w maps µ into itself and th us deﬁnes a pr op er Marko v tr ansition k ernel for the distribution π ( q , p ). Note that, in practical applica tions the dynamics must be p erfor med n umerically and the mea sure will not b e conserved exactly [6]. An y subseq uent bias, howev er, can be avoided by treating the ﬂow as a Metrop o lis pr op osal function instead o f the tr ansition itself [1, 2 ]. Now if the gradients of the Hamiltonian satisfy ˙ q i ( q , − p ) = − ˙ q i ( q , p ) ˙ p i ( q , − p ) = + ˙ p i ( q , p ) then the resulting ﬂow is rev ersible: reﬂecting the mo- men ta a nd evolving the sa me dista nce a lo ng the integral curves r ecov ers the initial conﬁgura tion. By adding such a reﬂection to the end o f e a ch ev olution, the resulting 8 Other c hoices f or F are p ossible, but this exp onen tial form has a computational adv antage when considering the ubiquitous di s- tributions fr om the exp onen tial fami ly . 5 transition has detailed ba lance with resp ect to π ( q , p ), which then becomes the unique stationary distribution. The augmented transitions, how ev er, do not pro duce a conv ergent Ma r ko v chain beca use the ev olution explor es only level s e ts of pr o bability density , π ( q , p ) = π ( q 0 , p 0 ) ∝ exp ( − H ( q 0 , p 0 )) . Mov emen t across the proba bility contours can b e intro- duced by adopting a Gibbs strateg y and alterna ting ea ch Hamiltonian transition with transitions from any condi- tional distribution spanning the contours. The combined transitions guarantee ergo dicity and, provided that the transitions are tuned to av oid cycles in phase spa ce and subsequent p erio dicity , the resulting Mark ov chain will conv erge to π ( q , p ). While any r e versible Ha milto nian deﬁnes b oth a pr ob- ability distribution and r esp ective Mar ko v chain, practi- cal applica tions demand the r everse construction: what Hamiltonians, and hence well-behaved Markov chains, are cons istent with a giv en target distribution π ( x )? Provided a deco mp os ition of the random v ar iables into po sitions and moment a, x = { q , p } , the supp ort of the ta rget dis tribution co uld b e identiﬁed with the full cotangent bundle and π ( x = { q , p } ) would fully deﬁne a Hamiltonian. T her e is no natural motiv ation for such a decomp osition, howev er, and one resulting in a reversible Hamiltonian system need no t ev en exist. 9 If we app eal to the fa ctorization of the cotangent bun- dle int o  q i  and { p i } co ordina tes and instead ide ntify the random v ar iables with the pos itio n manifold M , how- ever, then x = q and the tar get distribution c onstrains only the ma rginal distribution of π ( q , p ) , π ( x ) = Z d n p π ( q , p )     q = x . An y joint distribution deﬁned ov er the entire cotangent bundle factor s , π ( q , p ) = π ( p | q ) π ( q ) , and the residual freedom in the choice of co nditional dis- tribution, π ( p | q ), ca n b e enginee red to not only guara n- tee reversibility but also be ea sily sampled to ensure that the en tire pro cedur e rema ins computationally eﬃcient. Once the latent v ariables p are ma rginalized out, the Marko v chain g enerates the desir ed sa mples fr o m π ( q ). This ﬁnal iden tiﬁcation completes the constr uction of a gener al MCMC pro cedur e : a given targ et distribu- tion π ( q ) a nd the selection of a n appropr iate conditiona l distribution, π ( p | q ), deﬁnes Hamiltonian dy namics and a well-behav ed transition kernel. Because the dynam- ics incor p orate g r adients of the targ et distribution, the 9 A trivial example being a distribution deﬁned on an odd- dimensional space. transitions systema tica lly explor e the targ e t distribution and av oid the random walk b ehavior typical of other MCMC tec hniques [2]. Moreov er, the freedom in the conditional distribution oﬀer s the p otential for includ- ing more informa tion about the target dis tribution and, consequently , further improving the p erfor ma nce o f the resulting Mar ko v chain. ADMISSIBLE HAMIL TONIANS With the ab ov e cons ide r ations, the most g eneral Hamiltonian yielding the desired Mar ko v chain is H = − lo g π ( q , p ) + C = − log π ( p | q ) π ( q ) + C = − log π ( p | q ) − lo g π ( q ) + C ≡ T ( q , p ) + V ( q ) + C. Note that T and V are neither scala rs nor scalar den- sities. While the joint distribution π ( q , p ) is in v ar iant under p o int transfor ma tions (and indeed all symplecto- morphisms), the distributions in the decomp osition ar e densities of oppos ite w eight with res p ect to co or dina te transformatio ns on the po sition manifold M , π ( P | Q ) =     ∂ q ∂ Q     − 1 π ( p | q ) π ( Q ) =     ∂ q ∂ Q     +1 π ( q ) , and the terms in the Hamiltonian m ust tr ansform as T ( Q , P ) = T ( q , p ) + log     ∂ q ∂ Q     (4) V ( Q ) = V ( q ) − log     ∂ q ∂ Q     . (5) When added together the additional fac tors cancel to yield the sca lar Hamiltonian. While the potential ener gy V ( q ) is fully sp eciﬁed b y the targ et distribution, the kinetic energ y is constr a ined by only the de ﬁning normalization of a conditiona l dis- tribution Z d n p exp ( − T ( q , p )) ∈ R + , i.e. a ﬁnite p ositive num ber indep endent of q , and the demands of detailed bala nce, T ( q , − p ) = T ( q , p ) ∂ T ∂ p i ( q , − p ) = − ∂ T ∂ p i ( q , p ) . Reversibilit y is ass ured for an y k ine tic energy of the form T ( q , p ) = τ 1 ( q , p ) + τ 1 ( q , − p ) + τ 2 ( q ) , 6 where the ﬁr st tw o terms can, in genera l, b e decomp osed int o a sum of co mpletely symmetric tenso rs T ( n ) of type ( n, 0) with n even a nd p ositive, 10 τ 1 ( q , p ) + τ 1 ( q , − p ) = X n =2 , 4 ,... X j 1 ...j n p j 1 . . . p j n T j 1 ...j n ( n ) ( q ) . A particula rly useful c hoice is any scala r function of a non-degenera te qua dratic form in the momen ta, T ( q , p ) = τ   X ij p i p j A ij ( q )   + t 2 ( q ) . Here the tensor A ( q ), or mor e appropr iately its in v erse, eﬀectively serves as a spatially-dep endent linear trans- formation o f the co o rdinates q — if the tr ansformation simpliﬁes the s tructure o f the target distribution then the re s ulting Markov chain promises to explo re the spa ce m uch more eﬃciently . In particula r , the incor p o ration of the deriv atives of the p o tential can help to loc a lly stan- dardize the distribution, av oiding complica tions due to the narr ow v alleys characteristic of stro ng correla tions. T a k ing τ to be the iden tit y gives T ( q , p ) = 1 2 X ij p i p j Λ ij ( q ) − 1 2 log | Λ ( q ) | , (6) which re sults in a gaussian conditional distribution, π ( p | q ) = N ( p | 0 , Λ ) . Notice how the nor malization of the co nditional gaussia n int ro duces the determinant o f Λ , which tra nsforms as a tensor densit y of w eight − 2 with resp ect to coor dina te transformatio ns on M . 11 Under a po int transformation the norma lization b e c omes 1 2 log | Λ ( Q ) | = 1 2 log | ∂ q /∂ Q | − 2 | Λ ( q ) | = 1 2 log | Λ ( q ) | + 1 2 log | ∂ q /∂ Q | − 2 = 1 2 log | Λ ( q ) | − log | ∂ q /∂ Q | , int ro ducing exactly the necessary fa c tor o f the Jaco bian to ensure the pr op er transfor mation of T . The identiﬁca- tion of meas ur es with forms ensures co ns istency be tw een the tw o per sp ectives. If the cov ariance Σ = Λ − 1 is pr op ortional to the iden- tit y then the Hamiltonian reduces to that of classic a l me- chanics and the form used in the ﬁr st implemen tations 10 In order to satisfy the deﬁning transform ation prop erty (4), the con tribution from τ 1 ( q , p ) + τ 1 ( q , − p ) m ust b e a scalar and , consequen tly , the T ( n ) must b e tensors. The second term τ 2 ( q ) is ultimately r esponsibl e for the in troduction of the necessary factor of log | ∂ q /∂ Q | . 11 Provided that k + l is ev en, the determinant of a rank ( k , l ) tensor exists and transforms as a scalar densit y of weigh t l − k . of HMC [1]. A non trivial but cons tant cov a riance ma- trix allows for a g lobal res caling and r otation of the tar- get distribution [2], and a spatially dep endent cov ariance transforms the target distribution lo cally [7]. Because integrability requir e s it to b e p os itive-deﬁnite and non-deg enerate, the spatially -v a rying cov aria nce can also b e interpreted as a Riemannian metric and, from this pe r sp ective, the Hamiltonian evolution lo cally par- allels geo de s ics o f a cur ved manifold [8]. This additional geometric structure e m be dded within the symplectic ge- ometry crea tes the potential to further improv e p erfor- mance with the application of more to o ls from diﬀerential geometry . Utilizing any s uch poss ibilit y , howev er, ﬁrst requires a choice of the s patially-v arying cov aria nce. Choices of the Cov ariance Matrix The Fisher-Rao metric ubiquitous in information ge- ometry , Σ ij = E y  ∂ log π ( x | y ) ∂ x i ∂ log π ( x | y ) ∂ x j  , or given par ticular co o rdinate functions Σ ij = E y  ∂ V ( x | y ) ∂ x i ∂ V ( x | y ) ∂ x j  , provides an o bvious candidate for the cov ar iance matr ix , and its use in HMC ha s proven successful when the ex - pec tation ca n b e perfor med ana ly tically [7]. While the Fisher-Rao metric do es inco rp orate deriv ativ es of the tar- get dis tribution, howev er, it r equires integrating ov er the y to ensure po sitive-deﬁniteness. In a Bayesian a ppli- cation this necess ita tes exp ectation o f the pos ter ior ov er the ensemble of p ossible da ta sets; no t only is the exp ec- tation contrary to the Bayesian philosophy of inference based solely on the meas ured data, it also washes out the lo cal structure par ticular to a given data s et that can prove imp or tant in p os ter ior explora tion. The geometric persp ective, ho w ever, suggests another po ssibility free fro m the unw an ted marg inalization. A “background” Riemannian metric σ deﬁned on M in- duces a metric on the gr a ph of the p otential V , 12 Σ ij = σ ij ( q ) + ∂ V ( q ) ∂ q i ∂ V ( q ) ∂ q j . Unfortunately , Σ does not transfor m as a prop er tensor bec ause V is no t a prop er sca lar function. The additiona l structure aﬀorded by σ , ho w ever, admits the nec essary correctio n: b ecause the deter minant | σ | is a scalar den- sity of weigh t + 2 with resp ect to tra nsformations on M , 12 The graph is an n -dimensional submanifold in the ( n + 1)- dimensional manif old with coordinates ( q , V ( q )). 7 the quantit y ¯ V ( q ) ≡ V ( q ) + 1 2 log | σ ( q ) | is a true sca lar and the modiﬁed metric ¯ Σ ij ( q ) ≡ σ ij ( q ) + ∂ ¯ V ( q ) ∂ q i ∂ ¯ V ( q ) ∂ q j a true tensor. Note the simila rity o f the outer pro duct ( ∂ i ¯ V )( ∂ j ¯ V ) to the argument of the exp ectation v alue in the Fisher - Rao metric: this induced metric features a similar struc- ture without the undesired expec ta tion. More over, ¯ Σ is simply a rank-1 up date of the background geometry and the metric can b e e ﬃcie ntly inv erted with the Sherman- Morriso n-W o o dbury formula [9], ¯ Λ ij = λ ij − ( ∂ i ¯ V )( ∂ j ¯ V ) 1 + ( ∂ l ¯ V )( ∂ l ¯ V ) , where λ is the inv erse of σ (satisfying λ ik σ kj = δ i j ), and ∂ i = λ ij ∂ j . If σ is homoge neous (i.e. indep endent of po s ition) then the inv ersions necessar y a t each iteration o f the Hamilto- nian evolution c an b e computed a t order O  n 2  , signiﬁ- cantly faster than the O  n 3  required for the inv ersion of the dense Fisher-Ra o metric. The Chris toﬀel co eﬃcients of the metric ¯ Σ , which s pec ify the Hamiltonia n ﬂow, are straightforward to calculate in this case, 13 Γ i j k = ( ∂ i V )( ∂ j ∂ k V ) 1 + ( ∂ l V )( ∂ l V ) , also r equiring only O ( n 2 ) op er ations and allowing the en- tire Ha miltonian ﬂow to b e calcula ted at the same order . Note the appeara nce of ∂ j ∂ k V in the Christoﬀel coeﬃ- cients: while ¯ Σ app ears to inco rp orate only the outer pro duct approximation to the Hes s ian of V , the full Hes- sian, and a ll the infor mation it contains, is included the evolution. Constructed fro m deriv ativ es of ¯ Σ , the Riemann cur - v atur e tensor and its co ntraction, the r ank-2 Ricci ten- sor, oﬀer the p otential to include higher-o rder deriv a- tives and, co ns equently , more informa tion into the evo- lution. An explicit calculation, how ev er, shows that, sur- prisingly , only second deriv atives of V co ntribute to the Riemann tenso r. F urther in v estigation is necessary to de- termine the full utility of including these tens ors into the kinetic ener gy . 13 Because the determinant of the homogeneous metric | σ | is also posi tion independent , ∂ i ¯ V = ∂ i V and the metrics coincide, ¯ Σ = Σ . CONCLUSIONS AND FUTURE W ORK By consider ing the geometry of Hamiltonian dynamics and the basic constraints of Marko v chain Monte Car lo, we ha v e constructed the most genera l approach to Hamil- tonian Monte Ca r lo (HMC). The simplest admissible Hamiltonian g iven in Eq. ( 6 ) repro duces exis ting appro aches to Hamiltonian Monte Carlo, but kno wing the mos t general form b egs further extensions. Girolami, et al., for example, hav e considered the kinetic energy T ( q , p ) = ν + n 2 log 1 + P n ij =1 p i p j Λ ij ( q ) ν ! − 1 2 log | Λ ( q ) | , which gives Studen t’s t-dis tr ibution in place of the ga us- sian, but with only mixed r esults. Other choices of the ki- netic energy may dr amatically improv e HMC for certain distributions, and there may still b e means to improv e the p erfor ma nce universally . Generalizing the symplectic manifold o f HMC to a Poisson manifold [10] oﬀer s even mor e po ssibilities. Without the restr iction of non-deg enerate forms, the Poisson ma nifold can acco mmo da te distributions with spatially v arying dimensio na lity and p er haps even admit trans-dimensiona l Mo nte Carlo. Given the unexplor ed p ossibilities of this geometric per sp ective, the future of HMC is pro mising. ACKNO WLEDGMENTS W e thank Katherine Deck and Ila V arma for v aluable discussions a nd comments on a preliminary draft of this manuscript, and Mark Girolami for reviewing an in ter- mediate draft. LCS ackno wledges s uppo rt from MIT’s Solomon Buchsbaum fund. APPENDIX A: SPECULAR REFLECT ION F OR GENERAL HAMIL TONIANS The explicit incorp ora tion of inequality constraints is often adv antageous, and sometimes even nec e s sary , in the construction of certain Mar ko v chains. Arbitrary in- equality cons tr aints, C ( q ) > 0 , can be incorpor ated into the Hamiltonian framework with the intro duction of an inﬁnite potential [11], V ( q ) =  ∞ , C ( q ) ≤ 0 0 , else . A na ive implementation of Hamiltonian dyna mics with such a potential, how ev er, immediately fails because the gradient d V ( q ) alo ng the bounda r y is undeﬁned. 8 When c o nsidering the classica l mec hanics o f a p o int particle, H = 1 2 P ij p i p j δ ij + V ( q ), the diﬃculties around the p otential barrie r c an b e avoided by a pp e a ling to the exact result: the compo nent s of momentum pe r p endicu- lar to the constraint sur face reﬂect while preserv ing the v alue of the Hamiltonian. This sp e cular r eﬂe ction is given by ∆ p = p ′ − p = − 2 ( p , ˆ n ) ˆ n , (7) where the unit normal along the surface of equality ( C ( q ) = 0) is ˆ n = d C p (d C, d C ) and the inner pro duct of one- forms is induced by the symmetric (2 , 0 ) tensor δ , ( a , b ) ≡ X ij a i b j δ ij = X i a i b i . The r eﬂection dep ends only o n the direction o f d C and not the undeﬁned d V . Also no te that only the co m- po nents of the momen tum p er p e ndicular to le vel s e ts of C ( q ) trans fo rm; those para llel to level se ts are unaﬀected. Generalizing sp ecular reﬂectio n to a general Hamilto- nian system is straig ht forward. The gra dient n ≡ d C deﬁnes a unique one-form a t the surfa c e of c o nstraint equality C ( q ) = 0, and forms a basis with the addition of ( n − 1) one-forms ω i , i ∈ { 1 , . . . , n − 1 } , each linear ly independent of one another and n . With this ba s is, the momentum p a t a p o int of reﬂection q on the co nstraint bo undary may b e decomp osed as p = p ⊥ n + n − 1 X i =1 p k i ω i . Because the reﬂection is determined en tirely by the sur- face C ( q ) = 0, the momen tum can transform o nly a lo ng the one- dimens io nal subspace spanned by n . Conse- quently , the p k should b e inv ar ia nt and a genera l reﬂec - tion is g iven by p = p ⊥ n + n − 1 X i =1 p k i ω i → p ′ = αp ⊥ n + n − 1 X i =1 p k i ω i , along with the c o ndition that the Hamiltonian r emains inv a riant, H ( q , p ′ ) = H ( q , p ). In the case of a quadr a tic kinetic ener gy , H ( q , p ) = 1 2 X ij p i p j Λ ij ( q ) − 1 2 log | Λ ( q ) | + V ( q ) , conserv ation of the Hamiltonian requir es 0 = H ( q , p ′ ) − H ( q , p ) = ( α − 1) p ⊥   1 2 ( α + 1) p ⊥ X ij (d C ) i (d C ) j Λ ij + n − 1 X k =1 p k k X ij (d C ) i ( ω k ) j Λ ij   , where (d C ) i and  ω k  j are the compo nents o f the resp ec- tive one-for ms in an arbitr a ry bas is. Ignoring the α = 1 solution where the momen tum is unch anged, α = − 2 P n − 1 k =1 p k k P ij (d C ) i ( ω k ) j Λ ij p ⊥ P ij (d C ) i (d C ) j Λ ij − 1 , or with a bit of manipula tion α − 1 = − 2 P ij (d C ) i p j Λ ij p ⊥ P ij (d C ) i (d C ) j Λ ij , implying that the c hange in mo ment um is ∆ p = ( α − 1) p ⊥ n = − 2 P ij (d C ) i p j Λ ij P ij (d C ) i (d C ) j Λ ij n . Deﬁning an inner product induced by the tensor Λ , ( a , b ) Λ ≡ X ij a i b j Λ ij , and a unit one-form under this inner pro duct ˆ a ≡ a p ( a , a ) Λ , the gener alized sp ecular reﬂe c tion b ecomes ∆ p = − 2 ( ˆ n , p ) Λ ˆ n . (8) The resem blance o f E q. ( 8 ) to Eq. ( 7 ) is welcome. As befo re, the dep endence of the result on only the dire c tion of n ensur es that the inﬁnite magnitude of the potential barrier pr ovides no diﬃculties. Moreo ver, the general result no t only reduces to the cla ssical ca se for Λ ij = δ ij , it also agr ees w ith the generalization that would b e exp ected given the interpretation of the quadra tic kinetic energy resulting from a Riemannian (cov ariant) metric Σ ij on the ba s e manifold (wher e as b efor e Σ = Λ − 1 ). A straightforward calcula tion v eriﬁes that Eq. ( 8 ) is also a suﬃcien t reﬂection solution for a ny kinetic ener gy of the for m T ( q , p ) = τ   X ij p i p j Λ ij ( q )   + τ 2 ( q ) . 9 [1] S. Duane, A . Kennedy , B. J. Pendleton, and D. Ro w eth, Physics Letters B 195 , 216 (1987) . [2] R. N eal, in Handb o ok of Markov Chain Monte C arlo , edited by S. Brooks, A. Gelman, G. L. Jones, and X .-L. Meng (CRC Press, N ew Y ork, 2011). [3] B. Sc hutz, Ge ometric al Metho ds of Mathematic al Physics (Cam bridge Un iversit y Press, New Y ork , 1980). [4] M. Kardar, Statistic al Me chanics of Particles (Cam- bridge Un ivers ity Press, New Y ork, 2007). [5] G. B. F olland, R e al Analysis: Mo der n T e chniques and Their Applic ations (1999). [6] B. Leimkuhler and S. Reich, Simulating Hamiltonian Dy- namics (Cambridge Universit y Press, N ew Y ork, 2004). [7] M. Girolami and B. Calderhead, Journal of the Roy al Statistical So ciet y: Series B (Statistical Method o l o g y ) [8] O. Calin and D. C. Chang, Ge ometric Me chanics on R eimannian Manifolds (Birkh¨ auser, New Y ork, 2004). [9] G. H. Golub and C. F. V an Loan, Matrix Computations (Johns Hopk ins Universit y Press, Baltimore, 1996). [10] A. W einstein, J. Diﬀerential Geom. 18 , 523 ( 1983). [11] M. Betancourt, in Maximum Entr opy and Bayesian metho ds in scienc e and engine ering , American In- stitute of P hysics Conference Series, V ol. 1305, edited by A. Mohammad-Djafari, J.-F. Berc her, & P . Bessi´ ere (AIP Press, 2011) pp. 165–172, arXiv:1005.01 57 [ph ysics.data-an] .

The Geometry of Hamiltonian Monte Carlo

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment