Adaptive Importance Tempering: A flexible approach to improve computational efficiency of Metropolis Coupled Markov Chain Monte Carlo algorithms on binary spaces

Adaptiv e Imp ortance T emp ering: A ﬂexible approac h to impro v e computational eﬃciency of Metrop olis Coupled Mark o v Chain Mon te Carlo algorithms on binary spaces R. Alexander V alencia-Sanc hez 1 , Jeﬀrey S. Rosen thal 1 , Y asuhiro W atanab e 2 , Hirotak a T amura 3 & Ali Sheikholeslami 4 Abstract Based on the algorithm Informed Imp ortance T empering (I IT) prop osed b y Li et al. ( 2023 ) w e prop ose an algorithm that uses an adaptiv e b ounded balancing function. W e argue why implemen ting parallel tempering where eac h replica uses a rejection free MCMC algorithm can b e ineﬃcient in high dimensional spaces and show how the prop osed adaptive algorithm can o vercome these computational ineﬃciencies. W e present tw o equiv alent v ersions of the adaptiv e algorithm − A-IIT and SS-I IT − and establish that b oth hav e the same limiting distribution, making either suitable for use within a parallel temp ering framew ork. T o ev aluate p erformance, w e b enc hmark the adaptive algorithm against several MCMC metho ds: I IT, Rejection free Metrop olis-Hastings (RF-MH) and RF-MH using a m ultiplicity list. Simulation results demon- strate that Adaptive I IT iden tiﬁes high-probability states more eﬃciently than these comp eting algorithms in high-dimensional binary spaces with multiple mo des. Keywor ds: Mark ov chain Monte Carlo, adaptiv e Mark o v c hain, computational methods, Informed prop osals, P arallel T emp ering. 1 In tro duction Consider a probability mass function (PMF) π ov er a ﬁnite discrete state space S . Giv en a mea- surable function g , we w ant to compute an estimate of E π ( g ( X )). Another wa y to look at this problem is considering that we hav e a non-negativ e function π deﬁned o v er a ﬁnite discrete space S and w e wan t to obtain samples from a probabilit y distribution whic h PMF is proportional to the function π for which we don’t know the normalizing constan t. A standard approach to approximate the desired v alue is to use a Marko v Chain Monte Carlo (MCMC) algorithm whic h prop oses to build a Mark ov c hain { X n } that con v erges to the target distribution π , get N samples from the chain and use them to create a Monte Carlo estimate. 1 Departmen t of Statistical Sciences, Univ ersity of T oronto, Canada 2 F ujitsu Limited, Kanaga wa, Japan 3 D XR Lab oratory Inc., Kanagaw a, Japan 4 Departmen t of Electrical and Computer Engineering, Universit y of T oronto, Canada 1 An example of a MCMC algorithm is Metrop olis-Hastings (MH) algorithm ( Metrop olis et al. , 1953 ; Hastings , 1970 ) whic h deﬁnes a w a y to deﬁne a Mark ov chain that conv erges to the desired target distribution b y using the un-normalized function π ( x ) and a prop osal distribution Q ( y | x ). Rejection of some prop osed states is an important part of this algorithm, in some settings it’s b een shown that an optimal implementation of the algorithm will reject a bit more than 75% of the prop osals ( Gelman et al. , 1997 ; Rob erts & Rosen thal , 2001 ; Neal et al. , 2012 ). The rejection rate can also b e v ery high for high-dimensional state spaces which results in lo w eﬃciency for the algorithm ( Li et al. , 2023 ). These rejections migh t b e seen as an ineﬃciency of the algorithm as w e are p erforming compu- tations that don’t translate in new information for the Mark o v c hain. T o address this situation, it is p ossible to deﬁne a Rejection-F ree v ersion of the MH algorithm ( Rosenthal et al. , 2020 ). This algorithm deﬁnes a new Marko v c hain that alwa ys accepts a mo v e to a proposed state and allo ws to make inference of the target distribution π . The algorithm iden tiﬁes the set of neigh b or states of the curren t state, ev aluates the target distribution on eac h neigh b or state and mo dify the prop osal probabilit y with the computed information. This Rejection-F ree MH algorithm is one instance of the class of Informed Imp ortance T em- p ering (IIT) algorithms prop osed b y Zanella ( 2020 ) as it uses information of the local neighbors of the curren t state to mo dify the prop osal distribution. The adv an tage of IIT b ecome clear when we ha ve access to parallel computing. It has b een sho wn that this class of algorithms p erforms well in high dimensional settings ( Zhou & Smith , 2022 ) and that it is very simple to implement an I IT v ersion of diﬀerent MCMC algorithms ( Li et al. , 2023 ) In this w ork w e explore ho w we can create an algorithm that combines I IT with P arallel T em- p ering (PT) and show it’s eﬀectiv eness using sim ulations in binary m ulti modal state spaces. The most direct w ay of combining the t wo algorithms w as already discussed by Rosen thal et al. ( 2020 , sec. 6) where they implemen t parallel temp ering where eac h replica uses a rejection free Metrop olis Hastings algorithm and they mo dify the replica sw ap probabilit y to correct the bias generated b y the rejection free chains. This implemen tation, although straigh tforward, increases the n umber of computations needed to try a replica swap in a factor prop ortional to the dimension of the space. In order to address this computational ineﬃciency w e prop ose an algorithm called Adaptiv e I IT (A-I IT). This algorithm creates a c hain with Mark ovian adaptations ( Rob erts & Rosen thal , 2007 ) that uses the information of the neighbors of the curren t state to propose a rejection free mov e to the next state but mak es the computation of the probabilit y of a replica sw ap more eﬃcient as it do esn’t need additional factors to compute the replica sw ap probabilit y . 1.1 Ov erview of the pap er In section 2 we presen t a brief summary of MCMC, rejection free algorithms and Informed Imp or- tance T emp ering that will b e useful for the deﬁnition of the adaptive algorithm. In section 3 w e present the adaptive algorithm (A-I IT) and we pro ve that the Marko v chain 2 deﬁned by this algorithm conv erges to a probability distribution and w e show how to use this adaptiv e Mark o v c hain to obtain an estimator for E π ( f ( X )). In section 4 w e brieﬂy discuss wh y using a rejection-free Marko v chain with parallel temp ering creates additional computations for the computation of the replica sw ap probability and how we can av oid these extra computations without aﬀecting the con vergence of the c hain. In section 5 we presen t more details on the implementation of the algorithms in binary mul- timo dal spaces. W e present the second version of the algorithm: Single step I IT (SS-I IT), w e explain why it’s equiv alent to A-I IT and how w e can use b oth in an implementation with P arallel T emp ering to reduce the num b er of computations needed to up date the Mark ov chain. W e presen t a discussion on the c hosen balancing function h ( r ) = √ r . In section 6 w e present the details of the sim ulations p erformed and show the results in binary m ultimo dal state spaces. W e compare the p erformance of A-I IT with other MCMC algorithms. W e show that in lo w dimensional state spaces MH still p erforms b etter than A-I IT in terms of con vergence in T otal V ariation Distance (TVD) and in terms of how man y iterations it takes for the c hains to visit the mo des. W e presen t results in a binary space of dimension 3000 where A-IIT p erforms b etter than MH and IIT, w e discuss ho w w e can attribute this impro vemen t to b oth the use of the balancing function h ( r ) = √ r and the use of a multiplicit y list to reduce the num b er of computations needed to up date the Marko v chain. Additional details of these simulations and results on state spaces of other dimension (1000, 5000, 7000) can be found in the app endix. 2 MCMC algorithms T o build an estimate of E π ( f ( X )) Marko v Chain Mon te Carlo prop oses to build a Marko v c hain { X n } that conv erges to the target distribution π , get N samples from that distribution and then use the Mon te Carlo metho d to estimate the desired v alue as follo ws. E π ( f ( X )) ≈ 1 N N X i =1 f ( X i ) in general w e can write this estimator in the follo wing form. \ E π ( f ( X )) = P N i =1 w i ( X i ) f ( X i ) P N i =1 w i ( X i ) (1) Where w i ( X i ) is the weigh t assigned to the state X i ∈ S when the Marko v chain visits that state at iteration i . T o build this estimator w e can see that we need t w o comp onen ts: First a pro cess to explore the state space and while exploring, a function to deﬁne the weigh t of each visited state. Metrop olis-Hastings algorithm, for example, assigns w ( X i ) = 1 to every state X i ev ery time that the c hain is visited. The exploration of the state space dep ends on the prop osal distribution 3 and the estimation of the w eight dep ends on the rejections happ ening during that exploration. If we mo dify the exploration pro cess then the w eight estimation also has to b e mo diﬁed so the c hain still conv erges to the desired target distribution. 2.1 Rejection free algorithms As stated before, an optimal implemen tation of Metrop olis-Hastings algorithm will consistently reject prop osed states. Rosenthal et al. ( 2020 ) presen t a Rejection-free version of MH algorithm. Consider a Mark o v chain { X n } with transition probabilities P ( y | x ) = P ( X n +1 = y | X n = x ). Then w e can build the corresp onding jump Marko v c hain { J n } that has transition probabilities given b y b P ( y | x ) = P ( J n +1 = y | J n = x ) = P ( y | x ) Z ( x ) Where Z ( x ) represents the probabilit y that the c hain ”escapes” state x in a single iteration. Z ( X ) = X y  = x P ( y | x ) = 1 − P ( x | x ) The jump chain inherits imp ortan t prop erties from the original chain, such as b eing irreducible and rev ersible. Moreo ver if the original c hain’s stationary distribution is π then the stationary distribution of the jump c hain is giv en b y b π ( x ) = Z ( x ) π ( x ). Along with the jump chain authors deﬁne the m ultiplicity list { M k } that represents the n umber of iterations the original c hain sta yed in its curren t state b efore accepting a mov e to a diﬀerent state. The distribution of M k giv en J k is equal to the distribution of 1 + G where G ∼ Geometric( Z ( x )). Authors presen t tw o p ossible wa ys to deﬁne the weigh t of a state when using a rejection free algorithm. In particular theorem 8 and 12 of Rosen thal et al. ( 2020 ) pro ve that using w k ( X k ) = M k or w k ( X k ) = 1 / Z ( X k ) in equation 1 creates a consistent estimator for E π ( f ( X )). The main adv an tage of using the m ultiplicity list is that it allows to reco ver samples of the orig- inal Marko v c hain. This aspect of the m ultiplicity list is an important part of the implemen tation of A-I IT that will b e presented in the next section. 2.2 Informed Imp ortance T emp ering A generalization of rejection-free algorithms is presented b y Zanella ( 2020 ) introducing the concept of balancing function. This function measures the imp ortance of neigh b or states and uses this information to modify the wa y the c hain explores the state space. T o build a Mark ov c hain that follows I IT we will start with a proposal distribution Q ( y | x ) that deﬁnes the set of neighbor states of a state x as N x = { y : Q ( y | x ) > 0 } . Then we consider a balancing function, which is a function h : (0 , ∞ ) → (0 , ∞ ) that satisﬁes h ( r ) = rh ( r − 1 ). F or the states y in the neigh b orhoo d of x w e consider a function α ( x, y ) ≥ 0 that represents the preference of choosing y , which then is used to modify the prop osal weigh t of that state η ( y | x ). 4 η h ( y | x ) = Q ( y | x ) α h ( x, y ) , α h ( x, y ) = h  π ( y ) Q ( x | y ) π ( x ) Q ( y | x )  (2) Algorithm 1 sho ws the implemen tation of I IT. Algorithm 1 Informed imp ortance temp ering (Naive I IT) Initialize X 0 for k in 0 to K do F or each Y ∈ N X k , calculate η h ( Y | X k ) = Q ( Y | X k ) h  π ( Y ) Q ( X k | Y ) π ( X k ) Q ( Y | X k )  Calculate Z h ( X k ) = X Y ∈N X k η h ( Y | X k ) Set w k ( X k ) ← 1 / Z h ( X k ). Cho ose the next state Y ∈ N X k suc h that P ( X k +1 = Y | X k ) ∝ Q ( Y | X k ) h  π ( Y ) Q ( X k | Y ) π ( X k ) Q ( Y | X k )  end for This is v ery similar to how rejection-free Metrop olis-Hastings algorithm is implemented but w e can c ho ose to use a diﬀerent function, as long as the function satisﬁes the balancing function equalit y this guarantees that the resulting chain is reversible and hence conv erges to the desired target distribution. Cho osing a diﬀerent balancing function allo ws the algorithm to modify the wa y the Marko v chain explores the state space. If w e use the balancing function h ( r ) = min { 1 , r } then I IT is equiv alent to Rejection-F ree Metrop olis-Hastings ( Rosen thal et al. , 2020 ), but there’s no need to use a bounded balancing function, authors men tion other v alid balancing functions such as h ( r ) = √ r ( squar e d r o ot ), h ( r ) = 1 ∧ r ( min ), h ( r ) = 1 ∨ r ( max ), h ( r ) = (1 ∧ r e − c ) ∨ ( r ∧ e − c ). If we c ho ose a diﬀeren t balancing function, this mo diﬁes the prop osal w eight of the states η h whic h then mo diﬁes the exploration of the state space. In ﬁgure 1 we compare how these balancing functions measure the imp ortance of neigh b or states. Consider a uniform prop osal distribution Q and t wo neighbor states y 1 and y 2 suc h that π ( y 1 ) /π ( x ) = 2 and π ( y 2 ) /π ( x ) = 3. Using the min balancing function b oth of these states ha ve the same imp ortance so the probability that the Marko v chain chooses any of them in the next iteration is the same. Ho wev er using any of the other t wo balancing functions the diﬀerence in probabilit y is translated in one of the states b eing more likely to b e chosen. In this example using max balancing function mak es y 2 50% more lik ely to b e c hosen than y 1 while using squar e d r o ot balancing function mak es y 2 22% more lik ely to b e c hosen than y 1 5 Figure 1: Comparison of balancing functions. Showing the possible v alues that the expression h  π ( y ) π ( x )  ma y tak e for 2 neighbors of x : y 1 with low er probability and y 2 with higher probabilit y . 3 Adaptiv e I IT In general giv en a non-decreasing function f ( r ) that we w ant to use to measure the importance of the states, we can create a version that is b ounded and satisﬁes the balancing function prop ert y follo wing what’s presented b y ( Zanella , 2020 ). F or a ﬁxed b ounding constant γ > 1 w e ﬁrst take the b ounded function f γ ( r ) = 1 γ min { γ , f ( r ) } whic h tak es v alues in (0 , 1) and deﬁne: h γ ( r ) = min  f γ ( r ) , r f γ  1 r  (3) Then h γ is a balancing function taking v alues in (0 , 1). The c hoice of b ounding constan t impacts the measuremen t of imp ortance. Using a small con- stan t may result in not fully using the information of the neigh b ors, while using a big constan t may cause numerical errors during the implementation. F or a giv en state space S , target probability distribution π and non-decreasing function f , we deﬁne the minimal fully informative constan t as γ ∗ = max { f ( π ( y ) /π ( x )); x ∈ S , y ∈ N ( x ) } since this is the minim um constan t that assigns a diﬀeren t w eight to all states with diﬀeren t probabilities. Without previous knowledge of the problem, ﬁnding this constan t w ould b e unfeasible when the dimension of S is big. T o ov ercome this problem w e prop ose an adaptive algorithm that up dates the b ounding constant as it explores the state space. F or the adaptiv e algorithm ﬁrst w e consider the basis function f that will be used to measure the imp ortance of the states. F ollo wing equation 3 and considering a constant γ w e can create a b ounded balancing function that can be used to deﬁne the Marko v c hain k ernel P γ as follows: 6 P γ ( Y | X ) = Q ( Y | X ) h γ  π ( Y ) Q ( X | Y ) π ( X ) Q ( Y | X )  for Y  = X , P γ ( X | X ) = 1 − X y ∈N X P γ ( y | X ) . (4) With an adequate c hoice of prop osal distribution Q this k ernel is irreducible and ap eriodic. F or an y ﬁxed γ > 0, since h γ is a balancing function then the chain is reversible and π ( x ) is the stationary distribution. Prop osition 1. L et π b e the tar get pr ob ability distribution, γ ≥ 1 b e ﬁxe d and c onsider the Markov chain kernel P γ deﬁne d in e quation 4 wher e Q is a pr op osal distribution that makes the kernel irr e ducible and ap erio dic and h γ is a b alancing function taking values in (0 , 1) . Then P γ has π as a stationary distribution and it’s er go dic for π . Pr o of. Since h γ is a balancing function then h γ ( r ) = rh γ (1 /r ). Using this prop ert y w e obtain π ( X ) P γ ( Y | X ) = π ( X ) Q ( Y | X ) h γ  π ( Y ) Q ( X | Y ) π ( X ) Q ( Y | X )  = π ( X ) Q ( Y | X ) π ( Y ) Q ( X | Y ) π ( X ) Q ( Y | X ) h γ  π ( X ) Q ( Y | X ) π ( Y ) Q ( X | Y )  = π ( Y ) Q ( X | Y ) h γ  π ( X ) Q ( Y | X ) π ( Y ) Q ( X | Y )  = π ( Y ) P γ ( X | Y ) So P γ has π as a stationary distribution and since the kernel is also ap erio dic and irreducible o ver a ﬁnite state space S then P γ it’s ergo dic for π W e can then deﬁne the rejection-free k ernel equiv alen t to equation 4 as follows: b P γ ( Y | X ) = 1 Z h ( X ) Q ( Y | X ) h γ  π ( Y ) Q ( X | Y ) π ( X ) Q ( Y | X )  for Y  = X , b P γ ( X | X ) =0 . (5) Where Z h ( X ) is the probabilit y of escaping state X , deﬁned as: Z h ( X ) = X Y ∈N X Q ( Y | X ) h γ  π ( Y ) Q ( X | Y ) π ( X ) Q ( Y | X )  (6) Prop osition 2. L et π ( · ) b e the tar get pr ob ability distribution, γ ≥ 1 b e ﬁxe d and c onsider the r eje ction-fr e e Markov chain kernel b P γ deﬁne d in e quation 5 wher e Q a pr op osal distribution that 7 makes the kernel irr e ducible and ap erio dic and h γ is a b ounde d b alancing function. Then b P γ has π ∗ ( · ) = π ( · ) Z h ( · ) as a stationary distribution and it’s er go dic for π ∗ . Pr o of. The proof is v ery similar to the one before and relies on the property of a balancing function. π ( X ) Z h ( X ) b P γ ( Y | X ) = π ( X ) Z h ( X ) 1 Z h ( X ) Q ( Y | X ) h γ  π ( Y ) Q ( X | Y ) π ( X ) Q ( Y | X )  = π ( X ) Q ( Y | X ) π ( Y ) Q ( X | Y ) π ( X ) Q ( Y | X ) h γ  π ( X ) Q ( Y | X ) π ( Y ) Q ( X | Y )  = π ( Y ) Q ( X | Y ) h γ  π ( X ) Q ( Y | X ) π ( Y ) Q ( X | Y )  = π ( Y ) Z h ( Y ) 1 Z h ( Y ) Q ( X | Y ) h γ  π ( X ) Q ( Y | X ) π ( Y ) Q ( X | Y )  = π ( Y ) Z h ( Y ) P γ ( X | Y ) So b P γ has π ∗ as a stationary distribution and since the k ernel is also ap eriodic and irreducible o ver a ﬁnite state space S then b P γ it’s ergo dic for π . F ollowing the notation presen ted by Roberts & Rosenthal ( 2007 ), w e consider a chain with Mark ovian adaptations { ( X n , γ n ) } where at time n w e use the Mark o v k ernel deﬁned b y γ n . T o deﬁne ho w the kernels will b e adapted, w e consider a function M : S → R deﬁned as: M ( X ) = max Y ∈N ( X )  f  π ( Y ) Q ( X | Y ) π ( X ) Q ( Y | X )  , f  π ( X ) Q ( Y | X ) π ( Y ) Q ( X | Y )  (7) This function c hecks the p ossible v alues that the ratio of probabilities can tak e when ev aluated using function f . With this function we consider the set Y = { M ( X ) | X ∈ S } and we consider all the Marko v c hain k ernels P γ for each γ ∈ Y . Using these k ernels the transition probabilities of the adaptiv e algorithm are given b y P ( X n +1 = y | X n = x, Γ n = γ ) = P γ ( y | x ) (8) W e initialize the b ounding constant at γ 0 = 1 and ev ery time w e visit a new state X n ∈ S we up date the constant considering the information of the neighbor states γ n = max { M ( X n ) , γ n − 1 } and use the corresp onding adapted Mark ov k ernel to c ho ose the next state. Algorithm 2 shows the steps to implemen t this adaptiv e Mark o v c hain. Although the algorithm considers adaptations can happ en every iteration, since we are consid- ering a ﬁnite state space S , then the set of b ounding constan ts Y is ﬁnite so this becomes a ﬁnite adaptation algorithm. Finite adaptations are one of the safe w ays to create adaptiv e Mark ov c hains that are ergodic ( Rob erts & Rosenthal , 2007 ). 8 Algorithm 2 Adaptiv e IIT Initialize X 0 and γ 0 = 1 for k in 0 to K do ▷ Step 1: Update bounding constan t F or each Y ∈ N X k , calculate f  π ( Y ) Q ( X k | Y ) π ( X k ) Q ( Y | X k )  , f  π ( X k ) Q ( Y | X K ) π ( Y ) Q ( X k | Y )  Compute M ( X k ) as sho wn in equation 7 Up date γ k +1 = max { γ k , M ( X k ) } Deﬁne h γ using the updated constan t γ k +1 follo wing equation 3 ▷ Step 2: Use the adapted Mark ov k ernel to up date the c hain Calculate Z h ( X k ) = X y ∈N X k Q ( Y | X k ) h γ  π ( Y ) Q ( X k | Y ) π ( X k ) Q ( Y | X k )  Set w k ( X k ) ← 1 + Geometric( Z h ( X k )). Cho ose the next state Y ∈ N X k suc h that P ( X k +1 = Y | X k ) = 1 Z h ( X n ) Q ( Y | X k ) h γ  π ( Y ) Q ( X k | Y ) π ( X k ) Q ( Y | X k )  end for Prop osition 3. F or a function f : (0 , ∞ ) → (0 , ∞ ) c onsider the chain with Markovian adaptations { X n , γ n } deﬁne d over a ﬁnite state sp ac e S that fol lows algorithm 2 wher e Y is deﬁne d ab ove and for e ach γ ∈ Y let P γ b e the Markov chain kernel as deﬁne d in e quation 4 that uses the b alancing function h γ deﬁne d in e quation 3 . Assume that for e ach ﬁxe d γ ∈ Y the kernel P γ is er go dic lim n →∞ || P n γ ( x, · ) − π ( · ) || = 0 for al l γ ∈ Y and x ∈ S , then the adaptive algorithm is er go dic. Pr o of. Let M := max γ ∈Y { γ } , there exist X ′ ∈ S suc h that M ( X ′ ) = M . Using the adaptive chain we deﬁne the stopping time τ = inf { n : N | X n = X ′ } whic h iden tiﬁes the time of ﬁrst vis it to this state, whic h also corresponds to the time when the Marko v chain k ernel will stop c hanging. Γ n = Γ τ for all n ≥ τ . Since the chain is irreducible and S is ﬁnite, the states are p ositiv e recurrent then P ( τ < ∞| X 0 = x ) = 1 for an y x ∈ S . So this a ﬁnite adaptation algorithm, using Prop osition 2 from Rob erts & Rosen thal ( 2007 ) w e know the adaptiv e algorithm con verges to target distribution π . Algorithm 2 considers that at ev ery step we update the b ounding constant so the pro of of con vergence to the target distribution relies on the fact that the state space S is ﬁnite. If we w ant to extend this to an inﬁnite discrete state space S ∗ w e will need to mo dify the adaptiv e step of the algorithm. W e can consider a reduction of the b ounding constant to wards 1, this reduction can b e random, deﬁning eac h step a probability to reduce the constan t, or deterministic b y reducing the constan t after speciﬁc n umber of steps. W e can also consider that step 1 is run only for a 9 deterministic num b er of steps K and after that we use the balancing function h γ deﬁned with the biggest found bounding constan t. Prop osition 3 pro ves that the Mark o v chain that follo ws algorithm 2 is ergodic for a distribution related to the target distribution π . Using theorem 8 from Rosenthal et al. ( 2020 ) w e can see that if { X k } is the Marko v chain constructed using A-I IT and { M k } is the corresp onding multiplicit y list then, for a suitable function g the estimator: P N k =1 M k g ( X k ) P N k =1 M k (9) is a consisten t estimator of E π ( g ( X )) Adaptiv e I IT explores the state space S in the same wa y that I IT w ould do it but it c hanges the w ay the weigh ts are estimated. A-I IT can use a m ultiplicit y list b ecause the ”escap e probabilit y” Z h ( X ) is bounded betw een (0 , 1) which w as not the case in general for I IT. Using the multiplicit y list will b e useful when implementing the algorithm with Parallel T emp ering as it allo ws to deﬁne a sp eciﬁc num b er of samples to obtain from the original (non rejection-free) Mark ov chain . 4 Rejection-free algorithms with Parallel T emp ering When implementing MCMC algorithms with multi mo dal target distributions it’s often useful to use Metrop olis-Coupled chains or P arallel T emp ering ( Gey er , 1991 ). T o implement P arallel T emp ering we consider R indep enden t copies (or replicas) of the chain X , eac h on the state space S and a set of in v erse temp eratures β 1 > β 2 > ... > β R ≥ 0. Replica i has π i ∝ π β i as its stationary distribution. P arallel temp ering consist of tw o steps: First each replica explores the state space for a ﬁxed n umber of iterations. The second step is to prop ose a sw ap b et ween pairs of adjacent replicas X ( β i ) ↔ X ( β i +1 ) . This swap is accepted using the usual Metrop olis probability min ( 1 , π 2 ( X ( β 1 ) ) π 1 ( X ( β 2 ) ) π 1 ( X ( β 1 ) ) π 2 ( X ( β 2 ) ) ) (10) This replica swap probabilit y preserv es the pro duct target measure of the original chain Q i π ( · ) ( β i ) . If we would like to implemen t parallel temp ering with eac h replica follo wing a rejection-free algorithm then the sw ap probability needs to b e adjusted ( Rosen thal et al. , 2020 ) considering that the rejection-free chain hav e a diﬀeren t target distribution. The swap probabilit y needs to b e mo diﬁed as follows: min ( 1 , Z 2 ( X ( β 1 ) ) π 2 ( X ( β 1 ) ) Z 1 ( X ( β 2 ) ) π 1 ( X ( β 2 ) ) Z 1 ( X ( β 1 ) ) π 1 ( X ( β 1 ) ) Z 2 ( X ( β 2 ) ) π 2 ( X ( β 2 ) ) ) (11) This preserv es the mo diﬁed pro duct measure Q i Z ( · ) π ( · ) ( β ! ) whic h corresp onds to the target 10 distribution where eac h replica is a rejection-free c hain. Not adjusting the probabilit y will induce bias in the chain that will aﬀect the rate of conv ergence. As stated b efore eac h of these Z factors requires to ev aluate π for every neigh b or of the state and apply the corresponding balancing function. F or high dimension spaces N ( X ) can become v ery large and although rejection-free c hains alw a ys mo ve to a diﬀeren t state, the replica sw ap ma y b e rejected and the additional computations w on’t translate in new information for the Marko v chain. T o a void these additional computations we use a rejection-free algorithm with m ultiplicit y list so w e can alternate b et ween the indep enden t replica updates and the replica swaps without inducing bias. As explained by Rosenthal et al. ( 2020 ), we deﬁne a n umber of samples L 0 to obtain from the original c hain (with target distribution π ). Then w e proceed as follows. 1. Set the n um b er of remaining rep etitions to L = L 0 2. Find the next state for replica i , X ( i ) n +1 and the v alue of the m ultiplicity list M ( i ) n corresp onding to the curren t state X ( i ) n 3. If M ( i ) n ≥ L then replace M ( i ) n b y L and the chain stays in the current state X ( i ) n +1 = X ( i ) n . Then return to step 1 for the next replica. 4. If M ( i ) n < L then update L = L − M ( i ) n and return to step 2 for the same replica Using a rejection-free c hain with a m ultiplicity list allows us to transform each single rejection- free iteration into M n samples from the original chain. So we can use the steps describ ed ab o ve to obtain L 0 samples from the original c hain using a m ultiplicit y list in all the replicas. This then allo ws us to use parallel tempering without creating bias and without additional computations when trying a replica swap. W e can see it’s straightforw ard to implemen t P arallel T emp ering where each replica uses A-I IT follo wing the four steps ab o ve. This is more eﬃcien t than using I IT in the replicas as w e a void the computation of the factors Z i ( X ( β j ) ) every time w e try a replica swap. 5 Implemen tation of Adaptiv e I IT in Parallel T emp ering 5.1 Single step I IT Algorithm 2 requires a lot of computations for each iteration. First the up date of the b ounding constan t requires the computation of π ( · ) for each of the N ( X ) neigh b ors of the current state X . Then w e need to apply the b ounded balancing function to eac h of the probability ratios to compute the weigh t Z h ( X k ) and then choose one of the prop osed states. When implementing A-I IT with parallel temp ering it’s p ossible to use a non-rejection free algorithm for some of the replicas as long as the target distribution is not mo diﬁed. T o achiev e this, we can consider an algorithm that uses a b ounded balancing function h and is equiv alen t to Adaptiv e IIT but creates just one sample of the original chain ev ery iteration. 11 Algorithm 3 Single-step I IT (SS-I IT) Initialize X 0 and γ 0 = 1 for k in 0 to K do ▷ Step 1: Update bounding constan t Prop ose a new state Y ∈ N ( X k ) using the prop osal distribution Q ( ·| X k ) Calculate B k = max  f  π ( Y ) Q ( X k | Y ) π ( X k ) Q ( Y | X k )  , f  π ( X k ) Q ( Y | X k ) π ( Y ) Q ( X k | Y )  Up date γ k +1 = max { γ k , B k } Deﬁne h γ using the updated constan t γ k +1 follo wing equation 3 ▷ Step 2: Use the adapted Mark ov k ernel to up date the c hain Set w k ( X k ) ← 1. let U ∼ U nif (0 , 1) if U < h γ  π ( Y ) Q ( X k | Y ) π ( X k ) Q ( Y | X k )  then X k +1 ← Y else X k +1 ← X k end if end for As discussed in the previous section, when implementing parallel temp ering with Adaptive I IT w e deﬁne a n umber of samples of the original c hain L 0 to obtain before trying a replica swap. W e can also use Single-step IIT to obtain the desired num b er of samples. The adv an tage of A-I IT is that, since it’s rejection free, after every iteration the c hain mov es to a diﬀerent state how ever this requires additional computations. The adv antage of SS-IIT is that it do esn’t require as many computations to prop ose a jump how ever we ma y reject the prop osed state and the chain ma y not explore the state space as fast as A-I IT. When implemen ting Parallel T empering w e can c ho ose some replicas to use A-I IT and others to use SS-IIT. F or some replicas with high v alues of β w e exp ect more rejections whenever the c hain mo ves to a state of high probability so w e would prefer A-I IT. F or a v alue of β close to zero we don’t exp ect as man y rejections so SS-I IT migh t b e a b etter c hoice. 5.2 Cho osing b etw een A-I IT and SS-I IT No w we hav e tw o adaptiv e algorithms that can generate samples from the target distribution π while using any balancing function h . T o choose whic h of them to use for the diﬀeren t replicas in parallel temp ering w e will compare the num b er of computations needed for the chain to generate L 0 samples from the original chain. In this situation we consider that each ev aluation of π in a neigh b or state counts as one computation. SS-I IT alwa ys generates 1 sample of the original c hain for each iteration of the algorithm whic h requires to ev aluate π in the prop osed neighbor state. A-I IT generates one sample from the rejection-free chain which is equiv alent to M n (a random n umber) of samples from the original c hain but it requires to ev aluate π for each state in the set 12 N X . Consider the state space S = { 0 , 1 } p with p = 5000 and a target distribution with 6 mo des deﬁned in a similar wa y than the target distributions presented in the next section i.e., π ( X ) ∝ P 6 i =1 exp( − θ || X − X ( i ) || 1 ). W e run parallel temp ering with 13 replicas ( X ( β 1 ) , . . . , X ( β 13 ) ). Each replica uses Adaptiv e I IT until the multiplicit y list reaches a v alue of L 0 = 800 then tries a replica swap. W e run 25 simulations and in eac h sim ulation eac h replica p erforms 100 swaps b efore stopping. W e then compute the a verage num b er of rejection free steps eac h replica p erformed in-b et w een replica swaps. T able 1 contains the v alues of temperatures used and the av erage swap rate b et ween β i and β i +1 . T able 1: Inv erse temp erature used id in verse sw ap rate temp erature β 1 20000 0.412 β 2 19820 0.382 β 3 19539 0.426 β 4 19409 0.401 β 5 19282 0.384 β 6 19169 0.414 β 7 19112 0.333 β 8 18935 0.272 β 9 18724 0.341 β 10 18663 0.318 β 11 18586 0.256 β 12 18507 0.297 β 13 18485 - In ﬁgure 2 w e plot the num b er of rejection free iterations each replica with inv erse temperature β i needs so that their multiplicit y list reac hes the v alue L 0 = 800, i.e., b efore we try a replica sw ap. W e can observ e that as the v alue of β decreases, the n umber of iterations increases as well. F or β 1 on av erage the chain p erforms 3 rejection-free steps b efore the replica sw ap. This means that single-step I IT algorithm w ould need, on av erage, 267 rejections before accepting a mov e. On the other hand for β 13 the chain p erforms, on av erage, 26 rejection-free steps b efore the replica sw ap, single-step I IT algorithm w ould need, on av erage, 30 rejections before accepting a mo v e. The size of the neigh b or set is the dimension of the state space |N ( X ) | = 5000, this means that eac h rejection-free iteration needs to ev aluate π 5000 times before mo ving to a new state no matter the v alue of β , mean while the n umber of computations needed for SS-I IT to mov e to a new state dep ends on the v alue of β , table 2 sho ws a summary of the % of computations needed for the chain to mov e to a diﬀerent state for SS-I IT in comparison with A-I IT. 13 Figure 2: Number of Rejection F ree iterations for replicas temp ered at diﬀeren t temperatures Algorithm β 1 β 13 A-I IT 5000 5000 SS-I IT 267 30 % 5.3% 0.6% T able 2: Comparison of n umber of computations needed for the Mark o v c hain to mo ve to a diﬀeren t state In this example the the num b er of computations needed for the smallest β is 10 times bigger than the computations for the largest β . When implemen ting Adaptive IIT with parallel temp ering w e can consider running a ﬁrst simulation to measure the a verage num b er of rejection-free steps that algorithm 2 w ould need in-b etw een replica swaps and compare with the num b er of computations needed. With this information w e can deﬁne whic h algorithm to use for each replica in such a w a y that we reduce the n umber of computations needed for a replica to visit a new state and, in this w ay , implement an algorithm that eﬃciently explores the state space. 5.3 General discussion of the balancing function The proposed adaptive algorithms are deﬁned in general and can b e applied for an y increasing function f . In their w ork Zhou & Smith ( 2022 ) gives particular emphasis to the balancing function h ( r ) = √ r so in this work we implemen t a b ounded version of this function to use in the adaptive algorithm. In this section we presen t the pro cess to create a b ounded version of this balancing function. F or a ﬁxed γ > 0 taking the function f ( r ) = √ r as a starting p oin t, the b ounded balancing function version of f takes the following form: 14 h γ ( r ) = min  1 , r, √ r γ  =          r for r ∈ h 0 , 1 γ 2  √ r γ for r ∈ h 1 γ 2 , γ 2  1 for r ∈  γ 2 , ∞  (12) Figure 3 sho ws the b eha viour of this b ounded balancing function for some v alues of γ . Figure 3: Bounded balancing function based on h ( r ) = √ r considering diﬀeren t v alues for the b ounding constan t γ . Since we wan t to use h γ to compute probabilities there’s no usage of using a b ound smaller than 1. W e can observ e that for γ = 1 h 1 ( r ) = min { 1 , r } the balancing function is the same as the function used in Metrop olis-Hastings algorithm. The deﬁnition of algorithm 2 requires a double visit to the neighbor states. First we compute the ratio of probabilities of the neigh b or states to up date the b ounding constant. Then, w e apply the updated balancing balancing function to the target distribution ev aluated at eac h neighbor state. This ensures that w e’re alwa ys using the latest information a v ailable (the latest b ounding constan t found). T o a void this double visit to the neighbor states, w e could sw ap steps 1 and 2 in the algorithm. So ﬁrst w e compute the ratio of probabilities π ( y ) /π ( x ) for all the neighbor states, store these v alues, then apply the b ounded balancing function (using a previously found b ound) to compute the factor Z h ( x ). After this computation we can use the stored v alues of the ratio of probabilities, apply the function f and compute the new balancing function. Changing the order of steps 1 and 2 in the algorithm helps to a void the double visit to the neigh b or states but w e need extra storage capacit y . If w e use the squared root balancing function w e can av oid this double visit in a more straightforw ard wa y . Note that the deﬁnition of the b ounded squared ro ot balancing function ( 3 ), for a ﬁxed γ > 0 h γ ( r ) = √ r /γ as long as r ∈ [1 /γ 2 , γ 2 ]. So w e can simplify the deﬁnition of the function as long as the input is b et ween these tw o v alues. F or simplicity consider that w e’re using a uniform prop osal distribution Q and consider R i = π ( Y i ) /π ( X ). Note that the wa y that we update the b ounding constan t guarantees that γ ≥ R 1 / 2 i 15 and γ ≥ R − 1 / 2 i , this implies that R i ∈ [1 /γ 2 , γ 2 ] for all the neigh b ors of X , and this is ensured in ev ery iteration of the algorithm. Then w e don’t need to deﬁne a bounded balancing function to apply to eac h √ R i , to apply the b ounded balancing function it suﬃces to divide all the v alues b y γ i.e., h γ ( R ) = √ R/γ . Algorithm 4 sho ws how w e can implemen t this more eﬃcien t version of Adaptiv e I IT when we use the balancing function h ( r ) = p ( r ). This do esn’t c hange the w a y the chain b eha ves but reduces the num b er of computations in eac h step of the algorithm. Algorithm 4 Adaptiv e IIT (computationally eﬃcient for f ( r ) = √ r ) Initialize X 0 and γ 0 = 1. Consider a uniform prop osal distribution Q ( Y | X ) ∝ 1 for k in 1 to K do F or each Y i ∈ N X k − 1 , calculate R i ( X k − 1 ) 1 / 2 =  π ( Y i ) π ( X k − 1 )  1 / 2 γ c = max i { R 1 / 2 i , R − 1 / 2 i } γ k = max { γ k − 1 , γ c } Calculate Z h ( X k − 1 ) = 1 |N ( X k − 1 ) | X y ∈N X k − 1 1 γ k  π ( Y i ) π ( X k − 1 )  1 / 2 Set w k ( X k − 1 ) ← 1 + Geometr ic ( Z h ( X k − 1 )). Cho ose the next state Y ∈ N X k − 1 suc h that P ( X k = Y | X k − 1 ) ∝  π ( Y ) π ( X k − 1 )  1 / 2 end for 6 Sim ulations 6.1 General framew ork F or the simulations w e consider a state space S = { 0 , 1 } p and a m ultimo dal target distribution π deﬁned as π ( X ) = 1 C θ m X i =1 exp( − θ || X − X ( i ) || 1 ) (13) Where m is the n umber of mo des, X ( i ) represen ts mo de i , θ > 0 and || · || i is the L 1 distance, C θ is the normalizing constant. 16 W e implemen t non-rev ersible P arallel T empering with R replicas ( Syed et al. , 2021 ) alternatinv b et w een swapping ev en then o dd indexed replicas. W e compare the p erformance of Adaptiv e I IT with other 3 algorithms. All of them use non-reversible parallel temp ering but the replicas in eac h of the implementation will use one of 4 diﬀeren t algorithms. F or all of the algorithms we use a uniform prop osal distribution ( Q ( y | x ) ∝ 1 for all y ∈ N X ) and the neighbor set is deﬁned by all the states with L 1 distance 1: N x = { y ∈ S : || x − y || 1 = 1 } . # ID Name h ( r ) w ( x ) Algorithm 1 A-I IT Adaptiv e I IT √ r 1 + G ( Z ( x )) 2 1 3 2 I IT Naiv e IIT √ r 1 / Z ( x ) 1 3 MH-m ult Metrop olis Hastings with multiplicit y list min { 1 , r } 1 + G ( Z ( x )) 5 1 7 4 RF-MH Rejection F ree Metrop olis Hastings min { 1 , r } 1 / Z ( x ) 6 T able 3: Summary of compared algorithms. In table 3 is the summary of the algorithms compared. I IT and RF-MH use direct w eight estimation so they require additional computations for the replica sw ap probabilit y . The algorithms that w e label as A-IIT and MH-m ult are implementation of 2 diﬀerent algorithms. F or some of the replicas with high v alues of β w e use a rejection free algorithm with a multiplicit y list (algorithms 2 and 5 ) while for replicas with lo w v alues of β w e use algorithms that allo w rejections (algorithms 3 and 7 ), the num b er of replicas using a rejection-free algorithm is the same for b oth algorithms in an y example. T raditionally one should run the algorithms for the same amount of iterations to make a fair comparison of the p erformance of eac h of them. F or this work is not straightforw ard to deﬁne the n umber of iterations to run eac h algorithm since algorithms (2) and (4) p erform a ﬁxed num b er of rejection free iterations while algorithms (1) and (3) p erform a random num b er of rejection free iterations since they use a m ultiplicity list. T o deﬁne equiv alent n umber of iterations in b et ween replica swaps for all algorithms, we ﬁrst deﬁne a n umber of rejection free iterations for algorithms (2) and (4) and then, for algorithms (1) and (3), we deﬁne a v alue of L 0 (the num b er of samples from the original c hain to obtain b efore a replica swap) in such a wa y that, for the replica with the highest β , the av erage n umber of Rejection-F ree steps in b et ween replica sw aps is as close as p ossible for all the algorithms. In this w ay the replicas get the same num b er of steps to explore the state space and the comparison is more fair. F or example for algorithms (2) and (4) w e deﬁne 2 rejection-free iteration to try betw een replica sw aps and to run the algorithms for a total of 2 million iterations, this translates in 1 million of 17 sw aps tried. F or algorithms (1) and (3) deﬁning L 0 = 800 equals to 1.64 and 1.2 a v erage rejection free steps. So we use that L 0 and the algorithms tries a total of 1 million replica swaps. In this w ay w e try to mak e the comparison as fair as p ossible. The n umber of replicas and the distance betw een the β ’s is deﬁned suc h that the prop ortion of accepted swap rates betw een adjacen t replicas is close to 0.234 ( A tc had´ e et al. , 2011 ). W e measure the time it takes the replica with the highest β to reach all of the mo des as we consider this is a measurement of ho w go od is the chain’s exploration capabilities to ﬁnd states with high probability . This is presen ted with graphs simulating a ladder where in the x-axis we plot the time and the y-axis w e plot the p ercen tage of sim ulations that found all the modes of the target distribution at the deﬁned time. Eac h jump in the plot represents one simulation ﬁnding all the mo des of the target distribution, the faster the function reac hes 100% the faster the algorithm is in ﬁnding all the mo des. F or the low dimensional examples we also measure the evolution of T otal V ariation Distance (TVD). W e presen t this in a chart where the x-axis is the time and the y-axis is the av erage TVD calculated of all the simulations ran. The faster the plot approac hes 0 the better the algorithm is at conv erging to the desired target distribution. The simulation results for the lo w dimensional examples sho w that the rejection free algo- rithms perform better both in terms of TVD and in sp eed to ﬁnd the mo des. This is expected as the rejection free algorithms eﬃcien tly use the information of neigh b or states to iden tify direct paths to the mo des. In this lo w dimensional example the additional computations needed by the algorithms using a multiplicit y list (A-IIT and MH-m ult) is a lot more than the computations needed by algo- rithms using direct w eight estimation (I IT and RF-MH). This is due to the dimension of the state space b eing small and L 0 b eing large in comparison. So the num b er of computations in-b et ween replica sw aps for algorithms with m ultiplicity list is alw ays larger than the computations for the other tw o algorithms and the additional computations needed during the replica swap pro cess is not signiﬁcant. It is in the sim ulation results for the high dimensional example that w e observe A-I IT p erform- ing b etter than the other algorithms. Since the dimension of the space is p = 3000 the computation of the probability of sw apping adjacen t replicas requires an additional 4 × 3000 = 12000 ev aluation of the function π , and since we are p erforming non-reversible replica sw aps eac h s tep of replica sw aps requires 12000 × 6 = 72000 additional computations, this impacts the sp eed in which the parallel temp ering algorithm explores the state space. Additional to that we observ e that the v alue of L 0 do es not scale at the same rate as the dimension of the state space so, for the algorithms using multiplicit y list, w e are able to ﬁnd an adequate c hoice of replicas that use the equiv alen t algorithm that p ermits rejections in such a wa y that we minimize the n umber of computations needed p er replica swap. In the sim ulation results, w e can also see the diﬀerence in sp eed betw een A-I IT and MH-mult whic h can b e attributed to using a diﬀeren t balancing function. Hence we observ e the adv antage 18 of using A-IIT o ver Metrop olis Hastings or I IT when using parallel temp ering. 6.2 Lo w dimensional bi-mo dal example F or this example we consider a space of dimension p = 16 with t wo mo des: • x (1) = (1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0) • x (2) = (0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1) The state space is S = { 0 , 1 } 16 , w e use θ = 6 to deﬁne the target distribution, with these parameters, the normalizing constan t is C θ = 2(1 + e − θ ) 16 , each mo de accum ulates 48% of the total probabilit y mass. Since this space is small enough w e can compute the true distribution π and use it to compute the T otal V ariation Distance. W e run all algorithms for a total of 1 million replica sw aps, the algorithms using rejection free iterations try a replica sw ap after every iteration. Algorithm A-IIT uses L 0 = 1000 and algorithm MH-m ult uses L 0 = 100. W e run 100 sim ulations of each algorithm to obtain the results presented b elo w. F or this sim ulations, all the replicas of A-I IT and MH-mult use the rejection free algorithm. In table 8 w e see the av erage of rejection free sw aps that each replica p erforms in-b et ween replica sw aps with the deﬁned L 0 . T able 4: Average num b er of iterations b et ween replica sw aps replica A-I IT MH-mult β 1 1.246 1.489 β 2 24.067 10.942 β 3 91.075 25.027 β 4 218.88 42.84 In this particular problem we don’t hav e beneﬁts of using A-I IT as using I IT is faster b oth in terms of ﬁnding the mo des and conv ergence to the target distribution. W e attribute this to the small size of the state space, the n umber of additional computations needed by I IT b oth for each iteration and trying replica sw aps do es not hinder the performance. 19 Figure 4: Comparison of p erformance of the four algorithms in the lo w dimensional bimo dal prob- lems. On the left, time to visit all the modes of the target distribution. On the right, ev olution of T otal V ariation Distance 6.3 Lo w dimensional multi mo dal example F or this example we consider a space of dimension p = 16 with sev en modes: • x (1) = (1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1) • x (2) = (1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0) • x (3) = (0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1) • x (4) = (1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0) • x (5) = (0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1) • x (6) = (1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1) • x (7) = (0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0) The state space is S = { 0 , 1 } 16 , w e use θ = 10 to deﬁne the target distribution, with these parameters, the normalizing constant is C θ = 7(1 + e − θ ) 16 , all mo des accum ulate 99.9% of the total probabilit y mass. Since this space is small enough w e can compute the true distribution π and use it to compute the T otal V ariation Distance. W e run all algorithms for a total of 1 million replica sw aps, the algorithms using rejection free iterations try a replica swap after every 2 iterations while the algorithms using a m ultiplicity list uses an L 0 = 1000. W e run 100 sim ulations of each algorithm to obtain the results presen ted b elo w. F or this sim ulations, all the replicas of A-IIT and MH-m ult use the rejection free algorithm. In table 10 we see the av erage of rejection free swaps that each replica p erforms in-b et ween replica sw aps with the deﬁned L 0 . Here we observe the eﬀect of the adaptiv e constan t. A b ounding constan t bigger than 1 means that the probability of escap e Z h ( X ) is smaller which then translates 20 in bigger v alues for the multiplicit y list, reducing the num b er of rejection-free iterations needed to reac h the deﬁned L 0 . T able 5: Average num b er of iterations b et ween replica sw aps replica A-I IT MH-mult β 1 1.001 1.091 β 2 19.306 87.393 β 3 77.602 221.185 W e observe a similar b ehavior as in the bi-mo dal problem. Using parallel tempering with I IT giv es a faster algorithm to both ﬁnd all the mo des and estimating the the target distribution. W e consider that the b eneﬁts of using A-I IT dep end on the dimension of the state space and not on the num b er of modes the state space has. W e don’t observ e an y signiﬁcant diﬀerences in the p erformance of using I IT or Rejection F ree Metrop olis Hastings. As the dimension of the state space is small there’s no signiﬁcant adv an tage to using a diﬀerent balancing function when choosing a state to mo ve to. Figure 5: Comparison of p erformance of the four algorithms in the low dimensional m ultimo dal problem. On the left, time to visit all the mo des of the target distribution. On the righ t, evolution of T otal V ariation Distance 21 6.4 High dimensional example F or this example we consider a space of dimension p = 3000 and w e consider 6 modes: • x (1) = (1 , 0 , 1 , 0 , . . . , 1 , 0 , 1 , 0) • x (2) = (0 , 1 , 0 , 1 , . . . , 0 , 1 , 0 , 1) • x (3) = (1 , 1 , 1 , 1 , . . . , 0 , 0 , 0 , 0) • x (4) = (0 , 0 , 0 , 0 , . . . , 1 , 1 , 1 , 1) • x (5) = (0 , 0 , 0 , 0 , . . . , 0 , 0 , 1 , 1 , . . . , 1 , 1 , 0 , 0 , . . . , 0 , 0 , 0 , 0) • x (6) = (1 , 1 , 1 , 1 , . . . , 1 , 1 , 0 , 0 , . . . , 0 , 0 , 1 , 1 , . . . , 1 , 1 , 1 , 1) All mo des satisfy || x ( i ) || 1 = p 2 and are deﬁned in suc h a w ay that || x ( i ) − x ( j ) || ≥ p 2 The state space is S = { 0 , 1 } 3000 , we use θ = 0 . 001 to deﬁne the target distribution. W e run all algorithms for a total of 1 million replica sw aps, algorithms (2) and (4) try a replica sw ap after 2 iterations while the algorithms using a m ultiplicit y list uses an L 0 = 800. F or algorithms (1) and (3) 8 out of the 13 chains use the rejection free algorithm and the remaining 5 use the algorithm that allo w rejections. W e run 100 sim ulations of each algorithm to obtain the results presented b elo w. In table 12 w e see the av erage num b er of iterations each replica p erforms in-b et ween replica sw aps with the deﬁned L 0 . F or the 8 replicas using the rejection free algorithm this num b er corresp onds to the a v erage num b er of rejection-free steps p erformed before the multiplicit y list reac hes the v alue of L 0 , for the other 5 replicas the num b er is ﬁxed as the c hain alwa ys p erforms L 0 iterations of the algorithm before trying a replica sw ap. T able 6: Average num b er of iterations b et ween replica sw aps replica A-I IT MH-m ult β 1 1.614 1.162 β 2 1.809 1.352 β 3 2.001 1.788 β 4 2.262 2.587 β 5 2.678 3.916 β 6 3.235 5.346 β 7 4.052 6.816 β 8 5.209 8.588 β 9 800 800 β 10 800 800 β 11 800 800 β 12 800 800 β 13 800 800 F or algorithms (1) and (3) the burn-in p eriod lasts un til the m ultiplicity list reaches the v alue 8000. F or algorithms (2) and (4) the burn-in p eriod lasts 50 iterations. F or algorithm (1) the 22 b ounding constan t is up dated only during the burn-in p erio d. After the burn-in p erio d w e stop the adaptation and use the balancing function considering the latest b ounding constant found. Figure 6: Comparison of time to visit all the modes of the target distribution in a space of dimension p = 3000. On the righ t is the same plot only focusin on algorithms A-I IT and MH-m ult The left plot of ﬁgure 6 presents the time each algorithm to ok to ﬁnd all the mo des. W e can observ e that A-I IT and MH-mult, the algorithms that use a mul tiplicity list, are faster than I IT and RF-MH, the rejection-free algorithms. The time for the algorithms using a multiplicit y list could b e improv ed in this case b y using SS-I IT in more c hains.. The righ t plot of ﬁgure 6 w e observ e the plot only f or algorithms (1) and (3). Here w e can better appreciate the diﬀerence in time. Adaptiv e I IT is 3 times faster than Rejection F ree Metrop olis- Hastings. W e can attribute the diﬀerence in p erformance to the diﬀeren t balancing function used since that’s the main diﬀerence b et w een the algorithms. In ﬁgure 7 we can see the plot of the diﬀeren t b ounds found for each replica in the sim ulations. As w e can see with the short burn-in p erio d the algorithm was able to ﬁnd diﬀeren t bounding constan ts for eac h replica. When choosing a state for the c hain to mov e algorithm (3) assigns the same probability to all the states with higher probability than the curren t state, while algorithm (1) ma y assign diﬀeren t probabilities. This means that algorithm (1) will mo ve quick er to states with high probabilit y and in this setting where the target distribution is m ultimo dal and eac h mode has a path of high probabilit y leading to it, mak es algorithm (1) reac h the mode faster. In ﬁgure 8 w e ran similar sim ulations for algorithms (1) and (3) k eeping trac k of the time it tak es for the algorithms to ﬁnd all the mo des but with diﬀeren t num b er of replicas using the rejection free algorithm. In the x-axis is the num b er of replicas that used the rejection free algorithm, in the y-axis is the time in seconds it to ok for the sim ulations to ﬁnd the mo des. If we increase the num b er of replicas using a rejection free algorithm the time to ﬁnd the modes 23 Figure 7: Summary of b ounds used for each chain after the burn-in perio d for the 100 simulations. increases, this is b ecause each iteration requires more computations. How ever w e can see that in all cases A-I IT is faster than MH-mult. This sho ws the adv an tage of using h ( r ) = √ r as the balancing function. Figure 8: Comparison of time to ﬁnd all modes for diﬀeren t n um b er of rejection free replicas in 100 sim ulations. 24 7 Summary and future work This work presen ted an adaptive MCMC algorithm for discrete state spaces that uses information of the neighbors of the curren t state to ”temp er” the choice of the next state to visit. W e presen t b oth a rejection free version and what w e call a single-step v ersion of the algorithm. W e prov e the conv ergence of these algorithms to the desired target distribution, and we sho w that they are equiv alen t in the sense that b oth con v erge to the same target distribution. This equiv alence gives ﬂexibility to the implementation with parallel temp ering as we can choose the algorithm that more eﬃcien tly uses the av ailable computational resources although the choice will dep end on the sp eciﬁc problem to solv e. There are asp ects of the algorithm that w e don’t explore in this w ork and can b e the topic of future research. • The c hoice of importance function f to deﬁne the b ounded balancing function: As was men- tioned in the original work of Li et al. ( 2023 ) the c hoice of balancing functions is problem sp eciﬁc. • The sc hedule of adaptations: W e mentioned the p ossibilit y of reducing the b ounding constan t to wards 1 so w e can consider a ﬁrst stage of constan tly up dating the constant to impro ve the exploration of the state space and later reduce it to reduce the v ariance of the Mon te Carlo estimator \ E π ( X ) • The optimal c hoice of rejection free or single step replicas: In this w ork we made a c hoice on the n um b er of rejection free replicas to use for the sim ulations trying to use half and half but there’s an opportunity to deﬁne the optimal n umber of replicas for eac h algorithm. • Extensions to other algorithms: In the w ork of Li et al. ( 2023 ) it’s mentioned that I IT allows for other modiﬁcations for example to consider a smaller neighbor set at eac h iteration. W e consider that A-IIT can also be adapted in a similar w ay . Ac knowledgmen ts Thanks to Norma Garcia-Salazar and Marco Gallegos-Herrada for helpful discussions and remarks. This research w as enabled in part by supp ort pro vided by Compute Ontario (computeon- tario.ca) and the Digital Researc h Alliance of Canada (alliancecan.ca). F unding The authors thank F ujitsu Limited and F ujitsu Consulting (Canada) Inc. for pro viding ﬁnancial supp ort. Conﬂict of In terest The authors declare no competing interests. 25 References A tchad ´ e, Y. F., Rob erts, G. O., & Rosenthal, J. S. (2011, Octob er). T o wards optimal scaling of metrop olis-coupled Marko v chain Monte Carlo. Statistics and Computing , 21 (4), 555–568. Retriev ed 2025-05-21, from http://link.springer.com/10.1007/s11222-010-9192-1 doi: 10.1007/s11222-010-9192-1 Gelman, A., Gilks, W. R., & Rob erts, G. O. (1997). W eak conv ergence and optimal scaling of random walk Metrop olis algorithms. The Annals of Applie d Pr ob ability , 7 (1), 110 – 120. Retriev ed from https://doi.org/10.1214/aoap/1034625254 doi: 10.1214/aoap/1034625254 Gey er, C. J. (1991). Mark ov c hain Monte Carlo maxim um likelihoo d. In Computing scienc e and statistics: Pr o c e e dings of the 23r d symp osium on the interfac e (pp. 156–163). Hastings, W. K. (1970, 04). Mon te Carlo sampling metho ds using Marko v c hains and their applications. Biometrika , 57 (1), 97-109. Retrieved from https://doi.org/10.1093/biomet/ 57.1.97 doi: 10.1093/biomet/57.1.97 Li, G., Smith, A., & Zhou, Q. (2023). Imp ortanc e is imp ortant: A guide to informe d imp ortanc e temp ering metho ds. Retriev ed from Metrop olis, N., Rosen bluth, A. W., Rosenbluth, M. N., T eller, A. H., & T eller, E. (1953, 06). Equa- tion of State Calculations by F ast Computing Mac hines. The Journal of Chemic al Physics , 21 (6), 1087-1092. Retriev ed from https://doi.org/10.1063/1.1699114 doi: 10.1063/1.1699114 Neal, P ., Rob erts, G., & Y uen, W. K. (2012). Optimal scaling of random w alk Metropolis algorithms with discon tinuous target densities. The Annals of Applie d Pr ob ability , 22 (5), 1880 – 1927. Retriev ed from https://doi.org/10.1214/11-AAP817 doi: 10.1214/11-AAP817 Rob erts, G. O., & Rosen thal, J. S. (2001). Optimal scaling for v arious Metrop olis-Hastings algo- rithms. Statistic al Scienc e , 16 (4), 351 – 367. Retriev ed from https://doi.org/10.1214/ss/ 1015346320 doi: 10.1214/ss/1015346320 Rob erts, G. O., & Rosenthal, J. S. (2007). Coupling and ergo dicity of adaptiv e mark ov chain mon te carlo algorithms. Journal of Applie d Pr ob ability , 44 (2), 458–475. Retriev ed 2024-11-26, from http://www.jstor.org/stable/27595854 Rosen thal, J. S., Dote, A., Dabiri, K., T am ura, H., Chen, S., & Sheikholeslami, A. (2020). Jump markov chains and r eje ction-fr e e metr op olis algorithms. Sy ed, S., Bouchard-Cˆ ot ´ e, A., Deligiannidis, G., & Doucet, A. (2021). Non-reversible parallel temp ering: a scalable highly parallel mcmc sc heme. Journal of R oyal Statistic al So ciety, Series B , (A c c epte d) . 26 Zanella, G. (2020). Informed prop osals for lo cal mcmc in discrete spaces. Journal of the Amer- ic an Statistic al Asso ciation , 115 (530), 852–865. Retrieved from https://doi.org/10.1080/ 01621459.2019.1585255 doi: 10.1080/01621459.2019.1585255 Zhou, Q., & Smith, A. (2022, 28–30 Mar). Rapid con vergence of informed imp ortance temp ering. In G. Camps-V alls, F. J. R. Ruiz, & I. V alera (Eds.), Pr o c e e dings of the 25th international c onfer enc e on artiﬁcial intel ligenc e and statistics (V ol. 151, pp. 10939–10965). PMLR. Retrieved from https://proceedings.mlr.press/v151/zhou22e.html App endix A Details on the sim ulations Here w e present more details on the sim ulations. Speciﬁcally the v alues of β used, the a verage sw ap rate and w e presen t other results obtained in spaces of dimension 1000, 5000 and 7000. 27 A.1 Lo w dimensional examples T able 7: Inv erse temp erature used in the bimo dal mo del temp id A-I IT MH-m ult I IT RF-MH β 1 1 1 1 1 β 2 0.49 0.49 0.26 0.26 β 3 0.33 0.33 0.06 0.06 β 4 0.22 0.22 NA NA T able 8: Average swap rate b et ween replica β i and β i +1 for the bimodal mo del replica A-I IT I IT MH-mult RF-MH β 1 0.2352 0.2382 0.2352 0.2382 β 2 0.2423 0.2327 0.2423 0.2326 β 3 0.2521 NA 0.2521 NA T able 9: Inv erse temp erature used in the 7 mo des mo del temp id A-I IT MH-m ult I IT RF-MH β 1 1 1 1 1 β 2 0.31 0.31 0.15 0.155 β 3 0.21 0.21 0.002 0.002 T able 10: Average swap rate b et ween replica β i and β i +1 for the 7 mo des mo del replica A-I IT I IT MH-mult RF-MH β 1 0.2478 0.2327 0.2477 0.2506 β 2 0.2521 0.2539 0.2521 0.2391 28 A.2 High dimensional example T able 11: Inv erse temp erature used in the problem of dimension 3k temp id A-I IT I IT MH-m ult RF-MH β 1 20000 20000 20000 20000 β 2 19517 15046 17899 15005 β 3 19029 13509 15895 13611 β 4 18535 12162 14353 12684 β 5 17744 11215 13057 12182 β 6 17125 10570 12234 11600 β 7 16568 10087 11631 11377 β 8 16037 9716 11093 11185 β 9 15571 9396 10578 11090 β 10 15273 9166 10109 10986 β 11 15075 9001 9409 10892 β 12 14786 8827 8951 10802 β 13 14595 8691 8417 10735 T able 12: Average swap rate b et ween replica β i and β i +1 in space of dimension 3000 replica A-I IT I IT MH-mult RF-MH β 1 0.4489 0.2509 0.3759 0.2161 β 2 0.4582 0.342 0.2893 0.3295 β 3 0.4333 0.2239 0.2381 0.3212 β 4 0.4009 0.222 0.1943 0.4041 β 5 0.3953 0.2743 0.2374 0.2964 β 6 0.3859 0.3078 0.2604 0.4368 β 7 0.3664 0.3336 0.2586 0.4125 β 8 0.3505 0.3144 0.2383 0.4716 β 9 0.3575 0.3507 0.2287 0.386 β 10 0.3942 0.4006 0.0984 0.3795 β 11 0.3503 0.3268 0.1639 0.3212 β 12 0.3703 0.4098 0.0946 0.4413 29 B Other results in high dimensions W e chose to present the results of the sim ulations of a problem in a state space of dimension 3000 as this sho wed clearly the adv antages of A-I IT o ver other algorithms. Here we share some other results obtained from sim ulations run with similar target distributions π with the same n um b er of mo des but in other dimensions. Although the results presented for the state space of dimension 3000 are not replicated in dimension 5000, we observ e similar results for dimension 7000. W e consider there is an opp ortunit y to optimize some parameters in dimension 5000 so that we can sho w that A-I IT p erforms b etter than Metrop olis Hastings B.1 Dimension 1000 Figure 9: Comparison of time to visit all the modes of the target distribution in a space of dimension p = 1000. 30 T able 13: Inv erse temp erature used in the problem of dimension 1k temp id A-I IT I IT MH-m ult RF-MH β 1 20000 20000 20000 20000 β 2 19209 18092 19175 18202 β 3 18544 16632 18496 16829 β 4 18030 15661 17962 15769 β 5 17441 14723 17333 14902 β 6 16904 13981 16728 14075 β 7 16480 13544 16322 13582 β 8 15969 12819 15876 12999 β 9 15483 12279 15397 12419 β 10 15115 11806 15052 11994 β 11 14768 11372 14716 11472 β 12 14376 10970 14393 11043 β 13 13910 10484 13968 10691 T able 14: Average swap rate b et ween replica β i and β i +1 in space of dimension 1000 replica A-I IT I IT MH-mult RF-MH β 1 0.333 0.2655 0.3258 0.2971 β 2 0.345 0.2824 0.3408 0.3201 β 3 0.369 0.3951 0.3631 0.3636 β 4 0.3397 0.3454 0.3275 0.3922 β 5 0.3417 0.3968 0.3199 0.3537 β 6 0.3651 0.5849 0.3678 0.5423 β 7 0.3287 0.3205 0.3468 0.4315 β 8 0.3254 0.4131 0.3255 0.3909 β 9 0.3576 0.4338 0.3648 0.4932 β 10 0.3583 0.4378 0.3617 0.3612 β 11 0.3324 0.4397 0.3603 0.4164 β 12 0.2914 0.3085 0.3092 0.4715 31 B.2 Dimension 5000 Figure 10: Comparison of time to visit all the modes of the target distribution in a space of dimension p = 5000. On the right is the same plot only fo cusin on A-I IT and MH-m ult. T able 15: Inv erse temp erature used in the problem of dimension 5k temp id A-I IT I IT MH-m ult RF-MH β 1 20000 20000 20000 20000 β 2 19813 11152 19199 10929 β 3 19524 9798 18256 9985 β 4 19387 8698 17010 9369 β 5 19254 7903 16246 8908 β 6 19135 7290 15526 8538 β 7 19075 6819 14787 8233 β 8 18891 6610 14075 7963 β 9 18680 6427 13404 7723 β 10 18616 6274 12786 7523 β 11 18536 6125 12388 7422 β 12 18454 6007 11783 7299 β 13 18421 5889 11230 7208 32 T able 16: Average swap rate b et ween replica β i and β i +1 in space of dimension 5000 replica A-I IT I IT MH-mult RF-MH β 1 0.4168 0.352 0.3636 0.2386 β 2 0.3777 0.3035 0.3773 0.3558 β 3 0.4079 0.171 0.3478 0.3633 β 4 0.3869 0.1459 0.3838 0.3457 β 5 0.3624 0.1368 0.3722 0.3313 β 6 0.3943 0.147 0.3776 0.3219 β 7 0.3077 0.3699 0.3593 0.269 β 8 0.2868 0.3774 0.3591 0.2445 β 9 0.3337 0.3816 0.3461 0.2021 β 10 0.3391 0.3427 0.3786 0.244 β 11 0.3133 0.3729 0.306 0.1981 β 12 0.3232 0.3278 0.294 0.2217 B.3 Dimension 7000 Figure 11: Comparison of time to visit all the modes of the target distribution in a space of dimension p = 7000. On the right is the same plot only fo cus in on A-I IT and MH-mult. 33 T able 17: Inv erse temp erature used in the problem of dimension 7k temp id A-I IT I IT MH-m ult RF-MH β 1 20000 20000 20000 20000 β 2 11710 10008 11457 9951 β 3 9796 8637 9845 8781 β 4 9154 7937 9005 8063 β 5 8692 7426 8348 7557 β 6 8327 7024 7813 7135 β 7 8023 6751 7377 6870 β 8 7738 6521 7049 6649 β 9 7490 6310 6741 6483 β 10 7274 6185 6472 6371 β 11 7069 6110 6239 6301 β 12 6878 6032 6029 6252 β 13 6713 5996 5858 6214 T able 18: Average swap rate b et ween replica β i and β i +1 in space of dimension 7000 replica A-I IT I IT MH-mult RF-MH β 1 0.3965 0.363 0.3854 0.336 β 2 0.2348 0.206 0.2628 0.256 β 3 0.3199 0.2414 0.2614 0.2394 β 4 0.3341 0.2253 0.2418 0.2287 β 5 0.329 0.217 0.2207 0.1984 β 6 0.3428 0.2423 0.2226 0.2348 β 7 0.3314 0.2242 0.2412 0.2136 β 8 0.3336 0.1926 0.2274 0.2091 β 9 0.3264 0.2245 0.2272 0.2153 β 10 0.3182 0.257 0.2362 0.2371 β 11 0.3081 0.221 0.23 0.233 β 12 0.3226 0.295 0.2623 0.2666 34 C Algorithms Algorithm 5 Metrop olis Hastings with a multiplicit y list initialize X 0 for k in 1 to K do compute Z ( X k − 1 ) = P Y ∈N X k − 1 Q ( Y | X k − 1 ) min ( 1 , π ( Y ) Q ( X k − 1 | Y ) π ( X k − 1 ) Q ( Y | X k − 1 ) ) sim ulate m ultiplicit y list M k − 1 ← 1 + G where G ∼ Geometric( Z ( X k − 1 )) with Set w k − 1 ( X k − 1 ) ← M k − 1 c ho ose the next state X k ∈ N X k − 1 suc h that b P ( X k = Y | X k − 1 ) ∝ Q ( Y | X k − 1 ) min ( 1 , π ( Y ) Q ( X k − 1 | Y ) π ( X k − 1 ) Q ( Y | X k − 1 ) ) end for Algorithm 6 Rejection-F ree Metrop olis Hastings initialize X 0 for k in 1 to K do compute Z ( X k − 1 ) = P Y ∈N X k − 1 Q ( Y | X k − 1 ) min ( 1 , π ( Y ) Q ( X k − 1 | Y ) π ( X k − 1 ) Q ( Y | X k − 1 ) ) Set w k − 1 ( X k − 1 ) ← 1 Z ( X k − 1 ) c ho ose the next state X k ∈ N X k − 1 suc h that b P ( X k = Y | X k − 1 ) ∝ Q ( Y | X k − 1 ) min ( 1 , π ( Y ) Q ( X k − 1 | Y ) π ( X k − 1 ) Q ( Y | X k − 1 ) ) end for Algorithm 7 Metrop olis-Hastings algorithm initialize X 0 for k in 1 to K do random Y ∈ N X k − 1 based on Q ( ·| X k − 1 ) Set w k − 1 ( X k − 1 ) ← 1 X k ← Y with probabilit y π ( Y ) Q ( Y ,X k − 1 ) π ( X k − 1 ) Q ( X k − 1 ,Y ) ▷ Accept proposed mo ve to Y Otherwise X k ← X k − 1 ▷ Reject and stay at X k − 1 end for 35

Adaptive Importance Tempering: A flexible approach to improve computational efficiency of Metropolis Coupled Markov Chain Monte Carlo algorithms on binary spaces

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment