MCMC with Strings and Branes: The Suburban Algorithm (Extended Version)

MCMC with Strings and Branes: The Suburban Algorithm (Extended V ersion) Jonathan J. Hec kman 1 , 2 ∗ , Jeﬀrey G. Bernstein 3 † , Ben Vigo da 4 ‡ 1 Departmen t of Ph ysics and Astronom y , Univ ersit y of P ennsylv ania, Philadelphia, P A 19104, USA 2 Departmen t of Ph ysics, Univ ersit y of North Carolina, Chapel Hill, NC 27599, USA 3 Analog Devices | Lyric Labs, One Broadwa y , Cambridge, MA 02142, USA 4 Gamalon Labs, One Broadwa y , Cambridge, MA 02142, USA Abstract Motiv ated b y the ph ysics of strings and branes, w e dev elop a class of Marko v chain Mon te Carlo (MCMC) algorithms in volving extended ob jects. Starting from a collection of parallel Metropolis-Hastings (MH) samplers, w e place them on an auxiliary grid, and couple them together via nearest neigh b or interactions. This leads to a class of “suburban samplers” (i.e., spread out Metrop olis). Coupling the samplers in this w ay modiﬁes the mixing rate and sp eed of con vergence for the Mark ov c hain, and can in man y cases allow a sampler to more easily ov ercome free energy barriers in a target distribution. W e test these general theoretical considerations b y p erforming several n umerical exp erimen ts. F or suburban samplers with a ﬂuctuating grid top ology , p erformance is strongly correlated with the av erage num b er of neighbors. Increasing the av erage n umber of neigh b ors ab o ve zero initially leads to an increase in p erformance, though there is a critical connectivity with eﬀectiv e dimension d eﬀ ∼ 1, abov e whic h “groupthink” tak es o ver, and the p erformance of the sampler declines. Ma y 2016 ∗ e-mail: jheckman@sas.upenn.edu † e-mail: jeff.bernstein@analog.com ‡ e-mail: ben.vigoda@gamalon.com Con ten ts 1 In tro duction 2 2 Statistical Inference with Strings and Branes 6 3 MCMC with Strings and Branes 8 3.1 P ath In tegral for Poin t Particles . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 P ath In tegral for Extended Ob jects . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Splitting and Joining . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Dimensions and Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.1 Eﬀectiv e Dimension and Fluctuating W orldvolumes . . . . . . . . . . 22 4 The Suburban Algorithm 24 4.1 Implemen tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Hyp erparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Ov erview of Numerical Experiments 28 5.1 P erformance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 A Sampling of Samplers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3 Example T argets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6 Eﬀectiv e Connectivity and Symmetric Mixtures 32 7 Random Landscap es 37 8 Banana Distribution 39 9 F ree Energy Barriers 43 10 Conclusions 46 A Comparison with Slice Sampler 48 1 1 In tro duction Mark o v c hain Mon te Carlo (MCMC) metho ds are a remark ably robust w a y to sample from complex probabilit y distributions. In this class of algorithms, the Metrop olis-Hastings (MH) algorithm [1, 2] stands out as an imp ortan t b enc hmark. One of the app ealing features of the original Metrop olis algorithm is the simple ph ysical picture whic h underlies the general metho d. Roughly sp eaking, the idea is that the thermal ﬂuctuations of a particle mo ving in an energy landscap e pro vides a conceptually elegan t w a y to sample from a target distribution. Recall that for X , a con tinuous random v ariable with outcome x , we ha v e a probabilit y density π ( x ), and a prop osal k ernel q ( x 0 | x ). In the MH algorithm, a new v alue x new is drawn from the distribution q and is then accepted with probabilit y: a  x new | x old  = min  1 , q ( x old | x new ) q ( x new | x old ) π ( x new ) π ( x old )  . (1.1) On the other hand, there are also w ell known drawbac ks to MCMC metho ds. F or exam- ple, though in man y cases there is an exp ectation that sampling will con verge to the correct p osterior distribution, the actual sp eed at which this can o ccur is often unknown. Along these lines, it is p ossible for a sampler to remain trapp ed in a metastable equilibrium for a long p erio d of time. A related concern is that once a sampler b ecomes trapp ed, a large free energy barrier can obstruct an accurate determination of the global structure of the distri- bution. Some of these issues can b e o vercome b y suﬃcien t tuning of the prop osal kernel, or b y comparing the p erformance of diﬀerent samplers. It is therefore natural to ask whether further inspiration from physics can lead to new examples of samplers. No w, although the ph ysics of p oin t particles underlies m uch of our mo dern understand- ing of natural phenomena, it has pro ven fruitful, esp ecially in the con text of high energy theoretical ph ysics, to consider ob jects suc h as strings and more generally p -branes with ﬁnite extent in p spatial dimensions (a string b eing a case of a 1-brane). One of the main features of branes is that the n umber of spatial dimensions strongly aﬀects how a lo calized p erturbation propagates across its worldv olume. Viewing a brane as a collective of p oin t particles that interact with one another (see ﬁgure 1), this suggests applications to questions in statistical inference [3]. Motiv ated b y these ph ysical considerations, our aim in this w ork will b e to study gen- eralizations of the MH algorithm for suc h extended ob jects. F or an ensem ble of M parallel MH samplers of π ( x ), we can alternatively view this as a single particle sampling from M v ariables x 1 , ..., x M with density: π ( x 1 , ..., x M ) = π ( x 1 ) ...π ( x M ) , (1.2) 2 Figure 1: Depiction of ho w parallel MH samplers (left) and a suburban sampler (righ t) ev olv e as a function of time. In the suburban sampler, nearest neighbors on a grid can ha v e correlated inferences (depicted by dashed lines), leading to faster mixing rates. The absence of a dashed line in a given time step indicates a splitting of the extended ob ject. where the prop osal kernel is simply: q parallel ( x new 1 , ..., x new M | x old 1 , ..., x old M ) = q ( x new 1 | x old 1 ) ...q ( x new M | x old M ) . (1.3) T o realize an MCMC algorithm for an extended ob ject, w e shall keep the same target π ( x 1 , ..., x M ), but we will no w change the prop osal kernel by interpreting the index σ on x σ as sp ecifying the lo cation of a statistical agen t in a netw ork. Dep ending on the connec- tivit y of this net w ork, an agen t ma y interact with sev eral neighboring agents (if eac h agent comm unicates with no neigh b ors, this is equiv alen t to parallel MH samplers). Sc hematically then, MCMC with an extended ob ject inv olv es mo difying the prop osal kernel to the form: q extend ( x new 1 , ..., x new M | x old 1 , ..., x old M ) = M Y σ =1 q σ ( x new σ | Neigh b ors of x old σ ) . (1.4) In the ab o ve, the connectivit y of the extended ob ject sp eciﬁes its ov erall top ology . F or example, in the case of a string, i.e., a one-dimensional extended ob ject, the neighbors of x i are x i − 1 , x i , and x i +1 . Figure 1 depicts the time evolution of parallel MH samplers compared with the suburban sampler. F rom this p erspective, the suburban algorithm is a particular choice of ensem ble MCMC. Ensem ble samplers ha v e been considered previously in the MCMC literature (see e.g., [4 – 14]), though as far as w e are a ware, the physical interpretation as well as the sp eciﬁc suite of algorithms we propose is new. These metho ds fall generally into t w o categories: those that, 3 lik e the suburban algorithm, op erate o ver iden tical copies of the target distribution; and those that op erate ov er a parameterized family of related but not identical distributions. The former includes reference [5], whic h uses a p opulation of samples to adaptively choose a prop osal direction; reference [9], whic h use a p opulation of samples to generate prop osals that are in v ariant under aﬃne transformations of the underlying space; and [10], whic h uses subsets of the sample p opulations to estimate parameters of an approximate distribution used to generate prop osals in an elliptical slice sampler. The second category is exempliﬁed b y parallel temp ering [7], in which parallel chains op erate ov er a family of distributions parameterized b y temp erature where prop osals include b oth lo cal transitions and exchanges of state b et ween pairs of c hains. This category of metho ds uses distributions that mix b etter than the target distribution, but are similar enough to each other that exchanges will b e accepted with reasonable probabilit y . There are many MCMC v ariations in the literature that follow this general approac h, including references [11 – 14]. Returning to the case of suburban samplers, there are p oten tially many consistent wa ys to connect together the inferences of statistical agents. F rom the p erspective of ph ysics, this amoun ts to a notion of distance/pro ximit y b et w een nearest neighbors in a brane. A ph ysically w ell-motiv ated w ay to eliminate this arbitrary feature is to allow the notion of pro ximity itself to ﬂuctuate . F rom the p erspective of equation (1.4), w e treat the placement of nearest neigh b ors as sp ecifying a collection of random graphs, and by allo wing possible ﬂuctuations, v arious agen ts reach a collectiv e inference diﬀerently . Indeed, from this p erspective, it is also natural to allow the brane to split in to or join up smaller constituen t parts (see ﬁgure 1). In con trast to the case of a grid with a ﬁxed top ology , the physics of general splitting and joining is less tractable analytically (except in sp ecial limits where p erturbation theory via a small expansion parameter is a v ailable). T urning the discussion around, the general considerations presented here app ear to ha v e consequences for our understanding of quan tum ﬁelds and strings. As noted in reference [3], one wa y to form approximate observ ables in a theory of quan tum gra vit y is to consider inference of an ensemble of agen ts p ooling their (approximate) lo cal observ ations. F rom this p erspective, the presen t pap er can b e viewed as a concrete implementation of this general prop osal using the framew ork of Marko v Chain Monte Carlo sampling. In particular, the app earance of a preferred role for an eﬀectiv e one-dimensional connectivity as predicted in [3] suggests a central role for suc h ob jects in an y form ulation of quantum gravit y . W e view MCMC with extended strings and branes as a nov el class of ensem ble samplers in which there is some random degree of connectivity b et ween m ultiple statistical agen ts. By correlating the inferences of nearest neighbors in this w ay , we can exp ect there to b e some impact on p erformance. F or example, the degree of connectivity impacts the mixing rate for obtaining indep enden t samples. Another imp ortan t feature is that because we are dealing with an extended ob ject, diﬀerent statistical agen ts ma y become lo calized in diﬀeren t 4 high densit y regions. Provided the connectivit y with neigh b ors is suﬃciently low, coupling these agents then has the p oten tial to provide a more accurate global characterization of a target distribution. Conv ersely , connecting to o man y agents together ma y cause the entire collectiv e to suﬀer from “groupthink” in the sense of [3], namely , once an initial erroneous inference is reached it can b ecome diﬃcult to correct. In the statistical mechanical in ter- pretation of statistical inference developed in references [3, 15, 16], this can b e viewed as the standard tradeoﬀ in thermo dynamics b et ween minimizing the energy (i.e., obtaining an accurate inference) and maximizing entrop y (i.e., exploring a broader class of conﬁguration). In particular, w e shall presen t some general argumen ts that the optimal connectivity for a net w ork of agents on a grid arranged as a hypercubic lattice with some p ercolation (i.e., w e allo w for broken links) o ccurs at a critical eﬀectiv e dimension: d eﬀ ∼ 1 (1.5) where 2 d eﬀ is the av erage num b er of neighbors. T o summarize: With to o few friends one drifts in to oblivion, but with to o man y friends one b ecomes a b oring conformist. T o test these general theoretical considerations, w e perform a num b er of n umerical ex- p erimen ts for a v ariety of simple target distributions. One of the simple features of this class of proposal k ernels is that there is a hyperparameter av ailable (the av erage degree of connectivit y) which allo ws us to smo othly interpolate from the case of an extended ob ject to a collection of indep enden t parallel MH samplers. Ov erall, w e ﬁnd that some lev el of connectivit y leads to a generic impro vemen t o ver parallel MH. W e address the extent to which the extended nature of a brane impacts its p erformance. Holding ﬁxed the av erage eﬀective dimension d eﬀ but v arying the ov erall top ology of the extended ob ject from a 1d, to 2d, to 4d grid, as w ell as an Erd¨ os-Ren yi ensemble of random graphs leads to comparable performance for the diﬀerent samplers. In all of the cases w e ha ve encoun tered, the mixing rate is indeed fastest at a critical eﬀective dimension as dictated b y line (1.5). In some cases, how ever, the clumping eﬀects of a higher dimensional grid are helpful, esp ecially when there is a landscap e of lo cal maxima in the target distribution. The rest of this pap er is organized as follows. W e ﬁrst b egin in section 2 with some general qualitativ e considerations on the p oten tial links b et ween statistical inference and extended ob jects such as those whic h arise in string theory . W e then turn in section 3 with a general discussion on the ph ysics of extended ob jects and its relation to MCMC. Readers not in terested in the theoretical underpinnings of the algorithm can b ypass most of section 3. In section 4 we presen t the “suburban algorithm.” In section 5 w e turn to an ov erview of our n umerical exp erimen ts. Section 6 highlights the dep endence of the algorithm on the v arious h yp erparameters, and in particular the av erage degree of connectivit y with neighbors. In sections 7 and 8 w e study particular examples of target distributions, and in section 9 we 5 Figure 2: Depiction of how an extended ob ject suc h as a string can ov ercome a free energy barrier. study some con trolled examples where w e increase the free energy barrier b et ween cen ters of a mixture mo del of t wo normal distributions, sho wing that as the barrier separation increases, the p erformance of parallel MH degrades more quic kly than a d eﬀ ∼ 1 suburban sampler. Section 10 con tains our conclusions and potential directions for future work. In App endix A w e discuss in more detail the relativ e p erformance with slice sampling. F or a condensed accoun t of our results, we refer the interested reader to reference [17]. Finally , a standalone copy of the Java libraries for the suburban sampler, and its in- terface with the Dimple libraries is av ailable at the publicly a v ailable GitLab rep ository https://gitlab.com/suburban/suburban . W e hav e also included a short Matlab demo for the suburban sampler. 2 Statistical Inference with Strings and Branes T o frame the results to follo w, in this section we discuss b oth the ph ysical motiv ation and applications connected with statistical inference with extended ob jects such as strings and branes. The essential p oin t is that in the con text of a quantum theory of gravit y suc h as string 6 theory , it is not en tirely clear whether there is a completely well-deﬁned notion of a local observ able. Along these lines, it is fruitful to ask whether a collectiv e of observ ers can agree in some appro ximate w a y on data measured b y an ensem ble. With this in mind, reference [3] prop osed to study the observ ations of a collectiv e of statistical agents p ooling their resources to reac h a ﬁnal inference sc heme. A concrete wa y to p ose this question is to ask the sense in which the collective can accurately reconstruct a join t probabilit y distribution suc h as: π ( x 1 , ..., x M ) = π ( x 1 ) ...π ( x M ) , (2.1) for M statistical agents. F ollo wing the general ideas presen ted in references [15, 16] for individual agen ts, statistical inference can b e understo od in terms of a statistical mechanics problem in whic h the relativ e en tropy b et ween an agents prop osed probability distribution and the true distribution pro vides a notion of energy . This suggests a natural application to quan tum gravit y , where an individual observer may only hav e access to their individual “w orldview” whic h is then improv ed by further samples of an actual data set. No w, in quantum gravit y there is a well-kno wn issue with the use of p oin t particles whic h stems from the fact that there is strictly sp eaking, no notion of a gauge inv ariant lo cal observ able. Rather, it is generally exp ected that some notion of lo calit y must giv e wa y , and m ust also b e accompanied b y the app earance of spread out or extended ob jects. Along these lines, it is natural to ask whether an inference scheme adopted b y an extended ob ject can lead to diﬀerent conclusions from those obtained by indep enden t p oin t particles. This question was studied in reference [3], where general considerations led to the con- clusion that the standard conditions of quan tum strings suggest a privileged role for one- dimensional ob jects. The main idea of [3] is that when statistical agents share data along a discretized w orldv olume lattice, new inference sc hemes can b e ac hieved which are una v ailable to an individual agent. Additionally , there is a privileged role for 1 + 1 dimensional ob jects b ecause in this case, the tw o-p oin t function for a scalar ﬁeld exhibits a late-time logarithmic div ergence. This is milder than the p o wer la w div ergence presen t for a free p oin t particle, suggesting a more stable inference scheme relativ e to this case. Coupling this system to w orldsheet gra vit y can also b e understo o d at an abstract lev el as an additional la y er of in- ference by a meta-agent, namely , where the connectivit y b et ween nearest neighbors can b e rearranged. These general considerations naturally suggest a n umber of imp ortan t follo wup questions, esp ecially in the con text of string theory . F or one, the privileged role of one-dimensional ob jects app ears to be at o dds with some of the general lessons reached from the study of non-p erturbativ e dualities, where v arious extended ob jects are in some sense on an “equal fo oting” with quantum strings. This in turn raises the question of whether the connection b et ween strings and inference in quan tum gravit y is only an artifact of working with a particular geometric connectivit y for agents in the collective. At a more concrete level, 7 there is also the question of the precise mechanism by whic h a collective actually “shares” information, namely ho w the po oling of resourc es in the en tire collectiv e actually tak es place. One of the aims of the present pap er will b e to address these issues by sho wing ho w inference can actually b e implemen ted for an extended ob ject. Along these lines, w e fo cus on the case of Mark o v Chain Monte Carlo sampling metho ds. The app earance of worldv olume gra vit y for the extended ob ject will also b e crudely characte rized in terms of a statistical ensem ble of random graphs, which act to deﬁne a time-dep enden t notion of lo calit y for agen ts in the collective. In a certain sense this is a cruder notion of gra vity than is present in the ph ysical sup erstring, but it has the adv antage of b eing discretized and fully non- p erturbativ e. F rom this p erspective, one should view the results of this pap er as a concrete w a y to implemen t a non-p erturbativ e formulation of strings and branes making observ ations in a target space. The a verage degree of connectivit y will pro vide us with a notion of an eﬀectiv e dimension. While this is admittedly less reﬁned than the standard notions used in m uch of the high energy theory literature, it has the deﬁnite adv antage of being completely w ell-deﬁned so that w e can implemen t and test it numerically . Indeed, even though it is crude, the remark able fact that there is an eﬀectiv e dimension which app ears to go v ern the main elements of the inference scheme is highly non-trivial and provides further evidence of the crucial role of eﬀectiv ely one-dimensional ob jects. Finally , though w e shall b e implementing a Mark ov Chain Mon te Carlo sampling algo- rithm the aim here is to b etter understand ho w the top ology and dimension of a ﬂuctuating lattice itself inﬂuences the ov erall sp eed and accuracy of an inference scheme. This is rather diﬀeren t from the standard approach in lattice quantum ﬁeld theory where it is typically assumed that the lattice is ﬁxed, and moreo ver, the structure of the target distribution π ( x ) is assumed to take a relatively simple canonical form. With these physical considerations in mind, we now turn to the implementation of MCMC with strings and branes. 3 MCMC with Strings and Branes One of the main ideas w e shall dev elop in this paper is MCMC methods for extended ob jects. In this section we b egin with the theoretical elemen ts of this prop osal, giving a path integral form ulation of MCMC for p oin t particles and branes. Some of this material is likely familiar to some physicists as the “F eynman-Kac” path in tegral formulation of sto c hastic pro cesses, though as far as w e are aw are, the sp eciﬁc application to MCMC methods we fo cus on here has not app eared b efore in the literature. F or earlier related work on the statistical mec hanics of statistical inference, see [15, 16] and [3]. F or a relatively concise review of some asp ects of string theory and the ph ysics of branes, w e refer the in terested reader to [18, 19], and references therein. F or additional background on details of quan tum ﬁeld theory , we 8 refer the interested reader to [20, 21], and references therein. Supp ose then, that we hav e a target distribution π ( x ). In an MCMC algorithm we pro duce a sequence of “timesteps” x (1) , ..., x ( t ) , ..., x ( N ) whic h can b e viewed as the motion of a p oin t particle exploring a target space Ω. More formally , this sequence of points deﬁnes the “worldline” for a particle, and consequently a map from time to the target: x : W orldline → T arget with t 7→ x ( t ) . (3.1) In the case of a string, w e extend the notion of a “w orldline” to a “worldsheet,” i.e., w e hav e b oth a temp oral and spatial extent with resp ectiv e co ordinates t and σ : x : W orldsheet → T arget with ( t, σ ) 7→ x ( t, σ ) (3.2) More generally , if w e hav e an extended ob ject with d spatial directions, we get a map from a “worldv olume” to the target: x : W orldvolume → T arget with ( t, σ 1 , ..., σ d ) 7→ x ( t, σ 1 , ..., σ d ) . (3.3) The case d = 0 and d = 1 resp ectiv ely denote a p oin t particle and string. T o make the anal- ysis of these maps computationally tractable, we will hav e to discretize these w orldv olumes. So, in addition to making ﬁnite timesteps, w e will also hav e to w ork with a ﬁnite num b er of statistical agents spanning the spatial directions of the worldv olume. Using this form ulation, w e shall extract some basic prop erties such as the correlation b et ween samples as a function of time. In particular, w e will see that the o verall connectivity , i.e., the num b er of nearest neigh b or interactions, strongly inﬂuences b oth spatial as well as temp oral correlations. This spatial connectivity also aﬀects the motion of the extended ob ject on a ﬁxed target. Compared with the case of indep enden t p oin t particles, this can allo w an extended ob ject to more easily explore global asp ects of a target. The rest of this section is organized as follo ws. First, we giv e a path in tegral form ulation of MCMC for a p oin t particle exploring a ﬁxed target distribution. W e then turn to the generalization for strings and branes, and introduce the notion of splitting and joining as w ell. After in tro ducing the general formalism, w e then turn to an analysis of ho w the a v erage degree of connectivity for statistical agen ts in an ensemble impacts the resulting inference sc heme. 3.1 P ath In tegral for P oin t Particles T o frame the discussion to follo w, in this subsection w e in tro duce some bac kground formalism on path integrals. Our aim will b e to gear up for the case of extended ob jects. 9 In what follows, we denote the random v ariable as X with outcome x on a target space Ω with measure dx . W e consider sampling from a probability densit y π ( x ). In accord with ph ysical in tuition, w e view − log π ( x ) as a p oten tial energy , i.e., w e write: π ( x ) = exp( − V ( x )) . (3.4) In general, our aim is to disco ver the structure of π ( x ) b y using some sampling algorithm to pro duce a sequence of v alues x (1) , ..., x ( N ) . A quantit y of in terest is the exp ected v alue of π ( x ) with respect to a giv en probabilit y distribution of paths. This helps in telling us the relativ e sp eed of con vergence and the mixing rate. T o study this, it is helpful to ev aluate the exp ectation v alue of the quantit y: N Y i =1 exp( − β ( i ) V ( x ( i ) )) (3.5) with resp ect to a giv en path generated by our sampler. W e can then diﬀerentiate with resp ect to the β ( i ) ’s to study the rate at which our sampler explores the target distribution. In more general terms, the reason to b e in terested in this exp ectation v alue comes from the statistical mec hanical interpretation of statistical inference [3, 15, 16]: There is a natu- ral comp etition b et w een sta ying in high lik eliho od regions (minimizing the p oten tial), and exploring more of the distribution (maximizing en tropy). The tradeoﬀ b et ween the t wo is neatly captured by the path in tegral formalism. Indeed, in the sp ecial case β ( i ) = 1 we hav e an especially transparent in terpretation: It tells us about a particle mo ving in a p oten tial V ( x ), and sub ject to a thermal bac kground, as sp eciﬁed by the c hoice of probabilit y mea- sure ov er p ossible paths. Indeed, we will view this probability measure as deﬁning a “kinetic energy” in the sense that eac h time step, w e apply a random kic k to the tra jectory of the particle, as dictated by its con tact with the thermal reservoir. Along these lines, if w e ha v e an MCMC sampler with transition probabilities T ( x ( i ) → x ( i +1) ), the exp ected v alue dep ends on: Z path (  β ( i )  ) = T ( x (0) → x (1) ) e − β (1) V ( x (1) ) × ... × T ( x ( N − 1) → x ( N ) ) e − β ( N ) V ( x ( N ) ) (3.6) Marginalizing ov er the intermediate v alues, w e get: Z = Z [ dx ] N − 1 Y i =0 T ( x ( i ) → x ( i +1) ) e − β ( i +1) V ( x ( i +1) ) ! (3.7) where w e hav e introduced the measure factor [ dx ] = dx (1) ...dx ( N ) . W e would lik e to interpret 10 V ( x ) as a p oten tial energy and − log T ( x ( i ) → x ( i +1) ) as a kinetic energy . So, w e shall write: V ( x ) = − log π ( x ) and K ( x ( i ) , x ( i +1) ) = − log T ( x ( i ) → x ( i +1) ) . (3.8) W e no w observ e that our exp ectation v alue has the form of a well-kno wn ob ject in physics: A path integral! 1 F or example, with all β ( i ) = 1, w e ha v e: Z ( x begin → x end ) = end Z begin [ dx ] exp( − X t L ( E ) [ x ( t ) ]) (3.9) where we hav e introduced the Euclidean signature Lagrangian: L ( E ) [ x ( t ) ] = K + V . (3.10) Since w e shall also b e taking the n umber of timesteps to be v ery large, we mak e the Riemann sum approximation and introduce the rescaled Lagrangian density: 1 N X t 7→ Z dt, N L ( E ) 7→ L ( E ) (3.11) so that we can write our pro cess as: Z ( x begin → x end ) = Z [ dx ] exp  − Z dt L ( E ) [ x ( t )]  , (3.12) where by abuse of notation, w e use the same v ariable t to reference b oth the discretized timestep as well as its con tin uum coun terpart. A few commen ts are in order here. Readers familiar with the Lagrangian formulation of classical mechanics and quantum mechanics will note that we ha ve introduced K + V rather than K − V as our Lagrangian. In ph ysical terms, this has imp ortan t consequences, particularly in the in terpretation of the time ev olution of a saddle p oin t solution (i.e., one that is solv ed b y the Euler-Lagrange equations of motion). As an illustrativ e example, w e see that for a quadratic potential, w e do not obtain the familiar behavior of a harmonic oscillator with tra jectory x ( t ) ∼ exp( iω t ), but rather x ( t ) ∼ exp( − ω t ). F ormally , this amounts to the substitution t 7→ it , which is often referred to as a “Wic k rotation” or passing to “Euclidean signature.” Ph ysically , what it means is that rather than getting oscillatory b eha vior, w e instead get a diﬀusion or spread to the location of the particle. F or further discussion on Euclidean signature quan tum ﬁeld theory , i.e., statistical ﬁeld theory , see for example [20, 21]. T o give further justiﬁcation for this terminology , consider now the sp eciﬁc case of the 1 Alb eit one in Euclidean signature, see b elo w for details. 11 Metrop olis-Hastings algorithm. In this case, w e hav e a proposal k ernel q ( x 0 | x ), and accep- tance probability: a ( x 0 | x ) = min  1 , q ( x | x 0 ) q ( x 0 | x ) π ( x 0 ) π ( x )  . (3.13) The total transmission probability is then giv en by a sum of tw o terms. One is giv en b y a ( x 0 | x ) q ( x 0 | x ), i.e., we accept the new sample. W e also sometimes reject the sample, i.e., we k eep the same v alue as b efore: T ( x → x 0 ) = r × δ ( x − x 0 ) + a ( x 0 | x ) q ( x 0 | x ) , (3.14) where δ ( x − x 0 ) is the Dirac delta function, and we ha ve in tro duced an a v eraged rejection rate: r ≡ 1 − Z dx 0 a ( x 0 | x ) q ( x 0 | x ) . (3.15) The sp eciﬁc optimal v alue dep ends on the target distribution and the prop osal kernel. 2 F or illustrative purp oses, supp ose that w e work in the sp ecial limit where the acceptance rate is close to one, and that we hav e a Gaussian proposal k ernel so that − log q  x ( t +1) | x ( t )  ∼ α  x ( t +1) − x ( t )  2 . In this case, the path in tegral tak es a rather pleasing form whic h has a simple physical interpretation. W e hav e: High Acceptance: L ( E ) [ x ( t ) ] = K + V ' α  x ( t +1) − x ( t )  2 + V ( x ( t ) ) . (3.16) Where we interpret the ﬁnite diﬀerence b et w een time steps as a time deriv ativ e: D t x ≡ x ( t +1) − x ( t ) . (3.17) More generally , w e can ask what happ ens for intermediate v alues of a . In general, this is a c hallenging question so we do not exp ect to hav e as simple a form for the Euclidean signature Lagrangian. Nevertheless, w e shall see that muc h of the structure already found p ersists in this case as well. Along these lines, we shall attempt to approximate the mixture mo del T ( x → x 0 ) b y a normal distribution q eﬀ  x ( t +1) | x ( t )  suc h that − log q eﬀ  x ( t +1) | x ( t )  ∼ α eﬀ  x ( t +1) − x ( t )  2 . T o this end, w e match the ﬁrst and second moments of the putative normal distribution with our net transition rate in the appro ximation that w e can use the a v erage acceptance rate a : α eﬀ = 1 a × α. (3.18) 2 F or example, under the assumption that the limiting diﬀusion approximation is v alid, the optimal ac- ceptance rate is 0 . 234 [22]. 12 So in this more general case, we get the eﬀectiv e Lagrangian: L ( E ) [ x ( t ) ] ' α eﬀ  x ( t +1) − x ( t )  2 + V ( x ( t ) ) + ..., (3.19) where here, the “...” denotes additional correction terms coming from: Correction T erm: − log ( T ( x → x 0 ) /q eﬀ  x ( t +1) | x ( t )  ) . (3.20) A t a large n umber of samples, we expect that contributions giv en by higher order p o wers of time deriv ativ es are suppressed b y p o wers of 1 / N , a p oin t w e discuss in more detail in subsection 3.3. Observe that as the acceptance rate decreases α eﬀ increases and the sampled v alues all concentrate together. Our plan in the follo wing sections will b e to assume the structure of a kinetic term with quadratic time deriv ativ es, but a general p oten tial. The ov erall strength of the kinetic term will dep end on details such as the a verage acceptance rate. As we discuss in subsection 3.3, the correction terms to this general structure will, for a broad class of mo dels, b e suppressed b y p o wers of 1 / N . 3.2 P ath In tegral for Extended Ob jects W e no w turn to the generalization of the abov e concepts for strings and branes, i.e., extended ob jects. T o co v er this more general class of p ossibilities, w e ﬁrst introduce M copies of the original distribution, and consider the related joint distribution: π ( x 1 , ..., x M ) = π ( x 1 ) ...π ( x M ). (3.21) If w e keep the prop osal kernel unchanged, we can simply describ e the evolution of M inde- p enden t p oin t particles exploring an enlarged target space: Ω enlarged = Ω M = Ω × ... × Ω | {z } M . (3.22) If we also view the individual statistical agen ts on the w orldv olume as indistinguishable, we can also consider quotienting b y the symmetric group on M letters, S M : Ω S enlarged = X M /S M . (3.23) Of course, w e are also free to consider a more general prop osal k ernel in which w e correlate these v alues. Viewed in this wa y , an extended ob ject is a single point particle, but on an enlarged target space. The precise w a y in whic h we correlate en tries across a grid will in turn dictate the type of extended ob ject. 13 W e b egin with some general deﬁnitions, and later sp ecialize to more tractable cases. Along these lines, supp ose w e hav e a graph consisting of M no des, and a corresp onding undirected adjacency matrix A , whic h consists of ones on the diagonal, and just zero es and ones oﬀ the diagonal. W e also use σ as a general index holder for “spatial p osition” on a grid. W e say that tw o no des σ and σ 0 are “neigh b ors” if A σ,σ 0 = 1. Denote b y Nb( σ ) the set of neighbors for site σ . Holding ﬁxed the adjacency matrix A , we deﬁne the prop osal kernel: q ( x 1 , ..., x M | y 1 , ..., y M , A ) ≡ M Y σ =1 q σ ( x σ | y Nb( σ ) ) , (3.24) where q σ is some choice of prop osal kernel for a single p oin t particle. W e can therefore adopt tw o diﬀerent p erspectives on this pro cedure. On the one hand, w e can view an extended ob ject as propagating ov er the enlarged target. On the other hand, w e can view this extended ob ject as one collectiv e mo ving on the original target space. Indeed, m uch of the path in tegral formalism carries o ver unchanged. The only diﬀerence is that no w, we m ust also keep trac k of the spatial exten t of our ob ject. So, w e again in tro duce a p oten tial energy V and a kinetic energy K : V = − log π and K = − log T , (3.25) and a Euclidean signature Lagrangian densit y: L ( E ) [ x ( t, σ A )] = K + V , (3.26) where here, σ A indexes lo cations on the extended ob ject, and the subscript A makes implicit reference to the adjacency on the graph. The transition probability is: Z ( x begin → x end | A ) = Z [ dx ] exp( − X t X σ L ( E ) [ x ( t, σ A )]) , (3.27) where no w the measure factor [ dx ] in volv es a pro duct o ver dx ( t ) σ . Since we shall also b e taking the n um b er of time steps and agen ts to b e large, w e again make the Riemann sum appro ximation and introduce the rescaled Lagrangian densit y: 1 N X t 7→ Z dt, 1 M X σ 7→ Z dσ A N M L ( E ) 7→ L ( E ) (3.28) so that the exp ectation v alue has con tinuum description: Z ( x begin → x end | A ) = Z [ dx ] exp  − Z dtdσ A L ( E ) [ x ( t, σ A )]  , (3.29) 14 in the ob vious notation. Strictly sp eaking, the integral with measure dσ A ma y fail to ha ve a smo oth contin uum limit (i.e., when M → ∞ ), so when it do es not, no con tinuum ap- pro ximation is av ailable and we should view this as merely a shorthand for the discretized answ er. 3.2.1 Splitting and Joining In the ab o ve discussion, we held ﬁxed a particular choice of adjacency matrix. This choice is somewhat arbitrary , and physical considerations suggest a natural generalization where w e sum ov er a statistical ensem ble of c hoices. W e shall lo osely refer to this splitting and joining of connectivit y as “incorp orating gravit y” into the dynamics of the extended ob ject, b ecause it can change the notion of whic h statistical agents are nearest neighbors. 3 Along these lines, we incorp orate an ensemble A of p ossible adjacency matrices, with some prescribed probability to dra w a giv en adjacency matrix. Since we evolv e forward in discretized time steps, we can in principle hav e a sequence of such matrices A (1) , ..., A ( N ) , one for eac h timestep. F or eac h dra w of an adjacency matrix, the notion of nearest neighbor will c hange, whic h w e denote b y writing σ A ( t ) , that is, we mak e implicit reference to the connectivit y of nearest neigh b ors. Marginalizing ov er the choice of adjacency matrix, we get: Z ( x begin → x end ) = Z [ dx ][ dA ] exp( − X t X σ L ( E ) [ x ( t, σ A ( t ) )]) , (3.30) where no w the integral inv olves summing ov er multiple ensembles: the spatial and temp oral v alues with measure factor dx ( t ) σ , as w ell as the choice of a random matrix from the ensemble with measure factor dA ( t ) (one such integral for each timestep). A t a very general lev el, one can view the adjacency matrix as adding additional auxiliary random v ariables to the pro cess. So in this sense, it is simply part of the deﬁnition of the prop osal k ernel. The top ology of an extended ob ject dictates a choice of statistical ensem ble A . W e illustrate this by giving some particular examples which we study in more detail later on. F or a collection of M indep enden t, but indistinguishable p oin t particles, the ensem ble of adjacency matrices is given b y: A particles =  S AS − 1 | A is the M × M identit y and S ∈ S M  . (3.31) F or an ensem ble of strings, w e ha ve a notion of a nearest neigh b or in teraction, and so we 3 It is not quite gra vity in the worldv olume theory , b ecause there is a priori no guarantee that our sum o ver diﬀerent graph top ologies will ha ve a smo oth semi-classical limit. Nevertheless, summing o ver diﬀerent w ays to connect the statistical agents conv eys the main p oin t that the proximit y of any tw o agents can c hange. F or additional discussion, see for example reference [3]. 15 also introduce a split / join probability p join : A string ( p join ) =          S AS − 1 with: S ∈ S M , A σ σ = 1, A σ,σ +1 = A σ +1 ,σ = 1 with probability p join , A σ σ 0 = 0 otherwise          , (3.32) where in the ab o ve, the index σ = M + 1 is iden tiﬁed with σ = 1. That is, we hav e a circulan t matrix: Geometrically , we view 1 , ..., M as arranged along a circle, with each link either on or oﬀ. More generally , w e can consider the case of a d -dimensional hypercubic lattice, i.e., an extended ob ject in d spatial dimensions. In this case, it is somewhat simpler to ﬁrst in tro duce a m × ... × m | {z } d × m × ... × m | {z } d arra y with m d = M , which we then repac k age in terms of an M × M matrix. F or a h yp ercubic lattice in d dimensions, we in tro duce A σ 1 ,...,σ d ; σ 0 1 ,....,σ 0 d , and deﬁne the ensemble of arra ys for a brane as: A brane ( p join ) =              S AS − 1 with: S ∈ S M , A σ 1 ,...,σ d ; σ 1 ,...,σ d = 1, A σ 1 ,...,σ k ,...σ d ; σ 1 ,...,σ k +1 ,...,σ d = A σ 1 ,...,σ k +1 ,...σ d ; σ 1 ,...,σ k ,...,σ d = 1 with probability p join , A σ σ 0 = 0 otherwise              . (3.33) W e can repac k age this as an M × M adjacency matrix b y replacing the multi-index σ 1 , ..., σ d b y a single base m index: i = 1 + ( σ 1 − 1) + ( σ 2 − 1) m + ... + ( σ d − 1) m d − 1 . (3.34) Of course, in addition to these geometrically w ell-motiv ated c hoices, w e can consider more general ensem bles of adjacency matrices. F or example, a conﬁguration of random graphs with well studied prop erties is the Erd¨ os-Ren yi ensem ble: A ER ( p join ) =  A σ σ = 1, A σ σ 0 = A σ 0 σ = 1 with probability p join ( σ 6 = σ 0 )  . (3.35) 3.3 Dimensions and Correlations In the previous section w e presen ted some general features of strings and branes, and their generalization to Mark ov chains. F ollowing some of the general considerations outlined in reference [3], in this section w e discuss the extent to whic h the extended nature of such ob jects plays a role in statistical inference and in particular MCMC. 16 T o keep our discussion from b ecoming ov erly general, we shall initially sp ecialize to the case of a h yp ercubic lattice of agen ts in d spatial dimensions arranged on a torus, and w e denote a lo cation on the grid by a d -comp onen t vector σ . W e shall later relax these considerations to allow for the p ossibilit y of a ﬂuctuating worldv olume. F or a ﬁxed grid, each grid site has precisely 2 d neigh b ors. In what follo ws, w e shall ﬁnd it conv enient to introduce a set of d unit v ectors: e 1 = (1 , 0 , ..., 0) (3.36) e 2 = (0 , 1 , ..., 0) (3.37) ... (3.38) e d = (0 , 0 , ..., 1) . (3.39) W e also sp ecialize the form of the prop osal kernel: q σ ( x σ ( t + 1) | Nb( x σ ( t ))) ∝ exp       − α ( x σ ( t + 1) − x σ ( t )) 2 − d P k =1 β ( x σ ( t + 1) − x σ + e k ( t )) 2 − d P k =1 β ( x σ ( t + 1) − x σ − e k ( t )) 2       . (3.40) This has a recognizable form, consisting of ﬁnite diﬀerences in b oth the time direction, and spatial directions of our brane. Along these lines, we introduce the notation: D t x σ = x σ ( t + 1) − x σ ( t ) (3.41) D + k x σ = x σ + e k ( t ) − x σ ( t ) (3.42) D − k x σ = x σ − e k ( t ) − x σ ( t ) (3.43) so that the prop osal kernel is giv en b y: q σ ( x σ | Nb( x σ )) ∝ exp − α ( D t x σ ) 2 − d X k =1 β ( D t x σ − D + k x σ ) 2 − d X k =1 β ( D t x σ − D − k x σ ) 2 ! . (3.44) T o pro ceed further, w e observe that in a large lattice, the ﬁnite diﬀerences are well-appro ximated b y deriv ativ es of con tinuous functions. In this case, we can also write D + k x σ = − D − k x σ , up to higher order deriv ativ es, whic h as we explain in subsection 3.3 make a subleading con tribution to the inference problem. Expanding in this limit, v arious cross-terms cancel 17 and we get: q σ ( x σ | Nb( x σ )) ∝ exp − ( α + 2 dβ ) ( D t x σ ) 2 − d X k =1 2 β ( D + k x σ ) 2 ! . (3.45) That is, we see the exp ected kinetic term for a ( d + 1)-dimensional quantum ﬁeld theory in Euclidean signature. So far, w e ha ve k ept our analysis rather general. No w, we w ould also lik e to b e able to tak e a canonical limit in whic h the strength of timelik e jumps remains comparable in passing from the completely disconnected grid to the maximally connected grid. T o this end, we no w further sp ecialize the choice of α as: α = 2 β − 2 dβ . (3.46) The full prop osal kernel now tak es the form: Y σ q σ ( x σ | Nb( x σ )) ∝ exp − 2 β X σ ( D t x σ ) 2 + d X k =1 ( D + k x σ ) 2 !! . (3.47) No w, just as in the case of the p oin t particle path integral, we again see that the eﬀective transition rate deﬁnes a kinetic energy term, with an eﬀective strength dictated by the o v erall acceptance rate. The general form of this kinetic term is given by a form recognizable to ph ysicists: 4 L ( E ) [ x ( t, σ )] = 2 β eﬀ X σ ( D t x σ ) 2 + d X k =1 ( D + k x σ ) 2 ! + V + ... (3.48) where β eﬀ sets the eﬀectiv e tension of the brane, and the correction terms “...” indicate that w e are again w orking to quadratic order in the deriv ativ es. So to summarize, we ha v e arriv ed at a ( d + 1)-dimensional statistical ﬁeld theory with kinetic term quadratic in deriv atives and a general p oten tial. One of the things w e would most lik e to understand is the exten t to which an extended ob ject with d spatial dimensions can explore the hills and v alleys of V . W e p erform a p erturbativ e analysis, at ﬁrst viewing V as a small correction to the propagation of our extended ob ject. Starting from some ﬁxed p osition x ∗ , we can then consider the expansion 4 As the astute reader will no doubt notice, th e structure of the kinetic term we consider here is not the most general one we could consider. More generally , we can introduce a vector of temp oral and spatial deriv atives D K x , with K = 0 , 1 , ..., d and in tro duce the kinetic term 1 2 ( D K x )  1 Σ  K L ( D L x ), with Σ a p ositiv e deﬁnite matrix. In the ph ysics literature, this deﬁnes a metric on the brane system. An ev en further generalization is to allow some x dep endence in Σ itself. Some asp ects of this more general case were considered in [3]. W e lea ve a detailed study of the application to MCMC for future work. 18 of V around this p oin t: V ( x ) = V ( x ∗ ) + V 0 ( x ∗ )( x − x ∗ ) + V 00 ( x ∗ ) 2 ( x − x ∗ ) 2 + ..., (3.49) and study the impact on the correlation of samples as a function of time. Each of the deriv ativ es of V ( x ) rev eals another characteristic feature length of V ( x ). These feature lengths are sp eciﬁed by the v alues of the moments for the distribution π ( x ). Alternatively , w e can simply use the v arious deriv ativ es of V ( x ) to extract this set: ` n ∼ 1 | V ( n ) ( x ∗ ) | 1 /n , (3.50) An inﬁnite v alue for the feature length simply means there is no new feature length. Let us refer to the set of ﬁnite c haracteristic length scales as { ` i } . No w, there is a clear sense in which we can also view each of these length scales as deﬁning a unit of time on the brane, i.e., how fast we exp ect our sampler to explore such a feature length. Using our Lagrangian interpretation, these length scales are set by b oth ` i and the strength of the kinetic term: τ i ∼ ` i × p β . (3.51) W e refer to “early” and “late” time b eha vior as sp eciﬁed b y: t early  τ i  t late . (3.52) Since space and time on the w orldv olume are on a similar fo oting, this also deﬁnes a notion of “close” and “far” for agen ts on the grid. By abuse of terminology , w e shall lump all of these notions together. No w, in the limit where V = 0, there is a w ell-kno wn b eha vior for correlation functions: h x ( t, σ ) x (0 , 0) i ≡ 1 Z Z [ dx ] x ( t, σ ) x (0 , 0) exp( − X t X σ L ( E ) [ x ( t, σ A )]) (3.53) whic h for ( t, σ ) ∈ R d +1 is given by: 5 h x ( t, σ ) x (0 , 0) i ∼ 1 s t 2 + d P i =1 ( σ i ) 2 ! d − 1 . (3.54) There is thus a rather sharp change in the b eha vior of the extended ob ject for d < 1 and 5 One wa y to obtain this scaling relation is to observ e that the F ourier transform of 1 /k 2 in d + 1 dimensions exhibits the requisite p o wer law behavior. 19 d > 1. F or d = 1, w e ha ve a logarithm rather than a constan t. So, for lo w enough v alues of d , the extended ob ject can wander around at late times, while for larger v alues, the o verall spread in v alues is suppressed. The crossov er b et ween the t wo b eha viors o ccurs at d = 1, i.e., the case of a string. W e would now like to understand the impact that adding a non-trivial p oten tial energy will hav e on the structure of our correlation functions. In general, this is a c hallenging problem whic h has no closed form solution. W e can, ho wev er, develop a picture for whether w e exp ect these perturbations to impact the early and late time b eha vior of our sampler. Along these lines, w e can in tro duce the notion of a “scaling dimension” for x ( t, σ ) and its deriv ativ es. The basic idea is that just as we assign a notion of pro ximit y in space and time to agents on a grid, w e can also ask how rescaling all distances on the grid via: N 7→ λN M 7→ λ d M (3.55) impacts the structure of our contin uum theory Lagrangian. The key p oin t is that provided N and M hav e b een tak en suﬃciently large, or alternatively we tak e λ suﬃcien tly large, w e do not exp ect there to b e any impact on the physical interpretation. Unpac king this statemen t naturally leads us to the notion of a scaling dimension for x ( t, σ ) itself. Observ e that rescaling the num b er of samples and num b er of agents in line (3.55) can b e interpreted equiv alently as holding ﬁxed N and M , but rescaling t and σ : ( t, σ ) 7→ ( λt, λσ ) . (3.56) No w, for our kinetic term to remain inv arian t, we need to also rescale x ( t, σ ): x ( t, σ ) 7→ λ − ∆ x ( λt, λσ ) . (3.57) The exp onen t ∆ is often referred to as the “scaling dimension” for x obtained from “naive dimensional analysis” or ND A. It is “naiv e” in the sense that when the potential V 6 = 0 and w e ha ve strong coupling, the notion of a scaling dimension may only emerge at suﬃcien tly long distance scales. F or additional discussion on scaling dimensions and their role in statis- tical ﬁeld theory , w e refer the interested reader to reference [23]. Note that b ecause w e are uniformly rescaling the spatial and temp oral pieces of the grid, we get the same answer for the scaling dimension if we consider spatial deriv atives along the grid. This assumption can also b e relaxed in more general ph ysical systems. T o illustrate, let us no w extract the scaling dimension of x for the case of a kinetic term quadratic in deriv atives. W e tak e D x as a placeholder for any choice of deriv ativ e either in 20 space or in time. Under a rescaling, we hav e: Z dtd d σ ( D x ) 2 7→ λ − 2∆+ d − 1 Z dtd d σ ( D x ) 2 . (3.58) So, inv ariance of the action requires the exp onen t of λ to v anish, namely: ∆ = d − 1 2 . (3.59) Using this general sort of scaling analysis allows us to characterize p ossible eﬀects of p erturbations, and whether w e exp ect them to drastically impact our inference sc heme as we tak e N and M to b e very large. As a ﬁrst example, consider the eﬀects of the “Correction T erms” in line (3.20). W e exp ect that such con tributions will tak e the form of higher p o wers in D x , p ossibly multiplied by p o w ers of x as w ell. The latter p ossibilit y mainly o ccurs when w e hav e a prop osal kernel whic h cannot b e written as temp oral and spatial deriv atives on a grid, i.e., it plays less of a role in the considerations that follow. So, with this mind, w e can consider the b eha vior of a perturbation of the form ( x ) µ ( D x ) ν . Applying our ND A analysis prescription, w e see that under a rescaling, the con tribution suc h a term makes to the action is: Z dtd d σ ( x ) µ ( D x ) ν 7→ λ − µ ∆ − ν (∆+1)+ d +1 Z dtd d σ ( D x ) 2 , (3.60) Ho w ever, using (3.59), we see that the ov erall exp onen t on the righthand side is: − µ ∆ − ν (∆ + 1) + d + 1 = (2 − ν )( d + 1) − µ ( d − 1) 2 . (3.61) So in other w ords, terms of the form ( D x ) ν for ν > 2 die oﬀ as we tak e N → ∞ , i.e., λ → ∞ . Additionally , w e see that when d ≤ 1, w e can in principle exp ect more general con tributions of the form ( x ) µ ( D x ) ν . The presence of such terms will not aﬀect our general conclusions. F or additional discussion on the in terpretation of suc h con tributions, see reference [3]. Consider next p ossible perturbations to the p oten tial energy . Again, the impact these higher order terms can ha v e on the early time b eha vior of correlation functions of line (3.53) dep ends on the num b er of dimensions for the brane. The main p oin t follo ws from ND A: In general, we are integrating ov er a ( d + 1)-dimensional spacetime, so since a deriv ative carries one unit of in v erse length, the scaling dimension of x (around the V = 0 limit) is just ( d − 1) / 2. Each successive in teraction term in the p oten tial is of the form x n , with scaling dimension n ( d − 1) / 2. As follows from a p erturbativ e analysis, when these higher order terms hav e low scaling dimension, their impact on long distance correlations is strong, while con v ersely , when their scaling dimension is high, their impact on long distance correlations 21 is small. The dividing line is set by whether the scaling dimension of the interaction term is smaller than d + 1 (i.e., the num b er of spacetime directions w e in tegrate o v er): n ( d − 1) 2 ≤ d + 1 . (3.62) So, for d ≤ 1, all higher order terms can impact the long distance b eha vior of the correlation functions, while for d > 1, the most relev ant term is b ounded ab o ve by: n ≤ 2 d + 2 d − 1 . (3.63) No w, in the con text of MCMC, we would like for our extended ob ject to b e able to explore diﬀeren t contours of the energy landscap e. This in turn means that if our brane has settled near a critical p oin t, it is p oten tially sensitive to the higher order deriv atives in V ( x ) as in equation (3.49). So, a priori, if V ( x ) p ossesses many non-trivial deriv atives, taking d ≤ 1 pro vides a w ay to explore more of this landscap e. More precisely , w e can see that for suﬃcien tly large d we cannot prob e muc h of the global structure of the p oten tial. F or example, if we set n = 3, w e see that d ≤ 5, i.e., six spacetime dimensions for the w orldv olume. On the other hand, there is also a strong argument to a v oid taking d to o s m all. The fact that the time dep endence of the t wo-point function of a free Gaussian ﬁeld go es as 1 /t d − 1 means that there can be signiﬁcant spread in the ﬂuctuations of a lo w-dimensional ob ject. This in turn means that suc h an ob ject ma y execute a very long random walk b efore ﬁnding an ything of interest (wandering in the desert). So to summarize, for d suﬃcien tly small (i.e., close to zero), we can expect to w ander for a long time b efore ﬁnding an ything of interest, while con versely , if d is bigger than one, “groupthink” tak es ov er in the collective and it is imp ossible to mo v e aw ay from an initial inference. Clearly , the v alue of d whic h is optimal will dep end on the precise shap e of the p oten tial V ( x ). Nev ertheless, w e can already see that there is p oten tially a signiﬁcan t adv antage to correlating the b eha vior of nearest neighbor interactions. 3.3.1 Eﬀectiv e Dimension and Fluctuating W orldv olumes In the preceding discussion, w e assumed that w e had a ﬁxed spatial grid of dimension d , where the n umber of nearest neighbor interactions is alwa ys ﬁxed. There are a few drawbac ks to this from the persp ectiv e of inference. F or example an extended ob ject may become trapp ed more easily if all of its agents clump in one lo cal minim um of V ( x ). On the other hand, one of the adv an tages of an extended ob ject is that there is a natural pull to nearb y minima, 22 so it can also p oten tially explore a landscap e more eﬃciently than parallel point particles. T o address this issue, we consider a ﬂuctuating worldv olume, i.e., we take an ensemble of nearest neighbors which actually ﬂuctuates as a function of time. Since we are now dealing with a ﬂuctuating n um b er of nearest neigh b ors, we will need to mo dify our prop osal k ernel. W e again introduce a set of ﬁnite diﬀerences, but no w w e sp eciﬁcally indicate the neighbor as n ( σ ): D t x σ = x σ ( t + 1) − x σ ( t ) (3.64) D n ( σ ) x σ = x n ( σ ) ( t ) − x σ ( t ) , (3.65) in the ob vious notation. W e no w in tro duce a mo diﬁed prop osal kernel where the size of the time step α σ no w dep ends on the num b er of neighbors: q σ ( x σ | Nb( x σ )) ∝ exp   − α σ ( D t x σ ) 2 − X n ( σ ) β  D t x σ − D n ( σ ) x σ  2   (3.66) = exp   − ( α σ + n tot σ β ) ( D t x σ ) 2 − X n ( σ ) β  D n ( σ ) x σ  2 + X n ( σ ) 2 β D t x σ D n ( σ ) x σ   , (3.67) where n tot σ denotes the total n umber of nearest neighbors to the site σ , and the parameter α σ also dep ends on the total num b er of nearest neigh b ors: α σ = 2 β − n tot σ β . (3.68) Due to the ﬂuctuating top ology , an analysis of the correlation functions is now more c hallenging. How ever, there are v arious appro ximation schemes a v ailable which pro vide a w a y to co v er this case as w ell. One crude appro ximation w e shall adopt is to consider the t ypical random graph c hosen from a particular ensem ble, and to then further assume that this is w ell-approximated b y just the a v erage degree of connectivit y b et w een an agen t and its neigh b ors. F or the ensembles in tro duced earlier, i.e., for a d -dimensional hypercubic lattice with some p ercolation, the av erage num b er of neighbors is: Hyp ercubic Lattice: n avg = 2 d × p join (3.69) while for the Erd¨ os-Ren yi ensem ble, the av erage num b er of neighbors is: Erd¨ os-Ren yi: n avg = ( M − 1) × p join , (3.70) where M is the total n um b er of agents, i.e., no des in the graph. 23 F or h yp ercubic lattices, w e can also introduce the notion of an eﬀective dimension: d eﬀ = n avg / 2 , (3.71) a notion w e shall also use (b y abuse of terminology) for the Erd¨ os-Ren yi ensem ble as w ell. With this in mind, w e can reuse our previous analysis with a ﬁxed connectivit y , where we replace all occurrences of d by d eﬀ . In this case, there is no need to conﬁne our discussion to d b eing an integer. When we turn to our numerical exp erimen ts, w e will indeed see that this approximation pro vides a reasonable leading order characterization of the dynamics of branes. 4 The Suburban Algorithm Ha ving motiv ated the study of MCMC with strings and branes, w e no w turn to some speciﬁc implemen tations of the suburban algorithm. F or ease of exp osition, we shall presen t the case of sampling a single contin uous v ariable x . The generalization to a D -dimensional target (suc h as R D ) is straigh tforw ard, though there are v arious wa ys to do this, i.e., w e can either adopt MH within a Gibbs sampler, or a sampler with join t v ariables (i.e., we p er form an up date on all D dimensions simultaneously). 6 Let us no w turn to the structure of the suburban sampler. Recall that w e are interested in a class of Metrop olis-Hastings algorithms in which instead of directly sampling from π ( x ), w e in tro duce multiple copies of the target and sample from the joint distribution: π ( x 1 , ..., x M ) = π ( x 1 ) ...π ( x M ) . (4.1) 6 T o b e more precise, the MH within Gibbs up date for a target distribution p ( x (1) , ..., x ( D ) ) with sup- p ort on a D -dimensional space amounts to viewing this as a conditional probability p ( x (1) , ..., x ( D ) ) = p ( x ( i ) | x (1) , ..., x ( b i ) , ..., x ( D ) ) p ( x (1) , ..., x ( b i ) , ..., x ( D ) ), where the notation b i indicates that we omit this index. The MH within Gibbs up date is then given by sampling from just the univ ariate distribution: Algorithm 1 MH within Gibbs In tro duce Γ = { 1 , ..., D } for i = 1 to D do j ← dra w from Γ x ( j ) ← sample from p ( x ( j ) | x (1) , ..., x ( b j ) , ..., x ( D ) ) using a 1D MH up date. return ( x (1) , ..., x ( D ) ) Both t yp es of samplers ha v e their relative merits, and we will study examples of both. 24 Algorithm 2 Suburban Sampler Randomly Initialize X (0) and A (0) for t = 0 to N − 1 do X ( ∗ ) ← sample from q ( X |X ( t ) , A ( t ) ) accept with probability a ( X ∗ |X ( t ) , A ( t ) ) if accept = true then X ( t +1) ← X ( ∗ ) else X ( t +1) ← X ( t ) A ( t +1) ← draw from A return X (1) , ..., X ( N ) W e shall also refer to the prop osal k ernel as: q ( x 1 , ..., x M | y 1 , ..., y M , A ) = M Y σ =1 q σ ( x σ | Nb( y σ )) , (4.2) where A is the adjacency matrix of the grid. T o av oid ov erloading the notation, w e shall write X ( t ) ≡ n x ( t ) 1 , ..., x ( t ) M o for the current state of the grid. In what follo ws, w e write the MH acceptance probability as: a  X new |X old , A  = min  1 , q ( X old |X new , A ) q ( X new |X old , A ) π ( X new ) π ( X old )  . (4.3) W e no w in tro duce algorithm 2, the suburban algorithm. An imp ortan t feature of the suburban algorithm is that some of these steps can b e parallelized whilst retaining detailed balance. F or example we can pic k a coloring of a graph and then p erform an up date for all no des of a particular color whilst holding ﬁxed the rest. There are of course many v ariations on the ab o ve algorithm. F or example, in practice for eac h time step we shall p erform a Gibbs up date ov er our M agents. F or Gibbs sampling o v er the target, we then ha ve a Gibbs up date sc hedule with D × M steps, and for the joint sampler, it is o ver just M steps. W e can also choose to not dra w a new random graph A ( t ) at eac h step, but rather only every T draw steps. Other p ossibilities include sto c hastic time ev olution for A ( t ) . T o keep the analysis tractable, ho wev er, we will indeed stic k to the simplest possibility , p erforming an update on the graph topology at eac h sampling time step. No w, ha ving collected a sequence of v alues X (1) , ..., X ( N ) , we can interpret this as N × M samples of the original distribution π ( x ). As standard for MCMC metho ds, we can then calculate quan tities of interest suc h as the mean and co v ariance for the distribution π ( x ) b y 25 p erforming an appropriate sum ov er the observ ables: h x i π ' 1 N M X σ,t x ( t ) σ (4.4)  ( x − h x i π ) 2  π ' 1 N M − 1 X σ,t  x ( t ) σ − h x i π  2 (4.5) as well as higher order momen ts. Let us discuss the reason w e expect our sampler to conv erge to the correct p osterior distribution. First of all, w e note that although we are mo difying the prop osal k ernel at eac h time step (i.e., b y introducing a diﬀerent adjacency matrix A ∈ A ), this mo diﬁcation is indep enden t of the current state of the system. So, it cannot impact the even tual p osterior distribution w e obtain. Second, w e observ e that since we are just performing a sp eciﬁc kind of MH sampling routine for the distribution π ( x 1 , ..., x M ), w e expect to con verge to the correct p osterior distribution. But, since the v ariables x 1 , ..., x M are all indep enden t, this is tan tamoun t to having also sampled m ultiple times from π ( x ). The cav eat is that w e need the sampler to actually w ander around during its random walk; d ≤ 1 is t ypically necessary to preven t “groupthink.” 4.1 Implemen tation W e now turn to the implementation of the suburban algorithm we shall consider in subse- quen t sections. T o accommo date a ﬂexible framework for protot yping, we ha v e implemented the suburban algorithm in the probabilistic programming language Dimple [24]. This con- sists of a set of Java libraries with a Matlab wrapper. W e ha ve found this interface to b e quite helpful in reaching the form of the algorithm presen ted in this w ork, as well as in p erforming diﬀeren t t yp es of numerical exp erimen ts. In the actual implementation, we hav e found it helpful to exclude some initial fraction of the samples, i.e., the pro cess kno wn as “burn-in.”W e do this more for practical considerations connected with the diagnostics w e perform than for an y theoretical reason, since a suﬃcien tly w ell-b eha ved MCMC sampler run for long enough will ev en tually conv erge anyw ay to the correct p osterior distribution. In practice, we tak e a fairly large burn-in cut, discarding the ﬁrst 10% of samples from a run, i.e., w e only k eep 90% of the samples. W e alwa ys p erform Gibbs sampling ov er the M agen ts. If we also p erform Gibbs sampling ov er a D -dimensional target, w e thus get a Gibbs sc hedule with D × M up dates for eac h time step. F or a joint sampler, the Gibbs schedule consists of just M up dates. The sp eciﬁc c hoice of prop osal kernel w e tak e is motiv ated by the ph ysical considerations 26 outlined in section 3: q σ ( x σ | Nb( x σ )) ∝ exp   − α σ ( D t x σ ) 2 − X n ( σ ) β  D t x σ − D n ( σ ) x σ  2   with α σ = 2 β − n tot σ β , (4.6) that is, we take an adaptive v alue for the parameter α sp eciﬁed by the num b er of nearest neigh b ors joined to x σ . As already mentioned in section 3, the main p oin t is to ensure that the o verall strength of the kinetic term, i.e., the quadratic terms in volving the temp oral deriv ativ es, do es not dominate ov er the spatial deriv ativ es. In addition, w e also implemen t the diﬀeren t c hoices of graph ensem bles outline in subec- tion 3.2.1. W e also include the option to not p erm ute or “shuﬄe” the indexing of the agen ts. As a general rule of thum b, w e ﬁnd that switc hing oﬀ sh uﬄing alwa ys leads to worse p erformance. 4.2 Hyp erparameters Let us now formalize the total list of h yp erparameters for the suburban algorithm. The total n umber of timesteps is N , and the total num b er of agen ts is M . In addition to the total num b er of samples collected in a run, w e hav e a choice of ensemble of random graphs, i.e., how we connect the agen ts together. There is a coarse parameter given by the o verall top ology of graphs on whic h w e p erform p ercolation. Additionally , we ha v e in tro duced a class of ensem bles where w e p erm ute the lo cations of agen ts on the grid. F or a collective of parallel MH samplers, this has no eﬀect (since there is no correlation b et w een agen ts an yw ay), but for more general collectives, this can clearly ha ve an impact. Indeed, we ﬁnd that if w e consider related ensem bles in whic h shuﬄing is turned oﬀ, the p erformance suﬀers. W e shall therefore conﬁne our exp erimen ts to cases where sh uﬄing is switc hed on. Finally , there is also a contin uous parameter p join whic h dictates the probabilit y of a given link in a graph b eing active. This in turn translates to the eﬀective w orldv olume dimension exp erienced b y an agent in the collectiv e. Of course, the sp eciﬁc c hoice of ensemble of random graphs will also aﬀect how muc h v ariance there is in the av erage degree of connectivity , though surprisingly , this seems to b e a subleading eﬀect in the tests we p erform. There are also many hyperparameters lurking in the prop osal k ernel. F or the most part, w e will fo cus on the case of equation (4.6), where there is just one tunable parameter β . F or a sampler in D dimensions, this naturally extends to a symmetric p ositiv e deﬁnite matrix β I J , in the obvious notation. The ov erall parameter β sets the “stiﬀness” or tension of the brane. F or β large, the coupling to nearest neigh b ors is strongest, and the relative size of jumps in the target space is smaller. F or small β , the brane is “ﬂopp y ,” and eac h agent in the collectiv e will execute larger mo v ements. In this limit, the ov erall b eha vior of the prop osal 27 k ernel approac hes the uniform distribution, and the eﬀects of grid top ology are exp ected to b ecome w eaker. 5 Ov erview of Numerical Exp erimen ts Our emphasis up to this p oin t has b een on v arious theoretical asp ects of the suburban algorithm, in particular, how to understand MCMC with extended ob jects. W e no w switc h gears from theory to exp erimen t, and ask ho w w ell such algorithms do in practice. Our plan in this section will b e to give a list of the v arious metrics we shall use to gauge p erformance. W e then discuss the class of samplers we shall study , and then giv e a brief o v erview of the target distributions we consider. In subsequen t sections w e turn to examples. F or simplicit y , we fo cus on the sp eciﬁc case where the q σ of equation (1.4) are all normal distributions in which the means and cov ariance matrix are dictated by the choice of nearest neigh b ors. In most cases, we consider MH within Gibbs sampling, though w e also consider the case where joint v ariables are sampled, that is, pure MH. F or target distributions we fo cus on lo w-dimensional examples of target distributions suc h as v arious mixture models of normal distributions, as well as the Rosenbrock “banana distribution,” which has most of its mass concentrated on a lo w er dimensional subspace. Rather than p erform error analysis within a single long MCMC run, we opt to tak e m ultiple indep enden t trials of eac h MCMC run in whic h w e v ary the hyperparameters of the sampler suc h as the o verall top ology and av erage degree of connectivity of the sampler. Though this leads to more ineﬃcien t statistical estimators for our MCMC runs, it has the virtue of allo wing us to easily compare the p erformance of diﬀerent algorithms, i.e., as we v ary the contin uous and discrete h yp erparameters of the suburban algorithm. T o gauge p erformance of the diﬀerent runs, we fo cus on examples where we can ana- lytically compute v arious statistics such as the mean and cov ariance matrix of the target distribution, comparing with the v alue obtained from our MCMC samplers. W e also cal- culate the exp ected num b er of samples on a tail to see whether the sampler sp ends the correct amoun t of time searc hing for “rare ev ents.” W e also collect the rejection rate and the integrated auto-correlation time (i.e., mixing rate) for the MCMC sampler. 5.1 P erformance Metrics In general, gauging the p erformance of an MCMC algorithm can b e diﬃcult, so we shall adopt a few diﬀerent performance metrics. T o k eep the size of the tests manageable (i.e., on the order of a few weeks rather than months or y ears) we also limit the dimension of the target space distributions we consider. 28 The p erformance metrics we adopt can roughly b e split into tests of how well the algo- rithm conv erges to the correct p osterior distribution, i.e., external comparisons, as w ell as in ternal comparisons such as the mixing rate and rejection rate for the Marko v chain. Let us now discuss eac h of these p erformance metrics in more detail. As a partial charac- terization of conv ergence to the correct p osterior distribution, w e focus on target distributions where we can calculate the v arious moments of the probability distribution analytically . In particular, for a D -dimensional target, w e obtain a sample v alue for the mean and co v ari- ance matrix whic h we denote as µ inf and Σ inf , resp ectiv ely , i.e., the inferred v alues. W e then compute the distance to the true mean and cov ariance using the metrics: d mean ≡ k µ inf − µ true k and d cov ≡  T r  (Σ inf − Σ true ) · (Σ inf − Σ true ) T  1 / 2 , (5.1) in the obvious notation. In addition to these simple tests, we also divide up the distribution in to v arious regions of high and low probability mass, and v erify that we obtain the appro- priate num b er of ev ents in each of these regions. In practice, w e alw a ys consider drawing a b o x of size L D cen tered at the origin suc h that there is a 68% chance of falling inside the b o x. W e then also calculate the related 2 σ and 3 σ b o xes (resp ectiv ely 95% and 99 . 7%), and v erify that a similar num b er of coun ts falls in the appropriate bin. In practice, we actually concen trate on the n um b er of counts in the 0 σ − 1 σ region, the 1 σ − 2 σ region, the 2 σ − 3 σ region, and ev ents whic h fall outside the 3 σ region. F or eac h suc h region, we compute the exp ected n umber of ev en ts with the observed num b er, and obtain a corresp onding fraction: f region ≡ N inf − N true N total , (5.2) where N total denotes the total num b er of samples (after taking into account burn-in). In addition to these external metrics, i.e., metrics based on comparison with the actual distribution, we also use diagnostics that are av ailable from the MCMC runs. These are imp ortan t in most actual applications of MCMC since w e do not usually know the analytic form of the target distribution. Rather, we m ust dep end on internal diagnostics suc h as the rejection rate, and the in tegrated auto-correlation time, i.e., the mixing rate. A t ypical rule of th um b is that for targets with no large free energy barriers, a rejection rate of somewhere b et ween 50% − 80% is acceptable (see e.g., [22]). Indeed, if the rejection rate is to o low, then the sampler is wandering aimlessly , and if the rejection rate is to o high, it is an indication that to o little of the target is b eing explored. Let us also note, ho w ever, that in situations where there is a large free energy barrier (i.e., m ultiple high mass regions separated b y large lo w mass regions), the rejection rate can turn out to b e rather high. This is just a symptom of the fact that most prop osals will land in a low mass region. Finally , we also collect the v alue of the in tegrated auto-correlation time for the “energy” 29 of the distribution. This observ able pro vides a w ay of quan tifying the correlation b et ween samples dra wn at diﬀeren t times from the MCMC run. This observ able in particular has b een argued to provide a preferred diagnostic for ev aluating p erformance of an MCMC run (for a recent discussion, see e.g., [25]). Along these lines, w e in tro duce: V = − log π ( x 1 , ..., x M ) , (5.3) and collect the v alues V (1) , ..., V ( N ) . W e ev aluate the cov ariance c ( k ) for − N < k < N , c ( k ) ≡            1 N N − k X t =1  V ( t ) − V   V ( t + k ) − V  for k ≥ 0 1 N N + k X t =1  V ( t ) − V   V ( t − k ) − V  for k < 0            , (5.4) and then extract the cross correlation b c ( k ): b c ( k ) = c ( k ) /c (0) . (5.5) F rom this, we extract an estimate for the integrated auto-correlation time: τ dec ≡ X − N π ( x ) or π ( x R ), w e con tinue to double the size of the in terv al by mo ving the left or righ t side of the interv al further out (randomly picking one of the t wo sides) until π ( x L ) < π ( x ∗ ) and π ( x R ) < π ( x ∗ ). Next, w e proceed to pic k a new p oin t x 0 ∗ on the x -axis using a shrinking pro cedure. First, designate b y x M the midp oin t of ( x L , x R ). Draw a new v alue x 0 ∗ using the uniform distribution on ( x L , x R ). W e reject the new v alue if x ∗ and x 0 ∗ are not on opp osite sides of the midpoint. If they are on opp osite sides of the midp oin t, then w e only accept the new v alue if π ( x 0 ∗ ) > π ( x L ) and π ( x 0 ∗ ) > π ( x R ). If a sample is rejected, we shrink the in terv al b y setting x R to x M when x 0 ∗ < x M , and otherwise we mov e x L to x M . The ab o ve steps are then rep eated un til a new sample is accepted. Finally , w e rep eat all these steps. In practice, w e mainly use the default implemen tation in Dimple so that the initial size of the x -axis width is an in terv al of length one containing x ∗ , and the maximum num b er of doublings is 10. In some cases, esp ecially for the free energy barrier tests, this c hoice of initialization width can lead to erratic behavior of the sampler. In this case, we ﬁnd that taking an initial width of 100 (i.e., m uc h bigger than the size of the separation of the lo cal high density regions) leads to b etter p erformance. No w, as w e ha ve already men tioned in section 5, it is subtle to directly compare the suburban and slice samplers, since in the former, there is a clear accept/reject choice, while in the latter, everything b oils do wn to the ov erall size of the interv als and the halting of the “stepping out” and “stepping in” lo ops. In practice w e ﬁnd that when we collect some ﬁxed n um b er N of samples, the slice sampler typically makes several more queries to the target distribution compared with the suburban sampler, roughly a factor of ∼ 5 − 10. Though this mak es a direct comparison of the tw o algorithms less straigh tforw ard, we include the results of these tests as a simple wa y to gauge p erformance. T o compare the relative performance, w e mainly fo cus on the suburban sampler obtained from a 2d membrane grid top ology , but with eﬀectiv e dimension d eﬀ = 1. W e also fo cus on the case with brane tension β = 0 . 01. As a ﬁrst example, w e return to the case of the random landscap e mo del with random seed 40 studied in section 7. F or illustrativ e purposes, in ﬁgure 12 w e compare the tuned suburban sampler with parallel slice samplers. By insp ection, w e see that we ha v e comparable mixing rates and conv ergence. W e also compare with parallel MH which fares m uc h w orse. As a second example, w e consider again the free energy barrier test studied in section 9. Here, w e again fo cus on the suburban sampler since it has the fastest mixing rate. Here, w e observe a curious feature: F or parallel slice samplers with an initialization width of 1, w e observe erratic behavior in the b eha vior of the sampler. This app ears to b e due to the detailed balance requirement asso ciated with the doubling pro cedure. Indeed, w e ﬁnd that 50 there is no erratic b eha vior when we tak e a larger initialization width of 100 for the slice sampler. Figure 13 displays the relativ e p erformance of the diﬀerent samplers for the same 2D free energy barrier test. 51 References [1] N. Metrop olis, A. W. Rosenbluth, M. N. Rosen bluth, A. H. T eller, and E. T eller, “Equation of State Calculations b y F ast Computing Mac hines,” J. Chem. Phys. 21 (1953) 1087–1092. [2] W. K. Hastings, “Monte Carlo Sampling Metho ds Using Marko v Chains and their Applications,” Biometrika 57 no. 1, (1970) 97–109. [3] J. J. Heckman, “Statistical Inference and String Theory,” Int. J. Mo d. Phys. A30 no. 26, (2015) 1550160, arXiv:1305.3621 [hep-th] . [4] R. H. Swendsen and J.-S. W ang, “Replica Monte Carlo Sim ulation of Spin-Glasses,” Phys. R ev. L ett. 57 no. 21, (1986) 2607–2609. [5] C. J. Geyer, “Marko v Chain Mon te Carlo Maximum Likelihoo d,” in Computing Scienc e and Statistics: Pr o c e e dings of the 23r d Symp osium on the Interfac e , E. M. Keramidas, ed., pp. 156–163. Interface F oundation, 1991. [6] W. R. Gilks, G. O. Rob erts, and E. I. George, “Adaptiv e Direction Sampling,” Journal of the R oyal Statistic al So ciety. Series D (The Statistician) 43 no. 1, (1997) 179–189. [7] D. J. Earl and M. W. Deem, “Parallel temp ering: Theory , applications, and new p erspectives,” Phys. Chem. Chem. Phys. 7 (2005) 3910–3916. [8] R. M. Neal, “MCMC Using Ensem bles of States for Problems with F ast and Slow V ariables suc h as Gaussian Pro cess Regression,” arXiv:1101.0387 [stat] . [9] J. Go o dman and J. W eare, “Ensem ble Samplers with Aﬃne Inv ariance,” Comm. in Appl. Math. and Comp. Sci. 5 no. 1, (2010) 65–80. [10] R. Nishihara, I. Murray , and R. P . Adams, “Parallel MCMC with Generalized Elliptical Slice Sampling,” J. Mach. L e arn. R es. 15 no. 1, (Jan., 2014) 2087–2112. http://dl.acm.org/citation.cfm?id=2627435.2670318 . [11] K. Hukushima and K. Nemoto, “Exc hange Mon te Carlo Metho d and Application to Spin Glass Simulations,” Journal of the Physic al So ciety of Jap an 65 (June, 1996) 1604, cond-mat/9512035 . [12] S. C. Kou, Q. Zhou, and W. H. W ong, “Equi-energy sampler with applications in statistical inference and statistical mec hanics,” Ann. Statist. 34 no. 4, (08, 2006) 1581–1619. http://dx.doi.org/10.1214/009053606000000515 . 52 [13] F. Liang and W. H. W ong, “Ev olutionary Mon te Carlo: Applications to Cp mo del sampling and change p oin t problem,” Statistic a Sinic a (2000) 317–342. [14] J. S. Liu, F. Liang, and W. H. W ong, “The Multiple-T ry Metho d and Lo cal Optimization in Metrop olis Sampling,” Journal of the Americ an Statistic al Asso ciation 95 no. 449, (2000) 121–134. http://www.jstor.org/stable/2669532 . [15] V. Balasubramanian, “A Geometric F ormulation of Occam’s Razor F or Inference of P arametric Distributions,” arXiv:adap-org/9601001 . [16] V. Balasubramanian, “Statistical Inference, Occam’s Razor and Statistical Mechanics on the Space of Probabilit y Distributions,” Neur al Comp. 9(2) (1997) 349–368, arXiv:cond-mat/9601030 . [17] J. J. Heckman, J. G. Bernstein, and B. Vigo da, “MCMC with Strings and Branes: The Suburban Algorithm,” arXiv:1605.06122 [stat.CO] . [18] B. Zwiebach, A First Course in String The ory . Cambridge Univ ersit y Press, Cam bridge, UK, 2006. [19] C. V. Johnson, D-Br anes . Cambridge Univ ersity Press, Cam bridge, UK, 2005. [20] M. E. Peskin and D. V. Schroeder, An Intr o duction to quantum ﬁeld the ory . Addison-W esley, Reading, USA, 1995. [21] A. Wipf, “Statistical approach to quantum ﬁeld theory,” L e ct. Notes Phys. 864 (2013) 390. [22] A. Gelman, W. R. Gilks, and G. O. Rob erts, “W eak conv ergence and optimal scaling of random walk Metrop olis algorithms,” Ann. Appl. Pr ob. 7 no. 1, (1997) 110–120. [23] P . Di F rancesco, P . Mathieu, and D. Senec hal, Conformal Field The ory . Graduate T exts in Contemporary Physics. Springer-V erlag, New Y ork, 1997. [24] S. Hershey , J. Bernstein, B. Bradley , A. Sch weitzer, N. Stein, T. W eber, and B. Vigo da, “Accelerating Inference: tow ards a full Language, Compiler and Hardware Stac k,” arXiv:1212.2991 [cs.SE] . [25] D. F oreman-Mac key , D. W. Hogg, D. Lang, and J. Go odman, “emcee: The MCMC Hammer,” Pub. of the Ast. So c. of the Paciﬁc 125 no. 925, (2013) 306–312, arXiv:1202.3665 [astro-ph.IM] . [26] A. E. Raftery and S. M. Lewis, “Comment: One Long Run with Diagnostics: Implemen tation Strategies for Marko v Chain Mon te Carlo,” Statist. Sci. 7 no. 4, (1992) 493–497. 53 [27] R. A. Neal, “Slice sampling,” The Ann. of Stat. 31 no. 3, (2003) 705–767. [28] Y. Chen, M. W elling, and A. J. Smola, “Sup er-Samples from Kernel Herding,” arXiv:1203.3472 [cs.LG] . 54

MCMC with Strings and Branes: The Suburban Algorithm (Extended Version)

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment