MCMC with Strings and Branes: The Suburban Algorithm

Motivated by the physics of strings and branes, we introduce a general suite of Markov chain Monte Carlo (MCMC) "suburban samplers" (i.e., spread out Metropolis). The suburban algorithm involves an ensemble of statistical agents connected together by…

Authors: Jonathan J. Heckman, Jeffrey G. Bernstein, Ben Vigoda

MCMC with Strings and Branes: The Suburban Algorithm
MCMC with Strings and Branes: The Suburban Algorithm Jonathan J. Heckman, 1 Jeffrey G. Bernstein, 2 and Ben Vigo da 3 1 Dep artment of Physics, University of North Car olina at Chap el Hil l, NC 2 Analo g Devic es | Lyric Labs, Cambridge, MA 3 Gamalon L abs, Cambridge, MA Motiv ated b y the ph ysics of strings and branes, w e introduce a general suite of Marko v chain Mon te Carlo (MCMC) “suburban samplers” (i.e., spread out Metrop olis). The suburban algorithm in volv es an ensem ble of statistical agents connected together by a random netw ork. P erformance of the collective in reaching a fast and accurate inference dep ends primarily on the av erage num ber of nearest neighbor connections. Increasing the av erage n umber of neigh b ors abov e zero initially leads to an increase in p erformance, though there is a critical connectivity with effective dimension d eff ∼ 1, ab ov e which “groupthink” takes ov er, and the p erformance of the sampler declines. I. INTR ODUCTION Mark ov c hain Monte Carlo (MCMC) metho ds are a remark ably robust w ay to sample from complex proba- bilit y distributions. Metrop olis-Hastings (MH) sampling [1, 2] stands out as an important benchmark. An app eal- ing feature of MH sampling is the simple ph ysical picture whic h underlies the general metho d. Roughly sp eaking, the idea is that the thermal fluctuations of a particle mo ving in an energy landscap e provides a conceptually elegan t w ay to sample from a target distribution. But there are also p oten tial dra wbacks to MCMC metho ds. F or example, the speed of conv ergence to the correct p osterior is often unkno wn since a sampler can b ecome trapp ed in a metastable equilibrium for a long p eriod of time. Once a sampler b ecomes trapp ed, a large free energy barrier can obstruct an accurate determi- nation of the distribution. F rom this p ersp ective it is therefore natural to ask whether further inspiration from ph ysics can lead to new examples of samplers. No w, although the physics of point particles underlies m uch of our mo dern understanding of natural phenom- ena, it has pro ven fruitful to consider ob jects such as strings and branes with finite exten t in p spatial dimen- sions (a string b eing a case of a 1-brane). One of the main features of branes is that the n um b er of spatial di- mensions strongly affects how a lo calized p erturbation propagates across its worldv olume. Viewing a brane as a collective of p oin t particles that interact with one an- other (see fig. 1), this suggests applications to questions in statistical inference [3]. Motiv ated by these physical considerations, our aim in this work will b e to study generalizations of the MH algorithm for suc h extended ob jects. F or an ensemble of M parallel MH samplers of a distribution π ( x ), we can alternativ ely view this as a single particle sampling from M v ariables x 1 , ..., x M with density: π ( x 1 , ..., x M ) = π ( x 1 ) ...π ( x M ) , (1) where the proposal k ernel is simply: q par ( x new 1 , ..., x new M | x old 1 , ..., x old M ) = M Y σ =1 q ( x new σ | x old σ ) . (2) FIG. 1: Depiction of how parallel MH samplers (left) and a suburban sampler (right) evolv e as a function of time. T o realize MCMC with strings and branes, w e k eep the same target π ( x 1 , ..., x M ), but we change the prop osal k ernel by in terpreting the index σ on x σ as specifying the lo cation of a statistical agent in a netw ork. Depending on the connectivit y of this netw ork, an agen t may interact with several neighboring agents Nb( x σ ), so we introduce a prop osal k ernel: 1 q brane ( x new 1 , ..., x new M | x old 1 , ..., x old M ) = M Y σ =1 q σ ( x new σ | Nb( x old σ )) . (3) In the ab o ve, the connectivit y of the extended ob ject sp ecifies its ov erall top ology . F or example, in the case of a string, i.e., a one-dimensional extended ob ject, the neigh b ors of x i are x i − 1 , x i , and x i +1 . Fig. 1 depicts the time evolution of parallel MH samplers compared with the suburban sampler. Ho wev er, there are p otentially many consistent wa ys to connect together the inferences of statistical agen ts. F rom the p erspective of ph ysics, this amounts to a no- tion of distance/pro ximity betw een nearest neigh b ors in a brane. A physically w ell-motiv ated wa y to eliminate 1 F rom this p erspective, the suburban algorithm is a particular choice of ensemble MCMC (see e.g. [4–9]). 2 this arbitrary feature is to allo w the notion of pro ximity itself to fluctuate , and for the brane to split and join. W e view MCMC with extended strings and branes as a nov el class of ensemble samplers. By correlating the inferences of nearest neighbors, we can exp ect there to b e some impact on performance. F or example, the de- gree of connectivit y impacts the mixing rate for obtaining indep enden t samples. Another imp ortant feature is that b ecause w e are dealing with an extended ob ject, different statistical agen ts ma y b ecome localized in different high densit y regions. Provided the connectivity with neigh- b ors is sufficien tly lo w, coupling these agents then has the p oten tial to provide a more accurate global characteriza- tion of a target distribution. Conv ersely , connecting too man y agen ts together may cause the entire collectiv e to suffer from “groupthink” in the sense of [3]. In particular, w e shall present evidence that the optimal connectivity for a netw ork of agen ts on a grid arranged as a hypercu- bic lattice with some percolation (i.e., w e allo w for broken links) o ccurs at a critical effective dimension: d eff ∼ 1 (4) where 2 d eff is the a verage num b er of neigh b ors. T o summarize: With to o few friends one drifts in to oblivion, but with to o man y friends one b ecomes a boring conformist. T urning our discussion around, one can view this pap er as providing a concrete w ay to study the physics of branes with a strongly fluctuating worldv olume, that is, the non- p erturbativ e regime of string theory . The App endices pro vide some additional details (see also [10]). The suburban co de is a v ailable at https://gitlab.com/suburban/suburban . I I. MCMC WITH STRINGS AND BRANES One of the main ideas w e shall develop in this pa- p er is MCMC metho ds for extended ob jects. In an MCMC algorithm we pro duce a sequence of “timesteps” x (1) , ..., x ( t ) , ..., x ( N ) whic h can b e viewed as the motion of a p oin t particle exploring a target space Ω. More for- mally , this sequence of points defines the “worldline” for a particle, and consequently a map: t 7→ x ( t ) . (5) F or an extended ob ject with d spatial directions, w e get a map from a “w orldvolume” to the target: ( t, σ 1 , ..., σ d ) 7→ x ( t, σ 1 , ..., σ d ) . (6) The special cases d = 0 and d = 1 resp ectively denote a p oin t particle and string. The general physical in tuition is that minus the log of the target distribution π ( x ) = exp( − V ( x )) defines a p oten tial energy , and min us the log of the Marko v c hain transition probability T ( x → x 0 ) = exp( − K ( x, x 0 )) is a kinetic energy . The key p oint is that statistical field theory in d + 1 Euclidean dimensions strongly depends on the num b er of dimensions. F or example, the t wo-point function for a free Gaussian field x ( σ ) with σ ∈ R d +1 is: E ( x ( σ ) x (0)) ∼ 1 / || σ || d − 1 (7) where in the case of d = 1, the tw o-p oin t function is log || σ || . F or d ≤ 1, a random field explores its sur- roundings at large σ , but the o verall v ariance decreases as d → 1. F or d > 1, ho wev er, “groupthink” sets in and the ensemble less quickly explores its surroundings. This suggests a sp ecial role for stringlike ob jects [3]. In a theory of quantum gravit y (suc h as string the- ory) it is also physically natural to let the proximit y of nearest neigh b ors fluctuate. So, we in tro duce an en- sem ble of random graphs A . F or example, for a d - dimensional toroidal hypercubic lattice, introduce m lat- tice sites along a spatial direction so that m d = M is the total num b er of agents. F or a hypercubic lattice in d dimensions, we define the ensemble of random graphs for a brane A brane ( p join ) as one in which w e ha ve a random sh uffling of the agen ts, and in which a given link in a d - dimensional hypercubic lattice is activ e with probability p join . W e can also consider more general ensembles of ad- jacency matrices. F or example, the Erd¨ os-Ren yi ensem- ble A ER ( p join ) has an edge betw een any tw o no des with probabilit y p join . W e also in tro duce the notion of an ef- fectiv e dimension which dep ends on the av erage n umber of neighbors: d eff = n avg / 2 , (8) whic h need not be an integer. I I I. THE SUBURBAN ALGORITHM W e now present the suburban algorithm. F or ease of exp osition, we shall present the case of a 1D target. The generalization to a D -dimensional target is straightfor- w ard, and we can take MH within a Gibbs sampler, or a sampler with join t v ariables in which all D dimensions up date simultaneously . T o av oid ov erloading the notation, w e shall write X ( t ) ≡ n x ( t ) 1 , ..., x ( t ) M o for the curren t state of the grid. Instead of directly sampling from π ( x ), we introduce m ultiple copies of the target and sample from the joint distribution π ( X ) = π ( x 1 ) ...π ( x M ) using MH sampling with prop osal k ernel q brane ( x new 1 , ..., x new M | x old 1 , ..., x old M ) = q ( X (new) |X (old) , A ), where A denotes the adjacency ma- trix. If the system is in a state X ( t ) , with adjacency ma- trix A ( t ) , w e pic k a new state according to the MH update rule with a prop osal kernel which dep ends on b oth these inputs. The MH acceptance probabilit y is: a  X new |X old , A  = min  1 , q ( X old |X new , A ) q ( X new |X old , A ) π ( X new ) π ( X old )  (9) This leads us to algorithm 1. Some of these steps can b e parallelized whilst retaining detailed balance. F or example we could pick a coloring 3 Algorithm 1 Suburban Sampler Randomly Initialize X (0) and A (0) for t = 0 to N − 1 do X ( ∗ ) ← sample from q ( X |X ( t ) , A ( t ) ) accept with probability a ( X ∗ |X ( t ) , A ( t ) ) if accept = true then X ( t +1) ← X ( ∗ ) else X ( t +1) ← X ( t ) A ( t +1) ← draw from A return X (1) , ..., X ( N ) of a graph and then p erform an up date for all nodes of a particular color whilst holding fixed the rest. W e can also sto chastically evolv e the adjacency matrices. No w, ha ving collected a sequence of v alues X (1) , ..., X ( N ) , we can in terpret this as N × M samples of the original distribution π ( x ). As standard for MCMC metho ds, w e can then calculate quantities of in terest suc h as the mean: x ' 1 N M X σ,t x ( t ) σ (10) as well as higher order momen ts. Let us discuss the reason w e exp ect our sampler to con verge to the correct p osterior distribution. First note that although w e are mo difying the prop osal k ernel at eac h time step (i.e., by introducing a different adjacency matrix A ∈ A ), this mo dification is independent of the curren t state of the system. So, it cannot impact the ev entual p osterior distribution we obtain. Second, w e ob- serv e that since w e are just performing a specific kind of MH sampling routine for the distribution π ( x 1 , ..., x M ), w e exp ect to con verge to the correct p osterior distribu- tion. But, since the v ariables x 1 , ..., x M are all indepen- den t, this is tantamoun t to having also sampled multiple times from π ( x ). The cav eat is that w e need the sampler to actually w ander around during its random w alk; d ≤ 1 is typically necessary to preven t “groupthink.” A. Implemen tation T o accommo date a flexible framew ork for prototyp- ing, we hav e implemented the suburban algorithm in the probabilistic programming language Dimple [11]. F or practical purp oses we take a fairly large burn-in cut, discarding the first 10% of samples from a run. W e alw ays perform Gibbs sampling o ver the M agents. F or MH within Gibbs sampling o ver a D -dimensional target, w e thus get a Gibbs schedule with D × M up dates for eac h time step. F or a join t sampler, the Gibbs schedule consists of just M up dates. The sp ecific choice of q σ ( x σ | Nb( x σ )) for eqn. (3) is motiv ated b y having a free Gaussian field on a fluctuating graph top ology: ∝ exp   − α σ ( D t x σ ) 2 − X n ( σ ) β  D t x σ − D n ( σ ) x σ  2   (11) with: D t x σ = x ( t +1) σ − x ( t ) σ (12) D n ( σ ) x σ = x ( t ) n ( σ ) − x ( t ) σ (13) for n ( σ ) a neigh b or of σ on the graph defined by the adja- cency matrix. Additionally , we set the hyperparameters for the k ernel as: α σ = 2 β − n tot σ β , (14) that is, we take an adaptive v alue for α σ sp ecified by the n umber of nearest neighbors joined to x σ . This condi- tion leads to a well-behav ed contin uum limit on a fully connected hypercubic lattice. IV. NUMERICAL EXPERIMENTS In most cases, we consider MH within Gibbs sampling, though we also consider the case where joint v ariables are sampled, that is, pure MH. Rather than perform er- ror analysis within a single long MCMC run, we opt to tak e multiple independent trials of each MCMC run in whic h we v ary the h yp erparameters of the sampler such as the o verall top ology and a verage degree of connectivit y of the sampler. Though this leads to less efficien t statis- tical estimators, it has the virtue of allowing us to easily compare the p erformance of different algorithms, i.e., as w e v ary the contin uous and discrete h yp erparameters of the suburban algorithm. W e take M = 81 = 9 2 = 3 4 to compare different grid top ologies. W e hav e also com- pared performance with parallel slice (within Gibbs) sam- plers [12] to ensure that our p erformance is comparable to other benchmarks. T o gauge accuracy , we collect the inferred mean and co v ariance matrix. W e then compute the distance to the true v alues: d mean ≡ k µ inf − µ true k (15) d cov ≡  T r  (Σ inf − Σ true ) · (Σ inf − Σ true ) T  1 / 2 . (16) W e also collect performance metrics from the MCMC runs suc h as the rejection rate. A typical rule of th um b is that for targets with no large free energy barriers, a rejection rate of somewhere b etw een 50% − 80% is ac- ceptable (see e.g., [13]). W e also collect the integrated auto-correlation time for the “energy” of the distribu- tion: V = − log π ( x 1 , ..., x M ) , (17) 4 b y collecting the v alues V (1) , ..., V ( N ) . F or − N < k < N , w e ev aluate: c ( k ) ≡            1 N N − k X t =1  V ( t ) − V   V ( t + k ) − V  k ≥ 0 1 N N + k X t =1  V ( t ) − V   V ( t − k ) − V  k < 0            , (18) and then extract the in tegrated auto-correlation time: τ dec ≡ X − N 1) both fare w orse than a stringlik e sampler. 6 D. F ree Energy Barriers The extended nature of the suburban sampler also sug- gests that for target distributions with v arious discon- nected “deep po ck ets,” differen t pieces of the ensemble can wander ov er to differen t regions. W e consider a mix- ture mo del with tw o Gaussian comp onen ts: π GMM ( x ) = 3 4 N ( x | µ (+) , Σ) + 1 4 N ( x | µ ( − ) , Σ) , (22) with: µ ( ± ) = ( ± L barrier , 0 , ..., 0) Σ = σ 2 × I D × D , (23) where we v ary L barrier and hold fixed σ = 0 . 25. W e tak e N samples = 1000 samples with M = 81 agents on a 2 d grid with β = 0 . 01, p erforming T = 1000 indep endent trials. Since we use MH within Gibbs, we do not find m uch decrease in performance in comparing the D = 2 and D = 10 free energy barrier tests. Fig. 7 shows that parallel samplers fare worse than the extended ob jects. The d eff = 2 runs are sometimes more accurate, but mix slow er than for d eff = 1. After thinning samples, the former runs will be less accurate. Ac knowledgemen ts JJH thanks D. Krohn for collaboration at an early stage. W e thank J.A. Barandes, C. Barb er, C. F reer, M. F reytsis, J.J. Heckman Sr., A. Murugan, P . Oreto, R. Y ang, and J. Y edidia for helpful discuss ions. The work of JJH is supported b y NSF CAREER gran t PHY-1452037. JJH also ackno wledges supp ort from the Bahnson F und as well as the R. J. Reynolds Industries, Inc. Junior F ac- ult y Dev elopment Award at UNC Chap el Hill. 0 5 10 15 20 0 2 4 6 8 10 L barrier d mean d eff = 0 d eff = 1 d eff = 2 0 5 10 15 20 0 20 40 60 80 100 L barrier d cov 0 5 10 15 20 0 10 20 30 40 L barrier τ dec 0 5 10 15 20 0.7 0.75 0.8 0.85 0.9 0.95 1 L barrier R rej FIG. 7: Plots of the 2D free energy barrier tests. [1] N. Metrop olis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. T eller, and E. T eller, “Equation of State Calcu- lations by F ast Computing Machines,” J. Chem. Phys. 21 (1953) 1087–1092. [2] W. K. Hastings, “Monte Carlo Sampling Metho ds Using Mark ov Chains and their Applications,” Biometrika 57 no. 1, (1970) 97–109. [3] J. J. Heckman, “Statistical Inference and String The- ory ,” Int. J. Mo d. Phys. A30 no. 26, (2015) 1550160, arXiv:1305.3621 [hep-th] . [4] R. H. Swendsen and J.-S. W ang, “Replica Monte Carlo Sim ulation of Spin-Glasses,” Phys. R ev. L ett. 57 no. 21, (1986) 2607–2609. [5] C. J. Gey er, “Mark ov Chain Monte Carlo Maximum Lik eliho o d,” in Computing Scienc e and Statistics: Pr o- c e e dings of the 23r d Symp osium on the Interfac e , E. M. Keramidas, ed., pp. 156–163. Interface F oundation, 1991. [6] W. R. Gilks, G. O. Rob erts, and E. I. George, “Adap- tiv e Direction Sampling,” Journal of the R oyal Statistical So ciety. Series D (The Statistician) 43 no. 1, (1997) 179– 189. [7] D. J. Earl and M. W. Deem, “P arallel temp ering: Theory , applications, and new persp ectives,” Phys. Chem. Chem. Phys. 7 (2005) 3910–3916. [8] R. M. Neal, “MCMC Using Ensem bles of States for Prob- lems with F ast and Slow V ariables such as Gaussian Pro- cess Regression,” arXiv:1101.0387 [stat] . [9] J. Go o dman and J. W eare, “Ensem ble Samplers with Affine Inv ariance,” Comm. in Appl. Math. and Comp. Sci. 5 no. 1, (2010) 65–80. [10] J. J. Heckman, J. G. Bernstein, and B. Vigoda, “MCMC with Strings and Branes: The Subur- ban Algorithm (Extended V ersion),” [physics.comp-ph] . [11] S. Hershey , J. Bernstein, B. Bradley , A. Sch weitzer, N. Stein, T. W eber, and B. Vigo da, “Accelerating Infer- ence: tow ards a full Language, Compiler and Hardw are Stac k,” arXiv:1212.2991 [cs.SE] . [12] R. A. Neal, “Slice sampling,” The Ann. of Stat. 31 no. 3, (2003) 705–767. [13] A. Gelman, W. R. Gilks, and G. O. Rob erts, “W eak con- v ergence and optimal scaling of random walk Metrop olis algorithms,” Ann. Appl. Pr ob. 7 no. 1, (1997) 110–120. [14] Y. Chen, M. W elling, and A. J. Smola, “Super-Samples from Kernel Herding,” arXiv:1203.3472 [cs.LG] . [15] M. E. Peskin and D. V. Schroeder, An Intr o duction to quantum field the ory . Addison-W esley, Reading, USA, 1995. [16] A. Wipf, “Statistical approach to quantum field theory,” L e ct. Notes Phys. 864 , (2013). [17] V. Balasubramanian, “Statistical Inference, Occam’s Ra- zor and Statistical Mec hanics on the Space of Probabil- it y Distributions,” Neur al Comp. 9(2) (1997) 349–368, arXiv:cond-mat/9601030 . [18] P . Di F rancesco, P . Mathieu, and D. Senec hal, Conformal Field The ory . Springer-V erlag, New Y ork, USA, 1997. 7 App endix A: P ath In tegrals for MCMC In this App endix we give a path in tegral formulation for MCMC with extended ob jects. F or additional back- ground on path integrals in statistical field theory , see [15, 16]. In what follows, we denote the random v ari- able as X with outcome x on a target space Ω with measure dx . W e consider sampling from a probabilit y densit y π ( x ). In accord with ph ysical in tuition, w e view − log π ( x ) = V ( x ) as a potential energy . In general, our aim is to disco ver the structure of π ( x ) b y using some sampling algorithm to pro duce a sequence of v alues x (1) , ..., x ( N ) . A quan tity of interest is the ex- p ected v alue of π ( x ) with respect to a giv en probability distribution of paths. This helps in telling us the rela- tiv e speed of con vergence and the mixing rate. T o study this, it is helpful to ev aluate the exp ectation v alue of the quan tity: N Y i =1 exp( − V ( x ( i ) )) (A1) with resp ect to a giv en path generated b y our sampler. In more general terms, the reason to b e in terested in this expectation v alue comes from the statistical mec han- ical interpretation of statistical inference [3, 17]: There is a comp etition b et ween sta ying in high likelihoo d re- gions (minimizing the potential), and exploring more of the distribution (maximizing en tropy). The tradeoff b e- t ween the tw o is neatly captured b y the path integral formalism: It tells us about a particle moving in a p o- ten tial V ( x ), and sub ject to a thermal background, as sp ecified by the choice of probabilit y measure ov er p ossi- ble paths. Indeed, w e will view this probabilit y measure as defining a “kinetic energy” in the sense that at each time step, we apply a random kick to the tra jectory of the particle, as dictated by its contact with a thermal reserv oir. Along these lines, if w e hav e an MCMC sampler with transition probabilities T ( x ( i ) → x ( i +1) ), marginalizing o ver the in termediate v alues yields the exp ected v alue of line (A1): Z = Z [ dx ( t ) ] N − 1 Y i =0 T ( x ( i ) → x ( i +1) ) e − V ( x ( i +1) ) ! (A2) where w e hav e in tro duced the measure factor [ dx ( t ) ] = dx (1) ...dx ( N ) . W e would like to in terpret V ( x ) as the p oten tial energy and − log T ( x ( i ) → x ( i +1) ) as a kinetic energy: V = − log π and K = − log T , (A3) W e now observe that our exp ectation v alue has the form of a well-kno wn ob ject in physics: Z ( x begin → x end ) = end Z begin [ dx ( t ) ] e − P t L ( E ) [ x ( t ) ] , (A4) A path in tegral! Here the Euclidean signature La- grangian is: L ( E ) [ x ( t ) ] = K + V . (A5) Since we shall also b e taking the num b er of timesteps to b e very large, we make the Riemann sum approximation and introduce the rescaled Lagrangian density: 1 N X t 7→ Z dt, N L ( E ) 7→ L ( E ) (A6) so that w e can write our pro cess as: Z ( x begin → x end ) = Z [ dx ] e − R dt L ( E ) [ x ( t )] , (A7) where b y abuse of notation, we use the same v ariable t to reference b oth the discretized timestep as well as its con tinuum counterpart. T o give further justification for this terminology , con- sider now the sp ecific case of the Metrop olis-Hastings al- gorithm. In this case, w e ha ve a prop osal k ernel q ( x 0 | x ), and acceptance probabilit y: a ( x 0 | x ) = min  1 , q ( x | x 0 ) q ( x 0 | x ) π ( x 0 ) π ( x )  . (A8) The total transmission probabilit y is then given by a sum of t wo terms. One is giv en b y a ( x 0 | x ) q ( x 0 | x ), i.e., w e accept the new sample. W e also sometimes reject the sample, i.e., w e k eep the same v alue as b efore: T ( x → x 0 ) = r × δ ( x − x 0 ) + a ( x 0 | x ) q ( x 0 | x ) , (A9) where δ ( x − x 0 ) is the Dirac delta function, and w e hav e in tro duced an av eraged rejection rate: r ≡ 1 − Z dx 0 a ( x 0 | x ) q ( x 0 | x ) . (A10) T o gain further insight, we now approximate the mixture mo del T ( x → x 0 ) by a normal distribu- tion q eff  x ( t +1) | x ( t )  suc h that − log q eff  x ( t +1) | x ( t )  ∼ α eff  x ( t +1) − x ( t )  2 . Hence, 2 L ( E ) [ x ( t ) ] ' α eff  x ( t +1) − x ( t )  2 + V ( x ( t ) ) + ..., (A11) where here, the “...” denotes additional correction terms whic h are typically suppressed b y pow ers of 1 / N . Our plan will b e to assume a kinetic term with quadratic time deriv ativ es, but a general potential. The o verall strength of the kinetic term will depend on details 2 F or example, for a Gaussian prop osal kernel with − log q ( x ( t +1) | x ( t ) ) ∼ α ( x ( t +1) − x ( t ) ) 2 , matching the first and second momen ts to q eff requires α eff = α/a , with a the av erage acceptance rate. 8 suc h as the a verage acceptance rate. As the acceptance rate decreases, α eff increases and the sampled v alues all concen trate together. W e now turn to the generalization of the abov e con- cepts for strings and branes, i.e., extended ob jects. Intro- duce M copies of the original distribution, and consider the related join t distribution: π ( x 1 , ..., x M ) = π ( x 1 ) ...π ( x M ). (A12) If we keep the prop osal kernel unchanged, we can simply describ e the ev olution of M indep enden t p oint particles exploring an enlarged target space: Ω enlarged = Ω M = Ω × ... × Ω | {z } M . (A13) If we also view the individual statistical agents on the w orldvolume as indistinguishable, we can also consider quotien ting b y the symmetric group on M letters, S M : Ω S enlarged = X M /S M . (A14) Of course, w e are also free to consider a more gen- eral prop osal kernel in which we correlate these v alues. View ed in this w ay , an extended ob ject is a single p oin t particle, but on an enlarged target space. The precise w ay in which we correlate entries across a grid will in turn dictate the type of extended ob ject. Indeed, m uch of the path integral formalism carries o ver unchanged. The only difference is that now, we m ust also keep trac k of the spatial extent of our ob ject. So, w e again in tro duce a p otential energy V and a kinetic energy K : V = − log π and K = − log T , (A15) and a Euclidean signature Lagrangian density: L ( E ) [ x ( t, σ A )] = K + V , (A16) where here, σ A indexes lo cations on the extended ob ject, and the subscript A makes implicit reference to the adja- cency on the graph. In a similar notation, the exp ected v alue is now: Z ( x begin → x end | A ) = Z [ dx ] e − P t P σ L ( E ) [ x ( t,σ A )] . (A17) Since we shall also b e taking the n umber of time steps and agents to b e large, we again make the Riemann sum appro ximation: 1 N X t 7→ Z dt, 1 M X σ 7→ Z dσ A N M L ( E ) 7→ L ( E ) (A18) so that: Z ( x begin → x end | A ) = Z [ dx ] e − R dtdσ A L ( E ) [ x ( t,σ A )] , (A19) in the ob vious notation. So far, w e hav e held fixed a particular adjacency ma- trix. This is somewhat arbitrary , and ph ysical consid- erations suggest a natural generalization where we sum o ver a statistical ensemble of choices. One can lo osely refer to this splitting and joining of connectivity as “in- corp orating gravit y” into the dynamics of the extended ob ject, b ecause it can change the notion of whic h sta- tistical agen ts are nearest neigh b ors. 3 Along these lines, w e incorp orate an ensem ble A of p ossible adjacency ma- trices, with some prescrib ed probability to draw a given adjacency matrix. The top ology of an extended ob ject dictates a c hoice of statistical ensem ble A . Since we evolv e forward in discretized time steps, w e can in principle ha ve a sequence of such matrices A (1) , ..., A ( N ) , one for each timestep. F or each draw of an adjacency matrix, the notion of nearest neigh b or will c hange, whic h we denote by writing σ A ( t ) , that is, w e mak e implicit reference to the connectivity of nearest neigh b ors. Marginalizing ov er the choice of adjacency matrix, we get: Z ( x begin → x end ) = Z [ dx ][ dA ] e − P t P σ L ( E ) [ x ( t,σ A ( t ) )] , (A20) where now the integral in volv es summing o ver multiple ensem bles: the spatial and temporal v alues with measure factor dx ( t ) σ , as well as the choice of a random matrix from the ensemble dA ( t ) (one such integral for each timestep). A t a v ery general lev el, one can view the adjacency ma- trix as adding additional auxiliary random v ariables to the pro cess. So in this sense, it is simply part of the definition of the prop osal k ernel. 1. Dimensions and Correlations F ollowing some of the general considerations outlined in reference [3], we now discuss the exten t to which the extended nature of suc h ob jects pla ys a role in statistical inference and in particular MCMC. T o k eep our discussion from b ecoming o verly general, w e sp ecialize to the case of a hypercubic lattice of agents in d spatial dimensions arranged on a torus, and w e de- note a lo cation on the grid by a d -component vector σ . W e can allow for the p ossibility of a fluctuating world- v olume b y making the crude substitution d 7→ d eff . Consider the Gaussian prop osal kernel of line (11). In a large lattice, we approximate the finite differences in one of the d spatial directions by deriv ativ es of contin uous functions. Expanding in this limit, v arious cross-terms 3 It is not quite gravity in the w orldvolume theory , because there is a priori no guarantee that our sum o ver differen t graph topologies will hav e a smo oth semi-classical limit. Nevertheless, summing ov er different wa ys to connect the statistical agents conv eys the main point that the proximity of any tw o agents can change. 9 cancel and w e get for the prop osal k ernel: ∝ exp − 2 β X σ ( D t x σ ) 2 + d X k =1 ( D k x σ ) 2 !! . (A21) where D k denotes a finite difference in the k th spatial comp onen t of the d -dimensional lattice. Just as in the case of the p oint particle, the transition rate defines a kinetic energy quadratic in deriv atives (to leading order), with an effective strength dictated by the o verall acceptance rate. One of the things we would most like to understand is the extent to which an extended ob ject can explore the hills and v alleys of V . W e p erform a p erturbative analysis, at first viewing V as a small correction to the Lagrangian. Starting from some fixed position x ∗ , con- sider the expansion of V around this p oint: V ( x ) = V ( x ∗ ) + V 0 ( x ∗ )( x − x ∗ ) + V 00 ( x ∗ ) 2 ( x − x ∗ ) 2 + ..., (A22) Eac h of the deriv atives of V ( x ) rev eals another charac- teristic feature length of V ( x ). These feature lengths are sp ecified by the v alues of the momen ts for the distribu- tion π ( x ). When V = 0, there is a well-kno wn b eha vior of cor- relation functions whic h is given b y eqn. (7). 4 There is th us a rather sharp change in the inferential p o wers of an extended ob ject abov e and b elow d eff ∼ 1. T o understand the impact of a non-trivial p otential, w e in tro duce the notion of a “scaling dimension” for x ( t, σ ) and its deriv atives. This is a w ell-known notion, see [18] for a review. Just as w e assign a notion of proximit y in space and time to agen ts on a grid, we can also ask ho w rescaling all distances on the grid via: N 7→ λN M 7→ λ d M (A23) impacts the structure of our contin uum theory La- grangian. The key p oint is that pro vided N and M ha ve b een tak en sufficiently large, or alternativ ely we tak e λ sufficien tly large, we do not expect there to be any impact on the ph ysical in terpretation. Unpac king this statement naturally leads us to the no- tion of a scaling dimension for x ( t, σ ) itself. Observe that rescaling the num b er of samples and num b er of agents in line (A23) can b e interpreted equiv alently as holding fixed N and M , but rescaling t and σ : ( t, σ ) 7→ ( λt, λσ ) . (A24) No w, for our kinetic term to remain in v ariant, w e need to also rescale x ( t, σ ): x ( t, σ ) 7→ λ − ∆ x ( λt, λσ ) . (A25) 4 One wa y to obtain this scaling relation is to observe that the F ourier transform of 1 /k 2 in d + 1 dimensions exhibits the req- uisite power law behavior. 200 400 600 800 1000 -0.2 0 0.2 0.4 0.6 0.8 N samples f 0 σ - 1 σ d eff = 0 d eff = 1 d eff = 2 200 400 600 800 1000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 N samples f 1 σ - 2 σ 200 400 600 800 1000 -0.15 -0.1 -0.05 0 0.05 N samples f 2 σ - 3 σ 200 400 600 800 1000 -1 -0.8 -0.6 -0.4 -0.2 0 N samples f 3 σ FIG. 8: Plots of the conv ergence to the true tail statistics for suburban samplers of the random seed 40 landscap e. The exp onent ∆ is often referred to as the “scaling di- mension” for x obtained from “naive dimensional anal- ysis” or NDA. It is “naive” in the sense that when the p oten tial V 6 = 0 and w e ha ve strong coupling, the notion of a scaling dimension may only emerge at sufficiently long distance scales. Note that b ecause we are uniformly rescaling the spatial and temporal pieces of the grid, w e get the same answer for the scaling dimension if we con- sider spatial deriv atives along the grid. This assumption can also b e relaxed in more general physical systems. T o illustrate, inv ariance of the free field action requires: ∆ = d − 1 2 . (A26) W e can also consider the b ehavior of a p erturbation of the form ( x ) µ ( D x ) ν . Applying our NDA analysis pre- scription, w e see that under a rescaling, the contribution suc h a term mak es to the action is: Z dtd d σ ( x ) µ ( D x ) ν 7→ λ − µ ∆ − ν (∆+1)+ d +1 Z dtd d σ ( D x ) 2 , (A27) so terms of the form ( D x ) ν for ν > 2 die off as we tak e N → ∞ , i.e., λ → ∞ . Additionally , w e see that when d ≤ 1, w e can in principle exp ect more general con tributions of the form ( x ) µ ( D x ) ν . F or additional discussion on the in terpretation of such contributions, see reference [3]. Consider next p ossible p erturbations to the p otential energy . Each successiv e interaction term in the p otential is of the form x n , with scaling dimension n ( d − 1) / 2. So, for d ≤ 1, all higher order terms can impact the long distance b ehavior of the correlation functions, while for d > 1, the most relev ant term is b ounded ab ov e, and the global structure of the potential will be missed. 10 2 4 6 8 10 x 10 4 -0.05 0 0.05 0.1 0.15 0.2 N samples f 0 σ - 1 σ 2 4 6 8 10 x 10 4 -0.02 0 0.02 0.04 0.06 0.08 N samples f 1 σ - 2 σ 2 4 6 8 10 x 10 4 -2 0 2 4 6 x 10 -3 N samples f 2 σ - 3 σ 2 4 6 8 10 x 10 4 -0.1 -0.05 0 0.05 N samples f 3 σ d eff = 0 d eff = 1 d eff = 1.6 FIG. 9: Plots of the con vergence to the true correct tail statis- tics for suburban samplers of the banana distribution. 200 400 600 800 1000 0 0.5 1 1.5 2 N samples d mean d eff = 0 d eff = 1 slice 200 400 600 800 100 0 0 0.5 1 1.5 2 N samples d cov 200 400 600 800 1000 0 2 4 6 8 10 N samples τ dec 200 400 600 800 100 0 0.5 0.6 0.7 0.8 0.9 1 N samples R rej FIG. 10: Comparison b etw een slice and suburban for the ran- dom seed 40 landscap e. App endix B: T ail Statistics T ests In figs. 8 and 9 we display some tests of ho w w ell a sampler collects “rare even ts,” i.e., tail statistics. After taking a burn-in cut with N total remaining samples, we compute the num b er of counts in the 0 σ − 1 σ region, the 1 σ − 2 σ region, the 2 σ − 3 σ region, and ev ents which fall outside the 3 σ region. F or each such region, we compute the difference betw een the inferred and true counts and return the fraction: f region ≡ N inf − N true N total . (B1) App endix C: Comparison with Slice Figs. 10 and 11 compare the p erformance of a sub- urban sampler with 2 d grid topology (with d eff = 1 and β = 0 . 01) with parallel slice within Gibbs sampling. W e use the default implementation in Dimple so that for a 1D target distribution the initial size of the x -axis width is an in terv al of length one con taining x ∗ , and the maximum n umber of doublings is 10. A direct comparison with suburban is subtle b ecause in slice sampling the halting of the “stepping out” and “stepping in” loops is not fixed ahead of time. In practice we find that for a fixed n umber of samples, slice t ypically mak es several more queries to the target distribution compared with suburban, roughly a factor of ∼ 5 − 10. F or large free energy barriers, it is also sometimes helpful to enlarge the initialization width from 1 to 100 (see fig. 11). 0 5 10 15 20 0 2 4 6 8 10 L barrier d mean d eff = 0 d eff = 1 slice 1 slice 100 0 5 10 15 20 0 20 40 60 80 100 L barrier d cov 0 5 10 15 20 0 2 4 6 8 10 L barrier τ dec 0 5 10 15 20 0.7 0.75 0.8 0.85 0.9 0.95 1 L barrier R rej FIG. 11: Comparison b etw een slice and suburban for the free energy barrier test. W e sho w t wo different initialization widths for the slice sampler.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment