Stratification as a general variance reduction method for Markov chain Monte Carlo

Stratiﬁcation as a general va riance reduction metho d fo r Ma rk ov chain Monte Ca rlo ∗ Aa ron R. Dinner † , Erik H. Thiede ‡ , Brian V an Koten § , and Jonathan W ea re ¶ Abstract. The Eigenv ector Metho d for Umbrella Sampling (EMUS) [ 48 ] b elongs to a popular class of methods in statistical mec hanics whic h adapt the principle of stratiﬁed surv ey sampling to the computation of free energies. W e develop a detailed theoretical analysis of EMUS. Based on this analysis, we show that EMUS is an eﬃcien t gener al metho d for computing av erages o v er arbitrary target distributions. In particular, we sho w that EMUS can b e dramatically more eﬃcient than direct MCMC when the target distribution is m ultimo dal or when the goal is to compute tail probabilities. T o illustrate these theoretical results, w e present a tutorial application of the method to a problem from Bay esian statistics. 1. Intro duction. Mark ov c hain Monte Carlo (MCMC) metho ds hav e b een widely used with great success throughou t statistics, engineering, and the natural sciences. Ho w ev er, when estimating tail probabilities or when sampling from multimodal distributions, accurate MCMC estimates often require a prohibitiv ely large n um b er of samples. In this article, we analyze the Eigen v ector Method for Um brella Sampling (EMUS) [ 48 ]. W e ﬁrst prop osed EMUS as a metho d for computing free energies, and w e demonstrated that it was useful for treating the m ultimo dalit y that typically arises in that context. Here, w e demonstrate that EMUS is an eﬀectiv e gener al means of addressing the challenges p osed not just b y multimodality but also tail even ts, with p otential applications to a broad range of problems in statistics, engineering, and the natural sciences. EMUS was inspired b y Umbrella Sampling [ 49 ] and other metho ds suc h as the W eigh ted Histogram Analysis Metho d (WHAM) [ 27 ] and the Multistate Bennett Acceptance Ratio (MBAR) [ 44 ] for computing p otentials of mean force and free energies in statistical mec hanics. 1 W e call these str atiﬁe d MCMC metho ds since they each adapt the principle of stratiﬁed surv ey sampling to MCMC sim ulation. That is, they eac h draw samples concentrated in a collection of subregions of state space and com bine the resulting av erages in a consistent manner. Stratiﬁed MCMC metho ds are among the most p o werful, most successful, and most widely used to ols in molecular sim ulation. (Ho wev er, in con trast to our presen tation here, they are not t ypically used in molecular simulation to compute a verages of general observ ables.) ∗ BvK w as supp o rted by NSF RTG: Computational and Applied Mathematics in Statistical Science, numb er 1547396. ARD, EHT, and JW were supp orted by National Institutes of Health (NIH) Grant Numb er R01GM109455. JW was also supported by the Advanced Scientiﬁc Computing Research Program within the DOE Oﬃce of Science through a wa rd DE-SC0020427. Computing resources w ere provided b y the University of Chicago Resea rch Computing Center. We wish to thank Jonathan Mattingly , Jeremy T empkin, and Charlie Matthews for helpful discussions. † Depa rtment of Chemistry and James Franck Institute, the University of Chicago, Chicago, IL 60637 ‡ Depa rtment of Chemistry and James Franck Institute, the University of Chicago, Chicago, IL 60637 § Depa rtment of Mathematics and Statistics, the Universit y of Massachusetts, Amherst, MA 01003 ¶ Courant Institute of Mathematical Sciences, New Y ork University , New Y ork, NY 10012 1 A p otential of mean force is the logarithm of a marginal density . A free energy is the logarithm of a normalization constant. Both quantities play fundamental roles in statistical mechanics, e.g., in theories of rates of chemical reactions. 1 2 DINNER, ET AL. WHAM, for example, has b een instrumental for treating biomolecular pro cesses ranging from protein folding [ 7 ] to conductance b y ion c hannels [ 3 ]. While the practical utility of stratiﬁcation has b een established in many applications, the adv an tages and disadv an tages of the metho d ha v e remained p o orly understo o d; cf. [ 48 ]. Motiv ated b y the substan tial gap b et ween theory and application of stratiﬁed MCMC within statistical mec hanics, and also by the general c hallenges posed b y multimodality and tail prob- abilities, the goal of this pap er is to dev elop a clear theoretical explanation of the adv an tages of EMUS. Our theory suggests new applications of stratiﬁed MCMC (and EMUS in particu- lar) to broad classes of sampling problems arising in statistics and statistical mec hanics. F or example, v ery recen tly EMUS w as successfully applied to a parameter estimation problem in cosmology [ 34 ]. Summa ry of Main Results. Our most general results are a cen tral limit theorem (CL T) for the EMUS metho d and a conv enient upp er b ound on the asymptotic v ariance, cf. Theo- rem 3.3 and Theorem 3.5 . W e note that the pro of of the upper b ound relies on a new class of p erturbation estimates for Marko v c hains whic h we deriv ed in [ 47 ]. These estimates are substan tially more detailed than previous results [ 9 ]. After proving the CL T, we address the dep endence of the sampling error on the choice of strata. In particular, for a representativ e MCMC metho d, w e estimate the asymptotic v ariances of tra jectory av erages sampling the biased distributions, cf. Theorem 3.7 . Our estimate sho ws how factors suc h as the diameters of the strata inﬂuence the asymptotic v ariances. In Section 4 , w e apply the general theory dev elop ed in Section 3 to case studies inv olv- ing tail probabilities and multimodality . Our results concern t w o limits: a smal l pr ob ability limit and a low-temp er atur e limit. In the small probabilit y limit, w e consider estimation of probabilities of the form p M := P [ X ≥ M ] . F or a broad class of random v ariables X , we sho w that while the cost of computing p M with relativ e precision b y direct MCMC increases exp onen tially with M , the cost b y EMUS increases only p olynomially; cf. Section 4.2 . In the lo w-temp erature limit, a parameter of the target distribution decreases, in tensifying the eﬀects of multimodality on the eﬃciency of MCMC sampling. W e show that the cost of computing an a v erage to ﬁxed precision b y direct MCMC increases exp onen tially in this limit, whereas the cost by EMUS increases only p olynomially; cf. Section 4.1 . W e conclude that EMUS may b e dramatically more eﬃcien t than direct MCMC sampling when the target distribution is multimodal or when the goal is to compute a small tail probabilit y . T o illustrate our theoretical results, we presen t a tutorial n umerical study applying EMUS to a problem in Ba yesian statistics in Section 5 . In addition to illustrating the theory , our n umerical study demonstrates the problems that ma y o ccur when EMUS and other similar stratiﬁed MCMC metho ds are used carelessly . It also addresses practical issues suc h as the c hoice of strata and the computation of error bars for a verages estimated by EMUS. The results in this article signiﬁcantly extend and generalize the ideas in [ 48 ]. W e ﬁrst prop osed the EMUS metho d with the goal of analyzing and impro ving umbrella sampling approac hes in free energy calculations. Here, our goal is to establish EMUS as a gener al STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 3 v ariance reduction tec hnique, and we presen t entirely new results, including an upp er b ound on the asymptotic v ariance of EMUS (Theorem 3.5 ), a condition to guide some asp ects of the c hoice of strata (Remark 3.10 ), a theoretical argumen t demonstrating the beneﬁts of EMUS for computing tail probabilities (Section 4.2 ), numerical results applying EMUS to Ba yesian inference (Section 5 ), a metho d of correcting problems related to po orly c hosen strata (Section 5.3 ), and a greatly impro ved numerical method for estimating the standard deviations of quan tities computed by EMUS (App endix G ). In addition, we give complete justiﬁcations of some results that w ere stated without pro of in [ 48 ], including Theorem 3.7 concerning the dep endence of the sampling error on the c hoice of strata. Finally , w e note that our results concerning m ultimo dal distributions and the lo w-temp erature limit generalize and clarify the results given in [ 48 ]; in particular, our Theorem 4.1 cov ers p eriodic b oundary conditions and stratiﬁcation in more than one v ariable. 2. The Eigenvector Metho d for Umbrella Sampling. In this section, w e presen t the Eigen vector Metho d for Umbrella Sampling (EMUS), and we prov e that it is consisten t. A detailed deriv ation of the estimator can be found in [ 48 ]. W e also review a related method, iterativ e EMUS, and we compare iterativ e EMUS with the MBAR metho d from statistical mec hanics [ 44 ]. 2.1. The EMUS estimator. The ob jective of EMUS is to compute the a v erage π [ g ] := Z Ω g ( x ) π ( dx ) , of a function g with resp ect to a measure π deﬁned on a set Ω. In EMUS, instead of sampling directly from π , w e sample from biased distributions analogous to the strata in stratiﬁed surv ey sampling methods. W e then w eigh t the samples from the biased distributions to estimate π [ g ]. W e assume that the biase d distributions take the form π i ( dx ) := ψ i ( x ) π ( dx ) π [ ψ i ] for some set { ψ i } L i =1 of non-negativ e bias functions deﬁned on Ω. W e assume that L X i =1 ψ i ( x ) > 0 for all x ∈ Ω . W e call the supp ort of ψ i the i ’th str atum to mak e an analogy b et w een the biased distributions of EMUS and the strata of stratiﬁed surv ey sampling. The EMUS estimator is based on t w o observ ations. First, one may write π [ g ] in terms of w eighted sums of a v erages o v er the biased distributions. Let u ∈ (0 , ∞ ) L b e arbitrary , and for an y function h : Ω → R , deﬁne h ∗ ( x ) := h ( x ) P L k =1 ψ k ( x ) /u k . 4 DINNER, ET AL. By [ 48 , Equation 16], w e ha ve (2.1) π [ g ] = P L i =1 z i π i [ g ∗ ] /u i P L i =1 z i π i [ 1 ∗ ] /u i , where 1 denotes the function equal to one ev erywhere and (2.2) z i = π [ ψ i ] P L k =1 π [ ψ k ] . (F or the reader’s conv enience, w e present complete deriv ations of equations ( 2.1 ) and ( 2.3 ) in App endix A .) The parameter u ab ov e can b e though t of as an initial guess of the unkno wn weight ve ctor z ∈ R L in ( 2.2 ). Its explicit presence here will simplify the description of an iterativ e v ersion of EMUS in Section 2.2 . Outside of this section and Section 2.2 , the choice of u is absorb ed in the deﬁnition of the ψ i , i.e. u i = 1. Second, the weigh t vector z can b e found by solving a certain eigenproblem. Let w ∈ R L b e deﬁned b y z i = u i w i . By [ 48 , Equation 17], w e hav e (2.3) w t F = w t , where F ij := π i [ ψ ∗ j ] /u j . W e call F the overlap matrix . W e observ e that equations ( 2.1 ) and ( 2.3 ) combine to express π [ g ] as a function of a verages o ver the biased distributions, namely π i [ g ∗ ], π i [ 1 ∗ ], and F . T o see this, it suﬃces to recognize that F is sto c hastic and w is a probability vector. Therefore, whenev er F is irreducible, the solution of the eigenproblem is unique by the P erron–F rob enius theorem. W e will assume throughout the remainder of this work that F is irreducible, which amoun ts to requiring some o verlap betw een neighboring strata; see Lemma 2.1 . In general, for an y irreducible, sto chastic matrix G ∈ R L × L , w e will let w ( G ) ∈ R L denote the unique solution of w ( G ) t G = w ( G ) t with L X i =1 w i ( G ) = 1 . With this notation, b y ( 2.1 ) and ( 2.3 ) we ha v e (2.4) π [ g ] = P L i =1 w i ( F ) π i [ g ∗ ] P L i =1 w i ( F ) π i [ 1 ∗ ] . In EMUS, we substitute MCMC estimates for the a verages ov er biased distributions on the right hand side of ( 2.4 ) to estimate π [ g ]. T o be precise, let X i t b e a Mark o v pro cess ergo dic for π i . W e call X i t the biase d pr o c ess sampling the biased distribution π i . The EMUS algorithm pro ceeds as follo ws: 1. F or eac h i = 1 , . . . , L , compute N i steps of the pro cess X i t . 2. Compute the av erages (2.5) ¯ g ∗ i := 1 N i N i X t =1 g ∗ ( X i t ) , ¯ 1 ∗ i := 1 N i N i X t =1 1 ∗ ( X i t ) , and ¯ F ij := 1 N i N i X t =1 ψ ∗ j ( X i t ) /u j . STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 5 3. Compute w ( ¯ F ) n umerically , for example from the QR factorization of I − ¯ F [ 19 ]. 4. Compute the estimate π US [ g ] := P L i =1 w i ( ¯ F ) ¯ g ∗ i P L i =1 w i ( ¯ F ) ¯ 1 ∗ i of π [ g ]. Recall that w ( ¯ F ) is deﬁned only if ¯ F is irreducible. In the follo wing lemma, whose pro of is in App endix B , w e state simple criteria for the irreducibilit y of F and ¯ F : Lemma 2.1. The overlap matrix F is irr e ducible if and only if for every A ⊂ { 1 , 2 , . . . , L } , we have (2.6) π   X i ∈ A ψ i !   X j / ∈ A ψ j     > 0 . The appr oximate overlap matrix ¯ F is irr e ducible if and only if for every A ⊂ { 1 , 2 , . . . , L } , the set ∪ i ∈ A { x : ψ i ( x ) > 0 } c ontains at le ast one sample p oint gener ate d fr om one of the biase d pr o c esses X j t with j / ∈ A . W e claim that the EMUS estimator is consistent; that is, π US [ g ] conv erges almost surely to π [ g ] as the total num ber of samples tends to inﬁnit y . T o mak e this precise, we require the follo wing assumption on the gro wth of N i with the total n um b er of samples: Assumption 2.2. Let N = L X i =1 N i b e the total n um b er of samples from all biased distributions. Assume that for eac h i , lim N →∞ N i / N = κ i > 0 . That is, assume that when N is large, the prop ortion of samples dra wn from the i ’th biased distribution is ﬁxed and greater than zero. W e now pro v e that EMUS is consisten t: Lemma 2.3. Under Assumption 2.2 and the irr e ducibility c ondition ( 2.6 ) , π US [ g ] c onver ges almost sur ely to π [ g ] as the total numb er of samples N tends to inﬁnity. Pr o of. Since the pro cesses X i t are ergo dic, (2.7) ¯ F as − → F , ¯ g ∗ i as − → g ∗ i , and ¯ 1 ∗ i as − → 1 ∗ i as N → ∞ . Moreo ver, b y Lemma D.1 in Appendix D , w ( G ) is contin uous at F . (T echnically , w ( G ) admits an extension to the set of all L × L matrices, which is contin uous at F .) Therefore, as a function of ¯ F , ¯ g ∗ i , and ¯ 1 ∗ i , π US [ g ] is contin uous at F , π i [ g ∗ ], and π i [ 1 ∗ ]. It follo ws by the con tin uous mapping theorem and equation ( 2.4 ) that π US [ g ] as − → π [ g ]. 6 DINNER, ET AL. 2.2. Iterative EMUS and V ardi’s Estimato r fo r Selection Bias Mo dels. EMUS resem bles certain metho ds for computing normalization constan ts of families of probabilit y densities [ 17 , 51 , 35 , 26 ]. The resemblance arises b ecause the weigh ts in the third step of EMUS are the normalization constan ts of the biased distributions. These metho ds hav e been used, for example, to compute Bay es factors in mo del selection problems [ 17 ] and for computations related to selection bias mo dels [ 51 ]. In this section, we explain ho w EMUS relates to V ardi’s estimator for selection bias mo dels [ 51 ] and its descendants suc h as the p opular Multistate Bennett Acceptance Ratio (MBAR) metho d [ 44 ]. In addition to comparing EMUS with these metho ds, w e review a method, iterative EMUS, for solving the nonlinear system of equations deﬁning V ardi’s estimator [ 48 ]. The ﬁrst iterate of this metho d is exactly the EMUS estimator describ ed in Section 2.1 . W e note that our analysis and calculations in Sections 3 , 4 , and 5 p ertain only to the EMUS method and not iterativ e EMUS or MBAR. Ho w ev er, the similarit y b et w een the metho ds suggests that our results ma y generalize. Giv en the notation dev elop ed in Section 2.1 , w e can express V ardi’s estimate z V of the w eight vector as the v ector with en tries z V i = u i N i where u solv es the equation (2.8) N j = L X i =1 N i ¯ F ij ( u ) , L X i =1 u i N i = 1 and where we ha ve no w made the dep endence of the matrix ¯ F in ( 2.5 ) on the choice of u ∈ R L explicit. By [ 51 , Theorem 1], this nonlinear equation determines z V uniquely whenev er the irreducibilit y criterion of Lemma 2.1 holds. V ardi’s estimator w as originally deriv ed assuming that the samples X i t from the biased distributions w ere i.i.d. In that case, it is the nonparametric maximum lik eliho od estimator of the target distribution π given samples from the biased distributions π i [ 51 ], and it has certain optimalit y prop erties [ 18 ]. Several adjustments to the estimator hav e b een prop osed for the case of samples from Marko v processes. In the Multistate Bennett Acceptance Ratio (MBAR) metho d, one replaces the factors N i app earing in the summand in ( 2.5 ) with eﬀe ctive sample sizes n i , which are computed from estimates of the in tegrated auto co v ariance of a family of functions [ 44 ]. (In some v ersions of MBAR, the sample a v erage o v er all N i p oin ts is replaced with a sample av erage ov er the n i p oin ts obtained b y including only every N i /n i ’th p oin t along the tra jectory X i t .) Another recen t w ork proposes diﬀerent eﬀective sample sizes computed b y minimizing an estimate of the asymptotic v ariance of the estimator [ 12 ]. In fact, the estimator is consisten t with N i replaced b y an y ﬁxed positive n um b er [ 12 ]. W e hav e found that our n umerical results do not dep end sensitively on the choice of eﬀective sample size, so w e use N i for simplicit y . W e no w review iterative EMUS, which we in tro duced in [ 48 ]. Iterativ e EMUS may b e understo od as a ﬁxed p oin t iteration for solving equation ( 2.8 ). The iteration pro ceeds as follo ws: 1. As an initial guess for z V , choose a p ositiv e v ector z 0 ∈ R L . Set m = 0. Cho ose a tolerance τ > 0. STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 7 2. Compute ¯ F ij ( u ) for u i = z m i / N i . Solve the eigenv ector equation (2.9) w j = L X i =1 w i ¯ F ij ( u ) , L X i =1 w i = 1 for w ∈ R L to obtain the up dated estimate z m +1 of z V with en tries z m +1 i = u i w i P L k =1 u k w k . 3. If max i    w i − N i N    > τ , then increment m and rep eat step 2. In [ 48 ] w e sho w that the eigen v ector equation ( 2.9 ) has a unique solution for ev ery m , and w e suggest a n umerical metho d for ﬁnding the solution. W e also discuss the conv ergence of iterative EMUS, and we sho w that for every ﬁxed m , z m is a consisten t estimator of the w eight vector z . If one chooses z 0 i = N i / N , then z 1 is the EMUS estimate of z , w ( ¯ F ). 2.3. Related MCMC methods. EMUS b elongs to a large class of MCMC metho ds that b y v arious mechanisms promote a more uniform sampling of space. F or example, in parallel temp ering [ 46 , 16 ], one uses MCMC samples drawn from a distribution or sequence of distri- butions close to the uniform distribution to speed sampling of the target distribution. The bias in tro duced b y the choice of distributions is corrected either by rew eighting the samples or b y a replica exc hange strategy [ 16 ]. The W ang–Landau [ 52 ] and Metadynamics methods [ 28 ] adaptiv ely construct a biased distribution to achiev e uniform sampling in certain co ordinates. The temp erature accelerated molecular dynamics method [ 33 ] is also designed to achiev e uni- form sampling in a giv en co ordinate, but it works by en tirely diﬀeren t means. In EMUS and other stratiﬁed MCMC metho ds, one achiev es more uniform sampling by ensuring that eac h stratum con tains points from at least one MCMC sim ulation. EMUS is p erhaps most similar in spirit to the parallel Mark o v chain Monte Carlo method [ 50 ]. 3. Erro r Analysis of EMUS. Here, w e dev elop tools for analyzing the error of EMUS. First, in Section 3.1 , w e pro v e a CL T for EMUS, and w e deriv e a conv enien t upper b ound on the asymptotic v ariance. Then, in Section 3.2 , we analyze the dep endence of the asymptotic v ariance of EMUS on the choice of biased distributions. W e use these to ols in Section 4 to pro ve limiting results demonstrating the adv antages of EMUS for multimodal distributions and tail probabilities. In addition, our CL T for EMUS is the basis for b oth practical error estimates (Section 5 and App endix G ) and also a metho d of optimizing the allocation κ i of the samples among the diﬀeren t biased distributions [ 48 ]. 3.1. A CL T for EMUS and an Estimate of the Asymptotic V a riance. In this section, w e prov e a Central Limit Theorem (CL T) for EMUS, and we derive an upp er b ound on the asymptotic v ariance σ 2 US ( g ) of π US [ g ]. T o prov e the CL T for EMUS, we must assume that a CL T holds for tra jectory a verages ov er the biased process es: Assumption 3.1. F or any matrix H , let H i : denote the i ’th ro w of H . Deﬁne ¯ G ∈ R L × 2 b y ¯ G i : = ( ¯ g ∗ i , ¯ 1 ∗ i ) . 8 DINNER, ET AL. Assume that (3.1) p N i  ¯ F i : , ¯ G i :  − ( F i : , π i [ g ∗ ] , π i [ 1 ∗ ])  d − → N(0 , Σ i ) for some asymptotic co v ariance matrix Σ i ∈ R ( L +2) × ( L +2) of the form (3.2) Σ i =  σ i ρ i ρ t i τ i  , where σ i ∈ R L × L denotes the asymptotic co v ariance of ¯ F i : with itself, ρ i ∈ R L × 2 denotes the asymptotic cov ariance of ¯ F i : with ¯ G i : , and τ i ∈ R 2 × 2 denotes the asymptotic co v ariance of ¯ G i : with itself. W e exp ect a CL T to hold for most MCMC methods, target distributions, and target functions of interest in statistics and statistical mec hanics. W e refer to [ 41 ] for a comprehensive review of conditions guaranteeing a CL T. In Theorem 3.7 of Section 3.2 , w e pro v e a CL T and an estimate of the asymptotic v ariance for a simple family of pro cesses which one might use to sample the biased distributions in an application of EMUS. W e now pro v e a CL T for π US [ g ], and we giv e a form ula expressing the asymptotic v ariance σ 2 US ( g ) of π US [ g ] in terms of the asymptotic v ariances Σ i of the tra jectory a v erages. In this form ula, ( I − F ) # denotes the group generalized inv erse of I − F ; the group in v erse A # of a matrix A is c haracterized b y the prop erties AA # A = A , A # AA # = A # , and AA # = A # A. W e refer to [ 19 ] for a detailed explanation of the properties of the group in verse, a pro of that ( I − F ) # exists whenever F is sto c hastic and irreducible, and an algorithm for computing ( I − F ) # . In Theorem 3.3 and b elo w w e imp ose the follo wing assumption: Assumption 3.2. The pro cesses X i t sampling the biased distributions are indep enden t. This assumption do es not hold for all stratiﬁed MCMC methods. F or example, in replica exc hange um brella sampling one perio dically allo ws conﬁguration exc hanges betw een neigh- b oring pro cesses; see [ 32 ] for a general discussion of replica exchange strategies and [ 45 ] for an application of replica exchange in a method similar to EMUS. The result is a sin- gle pro cess taking v alues in R L × d and sampling the pro duct distribution Π( x 1 , x 2 , . . . , x L ) = π 1 ( x 1 ) π 2 ( x 2 ) . . . π 1 ( x L ). In this case, a CL T w ould still hold for EMUS, but the asymptotic v ariance w ould tak e a diﬀerent form. W e also assume that u i = 1 for all i . As already men tioned, w e can equiv alen tly absorb the choice of u i in to the c hoice of the bias function ψ i . Theo rem 3.3. L et Assumptions 2.2 , 3.1 , and 3.2 and the irr e ducibility c ondition ( 2.6 ) hold. L et g b e squar e inte gr able over π , so π [ g 2 ] < ∞ . Deﬁning Ψ = 1 P L i =1 π [ ψ i ] and ` := Ψ (1 , − π [ g ]) t ∈ R 2 . L et g ∈ R L b e the ve ctor with g i := ` · ( π i [ g ∗ ] , π i [ 1 ∗ ]) . We have (3.3) √ N ( π US [ g ] − π [ g ]) d − → N(0 , σ 2 US (g)) , STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 9 wher e σ 2 US ( g ) = L X i =1 z 2 i κ i  ( I − F ) # g · σ i ( I − F ) # g + 2( I − F ) # g · ρ i ` + ` t τ i `  . (3.4) Pr o of. The result follows using the delta metho d and a formula expressing w 0 ( F ) in terms of ( I − F ) # ; w e give the details in App endix D . W e now derive a con v enient upp er bound on the asymptotic v ariance σ 2 US ( g ). In Section 4 , w e use this b ound to analyze the eﬃciency of EMUS in the low-temperature limit and in the limit of small tail probabilities. Our b ound is based on the probabilit y P i [ t j < t i ] deﬁned b elo w: Deﬁnition 3.4. L et Y n b e the Markov chain with state sp ac e { 1 , 2 , . . . , L } and tr ansition matrix F . L et P i [ t j < t i ] denote the pr ob ability that Y n hits j b efor e r eturning to i , c onditione d on Y 0 = i . Theo rem 3.5. L et Assumptions 2.2 , 3.1 , and 3.2 and the irr e ducibility c ondition ( 2.6 ) hold. L et g b e squar e inte gr able over π , so π [ g 2 ] < ∞ . L et σ 2 ( g ) b e the asymptotic varianc e of π US [ g ] , and for any me asur e ν and function f let v ar ν ( f ) b e the varianc e of f over ν . Deﬁne the function h = g ∗ − π [ g ] 1 ∗ , and let C( ¯ h i ) b e the asymptotic varianc e of the tr aje ctory aver age of h over the biase d pr o c ess X i t . We have σ 2 US ( g ) ≤ 2 L X i =1 1 κ i ( z 2 i Ψ 2 C( ¯ h i ) + tr( R i ) π [ | h | ] 2 X j 6 = i F ij > 0 v ar π i ( ψ ∗ j ) P i [ t j < t i ] 2 ) , (3.5) wher e R i ∈ R L × L with R i j k := σ i j k q v ar π i ( ψ ∗ j ) p v ar π i ( ψ ∗ k ) . Pr o of. The result follo ws from Theorem 3.3 , using the perturbation bounds which we deriv ed in [ 47 ]. Details app ear in App endix D . 3.2. Dep endence of the Asymptotic V a riance on the Choice of Strata. In this section, w e consider ho w the c hoice of strata inﬂuences the factors in the upper b ound ( 3.5 ) on σ 2 US ( g ). Roughly , the asymptotic v ariances C ( ¯ h i ) and tr( R i ) characterize the sampling error, and for eac h i the factor (3.6) X j 6 = i F ij > 0 v ar π i ( ψ ∗ j ) P i [ t j < t i ] 2 10 DINNER, ET AL. measures the sensitivit y of the EMUS estimator to sampling errors asso ciated with π i . W e show in Section 3.2.1 that the factors C ( ¯ h i ) and tr( R i ) c haracterizing the sampling error may b e controlled b y decreasing the diameters of the strata. W e show in Section 3.2.2 that that P i [ t j < t i ] may b e con trolled b y ensuring suﬃcient o verlap b et w een neigh b oring strata. This last observ ation leads to a practical condition guiding the choice of strata; see Remark 3.10 and ( 5.8 ). Our theorems in this section apply to a sp eciﬁc class of strata and Marko v pro cesses that are broadly representativ e of those employ ed in practical applications. Thus, the assumptions made here are m uch stronger than those made in pro ving the CL T for EMUS, for example. W e discuss ho w our results migh t extend to more general implementations of stratiﬁed MCMC after the statemen t of Theorem 3.7 in Section 3.2.1 and in Remark 3.10 . 3.2.1. Asymptotic V ariances of MCMC Averages. Here, w e consider the eﬀect of the c hoice of strata on the asymptotic v ariances C ( ¯ h i ) and tr( R i ). Because such a div erse v ariet y of biased pro cesses and distributions could in principle b e used, it is futile in our opinion to try for a completely general result. Instead, motiv ated by the eﬃciency analysis undertaken in Section 4 , we in troduce a simple parametric family of bias functions, and for this family w e state Assumption 3.6 relating the diameters of the strata with the asymptotic v ariances. In Theorem 3.7 , w e verify Assumption 3.6 for one represen tativ e class of biased pro cesses. Finally , at the end of this section, w e explain wh y we exp ect the assumption to hold for other c hoices of biased pro cesses and distributions. Consider the following representativ e class of bias functions: Given a family of sets { U i : i = 1 , . . . , L } with ∪ L i =1 U i = Ω, deﬁne (3.7) ψ i := 1 U i and π i ( dx ) := 1 U i ( x ) π ( dx ) π [ 1 U i ] for i = 1 , . . . , L, where 1 U i denotes the c haracteristic function of U i . Assume that the sets U i are c hosen so that the irreducibilit y criterion of Lemma 2.1 holds. F or example, supp ose that Ω = [0 , 1] d is the d -dimensional unit cub e. One migh t choose K ∈ N , set h := 1 /K , and deﬁne (3.8) U i = ( h [ − 1 , 1] d + h i ) ∩ Ω for i ∈ { 0 , 1 , . . . , K } d , co vering Ω uniformly b y a grid of strata ha ving diameters prop ortional to h . W e use this uniform grid as a device when analyzing the eﬀect of the stratum size on the eﬃciency of EMUS in Section 4 . Ho wev er, while suc h a na ¨ ıv e choice may suﬃce for small d , it is not practical for large d . W e discuss appropriate bias functions for high-dimensional problems later in this section and again in Section 5.1 . Since we wish to study grids like ( 3.8 ) as h = 1 /K v aries, we state our assumption on asymptotic v ariances in terms of the following parametric family of strata: Let x 0 ∈ Ω, and let Z ⊂ R d b e a bounded set con taining 0. F or eac h h > 0, deﬁne a stratum and a biased distribution b y (3.9) Z h = x 0 + hZ and π h ( dx ) = 1 Z h ( x ) π ( dx ) π [ 1 Z h ] . STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 11 T o make the connection with ( 3.8 ), one should imagine that Z = [ − 1 , 1] d and that x 0 is one of the grid p oin ts h i . Assumption 3.6 c haracterizes the dependence of the asymptotic v ariance of MCMC av er- ages o ver π h on the parameter h : Assumption 3.6. Assume that f : Ω → R has ﬁnite v ariance v ar h ( f ) ov er π h , and deﬁne σ 2 h ( f ) to b e the asymptotic v ariance of an MCMC tra jectory a verage appro ximating π h [ f ]. W rite π ( x ) = exp( − β V ( x )) R exp( − β V ( y )) dy , for some p otential V : Ω → R and inverse temp er atur e β > 0. W e assume σ 2 h ( f ) v ar π h ( f ) ≤ C h a β b exp  β  max Z h V − min Z h V  ≤ C h a β b exp ( β h diam( Z ) k∇ V k ∞ ) for some C, a, b ≥ 0 indep endent of h , Z , and f . T o motiv ate Assumption 3.6 , w e pro v e that a special case holds for a represen tativ e class of pro cesses sampling the biased distributions, cf. Theorem 3.7 . Assume that the potential V app earing in the assumption is contin uously diﬀeren tiable. Let Z ⊂ R d b e either a con vex p olyhedron or a set with C 3 b oundary . 2 No w let X h t b e the o v erdamp ed Langevin process with reﬂecting boundary conditions on Z h . This pro cess is deﬁned by the F okk er–Planck equation ∂ u ∂ t ( x, t ) = div( β − 1 ∇ u ( x, t ) + u ( x, t ) ∇ V ( x )) for x ∈ U, t > 0 , ( β − 1 ∇ u ( x, t ) + u ( x, t ) ∇ V ( x )) · n ( x ) = 0 for x ∈ ∂ Z h , t ≥ 0 , and (3.10) u ( x, 0) = p ( x ) for x ∈ Z h , where β and V are the inv erse temp erature and p otential deﬁned in Assumption 3.6 and n ( x ) denotes the in w ard unit normal to ∂ Z h at x . That is, X h t is the unique Mark o v process so that if X h 0 has density p ( x ), then X h t has density u ( x, t ). The existence of the reﬂected pro cess is established in [ 53 , 2 ] when Z is a conv ex p olyhedron and in [ 13 , Chapter 8] when Z has C 3 b oundary . A simple in tro duction to the reﬂected pro cess and its prop erties appears in [ 36 , Chapter 4]. W e show in Theorem 3.7 that X h t is ergo dic for π h , at least when Z is b ounded. The reﬂected pro cess X h t shares man y features with the pro cesses used in practical stratiﬁed MCMC metho ds. In particular, it is closely related to the (unreﬂected) ov erdamp ed Langevin pro cess Y t [ 43 , 40 ]. W e now v erify Assumption 3.6 for the reﬂected process: Theo rem 3.7. Assume that f : Ω → R has ﬁnite varianc e v ar h ( f ) over π h . L et Z either have C 3 b oundary or b e c onvex. Assume that V is c ontinuously diﬀer entiable. Supp ose that 2 The b oundary of a set is C 3 if in a neighborho o d of each point on the b oundary , the b oundary is the graph of a three times contin uously diﬀerentiable function. 12 DINNER, ET AL. X h t is stationary; that is, X h 0 has distribution π h . L et ¯ f h := 1 T R T t =0 f ( X h t ) dt b e the c ontinuous time tr aje ctory aver age of f . We have √ T ( ¯ f h − π h [ f ]) d − → N(0 , σ 2 h (f )) , wher e (3.11) σ 2 h ( f ) ≤ Λ h 2 β exp  β  max Z h V − min Z h V  v ar h ( f ) . The c onstant Λ dep ends only on Z , not on h , β , V , or f . Pr o of. See App endix E . There are three ma jor diﬀerences betw een the c hoice of strata and sampling scheme sp ec- iﬁed in this subsection and those typical of practical applications: First, in molecular sim u- lations, one typically c ho oses Gaussian bias functions instead of piecewise constan t. Second, practical methods m ust be discrete in time, e.g., one might use a discretization of the con tin- uous time pro cess X h t . Third, for high-dimensional problems, one t ypically stratiﬁes only a certain lo w-dimensional r e action c o or dinate or c ol le ctive variable . In the ﬁrst case, for Gaussian bias functions, a version of Theorem 3.7 holds with minor adjustmen ts; we omit the exact statement and proof for simplicity . In the second case, for discretizations of Langevin dynamics, the asymptotic v ariances of tra jectory a v erages are closely related to the corresp onding av erages for the con tinuous time dynamics: In fact, under some conditions on the p oten tial V , (3.12) lim ∆ t → 0 ∆ tC ∆ t ( f ) = C ( f ) , where C ∆ t ( f ) is the asymptotic v ariance of the tra jectory a verage of f for the discretization with time step ∆ t and C ( f ) is the asymptotic v ariance for the contin uous time pro cess [ 31 , Section 3.2]. F or other discrete time pro cesses, we exp ect Assumption 3.6 to hold with diﬀeren t exp onen ts a and b . F or example, the aﬃne in v ariance prop ert y of the aﬃne in v ariant ensemble sampler [ 20 ] suggests a = 0. The third case is subtle. When d is large, one t ypically stratiﬁes only in a function θ : Ω ⊂ R d → R ` with ` muc h smaller than d . T o be precise, one might choose a uniform grid of nonnegativ e functions η i : R ` → R deﬁned as in ( 3.8 ), but with supports cov ering θ (Ω) ⊂ R ` instead of Ω ⊂ R d . One would then deﬁne the bias functions (3.13) ψ i ( x ) := η i ( θ ( x )) . (W e make a similar choice in our calculations in Section 5 , cf. the natural stratiﬁcation ( 5.1 ).) F or a clever choice of θ , these biased distribution ma y b e muc h easier to sample than the target distribution. F or example, suppose that the marginal π θ of π in θ w ere m ultimo dal, but that the conditional distributions π ( · | θ = θ 0 ) were unimodal or otherwise easy to sample for each ﬁxed θ 0 . In that case, for h suﬃciently small, eac h biased distribution would be unimo dal, hence easy to sample. (Recall that h sets the diameters of the strata for the grid of bias functions deﬁned in ( 3.8 ), so h small means that the diameter of the supp ort of η i is STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 13 small.) In free energy calculations, it is often p ossible to choos e such a θ based on in tuition or scientiﬁc principles; see [ 48 , 30 ] for discussion. When computing tails or marginals, the problem itself t ypically suggests a particular θ ; cf. the natur al str atiﬁc ation in Section 5.1 . The reader will notice that bias functions of the form ( 3.13 ) will typically hav e inﬁnite supp ort, rendering the bound in Assumption 3.6 useless. In this case, one might hop e for a similar b ound with the potential function V replaced b y the free energy F ( θ ) := − β − 1 log( π θ ( θ )) , where π θ is the marginal densit y of π in θ . Roughly , this replacemen t will b e v alid when, under the dynamics of the MCMC processes sampling π , the distribution of an y v ariable (an y function of the process) con v erges rapidly to its conditional distribution under π given the curren t v alue of the θ v ariable. This will o ccur, for example, when the marginal in θ is m ultimodal or otherwise diﬃcult to sample, but the conditional distributions are easy to sample. In general an eﬀectiv e c hoice of θ will b e one for which conditional equilibration given θ o ccurs m uc h more rapidly than the ov erall time to conv ergence of the pro cess. More on the eﬀectiv e dynamics of low-dimensional v ariables can b e found in [ 37 ] or [ 29 ]. 3.2.2. Controlling the Probabilities P i [ t j < t i ] . Here, we examine the eﬀect of the choice of strata on the factors in displa y ( 3.6 ) that app ear in our upp er b ound ( 3.5 ) on σ 2 US ( g ). W e b egin with a lemma estimating v ar π i ( ψ ∗ j ) / P i [ t j < t i ] 2 in terms of F ij : Lemma 3.8. We have v ar π i ( ψ ∗ j ) P i [ t j < t i ] 2 ≤ 1 F ij . Pr o of. W e hav e P i [ t k < t i ] ≥ P [ X 1 = k | X 0 = i ] = F ik , where X t denotes the Mark o v c hain with transition matrix F . Therefore, since ψ ∗ j ( x ) ∈ [0 , 1], v ar π i ( ψ ∗ j ) P i [ t j < t i ] 2 ≤ v ar π i ( ψ ∗ j ) F 2 ij ≤ π i [( ψ ∗ j ) 2 ] F 2 ij ≤ π i [ ψ ∗ j ] F 2 ij = 1 F ij . W e now estimate the size of F ij for piecewise constant bias functions suc h as the uniform grid ( 3.8 ): Lemma 3.9. Assume as in ( 3.7 ) that the bias functions ar e pie c ewise c onstant, and write π ( x ) ∝ exp( − β V ( x )) . We have F ij ≥ | U i ∩ U j | | U i |    P L k =1 1 U k    ∞ exp  β  min U i V − max U i V  In p articular, for the uniform grid of str ata ( 3.8 ) , we have F ij ≥ 1 4 d exp  β  min U i V − max U i V  ≥ 1 4 d exp  − 2 β h √ d k∇ V k ∞  (3.14) for any i , j ∈ Z d so that F ij > 0 . 14 DINNER, ET AL. Pr o of. W e hav e F ij = π i [ ψ ∗ j ] = π i " 1 U j P L k =1 1 U k # ≥ π [ U i ∩ U j ] π [ U i ] 1    P L k =1 1 U k    ∞ ≥ | U i ∩ U j | | U i | 1    P L k =1 1 U k    ∞ exp  β  min U i V − max U i V  , whic h prov es the ﬁrst claim made in the statement of the lemma. No w, for the uniform grid of strata ( 3.8 ), the minimum nonzero v alue of | U i ∩ U j | / | U i | is 1 / 2 d , attained when j = (1 , 1 , . . . , 1) + i . Moreov er, except for a set of measure zero, eac h x ∈ R d lies within 2 d strata, so    P L k =1 1 U k    ∞ = 2 d . Finally , w e ha ve max U i V ( x ) − min U i V ( x ) ≤ diam( U i ) k∇ V k ∞ = 2 √ dh k∇ V k ∞ , and the result follo ws. R emark 3.10 (A Condition to Guide the Choice of Strata). Lemmas 3.8 and 3.9 suggest a practical constraint on the choice of strata: T o ensure that the calculation of the weigh ts is not to o sensitive to sampling errors, it will suﬃce to choose strata so that nonzero en tries of F are not to o small. W e let this condition guide the c hoice of strata in Section 5 , cf. ( 5.8 ). Ho wev er, the condition is only suﬃcien t, not necessary . F or example, consider a uniform grid of Gaussian bias functions similar to ( 3.8 ), but with Gaussian densities ha ving mean µ = h [ − 1 , 1] d + h i and v ariance σ 2 = h 2 replacing the characteristic functions 1 U i . In that case, ev en though F will be dense and ma y ha v e some extremely small nonzero en tries, one can still con trol ( 3.6 ) b y decreasing h , under some conditions on π . W e omit the exact statement and pro of for simplicit y . Despite the exponential dep endence on d in ( 3.14 ), EMUS and other stratiﬁed MCMC metho ds are adv an tageous for high-dimensional problems b ecause it often suﬃces to stratify only a low-dimensional collective v ariable. In such cases, the dimension of the grid of strata is muc h smaller than dimension of the state space Ω; see our discussion of collective v ariables in Section 3.2.1 and our computations in Section 5 . It is imp ortant to keep this in mind when reading our results below. Also, one may deﬁne a uniform grid of strata so that ( 3.6 ) increases only as d 2 with dimension, not exp onen tially . W e construct suc h a grid in App endix C . 4. Limiting Results as a Rationale for EMUS. In this section, w e analyze the eﬃciency of EMUS in t wo limits: First, w e consider a low temp er atur e limit , where we write π ( x ) ∝ exp( − β V ( X )) and let the inv erse temperature β increase, concen trating the target distribution at its modes and in tensifying the eﬀects of m ultimo dalit y on the eﬃciency of MCMC sampling. Second, we consider the estimation of increasingly small tail probabilities. Our goal in eac h case is to elucidate the adv an tages and disadv antages of EMUS for a broad class of problems, pro viding a rationale for the use of the metho d. W e hop e that others will use the to ols of Section 3 in similar fashion to dev elop their o wn nov el applications of EMUS. STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 15 4.1. Limit of Low T emperature. Let the target distribution take the form π β ( x ) = exp( − β V ( x )) R exp( − β V ( x )) dx for some p otential V and inverse temp er atur e β > 0, as in Section 3.2 . In this section, we analyze the eﬃciency of EMUS in the low temp er atur e limit as β tends to inﬁnit y with V ﬁxed. W e observ e that π β concen trates at its mo des (the minima of V ) in this limit. As a consequence, MCMC metho ds for sampling π β undergo transitions betw een modes only rarely , whic h mak es direct MCMC sampling increasingly ineﬃcient. T o be precise, we show that the asymptotic v ariance of a tra jectory av erage of the o v erdamp ed Langevin dynamics increases exp onen tially with β in the worst case. On the other hand, w e sho w that the asymptotic v ariance of the EMUS estimate of the same a v erage increases only polynomially . Therefore, EMUS is dramatically more eﬃcien t than direct sampling in the lo w temp erature limit. W e consider the low temp erature limit b ecause it provides a con venien t sequence of increas- ingly diﬃcult to sample multimodal distrib utions: By analyzing EMUS in the lo w temp erature limit, we hop e to clarify its adv antages for multimodal problems in general. W e hav e no other in terest in low temp erature. W e now examine the o v erdamp ed Langevin dynamics (4.1) dX β t = −∇ V ( X β t ) dt + p 2 β − 1 dB t in the lo w temp erature limit. (The o verd amp ed Langevin dynamics is ergo dic for π β under certain conditions on V ; see [ 42 ] for example.) F or t ypical p oten tials V , the generator L := − β − 1 ∆ + ∇ V · ∇ of ( 4.1 ) has a sp ectral gap that shrinks exp onen tially with β ; that is, for some c > 0, (4.2) − exp( − cβ ) ≤ λ 1 < 0 , where λ 1 is the greatest nonzero eigenv alue of L . W e refer to [ 31 , Section 2.5] for a review of results on the spectrum of L , and we refer to [ 21 ] for precise conditions on V whic h guaran tee ( 4.2 ). Now let v 1 b e an eigenfunction corresponding to λ 1 normalized so that π h [ v 2 1 ] = 1. By formula ( E.1 ), the asymptotic v ariance σ 2 β ( v 1 ) of the tra jectory av erage of v 1 satisﬁes σ 2 β ( v 1 ) = − π β [ v 1 L − 1 v 1 ] = − λ − 1 1 π β [ v 2 1 ] = − λ − 1 1 ≥ exp( cβ ) , indicating that the cost of estimating π [ v 1 ] b y direct MCMC gro ws exp onentially with β . Ha ving analyzed the o v erdamp ed Langevin dynamics, w e no w examine EMUS in the lo w temp erature limit. F or con venience, w e assume that Ω is the unit cub e [0 , 1] d ⊂ R d with p erio dic b oundary conditions; to b e more precise, w e let Ω = R d / Z d b e the set of all p oin ts in R d with x and y identiﬁed if and only if x − y ∈ Z d . P erio dic b oundary conditions are t ypical of problems in c hemistry and computational statistical mec hanics. W e do not see an y diﬃculties in generalizing our results to other t yp es of domains. 16 DINNER, ET AL. As β increases, w e must mak e the supp orts of the bias functions smaller. W e accomplish this b y adjusting the parameter h in a uniform grid of bias functions similar to those deﬁned in ( 3.8 ). T o be precise, we ﬁx K ∈ N , set h := 1 /K , and deﬁne (4.3) ψ i ( x ) := 1 2 d 1 [ − 1 , 1] d ( K ( x − h i )) for i ∈ { 0 , 1 , . . . , K − 1 } d . This family of K d bias functions is a partition of unity ov er Ω, and the supp ort of the i ’th bias function is U i := h [ − 1 , 1] d + h i . F or con venience, w e treat the index i as an element of Z d /K Z d ; that is, w e let i b e p eriodic with p erio d K in each of its comp onents, identifying (0 , i 2 , . . . , i d ) with ( K , i 2 , . . . , i d ), for example. Figure 1 illustrates suc h a family of bias functions, and it demonstrates the appropriate relationship b et w een β and h . 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 ( x ) = 1 , h = 1 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 = 3 , h = 1 / 3 0.325 0.350 0.375 0.400 0.425 x 0.5 0.0 0.5 V ( x ) i n S t r a t u m 0.36 0.37 0.38 0.39 x 0.5 0.0 0.5 Figure 1: Bias functions and target distributions in the low temp erature limit. In the upp er t wo plots, the black curv es are the densities of the target distributions for tw o diﬀerent v alues of β . Observe that π concentrates at the minima of V as β increases. The red bands eac h lie ab o v e a single stratum chosen from a family of strata for which h ∝ β − 1 . In the lo wer t wo plots, the blue curve is β V ( x ) and the x -axis cov ers the b ottom of the red band in the plot immediately ab ov e. Observe that the range of β V ( x ) ov er the red band is the same for eac h of the t w o v alues of β . By Theorem 3.7 and the ensuing discussion in Section 3.2 , this implies that the cost of sampling a single biased distribution increases at most p olynomially with β when h ∝ β − 1 . W e no w show that the asymptotic v ariance of EMUS increases at most p olynomially with β when K is c hosen appropriately . In ligh t of the abov e discussion, this means that EMUS ma y STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 17 b e dramatically more eﬃcient than direct sampling for multimodal problems. W e note that despite the exponential dep endence on d in ( 4.4 ) b elow, EMUS and other stratiﬁed MCMC metho ds are often adv an tageous for high-dimensional m ultimo dal problems; see our discussion of lo w-dimensional collectiv e v ariables in Section 3.2 and also our computations in Section 5 . Theo rem 4.1. F or any b ounde d c ontinuous function g , let σ 2 β , US ( g ) denote the asymptotic varianc e of π β , US [ g ] . L et the bias functions b e deﬁne d by ( 4.3 ) with K e qual to the le ast inte ger gr e ater than β ; that is, K = d β e . T ake κ i = 1 /K d . L et Assumption 3.6 hold. We have (4.4) σ 2 β , US ( g ) v ar π β ( g ) ≤ C (1 + β ) q d for c onstants C, q > 0 indep endent of g and β , but dep ending on V and the c onstants in Assumption 3.6 . Pr o of. The proof is a straightforw ard application of the theory dev eloped in Section 3 ; w e presen t the details in App endix F . Our pro of of Theorem 4.1 relies on the perturbation b ounds which w e deriv ed in [ 47 ]. These b ounds allow one to estimate the sensitivity of w ( F ) to small p erturbations of F . Most p erturbation bounds in the literature predict that w ( F ) is highly sensitive when the sp ectral gap of F is small, but ours show that this is not alwa ys the case. (The sp ectral gap is 1 − | λ 2 | , where λ 2 is the eigenv alue of F with second largest absolute v alue.) In the low-temperature limit, the sp ectral gap of F decreases exponentially with β ; see [ 47 ] for a simple example of this phenomenon. Nonetheless, using our bounds, w e show that the cost to compute av erages b y EMUS increases only p olynomially in β . 4.2. Limit of Small Probability . In this section, w e assess the p erformance of EMUS for computing tail probabilities. T o b e precise, we let Ω = [0 , ∞ ), and we consider estimation of probabilities of the form p M := π ([ M , ∞ )) . W e sho w that for a broad class of distributions π , the cost of computing p M with relative precision b y direct MCMC increases exp onen tially with M , whereas the cost by EMUS in- creases only p olynomially . Thus, EMUS is dramatically more eﬃcien t than direct sampling for computing the probabilities of tail ev en ts. In Assumption 4.2 b elow, we state the conditions which w e will imp ose on π in our analysis. These conditions specify a simple class of problems for which strong conclusions ma y b e drawn. Similar results hold more generally . F or example, in Section 5 , we rep ort the results of a computational exp eriment demonstrating the adv an tages of EMUS for computing tails of a marginal densit y . 18 DINNER, ET AL. Assumption 4.2. W rite π ( x ) = exp( − V ( x )) for some p oten tial function V : [0 , ∞ ) → R . Assume that for some M 0 ≥ 0: 1. Whenever x ≥ M 0 , (4.5) 0 ≤ V 00 ( x ) and 0 < V 0 ( x ) . 2. F or some α ∈ (0 , 1) and c > 0, whenever x ≥ M 0 , (4.6) αV 0 ( x ) 2 − V 00 ( x ) ≥ c > 0 . F or example, we migh t ha v e π ( x ) ∝ exp( −| x | r ) for an y r ≥ 1 . Condition ( 4.6 ) in Assumption 4.2 implies geometric ergodicity of the o verdamped Langevin dynamics with p oten tial V [ 41 ]. W e rely on this fact to motiv ate Assumption 4.3 concerning the con vergence of MCMC processes sampling biased distributions with unbounded supp ort. In terestingly , we use the same condition to pro ve lo w er b ounds on some of the entries of the o verlap matrix; cf. Lemma F.1 . Condition ( 4.5 ) in Assumption 4.2 implies p M ≤ D exp( − γ M ) whenev er M ≥ M 0 for some D , γ > 0. Therefore, the relative v ariance ρ 2 M of 1 [ M , ∞ ) o ver π satisﬁes ρ 2 M = p M − p 2 M p 2 M ≥ D − 1 exp( γ M ) − 1 . W e conclude that estimating p M with relative accuracy b y a direct MCMC metho d (or even Mon te Carlo with independent samples) requires a n um b er of samples increasing exponentially with M . By contrast, w e show that for an appropriate choice of bias functions, the cost to estimate p M b y EMUS increases only p olynomially in M . F or each M > 0 and K ∈ N , let h := M K , and deﬁne the family of K + 2 bias functions (4.7) ψ i ( x ) :=            1 2 1 [0 ,h ] ( x ) for i = 0 , 1 2 1 [( i − 1) h, ( i +1) h ] ( x ) for i = 1 , . . . , K − 1 1 2 1 [ M − h, ∞ ) ( x ) for i = K , and 1 2 1 [ M , ∞ ) ( x ) for i = K + 1 . STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 19 0 5 10 15 20 25 30 35 V ( x ) 0 h 2h ... M x 0 1 1 2 3 4 . . . . . . 4 K K + 1 Figure 2: The bias functions { ψ i : i = 0 , . . . , K + 1 } deﬁned in ( 4.7 ) and a p otential function V satisfying Assumption 4.2 . Observ e that the bias functions ψ K and ψ K +1 ha ve unbounded supp ort. As in Section 4.1 , let U i denote the supp ort of ψ i . This family of bias functions is a partition of unit y on [0 , ∞ ); see Figure 2 . W e no w address the cost of estimating p M b y EMUS. First, we observ e that Assumption 3.6 on the asymptotic v ariances of MCMC av erages do es not cov er the sampling of π K and π K +1 , since the supp orts of these distributions are unbounded. Th us, we require the follo wing assumption. Assumption 4.3. Let f : [0 , ∞ ) → R , and deﬁne σ 2 i ( f ) to be the asymptotic v ariance of an MCMC tra jectory av erage appro ximating π i [ f ] for i = K, K + 1. W e assume σ 2 i ( f ) v ar π i ( f ) ≤ D for some D indep enden t of M and f . In fact, since Assumption 4.2 implies that the o v erdamp ed Langevin dynamics is ergo dic for π ( x ) = exp( − V ( x )) on the unbounded domain Ω = R , we fully expect (but do not prov e here) that under Assumption 4.2 , Assumption 4.3 holds for ov erdamp ed Langevin constrained (b y reﬂection as in ( 3.10 )) to remain in the support of π K or π K +1 . Alternativ ely the reader ma y simply assume that w e dra w i.i.d. samples from the biased distributions. All our results hold in that case. W e show in Theorem 4.4 that the relative asymptotic v ariance of the EMUS estimate of p M gro ws only polynomially with M for a broad class of target distributions π . Therefore, EMUS may b e dramatically more eﬃcien t than direct MCMC sampling when the goal is to compute tail probabilities. W e observ e that while the h yp otheses of the theorem are somewhat restrictiv e, similar results hold more generally; for example, see Section 5 where we compute tails of a marginal densit y . 20 DINNER, ET AL. Theo rem 4.4. L et Assumptions 3.6 , 4.2 , and 4.3 hold. Set K = M max x ≤ M d| V 0 ( x ) |e . Deﬁne a family of K + 2 bias functions ψ i by ( 4.7 ) . T ake κ i = 1 / ( K + 2) . L et σ 2 M , US denote the asymptotic varianc e of the EMUS estimate of p M . We have σ 2 M , US p 2 M ≤ C K 2 for some c onstant C > 0 dep ending on V but not on M . F or example, supp ose that V ( x ) = ˜ V ( x ) + x r , wher e ˜ V has b ounde d supp ort and r ≥ 1 . Then | V 0 ( x ) | ≤ C (1 + M r − 1 ) , and so σ 2 M , US p 2 M ≤ C M 2 (1 + M r − 1 ) 2 . Pr o of. The pro of is similar to the lo w temp erature limit, Theorem 4.1 , but with compli- cations arising b ecause not all strata are b ounded and b ecause here we consider the relative v ariance instead of the v ariance; see App endix F . In particular, we require Assumption 4.2 to show that one can in fact choose h so that all nonzero entries of F are bounded ab o ve zero uniformly as M increases; cf. Le mma F.1 . This is the only part of the pro of relying on Assumption 4.2 . 5. EMUS fo r tails: An example from Ba y esian inference. W e demonstrate the use of EMUS for eﬃciently exploring and visualizing distributions. In particular, we show ho w EMUS ma y b e used to eﬃciently compute both marginal densities and also tail probabilities of the form P [ η ( Z ) ≥ ε − 1 ] where η ( Z ) is a real v alued function of a high-dimensional random v ariable Z . F or b oth tails and marginals, there is a natural and easy to implement c hoice of strata, whic h we describ e in Section 5.1 . In Section 5.3 , w e calculate tw o diﬀeren t one-dimensional marginals of the posterior distri- bution of the hierarchical Ba yesian mixture mo del describ ed in Section 5.2 . F or one marginal, the natural stratiﬁcation suﬃces. F or the other, it do es not, but a preliminary computation made with the natural stratiﬁcation suggests a b etter choice of strata. W e use this example to explain how to diagnose and correct problems related to po orly chosen strata: Our results will serv e to guide the practice of stratiﬁed MCMC. 5.1. The natural stratiﬁcation for tails and marginals. Here, we brieﬂy explain how EMUS can b e used to estimate tail probabilities and low-dimensional marginals of high- dimensional distributions. Let Ω ⊂ R d ; let π b e a probabilit y distribution on Ω; and let η : Ω → R . Suppose that one wishes to estimate the v ery small tail probabilit y P [ η ( Z ) ≥ ε − 1 ]. In this case, it is natural to stratify in η only . That is, one ma y choose a partition of unity STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 21 { φ i } L i =1 on R and deﬁne bias functions (5.1) ψ i ( x ) = φ i ( η ( x )) for i = 1 , . . . , L dep ending only on η . F or a partition of unit y , one migh t c ho ose the regular grid of piecewise constan t functions deﬁned in Section 4.2 . W e refer to ( 5.1 ) as the natur al str atiﬁc ation . T o compute the tail probabilit y , one uses EMUS to estimate π ( 1 [ ε − 1 , ∞ ) ◦ η ). Computing marginal densities is similar; in fact, computing tails may b e understo o d as a sp ecial case of computing a marginal density . Supp ose no w that η : Ω → R ` . T o estimate the marginal π η of π in η , one chooses a partition of unit y { φ i } L i =1 on R ` , again deﬁning bias functions b y ( 5.1 ). One then uses EMUS to compute a verages of histo gr am bins , whic h are functions of the form (5.2) b η 0 ( η ( x )) = 1 η 0 + h [ − 1 , 1] ` ( η ( x )) . W e hav e lim h → 0 1 (2 h ) ` π [ b η 0 ] = π η ( η 0 ) , so for small h the a v erages of the histogram bins approximate π η . By the argumen t in Section 4.2 , EMUS with the natural stratiﬁcation will be dramatically more eﬃcien t than direct sampling as long the biase d distributions ar e no har der to sample than the tar get distribution π . Essentially , this is b ecause, with the natural stratiﬁcation, very small a verages like P [ η ( Z ) ≥ ε − 1 ] o ver the target distribution π are expressed as functions of muc h larger av erages o ver the biased distributions π i . Unfortunately , ho wev er, for general functions η , the biased distributions of the natural stratiﬁcation need not b e easy to sample. In Section 5.3 , we give one example where the natural stratiﬁcation works and one where it do es not. In the case where it do es not, w e explain ho w to mak e a b etter c hoice of strata. 5.2. A hierarchical Ba yesian mixture mo del. Here, w e review the hierarchical Ba yesian mixture mo del prop osed in [ 39 ], and we discuss the diﬃculties which complicate inference under this mo del. As a tutorial in the use of EMUS, we presen t a n umerical inv estigation of these diﬃculties in Section 5 . In the hierarc hical mixture model, the data vector y = ( y 1 , . . . , y n ) ∈ R n consists of indep enden t iden tically distributed samples drawn from a mixture distribution of the form p ( y i | φ ) = K X k =1 q k ν ( y i ; µ k , λ − 1 k ) , where K is the n umber of mixture components, q k is the w eigh t of the k ’th mixture comp onen t, ν ( · ; µ k , λ − 1 k ) is the normal densit y with mean µ k and v ariance λ − 1 k , and φ is the v ector of parameters φ = ( µ 1 , . . . , µ K , λ 1 , . . . , λ K , q 1 , . . . , q K − 1 ) . 22 DINNER, ET AL. (Since p ( y i | φ ) is a probability distribution, q 1 + · · · + q K = 1, and q 1 , . . . , q K − 1 determine q K .) The follo wing prior distribution is imp osed on φ : µ i ∼ N(m , κ − 1 ) λ k ∼ Gamma( α, β ) β ∼ Gamma( g , h ) ( q 1 , . . . , q K − 1 ) ∼ Dirichlet K (1 , . . . , 1) . As in [ 23 , 10 ], w e c ho ose m = M , κ = 4 R 2 , α = 2 , g = 0 . 2 , and h = 100 g αR 2 where R and M are the range and the mean of the observed data, respectively . The posterior densit y is p ( θ | y ) = κ K/ 2 g h β K α + g − 1 Z K Γ( α ) K Γ( g )(2 π ) n + K 2 K Y k =1 λ k ! α − 1 × exp ( − κ 2 K X k =1 ( µ k − M ) 2 − β h + K X k =1 λ k !) × N Y i =1 K X k =1 q k λ 1 2 k exp  λ k 2 ( y i − µ k ) 2  ! , where θ = ( φ, β ) denotes the v ector of all parameters to be inferred, including the h yp erpa- rameter β . Sev eral factors complicate inference based on this mo del. First, the mixture comp onents are not iden tiﬁable; that is, the p osterior distribution is inv arian t under p erm utation of the lab els of the mixture components. Consequences of non-identiﬁabilit y are discussed at length in [ 23 , 10 ]. In our computations in Section 5.3 , w e imp ose the constrain t µ 1 ≤ µ 2 ≤ · · · ≤ µ K to ensure that the components are identiﬁable. Second, in Lemma 5.1 , we sho w that the p osterior density ma y be un b ounded, in tro ducing spurious mo des with inﬁnite density . Finally , ev en with identiﬁabilit y constraints, the p osterior distribution ma y hav e m ultiple mo des of ﬁnite posterior densit y . F or example, see the mo des reported in [ 10 ]. In Section 5.3 , w e use EMUS to eﬃciently visualize the posterior, assessing the eﬀects of m ultimo dalit y and un b oundedness. W e suspect that the un b oundedness of the p osterior for this mo del is well known. Ho w ever, w e are unable to ﬁnd a reference, so we no w explain. It is certainly well kno wn that the lik eliho o d of a Gaussian mixture mo del is unbounded: Roughly sp eaking, the likelihoo d is inﬁnite when any mixture comp onent is collapsed on a single data p oin t [ 1 ]. Nonetheless, one migh t exp ect the p osterior densit y p ( θ | y ) to b e b ounded, since the prior p enalizes large STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 23 v alues of the precisions λ i . This is not alw a ys the case when the data vector con tains rep eated en tries: Lemma 5.1. If any datum y i has fr e quency N i gr e ater than 2 g + 2( K − 1) α, then the p osterior density p ( θ | y ) is unb ounde d. Pr o of. T ake the limit of p ( θ | y ) as λ 1 → ∞ with µ 1 = y i , β = λ − 1 1 , and all other v ariables held ﬁxed. The reader will observe that under the mo del, the set of data vectors with rep eated en tries has probability zero. How ev er, in practice, the data consist of measurements with ﬁnite precision, and therefore rep eated en tries occur commonly , cf. the Hidalgo stamp data used in Section 5.3 . 5.3. Numerical exper iments: Cho osing strata, computing tails, diagnosis of p roblems. In this section, w e explain how to recognize and correct problems related to p o or choices of strata, and w e demonstrate the use of EMUS to inv estigate the m ultimo dalit y and un bounded- ness of the p osterior in the mixture mo del. W e ﬁrst compute t wo one-dimensional marginals of the high-dimensional p osterior densit y p ( θ | y ) using the natural stratiﬁcation ( 5.1 ). The natural stratiﬁcation works in one case but not the other. In the case where the natural strat- iﬁcation does not w ork, preliminary calculations based on the natural stratiﬁcation suggest a b etter c hoice of strata. Here, w e let y b e the Hidalgo stamp data set ﬁrst studied in [ 22 ], consisting of the thick- nesses of 485 stamps, ranging betw een 60 µ m and 130 µ m. W e let there b e three mixture comp onen ts ( K = 3), follo wing previous computational studies [ 10 , 23 ]. In our ﬁrst calcu- lation, we estimated the marginal in µ 2 using the natural stratiﬁcation with a grid of 201 bias functions cov ering the range [7 , 11], with the s upport of the leftmost and rightmost bias functions reac hing to −∞ and ∞ , resp ectiv ely . F or the middle strata, deﬁne φ 1 : R → R b y (5.3) φ 1 ( x ) := max { 0 , 1 − | x |} . W e used the bias functions (5.4) ψ i ( θ ) = φ 1  µ 2 − (7 + ( i − 1) h ) h  , where h := 0 . 02 for i = 2 , . . . , 200. No w, deﬁne φ 2 : R → R b y (5.5) φ 2 ( x ) := min { max { 0 , 1 − x } , 1 } The ﬁrst and last bias functions w ere ψ 1 ( θ ) = φ 2  µ 2 − 7 h  (5.6) ψ 201 ( θ ) = φ 2  (7 + 200 h ) − µ 2 h  , (5.7) 24 DINNER, ET AL. where h = 0 . 02 as b efore. W e c hose the total n um b er of bias functions based on the sizes of the oﬀ-diagonal entries in the ov erlap matrix. F or any bias functions of the form ( 5.4 ), the o v erlap matrix is tridiagonal. Th us, by Remark 3.10 , if the superdiagonal and sub diagonal en tries F i,i +1 and F i,i − 1 are suﬃcien tly large, then the EMUS estimator is not to o sensitiv e to statistical errors in ¯ F . F or our c hoice of bias functions, (5.8) min { F i,i +1 ; i = 1 , . . . 200 } ≥ 0 . 01 and min { F i,i − 1 ; i = 2 , . . . , 201 } ≥ 0 . 004 . W e sampled the biased distributions using the aﬃne inv arian t ensem ble sampler with 100 w alkers, as implemen ted in the emcee pac k age [ 15 ]. Due to computational restrictions on memory , only every tenth sample p oint was sav ed. As a chec k on the sampling, the av erage acceptance probabilit y ov er all w alkers in the ensem ble sampler was calculated for eac h biased distribution. Averaging o ver biased distributions ga ve a total a v erage acceptance probability of 0.31. The minimum acceptance probability o v er all distributions w as 0.12. T o initialize sampling, we computed an un biased test tra jectory; that is, a tra jectory ha ving ergo dic distribution π . W e then started by sampling a single biased distribution π k , initializing with p oin ts drawn randomly from the un biased tra jectory . W e sampled the other biased distributions in sequence, initializing with p oints drawn randomly from samples of adjacen t biased distributions. Th us, w e sampled π k ﬁrst, then π k − 1 and π k +1 , then π k − 2 and π k +2 , etc. W e equilibrated the sampler in eac h π i for 3000 Monte Carlo steps, and collected data for an additional 100000 Monte Carlo steps. Eac h step of the ensem ble sampler inv olves p erturbing the positions of each of the 100 w alk ers. W e computed the marginal in µ 2 using a grid of 200 histogram bins, cov ering the region [7 , 11]; this corresp onds to taking h = 0 . 01 in ( 5.2 ). The result is the curv e lab eled EMUS in Figure 3a . The marginal in µ 2 has t w o mo des, labeled 1 and 2 in Figure 3a . W e plot the mixture distributions corresp onding to these modes in Figure 4 . (T o b e precise, the distributions in Figure 4 corresp ond to means o v er histogram bins cen tered at the labeled p oin ts.) F or comparison, w e also estimated the marginal in µ 2 from multiple long, un biased tra- jectories. W e computed 100 un biased tra jectories of the aﬃne inv arian t ensemble sampler in parallel. F or each tra jectory , the ensembles were ﬁrst equilibrated for 10000 Monte Carlo steps, and then data were collected for 100000 steps. These tra jectories w ere combined and binned to produce the densit y labeled Un biased in Figure 4 . W e estimated the relative asymp- totic v ariance of the marginal density for the un biased calculation using ACOR [ 14 ], and we estimated the relative asymptotic v ariance for the EMUS calculation using the method out- lined in App endix G . W e present the results in Figure 3a . Note that near the mo de, un biased MCMC p erforms slightly b etter than EMUS, but in the tails, EMUS p erforms dramatically b etter. After computing the marginal in µ 2 , we tried computing the marginal in log 10 λ 1 . W e used the natural stratiﬁcation with a grid of 50 bias functions with maxima equally spaced b et w een –1 and 3.2 constructed as ψ i ( θ ) = φ  − 1 + h ( i − 1) − log 10 λ 1 h  STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 25 7 8 9 10 11 2 ( 1 0 m ) 0 10 20 30 40 50 60 - l o g 1 0 2 1 2 EMUS Unbiased (a) 7 8 9 10 11 2 ( 1 0 m ) 7 6 5 4 3 2 1 0 - l o g 1 0 A s y m . V a r . o f - l o g 1 0 2 (b) Figure 3: Estimates of the logarithm of the marginal density in µ 2 and the asymptotic v ari- ances of those estimates. Figure 3a displa ys estimates of the marginal in µ 2 computed by EMUS and by an unbiased tra jectory of the ensemble sampler. Figure 3b displays the asymp- totic v ariances of these t wo estimates of the marginal densit y . W e note that while the un biased calculation has greater accuracy near the mo de, the EMUS calculation has greater accuracy in the tails. The relative errors in this ﬁgure w ere estimated using the metho d describ ed in App endix G . where h = 3 . 2 − ( − 1) 49 . W e used the same initialization sc heme as for the marginal in µ 2 , b eginning with a single biased distribution initialized from an un biased test tra jectory . W e call this the c enter sample. The result of this calculation w as the density lab eled “1D Cen ter” in Figure 5a . When w e tried to compute the asymptotic v ariance of this densit y estimate, w e noticed very slow con vergence of the sampler for some biased distributions. T o inv estigate, we p erformed another EMUS calculation using a similar initialization pro cedure, but starting from π 1 , the biased distribution at the extreme left, co v ering the low est v alues of λ 1 . W e call this the left sample. The result of this second calculation was the densit y lab eled “1D Left” in Figure 5a . F or b oth the cen ter and left samples, the strata were equilibrated for 3000 steps and sampled for another 200000. W e observ e that the t w o densities diﬀer signiﬁcantly in the region − 1 ≤ log 10 λ 1 ≤ 0 . 5. They should b e the same up to sampling errors; for example, we observ e that diﬀeren t initializations hav e no eﬀect on the calculation of the marginal in µ 2 , cf. Figure 3a . Figure 6 explains the problem and suggests a solution: In the region 0 . 2 ≤ log 10 λ 1 ≤ 0 . 7, 26 DINNER, ET AL. 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 1 6 8 10 12 2 0.0 0.2 0.4 0.6 0.8 1.0 S t a m p T h i c k n e s s ( 1 0 m ) 0.0 0.2 0.4 0.6 0.8 1.0 Probability Density Figure 4: Gaussian mixtures corresp onding to mo des of the marginal in µ 2 . Mixtures 1 and 2 corresp ond to the lab eled p oin ts in Figure 3a . T o b e precise, the blue curv e in each plot is the mixture distribution corresp onding to the mean of a histogram bin centered at the point lab eled in Figure 3a . The green curves are the individual mixture comp onents. The black bars are a histogram of the Hidalgo stamp data. the center and left samples cov er entirely diﬀeren t ranges of log 10 λ 2 . This suggests that the biased distributions corresp onding to the range 0 . 2 ≤ log 10 λ 1 ≤ 0 . 7 are multimodal, with barriers in λ 2 imp eding sampling. T o conﬁrm the h yp othesis that barriers in λ 2 w ere responsible for the po or con v ergence observ ed in the cen ter and left samples, w e performed a third calculation, stratifying in both log 10 λ 1 and log 10 λ 2 . W e used a 50 × 50 grid of bilinear bias functions, with maxima equally spaced b et w een − 1 and 3 . 2. T o b e precise, for i, j = 1 , . . . , 50, w e deﬁned the bias functions ψ ij ( θ ) = φ  − 1 + h ( i − 1) − log 10 λ 1 h  × φ  − 1 + h ( j − 1) − log 10 λ 2 h  , with h as b efore. Let η ij denote the biased distribution corresp onding to ψ ij . W e p erformed the t w o-dimensional EMUS calculation t wice, initializing from the center and left samples drawn from the natural stratiﬁcation in log 10 λ 1 . F or eac h i = 1 , . . . , L , to sample the row { η ij : j = 1 , . . . , 50 } of biased distributions, w e began b y initializing sampling of a single biased distribution η ik with p oints from the either the center or left sample of π i . W e then sampled the other distributions η ij for j 6 = k in sequence, again initializing with p oin ts from samples of adjacent distributions, either η i,j +1 or η i,j − 1 in this case. If no samples w ere found inside the supp ort of a biased distribution, that distribution w as ignored. F or each biased distribution, sampling w as burned in for 4500 steps, and samples w ere collected for an additional 2500 steps. Ultimately , 1397 of the 2500 biased distributions were sampled; the unsampled distributions corresp ond to the white space in Figure 7a . W e computed the marginal in log 10 λ 1 and log 10 λ 2 using a 200 × 200 grid of histogram bins, co vering the region − 1 ≤ log 10 λ 1 ≤ 3 . 2 and − 1 ≤ log 10 λ 2 ≤ 3 . 2; this corresp onds to taking STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 27 1 0 1 2 3 l o g 1 0 1 ( 0 . 0 1 m 2 ) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 l o g 1 0 1 2D Cntr 2D Left 1D Cntr 1D Left No Bias (a) 1 0 1 2 3 l o g 1 0 1 ( 0 . 0 1 m 2 ) 0 1 2 3 4 5 6 l o g 1 0 A s y m . V a r . i n l o g 1 0 1 EMUS No Bias (b) Figure 5: Estimates of the logarithm of the marginal density in log 10 λ 1 and the asymptotic v ariances of those estimates. Figure 5a displa ys the estimates of the marginal in log 10 λ 1 computed by v arious metho ds. The error bars are twice the estimated asymptotic standard deviation in eac h histogram bin. F or the t w o-dimensional EMUS calculations, standard de- viations w ere estimated using the method describ ed in App endix G . F or b oth the unbiased calculation asymptotic v ariances w ere estimated using ACOR [ 14 ]. No error bars are given for the tw o one-dimensional calculations, as the barrier depicted in Figure 10 mak es accu- rate estimation of the asymptotic v ariance imp ossible. A clear error is visible in the t w o one-dimensional umbrella sampling calculations, due to initialization along either side of the barrier in Figure 10 . Figure 5b displa ys the asymptotic v ariance of the marginal densit y in log 10 λ 1 for the un biased and the tw o-dimensional EMUS calculations. W e note that while the un biased calculation achiev es greater accuracy near the mo de, the EMUS calculation achiev es greater accuracy in the tails. h = (3 . 2 − ( − 1)) / 200 in ( 5.2 ); the result from the cen ter calculation app ears in Figure 7a . In Figure 8 , w e show the mixture distributions corresp onding to the mo des of the t w o-dimensinoal marginal in Figure 7a . The t wo-dimensional marginals w ere essen tially the same for the cen ter and left initializations; see Figure 9 . W e also estimated the one-dimensional marginal in log 10 λ 1 using the tw o-dimensional stratiﬁcation; see the results lab eled “2D Center” and “2D Left” in Figure 5a . Finally , w e estimated the relativ e asymptotic v ariance of the marginal in log 10 λ 1 computed by tw o-dimensional stratiﬁcation. Again, w e observe that EMUS p erforms m uch b etter than unbiased sampling in the tails, cf. Figure 5b . The marginal in log 10 λ 1 and log 10 λ 2 conﬁrms that barriers in λ 2 caused the problems observ ed in calculating the marginal in log 10 λ 1 using the natural stratiﬁcation. In fact, we see that computing the marginal in either λ 1 or λ 2 requires stratifying b oth v ariables, as 28 DINNER, ET AL. 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 l o g 1 0 2 ( 0 . 0 1 m 2 ) 0.0 0.2 0.4 0.6 0.8 1.0 l o g 1 0 1 ( 0 . 0 1 m 2 ) 0.5 1.0 1.5 2.0 2.5 l o g 1 0 2 ( 0 . 0 1 m 2 ) 0.0 0.5 1.0 1.5 2.0 2.5 l o g 1 0 C e n t e r S a m p l e d H i s t 0.0 0.5 1.0 1.5 2.0 2.5 l o g 1 0 L e f t S a m p l e d H i s t Figure 6: T o generate Figure 6 , we binned the samples for the one-dimensional left and center EMUS calculations, and w e plotted the diﬀerence in the histograms. The con tour lines are con tours of the log marginal density , as in Figure 7a . Figure 6 shows that while the t w o calculations largely sample the same regions, near log 10 λ 1 = 0 . 45 they b ecome trapp ed on opp osite sides of a barrier. This leads to po or sampling, causing a slowly deca ying error in the estimates of the marginal densit y , cf. Figure 5a . stratifying only one leads to barriers that imp ede sampling in the other. In particular, there are barriers in λ 2 along the line log 10 λ 1 = 0 . 45 and a barrier in λ 1 along log 10 λ 2 = 0 . 6: In Figure 10 , w e plot an estimate of the conditional distribution of log 10 λ 2 with log 10 λ 1 = 0 . 45 ﬁxed. This distribution is m ultimo dal with a region of v ery low probabilit y separating the mo des, whic h explains the p oor sampling depicted in Figure 6 . T o conclude, w e ha v e conﬁrmed that EMUS can be extremely eﬃcien t for computing tails. Ho wev er, one must exercise care in the choice of strata. The natural stratiﬁcation often suﬃces, but in some cases, like computing the marginal in log 10 λ 1 , the biased distributions of the natural stratiﬁcation ma y b e very diﬃcult to sample. W e prop ose the use of diﬀeren t initializations, lik e the cen ter and left samples, as a metho d of identifying problems related to po orly chosen strata. Careful insp ection of sim ulations performed with these diﬀeren t initializations can iden tify problems and suggest b etter strata. 6. Conclusions. W e hav e analyzed the Eigen vector Metho d for Umbrella Sampling (EMUS), an especially simple and eﬀective stratiﬁed MCMC metho d sharing many features with the p opular WHAM [ 27 ] and MBAR [ 44 ] metho ds of computational chemistry . W e hav e demon- strated the adv an tages of EMUS for sampling from multimodal distributions and computing STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 29 1 0 1 2 3 l o g 1 0 1 ( 0 . 0 1 m 2 ) 1 0 1 2 3 l o g 1 0 2 ( 0 . 0 1 m 2 ) 1 2 3 0 5 10 15 20 - l o g 1 0 1 , 2 (a) 1 0 1 2 3 l o g 1 0 1 ( 0 . 0 1 m 2 ) 1 0 1 2 3 l o g 1 0 2 ( 0 . 0 1 m 2 ) 0 5 10 15 20 - l o g 1 0 1 , 2 (b) Figure 7: Logarithm of marginal density in log 10 λ 1 and log 10 λ 2 as estimated by EMUS and un biased MCMC. Con tour lines in b oth ﬁgures are every unit change in the estimated log 10 marginal density . Figure 7a is the EMUS estimate. The num bers 1, 2, and 3 on this ﬁgure corresp ond to the mixture densities in Figure 8 . Note that at v alues of log 10 λ near 3 . 0 w e begin to see the mo des corresp onding to singularities of the posterior. Figure 7b is the marginal densit y estimated from a long un biased tra jectory of the ensem ble sampler. Note that the en tire tra jectory lies in a small neigh borho o d of the mo de lab eled 1 in Figure 7a . tail probabilities, and w e hav e explained ho w to identify and resolv e the problems which may o ccur if the metho d is implemented p o orly . W e ha v e also given a tutorial intended to explain ho w to diagnose and correct problems related to p oorly chosen strata. Our purpose w as to explain the b eneﬁts of stratiﬁed MCMC analytically , with the ulti- mate goal of in tro ducing stratiﬁed MCMC to a div erse audience of statisticians, engineers, and scien tists. Since stratiﬁed MCMC had previously b een applied only to a particular class of statistical mechanics calculations without any general justiﬁcation, we began by dev eloping a general theory . W e hop e that our theory will serv e as the basis for further dev elopmen ts. F or example, it ma y no w be p ossible to undertak e a comparison of EMUS and other so-called reac- tion coordinate metho ds suc h as W ang–Landau sampling [ 52 ] or Metadynamics [ 28 ]. Despite some similarities with EMUS, these metho ds work b y a substan tially diﬀerent mec hanism and understanding the relative adv an tages of the tw o approaches is non-trivial. W e also note that there is p oten tial to apply stratiﬁcation to problems that lie outside the scop e of the present w ork. F or example, we presen t a stratiﬁcation metho d capable of computing dynamical quan- tities suc h as mean ﬁrst passage times [ 11 ]. App endix A. Derivation of ( 2.1 ) and ( 2.3 ) . 30 DINNER, ET AL. 0.0 0.2 0.4 0.6 0.8 1.0 1 2 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 3 6 8 10 12 4 0.0 0.2 0.4 0.6 0.8 1.0 S t a m p T h i c k n e s s ( 1 0 m ) 0.0 0.2 0.4 0.6 0.8 1.0 Probability Density Figure 8: Gaussian mixtures corresp onding to means of histogram bins. Mixtures one through three correspond to the lab eled points on Figure 7a , mixture four corresp onds to a distribution near a singularit y of the p osterior, with log 10 λ 1 = 4 . 34 and log 10 λ 2 = 0 . 79. T o b e precise, the blue curv e in each plot is the mixture distribution corresp onding to the mean of a histogram bin centered at the point lab eled in Figure 7a . The green curv es are the individual mixture comp onen ts. The black bars are a histogram of the Hidalgo stamp data. T o see that ( 2.1 ) holds, observe that L X i =1 z i π i [ g ∗ ] /u i = L X i =1 1 P L k =1 π [ ψ k ] π " g ψ i /u i P L k =1 ψ k /u k # = π [ g ] P L k =1 π [ ψ k ] . Therefore, w e hav e P L i =1 z i π i [ g ∗ ] /u i P L i =1 z i π i [ 1 ∗ ] /u i = π [ g ] /  P L k =1 π [ ψ k ]  π [ 1 ] /  P L k =1 π [ ψ k ]  = π [ g ] . STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 31 1 0 1 2 3 l o g 1 0 1 ( 0 . 0 1 m 2 ) 1 0 1 2 3 l o g 1 0 2 ( 0 . 0 1 m 2 ) 0.0 0.2 0.4 0.6 0.8 1.0 | l o g 1 0 1 , 2 | Figure 9: The diﬀerence betw een the free energy surfaces of the tw o-dimensional umbrella sampling runs. The center calculation w as initialized from the center one-dimensional cal- culation, and the left calculation from the left one-dimensional calculation. In general the diﬀerence is small, roughly a ten th of an order of magnitude in the log marginal. T o prov e ( 2.3 ), observ e that L X i =1 w i F ij = L X i =1 z i u i π i [ ψ ∗ j ] u j = L X i =1 1 u j π " ψ j ( ψ i /u i ) P L k =1 ψ k /u k # = π [ ψ j ] u j = w j . App endix B. Pro of of Lemma 2.1 . Pr o of. W e pro v e only the second statemen t; pro of of the ﬁrst is similar. By deﬁnition, a non-negativ e matrix M ∈ R L × L is irreducible if and only if for ev ery subset A ⊂ { 1 , 2 , . . . , L } of the indices, there exist indices i ∈ A and j / ∈ A so that M j i > 0. No w assume that for ev ery A ⊂ { 1 , 2 , . . . , L } , there exist j / ∈ A and t ≥ 0 so that X j t ∈ ∪ k ∈ A { x : ψ k ( x ) > 0 } . Then for some i ∈ A , ψ i ( X j t ) > 0, so ¯ F j i > 0, hence ¯ F is irreducible. App endix C. Spa rse Grid of Strata. One may deﬁne a uniform grid of strata so that ( 3.6 ) increases only as d 2 with dimension, not exp onentially: F or any i ∈ Z d , let V 0 i := 32 DINNER, ET AL. 1 0 1 2 3 l o g 1 0 2 10 12 14 16 18 20 l o g 1 0 1 , 2 a t l o g 1 0 1 = 0 . 4 5 Figure 10: Here we give an estimate of the conditional distribution of log 10 λ 2 with log 10 λ 1 = 0 . 45 calculated from the tw o-dimensional marginal seen in Figure 7a . The conditional dis- tribution is multimodal. The mo de on the left corresp onds to mixtures with the data from thic knesses of 60 to 85 µ m cov ered by a single Gaussian similar to mo de 2 in Figure 8 . The mo de on the right corresp onds to mixtures with these data co vered by tw o Gaussians similar to mo de 1 in Figure 8 . h i + h  − 1 2 , 1 2  d . F or i 6 = j , deﬁne W ij := n x ∈ V 0 j : min y ∈ V 0 i k x − y k ≤ min y ∈ V 0 k k x − y k for an y k ∈ Z d \ { j } o to b e the d -dimensional p yramid consisting of all p oin ts in V 0 j closer to V 0 i than to any other cub e V 0 k . Now let e n denote the n ’th standard basis v ector in R d , and deﬁne V i := ∪ d n =1 ( W i , i + e n ∪ W i , i − e n ) ∪ V 0 i to be the cub e V 0 i enlarged b y all the neigh boring p yramids W ij . The strata V i are con v ex, and the corresp onding bias functions ψ i = 1 2 1 V i are a partition of unit y . Eac h stratum V i in tersects only the 2 d neighboring strata V i ± e n for n = 1 , . . . , d . Moreo ver, each in tersection b et w een neigh boring strata V i and V j consists of the pair of pyramids W ij and W ji , and it has v olume 1 /d . Therefore, by Lemma 3.9 , for this c hoice of bias functions, the nonzero en tries of F decrease as 1 /d . It follo ws that ( 3.6 ) increases as d 2 . App endix D. Proofs of Theorem 3.3 and Theorem 3.5 . Our proof of Theorem 3.3 (the CL T for EMUS) is based on the delta method. T o apply the delta method, w e require the follo wing result ensuring the diﬀeren tiability of w ( G ): Lemma D.1. The function w ( G ) admits an extension ˜ w : R L × L → R L which is diﬀer en- tiable on the set of irr e ducible sto chastic matric es. STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 33 Pr o of. By [ 47 , Lemma 3.1], w ( G ) admits a con tin uously diﬀerentiable extension to an op en set U ⊂ R L × L . W e further extend the domain of w ( G ) to R L × L b y arbitrarily deﬁning w ( G ) = 0 whenev er G ∈ R L × L \ U . The extension in Lemma D.1 resolves t wo tec hnicalities: First, the set of sto chastic matri- ces is not a v ector space but a compact, con vex subset of R L × L with empty in terior. Therefore, the deriv ative of w is undeﬁned. Second, ¯ F ma y b e reducible for some v alues of N and some realizations of the pro cesses sampling the biased distributions. In that case, the in v ariant distribution of ¯ F is not unique, so w ( ¯ F ) is undeﬁned. Throughout the remainder of this work, w ( G ) will denote the extension guaran teed b y the lemma. W e now pro v e the CL T for EMUS. Pr o of of The or em 3.3 . The pro of is based on the delta method [ 6 , Prop osition 6.2] and a form ula for w 0 ( ¯ F ) giv en in [ 19 ]. By Lemma D.1 , w ( ¯ F ) is diﬀeren tiable at F , so the function B  ¯ F , { ¯ g ∗ i } L i =1 , { ¯ 1 ∗ i } L i =1  := π US [ g ] = P L i =1 w i ( ¯ F ) ¯ g ∗ i P L i =1 w i ( ¯ F ) ¯ 1 ∗ i is diﬀerentiable at  F , { π i [ g ∗ ] } L i =1 , { π i [ 1 ∗ ] } L i =1  . Let ∂ i B ∈ R L +2 b e the deriv ative of B with resp ect to those quan tities computed from X i t : That is, (D.1) ∂ i B :=  ∂ B ∂ ¯ F i : , ∂ B ∂ ¯ G i :  ∈ R L +2 , where ∂ B ∂ ¯ F i : ∈ R L denotes the partial deriv ative of B with resp ect to the i ’th ro w of ¯ F and ∂ B ∂ ¯ G i : =  ∂ B ∂ ¯ g ∗ i , ∂ B ∂ ¯ 1 ∗ i  ∈ R 2 . T o simplify notation, we will assume throughout the remainder of this argument that all deriv atives are ev aluated at  F , { π i [ g ∗ ] } L i =1 , { π i [ 1 ∗ ] } L i =1  . In formulas in v olving matrix multi- plication, w e will treat ∂ i B , ∂ B ∂ ¯ F i : , and ∂ B ∂ ¯ G i : as ro w vectors. Since we assume that the processes X i t sampling the diﬀeren t measures π i are indep endent, [ 5 , Chapter 1, Theorem 2.8] implies that √ M   ¯ F 1: , ¯ g 1 , ¯ 1 ∗ 1 , . . . , ¯ F L : , ¯ g ∗ L , ¯ 1 ∗ L  − ( F 1: , π 1 [ g ∗ ] , π 1 [ 1 ∗ ] , . . . , F L : , π L [ g ∗ ] , π L [ 1 ∗ ])  d − → N (0 , Σ) , (D.2) where Σ is the co v ariance matrix of the pro duct of the distributions N  0 , κ − 1 i Σ i  . (That is, Σ ∈ R L ( L +2) × L ( L +2) is the blo c k diagonal matrix with the matrices κ − 1 i Σ i along the diagonal.) Therefore, b y the delta metho d, √ M ( π US [ g ] − π [ g ]) d − → N(0 , σ 2 ) , 34 DINNER, ET AL. where σ 2 = ( ∂ 1 B , . . . , ∂ L B )Σ( ∂ 1 B , . . . , ∂ L B ) t = L X i =1 κ − 1 i ∂ i B Σ i ∂ i B t . (D.3) No w we observe that for an y column v ector v ∈ R L ha ving mean zero, d dε w k ( F + εe i v t )     ε =0 = ∂ w k ∂ ¯ F i : v = z i v t ( I − F ) # e k , b y [ 19 , Theorem 3.1]. (In the form ula abov e, e i ∈ R L denotes the i ’th standard basis vector.) Therefore, w e hav e ∂ B ∂ ¯ F i : v = P L k =1 ∂ w k ∂ ¯ F i : v π k [ g ∗ ] P L k =1 z k π k [ 1 ∗ ] − P L k =1 ∂ w k ∂ ¯ F i : v π k [ 1 ∗ ] P L i =1 z k π k [ 1 ∗ ] P L i =1 z k π k [ g ∗ ] P L k =1 z k π k [ 1 ∗ ] = L X k =1 ∂ w k ∂ ¯ F i : v Ψ( π k [ g ∗ ] − π [ g ] π k [ 1 ∗ ]) (D.4) = z i v t ( I − F ) # g , where g k = Ψ π k [ g ∗ − π [ g ] 1 ∗ ] = ` · ( π k [ g ∗ ] , π k [ 1 ∗ ]) . (Equalit y ( D.4 ) ab ov e follo ws from ( 2.1 ) and the deﬁnition ( ?? ) of Ψ.) Also, (D.5) ∂ B ∂ ¯ G i = z i Ψ(1 , − π [ g ]) = z i `. Th us, ∂ i B Σ i ∂ i B t = ∂ B ∂ ¯ F i : σ i ∂ B ∂ ¯ F i : t + 2 ∂ B ∂ ¯ F i : ρ i ∂ B ∂ ¯ G i : t + ∂ B ∂ ¯ G i : τ i ∂ B ∂ ¯ G i : = z 2 i n ( I − F ) # g · σ i ( I − F ) # g + 2( I − F ) # g · ρ i ` + ` t τ i ` o , and the result follo ws b y ( D.3 ). W e no w pro v e of Theorem 3.5 . T o b egin, we give some upper b ounds on the partial deriv atives of the w eigh t v ector w ( F ) with resp ect to the entries of the ov erlap matrix F . Deﬁnition D.2. L et e i ∈ R L denote the i ’th standar d b asis ve ctor. F or i, j ∈ { 1 , 2 , . . . , L } with i 6 = j , deﬁne the lo garithmic p artial derivatives ∂ log w k ∂ F ij ( F ) := ∂ ∂ F ij log w k   X i 6 = j I + F ij  e i e t j − e i e t i    = d dε     ε =0 log w k ( F + ε ( e i e t j − e i e t i )) . (D.6) STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 35 (These p artial derivatives must b e understo o d as derivatives of the extension guar ante e d by L emma D.1 ; otherwise, they ar e deﬁne d only when F ij > 0 and F ii > 0 .) Our deﬁnition of logarithmic partial deriv ativ es in ( D.6 ) is not standard. Ho wev er, we observ e that a version of the standard form ula relating the total and partial deriv atives of log w holds: F or all matrices H whose ro ws sum to zero, (D.7) d dε     ε =0 log w k ( F + εH ) = X i 6 = j ∂ log w k ∂ F ij ( F ) H ij . W e need only consider matrices whose ro ws sum to zero, since these are the only perturbations for whic h F + εH can b e stochastic. The follo wing result app ears in [ 47 , Theorem 3.6]. It is crucial in our proof of Theorem 3.5 . Lemma D.3. R e c al l P i [ t j < t i ] and ∂ log w k ∂ F ij fr om Deﬁnitions 3.4 and D.2 . F or al l sto chastic and irr e ducible matric es F , we have 1 2 1 P i [ t j < t i ] ≤ max k     ∂ log w k ∂ F ij ( F )     ≤ 1 P i [ t j < t i ] . W e also require the following lemma in the proof of Theorem 3.5 . Lemma D.4. The asymptotic c ovarianc e matrix σ i has the pr op erties: 1. The r ows and c olums of σ i sum to zer o. That is, for e ∈ R L the ve ctor of al l ones, σ i e = 0 and e t σ i = 0 . 2. F or al l j = 1 , . . . , L , σ i j k = σ i kj = 0 whenever F ik = 0 . Pr o of. Since the rows of ¯ F sum to one with probabilit y one, w e ha ve v ar( ¯ F i : e ) = 0 for any ﬁxed num ber of samples N i . Therefore, the asymptotic v ariance σ i has e t σ i e = 0, and it follo ws that e t σ i = σ i e = 0 since σ i is symmetric and p ositiv e semideﬁnite. Let k b e suc h that F ik = 0. Since ¯ F ik = 0 with probabilit y one, w e hav e co v( ¯ F ik , ¯ F ij ) = 0 for an y j = 1 , . . . , L , and therefore σ i j k = 0. W e now pro v e Theorem 3.5 . Pr o of of The or em 3.5 . W e begin with formula ( D.3 ): (D.8) σ 2 = L X i =1 κ − 1 i ∂ i B t Σ i ∂ i B . 36 DINNER, ET AL. Since the asymptotic cov ariance matrix Σ i is symmetric and p ositive semideﬁnite, the Cauc hy inequalit y holds: a t Σ i b ≤ 1 2 a t Σ i a + 1 2 b t Σ i b, for all a, b ∈ R L +1 . Therefore, ∂ i B t Σ i ∂ i B =  ∂ B ∂ ¯ F i : , ∂ B ∂ ¯ G i :  Σ i  ∂ B ∂ ¯ F i : , ∂ B ∂ ¯ G i :  t ≤ 2  ∂ B ∂ ¯ F i : , 0  Σ i  ∂ B ∂ ¯ F i : , 0  t + 2  0 t , ∂ B ∂ ¯ G i :  Σ i  0 t , ∂ B ∂ ¯ G i :  t = 2 ∂ B ∂ ¯ F i : σ i ∂ B ∂ ¯ F i : t + 2 ∂ B ∂ ¯ G i : τ i ∂ B ∂ ¯ G i : t . =: 2 A 0 + 2 A 1 . (D.9) (Here, 0 denotes the zero v ector in R L , in terpreted as a column v ector.) W e now estimate the term A 0 deﬁned ab o v e. By ( D.4 ), w e hav e A 0 = ∂ B ∂ ¯ F i : t σ i ∂ B ∂ ¯ F i : = L X j,k,`,m =1 g ` ∂ w ` ∂ ¯ F ij σ i j k g m ∂ w m ∂ ¯ F ij = L X `,m =1 z ` g ` z m g m X j 6 = i F ij > 0 X k 6 = i F ik > 0 ∂ log w ` ∂ ¯ F ij σ i j k ∂ log w m ∂ ¯ F ik . = L X `,m =1 z ` g ` z m g m X j 6 = i F ij > 0 X k 6 = i F ik > 0 q v ar π i ( ψ ∗ j ) ∂ log w ` ∂ ¯ F ij R i j k q v ar π i ( ψ ∗ k ) ∂ log w m ∂ ¯ F ik , (D.10) where R i j k := σ i j k q v ar π i ( ψ ∗ j ) p v ar π i ( ψ ∗ k ) . (The third equality abov e follows from formula ( D.7 ) relating the total and partial deriv ativ es of log w , since the ro ws and colums of σ i sum to zero b y Lemma D.4 .) W e claim that X j 6 = i F ij > 0 X k 6 = i F ik > 0 q v ar π i ( ψ ∗ j ) ∂ log w ` ∂ ¯ F ij R i j k q v ar π i ( ψ ∗ k ) ∂ log w m ∂ ¯ F ik ≤ tr( R i ) X j 6 = i F ij > 0 v ar π i ( ψ ∗ j ) ∂ log w ` ∂ ¯ F ij 2 . (D.11) STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 37 T o prov e this, we observ e that R i is symmetric and p ositive semideﬁnite since σ i is symmetric and p ositiv e semideﬁnite. Therefore, R i has the sp ectral decomposition R i = L X j =1 λ i,j v i,j ( v i,j ) t with eigenv alues λ i,j > 0 and corresp onding eigenv ectors v i,j suc h that k v i,j k = 1. Th us, for an y a ∈ R L , a t R i a = L X j =1 λ i,j | v i,j · a | 2 ≤   L X j =1 λ i,j   k a k 2 = tr( R i ) k a k 2 . (D.12) Inequalit y ( D.11 ) follows from ( D.12 ) b y setting a j = ( q v ar π i ( ψ ∗ j ) ∂ log w ` ∂ ¯ F ij if j 6 = i and F ij > 0 , and 0 otherwise. Finally , combining ( D.10 ), ( D.11 ), and Lemma D.3 yields A 0 ≤ tr( R i ) L X ` =1 z ` | g ` | ! 2 X j 6 = i F ij > 0 v ar π i ( ψ ∗ j ) P i [ t j < t i ] 2 . Moreo ver, we ha v e L X ` =1 z ` | g ` | = Ψ L X ` =1 z ` | π ` [ g ∗ − π [ g ] 1 ∗ ] | ≤ Ψ L X ` =1 z ` π ` [ | h | ] = π [ | h | ] , b y ( 2.1 ), and therefore (D.13) A 0 ≤ tr( R i ) π [ | h | ] 2 X j 6 = i F ij > 0 v ar π i ( ψ ∗ j ) P i [ t j < t i ] 2 . W e now observ e that b y ( D.5 ) A 1 = z 2 i ` t τ i ` = Ψ 2 z 2 i C( ¯ h i ) , C( ¯ h i ) denotes the asymptotic co v ariance of the tra jectory av erage ¯ h i of h ov er the biased pro cess X i t . Therefore, combining ( D.8 ) and ( D.13 ), w e ﬁnd σ 2 ≤ 2 L X i =1 1 κ i        z 2 i Ψ 2 C( ¯ h i ) + tr( R i ) π [ | h | ] 2 X j 6 = i F ij > 0 v ar π i ( ψ ∗ j ) P i [ t j < t i ] 2        , as desired. 38 DINNER, ET AL. When the bias functions are a partition of unity , b oth the EMUS metho d and the state- men ts of Theorems 3.3 and 3.5 simplify considerably . (The bias functions are a partition of unit y if and only if P L i =1 ψ i ( x ) = 1 for all x .) In this case, f ∗ = f for all functions f , and the EMUS metho d reduces to π US [ g ] = L X i =1 w i ( ¯ F ) ¯ g i , where ¯ F ij = N − 1 i N i X t =1 ψ j ( X i t ) and ¯ g i = N − 1 i N i X t =1 g ( X i t ) . Co rollary D.5. Supp ose that the bias functions ar e a p artition of unity. In that c ase, The- or em 3.5 holds with either v ar π ( g ) or π [ | g | ] 2 in plac e of π [ | h | ] 2 . In addition, Ψ = 1 , and one c an r eplac e C( ¯ h i ) with the asymptotic varianc e C( ¯ g i ) of ¯ g i . Pr o of. When the bias functions are a partition of unity , π [ | h | ] 2 = π [ | g − π [ g ] | ] 2 ≤ π [ | g − π [ g ] | 2 ] = v ar π ( g ) , and so we ma y replace π [ | h | ] 2 with v ar π ( g ). In addition, equation ( D.4 ) holds with g k = π k [ g ]. Th us, following the argumen t ab o v e, one may v erify that the result also holds with π [ | g | ] 2 in place of π [ | h | ] 2 . App endix E. Pro of of Theorem 3.7 . In the argumen ts b elo w, for any probability measure ν on a set Ω, we let L 2 ( ν ) := { u : Ω → R : ν [ u 2 ] < ∞} , and w e deﬁne the L 2 ( ν ) inner product h f , g i ν = ν [ f g ] with the corresp onding norm k f k L 2 ( ν ) := p h f , f i ν . Giv en a set U ⊂ R d , we deﬁne L 2 ( U ), k·k L 2 ( U ) , h , i U to b e the analogous function space, norm, and inner pro duct for Lebesgue measure on U . Our pro of of Theorem 3.7 requires a P oincar´ e inequality , Le mma E.1 . W e refer to [ 31 , Section 3] for an in tro duction to P oincar ´ e inequalities and their role in the theory of diﬀusion pro cesses. Lemma E.1. Assume that the Poinc ar ´ e ine quality holds for U with c onstant Λ ; that is, assume that for al l we akly diﬀer entiable f : U → R so that ∇ f ∈ L 2 ( U ) ,     f − Z U f dx     L 2 ( U ) ≤ Λ( U ) k∇ f k L 2 ( U ) STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 39 We have a similar Poinc ar ´ e ine quality for π h : k f − π h ( f ) k L 2 ( π h ) ≤ h Λ( U ) exp β 2 sup U h V − inf U h V !! k∇ f k L 2 ( π h ) . Pr o of. By a standard scaling argument, the P oincar ´ e inequality holds for U h with constan t h Λ. T o see this, let A h : U → U h b e the aﬃne transformation A h x = x 0 + h ( x − x 0 ) . F or an y f : U h → R with ∇ f ∈ L 2 ( U h ), using the c hange of v ariable formula and the chain rule, w e hav e     f − Z U h f     2 L 2 ( U h ) = h d     f ◦ A h − Z U f ◦ A h     2 L 2 ( U ) ≤ h d Λ 2 k∇ ( f ◦ A h ) k 2 L 2 ( U ) = h d h 2 Λ 2 k ( ∇ f ) ◦ A h k 2 L 2 ( U ) = h 2 Λ 2 k∇ f k 2 L 2 ( U h ) . No w observe that for any f ∈ L 2 ( π h ), k f − π h [ f ] k L 2 ( π h ) = min c ∈ R k f − c k L 2 ( π h ) , since π h [ f ] is the L 2 ( π h ) orthogonal pro jection of f onto the space of constant functions. Therefore, w e hav e k f − π h [ f ] k 2 L 2 ( π h ) ≤     f − Z U h f     2 L 2 ( π h ) ≤ sup x ∈ U h π h ( x ) !     f − Z U h f     2 L 2 ( U h ) ≤ h 2 Λ 2 sup U h π ( x ) ! k∇ f k 2 L 2 ( U h ) ≤ h 2 Λ 2 sup x ∈ U h π h ( x ) inf x ∈ U h π h ( x ) k∇ f k 2 L 2 ( π h ) , and the result follo ws. R emark E.2. The P oincar´ e inequalit y for the Leb esgue measure on a set U holds under v ery weak conditions on U . F or example, when U is conv ex, the Poincar ´ e inequality holds with constan t Λ( U ) = D /π , where D is the diameter of the domain [ 38 ]. W e now pro v e Theorem 3.7 : 40 DINNER, ET AL. Pr o of of The or em 3.7 . W e b egin b y stating a simple consequence of the functional cen- tral limit theorem for rev ersible, contin uous time Mark ov pro cessses: Let Y t b e a reversible, stationary Mark ov pro cess with ergo dic distribution π and generator L . Let g ∈ L 2 ( π ), and deﬁne ¯ g := T − 1 Z T s =0 g ( Y s ) ds. By [ 25 , Corollary 1.9], √ T ( ¯ g − π [ g ]) d − → N(0 , σ 2 (g)) , where (E.1) σ 2 ( g ) = h g − π [ g ] , L − 1 ( g − π [ g ]) i π . Here, L − 1 ( g − π [ g ]) denotes an y function in the domain of L with L ( L − 1 ( g − π [ g ])) = g − π [ g ] and π [ L − 1 ( g − π [ g ])] = 0. Suc h a function must exist when g ∈ L 2 ( π ) and X t is rev ersible [ 25 ]. W e no w show that the process X h t meets the conditions abov e for the central limit theorem. First, w e recall that the generator of X h t is the op erator L h = β − 1 ∆ − ∇ V · ∇ with domain D ( L h ) := { g ∈ C 2 ( U h ) : ∇ g ( x ) · n ( x ) = 0 for all x ∈ ∂ U h } ; see [ 2 , Prop osition 3.2] for the case of a con v ex p olyhedron or [ 13 , Chapter 8] for a domain with C 3 b oundary . By [ 24 , Theorem 4.3.3], a pro cess Y t with inv arian t distribution π is reversible if its gen- erator is symmetric and it has the strong con tin uity prop erty (E.2) lim t → 0 + k T t f − f k π = 0 for all f ∈ L 2 ( π ) , where T t f ( x ) := E x [ f ( Y t )] denotes the backw ards semigroup asso ciated with Y t . The generator L h of X h t is symmetric, since for all f , g ∈ D ( L h ), using in tegration b y parts, we hav e − β − 1 h∇ f , ∇ g i π = − β − 1 Z U h ∇ f · ∇ g z − 1 h exp( − β V ) dx = β − 1 Z U h f div( z − 1 h exp( − β V ) ∇ g ) dx − β − 1 Z ∂ U h f z − 1 h exp( − β V ) ∇ g · n dS = Z U h ( β − 1 ∆ g − ∇ V · ∇ g ) f z − 1 h exp( − β V ) dx = h f , L h g i π . (E.3) STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 41 (Here, z − 1 h := R U h exp( − β V ) dx is the normalizing constan t for π h .) Since h∇ f , ∇ g i π is in- v ariant under exchanging f and g , h f , L h g i π = h L h f , g i π and L h is symmetric. W e p ostp one discussion of the strong con tin uity of X h t to the end of the pro of. W e now use the P oincar ´ e inequality (Lemma E.1 ) and ( E.3 ) to prov e that X h t is ergo dic and to estimate the term L − 1 h ( g − π h [ g ]) app earing in the formula for σ 2 h ( g ); in essence, w e adapt the approac h outlined in [ 31 , Section 3] to the family of reﬂected pro cesses X h t . W e prov e ergo dicit y ﬁrst. By [ 4 , Prop osition 2.2], a pro cess is ergo dic if and only if 0 is a simple eigen v alue of its generator. By the P oincar´ e inequalit y (Lemma E.1 ) and ( E.3 ), for all u ∈ D ( L h ), k u − π h [ u ] k 2 L 2 ( π h ) ≤ C 2 h k∇ u k 2 L 2 ( π h ) = C 2 h β h u, − Lu i π h ≤ C 2 h β k u k L 2 ( π h ) k Lu k L 2 ( π h ) , (E.4) where C h = h Λ( U ) exp β 2 sup U h V − inf U h V !! . No w if u is not constan t, k u − π h [ u ] k 2 L 2 ( π h ) > 0, so k L h u k L 2 ( π h ) > 0 and u is not an eigenv ector with eigen v alue 0. Hence, 0 is a simple eigenv alue of L h , and X h t is ergo dic. Finally , we estimate σ 2 h ( g ). W e ha v e k u k L 2 ( π h ) ≤ C 2 h β k Lu k L 2 ( π h ) . T aking u = L − 1 h ( g − π h [ g ]) in the ab o v e yields k L − 1 h ( g − π h [ g ]) k L 2 ( π h ) ≤ C 2 h β k g − π h [ g ] k L 2 ( π h ) , whic h implies σ 2 h ( g ) = h g − π h [ g ] , L − 1 ( g − π h [ g ]) i π h ≤ C 2 h β v ar π h ( g ) , using the Cauc hy–Sc hw arz inequalit y . It remains to show that the pro cess X h t has the strong contin uit y prop ert y ( E.2 ). W e only sk etch an argumen t, since the basic ideas are standard. First, one can use the Lipsc hitz con tinuit y of strong solutions of the reﬂected pro cess [ 2 , Lemma 4.1] to show that X h t has the F eller prop ert y . (That is, one can show that T t u is contin uous whenev er u is con tinuous.) In addition, since the pro cess X h t has an inﬁnitesimal generator, w e hav e the point wise con tin uit y prop ert y (E.5) lim t → 0 + T t u ( x ) = u ( x ) for all x ∈ U h and all u ∈ D ( L h ). No w w e ha ve k T t k ∞ ≤ 1 for all t ≥ 0, where k T t k ∞ is the op erator norm of T t on the space of con tin uous functions with the sup-norm, and therefore b y a densit y argument the limit ( E.5 ) holds for all con tinuous u . Hence, by [ 8 , Lemma 1.4], w e hav e lim t → 0 + sup x ∈ U h | T t u ( x ) − u ( x ) | = 0 42 DINNER, ET AL. for all contin uous u . The strong con tinuit y property ( E.2 ) then follo ws b y another densit y argumen t, using that k T t k L 2 ( π h ) ≤ 1 for all t ≥ 0. App endix F. Pro of of Theorems 4.1 and 4.2 . Pr o of of The or em 4.1 . By Corollary D.5 , since the bias functions are a partition of unit y , w e hav e σ 2 ( g ) ≤ 2 X i ∈ Z d /K Z d κ − 1 i ( C ( ¯ g i ) z 2 i + v ar π ( g ) tr( R i ) X j 6 = i F ij > 0 1 F ij ) . (F.1) T o pro v e the desired upp er b ound, we substitute estimates of C ( ¯ g i ), R i , and F ij in to the inequalit y ab o ve. First, w e consider the asymptotic co v ariances R i and C ( ¯ g i ). Let h = 1 /K . The diameter of U i is 2 √ dh , so b y Assumption 3.6 R i jj ≤ C h a β b exp  2 √ dhβ k∇ V k L ∞  ≤ C h a − b exp  2 √ d k∇ V k L ∞  . (F.2) (The second inequalit y follo ws since hβ ≤ 1 by deﬁnition.) Similarly , (F.3) C( ¯ g i ) ≤ C h a − b exp  2 √ d k∇ V k L ∞  v ar π i ( g ) . Second, by Lemma 3.9 , the nonzero en tries of the ov erlap matrix F are b ounded b elow as β tends to inﬁnit y: (F.4) F ij ≥ exp  − 2 √ d k∇ V k L ∞  4 d for all i , j so that F i , j > 0. W e also observ e that eac h row of F has 3 d nonzero en tries, since F i , i + k > 0 only when all entries of k b elong to {− 1 , 0 , 1 } . W e now estimate the term in v olving C( ¯ g i ) in ( F.1 ). By ( F.3 ), we ha v e X i ∈ Z d /K Z d z 2 i C( ¯ g i ) ≤ C h a − b exp  2 √ d k∇ V k L ∞  X i ∈ Z d /K Z d z 2 i v ar π i ( g ) . (F.5) No w we hav e v ar π i ( g ) = π i  | g − π i [ g ] | 2  ≤ π i h | g − π [ g ] | 2 i . Therefore, X i ∈ Z d /K Z d z 2 i v ar π i ( g ) ≤ X i ∈ Z d /K Z d z i π i h | g − π [ g ] | 2 i = π h | g − π [ g ] | 2 i = v ar π ( g ) . (F.6) STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 43 (The inequality follo ws since 0 ≤ z i ≤ 1 for all i ; the second to last equality follo ws using ( 2.4 ) and that { ψ i } i ∈ Z d /K Z d is a partition of unit y .) Th us, (F.7) X i ∈ Z d /K Z d z 2 i C( ¯ g i ) ≤ C h a − b exp  2 √ d k∇ V k L ∞  v ar π ( g ) . It remains to address the term in v olving R i in ( F.1 ): Using ( F.2 ), ( F.4 ), and that eac h ro w of F has 3 d nonzero en tries, we hav e tr( R i ) X j 6 = i F ij > 0 1 F ij ≤ C 6 2 d h a − b exp  4 √ d k∇ V k L ∞  (F.8) for ev ery i ∈ Z d /K Z d . Finally , using ( F.7 ), ( F.8 ), and κ − 1 i = K d = d β e d , w e conclude σ 2 ( g ) ≤ 2 C h a − b ( K d exp  2 √ d k∇ V k L ∞  + 6 2 d K 2 d exp  4 √ d k∇ V k L ∞  ) v ar π ( g ) ≤  D d β e d + b − a + E d β e 2 d + b − a  v ar π ( g ) , where the constan ts D and E depend on d and V , but not on g or β . W e note that if one uses the bias functions prop osed in App endix C , then the constants D and E in the pro of of Theorem 4.1 gro w only polynomially with the dimension d , not exp onen tially . How ev er, w e do not claim that those bias functions p erform b etter than the uniform grid ( 3.8 ) or the bias functions of Section 5.3 in practice. W e now pro v e Theorem 4.4 : Pr o of of The or em 4.4 . T ak e g := 1 x ≥ M . Since the bias functions are a partition of unity , b y Corollary D.5 , w e hav e (F.9) σ 2 M ≤ 2 K +1 X i =0 κ − 1 i        C( ¯ g i ) z 2 i + p 2 M tr( R i ) X j 6 = i F ij > 0 1 F ij        . First, w e estimate R i j j ≤ C h a exp  h max x ≤ M | V 0 ( x ) |  ≤ C eh a for all i = 1 , . . . , K − 1 b y Assumption 3.6 . By Assumption 4.3 , R K j j ≤ D for j = K − 1 , K , and R K j j = 0 for j 6 = K − 1 , K , since ψ K +1 is constan t ov er the supp ort of π K . In addition, R K +1 j j = 0 for all j = 1 , . . . , L, 44 DINNER, ET AL. since all bias functions ψ i tak e a constant v alue o v er the supp ort of π K +1 . Likewise, C( ¯ g K ) ≤ C eh a v ar π K ( g ) ≤ C eh a , and C( ¯ g i ) = 0 for all i 6 = K . W e now show that the nonzero entries of the o verlap matrix are b ounded below indepen- den t of M . First, w e estimate the entries whic h are a verages o v er the biased distributions with b ounded support. By Lemma 3.9 , we ha v e F ij ≥ 1 2 exp(2) > 0 for all i = 0 , . . . , K − 1 and j so that F ij > 0. whenev er F ij > 0. It remains to address those en tries related to biased distributions with un b ounded supp ort, so with i = K, K + 1. By Lemma F.1 , F K,K +1 and F K,K − 1 are bounded b elo w b y some θ > 0 indep enden t of M , for this c hoice of bias functions. (Lemma F.1 and its pro of app ear in Appendix F . Lemma F.1 is the only part of the proof which relies on Assumption 4.2 .) In addition, for an y i = 0 , . . . , K + 1, w e hav e F ii = 1 2 , which implies F K +1 ,K = 1 − F K +1 ,K +1 = 1 2 since F is sto chastic when the bias functions are a partition of unit y . Finally , w e substitute the ab o v e estimates of the o v erlap matrix and the v ariances in to ( F.9 ). Let c = min { θ , 1 / 2 exp(2) } . Observ e that h decreases with M , so C eh a ≤ E for some constan t E , uniformly in M . Let F = max { D , E } . W e ha ve σ 2 p 2 M ≤ 2 p 2 M K +1 X i =0 ( K + 2)  C( ¯ g i ) z 2 i + p 2 M (2 F ) 2 c 2  ≤ 2( K + 2) F z 2 K p 2 M + 4( K + 2) 2 c 2 . W e now observ e that z K p M = z K z K +1 = F K +1 ,K F K,K +1 ≤ 1 c . Therefore, σ 2 p 2 M ≤ 2 F ( K + 2) + 4 F ( K + 2) 2 c 2 , whic h prov es the result. W e now pro v e Lemma F.1 , whic h is used in the pro of of Theorem 4.4 . Lemma F.1. Under the hyp otheses of The or em 4.4 , ther e exist c onstants M 1 , θ + , θ − > 0 dep ending on V but not on M so that F K,K +1 ≥ θ + > 0 and F K,K − 1 ≥ θ − > 0 whenever M ≥ M 1 . STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 45 Pr o of. W e consider F K,K − 1 ﬁrst. W e ha v e F K,K − 1 = 1 2 π ([ M − h, M )) π ([ M − h, ∞ )) = 1 2 R M M − h exp( − V ( x )) dx R ∞ M − h exp( − V ( x )) dx . By the in tegral mean v alue theorem, Z M M − h exp( − V ( x )) dx = h exp( − V ( ξ M − h,M )) for some ξ M − h,M ∈ [ M − h, M ]. Moreov er, b y ( 4.5 ), we hav e V ( x ) ≤ V ( M ) + V 0 ( M )( x − M ) for all x ≥ M ≥ M 0 . Therefore, when M − h ≥ M 0 , Z ∞ M − h exp( − V ( x )) dx ≤ Z ∞ M − h exp( − V ( M − h ) − V 0 ( M − h )( x − M + h )) dx = exp( − V ( M − h )) V 0 ( M − h ) . It follo ws that F K,K − 1 ≥ hV 0 ( M − h ) exp( V ( M − h ) − V ( ξ M − h,M )) ≥ hV 0 ( M − h ) exp  − h max x ≤ M | V 0 ( x ) |  ≥ hV 0 ( M − h ) exp( − 1) = V 0 ( M − h ) d max x ≤ M | V 0 ( x ) |e exp( − 1) , (F.10) using the deﬁnition h = M /K . T o estimate the quotient in expression ( F.10 ), w e distinguish tw o cases: By ( 4.5 ), V 0 is nondecreasing on [ M 0 , ∞ ), so either lim x →∞ V 0 ( x ) = C 2 < ∞ or lim x →∞ V 0 ( x ) = ∞ . In the ﬁrst case, V 0 is b ounded, and w e ha ve (F.11) V 0 ( M − h ) d max x ≤ M | V 0 ( x ) |e ≥ V 0 ( M 0 )  max x ∈ [0 , ∞ ) | V 0 ( x ) |  > 0 , whenev er M − h ≥ M 0 . In the second case, for M suﬃciently large, max x ≤ M | V 0 ( x ) | = V 0 ( M ) . Therefore, applying in succession the mean v alue theorem, the monotonicit y of V 0 , assump- 46 DINNER, ET AL. tion ( 4.6 ), and the h yp othesis lim x →∞ V 0 ( x ) = ∞ , w e hav e that for all M suﬃciently large, V 0 ( M − h ) d max x ≤ M | V 0 ( x ) |e = V 0 ( M ) d V 0 ( M ) e − V 0 ( M ) − V 0 ( M − h ) d V 0 ( M ) e ≥ V 0 ( M ) d V 0 ( M ) e − hV 00 ( η M − h,M ) V 0 ( M ) ≥ V 0 ( M ) d V 0 ( M ) e − V 00 ( η M − h,M ) V 0 ( η M − h,M ) 2 ≥ V 0 ( M ) d V 0 ( M ) e − α ≥ 1 − α 2 (F.12) > 0 . (In the second and third lines ab ov e, η M − h,M ∈ [ M − h, M ] denotes the p oin t guaran teed by the mean v alue theorem so that V 0 ( M ) − V 0 ( M − h ) = hV 00 ( η M − h,M ).) It follo ws from ( F.10 ), ( F.11 ), and ( F.12 ) that there exist M − , θ − > 0 so that (F.13) F K,K − 1 ≥ θ − > 0 whenev er M ≥ M − . No w we prov e that F K,K +1 is b ounded below. W e hav e F K,K +1 = 1 2 R ∞ M exp( − V ( x )) dx R ∞ M − h exp( − V ( x )) dx = F K,K − 1 R ∞ M exp( − V ( x )) dx R M M − h exp( − V ( x )) dx ≥ θ − R M + h M exp( − V ( x )) dx R M M − h exp( − V ( x )) dx ≥ θ − R M + h M exp( V ( x − h ) − V ( x )) exp( − V ( x − h )) dx R M M − h exp( − V ( x )) dx ≥ θ − exp  min [ M − h,M + h ] V − max [ M − h,M + h ] V  ≥ θ − exp  − 2 h max [ M − h,M + h ] | V 0 |  . (F.14) As abov e, to b ound the quan tity app earing in the exp onen t in ( F.14 ), w e distinguish the tw o cases lim x →∞ V 0 ( x ) = C 1 < ∞ and lim x →∞ V 0 ( x ) = ∞ . In the ﬁrst case, for M suﬃcien tly large that 2 C 1 ≥ | V 0 ( x ) | ≥ C 1 / 2 whenev er x ≥ M − h , we ha v e (F.15) h max [ M − h,M + h ] | V 0 | = max [ M − h,M + h ] | V 0 |  max [0 ,M ] | V 0 |  ≤ 2 C 1 C 1 / 2 = 4 . STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 47 In the second case, for M suﬃcien tly large, h max [ M − h,M + h ] | V 0 | = max [ M − h,M + h ] | V 0 |  max [0 ,M ] | V 0 |  ≤ V 0 ( M + h ) V 0 ( M ) . (F.16) By ( 4.6 ), we ha v e the diﬀeren tial inequalit y V 00 < α | V 0 | 2 . This implies V 0 ( M + s ) ≤ y 0 ( s ) for y ( s ) = 1 V 0 ( M ) − 1 − αs the solution of the initial v alue problem y 0 = αy 2 and y (0) = V 0 ( M ) . Therefore, V 0 ( M + h ) ≤ 1 V 0 ( M ) − 1 − αh = 1 V 0 ( M ) − 1 − α d V 0 ( M ) e − 1 ≤ V 0 ( M ) 1 − α , so b y ( F.16 ), (F.17) h max [ M − h,M + h ] | V 0 | ≤ 1 1 − α . It follo ws from ( F.14 ), ( F.15 ), and ( F.17 ) that there exist M + , θ + > 0 so that (F.18) F K,K +1 ≥ θ + > 0 whenev er M ≥ M + . App endix G. Imp roved Metho d of Computing Error Ba rs. In [ 48 , Section VII.B.1], w e prop osed a practical metho d of estimating the asymptotic standard deviations (error bars) of av erages computed by EMUS. Using the notation established in App endix D , our metho d pro ceeds as follo ws: 1. Compute ¯ F , { ¯ g ∗ i } L i =1 , and { ¯ 1 ∗ i } L i =1 . 2. Compute w ( ¯ F ) and the group in v erse ( I − ¯ F ) # . 3. Ev aluate ∂ i B at ¯ F , { ¯ g ∗ i } L i =1 , and { ¯ 1 ∗ i } L i =1 . 4. Compute the time series ¯ ζ i t = ∂ i B ·   ψ 1 ( X i t ) , . . . , ψ L ( X i t ) , g ∗ ( X i t ) , 1 ∗ ( X i t )  −  ¯ F i 1 , . . . , ¯ F iL , ¯ g ∗ i , ¯ 1 ∗ i   . 48 DINNER, ET AL. 5. Compute an estimate ¯ χ 2 i of the in tegrated auto co v ariance of ¯ ζ i t using an algorithm suc h as ACOR [ 14 ]. 6. Compute as an estimate σ 2 the quan tity (G.1) ¯ σ 2 := L X i =1 ¯ χ 2 i κ i . W e originally proposed computing the group in verse ( I − ¯ F ) # using the metho d of [ 19 ] based on the QR factorization. W e hav e since disco v ered that this metho d do es not alw ays yield suﬃciently accurate results. F or example, when computing error bars for the marginal in µ 2 in Section 5.3 , we observed a highly oscillatory numerical error aﬀecting some entries of ( I − ¯ F ) # . That the sign pattern in Figure 11a fails to b e symmetric is evidence of this n umerical error. W e note that since the exact ov erlap matrix F is in detailed balance w ith w ( F ), we ha v e diag ( w ( F )) F diag( w ( F )) − 1 = F t . (Here, diag ( w ( F )) denotes the diagonal matrix with w ( F ) along the diagonal.) Therefore, (( I − F ) # ) t = diag( w ( F ))( I − F ) # diag( w ( F )) − 1 , whic h implies that the sign pattern of ( I − F ) # is symmetric since w ( F ) is p ositive. As a result of these numerical errors, we were unable to accurately compute error bars for the EMUS estimate of the marginal densit y . 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Sign pattern for QR computed group inverse (a) 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Sign pattern for iterative inverse (b) Figure 11: Sign pattern of group in verse ( I − ¯ F ) # computed by metho d of [ 19 ] (Figure 11a ) and using p o wer iteration (Figure 11b ). Y ellow indicates an en try with p ositive sign, blue a negativ e sign. Here, w e consider the o verlap matrix ¯ F computed to estimate the marginal densit y of µ 2 in Section 5.3 . The oscillations in sign observed in the upp er righ t corner of Figure 11a are evidence of n umerical error. W e therefore prop ose computing the group inv erse by a new metho d combining QR fac- torization with p o wer iteration. W e ﬁrst compute an estimate G 0 of ( I − ¯ F ) # b y the metho d STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 49 of [ 19 ]. W e then iterate (G.2) G n +1 = I ( G n ) = ˜ F G n + I − ew ( ¯ F ) t , where e ∈ R L denotes the column v ector of all ones and ˜ F := ( I − ew ( F ) t ) ¯ F . W e observ e that ( I − ¯ F ) # is a ﬁxed p oin t of this iteration, since I (( I − ¯ F ) # ) = ( I − ew ( ¯ F ) t ) ¯ F ( I − ¯ F ) # + ( I − eπ ( ¯ F ) t ) = ( ¯ F − I )( I − ¯ F ) # + ( I − ew ( ¯ F ) t ) + ( I − ¯ F ) # = ( I − ¯ F ) # . Ab o v e, w e use w ell known prop erties of the group in verse, including that the sp ectral pro jector I − ew ( ¯ F ) t comm utes with ¯ F , that ( I − ew ( ¯ F ) t )( I − ¯ F ) # = ( I − F ) # , and that ( I − ¯ F )( I − ¯ F ) # = I − ew ( ¯ F ) t . Moreo ver, when ¯ F is irreducible, I K is a contraction for K suﬃciently large. By the P erron-F robenius theorem, the sp ectral radius of ˜ F is smaller than 1 − ε , for some ε > 0. Therefore, b y Gelfand’s form ula, for an y matrix norm k·k , w e ha ve lim k →∞    ˜ F k    1 /k < 1 − ε/ 2, and so for some K ,    ˜ F k    < (1 − ε/ 2) k whenev er k ≥ K . No w I K ( G ) = ˜ F K G + ( I − eπ ( F ) t ) K − 1 X j =0 F j . Th us, assuming that the norm k·k is subm ultiplicative,   I K ( G ) − I K ( H )   =    ˜ F K ( G − H )    ≤    ˜ F K    k G − H k ≤ (1 − ε/ 2) K k G − H k . Therefore, the p o w er iteration conv erges and its limit is the group inv erse ( I − ¯ F ) # . Using this new metho d, we computed ( I − ¯ F ) # for ¯ F the o verlap matrix in v olved in estimating the marginal in µ 2 in Section 5.3 . W e p erformed 10 6 p o w er metho d iterates. Observ e that the sign pattern of the group inv erse computed with p ow er iteration is symmetric; see Figure 11b . The pow er iteration ( G.2 ) con v erges slo wly when the sp ectral gap of ¯ F is small. W e ha v e sho wn in [ 47 ] that the spectral gap ma y b e very small: It decreases exponentially with a temp erature parameter in a limit similar to the one analyzed in Section 4.1 ab ov e. How ev er, ev en when the sp ectral gap is small, we conjecture that a modest n umber of pow er iterations will signiﬁcan tly reduce the n umerical error in the group in v erse, since the error in the initial calculation seems to b e highly oscillatory and the pow er iteration has a smoothing eﬀect. REFERENCES 50 DINNER, ET AL. [1] Aitkin, M.: Likelihoo d and Bay esian analysis of mixtures. Statistical Modelling 1 (4), 287–304 (2001) [2] Andres, S.: Path wise diﬀerentiabilit y for SDEs in a conv ex p olyhedron with oblique reﬂection. Ann. Inst. Henri Poincar ´ e Probab. Stat. 45 (1), 104–116 (2009) [3] Berneche, S., Roux, B.: Energetics of ion conduction through the k[sup +] channel. Nature 414 (6859), 73 (2001) [4] Bhattachary a, R.N.: On the functional central limit theorem and the la w of the iterated logarithm for Mark ov pro cesses. Z. W ahrsch. V erw. Gebiete 60 (2), 185–201 (1982) [5] Billingsley , P .: Con vergence of Probability Measures, second edn. Wiley series in probability and statistics. Wiley-In terscience, New Y ork (1999) [6] Bilo deau, M., Brenner, D.: Theory of multiv ariate statistics. Springer texts in statistics. Springer, New Y ork (1999) [7] Bo czko, E.M., Bro oks, C.L.: First-principles calculation of the folding free energy of a three-helix bundle protein. Science 269 (5222), 393–396 (1995) [8] B¨ ottc her, B., Sc hilling, R.L., W ang, J.: A primer on Feller pro cesses. In: L´ evy Matters II I : L ´ evy-type pro cesses: construction, appro ximation and sample path prop erties, Lecture notes in mathematics (Springer-V erlag), chap. 1, pp. 1–30. Springer (2013) [9] Cho, G.E., Meyer, C.D.: Comparison of p erturbation b ounds for the stationary distribution of a Marko v c hain. Linear Algebra Appl. 335 , 137–150 (2001) [10] Chopin, N., Leli` evre, T., Stoltz, G.: F ree energy methods for Bay esian inference: eﬃcient exploration of univ ariate Gaussian mixture p osteriors. Statistics and Computing 22 (4), 897–916 (2012) [11] Dinner, A.R., Mattingly , J.C., T empkin, J.O.B., Koten, B.V., W eare, J.: T ra jectory Stratiﬁcation of Sto c hastic Dynamics. SIAM Review 60 (4), 909–938 (2018). DOI 10.1137/16M1104329. URL h ttps: //epubs.siam.org/doi/10.1137/16M1104329 [12] Doss, H., T an, A.: Estimates and standard errors for ratios of normalizing constan ts from m ultiple Mark o v c hains via regeneration. Journal of the Roy al Statistical So ciety: Series B (Statistical Methodology) 76 (4), 683–712 (2014) [13] Ethier, S.N., Kurtz, T.G.: Mark ov pro cesses. Wiley Series in Probability and Mathematical Statistics: Probabilit y and Mathematical Statistics. John Wiley & Sons, Inc., New Y ork (1986). Characterization and conv ergence [14] F oreman-Mack ey , D., Go o dman, J.: ACOR 1.1.1. h ttps://p ypi.python.org/p ypi/acor/1.1.1 (2014) [15] F oreman-Mack ey , D., Hogg, D.W., Lang, D., Go odman, J.: emcee: The MCMC hammer. Publications of the Astronomical So ciety of the Paciﬁc 125 (925), 306 (2013) [16] Geyer, C.J.: Marko v chain Monte Carlo maxim um lik eliho o d. In: Computing Science and Statistics: Pro ceedings of the 23rd Symposium on the Interface. American Statistical Asso ciation (1991) [17] Geyer, C.J.: Estimating normalizing constants and reweigh ting mixtures (1994). T echnical Rep ort No. 568. Retrieved from the Universit y of Minnesota Digital Conserv ancy [18] Gill, R.D., V ardi, Y., W ellner, J.A.: Large sample theory of empirical distributions in biased sampling mo dels. Ann. Statist. 16 (3), 1069–1112 (1988) [19] Golub, G.H., Mey er Jr., C.D.: Using the QR factorization and group inv ersion to compute, diﬀeren tiate, and estimate the sensitivit y of stationary probabilities for Mark o v c hains. SIAM J. Algebraic Discrete Metho ds 7 (2), 273–281 (1986) [20] Go o dman, J., W eare, J.: Ensemble samplers with aﬃne inv ariance. Commun. Appl. Math. Comput. Sci. 5 (1), 65–80 (2010) [21] Helﬀer, B., Klein, M., Nier, F.: Quantitativ e analysis of metastability in reversible diﬀusion pro cesses via a Witten complex approach. Mat. Con temp. 26 , 41–85 (2004) [22] Izenman, A.J., Sommer, C.J.: Philatelic mixtures and multimodal densities. Journal of the American Statistical Asso ciation 83 (404), 941–953 (1988) [23] Jasra, A., Holmes, C.C., Stephens, D.A.: Marko v c hain Monte Carlo methods and the lab el switching problem in Bay esian mixture modeling. Statist. Sci. 20 (1), 50–67 (2005) [24] Jiang, D.Q., Qian, M., Qian, M.P .: Mathematical theory of nonequilibrium steady states : on the frontier of probability and dynamical systems. Lecture notes in mathematics (Springer-V erlag). Springer, Berlin ; New Y ork (2004) [25] Kipnis, C., V aradhan, S.R.S.: Cen tral limit theorem for additive functionals of reversible Marko v pro cesses and applications to simple exclusions. Comm. Math. Ph ys. 104 (1), 1–19 (1986) STRA TIFICA TION AND MARKO V CHAIN MONTE CARLO 51 [26] Kong, A., McCullagh, P ., Meng, X.L., Nicolae, D., T an, Z.: A theory of statistical mo dels for Mon te Carlo integration. Journal of the Roy al Statistical Society . Series B: Statistical Metho dology 65 (3), 585–604 (2003) [27] Kumar, S., Rosenberg, J.M., Bouzida, D., Swendsen, R.H., Kollman, P .A.: The weigh ted histogram analysis method for free-energy calculations on biomolecules. I. The method. J. Comput. Chem. 13 (8), 1011–1021 (1992) [28] Laio, A., P arrinello, M.: Escaping free-energy minima. Pro ceedings of the National Academy of Sciences 99 (20), 12562–12566 (2002) [29] Legoll, F., Leli` evre, T.: Eﬀectiv e dynamics using conditional expectations. Nonlinearity 23 (9), 2131–2163 (2010) [30] Leli` evre, T., Rousset, M., Stoltz, G.: F ree energy computations. Imperial College Press, London (2010). A mathematical p erspective [31] Leli` evre, T., Stoltz, G.: Partial diﬀerential equations and sto c hastic metho ds in molecular dynamics. Acta Numerica 25 , 681 (2016) [32] Liu, J.S.: Monte Carlo strategies in scientiﬁc computing. Springer Series in Statistics. Springer-V erlag, New Y ork (2001) [33] Maragliano, L., V anden-Eijnden, E.: A temp erature accelerated method for sampling free energy and determining reaction pathw a ys in rare even ts simulations. Chemical Ph ysics Letters 426 (1-3), 168 – 175 (2006) [34] Matthews, C., W eare, J., Kra vtsov, A., Jennings, E.: Umbrella sampling: a p ow erful method to sample tails of distributions (2017). [35] Meng, X.L., W ong, W.H.: Simulating ratios of normalizing constants via a simple iden tity: a theoretical exploration. Statistica Sinica 6 (4), 831–860 (1996) [36] Pa vliotis, G.A.: Stochastic pro cesses and applications, T exts in Applie d Mathematics , v ol. 60. Springer, New Y ork (2014). Diﬀusion pro cesses, the F okk er-Planck and Langevin equations [37] Pa vliotis, G.A., Stuart, A.M.: Multiscale metho ds, T exts in Applie d Mathematics , vol. 53. Springer, New Y ork (2008). Av eraging and homogenization [38] Pa yne, L.E., W einberger, H.F.: An optimal Poincar´ e inequality for conv ex domains. Archiv e for Rational Mec hanics and Analysis 5 (1), 286–292 (1960) [39] Richardson, S., Green, P .J.: On Bay esian analysis of mixtures with an unkno wn n umber of comp onents (with discussion). Journal of the Roy al Statistical So ciet y: Series B (Statistical Metho dology) 59 (4), 731–792 (1997) [40] Rob erts, G., Tweedie, R.: Geometric conv ergence and central limit theorems for multidimensional Hast- ings and Metrop olis algorithms. Biometrik a 83 (1), 95 (1996). DOI 10.1093/biomet/83.1.95 [41] Rob erts, G.O., Rosenthal, J.S.: Geometric ergodicity and h ybrid Mark ov chains. Electron. Comm. Probab. 2 , no. 2, 13–25 (1997) [42] Rob erts, G.O., Tweedie, R.L.: Exponential conv ergence of Langevin distributions and their discrete appro ximations. Bernoulli 2 (4), 341–363 (1996) [43] Rossky , P .J., Doll, J.D., F riedman, H.L.: Bro wnian dynamics as smart Monte Carlo simulation. The Journal of Chemical Physics 69 (10), 4628–4633 (1978). DOI http://dx.doi.org/10.1063/1.436415 [44] Shirts, M.R., Chodera, J.D.: Statistically optimal analysis of samples from m ultiple equilibrium states. The Journal of chemical physics 129 (12), 124105 (2008) [45] Sugita, Y., Kitao, A., Ok amoto, Y.: Multidimensional replica-exchange metho d for free-energy calcula- tions. J. Chem. Phys. 113 (15), 11 (2000) [46] Swendsen, R.H., W ang, J.S.: Replica monte carlo simulation of spin-glasses. Physical review letters 57 (21), 2607 (1986) [47] Thiede, E., V an Koten, B., W eare, J.: Sharp entrywise p erturbation b ounds for Mark o v c hains. SIAM Journal on Matrix Analysis and Applications 36 (3), 917–941 (2015) [48] Thiede, E.H., V an Koten, B., W eare, J., Dinner, A.R.: Eigenv ector method for um brella sampling enables error analysis. The Journal of Chemical Physics 145 (8), 084115 (2016) [49] T orrie, G.M., V alleau, J.P .: Nonph ysical sampling distributions in Mon te Carlo free-energy estimation: Um brella sampling. Journal of Computational Physics 23 (2), 187 (1977) [50] V anDerwerk en, D.N., Schmidler, S.C.: Parallel Marko v chain Monte Carlo (2013). [51] V ardi, Y.: Empirical distributions in selection bias models. The Annals of Statistics 13 (1), 178–203 52 DINNER, ET AL. (1985) [52] W ang, F., Landau, D.P .: Eﬃcient, m ultiple-range random w alk algorithm to calculate the density of states. Phys. Rev. Lett. 86 , 2050–2053 (2001) [53] W ang, F.Y., Y an, L.: Gradient estimate on con vex domains and applications. Pro c. Amer. Math. Soc. 141 (3), 1067–1081 (2013)

Stratification as a general variance reduction method for Markov chain Monte Carlo

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment