Random construction of interpolating sets for high dimensional integration
Many high dimensional integrals can be reduced to the problem of finding the relative measures of two sets. Often one set will be exponentially larger than the other, making it difficult to compare the sizes. A standard method of dealing with this pr…
Authors: Mark Huber, Sarah Schott
Random construction of interp olating sets fo r high dimensional integration V ersion: Nov ember 10, 2018 Ma rk L. Huber Claremont McKenna College mh ub er@ cmc.edu Sa ra h Schott Duk e University schott@math.duk e.edu Abstract Man y high dimensional integrals ca n b e reduced to the p roblem o f finding the rel ative meas u res of t wo sets. Often one set will be expon en - tially larger th an th e other, mak ing it difficult to compare the sizes. A standard metho d of d ealing with this problem is to interpolate b etw een the sets with a sequence of nested sets where n eighb oring sets hav e relativ e measures b ounded ab ov e by a constant. Choosing such a well balanced sequence can b e very difficult in practice. Here a new approac h that au- tomatically creates such sets is presen ted. T h ese we ll balanced sets allo w for faster appro x imation algorithms for integrals and sums, and b etter temp ering and annealing Marko v chai n s for generating random samples. Applications suc h as fi nding the partition function of the Ising mo d el and normalizing constants for p osterior distributions in Bay esian metho ds are discussed. 1 Intro duction Monte Carlo metho ds for num er ical in teg ration can have enormous v ariance for the t yp es of high dimensional problems that arise in statistics and c ombinatorial optimization applications. Consider a state space Ω with meas ure µ , and B ⊂ Ω with finite measure. Then the problem consider e d here is approximating Z = Z x ∈ B dµ ( x ) . (1.1) The c la ssical Monte Carlo approa ch is to crea te a rando m v ariable X such that E ( X ) = Z wher e X has v arianc e as small as po s sible. Unfortunately , it is often not p oss ible to know the v aria nce of X ahead of time, and this must b e 1 estimated as well. How go o d the estimate of the v ariance is dep ends on even higher moment s which ar e ev en more difficult to estimate. The metho d pres en ted here creates an estimate of Z of the form e X/k , where k is a known constant and X is a Poisson rando m v aria ble with mean k ln( Z ). Because the mean a nd v ariance for a Poisson random v ar ia ble are the same, we simult a neously o btain our estimate of Z and knowledge of the v ar iance of our estimate. In fact, the output from o ur method do es the following: • E s timate Z to within a sp ecified relative err o r with a spe c ifie d failure probability in time O (ln( Z ) 2 ). • Cr eate a w ell balanced sequence of nested sets useful in building annealing and temper ing Mar ko v c ha ins that can b e used to generate Mo n te Carlo samples. • Develop an omnithermal approximation for partition functions arising from spatial p o int pro cesses and Gibbs distributions. Previous w ork The new metho d pr e sent e d her e follows a lo ng line of work using interpo la ting sets. F or ins tance, V a lleau and Car d [ 18 ] intro duced what they ca lled multistage sampling where an in termedia te distribution w as added to make estimation more effective. Jerrum, V aliant and V a zirani [ 8 ] us ed a similar idea of self-r e ducibility , and ca refully a nalyzed the computational complexity of the resulting approximation metho d. Suppo se we are giv en tw o finite sets B ′ and B such that B ′ ⊂ B and # B (the nu mber of elements of B ), is known. One w ay of viewing self-reducibility , is that it effectively r equires a s equence of s ets B = B 0 ⊃ B 1 ⊃ B 2 ⊃ · · · ⊃ B ℓ = B ′ such that the rela tiv e s iz es of the se ts # B i +1 / # B i ≥ α for a fixed co nstant α ∈ (0 , 1). Then an unbiased es timate ˆ b i of # B i +1 / # B i is crea ted for ea ch i . The pro duct of these estimates will then b e an un biase d estimator for # B ′ / # B , and m ultiplying by # B gives the fina l es timate of # B . F or fixed α ∈ (0 , 1), it is ea sy to estimate # B i +1 / # B i with small relative error simply by drawing samples from # B i and c o unt ing the p ercentage that fall in # B i +1 . The relative standard devia tion of a Bernoulli rando m v ariable with pa rameter α is (1 − α ) /α , so it is imp ortant not to make α to o small. On the other hand, if α is to o la rge, then the nested sets ar e no t shrink ing muc h at each step, and it will r e quire a lengthy sequence of such sets. T o be precise, the num b er of s e ts ℓ must satisfy ℓ ≥ ln α (# B ′ / # B ) = ln(# B / # B ′ ) / ln( α − 1 ) which go es to infinit y a s α go es to 1. Balancing these tw o consider ations leads to a o ptimal α v a lue of about 0 . 203 1. The difficulty in applying self-r educibilit y is finding a sequence o f sets such that # B i +1 / # B i is prov ably a t least α , but no t so close to 1 that the sequence of sets is to o long. Ideally , # B i +1 / # B i would equal α for every i , or a t least be very close. F or fixed co nstants α 1 and α 2 , r efer to a se quence of sets where the ratios # B i +1 / # B i fall in [ α 1 , α 2 ] for all i as wel l- b alanc e d . 2 W ell-balanced sequenc e s ha ve other uses as well. Methods of desig ning Marko v chains such a s simulated a nnealing [ 10 ], sim ula ted temp ering, and par - allel temp ering [ 16 , 5 , 11 ] all require such a sequence of well-balanced sets in order to mix r apidly (see [ 19 , 2 0 ].) Now consider the s pecia l ca se of ( 1.1 ) wher e Z is the nor malizing co nstant of a p o sterior distribution of a Bayesian analysis . Skilling [ 15 ] introduced neste d sampling as a w ay of generating a random sequence o f nested sets. The adv an- tage this metho d ha s ov er self-r e ducibilit y is that there is no need to hav e the sequence of se ts in hand ahead of time. Ins tead, it builds up sets from scratch at random according to a well-defined pro cedure. The disadv antage is tha t it lo ses the prop erty of self-reducible algor ithms that the v ariance of the output co uld be b ounded prior to running the algo- rithm. Because deterministic numerical int eg ration was used in the metho d, the v ariance can be determined only up to a factor that dep ends up on the deriv atives o f a function that is difficult to compute. Therefore, nested sa m- pling falls in the c la ss o f metho ds w her e the v aria nce must b e estimated, rather than bounded ahea d of time as with self-reducibility . The T o o tsie P op Algorithm The metho d presented here is ca lle d The T o ot- sie Pop Algorithm ( TP A ), a nd combines the tight analysis of s elf-reducibility by adding feature s similar to nested sa mpling . Like self-r educibilit y , it is very general, working over a wide v ariety of problems. This includes the nes ted sam- pling doma in of Bay esia n p o sterior normalizatio n, but a lso includes many other problems where self-reducibilit y has b een applied such as the Ising mo del. Por- tions of this w or k were present ed at the Ninth V a lencia In ter na tional Mee ting s on Bay esia n Statis tics , and also a ppea rs in the confere nce pro ceedings [ 7 ] with a discussion. The name is somewhat unusual, and reference s an a dvertising c a mpaign run for T o otsie Pop candies. A T o o tsie Pop is a c ho cola te chewy center surro unded by a candy shell. The ad campaign asked “How man y licks do e s it take to get to the center of a T o ots ie Pop?”. O ur algo rithm op erates in a similar fashion. Our set B is slowly whittled a way until the center B ′ is r eached. The n umber of steps taken to mo ve from B to B ′ will b e Poisson with mean ln( µ ( B ) /µ ( B ′ )), ther eb y allowing approximation of µ ( B ) /µ ( B ′ ). Therefore, the “num b er of licks” is exactly what is needed to for m o ur estimate! 1.1 Organiza tion Section 2 describ es the TP A pro cedure in detail, then Section 3 shows so me applications. Section 4 then ana lyzes the exp ected running time of the metho d, and introduces a tw o phase appr o ach to TP A . Section 5 descr ib es how TP A can b e used to build well-balanced nested sets for temp ering. Sec tion 6 s hows how to cre a te a n a ppr oximation that s imultaneously works for a ll members of a contin uous family of sets at once. Finally , Sec tion 7 discus s es further areas o f exploratio n with TP A tec hnique s . 3 2 The T o otsie Pop Algo rithm The TP A method has four general ingredients: 1. A measur e space (Ω , F , µ ) 2. Two finite measurable s ets B and B ′ satisfying B ′ ⊂ B and µ ( B ′ ) > 0. The set B ′ is the c ent er a nd B is the shel l . 3. A family o f nested sets { A ( β ) : β ∈ R ∪ {∞}} such that β ′ < β implies A ( β ′ ) ⊆ A ( β ), µ ( A ( β )) is a contin uous function of β , and the limit of µ ( A ( β )) as β g o es to − ∞ is 0. 4. Sp ecial v alues β B and β B ′ that satisfy A ( β B ) = B and A ( β B ′ ) = B ′ . With these ingredients, the TP A metho d is v er y simple to descr ibe . 1. Star t with i = 0 and β i = β B . 2. Draw a r andom sample Y fro m µ co nditioned to lie in A ( β i ). 3. Let β i +1 = inf { β : Y ∈ A ( β ) } . 4. If Y ∈ B ′ stop and o utput i . 5. E ls e set i to b e i + 1 and go back to step 2. Another way o f descr ibing the draw in Step 2 is that for measur able D , P ( Y ∈ D ) = µ ( D ∩ A ( β i )) /µ ( A ( β i )) . At each step, the set A ( β i ) shr inks with probability 1, and so is slowly worn aw ay until the sa mple falls in to the region B ′ . Line 2 ab ove deser ves sp ecial a tten tio n. Drawing a random sample Y fro m µ co nditioned to lie in A ( β i ) is in genera l a very difficult problem. The go o d news is that the imp ortance of this pro blem means that a v ast literature for solving this problem exists. Marko v chain Monte Carlo (MCMC) metho ds a re critical to obtaining these s amples, and v ariatio ns o n the ea r ly metho ds have blossomed ov er the last fifty years. Readers are refer red to [ 14 , 13 , 3 ] and the references therein for more information. Of course, any o ther metho d for turning samples into appr oximations either implicitly o r explicitly dep end on the a bilit y to exec ute some v ariant of line 2 as well, so our alg orithm is not actua lly demanding an ything ab ov e and b eyond what other s require. The alg orithm is easily mo dified to handle different meth- o ds o f simulating random v ariables. F or instance , nested s ampling [ 15 ] dr aws several such Y v aria bles at once, and TP A can be wr itten to do s o as w ell. The key fact ab out this pro cess is the following: Theorem 2.1 . A t any st ep of the algorithm, let E i = ln( µ ( A ( β i ))) − ln( µ ( A ( β i +1 ))) . Then the E i ar e indep endent, identic al ly distribute d ex p onential r andom vari- ables with me an 1. 4 Pr o of. T o s implify the notation, let m ( b ) = µ ( A ( b )). Begin b y s howing that each U i = m ( β i +1 ) /m ( β i ) is uniform ov er [0 , 1]. Fix β i ≥ β B ′ and let a ∈ (0 , 1 ). Then since m ( b ) is a contin uous function in b with lim b →−∞ m ( b ) = 0, there exists a b ∈ ( −∞ , β i ] suc h that m ( b ) /m ( β i ) = a. Call this v alue β a . Let 0 < ǫ < m ( β B ) − a. Then b y the s ame reaso ning there is a v alue β a + ǫ ≤ β B such that m ( β a + ǫ ) /µ ( β i ) = a + ǫ . No w co nsider Y dr awn from µ conditioned to lie in A ( β i ). Then β i +1 = inf { b : Y ∈ A ( b ) } . Moreov e r , { U i ≤ a } ⇒ { Y ∈ A ( β a ) } , an even t which o ccurs with pr o bability m ( β a ) /m ( β i ) = a . So P ( U i ≤ a ) ≥ a . On the other hand, Y / ∈ A ( β a + ǫ ) implies β i +1 ≥ β a + ǫ which means that U i = m ( β i +1 ) /m ( β i ) ≥ a + ǫ . In other words, P ( U 1 < a + ǫ ) ≤ P ( Y ∈ A ( β a + ǫ )) = a + ǫ . This holds for ǫ arbitra rily small, hence by domina ted conv erg ence P ( U i ≤ a ) ≤ a. Therefore, P ( U i ≤ a ) is at lea st and a t most a , s o P ( U i ≤ a ) = a , which shows that U i is uniform on [0 , 1 ]. Observing that the neg ative o f the natural log of a uniform n umber of [0 , 1] is an expo nen tia l with mean 1 completes the pro of. F or t ( b ) = ln ( µ ( A ( b ))), the theo r em sa ys the p oints t ( β 0 ) , t ( β 1 ) , . . . , t ( β i ) in a run of TP A are s e parated by e xpo nent ia l random v aria bles of mean 1, in other words, these p oints form a homogeneous P o is son p oint pro ce s s o n [ t ( β B ′ ) , t ( β B )] of rate 1 . 2.1 T aking advan tage of Poisson p oint pro cesses The fir st a pplica tion o f this is to descr ibe the to ta l num b er of p oints used by a run of TP A , that is , the v alue o f i at the end of the a lgorithm. Becaus e the t ( β i ) v alues for m a Poisson p oint pro ces s , the distribution of i is Poisson with mean t ( β B ) − t ( β B ′ ) = ln( µ ( B ) / µ ( B ′ )) . F urthermore, the union of k indep endent Poisson p oint pro cesses o f rate 1 is also a Poisson p oint pr o cess of rate k . That means that after k runs of TP A , the distribution of the tota l num b er of samples used is Poisson with mean k ln ( µ ( B ) /µ ( B ′ )) . 3 Applications The follo wing examples illustrate some of the uses for TP A . 3.1 The Ising mo del The Ising model is an example of a Gibbs distribut ion , where a function H : Ω → R gives rise to a distribution on Ω: π ( x ) = 1 Z ( β ) exp( − β H ( x )) . (3.1) F rom applicatio ns in statistical physics, β ≥ 0 is known a s the inv erse temp er- ature, and Z ( β ) is called the partition function. 5 In the Ising mo del, each node of a g r aph G = ( V , E ) is ass ig ned one of tw o v alues. Ther e a re many ways to repr esent the mo del. In the form considere d here, ea c h no de is either 0 o r 1, and for x ∈ { 0 , 1 } V , − H ( x ) is one plus the nu mber of edg es e ∈ E such that the e ndpoints of the edge ha ve the s ame v alue in x . This can b e written as H ( x ) = − [1 + P { i,j }∈ E (1 − x ( i ) − x ( j ) + 2 x ( i ) x ( j ))] . In o rder to e mbed this problem in the framework of TP A , add an aux iliary dimension to the configuration x . The auxiliar y state space is Ω aux ( β ) = { ( x, y ) : x ∈ { 0 , 1 } V , y ∈ [0 , exp( − β H ( x )) } . Some notes o n Ω aux ( β ): • The total leng th of the line segments in Ω aux ( β ) is just Z ( β ). Tha t is to say , µ (Ω aux ( β )) = Z ( β ) wher e µ is the one dimensional Leb esgue measure of the union o f the line segment s . • Let β ′ < β . Then s inc e − H ( x ) > 0, Ω aux ( β ′ ) ⊂ Ω aux ( β ). Moreover, Z ( β ) is a cont inuous function that go es to 0 as β → −∞ . Therefo r e Condition 2 of the TP A ingredients is s atisfied. • F o r β = 0, y ∈ [0 , 1] for a ll x ∈ { 0 , 1 } . That means Z (0) = 2 V . • Let β > 0. Then Ω aux ( β ) is the shell, and Ω aux (0) is the c e n ter . With this in mind, the TP A a lgorithm w o rks as follows. 1. Star t with i = 0 and β i = β . 2. Draw a random sample X from π β i , then draw Y (given X ) uniformly from [0 , exp( − β i H ( X ))]. 3. Let β i +1 = ln( Y ) / ( − H ( X )) 4. If β i +1 ≤ 0 stop and output i . 5. E ls e set i to b e i + 1 and go back to step 2. One r un of T P A will require on a verage 1 + ln( Z ( β ) / Z (0)) = 1 + ln( Z ( β )) − # V ln(2) samples from v ario us v alues of β , where # V is the n umber of v er tices of the g raph. This method o f adding a n auxiliar y v ariable allows TP A to b e used o n a v ariety of dis crete distributions by changing the measure to one that v aries contin uously in the index. 3.2 P osterior dist ributions In B ay esian analysis, often it is nece ssary to find the norma lizing co nstant of a po sterior distributio n. This is known a s the evidenc e for a mo del, and ca n b e written: Z = Z x ∈ Ω f ( x ) dx, 6 where f ( x ) is a no nnegative density (the pro duct of the prior density and the likelihoo d of the data ) and Ω ⊆ R n . F or a p oint c ∈ Ω a nd ǫ > 0, let B 1 ǫ ( c ) be the po int s within L 1 distance ǫ o f c . Supp ose that for a particula r c a nd ǫ , B 1 ǫ ( c ) ⊂ Ω and there is a known M such that (1 / 2) M ≤ f ( x ) ≤ M fo r all x ∈ B 1 ǫ ( c ). Then to estimate Z ( ǫ ) = R x ∈ B 1 ǫ ( c ) f ( x ) dx , dr aw N iid samples X 1 , . . . , X N uniformly from B ǫ ( c ), and let the estimate be ˆ Z ( ǫ ) = (2 ǫ ) − n P i f ( X i ) / N . Then ˆ Z ( ǫ ) is an unb ia sed estimate for Z ( ǫ ) with standar d devia tion b ounded ab ove by Z ( ǫ ) / √ k . Now the connection to TP A can be ma de . The family o f sets will b e { A ( β ) = B 1 β ( c ) ∩ Ω } , and the measure is µ ( A ( β )) = R x ∈ A ( β ) f ( x ) dx . The shell will b e A ( ∞ ) (so Z = µ ( A ( ∞ ))) a nd the center A ( ǫ ) (with measure Z ( ǫ ).) T P A ca n then b e us e d to estimate Z/ Z ( ǫ ), a nd the estimate of Z ( ǫ ) can then finish the job. 4 Running time of TP A Suppo se that TP A is run k times, a nd the k v alues of the i v ariable at the end of each run are summed together. Call this sum N . Then N has a Poisson distribution with mean k ln( µ ( B ) /µ ( B ′ )) . This makes N /k an un bias ed estimate of ln( µ ( B ) /µ ( B ′ )) . The v a riance of N/k is ln( µ ( B ) /µ ( B ′ )) /k . Let W be a normal random v aria ble of mea n 0 and v ar iance 1, and W α be the in verse c df of W so that Pr( W ≤ W α ) = α . Then the norma l appr oximation to the Poisso n gives h ( N / k ) − W α/ 2 p N /k , ( N /k ) + W α/ 2 p N /k i (4.1) as a n a ppr oximately 1 − α level confidence interv al for ln( µ ( B ) /µ ( B ′ )) . Exp o- nent ia ting then g ives the 1 − α level for µ ( B ) /µ ( B ′ ). F or a s p ecific o utput, it is also po ssible to build a n exact confidence interv al for µ ( B ) /µ ( B ′ ) since the dis tr ibution of the o utput is kno wn exactly . Similarly , it is easy to p erfor m a Bayesian analysis and find a cr edible interv al given a prior on ln( µ ( B ) /µ ( B ′ )) . Lastly , consider how to build an ( ǫ, δ ) r andomize d app r ox imation scheme (RAS) whose o utput ˆ A s a tisfies: Pr (1 + ǫ ) − 1 ≤ ˆ A µ ( B ) /µ ( B ′ ) < 1 + ǫ ! > 1 − δ. F or simplicity , a ssume that µ ( B ) /µ ( B ′ ) ≥ e . No te that when µ ( B ) /µ ( B ′ ) < e , then simple acceptance rejection can b e used to obtain an ( ǫ, δ )-RAS in Θ( ǫ − 2 ln( δ − 1 )) time. See [ 4 ] for a description of this metho d. The following lemma gives a bo und on the tails of the Poisson distribution. 7 Lemma 4.1. L et ˜ ǫ > 0 and N b e a Poisson r andom variable with me an k λ , wher e ˜ ǫ/λ ≤ 2 . 3 . Then Pr N k − λ ≥ ˜ ǫ ≤ 2 exp − k ˜ ǫ 2 2 λ 1 − ˜ ǫ λ . (This result is a special case of Theorem 6.1 shown later.) T o obtain our ( ǫ, δ )-RAS, it is sufficient to make ˜ ǫ = ln (1 + ǫ ), and to choose k so that 2 exp( − k ˜ ǫ 2 (1 − ˜ ǫ/λ ) / [2 λ ]) ≤ δ , where λ = ln( µ ( B ) / µ ( B ′ )) . This is made mor e difficult b y the fact that λ is unknown at the sta rt of the a lgorithm! There a re many ways aro und this difficult y , p erhaps the simplest is to use a t wo phas e metho d. First g et a rough es timate o f λ , then r efine this estimate to the level demanded b y ǫ. Phase I Let ǫ a = ln(1 + ǫ ) and k 1 = 2 ǫ − 2 a (1 − ǫ a ) − 1 ln(2 δ − 1 ) . Then let N 1 be the sum of the outputs fro m k 1 runs of TP A . Phase I I Set k 2 = N 1 (1 − ǫ a ) − 1 . Let N 2 be the sum of the outputs from k 2 runs of TP A . The final estimate is ex p( N 2 /k 2 ). Phase I es timates λ to within an a dditive er ror ǫ a λ . Phase II uses the P hase I estimate o f λ to crea te a b etter estimate of λ to within an additive erro r of ǫ a . Note that ǫ a ≈ ǫ in the sense that lim ǫ → 0 ǫ a /ǫ = 1. Theorem 4.1 . The output ˆ A of the ab ove pr o c e dur e is an ( ǫ, δ ) r andomize d appr ox imation scheme for µ ( B ) /µ ( B ′ ) . The running time is r andom, with an exp e cte d running time that is Θ((ln( µ ( B ) /µ ( B ′ ))) 2 ǫ − 2 ln( δ − 1 )) . Pr o of. Call P hase I a success if N 1 /k 1 is within distance ǫ a λ of λ . F rom Lemma 4.1 with ˜ ǫ = ǫ a λ : Pr N 1 k 1 − λ ≥ ǫ a λ = 2 exp − k 1 ( λǫ 2 a )(1 − ǫ a ) ≤ δ / 2 since λ ≥ 1 and k 1 = ǫ − 2 a (1 − ǫ a ) − 1 ln(2 δ − 1 ) . Therefore, the probability that Phase I is a failur e is at mos t δ / 2 . When Pha s e I is a succes s, (1 − ǫ a ) λk 1 ≤ N 1 . In this even t k 2 = N 1 (1 − ǫ a ) − 1 ≥ λk 1 = λǫ − 2 a (1 − ǫ a ) − 1 ln(2 δ − 1 ) . Plugging this in to Lemma 4.1 yields: Pr N 2 k 2 − λ ≥ ǫ a λ ≤ 2 exp − λǫ − 2 a (1 − ǫ a ) − 1 ln(2 δ − 1 ) ǫ 2 a 2 λ 1 − ǫ a λ . Using λ ≥ 1, the right ha nd side is at most 2 δ . The chance of failure in e ither Phase is at most δ / 2 + δ / 2 = δ, so altogether | ( N 2 /k 2 ) − λ | ≤ ǫ a with probability at least 1 − δ . Ex po nen tia ting then gives (1 + ǫ ) − 1 = e − ǫ a ≤ exp( N 2 /k 2 ) /λ ≤ e ǫ a = 1 + ǫ with probability at least 1 − δ . 8 The exp ected n umber of sa mples needed in P hase I is k 1 λ , while the exp ected nu mber needed in Phas e I I is: E( N 2 ) = E(E( N 2 | N 1 )) = E((1 − ǫ a ) − 1 N 1 λ ) = 2(1 − ǫ a ) − 2 ǫ − 2 a ln(2 δ − 1 ) λ 2 . Since ǫ a = Θ( ǫ ), the pro of is complete. 5 W ell-balanced nested se ts Consider running TP A k times, and collecting all the v alues of β i generated during these r uns. Let P denote this set o f v alues, then P forms a Poisson point pro cess of rate k on [ β B ′ , β B ] . Call β B = α 0 > α 1 > · · · > α ℓ = β B ′ a wel l-b alanc e d c o oling sche dule if µ ( A ( α i +1 )) /µ ( A ( α i )) is close to 1 /e for all i from 0 to ℓ − 1. Given P , finding such a well-balanced set is eas y: s imply o r der the β v alues in P , a nd set α i = β ( ik ) . The v alue o f ln( µ ( A ( α i +1 ) / A ( α i ))) will have distribution equal to the sum of k iid exp onential ra ndo m v aria bles with mean 1 / k . So ln( µ ( A ( α i +1 ) /µ ( A ( α i )))) will b e gamma distributed with mean 1 a nd standar d deviation 1 /k . 6 Omnithermal appro ximation Suppo se instead of just a s ingle v alue of interest µ ( B ) / µ ( B ′ ), it is necessary to create an approximation of µ ( A ( β )) /µ ( B ′ ) that is v alid for all v alues β ∈ [ β B ′ , β B ] simultaneously . Call this an omnithermal appr oximation . The s e prob- lems app ear in what a r e called doubly intractable po sterior distributions ar is ing in Bay es ia n a nalyses inv olving s patial p oint pr o cesses. They are usua lly dealt with indirectly using Markov chain Monte Car lo with aux iliary v ar ia bles [ 12 ], but omnithermal approximation allows for a more direct approach. In the last section the Poisson p oint pro cess P for med from the β i v alues collected from k r uns o f T P A was intro duced. T o mov e from P to a Poisson pro cess, set N P ( t ) = # { b ∈ P : b ≥ β B − t } . As t adv ances from 0 to β B − β B ′ , N P ( t ) increases by 1 whenever it hits a β v alue. By the theo ry of Poisson point pro cesses , this happ ens at interv als tha t will be indep endent exponential random v ar iables with r a te k . Given N P ( t ), approximate µ ( B ) /µ ( A ( β )) by exp( N P ( β B − β ) /k ). When β = β B ′ , this is just the approximation given earlier, so this gene r alizes the description of TP A from b efor e. The key fact is tha t N P ( t ) − k t is a r ight contin uous ma rtingale. T o b ound the error in exp( N P ( t ) /k ), it is necessar y to b ound the proba bilit y that N P ( t ) − k t has drifted too far aw ay from 0. 9 Theorem 6.1. L et ˜ ǫ > 0 . Then for N P ( · ) a r ate k Poisson pr o c ess on [0 , λ ] , wher e ˜ ǫ/λ ≤ 2 . 3 : Pr sup t ∈ [0 ,λ ] N P ( t ) k − t ≥ ˜ ǫ ! ≤ 2 exp − k ˜ ǫ 2 2 λ 1 − ˜ ǫ λ . Pr o of. The approa ch will b e similar to finding a Cherno ff b ound [ 2 ]. Since exp( αx ) is convex for any po sitive cons ta n t α , and N P ( t ) is a right cont inuous martingale, exp( αN P ( t )) is a r ight contin uous submartingale. Let A U denote the even t that ( N P ( t ) /k ) − t > ǫ for some t ∈ [0 , λ ]. Then for all α > 0: Pr( A U ) = Pr sup t ∈ [0 ,λ ] exp( αN P ( t )) ≥ exp( αk t + αk ǫ ) ! . It follows from basic Markov-t yp e inequalities on right c o nt inuous submartin- gales (p. 1 3 of [ 9 ]) that this probability ca n b e upp er b ounded as Pr( A U ) ≤ E( α exp( N P ( λ )) / exp( αk λ + αk ˜ ǫ )) . Using the momen t gener ating function for a Poisson with para meter k λ : E[exp( αN P ( λ ))] = exp( k λ (exp( α ) − 1)) , which means Pr( A U ) ≤ exp( λ ( e α − 1 − α ) + α ˜ ǫ ) k . A T aylor se ries expans ion shows that e α − 1 − α ≤ ( α 2 / 2)(1 + α ) as lo ng a s α ∈ [0 , 2 . 3 1858 ... ]. Set α = ˜ ǫ/λ . Simplifying the re s ulting upper b ound yields Pr( A U ) ≤ exp − k ˜ ǫ 2 2 λ 1 − ˜ ǫ λ . The other ta il can be dealt with in a similar fashion, yielding a b ound Pr sup t ∈ [0 ,λ ] [ N P ( α ) /k ] − t < ˜ ǫ ! ≤ exp − k ˜ ǫ 2 2 λ . The union bound on the t wo tails then yields the theorem. Corollary 6.1. F or ǫ ∈ (0 , 0 . 3) , δ ∈ (0 , 1) , and ln( µ ( B ) /µ ( B ′ )) > 1 , after k = 2(ln( µ ( B ) /µ ( B ′ ))(3 ǫ − 1 + ǫ − 2 ) ln(2 / δ ) runs of TP A, the p oints obtaine d c an b e use d to build an ( ǫ, δ ) omnithermal appr ox imation. 10 Pr o of. In or der for the final approximation to b e within a multiplicative factor of 1 + ǫ o f the true result, the log of the appr oximation must b e accura te to an a dditive term of ln(1 + ǫ ). Let λ = ln( µ ( B ) /µ ( B ′ )), so k = 2 λ (3 ǫ − 1 + ǫ − 2 ) ln(2 / δ ). T o pr ov e the corolla ry from the theorem, it suffices to show that 2 exp( − 2 λ (3 ǫ − 1 + ǫ − 2 ln(2 /δ )[ln(1 + ǫ )] 2 (1 − ǫ/λ ) / (2 λ )) < δ . After canceling the factors of λ , a nd noting tha t when λ > 1, 1 − ǫ/λ < 1 − ǫ , it s uffices to show that (3 ǫ − 1 + ǫ − 2 )(1 − ǫ )[ln(1 + ǫ )] 2 > 1 . This can b e shown for ǫ ∈ (0 , 0 . 3) b y a T aylor ser ie s expans ion. 6.1 Example: Omnithermal a pp roximation for the Ising mo del Consider the following mo del. The v alue of β is drawn fr o m a prior density f prior ( · ) on [0 , ∞ ), a nd then the data (co nditioned o n β ) is drawn from the Ising mo del. This was used b y Besag [ 1 ] a s a mo del for agricultur e wherein s o il quality of adjacent plots was more likely to b e similar . Given the data X , the po sterior in the B ay esian analysis is the following density on β : f p ost ( b ) ∝ f prior ( b ) exp( bH ( X )) Z ( b ) . (6.1) The evidenc e fo r the mo del is the integral o f the r ight ha nd side of ( 6.1 ) as b runs from 0 to ∞ . This is only a o ne-dimensional integration, and so should b e straightforward from a nu mer ical per s pec tive, except that Z ( b ) is unknown. Here is where the omnithermal approximation co mes in: it gives a n ap- proximation for Z ( b ) that is v alid for al l v alues of b at o nce. An y numerical int eg ration tec hnique can be used, and the final v alue for the evidence (not in- cluding er ror aris ing from the numerical metho d) will b e within a factor o f 1 + ǫ of the tr ue answer. Figure 1 presents tw o omnithermal a pproximations for log Z β generated us- ing this metho d on a small 4 × 4 squar e lattice. The top graph is the res ult of a single run of TP A from β = 2 down to β = 0. At each β v alue re tur ned by TP A, the approximation drops by 1. The b ottom gra ph is the res ult of ⌈ ln(4 · 10 6 ) ⌉ = 16 runs of TP A. This run told us that Z 2 ≤ 217 with c o nfidence 1 − 10 − 6 / 2. Therefor e, using ǫ = 0 . 1 , and δ = 10 6 / 2 in Theor em 6.1 shows that r = 3300 00 samples suffice for a (0 . 1 , 10 − 6 ) omnithermal approximation. 7 Conclusions and further wo rk The strength of TP A is the generality of the pro cedure, but tha t same generality means that it is p oss ible to do b etter in restricted cir c umstances. F or instance, when f ( x ) falls into the class of Gibbs distributions, ˘ Stefanko vi˘ c et al. [ 17 ] were able to give an ˜ O (ln( Z )) alg orithm for a pproximating Z , but the high co nstants inv olved in their alg o rithm make it solely of theoretical interest. (Here the ˜ O notation hides logarithmic factors.) TP A can b e used in c o njunction with their 11 One run of TP A 0 2 1 ln( Z β ) β Sixteen runs of TP A 0 2 1 ln( Z β ) β Figure 1 : Omnithermal approximations for the pa rtition function of the Ising mo del o n a 4 × 4 lattice algorithm [ 6 ] to build an O (ln ( Z ) ln (ln( Z ))) algorithm, and work contin ues to bring this running time down to O (ln( Z )) . Ackno wledgments Both a uthors w er e supp or ted in this work by NSF CAREE R gr ant DMS-0 5- 48153 . Any opinions, findings, and conclusions or recommendations express e d in this mater ia l ar e those of the authors a nd do no t nece s sarily reflect the views of the Na tional Science F oundation. References [1] J. Be sag. Spatial interaction and the statistical analysis of lattice systems (with discussion). J. R . Statist. So c. Ser. B Stat. Metho dol. , 36 :192–2 36, 1974. [2] H. Chernoff. A measure of asy mpto tic efficiency for tests o f a hypothesis based on the sum o f observ ations. Ann. of Math. S t at. , 2 3:493– 509, 1952 . [3] G. S. Fishman. Cho o sing sample path leng th and num b er of sample paths when starting in the steady state. Op er. R es. L ett ers , 16:209–2 19, 1994. 12 [4] G. S. Fishman. Monte Carlo: c onc epts, algorithms, and applic ations . Springer-V er lag, 1996. [5] C. J. Geyer. Mar ko v chain Monte Ca rlo maximum likeliho o d. In Pr o- c e e dings of the 23r d Symp osium on the In t erfac e: Computing Scienc e and Statistics , pages 156–16 3, 1991 . [6] M. Hub er. The paired pro duct estimator appro ach to appr oximating g ibbs partition functions. 2011 . preprint. [7] M. L. Hub er and S. Schott. Using TPA for Bayesian inference. Bayesian Statistics 9 , pages 257–2 82, 2 010. [8] M. Jerr um, L. V aliant, a nd V. V a z ir ani. Random gener ation of combi- natorial structur es from a uniform distr ibution. The or et. Comput. Sci. , 43:169 –188, 198 6. [9] I. Kara tzas and S. E. Shr eve. Br ownian Motion and S to chastic Calculus (2nd Ed.) . Springer, 1991. [10] S. Kirk pa trick, C. D Gela tt, and M. P . V ecc hi. Optimizatio n by simulated annealing. Scienc e , 220(459 8):671–6 80, 1983 . [11] E. Marinar i and G. Parisi. Sim ula ted temp ering : a new Monte Carlo scheme. Eur ophys. L ett . , 1 9(6):451– 458, 1992 . [12] I. Murr ay , Z. Ghahrama ni, and D. J. C. MacK ay . MCMC for doubly- int r actable distr ibutions. In Pr o c. of 22nd Conf. on Unc ertainty in Artificial Intel ligenc e (U AI) , 2006. [13] C. P . Rob ert and G. Casella. Monte Carlo S tatistic al Metho ds . Spr inger, 2010. [14] R. W. Shonkwiler and F. Mendivil. Exp olor ations in Monte Carlo Metho ds . Springer, 2009 . [15] John Skilling. Nested Sampling for genera l Bay esian computatio n. Bayesian Anal. , 1(4):833 –860, 2006. [16] R. Sw endsen and J-S. W ang. Replica mo n te carlo simulation of spin glass es. Phys. R ev. L et. , 57:2 607–2 609, 1 9 86. [17] D. ˘ Stefanko vi˘ c, S. V empa la, and E. Vigo da. Adaptive sim ula ted annealing : A near-optimal connection b etw een sampling and count ing . J . of t he ACM , 56(3):1–3 6, 200 9 . [18] J.P . V alleau and D.N. Card. Monte Carlo estimatio n o f the free entergy by m ultista ge sampling. J . Chem. Phys. , 57(12 ):5457– 5 462, 197 2. [19] Dawn B. W o o dward, Scott C. Sch midler , and Mark Hub er. Co nditions for rapid mixing o f parallel a nd simulated temp ering o n multimodel distribu- tions. Ann. of Appl. Pr ob. , 19(2 ):6 17–64 0, 200 9 . 13 [20] Dawn B. W oo dward, Sco tt C. Sc hmidler , and Mark Hub er. Sufficient con- ditions for to rpid mixing o f par allel a nd simulated temp ering. Ele ctr on. J. Pr ob ab. , 14:78 0 –804, 2009 . 14
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment