Random construction of interpolating sets for high dimensional integration

Random construction of interp olating sets fo r high dimensional integration V ersion: Nov ember 10, 2018 Ma rk L. Huber Claremont McKenna College mh ub er@ cmc.edu Sa ra h Schott Duk e University schott@math.duk e.edu Abstract Man y high dimensional integrals ca n b e reduced to the p roblem o f ﬁnding the rel ative meas u res of t wo sets. Often one set will be expon en - tially larger th an th e other, mak ing it diﬃcult to compare the sizes. A standard metho d of d ealing with this problem is to interpolate b etw een the sets with a sequence of nested sets where n eighb oring sets hav e relativ e measures b ounded ab ov e by a constant. Choosing such a well balanced sequence can b e very diﬃcult in practice. Here a new approac h that au- tomatically creates such sets is presen ted. T h ese we ll balanced sets allo w for faster appro x imation algorithms for integrals and sums, and b etter temp ering and annealing Marko v chai n s for generating random samples. Applications suc h as ﬁ nding the partition function of the Ising mo d el and normalizing constants for p osterior distributions in Bay esian metho ds are discussed. 1 Intro duction Monte Carlo metho ds for num er ical in teg ration can have enormous v ariance for the t yp es of high dimensional problems that arise in statistics and c ombinatorial optimization applications. Consider a state space Ω with meas ure µ , and B ⊂ Ω with ﬁnite measure. Then the problem consider e d here is approximating Z = Z x ∈ B dµ ( x ) . (1.1) The c la ssical Monte Carlo approa ch is to crea te a rando m v ariable X such that E ( X ) = Z wher e X has v arianc e as small as po s sible. Unfortunately , it is often not p oss ible to know the v aria nce of X ahead of time, and this must b e 1 estimated as well. How go o d the estimate of the v ariance is dep ends on even higher moment s which ar e ev en more diﬃcult to estimate. The metho d pres en ted here creates an estimate of Z of the form e X/k , where k is a known constant and X is a Poisson rando m v aria ble with mean k ln( Z ). Because the mean a nd v ariance for a Poisson random v ar ia ble are the same, we simult a neously o btain our estimate of Z and knowledge of the v ar iance of our estimate. In fact, the output from o ur method do es the following: • E s timate Z to within a sp eciﬁed relative err o r with a spe c iﬁe d failure probability in time O (ln( Z ) 2 ). • Cr eate a w ell balanced sequence of nested sets useful in building annealing and temper ing Mar ko v c ha ins that can b e used to generate Mo n te Carlo samples. • Develop an omnithermal approximation for partition functions arising from spatial p o int pro cesses and Gibbs distributions. Previous w ork The new metho d pr e sent e d her e follows a lo ng line of work using interpo la ting sets. F or ins tance, V a lleau and Car d [ 18 ] intro duced what they ca lled multistage sampling where an in termedia te distribution w as added to make estimation more eﬀective. Jerrum, V aliant and V a zirani [ 8 ] us ed a similar idea of self-r e ducibility , and ca refully a nalyzed the computational complexity of the resulting approximation metho d. Suppo se we are giv en tw o ﬁnite sets B ′ and B such that B ′ ⊂ B and # B (the nu mber of elements of B ), is known. One w ay of viewing self-reducibility , is that it eﬀectively r equires a s equence of s ets B = B 0 ⊃ B 1 ⊃ B 2 ⊃ · · · ⊃ B ℓ = B ′ such that the rela tiv e s iz es of the se ts # B i +1 / # B i ≥ α for a ﬁxed co nstant α ∈ (0 , 1). Then an unbiased es timate ˆ b i of # B i +1 / # B i is crea ted for ea ch i . The pro duct of these estimates will then b e an un biase d estimator for # B ′ / # B , and m ultiplying by # B gives the ﬁna l es timate of # B . F or ﬁxed α ∈ (0 , 1), it is ea sy to estimate # B i +1 / # B i with small relative error simply by drawing samples from # B i and c o unt ing the p ercentage that fall in # B i +1 . The relative standard devia tion of a Bernoulli rando m v ariable with pa rameter α is (1 − α ) /α , so it is imp ortant not to make α to o small. On the other hand, if α is to o la rge, then the nested sets ar e no t shrink ing muc h at each step, and it will r e quire a lengthy sequence of such sets. T o be precise, the num b er of s e ts ℓ must satisfy ℓ ≥ ln α (# B ′ / # B ) = ln(# B / # B ′ ) / ln( α − 1 ) which go es to inﬁnit y a s α go es to 1. Balancing these tw o consider ations leads to a o ptimal α v a lue of about 0 . 203 1. The diﬃculty in applying self-r educibilit y is ﬁnding a sequence o f sets such that # B i +1 / # B i is prov ably a t least α , but no t so close to 1 that the sequence of sets is to o long. Ideally , # B i +1 / # B i would equal α for every i , or a t least be very close. F or ﬁxed co nstants α 1 and α 2 , r efer to a se quence of sets where the ratios # B i +1 / # B i fall in [ α 1 , α 2 ] for all i as wel l- b alanc e d . 2 W ell-balanced sequenc e s ha ve other uses as well. Methods of desig ning Marko v chains such a s simulated a nnealing [ 10 ], sim ula ted temp ering, and par - allel temp ering [ 16 , 5 , 11 ] all require such a sequence of well-balanced sets in order to mix r apidly (see [ 19 , 2 0 ].) Now consider the s pecia l ca se of ( 1.1 ) wher e Z is the nor malizing co nstant of a p o sterior distribution of a Bayesian analysis . Skilling [ 15 ] introduced neste d sampling as a w ay of generating a random sequence o f nested sets. The adv an- tage this metho d ha s ov er self-r e ducibilit y is that there is no need to hav e the sequence of se ts in hand ahead of time. Ins tead, it builds up sets from scratch at random according to a well-deﬁned pro cedure. The disadv antage is tha t it lo ses the prop erty of self-reducible algor ithms that the v ariance of the output co uld be b ounded prior to running the algo- rithm. Because deterministic numerical int eg ration was used in the metho d, the v ariance can be determined only up to a factor that dep ends up on the deriv atives o f a function that is diﬃcult to compute. Therefore, nested sa m- pling falls in the c la ss o f metho ds w her e the v aria nce must b e estimated, rather than bounded ahea d of time as with self-reducibility . The T o o tsie P op Algorithm The metho d presented here is ca lle d The T o ot- sie Pop Algorithm ( TP A ), a nd combines the tight analysis of s elf-reducibility by adding feature s similar to nested sa mpling . Like self-r educibilit y , it is very general, working over a wide v ariety of problems. This includes the nes ted sam- pling doma in of Bay esia n p o sterior normalizatio n, but a lso includes many other problems where self-reducibilit y has b een applied such as the Ising mo del. Por- tions of this w or k were present ed at the Ninth V a lencia In ter na tional Mee ting s on Bay esia n Statis tics , and also a ppea rs in the confere nce pro ceedings [ 7 ] with a discussion. The name is somewhat unusual, and reference s an a dvertising c a mpaign run for T o otsie Pop candies. A T o o tsie Pop is a c ho cola te chewy center surro unded by a candy shell. The ad campaign asked “How man y licks do e s it take to get to the center of a T o ots ie Pop?”. O ur algo rithm op erates in a similar fashion. Our set B is slowly whittled a way until the center B ′ is r eached. The n umber of steps taken to mo ve from B to B ′ will b e Poisson with mean ln( µ ( B ) /µ ( B ′ )), ther eb y allowing approximation of µ ( B ) /µ ( B ′ ). Therefore, the “num b er of licks” is exactly what is needed to for m o ur estimate! 1.1 Organiza tion Section 2 describ es the TP A pro cedure in detail, then Section 3 shows so me applications. Section 4 then ana lyzes the exp ected running time of the metho d, and introduces a tw o phase appr o ach to TP A . Section 5 descr ib es how TP A can b e used to build well-balanced nested sets for temp ering. Sec tion 6 s hows how to cre a te a n a ppr oximation that s imultaneously works for a ll members of a contin uous family of sets at once. Finally , Sec tion 7 discus s es further areas o f exploratio n with TP A tec hnique s . 3 2 The T o otsie Pop Algo rithm The TP A method has four general ingredients: 1. A measur e space (Ω , F , µ ) 2. Two ﬁnite measurable s ets B and B ′ satisfying B ′ ⊂ B and µ ( B ′ ) > 0. The set B ′ is the c ent er a nd B is the shel l . 3. A family o f nested sets { A ( β ) : β ∈ R ∪ {∞}} such that β ′ < β implies A ( β ′ ) ⊆ A ( β ), µ ( A ( β )) is a contin uous function of β , and the limit of µ ( A ( β )) as β g o es to − ∞ is 0. 4. Sp ecial v alues β B and β B ′ that satisfy A ( β B ) = B and A ( β B ′ ) = B ′ . With these ingredients, the TP A metho d is v er y simple to descr ibe . 1. Star t with i = 0 and β i = β B . 2. Draw a r andom sample Y fro m µ co nditioned to lie in A ( β i ). 3. Let β i +1 = inf { β : Y ∈ A ( β ) } . 4. If Y ∈ B ′ stop and o utput i . 5. E ls e set i to b e i + 1 and go back to step 2. Another way o f descr ibing the draw in Step 2 is that for measur able D , P ( Y ∈ D ) = µ ( D ∩ A ( β i )) /µ ( A ( β i )) . At each step, the set A ( β i ) shr inks with probability 1, and so is slowly worn aw ay until the sa mple falls in to the region B ′ . Line 2 ab ove deser ves sp ecial a tten tio n. Drawing a random sample Y fro m µ co nditioned to lie in A ( β i ) is in genera l a very diﬃcult problem. The go o d news is that the imp ortance of this pro blem means that a v ast literature for solving this problem exists. Marko v chain Monte Carlo (MCMC) metho ds a re critical to obtaining these s amples, and v ariatio ns o n the ea r ly metho ds have blossomed ov er the last ﬁfty years. Readers are refer red to [ 14 , 13 , 3 ] and the references therein for more information. Of course, any o ther metho d for turning samples into appr oximations either implicitly o r explicitly dep end on the a bilit y to exec ute some v ariant of line 2 as well, so our alg orithm is not actua lly demanding an ything ab ov e and b eyond what other s require. The alg orithm is easily mo diﬁed to handle diﬀerent meth- o ds o f simulating random v ariables. F or instance , nested s ampling [ 15 ] dr aws several such Y v aria bles at once, and TP A can be wr itten to do s o as w ell. The key fact ab out this pro cess is the following: Theorem 2.1 . A t any st ep of the algorithm, let E i = ln( µ ( A ( β i ))) − ln( µ ( A ( β i +1 ))) . Then the E i ar e indep endent, identic al ly distribute d ex p onential r andom vari- ables with me an 1. 4 Pr o of. T o s implify the notation, let m ( b ) = µ ( A ( b )). Begin b y s howing that each U i = m ( β i +1 ) /m ( β i ) is uniform ov er [0 , 1]. Fix β i ≥ β B ′ and let a ∈ (0 , 1 ). Then since m ( b ) is a contin uous function in b with lim b →−∞ m ( b ) = 0, there exists a b ∈ ( −∞ , β i ] suc h that m ( b ) /m ( β i ) = a. Call this v alue β a . Let 0 < ǫ < m ( β B ) − a. Then b y the s ame reaso ning there is a v alue β a + ǫ ≤ β B such that m ( β a + ǫ ) /µ ( β i ) = a + ǫ . No w co nsider Y dr awn from µ conditioned to lie in A ( β i ). Then β i +1 = inf { b : Y ∈ A ( b ) } . Moreov e r , { U i ≤ a } ⇒ { Y ∈ A ( β a ) } , an even t which o ccurs with pr o bability m ( β a ) /m ( β i ) = a . So P ( U i ≤ a ) ≥ a . On the other hand, Y / ∈ A ( β a + ǫ ) implies β i +1 ≥ β a + ǫ which means that U i = m ( β i +1 ) /m ( β i ) ≥ a + ǫ . In other words, P ( U 1 < a + ǫ ) ≤ P ( Y ∈ A ( β a + ǫ )) = a + ǫ . This holds for ǫ arbitra rily small, hence by domina ted conv erg ence P ( U i ≤ a ) ≤ a. Therefore, P ( U i ≤ a ) is at lea st and a t most a , s o P ( U i ≤ a ) = a , which shows that U i is uniform on [0 , 1 ]. Observing that the neg ative o f the natural log of a uniform n umber of [0 , 1] is an expo nen tia l with mean 1 completes the pro of. F or t ( b ) = ln ( µ ( A ( b ))), the theo r em sa ys the p oints t ( β 0 ) , t ( β 1 ) , . . . , t ( β i ) in a run of TP A are s e parated by e xpo nent ia l random v aria bles of mean 1, in other words, these p oints form a homogeneous P o is son p oint pro ce s s o n [ t ( β B ′ ) , t ( β B )] of rate 1 . 2.1 T aking advan tage of Poisson p oint pro cesses The ﬁr st a pplica tion o f this is to descr ibe the to ta l num b er of p oints used by a run of TP A , that is , the v alue o f i at the end of the a lgorithm. Becaus e the t ( β i ) v alues for m a Poisson p oint pro ces s , the distribution of i is Poisson with mean t ( β B ) − t ( β B ′ ) = ln( µ ( B ) / µ ( B ′ )) . F urthermore, the union of k indep endent Poisson p oint pro cesses o f rate 1 is also a Poisson p oint pr o cess of rate k . That means that after k runs of TP A , the distribution of the tota l num b er of samples used is Poisson with mean k ln ( µ ( B ) /µ ( B ′ )) . 3 Applications The follo wing examples illustrate some of the uses for TP A . 3.1 The Ising mo del The Ising model is an example of a Gibbs distribut ion , where a function H : Ω → R gives rise to a distribution on Ω: π ( x ) = 1 Z ( β ) exp( − β H ( x )) . (3.1) F rom applicatio ns in statistical physics, β ≥ 0 is known a s the inv erse temp er- ature, and Z ( β ) is called the partition function. 5 In the Ising mo del, each node of a g r aph G = ( V , E ) is ass ig ned one of tw o v alues. Ther e a re many ways to repr esent the mo del. In the form considere d here, ea c h no de is either 0 o r 1, and for x ∈ { 0 , 1 } V , − H ( x ) is one plus the nu mber of edg es e ∈ E such that the e ndpoints of the edge ha ve the s ame v alue in x . This can b e written as H ( x ) = − [1 + P { i,j }∈ E (1 − x ( i ) − x ( j ) + 2 x ( i ) x ( j ))] . In o rder to e mbed this problem in the framework of TP A , add an aux iliary dimension to the conﬁguration x . The auxiliar y state space is Ω aux ( β ) = { ( x, y ) : x ∈ { 0 , 1 } V , y ∈ [0 , exp( − β H ( x )) } . Some notes o n Ω aux ( β ): • The total leng th of the line segments in Ω aux ( β ) is just Z ( β ). Tha t is to say , µ (Ω aux ( β )) = Z ( β ) wher e µ is the one dimensional Leb esgue measure of the union o f the line segment s . • Let β ′ < β . Then s inc e − H ( x ) > 0, Ω aux ( β ′ ) ⊂ Ω aux ( β ). Moreover, Z ( β ) is a cont inuous function that go es to 0 as β → −∞ . Therefo r e Condition 2 of the TP A ingredients is s atisﬁed. • F o r β = 0, y ∈ [0 , 1] for a ll x ∈ { 0 , 1 } . That means Z (0) = 2 V . • Let β > 0. Then Ω aux ( β ) is the shell, and Ω aux (0) is the c e n ter . With this in mind, the TP A a lgorithm w o rks as follows. 1. Star t with i = 0 and β i = β . 2. Draw a random sample X from π β i , then draw Y (given X ) uniformly from [0 , exp( − β i H ( X ))]. 3. Let β i +1 = ln( Y ) / ( − H ( X )) 4. If β i +1 ≤ 0 stop and output i . 5. E ls e set i to b e i + 1 and go back to step 2. One r un of T P A will require on a verage 1 + ln( Z ( β ) / Z (0)) = 1 + ln( Z ( β )) − # V ln(2) samples from v ario us v alues of β , where # V is the n umber of v er tices of the g raph. This method o f adding a n auxiliar y v ariable allows TP A to b e used o n a v ariety of dis crete distributions by changing the measure to one that v aries contin uously in the index. 3.2 P osterior dist ributions In B ay esian analysis, often it is nece ssary to ﬁnd the norma lizing co nstant of a po sterior distributio n. This is known a s the evidenc e for a mo del, and ca n b e written: Z = Z x ∈ Ω f ( x ) dx, 6 where f ( x ) is a no nnegative density (the pro duct of the prior density and the likelihoo d of the data ) and Ω ⊆ R n . F or a p oint c ∈ Ω a nd ǫ > 0, let B 1 ǫ ( c ) be the po int s within L 1 distance ǫ o f c . Supp ose that for a particula r c a nd ǫ , B 1 ǫ ( c ) ⊂ Ω and there is a known M such that (1 / 2) M ≤ f ( x ) ≤ M fo r all x ∈ B 1 ǫ ( c ). Then to estimate Z ( ǫ ) = R x ∈ B 1 ǫ ( c ) f ( x ) dx , dr aw N iid samples X 1 , . . . , X N uniformly from B ǫ ( c ), and let the estimate be ˆ Z ( ǫ ) = (2 ǫ ) − n P i f ( X i ) / N . Then ˆ Z ( ǫ ) is an unb ia sed estimate for Z ( ǫ ) with standar d devia tion b ounded ab ove by Z ( ǫ ) / √ k . Now the connection to TP A can be ma de . The family o f sets will b e { A ( β ) = B 1 β ( c ) ∩ Ω } , and the measure is µ ( A ( β )) = R x ∈ A ( β ) f ( x ) dx . The shell will b e A ( ∞ ) (so Z = µ ( A ( ∞ ))) a nd the center A ( ǫ ) (with measure Z ( ǫ ).) T P A ca n then b e us e d to estimate Z/ Z ( ǫ ), a nd the estimate of Z ( ǫ ) can then ﬁnish the job. 4 Running time of TP A Suppo se that TP A is run k times, a nd the k v alues of the i v ariable at the end of each run are summed together. Call this sum N . Then N has a Poisson distribution with mean k ln( µ ( B ) /µ ( B ′ )) . This makes N /k an un bias ed estimate of ln( µ ( B ) /µ ( B ′ )) . The v a riance of N/k is ln( µ ( B ) /µ ( B ′ )) /k . Let W be a normal random v aria ble of mea n 0 and v ar iance 1, and W α be the in verse c df of W so that Pr( W ≤ W α ) = α . Then the norma l appr oximation to the Poisso n gives h ( N / k ) − W α/ 2 p N /k , ( N /k ) + W α/ 2 p N /k i (4.1) as a n a ppr oximately 1 − α level conﬁdence interv al for ln( µ ( B ) /µ ( B ′ )) . Exp o- nent ia ting then g ives the 1 − α level for µ ( B ) /µ ( B ′ ). F or a s p eciﬁc o utput, it is also po ssible to build a n exact conﬁdence interv al for µ ( B ) /µ ( B ′ ) since the dis tr ibution of the o utput is kno wn exactly . Similarly , it is easy to p erfor m a Bayesian analysis and ﬁnd a cr edible interv al given a prior on ln( µ ( B ) /µ ( B ′ )) . Lastly , consider how to build an ( ǫ, δ ) r andomize d app r ox imation scheme (RAS) whose o utput ˆ A s a tisﬁes: Pr (1 + ǫ ) − 1 ≤ ˆ A µ ( B ) /µ ( B ′ ) < 1 + ǫ ! > 1 − δ. F or simplicity , a ssume that µ ( B ) /µ ( B ′ ) ≥ e . No te that when µ ( B ) /µ ( B ′ ) < e , then simple acceptance rejection can b e used to obtain an ( ǫ, δ )-RAS in Θ( ǫ − 2 ln( δ − 1 )) time. See [ 4 ] for a description of this metho d. The following lemma gives a bo und on the tails of the Poisson distribution. 7 Lemma 4.1. L et ˜ ǫ > 0 and N b e a Poisson r andom variable with me an k λ , wher e ˜ ǫ/λ ≤ 2 . 3 . Then Pr      N k − λ     ≥ ˜ ǫ  ≤ 2 exp  − k ˜ ǫ 2 2 λ  1 − ˜ ǫ λ  . (This result is a special case of Theorem 6.1 shown later.) T o obtain our ( ǫ, δ )-RAS, it is suﬃcient to make ˜ ǫ = ln (1 + ǫ ), and to choose k so that 2 exp( − k ˜ ǫ 2 (1 − ˜ ǫ/λ ) / [2 λ ]) ≤ δ , where λ = ln( µ ( B ) / µ ( B ′ )) . This is made mor e diﬃcult b y the fact that λ is unknown at the sta rt of the a lgorithm! There a re many ways aro und this diﬃcult y , p erhaps the simplest is to use a t wo phas e metho d. First g et a rough es timate o f λ , then r eﬁne this estimate to the level demanded b y ǫ. Phase I Let ǫ a = ln(1 + ǫ ) and k 1 = 2 ǫ − 2 a (1 − ǫ a ) − 1 ln(2 δ − 1 ) . Then let N 1 be the sum of the outputs fro m k 1 runs of TP A . Phase I I Set k 2 = N 1 (1 − ǫ a ) − 1 . Let N 2 be the sum of the outputs from k 2 runs of TP A . The ﬁnal estimate is ex p( N 2 /k 2 ). Phase I es timates λ to within an a dditive er ror ǫ a λ . Phase II uses the P hase I estimate o f λ to crea te a b etter estimate of λ to within an additive erro r of ǫ a . Note that ǫ a ≈ ǫ in the sense that lim ǫ → 0 ǫ a /ǫ = 1. Theorem 4.1 . The output ˆ A of the ab ove pr o c e dur e is an ( ǫ, δ ) r andomize d appr ox imation scheme for µ ( B ) /µ ( B ′ ) . The running time is r andom, with an exp e cte d running time that is Θ((ln( µ ( B ) /µ ( B ′ ))) 2 ǫ − 2 ln( δ − 1 )) . Pr o of. Call P hase I a success if N 1 /k 1 is within distance ǫ a λ of λ . F rom Lemma 4.1 with ˜ ǫ = ǫ a λ : Pr      N 1 k 1 − λ     ≥ ǫ a λ  = 2 exp  − k 1 ( λǫ 2 a )(1 − ǫ a )  ≤ δ / 2 since λ ≥ 1 and k 1 = ǫ − 2 a (1 − ǫ a ) − 1 ln(2 δ − 1 ) . Therefore, the probability that Phase I is a failur e is at mos t δ / 2 . When Pha s e I is a succes s, (1 − ǫ a ) λk 1 ≤ N 1 . In this even t k 2 = N 1 (1 − ǫ a ) − 1 ≥ λk 1 = λǫ − 2 a (1 − ǫ a ) − 1 ln(2 δ − 1 ) . Plugging this in to Lemma 4.1 yields: Pr      N 2 k 2 − λ     ≥ ǫ a λ  ≤ 2 exp  − λǫ − 2 a (1 − ǫ a ) − 1 ln(2 δ − 1 ) ǫ 2 a 2 λ  1 − ǫ a λ   . Using λ ≥ 1, the right ha nd side is at most 2 δ . The chance of failure in e ither Phase is at most δ / 2 + δ / 2 = δ, so altogether | ( N 2 /k 2 ) − λ | ≤ ǫ a with probability at least 1 − δ . Ex po nen tia ting then gives (1 + ǫ ) − 1 = e − ǫ a ≤ exp( N 2 /k 2 ) /λ ≤ e ǫ a = 1 + ǫ with probability at least 1 − δ . 8 The exp ected n umber of sa mples needed in P hase I is k 1 λ , while the exp ected nu mber needed in Phas e I I is: E( N 2 ) = E(E( N 2 | N 1 )) = E((1 − ǫ a ) − 1 N 1 λ ) = 2(1 − ǫ a ) − 2 ǫ − 2 a ln(2 δ − 1 ) λ 2 . Since ǫ a = Θ( ǫ ), the pro of is complete. 5 W ell-balanced nested se ts Consider running TP A k times, and collecting all the v alues of β i generated during these r uns. Let P denote this set o f v alues, then P forms a Poisson point pro cess of rate k on [ β B ′ , β B ] . Call β B = α 0 > α 1 > · · · > α ℓ = β B ′ a wel l-b alanc e d c o oling sche dule if µ ( A ( α i +1 )) /µ ( A ( α i )) is close to 1 /e for all i from 0 to ℓ − 1. Given P , ﬁnding such a well-balanced set is eas y: s imply o r der the β v alues in P , a nd set α i = β ( ik ) . The v alue o f ln( µ ( A ( α i +1 ) / A ( α i ))) will have distribution equal to the sum of k iid exp onential ra ndo m v aria bles with mean 1 / k . So ln( µ ( A ( α i +1 ) /µ ( A ( α i )))) will b e gamma distributed with mean 1 a nd standar d deviation 1 /k . 6 Omnithermal appro ximation Suppo se instead of just a s ingle v alue of interest µ ( B ) / µ ( B ′ ), it is necessary to create an approximation of µ ( A ( β )) /µ ( B ′ ) that is v alid for all v alues β ∈ [ β B ′ , β B ] simultaneously . Call this an omnithermal appr oximation . The s e prob- lems app ear in what a r e called doubly intractable po sterior distributions ar is ing in Bay es ia n a nalyses inv olving s patial p oint pr o cesses. They are usua lly dealt with indirectly using Markov chain Monte Car lo with aux iliary v ar ia bles [ 12 ], but omnithermal approximation allows for a more direct approach. In the last section the Poisson p oint pro cess P for med from the β i v alues collected from k r uns o f T P A was intro duced. T o mov e from P to a Poisson pro cess, set N P ( t ) = # { b ∈ P : b ≥ β B − t } . As t adv ances from 0 to β B − β B ′ , N P ( t ) increases by 1 whenever it hits a β v alue. By the theo ry of Poisson point pro cesses , this happ ens at interv als tha t will be indep endent exponential random v ar iables with r a te k . Given N P ( t ), approximate µ ( B ) /µ ( A ( β )) by exp( N P ( β B − β ) /k ). When β = β B ′ , this is just the approximation given earlier, so this gene r alizes the description of TP A from b efor e. The key fact is tha t N P ( t ) − k t is a r ight contin uous ma rtingale. T o b ound the error in exp( N P ( t ) /k ), it is necessar y to b ound the proba bilit y that N P ( t ) − k t has drifted too far aw ay from 0. 9 Theorem 6.1. L et ˜ ǫ > 0 . Then for N P ( · ) a r ate k Poisson pr o c ess on [0 , λ ] , wher e ˜ ǫ/λ ≤ 2 . 3 : Pr sup t ∈ [0 ,λ ]     N P ( t ) k − t     ≥ ˜ ǫ ! ≤ 2 exp  − k ˜ ǫ 2 2 λ  1 − ˜ ǫ λ  . Pr o of. The approa ch will b e similar to ﬁnding a Cherno ﬀ b ound [ 2 ]. Since exp( αx ) is convex for any po sitive cons ta n t α , and N P ( t ) is a right cont inuous martingale, exp( αN P ( t )) is a r ight contin uous submartingale. Let A U denote the even t that ( N P ( t ) /k ) − t > ǫ for some t ∈ [0 , λ ]. Then for all α > 0: Pr( A U ) = Pr sup t ∈ [0 ,λ ] exp( αN P ( t )) ≥ exp( αk t + αk ǫ ) ! . It follows from basic Markov-t yp e inequalities on right c o nt inuous submartin- gales (p. 1 3 of [ 9 ]) that this probability ca n b e upp er b ounded as Pr( A U ) ≤ E( α exp( N P ( λ )) / exp( αk λ + αk ˜ ǫ )) . Using the momen t gener ating function for a Poisson with para meter k λ : E[exp( αN P ( λ ))] = exp( k λ (exp( α ) − 1)) , which means Pr( A U ) ≤ exp( λ ( e α − 1 − α ) + α ˜ ǫ ) k . A T aylor se ries expans ion shows that e α − 1 − α ≤ ( α 2 / 2)(1 + α ) as lo ng a s α ∈ [0 , 2 . 3 1858 ... ]. Set α = ˜ ǫ/λ . Simplifying the re s ulting upper b ound yields Pr( A U ) ≤ exp  − k ˜ ǫ 2 2 λ  1 − ˜ ǫ λ  . The other ta il can be dealt with in a similar fashion, yielding a b ound Pr sup t ∈ [0 ,λ ] [ N P ( α ) /k ] − t < ˜ ǫ ! ≤ exp  − k ˜ ǫ 2 2 λ  . The union bound on the t wo tails then yields the theorem. Corollary 6.1. F or ǫ ∈ (0 , 0 . 3) , δ ∈ (0 , 1) , and ln( µ ( B ) /µ ( B ′ )) > 1 , after k = 2(ln( µ ( B ) /µ ( B ′ ))(3 ǫ − 1 + ǫ − 2 ) ln(2 / δ ) runs of TP A, the p oints obtaine d c an b e use d to build an ( ǫ, δ ) omnithermal appr ox imation. 10 Pr o of. In or der for the ﬁnal approximation to b e within a multiplicative factor of 1 + ǫ o f the true result, the log of the appr oximation must b e accura te to an a dditive term of ln(1 + ǫ ). Let λ = ln( µ ( B ) /µ ( B ′ )), so k = 2 λ (3 ǫ − 1 + ǫ − 2 ) ln(2 / δ ). T o pr ov e the corolla ry from the theorem, it suﬃces to show that 2 exp( − 2 λ (3 ǫ − 1 + ǫ − 2 ln(2 /δ )[ln(1 + ǫ )] 2 (1 − ǫ/λ ) / (2 λ )) < δ . After canceling the factors of λ , a nd noting tha t when λ > 1, 1 − ǫ/λ < 1 − ǫ , it s uﬃces to show that (3 ǫ − 1 + ǫ − 2 )(1 − ǫ )[ln(1 + ǫ )] 2 > 1 . This can b e shown for ǫ ∈ (0 , 0 . 3) b y a T aylor ser ie s expans ion. 6.1 Example: Omnithermal a pp roximation for the Ising mo del Consider the following mo del. The v alue of β is drawn fr o m a prior density f prior ( · ) on [0 , ∞ ), a nd then the data (co nditioned o n β ) is drawn from the Ising mo del. This was used b y Besag [ 1 ] a s a mo del for agricultur e wherein s o il quality of adjacent plots was more likely to b e similar . Given the data X , the po sterior in the B ay esian analysis is the following density on β : f p ost ( b ) ∝ f prior ( b ) exp( bH ( X )) Z ( b ) . (6.1) The evidenc e fo r the mo del is the integral o f the r ight ha nd side of ( 6.1 ) as b runs from 0 to ∞ . This is only a o ne-dimensional integration, and so should b e straightforward from a nu mer ical per s pec tive, except that Z ( b ) is unknown. Here is where the omnithermal approximation co mes in: it gives a n ap- proximation for Z ( b ) that is v alid for al l v alues of b at o nce. An y numerical int eg ration tec hnique can be used, and the ﬁnal v alue for the evidence (not in- cluding er ror aris ing from the numerical metho d) will b e within a factor o f 1 + ǫ of the tr ue answer. Figure 1 presents tw o omnithermal a pproximations for log Z β generated us- ing this metho d on a small 4 × 4 squar e lattice. The top graph is the res ult of a single run of TP A from β = 2 down to β = 0. At each β v alue re tur ned by TP A, the approximation drops by 1. The b ottom gra ph is the res ult of ⌈ ln(4 · 10 6 ) ⌉ = 16 runs of TP A. This run told us that Z 2 ≤ 217 with c o nﬁdence 1 − 10 − 6 / 2. Therefor e, using ǫ = 0 . 1 , and δ = 10 6 / 2 in Theor em 6.1 shows that r = 3300 00 samples suﬃce for a (0 . 1 , 10 − 6 ) omnithermal approximation. 7 Conclusions and further wo rk The strength of TP A is the generality of the pro cedure, but tha t same generality means that it is p oss ible to do b etter in restricted cir c umstances. F or instance, when f ( x ) falls into the class of Gibbs distributions, ˘ Stefanko vi˘ c et al. [ 17 ] were able to give an ˜ O (ln( Z )) alg orithm for a pproximating Z , but the high co nstants inv olved in their alg o rithm make it solely of theoretical interest. (Here the ˜ O notation hides logarithmic factors.) TP A can b e used in c o njunction with their 11 One run of TP A 0 2 1 ln( Z β ) β Sixteen runs of TP A 0 2 1 ln( Z β ) β Figure 1 : Omnithermal approximations for the pa rtition function of the Ising mo del o n a 4 × 4 lattice algorithm [ 6 ] to build an O (ln ( Z ) ln (ln( Z ))) algorithm, and work contin ues to bring this running time down to O (ln( Z )) . Ackno wledgments Both a uthors w er e supp or ted in this work by NSF CAREE R gr ant DMS-0 5- 48153 . Any opinions, ﬁndings, and conclusions or recommendations express e d in this mater ia l ar e those of the authors a nd do no t nece s sarily reﬂect the views of the Na tional Science F oundation. References [1] J. Be sag. Spatial interaction and the statistical analysis of lattice systems (with discussion). J. R . Statist. So c. Ser. B Stat. Metho dol. , 36 :192–2 36, 1974. [2] H. Chernoﬀ. A measure of asy mpto tic eﬃciency for tests o f a hypothesis based on the sum o f observ ations. Ann. of Math. S t at. , 2 3:493– 509, 1952 . [3] G. S. Fishman. Cho o sing sample path leng th and num b er of sample paths when starting in the steady state. Op er. R es. L ett ers , 16:209–2 19, 1994. 12 [4] G. S. Fishman. Monte Carlo: c onc epts, algorithms, and applic ations . Springer-V er lag, 1996. [5] C. J. Geyer. Mar ko v chain Monte Ca rlo maximum likeliho o d. In Pr o- c e e dings of the 23r d Symp osium on the In t erfac e: Computing Scienc e and Statistics , pages 156–16 3, 1991 . [6] M. Hub er. The paired pro duct estimator appro ach to appr oximating g ibbs partition functions. 2011 . preprint. [7] M. L. Hub er and S. Schott. Using TPA for Bayesian inference. Bayesian Statistics 9 , pages 257–2 82, 2 010. [8] M. Jerr um, L. V aliant, a nd V. V a z ir ani. Random gener ation of combi- natorial structur es from a uniform distr ibution. The or et. Comput. Sci. , 43:169 –188, 198 6. [9] I. Kara tzas and S. E. Shr eve. Br ownian Motion and S to chastic Calculus (2nd Ed.) . Springer, 1991. [10] S. Kirk pa trick, C. D Gela tt, and M. P . V ecc hi. Optimizatio n by simulated annealing. Scienc e , 220(459 8):671–6 80, 1983 . [11] E. Marinar i and G. Parisi. Sim ula ted temp ering : a new Monte Carlo scheme. Eur ophys. L ett . , 1 9(6):451– 458, 1992 . [12] I. Murr ay , Z. Ghahrama ni, and D. J. C. MacK ay . MCMC for doubly- int r actable distr ibutions. In Pr o c. of 22nd Conf. on Unc ertainty in Artiﬁcial Intel ligenc e (U AI) , 2006. [13] C. P . Rob ert and G. Casella. Monte Carlo S tatistic al Metho ds . Spr inger, 2010. [14] R. W. Shonkwiler and F. Mendivil. Exp olor ations in Monte Carlo Metho ds . Springer, 2009 . [15] John Skilling. Nested Sampling for genera l Bay esian computatio n. Bayesian Anal. , 1(4):833 –860, 2006. [16] R. Sw endsen and J-S. W ang. Replica mo n te carlo simulation of spin glass es. Phys. R ev. L et. , 57:2 607–2 609, 1 9 86. [17] D. ˘ Stefanko vi˘ c, S. V empa la, and E. Vigo da. Adaptive sim ula ted annealing : A near-optimal connection b etw een sampling and count ing . J . of t he ACM , 56(3):1–3 6, 200 9 . [18] J.P . V alleau and D.N. Card. Monte Carlo estimatio n o f the free entergy by m ultista ge sampling. J . Chem. Phys. , 57(12 ):5457– 5 462, 197 2. [19] Dawn B. W o o dward, Scott C. Sch midler , and Mark Hub er. Co nditions for rapid mixing o f parallel a nd simulated temp ering o n multimodel distribu- tions. Ann. of Appl. Pr ob. , 19(2 ):6 17–64 0, 200 9 . 13 [20] Dawn B. W oo dward, Sco tt C. Sc hmidler , and Mark Hub er. Suﬃcient con- ditions for to rpid mixing o f par allel a nd simulated temp ering. Ele ctr on. J. Pr ob ab. , 14:78 0 –804, 2009 . 14

Random construction of interpolating sets for high dimensional integration

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment