Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution

The Annals of Applie d Statistics 2009, V ol. 3, No. 3, 1204–123 1 DOI: 10.1214 /09-A OAS247 c  Institute of Mathematical Statistics , 2 009 SIMULA TION FR OM ENDPOINT-CONDITIONED, CONTINUOUS-TIME MARKO V CHAINS ON A FINITE S T A TE SP A CE , WITH APPLICA TIONS TO MOLECULAR EV OLUTION By Asger Hobol th 1 and Eric A. Stone A arhus University and North Car olina State U niversity Analyses o f serially-sampled data often b egin with the assump- tion that the observ ations represent d iscrete samples from a latent conti nuous-time stochastic pro cess. The con tinuous-time Mark o v chain (CTMC) is one such generativ e mod el whose p opularity extend s to a v ariet y of disciplines ranging from compu tational ﬁ n ance to h u - man genetics and genomics. A common t h eme among t h ese diverse applications is the need to sim ulate sample paths of a CTMC condi- tional on realized data th at is d iscretely observed. Here w e present a general solution to th is sampling problem when the CTMC is d e- ﬁned on a discrete and ﬁnite state space. Sp eciﬁcally , w e consider the generation of sample paths, including intermediate states and times of transition, from a CTMC whose b eginning and endin g states are known across a time interv al of length T . W e ﬁrst unify th e litera- ture th rough a discussion of th e three predominant approaches: (1) mod iﬁ ed rejection sampling, (2) direct sampling, and (3) u n iformiza- tion. W e t hen give analytical results for the complexity and eﬃciency of each metho d in t erms of the instantaneous tran sition rate matrix Q of the CTMC, its b eginning and ending states, and the length of sampling time T . In doing so, w e show that no method dominates the others across all mod el sp eciﬁcations, and we give exp licit pro of of whic h metho d preva ils for any given Q, T , and endp oints . Finally , w e introduce and compare three applications of CTMC s to demonstrate the pitfalls of choosing an ineﬃcient sampler. 1. Int ro duction. This p ap er considers the problem of conditional sam- pling from a con tinuous-time Marko v chain (CT MC) d eﬁ ned on a discrete and ﬁnite state space. In the ideal case, give n con tinuously-observ ed sam- ple paths, statistical inference is straigh tforward: the suﬃcien t stati stics Received December 2007; revised December 2008. 1 Supp orted by th e US National Institute of Health (Grant R01 GM0708 06). Key wor ds and phr ases. Con t inuous-time Mark o v chai ns, sim ulation, molecular evolu- tion. This is an electronic repr in t o f the or iginal a r ticle published by the Institute of Mathematical Statistics in The Annals of Applie d Statistics , 2009, V ol. 3, No. 3, 120 4–123 1 . This repr in t diﬀers fro m the or iginal in pagination and t y po graphic deta il. 1 2 A. HOBOL TH AND E. STONE are s imply the n u m b er of transitions b et w een an y t wo states and th e to- tal time sp ent in eac h state [e.g., Guttorp ( 1995 ), Section 3.7]. In most applications of CTMCs , ho wev er, the stochastic pr o cess { X ( t ) : 0 ≤ t ≤ T } serv es as a contin uous-time mo d el f or data sampled at discrete p oints in time T 0 = 0 < T 1 < · · · < T N − 1 < T N = T . Sometimes this d iscretized prob- lem remains amenable to closed-form analysis, and in suc h cases samplin g is unnecessary . Examples f r om the literature include Holmes and Ru bin ( 2002 ) and Hob olth and Jensen ( 2005 ), b oth of whom motiv ated EM algorithms b y sho win g ho w to calculate the num b er of transitions b et wee n an y t w o states and the time sp ent in a state for an en dp oint- conditioned CTMC. More recen tly , Siep el, Polla r d and Haussler ( 2006 ) show ed ho w to calculate the probabilit y mass function for the n umb er of transitio ns, and Minin and S uc hard ( 2008 ) derived analytically tractable results for the mo- men ts of the n u m b er of trans itions b et wee n any t wo s tates. Our work com- plemen ts these appr oac hes, fo cu sing on the case where parametric in ference relies up on the sim ulation of con tinuous s ample paths from the CTMC con- ditional on X ( T 0 ) , . . . , X ( T N ). As a consequence of the Mark o v assum ption, kno wledge of the data X ( T 0 ) , . . . , X ( T N ) eﬀectiv ely partitions the pro cess in to in dep endent comp onen ts { X ( t ) : T k ≤ t ≤ T k +1 } whose endp oin ts X ( T k ) and X ( T k +1 ) are kno w n. Thus, samp ling a realization from { X ( t ) : 0 ≤ t ≤ T } giv en the observe d data amounts to sampling fr om N indep en d en t and iden tical CTMCs, eac h conditioned on its endp oints X ( T k ) and X ( T k +1 ) and spanning a ﬁxed in terv al of time T k +1 − T k . In what follo w s, we s p e- cialize to one of these N comp onen ts, fo cusing on ho w to sim u late sample paths from a CTMC wh en only its endp oin ts are known. Cr ucially , while calculating the suﬃcient statistics from the s imulated data remains easy , the simulatio n s themselv es ma y b e prohibitiv ely time-consuming without an eﬃcient strategy for generating s ample paths. The wid e app licabilit y of CTMC s to serially-sampled d ata has accelerate d their en try in to the interdisciplinary literature, and sev eral recen t pap ers ha ve sp eciﬁcally considered the problem of sampling paths fr om an en d p oint - conditioned pro cess. Blac kw ell ( 2003 ), for example, discusses a n aiv e rejec- tion sampling metho d as it applies to the analysis of radio-trac k in g data. A similar approac h is considered in Bladt and S øren sen ( 2005 ), the motiv a- tion there coming from the ﬁeld of m athematical ﬁnance. F or our pur p oses, a naiv e rejection sampling metho d is one that sim ulates sample paths forward in time according to the sp eciﬁed pro cess: as a consequence of en d p oint con- ditioning, rejection of a samp le p ath o ccurs whenever there is disagreemen t b et ween the simulated and observ ed ending states. Nielsen ( 2002 ) improv es up on the n aiv e rejection sampling approac h in an app lication to molecular ev olution. His metho d , which w e call “mo diﬁed r ejection s ampling,” condi- tions on the ev en t that one state c h ange must ha v e o ccurr ed in cases where the observe d end p oin ts of the pr o cess are not the same. Neverthele ss, despite A CONTINUOU S-TIME MAR KO V CHAIN 3 Nielsen’s improv emen t, samplin g f orw ard in time without sp eciﬁc regard to the end ing state ma y lead to a prohib itively lo w rate of p ath acceptance. As an alternativ e, Hob olth ( 2008 ) su ggests a direct samp ling pro cedur e b ased on analytical expressions for the probabilities of state transitions and their w aiting times. A ﬁnal approac h , often called un if orm izatio n, originates with the wo r k of Jensen ( 1953 ). T he id ea is to construct a related pr o cess that p ermits virtual transitions (in wh ic h th e s tate d o es not c hange) so that th e n u m b er of s tate transitions in a time inte rv al can b e seen as Po iss on dis- tributed and the state transitions themselves constitute a Mark ov chain. Sampling from this related pro cess is equiv alent to sampling from the tar- get CTMC pro vid ed that the virtual c hanges are ignored. Recen t appli- cations ca n b e found in F ea rnhead and Sh erlo c k ( 20 06 ), Lartillot ( 2 006 ), Mateiu and Rannala ( 2006 ), and Ro d rigue, Philipp e and L artillot ( 2008 ). Though dev elop ed for distinct app lications, eac h of the aforemen tioned path-sampling p ro cedures simulate s fr om the same conditional distribution and th us solv es the same problem. In light of this in terc hangeabilit y , and b e- cause of the imp ortance of path sampling to statistical inf er en ce on endp oin t- conditioned CTMCs, it is imp erativ e to ask wh ether one pro cedu re should b e preferred for its computational eﬃciency . Th e remainder of the p ap er seeks an exhaustive answer to this question. W e ﬁrs t giv e a th orou gh discussion in Section 2 of the thr ee sampling strategies: (1) mo diﬁed rejection sampling, (2) direct sampling, and (3) u n iformization. In S ection 3 w e introd uce three CTMCs that highligh t the heterogenous relationship b et w een parameteriza- tion of th e sto chasti c pro cess and the eﬃciency of eac h sampling strategy . Section 4 then generali zes the results of Section 3 by pro viding analytical expressions for the eﬃciency of eac h samp ler for arbitrary parameterizations of the CTMC. W e conclude b y summ arizing the results into r ecommenda- tions on how to b est simulate sample paths from any giv en CTMC. In doing so, w e also iden tify cases for which one or more of the p otentia l strategies is guaran teed to f ail. 2. Samp lin g strategie s for endp oint -conditioned c hains. In this section w e review three strategies for sim ulating a realizatio n of a ﬁn ite-state CT MC { X ( t ) : 0 ≤ t ≤ T } cond itional on its b eginnin g s tate X (0) = a and ending state X ( T ) = b . The c hain is deﬁn ed b y its instan taneous rate matrix Q with oﬀ-diagonal en tries Q ab ≥ 0 and diagonal entries Q aa = − P b 6 = a Q ab = − Q a < 0. W e mak e the futher assumption that X ( t ) is irreducible and p ositiv e recurrent s o that a stationary distribu tion π exists. Finally , unless otherwise noted, w e s et P c π c Q c = 1 so that one state c hange is exp ected p er unit time. T o un derstand the sampling diﬃ cu lties associated with conditioning a CTMC on its endp oin ts, it is u seful to ﬁ rst review ho w one p ro ceeds when the ending s tate X ( T ) is un observ ed. Simulati ng a sample p ath of { X ( t ) : 0 ≤ 4 A. HOBOL TH AND E. STONE t ≤ T } that b egins at X (0) = a can b e accomplished b y a simple iterativ e pro cedure. Th e k ey observ ation is that the wa iting time τ to the ﬁr s t state c hange is expon entially d istributed with mean 1 /Q a . If τ > T , there is no state c hange in the in terv al [0 , T ], and the corresp onding samp le path is constan t; o therwise, a n ew state c is drawn from the d iscrete probabilit y distribution with probabilit y m asses Q ac /Q a and the pr o cedure is iterated for the shorter time in terv al [ τ , T ] (or, equiv alen tly , for [0 , T − τ ]). F or ref- erence, we pr esent this forwa rd samplin g algorithm b elow: Algorithm 1 (F orw ard sampling). 1. Sample τ ∼ Exp onential ( Q a ). If τ ≥ T , w e are done: X ( t ) = a for all t ∈ [0 , T ]. 2. If τ < T , c ho ose a new state c 6 = a from a d iscrete p robabilit y distribu- tion with probabilit y masses Q ac /Q a . Rep eat the pro cedure with new b eginning state c and new time in terv al [ τ , T ] . Under the assumption that th e ending state X ( T ) = b is observed, condi- tioning excludes all paths sampled fr om the preceding algorithm that fail to end in state b . This is th e essence of the rejection s ampling app roac h, whose mo diﬁcation by Nielsen w e discuss in the next s u bsection. 2.1. R eje ction sampling. As imp lemen ted in Blac kw ell ( 2003 ) and Bladt and S øren sen ( 2005 ), naive rejection samp ling uses forward samplin g to generate candidate samp le paths of an en dp oint -conditioned CTMC. F rom these, the acceptable candidates are those for whic h the sim ulated ending state and the observ ed ending state are the same. In particular, when sampling forw ard, the p robabilit y of hitting the observed ending state b is P ab ( T ) = exp( Qt ) ab . Thus, if T is large, this probabilit y approxima tely equals the stationary probabilit y π b of b . Conv er s ely , if T is small and a 6 = b , the probabilit y is ap p ro ximately Q ab T . It follo ws that in case of (i) large time T and small stationary pr ob ab ility π b , or (ii) diﬀerent states a 6 = b and small time T , naiv e rejection sampling is b ound to fail. Nielsen’s mo diﬁcation im- pro ves up on n aiv e rejectio n sampling in the latter case Nielsen ( 2002 ). By a cond itioning argument , the time τ to the ﬁrst state c hange giv en at least one state c hange o ccurs b efore T and X (0) = a h as d ensit y f ( τ ) = Q a e − τ Q a 1 − e − T Q a , 0 ≤ τ ≤ T . (2.1) The corresp ond ing cum ulativ e distribu tion fun ction is F ( τ ) = 1 − e − τ Q a 1 − e − T Q a , 0 ≤ τ ≤ T , A CONTINUOU S-TIME MAR KO V CHAIN 5 with explicit in verse F − 1 ( u ) = − log[1 − u (1 − e − T Q a )] /Q a . Th us, up on samp ling u from a Uniform (0 , 1) distribution, transformation yields the sample wait ing time F − 1 ( u ) to the ﬁrs t state change of the CTMC. Algorithm 2 (Mo diﬁed r ejection sampling). If a = b : 1. Simulate from { X ( t ) : 0 ≤ t ≤ T } using the f orw ard samplin g algorithm. 2. Accept the sim u lated path if X ( T ) = a ; otherwise, return to step 1 and b egin anew. If a 6 = b : 1. Sample τ from the densit y ( 2.1 ) us ing the in ve rse transformation metho d, and c ho ose a new s tate c 6 = a fr om a discrete probabilit y distribu tion with probabilit y masses Q ac /Q a . 2. Simulate the remaind er { X ( t ) : τ ≤ t ≤ T } using the forward sampling algorithm fr om th e b eginning state X ( τ ) = c . 3. Accept the simulate d path if X ( T ) = b ; otherwise, r eturn to step 1 and b egin anew. In short, mo d iﬁed rejection samp ling explicitly a v oids simulating constan t sample paths when it is known that at least one state c h an ge must tak e p lace. This is particularly b eneﬁcial wh en T is sm all, as the naive approac h w ill b e dominated by w asted constan t sample paths whose ending state r emains a [whic h o ccur s with probabilit y appro ximately (1 − Q a T )]. Nevertheless, if the trans ition from a to b is u nlik ely so that Q ab /Q a is sm all, then essen- tially ev ery sample path will still b e rejected. In suc h a setting, either direct sampling or u niformization is requir ed. 2.2. Dir e ct sampling. Th e d irect samplin g p r o cedure of Hob olth ( 2008 ) requires that the instan taneous r ate matrix Q admits an eigen v alue d ecom- p osition. Under that assumption, let U b e an orthogonal matrix with eigen- v ectors as columns and let D λ b e the diagonal matrix of corresp ond ing eigen v alues such that Q = U D λ U − 1 . Then, for an y time t , the transition probabilit y matrix of the CT MC can b e calculated as P ( t ) = e Qt = U e tD λ U − 1 and P ab ( t ) = X j U aj U − 1 j b e tλ j . (2.2) Consider ﬁrst the case wh er e the end p oint s of the C T MC are iden tical so that X (0) = X ( T ) = a . The probabilit y that there are no state c h anges in the time interv al [0 , T ] conditional on X (0) = a and X ( T ) = a is giv en by p a = e − Q a T P aa ( T ) . (2.3) 6 A. HOBOL TH AND E. STONE Th us, with pr obabilit y p a , a sample path from the CT MC will b e the con- stan t X ( t ) = a . F urthermore, with probabilit y (1 − p a ), at least on e state c hange occur s. Thus, when X (0) = X ( T ) = a , the samp le path is constant with probability p a , and has at least one c hange with p robabilit y (1 − p a ). Next consider the case wher e X (0) = a and X ( T ) = b , with a 6 = b . Let τ denote th e waiting time until the ﬁrst state c h ange. The conditional p roba- bilit y that the ﬁrst state c han ge is to i at a time s m aller than t is P ( τ ≤ t, X ( τ ) = i | X (0) = a, X ( T ) = b ) = P ( τ ≤ t, X ( τ ) = i, X (0) = a | X ( T ) = b ) /P ( X (0) = a | X ( T ) = b ) = Z t 0 Q a e − Q a z Q ai Q a P ib ( T − z ) P ab ( T ) dz = Z t 0 f i ( z ) dz , where f i ( z ) is the integ rand. Sp eciﬁcally , conditional on the end p oin ts X (0) = a and X (0) = b , the p robabilit y p i that the ﬁrst state c hange is to i is p i = Z T 0 f i ( t ) dt, i 6 = a, a 6 = b. (2.4) Using ( 2.2 ), we can rewrite the integrand as f i ( t ) = Q ai e − Q a t P ib ( T − t ) P ab ( T ) = Q ai P ab ( T ) X j U ij U − 1 j b e T λ j e − t ( λ j + Q a ) , (2.5) whic h renders the in tegral in ( 2.4 ) straigh tforw ard . W e get p i = Q ai P ab ( T ) X j U ij U − 1 j b J aj , (2.6) where J aj =      T e T λ j , if λ j + Q a = 0, e T λ j − e − Q a T λ j + Q a , if λ j + Q a 6 = 0. W e n ow hav e a pro cedure for simulat ing the next state and the waiting time b efore the state c hange o ccurs. Iterating the pro cedur e allo ws us to sim u late a samp le path { X ( t ) : 0 ≤ t ≤ T } th at b egins in X (0) = a and ends in X ( T ) = b . Algorithm 3 (Direct samp ling). 1. If a = b , samp le Z ∼ Bernoulli( p a ), where p a is given by ( 2.3 ). If Z = 1 , w e are don e: X ( t ) = a, 0 ≤ t ≤ T . A CONTINUOU S-TIME MAR KO V CHAIN 7 2. If a 6 = b or Z = 0, then at least one state c hange o ccur s . C alculate p i for all i 6 = a from ( 2.6 ). Sample i 6 = a fr om th e discrete pr ob ab ility d istr ibution with probabilit y masses p i /p − a , i 6 = a, wh er e p − a = P j 6 = a p j . [Note that p − a = 1 when a = b and p − a = (1 − p a ) otherwise.] 3. Sample the wai ting time τ in state a according to the con tin u ous densit y f i ( t ) /p i , 0 ≤ t ≤ T , where f i ( t ) is giv en b y ( 2.5 ). Set X ( t ) = a, 0 ≤ t < τ . 4. Rep eat pro cedure w ith new starting v alue i and new time int erv al of length T − τ . Remark 4. In step 3 ab o v e, we simulate f rom the scaled d en sit y ( 2.5 ) b y ﬁnd ing the cumulativ e distribution fu nction and then use the inv erse transformation m etho d . T o calculate the cum ulativ e distribution function, note that Z t 0 e T λ j e − s ( λ j + Q a ) ds =    te T λ j , if λ j + Q a = 0, e T λ j 1 λ j + Q a (1 − e − t ( λ j + Q a ) ) , if λ j + Q a 6 = 0. T o use th e in verse transformation metho d, w e must ﬁn d th e time t suc h that F ( t ) − u = 0, where F ( t ) is the cum ulativ e distribution function and 0 < u < 1 . In s u bsequent sections, w e ha v e used a (n u merical) ro ot ﬁnder for this p u rp ose. 2.3. Uniformization. The ﬁnal s tr ategy that we consider p ermits sam- pling from X ( t ) through construction of an auxilliary sto c hastic pr o cess Y ( t ). Let µ = max c Q c and deﬁne the p r o cess Y ( t ) by letting the state c hanges b e determined by a discrete-time Mark o v pro cess w ith transition matrix R = I + 1 µ Q. (2.7) Note that, by construction, we allo w virtual state changes in which a jump o ccurs but the state do es not change . In deed, virtual state c hanges for state a are p ossible if R aa > 0. Next, let the ep o c h s of state c hanges b e determined b y an indep end en t Poisson pro cess with rate µ . Th e sto chasti c pro cess Y ( t ) is called a Markov chain sub or dinate d to a Poisson pr o c ess and is equiv- alen t to the original contin uous-time Marko v c hain X ( t ) as the follo wing calculatio n sh o ws: P ( t ) = e Qt = e µ ( R − I ) t = e − µt ∞ X n =0 ( µtR ) n n ! = ∞ X n =0 e − µt ( µt ) n n ! R n . (2.8) This approac h is commonly r eferred to as uniform izatio n, and we adopt that language here. In what follo w s, we describ e h o w uniform ization can b e used 8 A. HOBOL TH AND E. STONE to construct an algorithm for exact samplin g fr om X ( t ) , conditional on the b eginning and ending states. It f ollo ws directly fr om ( 2.8 ) th at th e transition function of the Mark o v c hain sub ordinated to a P oisson pr o cess is giv en by P ab ( t ) = P ( X ( t ) = b | X (0) = a ) = e − µt 1 ( a = b ) + ∞ X n =1 e − µt ( µt ) n n ! R n ab . Th us, the num b er of state changes N (including th e virtu al) for the condi- tional pr o cess th at starts in X (0) = a and end s in X ( T ) = b is given by P ( N = n | X (0) = a, X ( T ) = b ) = e − µT ( µT ) n n ! R n ab /P ab ( T ) . (2.9) Giv en the num b er of state changes N = n , the times t 1 , . . . , t n at which those state changes o ccur are u niformly distributed in the time in terv al [0 , T ] . F urthermore, th e state c han ges X ( t 1 ) , . . . , X ( t n − 1 ) are determined by a Mark o v c h ain with transition matrix R conditional on the b eginning state X (0) = a and ending state X ( t n ) = b . Putting these things together, we ha v e the follo wing algorithm for simu- lating a con tinuous-time Mark o v chain { X ( t ) : 0 ≤ t ≤ T } conditional on the starting state X (0) = a and end ing state X ( T ) = b . Algorithm 5 (Uniformization). 1. Simulate the num b er of state c h anges n from the distribution ( 2.9 ). 2. If the num b er of state c hanges is 0, we are done: X ( t ) = a, 0 ≤ t ≤ T . 3. If the num b er of state c h anges is 1 and a = b , we are done: X ( t ) = a, 0 ≤ t ≤ T . 4. If the num b er of state changes is 1 and a 6 = b sim u late t 1 uniformly r andom in [0 , T ] , we are done: X ( t ) = a, t < t 1 , and X ( t ) = b, t 1 ≤ t ≤ T . 5. When the n umber of state c hanges n is at least 2, sim ulate n indep en- den t uniform random n u m b ers in [0 , T ] and sort the num b ers in in cr easing order to obtain the times of state c h anges 0 < t 1 < · · · < t n < T . Simu- late X ( t 1 ) , . . . , X ( t n − 1 ) from a discrete-time Marko v c hain with transi- tion m atrix R and co nditional on starting state X (0) = a and ending state X ( t n ) = b . Determine wh ic h state c hanges are virtual and return the remaining c h an ges and corresp onding times of c han ge. Remark 6. In S tep 1 ab ov e, w e ﬁn d th e num b er of state c hanges n b y sim u lating u fr om a Uniform(0 , 1) d istribution and letting n b e the ﬁrst time the cumulat iv e sum of ( 2.9 ) exceeds u . When calculat ing the cumulat iv e sum w e n eed to raise R to p o w ers 1 thr ough n . These p o wers of R are stored b ecause they are r equired in S tep 5 of the algorithm. W e use the eigen v alue decomp osition ( 2.2 ) of Q to calculate P ab ( t ). A CONTINUOU S-TIME MAR KO V CHAIN 9 Remark 7. In Step 5 ab ov e we simulate X ( t i ) , i = 1 , . . . , n − 1 , fr om the discrete distribu tion with probability masses P ( X ( t i ) = x i | X ( t i − 1 ) = x i − 1 , X ( t n ) = b ) = R x i − 1 ,x i ( R n − i ) x i ,b ( R n − i +1 ) x i − 1 ,b . Th us far w e h a ve outlined three comp eting strategies for sim u lating sam- ple paths from an end p oin t-conditioned C TMC. Though our discussion has b een agnostic to the n u m b er of desir ed sample paths, this quan tit y has a di- rect and v aried imp act on the computational eﬃciency of eac h sampler. F or example, while b oth direct sampling and u niformization r equ ire a p ossibly time-consuming eigendecomp osition of Q , it is clear that one such compu ta- tion will suﬃ ce ev en wh en m u ltiple sample paths are desired. The num b er of sample paths desired from an endp oint-co n dtioned CTMC is application driv en: estimation of some quantit y , su ch as the exp ected num b er of visits to a giv en state, may require man y sample p aths, whereas the up dating step in a Ba yesia n compu tation ma y r equire as few as on e. Rather than exhaus t p oten tial applications, we hav e c hosen to formally analyze the static and dynamic costs asso ciated w ith eac h samp ling strategy . W e defer th is discu s- sion to Section 4 , using the next s ection to demonstrate b y example th at n o one strategy dominates the others when only one sample path is required. 3. Comparison by example. T o illustrate the strategies detailed ab ov e, in this section w e intro d uce three explicit examples of CTMCs for which the p erformance of eac h sampler can b e directly compared. W e b egin with a pair of CTMCs in common use for molecular ev olutionary s tu dies; eac h p ro vides a un ique sto c h astic d escription of h o w DNA s equ ences evo lve o ve r time. F or these and the remaining example, w e compare the computational demands (measured as CPU time) of mo diﬁed rejection sampling, of d irect sampling, and of un iformization. F or eac h example, the computational demands are accum ulated o ver 100 ind ep endent samples. In what follo ws, note that wh ile r ejection sampling and unif orm ization do not require any n u merical approximat ions, d irect sampling requires a ro ot ﬁnder. The numerical appro ximation of the ro ot ﬁn der can b e m ade arbi- trarily precise, b ut th e c hoice of precision aﬀects th e ru nning time. Without loss of generalit y , w e ha v e c hosen th e default settings of the root ﬁ nder in the statistical pr ogramming language R ( www.r-pr o ject.org , V ers ion 2.0.0). The root ﬁnder t ypically con v erges in 4 to 8 iterations. The programs are run on an Inte l 2.40 GHz Pen tium 4 pr o cessor and are a v ailable in the su p- plemen tary material [ Hob olth and Stone ( 2009 )]. 3.1. Example 1: Mole cular evolution on the nu cle otide level. W e ﬁrst consider a p opu lar mo del of DNA sequence ev olution at the n u cleotide lev el. 10 A. HOBOL TH AND E. STONE The state sp ace for a p articular site in a DNA sequence is of size 4 corre- sp ondin g to the DNA b uilding blo cks adenine ( A ), guan in e ( G ), cytosine ( C ), and th ymine ( T ). The HKY mo del of Hasega wa, Kishino and Y ano ( 1985 ) describ es the evol ution of one site in a DNA sequence thr ou gh an instant a- neous rate matrix of the f orm Q = (1 /s )     · κπ G π C π T κπ A · π C π T π A π G · κπ T π A π G κπ C ·     , where th e states app ear in the order A , G , C , T and the diagonal elemen ts of Q are su ch that the rows su m to zero. Note that state c hanges of the CTMC are called ‘substitutions’ in this con text to reﬂect that the nucleotide in a particular site has b een su bstituted by another. Th e HKY mo del is r ev ersible and has stationary distribu tion π = ( π A , π G , π C , π T ). T he ts / tv rate ratio pa- rameter κ is u sed to distinguish b et w een transitions [sub stitutions b et w een purines ( A , G ) or b et w een p yr imidines ( C , T )] and transv ersions (substitu- tions b et we en a purine and a pyrimidine). The scaling p arameter s = s ( κ, π ) is c hosen such that P 4 a =1 Q a π a = 1 , imp lying that t su bstitutions are ex- p ected in t time un its. In this application, w e use the p arameter v alues κ = 2 and π = (0 . 2 , 0 . 3 , 0 . 3 , 0 . 2). T h e s caling parameter calibrates the int ensit y of substitutions p er un it time; for co n text, note that the exp ected n umb er of substitutions p er site b et ween humans and chimpanzees is roughly 0.01 [ The Chimp anzee Sequencing and Analysis Consortium ( 2005 )] and b et ween h u mans and mice is roughly 0.50 [ Mouse Genome S equencing Consortium ( 2002 )]. Figure 1 plots th e computational demands of eac h sampling strategy against ev olutionary distance, measured equiv alen tly as th e exp ected n u m- b er of su bstitutions (CT MC state changes) or as units of time. Th e plot on the left demonstr ates the case where the b eginning and ending states are b oth A ; b y cont rast, on the righ t the b eginning state r emains A while th e ending state is G . The ﬁgure revea ls rejection sampling to b e by far the most eﬃcien t algorithm h ere. Moreo v er, direct sampling is more eﬃcien t than u ni- formization wh en the endp oin ts are the same and th e ev olutionary distance is shorter than one exp ected substitution p er site. When the endp oin ts are diﬀeren t, uniform izatio n is more eﬃcien t than direct samp ling. 3.2. Example 2: Mole cular evolution on the c o don level. F or protein- co ding DNA sequences, th e natural state space consists of nucleo tide triplets (called co dons). T here are 4 3 = 64 p ossible nucleoti de triplets, but the thr ee stop co don s T GA , TAG , and TAA do n ot app ear w ithin a protein. The 64 − 3 = 61 remaining co dons constitute the state space and are called the sense co dons. Eac h of the 61 s ense co dons d eterministically tr an s lates into one of A CONTINUOU S-TIME MAR KO V CHAIN 11 Fig. 1. CPU time versus evolutionary distanc e for the HKY mo del. I n b oth plots the b e- ginning state is A . In the left plot the ending stat e is al so A and in the right plot the ending state is G . R eje ction sampling r e quir es less CPU time than dir e ct sampling and uniformiza- tion. The solid thick li nes show pr e dicte d CPU times when the c ost of initiali zation and r e cursion is ﬁtte d to the observe d CPU times (se e Se ctions 4.1 – 4.3 ). The soli d thin li nes show pr e dicte d CPU times when the c ost of initiali zation and r e cursion is estimate d fr om a simulation study of r eversible r ate matric es (se e Se ction 4.4 ). Her e and in Figur es 2 and 3 the exp e cte d numb er of r e cursion steps was c alculate d analytic al ly using the formul as in the text. 20 amin o acids, and th u s d istinct codon s translate in to the same amino acid. Substitutions b et we en co d ons that translate in to the same amino acid are called syn on ymous (or silen t), while sub stitutions b et wee n diﬀerent amino acids are called nonsynonymous (or n onsilen t). In 1994, Goldman and Y ang ( 19 94 ) formulated a mo del on th e space of sense co d on s that is still in common use to day . The GY mod el, a natural extension of the HKY mo del describ ed ab o v e, is rev ersib le with stationary distribution π = ( π 1 , . . . , π 61 ) and incorp orates a ts / tv r ate ratio κ . The GY mo del also distinguishes b et w een synonymous and nonsynon ym ous substi- tutions th rough a parameter ω . T he oﬀ-diagonal en tries in the instan taneous GY rate matrix are give n by Q ab = (1 / s )            0 , if a and b diﬀer at more that one p osition, π b , for synonymous transversions, κπ b , for synonymous transversions, ω π b , for nonsynonymous transversions, ω κπ b , for nonsynonymous transversions, 12 A. HOBOL TH AND E. STONE where s = s ( ω , κ, π ) is again c hosen su c h that t s ubstitutions are exp ected in t time u nits (i.e., P a Q a π a = 1). In our application, we c h o ose κ = 2 so that transitions are fa v ored o ve r transv er s ions and tak e ω = 0 . 01 so that synon ymou s c hanges are far more lik ely than n onsynon ymous changes. W e c ho ose π b ased on established pat- terns of co don usage, and note that these frequencies are quite heteroge- nous: the s m allest entry is GG G ( π GGG = 0 . 0042) and the largest ent ry is GAG ( π GAG = 0 . 042 6). F or these sp eciﬁcatio ns, Figure 2 p lots th e compu tational d emands of eac h sampling strategy agai nst ev olutionary d istance. In eac h plot, the starting state is AAA . The leftmost p lot compares p erformance when the ending state is AAG . Note that the su bstitution fr om AAA to AA G is a syn onymous transition (b oth AAA and AAG co de f or the amino acid lysine) and that the frequency for AAG is 0.0396 . Beca use synonymous tran s itions are very lik ely , the plot conﬁrms that r ejection sampling w ill b e ve ry eﬃcient. Con tr ast this obser- v ation with the middle plot in which the end ing state is A AC . Th e co don AAC tran s lates to asparagine so the sub stitution from the b eginning to the ending state is a less lik ely nonsy n on ymou s transve rsion. This is reﬂected in the p o or p erformance of the rejection samplin g algorithm. Finally , th e righ tmost p lot demonstrates what o ccurs when the ﬁnal state is TTT . I n this case, rejection samplin g is not feasible b ecause the p r obabilit y of ending in the ﬁn al state is eﬀectiv ely zero. 3.3. Example 3: Mole cu lar evolution on the se quenc e level. The examples ab o ve seem to indicate that rejection sampling and uniformization h a ve the most utilit y , but it is easy to conceiv e of an application for which d irect sampling is most eﬃcient . As the t wo p revious examples sho w, the eﬃciency of rejection s amp ling is tied to its accepta nce probabilit y; if the observ ed ending state is unlikel y , a large f raction of sample paths will b e destined for rejection. Uniformization, on th e other h and, can b e ineﬃcien t wh en many virtual substitutions are requir ed. With that bac kgrou n d, w e consider the extension of the HKY mo d el d escrib ed b elo w. Recall the HKY rate matrix (with ν instead of π ) Q =     · κν G ν C ν T κν A · ν C ν T ν A ν G · κν T ν A ν G κν C ·     . Jensen and P edersen ( 2000 ) consider so-called neigh b or dep enden t mo dels where the instan taneous rate at a site dep ends on the n eigh b ors of the site. Jensen and P edersen ( 2000 ) are particularly interested in CG a vo idance where the rate aw a y from C is particularly high if its right neighbor is a G . Such a model implies CG deﬁciency in a single sequence, whic h is a n A CONTINUOU S-TIME MAR KO V CHAIN 13 Fig. 2. CPU time versus evolutionary distanc e for the GY mo del. In al l plots the b e gin- ning state is AAA . In the left plot the ending state is AAG , in the m idd le plot the ending stat e is AAC , and i n the right plot the ending state is TTT . Re je ction sampli ng is most eﬃcient in the situation depicte d on the left, but it enters an inﬁnite while lo op on the ri ght (and is ther ef or e not shown). Dir e ct sampling and uniformi zation have sim i lar running times, with uniformization b eing sli ghtly faster. The solid thick li nes show pr e dicte d CPU times when the c ost of initialization and r e cursion is ﬁtte d to the observe d CPU tim es (se e Se ctions 4.1 – 4.3 ). The solid thin lines show pr e di cte d CPU times when the c ost of initialization and r e cursion i s estimate d fr om a simulation study of r eversible r ate matric es (se e Se ction 4.4 ). Final ly, the dashe d lines show pr e dicte d CPU tim es when the initial ization and r e cursion c osts ar e estimate d fr om a simulation study of sp arse r ate matric es (se e Se ction 4.4 ). often observed p h enomenon for mammalian sequen ces d ue to the p r o cess of CpG meth ylation-deamination. Neigh b or-dep en den t n ucleotide mo d els are also considered in Hw ang and Green ( 2004 ) and Hob olth ( 2008 ). In these t wo p ap ers, a Gibbs sampling scheme is u sed to estimate the parameters of the mo del while taking the u ncertain ty of the neigh b ors into account . In particular, eac h single site is up d ated conditionally on the current v alues of the complete ev olutionary history of the neighboring nucleotides. Consider the ev olution at a single site and assu me for simplicit y that th e ev olutionary h istory of the left n eigh b or is neve r a C , and the evol utionary history of th e righ t neigh b or is alw a ys a G . I n this situation, a CG din u cleotide is p resen t when the site that we consider is a C . Jensen and Pe dersen ( 2000 ) mo del the C pG eﬀect through in creasing the r ate a wa y from CG nucle otides b y multiplying eac h ent ry in the HKY rate matrix ( 3.3 ) corresp onding to C with a parameter γ > 1. When the left neigh b or is not a C and the right 14 A. HOBOL TH AND E. STONE neigh b or is a G , the rate matrix th us b ecomes Q HKY + CG =     · κν G ν C ν T κν A · ν C ν T γ ν A γ ν G · γ κν T ν A ν G κν C ·     . (3.1) The stationary distribution π of Q HKY + CpG is giv en by ( π A , π G , π C , π T ) = ( ν A , ν G , ν C /γ , ν T ) / ( ν A + ν G + ν C /γ + ν T ) . If the parameters are ( ν A , ν G , ν C , ν T ) = (0 . 3 , 0 . 3 , 0 . 2 , 0 . 2) and γ = 20, w e obtain the stationary distribution ( π A , π G , π C , π T ) = (0 . 3 , 0 . 3 , 0 . 01 , 0 . 2) / 0 . 81. Note that the stationary p robabilit y of a C nucle otide is n o w 0 . 01 / 0 . 81 = 0 . 01 2 . The left-hand p lot of Figure 3 illus tr ates the p erformance of the three samplers wh en th e CTMC b egins in T and end s in state C . Here the most eﬃcien t sampler dep ends on the time b et ween the states: if T < 0 . 3, rejection sampling is the most eﬃcien t, if 0 . 3 < T < 0 . 9, unif orm ization is th e m ost eﬃcien t, and if T > 0 . 9, direct sampling is the most eﬃcien t. F or large times, Fig. 3. CPU usage versus time f or the HKY + CpG r ate matrix ( 3.1 ). In the l eft plot, the b e ginning state is T and ending state is C . In the right plot, the b e ginning state is C and ending state is T . R eje ction sampling i s very fast in the situation depicte d on the right, but it is slow for lar ge evolutionary distanc es on the left. Dir e ct sampling and uniformization have similar running times, but dir e ct sampli ng i s f aster for lar ge evolutionary distanc es. The solid thick lines show pr e dicte d CPU times when the c ost of ini tialization and r e cursion is ﬁtte d to the observ e d CPU times (se e Se ctions 4.1 – 4.3 ). The solid thin lines show pr e dicte d CPU times when the c ost of i nitialization and r e cursion i s estimate d f r om a simulation study of r eversible r ate matric es (se e Se ction 4.4 ). A CONTINUOU S-TIME MAR KO V CHAIN 15 rejection samp lin g is ineﬃcien t b ecause it is un lik ely to end in s tate C , and direct sampling b ecomes more eﬃcien t than uniformization b ecause many virtual c h anges are required in the u niformization pro cedure. The righ t- hand plot of Figure 3 s h o ws th e case when the b eginning state is C an d th e ending state is T . Und er this s cenario, rejection sampling is the most eﬃcien t sampling algorithm b ecause the acceptance probability is high. 4. Complexit y of samplers. The examples in the previous section w ere c hosen to demonstrate the heterogenous dep end ence of eac h sampling strat- egy u p on the c h aracteristics of the endp oin t-conditioned CT MC. In partic- ular, eﬃciency was shown to b e imp acted by eac h asp ect of the p ro cess: the instan taneous rate matrix Q , the sampling time T , and th e b eginning and ending state s a and b . Th is section translates the qu alitativ e observ ations ab o ve in to quantita tiv e pr o of of which sampler will b e most eﬃcien t for an y sp eciﬁcation of CTMC . T o accomplish this, we rely on the algorithmic de- scriptions of the three samp ling strategies as give n in Section 2 . Note that the algorithms are sc hematically consisten t, with eac h progressing th rough (1) initialization, (2) r ecur sion, and (3) termination. Ou r ap p roac h is to de- ﬁne the ﬁxed computational costs for the initializati on and recursion steps, whic h we call α and β , resp ectiv ely . As sho wn in S ection 2 , the n um b er of recursion steps requir ed to generate an en tire sample path is sto chasti c, and w e capture this in a r an d om v ariable L . Thus, the co mputation cost of generating one sample path is α + β L and the mean cost is ob viously α + β E [ L ]. In th e case of rejection sampling, note th at only a certain fr action of the generated sample paths will b e con- sisten t with the observ ed endin g state and hence accepted. Ultimately , the results of this section demonstrate our abilit y to accurately predict the CPU time needed to pro d uce one v alid sample p ath fr om an endp oin t-conditioned pro cess. Such analysis is of great practical imp ortance, as it allo ws the re- searc her to c ho ose the most eﬃcient sampler in adv ance. The rest of this section is organized as follo w s. In S ections 4.1 – 4.3 we discuss complexit y and derive the mean n umber of recursions E [ L ] for eac h sampler. In Section 4.4 we ﬁrst demonstrate that α and β can b e estimated from the size of the state sp ace and structure of th e rate matrix only . Thus, determining th e v alues of α an d β is a one-time ca lculation. Seco nd, w e pro v id e a strategy for c ho osing the most eﬃcien t sampler. The strategy dep ends on the mean num b er of r ecursions and estimated v alues of α and β . In case of rejection sampling, the strategy also dep ends on the acceptance probabilit y . Third, w e give further insight into the sampling strategies by analyzing the p rop osed strategy in detail for mo d erately large time interv als. 16 A. HOBOL TH AND E. STONE 4.1. R eje ction sampling c omplexity. Let p acc b e the acceptance probabil- it y for the rejection samp lin g algorithm ﬁr st describ ed in Section 2.1 . Then the exp ected n umb er of samp les b efore acceptance is its recipro cal 1 /p acc . In the notatio n describ ed ab o ve, the mean CPU time required to sim u late one sample path is thus ( α + β E [ L ]) /p acc . (4.1) When the b eginning and ending states tak e the same v alue, sa y , a , the acceptance pr obabilit y is simply P aa ( T ). In particular, for small T we h a v e p acc ≈ (1 − Q a T ), and f or large T we hav e p acc ≈ π a . F ur thermore, the ex- p ected n u m b er of recursion steps required to generate one sample path is giv en by E [ L ] = X i X j 6 = i E [ N ij ( T ) | X (0) = a ] , (4.2) where N ij ( T ) is the n u m b er of state changes from i to j in the time in terv al [0 , T ] . Th is exp ectation is giv en by [e.g., Prop osition 3.6 of Guttorp ( 1995 )] E [ N ij ( T ) | X (0) = a ] = Q ij Z T 0 P ( X ( t ) = i | X (0) = a ) dt. Analytical expressions for the in tegral can b e found b y app ealing to an eigendecomp osition of Q (see Section 2.2 ). Figure 4 pro vid es an illustration of the ab o ve considerations using th e HKY mo del from Section 3.1 as an example. The top panel of the ﬁgure de- tails the case when the b eginning and ending state s are the same (sp eﬁcially , the n ucleotide A ). F rom the left, the ﬁ r st column plots the acceptance pr ob- abilit y exp( QT ) AA against the time T , showing a n onlinear decrease from p acc ≈ 1 when T is small to p acc ≈ π A when T is large. Th e slop ed dashed line plots the ﬁrst-order T a ylor appr o ximation 1 − Q A T of p acc that is v alid for small T ; the horizon tal dashed line indicates the stationary pr ob ab ility π A that is the limit of p acc when T gro w s large. The second column plots CPU time sp ent on initializatio n against T for a collection of simulate d s amp le p aths. A linear regression was used to esti- mate the initializa tion cost. More sp eciﬁcally , w e generated 500 indep endent samples from the mo diﬁ ed rejection sampler and recorded the time sp en t on initializatio n and r ecursion, resp ectiv ely . The CPU time sp en t on initial- ization is prop ortional to 1 /p acc [recall ( 4.1 )]; we estimated α using lin ear regression and obtained ˆ α = 0 . 0509. T h e thir d column shows the exp ected n u m b er of state c hanges E [ L ], calculated from ( 4.2 ), as a fun ction of time. In th e fou r th column we sho w the CP U time sp ent on s ampling. The CPU time sp en t on sampling is p rop ortional to E [ L ] /p acc [recall ( 4.1 )]; w e esti- mated β using linear regression and obtained ˆ β = 0 . 036 5. Adding the CPU time sp en t on initialization and samp ling giv es the total CPU time sp ent on A CONTINUOU S-TIME MAR KO V CHAIN 17 Fig. 4. Summary statistics for the HKY mo del fr om Se ction 3.1 . The top r ow shows the c ase when the b e ginning state is A and the ending state is A . In the b ottom r ow, the b e ginning state is A and the ending state is G . The ﬁrst c olumn shows the pr ob ability of ending in the c orr e ct state (the ac c eptanc e pr ob abi lity), and the se c ond c olumn shows the CPU time sp ent on initiali zation. The thir d c olumn shows the exp e cte d numb er of r e cursions r e quir e d in e ach forwar d sample, and the f ourth c olumn shows the CPU time sp ent on sampling. Summing the CPU times sp ent on initialization and on sampling gives the total CPU time sp ent to pr o duc e a sample p ath. Thi s total time i s shown in the left-hand plot of Figur e 1 . pro du cing a sample path. The total CPU time and pr edicted CPU time is sho w n in the left plot of Figure 1 . When the b eginning and ending states are diﬀerent, the calculations are only sligh tly more complicat ed. T o compute the acceptance p r obabilit y in the case a 6 = b , let N ( t ) b e the num b er of state changes of X ( t ) in the in terv al [0 , t ] . W e h a ve P ab ( T ) = Pr( X ( T ) = b | X (0) = a ) = Pr( X ( T ) = b, N ( T ) > 0 | X (0) = a ) (4.3) = Pr( X ( T ) = b | N ( T ) > 0 , X (0) = a ) Pr( N ( T ) > 0) = p acc Pr( N ( T ) > 0) , 18 A. HOBOL TH AND E. STONE from which it is clear that p acc = P ab ( T ) 1 − Pr( N ( T ) = 0) = P ab ( T ) 1 − e − T Q a . (4.4) F or small T we h a ve the ﬁrst-order approximat ion p acc ≈ Q ab Q a  1 − Q b T 2  + X i 6 =( a,b ) Q ai Q a T 2 Q ib , (4.5) and for large T it is clear that p acc ≈ π b . Next we consid er the num b er of recursion steps L . W e know that the n u m b er of state c hanges is at least one b ecause w e hav e assumed that the b eginning and ending states a and b are diﬀerent. The probabilit y of the ﬁrst c hange b eing to state k ( k 6 = a ) is Q ak /Q a , and the den s it y of the time to this change is giv en b y ( 2.1 ). Let the num b er of state c h anges from i to j ( j 6 = i ) when the ﬁrst substitution is to k b e denoted N ij,k . T he exp ected n u m b er of suc h c h anges in a time inte rv al [0 , T ] is giv en by E [ N ij,k ( T )] = Z T t =0 Q a e − tQ a 1 − e − T Q a Q ak Q a Q ij Z T s = t P k i ( s ) ds dt. Again, this integ ral can b e calculated analytically using an eigen v alue de- comp osition of Q . The exp ected v alue of L is giv en b y E [ L ] = 1 + X i X j 6 = i X k 6 = a E [ N ij,k ( T )] . The b ottom row of Figure 4 mirrors the top row, except that here the ending state G has b een c hosen to b e distin ct from the b eginning state A . As b efore, the ﬁrst plot from the left shows the acceptance probabilit y ( 4.4 ) against the time T . The slop ed dash ed lin e no w sho ws the linear appr o xi- mation ( 4.5 ), while th e horizon tal dashed line ind icates the stationary pr ob- abilit y π G of the ending state G . In the second plot, the C P U time s p en t on in itializa tion is explained b y the recipro cal of the acceptance p r obabilit y . In the third plot, w e sh o w the exp ected n umb er of subs titutions, and in the last p lot the CP U time sp ent on samp ling is explained b y the exp ected n u m b er of sub stitutions divided by the acceptance pr ob ab ility . Th e regres- sion coeﬃcient for initializ ation is 0.0509 and for sampling 0.0366. Note that these co eﬃcien ts are v ery similar to what was observed in the case of equ al b eginning and ending states. T o complemen t th e observ ations of Figure 4 , recall the GY m o del in tro- duced in Section 3.2 . The three plots in Figure 5 mirr or those in Figure 2 , with eac h sh o wing ho w the acceptance probabilit y scales with time in the previously depicted scenario. In all cases, the b eginn in g state is the co don AAA : from the left, the ﬁ r st plot considers the end ing state AAG (a synonymous A CONTINUOU S-TIME MAR KO V CHAIN 19 Fig. 5. Ac c eptanc e pr ob abil ities for the GY mo del fr om Se ction 3.2 . In al l c ases the b e ginning state is AAA . In the left-hand plot, the ending state is AAG , in the midd le plot the ending state is AAC , and in the right-hand plot the ending state is TTT . R eje ction sampling is very eﬃcient in the si tuation depicte d on the left, less eﬃcient in the midd le, and not pr actic al i n the ri ght. transition aw a y from AAA ), the second plot consider s th e ending state AAC (a nonsynonymous transv ersion a wa y from AAA ), and the third plot considers the endin g state TTT (a minimum of th ree state c hanges aw a y from AAA ). In the ﬁ rst case, the acceptance probabilit y is high and rejection sampling is eﬃcien t. In the s econd case, the acceptance probabilit y is lo w, p articularly for small T , and rejection sampling is less eﬃcien t. In the third case, the probabilit y of ending up in the desired state TTT is smaller than 1 / 10 6 , and rejection sampling cannot b e used. In this ﬁn al case , one m ust use direct sampling or u niformization. 4.2. Dir e ct sampling c omplexity. Th e computational costs for direct s am- pling are dep endent up on its initialization and the CPU time sp en t on sam- pling a n ew state and its corresp ond in g w aiting time. As b efore, the cost of generating one sample path can b e written as α + β L, but the in itial cost α is m uch more exp ensive than for rejection sampling b ecause an eigendecomp osition of Q is r equired. The exp ected n um b er of recursion steps is equ iv alent to the n u m b er of state changes N and can b e found through E [ L ] = E [ N ( T ) | X (0) = a, X ( T ) = b ] = X i X j 6 = i E [ N ij ( T ) | X (0) = a, X ( T ) = b ] , 20 A. HOBOL TH AND E. STONE Fig. 6. CPU time sp ent on dir e ct sampling in the HK Y mo del fr om Se ction 3.1 . Two c ases ar e c onsider e d: b e ginning state A and ending state A (ﬁrst two pl ots fr om the l eft) and b e ginning state A but ending state G (last two plots). In b oth c ases, the initial ization CPU is c onstant (ﬁrst and thir d plot). The sampling CPU i s pr op ortional to the exp e cte d numb er of substitutions (se c ond and fourth plot). where N ij ( T ) is th e n u m b er of state c hanges fr om i to j in the time in terv al [0 , T ]. These exp ectations can b e calculated using f ormulas in Hob olth and Jensen ( 2005 ). In Figure 6 we illustrate th e ab o ve considerations for the HKY mo del from Section 3.1 . The initializat ion cost is constan t, and th e n umb er of state c hanges explains the cost of sampling. The initializatio n cost α is around 0.85, which is muc h larger than the 0.05 observed wh en doing rejection sam- pling. Moreo v er, the cost β f or eac h recursion step is 0.56, as compared to 0.04 f or rejection samp ling. T his ma y seem an unfav orable comparison , but recall th at rejection sampling do es not guarant ee that the endp oint condi- tions are met by its generated sample p aths; if the probabilit y of acceptance p acc is lo w, then the cost of r ejection sampling giv en b y ( 4.1 ) w ill b e domi- nated by 1 /p acc . This illustr ates the tradeoﬀ that distingu ish es rejection s amp ling fr om the t wo remaining approac hes: the compu tational costs of rejection sampling are comparativ ely inexp ensive, but only a fr action of the simulate d sample paths from that metho d will b e viable. 4.3. Uniformization c omplexity. Th e computational costs for u niformiza- tion are similar in structure to those of dir ect sampling. In itializa tion re- quires an eigendecomp osition of Q and construction of the auxiliary tran s i- tion matrix R in order to carry out Step 1 of the algorithm (recall R emark 6 ). Eac h recursion step consists of sampling a new state and its corresp on d ing w aiting time; examination of the uniformization algorithm rev eals that the n u m b er of recursion s teps L is equal to th e num b er of state c hanges N ( T ) accum ulated by the auxiliary c hain. Thus, E [ L ] = E [ N ( T ) | X (0) = a, X ( T ) = b ] A CONTINUOU S-TIME MAR KO V CHAIN 21 Fig. 7. CPU time sp ent on unif ormization i n the HKY mo del f r om Se ction 3.1 . Two c ases ar e c onsider e d: b e ginni ng state A and ending state A (ﬁrst two plots) and b e ginning state A but ending state G (last two pl ots). In b oth c ases, the initiali zation CPU time is c onstant (ﬁrs t and thir d plot). The sampling CPU time i s pr op ortional to the exp e cte d numb er of substitutions (se c ond and fourth plot). = 1 P ab ( T ) ∞ X n =0 ne − µT ( µT ) n n ! ( R n ) ab = 1 P ab ( T ) µT ( Re QT ) ab = 1 P ab ( T ) µT X c R ac P cb ( T ) . In particular, when T is large w e get E [ L ] ≈ µT . Figure 7 illustrates the ab o v e co n siderations for the HKY mo del from Section 3.1 . As with direct sampling, th e initialization cost is constan t and the num b er of state c han ges (b oth real and virtual) explains the cost of sampling. W e ﬁnd ˆ α = 1 . 05, whic h is ab out the same magnitude as the initializat ion co st for d irect samp ling. In unif orm sampling, the recursion step is imm ediate if w e en ter Steps 2–4. Each recurs ion in Step 5 is also v ery fast b ecause w e just h a ve to sim ulate from a discrete- state Mark o v c hain with tran s ition probabilit y matrix R give n the endp oin ts, and w here all the relev ant p ow ers of the transition matrix R are already calculated. The recursion cost ˆ β = 0 . 09 is aroun d 1 / 6 of the recursion cost for d irect sampling and twice as muc h as th e recursion cost of rejection samp ling. 4.4. Comp arison and r e c ommendation. 4.4.1. Comp arison and r e c ommendat ion for gener al T . The p receding results explicitly relate the computational complexit y of eac h sampling strat- egy to charac teristics of the C TMC. Th is p ermits th e th ree strategies to b e compared to eac h other, but only after reliable v alues for α and β ha ve b een obtained. W e also note that the v alues of α and β dep end on the c h oice of 22 A. HOBOL TH AND E. STONE Fig. 8. V alues of α (left) and β (right) for the thr e e algorithms and four examples. Both x -axes show the size of the state sp ac e c onsider e d i n e ach example, with α and β on the y -axes for the left and right plot, r esp e ctively. A c onstant ﬁt is shown f or r eje ction sampling and a quadr atic ﬁt b etwe en β and the size of the state sp ac e is shown for di r e ct sampling; the r emaining r el ationships in the ﬁgur e ar e describ e d as a li ne ar dep endency of size to the p ower 2.5. R esults for r andomly gener ate d 61 × 61 r ate matric es with nonzer o oﬀ-diagonal entries, when the entry c orr esp onds to two c o dons that ar e exactly one p oint substitution away fr om e ach other, is shown in the left plot. These sp arse r ate matric es have a lower value of α in c ase of dir e ct sampl i ng and unif ormization. computer language. Run ning simulatio ns as w e h a ve to estimate these pa- rameters is n ot pr actical, as it compromises the gains in c h o osing an eﬃcien t sampler. F or that reason, and to establish the generalit y of our observ ations, w e sough t to r elate α and β to the size of the state s pace of the C T MC. F or eac h of th e three samp lin g algorithms, we estimated the v alues of α and β for r an d omly simulate d rev ersible rate matrices. Sp eciﬁcally , we ﬁ rst generated a symmetric matrix S with randomly generated exp onen tially dis- tributed oﬀ-diagonal en tries S ij ∼ Exp(1), i > j . Second, we generated th e stationary distribu tion π of the CT MC by sampling from a Dirichlet distr i- bution Dir( α ) with α = (1 , . . . , 1) , that is, a vecto r of ones. T he oﬀ-diagonal en tries in the r ate m atrix b ecome Q ij = S ij π j , i > j, and Q ij = S j i π i , i < j . W e consid er ed sev en diﬀerent state space sizes (5 , 10 , 20 , 40 , 60 , 80 , 100), and rep eated eac h of the sim u lations ﬁve times. T he results are summ arized in Figure 8 . Figure 8 supp orts our p revious observ ations on the thr ee sampling strate- gies. Th e costs of initialization, as quan tiﬁed by α , are as exp ected, with direct sampling and uniform izatio n slo wed relativ e to rejection sampling b y A CONTINUOU S-TIME MAR KO V CHAIN 23 their dep end ence on an eigen v alue decomp osition of the rate matrix Q . In - deed, whereas the initializ ation cost of rejection sampling remains essentia lly unc hanged as the s tate space grows, th e other metho ds increase in ru n time nonlinearly . Theory su ggests that the eigen v alue decomp osition th at domi- mates direct sampling and uniformization should dep end on the cub e of th e size of the state space, and w e ﬁnd in our limited sample that the relationship is b est explained b y an exp onen t of 2.5. In an y case, direct samp ling and u ni- formization are comparable in their in itializa tion costs, with u niformization alw a ys sligh tly slo wer b ecause it requires p ow ers of the transition pr obabilit y matrix R to also b e calculated. By con trast, for the state sp ace sizes that w e considered, uniformization has a su bstan tially s m aller v alue of β . Compare these ob s erv ations to the resu lts shown in Figures 4 , 6 , and 7 for the HKY mo del: what w as true in that example with a state space of size four app ears to hold consisten tly across our sample of simulated rate m atrices of v arious sizes. The most encouraging f eature of Figure 8 is the apparent ease with whic h α and β can b e predicted s olely from th e size of the state space. As illustrated in the ﬁgur e, w e ﬁ t six simple mo dels to pred ict α and β for eac h of the three samp lin g strategies from the state space size alone. Th e utilit y of these predictions b ecomes clear in r eferen ce to the four-state HKY mo del. Previously w e found α and β b y sim ulating endp oin t-conditioned samples from the mo del of interest itself. F or rejection sampling, these v alues were ˆ α = 0 . 0149 an d ˆ β = 0 . 0094 (Figure 4 ), whic h w e compare to α = 0 . 016 5 and β = 0 . 0109 predicted for a state space of size f our. Similarly , as shown in Figure 6 , for direct sampling ( ˆ α, ˆ β ) r anges from (0 . 2274 , 0 . 1342 ), when the b eginning and end ing states w ere th e same, to (0 . 2258 , 0 . 1370), when they w ere diﬀerent. By comparison, the predicted v alues for direct complexit y are α = 0 . 21 55 and β = 0 . 128 5. Last, r ecall from Figure 7 that for u niformization ( ˆ α, ˆ β ) r anges f rom (0 . 2503 , 0 . 0253) for id entical b eginning and ending states to (0 . 2419 , 0 . 0206) wh en diﬀerent. Th ese v alues agree with our pr edictions of α = 0 . 2286 and β = 0 . 0143, although the pred icted v alue of β is somewh at lo we r than the ﬁt from th e m o del itself. It s h ould b e emphasized that the goal here is n ot to p erfectly predict α and β for any particular CTMC. R ather, the pur p ose of the simple mo dels illustrated in Figure 8 is to obtain v alues accurate enough to decide wh ic h sampling strategy will b e most eﬃcien t. With that in mind, reconsider the CPU times observe d f or the three examples in Section 3 . In Figures 1 – 3 resp ectiv ely , CPU time estimates for the HKY, GY, and HKY + Cp G mo dels are s h o wn as thin lines obtained from the n ew predictions of α and β . It is clear that a p ractitioner, armed with only these predictions deriv ed fr om the state space size, would c ho ose the m ost eﬃcient sampler in eac h example for virtually any combinatio n of time and sp eciﬁc end p oin t conditions. On the ot her hand, despite that success, the CPU times we predict are not 24 A. HOBOL TH AND E. STONE uniformly accurate; the predictions f or the GY mo d el shown in Figure 2 , for example, do not ﬁt the observed data well at all. W e sp eculated that this observe d lac k of ﬁt might b e a resu lt of the structur e of th e GY mo del b eing to o diﬀerent from that of the r an d omly generated r ate m atrices used to establish our m o dels. T o p ursue this h yp othesis, w e generated 61 × 61 rate matrices with the same s tr ucture as the GY mo del. In particular, the only nonzero oﬀ-diago nal entries in th e rand om rate matrix are those en tries where the tw o co dons are exact ly one p oin t substitutions aw a y from eac y other. These sparse rate m atrices result in smaller v alues for α in the cases of direct sampling and un iformization, as indicated in Figure 8 . As exp ected from Figure 2 , the v alue of α f or rejectio n sampling and β v alues for all samplers are largely unaﬀecte d. The pred icted CPU times from the n ew v alues of α and β are sh o wn as dashed lines and present a satisfactory ﬁt. Th us, while our mod els for α and β app ear to b e of suﬃcen t qu alit y to guide the correct c hoice of sampler, it is clear that diﬀeren tly structured rate matrices of the same size ma y yield substantia lly d iﬀeren t v alues of α and β and hence CPU times. The preceding discuss ion m otiv ates the follo wing guidelines for c ho osing the most eﬃcien t sampling strategy: 1. Estimate α and β for the th ree sampling strategies. As discussed, α and β can b e estimated reliably from the size of the r ate matrix, allo wing for some v ariabilit y du e to its stru cture. 2. Predict CPU times for rejection sampling ( α R + β R E R [ L ]) /p acc , for direct sampling ( α D + β D E D [ L ]), and for un if orm izatio n ( α U + β U E U [ L ]). 3. Cho ose the sampler with the lo west pr ed icted CPU time. 4.4.2. Comp arison for mo der ately l ar ge T . W e end th e comparison by considering the sp ecial case wh en T is at least mo d erately large. In this case some useful r ules of thum b emerge. T o b egin, note that th e exp ected n u m b er of iterations requ ired for rejection samplin g and for direct sampling should b e appr oximate ly equal. F or mo derately large T , we can mak e the substitution of ( P c π c Q c ) T f or E [ L ], or just T , provided that the chain has b een calibrated such that P c π c Q c = 1. F or uniformization, E [ L ] is larger b ecause of vir tu al s tate c hanges; under the same assum ptions, h ere E [ L ] can b e roughly appro ximated by µT = (max c Q c ) T . Under these assumptions, virtual c han ges increase the num b er of iterations required in uniformization b y a factor of ν = max c Q c .  X c π c Q c  = max c Q c , again assuming that the c hain h as b een calibrated. In other w ords , the inﬂa- tion factor ν is the ratio of the maximum diagonal entry of the rate matrix A CONTINUOU S-TIME MAR KO V CHAIN 25 Q to its (we igh ted) av erage d iagonal en tr y . W e obtain ν = (1 . 12 , 2 . 22 , 16 . 2) for the rate matrices in Examples 1–3, resp ectiv ely . As exp ected, the inﬂa- tion factor is ve ry high for the rate matrix in Example 3 and explains the observ ations in Figure 3 . Finally , w ith E [ L ] = T for b oth rejection sampling and d irect sampling, the appr o ximate complexities can b e expr essed as follo ws: Rejection sampling Direct sampling Uniformization . ( α R + β R T ) /p acc α D + β D T α U + β U T ν (4.6) T o see the utilit y of these formulas, r ecall the results for the HKY mo del sho w n in Figure 1 and consider the mo derately large time T = 2. Noting that for direct sampling and uniformization our app ro ximations are not endp oin t- dep endent, and using estimates of (0 . 2155 , 0 . 1285 ) for ( α, β ) for direct sam- pling and (0 . 22 86 , 0 . 0143) for uniform izatio n, the formulas p redict their CPU times to b e 0.472 and 0.261, resp ectiv ely , in b oth panels of Figure 1 . Re- jection sampling, of course, is dep enden t on the ending state, and thus the complexities for that metho d illustrated in the left and right plots d iﬀer. In this case, the diﬀerence is subtle b ecause (1) the c hain is nearly m ixed and (2) the stationary probabilities that go vern the acceptance probabilities are similar. Using 0.2 and 0.3 as the r esp ectiv e acceptance probabilities for A and G , and (0 . 016 , 0 . 010) for ( α, β ), w e obtain 0.19 for the left plot and 0.13 for the r igh t plot. Insp ection of Figure 1 give s v alidit y to our approxi mations, sho w in g all of the predictions to b e highly accurate. F or the HKY + CpG mo del in Figure 3 w e can mak e similar predictions. W e predict the CPU times for direct samp ling and uniform izatio n to b e 0.472 and 0.693, again in go o d agreemen t with b oth ﬁ gures. Using the stationary probabilities 0.012 and 0.24 6 for C and T , w e obtain 3.112 for the left plot and 0.155 f or the righ t plot. Th ese predictions are again v er y accurate. In the particular case of mo derately large T , the guideline for c ho osing the most eﬃcien t sampling strategy can b e made ev en more explicit. It follo ws immediately from ( 4.6 ) that u niformization is more eﬃcie n t than direct sampling if ν < α D + β D T − α U β U T = ν critical . Recall the transition matrix ( 2.7 ) of the auxiliary p ro cess. It is evident that if the inﬂ ation factor ν is large, then the tr ansition matrix has one or more states where virtual state changes are very likely . In the uniformization sam- pling pro cedur e, these vir tual state changes h a ve to b e s im ulated, although they are eliminated in the ﬁn al sample path. Man y invisible virtual jum ps th u s makes uniformization less eﬃcien t. F or the state sp ace of size 4 w ith ( α D , β D , α U , β U ) = (0 . 2155 , 0 . 1285 , 0 . 2286 , 0 . 0143), w e obtain ν critical = 8 . 5. F or the HKY mo del we h a ve ν HKY = 1 . 12 and for the HKY + CpG mo del 26 A. HOBOL TH AND E. STONE w e ha v e ν HKY+CpG = 16 . 2, and, thus, we p redict uniformization to b e more eﬃcien t for large T for the HKY mo del, while direct samp ling is m ore eﬃ- cien t for the HKY + CpG mo del. Similarly , it follo w s from ( 4.6 ) th at rejection sampling is more eﬃcient than uniform ization if p acc > α R + β R T α U + β U T ν = p U critical , (4.7) and more eﬃcien t th an direct samplin g if p acc > α R + β R T α D + β D T = p D critical . (4.8) F or the HKY mo del, we get p U critical = 0 . 147. In the case of b eginning state A , endin g state A and f or T = 2, we get fr om ( 4.4 ) that p acc = 0 . 254. If th e b eginning state is A , ending state is G and T = 2, we get p acc = 0 . 347. Both acceptance probabilities are la rger than, and we p redict correctly (recall Figure 1 ) that rejecti on sampling is th e most eﬃcien t algorithm in b oth cases. F or the HKY + CpG mo del we get p D critical = 0 . 081. In the case wh ere the b eginning state is T , ending state is C and T = 2, we obtain p acc = 0 . 017, and with b eginnin g state C , ending state T and T = 2, we get p acc = 0 . 272. W e thus correctly predict (r ecall Figure 3 ) that d ir ect sampling is the most eﬃcien t a lgorithm in the ﬁr st s ituation, while rejection sampling is more eﬃcien t in the second situation. The appro ximations in ( 4.6 ) are less precise for the GY mo del b ecause the larger state space increases the dep endency of th e b eginning and end ing states. Ho wev er, we still get reliable p redictions w hen applying the mo der- ately large T appro ximations. The predicted v alues in Figure 8 for the sparse co don rate matrices size 61 are ( α R , β R , α D , β D , α U , β U ) = (0 . 0 17 , 0 . 011 , 1 . 072 , 0 . 305 , 1 . 124 , 0 . 105). F or T = 2 we get ν critical = 2 . 6 6 and since ν GY = 2 . 2 2, w e correctly p redict un iformization to b e more eﬃcient than direct sampling. W e get p U critical = 0 . 024, meaning th at uniform izatio n is also more eﬃcient than rejection samp ling if the acceptance p robabilit y is smaller than 2 . 4%. With T = 2 and b eginning state AAA , w e get acceptance p robabilities 0.65, 0.01, and 1 / 10 5 for ending states AAG , AAC , and TT T , resp ectiv ely (recall Figure 5 ). W e th u s correctly pred ict rejection sampling to b e faster than uniformization when the ending state is AAG , and slow er when the ending state is AAC or TTT . T o summarize, this section s h o ws that when the cost of initializ ation α and th e cost of a recur s ion step β are kn o wn, we can accurately p r edict the time it tak es to pr o duce a single samp le from an y of the three sim ulation pro cedures. W e ha ve thus demonstrated th at c ho osing among the simulatio n pro cedures is an ob jectiv e task that can b e automated. In our analysis, we A CONTINUOU S-TIME MAR KO V CHAIN 27 ha ve d emonstrated that it is straight forw ard and in exp ensive to estimate α and β reliably . An alternativ e, wh ic h we ha ve not addressed, w ould b e to obtain these d irectly by translating th e necessary calculations f or eac h sampling strategy into ﬂoating p oint op erations. In practice, our deriv ations serv e w ell eve n without quanti ﬁcation of the in itialization and recursion costs; for reasonable v alues of α and β , th e acceptance probabilit y p acc and the inﬂation factor ν can in form w hic h of the thr ee sampling strategies wo r ks b est. 5. Conclusion. The prev alence of endp oin t-conditioned CTMCs as an inferen tial to ol in inte r disciplinary studies has led to the develo pment of sev eral p ath-samp lin g algorithms. As the s cop e of application con tinues to gro w, so to o w ill the n eed for compu tationally eﬃcient appr oac hes, and yet this asp ect has to our kno w ledge yet to b e considered. T o that end, w e ha ve present ed a formal comparison of thr ee sampling strategies: (1) m o diﬁed r e- jection samplin g, (2) direct samp ling, and (3) u n iformization. Signiﬁcantly , w e s h o w th at eﬃciency is a relativ e measure that dep end s hea vily on the sp eciﬁcation of the conditioned sto c h astic p ro cess; ind eed, as demonstrated in Section 3 , the compu tational requirements for eac h algorithm dep end on the rate matrix Q , the time in terv al T , and th e endp oints a and b . W e ha ve sho w n that no one algorithm dominates the other tw o, and that eac h algo- rithm has its sp eciﬁc strengths and w eaknesses. Th e pr evious section served to demysti fy th ose str en gths and wea k n esses by completely quantifying the computational costs asso ciated with eac h sampling strategy . W e ha v e concentrat ed our eﬀorts on one sp eciﬁc application, namely , the sim u lation of a single sample path pro vided the rate matrix Q , the time in terv al T , and the endp oin ts a and b u p on wh ic h the process is co ndi- tioned. W e framed eac h of the three path-sampling algorithms as a progres- sion through (1) initialization, (2) recursion, and (3) termination, and our discussion w as b ased on an in-d epth analysis of the computational require- men ts of eac h step. It should b e noted th at our theory is easily amenable to application-sp eciﬁc s ituations where these requiremen ts v ary; f or example, it is reasonable in s ome cases to exp ect that an eigen v alue d ecomp osition of the rate matrix has already b een pr o vided, and it is clear from the previous section how this impacts eac h algorithmic step. P erh aps the most imp ortan t v arian t to consider is the extension to the sim u lation of m ultiple sample p aths. When s imulating k s amp le paths from the s ame endp oint-c on d itioned CTMC using an y of the aforementi oned strategies, the in itializ ation step need only b e done once. On the other hand, the iteratio ns required f or eac h sample path cannot in general b e consoli- dated, and, th u s, k aﬀects complexit y as a s cale factor of β . It follo ws that for large enough k the initializa tion cost is of negligible concern, and b ecause our examples h a ve s h o wn th at α and β are somewhat comparable, in practice, 28 A. HOBOL TH AND E. STONE k need not b e th at large. In suc h cases, complexit y is d etermined by k β E [ L ] for direct samp ling and uniformization, and by k β E [ L ] /p acc for rejection sampling. As a result of the virtual state changes that o ccur w hen sampling b y uniform izatio n, E [ L ] for that metho d w ill t ypically b e somewh at larger than for the other t w o. F or direct s ampling, this is oﬀset by a larger β , and, th u s, the d ecision b et ween direct sampling an d uniformization rests u p on the num b er of v ir tual state c han ges required. Rejectio n sampling, b y con- trast, completes eac h iteration quickly without the use of vir tual transitions; it is once again the p ath acceptance p r obabilit y that determines whether or not rejectio n sampling is viable. In the direct sampling algorithm we use a ro ot ﬁnder to simula te the wait ing time b efore the next state c hange. If m u ltiple sample paths are required, it w ould b e b eneﬁ cial to completely c haracterize (or very accurately ap p ro ximate) the cum u lativ e distribu tion function for the w aiting time. As so on as this task is done, dra w ing from the conditional w aiting time distribu tion wo u ld b e almost instan taneous. Simi- larly , one could, in the case of uniformization, store th e v alues of p robabilit y masses in R emark 7 for i ≤ 3, sa y . Storing the calculat ions allo w for very fast generation of state c hanges from the discrete Mark ov c hain determined by the transition pr obabilit y matrix R . In s h ort, th e p rimary d istinction when m u ltiple sample p aths are requir ed is that the fr on t-loaded p ro cedures— direct sampling and unif orm ization—b ecome comparativ ely more desirable, the reason b eing th at more kno wledge ab out the p articular CTMC under consideration can b e tak en into accoun t. Finally , it sh ould b e noted that the eﬃciency of rejection sampling in- creases as the space of v alid endp oin t conditions is enlarged. As an example, consider the case of a CTMC observed at equidistant time p oints, so that the goal is to simulate sample paths u sing th e same rate m atrix Q and time in terv al T for a set of end p oin t pairs { ( a i , b i ) : i = 1 , . . . , n } . In th is case we can ﬁrst use (unmo d iﬁ ed) rejection sampling and assign eac h samp le path to a p air ( a i , b i ) that m atc hes th e b eginn in g and en d ing state of the s im ulated path. If the mo del is app ropriate, this pro cedure could easily accoun t for the ma jorit y of the needed sample paths, and v ery few r ejections would b e required. Th e remaining sample p aths can sub sequen tly b e simulate d u sing one of the three end p oin t-conditioned samp lers describ ed in this pap er. Ac kn owledgmen ts. W e thank Jeﬀ Thorne, Ben Redelings, an d O le F. Christensen for f ruitful discu ssions and v aluable comments on the manuscript. W e are v ery grateful to the AE and the tw o review ers for man y constru ctiv e commen ts and suggestions. SUPPLEMENT AR Y MA TERIAL Eﬃcien t simulatio n from ﬁnite-state, con tin uous-time Marko v c h ains with incomplete observ ations (DOI: 10.121 4/09-A OAS247 S UPP ; .zip). W e ac- compan y our pap er with R co de ( www.r-pr o ject.org ) that can r epro duce A CONTINUOU S-TIME MAR KO V CHAIN 29 the ﬁ gures in the man u script [ Hob olth and Stone ( 2009 )]. A d escription of ho w the co de is organized is included in the supplementary material. REFERENCES Blac kwell, P . (2003). Ba yesi an inference for Marko v p rocesses with diﬀusion and discrete compon ents. Biometrika 90 613–627. MR2006839 Bladt, M. and Sørensen, M. ( 2005). Statistical in ference for discretely observed Marko v jump pro cesses. J. R oy. Statist. So c. Ser. B 67 395–410. MR2155345 F earnhead, P . and Sherlo ck, C. (2006). An exact Gibbs sampler for the Mark o v mo dulated P oisson pro cesses. J. R oy. Statist. So c. Ser. B 68 767–784 . MR2301294 Goldman, N. and Y ang, Z. (1994). A codon- based mo del of nucleotide sub stitution for protein-co ding DNA sequ ences. Mol. Bi ol. Evol. 11 725–736. Guttorp, P . (1995). Sto chastic Mo deli ng of Scientiﬁc Data . Chapman and H all, S uﬀolk, Great Britain. MR1358359 Hasega wa, M., Kishino, H . and Y ano, T. (1985). Dating of th e human-ap e splitting by a molecular clock of mitochondrial DNA . J. Mol. Evol. 22 160–174. Hob olth, A. (2008). A Marko v chain Monte Carlo exp ectation maximization algorithm for statistical analysis of DNA sequ ence evol ution with neighbor-dep endent substitu t ion rates. J. Computat. Gr aph. Statist. 17 138–162. Hob olth, A. and Jensen, J. (2005). Statistical inference in evolutionary mo dels of DN A sequences via th e EM algorithm. Statist. Appl. Genet. Mol. Biol. 4 18. MR2170434 Hob olth, A. and Stone, E. (2009). Supplement to “Eﬃcien t simulation from ﬁnite-state, contin uous-time Marko v chains with incomplete observ ations.” D OI: 10.1214 /09-AO AS247SUPP . Holmes, I. and R ubin, G. (2002). A n exp ectation maximization algorithm for training hidden substitution mo dels. J. Mol. Biol. 317 757–768. Hwa ng, D. and Green, P . (2004). Bay esian Marko v chai n Monte Carlo sequen ce analy- sis reveals v arying neutral sub stitution patterns in mammalian evol ution. PNAS 101 13994–1 314001. Jensen, A. (1953). Markoﬀ chains as an aid in the study of Markoﬀ pro cesses. Skand. Ak tuarietidsskr. 36 87–91. MR0057488 Jensen, J. and P ed ersen, A. (2000). Probabilistic mo dels of DNA seq uence evolution with context dep end ent rates of su b stitution. A dv. in Appl. Pr ob ab. 32 499–517. MR1778577 Lartillot, N . (2006). Conjugate Gibbs sampling for Bay esian phylogenetic mo dels. J. Com- put. Biol. 13 1701–1722. MR2287736 Mateiu, L. and Rannala, B. (2006). Inferring complex DN A substitut ion pro cesses on phylogenies u sing uniformization and data augmentation. Syst. Biol. 55 259–269. Minin, V. and Suchard, M. (2008). Coun ting lab eled transitions in contin uous-t ime Marko v mod els of evolution. J. Math. Bi ol. 56 391–412. MR2358440 Mouse Genome Sequencing Consortium (2002). In itial sequencing and comparative anal- ysis of th e mouse genome. Natur e 420 520–562. Nielsen, R. (2002). Mappin g mutations on phylogenies . Syst. Biol . 51 729–739 . Ro drigue, N., Philipp e, H. and Lartillot, N. (2008). Uniformization for sampling realiza- tions of Mark ov p rocesses: Ap plications to Bay esian implementations of co don substi- tution mo dels. Bi oinformatics 24 56–62. Siep el, A ., Po llard, K. and Haussler, D. (2006). New meth ods for detecting lineage-speciﬁc selection. In Pr o c e e dings f or the 10th International Confer enc e on R ese ar ch in Compu- tational Mole cular Biolo gy 190–205 . Springer, Berlin. 30 A. HOBOL TH AND E. STONE The Chimpanzee Sequencing and A nalysis Consortium (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Natur e 437 69–87. Dep ar tment of Ma thematic al Sciences Aarhus University Denmark E-mail: asger@daimi.au.dk Dep ar tment of St a tistics Nor th Carolina St ate University Raleigh, Nor th Carolina 27 695 USA E-mail: eric stone@ncsu.edu

Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment