Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution

Analyses of serially-sampled data often begin with the assumption that the observations represent discrete samples from a latent continuous-time stochastic process. The continuous-time Markov chain (CTMC) is one such generative model whose popularity…

Authors: Asger Hobolth, Eric A. Stone

Simulation from endpoint-conditioned, continuous-time Markov chains on a   finite state space, with applications to molecular evolution
The Annals of Applie d Statistics 2009, V ol. 3, No. 3, 1204–123 1 DOI: 10.1214 /09-A OAS247 c  Institute of Mathematical Statistics , 2 009 SIMULA TION FR OM ENDPOINT-CONDITIONED, CONTINUOUS-TIME MARKO V CHAINS ON A FINITE S T A TE SP A CE , WITH APPLICA TIONS TO MOLECULAR EV OLUTION By Asger Hobol th 1 and Eric A. Stone A arhus University and North Car olina State U niversity Analyses o f serially-sampled data often b egin with the assump- tion that the observ ations represent d iscrete samples from a latent conti nuous-time stochastic pro cess. The con tinuous-time Mark o v chain (CTMC) is one such generativ e mod el whose p opularity extend s to a v ariet y of disciplines ranging from compu tational fi n ance to h u - man genetics and genomics. A common t h eme among t h ese diverse applications is the need to sim ulate sample paths of a CTMC condi- tional on realized data th at is d iscretely observed. Here w e present a general solution to th is sampling problem when the CTMC is d e- fined on a discrete and finite state space. Sp ecifically , w e consider the generation of sample paths, including intermediate states and times of transition, from a CTMC whose b eginning and endin g states are known across a time interv al of length T . W e first unify th e litera- ture th rough a discussion of th e three predominant approaches: (1) mod ifi ed rejection sampling, (2) direct sampling, and (3) u n iformiza- tion. W e t hen give analytical results for the complexity and efficiency of each metho d in t erms of the instantaneous tran sition rate matrix Q of the CTMC, its b eginning and ending states, and the length of sampling time T . In doing so, w e show that no method dominates the others across all mod el sp ecifications, and we give exp licit pro of of whic h metho d preva ils for any given Q, T , and endp oints . Finally , w e introduce and compare three applications of CTMC s to demonstrate the pitfalls of choosing an inefficient sampler. 1. Int ro duction. This p ap er considers the problem of conditional sam- pling from a con tinuous-time Marko v chain (CT MC) d efi ned on a discrete and finite state space. In the ideal case, give n con tinuously-observ ed sam- ple paths, statistical inference is straigh tforward: the sufficien t stati stics Received December 2007; revised December 2008. 1 Supp orted by th e US National Institute of Health (Grant R01 GM0708 06). Key wor ds and phr ases. Con t inuous-time Mark o v chai ns, sim ulation, molecular evolu- tion. This is an electronic repr in t o f the or iginal a r ticle published by the Institute of Mathematical Statistics in The Annals of Applie d Statistics , 2009, V ol. 3, No. 3, 120 4–123 1 . This repr in t differs fro m the or iginal in pagination and t y po graphic deta il. 1 2 A. HOBOL TH AND E. STONE are s imply the n u m b er of transitions b et w een an y t wo states and th e to- tal time sp ent in eac h state [e.g., Guttorp ( 1995 ), Section 3.7]. In most applications of CTMCs , ho wev er, the stochastic pr o cess { X ( t ) : 0 ≤ t ≤ T } serv es as a contin uous-time mo d el f or data sampled at discrete p oints in time T 0 = 0 < T 1 < · · · < T N − 1 < T N = T . Sometimes this d iscretized prob- lem remains amenable to closed-form analysis, and in suc h cases samplin g is unnecessary . Examples f r om the literature include Holmes and Ru bin ( 2002 ) and Hob olth and Jensen ( 2005 ), b oth of whom motiv ated EM algorithms b y sho win g ho w to calculate the num b er of transitions b et wee n an y t w o states and the time sp ent in a state for an en dp oint- conditioned CTMC. More recen tly , Siep el, Polla r d and Haussler ( 2006 ) show ed ho w to calculate the probabilit y mass function for the n umb er of transitio ns, and Minin and S uc hard ( 2008 ) derived analytically tractable results for the mo- men ts of the n u m b er of trans itions b et wee n any t wo s tates. Our work com- plemen ts these appr oac hes, fo cu sing on the case where parametric in ference relies up on the sim ulation of con tinuous s ample paths from the CTMC con- ditional on X ( T 0 ) , . . . , X ( T N ). As a consequence of the Mark o v assum ption, kno wledge of the data X ( T 0 ) , . . . , X ( T N ) effectiv ely partitions the pro cess in to in dep endent comp onen ts { X ( t ) : T k ≤ t ≤ T k +1 } whose endp oin ts X ( T k ) and X ( T k +1 ) are kno w n. Thus, samp ling a realization from { X ( t ) : 0 ≤ t ≤ T } giv en the observe d data amounts to sampling fr om N indep en d en t and iden tical CTMCs, eac h conditioned on its endp oints X ( T k ) and X ( T k +1 ) and spanning a fixed in terv al of time T k +1 − T k . In what follo w s, we s p e- cialize to one of these N comp onen ts, fo cusing on ho w to sim u late sample paths from a CTMC wh en only its endp oin ts are known. Cr ucially , while calculating the sufficient statistics from the s imulated data remains easy , the simulatio n s themselv es ma y b e prohibitiv ely time-consuming without an efficient strategy for generating s ample paths. The wid e app licabilit y of CTMC s to serially-sampled d ata has accelerate d their en try in to the interdisciplinary literature, and sev eral recen t pap ers ha ve sp ecifically considered the problem of sampling paths fr om an en d p oint - conditioned pro cess. Blac kw ell ( 2003 ), for example, discusses a n aiv e rejec- tion sampling metho d as it applies to the analysis of radio-trac k in g data. A similar approac h is considered in Bladt and S øren sen ( 2005 ), the motiv a- tion there coming from the field of m athematical finance. F or our pur p oses, a naiv e rejection sampling metho d is one that sim ulates sample paths forward in time according to the sp ecified pro cess: as a consequence of en d p oint con- ditioning, rejection of a samp le p ath o ccurs whenever there is disagreemen t b et ween the simulated and observ ed ending states. Nielsen ( 2002 ) improv es up on the n aiv e rejection sampling approac h in an app lication to molecular ev olution. His metho d , which w e call “mo dified r ejection s ampling,” condi- tions on the ev en t that one state c h ange must ha v e o ccurr ed in cases where the observe d end p oin ts of the pr o cess are not the same. Neverthele ss, despite A CONTINUOU S-TIME MAR KO V CHAIN 3 Nielsen’s improv emen t, samplin g f orw ard in time without sp ecific regard to the end ing state ma y lead to a prohib itively lo w rate of p ath acceptance. As an alternativ e, Hob olth ( 2008 ) su ggests a direct samp ling pro cedur e b ased on analytical expressions for the probabilities of state transitions and their w aiting times. A final approac h , often called un if orm izatio n, originates with the wo r k of Jensen ( 1953 ). T he id ea is to construct a related pr o cess that p ermits virtual transitions (in wh ic h th e s tate d o es not c hange) so that th e n u m b er of s tate transitions in a time inte rv al can b e seen as Po iss on dis- tributed and the state transitions themselves constitute a Mark ov chain. Sampling from this related pro cess is equiv alent to sampling from the tar- get CTMC pro vid ed that the virtual c hanges are ignored. Recen t appli- cations ca n b e found in F ea rnhead and Sh erlo c k ( 20 06 ), Lartillot ( 2 006 ), Mateiu and Rannala ( 2006 ), and Ro d rigue, Philipp e and L artillot ( 2008 ). Though dev elop ed for distinct app lications, eac h of the aforemen tioned path-sampling p ro cedures simulate s fr om the same conditional distribution and th us solv es the same problem. In light of this in terc hangeabilit y , and b e- cause of the imp ortance of path sampling to statistical inf er en ce on endp oin t- conditioned CTMCs, it is imp erativ e to ask wh ether one pro cedu re should b e preferred for its computational efficiency . Th e remainder of the p ap er seeks an exhaustive answer to this question. W e firs t giv e a th orou gh discussion in Section 2 of the thr ee sampling strategies: (1) mo dified rejection sampling, (2) direct sampling, and (3) u n iformization. In S ection 3 w e introd uce three CTMCs that highligh t the heterogenous relationship b et w een parameteriza- tion of th e sto chasti c pro cess and the efficiency of eac h sampling strategy . Section 4 then generali zes the results of Section 3 by pro viding analytical expressions for the efficiency of eac h samp ler for arbitrary parameterizations of the CTMC. W e conclude b y summ arizing the results into r ecommenda- tions on how to b est simulate sample paths from any giv en CTMC. In doing so, w e also iden tify cases for which one or more of the p otentia l strategies is guaran teed to f ail. 2. Samp lin g strategie s for endp oint -conditioned c hains. In this section w e review three strategies for sim ulating a realizatio n of a fin ite-state CT MC { X ( t ) : 0 ≤ t ≤ T } cond itional on its b eginnin g s tate X (0) = a and ending state X ( T ) = b . The c hain is defin ed b y its instan taneous rate matrix Q with off-diagonal en tries Q ab ≥ 0 and diagonal entries Q aa = − P b 6 = a Q ab = − Q a < 0. W e mak e the futher assumption that X ( t ) is irreducible and p ositiv e recurrent s o that a stationary distribu tion π exists. Finally , unless otherwise noted, w e s et P c π c Q c = 1 so that one state c hange is exp ected p er unit time. T o un derstand the sampling diffi cu lties associated with conditioning a CTMC on its endp oin ts, it is u seful to fi rst review ho w one p ro ceeds when the ending s tate X ( T ) is un observ ed. Simulati ng a sample p ath of { X ( t ) : 0 ≤ 4 A. HOBOL TH AND E. STONE t ≤ T } that b egins at X (0) = a can b e accomplished b y a simple iterativ e pro cedure. Th e k ey observ ation is that the wa iting time τ to the fir s t state c hange is expon entially d istributed with mean 1 /Q a . If τ > T , there is no state c hange in the in terv al [0 , T ], and the corresp onding samp le path is constan t; o therwise, a n ew state c is drawn from the d iscrete probabilit y distribution with probabilit y m asses Q ac /Q a and the pr o cedure is iterated for the shorter time in terv al [ τ , T ] (or, equiv alen tly , for [0 , T − τ ]). F or ref- erence, we pr esent this forwa rd samplin g algorithm b elow: Algorithm 1 (F orw ard sampling). 1. Sample τ ∼ Exp onential ( Q a ). If τ ≥ T , w e are done: X ( t ) = a for all t ∈ [0 , T ]. 2. If τ < T , c ho ose a new state c 6 = a from a d iscrete p robabilit y distribu- tion with probabilit y masses Q ac /Q a . Rep eat the pro cedure with new b eginning state c and new time in terv al [ τ , T ] . Under the assumption that th e ending state X ( T ) = b is observed, condi- tioning excludes all paths sampled fr om the preceding algorithm that fail to end in state b . This is th e essence of the rejection s ampling app roac h, whose mo dification by Nielsen w e discuss in the next s u bsection. 2.1. R eje ction sampling. As imp lemen ted in Blac kw ell ( 2003 ) and Bladt and S øren sen ( 2005 ), naive rejection samp ling uses forward samplin g to generate candidate samp le paths of an en dp oint -conditioned CTMC. F rom these, the acceptable candidates are those for whic h the sim ulated ending state and the observ ed ending state are the same. In particular, when sampling forw ard, the p robabilit y of hitting the observed ending state b is P ab ( T ) = exp( Qt ) ab . Thus, if T is large, this probabilit y approxima tely equals the stationary probabilit y π b of b . Conv er s ely , if T is small and a 6 = b , the probabilit y is ap p ro ximately Q ab T . It follo ws that in case of (i) large time T and small stationary pr ob ab ility π b , or (ii) different states a 6 = b and small time T , naiv e rejection sampling is b ound to fail. Nielsen’s mo dification im- pro ves up on n aiv e rejectio n sampling in the latter case Nielsen ( 2002 ). By a cond itioning argument , the time τ to the first state c hange giv en at least one state c hange o ccurs b efore T and X (0) = a h as d ensit y f ( τ ) = Q a e − τ Q a 1 − e − T Q a , 0 ≤ τ ≤ T . (2.1) The corresp ond ing cum ulativ e distribu tion fun ction is F ( τ ) = 1 − e − τ Q a 1 − e − T Q a , 0 ≤ τ ≤ T , A CONTINUOU S-TIME MAR KO V CHAIN 5 with explicit in verse F − 1 ( u ) = − log[1 − u (1 − e − T Q a )] /Q a . Th us, up on samp ling u from a Uniform (0 , 1) distribution, transformation yields the sample wait ing time F − 1 ( u ) to the firs t state change of the CTMC. Algorithm 2 (Mo dified r ejection sampling). If a = b : 1. Simulate from { X ( t ) : 0 ≤ t ≤ T } using the f orw ard samplin g algorithm. 2. Accept the sim u lated path if X ( T ) = a ; otherwise, return to step 1 and b egin anew. If a 6 = b : 1. Sample τ from the densit y ( 2.1 ) us ing the in ve rse transformation metho d, and c ho ose a new s tate c 6 = a fr om a discrete probabilit y distribu tion with probabilit y masses Q ac /Q a . 2. Simulate the remaind er { X ( t ) : τ ≤ t ≤ T } using the forward sampling algorithm fr om th e b eginning state X ( τ ) = c . 3. Accept the simulate d path if X ( T ) = b ; otherwise, r eturn to step 1 and b egin anew. In short, mo d ified rejection samp ling explicitly a v oids simulating constan t sample paths when it is known that at least one state c h an ge must tak e p lace. This is particularly b eneficial wh en T is sm all, as the naive approac h w ill b e dominated by w asted constan t sample paths whose ending state r emains a [whic h o ccur s with probabilit y appro ximately (1 − Q a T )]. Nevertheless, if the trans ition from a to b is u nlik ely so that Q ab /Q a is sm all, then essen- tially ev ery sample path will still b e rejected. In suc h a setting, either direct sampling or u niformization is requir ed. 2.2. Dir e ct sampling. Th e d irect samplin g p r o cedure of Hob olth ( 2008 ) requires that the instan taneous r ate matrix Q admits an eigen v alue d ecom- p osition. Under that assumption, let U b e an orthogonal matrix with eigen- v ectors as columns and let D λ b e the diagonal matrix of corresp ond ing eigen v alues such that Q = U D λ U − 1 . Then, for an y time t , the transition probabilit y matrix of the CT MC can b e calculated as P ( t ) = e Qt = U e tD λ U − 1 and P ab ( t ) = X j U aj U − 1 j b e tλ j . (2.2) Consider first the case wh er e the end p oint s of the C T MC are iden tical so that X (0) = X ( T ) = a . The probabilit y that there are no state c h anges in the time interv al [0 , T ] conditional on X (0) = a and X ( T ) = a is giv en by p a = e − Q a T P aa ( T ) . (2.3) 6 A. HOBOL TH AND E. STONE Th us, with pr obabilit y p a , a sample path from the CT MC will b e the con- stan t X ( t ) = a . F urthermore, with probabilit y (1 − p a ), at least on e state c hange occur s. Thus, when X (0) = X ( T ) = a , the samp le path is constant with probability p a , and has at least one c hange with p robabilit y (1 − p a ). Next consider the case wher e X (0) = a and X ( T ) = b , with a 6 = b . Let τ denote th e waiting time until the first state c h ange. The conditional p roba- bilit y that the first state c han ge is to i at a time s m aller than t is P ( τ ≤ t, X ( τ ) = i | X (0) = a, X ( T ) = b ) = P ( τ ≤ t, X ( τ ) = i, X (0) = a | X ( T ) = b ) /P ( X (0) = a | X ( T ) = b ) = Z t 0 Q a e − Q a z Q ai Q a P ib ( T − z ) P ab ( T ) dz = Z t 0 f i ( z ) dz , where f i ( z ) is the integ rand. Sp ecifically , conditional on the end p oin ts X (0) = a and X (0) = b , the p robabilit y p i that the first state c hange is to i is p i = Z T 0 f i ( t ) dt, i 6 = a, a 6 = b. (2.4) Using ( 2.2 ), we can rewrite the integrand as f i ( t ) = Q ai e − Q a t P ib ( T − t ) P ab ( T ) = Q ai P ab ( T ) X j U ij U − 1 j b e T λ j e − t ( λ j + Q a ) , (2.5) whic h renders the in tegral in ( 2.4 ) straigh tforw ard . W e get p i = Q ai P ab ( T ) X j U ij U − 1 j b J aj , (2.6) where J aj =      T e T λ j , if λ j + Q a = 0, e T λ j − e − Q a T λ j + Q a , if λ j + Q a 6 = 0. W e n ow hav e a pro cedure for simulat ing the next state and the waiting time b efore the state c hange o ccurs. Iterating the pro cedur e allo ws us to sim u late a samp le path { X ( t ) : 0 ≤ t ≤ T } th at b egins in X (0) = a and ends in X ( T ) = b . Algorithm 3 (Direct samp ling). 1. If a = b , samp le Z ∼ Bernoulli( p a ), where p a is given by ( 2.3 ). If Z = 1 , w e are don e: X ( t ) = a, 0 ≤ t ≤ T . A CONTINUOU S-TIME MAR KO V CHAIN 7 2. If a 6 = b or Z = 0, then at least one state c hange o ccur s . C alculate p i for all i 6 = a from ( 2.6 ). Sample i 6 = a fr om th e discrete pr ob ab ility d istr ibution with probabilit y masses p i /p − a , i 6 = a, wh er e p − a = P j 6 = a p j . [Note that p − a = 1 when a = b and p − a = (1 − p a ) otherwise.] 3. Sample the wai ting time τ in state a according to the con tin u ous densit y f i ( t ) /p i , 0 ≤ t ≤ T , where f i ( t ) is giv en b y ( 2.5 ). Set X ( t ) = a, 0 ≤ t < τ . 4. Rep eat pro cedure w ith new starting v alue i and new time int erv al of length T − τ . Remark 4. In step 3 ab o v e, we simulate f rom the scaled d en sit y ( 2.5 ) b y find ing the cumulativ e distribution fu nction and then use the inv erse transformation m etho d . T o calculate the cum ulativ e distribution function, note that Z t 0 e T λ j e − s ( λ j + Q a ) ds =    te T λ j , if λ j + Q a = 0, e T λ j 1 λ j + Q a (1 − e − t ( λ j + Q a ) ) , if λ j + Q a 6 = 0. T o use th e in verse transformation metho d, w e must fin d th e time t suc h that F ( t ) − u = 0, where F ( t ) is the cum ulativ e distribution function and 0 < u < 1 . In s u bsequent sections, w e ha v e used a (n u merical) ro ot finder for this p u rp ose. 2.3. Uniformization. The final s tr ategy that we consider p ermits sam- pling from X ( t ) through construction of an auxilliary sto c hastic pr o cess Y ( t ). Let µ = max c Q c and define the p r o cess Y ( t ) by letting the state c hanges b e determined by a discrete-time Mark o v pro cess w ith transition matrix R = I + 1 µ Q. (2.7) Note that, by construction, we allo w virtual state changes in which a jump o ccurs but the state do es not change . In deed, virtual state c hanges for state a are p ossible if R aa > 0. Next, let the ep o c h s of state c hanges b e determined b y an indep end en t Poisson pro cess with rate µ . Th e sto chasti c pro cess Y ( t ) is called a Markov chain sub or dinate d to a Poisson pr o c ess and is equiv- alen t to the original contin uous-time Marko v c hain X ( t ) as the follo wing calculatio n sh o ws: P ( t ) = e Qt = e µ ( R − I ) t = e − µt ∞ X n =0 ( µtR ) n n ! = ∞ X n =0 e − µt ( µt ) n n ! R n . (2.8) This approac h is commonly r eferred to as uniform izatio n, and we adopt that language here. In what follo w s, we describ e h o w uniform ization can b e used 8 A. HOBOL TH AND E. STONE to construct an algorithm for exact samplin g fr om X ( t ) , conditional on the b eginning and ending states. It f ollo ws directly fr om ( 2.8 ) th at th e transition function of the Mark o v c hain sub ordinated to a P oisson pr o cess is giv en by P ab ( t ) = P ( X ( t ) = b | X (0) = a ) = e − µt 1 ( a = b ) + ∞ X n =1 e − µt ( µt ) n n ! R n ab . Th us, the num b er of state changes N (including th e virtu al) for the condi- tional pr o cess th at starts in X (0) = a and end s in X ( T ) = b is given by P ( N = n | X (0) = a, X ( T ) = b ) = e − µT ( µT ) n n ! R n ab /P ab ( T ) . (2.9) Giv en the num b er of state changes N = n , the times t 1 , . . . , t n at which those state changes o ccur are u niformly distributed in the time in terv al [0 , T ] . F urthermore, th e state c han ges X ( t 1 ) , . . . , X ( t n − 1 ) are determined by a Mark o v c h ain with transition matrix R conditional on the b eginning state X (0) = a and ending state X ( t n ) = b . Putting these things together, we ha v e the follo wing algorithm for simu- lating a con tinuous-time Mark o v chain { X ( t ) : 0 ≤ t ≤ T } conditional on the starting state X (0) = a and end ing state X ( T ) = b . Algorithm 5 (Uniformization). 1. Simulate the num b er of state c h anges n from the distribution ( 2.9 ). 2. If the num b er of state c hanges is 0, we are done: X ( t ) = a, 0 ≤ t ≤ T . 3. If the num b er of state c h anges is 1 and a = b , we are done: X ( t ) = a, 0 ≤ t ≤ T . 4. If the num b er of state changes is 1 and a 6 = b sim u late t 1 uniformly r andom in [0 , T ] , we are done: X ( t ) = a, t < t 1 , and X ( t ) = b, t 1 ≤ t ≤ T . 5. When the n umber of state c hanges n is at least 2, sim ulate n indep en- den t uniform random n u m b ers in [0 , T ] and sort the num b ers in in cr easing order to obtain the times of state c h anges 0 < t 1 < · · · < t n < T . Simu- late X ( t 1 ) , . . . , X ( t n − 1 ) from a discrete-time Marko v c hain with transi- tion m atrix R and co nditional on starting state X (0) = a and ending state X ( t n ) = b . Determine wh ic h state c hanges are virtual and return the remaining c h an ges and corresp onding times of c han ge. Remark 6. In S tep 1 ab ov e, w e fin d th e num b er of state c hanges n b y sim u lating u fr om a Uniform(0 , 1) d istribution and letting n b e the first time the cumulat iv e sum of ( 2.9 ) exceeds u . When calculat ing the cumulat iv e sum w e n eed to raise R to p o w ers 1 thr ough n . These p o wers of R are stored b ecause they are r equired in S tep 5 of the algorithm. W e use the eigen v alue decomp osition ( 2.2 ) of Q to calculate P ab ( t ). A CONTINUOU S-TIME MAR KO V CHAIN 9 Remark 7. In Step 5 ab ov e we simulate X ( t i ) , i = 1 , . . . , n − 1 , fr om the discrete distribu tion with probability masses P ( X ( t i ) = x i | X ( t i − 1 ) = x i − 1 , X ( t n ) = b ) = R x i − 1 ,x i ( R n − i ) x i ,b ( R n − i +1 ) x i − 1 ,b . Th us far w e h a ve outlined three comp eting strategies for sim u lating sam- ple paths from an end p oin t-conditioned C TMC. Though our discussion has b een agnostic to the n u m b er of desir ed sample paths, this quan tit y has a di- rect and v aried imp act on the computational efficiency of eac h sampler. F or example, while b oth direct sampling and u niformization r equ ire a p ossibly time-consuming eigendecomp osition of Q , it is clear that one such compu ta- tion will suffi ce ev en wh en m u ltiple sample paths are desired. The num b er of sample paths desired from an endp oint-co n dtioned CTMC is application driv en: estimation of some quantit y , su ch as the exp ected num b er of visits to a giv en state, may require man y sample p aths, whereas the up dating step in a Ba yesia n compu tation ma y r equire as few as on e. Rather than exhaus t p oten tial applications, we hav e c hosen to formally analyze the static and dynamic costs asso ciated w ith eac h samp ling strategy . W e defer th is discu s- sion to Section 4 , using the next s ection to demonstrate b y example th at n o one strategy dominates the others when only one sample path is required. 3. Comparison by example. T o illustrate the strategies detailed ab ov e, in this section w e intro d uce three explicit examples of CTMCs for which the p erformance of eac h sampler can b e directly compared. W e b egin with a pair of CTMCs in common use for molecular ev olutionary s tu dies; eac h p ro vides a un ique sto c h astic d escription of h o w DNA s equ ences evo lve o ve r time. F or these and the remaining example, w e compare the computational demands (measured as CPU time) of mo dified rejection sampling, of d irect sampling, and of un iformization. F or eac h example, the computational demands are accum ulated o ver 100 ind ep endent samples. In what follo ws, note that wh ile r ejection sampling and unif orm ization do not require any n u merical approximat ions, d irect sampling requires a ro ot finder. The numerical appro ximation of the ro ot fin der can b e m ade arbi- trarily precise, b ut th e c hoice of precision affects th e ru nning time. Without loss of generalit y , w e ha v e c hosen th e default settings of the root fi nder in the statistical pr ogramming language R ( www.r-pr o ject.org , V ers ion 2.0.0). The root finder t ypically con v erges in 4 to 8 iterations. The programs are run on an Inte l 2.40 GHz Pen tium 4 pr o cessor and are a v ailable in the su p- plemen tary material [ Hob olth and Stone ( 2009 )]. 3.1. Example 1: Mole cular evolution on the nu cle otide level. W e first consider a p opu lar mo del of DNA sequence ev olution at the n u cleotide lev el. 10 A. HOBOL TH AND E. STONE The state sp ace for a p articular site in a DNA sequence is of size 4 corre- sp ondin g to the DNA b uilding blo cks adenine ( A ), guan in e ( G ), cytosine ( C ), and th ymine ( T ). The HKY mo del of Hasega wa, Kishino and Y ano ( 1985 ) describ es the evol ution of one site in a DNA sequence thr ou gh an instant a- neous rate matrix of the f orm Q = (1 /s )     · κπ G π C π T κπ A · π C π T π A π G · κπ T π A π G κπ C ·     , where th e states app ear in the order A , G , C , T and the diagonal elemen ts of Q are su ch that the rows su m to zero. Note that state c hanges of the CTMC are called ‘substitutions’ in this con text to reflect that the nucleotide in a particular site has b een su bstituted by another. Th e HKY mo del is r ev ersible and has stationary distribu tion π = ( π A , π G , π C , π T ). T he ts / tv rate ratio pa- rameter κ is u sed to distinguish b et w een transitions [sub stitutions b et w een purines ( A , G ) or b et w een p yr imidines ( C , T )] and transv ersions (substitu- tions b et we en a purine and a pyrimidine). The scaling p arameter s = s ( κ, π ) is c hosen such that P 4 a =1 Q a π a = 1 , imp lying that t su bstitutions are ex- p ected in t time un its. In this application, w e use the p arameter v alues κ = 2 and π = (0 . 2 , 0 . 3 , 0 . 3 , 0 . 2). T h e s caling parameter calibrates the int ensit y of substitutions p er un it time; for co n text, note that the exp ected n umb er of substitutions p er site b et ween humans and chimpanzees is roughly 0.01 [ The Chimp anzee Sequencing and Analysis Consortium ( 2005 )] and b et ween h u mans and mice is roughly 0.50 [ Mouse Genome S equencing Consortium ( 2002 )]. Figure 1 plots th e computational demands of eac h sampling strategy against ev olutionary distance, measured equiv alen tly as th e exp ected n u m- b er of su bstitutions (CT MC state changes) or as units of time. Th e plot on the left demonstr ates the case where the b eginning and ending states are b oth A ; b y cont rast, on the righ t the b eginning state r emains A while th e ending state is G . The figure revea ls rejection sampling to b e by far the most efficien t algorithm h ere. Moreo v er, direct sampling is more efficien t than u ni- formization wh en the endp oin ts are the same and th e ev olutionary distance is shorter than one exp ected substitution p er site. When the endp oin ts are differen t, uniform izatio n is more efficien t than direct samp ling. 3.2. Example 2: Mole cular evolution on the c o don level. F or protein- co ding DNA sequences, th e natural state space consists of nucleo tide triplets (called co dons). T here are 4 3 = 64 p ossible nucleoti de triplets, but the thr ee stop co don s T GA , TAG , and TAA do n ot app ear w ithin a protein. The 64 − 3 = 61 remaining co dons constitute the state space and are called the sense co dons. Eac h of the 61 s ense co dons d eterministically tr an s lates into one of A CONTINUOU S-TIME MAR KO V CHAIN 11 Fig. 1. CPU time versus evolutionary distanc e for the HKY mo del. I n b oth plots the b e- ginning state is A . In the left plot the ending stat e is al so A and in the right plot the ending state is G . R eje ction sampling r e quir es less CPU time than dir e ct sampling and uniformiza- tion. The solid thick li nes show pr e dicte d CPU times when the c ost of initiali zation and r e cursion is fitte d to the observe d CPU times (se e Se ctions 4.1 – 4.3 ). The soli d thin li nes show pr e dicte d CPU times when the c ost of initiali zation and r e cursion is estimate d fr om a simulation study of r eversible r ate matric es (se e Se ction 4.4 ). Her e and in Figur es 2 and 3 the exp e cte d numb er of r e cursion steps was c alculate d analytic al ly using the formul as in the text. 20 amin o acids, and th u s d istinct codon s translate in to the same amino acid. Substitutions b et we en co d ons that translate in to the same amino acid are called syn on ymous (or silen t), while sub stitutions b et wee n different amino acids are called nonsynonymous (or n onsilen t). In 1994, Goldman and Y ang ( 19 94 ) formulated a mo del on th e space of sense co d on s that is still in common use to day . The GY mod el, a natural extension of the HKY mo del describ ed ab o v e, is rev ersib le with stationary distribution π = ( π 1 , . . . , π 61 ) and incorp orates a ts / tv r ate ratio κ . The GY mo del also distinguishes b et w een synonymous and nonsynon ym ous substi- tutions th rough a parameter ω . T he off-diagonal en tries in the instan taneous GY rate matrix are give n by Q ab = (1 / s )            0 , if a and b differ at more that one p osition, π b , for synonymous transversions, κπ b , for synonymous transversions, ω π b , for nonsynonymous transversions, ω κπ b , for nonsynonymous transversions, 12 A. HOBOL TH AND E. STONE where s = s ( ω , κ, π ) is again c hosen su c h that t s ubstitutions are exp ected in t time u nits (i.e., P a Q a π a = 1). In our application, we c h o ose κ = 2 so that transitions are fa v ored o ve r transv er s ions and tak e ω = 0 . 01 so that synon ymou s c hanges are far more lik ely than n onsynon ymous changes. W e c ho ose π b ased on established pat- terns of co don usage, and note that these frequencies are quite heteroge- nous: the s m allest entry is GG G ( π GGG = 0 . 0042) and the largest ent ry is GAG ( π GAG = 0 . 042 6). F or these sp ecificatio ns, Figure 2 p lots th e compu tational d emands of eac h sampling strategy agai nst ev olutionary d istance. In eac h plot, the starting state is AAA . The leftmost p lot compares p erformance when the ending state is AAG . Note that the su bstitution fr om AAA to AA G is a syn onymous transition (b oth AAA and AAG co de f or the amino acid lysine) and that the frequency for AAG is 0.0396 . Beca use synonymous tran s itions are very lik ely , the plot confirms that r ejection sampling w ill b e ve ry efficient. Con tr ast this obser- v ation with the middle plot in which the end ing state is A AC . Th e co don AAC tran s lates to asparagine so the sub stitution from the b eginning to the ending state is a less lik ely nonsy n on ymou s transve rsion. This is reflected in the p o or p erformance of the rejection samplin g algorithm. Finally , th e righ tmost p lot demonstrates what o ccurs when the final state is TTT . I n this case, rejection samplin g is not feasible b ecause the p r obabilit y of ending in the fin al state is effectiv ely zero. 3.3. Example 3: Mole cu lar evolution on the se quenc e level. The examples ab o ve seem to indicate that rejection sampling and uniformization h a ve the most utilit y , but it is easy to conceiv e of an application for which d irect sampling is most efficient . As the t wo p revious examples sho w, the efficiency of rejection s amp ling is tied to its accepta nce probabilit y; if the observ ed ending state is unlikel y , a large f raction of sample paths will b e destined for rejection. Uniformization, on th e other h and, can b e inefficien t wh en many virtual substitutions are requir ed. With that bac kgrou n d, w e consider the extension of the HKY mo d el d escrib ed b elo w. Recall the HKY rate matrix (with ν instead of π ) Q =     · κν G ν C ν T κν A · ν C ν T ν A ν G · κν T ν A ν G κν C ·     . Jensen and P edersen ( 2000 ) consider so-called neigh b or dep enden t mo dels where the instan taneous rate at a site dep ends on the n eigh b ors of the site. Jensen and P edersen ( 2000 ) are particularly interested in CG a vo idance where the rate aw a y from C is particularly high if its right neighbor is a G . Such a model implies CG deficiency in a single sequence, whic h is a n A CONTINUOU S-TIME MAR KO V CHAIN 13 Fig. 2. CPU time versus evolutionary distanc e for the GY mo del. In al l plots the b e gin- ning state is AAA . In the left plot the ending state is AAG , in the m idd le plot the ending stat e is AAC , and i n the right plot the ending state is TTT . Re je ction sampli ng is most efficient in the situation depicte d on the left, but it enters an infinite while lo op on the ri ght (and is ther ef or e not shown). Dir e ct sampling and uniformi zation have sim i lar running times, with uniformization b eing sli ghtly faster. The solid thick li nes show pr e dicte d CPU times when the c ost of initialization and r e cursion is fitte d to the observe d CPU tim es (se e Se ctions 4.1 – 4.3 ). The solid thin lines show pr e di cte d CPU times when the c ost of initialization and r e cursion i s estimate d fr om a simulation study of r eversible r ate matric es (se e Se ction 4.4 ). Final ly, the dashe d lines show pr e dicte d CPU tim es when the initial ization and r e cursion c osts ar e estimate d fr om a simulation study of sp arse r ate matric es (se e Se ction 4.4 ). often observed p h enomenon for mammalian sequen ces d ue to the p r o cess of CpG meth ylation-deamination. Neigh b or-dep en den t n ucleotide mo d els are also considered in Hw ang and Green ( 2004 ) and Hob olth ( 2008 ). In these t wo p ap ers, a Gibbs sampling scheme is u sed to estimate the parameters of the mo del while taking the u ncertain ty of the neigh b ors into account . In particular, eac h single site is up d ated conditionally on the current v alues of the complete ev olutionary history of the neighboring nucleotides. Consider the ev olution at a single site and assu me for simplicit y that th e ev olutionary h istory of the left n eigh b or is neve r a C , and the evol utionary history of th e righ t neigh b or is alw a ys a G . I n this situation, a CG din u cleotide is p resen t when the site that we consider is a C . Jensen and Pe dersen ( 2000 ) mo del the C pG effect through in creasing the r ate a wa y from CG nucle otides b y multiplying eac h ent ry in the HKY rate matrix ( 3.3 ) corresp onding to C with a parameter γ > 1. When the left neigh b or is not a C and the right 14 A. HOBOL TH AND E. STONE neigh b or is a G , the rate matrix th us b ecomes Q HKY + CG =     · κν G ν C ν T κν A · ν C ν T γ ν A γ ν G · γ κν T ν A ν G κν C ·     . (3.1) The stationary distribution π of Q HKY + CpG is giv en by ( π A , π G , π C , π T ) = ( ν A , ν G , ν C /γ , ν T ) / ( ν A + ν G + ν C /γ + ν T ) . If the parameters are ( ν A , ν G , ν C , ν T ) = (0 . 3 , 0 . 3 , 0 . 2 , 0 . 2) and γ = 20, w e obtain the stationary distribution ( π A , π G , π C , π T ) = (0 . 3 , 0 . 3 , 0 . 01 , 0 . 2) / 0 . 81. Note that the stationary p robabilit y of a C nucle otide is n o w 0 . 01 / 0 . 81 = 0 . 01 2 . The left-hand p lot of Figure 3 illus tr ates the p erformance of the three samplers wh en th e CTMC b egins in T and end s in state C . Here the most efficien t sampler dep ends on the time b et ween the states: if T < 0 . 3, rejection sampling is the most efficien t, if 0 . 3 < T < 0 . 9, unif orm ization is th e m ost efficien t, and if T > 0 . 9, direct sampling is the most efficien t. F or large times, Fig. 3. CPU usage versus time f or the HKY + CpG r ate matrix ( 3.1 ). In the l eft plot, the b e ginning state is T and ending state is C . In the right plot, the b e ginning state is C and ending state is T . R eje ction sampling i s very fast in the situation depicte d on the right, but it is slow for lar ge evolutionary distanc es on the left. Dir e ct sampling and uniformization have similar running times, but dir e ct sampli ng i s f aster for lar ge evolutionary distanc es. The solid thick lines show pr e dicte d CPU times when the c ost of ini tialization and r e cursion is fitte d to the observ e d CPU times (se e Se ctions 4.1 – 4.3 ). The solid thin lines show pr e dicte d CPU times when the c ost of i nitialization and r e cursion i s estimate d f r om a simulation study of r eversible r ate matric es (se e Se ction 4.4 ). A CONTINUOU S-TIME MAR KO V CHAIN 15 rejection samp lin g is inefficien t b ecause it is un lik ely to end in s tate C , and direct sampling b ecomes more efficien t than uniformization b ecause many virtual c h anges are required in the u niformization pro cedure. The righ t- hand plot of Figure 3 s h o ws th e case when the b eginning state is C an d th e ending state is T . Und er this s cenario, rejection sampling is the most efficien t sampling algorithm b ecause the acceptance probability is high. 4. Complexit y of samplers. The examples in the previous section w ere c hosen to demonstrate the heterogenous dep end ence of eac h sampling strat- egy u p on the c h aracteristics of the endp oin t-conditioned CT MC. In partic- ular, efficiency was shown to b e imp acted by eac h asp ect of the p ro cess: the instan taneous rate matrix Q , the sampling time T , and th e b eginning and ending state s a and b . Th is section translates the qu alitativ e observ ations ab o ve in to quantita tiv e pr o of of which sampler will b e most efficien t for an y sp ecification of CTMC . T o accomplish this, we rely on the algorithmic de- scriptions of the three samp ling strategies as give n in Section 2 . Note that the algorithms are sc hematically consisten t, with eac h progressing th rough (1) initialization, (2) r ecur sion, and (3) termination. Ou r ap p roac h is to de- fine the fixed computational costs for the initializati on and recursion steps, whic h we call α and β , resp ectiv ely . As sho wn in S ection 2 , the n um b er of recursion steps requir ed to generate an en tire sample path is sto chasti c, and w e capture this in a r an d om v ariable L . Thus, the co mputation cost of generating one sample path is α + β L and the mean cost is ob viously α + β E [ L ]. In th e case of rejection sampling, note th at only a certain fr action of the generated sample paths will b e con- sisten t with the observ ed endin g state and hence accepted. Ultimately , the results of this section demonstrate our abilit y to accurately predict the CPU time needed to pro d uce one v alid sample p ath fr om an endp oin t-conditioned pro cess. Such analysis is of great practical imp ortance, as it allo ws the re- searc her to c ho ose the most efficient sampler in adv ance. The rest of this section is organized as follo w s. In S ections 4.1 – 4.3 we discuss complexit y and derive the mean n umber of recursions E [ L ] for eac h sampler. In Section 4.4 we first demonstrate that α and β can b e estimated from the size of the state sp ace and structure of th e rate matrix only . Thus, determining th e v alues of α an d β is a one-time ca lculation. Seco nd, w e pro v id e a strategy for c ho osing the most efficien t sampler. The strategy dep ends on the mean num b er of r ecursions and estimated v alues of α and β . In case of rejection sampling, the strategy also dep ends on the acceptance probabilit y . Third, w e give further insight into the sampling strategies by analyzing the p rop osed strategy in detail for mo d erately large time interv als. 16 A. HOBOL TH AND E. STONE 4.1. R eje ction sampling c omplexity. Let p acc b e the acceptance probabil- it y for the rejection samp lin g algorithm fir st describ ed in Section 2.1 . Then the exp ected n umb er of samp les b efore acceptance is its recipro cal 1 /p acc . In the notatio n describ ed ab o ve, the mean CPU time required to sim u late one sample path is thus ( α + β E [ L ]) /p acc . (4.1) When the b eginning and ending states tak e the same v alue, sa y , a , the acceptance pr obabilit y is simply P aa ( T ). In particular, for small T we h a v e p acc ≈ (1 − Q a T ), and f or large T we hav e p acc ≈ π a . F ur thermore, the ex- p ected n u m b er of recursion steps required to generate one sample path is giv en by E [ L ] = X i X j 6 = i E [ N ij ( T ) | X (0) = a ] , (4.2) where N ij ( T ) is the n u m b er of state changes from i to j in the time in terv al [0 , T ] . Th is exp ectation is giv en by [e.g., Prop osition 3.6 of Guttorp ( 1995 )] E [ N ij ( T ) | X (0) = a ] = Q ij Z T 0 P ( X ( t ) = i | X (0) = a ) dt. Analytical expressions for the in tegral can b e found b y app ealing to an eigendecomp osition of Q (see Section 2.2 ). Figure 4 pro vid es an illustration of the ab o ve considerations using th e HKY mo del from Section 3.1 as an example. The top panel of the figure de- tails the case when the b eginning and ending state s are the same (sp eficially , the n ucleotide A ). F rom the left, the fi r st column plots the acceptance pr ob- abilit y exp( QT ) AA against the time T , showing a n onlinear decrease from p acc ≈ 1 when T is small to p acc ≈ π A when T is large. Th e slop ed dashed line plots the first-order T a ylor appr o ximation 1 − Q A T of p acc that is v alid for small T ; the horizon tal dashed line indicates the stationary pr ob ab ility π A that is the limit of p acc when T gro w s large. The second column plots CPU time sp ent on initializatio n against T for a collection of simulate d s amp le p aths. A linear regression was used to esti- mate the initializa tion cost. More sp ecifically , w e generated 500 indep endent samples from the mo difi ed rejection sampler and recorded the time sp en t on initializatio n and r ecursion, resp ectiv ely . The CPU time sp en t on initial- ization is prop ortional to 1 /p acc [recall ( 4.1 )]; we estimated α using lin ear regression and obtained ˆ α = 0 . 0509. T h e thir d column shows the exp ected n u m b er of state c hanges E [ L ], calculated from ( 4.2 ), as a fun ction of time. In th e fou r th column we sho w the CP U time sp ent on s ampling. The CPU time sp en t on sampling is p rop ortional to E [ L ] /p acc [recall ( 4.1 )]; w e esti- mated β using linear regression and obtained ˆ β = 0 . 036 5. Adding the CPU time sp en t on initialization and samp ling giv es the total CPU time sp ent on A CONTINUOU S-TIME MAR KO V CHAIN 17 Fig. 4. Summary statistics for the HKY mo del fr om Se ction 3.1 . The top r ow shows the c ase when the b e ginning state is A and the ending state is A . In the b ottom r ow, the b e ginning state is A and the ending state is G . The first c olumn shows the pr ob ability of ending in the c orr e ct state (the ac c eptanc e pr ob abi lity), and the se c ond c olumn shows the CPU time sp ent on initiali zation. The thir d c olumn shows the exp e cte d numb er of r e cursions r e quir e d in e ach forwar d sample, and the f ourth c olumn shows the CPU time sp ent on sampling. Summing the CPU times sp ent on initialization and on sampling gives the total CPU time sp ent to pr o duc e a sample p ath. Thi s total time i s shown in the left-hand plot of Figur e 1 . pro du cing a sample path. The total CPU time and pr edicted CPU time is sho w n in the left plot of Figure 1 . When the b eginning and ending states are different, the calculations are only sligh tly more complicat ed. T o compute the acceptance p r obabilit y in the case a 6 = b , let N ( t ) b e the num b er of state changes of X ( t ) in the in terv al [0 , t ] . W e h a ve P ab ( T ) = Pr( X ( T ) = b | X (0) = a ) = Pr( X ( T ) = b, N ( T ) > 0 | X (0) = a ) (4.3) = Pr( X ( T ) = b | N ( T ) > 0 , X (0) = a ) Pr( N ( T ) > 0) = p acc Pr( N ( T ) > 0) , 18 A. HOBOL TH AND E. STONE from which it is clear that p acc = P ab ( T ) 1 − Pr( N ( T ) = 0) = P ab ( T ) 1 − e − T Q a . (4.4) F or small T we h a ve the first-order approximat ion p acc ≈ Q ab Q a  1 − Q b T 2  + X i 6 =( a,b ) Q ai Q a T 2 Q ib , (4.5) and for large T it is clear that p acc ≈ π b . Next we consid er the num b er of recursion steps L . W e know that the n u m b er of state c hanges is at least one b ecause w e hav e assumed that the b eginning and ending states a and b are different. The probabilit y of the first c hange b eing to state k ( k 6 = a ) is Q ak /Q a , and the den s it y of the time to this change is giv en b y ( 2.1 ). Let the num b er of state c h anges from i to j ( j 6 = i ) when the first substitution is to k b e denoted N ij,k . T he exp ected n u m b er of suc h c h anges in a time inte rv al [0 , T ] is giv en by E [ N ij,k ( T )] = Z T t =0 Q a e − tQ a 1 − e − T Q a Q ak Q a Q ij Z T s = t P k i ( s ) ds dt. Again, this integ ral can b e calculated analytically using an eigen v alue de- comp osition of Q . The exp ected v alue of L is giv en b y E [ L ] = 1 + X i X j 6 = i X k 6 = a E [ N ij,k ( T )] . The b ottom row of Figure 4 mirrors the top row, except that here the ending state G has b een c hosen to b e distin ct from the b eginning state A . As b efore, the first plot from the left shows the acceptance probabilit y ( 4.4 ) against the time T . The slop ed dash ed lin e no w sho ws the linear appr o xi- mation ( 4.5 ), while th e horizon tal dashed line ind icates the stationary pr ob- abilit y π G of the ending state G . In the second plot, the C P U time s p en t on in itializa tion is explained b y the recipro cal of the acceptance p r obabilit y . In the third plot, w e sh o w the exp ected n umb er of subs titutions, and in the last p lot the CP U time sp ent on samp ling is explained b y the exp ected n u m b er of sub stitutions divided by the acceptance pr ob ab ility . Th e regres- sion coefficient for initializ ation is 0.0509 and for sampling 0.0366. Note that these co efficien ts are v ery similar to what was observed in the case of equ al b eginning and ending states. T o complemen t th e observ ations of Figure 4 , recall the GY m o del in tro- duced in Section 3.2 . The three plots in Figure 5 mirr or those in Figure 2 , with eac h sh o wing ho w the acceptance probabilit y scales with time in the previously depicted scenario. In all cases, the b eginn in g state is the co don AAA : from the left, the fi r st plot considers the end ing state AAG (a synonymous A CONTINUOU S-TIME MAR KO V CHAIN 19 Fig. 5. Ac c eptanc e pr ob abil ities for the GY mo del fr om Se ction 3.2 . In al l c ases the b e ginning state is AAA . In the left-hand plot, the ending state is AAG , in the midd le plot the ending state is AAC , and in the right-hand plot the ending state is TTT . R eje ction sampling is very efficient in the si tuation depicte d on the left, less efficient in the midd le, and not pr actic al i n the ri ght. transition aw a y from AAA ), the second plot consider s th e ending state AAC (a nonsynonymous transv ersion a wa y from AAA ), and the third plot considers the endin g state TTT (a minimum of th ree state c hanges aw a y from AAA ). In the fi rst case, the acceptance probabilit y is high and rejection sampling is efficien t. In the s econd case, the acceptance probabilit y is lo w, p articularly for small T , and rejection sampling is less efficien t. In the third case, the probabilit y of ending up in the desired state TTT is smaller than 1 / 10 6 , and rejection sampling cannot b e used. In this fin al case , one m ust use direct sampling or u niformization. 4.2. Dir e ct sampling c omplexity. Th e computational costs for direct s am- pling are dep endent up on its initialization and the CPU time sp en t on sam- pling a n ew state and its corresp ond in g w aiting time. As b efore, the cost of generating one sample path can b e written as α + β L, but the in itial cost α is m uch more exp ensive than for rejection sampling b ecause an eigendecomp osition of Q is r equired. The exp ected n um b er of recursion steps is equ iv alent to the n u m b er of state changes N and can b e found through E [ L ] = E [ N ( T ) | X (0) = a, X ( T ) = b ] = X i X j 6 = i E [ N ij ( T ) | X (0) = a, X ( T ) = b ] , 20 A. HOBOL TH AND E. STONE Fig. 6. CPU time sp ent on dir e ct sampling in the HK Y mo del fr om Se ction 3.1 . Two c ases ar e c onsider e d: b e ginning state A and ending state A (first two pl ots fr om the l eft) and b e ginning state A but ending state G (last two plots). In b oth c ases, the initial ization CPU is c onstant (first and thir d plot). The sampling CPU i s pr op ortional to the exp e cte d numb er of substitutions (se c ond and fourth plot). where N ij ( T ) is th e n u m b er of state c hanges fr om i to j in the time in terv al [0 , T ]. These exp ectations can b e calculated using f ormulas in Hob olth and Jensen ( 2005 ). In Figure 6 we illustrate th e ab o ve considerations for the HKY mo del from Section 3.1 . The initializat ion cost is constan t, and th e n umb er of state c hanges explains the cost of sampling. The initializatio n cost α is around 0.85, which is muc h larger than the 0.05 observed wh en doing rejection sam- pling. Moreo v er, the cost β f or eac h recursion step is 0.56, as compared to 0.04 f or rejection samp ling. T his ma y seem an unfav orable comparison , but recall th at rejection sampling do es not guarant ee that the endp oint condi- tions are met by its generated sample p aths; if the probabilit y of acceptance p acc is lo w, then the cost of r ejection sampling giv en b y ( 4.1 ) w ill b e domi- nated by 1 /p acc . This illustr ates the tradeoff that distingu ish es rejection s amp ling fr om the t wo remaining approac hes: the compu tational costs of rejection sampling are comparativ ely inexp ensive, but only a fr action of the simulate d sample paths from that metho d will b e viable. 4.3. Uniformization c omplexity. Th e computational costs for u niformiza- tion are similar in structure to those of dir ect sampling. In itializa tion re- quires an eigendecomp osition of Q and construction of the auxiliary tran s i- tion matrix R in order to carry out Step 1 of the algorithm (recall R emark 6 ). Eac h recursion step consists of sampling a new state and its corresp on d ing w aiting time; examination of the uniformization algorithm rev eals that the n u m b er of recursion s teps L is equal to th e num b er of state c hanges N ( T ) accum ulated by the auxiliary c hain. Thus, E [ L ] = E [ N ( T ) | X (0) = a, X ( T ) = b ] A CONTINUOU S-TIME MAR KO V CHAIN 21 Fig. 7. CPU time sp ent on unif ormization i n the HKY mo del f r om Se ction 3.1 . Two c ases ar e c onsider e d: b e ginni ng state A and ending state A (first two plots) and b e ginning state A but ending state G (last two pl ots). In b oth c ases, the initiali zation CPU time is c onstant (firs t and thir d plot). The sampling CPU time i s pr op ortional to the exp e cte d numb er of substitutions (se c ond and fourth plot). = 1 P ab ( T ) ∞ X n =0 ne − µT ( µT ) n n ! ( R n ) ab = 1 P ab ( T ) µT ( Re QT ) ab = 1 P ab ( T ) µT X c R ac P cb ( T ) . In particular, when T is large w e get E [ L ] ≈ µT . Figure 7 illustrates the ab o v e co n siderations for the HKY mo del from Section 3.1 . As with direct sampling, th e initialization cost is constan t and the num b er of state c han ges (b oth real and virtual) explains the cost of sampling. W e find ˆ α = 1 . 05, whic h is ab out the same magnitude as the initializat ion co st for d irect samp ling. In unif orm sampling, the recursion step is imm ediate if w e en ter Steps 2–4. Each recurs ion in Step 5 is also v ery fast b ecause w e just h a ve to sim ulate from a discrete- state Mark o v c hain with tran s ition probabilit y matrix R give n the endp oin ts, and w here all the relev ant p ow ers of the transition matrix R are already calculated. The recursion cost ˆ β = 0 . 09 is aroun d 1 / 6 of the recursion cost for d irect sampling and twice as muc h as th e recursion cost of rejection samp ling. 4.4. Comp arison and r e c ommendation. 4.4.1. Comp arison and r e c ommendat ion for gener al T . The p receding results explicitly relate the computational complexit y of eac h sampling strat- egy to charac teristics of the C TMC. Th is p ermits th e th ree strategies to b e compared to eac h other, but only after reliable v alues for α and β ha ve b een obtained. W e also note that the v alues of α and β dep end on the c h oice of 22 A. HOBOL TH AND E. STONE Fig. 8. V alues of α (left) and β (right) for the thr e e algorithms and four examples. Both x -axes show the size of the state sp ac e c onsider e d i n e ach example, with α and β on the y -axes for the left and right plot, r esp e ctively. A c onstant fit is shown f or r eje ction sampling and a quadr atic fit b etwe en β and the size of the state sp ac e is shown for di r e ct sampling; the r emaining r el ationships in the figur e ar e describ e d as a li ne ar dep endency of size to the p ower 2.5. R esults for r andomly gener ate d 61 × 61 r ate matric es with nonzer o off-diagonal entries, when the entry c orr esp onds to two c o dons that ar e exactly one p oint substitution away fr om e ach other, is shown in the left plot. These sp arse r ate matric es have a lower value of α in c ase of dir e ct sampl i ng and unif ormization. computer language. Run ning simulatio ns as w e h a ve to estimate these pa- rameters is n ot pr actical, as it compromises the gains in c h o osing an efficien t sampler. F or that reason, and to establish the generalit y of our observ ations, w e sough t to r elate α and β to the size of the state s pace of the C T MC. F or eac h of th e three samp lin g algorithms, we estimated the v alues of α and β for r an d omly simulate d rev ersible rate matrices. Sp ecifically , we fi rst generated a symmetric matrix S with randomly generated exp onen tially dis- tributed off-diagonal en tries S ij ∼ Exp(1), i > j . Second, we generated th e stationary distribu tion π of the CT MC by sampling from a Dirichlet distr i- bution Dir( α ) with α = (1 , . . . , 1) , that is, a vecto r of ones. T he off-diagonal en tries in the r ate m atrix b ecome Q ij = S ij π j , i > j, and Q ij = S j i π i , i < j . W e consid er ed sev en different state space sizes (5 , 10 , 20 , 40 , 60 , 80 , 100), and rep eated eac h of the sim u lations five times. T he results are summ arized in Figure 8 . Figure 8 supp orts our p revious observ ations on the thr ee sampling strate- gies. Th e costs of initialization, as quan tified by α , are as exp ected, with direct sampling and uniform izatio n slo wed relativ e to rejection sampling b y A CONTINUOU S-TIME MAR KO V CHAIN 23 their dep end ence on an eigen v alue decomp osition of the rate matrix Q . In - deed, whereas the initializ ation cost of rejection sampling remains essentia lly unc hanged as the s tate space grows, th e other metho ds increase in ru n time nonlinearly . Theory su ggests that the eigen v alue decomp osition th at domi- mates direct sampling and uniformization should dep end on the cub e of th e size of the state space, and w e find in our limited sample that the relationship is b est explained b y an exp onen t of 2.5. In an y case, direct samp ling and u ni- formization are comparable in their in itializa tion costs, with u niformization alw a ys sligh tly slo wer b ecause it requires p ow ers of the transition pr obabilit y matrix R to also b e calculated. By con trast, for the state sp ace sizes that w e considered, uniformization has a su bstan tially s m aller v alue of β . Compare these ob s erv ations to the resu lts shown in Figures 4 , 6 , and 7 for the HKY mo del: what w as true in that example with a state space of size four app ears to hold consisten tly across our sample of simulated rate m atrices of v arious sizes. The most encouraging f eature of Figure 8 is the apparent ease with whic h α and β can b e predicted s olely from th e size of the state space. As illustrated in the figur e, w e fi t six simple mo dels to pred ict α and β for eac h of the three samp lin g strategies from the state space size alone. Th e utilit y of these predictions b ecomes clear in r eferen ce to the four-state HKY mo del. Previously w e found α and β b y sim ulating endp oin t-conditioned samples from the mo del of interest itself. F or rejection sampling, these v alues were ˆ α = 0 . 0149 an d ˆ β = 0 . 0094 (Figure 4 ), whic h w e compare to α = 0 . 016 5 and β = 0 . 0109 predicted for a state space of size f our. Similarly , as shown in Figure 6 , for direct sampling ( ˆ α, ˆ β ) r anges from (0 . 2274 , 0 . 1342 ), when the b eginning and end ing states w ere th e same, to (0 . 2258 , 0 . 1370), when they w ere different. By comparison, the predicted v alues for direct complexit y are α = 0 . 21 55 and β = 0 . 128 5. Last, r ecall from Figure 7 that for u niformization ( ˆ α, ˆ β ) r anges f rom (0 . 2503 , 0 . 0253) for id entical b eginning and ending states to (0 . 2419 , 0 . 0206) wh en different. Th ese v alues agree with our pr edictions of α = 0 . 2286 and β = 0 . 0143, although the pred icted v alue of β is somewh at lo we r than the fit from th e m o del itself. It s h ould b e emphasized that the goal here is n ot to p erfectly predict α and β for any particular CTMC. R ather, the pur p ose of the simple mo dels illustrated in Figure 8 is to obtain v alues accurate enough to decide wh ic h sampling strategy will b e most efficien t. With that in mind, reconsider the CPU times observe d f or the three examples in Section 3 . In Figures 1 – 3 resp ectiv ely , CPU time estimates for the HKY, GY, and HKY + Cp G mo dels are s h o wn as thin lines obtained from the n ew predictions of α and β . It is clear that a p ractitioner, armed with only these predictions deriv ed fr om the state space size, would c ho ose the m ost efficient sampler in eac h example for virtually any combinatio n of time and sp ecific end p oin t conditions. On the ot her hand, despite that success, the CPU times we predict are not 24 A. HOBOL TH AND E. STONE uniformly accurate; the predictions f or the GY mo d el shown in Figure 2 , for example, do not fit the observed data well at all. W e sp eculated that this observe d lac k of fit might b e a resu lt of the structur e of th e GY mo del b eing to o different from that of the r an d omly generated r ate m atrices used to establish our m o dels. T o p ursue this h yp othesis, w e generated 61 × 61 rate matrices with the same s tr ucture as the GY mo del. In particular, the only nonzero off-diago nal entries in th e rand om rate matrix are those en tries where the tw o co dons are exact ly one p oin t substitutions aw a y from eac y other. These sparse rate m atrices result in smaller v alues for α in the cases of direct sampling and un iformization, as indicated in Figure 8 . As exp ected from Figure 2 , the v alue of α f or rejectio n sampling and β v alues for all samplers are largely unaffecte d. The pred icted CPU times from the n ew v alues of α and β are sh o wn as dashed lines and present a satisfactory fit. Th us, while our mod els for α and β app ear to b e of sufficen t qu alit y to guide the correct c hoice of sampler, it is clear that differen tly structured rate matrices of the same size ma y yield substantia lly d ifferen t v alues of α and β and hence CPU times. The preceding discuss ion m otiv ates the follo wing guidelines for c ho osing the most efficien t sampling strategy: 1. Estimate α and β for the th ree sampling strategies. As discussed, α and β can b e estimated reliably from the size of the r ate matrix, allo wing for some v ariabilit y du e to its stru cture. 2. Predict CPU times for rejection sampling ( α R + β R E R [ L ]) /p acc , for direct sampling ( α D + β D E D [ L ]), and for un if orm izatio n ( α U + β U E U [ L ]). 3. Cho ose the sampler with the lo west pr ed icted CPU time. 4.4.2. Comp arison for mo der ately l ar ge T . W e end th e comparison by considering the sp ecial case wh en T is at least mo d erately large. In this case some useful r ules of thum b emerge. T o b egin, note that th e exp ected n u m b er of iterations requ ired for rejection samplin g and for direct sampling should b e appr oximate ly equal. F or mo derately large T , we can mak e the substitution of ( P c π c Q c ) T f or E [ L ], or just T , provided that the chain has b een calibrated such that P c π c Q c = 1. F or uniformization, E [ L ] is larger b ecause of vir tu al s tate c hanges; under the same assum ptions, h ere E [ L ] can b e roughly appro ximated by µT = (max c Q c ) T . Under these assumptions, virtual c han ges increase the num b er of iterations required in uniformization b y a factor of ν = max c Q c .  X c π c Q c  = max c Q c , again assuming that the c hain h as b een calibrated. In other w ords , the infla- tion factor ν is the ratio of the maximum diagonal entry of the rate matrix A CONTINUOU S-TIME MAR KO V CHAIN 25 Q to its (we igh ted) av erage d iagonal en tr y . W e obtain ν = (1 . 12 , 2 . 22 , 16 . 2) for the rate matrices in Examples 1–3, resp ectiv ely . As exp ected, the infla- tion factor is ve ry high for the rate matrix in Example 3 and explains the observ ations in Figure 3 . Finally , w ith E [ L ] = T for b oth rejection sampling and d irect sampling, the appr o ximate complexities can b e expr essed as follo ws: Rejection sampling Direct sampling Uniformization . ( α R + β R T ) /p acc α D + β D T α U + β U T ν (4.6) T o see the utilit y of these formulas, r ecall the results for the HKY mo del sho w n in Figure 1 and consider the mo derately large time T = 2. Noting that for direct sampling and uniformization our app ro ximations are not endp oin t- dep endent, and using estimates of (0 . 2155 , 0 . 1285 ) for ( α, β ) for direct sam- pling and (0 . 22 86 , 0 . 0143) for uniform izatio n, the formulas p redict their CPU times to b e 0.472 and 0.261, resp ectiv ely , in b oth panels of Figure 1 . Re- jection sampling, of course, is dep enden t on the ending state, and thus the complexities for that metho d illustrated in the left and right plots d iffer. In this case, the difference is subtle b ecause (1) the c hain is nearly m ixed and (2) the stationary probabilities that go vern the acceptance probabilities are similar. Using 0.2 and 0.3 as the r esp ectiv e acceptance probabilities for A and G , and (0 . 016 , 0 . 010) for ( α, β ), w e obtain 0.19 for the left plot and 0.13 for the r igh t plot. Insp ection of Figure 1 give s v alidit y to our approxi mations, sho w in g all of the predictions to b e highly accurate. F or the HKY + CpG mo del in Figure 3 w e can mak e similar predictions. W e predict the CPU times for direct samp ling and uniform izatio n to b e 0.472 and 0.693, again in go o d agreemen t with b oth fi gures. Using the stationary probabilities 0.012 and 0.24 6 for C and T , w e obtain 3.112 for the left plot and 0.155 f or the righ t plot. Th ese predictions are again v er y accurate. In the particular case of mo derately large T , the guideline for c ho osing the most efficien t sampling strategy can b e made ev en more explicit. It follo ws immediately from ( 4.6 ) that u niformization is more efficie n t than direct sampling if ν < α D + β D T − α U β U T = ν critical . Recall the transition matrix ( 2.7 ) of the auxiliary p ro cess. It is evident that if the infl ation factor ν is large, then the tr ansition matrix has one or more states where virtual state changes are very likely . In the uniformization sam- pling pro cedur e, these vir tual state changes h a ve to b e s im ulated, although they are eliminated in the fin al sample path. Man y invisible virtual jum ps th u s makes uniformization less efficien t. F or the state sp ace of size 4 w ith ( α D , β D , α U , β U ) = (0 . 2155 , 0 . 1285 , 0 . 2286 , 0 . 0143), w e obtain ν critical = 8 . 5. F or the HKY mo del we h a ve ν HKY = 1 . 12 and for the HKY + CpG mo del 26 A. HOBOL TH AND E. STONE w e ha v e ν HKY+CpG = 16 . 2, and, thus, we p redict uniformization to b e more efficien t for large T for the HKY mo del, while direct samp ling is m ore effi- cien t for the HKY + CpG mo del. Similarly , it follo w s from ( 4.6 ) th at rejection sampling is more efficient than uniform ization if p acc > α R + β R T α U + β U T ν = p U critical , (4.7) and more efficien t th an direct samplin g if p acc > α R + β R T α D + β D T = p D critical . (4.8) F or the HKY mo del, we get p U critical = 0 . 147. In the case of b eginning state A , endin g state A and f or T = 2, we get fr om ( 4.4 ) that p acc = 0 . 254. If th e b eginning state is A , ending state is G and T = 2, we get p acc = 0 . 347. Both acceptance probabilities are la rger than, and we p redict correctly (recall Figure 1 ) that rejecti on sampling is th e most efficien t algorithm in b oth cases. F or the HKY + CpG mo del we get p D critical = 0 . 081. In the case wh ere the b eginning state is T , ending state is C and T = 2, we obtain p acc = 0 . 017, and with b eginnin g state C , ending state T and T = 2, we get p acc = 0 . 272. W e thus correctly predict (r ecall Figure 3 ) that d ir ect sampling is the most efficien t a lgorithm in the fir st s ituation, while rejection sampling is more efficien t in the second situation. The appro ximations in ( 4.6 ) are less precise for the GY mo del b ecause the larger state space increases the dep endency of th e b eginning and end ing states. Ho wev er, we still get reliable p redictions w hen applying the mo der- ately large T appro ximations. The predicted v alues in Figure 8 for the sparse co don rate matrices size 61 are ( α R , β R , α D , β D , α U , β U ) = (0 . 0 17 , 0 . 011 , 1 . 072 , 0 . 305 , 1 . 124 , 0 . 105). F or T = 2 we get ν critical = 2 . 6 6 and since ν GY = 2 . 2 2, w e correctly p redict un iformization to b e more efficient than direct sampling. W e get p U critical = 0 . 024, meaning th at uniform izatio n is also more efficient than rejection samp ling if the acceptance p robabilit y is smaller than 2 . 4%. With T = 2 and b eginning state AAA , w e get acceptance p robabilities 0.65, 0.01, and 1 / 10 5 for ending states AAG , AAC , and TT T , resp ectiv ely (recall Figure 5 ). W e th u s correctly pred ict rejection sampling to b e faster than uniformization when the ending state is AAG , and slow er when the ending state is AAC or TTT . T o summarize, this section s h o ws that when the cost of initializ ation α and th e cost of a recur s ion step β are kn o wn, we can accurately p r edict the time it tak es to pr o duce a single samp le from an y of the three sim ulation pro cedures. W e ha ve thus demonstrated th at c ho osing among the simulatio n pro cedures is an ob jectiv e task that can b e automated. In our analysis, we A CONTINUOU S-TIME MAR KO V CHAIN 27 ha ve d emonstrated that it is straight forw ard and in exp ensive to estimate α and β reliably . An alternativ e, wh ic h we ha ve not addressed, w ould b e to obtain these d irectly by translating th e necessary calculations f or eac h sampling strategy into floating p oint op erations. In practice, our deriv ations serv e w ell eve n without quanti fication of the in itialization and recursion costs; for reasonable v alues of α and β , th e acceptance probabilit y p acc and the inflation factor ν can in form w hic h of the thr ee sampling strategies wo r ks b est. 5. Conclusion. The prev alence of endp oin t-conditioned CTMCs as an inferen tial to ol in inte r disciplinary studies has led to the develo pment of sev eral p ath-samp lin g algorithms. As the s cop e of application con tinues to gro w, so to o w ill the n eed for compu tationally efficient appr oac hes, and yet this asp ect has to our kno w ledge yet to b e considered. T o that end, w e ha ve present ed a formal comparison of thr ee sampling strategies: (1) m o dified r e- jection samplin g, (2) direct samp ling, and (3) u n iformization. Significantly , w e s h o w th at efficiency is a relativ e measure that dep end s hea vily on the sp ecification of the conditioned sto c h astic p ro cess; ind eed, as demonstrated in Section 3 , the compu tational requirements for eac h algorithm dep end on the rate matrix Q , the time in terv al T , and th e endp oints a and b . W e ha ve sho w n that no one algorithm dominates the other tw o, and that eac h algo- rithm has its sp ecific strengths and w eaknesses. Th e pr evious section served to demysti fy th ose str en gths and wea k n esses by completely quantifying the computational costs asso ciated with eac h sampling strategy . W e ha v e concentrat ed our efforts on one sp ecific application, namely , the sim u lation of a single sample path pro vided the rate matrix Q , the time in terv al T , and the endp oin ts a and b u p on wh ic h the process is co ndi- tioned. W e framed eac h of the three path-sampling algorithms as a progres- sion through (1) initialization, (2) recursion, and (3) termination, and our discussion w as b ased on an in-d epth analysis of the computational require- men ts of eac h step. It should b e noted th at our theory is easily amenable to application-sp ecific s ituations where these requiremen ts v ary; f or example, it is reasonable in s ome cases to exp ect that an eigen v alue d ecomp osition of the rate matrix has already b een pr o vided, and it is clear from the previous section how this impacts eac h algorithmic step. P erh aps the most imp ortan t v arian t to consider is the extension to the sim u lation of m ultiple sample p aths. When s imulating k s amp le paths from the s ame endp oint-c on d itioned CTMC using an y of the aforementi oned strategies, the in itializ ation step need only b e done once. On the other hand, the iteratio ns required f or eac h sample path cannot in general b e consoli- dated, and, th u s, k affects complexit y as a s cale factor of β . It follo ws that for large enough k the initializa tion cost is of negligible concern, and b ecause our examples h a ve s h o wn th at α and β are somewhat comparable, in practice, 28 A. HOBOL TH AND E. STONE k need not b e th at large. In suc h cases, complexit y is d etermined by k β E [ L ] for direct samp ling and uniformization, and by k β E [ L ] /p acc for rejection sampling. As a result of the virtual state changes that o ccur w hen sampling b y uniform izatio n, E [ L ] for that metho d w ill t ypically b e somewh at larger than for the other t w o. F or direct s ampling, this is offset by a larger β , and, th u s, the d ecision b et ween direct sampling an d uniformization rests u p on the num b er of v ir tual state c han ges required. Rejectio n sampling, b y con- trast, completes eac h iteration quickly without the use of vir tual transitions; it is once again the p ath acceptance p r obabilit y that determines whether or not rejectio n sampling is viable. In the direct sampling algorithm we use a ro ot finder to simula te the wait ing time b efore the next state c hange. If m u ltiple sample paths are required, it w ould b e b enefi cial to completely c haracterize (or very accurately ap p ro ximate) the cum u lativ e distribu tion function for the w aiting time. As so on as this task is done, dra w ing from the conditional w aiting time distribu tion wo u ld b e almost instan taneous. Simi- larly , one could, in the case of uniformization, store th e v alues of p robabilit y masses in R emark 7 for i ≤ 3, sa y . Storing the calculat ions allo w for very fast generation of state c hanges from the discrete Mark ov c hain determined by the transition pr obabilit y matrix R . In s h ort, th e p rimary d istinction when m u ltiple sample p aths are requir ed is that the fr on t-loaded p ro cedures— direct sampling and unif orm ization—b ecome comparativ ely more desirable, the reason b eing th at more kno wledge ab out the p articular CTMC under consideration can b e tak en into accoun t. Finally , it sh ould b e noted that the efficiency of rejection sampling in- creases as the space of v alid endp oin t conditions is enlarged. As an example, consider the case of a CTMC observed at equidistant time p oints, so that the goal is to simulate sample paths u sing th e same rate m atrix Q and time in terv al T for a set of end p oin t pairs { ( a i , b i ) : i = 1 , . . . , n } . In th is case we can first use (unmo d ifi ed) rejection sampling and assign eac h samp le path to a p air ( a i , b i ) that m atc hes th e b eginn in g and en d ing state of the s im ulated path. If the mo del is app ropriate, this pro cedure could easily accoun t for the ma jorit y of the needed sample paths, and v ery few r ejections would b e required. Th e remaining sample p aths can sub sequen tly b e simulate d u sing one of the three end p oin t-conditioned samp lers describ ed in this pap er. Ac kn owledgmen ts. W e thank Jeff Thorne, Ben Redelings, an d O le F. Christensen for f ruitful discu ssions and v aluable comments on the manuscript. W e are v ery grateful to the AE and the tw o review ers for man y constru ctiv e commen ts and suggestions. SUPPLEMENT AR Y MA TERIAL Efficien t simulatio n from finite-state, con tin uous-time Marko v c h ains with incomplete observ ations (DOI: 10.121 4/09-A OAS247 S UPP ; .zip). W e ac- compan y our pap er with R co de ( www.r-pr o ject.org ) that can r epro duce A CONTINUOU S-TIME MAR KO V CHAIN 29 the fi gures in the man u script [ Hob olth and Stone ( 2009 )]. A d escription of ho w the co de is organized is included in the supplementary material. REFERENCES Blac kwell, P . (2003). Ba yesi an inference for Marko v p rocesses with diffusion and discrete compon ents. Biometrika 90 613–627. MR2006839 Bladt, M. and Sørensen, M. ( 2005). Statistical in ference for discretely observed Marko v jump pro cesses. J. R oy. Statist. So c. Ser. B 67 395–410. MR2155345 F earnhead, P . and Sherlo ck, C. (2006). An exact Gibbs sampler for the Mark o v mo dulated P oisson pro cesses. J. R oy. Statist. So c. Ser. B 68 767–784 . MR2301294 Goldman, N. and Y ang, Z. (1994). A codon- based mo del of nucleotide sub stitution for protein-co ding DNA sequ ences. Mol. Bi ol. Evol. 11 725–736. Guttorp, P . (1995). Sto chastic Mo deli ng of Scientific Data . Chapman and H all, S uffolk, Great Britain. MR1358359 Hasega wa, M., Kishino, H . and Y ano, T. (1985). Dating of th e human-ap e splitting by a molecular clock of mitochondrial DNA . J. Mol. Evol. 22 160–174. Hob olth, A. (2008). A Marko v chain Monte Carlo exp ectation maximization algorithm for statistical analysis of DNA sequ ence evol ution with neighbor-dep endent substitu t ion rates. J. Computat. Gr aph. Statist. 17 138–162. Hob olth, A. and Jensen, J. (2005). Statistical inference in evolutionary mo dels of DN A sequences via th e EM algorithm. Statist. Appl. Genet. Mol. Biol. 4 18. MR2170434 Hob olth, A. and Stone, E. (2009). Supplement to “Efficien t simulation from finite-state, contin uous-time Marko v chains with incomplete observ ations.” D OI: 10.1214 /09-AO AS247SUPP . Holmes, I. and R ubin, G. (2002). A n exp ectation maximization algorithm for training hidden substitution mo dels. J. Mol. Biol. 317 757–768. Hwa ng, D. and Green, P . (2004). Bay esian Marko v chai n Monte Carlo sequen ce analy- sis reveals v arying neutral sub stitution patterns in mammalian evol ution. PNAS 101 13994–1 314001. Jensen, A. (1953). Markoff chains as an aid in the study of Markoff pro cesses. Skand. Ak tuarietidsskr. 36 87–91. MR0057488 Jensen, J. and P ed ersen, A. (2000). Probabilistic mo dels of DNA seq uence evolution with context dep end ent rates of su b stitution. A dv. in Appl. Pr ob ab. 32 499–517. MR1778577 Lartillot, N . (2006). Conjugate Gibbs sampling for Bay esian phylogenetic mo dels. J. Com- put. Biol. 13 1701–1722. MR2287736 Mateiu, L. and Rannala, B. (2006). Inferring complex DN A substitut ion pro cesses on phylogenies u sing uniformization and data augmentation. Syst. Biol. 55 259–269. Minin, V. and Suchard, M. (2008). Coun ting lab eled transitions in contin uous-t ime Marko v mod els of evolution. J. Math. Bi ol. 56 391–412. MR2358440 Mouse Genome Sequencing Consortium (2002). In itial sequencing and comparative anal- ysis of th e mouse genome. Natur e 420 520–562. Nielsen, R. (2002). Mappin g mutations on phylogenies . Syst. Biol . 51 729–739 . Ro drigue, N., Philipp e, H. and Lartillot, N. (2008). Uniformization for sampling realiza- tions of Mark ov p rocesses: Ap plications to Bay esian implementations of co don substi- tution mo dels. Bi oinformatics 24 56–62. Siep el, A ., Po llard, K. and Haussler, D. (2006). New meth ods for detecting lineage-specific selection. In Pr o c e e dings f or the 10th International Confer enc e on R ese ar ch in Compu- tational Mole cular Biolo gy 190–205 . Springer, Berlin. 30 A. HOBOL TH AND E. STONE The Chimpanzee Sequencing and A nalysis Consortium (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Natur e 437 69–87. Dep ar tment of Ma thematic al Sciences Aarhus University Denmark E-mail: asger@daimi.au.dk Dep ar tment of St a tistics Nor th Carolina St ate University Raleigh, Nor th Carolina 27 695 USA E-mail: eric stone@ncsu.edu

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment