Estimation for general birth-death processes

Estimation for general birth-death pro cesses F orrest W. Cra wford 1 , Vladimir N. Minin 2 , and Marc A. Suc hard 3 , 4 , 5 T yp eset No v em ber 27, 2024 1. Department of Biomathematics Univ ersity of California Los Angeles CHS A V-611 Los Angeles, CA 90095-1766 USA fcra wford@ucla.edu 2. Department of Statistics Univ ersity of W ashington P adelford Hall C-315, Box 354322 Seattle, W A 98195-4322 USA vminin@u.w ashington.edu 3. Departments of Biomathematics, 4. Human Genetics, and 5. Biostatistics Univ ersity of California Los Angeles 6558 Gonda Building, Los Angeles, CA 90095-1766 USA msuc hard@ucla.edu Abstract Birth-death pro cesses (BDPs) are con tin uous-time Mark o v chains that trac k the n um ber of “par- ticles” in a system ov er time. While widely used in p opulation biology , genetics and ecology , statistical inference of the instan taneous particle birth and death rates remains largely limited to restrictive linear BDPs in which p er-particle birth and death rates are constan t. Researchers often observe the num b er of particles at discrete times, necessitating data augmentation pro- cedures such as exp ectation-maximization (EM) to ﬁnd maximum lik eliho od estimates. The E-step in the EM algorithm is av ailable in closed-form for some linear BDPs, but otherwise previous work has resorted to approximation or simulation. Remark ably , the E-step conditional exp ectations can also b e expressed as con v olutions of computable transition probabilities for any general BDP with arbitrary rates. This imp ortan t observ ation, along with a conv enient con- tin ued fraction representation of the Laplace transforms of the transition probabilities, allo ws no vel and eﬃcient computation of the conditional exp ectations for all BDPs, eliminating the need for approximation or costly simulation. W e use this insight to derive EM algorithms that yield maxim um likelihoo d estimation for general BDPs c haracterized by v arious rate mo dels, in- cluding generalized linear mo dels. W e sho w that our Laplace con volution tec hnique outperforms comp eting metho ds when a v ailable and demonstrate a tec hnique to accelerate EM algorithm con vergence. Finally , we v alidate our approac h using syn thetic data and then apply our methods to estimation of mutation parameters in microsatellite ev olution. Keyw ords : Birth-death pro cess, EM algorithm, MM algorithm, maxim um lik eliho od estimation, con tin uous-time Marko v chain, microsatellite evolu tion 1 1 In tro duction A birth-death process (BDP) is a con tin uous-time Mark o v c hain that mo dels a non-negativ e in teger n um b er of particles in a system (F eller, 1971). The state of the system at a giv en time is the n um b er of particles in existence. At any momen t in time, one of the particles may “give birth” to a new particle, increasing the count b y one, or one particle may “die”, decreasing the coun t b y one. BDPs are p opular mo deling to ols in a wide v ariety of quan titative disciplines, suc h as p opulation biology , genetics, and ecology (Thorne et al, 1991; Krone and Neuhauser, 1997; Nov ozhilov et al, 2006). F or example, BDPs can c haracterize epidemic dynamics, (Bailey, 1964; Andersson and Britton, 2000), sp eciation and extinction (Nee et al, 1994; Nee, 2006), evolution of gene families (Cotton and Page, 2005; Dem uth et al, 2006), and the insertion and deletion ev ents for probabilistic alignmen t of DNA sequences (Thorne et al, 1991; Holmes and Bruno, 2001). T raditionally , most modeling applications hav e used the “simple linear” BDP with constant p er- particle birth and death rates, which arises from an assumption of indep endence among particles and no background birth and death rates. When individual birth and death rates instead depend on the size of the population as a whole, the mo del is called a “general” BDP . Previous statistical estimation in BDPs has fo cused mainly on estimating the constant per-particle birth and death rates of the simple linear BDP based on observ ations of the n umber of particles o ver time. How ever, the simple linear BDP is often unrealistic, and nonlinear dependence of the birth and death rates on the curren t n um b er of particles pro vides the means to mo del more sophisticated and re alistic patterns of sto c hastic p opulation dynamics in a wide v ariet y of biological disciplines. F or example, p opulations sometimes exhibit logistic-lik e gro wth as their n umber approaches the carrying capacit y of their en vironmen t (T an and Pian tadosi, 1991). In genetic mo dels, the rate of new oﬀspring carrying an allele often dep ends on the prop ortions of b oth individuals already carrying the allele and those who do not (Moran, 1958). In coalescent theory , the rate of coalescence c hanges with the square of the n um b er of lineages (Kingman, 1982). In addition, researchers may wish to assess the inﬂuence of cov ariates on birth and death rates b y ﬁtting a regression mo del (Kalbﬂeisch and Lawless, 1985; Liu et al, 2007). Progress in estimating birth and death rates in BDPs has also t ypically b een limited to contin- uous observ ation of the pro cess (Moran, 1951, 1953; Anscombe, 1953; Darwin, 1956; W olﬀ, 1965; 2 Reynolds, 1973; Keiding, 1975). How ever, in practice researc hers ma y observ e data from BDPs only at discrete times through longitudinal observ ations. Estimating transition rates in contin uous-time Mark o v pro cesses using discrete observ ations is diﬃcult since the state path b et ween observ ations is not observ ed. F urthermore, direct analytic maximization of the likelihoo d for general BDPs remains infeasible for partially observed samples since the lik eliho od usually cannot b e written in closed-form. Despite these c hallenges, sev eral researchers hav e made progress in estimating pa- rameters of the simple linear BDP under discrete observ ation (Keiding, 1974; Thorne et al, 1991; Holmes and Bruno, 2001; Rosenberg et al, 2003; Dauxois, 2004). How ev er, none of these develop- men ts pro vides a robust method to ﬁnd exact maximum lik eliho o d estimates (MLEs) of parameters in discretely observed general BDPs with arbitrary birth and death rates. A ma jor insigh t comes from the fact that the lik eliho od of the contin uously observ ed proc ess has a simple form which easily yields expressions for estimation of rate parameters. This fact is the basis for exp ectation-maximization (EM) algorithms for maxim um likelihoo d estimation in missing data problems (Dempster et al, 1977). In ﬁnite state-space Mark ov chains, the relev ant conditional exp ectations (the E-step of the EM algorithm) can often b e computed eﬃcien tly , and several researc hers ha v e derived EM algorithms for estimating transition rates in this con text (Lange, 1995a; Holmes and Rubin, 2002; Hob olth and Jensen, 2005; Bladt and Sorensen, 2005; Metzner et al, 2007). Unfortunately , ﬁnding these conditional exp ectations for general BDPs p oses c hallenges since the join t distribution of the states and w aiting times (or its generating function) is usually not a v ailable in closed-form. Notably , Holmes and Bruno (2001); Holmes and Rubin (2002) and Doss et al (2010) are able to ﬁnd analytic expressions or n umerical appro ximations for these expectations in EM algorithms for certain BDPs whose rates dep end linearly on the current n um b er of particles. While these developmen ts are promising, there remains a great need for estimation tec hniques that can b e applied to more sophisticated BDPs under a v ariet y of sampling scenarios. Indeed, more complex and realistic mo dels like those reviewed b y No v ozhilov et al (2006) may b e of little use to applied research ers if no practical metho d exists to estimate their parameters. Here we seek to ﬁll this apparent void by pro viding a framework for deriving EM algorithms for estimating rate parameters of a general BDP . W e ﬁrst formally deﬁne the general BDP and giv e an exact expression for the Laplace transform of the transition probabilities in the form of a con tinued fraction. W e then give the lik eliho od for contin uously-observ ed BDPs and outline 3 the EM algorithm. Next, we describ e a nov el metho d to eﬃcien tly compute the exp ectations of the E-step for BDPs with arbitrary rates. Since these exp ectations are conv olutions of transition probabilities, we perform the conv olution in the Laplace domain, and then in v ert the Laplace transformed expressions to obtain the desired conditional expectation. This technique obviates the costly numerical integration or rep eated simulation that has plagued previous approaches. W e pro vide examples of the maximization step for several diﬀeren t classes of BDPs and demonstrate a technique for accelerating con v ergence of the EM algorithm. W e sho w that our method is faster than comp eting techniques and v alidate it using sim ulated data. Finally , we conclude with an application that analyzes microsatellite ev olution and answers an op en question in ev olutionary genomics. 2 General BDPs and their EM algorithms 2.1 F ormal description and transition probabilities Consider a general BDP X ( τ ) coun ting the num b er of particles k in existence at times τ ≥ 0. F rom state X ( τ ) = k , transitions to state k + 1 happ en with instantaneous rate λ k , and transitions to state k − 1 happ en with instantaneous rate µ k . The transition rates λ k and µ k ma y dep end on k but are time-homogeneous. As we sho w b elo w, it is often necessary to ev aluate ﬁnite-time transition probabilities to derive eﬃcien t EM algorithms for estimation of arbitrary birth and death rates in general BDPs. This prov es useful b oth in completing the E-step of the EM algorithm and in computing incomplete data likelihoo ds for v alidation of our EM estimates. F or a starting state i ≥ 0, the ﬁnite-time transition probabilities P i,j ( τ ) = Pr( X ( τ ) = j | X (0) = i ) ob ey the system of ordinary diﬀeren tial equations d P i, 0 ( τ ) d τ = µ 1 P i, 1 ( τ ) − λ 0 P i, 0 ( τ ), and d P i,j ( τ ) d τ = λ j − 1 P i,j − 1 ( τ ) + µ j +1 P i,j +1 ( τ ) − ( λ j + µ j ) P i,j ( τ ) , (1) for j ≥ 1 with P i,i (0) = 1 and P i,j (0) = 0 for i 6 = j (F eller, 1971). F or some simple parameterizations of λ k and µ k , closed-form solutions exist for the transition probabilities P i,j ( τ ), but this is not p ossible for most mo dels. Karlin and McGregor (1957) show 4 that for an y parameterization of λ k and µ k , it is p ossible to express the transition probabilities in terms of orthogonal p olynomials. Ho w ev er, in practice these sp ecial p olynomials are diﬃcult to ﬁnd, and ev en when they are a v ailable, they rarely yield solutions in closed-form or expressions that are amenable to computation (Nov ozhilov et al, 2006; Rensha w, 2011). In con trast, the con- tin ued fraction metho d w e outline b elow do es not require additional mo del-speciﬁc insight b eyond sp eciﬁcation of λ k and µ k . T o solv e for the transition probabilities, it is adv antageous to work in the Laplace domain (Karlin and McGregor, 1957). This transformation also pro v es essen tial in maintaining numerical stabilit y of transition probabilities in general BDPs and in computing the conditional exp ectations necessary for the EM algorithm derived in a subsequent section. Laplace transforming equation (1) yields sf i, 0 ( s ) − δ i 0 = µ 1 f i, 1 ( s ) − λ 0 f i, 0 ( s ) , sf i,j ( s ) − δ ij = λ j − 1 f i,j − 1 ( s ) + µ j +1 f i,j +1 ( s ) − ( λ j + µ j ) f i,j ( s ) , (2) where f i,j ( s ) is the Laplace transform of P i,j ( τ ) and δ ij = 1 if i = j and zero otherwise. Letting i = 0 and rearranging (2), we obtain the recurrence relations f 0 , 0 ( s ) = 1 s + λ 0 − µ 1  f 0 , 1 ( s ) f 0 , 0 ( s )  , and f 0 ,j ( s ) f 0 ,j − 1 ( s ) = λ j − 1 s + µ j + λ j − µ j +1  f 0 ,j +1 ( s ) f 0 ,j ( s )  . (3) W e can inductively combine these expressions for j = 1 , 2 , 3 , . . . to arriv e at the well-kno wn gener- alized contin ued fraction f 0 , 0 ( s ) = 1 s + λ 0 − λ 0 µ 1 s + λ 1 + µ 1 − λ 1 µ 2 s + λ 2 + µ 2 − · · · . (4) This is an exact expression for the Laplace transform of the transition probabilit y P 0 , 0 ( τ ). In (4), let a 1 = 1 and a j = − λ j − 2 µ j − 1 , and let b 1 = s + λ 0 and b j = s + λ j − 1 + µ j − 1 for j ≥ 2. Then (4) 5 b ecomes f 0 , 0 ( s ) = a 1 b 1 + a 2 b 2 + a 3 b 3 + · · · . (5) W e can write this more compactly as f 0 , 0 ( s ) = a 1 b 1 + a 2 b 2 + a 3 b 3 + · · · . (6) The k th con v ergent of f 0 , 0 ( s ) is f ( k ) 0 , 0 ( s ) = a 1 b 1 + a 2 b 2 + · · · a k b k = A k ( s ) B k ( s ) , (7) where A k ( s ) and B k ( s ) are the n umerator and denominator of the rational function f ( k ) 0 , 0 . The transition probabilities P i,j ( τ ) for i, j > 0 can b e derived in con tin ued fraction form by com bining (2) and (4) to obtain f i,j ( s ) =                        i Y k = j +1 µ k   B j ( s ) B i +1 ( s )+ B i ( s ) a i +2 b i +2 + a i +3 b i +3 + · · · for j ≤ i, j − 1 Y k = i λ k ! B i ( s ) B j +1 ( s )+ B j ( s ) a j +2 b j +2 + a j +3 b j +3 + · · · for i ≤ j , (8) (Murph y and O’Donoho e, 1975; Crawford and Suchard, 2011). Although the Laplace transforms of the transition probabilities are generally still not av ailable in closed-form, a contin ued fraction representation is desirable for several reasons: 1) contin ued fraction represen tations of functions often conv erge muc h faster than equiv alen t p ow er series; 2) there are eﬃcien t algorithms for ev aluating them to a ﬁnite depth; and 3) there exist metho ds for b ounding the error of truncated contin ued fractions (Bankier and Leigh ton, 1942; W all, 1948; Blanc h, 1964; Lorentzen and W aadeland, 1992; Cra viotto et al, 1993; Abate and Whitt, 1999; Cuyt et al, 2008). F or an arbitrary BDP , w e recov er the transition probabilities through numerical in v ersion of the Laplace-transformed expressions. W e ev aluate the con tin ued fraction to a moni- 6 tored depth that controls the ov erall error and generates stable appro ximations to the transition probabilities unattainable by previous methods (Murphy and O’Donohoe, 1975; Parthasarath y and Sudhesh, 2006; Crawford and Suchard, 2011). The abilit y to compute transition probabilities for general BDPs with arbitrary rate parame- terizations prov es useful in tw o w ays. First, if we interpret ﬁnite-time transition probabilities as functions of an unknown parameter vector θ , then P a,b ( t ) given θ returns the likeliho o d of a dis- crete observ ation from a BDP such that X (0) = a and X ( t ) = b , where the tra jectory in time t b et ween a and b is unobserv ed. Second, transition probabilities pla y an imp ortan t role in computing conditional exp ectations of suﬃcien t statistics, as we shall see b elo w. 2.2 Lik eliho o d expressions and surrogate functions [Figure 1 ab out here.] With a formal description of a general BDP and the ﬁnite-time transition probabilities in hand, w e no w pro ceed with our task of estimating the parameters of a general BDP using discrete observ ations. Giv en one or more indep enden t observ ations of the form Y = ( X (0) = a, X ( t ) = b ) from a general BDP , w e wish to ﬁnd maximum likelihoo d estimates of the rate parameters λ k and µ k for k = 0 , 1 , 2 , . . . . W e will assume that the birth and death rates at state k dep end on b oth k and a ﬁnite-dimensional parameter vector θ , so that the form of λ k ( θ ) and µ k ( θ ) is kno wn for all k . F or a single realization of the pro cess starting at X (0) = a and ending at X ( t ) = b , let T k b e the total time sp en t in state k . Let U k b e the num b er of “up” steps (births) from state k , and let D k b e the num b er of “down” steps (deaths) from state k . Let the total num b er of up and do wn steps in a realization of the pro cess b e denoted by U = P ∞ k =0 U k and D = P ∞ k =0 D k resp ectiv ely . W e also deﬁne the total particle time, T particle = Z t 0 X ( τ ) d τ = ∞ X k =0 k T k , (9) that counts the amount of time lived b y each particle since time τ = 0. Of course, the total elapsed time is t = P ∞ k =0 T k . W e demonstrate these concepts schematically in Figure 1. 7 The log-likelihoo d for a contin uously observ ed pro cess takes a simple form when w e sum ov er all p ossible states k (W olﬀ, 1965): ` ( θ ) = ∞ X k =0 U k log  λ k ( θ )  + D k log  µ k ( θ )  −  λ k ( θ ) + µ k ( θ )  T k . (10) Ho w ever, when a BDP is sampled discretely suc h that only X (0) = a and X ( t ) = b are observ ed, the quan tities U k , D k , and T k are unkno wn for ev ery state k , and w e cannot maximize the log-likelihoo d (10) without them. W e therefore app eal to the EM algorithm for iterative maximum likelihoo d estimation with missing data (Dempster et al, 1977). In the EM algorithm, w e deﬁne a surrogate ob jective function Q by taking the exp ectation of the complete data log-likelihoo d (10), conditional on the observed data Y and the parameter v alues θ ( m ) from the previous iteration of the EM algorithm (the E-step). Then w e ﬁnd the parameter v alues θ ( m +1) that maximize this surrogate function (the M-step). This t w o-step pro cess is rep eated until conv ergence to the maxim um likelihoo d estimate of θ . T aking the exp ectation of (10) conditional on Y and θ ( m ) , we form the surrogate function Q : Q  θ | θ ( m )  = E  ` ( θ ) | Y , θ ( m )  = ∞ X k =0 E ( U k | Y ) log  λ k ( θ )  + E ( D k | Y ) log  µ k ( θ )  − E ( T k | Y )  λ k ( θ ) + µ k ( θ )  , (11) where for clarit y we hav e omitted the dep endence of the exp ectations on the parameter v alue θ ( m ) from the m th iterate. In general, we assume that the maxim um lik eliho od estimator exists; see Bladt and Sorensen (2005) for a discussion of the issues of identiﬁabilit y , existence, and uniqueness. 2.3 Computing the exp ectations of the E-step Computing the exp ectations of U k , D k , and T k in the E-step is diﬃcult in birth-death estimation since the unobserv ed state path and waiting times are not indep enden t conditional on the observed data Y . Dos s et al (2010) adopt an approach for linear BDPs that com bines analytic results with sim ulations. F or some mo dels, these authors are able to derive the generating function for the joint distribution of U , D , T particle , and the state path conditional on X (0) = a and can manipulate this generating function to complete the E-step. F or a more complicated linear mo del, Doss et al resort 8 to approximating the relev an t conditional exp ectations by simulating sample paths, conditional on Y (Hob olth, 2008). Our solution is to recognize that we do not need to kno w very muc h ab out the missing data to ﬁnd the conditional exp ectations used in the suﬃcient statistics ab o v e. In fact, the transition probabilities are all that we require. The following in tegral representations of the conditional exp ectations in the EM algorithm will prov e useful: E ( U k | Y ) = Z t 0 P a,k ( τ ) λ k P k +1 ,b ( t − τ ) d τ P a,b ( t ) , (12a) E ( D k | Y ) = Z t 0 P a,k ( τ ) µ k P k − 1 ,b ( t − τ ) d τ P a,b ( t ) , and (12b) E ( T k | Y ) = Z t 0 P a,k ( τ ) P k,b ( t − τ ) d τ P a,b ( t ) . (12c) These form ulas hav e app eared in many types of studies related to EM estimation for contin uous- time Marko v chains (Lange, 1995a; Holmes and Rubin, 2002; Bladt and Sorensen, 2005; Hob olth and Jensen, 2005; Metzner et al, 2007). F or general BDPs whose transition probabilities must b e computed numerically , numerical in tegration o v er the pro duct of the densities can b e computation- ally prohibitiv e. Ho w ever, the n umerators in (12) a-c are con v olutions of integrable time-domain functions. Since the Laplace transforms f a,b ( s ) of these transition probabilities are av ailable and easy to compute, w e take adv antage of the Laplace conv olution property , arriving at the representations E ( U k | Y ) = λ k L − 1 h f a,k ( s ) f k +1 ,b ( s ) i ( t ) P a,b ( t ) , (13a) E ( D k | Y ) = µ k L − 1 h f a,k ( s ) f k − 1 ,b ( s ) i ( t ) P a,b ( t ) , and (13b) E ( T k | Y ) = L − 1 h f a,k ( s ) f k,b ( s ) i ( t ) P a,b ( t ) . (13c) where L − 1 denotes in v erse Laplace transformation. Although these formulas are equiv alent to (12), they oﬀer substantial time savings o v er computing the integral directly , and render tractable the 9 computation of exp ectations in the EM algorithm for arbitrary general BDPs. T o calculate the numerators of (13), we use the Laplace in version method p opularized by Abat e and Whitt (1992, 1995). This inv olves a Riemann sum approximation of the inv erse transform that stabilizes the discretization error and is amenable to series acceleration metho ds (Abate and Whitt, 1999; Press, 2007). T o ev aluate the contin ued fraction Laplace transforms f a,b ( s ), we use the mo diﬁed Len tz metho d (Len tz, 1976; Thompson and Barnett, 1986; Press, 2007). 2.4 Maximization tec hniques for v arious BDPs In con trast to the generic technique outlined ab o v e for computing the exp ectations of the E-step, the M-step dep ends explicitly on the functional form of the birth and death rates λ k ( θ ) and µ k ( θ ). Here w e give several representativ e examples of BDPs and techniques for completing the M-step of the EM algorithm, suc h as analytic maximization, minorize-maximize (MM), and Newton’s metho d. 2.4.1 Simple linear BDP In the simple linear BDP , births and deaths happ en at constant p er-capita rates, so λ k = k λ and µ k = k µ . The unkno wn parameter vector is θ = ( λ, µ ), and the surrogate function b ecomes Q ( θ ) = ∞ X k =0 E ( U k | Y ) log[ k λ ] + E ( D k | Y ) log[ k µ ] − E ( T k | Y ) k ( λ + µ ) . (14) T aking the deriv ativ e of (14) with resp ect to the unkno wn parameters, setting the result to zero, and solving for λ and µ giv es the M-step up dates λ ( m +1) = E ( U | Y ) E ( T particle | Y ) , and (15a) µ ( m +1) = E ( D | Y ) E ( T particle | Y ) . (15b) These up dates corresp ond to the usual maximum likelihoo d estimators in the contin uously observ ed pro cess (Reynolds, 1973). Note that the transition probabilities P a,b ( t ) in the denominators of the exp ectations in (12) cancel out in (15a) and (15b). When this is the case, transition probabilities are not necessary to derive an EM algorithm. 10 2.4.2 Linear BDP with immigration Sometimes p opulations are not closed, and new individuals can en ter; we call this action “immigra- tion.” Another interpretation arises in mo dels of p oin t mutations in DNA sequences. Supp ose new m utations arise in a DNA sequence via tw o distinct pro cesses: one inserts new mutan ts at a rate prop ortional to the n um b er already presen t, and the other creates new mutations at a constan t rate, regardless of ho w many already exist. T o mo del this behavior, we augment the simple linear BDP ab o v e with a constant term ν represen ting immigration, so that λ k = k λ + ν and µ k = k µ . The log-likelih o od b ecomes ` ( θ ) = ∞ X k =0 U k log( k λ + ν ) + D k log( µ ) − T k [ k ( λ + µ ) + ν ] . (16) Unfortunately , if we take the deriv ativ e of the log-likelihoo d with resp ect to λ or ν , the unkno wn app ears in the denominator of the terms of the inﬁnite sum. How ever, since each summand is a conca v e function of the unknown parameters, we can separate them in a minorizing function H suc h that for all θ , H  θ | θ ( m )  ≤ ` ( θ ) and H  θ ( m ) | θ ( m )  = `  θ ( m )  as follows: ` ( θ ) ≥ H  θ | θ ( m )  = ∞ X k =0 U k  p k log  p k λ  + (1 − p k ) log  (1 − p k ) ν  + D k log( µ ) −  k ( λ + µ ) + ν  T k , (17) where p k = k λ ( m ) k λ ( m ) + ν ( m ) . (18) Then letting Q  θ | θ ( m )  = E  H ( θ ) | Y , θ ( m )  b e the surrogate function, this minorization forms the basis for an EM algorithm in which a step of the minorize-maximize (MM) algorithm takes the place of the M-step, and the ascent prop ert y of the EM algorithm is preserved (Lange, 2010). 11 Maximizing Q with resp ect to λ and ν yields the up dates λ ( m +1) = ∞ X k =0 p k E ( U k | Y ) E ( T particle | Y ) , and (19a) ν ( m +1) = ∞ X k =0 (1 − p k ) E ( U k | Y ) t . (19b) Expression (19a) is similar to (15a), the up date for λ in the simple BDP . The diﬀerence lies in that eac h E ( U k | Y ) in this case is w eigh ted b y the prop ortion of additions at state k due to births, not immigrations. The up date for µ is the same as (15b). 2.4.3 Logistic/restricted gro wth T o illustrate an EM algorithm for more complicated rate sp eciﬁcations in which no MM up date is evident and the rates no longer dep end on the curren t state k in a linear wa y , w e examine a mo del for restricted p opulation growth. Typical deterministic p opulation mo dels often incorp orate limitations on p opulation size due to the carrying capacity K of the environmen t. One famous example is the logistic mo del of p opulation gro wth (Murray, 2002). Contin uous-time sto c hastic analogs hav e previously required a ﬁnite cap on p opulation size (T an and Piantadosi, 1991). These sto c hastic mo dels roughly mimic the b eha vior of the deterministic model for p opulation sizes b elo w K , but are limited b ecause they do not allow gro wth b ey ond K . Here we present a mo del which supp orts transient gro wth b eyond the carrying capacity , but where the p opulation size tends to a balance b et w een restricted growth and death. Supp ose births are co operative, requiring t wo parents, but fecundit y decays as the num b er of extan t particles increases, and death remains an indep enden t pro cess such that λ k = λk 2 e − β k and µ k = k µ . Here, w e can interpret the carrying capacity roughly as the p opulation size k > 0 at whic h λ k ≈ µ k . Ignoring irrelev ant terms, the surrogate function b ecomes Q  θ | θ ( m )  = ∞ X k =0 E ( U k | Y )[log( λ ) − β k ] + E ( D k | Y ) log( µ ) − E ( T k | Y )[ λk 2 e − β k + k µ ] . (20) Since λ and β app ear together, w e opt for a n umerical Newton step. The gradien t of Q with resp ect 12 to these parameters is F =       E ( U | Y ) λ − ∞ X k =0 k 2 e − β k E ( T k | Y ) − ∞ X k =0 h k E ( U k | Y ) + λk 3 e − β k E ( T k | Y ) i       , (21) and the Hessian is H =       − E ( U | Y ) λ 2 − ∞ X k =0 k 3 e − β k E ( T k | Y ) − ∞ X k =0 k 3 e − β k E ( T k | Y ) λ ∞ X k =0 k 4 e − β k E ( T k | Y ) .       . (22) Then we up date these parameters by    λ ( m +1) β ( m +1)    =    λ ( m ) β ( m )    − H − 1 F . (23) The ascent prop ert y is preserved when a Newton step is use d in place of an exact M-step (Lange, 1995a). The up date for µ is the same as (15b). 2.4.4 SIS epidemic mo dels Under a v ery common epidemic mo del, mem b ers of a ﬁnite p opulation of size N are classiﬁed as either “susceptible” to a given disease or “infected” (Bailey, 1964; Andersson and Britton, 2000). Susceptibles b ecome infected in prop ortion to the num b er of currently infected in the p opulation, and infecteds revert to susceptible status with a certain rate indep enden t of ho w many infecteds there are. This idealized susceptible-infectious-susceptible (SIS) infectious disease mo del sp eciﬁes a general birth-death pro cess in which we trac k the num b er of infecteds. Let λ k = β k ( N − k ) / N b e the rate of new infections when there are already k infected in the p opulation. Let µ k = γ k / N b e the rate of reco very of infecteds to susceptibles. Then if θ = ( β , γ ), we ha v e Q  θ | θ ( m )  = N X k =0 E ( U k | Y ) log( β ) + E ( D k | Y ) log( γ ) − E ( T k | Y )( k ( N − k ) β + k γ ) / N , (24) 13 and the up date for β is β ( m +1) = N E ( U | Y ) N X k =0 ( N − k ) k E ( T k | Y ) . (25) The up date for γ is γ ( m +1) = N E ( D | Y ) E ( T particle | Y ) . (26) 2.4.5 Generalized linear mo dels Our general framew ork allo ws assessment of the inﬂuence of co v ariates on the rates of a general BDP in a nov el wa y . Supp ose we sample observ ations from indep enden t pro cesses X i ( τ ), i = 1 , . . . , N and observ e Y i = ( X i (0) , X i ( t i )) asso ciated with d co v ariates z i = ( z i 1 , . . . , z id ) t . These pro cesses ma y represen t diﬀerent sub jects in a study . W e mo del the birth and death rates λ ik and µ ik for eac h pro cess/sub ject X i as functions of z i and unknown d -dimensional regression co eﬃcien ts θ λ and θ µ in a generalized linear mo del (GLM) framework. W e link log( λ ik ) = g ( k , z t i θ λ ) and log ( µ ik ) = h ( k , z t i θ µ ) , (27) where g ( · ) and h ( · ) are scalar-v alued functions. W e note the p ossibilit y that co v ariates ma y diﬀer b et ween θ λ and θ µ through trivial mo diﬁcation; to ease notation, w e do not explore this direction. Giv en N indep endent pro cesses, we sum log-lik eliho ods to arriv e at the multiple-sub ject surrogate function: Q  θ | θ ( m )  = N X i =1 ∞ X k =0 h E ( U k | Y i ) g ( k , z t i θ λ ) + E ( D k | Y i ) h ( k , z t i θ µ ) − E ( T k | Y i )  e g ( k, z t i θ λ ) + e h ( k, z t i θ µ )  i . (28) Although we cannot usually maximize this surrogate function for all elements of ( θ λ , θ µ ) simulta- neously , a Newton step is often straightforw ard to deriv e. As an example, consider generalized linear mo del extension of the simple linear BDP in whic h log( λ ik ) = log( k ) + z t i θ λ , and log( µ ik ) = log( k ) + z t i θ µ . (29) 14 T aking the gradien t of the corresp onding surrogate function Q with resp ect to the parameters θ λ yields ∇ θ λ Q = N X i =1 E ( U | Y i ) z i − e z t i θ λ E ( T particle | Y i ) z i (30) and the second diﬀerential (Hessian) of Q is d 2 θ λ Q = − N X i =1 e z t i θ λ E ( T particle | Y i ) z i z t i . (31) Com bining these, we arrive at the Newton step for the parameter v ector θ λ : θ ( m +1) λ = θ ( m ) λ −  d 2 θ λ Q  − 1 ∇ θ λ Q. (32) A similar up date can b e found for θ µ . These up dates are examples of the gradient EM algorithm for regression in Marko v pro cesses describ ed by W anek et al (1993) and Lange (1995a). It is w orth noting that the Hessian matrix d 2 θ λ Q can b ecome ill-conditioned, making it diﬃcult to inv ert for the Newton step in (32) for some problems. Unfortunately there is no quasi-Newton option since in general E ( T particle | Y ) e z t i θ λ is unbounded. An alternativ e to inv ersion of the Hessian matrix is cyclic co ordinate descen t in which a Newton step is p erformed for eac h co ordinate θ j individually . This carries the adv antage of a v oiding matrix inv ersion, but conv ergence is slow er and the ascent prop ert y m ust b e c hec k ed at each Newton step. 2.5 Implemen tation Before presenting sim ulation results and our application to microsatellite evolution, we brieﬂy outline some implementation details that ease our subsequent analyses. 2.5.1 E-step acceleration The E-step in these EM algorithms for BDP estimation usually in v olves inﬁnite weigh ted sums of the conditional exp ectations E ( U k | Y ), E ( D k | Y ), and E ( T k | Y ). F or example, when estimating λ in 15 the simple linear BDP , w e m ust ev aluate E ( U | Y ) = ∞ X k =0 E ( U k | Y ) = ∞ X k =0 λ k L − 1 h f a,k ( s ) f k +1 ,b ( s ) i ( t ) P a,b ( t ) . (33) F ortunately , the conditional exp ectations of U k , D k , and T k are usually small for k  min( a, b ) and k  max( a, b ), so it is p ossible to replace the inﬁnite sum in (33) b y a ﬁnite one. W e ﬁnd an additional increase in computational eﬃciency b y exchanging the order of Laplace inv ersion and summation. Then (33) b ecomes E ( U | Y ) ≈ L − 1   k max X k = k min λ k f a,k ( s ) f k +1 ,b ( s )   ( t ) P a,b ( t ) , (34) where we c ho ose k min to b e the largest k < min( a, b ) suc h that λ k | f a,k ( s ) − f k +1 ,a | < 10 − 8 and k max to b e the ﬁrst k > max( a, b ) suc h that λ k | f a,k ( s ) f k +1 ,b ( s ) | < 10 − 8 . In practice, w e rarely need to compute exp ectations for k less than min( a, b ) − 10 or greater than max( a, b ) + 10. 2.5.2 Quasi-Newton acceleration of EM iterates EM algorithms are notorious for slo w conv ergence, esp ecially near optima. When appropriate, we exploit the quasi-Newton acceleration metho d in tro duced by Lange (1995b) in our implementations. Other acceleration metho ds exist, and ma y give b etter results, depending on the problem (Lange, 1995a; Louis, 1982; Meilijson, 1989; Jamshidian and Jennrich, 1993). Figure 2 sho ws the log- lik eliho od function and iterates for the basic EM and accelerated EM metho ds in the simple linear mo del. Since the quasi-Newton acceleration metho d do es not gu arantee that the likelihoo d increases at each step, “step-halving” is o ccasionally necessary to achiev e ascent. Note that this requires lik eliho od ev aluation at least once p er iteration. Our approach is adv an tageous in that we can eﬃcien tly calculate this lik elihoo d (transition probabilit y) for an y general BDP (Crawford and Suc hard, 2011). [Figure 2 ab out here.] 16 2.5.3 Asymptotic v ariance of EM estimates Finding the observed information matrix for an EM estimate can b e c hallenging. Louis (1982) giv es form ulae for the observed information, whic h Doss et al (2010) use to derive analytic expres- sions for the observ ed information for v ery simple BDPs. Ho w ever, analytic expressions for the asymptotic v ariance are generally hard to ﬁnd for more complicated mo dels. W e instead turn to the supplemented EM (SEM) algorithm of Meng and Rubin (1991), which computes the informa- tion matrix of the EM estimate of θ after the MLE ˆ θ has b een found. The observed information is I ( ˆ θ ) = − d 2 Q ( ˆ θ | ˆ θ )( I − d M ( ˆ θ )), where M ( θ ) is the EM algorithm map suc h that θ ( m +1) = M ( θ ( m ) ). W e numerically approximate the diﬀeren tial d M at the termination of the EM algorithm. W e note also that since we are able to calculate transition probabilities directly , the observed data log-lik eliho od is easily computed as ` ( θ ) = N X i =1 log P a i ,b i ( t i ) , (35) where a i = X i (0) and b i = X i ( t i ). As an alternative to the approaches outlined ab ov e, we can calculate the Hessian using purely n umerical techniques. If H ( ˆ θ ) = d 2 ` ( ˆ θ ) is the numerical Hessian ev aluated at the estimated v alue ˆ θ , then ˆ I ≈ − H ( ˆ θ ). 3 Results 3.1 Laplace con v olution E-step comparison T o illustrate the computational sp eedup that the Laplace conv olution formulae (13) and their acceleration in section 2.5.1 achiev e o ver existing metho ds, we calculate conditional exp ectations for v arious BDP models for p erforming the E-step and rep ort computing times in T able 2. The ﬁrst metho d in the table emplo ys rejection sampling of tra jectories where we condition on the starting state, and reject based on the ending state (Bladt and Sorensen, 2005). The second metho d adapts an endp oin t-conditioned simulation algorithm (Hobolth, 2008; Hob olth and Stone, 2009). The third considers na ¨ ıv e time-domain conv olution (Equation (12)) using the integrate function in R . Finally , w e compute the same quantities via the Laplace-domain con v olution metho d outlined in section 2.3. In our implementations, we ha v e made every eﬀort to reuse as muc h shared R co de as p ossible, 17 with the aim of making the routines comparable. W e consider four diﬀerent BDPs. F or a simple linear BDP and a linear BDP with immigration, we use the data Y = ( X (0) = 19 , X (2) = 27). Under a logistic mo del, the data are Y = ( X (0) = 10 , X (2) = 16), and for the SIS model the data are Y = ( X (0) = 10 , X (2) = 31). W e list all mo del parameter v alues in T able 2. As seen in T able 2, the Laplace conv olution metho d is often more than 10 times faster than the other metho ds. In terms of time-p erformance, the endp oin t-conditioned simulation stands as second b est, achieving almost comparable sp eed in the logistic BDP . T o interpret this ﬁnding, we recall that Hobolth (2008) constructs an endp oin t-conditioned sim ulation for p erforming the E-step in ﬁnite state-space Mark o v chains. Therefore, to adapt this metho d we approximate each BDP b y a Mark o v c hain with a ﬁnite transition rate matrix. T o c ho ose the arbitrary dimension of this matrix we truncate the pro cess at the ﬁrst state k > max( a, b ) suc h that P a,k ( t ) < 10 − 5 , and the resulting estimates agree substantially with the other metho ds. W e are a ware that the size of the rate matrix aﬀects the sp eed of the sim ulation routine, so we wish to k eep the matrix as small as p ossible. On the other hand, the matrix m ust remain large enough to include states that ma y b e visited with high probabilit y in a path from a to b o v er time t . F or the logistic mo del, suc h a stringen t upp er b ound lies just ab o v e the relatively small carrying capacity . How ev er, endp oin t- conditioned simulation completely fails for the SIS mo del, an issue w e discuss later. Finally , and quite naturally , the tw o conv olution metho ds arriv ed at nearly the same answer for eac h mo del; the diﬀerence is largely due to very diﬀeren t sources of numerical error, but at disparate computational costs. [T able 1 ab out here.] 3.2 Syn thetic examples T o ev aluate the p erformance of our EM algorithms, we simulate discrete observ ations from several of the BDPs outlined ab o ve. F or each sample, we dra w starting p oin ts X i (0) uniformly from the in tegers 0 to 20, and times t i uniformly from 0 . 1 to 3. W e then sim ulate a tra jectory of the BDP and record the state X i ( t i ). F or the generalized linear mo del (GLM), w e employ the simple linear parameterization with a log link with d = 2 cov ariates. W e sp ecify the cov ariates z i = ( z i, 1 , z i, 2 ) as follo ws: z i, 1 ∼ N (1 , σ 2 ), z i, 2 ∼ N (2 , σ 2 ) for i = 1 , . . . , N/ 2, z i, 1 ∼ N (2 , σ 2 ) and z i, 2 ∼ N (1 , σ 2 ) for 18 i = N / 2 + 1 , . . . , N , where σ 2 = 0 . 1. T able 3 rep orts the n um b er of sim ulated observ ations, true parameter v alues, p oint-estimates, asymptotic standard error estimates for all mo del parameters. It is imp ortan t to note that the MLEs can diﬀer substantially from the parameter v alues used to p erform the simulation, regardless of the algorithm used to ﬁnd the MLEs. This is due to several factors, including: 1) missing state paths; 2) sto c hasticity of the BDP generating the state paths; 3) arbitrary choice of starting states X i (0); and 4) ﬁnite sample sizes. Despite these limitations inheren t in learning from partially observ ed sto c hastic pro cesses, the p oin t-estimates matc h the true parameter v alues rather w ell. [T able 2 ab out here.] 3.3 Application to microsatellite ev olution Microsatellites are short tandem rep eats of c haracters in a DNA sequence (Schl¨ otterer, 2000; El- legren, 2004; Ric hard et al, 2008). The num b er of repeated “motifs” in a microsatellite often c hanges ov er evolutionary timescales. The molecular mechanism resp onsible for changes in rep eat n um b ers is known as “p olymerase slippage” (Sc hl¨ otterer, 2000). Several researc hers hav e prop osed linear BDPs for use in analyzing ev olution of microsatellite rep eat num bers (Whittaker et al, 2003; Calabrese and Durrett, 2003; Sainudiin et al, 2004). Ho w ev er, many inv estigations demonstrate that microsatellite mutabilit y dep ends on the n um b er of rep eats already presen t, motif size, and motif n ucleotide comp osition (Chakraborty et al, 1997; Ec k ert and Hile, 2009; Kelk ar et al, 2008; Amos, 2010). Exactly ho w these factors aﬀect addition and deletion rates remains an open question (Bharga v a and F uen tes, 2010). T o our kno wledge, no previous study form ulates or ﬁts a general BDP in whic h motif size and composition are treated as a cov ariates in a generalized regression framew ork, despite the scien tiﬁc interest in examining such eﬀects on microsatellite evolution. W ebster et al (2002) study the evolution of 2467 microsatellites common (orthologous) to both h umans and c himpanzees, pro viding an ideal dataset for studying the inﬂuence of rep eat num b er and motif size on addition and deletion rates. F or eac h of these observed microsatellites, W ebster et al (2002) record the motif n ucleotide pattern and the num b er of rep eats of this motif found in chimpanzees and humans, and estimate a mutabilit y parameter that con trols the rate of addition and deletion. 19 W e now presen t an extended application of our BDP inference tec hnique to c himpanzee-human microsatellite evolution, dra wing on the data in T able 6 of the supplementary information in W eb- ster et al (2002). W e in tro duce several nov el mo deling and inferential tec hniques relev ant to the study of microsatellites, and deduce the eﬀect of motif size and comp osition on microsatellite ad- dition and deletion rates. While the likelihoo d takes a sligh tly more complicated form, our BDP regression tec hnique is straigh tforward to implemen t and yields insigh t in to the complicated pro cess of microsatellite evolution. 3.3.1 Ev olutionary mo del T o analyze the data as realizations from a BDP , w e m ust ac kno wledge the ev olutionary relationship b et ween c himpanzees and h umans. Supp ose the most recent common ancestor of chimpanzees and h umans lived at time t in the past, so that an ev olutionary time of 2 t separates con temp orary hu- mans and c himpanzees. W e note that under mild conditions, general BDPs are reversible Marko v c hains (Rensha w, 2011). Therefore, assuming stationarity of the c himpanzee microsatellite length distributions, we stand justiﬁed in reversing the evolutionary pro cess from the ancestor to c him- panzee, so that for estimation purp oses we ma y regard h umans as direct descendants of mo dern c himpanzees (or vice-versa) o ver an ev olutionary time of 2 t . If C is the n umber of rep eats in a chim- panzee microsatellite and H is the num b er of rep eats in the corresp onding human microsatellite, then the likelihoo d of the observ ation Y = ( C, H ) is Pr( Y ) = ∞ X k =0 π k P k,C ( t ) P k,H ( t ) = π C ∞ X k =0 P C,k ( t ) P k,H ( t ) = π C P C,H (2 t ) , (36) where π k is the equilibrium probabilit y of the microsatellite ha ving k repeats. The second line follo ws b y rev ersibility and the third b y the Chapman-Kolmogoro v equalit y . Therefore, the log- lik eliho od of the observ ation Y is now log π C + ` ( θ ; Y ). Figure 3 sho ws a schematic represen tation of this reversibilit y argument. [Figure 3 ab out here.] 20 3.3.2 BDP rates and equilibrium distribution The observ ed data for microsatellite i are Y i = ( X i (0) , X i (1)), where X i (0) is the n umber of rep eats observ ed in chimpanzees, X i (1) is the num ber of rep eats observ ed in humans, and the evolution- ary time separating humans and c himpanzees is scaled to unity . In addition to the evolutionary relationship explained ab o v e, there are other complications: in the W ebster et al (2002) dataset, it is evident that microsatellites with small n umbers of repeats are not detected. Rose and F alush (1998) argue that there is a minimum num b er of rep eats necessary for microsatellite m utation via p olymerase slippage. Sainudiin et al (2004) interpret this ﬁnding as justiﬁcation for truncating the state-space of BDP at x min , so that X ( τ ) ≥ x min . T o av oid questions of ascertainmen t bias (see e.g. V o wles and Amos (2006)), and to mak e our results comparable to those of past researchers, w e deﬁne a microsatellite to b e a collection of more than x min rep eated motifs, where x min is 9 for rep eats of size 1, 5 for rep eats of size 3 and 4, and 2 for rep eats of size 5. Researc hers hav e also observ ed that microsatellites do not tend to gro w indeﬁnitely (Kruglyak et al, 1998). The maxim um num b er of rep eats in the W ebster et al dataset is 47. This suggests a ﬁnite nonzero equilibrium distribution of microsatellite lengths. T o achiev e such an equilibrium distribution, we preliminarily view the ev olution as a linear BDP with immigration on a state-space that is truncated b elow x min . It is reasonable to assume that rates of addition and deletion dep end linearly on ho w many rep eats are already present. Then for a microsatellite that curren tly has k rep eats, the birth and death rates are λ k =        k λ + λ k ≥ x min 0 k < x min and µ k =        k µ k > x min 0 k ≤ x min . (37) This gives a geometric equilibrium distribution for the num b er of rep eats: π k =         1 − λ µ   λ µ  k − x min − 1 k ≥ x min 0 k < x min , (38) when λ < µ (Renshaw, 2011). W e c ho ose this simple mo del so that the BDP has a simple closed- form nonzero equilibrium solution that is easy to incorp orate into the log-likelihoo d. Note that the 21 constrain t λ < µ do es not mean that the rate of microsatellite rep eat addition is alw a ys less than the rate of deletion, since it is p ossible that λ k > µ k for small k . Additionally , λ < µ do es not mean that the num ber of rep eats in a microsatellite tends to zero o v er long ev olutionary times — the equilibrium distribution (38) assigns p ositiv e probability to all rep eat n umbers greater than or equal to x min . 3.3.3 Lik eliho o d and surrogate function No w we augment the log-lik eliho od with the log-equilibrium probability of observing X i (0) ch im- panzee rep eats F ( θ ) = N X i =1 log π X i (0) + ` ( θ ; Y i ) , (39) where ` ( θ ; Y i ) is equiv alent to (10). Including the inﬂuence of the equilibrium distribution is similar to imp osing a prior distribution on λ and µ . T o ensure the existence of the equilibrium distribution (38), w e must also incorp orate the constraint λ < µ . T o achiev e maximization of the augmented log-lik eliho od (39) under this constraint, we imp ose a barrier term of the form γ log( µ − λ ). By iterativ ely maximizing and sending the barrier p enalty γ → 0, w e can achiev e maximization under the inequalit y constraint. More formally , if we let H ( θ ) = N X i =1  log π X i (0) + ` ( θ ; Y i )  + γ log ( µ − λ ) , (40) then argmax θ H ( θ ) → argmax θ F ( θ ) (41) under the constraint λ < µ as γ → 0. T o incorp orate and ev aluate the inﬂuence of motif size and comp osition heterogeneit y , we no w treat λ and µ in the i th observ ation as functions of the co v ariate v ector z i in a general BDP . Supp ose microsatellite i has motif size r i . W e co de the vectors z i as follo ws: z i =                (1 , 0 , 0 , p a , p c , p t ) t r i = 1 (1 , 1 , 0 , p a , p c , p t ) t r i = 2 (1 , 0 , 1 , p a , p c , p t ) t r i ≥ 3 (42) 22 where p x is the proportion of x n ucleotides per rep eat. W e deﬁne a single parameter α that controls the diﬀerence b et w een λ and µ . Then in the i th microsatellite, the complete mo del b ecomes log( λ k,i ) = log( k + 1) + α + z t i θ and log ( µ k,i ) = log( k ) + z t i θ . (43) Therefore ( α, θ ) t is the 7 × 1 v ector of unkno wn parameters. Putting all this together, the surrogate function b ecomes Q  θ | θ ( m )  ∝ N X i =1 X i (0) α + log (1 − e α ) + " ∞ X k =0 E ( U k | Y i )( α + z t i θ ) + E ( D k | Y i ) z t i θ − E ( T k | Y i )  ( k + 1) e α + z t i θ + k e z t i θ  #! + γ log ( − α ) , (44) where α < 0 since λ < µ , and we send the p enalt y γ → 0 as the algorithm con verges. W e use a gradien t EM algorithm to ﬁnd the MLE of ( α, θ ). T able 4 rep orts the parameter estimates, along with asymptotic standard errors. F rom these results, w e infer that motifs of diﬀeren t sizes and comp osition ha v e diﬀeren t c haracteristics under our evolutionary mo del. Sp eciﬁcally , λ and µ are greatest for din ucleotide rep eats, as compared to motifs with one or at least three rep eats. Motifs consisting mostly of A and T nucleotides also give rise to higher λ and µ . T able 1 shows the estimated λ and µ for each unique motif pattern in the dataset. These conclusions are largely consisten t with the descriptiv e results obtained by W ebster et al (2002). Our analysis also provides a natural probabilistic justiﬁcation for the existence of a ﬁnite nonzero equilibrium distribution of microsatellite rep eat n umbers and a formal statistical framew ork for deducing the eﬀec t of motif size and rep eat n um b er on m utation rates. [T able 3 ab out here.] 4 Discussion Application of sto c hastic mo dels in statistics requires a ﬂexible and general approach to parameter estimation, without whic h ev en the most realistic mo del b ecomes unapp ealing to researc hers who wish to learn from the data they hav e collected. Estimation for contin uously observ ed BDPs is 23 Motif λ µ Motif λ µ Motif λ µ Motif λ µ Motif λ µ A 0.3605 0.3969 AGGA 0.025 0.0276 C C AT 0.0051 0.0056 GGC G 0.0133 0.0147 T C T T 0.0085 0.0094 AAAAC 0.0128 0.0141 AGGG 0.0266 0.0293 C C AT C 0.004 0.0044 GGC GG 0.0155 0.0171 T C T T T 0.0096 0.0106 AAAAG 0.0233 0.0257 AGGGA 0.0256 0.0282 CC T 0.0031 0.0035 GGGA 0.0266 0.0293 T G 0.6094 0.6708 AAAAT 0.0207 0.0228 AGT C 0.0108 0.0119 C C T C 0.0026 0.0028 GGGAA 0.0256 0.0282 T GA 0.0214 0.0235 AAAC 0.0111 0.0123 AGT G 0.0229 0.0252 C C T C C 0.0023 0.0025 GGGGA 0.0269 0.0296 T GAA 0.0216 0.0237 AAAC A 0.0128 0.0141 AT 0.5407 0.5952 C C T G 0.0054 0.006 GGT 0.0231 0.0255 T GAGT 0.0212 0.0233 AAAC C 0.0074 0.0081 AT A 0.0197 0.0217 C C T T 0.0047 0.0052 GGT A 0.0229 0.0252 T GAT 0.0197 0.0217 AAAG 0.0236 0.026 AT AA 0.0203 0.0224 C C T T T 0.006 0.0066 GGT G 0.0243 0.0268 T GC 0.0085 0.0094 AAAGA 0.0233 0.0257 AT AAA 0.0207 0.0228 C G 0.1835 0.202 GGT GT 0.0222 0.0245 T GC C 0.0054 0.006 AAAGG 0.0244 0.0269 AT AC 0.0102 0.0112 C GC 0.0038 0.0042 GT 0.6094 0.6708 T GG 0.0231 0.0255 AAAT 0.0203 0.0224 AT AG 0.0216 0.0237 C GG 0.0104 0.0114 GT AC A 0.0125 0.0138 T GGA 0.0229 0.0252 AAAT A 0.0207 0.0228 AT AT G 0.0202 0.0222 C T 0.1362 0.1499 GT AT 0.0197 0.0217 T GGG 0.0243 0.0268 AAAT G 0.0217 0.0239 AT C 0.0079 0.0087 C T C 0.0031 0.0035 GT G 0.0231 0.0255 T GT 0.019 0.0209 AAAT T 0.0193 0.0212 AT C T 0.0093 0.0103 C T C C 0.0026 0.0028 GT GAG 0.0239 0.0263 T GT A 0.0197 0.0217 AAC 0.0089 0.0098 AT G 0.0214 0.0235 C T C C T 0.0037 0.0041 GT GG 0.0243 0.0268 T GT C 0.0099 0.0109 AAC A 0.0111 0.0123 AT GA 0.0216 0.0237 C T G 0.0085 0.0094 GT T 0.019 0.0209 T GT T 0.018 0.0199 AAC AA 0.0128 0.0141 AT GAC 0.0125 0.0138 C T GGG 0.0138 0.0151 GT T G 0.0209 0.0231 T GT T T 0.0175 0.0193 AAC C 0.0056 0.0062 AT GAT 0.0202 0.0222 C T T 0.007 0.0077 GT T T 0.018 0.0199 T T A 0.0175 0.0193 AAC T 0.0102 0.0112 AT T 0.0175 0.0193 C T T C 0.0047 0.0052 GT T T A 0.0188 0.0207 T T AA 0.0186 0.0205 AAG 0.0241 0.0265 AT T A 0.0186 0.0205 C T T T 0.0085 0.0094 GT T T G 0.0197 0.0217 T T AAT 0.0179 0.0197 AAGA 0.0236 0.026 AT T C 0.0093 0.0103 C T T T C 0.006 0.0066 GT T T T 0.0175 0.0193 T T AG 0.0197 0.0217 AAGC 0.0118 0.013 AT T G 0.0197 0.0217 C T T T T 0.0096 0.0106 T 0.2524 0.2778 T T AT 0.017 0.0187 AAGG 0.025 0.0276 AT T T 0.017 0.0187 G 0.4579 0.5041 T A 0.5407 0.5952 T T AT T 0.0167 0.0184 AAGGG 0.0256 0.0282 AT T T A 0.0179 0.0197 GA 0.7283 0.8017 T AA 0.0197 0.0217 T T C 0.007 0.0077 AAGT 0.0216 0.0237 AT T T C 0.0103 0.0114 GAA 0.0241 0.0265 T AAA 0.0203 0.0224 T T C A 0.0093 0.0103 AAGT G 0.0228 0.0251 AT T T G 0.0188 0.0207 GAAA 0.0236 0.026 T AAAA 0.0207 0.0228 T T C C 0.0047 0.0052 AAT 0.0197 0.0217 AT T T T 0.0167 0.0184 GAAAA 0.0233 0.0257 T AAAT 0.0193 0.0212 T T C T 0.0085 0.0094 AAT A 0.0203 0.0224 C 0.0229 0.0252 GAAAG 0.0244 0.0269 T AAT 0.0186 0.0205 T T C T C 0.006 0.0066 AAT AA 0.0207 0.0228 C A 0.1628 0.1792 GAAG 0.025 0.0276 T AAT G 0.0202 0.0222 T T C T G 0.0108 0.0119 AAT AG 0.0217 0.0239 C AA 0.0089 0.0098 GAAGG 0.0256 0.0282 T AAT T 0.0179 0.0197 T T C T T 0.0096 0.0106 AAT G 0.0216 0.0237 C AAA 0.0111 0.0123 GAAT 0.0216 0.0237 T AC 0.0079 0.0087 T T G 0.019 0.0209 AAT T 0.0186 0.0205 C AAAA 0.0128 0.0141 GAAT T 0.0202 0.0222 T AC T A 0.0111 0.0122 T T GAA 0.0202 0.0222 AC 0.1628 0.1792 C AAAC 0.0074 0.0081 GAC AG 0.0141 0.0155 T AGA 0.0216 0.0237 T T GT 0.018 0.0199 AC A 0.0089 0.0098 C AAAG 0.0134 0.0148 GAG 0.0261 0.0287 T AT 0.0175 0.0193 T T GT T 0.0175 0.0193 AC AA 0.0111 0.0123 C AC 0.0035 0.0039 GAGAA 0.0244 0.0269 T AT C 0.0093 0.0103 T T T A 0.017 0.0187 AC AAA 0.0128 0.0141 C AC AC 0.0042 0.0047 GAGG 0.0266 0.0293 T AT G 0.0197 0.0217 T T T AA 0.0179 0.0197 AC AG 0.0118 0.013 C AC C 0.0028 0.0031 GAT 0.0214 0.0235 T AT T 0.017 0.0187 T T T AG 0.0188 0.0207 AC C 0.0035 0.0039 C AC C A 0.0042 0.0047 GAT A 0.0216 0.0237 T AT T T 0.0167 0.0184 T T T AT 0.0167 0.0184 AC C A 0.0056 0.0062 C AG 0.0096 0.0106 GAT T 0.0197 0.0217 T C 0.1362 0.1499 T T T C 0.0085 0.0094 AG 0.7283 0.8017 C AGA 0.0118 0.013 GC AC A 0.0077 0.0085 T C AAA 0.0119 0.0131 T T T C C 0.006 0.0066 AGAA 0.0236 0.026 C AGAG 0.0141 0.0155 GC C GC 0.0047 0.0051 T C AC 0.0051 0.0056 T T T C T 0.0096 0.0106 AGAAA 0.0233 0.0257 C AGG 0.0126 0.0138 GC T 0.0085 0.0094 T C AT 0.0093 0.0103 T T T G 0.018 0.0199 AGAC 0.0118 0.013 C AT 0.0079 0.0087 GC T GT 0.0122 0.0134 T C AT T 0.0103 0.0114 T T T GG 0.0197 0.0217 AGAGG 0.0256 0.0282 C AT A 0.0102 0.0112 GGA 0.0261 0.0287 T C C 0.0031 0.0035 T T T GT 0.0175 0.0193 AGAT 0.0216 0.0237 C AT C 0.0051 0.0056 GGAA 0.025 0.0276 T C C A 0.0051 0.0056 T T T T A 0.0167 0.0184 AGC AA 0.0134 0.0148 C AT G 0.0108 0.0119 GGAAG 0.0256 0.0282 T C C C 0.0026 0.0028 T T T T C 0.0096 0.0106 AGC C 0.0059 0.0065 C AT T 0.0093 0.0103 GGAG 0.0266 0.0293 T C C T 0.0047 0.0052 T T T T G 0.0175 0.0193 AGC T C 0.0072 0.0079 CC A 0.0035 0.0039 GGAGG 0.0269 0.0296 T C T 0.007 0.0077 AGG 0.0261 0.0287 C C AAC 0.0042 0.0047 GGC C A 0.0081 0.0089 T C T G 0.0099 0.0109 T able 1: Estimates of birth and death rates for each unique motif in the human-c himpanzee dataset of W ebster et al (2002). Under our mo del of microsatellite m utation, AT rep eats ha v e the highest asso ciated rates and C T C C rep eats ha v e the low est. 24 straigh tforw ard and well-established. F or partially observed BDPs, our approac h is unique b ecause it requires only t w o simple ingredients: the functional form of the birth and death rates λ k ( θ ) and µ k ( θ ) for all k , and an exact or approximate M-step. A third ingredient is optional: the Hessian of the surrogate function is useful when asymptotic standard errors are desired. How ever, this matrix can often b e approximated numerically up on con v ergence of the EM algorithm, since the observ ed-data likelihoo d is av ailable numerically via (35). With these ingredients in hand, ev en elusiv e general BDPs b ecome tractable. In previous w ork on estimation for BDPs, completion of the E-step typically relies on time- domain numerical integration or simulation of BDP tra jectories. As w e sho w in T able 2, b oth rejection sampling and endp oint-conditioned simulation can o ccasionally p erform satisfactorily , es- p ecially in comparison to time-domain con v olution. How ev er, endp oin t-conditioning is designed for ﬁnite state-space Marko v chains, and it relies on a matrix eigendecomp osition to calculate tran- sition probabilities. As we show for the SIS mo del, this matrix b ecomes nearly singular, causing the sim ulation algorithm to fail, ev en when w e choose parameter v alues that are not biologically unreasonable. The Laplace conv olution in the E-step of our algorithm is more generic with equiv- alen t or b etter p erformance. F or this reason, a v ariation on our Laplace conv olution metho d for computing the E-step ma y oﬀer further use in estimation for non-BDP ﬁnite Marko v chains as w ell, suc h as n ucleotide or co don substitution mo dels. F or some linear BDPs, the a v ailability of a generating function furnishes analytic E- and M-steps yielding v ery fast parameter up dates in closed-form (Doss et al, 2010). F or some mo dels, these to ols pro vide the asymptotic v ariance of the MLE in closed-form. How ever, for the ma jorit y of BDPs, we must return to the Laplace con volution metho d outlined in this pap er. If one cannot ﬁnd analytic parameter up dates in the M-step, several options remain av ailable. With a minorizing function as in section 2.4.2, an EM-MM algorithm is viable. F urther, one or more n umerical Newton steps oﬀers an alternative, as in sections 2.4.3 and 2.4.5. One may employ other gradien t-based metho ds as w ell. Although the MM up date derived for the BDP with immigration (section 2.4.2) is app ealing in its simplicity , multiple minorizations of the lik eliho od can result in v ery slow conv ergence, since the surrogate function lies far from the true lik eliho od for most v alues of θ . In addition, Newton steps that require matrix inv ersion ma y suﬀer since the Hessian of the surrogate can b ecome ill-conditioned. 25 Ev en with the substantial sp eedup oﬀered b y our Laplace con v olution metho d for p erforming the E-step and quasi-Newton acceleration of the EM iterates, our algorithms can mo v e slo wly tow ard the MLE. Here, na ¨ ıve numerical optimization of the incomplete data lik eliho o d can sometimes run computationally faster. Ho w ev er, suc h techniques p erform v ery p o orly when the num b er of parameters increases and they often require sp eciﬁcation of tuning constants in order to reach the global optimum. F or BDP estimation problems, EM algorithms oﬀer several other adv antages ov er na ¨ ıv e numerical optimization, and these b eneﬁts are esp ecially stark when the M-step is a v ailable in closed-form. First, when the log-likelihoo d is lo cally conv ex, the EM algorithm is robust with resp ect to the initial parameter v alues near the maxim um, and EM algorithms generally do not need tuning parameters. F urther, the ascent prop erty ensures the iterates will approach a maxim um. P erhaps the most imp ortan t reason to consider EM algorithms is that they can accommo date high- dimensional parameter spaces without substan tially increasing the computational complexit y of the algorithm. This is esp ecially useful in mo dels with man y unknown parameters when p erforming regression with cov ariates (section 2.4.5), or our microsatellite example. W e also note the p oten tial for substan tial computational sp eedup b y parallelizing the E-step. When discrete observ ations from a BDP are indep enden t, the E-step ma y b e p erformed in parallel for every observ ation. F or example, E ( U | Y i ) can be computed simultaneously for i = 1 , . . . , N . When speed is an issue, graphics pro cessing units may prov e useful in reducing the computational cost of EM algorithms (Zhou et al, 2010). With regard to our example, we presen t a nov el wa y of studying the evolution of microsatellite rep eats using a generalized linear mo del. Previous eﬀorts often ignore the evolutionary relation- ship b etw een organisms, use incomplete or equilibrium mo dels of rep eat num b ers, or ﬁt separate mo dels to motifs of diﬀerent sizes. W e treat motif size as a categorical v ariable and incorp orate motif nucleotide comp osition, allowing us to ﬁt a single mo del to all the microsatellite observ ations sim ultaneously . Though our rate sp eciﬁcation (37) and resulting equilibrium distribution (38) are in tended to be somewhat simplistic, more sophisticated models that are informed b y biological considerations ma y b e fruitful. The only requirement in our setup is that the gradient and Hessian of λ k , µ k , and π k b e a v ailable for an y rep eat num b er k . Although our microsatellite example is limited in scop e, it is easy to imagine a more comprehensiv e study . F or example, incorp orating more sophisticated motif n ucleotide comp osition co v ariates and lo cation of the microsatellite on the 26 c hromosome might provide additional insigh t into the ev olutionary pro cess. Our EM framework is nearly ideal for these types of studies, since the num b er of unkno wn parameters do es not substan- tially increase the computational burden of the M-step, and the E-step is completely unaﬀected by the nu mber of parameters. In terestingly , we attempted to use the generic nonlinear regression R function nlm to v alidate the MLEs obtained b y our EM algorithm for the microsatellite ev olution problem, starting at a v ariety of initial v alues, including the MLE found b y our EM algorithm. This na ¨ ıve optimizer failed to conv erge in ev ery case. W e sp eculate that this is b ecause the small numerical errors in the lik eliho od ev aluation ha v e similar order of magnitude as the curv ature of the likelihoo d function near the maximum. Our EM aglorithms take adv an tage of analytic deriv atives of the surrogate function instead of the likelihoo d, and hence are less susceptible to small errors in the n umerical gradien t. 5 Conclusion Previous w ork on parameter estimation in BDPs almost exclusiv ely conﬁnes itself to inference of birth and death rates under the simple linear mo del. T o rectify this situation, w e presen t a ﬂexible and robust framew ork for deriving EM algorithms to estimate parameters in an y general BDP , using discrete observ ations. W e hop e that this con tribution encourages developmen t of more sophisticated and realistic birth-death mo dels in applied work, since researc hers can now e stimate parameters using more complicated rate structures, even when the data are observed at discrete times. Soft ware A soft ware implementation of the EM algorithms for general BDPs used in this pap er is currently a v ailable from FW C (b y request through the Editor to maintain reviewer anonymit y) and will b e dep osited in CRAN (2011) b efore publication. 27 Ac knowledgemen ts W e are grateful to Kenneth Lange, Hua Zhou, and Gabriela Cybis for helpful comments. This w ork was supp orted by NIH grants R01 GM086887, HG006139, T32GM008185, and NSF grant DMS-0856099. References Abate J, Whitt W (1992) Numerical in v ersion of probability generating functions. Op er Res Lett 12:245–251 Abate J, Whitt W (1995) Numerical inv ersion of Laplace transforms of probabilit y distributions. ORS J Comput 7(1):36–43 Abate J, Whitt W (1999) Computing Laplace transforms for numerical inv ersion via con tin ued fractions. INF ORMS J Comput 11(4):394–405 Amos W (2010) Mutation biases and mutation rate v ariation around very short h uman microsatel- lites rev ealed b y h uman-chimpanzee-orangutan genomic sequence alignments. J Mol Ev ol 71:192– 201 Andersson H, Britton T (2000) Sto chastic Epidemic Mo dels and their Statistical Analysis. Lecture notes in statistics, Springer New Y ork Anscom b e FJ (1953) Sequen tial estimation. J Roy Stat So c B 15(1):1–29 Bailey NTJ (1964) The Elemen ts of Sto c hastic Processes with Applications to the Natural Sciences. Wiley New Y ork Bankier JD, Leighton W (1942) Numerical con tin ued fractions. Am J Math 64(1):653–668 Bharga v a A, F uen tes F (2010) Mutational dynamics of microsatellites. Mol Biotec hnol 44:250–266 Bladt M, Sorensen M (2005) Statistical inference for discretely observed Marko v jump pro cesses. J Ro y Stat So c B Met 67(3):395–410 Blanc h G (1964) Numerical ev aluation of contin ued fractions. SIAM Rev 6(4):383–421 28 Calabrese P , Durrett R (2003) Dinucleotide rep eats in the drosophila and h uman genomes hav e complex, length-dep enden t m utation pro cesses. Mol Biol Evol 20(5):715–725 Chakrab ort y R, Kimmel M, Stivers D, Davison L, Dek a R (1997) Relative m utation rates at di-, tri-, and tetranucleotide microsatellite lo ci. P Natl Acad Sci USA 94(3):1041–1046 Cotton JA, Page RDM (2005) Rates and patterns of gene duplication and loss in the human genome. Pro c R Soc B 272:277–283 CRAN (2011) The comprehensive R arc hiv e netw ork. URL http://cran.r- project.org Cra viotto C, Jones WB, Thron WJ (1993) A survey of truncation error analysis for Pad´ e and con tin ued fraction approximan ts. Acta Appl Math 33:211–272 Cra wford FW, Suchard MA (2011) T ransition probabilities for general birth-death pro cesses with applications in ecology , genetics, and ev olution. J Math Biol Cuyt A, P etersen V, V erdonk B, W aadeland H, Jones W (2008) Handb ook of Contin ued F ractions for Sp ecial F unctions. Springer Berlin / Heidelb erg Darwin JH (1956) The b eha viour of an estimator for a simple birth and death pro cess. Biometrik a 43(1):23–31 Dauxois J (2004) Ba y esian inference for linear gro wth birth and death pro cesses. J Stat Plan Infer 121(1):1–19 Dempster AP , Laird NM, Rubin DB (1977) Maxim um likelihoo d from incomplete data via the EM algorithm. J Roy Stat So c B 39(1):1–38 Dem uth JP , Bie TD, Sta jich JE, Cristianini N, Hahn MW (2006) The evolution of mammalian gene families. PLoS ONE 1(1):e85 Doss CR, Suchard MA, Holmes I, Kato-Maeda M, Minin VN (2010) Great exp ectations: EM algorithms for discretely observ ed linear birth-death-immigration pro cesses. ArXiv e-prin ts 1009. 0893 Ec k ert KA, Hile SE (2009) Every microsatellite is diﬀeren t: In trinsic DNA features dictate muta- genesis of common microsatellites present in the human genome. Mol Carcinogen 48(4):379–388 29 Ellegren H (2004) Microsatellites: simple sequences with complex ev olution. Nat Rev Genet 5(6):435–445 F eller W (1971) An In tro duction to Probabilit y Theory and its Applications. Wiley series in prob- abilit y and mathematical statistics, Wiley New Y ork Hob olth A (2008) A Marko v chain Monte Carlo exp ectation maximization algorithm for statistical analysis of DNA sequence evolution with neigh b or-dep ending substitution rates. J Comput Graph Stat 17(1):1–25 Hob olth A, Jensen JL (2005) Statistical inference in evolutionary models of DNA sequences via the EM algorithm. Stat Appl Genet Mol 4(1):1–19 Hob olth A, Stone EA (2009) Sim ulation from endp oin t-conditioned, contin uous-time Mark ov chains on a ﬁnite state space, with applications to molecular evolution. Ann Appl Stat 3(3):1024–1231 Holmes I, Bruno WJ (2001) Ev olutionary HMMs: a Ba y esian approac h to m ultiple alignmen t. Bioinformatics 17(9):803–820 Holmes I, Rubin G (2002) An exp ectation maximization algorithm for training hidden substitution mo dels. J Mol Biol 317(5):753–764 Jamshidian M, Jennrich RI (1993) Conjugate gradient acceleration of the EM algorithm. J Am Stat Asso c 88(421):221–228 Kalbﬂeisc h JD, Lawless JF (1985) The analysis of panel data under a Marko v assumption. J Am Stat Asso c 80(392):863–871 Karlin S, McGregor J (1957) The diﬀeren tial equations of birth-and-death processes, and the Stielt- jes moment problem. T rans Am Math So c 85(2):589–646 Keiding N (1974) Estimation in the birth pro cess. Biometrik a 61(1) Keiding N (1975) Maximum likelihoo d estimation in the birth-and-death process. Ann Stat 3(2):363–372 Kelk ar YD, Ty ekuc hev a S, Chiaromonte F, Mako v a KD (2008) The genome-wide determinants of h uman and chimpanzee microsatellite ev olution. Genome Res 18(1):30–38 30 Kingman JF C (1982) On the genealogy of large p opulations. J Appl Probab 19:27–43 Krone SM, Neuhauser C (1997) Ancestral pro cesses with selection. Theor P opul Biol 51:210–237 Krugly ak S, Durrett R T, Sc hug MD, Aquadro CF (1998) Equilibrium distributions of microsatellite rep eat length resulting from a balance b et ween slippage even ts and p oint mutations. P Natl Acad Sci USA 95(18):10,774–10,778 Lange K (1995a) A gradien t algorithm locally equiv alen t to the EM algorithm. J Roy Stat So c B Met 57(2):425–437 Lange K (1995b) A quasi-Newton acceleration of the EM algorithm. Stat Sinica 5:1–18 Lange K (2010) Numerical Analysis for Statisticians (Statistics and Computing), 2nd edn. Springer New Y ork Len tz WJ (1976) Generating Bessel functions in Mie scattering calculations using contin ued frac- tions. Appl Opt 15(3):668–671 Liu H, Beck ett LA, DeNardo GL (2007) On the analysis of count data of birth-and-death process t yp e: with application to molecularly targeted cancer therap y . Statist Med 26:11141135 Loren tzen L, W aadeland H (1992) Contin ued F ractions with Applications. North-Holland, Amster- dam Louis T A (1982) Finding the observed information matrix when using the EM algorithm. J Roy Stat So c B Met 44(2):226–233 Meilijson I (1989) A fast improv emen t to the EM algorithm on its o wn terms. J Roy Stat So c B Met 51(1):127–138 Meng XL, Rubin DB (1991) Using EM to obtain asymptotic v ariance-cov ariance matrices: The SEM algorithm. J Am Stat Assoc 86(416):899–909 Metzner P , Dittmer E, Jahnk e T, Sch¨ utte C (2007) Generator estimation of Mark ov jump processes. J Comput Phys 227:353–375 Moran P AP (1951) Estimation metho ds for evolutiv e pro cesses. J Ro y Stat So c B Met 13(1):141–146 31 Moran P AP (1953) The estimation of the parameters of a birth and death pro cess. J Ro y Stat So c B Met 15(2):241–245 Moran P AP (1958) Random pro cesses in genetics. Math Pro c Cambridge 54(01):60–71 Murph y JA, O’Donoho e MR (1975) Some prop erties of contin ued fractions with applications in Mark o v pro cesses. IMA J Appl Math 16(1):57–71 Murra y J (2002) Mathematical Biology: An In tro duction, Interdisciplinary applied mathematics, v ol 1. Springer, New Y ork Nee S (2006) Birth-death mo dels in macro ev olution. Annu Rev Ecol Ev ol S 37:1–17 Nee S, Ma y RM, Harvey PH (1994) The reconstructed ev olutionary pro cess. Philos T Ro y So c B 344(1309):305–311 No v ozhilov AS, Karev GP , Ko onin EV (2006) Biological applications of the theory of birth-and- death pro cesses. Brief Bioinform 7(1):70–85 P arthasarath y PR, Sudhesh R (2006) Exact transien t solution of a state-dep enden t birth-death pro cess. J Appl Math Sto c h Anal 82(6):1–16 Press WH (2007) Numerical Recip es: the Art of Scientiﬁc Computing. Cambridge Univ ersit y Press New Y ork Rensha w E (2011) Stochastic P opulation Pro cesses: Analysis, Approximations, Sim ulations. Oxord Univ ersit y Press Reynolds JF (1973) On estimating the parameters of a birth-death pro cess. Aust J Stat 15(1):35–43 Ric hard GF, Kerrest A, Dujon B (2008) Comparative genomics and molecular dynamics of DNA rep eats in euk ary otes. Microbiol Mol Biol Rev 72(4):686–727 Rose O, F alush D (1998) A threshold size for microsatellite expansion. Mol Biol Evol 15(5):613–615 Rosen b erg NA, Tsolaki AG, T anak a MM (2003) Estimating change rates of genetic markers using serial samples: applications to the transp oson IS6110 in Mycobacterium tub erculosis. Theor P opul Biol 63(4):347–363 32 Sain udiin R, Durrett R T, Aquadro CF, Nielsen R (2004) Microsatellite m utation mo dels. Genetics 168(1):383–395 Sc hl¨ otterer C (2000) Ev olutionary dynamics of microsatellite DNA. Chromosoma 109:365–371 T an WY, Pian tadosi S (1991) On stochastic gro wth pro cesses with application to sto c hastic logistic gro wth. Stat Sinica 1:527–540 Thompson IJ, Barnett AR (1986) Coulomb and Bessel functions of complex arguments and order. J Comput Phys 64:490–509 Thorne J, Kishino H, F elsenstein J (1991) An evolutionary mo del for maxim um likelihoo d alignment of DNA sequences. J Mol Ev ol 33(2):114–124 V owles EJ, Amos W (2006) Quantifying ascertainment bias and sp ecies-speciﬁc length diﬀerences in h uman and chimpanzee microsatellites using genome sequences. Mol Biol Ev ol 23(3):598–607 W all HS (1948) Analytic Theory of Contin ued F ractions. Universit y Series in Higher Mathematics, D. V an Nostrand Company , Inc. New Y ork W anek LA, Goradia TM, Elashoﬀ RM, Morton DL (1993) Multi-stage Marko v analysis of progres- siv e disease applied to melanoma. Biom J 35(8):967–983 W ebster MT, Smith NGC, Ellegren H (2002) Microsatellite evolution inferred from human and c himpanzee genomic sequence alignments. P Natl Acad Sci USA 99(13):8748–8753 Whittak er JC, Harb ord RM, Boxall N, Mac k ay I, Da wson G, Sibly RM (2003) Likelihoo d-based estimation of microsatellite mutation rates. Genetics 164(2):781–787 W olﬀ R W (1965) Problems of statistical inference for birth and death queuing mo dels. Op er Res 13(3):343–357 Zhou H, Lange K, Suchard M (2010) Graphics pro cessing units and high-dimensional optimization. Stat Sci 25(3):311–324 33 τ X ( τ ) 0 t 0 1 2 3 4 5 6 7 T 1 = T 2 = T 3 = T 4 = T 5 = T 6 = U 1 = U 2 = U 3 = U 4 = U 5 = D 2 = D 4 = D 5 = Figure 1: A sample path from a birth-death pro cess (BDP) X ( τ ). The pro cess starts at state X (0) = 1 and is at state X ( t ) = 4 at time t . At right are sc hematic represen tations of the time sp en t in each state T k , the num ber of up steps U k , and the num b er of down steps D k . These quantities are the suﬃcient statistics for estimators of rate parameters in general birth-death pro cesses. 34 0.0 0.1 0.2 0.3 0.4 0.5 0.00 0.10 0.20 0.30 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + λ µ 0.0 0.1 0.2 0.3 0.4 0.5 0.00 0.10 0.20 0.30 + + + + + + + + + + + + + + + µ Figure 2: Eﬀect of quasi-Newton acceleration on iterates of the exp ectation-maximization (EM) algorithm for a simple linear BDP with birth rate λ and death rate µ . Contour lines sketc h the log-lik eliho od from N = 50 discrete samples. Iterates are shown with the “+” symbol. On the left, ordinary EM iterates conv erge very slo wly in the neigh b orhoo d of the maximum, for a total of 36 iterations. On the right, EM iterates using quasi-Newton acceleration mak e large jumps and con v erge rapidly in 15 iterations. 35 ? Ancestor ( AAC ) 2 Chimp T ( AAC ) 3 Human T ( AAC ) 2 Chimp ( AAC ) 3 Human 2 T Figure 3: Reversibilit y of the BDP implies that the ev olutionary relationship betw een con temp orary c himpanzees and the most recen t common ancestor can b e inv erted. On the left, the most recen t common ancestor of chimpanzees and humans liv ed at time T in the past. At a certain locus, c himpanzees hav e a microsatellite consisting of 2 rep eats of the motif AAC , and at an orthologous lo cus, humans hav e 3 rep eats of the motif. The num ber of rep eats in the ancestor is unknown. On the righ t, using a probabilistic justiﬁcation explained in the text, we ma y in terpret the ev olutionary relationship b et w een chimpanzees and humans as unidirectional, while “integrating out” the num b er of rep eats at the ancestral lo cus. 36 Endp oin t- Rejection conditioned Time- Laplace- Mo del Quan tit y sampling sim ulation conv olution con v olution Simple linear (2.4.1) E ( U | Y ) 1.449 0.741 19.606 0.084 λ = 0 . 5, µ = 0 . 3 E ( D | Y ) 1.375 0.743 21.224 0.086 E ( T particle | Y ) 1.432 0.636 16.488 0.087 Immigration (2.4.2) P k p k E ( U | Y ) 1.192 0.697 15.669 0.085 λ = 0 . 5, ν = 0 . 2 E ( D | Y ) 1.324 0.689 21.058 0.086 µ = 0 . 3 E ( T particle | Y ) 1.319 0.703 14.961 0.089 Logistic (2.4.3) E ( U | Y ) 50.810 0.162 21.907 0.102 λ = 0 . 5, α = 0 . 2 E ( D | Y ) 56.957 0.180 20.851 0.102 µ = 0 . 3 P k k 2 e − kα E ( T k | Y ) 50.764 0.168 21.623 0.107 SIS (2.4.4) E ( U | Y ) 7.880 * 5.295 0.059 β = 0 . 5, γ = 0 . 3 E ( D | Y ) 8.886 * 2.749 0.048 P k ( N − k ) k E ( T k | Y ) 8.456 * 4.269 0.053 T able 2: Compute times (seconds) to p erform v arious E-steps for four diﬀeren t BDP models. W e rep ort text section num b ers in whic h the mo dels are describ ed in parentheses. F or eac h E- step, we consider several metho ds. In all cases, the Laplace metho d takes substantially less time. The endp oin t-conditioned simulation metho d fails for the susceptible-infectious-susceptible (SIS) infectious disease mo del. 37 Mo del P arameter T rue Estimate SE Simple linear ( N = 500) λ 0.5 0.5039 0.0269 (2.4.1) µ 0.2 0.1981 0.0254 Immigration ( N = 800) λ 0.2 0.2182 0.0129 (2.4.2) ν 0.1 0.1016 0.0213 µ 0.25 0.2488 0.0231 Logistic ( N = 1500) λ 0.3 0.2917 0.0035 (2.4.3) α 0.5 0.4942 0.0397 µ 0.05 0.0456 0.0633 SIS ( N = 1000) β 0.1 0.1025 0.0048 (2.4.4) γ 2.0 2.1374 0.0367 GLM ( N = 1000) θ λ, 1 0.25 0.2585 0.0393 (2.4.5) θ λ, 2 0.1 0.1143 0.0402 θ µ, 1 0.2 0.1973 0.0457 θ µ, 2 0.05 0.0877 0.0457 T able 3: Poin t-estimates and their standard errors (SE) for simulated observ ations under v arious BDPs. W e rep ort the text section describing each of the mo dels in paren theses. The metho d for generating the rates in the generalized linear mo del (GLM) BDP is describ ed in the text. 38 P arameter Co v ariate Estimate SE θ 1 In tercept -1.3105 0.1236 θ 2 r i = 2 0.2854 0.0983 θ 3 r i ≥ 3 -1.5405 0.1079 θ 4 p a 0.2207 0.1725 θ 5 p c -0.3822 0.0577 θ 6 p t 0.0477 0.0002 α birth -0.0889 0.0039 T able 4: Maxim um likelihoo d estimates of parameters in the microsatellite mo del and their asymp- totic standard errors. The ﬁrst three elemen ts of θ corresp ond to the motif size r i , and the last three corresp ond to the motif n ucleotide composition. The parameter α con trols the diﬀerence b et ween the birth and death rates. The i th microsatellite birth rate is then λ = exp( α + z t i θ ) and the death rate is µ = exp( z t i θ ). Estimated birth and death rates are higher for dinucleotide rep eats than for monon ucleotide rep eats or microsatellites whose motifs ha v e 3, 4, or 5 n ucleotides. Mircrosatellites whose motif consists, for example, of A nucleotides hav e higher birth and death rates compared to G n ucleotides. 39

Estimation for general birth-death processes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment