An EM algorithm for estimation in the Mixture Transition Distribution model

The Mixture Transition Distribution (MTD) model was introduced by Raftery to face the need for parsimony in the modeling of high-order Markov chains in discrete time. The particularity of this model comes from the fact that the effect of each lag upo…

Authors: Sophie L`ebre (SG), Pierre-Yves Bourguinon (SG)

An EM algorithm for estimation in the Mixture Transition Distribution   model
An EM algori thm for estimation in the Mixture T ransition Distribution mo del Sophie L ` ebre ∗ ∗ , Pierre-Yv es Bourguignon Lab orato ire Statistiqu e et G´ enome, UMR 8071 Univ ersit ´ e Evry V al d’Essonne/C NRS UMR8071/INRA 1152 523, place des T errasses de l’Agora, 91000 Evry , F rance. Abstract The Mixture T ransition Distribution (MTD) mo del was in tro duced by Raftery to face t he need for p arsimon y in the modeling of high-order Mark ov chains in discrete time. The particularity of this mo del comes from the fact t hat the effect of eac h lag up on the present is considered separately and add itivel y , so th at the number of parameters required is drastically reduced. How ever, the efficiency for the MTD parameter es- timations proposed up to date still remains problematic on account of the la rge n umber of constraints on the p arameters. In this paper, an it- erativ e p ro cedure, commonly kno wn as Exp ectation-Maximization (EM) algorithm, is developed co op erating with the principle of Maximum Lik e- lihoo d Estimation (MLE) to estimate t he MTD parameters. Some ap- plications of mo deling MTD show th e prop osed EM algorithm is easier to b e used than the algorithm developed by Berch told. Moreo ver, t he EM Es timations of parameters for high- order MTD mo dels led on DN A sequences outperform the co rresp onding fully para metrized Mark ov c hain in terms of Bay esian I nformation Criterion. A soft ware implementati on of our algorithm is av ailable in the library seq++ at http://stat.ge nopole.cnrs. fr/seqpp . keywor ds: Marko v c hain; mixture transition distribution (MTD); Par- simon y; Maximum like liho od; EM algorithm; 1 In tro duction While providing a useful framework for discrete-time sequence mo d- eling, higher-order Mark o v c hains suffer from the exp o nen tial gro wth of the parameter space dimension with resp ect t o the o rder of the mo del, whic h r esults in the inefficiency of the para meters’estimations when a limited amoun t of data is a v ailable. This fa ct motiv at es the de v elopmen ts of appro ximate v ersions of higher- o rder Mark o v c hains, suc h as the Mixture T ransition Distribution (MTD) mo del [11, 3] and v ariable length Marko v c hains [4]. Thanks to a simple ∗ ∗ T o whom corresp ondence should b e addressed. 1 structure, where eac h la g con t ributes to the prediction of t he cur- ren t letter in a separate and a dditiv e w a y , t he dimension o f mo del parameter space g ro ws only linearly with resp ect to the order of the MTD mo del. Nev ertheless, Maxim um Lik eliho o d Estimation (MLE) in the MTD mo del is sub ject to suc h constraints that analytical solutions ar e b e- y ond the reach of presen t metho ds. One has th us to retort to nume r- ical optimization pro cedures. The most p ow erful method prop o sed to this day is due to Berc h told [2], and relies on an a d-ho c o pti- mization metho d. In this pap er, w e prop ose to fit the MTD mo del in to the general framew ork of hidden v ariable mo dels, and deriv e a version o f the classical EM alg o rithm fo r t he estimations o f its parameters. In t his first section, we define the MTD mo del and recall its main features and some of its v ar ia n t s. P ara metrization of the model is discussed in section 2, where w e establish t ha t under the most general definition, it is no t identifiable. Then w e shed ligh t o n an iden tifiable set of parameters. Deriv ations o f the up date f o rm ula s in volv ed by the EM algo rithm are detailed in section 3. W e finally illustrate our metho d b y some applications to biological sequence mo deling. Need for parsimony Mark ov mo dels are p ertinent to analyze m - letter w ords’ comp osition of a sequence of random v aria bles [7, 6]. Nev ertheless, the length m of the w ords the mo del a ccoun ts for ha s to b e c hosen b y the statistician. On the one hand, a high order is alw ays preferred since it can capture strictly mor e information. On the ot her hand, since the parameter’s dimension increases exp o- nen tially fast with r esp ect to the order of the mo del, higher order mo dels cannot b e accurately estimated. Th us, a trade-off has to b e dra wn to optimize the amo unt of info r mation extracted f r o m the data. W e illustrate this phenomenon by running a simple exp erimen t: b y using a randomly c ho sen Mark ov c hain tra nsition matrix of order 5, we sample 1000 sequences of length 5000. Eac h of them is then used to estimate a Marko v mo del transition matrix of o rder v arying from 2 to 6. F or eac h of these estimates, w e ha v e plott ed the total v ariation distance with resp ect to the generating mo del (see F ig ure 1), computed a s the quantit y D V T ( P , Q ) = P x ∈Y n | P ( x ) − Q ( x ) | for distributions P and Q . It t urns out that the optimal estimation in terms of total v a r ia tion distance b et w een genuine and estimated distributions is obtained with a mo del of order 2 whereas the gen- 2 Figure 1: T otal v ariation distance b etw een distributions estimated from ran- domly gener ated s equences a nd the gener ating distribution. The gener ating mo del is of o rder 5, and the rando m sequences are 5000 letters lo ng. erating mo del is of order 5. Mixture T r ansition Distributions aim at pro viding a mo del ac- coun ting for the n um b er of o ccurrences of m -letter words, while a v oiding the exp o nen tial increase with resp ect to m of the full Mark ov mo del parameter’s dimension (See T able 1 for a comparison of the mo dels’ dimensions). MTD mode ling Let Y = ( Y 1 , . . . , Y n ) be a seque nce of random v ariables ta king v alues in the finite set Y = { 1 , . . . , q } . W e use the notation, Y t 2 t 1 = ( Y t 1 , Y t 1 +1 , . . . , Y t 2 ) to refer to the subsequence of the t 2 − t 1 + 1 successiv e v ariables. In the whole pap er, v ectors and matrices are denoted by b old letters. Definition 1 The r andom se quenc e Y is said to b e a n m th or der MTD se quenc e if ∀ t > m, ∀ y 1 , . . . , y t ∈ Y , P ( Y t = y t | Y t − 1 1 = y t − 1 1 ) = m X g =1 ϕ g P ( Y t = y t | Y t − g = y t − g ) = m X g =1 ϕ g π g ( y t − g , y t ) . (1) 3 wher e the ve ctor ϕ = ( ϕ 1 , . . . , ϕ m ) is subje ct to the c on str aints: ∀ g ∈ { 1 , . . . , m } , ϕ g ≥ 0 , (2) m X g =1 ϕ g = 1 . (3) and the matric es { π g = [ P ( Y t = j | Y t − g = i )] i,j ∈Y ; 1 ≤ g ≤ m } ar e q × q sto chastic matric es. A m th-o rder MTD mo del is th us defined by a v ector parameter, θ =  ϕ g , ( π g ( i, j )) i,j ∈Y  1 ≤ g ≤ m whic h b elong s t o the space Θ = ( θ ; ∀ 1 ≤ g ≤ m, 0 ≤ ϕ g ≤ 1 ; m X g =1 ϕ g = 1 ; ∀ i, j ∈ Y , 0 ≤ π g ( i, j ) ≤ 1 and X j ∈Y π g ( i, j ) = 1 ) . It is obvious fro m the first equalit y in equation (1) that the MTD mo del fulfills the Mark ov prop erty . Th us, MTD mo dels are Mark o v mo dels with the particularity that each lag Y t − 1 , Y t − 2 , . . . con tributes additiv ely to the distribution of the random v ariable Y t . Berc h told and Raftery [3 ] pu blished a complete review of the MTD mo del. They recall theoretical results on the limiting b ehavior of t he mo del and on its auto- correlation structure. Details are giv en ab out sev - eral extensions of this mo del, such a s infinite-lag mo dels, or infinite coun ta ble and contin uous state space. W e ha v e to p oin t out that Raftery [11] defined t he original mo del with the same tra nsition matrix π fo r eac h lag { Y t − g } g =1 ,...,m . In the sequel, we refer to this mo del as the single mat r ix MTD mo del. Later, Berch told [1] intro duced a more general definition of the MTD mo dels as a mixture o f transitions from differen t subsets of lagged v ariables { Y t − m , . . . , Y t − 1 } to the presen t one Y t , eve n tually discard- ing some of the dep ende ncies. In this pap er, we fo cus o n a sligh tly more restricted mo del ha ving a sp ecific but same o r der transition matrix π g for each la g Y t − g . W e denote b y MTD l the MTD mo del whic h has a l -o rder transition matrix f or eac h la g (Definition 2 ). F rom now on, the MTD mo del defined b y (1) is denoted a ccord- ingly b y MTD 1 . 4 Definition 2 The r a ndom se quenc e Y is a m -or der MTD l se quenc e if, for al l l, m ∈ N such that l < m , and al l y t 1 ∈ Y t : P ( Y t = y t | Y t − 1 1 = y t − 1 1 ) = P ( Y t = i t | Y t − 1 t − m = y t − 1 t − m ) = m − l +1 X g =1 ϕ g P ( Y t = y t | Y t − g t − g − l +1 = y t − g t − g − l +1 ) = m − l +1 X g =1 ϕ g π g ( y t − g t − g − l +1 , y t ) . holds, wher e π g is a q l × q tr ans ition matrix. T rade-off b etw een dimensio n and maximal lik eliho o d Ev en though MTD mo dels inv olv e a restricted amount of parameters compared to Mark ov c hains, increasing the order l o f the mo del ma y result in effi ciency of the MLE decreased. The qualit y of the trade-off b et w een go o dness-of-fit and generalization erro r a mo del achiev es can b e assessed a gainst classical model se lection criteria, s uc h as the Bay esian Info r ma t io n Criterion (see illustratio ns in section 4.2). Ho w ev er, computing the BIC requires the know ledge of the di- mension of the mo del. This dimens ion is usually computed a s the dimension of the parameter space for a bijectiv e parametriza- tion. In the sp ecific case of the MTD mo dels, the orig inal single- matrix mo del is parametrized in a bijectiv e w ay , whereas its gener- alized v ersion with sp ec ific transition mat r ices for each la g is o ve r- parametrized: in app endix A is giv en an example of tw o distinct v alues of the parameters ( ϕ , π ), whic h b o th define the same MTD 1 distribution. The dime nsion o f the m o del is t h us lo w er tha n the dimension of the parameter space, and computing the BIC using the parameter space dimension w ould ov er-p enalize the mo dels . A tigh ter upp er b o und of the dimension of the MTD l mo del is derive d in section 2, a b ound whic h is used later to compute the BIC. The ques tion of estim ation As a coun terpart for their parsimon y , MTD parameters are difficult to b e estimated due to the constrain ts that the tr a nsition probabilities { P ( i m . . . i 1 ; i 0 ); i m , . . . , i 0 ∈ Y } hav e to comply to. There is indeed no analytical solution to the maxi- mization of the log -lik eliho o d L y ( θ ) = P θ ( Y = y ) of the MTD mo dels under the constraints the v ector ϕ and the sto chastic matri- ces π g ha v e to fulfill. F or a give n sequen ce y = y 1 , . . . , y n of length n , w e recall that the loglike liho o d of the sequenc e y under the MTD 1 5 mo del writes L y ( θ ) = log P θ ( Y n 1 = y n 1 ) = log ( P ( Y m 1 = y m 1 ) n Y t = m +1 m X g =1 ϕ g π g ( y t − g , y t ) !) . The estimation of the original single matrix MTD mo del already aroused a lo t of in terest. Although any distribution from t his mo del is defined by a unique parameter θ , the maxim um lik eliho od can not b e analytically determined. Li and Kw ok [10] pr o p ose an in ter- esting a lt ernativ e to the maxim um likelih o o d with a minim um c hi- square metho d. Nev ertheless, they carry out estimations b y using a non-linear optimizatio n a lg orithm that is no t explicitly describ ed. Raftery and T av ar´ e [12] obtain appro ximations of b oth maxim um lik eliho od and minimum chi-square estimates with nume rical pro ce- dures from the NA G library whic h is not f reely a v ailable. They also sho w that the MTD mo del can b e estimated using GLIM (Gener- alized Linear Interactiv e Mo deling) in the sp ec ific case where the state space’s size q equals 2. Finally , Berc htold [2] dev elop ed an ad ho c iterativ e metho d implemen ting a constrained gradient descen t optimization. This alg orithm is ba sed on the assumption that the v ector ϕ and eac h row o f the matrix π a r e indep enden t. It consists in successiv ely up da ting eac h of these v ectors constrained to ha v e a sum of comp onen ts equal to 1 as follows . Berc htold’s Algorithm • Compute p artial derivatives of t he lo g like liho o d ac c or d ing to e ach e lement o f the ve ctor, • cho ose a val ue δ in [0 , 1 ] , • add δ to the the c omp o nent with the lar ges t derivative, and subtr act δ fr om the one with the smal lest derivative. This algorithm has b een sho wn to p erform at least b etter than the previous metho ds, and it can b e extended to the case of the MTD l mo dels. In this latter case, it estimates one of the parameter v ec- tors { ( ϕ g , π g ); 1 ≤ g ≤ m } describing the maxim um-like liho o d MTD distribution. Nev ertheless, the c hoice of the alter ation p ar ameter δ remains an issue of the metho d. An in-depth discussion of the strat- egy used to up date the alteration parameter δ can b e found in [2]. W e pro p ose to approxim ate the maxim um likelih o o d estimate of the MTD mo del n ˆ P M L ( i m . . . i 1 ; i 0 ); i m , . . . , i 0 ∈ Y o b y coming down 6 to a b etter kno wn problem: estimation of incomplete data with an Exp ectatio n-Maximization (EM) algorithm [5]. W e intro duce a sim- ple estimation metho d whic h allo ws to approx imate one parameter v ector θ = { ( ϕ g , π g ); 1 ≤ g ≤ m } maximizing the lo g -lik eliho o d. 2 Upp er b ound of the MT D mo d el dimension The MT D 1 mo del is ov er-parametrized. W e p rov ide an ex ample of tw o distinct parameter v a lues ( ϕ , π ) defining the same 2 nd -order MTD 1 mo del in app endix A. Moreo v er, w e pro p ose a new parameter set whose dimension is low er. It stems from the straigh tforw ard re- mark tha t the m th-order MTD 1 mo del satisfies the following prop o- sition: Prop osition 1 T r ans ition pr ob a bilities of a m th-or der MTD 1 mo del satisfy: ∀ i m , ..., i g , ..., i 0 , i ′ g ∈ Y , P ( i m ...i g ...i 1 ; i 0 ) − P ( i m ...i ′ g ...i 1 ; i 0 ) = ϕ g  π g ( i g , i 0 ) − π g ( i ′ g , i 0 )  . (4) This simply means that the left-hand side of equation (4) only de- p ends on the parameter comp onents asso ciated to lag g . Consider a giv en distribution from MTD 1 with parameter ( ϕ g , π g ) 1 ≤ g ≤ m , and let u b e an arbit r a ry elemen t of Y . Eac h transition probabilit y P ( i m ...i 1 ; i 0 ) ma y b e written : P ( i m ...i 1 ; i 0 ) = m X g =1 ϕ g [ π g ( i g , i 0 ) − π g ( u, i 0 )] + m X g =1 ϕ g π g ( u, i 0 ) . (5) F rom Prop osition 1 , it follows tha t eac h term o f the first sum ϕ g [ π g ( i g , i 0 ) − π g ( u, i 0 )] equals the difference of probabilities P ( u...ui g u...u ; i 0 ) − P ( u...u ; i 0 ). The second sum is t r ivially the transition probabilit y from the m - letter w ord u . . . u to i 0 . Let us denote the transition probabilities from m -letter w ords to the letter j , restricting to words differing from u . . . u b y at most one letter : p u ( g ; i, j ) := P ( u...uiu...u ; j ) , (6) where u ...uiu...u is t he m -letter w ord whose letter in p osition g (from righ t to left) is i . The quan tities in (6) are sufficien t to describ e the mo del, as stated in the following prop osition. 7 Prop osition 2 The tr an sition pr ob abilities of a m th-or der MTD 1 mo del satisfy: ∀ u ∈ Y , ∀ i m , ..., i g , ..., i 0 ∈ Y , P ( i m , ..., i 1 ; i 0 ) = P m g =1  p u ( g ; i g , i 0 ) − m − 1 m p u ( i 0 )  . wher e p u ( j ) de notes the value of p u ( g ; u, j ) , w hatever the val ue of g . F or an y ar bitr a ry u elemen t of Y , a MTD 1 distribution can b e parametrized b y a v ector θ u from the ( q − 1)[1 + m ( q − 1)]-dimensional set ¯ Θ u , ¯ Θ u = ( (( p u ( g ; i, j )) 1 ≤ g ≤ m,i,j ∈Y suc h that ∀ g ∈ { 1 , . . . , m } , ∀ i ∈ Y , X j ∈Y p u ( g ; i, j ) = 1 and ∀ g , g ′ ∈ { 1 , . . . , m } , p u ( g ; u, j ) = p u ( g ′ ; u, j ) ) (7) Note that not all θ u in ¯ Θ u define a MTD 1 distribution: the sum P m g =1 p u ( g ; i g , i 0 ) − m − 1 m p u ( i 0 ) may indeed fall outside the in terv al [0 , 1]. F or this reason, we can only claim tha t some subset Θ u of ¯ Θ u is a parameter space for the MTD 1 mo del. How ev er, as the com- p onen ts of a parameter θ u ∈ Θ u are tr a nsition probabilities, tw o differen t parameter v alues can not define the same MTD distribu- tion. The mapping of Θ u on the MTD 1 mo del is thus bijectiv e, whic h results in the dimension of ¯ Θ u b eing an upp er b ound of the dimension of the MTD mo del. Whereas the origina l definition of the MTD 1 mo del (1) inv olv es an m − 1+ mq ( q − 1)-dimensional para meter set, this new parametriza- tion lies in a smaller dimensional space, dropping q ( m − 1) param- eters. Equiv alent parametrization can b e set for MTD mo dels having higher order tra nsition matrix for eac h lag. F or a ny l ≥ 1, a MTD l mo del can b e described by a v ector comp osed of the transition prob- abilities p l u ( g ; i l ...i 1 , j ) = P ( u...u i l ...i 1 u...u ; j ) for all l -letter w ords i l ...i 1 . D enoting by Θ l u the corresp onding parameter space, its di- mension | Θ l u | = P l k =2 [ q k − 2 ( q − 1) 3 ( m − k + 1)] + (1 + m ( q − 1))( q − 1) is again m uch smaller tha n the n um b er of parameters originally re- quired to des crib e the MT D l mo del (see [8], section 2.2, for the coun ting details). A comparison of the dimensions according to b oth pa rametrizations a pp ears in T able 1 . W e will now ma ke use of the upp er b o und | θ l u | of the mo del’s dimension to p enalize t he like - liho o d in the assessmen t of MTD mo dels go odness-of- fit (see section 4.2). 8 T able 1: N um b er o f i ndep endent parameters required to describ e full Mark ov and MTD l mo del s (state space si ze: q = 4 ). E xcept for the single matrix MTD mo del, MTD mo dels or ig inally defined with par ameters ( ϕ, π ) ar e ov er parametriz e d: the par ameter θ l u , intro duced in sectio n 2, requir es far less independent para meters. Note that the 1 st order MTD 1 mo del (res p. 2 nd order MTD 2 mo del) is equiv ale nt to the 1 st order (resp. 2 nd order) full Markov mo del. F ull MTD 1 MTD 2 Order m Marko v | ( ϕ , π ) | | θ 1 u | | ( ϕ , π ) | | θ 2 u | 1 12 12 12 2 48 25 21 48 48 3 192 38 30 97 84 4 768 51 39 146 120 5 3 07 2 64 48 195 156 3 Estimation In this section, we exp ose an EM algorithm for the estimation of MTD mo dels. Firstly , this pro ce dure allows to maximize the lik e- liho o d without assuming the indep endence of parameters ϕ and π and offers the con vergenc e prop erties of an EM algorithm. Secondly , from a tec hnical p oint of view, the EM algorithm do es not require an y tr ic k to f ulfill the constrain ts holding o n the ( ϕ , π ) parameters as Berc htold’s algorit hm do es. W e exp ose here our estimation metho d of the MTD 1 mo del (1) having a sp ec ific 1 st order transition matrix for each lag. The metho d can easily b e adapted for single matrix MTD mo dels as we ll as for MTD mo dels having differen t types of transition matrix for eac h lag. Detailed deriv ations of the form u- las for iden tical matrix MTD a nd MTD l mo dels are presen ted in app endix B. T o estimate the transition probabilities { P ( i m ....i 1 ; i 0 ); i m , ..., i 0 ∈ Y } of a m th-order MTD 1 mo del, w e prop ose to compute an approx ima- tion of one s et of par ameters θ = ( ϕ g , π g ) 1 ≤ g ≤ m whic h maximizes the lik eliho od. 3.1 In t ro duction of a hidden pro cess Our appro a c h lies on a particular interpre tation of the mo del. The definition of the MTD 1 mo del (1) is equiv alen t to a mixture of m hidden mo dels w here the random v ariable Y t is predicted b y one of the m Mark ov c hains π g with the corresp o nding probability ϕ g . Indeed, t he co efficien ts ( ϕ g ) g =1 ,..,m define a probability measure on the finite set { 1 , ..., m } since they satisfy the constraints (2) and (3). F rom now on, we consider a hidden state pro cess S 1 , ..., S n that 9 Y t−2 S t−2 Y Y S t−1 t t+1 S t−1 S t Y t+1 S t−3 Y t−3 Figure 2 : DA G dep endency s tructure o f a 2 nd order MTD 1 mo del. determines the w a y according to whic h the prediction is carried out. The hidden state v ariables { S t } , t a king v alues in the finite set S = { 1 , ..., m } , are indep enden t and iden tically distributed, with distribution ∀ t ≤ n, ∀ g ∈ S , P ( S t = g ) = ϕ g . The MTD 1 mo del is then defined as a hidden v aria ble mo del. The observ ed v ariable Y t dep ends on the curren t hidden state S t and on the m previous v ariables Y t − 1 , ..., Y t − m . This dep ende ncy structure of the mo del is represe n ted as a D irected Acyclic Graph (D AG) in Figure 2. The hidden v alue at one p osition indicates which o f those previous v aria bles of transition matrices are to b e used to draw the curren t letter: conditional on the state S t , the ra ndo m v ariable Y t only dep ends on the v ariable Y t − S t : ∀ t > m, ∀ g ∈ S , P ( Y t = y t | Y t − 1 t − m = y t − 1 t − m , S t = g ) = π g ( y t − g , y t ) . So we carry out estimation in the MTD 1 mo dels as estimation in a mix ture model where the components of the mix ture are m Mark ov chains, each one predicting the v ariable Y t from one of the m previous v ariables. 3.2 EM algorithm By considering a hidden v ariables mo del, we w a n t to compute the maxim um lik eliho o d estimate from incomplete data. The EM algo- rithm introduced by Dempster et al. [5] is a ve ry classical fra mew ork for ac hieving suc h a task. It has prov ed to b e particularly efficien t at estimating v ario us classes of hidden v a r ia ble mo dels. W e mak e it en tirely explicit in the case of the MTD mo dels. The purp ose of the EM alg orithm is to approximate the maxi- m um of the log -lik eliho o d of the incomplete data L y ( θ ) = log P θ ( Y = y ) o v er θ ∈ Θ using the relationship ∀ θ , θ ′ ∈ Θ , L y ( θ ) = Q ( θ | θ ′ ) − H ( θ | θ ′ ) 10 where the quan t it ies Q and H are defined as follows : Q ( θ | θ ′ ) = E [log P θ ( Y , S ) | Y = y , θ ′ ] H ( θ | θ ′ ) = E [log P θ ( Y , S | Y = y ) | y , θ ′ ] The EM algorit hm is divided in tw o steps: E-step (Exp ectation) and M-step (Maximization). Both steps consist of, respective ly , computing and maximizing the function Q ( θ | θ ( k ) ), that is the log- lik eliho od of the complete model conditional o n the observ ed se- quence y and on the curren t para meter θ ( k ) . Using the f act that the function θ → H ( θ | θ ( k ) ) is maximal in θ ( k ) , Dempster et al. pro v ed that this pro cedu re necessarily increases the log- lik eliho od L y ( θ ). See [1 4] f or a detailed study of the con v ergence prop erties o f the EM algorithm. W e no w deriv e a nalytical expressions for b oth E-step and M- step. In this particular case, the log-likelihoo d of t he complete da ta ( Y n m +1 , S n m +1 ) conditional on the first m observ ations Y m 1 writes: log P θ ( Y n m +1 , S n m +1 | Y m 1 ) = n X t = m +1 m X g =1 X i ∈Y X j ∈Y 1 l { Y t − g = i,Y t = j,S t = g } log π g ( i, j ) + n X t = m +1 m X g =1 1 l { S t = g } log ϕ g . (8) E-step The Estimation step is computing the expectation of this function (8) conditional on the obse rv ed data y and the curren t parameter θ ( k ) , that is calculating, f or all t > m a nd fo r all elemen t g in { 1 , ..., m } , the followin g quantit y , E (1 l { S t = g } | y , θ ( k ) ) = P ( S t = g | y , θ ( k ) ) . (9) Then, function Q writes: Q ( θ | θ ( k ) ) = n X t = m +1 m X g =1 X i ∈Y X j ∈Y  P ( S t = g | y , θ ( k ) ) log π g ( i, j )  1 l { y t − g = i,y t = j } + n X t = m +1 m X g =1 P ( S t = g | y , θ ( k ) ) log ϕ g . (10) So E-step reduces to computing the probabilities (9), for whic h w e deriv e an explicit expres sion b y using the theory of graphical mo dels in t he particular case of D A G structured dep endencies [9]. 11 S t−2 t−2 Y S Y S Y S Y t−1 t−1 t t t+1 t+1 Y S t−3 t−3 Figure 3 : Mor al gra ph of a 2 nd order MTD 1 mo del. First, remark that the state v ariable S t dep ends on the sequence Y only through the m + 1 v a riables { Y t − m , ..., Y t − 1 , Y t } : ∀ t ≤ n, ∀ g ∈ { 1 , ..., m } , P ( S t = g | y , θ ) = P ( S t = g | Y t t − m = y t t − m , θ ) . (11) Indeed, independence prop erties can b e deriv ed from t he moral graph (Fig. 3) whic h is obtained from the DA G structure of the dep endencies ( F ig. 2) b y “marrying” the par en ts, that is adding an edge b etw een the common paren ts of each v ariable, and then delet- ing directions. In this moral graph, the set { Y t − m , ..., Y t } separates the v ariable S t from the rest of the sequence { Y 1 , ..., Y t − m − 1 } so that applying corollary 3.23 from [9] yields: S t ⊥ ⊥ ( Y t − m − 1 1 , Y n t +1 ) | Y t t − m F rom now on, w e denote i 0 m = i m i m − 1 ...i 1 i 0 an y ( m + 1)- letter w ord comp osed of elemen ts o f Y . F or all g in { 1 , ..., m } , for all i 0 m elemen ts of Y , Bay es’ Theorem giv es: P ( S t = g | Y t t − m = i 0 m , θ ) = P ( S t = g , Y t = i 0 | Y t − 1 t − m = i 1 m , θ ) P ( Y t = i 0 | Y t − 1 t − m = i 1 m , θ ) = P ( Y t = i 0 | S t = g , Y t − 1 t − m = i 1 m , θ ) P ( S t = g | Y t − 1 t − m = i 1 m , θ ) P m l =1 P ( Y t = i 0 | S t = l , Y t − 1 t − m = i 1 m , θ ) P ( S t = l | Y t − 1 t − m = i 1 m , θ ) . (12) W e show b elo w that the probabilities P ( Y t = i 0 | S t = g , Y t − 1 t − m = i 1 m , θ ) and P ( S t = g | Y t − 1 t − m = i 1 m , θ ) in express ion (12) are entirely explicit. First, conditional on θ , the state S t and the v ariables Y t − 1 t − m , the distribution of Y t writes: P ( Y t = i 0 | S t = g , Y t − 1 t − m = i 1 m , θ ) = π g ( i g , i 0 ) . Second, although t he state S t dep ends on the ( m + 1 ) -letter w ord Y t t − m , whic h brings informat ion ab out the pro babilit y of transition from Y t − 1 t − m to Y t , it do es not dep end on the m -letter word formed b y the only v a riables Y t − 1 t − m . This ag a in follows from the same corollary 12 Y t−2 S t−2 Y t−1 S t S t−3 Y t−3 S t+1 t+1 Y Y t S t−1 Figure 4: In bla ck: graph of the sma lle st ancestra l s e t containing S t and the t wo v ariables ( Y t − 2 , Y t − 1 ) in the par ticular case o f a 2 nd order MTD 1 mo del. (The part of the s tructure dep endency D AG that is e xcluded from the s mallest ancestral s et a pp e ars here in light blue.) S S Y t−1 t−1 t Y S t−3 t−3 Y S t−2 t−2 Figure 5: Moral gra ph of the smalles t ancestral set in Fig ure 4 . There is no path be tween S t and the subset of 2 v ar iables { Y t − 2 , Y t − 1 } . in [9]. The indep endence o f the v ariables S t and Y t − 1 t − m is deriv ed from the graph of the smallest ancestral set containing these v ariables, that is the subgraph con taining S t , Y t − m t − 1 and the whole line o f their ancestors (See Figure 4 for an illustratio n when n = 2 ). It turns out that, when considering the mora lization of this subgraph (F igure 5), there is no path b et w een S t and the set Y t − 1 t − m . This establishes S t ⊥ ⊥ Y t − 1 t − m and w e ha ve P ( S t = g | Y t − 1 t − m = i 1 m , θ ) = P ( S t = g | θ ) = ϕ g . Finally , the probability (1 2), is en tirely determined b y t he current parameter θ and do es not dep end on the time t . As a result, t he k th iteration of Estimation-step consists in cal- culating, for all g in { 1 , ..., m } a nd for all m + 1- letters word i 0 m of elemen ts of Y , ∀ g ∈ { 1 , ..., m } , ∀ i m , ..., i 1 , i 0 ∈ Y , P ( k ) S ( g | i 0 m ) = P ( S t = g | Y t t − m = i 0 m , θ ( k ) ) = ϕ ( k ) g π ( k ) g ( i g , i 0 ) P m l =1 ϕ ( k ) l π ( k ) l ( i l , i 0 ) . (13) M-Step Maximization of the function Q ( θ | θ ( k ) ) with resp ect to the constrain ts imp ose d o n the v ector ϕ and on the elemen ts of the tran- sition matr ices π 1 , ..., π m is easily achiev ed using Lagrange metho d: ∀ g ∈ { 1 , ..., m } , ∀ i, j ∈ Y , 13 ϕ ( k +1) g = 1 n − m X i m ...i 0 P ( k ) ( g | i 0 m ) N ( i 0 m ) (14) π ( k +1) g ( i, j ) = P i m ...i g + 1 i g − 1 ...i 1 P ( k ) ( g | i g +1 m i i 1 g − 1 j ) N ( i g +1 m i i 1 g − 1 j ) P i m ...i g + 1 i g − 1 ...i 1 i 0 P ( k ) ( g | i g +1 m i i 0 g − 1 ) N ( i g +1 m i i 0 g − 1 ) (15) where sums are carried out f or the v ariables i m , ..., i g +1 , i g − 1 , ..., i 1 , i 0 taking v alues in Y , n is the length o f the observ ed sequenc e y and N ( i 0 m ) the n um b er of o ccurrences of the w ord i 0 m in this sequence . Initialization T o maximize the c hance of reac hing the global max- im um, we run the algo rithm fro m v arious starting p oin ts. One ini- tialization is deriv ed fro m contingenc y tables b et w een eac h lag y t − g and the presen t y t as prop osed b y Berc h told [2] and sev era l o t hers are randomly draw n from the uniform distribution. EM-Algori thm for MTD mo dels • Compute the numb er of o c curr enc es o f e ach ( m + 1) -letters wor d N ( i 0 m ) , • initialize p ar ameters ( ϕ (0) , π (0) ) , • cho ose a stopping rule, i.e. a n upp er thr eshold ε on the incr e ase of the lo g-likeli ho o d, • iter ate E and M steps given by e quations (13,14,15), • stop when L y ( θ ( k +1) ) − L y ( θ ( k ) ) < ε . A soft ware implemen tation of our algorithm is av ailable in the library seq++ at http://stat.ge nopole.cnrs.fr/seqpp . 4 Applications 4.1 Comparison with Berc h told’s Estimation In this pap er, w e fo cus on estimation of the MTD l mo del (see Defini- tion 2) whic h has a sp ecific but same order matrix tra nsition for eac h lag. W e ev aluate the p erformance of the EM algorithm with compar- ison to the last and b est algorithm to date, dev elop ed b y Berc h to ld [2]. Among o thers, Berc htold estimates the par a meters of MTD l mo dels on tw o sequence s analyzed in previous articles: a time serie of the twiligh t song of t he woo d p ewe e and the mouse α A-Crystallin 14 T able 2: Maximum log- likelihoo d of MTD 1 mo dels estimated by EM and Berch- told’s alg orithm (s e e [2], sec tion 5.1 and 6.2). Order m q = |Y | Berch told EM Sequence 2 3 -486.4 -481.8 Pew ee 4 -1720 .1 -1718 .5 α A-Cr ystallin 3 3 -484.0 -480.0 Pew ee 4 -1710 .6 -1707 .9 α A-Cr ystallin Gene sequence (the complete sequences app ear in [12], T ables 7 and 12). The song o f the w o o d p ew ee is a seque nce comp osed of 3 dis- tinct phrases (referred t o as 1 , 2 , 3), whereas the α A-Crystallin Gene is comp osed of 4 n ucleotides: a, c, g, t. W e apply our estimation metho d to these sequences and obtain comparable or higher v a lue o f the log-lik eliho o d for b oth (see T ab. 2). Since the original parametrizatio n of the MTD 1 mo del is not injectiv e, it is not reasonable to compare their v alues. T o o v ercome this problem, w e computed the parameters from the set ¯ Θ u defined in (7). The estimated parameters (using a precision parameter ε = 0 . 001) o f the 2 nd order MTD 1 mo del on the song o f w o o d P ew ee (first line of t he T able 2) a re exp osed in Figure 6. Complete results app ear in app endix C, namely estimated parameters ˆ ϕ , ˆ π 1 , ˆ π 2 and their corresp onding f ull 2 nd order transition matrices ˆ Π . F or b o t h sequences under study , P ewe e a nd α A-crystallin, EM and Berch told algorithms lead to comparable estimations. The EM algorithm prov es here to b e an effectiv e metho d t o maximize the log- lik eliho od of MTD mo dels. Nev ertheless, EM algo r it hm o ffers the adv an ta ge to b e v ery easy to use. Whereas Berc htold’s algor it hm requires to set and up date a parameter δ to alter the v ector ϕ and eac h row of the matrices π g , running the EM a lgorithm only requires the c hoice of the threshold ε in t he stopping rule. 4.2 Estimation on DN A co ding sequences DNA co ding regions are tr a nslated into proteins with resp ect to the genetic co de, whic h is defined on blo c ks o f three n ucleotides called c o dons . Hence, the n ucleotides in these regio ns are constrained in differen t wa ys a ccording to their p osition in the co don. It is common in bioinformatics to use three differen t transition matrices to predict the n ucleotides in the three p ositions of the co dons. This mo del is called the phase d Marko v mo del. Since we a im at comparing the go o dness -of-fits of mo dels with differen t dimensions, the max imal v alue o f a p enalized likelihoo d 15 Figure 6: Estimation o f a 2 nd order MTD 1 mo del on the song of the woo d pewee. W e use u=1 (song n ◦ 1) as r e ference letter to express the para meters defined in (7 ). Estimates obtained with: • Berch told’s algo r ithm ( L y ( ˆ θ ) = − 4 86 . 4): [ ˆ p 1 (1; i, j )] 1 ≤ i,j ≤ 3 =   0 . 7541 69 0 . 198791 0 . 073356 0 . 9916 96 0 . 0 . 0346 2 0 . 9935 79 0 . 003497 0 . 0 2924   [ ˆ p 1 (2; i, j )] 1 ≤ i,j ≤ 3 =   0 . 7541 69 0 . 198791 0 . 073356 0 . 1372 05 0 . 213411 0 . 649384 0 . 0480 23 0 . 927598 0 . 044116   • EM-algo rithm ( L y ( ˆ θ ) = − 481 . 8 ): [ ˆ p 1 (1; i, j )] 1 ≤ i,j ≤ 3 =   0 . 7530 5 0 . 20 0475 0 . 046475 0 . 9914 75 0 . 0 . 0085 25 0 . 9964 25 0 . 003575 0 .   [ ˆ p 1 (2; i, j )] 1 ≤ i,j ≤ 3 =   0 . 7530 5 0 . 20 0475 0 . 046475 0 . 1375 25 0 . 2 1135 0 . 6 51125 0 . 0280 5 0 . 92 5475 0 . 046475   16 Figure 7 : Differe nc e a ccording to the BIC criterio n be t ween MTD mo dels and the co rresp onding fully parametrize d Marko v Mo del. function against the dimension of parameter space will b e used to assess each mo del. T he Ba ye sian Information criterion [13] for this ev a lua tion is defined as: B I C ( M ) = − 2 L y ( ˆ θ M ) + d ( M ) lo g n, where ˆ θ M stands for the maximum lik eliho o d estimate o f mo del M . The lo w er the BIC a mo del achie v es, the more p ertinen t it is. BIC ev aluation has b een computed on D NA co ding sequence sets from bacterial genomes. Eac h of these sequence sets has length ranging from 1 500 000 to 5 000 000. Display ed v alues in F igure 7 are av erages o ver the 15 sequen ces set of the difference b et w een the BIC v a lue a c hiev ed by the full Mark o v mo del and the one ac hiev ed b y the MTD mo del of the same order. Whenev er this figure is p ositiv e, the MTD model ha s to be preferred to the full Mark ov mo del. The full Mark ov mo del turns out to outp erform the MTD 1 mo del when the or der is inferior to 4. This is not s urprising since the estimation is computed o v er large datasets tha t provide a sufficien t amoun t o f information with resp ect to the n um b er of parameters of t he full mo del. Ho w eve r, the 5 th order MTD 1 mo del and full Mark ov mo del ha v e comparable p erfo rmances, and the MTD 1 mo del 17 outp erforms the full Marko v mo del for higher orders. This is a n evidence that altho ug h MT D 1 only appro ximate the full Mark o v mo dels, their estimation accuracy decreas es slo w er with the order. Ev en more striking is the compar ison of the MTD 2 mo del with the full Marko v mo del. Whatev er the order of the mo del, its go o dness - of-fit is at least equiv alen t to the one ac hiev ed b y the full Mark o v mo del. The MTD l mo del turns out to b e a satisfactory trade-off b et w een dimension and estimation accuracy . 5 Ac kn o wledgmen ts W e thank Bernard Prum and Catherine Matias for their v ery con- structiv e suggestions, and Vincen t Miele fo r his implemen ta tion of the EM algorithm in the seq++ library . Moreo ver, we t ha nk the referees for their commen ts and suggestions whic h impro v e this pa - p er. A Example of equiv alen t p arameters defin ing the same M TD 1 mo del Let the size sta te space be 4 a s for DNA sequences Y = { a, c, g , t } and consider these tw o 2 nd order MTD 1 mo del parameters θ , θ ′ . ϕ = (0 . 3 , 0 . 7) π 1 =     0 . 1 0 . 2 0 . 3 0 . 4 0 . 4 0 . 3 0 . 2 0 . 1 0 . 2 0 . 2 0 . 2 0 . 4 0 . 4 0 . 2 0 . 2 0 . 2     π 2 =     0 . 1 0 . 1 0 . 1 0 . 7 0 . 2 0 . 2 0 . 4 0 . 2 0 . 3 0 . 3 0 . 3 0 . 1 0 . 3 0 . 2 0 . 3 0 . 2     ϕ ′ = (0 . 2 , 0 . 8 ) π ′ 1 =     0 . 2 0 . 1 0 . 2 0 . 5 0 . 65 0 . 2 5 0 . 0 5 0 . 05 0 . 35 0 . 1 0 . 05 0 . 5 0 . 65 0 . 1 0 . 05 0 . 2     π ′ 2 =     0 . 075 0 . 1375 0 . 15 0 . 6375 0 . 1625 0 . 225 0 . 4125 0 . 2 0 . 25 0 . 312 5 0 . 32 5 0 . 1125 0 . 25 0 . 225 0 . 325 0 . 2     Both par ameters define the sa me 2 nd order Ma r ko v transition matrix Π . 18 a c g t Π = aa ac ag at ca cc cg ct g a g c g g g t ta tc tg tt                             0 . 1 0 . 13 0 . 16 0 . 61 0 . 19 0 . 1 6 0 . 1 3 0 . 52 0 . 13 0 . 1 3 0 . 1 3 0 . 61 0 . 19 0 . 1 3 0 . 1 3 0 . 55 0 . 17 0 . 2 0 . 37 0 . 26 0 . 26 0 . 2 3 0 . 3 4 0 . 17 0 . 2 0 . 2 0 . 34 0 . 26 0 . 26 0 . 2 0 . 34 0 . 2 0 . 24 0 . 2 7 0 . 3 0 . 19 0 . 33 0 . 3 0 . 2 7 0 . 1 0 . 27 0 . 2 7 0 . 2 7 0 . 19 0 . 33 0 . 2 7 0 . 2 7 0 . 13 0 . 24 0 . 2 0 . 3 0 . 26 0 . 33 0 . 2 3 0 . 2 7 0 . 17 0 . 27 0 . 2 0 . 2 7 0 . 26 0 . 33 0 . 2 0 . 2 7 0 . 2                             B EM algorithm for other MTD mo dels B.1 Single matrix MTD mo del: iter at ion k. E-Step ∀ g ∈ { 1 , ..., m } , ∀ i m , ..., i 1 , i 0 ∈ { 1 , ..., q } , P ( k ) S ( g | i 0 m ) = ϕ ( k ) g π ( k ) ( i g , i 0 ) P m l =1 ϕ ( k ) l π ( k ) ( i l , i 0 ) . M-Step ∀ g ∈ { 1 , ..., m } , ∀ i , j ∈ { 1 , ..., q } , ϕ ( k +1) g = 1 n − m X i m ...i 0 P ( k ) ( g | i 0 m ) N ( i 0 m ) π ( k +1) ( i, j ) = P m g =1 P i m ...i g +1 i g − 1 ...i 1 P ( k ) ( g | i g +1 m i i 1 g − 1 j ) N ( i g +1 m i i 1 g − 1 j ) P m g =1 P i m ...i g +1 i g − 1 ...i 1 i 0 P ( k ) ( g | i g +1 m i i 0 g − 1 ) N ( i g +1 m i i 0 g − 1 ) wher e sum s ar e c arrie d out for the variables i m , ..., i g +1 , i g − 1 , ..., i 1 , i 0 varying fr om 1 to q , n is the lengt h of the observe d se quenc e y and N ( i 0 m ) the n u mb er of o c curr enc es of the wor d i 0 m in t his se quenc e. B.2 MTD l mo del: iteration k. E-Step ∀ g ∈ { 1 , ..., m − l + 1 } , ∀ i m , ...i 1 , i 0 ∈ { 1 , ..., q } , P ( k ) S ( g | i 0 m ) = ϕ ( k ) g π ( k ) g ( i g g + l − 1 , i 0 ) P m − l +1 h =1 ϕ ( k ) h π ( k ) h ( i h h + l − 1 , i 0 ) . 19 M-Step ∀ g ∈ { 1 , ..., m } , ∀ i l , ..., i 1 , j ∈ { 1 , ..., q } , ϕ ( k +1) g = 1 n − m X u m ...u 0 P ( k ) S ( g | u 0 m ) N ( u 0 m ) π ( k +1) g ( i l i l − 1 ...i 1 , j ) = P u m ...u g + l u g − 1 ...u 1 P ( k ) S ( g | u g + l m i 1 l u 1 g − 1 j ) N ( u g + l m i 1 l u 1 g − 1 j ) P u m ...u g + l u g − 1 ...u 1 u 0 P S ( g | u g + l m i 1 l u 0 g − 1 ) N ( u g + l m i 1 l u 0 g − 1 ) , wher e sum s ar e c arrie d out for the variables u m , ..., u g + l , u g − 1 , ..., u 1 , u 0 varying fr om 1 to q , n is the lengt h of the observe d se quenc e y and N ( i 0 m ) the n u mb er of o c curr enc es of the wor d i 0 m in t his se quenc e. C 2 nd order MTD 1 estimates obtained on b oth the song of w o o d p ew ee and the mouse α A- Crystallin Gene sequence (Section 4.1). 1. Song of w o o d p ewee Berch to ld’s algor ithm (se e [2], sec tion 5.1 ): L y ( ˆ θ ) = − 4 86 . 4. ˆ ϕ = (0 . 2 69 , 0 . 731) ˆ π 1 =   0 . 097 0 . 73 9 0 . 164 0 . 980 0 0 . 020 0 . 987 0 . 01 3 0   ˆ π 2 =   0 . 996 0 0 . 004 0 . 152 0 . 02 0 0 . 8 2 8 0 . 003 0 . 99 7 0   . EM-algo rithm: L y ( ˆ θ ) = − 481 . 8 ). ˆ ϕ = (0 . 2 75 , 0 . 725) ˆ π 1 =   0 . 102 0 . 72 9 0 . 169 0 . 969 0 0 . 031 0 . 987 0 . 01 3 0   ˆ π 2 =   1 0 0 0 . 151 0 . 01 5 0 . 8 3 4 0 1 0   . These es timated parameters define resp ectively the following 2 nd order Marko v transition matrices ˆ Π B and ˆ Π E M . ˆ Π B = 0 B B B B B B B B B B B B @ 0 . 75416 9 0 . 198791 0 . 04704 0 0 . 99169 6 0 . 0 . 00830 4 0 . 99357 9 0 . 003497 0 . 02924 0 . 13720 5 0 . 213411 0 . 64938 4 0 . 37473 2 0 . 01462 0 . 6106 48 0 . 37661 5 0 . 018117 0 . 60526 8 0 . 02828 6 0 . 927598 0 . 04411 6 0 . 26581 3 0 . 728807 0 . 00538 0 . 26769 6 0 . 732304 0 . 1 C C C C C C C C C C C C A ˆ Π E M = 0 B B B B B B B B B B B B @ 0 . 75305 0 . 20047 5 0 . 046475 0 . 99147 5 0 . 0 . 00 8525 0 . 99642 5 0 . 003575 0 . 0 . 13752 5 0 . 21135 0 . 651125 0 . 37595 0 . 01087 5 0 . 613175 0 . 3809 0 . 01445 0 . 60465 0 . 02805 0 . 92547 5 0 . 046475 0 . 26647 5 0 . 725 0 . 00852 5 0 . 27142 5 0 . 728575 0 . 1 C C C C C C C C C C C C A 20 2. Mous e α A-Crystallin Gene s equence EM-algo rithm: L y ( ˆ θ ) = − 1718 . 5. ˆ ϕ = (0 . 5 62 , 0 . 438) , ˆ π 1 =     0 . 225 0 . 14 0 0 . 5 0 6 0 . 129 0 . 354 0 . 30 0 0 . 0 0 8 0 . 338 0 . 271 0 . 12 3 0 . 4 5 6 0 . 150 0 . 166 0 . 19 1 0 . 4 3 0 0 . 213     ˆ π 2 =     0 . 094 0 . 60 0 0 . 149 0 . 157 0 . 335 0 . 27 1 0 . 153 0 . 241 0 . 185 0 . 41 5 0 . 099 0 . 301 0 . 192 0 . 37 0 0 . 129 0 . 309     . These es timated parameters define resp ectively the following 2 nd order Marko v transition matrix ˆ Π E M . ˆ Π E M = 0 B B B B B B B B B B B B B B B B B B B B B B B B B B @ 0 . 16762 2 0 . 341480 0 . 349634 0 . 141264 0 . 24012 0 0 . 431400 0 . 069758 0 . 258722 0 . 19347 4 0 . 331926 0 . 321534 0 . 153066 0 . 13446 4 0 . 370142 0 . 306922 0 . 188472 0 . 27318 0 0 . 197378 0 . 351386 0 . 178056 0 . 34567 8 0 . 287298 0 . 071510 0 . 295514 0 . 29903 2 0 . 187824 0 . 323286 0 . 189858 0 . 24002 2 0 . 226040 0 . 308674 0 . 225264 0 . 20748 0 0 . 260450 0 . 327734 0 . 204336 0 . 27997 8 0 . 350370 0 . 047858 0 . 321794 0 . 23333 2 0 . 250896 0 . 299634 0 . 216138 0 . 17432 2 0 . 289112 0 . 285022 0 . 251544 0 . 21054 6 0 . 240740 0 . 340874 0 . 207840 0 . 28304 4 0 . 330660 0 . 060998 0 . 325298 0 . 23639 8 0 . 231186 0 . 312774 0 . 219642 0 . 17738 8 0 . 269402 0 . 298162 0 . 255048 1 C C C C C C C C C C C C C C C C C C C C C C C C C C A No detail on the 2 nd order MTD 1 estimates from the mouse α A-Crystallin Gene sequence is given in [2]. References [1] Berch told, A. (199 5). Autoregre ssive mo deling of marko v ch ains. In Statisti- c al Mo del li ng: Pr o c e e dings of t he 10 th International Workshop on S t atistic al Mo del ling , pag es 19– 26. Springe r -V erla g. [2] Berch told, A. (200 1). Estima tion in the mixture transition distribution mo del. Journ al of Time Series Anal ysis , 22 (4 ), 379 –397. [3] Berch told, A. and Raftery , A. E. (2002). The mixture transitio n distribution mo del for high-o r der markov chains and non-gaus sian time se ries. Statist ic al Scienc e , 17 , 328–3 56. [4] B ¨ uhlmann, P . and Wyner, A. (1 999). V aria ble length markov chains. Annals of Statist ics , 27 . [5] Dempster, A., Laird, N., a nd Rubin, D. (1 9 77). Ma x imu m likelihoo d from incomplete data via the em alg o rithm. Journ al of the Ro yal St atistic al So ciety. Series B. , 39 , 1– 38. 21 [6] Durbin, R., Eddy , S. R., K rogh, A., a nd Mitc hison, G. (19 99). Biolo gi- c al Se quenc e Analysis: Pr ob abilistic Mo dels of Pr oteins and Nucleic A cids . Cambridge University P ress. [7] Fichan t, G. and Gautier , C. (1987 ). Statistical metho d for pr edicting pro- tein co ding regio ns in nucleic acid sequences. Computer applic ations in the bioscienc es : CABIOS. , 3 , 2 87–2 95. [8] Grelot, A. (20 05). Estimation Bay´ esienne d’un mo d` ele MTD - MSc R ep ort availab le at http://stat.genop ole.cnrs.fr/sg/Memb ers/slebr e/r app ortAd eline.p df/view . [9] Lauritzen, S. L. (199 8). Gr aphic al m o dels. R epr. O xford Statistical Science Series. 17. [10] Li, W. and K wok, M. C. (199 0). Some res ults o n the estimation of a higher order mar ko v chain. Commun. Stat. Simulat. , 19 (1), 3 63–3 80. [11] Raftery , A. E. (1985 ). A mo del for high-o r der Markov chains. Journal of the Ro yal S tatistic al So ciety. Series B , 47 (3), 528– 539. [12] Raftery , A. E. and T av ar´ e, S. (1 9 94). Estimation a nd mo delling rep eated patterns in hig h o rder mar kov chains with the mixture tra nsition distribution mo del. Journal of the R oyal Statist ic al So ciety Applie d S t atistics , 43 (1), 179– 199. [13] Sch warz, G. (1978). E stimating the dimension of a model. Annals of Statistics , 6 (2), 461 –464. [14] W u, C. (1983 ). On the co nv er gence pr op erties o f the em algor ithm. The Annals of Statistics , 11 (1), 9 5–103 . 22

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment