Investigating some attributes of periodicity in DNA sequences via semi-Markov modelling

Investiga ting some a ttributes of periodicity in DNA sequences via semi-Mark o v modelling A Preprint P avlos K olias Departmen t of Mathematics Aristotle Univ ersit y of Thessaloniki pak olias@math.auth.gr Alexandra P apadop oulou Departmen t of Mathematics Aristotle Univ ersit y of Thessaloniki apapado@math.auth.gr No vem ber 16, 2021 Abstract DNA segmen ts and sequences hav e b een studied thoroughly during the past decades. One of the main problems in computational biology is the identiﬁcation of exon-intron structures inside genes using mathematical techniques. Previous studies ha ve used diﬀerent metho ds, such as F ourier analysis and hidden-Mark ov mo dels, in order to b e able to predict which parts of a gene corresp ond to a protein enco ding area. In this pap er, a semi-Marko v mo del is applied to 3-base p erio dic sequences, whic h characterize the protein-co ding regions of the gene. Analytic forms of the related probabilities and the corresp onding indexes are provided, which yield a description of the underlying p eriodic pattern. Last, the previous theoretical results are illustrated with DNA sequences of syn thetic and real data. K eyw ords DNA sequences · P erio dicit y · Semi-Marko v Chain 1 In tro duction P erio dicit y is a structural prop ert y of DNA sequences. It is expressed as either nucleotides or words of n ucleotides that app ear with sp eciﬁc ﬁxed distances in-b et ween. Mainly , there hav e b een observed tw o types of p eriodic b eha viours in DNA. The ﬁrst one was introduced by T rifono v in 1980 [ 5 ] regarding chromatin, whic h is a basic element of the cell nucleus. T rifono v observed that certain di-nucleotides in the DNA of c hromatin tends to app ear at approximately ev ery 10 to 11 bases. Subsequent studies suggested that the p eriod of chromatin sequences con verges to 10.4 bases [ 2 ]. Also, a more recen t study [ 4 ], which inv estigated the genome of three organisms, A. thaliana, C.elegans and H.sapiens, suggested that the di-n ucleotide AA has almost p erfect 10.5-base perio dic b eha viour in those organisms. One explanation ab out this t yp e of p eriodicity is that the distance of 10.5 bases is exactly the "step" of the double strand, which curv es the DNA c hain and allows these long sequences to suppress into the small area of the nucleus. The second type of p eriodicity has b een observ ed in areas of the genome that are transcrib ed and later translated into proteins, also called co ding regions. Previous studies hav e used metho ds from mathematical analysis, such as the sp ectral densit y , and they ha ve sho wn that in co ding regions, there is a tendency of certain nucleotides to reapp ear ev ery 3-bases [ 6 ]. Also, this type of p erio dicit y has only b een observed in co ding regions, while for non-co ding regions there was not found any similar p erio dic b eha viour. As each of the amino acids is enco ded with a triplet of nucleotides (co dons) and some sp eciﬁc amino acids are more abundan t than others, authors concluded that the p eriodic behaviour, in fact exists, due to this higher frequency of certain amino acids and the p eriod of 3-bases is sue to the triplet nature of the DNA. As the whole genome of eac h organism is frequen tly of sev eral billions bases, the information ab out the p erio dic b eha viour of the co ding regions of the DNA would b e really helpful into detecting those regions and distinguish b et ween protein enco ding regions and non-co ding regions. Some algorithmic techniques hav e already b een implemented using this information and they hav e used similar metho d, suc h as the F ourier transformation [ 8 ]. Also, some other well-kno wn algorithms use hidden-Mark ov mo dels, in order to classify b et w een diﬀerent regions of DNA [ 1 ]. In this pap er w e assume that a DNA sequence could b e describ ed b y a semi-Marko v chain X t , with state space A preprint - November 16, 2021 S = { A, C, G, T } , t denotes the index position and C ( m ) = { c i,j ( m ) } is the core matrix of the SMC. W e prop ose a recursive form ula based on the basic parameters of the mo del that could p oten tially identify regions that hav e "strong" or "weak" p erio dic b eha viour. Finally we apply the mo del to b oth synthetic sequences and DNA sequences of several organisms. 2 The semi-Mark o v mo del W e assume that the DNA sequence is a realization of a semi-Marko v chain X n with state space the four n ucleotides S = { A, C, G, T } . The semi-Mark ov chain is described by a sequence of Mark ov transition matrices { P ( t ) } ∞ t =0 and a sequence of conditional holding time matrices { H ( m ) } ∞ m =1 , such as: P ( t ) = { p i,j ( t ) } , (1) where p i,j ( t ) = P r ob [ the SMC will make its next transition to state j / the SMC entered state i at time t ] with p i,j ( t ) ≥ 0 , ∀ i, j ∈ S, t ∈ N and P j ∈ S p i,j ( t ) = 1 , ∀ i, t ∈ N , and H ( m ) = { h i,j ( m ) } , (2) where h i,j ( m ) = P r ob [ The SMC will stay in state i for m p ositions b efore moving to state j ] W e deﬁne the probabilities of the waiting time w i ( m ) , which are the probabilities for the SMC to hold for m time units in state i , b efore making its next transition. w i ( m ) = X j ∈ S p i,j h i,j ( m ) (3) Also the cumulativ e distribution for the waiting time is: > w i ( n ) = ∞ X m = n +1 w i ( m ) = X j ∈ S p i,j > h i,j ( m ) (4) The basic parameter of the SMC is the c or e matrix and it is deﬁned as: C ( m ) = { c i,j ( t, m ) } i,j ∈ S = P ( t ) ◦ H ( m ) , (5) where the op erator {◦} denotes the elemen t-wise pro duct of matrices (Hadamard pro duct). W e assume that DNA sequences do not contain virtual transitions, therefore: p i,i ( t ) = 0 , ∀ i ∈ S, t ∈ N . W e also deﬁne the interv al transition probabilit y q i,j ( n ) , which is the probability for the SMC to b e in state j after n time units, while it entered state i in time t = 0 to b e: [3] Q ( n ) = { q i,j ( n ) } i,j ∈ S = > W ( n ) + n X m =0 [ P ◦ H ( m )] Q ( n − m ) , (6) where > W ( n ) = diag { > w i ( n ) } . 2.1 The homogeneous case In the following, the parameter of time is replaced by p osition, based on the nature of the DNA sequences, as their evolution dep ends on the index position of ev ery letter in the sequence. In order to study the d-p eriodic b ehaviour of a DNA sequence, we would like to examine the probabilit y of a letter app earance ev ery d steps. Thus, w e deﬁne the following probability: p i ( d ) = P r ob [ the sequence will b e in state i , in p osition d while it has b een in state i in the initial p osition ] (7) It is imp ortan t to note that for a given DNA sequence, w e do not kno w if the initial p osition is due to a letter transition or reapp earance of the same letter, therefore w e hav e to include b oth those tw o cases in 2 A preprint - November 16, 2021 calculating the ab o v e probabilit y . W e now present all the p ossible instances for the DNA sequence to b e in state i , after d steps, while it has b een in state i in the starting p osition. Let S x = i i i i · · · i | {z } x − times j u u · · · u i, (8) the sequence of letters of length d, where x = 1 , 2 , ..., d , j denotes any letter diﬀerent than i and u denotes an y letter from the state space S = { A, C, G, T } S 1 = i j u u · · · u i S 2 = i i j u u · · · u i S 3 = i i i j u u · · · u i . . . S d − 2 = i i · · · i j u i S d − 1 = i i i · · · i j i S d = i i i i i · · · i The diﬀerent instances S i are mutually exclusive and exhaustive even ts, thus using probabilistic argument w e can conclude to the following equation, regarding the probability p i ( d ) . p i ( d ) = > w i ( d ) + N X j 6 = i d X k =1 ≥ c i,j ( k ) q j,i ( d − k ) (9) where ≥ c i,j ( k ) = p i,j · ≥ h j,i ( k ) , ≥ h j,i ( k ) denotes the surviv al function of the conditional holding times of the states and q j,i ( d − k ) is describ ed in terms of the basic parameters of the semi-Marko v chain and it follows q i,j ( n ) = δ i,j > w i ( n ) + X k ∈ S p i,k n X m =0 h i,k ( m ) q k,j ( n − m ) , i, j ∈ S, n ∈ N , δ i,j =  1 i = j 0 i 6 = j. (10) Equation 6 in matrix form is the following: P ( d ) = > W ( d ) + d X k =1 I ◦  [ ≥ C ( k ) Q ( d − k )][ U − I ]  (11) where U ∈ M N × N is a square matrix with all the elements equal to 1 and ≥ C ( k ) = P ◦ ≥ H ( k ) and Q ( n ) = { q i,j ( n ) } . F or the interv al transition probability matrix Q ( n ) , instead of using the recursive formula, we can apply the closed analytic form, as prop osed by V assiliou and Papadopoulou [7]. Q ( n ) = > W ( n ) + C ( n ) + n X j =2 { C ( j − 1) + j − 2 X k =1 S j ( k , m k ) } ×{ > W ( n − j + 1) + C ( n − j + 1) } (12) where S j ( k , m k ) = j − k X m k =2 j − k +1 X m k − 1 =1+ m k · · · j − 1 X m 1 =1+ m 2 k − 1 Y r = − 1 C ( m k − r − 1 − m k − r ) (13) for j > k + 2 , while if j 6 k + 2 we ha ve S j ( k , m k ) = 0 . 3 A preprint - November 16, 2021 F or a "strongly" p eriodic chain, with p eriod d, it is exp ected that for every p erio dic state, the frequency of the state app earances, every k × d p ositions, w ould b e high. So an interesting question is whether the chain is in the same state, not only for the ﬁrst cycle of length d but also for a num b er of k successiv e cycles of the same length. Now, let P ( n, d ) to b e a column matrix with its i-th elemen t to deﬁne the probability: p i ( n, d ) = P r ob [ the SMC to b e in state i every d p ositions for n cycles/ the initial state was i ] (14) Using probabilistic argument and applying the equation 6, we can prov e the following equation: P ( n, d ) = P ( n − 1 , d ) ◦ h > W ( d ) + d X k =1 I ◦ [[ ≥ C ( k ) Q ( d − k )][ U − I ] i (15) where P ( n, d ) = { p i ( n, d ) } . The initial condition is: P (1 , d ) = > W ( d ) + d X k =1 I ◦ [[ ≥ C ( k ) Q ( d − k )][ U − I ]] (16) Let us deﬁne the ratio R ( n ) : R ( n ) =  [ P ( n − 1 , d ) 1 ] ◦ I  − 1 · P ( n, d ) (17) where 1 = [1 , 1 , ..., 1] . The i-th element of matrix R ( n ) is the ratio of the probability p i ( n, d ) ov er p i ( n − 1 , d ) for ev ery n and illustrates the v ariations b et w een the probabilities p i ( n, d ) and p i ( n − 1 , d ) , in order to in vestigate the p erio dicit y ov er a n umber of cycles. 2.2 The case of partial non homogeneit y The partial non-homogeneous semi-Marko v chain is constructed based on the fact that ev ery amino acid consists of three nucleotides (co don). Using this information w e can create three discrete co ding p ositions k = { 1 , 2 , 3 } and for the NHSMC w e hav e three sto c hastic matrices P ( k ) , k = 1 , 2 , 3 for the em bedded Mark ov c hain. In order to in vestigate the p erio dic b eha viour, we deﬁne the following probability: p i ( k , d ) = P r ob [ the NHSMC will b e in state i in the p osition d / initially the NHSMC was in state i in co ding p osition k ] (18) W e now present all the p ossible and m utually exclusiv e even ts for the realization of the even t of the probability p i ( k , d ) . S 1 = i ( k ) j ( k + 1) u ( k + 2) u ( k + 3) u ( k + 4) · · · u (( k + d − 1) mo d s ) i (( k + d ) mo d s ) S 2 = i ( k ) i ( k + 1) j ( k + 2) u ( k + 3) u ( k + 4) · · · u (( k + d − 1) mo d s ) i (( k + d ) mo d s ) S 3 = i ( k ) i ( k + 1) i ( k + 2) j ( k + 3) u ( k + 4) · · · u (( k + d − 1) mo d s ) i (( k + d ) mo d s ) . . . S d − 2 = i ( k ) i ( k + 1) i ( k + 2) · · · j (( k + d − 2) mo d s ) u (( k + d − 1) mo d s ) i (( k + d ) mo d s ) S d − 1 = i ( k ) i ( k + 1) i ( k + 2) i ( k + 3) · · · j (( k + d − 1) mo d s ) i (( k + d ) mo d s ) S d = i ( k ) i ( k + 1) i ( k + 2) i ( k + 3) i ( k + 4) i ( k + 5) · · · i (( k + d ) mo d s ) , where j ( · ) 6 = i ( · ) and u ( · ) denotes a letter from the state space S. It is easy to show that the diﬀeren t S i ev ents are mutually exclusive and co ver the whole sample space, th us w e can conclude to the following equation for the probability p i ( k , d ) p i ( k , d ) = > w ( k , d ) + N X j 6 = i d X x =1 ≥ c i,j ( k , x ) q j,i (( k + x ) mo d s, d − x ) (19) 4 A preprint - November 16, 2021 The quantities > w i ( · ) , ≥ c i,j ( · ) , and q j,i ( · ) are functions of the basic parameters P ( k ) and H ( m ) of the NHSMC. The interv al transition probabilities q j,i ( · ) are expressed by the follo wing equation: q i,j ( k , n ) = δ i,j > w i ( k , n ) + X x ∈ S p i,x ( k ) n X m =0 h i,x ( m ) q x,j ( k + m, n − m ) , i, j ∈ S , n ∈ N (20) where >w i ( n, s ) denotes the surviv al function of the unconditional holding times for the state i . Using matrix notation we can write the equation 16 as: P ( k , d ) = > W ( k , d ) + d X x =1 I ◦ [[ ≥ C ( k , x ) Q (( k + x ) mo d s, d − x )][ U − I ]] (21) The elements of the matrix Q (( k + x ) mo d s, d − x ) are the interv al transition probabilities for the NHSMC, whic h could b e expressed by the following recursive formula: Q ( s, n ) = > W ( s, n ) + n X m =1 C ( s, m ) Q ( s + m, n − m ) (22) F or the recursiv e equation of the interv al transition probabilities for the NHSMC (19), w e also hav e the closed analytic form [7]: Q ( k , n ) = > W ( k , n ) + C ( k , n ) + n X j =2 { C ( k, j − 1) + j − 2 X x =1 S j ( x, k , m x ) } ×{ > W ( k + j − 1 , n − j + 1) + C ( k + j − 1 , n − j + 1) } (23) where S j ( x, k , m x ) = j − x X m x =2 j − x +1 X m x − 1 =1+ m x · · · j − 1 X m 1 =1+ m 2 x − 1 Y r = − 1 C ( k + m x − r − 1 , m x − r − 1 − m x − r ) (24) for j > x + 2 , while if j 6 x + 2 we ha ve S j ( x, k , m x ) = 0 . Similarly with the homogeneous case, w e are interested for the sequence to b e in the same state, not only after d steps, but also for a num b er n of successive cycles of length d, given that its initial co ding p osition w as k . Let P ( k , n, d ) to b e a column matrix and its i-th element to deﬁne the probability: p i ( k , n, d ) = P r ob [ the NHSMC will b e in state i every d p ositions for n cycles / the initial state was i in co ding p osition k ] (25) Using probabilistic argument and the equation 18, w e can prov e the following equation: P ( k , n, d ) = P ( k , n − 1 , d ) ◦ h > W ( k , d ) + d X x =1 I ◦ [[ ≥ C ( k , x ) Q (( k + x ) mo d s, d − x )][ U − I ]] i (26) where U = { u i,j } i,j ∈ S , u i,j = 1 , ∀ i, j , > W ( k , d ) = diag { > w i ( k , d ) } , ≥ C ( k , m ) = P ( k ) ◦ ≥ H ( m ) and Q ( k , n ) = { q i,j ( k , n ) i,j ∈ S } . The initial condition is: P ( k , 1 , d ) = > W ( d ) + d X k =1 I ◦ [[ ≥ C ( k ) Q ( d − k )][ U − I ]] (27) W e deﬁne the ratio R ( k , n ) : R ( k , n ) =  [ P ( k , n − 1 , d ) 1 ] ◦ I  − 1 · P ( k , n, d ) (28) where 1 = [1 , 1 , ..., 1] . The i-th elemen t of matrix R ( k , n ) is the ratio of the probabilit y p i ( k , n, d ) o v er p i ( k , n − 1 , d ) for every n and illustrates the v ariations b et w een the probabilities p i ( k , n, d ) and p i ( k , n − 1 , d ) , in order to inv estigate the p eriodicity ov er a n umber of cycles, with a sp eciﬁc co ding p osition k . 5 A preprint - November 16, 2021 3 Illustrations of real and syn thetic data F or the illustrations of the homogeneous semi-Marko v mo del, synthetic DNA sequences as well as real genomic and mRNA sequences w ere used. The co ding sequence was human dystrophin mRNA and the non-co ding sequence, which was used for comparison, was the human b-nerve growth factor gene (BNGF). W e assumed that each of the sequences could b e describ ed by a homogeneous semi-Marko v chain { X t } ∞ t =0 , with state space S = { A, C, G, T } and the index t denotes the p osition of each n ucleotide inside the sequence. The basic parameters P i,j ( s ) and H i,j ( m ) of the SMC were estimated using the empirical estimators: b p i,j ( k ) = N ( i ( k ) → j ) P x ∈ S N ( i ( k ) → x ) and b h i,j ( m ) = N ( i → j, m ) P x ∈ S N ( i → x, m ) , (29) where N ( i ( k ) → j ) denotes the num ber of transitions from state i to state j , starting from co ding p osition k and N ( i → j, m ) denotes the num b er of transitions from state i to state j , while the SMC remained in state i for m p ositions. In order to estimate the initial condition, which are the probabilities of the matrix P ( k , 1 , d ) , the ﬁrst 10 cycles of length 3 hav e b een used and the basic parameters P ( k ) and H ( m ) hav e b een estimated. After that and for each cycle n , the core matrix has b een estimated C ( k , m ) , using the letters of the sequence up un til the p osition n · d + k . This sp eciﬁc pro cess has b een implemented, correcting the estimations, as in the curren t application the length of each p eriod is small ( d = 3 ), resulting in an non adequate sample size for eac h cycle. Finally , the probability for the chain to b e in the same state for every n · d p ositions has been calculated using: P ( k , n, d ) = P ( k , n − 1 , d ) ◦ h > W ( k , d ) + d X x =1 I ◦ [[ ≥ C ( k , x ) Q (( k + x ) mo d s, d − x )][ U − I ]] i (30) 3.1 DNA sequences of synthetic data Example 1: Comparison b et w een random and p erio dic DNA sequences Let L a DNA sequence of length N = 1000 of the form: L = { U, U, U, U, U, U, U, U, U, U, U, U, U... } , where the letter U corresp onds to any n ucleotide, from a uniform distribution P r ob [ U = A ] = P r ob [ U = C ] = P r ob [ U = G ] = P r ob [ U = T ] = 1 4 . This kind of sequence would not exhibit any p erio dic b eha viour, ho wev er the estimated probability matrix P ( n, d ) for d = 3 will b e estimated for comparison. The estimation of the embedded Marko v matrix P is: P =    0 0 . 2 0 . 8 0 0 . 375 0 0 . 5 0 . 125 0 . 125 0 . 5 0 0 . 375 0 . 25 0 . 75 0 0    and the core matrix C ( m ) is: C (1) =    0 0 0 . 8 0 0 . 375 0 0 . 5 0 . 125 0 . 125 0 . 375 0 0 . 375 0 . 25 0 . 5 0 0    and C (2) =    0 0 0 0 0 0 0 0 0 0 . 125 0 0 0 0 0 0    , 6 A preprint - November 16, 2021 while the only non zero elemen t of C (3) is c 4 , 2 (3) = 0 . 25 The initial condition P (1 , 3) is: P (1 , 3) =    0 . 32 0 . 34 0 . 42 0 . 27    Figure 1: R(n) for the synthetic DNA sequence of a uniform distribution No w let L a DNA sequence of length N = 1000 of the form: L = { A, U, U, A, U, U, A, U, U, A, U, U, A... } , where the letter A corresp onds to adenine, while the letter U corresp onds to any nucleotide from the uniform distribution P r ob [ U = A ] = P r ob [ U = C ] = P r ob [ U = G ] = P r ob [ U = T ] = 1 4 . W e will inv estigate the p erio dic b eha viour, of p eriod d = 3 . One can notice that for the letter A can hav e a waiting time w A m for every m . Pn the other hand, for the other three letters C, G, T , the waiting times are zero if m exceeds tw o, as b et ween 3 letters, there alwa ys exists the letter A . The estimated embedded Mark ov matrix P is: P =    0 0 . 30 0 . 30 0 . 40 0 . 73 0 0 . 15 0 . 12 0 . 69 0 . 17 0 0 . 14 0 . 70 0 . 14 0 . 16 0    and the core matrix is: C (1) =    0 0 . 19 0 . 16 0 . 27 0 . 60 0 0 . 15 0 . 13 0 . 56 0 . 17 0 0 . 15 0 . 50 0 . 14 0 . 16 0    and C (2) =    0 0 . 08 0 . 11 0 . 09 0 . 13 0 0 0 0 . 13 0 0 0 0 . 20 0 0 0    7 A preprint - November 16, 2021 while the other matrices C ( m ) for m > 2 hav e non zero elements only in the ﬁrst row. The initial condition P (1 , 3) is: P (1 , 3) =    0 . 83 0 . 18 0 . 20 0 . 25    The probability for the chain to b e in state A , every d = 3 p ositions, while starting from state A , is greater than the other three states, as we exp ected. How ever, the probability p A ( n, 3) is low er than 1, b ecause it is also allow ed for the SMC to b e in state A in-b et ween a p erio dic cycle. Figure 2: R(n) for the synthetic DNA sequence with 3-base p eriodicity of adenine Example 2: Detection of p eriodic regions inside a sequence Let L a DNA sequence of length N = 5000 of the form: L = { U, U, U, U, U, U, U, U, U, U, U, U, U... } , where the letter U corresp onds to any random nucleotide. In the interv als 1500 − 2000 and 3000 − 3500 , which corresp ond to the cycles 500 − 666 and 1000 − 1166 , the letter U has b een substituted with the letter A , starting from the ﬁrst p osition of each interv al and for every 3 p ositions. Figure 1 shows the v alues of the ratio R ( n ) for the letter A , where the green regions are the cycles of the sequence R ( n ) where the sequence is increasing, while the red regions are the cycles where the sequence R ( n ) decreases. It is observ ed from the ﬁgure, that the regions, in which w e hav e synthetically added p eriodic b eha viour are apparen tly colored with green. 8 A preprint - November 16, 2021 Figure 3: R(n) of the letter A of the synthetic sequence with p erio dicit y in the cycles 500-666 and 1000-1166 3.2 DNA sequences of real data The information about the p erio dic b eha viour of the co ding regions of the genome could p ossibly b e used, in order to distinguish these regions, ov er a DNA sequence with great length. F or the co ding sequences of real DNA, the human dystrophin mRNA has b een used, while for the non co ding region, the human b-nerv e gro wth factor has b een used. These sequences hav e a length greater than 5000 bases and they hav e already b een studied for p eriodic b eha viour [6] Figure 4: R ( n ) for the human dystrophin mRNA sequence 9 A preprint - November 16, 2021 Figure 5: R ( n ) for the human b-nerve growth factor sequence It is ob vious that the probabilities p i ( k , n, d ) will con verge to zero, as they are a pro duct of n probabilities. The most important things in the in vestigation are the initial probabilit y P ( k , 1 , d ) , whic h contains the probabilities for the chain to b e in the same state after d p ositions and also the ratio R ( k , n ) , whic h measures the relationship b et w een the probabilities of the current cycle and the previous one. If the v alues of R ( k , n ) are high, then the probabilities p i ( k , n, d ) decrease with a slow rate, while if the v alues of R ( k , n ) are lo w, then the probabilities p i ( k , n, d ) decrease with a slow rate. One can notice that for the h uman dystrophin mRNA sequence, the nucleotide A has a higher chance to app ear ev ery 3 p ositions, while all the other n ucleotides ha ve the same b eha viour.On the other hand, for the human b-nerve growth factor, all the nucleotides hav e appro ximately the same probability to app ear every 3 p ositions. 4 Conclusion In the present pap er, a metho d was developed, in order to inv estigate the p erio dicit y of DNA sequences. The mo del was developed using a semi-Mark ov c hain and the basic parameters were calculated using recursive equations for a num ber of cycles of a sp eciﬁed length. The idea for the dev elopment of this metho d o ccurred b y a main problem in computational biology , that is the identiﬁcation of co ding and non-co ding regions o ver a long DNA sequence. F rom previous studies, it is known that the co ding regions of the gene hav e diﬀeren t structure from the non-co ding regions, as they exhibit a characteristic tendency of rep etition of some n ucleotides every 3 p ositions. Using this fact and by mo delling a DNA sequence as a semi-Marko v c hain, the probabilities of the chain to b e in the same state every d p ositions for the entire length, were calculated. The n umerical results of the implemen tation of the mo del on actual data conﬁrmed the previous studies, as it w as apparen t that p erio dic b eha viour is a c haracteristic of the co ding segments, unlike non-co ding segment that did not show similar b eha viour. F or the estimation of the parameters, a correction pro cedure was applied, due to the short duration of the p erio d ( d = 3 ) for the sp eciﬁc application. The algorithm could p otentially b e used as an initial metho d for inv estigating p eriodicity for any DNA sequence and also it could b e used to separate tw o diﬀerent DNA segments in terms of p eriodic b ehaviour. Although the examples pro duced satisfactory results, they should b e p erceived with caution, due to the complexit y of the structure of DNA and its v arious p eculiarities. F or example, additional parameters could b e included in the mo del, such as the sequence length, the frequencies of eac h nucleotide, the op en reading frames (orf ’s), the sp ecies of the organism, the m utations, and others. Also, b ecause in DNA sequences the characteristic of p erio dicit y still exists, even when there are small p erturbations in the cycle of the p eriod, such as a shift of the p osition of a letter, an interesting question for general mo delling, would b e to study this sp eciﬁc problem under this case. 10 A preprint - November 16, 2021 References [1] Chris Burge and Samuel Karlin, Pr e diction of c omplete g ene structur es in human genomic dna , Journal of molecular biology 268 (1997), no. 1, 78–94. [2] Amir B Cohanim, Edward N T rifono v, and Y echezk el Kashi, Sp e ciﬁc sele ction pr essur e at the thir d c o don p ositions: c ontribution to 10-to 11-b ase p erio dicity in pr okaryotic genomes , Journal of molecular evolution 63 (2006), no. 3, 393–400. [3] Ronald A How ard, Dynamic pr ob abilistic systems: Markov mo dels , vol. 2, Courier Corp oration, 2012. [4] Bilal Salih, Vijay T ripathi, and Edw ard N T rifonov, V isible p erio dicity of str ong nucle osome dna se quenc es , Journal of Biomolecular Structure and Dynamics 33 (2015), no. 1, 1–9. [5] Edw ard N T rifonov and Jo el L Sussman, The pitch of chr omatin dna is r eﬂe cte d in its nucle otide se quenc e , Pro ceedings of the National Academ y of Sciences 77 (1980), no. 7, 3816–3820. [6] Anastasios A. T sonis, James B. Elsner, and P anagiotis A. T sonis, Perio dicity in dna c o ding se quenc es: Implic ations in gene evolution , Journal of Theoretical Biology 151 (1991), no. 3, 323 – 331. [7] P .-C. G. V assiliou and Aleka Papadopoulou, Non-homo gene ous semi-markov systems and maintainability of the state sizes , Journal of Applied Probability 29 (1992), no. 3, 519–534. [8] Changc huan Yin and Stephen S-T Y au, Pr e diction of pr otein c o ding r e gions by the 3-b ase p erio dicity analysis of a dna se quenc e , Journal of theoretical biology 247 (2007), no. 4, 687–694. 11

Investigating some attributes of periodicity in DNA sequences via semi-Markov modelling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment