Poisson approximation for search of rare words in DNA sequences
Using recent results on the occurrence times of a string of symbols in a stochastic process with mixing properties, we present a new method for the search of rare words in biological sequences generally modelled by a Markov chain. We obtain a bound o…
Authors: ** Nicolas Vergne (Université d’Évry Val d’Essonne, 프랑스) Miguel Abadi (Universidade de Campinas, 브라질) **
P oisson appro ximation for searc h of rare w ords in DNA sequences Nicola s V ergne a , ∗ Miguel Aba di b a Universit´ e d’Evry V al d’Essonne, D´ ep artement Math´ ematiques, L ab or atoir e Statistique et G´ enome, 91 000 Evry, F r anc e b Universidade de Campinas, Br asil Abstract Using recen t resu lts on the o ccurrence times of a string of sym b ols in a sto c h astic pro cess with mixing prop erties, w e present a new m etho d for the searc h of rare words in biological sequ en ces generally mo delled by a Mark o v c hain. W e obtain a b ound on the err or b et ween the d istribution of the num b er of o ccurrences of a word in a sequence (und er a Mark ov mo d el) and its P oisson appro ximation. A global b ound is already gi ven b y a Chen-Stein m etho d . Our approac h, the ψ -mixing metho d , giv es lo cal b ounds. S ince w e only n eed the error in the tai ls of distrib u tion, the global uniform b ou n d of Chen-Stein is too large and it is a b etter w a y to consider local b ound s. W e searc h for t w o thresholds on the n umb er of occur r ences fr om w hic h w e can regard the studied word as an o v er-represented or an und er-represent ed one. A biologica l role is suggested for these o ve r - or un der-represent ed words. Ou r metho d giv es su c h thr esholds f or a panel of words muc h broader than the Chen-Stein metho d. Comparing th e metho ds, w e observ e a b etter accuracy for the ψ -mixin g metho d for the b ound of the tails of distrib ution. W e also pr esent the soft wa r e PANOW 1 dedicated to the computation of the error term and the th r esholds for a studied word. Key wor ds: P oisson appro ximation, Chen-Stein metho d, mixin g pro cesses, Mark o v chains, rarew ords , DNA sequences 1 a v ailable at http://s tat.genopo le.cnrs.fr/software/panowdir/ ∗ Corresp ond ing author. Email addr esses: nicolas.verg ne@genopo le.cnrs.fr (Nicolas V ergne), abadi@im e.unicamp .br (Miguel Abadi). Preprint su bmitted to Elsevier F ebr uary 28, 2022 1 In tro duction Mo delling DNA seq uences with sto chas t ic mo dels and dev eloping statistical metho ds to analyse the enormous set of data that results from the m ultiple pro jects of DNA sequencing a re c hallenging questions for statisticians and biologists. Man y D NA sequence a nalysis are based on the distribution of the o ccurrences of patterns ha ving some special biological f unction. The most p op- ular mo del in this domain is the Mark ov c hain mo del that g iv es a description of the lo cal behaviour o f the sequence (see Almagor [6 ], Blaisdell [10 ], Phillips et al. [2 3], Gelfand et al. [16]). An impo rtan t problem is to determine the statistical significance of a w ord frequency in a DNA sequence. Nico d` em e et a l. [2 1] discuss ab out this relev ance of finding o ve r - or under-represen ted w ords. The naive idea is the follow ing : a w ord may ha ve a significan t low fre- quency in a D NA sequence b ecause it disrupts replication or gene expression, whereas a significan tly frequen t word may ha v e a fundamen tal activity with regard to genome stabilit y . W ell-know n examples of w ords with exceptional frequencies in DNA sequences are biological palindro mes corresp o nding to re- striction sites a voided for instance in E. c oli (Karlin et al. [18]), the Cross-o ve r Hotsp ot Instigator sites in sev eral bacteria, again in E. c oli for example (Smith et al. [29], El Ka r o ui et al. [14]), and uptake sequences (Smith et al. [30]) or p oly adeny la tion signals (v an Helden et al. [33 ]) . The exact distribution o f the num b er o f a w ord o ccurrences under the Mark o- vian mo del is know n and some softw a res are a v aila ble (Robin and D audin [28], R ´ egnier [25 ]) but, b ecause of n umerical complexit y , they are o f ten used to compute exp ectation and v ariance of a giv en coun t (a nd th us use, in fact, Gaussian approximations for the distribution). In fa ct these metho ds are not efficien t for long se quences or if the Mark ov mo del order is la rger than 2 or 3. F or suc h case s, se veral approx imat ions are p ossible: Gaussian appro xima- tions ( Prum et al. [24]), Binomial (or P oisson) appro ximations (v an Helden et al. [32], Go db ole [17]), comp ound P oisson approximations (Reinert and Sc h bat h [26]), or large deviations approac h (Nuel [22]). In this pap er w e only fo cus on the P oisson a pproximation. W e approximate P ( N ( A ) = k ) b y exp( − t P ( A ))[ t P ( A )] k ( k !) − 1 where P ( N ( A ) = k ) is the stationary probabilit y under the Mark o v mo del that the n um b er of o ccurrences N ( A ) of word A is equal to k , P ( A ) is the pro ba bilit y that w o rd A o ccurs at a g iv en p osition, and t is the length of the sequence. In tuitively , a binomial distribution could b e used to approx imat e the distribution of o ccurrences of a particular w ord. L ength t of the se quence is la r g e, P ( A ) is small and t P ( A ) is almost constant. Th us, w e use the more n umerically con venie nt P oisson appro ximation. Our aim is to b ound the error b et w een the distribution of the n umber of o ccurrence s of w ord A and its P oisson appro ximation. In Reinert and Sch bath [26], the a u- thors pro v e an upp er b ound for a comp ound P oisson appro ximation. They use a Chen-Stein metho d, whic h is the usual method in this purp ose. This 2 metho d ha s been dev elop ed b y Chen on Poiss o n appro ximations (Chen [12]) after a w ork of Stein on norma l approximations (Stein [31]) . Its principle is to b ound the difference b etw een the t w o distributions in tota l v ariat io n distance for all subse t s of the definition domain. Since w e ar e in terested in under- or o ve r- represen ted w ords, w e are o nly in terested in this differenc e fo r t he tails of the distributions. Then, the uniform b ound give n b y the Chen-Stein method is to o la rge for our purpo se. W e presen t here a new method, based on the prop ert y of mixing pro cesses. Our metho d ha s the useful particularit y to g iv e a b ound on the error at each p o int of the distribution. More precise ly , it o ff ers an error term ǫ , for the num b er of o ccurrences k , of word A : P ( N ( A ) = k ) − e − t P ( A ) ( t P ( A )) k k ! ≤ ǫ ( A, k ) . Moreo v er, ǫ ( A, k ) deca ys factorially fa st with resp ect to k . Abadi [1 , 2 ] presen ts lo we r and upp er b ounds for the exp onen tial appro xima- tion of the first occurrence time o f a rare ev ent, also called h i tting time , in a stationary sto c hastic process on a finite alphab et with α - or φ -mixing prop- ert y . Abadi and V ergne [4] describ e the statistics of r eturn times of a string of sym b ols in suc h a pro cess. In Abadi and V ergne [5], the authors prov e a P oisson appro ximation for the distribution of o ccurrence times of a string of sym b ols in a φ -mixing pro cess. The first part of o ur w ork is to determine some constan ts not explicitly computed in t he results of t he ab ov e men tioned articles but neces sary for t he proo f of our t heorem. Our work is complem en- tary to all these articles, in the sense that it relies on them for preliminary results and it adapts them to ψ -mixing pro cesses. Since Mark o v c hains a r e mixing pro cesses, all these results established for mixing pro cess es a lso apply to Mark ov chains whic h mo del biological sequences . This pap er is organised in the follo wing w ay . In section 2, w e in tro duce t he Chen-Stein metho d. In section 3, w e define a ψ -mixing pro cess and state some preliminary notations, mostly on the prop erties of a w ord. W e also presen t in t his sec t io n the principal result of o ur w ork: the P oisson appro ximation (Theorem 5). In section 4, we state preliminary results. Mainly , w e recall results of Abadi [2], computing all the necessary constan ts and we presen t lemmas and prop ositions necessary for the pro of of Theorem 5 . In section 5, w e establish the pro o f of our main result: Theorem 5 on P oisson a ppro ximation. Using ψ -mixing prop erties and preliminary results, w e prov e an upp er b ound for t he difference b etw een the exact distribution of the n umber of o ccurrence of w ord A and the P oisson distribution of parameter t P ( A ). Section 6 is dedicated to n umerical results. F or the searc h o f ov er-represen ted w ords, we compare our metho d to Chen-Stein metho d o n b o th syn thetic and biological data. In this section, w e also presen t results obtained b y a similar me tho d, t he φ -mixing metho d. W e end the pa p er presen ting some examples of bio lo gical applications, 3 and some conclusions and p ersp ectiv es of f ut ure w orks. 2 The Chen-Stein metho d 2.1 T otal v a riation d i s tanc e Definition 1 F or any two r a ndom variables X and Y with values in the same discr ete sp a c e E , the total variation distanc e b etwe en their pr ob ability distri- butions is define d by d TV ( L ( X ) , L ( Y )) = 1 2 X i ∈ E | P ( X = i ) − P ( Y = i ) | . W e r emark that for a n y subset S o f E | P ( X ∈ S ) − P ( Y ∈ S ) | ≤ d TV ( L ( X ) , L ( Y )) . 2.2 The Chen-Stein metho d The Chen-Stein metho d is used to b ound the error b et w een the distribution of the n umber of occurrences of a word A in a sequence X a nd the P oisson distribution with parameter t P ( A ) where t is the length of the seq uence and P ( A ) the stationary measure of A . The Chen-Stein metho d fo r Pois son approx- imation has b een dev elop ed b y Chen [12 ]; a friendly exp osition is in Arratia et al. [7] and a description with many examples can b e found in Arratia et al. [8] and Barb o ur et al. [9]. W e will use Theorem 1 in Arratia et a l. [8] with an impro v ed b ound b y Barb our et al. [9] (Theorem 1.A and Theorem 10.A). First, w e will fix a few notations. Let A b e a finite set (for example, in the DNA case A = { a, c, g , t } ). Put Ω = A Z . F or eac h x = ( x m ) m ∈ Z ∈ Ω, w e denote b y X m the m - t h co ordinate o f the seq uence x : X m ( x ) = x m . W e denote b y T : Ω → Ω the one-step-left shift op erator: so w e will ha v e ( T ( x )) m = x m +1 . W e denote by F t he σ -algebra o ver Ω generated by strings and by F I the σ - algebra generated by strings with co ordinates in I with I ⊆ Z . W e consider an in v arian t probabilit y measure P ov er F . Consider a stationary Marko v c hain X = ( X i ) i ∈ Z on the finite alphab et A . Let us fix a w ord A = ( a 1 , . . . , a n ). F or i ∈ { 1 , 2 , · · · , t − n + 1 } , let Y i b e the following r a ndom v ariable Y i = Y i ( A ) = 1 { w ord A a pp ears at p osition i in the sequence } = 1 { ( X i , . . . , X i + n − 1 ) = ( a 1 , . . . , a n ) } , 4 where 1 { F } denotes the indicato r function of set F . W e put Y = P t − n +1 i =1 Y i , the random v ariable corresp onding to the num b er of o ccurrences of a w o r d, E ( Y i ) = m i and P t − n +1 i =1 m i = m . Then, E ( Y ) = m . Let Z b e a P o isson random v aria ble with parameter m : Z ∼ P ( m ). F or eac h i , w e arbitrar ily define a set V ( i ) ⊂ { 1 , 2 , · · · , t − n + 1 } con taining the p oint i . The set V ( i ) will pla y t he role of a neighbourho o d of i . Theorem 2 (Arratia et al. [8], Barb our et al. [9]) L e t I b e an index se t. F or e ach i ∈ I , let Y i b e a B ernoul li r an d om variabl e w ith p i = P ( Y i = 1) > 0 . Supp ose that, for e ach i ∈ I , we have chosen V ( i ) ⊂ I with i ∈ V ( i ) . L et Z i , i ∈ I , b e in dep e ndent Pois son v a riables with me an p i . The total variation distanc e b etwe en the dep endent Bernoul li pr o c ess Y = { Y i , i ∈ I } and the Poisson pr o c ess Z = { Z i , i ∈ I } satisfies d TV ( L ( Y ) , L ( Z )) ≤ b 1 + b 2 + b 3 wher e b 1 = X i X j ∈ V ( i ) E ( Y i ) E ( Y j ) , b 2 = X i X j ∈ V ( i ) ,j 6 = i E ( Y i Y j ) , b 3 = X i E | E ( Y i − p i | Y j , j / ∈ V ( i )) | . Mor e over, if W = P i ∈ I Y i and λ = P i ∈ I p i < ∞ , then d T V ( L ( W ) , P ( λ )) ≤ 1 − e − λ λ ( b 1 + b 2 ) + min 1 , s 2 λe b 3 . W e think of V ( i ) as a neighbourho o d of strong dep endence of Y i . Intuitiv ely , b 1 describes the con tributio n related to the size of the neigh b ourho o d and the w eigh ts of the ra ndom v ariables in that neigh b ourho o d; if all Y i had the same probabilit y of suc cess, then b 1 w ould b e directly prop ortional to the neigh- b ourho o d size. The term b 2 accoun ts for the strength of the dep endence inside the neigh b ourho o d; as it dep ends on the second momen ts, it can b e view ed as a “second order interaction” term. Finally , b 3 is related to the strength of de- p endence of Y i with random v ariables outside its neighbourho o d. In particular, note that b 3 = 0 if Y i is indep enden t of { Y j | j / ∈ V ( i ) } . One consequence of this theorem is that for an y indicator function of an ev en t, i.e. for an y measurable functional h from Ω to [0 , 1], there is an error b ound of 5 the form | E h ( Y ) − E h ( Z ) | ≤ d T V ( L ( Y ) , L ( Z )) . Th us, if S ( Y ) is a test statistic then, for all t ∈ R , P ( S ( Y ) ≥ t ) − P ( S ( Z ) ≥ t ) ≤ b 1 + b 2 + b 3 , whic h can b e used to construct confidence in t erv als a nd t o find p- v alues fo r tests based on t his statistic. 3 Preliminary notations and P oisson Approxim ation 3.1 Pr eli m inary notations W e fo cus on Mark ov pro cesse s in our biological applications (s ee 6) but the theorem giv en in the following subs ection is established fo r more general mix- ing pro cesses: the so called ψ -mixing pro cesses. Definition 3 L et ψ = ( ψ ( ℓ )) ℓ ≥ 0 b e a se quenc e of r e al n umb ers de cr e asin g to zer o. We say that ( X m ) m ∈ Z is a ψ -mixing pr o c ess if for al l inte g e rs ℓ ≥ 0 , the fol lowing h o lds sup n ∈ N ,B ∈F { 0 ,.,n } ,C ∈F { n ≥ 0 } | P ( B ∩ T − ( n + ℓ +1) ( C )) − P ( B ) P ( C ) | P ( B ) P ( C ) = ψ ( ℓ ) , wher e the supr em um is taken over the sets B and C , such that P ( B ) P ( C ) > 0 . F or a w ord A of Ω, tha t is to say a measurable subset of Ω, w e sa y that A ∈ C n if and only if A = { X 0 = a 0 , . . . , X n − 1 = a n − 1 } , with a i ∈ A , i = 1 , . . . , n . Then, the in teger n is the length of w or d A . F or A ∈ C n , w e define the hitting t ime τ A : Ω → N ∪ {∞} , as the random v ar ia ble defined on the probabilit y space (Ω, F , P ): ∀ x ∈ Ω , τ A ( x ) = inf { k ≥ 1 : T k ( x ) ∈ A } . τ A is the first time that the pro cess hits a give n measurable A . W e also use the classical probabilistic shorthand notations. W e write { τ A = m } instead of { x ∈ Ω : τ A ( x ) = m } , T − k ( A ) instead o f { x ∈ Ω : T k ( x ) ∈ A } and { X s r = x s r } instead o f { X r = x r , ..., X s = x s } . Also w e write for tw o measurable 6 subsets A and B of Ω, the conditiona l pro ba bilit y of B given A as P ( B | A ) = P A ( B ) = P ( B ∩ A ) / P ( A ) and the probability of the inters ection of A and B b y P ( A ∩ B ) or P ( A ; B ). F or A = { X n − 1 0 = x n − 1 0 } a nd 1 ≤ w ≤ n , w e write A ( w ) = { X n − 1 n − w = x n − 1 n − w } for the ev ent consisting of the last w s ymbols of A . W e also write a ∨ b for the suprem um of t wo real n umbers a and b . W e define the p erio dicity p A of A ∈ C n as follows : p A = inf { k ∈ N ∗ | A ∩ T − k ( A ) 6 = ∅} . p A is called the pr incipal p erio d of word A . Then, w e denote by R p = R p ( n ) the set of w ords A ∈ C n with p erio dicity p a nd w e also define B n as the set of w o rds A ∈ C n with p erio dicit y less than [ n/ 2], where [ . ] defines the in teger part of a real num b er: R p = { A ∈ C n | p A = p } , B n = [ n 2 ] [ p =1 R p . B n is the set o f words whic h are self-o verlapping b efore half their length (see Example 4). W e define R ( A ) the set of return times of A whic h are not a m ultiple of its p erio dicity p A : R ( A ) = n k ∈ { [ n/p A ] p A + 1 , . . . , n − 1 }| A ∩ T − k ( A ) 6 = ∅ o . Let us denote r A = # R ( A ), the cardinality o f the set R ( A ). Define also n A = min R ( A ) if R ( A ) 6 = ∅ and n A = n otherwise. R ( A ) is called the set of secondary p erio ds o f A and n A is the smallest secondary perio d of A . Finally , we in tro duce the following notation. F or an in teger s ∈ { 0 , . . . , t − 1 } , let N t s = P t i = s 1 { T − i ( A ) } . The random v ariable N t s coun ts the n um b er of o ccurrences of A betw een s and t (w e omit the dependence on A ). F o r t he sak e o f simplicit y , w e also put N t = N t 0 . Example 4 Consider the wor d A = aaataaataaa . Sinc e p A = 4 , we have A ∈ B n wher e n = 11 . Se e the fol lo w ing figur e to note that R ( A ) = { 9; 1 0 } , r A = 2 and n A = 9 . 0 1 2 3 4 5 6 7 8 9 10 a a a t a a a t a a a a a a t a a a t a a a a a a t a a a t a a a a a a t a a a t a a a a a a t a a a t a a a 7 3.2 The mixing metho d W e presen t a theorem that giv es an error b ound for the P oisson approx imat io n. Compared to t he Chen-Stein metho d, it has the adv an tage to presen t non uniform b ounds that strongly con trol the decay of the tail distribution of N t . Theorem 5 ( ψ -mixing appro ximation) L et ( X m ) m ∈ Z b e a ψ -m ixing pr o- c e s s. Th e r e exists a c onstant C ψ = 254 , such that for a l l A ∈ C n \ B n and al l non ne gative inte g e rs k and t , the fol lo wing ine quality holds: P ( N t = k ) − e − t P ( A ) ( t P ( A )) k k ! ≤ C ψ e ψ ( A ) e − ( t − (3 k +1) n ) P ( A ) g ψ ( A, k ) wher e g ψ ( A, k ) = (2 λ ) k − 1 ( k − 1)! k / ∈ { λ e ψ ( A ) , ..., 2 t n } (2 λ ) k − 1 λ e ψ ( A ) ! 1 e ψ ( A ) k − λ e ψ ( A ) − 1 k ∈ { λ e ψ ( A ) , ..., 2 t n } , e ψ ( A ) = inf 1 ≤ w ≤ n A h ( r A + n ) P A ( w ) (1 + ψ ( n A − w )) i , and λ = t P ( A )(1 + ψ ( n )) . This result is a t the core of our study . It sho ws an upp er b ound fo r the difference b etw een the distribution of the n umber of o ccurrences of w ord A in a sequence of length t and the P oisson distribution of parameter t P ( A ). Pro of is p ostp oned in Section 5. 4 Calculation of the constan ts Our g oal is to compute a b ound as small a s p ossible to control the error b et we en the P oisson distribution and the distribution of the num b er o f o c- currences of a w ord. Thus , we determine the global constant C ψ app earing in Theorem 5 by means of interme dia r y b ounds app earing in the pro of. General b ounds are in teresting asymptotically in n , but fo r biological applications, n is appro ximately b et w een 10 or 20 , which is to o small. Then along the pro of, w e will indicate the in termediary b ounds that w e compute. Before establish- ing the pro of of that Theorem 5, we p oint out here, for easy references, some results of Aba di [2], and some other useful results. In Abadi [2], these results 8 are giv en only in the φ -mixing con text. Moreov er exact v a lues of the constants are not giv en, while these are nece ssary for practical use o f these metho ds. W e pro vide the v alues of all the constants app earing in the pro ofs o f these results. Prop osition 6 (Prop osition 11 in Abadi [2]) L et ( X m ) m ∈ Z b e a ψ - m ixing pr o c ess. Ther e e xist two finite c onstants C a > 0 and C b > 0 , such that for any n , any wor d A ∈ C n , a n d any c ∈ h 4 n, 1 2 P ( A ) i satisfying ψ ( c/ 4) ≤ P { τ A ≤ c/ 4 } ∩ { τ A ◦ T c/ 4 > c/ 2 } , ther e e xists ∆ , with n < ∆ ≤ c/ 4 , such that for al l p ositive inte gers k , the fol lowing in e qualities hold: P ( τ A > k c ) − P ( τ A > c − 2 ∆) k ≤ C a ε ( A ) k P ( τ A > c − 2 ∆) k , (1) P ( τ A > k c ) − P ( τ A > c ) k ≤ C b ε ( A ) k P ( τ A > c − 2 ∆) k , (2) with ε ( A ) = inf n ≤ ℓ ≤ 1 P ( A ) [ ℓ P ( A ) + ψ ( ℓ )] . Both inequalities pro vide an appro ximation of the hitting time distribution b y a geometric distribution at any p oint t of the form t = k c . The difference b et we en these distributions is that in 1, t he geometric term inside the mo dulus is the same as in the upper bo und, while in 2, t he geometric term inside the mo dulus is larger tha n the one in the upp er b ound. That is, the second b ound giv es a larger error. W e will use b oth in the pro o f o f Theorem 8. Prop osition 7 We have C a = 24 a n d C b = 25 . PR OOF. F or the details of the pro of of Pro p osition 6, w e refer to Prop o sition 11 in Abadi [2]. F or an y c ∈ h 4 n, 1 2 P ( A ) i and ∆ ∈ [ n, c/ 4], w e denote N i j = n τ A ◦ T ic + j ∆ > c − j ∆ o and N = { τ A > c − 2 ∆ } for the sak e of simplicit y . Abadi [2] obtains the follow ing b ound: ∀ k ≥ 2 , P ( τ A > k c ) − P ( N ) k ≤ ( a ) + ( b ) + ( c ) , with ( a ) = k − 2 X j = 0 P ( N ) j P ( τ A > ( k − j ) c ) − P τ A > ( k − j − 1) c ; N k − j − 1 2 , ( b ) = k − 2 X j = 0 P ( N ) j P τ A > ( k − j − 1) c ; N k − j − 1 2 − P ( τ A > ( k − j − 1) c ) P N 0 2 , ( c ) = P ( N ) ( k − 1) | P ( τ A > c ) − P ( N ) | . 9 First, for an y me a surable B ∈ F { ( ℓ +1) c, ( ℓ +2) c + n − 1 } , w e ha v e P ( B ) + ψ (∆) ≤ 3 ψ (∆) ≤ 3 2 ε ( A ). W e can also remark that P ( N ) ≥ 1 / 2. Then, by iteration o f the mixing pro p ert y , w e hav e the fo llowing inequalit y for all ℓ ∈ N : P ℓ \ i =0 N i 1 ; B ! ≤ 6 P ( N ) ℓ +1 ε ( A ) . W e a pply this b ound in the inequalities (14) and (15) of Abadi [2] to get ( a ) ≤ k − 2 X j = 0 P ( N ) j 6 P ( N ) k − j − 2+1 ε ( A ) = 6( k − 1) ε ( A ) P ( N ) ( k − 1) , ( b ) ≤ k − 2 X j = 0 P ( N ) j 6 P ( N ) k − j − 2+1 ε ( A ) = 6( k − 1) ε ( A ) P ( N ) ( k − 1) . W e a lso ha ve ( c ) ≤ P ( N ) k − 1 P N ; τ A ◦ T c − 2∆ ≤ 2∆ ≤ ε ( A ) P ( N ) k − 1 . W e o btain (1): P ( τ A > k c ) − P ( N ) k ≤ 2 4 k ε ( A ) P ( N ) k . W e deduce (2): P ( τ A > k c ) − P ( τ A > c ) k ≤ 25 k ε ( A ) P ( N ) k . Then, C a = 24 a nd C b = 25 . Theorem 8 (Theorem 1 in Abadi [2]) L et ( X m ) m ∈ Z b e a ψ -m i x ing pr o- c e s s. Then, t h e r e ex i s t c onstants C h > 0 an d 0 < Ξ 1 < 1 ≤ Ξ 2 < ∞ , s uch that for al l n ∈ N and any A ∈ C n , ther e exists ξ A ∈ [Ξ 1 , Ξ 2 ] , for which the fol lowing in e quality h o lds f o r al l t > 0 : P τ A > t ξ A ! − e − t P ( A ) ≤ C h ε ( A ) f 1 ( A, t ) , with ε ( A ) = inf n ≤ ℓ ≤ 1 P ( A ) [ ℓ P ( A ) + ψ ( ℓ )] a n d f 1 ( A, t ) = ( t P ( A ) ∨ 1) e − t P ( A ) . W e prov e an upp er b ound for the distance b et wee n the rescaled hitting time and the exp onential la w of exp ectation equal to one. The f a ctor ε ( A ) in the upp er b ound sho ws that the rate of con v ergence to the exp o nen tial law is giv en b y a trade off b etw een the length of this time and the v elo cit y of lo osing memory of the pro cess. Prop osition 9 We have C h = 105 . PR OOF. W e fix c = 1 2 P ( A ) and ∆ given by Prop osition 6. W e define ξ A = − log P ( τ A > c − 2 ∆) c P ( A ) . 10 There ar e three steps in the pro of o f the theorem. Fir st, w e consider t of the form t = k c with k a p ositiv e in teger. Secondly , w e prov e the theorem for any t of the for m t = ( k + p/q ) c w it h k , p p ositiv e integers and 1 ≤ p ≤ q with q = 1 2 ε ( A ) . W e also put r = ( p/q ) c . Finally , we consider the remaining cases. Here, for the sak e of simplicit y , w e do not detail the tw o first steps (for that, see Abadi [2]) , but only the last one. Let t b e an y p ositiv e real n um b er. W e write t = k c + r , with k a p ositiv e integer and r such that 0 ≤ r < c . W e can c ho ose a ¯ t suc h that ¯ t < t and ¯ t = ( k + p/q ) c with p , q as b efor e. Abadi [2] obtains the following b ound: P ( τ A > t ) − e − ξ A P ( A ) t ≤ | P ( τ A > t ) − P ( τ A > ¯ t ) | + P ( τ A > ¯ t ) − e − ξ A P ( A ) ¯ t + e − ξ A P ( A ) ¯ t − e − ξ A P ( A ) t . The first term in the triangular inequalit y is b ounded in t he follo wing w ay: | P ( τ A > t ) − P ( τ A > ¯ t ) | = P τ A > ¯ t ; τ A ◦ T ¯ t ≤ t − ¯ t ≤ P τ A > k c ; τ A ◦ T ¯ t ≤ ∆ ≤ P ( N ) k − 2 (∆ P ( A ) + ψ (∆))) ≤ 4 P ( N ) k ε ( A ) ≤ 4 ε ( A ) e − ξ A P ( A ) t . The second term is bo unded like in the t w o first steps of t he pro of in Abadi [2]. W e apply inequalities (1) and (2) to o btain P ( τ A > ¯ t ) − e − ξ A P ( A ) ¯ t ≤ (3 + C a t P ( A ) + C a + 2 C b ) ε ( A ) e − ξ A P ( A ) t . Finally , t he third t erm is b o unded using the Mean V alue Theorem (see for example Douglass [1 3]) e − ξ A P ( A ) ¯ t − e − ξ A P ( A ) t ≤ ξ A P ( A ) r − p q c ! e − ξ A P ( A ) ¯ t ≤ ε ( A ) e − ξ A P ( A ) t . Th us w e ha ve P ( τ A > t ) − e − ξ A P ( A ) t ≤ 1 05 ε ( A ) f 1 ( A, ξ A t ) and the theorem follo ws by the c hange of v a riables ˜ t = ξ A t . Then C h = 105. Lemma 10 ( X m ) m ∈ Z b e a ψ -mixing pr o c ess. Supp ose that B ⊆ A ∈ F { 0 ,...,b } , C ∈ F { b + g ,..., ∞} with b, g ∈ N . The fol low ing ine quality h olds: P A ( B ∩ C ) ≤ P A ( B ) P ( C )(1 + ψ ( g )) . 11 PR OOF. Since B ⊆ A , ob viously P ( A ∩ B ∩ C ) = P ( B ∩ C ). By t he ψ -mixing prop ert y P ( B ∩ C ) ≤ P ( B )( P ( C ) + ψ ( g )). W e divide the a b o ve inequalit y b y P ( A ) and the lemma f o llo ws. F or all the follow ing prop ositions a nd lemmas, we recall that e ψ ( A ) = inf 1 ≤ w ≤ n A h ( r A + n ) P A ( w ) (1 + ψ ( n A − w )) i . Prop osition 11 L et ( X m ) m ∈ Z b e a ψ -mixing pr o c ess. L et A ∈ R p ( n ) . Then the fol lowing holds: (a) F or al l M , M ′ ≥ g ≥ n , | P A ( τ A > M + M ′ ) − P A ( τ A > M ) P ( τ A > M ′ ) | ≤ P A ( τ A > M − g ) 2 g P ( A ) [1 + ψ ( g )] , and similarly | P A ( τ A > M + M ′ ) − P A ( τ A > M ) P ( τ A > M ′ − g ) | ≤ P A ( τ A > M − g ) [ g P ( A ) + 2 ψ ( g )] . (b) F or al l t ≥ p ∈ N , with ζ A = P A ( τ A > p A ) , | P A ( τ A > t ) − ζ A P ( τ A > t ) | ≤ 2 e ψ ( A ) . The ab o ve prop osition establishes a r elat io n betw een hitting and return times with an error b ound uniform with resp ect to t . In particular, ( b ) sa ys that these t imes coincide if and only if ζ A = 1, namely , the string A is non-self- o ve rla pping. PR OOF. In order to simplify notation, for t ∈ Z , τ [ t ] A stands for τ A ◦ T t . W e in tro duce a gap of length g af ter co ordinate M to construct the following triangular inequalit y | P A ( τ A > M + M ′ ) − P A ( τ A > M ) P ( τ A > M ′ ) | ≤ P A ( τ A > M + M ′ ) − P A τ A > M ; τ [ M + g ] A > M ′ − g (3) + P A τ A > M ; τ [ M + g ] A > M ′ − g − P A ( τ A > M ) P ( τ A > M ′ − g ) (4) + P A ( τ A > M ) | P ( τ A > M ′ − g ) − P ( τ A > M ′ ) | . (5) T erm (3 ) is b ounded with Lemma 10 by P A τ A > M ; τ [ M ] A ≤ g ≤ P A ( τ A > M − g ) g P ( A ) [1 + ψ ( g )] . 12 T erm (4) is b ounded using the ψ -mixing prop erty by P A ( τ A > M ) ψ ( g ). The mo dulus in (5) is b ounded using stationarity b y P ( τ A ≤ g ) ≤ g P ( A ) . This ends the pro of of b oth inequalities of item ( a ). Item ( b ) for t ≥ 2 n is pr ov en similarly to item ( a ) with t = M + M ′ , M = p , and g = w with 1 ≤ w ≤ n A . Consider now p ≤ t < 2 n . ζ A − P A ( τ A > t ) = P A ( p < τ A ≤ t ) = P A ( τ A ∈ R ( A ) ∪ ( n ≤ τ A ≤ t )) ≤ e ψ ( A ) . First and second equalities follow b y definition of τ A and R ( A ). The inequalit y follo ws by Lemma 10. Let ζ A = P A ( τ A > p A ) and h = 1 / (2 P ( A )) − 2∆, t hen ξ A = − 2 log P ( τ A > h ). Lemma 12 L et ( X m ) m ∈ Z b e a ψ -mixing pr o c ess. Then the fol lowing ine quality holds: | ξ A − ζ A | ≤ 1 1 e ψ ( A ) . Hence, w e ha ve ζ A − 11 e ψ ( A ) ≤ ξ A ≤ ζ A + 11 e ψ ( A ) . PR OOF. P ( τ A > h ) = h Y i =1 P ( τ A > i | τ A > i − 1) = h Y i =1 (1 − P T − i ( A ) | τ A > i − 1 ) = h Y i =1 (1 − ρ i P ( A )) , where ρ i def = P A ( τ A > i − 1) P ( τ A > i − 1) . Therefore ξ A + 2 p A X i =1 log(1 − ρ i P ( A )) − 2 h X i = p A +1 ζ A P ( A ) ≤ 2 h X i = p A +1 |− log(1 − ρ i P ( A )) − ζ A P ( A ) | . The ab ov e mo dulus is b ounded b y |− log(1 − ρ i P ( A )) − ρ i P ( A ) | + | ρ i − ζ A | P ( A ) . 13 No w no te that | y − (1 − e − y ) | ≤ (1 − e − y ) 2 for y > 0 small enough. Apply it with y = − log (1 − ρ i P ( A )) to b ound the most left term of the ab o v e expression b y ( ρ i P ( A )) 2 . F urther b y Prop osition 11 ( b ) and the fact that P ( τ A > h ) ≥ 1 / 2 w e ha ve | ρ i − ζ A | ≤ 2 e 1 ( A ) P ( τ A > h ) ≤ 4 e ψ ( A ) . for all i = p A + 1 , . . . , h . Y et as b efore − p A X i =1 log(1 − ρ i P ( A )) ≤ p A ρ i P ( A ) + ( ρ i P ( A )) 2 ≤ e ψ ( A ) . Finally , b y definition of h 2 h X i = p A +1 ζ A P ( A ) − ζ A ≤ 4∆ P ( A ) + 2 p A P ( A ) ≤ 6 e ψ ( A ) . This ends the pro of of the lemma. Prop osition 13 L et ( X m ) m ∈ Z b e a ψ -mixing pr o c e s s . Then the fol lo w ing i n- e q uali ty h olds: | P ( τ A > t ) − e − t P ( A ) | ≤ C p e ψ ( A )( t P ( A ) ∨ 1) e − ( ζ A − 11 e ψ ( A )) t P ( A ) . PR OOF. W e b ound the first term with Theorem 8 and the second with Lemma 12 : | P ( τ A > t ) − e − t P ( A ) | ≤ | P ( τ A > t ) − e − ξ A t P ( A ) | + | e − ξ A t P ( A ) − e − t P ( A ) | | P ( τ A > t ) − e − ξ A t P ( A ) | ≤ C h ε ( A ) e − ξ A t P ( A ) ≤ C h e ψ ( A ) e − ( ζ A − 11 e ψ ( A )) t P ( A ) | e − ξ A t P ( A ) − e − t P ( A ) | ≤ t P ( A ) | ξ A − 1 | e − min { 1 ,ξ A } t P ( A ) ≤ 11 t P ( A ) e ψ ( A ) e − ( ζ A − 11 e ψ ( A )) t P ( A ) . This ends the pro of of the prop osition with C p = C h + 11. Definition 14 Given A ∈ C n , we define for j ∈ N , the j -th o c curr en c e time of A as the r andom v ariable τ ( j ) A : Ω → N ∪ {∞} , define d on the p r ob ability sp ac e (Ω , F , P ) as fol low s: for an y x ∈ Ω , τ (1) A ( x ) = τ A ( x ) and for j ≥ 2 , τ ( j ) A ( x ) = inf { k > τ ( j − 1) A ( ω ) : T k ( x ) ∈ A } . 14 Prop osition 15 L et ( X m ) m ∈ Z b e a ψ -mixing pr o c ess. Then, for al l A / ∈ B n , al l k ∈ N , and al l 0 ≤ t 1 < t 2 < ... < t k ≤ t for which min 2 ≤ j ≤ k { t j − t j − 1 } > 2 n , ther e exists a p ositive c onstant C 1 indep endent of A , n , t and k such that P k \ j = 1 τ ( j ) A = t j ; τ ( k +1) A > t − P ( A ) k k +1 Y j = 1 P j ≤ C 1 k ( P ( A )(1 + ψ ( n ))) k e ψ ( A ) e − ( t − (3 k +1) n ) P ( A ) wher e P j = P ( τ A > ( t j − t j − 1 ) − 2 n ) . PR OOF. W e will sho w this prop osition b y induction on k . W e put ∆ j = t j − t j − 1 for j = 2 , ..., k , ∆ 1 = t 1 and ∆ k +1 = t − t k . F ir stly , we note that b y stationarit y P ( τ A = t ) = P ( A ; τ A > t − 1) . F or k = 1, by a triangular inequality we obtain P τ A = t 1 ; τ (2) A > t − P ( A ) 2 Y j = 1 P j ≤ P τ A = t 1 ; τ (2) A > t − P τ A = t 1 ; N t t 1 +2 n = 0 (6) + P τ A = t 1 ; N t t 1 +2 n = 0 − P ( τ A = t 1 ) P 2 (7) + P ( A ; τ > t 1 − 1 ) − P A ; N t 1 − 1 2 n = 0 P 2 (8) + P A ; N t 1 − 1 2 n = 0 P 2 − P ( A ) 2 Y j = 1 P j . (9) T erm (6 ) is equal to P τ A = t 1 ; S t 1 +2 n i = t 1 +1 T − i ( A ); N t t 1 +2 n = 0 and then (6) = P A ; 2 n [ i ∈R ( A ) ∪ i =1 T − i ( A ); N t 2 n = 0 . Since A / ∈ B n , for 1 ≤ i < p A , the ab o v e probability is zero. Th us, using mixing prop ert y (6) ≤ P A ; 2 n [ i ∈R ( A ) ∪ i = p A T − i ( A ); N t 2 n = 0 ≤ 2 P ( A ) P ( A )( r A + n )(1 + ψ ( n )) P N t 2 n = 0 15 ≤ 2 P ( A ) e ψ ( A ) e − ( t − (3 k +1) n ) P ( A ) . T erm (7 ) is b ounded using ψ -mixing prop erty (7) ≤ ψ ( n )(1 + ψ ( n )) P ( A ) P 1 P 2 ≤ ψ ( n ) P ( A ) e ψ ( A ) e − ( t − (3 k +1) n ) P ( A ) . Analogous computations a r e used to b o und terms (8) and (9) . No w, let us suppo se that the prop osition holds for k − 1 and let us pro ve it for k . W e put S i = { τ ( i ) A = t i } . W e use a triangular inequality again to b ound the term in the left hand side of the inequalit y of the pro p osition b y a sum of fiv e terms: P k \ j = 1 τ ( j ) A = t j ; τ ( k +1) A > t − P ( A ) k k +1 Y j = 1 P j ≤ I + I I + I I I + I V + V . I = P k \ j = 1 S j ; τ ( k +1) A > t − P k − 1 \ j = 1 S j ; N t k − 2 n t k − 1 +1 = 0; T − t k ( A ); N t t k +1 = 0 = P k − 1 \ j = 1 S j ; N t k − 2 n t k − 1 +1 = 0; t k − 1 [ i = t k − 2 n +1 T − i ( A ); T − t k ( A ); N t t k +1 = 0 ≤ ( P ( A )(1 + ψ ( n ))) k (1 − ψ ( n )) np A + ( r A + n ) P ( A ( w ) ) e − ( t − (3 k +1) n ) P ( A ) , I I = P k − 1 \ j = 1 S j ; N t k − 2 n t k − 1 +1 = 0; T − t k ( A ); N t t k +1 = 0 − P k − 1 \ j = 1 S j ; N t k − 2 n t k − 1 +1 = 0 P A ; N t − t k 1 = 0 ≤ P k − 1 \ j = 1 ; N t k − 2 n t k − 1 +1 = 0 P A ; N t − t k 1 = 0 ψ ( n ) ≤ ( P ( A )(1 + ψ ( n ))) k ψ ( n ) e − ( t − (3 k +1) n ) P ( A ) , I I I = P k − 1 \ j = 1 S j ; N t k − 2 n t k − 1 +1 = 0 − P k − 1 \ j = 1 S j ; N t k − 1 t k − 1 +1 = 0 P A ; N t − t k 1 = 0 ≤ P k − 1 \ j = 1 S j ; N t k − 2 n t k − 1 +1 = 0; t k − 1 [ t k − 2 n +1 T − i ( A ) P ( A ) ≤ 2 P ( A )( P ( A )(1 + ψ ( n ))) k e − ( t − (3 k +1) n ) P ( A ) . W e use the inductiv e hy p othesis for the term I V a nd the case with k = 1 for the term V . I V = P k − 1 \ j = 1 S j ; N t k − 1 t k − 1 +1 = 0 − P ( A ) k − 1 k Y j = 1 P j P A ; N t − t k 1 = 0 ≤ C 1 ( k − 1)( P ( A )(1 + ψ ( n ))) k e ψ ( A ) e − ( t − (3 k +1) n ) P ( A ) , 16 V = P ( A ) k − 1 k Y j = 1 P j P A ; N t − t k 1 = 0 − P ( A ) P k +1 ≤ 2( P ( A )( 1 + ψ ( n ))) k e ψ ( A ) e − ( t − (3 k +1) n ) P ( A ) . Finally , w e obtain I + I I + I I I + I V + V ≤ (3 + C 1 ( k − 1) + 2)( P ( A ) + ψ ( n )) k e ψ ( A ) . T o conclude the pro of, it is sufficien t that C 1 k = 3 + C 1 ( k − 1) + 2, therefore C 1 = 5. This ends the pro of o f t he prop osition. 5 Pro of of Theorem 5 In this section, we pro ve the main result of our w or k (see Section 3.2): a n upp er b ound for the difference b et w een the exact distribution of the num b er of o ccurrences of w o r d A and the Poiss on distribution of parameter t P ( A ). Throughout the pro of, w e will note in italic the terms computed b y our soft- w are PANOW (see Section 6.1 ) . PR OOF. F or k = 0, the result comes from Prop osition 13 ( P ( N t = 0) = P ( τ A > t )). F or k > 2 t/n , since A / ∈ B n , w e hav e P ( N t = k ) = 0. Hence, P ( N t = k ) − e − t P ( A ) ( t P ( A )) k k ! = e − t P ( A ) ( t P ( A )) k k ! ≤ ( t P ( A )) k − 1 ( k − 1 )! t P ( A ) k ≤ 1 2 ( t P ( A )) k − 1 ( k − 1)! e ψ ( A ) . Indeed, since t k < n 2 then t P ( A ) k < n P ( A ) 2 ≤ e ψ ( A ) 2 . No w, let us consider 1 ≤ k ≤ 2 t/n . W e consider a sequence which contains exactly k o ccurrences of A . These o ccurrences can be isolated or can b e in clumps. W e define the following set: T = T ( t 1 , t 2 , ..., t k ) = k \ j = 1 ( τ ( j ) A = t j ); τ ( k +1) A > t . 17 W e recall that w e put P j = P ( τ A > ( t j − t j − 1 ) − 2 n ), ∆ j = t j − t j − 1 for j = 2 , ..., k , ∆ 1 = t 1 and ∆ k +1 = t − t k . Define I ( T ) = min 2 ≤ j ≤ k { ∆ j } . W e say that the o ccurrences of A are isola t ed if I ( T ) ≥ 2 n and w e sa y that there exists at least one clump if I ( T ) < 2 n . W e also denote B k = {T | I ( T ) < 2 n } and G k = {T | I ( T ) ≥ 2 n } . The set { N t = k } is the disjoint union b et wee n B k and G k , then P ( N t = k ) = P ( B k ) + P ( G k ) , P ( N t = k ) − e − t P ( A ) ( t P ( A )) k k ! ≤ P ( B k ) + P ( G k ) − e − t P ( A ) ( t P ( A )) k k ! . W e will pro ve an upp er b ound for the t wo quan tities on the right hand side of the ab ov e inequalit y to conclude t he pro of of the t heorem. W e prov e an upper bound for P ( B k ) . D efine C ( T ) = P k j = 2 1 { ∆ j > 2 n } + 1. C ( T ) compute s ho w man y clusters there are in a giv en T . Suppose that T is suc h t hat C ( T ) = 1 and fix the p osition t 1 of the first o ccurrence of A . F urther, e a c h o ccurrence inside the cluster (with the exception of the most left one whic h is fixed at t 1 ) can a pp ear at distance d of the previous one, with p A ≤ d ≤ 2 n . Therefore, the ψ -mixing prop erty leads to the b o und P [ t 2 ,...,t k T ( t 1 , t 2 , . . . , t k ) ≤ P k \ j = 1 [ n/ 2 ≤ t i +1 − t i ≤ 2 n ; i =2 ,...,k T − t j ( A ) (10) ≤ P ( A ) e ψ ( A ) k − 1 e ψ ( A ) e − ( t − (3 k +1) n ) P ( A ) . Supp ose no w that T is such that C ( T ) = i . Assum e also that the most left o ccurrence of the i clusters of T o ccurs at t (1) , . . . , t ( i ), with 1 ≤ t (1) < . . . < t ( i ) ≤ t fixed. By the same ar g umen t used ab o ve, w e hav e the inequalities P [ { t 1 ,...,t k }\{ t (1) , ..., t ( i ) } T ( t 1 , . . . , t k ) ≤ ( P ( A )(1 + ψ ( n ))) i − 1 e ψ ( A ) k − i e − ( t − (3 k +1) n ) P ( A ) . 18 T o obtain an upp er b o und for P ( B k ) w e must sum the ab ov e bound ov er all T suc h that C ( T ) = i with i running from 1 to k − 1. Fixed C ( T ) = i , the lo cations of the mo st left o ccurrences of A o f eac h o ne of the i clusters can b e c hosen in at most C i t man y w ays . The cardinality of eac h one of the i clusters can b e arra nged in C i − 1 k − 1 man y w ays. (This corresp onds to breaking the in terv al (1 / 2 , k + 1 / 2) in i in terv als at p oints c hosen from { 1 + 1 / 2 , . . . , k − 1 / 2 } .) Collecting these inf o rmations, w e hav e that P ( B k ) is b ounded by k − 1 X i =1 C i t C i − 1 k − 1 ( P ( A )(1 + ψ ( n ))) i e ψ ( A ) k − i e − ( t − (3 k +1) n ) P ( A ) ≤ e − ( t − (3 k +1) n ) P ( A ) e ψ ( A ) k max 1 ≤ i ≤ k − 1 ( λ/e ψ ( A )) i i ! k − 1 X i =1 C i − 1 k − 1 ≤ e − ( t − (3 k +1) n ) P ( A ) e ψ ( A ) (2 λ ) k − 1 ( k − 1 )! k < λ e ψ ( A ) (2 λ ) k − 1 λ e ψ ( A ) ! λ e ψ ( A ) k − 1 − λ e ψ ( A ) k ≥ λ e ψ ( A ) . This ends the pro of of the b ound for P ( B k ). We c ompute P ( B k ) ≤ k − 1 X i =1 C i t C i − 1 k − 1 ( P ( A )(1 + ψ ( n ))) i e ψ ( A ) k − i e − ( t − (3 k +1) n ) P ( A ) . W e pro ve an upp er b ound for P ( G k ) − e − t P ( A ) ( t P ( A )) k k ! . It is b ounded b y four terms b y the tr ia ngular inequalit y X T ∈ G k P k \ j = 1 τ ( j ) A = t j ; τ ( k +1) A > t − P ( A ) k k +1 Y j = 1 P j (11) + X T ∈ G k P ( A ) k k +1 Y j = 1 P j − k +1 Y j = 1 e − (∆ j − 2 n ) P ( A ) (12) + X T ∈ G k P ( A ) k e − ( t − 2( k +1) n ) P ( A ) − e − t P ( A ) (13) + # G k k ! t k e − t P ( A ) ( t P ( A )) k k ! − e − t P ( A ) ( t P ( A )) k k ! . (14) W e will b ound these terms to obta in Theorem 5. First, w e b ound the cardinal of G k # G k ≤ C k t ≤ t k k ! . 19 T erm (1 1 ) is b ounded with Prop osition 1 5 (11) ≤ C 1 t k ( k − 1)! ( P ( A )(1 + ψ ( n ))) k e ψ ( A ) e − ( t − (3 k +1) n ) P ( A ) . T erm (1 2 ) is b ounded with Prop osition 1 3 (12) ≤ t k k ! P ( A ) k k +1 X j = 1 j − 1 Y i =1 P i P j − e − (∆ j − 2 n ) P ( A ) k +1 Y i = j +1 e − (∆ i − 2 n ) P ( A ) ≤ t k k ! P ( A ) k ( k + 1 ) C p e ψ ( A ) e − ( ζ A − 11 e ψ ( A )) t P ( A ) ≤ 2 C p ( t P ( A )) k ( k − 1 )! e ψ ( A ) e − ( ζ A − 11 e ψ ( A )) t P ( A ) where C p is defined in Prop osition 13. We c ompute ( 12 ) ≤ ( t P ( A )) k ( k − 1)! k + 1 k [(8 + C a t P ( A ) + C a + 2 C b ) ε ( A ) + 11 t P ( A ) e ψ ( A )] e − ( ζ A − 11 e ψ ( A )) t P ( A ) . T erm (1 3 ) is b ounded b y (13) ≤ t k k ! P ( A ) k ( k + 1 )2 n P ( A ) e − t P ( A ) e 2( k +1) n P ( A ) . T o b ound term (14), w e b ound the following difference # G k k ! t k − 1 ≤ ( t − k (4 n )) k t k − 1 ≤ k ( k + 4 n ) t . Then, w e hav e (14) ≤ k ( k + 4 n ) t e − t P ( A ) ( t P ( A )) k k ! . No w, w e just hav e to add the five b ounds to obtain the theorem with the constan t C ψ = 1 + C 1 + 2 C p + 8 + 8. Prop osition 15 sho ws that C 1 = 5 and Prop osition 13 with Theorem 8 that C p = 116 . Then, w e pr ov e the theorem with C ψ = 25 4. 20 6 Biological applications With the explicit v alue of the constan t C ψ of Theorem 5 , and more particularly thanks to all the intermediary b ounds given in the pr o of of this theorem, w e can dev elop an algorithm to a pply this form ula t o the study of rare w ords in biological sequences . In order t o compare different metho ds, w e also compute the b ounds corresp onding to a φ -mixing, pro cess for whic h a pro of of Poisson appro ximation is giv en in Abadi a nd V ergne [5]. Let us recall the definition of suc h a mixing pro cess . Definition 16 L et φ = ( φ ( ℓ )) ℓ ≥ 0 b e a se quenc e de cr e asin g to z e r o. We say that ( X m ) m ∈ Z is a φ -mixing pr o c ess if for al l inte gers ℓ ≥ 0 , the fol lowing holds sup n ∈ N ,B ∈F { 0 ,.,n } ,C ∈F { n ≥ 0 } | P ( B ∩ T − ( n + ℓ +1) ( C )) − P ( B ) P ( C ) | P ( B ) = φ ( ℓ ) , wher e the supr emum is taken over the sets B and C , such that P ( B ) > 0 . Note that obvious ly , ψ -mixing implies φ -mixing. Then, we obtain tw o new metho ds for the detection o f o v er- or under-represen ted words in biological sequence s and we compare them to t he Chen-Stein metho d. W e recall that Mark ov mo dels are ψ -mixing pro cesse s and then a lso φ -mixing pro cesses. Then, w e firs t need to kno w the functions ψ and φ for a Mark o v mo del. It turns out that we can use ψ ( ℓ ) = φ ( ℓ ) = K ν ℓ with K > 0 a nd 0 < ν < 1 , where K and ν ha ve to b e estimated (see Meyn and Tw eedie [1 9]). There are sev eral estimations of K and ν . W e c ho ose ν equal to the sec o nd eigen v a lue of the transition matrix of the mo del and K = inf j ∈{ 1 ,. .., |A| k } µ j − 1 where |A| is the alphabet size, k the order of t he Mark o v mo del and µ the stationary distribution of t he Mark ov mo del. W e recall that w e aim at guessing a rele v an t biological role of a w o rd in a sequence using its n um b er of o ccurrences. Th us we compare the n umber of o ccurrences exp ected in t he Mark o v c hain that mo dels the sequence and the observ ed n umber of o ccurrences . It is recommended to c ho ose a degree of sig- nificance s to quan tify this relev ance. W e fix ar bitrarily a degree of significance and we wan t to calculate the smallest n um b er of o ccurrences u necessary fo r P ( N > u ) < s , where N is the n umber of o ccurrences of the studied w o r d. If the n um b er of o ccurrences coun ted in the se quence is lar ger than this u , w e 21 can consider the w ord to b e relev ant with a degree of significance s . W e ha ve P ( N > u ) ≤ + ∞ X k = u ( P P ( N = k ) + E r r or ( k )) where P P ( N = k ) is t he probabilit y under the Poisson mo del that N is equal to k and E r r or ( k ) is the error b et w een the exact distribution and its P ois- son approximation, b o unded using The o rem 5 . Then, w e search the smalle st threshold u suc h that + ∞ X k = u ( P P ( N = k ) + E r r or ( k )) < s. (15) Then, w e hav e P ( N > u ) < s and we consider the word relev ant with a degree of significance s if it app ears more than u times in the sequence. In o r der to compare t he differen t metho ds, w e compare the thresholds t ha t they give. Obv io usly , the smaller the degree of significance, the more relev ant the studied word is. But for a fixed degree of significance, t he b est metho d is the one whic h giv es the smallest threshold u . Indeed, t o give the smallest u is equiv alen t to giv e the smalles t error in the tail of the distribution b etw een the exact distribution of the n um b er of occurrences of w ord A and the P o isson distribution with parameter t P ( A ). 6.1 Softwar e ava ilability W e deve lo p ed PANOW , dedicated to the determination of t hreshold u for given w ords. This softw are is written in ANSI C++ and dev elop ed on x86 G NU/Lin ux systems with GCC 3.4, and successfully tested with GCC latest v ersions on Sun and Apple Mac OSX systems. It relies o n seq++ library ( Miele et al. [20 ]). Compilation and installatio n a re complian t with the GNU standard pr o cedure. It is a v ailable at http://stat.genopol e.cnrs.fr/software /panowdir/ . On- line do cumen tation is also av ailable. PANOW is licensed under the G NU General Public License ( http://www.gnu.org /licenses/licenses .html ). 22 6.2 Comp arisons b etwe en the thr e e differ ent metho ds 6.2.1 Comp arisons using synthetic data. W e can compare the mixing metho ds and the Chen-Stein metho d through the v alues of threshold u obtained with PANOW using Abadi and V ergne [4] in t he first case and Reinert and Sc hbath [26] in the second one. W e r ecall that the metho d w hich gives the smalles t threshold u is the b est method for a fixed degree of significance. T able 1 offers a go o d out line of the p ossibilities a nd limits of eac h metho d. It displa ys s o me results on differen t w ords randomly selected ( no biological meaning for an y of these words). T able 1 has b een T able 1 T able of thr esh olds u obtained by the three metho ds (sequence length t equal to 10 6 ). F or eac h one of the three metho ds and for eac h wo r d , we compu te the th r eshold whic h p ermits to consider the w ord as an o ver-represen ted word or not, f or d egree of significance s equal to 0 . 1 or 0 . 01. IMP means that the metho d can not return a result. t = 10 6 W ords s = 0 . 1 s = 0 . 01 CS φ ψ CS φ ψ cccg IMP IMP IMP IMP IMP I MP aagcg c IMP 1301 378 IMP 1304 392 cgagc ttc 18 38 18 IMP 40 22 ttgggc tg 14 27 14 18 29 17 gtgcgg ag 16 32 16 22 34 20 agcaa ata 19 39 19 IMP 41 23 obtained with an order one Mark ov mo del using a random t r a nsition matrix and f o r a degree of significance of 0 . 1 and 0 . 01 . IMP means that the metho d can no t return a result. There are sev eral reasons for that a nd we explain them in the follow ing paragraph. Analysing many results, w e no t ice some differences b et we en t he metho ds. Firstly , none of the metho ds giv es us a result in all the cases. W e recall that the Chen Stein method giv es a b ound ( C S ) using the total v ariation distance. If the degree of significance s t ha t w e c ho ose is smaller than the b o und of Chen-Stein, w e nev er find a threshold u suc h that C S + + ∞ X k = u P P ( N = k ) < s. 23 Then, eac h time t ha t the giv en b ound is higher than the significance degree, use of the Chen Stein metho d is imp o ssible. Therefore there are man y examples that w e can not study with this metho d. Obvious ly , it is in teresting to hav e a small degree of significance s and t ha t may b e imp ossible b y this restriction of the Chen-Stein metho d. F or example, this problem app ears fo r the w ords aagcgc and cgagcttc in T able 1. F or this second word, the Chen-Stein b ound is equal to 0 . 0107 9 54. Hence, w e can use t his metho d for a significance degree s equal to 0 . 1 but not for a significance degree of 0 . 01. The same phenomena app ears for the w o rd agcaa ata (the Chen-Stein b o und is equal to 0 . 0120193). The φ - and ψ -mixing metho ds are not based o n the to tal v ariation distance. Then, whatev er the degree of significance s a nd if the studied word satisfies the three followin g w eak prop erties, w e alw ays giv e a threshold u , con t rary to the Chen Stein metho d. In spite of these three conditions, our metho ds enable us to study a m uch broader panel of w ords than the Chen-Stein method. Indeed, for these tw o metho ds, t he only problematic cases arise either when function e ψ (see Theorem 5) is larger than 1 or for a “high” parameter of the Poiss o n distribution (“high” means larger than 500) or when the w o rd perio dicit y is smaller than half its length (see assumptions in Theorem 5: A / ∈ B n ). In fact, the first case do es not o ccur v ery frequen tly (in an y case in T able 1). The reason wh y the function e ψ (or a similar f unction in the φ - mixing case) has to be smaller than 1 is that, for nume r ical re a sons, the error term has to b e decreasing w it h the num b er of o ccurrences k and without this condition on e ψ w e can not ensure this decrease . W e hav e to compute error terms for a finite n um b er of v alues of k but in order to reduce the computation time, when error term b ecomes smaller than a certain v alue (w e choose 1 0 − 300 ), w e supp ose all the follo wing error terms equals to this v alue. That is wh y error term has to b e decreasing. The sec ond problem, a “high” parameter of the P oisson distribution, is just a computational difficult y and once again it do es not occur v ery frequen tly (o nly for the w ord cccg in T able 1 for instance). W e w o uld lik e to insist on the main adv an ta ge of our metho ds: w e can fix an y significance degree s and, except in the v ery rare case s men tioned ab ov e, w e will find a t hr eshold u , contrary to the Chen-Stein metho d. Also, w e can use our metho ds for any Mark ov c hain order. Indeed , PANOW runs fast enough con tra ry to the R program used to compute the Chen-Stein b o und of Reinert and Sc h bath [26]. Note tha t, in program PANOW , w e give another metho d to compute the C hen-Stein b ound (see Abadi [3]) and this me t ho d giv es a ppro ximately the same Chen-Stein b ound. The second main observ ation w e can mak e is that, when it w orks, the Chen- Stein method giv es either a similar threshold u than the ψ -mixing me tho d, or a smaller one. This means that the ψ -mixing metho d out-p erforms t he Chen-Stein metho d. 24 Thirdly w e notice that the ψ -mixing metho d is alw ays b etter than the φ - mixing one. Ob viously , this result w as exp ected b y the definitions of these mix- ing pro cesses and also by the theorems b ecause of the extra factor e − ( t − (3 k +1) n ) P ( A ) (see Theorem 5 and The o r em 2 in Abadi and V ergne [5]). W e are interes ted b y the real impact o f this f actor o n the threshold u : it is significan tly better in the case of a ψ - mixing pro cess. 6.2.2 Biolo gic al c omp arisons. No w, w e presen t a few results obtained on real biological examples with order one Marko v mo dels. There are many catego r ies of words whic h hav e relev an t biological functions (promo t ers, terminators, repeat sequences, c hi sites, up- tak e sequenc es, b end sites, signal p eptides, binding sites, restriction sites, ˜ . . . ). Some of them are highly presen t in the sequence, some ot hers are almost absen t. Then, it turns out to b e interesting to consider the ov er or the under- represen tation of w ords to find words biologically relev a n t. In this section, w e test o ur methods on words already kno wn to b e relev an t. W e fo cus our study on Chi sites or uptak e se quences. Chi sites of bacterias protect the genome b y stopping its degradation p erf o rmed b y a part icular enzyme. The function of this enzyme is to destro y viruses whic h could app ear in to the bacteria. Viruses do no t contain Chi sites and then are exterminated. It turns o ut that Chi sites are highly presen t in the bacterial genome. Uptak e sequence s are abundant sequence motifs, o f ten lo cated do wnstream of O R Fs, that are used t o facilitate the within-sp ecies horizontal tra nsfer of DNA. Example 1 First, w e consider the Chi of Escherichia c oli , gctggtg g , (see T able 2), for differen t degrees of significance. W e use complete sequenc e of Escherichia c o l i K12 ( Bla ttner et al. [11]). Sequence length is equal to 46392 21. W e recall that for a fixed significance degree, the smaller the threshold u , the b est the metho d is. Then, w e can conclude that the ψ -mixing metho d giv es the most in teresting results. Chi of E. c oli could b e considered as an o ver-repres ented one from 99 o ccurrences for a signific a nce degree s of 0 . 0001. Because Chen- Stein b ound is equal to 0 . 067726, Chen-Stein metho d do es no t p ermit to conclude for significance degrees of 0 . 01 and 0 . 00 1. Moreo v er, it is w ell kno wn that Chi of E. c oli is a v ery relev ant word in this bacteria. Then, w e exp ect a v ery small significance degree for this w ord. Unfortunately , the minimal significance degree whic h could b e obtained b y Chen-Stein metho d is, in fact, the Chen-Stein b ound: 0 . 067726. Our metho d a llo ws to obtain v ery small significance degree and the minimal significance degree f o r whic h Chi of E. c oli is considered as an ov er-represen ted w ord b y the ψ -mixing metho d, is giv en at the last line of T able 2: it is equ a l to 10 − 239 . Not e also that the thresholds u increase with the significance degrees s . T o understand this fact, it is sufficien t 25 T able 2 T able of thresholds u o b tained b y the th ree methods for the Chi of Escherichia c oli : gctggtgg (sequence length t equal to 4639221 ). F or eac h one of the th ree metho ds w e compute the thresh old whic h p ermits to consider the w ord as an o v er- represent ed wo r d or not, for degree of significance s . IMP means that the metho d can not return a r esult. “coun ts” corresp ond to the num b er of o ccurrences observed in the sequen ce. s Chen-Stein φ -mixing ψ -mixing coun ts 0 . 1 87 193 83 499 0 . 01 IMP 195 92 499 0 . 0001 IMP 197 99 499 10 − 239 IMP 549 498 499 to lo ok at inequality (15). But they increase slo wly while significance degrees s decreases. It could b e surprising but it is due to the error term whic h decreases v ery f ast from a certain num b er of o ccurrence s. Example 2 Second, w e consider the Chi of Haemophilus influenzae and its uptak e se- quence (see T able 3), for a significance degree s equal to 0 . 01. W e use complete sequence of Haemo p hilus influenzae (F leisc hmann et a l. [1 5]). Sequence length is equal to 18 30138. W e observ e tha t in all the cases the ψ -mixing metho d is T able 3 T able of thr esh olds u obtained by the three m etho ds for the C hi and the u ptak e se- quence of H aemophilus influenzae (sequence length t equal to 1830138 ). F or eac h one of the three metho d s and for eac h w ord , we compu te the thr eshold which p ermits to consider the word as an ov er-represen ted word or not, for degree of significance equal to 0 . 01. IMP means that the metho d can n ot return a result. “coun ts” corr esp ond to the num b er of o ccurrences observ ed in the sequence. W ords Chen-Stein φ -mixing ψ -mixing coun ts gatgg tgg (c hi) 23 36 22 20 gctggt gg (chi) 21 32 20 44 ggtgg tgg (c hi) 16 IMP IMP 57 gttggt gg (c hi) 30 45 26 37 aagtg cggt (uptak e) 13 17 13 737 the best one because it giv es the smallest u , excep t for the w ord ggtggtgg whic h has a p erio dicit y less than h n 2 i (and then w e can not study it : see as- sumptions in Theorem 5). W e can no t assume the go o d signific a nce of the 26 first Chi ( gatggtgg ) because we coun t only 20 o ccurrences in the sequence, whereas 23 o ccurrences a re necessary to consider this w ord as exceptional. On the o t her hand, the upta ke sequence is v ery s ig nifican t (and then very rele- v ant). Indeed, w e could fix a significance degree equal to 10 − 224 and consider it as an o v er-represen ted w ord from 73 6 o ccurrence s with the ψ - mixing metho d. As aagtgcggt is coun t ed 737 t imes in the sequence , w e obtain the w ell-kno wn fact that this word is biologically relev a n t. 7 Conclusions and p er sp ectiv es T o conclude this pap er, we recall the adv a ntages of our new metho ds. W e giv e an error v alid for all the v alues k of the r a ndom v ariable N t corresp onding to the n um b er of o ccurrences of w ord A in a seq uence o f length t . The n, we can find a minimal num b er o f o ccurrences to consider a w ord as bio lo gically relev ant f o r a v ery large n um b er of words a nd for all degrees of significance. That is the main adv antage of our metho ds on the Chen-Stein one whic h is based on the total v ariation distance and fo r whic h small degrees of significance can not b e obtained. Results of our ψ - mixing metho d and the Chen-Stein metho d remain similar but our metho d has less limitations. Note that our metho ds pro vide p erf o rming results for g eneral mo delling pro cesses suc h as Mark ov chains as w ell as ev ery φ - and ψ -mixing pro cesses. In terms of p ersp ectiv es, as we exp ect more significan t results , we hop e to impro v e these methods adapting them directly to Mark o v c hains instead of ψ - or φ -mixing. Moreo ver, it is w ell-kno wn that a comp ound P oisson appro xi- mation is b etter for self-o ve r la pping w ords (see Reinert et al. [27] and Reinert and Sc h bath [26]) . An error term for the comp ound P oisson approximation for self-o verlapping w ords can b e easily deriv ed from our results. Ac knowledgem ents The authors would lik e to thank Bernard Prum for his supp ort a nd his useful commen ts. The authors w ould lik e to thank Sophie Sch ba th for her prog r am, Vincen t Miele for his v ery relev ant help in the conception of the softw are and Catherine Matias fo r her in v aluable advices. 27 References [1] M. Abadi. Exp onen tial appro ximation for h itting times in mixin g p ro cesses. Mathematic al Physics E le ctr onic Journal , 7, 2001. [2] M. Abadi. Sharp error term s and n ecessary conditions for exp onenti al hitting times in mixing pro cesses. A nnals of Pr ob ability , 32:243–26 4, 2004. [3] M. Abadi. Instantes de o c orrˆ encia de eventos r ar os em pr o c essos mis- tur ador es . Ph D thesis, Unive r sidade de S˜ ao p au lo, 2001. a v ailable at http://w ww.ime.un icamp.br/~miguel . [4] M. Abadi an d N. V ergne. Sh arp error terms for return time stati s tics under mixing conditions. Submitted, 2006. [5] M. Abadi and N. V ergne. S harp error terms for P oisson statistics under mixing conditions: A n ew approac h. Subm itted, 2006. [6] H. Almagor. A Marko v analysis of DNA sequences. J.The or. Biol. , 104:633 – 645, 1983. [7] R. Arratia, L. Goldstein, and L. Gordon. Two momen ts suffice f or Poisson appro ximations: the Chen -S tein metho d. A nn. Pr ob. , 17:9–25, 1989. [8] R. Arratia, L. Goldstein, and L. Gord on. Poisson appro xim ation and the Chen- Stein metho d . Statist. Sci. , 5:403– 434, 1990. [9] A.D. Barb our , L.H.Y. Ch en, and W.L. L oh . Comp ound P oisson app ro ximation for nonn egativ e random v ariables via Stein’s met h o d. Ann. Pr ob. , 2 0:1843– 1866, 1992. [10] B.E. Blaisdell. Mark o v c h ain analysis fin ds a s ignifi can t influ ence of n eigh b oring bases on the occurren ce of a b ase in eucaryo tic n u clear DNA sequ en ces b oth protein-co ding and n onco ding. J. Mol. Ev ol. , 21:278–28 8, 1985. [11] F.R. Blattner, G.3rd Plunk ett, C.A. Blo c h , N.T. Perna, V. Burland , M. Riley , J. Coll ad o-Vides, J.D. Glasner, C.K. Rod e, Ma yh ew G.F., J. Gregor, Da vis N.W., H.A. Kirk p atric k, M.A. Go eden, D.J. Rose, B. Mau, and Y. S hao. Th e complete genome sequence of esc h eric hia coli k-12. Scienc e , 277:1 453–74, 1997. [12] L.H.Y. Chen. Poisson appro ximation for dep endan t trials. A nn. Pr ob. , 3: 534–5 45, 1975. [13] S.A. Douglass. Intr o duction to Mathematic al Analy si s , c hapter 8. Addison- W esley , Boston, 1996. [14] M. El Karoui, V. Biaudet, S. S c hbat h , and A. Gru ss. Characteristics of Chi distribution on different bacterial genomes. R es. M icr obiol. , 150:5 79–587, 1999. [15] R.D. Fleisc hmann, M.D Adams, O. White, and R.A. C la yton. Whole-genome random sequ encing and assem b ly of h aemophilus influenzae rd . Scienc e , 269: 496–5 12, 1995. [16] M.S. Gelfand, C.G. Koz hukhin, and P evzner P .A. Exte n dable w ords in n u- cleotide sequen ces. Bioinformat i cs , 8:129–1 35, 1992. [17] A.P . God b ole. Poisson app ro ximations for runs and patterns of r are ev ents. A dv. Appl. Pr ob. , 23:851– 865, 1991. [18] S. Karlin, C. Burge, and A.M. Campb ell. S tatistica l analyses of counts and distributions of restriction sites in d n a sequ en ces. Nu c l. A cids R es. , 20:1363– 1370, 1992. [19] S.P . Meyn and R.L. Tweedie. Markov Chains and Sto chastic Stability . Sprin ger- V erlag, Heidelb erg, 1993. 28 [20] V. Miele, P .Y. Bourguignon, D. Rob elin, G. Nuel, and H. Ric hard. seq++ : analyzing biologica l sequences with a r ange of Mark ov-rela ted mod els. Bioin- formatics , 21:2783 –2784, 2005. [21] P . Nico d ` eme, T. Do erks, and M. Vingron. Proteome analysis based on motif statistics. Bioinformatics , 18(Suppl. 2):5161– 5171, 2002. [22] G. Nuel. LD-SP att: Large Deviatio ns Statistics for Patte r ns on Mark o v chains. Comp. Biol. , 11:1023–1 033, 20 04. [23] G.J. Ph illips, J. Arnold, and R. Iv arie. The effect of co don u s age on the oligon ucleotide comp osition of th e e. coli genome and ident ifi cation of o ver- and under r epresen ted sequen ces by Mark ov c hain analysis. Nucl. A cids R es. , 15:262 7–2638, 1987. [24] B. Pr um, F. Ro dolphe, and E. de T urc kheim. Fi n ding wo r ds with unexp ected frequencies in DNA sequences. J. R. Statis. So c. B , 11:190–192 , 19 95. [25] M. R ´ egnier. A unifi ed approac h to w ord o ccurrence p r obabilities. Discr. Appl. Math. , 104:259 –280, 2000. [26] G. Reinert and S. S c hbat h. C omp ound P oisson and P oisson pro cess app ro xi- mations for o ccurr ences of m u ltiple wo r d s in Marko v c hains. J. Comp u t. Biol. , 5:223– 253, 1998. [27] G. Reinert, S. Sch bath, and M.S. W aterman. Probabilistic and Statistical Prop erties of Words: An Ov erview. J. Comput. Bi ol. , 7, 2000. [28] S. Rob in and J .J. Daudin. Exact distrib u tion of word o ccurrences in a random sequence of letters. J. Appl. Pr ob. , 36, 1999. [29] G.R. S mith, S.M. Kun es, D.W. S c hultz, A. T a ylor, and K .L. T riman. S tr ucture of c h i hotsp ots of generalized r ecom bination. Cel l , 24:429– 36, 1981. [30] H.O. S mith, M.L Gwin n , and S.L. S alzb er g. DNA uptak e s ignal s equences in naturally transform able bacteria. R es. Micr obiol. , 150:603– 616, 1999. [31] C. Stein. A b oun d f or the error in the normal app ro ximation to the distr ib ution of a sum of dep enden t random v ariables. Pr o c. Sixth Berkeley Symp. Math. Statist. Pr ob ab. , 2:583–602 , 1972. Univ ersit y of California Press. [32] J. v an Helden, B. Andr´ e, and J. Collado-Vides. Extracting regulatory s ites fr om the upstream region of y east genes b y compu tational analysis of oligon u cleotide frequencies. J. M ol. Biol. , 281:872– 842, 1998. [33] J. v an Helden, M. del O lmo, and J.E. P´ erez-Ort ´ ın. Statistica l analysis of y east genomic do wn stream s equences r eveal s putativ e p oly aden ylation signals. Nucl. A cids R es. , 28:1000 –1010, 2000. 29
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment