On the maximal number of highly periodic runs in a string
A run is a maximal occurrence of a repetition $v$ with a period $p$ such that $2p \le |v|$. The maximal number of runs in a string of length $n$ was studied by several authors and it is known to be between $0.944 n$ and $1.029 n$. We investigate high…
Authors: Maxime Crochemore, Costas Iliopoulos, Marcin Kubica
On the maximal n um b er of highly p erio dic runs in a string ⋆ Maxime Cro c hemore 1 , 3 , Costas Iliop oulos 1 , 4 , Marcin Kubica 2 , Jakub R adoszewski 2 , W o jciec h Rytter ⋆⋆ 2 , 5 , and T omasz W ale ´ n 2 1 Dept. of Computer Science, King’s College London, London WC2R 2LS, UK [maxime.cr ochemore,csi]@kcl .ac.uk 2 Dept. of Mathematics, Computer Science and Mec hanics, Universit y of W a rsaw, W arsaw, P oland [kubica,jr ad,rytter,walen]@ mi muw.edu.pl 3 Universit ´ e P aris-Est, F rance 4 Digital Ecosystems & Business Intelligence Institute, Curtin Universit y of T ec hnology , P erth W A 6845, Australia 5 Dept. of Math. and Informatics, Copernicus Universit y , T oru´ n, Poland Abstract. A run is a maximal o ccurrence of a rep etition v with a p erio d p such that 2 p ≤ | v | . The m ax imal num b er of runs in a string of length n w as studied by sev eral authors and it is known to b e b etw een 0 . 944 n and 1 . 029 n . W e investig ate highly perio dic runs, in which the shortest p eriod p satisfies 3 p ≤ | v | . W e show th e u pp er b ound 0 . 5 n on the maximal num b er of such runs in a string of length n and construct a sequ en ce of w ords for whic h we obtain the lo w er b ound 0 . 406 n . 1 In tro duction Repe titions and per io dicities in string s are one of the fundamental topics in combinatorics on words [2, 13]. They are also impo rtant in o ther a reas: lo ssless compressio n, w o rd represent atio n, computationa l biology etc. Repetitions are studied fr om different directions: cla ssification of words not containing repe ti- tions of a given exp onent, efficien t identification of factors b eing rep etitions of different types and finally computing the b ounds of the num b er of rep etitions of a given exp onent that a s tring may contain, which we consider in this pap er. Both the k nown results in the topic a nd a deep er description of the motiv ation can b e found in the survey by Cro chemore et al. [5]. The co ncept of runs (also called maximal r ep etitions) ha s been in tro duced to represent all r ep e titio ns in a s tring in a succinct manner. The crucial prop erty of runs is that their maximal num be r in a string of length n (denoted as runs ( n )) is O ( n ) [10]. Due to the work of many peo ple, m uch better b ounds on runs ( n ) hav e ⋆ Researc h supp orted in part by the Ro yal Society , UK. ⋆⋆ Supp orted by grant N206 004 3 2/0806 of the Poli sh Ministry of Science and Higher Education. bee n obtained. The low er b ound 0 . 92 7 n was firs t pr ov ed in [8]. Afterwards it was improv ed by Kusa no et al. [1 2] to 0 . 9 44 n employing computer exp eriments and very recently by Simpson [18] to 0 . 9 4457 5712 n . On the other hand, the first explicit upper b ound 5 n was settled in [15], a fterwards it was systematica lly improv ed to 3 . 44 n [17], 1 . 6 n [3, 4] and 1 . 52 n [9]. The be s t known r esult runs ( n ) ≤ 1 . 029 n is due to Cro chemore et al. [6], but it is conjectured [1 0] that runs ( n ) < n . The ma ximal n umber of runs w a s also studied for s pec ial t yp es of strings a nd tight b ounds were established for Fib onacci str ings [10, 16] and mor e generally Sturmian strings [1]. The combinatorial analysis of runs in str ings is s trongly rela ted to the prob- lem of estimation of the maximal n umber of o c currences of squares in a string. In the la tter the ga p b etw een the upp er and lower bo und is muc h larg er than for runs [5, 7]. How ever, a recent pa per [11] by some of the authors shows that int r o duction of exponents larger than 2 can lead to obtaining tigh ter b ounds for the num ber o f corres p o nding o ccurr ences. In this paper w e intro duce and study the concept of highly p erio dic runs (hp-runs) in which the p erio d is at least three times sho rter than the run. W e show the following b ounds o n the num b er hp-runs ( n ) of such runs in a str ing o f length n : 0 . 406 n ≤ hp-runs ( n ) ≤ n − 1 2 The upp er b o und is achiev ed by analyzing prime words (i.e. words that are primitive and minimal/maximal in the class of their cyclic equiv alen ts) that app ear a s per io ds o f hp-r uns . As for the lo wer bound, w e giv e a simple a rgument that leads to 0 . 4 n bound and then describ e a family of w ords that impro ves this bo und to 0 . 406 n . 2 Definitions W e co nsider wor ds over a finite alphab et A , u ∈ A ∗ ; b y ε we denote an empty word; the pos itions in a word u are n umbered from 1 to | u | . By Alph ( u ) we denote the set o f all letters o f u . F or u = u 1 u 2 . . . u m , by u [ i . . j ] w e deno te a factor of u equal to u i . . . u j (in par ticular u [ i ] = u [ i . . i ]). W or ds u [1 . . i ] are called prefixes of u , a nd words u [ i . . m ] — suffixes of u . W e say that p ositive int eg er p is the (shortest) p erio d o f a word u = u 1 . . . u m (notation: p = p er ( u )) if p is the smallest num b er such that u i = u i + p holds for all 1 ≤ i ≤ m − p . If w k = u ( k is a non-nega tive in tege r) then we sa y that u is the k th power of the w or d w . A squar e is the 2 nd power of some w ord. The primitive r o ot of a word u , denoted root ( u ), is the shortest such word w that w k = u for some p ositive k . W e call a word u primitive if root ( u ) = u , other wise it is called nonprimitive . W e say that words u and v are cyclically equiv alent (or tha t one of them is a cy clic rotation of the other) if u = xy and v = y x fo r some x, y ∈ A ∗ . It is a simple observ ation that if u a nd v are cyclically equiv alent then ro o t ( u ) = root ( v ). Let us ass ume that A is totally ordered by ≤ what induces a lexic o graphica l order in A ∗ , also denoted b y ≤ . W e say that u ∈ A ∗ is a prime wor d if it 2 is primitive and minimal o r maxima l in the class of words that are cyclically equiv alent to it. It can b e proved [13] that a prime w or d u cannot ha ve a pr op er (i.e. non-empty and different than u ) pr efix that w ould also b e its suffix. A run (also called a maximal rep etition) in a string u is an int er v al [ i . . j ] such that b o th the a sso ciated factor u [ i . . j ] has p er io d p , 2 p ≤ j − i + 1, a nd the prop erty cannot b e extended to the rig ht nor to the left: u [ i − 1] 6 = u [ i + p − 1 ] and u [ j − p + 1 ] 6 = u [ j + 1] when the letters ar e defined. A highl y p erio dic run (hp-run) is a run [ i . . j ] for which the sho rtest p erio d p sa tisfies 3 p ≤ j − i + 1. F or simplicity , in the further text we so metimes refer to runs or hp- r uns as to o ccurrences of cor resp onding factors o f u . 3 Upp er bound Let u ∈ A ∗ be a word of length n . By P = { p 1 , p 2 , . . . , p n − 1 } we denote the set of inter-positio ns of u that ar e lo cated b etwe en pair s of co nsecutive letters of u . W e define a function F that assig ns to each hp- run v in a string the set of hand les amo ng all inter-pos itio ns within v . Hence, F is a mapping from the set of hp-r uns o ccurring in u to the set 2 P of subsets of P . Let v b e a hp-r un with per io d p and let w b e the prefix of v of length p . By w min and w max we denote words cyclically equiv alent to w that ar e minimal and maximal in lexico graphica l order. W e define F ( v ) as fo llows: a) if w min 6 = w max then F ( v ) co ntains inter-po sitions b etw een consec utive o c- currences of w min and b etw een co nsecutive o ccurrence s of w max within v b) if w min = w max then F ( v ) contains all inter-po sitions within v . Lemma 1. w min and w max ar e prime wor ds. Pr o of. By the definition of w min and w max , it suffices to show that b oth words are primitive. This follows from the fact that, due to the minimality of p , w is primitive and that w min and w max are cyc lically equiv alent to w . ⊓ ⊔ Lemma 2. Case b) fr om the ab ove definition implies that | w min | = 1 . Pr o of. w min is primitive, therefore if | w min | ≥ 2 then w min would cont a in at least tw o distinct letters, a = w min [1] and b = w min [ i ] 6 = a . If b < a ( b > a ) then the cyclic rotatio n o f w min by i − 1 letters would b e lexicogr aphically smaller (greater) than w min — a contradiction. ⊓ ⊔ Note that in case b) of the definition of F obviously F ( v ) contains at lea st tw o distinct handles . The following lemma concludes tha t the same pr op erty also holds in case a). Lemma 3. Each of the wor ds w 2 min and w 2 max is a factor of v . Pr o of. Reca ll that 3 p ≤ | v | , where p = p er ( v ). By Lemma 2 , this concludes the pro of in cas e b). As for the pro of in case a ), it suffices to note that the first o ccurrences of each of the words w min , w max within v start non-further than p po sitions from the b eginning of v . ⊓ ⊔ 3 w min w min w max w max w min v Case b) ....... v Case a) Fig. 1. Illustration of the d efinition of F and Lemma 3. The arrow s in the figure point to p ositions from the set of hand les F ( v ). W e now show a cruc ia l pr op erty of F . Lemma 4. F ( v 1 ) ∩ F ( v 2 ) = ∅ for every two distinct hp-runs v 1 , v 2 in u . Pr o of. Assume to the contrary that p i ∈ F ( v 1 ) ∩ F ( v 2 ) is a handle of tw o differen t runs v 1 and v 2 . By Lemmas 1 and 3 , p i is lo ca ted in the middle of tw o squares w 2 1 and w 2 2 of prime words, wher e | w 1 | = p er ( v 1 ) and | w 2 | = p er ( v 2 ). w 1 6 = w 2 , since in the opp osite cases runs v 1 and v 2 would b e the s ame. W.l.o.g. a ssume that | w 1 | < | w 2 | . Then, word w 1 is b oth a prefix a nd a suffix of w 2 (see fig . 2), what contradicts the primality of w 2 . ⊓ ⊔ i p w 2 w 2 w 1 w 1 Fig. 2. A situation where p i is in the middle of tw o different squares w 2 1 and w 2 2 . The following theorem concludes the analys is o f the upp er b o und. 4 Theorem 1. A wor d u ∈ A ∗ of length n may c ontain at most n − 1 2 runs. Pr o of. Due to Lemma 3, for ea ch hp-run v within u , | F ( v ) | ≥ 2. Since | P | = n − 1 , Lemma 4 implies the conclusion of the theorem. ⊓ ⊔ 4 Lo w er b ound Lemma 5. L et s b e a wor d and denote: r = h p-runs ( s ) , ℓ = | s | Ther e exists a se quenc e of wor ds ( s n ) ∞ n =0 , s 0 = s , su ch that r n = hp -runs ( s n ) , ℓ n = | s n | and lim n →∞ r n ℓ n = r ℓ + 1 5 ℓ Pr o of. W e define the sequence s n recursively . Denote A = Alph ( s n ) and let A be a dis jo in t cop y of A . By s n we denote the w o rd obtained from s n by substituting letters from A with the corresp o nding letters from A . W e define s n +1 = ( s n s n ) 3 . Recall that ℓ 0 = ℓ , r 0 = r and no te that for n ≥ 1 ℓ n = 6 ℓ n − 1 , r n = 6 r n − 1 + 1 By s imple induction this concludes that r n ℓ n = r ℓ + 1 ℓ n X i =1 1 6 i = r ℓ + 1 5 ℓ 1 − 1 6 n +1 T aking n → ∞ in the ab ov e formula we obtain the co nclusion of the lemma. ⊓ ⊔ Starting with the 3-letter word s = a 3 for w hich r/ℓ = 1 / 3, from Lemma 5 we obtain the b ound 0 . 4 n . This b ound is, how ever, not optimal — we will show an example of a sequence of words for which we obta in the b ound 0 . 406 n . Let A = { a, b } . W e deno te: X = a 3 b 3 3 , Y = a 4 b 3 a, α = X Y , β = X a Lemma 6. A c ouple of imp ortant pr op erties of wor ds α and β : – X Y X intr o d uc es a n ew hp-run with the p erio d 7. Henc e, e ach of t he p airs αα and αβ intr o duc es a new hp-ru n . – β is a pr efix of α . Henc e, αβ αβ αα intr o duc es the hp-run ( αβ ) 3 . – Y is a pr efix of aX , ther efor e α is a pr efix of β α . Henc e, ααβ α intr o duc es the hp-run α 3 . Now we will also b e dealing with a new alphab et A ′ = { α, β } . W e define the Fibo nacci morphism h as: h ( α ) = αβ , h ( β ) = α Let f n = h n ( α ) , r n = hp-runs ( f n ) , ℓ n = | f n | 5 n r n ℓ n r n /ℓ n f n 0 9 2 6 0 . 3462 α 1 17 4 5 0 . 3778 αβ 2 26 7 1 0 . 3662 αβ α 3 45 116 0 . 38 79 αβ ααβ 4 71 187 0 . 37 96 αβ ααβ αβ α 5 119 303 0 . 39 27 αβ ααβ αβ ααβ ααβ 6 192 490 0 . 39 18 αβ ααβ αβ ααβ ααβ αβ α αβ α β α T able 1 : A fir st few words o f the s equence f n with the corre sp o nding terms of sequences r n and ℓ n . Theorem 2. lim n →∞ r n ℓ n > 0 . 406 In p articular, r 19 ℓ 19 ≥ 103 6 64 255 3 29 > 0 . 40 6 Pr o of. W e start with the v alues ℓ n , r n for n ≤ 4 that are prec omputed in T able 1 a nd show that for n ≥ 5 the fo llowing recursive formulas hold: ℓ n = ℓ n − 1 + ℓ n − 2 (1) r n ≥ r n − 1 + r n − 2 + n − 4 if 2 | n (2) r n ≥ r n − 1 + r n − 2 + n − 2 if 2 ∤ n (3) The “ in particula r ” part of the le mma is a str a ightforw ar d conseque nce o f the formulas. (1) is obvious, therefore we concent r ate on the inequalities for r n . T he re- cursive part of eac h of them ( r n − 1 + r n − 2 ) is a consequence of the for mula f n = f n − 1 f n − 2 and the fact that Fibo nacci words contain rep etitions o f exp o- nent at most 2 + Φ < 4 , see [14 ]. Due to Lemma 6 , for even v alues of n a new hp-run is int r o duced up on concatenation — see the example for n = 6: αβ ααβ αβ ααβ α αβ | αβ αα | {z } β αβ α and for o dd v alues of n , three more hp-runs app ear, as in the following example for n = 5: αβ ααβ αβ α | α |{z} β ααβ αβ ααβ αβ α | αβ α | {z } αβ 6 αβ α αβ αβ α | α | {z } β ααβ Apart fr om that, since h ( αβ αβ αα ) = αβ ααβ ααβ α | {z } β contains a hp-run f 3 2 , word f n int r o duces n − 5 new hp-runs comp osed for m f 3 2 , f 3 3 , . . . , f 3 n − 4 , ea ch created b y iterating h i ( αβ αβ αα ) — see the ex a mple fo r n = 7: αβ ααβ αβ ααβ ααβ αβ αα β αβ α | αβ ααβ α | {z } β ααβ ααβ αβ ααβ αβ α αβ ααβ αβ ααβ αβ α | αβ | {z } ααβ αβ ααβ ααβ In to tal, we obtain n − 4 new hp-runs for even n and n − 2 for o dd n , wha t concludes the pro of of the inequa lities. ⊓ ⊔ References 1. P . Ba tu ro, M. Pia tkow ski, and W. Rytter. The n umber of runs in stu rmian words. In O. H. Ibarra and B. Ravikumar, editors, CI AA , volume 5148 of L e ctur e Notes in Com puter Scienc e , pages 252–261 . Springer, 2008. 2. J. Berstel and J. Karhumaki. Com binatorics on words: a tutorial. Bul letin of the EA TCS , 79:178–2 28, 2003. 3. M. Cro chemore and L. Ilie. Analysis of maximal repetitions in strings. In L. Kucera and A. Kucera, editors, MFCS , vol u me 4708 of L e ctur e Notes in Computer Scienc e , pages 465–47 6. Springer, 2007. 4. M. Croc hemore and L. Ilie. Maximal rep etitions in strings. J. Comput. Syst. Sci. , 74(5):796– 807, 2008. 5. M. Crochemore, L. Ilie, and W. Rytt er. Rep etitions in strings: algorithms and com binatorics. The or et. Comput. Sci. (to app e ar) . 6. M. Crochemore, L. Ilie, and L. Tinta . T ow ards a solution to t h e ”runs” conjecture. In P . F erragina an d G. M. Landau , editors, CPM , volume 5029 of L e ctur e Not es in Com puter Scienc e , pages 290–302 . Springer, 2008. 7. M. Crochemore and W. Rytter. Squares, cub es, and time-space efficient string searc hing. Algorithmic a , 13(5):405–425 , 1995. 8. F. F ranek and Q . Y ang. An asymptotic low er b ound for th e maximal num b er of runs in a string. Int. J. F ound. C omput. Sci. , 19(1):195– 203, 2008. 9. M. Giraud. Not so many runs in strings. I n C. Mart ´ ın-Vide, F. Otto, and H. F ernau, editors, LA T A , volume 5196 of L e ctur e Notes in Computer Scienc e , pages 232–239. Springer, 2008. 10. R. M. Kolpak ov and G. Kuchero v. Finding maximal rep etitions in a w ord in linear time. In Pr o c e e dings of the 40th Symp osium on F oundations of Computer Scienc e , pages 596–60 4, 1999. 11. M. Kubica, J. Radoszewski, W. Ry tter, and T. W alen. On t he maximal num b er of cub ic subw ords in a string. In Pr o c e e di ngs of the 20th International Workshop on Combinatorial Algorithms (to app e ar) , 2009. 7 12. K. Ku sano, W. Matsubara, A. Ishino, H. Bannai, and A. Sh inohara. N ew lo wer b ounds f or the maximum n umber of runs in a string. CoRR , abs/0804.1214, 2008. 13. M. Loth aire. Combinatorics on Wor ds . Ad dison-W esley , Reading, MA., U.S.A., 1983. 14. F. Mignosi and G. Pirill o. Rep etitions in the fibonacci infinite w ord. IT A , 26: 199– 204, 1992. 15. W. Rytter. The n umber of runs in a string: Impro ved a n alysis of the linear upp er b ound. In B. Durand and W. Thomas, editors, ST ACS , volume 3884 of L e ctur e Notes in Computer Scienc e , pages 184–195 . S p ringer, 2006. 16. W. Rytter. The structure of subw ord graphs and su ffix trees in fib onacci words. The or. Comput. Sci. , 363(2):211 –223, 2006. 17. W. Rytter. The n umb er of runs in a string. Inf. Comput. , 205(9):1459–1469 , 2007. 18. J. Simpson. Mo dified pado v an words and the maxim um number of ru ns in a w ord. Aus tr alasian Journal of Combi natorics (to app e ar) . 8
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment