New Lower Bounds for the Maximum Number of Runs in a String
We show a new lower bound for the maximum number of runs in a string. We prove that for any e > 0, (a -- e)n is an asymptotic lower bound, where a = 56733/60064 = 0.944542. It is superior to the previous bound 0.927 given by Franek et al. Moreover, o…
Authors: Kazuhiko Kusano, Wataru Matsubara, Akira Ishino
New Lo w er Bounds for the Maxim um N umb er of Runs in a String Kazuhiko Kusano 1 , W ataru Matsubara 1 , Akira Ishino 1 , Hideo Bannai 2 , and Ayumi Shino hara 1 1 Graduate School of Information Science, T ohoku Un iversi ty , Aramaki aza Aoba 6-6-05, Aoba-k u , Sendai 980 - 8579, Japan { kusano@s hino., matsubara@ shino., ishino@, ayumi@ } ecei.toho ku.ac.jp 2 Department o f In formatics, Kyushu Universit y , 744 Motook a, Nishiku, F uku ok a 819-0395 Japan. bannai@i.k yushu-u.ac.jp Abstract. W e sho w a new low er b ound for the maximum n umb er of runs in a string. W e prov e th at for any ε > 0, ( α − ε ) n is an asymptotic lo wer b oun d, where α = 56733 / 60064 ≈ 0 . 944542 . I t is s u p erior to the previous bound 3 / (1+ √ 5) ≈ 0 . 92 7 given by F ran ˇ ek et al . [1, 2]. M oreov er, our construction of the strings and the pro of is m uch simpler th an theirs. 1 In tro duction Repe titions in string s is an impo rtant element in the analysis and pr o cessing o f strings. It was shown in [3 ] that when consider ing maximal r ep etitions , or runs , the maximum num b er of runs ρ ( n ) in any string of length n is O ( n ), leading to a linear time algo rithm for computing all the runs in a string. Although they were not a ble to g ive b ounds for the co nstant factor, there hav e b een several works to this end [4 –6]. The currently known b est upp er b ound 3 is ρ ( n ) ≤ 1 . 0 48 n , obtained by calcula tions bas ed o n the pro of technique of [6]. The technique bo unds the num b er of runs for each s tring by cons idering r uns in tw o pa r ts: runs with lo ng p erio ds, and r uns with shor t p erio ds. The former is more sparse and easier to b ound while the latter is b o unded b y a n exha ustive calculation concerning how runs of differen t p erio ds can overlap in an interv al of some length. On the other hand, an asymptotic lo wer b ound o n ρ ( n ) is pr esented in [2], where it is shown that for any ε > 0, ther e e xists an integer N > 0 such tha t for any n > N , ρ ( n ) ≥ ( α − ε ) n , where α = 3 1+ √ 5 ≈ 0 . 927. It was conjectured in [1] that this bo und is optimal. In this paper , w e pr ove that the conjecture was false, by sho wing a new low er bo und α = 56733 / 6 0064 ≈ 0 . 944 542. First we show a concrete s tr ing τ of length 60064 , which contains 56 714 runs in it. It immediately disprov es the co njecture, since 5671 4 / 60 0 64 ≈ 0 . 94422 6 is already higher than the previo us b ound 0 . 9 27. Then we prov e that the string τ k , which is the str ing obtained by concatenating k copies of τ , contains 56 733 k − 1 8 r uns for any k ≥ 2. Since | τ k | = 60064 k , it yields the new lower b ound 5 6733 / 60064 as k → ∞ . 3 Presen ted on th e w ebsite http://www.csd.uw o.ca/faculty/ilie/runs.html 2 Preliminaries Let Σ be a finite set of symbols, called an alphab et . Strings x , y a nd z are said to be a pr efix , s ubstring , a nd suffix of the string w = xy z , res pec tively . The length of a string w is denoted by | w | . The i -th symbol of a string w is deno ted by w [ i ] for 1 ≤ i ≤ | w | , and the substring of w that b egins at p osition i and ends at po sition j is denoted by w [ i : j ] for 1 ≤ i ≤ j ≤ | w | . A string w has p erio d p if w [ i ] = w [ i + p ] for 1 ≤ i ≤ | w | − p . A string w is called primitive if w cannot b e written as u k , where k is a p ositive integer, k ≥ 2 . A str ing u is a run if it is per io dic w ith (minimum) p erio d p ≤ | u | / 2. A substring u = w [ i : j ] of w is a run in w if it is a run of pe r io d p and neither w [ i − 1 : j ] nor w [ i : j + 1] is a r un of pe rio d p , that means the run is maximal. W e deno te the run u = w [ i : j ] in w by the triple h i, j − i + 1 , p i consisting of the b egin p osition i , the leng th | u | , a nd the minimum pe r io d p of u . A run of w which is a prefix (resp. suffix) o f w is ca lle d a prefix (resp. suffix) run of w , F o r a str ing w , we deno te by ru n ( w ) the num b er of runs in w . F or ex ample, the s tring aa baabaa aacaa cac co nt a ins the following 7 runs: h 1 , 2 , 1 i = a 2 , h 4 , 2 , 1 i = a 2 , h 7 , 4 , 1 i = a 4 , h 12 , 2 , 1 i = a 2 , h 13 , 4 , 2 i = (ac) 2 , h 1 , 8 , 3 i = (aab ) 8 3 , and h 9 , 7 , 3 i = (a ac) 7 3 . Thu s run ( aaba abaaa acaaca c ) = 7 . W e are interested in the b ehavior of the maxrun function defined by ρ ( n ) = max { run ( w ) | w is a string of length n } . F ranˇ ek, Simpson and Smyth [1] sho wed a bea utiful construction of a s eries of strings which contains many runs , and later F r anˇ ek a nd Qian Y ang [2 ] for ma lly prov ed a family of true asymptotic low er bo unds arbitr a rily close to 3 1+ √ 5 n as follows. Theorem 1 ([ 2]). F or any ε > 0 ther e exists a p ositive inte ger N so t hat ρ ( n ) ≥ 3 1+ √ 5 − ε n for any n ≥ N . 3 Basic Prop erties In this section, we s ummarize so me basic prop erties concerning p erio ds and rep etitions in s trings, which will b e utilized in the sequel. The nex t Lemma g iven by Fine and Wilf [7] provides an imp ortant prop erty on p erio ds of a string. Lemma 1 (P erio di cit y Lem ma (se e [8, 9])). L et p and q b e two p erio ds of a st r ing w . If p + q − gcd( p, q ) ≤ | w | , then gcd( p, q ) is also a p erio d of w . F or a string w , let us consider a ser ies of string s w , w 2 , w 3 , w 4 . . . , a nd observe all runs contained in these strings. There are many ca ses, which co nfuse the task of counting the n umber of runs in these strings. 1. A run in w k which is neither a suffix nor prefix run of w k is a lso a r un in w k +1 . 2 2. A suffix run in w k and a prefix run in w may be merg e d into one run in w k +1 . 3. A suffix run in w k may be ex tended to a r un in w k +1 . 4. A new run may be new ly created at the b or der b etw een w k +1 and w . Concerning case 4, note that a new run that did not appea r in w or w 2 may be created in w 3 . F or example, consider str ings w = abcaca bc , and r = ( cabc a ) 2 . W e can verify that r is a run h 8 , 10 , 5 i of w 3 = abcac abcabc acabca bcacab c , while r do es not app ear in w 2 = a bcaca bcabca cabc . Mor e ov er, the same ar- gument holds a lso for binary alphab et 0 , 1 ; Replace a , b , c into 01 , 10 , 0 0 , resp ectively in the ab ov e exa mple. How ever, the following lemma shows that the leng th of such new r uns ca n be b o unded. Lemma 2. L et w b e a st r ing of length n . F or any k ≥ 3 , let r = h i , l , p i b e a run in w k . If l ≥ 2 n , t hen i = 1 and l = k n , t hat is, r = w k . Pr o of. W e assume that n > 1 , since it is tr ivial for the case n = 1. Since p is the minimum p erio d of the run r , we k now | r | = l ≥ 2 p and l ≥ 2 n . Let u be a primitive string of length m where w = u t for some integer t ≥ 1. Then, | u | = m ≤ n is also a p erio d of r un r . Since p + m ≤ l , Lemma 1 claims that gc d( p, m ) is also a p erio d of run r . If p > m , then gcd( p, m ) < p , which contradicts the as s umption that p is the minimum p erio d of r . If p < m , then it contradicts the a ssumption that u is pr imitive. Therefore we have p = m . Since m is a p erio d of w k , we have r = h 1 , k n, m i = w k . This lets us prove the following lemma which gives a formula for run ( w k ). Lemma 3. L et w b e a string of length n . F or any k ≥ 2 , run ( w k ) = Ak − B , wher e A = run ( w 3 ) − ru n ( w 2 ) and B = 2 run ( w 3 ) − 3 run ( w 2 ) . Pr o of. W e think ab out the increase in the n umber o f runs, when concatena ting w k and w . Let r = h i, l, p i be a run of w k +1 such that i + l > n k + 1, that is , r ends so mewhere in the last w o f w k +1 . By Lemma 2, if i ≤ ( k − 2 ) n then r = w k +1 . In s uch a ca se, r do es not increa se the num b er of runs since the run will hav e a lready b een consider ed in w 2 . Therefor e, the increase in runs can b e considered by r estricting our attention to runs with i > ( k − 2) n , that is, the increase in r uns for the last 3 w ’s o f w k +1 when co ncatenating w to the las t 2 w ’s o f w k . This gives us run ( w k +1 ) − run ( w k ) = run ( w 3 ) − ru n ( w 2 ). run ( w k ) = run ( w k − 1 ) + run ( w 3 ) − ru n ( w 2 ) = run ( w k − 2 ) + 2( run ( w 3 ) − run ( w 2 )) = run ( w 2 ) + ( k − 2 )( run ( w 3 ) − ru n ( w 2 )) = k ( run ( w 3 ) − ru n ( w 2 )) − (2 ru n ( w 3 ) − 3 run ( w 2 )) for k ≥ 3. It is easy to see that the equation also holds fo r k = 2 . 3 Theorem 2. F or any string w and any ε > 0 , ther e exists a p ositive inte ger N such t hat for any n ≥ N , ρ ( n ) n > run ( w 3 ) − ru n ( w 2 ) | w | − ε. Pr o of. By Lemma 3, run ( w k ) = Ak − B , where A = run ( w 3 ) − run ( w 2 ) and B = 2 r u n ( w 3 ) − 3 run ( w 2 ). F or any given ε > 0, w e choo se N > A − B ε . F or a ny n ≥ N , let k be the int eg er satisfying | w | ( k − 1) ≤ n < | w | k . Notice that k > n | w | ≥ N | w | ≥ A − B | w | ε . Since ρ ( i + 1) ≥ ρ ( i ) for any i , and | w k − 1 | = | w | ( k − 1), ρ ( n ) n ≥ ρ ( | w | ( k − 1)) | w | k ≥ run ( w k − 1 ) | w | k = A ( k − 1) − B | w | k = Ak − A − B | w | k = A | w | − A − B | w | k > A | w | − ε. ⊓ ⊔ 4 New Lo w er Bounds W e found so me strings which contain ma n y runs, by running a co mputer pro gram which utilizes a simple heuristic sea rch for run-rich bina ry s trings. Given a buffer size, the sear ch first star ts with the sing le string 0 in the buffer. At each r o und, t wo new strings ar e created fro m each str ing in the buffer by a pp e nding 0 or 1 to the string. The new strings are then so rted in order of run ( w 3 ) − run ( w 2 ), and only those that fit in the buffer are retaine d for the next ro und. Strings that give a hig h ra tio of runs ar e recor ded. W e tried se veral v ariations of the a lgorithm, and found man y r un-rich strings. Among these strings found so far, the string τ , lets us prov e the currently b est low er b ound on the maximum num b e r of runs in a string. Since τ is to o long to include in the pap er, we will make τ av a ilable o n our web site 4 . Once we hav e τ , it is stra ightforw ard to confirm that the following lemma holds. An y na ¨ ıve progra m to co unt r uns in a str ing would b e sufficient. Lemma 4. Ther e exists a st ring τ such that | τ | = 600 64 , ru n ( τ ) = 567 14 , run ( τ 2 ) = 11344 8 , and run ( τ 3 ) = 17018 1 . It immediately dis pr ov es the co njecture, since 5 6 714 / 60064 ≈ 0 . 94 4226 is already hig her than the previous b ound 3 1+ √ 5 ≈ 0 . 92 7. W e now show the main result o f this pap er. Theorem 3. F or any ε > 0 t her e exists a p ositive inte ger N so that ρ ( n ) > ( α − ε ) n for any n ≥ N , wher e α = 56733 60064 ≈ 0 . 94454 2 . 4 http://www .shino.ecei.tohok u. ac.jp/runs/ 4 Pr o of. F rom Theorem 2 and Lemma 4 , we have ρ ( n ) n > 17018 1 − 1 1 3448 60064 − ε = 56733 60064 − ε. ⊓ ⊔ F or pr o of of concept, we present in the App endix, a shorter str ing τ 1558 with | τ 1558 | = 1558 , run ( τ 1558 ) = 1445 , run ( τ 2 1558 ) = 291 5 , run ( τ 3 1558 ) = 437 4 that gives a smaller b ound (4374 − 291 5 ) / 1558 ≈ 0 . 93645 co mpared to τ , but is still better than previously known. 5 Conclusion W e pr esented a new low er bound 567 33 / 60 064 ≈ 0 . 94 4542 for the maximum nu mber of runs in a string. The pro o f was very simple, once after we verified that the runs in the string τ is 5 6714, a nd noticed some trivial prop erties of the string. W e do not think that the b ound is optimal. W e believe that our work would revive the interests to push the low er b ound higher up, since the previo us bo und 3 / (1 + √ 5) ≈ 0 . 927 was co njectured to b e the optimal s ince 20 03. References 1. F ranˇ ek, F., Simpson, R., Sm y th, W.: The m ax im um number of runs in a string. In: Proc. 14th Australasian W orkshop on Combinatorial A lgorithms (A W OCA2003). (2003) 26–35 2. F ranˇ ek, F., Y ang, Q.: An asymptotic low er b ound for the maximal-num b er-of-runs function. In: Pro c. Prague Stringology Conference (PSC’06). (2006) 3–8 3. Kolpako v, R., Kuchero v, G.: Finding maximal rep etitions in a w ord in linear time. In: Pro c. 40th Annual Symp osium on F oundations of Computer Science (FOCS’99). (1999) 596–604 4. Rytt er, W.: The num b er of run s in a string: I mprov ed analysis of th e linear upp er b ound. In: Proc. 23rd Annual Symp osium on Theoretical Asp ects of Computer Science (ST ACS 2006). V olume 3884 of L N CS. (2006) 184–195 5. Rytt er, W.: The number of run s in a string. Inf. Comput. 205 (9) (2007 ) 1459–14 69 6. Crochemore, M., Ilie, L.: Maximal rep etitions in strings. J. Comput. Syst. Sci. (2007) in press. 7. Fine, N., W ilf, H.: Uniqueness Theorems for Periodic F unctions. Pro ceedings of the American Mathematical So ciety 16 (1) (1965) 109–114 8. Lothaire, M.: Algebraic combinatorics on words. Cambridge Universit y Press New Y ork (2002) 9. Crochemore, M., Ryt ter, W.: Jew els of Stringology. W orld Scientific (2002) 5 App endix The binary str ing τ 1558 with | τ 1558 | = 155 8 , run ( τ 1558 ) = 144 5 , run ( τ 2 1558 ) = 2915 , run ( τ 3 1558 ) = 43 74, giving low er b ound (437 4 − 2915 ) / 1558 ≈ 0 . 9364 5 > 0 . 927. 110101 10100 1011010110100101101011001101011010010110101101001011010 110010 11010 1101001011010110100101101011001101011010010110101101001 011010 11001 0110101101001011010110010110100101101011010010110101100 101101 01101 0010110101101001011010110010110100101101011010010110101 100101 10101 1010010110101100101101001011010110100101101011001011010 110100 10110 1011010010110101100101101011010010110101100101101001011 010110 10010 1101011001011010110100101101011010010110101100101101001 011010 11010 0101101011001011010110100101101011001011010010110101101 001011 01011 0010110101101001011010110100101101011001011010110100101 101011 00101 1010010110101101001011010110010110101101001011010110010 110100 10110 1011010010110101100101101011010010110101101001011010110 010110 10010 1101011010010110101100101101011010010110101100101101001 011010 11010 0101101011001011010110100101101011010010110101100101101 011010 01011 0101100101101001011010110100101101011001011010110100101 101011 01001 0110101100101101001011010110100101101011001011010110100 101101 01100 1011010010110101101001011010110010110101101001011010110 100101 10101 1001011010110100101101011001011010010110101101001011010 110010 11010 1101001011010110010110100101101011010010110101100101101 011010 01011 0101101001011010110010110100101101011010010110101100101 101011 01001 0110101100101101001011010110100101101011001011010110100 101101 01101 0010110101100101101011010010110101100101101001011010110 100101 10101 1001011010110100101101011010010110101100101101001011010 110100 10110 1011001011010110100101101011001011010010110101101001011 010110 01011 01011010010110101101001011010 By interpreting τ 1558 as a bina ry r epresentation of an integer, it can b e ex- pressed in hexagona l r epresentation b y: 0x35A5 AD2D6 6B4B5A5ACB5A5AD2D66B4B5A5ACB5A5ACB4B5A5ACB5A5AD2D65A5AD 2D65AD 2D65A 5AD2D65AD2D696B2D696B2D2D696B2D696B4B59696B4B596B4B5969 6B4B59 6B4B5 A5ACB5A5ACB4B5A5ACB5A5ACB4B5A5ACB5A5AD2D65A5AD2D65AD2D6 5A5AD2 D65AD 2D696B2D696B2D2D696B2D696B4B59696B4B596B4B59696B4B596B4 B5A5AC B5A5A CB4B5A5ACB5A5ACB4B5A5ACB5A5AD2D65A5AD2D65AD2D65A5AD2D65 AD2D69 6B2D6 96B2D2D696B2D696B4B59696B4B596B4B59696B4B596B4B5A5A 6
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment