Uncommon Suffix Tries
Common assumptions on the source producing the words inserted in a suffix trie with $n$ leaves lead to a $\log n$ height and saturation level. We provide an example of a suffix trie whose height increases faster than a power of $n$ and another one wh…
Authors: Peggy Cenac (IMB), Brigitte Chauvin (LM-Versailles), Frederic Paccaut (LAMFA)
C ´ enac, Chauvin, P accaut, Pouy ann e: uncommon suffix tries Uncommon Suffix T ries Peggy C ´ enac 1 Brigitte Cha uvin 2 Fr ´ ed ´ eric P acca ut 3 Nicolas Pouy anne 4 De c emb er 14th 2011 Abstract Common assum p tions on the source pro d ucing the w ords inserted in a suffix trie w ith n lea v es lead to a log n heigh t and saturation lev el. W e pr o vide an examp le of a suffix trie whose heigh t increases faster than a p o wer of n and another one wh ose saturation lev el is negligible with resp ect to log n . Both are b uilt from VLMC (V ariable Length Marko v Chain) pr obabilistic sources and are easily extended to families of tries ha ving the same prop erties. The fir st example corresp onds to a “loga rithmic infinite com b” and enjo ys a non uniform p olynomial mixin g. The second one corresp onds to a “factoria l infin ite com b” for which mixing is uniform and exp onenti al. MSC 2010 : 60 J05, 37E05. Keywor ds : v ariable length Marko v c hain, probabilistic source, mixing prop erties, suffix trie 1 In tro duction T rie (abbreviation o f re trie v al) is a natural data structure, efficien t for searc hing w ords in a giv en set and used in man y algorithms as data compression, sp ell c hec king or IP addresses lo okup. A trie is a digital tree in whic h w ords are 1 Univ ers it ´ e d e Bourgo gne, Institut de Math´ ematiques de Bourgogne, IMB UMR 5584 CNRS, 9 rue Alain Sav a ry - BP 47 870, 21078 DIJON CE DEX, F rance. 2 Univ ers it ´ e de V ersa illes-St-Quentin, Lab orato ir e de Math´ ematiques de V ersa illes, CNRS, UMR 8 1 00, 45, av enue des E tats-Unis, 780 35 V ersailles CEDEX, F rance. 3 LAMF A, CNRS, UMR 6 140, Univ ersit´ e de Picardie Jules V erne, 33, r ue Sain t-Leu, 80 039 Amiens, F ra nce. 4 Univ ers it ´ e de V ersa illes-St-Quentin, Lab orato ir e de Math´ ematiques de V ersa illes, CNRS, UMR 8 1 00, 45, av enue des E tats-Unis, 780 35 V ersailles CEDEX, F rance. 1 C ´ enac, Chauvin, P accaut, Pouy ann e: uncommon suffix tries inserted in external no des. The trie pro cess g r o ws up b y successiv ely inserting w ords according to their prefixes . A precise definition will b e giv en in Section 4.1. As so on as a set of words is g iven, the wa y they are inserted in the trie is de- terministic. Nev ertheless, a trie b ecomes random when the w ords are randomly dra wn: eac h word is pro duced b y a pro babilistic source and n words a re c hosen (usually indep enden tly) to b e inserted in a trie. A suffix trie is a trie built on the suffixes o f one infinite w ord. The randomness then comes f r om the source pro ducing suc h an infinite w ord and the successiv e w ords inserted in the tree are far fr o m b eing indep enden t, they ar e strongly correlated. As a principal application o f suffix tries one can cite t he lossless compression algorithm Lemp el-Ziv 77 (LZ77 ). The first results on the a v erage size of suffix tries when the infinite word is give n b y a symmetrical memoryless source are due to Blumer et al. [1] and those on the heigh t of the tree to D evro y e [4]. Using analytic com binatorics, F a y olle [6] has obtained the a v erage size and total path length of the tree for a binary w ord issued f rom a memoryless source (with some restriction o n the probability of eac h letter). Here w e are in terested in the heigh t H n and the saturation lev el ℓ n of a suffix trie T n con taining the first n suffixes of an infinite w ord pro duced b y a source asso ciated with a so-called V ariable Length Mark o v Chain ( VLMC) (see Rissanen [11] fo r the seminal work, G a lv es-L¨ oc herbac h [8] for an o v erview, and [2] f or a probabilistic fra me). One deals with a particular VLMC source asso ciat ed with an infinite com b, describ ed hereafter. This particular mo del has the double adv antage to go b ey ond the cases of memoryless or Marko v sources and to provide concrete computable pro p erties. The analysis of the heigh t and the satura tion lev el is usually motiv ated b y optimization o f the memory cost. Height is cle arly relev an t to this p o in t; saturation lev el is algorit hmically relev ant a s w ell b ecause in ternal no des below the saturatio n lev el are often replaced by a less expansiv e table. All the tries or suffix tries considered so far in the literature hav e a heigh t a nd a saturation lev el b oth gr owing logarithmically with the n um b er of w ords inserted, to the b est of our kno wledge. F or plain tries, when the inserted w ords a r e in- dep enden t, the results due t o Pittel [10] rely on t w o assumptions on the source pro ducing the w ords: first, the source is unifo r mly mixing, second, the probability of any w ord deca ys exp onen tially with it s length. Let us also mention the general analysis o f tries b y Cl´ emen t-Fla jo let-V all ´ ee [3 ] fo r dynamical source s. F or suffix tries, Szpank o wski [12] obtains the same result, with a w eak er mixing assumption (still unifo rm though) a nd the same h yp othesis on the measure of the words. Our aim is to exh ibit t w o cases when these behaviours are no longer the same. The first example is the “ lo garithmic c omb ”, for whic h w e sho w t ha t the mixing 2 C ´ enac, Chauvin, P accaut, Pouy ann e: uncommon suffix tries is slo w in some sense, namely non uniformly p olynomial (see Section 3.2 for a precise statemen t) and the measure of some increasing sequenc e of w ords deca ys p olynomially . W e prov e in Theorem 4.8 that the heigh t of this trie is larger than a p o w er of n (when n is the num b er of inserted suffixes in t he tree). The second example is the “ factorial c omb ”, whic h has a unifo r mly exp onential mixing, thus fulfilling t he mixing h yp othesis of Szpank o wski [12], but the measure of some increasing sequence of words deca ys faster than any exp o nential. In this case we pro v e in Theorem 4.9 that the saturation lev el is negligible with resp ect to log n . W e pro v e more precisely that, almost surely , ℓ n ∈ o log n (log log n ) δ , for a n y δ > 1. The pap er is or ganised as follo ws. In Section 2, we define a VLMC source asso ci- ated with an infinite com b. In Section, 3 w e giv e results on the mixing prop erties of these sources b y explicitely computing the suitable generating functions in terms o f the source da t a . In Section 4, the asso ciated suffix t r ies a re built, and the tw o uncommon b eha viours are stated and sho wn. The metho ds are based on t w o ke y to ols concerning pattern return time: a dualit y prop ert y and the compu- tation of generating functions. The relation b etw een the mixing of the source and the asymptotic b eha viour of the trie is highligh ted b y the pro of of Prop o sition 4.7. 2 Infinite com bs as sour c es In this section, a VLMC probabilistic source asso ciated with an infinite com b is define d. Moreov er, we in tro duce the t w o examp les giv en in introduction: the logarithmic and the facto r ia l com bs. W e begin with the definition of a general v ariable length Mark ov Chain a sso ciated with a probabilized infinite com b. The follo wing presen tation comes fr o m [2]. Let A b e the alphab et { 0 , 1 } and L = A − N b e the set o f left-infinite w ords. Consider the binary tree (represen ted in Figure 1) whose finite leav es are the w ords 1 , 01 , . . . , 0 k 1 , . . . and with an infinite leaf 0 ∞ as w ell. Eac h leaf is labelled with a Bernoulli distribution, respectiv ely denoted by q 0 k 1 , k > 0 and q 0 ∞ . This probabilized tree is called the infinite c omb . The VLMC (V a r iable Length Mark o v Chain) asso ciat ed with an infinite comb is the L -v alued Mark o v c hain ( V n ) n > 0 defined by the transitions P ( V n +1 = V n α | V n ) = q ← − pref ( V n ) ( α ) where α ∈ A is an y letter and ← − pref ( V n ) denotes the first suffix of V n (reading from righ t to left) app ear ing as a leaf of the infinite com b. F or instance, if V n = . . . 1 000, then ← − pref ( V n ) = 000 1. Notice that the VLMC is en tirely determined b y the data 3 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Figure 1: An infinite com b q 0 ∞ , q 0 k 1 , k > 0 . F rom now on, denote c 0 = 1 and for n > 1, c n := n − 1 Y k =0 q 0 k 1 (0) . It is pro v ed in [2] t ha t in the ir r educible case i.e. when q 0 ∞ (0) 6 = 1 , there exists a unique stationary probabilit y measure π on L f or ( V n ) n if and only if the serie s P c n con v erges. F rom no w on, we assume t ha t this condition is fulfilled and w e call S ( x ) := X n > 0 c n x n (1) its generating function so that S (1) = P n > 0 c n . F or any finite word w , we denote π ( w ) := π ( L w ). Computations p erformed in [2] sho w that for any n > 0 , π (10 n ) = c n S (1) and π (0 n ) = P k > n c k S (1) . (2) Notice that, b y stationarity π (0 n ) = π (0 n +1 ) + π (10 n ) and b y disjointnes s of ev en ts, π (0 n ) = π (0 n +1 ) + π (0 n 1) for all n > 1 so that π (10 n ) = π (0 n 1) . (3) 4 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries If U n denotes the final letter of V n , the ra ndom seq uence W = U 0 U 1 U 2 . . . is a righ t-infinite random w ord. W e define in this wa y a probabilistic source in the sense of information theory i.e. a mec hanism that pro duces ra ndom w ords. This VLMC probabilistic source is characterized by : p w := P ( W has w as a prefix ) = π ( w ) , for ev ery finite word w . Both particular suffix tries the article deals with are built from suc h sources, defined b y the following data. Example 1: the logarithmic com b The loga r ithmic com b is defined by c 0 = 1 and f or n > 1, c n = 1 n ( n + 1)( n + 2)( n + 3) . The corresp onding conditional probabilities on the leav es of the tr ee are q 1 (0) = 1 24 and for n > 1 , q 0 n 1 (0) = 1 − 4 n + 4 . The expression of c n w as c hosen to mak e the computations as simple a s p ossible and also b ecause the square-integrabilit y of the waiting t ime of some pattern will b e needed (see end o f Section 4.3 ), guarantee d b y X n > 0 n 2 c n < + ∞ . Example 2: the factorial com b The conditional probabilities on the leav es are defined b y q 0 n 1 (0) = 1 n + 2 for n > 0 , so that c n = 1 ( n + 1 )! . 5 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries 3 Mixing prop erties of infi nite c o m bs In this se ction, we first precise what w e mean b y mixing prop erties of a random sequence . W e refer to Do ukhan [5], esp ecially fo r the notion of ψ -mixing defined in that b o ok. W e state in Prop osition 3 .2 a g eneral result t hat prov ides the mixing co efficien t for an infinite com b defined b y ( c n ) n > 0 or equiv alen tly b y its generating function S . This result is then applied to our tw o examples. The mixing of the logarithmic com b is p olynomial but not uniform, it is a very w eak mixing; the mixing of the factorial comb is uniform and exp onential, it is a v ery strong mixing. Notice that mixing prop erties of some infinite com bs ha v e already b een in v estigated b y Isola [9], although with a sligh t differen t language. 3.1 Mixing p r op erties of general infinite com bs F or a stationary seque nce ( U n ) n > 0 with stationar y measure π , w e wan t to measure b y means of a suitable co efficien t the indep endence b et w een tw o words A and B separated b y n letters. The sequence is said to b e “mixing” when this co efficien t v anishes when n go es to + ∞ . Among all ty p es of mixing, w e f o cus on o ne of the strongest t ype: ψ -mixing . More precise ly , for 0 6 m 6 + ∞ , denote by F m 0 the σ -a lgebra generated b y { U k , 0 6 k 6 m } and in tro duce for A ∈ F m 0 and B ∈ F ∞ 0 the mixing co efficien t ψ ( n, A, B ) := π ( A ∩ T − ( m +1) − n B ) − π ( A ) π ( B ) π ( A ) π ( B ) = P | w | = n π ( Aw B ) − π ( A ) π ( B ) π ( A ) π ( B ) , (4) where T is the shift map and where the sum runs ov er the finite words w with length | w | = n . A sequence ( U n ) n > 0 is called ψ -mixing whenev er lim n →∞ sup m > 0 ,A ∈F m 0 ,B ∈F ∞ 0 | ψ ( n, A, B ) | = 0 . In this definition, the con v ergence to zero is uniform o v er a ll w ords A and B . This is not going to b e the case in our first example. As in Isola [9], w e widely use the renew al prop erties of infinite com bs (see Lemma 3.1) but more detailed results are needed, in particular w e in v estigate the lack o f uniformity for the logarit hmic com b. 6 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Notations and Generating functions • F or a comb, recall that S is the generating function of the nonincreasing se- quence ( c n ) n > 0 defined by (1). • Set ρ 0 = 0 and for n > 1, ρ n := c n − 1 − c n , with generating function P ( x ) := X n > 1 ρ n x n . • Define the sequence ( u n ) n > 0 b y u 0 = 1 and for n > 1, u n := π ( U 0 = 1 , U n = 1) π (1) = 1 π (1) X | w | = n − 1 π (1 w 1) , (5) and let U ( x ) := X n > 0 u n x n denote its generating function. Hereunder is stated a k ey lemma that will b e widely used in Prop osition 3.2. In some sense, this kind of relation (sometimes called Renew al Equation) reflects the renew al prop erties of the infinite com b. Lemma 3.1 The se quenc es ( u n ) n > 0 and ( ρ n ) n > 0 ar e c onne cte d by the r elations: ∀ n > 1 , u n = ρ n + u 1 ρ n − 1 + . . . + u n − 1 ρ 1 and (c onse q uently) U ( x ) = ∞ X n =0 u n x n = 1 1 − P ( x ) = 1 (1 − x ) S ( x ) . Proo f. F or a finite w o rd w = α 1 . . . α m suc h that w 6 = 0 m , let l ( w ) denote the p osition of the last 1 in w , that is l ( w ) := max { 1 ≤ i ≤ m, α i = 1 } . Then, the sum in the expression (5) of u n can b e decomp osed as fo llo ws: X | w | = n − 1 π (1 w 1) = π (10 n − 1 1) + n − 1 X i =1 X | w | = n − 1 l ( w )= i π (1 w 1) . No w, b y disjoin t union π (10 n − 1 ) = π (10 n − 1 1) + π (10 n ), so that π (10 n − 1 1) = π (1)( c n − 1 − c n ) = π (1) ρ n . 7 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries In the same w a y , for w = α 1 . . . α n − 1 , if l ( w ) = i then π (1 w 1) = π ( 1 α 1 . . . α i − 1 1) ρ n − i , so that u n = ρ n + n − 1 X i =1 ρ n − i 1 π (1) X | w | = i − 1 π (1 w 1) = ρ n + n − 1 X i =1 ρ n − i u i , whic h leads to U ( x ) = (1 − P ( x )) − 1 b y summation. ⊓ ⊔ Mixing co efficien ts The mixing co efficien ts ψ ( n, A, B ) are expressed as the n -th co efficie nt in the series expansion of an analytic function M A,B whic h is g iv en in terms o f S and U . The notatio n [ x n ] A ( x ) means the co efficien t of x n in the p o w er expansion of A ( x ) at the origin. Denote the remainders asso ciated with the series S ( x ) by r n := X k > n c k , R n ( x ) := X k > n c k x k and for a > 0, define the “shifted” generating function P a ( x ) := 1 c a X n > 1 ρ a + n x n = x + x − 1 c a x a R a +1 ( x ) . (6) Prop osition 3.2 F or any fi nite wor d A a nd any wor d B , the identity ψ ( n, A, B ) = [ x n +1 ] M A,B ( x ) holds for the gener ating functions M A,B r esp e ctively define d by: i) if A = A ′ 1 and B = 1 B ′ wher e A ′ and B ′ ar e any fi nite wor ds, then M A,B ( x ) = M ( x ) := S ( x ) − S (1) ( x − 1) S ( x ) ; ii) if A = A ′ 10 a and B = 0 b 1 B ′ wher e A ′ and B ′ ar e any finite wor ds a nd a + b > 1 , then M A,B ( x ) := S (1) c a + b c a c b P a + b ( x ) + U ( x ) [ S (1) P a ( x ) P b ( x ) − S ( x )] ; 8 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries iii) if A = 0 a and B = 0 b with a, b > 1 , then M A,B ( x ) := S (1) 1 r a r b X n > 1 r a + b + n − 1 x n + U ( x ) S (1) R a ( x ) R b ( x ) r a r b x a + b − 2 − S ( x ) ; iv) if A = A ′ 10 a and B = 0 b wher e A ′ is any finite w or ds and a, b > 0 , then M A,B ( x ) := S (1) 1 c a r b x a + b − 1 R a + b ( x ) + U ( x ) S (1) P a ( x ) R b ( x ) c a r b x b − 1 − S ( x ) ; v) if A = 0 a and B = 0 b 1 B ′ wher e B ′ is any finite w or ds and a, b > 0 , then M A,B ( x ) := S (1) 1 r a c b x a + b − 1 R a + b ( x ) + U ( x ) S (1) R a ( x ) P b ( x ) r a c b x a − 1 − S ( x ) . Remark 3.3 It is w orth noticing that the asymptotics of ψ ( n, A, B ) may not b e uniform in al l wor ds A and B . We c al l this kind of system non-uniformly ψ - mixing. It m ay happ en that ψ ( n, A, B ) go es to zer o for any fixe d A an d B , but (for example, in c ase iii) ) the lar ger a or b , the sl o wer the c onver ge n c e, p r eventing it fr om b ein g unif o rm. Proo f. The follo wing identit y has b een established in [2] (see formu la (17) in that pap er) and will b e used man y t imes in the sequel. F or an y tw o finite w ords w and w ′ , π ( w 1 w ′ ) π (1 ) = π ( w 1) π (1 w ′ ) . (7) i) If A = A ′ 1 and B = 1 B ′ , then ( 7 ) yields π ( Aw B ) = π ( A ′ 1 w 1 B ′ ) = π ( A ′ 1) π (1) π (1 w 1 B ′ ) = S (1) π ( A ) π ( B ) π (1 w 1) π (1) . So ψ ( n, A, B ) = S (1 ) u n +1 − 1 and by Lemma 3.1, the result follows . ii) Let A = A ′ 10 a and B = 0 b 1 B ′ with a, b > 0 and a + b 6 = 0. T o b egin with, π ( Aw B ) = 1 π (1) π ( A ′ 1) π ( 1 0 a w 0 b 1 B ′ ) = 1 π (1) 2 π ( A ′ 1) π ( 1 0 a w 0 b 1) π ( 1 B ′ ) . 9 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries F urthermore, π ( A ) = c a π ( A ′ 1) and by (3), π (0 b 1) = π (10 b ), so it comes π ( B ) = 1 π (1) π (0 b 1) π ( 1 B ′ ) = π (10 b ) π (1) π (1 B ′ ) = c b π (1 B ′ ) . Therefore, π ( Aw B ) = π ( A ) π ( B ) c a c b π (1) 2 π (10 a w 0 b 1) . Using π (1) S (1) = 1, this prov es ψ ( n, A, B ) = S (1 ) v a,b n c a c b − 1 where v a,b n := 1 π (1) X | w | = n − 1 π (10 a w 0 b 1) . As in the pro o f of the pr evious lemma, if w = α 1 . . . α m is any finite w ord differen t from 0 m , w e call f ( w ) := min { 1 ≤ i ≤ m, α i = 1 } the first place where 1 can b e seen in w and recall that l ( w ) denotes the last place where 1 can b e seen in w . O ne has X | w | = n − 1 π (10 a w 0 b 1) = π (10 a + n − 1+ b 1) + X 1 ≤ i ≤ j ≤ n − 1 X | w | = n − 1 f ( w )= i,l ( w )= j π (10 a w 0 b 1) . If i = j then w is the w ord 0 i − 1 10 n − i − 1 , else w is of the form 0 i − 1 1 w ′ 10 n − 1 − j , with | w ′ | = j − i − 1. Hence, the previous sum can b e rewritten as X | w | = n − 1 π (10 a w 0 b 1) = π (1) ρ a + b + n + π (1) n − 1 X i =1 ρ a + i ρ n − i + b + X 1 6 i 1. Set v a,b n := 1 π (1) X | w | = n − 1 π (0 a w 0 b ) . First, recall that, due to (2), π ( A ) = π (1) r a and π ( B ) = π (1) r b . Conse- quen tly , ψ ( n, A, B ) = π (1) v a,b n +1 − π ( A ) π ( B ) π ( A ) π ( B ) = S (1) v a,b n +1 r a r b − 1 . Let w be a finite w ord with | w | = n − 1. If w = 0 n − 1 , then π ( Aw B ) = π (0 a + n − 1+ b ) = π (1) r a + b + n − 1 . If not, let f ( w ) denote as b efore the first p o sition of 1 in w and l ( w ) the last one in w . If f ( w ) = l ( w ), then π ( Aw B ) = π (0 a + f ( w ) − 1 10 n − 1 − f ( w )+ b ) = 1 π (1) π (0 a + f ( w ) − 1 1) π ( 1 0 n − 1 − f ( w )+ b ) = π (1) c a + f ( w ) − 1 c n − 1 − f ( w )+ b . If f ( w ) < l ( w ), t hen writing w = w 1 . . . w n − 1 , π ( Aw B ) = π (0 a + f ( w ) − 1 1 w f ( w )+1 . . . w l ( w ) − 1 10 n − 1 − l ( w ) ) = 1 π (1) 2 π (0 a + f ( w ) − 1 1) π ( 1 w f ( w )+1 . . . w l ( w ) − 1 1) π ( 1 0 n − 1 − l ( w )+ b ) . Summing yields v a,b n = r a + b + n − 1 + n − 1 X i =1 c a + i − 1 c n − 1+ b − i + n − 1 X i,j =1 i 1 by c n = 1 n ( n + 1)( n + 2)( n + 3) . When | x | < 1, the series S ( x ) writes as follo ws S ( x ) = 47 36 − 5 12 x + 1 6 x 2 + (1 − x ) 3 log(1 − x ) 6 x 3 (8) and S (1) = 19 18 . With Prop osition 3.2, the asymptotics of the mixing co efficien t comes f r om sin- gularit y analysis of the generating functions M A,B . Prop osition 3.4 The VLMC define d by the lo ga rithmic infin ite c omb has a non- uniform p olynomial mixing of the f o l lowing f o rm: for any finite wor ds A and B , ther e exists a p ositive c onstant C A,B such that for any n > 1 , | ψ ( n, A, B ) | 6 C A,B n 3 . Remark 3.5 The C A,B c ann o t b e b ounde d ab ove by s o me c onstant that do es not dep en d on A and B , as c an b e se en her eunder in the pr o of. Inde e d, we sh ow that if a and b a r e p os i tive inte gers, ψ ( n, 0 a , 0 b ) ∼ 1 3 S (1) r a r b − 1 r a − 1 r b + 1 S (1) 1 n 3 as n go es to infinity. In p articular, ψ ( n, 0 , 0 n ) tends to the p ositive c o nstant 13 6 . Proo f of Proposition 3.4. F or a n y finite words A a nd B in case i) of Prop osition 3.2 , one deals with U ( x ) = ((1 − x ) S ( x )) − 1 whic h has 1 as a unique dominan t singularit y . Indeed, 1 is the unique dominan t singularity of S , so that the dominan t singularities of U a re 1 or zero es of S contained in the closed unit disc. But S do es not v anish on the closed unit disc, b ecause for any z suc h that | z | 6 1, | S ( z ) | > 1 − X n > 1 1 n ( n + 1 ) ( n + 2 ) ( n + 3 ) = 1 − ( S (1) − 1) = 17 18 . 12 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Since M ( x ) = S ( x ) − S (1) ( x − 1) S ( x ) = S (1) U ( x ) − 1 1 − x , the unique dominant singularity o f M is 1, and when x tends to 1 in the unit disc, (8) leads to M ( x ) = A ( x ) − 1 6 S (1) (1 − x ) 2 log(1 − x ) + O (1 − x ) 3 log(1 − x ) where A ( x ) is a p o lynomial of degree 2. Using the classical transfer theorem (see Fla jolet and Sedgewic k [7, section VI]) based on the analysis of the singularities of M , w e g et ψ ( n − 1 , w 1 , 1 w ′ ) = [ x n ] M ( x ) = 1 3 S (1) 1 n 3 + o 1 n 3 . The cases ii) , iii) , iv) and v) of Prop o sition 3.2 are of the same kind, and w e completely deal with case iii) . Case iii) : w ords of the fo r m A = 0 a and B = 0 b , a, b > 1. As show n in Prop osition 3.2, one ha s to compute the asymptotics of the n -t h co efficien t of t he T aylor series o f the function M a,b ( x ) := S (1) 1 r a r b X n > 1 r a + b + n − 1 x n + U ( x ) S (1) R a ( x ) R b ( x ) r a r b x a + b − 2 − S ( x ) . (9) The con tribution o f the left-hand term of this sum is directly giv en b y the asymp- totics o f the remainder r n = X k > n c k = 1 3 n ( n + 1)( n + 2) = 1 3 n 3 + O 1 n 4 . By means of singularit y analysis, we deal with the righ t-hand term N a,b ( x ) := U ( x ) S (1) R a ( x ) R b ( x ) r a r b x a + b − 2 − S ( x ) . Since 1 is the only dominan t singularit y of S and U and consequen tly of an y R a , it suffices to compute an expansion of N a,b ( x ) a t x = 1. It follows from (8) tha t U , S and R a admit expansions near 1 of the for ms U ( x ) = 1 S (1)(1 − x ) + p o lynomial + 1 6 S (1) 2 (1 − x ) 2 log(1 − x ) + O (1 − x ) 2 , S ( x ) = p olynomial + 1 6 (1 − x ) 3 log(1 − x ) + O (1 − x ) 3 , 13 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries and R a ( x ) = p o lynomial + 1 6 (1 − x ) 3 log(1 − x ) + O (1 − x ) 3 . Consequen tly , N a,b ( x ) = 1 6 1 r a + 1 r b − 1 S (1) (1 − x ) 2 log(1 − x ) + O (1 − x ) 2 in a neigh b ourho o d of 1 in the unit disc so t ha t, b y singularit y analysis, [ x n ] N a,b ( x ) = − 1 3 1 r a + 1 r b − 1 S (1) 1 n 3 + o 1 n 3 . Consequen tly (9) leads to ψ ( n − 1 , 0 a , 0 b ) = [ x n ] M a,b ( x ) ∼ 1 3 S (1) r a r b − 1 r a − 1 r b + 1 S (1) 1 n 3 as n t ends to infinity , sho wing the mixing inequalit y and the non uniformity . The remaining cases ii) , iv) and v) are o f the same fla v our. ⊓ ⊔ 3.3 Mixing of the factorial infinite com b Consider now the second Example in Section 2, that is the probabilized infinite com b defined b y ∀ n ∈ N , c n = 1 ( n + 1)! . With previous notations, one gets S ( x ) = e x − 1 x and U ( x ) = x (1 − x )( e x − 1) . Prop osition 3.6 The VLMC define d by the factorial infinite c omb has a uniform exp on ential mixing of the fo l lowing fo rm: ther e exists a p ositive c on stant C such that for any n > 1 and for any finite wor ds A and B , | ψ ( n, A, B ) | 6 C (2 π ) n . Proo f. 14 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries i) First case of mixing in Prop osition 3.2: A = A ′ 1 and B = 1 B ′ . Because of Prop osition 3 .2 , the pro of consists in computing the a symp- totics of [ x n ] M ( x ). W e make use of singularit y a nalysis. The dominant singularities of M ( x ) = S ( x ) − S (1) ( x − 1) S ( x ) are readily seen to b e 2 iπ and − 2 iπ , and M ( x ) ∼ 2 iπ 1 − e 1 − 2 iπ · 1 1 − z 2 iπ . The b ehaviour of M in a neigh b ourho o d of − 2 iπ is obtained b y complex conjugacy . Singularit y analysis via transfer theorem pro vides thus that [ x n ] M ( x ) ∼ n → + ∞ 2( e − 1) 1 + 4 π 2 1 2 π n ǫ n where ǫ n = 1 if n is ev en 2 π if n is o dd . ii) Second case o f mixing: A = A ′ 10 a and B = 0 b 1 B ′ . Because of Prop osition 3.2, one has to compute [ x n ] M a,b ( x ) with M a,b ( x ) := S (1) c a + b c a c b P a + b ( x ) + 1 S ( x ) · 1 1 − x h S (1) P a ( x ) P b ( x ) − S ( x ) i , where P a + b is an en tire function. In this last formula, the brack ets contain an entire function that v anishes at 1 so that the dominant singularities of M a,b are again t ho se of S − 1 , namely ± 2 iπ . The expansion o f M a,b ( x ) a t 2 iπ writes th us M a,b ( x ) ∼ 2 iπ − S (1) P a (2 iπ ) P b (2 iπ ) 1 − 2 iπ · 1 1 − x 2 iπ whic h implies, by singularity analysis, tha t [ x n ] M a,b ( x ) ∼ n → + ∞ 2 ℜ 1 − e 1 − 2 iπ · P a (2 iπ ) P b (2 iπ ) (2 iπ ) n . Besides, the remainder of the exp onen tial series satisfies X n > a x n n ! = x a a ! 1 + x a + O ( 1 a ) (10) 15 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries when a tends to infinit y . Conseq uen tly , b y F orm ula (6), P a (2 iπ ) tends to 2 iπ as a tends to infinit y so that one gets a p ositiv e constant C 1 that do es not dep end on a and b suc h that fo r an y n > 1, | ψ ( n, A, B ) | 6 C 1 (2 π ) n . iii) Third case of mixing: A = 0 a and B = 0 b . This time, o ne has to compute [ x n ] M a,b ( x ) with M a,b ( x ) := S (1) 1 r a r b X n > 1 r a + b + n − 1 x n + U ( x ) S (1) R a ( x ) R b ( x ) r a r b x a + b − 2 − S ( x ) the first term b eing an en tire function. Here again, the dominan t singular- ities of M a,b are lo cated at ± 2 iπ and M a,b ( x ) ∼ 2 iπ − S (1) R a (2 iπ ) R b (2 iπ ) (1 − 2 iπ ) r a r b (2 iπ ) a + b − 2 · 1 1 − x 2 iπ whic h implies, by singularity analysis, tha t ψ ( n − 1 , A, B ) ∼ n → + ∞ 2 ℜ 1 − e 1 − 2 iπ · R a (2 iπ ) R b (2 iπ ) r a r b (2 iπ ) a + b − 2 1 (2 iπ ) n . Once more, b ecause of (10 ), this implies that there is a p ositiv e constan t C 2 indep enden t of a and b and such that for any n > 1, | ψ ( n, A, B ) | 6 C 2 (2 π ) n . iv) and v) : b o th r emain ing c ases of mixin g that r esp e ctively c orr esp ond to wor ds of the form A = A ′ 10 a , B = 0 b and A = 0 a , B = 0 b 1 B ′ ar e of the same vein and le ad to similar r esults. ⊓ ⊔ 4 Heigh t and saturatio n level o f s uffix tries In this section, we consider a suffix trie pro cess ( T n ) n asso ciated with an infinite random w ord generated by an infinite com b. A precise definition of tr ies and suffix tries is giv en in section 4 .1. W e a re in terested in the heigh t and the satura tion lev el of suc h a suffix trie. 16 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Our method t o study these t w o parameters uses a duality prop erty ` a la Pittel dev elop ed in Section 4.2, together with a careful and explicit calculation of the generating function of the second o ccurrence of a w ord (in Section 4.3 ) whic h can b e ac hiev ed for an y infinite com b. These calculations are not so intricate b ecause they are strong ly related to the mixing co efficien t and the mixing prop erties detailed in Section 3. More sp ecifically , w e lo ok at our tw o fa v ourite examples, the logarithmic com b and the factorial comb . W e prov e in Section 4.5 tha t the heigh t of the first o ne is not loga rithmic but p olynomial and in Section 4.6 that t he saturation lev el of t he second one is not logarithmic either but negligibly smaller. Remark that despite the v ery particular form of the com b in the wide family of v aria ble length Mark o v mo dels, the com b sources pro vide a sp ectrum of asymptotic b eha viours for the suffix tries. 4.1 Suffix tries Let ( Y n ) n ≥ 1 b e an increasing sequence of sets. Eac h set Y n con tains exactly n infinite words. A trie pro cess ( T n ) n ≥ 1 is a planar tree increasing pro cess asso ciated with ( Y n ) n > 1 . The trie T n con tains the words of Y n in its leav es. It is obtained b y a sequen tial construction, inserting the words o f Y n success iv ely . A t the b eginning, T 1 is the tree con taining t he ro ot and the leaf 0 . . . (resp. the leaf 1 . . . ) if the w ord in Y 1 b egins with 0 (resp. with 1). F or n ≥ 2, know ing the tree T n − 1 , the n -th word m is inse rted as f ollo ws. W e go through the tr ee along the branc h whose no des are enco ded b y the successiv e prefixes of m ; when t he branc h ends, if a n internal no de is reache d, then the w ord is inserted a t the free leaf, else w e mak e the branc h grow comparing the next letters of b oth w ords un til they can b e inserted in tw o differen t leav es. As one can clearly see o n Figure 2 a trie is not a complete tree and the insertion of a w ord can mak e a branc h gro w by more than one level. Notice that an internal no de exists within the trie if there are at least t w o w ords in the set starting b y the prefix associated to this no de. This indicates wh y the se c ond o ccurrence of a word is prominent. Let m := a 1 a 2 a 3 . . . b e an infinite word on A = { 0 , 1 } . The suffix trie T n (with n lea v es) asso ciated with m , is the trie built from the set of the n -th first suffixes of m , that is Y n = { m, a 2 a 3 . . . , a 3 a 4 . . . , . . . , a n a n +1 . . . } . F or a giv en trie T n , w e are mainly interes ted in the heigh t H n whic h is the maximal depth of an in ternal no de of T n and the satur ation level ℓ n whic h is the maximal depth up to which a ll the internal no des are presen t in T n . F ormally , if ∂ T n 17 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries − → Figure 2: Last steps of the construction of a trie built fro m the set (000 . . . , 10 . . . , 1101 . . . , 001 , . . . , 01110 . . . , 1100 . . . , 01111 . . . ). denotes the set of leav es of T n , H n = max u ∈T n \ ∂ T n | u | ℓ n = max j ∈ N | # { u ∈ T n \ ∂ T n , | u | = j } = 2 j . See Figure 3 for an example. 4.2 Dualit y Let ( U n ) n > 1 b e an infinite random word generated by some infinite comb and ( T n ) n > 1 b e the asso ciated suffix trie pro cess . W e denote b y R the set of right- infinite w ords. Besides, w e define hereunder t w o ra ndo m v ariables ha ving a key role in the pro o f of Theorem 4 .8 and Theorem 4.9. This metho d go es ba ck to Pittel [10]. Let s ∈ R b e a deterministic infinite sequence and s ( k ) its prefix of length k . F or n ≥ 1, X n ( s ) := 0 if s (1) is not in T n max { k > 1 | the w ord s ( k ) is already in T n \ ∂ T n } , T k ( s ) := min { n > 1 | X n ( s ) = k } , 18 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries P S f r a g r e p la c e m e n t s 1 0 0 1 0 . . . 0 0 . . . 0 1 0 . . . 1 0 1 . . . 0 1 1 . . . 1 1 . . . 1 0 0 1 1 . . . 0 1 Figure 3: Suffix trie T 10 asso ciated with the w ord 10010 1 1001110 . . . . Here, H 10 = 4 and ℓ 10 = 2. where “ s ( k ) is in T n \ ∂ T n ” stands for: there exists an in ternal no de v in T n suc h that s ( k ) enco des v . F or any k > 1, T k ( s ) denotes the num b er of lea v es of the first tree “con taining” s ( k ) . See Figure 4 for an example. Thus , the saturation lev el ℓ n and the heigh t H n can b e described using X n ( s ): ℓ n = min s ∈R X n ( s ) and H n = max s ∈R X n ( s ) . (11) Moreo v er, X n ( s ) and T k ( s ) are in duality in the following sense: for all p ositiv e in tegers k and n , one has the equality of t he ev en ts { X n ( s ) > k } = { T k ( s ) 6 n } . (12) The r a ndom v ar iable T k ( s ) (if k > 2) also represen ts the waiting time o f the second o ccurrenc e of t he deterministic w ord s ( k ) in the ra ndom sequence ( U n ) n ≥ 1 , i.e. one has to w ait T k ( s ) for t he source to create a prefix containing exactly tw o o ccurrences of s ( k ) . Mor e precisely , for k > 2, T k ( s ) can b e rewritten as T k ( s ) = min n n > 1 U n U n +1 . . . U n + k − 1 = s ( k ) and ∃ ! j < n suc h that U j U j +1 . . . U j + k − 1 = s ( k ) o . 19 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries P S f r a g r e p la c e m e n t s 0 1 s ˆ s Figure 4: Example of suffix trie with n = 20 w ords. The saturation lev el is reac hed for an y seque nce having 1000 as prefix (in r ed); ℓ 20 = X 20 ( s ) = 3 and th us T 3 ( s ) 6 20. The heigh t (r elated to the maxim um of X 20 ) is realized for an y sequence of the form 11 0101 . . . (in blue) and H 20 = 6. Remark that the shortest branc h has length 4 whereas the saturat io n lev el ℓ n is equal to 3 . Notice that T k ( s ) denotes the b e gin ning o f the second o ccurrenc e of s ( k ) whereas in [2], τ (2) s ( k ) denotes the end of the second o ccurrence of s ( k ) , so that τ (2) s ( k ) = T k ( s ) + k . (13) More generally , in [2] , for an y r > 1, the random return times τ ( r ) ( w ) is defined as the end of the r -th o ccurrence of w in the sequence ( U n ) n > 1 and the generating function of the τ ( r ) is calculated. W e go ov er these calculations in the sequel. 4.3 Return time generating functions Prop osition 4.7 L et k > 1 . L et also w = 10 k − 1 and τ (2) ( w ) b e the end of the se c ond o c curr enc e o f w in a se quenc e gener ate d by a c omb define d by ( c n ) n > 0 . L et S and U b e the or dina ry gen er ating functions d e fi ne d in Se ction 3 . 1. The 20 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries pr ob ability gener ating function of τ (2) ( w ) is Φ (2) w ( x ) = c 2 k − 1 x 2 k − 1 U ( x ) − 1 S (1)(1 − x ) 1 + c k − 1 x k − 1 ( U ( x ) − 1) 2 . F urthermor e, as so on as P n > 1 n 2 c n < ∞ , the r andom variable τ (2) ( w ) is squar e- inte gr able and E ( τ (2) ( w )) = 2 S (1) c k − 1 + o 1 c k − 1 , V ar( τ (2) ( w )) = 2 S (1) 2 c 2 k − 1 + o 1 c 2 k − 1 . (14) Proo f. F or an y r > 1, let τ ( r ) ( w ) denote the end of the r -th o ccurrence of w in a random sequence generated by a comb and Φ ( r ) w its pro ba bilit y g enerating function. The rev ersed w ord of c = α 1 . . . α N will b e denoted by the ov erline c := α N . . . α 1 W e use a result of [2] that computes these generating functions in terms of sta- tionary probabilities q ( n ) c . These probabilities measure the o ccurrence of a finite w ord after n steps, conditio ned to start from the w ord c . More precisely , for an y finite words u and c and for any n > 0 , let q ( n ) c ( u ) := π U n −| u | + | c | +1 . . . U n + | c | = u U 1 . . . U | c | = c . It is shown in [2] tha t , for | x | < 1, Φ (1) w ( x ) = x k π ( w ) (1 − x ) S w ( x ) and for r ≥ 1, Φ ( r ) w ( x ) = Φ (1) w ( x ) 1 − 1 S w ( x ) r − 1 where S w ( x ) := C w ( x ) + ∞ X n = k q ( n ) ← − pref ( w ) ( w ) x n , C w ( x ) := 1 + k − 1 X j =1 1 { w j +1 ...w k = w 1 ...w k − j } q ( j ) ← − pref ( w ) ( w k − j +1 . . . w k ) x j . In the particular case when w = 10 k − 1 , then ← − pref ( w ) = w = 0 k − 1 1 a nd π ( w ) = c k − 1 S (1) . Moreo v er, D efinition (4) of the mixing co efficien t and Prop osition 3.2 i) 21 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries imply successiv ely that q ( n ) ← − pref ( w ) ( w ) = π U n − k −| w | +1 . . . U n + k = w U 1 . . . U k = ← − pref ( w ) = π ( w ) ψ n − k , ← − pref ( w ) , w + 1 = π ( w ) S (1 ) u n − k +1 = c k − 1 u n − k +1 , This relation mak es more explicit the link b etw een return times and mixing. This leads to X n > k q ( n ) ← − pref ( w ) ( w ) x n = c k − 1 x k − 1 X n > 1 u n x n = c k − 1 x k − 1 U ( x ) − 1 . F urthermore, there is no auto- correlation structure inside w so that C w ( x ) = 1 and S w ( x ) = 1 + c k − 1 x k − 1 U ( x ) − 1 . This en tails Φ (1) w ( x ) = c k − 1 x k S (1)(1 − x ) 1 + c k − 1 x k − 1 U ( x ) − 1 and Φ (2) w ( x ) = Φ (1) w ( x ) 1 − 1 S w ( x ) = c 2 k − 1 x 2 k − 1 U ( x ) − 1 S (1)(1 − x ) 1 + c k − 1 x k − 1 U ( x ) − 1 2 whic h is the anno unced result. The assumption X n > 1 n 2 c n < ∞ mak es U twice differen tiable a nd elemen tary calculatio ns lead to (Φ (1) w ) ′ (1) = S (1) c k − 1 − S (1) + 1 + S ′ (1) S (1) , (Φ (2) w ) ′ (1) = (Φ (1) w ) ′ (1) + S (1) c k − 1 , (Φ (1) w ) ′′ (1) = 2 S (1) 2 c 2 k − 1 + o 1 c 2 k − 1 and (Φ (2) w ) ′′ (1) = 6 S (1) 2 c 2 k − 1 + o 1 c 2 k − 1 , and finally to (14). ⊓ ⊔ 22 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries 4.4 Logarithmic com b and factorial com b Let h + and h − b e the constan ts in [0 , + ∞ ] defined by h + := lim n → + ∞ 1 n max n ln 1 π ( w ) o and h − := lim n → + ∞ 1 n min n ln 1 π ( w ) o , ( 1 5) where the maximum and the minimum range o v er the w ords w of length n with π ( w ) > 0. In their pap ers, Pittel [10] and Szpank o wski [12] only deal with t he cases h + < + ∞ a nd h − > 0, whic h amoun ts to sa ying that the probability of an y w ord is exp onen tially decreasing with its length. Here, w e fo cus on our t w o examples for whic h these assumptions are not fulfilled. More precisely , for the logarithmic infinite com b, (2) implies that π (10 n ) is o f order n − 4 , so that h − 6 lim n → + ∞ 1 n ln 1 π (10 n − 1 ) = 4 lim n → + ∞ ln n n = 0 . Besides, for the factorial infinite comb , π (10 n ) is of or der 1 ( n +1)! so that h + > lim n → + ∞ 1 n ln 1 π ( 1 0 n − 1 ) = lim n → + ∞ n ! n = + ∞ . F or these tw o mo dels, the asymptotic b eha viour of the lengths of the branc hes is not alw a ys logarithmic, as can b e seen in the tw o following theorems, sho wn in Sections 4.5 and 4.6. Theorem 4.8 (Heigh t of the logarithmic infinite com b) L et T n b e the suf- fix trie built fr om the n first suffixe s o f a se quenc e gener ate d by a l o garithmic infinite c omb. Then , the height H n of T n satisfies ∀ δ > 0 , H n n 1 4 − δ − → n →∞ + ∞ in pr ob ability. Theorem 4.9 (Saturation lev el of the factorial infini te comb) L et T n b e the s uffix trie built fr om the n first suffixes of the se quenc e gener ate d by a factoria l infinite c omb. T h en, the satur ation level ℓ n of T n satisfies: for any δ > 1 , alm o st sur ely, when n tends to infinity, ℓ n ∈ o log n (log log n ) δ . The dynamic asymptotics of the heigh t and of the saturation lev el can b e visu- alized on F igure 5. The n um b er n of leav es of the suffix trie is put on the x -axis 23 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries while heights or saturation lev els of tries are put on the y -axis. Plain lines rep- resen t a logarithmic com b while long dashed lines are those of a factoria l com b (mean v alues of 25 simulations). Short dashed lines represen t a third infinite comb defined b y the data c n = 1 3 Q n − 1 k =1 1 3 + 1 (1+ k ) 2 for n > 1. Suc h a pro cess has a uniform exp onen tial mixing, a finite h + and a p ositive h − as can b e elemen tarily chec ke d. As a matter of con- sequence , it satisfies all assumptions of Pittel [10] and Szpank o wski [12 ] implying that the heigh t and the saturation leve l are b oth of o r der log n . Such assumptions will alw ays b e fulfilled a s so o n as the data ( c n ) n satisfy lim n c 1 /n n < 1; the proo f of this result is left to the reader. One can notice the height of the logarithmic com b that gro ws as a p ow er of n . The saturation lev el o f the factorial com b, negligible with resp ect to log n is more difficult to highligh t b ecause o f the v ery slo w growth o f logarithms. These asymptotic b eha viours, all coming from the same mo del, the infinite comb, stress its surprising richnes s. 4.5 Heigh t for the logarithmic com b In this subsection, w e pro v e Theorem 4.8. Consider the right-infinite se quence s = 10 ∞ . Then, T k ( s ) is the second o ccur- rence time o f w = 10 k − 1 . It is a no ndecreasing (random) function of k . Moreo v er, X n ( s ) is the maxim um of all k suc h that s ( k ) ∈ T n . It is nondecreasing in n . So, b y definition of X n ( s ) and T k ( s ), the dualit y can b e written ∀ n, ∀ ω , ∃ k n , k n 6 X n ( s ) < k n + 1 and T k n ( s ) 6 n < T k n +1 ( s ) . (16) Claim : lim n → + ∞ X n ( s ) = + ∞ a.s. (17) Indeed, if X n ( s ) w ere b o unded ab o v e, b y K say , then tak e w = 10 K and consider T K +1 ( s ) whic h is the time of the second o ccurrence of 10 K . The choice of the c n in the definition o f the loga r ithmic com b implies the conv ergence of the series P n n 2 c n . Th us (14) holds and E [ T K +1 ( s )] < ∞ so that T K +1 ( s ) is almost surely finite. This means that for n > T K +1 ( s ), the w ord 10 K has b een seen t wice, leading to X n ( s ) > K + 1 whic h is a con tradiction. W e mak e use of the follow ing lemma that is prov en hereunder. Lemma 4.10 F or s = 1 0 ∞ , ∀ η > 0 , T k ( s ) k 4+ η − → k →∞ 0 in pr obability , 24 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries and ∀ η > 1 2 , T k ( s ) k 4+ η − → k →∞ 0 a.s. (18) With notations (1 6 ), b ecause of (1 7 ), the sequence ( k n ) tends to infinity , so t hat ( T k n ( s )) is a subsequenc e of ( T k ( s )). Th us, (1 8 ) implies that ∀ η > 1 2 , T k n k 4+ η n − → n →∞ 0 a.s. and ∀ η > 0 , T k n k 4+ η n P − → k →∞ 0 . Using duality (16) again leads to ∀ η > 0 , X n ( s ) n 1 / (4+ η ) P − → n →∞ + ∞ . In otherw ords ∀ δ > 0 , X n ( s ) n 1 4 − δ P − → n →∞ + ∞ so that, since the heigh t of the suffix trie is larger than X n ( s ), ∀ δ > 0 , H n n 1 4 − δ P − → n →∞ + ∞ . This ends the pro o f of Theorem 4.8. ⊓ ⊔ Proo f of Lemma 4.10. Com bining (13) and ( 1 4) sho ws that E ( T k ( s )) = E ( τ (2) ( w )) − k = 19 9 k 4 + o ( k 4 ) (19) and V a r( T k ( s )) = V ar( τ (2) ( w )) = 361 162 k 8 + o ( k 8 ) . (20) F or a ll η > 0, write T k ( s ) k 4+ η = T k ( s ) − E ( T k ( s )) k 4+ η + E ( T k ( s )) k 4+ η . The deterministic part in the second-hand righ t term g o es t o 0 with k thanks to (19), so t ha t w e focus on the term T k ( s ) − E ( T k ( s )) k 4+ η . F o r a n y ε > 0, because o f Biena ym ´ e-Tc heb yc hev inequalit y , P | T k ( s ) − E [ T k ( s )] | k 4+ η > ε 6 V ar ( T k ( s )) ε 2 k 8+2 η = O 1 k 2 η . This sho ws the con v ergence in probabilit y in Lemma 4.10. Moreo v er, Borel- Can telli L emma ensures the almost sure con v ergence as so on as η > 1 2 . ⊓ ⊔ 25 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Remark 4.11 Notic e that our p r o of show s actual ly that the c on v er genc e to + ∞ in The or em 4.8 i s valid a.s. (and not only in pr ob ability) as so on as δ > 1 36 . 4.6 Saturation lev el for the factorial c om b In this subsection, w e pro v e Theorem 4.9. Consider the pr o babilized infinite factorial com b defined in Section 2 by ∀ n ∈ N , c n = 1 ( n + 1)! . The pro of hereunder sho ws actually that ℓ n log log n log n n is an almost surely bo unded sequence , whic h implies the result. Recall that R denotes the set of all righ t- infinite sequences. By ch aracterization of the saturat io n lev el as a function of X n (see (1 1)), P ( ℓ n 6 k ) = P ( ∃ s ∈ R , X n ( s ) 6 k ) for all p ositiv e in tegers n, k . Dualit y f orm ula (1 2) then provides P ( ℓ n 6 k ) = P ( ∃ s ∈ R , T k ( s ) > n ) > P ( T k ( e s ) > n ) where e s denotes any infinite word ha ving 10 k − 1 as a prefix. Mark o v inequalit y implies ∀ x ∈ ]0 , 1[ , P ( ℓ n > k + 1) 6 P τ (2) (10 k − 1 ) < n + k 6 Φ (2) 10 k − 1 ( x ) x n + k (21) where Φ (2) 10 k − 1 ( x ) denotes as ab ov e the generating function of the rank of the final letter of the second o ccurrence of 10 k − 1 in the infinite random w ord ( U n ) n ≥ 1 . The simple form of the factorial comb leads t o the explicit expression U ( x ) = x (1 − x )( e x − 1) and, after computation, Φ (2) 10 k − 1 ( x ) = e x − 1 e − 1 · x 2 k − 1 1 − e x (1 − x ) h k ! ( e x − 1) (1 − x ) + x k − 1 1 − e x (1 − x ) i 2 . (22) In particular, applying F orm ula (22 ) with n = ( k − 1)! and x = 1 − 1 ( k − 1)! implies that fo r an y k > 1, P ( ℓ ( k − 1)! > k + 1) 6 1 − 1 ( k − 1)! 2 k − 1 h k !( e − 1) 1 ( k − 1)! i 2 · 1 1 − 1 ( k − 1)! ( k − 1)!+ k . 26 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Consequen tly , P ℓ ( k − 1)! > k + 1 = O ( k − 2 ) is the general term of a con v ergen t series. Thanks to Borel-Can telli Lemma, one gets almost surely lim n → + ∞ ℓ n ! n 6 1 . Let Γ − 1 denote the in v erse of Euler’s Gamma function, defined and increasing on the real interv al [2 , + ∞ [. If n and k are in tegers suc h that ( k + 1)! 6 n 6 ( k + 2)!, then ℓ n Γ − 1 ( n ) 6 ℓ ( k +2)! Γ − 1 (( k + 1)!) = ℓ ( k +2)! k + 2 , whic h implies that, a lmo st surely , lim n →∞ ℓ n Γ − 1 ( n ) 6 1 . In v erting Stirling F orm ula, namely Γ( x ) = r 2 π x e x log x − x 1 + O 1 x when x g o es to infinit y , leads t o the equiv alen t Γ − 1 ( x ) ∼ + ∞ log x log log x , whic h implies the result. ⊓ ⊔ Ac kno wledgemen ts The authors are v ery grat eful to Eng. Maxence Guesdon fo r provid ing sim ulations with great talen t and an infinite patience. They w ould lik e to thank also all p eople managing t w o v ery imp orta n t to ols for frenc h mathematicians: first the Institut Henri P oincar´ e, where a large part of this work was done and second Mathrice whic h pro vides a large num b er of services . References [1] A. Blumer, A. Ehrenfeuc h t and D . Haussler. Av erage sizes of suffix trees and dawgs . D iscr e te Appl . Math. , 24:37–45 , 1 9 89. 27 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries [2] P . C´ enac, B. Chauvin, F. P accaut, and N. P ouy anne. Con text trees, v ariable length Mark ov c hains and dynamical sources. S´ eminair e de Pr ob a bilit´ es , 2011. arXiv:1007.2986 . [3] J. Cl´ emen t, P . Fla jolet, a nd B. V all ´ ee. D ynamical sources in inf ormation theory: a general a na lysis of trie structures. A lgorithmic a , 29:307 –369, 200 1 . [4] L. Devro y e, W. Szpank o wski, and B. Rais. A note o n the heigh t o f suffix trees. SIAM J. Co m put. , 21(1):4 8–53, 1 992. [5] P . Doukhan. Mixing : pr op erties and examples . Lecture Notes in Stat. 85 . Springer-V erlag, 1994. [6] J. F a y olle. Compr ession de donn´ ees sans p erte et c ombinatoir e anal ytique . PhD thesis, Univ ersit ´ e P aris VI, 2006. [7] P . Fla jolet and R. Sedgewic k. A nalytic Co m binatorics . Cam bridge Univ ersit y Press, Cam bridge, 2009. [8] A. Galv es and E. L¨ oc herbac h. Sto c hastic c hains with memory of v ar ia ble length. TICSP Seri e s , 38:117–1 33, 200 8. [9] S. Isola. Renew al sequences and intermittenc y . J. Statist. Phys. , 97(1-2):26 3 – 280, 1999 . [10] B. Pittel. Asymptotic gro wth of a class of random trees. A nnals Pr ob ab. , 13:414–42 7, 1985. [11] J. Rissanen. A unive rsal data compression system. IEEE T r ans . Info rm. The o ry , 2 9 (5):656–6 6 4, 1983. [12] W. Szpank o wski. Asymptotic pro p erties of data compression and suffix trees. IEEE T r ans. Inform a tion The o ry , 3 9 (5):1647– 1 659, 19 93. 28 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Figure 5: Resp ectiv e heigh ts and saturation lev els f or a logarithmic com b (plain lines), a factorial com b ( lo ng dashed lines) and a log n -comb (short dashed lines). 29
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment