Uncommon Suffix Tries

C ´ enac, Chauvin, P accaut, Pouy ann e: uncommon suffix tries Uncommon Suﬃx T ries Peggy C ´ enac 1 Brigitte Cha uvin 2 Fr ´ ed ´ eric P acca ut 3 Nicolas Pouy anne 4 De c emb er 14th 2011 Abstract Common assum p tions on the source pro d ucing the w ords inserted in a suﬃx trie w ith n lea v es lead to a log n heigh t and saturation lev el. W e pr o vide an examp le of a suﬃx trie whose heigh t increases faster than a p o wer of n and another one wh ose saturation lev el is negligible with resp ect to log n . Both are b uilt from VLMC (V ariable Length Marko v Chain) pr obabilistic sources and are easily extended to families of tries ha ving the same prop erties. The ﬁr st example corresp onds to a “loga rithmic inﬁnite com b” and enjo ys a non uniform p olynomial mixin g. The second one corresp onds to a “factoria l inﬁn ite com b” for which mixing is uniform and exp onenti al. MSC 2010 : 60 J05, 37E05. Keywor ds : v ariable length Marko v c hain, probabilistic source, mixing prop erties, suﬃx trie 1 In tro duction T rie (abbreviation o f re trie v al) is a natural data structure, eﬃcien t for searc hing w ords in a giv en set and used in man y algorithms as data compression, sp ell c hec king or IP addresses lo okup. A trie is a digital tree in whic h w ords are 1 Univ ers it ´ e d e Bourgo gne, Institut de Math´ ematiques de Bourgogne, IMB UMR 5584 CNRS, 9 rue Alain Sav a ry - BP 47 870, 21078 DIJON CE DEX, F rance. 2 Univ ers it ´ e de V ersa illes-St-Quentin, Lab orato ir e de Math´ ematiques de V ersa illes, CNRS, UMR 8 1 00, 45, av enue des E tats-Unis, 780 35 V ersailles CEDEX, F rance. 3 LAMF A, CNRS, UMR 6 140, Univ ersit´ e de Picardie Jules V erne, 33, r ue Sain t-Leu, 80 039 Amiens, F ra nce. 4 Univ ers it ´ e de V ersa illes-St-Quentin, Lab orato ir e de Math´ ematiques de V ersa illes, CNRS, UMR 8 1 00, 45, av enue des E tats-Unis, 780 35 V ersailles CEDEX, F rance. 1 C ´ enac, Chauvin, P accaut, Pouy ann e: uncommon suffix tries inserted in external no des. The trie pro cess g r o ws up b y successiv ely inserting w ords according to their preﬁxes . A precise deﬁnition will b e giv en in Section 4.1. As so on as a set of words is g iven, the wa y they are inserted in the trie is de- terministic. Nev ertheless, a trie b ecomes random when the w ords are randomly dra wn: eac h word is pro duced b y a pro babilistic source and n words a re c hosen (usually indep enden tly) to b e inserted in a trie. A suﬃx trie is a trie built on the suﬃxes o f one inﬁnite w ord. The randomness then comes f r om the source pro ducing suc h an inﬁnite w ord and the successiv e w ords inserted in the tree are far fr o m b eing indep enden t, they ar e strongly correlated. As a principal application o f suﬃx tries one can cite t he lossless compression algorithm Lemp el-Ziv 77 (LZ77 ). The ﬁrst results on the a v erage size of suﬃx tries when the inﬁnite word is give n b y a symmetrical memoryless source are due to Blumer et al. [1] and those on the heigh t of the tree to D evro y e [4]. Using analytic com binatorics, F a y olle [6] has obtained the a v erage size and total path length of the tree for a binary w ord issued f rom a memoryless source (with some restriction o n the probability of eac h letter). Here w e are in terested in the heigh t H n and the saturation lev el ℓ n of a suﬃx trie T n con taining the ﬁrst n suﬃxes of an inﬁnite w ord pro duced b y a source asso ciated with a so-called V ariable Length Mark o v Chain ( VLMC) (see Rissanen [11] fo r the seminal work, G a lv es-L¨ oc herbac h [8] for an o v erview, and [2] f or a probabilistic fra me). One deals with a particular VLMC source asso ciat ed with an inﬁnite com b, describ ed hereafter. This particular mo del has the double adv antage to go b ey ond the cases of memoryless or Marko v sources and to provide concrete computable pro p erties. The analysis of the heigh t and the satura tion lev el is usually motiv ated b y optimization o f the memory cost. Height is cle arly relev an t to this p o in t; saturation lev el is algorit hmically relev ant a s w ell b ecause in ternal no des below the saturatio n lev el are often replaced by a less expansiv e table. All the tries or suﬃx tries considered so far in the literature hav e a heigh t a nd a saturation lev el b oth gr owing logarithmically with the n um b er of w ords inserted, to the b est of our kno wledge. F or plain tries, when the inserted w ords a r e in- dep enden t, the results due t o Pittel [10] rely on t w o assumptions on the source pro ducing the w ords: ﬁrst, the source is unifo r mly mixing, second, the probability of any w ord deca ys exp onen tially with it s length. Let us also mention the general analysis o f tries b y Cl´ emen t-Fla jo let-V all ´ ee [3 ] fo r dynamical source s. F or suﬃx tries, Szpank o wski [12] obtains the same result, with a w eak er mixing assumption (still unifo rm though) a nd the same h yp othesis on the measure of the words. Our aim is to exh ibit t w o cases when these behaviours are no longer the same. The ﬁrst example is the “ lo garithmic c omb ”, for whic h w e sho w t ha t the mixing 2 C ´ enac, Chauvin, P accaut, Pouy ann e: uncommon suffix tries is slo w in some sense, namely non uniformly p olynomial (see Section 3.2 for a precise statemen t) and the measure of some increasing sequenc e of w ords deca ys p olynomially . W e prov e in Theorem 4.8 that the heigh t of this trie is larger than a p o w er of n (when n is the num b er of inserted suﬃxes in t he tree). The second example is the “ factorial c omb ”, whic h has a unifo r mly exp onential mixing, thus fulﬁlling t he mixing h yp othesis of Szpank o wski [12], but the measure of some increasing sequence of words deca ys faster than any exp o nential. In this case we pro v e in Theorem 4.9 that the saturation lev el is negligible with resp ect to log n . W e pro v e more precisely that, almost surely , ℓ n ∈ o  log n (log log n ) δ  , for a n y δ > 1. The pap er is or ganised as follo ws. In Section 2, we deﬁne a VLMC source asso ci- ated with an inﬁnite com b. In Section, 3 w e giv e results on the mixing prop erties of these sources b y explicitely computing the suitable generating functions in terms o f the source da t a . In Section 4, the asso ciated suﬃx t r ies a re built, and the tw o uncommon b eha viours are stated and sho wn. The metho ds are based on t w o ke y to ols concerning pattern return time: a dualit y prop ert y and the compu- tation of generating functions. The relation b etw een the mixing of the source and the asymptotic b eha viour of the trie is highligh ted b y the pro of of Prop o sition 4.7. 2 Inﬁnite com bs as sour c es In this section, a VLMC probabilistic source asso ciated with an inﬁnite com b is deﬁne d. Moreov er, we in tro duce the t w o examp les giv en in introduction: the logarithmic and the facto r ia l com bs. W e begin with the deﬁnition of a general v ariable length Mark ov Chain a sso ciated with a probabilized inﬁnite com b. The follo wing presen tation comes fr o m [2]. Let A b e the alphab et { 0 , 1 } and L = A − N b e the set o f left-inﬁnite w ords. Consider the binary tree (represen ted in Figure 1) whose ﬁnite leav es are the w ords 1 , 01 , . . . , 0 k 1 , . . . and with an inﬁnite leaf 0 ∞ as w ell. Eac h leaf is labelled with a Bernoulli distribution, respectiv ely denoted by q 0 k 1 , k > 0 and q 0 ∞ . This probabilized tree is called the inﬁnite c omb . The VLMC (V a r iable Length Mark o v Chain) asso ciat ed with an inﬁnite comb is the L -v alued Mark o v c hain ( V n ) n > 0 deﬁned by the transitions P ( V n +1 = V n α | V n ) = q ← − pref ( V n ) ( α ) where α ∈ A is an y letter and ← − pref ( V n ) denotes the ﬁrst suﬃx of V n (reading from righ t to left) app ear ing as a leaf of the inﬁnite com b. F or instance, if V n = . . . 1 000, then ← − pref ( V n ) = 000 1. Notice that the VLMC is en tirely determined b y the data 3 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Figure 1: An inﬁnite com b q 0 ∞ , q 0 k 1 , k > 0 . F rom now on, denote c 0 = 1 and for n > 1, c n := n − 1 Y k =0 q 0 k 1 (0) . It is pro v ed in [2] t ha t in the ir r educible case i.e. when q 0 ∞ (0) 6 = 1 , there exists a unique stationary probabilit y measure π on L f or ( V n ) n if and only if the serie s P c n con v erges. F rom no w on, we assume t ha t this condition is fulﬁlled and w e call S ( x ) := X n > 0 c n x n (1) its generating function so that S (1) = P n > 0 c n . F or any ﬁnite word w , we denote π ( w ) := π ( L w ). Computations p erformed in [2] sho w that for any n > 0 , π (10 n ) = c n S (1) and π (0 n ) = P k > n c k S (1) . (2) Notice that, b y stationarity π (0 n ) = π (0 n +1 ) + π (10 n ) and b y disjointnes s of ev en ts, π (0 n ) = π (0 n +1 ) + π (0 n 1) for all n > 1 so that π (10 n ) = π (0 n 1) . (3) 4 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries If U n denotes the ﬁnal letter of V n , the ra ndom seq uence W = U 0 U 1 U 2 . . . is a righ t-inﬁnite random w ord. W e deﬁne in this wa y a probabilistic source in the sense of information theory i.e. a mec hanism that pro duces ra ndom w ords. This VLMC probabilistic source is characterized by : p w := P ( W has w as a preﬁx ) = π ( w ) , for ev ery ﬁnite word w . Both particular suﬃx tries the article deals with are built from suc h sources, deﬁned b y the following data. Example 1: the logarithmic com b The loga r ithmic com b is deﬁned by c 0 = 1 and f or n > 1, c n = 1 n ( n + 1)( n + 2)( n + 3) . The corresp onding conditional probabilities on the leav es of the tr ee are q 1 (0) = 1 24 and for n > 1 , q 0 n 1 (0) = 1 − 4 n + 4 . The expression of c n w as c hosen to mak e the computations as simple a s p ossible and also b ecause the square-integrabilit y of the waiting t ime of some pattern will b e needed (see end o f Section 4.3 ), guarantee d b y X n > 0 n 2 c n < + ∞ . Example 2: the factorial com b The conditional probabilities on the leav es are deﬁned b y q 0 n 1 (0) = 1 n + 2 for n > 0 , so that c n = 1 ( n + 1 )! . 5 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries 3 Mixing prop erties of inﬁ nite c o m bs In this se ction, we ﬁrst precise what w e mean b y mixing prop erties of a random sequence . W e refer to Do ukhan [5], esp ecially fo r the notion of ψ -mixing deﬁned in that b o ok. W e state in Prop osition 3 .2 a g eneral result t hat prov ides the mixing co eﬃcien t for an inﬁnite com b deﬁned b y ( c n ) n > 0 or equiv alen tly b y its generating function S . This result is then applied to our tw o examples. The mixing of the logarithmic com b is p olynomial but not uniform, it is a very w eak mixing; the mixing of the factorial comb is uniform and exp onential, it is a v ery strong mixing. Notice that mixing prop erties of some inﬁnite com bs ha v e already b een in v estigated b y Isola [9], although with a sligh t diﬀeren t language. 3.1 Mixing p r op erties of general inﬁnite com bs F or a stationary seque nce ( U n ) n > 0 with stationar y measure π , w e wan t to measure b y means of a suitable co eﬃcien t the indep endence b et w een tw o words A and B separated b y n letters. The sequence is said to b e “mixing” when this co eﬃcien t v anishes when n go es to + ∞ . Among all ty p es of mixing, w e f o cus on o ne of the strongest t ype: ψ -mixing . More precise ly , for 0 6 m 6 + ∞ , denote by F m 0 the σ -a lgebra generated b y { U k , 0 6 k 6 m } and in tro duce for A ∈ F m 0 and B ∈ F ∞ 0 the mixing co eﬃcien t ψ ( n, A, B ) := π ( A ∩ T − ( m +1) − n B ) − π ( A ) π ( B ) π ( A ) π ( B ) = P | w | = n π ( Aw B ) − π ( A ) π ( B ) π ( A ) π ( B ) , (4) where T is the shift map and where the sum runs ov er the ﬁnite words w with length | w | = n . A sequence ( U n ) n > 0 is called ψ -mixing whenev er lim n →∞ sup m > 0 ,A ∈F m 0 ,B ∈F ∞ 0 | ψ ( n, A, B ) | = 0 . In this deﬁnition, the con v ergence to zero is uniform o v er a ll w ords A and B . This is not going to b e the case in our ﬁrst example. As in Isola [9], w e widely use the renew al prop erties of inﬁnite com bs (see Lemma 3.1) but more detailed results are needed, in particular w e in v estigate the lack o f uniformity for the logarit hmic com b. 6 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Notations and Generating functions • F or a comb, recall that S is the generating function of the nonincreasing se- quence ( c n ) n > 0 deﬁned by (1). • Set ρ 0 = 0 and for n > 1, ρ n := c n − 1 − c n , with generating function P ( x ) := X n > 1 ρ n x n . • Deﬁne the sequence ( u n ) n > 0 b y u 0 = 1 and for n > 1, u n := π ( U 0 = 1 , U n = 1) π (1) = 1 π (1) X | w | = n − 1 π (1 w 1) , (5) and let U ( x ) := X n > 0 u n x n denote its generating function. Hereunder is stated a k ey lemma that will b e widely used in Prop osition 3.2. In some sense, this kind of relation (sometimes called Renew al Equation) reﬂects the renew al prop erties of the inﬁnite com b. Lemma 3.1 The se quenc es ( u n ) n > 0 and ( ρ n ) n > 0 ar e c onne cte d by the r elations: ∀ n > 1 , u n = ρ n + u 1 ρ n − 1 + . . . + u n − 1 ρ 1 and (c onse q uently) U ( x ) = ∞ X n =0 u n x n = 1 1 − P ( x ) = 1 (1 − x ) S ( x ) . Proo f. F or a ﬁnite w o rd w = α 1 . . . α m suc h that w 6 = 0 m , let l ( w ) denote the p osition of the last 1 in w , that is l ( w ) := max { 1 ≤ i ≤ m, α i = 1 } . Then, the sum in the expression (5) of u n can b e decomp osed as fo llo ws: X | w | = n − 1 π (1 w 1) = π (10 n − 1 1) + n − 1 X i =1 X | w | = n − 1 l ( w )= i π (1 w 1) . No w, b y disjoin t union π (10 n − 1 ) = π (10 n − 1 1) + π (10 n ), so that π (10 n − 1 1) = π (1)( c n − 1 − c n ) = π (1) ρ n . 7 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries In the same w a y , for w = α 1 . . . α n − 1 , if l ( w ) = i then π (1 w 1) = π ( 1 α 1 . . . α i − 1 1) ρ n − i , so that u n = ρ n + n − 1 X i =1 ρ n − i 1 π (1) X | w | = i − 1 π (1 w 1) = ρ n + n − 1 X i =1 ρ n − i u i , whic h leads to U ( x ) = (1 − P ( x )) − 1 b y summation. ⊓ ⊔ Mixing co eﬃcien ts The mixing co eﬃcien ts ψ ( n, A, B ) are expressed as the n -th co eﬃcie nt in the series expansion of an analytic function M A,B whic h is g iv en in terms o f S and U . The notatio n [ x n ] A ( x ) means the co eﬃcien t of x n in the p o w er expansion of A ( x ) at the origin. Denote the remainders asso ciated with the series S ( x ) by r n := X k > n c k , R n ( x ) := X k > n c k x k and for a > 0, deﬁne the “shifted” generating function P a ( x ) := 1 c a X n > 1 ρ a + n x n = x + x − 1 c a x a R a +1 ( x ) . (6) Prop osition 3.2 F or any ﬁ nite wor d A a nd any wor d B , the identity ψ ( n, A, B ) = [ x n +1 ] M A,B ( x ) holds for the gener ating functions M A,B r esp e ctively deﬁne d by: i) if A = A ′ 1 and B = 1 B ′ wher e A ′ and B ′ ar e any ﬁ nite wor ds, then M A,B ( x ) = M ( x ) := S ( x ) − S (1) ( x − 1) S ( x ) ; ii) if A = A ′ 10 a and B = 0 b 1 B ′ wher e A ′ and B ′ ar e any ﬁnite wor ds a nd a + b > 1 , then M A,B ( x ) := S (1) c a + b c a c b P a + b ( x ) + U ( x ) [ S (1) P a ( x ) P b ( x ) − S ( x )] ; 8 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries iii) if A = 0 a and B = 0 b with a, b > 1 , then M A,B ( x ) := S (1) 1 r a r b X n > 1 r a + b + n − 1 x n + U ( x )  S (1) R a ( x ) R b ( x ) r a r b x a + b − 2 − S ( x )  ; iv) if A = A ′ 10 a and B = 0 b wher e A ′ is any ﬁnite w or ds and a, b > 0 , then M A,B ( x ) := S (1) 1 c a r b x a + b − 1 R a + b ( x ) + U ( x )  S (1) P a ( x ) R b ( x ) c a r b x b − 1 − S ( x )  ; v) if A = 0 a and B = 0 b 1 B ′ wher e B ′ is any ﬁnite w or ds and a, b > 0 , then M A,B ( x ) := S (1) 1 r a c b x a + b − 1 R a + b ( x ) + U ( x )  S (1) R a ( x ) P b ( x ) r a c b x a − 1 − S ( x )  . Remark 3.3 It is w orth noticing that the asymptotics of ψ ( n, A, B ) may not b e uniform in al l wor ds A and B . We c al l this kind of system non-uniformly ψ - mixing. It m ay happ en that ψ ( n, A, B ) go es to zer o for any ﬁxe d A an d B , but (for example, in c ase iii) ) the lar ger a or b , the sl o wer the c onver ge n c e, p r eventing it fr om b ein g unif o rm. Proo f. The follo wing identit y has b een established in [2] (see formu la (17) in that pap er) and will b e used man y t imes in the sequel. F or an y tw o ﬁnite w ords w and w ′ , π ( w 1 w ′ ) π (1 ) = π ( w 1) π (1 w ′ ) . (7) i) If A = A ′ 1 and B = 1 B ′ , then ( 7 ) yields π ( Aw B ) = π ( A ′ 1 w 1 B ′ ) = π ( A ′ 1) π (1) π (1 w 1 B ′ ) = S (1) π ( A ) π ( B ) π (1 w 1) π (1) . So ψ ( n, A, B ) = S (1 ) u n +1 − 1 and by Lemma 3.1, the result follows . ii) Let A = A ′ 10 a and B = 0 b 1 B ′ with a, b > 0 and a + b 6 = 0. T o b egin with, π ( Aw B ) = 1 π (1) π ( A ′ 1) π ( 1 0 a w 0 b 1 B ′ ) = 1 π (1) 2 π ( A ′ 1) π ( 1 0 a w 0 b 1) π ( 1 B ′ ) . 9 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries F urthermore, π ( A ) = c a π ( A ′ 1) and by (3), π (0 b 1) = π (10 b ), so it comes π ( B ) = 1 π (1) π (0 b 1) π ( 1 B ′ ) = π (10 b ) π (1) π (1 B ′ ) = c b π (1 B ′ ) . Therefore, π ( Aw B ) = π ( A ) π ( B ) c a c b π (1) 2 π (10 a w 0 b 1) . Using π (1) S (1) = 1, this prov es ψ ( n, A, B ) = S (1 ) v a,b n c a c b − 1 where v a,b n := 1 π (1) X | w | = n − 1 π (10 a w 0 b 1) . As in the pro o f of the pr evious lemma, if w = α 1 . . . α m is any ﬁnite w ord diﬀeren t from 0 m , w e call f ( w ) := min { 1 ≤ i ≤ m, α i = 1 } the ﬁrst place where 1 can b e seen in w and recall that l ( w ) denotes the last place where 1 can b e seen in w . O ne has X | w | = n − 1 π (10 a w 0 b 1) = π (10 a + n − 1+ b 1) + X 1 ≤ i ≤ j ≤ n − 1 X | w | = n − 1 f ( w )= i,l ( w )= j π (10 a w 0 b 1) . If i = j then w is the w ord 0 i − 1 10 n − i − 1 , else w is of the form 0 i − 1 1 w ′ 10 n − 1 − j , with | w ′ | = j − i − 1. Hence, the previous sum can b e rewritten as X | w | = n − 1 π (10 a w 0 b 1) = π (1) ρ a + b + n + π (1) n − 1 X i =1 ρ a + i ρ n − i + b + X 1 6 i 1. Set v a,b n := 1 π (1) X | w | = n − 1 π (0 a w 0 b ) . First, recall that, due to (2), π ( A ) = π (1) r a and π ( B ) = π (1) r b . Conse- quen tly , ψ ( n, A, B ) = π (1) v a,b n +1 − π ( A ) π ( B ) π ( A ) π ( B ) = S (1) v a,b n +1 r a r b − 1 . Let w be a ﬁnite w ord with | w | = n − 1. If w = 0 n − 1 , then π ( Aw B ) = π (0 a + n − 1+ b ) = π (1) r a + b + n − 1 . If not, let f ( w ) denote as b efore the ﬁrst p o sition of 1 in w and l ( w ) the last one in w . If f ( w ) = l ( w ), then π ( Aw B ) = π (0 a + f ( w ) − 1 10 n − 1 − f ( w )+ b ) = 1 π (1) π (0 a + f ( w ) − 1 1) π ( 1 0 n − 1 − f ( w )+ b ) = π (1) c a + f ( w ) − 1 c n − 1 − f ( w )+ b . If f ( w ) < l ( w ), t hen writing w = w 1 . . . w n − 1 , π ( Aw B ) = π (0 a + f ( w ) − 1 1 w f ( w )+1 . . . w l ( w ) − 1 10 n − 1 − l ( w ) ) = 1 π (1) 2 π (0 a + f ( w ) − 1 1) π ( 1 w f ( w )+1 . . . w l ( w ) − 1 1) π ( 1 0 n − 1 − l ( w )+ b ) . Summing yields v a,b n = r a + b + n − 1 + n − 1 X i =1 c a + i − 1 c n − 1+ b − i + n − 1 X i,j =1 i 1 by c n = 1 n ( n + 1)( n + 2)( n + 3) . When | x | < 1, the series S ( x ) writes as follo ws S ( x ) = 47 36 − 5 12 x + 1 6 x 2 + (1 − x ) 3 log(1 − x ) 6 x 3 (8) and S (1) = 19 18 . With Prop osition 3.2, the asymptotics of the mixing co eﬃcien t comes f r om sin- gularit y analysis of the generating functions M A,B . Prop osition 3.4 The VLMC deﬁne d by the lo ga rithmic inﬁn ite c omb has a non- uniform p olynomial mixing of the f o l lowing f o rm: for any ﬁnite wor ds A and B , ther e exists a p ositive c onstant C A,B such that for any n > 1 , | ψ ( n, A, B ) | 6 C A,B n 3 . Remark 3.5 The C A,B c ann o t b e b ounde d ab ove by s o me c onstant that do es not dep en d on A and B , as c an b e se en her eunder in the pr o of. Inde e d, we sh ow that if a and b a r e p os i tive inte gers, ψ ( n, 0 a , 0 b ) ∼ 1 3  S (1) r a r b − 1 r a − 1 r b + 1 S (1)  1 n 3 as n go es to inﬁnity. In p articular, ψ ( n, 0 , 0 n ) tends to the p ositive c o nstant 13 6 . Proo f of Proposition 3.4. F or a n y ﬁnite words A a nd B in case i) of Prop osition 3.2 , one deals with U ( x ) = ((1 − x ) S ( x )) − 1 whic h has 1 as a unique dominan t singularit y . Indeed, 1 is the unique dominan t singularity of S , so that the dominan t singularities of U a re 1 or zero es of S contained in the closed unit disc. But S do es not v anish on the closed unit disc, b ecause for any z suc h that | z | 6 1, | S ( z ) | > 1 − X n > 1 1 n ( n + 1 ) ( n + 2 ) ( n + 3 ) = 1 − ( S (1) − 1) = 17 18 . 12 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Since M ( x ) = S ( x ) − S (1) ( x − 1) S ( x ) = S (1) U ( x ) − 1 1 − x , the unique dominant singularity o f M is 1, and when x tends to 1 in the unit disc, (8) leads to M ( x ) = A ( x ) − 1 6 S (1) (1 − x ) 2 log(1 − x ) + O  (1 − x ) 3 log(1 − x )  where A ( x ) is a p o lynomial of degree 2. Using the classical transfer theorem (see Fla jolet and Sedgewic k [7, section VI]) based on the analysis of the singularities of M , w e g et ψ ( n − 1 , w 1 , 1 w ′ ) = [ x n ] M ( x ) = 1 3 S (1) 1 n 3 + o  1 n 3  . The cases ii) , iii) , iv) and v) of Prop o sition 3.2 are of the same kind, and w e completely deal with case iii) . Case iii) : w ords of the fo r m A = 0 a and B = 0 b , a, b > 1. As show n in Prop osition 3.2, one ha s to compute the asymptotics of the n -t h co eﬃcien t of t he T aylor series o f the function M a,b ( x ) := S (1) 1 r a r b X n > 1 r a + b + n − 1 x n + U ( x )  S (1) R a ( x ) R b ( x ) r a r b x a + b − 2 − S ( x )  . (9) The con tribution o f the left-hand term of this sum is directly giv en b y the asymp- totics o f the remainder r n = X k > n c k = 1 3 n ( n + 1)( n + 2) = 1 3 n 3 + O  1 n 4  . By means of singularit y analysis, we deal with the righ t-hand term N a,b ( x ) := U ( x )  S (1) R a ( x ) R b ( x ) r a r b x a + b − 2 − S ( x )  . Since 1 is the only dominan t singularit y of S and U and consequen tly of an y R a , it suﬃces to compute an expansion of N a,b ( x ) a t x = 1. It follows from (8) tha t U , S and R a admit expansions near 1 of the for ms U ( x ) = 1 S (1)(1 − x ) + p o lynomial + 1 6 S (1) 2 (1 − x ) 2 log(1 − x ) + O (1 − x ) 2 , S ( x ) = p olynomial + 1 6 (1 − x ) 3 log(1 − x ) + O (1 − x ) 3 , 13 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries and R a ( x ) = p o lynomial + 1 6 (1 − x ) 3 log(1 − x ) + O (1 − x ) 3 . Consequen tly , N a,b ( x ) = 1 6  1 r a + 1 r b − 1 S (1)  (1 − x ) 2 log(1 − x ) + O (1 − x ) 2 in a neigh b ourho o d of 1 in the unit disc so t ha t, b y singularit y analysis, [ x n ] N a,b ( x ) = − 1 3  1 r a + 1 r b − 1 S (1)  1 n 3 + o  1 n 3  . Consequen tly (9) leads to ψ ( n − 1 , 0 a , 0 b ) = [ x n ] M a,b ( x ) ∼ 1 3  S (1) r a r b − 1 r a − 1 r b + 1 S (1)  1 n 3 as n t ends to inﬁnity , sho wing the mixing inequalit y and the non uniformity . The remaining cases ii) , iv) and v) are o f the same ﬂa v our. ⊓ ⊔ 3.3 Mixing of the factorial inﬁnite com b Consider now the second Example in Section 2, that is the probabilized inﬁnite com b deﬁned b y ∀ n ∈ N , c n = 1 ( n + 1)! . With previous notations, one gets S ( x ) = e x − 1 x and U ( x ) = x (1 − x )( e x − 1) . Prop osition 3.6 The VLMC deﬁne d by the factorial inﬁnite c omb has a uniform exp on ential mixing of the fo l lowing fo rm: ther e exists a p ositive c on stant C such that for any n > 1 and for any ﬁnite wor ds A and B , | ψ ( n, A, B ) | 6 C (2 π ) n . Proo f. 14 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries i) First case of mixing in Prop osition 3.2: A = A ′ 1 and B = 1 B ′ . Because of Prop osition 3 .2 , the pro of consists in computing the a symp- totics of [ x n ] M ( x ). W e make use of singularit y a nalysis. The dominant singularities of M ( x ) = S ( x ) − S (1) ( x − 1) S ( x ) are readily seen to b e 2 iπ and − 2 iπ , and M ( x ) ∼ 2 iπ 1 − e 1 − 2 iπ · 1 1 − z 2 iπ . The b ehaviour of M in a neigh b ourho o d of − 2 iπ is obtained b y complex conjugacy . Singularit y analysis via transfer theorem pro vides thus that [ x n ] M ( x ) ∼ n → + ∞ 2( e − 1) 1 + 4 π 2  1 2 π  n ǫ n where ǫ n =  1 if n is ev en 2 π if n is o dd . ii) Second case o f mixing: A = A ′ 10 a and B = 0 b 1 B ′ . Because of Prop osition 3.2, one has to compute [ x n ] M a,b ( x ) with M a,b ( x ) := S (1) c a + b c a c b P a + b ( x ) + 1 S ( x ) · 1 1 − x h S (1) P a ( x ) P b ( x ) − S ( x ) i , where P a + b is an en tire function. In this last formula, the brack ets contain an entire function that v anishes at 1 so that the dominant singularities of M a,b are again t ho se of S − 1 , namely ± 2 iπ . The expansion o f M a,b ( x ) a t 2 iπ writes th us M a,b ( x ) ∼ 2 iπ − S (1) P a (2 iπ ) P b (2 iπ ) 1 − 2 iπ · 1 1 − x 2 iπ whic h implies, by singularity analysis, tha t [ x n ] M a,b ( x ) ∼ n → + ∞ 2 ℜ  1 − e 1 − 2 iπ · P a (2 iπ ) P b (2 iπ ) (2 iπ ) n  . Besides, the remainder of the exp onen tial series satisﬁes X n > a x n n ! = x a a !  1 + x a + O ( 1 a )  (10) 15 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries when a tends to inﬁnit y . Conseq uen tly , b y F orm ula (6), P a (2 iπ ) tends to 2 iπ as a tends to inﬁnit y so that one gets a p ositiv e constant C 1 that do es not dep end on a and b suc h that fo r an y n > 1, | ψ ( n, A, B ) | 6 C 1 (2 π ) n . iii) Third case of mixing: A = 0 a and B = 0 b . This time, o ne has to compute [ x n ] M a,b ( x ) with M a,b ( x ) := S (1) 1 r a r b X n > 1 r a + b + n − 1 x n + U ( x )  S (1) R a ( x ) R b ( x ) r a r b x a + b − 2 − S ( x )  the ﬁrst term b eing an en tire function. Here again, the dominan t singular- ities of M a,b are lo cated at ± 2 iπ and M a,b ( x ) ∼ 2 iπ − S (1) R a (2 iπ ) R b (2 iπ ) (1 − 2 iπ ) r a r b (2 iπ ) a + b − 2 · 1 1 − x 2 iπ whic h implies, by singularity analysis, tha t ψ ( n − 1 , A, B ) ∼ n → + ∞ 2 ℜ  1 − e 1 − 2 iπ · R a (2 iπ ) R b (2 iπ ) r a r b (2 iπ ) a + b − 2 1 (2 iπ ) n  . Once more, b ecause of (10 ), this implies that there is a p ositiv e constan t C 2 indep enden t of a and b and such that for any n > 1, | ψ ( n, A, B ) | 6 C 2 (2 π ) n . iv) and v) : b o th r emain ing c ases of mixin g that r esp e ctively c orr esp ond to wor ds of the form A = A ′ 10 a , B = 0 b and A = 0 a , B = 0 b 1 B ′ ar e of the same vein and le ad to similar r esults. ⊓ ⊔ 4 Heigh t and saturatio n level o f s uﬃx tries In this section, we consider a suﬃx trie pro cess ( T n ) n asso ciated with an inﬁnite random w ord generated by an inﬁnite com b. A precise deﬁnition of tr ies and suﬃx tries is giv en in section 4 .1. W e a re in terested in the heigh t and the satura tion lev el of suc h a suﬃx trie. 16 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Our method t o study these t w o parameters uses a duality prop erty ` a la Pittel dev elop ed in Section 4.2, together with a careful and explicit calculation of the generating function of the second o ccurrence of a w ord (in Section 4.3 ) whic h can b e ac hiev ed for an y inﬁnite com b. These calculations are not so intricate b ecause they are strong ly related to the mixing co eﬃcien t and the mixing prop erties detailed in Section 3. More sp eciﬁcally , w e lo ok at our tw o fa v ourite examples, the logarithmic com b and the factorial comb . W e prov e in Section 4.5 tha t the heigh t of the ﬁrst o ne is not loga rithmic but p olynomial and in Section 4.6 that t he saturation lev el of t he second one is not logarithmic either but negligibly smaller. Remark that despite the v ery particular form of the com b in the wide family of v aria ble length Mark o v mo dels, the com b sources pro vide a sp ectrum of asymptotic b eha viours for the suﬃx tries. 4.1 Suﬃx tries Let ( Y n ) n ≥ 1 b e an increasing sequence of sets. Eac h set Y n con tains exactly n inﬁnite words. A trie pro cess ( T n ) n ≥ 1 is a planar tree increasing pro cess asso ciated with ( Y n ) n > 1 . The trie T n con tains the words of Y n in its leav es. It is obtained b y a sequen tial construction, inserting the words o f Y n success iv ely . A t the b eginning, T 1 is the tree con taining t he ro ot and the leaf 0 . . . (resp. the leaf 1 . . . ) if the w ord in Y 1 b egins with 0 (resp. with 1). F or n ≥ 2, know ing the tree T n − 1 , the n -th word m is inse rted as f ollo ws. W e go through the tr ee along the branc h whose no des are enco ded b y the successiv e preﬁxes of m ; when t he branc h ends, if a n internal no de is reache d, then the w ord is inserted a t the free leaf, else w e mak e the branc h grow comparing the next letters of b oth w ords un til they can b e inserted in tw o diﬀeren t leav es. As one can clearly see o n Figure 2 a trie is not a complete tree and the insertion of a w ord can mak e a branc h gro w by more than one level. Notice that an internal no de exists within the trie if there are at least t w o w ords in the set starting b y the preﬁx associated to this no de. This indicates wh y the se c ond o ccurrence of a word is prominent. Let m := a 1 a 2 a 3 . . . b e an inﬁnite word on A = { 0 , 1 } . The suﬃx trie T n (with n lea v es) asso ciated with m , is the trie built from the set of the n -th ﬁrst suﬃxes of m , that is Y n = { m, a 2 a 3 . . . , a 3 a 4 . . . , . . . , a n a n +1 . . . } . F or a giv en trie T n , w e are mainly interes ted in the heigh t H n whic h is the maximal depth of an in ternal no de of T n and the satur ation level ℓ n whic h is the maximal depth up to which a ll the internal no des are presen t in T n . F ormally , if ∂ T n 17 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries − → Figure 2: Last steps of the construction of a trie built fro m the set (000 . . . , 10 . . . , 1101 . . . , 001 , . . . , 01110 . . . , 1100 . . . , 01111 . . . ). denotes the set of leav es of T n , H n = max u ∈T n \ ∂ T n  | u |  ℓ n = max  j ∈ N | # { u ∈ T n \ ∂ T n , | u | = j } = 2 j  . See Figure 3 for an example. 4.2 Dualit y Let ( U n ) n > 1 b e an inﬁnite random word generated by some inﬁnite comb and ( T n ) n > 1 b e the asso ciated suﬃx trie pro cess . W e denote b y R the set of right- inﬁnite w ords. Besides, w e deﬁne hereunder t w o ra ndo m v ariables ha ving a key role in the pro o f of Theorem 4 .8 and Theorem 4.9. This metho d go es ba ck to Pittel [10]. Let s ∈ R b e a deterministic inﬁnite sequence and s ( k ) its preﬁx of length k . F or n ≥ 1, X n ( s ) :=  0 if s (1) is not in T n max { k > 1 | the w ord s ( k ) is already in T n \ ∂ T n } , T k ( s ) := min { n > 1 | X n ( s ) = k } , 18 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries P S f r a g r e p la c e m e n t s 1 0 0 1 0 . . . 0 0 . . . 0 1 0 . . . 1 0 1 . . . 0 1 1 . . . 1 1 . . . 1 0 0 1 1 . . . 0 1 Figure 3: Suﬃx trie T 10 asso ciated with the w ord 10010 1 1001110 . . . . Here, H 10 = 4 and ℓ 10 = 2. where “ s ( k ) is in T n \ ∂ T n ” stands for: there exists an in ternal no de v in T n suc h that s ( k ) enco des v . F or any k > 1, T k ( s ) denotes the num b er of lea v es of the ﬁrst tree “con taining” s ( k ) . See Figure 4 for an example. Thus , the saturation lev el ℓ n and the heigh t H n can b e described using X n ( s ): ℓ n = min s ∈R X n ( s ) and H n = max s ∈R X n ( s ) . (11) Moreo v er, X n ( s ) and T k ( s ) are in duality in the following sense: for all p ositiv e in tegers k and n , one has the equality of t he ev en ts { X n ( s ) > k } = { T k ( s ) 6 n } . (12) The r a ndom v ar iable T k ( s ) (if k > 2) also represen ts the waiting time o f the second o ccurrenc e of t he deterministic w ord s ( k ) in the ra ndom sequence ( U n ) n ≥ 1 , i.e. one has to w ait T k ( s ) for t he source to create a preﬁx containing exactly tw o o ccurrences of s ( k ) . Mor e precisely , for k > 2, T k ( s ) can b e rewritten as T k ( s ) = min n n > 1   U n U n +1 . . . U n + k − 1 = s ( k ) and ∃ ! j < n suc h that U j U j +1 . . . U j + k − 1 = s ( k ) o . 19 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries P S f r a g r e p la c e m e n t s 0 1 s ˆ s Figure 4: Example of suﬃx trie with n = 20 w ords. The saturation lev el is reac hed for an y seque nce having 1000 as preﬁx (in r ed); ℓ 20 = X 20 ( s ) = 3 and th us T 3 ( s ) 6 20. The heigh t (r elated to the maxim um of X 20 ) is realized for an y sequence of the form 11 0101 . . . (in blue) and H 20 = 6. Remark that the shortest branc h has length 4 whereas the saturat io n lev el ℓ n is equal to 3 . Notice that T k ( s ) denotes the b e gin ning o f the second o ccurrenc e of s ( k ) whereas in [2], τ (2)  s ( k )  denotes the end of the second o ccurrence of s ( k ) , so that τ (2)  s ( k )  = T k ( s ) + k . (13) More generally , in [2] , for an y r > 1, the random return times τ ( r ) ( w ) is deﬁned as the end of the r -th o ccurrence of w in the sequence ( U n ) n > 1 and the generating function of the τ ( r ) is calculated. W e go ov er these calculations in the sequel. 4.3 Return time generating functions Prop osition 4.7 L et k > 1 . L et also w = 10 k − 1 and τ (2) ( w ) b e the end of the se c ond o c curr enc e o f w in a se quenc e gener ate d by a c omb deﬁne d by ( c n ) n > 0 . L et S and U b e the or dina ry gen er ating functions d e ﬁ ne d in Se ction 3 . 1. The 20 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries pr ob ability gener ating function of τ (2) ( w ) is Φ (2) w ( x ) = c 2 k − 1 x 2 k − 1  U ( x ) − 1  S (1)(1 − x )  1 + c k − 1 x k − 1 ( U ( x ) − 1)  2 . F urthermor e, as so on as P n > 1 n 2 c n < ∞ , the r andom variable τ (2) ( w ) is squar e- inte gr able and E ( τ (2) ( w )) = 2 S (1) c k − 1 + o  1 c k − 1  , V ar( τ (2) ( w )) = 2 S (1) 2 c 2 k − 1 + o  1 c 2 k − 1  . (14) Proo f. F or an y r > 1, let τ ( r ) ( w ) denote the end of the r -th o ccurrence of w in a random sequence generated by a comb and Φ ( r ) w its pro ba bilit y g enerating function. The rev ersed w ord of c = α 1 . . . α N will b e denoted by the ov erline c := α N . . . α 1 W e use a result of [2] that computes these generating functions in terms of sta- tionary probabilities q ( n ) c . These probabilities measure the o ccurrence of a ﬁnite w ord after n steps, conditio ned to start from the w ord c . More precisely , for an y ﬁnite words u and c and for any n > 0 , let q ( n ) c ( u ) := π  U n −| u | + | c | +1 . . . U n + | c | = u   U 1 . . . U | c | = c  . It is shown in [2] tha t , for | x | < 1, Φ (1) w ( x ) = x k π ( w ) (1 − x ) S w ( x ) and for r ≥ 1, Φ ( r ) w ( x ) = Φ (1) w ( x )  1 − 1 S w ( x )  r − 1 where S w ( x ) := C w ( x ) + ∞ X n = k q ( n ) ← − pref ( w ) ( w ) x n , C w ( x ) := 1 + k − 1 X j =1 1 { w j +1 ...w k = w 1 ...w k − j } q ( j ) ← − pref ( w ) ( w k − j +1 . . . w k ) x j . In the particular case when w = 10 k − 1 , then ← − pref ( w ) = w = 0 k − 1 1 a nd π ( w ) = c k − 1 S (1) . Moreo v er, D eﬁnition (4) of the mixing co eﬃcien t and Prop osition 3.2 i) 21 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries imply successiv ely that q ( n ) ← − pref ( w ) ( w ) = π  U n − k −| w | +1 . . . U n + k = w    U 1 . . . U k = ← − pref ( w )  = π ( w )  ψ  n − k , ← − pref ( w ) , w  + 1  = π ( w ) S (1 ) u n − k +1 = c k − 1 u n − k +1 , This relation mak es more explicit the link b etw een return times and mixing. This leads to X n > k q ( n ) ← − pref ( w ) ( w ) x n = c k − 1 x k − 1 X n > 1 u n x n = c k − 1 x k − 1  U ( x ) − 1  . F urthermore, there is no auto- correlation structure inside w so that C w ( x ) = 1 and S w ( x ) = 1 + c k − 1 x k − 1  U ( x ) − 1  . This en tails Φ (1) w ( x ) = c k − 1 x k S (1)(1 − x )  1 + c k − 1 x k − 1  U ( x ) − 1  and Φ (2) w ( x ) = Φ (1) w ( x )  1 − 1 S w ( x )  = c 2 k − 1 x 2 k − 1  U ( x ) − 1  S (1)(1 − x )  1 + c k − 1 x k − 1  U ( x ) − 1  2 whic h is the anno unced result. The assumption X n > 1 n 2 c n < ∞ mak es U twice diﬀeren tiable a nd elemen tary calculatio ns lead to (Φ (1) w ) ′ (1) = S (1) c k − 1 − S (1) + 1 + S ′ (1) S (1) , (Φ (2) w ) ′ (1) = (Φ (1) w ) ′ (1) + S (1) c k − 1 , (Φ (1) w ) ′′ (1) = 2 S (1) 2 c 2 k − 1 + o  1 c 2 k − 1  and (Φ (2) w ) ′′ (1) = 6 S (1) 2 c 2 k − 1 + o  1 c 2 k − 1  , and ﬁnally to (14). ⊓ ⊔ 22 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries 4.4 Logarithmic com b and factorial com b Let h + and h − b e the constan ts in [0 , + ∞ ] deﬁned by h + := lim n → + ∞ 1 n max n ln  1 π ( w )  o and h − := lim n → + ∞ 1 n min n ln  1 π ( w )  o , ( 1 5) where the maximum and the minimum range o v er the w ords w of length n with π ( w ) > 0. In their pap ers, Pittel [10] and Szpank o wski [12] only deal with t he cases h + < + ∞ a nd h − > 0, whic h amoun ts to sa ying that the probability of an y w ord is exp onen tially decreasing with its length. Here, w e fo cus on our t w o examples for whic h these assumptions are not fulﬁlled. More precisely , for the logarithmic inﬁnite com b, (2) implies that π (10 n ) is o f order n − 4 , so that h − 6 lim n → + ∞ 1 n ln  1 π (10 n − 1 )  = 4 lim n → + ∞ ln n n = 0 . Besides, for the factorial inﬁnite comb , π (10 n ) is of or der 1 ( n +1)! so that h + > lim n → + ∞ 1 n ln  1 π ( 1 0 n − 1 )  = lim n → + ∞ n ! n = + ∞ . F or these tw o mo dels, the asymptotic b eha viour of the lengths of the branc hes is not alw a ys logarithmic, as can b e seen in the tw o following theorems, sho wn in Sections 4.5 and 4.6. Theorem 4.8 (Heigh t of the logarithmic inﬁnite com b) L et T n b e the suf- ﬁx trie built fr om the n ﬁrst suﬃxe s o f a se quenc e gener ate d by a l o garithmic inﬁnite c omb. Then , the height H n of T n satisﬁes ∀ δ > 0 , H n n 1 4 − δ − → n →∞ + ∞ in pr ob ability. Theorem 4.9 (Saturation lev el of the factorial inﬁni te comb) L et T n b e the s uﬃx trie built fr om the n ﬁrst suﬃxes of the se quenc e gener ate d by a factoria l inﬁnite c omb. T h en, the satur ation level ℓ n of T n satisﬁes: for any δ > 1 , alm o st sur ely, when n tends to inﬁnity, ℓ n ∈ o  log n (log log n ) δ  . The dynamic asymptotics of the heigh t and of the saturation lev el can b e visu- alized on F igure 5. The n um b er n of leav es of the suﬃx trie is put on the x -axis 23 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries while heights or saturation lev els of tries are put on the y -axis. Plain lines rep- resen t a logarithmic com b while long dashed lines are those of a factoria l com b (mean v alues of 25 simulations). Short dashed lines represen t a third inﬁnite comb deﬁned b y the data c n = 1 3 Q n − 1 k =1  1 3 + 1 (1+ k ) 2  for n > 1. Suc h a pro cess has a uniform exp onen tial mixing, a ﬁnite h + and a p ositive h − as can b e elemen tarily chec ke d. As a matter of con- sequence , it satisﬁes all assumptions of Pittel [10] and Szpank o wski [12 ] implying that the heigh t and the saturation leve l are b oth of o r der log n . Such assumptions will alw ays b e fulﬁlled a s so o n as the data ( c n ) n satisfy lim n c 1 /n n < 1; the proo f of this result is left to the reader. One can notice the height of the logarithmic com b that gro ws as a p ow er of n . The saturation lev el o f the factorial com b, negligible with resp ect to log n is more diﬃcult to highligh t b ecause o f the v ery slo w growth o f logarithms. These asymptotic b eha viours, all coming from the same mo del, the inﬁnite comb, stress its surprising richnes s. 4.5 Heigh t for the logarithmic com b In this subsection, w e pro v e Theorem 4.8. Consider the right-inﬁnite se quence s = 10 ∞ . Then, T k ( s ) is the second o ccur- rence time o f w = 10 k − 1 . It is a no ndecreasing (random) function of k . Moreo v er, X n ( s ) is the maxim um of all k suc h that s ( k ) ∈ T n . It is nondecreasing in n . So, b y deﬁnition of X n ( s ) and T k ( s ), the dualit y can b e written ∀ n, ∀ ω , ∃ k n , k n 6 X n ( s ) < k n + 1 and T k n ( s ) 6 n < T k n +1 ( s ) . (16) Claim : lim n → + ∞ X n ( s ) = + ∞ a.s. (17) Indeed, if X n ( s ) w ere b o unded ab o v e, b y K say , then tak e w = 10 K and consider T K +1 ( s ) whic h is the time of the second o ccurrence of 10 K . The choice of the c n in the deﬁnition o f the loga r ithmic com b implies the conv ergence of the series P n n 2 c n . Th us (14) holds and E [ T K +1 ( s )] < ∞ so that T K +1 ( s ) is almost surely ﬁnite. This means that for n > T K +1 ( s ), the w ord 10 K has b een seen t wice, leading to X n ( s ) > K + 1 whic h is a con tradiction. W e mak e use of the follow ing lemma that is prov en hereunder. Lemma 4.10 F or s = 1 0 ∞ , ∀ η > 0 , T k ( s ) k 4+ η − → k →∞ 0 in pr obability , 24 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries and ∀ η > 1 2 , T k ( s ) k 4+ η − → k →∞ 0 a.s. (18) With notations (1 6 ), b ecause of (1 7 ), the sequence ( k n ) tends to inﬁnity , so t hat ( T k n ( s )) is a subsequenc e of ( T k ( s )). Th us, (1 8 ) implies that ∀ η > 1 2 , T k n k 4+ η n − → n →∞ 0 a.s. and ∀ η > 0 , T k n k 4+ η n P − → k →∞ 0 . Using duality (16) again leads to ∀ η > 0 , X n ( s ) n 1 / (4+ η ) P − → n →∞ + ∞ . In otherw ords ∀ δ > 0 , X n ( s ) n 1 4 − δ P − → n →∞ + ∞ so that, since the heigh t of the suﬃx trie is larger than X n ( s ), ∀ δ > 0 , H n n 1 4 − δ P − → n →∞ + ∞ . This ends the pro o f of Theorem 4.8. ⊓ ⊔ Proo f of Lemma 4.10. Com bining (13) and ( 1 4) sho ws that E ( T k ( s )) = E ( τ (2) ( w )) − k = 19 9 k 4 + o ( k 4 ) (19) and V a r( T k ( s )) = V ar( τ (2) ( w )) = 361 162 k 8 + o ( k 8 ) . (20) F or a ll η > 0, write T k ( s ) k 4+ η = T k ( s ) − E ( T k ( s )) k 4+ η + E ( T k ( s )) k 4+ η . The deterministic part in the second-hand righ t term g o es t o 0 with k thanks to (19), so t ha t w e focus on the term T k ( s ) − E ( T k ( s )) k 4+ η . F o r a n y ε > 0, because o f Biena ym ´ e-Tc heb yc hev inequalit y , P  | T k ( s ) − E [ T k ( s )] | k 4+ η > ε  6 V ar ( T k ( s )) ε 2 k 8+2 η = O  1 k 2 η  . This sho ws the con v ergence in probabilit y in Lemma 4.10. Moreo v er, Borel- Can telli L emma ensures the almost sure con v ergence as so on as η > 1 2 . ⊓ ⊔ 25 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Remark 4.11 Notic e that our p r o of show s actual ly that the c on v er genc e to + ∞ in The or em 4.8 i s valid a.s. (and not only in pr ob ability) as so on as δ > 1 36 . 4.6 Saturation lev el for the factorial c om b In this subsection, w e pro v e Theorem 4.9. Consider the pr o babilized inﬁnite factorial com b deﬁned in Section 2 by ∀ n ∈ N , c n = 1 ( n + 1)! . The pro of hereunder sho ws actually that  ℓ n log log n log n  n is an almost surely bo unded sequence , whic h implies the result. Recall that R denotes the set of all righ t- inﬁnite sequences. By ch aracterization of the saturat io n lev el as a function of X n (see (1 1)), P ( ℓ n 6 k ) = P ( ∃ s ∈ R , X n ( s ) 6 k ) for all p ositiv e in tegers n, k . Dualit y f orm ula (1 2) then provides P ( ℓ n 6 k ) = P ( ∃ s ∈ R , T k ( s ) > n ) > P ( T k ( e s ) > n ) where e s denotes any inﬁnite word ha ving 10 k − 1 as a preﬁx. Mark o v inequalit y implies ∀ x ∈ ]0 , 1[ , P ( ℓ n > k + 1) 6 P  τ (2) (10 k − 1 ) < n + k  6 Φ (2) 10 k − 1 ( x ) x n + k (21) where Φ (2) 10 k − 1 ( x ) denotes as ab ov e the generating function of the rank of the ﬁnal letter of the second o ccurrence of 10 k − 1 in the inﬁnite random w ord ( U n ) n ≥ 1 . The simple form of the factorial comb leads t o the explicit expression U ( x ) = x (1 − x )( e x − 1) and, after computation, Φ (2) 10 k − 1 ( x ) = e x − 1 e − 1 · x 2 k − 1  1 − e x (1 − x )  h k ! ( e x − 1) (1 − x ) + x k − 1  1 − e x (1 − x )  i 2 . (22) In particular, applying F orm ula (22 ) with n = ( k − 1)! and x = 1 − 1 ( k − 1)! implies that fo r an y k > 1, P ( ℓ ( k − 1)! > k + 1) 6  1 − 1 ( k − 1)!  2 k − 1 h k !( e − 1) 1 ( k − 1)! i 2 · 1  1 − 1 ( k − 1)!  ( k − 1)!+ k . 26 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Consequen tly , P  ℓ ( k − 1)! > k + 1  = O ( k − 2 ) is the general term of a con v ergen t series. Thanks to Borel-Can telli Lemma, one gets almost surely lim n → + ∞ ℓ n ! n 6 1 . Let Γ − 1 denote the in v erse of Euler’s Gamma function, deﬁned and increasing on the real interv al [2 , + ∞ [. If n and k are in tegers suc h that ( k + 1)! 6 n 6 ( k + 2)!, then ℓ n Γ − 1 ( n ) 6 ℓ ( k +2)! Γ − 1 (( k + 1)!) = ℓ ( k +2)! k + 2 , whic h implies that, a lmo st surely , lim n →∞ ℓ n Γ − 1 ( n ) 6 1 . In v erting Stirling F orm ula, namely Γ( x ) = r 2 π x e x log x − x  1 + O  1 x  when x g o es to inﬁnit y , leads t o the equiv alen t Γ − 1 ( x ) ∼ + ∞ log x log log x , whic h implies the result. ⊓ ⊔ Ac kno wledgemen ts The authors are v ery grat eful to Eng. Maxence Guesdon fo r provid ing sim ulations with great talen t and an inﬁnite patience. They w ould lik e to thank also all p eople managing t w o v ery imp orta n t to ols for frenc h mathematicians: ﬁrst the Institut Henri P oincar´ e, where a large part of this work was done and second Mathrice whic h pro vides a large num b er of services . References [1] A. Blumer, A. Ehrenfeuc h t and D . Haussler. Av erage sizes of suﬃx trees and dawgs . D iscr e te Appl . Math. , 24:37–45 , 1 9 89. 27 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries [2] P . C´ enac, B. Chauvin, F. P accaut, and N. P ouy anne. Con text trees, v ariable length Mark ov c hains and dynamical sources. S´ eminair e de Pr ob a bilit´ es , 2011. arXiv:1007.2986 . [3] J. Cl´ emen t, P . Fla jolet, a nd B. V all ´ ee. D ynamical sources in inf ormation theory: a general a na lysis of trie structures. A lgorithmic a , 29:307 –369, 200 1 . [4] L. Devro y e, W. Szpank o wski, and B. Rais. A note o n the heigh t o f suﬃx trees. SIAM J. Co m put. , 21(1):4 8–53, 1 992. [5] P . Doukhan. Mixing : pr op erties and examples . Lecture Notes in Stat. 85 . Springer-V erlag, 1994. [6] J. F a y olle. Compr ession de donn´ ees sans p erte et c ombinatoir e anal ytique . PhD thesis, Univ ersit ´ e P aris VI, 2006. [7] P . Fla jolet and R. Sedgewic k. A nalytic Co m binatorics . Cam bridge Univ ersit y Press, Cam bridge, 2009. [8] A. Galv es and E. L¨ oc herbac h. Sto c hastic c hains with memory of v ar ia ble length. TICSP Seri e s , 38:117–1 33, 200 8. [9] S. Isola. Renew al sequences and intermittenc y . J. Statist. Phys. , 97(1-2):26 3 – 280, 1999 . [10] B. Pittel. Asymptotic gro wth of a class of random trees. A nnals Pr ob ab. , 13:414–42 7, 1985. [11] J. Rissanen. A unive rsal data compression system. IEEE T r ans . Info rm. The o ry , 2 9 (5):656–6 6 4, 1983. [12] W. Szpank o wski. Asymptotic pro p erties of data compression and suﬃx trees. IEEE T r ans. Inform a tion The o ry , 3 9 (5):1647– 1 659, 19 93. 28 C ´ enac, Chauvin, P accaut, Pouyanne: uncom m on suffix tries Figure 5: Resp ectiv e heigh ts and saturation lev els f or a logarithmic com b (plain lines), a factorial com b ( lo ng dashed lines) and a log n -comb (short dashed lines). 29

Uncommon Suffix Tries

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment