Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes

This paper deals with the problem of universal lossless coding on a countable infinite alphabet. It focuses on some classes of sources defined by an envelope condition on the marginal distribution, namely exponentially decreasing envelope classes wit…

Authors: Dominique Bontemps (LM-Orsay)

1 Uni v ersal Coding on Infinite Alphabets: Exponentially Decreasing En v elopes Dominique Bontemps Abstract —This paper deals with the problem of universal lossless coding on a c ountable infin ite alphabet. It fo cuses on some classes of sour ces defined by an en velope condition on the marg inal distribution, namely exponentially decreasing env elope classes with exponent α . The minimax redund ancy of exponentially d ecrea sing enve lope classes is prov ed to be equivalent to 1 4 α log e log 2 n . Then, an adaptive algorithm is proposed, whose maximum redundancy is equivalent to the minimax redundancy . Index T erm s —Data compres sion, universal co ding, infi nite countable alphabets, redundancy , B ayes mixture, adaptive com- pression. I . I N T R O D U C T I O N C OMPRESSION o f data is broadly used in our daily life: from the movies we watch to the office documen ts we produ ce. In this article, we are interested in lossless data compression on an unknown alphab et. This has ap plications in ar eas such as lang uage mod eling or lossless multimed ia codecs. First, we present briefly the problematics of data compres- sion. More de tails are av ailable in genera l textbooks, like [ 1 ]. Then we make a short re v iew of prec eding results, in which we situate the topic of this article, e xpon entially d ecreasing en velope classes, and we announce our results. A. Lossless da ta compressi on Consider a finite or coun tably infinite a lphabet X . A source on X is a probab ility distribution P , on the set X N of infinite sequences of s ymbo ls from X . Its marginal distributions are denoted by P n , n ≥ 1 ( for n = 1 , we on ly note P ). The scope of lossless data compre ssion is to en code a sequence of sym bols X 1: n , g enerated acco rding to P n , in to a sequence of bits as sma ll as possible. T he algo rithm has to be uniquely decodab le. The b inary en tropy H ( P n ) = E P n [ − log 2 P n ( X 1: n )] is known to be a lower bound for the expected codelen gth of X 1: n . From now on, log denotes the logarithm taken to base 2 , wh ile ln is used to deno te the natural lo garithm. Since arithmetic coding based o n P n encodes a message x 1: n with ⌈− log P n ( x 1: n ) ⌉ + 1 b its, th is lo wer boun d can b e achie ved within two bits. T hen, the expec ted r edund ancy measures the mean numb er of extra b its, in ad dition to the en tropy , a co ding strategy uses to encode X n . In th e sequ el, we use the word r ed undan cy instead o f expected r edu ndan cy . D. Bontemps is PhD fello wship holder at Laborato ire de Math ´ ematique s, Uni v . Pari s-Sud 11, Orsay , 91405 France. Furthermo re, together with Kraft-McMillan inequ ality , arithmetic coding provides an almost perfect correspon dence between cod ing algorithm s and probability distributions on X n . In this setting , if an algorith m is associated to the probab ility distribution Q n , it s e xpected redun dancy reduces to the Kullbach -Liebler di vergence between P n and Q n D ( P n ; Q n ) = E P n  log P n ( X 1: n ) Q n ( X 1: n )  . W e call this quantity (expecte d) redund ancy of the distribution Q n (with respect to P n ). Unfortu nately , the true statistics of the source are not k nown in gener al, b u t P n is suppo sed to belong to some large class of sources Λ (for instan ce, the class of all iid sou rces, or the class of Markov sources). In this p aper, the max imum redun dancy R n ( Q n ; Λ) = sup P ∈ Λ R n ( Q n ; P n ) measures how we ll a coding p robab ility Q n behave o n an entire class Λ . W ith this po int of view , the best coding probab ility is a minimax co ding pro bability , that achieves the minimax r edund ancy R n (Λ) = inf Q n R n ( Q n ; Λ) . Another way to measure the ability o f a class of sources to be efficiently encoded is the Ba yes r edund ancy R n,µ (Λ) = inf Q n Z Λ R n ( Q n ; P n ) d µ ( P ) where µ is a prior distrib ution on Λ endowed with the topology of weak con vergence and the Borel σ - field. Only one cod ing strategy ach iev es the Bayes red undan cy: the Bayes m ixture M n,µ ( x 1: n ) = Z Λ P n ( x 1: n ) d µ ( P ) . When Λ is a class of iid sour ces on the set X = N ∗ = N \{ 0 } , there is a natural p arametrizatio n of Λ by P θ ( j ) = θ j , with θ = ( θ 1 , θ 2 , . . . ) ∈ Θ Λ . Θ Λ is then a subset of Θ =    θ = ( θ 1 , θ 2 , . . . ) ∈ [0 , 1] N : X i ≥ 1 θ i = 1    and it is endowed with th e top ology of pointwise conver gence. In this case we write µ as a prior on Θ Λ . Minimax redundancy and Bayes redundan cy are linked by an important relation [ 2 ], [ 3 ]; it is written here in the con text of iid sources on a finite o r countably infinite alphabet, but Haussler [ 4 ] has sho wn that it ca n be generalized for all classes 2 of stationary ergo dic processes on a complete sep arable metr ic space. Theor em 1: Let Λ be a class of iid so urces, such that the parameter set Θ Λ is a measurable subset of Θ . Let n ≥ 1 . Then R n (Λ) = sup µ R n,µ (Λ) , where the supremum is taken over all (Borel) prob ability measures on Θ Λ . The q uantity sup µ R n,µ (Λ) is called maximin r edun dancy . A prior whose Bayes redu ndancy correspond s to the maxim in redund ancy is said to be maximin, or le ast fav or able. Theorem 1 says that m aximin redunda ncy an d minimax redun- dancy are th e same. It provides a tool to calculate th e min imax redund ancy . Before speakin g about known results, let us make mention of other two notions. W ith an asymptotic poin t of view , a sequen ce of cod- ing pro babilities ( Q n ) n ≥ 1 is said to be weakly uni- versal if the per-symbol r edund ancy tends to 0 on Λ : sup P ∈ Λ lim n →∞ 1 n D ( P n ; Q n ) = 0 . Instead of th e expected redu ndancy , many authors consid er individual seq uences. In this case, the minimax re gr et R ∗ n (Λ) = inf Q n sup P ∈ Λ sup x 1: n ∈X n log P n ( x 1: n ) Q n ( x 1: n ) plays the r ole that the m inimax redun dancy plays with th e expected redund ancy . B. Expon entially d ecr ea sing envelope classes In the case of a fin ite alphabet of size k , many classes o f sources have been studied in the liter ature, fo r which estimates of th e r edunda ncy have been provided. I n particular we ha ve the class of all iid sources (see [ 5 ]–[ 10 ], and references therein), whose min imax redund ancy is k − 1 2 log n 2 π e + log Γ(1 / 2) k Γ( k / 2) + o (1) . This last class can be seen as a p articular case of a ( k − 1) - dimensiona l cla ss o f iid sour ces o n a (possibly) bigger alpha- bet, for which we ha ve a similar result under certain conditions (see [ 11 ]–[ 13 ]). Similar results a re still av ailable f or classes of Markov pro cesses and finite memo ry tre e sources o n a finite alphabet ( see [ 5 ], [ 14 ]–[ 16 ]), and for k -dimen sional cla sses of ev en non-iid sources on an arbitrary alpha bet (see [ 17 ]). The results b ecome less p recise when one considers infinite dimensiona l cla sses on a finite alphabet. A typical examp le is the class o f renew al processes, fo r which we do not ha ve an equiv alent of the expected redund ancy , but we know that it is lower an d upper boun ded by a constant times √ n (see [ 18 ], [ 19 ]). Eventually , it is well known that the class of stationar y ergodic sources on a finite alphabet is wea kly universal (see [ 1 ]). Howe ver, Shields [ 20 ] showed th at this class does not admit non-tr i vial un iv ersal redun dancy rates. In the case of a coun tably infinite alph abet, the situa tion is significantly different. Even the class of all iid sources is n ot weakly universal (see [ 21 ], [ 22 ]). Kieffer characterized weakly universal classes in [ 21 ] (see also [ 22 ], [ 23 ]): Pr opo sition 1: A class Λ o f stationary sou rces on N ∗ is weakly u niversal if and on ly if there exists a pr obability dis- tribution Q on N ∗ such that fo r ev ery P ∈ Λ , D ( P ; Q ) < ∞ . In the literature, we find two main ways to deal w ith infin ite alphabets. The first one [ 24 ]–[ 32 ] separates the message into two parts: a descr iption of the symbo ls appearin g in the message, and the p attern they f orm. Then the compr ession of pattern s is studied. A second approach [ 23 ], [ 33 ]–[ 36 ] studies co llections of sources satisfying Kieffer’ s conditio n, and pr oposes co mpres- sion algo rithms for these classes. A result f rom [ 36 ] indicate s us such a w ay: Pr opo sition 2: Let Λ be a class of iid so urces over N ∗ . Let the env elope f unction f be define d by f ( x ) = sup P ∈ Λ P ( x ) . Then the min imax regret verifies R ∗ n (Λ) < ∞ ⇔ X x ∈ N ∗ f ( x ) < ∞ . It is ther efore quite n atural to consider classes of iid sour ces with envelope condition s o n the ma rginal d istribution. In this article we stud y specific classes of iid sources in troduce d by [ 36 ], and called exponen tially d ecr easing en velope classes . Definition 1: Let C and α be po siti ve numb ers satisfying C > e 2 α . Th e exponen tially decreasing env elope class Λ C e − α · is the class of so urces defined b y Λ C e − α · = { P : ∀ k ≥ 1 , P ( k ) ≤ C e − αk and P is stationary and memo ryless. } The first conditio n addre sses main ly the queu e of the distribu- tion of X 1 ; it means th at great numbers mu st be rare eno ugh. It does not mean that the distribution i s geome trical: if C is big enoug h, many other distributions are possible. Fur thermor e we will see that the exact value of C does not ch ange significantly the minimax r edund ancy , un like α . Since in this paper we are go ing to only talk about exponentially d ecreasing en velope classes, we simplify th e notations R n ( Q n ; Λ C e − α · ) , R n (Λ C e − α · ) , and R n,µ (Λ C e − α · ) into R n ( Q n ; C, α ) , R n ( C, α ) , and R n,µ ( C, α ) respe cti vely . The subset o f Θ corr espondin g to Λ C e − α · is de noted by Θ C,α = { θ = ( θ 1 , θ 2 , . . . ) ∈ [0 , 1] N : X i ≥ 1 θ i = 1 and ∀ i ≥ 1 , θ i ≤ C e − αi } . (1) W e presen t two main results a bout these classes. In Section II w e calculate the minimax r edund ancy o f exponentially decreasin g envelope classes, a nd we find that it is equi valent to 1 4 α log e log 2 n a s n ten ds to the infinity . T his rate is interesting for two main reaso ns. Up to o ur kn owledge, exponentially decre asing env elope classes are the first f amily of classes on an in finite alphabet for which a n equivalent of the m inimax redun dancy is known. Then, even the r ate is new: until now only rates in lo g n or √ n have been o btained. Once the minim ax redund ancy of a class of sources is known, we are interested in finding a minim ax coding algo- rithm. Section III proposes a new adaptive co ding algo rithm, and we show that its maximu m redun dancy is equiv alen t to 3 the minimax redun dancy of exponentially decr easing envelope classes. Eventually , the Appen dix c ontains some p roofs and s ome auxiliary results u sed in the m ain ana lysis. I I . M I N I M A X R E D U N DA N C Y In this section we state our main resu lt. Theorem 2 below giv es an equi valent of the minimax redun dancy of exponen - tially decr easing envelope class es. T o get it, we u se a result due to Haussler an d Opper [ 37 ]. Theor em 2: Let C and α be p ositiv e num bers satisfying C > e 2 α . The minimax redund ancy of the exp onentially decreasing envelope class Λ C e − α · verifies R n ( C, α ) ∼ n →∞ 1 4 α log e log 2 n. Theorem 2 imp roves on a p receding result of [ 36 , Theo- rem 7 ]. In that article the following bounds of the minimax redund ancy of exponentially decr easing envelope class es are giv en: 1 8 α log e log 2 n (1 + o (1)) ≤ R n ( C, α ) ≤ 1 2 α log e log 2 n + O (1 ) . In subsectio n II-A we ou tline the work don e in [ 37 ], and then we use it in subsection II-B to pr ove Theorem 2 . Eventually , we d iscuss in subsection II-C the adap tation o f this metho d to other envelope classes. A. F r om the metric entr o py to th e minimax r edund ancy T o study the r edund ancy of a class of sources, [ 37 ] considers the Hellinger distance b etween the first ma rginal distribu- tions of each sou rce. Bou nds on the min imax redunda ncy are provided in ter ms o f the metric entropy o f the set of the first marginal distributions, with respect to the Hellinger distance. As a conseq uence, that metho d can b e app lied only to iid sources. However it is very efficient in the case of exponentially decreasing env elope classes. First, we n eed to defin e th e Hellinger distanc e and the metric entropy . In the case of sources on a countably infinite alphabet, the Helling er distance can b e defined in th e following way: Definition 2: Let P and Q two proba bility distributions on N ∗ . Then the Hellinger d istance between P and Q is defined by h ( P, Q ) = s X k ≥ 1  p P ( k ) − p Q ( k )  2 . A related m etric can be defined o n the p arameter set Θ : d ( θ , θ ′ ) = h ( P θ , P θ ′ ) = v u u t X k ≥ 1  p θ k − q θ ′ k  2 . From a metric we c an define the me tric entr opy . W e need to defin e first some nu mbers. Definition 3: Let S be a subset of Θ , and ǫ be a positi ve number . 1) W e denote by D ǫ ( S, d ) the cardinality of the smallest finite partition of S with sets of d iameter a t most ǫ , or we set D ǫ ( S, d ) = ∞ if no such finite par tition exists. 2) The metric entro py o f ( S, d ) is de fined by H ǫ ( S, d ) = ln D ǫ ( S, d ) . 1 3) An ǫ -cover of S is a subset A ⊂ S such that, for all x in S , there is an element y of A with d ( x, y ) < ǫ . The covering number N ǫ ( S, d ) is the c ardinality of the smallest fin ite ǫ -cover of S , or we define N ǫ ( S, d ) = ∞ if no fin ite ǫ -cover exists. 4) An ǫ - separated sub set of S is a subset A ⊂ S such that, fo r all d istinct x, y in A , d ( x, y ) > ǫ . The packing number M ǫ ( S, d ) is the card inality of the largest finite ǫ -separated su bset of S , o r we defin e M ǫ ( S, d ) = ∞ if arbitrary large ǫ -separated subsets exist. The following lemma explains how th ese numb ers are linked. It is a classical result that can be foun d fo r instan ce in [ 38 ]. Lemma 1: Let S be a su bset of Θ . For all ǫ > 0 , M 2 ǫ ( S, d ) ≤ D 2 ǫ ( S, d ) ≤ N ǫ ( S, d ) ≤ M ǫ ( S, d ) . Lemma 1 ena bles us to choo se the mo st convenient num ber to calculate th e metric entro py . From the metric entropy o ne ca n d efine th e n otion of metric dimension , which gener alizes th e cla ssical notion of dimension. But the metric e ntropy lets us k now in som e way how dense the elemen ts ar e in a set, even infin ite dimen sional. Another quan tity that [ 37 ] uses is the minimax risk for the (1 + λ ) -affinity R λ (Λ) = inf Q sup θ ∈ Θ λ X k ≥ 1 P θ ( k ) 1+ λ Q ( k ) − λ , defined for all λ > 0 . More precisions about the (1 + λ ) -af finity are given in [ 37 ]. See also [ 39 ] for a special regard pay ed to envelope classes. In the case of an en velope class Λ f defined by an integrable en velope f unction f , it is easy to see that R λ < ∞ for all λ > 0 . Inde ed the choice Q ( k ) = f ( k ) P l ≥ 1 f ( l ) leads to the relation R λ ≤   X k ≥ 1 f ( k )   λ . W e can n ow write a slightly modified version 2 of Theorem 5 of [ 3 7 ] in th e co ntext o f d ata comp ression on an infinite alphabet. Theor em 3: Let Λ be a class of iid sources on N ∗ , such that the parameter set Θ Λ is a me asurable subset of Θ . Assume that there exists λ > 0 such tha t R λ < ∞ . Let h ( x ) be a 1 W e foll ow [ 37 ] in this definit ion of the metri c entrop y . Sev eral authors use a slightly differe nt definitio n, based on the cov ering number or the packin g number . 2 The separation of the upper and lo wer bounds have no ef fect on the proof gi ven by Haussler and Opper . A complete justificati on is av ailabl e in [ 39 ]. 4 continuo us, non- decreasing f unction d efined on the p ositiv e reals such th at, for all γ ≥ 0 and C > 0 , 1) lim x →∞ h ( C x ( h ( x )) γ ) h ( x ) = 1 and 2) lim x →∞ h ( C x (ln x ) γ ) h ( x ) = 1 . Then 1) If H ǫ (Θ Λ , d ) ∼ ǫ → 0 h  1 ǫ  , then R n (Λ) ∼ n →∞ (log e ) h ( √ n ) . 3 2) If, for some α > 0 and c > 0 , lim inf ǫ → 0 H ǫ (Θ Λ , d ) (1 /ǫ ) α h (1 /ǫ ) ≥ c, then lim inf n →∞ R n (Λ) n α/ ( α +2)  h ( n 1 / ( α +2) )  2 / ( α +2) > 0 . 3) If, for some α > 0 and C > 0 , lim sup ǫ → 0 H ǫ (Θ Λ , d ) (1 /ǫ ) α h (1 /ǫ ) ≤ C, then lim sup n →∞ R n (Λ) ( n ln n ) α/ ( α +2)  h ( n 1 / ( α +2) )  2 / ( α +2) < ∞ . The condition s co ncerning the fu nction h mean that h cannot grow too f ast. For instance, h can gr ow like C (ln x ) β , with β ≥ 0 . The first case in th e the orem is the one we use f or ex- ponen tially decreasing en velope classes. In this case, the fast decreasing envelope produ ces a “not too big ” metric entropy . Theorem 3 gives us an equ i valent of the min imax r edund ancy of the class of sou rces when n g oes to the infinity . This turns out very u seful, as it imp roves a pr eceding result o f [ 36 ]. However it is only an asymp totic result, without any conv ergence speed. The second and the th ird item s corr espond to bigg er c lasses of sources. In these cases th e result is a b it less interesting : it giv es a speed for the growth of the redundancy , b u t without the associated constant factor . Fu rthermo re there is a gap of (ln n ) α/ ( α +2) between the lower bou nd of p oint 2 and the upper b ound of p oint 3 . Howe ver it allows u s to retrieve more or less a result of [ 36 ] fo r anothe r ty pe of envelope classes. W e d ev elop now these applications. 3 The (log e ) factor comes from the use of the logarithm taken to base 2 , in the definitio n of R n . B. The min imax redundancy of e xpo nentially decr easing en - velope classes Here we want to prove Theo rem 2 by ap plying Theor em 3 . Thus we have to calcu late the metric e ntropy of expo nentially decreasing envelope classes. This is d one by Pr oposition 3 : Pr opo sition 3: Let C an d α b e positive nu mbers satisfy ing C > e 2 α . The metric entro py of the param eters set Θ C,α verifies H ǫ (Θ C,α , d ) = (1 + o (1)) 1 α ln 2 (1 /ǫ ) , where o (1) is a function g ( ǫ ) such th at g ( ǫ ) → 0 as ǫ → 0 . Pr oof of Th eor em 2 : Just apply Th eorem 3 , with h ( x ) = 1 α ln 2 ( x ) , to get th e result. Pr oof of P r oposition 3 : The ou tlines of the pro ofs o f the next lemma s can be foun d in Ap pendix A . The co rollaries are simple application s and the pro ofs are skipp ed. W e start with general co nsiderations. Let Λ f be the en velope class defined by the integrab le en velope fun ction f . L et Θ f be the co rrespon ding para meter set Θ f = { θ = ( θ 1 , θ 2 , . . . ) ∈ [0 , 1] N : X i ≥ 1 θ i = 1 and ∀ i ≥ 1 , θ i ≤ f ( i ) } . The function θ 7→ ( √ θ 1 , √ θ 2 , . . . ) is an isometry between the metric space (Θ f , d ) and the subset A f ∩ { k x k = 1 } of ℓ 2 , equ ipped with the classical euclide an n orm k · k , wher e A f is defined b y A f = { ( x k ) k ∈ N ∗ ∈ ℓ 2 : ∀ k ∈ N ∗ , 0 ≤ x k ≤ p f ( k ) } . (2) The metric entropy of (Θ f , d ) can be calcu lated in th is space. Next we truncate some coo rdinates, to work in a fin ite dimensiona l space instead of ℓ 2 . T ogether with an adequate use of Le mma 1 , this helps us to ob tain upp er and lo wer bounds for the metr ic entropy of (Θ f , d ) . W e start with the u pper bound . Lemma 2: Let Λ f be the en velope class defined by the integrable envelope function f , and let ǫ be a positiv e n umber . Let N ǫ denote the in teger N ǫ = inf    n ≥ 1 : X k ≥ n +1 f ( k ) ≤ ǫ 2 16    . For U ∈ R N and a > 0 , let B R N ( U, a ) deno te th e ball in R n with center U and r adius a . T hen H ǫ (Θ f , d ) ≤ N ǫ ln(1 /ǫ ) + 3 N ǫ ln 2 + A ( N ǫ ) + B ( ǫ ) , where A ( N ) = − ln V ol ( B R N (0 , 1)) = ln Γ( N 2 + 1) π N 2 and B ( ǫ ) = N ǫ X k =1 ln  p f ( k ) + ǫ 4  . Furthermo re A ( N ǫ ) ∼ ǫ → 0 N ǫ 2 ln N ǫ . 5 Note that − N ǫ ln(1 /ǫ ) − 2 N ǫ ln 2 ≤ B ( ǫ ) ≤ ǫ 4 N ǫ . These bound s on B ( ǫ ) show that B ( ǫ ) tends to decrease the upper bou nd, while A ( N ǫ ) contributes to its growth. If ln N ǫ behaves like ln(1 /ǫ ) up to a constant factor, then the u pper bound given in Lem ma 2 co rrespond s to a constant times N ǫ ln N ǫ , and we are concerned with the point 3 of Theorem 3 . If we ap ply Lemma 2 to the case of expon entially decrea sing en velope classes, we o btain the upper bounds N ǫ ≤ 2 α ln(1 /ǫ ) + 1 α ln 16 C 1 − e − α B ( ǫ ) ≤ − α 4 N 2 ǫ +  ln C 2 + 1 √ 1 − e − α  N ǫ . This leads to Cor ollary 1: Let C and α be positiv e numbers s atisfying C > e 2 α . The metric entr opy of the p arameter set Θ C,α defined by ( 1 ) verifies H ǫ (Θ C,α , d ) ≤ (1 + o (1)) 1 α ln 2 (1 /ǫ ) , where o (1) is a functio n g ( ǫ ) such that g ( ǫ ) → 0 as ǫ → 0 . Now we need to get a lower bound o n the metric entropy . In this case too, we want to trun cate some coor dinates to br ing ourselves to a smaller finite dimen sional space. This time we truncate the first coordinates. Let us con sider the num ber l f = min { l ≥ 0 : X k ≥ l +1 f ( k ) ≤ 1 } . Lemma 3: Let Λ f be the en velope class defined by an integrable en velope fu nction f , which verifies X k ≥ 1 f ( k ) ≥ 2 . Let ǫ > 0 be a positi ve number, and let m ≥ 1 be an integer . Then H ǫ (Θ f , d ) ≥ 1 2 l f + m X k = l f +1 ln f ( k ) + m ln  1 ǫ  + A ( m ) , where A ( m ) is d efined as in L emma 2 : A ( m ) = − ln V ol ( B R m (0 , 1)) ∼ m →∞ m 2 ln m. Note that expone ntially d ecreasing en velop es verif y the con- dition P k ≥ 1 f ( k ) ≥ 2 . Indee d the env elope of exponentially decreasing envelope classes is f ( k ) = min(1 , C e − αk ) , and the con dition C > e 2 α entails that f (1) = f (2) = 1 . From Lemma 3 we can infe r th e follo wing c orollary , with the cho ice m =  2 α ln  1 ε  . Cor ollary 2: Let C and α be positiv e numbers s atisfying C > e 2 α . The metric entro py of the param eters set Θ C,α verifies H ǫ (Θ C,α , d ) ≥ (1 + o (1)) 1 α ln 2 (1 /ǫ ) , where o (1) is a functio n g ( ǫ ) such that g ( ǫ ) → 0 as ǫ → 0 . Note that the bo und is the same as in Coro llary 2 . Th erefore this conc ludes the pro of of Proposition 3 . C. What ab out other envelope classes? In [ 36 ] the redundan cy of another type of envelope classes is also stud ied. The power -law en velope class Λ C · − α is defined, for C > 1 and α > 1 , by the en velope functio n f α,C ( x ) = min(1 , C x α ) . T he boun ds o btained in [ 36 , Theo rem 6] are A ( α ) n 1 /α log ⌊ C ζ ( α ) ⌋ ≤ R n (Λ C · − α ) ≤  2 C n α − 1  1 /α (log n ) 1 − 1 /α + O (1) , (3) where A ( α ) = 1 α Z ∞ 1 1 − e − 1 / ( ζ ( α ) u ) u 1 − 1 /α d u, and ζ denotes the classical fun ction ζ ( α ) = P k ≥ 1 1 k α , f or α > 1 . If on e adap ts the calculus mad e earlier to th e power-la w en velope classes, one can get the following upp er and lower bound s: There are two (calculable) constants K 1 , K 2 > 0 suc h that, for all ǫ > 0 , K 1  1 ǫ  2 α − 1 ≤ H ǫ ≤ K 2 (1 + o (1))  1 ǫ  2 α − 1 ln  1 ǫ  . Unfortu nately this formula leaves a gap between th e lower bound and the up per bou nd. The applicatio n of Theo rem 3 makes the gap worse. Indeed the polynom ial part  1 ǫ  2 α − 1 of the metric entropy causes an add itional gap of log 1 /α n . In practice the bo unds are the follo wing: There are two ( unkn own) co nstants C, c > 0 such that, for a ll n ≥ 1 , c (1 + o (1)) n 1 /α ≤ R n (Λ C · − α ) ≤ C (1 + o (1)) n 1 /α log n. ( 4) These inequalities imp rove in no way the resu lt of [ 36 ]. May a better calculus of the metric entro py improve eithe r th eir lower bound or their u pper bou nd? Anyway the metric en tropy of p ower - law env elope classes is “too big” to efficiently app ly Theorem 3 : it does n ot leave the hope for an equivalence, as for exponentially decreasing env elope classes. T o summarize, the strategy based o n the metric entropy and Theo rem 3 turns out efficient for “small” classes o f sources. I I I . A U T O C E N S U R I N G C O D E This section p resents a n ew algorithm called AutoCen suring Code ( ACcode ). It is in fact a modificatio n of the Censuring Code proposed by Bouch eron, Gar i vier and Gass iat in [ 36 ]. W e keep th e idea that b ig sy mbols are very few , and must be encoded dif f erently , with an Elias cod e. Smaller symbols are encoded by arithmetic coding based on Krichevsky-T r ofimov mixtures, wh ich ar e known to be effective for fin ite a lphabets. Our inn ov ation is a data-driven cutoff M i = sup 1 ≤ k ≤ i X k used to en code X i +1 : with th is choice we do n ot need to k now the exact param eters of the expone ntially decreasing envelope. ACcode is a p refix code on the set o f all finite len gth messages, and it work s on line. Its maximum redu ndancy on an e xpon entially decreasing en velope class Λ C e − α · is equiv alent to the minimax red undan cy of this class of sour ces. 6 Furthermo re ACco de is adapti ve, as the same a lgorithm ver - ifies this property with all expon entially decr easing en velope classes. Th is is fo rmulated in the follo wing theorem , proved in App endix B . Let ACcode ( x 1: n ) den ote the binary string pro duced by ACcode when it encod es the message x 1: n , and let l ( · ) denote the leng th of a string. Theor em 4: For any p ositiv e nu mbers C and α satisfying C > e 2 α , sup P ∈ Λ C e − α · E P n [ l ( ACcode ( X 1: n )) − H ( P n )] ∼ n →∞ R n ( C, α ) . The d ifference between the redu ndancy of A Ccode an d the minimax re dundan cy is not ne cessarily boun ded: there may exist codes whose redundan cy is smaller than the r edund ancy of ACcode , but with a benefit negligible with respect to log 2 n . Additionally , Theorem 4 enables u s to retrieve the upp er bound of th e minimax r edunda ncy o btained in the section I I . Let us now define ACcod e . Let n ≥ 1 be some positive integer , and let x 1: n = x 1 x 2 . . . x n be a string from N n ∗ to be encoded . W e define the sequ ence of maxima m 0 = 0 and m i = sup 1 ≤ k ≤ i x k , f or all 1 ≤ i ≤ n. The seq uence ( m i ) 1 ≤ i ≤ n is n on-decr easing, piec ewis e co n- stant. For 1 ≤ i ≤ n , let n 0 i = P i j =1 1 m j >m j − 1 be the numb er of plateaus between 1 and i . F or 1 ≤ k ≤ n 0 n , let e m k be the k th new maximum: m i = e m n 0 i . (5) W e define also e m 0 = 0 . Let string f m b e the seq uence ( e m 1 − e m 0 + 1) , . . . , ( e m n 0 n − e m n 0 n − 1 + 1) , 1 . f m is encod ed into a binary string C2 by a pplying Elias penultim ate code (see [ 33 ]) to each n umber in f m . It is a p refix co de which uses l E ( x ) bits to encode a positive integer x , with l E (1) = 1 , l E ( x ) = 1 + ⌊ log x ⌋ + 2 ⌊ log ⌊ log x ⌋ + 1 ⌋ if x ≥ 2 . (6) Meanwhile the seq uence of censor ed s ymbo ls is encoded us- ing side informatio n from f m . Con sider th e censo red sequence e x 1: n = e x 1 e x 2 . . . e x n defined by e x i = x i 1 x i ≤ m i − 1 =  x i if x i ≤ m i − 1 , 0 otherwise. All symbo ls greater than m i − 1 are en coded together: they are r eplaced by the extra symbo l 0 , and this extra sy mbol is en coded instead. 0 h as a special use in ou r setting : it makes the d ecoder to know when m i changes, a nd that the new value has to be rea d in C2 . W e add at the end of e x 1: n an add itional 0 , which acts as a termination signal together with the last 1 in f m . This makes our code to be prefix o n th e set of all finite length messages (wh atev er n ). Therefo re we produce the bina ry string C1 by arithmetic coding of e x 1: n 0 . The co nditional c oding probabilities are defined by Q i +1 ( e X i +1 = j | X 1: i = x 1: i ) = n j i + 1 2 i + m i +1 2 if 1 ≤ j ≤ m i , Q i +1 ( e X i +1 = 0 | X 1: i = x 1: i ) = 1 / 2 i + m i +1 2 , where f or j ≥ 1 and i ≥ 0 , n j i is the num ber o f occurrences of symbo l j in x 1: i (with co n vention n j 0 = 0 for all j ≥ 1 ). If i ≤ n − 1 , th e ev ent { e X i +1 = 0 } is eq ual to { X i +1 > M i } . If x i +1 = j > m i , then n j i = 0 , and w e still ha ve Q i +1 ( e X i +1 = 0 | X 1: i = x 1: i ) = n j i + 1 2 i + m i +1 2 . In the sequel we note th e cod ing probab ility used to en code the entire string e x 1: n 0 by Q n +1 ( e x 1: n 0) = Q n +1 (0 | x 1: n ) n − 1 Y i =0 Q i +1 ( e x i +1 | x 1: i ) . A remark we can do is that the symbol 0 is alw ays considered as new: when x i +1 > m i , we enco de 0 but we increment the co unter n x i +1 i . (Th is choice has been made to simplify the calculation of the redu ndancy of ACcode , but we suspect that ch anging th is beh avior could improve the perfor mances.) Now we have defined C1 and C2 , we have to d escribe h ow they are tran smitted. T o keep our cod e on lin e, we overlap these two strings in the following way . Arithmetic cod e needs a certain amoun t of bits, say l i , to send the first i symb ol of e x 1: n . Unfor tunately , l i depend s on wheth er i = n + 1 or n ot. In previous case l n +1 = ⌈− log Q n +1 ( e x 1: n 0) ⌉ + 1 , and in later one l i depend s on the following sym bols and has to be computed . ACcode begins with C2 , by the transm ission of the Elias code o f e m 1 + 1 . Th en the transmission of C1 is in itiated. Suppose tha t e x i = 0 and n 0 i = k . As soo n as l i bits o f C1 have been s ent, the AC code algorith m sends the Elias cod e of e m k − e m k − 1 + 1 . The n C1 is transmitted ag ain, from the next bit. T o decode th e i th symbol in C1 , the kno wledge of the current maximum m i − 1 is needed; it is obtained from the beginning of the string C2 . The d ecoder also need s th e cou nters ( n j i − 1 ) j ≥ 1 , which can be comp uted f rom the first i − 1 decode d symb ols. As soon as l i bits of C1 ha ve b een recei ved, e x i can be decoded. When the d ecoder me ets a 0 at th e i th position, he k nows th at the next bits ar e th e Elias code of the next symb ol in f m , and deduces m i via ( 5 ). Since th e Elias cod e is prefix, the d ecoder knows when he receives C1 a gain. T hen the ( i + 1) th symbol can be pro cessed. Fig. 1 shows an illustration of the tran smission proce ss. In this example, the initial message is x 1:4 = 5 , 3 , 2 , 7 . The n the message encoded in C 1 is e x 1:4 0 = 0 , 3 , 2 , 0 , 0 . 1 3 bits are needed to tr ansmit the second 0 , an d 15 bits f or the last o ne. In C2 we transmit f m = 6 , 3 , 1 . In th e p revious example, exact calculations ha ve been per- formed , but this is not sensible fo r a pra ctical impleme ntation of arith metic coding . Some ru le is needed to set the precision 7 01110 | {z } 01100 1100 1101 | {z } 0101 | {z } 00 |{z} 1 |{z} ↓ beginning of C1 ↓ last bits of C1 ↓ Elias code of e m 1 + 1 Elias code of e m 2 − e m 1 + 1 Elias code of 1 Fig. 1. Example of AC code in calculus, and it must be used by both cod er an d deco der . T o av oid a too big extra r edund ancy caused by approximatio ns, precision can grow as n gr ows. For in stance, calcu lations can be made in memory with a further precision of 2 ⌈ log i ⌉ bits, in add ition to the ⌈− log Q i ( e x 1: i ) ⌉ + 1 bits needed to enc ode x 1: i ; this insu res that the extra re dundan cy is bounde d. A P P E N D I X A M E T R I C E N T RO P Y O F E X P O N E N T I A L L Y D E C R E A S I N G E N V E L O P E C L A S S E S W e give here the ou tlines of the lemmas we stated in subsection II-B . Pr oof o f Lemma 2 : N ǫ denotes the thresh old fro m wh ich we want to truncate the coordin ates. If y = ( y n ) n ≥ 1 is an element of A f , it s truncated v ersion is e y = ( y n 1 n ≤ N ǫ ) n ≥ 1 . One can check that k y − e y k ≤ ǫ 4 . Suppose now that S is an ǫ/ 4 -cover of { y ∈ A f : ∀ n ≥ N ǫ , y n = 0 } . Let z denote an eleme nt of A f . Then it exists some y ∈ S such that k e z − y k ≤ ǫ/ 4 . Thus k z − y k ≤ ǫ / 2 , and S is an ǫ/ 2 -cover o f A f . This leads to D ǫ (Θ f , d ) ≤ M ǫ/ 4   Y 1 ≤ k ≤ N ǫ h 0 , p f ( k ) i , k · k R N ǫ   ≤ V ol  Q 1 ≤ k ≤ N ǫ h − ǫ 8 , p f ( k ) + ǫ 8 i V ol  B R N ǫ (0 , ǫ 8 )  A fir st consequen ce of that calculus is th at D ǫ (Θ f , d ) is finite for all ǫ > 0 . The first assertion of Lemma 2 is then obtained by ap plying the loga rithm function . The rest of Lem ma 2 follows from the Feller bound s, in their version propo sed by [ 40 , ch. XII]: Γ( x ) = √ 2 π x x − 1 / 2 e − x e β 12 x , with β ∈ [0 , 1] . (7) Pr oof of Lemma 3 : Let m ≥ 1 be an in teger . W e project the set A f ∩ {k x k = 1 } over the m -dimensional space E m = { 0 } l f × R m × { 0 } { k : k ≥ l f + m +1 } generated by the co ordinates from l f + 1 to l f + m . Th is leads to D ǫ (Θ f , d ) ≥ N ǫ   l f + m Y k = l f +1 h 0 , p f ( k ) i , k · k R m   ≥ V ol  Q l f + m k = l f +1 h 0 , p f ( k ) i V ol ( B R m (0 , ǫ )) . It on ly r emains to a pply the log arithm function . A P P E N D I X B R E D U N DA N C Y O F ACcode A. Moments of M n W e first n eed a lem ma wh ich contain s several useful r esults about the m oments of M n . Lemma 4: Let C and α be positive n umbers satisfying C > e 2 α . Then , for all n ≥ 1 , 1) sup P ∈ Λ C e − α · E P [ M n ] ≤ 1 α  ln n + ln C 1 − e − α + 1  . 2) sup P ∈ Λ C e − α · E P  M n 1 M n > 1 α ln C n 2 1 − e − α  = O  ln n n  . 3) sup P ∈ Λ C e − α · E P [ M n ln M n ] = o (ln 2 n ) . Pr oof: Let F denote the distribution f unction associated with P . F or t ≥ 0 , we ha ve P ( X 1 > t ) = X k ≥⌊ t ⌋ +1 P ( k ) ≤ C 1 − e − α e − α ( ⌊ t ⌋ +1) ≤ e − α ( t − β ) , where β = 1 α ln C 1 − e − α . Ther efore F ( t ) ≥ G ( t ) f or all t ∈ R , where G ( t ) = 1 t ≥ β  1 − e − α ( t − β )  . W e c an iden tify in G the distribution fun ction of the random variable β + Y , whe re Y fo llows the expon ential distribution with parameter α . Let U 1 , . . . , U n be n iid random variables following the unifor m distribution on [0 , 1] . For 1 ≤ i ≤ n , le t us d efine X ′ i = F − 1 ( U i ) Y i = G − 1 ( U i ) − β , where F − 1 and G − 1 denote the pseudo-inverses of F and G : ∀ t ∈ [0 , 1 ] , F − 1 ( t ) = inf { x ∈ R : F ( x ) ≥ t } . Then the n -dim ensional vector X ′ 1: n = ( X ′ 1 , . . . , X ′ n ) has the same distribution as X 1: n , and the maxim a M ′ n = sup 1 ≤ i ≤ n X ′ i and M n follow the same distribution. On the other hand , th e r elation F ≥ G e ntails X ′ i ≤ β + Y i , for all 1 ≤ i ≤ n . As the con sequence, if M ′′ n = sup 1 ≤ i ≤ n Y i denotes the maximum of all Y i , w e have M ′ n ≤ β + M ′′ n . 8 Since the rando m variables Y i are in depend ent, the prob ability distribution o f M ′′ n is easy to c alculate. Ind eed for all t > 0 , P ( M ′′ n ≤ t ) = P ( ∀ 1 ≤ i ≤ n, Y i ≤ t ) =  1 − e − αt  n . W e c an write d own the d ensity function of M ′′ n : f ( t ) =  n α e − αt (1 − e − αt ) n − 1 if t > 0 , 0 otherwise. Now we look for an upper bou nd of E [ M n ] by taking advantage of the knowledge of that distribution: E [ M n ] = E [ M ′ n ] ≤ E [ β + M ′′ n ] = β + Z ∞ 0 t n α e − αt (1 − e − αt ) n − 1 d t = β + Z ∞ 0  1 − (1 − e − αt ) n  d t integrating by parts. Use no w the chang e of variables  u = 1 − e − αt t = − ln(1 − u ) α E [ M n ] ≤ β + 1 α Z 1 0 1 − u n 1 − u d u ≤ 1 α  ln n + 1 + ln C 1 − e − α  . Since the upper bound does n ot depend on P , that achieves the pr oof of the po int 1 . W e can h andle the po int 2 in th e same way . For all t > 0 , we hav e E [ M n 1 M n >β + t ] ≤ E  ( β + M ′′ n ) 1 M ′′ n >t  ≤ Z ∞ t ( β + u ) n α e − αu d u = ne − αt  t + 1 α + β  . W ith t = 2 α ln n , we get the second p oint of Lemma 4 . The third item is similar . Since the fun ction x 7→ x ln x is increasing on [1 , + ∞ ) and 1 ≤ M ′ n ≤ β + M ′′ n , we ha ve E [ M n ln M n ] ≤ E [( β + M ′′ n ) ln( β + M ′′ n )] = E  1 M ′′ n ≤ β ( β + M ′′ n ) ln( β + M ′′ n )  + E h 1 M ′′ n >β 1 M ′′ n ≤ 2 α ln n ( β + M ′′ n ) ln( β + M ′′ n ) i + E h 1 M ′′ n >β 1 M ′′ n > 2 α ln n ( β + M ′′ n ) ln( β + M ′′ n ) i ≤ 2 β ln(2 β ) + 4 α (ln n ) ln  4 α ln n  + E h 2 M ′′ n ln(2 M ′′ n ) 1 M ′′ n > 2 α ln n i ≤ 2 β ln(2 β ) +  4 α ln 4 α  ln n + 4 α (ln n )(ln ln n ) + E h 4 M ′′ n 2 1 M ′′ n > 2 α ln n i . Let us define γ ( n ) = 2 β ln(2 β ) +  4 α ln 4 α  ln n + 4 α (ln n )(ln ln n ) . Note that γ ( n ) = o (ln 2 n ) . Then E [ M n ln M n ] ≤ γ ( n ) + Z ∞ 2 α ln n 4 u 2 n α e − αu d u = γ ( n ) + 4 ne − 2 ln n α 2 (4 ln 2 n + 4 ln n + 2) . T aking the supremum over P , we get sup P ∈ Λ C e − α · E P [ M n ln M n ] ≤ γ ( n ) + 16 ln 2 n + 16 ln n + 8 α 2 n = o (ln 2 n ) . B. Contribution of C 1 Pr opo sition 4: Let C an d α b e positive nu mbers satisfy ing C > e 2 α . Then sup P ∈ Λ C e − α · E P n [ − log Q n ( e X 1: n ) − H ( P n )] ≤ (1 + o (1)) 1 4 α log e log 2 n. Pr oof: W e give here the sketch of th e proo f, and we d elay the proo fs o f ( 8 ), ( 9 ), ( 10 ), a nd ( 11 ). Here we de al with the quantity ( A ) = sup P ∈ Λ C e − α · E P n [ − log Q n ( e X 1: n ) − H ( P n )] . that cor respond s to the co ntribution of C1 . As we saw in Section III , the cod ing pro bability Q n is b ased on Krichevsky- T rofimov mixtures. For k ≥ 1 , let K T k denote the usual Krichevsky-T r ofimov mixture on the alp habet { 1 , . . . , k } , whose co nditional prob abilities are, f or all 0 ≤ i ≤ n − 1 and for a ll 1 ≤ j ≤ k , K T k ( X i +1 = j | X 1: i = x 1: i ) = n j i + 1 2 i + k 2 . Let us cho ose k = m n + 1 . In th is case, there is a simple relation between K T m n +1 and Q n . For any sequen ce of n positive integers x 1: n ∈ N n ∗ , Q i +1 ( e X i +1 = e x i +1 | X 1: i = x 1: i ) = 2 i + 1 + m n 2 i + 1 + m i K T m n +1 ( X i +1 = x i +1 | X 1: i = x 1: i ) . As a co nsequenc e, we can link th e redundancy of Q n to the redund ancy of K T m n +1 : log Q n ( e X 1: n ) = log K T M n +1 ( X 1: n ) + n − 1 X i =0 log 2 i + 1 + M n 2 i + 1 + M i and theref ore ( A ) = sup P ∈ Λ C e − α · ( A 1 ) z }| { E P n  − log K T M n +1 ( X 1: n ) − H ( P n )  − ( A 2 ) z }| { E P n  n − 1 X i =0 log 2 i + 1 + M n 2 i + 1 + M i  ! . 9 Note that ( A 2 ) corresponds to the gain in red undan cy of Q n with respect to K T M n +1 . It illustrates the be nefit of taking M i instead of M n as cutoff to encod e X i +1 . On the o ne hand , we ha ve ( A 1 ) ≤ E [ M n ] 2 log n + E [lo g( M n + 1)] . (8) Since E [log M n ] ≤ E [ M n ] , Lemm a 4 entails sup P ∈ Λ C e − α · E [log( M n + 1)] = o (log 2 n ) . Applying Lemma 4 again, we see that ( A 1 ) pro duces a redund ancy equiv alent to 1 2 α log e log 2 n , which is twice bigg er than the m inimax redundan cy obtained in T heorem 2 . So, we will hope th e corrective term ( A 2 ) to be about 1 4 α log e log 2 n . T o deal with ( A 2 ) , we use the c oncavity of the log func- tion, an d we group the terms in the sum, M n by M n . Let m =  n − 1 M n  be the n umber of bundles. T o simplify the expression, we also n eglect few terms at the beginning of th e sum. Let ( h n ) n ≥ 1 be a non- decreasing sequence of positi ve integers, such th at h n → ∞ as n → ∞ , and let u s define λ n = 2 h n log  1 + 1 2 h n  . Then ( A 2 ) ≥ λ n E P n " m X k = h n +1 M n − M kM n 2( k + 1) # . (9) It is easy to verify th at the f unction x 7→ x log  1 + 1 x  is non - decreasing, and tends to lo g e whe n x tends to the infinity; therefor e ( λ n ) tends to log e . W e can write now ( A ) ≤ sup P ∈ Λ C e − α · E [ M n ] 2 log n − λ n E " m X k = h n +1 M n − M kM n 2( k + 1) # ! + o (log 2 n ) ≤ 1 2 ( A 3 ) z }| { sup P ∈ Λ C e − α · E P n " M n log n − λ n M n m X k = h n +1 1 k + 1 # + λ n 2 ( A 4 ) z }| { sup P ∈ Λ C e − α · E P n " m X k = h n +1 M kM n k + 1 # + o (log 2 n ) Let us choose h n = max { 1 , ⌊ ln n − 2 ⌋} . Then ( A 3 ) = o (log 2 n ) (10) ( A 4 ) ≤ log 2 n 2 α log 2 e + o (log 2 n ) . (11) Therefo re we ha ve ( A ) ≤ (1 + o (1)) 1 4 α log e log 2 n which conc ludes the pro of of Proposition 4 . Pr oof o f ( 8 ): Let b P n ( x 1: n ) = sup P n P n ( x 1: n ) = Y j ∈{ x 1 ,...,x n }  n n j n  n n j be the maximum likelihood of the string x 1: n over all iid distribution o n N n . Then ( A 1 ) ≤ E P n " log b P n ( X 1: n ) K T M n +1 ( X 1: n ) # ≤ E P n " sup x 1: n ∈{ 1 ,...,M n +1 } log b P n ( x 1: n ) K T M n +1 ( x 1: n ) # Now we can ap ply a r esult from Caton i [ 9 , prop 1.4 .1]: Lemma 5: For all k ≥ 1 and fo r all x 1: n ∈ { 1 , . . . , k } n , − log K T k ( x 1: n ) + log b P n ( x 1: n ) ≤ k − 1 2 log n + log k . Pr oof o f ( 9 ): W e group the terms in ( A 2 ) , M n by M n : ( A 2 ) ≥ E P n   m − 1 X k =1 ( k +1) M n X i = kM n +1 log  1 + M n − M i 2 i + M i + 1    . From the rela tion M k ≤ M k ′ for all k ′ ≥ k ≥ 1 , we can infer, for all i ≥ k M n , M n − M i 2 i + M i + 1 ≤ M n 2 k M n = 1 2 k Since log is a concave fu nction, we have log(1 + x ) ≥ x log(1+ a ) a for a ll a > 0 and 0 ≤ x ≤ a . Consequently , if we cho ose a = 1 2 k , ( A 2 ) ≥ E P n   m − 1 X k =1 ( k +1) M n X i = kM n +1 2 k log  1 + 1 2 k  M n − M i 2 i + M i + 1   ≥ E P n " m X k = h n +1 λ n M n − M kM n 2 k + 2 # . Pr oof o f ( 10 ): W e have ( A 3 ) = sup P ∈ Λ C e − α · " X j ≥ 1 P n ( M n = j ) × j log n − λ n j ⌊ n − 1 j ⌋ X k = h n +1 1 k + 1 !# . Then we plug in h n = ⌊ ln n − 2 ⌋ . For n large eno ugh, h n ≥ 1 , and we h av e j ⌊ n − 1 j ⌋ X k = h n +1 1 k + 1 ≥ j Z ⌊ n − 1 j ⌋ +1 ln n − 1 d x x + 1 = j  ln  n − 1 j  + 2  − ln(ln n )  ≥ j ln( n − 1) − j ln j − j ln(ln n ) , and theref ore ( A 3 ) ≤ sup P ∈ Λ C e − α ·  (log e − λ n ) E [ M n ] ln n + λ n E [ M n ] ln n n − 1 + λ n E [ M n ln M n ] + λ n E [ M n ] ln(ln n )  . 10 Then, if we use Lemma 4 and the fact th at λ n tends to log e , we get ( 1 0 ). Pr oof of ( 11 ): W e w ant to commute the expected value and the sum in ( A 4 ) . T o do it, we need to get rid of m . W e can note th at the condition k ≤ m = ⌊ n − 1 M n ⌋ entails k M n ≤ n − 1 . Consequently , for n big enough , ( A 4 ) ≤ sup P ∈ Λ C e − α · E P n " m X k =3 M kM n k + 1 # ≤ sup P ∈ Λ C e − α · E P n " n − 1 X k =3 M kM n 1 kM n ≤ n − 1 k + 1 # ≤ n − 1 X k =3 sup P ∈ Λ C e − α · E P n [ M kM n 1 M n ≤ l n ] k + 1 + sup P ∈ Λ C e − α · E P n [ M n 1 M n >l n ] n − 1 X k =3 1 k + 1 , where l n = j 1 α  2 ln n + ln C 1 − e − α k . W e c an now plug in the results of Lemma 4 : ( A 4 ) ≤ n − 1 X k =3 ln( k l n ) + 1 + ln C 1 − e − α ( k + 1) α + o (1) n − 1 X k =3 1 k + 1 ≤ 1 α n − 1 X k =3 ln k k + 1 + 1 α (ln l n + O (1)) n − 1 X k =3 1 k + 1 . Note that l n = O (ln n ) , and consequ ently ln l n = O (ln ln n ) . So ( A 4 ) ≤ 1 α Z n 3 ln x x d x + O (ln ln n ) Z n 3 d x x ≤ 1 2 α ln 2 n + o (ln 2 n ) . C. Contribution of C2 Pr opo sition 5: Let C an d α b e positive nu mbers satisfy ing C > e 2 α . Then sup P ∈ Λ C e − α · E P n [ l ( C2 )] ≤ o (log 2 n ) . Pr oof: Like in the pr evious su bsection, we give first th e sketch of the p roof, and we delay several technical lemmas. sup P ∈ Λ C e − α · E P n [ l ( C2 )] ≤ 1 + sup P ∈ Λ C e − α · n X i =1 E P n  1 X i >M i − 1 l E ( X i + 1)  . W e d eal with this sum th anks to th e following lem ma: Lemma 6: Let g be a po siti ve and non-d ecreasing f unction on [1 , ∞ ) . L et ( K n ) n ≥ 1 be a no n-decr easing sequence of positive integers. Then, for all n ≥ 1 , sup P ∈ Λ C e − α · n X i =1 E P n  1 X i >M i − 1 g ( X i )  ≤ (1 + o (1)) g ( K n ) 1 α ln n + C n Z ∞ K n g ( x + 1) e − αx d x. T o app ly Lemma 6 , we extend the definition of l E on [1 , ∞ ) by l E ( x ) =  1 if x ∈ [1 , 2) , 1 + ⌊ log x ⌋ + 2 ⌊ log ⌊ log x ⌋ + 1 ⌋ if x ≥ 2 . W e get sup P ∈ Λ C e − α · E P n [ l ( C2 )] ≤ (1 + o (1)) l E ( K n + 1) α ln n + C n Z ∞ K n l E ( x + 2) e − αx d x Then we can choose K n = max { 1 , ⌊ 1 α ln n ⌋} . Th is entails l E ( K n + 1) ∼ n →∞ log K n ∼ log log n = o (log n ) , and theref ore 1 α l E ( K n + 1) ln n = o (log 2 n ) . The r emaining term is treated by Lemma 7 , wh ich ach iev es the proo f o f Propo sition 5 : Lemma 7: Let α > 0 be a real n umber, and let K n = max { 1 , ⌊ 1 α ln n ⌋} . Th en n Z ∞ K n l E ( x + 2) e − αx d x = o (log n ) . Pr oof of Lemma 6 : Let P be an elem ent o f Λ C e − α · . Let us define , f or all k ≥ 0 , ¯ p ( k ) = P ( X 1 > k ) = X j ≥ k +1 P ( k ) , and ( B 1 ) = n X i =1 E P n  1 X i >M i − 1 g ( X i )  . Note that, for all 1 ≤ i ≤ n , X i and M i − 1 are indep endent random variables, and P n ( M i ≤ k ) = P n ( ∀ 1 ≤ j ≤ i, X j ≤ k ) = (1 − ¯ p ( k )) i . Then we can write ( B 1 ) = n X i =1 X k ≥ 0 P n ( M i − 1 = k ) X m ≥ k +1 P ( m ) g ( m ) = X m ≥ 1 P ( m ) g ( m ) n X i =1 m − 1 X k =0 P ( M i − 1 = k ) = P (1) g (1 ) + X m ≥ 2 P ( m ) g ( m ) n X i =1 (1 − ¯ p ( m − 1)) i − 1 = X m ≥ 1 P ( m ) g ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1) . If we take g ( x ) = 1 f or all x , we g et X m ≥ 1 P ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1) = E " n X i =1 1 X i >M i − 1 # ≤ E [ M n ] . 11 In the g eneral case, we can split the sum a t K n , and we get ( B 1 ) = K n X m =1 P ( m ) g ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1) + X m ≥ K n +1 P ( m ) g ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1 ) ≤ g ( K n ) X m ≥ 1 P ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1 ) + X m ≥ K n +1 nP ( m ) g ( m ) ≤ g ( K n ) E [ M n ] + C n X m ≥ K n +1 g ( m ) e − αm . At this point, we can take the supremum over all so urces P in Λ C e − α · : sup P ∈ Λ C e − α · n X i =1 E P n  1 X i >M i − 1 g ( X i )  ≤ (1 + o (1)) g ( K n ) 1 α ln n + C n Z ∞ K n g ( x + 1 ) e − αx d x. Pr oof o f Lemma 7 : n Z ∞ K n l E ( x + 2) e − αx d x ≤ n Z ∞ K n (log( x + 2) + 2 log log ( x + 3) + 1 ) e − αx d x ≤ ne − αK n log( K n + 3) Z ∞ K n +3 log x + 2 log log x + 1 log( K n + 3) e − α ( x − K n − 3) d x ≤ e α log( K n + 3)  sup x ≥ K n +3 log x + 2 log lo g x + 1 log x  Z ∞ K n +3 log x log( K n + 3) e − α ( x − K n − 3) d x = O (log K n ) Z ∞ 0   1 + log  1 + x K n +3  log( K n + 3)   e − αx d x = o (log n ) . The supremum is correc tly defined and bounded, bec ause th e function x 7→ log x + 2 log lo g x + 1 log x is continu ous and tends to 1 as x tends to th e infinity . D. Pr oo f of Theorem 4 The message sent by the A Ccode algorithm is compound of two strings C1 a nd C 2 . C1 corr esponds to th e part o f the m essage enco ded by th e arithmetic code, with c oding probab ility Q n +1 . The arith metic code encodes a m essage e x 1: n 0 with ⌈− log Q n +1 ( e x 1: n 0) ⌉ + 1 b its. W e h av e E P n [ − log Q n +1 (0 | X 1: n )] = E P n [log( M n + 1 + 2 n )] ≤ log(2 n ) + E P n [ M n + 1] 2 n = O (log n ) thanks to Lemma 4 . T herefor e the redundan cy of ACcode can be upper bo unded , for all n ≥ 2 , by sup P ∈ Λ C e − α · E P n [ l ( C1 ) + l ( C1 )] − H ( P n )] ≤ sup P ∈ Λ C e − α · E P n [ − log Q n ( e X 1: n ) − H ( P n )] + sup P ∈ Λ C e − α · E P n [ l ( C2 )] + O (lo g n ) . W e con clude thanks to Propositions 4 and 5 . A C K N O W L E D G M E N T I wis h thank Elisabeth Gas siat for her helpf ul ad vice, f or the many ideas in this pap er she sugge sted me, and for h er constant av ailability to my q uestions. Thanks also to Aur ´ elien Garivier an d to St ´ ephane Bouchero n for the u seful d iscussions we had . I don’t fo rget the ideas I got f rom my classmates, especially Rafik I mekraz. R E F E R E N C E S [1] T . M. Cove r and J. A. Thomas, Elements of Information Theory . New Y ork: W iley , 1991. [2] R. G. Gallager , Informat ion Theory and Reliable Communi cation . Ne w Y ork: W iley , 1968. [3] L. D. Davisson and A. Leon-Garci a, “ A source m atchi ng approach to finding minimax codes, ” IEE E T rans. Inf . Theory , vo l. 26, pp. 166–174, 1980. [4] D. Haussler , “ A general minimax result for rela ti ve entropy , ” Uni versity of Califor nia, UC Santa Cruz, CA 96064, T ech. Rep. UCSC-CRL-96-26, 1996. [5] R. E. Kriche vsky and V . K . Trofimo v , “The performance of uni versal encodin g, ” IEEE T rans. Inf. Theory , vol. 27, no. 2, pp. 199–207, 1981. [6] Q. Xie and A. R. Barron, “Minimax redundanc y for the class of memoryless s ources, ” IEEE T rans. Inf . Theory , vol. 43, no. 2, pp. 646– 657, 1997. [7] ——, “ Asymptotic minimax regret for data compression, gambling, and predict ion, ” IEEE T rans. Inf . Theory , vol. 46, no. 2, pp. 431–44 5, 2000. [8] A. R. Barron, J. Rissan en, and B. Y u, “The minimum descri ption length princip le in coding and modeling , ” IEEE T rans. Inf. The ory , v ol. 44, no. 6, pp. 2743–2760, 1998. [9] O. Catoni, Statisti cal Learning Theory and Stoc hastic Optimizati on , ser . Lecture Notes in Math ematics. Spri nger-V erl ag, 2001, vol . 1851, ´ Ecole d’ ´ Et ´ e de Probabilit´ es de Saint-Flour XXXI. [10] M. Drmota and W . Szpank owski, “Preci se mini max redundanc y and regre t, ” IEEE T rans. Inf. Theory , vol. 50, no. 11, pp. 2686–2707 , 2004. [11] B. S. Clarke and A. R. Barron, “Informa tion-the oretic asymptotic s of bayes m ethods, ” IEE E T rans. Inf. Theory , vol. 36, no. 3, pp. 453–471, 1990. [12] ——, “Jef frey’ s prior is asymptot ically least fa vora ble under entrop y risk, ” J . Statist. Plann. Infer ence , vol. 41, pp. 37–60, 1994. [13] A. R. Barron, “Information -theoret ic characte rizati on of bayes per- formance and the choic e of priors i n parametri c and nonpara metric problems, ” in B ayesian Statist ics , J. M. Bernardo, J. O. Ber ger , D. A. P ., and S. A. F . M., Eds. Oxford Uni v . Press, 1998, vol. 6, pp. 27–52. [14] L. D. Davisson, “Mini max noiseless uni versa l codi ng for marko v sources, ” IEEE T rans. Inf. Theory , vol. 29, no. 2, pp. 211–214, 1983. [15] F . M. J. Wi llems, Y . M. Shtarko v , and T . J. T jalk ens, “The conte xt-tree weighti ng method: Basic properti es, ” IEE E T rans. Inf. Theory , vol. 41, no. 3, pp. 653–664, 1995. [16] K. Atteson, “The asymptotic redu ndanc y of ba yes rule s for markov chains, ” IEEE T rans. Inf. Theory , vol. 45, no. 6, pp. 2104–2109, 1999. [17] J. Rissan en, “Uni versal coding, informa tion, predic tion, and estimat ion, ” IEEE T rans. Inf. Theory , vol. 30, no. 4, pp. 629–636, 1984. [18] I. Csisz ´ a r and P . C. S hields, “Redunda ncy rates for rene wal and other processes, ” IEEE T rans. Inf . Theory , vol . 42, no. 6, pp. 2 065–2072, 1996. [19] P . Flajolet a nd W . Sz panko wski, “ Analytic v ariations on redundan cy rates of rene wal processes, ” IEEE Tr ans. Inf. Theory , vol . 48, no. 11, pp. 2911–2921, 2002. [20] P . C. Shields, “Uni versa l redu ndanc y ra tes do not exist, ” IEEE T rans. Inf. Theory , vol. 39, no. 2, pp. 520–524, 1993. 12 [21] J. C. Kief fer , “ A uni fied approac h to weak univ ersal source co ding, ” IEEE T rans. Inf. Theory , vol. 24, no. 6, pp. 674–682, 1978. [22] L. Gy ¨ orfi, I. P ´ ali , and E. C. va n der Meulen, “There is no un iv ersal source code for an infinite source alphabe t, ” IEEE T rans. Inf. Theory , vol. 40, no. 1, pp. 267–271, 1994. [23] ——, “On uni versal noiseless source coding for infinite source alpha- bets, ” Eur . T rans. T elecom. , vol. 4, pp. 4–16, 1993. [24] A. Orlitsk y , N. P . Santhana m, and J . Zhang, “Univ ersal compression of memoryless sources over unkno wn alphabets, ” IEEE T rans. Inf. Theory , vol. 50, no. 7, pp. 1469–1481, 2004. [25] ——, “Speaki ng of infinity [i .i.d. strings], ” IEEE T rans. Inf. Theory , vol. 50, no. 10, pp. 2215–2230, 2004. [26] N. Je vtic, A. Orlitsky , and N. P . Santhana m, “ A lo wer bound on compression of unknown alphab ets, ” Theor . Comput . Sci. , vol. 332, no. 1-3, pp. 293–311, 2005. [27] A. Orli tsky , N. P . Santhanam, K. V iswanath an, and J . Zhan g, “Limi t results on patt ern entrop y , ” IEEE T rans. Inf . Theory , vol. 52, no. 7, pp. 2954–2964, 2006. [28] G. I. Shamir and D. J. J. Co stello, “On the entrop y rate of pa ttern processes, ” IEEE T rans. Inf. Theory , vol . 50, no. 8, pp. 1620–163 5, 2004. [29] G. I. Shamir , “Sequential uni versal lossless techni ques for compression of patterns and their description length, ” in Data Compr ession Confer- ence , 2004, pp. 419–428. [30] G. M. Gemelos and T . W eissman, “On the entrop y rate of pattern processes, ” IEEE T rans. Inf. Theory , vol . 52, no. 9, pp. 3994–400 7, 2006. [31] A. Gari vier , “ A lo wer bound for the maxi min redundan cy in pattern coding, ” Entro py , vol. 11, no. 4, pp. 634–642, 2009. [32] Y . C hoi and W . Szpank owski , “Pat tern matching in co nstrained se- quence s, ” in 2007 Int. Symp. Informati on Theory , Nice, 2007, pp. 2606– 2610. [33] P . Elias, “Uni versa l code word sets and representati ons of the integers, ” IEEE T rans. Inf. Theory , vol. 21, no. 2, pp. 194–203, 1975. [34] D. ke He and E. hui Y ang, “The uni versa lity of grammar-ba sed codes for sources with countabl y infinite alphabets, ” IEEE T rans. Inf . Theory , vol. 51, no. 11, pp. 3753–3765, 2005. [35] D. P . Foster , R. A. Stine, and A. J. W yner , “Univ ersal codes for finite sequence s of integers drawn from a monotone distributi on, ” IE EE T rans. Inf. Theory , vol. 48, no. 6, pp. 1713–1720, 2002. [36] S. Boucheron, A. Gari vier , and E. Gassiat, “Coding on counta bly infinit e alphabe ts, ” IEEE T rans. Inf. Theory , vol . 55, no. 1, pp. 358–373, 2009. [37] D. Haussle r and M. Opper , “Mutual information, metric ent ropy an d cumulati ve relat i ve entropy risk, ” Ann. Statist. , vol. 25, no. 6, pp. 2451– 1492, 1997. [38] A. W . va n der V aart and J. A. W ellne r , W eak Con ver genc e and Empirical Pr ocesses , ser . Springer Series in Statisti cs. Springer -V erlag, 1996. [39] D. Bontemps, “Redond ance bay ´ esienne et minimax, sourc es station naires sans m ´ emoire en alph abet infini , ” Ma ster’ s thesis, Pa ris-Sud 11 Uni v ., sep 2007. [Online]. A va ilable : http:/ /www .math.u- psud.fr/ ∼ bontemps/Doc uments/rapp ortM2.pdf [40] E. T . Whitta ker and G. N. W atson, A Cour se of Modern Analy sis , 4th ed., ser . Cambridge Mathematica l Library . Cambridge : Cambri dge Uni v . Press, 1927, reprinted 1990.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment