Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes

1 Uni v ersal Coding on Inﬁnite Alphabets: Exponentially Decreasing En v elopes Dominique Bontemps Abstract —This paper deals with the problem of universal lossless coding on a c ountable inﬁn ite alphabet. It fo cuses on some classes of sour ces deﬁned by an en velope condition on the marg inal distribution, namely exponentially decreasing env elope classes with exponent α . The minimax redund ancy of exponentially d ecrea sing enve lope classes is prov ed to be equivalent to 1 4 α log e log 2 n . Then, an adaptive algorithm is proposed, whose maximum redundancy is equivalent to the minimax redundancy . Index T erm s —Data compres sion, universal co ding, inﬁ nite countable alphabets, redundancy , B ayes mixture, adaptive com- pression. I . I N T R O D U C T I O N C OMPRESSION o f data is broadly used in our daily life: from the movies we watch to the ofﬁce documen ts we produ ce. In this article, we are interested in lossless data compression on an unknown alphab et. This has ap plications in ar eas such as lang uage mod eling or lossless multimed ia codecs. First, we present brieﬂy the problematics of data compres- sion. More de tails are av ailable in genera l textbooks, like [ 1 ]. Then we make a short re v iew of prec eding results, in which we situate the topic of this article, e xpon entially d ecreasing en velope classes, and we announce our results. A. Lossless da ta compressi on Consider a ﬁnite or coun tably inﬁnite a lphabet X . A source on X is a probab ility distribution P , on the set X N of inﬁnite sequences of s ymbo ls from X . Its marginal distributions are denoted by P n , n ≥ 1 ( for n = 1 , we on ly note P ). The scope of lossless data compre ssion is to en code a sequence of sym bols X 1: n , g enerated acco rding to P n , in to a sequence of bits as sma ll as possible. T he algo rithm has to be uniquely decodab le. The b inary en tropy H ( P n ) = E P n [ − log 2 P n ( X 1: n )] is known to be a lower bound for the expected codelen gth of X 1: n . From now on, log denotes the logarithm taken to base 2 , wh ile ln is used to deno te the natural lo garithm. Since arithmetic coding based o n P n encodes a message x 1: n with ⌈− log P n ( x 1: n ) ⌉ + 1 b its, th is lo wer boun d can b e achie ved within two bits. T hen, the expec ted r edund ancy measures the mean numb er of extra b its, in ad dition to the en tropy , a co ding strategy uses to encode X n . In th e sequ el, we use the word r ed undan cy instead o f expected r edu ndan cy . D. Bontemps is PhD fello wship holder at Laborato ire de Math ´ ematique s, Uni v . Pari s-Sud 11, Orsay , 91405 France. Furthermo re, together with Kraft-McMillan inequ ality , arithmetic coding provides an almost perfect correspon dence between cod ing algorithm s and probability distributions on X n . In this setting , if an algorith m is associated to the probab ility distribution Q n , it s e xpected redun dancy reduces to the Kullbach -Liebler di vergence between P n and Q n D ( P n ; Q n ) = E P n  log P n ( X 1: n ) Q n ( X 1: n )  . W e call this quantity (expecte d) redund ancy of the distribution Q n (with respect to P n ). Unfortu nately , the true statistics of the source are not k nown in gener al, b u t P n is suppo sed to belong to some large class of sources Λ (for instan ce, the class of all iid sou rces, or the class of Markov sources). In this p aper, the max imum redun dancy R n ( Q n ; Λ) = sup P ∈ Λ R n ( Q n ; P n ) measures how we ll a coding p robab ility Q n behave o n an entire class Λ . W ith this po int of view , the best coding probab ility is a minimax co ding pro bability , that achieves the minimax r edund ancy R n (Λ) = inf Q n R n ( Q n ; Λ) . Another way to measure the ability o f a class of sources to be efﬁciently encoded is the Ba yes r edund ancy R n,µ (Λ) = inf Q n Z Λ R n ( Q n ; P n ) d µ ( P ) where µ is a prior distrib ution on Λ endowed with the topology of weak con vergence and the Borel σ - ﬁeld. Only one cod ing strategy ach iev es the Bayes red undan cy: the Bayes m ixture M n,µ ( x 1: n ) = Z Λ P n ( x 1: n ) d µ ( P ) . When Λ is a class of iid sour ces on the set X = N ∗ = N \{ 0 } , there is a natural p arametrizatio n of Λ by P θ ( j ) = θ j , with θ = ( θ 1 , θ 2 , . . . ) ∈ Θ Λ . Θ Λ is then a subset of Θ =    θ = ( θ 1 , θ 2 , . . . ) ∈ [0 , 1] N : X i ≥ 1 θ i = 1    and it is endowed with th e top ology of pointwise conver gence. In this case we write µ as a prior on Θ Λ . Minimax redundancy and Bayes redundan cy are linked by an important relation [ 2 ], [ 3 ]; it is written here in the con text of iid sources on a ﬁnite o r countably inﬁnite alphabet, but Haussler [ 4 ] has sho wn that it ca n be generalized for all classes 2 of stationary ergo dic processes on a complete sep arable metr ic space. Theor em 1: Let Λ be a class of iid so urces, such that the parameter set Θ Λ is a measurable subset of Θ . Let n ≥ 1 . Then R n (Λ) = sup µ R n,µ (Λ) , where the supremum is taken over all (Borel) prob ability measures on Θ Λ . The q uantity sup µ R n,µ (Λ) is called maximin r edun dancy . A prior whose Bayes redu ndancy correspond s to the maxim in redund ancy is said to be maximin, or le ast fav or able. Theorem 1 says that m aximin redunda ncy an d minimax redun- dancy are th e same. It provides a tool to calculate th e min imax redund ancy . Before speakin g about known results, let us make mention of other two notions. W ith an asymptotic poin t of view , a sequen ce of cod- ing pro babilities ( Q n ) n ≥ 1 is said to be weakly uni- versal if the per-symbol r edund ancy tends to 0 on Λ : sup P ∈ Λ lim n →∞ 1 n D ( P n ; Q n ) = 0 . Instead of th e expected redu ndancy , many authors consid er individual seq uences. In this case, the minimax re gr et R ∗ n (Λ) = inf Q n sup P ∈ Λ sup x 1: n ∈X n log P n ( x 1: n ) Q n ( x 1: n ) plays the r ole that the m inimax redun dancy plays with th e expected redund ancy . B. Expon entially d ecr ea sing envelope classes In the case of a ﬁn ite alphabet of size k , many classes o f sources have been studied in the liter ature, fo r which estimates of th e r edunda ncy have been provided. I n particular we ha ve the class of all iid sources (see [ 5 ]–[ 10 ], and references therein), whose min imax redund ancy is k − 1 2 log n 2 π e + log Γ(1 / 2) k Γ( k / 2) + o (1) . This last class can be seen as a p articular case of a ( k − 1) - dimensiona l cla ss o f iid sour ces o n a (possibly) bigger alpha- bet, for which we ha ve a similar result under certain conditions (see [ 11 ]–[ 13 ]). Similar results a re still av ailable f or classes of Markov pro cesses and ﬁnite memo ry tre e sources o n a ﬁnite alphabet ( see [ 5 ], [ 14 ]–[ 16 ]), and for k -dimen sional cla sses of ev en non-iid sources on an arbitrary alpha bet (see [ 17 ]). The results b ecome less p recise when one considers inﬁnite dimensiona l cla sses on a ﬁnite alphabet. A typical examp le is the class o f renew al processes, fo r which we do not ha ve an equiv alent of the expected redund ancy , but we know that it is lower an d upper boun ded by a constant times √ n (see [ 18 ], [ 19 ]). Eventually , it is well known that the class of stationar y ergodic sources on a ﬁnite alphabet is wea kly universal (see [ 1 ]). Howe ver, Shields [ 20 ] showed th at this class does not admit non-tr i vial un iv ersal redun dancy rates. In the case of a coun tably inﬁnite alph abet, the situa tion is signiﬁcantly different. Even the class of all iid sources is n ot weakly universal (see [ 21 ], [ 22 ]). Kieffer characterized weakly universal classes in [ 21 ] (see also [ 22 ], [ 23 ]): Pr opo sition 1: A class Λ o f stationary sou rces on N ∗ is weakly u niversal if and on ly if there exists a pr obability dis- tribution Q on N ∗ such that fo r ev ery P ∈ Λ , D ( P ; Q ) < ∞ . In the literature, we ﬁnd two main ways to deal w ith inﬁn ite alphabets. The ﬁrst one [ 24 ]–[ 32 ] separates the message into two parts: a descr iption of the symbo ls appearin g in the message, and the p attern they f orm. Then the compr ession of pattern s is studied. A second approach [ 23 ], [ 33 ]–[ 36 ] studies co llections of sources satisfying Kieffer’ s conditio n, and pr oposes co mpres- sion algo rithms for these classes. A result f rom [ 36 ] indicate s us such a w ay: Pr opo sition 2: Let Λ be a class of iid so urces over N ∗ . Let the env elope f unction f be deﬁne d by f ( x ) = sup P ∈ Λ P ( x ) . Then the min imax regret veriﬁes R ∗ n (Λ) < ∞ ⇔ X x ∈ N ∗ f ( x ) < ∞ . It is ther efore quite n atural to consider classes of iid sour ces with envelope condition s o n the ma rginal d istribution. In this article we stud y speciﬁc classes of iid sources in troduce d by [ 36 ], and called exponen tially d ecr easing en velope classes . Deﬁnition 1: Let C and α be po siti ve numb ers satisfying C > e 2 α . Th e exponen tially decreasing env elope class Λ C e − α · is the class of so urces deﬁned b y Λ C e − α · = { P : ∀ k ≥ 1 , P ( k ) ≤ C e − αk and P is stationary and memo ryless. } The ﬁrst conditio n addre sses main ly the queu e of the distribu- tion of X 1 ; it means th at great numbers mu st be rare eno ugh. It does not mean that the distribution i s geome trical: if C is big enoug h, many other distributions are possible. Fur thermor e we will see that the exact value of C does not ch ange signiﬁcantly the minimax r edund ancy , un like α . Since in this paper we are go ing to only talk about exponentially d ecreasing en velope classes, we simplify th e notations R n ( Q n ; Λ C e − α · ) , R n (Λ C e − α · ) , and R n,µ (Λ C e − α · ) into R n ( Q n ; C, α ) , R n ( C, α ) , and R n,µ ( C, α ) respe cti vely . The subset o f Θ corr espondin g to Λ C e − α · is de noted by Θ C,α = { θ = ( θ 1 , θ 2 , . . . ) ∈ [0 , 1] N : X i ≥ 1 θ i = 1 and ∀ i ≥ 1 , θ i ≤ C e − αi } . (1) W e presen t two main results a bout these classes. In Section II w e calculate the minimax r edund ancy o f exponentially decreasin g envelope classes, a nd we ﬁnd that it is equi valent to 1 4 α log e log 2 n a s n ten ds to the inﬁnity . T his rate is interesting for two main reaso ns. Up to o ur kn owledge, exponentially decre asing env elope classes are the ﬁrst f amily of classes on an in ﬁnite alphabet for which a n equivalent of the m inimax redun dancy is known. Then, even the r ate is new: until now only rates in lo g n or √ n have been o btained. Once the minim ax redund ancy of a class of sources is known, we are interested in ﬁnding a minim ax coding algo- rithm. Section III proposes a new adaptive co ding algo rithm, and we show that its maximu m redun dancy is equiv alen t to 3 the minimax redun dancy of exponentially decr easing envelope classes. Eventually , the Appen dix c ontains some p roofs and s ome auxiliary results u sed in the m ain ana lysis. I I . M I N I M A X R E D U N DA N C Y In this section we state our main resu lt. Theorem 2 below giv es an equi valent of the minimax redun dancy of exponen - tially decr easing envelope class es. T o get it, we u se a result due to Haussler an d Opper [ 37 ]. Theor em 2: Let C and α be p ositiv e num bers satisfying C > e 2 α . The minimax redund ancy of the exp onentially decreasing envelope class Λ C e − α · veriﬁes R n ( C, α ) ∼ n →∞ 1 4 α log e log 2 n. Theorem 2 imp roves on a p receding result of [ 36 , Theo- rem 7 ]. In that article the following bounds of the minimax redund ancy of exponentially decr easing envelope class es are giv en: 1 8 α log e log 2 n (1 + o (1)) ≤ R n ( C, α ) ≤ 1 2 α log e log 2 n + O (1 ) . In subsectio n II-A we ou tline the work don e in [ 37 ], and then we use it in subsection II-B to pr ove Theorem 2 . Eventually , we d iscuss in subsection II-C the adap tation o f this metho d to other envelope classes. A. F r om the metric entr o py to th e minimax r edund ancy T o study the r edund ancy of a class of sources, [ 37 ] considers the Hellinger distance b etween the ﬁrst ma rginal distribu- tions of each sou rce. Bou nds on the min imax redunda ncy are provided in ter ms o f the metric entropy o f the set of the ﬁrst marginal distributions, with respect to the Hellinger distance. As a conseq uence, that metho d can b e app lied only to iid sources. However it is very efﬁcient in the case of exponentially decreasing env elope classes. First, we n eed to deﬁn e th e Hellinger distanc e and the metric entropy . In the case of sources on a countably inﬁnite alphabet, the Helling er distance can b e deﬁned in th e following way: Deﬁnition 2: Let P and Q two proba bility distributions on N ∗ . Then the Hellinger d istance between P and Q is deﬁned by h ( P, Q ) = s X k ≥ 1  p P ( k ) − p Q ( k )  2 . A related m etric can be deﬁned o n the p arameter set Θ : d ( θ , θ ′ ) = h ( P θ , P θ ′ ) = v u u t X k ≥ 1  p θ k − q θ ′ k  2 . From a metric we c an deﬁne the me tric entr opy . W e need to deﬁn e ﬁrst some nu mbers. Deﬁnition 3: Let S be a subset of Θ , and ǫ be a positi ve number . 1) W e denote by D ǫ ( S, d ) the cardinality of the smallest ﬁnite partition of S with sets of d iameter a t most ǫ , or we set D ǫ ( S, d ) = ∞ if no such ﬁnite par tition exists. 2) The metric entro py o f ( S, d ) is de ﬁned by H ǫ ( S, d ) = ln D ǫ ( S, d ) . 1 3) An ǫ -cover of S is a subset A ⊂ S such that, for all x in S , there is an element y of A with d ( x, y ) < ǫ . The covering number N ǫ ( S, d ) is the c ardinality of the smallest ﬁn ite ǫ -cover of S , or we deﬁne N ǫ ( S, d ) = ∞ if no ﬁn ite ǫ -cover exists. 4) An ǫ - separated sub set of S is a subset A ⊂ S such that, fo r all d istinct x, y in A , d ( x, y ) > ǫ . The packing number M ǫ ( S, d ) is the card inality of the largest ﬁnite ǫ -separated su bset of S , o r we deﬁn e M ǫ ( S, d ) = ∞ if arbitrary large ǫ -separated subsets exist. The following lemma explains how th ese numb ers are linked. It is a classical result that can be foun d fo r instan ce in [ 38 ]. Lemma 1: Let S be a su bset of Θ . For all ǫ > 0 , M 2 ǫ ( S, d ) ≤ D 2 ǫ ( S, d ) ≤ N ǫ ( S, d ) ≤ M ǫ ( S, d ) . Lemma 1 ena bles us to choo se the mo st convenient num ber to calculate th e metric entro py . From the metric entropy o ne ca n d eﬁne th e n otion of metric dimension , which gener alizes th e cla ssical notion of dimension. But the metric e ntropy lets us k now in som e way how dense the elemen ts ar e in a set, even inﬁn ite dimen sional. Another quan tity that [ 37 ] uses is the minimax risk for the (1 + λ ) -afﬁnity R λ (Λ) = inf Q sup θ ∈ Θ λ X k ≥ 1 P θ ( k ) 1+ λ Q ( k ) − λ , deﬁned for all λ > 0 . More precisions about the (1 + λ ) -af ﬁnity are given in [ 37 ]. See also [ 39 ] for a special regard pay ed to envelope classes. In the case of an en velope class Λ f deﬁned by an integrable en velope f unction f , it is easy to see that R λ < ∞ for all λ > 0 . Inde ed the choice Q ( k ) = f ( k ) P l ≥ 1 f ( l ) leads to the relation R λ ≤   X k ≥ 1 f ( k )   λ . W e can n ow write a slightly modiﬁed version 2 of Theorem 5 of [ 3 7 ] in th e co ntext o f d ata comp ression on an inﬁnite alphabet. Theor em 3: Let Λ be a class of iid sources on N ∗ , such that the parameter set Θ Λ is a me asurable subset of Θ . Assume that there exists λ > 0 such tha t R λ < ∞ . Let h ( x ) be a 1 W e foll ow [ 37 ] in this deﬁnit ion of the metri c entrop y . Sev eral authors use a slightly differe nt deﬁnitio n, based on the cov ering number or the packin g number . 2 The separation of the upper and lo wer bounds have no ef fect on the proof gi ven by Haussler and Opper . A complete justiﬁcati on is av ailabl e in [ 39 ]. 4 continuo us, non- decreasing f unction d eﬁned on the p ositiv e reals such th at, for all γ ≥ 0 and C > 0 , 1) lim x →∞ h ( C x ( h ( x )) γ ) h ( x ) = 1 and 2) lim x →∞ h ( C x (ln x ) γ ) h ( x ) = 1 . Then 1) If H ǫ (Θ Λ , d ) ∼ ǫ → 0 h  1 ǫ  , then R n (Λ) ∼ n →∞ (log e ) h ( √ n ) . 3 2) If, for some α > 0 and c > 0 , lim inf ǫ → 0 H ǫ (Θ Λ , d ) (1 /ǫ ) α h (1 /ǫ ) ≥ c, then lim inf n →∞ R n (Λ) n α/ ( α +2)  h ( n 1 / ( α +2) )  2 / ( α +2) > 0 . 3) If, for some α > 0 and C > 0 , lim sup ǫ → 0 H ǫ (Θ Λ , d ) (1 /ǫ ) α h (1 /ǫ ) ≤ C, then lim sup n →∞ R n (Λ) ( n ln n ) α/ ( α +2)  h ( n 1 / ( α +2) )  2 / ( α +2) < ∞ . The condition s co ncerning the fu nction h mean that h cannot grow too f ast. For instance, h can gr ow like C (ln x ) β , with β ≥ 0 . The ﬁrst case in th e the orem is the one we use f or ex- ponen tially decreasing en velope classes. In this case, the fast decreasing envelope produ ces a “not too big ” metric entropy . Theorem 3 gives us an equ i valent of the min imax r edund ancy of the class of sou rces when n g oes to the inﬁnity . This turns out very u seful, as it imp roves a pr eceding result o f [ 36 ]. However it is only an asymp totic result, without any conv ergence speed. The second and the th ird item s corr espond to bigg er c lasses of sources. In these cases th e result is a b it less interesting : it giv es a speed for the growth of the redundancy , b u t without the associated constant factor . Fu rthermo re there is a gap of (ln n ) α/ ( α +2) between the lower bou nd of p oint 2 and the upper b ound of p oint 3 . Howe ver it allows u s to retrieve more or less a result of [ 36 ] fo r anothe r ty pe of envelope classes. W e d ev elop now these applications. 3 The (log e ) factor comes from the use of the logarithm taken to base 2 , in the deﬁnitio n of R n . B. The min imax redundancy of e xpo nentially decr easing en - velope classes Here we want to prove Theo rem 2 by ap plying Theor em 3 . Thus we have to calcu late the metric e ntropy of expo nentially decreasing envelope classes. This is d one by Pr oposition 3 : Pr opo sition 3: Let C an d α b e positive nu mbers satisfy ing C > e 2 α . The metric entro py of the param eters set Θ C,α veriﬁes H ǫ (Θ C,α , d ) = (1 + o (1)) 1 α ln 2 (1 /ǫ ) , where o (1) is a function g ( ǫ ) such th at g ( ǫ ) → 0 as ǫ → 0 . Pr oof of Th eor em 2 : Just apply Th eorem 3 , with h ( x ) = 1 α ln 2 ( x ) , to get th e result. Pr oof of P r oposition 3 : The ou tlines of the pro ofs o f the next lemma s can be foun d in Ap pendix A . The co rollaries are simple application s and the pro ofs are skipp ed. W e start with general co nsiderations. Let Λ f be the en velope class deﬁned by the integrab le en velope fun ction f . L et Θ f be the co rrespon ding para meter set Θ f = { θ = ( θ 1 , θ 2 , . . . ) ∈ [0 , 1] N : X i ≥ 1 θ i = 1 and ∀ i ≥ 1 , θ i ≤ f ( i ) } . The function θ 7→ ( √ θ 1 , √ θ 2 , . . . ) is an isometry between the metric space (Θ f , d ) and the subset A f ∩ { k x k = 1 } of ℓ 2 , equ ipped with the classical euclide an n orm k · k , wher e A f is deﬁned b y A f = { ( x k ) k ∈ N ∗ ∈ ℓ 2 : ∀ k ∈ N ∗ , 0 ≤ x k ≤ p f ( k ) } . (2) The metric entropy of (Θ f , d ) can be calcu lated in th is space. Next we truncate some coo rdinates, to work in a ﬁn ite dimensiona l space instead of ℓ 2 . T ogether with an adequate use of Le mma 1 , this helps us to ob tain upp er and lo wer bounds for the metr ic entropy of (Θ f , d ) . W e start with the u pper bound . Lemma 2: Let Λ f be the en velope class deﬁned by the integrable envelope function f , and let ǫ be a positiv e n umber . Let N ǫ denote the in teger N ǫ = inf    n ≥ 1 : X k ≥ n +1 f ( k ) ≤ ǫ 2 16    . For U ∈ R N and a > 0 , let B R N ( U, a ) deno te th e ball in R n with center U and r adius a . T hen H ǫ (Θ f , d ) ≤ N ǫ ln(1 /ǫ ) + 3 N ǫ ln 2 + A ( N ǫ ) + B ( ǫ ) , where A ( N ) = − ln V ol ( B R N (0 , 1)) = ln Γ( N 2 + 1) π N 2 and B ( ǫ ) = N ǫ X k =1 ln  p f ( k ) + ǫ 4  . Furthermo re A ( N ǫ ) ∼ ǫ → 0 N ǫ 2 ln N ǫ . 5 Note that − N ǫ ln(1 /ǫ ) − 2 N ǫ ln 2 ≤ B ( ǫ ) ≤ ǫ 4 N ǫ . These bound s on B ( ǫ ) show that B ( ǫ ) tends to decrease the upper bou nd, while A ( N ǫ ) contributes to its growth. If ln N ǫ behaves like ln(1 /ǫ ) up to a constant factor, then the u pper bound given in Lem ma 2 co rrespond s to a constant times N ǫ ln N ǫ , and we are concerned with the point 3 of Theorem 3 . If we ap ply Lemma 2 to the case of expon entially decrea sing en velope classes, we o btain the upper bounds N ǫ ≤ 2 α ln(1 /ǫ ) + 1 α ln 16 C 1 − e − α B ( ǫ ) ≤ − α 4 N 2 ǫ +  ln C 2 + 1 √ 1 − e − α  N ǫ . This leads to Cor ollary 1: Let C and α be positiv e numbers s atisfying C > e 2 α . The metric entr opy of the p arameter set Θ C,α deﬁned by ( 1 ) veriﬁes H ǫ (Θ C,α , d ) ≤ (1 + o (1)) 1 α ln 2 (1 /ǫ ) , where o (1) is a functio n g ( ǫ ) such that g ( ǫ ) → 0 as ǫ → 0 . Now we need to get a lower bound o n the metric entropy . In this case too, we want to trun cate some coor dinates to br ing ourselves to a smaller ﬁnite dimen sional space. This time we truncate the ﬁrst coordinates. Let us con sider the num ber l f = min { l ≥ 0 : X k ≥ l +1 f ( k ) ≤ 1 } . Lemma 3: Let Λ f be the en velope class deﬁned by an integrable en velope fu nction f , which veriﬁes X k ≥ 1 f ( k ) ≥ 2 . Let ǫ > 0 be a positi ve number, and let m ≥ 1 be an integer . Then H ǫ (Θ f , d ) ≥ 1 2 l f + m X k = l f +1 ln f ( k ) + m ln  1 ǫ  + A ( m ) , where A ( m ) is d eﬁned as in L emma 2 : A ( m ) = − ln V ol ( B R m (0 , 1)) ∼ m →∞ m 2 ln m. Note that expone ntially d ecreasing en velop es verif y the con- dition P k ≥ 1 f ( k ) ≥ 2 . Indee d the env elope of exponentially decreasing envelope classes is f ( k ) = min(1 , C e − αk ) , and the con dition C > e 2 α entails that f (1) = f (2) = 1 . From Lemma 3 we can infe r th e follo wing c orollary , with the cho ice m =  2 α ln  1 ε  . Cor ollary 2: Let C and α be positiv e numbers s atisfying C > e 2 α . The metric entro py of the param eters set Θ C,α veriﬁes H ǫ (Θ C,α , d ) ≥ (1 + o (1)) 1 α ln 2 (1 /ǫ ) , where o (1) is a functio n g ( ǫ ) such that g ( ǫ ) → 0 as ǫ → 0 . Note that the bo und is the same as in Coro llary 2 . Th erefore this conc ludes the pro of of Proposition 3 . C. What ab out other envelope classes? In [ 36 ] the redundan cy of another type of envelope classes is also stud ied. The power -law en velope class Λ C · − α is deﬁned, for C > 1 and α > 1 , by the en velope functio n f α,C ( x ) = min(1 , C x α ) . T he boun ds o btained in [ 36 , Theo rem 6] are A ( α ) n 1 /α log ⌊ C ζ ( α ) ⌋ ≤ R n (Λ C · − α ) ≤  2 C n α − 1  1 /α (log n ) 1 − 1 /α + O (1) , (3) where A ( α ) = 1 α Z ∞ 1 1 − e − 1 / ( ζ ( α ) u ) u 1 − 1 /α d u, and ζ denotes the classical fun ction ζ ( α ) = P k ≥ 1 1 k α , f or α > 1 . If on e adap ts the calculus mad e earlier to th e power-la w en velope classes, one can get the following upp er and lower bound s: There are two (calculable) constants K 1 , K 2 > 0 suc h that, for all ǫ > 0 , K 1  1 ǫ  2 α − 1 ≤ H ǫ ≤ K 2 (1 + o (1))  1 ǫ  2 α − 1 ln  1 ǫ  . Unfortu nately this formula leaves a gap between th e lower bound and the up per bou nd. The applicatio n of Theo rem 3 makes the gap worse. Indeed the polynom ial part  1 ǫ  2 α − 1 of the metric entropy causes an add itional gap of log 1 /α n . In practice the bo unds are the follo wing: There are two ( unkn own) co nstants C, c > 0 such that, for a ll n ≥ 1 , c (1 + o (1)) n 1 /α ≤ R n (Λ C · − α ) ≤ C (1 + o (1)) n 1 /α log n. ( 4) These inequalities imp rove in no way the resu lt of [ 36 ]. May a better calculus of the metric entro py improve eithe r th eir lower bound or their u pper bou nd? Anyway the metric en tropy of p ower - law env elope classes is “too big” to efﬁciently app ly Theorem 3 : it does n ot leave the hope for an equivalence, as for exponentially decreasing env elope classes. T o summarize, the strategy based o n the metric entropy and Theo rem 3 turns out efﬁcient for “small” classes o f sources. I I I . A U T O C E N S U R I N G C O D E This section p resents a n ew algorithm called AutoCen suring Code ( ACcode ). It is in fact a modiﬁcatio n of the Censuring Code proposed by Bouch eron, Gar i vier and Gass iat in [ 36 ]. W e keep th e idea that b ig sy mbols are very few , and must be encoded dif f erently , with an Elias cod e. Smaller symbols are encoded by arithmetic coding based on Krichevsky-T r oﬁmov mixtures, wh ich ar e known to be effective for ﬁn ite a lphabets. Our inn ov ation is a data-driven cutoff M i = sup 1 ≤ k ≤ i X k used to en code X i +1 : with th is choice we do n ot need to k now the exact param eters of the expone ntially decreasing envelope. ACcode is a p reﬁx code on the set o f all ﬁnite len gth messages, and it work s on line. Its maximum redu ndancy on an e xpon entially decreasing en velope class Λ C e − α · is equiv alent to the minimax red undan cy of this class of sour ces. 6 Furthermo re ACco de is adapti ve, as the same a lgorithm ver - iﬁes this property with all expon entially decr easing en velope classes. Th is is fo rmulated in the follo wing theorem , proved in App endix B . Let ACcode ( x 1: n ) den ote the binary string pro duced by ACcode when it encod es the message x 1: n , and let l ( · ) denote the leng th of a string. Theor em 4: For any p ositiv e nu mbers C and α satisfying C > e 2 α , sup P ∈ Λ C e − α · E P n [ l ( ACcode ( X 1: n )) − H ( P n )] ∼ n →∞ R n ( C, α ) . The d ifference between the redu ndancy of A Ccode an d the minimax re dundan cy is not ne cessarily boun ded: there may exist codes whose redundan cy is smaller than the r edund ancy of ACcode , but with a beneﬁt negligible with respect to log 2 n . Additionally , Theorem 4 enables u s to retrieve the upp er bound of th e minimax r edunda ncy o btained in the section I I . Let us now deﬁne ACcod e . Let n ≥ 1 be some positive integer , and let x 1: n = x 1 x 2 . . . x n be a string from N n ∗ to be encoded . W e deﬁne the sequ ence of maxima m 0 = 0 and m i = sup 1 ≤ k ≤ i x k , f or all 1 ≤ i ≤ n. The seq uence ( m i ) 1 ≤ i ≤ n is n on-decr easing, piec ewis e co n- stant. For 1 ≤ i ≤ n , let n 0 i = P i j =1 1 m j >m j − 1 be the numb er of plateaus between 1 and i . F or 1 ≤ k ≤ n 0 n , let e m k be the k th new maximum: m i = e m n 0 i . (5) W e deﬁne also e m 0 = 0 . Let string f m b e the seq uence ( e m 1 − e m 0 + 1) , . . . , ( e m n 0 n − e m n 0 n − 1 + 1) , 1 . f m is encod ed into a binary string C2 by a pplying Elias penultim ate code (see [ 33 ]) to each n umber in f m . It is a p reﬁx co de which uses l E ( x ) bits to encode a positive integer x , with l E (1) = 1 , l E ( x ) = 1 + ⌊ log x ⌋ + 2 ⌊ log ⌊ log x ⌋ + 1 ⌋ if x ≥ 2 . (6) Meanwhile the seq uence of censor ed s ymbo ls is encoded us- ing side informatio n from f m . Con sider th e censo red sequence e x 1: n = e x 1 e x 2 . . . e x n deﬁned by e x i = x i 1 x i ≤ m i − 1 =  x i if x i ≤ m i − 1 , 0 otherwise. All symbo ls greater than m i − 1 are en coded together: they are r eplaced by the extra symbo l 0 , and this extra sy mbol is en coded instead. 0 h as a special use in ou r setting : it makes the d ecoder to know when m i changes, a nd that the new value has to be rea d in C2 . W e add at the end of e x 1: n an add itional 0 , which acts as a termination signal together with the last 1 in f m . This makes our code to be preﬁx o n th e set of all ﬁnite length messages (wh atev er n ). Therefo re we produce the bina ry string C1 by arithmetic coding of e x 1: n 0 . The co nditional c oding probabilities are deﬁned by Q i +1 ( e X i +1 = j | X 1: i = x 1: i ) = n j i + 1 2 i + m i +1 2 if 1 ≤ j ≤ m i , Q i +1 ( e X i +1 = 0 | X 1: i = x 1: i ) = 1 / 2 i + m i +1 2 , where f or j ≥ 1 and i ≥ 0 , n j i is the num ber o f occurrences of symbo l j in x 1: i (with co n vention n j 0 = 0 for all j ≥ 1 ). If i ≤ n − 1 , th e ev ent { e X i +1 = 0 } is eq ual to { X i +1 > M i } . If x i +1 = j > m i , then n j i = 0 , and w e still ha ve Q i +1 ( e X i +1 = 0 | X 1: i = x 1: i ) = n j i + 1 2 i + m i +1 2 . In the sequel we note th e cod ing probab ility used to en code the entire string e x 1: n 0 by Q n +1 ( e x 1: n 0) = Q n +1 (0 | x 1: n ) n − 1 Y i =0 Q i +1 ( e x i +1 | x 1: i ) . A remark we can do is that the symbol 0 is alw ays considered as new: when x i +1 > m i , we enco de 0 but we increment the co unter n x i +1 i . (Th is choice has been made to simplify the calculation of the redu ndancy of ACcode , but we suspect that ch anging th is beh avior could improve the perfor mances.) Now we have deﬁned C1 and C2 , we have to d escribe h ow they are tran smitted. T o keep our cod e on lin e, we overlap these two strings in the following way . Arithmetic cod e needs a certain amoun t of bits, say l i , to send the ﬁrst i symb ol of e x 1: n . Unfor tunately , l i depend s on wheth er i = n + 1 or n ot. In previous case l n +1 = ⌈− log Q n +1 ( e x 1: n 0) ⌉ + 1 , and in later one l i depend s on the following sym bols and has to be computed . ACcode begins with C2 , by the transm ission of the Elias code o f e m 1 + 1 . Th en the transmission of C1 is in itiated. Suppose tha t e x i = 0 and n 0 i = k . As soo n as l i bits o f C1 have been s ent, the AC code algorith m sends the Elias cod e of e m k − e m k − 1 + 1 . The n C1 is transmitted ag ain, from the next bit. T o decode th e i th symbol in C1 , the kno wledge of the current maximum m i − 1 is needed; it is obtained from the beginning of the string C2 . The d ecoder also need s th e cou nters ( n j i − 1 ) j ≥ 1 , which can be comp uted f rom the ﬁrst i − 1 decode d symb ols. As soon as l i bits of C1 ha ve b een recei ved, e x i can be decoded. When the d ecoder me ets a 0 at th e i th position, he k nows th at the next bits ar e th e Elias code of the next symb ol in f m , and deduces m i via ( 5 ). Since th e Elias cod e is preﬁx, the d ecoder knows when he receives C1 a gain. T hen the ( i + 1) th symbol can be pro cessed. Fig. 1 shows an illustration of the tran smission proce ss. In this example, the initial message is x 1:4 = 5 , 3 , 2 , 7 . The n the message encoded in C 1 is e x 1:4 0 = 0 , 3 , 2 , 0 , 0 . 1 3 bits are needed to tr ansmit the second 0 , an d 15 bits f or the last o ne. In C2 we transmit f m = 6 , 3 , 1 . In th e p revious example, exact calculations ha ve been per- formed , but this is not sensible fo r a pra ctical impleme ntation of arith metic coding . Some ru le is needed to set the precision 7 01110 | {z } 01100 1100 1101 | {z } 0101 | {z } 00 |{z} 1 |{z} ↓ beginning of C1 ↓ last bits of C1 ↓ Elias code of e m 1 + 1 Elias code of e m 2 − e m 1 + 1 Elias code of 1 Fig. 1. Example of AC code in calculus, and it must be used by both cod er an d deco der . T o av oid a too big extra r edund ancy caused by approximatio ns, precision can grow as n gr ows. For in stance, calcu lations can be made in memory with a further precision of 2 ⌈ log i ⌉ bits, in add ition to the ⌈− log Q i ( e x 1: i ) ⌉ + 1 bits needed to enc ode x 1: i ; this insu res that the extra re dundan cy is bounde d. A P P E N D I X A M E T R I C E N T RO P Y O F E X P O N E N T I A L L Y D E C R E A S I N G E N V E L O P E C L A S S E S W e give here the ou tlines of the lemmas we stated in subsection II-B . Pr oof o f Lemma 2 : N ǫ denotes the thresh old fro m wh ich we want to truncate the coordin ates. If y = ( y n ) n ≥ 1 is an element of A f , it s truncated v ersion is e y = ( y n 1 n ≤ N ǫ ) n ≥ 1 . One can check that k y − e y k ≤ ǫ 4 . Suppose now that S is an ǫ/ 4 -cover of { y ∈ A f : ∀ n ≥ N ǫ , y n = 0 } . Let z denote an eleme nt of A f . Then it exists some y ∈ S such that k e z − y k ≤ ǫ/ 4 . Thus k z − y k ≤ ǫ / 2 , and S is an ǫ/ 2 -cover o f A f . This leads to D ǫ (Θ f , d ) ≤ M ǫ/ 4   Y 1 ≤ k ≤ N ǫ h 0 , p f ( k ) i , k · k R N ǫ   ≤ V ol  Q 1 ≤ k ≤ N ǫ h − ǫ 8 , p f ( k ) + ǫ 8 i V ol  B R N ǫ (0 , ǫ 8 )  A ﬁr st consequen ce of that calculus is th at D ǫ (Θ f , d ) is ﬁnite for all ǫ > 0 . The ﬁrst assertion of Lemma 2 is then obtained by ap plying the loga rithm function . The rest of Lem ma 2 follows from the Feller bound s, in their version propo sed by [ 40 , ch. XII]: Γ( x ) = √ 2 π x x − 1 / 2 e − x e β 12 x , with β ∈ [0 , 1] . (7) Pr oof of Lemma 3 : Let m ≥ 1 be an in teger . W e project the set A f ∩ {k x k = 1 } over the m -dimensional space E m = { 0 } l f × R m × { 0 } { k : k ≥ l f + m +1 } generated by the co ordinates from l f + 1 to l f + m . Th is leads to D ǫ (Θ f , d ) ≥ N ǫ   l f + m Y k = l f +1 h 0 , p f ( k ) i , k · k R m   ≥ V ol  Q l f + m k = l f +1 h 0 , p f ( k ) i V ol ( B R m (0 , ǫ )) . It on ly r emains to a pply the log arithm function . A P P E N D I X B R E D U N DA N C Y O F ACcode A. Moments of M n W e ﬁrst n eed a lem ma wh ich contain s several useful r esults about the m oments of M n . Lemma 4: Let C and α be positive n umbers satisfying C > e 2 α . Then , for all n ≥ 1 , 1) sup P ∈ Λ C e − α · E P [ M n ] ≤ 1 α  ln n + ln C 1 − e − α + 1  . 2) sup P ∈ Λ C e − α · E P  M n 1 M n > 1 α ln C n 2 1 − e − α  = O  ln n n  . 3) sup P ∈ Λ C e − α · E P [ M n ln M n ] = o (ln 2 n ) . Pr oof: Let F denote the distribution f unction associated with P . F or t ≥ 0 , we ha ve P ( X 1 > t ) = X k ≥⌊ t ⌋ +1 P ( k ) ≤ C 1 − e − α e − α ( ⌊ t ⌋ +1) ≤ e − α ( t − β ) , where β = 1 α ln C 1 − e − α . Ther efore F ( t ) ≥ G ( t ) f or all t ∈ R , where G ( t ) = 1 t ≥ β  1 − e − α ( t − β )  . W e c an iden tify in G the distribution fun ction of the random variable β + Y , whe re Y fo llows the expon ential distribution with parameter α . Let U 1 , . . . , U n be n iid random variables following the unifor m distribution on [0 , 1] . For 1 ≤ i ≤ n , le t us d eﬁne X ′ i = F − 1 ( U i ) Y i = G − 1 ( U i ) − β , where F − 1 and G − 1 denote the pseudo-inverses of F and G : ∀ t ∈ [0 , 1 ] , F − 1 ( t ) = inf { x ∈ R : F ( x ) ≥ t } . Then the n -dim ensional vector X ′ 1: n = ( X ′ 1 , . . . , X ′ n ) has the same distribution as X 1: n , and the maxim a M ′ n = sup 1 ≤ i ≤ n X ′ i and M n follow the same distribution. On the other hand , th e r elation F ≥ G e ntails X ′ i ≤ β + Y i , for all 1 ≤ i ≤ n . As the con sequence, if M ′′ n = sup 1 ≤ i ≤ n Y i denotes the maximum of all Y i , w e have M ′ n ≤ β + M ′′ n . 8 Since the rando m variables Y i are in depend ent, the prob ability distribution o f M ′′ n is easy to c alculate. Ind eed for all t > 0 , P ( M ′′ n ≤ t ) = P ( ∀ 1 ≤ i ≤ n, Y i ≤ t ) =  1 − e − αt  n . W e c an write d own the d ensity function of M ′′ n : f ( t ) =  n α e − αt (1 − e − αt ) n − 1 if t > 0 , 0 otherwise. Now we look for an upper bou nd of E [ M n ] by taking advantage of the knowledge of that distribution: E [ M n ] = E [ M ′ n ] ≤ E [ β + M ′′ n ] = β + Z ∞ 0 t n α e − αt (1 − e − αt ) n − 1 d t = β + Z ∞ 0  1 − (1 − e − αt ) n  d t integrating by parts. Use no w the chang e of variables  u = 1 − e − αt t = − ln(1 − u ) α E [ M n ] ≤ β + 1 α Z 1 0 1 − u n 1 − u d u ≤ 1 α  ln n + 1 + ln C 1 − e − α  . Since the upper bound does n ot depend on P , that achieves the pr oof of the po int 1 . W e can h andle the po int 2 in th e same way . For all t > 0 , we hav e E [ M n 1 M n >β + t ] ≤ E  ( β + M ′′ n ) 1 M ′′ n >t  ≤ Z ∞ t ( β + u ) n α e − αu d u = ne − αt  t + 1 α + β  . W ith t = 2 α ln n , we get the second p oint of Lemma 4 . The third item is similar . Since the fun ction x 7→ x ln x is increasing on [1 , + ∞ ) and 1 ≤ M ′ n ≤ β + M ′′ n , we ha ve E [ M n ln M n ] ≤ E [( β + M ′′ n ) ln( β + M ′′ n )] = E  1 M ′′ n ≤ β ( β + M ′′ n ) ln( β + M ′′ n )  + E h 1 M ′′ n >β 1 M ′′ n ≤ 2 α ln n ( β + M ′′ n ) ln( β + M ′′ n ) i + E h 1 M ′′ n >β 1 M ′′ n > 2 α ln n ( β + M ′′ n ) ln( β + M ′′ n ) i ≤ 2 β ln(2 β ) + 4 α (ln n ) ln  4 α ln n  + E h 2 M ′′ n ln(2 M ′′ n ) 1 M ′′ n > 2 α ln n i ≤ 2 β ln(2 β ) +  4 α ln 4 α  ln n + 4 α (ln n )(ln ln n ) + E h 4 M ′′ n 2 1 M ′′ n > 2 α ln n i . Let us deﬁne γ ( n ) = 2 β ln(2 β ) +  4 α ln 4 α  ln n + 4 α (ln n )(ln ln n ) . Note that γ ( n ) = o (ln 2 n ) . Then E [ M n ln M n ] ≤ γ ( n ) + Z ∞ 2 α ln n 4 u 2 n α e − αu d u = γ ( n ) + 4 ne − 2 ln n α 2 (4 ln 2 n + 4 ln n + 2) . T aking the supremum over P , we get sup P ∈ Λ C e − α · E P [ M n ln M n ] ≤ γ ( n ) + 16 ln 2 n + 16 ln n + 8 α 2 n = o (ln 2 n ) . B. Contribution of C 1 Pr opo sition 4: Let C an d α b e positive nu mbers satisfy ing C > e 2 α . Then sup P ∈ Λ C e − α · E P n [ − log Q n ( e X 1: n ) − H ( P n )] ≤ (1 + o (1)) 1 4 α log e log 2 n. Pr oof: W e give here the sketch of th e proo f, and we d elay the proo fs o f ( 8 ), ( 9 ), ( 10 ), a nd ( 11 ). Here we de al with the quantity ( A ) = sup P ∈ Λ C e − α · E P n [ − log Q n ( e X 1: n ) − H ( P n )] . that cor respond s to the co ntribution of C1 . As we saw in Section III , the cod ing pro bability Q n is b ased on Krichevsky- T roﬁmov mixtures. For k ≥ 1 , let K T k denote the usual Krichevsky-T r oﬁmov mixture on the alp habet { 1 , . . . , k } , whose co nditional prob abilities are, f or all 0 ≤ i ≤ n − 1 and for a ll 1 ≤ j ≤ k , K T k ( X i +1 = j | X 1: i = x 1: i ) = n j i + 1 2 i + k 2 . Let us cho ose k = m n + 1 . In th is case, there is a simple relation between K T m n +1 and Q n . For any sequen ce of n positive integers x 1: n ∈ N n ∗ , Q i +1 ( e X i +1 = e x i +1 | X 1: i = x 1: i ) = 2 i + 1 + m n 2 i + 1 + m i K T m n +1 ( X i +1 = x i +1 | X 1: i = x 1: i ) . As a co nsequenc e, we can link th e redundancy of Q n to the redund ancy of K T m n +1 : log Q n ( e X 1: n ) = log K T M n +1 ( X 1: n ) + n − 1 X i =0 log 2 i + 1 + M n 2 i + 1 + M i and theref ore ( A ) = sup P ∈ Λ C e − α · ( A 1 ) z }| { E P n  − log K T M n +1 ( X 1: n ) − H ( P n )  − ( A 2 ) z }| { E P n  n − 1 X i =0 log 2 i + 1 + M n 2 i + 1 + M i  ! . 9 Note that ( A 2 ) corresponds to the gain in red undan cy of Q n with respect to K T M n +1 . It illustrates the be neﬁt of taking M i instead of M n as cutoff to encod e X i +1 . On the o ne hand , we ha ve ( A 1 ) ≤ E [ M n ] 2 log n + E [lo g( M n + 1)] . (8) Since E [log M n ] ≤ E [ M n ] , Lemm a 4 entails sup P ∈ Λ C e − α · E [log( M n + 1)] = o (log 2 n ) . Applying Lemma 4 again, we see that ( A 1 ) pro duces a redund ancy equiv alent to 1 2 α log e log 2 n , which is twice bigg er than the m inimax redundan cy obtained in T heorem 2 . So, we will hope th e corrective term ( A 2 ) to be about 1 4 α log e log 2 n . T o deal with ( A 2 ) , we use the c oncavity of the log func- tion, an d we group the terms in the sum, M n by M n . Let m =  n − 1 M n  be the n umber of bundles. T o simplify the expression, we also n eglect few terms at the beginning of th e sum. Let ( h n ) n ≥ 1 be a non- decreasing sequence of positi ve integers, such th at h n → ∞ as n → ∞ , and let u s deﬁne λ n = 2 h n log  1 + 1 2 h n  . Then ( A 2 ) ≥ λ n E P n " m X k = h n +1 M n − M kM n 2( k + 1) # . (9) It is easy to verify th at the f unction x 7→ x log  1 + 1 x  is non - decreasing, and tends to lo g e whe n x tends to the inﬁnity; therefor e ( λ n ) tends to log e . W e can write now ( A ) ≤ sup P ∈ Λ C e − α · E [ M n ] 2 log n − λ n E " m X k = h n +1 M n − M kM n 2( k + 1) # ! + o (log 2 n ) ≤ 1 2 ( A 3 ) z }| { sup P ∈ Λ C e − α · E P n " M n log n − λ n M n m X k = h n +1 1 k + 1 # + λ n 2 ( A 4 ) z }| { sup P ∈ Λ C e − α · E P n " m X k = h n +1 M kM n k + 1 # + o (log 2 n ) Let us choose h n = max { 1 , ⌊ ln n − 2 ⌋} . Then ( A 3 ) = o (log 2 n ) (10) ( A 4 ) ≤ log 2 n 2 α log 2 e + o (log 2 n ) . (11) Therefo re we ha ve ( A ) ≤ (1 + o (1)) 1 4 α log e log 2 n which conc ludes the pro of of Proposition 4 . Pr oof o f ( 8 ): Let b P n ( x 1: n ) = sup P n P n ( x 1: n ) = Y j ∈{ x 1 ,...,x n }  n n j n  n n j be the maximum likelihood of the string x 1: n over all iid distribution o n N n . Then ( A 1 ) ≤ E P n " log b P n ( X 1: n ) K T M n +1 ( X 1: n ) # ≤ E P n " sup x 1: n ∈{ 1 ,...,M n +1 } log b P n ( x 1: n ) K T M n +1 ( x 1: n ) # Now we can ap ply a r esult from Caton i [ 9 , prop 1.4 .1]: Lemma 5: For all k ≥ 1 and fo r all x 1: n ∈ { 1 , . . . , k } n , − log K T k ( x 1: n ) + log b P n ( x 1: n ) ≤ k − 1 2 log n + log k . Pr oof o f ( 9 ): W e group the terms in ( A 2 ) , M n by M n : ( A 2 ) ≥ E P n   m − 1 X k =1 ( k +1) M n X i = kM n +1 log  1 + M n − M i 2 i + M i + 1    . From the rela tion M k ≤ M k ′ for all k ′ ≥ k ≥ 1 , we can infer, for all i ≥ k M n , M n − M i 2 i + M i + 1 ≤ M n 2 k M n = 1 2 k Since log is a concave fu nction, we have log(1 + x ) ≥ x log(1+ a ) a for a ll a > 0 and 0 ≤ x ≤ a . Consequently , if we cho ose a = 1 2 k , ( A 2 ) ≥ E P n   m − 1 X k =1 ( k +1) M n X i = kM n +1 2 k log  1 + 1 2 k  M n − M i 2 i + M i + 1   ≥ E P n " m X k = h n +1 λ n M n − M kM n 2 k + 2 # . Pr oof o f ( 10 ): W e have ( A 3 ) = sup P ∈ Λ C e − α · " X j ≥ 1 P n ( M n = j ) × j log n − λ n j ⌊ n − 1 j ⌋ X k = h n +1 1 k + 1 !# . Then we plug in h n = ⌊ ln n − 2 ⌋ . For n large eno ugh, h n ≥ 1 , and we h av e j ⌊ n − 1 j ⌋ X k = h n +1 1 k + 1 ≥ j Z ⌊ n − 1 j ⌋ +1 ln n − 1 d x x + 1 = j  ln  n − 1 j  + 2  − ln(ln n )  ≥ j ln( n − 1) − j ln j − j ln(ln n ) , and theref ore ( A 3 ) ≤ sup P ∈ Λ C e − α ·  (log e − λ n ) E [ M n ] ln n + λ n E [ M n ] ln n n − 1 + λ n E [ M n ln M n ] + λ n E [ M n ] ln(ln n )  . 10 Then, if we use Lemma 4 and the fact th at λ n tends to log e , we get ( 1 0 ). Pr oof of ( 11 ): W e w ant to commute the expected value and the sum in ( A 4 ) . T o do it, we need to get rid of m . W e can note th at the condition k ≤ m = ⌊ n − 1 M n ⌋ entails k M n ≤ n − 1 . Consequently , for n big enough , ( A 4 ) ≤ sup P ∈ Λ C e − α · E P n " m X k =3 M kM n k + 1 # ≤ sup P ∈ Λ C e − α · E P n " n − 1 X k =3 M kM n 1 kM n ≤ n − 1 k + 1 # ≤ n − 1 X k =3 sup P ∈ Λ C e − α · E P n [ M kM n 1 M n ≤ l n ] k + 1 + sup P ∈ Λ C e − α · E P n [ M n 1 M n >l n ] n − 1 X k =3 1 k + 1 , where l n = j 1 α  2 ln n + ln C 1 − e − α k . W e c an now plug in the results of Lemma 4 : ( A 4 ) ≤ n − 1 X k =3 ln( k l n ) + 1 + ln C 1 − e − α ( k + 1) α + o (1) n − 1 X k =3 1 k + 1 ≤ 1 α n − 1 X k =3 ln k k + 1 + 1 α (ln l n + O (1)) n − 1 X k =3 1 k + 1 . Note that l n = O (ln n ) , and consequ ently ln l n = O (ln ln n ) . So ( A 4 ) ≤ 1 α Z n 3 ln x x d x + O (ln ln n ) Z n 3 d x x ≤ 1 2 α ln 2 n + o (ln 2 n ) . C. Contribution of C2 Pr opo sition 5: Let C an d α b e positive nu mbers satisfy ing C > e 2 α . Then sup P ∈ Λ C e − α · E P n [ l ( C2 )] ≤ o (log 2 n ) . Pr oof: Like in the pr evious su bsection, we give ﬁrst th e sketch of the p roof, and we delay several technical lemmas. sup P ∈ Λ C e − α · E P n [ l ( C2 )] ≤ 1 + sup P ∈ Λ C e − α · n X i =1 E P n  1 X i >M i − 1 l E ( X i + 1)  . W e d eal with this sum th anks to th e following lem ma: Lemma 6: Let g be a po siti ve and non-d ecreasing f unction on [1 , ∞ ) . L et ( K n ) n ≥ 1 be a no n-decr easing sequence of positive integers. Then, for all n ≥ 1 , sup P ∈ Λ C e − α · n X i =1 E P n  1 X i >M i − 1 g ( X i )  ≤ (1 + o (1)) g ( K n ) 1 α ln n + C n Z ∞ K n g ( x + 1) e − αx d x. T o app ly Lemma 6 , we extend the deﬁnition of l E on [1 , ∞ ) by l E ( x ) =  1 if x ∈ [1 , 2) , 1 + ⌊ log x ⌋ + 2 ⌊ log ⌊ log x ⌋ + 1 ⌋ if x ≥ 2 . W e get sup P ∈ Λ C e − α · E P n [ l ( C2 )] ≤ (1 + o (1)) l E ( K n + 1) α ln n + C n Z ∞ K n l E ( x + 2) e − αx d x Then we can choose K n = max { 1 , ⌊ 1 α ln n ⌋} . Th is entails l E ( K n + 1) ∼ n →∞ log K n ∼ log log n = o (log n ) , and theref ore 1 α l E ( K n + 1) ln n = o (log 2 n ) . The r emaining term is treated by Lemma 7 , wh ich ach iev es the proo f o f Propo sition 5 : Lemma 7: Let α > 0 be a real n umber, and let K n = max { 1 , ⌊ 1 α ln n ⌋} . Th en n Z ∞ K n l E ( x + 2) e − αx d x = o (log n ) . Pr oof of Lemma 6 : Let P be an elem ent o f Λ C e − α · . Let us deﬁne , f or all k ≥ 0 , ¯ p ( k ) = P ( X 1 > k ) = X j ≥ k +1 P ( k ) , and ( B 1 ) = n X i =1 E P n  1 X i >M i − 1 g ( X i )  . Note that, for all 1 ≤ i ≤ n , X i and M i − 1 are indep endent random variables, and P n ( M i ≤ k ) = P n ( ∀ 1 ≤ j ≤ i, X j ≤ k ) = (1 − ¯ p ( k )) i . Then we can write ( B 1 ) = n X i =1 X k ≥ 0 P n ( M i − 1 = k ) X m ≥ k +1 P ( m ) g ( m ) = X m ≥ 1 P ( m ) g ( m ) n X i =1 m − 1 X k =0 P ( M i − 1 = k ) = P (1) g (1 ) + X m ≥ 2 P ( m ) g ( m ) n X i =1 (1 − ¯ p ( m − 1)) i − 1 = X m ≥ 1 P ( m ) g ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1) . If we take g ( x ) = 1 f or all x , we g et X m ≥ 1 P ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1) = E " n X i =1 1 X i >M i − 1 # ≤ E [ M n ] . 11 In the g eneral case, we can split the sum a t K n , and we get ( B 1 ) = K n X m =1 P ( m ) g ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1) + X m ≥ K n +1 P ( m ) g ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1 ) ≤ g ( K n ) X m ≥ 1 P ( m ) 1 − (1 − ¯ p ( m − 1)) n ¯ p ( m − 1 ) + X m ≥ K n +1 nP ( m ) g ( m ) ≤ g ( K n ) E [ M n ] + C n X m ≥ K n +1 g ( m ) e − αm . At this point, we can take the supremum over all so urces P in Λ C e − α · : sup P ∈ Λ C e − α · n X i =1 E P n  1 X i >M i − 1 g ( X i )  ≤ (1 + o (1)) g ( K n ) 1 α ln n + C n Z ∞ K n g ( x + 1 ) e − αx d x. Pr oof o f Lemma 7 : n Z ∞ K n l E ( x + 2) e − αx d x ≤ n Z ∞ K n (log( x + 2) + 2 log log ( x + 3) + 1 ) e − αx d x ≤ ne − αK n log( K n + 3) Z ∞ K n +3 log x + 2 log log x + 1 log( K n + 3) e − α ( x − K n − 3) d x ≤ e α log( K n + 3)  sup x ≥ K n +3 log x + 2 log lo g x + 1 log x  Z ∞ K n +3 log x log( K n + 3) e − α ( x − K n − 3) d x = O (log K n ) Z ∞ 0   1 + log  1 + x K n +3  log( K n + 3)   e − αx d x = o (log n ) . The supremum is correc tly deﬁned and bounded, bec ause th e function x 7→ log x + 2 log lo g x + 1 log x is continu ous and tends to 1 as x tends to th e inﬁnity . D. Pr oo f of Theorem 4 The message sent by the A Ccode algorithm is compound of two strings C1 a nd C 2 . C1 corr esponds to th e part o f the m essage enco ded by th e arithmetic code, with c oding probab ility Q n +1 . The arith metic code encodes a m essage e x 1: n 0 with ⌈− log Q n +1 ( e x 1: n 0) ⌉ + 1 b its. W e h av e E P n [ − log Q n +1 (0 | X 1: n )] = E P n [log( M n + 1 + 2 n )] ≤ log(2 n ) + E P n [ M n + 1] 2 n = O (log n ) thanks to Lemma 4 . T herefor e the redundan cy of ACcode can be upper bo unded , for all n ≥ 2 , by sup P ∈ Λ C e − α · E P n [ l ( C1 ) + l ( C1 )] − H ( P n )] ≤ sup P ∈ Λ C e − α · E P n [ − log Q n ( e X 1: n ) − H ( P n )] + sup P ∈ Λ C e − α · E P n [ l ( C2 )] + O (lo g n ) . W e con clude thanks to Propositions 4 and 5 . A C K N O W L E D G M E N T I wis h thank Elisabeth Gas siat for her helpf ul ad vice, f or the many ideas in this pap er she sugge sted me, and for h er constant av ailability to my q uestions. Thanks also to Aur ´ elien Garivier an d to St ´ ephane Bouchero n for the u seful d iscussions we had . I don’t fo rget the ideas I got f rom my classmates, especially Raﬁk I mekraz. R E F E R E N C E S [1] T . M. Cove r and J. A. Thomas, Elements of Information Theory . New Y ork: W iley , 1991. [2] R. G. Gallager , Informat ion Theory and Reliable Communi cation . Ne w Y ork: W iley , 1968. [3] L. D. Davisson and A. Leon-Garci a, “ A source m atchi ng approach to ﬁnding minimax codes, ” IEE E T rans. Inf . Theory , vo l. 26, pp. 166–174, 1980. [4] D. Haussler , “ A general minimax result for rela ti ve entropy , ” Uni versity of Califor nia, UC Santa Cruz, CA 96064, T ech. Rep. UCSC-CRL-96-26, 1996. [5] R. E. Kriche vsky and V . K . Troﬁmo v , “The performance of uni versal encodin g, ” IEEE T rans. Inf. Theory , vol. 27, no. 2, pp. 199–207, 1981. [6] Q. Xie and A. R. Barron, “Minimax redundanc y for the class of memoryless s ources, ” IEEE T rans. Inf . Theory , vol. 43, no. 2, pp. 646– 657, 1997. [7] ——, “ Asymptotic minimax regret for data compression, gambling, and predict ion, ” IEEE T rans. Inf . Theory , vol. 46, no. 2, pp. 431–44 5, 2000. [8] A. R. Barron, J. Rissan en, and B. Y u, “The minimum descri ption length princip le in coding and modeling , ” IEEE T rans. Inf. The ory , v ol. 44, no. 6, pp. 2743–2760, 1998. [9] O. Catoni, Statisti cal Learning Theory and Stoc hastic Optimizati on , ser . Lecture Notes in Math ematics. Spri nger-V erl ag, 2001, vol . 1851, ´ Ecole d’ ´ Et ´ e de Probabilit´ es de Saint-Flour XXXI. [10] M. Drmota and W . Szpank owski, “Preci se mini max redundanc y and regre t, ” IEEE T rans. Inf. Theory , vol. 50, no. 11, pp. 2686–2707 , 2004. [11] B. S. Clarke and A. R. Barron, “Informa tion-the oretic asymptotic s of bayes m ethods, ” IEE E T rans. Inf. Theory , vol. 36, no. 3, pp. 453–471, 1990. [12] ——, “Jef frey’ s prior is asymptot ically least fa vora ble under entrop y risk, ” J . Statist. Plann. Infer ence , vol. 41, pp. 37–60, 1994. [13] A. R. Barron, “Information -theoret ic characte rizati on of bayes per- formance and the choic e of priors i n parametri c and nonpara metric problems, ” in B ayesian Statist ics , J. M. Bernardo, J. O. Ber ger , D. A. P ., and S. A. F . M., Eds. Oxford Uni v . Press, 1998, vol. 6, pp. 27–52. [14] L. D. Davisson, “Mini max noiseless uni versa l codi ng for marko v sources, ” IEEE T rans. Inf. Theory , vol. 29, no. 2, pp. 211–214, 1983. [15] F . M. J. Wi llems, Y . M. Shtarko v , and T . J. T jalk ens, “The conte xt-tree weighti ng method: Basic properti es, ” IEE E T rans. Inf. Theory , vol. 41, no. 3, pp. 653–664, 1995. [16] K. Atteson, “The asymptotic redu ndanc y of ba yes rule s for markov chains, ” IEEE T rans. Inf. Theory , vol. 45, no. 6, pp. 2104–2109, 1999. [17] J. Rissan en, “Uni versal coding, informa tion, predic tion, and estimat ion, ” IEEE T rans. Inf. Theory , vol. 30, no. 4, pp. 629–636, 1984. [18] I. Csisz ´ a r and P . C. S hields, “Redunda ncy rates for rene wal and other processes, ” IEEE T rans. Inf . Theory , vol . 42, no. 6, pp. 2 065–2072, 1996. [19] P . Flajolet a nd W . Sz panko wski, “ Analytic v ariations on redundan cy rates of rene wal processes, ” IEEE Tr ans. Inf. Theory , vol . 48, no. 11, pp. 2911–2921, 2002. [20] P . C. Shields, “Uni versa l redu ndanc y ra tes do not exist, ” IEEE T rans. Inf. Theory , vol. 39, no. 2, pp. 520–524, 1993. 12 [21] J. C. Kief fer , “ A uni ﬁed approac h to weak univ ersal source co ding, ” IEEE T rans. Inf. Theory , vol. 24, no. 6, pp. 674–682, 1978. [22] L. Gy ¨ orﬁ, I. P ´ ali , and E. C. va n der Meulen, “There is no un iv ersal source code for an inﬁnite source alphabe t, ” IEEE T rans. Inf. Theory , vol. 40, no. 1, pp. 267–271, 1994. [23] ——, “On uni versal noiseless source coding for inﬁnite source alpha- bets, ” Eur . T rans. T elecom. , vol. 4, pp. 4–16, 1993. [24] A. Orlitsk y , N. P . Santhana m, and J . Zhang, “Univ ersal compression of memoryless sources over unkno wn alphabets, ” IEEE T rans. Inf. Theory , vol. 50, no. 7, pp. 1469–1481, 2004. [25] ——, “Speaki ng of inﬁnity [i .i.d. strings], ” IEEE T rans. Inf. Theory , vol. 50, no. 10, pp. 2215–2230, 2004. [26] N. Je vtic, A. Orlitsky , and N. P . Santhana m, “ A lo wer bound on compression of unknown alphab ets, ” Theor . Comput . Sci. , vol. 332, no. 1-3, pp. 293–311, 2005. [27] A. Orli tsky , N. P . Santhanam, K. V iswanath an, and J . Zhan g, “Limi t results on patt ern entrop y , ” IEEE T rans. Inf . Theory , vol. 52, no. 7, pp. 2954–2964, 2006. [28] G. I. Shamir and D. J. J. Co stello, “On the entrop y rate of pa ttern processes, ” IEEE T rans. Inf. Theory , vol . 50, no. 8, pp. 1620–163 5, 2004. [29] G. I. Shamir , “Sequential uni versal lossless techni ques for compression of patterns and their description length, ” in Data Compr ession Confer- ence , 2004, pp. 419–428. [30] G. M. Gemelos and T . W eissman, “On the entrop y rate of pattern processes, ” IEEE T rans. Inf. Theory , vol . 52, no. 9, pp. 3994–400 7, 2006. [31] A. Gari vier , “ A lo wer bound for the maxi min redundan cy in pattern coding, ” Entro py , vol. 11, no. 4, pp. 634–642, 2009. [32] Y . C hoi and W . Szpank owski , “Pat tern matching in co nstrained se- quence s, ” in 2007 Int. Symp. Informati on Theory , Nice, 2007, pp. 2606– 2610. [33] P . Elias, “Uni versa l code word sets and representati ons of the integers, ” IEEE T rans. Inf. Theory , vol. 21, no. 2, pp. 194–203, 1975. [34] D. ke He and E. hui Y ang, “The uni versa lity of grammar-ba sed codes for sources with countabl y inﬁnite alphabets, ” IEEE T rans. Inf . Theory , vol. 51, no. 11, pp. 3753–3765, 2005. [35] D. P . Foster , R. A. Stine, and A. J. W yner , “Univ ersal codes for ﬁnite sequence s of integers drawn from a monotone distributi on, ” IE EE T rans. Inf. Theory , vol. 48, no. 6, pp. 1713–1720, 2002. [36] S. Boucheron, A. Gari vier , and E. Gassiat, “Coding on counta bly inﬁnit e alphabe ts, ” IEEE T rans. Inf. Theory , vol . 55, no. 1, pp. 358–373, 2009. [37] D. Haussle r and M. Opper , “Mutual information, metric ent ropy an d cumulati ve relat i ve entropy risk, ” Ann. Statist. , vol. 25, no. 6, pp. 2451– 1492, 1997. [38] A. W . va n der V aart and J. A. W ellne r , W eak Con ver genc e and Empirical Pr ocesses , ser . Springer Series in Statisti cs. Springer -V erlag, 1996. [39] D. Bontemps, “Redond ance bay ´ esienne et minimax, sourc es station naires sans m ´ emoire en alph abet inﬁni , ” Ma ster’ s thesis, Pa ris-Sud 11 Uni v ., sep 2007. [Online]. A va ilable : http:/ /www .math.u- psud.fr/ ∼ bontemps/Doc uments/rapp ortM2.pdf [40] E. T . Whitta ker and G. N. W atson, A Cour se of Modern Analy sis , 4th ed., ser . Cambridge Mathematica l Library . Cambridge : Cambri dge Uni v . Press, 1927, reprinted 1990.

Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment