Entropy Concentration and the Empirical Coding Game

We give a characterization of Maximum Entropy/Minimum Relative Entropy inference by providing two `strong entropy concentration' theorems. These theorems unify and generalize Jaynes' `concentration phenomenon' and Van Campenhout and Cover's `conditio…

Authors: ** - **Peter Grünwald** (Centrum Wiskunde & Informatica, Amsterdam, 네덜란드) **

En trop y Concen tration and the Empirical C o d ing Game P eter Gr ¨ un w ald ∗ CWI, P .O . Box 94079,1090 GB A mster dam, NL Abstract W e give a c haracteriza tion of Ma ximum Entrop y/Minim u m Relativ e En tropy inference b y providing tw o ‘strong entrop y concen tration’ theo- rems. These theorems unify and generalize Ja ynes’ ‘concen tration phe- nomenon’ and V an Camp enhout and Co v er’s ‘conditional limit theorem’. The theorems c haracterize exactly in what sense a pr ior distrib ution Q con- ditioned on a giv en constrain t and the distribution ˜ P minimizing D ( P || Q ) o v er all P satisfying the constraint are ‘close’ to eac h other. W e then apply our theorems to esta blish th e relatio n ship b etw een en trop y concen tration and a game-theoret ic c haracteriza tion of Maxim um Entrop y Inference d ue to T opsøe and others. 1 In tro ductio n Ja ynes’ Maxim um En tro py (MaxEn t) Principle is a we ll- kno wn principle fo r in- ductiv e inference [Csisz´ ar , 1975, 1991, T opsøe, 1979, v an Campenhout and C ov er, 1981, Co v er and T homas , 1991, Gr ¨ un w ald a nd Da wid, 2004]. It has b een applied to statistical and mac hine learning pro blems ranging from protein mo deling to ∗ Also: research fellow at EURANDOM, P .O. Box 513, 5600 MB Eindhoven, The Nether- lands. This is a slightly mo dified version of the pap e r with the same title that app eared in Statistic a Ne erla ndic a 62(3), 200 8, page s 374–3 92, o n the occasio n o f the 1 0th anniversary of EURANDOM. So me of the results present ed here ha ve already appear ed in the conference pa- per [Gr ¨ un wald, 2 001a] and the technical rep o r t [Gr ¨ un wald , 200 1b]. Theorems 5.3 and 5.4 of Section 5 ar e new a nd have not b een published befo r e. The pap er b enefited enormous ly from v arious discussions with Richard Gill, Phil Dawid and F r anz Merkl. This work was s upp o r ted in part by the IST P rogr amme of t he Europ ean Communit y , under the P ASCAL Net work o f Excellence, IST-2002 -506 778. This publicatio n only r eflects the author’s views. 1 sto c k mark et prediction [Kapur and Kesa v an , 1992]. One of its c haracteriza- tions (some would say ‘justifications’) is the so- called c onc entr ation phenomenon [Ja ynes, 1978, 198 2]. Here is an informal v ersion of this phenomenon, in the w ords of Jay nes [200 3]: “If the information incorp orat ed into the maxim um- entrop y analysis includes all the constrain ts actually op erating in the random exper- imen t, then the distribution predicted by maxim um entrop y is ov er- whelmingly the most lik ely to b e obse rved exp erimen ta lly .” F or the case in whic h a prior distribution ov er the domain at hand is av a ila ble, v an Campenhout and Co ver [1 9 81] hav e prov en the related c onditional limit the o- r em . In Sections 2-4, w e pro vide a strong generalization of b oth the concen tration phenomenon and the conditiona l limit theorem. In Section 5, the res ults of Sec- tion 4 are used to extend an existing ga me-theoretic c haracterization (again, some w ould sa y “justification”) of Maxim um Entrop y due to T opsøe [1979]. In this w ay , w e prov ide sharp er results on t w o of the most frequen tly cited characterizations of the maxim um en tropy principle. 2 Informal Ov e rview Maxim um Entrop y Let X b e a random v ar ia ble taking v alues in some set X , whic h (o nly for the time b eing!) w e a ssume to b e finite: X = { 1 , . . . , m } . Let P , Q b e distributions f o r X with probability mass functions p and q . W e define H Q ( P ), the Q -entr opy of P , as H Q ( P ) = − E P  log p ( x ) q ( x )  = − D ( P || Q ) , (1) where D ( ·k· ) is t he Kullbac k-Leibler (KL) div ergence b etw een P and Q [Co ve r and Thomas, 1991]. In the usual MaxEn t setting, w e are give n a ‘prior’ distribution Q and a moment c ons tr aint : E [ T ( X )] = ˜ t (2) where T is some function T : X → R k for some k > 0 (More general form ulatio ns with arbitrary con v ex constraints exis t [Csisz´ ar , 1975], but here w e stic k to c o n- strain ts of for m (2)). W e define, if it exis ts, ˜ P to be the unique distribution ov er X that m a ximizes the Q -en trop y o ver all distributions (o v er X ) satisfying (2): ˜ P = arg max { P : E P [ T ( X )]= ˜ t } H Q ( P ) = arg min { P : E P [ T ( X )]= ˜ t } D ( P || Q ) (3) 2 The MaxEn t Principle then tells us that, in absence of an y further kno wledge ab out the ‘true’ or ‘p osterior’ distribution a ccording to whic h data are distributed, our b est guess for it is ˜ P . In practical problems w e ar e usually not giv en a constrain t of fo r m (2). Rather w e are given a n empiric al c onstr ai n t of the form 1 n n X i =1 T ( X i ) = ˜ t w hich we alwa ys abbreviate to ‘ T ( n ) = ˜ t ’ (4) The MaxEn t Principle is then usually applied as follo ws: supp ose we ar e giv en an em pirical constraint o f form (4). W e then ha v e to mak e predictions ab out new dat a coming from the same source. In absence of kno wledge of an y ‘true’ distribution generating this dat a , w e should mak e our predictions based on the MaxEn t distribution ˜ P for the momen t constraint (2) corresp onding to empir- ical constrain t (4). ˜ P is extended to sev eral outcomes by taking the pro duct distribution. The Concen tration Phenomenon and The Conditional Limit Theorem Wh y should this pro cedure mak e any sense? Here is one justification. If X is finite, and in the absen ce of a n y prior know ledge b eside the constrain t, one usually picks the uniform distribution fo r Q . In this case, Ja ynes’ ‘conce ntration phenomenon’ a pplies (W e are referring here to the v ersion emplo y ed b y Ja ynes [1978]. The theorem of Ja ynes [1 982] extends this in a direction differen t from the one w e consider he re). It sa ys that fo r a ll ε > 0, Q n sup j ∈X      1 n n X i =1 I j ( X i ) − ˜ P ( X = j )      > ε | T ( n ) = ˜ t ! = O ( e − cn ) (5) for some constant c dep ending o n ε . Here Q n is the n -fold pro duct distribution of Q , and I is the indicator function: I j ( x ) = 1 if x = j and 0 otherwise. In w o r ds, for t he ov erwhelming ma jor ity among the sequence s satisfying the constrain t, the empirical f r equencies are close t o the maxim um en tropy probabilities. It turns out that (5) still holds if Q is non- uniform. F or an illustration w e refer to Example 4.2. A closely related result (Th eorem 1 o f v an Campenhout and Co ver [1981]) is the conditional limit theorem (This theorem to o has later b een extended in sev eral directions different from the o ne c onsidered he re; see the discussion at the end of Section 4). It sa ys that lim n →∞ n ˜ t ∈ N Q 1 ( · | T ( n ) = ˜ t ) = ˜ P 1 ( · ) (6) where Q 1 ( · | T ( n ) = ˜ t ) and ˜ P 1 ( · ) refer to the marginal distribution of X 1 under Q ( · | T ( n ) = ˜ t ) and ˜ P respectiv ely . 3 Our Results Both the or ems ab o ve say that for some sets A , Q n ( A | T ( n ) = ˜ t ) ≈ ˜ P n ( A ) (7) In the concen tra tion phenome non, the set A ⊂ X n is ab out the fr equencies of individual o utcomes in t he sample. In the conditional limit theorem A ⊂ X 1 only concerns the first outcome. One migh t conjecture that (7) holds asymptotically in a m uc h wider sense, namely for just ab out any set w h ose pr ob a b ility one m a y b e inter este d in . F or examples of suc h sets see Example 4.2. In Theorems 4.1 and 4.3 w e show that ( 7 ) indeed holds f o r a v ery large class of sets; m o r eov er, w e giv e an explicit indication of the error o ne mak es if one approximates Q ( A | T ( n ) = ˜ t ) b y ˜ P ( A ). In this w ay we unify and strengthen both the concen tra tion phenomenon and the conditional limit theorem. T o b e more precise, let {A n } , with A i ⊂ X i b e a sequenc e of ‘t ypical’ sets f or ˜ P in the sense that ˜ P n ( A n ) go es to 1 sufficien tly fast. Then broadly sp eaking Theorem 4.1 sho ws that Q n ( A n | T ( n ) = ˜ t ) g o es to 1 t o o, ‘almost’ as fast as ˜ P n ( A n ). Theorem 4.3, our main theorem, sa ys that, if m n is an arbitrary increasing sequence with lim n →∞ m n /n = 0, then for every (measurable) sequence A m 1 , A m 2 , . . . (i.e. not just the ty pical o nes), with A m n ⊂ X m n , ˜ P n ( A m n ) → Q n ( A m n | T ( n ) = ˜ t ). In Section 5, we first g iv e an in terpretation of our strong concentration results in terms o f data compression. W e then sho w (Theorem 5 .2) that our concen tra t ion phenomenon implies that the MaxEn t dis tributio n ˜ P ac hiev es the b est minimax time-a ve ra ged logarithmic loss (co delength) ac hiev able for s equen tial prediction of samples satisfying the constrain t. W e also characterize (The orem 5.3 and 5.4) the precise conditions under whic h ˜ P also ac hiev es the total (non-time a v eraged) minimax log a rithmic loss. Surprisingly , the answ er dep ends crucially on the dimensionality k of the constrain t random v ector T : for k ≤ 2, ˜ P is also b est in the total sense. F or k > 3, there exist distributions w hich consisten tly o utp erform ˜ P . This is related to t he w ell-kno wn fa ct that random w alks in R k are tra nsien t if k ≥ 3. 3 Mathematical Preliminaries The Sampl e Space F rom now on w e assume a sample space X ⊆ R l for some l > 0 and let X b e the ra ndo m v ector with X ( x ) = x for all x ∈ X . W e reserv e the sym b ol Q to refer to a distribution for X called the prior distribution (f ormally , Q is a distribution on ( X , σ ( X )) where σ ( X ) is the Borel- σ -algebra generated b y X ). W e will b e in terested in sequenc es of i.i.d. rando m v ariables X 1 , X 2 , . . . , all distributed according to Q . Whenev er no confusion can arise, w e use Q also to refer to the join t (pro duct) distribution of × i ∈ N X i . Otherwise, we us e Q m to 4 denote the m -fo ld pro duct distribution of Q . The sample ( X 1 , . . . , X m ) will a lso b e written as X ( m ) . The Constrain t F unctions T Let T = ( T [1] , . . . , T [ k ] ) b e a k -dimensional ran- dom v ector that is σ ( X )-measurable. W e refer to t he ev ent { x ∈ X | T ( x ) = t } b oth as ‘ T ( X ) = t ’ and as ‘ T = t ’. Similarly w e write T i = t as an a bbrevia- tion o f T ( X i ) = t and T ( n ) as short fo r ( T ( X 1 ) , . . . , T ( X n )). The aver age of n observ ations of T will b e denoted b y T ( n ) := n − 1 P n i =1 T ( X i ) . W e assume that the supp ort of X is either coun table (in whic h case the prior distribution Q ad- mits a probability mass function) or that it is a c o nnected subset of R l for s ome l > 1 (in whic h case we assume t ha t Q has a b ounded con tin uous densit y with resp ect to Lebesgue measure). In b ot h cases, w e denote the probabilit y mass function/densit y by q . If X is countable, we shall further a ssume that T is of the lattic e form (whic h it will b e in most applications): Definition 3.1 [F eller, 1968, Page 490] A k -dimensional lattice ra ndo m v ec- tor T = ( T [1] , . . . , T [ k ] ) is a r ando m ve ctor for which ther e exists r e a l-value d b 1 , . . . , b k and h 1 , . . . , h k such that, for 1 ≤ j ≤ k , ∀ x ∈ X : T [ j ] ( x ) ∈ { b j + sh j | s ∈ N } . We c al l the lar ges t h i for which this holds the span of T [ i ] . If X is contin uous, w e shall ass ume that T is ‘regular’: Definition 3.2 We say a k -dimensiona l r andom ve ctor is of regular contin u- ous form if its di s tribution ad mits a b ounde d c ontinuous density with r esp e ct to L eb esgue me as ur e. Maxim um Entrop y Th ro ughout the pa p er, log is used t o denote log a rithm to base 2 . Let P , Q b e distributions for X . W e define H Q ( P ), the Q -e n tr opy of P , as H Q ( P ) = − D ( P | | Q ) , (8) where D is the KL-div ergence b et we en P and Q . This is defined ev en if P or Q ha v e no densities [Csisz´ ar, 197 5]. Assume w e are giv en a constrain t of form (2), i.e. E P [ T ( X )] = ˜ t . Here T = ( T [1] , . . . , T [ k ] ) , ˜ t = ( ˜ t [1] , . . . , ˜ t [ k ] ). W e define, if it exists, ˜ P to be the unique distribution on X that maximizes the Q -entrop y o v er all distributions o n X satisfying (2). That is, ˜ P is giv en b y (3). If Condition 1 b elo w holds, then ˜ P exists and is given b y the exp onen tial form (9), as expresse d in Prop osition 3.3 b elo w. I n the condition, the nota tion a T b refers to the dot pro duct b et w een a and b . 5 Condition 1: There exis ts ˜ β ∈ R k suc h that Z ( ˜ β ) = R x ∈X exp( − ˜ β T T ( x )) dQ ( x ) is fin ite and the distribution ˜ P with densit y (with resp ect to Q ) ˜ p ( x ) := 1 Z ( ˜ β ) e − ˜ β T T ( x ) (9) satisfies E ˜ P [ T ( X )] = ˜ t . In our theorems, w e shall simply assume that Condition 1 holds. A sufficien t (b y no means necessary!) requireme nt for Condition 1 is for example t ha t Q has b ounded supp ort; Csis z´ ar [1975] gives a more precise c haracterization. W e will also assume in our theorems the follo wing natura l condition: Condition 2: The ‘ T -co v ariance matrix’ Σ with Σ ij = E ˜ P [ T [ i ] T [ j ] ] − E ˜ P [ T [ i ] ] E ˜ P [ T [ j ] ] is in v ertible. Σ is guar a n teed t o exist b y Condition 1 (see an y b o o k with a t reatmen t of ex- p onen tial families, fo r ex ample, [Gr ¨ un wald, 2007]) and will b e singular only if either ˜ t j lies a t the b oundar y of the rang e of T [ j ] for s ome j or if some of the T [ j ] are a ffine com binations of the others. In the first case, the constrain t T [ j ] = ˜ t j can b e replaced by restricting the sample space to { x ∈ X | T [ j ] ( x ) = ˜ t j } and considering t he remaining constrain ts for the new sample space. In the second case, w e can remov e some of the T [ i ] from the constrain t without c hanging the set of dis tributions satisfying it, making Σ once aga in inv ertible. Prop osition 3.3 (C sisz´ ar [1975]) Assume Condition 1 h olds for Constr aint (2). Then inf { D ( P || Q ) | P : E P [ T ( X )] = ˜ t } is attaine d by a ˜ P of the form (9). If, additional ly, Condition 2 holds, then Cond ition 1 hol d s for onl y one ˜ β ∈ R k and the infimum is uniquely attaine d by the unique ˜ P satisfying (9). If Condition 1 holds, then ˜ t dete rmines b o th ˜ β and ˜ P . 4 The C oncentration Theo rems Theorem 4.1 (the concen tr ation phenomenon for typical sets, lattice case) Assume we ar e given a c onstr aint of form (2) such that T is of the lattic e form and h = ( h 1 , . . . , h k ) is the sp an of T a n d s uch that c onditions 1 and 2 hold. Then ther e exists a se quenc e { c i } satisfying lim n →∞ c n = Q k j =1 h j p (2 π ) k det Σ such that 6 1. L et A 1 , A 2 , . . . b e an arb i tr ary se quenc e of sets with A i ⊂ X i . F or a l l n with Q ( T n = ˜ t ) > 0 , we have: ˜ P ( A n ) ≥ n − k / 2 c n Q ( A n | T ( n ) = ˜ t ) . (10) Henc e if B 1 , B 2 , . . . is a se quenc e of sets with B i ⊂ X i whose pr ob ability tends to 1 under ˜ P in the sense that 1 − ˜ P ( B n ) = O ( f ( n ) n − k / 2 ) for some function f : N → R ; f ( n ) = o (1) , then Q ( B n | T ( n ) = ˜ t ) tends to 1 in the sens e that 1 − Q ( B n | T ( n ) = ˜ t ) = O ( f ( n )) . 2. If for al l n , A n ⊆ { x ( n ) | n − 1 P n i =1 T ( x i ) = ˜ t } then (10) hol d s w i th e quality. As discussed in Section 5, Theorem 4.1 has applicatio ns for data compression. The relation of the Theorem to Ja ynes’ original concen tration phenomenon is discusse d at the end of the presen t section. Pro of W e need the follo wing theorem: Theorem (‘lo cal cen tral limit theorem for lattice rand om v ariables’, F eller [1968 ], page 490) Let T = ( T [1] , . . . , T [ k ] ) b e a lattice ra ndom v ector and h 1 , . . . , h k b e t he corresp onding spans as in Definition 3.1; let E P [ T ( X )] = t a nd supp ose that P satisfies Condition 2 with T -co v ariance matrix Σ. Let X 1 , X 2 , . . . b e i.i.d. with common distribution P . Let V b e a closed and b ounded set in R k . Let v 1 , v 2 , . . . b e a sequence in V suc h that fo r all n , P ( P n i =1 ( T i − t ) / √ n = v n ) > 0. Then as n → ∞ , n k / 2 Q k j =1 h j P  P n i =1 ( T i − t ) √ n = v n  − ℵ ( v n ) → 0 . Here ℵ is the densit y of a k -dimensional normal distribution with mean vec to r µ = t and co v ariance matrix Σ. F eller giv es the lo cal cen tral limit theorem only for 1-dimens ional lattice random v ariables with E [ T ] = 0 and v ar[ T ] = 1; extending the pro of to k - dimensional random ve ctors with arbitrary means and co v ariances is, how ev er, completely straigh tforward: see XV.7 (page 494) o f [F eller, 1968]. The theorem sho ws that there exists a sequence d 1 , d 2 , . . . with lim n →∞ d n = 1 suc h that, fo r all n with P ( P n i =1 ( T i − t ) = 0 ) > 0, n k/ 2 Q k j =1 h j P  P n i =1 ( T i − t ) √ n = 0  ℵ (0) = p (2 π n ) k det Σ Q k j =1 h j P 1 n n X i =1 T i = t ! = d n (11) 7 The proo f no w b ecomes v ery simple. First note that ˜ P ( A n | T ( n ) = ˜ t ) = Q ( A n | T ( n ) = ˜ t ) (write out the definition of conditional probabilit y and realize that exp( − ˜ β T T ( x )) = exp( − ˜ β T ˜ t ) = constan t for all x with T ( x ) = ˜ t . Use this to sho w that ˜ P ( A n ) ≥ ˜ P ( A n , T ( n ) = ˜ t ) = ˜ P ( A n | T ( n ) = ˜ t ) ˜ P ( T ( n ) = ˜ t ) (12) = Q ( A n | T ( n ) = ˜ t ) ˜ P ( T ( n ) = ˜ t ) . Clearly , with ˜ P in the rˆ ole of P , the lo cal cen tra l limit theorem is applicable to random v ector T . Then, by (11), ˜ P ( T ( n ) = ˜ t ) = ( Q k j =1 h j ) / p (2 π n ) k det Σ d n . Defining c n := ˜ P ( T ( n ) = ˜ t ) n k / 2 finishes the pro of of item 1. F or item 2, notice that in this case (12) holds with equalit y; the rest of the pro of remains unchanged. ✷ Example 4.2 The ‘Brandeis dice example’ is a to y example frequen tly used b y Ja ynes a nd o thers in discussions of the MaxE nt formalism [Jay nes , 19 78]. Let X = { 1 , . . . , 6 } and X b e the outcome in one throw of some giv en die. W e initially b eliev e (e.g. for reasons of symmetry) that the distribution of X is uniform. The n Q ( X = j ) = 1 / 6 for a ll j and E Q [ X ] = 3 . 5. W e are then told that the a v erage n um b er of sp ots is E [ X ] = 4 . 5 rather than 3 . 5. As calculated b y Ja ynes, the MaxEn t distribution ˜ P giv en this constraint is giv en b y ( ˜ p (1) , . . . , ˜ p (6)) = (0 . 05435 , 0 . 07877 , 0 . 11416 , 0 . 16545 , 0 . 23977 , 0 . 34749) . (13) By the C hernoff/Ho effding b ound, for ev ery j ∈ X , ev ery ε > 0, ˜ P ( | n − 1 P n i =1 I j ( X i ) − ˜ p ( j ) | > ε ) < 2 exp( − nc ) fo r some constan t c > 0 dep ending on ε ; here I j ( X ) is the indicator function for X = j . Theorem 4.1 then implies that Q ( | n − 1 P n i =1 I j ( X i ) − ˜ p ( j ) | > ε | T ( n ) = ˜ t ) = O ( √ ne − nc ) = O ( e − nc ′ ) for some c ′ > 0. In this wa y w e reco v er Jaynes ’ original concen t r a tion phenomenon (5): the fraction of sequences satisfying the constrain t with frequencies close t o MaxEn t pro babilities ˜ p is o ve r- whelmingly large. Supp ose no w w e receiv e new information about an additional constrain t: P ( X = 4) = P ( X = 5) = 1 / 2. This can b e expressed as a moment constrain t b y E [( I 4 ( X ) , I 5 ( X )) T ] = (0 . 5 , 0 . 5) T , where I j is the indic a t o r f unction of the ev ent X = j . W e can now either use ˜ P defin ed as in (13) in the rˆ ole of prior Q and imp ose the new constrain t E [( I 4 ( X ) , I 5 ( X )) T ] = (0 . 5 , 0 . 5) T , or use uniform Q and impose the com bined constraint E [ T ] = E [( T [1] , T [2] , T [3] ) T ] = (4 . 5 , 0 . 5 , 0 . 5) T , with T [1] = X, T [2] = I 4 ( X ) , T [3] = I 5 ( X ). In b oth cases w e end up with a ne w MaxEn t distribution ˜ ˜ p (4) = ˜ ˜ p (5) = 1 / 2. This distribution, while still consisten t with the original constrain t E [ X ] = 4 . 5, rules out the v a st ma j o rit y of sequences satisfying it. Ho wev er, w e can apply our concen tratio n phenomenon 8 again to the new MaxEn t distribution ˜ ˜ P . Let I j,j ′ ,ε denote the ev en t that      1 n n X i =1 I j ( X i ) − P n − 1 i =1 I j ′ ( X i ) I j ( X i +1 ) P n − 1 i =1 I j ′ ( X i )      > ε. According to ˜ ˜ P , w e still hav e that X 1 , X 2 , . . . are i.i.d. Then b y the Cher- noff/Ho effding bo und, for eac h ε > 0, for j, j ′ ∈ { 4 , 5 } , ˜ ˜ P ( I j,j ′ ,ε ) is exp onen tially small. T heorem 4.1 then implies t ha t Q n ( I j,j ′ ε | T ( n ) = (4 . 5 , 0 . 5 , 0 . 5) T ) is ex- p onen tially small to o: for the o verw helming ma jority of samples satisfying the com bined constrain t, the sample will lo ok just as if it had b een generated b y an i.i.d. pro cess, ev en though X 1 , . . . , X n are ob viously not c omp letely indep enden t under Q n ( ·| T ( n ) = (4 . 5 , 0 . 5 , 0 . 5) T ). There also exists a v ersion of Theorem 4.1 for con tinuous - v alued random v ectors. This is giv en, along with the pro of, in tec hnical rep ort [Gr ¨ un w ald, 2001b]. There are a few limitations to The o r em 4.1: (1 ) we must require that ˜ P ( A n ) go es to 0 or 1 a s n → ∞ ; (2) the contin uo us case needed a separate statemen t, whic h is caused by t he more fundamen tal (3) it turns out that the pro of tec hnique used cannot be adapted to point-wise conditioning on T ( n ) = ˜ t in t he con tin uous case [Gr ¨ un w ald, 2001b]. Theorem 4.3 ov ercomes all these problems. The price w e pay is that, when conditioning on T ( n ) = ˜ t , the sets A m m ust only refer to X 1 , . . . , X m where m is suc h that m/n → 0; for ex ample, m = ⌈ n/ log n ⌉ will w ork. Whenev er in the case of con tinuous-v alued T w e write Q ( · | T ( n ) = t ) or ˜ P ( · | T ( n ) = t ) w e refer to the contin uo us version of these quan tities. These are easily sho wn t o exis t [Gr ¨ un w ald, 2001b]. Recall that ( f or m < n ) Q m ( · | T ( n ) = ˜ t ) refers to the marginal distribution of X 1 , . . . , X m conditioned on T ( n ) = ˜ t . It is implicitly understo o d in the theorem that in the lattice case, n ranges only o v er those v alues for w hich Q ( T ( n ) = ˜ t ) > 0. Theorem 4.3 (Main T heorem: the St rong Concen tration P henomenon/ Strong Conditional Limit Theorem) L et { m i } b e an incr e asing se quenc e with m i ∈ N , such that lim n →∞ m n /n = 0 . Assume we ar e given a c onstr aint of form (2) such that T is of the r e gular c ontinuous form or of the lattic e form and sup- p ose that Conditions 1 an d 2 ar e s atisfie d. Then as n → ∞ , Q m n ( · | T ( n ) = ˜ t ) c onver ges we akly to ˜ P m n ( · ) . Discussion of “w eak conv ergence” a s w ell as the pro of (using the s ame k ey idea, but in v olving m uch more w or k than the pro of o f Theorem 4.1) is in tec hnical rep ort [Gr ¨ un w ald, 2001b]. 9 Related Results Theorem 4.1 is related to Ja ynes’ original concen tration phe- nomenon, the pro of of which is based on Stirling’s appro ximation of the factorial. Another closely related result (also based on Stirling’s approximation) is in Exam- ple 5.5.8 of Li and Vit´ a n yi [1 997]. Both results can b e easily extended to prov e the follo wing weak er v ersion of Theorem 4.1, item 1: ˜ P ( A n ) ≥ n −|X | c n Q ( A n | T ( n ) = ˜ t ) where c n tends to some constan t. Note that in t his form, the theorem is v oid for infinite sample spaces. Ja ynes [1982] extends the origina l concen tration phe- nomenon in a direction s omewhat differen t from Theorem 4.1; it w ould b e in ter- esting to study the relations. Theorem 4.3 is similar to the original ‘conditional limit theorems’ (Theorems 1 and 2) of v an C amp enhout and Cov er [1981]. W e note that the preconditions for our theorem to hold are w eak er and the conclusion is stronger than for the original conditional limit theorems, the main nov elty b eing that Theorem 4.3 supplies us with an explicit b ound on ho w fast m can grow as n tends to infinity . The conditional limit theorem w a s later extended by Csisz´ ar [1984]. His setting is considerably more general than o urs (e.g. allo wing f o r general con ve x constraints rather than just momen t constraints ), but his results also lac k an explicit estimate of the rate at whic h m can increase with n . Csisz´ a r [1984] and Co ver a nd Thomas [1991] (where a simplified v ersion of the conditional limit theorem is prov ed) b ot h mak e the connection to large deviation results , in particular Sanov’s theorem. As sho wn in the latt er reference, w eak v ersions of the conditional lim it theorem can b e in t erpreted as immediate consequences of Sano v’s theorem. 5 Consequ ences for Data Compres sion Games F or simplic ity w e restrict ourselv es in this s ection t o coun table sample spaces X and we identify pro ba bilit y mas s functions with probabilit y distributions. Belo w w e mak e frequen t use of co ding - theoretic conce pts whic h w e first briefl y review. 5.1 Theorem 1 and Data Compr ession Recall that by the Kra ft Inequalit y [Co ver a nd Thomas, 1991], for ev ery prefix co de with len g t hs L o v er sym b ols f r om a countable alphabet X n , there exists a ( p ossibly sub-additive) pro babilit y mass function p o ver X n suc h that for all x ( n ) ∈ X n , L ( x ( n ) ) = − log p ( x ( n ) ). W e will call this p the ‘probability (mass) function corresp onding to L ’. Similarly , for eve ry probabilit y mass function p ov er X n there exists a (prefix) co de with lengths L ( x ( n ) ) = ⌈− log p ( x ( n ) ) ⌉ . Neglecting the round-off error, w e will simply sa y that for ev ery p , there exists a co de with lengths L ( x ( n ) ) = − log p ( x ( n ) ). W e call the co de with these len gt hs ‘the co de 10 corresp onding to p ’. By the information ineq uality [Cov er and Thomas, 1991], this is also the most efficien t co de to use if data X ( n ) w ere actually distributed according to p . W e can no w see that Theorem 4.1, item 2, has imp ortant implic a t io ns for co ding. Cons ider the follo wing special case of The orem 4.1 , whic h obtains b y taking A n = { x ( n ) } and logarithms: Corollary 5.1 (the concen tration phenomenon, co ding-theoretic form u- lation) Assume we ar e given a c onstr aint of form (2) such that T is of the lattic e form and h = ( h 1 , . . . , h k ) is the sp an of T a n d s uch that c onditions 1 and 2 hold. F or al l n , al l x ( n ) with n − 1 P n i =1 T ( x i ) = ˜ t , we have − log ˜ p ( x ( n ) ) = − lo g q ( x ( n ) | 1 n P n i =1 T ( X i ) = ˜ t )+ + k 2 log 2 π n + log √ det Σ − P k j =1 log h j + o (1) = − log q ( x ( n ) | 1 n P n i =1 T ( X i ) = ˜ t ) + k 2 log n + O (1) . (14) In w ords, this means the fo llo wing: let x ( n ) b e a sample distributed according to Q , Supp ose w e are given the information that n − 1 P n i =1 T ( x i ) = ˜ t . Then, b y the information inequalit y , the most efficien t co de to enco de x ( n ) is the one based on q ( ·| T ( n ) = ˜ t ) with lengths − log q ( x ( n ) | T ( n ) = ˜ t ). Y et if w e enco de x ( n ) us- ing the co de with lengths − log ˜ p ( · ) (whic h w ould b e the most efficien t had x ( n ) b een generated b y ˜ p ) then the n umber of extra bits w e need is only of the order ( k / 2) log n . That means, for example, that the n umber o f additional bits w e need p er outc ome go es to 0 as n increases. Gr ¨ un wald [2001a] use d Corollary 5.1 to establish a f ormal connection b et we en the concentration phenomenon a nd uni- versal c o di n g , a central concept of information theory; this is w orked o ut in more detail b y Gr ¨ un wald [2007], Chapter 10, Section 2.2 . In the presen t pap er, w e fo cus on the game-theoretic consequences of Corollary 5.1. 5.2 Empirical Constrain ts and Game Theory Recall w e assume coun table X . The σ -algebra of suc h X is tacitly tak en to b e the p ow er set of X . The σ -algebra thus b eing implicitly understoo d, we can define P ( X ) to b e the set of all probability distributions o ve r X . F or a pro duct X ∞ = × i ∈ N X of a countable s a mple space X , w e define P ( X ∞ ) to b e the set of all distributions ov er the product space with the associated pro duct σ -algebra. T opsøe [1979] and G r ¨ un wald and Da wid [200 4] prov ided c hara cterizations o f Maxim um En tro py distributions quite differen t from the presen t one. It w as 11 sho wn that, under regularit y conditions, H q ( ˜ p ) = sup p ∗ : E p ∗ [ T ]= ˜ t inf p E p ∗  − log p ( X ) q ( X )  = inf p sup p ∗ : E p ∗ [ T ]= ˜ t E p ∗  − log p ( X ) q ( X )  (15) where b oth p and p ∗ are understo o d to b e mem b ers of P ( X ) and H q ( ˜ p ) is de- fined as in (1). By this result, the MaxEn t setting can b e thought of as a game b et we en Nature, who can c ho ose any p ∗ satisfying the constraint, and St a tis- tician, who only kno ws that Nat ure will c ho ose a p ∗ satisfying the constrain t. Statistician w ants to minimize his w orst- case exp ected co delength (relative to q ), where the worst-case is ov er all c hoices for Nature. In suc h ga me-t heoretic con- texts, the co delength is usually called “logar ithmic score” or “logarithmic loss” [Gr ¨ unw ald and D a wid, 2004]. It turns out that the minimax strategy for Stat istician in (15) is giv en b y ˜ p . That is, ˜ p = arg inf p sup p ∗ : E p ∗ [ T ]= ˜ t E p ∗  − log p ( x ) q ( x )  . (16) Th us, ˜ p is b oth the optimal strategy f o r Nature and for Statistician. This g iv es a decision-theoretic justification of using MaxEn t probabilities which seems quite differen t from our concen tratio n phenomenon. Or is it? Realizing that in practi- cal situations we deal with empirical constraints of form (4) rather than (2) w e ma y w onder what distribution ˆ p is minimax in the empirical v ersion of problem (16). In this v ersion Nature gets to c ho ose an individual seq uence ra ther than a distribution. T o our kno wledge, w e are t he first to analyze this ‘empirical’ game. T o make it more pr ecise, let C n = ( x ( n ) ∈ X n | n − 1 n X i =1 T ( x i ) = ˜ t ) . (17) Then, for n w ith C n 6 = ∅ , ˆ p n (if it exists) is defined b y ˆ p n := arg inf p ∈P ( X n ) sup x ( n ) ∈C n − log p ( x 1 , . . . , x n ) q ( x 1 , . . . , x n ) = arg sup p ∈P ( X n ) inf x ( n ) ∈C n p ( x ( n ) ) q ( x ( n ) ) (18) ˆ p n can b e in terpreted in t wo w ays : (1) it is the distribution that assigns ‘maxi- m um probabilit y’ (relativ e to q ) to all sequences satisfying the constrain t; (2) as − log ( ˆ p ( x ( n ) ) /q ( x ( n ) )) = P n i =1 ( − log ˆ p ( x i | x 1 , . . . , x i − 1 ) + log q ( x i | x 1 , . . . , x i − 1 )), it is also the p tha t minimizes cumulativ e worst-case logarithmic loss relativ e to q when used for se quen tially predicting x 1 , . . . , x n . 12 One immediately v erifies that ˆ p n = q n ( · | T ( n ) = ˜ t ): the solution to the empirical minimax problem is just the conditioned prior, whic h w e kno w b y The- orems 4.1 and 4 .3 is in some sense very close to ˜ p . Ho wev er, for no single n , is ˜ p exactly equal to q n ( · | T ( n ) = ˜ t ). Indeed, q n ( · | T ( n ) = ˜ t ) assigns zero prob- abilit y to an y sequence of length n no t satisfying the constrain t. This means that using q in prediction tasks against the logarithmic loss will b e pro blematic if the cons tra in t only holds appro ximately and/or if n is unknown in adv ance to the Statistician. In the latter case, it is imp ossible to use q ( · | T ( n ) = ˜ t ) for prediction without mo dification. F or suppo se t ha t the statistician g uesses that the sample will hav e length n 1 for some n 1 with C n 1 6 = ∅ . There exist sequence s x ( n 2 ) = x 1 , . . . , x n 1 , . . . , x n 2 of length n 2 > n 1 satisfying the constrain t such that x ( n 1 ) do es not satisfy the constrain t, and t herefore q ( x ( n 2 ) | x ( n 1 ) ∈ C n 1 ) = 0, so q ( · | x ( n 1 ) ∈ C n 1 ) = 0 cannot b e used for prediction if the actual seq uence length turns o ut to exceed n 1 . W e may guess that in this case ( n not kno wn in adv a nce), the MaxEn t distribution ˜ p , rather than q ( ·| T ( n ) = ˜ t ) is actually a b etter distribu- tion to use for prediction. The follo wing theorem sho ws that in some sense, this is ind eed so: Theorem 5.2 L et X b e a c ountable sample s p ac e. Assume we ar e given a c on - str aint of form (2) such t ha t T is of the lattic e form, and such t hat Conditions 1 and 2 ar e satisfie d. L et C n b e as in (1 7 ). Then the infim um in inf p ∈P ( X ∞ ) sup { n : C n 6 = ∅} sup x ( n ) ∈C n − 1 n log p ( x 1 , . . . , x n ) q ( x 1 , . . . , x n ) (19) is achieve d by the Maximum Entr opy d istribution ˜ p , and is e qual to H q ( ˜ p ) . Pro of Let C = ∪ ∞ i =1 C i . W e need to s how that for all n , for a ll x ( n ) ∈ C , H q ( ˜ p ) = − 1 n log ˜ p ( x ( n ) ) q ( x ( n ) ) = inf p ∈P ( X ∞ ) sup { n : C n 6 = ∅} sup x ( n ) ∈C n − 1 n log p ( x ( n ) ) q ( x ( n ) ) (20) Equation (20) implies t ha t ˜ p reaches the inf in (19) and that the inf is equal to H q ( ˜ p ). The leftmost equalit y in (20 ) is a standard result ab out exp onen tial families of form (9); see for example, [Gr ¨ un w ald, 2007]. T o prov e the rig htmost equalit y in (20), let x ( n ) ∈ C n . Consider the conditional distribution q ( · | x ( n ) ∈ C n ). Note that, for ev ery distribution p 0 o v er X n , p 0 ( x ( n ) ) ≤ q ( x ( n ) | x ( n ) ∈ C n ) for at least one x ( n ) ∈ C n . By Theorem 4 .1 (or rather Corollary 5.1), for this x ( n ) w e ha v e − 1 n log p 0 ( x ( n ) ) q ( x ( n ) ) ≥ − 1 n log ˜ p ( x ( n ) ) q ( x ( n ) ) − k 2 n log n − O ( 1 n ) , 13 and we see that for ev ery distribution p 0 o v er X ∞ , sup { n : C n 6 = ∅} sup x ( n ) ∈C n − 1 n log p 0 ( x ( n ) ) q ( x ( n ) ) ≥ su p { n : C n 6 = ∅} sup x ( n ) ∈C n − 1 n log ˜ p ( x ( n ) ) q ( x ( n ) ) , whic h sho ws the rightmost equalit y in (20). ✷ Theorem 5.2 sho ws that, among all distributions on X ∞ , the minimax co delength p er ou tc ome for sequences satisfying the constrain ts is ac hiev ed b y the maxim um en tropy ˜ p . W e may no w a sk whether it is also achie ved by a n y differen t dis tribu- tion p ′ , and if so, whether that distribution may ev en be “ b etter” in the sense that it a c hiev es strictly smaller co delengths o n a ll sequences of all lengths that satisfy the constrain ts. Surprisingly , the answ er dep ends on the nu mber of constrain ts k : for k > 2, there exists suc h a p ′ . F or k ≤ 2, there does not: Theorem 5.3 Assume we ar e given a c onstr aint such that Co n dition 1 a n d 2 b oth h old, X is finite, T i s of the lattic e form, an d T = ( T [1] , . . . , T [ k ] ) for some k > 2 . T hen (a), ther e exis ts a distribution p ′ and a c onstant c ′ > 0 , such that, for al l lar ge enough n with C n 6 = ∅ , for al l x ( n ) ∈ C n , log p ′ ( x 1 , . . . , x n ) ˜ p ( x 1 , . . . , x n ) > c log n. Mor e over, (b), ther e exists a distribution p ′′ and a c onstant c ′′ > 0 such that fo r all n (and not j ust a l l lar ge n ) with C n 6 = ∅ , for al l x ( n ) ∈ C n , log p ′′ ( x 1 , . . . , x n ) ˜ p ( x 1 , . . . , x n ) > c ′′ . Theorem 5.4 Assume we ar e given a c onstr aint such that Co n dition 1 a n d 2 b oth h old, X is finite, T i s of the lattic e form, an d T = ( T [1] , . . . , T [ k ] ) for some k ≤ 2 . Then ther e exists no d istribution p ′ , such that, for al l lar g e n with C n 6 = ∅ , for al l x ( n ) ∈ C n , log p ′ ( x 1 , . . . , x n ) ˜ p ( x 1 , . . . , x n ) > 0 . The upshot of these theorems is that, if it is kno wn that the sample satisfies the constraint, but the sample size is no t know n, then, if k ≥ 3, there exist distributions whic h are g ua r an teed to compress the data more than ˜ p , so t hat the game-theoretic justification fo r predicting/co ding with ˜ p is, to some exten t, c hallenged. The pro ofs of b oth theorems mak e use of the followin g lemma, whic h w e state and pro v e first: 14 Lemma 5.5 Under the c o n ditions of The or em 5 . 3 and 5.4, supp ose ther e exists a distribution p ′ such that, for al l lar ge e n ough n with C n 6 = ∅ , for al l x ( n ) ∈ C n , log p ′ ( x 1 , . . . , x n ) ˜ p ( x 1 , . . . , x n ) > 0 . (21) Then ther e also exis ts a distribution p ′′ and a fixe d c ′′ > 0 such that for all n (an d not just for al l lar ge n ) with C n 6 = ∅ , for al l x ( n ) ∈ C n , log p ′′ ( x 1 ,...,x n ) ˜ p ( x 1 ,...,x n ) > c ′′ . Pro of Let n ∗ b e the smallest n suc h that fo r a ll x ( n ) ∈ C n , condition (21) holds. Let n 1 b e the sm allest n suc h that C n is nonempt y . Note that n ∗ ≥ n 1 . Let a = ⌈ n ∗ /n 1 ⌉ , i.e. n ∗ /n 1 rounded up to the nearest in teger. Then n 1 · a ≥ n ∗ , so that, with n 2 := n 1 ( a − 1), we ha ve 0 ≤ n 2 < n ∗ . W e only consider the case 0 < n 2 ; the case n 2 = 0 is completely analogous, but easier. So assume n 2 > 0. Then C n 2 is no nempt y (to see this, tak e a n y x ( n 1 ) ∈ C n 1 and note that the sequence consisting of ( a − 1) rep etitions of x ( n 1 ) m ust satisfy the constraint and therefore b e in C n 2 ). W e ma y assume that inf x ( n 2 ) ∈C n 2 p ′ ( x ( n 2 ) ) ˜ p ( x ( n 2 ) ) ≤ 1 , (22) (otherwise it w ould follow that n 2 rather than n ∗ is the smallest n for whic h condition (21) holds, and w e would ha v e a contradiction). No w, let y ( n 2 ) ∈ C n 2 b e an y sequence that ac hieve s the infimum in (22). Since b y our condition, for all x ( a · n 1 ) ∈ C a · n 1 , p ′ ( x ( a · n 1 ) ) / ˜ p ( x ( a · n 1 ) ) > 1, and also p ′ ( x ( an 1 ) ) = p ′ ( x n 2 +1 , . . . , x an 1 | x ( n 2 ) ) · p ′ ( x ( n 2 ) ), and also C n 1 is fin ite, it follo ws that inf z ( n 1 ) ∈C n 1 P ′ ( X n 2 +1 = z 1 , . . . , X an 1 = z n 1 | X ( n 2 ) = y ( n 2 ) ) ˜ p ( z ( n 1 ) ) > 1 . (23) W e ma y no w define a probabilit y ma ss function p ◦ on X n 1 suc h that, for z ( n 1 ) ∈ C n 1 , p ◦ ( z ( n 1 ) ) is slightly smaller tha n P ′ ( X n 2 +1 = z 1 , . . . , X an 1 = z n 1 | X ( n 2 ) = y ( n 2 ) ), whereas f or z ( n 1 ) ∈ X n 1 \ C n 1 , p ◦ ( z ( n 1 ) ) is sligh tly larger than P ′ ( X n 2 +1 = z 1 , . . . , X an 1 = z n 1 | X ( n 2 ) = y ( n 2 ) ), where w e increase and decrease the probabilit y in suc h a w a y that the tot al probabilit y o n X n 1 remains 1. Since X n 1 is finite, using (23 ), w e can do this in suc h a w ay that for some ε > 0, f o r a ll z ( n 1 ) ∈ C n 1 , p ◦ ( z n 1 ) ˜ p ( z ( n 1 ) ) ≥ 1 + ε, (24) whereas for all z ( n 1 ) ∈ X n 1 \ C n 1 , p ◦ ( z n 1 ) P ′ ( X n 2 +1 = z 1 , . . . , X an 1 = z n 1 | X ( n 2 ) = y ( n 2 ) ) ≥ 1 + ε, (25) 15 W e now extend p ◦ to a probability mass function on X ∞ b y defining, for all z ( n 1 ) ∈ X n 1 , fo r an y m ≥ 1 , y ( m ) ∈ X m , p ◦ ( z 1 , . . . , z n 1 , y 1 , . . . , y m ) := p ◦ ( z ( n 1 ) ) P ′ ( X n 1 +1 = y 1 , . . . , X n 1 + m = y m | X n 1 = z ( n 1 ) ). Now, for an y n ≥ 1, x ( n ) ∈ X n , let m b e the n um b er of distinct initial segmen ts x 1 , . . . , x n ′ of x ( n ) with n ′ < n tha t satisfy the constrain t, i.e. x ( n ′ ) ∈ C n ′ . Notice that we ma y hav e m = 0. W e set s 0 = 0, s m +1 = n , and, for j ∈ { 1 , . . . , m } , s j is set suc h t hat x ( s j ) ∈ C s j and s 1 < s 2 < . . . < s m < n . W e define p ′′ ( x ( n ) ) := m Y j =0 p ◦ ( x s j +1 , . . . , x s j +1 ) . One eas ily v erifies b y induction on n that p ′′ defines a probability mass function on X ∞ , and, using (2 2) and (2 5), that for all n with C n 6 = ∅ , all x n ∈ C n , p ′′ ( x n ) / ˜ p ( x n ) ≥ 1 + ε . T he result follo ws. ✷ Pro of (of Theorem 5.3) Let n 1 < n 2 < . . . b e the sequence of all n suc h that C n 6 = ∅ . D efine, for j = 1 , 2 , . . . , q j := q ( · | T ( n j ) = t ). Let π b e any distribution on the natural n um b ers suc h that, for all j ∈ { 1 , 2 , . . . } , − log π ( j ) = log j + O ( lo g log j ) . F or example, w e may tak e Rissanen’s universal prior for the inte gers [Rissanen, 1989], π ( j ) ∝ 1 /j (log j ) 2 . No w set, f or all n , x n ∈ X n , p ′ ( x n ) := P j =1 , 2 ,... π ( j ) q j ( x n ). Then p ′ uniquely induces a distribution P on X ∞ that is actually a mixture of al l conditional distributions of X ∞ giv en that the constrain t holds at some sample size for whic h it can hold at all. F or each j ∈ { 1 , 2 . . . } , eac h x ( n j ) ∈ C n j , w e ha ve − log p ′ ( x ( n ) ) = − log X i π i q i ( x ( n ) ) ≤ − log π ( j ) − log q ( x ( n ) | T ( n j ) = t ) ≤ log j + O (log log j ) − k 2 log n − log ˜ p ( x ( n ) ) , where the final inequalit y follo ws by Corollary 5.1. Since j ≤ n , P art (a) of the theorem follo ws. Part (b) is no w an imm ediate consequenc e of Le mma 5.5. ✷ Pro of (of Theorem 5.4) Assume , b y means of con tra diction, that a p ′ as men- tioned in the theorem does e xist. Then by Lemma 5.5, there also exists a p ′′ , suc h t ha t for all n with C n 6 = ∅ , for a ll x ( n ) ∈ C n , log p ′′ ( x 1 ,...,x n ) ˜ p ( x 1 ,...,x n ) > c ′′ for some c ′′ > 0. F or 0 < α < 1 , define a probability distribution on X ∞ , in terms of its mass function p α , by p α ( x n ) := αp ′′ ( x n ) + (1 − α ) ˜ p ( x n ). Note t ha t p α ( x n ) ≥ max { αp ′′ ( x n ) , (1 − α ) ˜ p ( x n ) } . (26) 16 No w, f or an y n ≥ 1, x ( n ) ∈ X n , let m b e the n umber of distinct initia l segmen ts x 1 , . . . , x n ′ of x ( n ) with n ′ < n that satisfy the cons tr a in t, i.e. x ( n ′ ) ∈ C n ′ . Notice that w e ma y hav e m = 0. W e set s 0 = 0 , s m +1 = n , and, f or j ∈ { 1 , . . . , m } , s j is s et suc h that x ( s j ) ∈ C s j and s 1 < s 2 < . . . < s m < n . W e define p ◦ ( x ( n ) ) := m Y j =0 p α ( x s j +1 , . . . , x s j +1 ) . One easily v erifies b y induction on n that (a) p ◦ is the mass function o f some probabilit y distribution P ◦ on X ∞ , and, using (26) , that (b) for all n , a ll x n ∈ X n , p ◦ ( x ( n ) ) ≥ m − 1 Y j =0 αp ′′ ( x s j +1 , . . . , x s j +1 ) ! (1 − α ) ˜ p ( x s m +1 ,...,x n ) ≥ α m e mc ′′ ˜ p ( x 1 , . . . , x s m )(1 − α ) ˜ p ( x s m +1 ,...,x n ) = α m (1 − α )2 mc ′′ ˜ p ( x ( n ) ) , (27) where c ′′ is as in Lemma 5.5. No w supp ose that X 1 , X 2 , . . . are i.i.d. ∼ ˜ P , i.e. data are sampled from the Max Ent distribution ˜ P . W e ma y view U n := n − 1 P n i =1 T ( X i ) − ˜ t as sp ecifying a Mark ov chain, where t he state at time n is giv en b y the v alue of U n , and the tra nsition probabilities are giv en b y ˜ P ( U n +1 = · | U n = u ), for eac h realizable v alue of u , and t he starting state is U 0 := 0. By the lo cal cen tra l limit theorem (Section 4 ), the probabilit y of b eing in state “0” at time n is of order 1 / √ n (if k = 1) or 1 /n (if k = 2). In b o th cases, this probabilit y is summable, so it f ollo ws b y basic Marko v chain theory [F eller, 1 9 68] that s tate “0” is recurren t and with probabilit y 1, U n = 0 will ho ld for infinitely man y n . But, fo r n > 0, U n = 0 is equiv alent to x ( n ) ∈ C n , i.e. the constrain t holds. It follow s that the constrain t will hold infinitely often, almost surely under ˜ P . Y et, if w e decide to enco de seque nce x 1 , . . . , x n with the co de corresp onding to p ◦ rather than ˜ p , then b y (27) , if w e select a v alue of α < 1 suc h that α 2 c ′′ > 1, then w e will ˜ P -almost surely c o mpress the data significan tly b etter than if w e use the co de with lengths − log ˜ p itself. More precis ely , with ˜ P -probability 1, − log ˜ p ( X n ) p ◦ ( X n ) → ∞ . But this contradicts t he no-h yp ercompression inequalit y [Gr ¨ un w ald, 200 7] (a lso kno wn as the “competitive optimalit y of the Shannon-F ano code, ” [Co v er and Thomas, 1991]), an easy consequence of Mark o v’s inequalit y whic h states tha t for all K > 0 , and an y t w o distributions P and Q with mass functions p and q , for all n , P ( − log p ( X n ) ≥ − log q ( X n ) + K ) ≤ 2 − K . The t heorem is pro ve d. ✷ 17 References T.M. Cov er and J.A. Thomas. Elements of Information The ory . Wiley- In terscience, New Y ork, 1991. I. Csisz ´ ar. I - div ergence geometry of probability distributions and minimization problems. Annals of Pr ob ability , 3(1):146–158 , 1975. I. Csisz´ ar. Sano v prop ert y , generalized I -pro jection and a conditional limit the- orem. A nn a ls of Pr ob ability , 12(3):768–793, 198 4 . I. Csisz´ ar. Wh y least squares and maxim um en tro p y? An axiomatic approach to inf erence for linear in v erse problems. A nnals of Statistics , 19(4):2032 – 2066, 1991. W. F eller. An Intr o duction to Pr ob abili ty The ory and Its A ppl i c ations , v olume 2. Wiley , New Y ork, 3rd edition, 1968. P . G r ¨ unw ald. The Minimum Description L ength Principle . MI T Press, Cam- bridge, MA, 20 0 7. P . D. Gr ¨ un w ald. Strong en tropy concen tration, game theory and algorithmic randomness. In Pr o c e e dings of the F ourte enth A nnual Confer enc e on Com- putational L e arning The ory (C O L T’ 01) , pages 320–3 3 6, New Y ork, 2001a. Springer-V erlag. P . D . Gr ¨ un wald. Strong en t r o p y concen t r ation, co ding, game theory a nd ran- domness. T ec hnical Rep ort 10, EURANDOM, 2001b. P . D. Gr ¨ unw ald and A. P . Da wid. Game theory , maxim um entrop y , minim um discrepancy , and r o bust Bay esian decision theory . Annals of Statistics , 32(4 ): 1367–143 3, 2004 . E.T. Ja ynes. Pr ob a bility T h e ory: The L o gic of S c i e nc e . Cam bridge Unive rsity Press, Cam bridg e, UK, 2003. Edited by G. Larry Bretthorst. E.T. Jayne s. Where do w e stand on maxim um en trop y? In R.D . Levine and M. T ribus, editors, The Maxi m um Entr opy F ormalism , pag es 15–118. MIT Press, Cam bridg e, MA, 1978. E.T. Jaynes . On the ra tionale o f maximum-en tropy methods. Pr o c e e dings of the IEEE , 70(939-951), 19 82. J. N. Kapur a nd H. K Kesav an. Entr opy Optimization Princi p les with Applic a- tions . Academic Press, San Diego, 1992 . 18 M. Li and P .M.B. Vit´ anyi. An I ntr o duction to Kolm o gor ov Complexity and Its Applic ations . Springer-V erlag , New Y ork, revised and expanded 2 nd edition, 1997. J. Rissanen. Sto c hastic C omplexity in Statistic al I nquiry . W o rld Scientific , Hac k- ensac k, NJ, 1989. F. T opsøe. Information-theoretical optimization tec hniques. Kyb ernetika , 15(1) : 8–27, 1979. J. v an Camp enhout a nd T. Co ver. Maximum entrop y and conditional probability . IEEE T r ansa c tions on Information The ory , IT-27(4):483–489, 1981. 19

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment