Minimax redundancy for Markov chains with large state space

Minimax redundancy for Mark o v c hains with large state space Kedar Shriram T at w aw adi, Jian tao Jiao, Tsac h y W eissman Ma y 8, 2018 Abstract F or any Marko v source, there exis t universal co des whose normalized co delength a ppr oaches the Shannon limit asymptotica lly as the nu m be r of samples go es to inﬁnity . This pap er inv es- tigates how fast the gap betw een the nor malized co dele ngth of the “b est” univ ersal compress o r and the Shannon limit (i.e. the co mpr ession redundancy ) v anishes no n-asymptotically in terms of the alphab et size and mixing time of the Markov s ource. W e show that, for Markov sources whose relaxa tio n time is at least 1 + (2+ c ) √ k , where k is the state space size (and c > 0 is a c o n- stant), the phase tr ansition for the num ber of samples r e q uired to achiev e v anis hing compres sion redundancy is precisely Θ( k 2 ). 1 In tro duction F or any data source that can b e mo deled as a stationary ergo dic sto chastic pro cess, it is w ell known in the literat ure of universal co mpression that there exist compression algorithms without an y kno wledge of the source d istribution, such that its p erformance can approac h the f undamental limit of the source, also kno wn as the Sh annon entrop y , as th e num b er of observ ations tends to inﬁ nit y . The existence of unive rsal data compressors has sp u rred a huge w a v e of r esearc h around it. A large fraction of practical lossless compressors are b ased on the Lemp el–Ziv algorithms [ ZL77 , ZL78 ] and their v arian ts, and the n orm alized co delength of a universal s ource co de is also widely u sed to measure the c ompr essibility of the sou r ce, which is based on the idea that the normalized co delength is “ close ” to the true entrop y rate giv en a mo derate num b er of samples. There has b een considerable eﬀorts tryin g to qu an tify ho w fast the co delength of a unive rsal co de approac hes the Shannon en trop y rate. One of the general statemen ts p ertaining to distributions parametrized by a ﬁnite dimensional v ector is due to Rissanen [ Ris84 ]. Let X n b e a sequence of random v ariables generated from some stationary distribution p θ ( x n ) with p arameters θ . A compressor L f or the X n sequence is c haracterized by its length f unction L ( x n ), whic h is the length (in bits), of the co de corresp onding to ev ery realization x n of X n . The ent rop y H θ ( X n ) quan tiﬁes the fu ndament al limit of compression un der mo d el p θ , whic h is giv en by H θ ( X n ) = X x n p θ ( x n ) log 2 1 p θ ( x n ) (1) 1 The r edundancy for a compressor with length f u nction L ( X n ) is deﬁn ed as: r n ( L, θ ) = 1 n ( E [ L ( X n ) | θ ] − H θ ( X n )) (2) 1 Throughout the pap er, we will w ork with log ≡ log 2 . 1 Rissanen [ Ris84 ] states that if θ ∈ Θ ⊂ R d , and if the p arameter θ can b e estimated with “ p ar a- metric ” rate asymp toticall y (with d, θ ﬁxed), then there exists some compressor L suc h that r n ( L, θ ) = d log n 2 n + O  1 n  (3) as n → ∞ . Moreo v er, ﬁx in g d, ǫ > 0, for any uniquely d ecod able co de L , its r edundancy satisﬁes r n ( L, θ ) ≥ (1 − ǫ ) d log n 2 n (4) as n → ∞ for al l v alues of θ except for a set wh ose vol ume v anishes as n → ∞ w h ile other parameters are ﬁxed. The fo cus of Rissanen [ Ris84 ] w as asympto tic , i.e., the c haracte rization of the r edundan cy as the num b er of samples n → ∞ while other parameters remain ﬁ xed. There has b een considerable generalizat ions in the asymp totic realm, such as [ A tt9 9 , MF9 5 , FM 96 , XB 97 , XB00 ]. In mo d ern applications, the parameter dimension d ma y b e comparable or eve n larger than the num b er of samples n . F or example, in the Go ogle 1 Billion W ord dataset (1BW) [ CMS + 13 ], the num b er of distinct w ords is more than 2 million, and the data distr ibution is also not i.i.d., whic h mak es us w onder whether we are op erating in the asymptotics wh en any univ ersal co de is applied. W e emphasize that the implications of ( 4 ) ma y not b e c orr e ct in the non-asymp totic setting (i.e. when the p ap r ameter dimension d is comparable to the num b er of samples n ). Ind eed, in terpreting ( 4 ) in the non-asymptotic wa y , it imp lies th at it requires at least n ≫ d log d samples to ac hiev e v anishin g r edundan cy . Ho wev er, when th e data sour ce is i.i.d. with alphab et size d + 1, the pr ecisely n on-asymptotic compu tation shows that the phase tr ansition b et w een v anishin g and non-v anishin g redu n dancy is at n ≍ d [ JHFHW17 ]. There exists extensiv e literature on quan tifying the redund ancy in th e non-asymptotic regime. Da viss on [ Da v83 ] consider ed the case of memoryless sources and m -Mark o v sources, and ob tained non-asymptotic upp er and lo wer b ounds (i.e. b ounds that are explicit in all the parameters in v olve d) on the aver age c ase minimax redun dancy , w hic h is deﬁned b y inf L sup θ ∈ Θ r n ( L, θ ) , (5) where the inﬁ mum is tak en o v er all uniquely deco dable co des [ CT12 ] (section 5.1). Ho w ever, the lo w er b ound for Marko v sources with alphab et size k in [ Da v83 ] is non-zero on ly when n ≫ k 2 log k (See App endix A ) and are not tigh t in the sense that the up p er and lo w er b ounds do not matc h in scaling in th e large alphab et regime. Th e w ork [ OS04 , DS04 , SW12 ] mainly considered a v arian t called worst c ase minimax redu ndancy , and show ed that for i.i.d. sources with alphab et size k , the worst case minimax redu ndancy 2 v anishes if and only if the n um b er of samples n ≫ k n on- asymptotically . The problem of wo rst-case minimax redund ancy for Mark o v sour ces wa s considered in [ JS 04 ]. The fo cus of this p ap er is on the av erage case minimax redun dancy for Marko v c hains. W e reﬁ n e the minimax r edundan cy in ( 5 ) an d categorize d iﬀeren t Mark o v chains b y ho w fast it “ mixes ”. Informally , we ask the follo wing question: Question 1. H ow do es the minimum numb e r of samples r e quir e d to achieving vanishing r e dunda ncy dep end on the state sp ac e size and mixing time? 2 Precisely , the minimax regret with resp ect to a codin g oracle th at only uses co des corresponding to i.i.d. distri- butions. 2 2 Preliminaries Consider a ﬁr st-order Marko v c hain X 1 , X 2 , . . . on a ﬁnite s tate space X = { 1 , 2 , . . . , k } , [ k ] with transition k ernel K . W e den ote the en tries of K as K ij , that is, K ij = P X 2 | X 1 ( j | i ) for i, j ∈ X . W e sa y that a Mark o v chain is stationary if P X 1 , the distribution of X 1 , satisﬁes k X i =1 P X 1 ( i ) K ij = P X 1 ( j ) for all j ∈ X . (6) W e sa y that a Mark o v c hain is r e v ersible if there exists a d istribution π on X whic h satisﬁes the detailed b alance equations: π i K ij = π j K j i for all i, j ∈ X . (7) In th is case, π is called the stationary distribu tion of th e Mark o v c hain. F or a rev ersible Mark o v chain, its (left) sp ectrum of the op erator K consists of k real eigen v alues 1 = λ 1 ≥ λ 2 ≥ · · · ≥ λ k ≥ − 1. W e deﬁne the sp ectral gap of a r ev ersible Mark o v c hain as γ ( K ) = 1 − λ 2 . (8) The absolute sp e ctr al gap of K is deﬁn ed as γ ∗ ( K ) = 1 − max i ≥ 2 | λ i | , (9) and it clearly follo ws that, for any reversible Marko v chain, γ ( K ) ≥ γ ∗ ( K ) . (10) The r elaxa tion time of a Marko v chain is deﬁned as τ rel ( K ) = 1 γ ∗ ( K ) . (11) The r elaxatio n time of a reversible Mark o v c hain (appr o ximate ly) captures its mixing time, whic h in formally is the smallest n for whic h the marginal distribu tion of X n is very close to the Mark o v c hain’s stationary distribution. W e refer to [ MT06 ] for a s urve y . In tuitiv ely sp eaking, the sh orter the relaxation time τ rel , the faster the Marko v c hain “mixes”: that is, the s horter its “memory”, or the so oner evo lutions of the Mark o v c hain from diﬀerent starting states b egin to lo ok similar. The multiplicativ e rev ersibilization of th e tr an s ition m atrix K is deﬁ n ed as: K ∗ j i = π i K ij π j (12) K ∗ is infact the transition matrix for the rev erse Marko v chai n X n → X n − 1 → . . . → X 1 . Note that for rev ersible c hains K ∗ = K . Th e pseudo-sp ectral gap for a non-reversible chain (with transition matrix K ) is d eﬁned as: γ ps ( K ) = max r ≥ 1 γ (( K ∗ ) r K r ) r (13) The p seudo-sp ectral gap for a non-reversible c hain is r elated to the mixing time of the non-r eversible Mark o v c hain. 3 W e denote by M 1 ( k ) the set of all discrete d istributions with alphab et size k ( i.e. , the ( k − 1)- probabilit y simplex), and b y M 2 ( k ) the set of all Mark ov c hain transition matrices on a state sp ace of size k . Let M 2 , rev ( k ) ⊂ M 2 ( k ) b e the set of transition m atrices of all stationary r eversible Mark ov c hains on a state space of size k . W e deﬁne a class of stationary Marko v c hains M 2 , rev ( k , τ rel ) ⊂ M 2 , rev ( k ) as follo w s: M 2 , rev ( k , τ rel ) = { K ij ∈ M 2 , rev ( k ) , τ rel ( K ) ≤ τ rel } . (14) In other words, w e consider stationary r ev ersible Mark o v c hains whose relaxation time is upp er- b ound ed b y τ rel . Another probabilistic repr esen tatio n of rev ersible Marko v chains is via rand om wa lk on und i- rected graphs. Consider an u ndirected graph (without m ulti-edges) on k ve rtices. Let the w eigh t on the u ndirected edge { i, j } b e denoted as w ij ≥ 0. Due to the undir ected nature of th e graph, w ij = w j i ≥ 0 , ∀ i, j ∈ [ k ]. W e also deﬁn e ρ i and ρ as: ρ i = k X j =1 w ij , ∀ i ∈ [ k ] ρ = X i,j w ij Here, ρ i corresp onds to the row-sums of th e w eigh t matrix W , with en tries w ij . W e can n o w consider a Marko v c h ain corresp ond ing to a random w alk on this graph. The tran s ition p robabilities and the stationary distribution corresp onding to a random w alk on this w eighted undirected graph are given by: K ij = w ij ρ i (15) π i = ρ i ρ (16) W e can v erify that the transition matrix K corresp onds to a reversible Mark o v c hain (i.e. K ∈ M 2 , rev ( k )) as: π i K ij = w ij ρ (17) = w j i ρ (18) = π j K j i (19) Con v ersely , w e can understand an y reversible Mark o v c hain ˆ K ∈ M 2 , rev ( k ), with stationary distribution ˆ π as a random wa lk on an u ndirected graph with we igh ts ˆ w ij : ˆ w ij = ˆ π i ˆ K ij (20) The qu an tit y of interest in this pap er is R n ( k , τ rel ) = inf L sup K ∈M 2 , rev ( k, τ rel ) r n ( L, θ ) , (21) where th e in ﬁmum is tak en o v er all un iquely d eco d able co d es [ CT12 ] (section 5.1), and the supr e- m um is tak en ov er all stationary rev ersible Mark o v c hains whose relaxation time is upp er b ounded b y τ rel . W e d eﬁne the quantit y n ∗ ( k , τ rel , ǫ ) as follo w s: n ∗ ( k , τ rel , ǫ ) , min { n : R n ( k , τ rel ) ≤ ǫ } . (22) 4 Notation The qu antit y h ( X ) denotes the d iﬀeren tial en trop y of the con tin uous rand om v ariable X with densit y f unction f X ( x ), and is giv en by: h ( X ) = Z f X ( x ) log 2 1 f X ( x ) dx (23) W e deﬁn e the KL-div ergence D ( p X || q X ) b et w een tw o discrete distribu tions p X ( x ) and q X ( x ) as: D ( p X || q X ) = X x p X ( x ) log 2 p X ( x ) q X ( x ) (24) Thoughout the pap er, we will use th e notation o ( . ) and O ( . ) to denote the asymp totic gro wth of a function. Let f ( k ) and g ( k ) b e non-negativ e fun ctions. W e sa y that fun ction f ( k ) = O ( g ( k )), if f ( k ) ≤ C g ( k ) for some C > 0 and all n > C . W e sa y that fu nction f ( k ) = o ( g ( k )), if the asymptotic gro wth of f ( k ) is strictly slo w er than that of g ( k ), i.e. lim k →∞ f ( k ) g ( k ) = 0 3 Main Results The m ain theorems in this pap er are: Theorem 1. F or τ rel ≥ 1 + 2+ c √ k , R n ( k , τ rel ) ≥ k ( k − 1) 4 n log 2( n − 1) k ( k − 1) + k ( k − 1) 4 n log e 16 π (1 + 2+ c √ k ) − log k n . (25) for k ≥ k c , wher e c > 0 is a c onstant and k c is a c onsta nt dep ending only on c . Theorem 1 is pr ov ed in the section 4 . Theorem 2. The aver age-c ase minimax r e dundancy R n ( k ) for Markov sour c es i s deﬁne d as: R n ( k ) = inf L sup θ ∈M 2 ( k ) r n ( L, θ ) Then, the fol lowing upp er b ound holds: R n ( k ) ≤ 2 k 2 n log 2  n k 2 + 1  + k 2 n + log 2 k + 3 n Note that as R n ( k ) ≥ R n ( k , τ rel ), the upp er b ound in theorem 2 is v alid for R n ( k , τ rel ). Also, as R n ( k ) ≥ R n ( k , τ rel ), the lo w er b ound in theorem 1 is v alid f or R n ( k ). Theorem 2 is prov ed in the section 5 . Th e follo wing corollary is immediate. Corollary 1. If n ≫ k 2 , then R n ( k , τ rel ) → 0 uniformly over τ rel . F or τ rel ≥ 1 + 2+ c √ k , wher e c > 0 is a p ositive c onst ant, ther e exists a c onstant c 1 such that if n = c 1 k 2 , then R n ( k , τ rel ) is b ounde d away fr om zer o as k → ∞ . 5 Analyzing R n ( k , τ rel ) o v er r ev ersible Marko v chains, giv es us a more reﬁned u nderstandin g of the compression redundancy . F r om theorem 2 , w e observ e that for any Marko v d istribution, we can ac h iv e ǫ redun dancy (for an y constan t ǫ > 0) usin g n ∝ k 2 samples. On the other h and, theorem 1 tells us th at, ev en for the small family of fast mixing rev ersbile c hains, in the worst case, at least O ( k 2 ) s amples are necessary to obtain ǫ redund ancy . Figure 1 pro vides a pictorial illustration of n ∗ ( k , τ rel , ǫ ) when ǫ is a sm all constant. T he case τ rel = 1 corresp onds to i.i.d. distribu tion, and it follo ws fr om [ JHFHW17 ] that n ∗ ( k , τ rel , ǫ ) = Θ( k ) for small constan t ǫ . Interestingly , w hen the Mark o v chain b ecomes sligh tly “non-i.i.d.”, the r equired sample size immediately ju mps to Θ( k 2 ) and remains there no matte r ho w large the τ rel is. Similar phenomena exist in the literature of en trop y rate estimation for Mark o v c hains [ HJL + 18 ], where the phase transitions for consisten t ent rop y rate estimation happ ens at k log k for i.i.d. data, and k 2 log k when the relaxation time is ab o v e 1 + Ω  log 2 k √ k  . In other words, eve n if we use th e co d elength of the “b est” universal co d e to estimate the en trop y rate of the Mark o v source, it still requires considerably more samp les than the in formation theoreticall y optimal en trop y r ate estimator that do es n ot go throu gh the construction of a co de. 1 Θ( k 2 ) Θ( k ) 1 + 2+ c √ k τ rel = ( γ ∗ ) − 1 n ∗ ( k , τ rel , ǫ ) for constant ǫ Figure 1: The ﬁgure plots the n ∗ ( k , τ rel , ǫ ) for a ﬁxed small enough ǫ > 0, against the relax- ation time constraint τ rel . Note that τ rel = 1 corresp onds to i.i.d data, and h ence n ∗ ( k , 1 , ǫ ) = Θ( k ) [ JHFHW17 ]. 4 Theorem 1 Pro of Roa dmap W e ﬁ rst conduct the conti n uous approximat ion of the red u ndancy , w hic h is giv en b y the follo wing lemma. Lemma 1. F or any u niquely de c o da ble c o de L , we have r n ( L, θ ) ≥ 1 n D ( p θ ( x n ) k q L ( x n )) , (26) wher e q L ( x n ) = 2 − L ( x n ) P x n 2 − L ( x n ) . 6 Pr o of. Consider the redun d ancy r n ( L, θ ): r n ( L, θ ) = 1 n [ E ( L ( X n ) | θ ) − H θ ( X n )] (27) = 1 n " X x n p θ ( x n ) L ( x n ) − X x n p θ ( x n ) log 1 p θ ( x n ) # (28) = 1 n " X x n p θ ( x n ) log p θ ( x n ) P x n 2 − L ( x n ) 2 − L ( x n ) + log 1 P x n 2 − L ( x n ) # (29) As L is a u niquely deco dable co de [ CT12 ] (section 5.1), we can n o w us e the K raft inequalit y [ C T 12 ] (Theorem 5.5.1) for the lengths L ( x n ) to obtain: r n ( L, θ ) ≥ 1 n " X x n p θ ( x n ) log p θ ( x n ) P x n 2 − L ( x n ) 2 − L ( x n ) # (30) = 1 n " X x n p θ ( x n ) log p θ ( x n ) q L ( x n ) # (31) = 1 n D ( p θ ( x n ) || q L ( x n )) . (32) W e then u s e the strategy of low er b oun ding the minimax r isk by Ba y es risk, wh ic h is giv en by the follo w ing lemma. Lemma 2. F or any prior distribution Φ( θ ) supp orte d on the p ar ameter sp ac e M 2 , rev ( k , τ rel ) , we have R n ( k , τ rel ) ≥ 1 n I ( θ ; X n ) . (33) Pr o of. Let p ( x n ) = R θ Φ( θ ) p θ ( x n ) dθ , then : R n ( k , τ rel ) = inf L sup K ∈M 2 , rev ( k, τ rel ) r n ( L, θ ) (34) ≥ inf L Z θ Φ( θ ) r n ( L, θ ) dθ (35) Equation ( 35 ) is true b ecause the a v erag e is alw a ys lo w er than the su premum. W e next use the Lemma 1 to obtain: R n ( k , τ rel ) ≥ inf q L ( x n ) 1 n Z θ Φ( θ ) D ( p θ ( x n ) || q L ( x n )) dθ = inf q L ( x n ) 1 n " Z θ Φ( θ ) X x n p θ ( x n ) log p θ ( x n ) q L ( x n ) dθ # = inf q L ( x n ) 1 n " Z θ Φ( θ ) X x n p θ ( x n ) log p θ ( x n ) p ( x n ) p ( x n ) q L ( x n ) dθ # = inf q L ( x n ) 1 n  Z θ Φ( θ ) D ( p θ ( x n ) || p ( x n )) dθ + D ( p ( x n ) || q L ( x n ))  7 ﬁnally , the n on-negativit y of the KL-diverge nce, completes the p ro of: R n ( k , τ rel ) ≥ 1 n  Z θ Φ( θ ) D ( p θ ( x n ) || p ( x n )) dθ  = 1 n I ( θ ; X n ) Lemma 2 suggests that, for any prior distr ibution Φ ( θ ): R n ( k , τ rel ) ≥ 1 n I ( θ ; X n ) (36) = 1 n [ h ( θ ) − h ( θ | X n )] . (37) In ord er to obtain a tigh t low er b ound on R n ( k , τ rel ), it th us suﬃces to choose a prior on θ suc h that h ( θ ) is as large as p ossible, while h ( θ | X n ), which quan tiﬁes ho w w ell w e can estimate θ based on X n , is as small as p ossible. The tran s ition matrix h as ab out k 2 degrees of freedom. In ord er to pro v e the lo wer b ound corresp ondin g to n ∗ ( k , τ rel , ǫ ) ≈ k 2 , w e n eed nearly k 2 degrees of freedom in th e p rior construction, but w ould also lik e the Mark o v chain to m ix fast under this prior. In other w ords, we wan t the Mark o v chain to b e similar to the m emoryless scenario. It naturally motiv ates a prior construction using random matrix theory . I ndeed, if the trans ition matrix can b e view ed as a com bination of the rank one matrix corresp onding to the stationary distribution and a “noise” matrix with nearly i.i.d. en tries, it w ould b e exp ected from r andom matrix theory th at the second largest eigen v alue would b e close to zero as the matrix size increases. Ho w ev er, the tec hnical diﬃculty app ears in constructing a prior which is c ompletely supp orted on M 2 , rev ( k , τ rel ) with d esirable sp ectral p rop erties and also in ensuring that the prior has large enough diﬀeren tial en trop y . The concrete construction is b elow. 4.1 Prior Construction Consider ˜ M 2 , rev ( k ) ⊂ M 2 , rev ( k ) b e the sp ace of Mark o v distributions which ha ve the follo wing prop erties: ˜ M 2 , rev ( k ) = { K ij ∈ M 2 , rev ( k ) , K ii = 0 , ∀ i ∈ [ k ] } (38) The sp ace ˜ M 2 , rev ( k ) corresp onds to tr ansition m atrices of r andom wal ks o v er undirected grap h s that do not ha v e self lo ops. W e also deﬁne a class of stationary Mark o v c hains ˜ M 2 , rev ( k , τ rel ) ⊂ ˜ M 2 , rev ( k ) as follo w s: ˜ M 2 , rev ( k , τ rel ) = { K ij ∈ ˜ M 2 , rev ( k ) , τ rel ( K ) ≤ τ rel } . (39) In other w ords, we consider stati onary reve rsible Mark o v c hains in ˜ M 2 , rev ( k ) whose relaxation time is up p er-b ounded b y τ rel . Deﬁnition 1. L et π ( i, j ) = π i K ij denote the stationar y distribution over the tuples ( X 1 , X 2 ) . Then we c an c onsider a p ar ametrization θ for ˜ M 2 , r e v ( k ) as: θ = (2 π (1 , 2) , . . . , 2 π (1 , k ) , 2 π (2 , 3) , . . . , 2 π ( k − 1 , k )) ≡ ( θ 1 , 2 , θ 1 , 3 , . . . , θ 1 ,k , θ 2 , 3 , . . . , θ 2 ,k , . . . , θ k − 1 ,k ) 8 The scaling by factor 2 (e.g. θ 1 , 2 = 2 π (1 , 2)) is consid ered to en sure that the sum of the parameters is 1. X i j (41) Then, the transition matrix K can b e obtained as: K ij = ˜ θ i,j P j ′ ˜ θ i,j ′ (42) W e also deﬁne priors ˜ Φ u ( θ ) and ˜ Φ u ( θ ; τ rel ) that are uniform d istributions on spaces ˜ M 2 , rev ( k ) and ˜ M 2 , rev ( k , τ rel ), resp ectiv ely , und er th e parametrization of θ . W e can obtain the distribu tion ˜ Φ u ( θ ) o v er the space ˜ M 2 , rev ( k ), by consid ering transition matrices corresp onding to r andom wa lk o v er und ir ected graphs w ith rand om weig h ts. Lemma 3. Consider a simple c omplete g r aph (c omplete gr aph, without self lo ops) on k v e rtic es, with r andom weights w ij distribute d i.i.d as w ij = w j i ∼ Exp (1) ( w ii = 0 ). Then, the c orr esp onding tr ansition matrix K is distribute d as ˜ Φ u ( θ ) , i .e. uniformly distribute d over the sp ac e ˜ M 2 , r e v ( k ) . Pr o of. W e r ecall a w ell-kno wn prop ert y [ FK G10 ] (Section 2.3) of exp onential distributions: Let U = { u 1 , u 2 , . . . , u r } b e such that every u i ∼ i.i.d Exp ( λ ). Th en , for u = P i u i and v i = u i /u , the v ecto r V = { v 1 , v 2 , . . . , v r } is un iformly d istr ibuted o v er the probability simplex P r i =1 v i = 1. Lemma 3 is a sp ecial case, an d can b e prov ed by considering: U = { 2 w 12 , 2 w 13 , . . . , 2 w 1 k , 2 w 23 , . . . , 2 w 2 k , . . . , 2 w k − 1 ,k } V = θ ≡ { θ 1 , 2 , θ 1 , 3 , . . . , θ 1 ,k , θ 2 , 3 , . . . , θ 2 ,k , . . . , θ k − 1 ,k } Then V = θ is uniformly distr ibuted o v er ˜ M 2 , rev ( k ), the probabilit y simplex of dimension k ( k − 1) 2 . Lemma 3 pro vides a nice gatew a y to use tools from rand om matrix theory . W e will ﬁrst understand some prop erties of the weig h t matrix W , consisting of rand om w eigh ts w ij = w j i ∼ Exp (1) , ∀ i 6 = j and w ii = 0 , ∀ i ∈ [ k ]. Recall the follo wing deﬁn ition of a Wigner’s Matrix [ T ao ] Deﬁnition 2. We say a r andom symmetric matrix A is a W igner’s matrix, i f the upp er-triangular entries A ij , i > j ar e distribute d i.i.d with zer o me an and unit varianc e, while the diagonal entries A ii ar e i.i.d. r e al variables with b ounde d me an and varianc e, distribute d indep endently of the upp er- triangular e ntries. Consider the matrix ˆ W , where ˆ w ij = w ij − 1. ˆ W is a symmetric random matrix, w here the oﬀ-diagonal ent ries are i.i.d. w ith 0 mean, while the diagonal ent ries are constan ts. T his implies that, th e matrix ˆ W is a Wigner’s random matrix. Then, the Strong Bai-Yin theorem, upp er b ound (Theorem 2.3.24, Exercise 2.3.15 of [ T ao ]) implies that th e eigenv alues of ˆ W are b ounded as: | λ i ( ˆ W ) | ≤ 2 √ k + o ( √ k ) a.s., ∀ i ∈ [ k ] (43) (here, b y a.s. w e m ean that th e sequence of ev en ts is true inﬁnitely often as k → ∞ ) W e can also b ound the ro w sums ρ i of the matrix W as: 9 Lemma 4. The fol lowing pr op erties ar e true for the weight matrix W . max 1 ≤ i ≤ k    ρ i k − 1    = o (1) a.s. (44) X 1 ≤ i ≤ k  ρ i k − 1  2 = O (1) a.s. (45) Pr o of. Let V b e a matrix so th at v ij = w ij , ∀ i 6 = j and v ii ∼ Exp (1) , ∀ i . Then, u sing Lemma 2.3 of [ BCC10 ], we get that: max 1 ≤ i ≤ k    ρ i k − 1    ≤ o (1) + 1 k max 1 ≤ i ≤ k v ii a.s. (46) Let A k ,ǫ b e the eve n t such that: A k ,ǫ =  1 k max 1 ≤ i ≤ k v ii ≤ ǫ  (47) As v ii are in dep end en t exp onent ial rand om v ariables, w e obtain: P ( A c k ,ǫ ) = 1 − P  1 k max 1 ≤ i ≤ k v ii ≤ ǫ  = 1 − k Y i =1 P ( v ii ≤ k ǫ ) = 1 − (1 − e − k ǫ ) k ≤ k e − k ǫ W e can n o w obtain a b oun d on the su m of the prob ab ility of even ts: ∞ X k =1 P ( A c k ,ǫ ) ≤ ∞ X k =1 k e − k ǫ < ∞ By u sing Borel-Cant elli lemma, this prov es th e equation ( 44 ). F rom equation ( 43 ), we know that largest eigen v alue of ˆ W is b ounded as: | λ 1 ( ˆ W ) | ≤ 2 √ k + o ( √ k ) a.s (48) Th us: X 1 ≤ i ≤ k  ρ i k − 1  2 = h ˆ W 1 , ˆ W 1 i k 2 ≤ λ 1 ( ˆ W ) 2 k ≤ 4 k k + o (1) a.s. = O (1) a.s. This pro v es the equation ( 45 ). 10 The lemma 4 essentia lly sa ys that, every ρ i is close to k , i.e.: ρ i ≤ k (1 + δ ) , where δ = o (1) (49) Equation ( 49 ) then implies that that en tries of the matrix K , are prop ortional to that of the matrix W . K ij = w ij ρ i = w ij k (1 + o (1)) (50) This suggests th at the eigen v alues of matrix K are close to th ose of W /k . Lemma 5. L et W b e a r andom matrix on k vertic es, with r andom weights w ij distribte d i.i.d as w ij = w j i ∼ Exp (1) ( w ii = 0 ). Then the c orr esp onding tr ansition matrix K ha s the fol lowing sp e ctr al pr op erties: λ 1 ( K ) = 1 (51) max 2 ≤ i ≤ k | λ i ( K ) | ≤ 2 + c √ k a.s (52) for some c onstant c > 0 . (h er e, by ”a.s.” we me an that the se q u enc e of events is true i nﬁnitely often as k → ∞ ) The p ro of for the lemma 5 f ollo ws from lemma 4 , and is prov ed in the App endix B . T he follo wing corollary is immediate. Corollary 2. L et θ ∈ ˜ M 2 , r e v ( k ) b e distribute d ac c or ding to the prior ˜ Φ u ( θ ) . Als o, let τ 0 rel = 1 + 2+ c √ k , wher e c > 0 i s a p ositive c onsta nt. Then, P ( θ ∈ ˜ M 2 , r e v ( k , τ 0 rel )) → 1 , (53) as k → ∞ . F rom no w on w e den ote τ 0 rel = 1 + 2+ c √ k . W e n ext analyze h ( θ ) and h ( θ | X n ) u nder the prior ˜ Φ u ( θ ; τ 0 rel ). Lemma 6. L et θ ∼ ˜ Φ u ( θ ; τ 0 rel ) and τ 0 rel = 1 + 2+ c √ k . Then, the diﬀe r ential entr opy h ( θ ) is lower b ounde d as h ( θ ) ≥ k ( k − 1) 2 log 2 k ( k − 1) + k ( k − 1) log e 2 − log k for k ≥ k c , wher e k c only dep ends on c > 0 . Pr o of. Corollary 2 imp lies that there exists some k c suc h that for k ≥ k c , V ol ( ˜ M 2 , rev ( k , τ 0 rel )) ≥ 1 2 V ol ( ˜ M 2 , rev ( k )) = 1 2   1 h k ( k − 1) 2 i !   As the distr ibution ˜ Φ u ( θ ; τ 0 rel ) is un iform, we kno w h ( θ ) = log V ol ( ˜ M 2 , rev ( k , τ 0 rel )) ≥ log V ol ( ˜ M 2 , rev ( k )) − 1 ≥ k ( k − 1) 2 log 2 k ( k − 1) + k ( k − 1) log e 2 − log k W e used Stirling approxima tion f or factorial to simp lify the b oun d on the entrop y . 11 The n ext s tep in the pr o of is to upp er b oun d the term h ( θ | X n ), wh ic h qu an tiﬁes ho w w ell w e can estimate the parameter θ from X n . L et ˆ θ = θ ( X n ) b e a d eterministic estimator for the parameter θ . Then, h ( θ | X n ) = h ( θ − ˆ θ | X n ) (54) ≤ h ( θ − ˆ θ ) (55) Utilizing th e f act that Gaussian distrib ution maximizes the diﬀerent ial en trop y un der v ariance constrain ts, w e ha v e h ( θ − ˆ θ ) ≤ X i,j h ( θ i,j − ˆ θ i,j ) (56) ≤ 1 2 X i,j h log  2 π e V a r ( ˆ θ i,j ) i (57) ≤ 1 2 X i,j  log  2 π e Z θ ˜ Φ u ( θ ; τ 0 rel ) V a r ( ˆ θ i,j | θ ) dθ  (58) Let N ( i, j ) = P n − 1 r =1 1 [( X r , X r +1 ) = ( i, j )] repr esen t the n um b er of o ccurrences of the tuple ( i, j ) in the X n sequence. Then , a natural estimator for parameter θ i,j = 2 π ( i, j ) = π ( i, j ) + π ( j, i ) is the empirical estimator ˆ θ i,j : ˆ θ i,j = N ( i , j ) + N ( j, i ) n − 1 (59) W e prov e the follo wing b ound on the v aria nce of the empirical estimator ˆ θ i,j . Lemma 7. L et θ ∼ ˜ Φ u ( θ ; τ 0 rel ) . Then the varianc e of the estimator ˆ θ i,j = N ( i,j )+ N ( j,i ) n − 1 c an b e b ounde d as: V a r ( ˆ θ i,j | θ ) ≤ 8 θ i,j τ 0 rel n − 1 (60) Pr o of. Let X 1 → X 2 . . . → X n b e a rev ersible Mark ov c hain with transition matrix K ∈ ˜ M 2 , rev ( k , τ 0 rel ). Consider the Marko v c hain ov er the tuples ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ). Let ˜ K b e the corresp ondin g transition matrix. It is in teresting to note that, although the original Mark o v c hain is reversible, the c hain o v er the tuples is generally not reve rsible. Let: f i,j ( X r − 1 , X r ) = 1 [( X r − 1 , X r ) = ( i, j )] + 1 [( X r − 1 , X r ) = ( j, i )] n − 1 then, the estimator ˆ θ i,j can b e written as: ˆ θ i,j = n − 1 X r =1 f i,j ( X r − 1 , X r ) W e can now u se Theorem 3.7 of [ P + 15 ] on the function f i,j ( X r − 1 , X r ) corresp ondin g to the Mark o v c hain ov er the tup les ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ), to obtain: V a r ( ˆ θ i,j | θ ) ≤ 4 θ i,j γ ps ( ˜ K )( n − 1) W e next use th e lemma 8 (pr ov ed in the App endix C ) to b ound γ ps ( ˜ K ) in terms of τ rel ( K ): 12 Lemma 8. L et X 1 → X 2 . . . → X n b e a r eversible M arkov chain with tr ansition matrix K , then the Markov chain over tuples, ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) has pseudo-sp e ctr al g ap γ ps ( ˜ K ) given by: γ ps ( ˜ K ) ≥ γ ∗ ( K ) 2 (61) Using lemma 8 , we obtain the v ariance b oun d: V a r ( ˆ θ i,j | θ ) ≤ 8 θ i,j γ ∗ ( K )( n − 1) = 8 θ i,j τ r el ( K ) ( n − 1) ≤ 8 θ i,j τ 0 r el ( K ) ( n − 1) whic h pr o v es th e lemma. W e next use th e v ariance b ou n d on the estimator to obtain an upp er b ound on h ( θ | X n ). Lemma 9. L e t θ ∼ ˜ Φ u ( θ ; τ 0 rel ) , then the c onditional diﬀer ential entr opy h ( θ | X n ) is upp er b ounde d by: h ( θ | X n ) ≤ k ( k − 1) 4 log 16 π eτ 0 rel n − 1 + k ( k − 1) 4 log 2 k ( k − 1) (62) Pr o of. F rom equation ( 58 ), and lemma 8 we obtain: h ( θ | X n ) ≤ 1 2 X i,j  log  2 π e Z θ ˜ Φ u ( θ ; τ 0 rel ) V a r ( ˆ θ i,j | θ ) dθ  (63) ≤ 1 2 X i,j  log  2 π e Z θ ˜ Φ u ( θ ; τ 0 rel ) 8 θ i,j τ 0 rel n − 1 dθ  (64) = k ( k − 1) 4 log 16 π eτ 0 rel n − 1 + 1 2 X i,j  log Z θ ˜ Φ u ( θ ; τ 0 rel ) θ i,j dθ  (65) ≤ k ( k − 1) 4 log 16 π eτ 0 rel n − 1 (66) + k ( k − 1) 4   log X i,j 2 k ( k − 1) Z θ ˜ Φ u ( θ ; τ 0 rel ) θ i,j dθ   (67) = k ( k − 1) 4 log 16 π eτ 0 rel n − 1 + k ( k − 1) 4 log 2 k ( k − 1) (68) Equation ( 68 ) is true b ecause P i,j θ i,j = 1. This pro v es the upp er b ound on th e term h ( θ | X n ). 13 4.2 Pro of of Theorem 1 Using lemma 2 for the prior ˜ Φ u ( θ ; τ 0 rel ), and lemma 6 , 9 , we ha v e R n ( k , τ rel ) ≥ 1 n I ( θ ; X n ) (69) = 1 n [ h ( θ ) − h ( θ | X n )] (70) ≥ k ( k − 1) 4 n log 2( n − 1) k ( k − 1) + k ( k − 1) 4 n log e 16 π τ 0 rel − log k n (71) for k ≥ k c , wher e τ 0 rel = 1 + 2+ c √ k . This completes the pro of of the lo w er b ound. 5 Theorem 2 Pro of F or an y s equence x n o v er the alph ab et X = [ k ], let N ( a ) , N ( a, b ) b e deﬁn ed as: N ( a ) = n − 1 X i =1 1 [ x i = a ] (72) N ( a, b ) = n − 1 X i =1 1 [( x i , x i +1 ) = ( a, b )] (73) Before we prov e the theorem, we consider some simp le lemmas. Lemma 10. Ther e exists a pr eﬁx c o de [ CT12 ] (Se ction 5.1) on non-ne gative inte gers N ∪ { 0 } = { 0 , 1 , 2 , . . . } , such that every inte g er m has a c o dewor d of length l m ≤ 2 log 2 ( m + 1) + 1 . Pr o of. Let q b e such that: 2 q ≤ ( m + 1) < 2 q +1 . Thus, ( m + 1) can b e w r itten as: ( m + 1) = 2 q + r (74) where, 0 ≤ r < 2 q . Let U q = 000 . . . 001 b e a unary co d e with q zeros, and B r b e the binary represent ation of r us in g q bits. Then the f ollo win g co de C m is pr eﬁx-free: C m = U q 1 B r (75) Th us, th e length of the co de l ( C m ): l ( C m ) = 2 q + 1 (76) ≤ 2 log 2 ( m + 1) + 1 (77) This completes the p r o of. Lemma 11. We c an stor e the p ar ameters N ( a ) , N ( a, b ) , ∀ a, b ∈ [ k ] for a se qu enc e x n using L par am numb er of bits, which is upp er b ounde d as: L par am ( x n ) ≤ 2 k 2 log 2  n k 2 + 1  + k 2 (78) 14 Pr o of. Note that, we only need to store N ( a, b ) , ∀ a, b ∈ [ k ] as the parameters N ( a ) can b e deriv ed. Using the pr eﬁx co ding fr om lemma 10 for the parameters N ( a, b ): L par am ( x n ) ≤ X a,b [2 log 2 ( N ( a, b ) + 1) + 1] (79) = 2 k 2 X a,b 1 k 2 log 2 ( N ( a, b ) + 1) + k 2 (80) ≤ 2 k 2 log 2   1 k 2 X a,b ( N ( a, b ) + 1)   + k 2 (81) = 2 k 2 log 2  n k 2 + 1  + k 2 (82) Equation ( 81 ) is tru e due to conca vity of the log fun ction and the Jensen’s inequalit y . Lemma 12. We c an use arithmetic c o ding [ WN C87 ] to enc o de a se quenc e x n using L seq ( x n ) bits, which is b ound e d as: L seq ( x n ) ≤ log 2 k + ( n − 1) H 1 ( x n ) + 3 (83) wher e H 1 ( x n ) i s the 1 st or der empiric al entr opy of se que nc e x n : H 1 ( x n ) = k X a =1 k X b =1 N ( a, b ) n − 1 log 2 N ( a ) N ( a, b ) (84) Pr o of. W e ﬁrs t enco de the sequence x 1 using ﬁxed ⌈ log 2 k ⌉ b its. Next, w e en co de th e remaining ( n − 1) symbols using arithmetic co d ing [ WST95 ] (section IV) with the ﬁ rst order m o del d istribution q ( b | a ) = N ( a, b ) / N ( a ). Using theorem 1 of [ WST95 ], the co d elength of x n is: L seq ( x n ) ≤ ⌈ log 2 k ⌉ + n − 1 X i =1 log 2 1 q ( x i +1 | x i ) + 2 ! (85) = log 2 k + 1 + k X a =1 k X b =1 N ( a, b ) log 2 N ( a ) N ( a, b ) + 2 (86) = log 2 k + ( n − 1) H 1 ( x n ) + 3 (87) This completes the p r o of. Let x n b e a giv en sequence o v er the alph ab et X = [ k ]. C onsider the follo wing compressor: 1. S tore all the parameters N ( a, b ) , ∀ a, b ∈ [ k ] using th e unive rsal pr eﬁ x-free co de in lemma 10 . 2. Use the parameters N ( a, b ) to compress x n using ﬁ rst-order Mark o v arithmetic co din g as in lemma 12 . Then, the co delength ˆ L ( x n ) is b ounded as: ˆ L ( x n ) = L par am ( x n ) + L seq ( x n ) (88) ≤ h 2 k 2 log 2  n k 2 + 1  + k 2 i + [log 2 k + ( n − 1) H 1 ( x n ) + 3] (89) 15 W e now tak e a lo ok at r edundancy R n ( k ): R n ( k ) = inf L sup θ ∈M 2 ( k ) r n ( L, θ ) (90) ≤ sup θ ∈M 2 ( k ) 1 n  E θ [ ˆ L ( X n )] − H θ ( X n ))  (91) = sup θ ∈M 2 ( k ) 1 n  E θ [ ˆ L ( X n )] − H θ ( X 1 ) − ( n − 1) H θ ( X 2 | X 1 )  (92) ≤ sup θ ∈M 2 ( k ) 1 n  E θ [ ˆ L ( X n )] − ( n − 1) H θ ( X 2 | X 1 )  (93) ≤ 2 k 2 n log 2  n k 2 + 1  + k 2 n + sup θ ∈M 2 ( k ) n − 1 n ( E θ [ H 1 ( X n )] − H θ ( X 2 | X 1 )) + log 2 k + 3 n (94) ≤ 2 k 2 n log 2  n k 2 + 1  + k 2 n + log 2 k + 3 n (95) Where equation ( 95 ) is tru e b ecause of the conca vit y of en trop y . This completes the p ro of of the upp er b ound . References [A tt9 9] Kevin Atte son. The asymptotic redund ancy of bay es rules for marko v chains. IEEE T r ansactions on Information The ory , 45(6):2104– 2109, 1999. [BCC10] Charles Borden av e, Pietro Caputo, and Djalil Ch afai. Sp ectrum of large rand om re- v ersible Marko v c hains: t w o examples. ALEA: L atin Americ an J ournal of Pr ob ability and Mathematic al Statistics , 7:41–6 4, 2010. [CMS + 13] Ciprian Chelba, T omas Mik olo v, Mike S c h uster, Q i Ge, Thorsten Bran ts, Phillipp Ko ehn, and T ony Robinson. One billion word b enchmark for measuring pr ogress in statistica l language mo deling. arXiv pr eprint arXiv:1312.3005 , 2013. [CT12] Thomas M Cov er and Jo y A Thomas. Elements of information the ory . John Wiley & Sons, 2012. [Da v83] L Da visson. Minimax noiseless universal co ding for marko v sour ces. IEEE T r ansac- tions on Information The ory , 29(2):211– 215, 1983. [DS04] Mic h ael Drm ota and W o jciec h Szpank o wsk i. Pr ecise minimax redund ancy and r egret. IEEE T r ansactions on Information The ory , 50(11):268 6–270 7, 2004. [FK G10 ] Bel a A F rigyik, Amol Kapila, and May a R Gupta. In tro duction to the d iric hlet distribution and related pro cesses. D ep artment of Ele ctric al Engine ering, University of W ashignton, UWEE TR-2010-000 6 , 2010. [FM96] Meir F eder and Neri Merha v. Hierarc hical un iv ersal co ding. Information The ory , IEEE T r ansactions on , 42(5):135 4–136 4, 1996. [HJL + 18] Y anjun Han, Jian tao J iao, Chuan-Zheng Lee, T sac h y W eissman, Yihong W u, and Tianc heng Y u. Ent rop y rate estimation for mark o v chains with large state space. arXiv pr eprint arXiv:1802.07 889 , 2018. 16 [JHFHW17] Jian tao Jiao, Y anjun Han, Ir en a Fisc her-Hw ang, and Tsach y W eissman. Estimating the fund amen tal limits is easier than ac hieving the fundamental limits. arXiv pr eprint arXiv:1707.01 203 , 2017. [JS04] Philipp e Jacquet and W o j ciec h Szpanko wski. Marko v t yp es and minimax redund ancy for Mark o v sources. IEEE T r ansactions on Informat ion The ory , 50 (7):139 3–14 02, 2004. [MF95] Neri Merha v and Meir F ed er. A s trong v ersion of the redun dancy-capacit y th eorem of unive rsal co ding. IEEE T r ans actions on Information The ory , 41(3):71 4–722 , 1995. [MT06] Ra vi Montenegro and Prasad T etali. Mathematical asp ects of mixing times in mark o v c hains. F oundations and T r ends R  i n The or etic al Computer Scienc e , 1(3):237– 354, 2006. [OS04] Alo n Orlitsky and Nara y ana P San thanam. Sp eaking of inﬁnity . Information The or y, IEEE T r ansactions on , 50(10):22 15–22 30, 2004. [P + 15] Daniel P aulin et al. Concentratio n inequalities for Marko v chains by Marton couplings and sp ectral metho d s. Ele ctr onic Journal of Pr ob ability , 20, 2015. [Ris84] Jorma Rissanen. Universal cod ing, information, p r ediction, and estimation. IEEE T r ansactions on Information the ory , 30(4):629– 636, 1984. [SW12] W o jciec h Szpank o ws k i and Marcelo J W ein b erger. Minimax p oint wise redu ndancy for memoryless mo dels o v er large alphab ets. Information The or y, IEE E T r ansactions on , 58(7): 4094– 4104, 2012. [T ao] T erence T ao. T opics in r andom matrix the ory , v olume 132. [WNC87] Ian H Witten, Radford M Neal, and John G Cleary . Arithmetic co d ing for data compression. Communic ations of the ACM , 30(6):520– 540, 1987. [WST95] F rans MJ Willems, Y ur i M Shtark o v, and Tjalling J Tjalk ens. Th e con text- tree we igh t- ing metho d: basic pr op erties. IEEE T r ansactions on Information The ory , 41(3):65 3– 664, 1995. [XB97] Q un Xie and Andrew R Barr on . Minim ax r edundan cy f or the class of memoryless sources. IEEE T r ansactions on Information The ory , 43(2):64 6–657 , 1997. [XB00] Q un Xie and An drew R Barron. Asymptotic minimax r egret for data compression, gam bling, and pr ediction. IEEE T r ansactions on Information The ory , 46(2):431– 445, 2000. [ZL77] Jacob Z iv and Abr ah am Lemp el. A u niv ersal algorithm for sequential data compres- sion. IEEE T r ansactions on information the ory , 23(3):33 7–343 , 1977. [ZL78] Jacob Ziv and Abraham Lemp el. Compression of ind ivid ual sequences via v ariable- rate co d ing. IEEE tr ansactions on Information The ory , 24(5):530– 536, 1978. 17 A Existing Minimax Redundancy Lo w er Bounds W e analyze th e existing low er b oun d by [ Da v83 ]. R n ( k ) ≥ g ( k , n ) = k ( k − 1) 2 n log n + k ( k − 1) n log 1 k 4 − k ( k − 1) 2 n log   C 1 −  1 − 1 4 k 4  1 2   (96) W e can s im p lify g ( k , n ), to get the lo w er b ound: g ( k , n ) = k ( k − 1) 2 n log n + k ( k − 1) n log 1 k 4 − k ( k − 1) 2 n log   C 1 −  1 − 1 4 k 4  1 2   (97) = k ( k − 1) 2 n log n k 2 − 5 k ( k − 1) 2 n log k 2 − k ( k − 1) 2 n log C + o  k ( k − 1) n  (98) Th us, th e eﬀectiv e lo w er b ound on R n ( k ) is: R n ( k ) ≥ k ( k − 1) 2 n log n k 2 − 5 k ( k − 1) 2 n log k 2 − k ( k − 1) 2 n log C +  k ( k − 1) n  (99) W e observ e that the lo wer b ou n d on redu ndancy R n ( k ) is non-zero only w hen n ≫ k 2 log k . W e aim to imp r o v e the low er b ound when n ≍ k 2 . B Pro of of Lemma 5 T o analyze the sp ectrum of the transition matrix K , w e construct a symmetric matrix S , whic h has the same sp ectrum as K almost su rely . Lemma 13. (Sp e ctr al Equ ivalenc e) Almo st sur ely, for a lar ge k , the sp e ctrum of the tr ansition matrix K c oincides with the sp e ctrum of the symmetric matrix S deﬁne d as: S ij = r ρ i ρ j K ij = w ij √ ρ i ρ j (100) The lemma is prov ed in lemma 2.1 [ BCC10 ]. W e no w use the lemma 4 , which allo w s us to estimate ρ i = k (1 + o (1)), to compare the sp ectrum of the matrix √ k K with the matrix W √ k . Lemma 14. (Bulk b ehavior) The ESD (empiric al sp e ctr al density) of √ k K we akly c onver ges to the Wigner’s semi- cir cle law W 2 . µ √ k K w − − − → k →∞ W 2 (101) wher e the Wigner’s se mi- cir cle law W 2 is given by: x 7→ 1 2 π p 4 − x 2 1 [ − 2 , 2] ( x ) 18 Pr o of. First of all, from the lemma 13 , the sp ectrum of S is equiv alen t to that of K a.s. (for large k ). Thus, it is suﬃcient to an alyze the sp ectrum of √ k S . T o sho w the w eak con v ergence, we b ound the Levy distance b etw een the cum ulativ e distribu tions corresp onding to th e ES D of matrices √ k S and W / √ k . Let F √ k S and F W / √ k b e the cumulativ e distributions, then: L 3 ( F √ kS , F W / √ k ) ≤ 1 k T r (( √ k S − W / √ k ) 2 ) (102) = 1 k k X i,j w 2 ij k  k √ ρ i ρ j − 1  2 (103) ≤ O ( δ 2 )   1 k 2 k X i,j w 2 ij   (104) → 2 O ( δ 2 ) a.s, as k → ∞ (105) This pro v es the w eak conv ergence of the µ √ kK to the w igner semi-circle la w W 2 . Note that, ev en though λ 1 ( √ k K ) = √ k → ∞ as k → ∞ , the w eak limit of of µ √ kK is not aﬀected since λ 1 ( √ k K ) h as weigh t 1 /k . The theorem thus implies that the bu lk of th e sp ectrum σ ( K ) collapses as k − 1 / 2 , but do es n ot giv e a characte rization for λ 2 ( √ k K ), w hic h is what is required . T o pro v e lemma 5 , we r ep resen t the symmetric m atrix S as a com bination of a rank one matrix P corresp onding to the stationary distrib ution, and a ”noise” matrix S − P with nearly i.i.d en tries. Bounding th e sp ectral norm of the ”noise” matrix S − P gives us the result. Pr o of. (lemma 5 ) Since K is almost s u rely irreducible, for large enough k , the eigenspace of S of eigen v alue 1 is a.s. of size 1. and is the span of the v ector [ √ ρ 1 , √ ρ 2 , . . . , √ ρ k ]. Consider the symmetric matrix P : P ij = √ ρ i ρ j ρ (106) By r emo ving P from S , we are essentially remo ving the largest eigen v alue 1, without touc hing the other eigen v alues. T h us, the sp ectrum of the m atrix S − P is give n by: { λ 2 ( S ) , λ 3 ( S ) , . . . , λ k ( S ) } ∪ { 0 } (107) T o ﬁnd √ k λ 2 ( S ), w e no w b ound the sp ectral norm of matrix A = √ k ( S − P ). Lemma 2.4 of [ BCC10 ] along with lemma 4 giv es u s the result: max 1 ≤ i ≤ k √ k | λ i ( S − P ) | ≤ 2 + o (1) a.s. (108) Equation ( 106 ) and lemma 13 together imply that: max 2 ≤ i ≤ k √ k | λ i ( S ) | ≤ 2 + o (1) a.s. (109) max 2 ≤ i ≤ k √ k | λ i ( K ) | ≤ 2 + o (1) a.s. (110) max 2 ≤ i ≤ k | λ i ( K ) | ≤ 2 + c √ k a.s. (111) for some constant c ≥ 0. This completes the p r o of. 19 C Pro of of Lemma 8 Consider the Mark o v c hain ov er the tuples ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) with transition matrix ˜ K . W e ﬁ rst analyze some prop erties of the transition matrix ˜ K . Lemma 15. L et X 1 → X 2 . . . → X n b e a r eversible Markov chain with tr ansition matrix K , then the tr ansition matrix ˜ K for the chain: ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) and its matrix r eversibilization ˜ K ∗ ar e: ˜ K (( a, b ) , ( c, d )) = 1 [ b = c ] K ( c, d ) ˜ K ∗ (( a, b ) , ( c, d )) = 1 [ a = d ] K ( d, c ) Pr o of. The transition matrix for the chain ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) is giv en b y: ˜ K (( a, b ) , ( c, d )) = P ( X 2 = c, X 3 = d | X 1 = a, X 2 = b ) (112) = P ( X 2 = c | X 1 = a, X 2 = b ) P ( X 3 = d | X 1 = a, X 2 = b, X 2 = c ) (113) = 1 [ b = c ] P ( X 3 = d | X 2 = c ) (114) = 1 [ b = c ] K ( c, d ) (115) where, equation( 114 ) holds b ecause of the Mark o vity condition. As the Marko v chain ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) is in general non-r eversible, th e m ultiplicativ e rev ersibilization of ˜ K is: ˜ K ∗ (( c, d ) , ( a, b )) = ˜ K (( a, b ) , ( c, d )) π ( a, b ) π ( c, d ) (116) = 1 [ b = c ] K ( c, d ) π ( a ) K ( a, b ) π ( c ) K ( c, d ) (117) = 1 [ b = c ] π ( b ) K ( b, a ) π ( c ) (118) = 1 [ b = c ] π ( b ) K ( b, a ) π ( b ) (119) ˜ K ∗ (( c, d ) , ( a, b )) = 1 [ b = c ] K ( b, a ) (120) ˜ K ∗ (( a, b ) , ( c, d )) = 1 [ a = d ] K ( d, c ) (121) This pro v es the lemma. Lemma 16. L et T b e the k × k matrix c or r esp onding to the tr ansformation: T M (( a, b ) , ( c, d )) = M (( b, a ) , ( c, d )) (122) Then, ˜ K and ˜ K ∗ have the pr op erty: (( ˜ K ∗ ) r ˜ K r ) = ( T ˜ K r ) 2 (123) Pr o of. The matrix T also has the prop erties: M (( a, b ) , ( c, d )) T = M (( a, b ) , ( d, c )) (1 24) T 2 = I (125) 20 Then w e can show that: T ˜ K (( a, b ) , ( c, d )) T = T 1 [ b = c ] K ( c, d ) T (126) = 1 [ a = c ] K ( c, d ) T (127) = 1 [ a = d ] K ( d, c ) (128) = ˜ K ∗ (( a, b ) , ( c, d )) (129) Using equation ( 129 ) we can show that, for any r : (( ˜ K ∗ ) r ˜ K r ) = ( T ˜ K T ) r ˜ K r (130) = T ˜ K r T ˜ K r (131) = ( T ˜ K r ) 2 (132) This completes the p r o of. Lemma 17. Matric es ˜ K 2 and T ˜ K 2 have the f orm: ˜ K 2 (( a, b ) , ( c, d )) = K ( b, c ) K ( c, d ) (133) T ˜ K 2 (( a, b ) , ( c, d )) = K ( a, c ) K ( c, d ) (134) Pr o of. Using lemma 15 , w e obtain: ˜ K 2 (( a, b ) , ( c, d )) = X e,f ˜ K (( a, b ) , ( e, f )) ˜ K (( e, f ) , ( c, d )) (135) = X e,f 1 [ b = e ] K ( e, f ) 1 [ f = c ] K ( c, d ) (136) = K ( b, c ) K ( c, d ) (1 37) This pro v es the equation ( 133 ). W e now use the d eﬁnition of T to ob tain: T ˜ K 2 (( a, b ) , ( c, d )) = ˜ K 2 (( b, c ) , ( c, d )) (138) = K ( a, c ) K ( c, d ) (13 9) This pro v es the lemma. Lemma 18. Matric es K and T ˜ K 2 have identic al non-zer o eigenvalues. Pr o of. Let V b e an eigen v ecto r of the matrix T ˜ K 2 with a non-zero eigenv alue η . This imp lies that: η V (( a, b ) , 1) = X c,d T ˜ K 2 (( a, b ) , ( c, d )) V (( a, b ) , 1) (140) = X c,d K ( a, c ) K ( c, d ) V (( a, b ) , 1) (141) = X c K ( a, c ) V (( a, b ) , 1) X d K ( c, d ) (14 2) = X c K ( a, c ) V (( a, b ) , 1) (143) 21 Th us, th is sho ws that for any b ∈ [ k ], the vec tor V (( ., b ) , 1) is an eigen v ect or of matrix K w ith eigen v alue η . Con v ersely , let v = [ v 1 , v 2 , . . . , v k ] T b e an eigen v ector of the matrix K with non-zero eigen v alue η . Then, th e ve ctor V (( a, b ) , 1) = v a is an eigen v ector of the matrix T ˜ K 2 . Th us, together th is implies that the non-zero eigen v alues of matrices K and T ˜ K 2 are id en tical. W e now come to the pro of of lemma 8 . Pr o of. (lemma 8 ) Using lemma 18 and lemma 16 we get: λ 2 (( T ˜ K 2 ) 2 ) = max( λ 2 ( K ) 2 , λ k ( K ) 2 ) γ (( ˜ K ∗ ) 2 ˜ K 2 ) = 1 − max( λ 2 ( K ) 2 , λ k ( K ) 2 ) ≥ 1 − max( | λ 2 ( K ) | , | λ k ( K ) | ) = γ ∗ ( K ) No w usin g the deﬁn ition of the pseudo-sp ectral gap γ ps ( ˜ K ), we obtain: γ ps ( ˜ K ) ≥ γ (( ˜ K ∗ ) 2 ˜ K 2 ) 2 ≥ γ ∗ ( K ) 2 This pro v es the lemma. 22

Minimax redundancy for Markov chains with large state space

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment