Minimax redundancy for Markov chains with large state space

For any Markov source, there exist universal codes whose normalized codelength approaches the Shannon limit asymptotically as the number of samples goes to infinity. This paper investigates how fast the gap between the normalized codelength of the "b…

Authors: Kedar Shriram Tatwawadi, Jiantao Jiao, Tsachy Weissman

Minimax redundancy for Mark o v c hains with large state space Kedar Shriram T at w aw adi, Jian tao Jiao, Tsac h y W eissman Ma y 8, 2018 Abstract F or any Marko v source, there exis t universal co des whose normalized co delength a ppr oaches the Shannon limit asymptotica lly as the nu m be r of samples go es to infinity . This pap er inv es- tigates how fast the gap betw een the nor malized co dele ngth of the “b est” univ ersal compress o r and the Shannon limit (i.e. the co mpr ession redundancy ) v anishes no n-asymptotically in terms of the alphab et size and mixing time of the Markov s ource. W e show that, for Markov sources whose relaxa tio n time is at least 1 + (2+ c ) √ k , where k is the state space size (and c > 0 is a c o n- stant), the phase tr ansition for the num ber of samples r e q uired to achiev e v anis hing compres sion redundancy is precisely Θ( k 2 ). 1 In tro duction F or any data source that can b e mo deled as a stationary ergo dic sto chastic pro cess, it is w ell known in the literat ure of universal co mpression that there exist compression algorithms without an y kno wledge of the source d istribution, such that its p erformance can approac h the f undamental limit of the source, also kno wn as the Sh annon entrop y , as th e num b er of observ ations tends to infi nit y . The existence of unive rsal data compressors has sp u rred a huge w a v e of r esearc h around it. A large fraction of practical lossless compressors are b ased on the Lemp el–Ziv algorithms [ ZL77 , ZL78 ] and their v arian ts, and the n orm alized co delength of a universal s ource co de is also widely u sed to measure the c ompr essibility of the sou r ce, which is based on the idea that the normalized co delength is “ close ” to the true entrop y rate giv en a mo derate num b er of samples. There has b een considerable efforts tryin g to qu an tify ho w fast the co delength of a unive rsal co de approac hes the Shannon en trop y rate. One of the general statemen ts p ertaining to distributions parametrized by a finite dimensional v ector is due to Rissanen [ Ris84 ]. Let X n b e a sequence of random v ariables generated from some stationary distribution p θ ( x n ) with p arameters θ . A compressor L f or the X n sequence is c haracterized by its length f unction L ( x n ), whic h is the length (in bits), of the co de corresp onding to ev ery realization x n of X n . The ent rop y H θ ( X n ) quan tifies the fu ndament al limit of compression un der mo d el p θ , whic h is giv en by H θ ( X n ) = X x n p θ ( x n ) log 2 1 p θ ( x n ) (1) 1 The r edundancy for a compressor with length f u nction L ( X n ) is defin ed as: r n ( L, θ ) = 1 n ( E [ L ( X n ) | θ ] − H θ ( X n )) (2) 1 Throughout the pap er, we will w ork with log ≡ log 2 . 1 Rissanen [ Ris84 ] states that if θ ∈ Θ ⊂ R d , and if the p arameter θ can b e estimated with “ p ar a- metric ” rate asymp toticall y (with d, θ fixed), then there exists some compressor L suc h that r n ( L, θ ) = d log n 2 n + O  1 n  (3) as n → ∞ . Moreo v er, fix in g d, ǫ > 0, for any uniquely d ecod able co de L , its r edundancy satisfies r n ( L, θ ) ≥ (1 − ǫ ) d log n 2 n (4) as n → ∞ for al l v alues of θ except for a set wh ose vol ume v anishes as n → ∞ w h ile other parameters are fixed. The fo cus of Rissanen [ Ris84 ] w as asympto tic , i.e., the c haracte rization of the r edundan cy as the num b er of samples n → ∞ while other parameters remain fi xed. There has b een considerable generalizat ions in the asymp totic realm, such as [ A tt9 9 , MF9 5 , FM 96 , XB 97 , XB00 ]. In mo d ern applications, the parameter dimension d ma y b e comparable or eve n larger than the num b er of samples n . F or example, in the Go ogle 1 Billion W ord dataset (1BW) [ CMS + 13 ], the num b er of distinct w ords is more than 2 million, and the data distr ibution is also not i.i.d., whic h mak es us w onder whether we are op erating in the asymptotics wh en any univ ersal co de is applied. W e emphasize that the implications of ( 4 ) ma y not b e c orr e ct in the non-asymp totic setting (i.e. when the p ap r ameter dimension d is comparable to the num b er of samples n ). Ind eed, in terpreting ( 4 ) in the non-asymptotic wa y , it imp lies th at it requires at least n ≫ d log d samples to ac hiev e v anishin g r edundan cy . Ho wev er, when th e data sour ce is i.i.d. with alphab et size d + 1, the pr ecisely n on-asymptotic compu tation shows that the phase tr ansition b et w een v anishin g and non-v anishin g redu n dancy is at n ≍ d [ JHFHW17 ]. There exists extensiv e literature on quan tifying the redund ancy in th e non-asymptotic regime. Da viss on [ Da v83 ] consider ed the case of memoryless sources and m -Mark o v sources, and ob tained non-asymptotic upp er and lo wer b ounds (i.e. b ounds that are explicit in all the parameters in v olve d) on the aver age c ase minimax redun dancy , w hic h is defined b y inf L sup θ ∈ Θ r n ( L, θ ) , (5) where the infi mum is tak en o v er all uniquely deco dable co des [ CT12 ] (section 5.1). Ho w ever, the lo w er b ound for Marko v sources with alphab et size k in [ Da v83 ] is non-zero on ly when n ≫ k 2 log k (See App endix A ) and are not tigh t in the sense that the up p er and lo w er b ounds do not matc h in scaling in th e large alphab et regime. Th e w ork [ OS04 , DS04 , SW12 ] mainly considered a v arian t called worst c ase minimax redu ndancy , and show ed that for i.i.d. sources with alphab et size k , the worst case minimax redu ndancy 2 v anishes if and only if the n um b er of samples n ≫ k n on- asymptotically . The problem of wo rst-case minimax redund ancy for Mark o v sour ces wa s considered in [ JS 04 ]. The fo cus of this p ap er is on the av erage case minimax redun dancy for Marko v c hains. W e refi n e the minimax r edundan cy in ( 5 ) an d categorize d ifferen t Mark o v chains b y ho w fast it “ mixes ”. Informally , we ask the follo wing question: Question 1. H ow do es the minimum numb e r of samples r e quir e d to achieving vanishing r e dunda ncy dep end on the state sp ac e size and mixing time? 2 Precisely , the minimax regret with resp ect to a codin g oracle th at only uses co des corresponding to i.i.d. distri- butions. 2 2 Preliminaries Consider a fir st-order Marko v c hain X 1 , X 2 , . . . on a finite s tate space X = { 1 , 2 , . . . , k } , [ k ] with transition k ernel K . W e den ote the en tries of K as K ij , that is, K ij = P X 2 | X 1 ( j | i ) for i, j ∈ X . W e sa y that a Mark o v chain is stationary if P X 1 , the distribution of X 1 , satisfies k X i =1 P X 1 ( i ) K ij = P X 1 ( j ) for all j ∈ X . (6) W e sa y that a Mark o v c hain is r e v ersible if there exists a d istribution π on X whic h satisfies the detailed b alance equations: π i K ij = π j K j i for all i, j ∈ X . (7) In th is case, π is called the stationary distribu tion of th e Mark o v c hain. F or a rev ersible Mark o v chain, its (left) sp ectrum of the op erator K consists of k real eigen v alues 1 = λ 1 ≥ λ 2 ≥ · · · ≥ λ k ≥ − 1. W e define the sp ectral gap of a r ev ersible Mark o v c hain as γ ( K ) = 1 − λ 2 . (8) The absolute sp e ctr al gap of K is defin ed as γ ∗ ( K ) = 1 − max i ≥ 2 | λ i | , (9) and it clearly follo ws that, for any reversible Marko v chain, γ ( K ) ≥ γ ∗ ( K ) . (10) The r elaxa tion time of a Marko v chain is defined as τ rel ( K ) = 1 γ ∗ ( K ) . (11) The r elaxatio n time of a reversible Mark o v c hain (appr o ximate ly) captures its mixing time, whic h in formally is the smallest n for whic h the marginal distribu tion of X n is very close to the Mark o v c hain’s stationary distribution. W e refer to [ MT06 ] for a s urve y . In tuitiv ely sp eaking, the sh orter the relaxation time τ rel , the faster the Marko v c hain “mixes”: that is, the s horter its “memory”, or the so oner evo lutions of the Mark o v c hain from different starting states b egin to lo ok similar. The multiplicativ e rev ersibilization of th e tr an s ition m atrix K is defi n ed as: K ∗ j i = π i K ij π j (12) K ∗ is infact the transition matrix for the rev erse Marko v chai n X n → X n − 1 → . . . → X 1 . Note that for rev ersible c hains K ∗ = K . Th e pseudo-sp ectral gap for a non-reversible chain (with transition matrix K ) is d efined as: γ ps ( K ) = max r ≥ 1 γ (( K ∗ ) r K r ) r (13) The p seudo-sp ectral gap for a non-reversible c hain is r elated to the mixing time of the non-r eversible Mark o v c hain. 3 W e denote by M 1 ( k ) the set of all discrete d istributions with alphab et size k ( i.e. , the ( k − 1)- probabilit y simplex), and b y M 2 ( k ) the set of all Mark ov c hain transition matrices on a state sp ace of size k . Let M 2 , rev ( k ) ⊂ M 2 ( k ) b e the set of transition m atrices of all stationary r eversible Mark ov c hains on a state space of size k . W e define a class of stationary Marko v c hains M 2 , rev ( k , τ rel ) ⊂ M 2 , rev ( k ) as follo w s: M 2 , rev ( k , τ rel ) = { K ij ∈ M 2 , rev ( k ) , τ rel ( K ) ≤ τ rel } . (14) In other words, w e consider stationary r ev ersible Mark o v c hains whose relaxation time is upp er- b ound ed b y τ rel . Another probabilistic repr esen tatio n of rev ersible Marko v chains is via rand om wa lk on und i- rected graphs. Consider an u ndirected graph (without m ulti-edges) on k ve rtices. Let the w eigh t on the u ndirected edge { i, j } b e denoted as w ij ≥ 0. Due to the undir ected nature of th e graph, w ij = w j i ≥ 0 , ∀ i, j ∈ [ k ]. W e also defin e ρ i and ρ as: ρ i = k X j =1 w ij , ∀ i ∈ [ k ] ρ = X i,j w ij Here, ρ i corresp onds to the row-sums of th e w eigh t matrix W , with en tries w ij . W e can n o w consider a Marko v c h ain corresp ond ing to a random w alk on this graph. The tran s ition p robabilities and the stationary distribution corresp onding to a random w alk on this w eighted undirected graph are given by: K ij = w ij ρ i (15) π i = ρ i ρ (16) W e can v erify that the transition matrix K corresp onds to a reversible Mark o v c hain (i.e. K ∈ M 2 , rev ( k )) as: π i K ij = w ij ρ (17) = w j i ρ (18) = π j K j i (19) Con v ersely , w e can understand an y reversible Mark o v c hain ˆ K ∈ M 2 , rev ( k ), with stationary distribution ˆ π as a random wa lk on an u ndirected graph with we igh ts ˆ w ij : ˆ w ij = ˆ π i ˆ K ij (20) The qu an tit y of interest in this pap er is R n ( k , τ rel ) = inf L sup K ∈M 2 , rev ( k, τ rel ) r n ( L, θ ) , (21) where th e in fimum is tak en o v er all un iquely d eco d able co d es [ CT12 ] (section 5.1), and the supr e- m um is tak en ov er all stationary rev ersible Mark o v c hains whose relaxation time is upp er b ounded b y τ rel . W e d efine the quantit y n ∗ ( k , τ rel , ǫ ) as follo w s: n ∗ ( k , τ rel , ǫ ) , min { n : R n ( k , τ rel ) ≤ ǫ } . (22) 4 Notation The qu antit y h ( X ) denotes the d ifferen tial en trop y of the con tin uous rand om v ariable X with densit y f unction f X ( x ), and is giv en by: h ( X ) = Z f X ( x ) log 2 1 f X ( x ) dx (23) W e defin e the KL-div ergence D ( p X || q X ) b et w een tw o discrete distribu tions p X ( x ) and q X ( x ) as: D ( p X || q X ) = X x p X ( x ) log 2 p X ( x ) q X ( x ) (24) Thoughout the pap er, we will use th e notation o ( . ) and O ( . ) to denote the asymp totic gro wth of a function. Let f ( k ) and g ( k ) b e non-negativ e fun ctions. W e sa y that fun ction f ( k ) = O ( g ( k )), if f ( k ) ≤ C g ( k ) for some C > 0 and all n > C . W e sa y that fu nction f ( k ) = o ( g ( k )), if the asymptotic gro wth of f ( k ) is strictly slo w er than that of g ( k ), i.e. lim k →∞ f ( k ) g ( k ) = 0 3 Main Results The m ain theorems in this pap er are: Theorem 1. F or τ rel ≥ 1 + 2+ c √ k , R n ( k , τ rel ) ≥ k ( k − 1) 4 n log 2( n − 1) k ( k − 1) + k ( k − 1) 4 n log e 16 π (1 + 2+ c √ k ) − log k n . (25) for k ≥ k c , wher e c > 0 is a c onstant and k c is a c onsta nt dep ending only on c . Theorem 1 is pr ov ed in the section 4 . Theorem 2. The aver age-c ase minimax r e dundancy R n ( k ) for Markov sour c es i s define d as: R n ( k ) = inf L sup θ ∈M 2 ( k ) r n ( L, θ ) Then, the fol lowing upp er b ound holds: R n ( k ) ≤ 2 k 2 n log 2  n k 2 + 1  + k 2 n + log 2 k + 3 n Note that as R n ( k ) ≥ R n ( k , τ rel ), the upp er b ound in theorem 2 is v alid for R n ( k , τ rel ). Also, as R n ( k ) ≥ R n ( k , τ rel ), the lo w er b ound in theorem 1 is v alid f or R n ( k ). Theorem 2 is prov ed in the section 5 . Th e follo wing corollary is immediate. Corollary 1. If n ≫ k 2 , then R n ( k , τ rel ) → 0 uniformly over τ rel . F or τ rel ≥ 1 + 2+ c √ k , wher e c > 0 is a p ositive c onst ant, ther e exists a c onstant c 1 such that if n = c 1 k 2 , then R n ( k , τ rel ) is b ounde d away fr om zer o as k → ∞ . 5 Analyzing R n ( k , τ rel ) o v er r ev ersible Marko v chains, giv es us a more refined u nderstandin g of the compression redundancy . F r om theorem 2 , w e observ e that for any Marko v d istribution, we can ac h iv e ǫ redun dancy (for an y constan t ǫ > 0) usin g n ∝ k 2 samples. On the other h and, theorem 1 tells us th at, ev en for the small family of fast mixing rev ersbile c hains, in the worst case, at least O ( k 2 ) s amples are necessary to obtain ǫ redund ancy . Figure 1 pro vides a pictorial illustration of n ∗ ( k , τ rel , ǫ ) when ǫ is a sm all constant. T he case τ rel = 1 corresp onds to i.i.d. distribu tion, and it follo ws fr om [ JHFHW17 ] that n ∗ ( k , τ rel , ǫ ) = Θ( k ) for small constan t ǫ . Interestingly , w hen the Mark o v chain b ecomes sligh tly “non-i.i.d.”, the r equired sample size immediately ju mps to Θ( k 2 ) and remains there no matte r ho w large the τ rel is. Similar phenomena exist in the literature of en trop y rate estimation for Mark o v c hains [ HJL + 18 ], where the phase transitions for consisten t ent rop y rate estimation happ ens at k log k for i.i.d. data, and k 2 log k when the relaxation time is ab o v e 1 + Ω  log 2 k √ k  . In other words, eve n if we use th e co d elength of the “b est” universal co d e to estimate the en trop y rate of the Mark o v source, it still requires considerably more samp les than the in formation theoreticall y optimal en trop y r ate estimator that do es n ot go throu gh the construction of a co de. 1 Θ( k 2 ) Θ( k ) 1 + 2+ c √ k τ rel = ( γ ∗ ) − 1 n ∗ ( k , τ rel , ǫ ) for constant ǫ Figure 1: The figure plots the n ∗ ( k , τ rel , ǫ ) for a fixed small enough ǫ > 0, against the relax- ation time constraint τ rel . Note that τ rel = 1 corresp onds to i.i.d data, and h ence n ∗ ( k , 1 , ǫ ) = Θ( k ) [ JHFHW17 ]. 4 Theorem 1 Pro of Roa dmap W e fi rst conduct the conti n uous approximat ion of the red u ndancy , w hic h is giv en b y the follo wing lemma. Lemma 1. F or any u niquely de c o da ble c o de L , we have r n ( L, θ ) ≥ 1 n D ( p θ ( x n ) k q L ( x n )) , (26) wher e q L ( x n ) = 2 − L ( x n ) P x n 2 − L ( x n ) . 6 Pr o of. Consider the redun d ancy r n ( L, θ ): r n ( L, θ ) = 1 n [ E ( L ( X n ) | θ ) − H θ ( X n )] (27) = 1 n " X x n p θ ( x n ) L ( x n ) − X x n p θ ( x n ) log 1 p θ ( x n ) # (28) = 1 n " X x n p θ ( x n ) log p θ ( x n ) P x n 2 − L ( x n ) 2 − L ( x n ) + log 1 P x n 2 − L ( x n ) # (29) As L is a u niquely deco dable co de [ CT12 ] (section 5.1), we can n o w us e the K raft inequalit y [ C T 12 ] (Theorem 5.5.1) for the lengths L ( x n ) to obtain: r n ( L, θ ) ≥ 1 n " X x n p θ ( x n ) log p θ ( x n ) P x n 2 − L ( x n ) 2 − L ( x n ) # (30) = 1 n " X x n p θ ( x n ) log p θ ( x n ) q L ( x n ) # (31) = 1 n D ( p θ ( x n ) || q L ( x n )) . (32) W e then u s e the strategy of low er b oun ding the minimax r isk by Ba y es risk, wh ic h is giv en by the follo w ing lemma. Lemma 2. F or any prior distribution Φ( θ ) supp orte d on the p ar ameter sp ac e M 2 , rev ( k , τ rel ) , we have R n ( k , τ rel ) ≥ 1 n I ( θ ; X n ) . (33) Pr o of. Let p ( x n ) = R θ Φ( θ ) p θ ( x n ) dθ , then : R n ( k , τ rel ) = inf L sup K ∈M 2 , rev ( k, τ rel ) r n ( L, θ ) (34) ≥ inf L Z θ Φ( θ ) r n ( L, θ ) dθ (35) Equation ( 35 ) is true b ecause the a v erag e is alw a ys lo w er than the su premum. W e next use the Lemma 1 to obtain: R n ( k , τ rel ) ≥ inf q L ( x n ) 1 n Z θ Φ( θ ) D ( p θ ( x n ) || q L ( x n )) dθ = inf q L ( x n ) 1 n " Z θ Φ( θ ) X x n p θ ( x n ) log p θ ( x n ) q L ( x n ) dθ # = inf q L ( x n ) 1 n " Z θ Φ( θ ) X x n p θ ( x n ) log p θ ( x n ) p ( x n ) p ( x n ) q L ( x n ) dθ # = inf q L ( x n ) 1 n  Z θ Φ( θ ) D ( p θ ( x n ) || p ( x n )) dθ + D ( p ( x n ) || q L ( x n ))  7 finally , the n on-negativit y of the KL-diverge nce, completes the p ro of: R n ( k , τ rel ) ≥ 1 n  Z θ Φ( θ ) D ( p θ ( x n ) || p ( x n )) dθ  = 1 n I ( θ ; X n ) Lemma 2 suggests that, for any prior distr ibution Φ ( θ ): R n ( k , τ rel ) ≥ 1 n I ( θ ; X n ) (36) = 1 n [ h ( θ ) − h ( θ | X n )] . (37) In ord er to obtain a tigh t low er b ound on R n ( k , τ rel ), it th us suffices to choose a prior on θ suc h that h ( θ ) is as large as p ossible, while h ( θ | X n ), which quan tifies ho w w ell w e can estimate θ based on X n , is as small as p ossible. The tran s ition matrix h as ab out k 2 degrees of freedom. In ord er to pro v e the lo wer b ound corresp ondin g to n ∗ ( k , τ rel , ǫ ) ≈ k 2 , w e n eed nearly k 2 degrees of freedom in th e p rior construction, but w ould also lik e the Mark o v chain to m ix fast under this prior. In other w ords, we wan t the Mark o v chain to b e similar to the m emoryless scenario. It naturally motiv ates a prior construction using random matrix theory . I ndeed, if the trans ition matrix can b e view ed as a com bination of the rank one matrix corresp onding to the stationary distribution and a “noise” matrix with nearly i.i.d. en tries, it w ould b e exp ected from r andom matrix theory th at the second largest eigen v alue would b e close to zero as the matrix size increases. Ho w ev er, the tec hnical difficulty app ears in constructing a prior which is c ompletely supp orted on M 2 , rev ( k , τ rel ) with d esirable sp ectral p rop erties and also in ensuring that the prior has large enough differen tial en trop y . The concrete construction is b elow. 4.1 Prior Construction Consider ˜ M 2 , rev ( k ) ⊂ M 2 , rev ( k ) b e the sp ace of Mark o v distributions which ha ve the follo wing prop erties: ˜ M 2 , rev ( k ) = { K ij ∈ M 2 , rev ( k ) , K ii = 0 , ∀ i ∈ [ k ] } (38) The sp ace ˜ M 2 , rev ( k ) corresp onds to tr ansition m atrices of r andom wal ks o v er undirected grap h s that do not ha v e self lo ops. W e also define a class of stationary Mark o v c hains ˜ M 2 , rev ( k , τ rel ) ⊂ ˜ M 2 , rev ( k ) as follo w s: ˜ M 2 , rev ( k , τ rel ) = { K ij ∈ ˜ M 2 , rev ( k ) , τ rel ( K ) ≤ τ rel } . (39) In other w ords, we consider stati onary reve rsible Mark o v c hains in ˜ M 2 , rev ( k ) whose relaxation time is up p er-b ounded b y τ rel . Definition 1. L et π ( i, j ) = π i K ij denote the stationar y distribution over the tuples ( X 1 , X 2 ) . Then we c an c onsider a p ar ametrization θ for ˜ M 2 , r e v ( k ) as: θ = (2 π (1 , 2) , . . . , 2 π (1 , k ) , 2 π (2 , 3) , . . . , 2 π ( k − 1 , k )) ≡ ( θ 1 , 2 , θ 1 , 3 , . . . , θ 1 ,k , θ 2 , 3 , . . . , θ 2 ,k , . . . , θ k − 1 ,k ) 8 The scaling by factor 2 (e.g. θ 1 , 2 = 2 π (1 , 2)) is consid ered to en sure that the sum of the parameters is 1. X i j (41) Then, the transition matrix K can b e obtained as: K ij = ˜ θ i,j P j ′ ˜ θ i,j ′ (42) W e also define priors ˜ Φ u ( θ ) and ˜ Φ u ( θ ; τ rel ) that are uniform d istributions on spaces ˜ M 2 , rev ( k ) and ˜ M 2 , rev ( k , τ rel ), resp ectiv ely , und er th e parametrization of θ . W e can obtain the distribu tion ˜ Φ u ( θ ) o v er the space ˜ M 2 , rev ( k ), by consid ering transition matrices corresp onding to r andom wa lk o v er und ir ected graphs w ith rand om weig h ts. Lemma 3. Consider a simple c omplete g r aph (c omplete gr aph, without self lo ops) on k v e rtic es, with r andom weights w ij distribute d i.i.d as w ij = w j i ∼ Exp (1) ( w ii = 0 ). Then, the c orr esp onding tr ansition matrix K is distribute d as ˜ Φ u ( θ ) , i .e. uniformly distribute d over the sp ac e ˜ M 2 , r e v ( k ) . Pr o of. W e r ecall a w ell-kno wn prop ert y [ FK G10 ] (Section 2.3) of exp onential distributions: Let U = { u 1 , u 2 , . . . , u r } b e such that every u i ∼ i.i.d Exp ( λ ). Th en , for u = P i u i and v i = u i /u , the v ecto r V = { v 1 , v 2 , . . . , v r } is un iformly d istr ibuted o v er the probability simplex P r i =1 v i = 1. Lemma 3 is a sp ecial case, an d can b e prov ed by considering: U = { 2 w 12 , 2 w 13 , . . . , 2 w 1 k , 2 w 23 , . . . , 2 w 2 k , . . . , 2 w k − 1 ,k } V = θ ≡ { θ 1 , 2 , θ 1 , 3 , . . . , θ 1 ,k , θ 2 , 3 , . . . , θ 2 ,k , . . . , θ k − 1 ,k } Then V = θ is uniformly distr ibuted o v er ˜ M 2 , rev ( k ), the probabilit y simplex of dimension k ( k − 1) 2 . Lemma 3 pro vides a nice gatew a y to use tools from rand om matrix theory . W e will first understand some prop erties of the weig h t matrix W , consisting of rand om w eigh ts w ij = w j i ∼ Exp (1) , ∀ i 6 = j and w ii = 0 , ∀ i ∈ [ k ]. Recall the follo wing defin ition of a Wigner’s Matrix [ T ao ] Definition 2. We say a r andom symmetric matrix A is a W igner’s matrix, i f the upp er-triangular entries A ij , i > j ar e distribute d i.i.d with zer o me an and unit varianc e, while the diagonal entries A ii ar e i.i.d. r e al variables with b ounde d me an and varianc e, distribute d indep endently of the upp er- triangular e ntries. Consider the matrix ˆ W , where ˆ w ij = w ij − 1. ˆ W is a symmetric random matrix, w here the off-diagonal ent ries are i.i.d. w ith 0 mean, while the diagonal ent ries are constan ts. T his implies that, th e matrix ˆ W is a Wigner’s random matrix. Then, the Strong Bai-Yin theorem, upp er b ound (Theorem 2.3.24, Exercise 2.3.15 of [ T ao ]) implies that th e eigenv alues of ˆ W are b ounded as: | λ i ( ˆ W ) | ≤ 2 √ k + o ( √ k ) a.s., ∀ i ∈ [ k ] (43) (here, b y a.s. w e m ean that th e sequence of ev en ts is true infinitely often as k → ∞ ) W e can also b ound the ro w sums ρ i of the matrix W as: 9 Lemma 4. The fol lowing pr op erties ar e true for the weight matrix W . max 1 ≤ i ≤ k    ρ i k − 1    = o (1) a.s. (44) X 1 ≤ i ≤ k  ρ i k − 1  2 = O (1) a.s. (45) Pr o of. Let V b e a matrix so th at v ij = w ij , ∀ i 6 = j and v ii ∼ Exp (1) , ∀ i . Then, u sing Lemma 2.3 of [ BCC10 ], we get that: max 1 ≤ i ≤ k    ρ i k − 1    ≤ o (1) + 1 k max 1 ≤ i ≤ k v ii a.s. (46) Let A k ,ǫ b e the eve n t such that: A k ,ǫ =  1 k max 1 ≤ i ≤ k v ii ≤ ǫ  (47) As v ii are in dep end en t exp onent ial rand om v ariables, w e obtain: P ( A c k ,ǫ ) = 1 − P  1 k max 1 ≤ i ≤ k v ii ≤ ǫ  = 1 − k Y i =1 P ( v ii ≤ k ǫ ) = 1 − (1 − e − k ǫ ) k ≤ k e − k ǫ W e can n o w obtain a b oun d on the su m of the prob ab ility of even ts: ∞ X k =1 P ( A c k ,ǫ ) ≤ ∞ X k =1 k e − k ǫ < ∞ By u sing Borel-Cant elli lemma, this prov es th e equation ( 44 ). F rom equation ( 43 ), we know that largest eigen v alue of ˆ W is b ounded as: | λ 1 ( ˆ W ) | ≤ 2 √ k + o ( √ k ) a.s (48) Th us: X 1 ≤ i ≤ k  ρ i k − 1  2 = h ˆ W 1 , ˆ W 1 i k 2 ≤ λ 1 ( ˆ W ) 2 k ≤ 4 k k + o (1) a.s. = O (1) a.s. This pro v es the equation ( 45 ). 10 The lemma 4 essentia lly sa ys that, every ρ i is close to k , i.e.: ρ i ≤ k (1 + δ ) , where δ = o (1) (49) Equation ( 49 ) then implies that that en tries of the matrix K , are prop ortional to that of the matrix W . K ij = w ij ρ i = w ij k (1 + o (1)) (50) This suggests th at the eigen v alues of matrix K are close to th ose of W /k . Lemma 5. L et W b e a r andom matrix on k vertic es, with r andom weights w ij distribte d i.i.d as w ij = w j i ∼ Exp (1) ( w ii = 0 ). Then the c orr esp onding tr ansition matrix K ha s the fol lowing sp e ctr al pr op erties: λ 1 ( K ) = 1 (51) max 2 ≤ i ≤ k | λ i ( K ) | ≤ 2 + c √ k a.s (52) for some c onstant c > 0 . (h er e, by ”a.s.” we me an that the se q u enc e of events is true i nfinitely often as k → ∞ ) The p ro of for the lemma 5 f ollo ws from lemma 4 , and is prov ed in the App endix B . T he follo wing corollary is immediate. Corollary 2. L et θ ∈ ˜ M 2 , r e v ( k ) b e distribute d ac c or ding to the prior ˜ Φ u ( θ ) . Als o, let τ 0 rel = 1 + 2+ c √ k , wher e c > 0 i s a p ositive c onsta nt. Then, P ( θ ∈ ˜ M 2 , r e v ( k , τ 0 rel )) → 1 , (53) as k → ∞ . F rom no w on w e den ote τ 0 rel = 1 + 2+ c √ k . W e n ext analyze h ( θ ) and h ( θ | X n ) u nder the prior ˜ Φ u ( θ ; τ 0 rel ). Lemma 6. L et θ ∼ ˜ Φ u ( θ ; τ 0 rel ) and τ 0 rel = 1 + 2+ c √ k . Then, the diffe r ential entr opy h ( θ ) is lower b ounde d as h ( θ ) ≥ k ( k − 1) 2 log 2 k ( k − 1) + k ( k − 1) log e 2 − log k for k ≥ k c , wher e k c only dep ends on c > 0 . Pr o of. Corollary 2 imp lies that there exists some k c suc h that for k ≥ k c , V ol ( ˜ M 2 , rev ( k , τ 0 rel )) ≥ 1 2 V ol ( ˜ M 2 , rev ( k )) = 1 2   1 h k ( k − 1) 2 i !   As the distr ibution ˜ Φ u ( θ ; τ 0 rel ) is un iform, we kno w h ( θ ) = log V ol ( ˜ M 2 , rev ( k , τ 0 rel )) ≥ log V ol ( ˜ M 2 , rev ( k )) − 1 ≥ k ( k − 1) 2 log 2 k ( k − 1) + k ( k − 1) log e 2 − log k W e used Stirling approxima tion f or factorial to simp lify the b oun d on the entrop y . 11 The n ext s tep in the pr o of is to upp er b oun d the term h ( θ | X n ), wh ic h qu an tifies ho w w ell w e can estimate the parameter θ from X n . L et ˆ θ = θ ( X n ) b e a d eterministic estimator for the parameter θ . Then, h ( θ | X n ) = h ( θ − ˆ θ | X n ) (54) ≤ h ( θ − ˆ θ ) (55) Utilizing th e f act that Gaussian distrib ution maximizes the different ial en trop y un der v ariance constrain ts, w e ha v e h ( θ − ˆ θ ) ≤ X i,j h ( θ i,j − ˆ θ i,j ) (56) ≤ 1 2 X i,j h log  2 π e V a r ( ˆ θ i,j ) i (57) ≤ 1 2 X i,j  log  2 π e Z θ ˜ Φ u ( θ ; τ 0 rel ) V a r ( ˆ θ i,j | θ ) dθ  (58) Let N ( i, j ) = P n − 1 r =1 1 [( X r , X r +1 ) = ( i, j )] repr esen t the n um b er of o ccurrences of the tuple ( i, j ) in the X n sequence. Then , a natural estimator for parameter θ i,j = 2 π ( i, j ) = π ( i, j ) + π ( j, i ) is the empirical estimator ˆ θ i,j : ˆ θ i,j = N ( i , j ) + N ( j, i ) n − 1 (59) W e prov e the follo wing b ound on the v aria nce of the empirical estimator ˆ θ i,j . Lemma 7. L et θ ∼ ˜ Φ u ( θ ; τ 0 rel ) . Then the varianc e of the estimator ˆ θ i,j = N ( i,j )+ N ( j,i ) n − 1 c an b e b ounde d as: V a r ( ˆ θ i,j | θ ) ≤ 8 θ i,j τ 0 rel n − 1 (60) Pr o of. Let X 1 → X 2 . . . → X n b e a rev ersible Mark ov c hain with transition matrix K ∈ ˜ M 2 , rev ( k , τ 0 rel ). Consider the Marko v c hain ov er the tuples ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ). Let ˜ K b e the corresp ondin g transition matrix. It is in teresting to note that, although the original Mark o v c hain is reversible, the c hain o v er the tuples is generally not reve rsible. Let: f i,j ( X r − 1 , X r ) = 1 [( X r − 1 , X r ) = ( i, j )] + 1 [( X r − 1 , X r ) = ( j, i )] n − 1 then, the estimator ˆ θ i,j can b e written as: ˆ θ i,j = n − 1 X r =1 f i,j ( X r − 1 , X r ) W e can now u se Theorem 3.7 of [ P + 15 ] on the function f i,j ( X r − 1 , X r ) corresp ondin g to the Mark o v c hain ov er the tup les ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ), to obtain: V a r ( ˆ θ i,j | θ ) ≤ 4 θ i,j γ ps ( ˜ K )( n − 1) W e next use th e lemma 8 (pr ov ed in the App endix C ) to b ound γ ps ( ˜ K ) in terms of τ rel ( K ): 12 Lemma 8. L et X 1 → X 2 . . . → X n b e a r eversible M arkov chain with tr ansition matrix K , then the Markov chain over tuples, ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) has pseudo-sp e ctr al g ap γ ps ( ˜ K ) given by: γ ps ( ˜ K ) ≥ γ ∗ ( K ) 2 (61) Using lemma 8 , we obtain the v ariance b oun d: V a r ( ˆ θ i,j | θ ) ≤ 8 θ i,j γ ∗ ( K )( n − 1) = 8 θ i,j τ r el ( K ) ( n − 1) ≤ 8 θ i,j τ 0 r el ( K ) ( n − 1) whic h pr o v es th e lemma. W e next use th e v ariance b ou n d on the estimator to obtain an upp er b ound on h ( θ | X n ). Lemma 9. L e t θ ∼ ˜ Φ u ( θ ; τ 0 rel ) , then the c onditional differ ential entr opy h ( θ | X n ) is upp er b ounde d by: h ( θ | X n ) ≤ k ( k − 1) 4 log 16 π eτ 0 rel n − 1 + k ( k − 1) 4 log 2 k ( k − 1) (62) Pr o of. F rom equation ( 58 ), and lemma 8 we obtain: h ( θ | X n ) ≤ 1 2 X i,j  log  2 π e Z θ ˜ Φ u ( θ ; τ 0 rel ) V a r ( ˆ θ i,j | θ ) dθ  (63) ≤ 1 2 X i,j  log  2 π e Z θ ˜ Φ u ( θ ; τ 0 rel ) 8 θ i,j τ 0 rel n − 1 dθ  (64) = k ( k − 1) 4 log 16 π eτ 0 rel n − 1 + 1 2 X i,j  log Z θ ˜ Φ u ( θ ; τ 0 rel ) θ i,j dθ  (65) ≤ k ( k − 1) 4 log 16 π eτ 0 rel n − 1 (66) + k ( k − 1) 4   log X i,j 2 k ( k − 1) Z θ ˜ Φ u ( θ ; τ 0 rel ) θ i,j dθ   (67) = k ( k − 1) 4 log 16 π eτ 0 rel n − 1 + k ( k − 1) 4 log 2 k ( k − 1) (68) Equation ( 68 ) is true b ecause P i,j θ i,j = 1. This pro v es the upp er b ound on th e term h ( θ | X n ). 13 4.2 Pro of of Theorem 1 Using lemma 2 for the prior ˜ Φ u ( θ ; τ 0 rel ), and lemma 6 , 9 , we ha v e R n ( k , τ rel ) ≥ 1 n I ( θ ; X n ) (69) = 1 n [ h ( θ ) − h ( θ | X n )] (70) ≥ k ( k − 1) 4 n log 2( n − 1) k ( k − 1) + k ( k − 1) 4 n log e 16 π τ 0 rel − log k n (71) for k ≥ k c , wher e τ 0 rel = 1 + 2+ c √ k . This completes the pro of of the lo w er b ound. 5 Theorem 2 Pro of F or an y s equence x n o v er the alph ab et X = [ k ], let N ( a ) , N ( a, b ) b e defin ed as: N ( a ) = n − 1 X i =1 1 [ x i = a ] (72) N ( a, b ) = n − 1 X i =1 1 [( x i , x i +1 ) = ( a, b )] (73) Before we prov e the theorem, we consider some simp le lemmas. Lemma 10. Ther e exists a pr efix c o de [ CT12 ] (Se ction 5.1) on non-ne gative inte gers N ∪ { 0 } = { 0 , 1 , 2 , . . . } , such that every inte g er m has a c o dewor d of length l m ≤ 2 log 2 ( m + 1) + 1 . Pr o of. Let q b e such that: 2 q ≤ ( m + 1) < 2 q +1 . Thus, ( m + 1) can b e w r itten as: ( m + 1) = 2 q + r (74) where, 0 ≤ r < 2 q . Let U q = 000 . . . 001 b e a unary co d e with q zeros, and B r b e the binary represent ation of r us in g q bits. Then the f ollo win g co de C m is pr efix-free: C m = U q 1 B r (75) Th us, th e length of the co de l ( C m ): l ( C m ) = 2 q + 1 (76) ≤ 2 log 2 ( m + 1) + 1 (77) This completes the p r o of. Lemma 11. We c an stor e the p ar ameters N ( a ) , N ( a, b ) , ∀ a, b ∈ [ k ] for a se qu enc e x n using L par am numb er of bits, which is upp er b ounde d as: L par am ( x n ) ≤ 2 k 2 log 2  n k 2 + 1  + k 2 (78) 14 Pr o of. Note that, we only need to store N ( a, b ) , ∀ a, b ∈ [ k ] as the parameters N ( a ) can b e deriv ed. Using the pr efix co ding fr om lemma 10 for the parameters N ( a, b ): L par am ( x n ) ≤ X a,b [2 log 2 ( N ( a, b ) + 1) + 1] (79) = 2 k 2 X a,b 1 k 2 log 2 ( N ( a, b ) + 1) + k 2 (80) ≤ 2 k 2 log 2   1 k 2 X a,b ( N ( a, b ) + 1)   + k 2 (81) = 2 k 2 log 2  n k 2 + 1  + k 2 (82) Equation ( 81 ) is tru e due to conca vity of the log fun ction and the Jensen’s inequalit y . Lemma 12. We c an use arithmetic c o ding [ WN C87 ] to enc o de a se quenc e x n using L seq ( x n ) bits, which is b ound e d as: L seq ( x n ) ≤ log 2 k + ( n − 1) H 1 ( x n ) + 3 (83) wher e H 1 ( x n ) i s the 1 st or der empiric al entr opy of se que nc e x n : H 1 ( x n ) = k X a =1 k X b =1 N ( a, b ) n − 1 log 2 N ( a ) N ( a, b ) (84) Pr o of. W e firs t enco de the sequence x 1 using fixed ⌈ log 2 k ⌉ b its. Next, w e en co de th e remaining ( n − 1) symbols using arithmetic co d ing [ WST95 ] (section IV) with the fi rst order m o del d istribution q ( b | a ) = N ( a, b ) / N ( a ). Using theorem 1 of [ WST95 ], the co d elength of x n is: L seq ( x n ) ≤ ⌈ log 2 k ⌉ + n − 1 X i =1 log 2 1 q ( x i +1 | x i ) + 2 ! (85) = log 2 k + 1 + k X a =1 k X b =1 N ( a, b ) log 2 N ( a ) N ( a, b ) + 2 (86) = log 2 k + ( n − 1) H 1 ( x n ) + 3 (87) This completes the p r o of. Let x n b e a giv en sequence o v er the alph ab et X = [ k ]. C onsider the follo wing compressor: 1. S tore all the parameters N ( a, b ) , ∀ a, b ∈ [ k ] using th e unive rsal pr efi x-free co de in lemma 10 . 2. Use the parameters N ( a, b ) to compress x n using fi rst-order Mark o v arithmetic co din g as in lemma 12 . Then, the co delength ˆ L ( x n ) is b ounded as: ˆ L ( x n ) = L par am ( x n ) + L seq ( x n ) (88) ≤ h 2 k 2 log 2  n k 2 + 1  + k 2 i + [log 2 k + ( n − 1) H 1 ( x n ) + 3] (89) 15 W e now tak e a lo ok at r edundancy R n ( k ): R n ( k ) = inf L sup θ ∈M 2 ( k ) r n ( L, θ ) (90) ≤ sup θ ∈M 2 ( k ) 1 n  E θ [ ˆ L ( X n )] − H θ ( X n ))  (91) = sup θ ∈M 2 ( k ) 1 n  E θ [ ˆ L ( X n )] − H θ ( X 1 ) − ( n − 1) H θ ( X 2 | X 1 )  (92) ≤ sup θ ∈M 2 ( k ) 1 n  E θ [ ˆ L ( X n )] − ( n − 1) H θ ( X 2 | X 1 )  (93) ≤ 2 k 2 n log 2  n k 2 + 1  + k 2 n + sup θ ∈M 2 ( k ) n − 1 n ( E θ [ H 1 ( X n )] − H θ ( X 2 | X 1 )) + log 2 k + 3 n (94) ≤ 2 k 2 n log 2  n k 2 + 1  + k 2 n + log 2 k + 3 n (95) Where equation ( 95 ) is tru e b ecause of the conca vit y of en trop y . This completes the p ro of of the upp er b ound . References [A tt9 9] Kevin Atte son. The asymptotic redund ancy of bay es rules for marko v chains. IEEE T r ansactions on Information The ory , 45(6):2104– 2109, 1999. [BCC10] Charles Borden av e, Pietro Caputo, and Djalil Ch afai. Sp ectrum of large rand om re- v ersible Marko v c hains: t w o examples. ALEA: L atin Americ an J ournal of Pr ob ability and Mathematic al Statistics , 7:41–6 4, 2010. [CMS + 13] Ciprian Chelba, T omas Mik olo v, Mike S c h uster, Q i Ge, Thorsten Bran ts, Phillipp Ko ehn, and T ony Robinson. One billion word b enchmark for measuring pr ogress in statistica l language mo deling. arXiv pr eprint arXiv:1312.3005 , 2013. [CT12] Thomas M Cov er and Jo y A Thomas. Elements of information the ory . John Wiley & Sons, 2012. [Da v83] L Da visson. Minimax noiseless universal co ding for marko v sour ces. IEEE T r ansac- tions on Information The ory , 29(2):211– 215, 1983. [DS04] Mic h ael Drm ota and W o jciec h Szpank o wsk i. Pr ecise minimax redund ancy and r egret. IEEE T r ansactions on Information The ory , 50(11):268 6–270 7, 2004. [FK G10 ] Bel a A F rigyik, Amol Kapila, and May a R Gupta. In tro duction to the d iric hlet distribution and related pro cesses. D ep artment of Ele ctric al Engine ering, University of W ashignton, UWEE TR-2010-000 6 , 2010. [FM96] Meir F eder and Neri Merha v. Hierarc hical un iv ersal co ding. Information The ory , IEEE T r ansactions on , 42(5):135 4–136 4, 1996. [HJL + 18] Y anjun Han, Jian tao J iao, Chuan-Zheng Lee, T sac h y W eissman, Yihong W u, and Tianc heng Y u. Ent rop y rate estimation for mark o v chains with large state space. arXiv pr eprint arXiv:1802.07 889 , 2018. 16 [JHFHW17] Jian tao Jiao, Y anjun Han, Ir en a Fisc her-Hw ang, and Tsach y W eissman. Estimating the fund amen tal limits is easier than ac hieving the fundamental limits. arXiv pr eprint arXiv:1707.01 203 , 2017. [JS04] Philipp e Jacquet and W o j ciec h Szpanko wski. Marko v t yp es and minimax redund ancy for Mark o v sources. IEEE T r ansactions on Informat ion The ory , 50 (7):139 3–14 02, 2004. [MF95] Neri Merha v and Meir F ed er. A s trong v ersion of the redun dancy-capacit y th eorem of unive rsal co ding. IEEE T r ans actions on Information The ory , 41(3):71 4–722 , 1995. [MT06] Ra vi Montenegro and Prasad T etali. Mathematical asp ects of mixing times in mark o v c hains. F oundations and T r ends R  i n The or etic al Computer Scienc e , 1(3):237– 354, 2006. [OS04] Alo n Orlitsky and Nara y ana P San thanam. Sp eaking of infinity . Information The or y, IEEE T r ansactions on , 50(10):22 15–22 30, 2004. [P + 15] Daniel P aulin et al. Concentratio n inequalities for Marko v chains by Marton couplings and sp ectral metho d s. Ele ctr onic Journal of Pr ob ability , 20, 2015. [Ris84] Jorma Rissanen. Universal cod ing, information, p r ediction, and estimation. IEEE T r ansactions on Information the ory , 30(4):629– 636, 1984. [SW12] W o jciec h Szpank o ws k i and Marcelo J W ein b erger. Minimax p oint wise redu ndancy for memoryless mo dels o v er large alphab ets. Information The or y, IEE E T r ansactions on , 58(7): 4094– 4104, 2012. [T ao] T erence T ao. T opics in r andom matrix the ory , v olume 132. [WNC87] Ian H Witten, Radford M Neal, and John G Cleary . Arithmetic co d ing for data compression. Communic ations of the ACM , 30(6):520– 540, 1987. [WST95] F rans MJ Willems, Y ur i M Shtark o v, and Tjalling J Tjalk ens. Th e con text- tree we igh t- ing metho d: basic pr op erties. IEEE T r ansactions on Information The ory , 41(3):65 3– 664, 1995. [XB97] Q un Xie and Andrew R Barr on . Minim ax r edundan cy f or the class of memoryless sources. IEEE T r ansactions on Information The ory , 43(2):64 6–657 , 1997. [XB00] Q un Xie and An drew R Barron. Asymptotic minimax r egret for data compression, gam bling, and pr ediction. IEEE T r ansactions on Information The ory , 46(2):431– 445, 2000. [ZL77] Jacob Z iv and Abr ah am Lemp el. A u niv ersal algorithm for sequential data compres- sion. IEEE T r ansactions on information the ory , 23(3):33 7–343 , 1977. [ZL78] Jacob Ziv and Abraham Lemp el. Compression of ind ivid ual sequences via v ariable- rate co d ing. IEEE tr ansactions on Information The ory , 24(5):530– 536, 1978. 17 A Existing Minimax Redundancy Lo w er Bounds W e analyze th e existing low er b oun d by [ Da v83 ]. R n ( k ) ≥ g ( k , n ) = k ( k − 1) 2 n log n + k ( k − 1) n log 1 k 4 − k ( k − 1) 2 n log   C 1 −  1 − 1 4 k 4  1 2   (96) W e can s im p lify g ( k , n ), to get the lo w er b ound: g ( k , n ) = k ( k − 1) 2 n log n + k ( k − 1) n log 1 k 4 − k ( k − 1) 2 n log   C 1 −  1 − 1 4 k 4  1 2   (97) = k ( k − 1) 2 n log n k 2 − 5 k ( k − 1) 2 n log k 2 − k ( k − 1) 2 n log C + o  k ( k − 1) n  (98) Th us, th e effectiv e lo w er b ound on R n ( k ) is: R n ( k ) ≥ k ( k − 1) 2 n log n k 2 − 5 k ( k − 1) 2 n log k 2 − k ( k − 1) 2 n log C +  k ( k − 1) n  (99) W e observ e that the lo wer b ou n d on redu ndancy R n ( k ) is non-zero only w hen n ≫ k 2 log k . W e aim to imp r o v e the low er b ound when n ≍ k 2 . B Pro of of Lemma 5 T o analyze the sp ectrum of the transition matrix K , w e construct a symmetric matrix S , whic h has the same sp ectrum as K almost su rely . Lemma 13. (Sp e ctr al Equ ivalenc e) Almo st sur ely, for a lar ge k , the sp e ctrum of the tr ansition matrix K c oincides with the sp e ctrum of the symmetric matrix S define d as: S ij = r ρ i ρ j K ij = w ij √ ρ i ρ j (100) The lemma is prov ed in lemma 2.1 [ BCC10 ]. W e no w use the lemma 4 , which allo w s us to estimate ρ i = k (1 + o (1)), to compare the sp ectrum of the matrix √ k K with the matrix W √ k . Lemma 14. (Bulk b ehavior) The ESD (empiric al sp e ctr al density) of √ k K we akly c onver ges to the Wigner’s semi- cir cle law W 2 . µ √ k K w − − − → k →∞ W 2 (101) wher e the Wigner’s se mi- cir cle law W 2 is given by: x 7→ 1 2 π p 4 − x 2 1 [ − 2 , 2] ( x ) 18 Pr o of. First of all, from the lemma 13 , the sp ectrum of S is equiv alen t to that of K a.s. (for large k ). Thus, it is sufficient to an alyze the sp ectrum of √ k S . T o sho w the w eak con v ergence, we b ound the Levy distance b etw een the cum ulativ e distribu tions corresp onding to th e ES D of matrices √ k S and W / √ k . Let F √ k S and F W / √ k b e the cumulativ e distributions, then: L 3 ( F √ kS , F W / √ k ) ≤ 1 k T r (( √ k S − W / √ k ) 2 ) (102) = 1 k k X i,j w 2 ij k  k √ ρ i ρ j − 1  2 (103) ≤ O ( δ 2 )   1 k 2 k X i,j w 2 ij   (104) → 2 O ( δ 2 ) a.s, as k → ∞ (105) This pro v es the w eak conv ergence of the µ √ kK to the w igner semi-circle la w W 2 . Note that, ev en though λ 1 ( √ k K ) = √ k → ∞ as k → ∞ , the w eak limit of of µ √ kK is not affected since λ 1 ( √ k K ) h as weigh t 1 /k . The theorem thus implies that the bu lk of th e sp ectrum σ ( K ) collapses as k − 1 / 2 , but do es n ot giv e a characte rization for λ 2 ( √ k K ), w hic h is what is required . T o pro v e lemma 5 , we r ep resen t the symmetric m atrix S as a com bination of a rank one matrix P corresp onding to the stationary distrib ution, and a ”noise” matrix S − P with nearly i.i.d en tries. Bounding th e sp ectral norm of the ”noise” matrix S − P gives us the result. Pr o of. (lemma 5 ) Since K is almost s u rely irreducible, for large enough k , the eigenspace of S of eigen v alue 1 is a.s. of size 1. and is the span of the v ector [ √ ρ 1 , √ ρ 2 , . . . , √ ρ k ]. Consider the symmetric matrix P : P ij = √ ρ i ρ j ρ (106) By r emo ving P from S , we are essentially remo ving the largest eigen v alue 1, without touc hing the other eigen v alues. T h us, the sp ectrum of the m atrix S − P is give n by: { λ 2 ( S ) , λ 3 ( S ) , . . . , λ k ( S ) } ∪ { 0 } (107) T o find √ k λ 2 ( S ), w e no w b ound the sp ectral norm of matrix A = √ k ( S − P ). Lemma 2.4 of [ BCC10 ] along with lemma 4 giv es u s the result: max 1 ≤ i ≤ k √ k | λ i ( S − P ) | ≤ 2 + o (1) a.s. (108) Equation ( 106 ) and lemma 13 together imply that: max 2 ≤ i ≤ k √ k | λ i ( S ) | ≤ 2 + o (1) a.s. (109) max 2 ≤ i ≤ k √ k | λ i ( K ) | ≤ 2 + o (1) a.s. (110) max 2 ≤ i ≤ k | λ i ( K ) | ≤ 2 + c √ k a.s. (111) for some constant c ≥ 0. This completes the p r o of. 19 C Pro of of Lemma 8 Consider the Mark o v c hain ov er the tuples ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) with transition matrix ˜ K . W e fi rst analyze some prop erties of the transition matrix ˜ K . Lemma 15. L et X 1 → X 2 . . . → X n b e a r eversible Markov chain with tr ansition matrix K , then the tr ansition matrix ˜ K for the chain: ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) and its matrix r eversibilization ˜ K ∗ ar e: ˜ K (( a, b ) , ( c, d )) = 1 [ b = c ] K ( c, d ) ˜ K ∗ (( a, b ) , ( c, d )) = 1 [ a = d ] K ( d, c ) Pr o of. The transition matrix for the chain ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) is giv en b y: ˜ K (( a, b ) , ( c, d )) = P ( X 2 = c, X 3 = d | X 1 = a, X 2 = b ) (112) = P ( X 2 = c | X 1 = a, X 2 = b ) P ( X 3 = d | X 1 = a, X 2 = b, X 2 = c ) (113) = 1 [ b = c ] P ( X 3 = d | X 2 = c ) (114) = 1 [ b = c ] K ( c, d ) (115) where, equation( 114 ) holds b ecause of the Mark o vity condition. As the Marko v chain ( X 1 , X 2 ) → ( X 2 , X 3 ) . . . → ( X n − 1 , X n ) is in general non-r eversible, th e m ultiplicativ e rev ersibilization of ˜ K is: ˜ K ∗ (( c, d ) , ( a, b )) = ˜ K (( a, b ) , ( c, d )) π ( a, b ) π ( c, d ) (116) = 1 [ b = c ] K ( c, d ) π ( a ) K ( a, b ) π ( c ) K ( c, d ) (117) = 1 [ b = c ] π ( b ) K ( b, a ) π ( c ) (118) = 1 [ b = c ] π ( b ) K ( b, a ) π ( b ) (119) ˜ K ∗ (( c, d ) , ( a, b )) = 1 [ b = c ] K ( b, a ) (120) ˜ K ∗ (( a, b ) , ( c, d )) = 1 [ a = d ] K ( d, c ) (121) This pro v es the lemma. Lemma 16. L et T b e the k × k matrix c or r esp onding to the tr ansformation: T M (( a, b ) , ( c, d )) = M (( b, a ) , ( c, d )) (122) Then, ˜ K and ˜ K ∗ have the pr op erty: (( ˜ K ∗ ) r ˜ K r ) = ( T ˜ K r ) 2 (123) Pr o of. The matrix T also has the prop erties: M (( a, b ) , ( c, d )) T = M (( a, b ) , ( d, c )) (1 24) T 2 = I (125) 20 Then w e can show that: T ˜ K (( a, b ) , ( c, d )) T = T 1 [ b = c ] K ( c, d ) T (126) = 1 [ a = c ] K ( c, d ) T (127) = 1 [ a = d ] K ( d, c ) (128) = ˜ K ∗ (( a, b ) , ( c, d )) (129) Using equation ( 129 ) we can show that, for any r : (( ˜ K ∗ ) r ˜ K r ) = ( T ˜ K T ) r ˜ K r (130) = T ˜ K r T ˜ K r (131) = ( T ˜ K r ) 2 (132) This completes the p r o of. Lemma 17. Matric es ˜ K 2 and T ˜ K 2 have the f orm: ˜ K 2 (( a, b ) , ( c, d )) = K ( b, c ) K ( c, d ) (133) T ˜ K 2 (( a, b ) , ( c, d )) = K ( a, c ) K ( c, d ) (134) Pr o of. Using lemma 15 , w e obtain: ˜ K 2 (( a, b ) , ( c, d )) = X e,f ˜ K (( a, b ) , ( e, f )) ˜ K (( e, f ) , ( c, d )) (135) = X e,f 1 [ b = e ] K ( e, f ) 1 [ f = c ] K ( c, d ) (136) = K ( b, c ) K ( c, d ) (1 37) This pro v es the equation ( 133 ). W e now use the d efinition of T to ob tain: T ˜ K 2 (( a, b ) , ( c, d )) = ˜ K 2 (( b, c ) , ( c, d )) (138) = K ( a, c ) K ( c, d ) (13 9) This pro v es the lemma. Lemma 18. Matric es K and T ˜ K 2 have identic al non-zer o eigenvalues. Pr o of. Let V b e an eigen v ecto r of the matrix T ˜ K 2 with a non-zero eigenv alue η . This imp lies that: η V (( a, b ) , 1) = X c,d T ˜ K 2 (( a, b ) , ( c, d )) V (( a, b ) , 1) (140) = X c,d K ( a, c ) K ( c, d ) V (( a, b ) , 1) (141) = X c K ( a, c ) V (( a, b ) , 1) X d K ( c, d ) (14 2) = X c K ( a, c ) V (( a, b ) , 1) (143) 21 Th us, th is sho ws that for any b ∈ [ k ], the vec tor V (( ., b ) , 1) is an eigen v ect or of matrix K w ith eigen v alue η . Con v ersely , let v = [ v 1 , v 2 , . . . , v k ] T b e an eigen v ector of the matrix K with non-zero eigen v alue η . Then, th e ve ctor V (( a, b ) , 1) = v a is an eigen v ector of the matrix T ˜ K 2 . Th us, together th is implies that the non-zero eigen v alues of matrices K and T ˜ K 2 are id en tical. W e now come to the pro of of lemma 8 . Pr o of. (lemma 8 ) Using lemma 18 and lemma 16 we get: λ 2 (( T ˜ K 2 ) 2 ) = max( λ 2 ( K ) 2 , λ k ( K ) 2 ) γ (( ˜ K ∗ ) 2 ˜ K 2 ) = 1 − max( λ 2 ( K ) 2 , λ k ( K ) 2 ) ≥ 1 − max( | λ 2 ( K ) | , | λ k ( K ) | ) = γ ∗ ( K ) No w usin g the defin ition of the pseudo-sp ectral gap γ ps ( ˜ K ), we obtain: γ ps ( ˜ K ) ≥ γ (( ˜ K ∗ ) 2 ˜ K 2 ) 2 ≥ γ ∗ ( K ) 2 This pro v es the lemma. 22

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment