Empirical Lossless Compression Bound of a Data Sequence

Empirical Lossless Compression Bound of a Data S equence Author: Lei M Li 1,2, ∗ Aﬃliations: 1 Academ y o f Mathematics and Systems Science, Chinese Academ y of Sciences, Beijing 100190, China. 2 Univ ersit y of Chinese Academ y of Sciences, Beijing 1 0 0049, China ∗ Corresp ondence should b e addressed to Lei M Li (lilei@amss.ac.cn) T elephone: +8610–82 5 41585 F ax: +8610–626 5 –8364 Key words: lossle ss compression, en tropy , Kolmogorov complexit y , normalized maximum lik eliho o d, lo cal asymptotic normality Abstract W e consider the lo ssless compression b oun d of an y individual data sequence. Con- ceptually , its Kolmogoro v complexit y is suc h a b ound y et u n computable. The Shann on source cod ing theorem states that the a ve r age compression b ound is nH , where n is the n umb er of w ords and H is the en trop y of an oracle probabilit y distribution c harac- terizing the data source. The qu an tit y nH ( ˆ θ n ) obtained b y plu gging in the maxim um lik eliho o d est imate is an un derestimate of the b ound. Shtark o v sho w ed that the normal- ized m axim um like liho o d (NML) distribu tion or co de length is optimal in a minimax sense for any parametric family . In this article, we consider the exp onentia l families, the only mo dels that admit suﬃcien t statistics wh ose d imensions remain b ounded as the sample size gro ws. W e sh o w b y the local asymptotic normalit y that the NML co de length is nH ( ˆ θ n ) + d 2 log n 2 π + log R Θ | I ( θ ) | 1 / 2 dθ + o (1), where d is the mo del dimen- sion or d ictionary size, and | I ( θ ) | is the determinant of the Fisher inform ation matrix. W e also demonstrate that s equ en tially predicting the optimal co de length for the next w ord via a Ba yesia n mec han ism leads to the mixture co de, whose pathwise length is giv en by n H ( ˆ θ n ) + d 2 log n 2 π + log | I ( ˆ θ n ) | 1 / 2 w ( ˆ θ n ) + o (1), where w ( θ ) is a p rior. If we tak e the Jeﬀreys prior when it is p rop er, the expr ession agrees with the NML co de length. The asymptotics apply to not only discrete sym b ols bu t also contin uous data if the co de length for the form er is replaced by the description length for the la tter. The analytical resu lt is exempliﬁed b y calc u lating compression b ounds of protein-encod ing DNA sequences u nder diﬀeren t parsing m o dels. T ypically , th e highest compression is ac hiev ed w hen th e parsing is in the ph ase of the amino acid co d ons. On the other hand, the compression r ates of pseudo-random sequences are larger than 1 regardless of parsing mo dels. These mo del-based results are consisten t with that random sequences are incompressible as asserted b y th e Kolmogoro v complexit y theory . The empirical lossless compression b ound is particularly m ore accurate when the d ictionary s ize is relativ ely large. 1 In tro duction The computation of the compre ssion b ound of an y individual sequence is bot h a philosophical and a practical problem. It t o uc hes on the fundamen tals of h uman being’s in telligence. After sev eral decades of eﬀorts, many insigh ts ha v e been gained b y experts from diﬀeren t disciplines. In essence, the b ound is the shortest program that prin ts the sequ ence on a T uring mac hine, referred to as the Solomonoﬀ-Kolmogo r ov-Chaitin algorithmic complexit y . Under this setting, if a sequence cannot b e compressed b y any compute r program, it is random. On the other hand, if w e can compress the sequence b y a certain program or co ding sch eme, it is then not rando m, and we learn some pattern or knowle dge in the sequence. Nev ertheless, this K o lmogorov complexit y is not computable. Along another line, the source co ding t heorem propo sed by Shanno n [30] claimed that the optimal co ding, or the av erage shortest co de length, is no less than nH , where n is the 1 n um b er of words and H is the the entrop y of the source if its distribution can b e sp eciﬁed. Although Shannon’s probabilit y framew ork has inspired the inv en tions o f some ingenious compression methods, nH is an ora cle b o und. Some further questions need to b e a ddressed. First, where do es t he proba bilit y distribution come from? A straightforw ard solution is one inferred from the da ta themselv es. Ho w ev er, in the case of discrete sym bo ls, plugg ing in the w ord f r equencies ˆ θ n observ ed in the sequenc e results in nH ( ˆ θ n ), whic h is, as can b e sho wn, an undere stimate of the b ound. Second, the w ord frequenc ies are coun ted a ccording to a dictionary . Diﬀeren t dictionaries would lead to diﬀerent distributions or co des. What is the criterion for selecting a go o d dictionary? Third, the pra ctice of some compression algorithms suc h as the Lemp el-Ziv co ding shows as the length of a sequence gets lo ng er, the size of the dictionary gets larger. What is the exact impact o f the dictionary size on the compression? F ourth, can we ac hiev e the compression limit by a predictiv e co de tha t go es through data in only one round? Fifth, ho w is the b ound deriv ed from the probability fra mew ork, if p ossibly , connected to the conclusions dra wn from the algo rithmic complexity ? In this a rticle, we review the k ey ideas of lossless compression and presen t some new mathematical results relev an t to t he abov e problems. Besides the algor ithm complexit y and the Shannon source co ding theorem, the tec hnical to o ls cen ter around the normalized maxim um lik eliho o d (NML) co ding [31, 26 ] and predictiv e co ding [7, 16]. The expansions of these co de lengths lead to an empirical compress ion b o und, whic h is indeed sequenc e-sp eciﬁc and th us has a natural link to algorithmic complexit y . Although the primary t heme is the path wise asymptotics, their related a ve rag e results w ere discussed as w ell for the sak e of comparison. The analytical results apply to not only discrete symbols but also con tinuous data if the co delength for the former is replaced b y the description length for the latter [3]. Other than theoretical justiﬁcation, the empirical b ound is exempliﬁed b y protein-co ding DNA sequences a nd pseudo-random sequence s. 2 A br ief review of the k ey co ncepts Data compression The basic conce pts of lossless co ding can b e found in the textb o ok [9]. Before w e pro ceed, it is helpful to clarify some j a rgon used in this pap er: strings, sym b o ls, and w ords. W e illustrate them b y an example. The following “studydnasequencefrom thedat- acompressionpo in tofviewforexampleab cdefghijklmnop qrstuvwxyz”, is a string. The 26 small case distinct English letters app earing in the string are called sym b ols, and they form an alphab et. If we parse the string in to “study”, “dnasequence”, “from thedata” , “compression- p oin tof view”, “ forexample”, “a b cdefg”, “hijklmnop q”, “rstuvwxyz”, and these substrings are called w ords. The implemen tation of data compression includes an encoder and a decoder. The enc o der parses the string to b e compressed into words and replace s eac h w ord by its co dew ord. Consequen tly , this pro duces a new string, whic h is hop efully shorter than the original one in terms of it s. The decoder, conv ersely , parses t he new string in to co dewords, and in terpret eac h co dew ord bac k to a word of the original sym b ols. The collection of all distinct w ords 2 in a par sing is called a dictionary . In the notio n of data compression, tw o issues arise natur a lly . First, is there a low er b ound? Second, ho w do w e compute this b ound, o r is it computable a t all? Preﬁx co de A basic idea in lossless compression is the preﬁx co de or instan taneous co de. A co de is called a preﬁx o ne if no co dew ord is a preﬁx of any other co dew ord. The preﬁx constrain t has a ve ry close relationship to the metaphor of the T uring mac hine, b y whic h the algo rithmic complexit y is deﬁned. Given a preﬁx co de ov er an alphab et of α sym b ols, the co deword lengths l 1 , l 2 , · · · , l m , where m is the dictionary size, m ust satisfy the Kraft inequalit y: P m i =1 α − l i ≤ 1. Con v ersely , giv en a set of co de lengths that satisfy this inequalit y , there exists a preﬁx co de with these co de lengths. Please notice that the dictiona r y size in a preﬁx co de could b e either ﬁnite or coun tably inﬁnite. The class of preﬁx codes is smaller than the more general clas s of uniquely decodable co des, and one ma y exp ect that some uniquely deco dable co des could b e a dv an tageous o ve r preﬁx co des in terms of data compression. Ho w ev er, this is not exactly true, for it can b e sho wn that the co dew ord lengths of a n y uniquely deco dable co de must satisfy the K raft inequalit y . Th us we can construct a preﬁx co de to match the co dew ord lengths of an y give n uniquely deco dable co de. A preﬁx code has an attractiv e self-punctuating feature: it can be deco ded without reference to the future co dew ords, since the end of a co dew ord is immediately recognizable. F or these reasons, p eople stick to preﬁx co ding in practice. A conceptual y et conv enien t generalization of the Kr a ft inequalit y is to drop the integer requiremen t fo r co de lengths and ignore the eﬀect of rounding up. A general set of co de lengths can b e implemen ted by the arithmetic co ding [21, 2 7 ]. This generalization leads to a correspondence b et w een probabilit y distributions and preﬁx co de lengths: to ev ery distribution P on the dictionar y , there exists a preﬁx co de C whose length L C ( x ) is equal to − log P ( x ) for all w ords x . Conv erse ly , for eve ry preﬁx co de C o n the dictionary , there exists a probabilit y measure P suc h t ha t − log P ( x ) is equal to the co de length L C ( x ) for all w ords x . Shannon’s probabilit y-based co ding In his seminal w ork [30], Shannon prop osed the source co ding theorem based on a probabilit y framew ork. If we assume a ﬁnite n umber of words A 1 , A 2 , · · · , A m are generated from a probabilistic source denoted b y a random v ariable X with frequencies p i , i = 1 , · · · , m , then the exp ected length of a n y preﬁx co de is no shorter tha n the en tro p y of this source deﬁned as: H ( X ) = − P m i =1 p i log p i . Throughout this pap er, we take 2 as the base of the logarithm op eration, and thereb y bit is the unit of co de lengths. This r esult o ﬀers a low er b ound of data compression if a probabilistic mo del can b e assumed. Huﬀman code is such a n optimal co de that reaches the exp ected co de length. The co dew ords ar e deﬁned b y a binary tree built from w ord frequencies. Shannon-F ano-Elias co de is another one that uses at most t w o bits more than the low er b ound. The co de length of A i in Shannon- F ano-Elias co de is approx imately equal to − log p i . 3 Kolmogoro v complexit y and algorithm-based co ding Kolmog o ro v, who laid o ut the foundation of the pro babilit y theory , in terestingly put a wa y probabilistic mo dels, and along with ot her researc hers including Solomonoﬀ and Chaitin, pursued another path to under- stand the information structure of data based on the notion of the univ ersal T uring mac hine. Kolmogorov [8] expressed the following: “infor ma t ion theory m ust precede probability the- ory , and not b e based on it.” W e g iv e a brief accoun t of some facts abo ut Kolmogoro v complexit y relev an t to our study , and refer readers to Li and Vit´ an yi [17], Vit´ anyi and Li [34] for detail. A T uring mac hine is a computer with a ﬁnite state op erating on a ﬁnite sym b o l se t, and is essen tially the abstraction of any concrete computer that has CPUs, memory , and input and output devices. At eac h unit of time, the machine reads in one op eration command from the program tape, write some sy mbols on a w ork tap e, and c ha nge its state according to a transition table. Tw o imp ortant features need more explanation. First, the program is linear, namely , the mac hine reads the tap e from left to right, nev er go es back. Second, the program is preﬁx-free, namely , no program leading to a halt ing computation can b e the preﬁx of another suc h program. T his feature is an analog to the preﬁx - co ding idea. A univ ersal T uring machine can reproduce the results of other mach ines. Kolmogorov complexit y of a word x with resp ect to a unive rsal computer U , denoted b y K U ( x ), is deﬁned as the minim um length ov erall programs that prin t x and halt. The Kolmog oro v complexities of all w ords satisfy the Kraft inequalit y due to its natural connection to preﬁx coding. In fact, for a ﬁxed mac hine U , w e can enco de x b y the minim um length prog ram that prints x a nd halt. Giv en a long string, if we deﬁne a wa y to parse it in to w or ds, then we enco de eac h word by the ab ov e prog ram. Consequen tly , we enco de the string b y concatenating the programs one aft er another. The deco ding can easily b e carr ied out by inputting the concatenated program into U . One ob vious w a y of parsing is to take the string itself as the only w ord. Thus ho w muc h w e can compress the string dep ends o n the complexit y o f this string. A t this p o int, w e see the connection b et we en data compression and the Kolmogo ro v complexit y , whic h is deﬁned f or each string on an implemen table type of computatio nal mac hine — the T uring machine . Next, we highligh t some theoretical results ab out Kolmogorov complexit y . First, it is not mac hine sp eciﬁc except for a mac hine-sp eciﬁc constan t. Second, the Kolmogoro v complexit y is unfort una t ely not computable. T hird, there exists a unive rsal probabilit y P U ( x ) with resp ect to a univ ersal mac hine, suc h that 2 − K ( x ) ≤ P U ( x ) ≤ c 2 − K ( x ) for all strings, where c is a constant indep enden t of x . This means that K ( x ) is equiv alent to − log P U ( x ) except for a constan t, whic h can b e view ed as the co de lengths of a preﬁx co de in lig h t of the Shannon-F ano- Elias co de. Because of the non- computabilit y o f Kolmogorov complexit y , the univ ersal probability is not computable either. The study of the Kolmogorov complexit y t ells us that the assessmen t of exact compres sion b ounds of strings is bey ond the abilit y o f an y sp eciﬁc T uring machine. Ho wev er, any program on a T ur ing Mac hine oﬀers, except for a constan t, a n upp er b ound. 4 Corresp ondence b etw een probability mo dels and string parsing A critical que stion remained t o b e answ ered in the Shannon source co ding theorem is: where do es the mo del that deﬁnes probabilities come f rom? According to t he theorem, the optimal co de lengths are prop ort io nal to t he negativ e logarithm o f t he w ord frequencies. Once the dictiona ry is deﬁned, the w ord frequencies can b e coun ted for an y individual string to b e compressed. Equiv alen tly , a dictionary can b e induced by t he wa y we parse a string. It is noted that the term “ letter” instead of “w ord” w a s used in Shannon’s original paper [30], whic h did not discuss ho w to parse strings into w ords at a ll. Fixed-length and v ariable-length parsing The w or ds generated from the parsing pro- cess could b e either of the same length or of v ariable lengths. F or example, w e can enco de Shak esp eare’s work letter by letter, or enco de them b y natural w ords of diﬀerent lengths. A c hoice made at this p o in t leads to tw o quite diﬀeren t co ding sc hemes. If w e decomp ose a string in to words of the same n um b er of sym b ols, this is a ﬁxed-length parsing. The t w o extra bits for eac h w ord is a big deal if the num b er o f sy mbols in eac h word is small. As the w o r d length gets longer and longer, the t w o extra bits are relativ ely negligible for eac h blo c k. An eﬀectiv e alternative to get around the issue of extra bits is the arithmetic co ding that in tegrates the co des of successiv e w ords at the cost o f more computations. V ariable-length parsing decomp oses a string in to w or ds of v a r iable num b er of sym b ols. The p o pula r Lemp el-Ziv co ding is suc h a sc heme. Although the complexit y of a string x is not computable, the complexit y of ’ x 1’, relative to ’ x ’ is small. T o concatenate an ’1’ to the end of ’ x ’, w e can simply use the progra m of printing x fo llo w ed b y prin ting ’1’. A recursiv e implemen ta t ion of this idea leads to the Lemp el-Ziv co ding, whic h concatenates the address of ’ x ’ and t he co de of ’1’. Please notice that as the data length increases, the dictionary size resulted fr o m the parsing sc heme o f the Lemp el-Ziv co ding increases as well if w e do not imp o se an upp er limit. Along the pro cess of encoding, each w ord o ccurs only once because do wn the road either it will not b e a preﬁx of any other w ord or a new w o rd concatenating it with a certain suﬃx sym b ol will b e found. T o a go o d appro ximation, all the w or ds encoun tered up to a p oin t are equally lik ely . If we use the same bits to store the addresses of these words, their co de lengths are equal. Approximately , it ob eys Shannon’s source co ding t heorem to o. P arametric mo dels an d complex ity Hereafter w e use parametric probabilistic mo dels to count preﬁx co de lengths. The sp eciﬁcation o f a parametric mo del includes three as- p ects: a mo del class ; a mo del dimens ion; and parameter v alues. Suppose w e restrict our atten tion to some hy p o t hetical mo del classes. Each of these mo del classes is indexed by a set of parameters, and we call the num b er of parameters in eac h mo del it s dimension. W e also assume the identiﬁabilit y of the para meterization, tha t is , diﬀerent parameter v alues corresp ond to diﬀeren t mo dels. Let us denote one suc h mo del class by a probabilit y measure { P θ : θ ∈ a n op en set Θ ⊂ R d } , and their corresp onding frequency functions by { p ( x ; θ ) } . The mo del class is usually deﬁned b y a pa rsing sc heme. F or example, if we parse a string sym b ol b y sym b ol, then the n um b er of words equals the n um b er of sym b ols appearing in 5 the string. W e denote t he num b er o f sym b ols by α , then d = α − 1. If w e parse the string b y ev ery t w o sym b ols, then the nu mber of words increases to d = α 2 − 1, and so on. F rom the ab ov e review of Kolmogorov complexit y , it is clear that strings themse lve s do not admit proba bility mo dels in the ﬁrst place. Neve rtheless, w e can ﬁt a string b y a parametric mo del. By doing so, w e need to pa y ex tra bits to des crib e the mo del, a s observ ed b y Dr. Rissanen. He termed them as sto chas tic complexit y or parametric complexit y . The total co de lengths b y a mo del include b oth the data description and the parametric complexit y . Tw o references for co de length ev aluation The ev aluation of redundancy of a giv en co de needs a reference. Tw o suc h references are discussed in the literature. In the ﬁrst scenario, w e assume that the words X ( n ) = { X 1 , X 2 · · · , X n } are generated according to P θ 0 as indep enden t and iden tically distributed (iid) random v ariables, whose outcomes are denoted b y { x i } . Then the opt imal co de length is given by L 0 = − P n i =1 log p ( X i ; θ 0 ). As n go es large, its av erage co de length is giv en by E L 0 = nH ( θ ). In general, the co de length corresp onding to any distribution Q ( x ) is give n b y L Q = − P n i =1 log q ( X i ), and its redundancy is R Q = L Q − L 0 . The exp ected redundancy is the Kullback - L eibler div ergence b et w een the t w o distributions: E P θ 0 ( L Q − L 0 ) = E P θ 0 log P θ 0 ( X ( n ) ) Q ( X ( n ) ) = D ( P θ 0 || Q ) ≥ 0 . It can b e shown that minmax and maxmin v alues of a v erage redundancy are equal [14]. inf Q sup θ E P θ log P θ ( X ( n ) ) Q ( X ( n ) ) = sup θ inf Q E P θ log P θ ( X ( n ) ) Q ( X ( n ) ) = I (Θ; X ( n ) ) . Historically a k ey progress on redundancy [15, 23] is that for eac h p ositiv e num b er ǫ and for all θ 0 ∈ Θ except in a set whose volume go es to zero, as n − → ∞ , E P θ 0 ( L Q − L 0 ) ≥ d − ǫ 2 log n. (1) All these results are a b out av erage co de length o ver all p ossible strings. Another reference whic h any co de can b e compared with is obt a ined by replacing θ 0 b y the maximized like liho o d estimate ˆ θ n in L 0 , that is, L ˆ θ n = − P n i =1 log p ( X i ; ˆ θ n ). Please notice that L ˆ θ n do es not satisfy the Kraft inequality . This p ersp ectiv e is a practical one, since in realit y x ( n ) is simply data without an y probability measure . Giv en a parametric mo del class { P θ } , w e ﬁt t he data b y one surrogate mo del that maximizes the likelihoo d. Then w e consider L Q − L ˆ θ n = log p ( x ( n ) ; ˆ θ ( x ( n ) )) q ( x ( n ) ) . 6 Optimalit y of normalized maxi mum lik eliho o d co de length Minimizing the ab o v e quan tit y leads to the normalized maxim um-lik eliho o d (NML) distribution: ˆ p ( x ( n ) ) = p ( x ( n ) ; ˆ θ ( x ( n ) )) P x ( n ) p ( x ( n ) ; ˆ θ ( x ( n ) )) . The NML co de length is th us giv en by L N M L = − log p ( x ( n ) ; ˆ θ ( x ( n ) )) + log X x ( n ) p ( x ( n ) ; ˆ θ ( x ( n ) )) . (2) Sh tark ov [3 1 ] prov ed t he opt ima lity of NML co de by sho wing it solv es min q max x ( n ) log p ( x ( n ) ; ˆ θ ( x ( n ) )) q ( x ( n ) ) , (3) where q ranges ov er the set of virtually all distributions. Later R issanen [26] f urther prov ed that NML co de solve s min q max g E g [log p ( X ( n ) ; ˆ θ ( X ( n ) )) q ( X ( n ) ) ] , where q and g range ov er the set of virtually all distributions. This result states that t he NML co de is still optimal ev en if the data are generated from o ut side the parametric mo del family . Namely , regardless of the source natur e in practice, we can alwa ys ﬁnd t he optimal co de length from a distribution family . 3 Empirical co de leng ths based on e xp on en tial family distribu tions In this section, w e ﬁt the data from a source, either discrete or contin uous, b y an ex p onen tial family due to the following considerations. First, the m ultinomial distribution, whic h is used to enco de discrete sym b ols, is an exp onen tial family . Second, according to the Pitman- Ko opman-D armois theorem, exp o nen tial f a milies are, under certain regularit y conditions, the only mo dels that admit suﬃcien t statistics whose dimensions remain b o unded as the sample size grows . On one hand, this prop erty is most desirable in data compression. On the other hand, the results would b e v a lid in the more g eneral statistical learning ot her than source co ding. Third, as w e will sho w, the ﬁrst term in the co de length expansion is nothing but the empirical en tropy fo r exp onen tial f amilies, whic h is a straightforw a r d ex tension of Shannon’s source co ding theorem. Exp onen tial families Consider a canonical exp onential family of distributions { P θ : θ ∈ Θ } , where the natural parameter space Θ is an op en set of R d . The densit y f unction is giv en b y p ( x ; θ ) = exp { θ T S ( x ) − A ( θ ) } , (4) 7 with respect to some measure µ ( dx ) on the supp ort of data . The transposition o f a mat rix (or v ector) V is represe nted b y V T here and throughout the pap er. S ( · ) is the suﬃcie nt statistic for the parameter θ . W e denote the ﬁrst and the second deriv ative of A ( θ ) resp ectiv ely by ˙ A ( θ ) and ¨ A ( θ ). The en tropy o r diﬀeren tial en tr o p y of P θ is: H ( θ ) = A ( θ ) − θ T ˙ A ( θ ). The follo wing result is an empirical a nd pathw ise v ersion of Shannon’s source co ding t heorem. Theorem 1 (Empirical optimal source co de length) If w e ﬁt an individual data se - quenc e by an exp onen tial family distribution, the NML c o de length is given by L N M L = nH ( ˆ θ n ) + d 2 log n 2 π + log Z Θ | I ( θ ) | 1 / 2 dθ + o (1) , (5) wher e H ( ˆ θ n ) is the entr opy evaluate d at the maximum l i k eliho o d estimate (MLE) ˆ θ n = ˆ θ ( x ( n ) ) , and | I ( θ ) | is the determina n t of the Fisher inform ation I ( θ ) = [ − E ( ∂ 2 log p ( X ; θ ) ∂ θ j ∂ θ k )] j,k =1 , ··· ,d . The inte gr al in the expr ession is ass ume d to b e ﬁnite. The ﬁrst term in (2) is nA ( ˆ θ n ) − [ P n i =1 S ( x i )] T ˆ θ n = nA ( ˆ θ n ) − n ˙ A ( ˆ θ n ) T ˆ θ n = nH ( ˆ θ n ), namely , the en tro py in Shannon’s theorem except that the mo del parameter is replaced by the MLE. The second term has a close relationship to the BIC w ork of Ak aike [2] a nd Sch w art z [29], and the third term is the Fisher information whic h c haracterizes the lo cal prop ert y of a distribution family . Surprisingly and interes tingly , this empirical v ersion of the lossless co d- ing theorem puts together the three pieces of fundamen ta l w orks resp ectiv ely b y Shannon, Ak aik e-Sc hw artz, and Fisher. Next, w e giv e a heuristic pro of of (5) b y the lo cal asymptotic normality (LAN) [6], though a complete pro of can b e found in the App endix. In the deﬁnition of NML co de length (2), the ﬁrst t erm b ecomes empirical en trop y for exp onential families. Namely , L N M L = nH ( ˆ θ n ) + log X x ( n ) p ( x ( n ) ; ˆ θ n ) . (6) The remaining diﬃculty is the computation of the summation. In a general problem of data description length, Rissanen [25] deriv ed an analytical expansion requiring ﬁv e assumptions, whic h w ere hard to verify . Here w e show for sources from exponential fa milies, the ex pansion is v alid as long as the in tegral is ﬁnite. Let U ( θ , r √ n ) b e a cub e o f size r √ n cen tering at θ , where r is a constan t. LAN states that w e can expand probability densit y in eac h neighborho o d U ( θ , r √ n ) as follows. log p ( x ( n ) ; θ + h ) p ( x ( n ) ; θ ) = h T [ n X i =1 S ( x i ) − n ˙ A ( θ )] − 1 2 h T [ nI ( θ ) ] h + o ( h ) , where I ( θ ) = ¨ A ( θ ). Maximizing the lik eliho o d in U ( θ , r √ n ) with resp ect to h leads to max h log p ( x ( n ) ; θ + h ) p ( x ( n ) ; θ ) = 1 2 [ n X i =1 S ( x i ) − n ˙ A ( θ )] T [ nI ( θ ) ] − 1 [ n X i =1 S ( x i ) − n ˙ A ( θ )] + o ( r √ n ) . 8 Consequen tly , if ˆ θ n ( x ( n ) ) falls into the neighborho o d U ( θ , r √ n ), w e ha ve p ( x ( n ) ; ˆ θ n ) = e { 1 2 [ P n i =1 S ( x i ) − n ˙ A ( θ )] T [ nI ( θ )] − 1 [ P n i =1 S ( x i ) − n ˙ A ( θ )]+ o ( r √ n ) } p ( x ( n ) ; θ ) , (7) where ˆ θ n solv es P n i =1 S ( x i ) = n ˙ A ( ˆ θ n ). Applying t he T aylor expansion, w e get n X i =1 S ( x i ) − n ˙ A ( θ ) = n ˙ A ( ˆ θ n ) − n ˙ A ( θ ) = [ n ¨ A ( θ )]( ˆ θ n − θ ) + o ( r √ n ) . Plugging it into (7 ) leads to p ( x ( n ) ; ˆ θ n ) = e { 1 2 ( ˆ θ n − θ ) T [ n ¨ A ( θ )]( ˆ θ n − θ )+ o ( r √ n ) } p ( x ( n ) ; θ ) . (8) If w e consider i.i.d. random v aria bles Y 1 , · · · , Y n sampled from t he exp onen tial distribution (4), then the MLE ˆ θ ( Y ( n ) ) is a rando m v ariable. The summation of the quantit y (8) in the neigh b orho o d U ( θ , r √ n ) can b e expressed as the follow ing exp ectation of ˆ θ ( Y ( n ) ). E [ e { 1 2 ( ˆ θ n − θ ) T [ n ¨ A ( θ )]( ˆ θ n − θ ) } 1 ( ˆ θ n ∈ U ( θ , r √ n ))] . (9) Due to the asymptotic norma lity of MLE ˆ θ ( Y ( n ) ), namely , ˆ θ n − θ d − → N (0 , [ nI ( θ ) ] − 1 ), the densit y of ˆ θ ( Y ( n ) ) is approximated by | nI ( θ ) | 1 / 2 (2 π ) d/ 2 e {− 1 2 ( ˆ θ n − θ ) T [ n ¨ A ( θ )]( ˆ θ n − θ ) } d ˆ θ n . Applying this densit y to the expectation in (9), w e ﬁnd the tw o exponential t erms cancel out, and obtain | nI ( θ ) | 1 / 2 (2 π ) d/ 2 . The sum of its logar it hm o v er all neigh b orho o ds U ( θ , r √ n ) leads to the remaining terms in (5). Predictiv e co ding The optimality of the NML co de is established in the minimax settings. Y et its implemen tatio n requires going through the data t w o ro unds, one for w ord coun ting of a dictionary , a nd one mor e fo r enco ding. It is natural to ask whether there exists a sc heme that go es through t he data only once a nd still can compress the data equally w ell. It turns out that predic tive co ding is suc h a sc heme for a giv en dictionary . T he idea of predictiv e co ding is to sequen tia lly mak e inferences ab out the parameters in the probability function p ( x ; θ ), w hich is then used to up date the co de b o ok. That is, after o btaining o bserv ations x 1 , · · · , x i , w e calculate the MLE ˆ θ i , and in turn enco de t he next observ ation according to the curren t estimated distribution. Its co de length is th us L pr edictive = ( − P n i =1 log p ( X i +1 | ˆ θ i )). This pro cedure (R issanen [2 2 , 23]) has in timate connections with the prequen tial approac h to statistical inference as adv o cated b y Da wid [10, 11]. Predictiv e co ding is intuitiv ely optimal due to t w o imp ortant fundamen tal results. F irst, the MLE ˆ θ i is asymptotically most accurate, since it gathers all the information in X 1 , · · · , X i for the inf erence o f the 9 parametric model p ( x ; θ ). Second the co de length log p ( X i +1 | ˆ θ i ) is optimal as dictated b y the Shanno n source co ding theorem. In the case of exp onential families, Prop osition 2.2 in [16] show ed L pr edictive can b e expanded as follows. L pr edictive = nH ( ˆ θ n ) + d 2 log n + ˜ D n ( ω ) , where the seque nce of random v ariables { ˜ D n ( ω ) } con ve rges to an almost su rely ﬁnite random v ariable ˜ D ( ω ). Alternativ ely , w e can use Bay esian estimates in the predictiv e coding. Starting from a prior distribution w ( θ ), w e enco de x 1 b y the marginal distribution q 1 ( x 1 ) = R Θ p ( x 1 | θ ) w ( θ ) d θ resulted from w ( · ). The posterior is g iv en by w 1 ( θ ) = p ( x 1 | θ ) w ( θ ) / R Θ p ( x 1 | θ ) w ( θ ) d θ . W e then use this p osterior as the up dated prior to enco de the next w ord x 2 . Using induction, w e can sho w that the mar g inal distribution to enco de the k -th w ord is q k ( x k ) = R Θ [ Q k i =1 p ( x i | θ )] w ( θ ) d θ R Θ [ Q k − 1 i =1 p ( x i | θ )] w ( θ ) d θ . Mean while, the up da t ed p osterior, also the prior for the next round enco ding, b ecomes w k ( θ ) = [ Q k i =1 p ( x i | θ )] w ( θ ) R Θ [ Q k i =1 p ( x i | θ )] w ( θ ) d θ . Prop osition 1 (Ba y esian predictiv e c o de length) The total Bayesian p r e d i c tive c o de length for a string of n wor ds is L mixture = − n X k =1 log q k ( x k ) = − log Z Θ [ n Y i =1 p ( x i | θ )] w ( θ ) d θ . Th us the Ba yes ian pr edictive code is nothing but the mixture co de referred to in the literat ur e [7]. Theorem 2 (Expansion of Ba y esian predictiv e co de length) If we ﬁt a data se quenc e by a n exp onential family distribution, the mixtur e c o d e length has the exp an s ion: L mixture = nH ( ˆ θ n ) + d 2 log n 2 π + log | I ( ˆ θ n ) | 1 / 2 w ( ˆ θ n ) + o (1) , (10) wher e w ( θ ) is any mixtur e of c onjugate prior distributions. The result is v alid for general priors that can b e appro ximated b y a mixture of conjuga t e ones. In the case of m ultinomial distributions, the conjug ate prior is Diric hlet distribution. An y prior w ( θ ) con tin uous on the d - dimensional simplex in the [0 , 1] ( m +1) cub e can b e unifor mly appro ximated b y the Bernstein p o lynomials of m v ariables, eac h term of whic h is a Diric hlet 10 distribution [1 9, 20]. It is noted that in the curren t setting, the source is not assumed to b e i.i.d. samples from an exp onen tia l family distribution as in Theorem 2.2 and Prop osition 2.3 in [1 6]. When R Θ | I ( θ ) | 1 / 2 dθ is ﬁnite, w e can take the Jeﬀreys prior, w ( θ ) = | I ( θ ) | 1 / 2 R | I ( θ ) | 1 / 2 dθ then (10) b ecomes (5). Putting together, w e hav e sho wn the optima l co de length can b e achiev ed by the Ba y esian predictive co ding sc heme. Redundancy No w w e examine the empirical co de length under Shannon’s setting. That is, w e ev alua te the redundancy of the co de length assuming the source is fr o m a h yp o thetical distribution. Prop osition 2 If we assume that a sour c e fo l lows an exp onential family distribution, then nH ( ˆ θ n ) − nH ( θ ) = − C n ( ω ) (log log n ) + o (1) , (11) the se quenc e of nonne gative r andom variab les { C n ( ω ) } hav e the pr op erty, lim n →∞ C n ( ω ) ≤ d , for almost al l p a th ω ’s. If we further assume that R Θ | I ( θ ) | 1 / 2 dθ < ∞ , then R N M L = L N M L − N H ( θ ) = d 2 log n 2 π − C n ( ω ) (log log n ) + log Z Θ | I ( θ ) | 1 / 2 dθ + o (1) , (1 2 ) wher e { C n ( ω ) } is the s ame as ab ove. The diﬀerence in the ﬁrst part is ( − P n i =1 log p ( X i | ˆ θ n )) − ( − P n i =1 log p ( X i | θ 0 )), and the rest is true according to the pro of of Prop osition 2.2 in [16], Equation (18). The NML co de is a sp ecial case of the mixture co de, whose redundancy is given by Theorem 2.2 in [16]. W e note that lim n →∞ C n ( ω ) is b o unded b elo w by 1. This prop osition conﬁrms that nH ( ˆ θ n ) is an underestimate of the compression b ound. Although log log n grow s up slo wly , the term, as sho wn by the example in T able 1, gets large as the mo del dimension d increases. Co ding of discrete sym b ols and m ultinomial mo del F or compressing strings of dis- crete sym b ols, it is suﬃcien t to consider the discrete distribution sp eciﬁed by a probabilit y v ector, θ = ( p 1 , p 2 , · · · , p d , p d +1 ). Its frequency function is P ( X = k ) = Q d +1 k =1 p 1 ( X = k ) k . The Fisher inf o rmation matrix I ( p 1 , · · · p d ) can b e shown to b e −  E ∂ 2 lo g P ( X = k ) ∂ p j ∂ p k  j,k =1 , ··· ,d =      1 p 1 + 1 p d +1 1 p d +1 · · · 1 p d +1 1 p d +1 1 p 2 + 1 p d +1 · · · 1 p d +1 . . . . . . . . . . . . 1 p d +1 1 p d +1 · · · 1 p d + 1 p d +1      . Th us | I ( p 1 , · · · p d ) | = 1 / Q d +1 k =1 p k . Supp ose X 1 , · · · , X n are i.i.d. random v ariables ob eying the ab o ve discrete distribution. Then S = P n i =1 X i follo ws a m ultinomial distribution M u lti ( n ; p 1 , p 2 , · · · , p d , p d +1 ). Its 11 conjugate prior distribution is the Dirichle t distribution ( α 1 , α 2 , · · · , α d +1 ), whose densit y function is Γ( P d +1 k =1 α k ) Q d +1 k =1 Γ( α k ) d +1 Y k =1 p α k − 1 k , where Γ( t ) = R + ∞ u =0 u − t e − u du . Since the Jeﬀreys prior is prop ortional to | I ( p 1 , · · · p d ) | 1 / 2 . In this case, it equals D iric hlet(1/2,1/2, ..., 1/2), whose densit y is Γ(( d +1) / 2 ) Γ(1 / 2) d +1 Q d +1 k =1 p − 1 / 2 k . The Jeﬀreys prio r w as also used b y Krichev sky [15] to deriv e optimal univ ersal co des. It is noticed that Γ(1 / 2) = √ π . Plug it in to Eq uation (10), we ha v e the following sp eciﬁc form of the NML co de length fo r the multinomial distribution. Remem b er that the distribution or w ord frequencies are sp eciﬁc for a g iv en dictionary Φ, and w e thus term it as L N M L @Φ . If we c hange the dictionary , the co de length c hanges accordingly . Prop osition 3 (Optimal co de lengths for a m ultinomial distribution) L N M L @Φ = nH ( ˆ θ n ) + d 2 log n − d 2 − log Γ( d + 1 2 ) + 1 2 log π , (13) wher e nH ( ˆ θ n ) = − P d k =1 ˆ p k log ˆ p k , ˆ p k = n k /n — the fr e quency of the k-th wor d app e aring in the string. 4 Compress ion of rand om seque n ces and DNA seque nces Lossless compression b ound and description lengt h Given a dictiona r y of w ords, w e parse a string into w ords fo llow ed b y coun ting their frequencies ˆ p k = n k /n , the total num b er of w ords n , and the n umber of distinct w ords d . Plugging them into ex pression (1 3), we obtain the lossless compression b ound f o r this dictionary or parsing. If a diﬀeren t parsing is tried, the three quan tities: w ord frequencies, n um b er of w ords, dictionary size (n um b er of distinct w ords) w ould c hange, and the resulting b ound w ould c hange accordingly . In t he general situation where t he data are not necessarily discrete sym b ols, w e replace the co de length with description length (10) as termed by Rissanen. Since eac h parsing correspo nds to a probabilistic mo del, the co de length is mo del- dep enden t . The comparison of tw o or more co ding sc hemes is exactly the selection of models, with t he expression (13) a s the ta r get function. Rissanen’s principle of minim um description length and mo del selection R issa- nen, in his work of [23, 24, 2 5], etc. prop osed the principle of minimum description length (MDL) as a more general mo deling rule than that of maxim um lik eliho o d, whic h w a s recom- mended, analyzed, and p opularized b y R. A. Fisher. F rom the information-theoretic p oin t of view, when w e enco de da t a from a source b y preﬁx co ding, the o ptima l co de is the one that ac hiev es the minim um description length. Because of the equiv alence b et we en a preﬁx co de length and the negativ e logarit hm of the corresp onding probability distribution, via 12 Kraft’s inequalit y , this in turn g ives us a modeling principle, namely , the MDL principle: c ho ose the mo del or preﬁx co ding algor it hm t ha t give s the minimal description of dat a , see Hansen and Y u [13] for a revie w on this topic. W e also refer readers to [3, 12] for a more complete accoun t of MDL. MDL is a mathematical formulation of the general principle kno wn as Occam’s razor: c ho ose the simplest explanation consisten t with the observ ed data [9]. W e mak e one remark ab out the signiﬁcance of MDL . On the one hand, Shannon’s w ork establishes t he connection b et w een o pt imal coding and probabilistic mo dels. On the other hand, Ko lmogoro v’s algo- rithmic theory sa ys that the complexit y , or t he absolute o ptimal coding, cannot b e prov ed by an y T uring mac hines. MDL oﬀers a practical principle: it allows us to mak e choices a mo ng p ossible mo dels a nd co ding a lgorithms without the desire to prov e optimality . As more and more candidates of mo dels are ev aluated ov er time, human’s understanding pro gresses. Compression b ounds of random sequences A random sequence is non- compressible b y an y mo del-based or a lg orithmic preﬁx coding as indicated b y the complexit y results [17, 34]. Thus a legitimate compression b ound o f a random sequence should b e no less than 1 up to certain v aria t ions. Con v ersely , if the compress ion rates of a sequen ce using L N M L @Φ as the compression b ound is no less than 1 under all dictionaries Φ, namely , min Φ: dictionar i e s L N M L @Φ L RAW = 1 + L N M L @Φ − L RAW L RAW ≥ 1 , where L RAW is the data length of the raw sequence in terms of bits, then the sequence is random. If w e a ssume the source is from a uniform distribution, L RAW = nH , and the diﬀerence L N M L @Φ − L RAW is essen t ially the redundancy of L N M L @Φ , whic h can b e calculated b y (13). Although it is challenging to test all dictionaries, we can try some, particularly those suggested b y the domain exp erience. A simul ation study: c ompression b ounds of pseudo-random sequences Sim ula- tions w ere carried out to test the theoretical b ounds. First, a pseudo-random binary string of size 3000 w as sim ulated in R according to Bernoulli trials with a pro babilit y of 0.5. In T able 1 , the ﬁr st column sho ws the word length used f o r parsing the data; The second col- umn sho ws the w ord n um b er; the third column sho ws the n um b er of distinc t words. W e group t he terms in (13) in to three part s: t he term in v olving n , the term inv olving log n , and others. The b ounds by nH ( ˆ θ n ), nH ( ˆ θ n ) + d 2 log n a nd L N M L in (13) are resp ectiv ely show n in t he next three columns. When the w ord length increases , d increases , and the b ounds b y nH ( ˆ θ n ) sho w a decreasing trend. A b ound smaller than 1 indicates the sequence can b e compressed, con tradicting the assertion that random sequences cannot b e so. When the w ord length is 8, the dictiona r y size is 37 5, and the b ound by nH ( ˆ θ n ) is o nly 0.929. The incompressibilit y nature of random sequences falsiﬁe d nH ( ˆ θ n ) as a legitimate compression b ound. If the log n term is included, the b ounds are alwa ys larger than 1. The b ounds b y L N M L (13) are tighter while remain larger than 1 exc ept the case at the b ottom ro w, 13 T able 1 : The data compression r a tes of a binary string of size 3000 under diﬀerent parsing mo dels. The data w ere sim ulated in R according to Bernoulli trials with probabilit y 0.5. w ord w ord dictionary nH ( ˆ θ n ) nH ( ˆ θ n ) L N M L (13) length num b er size + d 2 log n 1 3000 2 0.999 9 18 1.001843 1.001952 2 1500 4 0.998 5 91 1.003866 1.003641 3 1000 8 0.998 9 61 1.010587 1.008834 4 750 16 0.99 5 422 1.019299 1.012974 5 600 32 0.98 9 603 1.037285 1.018977 6 500 64 0.98 1 597 1.075738 1.027959 7 428 124 0.9 6 4885 1.14432 4 1.031261 8 375 196 0.9 2 8930 1.20682 9 1.006312 9 333 252 0.8 7 2958 1.22384 6 0.950283 where the n um b er of distinct words approac hes the total num b er of words . Since L N M L is an achiev able b ound, nH ( ˆ θ n ) + d 2 log n is an ov erestimate. Kno wledge disco v ery b y data compression On the other hand, if w e can compress a sequence by a certain preﬁx co ding sc heme, then this sequence is not random. In the mean time, this coding sc heme presen ts a clue to understanding the information structure hidden in the sequence. Dat a compression is one general learning mec hanism, among o t hers, to discov er kno wledge from nat ur e a nd other sources. Ry abk o, Astola and Gammerman [28] applied the idea of Kolmogo ro v complexit y to the statistical testing of some typical hypotheses. This approach w as used to ana lyze DNA sequence s in [33]. DNA sequences of proteins The information carried by the DNA double helix is tw o long complemen ta ry strings of the letters A, G, C, a nd T. It is in teresting t o see if w e can compress D NA sequences at all. Nex t, w e carried out the lossless compression exp erimen ts on a couple of protein-enco ding DNA sequences . Redisco v ery of the codon structure In T a ble 2 w e sho w the result b y applying the NML co de length L N M L in (13) to an E. Coli protein gene sequenc e lab eled b y b0059 [1], whic h ha s 2907 nucle ot ides. Each ro w corresp onds to one mo del use d for co ding. All the mo dels b eing tested are listed in the ﬁrst column. In the ﬁrst mo del, w e enco de the DNA n ucleotides one b y one and name it Model 1. In the second or third mo del, w e parse the DNA sequence by pa ir s and then enco de the resu lting bi-n ucleotide seq uence according to their frequencies. Diﬀerent starting p ositions lead t o t wo diﬀeren t phases, and w e denote them 14 b y 2.0 and 2.1 r esp ectiv ely . Other models ar e understo o d in the same fashion. Note that all these mo dels are generated by ﬁx-length parsing. The last mo del “a.a.” means w e ﬁrst translate DNA triplets in t o amino acids and then enco de the resulting amino acid sequence. The se cond column sho ws the tota l nu mber of w o rds in eac h parsed s equence. The third column shows the n um b er of diﬀerent words in eac h parsed sequence or the dictionar y size. The fourth column is the empirical en trop y estimated from observ ed frequencies. The next column is the ﬁrst term in expression (13), whic h is the pro duct of the second and four t h columns. Then we calculate the rest terms in (13). The total bits are then calculated and the compression ra tes are the ratios L N M L / (2907 ∗ 2). The last column sho ws the compression rates under diﬀerent mo dels. All the compression r ates are around 1 except that obtained f rom Mo del 3.0, whic h represen ts the correct co don pattern and correct phase. Th us the comparison of compression b ounds redisco v ers the co don structure of this pr o tein-enco ding DNA sequence and the phase of the op en reading fra me. It is some what surprising that the optimal co de length L N M L enables us to mathematically identify the triplet co ding system using only the sequence of one gene. Historically , the system was disco v ered by F rancis Cric k and his colleagues in the early 1 960s using frame-shift mutations of ba cteria- phage T4. Next, we ha v e a closer lo ok at the results. The compression rate of the four- n ucleotide w ord co ding is closes t to 1, and thus it b eha v es more lik e ”random”. F or example, it is 0.9947 for Mo del 4.2. The ﬁrst term of empirical entrop y con tributes 5431 bits, while the rest terms con tribute 346 bits. If w e use d 2 log n instead, t he rest term is 0 . 5 ∗ (219 − 1) ∗ log (726) ≈ 10 36 bits, and the compression rate b ecomes 1.11, whic h is less tight. If t he Ziv-Lempel a lgorithm is applied to the b0059 sequence, 63 5 w ords are generated along the w ay . Each word needs lo g (6 3 5) bits for k eeping the address of its preﬁx, and 2 bits for the last n ucleotide. In total, it needs 635 ∗ l og (635) = 5912 bits for storing addresses, whic h corr espo nd to the ﬁrst term in (13), and 635 ∗ 2 = 12 70 bits for storing the w ords’ last letter, whic h corresp ond to the rest terms in (13). The compression rate o f Ziv-Lemp el co ding is 1 .24. Redundan t information in protein gene sequences It is kno wn that the 4 3 = 64 triplets corresp ond to only 20 amino acids plus stop co dons. Th us redundancy do es exist in protein gene sequences . Most o f the redundancy lies in the third p osition of a co don. F or example, GG A, GG C, GG T, and GGG all corresp ond to glycine. According to T able 2, there are 4 048.22 bits of information in the amino acid sequence while t here are 5277.51 bits of information in Mo del 3.0 . Thus the redundancy in this sequence is estimated to b e (5277.51- 4048.22)/4 048.22=0.30 . Randomization T o ev aluate the accuracy o r signiﬁcance of the compression rates of a DNA sequence, w e need a reference distribution for comparison. A t ypical metho d is to consider the randomness obta ined b y p ermutations. That is, giv e a DNA sequence, w e p erm ute the n ucleotide bases and re-calculate the compres sion rates. If we rep eat this p erm utation pro cedure, then a reference distribution is generated. 15 T able 2 : The data compres sion rates of E. Coli ORF b0059 calculated by ( 1 3) under diﬀeren t parsing mo dels. mo del # word dictionary empirical 1-st rest L N M L @Φ compression n size d en trop y term terms t o tal bits rate 1 2907 4 1.9924 5792.00 16.58 5808.58 0.9991 2.0 1453 16 3.9570 5749.49 59.81 5809.31 0.9995 2.1 1453 16 3.9425 5728.44 59.81 5788.25 0.9959 3.0 969 58 5.2842 5120.39 157.11 5277.51 0.9077 3.1 968 63 5.5905 5411.63 167.13 5578 .76 0.9605 3.2 968 64 5.6706 5489.10 169.11 5658 .21 0.9742 4.0 726 218 7.4507 5409.24 345.0 7 5754.31 0.9908 4.1 726 217 7.4337 5396.87 344.2 0 5741.07 0.9885 4.2 726 219 7.4814 5431.49 345.9 40 5777.43 0.9947 4.3 726 221 7.4678 5421.64 347.6 7 5769.31 0.9933 a. a. 96 9 21 4.1056 3978.31 69.92 4048.22 0.6963 In T able 3 , we consider the compression rates for E. Coli ORF b006 0 , whic h has 2352 n ucleotides. First, the optimal compression r a te o f 0.958 is achiev ed a t mo del 3.0. Second, w e f urther carried out the calculations f or p ermuted sequences. T he av erages, standard deviations, and 1% low er quan tiles of compression r ates under diﬀeren t mo dels are sho wn in T able 3 as w ell. Except for Mo del 1 , all the compression r a tes, in terms of either av erages or low er quantiles are signiﬁcan tly ab ov e 1, Third, the results by the single t erm nH ( ˆ θ n ) are ab out 0.996, 0.994, 0.98 6, and 0.952 respective ly for one-, tw o-, three-, and four-nucle otide mo dels. The 99% quan tiles of nH ( ˆ θ n ) for the four-nucleotide mo dels a re no larger than 0.961. Th us the diﬀerence b et we en nH ( ˆ θ n ) and nH ( θ ) as sho wn in (11) increases as t he dictionary size go es up. F ourth, the results of nH ( ˆ θ n ) + d 2 log n sho w extra bits compared to those of L N M L , and the compression ratio go from 1.02 to 1.17, suggesting the rest terms in (13) are not negligible. It is noted Mo dels 3.1 and 3.2 are obt a ined b y phase-shifting from the correct Mo del 3.0. Other mo dels are obtained by incorrect parsing. These mo dels can serv e as references for Mo del 3.0. The incorrect parsing and phase-shifting hav e a ﬂav or of the linear congruen tia l pseudo-random n um b er generator, and play the role of randomization. 5 Discuss ion Putting together the ana lytical results and n umerical examples, we sho w the compression b ound of a data sequence using an exp o nen tial fa mily is the co de length deriv ed from the NML distribution (5). The empirical b o und can b e implemen ted by the Ba ye sian predictiv e 16 T able 3: The data compre ssion rates of E. Coli O R F b006 0 and stat istics from p ermutations. The protein gene sequence has 2352 n ucleotide bases. Mo del 1.0 2.0 2.1 3.0 3.1 3.2 4.0 4.1 4.2 4.3 Original L N M L (13) 0.999 1.000 1.001 0.958 0.980 0.989 1.001 1.002 0.997 0.999 L N M L (13) a verage 0.999 1.006 1.006 1.020 1.020 1.020 1.020 1.020 1.020 1.020 SD ( × 10 − 3 ) 0.00 0.74 0 .76 1.69 1.77 1.71 4.22 4.33 4.15 3 . 92 1%-quan tile 0.999 1.004 1.004 1.016 1.016 1.016 1.011 1.010 1.011 1.011 nH ( ˆ θ n ) a verage 0.996 0.994 0.994 0.986 0.987 0.986 0.952 0.952 0.952 0.952 SD ( × 10 − 3 ) 0.00 0.74 0 .76 1.69 1.77 1.71 3.69 3.79 3.63 3 . 42 99%-quan tile 0.996 0.995 0.995 0.990 0.990 0.990 0.961 0.961 0.960 0.960 nH ( ˆ θ n ) a verage 0.999 1.010 1.010 1.051 1.051 1.051 1.174 1.174 1.174 1.174 SD ( × 10 − 3 ) 0.00 0.74 0 .76 1.70 1.78 1.71 7.49 7.66 7.37 7 . 01 + d 2 log n 1%-quan tile 0.999 1.008 1.008 1.046 1.047 1.047 1.157 1.156 1.157 1.157 co ding fo r any giv en dictionary o r mo del. Diﬀeren t mo dels are then compared b y their empirical compression b ounds. The examples of DNA sequences indicate that the compression rates by an y dictionaries are indeed larger than 1 for random sequence s, in consistency with the a ssertions b y the Kol- mogorov complexit y theory . Conv ersely , if signiﬁcant compression is achiev ed b y a sp eciﬁc mo del, certain kno wledge is ga ined. The co don structure is suc h an instance. Unlik e t he algorithmic complexit y that contains a constant, the results based on pro ba- bilit y distributions giv e the exact bits of co de lengths. All three terms in (5) ar e imp ortan t for the compression b ound. Using only the ﬁrst term nH ( ˆ θ n ) can lead to bounds of r a n- dom sequences smaller than 1. The gap gets larger as the dictionary size increases as seen from T able 1 and 3. The b ound b y adding the second term d 2 log n had b een prop osed by the tw o-pa r t co ding or the Ko lmogoro v complexity . It is equiv alen t to BIC widely used in mo del selection. How ev er, it ov erestimates the inﬂuence o f the dictionary size, as show n b y the examples of s imulations and D NA sequence s. The inclusion of the Fisher info rmation in the third term g iv es a tighter bo und. The terms ot her than nH ( ˆ θ n ) get larger as the dictionary size incre ases in T able 1 a nd 3 . The observ atio n that the compression b ounds from all terms in (5 ) k ept slightly ab o ve 1 for all tested libraries meets our exp ectation on the incompressibilit y of random sequences. Although the empirical compression b o und is obta ined under the i.i.d. mo del, the w ord length can b e set rather lar g e to describ e the lo cal dep endence betw een symbols. Indeed, as shown in the examples of DNA se quences, t he empirical en tropy term in (5) could get smaller, for either the original sequences or the p erm uted ones. Mean while, the second term could get larger. F or a speciﬁc seque nce, a b etter dictionary is selected by trading oﬀ the en trop y part a nd mo del complexity pa rt. Rissanen [25] obtained an expansion of the NML co de length, in whic h the ﬁrst term is the log-likelihoo d of data with the parameters plugged in by the MLE. In this article, w e sho w it is exactly the empirical en trop y if the parametric mo del tak es an y expo nen tial family . According to this form ula, the NML code length is an empirical vers ion or a direct extension of Shannon’s source co ding theorem. F urt hermore, the asym ptotics in [25] requires ﬁv e 17 assumptions, which are hard to examine. Suzuki and Y amanishi prop osed a F ourier approac h to calculate NML co de length [32] for con tinuous random v ariables with certain assumptions. Instead, we sho w (5) is v alid for exponential families, as long as R Θ | I ( θ ) | 1 / 2 dθ < ∞ , without an y o ther assumptions. If the Jeﬀreys prior is improp er in t he interior of the full parameter space, w e can restrict the parameter to a compact subset. The exp onential fa milies include not only distributions of discrete sym b ols such as multinomials but also con tin uous signals suc h as from the normal distribution. The mathematics underlying the expansion o f NML is the structure of lo cal asymptotic normalit y prop osed b y LeCam [6 ]. LAN has been used to show the optimalit y of certain statistical estimates. This art icle connects LAN t o compression b ound. W e hav e sho wn as long a s LAN is v alid, the similar expansion of (2) can b e obtained. 6 App endix This section con tains the pro ofs of the results in Sections 2 and 3. The follow ing basic facts ab out the exp onen tial family (4 ) are needed, see [4]. 1. E ( S ( X )) = ˙ A ( θ ), and V ar ( S ( X )) = ¨ A ( θ ). 2. ˙ A ( · ) is one to one on the nat ura l parameter space. 3. The MLE ˆ θ n based on ( X 1 , · · · , X n ) is giv en b y ˆ θ n = ˙ A − 1 ( ¯ S n ), where ¯ S n = 1 n P n i =1 S ( X i ). 4. The Fisher info r ma t ion matrix I ( θ ) = ¨ A ( θ ). Pro of of Theorem 1 . In the canonical exp onential fa mily , t he natur a l parameter space is op en and con v ex. Since R Θ | I ( θ ) | 1 / 2 dθ < ∞ , we can ﬁnd a series o f b ounded set { Θ ( k ) , k = 1 , 2 , · · · } suc h that log[ R Θ | I ( θ ) | 1 / 2 dθ ] − log[ R Θ ( k ) | I ( θ ) | 1 / 2 dθ ] = ǫ k . where ǫ k → 0. F urther- more, w e can select eac h b ounded set Θ ( k ) so that it can b e partitioned into disjoint cub es, eac h of whic h is denoted by U ( θ ( k ) j , r √ n ) with θ ( k ) j as its cen ter and r √ n as its side length. Namely , Θ ( k ) = S j U ( θ ( k ) j , r √ n ), and U ( θ ( k ) j 1 , r √ n ) T U ( θ ( k ) j 2 , r √ n ) = ∅ fo r j 1 6 = j 2 . The normalizing constan t in equation (6) can b e summed (integration in the case of con tin uous v ariables) by the suﬃcien t statistic P n i =1 S ( x i ), and in t urn b y the MLE ˆ θ n X { x ( n ) , ˆ θ ∈ Θ ( k ) } p ( x ( n ) ; ˆ θ n ) = X U ( θ ( k ) j , r √ n ) X { x ( n ) , ˆ θ ∈ U ( θ ( k ) j , r √ n ) } p ( x ( n ) ; ˆ θ n ) , (14) p ( x ( n ) ; ˆ θ n ) = e { ˆ θ T n P n i =1 S ( x i ) − nA ( ˆ θ n ) } µ n ( dx n ) = e { n ˆ θ T n ¯ S n − nA ( ˆ θ n ) } µ n ( dx n ) . (15) No w expand n [ θ ¯ S n − A ( θ )] ar o und ˆ θ n within the neigh b or ho o d U ( θ ( k ) j ). n [ θ T ¯ S n − A ( θ ) ] = n [ ˆ θ T n ¯ S n − A ( ˆ θ n )]+( θ − ˆ θ n ) T [ n ¯ S n − n ˙ A ( ˆ θ n )] − 1 2 ( θ − ˆ θ n ) T [ n ¨ A ( ˆ θ n )]( θ − ˆ θ n )+ M 1 n || θ − ˆ θ n || 3 . 18 Since the MLE ˆ θ n = ˙ A − 1 ( ¯ S n ), t he second term is zero. F urthermore, we expand ¨ A ( ˆ θ n ) around ¨ A ( θ ), and r ear r a nge the t erms in the equation, then w e ha v e n [ ˆ θ T n ¯ S n − A ( ˆ θ n )] = n [ θ T ¯ S n − A ( θ )] + 1 2 ( ˆ θ n − θ ) T [ n ¨ A ( θ )]( ˆ θ n − θ ) + M 2 n || θ − ˆ θ n || 3 , where the constan t M 2 in v olve s the third order deriv ativ es of A ( θ ), whic h is con tinuous in the canonical exp onen tial family a nd thus b ounded in the b ounded set Θ ( k ) . In other w ords, M 2 is b ounded uniformly across all { U ( θ ( k ) j , r √ n ) } . Similar b ounded constan ts will b e used rep eatedly hereafter. Then equation (15) b ecomes p ( x ( n ) ; ˆ θ n ) = e 1 2 ( θ − ˆ θ n ) T [ n ¨ A ( θ )]( θ − ˆ θ n )+ M 2 n || θ − ˆ θ n || 3 e n [ θ ¯ S n − A ( θ )] µ n ( dx n ) , Notice tha t the exp onen tial form e n [ θ ¯ S n − A ( θ )] is the densit y of n ¯ S . If w e consider i.i.d. ran- dom v ar ia bles Y 1 , · · · , Y n sampled from the exp o nen tial distribution (4), the MLE ˆ θ ( Y ( n ) ) is a random v ariable. T ake θ = θ ( k ) j , then the sum of the ab o v e quantit y ov er the neighbor- ho o d U ( θ ( k ) j , r √ n ) is nothing but the expectatio n of ˆ θ ( Y ( n ) ) = ˙ A − 1 ( ¯ S n ) with resp ect to the distribution of n ¯ S n , ev aluated at the parameter θ ( k ) j . X { x ( n ) , ˆ θ n ∈ U ( θ ( k ) j , r √ n ) } p ( x ( n ) ; ˆ θ n ) = E θ ( k ) j [ 1 [ ˆ θ n ∈ U ( θ ( k ) j , r √ n )] e 1 2 ( ˆ θ n − θ ( k ) j ) T [ n ¨ A ( θ ( k ) j )]( ˆ θ n − θ ( k ) j )+ M 2 n || ˆ θ n − θ ( k ) j || 3 ] . Let ξ n = √ n ( ˆ θ n − θ ( k ) j ). Now ˆ θ n ∈ U ( θ ( k ) j , r √ n ) if and only if ξ n ∈ U (0 , r ), where U (0 , r ) is the d- dimensional cub e cen tered at zero with the side length r . Next expand e M 2 n || ˆ θ n − θ ( k ) j || 3 in the neigh b orho o d, the ab o v e b ecomes X { ˆ θ n ∈ U ( θ ( k ) j , r √ n ) } p ( x ( n ) ; ˆ θ n ) = E [ 1 [ ξ n ∈ U (0 , r )][ e 1 2 ξ T n I ( θ ( k ) j ) ξ n (1 + M 3 n − 1 2 )] . According to the cen tral limit theorem, n − d 2 [ P n i =1 S ( Y i ) − ˙ A ( θ ( k ) j )] d − → N (0 , ¨ A ( θ ( k ) j )). More- o v er, the appro ximation error has the Berry-Esseen bound O ( n − 1 2 ), where the constan t is determined by the b ound on A ( θ )’s third-order deriv ative s. Similarly , we ha v e the asymp- totic normalit y of MLE, ξ n ( Y ( n ) ) d − → N (0 , I ( θ ( k ) j ) − 1 ), where the Berry-Esseen b ound is v alid 19 for t he conv ergence, see [18]. Therefore, the exp ectation con v erges as f o llo ws. E { 1 [ ξ n ∈ U (0 , r )] e 1 2 ξ T n I ( θ ( k ) j ) ξ n (1 + M 3 n − 1 2 ) } = Z 1 [ ξ n ∈ U (0 , r )][ e 1 2 ξ T n I ( θ ( k ) j ) ξ n (1 + M 3 n − 1 2 )] | I ( θ ( k ) j ) | 1 / 2 (2 π ) d/ 2 e − 1 2 ξ T n I ( θ ( k ) j ) ξ n dξ n + M 4 n − 1 2 = (2 π ) − d 2 | I ( θ ( k ) j ) | 1 2 Z 1 [ ξ n ∈ U (0 , r )](1 + M 3 n − 1 2 )] dξ n + M 4 n − 1 2 = (2 π ) − d 2 | I ( θ ( k ) j ) | 1 2 r d (1 + M 3 n − 1 2 ) + M 4 n − 1 2 = n d 2 (2 π ) − d 2 | I ( θ ( k ) j ) | 1 / 2 ( r d n − d 2 )(1 + M ′ 3 n − 1 2 ) . Plug this in to the sum (14), we obtain X { x ( n ) , ˆ θ n ∈ Θ ( k ) } p ( x ( n ) ; ˆ θ n ) = n d 2 (2 π ) − d 2 [ X U ( θ ( k ) j , r √ n ) | I ( θ ( k ) j ) | 1 / 2 ( r d n − d 2 )](1 + M ′ 3 n − 1 2 ) − → n d 2 (2 π ) − d 2 [ Z Θ ( k ) | I ( θ ) | 1 / 2 dθ ](1 + M ′ 3 n − 1 2 )] log[ X { x ( n ) , ˆ θ n ∈ Θ ( k ) } p ( x ( n ) ; ˆ θ n )] = d 2 log n 2 π + log Z Θ ( k ) | I ( θ ) | 1 / 2 dθ + M ′′ 3 n − 1 2 = d 2 log n 2 π + log Z Θ | I ( θ ) | 1 / 2 dθ + [log Z Θ ( k ) | I ( θ ) | 1 / 2 dθ − log Z Θ | I ( θ ) | 1 / 2 dθ ] + + M ′′ 3 n − 1 2 = d 2 log n 2 π + log Z Θ | I ( θ ) | 1 / 2 dθ − ǫ k + M ′′ 3 n − 1 2 . Note that the b ound M ′′ 3 of the last term relies solely o n Θ ( k ) . F o r a giv en k , w e select n suc h that the last term is suﬃcien t ly small. This completes the pro of. Pro of of Theorem 2 First we consider the conjuga te prior of (4), whic h tak es the form u ( θ ) = exp { α ′ θ − β A ( θ ) − B ( α , β ) } , (16) where α is a v ector in R d , β is a scalar, and B ( α, β ) = log R Θ exp { α ′ θ − β A ( θ ) } dθ . Then the marginal density is m ( x ( n ) ) = exp { B ( n X i =1 T ( x i ) + α, n + β ) − B ( α , β ) } , (17) according to the deﬁnition of B ( · , · ). Therefore L mixture = B ( α, β ) − B ( n X i =1 S ( x i ) + α, n + β ) = B ( α , β ) − log ( Z Θ exp { nL n ( t } d t ) , 20 where nL n ( t ) = [ P n i =1 S ( x i ) + α ] T t − ( n + β ) A ( t ) . The minim um of L n ( t ) is achie ve d at ˜ θ n = ˜ θ ( x ( n ) ) = ˙ A − 1 ( P n i =1 S ( x i ) + α n + β ) . Notice t ha t ˙ A ( ˜ θ n ) = P n i =1 S ( x i ) + α n + β = P n i =1 S ( x i ) n + O ( 1 n ) = ˙ A ( ˆ θ n ) + O ( 1 n ) . Through T a ylor’s expansion, it can b e shown that ˜ θ n = ˆ θ n + O ( 1 n ) . (18) Notice that − ¨ L n ( t ) = n + β n ¨ A ( t ) . Let Σ = − ¨ L − 1 n ( ˜ θ n ) = n n + β ¨ A − 1 ( ˜ θ n ) . By expanding L n ( t ) a t the saddle p oint ˜ θ n and applying the Laplace metho d (see [5]), w e hav e log( Z Θ exp { nL n ( t ) } d t ) = − d 2 log n 2 π + 1 2 log( d et Σ) + nL n ( ˜ θ n ) + O ( 1 n ) − → − d 2 log n 2 π − 1 2 log( d et I ( ˜ θ n )) + nL n ( ˜ θ n ) . Next, L mixture − nH ( ˆ θ n ) − → B ( α, β ) + d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − nL n ( ˜ θ n ) − nH ( ˆ θ n ) = B ( α, β ) + d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − [ n X i =1 S ( x i ) + α ] T ˜ θ n + ( n + β ) A ( ˜ θ n ) − nA ( ˆ θ n ) + [ n X i =1 S ( x i )] T ˆ θ n = d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − [ α T ˜ θ n − β A ( ˜ θ n ) − B ( α, β )] − [ n X i =1 S ( x i )] T ( ˜ θ − ˆ θ n ) + n [ A ( ˜ θ n ) − A ( ˆ θ n )] = d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − log w ( ˜ θ n ) − [ n X i =1 S ( x i ) T ( ˜ θ − ˆ θ n ) + n ˙ A ( ˆ θ ) T ( ˜ θ n − ˆ θ n ) + n 2 ( ˜ θ n − ˆ θ n ) T ¨ A ( ˆ θ n )( ˜ θ n − ˆ θ n )] + O ( 1 n ) = d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − log w ( ˜ θ n ) − n [[ ¯ S n − ˙ A ( ˆ θ n )] T ( ˜ θ n − ˆ θ n ) + 1 2 ( ˜ θ n − ˆ θ n ) T ¨ A ( ˆ θ n )( ˜ θ n − ˆ θ n )] + O ( 1 n ) − → d 2 log n 2 π + 1 2 log( d et I ( ˆ θ n )) − log w ( ˆ θ n ) . 21 The last step is v a lid b ecause o f ˆ θ n = ˙ A − 1 ( ¯ S n ) and (1 8 ). This pro v es the case o f the prior u ( θ ) in (16). Mean while, we got the expansion of (17) m ( x ( n ) ) = exp {− L mixture } = exp { B ( n X i =1 S ( x i ) + α, n + β ) − B ( α , β ) } = exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) + log w ( ˆ θ n ) + o (1) } = [ w ( ˆ θ n ) + o (1)] exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) } . (19) If the prior of θ tak es the for m of a ﬁnite mixture of the conjugate distributions (16) a s in the follo wing w ( θ ) = J X j = 1 λ j exp { α ′ j θ − β j A ( θ ) − B ( α j , β j ) } = J X j = 1 λ j u j ( θ ) , (20) where P J j = 1 λ j = 1, 0 < λ j < 1 , j = 1 , · · · , J . Then the marginal densit y is given b y m ( x ( n ) ) = J X j = 1 λ j exp { B ( n X i =1 T ( x i ) + α j , n + β j ) − B ( α j , β j ) } = J X j = 1 λ j [ u j ( ˆ θ n ) + o (1)] exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ )) } = exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) } [ J X j = 1 λ j u j ( ˆ θ n ) + o (1)] = exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) } [ w ( ˆ θ n ) + o (1)] = exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) + log w ( ˆ θ n ) + o (1) } . Eac h summand w as appro ximated b y (19). This completes the pro of b ecause of L mixture = − log { m ( x ( n ) ) } . 7 Ac kno wled g emen t The author is grateful to Prof. Bin Y u and Dr. Jorma Rissanen for t heir g uidance in learn- ing the to pic. This researc h is supp orted by the National Key Researc h and Dev elopmen t Program of China (2022YF A1004801), the National Natural Science F oundation of China (Grant No. 32170679, 118714 6 2, 9 1530105), the National Cen ter f o r Mathematics a nd In- terdisciplinary Sciences of the Chinese Academ y of Sciences, and the Key Lab oratory of Systems and Control of the CAS. 22 References [1] E. Coli genome and protein genes. h ttps://www.ncbi.nlm.nih.go v/genome. [2] H. Ak aik e. On en trop y maximisation principle. In P . R. Krishnaiah, editor, Applic ations of Statistics , pag es 27– 41. Amsterdam: North Holland, 1 970. [3] A. Barron, J. Rissanen, and B. Y u. The minim um description length principle in co ding and mo deling. I EEE. T r a n s. Inform. The o ry. , pages 2743– 2760, 199 8 . [4] L. D. Bro wn. F undam e ntals of Statistic al Exp onential F amilie s : With Applic ations in Statistic al De cisio n The ory . Institute o f Mathematical Statistics, USA, 19 86. [5] N. G. De Bruijn. Asymptotic Metho ds in Analysis . North-Holland: Amsterdam, 1958. [6] L. Le Cam and G. Y ang . Asymptotics in Statistics: So m e Basi c Conc epts . Springer, 2000. [7] B. S. Clarke and A. R. Ba r ron. Jeﬀrey’ prior is asymptotically least fa v orable under en trop y risk. Journal of Statistic al Plann i n g and Infer enc e , 41:37–64 , 1994. [8] T. M. C ov er, P . Gacs, and R. M. Gra y . Kolmogorov ’s Contributions to Information Theory and Algorithmic Complexit y. The Annals of Pr ob abil i ty , 17 (3):840 – 865, 1989. [9] T. M. Cov er and J. A. Thomas. Elements of Information Th e ory . Wiley , 1991. [10] A. P . D a wid. Presen t p o sition and p oten tial dev elopmen ts: some p ersonal views, sta- tistical theory , the prequen t ia l approach. J. R. Stat. So c. Ser. B , 147, 1984. [11] A. P . Daw id. Prequen tial ana lysis, sto chas tic complexit y and Ba y esian inference. In F ourth V alenc i a Internation a l Me eting on Bayesian Statistics , pages 15–20. 1992. [12] P . Gr ¨ un w ald. The Minimum Description L ength principle . MIT Press, 2007. [13] M. Hansen and B. Y u. Mo del se lection and minim um description length principle. JASA , 9 6 , 200 1 . [14] D. Haussler. A general minimax result f o r relativ e en tropy . IEEE T r an sactions on Information Th e ory , 43:1276–1 280, 1997. [15] R. E. Krich evsky . The connection b et w een the redundancy and reliabilit y of information ab out the source. Pr obl. Inform. T r ans. , 4:48–57, 1968. [16] L. M. Li and B. Y u. Iterated log arithmic expansions o f the path wise co de lengths for exp o nen tial families. IEEE T r ans actions on Information The ory , 46:26 8 3–2689, 2000 . [17] M. Li and P . Vit´ an yi. An Intr o duction to Kolmo g or ov Complexity and its Appl i c ations . Springer V erlag, New Y ork, 1996. 23 [18] I. Pinelis a nd R. Molzon. Optimal-order b ounds on the rate of con v ergence to normality in the m ultiv ariate delta metho d. Ele ctr onic Journal of Statistics , 10(1):1001 – 1063, 2016. [19] M. J. D. P ow ell. Appr oxim ation the ory and me tho ds . Cam bridge Univ ersit y Press: Cam bridge, 1981. [20] Jo˜ ao B. Prolla. A generalized b ernstein approximation theorem. Mathematic al Pr o c e e d- ings of the Cambridge Philosophi c al So ciety , 1 0 4(2):317–3 3 0, 1988. [21] J. Rissanen. Generalized Kraft inequalit y and arithmetic co ding. IBM Journa l of R ese ar ch and Development , 2 0(3):199–20 3, Ma y 1976 . [22] J. Rissanen. A predictiv e least squares principle. IMA Journal of Mathematic al Contr ol and I nformation , 3:21 1–222, 19 8 6. [23] J. Rissanen. Sto c hastic complexit y a nd mo deling. Annals o f Statistics , 14:1080– 1 100, 1986. [24] J. Rissanen. Sto chastic Complexity and Statistic a l Inquiry . W orld Scientiﬁc : Singap ore, 1989. [25] J. Rissanen. Fisher information and stochas tic complexit y . IEEE T r an s . Inform. The ory , 42:40–47, 1996. [26] J. Rissanen. Strong optimalit y of the no rmalized ML models as univ ersal co des and information in data. IEEE T r ans a ctions on Information The ory , 47(5):1712–1 717, 2001. [27] J. J. Rissanen and La ngdon G. G. Arithmetic co ding. IBM Journal of R ese ar ch and Developmen t , 23(2) :1 49–162, March 1979 . [28] B. Ry abk o, J. Astola, and A. Gammerman. Application of Kolmogorov complex ity and univ ersal co des to iden tit y testing and nonparametric testing of serial indep endence for time series. The or etic al Computer Sci e n c e , pages 440–448 , 2006. [29] G. Sc h warz. Estimating the dimension of a mo del. Annals of Statistics , 6:461–464, 1978. [30] C. E. Shannon. A mathematical theory of comm unication. Bel l Sys. T e ch. Journal , 27:379–42 3, 623–656, 1 9 48. [31] Y. M. Sh tarko v. Univ ersal seq uential co ding of single messages. Pr oblems of Information T r ansmission , 23(3):3–17, July-Sept 19 8 7. [32] A tsushi Suzuki and Kenji Y a manishi. Exact calculation of normalized maxim um like - liho o d co de length using F ourier analysis. In 2018 IEEE Internationa l Symp osi um on Information Th e ory ( I SIT) , pag es 121 1–1215, 2 0 18. 24 [33] N. Usotsk a y a and B. Ryabk o. Applications of information-theoretic tests for ana lysis of D NA sequences based on marko v c hain mo dels. Com putational Statistics and Data A nalysis , pages 1861 –1872, 2009. [34] P . Vit´ an yi and M. Li. Minim um description length induction, Ba y esianism, and Kol- mogorov complexit y . IEEE T r ans. Inform. The ory , 46:446 – 464, 2000 . 25

Empirical Lossless Compression Bound of a Data Sequence

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment