Empirical Lossless Compression Bound of a Data Sequence

We consider the lossless compression bound of any individual data sequence. If we fit the data by a parametric model, the entropy quantity $nH({\hat θ}_n)$ obtained by plugging in the maximum likelihood estimate is an underestimate of the bound, wher…

Authors: Lei M Li

Empirical Lossless Compression Bound of a Data S equence Author: Lei M Li 1,2, ∗ Affiliations: 1 Academ y o f Mathematics and Systems Science, Chinese Academ y of Sciences, Beijing 100190, China. 2 Univ ersit y of Chinese Academ y of Sciences, Beijing 1 0 0049, China ∗ Corresp ondence should b e addressed to Lei M Li (lilei@amss.ac.cn) T elephone: +8610–82 5 41585 F ax: +8610–626 5 –8364 Key words: lossle ss compression, en tropy , Kolmogorov complexit y , normalized maximum lik eliho o d, lo cal asymptotic normality Abstract W e consider the lo ssless compression b oun d of an y individual data sequence. Con- ceptually , its Kolmogoro v complexit y is suc h a b ound y et u n computable. The Shann on source cod ing theorem states that the a ve r age compression b ound is nH , where n is the n umb er of w ords and H is the en trop y of an oracle probabilit y distribution c harac- terizing the data source. The qu an tit y nH ( ˆ θ n ) obtained b y plu gging in the maxim um lik eliho o d est imate is an un derestimate of the b ound. Shtark o v sho w ed that the normal- ized m axim um like liho o d (NML) distribu tion or co de length is optimal in a minimax sense for any parametric family . In this article, we consider the exp onentia l families, the only mo dels that admit sufficien t statistics wh ose d imensions remain b ounded as the sample size gro ws. W e sh o w b y the local asymptotic normalit y that the NML co de length is nH ( ˆ θ n ) + d 2 log n 2 π + log R Θ | I ( θ ) | 1 / 2 dθ + o (1), where d is the mo del dimen- sion or d ictionary size, and | I ( θ ) | is the determinant of the Fisher inform ation matrix. W e also demonstrate that s equ en tially predicting the optimal co de length for the next w ord via a Ba yesia n mec han ism leads to the mixture co de, whose pathwise length is giv en by n H ( ˆ θ n ) + d 2 log n 2 π + log | I ( ˆ θ n ) | 1 / 2 w ( ˆ θ n ) + o (1), where w ( θ ) is a p rior. If we tak e the Jeffreys prior when it is p rop er, the expr ession agrees with the NML co de length. The asymptotics apply to not only discrete sym b ols bu t also contin uous data if the co de length for the form er is replaced by the description length for the la tter. The analytical resu lt is exemplified b y calc u lating compression b ounds of protein-encod ing DNA sequences u nder differen t parsing m o dels. T ypically , th e highest compression is ac hiev ed w hen th e parsing is in the ph ase of the amino acid co d ons. On the other hand, the compression r ates of pseudo-random sequences are larger than 1 regardless of parsing mo dels. These mo del-based results are consisten t with that random sequences are incompressible as asserted b y th e Kolmogoro v complexit y theory . The empirical lossless compression b ound is particularly m ore accurate when the d ictionary s ize is relativ ely large. 1 In tro duction The computation of the compre ssion b ound of an y individual sequence is bot h a philosophical and a practical problem. It t o uc hes on the fundamen tals of h uman being’s in telligence. After sev eral decades of efforts, many insigh ts ha v e been gained b y experts from differen t disciplines. In essence, the b ound is the shortest program that prin ts the sequ ence on a T uring mac hine, referred to as the Solomonoff-Kolmogo r ov-Chaitin algorithmic complexit y . Under this setting, if a sequence cannot b e compressed b y any compute r program, it is random. On the other hand, if w e can compress the sequence b y a certain program or co ding sch eme, it is then not rando m, and we learn some pattern or knowle dge in the sequence. Nev ertheless, this K o lmogorov complexit y is not computable. Along another line, the source co ding t heorem propo sed by Shanno n [30] claimed that the optimal co ding, or the av erage shortest co de length, is no less than nH , where n is the 1 n um b er of words and H is the the entrop y of the source if its distribution can b e sp ecified. Although Shannon’s probabilit y framew ork has inspired the inv en tions o f some ingenious compression methods, nH is an ora cle b o und. Some further questions need to b e a ddressed. First, where do es t he proba bilit y distribution come from? A straightforw ard solution is one inferred from the da ta themselv es. Ho w ev er, in the case of discrete sym bo ls, plugg ing in the w ord f r equencies ˆ θ n observ ed in the sequenc e results in nH ( ˆ θ n ), whic h is, as can b e sho wn, an undere stimate of the b ound. Second, the w ord frequenc ies are coun ted a ccording to a dictionary . Differen t dictionaries would lead to different distributions or co des. What is the criterion for selecting a go o d dictionary? Third, the pra ctice of some compression algorithms suc h as the Lemp el-Ziv co ding shows as the length of a sequence gets lo ng er, the size of the dictionary gets larger. What is the exact impact o f the dictionary size on the compression? F ourth, can we ac hiev e the compression limit by a predictiv e co de tha t go es through data in only one round? Fifth, ho w is the b ound deriv ed from the probability fra mew ork, if p ossibly , connected to the conclusions dra wn from the algo rithmic complexity ? In this a rticle, we review the k ey ideas of lossless compression and presen t some new mathematical results relev an t to t he abov e problems. Besides the algor ithm complexit y and the Shannon source co ding theorem, the tec hnical to o ls cen ter around the normalized maxim um lik eliho o d (NML) co ding [31, 26 ] and predictiv e co ding [7, 16]. The expansions of these co de lengths lead to an empirical compress ion b o und, whic h is indeed sequenc e-sp ecific and th us has a natural link to algorithmic complexit y . Although the primary t heme is the path wise asymptotics, their related a ve rag e results w ere discussed as w ell for the sak e of comparison. The analytical results apply to not only discrete symbols but also con tinuous data if the co delength for the former is replaced b y the description length for the latter [3]. Other than theoretical justification, the empirical b ound is exemplified b y protein-co ding DNA sequences a nd pseudo-random sequence s. 2 A br ief review of the k ey co ncepts Data compression The basic conce pts of lossless co ding can b e found in the textb o ok [9]. Before w e pro ceed, it is helpful to clarify some j a rgon used in this pap er: strings, sym b o ls, and w ords. W e illustrate them b y an example. The following “studydnasequencefrom thedat- acompressionpo in tofviewforexampleab cdefghijklmnop qrstuvwxyz”, is a string. The 26 small case distinct English letters app earing in the string are called sym b ols, and they form an alphab et. If we parse the string in to “study”, “dnasequence”, “from thedata” , “compression- p oin tof view”, “ forexample”, “a b cdefg”, “hijklmnop q”, “rstuvwxyz”, and these substrings are called w ords. The implemen tation of data compression includes an encoder and a decoder. The enc o der parses the string to b e compressed into words and replace s eac h w ord by its co dew ord. Consequen tly , this pro duces a new string, whic h is hop efully shorter than the original one in terms of it s. The decoder, conv ersely , parses t he new string in to co dewords, and in terpret eac h co dew ord bac k to a word of the original sym b ols. The collection of all distinct w ords 2 in a par sing is called a dictionary . In the notio n of data compression, tw o issues arise natur a lly . First, is there a low er b ound? Second, ho w do w e compute this b ound, o r is it computable a t all? Prefix co de A basic idea in lossless compression is the prefix co de or instan taneous co de. A co de is called a prefix o ne if no co dew ord is a prefix of any other co dew ord. The prefix constrain t has a ve ry close relationship to the metaphor of the T uring mac hine, b y whic h the algo rithmic complexit y is defined. Given a prefix co de ov er an alphab et of α sym b ols, the co deword lengths l 1 , l 2 , · · · , l m , where m is the dictionary size, m ust satisfy the Kraft inequalit y: P m i =1 α − l i ≤ 1. Con v ersely , giv en a set of co de lengths that satisfy this inequalit y , there exists a prefix co de with these co de lengths. Please notice that the dictiona r y size in a prefix co de could b e either finite or coun tably infinite. The class of prefix codes is smaller than the more general clas s of uniquely decodable co des, and one ma y exp ect that some uniquely deco dable co des could b e a dv an tageous o ve r prefix co des in terms of data compression. Ho w ev er, this is not exactly true, for it can b e sho wn that the co dew ord lengths of a n y uniquely deco dable co de must satisfy the K raft inequalit y . Th us we can construct a prefix co de to match the co dew ord lengths of an y give n uniquely deco dable co de. A prefix code has an attractiv e self-punctuating feature: it can be deco ded without reference to the future co dew ords, since the end of a co dew ord is immediately recognizable. F or these reasons, p eople stick to prefix co ding in practice. A conceptual y et conv enien t generalization of the Kr a ft inequalit y is to drop the integer requiremen t fo r co de lengths and ignore the effect of rounding up. A general set of co de lengths can b e implemen ted by the arithmetic co ding [21, 2 7 ]. This generalization leads to a correspondence b et w een probabilit y distributions and prefix co de lengths: to ev ery distribution P on the dictionar y , there exists a prefix co de C whose length L C ( x ) is equal to − log P ( x ) for all w ords x . Conv erse ly , for eve ry prefix co de C o n the dictionary , there exists a probabilit y measure P suc h t ha t − log P ( x ) is equal to the co de length L C ( x ) for all w ords x . Shannon’s probabilit y-based co ding In his seminal w ork [30], Shannon prop osed the source co ding theorem based on a probabilit y framew ork. If we assume a finite n umber of words A 1 , A 2 , · · · , A m are generated from a probabilistic source denoted b y a random v ariable X with frequencies p i , i = 1 , · · · , m , then the exp ected length of a n y prefix co de is no shorter tha n the en tro p y of this source defined as: H ( X ) = − P m i =1 p i log p i . Throughout this pap er, we take 2 as the base of the logarithm op eration, and thereb y bit is the unit of co de lengths. This r esult o ffers a low er b ound of data compression if a probabilistic mo del can b e assumed. Huffman code is such a n optimal co de that reaches the exp ected co de length. The co dew ords ar e defined b y a binary tree built from w ord frequencies. Shannon-F ano-Elias co de is another one that uses at most t w o bits more than the low er b ound. The co de length of A i in Shannon- F ano-Elias co de is approx imately equal to − log p i . 3 Kolmogoro v complexit y and algorithm-based co ding Kolmog o ro v, who laid o ut the foundation of the pro babilit y theory , in terestingly put a wa y probabilistic mo dels, and along with ot her researc hers including Solomonoff and Chaitin, pursued another path to under- stand the information structure of data based on the notion of the univ ersal T uring mac hine. Kolmogorov [8] expressed the following: “infor ma t ion theory m ust precede probability the- ory , and not b e based on it.” W e g iv e a brief accoun t of some facts abo ut Kolmogoro v complexit y relev an t to our study , and refer readers to Li and Vit´ an yi [17], Vit´ anyi and Li [34] for detail. A T uring mac hine is a computer with a finite state op erating on a finite sym b o l se t, and is essen tially the abstraction of any concrete computer that has CPUs, memory , and input and output devices. At eac h unit of time, the machine reads in one op eration command from the program tape, write some sy mbols on a w ork tap e, and c ha nge its state according to a transition table. Tw o imp ortant features need more explanation. First, the program is linear, namely , the mac hine reads the tap e from left to right, nev er go es back. Second, the program is prefix-free, namely , no program leading to a halt ing computation can b e the prefix of another suc h program. T his feature is an analog to the prefix - co ding idea. A univ ersal T uring machine can reproduce the results of other mach ines. Kolmogorov complexit y of a word x with resp ect to a unive rsal computer U , denoted b y K U ( x ), is defined as the minim um length ov erall programs that prin t x and halt. The Kolmog oro v complexities of all w ords satisfy the Kraft inequalit y due to its natural connection to prefix coding. In fact, for a fixed mac hine U , w e can enco de x b y the minim um length prog ram that prints x a nd halt. Giv en a long string, if we define a wa y to parse it in to w or ds, then we enco de eac h word by the ab ov e prog ram. Consequen tly , we enco de the string b y concatenating the programs one aft er another. The deco ding can easily b e carr ied out by inputting the concatenated program into U . One ob vious w a y of parsing is to take the string itself as the only w ord. Thus ho w muc h w e can compress the string dep ends o n the complexit y o f this string. A t this p o int, w e see the connection b et we en data compression and the Kolmogo ro v complexit y , whic h is defined f or each string on an implemen table type of computatio nal mac hine — the T uring machine . Next, we highligh t some theoretical results ab out Kolmogorov complexit y . First, it is not mac hine sp ecific except for a mac hine-sp ecific constan t. Second, the Kolmogoro v complexit y is unfort una t ely not computable. T hird, there exists a unive rsal probabilit y P U ( x ) with resp ect to a univ ersal mac hine, suc h that 2 − K ( x ) ≤ P U ( x ) ≤ c 2 − K ( x ) for all strings, where c is a constant indep enden t of x . This means that K ( x ) is equiv alent to − log P U ( x ) except for a constan t, whic h can b e view ed as the co de lengths of a prefix co de in lig h t of the Shannon-F ano- Elias co de. Because of the non- computabilit y o f Kolmogorov complexit y , the univ ersal probability is not computable either. The study of the Kolmogorov complexit y t ells us that the assessmen t of exact compres sion b ounds of strings is bey ond the abilit y o f an y sp ecific T uring machine. Ho wev er, any program on a T ur ing Mac hine offers, except for a constan t, a n upp er b ound. 4 Corresp ondence b etw een probability mo dels and string parsing A critical que stion remained t o b e answ ered in the Shannon source co ding theorem is: where do es the mo del that defines probabilities come f rom? According to t he theorem, the optimal co de lengths are prop ort io nal to t he negativ e logarithm o f t he w ord frequencies. Once the dictiona ry is defined, the w ord frequencies can b e coun ted for an y individual string to b e compressed. Equiv alen tly , a dictionary can b e induced by t he wa y we parse a string. It is noted that the term “ letter” instead of “w ord” w a s used in Shannon’s original paper [30], whic h did not discuss ho w to parse strings into w ords at a ll. Fixed-length and v ariable-length parsing The w or ds generated from the parsing pro- cess could b e either of the same length or of v ariable lengths. F or example, w e can enco de Shak esp eare’s work letter by letter, or enco de them b y natural w ords of different lengths. A c hoice made at this p o in t leads to tw o quite differen t co ding sc hemes. If w e decomp ose a string in to words of the same n um b er of sym b ols, this is a fixed-length parsing. The t w o extra bits for eac h w ord is a big deal if the num b er o f sy mbols in eac h word is small. As the w o r d length gets longer and longer, the t w o extra bits are relativ ely negligible for eac h blo c k. An effectiv e alternative to get around the issue of extra bits is the arithmetic co ding that in tegrates the co des of successiv e w ords at the cost o f more computations. V ariable-length parsing decomp oses a string in to w or ds of v a r iable num b er of sym b ols. The p o pula r Lemp el-Ziv co ding is suc h a sc heme. Although the complexit y of a string x is not computable, the complexit y of ’ x 1’, relative to ’ x ’ is small. T o concatenate an ’1’ to the end of ’ x ’, w e can simply use the progra m of printing x fo llo w ed b y prin ting ’1’. A recursiv e implemen ta t ion of this idea leads to the Lemp el-Ziv co ding, whic h concatenates the address of ’ x ’ and t he co de of ’1’. Please notice that as the data length increases, the dictionary size resulted fr o m the parsing sc heme o f the Lemp el-Ziv co ding increases as well if w e do not imp o se an upp er limit. Along the pro cess of encoding, each w ord o ccurs only once because do wn the road either it will not b e a prefix of any other w ord or a new w o rd concatenating it with a certain suffix sym b ol will b e found. T o a go o d appro ximation, all the w or ds encoun tered up to a p oin t are equally lik ely . If we use the same bits to store the addresses of these words, their co de lengths are equal. Approximately , it ob eys Shannon’s source co ding t heorem to o. P arametric mo dels an d complex ity Hereafter w e use parametric probabilistic mo dels to count prefix co de lengths. The sp ecification o f a parametric mo del includes three as- p ects: a mo del class ; a mo del dimens ion; and parameter v alues. Suppose w e restrict our atten tion to some hy p o t hetical mo del classes. Each of these mo del classes is indexed by a set of parameters, and we call the num b er of parameters in eac h mo del it s dimension. W e also assume the identifiabilit y of the para meterization, tha t is , different parameter v alues corresp ond to differen t mo dels. Let us denote one suc h mo del class by a probabilit y measure { P θ : θ ∈ a n op en set Θ ⊂ R d } , and their corresp onding frequency functions by { p ( x ; θ ) } . The mo del class is usually defined b y a pa rsing sc heme. F or example, if we parse a string sym b ol b y sym b ol, then the n um b er of words equals the n um b er of sym b ols appearing in 5 the string. W e denote t he num b er o f sym b ols by α , then d = α − 1. If w e parse the string b y ev ery t w o sym b ols, then the nu mber of words increases to d = α 2 − 1, and so on. F rom the ab ov e review of Kolmogorov complexit y , it is clear that strings themse lve s do not admit proba bility mo dels in the first place. Neve rtheless, w e can fit a string b y a parametric mo del. By doing so, w e need to pa y ex tra bits to des crib e the mo del, a s observ ed b y Dr. Rissanen. He termed them as sto chas tic complexit y or parametric complexit y . The total co de lengths b y a mo del include b oth the data description and the parametric complexit y . Tw o references for co de length ev aluation The ev aluation of redundancy of a giv en co de needs a reference. Tw o suc h references are discussed in the literature. In the first scenario, w e assume that the words X ( n ) = { X 1 , X 2 · · · , X n } are generated according to P θ 0 as indep enden t and iden tically distributed (iid) random v ariables, whose outcomes are denoted b y { x i } . Then the opt imal co de length is given by L 0 = − P n i =1 log p ( X i ; θ 0 ). As n go es large, its av erage co de length is giv en by E L 0 = nH ( θ ). In general, the co de length corresp onding to any distribution Q ( x ) is give n b y L Q = − P n i =1 log q ( X i ), and its redundancy is R Q = L Q − L 0 . The exp ected redundancy is the Kullback - L eibler div ergence b et w een the t w o distributions: E P θ 0 ( L Q − L 0 ) = E P θ 0 log P θ 0 ( X ( n ) ) Q ( X ( n ) ) = D ( P θ 0 || Q ) ≥ 0 . It can b e shown that minmax and maxmin v alues of a v erage redundancy are equal [14]. inf Q sup θ E P θ log P θ ( X ( n ) ) Q ( X ( n ) ) = sup θ inf Q E P θ log P θ ( X ( n ) ) Q ( X ( n ) ) = I (Θ; X ( n ) ) . Historically a k ey progress on redundancy [15, 23] is that for eac h p ositiv e num b er ǫ and for all θ 0 ∈ Θ except in a set whose volume go es to zero, as n − → ∞ , E P θ 0 ( L Q − L 0 ) ≥ d − ǫ 2 log n. (1) All these results are a b out av erage co de length o ver all p ossible strings. Another reference whic h any co de can b e compared with is obt a ined by replacing θ 0 b y the maximized like liho o d estimate ˆ θ n in L 0 , that is, L ˆ θ n = − P n i =1 log p ( X i ; ˆ θ n ). Please notice that L ˆ θ n do es not satisfy the Kraft inequality . This p ersp ectiv e is a practical one, since in realit y x ( n ) is simply data without an y probability measure . Giv en a parametric mo del class { P θ } , w e fit t he data b y one surrogate mo del that maximizes the likelihoo d. Then w e consider L Q − L ˆ θ n = log p ( x ( n ) ; ˆ θ ( x ( n ) )) q ( x ( n ) ) . 6 Optimalit y of normalized maxi mum lik eliho o d co de length Minimizing the ab o v e quan tit y leads to the normalized maxim um-lik eliho o d (NML) distribution: ˆ p ( x ( n ) ) = p ( x ( n ) ; ˆ θ ( x ( n ) )) P x ( n ) p ( x ( n ) ; ˆ θ ( x ( n ) )) . The NML co de length is th us giv en by L N M L = − log p ( x ( n ) ; ˆ θ ( x ( n ) )) + log X x ( n ) p ( x ( n ) ; ˆ θ ( x ( n ) )) . (2) Sh tark ov [3 1 ] prov ed t he opt ima lity of NML co de by sho wing it solv es min q max x ( n ) log p ( x ( n ) ; ˆ θ ( x ( n ) )) q ( x ( n ) ) , (3) where q ranges ov er the set of virtually all distributions. Later R issanen [26] f urther prov ed that NML co de solve s min q max g E g [log p ( X ( n ) ; ˆ θ ( X ( n ) )) q ( X ( n ) ) ] , where q and g range ov er the set of virtually all distributions. This result states that t he NML co de is still optimal ev en if the data are generated from o ut side the parametric mo del family . Namely , regardless of the source natur e in practice, we can alwa ys find t he optimal co de length from a distribution family . 3 Empirical co de leng ths based on e xp on en tial family distribu tions In this section, w e fit the data from a source, either discrete or contin uous, b y an ex p onen tial family due to the following considerations. First, the m ultinomial distribution, whic h is used to enco de discrete sym b ols, is an exp onen tial family . Second, according to the Pitman- Ko opman-D armois theorem, exp o nen tial f a milies are, under certain regularit y conditions, the only mo dels that admit sufficien t statistics whose dimensions remain b o unded as the sample size grows . On one hand, this prop erty is most desirable in data compression. On the other hand, the results would b e v a lid in the more g eneral statistical learning ot her than source co ding. Third, as w e will sho w, the first term in the co de length expansion is nothing but the empirical en tropy fo r exp onen tial f amilies, whic h is a straightforw a r d ex tension of Shannon’s source co ding theorem. Exp onen tial families Consider a canonical exp onential family of distributions { P θ : θ ∈ Θ } , where the natural parameter space Θ is an op en set of R d . The densit y f unction is giv en b y p ( x ; θ ) = exp { θ T S ( x ) − A ( θ ) } , (4) 7 with respect to some measure µ ( dx ) on the supp ort of data . The transposition o f a mat rix (or v ector) V is represe nted b y V T here and throughout the pap er. S ( · ) is the sufficie nt statistic for the parameter θ . W e denote the first and the second deriv ative of A ( θ ) resp ectiv ely by ˙ A ( θ ) and ¨ A ( θ ). The en tropy o r differen tial en tr o p y of P θ is: H ( θ ) = A ( θ ) − θ T ˙ A ( θ ). The follo wing result is an empirical a nd pathw ise v ersion of Shannon’s source co ding t heorem. Theorem 1 (Empirical optimal source co de length) If w e fit an individual data se - quenc e by an exp onen tial family distribution, the NML c o de length is given by L N M L = nH ( ˆ θ n ) + d 2 log n 2 π + log Z Θ | I ( θ ) | 1 / 2 dθ + o (1) , (5) wher e H ( ˆ θ n ) is the entr opy evaluate d at the maximum l i k eliho o d estimate (MLE) ˆ θ n = ˆ θ ( x ( n ) ) , and | I ( θ ) | is the determina n t of the Fisher inform ation I ( θ ) = [ − E ( ∂ 2 log p ( X ; θ ) ∂ θ j ∂ θ k )] j,k =1 , ··· ,d . The inte gr al in the expr ession is ass ume d to b e finite. The first term in (2) is nA ( ˆ θ n ) − [ P n i =1 S ( x i )] T ˆ θ n = nA ( ˆ θ n ) − n ˙ A ( ˆ θ n ) T ˆ θ n = nH ( ˆ θ n ), namely , the en tro py in Shannon’s theorem except that the mo del parameter is replaced by the MLE. The second term has a close relationship to the BIC w ork of Ak aike [2] a nd Sch w art z [29], and the third term is the Fisher information whic h c haracterizes the lo cal prop ert y of a distribution family . Surprisingly and interes tingly , this empirical v ersion of the lossless co d- ing theorem puts together the three pieces of fundamen ta l w orks resp ectiv ely b y Shannon, Ak aik e-Sc hw artz, and Fisher. Next, w e giv e a heuristic pro of of (5) b y the lo cal asymptotic normality (LAN) [6], though a complete pro of can b e found in the App endix. In the definition of NML co de length (2), the first t erm b ecomes empirical en trop y for exp onential families. Namely , L N M L = nH ( ˆ θ n ) + log X x ( n ) p ( x ( n ) ; ˆ θ n ) . (6) The remaining difficulty is the computation of the summation. In a general problem of data description length, Rissanen [25] deriv ed an analytical expansion requiring fiv e assumptions, whic h w ere hard to verify . Here w e show for sources from exponential fa milies, the ex pansion is v alid as long as the in tegral is finite. Let U ( θ , r √ n ) b e a cub e o f size r √ n cen tering at θ , where r is a constan t. LAN states that w e can expand probability densit y in eac h neighborho o d U ( θ , r √ n ) as follows. log p ( x ( n ) ; θ + h ) p ( x ( n ) ; θ ) = h T [ n X i =1 S ( x i ) − n ˙ A ( θ )] − 1 2 h T [ nI ( θ ) ] h + o ( h ) , where I ( θ ) = ¨ A ( θ ). Maximizing the lik eliho o d in U ( θ , r √ n ) with resp ect to h leads to max h log p ( x ( n ) ; θ + h ) p ( x ( n ) ; θ ) = 1 2 [ n X i =1 S ( x i ) − n ˙ A ( θ )] T [ nI ( θ ) ] − 1 [ n X i =1 S ( x i ) − n ˙ A ( θ )] + o ( r √ n ) . 8 Consequen tly , if ˆ θ n ( x ( n ) ) falls into the neighborho o d U ( θ , r √ n ), w e ha ve p ( x ( n ) ; ˆ θ n ) = e { 1 2 [ P n i =1 S ( x i ) − n ˙ A ( θ )] T [ nI ( θ )] − 1 [ P n i =1 S ( x i ) − n ˙ A ( θ )]+ o ( r √ n ) } p ( x ( n ) ; θ ) , (7) where ˆ θ n solv es P n i =1 S ( x i ) = n ˙ A ( ˆ θ n ). Applying t he T aylor expansion, w e get n X i =1 S ( x i ) − n ˙ A ( θ ) = n ˙ A ( ˆ θ n ) − n ˙ A ( θ ) = [ n ¨ A ( θ )]( ˆ θ n − θ ) + o ( r √ n ) . Plugging it into (7 ) leads to p ( x ( n ) ; ˆ θ n ) = e { 1 2 ( ˆ θ n − θ ) T [ n ¨ A ( θ )]( ˆ θ n − θ )+ o ( r √ n ) } p ( x ( n ) ; θ ) . (8) If w e consider i.i.d. random v aria bles Y 1 , · · · , Y n sampled from t he exp onen tial distribution (4), then the MLE ˆ θ ( Y ( n ) ) is a rando m v ariable. The summation of the quantit y (8) in the neigh b orho o d U ( θ , r √ n ) can b e expressed as the follow ing exp ectation of ˆ θ ( Y ( n ) ). E [ e { 1 2 ( ˆ θ n − θ ) T [ n ¨ A ( θ )]( ˆ θ n − θ ) } 1 ( ˆ θ n ∈ U ( θ , r √ n ))] . (9) Due to the asymptotic norma lity of MLE ˆ θ ( Y ( n ) ), namely , ˆ θ n − θ d − → N (0 , [ nI ( θ ) ] − 1 ), the densit y of ˆ θ ( Y ( n ) ) is approximated by | nI ( θ ) | 1 / 2 (2 π ) d/ 2 e {− 1 2 ( ˆ θ n − θ ) T [ n ¨ A ( θ )]( ˆ θ n − θ ) } d ˆ θ n . Applying this densit y to the expectation in (9), w e find the tw o exponential t erms cancel out, and obtain | nI ( θ ) | 1 / 2 (2 π ) d/ 2 . The sum of its logar it hm o v er all neigh b orho o ds U ( θ , r √ n ) leads to the remaining terms in (5). Predictiv e co ding The optimality of the NML co de is established in the minimax settings. Y et its implemen tatio n requires going through the data t w o ro unds, one for w ord coun ting of a dictionary , a nd one mor e fo r enco ding. It is natural to ask whether there exists a sc heme that go es through t he data only once a nd still can compress the data equally w ell. It turns out that predic tive co ding is suc h a sc heme for a giv en dictionary . T he idea of predictiv e co ding is to sequen tia lly mak e inferences ab out the parameters in the probability function p ( x ; θ ), w hich is then used to up date the co de b o ok. That is, after o btaining o bserv ations x 1 , · · · , x i , w e calculate the MLE ˆ θ i , and in turn enco de t he next observ ation according to the curren t estimated distribution. Its co de length is th us L pr edictive = ( − P n i =1 log p ( X i +1 | ˆ θ i )). This pro cedure (R issanen [2 2 , 23]) has in timate connections with the prequen tial approac h to statistical inference as adv o cated b y Da wid [10, 11]. Predictiv e co ding is intuitiv ely optimal due to t w o imp ortant fundamen tal results. F irst, the MLE ˆ θ i is asymptotically most accurate, since it gathers all the information in X 1 , · · · , X i for the inf erence o f the 9 parametric model p ( x ; θ ). Second the co de length log p ( X i +1 | ˆ θ i ) is optimal as dictated b y the Shanno n source co ding theorem. In the case of exp onential families, Prop osition 2.2 in [16] show ed L pr edictive can b e expanded as follows. L pr edictive = nH ( ˆ θ n ) + d 2 log n + ˜ D n ( ω ) , where the seque nce of random v ariables { ˜ D n ( ω ) } con ve rges to an almost su rely finite random v ariable ˜ D ( ω ). Alternativ ely , w e can use Bay esian estimates in the predictiv e coding. Starting from a prior distribution w ( θ ), w e enco de x 1 b y the marginal distribution q 1 ( x 1 ) = R Θ p ( x 1 | θ ) w ( θ ) d θ resulted from w ( · ). The posterior is g iv en by w 1 ( θ ) = p ( x 1 | θ ) w ( θ ) / R Θ p ( x 1 | θ ) w ( θ ) d θ . W e then use this p osterior as the up dated prior to enco de the next w ord x 2 . Using induction, w e can sho w that the mar g inal distribution to enco de the k -th w ord is q k ( x k ) = R Θ [ Q k i =1 p ( x i | θ )] w ( θ ) d θ R Θ [ Q k − 1 i =1 p ( x i | θ )] w ( θ ) d θ . Mean while, the up da t ed p osterior, also the prior for the next round enco ding, b ecomes w k ( θ ) = [ Q k i =1 p ( x i | θ )] w ( θ ) R Θ [ Q k i =1 p ( x i | θ )] w ( θ ) d θ . Prop osition 1 (Ba y esian predictiv e c o de length) The total Bayesian p r e d i c tive c o de length for a string of n wor ds is L mixture = − n X k =1 log q k ( x k ) = − log Z Θ [ n Y i =1 p ( x i | θ )] w ( θ ) d θ . Th us the Ba yes ian pr edictive code is nothing but the mixture co de referred to in the literat ur e [7]. Theorem 2 (Expansion of Ba y esian predictiv e co de length) If we fit a data se quenc e by a n exp onential family distribution, the mixtur e c o d e length has the exp an s ion: L mixture = nH ( ˆ θ n ) + d 2 log n 2 π + log | I ( ˆ θ n ) | 1 / 2 w ( ˆ θ n ) + o (1) , (10) wher e w ( θ ) is any mixtur e of c onjugate prior distributions. The result is v alid for general priors that can b e appro ximated b y a mixture of conjuga t e ones. In the case of m ultinomial distributions, the conjug ate prior is Diric hlet distribution. An y prior w ( θ ) con tin uous on the d - dimensional simplex in the [0 , 1] ( m +1) cub e can b e unifor mly appro ximated b y the Bernstein p o lynomials of m v ariables, eac h term of whic h is a Diric hlet 10 distribution [1 9, 20]. It is noted that in the curren t setting, the source is not assumed to b e i.i.d. samples from an exp onen tia l family distribution as in Theorem 2.2 and Prop osition 2.3 in [1 6]. When R Θ | I ( θ ) | 1 / 2 dθ is finite, w e can take the Jeffreys prior, w ( θ ) = | I ( θ ) | 1 / 2 R | I ( θ ) | 1 / 2 dθ then (10) b ecomes (5). Putting together, w e hav e sho wn the optima l co de length can b e achiev ed by the Ba y esian predictive co ding sc heme. Redundancy No w w e examine the empirical co de length under Shannon’s setting. That is, w e ev alua te the redundancy of the co de length assuming the source is fr o m a h yp o thetical distribution. Prop osition 2 If we assume that a sour c e fo l lows an exp onential family distribution, then nH ( ˆ θ n ) − nH ( θ ) = − C n ( ω ) (log log n ) + o (1) , (11) the se quenc e of nonne gative r andom variab les { C n ( ω ) } hav e the pr op erty, lim n →∞ C n ( ω ) ≤ d , for almost al l p a th ω ’s. If we further assume that R Θ | I ( θ ) | 1 / 2 dθ < ∞ , then R N M L = L N M L − N H ( θ ) = d 2 log n 2 π − C n ( ω ) (log log n ) + log Z Θ | I ( θ ) | 1 / 2 dθ + o (1) , (1 2 ) wher e { C n ( ω ) } is the s ame as ab ove. The difference in the first part is ( − P n i =1 log p ( X i | ˆ θ n )) − ( − P n i =1 log p ( X i | θ 0 )), and the rest is true according to the pro of of Prop osition 2.2 in [16], Equation (18). The NML co de is a sp ecial case of the mixture co de, whose redundancy is given by Theorem 2.2 in [16]. W e note that lim n →∞ C n ( ω ) is b o unded b elo w by 1. This prop osition confirms that nH ( ˆ θ n ) is an underestimate of the compression b ound. Although log log n grow s up slo wly , the term, as sho wn by the example in T able 1, gets large as the mo del dimension d increases. Co ding of discrete sym b ols and m ultinomial mo del F or compressing strings of dis- crete sym b ols, it is sufficien t to consider the discrete distribution sp ecified by a probabilit y v ector, θ = ( p 1 , p 2 , · · · , p d , p d +1 ). Its frequency function is P ( X = k ) = Q d +1 k =1 p 1 ( X = k ) k . The Fisher inf o rmation matrix I ( p 1 , · · · p d ) can b e shown to b e −  E ∂ 2 lo g P ( X = k ) ∂ p j ∂ p k  j,k =1 , ··· ,d =      1 p 1 + 1 p d +1 1 p d +1 · · · 1 p d +1 1 p d +1 1 p 2 + 1 p d +1 · · · 1 p d +1 . . . . . . . . . . . . 1 p d +1 1 p d +1 · · · 1 p d + 1 p d +1      . Th us | I ( p 1 , · · · p d ) | = 1 / Q d +1 k =1 p k . Supp ose X 1 , · · · , X n are i.i.d. random v ariables ob eying the ab o ve discrete distribution. Then S = P n i =1 X i follo ws a m ultinomial distribution M u lti ( n ; p 1 , p 2 , · · · , p d , p d +1 ). Its 11 conjugate prior distribution is the Dirichle t distribution ( α 1 , α 2 , · · · , α d +1 ), whose densit y function is Γ( P d +1 k =1 α k ) Q d +1 k =1 Γ( α k ) d +1 Y k =1 p α k − 1 k , where Γ( t ) = R + ∞ u =0 u − t e − u du . Since the Jeffreys prior is prop ortional to | I ( p 1 , · · · p d ) | 1 / 2 . In this case, it equals D iric hlet(1/2,1/2, ..., 1/2), whose densit y is Γ(( d +1) / 2 ) Γ(1 / 2) d +1 Q d +1 k =1 p − 1 / 2 k . The Jeffreys prio r w as also used b y Krichev sky [15] to deriv e optimal univ ersal co des. It is noticed that Γ(1 / 2) = √ π . Plug it in to Eq uation (10), we ha v e the following sp ecific form of the NML co de length fo r the multinomial distribution. Remem b er that the distribution or w ord frequencies are sp ecific for a g iv en dictionary Φ, and w e thus term it as L N M L @Φ . If we c hange the dictionary , the co de length c hanges accordingly . Prop osition 3 (Optimal co de lengths for a m ultinomial distribution) L N M L @Φ = nH ( ˆ θ n ) + d 2 log n − d 2 − log Γ( d + 1 2 ) + 1 2 log π , (13) wher e nH ( ˆ θ n ) = − P d k =1 ˆ p k log ˆ p k , ˆ p k = n k /n — the fr e quency of the k-th wor d app e aring in the string. 4 Compress ion of rand om seque n ces and DNA seque nces Lossless compression b ound and description lengt h Given a dictiona r y of w ords, w e parse a string into w ords fo llow ed b y coun ting their frequencies ˆ p k = n k /n , the total num b er of w ords n , and the n umber of distinct w ords d . Plugging them into ex pression (1 3), we obtain the lossless compression b ound f o r this dictionary or parsing. If a differen t parsing is tried, the three quan tities: w ord frequencies, n um b er of w ords, dictionary size (n um b er of distinct w ords) w ould c hange, and the resulting b ound w ould c hange accordingly . In t he general situation where t he data are not necessarily discrete sym b ols, w e replace the co de length with description length (10) as termed by Rissanen. Since eac h parsing correspo nds to a probabilistic mo del, the co de length is mo del- dep enden t . The comparison of tw o or more co ding sc hemes is exactly the selection of models, with t he expression (13) a s the ta r get function. Rissanen’s principle of minim um description length and mo del selection R issa- nen, in his work of [23, 24, 2 5], etc. prop osed the principle of minimum description length (MDL) as a more general mo deling rule than that of maxim um lik eliho o d, whic h w a s recom- mended, analyzed, and p opularized b y R. A. Fisher. F rom the information-theoretic p oin t of view, when w e enco de da t a from a source b y prefix co ding, the o ptima l co de is the one that ac hiev es the minim um description length. Because of the equiv alence b et we en a prefix co de length and the negativ e logarit hm of the corresp onding probability distribution, via 12 Kraft’s inequalit y , this in turn g ives us a modeling principle, namely , the MDL principle: c ho ose the mo del or prefix co ding algor it hm t ha t give s the minimal description of dat a , see Hansen and Y u [13] for a revie w on this topic. W e also refer readers to [3, 12] for a more complete accoun t of MDL. MDL is a mathematical formulation of the general principle kno wn as Occam’s razor: c ho ose the simplest explanation consisten t with the observ ed data [9]. W e mak e one remark ab out the significance of MDL . On the one hand, Shannon’s w ork establishes t he connection b et w een o pt imal coding and probabilistic mo dels. On the other hand, Ko lmogoro v’s algo- rithmic theory sa ys that the complexit y , or t he absolute o ptimal coding, cannot b e prov ed by an y T uring mac hines. MDL offers a practical principle: it allows us to mak e choices a mo ng p ossible mo dels a nd co ding a lgorithms without the desire to prov e optimality . As more and more candidates of mo dels are ev aluated ov er time, human’s understanding pro gresses. Compression b ounds of random sequences A random sequence is non- compressible b y an y mo del-based or a lg orithmic prefix coding as indicated b y the complexit y results [17, 34]. Thus a legitimate compression b ound o f a random sequence should b e no less than 1 up to certain v aria t ions. Con v ersely , if the compress ion rates of a sequen ce using L N M L @Φ as the compression b ound is no less than 1 under all dictionaries Φ, namely , min Φ: dictionar i e s L N M L @Φ L RAW = 1 + L N M L @Φ − L RAW L RAW ≥ 1 , where L RAW is the data length of the raw sequence in terms of bits, then the sequence is random. If w e a ssume the source is from a uniform distribution, L RAW = nH , and the difference L N M L @Φ − L RAW is essen t ially the redundancy of L N M L @Φ , whic h can b e calculated b y (13). Although it is challenging to test all dictionaries, we can try some, particularly those suggested b y the domain exp erience. A simul ation study: c ompression b ounds of pseudo-random sequences Sim ula- tions w ere carried out to test the theoretical b ounds. First, a pseudo-random binary string of size 3000 w as sim ulated in R according to Bernoulli trials with a pro babilit y of 0.5. In T able 1 , the fir st column sho ws the word length used f o r parsing the data; The second col- umn sho ws the w ord n um b er; the third column sho ws the n um b er of distinc t words. W e group t he terms in (13) in to three part s: t he term in v olving n , the term inv olving log n , and others. The b ounds by nH ( ˆ θ n ), nH ( ˆ θ n ) + d 2 log n a nd L N M L in (13) are resp ectiv ely show n in t he next three columns. When the w ord length increases , d increases , and the b ounds b y nH ( ˆ θ n ) sho w a decreasing trend. A b ound smaller than 1 indicates the sequence can b e compressed, con tradicting the assertion that random sequences cannot b e so. When the w ord length is 8, the dictiona r y size is 37 5, and the b ound by nH ( ˆ θ n ) is o nly 0.929. The incompressibilit y nature of random sequences falsifie d nH ( ˆ θ n ) as a legitimate compression b ound. If the log n term is included, the b ounds are alwa ys larger than 1. The b ounds b y L N M L (13) are tighter while remain larger than 1 exc ept the case at the b ottom ro w, 13 T able 1 : The data compression r a tes of a binary string of size 3000 under different parsing mo dels. The data w ere sim ulated in R according to Bernoulli trials with probabilit y 0.5. w ord w ord dictionary nH ( ˆ θ n ) nH ( ˆ θ n ) L N M L (13) length num b er size + d 2 log n 1 3000 2 0.999 9 18 1.001843 1.001952 2 1500 4 0.998 5 91 1.003866 1.003641 3 1000 8 0.998 9 61 1.010587 1.008834 4 750 16 0.99 5 422 1.019299 1.012974 5 600 32 0.98 9 603 1.037285 1.018977 6 500 64 0.98 1 597 1.075738 1.027959 7 428 124 0.9 6 4885 1.14432 4 1.031261 8 375 196 0.9 2 8930 1.20682 9 1.006312 9 333 252 0.8 7 2958 1.22384 6 0.950283 where the n um b er of distinct words approac hes the total num b er of words . Since L N M L is an achiev able b ound, nH ( ˆ θ n ) + d 2 log n is an ov erestimate. Kno wledge disco v ery b y data compression On the other hand, if w e can compress a sequence by a certain prefix co ding sc heme, then this sequence is not random. In the mean time, this coding sc heme presen ts a clue to understanding the information structure hidden in the sequence. Dat a compression is one general learning mec hanism, among o t hers, to discov er kno wledge from nat ur e a nd other sources. Ry abk o, Astola and Gammerman [28] applied the idea of Kolmogo ro v complexit y to the statistical testing of some typical hypotheses. This approach w as used to ana lyze DNA sequence s in [33]. DNA sequences of proteins The information carried by the DNA double helix is tw o long complemen ta ry strings of the letters A, G, C, a nd T. It is in teresting t o see if w e can compress D NA sequences at all. Nex t, w e carried out the lossless compression exp erimen ts on a couple of protein-enco ding DNA sequences . Redisco v ery of the codon structure In T a ble 2 w e sho w the result b y applying the NML co de length L N M L in (13) to an E. Coli protein gene sequenc e lab eled b y b0059 [1], whic h ha s 2907 nucle ot ides. Each ro w corresp onds to one mo del use d for co ding. All the mo dels b eing tested are listed in the first column. In the first mo del, w e enco de the DNA n ucleotides one b y one and name it Model 1. In the second or third mo del, w e parse the DNA sequence by pa ir s and then enco de the resu lting bi-n ucleotide seq uence according to their frequencies. Different starting p ositions lead t o t wo differen t phases, and w e denote them 14 b y 2.0 and 2.1 r esp ectiv ely . Other models ar e understo o d in the same fashion. Note that all these mo dels are generated by fix-length parsing. The last mo del “a.a.” means w e first translate DNA triplets in t o amino acids and then enco de the resulting amino acid sequence. The se cond column sho ws the tota l nu mber of w o rds in eac h parsed s equence. The third column shows the n um b er of different words in eac h parsed sequence or the dictionar y size. The fourth column is the empirical en trop y estimated from observ ed frequencies. The next column is the first term in expression (13), whic h is the pro duct of the second and four t h columns. Then we calculate the rest terms in (13). The total bits are then calculated and the compression ra tes are the ratios L N M L / (2907 ∗ 2). The last column sho ws the compression rates under different mo dels. All the compression r ates are around 1 except that obtained f rom Mo del 3.0, whic h represen ts the correct co don pattern and correct phase. Th us the comparison of compression b ounds redisco v ers the co don structure of this pr o tein-enco ding DNA sequence and the phase of the op en reading fra me. It is some what surprising that the optimal co de length L N M L enables us to mathematically identify the triplet co ding system using only the sequence of one gene. Historically , the system was disco v ered by F rancis Cric k and his colleagues in the early 1 960s using frame-shift mutations of ba cteria- phage T4. Next, we ha v e a closer lo ok at the results. The compression rate of the four- n ucleotide w ord co ding is closes t to 1, and thus it b eha v es more lik e ”random”. F or example, it is 0.9947 for Mo del 4.2. The first term of empirical entrop y con tributes 5431 bits, while the rest terms con tribute 346 bits. If w e use d 2 log n instead, t he rest term is 0 . 5 ∗ (219 − 1) ∗ log (726) ≈ 10 36 bits, and the compression rate b ecomes 1.11, whic h is less tight. If t he Ziv-Lempel a lgorithm is applied to the b0059 sequence, 63 5 w ords are generated along the w ay . Each word needs lo g (6 3 5) bits for k eeping the address of its prefix, and 2 bits for the last n ucleotide. In total, it needs 635 ∗ l og (635) = 5912 bits for storing addresses, whic h corr espo nd to the first term in (13), and 635 ∗ 2 = 12 70 bits for storing the w ords’ last letter, whic h corresp ond to the rest terms in (13). The compression rate o f Ziv-Lemp el co ding is 1 .24. Redundan t information in protein gene sequences It is kno wn that the 4 3 = 64 triplets corresp ond to only 20 amino acids plus stop co dons. Th us redundancy do es exist in protein gene sequences . Most o f the redundancy lies in the third p osition of a co don. F or example, GG A, GG C, GG T, and GGG all corresp ond to glycine. According to T able 2, there are 4 048.22 bits of information in the amino acid sequence while t here are 5277.51 bits of information in Mo del 3.0 . Thus the redundancy in this sequence is estimated to b e (5277.51- 4048.22)/4 048.22=0.30 . Randomization T o ev aluate the accuracy o r significance of the compression rates of a DNA sequence, w e need a reference distribution for comparison. A t ypical metho d is to consider the randomness obta ined b y p ermutations. That is, giv e a DNA sequence, w e p erm ute the n ucleotide bases and re-calculate the compres sion rates. If we rep eat this p erm utation pro cedure, then a reference distribution is generated. 15 T able 2 : The data compres sion rates of E. Coli ORF b0059 calculated by ( 1 3) under differen t parsing mo dels. mo del # word dictionary empirical 1-st rest L N M L @Φ compression n size d en trop y term terms t o tal bits rate 1 2907 4 1.9924 5792.00 16.58 5808.58 0.9991 2.0 1453 16 3.9570 5749.49 59.81 5809.31 0.9995 2.1 1453 16 3.9425 5728.44 59.81 5788.25 0.9959 3.0 969 58 5.2842 5120.39 157.11 5277.51 0.9077 3.1 968 63 5.5905 5411.63 167.13 5578 .76 0.9605 3.2 968 64 5.6706 5489.10 169.11 5658 .21 0.9742 4.0 726 218 7.4507 5409.24 345.0 7 5754.31 0.9908 4.1 726 217 7.4337 5396.87 344.2 0 5741.07 0.9885 4.2 726 219 7.4814 5431.49 345.9 40 5777.43 0.9947 4.3 726 221 7.4678 5421.64 347.6 7 5769.31 0.9933 a. a. 96 9 21 4.1056 3978.31 69.92 4048.22 0.6963 In T able 3 , we consider the compression rates for E. Coli ORF b006 0 , whic h has 2352 n ucleotides. First, the optimal compression r a te o f 0.958 is achiev ed a t mo del 3.0. Second, w e f urther carried out the calculations f or p ermuted sequences. T he av erages, standard deviations, and 1% low er quan tiles of compression r ates under differen t mo dels are sho wn in T able 3 as w ell. Except for Mo del 1 , all the compression r a tes, in terms of either av erages or low er quantiles are significan tly ab ov e 1, Third, the results by the single t erm nH ( ˆ θ n ) are ab out 0.996, 0.994, 0.98 6, and 0.952 respective ly for one-, tw o-, three-, and four-nucle otide mo dels. The 99% quan tiles of nH ( ˆ θ n ) for the four-nucleotide mo dels a re no larger than 0.961. Th us the difference b et we en nH ( ˆ θ n ) and nH ( θ ) as sho wn in (11) increases as t he dictionary size go es up. F ourth, the results of nH ( ˆ θ n ) + d 2 log n sho w extra bits compared to those of L N M L , and the compression ratio go from 1.02 to 1.17, suggesting the rest terms in (13) are not negligible. It is noted Mo dels 3.1 and 3.2 are obt a ined b y phase-shifting from the correct Mo del 3.0. Other mo dels are obtained by incorrect parsing. These mo dels can serv e as references for Mo del 3.0. The incorrect parsing and phase-shifting hav e a flav or of the linear congruen tia l pseudo-random n um b er generator, and play the role of randomization. 5 Discuss ion Putting together the ana lytical results and n umerical examples, we sho w the compression b ound of a data sequence using an exp o nen tial fa mily is the co de length deriv ed from the NML distribution (5). The empirical b o und can b e implemen ted by the Ba ye sian predictiv e 16 T able 3: The data compre ssion rates of E. Coli O R F b006 0 and stat istics from p ermutations. The protein gene sequence has 2352 n ucleotide bases. Mo del 1.0 2.0 2.1 3.0 3.1 3.2 4.0 4.1 4.2 4.3 Original L N M L (13) 0.999 1.000 1.001 0.958 0.980 0.989 1.001 1.002 0.997 0.999 L N M L (13) a verage 0.999 1.006 1.006 1.020 1.020 1.020 1.020 1.020 1.020 1.020 SD ( × 10 − 3 ) 0.00 0.74 0 .76 1.69 1.77 1.71 4.22 4.33 4.15 3 . 92 1%-quan tile 0.999 1.004 1.004 1.016 1.016 1.016 1.011 1.010 1.011 1.011 nH ( ˆ θ n ) a verage 0.996 0.994 0.994 0.986 0.987 0.986 0.952 0.952 0.952 0.952 SD ( × 10 − 3 ) 0.00 0.74 0 .76 1.69 1.77 1.71 3.69 3.79 3.63 3 . 42 99%-quan tile 0.996 0.995 0.995 0.990 0.990 0.990 0.961 0.961 0.960 0.960 nH ( ˆ θ n ) a verage 0.999 1.010 1.010 1.051 1.051 1.051 1.174 1.174 1.174 1.174 SD ( × 10 − 3 ) 0.00 0.74 0 .76 1.70 1.78 1.71 7.49 7.66 7.37 7 . 01 + d 2 log n 1%-quan tile 0.999 1.008 1.008 1.046 1.047 1.047 1.157 1.156 1.157 1.157 co ding fo r any giv en dictionary o r mo del. Differen t mo dels are then compared b y their empirical compression b ounds. The examples of DNA sequences indicate that the compression rates by an y dictionaries are indeed larger than 1 for random sequence s, in consistency with the a ssertions b y the Kol- mogorov complexit y theory . Conv ersely , if significant compression is achiev ed b y a sp ecific mo del, certain kno wledge is ga ined. The co don structure is suc h an instance. Unlik e t he algorithmic complexit y that contains a constant, the results based on pro ba- bilit y distributions giv e the exact bits of co de lengths. All three terms in (5) ar e imp ortan t for the compression b ound. Using only the first term nH ( ˆ θ n ) can lead to bounds of r a n- dom sequences smaller than 1. The gap gets larger as the dictionary size increases as seen from T able 1 and 3. The b ound b y adding the second term d 2 log n had b een prop osed by the tw o-pa r t co ding or the Ko lmogoro v complexity . It is equiv alen t to BIC widely used in mo del selection. How ev er, it ov erestimates the influence o f the dictionary size, as show n b y the examples of s imulations and D NA sequence s. The inclusion of the Fisher info rmation in the third term g iv es a tighter bo und. The terms ot her than nH ( ˆ θ n ) get larger as the dictionary size incre ases in T able 1 a nd 3 . The observ atio n that the compression b ounds from all terms in (5 ) k ept slightly ab o ve 1 for all tested libraries meets our exp ectation on the incompressibilit y of random sequences. Although the empirical compression b o und is obta ined under the i.i.d. mo del, the w ord length can b e set rather lar g e to describ e the lo cal dep endence betw een symbols. Indeed, as shown in the examples of DNA se quences, t he empirical en tropy term in (5) could get smaller, for either the original sequences or the p erm uted ones. Mean while, the second term could get larger. F or a specific seque nce, a b etter dictionary is selected by trading off the en trop y part a nd mo del complexity pa rt. Rissanen [25] obtained an expansion of the NML co de length, in whic h the first term is the log-likelihoo d of data with the parameters plugged in by the MLE. In this article, w e sho w it is exactly the empirical en trop y if the parametric mo del tak es an y expo nen tial family . According to this form ula, the NML code length is an empirical vers ion or a direct extension of Shannon’s source co ding theorem. F urt hermore, the asym ptotics in [25] requires fiv e 17 assumptions, which are hard to examine. Suzuki and Y amanishi prop osed a F ourier approac h to calculate NML co de length [32] for con tinuous random v ariables with certain assumptions. Instead, we sho w (5) is v alid for exponential families, as long as R Θ | I ( θ ) | 1 / 2 dθ < ∞ , without an y o ther assumptions. If the Jeffreys prior is improp er in t he interior of the full parameter space, w e can restrict the parameter to a compact subset. The exp onential fa milies include not only distributions of discrete sym b ols such as multinomials but also con tin uous signals suc h as from the normal distribution. The mathematics underlying the expansion o f NML is the structure of lo cal asymptotic normalit y prop osed b y LeCam [6 ]. LAN has been used to show the optimalit y of certain statistical estimates. This art icle connects LAN t o compression b ound. W e hav e sho wn as long a s LAN is v alid, the similar expansion of (2) can b e obtained. 6 App endix This section con tains the pro ofs of the results in Sections 2 and 3. The follow ing basic facts ab out the exp onen tial family (4 ) are needed, see [4]. 1. E ( S ( X )) = ˙ A ( θ ), and V ar ( S ( X )) = ¨ A ( θ ). 2. ˙ A ( · ) is one to one on the nat ura l parameter space. 3. The MLE ˆ θ n based on ( X 1 , · · · , X n ) is giv en b y ˆ θ n = ˙ A − 1 ( ¯ S n ), where ¯ S n = 1 n P n i =1 S ( X i ). 4. The Fisher info r ma t ion matrix I ( θ ) = ¨ A ( θ ). Pro of of Theorem 1 . In the canonical exp onential fa mily , t he natur a l parameter space is op en and con v ex. Since R Θ | I ( θ ) | 1 / 2 dθ < ∞ , we can find a series o f b ounded set { Θ ( k ) , k = 1 , 2 , · · · } suc h that log[ R Θ | I ( θ ) | 1 / 2 dθ ] − log[ R Θ ( k ) | I ( θ ) | 1 / 2 dθ ] = ǫ k . where ǫ k → 0. F urther- more, w e can select eac h b ounded set Θ ( k ) so that it can b e partitioned into disjoint cub es, eac h of whic h is denoted by U ( θ ( k ) j , r √ n ) with θ ( k ) j as its cen ter and r √ n as its side length. Namely , Θ ( k ) = S j U ( θ ( k ) j , r √ n ), and U ( θ ( k ) j 1 , r √ n ) T U ( θ ( k ) j 2 , r √ n ) = ∅ fo r j 1 6 = j 2 . The normalizing constan t in equation (6) can b e summed (integration in the case of con tin uous v ariables) by the sufficien t statistic P n i =1 S ( x i ), and in t urn b y the MLE ˆ θ n X { x ( n ) , ˆ θ ∈ Θ ( k ) } p ( x ( n ) ; ˆ θ n ) = X U ( θ ( k ) j , r √ n ) X { x ( n ) , ˆ θ ∈ U ( θ ( k ) j , r √ n ) } p ( x ( n ) ; ˆ θ n ) , (14) p ( x ( n ) ; ˆ θ n ) = e { ˆ θ T n P n i =1 S ( x i ) − nA ( ˆ θ n ) } µ n ( dx n ) = e { n ˆ θ T n ¯ S n − nA ( ˆ θ n ) } µ n ( dx n ) . (15) No w expand n [ θ ¯ S n − A ( θ )] ar o und ˆ θ n within the neigh b or ho o d U ( θ ( k ) j ). n [ θ T ¯ S n − A ( θ ) ] = n [ ˆ θ T n ¯ S n − A ( ˆ θ n )]+( θ − ˆ θ n ) T [ n ¯ S n − n ˙ A ( ˆ θ n )] − 1 2 ( θ − ˆ θ n ) T [ n ¨ A ( ˆ θ n )]( θ − ˆ θ n )+ M 1 n || θ − ˆ θ n || 3 . 18 Since the MLE ˆ θ n = ˙ A − 1 ( ¯ S n ), t he second term is zero. F urthermore, we expand ¨ A ( ˆ θ n ) around ¨ A ( θ ), and r ear r a nge the t erms in the equation, then w e ha v e n [ ˆ θ T n ¯ S n − A ( ˆ θ n )] = n [ θ T ¯ S n − A ( θ )] + 1 2 ( ˆ θ n − θ ) T [ n ¨ A ( θ )]( ˆ θ n − θ ) + M 2 n || θ − ˆ θ n || 3 , where the constan t M 2 in v olve s the third order deriv ativ es of A ( θ ), whic h is con tinuous in the canonical exp onen tial family a nd thus b ounded in the b ounded set Θ ( k ) . In other w ords, M 2 is b ounded uniformly across all { U ( θ ( k ) j , r √ n ) } . Similar b ounded constan ts will b e used rep eatedly hereafter. Then equation (15) b ecomes p ( x ( n ) ; ˆ θ n ) = e 1 2 ( θ − ˆ θ n ) T [ n ¨ A ( θ )]( θ − ˆ θ n )+ M 2 n || θ − ˆ θ n || 3 e n [ θ ¯ S n − A ( θ )] µ n ( dx n ) , Notice tha t the exp onen tial form e n [ θ ¯ S n − A ( θ )] is the densit y of n ¯ S . If w e consider i.i.d. ran- dom v ar ia bles Y 1 , · · · , Y n sampled from the exp o nen tial distribution (4), the MLE ˆ θ ( Y ( n ) ) is a random v ariable. T ake θ = θ ( k ) j , then the sum of the ab o v e quantit y ov er the neighbor- ho o d U ( θ ( k ) j , r √ n ) is nothing but the expectatio n of ˆ θ ( Y ( n ) ) = ˙ A − 1 ( ¯ S n ) with resp ect to the distribution of n ¯ S n , ev aluated at the parameter θ ( k ) j . X { x ( n ) , ˆ θ n ∈ U ( θ ( k ) j , r √ n ) } p ( x ( n ) ; ˆ θ n ) = E θ ( k ) j [ 1 [ ˆ θ n ∈ U ( θ ( k ) j , r √ n )] e 1 2 ( ˆ θ n − θ ( k ) j ) T [ n ¨ A ( θ ( k ) j )]( ˆ θ n − θ ( k ) j )+ M 2 n || ˆ θ n − θ ( k ) j || 3 ] . Let ξ n = √ n ( ˆ θ n − θ ( k ) j ). Now ˆ θ n ∈ U ( θ ( k ) j , r √ n ) if and only if ξ n ∈ U (0 , r ), where U (0 , r ) is the d- dimensional cub e cen tered at zero with the side length r . Next expand e M 2 n || ˆ θ n − θ ( k ) j || 3 in the neigh b orho o d, the ab o v e b ecomes X { ˆ θ n ∈ U ( θ ( k ) j , r √ n ) } p ( x ( n ) ; ˆ θ n ) = E [ 1 [ ξ n ∈ U (0 , r )][ e 1 2 ξ T n I ( θ ( k ) j ) ξ n (1 + M 3 n − 1 2 )] . According to the cen tral limit theorem, n − d 2 [ P n i =1 S ( Y i ) − ˙ A ( θ ( k ) j )] d − → N (0 , ¨ A ( θ ( k ) j )). More- o v er, the appro ximation error has the Berry-Esseen bound O ( n − 1 2 ), where the constan t is determined by the b ound on A ( θ )’s third-order deriv ative s. Similarly , we ha v e the asymp- totic normalit y of MLE, ξ n ( Y ( n ) ) d − → N (0 , I ( θ ( k ) j ) − 1 ), where the Berry-Esseen b ound is v alid 19 for t he conv ergence, see [18]. Therefore, the exp ectation con v erges as f o llo ws. E { 1 [ ξ n ∈ U (0 , r )] e 1 2 ξ T n I ( θ ( k ) j ) ξ n (1 + M 3 n − 1 2 ) } = Z 1 [ ξ n ∈ U (0 , r )][ e 1 2 ξ T n I ( θ ( k ) j ) ξ n (1 + M 3 n − 1 2 )] | I ( θ ( k ) j ) | 1 / 2 (2 π ) d/ 2 e − 1 2 ξ T n I ( θ ( k ) j ) ξ n dξ n + M 4 n − 1 2 = (2 π ) − d 2 | I ( θ ( k ) j ) | 1 2 Z 1 [ ξ n ∈ U (0 , r )](1 + M 3 n − 1 2 )] dξ n + M 4 n − 1 2 = (2 π ) − d 2 | I ( θ ( k ) j ) | 1 2 r d (1 + M 3 n − 1 2 ) + M 4 n − 1 2 = n d 2 (2 π ) − d 2 | I ( θ ( k ) j ) | 1 / 2 ( r d n − d 2 )(1 + M ′ 3 n − 1 2 ) . Plug this in to the sum (14), we obtain X { x ( n ) , ˆ θ n ∈ Θ ( k ) } p ( x ( n ) ; ˆ θ n ) = n d 2 (2 π ) − d 2 [ X U ( θ ( k ) j , r √ n ) | I ( θ ( k ) j ) | 1 / 2 ( r d n − d 2 )](1 + M ′ 3 n − 1 2 ) − → n d 2 (2 π ) − d 2 [ Z Θ ( k ) | I ( θ ) | 1 / 2 dθ ](1 + M ′ 3 n − 1 2 )] log[ X { x ( n ) , ˆ θ n ∈ Θ ( k ) } p ( x ( n ) ; ˆ θ n )] = d 2 log n 2 π + log Z Θ ( k ) | I ( θ ) | 1 / 2 dθ + M ′′ 3 n − 1 2 = d 2 log n 2 π + log Z Θ | I ( θ ) | 1 / 2 dθ + [log Z Θ ( k ) | I ( θ ) | 1 / 2 dθ − log Z Θ | I ( θ ) | 1 / 2 dθ ] + + M ′′ 3 n − 1 2 = d 2 log n 2 π + log Z Θ | I ( θ ) | 1 / 2 dθ − ǫ k + M ′′ 3 n − 1 2 . Note that the b ound M ′′ 3 of the last term relies solely o n Θ ( k ) . F o r a giv en k , w e select n suc h that the last term is sufficien t ly small. This completes the pro of. Pro of of Theorem 2 First we consider the conjuga te prior of (4), whic h tak es the form u ( θ ) = exp { α ′ θ − β A ( θ ) − B ( α , β ) } , (16) where α is a v ector in R d , β is a scalar, and B ( α, β ) = log R Θ exp { α ′ θ − β A ( θ ) } dθ . Then the marginal density is m ( x ( n ) ) = exp { B ( n X i =1 T ( x i ) + α, n + β ) − B ( α , β ) } , (17) according to the definition of B ( · , · ). Therefore L mixture = B ( α, β ) − B ( n X i =1 S ( x i ) + α, n + β ) = B ( α , β ) − log ( Z Θ exp { nL n ( t } d t ) , 20 where nL n ( t ) = [ P n i =1 S ( x i ) + α ] T t − ( n + β ) A ( t ) . The minim um of L n ( t ) is achie ve d at ˜ θ n = ˜ θ ( x ( n ) ) = ˙ A − 1 ( P n i =1 S ( x i ) + α n + β ) . Notice t ha t ˙ A ( ˜ θ n ) = P n i =1 S ( x i ) + α n + β = P n i =1 S ( x i ) n + O ( 1 n ) = ˙ A ( ˆ θ n ) + O ( 1 n ) . Through T a ylor’s expansion, it can b e shown that ˜ θ n = ˆ θ n + O ( 1 n ) . (18) Notice that − ¨ L n ( t ) = n + β n ¨ A ( t ) . Let Σ = − ¨ L − 1 n ( ˜ θ n ) = n n + β ¨ A − 1 ( ˜ θ n ) . By expanding L n ( t ) a t the saddle p oint ˜ θ n and applying the Laplace metho d (see [5]), w e hav e log( Z Θ exp { nL n ( t ) } d t ) = − d 2 log n 2 π + 1 2 log( d et Σ) + nL n ( ˜ θ n ) + O ( 1 n ) − → − d 2 log n 2 π − 1 2 log( d et I ( ˜ θ n )) + nL n ( ˜ θ n ) . Next, L mixture − nH ( ˆ θ n ) − → B ( α, β ) + d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − nL n ( ˜ θ n ) − nH ( ˆ θ n ) = B ( α, β ) + d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − [ n X i =1 S ( x i ) + α ] T ˜ θ n + ( n + β ) A ( ˜ θ n ) − nA ( ˆ θ n ) + [ n X i =1 S ( x i )] T ˆ θ n = d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − [ α T ˜ θ n − β A ( ˜ θ n ) − B ( α, β )] − [ n X i =1 S ( x i )] T ( ˜ θ − ˆ θ n ) + n [ A ( ˜ θ n ) − A ( ˆ θ n )] = d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − log w ( ˜ θ n ) − [ n X i =1 S ( x i ) T ( ˜ θ − ˆ θ n ) + n ˙ A ( ˆ θ ) T ( ˜ θ n − ˆ θ n ) + n 2 ( ˜ θ n − ˆ θ n ) T ¨ A ( ˆ θ n )( ˜ θ n − ˆ θ n )] + O ( 1 n ) = d 2 log n 2 π + 1 2 log( d et I ( ˜ θ n )) − log w ( ˜ θ n ) − n [[ ¯ S n − ˙ A ( ˆ θ n )] T ( ˜ θ n − ˆ θ n ) + 1 2 ( ˜ θ n − ˆ θ n ) T ¨ A ( ˆ θ n )( ˜ θ n − ˆ θ n )] + O ( 1 n ) − → d 2 log n 2 π + 1 2 log( d et I ( ˆ θ n )) − log w ( ˆ θ n ) . 21 The last step is v a lid b ecause o f ˆ θ n = ˙ A − 1 ( ¯ S n ) and (1 8 ). This pro v es the case o f the prior u ( θ ) in (16). Mean while, we got the expansion of (17) m ( x ( n ) ) = exp {− L mixture } = exp { B ( n X i =1 S ( x i ) + α, n + β ) − B ( α , β ) } = exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) + log w ( ˆ θ n ) + o (1) } = [ w ( ˆ θ n ) + o (1)] exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) } . (19) If the prior of θ tak es the for m of a finite mixture of the conjugate distributions (16) a s in the follo wing w ( θ ) = J X j = 1 λ j exp { α ′ j θ − β j A ( θ ) − B ( α j , β j ) } = J X j = 1 λ j u j ( θ ) , (20) where P J j = 1 λ j = 1, 0 < λ j < 1 , j = 1 , · · · , J . Then the marginal densit y is given b y m ( x ( n ) ) = J X j = 1 λ j exp { B ( n X i =1 T ( x i ) + α j , n + β j ) − B ( α j , β j ) } = J X j = 1 λ j [ u j ( ˆ θ n ) + o (1)] exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ )) } = exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) } [ J X j = 1 λ j u j ( ˆ θ n ) + o (1)] = exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) } [ w ( ˆ θ n ) + o (1)] = exp {− nH ( ˆ θ n ) − d 2 log n 2 π − 1 2 log( d et I ( ˆ θ n )) + log w ( ˆ θ n ) + o (1) } . Eac h summand w as appro ximated b y (19). This completes the pro of b ecause of L mixture = − log { m ( x ( n ) ) } . 7 Ac kno wled g emen t The author is grateful to Prof. Bin Y u and Dr. Jorma Rissanen for t heir g uidance in learn- ing the to pic. This researc h is supp orted by the National Key Researc h and Dev elopmen t Program of China (2022YF A1004801), the National Natural Science F oundation of China (Grant No. 32170679, 118714 6 2, 9 1530105), the National Cen ter f o r Mathematics a nd In- terdisciplinary Sciences of the Chinese Academ y of Sciences, and the Key Lab oratory of Systems and Control of the CAS. 22 References [1] E. Coli genome and protein genes. h ttps://www.ncbi.nlm.nih.go v/genome. [2] H. Ak aik e. On en trop y maximisation principle. In P . R. Krishnaiah, editor, Applic ations of Statistics , pag es 27– 41. Amsterdam: North Holland, 1 970. [3] A. Barron, J. Rissanen, and B. Y u. The minim um description length principle in co ding and mo deling. I EEE. T r a n s. Inform. The o ry. , pages 2743– 2760, 199 8 . [4] L. D. Bro wn. F undam e ntals of Statistic al Exp onential F amilie s : With Applic ations in Statistic al De cisio n The ory . Institute o f Mathematical Statistics, USA, 19 86. [5] N. G. De Bruijn. Asymptotic Metho ds in Analysis . North-Holland: Amsterdam, 1958. [6] L. Le Cam and G. Y ang . Asymptotics in Statistics: So m e Basi c Conc epts . Springer, 2000. [7] B. S. Clarke and A. R. Ba r ron. Jeffrey’ prior is asymptotically least fa v orable under en trop y risk. Journal of Statistic al Plann i n g and Infer enc e , 41:37–64 , 1994. [8] T. M. C ov er, P . Gacs, and R. M. Gra y . Kolmogorov ’s Contributions to Information Theory and Algorithmic Complexit y. The Annals of Pr ob abil i ty , 17 (3):840 – 865, 1989. [9] T. M. Cov er and J. A. Thomas. Elements of Information Th e ory . Wiley , 1991. [10] A. P . D a wid. Presen t p o sition and p oten tial dev elopmen ts: some p ersonal views, sta- tistical theory , the prequen t ia l approach. J. R. Stat. So c. Ser. B , 147, 1984. [11] A. P . Daw id. Prequen tial ana lysis, sto chas tic complexit y and Ba y esian inference. In F ourth V alenc i a Internation a l Me eting on Bayesian Statistics , pages 15–20. 1992. [12] P . Gr ¨ un w ald. The Minimum Description L ength principle . MIT Press, 2007. [13] M. Hansen and B. Y u. Mo del se lection and minim um description length principle. JASA , 9 6 , 200 1 . [14] D. Haussler. A general minimax result f o r relativ e en tropy . IEEE T r an sactions on Information Th e ory , 43:1276–1 280, 1997. [15] R. E. Krich evsky . The connection b et w een the redundancy and reliabilit y of information ab out the source. Pr obl. Inform. T r ans. , 4:48–57, 1968. [16] L. M. Li and B. Y u. Iterated log arithmic expansions o f the path wise co de lengths for exp o nen tial families. IEEE T r ans actions on Information The ory , 46:26 8 3–2689, 2000 . [17] M. Li and P . Vit´ an yi. An Intr o duction to Kolmo g or ov Complexity and its Appl i c ations . Springer V erlag, New Y ork, 1996. 23 [18] I. Pinelis a nd R. Molzon. Optimal-order b ounds on the rate of con v ergence to normality in the m ultiv ariate delta metho d. Ele ctr onic Journal of Statistics , 10(1):1001 – 1063, 2016. [19] M. J. D. P ow ell. Appr oxim ation the ory and me tho ds . Cam bridge Univ ersit y Press: Cam bridge, 1981. [20] Jo˜ ao B. Prolla. A generalized b ernstein approximation theorem. Mathematic al Pr o c e e d- ings of the Cambridge Philosophi c al So ciety , 1 0 4(2):317–3 3 0, 1988. [21] J. Rissanen. Generalized Kraft inequalit y and arithmetic co ding. IBM Journa l of R ese ar ch and Development , 2 0(3):199–20 3, Ma y 1976 . [22] J. Rissanen. A predictiv e least squares principle. IMA Journal of Mathematic al Contr ol and I nformation , 3:21 1–222, 19 8 6. [23] J. Rissanen. Sto c hastic complexit y a nd mo deling. Annals o f Statistics , 14:1080– 1 100, 1986. [24] J. Rissanen. Sto chastic Complexity and Statistic a l Inquiry . W orld Scientific : Singap ore, 1989. [25] J. Rissanen. Fisher information and stochas tic complexit y . IEEE T r an s . Inform. The ory , 42:40–47, 1996. [26] J. Rissanen. Strong optimalit y of the no rmalized ML models as univ ersal co des and information in data. IEEE T r ans a ctions on Information The ory , 47(5):1712–1 717, 2001. [27] J. J. Rissanen and La ngdon G. G. Arithmetic co ding. IBM Journal of R ese ar ch and Developmen t , 23(2) :1 49–162, March 1979 . [28] B. Ry abk o, J. Astola, and A. Gammerman. Application of Kolmogorov complex ity and univ ersal co des to iden tit y testing and nonparametric testing of serial indep endence for time series. The or etic al Computer Sci e n c e , pages 440–448 , 2006. [29] G. Sc h warz. Estimating the dimension of a mo del. Annals of Statistics , 6:461–464, 1978. [30] C. E. Shannon. A mathematical theory of comm unication. Bel l Sys. T e ch. Journal , 27:379–42 3, 623–656, 1 9 48. [31] Y. M. Sh tarko v. Univ ersal seq uential co ding of single messages. Pr oblems of Information T r ansmission , 23(3):3–17, July-Sept 19 8 7. [32] A tsushi Suzuki and Kenji Y a manishi. Exact calculation of normalized maxim um like - liho o d co de length using F ourier analysis. In 2018 IEEE Internationa l Symp osi um on Information Th e ory ( I SIT) , pag es 121 1–1215, 2 0 18. 24 [33] N. Usotsk a y a and B. Ryabk o. Applications of information-theoretic tests for ana lysis of D NA sequences based on marko v c hain mo dels. Com putational Statistics and Data A nalysis , pages 1861 –1872, 2009. [34] P . Vit´ an yi and M. Li. Minim um description length induction, Ba y esianism, and Kol- mogorov complexit y . IEEE T r ans. Inform. The ory , 46:446 – 464, 2000 . 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment