Algorithmic information theory

We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's informatio…

Authors: ** 저자 정보가 명시되지 않음 (작성자 미상) **

Algorithmic Information Theory P eter D. Gr ¨ un w ald CWI, P .O. Bo x 94079 NL-1090 GB Amsterdam, The Netherlands E-mail: p dg@cwi.nl P aul M.B. Vit´ an yi CWI , P .O. Bo x 9 4079 NL-1090 GB Amsterdam The Netherlands E-mail: p dg@cwi.nl No v em b er 26 , 2024 Abstract W e in tro duce alg orithmic informatio n the o ry , also kno wn as the theory of Kol- mo gor ov c omplexity . W e explain the main concepts o f this quantit ative a pproach to defining ‘information’. W e discuss the exten t to which Kolmog o rov’s and Shan- non’s information theory have a common purp ose, and wher e they are fundamen- tally differen t. W e indicate how recent developmen ts within the theory allow one to formally distinguish b etw ee n ‘st ru ctur al’ (me aningful) and ‘r andom’ information as mea sured by the Kolmo gor ov st ru ctur e fun ction , which leads to a mathematical formalization of Occam’s raz o r in inductiv e inference. W e end b y discussing some of the philosophical implications of the theory . Keyw ords Kolmogoro v complexit y , algorithmic information theory , Shannon infor- mation theory , mutual information, d ata compression, K olmogo ro v structure function, Minim um Description Lengt h Prin ciple. 1 In tro duction Ho w s h ould we measure the amo unt of information ab out a phen omenon that is giv en to us by an observ ation concerning the phenomenon? Both ‘classic al’ (Shann on) in- formation theory (see the chapte r b y Harr emo¨ es and T opsøe [2007]) and algorithmic information theory sta rt with the idea that this amoun t ca n b e measured by the mini- mum nu mb er of b i ts ne e de d to describ e th e observation . But wh ereas Shannon’s theory considers description metho ds that are optimal relativ e to some giv en probab ility distri- bution, Kolmogoro v’s algorithmic theory tak es a different, nonprobabilistic app roac h : an y computer program that first computes (prin ts) the string representing the observ a- tion, and then terminates, is view ed as a v alid description. Th e amoun t of in formation in the s tring is then d efined as the size (measured in b its) of the shortest computer program that outputs the string and then terminates. A simila r definition can b e giv en 1 for infinite strin gs, bu t in this ca se the program p r o duces elemen t after element forev er. Th us, a lo ng sequence of 1’s su c h as 10000 times z }| { 11 . . . 1 (1) con tains littl e inform ation because a program of size ab out lo g 10000 bits o utpu ts it : for i := 1 to 100 00 ; print 1 . Lik ewise, th e transcendenta l n umb er π = 3 . 14 15 ... , an infinite sequence of seemingly ‘random’ d ecimal digits, conta ins b ut a few bits of information (There is a short p rogram that pro duces the consecutiv e digits of π forev er). Suc h a defin ition w ould app ear to mak e the amount of in formation in a string (or other ob ject) d ep end on the particular pr ogramming language used. F ortunately , it can b e sho w n th at all reasonable c hoices of programming languages lead to quan tifi- cation of the amoun t of ‘absolute’ information in individual ob jects that is inv arian t up to an additive constan t. W e call this qu an tit y the ‘Kolmogoro v complexit y’ of the ob ject. While regular strings ha ve small Kolmogoro v complexit y , rand om strings ha ve Kolmogoro v complexit y ab out equal to their o wn length. Measuring complexit y and information in terms of program size has turned out to b e a v ery p o we rfu l idea w ith applications in are as suc h as theoretical computer science, logic, p robabilit y theory , statistic s and ph ysics. This Chapter Kolmogoro v complexit y was in tro duced indep endently and with d if- feren t motiv ations b y R .J. Solomonoff (b orn 1926), A.N. Kolmo goro v (19 03–1987) and G. Ch aitin (b orn 19 43) in 1960/196 4, 1965 and 196 6 resp ectiv ely [Solomonoff 1964; Kolmogoro v 1965; Chaitin 1966]. During the last fort y ye ars, the sub ject h as deve l- op ed into a ma jor and mature area of r esearc h. Here, we giv e a brief ov erview of the sub ject geared to w ards an audience sp ecifically intereste d in the philosophy of inf orm a- tion. With the exception of the recen t work on the Kolmog oro v structure function and parts of the d iscussion on philosophical implications, all material we discuss here can also b e foun d in the s tand ard textb o ok [Li and Vit´ anyi 199 7 ]. The chapter is stru c- tured as follo ws: w e start with an in tro du ctory section in whic h w e define Kolmogoro v complexit y and list it s most imp ortan t prop erties. W e do this in a muc h simplified (ye t formally correct) manner, a v oiding b oth tec hn icaliti es and all questions of motiv ation (wh y this definition and not another one?). T his is follo we d by S ectio n 3 whic h pro- vides an informal ov erview of the more te c hnical topics discussed later in this c h apter, in Sectio ns 4– 6. The fi n al S ectio n 7, whic h discusses th e theory’s philosophical impli- cations, as well as Section 6.3, whic h discusses th e connection to ind u ctiv e inference, are less tec hnical ag ain, and should p erh aps b e glossed o ver b efore delving into the tec hn icaliti es of Sections 4– 6. 2 2 Kolmogoro v Complexit y: Essen tials The aim o f this section is to introdu ce our main notion in the fastest and s imp lest p ossible manner, a vo iding, to the exte nt that this is p ossible, all tec hnical and motiv a- tional issues. S ection 2.1 pr o vid es a simple d efinition of Kolmogoro v complexit y . W e list some of its k ey pr op erties in Section 2.2. Knowledge of these ke y prop erties is an essen tial prerequisite for und erstanding the adv anced topics trea ted in later sectio ns. 2.1 Definition The Kolmogo rov co mplexit y K will b e defined as a fun ction from finite binary strin gs of arbitrary length to th e natural num b ers N . Thus, K : { 0 , 1 } ∗ → N is a function defined on ‘ob jects’ represen ted by binary strings. Later the d efi n ition will be exte nd ed to other t yp es of ob jects suc h as num b ers (Example 3), sets, functions and probabilit y distributions (Exa mple 7). As a fi rst appr o ximation, K ( x ) ma y b e th ough t of as the length of the shortest computer program that prints x and then halts. This compu ter program may b e written in F ortran, Ja v a, LISP or any other universal pr o gr amming language . By this w e mean a general-purp ose programming language in w h ic h a univ ersal T u ring Mac hine can b e implemen ted. Most languages encoun tered in practice hav e this prop ert y . F or concreteness, let us fi x some univ ers al la nguage (sa y , LISP) and define Kolmogoro v complexit y with resp ect to it. The invarianc e the or em discussed b elo w implies that it do es not really matter which one w e pick. Computer programs often mak e use of data. Suc h data are sometimes listed inside the program. An example is the bitstring "010110. .." in the program print ” 01011 010101000 110 ... 010 ” (2) In ot her cases, suc h data are giv en as additional in p ut to the program. T o prepare for later extensions suc h as conditional Kolmogoro v complexit y , we should allo w for this p ossibilit y as w ell. W e thus ext end our initial definition of Kolmog oro v complexit y by considering computer programs with a very simple input-output in terface: programs are pro vided a s tream of bits, wh ic h , wh ile r u nning, they can read one bit at a time. There are no end -mark ers in the bit stream, so that, if a p rogram p halts on inp ut y and outputs x , then it will also halt on an y inpu t y z , w here z is a contin uation of y , and still output x . W e write p ( y ) = x if, on in put y , p prints x and then halts. W e defin e the Kolmogoro v complexit y relativ e to a give n language as the length of th e shortest program p plus input y , suc h that, w hen give n input y , p computes (outputs) x and then halts. Thus: K ( x ) := min y ,p : p ( y )= x l ( p ) + l ( y ) , (3) where l ( p ) denotes the length of inpu t p , and l ( y ) den otes the length of program y , b oth expressed in b its. T o mak e this definition formally entirel y correct, we need to assume that the program P runs on a computer with unlimited memory , and that the 3 language in u se has access to all this memory . Thus, while the defin ition (3) can b e made formally correct, it d o es obscure some tec hnical details which need not concern us no w. W e return to these in Section 4. 2.2 Key Prop erties of Kolmogoro v Complexit y T o gain fur ther intuitio n ab out K ( x ), w e no w list fi v e of its k ey prop erties. Thr ee of these concern the size of K ( x ) for commonly encounte red types of strings. The f ou r th is the in v ariance theorem, and the fi fth is the fact that K ( x ) is uncomputable in ge neral. Henceforth, we use x to denote finite bitstrings. W e abb reviate l ( x ), the length of a giv en bitstring x , to n . W e use b oldf ace x to d enote an infinite binary string. In that case, x [1: n ] is used to denote the in itial n -bit segment of x . 1(a). V ery Simple O b jects: K ( x ) = O (log n ) . K ( x ) m ust b e small for ‘simple’ or ‘regular’ ob jects x . F or example, there exists a fixed-size program that, when input n , outputs the first n bits of π and then halts. As is easy to see (S ectio n 4.2), sp ecification of n tak es O (log n ) bits. Thus, when x consists of the first n b its of π , its complexit y is O (log n ). Similarly , w e h av e K ( x ) = O (log n ) if x represent s the first n bits of a sequence lik e (1) consisting of only 1s. W e also ha ve K ( x ) = O (log n ) for the fi rst n bits of e , written in bin ary; or ev en for the first n bits of a s equ ence whose i -th bit is the i -th bit of e 2 . 3 if the i − 1-st bit wa s a one, and the i -th bit of 1 /π if the i − 1-st bit wa s a zero. F or certain ‘sp ecial’ lengths n , w e ma y h a v e K ( x ) ev en substanti ally smaller than O (log n ). F or example, supp ose n = 2 m for some m ∈ N . Then w e can describ e n by first describing m and then describing a program implementing the function f ( z ) = 2 z . The description of m tak es O (log m ) bits, the description of the program take s a constan t n u m b er of bits not dep endin g on n . Therefore, for such v alues of n , we get K ( x ) = O (log m ) = O (log lo g n ). 1(b). Completely Random O b jects: K ( x ) = n + O (log n ) . A c o de or description metho d is a b inary relation b et we en sour ce wo rds – strings to b e enco ded – and co de w ords – enco ded v ersions of these s trings. Without loss of generalit y , we ca n tak e the set of co de w ord s to b e fin ite binary strings [Co ve r and Thomas 1991]. In this c hapter w e only consider uniq uely de c o dable co des where the relatio n is one-to- one or one-to- man y , ind icating that give n an encod ing E ( x ) of string x , we can alwa ys reconstruct the original x . T he Kolmogoro v co mplexit y of x can b e view ed as the co de length of x that results from using the Kolm o gor ov c o de E ∗ ( x ): this is the co de that enco des x b y the shortest pr ogram that prints x and halts. The follo w ing crucial insigh t will b e applied to the Kolmogoro v co de, but it is imp ortan t to realize that in fact it holds f or e very uniquely d ecod able co de. F or any uniquely decodab le co de, there are no more than 2 m strings x whic h can b e describ ed b y m bits. The reason is quite simply that there are no more than 2 m binary strings of length m . T h us, the num b er of strings that can b e describ ed by less than m b its can b e at most 2 m − 1 + 2 m − 2 + . . . + 1 < 2 m . In particular, this holds f or the co d e E ∗ whose 4 length function is K ( x ). Thus, the fraction of strings x of length n with K ( x ) < n − k is less than 2 − k : the o v erwhelming ma j ority of sequences ca nn ot b e compressed by more than a constan t. Sp ecifically , if x is determined by n indep endent tosses of a fair coin, then all sequences of length n hav e the same probabilit y 2 − n , so that with probabilit y at lea st 1 − 2 − k , K ( x ) ≥ n − k . On the other h and, for arbitrary x , there exists a pr ogram ‘ print x ; halt ’. This program seems to ha v e lengt h n + O (1) w here O (1) is a small co nstant, accoun ting for the ‘print’ and ‘halt’ symb ols. W e ha ve to b e ca reful though: compu ter programs are usually represente d as a sequence of byt es. T h en in the program ab ov e x cannot b e an arbitrary sequence of b ytes, b ecause we someho w ha ve to mark the end of x . Although w e represent b oth the program and the string x as bits rather than bytes, the same problem remains. T o a vo id it, w e ha ve to enco de x in a prefix-free manner (Section 4.2 ) whic h tak es n + O (log n ) bits, rather than n + O (1). Th erefore, for all x of length n , K ( x ) ≤ n + O (log n ). Except for a f raction of 2 − c of these, K ( x ) ≥ n − c so that for the o v erwhelming m a jorit y of x , K ( x ) = n + O (log n ) . (4) Similarly , if x is determined by indep endent tosses of a fair coin, then (4) holds with o v erwhelming probability . Th us, while for v ery regular strings, the Kolmogoro v com- plexit y is s m all (sub linear in the length of the string), most strings hav e Kolmogo rov complexit y about equal to their o wn length. S uc h strings are called (Kolmo gor ov) r an- dom : they do not exhibit an y discernible pattern. A more precise definition follo ws in Example 4. 1(c). Sto c hastic Ob jects: K ( x ) = αn + o ( n ) . Supp ose x = x 1 x 2 . . . where the individual x i are realizat ions of some random v ariable X i , d istributed acco rd in g to some distribution P . F or example, w e may ha v e that all outcomes X 1 , X 2 , . . . are indep endently identi cally distributed (i.i.d.) with for all i , P ( X i = 1) = p for some p ∈ [0 , 1]. In that case, as will b e seen in Section 5.3, Theorem 10, K ( x [1: n ] ) = n · H ( p ) + o ( n ) , (5) where log is logarithm to the base 2, and H ( p ) = − p lo g p − (1 − p ) log (1 − p ) is the binary entro py , defin ed in Section 5.1. F or n o w the imp ortan t thing to note is that 0 ≤ H ( p ) ≤ 1, w ith H ( p ) achievi ng it s maximum 1 for p = 1 / 2. T hus, if data are generated by indep end en t tosses of a f air coin, (5) is consisten t w ith (4). If data are generated by a biased coin, then the Kolmogo rov complexit y will s till increase linearly in n , bu t with a f actor less than 1 in fron t: the data can b e compressed b y a linear amoun t. This still h olds if the data are distributed according to some P under wh ic h the differen t outcomes are dep enden t, as long as this P is ‘nondegenerate’. 1 An example 1 This means that there exists an ǫ > 0 suc h that, for all n ≥ 0, all x n ∈ { 0 , 1 } n , for a ∈ { 0 , 1 } , P ( x n +1 = a | x 1 , . . . , x n ) > ǫ . 5 is a k - th or der Markov chain , where the p r obabilit y of the i -th bit b eing a 1 dep ends on the v alue of the p revious k bits, but nothing else. If none of the 2 k probabilities needed to sp ecify suc h a c hain are either 0 or 1, then the c hain will b e ‘nondegenerate’ in our sen s e, implying that, with P -p r obabilit y 1, K ( x 1 , . . . , x n ) gro ws linearly in n . 2. I n v ariance It would seem that K ( x ) dep ends strongly on what programming language w e used in our definition of K . Ho we ve r, it turns out that, for an y tw o univ ersal languages L 1 and L 2 , letting K 1 and K 2 denote the resp ectiv e complexities, for all x of eac h length, | K 1 ( x ) − K 2 ( x ) | ≤ C, (6) where C is a constant that dep ends on L 1 and L 2 but not o n x or its length . Sin ce w e allo w any un iv ersal language in the d efinition of K , K ( x ) is only defined u p to an additiv e constan t. This means that the theory is inherent ly asymptotic : it can mak e meaningful statemen ts p ertaining to strings of increasing length, such as K ( x [1: n ] ) = f ( n ) + O (1) in the three examples 1( a), 1(b) and 1(c) a b o ve. A statemen t suc h as K ( a ) = b is n ot v ery meaningful. It is actually v ery easy to sho w (6). It is known fr om the theory of compu tatio n that for an y t wo universal language s L 1 and L 2 , there exists a compiler, written in L 1 , translating p rograms written in L 2 in to equiv alen t p rograms wr itten in L 1 . Thus, let L 1 and L 2 b e tw o u niv ersal languages, and let Λ b e a program in L 1 implemen ting a compiler translating from L 2 to L 1 . F or concreteness, assume L 1 is LIS P and L 2 is Ja v a. Let ( p, y ) b e the shortest com bination of Ja v a program plus inp ut that prints a giv en strin g x . Then the LISP p rogram Λ, wh en giv en input p foll o wed b y y , will also print x and halt. 2 It follo ws that K LISP ( x ) ≤ l (Λ) + l ( p ) + l ( y ) ≤ K Ja v a ( x ) + O (1), wh ere O (1) is the size of Λ. By symmetry , w e also obtain the opp osite inequalit y . Rep eating the argumen t for general univ ersal L 1 and L 2 , (6) follo ws. 3. Uncomputability Unfortunately K ( x ) is not a recurs iv e function: the K ol- mogoro v complexit y is not compu table in general. This means that there exi sts no computer program that, when inp ut an arb itrary string, outputs the Kolmogoro v com- plexit y of that string and then halts. W e pro ve this fact in Section 4, Examp le 3. Kolmogoro v complexit y can b e computably app ro ximated (tec hnically sp eaking, it is upp er semic omputable [Li and Vit´ an yi 1997]), bu t not in a practically useful w a y: wh ile the appro ximating algorit hm with inpu t x successiv ely outputs b etter and b etter ap- pro ximations t 1 ≥ t 2 ≥ t 3 ≥ . . . to K ( x ), it is (a) excessiv ely slo w , and (b), it is in general imp ossible to determine whether the cur ren t appr o ximation t i is already a go o d one or not. In the w ords of Barron and Co ve r [1991], (e ve ntual ly) “Y ou kn ow, but yo u do not kn o w yo u know”. Do these prop erties mak e the theory irrelev an t for practica l applicatio ns? Cer- tainly not. Th e reason is that it is p ossible to approxima te Kolmogoro v co mplexit y 2 T o forma lize this argumen t w e need t o setup the compiler in a w ay suc h that p and y can be fed to th e compiler without any symbols in b et ween, but this can b e done; see Example 2. 6 after all, in the follo wing, w eak er s ense: w e tak e some existing d ata compression pro- gram C (for example, gzip) that allo ws ev ery string x to b e encoded and deco ded computably and ev en efficien tly . W e then approxima te K ( x ) as the num b er of bits it tak es to enco de x using compressor C . F or man y compr essors, one can show that f or “most” strings x in the set of all strings of interest, C ( x ) ≈ K ( x ). Both universal c o d- ing [Co v er and Thomas 19 91 ] and the Mi nimum Description L ength (MD L) Principle (Section 6.3) are, to some exten t, based on such ideas. Univ ersal co ding forms the basis of most practica l lossless data compression algorithms, and MDL is a practicall y successful metho d for statistica l inference. There is an eve n closer connection to the normalize d c ompr ession distanc e metho d, a practical tool for data similarit y analysis that can explici tly b e un d ersto o d as an appr oximat ion of an “i deal” b u t u ncomputable metho d based on Kolmogoro v complexit y [Cilibrasi and Vit´ an yi 2005]. 3 Ov erview and Summary No w that we in tro duced our main concept, w e are ready to giv e a summary of the remainder of the c hapter. Section 4: Kolmogoro v Complexity – Details W e motiv ate our d efinition of Kol- mogoro v complexit y in terms of the theory of computation: the C h urch–T uring thesis implies that our c h oice of description metho d , based o n univ ersal comput- ers, is essent ially the only reasonable one. W e then in tro duce some basic co ding theoretic concepts, most notably the so-called pr efix-f r e e c o des that f orm the ba- sis for our v ersion of Kolmogoro v complexit y . Based on these notions, we giv e a precise defin ition of Kolmogo rov complexit y and w e fill in some d etails that w ere left open in the introd uction. Section 5: Shannon vs. Kolmogoro v Here we outline the similarities and differ- ences in aim and scope of Shann on’s and Kolmog oro v’s information theories. S ec- tion 5.1 reviews the entr opy , the cen tral concept in Shan n on’s theory . Although their pr imary aim is quite differen t, and they are functions defined on different spaces, there is a close relation b et we en en tropy and Kolmogoro v complexit y (Sec- tion 5.3): if data are distributed according to some computable distribution then, roughly , entr opy is exp e cte d Kolmo gor ov c omplexity . En tropy and Kolmogo ro v complexit y are concerned with information in a single ob ject: a rand om v ariable (Shannon) or an individu al sequence (Kolmogoro v). Both theories pro vide a (distinct) notion of mutual information that measures the in f ormation that one obje ct gives ab out ano ther obje ct . W e in tro du ce and compare the tw o notions in Section 5. 4. En tropy , Kolmogoro v complexit y and mutual information are concerned with lossless description or compression: messages must b e describ ed in suc h a w a y that from the 7 description, the original message can b e completely reconstructed. Extendin g the theo- ries to lossy d escription or compression enables the formalization of m ore sophistica ted concepts, suc h as ‘meaningful informatio n’ and ‘useful information’. Section 6: Meaningful Information, Structure F unction and Learning The idea of the Kolmo goro v S tructure F unction is to enco de ob jects (strings) in t wo p arts: a structur al and a r andom part. In tuitive ly , the ‘meaning’ of the string resides in the structural part and the size of the structural part quanti fies the ‘mean- ingful’ information in the message. Th e structural part defines a ‘mo del’ for the string. K olmogoro v’s structure fun ction approac h sho ws that the meaningful information is summarized by the simplest mo del such that the corresp ondin g t w o-part description is not larger than the Kolmo goro v complexit y of the original string. Kolmogoro v’s s tru cture fun ction is close ly related to J. Rissanen’s min- imum description length principle , wh ic h we br iefly discuss. This is a practical theory of learning from data th at can b e view ed as a mathematical formalizatio n of Occam’s Raz or. Section 7: Philosophical Implications Kolmogoro v complexit y has implications for the foundations of sev eral fields, including the foundations of mathematic s. The consequences are p articularly profound for the foundations of pr ob ability and statistics . F or example, it allo ws u s to discern b et wee n differ ent f orms of ran- domness, whic h is imp ossible u sing standard probabilit y the ory . It p ro vides a precise prescription f or and justification of the use of O ccam’s Razor in statis- tics, and leads to the distinction b etw een epistemolo gi c al and metap hysic al forms of Occam’s Razor. W e discuss these and other implications for the ph ilosophy of information in Section 7, w h ic h ma y b e read without deep kno w ledge of the tec hn icaliti es describ ed in Sectio ns 4–6. 4 Kolmogoro v Complexit y: Detail s In Section 2 we in tro d uced Kolmogoro v complexit y and its main features without pa y- ing muc h atten tion to either (a) und erlying motiv ation (wh y is Kolmogoro v complexit y a useful measure of inf orm ation?) or (b ) tec h nical deta ils. In this section, w e first pro vide a detailed such motiv atio n (Secti on 4.1). W e then (Section 4.2) pro vide the tec hn ical bac kgroun d kno wledge needed for a prop er understanding of the concept. Based on this b ac kground kno wledge, in Section 4.3 w e p ro vide a definition of Kol- mogoro v complexit y directly in terms of T uring mac hines, equiv alen t to, bu t at the same time m ore complica ted and insigh tful than the d efinition we ga ve in Sect ion 2.1. With the help of this new d efi n ition, we then fill in the gaps left op en in Section 2. 4.1 Motiv ation Supp ose w e wa nt to d escrib e a give n ob ject by a fin ite b inary string. W e do not care whether the ob ject has many descriptions; h o wev er, eac h description should describ e 8 but one ob ject. F rom among all descriptions of an ob ject we can tak e the length of the shortest description as a measure of the ob ject’s complexit y . It is natural to call an ob ject “simple” if it has at least one sh ort description, and to call it “co mplex” if all of its descriptions are long. But no w we are in danger of falling into the trap so eloqu en tly describ ed in the Ric hard -Berry p arado x, where we define a n atural num b er as “the least natural n umb er that cannot b e described in less than t wen ty w ords.” If this num b er do es exist, w e hav e just describ ed it in thirteen words, con tradicting its defin itional statemen t. If suc h a n umb er do es not exist, then all natural num b ers can b e d escrib ed in few er than t we nt y w ords. W e n eed to lo ok v ery carefully at what kind of descriptions (cod es) D w e ma y allo w. If D is kn o w n to b oth a sender and receiv er, th en a message x can b e transmitted from sender to receiv er by transm itting the description y with D ( y ) = x . W e ma y d efine the descriptional complexit y of x und er sp ecification metho d D as the length of the sh ortest y such that D ( y ) = x . Ob viously , this descriptional complexit y of x dep ends cru cially on D : the synt actic framewo rk of the description language determines the succinctness of description. Y et in order to ob jectiv ely compare descriptional complexities of ob jects, to b e able to say “ x is more complex than z ,” the descriptional complexit y of x s h ould dep end on x alone. T his complexit y can b e view ed as related to a univ ersal description metho d that is a priori assumed by all senders and recei ve rs. This complexit y is optimal if n o other description metho d assigns a lo w er complexit y to an y ob ject. W e are n ot really in terested in optimalit y with resp ect to all description metho ds. F or sp ecificati ons to b e useful at all it is n ecessary that the mapp ing from y to D ( y ) can b e executed in an effectiv e manner. Th at is, it can at least in principle b e p erformed b y humans or mac h ines. This notion has b een formalized as that of “partial recursiv e functions”, also known simply as c omputable fun ctions. According to generally accepted mathematica l viewp oin ts – the so-ca lled ‘Churc h-T urin g thesis’ – it coincides with the in tuitiv e notion of effectiv e compu tation [Li and Vit´ an yi 1997]. The set of p artial recursiv e functions cont ains an optimal function that minimizes description length of every other suc h function. W e denote this function by D 0 . Namely , for an y other recursiv e function D , for all ob jects x , there is a d escription y of x un der D 0 that is shorter than an y description z of x under D . (That is, shorter up to an additiv e constan t that is indep enden t of x .) Complexit y with resp ect to D 0 minorizes the complexities with resp ect to all partial recursiv e functions (this is just the inv ariance result (6) again). W e iden tify the length of the description of x with resp ect to a fixed sp ecificati on function D 0 with the “algorithmic (descriptional) complexit y” of x . Th e optimalit y of D 0 in the sense ab o v e means that the complexit y of an ob ject x is inv arian t (up to an additiv e constan t indep endent of x ) under transition from one optimal sp ecificatio n function to another. Its complexit y is an ob jectiv e at tribute of the describ ed ob ject alone: it is an intrinsic pr op ert y of that ob ject, and it do es n ot dep end on the description formalism. This complexit y can b e viewed as “absolute inform ation con tent ”: the amoun t of information that needs to b e transmitted b et w een all senders and receiv ers when they communicate the message in absence of any other a priori kn owledge that 9 restricts the domain of the message. Th is motiv ates the pr ogram for a general theory of al gorithmic complexit y and information. Th e f our ma jor inno v ations are as follo ws: 1. In restricting o ur s elves to form ally effectiv e descriptions, our d efinition co v ers ev ery form of description that is intuiti ve ly ac ceptable as b eing effectiv e according to general viewp oints in mathematics and logic . 2. The restriction to effectiv e descriptions enta ils that there is a unive rsal description metho d that minorizes the d escription length or complexit y with resp ect to an y other effect ive description metho d . Significan tly , this implies I tem 3. 3. The descriptio n length or complexit y of an ob ject is an intrinsic at tribute of the ob ject ind ep enden t of the particular description metho d or f orm alizat ions thereof. 4. The disturbing Ric hard-Berry paradox ab o v e do es not disapp ear, b u t resurfaces in the form of an alternati ve approac h to proving G¨ odel’s famous resu lt that not ev ery tru e mathemati cal statemen t is pro v able in mathemati cs (Example 4 b elo w). 4.2 Co ding Preliminaries Strings and N atural Numbers Let X b e some fi nite or coun table set . W e use the notation X ∗ to denote the set of finite str ings or se quenc es o ver X . F or example, { 0 , 1 } ∗ = { ǫ, 0 , 1 , 00 , 01 , 10 , 11 , 000 , . . . } , with ǫ denoting the emp ty wor d ‘’ with no letters. W e identify the natural n umbers N and { 0 , 1 } ∗ according to the corresp onden ce (0 , ǫ ) , (1 , 0) , (2 , 1) , (3 , 00) , (4 , 01) , . . . (7) The length l ( x ) of x is the n umber of bits in the binary string x . F or example, l (010) = 3 and l ( ǫ ) = 0. I f x is in terpreted as an inte ger, we get l ( x ) = ⌊ log ( x + 1) ⌋ and, for x ≥ 2, ⌊ log x ⌋ ≤ l ( x ) ≤ ⌈ log x ⌉ . (8) Here, as in the sequel, ⌈ x ⌉ is the smalle st in teger large r than or equal to x , ⌊ x ⌋ is the largest in teger smaller than or equal to x and log denotes logarithm to b ase tw o. W e shall t ypically b e concerned with enco ding finite-length binary strings b y other finite-length binary s trin gs. The emphasis is on binary strings only for con ve nience; observ ations in an y alphab et can b e so enco ded in a w ay that is ‘theory n eutral’. Co des W e repeatedly consider the follo wing scenario: a sender (say , A) wan ts to comm unicate or transmit some information to a r e c eiver (say , B). T he information to b e transmitted is an element from some set X . It will b e comm un icate d by sending a binary strin g, called the message . When B receiv es the message, he can deco de it again 10 and (hop efully) reconstruct the ele ment of X that wa s sent . T o ac hiev e this, A and B need to agree on a c o de or description metho d b efore communicati ng. I n tuitiv ely , this is a binary r elatio n b et ween sour c e wor ds and associated c o de wor ds . The relation is fully c haracterized b y the de c o ding function . Suc h a deco ding function D can b e any function D : { 0 , 1 } ∗ → X . The domain of D is the set of c o de wor ds and the range of D is the set of sour c e wor ds. D ( y ) = x is in terpreted as “ y is a co de wo rd for the sour ce w ord x ”. The set of all code wo rd s for source w ord x is the set D − 1 ( x ) = { y : D ( y ) = x } . Hence, E = D − 1 can b e called the enc o ding substitution ( E is not necessarily a fu nction). With eac h co de D we can associate a leng th fu nction L D : X → N such that, for eac h source w ord x , L D ( x ) is the length of the shortest enco ding of x : L D ( x ) = min { l ( y ) : D ( y ) = x } . W e d en ote by x ∗ the shortest y su c h that D ( y ) = x ; if there is more than one suc h y , then x ∗ is defined to b e the first suc h y in lexicographica l order. In co ding theory atten tion is often r estricted to the ca se where the source word set is fin ite, sa y X = { 1 , 2 , . . . , N } . If there is a constan t l 0 suc h that l ( y ) = l 0 for all code w ords y (equiv alen tly , L ( x ) = l 0 for all sour ce words x ), then w e call D a fixe d-length co d e. It is easy to see that l 0 ≥ log N . F or instance, in telet yp e transmissions the source has an alphab et of N = 32 letters, consisting of the 26 lette rs in the L atin alphab et plus 6 sp ecial c haracters. Hence, w e need l 0 = 5 b inary digits p er sour ce letter. In electronic co mpu ters w e often use the fixed-length ASCI I cod e w ith l 0 = 8. Prefix-free co de In general w e cannot uniquely reco ve r x and y from E ( xy ). Let E b e the iden tit y mapp ing. Th en we hav e E (00) E (00) = 0000 = E (0) E (000 ). W e no w introdu ce pr efix-fr e e c o des , which do not suffer fr om this d efect. A binary strin g x is a p r op er pr efix of a binary string y if w e can write y = xz for z 6 = ǫ . A set { x, y , . . . } ⊆ { 0 , 1 } ∗ is pr efix-fr e e if for any pair of d istinct elemen ts in the set n either is a p r op er p r efix of the other. A fu n ction D : { 0 , 1 } ∗ → N defin es a pr efix-fr e e c o de 3 if its domain is prefix-free. In ord er to decode a co de sequence of a prefix- free co d e, we simply start at the b eginning and deco de one co de w ord at a time. When w e come to the end of a co de wo rd, w e kno w it is the end, since no cod e w ord is the prefix of any other co de w ord in a pr efix-free co de. Clearly , prefix-free co d es are uniquely deco dable: we can alwa ys unambiguously reconstruct an outcome from its enco ding. Pr efix codes are not the only cod es with this prop erty; there are uniquely decodab le co des whic h are n ot prefix-free. I n the next section, we w ill define Kolmogoro v complexit y in terms of p r efix-free co des. On e m a y w ond er why we did not opt for general uniquely deco dable co d es. There is a go o d reason for this: It turns out that ev ery u niquely deco dable co de can b e replaced by a prefix-free co d e without c hanging the set of cod e-w ord lengths. This follo ws from a sophistica ted version of the Kraft inequalit y [Co v er and Thomas 199 1 , Kraft-Mc Millan inequalit y , Theorem 5.5.1]; 3 The standard terminology [Co ver and Thomas 1991] for such co des is ‘prefix co des’. F ollo wing Harremo ¨ es and T opsøe [2007], w e use the more informa tive ‘prefix -free co des’. 11 the basic Kraft inequalit y is found in [Harremo ¨ es and T opsøe 200 7 ], Equation 1.1. In Shannon’s and Kolmogo ro v’s theories, w e are only interest ed in co de w ord lengths of uniquely decod able co d es r ather than actual enco d ings. Th e Kr aft-McMillan inequalit y sho ws that without loss of generalit y , w e may restrict the set of codes w e w ork with to prefix-free codes, whic h are m u c h easier to han d le. Co des for the integers; P airing F unctions S upp ose w e enco de eac h binary string x = x 1 x 2 . . . x n as ¯ x = 11 . . . 1 | {z } n tim es 0 x 1 x 2 . . . x n . The resulting code is pr efi x-fr ee because w e can determine where the co de w ord ¯ x ends b y r eading it from left to righ t without bac king up. Note l ( ¯ x ) = 2 n + 1; th us, we ha v e enco ded strings in { 0 , 1 } ∗ in a p refix-free manner at the price of doubling their length. W e can get a muc h more efficien t code by applying the construction ab ov e to the length l ( x ) of x rather than x itself: d efine x ′ = l ( x ) x , where l ( x ) is inte rpr eted as a bin ary string acco rdin g to the corresp ondence (7). Then the code that maps x to x ′ is a pr efix-free co de satisfying, for all x ∈ { 0 , 1 } ∗ , l ( x ′ ) = n + 2 log n + 1 (here we ignore th e ‘round ing error’ in (8)). W e call this co d e the standar d pr efix-fr e e c o de for the natur al numb ers and use L N ( x ) as notat ion for the co delength of x un d er this co de: L N ( x ) = l ( x ′ ). When x is in terpreted as a num b er (using the corresp ondence (7) and (8)), w e see that L N ( x ) = log x + 2 log log x + 1. W e are often in terested in represen ting a p air of natural num b ers (or binary strin gs) as a single natural num b er (bin ary string). T o this end , we define the standar d 1-1 p airing function h· , ·i : N × N → N as h x, y i = x ′ y (in this d efinition x and y are in terpreted as strings). 4.3 F ormal Definition of K olmogoro v Complexit y In this su bsection we pr o vid e a formal definition of Kolmogoro v complexit y in terms of T ur ing mac h ines. This will allo w u s to fill in some d etails left op en in Section 2 . Let T 1 , T 2 , . . . b e a sta nd ard enumerati on of all T ur ing mac hines [Li and Vit´ anyi 19 97]. The f unctions implemente d by T i are call ed the p artial r e cursive or c omputable func- tions. F or tec hnical reasons, m ainly b ecause it simplifies the connection to Shann on’s information theory , we are in terested in the so-ca lled prefix complexit y , whic h is associ- ated with T uring m ac hin es for wh ich the set of programs (inputs) r esulting in a halting computation is prefix-free 4 . W e can realize this b y equipping the T uring mac hine with a one-w ay input tap e, a separate w ork tap e, and a one-w ay output tap e. Suc h T ur ing mac h ines are called p refix mac hines since the halting programs f or any one of them form a prefi x-fr ee set. W e first d efine K T i ( x ), the prefix Kolmogo rov complexit y of x relativ e to a giv en prefix machine T i , where T i is the i -th prefix mac hine in a standard en umeration of 4 There exists a version of Kolmogoro v complexit y correspond ing to programs th at are not necessarily prefix-free, but w e wil l not go into it here. 12 them. K T i ( x ) is defined as the length of the shortest inpu t sequence y such that T i ( y ) = x ; that is, the i -th T uring mac hine, wh en run with inpu t y , pro du ces x on its output tap e and then halts. If no su c h in put sequence exists, K T i ( x ) r emains un d efined. Of cours e, this pr eliminary definition is still highly s en s itiv e to th e p articular pr efix mac h ine T i that we u se. But now the ‘unive rsal prefix mac hine’ comes to our rescue. Just as there exists universal ord inary T uring mac hines, there also exist un iv ersal prefix mac h ines. These hav e the remark able prop ert y that they can simula te ev ery other p refix mac h ine. More sp ecifically , there exists a prefix mac hine U suc h that, with as input the concatenati on i ′ y (where i ′ is the stand ard enco ding of int eger y , Section 4.2), U outputs T i ( y ) and th en h alts. If U gets an y other inpu t then it do es not halt. Definition 1 L et U b e our r efer enc e pr efix machine, i.e. for al l i ∈ N , y ∈ { 0 , 1 } ∗ , U ( h i, y i ) = U ( i ′ y ) = T i ( y ) . The prefix Kolmogoro v complexit y of x is define d as K ( x ) := K U ( x ) , or e quivalently: K ( x ) = min z { l ( z ) : U ( z ) = x, z ∈ { 0 , 1 } ∗ } = = min i,y { l ( i ′ ) + l ( y ) : T i ( y ) = x, y ∈ { 0 , 1 } ∗ , i ∈ N } . (9) W e can alternati ve ly think of z as a p r ogram that prints x and then halts, or as z = i ′ y where y is a program suc h that, when T i is input program y , it prin ts x and then halts. Th us, by definition K ( x ) = l ( x ∗ ), where x ∗ is the lexicographically first shortest self-delimiting (prefix-free) program for x with resp ect to the reference p refix mac h ine. Consider the mapp ing E ∗ defined by E ∗ ( x ) = x ∗ . This ma y be viewed as the enco din g function of a prefix-free co de (deco ding fun ction) D ∗ with D ∗ ( x ∗ ) = x . By its definition, D ∗ is a very parsimonious co de. Example 2 In Section 2, w e defined K ( x ) as the shortest program f or x in some standard programming language such as LISP or Jav a. W e no w sho w that this definition is equiv alent to the p refix T uring mac hine Definition 1. Let L 1 b e a universal language; for concrete ness, sa y it is L I SP . Denote the correspond ing K olmogoro v complexit y defined as in (3) b y K LISP . F or the unive rsal prefix mac hine U of Defin ition 1, there exists a program p in LIS P that sim ulates it [Li and Vit´ anyi 199 7 ]. By this we mean that, for all z ∈ { 0 , 1 } ∗ , either p ( z ) = U ( z ) or neither p nor U ev er halt on inp ut z . Run with this program, our LISP computer computes the same function as U on its input, so that K LISP ( x ) ≤ l ( p ) + K U ( x ) = K U ( x ) + O (1) . On the other h and, LI S P , when equipp ed with the simple inpu t/output interfac e de- scrib ed in S ection 2, is a language such that for all p rograms p , the set of inputs y for whic h p ( y ) is we ll-defined forms a prefix-free set. Also, as is easy to chec k, the set of syntac tically correct LISP programs is prefix-fr ee. Therefore, the set of strings py where p is a synta ctically correct LISP program and y is an inp ut on which p halts, is prefix-free. Thus w e can construct a pr efix T uring mac h ine with some index i 0 suc h 13 that T i 0 ( py ) = p ( y ) for al l y ∈ { 0 , 1 } ∗ . Therefore, the univ ersal mac hine U sat isfies for all y ∈ { 0 , 1 } ∗ , U ( i ′ 0 py ) = T i 0 ( py ) = p ( y ) , so that K U ( x ) ≤ K LISP ( x ) + l ( i ′ 0 ) = K LISP ( x ) + O (1) . W e are therefore ju stified in calling K LISP ( x ) a v ersion of (pr efi x) K olmogo rov complex- it y . Th e same h olds for an y other universal language, as long as its set of syn tactically correct programs is pr efix-free. Th is is the case for every programming language we kno w of. Example 3 [ K ( x ) as an in teger function; uncomputabilit y] The corresp ond ence b et we en binary strings and in tegers established in (7) sh o ws that Kolmogoro v com- plexit y ma y equiv alen tly b e t hought of as a function K : N → N w here N are the nonnegativ e inte gers. This int erpretation is usefu l to pr ov e that Kolmogoro v complex- it y is un computable. Indeed, let us assume by means of con tradiction that K is computable. Then the function ψ ( m ) := min x ∈ N { x : K ( x ) ≥ m } m ust b e computable as wel l (note that x is inte rp r eted as an in teger in the definition of ψ ). Th e definition of ψ immediately implies K ( ψ ( m )) ≥ m . On the other hand, since ψ is computable, there exists a computer program of some fixed s ize c such that, on input m , the program outputs ψ ( m ) and halts. T herefore, since K ( ψ ( m )) is the length of the shortest program plu s input that prin ts ψ ( m ), we must ha ve that K ( ψ ( m )) ≤ L N ( m ) + c ≤ 2 log m + c . Thus, w e ha v e m ≤ 2 log m + c whic h m ust b e false fr om s ome m onw ards: con tradiction. Example 4 [G¨ odel’s incompleteness theorem and randomness] W e sa y that a formal system (definitions, axioms, rules of inference) is c onsistent if no state ment whic h can b e expressed in the system can b e pro ve d to b e b oth true and false in the system. A formal system is sound if only true stateme nts can b e prov ed to b e true in the system. (Hence, a sound formal system is consistent .) Let x b e a finite binary s tr in g of lengt h n . W e write ‘ x is c -random’ if K ( x ) > n − c . That is, the shortest binary description of x h as length not muc h smaller than x . W e recall from Sect ion 2.2 that the fraction of s equ en ces that can b e compressed b y more than c bits is b ounded by 2 − c . This sh o w s that there are sequences whic h are c -random for ev ery c ≥ 1 and ju stifies th e terminology: the smalle r c , the more r andom x . No w fix an y soun d formal system F that is p ow erfu l enough to express the statemen t ‘ x is c -random’. Sup p ose F can b e describ ed in f b its. By this w e mean that there is a fixed-size program of lengt h f suc h th at, when input the n um b er i , outputs a list of all v alid p ro ofs in F of length (n umber of symb ols) i . W e claim that, for al l bu t fi n itely man y rand om strings x and c ≥ 1, the s entence ‘ x is c -random’ is n ot p r o v able in F . Supp ose the contrary . Th en giv en F , w e can start to exhaustiv ely searc h for a pro of that some strin g of length n ≫ f is random, and p r in t it wh en w e fin d s u c h a s tr in g x . This procedur e to print x of length n uses only log n + f + O (1) bits of data, which is m uc h less than n . But x is rand om b y the pr o of and the fact that F is sound. Hence F is not consistent , wh ic h is a con tradiction. 14 Pushing the idea of Example 4 m uc h furth er, Chaitin [19 87] pro v ed a particularly strong v ariation of G¨ odel’s theorem, using K olmogo rov complexit y but in a more soph isticate d w a y , based on the num b er Ω defined b elo w . Roughly , it says the follo wing: there exists an exp onential Diopha ntine e quation , A ( n, x 1 , . . . , x m ) = 0 (10) for some finite m , su c h that the follo wing holds: let F b e a formal theory of arithmetic. Then for all F that are sound and consisten t, there is only a fin ite n umb er of v alues of n for whic h the theory determines whether (10) has finitely or infinitely many solutions ( x 1 , . . . , x m ) ( n is to b e considered a parameter rather than a v ariable). F or all other, infinite n u m b er of v alues f or n , the statemen t ‘(10) has a finite n u m b er of solutions’ is logica lly indep end ent of F . Chaitin’s Number of Wisdom Ω An axiom system that can b e effectiv ely de- scrib ed by a finite string has limited information conte nt – this was the basis for our pro of of G¨ odel’s theorem ab o ve. On t he ot her hand, there exist quite sh ort strin gs whic h are mathematically w ell-defined b ut uncomputable, which h a v e an astounding amoun t of inf ormation in them ab out the tru th of mathematica l stat ement s. F ollo wing Chaitin [19 75], we d efine the halting pr ob ability Ω as the real num b er d efined by Ω = X U ( p ) < ∞ 2 − l ( p ) , the sum tak en o ver all inputs p for whic h the r eference mac hin e U halts. W e call Ω the halting probabilit y b ecause it is the p robabilit y that U halts if its program is pro vided b y a sequence of fair coin flips. It turns out that Ω r epresen ts the halting pr oblem v ery compactly . T he follo wing theorem is pro ved in [Li and Vit´ an yi 1997]: Theorem 5 L et y b e a binary string of length at most n . Ther e exists an algorithm A which, given the first n bits of Ω , de cides whether the u niversal machine U halts on input y ; i.e. A outputs 1 if U halts on y ; A outputs 0 if U do es not halt on y ; and A is guar ante e d to run in finite time. The halting problem is a prime example of a problem that is unde cidable [Li and Vit´ an yi 1997], from whic h it follo ws that Ω m ust b e uncomputable. Kno wing the first 10000 bits of Ω enables us to solv e the halting of all programs of less than 10000 b its. This includes programs lo oking for coun terexamples to Gold- bac h’s Conjecture, Riema nn ’s Hyp othesis, and most other conjectures in mathemat ics whic h can b e refuted b y a single fi nite count erexample. Moreo v er, for all a xiomatic mathematica l theories wh ich can b e expressed compactly enough to b e conceiv ably in- teresting to h um an b eings, sa y in less than 10000 bits, Ω [1:10000] can be used to decide for ev ery statemen t in the theory whether it is tru e, false, or indep end en t. Th us, Ω is truly the n um b er of Wisdom, and ‘can b e kn own of, but not known, through h uman reason’ [C.H. Bennett and M. Gard n er, Sci e ntific Americ an , 241 :11(1979 ), 20–34]. 15 4.4 Conditional Kolmogorov complexit y In order to fu lly dev elop the th eory , we also need a notion of c onditional Kolmogo ro v complexit y . In tuitiv ely , the conditional Kolmogoro v complexit y K ( x | y ) of x giv en y can b e interpreted as th e shortest p rogram p suc h that, when y is giv en to the program p as input ‘for free’, the program prin ts x and th en halts. Based on conditional Kolmogoro v complexit y , w e can then fu rther defin e Kolmogoro v complexit ies of more complica ted ob jects suc h as functions and so on (Example 7) . The idea of pro viding p w ith an input y is r ealize d by putting h y , p i rather than just p on the inpu t tap e of a u niv ersal c onditional prefix mac hine U . T his is a pr efix mac h ine U such that for a ll y , i , q , U ( h y , h i, q ii ) = T i ( h y , q i ), whereas for an y in p ut not of this form, U do es not halt. Here T 1 , T 2 , . . . is some effectiv e en umeration of pr efix mac h ines. It is easy to sho w that suc h a univ ersal conditional prefix mac hine U exists [Li and Vit´ an yi 1997]. W e n o w fix a reference conditional un iv ersal prefix mac hine U and d efine K ( x | y ) as follo ws: Definition 6 [Conditional and Join t Kolmogoro v Complexity ] The conditional prefix Kolmog oro v complexit y of x g iven y (for fr e e) is K ( x | y ) = min p { l ( p ) : U ( h y , p i ) = x, p ∈ { 0 , 1 } ∗ } . (11) = min q ,i { l ( h i, q i ) : U ( h y , h i, q ii ) = x, q ∈ { 0 , 1 } ∗ , i ∈ N } (12) = min q ,i { l ( i ′ ) + l ( q ) : T i ( y ′ q ) = x, q ∈ { 0 , 1 } ∗ , i ∈ N } . (13) We define the un conditional complexit y K ( x ) as K ( x ) = K ( x | ǫ ) . We define the join t c omplexity K ( x, y ) as K ( x, y ) = K ( h x, y i ) . Note th at w e just redefined K ( x ) so that the unconditional Kolmo goro v complexit y is exactly equal to the conditional K olmogo rov complexit y with empt y input. This do es not con tradict ou r earlier defin ition: ha ving c hosen some reference conditional p refix mac h ine U , w e can alwa ys fin d an effectiv e en umeration T ′ 1 , T ′ 2 and a correspond ing unconditional unive rsal p r efix mac hine U ′ suc h that for all p , U ( h ǫ, p i ) = U ′ ( p ). Th en w e automatica lly ha v e, for all x , K U ′ ( x ) = K U ( x | ǫ ). Example 7 [ K for general ob jects: functions, distributions, sets, ...] W e ha ve defined the Kolmogoro v complexit y K of binary strings and natural n umb ers , whic h w e identi fied with eac h other. It is straigh tforwa rd to extend the definition to ob jects suc h as real-v alued functions, pr obabilit y distributions and sets. W e br iefly indicate ho w to do this. In tuitiv ely , the Kolmogoro v complexit y of a function f : N → R is the length of the sh ortest prefix-free program that compu tes (outputs) f ( x ) to precision 1 /q on input x ′ q ′ for q ∈ { 1 , 2 , . . . } . In terms of conditional univ ersal p refix mac hines: K ( f ) = min p ∈{ 0 , 1 } ∗  l ( p ) : f or all q ∈ { 1 , 2 , . . . } , x ∈ N : | U ( h x, h q , p ii ) − f ( x ) | ≤ 1 /q  . (14) 16 The Kolmogoro v complexit y of a function f : N × N → R is defined analogously , with h x, h q , p ii replaced b y h x, h y , h q , p iii , and f ( x ) r eplaced by f ( x, y ); similarly for fu n ctions f : N k × N → R for general k ∈ N . As a sp ecial case of (14), the Kolmogoro v complexit y of a probabilit y distribution P is the shortest program that outputs P ( x ) to p r ecision q on input h x, q i . W e will encounte r K ( P ) in S ection 5. The K olmogoro v complexit y of sets can b e defined in v arious manners [G´ acs, T romp, and Vit´ anyi 2001]. In this c hapter w e only consider fin ite sets S consisting of finite strings. One reason- able method of defining their complexit y K ( S ) is as the length of the shortest program that sequent ially outpu ts the ele ments of S (in an arbitrary order) and then halts. Let S = { x 1 , . . . , x n } , and assume that x 1 , x 2 , . . . , x n reflects th e lexicographical ord er of the elemen ts of S . In terms of conditional prefix mac hines, K ( S ) is the length of the shortest b inary program p suc h that U ( h ǫ, p i ) = z , where z = h x 1 , h x 2 , . . . , h x n − 1 , x n i . . . ii . (15 ) This definition of K ( S ) will b e u sed in Section 6. There we also need the n otion of the Kolmogoro v complexit y of a string x give n that x ∈ S , denoted as K ( x | S ). This is d efi n ed as the length of the shortest binary p rogram p from w hic h the (conditional univ ersal) U computes x from in put S giv en literally , in the form of (15). This concludes our tr eatment of the basic concepts of Kolmogoro v complexit y theory . In the next section w e compare these to the basic concepts of Shannon ’s inform ation theory . 5 Shannon and Kolmogoro v In this section w e compare Kolmogoro v complexit y to Shann on’s [1948] information the- ory , m ore commonly simply kn o w n as ‘informatio n theory’. Shannon’s theory predates Kolmogoro v’s by ab out 25 y ears. Both theo ries measure the amount of inform ation in an ob ject as the length of a description of the ob ject. In the Shannon approac h , ho w- ev er, the metho d of en co d in g ob jects is based on the presup p osition that the ob jects to b e enco ded are outcomes of a known r andom sour ce—it is only the c haracteristics of that random source that determine the enco ding, not the c haracteristics of the ob- jects that are its outcomes. In the Kolmogoro v complexit y approac h w e consider the individual ob jects themselve s, in isolation so-to-speak, and the encod ing of an ob ject is a computer program th at generates it. In the S hannon approac h we are in terested in the minim um expected n um b er of bits to transmit a messag e from a random sour ce of kno wn charac teristics through an error-free c hannel. In Kolmogoro v complexit y w e are int erested in the minim u m n umb er of bits from wh ic h a p articular m essage can effectiv ely b e reconstructed. A little refl ection reve als that this is a great d ifference: for every source emitting bu t t w o messages the Sh annon information is at most 1 bit, but we can c ho ose b oth messages co ncerned of arbitrarily h igh Kolmogoro v complexit y . Shannon stresses in his founding article that h is notion is only concerned with c om- munic ation , w h ile Kolmogoro v stresses in his foundin g article that his notion aims at 17 supplementi ng the gap left by Shannon theory concerning the information in ind ividu al ob jects. T o b e sure, b oth notions are n atural: Shannon ignores the ob ject itself but considers only the charac teristics of the r andom source of wh ic h the ob ject is one of the p ossible outcome s, while Kolmogoro v considers only the ob ject itself to determine the n umb er of bits in the ultimate compressed v ersion irresp ectiv e of the manner in whic h the ob ject arose. These differences not withstanding, there exist v ery s trong connections b et we en b oth theories. In th is section w e giv en an o v erview of these. In Section 5.1 we recall the relation b et w een prob ability distributions and codes, and w e review Shannon’s fund a- men tal notion, the entr opy . W e then (S ection 5.2) indicate h o w K olmogoro v complexit y resolv es a lacuna in the Shannon theory , namely its inabilit y to deal with information in individual ob jects. In Section 5.3 w e mak e precise and expla in the imp ortan t relation En tropy ≈ expected Kolmogoro v complexit y . Section 5. 4 deals with Shannon and alg orithmic mutual infor mation , the second fund a- men tal concept in b oth theories. 5.1 Probabilities, Co delengths, En trop y W e n ow briefly recall the t wo fund amen tal relations b et we en p robabilit y distribu tions and co delength functions, and indicate their connection to the en tropy , the fundamen- tal concept in Shannon’s theory . Th ese relations are essent ial for understanding the connection betw een Kolmogoro v’s and Shannon’s theory . F or (m u c h) more details, we refer to Harremo¨ es and T opsøe [2007]’s c hapter in this han d b o ok, and , in a Kolmogoro v complexit y cont ext, to [Gr ¨ u n wa ld and Vit´ anyi 200 3]. W e use the follo wing n otation: let P b e a probabilit y distribution defined on a finite or coun table set X . In the re- mainder of the c hapter, w e denote by X the rand om v ariable that tak es v alues in X ; th us P ( X = x ) = P ( { x } ) is the pr obab ilit y that the ev en t { x } obtains. W e write P ( x ) as an abbreviation of P ( X = x ), and w e write E P [ f ( X )] to denote the exp ectation of a f unction f : X → R , so that E P [ f ( X )] = P x ∈X P ( x ) f ( x ). The Two Relat ions b etw een probabilities and co de lengths 1. F or ev ery distribution P defin ed on a fin ite or counta ble set X , there exists a co d e with lengths L P ( x ), satisfying, f or all x ∈ X , L P ( x ) = ⌈− log P ( x ) ⌉ . This is the so-called Shannon-F ano co de corresp ond ing to P . Th e result follo ws directly from the Kraft inequalit y [Harremo ¨ es and T opsøe 2007, Secti on 1.2]. 2. If X is distrib u ted according to P , then the Shannon-F ano co de corresp ond ing to P is (essen tially) the optimal cod e to use in an expected sense. Of course, we ma y choose to enco de outcomes of X using a co de corresp onding to a distribution Q , with lengths ⌈− log Q ( x ) ⌉ , whereas the outcomes are actually distributed according to P 6 = Q . But, as expr essed in the noiseless c o ding the or em 18 or, more abstractly , in [Harremo ¨ es and T opsøe 2007, Section 1.3] as the First main the or em of information the ory , suc h a co de cannot b e significan tly b etter, and ma y in fact b e muc h wo rse than the co de with lengths ⌈− log P ( X ) ⌉ : the noiseless coding theorem sa ys that E P [ − log P ( X )] ≤ min C : C is a prefix-free code E P [ L C ( X )] ≤ E P [ − log P ( X )] + 1 , (16) so that it follo ws in particular that the exp ected length of the Shannon-F ano co de satisfies E P ⌈− log P ( X ) ⌉ ≤ E P [ − log P ( X )] + 1 ≤ min C : C is a prefix-free code E P [ L C ( X )] + 1 . and is thus alwa ys within just bit of the co de that is optimal in exp ectati on. In his 194 8 pap er, Sh annon pr op osed a measure of information in a distribution, w hic h he called the ‘entrop y’, a concept discussed at length in the c hapter by Harremo¨ es and T opsøe [2007] in this handb o ok. It is equal to the quant it y app earing on the left and on the righ t in (16): Definition 8 [Entrop y] Let X b e a fi nite or coun table set, let X b e a rand om v ariable taking v alues in X w ith distribution P . Then the (Shann on-) entr opy of random v ariable X is giv en by H ( P ) = − X x ∈X P ( x ) log P ( x ) , (17) En tropy is defined here as a fu nctional mapping a d istribution on X to real num b ers. In pr actice, we often deal with a pair of r an d om v ariables ( X , Y ) d efined on a joint space X × Y . Then P is the joint distribution of ( X, Y ), and P X is its corresp ondin g marginal distribution on X , P X ( x ) = P y P ( x, y ). In that case, rather than wr iting H ( P X ) it is customary to write H ( X ); we sh all follo w this con v en tion b elo w. En tropy can be in terpr eted in a n umb er of wa ys. The noisele ss c o ding theorem (16) giv es a pr ecise co ding-theoretic interpretati on: it sho ws that the entro py of P is essen tially equal to the a v erage cod e length when enco ding an outcome of P , if outcomes are enco ded using the optimal code (the co de that minimizes this a ve rage co de length). 5.2 A Lacuna in Shannon’s Theory Example 9 Assuming that x is emitted b y a random sour ce X with probabilit y P ( x ), w e can transmit x usin g the Shannon-F ano co de. This uses (up to r ounding) − log P ( x ) bits. By S hannon’s noiseless co ding theorem this is optimal on aver age , the av erage tak en o v er the probabilit y distribution of outcomes from the sour ce. Th us, if x = 00 . . . 0 ( n zeros), and the random source emits n -bit messages with equal pr obabilit y 1 / 2 n eac h, then we require n bits to trans mit x (the same as tr ans mitting x literally). Ho wev er, w e can transmit x in ab out log n bits if w e ignore probabilities and just describ e x individually . Thus, the optima lit y with resp ect to the a vera ge ma y be v ery sub-optimal in individual cases. 19 In Shannon’s theory ‘information’ is fully determined b y the probabilit y distribution on the set o f p ossible messages, and unrelated to the meaning, structure or con ten t of individual m essages. In man y cases this is p roblematic, since the d istribution generating outcomes ma y b e unkno wn to the observ er or (w orse), may not exist at all 5 . F or example, can w e answ er a question lik e “wh at is the information in this b o ok” b y viewing it as an elemen t of a set of p ossible b o oks with a probabilit y distribution on it? This seems un lik ely . Kolmogoro v complexi t y pro vides a measure of information that, unlik e Shannon’s, do es not rely on (often unt enable) probabilistic assumptions, and that tak es in to accoun t the phenomenon th at ‘regular’ strings are compr essible. Th us, it measures th e information co nte nt of an individual finite obje ct. The fact that suc h a measure exists is surprising, and indeed, it comes at a p r ice: u nlik e Shannon’s, Kolmogoro v’s measure is asymptotic in nature, and not computable in general. S till, the resulting theory is closely relat ed to Shannon’s, as we no w discuss. 5.3 En trop y and Exp ected K olmogorov Complexity W e call a d istribution P computable if it can b e computed by a finite-size program, i.e. if it h as finite Kolmogo ro v complexit y K ( P ) (Example 7). Th e set of computable distributions is very large: it con tains, for example, all Mark o v c hains of eac h order with r ational-v alued parameters. In the foll o wing discussion w e shall restrict ourselv es to computable distribu tions; extensions to th e uncomputable case are d iscussed b y Gr ¨ unw ald and Vit´ anyi [2003]. If X is distributed according to some distribution P , then the optimal (in the a v erage sense) co de to us e is the Shann on-F ano co d e. But no w supp ose it is only kno wn that P ∈ P , w here P is a large set of compu table distributions, p erh aps even the set of all computable distrib utions. No w it is n ot clear what co de is optimal. W e ma y try the S hannon-F ano code for a particular P ∈ P , but suc h a co d e will t ypically lead to v ery large exp ected co de lengths if X turns out to b e distributed according to some Q ∈ P , Q 6 = P . W e ma y ask whether there exists another co de that is ‘a lmost’ as go o d as the S hannon-F ano co de for P , no matter what P ∈ P actually generates th e sequence? W e now show that , (p erhaps surprisingly), the answer is y es. Let X b e a r andom v ariable taking on v alues in the s et { 0 , 1 } ∗ of binary strings of arbitrary length, and let P b e the distribution of X . K ( x ) is fixed for eac h x and giv es the sh ortest co de word length (but only u p to a fixed constan t). It is in- dep endent of the probabilit y d istribution P . Nev ertheless, if we wei gh eac h in divid- ual co de word length for x with its pr obab ility P ( x ), then the resulting P -exp ected co d e w ord length P x P ( x ) K ( x ) almost ac hieve s the m in imal a verag e code word length H ( P ) = − P x P ( x ) log P ( x ). This is expr essed in the follo w in g theorem (tak en from [Li and Vit´ an yi 1997]): 5 Even if w e adopt a Ba yesia n (sub jectiv e) interpretati on of probabilit y , this problem remains [Gr¨ u nw ald 2007]. 20 Theorem 10 L et P b e a c omputable pr ob ability distribution on { 0 , 1 } ∗ . Then 0 ≤ X x P ( x ) K ( x ) − H ( P ) ! ≤ K ( P ) + O (1) . The theorem b ecomes inte resting if we consider sequ en ces of P that assign mass to binary strings of increasing length. F or example, let P n b e the distribution on { 0 , 1 } n that corresp onds to n indep enden t tosse s of a coin with bias q , wh ere q is co mpu table (e.g., a rational num b er). W e hav e K ( P n ) = O (log n ), s in ce we can compu te P n with a program of constan t size a nd input n, q w ith length l ( n ′ ) + l ( q ′ ) = O (log n ). On the other h and, H ( P n ) = nH ( P 1 ) increases li nearly in n (see, e.g., the c hapter by Harremo ¨ es and T opsøe [2007] in this handb o ok; see also paragraph 1(c ) in Sectio n 2.2 of this chapter). S o for large n , the optimal co de for P n requires on a verag e nH ( P 1 ) bits, and the Kolmogoro v code E ∗ requires only O (log n ) bits extra. Divi ding by n , w e see that the add itional n umber of bits needed p er outcome u s ing the Kolmogoro v co de go es to 0. Thus, r emark ably , whereas the en tropy is the exp ected codelength according to P und er the optimal co de for P (a co d e that will b e wildly different for differen t P ), there exists a single code (the K olmogoro v co de), which is asymptotically almost optimal for al l computable P . 5.4 Mutual Information Apart from entrop y , the mutual information is p erh aps the most imp ortan t concept in Shannon’s theory . Similarly , apart from Kolmo goro v complexit y itself, the algo rithmic mutual information is one of the most imp ortan t concepts in Kolmogo rov’ s theory . In this section w e review Shann on’s noti on, we in tro duce Kolmogoro v’s notion, and then w e p ro vide an analogue of Theorem 10 wh ich sa ys that essen tially , Shannon mutual information is av eraged algorithmic m utual information. Shannon Mutual Information Ho w m uch information can a rand om v ariable X con v ey about a random v ariable Y ? T his is determined b y the (Shannon) mutual information b et we en X and Y . F ormally , it is d efined as I ( X ; Y ) := H ( X ) − H ( X | Y ) (18) = H ( X ) + H ( Y ) − H ( X, Y ) where H ( X | Y ) is the conditional entrop y of X give n Y , and H ( X, Y ) is the join t en trop y of X and Y ; the defi n ition of H ( X , Y ) , H ( X | Y ) as w ell as an alternativ e but equiv alen t definition if I ( X ; Y ), can b e found in [Harremo¨ es and T opsøe 2007]. The equalit y b et wee n the first and sec ond line follo w s b y strai ght forward rewriting. The m utual inform ation can b e thought of as the exp ected (a ve rage) reduction in the n umb er of bits needed to enco de X , w hen an outcome of Y is giv en for free. In accord w ith intuitio n, it is easy to show that I ( X ; Y ) ≥ 0, with equalit y if and only if X and Y are indep end en t, i.e. X p ro vides n o information ab out Y . Moreo v er, and 21 less in tuitiv ely , a straigh tforw ard ca lculation sho ws that this in formation is symmetric : I ( X ; Y ) = I ( Y ; X ). Algorithmic Mutual Information In order to defin e algorithmic mutual informa- tion, it will b e con venie nt to introdu ce some new notation: W e will denote b y + < an inequalit y to within an additive constan t. More precisely , let f , g b e fun ctions fr om { 0 , 1 } ∗ to R . Then by ‘ f ( x ) + < g ( x )’ w e mean that there exists a c su c h that for all x ∈ { 0 , 1 } ∗ , f ( x ) < g ( x ) + c . W e write ‘ f ( x ) + > g ( x )’ if g ( x ) + < f ( x ). W e d enote by + = the situat ion wh en b oth + < and + > hold. Since K ( x, y ) = K ( x ′ y ) (Section 4.4), trivia lly , the symmetry pr op ert y holds: K ( x, y ) + = K ( y , x ). An in teresting prop erty is the “Additivit y of Complexit y” prop erty K ( x, y ) + = K ( x ) + K ( y | x ∗ ) + = K ( y ) + K ( x | y ∗ ) . (19) where x ∗ is the first (in standard en umeration order) shortest prefix program that generates x and then h alts. (19) is the Kolmogoro v complexit y equ iv alen t of the en trop y equalit y H ( X, Y ) = H ( X ) + H ( Y | X ) (see S ection I.5 in the chapter by Harremo ¨ es and T opsøe [2007]). That this latter equalit y holds is tru e by simp ly rewrit- ing b oth sid es of the equation according to the definitions o f a ve rages of joint and marginal p robabilities. In fact, p otent ial ind ividual differences are a v eraged out. But in the Kolmog oro v complexit y case we do nothing like that: it is quite remark able th at additivit y of complexit y also h olds for individual ob jects. The result (19) is due to G´ acs [1974], can b e found as Theorem 3.9.1 in [Li and Vit´ an yi 1997] and has a difficult pro of. It is p erhaps instructiv e to p oin t o ut that the ve rsion with just x and y in the conditionals do esn’t hold with + =, but holds up to add itiv e logarithmic terms that cannot b e eliminated. T o define the al gorithmic m utual information b et w een t w o individual ob j ects x and y with no probabilities in volv ed, it is instructiv e to first recall the p robabilistic n otion (18). The algorithmic d efinition is, in fact, entirely analogo us, with H replac ed by K and random v ariables r ep laced b y in dividual sequences or their generating programs: The informa tion in y ab out x is defin ed as I ( y : x ) = K ( x ) − K ( x | y ∗ ) + = K ( x ) + K ( y ) − K ( x, y ) , (20) where the second equalit y is a consequence of (19) and states that this information is symmetric, I ( x : y ) + = I ( y : x ), and therefore w e can ta lk ab out mutual information . 6 Theorem 10 sho w ed that the en trop y of distribution P is appro ximately equal to the exp ected (u nder P ) Kolmogoro v complexit y . Theorem 11 g ive s the analogous result for the mutual inf ormation. 6 The n otation of the al gorithmic (individual) notion I ( x : y ) distinguishes it from the probabilistic (a verage) notion I ( X ; Y ). W e deviate slig htly from Li and Vit´ a nyi [1997] where I ( y : x ) is defined as K ( x ) − K ( x | y ). 22 Theorem 11 L et P b e a c omputable pr ob ability distribution on { 0 , 1 } ∗ × { 0 , 1 } ∗ . Then I ( X ; Y ) − K ( P ) + < X x X y p ( x, y ) I ( x : y ) + < I ( X ; Y ) + 2 K ( P ) . Th us, analogo usly to Th eorem 10, we see th at the exp ectatio n of the algorithmic m utual information I ( x : y ) is close to the probabilistic m u tu al inform ation I ( X ; Y ). Theorems 10 and 11 do not stand on their o wn: it turns out th at just ab out every concept in Shannon ’s theory h as an an alogue in Kolmogoro v’s theory , and in all such cases, these concepts can b e related by theorems sa ying that if dat a are generated pr ob - abilistical ly , then the Shannon concept is close to the exp ectat ion of the corresp onding Kolmogoro v concept. Examples are the probabilistic vs. the algorithmic sufficien t statistic s, and the probabilistic rate-distortion fu n ction [Co ver and Thomas 1991] vs. the algorithmic Kolmogoro v structure function. T he algorithmic suffi cien t statist ic and s tructure function are discussed in the next section. F or a comparison to their coun terparts in Shannon’s theory , we refer to [Gr ¨ unw ald and Vit ´ an yi 2004]. 6 Meaningful Informatio n The information con tained in an individual finite ob ject (lik e a fin ite binary string) is measured by its Kolmogoro v complexit y—the length of th e shortest b inary program that computes the o b ject. S uc h a sh ortest program con tains no redu ndancy: ev ery bit is information; but is it meaningful inform ation? If w e flip a fair coin to obtain a finite binary string, then with ov erwhelming probabilit y that string constitutes its o wn shortest program. How eve r, also with o verwhelming probabilit y all the bits in the string are meaningless information, r andom noise. On the other hand, let an ob ject x b e a sequence of observ ations of hea ve nly b o d ies. Then x can b e d escrib ed by the binary string pd , where p is the description of the la ws of gravi t y and the observ ational parameter setting, while d acc ounts for the m easuremen t err ors: w e can d ivide the information in x in to mea ningfu l information p and acc identa l information d . T h e main task for stati stical inference and learning theory is to distill the m eaningfu l inf orm ation presen t in the data. The question arises whether it is p ossible to separate meaningful information from acc identa l information, and if so, ho w. The essence of the solution to this problem is revea led as follo ws. As s h o wn by V ereshc h agin and Vit´ an yi [2004], for all x ∈ { 0 , 1 } ∗ , w e ha ve K ( x ) = min i,p { K ( i ) + l ( p ) : T i ( p ) = x } + O (1) , (21) where the minimum is tak en o ve r p ∈ { 0 , 1 } ∗ and i ∈ { 1 , 2 , . . . } . T o get some in tuition wh y (21) h olds, note that the original definition (1) expresses that K ( x ) is the sum of the description length L N ( i ) of s ome T urin g mac hine i w h en enco ded using the standard code for the in tegers, plus the length of a program suc h that T i ( p ) = x . (21) expr esses that the first term in this sum ma y b e replaced by K ( i ), 23 i.e. the sho rtest effecti ve description of i . It is clear that (21) is neve r large r than (9) plus some constan t (the size of a computer pr ogram implementing the standard enco ding/decod ing of in tegers). The r eason why (21) is also never smaller than (9) min us some constan t is th at there exists a T uring mac hine T k suc h that, for all i, p , T k ( i ∗ p ) = T i ( p ), where i ∗ is the shortest pr ogram that prints i and then halts, i.e. for all i, p , U ( h k , i ∗ p i ) = T i ( p ) where U is the r eference mac h ine u sed in Definition 1. T h us, K ( x ) is b ounded b y the constan t length l ( k ′ ) describing k , plus l ( i ∗ ) = K ( i ), plus l ( p ). The expression (21) sho ws th at w e can think of Kolmog oro v complexit y as the length of a two-p art c o de . This w a y , K ( x ) is view ed as the shortest length of a t wo- part code for x , one part describing a T uring mac hine T , or mo del , for the r e gular asp ects of x , and th e second part describing the irr e gular asp ects of x in the form of a program p to b e interpreted by T . The regular, or “v aluable,” inf orm ation in x is constituted b y the b its in the “model” while the rand om or “useless” in formation of x constitutes the remai nd er. T his lea v es op en the cru cial qu estion: Ho w to c ho ose T and p that together d escrib e x ? In general, many com bin ations of T and p are p ossible, but w e wan t to find a T that describ es the m eaningfu l asp ects of x . Belo w we show that this can b e ac hiev ed using the Algorith mic Minimum Sufficient Stat istic . Th is theory , b uilt on top of Kolmogoro v complexit y so to sp eak, has its r o ots in t wo talks b y Kolmogoro v [1974a , 1974b]. Based on Kolmogoro v’s remarks, th e theory has b een further dev elop ed by sev eral authors, culminating in V ereshc hagin and Vit´ anyi [2 004], some of the ke y ideas of whic h w e outline b elo w . Data and Mo del W e restrict atten tion to t he follo wing setting: w e obs erve data x in the form of a finite binary string of some length n . As mo dels for the data, w e consider finite sets S that conta in x . In s tatisti cs and mac h ine learning, the use of finite sets is nonstandard: one usually mod els the data using probabilit y d istrib utions or functions. Ho w ev er, the r estriction of sets is just a matt er of con v enience: the theory w e are ab out to presen t generalizes straightfo rwardly to the case where the mo dels are arbitrary computable probabilit y densit y functions and, in fact, other mo del cla sses suc h as compu table functions [V ereshchagin and Vit ´ an yi 2004]; see al so S ection 6.3. The in tu ition b ehin d the idea of a set as a model is the follo wing: informally , ‘ S is a go o d m o del for x ’ or equiv alen tly , S captur es all s tructure in x , if, in a sense to b e made precise further b elo w, it summarizes all simple p r op erties of x . In Section 6.1 b elo w, w e w ork to wa rd s the d efi n ition of the algorithmic minimal sufficien t s tatistic (AMSS) via the fund amen tal notions of ‘t yp icalit y’ of data and ‘optimalit y’ of a set. Section 6.2 in vest igates the AMSS fu rther in terms of the imp ortant Kolmo gor ov Struc- tur e F unction . In Section 6.3, w e relate the AMSS to the more well -known Minimum Description L ength Principle . 6.1 Algorithmic Sufficien t Statistic W e are no w ab out to formulat e the cen tral notions ‘ x is t ypical for S ’ and ‘ S is optimal for x ’. Both are necessary , but not s ufficien t requiremen ts for S to pr ecisely captur e 24 the ‘meaningful information’ in x . After ha vin g int ro d uced optimal sets, w e in vestig ate what fur ther requiremen ts w e need. The d ev elopment will mak e h ea vy us e of the Kolmogoro v complexit y of sets, and conditioned on sets. These n otions, written as K ( S ) and K ( x | S ), where defi n ed in Example 7. 6.1.1 Typic al Elemen ts Consider a string x of length n and prefix complexit y K ( x ) = k . W e lo ok for the structur e or r e gularity in x that is to b e summ arized with a set S of wh ic h x is a r andom or typic al mem b er: giv en S con taining x , the elemen t x cannot b e describ ed significan tly shorter than by its m aximal length ind ex in S , that is, K ( x | S ) ≥ lo g | S | + O (1). F ormally , Definition 12 Let β ≥ 0 b e an agree d-up on, fixed, co nstant. A finite binary string x is a typic al or r andom elemen t of a s et S of finite b inary strings, if x ∈ S and K ( x | S ) ≥ log | S | − β . (22) W e will not ind icate the dep endence on β explic itly , but the constan ts in all our in- equalities ( O (1)) will b e allo we d to b e functions of this β . This definition r equires a fi nite S . Note that the notion of t ypicalit y is not absolute but depend s on fixing the consta nt imp licit in the O -notatio n. Example 13 Consider the set S of binary strings of length n whose ev ery od d p osition is 0. Let x b e an elemen t of this set in wh ic h the subsequence of bits in ev en p ositions is an incompressible string. Then x is a typica l elemen t of S . But x is also a t ypical elemen t of the set { x } . Note that, if x is not a typica l elemen t of S , then S is certainly not a ‘goo d mod el’ for x in the in tuitive sense described ab o ve: S do es not capture all regularit y in x . Ho w ev er, the example ab o ve ( S = { x } ) shows that even if x is typica l f or S , S ma y still n ot capture ‘a ll meaningful information in x ’. Example 14 If y is not a typica l elemen t of S , this means that it has some simple sp ecial prop ert y that singles it out from the v ast ma jorit y of elemen ts in S . This can actually b e p ro v en formall y [Vit ´ an yi 2005]. Here w e merely giv e an example. Let S b e as in Example 13. Let y b e an elemen t of S in which the s ubsequence of bits in even p ositions cont ains tw o times as many 1s than 0s. Th en y is not a t ypical elemen t of S : the ov erwhelming ma jorit y of elemen ts of S hav e ab out equally man y 0s as 1s in ev en p ositions (this follo ws by simple com binatorics). As shown in [Vit´ an yi 2005], this implies that K ( y | S ) ≪ | log S | , so that y is not t yp ical. 25 6.1.2 Optimal Sets Let x b e a binary data string of length n . F or ev ery finite set S ∋ x , we ha ve K ( x ) ≤ K ( S ) + log | S | + O (1), since w e can d escrib e x by giving S and the index of x in a standard enumeration of S . Clearly this can b e implemen ted b y a T uring mac hine computing th e fin ite set S and a program p giving the ind ex of x in S . The size of a set con taining x measures intuitiv ely the num b er of prop erties of x that are represen ted: The largest set is { 0 , 1 } n and r ep r esen ts only one prop erty of x , n amely , b eing of lengt h n . It clearly “underfits” as explanatio n or mo del for x . The smallest set cont aining x is the sin gleto n set { x } and represen ts all conceiv able prop erties of x . It clea rly “o v erfits” as expla nation or mo del for x . There are t w o natural measures of suitabilit y of such a set as a m o del for x . W e migh t prefer either (a ) the simp lest set, or (b) the smallest set, as corresp onding to th e most lik ely structure ‘explaining’ x . Both the larg est set { 0 , 1 } n [ha ving lo w complexit y of ab out K ( n )] and the singleto n set { x } [having h igh complexit y of ab out K ( x )], while certainly statist ics for x , would indeed b e considered po or explanations. W e would lik e to balance simplicit y of mo del vs. size of m o del. Both measures relate to the optimalit y of a t wo -stage description of x using a finite set S that con tains it. Elab orating on the t w o-part co de describ ed abov e, K ( x ) ≤ K ( S ) + K ( x | S ) + O (1) (23) ≤ K ( S ) + log | S | + O (1) , where the first inequalit y follo ws b ecause there exists a program p pro d ucing x that fir s t computes S and then computes x based on S ; if p is not the sh ortest program generating x , then the inequalit y is strict. The second subs titution of K ( x | S ) by log | S | + O (1) uses the fact that x is an elemen t of S . The closer the righ t-hand s id e of (23) gets to the left-hand side, the b etter the t wo- stage description of x is. This implies a trade-off b et we en meaningful mo del information, K ( S ), and meaningless “noise” log | S | . A set S (cont aining x ) for whic h (23) holds with equalit y , K ( x ) = K ( S ) + log | S | + O (1) , (24) is called optimal . Th e first line of (23) implies that if a set S is optimal for x , then x m ust b e a t ypical element of S . How eve r, the con v erse d o es not hold: a data string x can be t ypical for a s et S without that set S b eing optimal for x . Example 15 It can b e sho wn that the set S of Example 13 is also optimal , and so is { x } . S ets for whic h x is t ypical f orm a muc h w id er class than optimal sets for x : the set { x, y } is still t ypical for x but w ith most y it will b e to o complex to b e optimal for x . A less artificial example can b e found in [V ereshchag in and Vit´ anyi 2004]. While ‘optimalit y’ is a refinement of ‘t ypicalit y’, the f act that { x } is still an optimal set for x sho ws that it is still not su ffi cien t by itself to capture th e n otion of ‘meaningful information’. In order to discuss the necessary refinemen t, w e first need to connect 26 optimal sets to the notion of a ‘sufficient statist ic’, whic h, as its name suggest s, has its ro ots in the statistical literat ur e. 6.1.3 Algorithmic Sufficien t Statistic A statistic of the d ata x = x 1 . . . x n is a function f ( x ). Essen tially , eve ry function will do. F or example, f 1 ( x ) = n , f 2 ( x ) = P n i =1 x i , f 3 ( x ) = n − f 2 ( x ), and f 4 ( x ) = f 2 ( x ) /n , are statistics. A “sufficien t” statistic of the data con tains all information in the d ata ab out the mo del. In int ro d u cing the notion of sufficiency in classica l statistics, Fisher [1922] stated: “The statistic chose n should summarize the whole of the relev an t infor- mation sup p lied b y the sample. This ma y b e calle d the Criterion of Su fficiency . . . In the case of the normal distributions it is eviden t that the second momen t is a suffi cien t statistic for estimat ing the s tand ard d eviatio n.” F or example, in the Bernoulli mo del (rep eated coin flips w ith outcomes 0 and 1 according to fi xed bias), the s tatistic f 4 is sufficien t. It giv es the mean of the outcomes and estimates the bias of the Bernoulli pro cess, whic h is the only relev an t m o del information. F or the classic (probabilistic) theory see, for example, [Cov er and Thomas 19 91 ]. G´ acs, T romp, and Vit´ anyi [2 001] dev elop an algorithmic theory of sufficient statistic s (rela ting individual data to in- dividual mo d el) and establish its relation to the probabilistic v ersion; this w ork is extended by Gr ¨ u n wa ld and Vit´ anyi [2004]. The algorithmic basics are as follo ws: In- tuitiv ely , a mo del expresses the essence of the d ata if the t wo-part code describing the data consisting of the model and the data-to-model co de is as concise as the b est one- part d escription. In other words, w e call a shortest program for an optimal set with resp ect to x an algorithm ic suffici e nt statistic for x . Example 16 (Sufficien t Stat istic) Let us lo ok at a coin toss example. Let k b e a n umb er in the range 0 , 1 , . . . , n of complexit y log n + O (1) giv en n and le t x b e a string of length n ha ving k 1s of complexit y K ( x | n, k ) ≥ lo g  n k  giv en n, k . This x can b e view ed as a t ypical resu lt of tossing a coin with a bias ab out p = k /n . A t wo -part description of x is giv en by fi rst sp ecifying the n umber k of 1s in x , follo w ed b y the index j ≤ log | S | of x in the set S of strings of length n with k 1s. T his set is optimal, since, to within O (1), K ( x ) = K ( x, h n, k i ) = K ( n, k ) + K ( x | n , k ) = K ( S ) + log | S | . The shortest program for S , whic h amo unts to an enco ding of n and then k giv en n , is an alg orithmic sufficient statistic for x . The optimal set that admits the shortest p ossible pr ogram (or rat her that shortest program) is calle d al gorithmic minimal sufficient statistic of x . In general there can b e more than one suc h set and co rresp onding p rogram: Definition 17 (Algorithmic minimal sufficien t statistic) An algorithmic su fficien t statistic of x is a shortest pr o gr am for a set S c ontaining x that is optimal, i.e . i t sat- isfies (24). An algorithmic sufficient statistic with o ptimal set S is minimal if ther e exists no optimal set S ′ with K ( S ′ ) < K ( S ) . 27 The algorithmic minimal sufficien t statistic (AMSS) divides th e information in x in a relev an t stru cture expr essed b y the set S , and the r emaining randomn ess with r esp ect to that structure, expressed b y x ’s index in S of log | S | bits. The sh ortest program for S is itself alone an algorithmic d efinition of structure, without a pr obabilistic int erpretation. Example 18 (Example 13, Con t.) The shortest program for the set S of Exam- ple 13 is a minim u m sufficien t stati stic for th e string x men tioned in that example. The program generating the set { x } , wh ile still an algorithmic sufficient statistic, is not a minimal sufficien t statistic. Example 19 (Example 16, Con t.) Th e S of Example 16 enco des the n umb er of 1s in x . The sh ortest program for S is an alg orithmic minimal sufficien t statistic for mo st x of length n with k 1’s, since only a fr actio n of at most 2 − m x ’s of length n with k 1s can ha v e K ( x ) < log | S | − m (Section 4). But of course there exist x ’s w ith k ones wh ic h ha v e m uc h more regularit y . An example is the string starting with k 1’s follo wed b y n − k 0’s. F or suc h strings, S is not optimal an ymore, nor is S an algorithmic sufficien t statistic . T o analyze the minimal sufficient s tatisti c further, it is useful to place a constrain t on the maxim u m complexit y of set K ( S ), sa y K ( S ) ≤ α , and to in vestig ate what happ ens if w e v ary α . T he result is the Kolmo gor ov Structur e F u nction , w h ic h we no w discuss. 6.2 The Kolmogorov Structure F unction The Kolmo gor ov structur e fu nction [Kolmogoro v 1974a; Kolmogoro v 1974b; V ereshc hagin and Vit´ anyi 200 4 ] h x of giv en data x is defined by h x ( α ) = min S { log | S | : S ∋ x, K ( S ) ≤ α } , (25) where S ∋ x is a con templated mo del for x , and α is a n on-negativ e int eger v alue b oundin g the complexit y of the con templated S ’s. Clearly , the Kolmogoro v stru cture function is nonin creasing and reac hes log |{ x }| = 0 for α = K ( x ) + c 1 where c 1 is the n umb er of bits required to c hange x in to { x } . F or ev ery S ∋ x w e hav e (23), and h ence K ( x ) ≤ α + h x ( α ) + O (1); that is, the function h x ( α ) nev er decreases more than a fixed indep endent constan t b elo w the d iagonal sufficiency line L defin ed by L ( α ) + α = K ( x ), whic h is a lo wer b ound on h x ( α ) and is approac hed to within a constan t distance by the graph of h x for certain α ’s (e.g. , for α = K ( x ) + c 1 ). F or these α ’s we thus ha v e α + h x ( α ) = K ( x ) + O (1); a mo del corresp onding to suc h an α (witness for h x ( α )) is a sufficien t statistic, and it is minimal for the least suc h α [Co v er and Thomas 1991; G´ acs, T romp, and Vit´ an yi 2001]. This is d epicted in Figure 1. Note once again that the structure fu nction is d efined relativ e to giv en data (a single sequence x ); different sequences result in differen t structur e functions. Y et, all these d ifferen t functions share some p rop erties: for all x , the fu nction h x ( α ) will lie ab o ve the diagonal sufficiency 28 K(x) log |S| K(x) |x| minimal sufficient statistic x x x h λ β (α) (α) (α) α Figure 1: Structure functions h x ( i ) , β x ( α ) , λ x ( α ), and minimal sufficient statistic. line for all α ≤ α x . Here α x is the complexit y K ( S ) of the AMSS for x . F or α ≥ α x , the function h x ( α ) remains within a constan t of the diagonal. F or sto c hastic strings generated by a simple computable distribution (finite K ( P )), the sufficiency line will t ypically b e fi rst hit for α close to 0, s ince the AMSS will gro w as O (log n ). F or example, if x is generated by ind ep endent fair coin flips, then, w ith p robabilit y 1, one AMSS will b e S = { 0 , 1 } n with co mplexit y K ( S ) = K ( n ) = O (log n ). On e ma y susp ect that all intuitiv ely ‘rand om’ sequences hav e a small sufficient statistic of order O (log n ) or smaller. Su rprisingly , this turns out not to b e the case, as w e sho w in Example 21. Example 20 (Lossy Compression) The Kolmogo rov structure fu nction h x ( α ) is relev an t t o lossy compression (us ed , e.g., to compress images). Assume we need to compress x to α bits where α ≪ K ( x ). Of course this implies some loss of information presen t in x . One w a y to s elect redundant information to discard is as follo ws: let S 0 b e the set generated by th e Algorithmic Minim um Sufficien t Statistic S ∗ 0 ( S ∗ 0 is a shortest program that pr in ts S 0 and halts). Assume th at l ( S ∗ 0 ) = K ( S 0 ) ≤ α . Since S 0 is an optimal set, it is also a t ypical set , so that K ( x | S 0 ) ≈ lo g | S 0 | . W e compress x b y S ∗ 0 , taking α bits. T o reco nstru ct an x ′ close to x , a decompressor can firs t reconstruct the set S 0 , and then select an elemen t x ′ of S 0 uniformly at rand om. This ensur es that with very high probabilit y x ′ is itself also a typica l elemen t of S 0 , so it has the same prop erties that x h as. T herefore, x ′ should serve th e pu r p ose of the message x as w ell as d o es x itself. Ho w ev er, if l ( S ∗ 0 ) > α , then it is n ot p ossible to compress all meaningful information of x in to α bits. W e ma y instead enco de, among al l sets S w ith K ( S ) ≤ α , the one with the smallest log | S | , ac h ieving h x ( α ). But inevitably , this set will n ot capture al l the structural prop erties of x . Let u s lo ok at an example. T o transmit a picture of “rain” through a channel with 29 log |S| |x| =|y| K(x)=K(y) K(x)=K(y) y x minimal sufficient statistic y minimal sufficient statistic x h (α) α h (α) Figure 2: Data strin g x is “p ositiv e random” or “sto c hastic” and data strin g y is ju s t “negativ e r andom” or “non-sto c hastic”. limited capacit y α , one can transmit the ind icatio n that this is a p icture of the rain and the particular d rops ma y b e c h osen by the receiv er at random. In this interpretatio n, the complexit y constrain t α determines h o w “random” or “t ypical” x will b e with resp ect to the c hosen set S —and hence ho w “indistinguishable” from the original x the randomly reconstructed x ′ can b e expected to b e. W e end this section with an example of a strange consequence of Kolmogoro v’s theory: Example 21 “P ositiv e” and “Negativ e” Individual Randomness: G´ acs, T romp, and Vit ´ an yi [2001] sho w ed the existence of s trin gs for w hic h essen tially the singleton set consisting of the string itself is a min imal su fficien t statistic (Section 6.1). While a su fficien t statistic of an ob ject yiel ds a t wo -part code that is as short as the shortest one part code, restricting the complexit y of the allo wed statistic may yield t w o-part co des that are considerably longer than the b est one-part cod e (so that the statistic is insufficient ). In fact, f or every ob ject there is a complexit y b ound b elo w whic h this h app ens; this is ju st the p oin t where the Kolmogoro v structure fun ction hits the diagonal. If that b ound is small (loga rithmic) w e call the ob ject “sto c hastic” since it has a simple satisfactory explanation (sufficien t statistic). Th us, Kolmogoro v [1974 a ] make s the imp ortan t dis- tinction of an ob ject b eing rand om in th e “negativ e” sense b y ha ving this b ound high (it has high complexit y and is not a t ypical elemen t of a lo w-complexit y mo del), and an ob ject b eing r andom in the “p ositiv e, probabilistic” sense by b oth ha ving this b ound small and itself ha ving complexit y considerably exceeding this b ound (lik e a string x of length n with K ( x ) ≥ n , b eing t ypical for the set { 0 , 1 } n , or the uniform pr obabil- it y distribution o v er that s et, while this set or probabilit y distribution has complexit y K ( n ) + O (1) = O (log n )). W e d epict the d istinction in Figure 2. 30 6.3 The Minim um Description Length P rinciple Learning The main goal of s tatistics and mac h ine learning is to learn from data. One common w a y of interpreting ‘learning’ is as a searc h for the structural, regular prop erties of the data – all the patterns that occur in it. On a v ery abstract lev el, this is ju st what is ac hiev ed b y the AMSS, wh ic h can th us b e related to learning, or, more generally , inductiv e inference. There is ho we ve r another, m uch more well-kno wn metho d for learning b ased on d ata co mpr ession. This is the Minim um Description L ength (MDL) Principle, mostly dev elop ed b y J. Rissanen [1978, 19 89] – see [Gr ¨ u n wa ld 2007] for a recen t in tr o d uction; see also [W allac e 2005] for the related MML Principle. Rissanen to ok Kolmogoro v complexit y as an informal starting p oint, but was not a ware of the AMSS w hen h e deve lop ed the fir st, and , with h indsigh t, somewhat crude v ersion of MDL [Rissanen 1978], which roughly sa ys that the b est theory to explain giv en data x is the one that minimizes the su m of 1. The le ngth, in bits, of the description of the theory , plus 2. The length, in bits, of the description of the d ata x when the data is d escrib ed with the help of the theory . Th us, data is enco ded by first encodin g a theory (constituting the ‘stru ctural’ part of the data) and then enco ding the data u sing the prop erties of the data that are p re- scrib ed b y the theory . Pic king the theory minimizing the total description length lea ds to an auto matic trade-off b etw een co mplexit y of the c hosen theory and its goo d ness of fit on the data. This provi des a p rinciple of inductiv e inference that ma y b e v iewed as a mathematic al formalization of ‘Occam’s Razor’. It automat ically protects aga inst o v erfitting, a cent ral concern of stati stics: when allo wing mo dels of arbitrary complex- it y , w e are alw a ys in danger that w e mo del random fluctuatio ns rather than the trend in the data [Gr ¨ unw ald 2007]. The MDL Principle has b een designed so as to b e pr actica lly u seful. This means that the codes used to describ e a ‘theory’ are not based on Kolmogoro v complexit y . Ho we ve r, there exists an ‘ideal’ ve rsion of MDL [Li and Vit ´ an yi 1997; Barron and Co v er 1991] whic h do es rely on Kolmogoro v complexit y . Wi thin our framewo rk (binary data, mo d els as sets), it b ecomes [V ereshchag in and Vit´ anyi 200 4 ; Vit´ an yi 2005]: pick a set S ∋ x minimizing th e t wo- part co d elength K ( S ) − log | S | . (26) In other w ords: any “optimal set” (as defined in Section 6.1.2) is regarded as a go o d explanation of the theory . I t follo ws that ev ery set S that is an AMSS also minimizes the t wo -part co delength to within O (1). Ho we ve r, as w e already in dicated, there exist optimal sets S (t hat, b ecause of their optimalit y , may b e selected by MDL), that are not minimal sufficien t statistics. As explained by Vit ´ an yi [2005], these do not capture the idea of ‘summarizing all stru ctur e in x ’. Thus, the AMSS may b e considered a refinemen t of the idealized MDL approac h . 31 Practical MDL Th e practical MDL approac h u ses pr obabilit y distribu tions rather than sets as mo dels. T ypically , one restricts to distributions in some m o d el class su c h as the set of all Mark ov c hain distributions of eac h order, or the set of all p olynomials f of eac h degree, where f expresses that Y = f ( X ) + Z , and Z is some norm ally distributed noise v ariable (this mak es f a ‘probabilistic’ h yp othesis). These mo del cla sses are still ‘large’ in that they cannot b e described by a finite n um b er of parameters; b ut they are simple enough so that admit efficien tly computable v ersions of MDL – unlik e the ideal v ersion ab o ve which, b ecause it inv olve s Kolmog oro v complexit y , is uncomputable. The Kolmogoro v complexit y , set-based theory has to b e adjusted at v arious places to d eal with su c h p ractica l m o dels, one reason b eing that they h av e u n coun tably many ele- men ts. MDL has b een su ccessful in practical stati stical and mac h ine learning p roblems where o v erfitting is a real concern [Gr ¨ unw ald 2007]. T ec hnically , MDL algorithms are v ery s imilar to the p opu lar Ba yesian method s , b ut the underlying ph ilosophy is very differen t: MDL is based on finding structure in individual data sequences; distribu tions (mo dels) are viewe d as r epr esentation languages for expr essing usef u l pr op erties of the data ; they are neither viewed as ob jectiv ely existing bu t unobs erv able ob jects according to which data are ‘generat ed’; nor are they view ed as represent ing sub ject ive degrees of belief, as in a mainstream Ba yesia n inte rp r etatio n. In recent y ears, ever more sophisticated refinemen ts of the original MDL ha ve de- v elop ed [Rissanen 1996; Rissanen and T abus 200 5 ; Gr ¨ u n wa ld 2007]. F or example, in mo dern MDL approac hes, one uses universal c o des which may b e t wo-part, but in practice a re often one-p art co des. 7 Philosophical Imp lications and Conclusion W e hav e giv en an o v erview of algorithmic information theory , fo cusing on s ome of its most imp ortan t asp ects: Kolmogoro v complexit y , algorithmic m utual information, their relations to ent ropy and Shannon mutual information, the Algorithmic Minimal Sufficien t Statistic and the Kolmogoro v Stru cture F unction, and their relation to ‘mean- ingful information’. T hroughout the c hapter w e emp h asized insigh ts that, in our view, are ‘philosophical’ in nature. It is now time to harv est and make the philosophical con- nections exp licit. Belo w w e first d iscuss s ome implications of algorithmic inform ation theory on the philosoph y of (general) m athematics, probabilit y theory and statistic s. W e then end the c hapter b y discussing the philosophical implications for ‘information’ itself. As we sh all see, it turns out that nearly all of these philosophical implications are s omehow rela ted to r andomness . Philosoph y of Mathematics: Randomness in Mathematics In and after Exam- ple 4 w e indicated that the ideas b ehind K olmogo rov complexit y are in timately related to G¨ odel’s incomplete ness theorem. Th e finite Kolmog oro v complexit y of any effectiv e axiom system implied the existence of bizarre equations lik e (10), wh ose full solution is, in a sens e, random: no effectiv e axiom system can fully determine the solutions of this 32 single equation. In this co nte xt, Chaitin writes: “This is a region in whic h mathemati- cal truth has no discernible structure or p attern and app ears to b e completely random [...] Quantum physics has shown that there is randomness in nature. I b eliev e that we ha v e demonstrated [...] that randomness is already presen t in p ure Mathematics. This do es not mean that the univ erse and M athematics are completely la wless, it means that la ws of a differen t kind apply: statistical la w s. [...] Perhaps num b er theory should b e p ursued more op enly in the spirit of an exp erimental science!”. Philosoph y of Probabilit y: Individual Randomness The statemen t ‘ x is a ran- dom sequence’ is essen tially mea ningless in cl assical probabilit y theory , whic h can only mak e statemen ts that hold for ensem bles, su c h as ‘r elativ e fr equ en cies conv erge to pr ob- abilities with high pr ob ability , or with pr ob ability 1 ’. But in real it y we only observe one sequence. What then do es the statemen t ‘this sequence is a t ypical outcome of distri- bution P ’ or, equiv alen tly , is ‘random with resp ect to P ’ tell us ab out the sequence? W e migh t think that it means that the sequen ce satisfies all prop erties that hold with P -probabilit y 1. But this w ill not work: if w e id en tify a ‘property’ with the set of sequences satisfying it, then it is easy to see that the in tersectio n of all sets corresp ond- ing to prop erties that h old ‘with probabilit y 1’ is empt y . The Martin-L¨ of theory of randomness [Li and Vit´ an yi 1997] essen tially resolv es this issu e. Martin-L¨ of ’s notion of rand omness turns out to b e, roughly , equiv alent with K olmogo ro v rand omness: a sequence x is random if K ( x ) ≈ l ( x ), i.e. it cannot b e effect iv ely compressed. This theory allo ws us to sp eak of the randomness of single, individual sequences, which is inheren tly imp ossible for probabilistic theories. Y et, as sho wn b y Martin-L¨ of, his notion of randomness is en tirely consistent with probabilistic ideas. Iden tifying the random- ness of an individual s equ en ce with its incompr essib ilit y op ens up a wh ole new area, whic h is illustrated by Example 21, in whic h we m ade distinctions b et wee n d ifferen t typ es of rand om sequences (‘p ositive ’ and ‘negativ e’) that cannot b e expressed in, let alone understo o d f rom, a traditional probabilistic p ersp ectiv e. Philosoph y of Statistics/Inductiv e Inference: Epistemological Occam’s Ra- zor There exist t wo close connections b et ween algorithmic information theory and inductiv e inf erence: one via the algorithmic su fficien t statisti c and the MDL Princi- ple; the other via Solomonoff ’s indu ction theory , whic h there wa s n o s p ace to discuss here [Li and Vit´ an yi 1997]. The former deals with finding s tructur e in data; the lat- ter is concerned with sequen tial prediction. Both of these theories implicitly emplo y a form of Occam’s Razor: when tw o h yp otheses fi t the data equally well, they p refer the simplest one (with the sh ortest description). Both the MDL and the Solomonoff approac h are theoretically quite well- b eha ved: there exist sev eral con ve rgence theorems for b oth appr oac hes. Let us giv e an example of s u c h a theorem for the MDL frame- w ork: Barron and Co v er [1991] and Barron [1985] show that, if data are distrib u ted according to some d istribution in a con templated mo del class (set of candidate distri- butions) M , then t wo -part MDL will even tually fi nd this distribution; it will ev en do 33 so based on a reasonably small sample. T his h olds both for practic al v ersions of MDL (with r estricted m o del classes) as w ell as for v ersions based on K olmogo rov co mplexit y , where M consists of the huge class of all d istributions whic h can b e arbitrarily wel l appro ximated by finite computer pr ograms. Suc h theorems pro vide a jus tification for MDL. Lo oking at the pro ofs, one fi nds that the preference for simple mo dels is crucial: the con v ergence o ccur s p recisely b ecause complexit y of eac h probabilistic h yo ptheses P is measured by its co d elength L ( P ), under a some prefix-co de that allo ws one to enco de all P und er consideration. If a complexit y measure L ′ ( P ) is used that does not corresp ond to any pr efix co de, then, as is easy to sho w, in some situations MDL will not con verge at all, and, no matter ho w many d ata are observ ed, will k eep selecting o v erly complex, sub optimal hypotheses for the data. In fact, ev en if the world is suc h that data are generated by a v ery complex (high K ( P )) distribution, it is wise to prefer simple mo dels at small sample s izes [Gr ¨ unw ald 2007]! This pro vides a justification for the use of MDL’s v ersion of Occam’s r azor in inductiv e inference. It sh ould b e stressed that this is an e pistemolo gic al rather than a (meta-) ph ysic al form of O ccam’s Razor: it is used as an effectiv e str ate gy , whic h is something v ery different from a b elief that ‘the true state of the w orld is lik ely to h a ve a sh ort description’. Th is issue, as well as the re- lated question to w h at exten t Occam’s Razo r can b e made represen tation-indep enden t, is discussed in great detail in [Gr ¨ u n wa ld 2007]. A further difference b et ween statist ical inference based on algorit hmic information theory and almost all other approac hes to stati stics and learning is that the algo rithmic approac h fo cuses on ind ividual data sequences: there is no need for the (often unten- able) assumption of classical statistics that there is some d istribution P according to whic h the data are distributed. I n the Ba ye sian appr oac h to statistics, probabilit y is often in terpreted sub jectiv ely , as a degree of b elief. S till, in man y Ba y esian appr oac hes there is an underlying assumption that there exists ‘sta tes of the world’ whic h are view ed as p robabilit y d istributions. Again, suc h assumptions n eed not b e made in the presen t theo ries; neither in the form whic h explicit ly uses Kolmo goro v complexit y , nor in the restricte d practical form. In b oth cases, the goal is to fin d regular patterns in the data , no more. All this is discussed in detail in [Gr¨ u n wa ld 2007]. Philosoph y of Information On the first page of the c hapter on Sh annon in f orma- tion theory in this handb ook [Harremo ¨ es and T opsøe 2007], w e read “information is alw a ys information ab out something .” T his is ce rtainly the case for Shannon informa- tion theory , w h ere a string x is alwa ys used to communicat e some state of the world, or of those aspects of the w orld that w e c are ab out. But if w e identify ‘amount of information in x ’ w ith K ( x ), then it is not so clear an ymore what this ‘information’ is ab out. K ( x ), the alg orithmic information in x lo oks at the in f ormation in x itself, inde- p endently of anything outside. F or example, if x co nsists of the fi rst billion bits of the binary expansion of π , th en its information cont ent is the size of the smallest program whic h prin ts these bits. This sequence d o es not describ e any state of the world that is to b e comm un icated. Therefore, one ma y argue that it is meaningless to sa y that ‘ x carries information’, let alone to measur e its amount. A t a w orkshop where many 34 of the contributors to this h andb o ok w ere presen t, there wa s a long discussion ab out this qu estion, with sev eral participan ts insisting that “algorithmic information misses “ab outness” (sic), and is therefore not really information”. In the end the question whether algorithmic in formation should really coun t as “information” is, of cours e, a matter of d efinition. Nev ertheless, we wo uld lik e to argue that there exist situations where intuit ive ly , the word “information” seems exactly the righ t word to describ e wh at is b eing measured, while nev erth eless, “aboutn ess” is miss ing. F or example, K ( y | x ) is supp osed to describ e the amoun t of “informatio n” in y that is not already present in x . No w supp ose y is equal to 3 x , expressed in binary , and x is a random string of length n , so th at K ( x ) ≈ K ( y ) ≈ n . Then K ( y | x ) = O (1) is m uc h smaller than K ( x ) or K ( y ). The wa y an algorithmic information theorist w ould phrase this is “ x pro vides nearly a ll the infor mation needed to generate y .” T o us, th is s eems an eminen tly reasonable use of the w ord information. Still, this “informatio n” do es not r efer to any outside state of the w orld. 7 Let us assume then th at the terminology “algorit hmic information theory” is j us- tified. What lessons can w e dra w from the theo ry for the philosoph y of information? First, w e sh ould emphasize that the amoun t of ‘absolute, inheren t’ inf ormation in a sequence is only well- defin ed asymptotically and is in general uncomputable. I f w e wan t a nonasymp totic and efficien tly computable measur e, we are forced to use a restricted cla ss of description method s. S uc h restrictio ns naturally lead one to u niv er- sal co d ing and practical MDL. The resulting notion of information is alwa ys d efined r elative to a class of description metho ds and ca n mak e no claims to ob jectivit y or absoluteness. In terestingly though, u nlik e Sh annon’s notion, it is still meaningful for individual sequences of data, and is not dep endent on an y outside pr ob abilistic assump - tions: this is an asp ect of the general theory that ca n b e retained in the r estricted forms [Gr ¨ unw ald 2007]. Second, the algorithmic theory allo ws us to formalize the notio n of ‘meaningful information’ in a d istinctly nov el manner. It leads to a separatio n of the meaningful information from th e noise in a s equence, once again without making an y p r obabilistic assumptions. Since learning can b e seen as an attempt to find the meaningful in forma- tion in d ata, this connects the theory to indu ctiv e inference. Third, the theory re-emphasizes the connectio n b et we en measur ing amoun ts of in- formation and data compression, whic h w as also the basis of Shannon’s theory . In fact, algorithmic information has close connections to Shannon information after all, and if the data x are generated b y some probabilistic pro cess P , so that the information in x is actually really ‘ab out’ something, then the algorithmic information in x b eha v es v ery similarly to the Shannon en trop y of P , as explained in Section 5.3. F urther Reading Kolmogoro v complexit y h as many applications which we could not discuss here. It has imp licatio ns for asp ects of ph ysics su c h as the second la w of 7 W e may of course sa y th at x carri es informati on “about” y . The p oint, h o wev er, is that y is not a state of an y imagined external w orld, so here “about” does not refer to anything external. Thus, one cannot say that x contains information about some external state of the wo rld. 35 thermo dynamics; it pro vides a no v el mathematical pro of tec hnique called the inc om- pr essibility metho d , and so on. T h ese and man y ot her topics in Kolmogoro v complexit y are thoroughly discussed and explained in the standard reference [Li and Vit´ anyi 19 97]. Additional (and more recen t) material on the relation to Shannon’s theory can b e found in Gr ¨ unw ald and Vit´ an yi [2003, 2004]. Additional material on the stru cture function is in [V ereshchagin and Vit´ an yi 2004; Vit´ anyi 2005]; and additional material on MDL can be found in [Gr ¨ u n w ald 2007]. 8 Ac kno wledgmen ts P aul Vit ´ an yi w as supp orted in part by the EU pro ject R E SQ, IST-2001-375 59, the NoE QIPROCONE + I ST-1999-2 9064 and the ESF QiT Programme. Both Vit´ an yi and Gr ¨ unw ald w ere supp orted in part b y the I S T Programme of the Eur op ean Communit y , under th e P ASCAL Net w ork of Excell ence, IST-2002-506 778. This publication only reflects the authors’ views. References Barron, A. and T. Co ve r (199 1). Minim um complexit y dens it y estimation. IE E E T r ansactions on Information The ory 37 (4), 1034– 1054. Barron, A. R. (1985). L o gic al ly Smo oth De nsity Estimation . Ph . D. thesis, Depart- men t of EE, S tanford Un iversit y , Stanford, Ca. Chaitin, G. (1966 ). On the length of programs for computing finite bin ary sequences. Journal of the ACM 13 , 54 7–569. Chaitin, G. (1975). A theory of p rogram size formally iden tical to information theory . Journal of the ACM 22 , 32 9–340. Chaitin, G. (1987). Algor ithmic Informatio n The ory . Cam b ridge Univ ersit y Pr ess. Cilibrasi, R. and P . Vit´ an yi (200 5). Clustering b y compression. IEE E T r ansactions on Information The ory 51 (4), 1523–1 545. Co v er, T. and J. Thomas (1 991). Elements of Information The ory . Ne w Y ork: Wiley In terscience. G´ acs, P . (1974) . On the symmetry of algorithmic information. Soviet M ath. Dokl. 15 , 1477– 1480. Correction, Ibid., 15 :1480, 1974. G´ acs, P ., J. T romp, and P . Vit´ an yi (2 001). Algorithmic sta tistics. IEEE T r ansactions on Information The ory 47 (6), 2464–2 479. Gr ¨ unw ald, P . and P . Vit´ an yi (2003) . Kolmogoro v complexit y and in formation the- ory; with an int erpr etation in terms of questions and answers. Journal of L o gi c, L anguage and Information 12 , 497–529. Gr ¨ unw ald, P . D. (20 07). Prediction is coding. Man uscript in preparation. 36 Gr ¨ unw ald, P . D. and P . M. Vit´ anyi (2004 ). Shann on information and K olmogoro v complexit y ,. Sub mitted for pu blicatio n. Av ailable at the Computer Science CoRR arXiv as htt p://de.ar xiv.org/a bs/cs.IT/0410002 . Harremo ¨ es, P . and F. T opsøe (2007). The quan titativ e theory of information. In J . v an Ben them a nd P . Adriaa ns (Eds.), Handb o ok of the Philosophy of Inform ation , Chapter 6. E lsevier. Kolmogoro v, A. (1965). Three approac hes to the quantit ativ e definition of informa- tion. Pr oblems Inform. T r ansmission 1 (1), 1–7. Kolmogoro v, A. (1974a). T alk at the Information Th eory Symp osium in T allinn, Estonia, 19 74. Kolmogoro v, A. (1974b). Complexit y of algorithms and ob jectiv e d efinition of ran- domness. A talk at M osco w Math. So c. meet ing 4/16/19 74. A 4-line abstract a v ailable in Usp ekhi M at. Nauk 29:4 (1974),1 55 (in Russian). Li, M. and P . Vit´ an yi (199 7). A n Intr o duction to Kolmo gor ov Comp lexity and Its Applic ations (revised and expanded second ed.). Ne w Y ork: Springer-V erlag. Rissanen, J. (1978 ). Modeling by the sh ortest data description. Automat ic a 14 , 465– 471. Rissanen, J. ( 1989). Sto chastic Complexity in Statistic al Inquiry . W orld Scien tific Publishing Compan y . Rissanen, J. (1996). Fisher information and sto c hastic complexit y . IEEE T r ansac- tions on Information The ory 42 (1), 40–47. Rissanen, J. an d I. T abus (2005 ). Kolmogoro v’s structur e function in MDL theory and lossy data compression. In P . D. Gr ¨ u n wa ld, I. J. Myung, and M. A. Pitt (Eds.), A dvanc es in Minimum Description L e ngth: The ory and Applic ations . MIT Press. Solomonoff, R . (1964). A formal theory of inductive inference, part 1 and part 2. Informatio n and Contr ol 7 , 1–22, 224–25 4. V ereshc hagin, N. and P . Vit ´ an yi (200 4). Kolmogo rov’ s structure functions and mo del selectio n. IE EE T r ansactions on Information The ory 50 (12), 3265–32 90. Vit´ an yi, P . M. (2005) . Algorithmic s tatistics and Kolmogoro v’s structure function. In P . D. Gr ¨ u n wa ld, I. J. Myung, and M. A. Pitt (Eds.), A dvanc es in Minimum Description L ength: The ory and Applic ations . MIT Press. W allace, C. (20 05). Statistic al and Inductive Infer e nc e by Minimum Message L ength . New Y ork: S pringer. 37 K(x) K(x) |x| x minimal sufficient statistic h − log P(x) P(x) 2 −|x| 2 −K(x) 1 scale logarithmic x −h 2 (α) (α) α

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment