Algorithmic statistics: forty years later

Algorithmic statistics has two different (and almost orthogonal) motivations. From the philosophical point of view, it tries to formalize how the statistics works and why some statistical models are better than others. After this notion of a "good mo…

Authors: Nikolai Vereshchagin, Alex, er Shen

Algorithmic statistics: fort y y ears later ∗ Nik olay V ereshc hagin † , Alexander Shen ‡ Abstract Algorithmic statistics has t wo different (and almost orthogonal) motiv a- tions. F rom the philosophical p oin t of view, it tries to formalize ho w the statistics w orks and why some statistical models are b etter than others. After this notion of a “goo d mo del” is introduced, a natural question arises: it is p ossible that for some piece of data there is no go o d model? If yes, ho w often these bad ( non-sto chastic ) data app ear “in real life”? Another, more tec hnical motiv ation comes from algorithmic information theory . In this theory a notion of complexity of a finite ob ject (=amount of information in this ob ject) is in tro duced; it assigns to ev ery ob ject some n um- b er, called its algorithmic c omplexity (or Kolmo gor ov c omplexity ). Algorithmic statistic pro vides a more fine-grained classification: for eac h finite ob ject some curv e is defined that c haracterizes its b eha vior. It turns out that several dif- feren t definitions giv e (appro ximately) the same curve. 1 In this surv ey we try to provide an exp osition of the main results in the field (including full pro ofs for the most important ones), as well as some historical commen ts. W e assume that the reader is familiar with the main notions of algorithmic information (K olmogoro v complexity) theory . An exp osition can b e found in [44, chap ters 1, 3, 4] or [22, c hapters 2, 3], see also the survey [37]. ∗ The work was in part funded by RFBR according to the research pro ject gran t 16-01-00362-a (N.V.) and by RaCAF ANR-15-CE40-0016-01 gran t (A.S.) † N. V ereshc hagin is with Moscow State Universit y , National Research Universit y Higher School of Economics and Y andex ‡ A. Shen is with LIRMM, CNRS & Univ. Mon tp ellier, 161 rue A da, 34095, F rance 1 Road-map: Section 2 considers the notion of ( α, β ) -sto c hasticity; Section 3 considers tw o-part descriptions and the so-called “minimal description length principle”; Section 4 giv es one more approac h: w e consider the list of ob jects of b ounded complexity and measure how far some ob ject is from the end of the list, getting some natural class of “standard descriptions” as a by-product; finally , Section 5 establishes a connection b et ween these notions and resource-b ounded complexity . The rest of the paper deals with an attempts to mak e theory close to practice b y considering restricted classes of description (Se ction 6) and strong mo dels (Section 7). 1 A short survey of main results of algorithmic statistics w as giv en in [43] (without pro ofs); see also the last c hapter of the b ook [44]. Con ten ts 1 Statistical mo dels 3 2 ( α, β ) -sto chasticit y 5 2.1 Prefix complexit y , a priori probability and randomness deficiency . . . 5 2.2 Definition of sto c hasticity . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Sto c hasticit y conserv ation . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Non-sto c hastic ob jects . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 T w o-part descriptions 15 3.1 Optimalit y deficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Optimalit y and randomness deficiencies . . . . . . . . . . . . . . . . . 17 3.3 T rade-off b et ween complexit y and size of a mo del . . . . . . . . . . . 18 3.4 Optimalit y and randomness deficiency . . . . . . . . . . . . . . . . . 22 3.5 Historical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Bounded complexity lists 32 4.1 En umerating strings of complexit y at most m . . . . . . . . . . . . . 32 4.2 Ω -lik e num b ers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 P osition in the list is well defined . . . . . . . . . . . . . . . . . . . . 36 4.4 The relation to P x . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 Standard descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Non-sto c hastic ob jects revisited . . . . . . . . . . . . . . . . . . . . . 43 4.7 Historical commen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Computational and logical depth 45 5.1 Bounded-time K olmogorov complexit y . . . . . . . . . . . . . . . . . 45 5.2 T rade-off b et ween time and complexit y . . . . . . . . . . . . . . . . . 48 5.3 Historical commen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4 Wh y so many equiv alent definitions? . . . . . . . . . . . . . . . . . . 52 6 Descriptions of restricted t yp e 59 6.1 F amilies of descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 P ossible shap es of b oundary curv e . . . . . . . . . . . . . . . . . . . . 63 6.3 Randomness and optimalit y deficiencies: restricted case . . . . . . . . 67 2 7 Strong mo dels 72 7.1 Information in minimal descriptions . . . . . . . . . . . . . . . . . . . 72 7.2 An attempt to separate “go o d” mo dels from “bad” ones . . . . . . . . 73 7.3 Prop erties of strong mo dels . . . . . . . . . . . . . . . . . . . . . . . 77 7.4 Strange strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.5 Strong sufficien t statistics . . . . . . . . . . . . . . . . . . . . . . . . 83 7.6 Normal strings and standard descriptions . . . . . . . . . . . . . . . . 85 7.7 A strong sufficien t minimal mo del for a normal string is normal . . . 88 7.8 The n umber of strings with a giv en profile . . . . . . . . . . . . . . . 94 7.9 Op en questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8 A ckno wledgments 97 1 Statistical mo dels Let us start with a (very rough) scheme. Imagine an exp erimen t that pro duces some bit string x . W e kno w nothing ab out the device that pro duced this data, and cannot rep eat the exp eriment. Still we w ant to suggest some statistical model that fits the data (“explains” x in a plausible w ay). This mo del is a probabilit y distribution on some finite set of binary strings con taining x . What do w e exp ect from a reasonable mo del? There are, informally sp eaking, t wo main prop erties of a go od model. First, the mo del should b e “simple”. If a model contains so many parameters that it is more complicated than the data itself, w e w ould not consider it seriously . T o mak e this requiremen t more formal, one can use the notion of K olmogoro v complexity . 2 Let us assume that measure P (used as a mo del) has finite supp ort and rational v alues. Then P can be considered as a finite (constructive) ob ject, so w e can sp eak ab out K olmogorov complexit y of P . The requirement then sa ys that complexity of P should b e muc h smaller than the complexity of the data string x itself. F or example, if a data string x con tains n bits, we may consider a mo del that corresp onds to n indep enden t fair coin tosses, i.e., the uniform distribution P on the set of all n -bit strings. Suc h a distribution is a constructive ob ject that is completely determined by the v alue of n , so its complexit y is O (log n ) , while the complexity of most n -bit strings is close to n (and therefore is muc h larger than the complexit y of P , if n is large enough). 2 W e assume that the reader is familiar with basic notions of algorithmic information theory (complexit y , a priory probability). See [37] for a concise introduction, and [22, 44] for more details. 3 Still this simple mo del lo oks un acceptable if, for example, the sequence x consists of n zeros, or, more generally , if the frequency of ones in x deviates significantly from 1 / 2 , or if zeros and ones alternate. This feeling w as one of the motiv ations for the dev elopment of algorithmic randomness notions: wh y some bit sequences of length n lo ok plausible as outcomes of n fair coin tosses while other do not, while all the n -bit sequences ha ve the same probability 2 − n according to the mo del? This question do es not ha ve a clear answer in the classical probabilit y theory , but the algorithmic approac h to randomness says that plausible strings should b e incompressible: the complexit y of suc h a string (the minimal length of a program pro ducing it) should b e close to its length. This answer w orks for a uniform distribution on n -bit strings; for arbitrary P it should be mo dified. It turns out that for arbitrary P we should compare the complexit y of x not with its length but with the v alue ( − log P ( x )) (all logarithms are binary); if P is the uniform distribution on n -bit strings, the v alue of ( − log P ( x )) is n for all n -bit strings x . Namely , w e consider the difference b et ween ( − log P ( x )) and complexit y of x as r andomness deficiency of x with resp ect to P . W e discuss the exact definition in the next section, but let us note here that this approac h lo oks natural: differen t data strings require different mo dels. Disclaimer . The sc heme ab o ve is o v ersimpl ified in man y asp ects. First, it rarely happ ens that we hav e no a priori information ab out the exp erimen t that pro duced the data. Second, in many cases the exp erimen t can b e rep eated (the same exp erimen tal device can b e used again, or a similar device can b e constructed). Also we often deal with a data stream: w e are more in terested, sa y , in a go o d prediction of oil prices for the next month than in a construction of mo del that fits well the prices in the past. All these asp ects are ignored in our simplistic mo del; still it ma y serv e as an example for more complicated cases. One should stress also that algorithmic statistics is more theoretical than practical: one of the reasons is that complexit y is a non-computable function and is defined only asymptotically , up to a b ounded additiv e term. Still the notions and results from this theory can b e useful not only as philosophical foundations of statistics but as a guidelines when comparing statistical mo dels in practice (see, for example, [32]). More practical approac h to the same question is pro vided by machine learning that deals with the same problem (finding a go o d mo del for some data set) in the “real w orld”. Unfortunately , currently there is a big gap b et ween the algorithmic statistics and machine learning: the first one provides nice results ab out mathematical mo dels that are quite far from practice (see the discussion ab out “standard mo dels” b elo w), while mac hine learning is a to ol that sometimes w orks w ell without an y theoretical reasons. There are some attempts to close this gap (by considering mo dels from 4 some class or resource-b ounded v ersions of the notions), but muc h more remains to b e done. A historic al r emark . The principles of algorithmic statistics are often traced back to Occam’s razor principle often stated as “Don’t multiply p ostulations b ey ond ne- cessit y” or in a similar w ay . P oincare writes in his Scienc e and Metho d’ (Chapter 1, The choic e of facts ) that “this economy of though t, this economy of effort, whic h is, according to Mac h, the constan t tendency of science, is at the same time a source of b eaut y and a practical adv an tage”. Still the mathematical analysis of these ideas b ecame p ossible only after a definition of algorithmic complexit y w as given in 1960s (b y Solomonoff, K olmogorov and then Chaitin): after that the connection b et w een randomness and incompressibility (high complexit y) b ecame clear. The formal def- inition of ( α , β ) -sto c hasticit y (see the next section) was giv en b y K olmogoro v (the authors learned it from his talk given in 1981 [17], but most probably it w as for- m ulated earlier in 1970s; the definition app eared in print in [35]). F or the other related approaches (the notions of logical depth and sophistication, minimal descrip- tion length principle) see the discussion in the corresp onding sections (see also [22, Chapter 5].) 2 ( α, β ) -sto c hasticit y 2.1 Prefix complexit y , a priori probabilit y and randomness deficiency Preparing for the precise definition of ( α, β ) -sto c hasticity , we need to fix the version of complexity used in this definition. There are several v ersions (plain and prefix complexities, differen t types of conditions), see [44, Chapter 6]. F or most of the results the choice b et ween these v ersions is not imp ortan t, since the difference b e- t ween the differen t versions is small (at most O (log n ) for strings of length n ), and w e usually allow errors of logarithmic size in the statements. W e will use the notion of c onditional pr efix c omplexity , usually denoted by K( x | c ) . Here x and c are finite ob jects; we measure the complexity of x when c is giv en. This complexit y is defined as the length of the minimal prefix-free program that, giv en c , computes x . 3 The adv antage of this definition is that it has an equiv alent formulation in terms of a priori probability [44, Chapter 4]: if m ( x | c ) is the conditional a priori 3 W e do not go into details here, but let us men tion one common misunderstanding: the set of programs should b e prefix-free for eac h c , but these sets may differ for different c and the union is not required to b e prefix-free. 5 probabilit y , i.e., the maximal low er semicomputable function of tw o argumen ts x and c suc h that P x m ( x | c ) 6 1 for every c , then K( x | c ) = − log m ( x | c ) + O (1) . In particular, if a probabilit y distribution P with finite supp ort and rational v alues (w e consider only distributions of this t yp e) is considered as a condition, w e may compare m with function ( x, P ) 7→ P ( x ) and conclude that m ( x | P ) > P ( x ) up to an O (1) -factor, so K( x | P ) 6 − log P ( x ) . So if w e define the randomness deficiency as d ( x | P ) = − log P ( x ) − K( x | P ) , w e get a non-negative (up to O (1) additive term) function. One ma y also explain in a differen t w ay wh y K( x | P ) 6 − log P ( x ) : this inequalit y is a reform ulation of a standard result from information theory (Shannon–F ano co de, Kraft inequalit y). Wh y do w e define the deficiency in this w ay? The follo wing prop osition provides some additional motiv ation. Prop osition 1. The function d ( x | P ) is (up to O (1) -additiv e term) the maximal lo wer semicomputable function of tw o arguments x and P suc h that X x 2 d ( x | P ) · P ( x ) 6 1 ( ∗ ) for ev ery P . Here x is a binary string, and P is a probability distribution on binary strings with finite supp ort and rational v alues. By low er semicomputable functions we mean functions that can b e approximated from b elo w by some algorithm (giv en x and P , the algorithm produces an increasing sequence of rational n umbers that con verges to d ( x | P ) ; no b ounds for the con vergence sp eed are required). Then, for a given P , the function x 7→ 2 d ( x | P ) can b e considered as a random v ariable on the probabilit y space with distribution P . The requiremen t ( ∗ ) sa ys that its exp ectation is at most 1 . In this wa y w e guaran tee (by Marko v inequality) that only a P -small fraction of strings ha ve large deficiency: the P -probabilit y of the even t d ( x | P ) > c is at most 2 − c . It turns out that there exists a maximal function d satisfying ( ∗ ) up to O (1) additiv e term, and our formula giv es the expression for this function in terms of prefix complexit y . Pr o of. The pro of uses standard arguments from Kolmogoro v complexity theory . The function K( x | P ) is upp er semicomputable, so d ( x | P ) is lo wer semicomputable. W e 6 can also note that X x 2 d ( x | P ) · P ( x ) = X x m ( x | P ) P ( x ) · P ( x ) = X x m ( x | P ) 6 1 , so the deficiency function satisfies ( ∗ ) . T o prov e the maximalit y , consider an arbitrary function d 0 ( x | P ) that is lo wer semicomputable and satisfies ( ∗ ) . Then consider a function m ( x | P ) = 2 d 0 ( x | P ) · P ( x ) (the function equals 0 if x is not in the supp ort of P ). Then m is lo w er semicomputable, P x m ( x | P ) 6 1 for ev ery P , so m ( x | P ) 6 m ( x | P ) up to O (1) - factor; this implies that d 0 ( x | P ) 6 d ( x | P ) + O (1) . F or the case where P is the uniform distribution on n -bit strings, using P as a condition is equiv alent to using n as the condition, so d ( x | P ) = n − K( x | n ) in this case, and small deficiency means that complexity K( x | n ) is close to the length n , so x is incompressible. 4 2.2 Definition of sto chasticit y Definition 1. A string x is called ( α, β ) -sto c hastic if there exists some probability distribution P (with rational v alues and finite supp ort) such that K( P ) 6 α and d ( x | P ) 6 β . By definition every ( α, β ) -sto chastic string is ( α 0 , β 0 ) -sto c hastic for α 0 > α , β 0 > β . Sometimes w e sa y informally that a string is “sto c hastic” meaning that it is ( α, β ) -sto chastic for some reasonably small thresholds α and β (for example, one can consider α, β = O (log n ) for n -bit strings). Let us start with some simple remarks. • Every simple string is sto c hastic. Indeed, if P is concen trated on x (single- ton supp ort), then K( P ) 6 K( x ) and d ( x | P ) = 0 (in b oth cases with O (1) - precision), so x is alwa ys (K( x ) + O (1) , O (1)) -sto c hastic. 4 Initially Kolmogoro v suggested to consider n − C( x ) as “randomness deficiency” in this case, where C stands for the plain (not prefix) complexity . One may also consider n − C( x | n ) . But all three deficiency functions mentioned are close to each other for strings x of length n ; one can sho w that the difference b et ween them is b ounded by O (log d ) where d is any of these three functions. The pro of works by comparing the exp ectation and probability-bounded characterizations as explained in [9]. 7 • On the other end of the sp ectrum: if P is a uniform distribution on n -bit strings, then K( P ) = O (log n ) , and most strings of length n ha ve d ( x | P ) = O (1) , so most strings of length n are ( O (log n ) , O (1)) -sto c hastic. The same distribution also witnesses that every n -bit string is ( O (log n ) , n + O (1)) -stochastic. • It is easy to construct sto c hastic strings that are betw een these tw o extreme cases. Let x b e an incompressible string of length n . Consider the string x 0 n (the first half is x , the second half is zero string). It is ( O (log n ) , O (1)) - sto c hastic: let P b e the uniform distribution on all the strings of length 2 n whose second half contains only zeros. • F or every distribution P (with finite supp ort and rational v alues, as usual) a random sampling according to P gives us a (K( P ) , c ) -sto c hastic string with probabilit y at least 1 − 2 − c . Indeed, the probability to get a string with defi- ciency greater than c is at most 2 − c (Mark ov inequality , see ab o ve). After these observ ations one may ask whether non-sto c hastic strings exists at all — and ho w they can b e constructed? A non-sto chastic string should ha v e non- negligible complexity (our first observ ation), but a standard wa y to get strings of high complexit y , b y coin tossing or other random exp erimen t, can giv e only sto c hastic strings (our last observ ation). W e will see that non-sto c hastic strings do exist in the mathematical sense; ho w- ev er, the question whether they app ear in the “real w orld”, is philosophical. W e will discuss b oth questions so on, but let us start with some mathematical results. First of all let us note that with logarithmic precision w e may restrict ourselv es to uniform distributions on finite sets. Prop osition 2. Let x b e an ( α, β ) -sto chastic string of length n . Then there exist a finite set A con taining x such that K( A ) 6 α + O (log n ) and d ( x | U A ) 6 β + O (log n ) , where U A is the uniform distribution on A . Since K( A ) = K( U A ) (with O (1) -precision, as usual), this prop osition means that w e may consider only uniform distributi ons i n the definition of sto c hasticity , and get an equiv alent (up to logarithmic c hange in the parameters) definition. According to this mo dified definition, a string x in ( α, β ) -sto c hastic if there exists a finite set A suc h that K( A ) 6 α and d ( x | A ) 6 β , where d ( x | A ) is no w defined as log # A − K( x | A ) . Kolmogoro v originally prop osed the definition in this form (but used plain complexit y). Pr o of. Let P b e the (finite) distribution that exists due to the definition of ( α, β ) - sto c hasticit y of x . W e may assume without loss of generality that β 6 n (as we 8 ha ve seen, all strings of length n are ( O (log n ) , n + O (1)) -stochastic, so for β > n the statemen t is trivial). Consider the set A formed b y all strings that ha ve sufficien tly large P -probability . Namely , let us choose minimal k such that 2 − k 6 P ( x ) and consider the set A of all strings such that P ( x ) > 2 − k . By construction A contains x . The size of A is at most 2 k , and − log P ( x ) = k with O (1) -precision. According to our assumption, d ( x | P ) = k − K( x | P ) 6 n , so k = d ( x | P ) + K( x | P ) 6 O ( n ) . Then K( x | A ) > K( x | P, k ) > K( x | P ) − O (log n ) , since A is determined by P , k , and the additional information in k is O (log k ) = O (log n ) since k = O ( n ) by our assumption. So the deficiency ma y increase only by O (log n ) when w e replace P b y U A , and K( A ) 6 K( P , k ) 6 K( P ) + O (log n ) for the same reasons. Remark 1. Similar argumen t can b e applied if P is a computable distribution (may b e, with infinite supp ort) computed by some program p , and w e require K( p ) 6 α and − log P ( x ) − K( x | p ) 6 β . So in this wa y w e also get the same notion (with logarithmic precision). It is imp ortan t, ho wev er, that program p c omputes the distribution P (giv en some p oin t x and some precision ε > 0 , it computes the probabilit y of x with error at most ε ). It is not enough for P to b e an output distribution for a randomized algorithm p (in this case P is called the semimeasure low er semic ompute d by p ; note that the sum of probabilities may b e strictly less than 1 since the computation may div erge with p ositiv e probabilit y). Similarly , it is very imp ortan t in the version with finite sets A (and uniform distributions on them) that the set A is considered as a finite ob ject: A is simple if there is a short program that prints the list of all elements of A . If w e allo w ed the set A to b e presented by an algorithm that en umerates A (but never says explicitly that no more elements will appear), then situation would c hange drastically: for every string of complexity k the finite set S k of strings that ha ve complexit y at most k , w ould be a go od explanation for x , so all ob jects would b ecome sto chastic. 2.3 Sto c hasticit y conserv ation W e ha v e defined stochasticit y for binary strings. Ho wev er, the same definition can b e used for arbitrary finite (constructiv e) ob jects: pairs of strings, tuples of strings, finite sets of strings, graphs, etc. Indeed, complexity can b e defined for all these ob jects as the complexity of their enco dings; note that the difference in complexities 9 for differen t enco dings is at most O (1) . The same can b e done for finite sets of these ob jects (or probability distributions), so the definition of ( α , β ) -sto c hasticity mak es sense. One can also note that computable bijection preserves sto c hasticit y (up to a con- stan t that dep ends on the bijection, but not on the ob ject). In fact, a stronger statemen t is true: every total computable mapping preserv es sto c hasticity . F or ex- ample, consider a sto c hastic pair of strings ( x, y ) . Do es it imply that x (or y ) is sto c hastic? It is indeed the case: if P is a distribution on pairs that is a reasonable mo del for ( x, y ) , then its pro jection (marginal distribution on the first comp onents) should b e a reasonable mo del for x . In fact, pro jection can b e replaced b y any total computable mapping. Prop osition 3. Let F b e a total computable mapping whose arguments and v alues are strings. If x is ( α, β ) -sto c hastic, then F ( x ) is ( α + O (1) , β + O (1)) -sto c hastic. Here the constan t in O (1) dep ends on F but not on x, α, β . Pr o of. Let P b e the distribution suc h that K( P ) 6 α and d ( x | P ) 6 β ; it exists according to the defini tion of sto chasticit y . Let Q = F ( P ) b e the image distribution. In other w ords, if ξ is a random v ariable with distribution P , then F ( ξ ) has distri- bution Q . It is easy to see that K( Q ) 6 K( P ) + O (1) , where the constant dep ends only on F . Indeed, Q is determined by P and F in a computable w ay . It remains to sho w that d ( F ( x ) | Q ) 6 d ( x | P ) + O (1) . The easiest w ay to show this is to recall the c haracterization of deficiency as the maximal lo wer semicomputable function suc h that X u 2 d ( u | S ) S ( u ) 6 1 for ev ery distribution S . W e ma y consider another function d 0 defined as d 0 ( u | S ) = d ( F ( u ) | F ( S )) It is easy to see that X u 2 d 0 ( u | S ) S ( u ) = X u 2 d ( F ( u ) | F ( S )) S ( u ) = X v 2 d ( v | F ( S )) · [ F ( S )]( v ) 6 1 (in the second equalit y w e group all the v alues of u with the same v = F ( u ) ). Therefore the maximalit y of d guaran tees that d 0 ( u | S ) 6 d ( u | S ) + O (1) , so w e get the required inequalit y . 10 This pro of can b e also rephrased using the definition of sto c hasticity with a priori probabilit y . W e need to show that for y = P ( x ) and Q = F ( P ) we ha ve m ( y | Q ) Q ( y ) 6 O (1) · m ( x | P ) P ( x ) or m ( F ( x ) | F ( P )) · P ( x ) Q ( F ( x )) 6 O ( m ( x | P )) . It remains to note that the left hand side is a lo wer semicomputable function of x and P whose sum ov er all x (for every P ) is at most 1 . Indeed, if we group all terms with the same F ( x ) , we get the sum P y m ( y | F ( P )) 6 1 , since the sum of P ( x ) o ver all x with F ( x ) = y equals Q ( y ) . Remark 2. In this pro of it is imp ortant that we use the definition with distributions. If w e replace is with the definition with finite sets, the results remains true with logarithmic precision, but the argument b ecomes more complicated, since the image of the uniform distribution may not b e a uniform distribution. So if a set A is a go o d mo del for x , w e should not use F ( A ) as a mo del for F ( x ) . Instead, w e should lo ok at the maximal k such that 2 k 6 # F − 1 ( y ) , and consider the set of all y 0 that ha ve at least 2 k preimages in A . Remark 3. It is imp ortan t in Prop osition 3 that F is a total function. If x is some non-sto c hastic ob ject and x ∗ is the shortest program for x , then x ∗ is incompressible and therefore sto c hastic. Still the interpreter (decompressor) maps x ∗ to x . W e discuss the case of non-total F b elow, see Section 5.4. Remark 4. A similar argument sho ws that d ( F ( x ) | F ( P )) 6 d ( x | P ) + K( F ) + O (1) (for total F ), so b oth O (1) -b ounds in Prop osition 3 may b e replaced by K( F ) + O (1) where O (1) -constan t do es not dep end on F anymore. 2.4 Non-sto c hastic ob jects Note that up to no w w e hav e not shown that non-sto c hastic ob jects exist at all. It is easy to sho w that they exist for rather large v alues of α and β (linearly gro wing with n ). Prop osition 4 ([35]) . F or some c and all n : (1) if α + 2 β < n − c log n , then there exist n -bit strings that are not ( α, β ) - sto c hastic; (2) ho wev er, if α + β > n + c log n , then every n -bit string is ( α , β ) -sto c hastic. 11 Note that the term c log n allo ws us to use the definition with finite sets (i.e., uniform distributions on finite sets) instead of arbitrary finite distributions, since b oth versions are equiv alent with O (log n ) -precision. Pr o of. The second part is obvious (and is added just for comparison): if α + β = n , then all n -bit strings can b e split into 2 α groups of size 2 β eac h. Then the complexit y of eac h group is α + O (log n ) , and the randomness deficiency of every string in the corresp onding group is at most β + O (1) . It is sligh tly bigger than the b ounds w e need, but w e ha ve reserv e c log n , and α and β can b e decreased, sa y , b y ( c/ 2) log n b efore using this argument. The first p art : Consider all finite sets A of strings that hav e complexity at most α and size at most 2 α + β . Since α + ( α + β ) < n , they cannot cov er all n -bit strings. Consider then the first (say , in the lexicographical order) n -bit string u not cov ered b y an y of these sets. What is the complexit y of u ? T o sp ecify u , it is enough to giv e n, α , β and the program of size at most α (from the definition of Kolmogoro v complexit y) that has maximal running time among programs of that size. Then w e can w ait un til this program terminates and look at the outputs of all programs of size at most α after the same num b er of steps, select sets of strings of size at most α + β , and take the first u not co vered b y these sets. So the complexit y of u is at most α + O (log n ) (the last term is needed to sp ecify n, α, β ). The same is true for conditional complexit y with arbitrary condition, since it is b ounded b y the unconditional complexit y . So the randomness deficiency of u in every set A of size 2 α + β is at least β − O (log n ) . W e see that u is not ( α, β − O (log n )) -sto c hastic. Again the O (log n ) -term can b e comp ensated by O (log n ) -change in β (we hav e c log n reserv e for that). Remark 5. There is a gap betw een lo wer and upp er bounds pro vided b y Prop osi- tion 4. As we will see later, the upp er b ound (2) is tigh t with O (log n ) -precision, but w e need more adv anced technique (prop erties of tw o-part descriptions, Section 3) to pro ve this. Prop osition 4 shows that non-sto c hastic ob jects exist for rather large v alues of α and β (prop ortional to n ). This, of course, is a mathematical existence result; it do es not sa y an ything ab out the p ossibilit y to observe non-sto c hastic ob jects in the “real w orld”. As w e ha ve discussed, random sampling (from a simple distribution) ma y produce a non-sto c hastic ob ject only with a negligible probabilit y; total algo- rithmic transformations (defined b y programs of small complexit y) also cannot not create non-sto chastic ob ject from sto c hastic ones. What ab out non-total algorithmic transformations? As w e hav e discussed in Remark 3, a non-total computable trans- 12 formation ma y transform a sto c hastic ob ject in to a non-sto c hastic one, but do es it happ en with non-negligible probability? Consider a randomized algorithm that outputs some string. It can b e considered as a deterministic algorithm applied to random bit sequence (generated by the in- ternal coin of the algorithm). This deterministic algorithm may b e non-total, so we cannot apply the previous result. Still, as the follo wing result sho ws, randomized algorithms also generate non-sto c hastic ob jects only with small probability . T o mak e this statement formal, we consider the sum of m ( x ) ov er all non- sto c hastic x of length n . Since the a priori probabilit y m ( x ) is the upp er b ound for the output distribution of an y randomized algorithm, this implies the same b ound (up to O (1) -factor) for ev ery randomized algorithm. The follo wing theorem giv es an upp er b ound for this sum: Prop osition 5 (see [30], Section 10) . X { m ( x ) | x is a n -bit string that is not ( α , α ) -sto c hastic } 6 2 − α + O (log n ) for ev ery n and α . Pr o of. Consider the sum of m ( x ) ov er al l strings of length n . This sum is some real num b er ω 6 1 . Let ˜ ω b e the num b er represen ted b y first α bits in the binary represen tation of ω , min us 2 − α . W e may assume that α 6 O ( n ) , otherwise all strings of length n are ( α, α ) -sto chastic. No w construct a probabilit y distribution as follows. All terms in a sum for ω are lo wer semicomputable, so we can en umerate increasing low er b ounds for them. When the sum of these low er b ounds exceeds ˜ ω , w e stop and get some measure P with finite supp ort and rational v alues. Note that w e hav e a measure, not a distribution, since the sum of P ( x ) for all x is less than 1 (it do es not exceed ω ). So w e normalize P (b y some factor) to get a distribution ˜ P prop ortional to P . The complexit y of ˜ P is bounded b y α + O (log n ) (since ˜ P is determined by ˜ ω and n ). Note that the difference b et ween P (without normalization factor) and a priori probability m (the sum of differences o ver all strings of length n ) is b ounded b y O (2 − α ) . It remains to sho w that for m -most strings the distribution ˜ P is a go o d mo del. Let us prov e that the sum of a priori probabilities of all n -bit strings x that hav e d ( x | ˜ P ) > α + c log n is b ounded b y O (2 − α ) , if c is large enough. Indeed, for those strings w e h a v e − log ˜ P ( x ) − K( x | ˜ P ) > α + c log n. The complexity of ˜ P is b ounded by α + O (log n ) and therefore K( x ) exceeds K( x | ˜ P ) at most by α + O (log n ) , so − log ˜ P ( x ) − K( x ) > 1 (or ˜ P ( x ) < m ( x ) / 2 ) for those strings, 13 if c is large enough (it sh ould exceed the constants hidden in O (log n ) notation). The difference 1 is enough for the estimate b elo w, but w e could hav e arbitrary constan t or ev en logarithmic difference by choosing larger v alue of c . Prefix complexit y can be defined in terms of a priori probability , so we get log( m ( x ) / ˜ P ( x )) > 1 for all x that hav e deficiency exceeding α + c log n with resp ect to ˜ P . The same inequalit y is true for P instead of ˜ P , since P is smaller. So for all those x we hav e P ( x ) < m ( x ) / 2 , or ( m ( x ) − P ( x )) > m ( x ) / 2 . Recalling that the sum of m ( x ) − P ( x ) o ver all x of length n do es not exceed O (2 − α ) b y construction of ˜ ω , w e conclude that the sum of m ( x ) ov er all strings of randomness deficiency (with respect to ˜ P ) exceeding α + c log n is at most O (2 − α ) . So w e ha ve sho wn that the sum of m ( x ) for all x of length n that are not ( α + O (log n ) , α + O (log n )) -sto c hastic, does not exceed O (2 − α ) . This differs from our claim only b y O (log n ) -c hange in α . Bruno Bauw ens noted that this argumen t can be modified to obtain a stronger result where ( α, α ) -sto c hasticit y is replaced by ( α + O (log n ) , O (log n )) -sto c hasticit y . Instead of one measure P , one should consider a family of measures. Let us approx- imate ω and lo ok when the approximations cross the thresholds corresp onding to k first bits of the binary expansion of ω . In this w ay we get P = P 1 + P 2 + . . . + P α , where P i has total weigh t at most 2 − i , and complexity at most i + O (log n ) . Let us sho w that all strings x where P ( x ) is close to m ( x ) (sa y , P ( x ) > m ( x ) / 2 ) are ( α + O (log n ) , O (log n )) -sto c hastic, namely , one of the measures P i m ultiplied b y 2 i is a go od explanations for them. Indeed, for suc h x and some i the v alue of P i ( x ) coincides with m ( x ) up to p olynomial (in n ) factor, since the sum of all P i is at least m ( x ) / 2 . On the other hand, m ( x | 2 i P i ) 6 2 i m ( x ) ≈ 2 i P i ( x ) , since the complexity of 2 i P i is at most i + O (log n ) . Therefore the ratio m ( x | P i ) / (2 i P i ( x )) is p olynomially b ounded, and the mo del 2 i P i has deficiency O (log n ) . This b etter b ound also follo ws from the Levin’s explanation, see b elo w. This result sho ws that non-sto c hastic ob jects rarely app ear as outputs of ran- domized algorithms. There is an explanation of this phenomenon (that go es back to Levin): non-sto c hastic ob jects pro vide a lot of information ab out halting problem, and the probabilit y of app earance of an ob ject that has a lot of information about some sequence α , is small (for any fixed α ). W e discuss this argumen t b elo w, see Section 4.6. It is natural to ask the following general question. F or a given string x , w e ma y consider the set of all pairs ( α, β ) such that x is ( α, β ) -sto chastic. By definition, this 14 set is up wards-closed: a p oin t in this set remains in it if we increase α or β , so there is some b oundary curve that describ es the trade-off betw een α and β . What curves could app ear in this w ay? T o get an answer (to c haracterizes all these curves with O (log n ) -precision), w e need some other technique, explained in the next section. 3 T w o-part descriptions No w we switc h to another measure of the qualit y of a statistical mo del. It is imp or- tan t both for philosophical and tec hnical reasons. The philosophical reason is that it corresp onds to the so-called “minimal description length principle”. The technical reason is that it is easier to deal with; in particular, w e will use it to answer the question ask ed at the end of the previous section. 3.1 Optimalit y deficiency Consider again some statistical mo del. Let P b e a probability distribution (with finite supp ort and rational v alues) on strings. Then w e ha ve K( x ) 6 K( P ) + K( x | P ) 6 K( P ) + ( − log P ( x )) for arbitrary string x (with O (1) -precision). Here w e use that (with O (1) -precision): • K( x | P ) 6 − log P ( x ) , as we hav e men tioned; • the complexit y of the pair is b ounded by the sum of complexities: K( u, v ) 6 K( u ) + K( v ) ; • K( v ) 6 K( u, v ) (in our case, K( x ) 6 K P ( x, P ) ). If P is a uniform distribution on some finite set A , this inequalit y can b e explained as follo ws. W e can sp ecify x in tw o steps: • first, we sp ecify A ; • then we sp ecify the ordinal n um b er of x in A (in some natural ordering, say , the lexicographic one). In this wa y we get K( x ) 6 K( A ) + log # A for every elemen t x of arbitrary finite set A . This inequality holds with O (1) -precision. If w e replace the prefix complexit y b y the plain version, we can say that C( x ) 6 C( A ) + log # A with precision O (log n ) for 15 ev ery string x of length at most n : we ma y assume without loss of generalit y that b oth terms in the right hand side are at most n , otherwise the inequality is trivial. The “qualit y” of a statistical mo del P for a string x can b e measured b y the dif- ference b et ween sides of this inequalit y: for a go o d mo del the “tw o-part description” should b e almost minimal. W e come to the following definition: Definition 2. The optimality deficiency of a distribution P considered as the mo del for a string x is the difference δ ( x, P ) = (K( P ) + ( − log P ( x ))) − K( x ) . As w e h a v e seen, δ ( x, P ) > 0 with O (1) -precision. If P is a uniform distribution on a set A , the optimalit y deficiency δ ( x, P ) will also b e denoted b y δ ( x, A ) , and δ ( x, A ) = (K( A ) + log # A ) − K( x ) . The following prop osition sho ws that we ma y restrict our atten tion to finite sets as mo dels (with O (log n ) -precision): Prop osition 6. Let P b e a distribution considered as a mo del for some string x of length n . Then there exists a finite set A such that K( A ) 6 K( P ) + O (log n ); log # A 6 − log P ( x ) + O (1) ( ∗ ) This prop osition will b e used in many arguments, since it is often easier to deal with sets as statistical mo dels (instead of distributions). Note that the inequalities ( ∗ ) eviden tly imply that δ ( x, A ) 6 δ ( x, P ) + O (log n ) , so arbitrary distribution P may b e replaced by a uniform one ( U A ) with a logarithmic- only c hange in the optimality deficiency . Pr o of. W e use the same construction as in Prop osition 2. Let 2 − k b e the maximal p o w er of 2 such that 2 − k 6 P ( x ) , and let A = { x | P ( x ) > 2 − k } . Then k = − log P ( x ) + O (1) . W e ma y assume that k = O ( n ) : if k is m uch bigger than n , then δ ( x, P ) is also bigger than n (since the complexity of x is b ounded b y n + O (log n ) ), and in this case the statement is trivial (let A b e the set of all n -bit strings). No w w e see that that A is determined by P and k , so K( A ) 6 K( P ) + K( k ) 6 K( P ) + O (log n ) . Note also that # A 6 2 k , so log # A 6 − log P ( x ) + O (1) . Let us note that in a more general setting [25] where w e consider several strings as outcomes of the rep eated exp erimen t (with indep enden t trials) and lo ok for a mo del that explains all of them, a similar result is not true: not ev ery probabilit y distribution can b e transformed in to a uniform one. 16 3.2 Optimalit y and randomness deficiencies No w w e hav e t w o “qualit y measures” for a statistical model P : the randomness deficiency d ( x | P ) and the optimality deficiency δ ( x, P ) . They are related: Prop osition 7. d ( x | P ) 6 δ ( x, P ) with O (1) -precision. Pr o of. By definition d ( x | P ) = − log P ( x ) − K( x | P ); δ ( x, P ) = − log P ( x ) + K( P ) − K( x ) . It remains to note that K( x ) 6 K( x, P ) 6 K( P ) + K( x | P ) with O (1) -precision. Could δ ( x, P ) b e significan tly larger than d ( x | P ) ? Look at the pro of ab o ve: the second inequality K( x, P ) = K( P ) + K( x | P ) is an equality with logarithmic precision. Indeed, the exact formula (Levin–G´ acs form ula for the complexit y of a pair with O (1) -precision) is K( x, P ) = K( P ) + K( x | P , K( P )) . Here the term K( P ) in the condition c hanges the complexity by O (log K( P )) , and w e may ignore models P whose complexity is muc h greater than the complexit y of x . On the other hand, in the first inequality the difference b et ween K( x, P ) and K( x ) may b e significant. This difference equals K( P | x ) with logarithmic accuracy and, if it is large, then δ ( x, P ) is muc h bigger than d ( x | P ) . The follo wing example sho ws that th is is p ossible. In this example we deal with sets as mo dels. Example 1. Consider an incompressible string x of length n , so K( x ) = n (all equalities with logarithmic precision). A go o d mo del for this string is the set A of all n -bit strings. F or this mo del we hav e # A = 2 n , K( A ) = 0 and δ ( x, A ) = n + 0 − n = 0 (all equalities ha ve logarithmic precision). So d ( x | P ) = 0 , too. No w w e can c hange the mo del b y excluding some other n -bit string. Consider a n -bit string y that is incompressible and indep enden t of x : this means that K( x, y ) = 2 n . Let A 0 b e A \ { y } . The set A 0 con tains x (since x and y are indep enden t, y differs from x ). Its complexit y is n (since it determines y ). The optimality deficiency is then n + n − n = n , but the randomness deficiency is still small: d ( x | A 0 ) = log # A 0 − K( x | A 0 ) = n − n = 0 (with logarithmic precision). T o see why K( A 0 | x ) = n , note that x and y are indep enden t, and the set A 0 has the same information as ( n, y ) . 17 One of the main resu lts of this section (Theorem 3) clarifies the situation: it im- plies that if optimality deficiency of a mo del is significantly larger than its randomness deficiency , then this mo del can b e impro ved and another model with b etter param- eters can b e found. More sp ecifically , the complexit y of the new mo del is smaller than the complexit y of the original one while both the randomness deficiency and optimalit y deficiency of the new mo del are not w orse than the randomness deficiency of the original one. This is one of the main results of algorithmic statistics, but first let us explore systematically the prop erties of tw o-part descriptions. 3.3 T rade-off b et w een complexit y and size of a mo del It is conv enien t to consider only mo dels that are sets (=uniform distribution on sets). W e will call them descriptions . Note that b y Propositions 2 and 6 this restriction do es not matter muc h since we ignore logarithmic terms. F or a giv en string x there are man y different descriptions: we can hav e a simple large set containing x , and at the same time some more complicated, but smaller one. In this section w e study the trade-off b et ween these tw o parameters (complexity and size). Definition 3. A finite set A is an ( i ∗ j ) -description 5 of x if x ∈ A , complexity K( A ) is at most i , and log # A 6 j . F or a given x we consider the set P x of all pairs ( i, j ) suc h that x has some ( i ∗ j ) -description; this set will be called the pr ofile of x . Informally sp eaking, an ( i ∗ j ) -description for x consists of t w o parts: first w e sp end i bits to sp ecify some finite set A and then j bits to sp ecify x as an elemen t of A . What can b e said ab out P x for a string x of length n and complexit y k = K( x ) ? By definition, P x is closed up w ards and contains the p oin ts (0 , n ) and ( k , 0) . Here w e omit terms O (log n ) : more precisely , w e hav e a ( O (log n ) ∗ n ) -description that consists of all strings of length n , and a (( k + O (1)) ∗ 0) -description { x } . Moreov er, the follo wing prop osition sho ws that w e can mov e the information from the second part of the description in to its first part (lea ving the total length almost unc hanged). In this wa y we mak e the set smaller (the price we pay is that its complexity increases). Prop osition 8 ([15, 13, 36]) . Let x b e a string and A b e a finite set that contains x . Let s be a non-negative in teger suc h that s 6 log # A . Then there exists a finite set A 0 con taining x suc h that # A 0 6 # A/ 2 s and K( A 0 ) 6 K( A ) + s + O (log s ) . 5 This notation may lo ok strange; ho wev er, we sp eak so often ab out finite sets of complexity at most i and cardinalit y at most 2 j that we decided to introduce some short name and notation for them. 18 Pr o of. List all the elements of A in some (sa y , lexicographic) order. Then we split the list into 2 s parts (first # A/ 2 s elemen ts, next # A/ 2 s elemen ts etc.; w e omit evident precautions for the case when # A is not a m ultiple of 2 s ). Then let A 0 b e the part that contains x . It has the required size. T o sp ecify A 0 , it is enough to sp ecify A and the part num b er; the latter takes at most s bits. (The logarithmic term is needed to mak e the enco ding of the part n um b er self-delimiting.) This statement can b e illustrated graphically . As w e ha v e said, the set P x is “closed up wards” and con tains with each p oin t ( i, j ) all p oin ts on the righ t (with bigger i ) and on the top (with bigger j ). It con tains p oints (0 , n ) and (K( x ) , 0) ; Prop osition 8 sa ys that w e can also mov e down-righ t adding ( s, − s ) (with logarithmic precision). W e will see that mo vemen t in the opp osite direction is not alwa ys p ossible. So, ha ving tw o-part descriptions with the same total length, we should prefer the one with bigger set (since it alwa ys can b e conv erted into others, but not vice versa). The b oundary of P x is some curv e connecting the p oints (0 , n ) and ( k , 0) . This curv e (in tro duced b y K olmogoro v in 1970s, see [16]) nev er gets into the triangle i + j < K( x ) and alw ays go es do wn (when mo ving from left to righ t) with slop e at least − 1 or more. K( x ) n K( x ) P x complexit y log-size Figure 1: The set P x and its b oundary curv e This picture raises a natural question: whic h boundary curv es are p ossible and whic h are not? Is it p ossible, for example, that the b oundary go es along the dotted line on Figure 1? The answer is p ositiv e: tak e a random string of desired complexit y and add trailing zeros to ac hieve desired length. Then the p oin t (0 , K( x )) (the left end of the dotted line) corresp onds to the set A of all strings of the same length 19 ha ving the same trailing zeros. W e kno w that the b oundary curv e cannot go do wn slo wer than with slop e − 1 and that it lies ab o ve the line i + j = K( x ) , therefore it follo ws the dotted line (with logarithmic precision). A more difficult question: is it possible that the b oundary curve starts from (0 , n ) , go es with the slop e − 1 to the v ery end and then go es down rapidly to (K( x ) , 0) (Figure 2, the solid line)? Such a string x , informally sp eaking, w ould ha ve essen tially only t wo t yp es of statistical explanations: a set of all strings of length n (and its parts obtained b y Prop osition 8) and the exact description, the singleton { x } . K( x ) n K( x ) complexit y log-size Figure 2: T w o opp osite p ossibilities for a b oundary curv e It turns out that not only these t w o opposite cases are possible, but also all in termediate curves (provided they decrease with slop e − 1 or faster, and are simple enough), at least with logarithmic precision. More precisely , the following statement holds: Theorem 1 ([45]) . L et k 6 n b e two inte gers and let t 0 > t 1 > . . . > t k b e a strictly de cr e asing se quenc e of inte gers such that t 0 6 n and t k = 0 ; let m b e the c omplexity of this se quenc e. Then ther e exists a string x of c omplexity k + O (log n ) + O ( m ) and length n + O (log n ) + O ( m ) for which the b oundary curve of P x c oincides with the line (0 , t 0 ) – (1 , t 1 ) –. . . – ( k , t k ) with O (log n ) + O ( m ) pr e cision: the distanc e b etwe en the set P x and the set T = { ( i, j ) | ( i < k ) ⇒ ( j > t i ) } is b ounde d by O (log n ) + O ( m ) . (W e say that the distance betw een t wo subsets P, Q ⊂ Z 2 is at most ε if P is con tained in the ε -neighborho o d of Q and vice versa.) Pr o of. F or ev ery i in the range 0 . . . k we list all the sets of complexit y at most i and size at most 2 t i . F or a given i the union of all these sets is denoted b y S i . It 20 con tains at most 2 i + t i elemen ts. (Here and later w e omit constant factors and factors p olynomial in n when estimating cardinalities, since they corresp ond to O (log n ) additiv e terms for lengths and complexities.) Since the sequence t i strictly decreases (this corresp onds to slop e − 1 in the picture), the sums i + t i do not increase, therefore eac h S i has at most 2 t 0 6 2 n elemen ts. The union of all S i therefore also has at most 2 n elemen ts (up to a p olynomial factor, see ab ov e). Therefore, w e can find a string of length n (actually n + O (log n ) ) that does not b elong to any S i . Let x b e a first suc h string in some order (e.g., in the lexicographic order). By construction, the set P x lies abov e the curve determined b y t i . So w e need to estimate the complexity of x and pro ve that P x follo ws the curv e (i.e., that T is con tained in the neighborho o d of P x ). Let us start with the upp er b ound for the complexity of x . The list of all ob jects of complexit y at most k plus the full table of their complexities ha ve complexity k + O (log k ) , since it is enough to kno w k and the n umber of terminating programs of length at most k . Except for this list, to sp ecify x w e need to know n and the sequence t 0 , . . . , t k , whose complexit y is m . The lo wer b ound: the complexity of x cannot b e less than k since all the singletons of this complexit y w ere excluded (via S k ). It remains to show that for ev ery i 6 k w e can put x in to a set A of complexity i (or slightly bigger) and size 2 t i (or slightly bigger). F or this we en umerate a sequence of sets of correct size and show that one of the sets will ha ve the required prop erties; if this sequence of sets is not v ery long, the complexit y of its elemen ts is bounded. Here are the details. W e start by taking the first 2 t i strings of length n as our first set A . Then w e start en umerating all finite sets of complexity at most j and of size at most 2 t j for all j = 0 , . . . , k , and get an en umeration of all sets S j . Recall that all elemen ts of all S j should b e deleted (and the minimal remaining element should ev entually b e x ). So, when a new set of complexit y at most j and of size at most 2 t j app ears, all its elemen ts are included in S j and deleted. Until all elemen ts of A are deleted, we hav e nothing to worry ab out, since A is co vering the minimal remaining elemen t. If (and when) all elements of A are deleted, we replace A by a new set that consists of first 2 t i undeleted (y et) strings of length n . Then w e wait again un til all the elements of this new A are deleted, if (and when) this happ ens, w e take 2 t i first undeleted elemen ts as new A , etc. The construction guarantees the correct size of the sets and that one of them co vers x (the minimal non-deleted elemen t). It remains to estimate the complexity of the sets we construct in this wa y . First, to start the pro cess that generates these sets, we need to kno w the length n 21 (actually something logarithmically close to n ) and the sequence t 0 , . . . , t k . In total w e need m + O (log n ) bits. T o sp ecify each version of A , we need to add its version n umber. So w e need to sho w that the num b er of differen t A ’s that app ear in the pro cess is at most 2 i or sligh tly bigger. A new set A is created when all the elemen ts of the old A are deleted. These c hanges can b e split in to t wo groups. Sometimes a new set of complexity j appears with j 6 i . This can happ en only O (2 i ) times since there are at most O (2 i ) sets of complexit y at most i . So w e may consider the other changes (excluding the first c hanges after each new large set w as added). F or those c hanges all the elemen ts of A are gone due to elemen ts of S j with j > i . W e hav e at most 2 j + t j elemen ts in S j . Since t j + j 6 t i + i , the total num b er of deleted elemen ts only slightly exceeds 2 t i + i , and eac h set A consists of 2 t i elemen ts, so we get ab out 2 i c hanges of A . Remark 6. It is easy to mo dify the pro of to get a string x of length exactly n . Indeed, w e may consider sligh tly smaller bad sets: decreasing the logarithms of their sizes b y O (log n ) , we can guarantee that the total n umber of elements in all bad sets is less than 2 n . Then there exists a string of length n that does not belong to bad sets. In this wa y the distance b et ween T and P x ma y increase by O (log n ) , and this is acceptable. Theorem 1 sho ws that the v alue of the complexit y of x do es not describe the prop erties of x fully; differen t strings of the same complexity x can ha v e differen t b oundary curv es of P x . This curve can b e considered as an “infinite-dimensional” c haracterization of x . Strings x with minimal p ossible P x (Figure 2, the upp er curve) ma y b e called antisto chastic . They ha v e quite unexp ected properties. F or example, if w e replace some bits of an antistochastic string x by stars (or some other sym b ols indicating erasures) lea ving only K( x ) non-erased bits, then the string x can b e reconstructed from the resulting string x 0 with logarithmic advice, i.e., K( x | x 0 ) = O (log n ) . This and other prop erties of an tisto c hastic strings w ere disco vered in [24]. 3.4 Optimalit y and randomness deficiency In this section w e establish the connection b etw een optimalit y and randomness de- ficiency . As we hav e seen, the optimalit y deficiency can b e bigger than the random- ness deficiency (for the same description), and the difference is δ ( x, A ) − d ( x | A ) = K( A ) + K( x | A ) − K( x ) . The Levin–G´ acs formula for the complexit y of pair ( K( u, v ) = K( u ) + K( v | u ) with logarithmic precision, for O (1) -precision one needs to add K( u ) in the condition, but we ignore logarithmic size terms anyw a y) shows that the difference 22 in question can b e rewritten as δ ( x, A ) − d ( x | A ) = K( A, x ) − K( x ) = K( A | x ) . So if the difference b et w een deficiencies for some ( i ∗ j ) -description A of x is big, then K( A | x ) is big. All the ( i ∗ j ) -descriptions of x can be en umerated if x , i , and j are giv en. So the large v alue of K( A | x ) for some ( i ∗ j ) -description A means that there are man y ( i ∗ j ) -descriptions of x , otherwise A can b e reconstructed from x b y sp ecifying i, j (requires O (log n ) bits) and the ordinal num b er of A in the enumeration. W e will prov e that if there are man y ( i ∗ j ) -descriptions for some x , then there exist a description with b etter parameters. No w w e explain this in more detail. Let us start with the follo wing remark. Consider all strings that hav e ( i ∗ j ) -descriptions for some fixed i and j . They can b e en umerated in the following wa y: w e enumerate all finite sets of complexit y at most i , select those sets that ha v e size at most 2 j , and include all elemen ts of these sets in to the enumeration. In this construction • the complexity of the en umerating algorithm is logarithmic (it is enough to kno w i and j ); • we enumerate at most 2 i + j elemen ts; • the enumeration is divided in to at most 2 i “p ortions” of size at most 2 j . It is easy to see that any other en umeration pro cess with these prop erties en umerates only ob jects that ha ve ( i ∗ j ) -descriptions (again with logarithmic precision). Indeed, eac h p ortion is a finite set that can be sp ecified b y its ordinal n umber and the en umeration algorithm, the first part requires i + O (log i ) bits, the second is of logarithmic size according to our assumption. Remark 7. The requirement about the portion size is redundan t. Indeed, we can c hange the algorithm by splittin g large p ortions in to pieces of size 2 j (the last piece ma y b e incomplete). This, of course, increases the n umber of p ortions, but if the total num b er of enumerated elemen ts is at most 2 i + j , then this splitting adds at most 2 i pieces. This observ ation looks (and is) trivial, still it plays an imp ortan t role in the pro of of the follo wing prop osition. Prop osition 9. If a string x of length n has at least 2 k differen t ( i, j ) -descriptions, then x has some ( i ∗ ( j − k )) -description and even some (( i − k ) ∗ j ) -description. Again w e omit logarithmic term: in fact one should write (( i + O (log n )) ∗ ( j − k + O (log n ))) , etc. The w ord “even” in the statemen t refers to Prop osition 8 that sho ws that i ndeed the second claim is stronger. 23 Pr o of. Consider the en umeration of all ob jects ha ving ( i ∗ j ) -descriptions in 2 i p or- tions of size 2 j (w e ignore logarithmic additiv e terms and respective p olynomial factors) as explained ab o v e. After each p ortion (i.e., new ( i ∗ j ) -description) app ears, w e count the num b er of descriptions for eac h en umerated ob ject and select ob jects that ha ve at least 2 k descriptions. Consider a new en umeration pro cess that enumer- ates only these “rich” ob jects (rich = ha ving many descriptions). W e ha ve at most 2 i + j − k ric h ob jects (since they appear in the list of size 2 i + j with m ultiplicity 2 k ), en umerated in 2 i p ortions (new p ortion of rich ob jects may app ear only when a new p ortion app ears in the original enumeration). So we apply the observ ation ab o ve to conclude that all rich ob jects hav e ( i ∗ ( j − k )) -descriptions. T o get the second (stronger) statement w e need to decrease the num b er of p ortions (while not increasing to o m uch the n umber of enumerated ob jects). This can b e done using the follo wing tric k: when a new ric h ob ject (ha ving 2 k descriptions) app ears, w e enumerate not only rich ob jects, but also “half-rich” ob jects, i.e., ob jects that curren tly hav e at least 2 k / 2 descriptions. In this wa y w e en umerate more ob jects — but only twice more. At the same time, after we dump ed all half-rich ob jects, we are sure that next 2 k / 2 new ( i ∗ j ) -descriptions will not create new ric h ob jects, so the n umber of p ortions is divided by 2 k / 2 , as required. Let us say more accurately how w e deal with logarithmic terms. W e may assume that i, j = O ( n ) , otherwise the claim is trivial. Then we allo w p olynomial (in n ) factors and O (log n ) additiv e terms in all our considerations. Remark 8. If we unfold this construction, we see that new descriptions (of smaller complexit y) are not selected from the original sequence of descriptions but con- structed from scratch. In Section 6 w e deal with muc h more complicated case where w e restrict ourselves to descriptions from some class (say , Hamming balls). Then the pro of given ab o v e do es not w ork, since the description we construct is not a ball ev en if w e start with ball descriptions. Still some other (m uch more ingenious) argumen t can b e used to pro ve a similar result for the restricted case. No w w e are ready to prov e the promised results (see the discussion after Exam- ple 1). Theorem 2. If a string x of length n is ( α , β ) -sto chastic, then ther e exists some finite set B c ontaining x such that K( B ) 6 α + O (log n ) and δ ( x, B ) 6 β + O (log n ) . Pr o of. Since x is ( α, β ) -sto c hastic, there exists some finite set A such that K( A ) 6 α and d ( x | A ) 6 β . Let i = K( A ) and j = log # A , so A is an ( i ∗ j ) -description of x . W e may assume without loss of generalit y that b oth α and β (and therefore i and j ) 24 are O ( n ) , otherwise the statement is trivial. The v alue δ ( x, A ) ma y exceed d ( x | A ) , as w e hav e discussed at the b eginning of this section. So w e assume that k = δ ( x, A ) − d ( x | A ) > 0; if not, w e can let B = A . Then, as we hav e seen, K( A | x ) > k − O (log n ) , and there are at least 2 k − O (log n ) differen t ( i ∗ j ) -descriptions of x . A ccording to Prop osition 9, there exists some finite set B that is an ( i ∗ ( j − k + O (log n ))) -description of x . Its optimalit y deficiency δ ( x, B ) is ( k − O (log n )) -smaller (compared to A ) and therefore O (log n ) -close to d ( x | A ) . In this argumen t w e used the simple part of Prop osition 9. Using the stronger statemen t ab out complexity decrease, we get the follo wing result: Theorem 3 ([45]) . L et A b e a finite set c ontaining a string x of length n and let k = δ ( x, A ) − d ( x | A ) . Then ther e is a finite set B c ontaining x such that K( B ) 6 K( A ) − k + O (log n ) and δ ( x, B ) 6 d ( x | A ) + O (log n ) . Pr o of. Indeed, if B is an (( i − k ) ∗ j ) -description of x (up to logarithmic terms, as usual), then its optimalit y deficiency is again ( k − O (log n )) -smaller (compared to A ) and therefore O (log n ) -close to d ( x | A ) . Note that the statement of the theorem implies that d ( x | B ) 6 d ( x | A ) + O (log n ) . Theorem 2 and Prop osition 7 show that we can replace the randomness deficiency in the definition of ( α, β ) -sto c hastic strings b y the optimalit y deficiency (with log- arithmic precision). More sp ecifically , for every string x of length n consider the sets Q x = { ( α, β ) | x is ( α , β ) -sto c hastic } , and ˜ Q x = { ( α, β ) | there exists A 3 x with K( A ) 6 α , δ ( x, A ) 6 β ) } . Then these sets are at most O (log n ) apart (each is con tained in the O (log n ) - neigh b orho o d of the other one). This remark, together with the existence of an tisto chastic strings of given com- plexit y and length, allo ws us to improv e the result ab out the existence of non- sto c hastic ob jects (Prop osition 4). Prop osition 10 ([13, Theorem IV.2]) . F or some c and for all n : if α + β < n − c log n , there exist strings of length n that are not ( α, β ) -sto c hastic. 25 i j n K( x ) K( x ) P x β α Figure 3: Non-sto chastic strings revisited. Left gray area corresp onds to descriptions A with K( A ) 6 α and δ ( x, A ) 6 β . Pr o of. Assume that in tegers n, α, β are given suc h that α + β < n − c log n (where the constant c will b e chosen later). Let x b e an antistochastic string of length n that has complexit y α + d where d is some p ositiv e num b er (see b elo w ab out the c hoice of d ). More precisely , for ev ery given d there exists a string x whose complexity is α + d + O (log n ) , length is n + O (log n ) , and the set P x is O (log n ) -close to the upp er gra y area (Figure 3). Assume that x is ( α, β ) -sto c hastic. Then (Theorem 2) the string x has an ( i ∗ j ) - description with i 6 α and i + j 6 K( x ) + β (with logarithmic precision). The set of pairs ( i, j ) satisfying these inequalities is shown as the lo wer gra y area. W e hav e to c ho ose c in suc h a wa y that for some d these tw o gra y are disjoint and ev en separated b y a gap of logarithmic size (since they are kno wn only with O (log n ) -precision). Note first that for d = c 0 log n with large enough c 0 w e guarantee the v ertical gap (the v ertical segments of the b oundaries of t w o gray areas are far apart). Then we select c large enough to guaran tee that the diagonal segments of the b oundaries of t wo gray areas are far apart ( α + β < n with logarithmic margin). The transition from randomness deficiency to optimality deficiency (Theorem 2) has the follo win g geometric interpretation. Theorem 4. The sets Q x and P x ar e r elate d to e ach other via an affine tr ansforma- tion ( α, β ) 7→ ( α, K( x ) − α + β ) , as Figur e 4 shows. 6 6 T ec hnically sp eaking, this holds only for α 6 K( x ) . F or α > K( x ) b oth sets contain all pairs 26 K( x ) n K( x ) P x complexit y log-size Figure 4: The set P x and the b oundary of the set Q x (b old dotted line); on every v ertical line tw o interv als ha ve the same length. As usual, this statement is true with logarithmic accuracy: the distance b etw een the image of the set Q x under this transformation and the set P x is claimed to b e O (log n ) for string x of length n . Pr o of. As we hav e seen, we ma y use the optimalit y deficiency instead of randomness deficiency , i.e., use the set ˜ Q x in place of Q x . The preimage of the pair ( i, j ) under our affine transformation is the pair ( i, i + j − K( x )) . Hence w e ha ve to pro ve that a pair ( i, j ) is in P x if and only if the pair ( i, i + j − K( x )) is in ˜ Q x . Note that K( A ) = i and log # A = j is equiv alent to K( A ) = i and δ ( x, A ) = i + j − K( x ) just b y definition of δ ( x, A ) . (See Figure 4: the optimalit y deficiency of a description A with K( A ) = i and log # A = j is the vertical distance b et ween ( i, j ) and the dotted line.) But there is some technical problem: in the definition of P x w e used inequalities K( A ) 6 i and log # A 6 j , not the equalities K( A ) = i and log # A = j . The same applies to the definition of ˜ Q x . So we hav e t wo sets that corresp ond to each other, but their 6 -closures could b e different. Ob viously , K( A ) 6 i and log # A 6 j imply K( A ) 6 i and K( A ) + log # A − K( x ) 6 i + j − K( x ) , but not vice versa. In other words, the set of pairs (K( A ) , log # A ) satisfying the latter inequalities (see the right set on Figure 5) is bigger than the set of pairs (K( A ) , log # A ) satisfying the former inequalities (see the left set on Figure 5). Now Proposition 8 helps: we ma y use it to con vert an y set with parameters from the righ t region into a set with with first comp onen t α . 27 K( A ) log # A i j K( A ) log # A i j Figure 5: The left picture sho ws (for giv en i and j ) the set of all pairs (K( A ) , log # A ) suc h that K( A ) 6 i and log # A 6 j ; the right picture sho ws the pairs (K( A ) , log # A ) suc h that K( A ) 6 i and δ ( x, A ) 6 i + j − K( x ) . parameters from the left region. Remark 9. Let us stress again that Theorem 2 claims only that the existenc e of a set A 3 x with K( A ) 6 α and d ( x | A ) 6 β is equiv alent to the existence of a set B 3 x with K( B ) 6 α and δ ( x | A ) 6 β (with logarithmic accuracy). The theorem do es not claim that for every set A 3 x with complexit y at most α the inequalities d ( x | A ) 6 β and δ ( x, A ) 6 β are equiv alent (with logarithmic accuracy). Indeed, the Example 1 sho ws that this is not true: the first inequalit y does not imply the second one in general case. Ho wev er, Theorems 2 and 3 sho w that this can happ en only for non-minimal descriptions (for which the description with smaller complexity and the same optimality deficiency) exists. Later w e will see that all the minimal descriptions of the same (or almost the same) complexit y ha v e almost the same information. Moreov er, if A and B are minimal descriptions and the complexity of A is less than that of B then C( A | B ) is small. F or the p eople with taste for philosophical sp eculations the meaning of Theo- rems 2 and 3 can b e advertised as follo ws. Imagine several scien tists that comp ete in providing a goo d explanation for some data x . Eac h explanation is a finite set A con taining x together with a program p that computes A . Ho w should w e compare differen t explanations? W e w ant the randomness de- ficiency d ( x | A ) of x in A to b e negligible (no features of x remain unexplained). Among these descriptions w e w ant to find the simplest one (with the shortest p ). That is, w e lo ok for a set A corresp onding to the p oin t where the b old dotted line on Fig. 4 touches the horizon tal axis. (In fact, there is alw a ys some trade-off betw een 28 the parameters, not the specific exact point where the curv e touc hes the horizontal axis, but w e w an t to keep the discu ssion simple though imprecise.) Ho wev er, this approach meets the following obstacle: w e are unable to compute randomness deficiency d ( x | A ) . Moreo ver, the in ven tor of the mo del A has no w a ys to con vince us that the deficiency is indeed negligible if it is the case (the function d ( x | A ) is not ev en upper semicomputable). What could b e done? Instead, w e ma y lo ok for an explanation with (almost) minimal sum log # A + | p | (minimum descrip- tion length principle). Note that this quan tity is known for comp eting explanation prop osals. Theorems 2 and 3 provide the connection b et ween these t wo approaches. Returning to mathematical language, we ha ve seen in this section that t w o ap- proac hes (based on ( i ∗ j ) -descriptions and ( α, β ) -sto c hasticity) pro duce essen tially the same curv e, though in different co ordinates. The other w ays to get the same curv e will b e discussed in Sections 4 and 5. 3.5 Historical remarks The idea to consider ( i ∗ j ) -descriptions with optimal parameters can b e traced bac k to K olmogoro v. There is a short record for his talk given in 1974 [16]. Here is the (full) translation of this note: F or ev ery constructiv e ob ject x w e may consider a function Φ x ( k ) of an in teger argument k > 0 defined as a logarithm of the minimal cardinalit y of a set of complexit y at most k containing x . If x itself has a simple definition, then Φ x (1) is equal to one [a t yp o: cardinality equals 1 , and logarithm equals 0 ] already for small k . If such a simple definition does not exist, x is “random” in the negativ e sense of the word “random”. But x is p ositiv ely “probabilistically random” only if the function Φ has a v alue Φ 0 for some relatively small k and then decreases approximately as Φ( k ) = Φ 0 − ( k − k 0 ) . [This corresp onds to appro ximate ( k 0 , 0) - sto c hasticit y .] K olmogorov also gav e a talk in 1974 [15]; the conten t of this talk w as rep orted b y Co ver [10, Section 4, page 31]. Here l ( p ) stands for the length of a binary string p and | S | stands for the cardinality of a set S . 4. K olmogorov’s H k F unction Consider the function H k : { 0 , 1 } k → N , H k ( x ) = min p : l ( p ) 6 k log | S | , where the minim um is taken o ver all subsets S ⊆ { 0 , 1 } n , suc h that x ∈ S , U ( p ) = S , l ( p ) 6 k . This definition was in tro duces by K olmogoro v in 29 a talk at the Information Symposium, T allinn, Estonia, in 1974. Th us H k ( x ) is the log of the size of the smallest set containing x o v er all sets sp ecifiable b y a program of k or fewer bits. Of sp ecial in terest is the v alue k ∗ ( x ) = min { k : H k ( x ) + k = K ( x ) } . Note that log | S | is the maximal num b er of bits necessary to describ e an arbitrary elemen t x ∈ S . Thus a program for x can b e written in tw o stages: “Use p to prin t the indicator function for S ; the desired sequence is the i th sequence in a lexicographic ordering of the elements of this set”. This program has length l ( p ) + log | S | , and k ∗ ( x ) is the length of the shortest program p for whic h this 2 -stage description is as short as the best 1 -stage description p ∗ . W e observ e that x must b e maximally random with resp ect to S — otherwise the 2 -stage description could b e impro ved, contradicting the minimalit y of K ( x ) . Thus k ∗ ( x ) and its asso ciated program p constitute a minimal sufficien t description for x . h . . . i Argumen ts can b e provided to establish that k ∗ ( x ) and its asso ciated set S ∗ describ e all of the “structure” of x . The remaining details ab out x are conditionally maximally complex. Thus pp ∗∗ , the program for S ∗ , pla ys the role of a sufficient statistic. In b oth places Kolmogoro v sp eaks about the place when the b oundary curv e of P x reac hes its low er b ound determined by the complexity of x . Later the same ideas w ere rediscov ered and p opularized b y many p eople. Koppel in [18] reformulates the definition using total algorithms. Instead of a finite set A he considered a total program P that terminates on all strings of some length. The t wo-part description of some x is then formed b y this program P and the input D for this program that is mapped to x . In our terminology this corresp onds to the set A of all v alues of P on the strings of the same length as D . He writes then [18, p. 1089] Definition 3. The c -sophistication of a finite string S [is defined as] SOPH c ( S ) = min {| P | | ∃ D s. t. ( P , D ) is a c -minimal description of α } . There is a typo in this pap er: S should b e replaced by α (t wo times). Before in Definition 1 the description is called c -minimal if | P | + | D | 6 H ( α ) + c (here P and D are the program and and its input, resp ectiv ely , H stands for complexity). 30 Though this paper (as w ell as the subsequen t pap ers [19, 20]) is not technically clear (e.g., it do es not say what are the requirements for the algorithm U used in the definition, and in [19, 20] only universalit y is required, whic h is not enough: if U is not optimal, the definition do es not mak e sense), the philosophic motiv ation for this notion is explained clearly [18, p. 1087]: The total complexit y of an ob ject is defined as the size of its most con- cise description. The total complexity of an ob ject can b e large while its “meaningful” complexit y is lo w; for example, a random ob ject is b y definition maximally complex but completely lacking in structure. h . . . i The “static” approac h to the formalization of meaningful complexity is “sophistication” defined and discussed b y Koppel and A tlan [reference to unpublished paper “Program-length complexity , sophistication, and induction” is given, but later a pap er of same authors [20] with a similar title appeared]. Sophistication is a generalization of the “H-function” or “minimal sufficient statistic” by Cov er and K olmogorov h . . . i The sophis- tication of an ob ject in the size of that part of that ob ject whic h describ es its structure, i.e. the aggregate of its pro jectible prop erties. One can also men tion the form ulation of “minimal description length” principle b y Rissanen [33]; the abstract of this paper sa ys: “Estimates of b oth integer-v alued structure parameters and real-v alued system parameters may b e obtained from a mo del based on the shortest data description principle”; here “integer-v alued structure parameters” ma y corresp ond to the c hoice of a statistical hypothesis (description set) while “real-v alued system parameters” may corresp ond to the choice of a sp ecific elemen t in this set. The author then says that “b y finding the mo del which minimizes the description length one obtains estimates of b oth the in teger-v alued structure parameters and the real-v alued system parameters”. W e do not try here to follo w the developmen t of these and similar ideas. Let us men tion only that the traces of the same ideas (though ev en more v ague) could b e found in 1960s in the classical pap ers of Solomonoff [39, 40] who tried to use shortest descriptions for inductive inference (and, as a side pro duct, ga v e the defi- nition of complexity later redisco vered b y K olmogoro v [14]). One may also men tion a “minim um message length principle” that go es bac k to [51]; the idea of t w o-part description is explained in [51] as follows: If the things are now classified then the measurements can b e recorded b y listing the following: 1. The class to which each thing belongs. 31 2. The av erage prop erties of eac h class. 3. The deviations of eac h thing from the av erage prop erties of its paren t class. If the things are found to b e concen trated in a small area of the region of eac h class in the measurement space then the deviations will b e small, and with reference to the av erage class prop erties most of the information ab out a thing is given b y naming the class to whic h it b elongs. In this case the information ma y b e recorded muc h more briefly than if a classification had not b een used. W e suggest that the b est classification is that whi c h results in the briefest recording of all the attribute information. Here the “class to which thing b elongs” corresp onds to a set (statistical mo del, de- scription in our terminology); the authors say that if this set is small, then only few bits need to be added to the description of this set to get a full description of the thing in question. The main tec hnical results of this sections (Theorems 1, 2, and 3) are taken from [45] (where some historical account is provided). 4 Bounded complexit y lists In this section w e sho w one more classification of strings that turns out to b e equiv a- len t (up to co ordinate c hange) to the previous ones: for a given string x and m > C( x ) w e lo ok ho w close x is to the end in the enumeration of all strings of complexit y at most m . F or tec hnical reasons it is more conv enien t to use plain complexit y C( x ) instead of the prefix version K( x ) . As w e hav e men tioned, the difference b etw een them is only logarithmic, and we mainly ignore terms of that size. 4.1 En umerating strings of complexit y at most m Consider some integer m , and all strings x of (plain) complexity at most m . Let Ω m b e the num b er of those strings. The follo wing prop erties of Ω m are well kno wn and often used (see, e.g., [8]). Prop osition 11. • Ω m = Θ(2 m ) (i.e., c 1 2 m 6 Ω m 6 c 2 2 m for some p ositiv e constants c 1 , c 2 and for all m ; • C(Ω m ) = m + O (1) . 32 Pr o of. The n umber of strings of complexity at most m is b ounded b y the total n umber of programs of length at most m , which is O (2 m ) . On the other hand, if Ω m is an ( m − d ) -bit n umber, we can sp ecify a string of complexity greater than m using m − d + O (log d ) bits: first w e sp ecify d in a self-delimiting manner using O (log d ) bits, and then app end Ω m in binary . This information allo ws us to reconstruct d , then m and Ω m , then en umerate strings of complexit y at most m un til we ha v e Ω m of them (so all strings of complexit y at most m are enumerated), and then take the first string x m that has not b een en umerated. As m < C( x m ) 6 m − d + O (log d ) , the v alue of d is b ounded b y a constant and hence Ω m is an ( m − O (1)) -bit n umber. In this argumen t the binary representation of Ω m can b e replaced by its program, so C(Ω m ) > m − O (1) . The upp er b ound m + O (1) is obvious, since Ω m = O (2 m ) . Giv en m , w e can enumerate all strings of complexit y at most m . Ho w many steps needs the en umeration algorithm to pro duce all of them? The answer is provided by the so-called busy b e aver numb ers ; let us recall their definition in terms of Kolmogoro v complexit y (see [44, section 1.2.2] for details). By definition, the n umber B ( m ) is the maximal integer of complexity at most m . It is not hard to see that C( B ( m )) = m + O (1) . Indeed, C( B ( m )) 6 m by definition. On the other hand, the complexit y of the next num b er B ( m ) + 1 is greater than m and at the same time is b ounded by C( B ( m )) + O (1) . Note that B ( m ) can be undefined for small m (if there are no in tegers of com- plexit y at most m ) and that B ( m + 1) > B ( m ) for all m . F or s ome m this inequalit y ma y not b e strict. This happ en, for example, if the optimal algorithm used to define K olmogorov complexity is defined only on strings of, sa y , ev en lengths; this restric- tion do es not prev ent it from b eing optimal, but then B (2 n ) = B (2 n + 1) for all n , since there are no ob jects of complexity exactly 2 n + 1 . Ho wev er, for some constant c we hav e B ( m + c ) > B ( m ) for all m . Indeed, consider a program p of length at most m that prints B ( m ) . T ransform it to a program p 0 that runs p and then adds 1 to the result. This program witnesses that C( B ( m ) + 1) 6 m + c for some constan t c . Hence B ( m + c ) > B ( m ) + 1 . No w we define B 0 ( m ) as follo ws. As w e hav e said, the set of all strings of com- plexit y at most m can be enumerated giv en m . Fix some enumeration algorithm A (with input m ) and some computation mo del. Then let B 0 ( m ) be the n um b er of steps used b y this algorithm to enumerate all the strings of complexity at most m . Prop osition 12. The n umbers B ( m ) and B 0 ( m ) coincide up to O (1) -change in m . More precisely , we hav e B 0 ( m ) 6 B ( m + c ) , B ( m ) 6 B 0 ( m + c ) 33 for some c and for all m . Pr o of. T o find B 0 ( m ) , it is enough to know m -bit binary string that represen ts Ω m (this string also determines m ). Th erefore C( B 0 ( m )) 6 m + c for some constan t c . As B ( m + c ) is the largest num b er of complexity m + c or less, w e hav e B 0 ( m ) 6 B ( m + c ) . On the other hand, if some in teger N exceeding b oth m and B 0 ( m ) is giv en, w e can run the enumeration algorithm A within N steps for each input smaller than N . Consider the first string that has not b een en umerated. Its complexit y is greater than m , so C( N ) > m − c for some constan t c . Thus the complexit y of ev ery n umber N starting from max { m, B 0 ( m ) } is greater than m − c , which means that max { m, B 0 ( m ) } > B ( m − c ) . It remains to note that for all large enough m we hav e m 6 B ( m − c ) , as the complexit y of m is O (log m ) . Th us for all large enough m the n umber B 0 ( m ) (and not m ) must b e bigger than B ( m − c ) . Replacing here m b y m + c and increasing the constant c if needed, w e conclude that B 0 ( m + c ) > B ( m ) for all m . A similar argumen t shows that B ( n ) coincides (up to O (1) -change in the argu- men t) with the maximal computation time of the univ ersal decompressor (from the definition of plain K olmogorov complexity) on inputs of size at most m , see [44, section 1.2.2] The next result says how many strings require long time to b e enumerated. Prop osition 13. After B 0 ( m − s ) steps of the enumeration algorithm on input m there are 2 s + O (log m ) strings that are not yet enumerated. W e assume that the algorithm en umerates strings (for ev ery input m ) without rep etitions. Note also that here B 0 can b e replaced by B , since they differ at most b y a constant c hange in the argument. Pr o of. T o make the notation simpler we omit O (1) - and O (log m ) -terms in this argu- men t. Given Ω m − s , w e can determine B 0 ( m − s ) . If we also know how many strings of complexit y at most m app ear after B 0 ( m − s ) steps, we can w ait until that man y strings app ear and then find a string of complexity greater than m . If the num b er of remaining strings is smaller than 2 s − O (log m ) , we get a prohibitively short description of this high complexity string. On the other hand, let x b e the last element that has b een enumerated in B 0 ( m − s ) steps. If there are significan tly more than 2 s elemen ts after x , sa y , at least 2 s + d for some d , w e can split the enumeration in p ortions of size 2 s + d and wait un til the p ortion containing x app ears. By assumption this p ortion is full. The num b er N of steps needed to finish this portion is at least B 0 ( m − s ) . This n umber N and its 34 successor N + 1 can be reconstructed from the p ortion num b er that contains ab out m − s − d bits. Th us the complexity of N + 1 is at most m − s − d + O (log m ) . Hence w e hav e B ( m − s − d + O (log m )) > N > B 0 ( m − s ) . By Prop osition 12 w e can replace B 0 b y B here: B ( m − s − d + O (log m )) > B ( m − s ) . (with some other constant in O -notation). Since B is a non-decreasing function, w e get d = O (log m ) . 4.2 Ω -lik e n um b ers G. Chaitin introduced the “Chaitin Ω -n umber” Ω = P k m ( k ); it can also b e defined as the probabilit y of termination if the opti mal prefix decompressor is applied to a random bit sequence (see [44, section 5.7]). 7 The num b ers Ω n are finite v ersions of Chaitin’s Ω -n umber. The information contained in Ω n increases as n increases; moreo ver, the following prop osition is true. In this prop osition we consider Ω n as a bit string (of length n + O (1) ) identifying the num b er Ω n and its binary representation. Prop osition 14. Assume that k 6 m . Consider the string (Ω m ) k consisting of the first k bits of Ω m . It is O (log m ) -equiv alen t to Ω k : b oth conditional complexities C(Ω k | (Ω m ) k ) and C((Ω m ) k | Ω k ) are O (log m ) . Pr o of. This is essentially the reformulation of the previous statement (Prop osition 13). Run the algorithm that enumerates strings of complexit y at most m . Kno wing (Ω m ) k , w e can wait until less than 2 m − k strings are left in the enumeration of strings of complexit y at most m ; w e kno w that this happens after more than B ( k ) steps, and in this time we can en umerate all strings of complexit y at most k and compute Ω k . (In this argument we ignore O (log m ) -terms, as usual.) No w the second inequality follows by the symmetry of information prop erty . In- deed, since C(Ω k ) = k + O (1) and C((Ω m ) k ) 6 k + O (1) , the inequalit y C(Ω k | (Ω m ) k ) = O (log m ) implies the inequalit y C((Ω m ) k | Ω k ) = O (log m ) . A direct argumen t is also easy . Knowing Ω k and k , w e can find the list of all the strings of complexit y at most k and the num b er B 0 ( k ) . Then we make B 0 ( k ) steps in the enumeration of the list of strings of complexity at most m . Prop osition 13 then 7 This num b er dep ends on the c hoice of the prefix decompressor, so it is not a sp ecific num b er but a class of n umbers. The elemen ts of this class can b e equiv alently characterized as random lo wer semicomputable reals in [0 , 1] , see [44, section 5.7]. 35 guaran tees that at that moment Ω m is kno wn with error ab out 2 m − k , so the first k bits of Ω m can b e reconstructed with small advice (of logarithmic size; w e omit terms of that size in the argument). There is a more direct connection with Chaitin’s Ω -num b er: one can sho w that the num b er Ω m is O (log m ) -equiv alen t to the m -bit prefix of Chaitin’s Ω -num b er. Since in this surv ey we restrict ourselv es to finite ob jects, w e do not go in to details of the pro of here, see [44, section 5.7.7]. 4.3 P osition in the list is w ell defined W e discussed how muc h time is needed to enumerate all strings of complexity at most m and ho w man y strings remain not en umerated b efore this time. Now w e w ant to study which strings remain not en umerated. More precisely , let x b e some string of complexit y at most m , so x app ears in the en umeration of all strings of complexity at most m . Ho w close x is to the end, that is, ho w many strings are enumerated after x ? The answer dep ends on the en umeration, but only sligh tly , as the follo wing prop osition sho ws. Prop osition 15. Let A and B b e algorithms that b oth for any given m enumerate (without rep etitions) the set of strings of complexit y at most m . Let x b e some string and let a x and b x the num b er of strings that app ear after x in A - and B -en umerations. Then | log a x − log b x | = O (log m ) . W e may also assume that A and B are algorithms of complexit y O (log m ) without input that en umerate strings of complexity at most m . Pr o of. Assume that a x is small: log a x 6 k . Wh y log b x cannot b e muc h larger than k ? Given the first m − log b x bits of Ω m and B , we can compute a finite set of strings B 0 that contains x and consists only of strings of complexit y at most m . Then w e can w ait until all strings from B 0 app ear in A -en umeration. After then at most 2 k strings are left, and we need k bits to count them. In this wa y we can describ e Ω m b y m − log b x + k + O (log m ) bits; ho wev er, Prop osition 11 says that C(Ω m ) = m + O (1) . Hence log b x 6 k + O (log m ) . The other inequalit y is prov en by a symmetric argument. In this theorem A and B en umerate exactly the same strings (though in different order). How ev er, the complexit y function is essen tially defined with O (1) -precision only: differen t optimal programming languages lead to differen t versions. Let C and ˜ C b e t wo (plain) complexit y functions; then ˜ C( x ) 6 C( x ) + c for some c and for 36 all x . Then the list of all x with C( x ) 6 m is con tained in the list of all x with ˜ C( x ) 6 m + c . The same argument sho ws that the n um b er of elemen ts after x in the first list cannot b e muc h larger than the n um b er of elemen ts after x in the second list. The rev erse inequality is not guaranteed, how ever, even for the same version of complexit y (small increase in the complexit y b ound ma y significan tly increase the n umber of strings after x in the list). W e will return to this question in Section 4.4, but let us note first that some increase is guaranteed. Prop osition 16. If for a string x there are at least 2 s elemen ts after x in the en umeration of all strings of complexit y at most m , then for ev ery d > 0 there are at least 2 s + d − O (log m ) strings after x in the en umeration of all strings of complexity at most m + d . Pr o of. Essen tially the same argumen t w orks here: if there are m uc h less than 2 s + d strings after x in the bigger list, then this bigger list can b e determined by 2 m − s bits needed to co v er x in the s maller list and less than s + d bits needed to coun t the elemen ts in th e bigger list that follo w the last cov ered elemen t. The last prop osition can b e restated in the following w ay . Let us fix some com- plexit y function and and some algorithm that, given m , enumerates all strings of complexit y at most m . Then, for a giv en string x , consider the function that maps ev ery m > C( x ) to the logarithm of the n um b er of strings after x in the enumeration with input m . Prop osition 16 says that d -increase in the argumen t leads at least to ( d − O (log m )) -increase of this function (but the latter increase could b e muc h bigger). As w e will see, this function is closely related to the set P x (and therefore Q x ): it is one more representation of the same b oundary curve. 4.4 The relation to P x T o explain the relation, consider the following pro cedure for a giv en binary string x . F or every m > C( x ) draw the line i + j = m on ( i, j ) -plane. Then draw the p oin t on this line with second co ordinate s where s is the logarithm of the n umber of elements after x in the enumeration of all strings of complexit y at most m . Mark also all p oin ts on this line on the right of (=b elow) this p oin t. Doing this for differen t m , w e get a set (Figure 6). Prop osition 16 guaran tees that this set is upw ard closed with logarithmic precision: if some p oin t ( i, j ) b elongs to this set, then the p oin t ( i, j + d ) is in O (log ( i + j )) -neighborho o d of this set. This implies that the point ( i + d, j ) is also in the neigh b orhoo d, since our set is closed b y construction in the direction (1 , − 1) . 37 K( x ) n m K( x ) s P x Figure 6: F or each m b et ween K( x ) and n (length of x ) we count elemen ts after x in the list of strings ha ving complexit y at most m ; assuming there is ab out 2 s of them, w e draw p oin t ( m − s, s ) and get a p oin t on some curve. This curve turns out to b e the b oundary of P x (with logarithmic precision). It turns out that this set coincides with P x (Definition 3) with O (log n ) -precision for a string x of length n (this means, as usual, that each of the tw o sets is contained in the O (log n ) -neigh b orho o d of the other one): Theorem 5. L et x b e a string of length n . If x has a ( i ∗ j ) -description then x is at le ast 2 j − O (log n ) -far fr om the end of ( i + j + O (log n )) -list. Conversely, if ther e ar e at le ast 2 j elements that fol low x in the ( i + j ) -list then x has a (( i + O (log n )) ∗ j ) - description. Pr o of. W e need to v erify tw o things. First, assuming that x has a ( i ∗ j ) -description, w e need to show that it is at least 2 j -far from the end of ( i + j ) -list. (With error terms: in ( i + j + O (log n )) -list there are at least 2 j − O (log n ) elemen ts after x .) Indeed, kno wing some ( i ∗ j ) -description A for x , w e can w ait until all the elemen ts of A app ear in ( i + j ) -list (as usual, w e omit O (log n ) -term: all elements of A ha ve complexity at most i + j + O (log n ) , so w e should consider ( i + j + O (log n )) -list to b e sure that it contains all elements of A ). In particular, x has app eared at that momen t. If there are (significan tly) less than 2 j elemen ts after x , then w e can enco de the n umber of remaining elemen ts b y (significan tly) less than j bits, and together with the description of A we get less than i + j bits to describ e Ω i + j , which is imp ossible. Second, assume that there are at least 2 j elemen ts that follo w x in the ( i + j ) - list. Then, splitting this list in to 2 j -p ortions, w e get at most 2 i full portions, and x is cov ered by one of them. Each p ortion has complexity at most i and log-size 38 at most j , so w e get an ( i ∗ j ) -description for x . (As usual, logarithmic terms are omitted.) No w we can reform ulate the properties of sto c hastic and an tisto c hastic ob jects. Ev ery ob ject of complexity k app ears in the list of ob jects of complexit y at most k 0 for all k 0 > k . Each sto chastic ob ject is far from the end of these lists (except, ma y b e, for some k 0 -lists with k 0 v ery close to k ). Each an tisto c hastic ob ject of length n is maximally close to the end of all k 0 -lists with k 0 < n (there are ab out 2 k 0 − k ob jects after x ), except, may b e, for some k 0 -lists with k 0 v ery close to n . When k 0 b ecomes greater than n , then even antistochastic strings are far from the end of the k 0 -list. What w e hav e said is just the description of the corresp onding curves (Figure 2) using Theorem 5. 4.5 Standard descriptions The lists of ob jects of bounded complexit y provide a natural class of descriptions. Consider some m and the n umber Ω m of strings of complexit y at most m . This n umber can b e represented in binary: Ω m = 2 a + 2 b + . . . , where a > b > . . . . The list itself then can b e split in to pieces of size 2 a , 2 b ,. . . , and these pieces can b e considered as description of corresp onding ob jects. In this w a y for eac h string x and for each m > C( x ) w e get some description on x , a piece than con tains x . Descriptions obtained in this wa y will b e called standar d descriptions. Note that for a given x we ha ve man y standard descriptions (dep ending on the c hoice of m ). One should hav e in mind also that the class of standard descriptions dep ends on the c hoice of the complexity function and the enumeration algorithm, and w e assume in the sequel that they are fixed. The following results sho w that standard descriptions are in a sense universal. First let us note that the standard descriptions hav e parameters close to the b oundary curv e of P x (more precisely , to the b oundary curve of the set constructed in the previous section that is close to P x ). 8 8 In general, if t wo sets X and Y in N 2 are close to each other (each is con tained in the small neigh b orho od of the other one), this does not imply that their b oundaries are close. It may happ en that one set has a small “hole” and the other do es not, so the b oundary of the first set has p oin ts that are far from the b oundary of the second one. How ever, in our case b oth sets are closed by construction in tw o different directions, and this implies that the b oundaries are also close. 39 Prop osition 17. Consider the standard description A of size 2 j obtained from the list of all strings of complexity at most m . Then C( A ) = m − j + O (log m ) , and the n umber of elements in the list that follow the elements of A is 2 j + O (log m ) . This statemen t sa ys that parameters of A are close to the p oin t on the line i + j = m considered in the previous section (Figure 6). Pr o of. T o sp ecify A , it is enough to kno w the first m − j bits of Ω m (and m itself ). The complexit y of A cannot b e muc h smaller, since knowing A and the j least significant bits of Ω m w e can reconstruct Ω m . The num b er of elemen ts that follo w A cannot exceed 2 j (it is a sum of smaller p o w ers of 2 ); it cannot b e significantly less since it determines Ω m together with the first m − j bits of Ω . (In other words, since Ω m is an incompressible string of length m , it cannot hav e more that O (log m ) zeros in a row.) This result do es not imply that ev ery p oin t on the b oundary of P x is close to parameters of some standard description. If some part of the b oundary has slop e − 1 , w e cannot guaran tee that there are standard descriptions along this part. F or example, consider the list of strings of complexit y at most m ; the maximal complexit y of strings in this list is m − c for some c = O (1) ; if w e take first string of this complexit y , there are 2 m + O (1) strings after it, so the corresp onding p oint is close to the vertical axis, and due to Prop osition 16 all other standard descriptions of x are also close to the v ertical axis. Ho wev er, descriptions with parameters close to arbitrary p oin ts on the b oundary of P x can b e obtained from standard descriptions b y chopping them in to smaller parts, as in Prop osition 8. In that shopping it is natural to use the order in which the strings w ere enumerated. In other words, c hop the list of strings of complexit y at most m in to portions of size 2 j . Consider all the full portions (of size exactly 2 j ) obtained in this w ay (they are parts of standard descriptions of bigger size). Descriptions obtained in this w ay are “universal” in the follo wing sense: if a pair ( i, j ) is on the b oundary of P x then there is a set A 3 x of this t yp e of complexity i + O (log( i + j )) and log-cardinality j + O (log( i + j )) . The follo wing result sa ys more: for every description A for x there is a “better” standard description that is simple given A . Prop osition 18. Let A b e an ( i ∗ j ) -description of a string x of length n . Then there is m 6 min { n, i + j } + O (log n ) suc h that the parameters of the standard description B for x obtained from the list of strings of complexity at most m satisfy the inequalities C( B ) 6 i + O (log m ) , C( B ) + log # B = m + O (log m ) . 40 Moreo ver, B is simple giv en A , i.e., C( B | A ) = O (log m ) . Pr o of. If i + j 6 n then the complexit y of ev ery element from A is at most i + j + O (log j ) = min { n, i + j } + O (log n ) . Otherwise remo ve from A all strings of length different from n . In this wa y A b ecomes ( i ∗ j ) -description for x with sligh tly larger i than b efore the remo v al and the same or smaller j . No w all the elemen ts of A ha ve complexity at most n + O (1) = min { n, i + j } + O (1) . Th us w.l.o.g. w e ma y assume that the complexit y of all strings from A do es not exceed some m = min { n, i + j } + O (log n ) . Consider the list of all strings of complexit y at most m and the standard descrip- tion B of x obtained from this list. As w e know from Proposition 17, the sum of the parameters of this description is m + O (log m ) . W e need to show that the size of B is at least 2 m − i − O (log m ) and hence the complexit y of B is at most i + O (log m ) . Wh y is this the case? Consider elemen ts that app ear after the last elemen t of A in the list. There are at least 2 m − i − O (log m ) of them, otherwise the total num b er of elemen ts in the list could b e describ ed in m uch less than m bits. Therefore there are at least 2 m − i − O (log m ) elemen ts in the list that app ear after x . As x ∈ B the n umber of elemen ts that app ear after x is less than 2# B therefore # B > 2 m − i − O (log m ) and C( B ) 6 i + O (log m ) . Wh y B is simple given A ? Denote the size of B b y 2 j 0 . Given A and m , w e can find the last elemen t of A , call it x 0 , in the list of strings of complexity at most m . Chop the list in to p ortions of size 2 j 0 . Then B is the last complete portion. If B con tains x 0 , w e can find B from m , j 0 , and x 0 as the complete p ortion con taining x 0 . Otherwise, x 0 app ears in the list after all the elemen ts from B . In this case we can find B from m and x 0 as the last complete p ortion b efore x 0 . Th us in an y case w e are able to find B from m , j 0 , and x 0 plus one extra bit. F or the same reason every standard description B of some x is simple given x (and this is not a surprise, since we kno w that all optimal descriptions of x are simple giv en x , see Prop osition 9). Prop osition 18 has the following corollary which w e formulate in an informal wa y . Let A be some ( i ∗ j ) -description with parameters on the b oundary of P x . Assume that on the left of this p oin t the b oundary curve decreases fast (with slop e less than − 1 ). Then in Prop osition 18 the v alue of d is small, otherwise the p oint ( i − d, j + d ) w ould be far from P x . So the complexities of A and the standard description B are close to eac h other. W e kno w also that A is simple given B , therefore B is also simple giv en A , and A and B hav e the same information (ha ve small conditional complexities in b oth directions). 41 If w e ha ve tw o differen t descriptions A, A 0 with appro ximately the same param- eters on the boundary of P x , and the curv e decreases fast on the left of the corre- sp onding b oundary p oin t, the same argumen t sho ws that A and A 0 ha ve the s ame information. Note that the condition ab out the slop e is imp ortan t: if the p oin t is on the s egmen t with slop e − 1 , the situation changes. F or example, consider a random n -bit string x and tw o its descriptions. The first one consists of all n -bit strings that ha ve the same left half as x , the second one consists of all n -bit strings that ha ve the same right half. Both ha ve the same parameters: complexit y n/ 2 and log-size n/ 2 , so they b oth correspond to the same p oint on the boundary of P x . Still the information in these t wo descriptions is differen t (left and right halves of a random string are indep enden t). These results sound as goo d news. Let us recall our original goal: to formalize what is a go o d statistical mo del. It seems that we are making some progress. Indeed, for a given x w e consider the boundary curv e P x and lo ok at the place when it first touc hes the lo w er b ound i + j = C( x ) ; after that it sta ys near this b ound. In other terms, w e consider mo dels with negligible optimalit y deficiency , and select among them the mo del with minimal complexit y . Giving a formal definitions, w e need to fix some threshold ε . Then we sa y that a set A is a ε -sufficient statistic if δ ( x, A ) < ε , and may choose the simplest one among them and call it the minimal ε -sufficient statistic . If the curv e go es down fast on the left of this p oint, we see that all the descriptions with parameters corresp onding to minimal sufficien t statistic are equiv alent to each other. T rying to relate these notion to practice, we may consider the following example. Imagine that w e hav e digitized some v ery old recording and got some bit string x . There is a lot of dust and scratc hes on the recording, so the originally recorded signal is distorted b y some random noise. Then our string x has a t wo-part description: the first part sp ecifies the original recording and the noise parameters (intensit y , sp ectrum, etc.) and the second part sp ecifies the noise exactly . May be, the first part is the minimal sufficien t statistic — and therefore sound restoration (and lossy compression in general) is a sp ecial case of the problem of finding a minimal sufficient statistic? The uniqueness result ab o ve (sa ying that all the minimal sufficien t statistics con tain the same information under some conditions) seem to supp ort this view: differen t go o d mo dels for the same ob ject contain the same explanation. Still the following observ ation (that easily follo ws from what we know) destroys this impression completely . Prop osition 19. Let B b e some standard description of complexity i obtained from the list of all strings of complexit y at most m . Then B is O (log m ) -equiv alent to Ω i . 42 This lo oks like a failure. Imagine that we wan ted to understand the nature of some data string x ; finally we succeed and find a description for x of reasonable complexit y and negligible randomness and optimalit y deficiencies (and all the go od prop erties w e dreamed of ). But Prop osition 19 says that the information con tained in this description is more related to the computabilit y theory than to sp ecific prop erties of x . Recalling the construction, w e see that the corresp onding standard description is determined by some prefix of some Ω -num b er, and is an in terv al in the en umeration of ob jects of b ounded complexity . So if we start with tw o old recordings, w e may get the same information, which is not what we exp ect from a restoration pro cedure. Of course, there is still a chance that some Ω -n um b er w as recorded and therefore the restoration pro cess indeed should pro vide the information ab out it, but this lo oks lik e a very sp ecial case that h ardly should happ en for any practical situation. What could w e do with this? First, we could just relax and b e satisfied that we no w understand muc h b etter the situation with p ossible descriptions for x . W e kno w that ev ery x is characterized by some curv e that has several equiv alen t definitions (in terms of sto c hasticity , randomness deficiency , p osition in the enumeration — as w ell as time-b ounded complexity , see Section 5 below). W e kno w that standard descriptions cov er the parts of the curv e where it goes down fast, and to co v er the parts where the slop e is − 1 one ma y use standard descriptions and their pieces; all these descriptions are simple giv en x . When curv e go es down fast, the description is essen tially unique (all the descriptions with the same parameters con tain the same information, equiv alent to the corresp onding Ω -n umber); this is not true on parts with slop e − 1 . So, even if this curv e is of no philosophical imp ortance, w e hav e a lot of tec hnical inf ormation ab out p ossible mo dels. The other approac h is to go farther and consider only mo dels from some class (Section 6), or add some additional conditions and lo ok for “strong mo dels” (Sec- tion 7). 4.6 Non-sto c hastic ob jects revisited No w we can explain in a differen t wa y wh y the probabilit y of obtaining a non- sto c hastic ob ject in a random pro cess is negligible (Prop osition 5). This explanation uses the notion of mutual information from algorithmic information theory . The m utual information in t w o strings x and y is defined as I ( x : y ) = C( x ) − C( x | y ) = C( y ) − C( y | x ) = C( x ) + C( y ) − C( x, y ); all three expressions are O (log n ) -close if x and y are strings of length n (see, e.g., [44, Chapter 2]). 43 Consider an arbitrary string x of length n ; let k b e the complexity of x . Consider the list of all ob jects of complexit y at most k , and the standard description A for x obtained from this list. If A is large, then x is sto c hastic; if A is small, then x con tains a lot of information ab out Ω k and Ω n . More precisely , let us assume that A has size 2 k − s (i.e., is 2 s times smaller than it could b e). Then (recall Prop osition 17) the complexit y of A is s + O (log k ) , since we can construct A knowing k and the first s bits of Ω k (b efore the bit that corresp onds to A ). So w e get ( s + O (log k )) ∗ ( k − s ) -description with optimalit y deficiency O (log k ) . On the other hand, kno wing x and k , we can find the ordinal n umber of x in the en umeration, so we know Ω k with error at most 2 k − s , so C(Ω k | x ) 6 k − s + O (log k ) , and I ( x : Ω k ) > s − O (log k ) (recall that C(Ω k ) = k + O (1) ). In the last statement w e may replace Ω k b y Ω n (where n is the length of x ): w e know from Prop osition 14 that Ω k is simple given Ω n , so if condition Ω k decreases complexity of x by almost s bits, the same is true for condition Ω n . Comparing arbitrary i 6 n with this s (it can be larger than s or smaller than s ), w e get the following result: Prop osition 20. Let x b e a string of length n . F or every i 6 n • either x is ( i + O (log n ) , O (log n )) -sto c hastic, • or I ( x : Ω n ) > i − O (log n ) . No w we may use the follo wing (simple and general) observ ation: for ev ery string u the probability to generate (b y a randomized algorithm) an ob ject that con tains a lot of information ab out u is negligible: Prop osition 21. F or every string u and for ev ery n um b er d , we hav e X { m ( x ) | K( x ) − K( x | u ) > d } 6 2 − d . In this prop osition the sum is tak en ov er all strings x that hav e the given prop ert y (ha ve a large mutual information with u ). Note that we hav e chosen the representa- tion of m utual information that makes the prop osition easy (in particular, w e ha v e used prefix complexit y). As we men tioned, other definitions differ only by O (log n ) if w e consider strings x and u of length at most n , and logarithmic accuracy is enough for our purp oses. Pr o of. Recall the definition of prefix complexity: K( x ) = − log m ( x ) , and K( x | u ) = − log m ( x | u ) . So K( x ) − K( x | u ) > d implies m ( x ) 6 2 − d m ( x | u ) , and it remains to note that P x m ( x | u ) 6 1 for every u . 44 Prop ositions 20 and 21 immediately imply the following impro ved v ersion of Prop osition 5 (page 13): Prop osition 22. X { m ( x ) | x is a n -bit string that is not ( α , O (log n )) -sto c hastic } 6 2 − α + O (log n ) for ev ery α . The impro vemen t here is the b etter upp er b ound for the randomness deficiency: O (log n ) instead of α + O (log n ) . 4.7 Historical commen ts The relation b et ween busy b ea ver n umbers and Kolmogoro v complexity was p oin ted out in [12] (see Section 2.1). The enumerations of all ob jects of b ounded complexity and their relation to sto c hasticity were stud ied in [13] (see Section I I I, E). 5 Computational and logical depth In this section we reformulate the results of the previous one in terms of b ounded- time K olmogorov complexit y and discuss the v arious notions of computational and logical depth that appeared in the literature. (The impatient reader may skip this section; it is not technically used in the sequel). 5.1 Bounded-time K olmogorov complexity The usual definition of K olmogoro v complexity of x as the minimal length l ( p ) of a program p that pro duces x do es not take into account the running time of the program p : it ma y happ en that the minimal program for x requires a lot of time to pro duce x while other programs produce x faster but are longer (for example, program “prin t x ” is rather fast). T o analyze this trade-off, the following definition is used. Definition 4. Let D b e some algorithm; its input and output are binary strings. F or a string x and integer t , define C t D = min { l ( p ) : D pro duces x on input p in at most t steps } , the time-b ounded Kolmogoro v complexity of x with time b ound t with resp ect to D . 45 This definition w as men tioned already in the first paper by Kolmogoro v [14]: Our approac h has one imp ortan t dra wback: it do es not take into ac- coun t the efforts needed to transform the program p and ob ject x [the description and the condition] to the ob ject y [whose complexit y is de- fined]. With appropriate definitions, one may prov e mathematical results that could b e in terpreted as the existence of an ob ject x that has simple programs (has v ery small complexity K ( x ) ) but all short programs that pro duce x require an unrealistically long computation. In another pap er I plan to study the dep endence of the program complexity K t ( x ) on the difficult y t of its transformation into x . Then the complexit y K ( x ) (as defined earlier) reapp ears as the minim um v alue of K t ( x ) if we remo v e restrictions on t . K olmogorov nev er published a paper he sp eaks ab out, and this definition is less studied than the definition without time b ounds, for several reasons. First, the definition is machine-dependent: we need to decide what computation mo del is used to count the n um b er of steps. F or example, w e ma y consider one-tap e T uring mac hines, or m ulti-tap e T uring machine, or some other computational mo del. The computation time dep ends on this c hoice, though not drastically (e.g., a multi- tap e mac hine can be replaced with a one-tape mac hine with quadratic increase in time, and most p opular mo dels are p olynomially related — this observ ation is used when w e argue that the class P of p olynomial-time computable functions is well defined). Second, the basic result that mak es the K olmogorov complexit y theory p ossible is the Solomonoff–K olmogorov theorem saying that there exists an optimal algorithm D that mak es the complexity function minimal up to O (1) additive term. No w we need to take in to accoun t the time b ound, and get the following (not so nice) result. Prop osition 23. There exists an optimal algorithm D for time-b ounded complexity in the following sense: for ev ery other algorithm D 0 there exists a constan t c and a p olynomial q suc h that C t D 0 ( x ) 6 C q ( t ) D ( x ) + c for all strings x and integers t . In this result, by “algorithm” we may mean a k -tap e T uring machine, where k is an arbitrary fixed num b er. Ho wev er, the claim remains true ev en when k is not fixed, i.e., w e ma y allo w D 0 to ha ve more tap es than D has. 46 The pro of remains essen tially the same: we c ho ose some simple self-delimiting enco ding of binary strings p 7→ ˆ p and some univ ersal algorithm U ( · , · ) and then let D ( ˆ px ) = U ( p, x ) Then the proof follo ws the standard scheme; the only thing we need to note is that the deco ding of ˆ p runs in p olynomial time (whic h is true for most natural w ays of self-delimiting enco ding) and that the univ ersal algorithm sim ulation ov erhead is p olynomial (which is also true for most natural constructions of universal algorithms). A similar result is true for conditional decompressors, so the conditional time- b ounded complexity can b e defined as w ell. F or T uring mac hines with fixed num b er of tap es the statemen t is true for some linear p olynomial q ( n ) = O ( n ) . F or the pro of we need to consider a universal machine U that sim ulates other mac hines efficiently: it should mo ve the program along the tap e, so the ov erhead is b ounded b y a factor that dep ends on the size of the program and not on the size of the input or computation time. 9 Let t ( n ) b e an arbitrary total computable function with in teger argumen ts and v alues; then the function x 7→ C t ( l ( x )) D ( x ) is a computable upp er b ound for the complexity C( x ) (defined with the same D ; recall that l ( x ) stands for the length of x ). Replacing the function t ( · ) by a bigger function, w e get a smaller computable upper b ound. An easy observ ation: in this w ay we can matc h every computable upp er b ound for Kolmogoro v complexit y . Prop osition 24. Let ˜ C( x ) b e some total computable upper b ound for Kolmogoro v complexit y function based on the optimal algorithm D from Proposition 23. Then there exists a computable function t such that C t ( l ( x )) D ( x ) 6 ˜ C( x ) for ev ery x . Pr o of. Giv en a num b er n , w e wait until ev ery string x of length at most n gets a program that has complexit y at most ˜ C( x ) , and let t ( n ) b e the maximal num b er of steps used b y these programs. So the c hoice of a computable time b ound is essen tially equiv alent to the c hoice of a computable total upp er b ound for K olmogorov complexity . In the sequel we assume that some optimal (in the sense of Proposition 23) D is fixed and omit the subscript D in C t D ( · ) . Similar notation C t ( · | · ) is used for conditional time-b ounded complexit y . 9 This observ ation motiv ates Levin’s version of complexity ( K t , see [21, Section 1.3, p. 21]) where the program size and logarithm of the computation time are added: linear ov erhead in computation time matches the constant ov erhead in the program size. Ho w ev er, this is a different approac h and w e do not use the Levin’s notion of time b ounded complexity in this surv ey . 47 5.2 T rade-off b et w een time and complexity W e use the extremely fast gro wing sequence B (0) , B (1) , . . . as a scale for measuring time. This sequence grows faster than any computable function (since the complexity of t ( n ) for an y computable t is at most log n + O (1) , we ha ve B (log n + O (1)) > t ( n ) ). In this scale it do es not matter whether we use time or space as the resource measure: they differ at most b y an exp onen tial function, and 2 B ( n ) 6 B ( n + O (1)) (in general, f ( B ( n )) 6 B ( n + O (1)) for every computable f ). So w e are in the realm of general computabilit y theory ev en if w e tec hnically sp eak ab out computational complexity , and the problems related to the unsolved P=NP question disapp ear. Let x b e a string of length n and complexit y k . Consider the time-b ounded complexit y C t ( x ) as a function of t . (The optimal algorithm from Prop osition 23 is fixed, so w e do not men tion it in the notation.) It is a decreasing function of t . F or small v alues of t the complexit y C t ( x ) is b ounded by n + O (1) where n stands for the length of x . Indeed, the program that prin ts x has size n + O (1) and w orks rather fast. F ormally sp eaking, C t ( x ) 6 n + O (1) for t = B ( O (log n )) . As t increases, the v alue of C t ( x ) decreases and reac hes k = C( x ) as t → ∞ . It is guaranteed to happ en for t = B ( k + O (1)) , since the computation time for the shortest program for x is determined b y th is program. W e can dra w a curv e that reflects this trade-off using B -scale for the time axis. Namely , consider the graph of the function i 7→ C B ( i ) ( x ) − C( x ) and the set of p oin ts ab ov e this graph, i.e., the set D x = { ( i, j ) | C B ( i ) ( x ) − C( x ) 6 j } . Theorem 6 ([6, 2]) . The set D x c oincides with the set Q x with O (log n ) -pr e cision for a string x of length n . Recall that the set Q x consists of pairs ( α, β ) suc h that x is ( α, β ) -sto c hastic (see p. 25). Pr o of. As w e kno w from Theorem 4, the sets P x and Q x are related by an affine transformation (see Figure 4). T aking this transformation in to accoun t, w e need to pro ve tw o statements: • if there exists an ( i ∗ j ) -description A for x , then C B ( i + O (log n )) ( x ) 6 i + j + O (log n ); 48 • if C B ( i ) ( x ) 6 i + j , then there exist an (( i + O (log n )) ∗ ( j + O (log n ))) -description for x . Both statements are easy to prov e using the to ols from the previous section. Indeed, assume that x has an ( i ∗ j ) -description A . All elements of A hav e complexit y at most i + j + O (log n ) . Knowing A and this complexit y , w e can find the minimal t suc h that C t ( x 0 ) 6 i + j + O (log n ) for all x 0 from A . This t can b e computed from A , whic h has complexity i , and an O (log n ) -bit advice (the v alue of complexit y). Hence t 6 B ( i + O (log n )) and C t ( x ) 6 i + j + O (log n ) , as required. The con verse: assume that C B ( i ) ( x ) 6 i + j . Consider all the strings x 0 that satisfy this inequalit y . There are at most O (2 i + j ) such strings. Thus we only need to sho w that giv en i and j we are able to enumerate all those strings in at most O (2 i ) p ortions. One can get a list of all those strings x 0 if B ( i ) is giv en, but w e cannot compute B ( i ) giv en i . Recall that B ( i ) is the maximal in teger that has complexit y at most i ; new candidates for B ( i ) ma y app ear at most 2 i times. The candidates increase with time; when this happ ens, we get a new p ortion of strings that satisfy the inequalit y C B ( i ) ( x ) 6 i + j . So w e ha ve at most O (2 i + j ) ob jects including x that are enumerated in at most 2 i p ortions, and this implies that x has an (( i + O (log n )) ∗ j ) -description. Indeed, we make all p ortions of size at most 2 j b y splitting larger p ortions into pieces. The n umber of p ortions increases at most b y O (2 i ) , so it remains O (2 i ) . Each p ortion (including the one that con tains x ) has then complexit y at most i + O (log n ) since it can b e computed with logarithmic advice from its ordinal num b er. This theorem shows that the results ab out the existence of non-sto chastic ob jects can b e considered as the “mathematical resul ts that could b e interpreted as the exis- tence of an ob ject x that has simple programs (has v ery small complexity K ( x ) ) but all s hort programs that pro duce x require an unrealistically long computation” men- tioned by Kolmogoro v (see the quotation ab o ve), and the algorithmic statistics can b e interpreted as an implementation of Kolmogoro v’s plan “to study the dep endence of the program complexity K t ( x ) on the difficulty t of its transformation into x ”, at least for the simple case of (unrealistically) large v alues of t . 5.3 Historical commen ts Section 5 has title “logical and computational depth” but we hav e not defined these notions yet. The name “logical depth” was in tro duced b y C. Bennett in [7]. He explains the motiv ation as follows: 49 Some mathematical and natural ob jects (a random sequence, a sequence of zeros, a p erfect crystal, a gas) are intuitiv ely trivial, while others (e.g., the human b o dy , the digits of π ) con tain internal evidence of a non trivial causal history . h . . . i W e prop ose depth as a formal measure of v alue. F rom the earliest da ys of information theory it has been appreciated that information per se is not a go od measure of message v alue. F or example, a typical sequence of coin tosses has high information conten t but little v alue; an ephemeris, giving the p ositions of the mo on and the planets every day for a hundred y ears, has no more information than the equations of motion and initial conditions from which it was calculated, but sa ves its owner the effort of recalculating these p ositions. The v alue of a message thus app ears to reside not in its information (its absolutely unpredictable parts), nor in its ob vious redundancy (v erbatim rep etitions, unequal digit frequencies), but rather is what migh t b e called its buried redundancy — parts predictable only with difficulty , things the receiver could in principle ha ve figured out without b eing told, but only at considerable cost in money , time, or computation. In other words, the v alue of a message is the amoun t of mathematical or other work plausibly done b y its originator, whic h its receiv er is sav ed from having to rep eat. T rying to formalize this in tuition, Bennett suggests the follo wing p ossible definitions: T en tative Definition 0.1 : A string’s depth migh t b e defined as the execution time of its minimal program. This notion is not robust (it dep ends on the sp ecific choice of the optimal mac hine used in the definition of complexity). So Bennett considers another v ersion: T en tative Definition 0.2 : A string’s depth at significance lev el s [migh t] b e defined as the time required to compute the string by a program no more than s bits larger than the minimal program. W e see that Definition 0.2 consider the same trade-off as in Theorem 6, but in rev ersed coordinates (time as a function of difference b et ween time-b ounded and limit complexities). Bennett is still not satisfied by this definition, for the following reason: This prop osed definition solves the stabilit y problem, but is unsatisfactory in the w ay it treats m ultiple programs of the same length. Intuitiv ely , 2 k distinct ( n + k ) -bit programs that compute same output ough t to b e accorded the same weigh t as one n -bit program h . . . i 50 In other language, he suggests to consider a priori probability instead of complexit y: T en tative Definition 0.3 : A string’s depth at significance level s might b e defined as the time t required for the string’s time-b ounded algorithmic probabilit y P t ( x ) to rise to within a factor 2 − s of its asymptotic time- un b ounded v alue P ( x ) . Here P t ( x ) is understo o d as a total weigh t of all self-delimiting programs that pro duce x in time at most t (eac h program of length s has w eight 2 − s ). F or our case (when we consider busy b ea v er num b ers as time scale) the exp onen tial time increase needed to switc h from a priori probability to prefix complexit y does not matter. Still Bennett is in terested in more reasonable time b ounds (recall that in his informal explanation a p olynomially computable sequence of π -digits was an example of a deep sequence!), and prefers a priori probabilit y approach. Moreov er, he finds a nice reformulation of this definition (almost equiv alen t one) in terms of complexity: Although Definition 0.3 satisfactorily captures the informal notion of depth, we prop ose a sligh tly stronger definition for the tec hnical reason that it app ears to yield a stronger slow growth prop ert y h . . . i Definition 1 (Depth of Finite Strings): Let x and w b e strings [probably w is a typo: it is not mentioned later] and s a significance parameter. A string’s depth at significance level s , denoted D s ( x ) , will b e defined as min { T ( p ) : ( | p | − | p ∗ | < s ) ∧ ( U ( p ) = x ) } , the least time required to compute it by a s -incompressible program. Here p ∗ is a shortest self-delimiting program for p , so its length | p ∗ | equals K( p ) . A ctually , this Definition 1 has a different underlying intuition than all the previ- ous ones: a string x is deep if al l pr o gr ams that c ompute x in a r e asonable time, ar e c ompr essible . Note the before w e required a different thing: that all programs that compute x in a reasonable time are m uch longer than the minimal one. This is a w eaker requirement: one may imagine a long incompressible program that computes x fast. This intuition is explained in the abstract of the paper [7] as follo ws: [W e define] an ob ject’s “logical depth” as the time required by a standard univ ersal T uring machine to generate it from an input that is algorithmi- cally random. 51 Bennett then pro ves a statement (called Lemma 3 in his pap er) that sho ws that his Definition 1 is almost equiv alent to T entative Definition 0.3 : the time remains ex- actly the same, while s changes at most logarithmically (in fact, at most b y K( s ) ). So if we use Bennett’s notion of depth (an y of them, except for the first one mentioned) with busy b ea ver time scale, we get the same curve as in our definition. A natural question arises: is there a direct proof that the output of an incom- pressible program with not to o large running time is sto chastic? In fact, y es, and one can pro v e a more general statement: the output of a sto chastic program with reasonable running time is sto chastic (see Section 5.4); note that sto c hasticity is a w eaker condition than incompressibility . Let us mention also the notion of c omputational depth in tro duced in [4]. There are sev eral versions mentioned in this paper; the first one exc hanges co ordinates in the Bennett’s tentativ e definition 0.2 (repro duced in [4] as Definition 2.5). The authors write: “The first notion of computational depth w e prop ose is the difference b et w een a time-bounded Kolmogoro v complexit y and traditional K olmogoro v complexit y” (Definition 3.1, where time b ound is some function of input length). The other notions of computation depth are more subtle (they use distinguishing complexit y or Levin complexit y in v olving the logarithm of the computation time). The connections b etw een computational/logical depth and sophistication w ere an ticipated for a long time; for example, Koppel writes in [19]: h . . . i The “dynamic” approac h to the formalization of meaningful com- plexit y is “depth” defined and discussed b y Bennett [1]. [Reference to an unpublished pap er “On the logical ‘depth’ of sequences and their re- ducibilities to incompressible sequences”.] The depth of an ob ject is the running-time of its most concise description. Since it is reasonable to as- sume that an ob ject has been generated b y its most concise description, the depth of an ob ject can b e thought of as a measure of its evolv edness. Although sophistication is measured in integers [not clear what in meant here: sophistication of S is also a function c 7→ S O P H c ( S ) ] and depth is measured in functions, it is not difficult to translate to a common range. Strangely , the direct connection b etw een the most basic v ersions of these notions (Theorem 6) seems to b e noticed only recently in [6, Section 3], and [2]. 5.4 Wh y so man y equiv alen t definitions? W e hav e shown sev eral equiv alen t (with logarithmic precision and up to affine trans- formation) w ays to defined the same curv e: 52 • ( α, β ) -sto c hasticity (Section 2); • tw o-part descriptions and optimality deficiency , the set P x (Section 3); • p osition in the enumeration of ob jects of b ounded complexit y (Section 4); • logical/computational depth (resource-b ounded complexity , Section 5). One can add to this list a characterization in terms of split en umeration (Section 3.4): the existence of ( i ∗ j ) -description for x is equiv alent (with logarithmic precision) to the existence of a simple enumeration of at most 2 i + j ob jects in at most 2 i p ortions (see Remark 7, p. 23, and the discussion before it). Wh y do w e need so man y equiv alen t definitions of the same curv e? First, this sho ws that this curv e is really fundamental — almost as fundamental c haracterization of an ob ject x as its complexit y . As K opp el writes in [18], sp eaking ab out (some v ersions of ) sophistication and depth: One w ay of demonstrating the naturalness of a concept is by proving the equiv alence of a v ariety of prime facie different formalizations h . . . i . It is hop ed that the pro of of the equiv alence of t wo approac hes to meaningful complexit y , one using static resources (program size) and the other using dynamic resources (time), will demonstrate not only the naturalness of the concept but also the correctness of the sp ecifications used in each formalization to ensure robustness and generality . Another, more tec hnical reason: different results ab out sto c hasticity use differen t equiv alent definitions, and a statemen t that lo oks quite mysterious for one of them ma y b ecome almost obvious for another. Let us give tw o examples of this type (the first one is sto chasticit y conserv ation when random noise is added, the second one is a direct pro of of Bennett’s characterization men tioned ab o v e). The first example is the follo wing theorem from [42, Theorem 14] (though the pro of there is different). Theorem 7. L et x b e some binary string, and let y b e another string (“noise”) that is c onditional ly r andom with r esp e ct to x , i.e., C( y | x ) ≈ l ( y ) . Then the p air ( x, y ) has the same sto chasticity pr ofile as x : the sets Q x and Q ( x,y ) ar e close to e ach other. Mor e sp e cific al ly, if C( y | x ) > l ( y ) − ε then the sets Q x and Q ( x,y ) ar e in a O (log l ( x ) + log l ( y ) + ε ) -neighb orho o d of e ach other. Recall that w e can sp eak ab out profiles of arbitrary finite ob jects, in particular, pairs of strings, using some natural enco ding (Section 2.3). Before giving a pro of sk etch, let us make tw o remarks. 53 Remark 10. By Theorem 4 the set P x can b e obtained by a simple transformation from the set Q x and C( x ) , and the other w ay around. Thus Theorem 7 can b e re- form ulated in terms of the profiles P x and P ( x,y ) . How ever the statemen t b ecomes more in volv ed: Theorem 8 (Theorem 7 in terms of profiles) . L et x b e a binary string and let y b e another string such that C( y | x ) > l ( y ) − ε . Then the set P ( x,y ) c an b e obtaine d fr om the set P x by the fol lowing tr ansformation φ : φ ( P x ) = { ( i, j + l ( y )) | i 6 C( x ) , ( i, j ) ∈ P x } ∪ { ( i, j ) | i > C( x ) , i + j > C( x, y ) } Mor e ac cur ately, the sets P ( x,y ) and φ ( P x ) ar e in an O (log l ( x )+log l ( y )+ ε ) -neighb orho o d of e ach other. The transformation of P x to P ( x,y ) is sho wn on Fig. 7. C( x, y ) C( x ) C( x, y ) complexit y log-cardinalit y P x,y P x Figure 7: The b oundary of P xy is obtained by shifting the b oundary of P x v erti- cally by l ( y ) ≈ C( x, y ) − C( x ) and adding the sloping segmen t with the endp oints (C( x ) , C( x, y ) − C( x )) and (C( x, y ) , 0) . Remark 11. An interesting sp ecial case of this theorem is obtained if w e consider a string u and its description X with small randomness deficiency: d ( u | X ) ≈ 0 . Let y b e the ordinal num b er of u in X . Then the small randomness deficiency guaran tees 54 that y is conditionally random with resp ect to X . Therefore the pair ( X , y ) has the same sto c hasticit y profile as u . Since this pair is mapp ed to u by a simple total computable function, w e conclude (Prop osition 3) that the sto c hasticity profile of X is con tained in the sto chasticit y profile of u (more precisely , in its O (log n + d ( u | X )) - neigh b orho o d). F or profiles, there is a more simple and direct pro of of the inclusion of φ ( P X ) in to a small neighborho o d of P u : if U is an ( i ∗ j ) -description for X , we consider the “lifting” of U , i.e. the union of all elemen ts of U that ha ve appro ximately the same cardinalit y as X ; in this w ay we obtain a ( i ∗ ( j + l ( y )) -description for u . This sho ws that the set { ( i, j + l ( y )) | i 6 C( x ) , ( i, j ) ∈ P x } is included in P u . F or the set { ( i, j ) | i > C( x ) , i + j > C( x, y ) } the inclusion is ob vious, as for all i > C( x ) , j > C( x, y ) − i the set of all strings in X whose index has the same l ( y ) − j leading bits, as u , is a ( i ∗ j ) -description of u . A pr o of sketch of The or em 7. Using the depth characterization of sto chasticit y pro- file, w e n eed to show that C B ( i ) ( x, y ) − C( x, y ) ≈ C B ( i ) ( x ) − C( x ) . Here “appro ximately” means that these t w o quantities may differ b y a logarithmic term, and also we are allow ed to add logarithmic terms to i (see b elo w what do es it mean). The natural idea is to rewrite this equality as C B ( i ) ( x, y ) − C B ( i ) ( x ) ≈ C( x, y ) − C( x ) . The righ t hand side is equal to C( y | x ) (with logarithmic precision) due to Kolmogo- ro v–Levin formula for the complexity of a pair (see, e.g., [44, Chapter 2]), and C( y | x ) equals l ( y ) , as y is random and independent of x . Th us it suffices to sho w that the left hand side also equals l ( y ) . T o this end we can prov e a v ersion of K olmogoro v– Levin form ula for b ounded complexity and show that the left hand side equals to C B ( i ) ( y | x ) . Again, since y is random and indep enden t of x , C B B ( i ) ( y | x ) equals l ( y ) . This plan needs clarification. First of all, let us explain which version of K olmo- goro v–Levin formula for b ounded complexit y w e need. (Essentially it was published b y Longpr´ e in [23] though the statement w as obscured by considering time b ound as a function of the input length.) The equalit y C( x, y ) = C( x ) + C( y | x ) should be considered as tw o inequalities, and eac h one should b e treated separately . Lemma 1. 1. There exist some constant c and some p olynomial p ( · , · ) suc h that C p ( n,t ) ( x, y ) 6 C t ( x ) + C t ( y | x ) + c log n 55 for all n and t and for all strings x and y of length at most n . 2. There exist some constant c and some polynomial p ( · , · ) such that C p (2 n ,t ) ( x ) + C p (2 n ,t ) ( y | x ) 6 C t ( x, y ) + c log n for all n and t and for all strings x and y of length at most n . Pr o of of the lemma. The pro of of this time-b ounded v ersion is obtained by a straigh t- forw ard analysis of the time requirements in the standard pro of. The first part sa ys that if there is some program p that pro duces x in time t , and some program q that pro duces y from x in time t , then the pair ( p, q ) can b e considered as a program that pro duces ( x, y ) in time p oly( t, n ) and has length l ( p ) + l ( q ) + O (log n ) (w e ma y assume without loss of generality that p and q hav e length O ( n ) , otherwise w e replace them b y shorter fast programs). The other direction is more complicated. Assume that C t ( x, y ) = m . W e hav e to coun t for a giv en x the n umber of strings y 0 suc h that C t ( x, y 0 ) 6 m . These strings ( y is one of them) can b e enumerated in time p oly(2 n , t ) , so if there are 2 s of them, then C poly (2 n ,t ) ( y | x ) 6 s + O (log n ) (the program witnessing this inequalit y is the ordinal n umber of y in the enumeration plus O (log n ) bits of auxiliary information. Note that w e do not need to sp ecify t in adv ance, w e en umerate y 0 in order of increasing time, and y is among first 2 s en umerated strings. On the other hand, there are at most 2 m − s + O (1) strings x 0 for whic h this num b er (of different y 0 suc h that C t ( x 0 , y 0 ) 6 m ) is at least 2 s − 1 , and these strings also could b e enumerated in time p oly(2 n , t ) , so C poly (2 n ,t ) ( x ) 6 m − s + O (log n ) (again we do not need to sp ecify t , w e just increase gradually the time b ound). When these t wo inequalities are added, s disapp ears and w e get the desired inequality . Of course, the exp onen t in the lemma is disapp oin ting (for space b ound it is not needed, b y the w ay), but since we measure time in busy b ea v er units, it is not a problem for us: indeed, p oly(2 n , B ( i )) 6 B ( i + O (log n )) , and w e allo w logarithmic c hange in th e argument anyw a y . No w w e should apply this lemma, but first w e need to give a full statemen t of what w e wan t to prov e. There are tw o parts (as in the lemma): • for every i there exists j 6 i + O (log n ) suc h that C B ( j ) ( x, y ) − C( x, y ) 6 C B ( i ) ( x ) − C( x ) + ε + O (log n ) for all strings x and y of length at most n such that C( y | x ) 6 l ( y ) − ε ; 56 • for every i there exists j 6 i + O (log n ) suc h that C B ( j ) ( x ) − C( x ) 6 C B ( i ) ( x, y ) − C( x, y ) + O (log n ) for all strings x and y of length at most n ; Both statemen ts easily follow from the lemma. Let us start with the s econd statement where the hard direction of the lemma is used. As planned, w e rewrite the inequalit y as C B ( j ) ( x ) + C( y | x ) 6 C B ( i ) ( x, y ) + O (log n ) using the un b ounded formula. Our lemma guaran tees that C B ( j ) ( x ) + C B ( j ) ( y | x ) 6 C B ( i ) ( x, y ) + O (log n ) for some j 6 i + O (log n ) , and it remains to note that C( y | x ) 6 C B ( j ) ( y | x ) . F or the other direction the argument is similar: we rewrite the inequalit y as C B ( j ) ( x, y ) 6 C( y | x ) + C B ( i ) ( x ) + O (log n ) and note that C( y | x ) > l ( y ) − ε > C B ( i ) ( y | x ) − ε , assuming that B ( i ) is greater than the time needed to prin t y from its literary description (otherwise the statemen t is trivial). So the lemma again can b e used (in the simple direction). This pro of used the depth represen tation of the sto c hasticity curv e; in other cases some other represen tation are more conv enient. Our second example is the change in sto c hasticit y profile when a simple algorithmic transformation is applied. W e hav e seen (Section 2.3) that a total mapping with a short program preserves sto c hasticity , and noted that for non-total mapping it is not the case (Remark 3, p. 11). How ev er, if the time needed to p erform the transformation is b ounded, w e can get some b ound (first pro ven b y A. Milov ano v in a different w ay): Theorem 9. L et F b e a p artial c omputable mapping whose ar guments and values ar e strings. If some n -bit string x is ( α, β ) -sto chastic, and F ( x ) is c ompute d in time B ( i ) for some i , then F ( x ) is (max( α, i ) + O (log n ) , β + O (log n )) -sto chastic. (The c onstant in O (log n ) -notation dep ends on F but not on n, x, α, β .) Pr o of sketch. Let us denote F ( x ) by y . By assumption there exist a ( α ∗ (C( x ) − α + β )) -description of x (recall the definition with optimalit y deficiency; w e omit logarithmic terms as usual). So there exists a simple en umeration of at most 2 C( x )+ β ob jects x 0 in at most 2 α p ortions that includes x . Let us count x 0 in this enumeration suc h that F ( x 0 ) = y and the computation uses time at most B ( i ) ; assume there 57 are 2 s of them. Then w e can en umerate all y ’s that ha ve at least 2 s preimages in time B ( i ) , in 2 α + 2 i p ortions. Indeed, new p ortions app ear in tw o cases: (1) a new p ortion app ears in the original en umeration; (2) candidate for B ( i ) increases. The first even t happ ens at most 2 α times, the second at most 2 i times. The total n umber of y ’s enumerated is 2 C( x )+ β − s ; it remains to note that C( x ) − s 6 C( y ) . Indeed, C( x ) 6 C( y ) + C( x | y ) , and C( x | y ) 6 s , since w e can en umerate all the preimages of y in the order of increasing time, and x is determined b y s -bit ordinal n umber of x in this en umeration. A sp ecial case of this prop osition is Bennett’s observ ation: if some d -incompressible program p pro duces x in time B ( i ) , then p is (0 , d ) -sto c hastic, and p is mapp ed to x b y the in terpreter (decompressor) in time B ( i ) , so x is (0 + i, d ) -sto c hastic. (F or sim- plicit y w e omit all the logarithmic terms in this argumen t, as w ell as in the previous pro of sketc h.) Remark 12. One can com bine Remark 4 (page 11) with Proposition 9 and show that if a program F of complexit y at most j is applied to an ( α, β ) -sto c hastic string x of length n and the computation terminates in time B ( i ) , then F ( x ) is (max( i, α ) + j + O (log n ) , β + j + O (log n )) -sto c hastic, where the constan t in O (log n ) notation is absolute (do es not dep end on F ). T o sho w this, one may consider the pair ( x, F ) ; it is easy to show (this can b e done in different wa ys using different c haracterizations of the sto c hasticity curve) that this pair is ( α + j + O (log n ) , β + j + O (log n )) -sto c hastic. Let us note also that there are some results in algorithmic information theory that are true for sto c hastic ob jects but are false or unknown without this assumption. W e will discuss (without proofs) tw o examples of this t yp e. The first is Epstein–Levin theorem saying that for a sto c hastic set A its total a priori probability is close to the maxim um a priori probabilit y of A ’s elements; see [31] for details. Here the result is (ob viously) false without stochasticit y assumption. In the next example [29] the stochasticit y assumption is used in the pro of, and it is not known whether the statement remains true without it: for every triple of strings ( x, y , z ) of length at most n ther e exists a string z 0 such that • C( x | z ) = C( x | z 0 ) + O (log n ) , • C( y | z ) = C( y | z 0 ) + O (log n ) , • C( x, y | z ) = C( x, y | z 0 ) + O (log n ) ; • C( z 0 ) 6 I (( x, y ) : z ) + O (log n ) , 58 assuming that ( x, y ) is ( O (log n ) , O (log n )) -sto chastic . This prop osition is related to the follo wing op en question on “irrelev ant oracles”: assume that the m utual information b et ween ( x, y ) and some z is negligible. Can an oracle z (an “irrelev an t oracle”) c hange substan tially natural prop erties of the pair ( x, y ) form ulated in terms of K olmogorov complexit y? F or instance, can suc h an oracle z allo w us to extract some common information of x and y ? In [29] a negative answ er to the latter question is giv en, but only for sto c hastic pairs ( x, y ) . 6 Descriptions of restricted t yp e 6.1 F amilies of descriptions In this section we consider the restricted case: the sets (considered as descriptions, or statistical h yp otheses) are taken from some family A that is fixed in adv ance. 10 (Elemen ts of A are finite sets of binary strings.) Informally sp eaking, this means that w e ha ve some a priori information ab out the black b o x that pro duces a given string: this string is obtained b y a random c hoice in one of the A -sets, but w e do not kno w in whic h one. Before w e had no restrictions (the family A was the family of all finite sets). It turns out that the results obtained so far can b e extended (sometimes with w eaker b ounds) to other families that satisfy some natural conditions. Let us formulate these conditions. (1) The family A is enumerable. This means that there exists an algorithm that prin ts elements of A as lists, with some separators (saying where one elemen t of A ends and another one b egins). (2) F or every n the family A con tains the set B n of all n -bit strings. (3) There exists some p olynomial p with the follo wing prop ert y: for ev ery A ∈ A , for every natural n and for ev ery natural c < # A the set of all n -bit strings in A can b e cov ered by at most p ( n ) · # A/c sets of cardinalit y at most c from A . The last condition is a replacement for splitting: in general, we cannot split a set A ∈ A in to pieces from A , but at least we can co ver a set A ∈ A by smaller elements of A (of size at most c ) with p olynomial o verhead in the n umber of pieces, compared to the required minimum # A/c (more precisely , w e hav e to cov er only n -bit elements of A ). W e assume that some family A that has prop erties (1)–(3) is fixed. F or a string x w e denote b y P A x the set of pairs ( i, j ) such that x has ( i ∗ j ) -description that b elongs 10 One can also consider some class of probability distributions, but w e restrict our attention to sets (uniform distributions). 59 to A . The set P A x is a subset of P x defined earlier; the bigger A is, the bigger is P A x . The full set P x is P A x for the family A that contains all finite sets. F or every string x the set P A x has prop erties close to the prop erties of P x pro ved earlier. Prop osition 25. F or every string x of length n the following is true: 1. The set P A x con tains a pair that is O (log n ) -close to (0 , n ) . 2. The set P A x con tains a pair that is O (1) -close to (C( x ) , 0) . 3. The adaptation of Prop osition 8 is true: if ( i, j ) ∈ P A x , then ( i + k + O (log n ) , j − k ) also b elongs to P A x for ev ery k 6 j . (Recall that n is the length of x .) Pr o of. 1. The prop ert y (2) guaran tees that the family A contains the set B n that is an ( O (log n ) ∗ n ) -description of x . 2. The prop ert y (3) applied to c = 1 and A = B n sa ys that every singleton b elongs to A , therefore each string has ((C( x ) + O (1)) ∗ 0) -description. 3. Assume that x has ( i ∗ j ) -description A ∈ A . F or a given k w e en umerate A un til we find a family of p ( n )2 k sets of size 2 − k # A (or less) in A that co vers all strings of length n in A . Suc h a family exists due to (3), and p is the p olynomial from (3). The complexity of the set that cov ers x do es not exceed i + k + O (log n + log k ) , since this set is determined by A , n , k and the ordinal num b er of the set in the co ver. W e ma y assume without loss of generality that k 6 n , otherwise { x } can be used as (( i + k + O (log n )) ∗ ( j − k )) -description of x . So the term O (log k ) can be omitted. F or example, w e may consider the family that consists of all “cylinders”: for every n and for every string u of length at most n w e consider the set of all n -bit strings that ha v e prefix u . Ob viously the family of all suc h sets (for all n and u ) satisfies the conditions (1)–(3). W e ma y also fix some bits of a string (not necessarily forming a prefix). That is, for ev ery string z in ternary alphab et { 0 , 1 , ∗} w e consider the set of all bit strings that can b e obtained from z b y replacing stars with some bits. This set contains 2 k strings, if u has k stars. The conditions (1)–(3) are fulfilled for this larger family , to o. A more in teresting example is the family A formed b y all balls in Hamming sense, i.e., the sets B y ,r = { x | l ( x ) = l ( y ) , d ( x, y ) 6 r } . Here l ( u ) is the length of binary string u , and d ( x, y ) is the Hamming distance b et w een t wo strings x and y of the same length. The parameter r is called the r adius of the ball, and y is its 60 c enter . Informally speaking, this means that the experimental data w ere obtained b y c hanging at most r bits in some string y (and all p ossible changes are equally probable). This assumption could b e reasonable if some string y is sen t via an unreliable c hannel. Both parameters y and r are not known to us in adv ance. It turns out that the family of Hamming balls satisfies the conditions (1)–(3). This is not completely ob vious. F or example, these conditions imply that for ev ery n and for ev ery r 6 n the set B n of n -bit strings can b e co vered b y p oly( n )2 n /V Hamming balls of radius r , where V stands for the cardinality of suc h a ball (i.e., V =  n 0  + . . . +  n r  ), and p is some p olynomial. This can b e shown by a probabilistic argumen t: take N balls of radius r whose centers are randomly c hosen in B n . F or a given x ∈ B n the probability that x is not co vered b y any of these balls equals (1 − V / 2 n ) N < e − V N / 2 n . F or N = n ln 2 · 2 n /V this upp er bound is 2 − n , so for this N the probabilit y to lea v e some x uncov ered is less than 1 . A similar argument can b e used to prov e (1)–(3) in the general case. Prop osition 26 ([46]) . The family of all Hamming balls satisfies conditions (1)–(3) ab o v e. Pr o of sketch. Let A be a ball of radius a and let c b e a num b er less than # A . W e need to cov er A by balls of cardinalit y c or less, using almost minimal n umber of balls, close to the low er bound # A/c up to a p olynomial factor. Let us mak e some observ ations. (1) The set of all n -bit strings can b e cov ered b y t wo balls of radius n/ 2 . So w e can assume without loss of generalit y that a 6 n/ 2 , otherwise w e can apply the probabilistic argumen t abov e. (2) Clearly the radius of co v ering balls should b e maximal possible (to keep cardinalit y less than c ); for this radius the cardinalit y of the ball equals c up to p olynomial factors, since the size of the ball increases at most b y factor n + 1 when its radius increases by 1 . (3) It is enough to cov er spheres instead of balls (since ev ery ball is a union of p olynomially many spheres); it is also enough to consider the case when the radius of the sphere that we wan t to co ver ( a ) is bigger than the radius of the cov ering ball ( b ), otherwise one ball is enough. (4) W e will cov er a -sphere by randomly chosen b -balls whose centers are uniformly tak en at some distance f from the center of a -sphere. (See b elow ab out the choice of f .) W e use the same probabilistic argumen t as b efore (for the set of all strings). It is enough to sho w that for a b -ball whose center is at that distance, the p olynomial fraction of p oin ts b elong to a -sphere. Ins tead of b -balls we ma y consider b -spheres, the cardinalit y ratio is p olynomial. 61 (5) It remains to choose some f with the follo wing prop ert y: if the center of a b - sphere S is at a d istance f from the cen ter of a -sphere T , then the p olynomial fraction of S -p oin ts belong to T . One can compute a suitable f explicitly . In probabilistic terms we just c hange f /n -fraction of bits and then change random b/n fraction of bits. The exp ected fraction of twice c hanged bits is, therefore, ab out ( f /n )( b/n ) , and the total fraction of changed bits is ab out f /n + b/n − 2( f /n )( b/n ) . So we need to write an equation sa ying that this expression is a/n and the find the solution f . (Then one can p erform the required estimate for binomial co efficien ts.) Ho wev er, one can av oid computations with the following probabilistic argumen t: start with b c hanged bits, and then change all the bits one b y one in a random order. A t the end w e hat n − b c hanged bits, and a is somewhere in betw een, so there is a moment where the num b er of changed bits is exactly a . And if the union of n ev ents cov ers the en tire probabilit y space, one of these even ts has probability at least 1 /n . When a family A is fixed, a natural question arises: do es the restriction on mo dels (when w e consider only mo dels in A ) c hanges the set P x ? Is it p ossible that a string has go od mo dels in general, but not in the restricted class? The answer is p ositiv e for the class of Hamming balls, as the follo wing prop osition sho ws. Prop osition 27. Consider the family A that consists of all Hamming balls. F or some p ositive ε and for all sufficiently large n there exists a string x of length n such that the distance b et ween P A x and P x exceeds εn . Pr o of sketch. Fix some α in (0 , 1 / 2) and let V b e the cardinality of the Hamming ball of radius αn . Find a set E of cardinalit y N = 2 n /V suc h that every Hamming ball of radius αn contains at most n p oints from E . This prop ert y is related to list de c o ding in the co ding theory . The existence of such a set can b e pro v ed b y a probabilistic argumen t: N randomly c hosen n -bit strings ha ve this prop ert y with p ositiv e probability . Indeed, the probabilit y of a random p oin t to b e in E is an in verse of the n um b er of points, so the distribution is close to Poisson distribution with parameter 1 , and tails decrease muc h faster that 2 − n needed. Since E with this prop ert y can b e found by an exhaustive search, we can assume that C( E ) = O (log n ) and ignore the complexit y of E (as well as other O (log n ) terms) in the sequel. Let x b e a random element in E , i.e., a string x ∈ E of complexit y ab out log # E . The complexity of a ball A of radius αn that contains x is at least C( x ) , since knowing suc h a ball and an ordinal num b er of x in A ∩ E , w e can find x . Therefore x do es not hav e (log # E , log V ) -descriptions in A . On the other hand, x does ha v e (0 , log # E ) -description if we do not require the description 62 P A x log # E n log # E n Figure 8: Theorem 11 can b e used (together with the argumen t abov e) to sho w that the b order of the set P A x (sho wn in gray) consists of a v ertical segmen t C( A ) = n − log V , log # A 6 log V , and the segment of slop e − 1 defined b y C( A ) +log # A = n , log V 6 log # A . The set P x con tains also the hatc hed part. to b e in A ; the set E is suc h a description. The p oin t (log # E , log V ) is ab o ve the line C( A ) + log # A = log # E , so P A x is significan tly smaller than P x . This construction gives a stochastic x ( E is the corresponding mo del) that b e- comes maximally non-sto c hastic if we restrict ourselv es to Hamming balls as descrip- tions (Figure 8). 6.2 P ossible shap es of b oundary curve Our next goal is to extend some results pro ven for non-restricted descriptions to the restricted case. Let A b e a family that has properties (1)–(3). W e pro v e a v ersion of Theorem 1 where the precision (unfortunately) is significantly w orse: O ( √ n log n ) instead of O (log n ) . Note that with this precision the term O ( m ) (proportional to the complexit y of the curve) that appeared in Theorem 1 is not needed. Indeed, if w e dra w the curve on the cell pap er with cell size √ n or larger, then it touches only O ( √ n ) cells, so it is determined by O ( √ n ) bits with O ( √ n ) -precision, and w e ma y assume without loss of generality that the complexity of the curve is O ( √ n ) . Theorem 10 ([46]) . L et k 6 n b e two inte gers and let t 0 > t 1 > . . . > t k b e a strictly de cr e asing se quenc e of inte gers such that t 0 6 n and t k = 0 .. Then ther e exists a string x of c omplexity k + O ( √ n log n ) and length n + O (log n ) for which the distanc e b etwe en P A x and T = { ( i, j ) | ( i 6 k ) ⇒ ( j > t i ) } is at most O ( √ n log n ) . 63 W e will see later (Theorem 11) that for ev ery x the b oundary curv e of P A x go es do wn at least with slope − 1 , as for the unrestricted case, so this theorem describ es all p ossible shap es of the b oundary curve. Pr o of. The pro of is similar to the pro of of Theorem 1. Let us recall this pro of first. W e consider the string x that is the lexicographically first string (of suitable length n 0 ) that is not cov ered b y an y “bad” set, i.e., b y any set of complexity at most i and size at most 2 j , where the pair ( i, j ) is at the boundary of the set T . The length n 0 is c hosen in such a wa y that the total num b er of strings in all bad sets is strictly less than 2 n 0 . On the other hand, we need “go o d sets” that cov er x . F or ev ery b oundary p oin t ( i, j ) w e construct a set A i,j that contains x , has complexit y close to i and size 2 j . The set A i,j is constructed in sev eral attempts. Initially A i,j is the set of lexicographically first 2 j strings of length n 0 . Then w e en umerate bad sets and delete all their elements from A i,j . A t some step A i,j ma y b ecome empty; then w e refill it with 2 j lexicographically first strings that are not in the bad sets (at the moment). By construction the final A i,j con tains the first x that is not in bad sets (since it is the case all the time). And the set A i,j can be describ ed by the n um b er of changes (plus some small information describing the pro cess as a whole and the v alue of j ). So it is crucial to hav e an upp er b ound for the num b er of changes. How do we get this b ound? W e note that when A i,j b ecomes empt y , it is refilled again, and all the new elemen ts should be co vered by bad sets before the new c hange could happ en. T wo types of bad sets ma y app ear: “small” ones (of size less than 2 j ) and “large ones” (of size at least 2 j ). The slop e of the b oundary line for T guarantees that the total n umber of elements in all small bad sets do es not exceed 2 i + j (up to a p oly( n ) - factor), so they ma y make A i,j empt y only 2 i times. And the num b er of large bad sets is O (2 i ) , since the complexity of each is b ounded b y i . (More precisely , we count separately the num b er of c hanges for A i,j that are first c hanges after a large bad set app ears, and the num b er of other c hanges.) Can we use th e same argument in our new situation? W e can generate bad sets as b efore and ha ve the same bounds for their sizes and the total num b er of their elemen ts. So the length n 0 of x can b e the same (in fact, almost the same, as we will need now that the union of all bad sets is less than half of all strings of length n 0 , see b elo w). Note that w e no w may enumerate only bad sets in A , since A is enumerable, but w e do not ev en need this restriction. What w e cannot do is to let A i,j to b e the set of the first non-deleted elements: we need A i,j to b e a set from A . So we now go in the other direction. Instead of choosing x first and then finding suitable “go o d” A i,j that contain x , we construct the sets A i,j ∈ A that change in time in suc h a wa y that (1) their in tersection alwa ys contains some non-deleted element (an element that is not yet cov ered by bad sets); (2) eac h A i,j has not to o man y 64 v ersions. The non-deleted element in their in tersection (in the final state) is then c hosen as x . Unfortunately , w e cannot do this for all points ( i, j ) along the b oundary curve. (This explains the loss of precision in the statemen t of the theorem.) Instead, we construct “go o d” sets only for some v alues of j . These v alues go down from n to 0 with step √ n log n . W e select N = p n/ log n p oin ts ( i 1 , j 1 ) , . . . , ( i N , j N ) on the b oundary of T ; the first coordinates i 1 , . . . , i N form a non-decreasing sequence, and the second co ordinates j 1 , . . . , j N split the range n . . . 0 in to (almost) equal interv als ( j 1 = n , j N = 0 ). Then w e construct goo d sets of sizes at most 2 j 1 , . . . , 2 j N , and denote them b y A 1 , . . . , A N . All these sets b elong to the family A . W e also let A 0 to b e the set of all strings of length n 0 = n + O (log n ) ; the c hoice of the constan t in O (log n ) will b e discussed later. Let us first describ e the construction of A 1 , . . . , A N assuming that the set of deleted elements is fixed. (Then w e discuss what to do when more elements are deleted.) W e construct A 0 inductiv ely (first A 1 , then A 2 etc.). As w e hav e said, # A 0 6 2 j s (in particular, A N is a singleton), and we keep track of the ratio ( the n umber of non-deleted strings in A 0 ∩ A 1 ∩ . . . ∩ A 0 ) / 2 j s . F or s = 0 this ratio is at least 1 / 2 ; this is obtained b y a suitable c hoice of n 0 (the union of all bad sets should cov er at most half of all n 0 -bit strings). When constructing the next A 0 , we ensure that this ratio decreases only b y p oly( n ) -factor. Ho w? Assume that A s − 1 is already constructed; its s ize is at most 2 j s − 1 . The condition (3) for A guaran tees that A s − 1 can b e co vered b y A -sets of size at most 2 j s , and w e need ab out 2 j s − 1 − j s co vering sets (up to p oly ( n ) -factor). Now w e let A 0 b e the co v ering set that con tains maximal n um b er of non-deleted elements in A 0 ∩ . . . ∩ A s − 1 . The ratio can decrease only b y the same p oly ( n ) -factor. In this wa y we get ( the n umber of non-deleted strings in A 0 ∩ A 1 ∩ . . . ∩ A 0 ) > α − s 2 j s / 2 , where α stands for the p oly ( n ) -factor mentioned ab ov e. 11 Up to now we assumed that the set of deleted elemen ts is fixed. What happ ens when more strings are deleted? The num b er of the non-deleted in A 0 ∩ . . . ∩ A s can decrease, and at some p oint and for some s can become less than the declared threshold ν s = α − s 2 j s / 2 . Then w e can find minimal s where this happ ens, and rebuild all the sets A 0 , A s +1 , . . . (for A 0 the threshold is not crossed due to the minimalit y of 11 Note that for the v alues of s close to N the right-hand side can b e less than 1 ; the inequality then claims just the existence of non-deleted elemen ts. The induction step is still p ossible: non-deleted elemen t is contained in one of the cov ering sets. 65 s ). In this wa y w e up date the sets A 0 from time to time, replacing them (and all the consequen t ones) by new v ersions when needed . The problem with this construction is that the n umber of up dates (differen t v ersions of eac h A 0 ) can b e to o big. Imagine that after an up date some element is deleted, and the threshold is crossed again. Then a new up date is necessary , and after this up date next deletion can trigger a new up date, etc. T o keep the n umber of up dates reasonable, we agree that after the update for al l the new sets A l (starting from A 0 ) the numb er of non-delete d elements in A 0 ∩ . . . ∩ A l is twic e bigger than the thr eshold ν l = α − l 2 j l / 2 . This can b e ac hieved if w e mak e the factor α twice bigger: since for A s − 1 w e hav e not crossed the threshold, for A 0 w e can guaran tee the inequalit y with additional factor 2 . No w let us prov e the b ound for the n umber of up dates for some A 0 . These up dates can b e of t wo types: first, when A 0 itself starts the up date (b eing the minimal s where the threshold is crossed); second, when the up date is induced b y one of the previous sets. Let us estimate the n umber of the up dates of the first t yp e. This up date happ ens when the n umber of non-deleted elements (that w as at least 2 ν s immediately after the previous up date of any kind) b ecomes less than ν s . This means that at least ν s elemen ts were deleted. How can this happ en? One p ossibilit y is that a new bad set of complexity at most i s (“large bad set”) app ears after the last up date. This can happ en at most O (2 i s ) times, since there is at most O (2 i ) ob jects of complexity at most i . The other p ossibilit y is the accumulation of elements deleted due to “small” bad sets, of complexity at least i s and of size at most 2 j s . The total n umber of suc h elemen ts is b ounded b y nO (2 i s + j s ) , since the sum i l + j l ma y only decrease as l , increases. So the n umber of updates of A 0 not caused by large bad sets is bounded b y nO (2 i s + j s ) /ν s = O ( n 2 i s + j s ) α − s 2 j s = O ( nα s 2 i s ) = 2 i s + N O (log n ) = 2 i s + O ( √ n log n ) (recall that s 6 N , α = p oly( n ) , and N ≈ p n/ log n ). This bound remains v alid if we tak e into accoun t the induced up dates (when the threshold is crossed for the preceding sets: there are at most N 6 n these sets, and additional factor n is absorb ed by O -notation). W e conclude that all the versions of A 0 ha ve complexity at most i s + O ( √ n log n ) , since eac h of them can b e describ ed b y the version num b er plus the parameters of the generating pro cess (w e need to kno w n and the b oundary curv e, whose complexit y is O ( √ n ) according to our assumption, see the discussion b efore the statemen t of the theorem). The same is true for the final version. It remains to take x in the in tersection of the final sets A 0 . (Recall that A N is a singleton, so final A N is { x } .) 66 Indeed, by construction this x has no bad ( i ∗ j ) -descriptions where ( i, j ) is on the b oundary of T . On the other hand, x has go od descriptions that are O ( √ n log n ) - close to this b oundary and whose vertical co ordinates are √ n log n -apart. (Recall that the slop e of the b oundary guarantees that horizon tal distance is less than the v ertical distance.) Therefore the p osition of the b oundary curv e for P A x is determined with precision O ( √ n log n ) , as required. 12 Remark 13. In this pro of we ma y use bad sets not only from A . Therefore, the set P x is also close to T (and the same is true for for every family B that con tains A ). It would b e interesting to find out what are the p ossible com binations of P x and P A x ; as w e hav e seen, it ma y happ en that P x is maximal and P A x is minimal, but this do es not sa y an ything about other p ossible combinations. F or the case of Hamming balls the statemen t of Theorem 10 has a natural in ter- pretation. T o find a simple ball of radius r that contains a giv en string x is the same as to find a simple string in a radius r ball centered at x . So this theorem show the p ossible b ehavior of the “approximation complexity” function r 7→ min { C( x 0 ) | d ( x, x 0 ) 6 r } where d is Hamming distance. One should only rescale the v ertical axis replacing the log-sizes of Hamming balls b y their radii. The connection is describ ed b y the Shannon entro p y function: a ball in B n of radius r has log-size ab out nH ( r /n ) for r 6 n/ 2 , and has almost full size for r > n/ 2 . F or example, error correcting codes (in classical sense, or with list deco ding) are example of strings where this function is almost a constant for small v alues of r : it is almost as easy to approximate a co dew ord as give it precisely (due to the p ossibility of error correction). 6.3 Randomness and optimality deficiencies: restricted case Not all the results pro ved for unrestricted descriptions ha ve natural coun terparts in the restricted case. F or example, one hardly can relate the set P A x with b ounded-time complexit y (is completely unclear how A could enter the picture). Still some results remain v alid (but new and m uch more complicated pro ofs are needed). This is the case for Prop osition 8 and 9. Let again A b e the class of descriptions that satisfies requirements (1)–(3). 12 No w we see why N w as chosen to b e p n/ log n : the bigger N is, the more p oin ts on the curve w e ha ve, but then the num b er of v ersions of the go od sets and their complexity increases, so we ha ve some trade-off. The c hosen v alue of n balances these t wo sources of errors. 67 Theorem 11 ([46]) . • If a string x of length n has an ( i ∗ j ) -description in A , then it has (( i + d + O (log n )) ∗ ( j − d + O (log n ))) -description in A for every d 6 j . • Assume that x is a string of length n that has at le ast 2 k differ ent ( i ∗ j ) - descriptions in A . Then it has (( i − k + O (log n )) ∗ ( j + O (log n )) -description in A . In fact, the second part uses only condition (1); it sa ys that A is en umerable. The first part uses also (3). It can b e combined with the second part to sho w that x has also (( i + O (log n )) ∗ ( j − k + O (log n )) -description in A . Though theorem 11 lo oks lik e a technical statemen t, it has imp ortan t conse- quences; it implies that the t wo approac hes based on randomness and optimality deficiencies remain equiv alent in the case of b ounded class of descriptions. The pro of tec hnique can b e also used to pro ve Epstein–Levin theorem [11], as explained in [31]; similar technique was used b y A. Milo v anov in [25] where a common mo del for sev eral strings is considered. Pr o of. The first part is easy: having some ( i ∗ j ) -description for x , we can searc h for a co v ering by the sets of righ t size that exists due to condition (3); since A is en umerable, we can do it algorithmically until w e find this cov ering. Then w e select the first set in the cov ering that con tains x ; the b ound for the complexit y of this set is guaran teed by the size of the cov ering. The pro of of the second statement is muc h more interesting. In fact, there are t wo differen t proofs: one uses a probabilistic existence argumen t and the second is more explicit. But b oth of them start in the same wa y . Let us enumerate all ( i ∗ j ) -descriptions from A , i.e., all finite sets that b elong to A , ha v e cardinality at most 2 j and complexit y at most i . F or a fixed n , we start a selection pro cess: some of the generated descriptions are mark ed (=selected) imme- diately after their generation. This pro cess should satisfy the following requiremen ts: (1) at an y moment every n -bit string x that has at least 2 k descriptions (among en u- merated ones) b elongs to one of the marked descriptions; (2) the total num b er of mark ed sets do es not exceed 2 i − k p ( n ) for some p olynomial p . Note that for i > n or j > n the statement is trivial, so w e ma y assume that i , j (and therefore k ) do not exceed n ; this explains why the p olynomial dep ends only on n . If w e hav e suc h a strategy (of logarithmic complexity), then the marked set con taining x will b e the required description of complexity i − k + O (log n ) and log- size j . Indeed, this mark ed set can be sp ecified b y its ordinal n umber in the list of mark ed sets, and this ordinal num b er has i − k + O (log n ) bits. 68 So w e need to construct a selection strategy of logarithmic complexit y . W e present t wo pro ofs: a probabilistic one and an explicit construction. Pr obabilistic proof . First w e consider a finite game that corresp onds to our situation. T w o play ers alternate, eac h makes 2 i mo ves. At eac h mov e the first pla yer presen ts some set of n -bit strings, and the second pla yer replies saying whether it marks this set or not. The second pla yer loses if after some mo v es the n umber of mark ed sets exceeds 2 i − k +1 ( n + 1) ln 2 (this sp ecific v alue follo ws from the argument b elo w) or if there exists a string x that b elongs to 2 k sets of the first play er but do es not b elong to an y marked set. Since this is a finite game with full information, one of the play ers has a winning strategy . W e claim that the second pla yer can win. If it is not the case, the first pla yer has a winning strategy . W e get a contradiction b y sho wing that the second pla yer has a pr ob abilistic strategy that wins with p ositiv e probability against any strategy of the first play er. So w e assume that some (deterministic) strategy of the first pla y er is fixed, and consider the follo wing simple probabilistic strategy: ev ery set A presen ted b y the first play er is marked with probability p = 2 − k ( n + 1) ln 2 . The exp ected num b er of mark ed sets is p 2 i = 2 i − k ( n + 1) ln 2 . By Cheb yshev’s inequalit y , the n umber of mark ed set exceeds the exp ectation b y a factor 2 with probabilit y less than 1 / 2 . So it is enough to sho w that the second bad case (after some mo ve there exists x that b elongs to 2 k sets of the first pla yer but do es not b elong to an y mark ed set) happ ens with probability at most 1 / 2 . F or that, it is enough to show that for every fixed x the probability of this bad ev ent is at most 2 − ( n +1) , and then use the union b ound. The intuitiv e explanation is simple: if x b elongs to 2 k sets, the second pla yer had (at least) 2 k c hances to mark a set con taining x (when these 2 k sets were presen ted b y the first pla yer), and the probabilit y to miss all these chances is at most (1 − p ) 2 k ; the choice of p guarantees that this probability is less than 1 / 2 − ( n +1) . Indeed, usin g the b ound (1 − 1 /x ) x < 1 /e , it is easy to show that (1 − p ) 2 k < e − ( n +1) ln 2 = 2 − ( n +1) . The pedantic reader w ould sa y that this argument is not formally correct, since the behavior of the first pla yer (and the moment when next set con taining x is pro duced) dep ends on the mov es of the second play er, so we do not hav e indep enden t ev ents with probabilit y 1 − p eac h (as it is assumed in the computation). 13 The formal 13 The same problem app ears if w e observe a sequence of indep enden t coin tossings with proba- bilit y of success p , select some trials (b efore they are actually p erformed, based on the information obtained so far), and ask for the probability of the ev en t “ t first selected trials were all unsuccessful”. This probabilit y do es not exceed (1 − p ) t ; it can b e smaller if the total num b er of selected trials is less than t with p ositiv e probabilit y . This scheme was considered by von Mises when he defined random sequences using selection rules, so it should b e familiar to algorithmic randomness p eople. 69 argumen t considers for each t the ev en t R t : “after some mov e of the second play er the string x b elongs to at least t sets provided by the first play er, but do es not b elong to any mark ed set”. Then w e pro ve b y induction (ov er t ) that the probabilit y of R t do es not exceed (1 − p ) t . Indeed, it is easy to see that R t in a union of several disjoint subsets (dep ending on the even ts h appening un til the first pla yer pro vides t + 1 sets con taining x ), and R t +1 is obtained b y taking a (1 − p ) -fraction in each of them. Constr uctive proof . W e consider the same game, but no w allow more sets to b e marked (replacing the b ound 2 i − k +1 ( n + 1) ln 2 b y a bigger bound 2 i − k i 2 ln 2 ) and also allow the second pla yer to mark sets that w ere pro duced earlier (not neces- sarily at the curren t mov e of the first pla yer). The explicit winning strategy for the second play er p erforms in parallel i − k + log i substrategies (indexed b y the n umbers log(2 k /i ) , . . . , i ). The substrategy num b er s wak es up once in 2 s mo ves (when the num b er of mo ves made b y the first play er is a multiple of 2 s ). It considers a family S that consists of 2 s last sets pro duced by the first pla y er, and the set T that consists of all strings x co vered b y at least 2 k /i sets from S . Then it selects and marks some elements in S in suc h a wa y that all x ∈ T are cov ered by one of the selected sets. It is done b y a greedy algorithm: first take a set from S that co vers maximal part of T , then the set that cov ers maximal num b er of non-co vered elemen ts, etc. How man y steps do w e need to co v er the entire T ? Let us show that ( i/ 2 k ) n 2 s ln 2 steps are enough. Indeed, every element of T is cov ered b y at least 2 k /i sets from S . Therefore, some set from S co v ers at least # T 2 k / ( i 2 s ) elements, i.e., 2 k − s /i -fraction of T . At the next step the non-co v ered part is m ultiplied b y (1 − 2 k − s /i ) again, and after in 2 s − k ln 2 steps the num b er of non-co vered elemen ts is b ounded by # T (1 − 2 k − s /i ) in 2 s − k ln 2 < 2 n (1 /e ) n ln 2 = 1 , therefore all elemen ts of T are cov ered. (Instead of a greedy algorithm one may use a probabilistic argument and sho w that randomly c hosen in 2 s − k ln 2 sets from S co ver T with p ositiv e probability; how ever, our goal is to construct an explicit strategy .) An ywa y , the n um b er of sets selected b y a substrategy n umber s , do es not exceed in 2 s − k (ln 2)2 i − s = in 2 i − k ln 2 , and w e get at most i 2 n 2 i − k ln 2 for all substrategies. It remains to prov e that after each mo ve of the second play er every string x that b elongs to 2 k or more sets of the first play er, also b elongs to some selected set. F or 70 t th mo ve w e consider the binary representation of t : t = 2 s 1 + 2 s 2 + . . . , where s 1 > s 2 > . . . Since x do es not b elong to the sets selected by sub strategies with num b ers s 1 , s 2 , . . . , the m ultiplicity of x among the first 2 s 1 sets is less than 2 k /i , the multiplicit y of x among the next 2 s 2 sets is also less than 2 k /i , etc. F or those j with 2 s j < 2 k /i the m ultiplicity of x among the resp ectiv e p ortion of 2 s j sets is ob viously less than 2 k /i . Therefore, w e conclude that the total m ultiplicity of x is less that i · 2 k /i = 2 k sets of the first pla yer and the second play er do es not need to care ab out x . This finishes the explicit construction of the winning strategy . No w w e can assume without loss of generality that the winning strategy has complexit y at most O (log ( n + k + i + j )) . (In the probabilistic argumen t w e ha v e pro ved the existence of a winning strategy , but then w e can perform the exhaustive searc h un til w e find one; the first strategy found will ha v e small complexit y .) Then w e use this simple strategy to play with the en umeration of all A -sets of complexity less than i and size 2 j (or less). The selected sets can b e describ ed b y their ordinal n umber (among the selected sets), so their complexit y is b ounded b y i − k (with logarithmic precision). Ev ery string that has 2 k differen t ( i ∗ j ) -descriptions in A , will also ha ve one among the selected sets, and that is what we need. As b efore (for the unrestricted case), this result implies that descriptions with minimal parameters are simple with resp ect to the data string: Theorem 12 ([46]) . L et A b e an enumer able family of finite sets. If a string x of length n has ( i ∗ j ) -description A ∈ A such that C( A | x ) > k , then x has a (( i − k + O (log n )) ∗ ( j + O (log n ))) -description in A . If the family A satisfies the c ondition (3) , then x has also a (( i + O (log n )) ∗ ( j − k + O (log n ))) -description in A . This giv es u s the same corollaries as in the unrestricted case: Corollary 1. Let A b e a family of finite sets that satisfies the conditions (1)–(3). Then for ev ery string x of length n three statemen ts • there exists a set A ∈ A of complexit y at most α with d ( x | A ) 6 β ; • there exists a set A ∈ A of complexit y at most α with δ ( x, A ) 6 β ; • the p oin t ( α, C( x ) − α + β ) b elongs to P A x are equiv alen t with logarithmic precision (the constants b efore the logarithms dep end on the c hoice of the set A ). 71 If w e are interested in the uniform statements true for every en umerable family A , the same arguments prov e the follo wing result: Prop osition 28. Let A be an arbitrary family of finite sets en umerated b y some program p . Then for every x of length n the statements • there exists a set A ∈ A such that d ( x | A ) 6 β ; • there exists a set A ∈ A such that δ ( x, A ) 6 β are equiv alen t up to O (C( p ) + log C( A ) + log n + log log # A ) -c hange in the parameters. 7 Strong mo dels 7.1 Information in minimal descriptions A p ossible wa y to bring the theory in accordance to our intuition is to change the def- inition of “ha ving the same information”. Although we hav e not giv en that definition explicitly , we ha v e adopted so far the following viewp oin t: x and y ha ve the same (or almost the same) information if b oth conditional complexities C( x | y ) , C( y | x ) are small. If only one complexity , say C( x | y ) , is small, we said that all (or almost all) information con tained in x is present in y . No w we will adopt a more restricted viewp oint and say that x and y hav e the same information if there are short total (ev erywhere defined) programs mapping x to y and vice versa. F rom this viewp oin t we cannot say an ymore that a string x and its shortest program x ∗ ha ve the same information: for example, x ma y b e non-sto c hastic while x ∗ is alwa ys sto c hastic, so there is no short total program that maps x ∗ to x because of Prop osition 3. 14 Let us men tion that if x and y ha ve the same information in this new sense, then there exists a simple computable bije ction that maps x to y (so they ha v e the same prop erties if the prop erty is defined in the computabilit y language), see [28] for the proof. F ormally , let us define the total conditional complexity with resp ect to a com- putable function D of t wo arguments, as CT D ( x | y ) = min { l ( p ) | D ( p, y ) = x, and D ( p, y 0 ) is defined for all y 0 } . (Note that D is not required to b e total, but we consider only p such that D ( p, y 0 ) is defined for all y 0 .) 14 It is worth to mention that on the other hand, for ev ery string x there is an almost minimal program for x that can b e obtained from x by a simple total algorithm [42, Theorem 17]. 72 There is a computable function D suc h that CT D is minimal up to an additiv e constan t. Fixing any such D we obtain the total c onditional c omplexity CT( x | y ) . In other wa y , we may define CT( x | y ) as the minimal plain complexit y of a total program that maps y to x . W e will think that y has all (or almost all) the information from x if CT( x | y ) is negligible. F ormally , we write x ε → y if CT( y | x ) 6 ε and we call x and y ε -e quivalent and write x ε ↔ y , if b oth CT( y | x ) and CT( x | y ) are at most ε . Prop osition 29. If x ε ↔ y then the sets P x and P y are in O ( ε ) neighborho o d of eac h other. Pr o of. Indeed, if A is an ( i ∗ j ) -description of x and p is a total program witnessing x ε → y , then the set B = { D ( p, x 0 ) | x 0 ∈ A } is an (( i + O ( ε )) ∗ j ) -description for y . (W e need p to b e total, as otherwise we cannot produce the list of B -elements from the list of A -elements and p .) 7.2 An attempt to separate “go o d” mo dels from “bad” ones No w we ha ve a more fine-grained classification of descriptions and can try to distin- guish b etw een descriptions that were equiv alen t in the former sense. F or example, consider a string xy where y is random conditionally to x . Let A b e a mo del for xy consisting of all extensions of x (of the same length). This mo del lo oks go o d (in par- ticular, it has negligible optimality deficiency). On the other hand, w e ma y consider a standard mo del B for xy of the same (or smaller) complexit y . It also has negligible optimalit y deficiency but lo oks unnatural. In this section w e are in terested in the follo wing question: how can w e formally distinguish goo d models like A from bad mo dels like B ? W e will see that at least for some strings u the v alue CT( A | u ) can b e used to distinguish b et ween go o d and bad mo dels for u . (Indeed, in our example CT( A | xy ) is small, while CT( B | xy ) can be large.) Definition 5. A set A 3 x is an ε -str ong mo del (or statistic ) for a string x if CT( A | x ) 6 ε . F or instance, the mo del A discussed ab o ve is an O (log n ) -strong mo del for x . On the other hand, we will see later that, if y is chosen appropriately , then no standard description B of the same complexit y and log-cardinalit y as A is an ε -strong model for x , ev en for ε = Ω( n ) . The following proposition explains the meaning of a strong mo del b y providing an equiv alen t definition. A finite family of sets A is called p artition if for ev ery A 1 , A 2 ∈ A we ha ve A 1 ∩ A 2 6 = ∅ ⇒ A 1 = A 2 . F or an y partition we can define its 73 complexit y . The lemma states that strong statistics are those that belong to simple partitions. Prop osition 30. Assume that A is a mo del for x that b elongs to a partition of complexit y ε . Then A is an ε + O (1) -strong mo del for x . Con versely , assume that A is an ε -strong statistic for а string x of length n . Then there is a partition A of complexit y at most ε + O (log n ) and a model A 0 ∈ A for x suc h that: # A 0 6 # A and b oth CT( A | A 0 ) and CT( A 0 | A ) are at most ε + O (log n ) . Pr o of. Assume that A is a model for x that belongs to a partition A of complexit y ε . The program that maps ev ery giv en string x 0 to the set A 0 ∈ A whic h x b elongs to (if x 0 b elongs to no set in A , then the program maps it, sa y , to the empty set) is total and has length at most ε + O (1) . Con versely , assume that A is an ε -strong statistic for a string x of length n . Then there is a total program p such that p ( x ) = A and | p | 6 ε . Consider the set X of all strings x 0 with x 0 ∈ p ( x 0 ) . Obviously , x ∈ X . Partition X according to the v alue of p ( x 0 ) : strings x 0 and x 00 are in the same set of the partition if p ( x 0 ) = p ( x 00 ) . The constructed partition A has complexity at most ε + O (log n ) . It includes the set A 0 = { x 0 ∈ X | p ( x 0 ) = A } , which includes x and can b e obtained from x by a total program of length at most ε + O (log n ) : that program maps a giv en string x 0 to the set { x 00 ∈ X | p ( x 00 ) = p ( x 0 ) } . Since A 0 ⊂ A , w e hav e # A 0 6 # A . It remains to show that that b oth CT( A | A 0 ) and CT( A 0 | A ) are less than | p | + O (log n ) = ε + O (log n ) . Indeed, A 0 can b e obtained from A by a total program of length | p | + O (log n ) that maps a giv en set B to { x 0 ∈ X | p ( x 0 ) = B } . On the other hand, A can be obtained from A 0 b y a total program of length | p | + O (1) that for a given set B picks any element x 0 from B and computes p ( x 0 ) . If D is empt y then that program outputs, sa y , the empty set. Recall that A 0 is non-empt y , as it includes x . Strong mo dels satisfy an analog of Prop osition 8 (the same pro of w orks): Prop osition 31. Let x be a string and A b e an ε -strong mo del for x . Let i be a non-negativ e in teger such that i 6 log # A . Then there exists an ε + O (log i ) -strong mo del A 0 for x suc h that # A 0 6 # A/ 2 i and C( A 0 ) 6 C( A ) + i + O (log i ) . T o take into accoun t the strength of mo dels, we may consider the set P x ( ε ) = { ( i, j ) | x has an ε -strong ( i ∗ j ) -description } . Ob viously , we hav e P x ( ε ) ⊂ P x = P x ( n + O (1)) 74 for all strings x of length n and for all ε . If the set P x ( ε ) is not muc h smaller than P x for a reasonably small ε , we will sa y that x is a “normal” string and otherwise we call x “strange”. More precisely , a string x is called ( ε, δ ) - normal if P x is in δ -neighborho o d of P x ( ε ) . Otherwise, x is called ( ε, δ ) -str ange . It turns out that there are √ n log n, O (log n ) -normal strings with an y given set P x that satisfies the conditions of Theorem 1. On the other hand, there are Ω( n ) , Ω( n ) - strange strings of length n . W e are going to state these facts accurately . Theorem 13 ([26]) . L et k 6 n b e two inte gers and let t 0 > t 1 > . . . > t k b e a strictly de cr e asing se quenc e of inte gers such that t 0 6 n and t k = 0 . Then ther e exists a string x of c omplexity k + O ( √ n log n ) and length n + O (log n ) for which the distanc e b etwe en b oth sets P x and P x ( O (log n )) and the set T = { ( i, j ) | ( i 6 k ) ⇒ ( j > t i ) } is at most O ( √ n log n ) . Pr o of. Consider the family A of all cylinders, i.e., the family of all the sets of the form { ur | l ( r ) = m } for differen t strings u and natural num b ers m . Sets from this family ha ve the following feature: if A 3 x then A is an O (log n ) -strong mo del for x . Hence for all strings x we hav e P A x = P A x ( O (log n )) . By Theorem 10 and Remark 13 there is a string x of length n + O (log n ) and complexit y k + O ( √ n log n ) suc h that all sets P x , P A x , T are O ( √ n log n ) -close to each other. Hence all the three sets are close to the set P A x ( O (log n )) as w ell. As the set P x ( O (log n )) includes the latter set and is included in P x , all the three sets are close to the set P x ( O (log n )) as w ell. The next theorem from [42] shows that “strange” strings do exist. Theorem 14. Assume that natur al numb ers k , n, ε satisfy the ine qualities O (1) 6 ε 6 k 6 n. Then ther e is a string x of length n and c omplexity k + O (log n ) such that the sets P x and P x ( ε ) ar e O (log n ) -close to the sets shown on Fig. 9 . (The set P x is to the right of the dashe d line. The set P x ( ε ) is to the right of the solid line. The differ enc e b etwe en the sets has the shap e of a p ar al lelo gr am.) W e will prov e this theorem later in Section 7.4. T o illustrate its statement let k = 2 n/ 3 and ε = n/ 3 in Theorem 14. Then the sets P x and P x ( n/ 3) are almost n/ 3 -apart, since the p oin t ( n/ 3 , n/ 3) is in the O (log n ) -neighborho o d of P x while all p oin ts from P x ( n/ 3) are ( n/ 3 − O (log n )) -apart from ( n/ 3 , n/ 3) (in l ∞ -norm). Th us the string x is ( n/ 3 , n/ 3 − O (log n )) -strange. 75 P x P x ( ε ) log-cardinalit y complexit y n n − ε k − ε n − k ε k Figure 9: The sets P x and P x ( ε ) for the strange string from Theorem 14, with O (log n ) -precision. The set P x is to the right of the dashed line. The set P x ( k ) is to the righ t of the solid line. Recall that w e ha ve in tro duced the notion of a strong mo del to separate go o d mo dels from bad ones. W e will now present some results that justify this approach. The following theorem b y Milov ano v states, roughly sp eaking, that there exists a string x of length n and a strong mo del A for x such that the parameters (complexity , log-cardinalit y) of every strong standar d model B for x are Ω( n ) -far from those of A . Theorem 15. F or some p ositive c for almost al l k ther e is a string x of length n = 4 k whose pr ofile P x is O (log n ) -close to the gr ay set shown on Fig. 10 such that • ther e is an O (log n ) -str ong mo del A for x with c omplexity k + O (log n ) and lo g-c ar dinality 2 k ( that mo del witnesses the p oint ( k , 2 k ) on the b or der of P x ) , but • for every m > C( x ) and for every simple enumer ation of strings of c omplexity at most m the standar d mo del B for x obtaine d fr om that enumer ation is either not str ong for x or its p ar ameters ar e far fr om the p oint ( k , 2 k ) . Mor e sp e cific al ly, if B is a mo del for x obtaine d fr om an enumer ation pr ovide d by some pr o gr am q , then at le ast one of the values CT( B | x ) , C( q ) , | C( B ) − k | , | log # B − 2 k | is lar ger than ck . W e will prov e this theorem in Section 7.6. 76 C ( A ) log | A | ( k , 2 k ) ( k , 3 k ) 3 k 4 k P x Figure 10: The profile P x of a string x from Theorem 15. 7.3 Prop erties of strong mo dels Once we hav e decided that non-strong descriptions are bad, it is natural to restrict ourselv es to strong descriptions with negligible randomness deficiency (and hence negligible optimalit y defi ciency). Consider some n -bit string x . Assume that A is an ε -strong description for x . Let u b e the ordinal num b er of x in A with resp ect to some fixed order. Then CT( x | A, u ) = O (1) and CT( A, u | x ) 6 ε + O (1) (the latter inequality holds since CT( A | x ) 6 ε and u can b e easily found when x and A are known). That is, x ε + O (1) ← → ( A, u ) . As the pairs ( A, u ) are naturally partitioned into classes, in this w ay w e obtain and alternativ e pro of of Prop osition 30: the image of that partition under the total mapping witnessing ( A, u ) O (1) → x satisfies the second claim of Prop osition 30. Assume further that the randomness deficiency of x in A is at most ε . As u is random and indep enden t of A (with precision ε ; note that C( u | A ) ≈ C( x | A ) > log # A − ε ), the sets Q A,i and Q A are ε + O (log n ) -close (Theorem 7). On the other hand, the sets Q A,u and Q x are ε + O (1) -close b y Proposition 29. Thus w e obtain the first prop ert y of strong mo dels: Prop osition 32. If b oth CT( A | x ) and log # A − C( x | A ) are at most ε , then the sets Q x and Q A are O ( ε + log l ( x )) -close. Assume that A is an ε -strong mo del for x with negligible randomness deficiency and ε ; for simplicit y we ignore these negligible quan tities in the sequel. Assume that A is normal in the sense describ ed ab o v e. Then the string x is normal as well. Indeed, 77 for ev ery pair ( i, j ) ∈ P x with i 6 C( A ) the pair ( i, j − log # A ) is in P A (Theorem 8; note that x is equiv alen t to ( A, u ) and u is random with condition A ) and hence there is a strong ( i ∗ ( j − log # A )) -description B for A . Consider the “lifting” of B , that is, the union of all sets from B that ha v e appro ximately the same size as A . It is a strong ( i ∗ j ) -description for x . It remains to consider pairs ( i, j ) ∈ P x where i > C( A ) . Then i + j > C( A ) + log # A = C( x ) . Hence the subset of A consisting of all strings x 0 whose ordinal n umber in A has the same i − C( A ) leading bits as the ordinal num b er of x , is a strong ( i ∗ j ) -description for x . It turns out that for minimal mo dels the con v erse is true as w ell. A mo del A for x is called ( δ , κ ) -minimal if there is no model B for x with C( B ) 6 C( A ) − δ and δ ( x, B ) 6 δ ( x, A ) + κ . The smaller δ and the larger κ are the stronger the prop ert y of ( δ, κ ) -minimality is. Recall that an ε -sufficient statistic is a mo del whose optimality deficiency is smaller than ε . Theorem 16 ([26]) . F or some value κ = O (log n ) the fol lowing holds. Assume that A is an ε -str ong and ε -sufficient statistic for an ( ε, ε ) -normal string x of length n . Assume also that A is a ( δ, κ ) -minimal mo del for x . Then A is ( O ( δ + ( ε + log n ) √ n ) , O ( δ + ( ε + log n ) √ n )) -normal. Remark 14. In the original theorem from [26] it is claimed only that A is ( O (( δ + ε + log n ) √ n ) , O (( δ + ε + log n ) √ n )) -normal. How ev er, the arguments from [26] pro ve the theorem as stated here. This theorem will b e pro ved in Section 7.7. The next theorem states that the total conditional complexit y of an y strong, sufficien t and minimal statistic for x conditioned b y an y other sufficient statistic for x is negligible. Theorem 17 ([41]) . F or some value κ = O (log n ) the fol lowing holds. Assume that A, B ar e ε -sufficient statistics for a string x of length n . Assume also that A is a ( δ, ε + κ ) -minimal statistic for x . Then C( A | B ) = O ( δ + log n ) . Mor e over, if, additional ly, A is an ε -str ong mo del for x , then CT( A | B ) = O ( ε + δ + log n ) . This theorem can b e interpreted as follows: assume that w e hav e remov ed some noise from a giv en data string x b y finding its description B with negligible optimalit y deficiency . Let A b e an y “ultimately denoised” mo del for x , i.e., a minimal mo del for x with negligible optimality deficiency . Then C( A | B ) is negligible (the first part of Theorem 17). Hence to obtain the “ultimately denoised” model for x we do not need x : an y such mo del can b e obtained from B by a short program. The second part of Theorem 17 shows that any such str ong mo del A can b e obtained from B b y a short total program. 78 7.4 Strange strings In this section w e prov e the existence of strange strings (Theorem 14). Then w e pro ve that there are many such strings (Theorem 18). Let A 7→ [ A ] denote a computable bijection from the family of finite sets to the set of binary strings. This bijection will be used to enco de finite sets of strings. Lemma 2. Assume that A is an ε -strong statistic for a string x of length n . Let y = [ A ] b e the code of A . Then y has an ( ε + O (log n )) , n -description. Pr o of. Let p be a string of length at most ε such that p ( x ) = y and p ( x 0 ) is defined for all strings x 0 . Consider the set { p ( x 0 ) | x 0 ∈ { 0 , 1 } n } . Its cardinality is at most 2 n and complexit y at most ε + O (log n ) . Pr o of of The or em 14. T o prov e the theorem it suffices to find a set A ⊂ { 0 , 1 } n with (a) C( A ) 6 ε + O (log n ) , log # A 6 k − ε whic h is not cov ered b y the union of sets from the union of the following three fam- ilies: (b) the family B consisting of all sets B ⊂ { 0 , 1 } ∗ with with C( B ) 6 ε , log # B 6 n − ε − 4 , (c) the family C consisting of all sets M with C( M ) 6 k , log # M 6 n − k − 4 whose co de [ M ] has a ( ε + O (log n )) , n -description, and (d) the family D consisting of all singletons sets { x } where C( x ) < k . Indeed, assume that we ha ve suc h set A . As x w e can tak e any non-co vered string in A . Notice that item (a) implies that the complexity of x is at most k + O (log n ) , and item (d) implies that it is at least k . Th u s the mem b ership of the pair ( k + O (log n ) , O (log n )) in P x ( ε ) is witnessed by the singleton { x } , pro vided ε is greater than the constant from the inequalit y CT( { x } | x ) = O (1) . The membership of the pair ( O (log n ) , n ) in P x ( ε ) and P x will b e witnessed b y the set of all strings of length n , pro vided ε is greater than then constan t from the inequality CT( { 0 , 1 } n | x ) = O (1) . The membership of the pair ( ε + O (log n ) , k − ε ) in P x will b e witnessed b y the set A . The upp er b ounds for P x ( ε ) and P x follo w from (b), (c), and Lemma 2. Indeed, item (b) implies that the pair ( ε, n − ε − O (1)) is not in P x (hence b y Prop osition 31 for an y i 6 ε the pair ( i − O (log n ) , n − i ) is not in P x either). Item (c) implies that the pair ( k − O (log n ) , n − k − O (1)) is not in P x ( ε ) . Let us show that there is a set A of n -bit strings satisfying (a), (b), (c) and (d). A direct coun ting rev eals that the family B ∪ C ∪ D co vers at most 2 ε +1 2 n − ε − 4 + 2 k +1 2 n − k − 4 + 2 k 6 2 n − 3 + 2 n − 3 + 2 n − 4 < 2 n − 1 79 strings and hence at least half of all n -bit strings are non-co vered. Ho wev er we cannot let A b e an y 2 k − ε -elemen t non-cov ered set of n -bit strings, as in that case C( A ) could b e large. W e first sho w ho w to find A , as in (a), that is not cov ered by B ∪ D (but may b e co vered b y C ). This ma y b e done using the same tec hnique as in the pro of of Theorem 1. T o construct A notice that b oth the families B and D can b e enumerated giv en k , ε, n b y running the universal mac hine U in parallel on all inputs. W e start suc h an enumeration and construct A “in sev eral attempts”. During the construction w e main tain the list of all strings cov ered by sets from B ∪ D enumerated so far. Suc h strings are called marke d . Initially , no strings are mark ed and A con tains the lexicographic first 2 k − ε strings of length n . Eac h time a new set B ∈ B app ears, all its elements receive a b-mark and we replace A by an y set consisting of 2 k − ε y et non- mark ed n -bit strings. Eac h time a new set { x } in D app ears, the string x receives a d-mark, but we do not immediately replace A . How ev er we do that when all strings in A receive a d-mark, replacing it b y any set consisting of 2 k − ε y et non-marked n -bit strings. The ab o ve counting shows that suc h replacements are alwa ys p ossible. The last v ersion of A (i.e. the v ersion obtained after the last set in B ∪ D ha ve app eared) is the sought set. Indeed, by construction # A = 2 k − ε and A is not cov ered b y sets in B ∪ D . It remains to v erify that C( A ) 6 ε + O (log n ) . This follo ws from the fact that A is replaced at most O (2 ε ) times, and hence can b e iden tified b y the n umber of its replacemen ts and ε, k , n (w e run the ab o ve construction of A and wait un til the giv en n umber of replacements are made). Wh y is A replaced at most O (2 ε ) times? The n umber of replacements caused b y app earance of a new set B ∈ B is at most 2 ε +1 . The n umber of strings with a d-mark is at most 2 k and hence A can b e replaced at most 2 k / 2 k − ε = 2 ε times due to receiving d-marks. No w w e ha v e to take into accoun t strings co v ered by sets from the family C . T o this end modify the construction as follo ws: put a c-mark on all strings from eac h set C en umerated in to C and replace A eac h time when all its elements hav e received c or d marks (or when a new set is enumerated into B ). Ho wev er this mo dification alone is not enough. Indeed, up to Ω(2 n ) strings may receiv e a c-mark, and hence A migh t b e replaced up to Ω(2 n − ( k − ε ) ) times due to c-marks. The crucial modification is the following: each time A is replaced, its new v ersion is not just an y set of 2 k − ε non-mark ed n -bit strings but a carefully c hosen suc h set. T o explain how to choose A w e fi rst represent C as an in tersection of t wo families, C 0 and C 00 . The first family C 0 consists of all sets M with C( M ) 6 k and the second family C 00 of all sets C with log # C 6 n − k − 4 whose co de [ C ] has a ( ε + O (log n ) , n ) - 80 description. The first family is small (less than 2 k +1 sets) and the second family has only small sets (at most 2 n − k − 4 -elemen t sets) and is not very large ( # C 00 = 2 O ( n ) ). Both families can b e en umerated given ε, k , n and, moreov er, the sets from C 00 app ear in the en umeration in at most 2 ε + O (log n ) p ortions. Due to this prop erty of C 00 w e can up date A eac h time a new p ortion of sets in C 00 app ears—this will increase the n umber of replacemen ts of A by 2 ε + O (log n ) , whic h is OK. The crucial c hange in construction is the follo wing: each time A is replaced, its new v ersion is a set of 2 k − ε non-marke d n -bit strings that has at most O ( n ) c ommon strings with every set fr om the p art of C 00 enumer ate d so far . (W e will show later that suc h a set alwa ys exists.) Wh y do es this solve the problem? There are tw o types of replacements of A : those caused b y enumerating a new set in B or a new bunch of sets in C 00 and those caused b y that all elements in A hav e receiv ed c- or d-marks. The n umber of replacement of the first t yp e is at most 2 ε + O (log n ) . Replacements of the second t yp e are caused b y enumerating new singleton sets in D and b y en umerating new sets C in C 0 whic h w ere enumerated in to C 00 on earlier steps. Due to the careful choice of A , when each suc h set C app ears in the en umeration of C 0 it can mark only O ( n ) strings in the curren t v ersion of A . The total num b er of sets in C 0 is at most 2 k +1 . Therefore the total n umber of ev ents “a string in the curren t version of A receives a c-mark” is at most O ( n 2 k ) . The total n umber of d-marks is at most 2 k . Hence the n umber of replacemen ts of the second type is at most ( O ( n 2 k ) + 2 k ) / 2 k − ε = O ( n 2 ε ) . Th us it remains to show that we indeed can alw ays choose A , as describ ed ab ov e. This will follo w from a lemma that says that in a large univ erse one can alwa ys c ho ose a large set that has a small intersection with ev ery set from a given small family of small sets. Lemma 3. Assume that a finite family C of subsets of a finite univ erse U is given and eac h set in C has at most s elements. If # C  N t + 1   s | U | − t  t +1 < 1 then there is an N -elemen t set A ⊂ U that has at most t common elements with eac h set in C . Pr o of. T o prov e the lemma w e use probabilistic method. The first elemen t a 1 of A is chosen at random among all elements in U with uniform distribution, the second 81 elemen t a 2 is c hosen with uniform distribution among the remaining elements and so forth. W e ha v e to show that the statemen t of the theorem holds with positive proba- bilit y . T o this end note that for every fixed C in C and for ev ery fixed set of indexes { i 1 , . . . , i t +1 } ⊂ { 1 , 2 , . . . , N } the probabilit y that al l a i 1 , . . . , a i t +1 fall in C is at most  s | U |− t  t +1 . The n umber of sets of indexes as ab o ve is  N t +1  . By union bound the probabilit y that a random set A do es not satisfy the lemma is upp er b ounded b y the left hand side of the display ed inequalit y . W e apply the lemma for U consisting of all non-marked n -bit strings, for N = 2 k − ε and for C consisting of all sets in C 00 app eared so far. Th us s = 2 n − k − 4 , # U > 2 n − 1 , # C = 2 O ( n ) , and w e need to show that for some t = O ( n ) it holds 2 O ( n )  2 k − ε t + 1   2 n − k − 4 2 n − 1 − t  t +1 < 1 , whic h easily foll o ws from the inequality  2 k − ε t +1  6 2 ( k − ε )( t +1) . Theorem 14 do es not say an ything ab out how rare are strange strings. Suc h strings are rare, as for ma jorit y of strings x of length n the set { 0 , 1 } n is a strong MSS for x . A more meaningful question is whether suc h strings might app ear with high probabilit y in a statistical exp erimen t. More sp ecifically , assume that w e sample a string x in a giv en set A ⊂ { 0 , 1 } n , where all elemen ts are equiprobable. Might it happen that with high probabilit y (sa y with probabilit y 99% ) x is strange? An affirmativ e answer to this question is giv en in the following Theorem 18. Assume that natur al k , n, ε, δ satisfy the ine qualities O (1) 6 ε 6 k 6 n, δ 6 k − ε. Then ther e is set A ⊂ { 0 , 1 } n of c ar dinality 2 k − ε and c omplexity at most ε + O ( δ + log n ) such that al l but 2 k − ε − δ its elements x have c omplexity k + O ( δ + log n ) and the sets P x and P x ( ε ) ar e O ( δ + log n ) -close to the sets shown on Fig. 9. Theorem 18 is prov ed similarly to Theorem 14. The only difference is that w e c hange A each time when at least 2 k − ε − δ strings in A receive c- or d-marks. As the result, the n um b er of changes of A will increase 2 δ times and the complexit y of A will increase b y δ . 82 7.5 Strong sufficien t statistics In this section we prov e Theorem 17. The pr o of of the first claim C( A | B ) 6 O ( δ + log n ) . Recall the notion of a standard description and Proposition 18. By that prop osition there exist standard descriptions A 0 , B 0 3 x that are “b etter” than A, B , respectively , i.e. δ ( x, A 0 ) 6 δ ( x, A ) + O (log n ) , C( A 0 | A ) = O (log n ) , δ ( x, B 0 ) 6 δ ( x, B ) + O (log n ) , C( B 0 | B ) = O (log n ) . Let i, j denote the complexities of A 0 , B 0 , respectively . Ignoring O (log n ) terms we ha ve C( A | B ) 6 C( A | A 0 ) + C( A 0 | Ω i ) + C(Ω i | Ω j ) + C(Ω j | B 0 ) + C( B 0 | B ) . W e will sho w that in the righ t hand side of this inequality , eac h term is at most δ + O (log n ) . F or the last term this directly follo ws from the construction of B 0 . The second and forth terms can b e estimated as follo ws. By Prop osition 19, A 0 , B 0 are O (log n ) - equiv alent to Ω i , Ω j , resp ectiv ely , th us the second and forth terms are O (log n ) . T o upp er b ound the third term, we first sho w that that i 6 j + δ + O (log n ) . By construction of B 0 w e hav e δ ( x, B 0 ) 6 δ ( x, B ) + O (log n ) . Since B is ε -sufficient, this inequality implies that δ ( x, B 0 ) 6 ε + O (log n ) . On the other hand, δ ( x, A ) > − O (log n ) and thus δ ( x, B 0 ) 6 δ ( x, A ) + ε + O (log n ) . Cho ose the constant d in the statemen t of the theorem so that this inequalit y implies that δ ( x, B 0 ) 6 δ ( x, A )+ ε + κ . W e assume that A is a ( δ, ε + κ ) -minimal statistic for x and hence C( B 0 ) > C( A ) − δ , that is, j > C( A ) − δ . On the other hand, b y construction of A 0 w e ha ve i = C( A 0 ) 6 C( A ) + O (log n ) and hence i 6 j + δ + O (log n ) . By Prop osition 14 w e hav e C(Ω i | Ω j ) 6 C(Ω i | Ω j + δ + O (log n ) ) + δ + O (log n ) . On the other hand, the inequality i 6 j + δ + O (log n ) and Prop osition 14 imply that C(Ω i | Ω j + δ + O (log n ) ) = O (log n ) . It remains to estimate the first term. Rep eating the argument from the b eginning of the last but one paragraph we can sho w that C( A 0 ) > C( A ) − δ , On the other hand, b y construction of A 0 w e hav e C( A 0 | A ) = O (log n ) . By Symmetry of information this implies that C( A | A 0 ) = δ + O (log n ) . Indeed, ignoring O (log n ) terms, we hav e: C( A | A 0 ) = C( A ) + C( A 0 | A ) − C( A 0 ) = C( A ) − C( A 0 ) 6 δ. 83 Pr o of of the se c ond claim CT( A | B ) 6 O ( ε + δ + log n ) (provided A is a strong mo dels for x ). It suffices to we show that CT( A | B ) is close to C( A | B ) pro vided A is a strong mo del for x . Recall that A 7→ [ A ] denotes a computable bijection from the family of finite sets to the set of binary strings which is u sed to enco de finite sets of strings. On the top level the argumen t is as follo ws. Let p b e a program witnessing CT( A | x ) 6 ε . W e first sho w that A has the following feature: there are many strings x 0 ∈ B with p ( x 0 ) = [ A ] . More specifically , at least 2 − C( A | B ) fraction of x 0 from B hav e this prop ert y . A t most 2 C( A | B ) sets A 0 can ha v e this feature, as eac h suc h A 0 can b e iden tified by the p ortion of x 0 ∈ B with p ( x 0 ) = [ A 0 ] . Given B and p w e are able to find a list of all such A 0 b y means of a short total program. Given B , the set A can b e iden tified b y p and its index in that list. Let us proceed to the detailed pro of. In the proof, w e will ignore terms of order O ( ε + log k ) . First w e sho w that there are man y x 0 ∈ B with p ( x 0 ) = [ A ] (otherwise B could not be a sufficient statistic for x ). Let D = { x 0 ∈ B | p ( x 0 ) = [ A ] } . W e hav e C( D | B ) 6 C( A | B ) , as C( D | B ) 6 C( D | A ) + C( A | B ) and C( D | A ) = C( D | [ A ]) 6 C( p ) + O (1) 6 ε + O (1) . Ob viously , D includes x , th us giv en B and p the string x can b e identified b y its index in D . Therefore C( x | B ) 6 C( D | B ) + log # D 6 C( A | B ) + log # D . On the other hand, C( x | B ) = log # B , as B is ε -sufficien t. Hence log # D > log # B − C( A | B ) . Recall that we ignored terms of order O ( ε + log k ) . A ctually , w e ha ve shown that log # D > l for some l = log # B − C( A | B ) − O ( ε + log k ) . Consider no w all A 0 for whic h the set D 0 defined in a similar w ay has the same low er b ound for its cardinality . That is, consider sets A 0 with log # { x 0 ∈ B | p ( x 0 ) = [ A 0 ] } > l . 84 Eac h suc h A 0 can b e iden tified by the p ortion of x 0 ∈ B with p ( x 0 ) = [ A 0 ] . Thus there are at most 2 C( A | B )+ O ( ε +log k ) differen t suc h A 0 s. Giv en B and C( A | B ) , p, ε we are able to find the list of all such A 0 s. The program that maps B to the list of such A 0 s is obviously total. Therefore there is a C( A | B ) + O ( ε + log k ) -bit total program that maps B to A and CT( A | B ) = C( A | B ) + O ( ε + log k ) . 7.6 Normal strings and standard descriptions Here w e pro ve Theorem 15, i.e. we exhibit an example of a normal string x suc h that every standard description is not a strong sufficien t minimal statistic for x . Our string x will b e obtained from an antistochastic string in the sense of Section 3.3. Definition 6. A string x of length n and complexit y k is called ε -antisto chastic if for all ( m, l ) ∈ P x either m > k − ε , or m + l > n − ε . (The profile of an an tisto chastic string is sho wn on Fig. 11.) C ( A ) log | A | n k k P x ( k , n − k ) Figure 11: The profile of an ε -antistochastic string x of length n and complexit y k for a small ε . By Theorem 1 for every k < n there exists an O (log n ) -antistochastic string of length n and complexity k + O (log n ) . It is easy to see that all antistochastic strings are normal. Recall that a string x is called ( ε, δ ) -normal if P x is in δ -neighborho o d of P x ( ε ) . Prop osition 33. Let x b e an ε -antistochastic string of length n and complexity k . Then x is ( O (log n ) , O (log n ) + ε ) -normal. 85 Pr o of. T o prov e the claim, it suffices to construct for every p oin t ( i, j ) on the b ound- ary of the set shown on Figure 11 an O (log n ) -strong ( i + O (log n ) ∗ j ) -description A for x . If i > k then let A = { x } . Otherwise let A be the set of all strings of length n whose first i bits are the same as those of x . By construction C( A ) 6 i + O (log n ) and log # A = n − i = j . Pr o of of The or em 15. Let y b e an O (log k ) -an tisto c hastic string of complexit y k + O (log k ) and length 2 k , whic h exists by Theorem 1. Let z b e a string of length 2 k suc h that C( z | y ) > 2 k . Finally , let x = y z b e the concatenation of y and z . Ob viously , C( x ) = 3 k with accuracy O (log k ) . By Theorem 8 the set P x is O (log k ) -close to the gra y set on Fig. 10. F rom normalit y of y it is not difficult to derive that x is O (log k ) , O (log k ) -normal. Let A = { yz 0 | l ( z 0 ) = 2 k } . Then A is an O (log k ) -strong statistic for x . No w w e need to sho w that ev ery standard description whose parameters (complexity , log- cardinalit y) are close to those of A is not strong. Let q be a program enumerating all strings of complexit y at most m . Let B b e the standard model for x obtaining from this enumeration. W e ha ve to show that for some p ositiv e c for almost all k we ha ve min { CT( B | x ) , C( q ) , | C( B ) − k | , | log # B − 2 k |} > ck . (1) Fix some small p ositiv e c . F or the sak e of contradiction assume that for infinitely man y k there are m, q , B such that the inequality (1) do es not hold. On the top level, the argument is as follows. W e ha ve assumed C( B ) and log # B are close to k and 2 k , resp ectiv ely . The shap e of P x guaran tees that in this case B is a sufficien t minimal statistic for x . W e hav e assumed also that CT( B | x ) is small, that is B is a strong statistic for x . Hence we can apply Theorem 17 to B and A and conclude that CT( B | A ) is small. As A is strongly equiv alen t to y , the total conditional complexity CT( B | y ) is small as w ell. On the other hand w e will show that the total complexit y of an y standard statistic B obtained from an en umeration of strings of complexit y at most m conditional to any string y is larger than min { C( B ) , m − l ( y ) } . In our case m is at least C( x ) ≈ 3 k , l ( y ) = 2 k and C( B ) ≈ k thus CT( B | y ) > k with accuracy O (log k ) , which is a con tradiction, if c is small enough. Let us pro ceed to the detailed pro of. Let us sho w first that m = O ( k ) . Recall that B is a standard description obtained from a list of complexit y at most m enumerated b y program q . W e hav e seen that in this case C( B ) + log # B = m + O (C( q ) + log m ) . Hence m 6 ( k + ck ) + (2 k + ck ) = O ( k ) with accuracy O (C( q ) + log m ) . W e ha v e assumed that C( q ) < ck = O ( k ) and therefore m = O ( k ) . 86 No w we will apply Theorem 17. Recall its statement: F or some v alue κ = O (log n ) the following holds. Assume that A, B are ε -sufficien t statistics for a string x of length n . Assume also that B is a ( δ, ε + κ ) -minimal statistic for x and B is an ε -strong mo del for x . Then CT( B | A ) = O ( ε + δ + log n ) . Fix suc h κ = O (log n ) = O (log k ) (recall that w e ha ve n = 4 k ). The mo dels A, B are ε -sufficient and B is ε -strong for ε = max { δ ( x, B ) , δ ( x, A ) , CT( B | x ) } 6 2 ck + O (log k ) . Besides, the shap e of P x guaran tees that B is δ, t -minimal for δ = | C( B ) − k | + O (log k ) 6 ck + O (log k ) , t = k − δ ( x, B ) − O (log k ) > k − 2 ck − O (log k ) . If c < 1 / 4 then for all large enough k we hav e t > ε + κ , th us B is δ , ε + κ -minimal and hence we can apply Theorem 17. Therefore CT( B | A ) 6 O ( ε + δ + log k ) = O ( ck + log k ) and hence CT( B | y ) = O ( ck + log k ) The contradiction is obtained from the latter inequality and the following lemma. Lemma 4. Assume that B is an y standard mo del obtained from an enumeration of strings of complexit y at most m by a program q and y is an y string. Then CT( B | y ) > min { C( B ) , m − l ( y ) } − O (C( q ) + log n ) , where n = max { l ( y ) , m } . Pr o of. Denote b y b the lexicographic first elemen t in B . W e can estimate the total complexit y of b conditional to y as follows: CT( b | y ) 6 CT( b | B ) + CT( B | y ) + O (log n ) . The first term here is O (1) and hence CT( b | y ) 6 CT( B | y ) + O (log n ) . Denote b y p a total program of this length transforming y to b and consider the set D := { p ( y 0 ) | l ( y 0 ) = l ( y ) } . Ob viously C( D ) 6 | p | + O (log n ) . Th us it suffices to show that C( D ) > min { C( B ) , m − l ( y ) } − O (log n ) . Ev ery elemen t from D has complexit y at most C( D ) + log # D + O (log n ) 6 C( D ) + l ( y ) + O (log n ) . If the righ t hand side of this inequality is larger than m then C( D ) > m − l ( y ) − O (log n ) and the lemma follows. 87 Therefore we may assume that all elemen ts from D hav e complexity at most m . Run the program q until it prin ts all elements from D . Since b ∈ D and b ∈ B , there are at most 2# B elemen ts of complexit y m that hav e not b een prin ted y et (all the elemen ts of B that are en umerated after b and less than # B that are enumerated after all elements of B ). So, we can find the list of all strings of complexity at most m from D , q and some extra log # B + 1 bits. Since this list has complexit y m − O (log m ) , we get C( D ) + C( q ) + log # B > m − O (log m ) . Recall that for any standard mo del obtained from en umeration of strings of com- plexit y at most m the sum of its complexity and log-cardinality is at most m plus the complexit y of the enumerating program. As B is a standard mo del, we hav e C( B ) + log # B 6 m + O (C( q ) + log m ) . Summing these inequalities we get C( D ) + O (C( q ) + log m ) > C( B ) . Lemma 4 th us implies that min { C( B ) , m − 2 k } − O (C( q ) + log k ) 6 CT( B | y ) = O ( ck + log k ) Since C( B ) > k − ck and m > C( x ) = 3 k − O (log k ) , w e hav e k 6 O ( ck + log k ) . The constant hidden in this O ( ck + log k ) notation is an absolute constan t, call it C . Therefore if c < 1 /C we will obtain a contradiction. 7.7 A strong sufficien t minimal mo del for a normal string is normal Here w e pro v e Theorem 16. W e will need t wo lemmas. W e first state the lemmas in an informal wa y and then we outline the pro of of the theorem. Then we pro vide rigorous form ulations of lemmas and the rigorous pro of of the theorem. By Propositions 18 and 19 for ev ery A 3 x there is B 3 x (a standard mo del) suc h C( B | Ω C( B ) ) ≈ 0 and parameters of B are not w orse than those of A . W e will need a similar result for normal strings and for strong mo dels. 88 Lemma 5 (informal) . Assume that A is a minimal statistic for some string. Then C(Ω C( A ) | A ) ≈ 0 . Lemma 6 (informal) . F or ev ery normal string x and ev ery mo del A for x there exists a str ong statistic H for x with (1) δ ( x, H ) . δ ( x, A ) , (2) C( H | Ω C( H ) ) ≈ 0 and (3) C( H ) 6 C( A ) . No w we sk etch the pro of of Theorem 16 using the lemmas. A sketch of pr o of of The or em 16. Let x be a normal string and let A b e a strong sufficien t and minimal statistic for x . W e ha ve to prov e that the profile of A is close to its strong profile. First we sho w that w.l.o.g w e may assume that A b elongs to a simple partition. Indeed, since A is a strong statistic for x , we may apply Prop osition 30 to A and x . Let A b e a simple partition and A 1 a mo del from A which exists by Prop osition 30. As the total conditional complexities CT( A 1 | A ) and CT( A | A 1 ) are small, the profiles of A and A 1 are close to eac h other. This also applies to strong profiles. Therefore it suffices to show that A 1 is normal. T o this end consider any mo del G (a family of sets) for A 1 . Our goal is to find a strong mo del F for A 1 whose parameters (complexity , log-cardinality) are not worse than those of G . T o do that we will find a mo del M 1 for x suc h that CT( M 1 | A 1 ) ≈ 0 , log #( M 1 ∩ A 1 ) ≈ log # A 1 , C( M 1 ) 6 C( G ) , C( M ) + log # M 1 6 C( G ) + log # G + log # A 1 . (2) Then w e wi ll let F = { A 0 ∈ A | log #( A 0 ∩ M 1 ) = log #( A 1 ∩ M 1 ) } (Here and further b y log we mean the integer part of the binary logarithm.) The family F can b e computed from M 1 , A and log #( A 1 ∩ M 1 ) . As A is simple, we conclude that C( F ) 6 C( M 1 ) 6 C( G ) . Moreo ver, CT( F | M 1 ) ≈ 0 , as the mapping M 0 7→ { A 0 ∈ A | log #( A 0 ∩ M 0 ) = log #( A 1 ∩ M 0 ) } is total. Since CT( M 1 | A 1 ) ≈ 0 , this implies that CT( F | A 1 ) ≈ 0 , that is, F is a strong mo del for A 1 . Finally log # F 6 log # M 1 − log # A 1 , because A is a partition and th us it has few sets that ha ve log #( A 1 ∩ M 1 ) ≈ log # A 1 common elements with M 1 . Th us the 89 sum of complexit y and log-cardinality of F is at most C( M 1 ) + (log # M 1 − log # A 1 ) 6 (C( G ) + log # G + log # A 1 ) − log # A 1 = C( G ) + log # G . Hence F is a strong mo del for A 1 whose parameters (complexit y , complexity + log- cardinalit y) are not worse than required. F rom Prop osition 31 it follows that the strong profile of A 1 includes the p oin t (C( G ) , log # G ) . Ho w to find a mo del M 1 for x satisfying (2)? W e will do that in three steps. On the first step w e construct a mo del L for x suc h that C( L ) 6 C( G ) and log # L 6 log # G + log # A 1 . More sp ecifically , we let L = [ { A 0 ∈ G | log # A 0 = log # A 1 } . By construction we hav e C( L ) 6 C( G ) and log # L 6 log # G + log # A 1 (see Fig. 12). ( C ( M ) , log M ) ( C ( L ) , log # L ) ( C ( F ) , log F ) ( C ( G ) , log # G ) complexit y log-cardinalit y P x The b oundary of P [ A 1 ] Figure 12: The picture sho ws parameters (complexity , log-cardinalit y) of mo dels G , F (for A ) and L, M (for x ). On the second step w e find a strong mo del M for x whose parameters (com- plexit y , complexit y + log-cardinality) are not worse than those of L and suc h that C( M | Ω C( M ) ) ≈ 0 ; such mo del exists by Lemma 6. On the third step we find a mo del 90 M 1 with the same parameters as those of M that b elongs to a simple partition M and suc h that CT( M 1 | M ) ≈ 0 ; such mo del exists b y Prop osition 30. By construction w e ha ve C( M 1 ) 6 C( M ) 6 C( L ) 6 C( G ) . W e hav e to show that CT( M 1 | A 1 ) ≈ 0 and log #( M 1 ∩ A 1 ) ≈ log # A 1 . T o prov e the first claim we show first that C( M | A 1 ) is small. Indeed, from A 1 w e can compute A (by a short program), from A we can compute Ω C( A ) (Lemma 5), from Ω C( A ) w e can compute Ω C( M ) (indeed, Lemma 6 guaran tees that C( M ) 6 C( L ) 6 C( G ) and w.l.o.g. we may assume that C( G ) 6 C( A ) as G is a mo del for A ) and then compute M (as C( M | Ω C( M ) ) ≈ 0 b y Lemma 6). Prop osition 30 guaran tees that M 1 is simple given M . Since C( M 1 | M ) ≈ 0 and C( M | A 1 ) ≈ 0 , we ha ve C( M 1 | A 1 ) ≈ 0 as w ell. T o sho w the stronger equalit y CT( M 1 | A 1 ) ≈ 0 consider the model A 1 ∩ M 1 for x . W e claim that its cardinality cannot b e muc h less than A 1 . Indeed, since C( M 1 | A 1 ) ≈ 0 , w e hav e C( A 1 ∩ M 1 ) 6 C( A 1 ) . Obviously , #( A 1 ∩ M 1 ) 6 # A 1 . Therefore the parameters of M 1 ∩ A 1 are not w orse then those of A 1 . The mo del M 1 ∩ A 1 cannot ha ve muc h b etter parameters than A 1 , since A 1 is a sufficien t statistic for x (recall that the parameters of A 1 are not w orse than those of A and A is assumed to b e a sufficien t statistic for x ). Hence log #( A 1 ∩ M 1 ) ≈ log # A 1 . Recall that M 1 b elongs to a simple partition M . The mo del M 1 can b e computed b y a total program from A 1 and its index among all M 0 ∈ M with log #( A 1 ∩ M 0 ) ≈ log # A 1 . As M is a partition, there are few suc h sets M 0 ∈ M . Hence CT( M 1 | A 1 ) ≈ 0 . No w we pro vide rigorous formulations and pro ofs of the used lemmas. Lemma 5 (rigorous). F or some κ = O (log n ) the following holds. Assume that A is a ( δ , κ ) -minimal statistic for a string x of length n . Then C(Ω C( A ) | A ) = O ( δ + log n ) . Pr o of. Let B b e a standard mo del for x that is an improv emen t of A existing by Prop osition 18 and hence δ ( x, B ) 6 δ ( x, A ) + O (log n ) . If the function κ is c hosen appropriately then C( B ) > C( A ) − δ . W e can estimate C(Ω C( A ) | A ) as follows C(Ω C( A ) | A ) 6 C(Ω C( A ) | Ω C( B ) ) + C(Ω C( B ) | B ) + C( B | A ) . Let us sho w that every term in the right hand side of this inequality is O ( δ + log n ) . F or the third term it holds b y construction. The second term is equal to O (log n ) since B is a standard mo del. F or the first term it holds since C( A ) < C( B ) + δ . Lemma 6 (rigorous). Assume that A is a mo del for an ε, α -normal string x of length n with ε 6 n , α < √ n/ 2 . Then there is a set H suc h that: 91 1) H is an ε -strong statistic for x , 2) δ ( x, H ) 6 δ ( x, A ) + O (( α + log n ) · √ n ) , 3) C( H | Ω C( H ) ) = O ( √ n ) , 4) C( H ) 6 C( A ) + α . Pr o of. Consider the sequence B 0 , A 1 , B 1 , A 2 , B 2 , . . . of statistics for x defined as fol- lo ws. Let B 0 = A . Then for all i let A i +1 b e a strong statistic for x obtained from B i b y using the assumption that x is ε, α -normal: C( A i +1 ) 6 C( B i ) + α , log # A i +1 6 log # B i + α, CT( A i +1 | x ) 6 ε. (See Fig. 13.) Let for all i let B i b e a standard description that is the impro v ement of A i obtained b y P roposition 18: δ ( B i , x ) 6 δ ( A i , x ) + O (log n ) , C( B i | A i ) = O (log n ) . The b oundary of P x A 1 A 2 A 3 A 4 B 3 B 2 B 1 Figure 13: Parameters of statistics A i and B i Denote b y N the minimal in teger such that C( A N ) − C( B N ) 6 √ n . Th us for all i < N we ha v e C( B i ) < C( A i ) − √ n . On the other hand, the complexit y of A i +1 is larger than that of B i b y at most α < √ n/ 2 . Therefore for all i < N we hav e C( A i +1 ) < C( A i ) − √ n/ 2 . Since C( A 1 ) = O ( n ) (recall that CT( A 1 | x ) 6 ε 6 n ) and C( A N ) > 0 , we hav e N = O ( √ n ) . 92 Let H := A N . By construction H is strong (the first claim of the lemma) and C( H ) 6 C( A 1 ) 6 C( A ) + α (the last claim). F rom N = O ( √ n ) it follo ws that the second condition is satisfied. It remains to estimate C( H | Ω C( H ) ) . T o this end we use the follo wing inequalit y: C( A N | Ω C( A N ) ) 6 C( A N | B N ) + C( B N | Ω C( B N ) ) + C(Ω C( B N ) | Ω C( A N ) ) . W e ha ve to sho w that all terms in the righ t hand side are equal to O ( √ n ) . This b ound holds for the first term b ecause N was c hosen so that C( A N ) − C( B N ) 6 √ n . By construction C( B N | A N ) = O (log n ) . Therefore w e can estimate C( A N | B N ) using the symmetry of information: C( A N | B N ) = C( A N ) + C( B N | A N ) − C( B N ) = C( A N ) − C( B N ) 6 √ n (with logarithmic accuracy). The second term is O (log n ) b y construction (recall that B N is a standard descrip- tion). T o estimate the third term note that b y construction w e hav e C( B N | A N ) = O (log n ) and hence C( B N ) − C( A N ) = O (log n ) . The pr o of of The or em 16. T o complete the proof of Theorem 16 w e hav e to v erify that all approximate equalities and inequalities used in the sketc h of the pro of hold with accuracy O ( δ + ( ε + log n ) √ n ) . In construction of A 1 w e use Prop osition 30. Hence the error terms in the in- equalities relating A and A 1 are of order O ( ε + log n ) . The complexit y of the partition is of the same order. Construction of L : the error terms in the inequalit y C( L ) 6 C( G ) is of order O (log log # A 1 ) = O (log n ) and the error term in the second inequality is constan t. In construction of M we use Lemma 6 for α = ε (w e can assume that the condition ε < √ n/ 2 of the lemma is met, as otherwise the statement of the theorem is ob vious). The error terms in Lemma 6 are of order O (( ε + log n ) √ n ) and hence the parameters (complexit y , complexity + log-cardinalit y) of M exceed those of L by at most O (( ε + log n ) √ n ) . Then we use Lemma 5 to estimate C(Ω C( A ) | A ) , which is th us of order O ( δ + log n ) . Therefore the error term in the equality C( M | A 1 ) ≈ 0 is at most O ( δ + ( ε + log n ) √ n ) . In construction of M 1 w e use Prop osition 30. Hence b oth complexities CT( M 1 | M ) and CT( M | M 1 ) are O ( ε + log n ) . The complexity of A 1 ∩ M 1 is at most O ( δ + ( ε + log n ) √ n ) larger than that of A 1 and hence the equality log #( A 1 ∩ M 1 ) ≈ log # A 1 holds with accuracy O ( δ + ( ε + log n ) √ n ) . This implies that the equality CT( M 1 | A 1 ) ≈ 0 holds also with this accuracy . 93 Construction of F : w e ha ve seen that the equality CT( M 1 | A 1 ) ≈ 0 holds with accuracy O ( δ + ( ε + log n ) √ n ) . The equalit y CT( F | M 1 ) ≈ 0 holds with accuracy O ( ε + log n ) , since the complexit y of A is O ( ε + log n ) . The error terms in other inequalities are those from the previous steps and hence of order O ( δ + ( ε + log n ) √ n ) . Remark 15. W e hav e used the sufficiency of A only to sho w that log #( A 1 ∩ M 1 ) ≈ log # A 1 . This conclusion can b e deriv ed from a weak er assumption ab out C( A ) and the profile of x . Namely , w e can assume that the profile of x do es not drop to the righ t of the p oin t (C( A ) , log # A ) . More specifically , let c b e the absolute constan t suc h that C( M 1 ∩ A 1 ) 6 C( A ) + c ( δ + ( ε + log n ) √ n ) (see the pro of ). Then w e can drop the assumption of ε -sufficiency of A at the exp ense of adding log # A 1 − log #( M 1 ∩ A 1 ) to the error term in the conclusion of the theorem. In this w a y we can pro ve the follo wing v ersion of the theorem: F or some value κ = O (log n ) and for some c onstant c the fol lowing holds. Assume that A is an ε -str ong statistic for an ( ε, ε ) -normal string x of length n . Assume also that A is a ( δ, κ ) -minimal mo del for x . Final ly, assume that ther e is no mo del A 0 3 x with C( A 0 ) 6 C( A ) + c ( δ + ( ε + log n ) √ n ) and log # A 0 6 log # A − ξ . Then A is ( O ( ξ + δ + ( ε + log n ) √ n ) , O ( ξ + δ + ( ε + log n ) √ n )) -normal. If A is an ε -sufficient mo del for x then the last assumption holds ξ = O ( δ + ( ε + log n ) √ n ) and hence this version implies the original one (but not the other w a y around). 7.8 The n umber of strings with a giv en profile In this section we consider the follo wing questions. Let P b e a non-empt y su bset of N 2 . Ho w man y strings hav e the profile that is close to P ? Ho w many normal strings ha ve the profile that is close to P ? W e assume that the set P satisfies the necessary conditions from Theorem 1, i.e. P is an upw ard close set such that ( a, b + c ) ∈ P implies ( a + b, c ) ∈ P for all in tegers a , b and c . Thus by Theorem 1 there is at least one string whose profile is close to P and b y Theorem 13 there is at least one normal string whose profile is close to P . T o estimate the n umber of such strings b etter, we introduce the following three parameters: k P = min { t | ( t, 0) ∈ P } , n P = min { t | (0 , t ) ∈ P } , m P = min { t | ( t, k P − t ) ∈ P } . The meaning of these num b ers is as follo ws: if P is close to the profile of some string x then k P is close to its complexity and m P is close to the complexity of a minimal 94 sufficien t statistic for x . The parameter n P can be understo od as “the generalized length” of x (minimal log-cardinalit y of a set with negligible complexit y con taining x ). Note that m P 6 k P 6 n P . Indeed, by definition the pair ( k P , 0) is in P hence the set in whic h m P is the minimal elemen t is not empty , thus m P is well defined and is not larger than k P . The second inequalit y is implied by the property ( a, b + c ) ∈ P ⇒ ( a + b, c ) ∈ P applied to a = b = 0 and c = n P . Theorem 19. 1) Ther e exist at le ast 2 k P − m P − O (1) strings whose pr ofile is O (C( P ) + log n P ) -close to P . 2) Ther e exist at le ast 2 k P − m P − O (1) strings that ar e O (log n P ) , O ( √ n P log n P ) - normal and whose pr ofile is O (C( P ) + √ n P log n P ) -close to P . Pr o of. T o pro v e the first statemen t of the theorem consider the following auxiliary set ˜ P : ˜ P = { ( i, j ) | i 6 m P , ( i, j + k P − m P ) ∈ P } ∪ { ( i, j ) | i > m P } . Note that the term k P − m P is non-negativ e, since k P > m P . Our first goal is to construct a string x such that P x is close to ˜ P . T o this end we apply Theorem 1 to n = n P + m P − k P , k = m P and the n umbers t 0 , t 1 , . . . , t k suc h that the line (0 , t 0 ) – (1 , t 1 ) – . . . – ( k , t k ) is the b oundary curv e of ˜ P . By that theorem there is a string y of length n + O (log n ) and complexity m P + O (log n ) whose profile is O (C( P ) + log n ) -close to ˜ P . Then we consider all strings y of length k P − m P suc h that C( y | x ) > k P − m P − 1 . By counting arguments there are at least 2 k P − m P − 1 suc h strings. W e claim that for eac h such y the profile of the string ( x, y ) is close to P . Indeed, by Theorem 8 the set P ( x,y ) can b e obtained from the set P x b y the follo wing transformation φ : φ ( P y ) = { ( i, j + k P − m P ) | i 6 C( x ) , ( i, j ) ∈ P x } ∪ { ( i, j ) | i > C( x ) , i + j > C( x, y ) } . Note that if we replace here C( x ) by m P , C( x, y ) by k P , and P x b y ˜ P , we will obtain exactly the original set P . It is easy to v erify that the transformation P y 7→ φ ( P y ) is contin uous: if sets P 0 and P 00 are in an ε -neighborho o d of eac h other then φ ( P 0 ) and φ ( P 00 ) are also in an O ( ε ) -neigh b orho od of each other. Besides, if, in the definition of φ , we use some ε -appro ximations of C( x ) , C( x, y ) in place of C( x ) , C( x, y ) , then the resulting set is at most O ( ε ) -apart from the original set. Since b y construction C( x ) ≈ m P and C( x, y ) ≈ k P (with accuracy O (C( P ) + log n P ) ), the sets φ ( P y ) and P are at most O (C( P ) + log n P ) -apart. Th us, b y Theorem 8 the set P ( x,y ) is close to P . 95 The second statemen t is pro ved b y a similar argument: we just use Theorem 13 (ab out existence of normal strings with a given profile) instead of Theorem 1. No w we pro vide an upp er b ound for the n umber of strings whose profile is close to P . In the pro of w e will use standard descriptions and Prop osition 17. By Prop osition 17 for ev ery standard description A obtained from an en umeration of strings of complexit y at most k the sum of complexity of A and its log-cardinality is not greater than k + c log k for some constant c . Fix suc h constant c and let m P ( ε ) := min { t | ( t + ε, k P − t + c log ( k P + 2 ε ) + ε ) ∈ P } . The definition of m P ( ε ) is similar to that of m P . Let L ( P , ε ) denote the set of all strings whose profile is ε -close to P . W e will assume that ε 6 k P . Theorem 20. log # L ( P , ε ) 6 k P − m P ( ε ) + 2 ε + O (log n P ) . Pr o of. T o simplify the notation let m = m P ( ε ) . W e will sho w that for every x ∈ L ( P , ε ) we ha ve C( x | Ω m ) 6 k P − m + ε (ignoring logarithmic terms). This bound ob viously implies the theorem, as there are few short programs and eac h program can map Ω m only to one string. First w e s ho w that C(Ω m | x ) is small for all x ∈ L ( P , ε ) : Lemma 7. F or every x from L ( P , ε ) w e hav e C(Ω m | x ) = O (log n P ) . Pr o of. Denote by k b e the complexity of x . Since x b elongs to L ( P , ε ) the p oin t ( k P + ε, ε ) b elongs to P x , hence k = C( x ) 6 k P + 2 ε . Let A be the standard description obtained from the list of all strings of com- plexit y at most k . By the choice of c we ha ve C( A ) + log # A 6 k + c log k . Hence the pair (C( A ) , k − C( A ) + c log k ) is in the profile of x . This implies that the pair (C( A ) + ε, k − C( A ) + c log( k P + 2 ε ) + ε ) is in P . Recall that m is defined as the minimal C( A ) satisfying this prop ert y , thus we ha ve C( A ) > m . By Prop osition 17 we ha ve C(Ω C( A ) | A ) = O (log k ) and hence C(Ω m | A ) = O (log k ) . Also C( A | x ) = O (log k ) , as A is a sufficient statistic for x . So w e hav e C(Ω m | x ) 6 C(Ω m | A ) + C( A | x ) + O (log n P ) = O (log n P ) . It remains to upp er b ound C( x | Ω m ) for ev ery x ∈ L ( P , ε ) . W e will do that by using the Symmetry of information: C( x | Ω m ) = C( x ) + C(Ω m | x ) − C(Ω m ) + O (log n P ) . 96 Recall that C(Ω m ) = m + O (log m ) and that C( x ) 6 k P + 2 ε . So, by Lemma 7 w e get C( x | Ω m ) 6 k P + 2 ε − m + O (log n P ) . 7.9 Op en questions 1. Is it true that for every minimal strong sufficient statistic A and for ev ery strong sufficien t statistic B for x we hav e CT( B | A ) ≈ 0 ? More specifically , is there a constan t c such that the following holds true: Assume that A, B are ε -strong, ε - sufficien t statistics for a string x of length n . Assu me further that there is no ( ε + c log n ) -strong ( ε + c log n ) -sufficient statistic A 0 for x with C( A 0 ) 6 C( A ) − δ . Is it true that CT( A | B ) = O ( ε + δ + log n ) in this case? 2. The same question, but this time w e further assume that B also satisfies the minimalit y requiremen t: there is no ( ε + c log n ) -strong ( ε + c log n ) -sufficient statistic B 0 for x with C( B 0 ) 6 C( B ) − δ . Note that if, in these tw o questions, w e replace total conditional complexity with the plain conditional complexity then the answers are p ositiv e and moreov er, w e do not need to assume that A, B are ε -strong (see Theorem 17). 3. (Merging strong sufficien t statistics.) Assume that A, B are strong sufficient statistics for x that ha v e small intersection compared to the cardinality of at least one of them. Then it is natural to conjecture that there is a strong sufficient statistic D for x of larger cardinalit y (=of smaller complexit y) that is simple given b oth A, B . F ormally , is it true (for some constant c ) that if A, B are ε -strong ε -sufficien t statistics for x , then there is a cε -strong cε -sufficient statistic D for x with log # D > log # A + log # B − log #( A ∩ B ) − c ( ε + log n ) and CT( D | A ) , CT( D | B ) at most c ( ε + log n ) ? (A motiv ating example: let x b e a random string of length n , let A consist of all strings of length n that hav e the same prefix of length n/ 2 as x , and let B consist of all strings of length n that hav e the same bits with num b ers n/ 4 + 1 , . . . , 3 n/ 4 as x . In this case it is natural to let D consist of all strings of length n that hav e the same bits n/ 4 + 1 , . . . , n/ 2 as x , so that log # D = log # A + log # B − log #( A ∩ B ) .) 8 A c kno wledgmen ts W e are grateful to sev eral p eople who con tributed and/or carefully read prelimi- nary v ersions of this surv ey , in particular, to B. Bau w ens, P . G´ acs, A. Milo v ano v, G. Novik o v, A. Romashchenk o, P . Vit´ anyi, and to all participants of Kolmogoro v 97 seminar in Mosco w State Univ ersity and ESCAPE group in LIRMM. W e are also grateful to an anonymous referee for correcting several mistakes. References [1] L.M. Adleman, Time, sp ac e and r andomness . MIT rep ort MIT/LCS/TM-131. Marc h 1979. [2] L. An tunes, B. Bauw ens, A. Souto, A. T eixeira, Sophistic ation vs. L o gic al Depth , Theory of Computing Systems, First Online, 10.1007/s00224- 016- 9672- 6 [3] L. Antunes and L. F ortnow. Sophistication revisited. The ory of Computing Sys- tems , 45 (1), 150–161 (June 2009). [4] L. Antunes, L. F ortnow, and D. v an Melk eb eek. Computational d epth, Pr o- c e e dings of the 16th IEEE Confer enc e on Computational Complexity , 266–273. IEEE, New Y ork, 2001. Journal v ersion: Computational depth: Concept and applications, The or etic al Computer Scienc e , 354 (3), 391–404 (2006) [5] L. Antunes, A. Matos, A. Souto, P . Vit´ an yi, Depth as Randomness Deficiency , The ory of Computing Systems , 45 (4), 724–739 (2009) [6] B. Bau w ens. Computability in statistic al hyp otheses testing, and char acteriza- tions of indep endenc e and dir e cte d influenc es in time series using Kolmo gor ov c omplexity , PhD thesis, Universit y of Gent, May 2010. [7] C.H. Bennett, Logical Depth and Ph ysical Complexity , in The Universal T ur- ing Machine: a Half-Century Survey . Edited b y Rolf Herk en, 227–257, Oxford Univ ersity Press, 1988. [8] L. Bienv en u, D. Desfon taines, A. Shen, What Percen tage of Programs Halt? Pr o c e e dings of ICALP 2015, 42nd International Col lo quium, Kyoto, Jap an, July 6-10, 2015 , Lecture notes in computer science, 9134 , 219–230. Extended ver- sion: Generic algorithms for halting problem and optimal machines revisited, arXiv:1505.00731 . [9] L. Bienv en u, P . G´ acs, M. Hoyrup, C. Ro jas, A. Shen, Algorithmic tests and ran- domness with resp ect to a class of measures, Pr o c e e dings of the Steklov Institute of Mathematics , 274 , 34–89 (2011). 98 Russian version: Алгоритмически е тесты и случайность относительно классов мер, Труды мате матического института имени В.А.Стеклова , 274 . 41–102 (2011). [10] T. Cov er, K olmogoro v complexity , data compression and inference. In: The Im- p act of Pr o c essing T e chniques on Communic ations , ed. J.K. Skwirzynski. Mar- tin us Nijhoff Publishers, 1985. [11] S. Epstein, L. Levin, Sets ha ve simple members, 1458 , rep osted as http://arxiv.org/abs/1403.4539 . [12] P . G´ acs, On the relation b et w een descriptional complexit y and algorithmic prob- abilit y , The or etic al Computer Scienc e , 22 , 71–93 (1983). [13] P . G´ acs, J. T romp, P .M.B. Vit´ anyi, Algorithmic statistics, IEEE T r ansactions on Information The ory , 47 (6), 2443–2463 (2001). [14] A.N. K olmogorov, Three Approac hes to the Quantitativ e Definition of In- formation [Russian: Три по дхо да к определению понятия «к оличество информации»] Pr oblems of Information T r ansmission [Проблемы переда чи информации], 1 (1), 4–11 (1965). English translation published in: Interna- tional Journal of Computer Mathematics , 2 , 157–168 (1968). [15] A.N. Kolmogoro v, T alk at the Information Theory Symp osium in T allinn, Es- tonia (then USSR), 1974. [As rep orted b y Co ver in his 1985 pap er [10]] [16] A.N. K olmogorov, The complexit y of algorithms and the ob jectiv e definition of randomness. Summary of the talk presented April 16, 1974 at Moscow Math- ematical So ciet y . У спехи мате матических наук (Usp ekhi matematic heskikh nauk, Russian), 29 (4[178]), 155 (1974). See http://mi.mathnet.ru/rus/umn/ v29/i4/p153 . A short note in Russian. [17] A. K olmogorov. T alk at the seminar at Mosco w State Universit y Mathemat- ics Department (Logic Division), 26 No vem b er 1981. [The definition of ( α , β ) - sto c hasticit y w as defined in this talk, and the question ab out the fraction of non-sto c hastic ob jects w as p osed.] [18] M. Koppel, Complexit y , Depth and Sophistication, Complex Systems , 1 , 1087– 1091 (1987). [19] M. K opp el, Structure, in The Universal T uring Machine: a Half-Century Sur- vey . Edited b y Rolf Herk en, 435–452, Oxford Universit y press, 1988. 99 [20] M. Koppel, H. A tlan, An almost machine-independent theory of program-length complexit y , sophistication, and induction, Information Scienc es . 56 (1–3), 23–33 (1991). [21] L. Levin, Randomness conserv ation inequalities; information and indep endence in mathematical theories, Information and Contr ol , 61 (1), 15–37 (1984). [22] M. Li, P .M.B. Vit´ anyi, A n Intr o duction to Kolmo gor ov Complexity and its Ap- plic ations , 3rd ed., Springer, New Y ork, 2008. [23] L. Longpr ´ e, R esour c e b ounde d Kolmo gor ov c omplexity, a link b etwe en c omputa- tional c omplexity and information the ory . Ph. D. Thesis, Departmen t of Com- puter Science, Cornell Universit y , TR 86-776, 1986. [24] A. Milo v ano v, Some prop erties of antistochastic strings. In: Computer Scienc e – The ory and Applic ations, 10th International Computer Scienc e Symp osium in R ussia, R ussia, July 13–17, 2015. (CSR 2015), Lecture Notes in Computer Science, 9139 , 339–349. Journal version: The ory of Computing Systems , Online First DOI 10.1007/s00224- 016- 9695- z . [25] A. Milov ano v, Algorithmic statistic, prediction and mac hine learning, 33r d Symp osium on The or etic al Asp e cts of Computer Scienc e (ST ACS 2016), Leibnitz International Pro ceedings in Informatics (LIPIcs), 47 , 2016, DOI 10.4230/LIPIcs.STACS.2016.54 , http://drops.dagstuhl.de/opus/ volltexte/2016/5755/ , 54:1–54:13. [26] A. Milov ano v, Algorithmic Statistics: Normal Ob jects and Universal Models, Computer Scienc e – The ory and Applic ations , Pro ceedings of CSR 2016 confer- ence, Lecture Notes in Computer Science, 9691 , 280–293. [27] F. Mota, S. Aaronson, L. Antunes, A. Souto, Sophistication as Randomness Defi- ciency , Descriptional Complexity of F ormal Systems , Lecture Notes in Computer Science, 8031 , 172–181 (2013) [28] An.A. Muchnik, I. Mezhirov, A. Shen, N.K. V ereshchagin, Game interpr etation of Kolmo gor ov c omplexity , [29] An.A. Muc hnik, A. Romashchenk o, Stabilit y of prop erties of K olmogorov com- plexit y under relativization, Pr oblems of Information T r ansmission , 46 (1), 38– 61 (2010). 100 [30] An.A. Muc hnik, A.L. Semenov, V.A. Uspensky , Mathematical metaph ysics of randomness, The or etic al Computer Scienc e , 207 (2), 263–317 (Nov ember 1998). [31] An.A. Muchnik, A. Shen, M. V yugin, Game ar guments in c omputability the ory and algorithmic information the ory , . [32] S. de Ro oij, P .M.B. Vitan yi, Approximating Rate-Distortion Graphs of Individ- ual Data: Exp erimen ts in Lossy Compression and Denoising, IEEE T r ansaction on Computers , 61 , No. 3, March 2012, 395–407. [33] J. Rissanen, Mo deling by shortest data description, Automatic a , 14 , 465–471 (1978). [34] C.P . Schnorr. Optimal enumerations and optimal G¨ odel n umberings. Mathe- matic al Systems The ory , 8(2):182–191, 1975. [35] А. Шень. Понятие ( α, β ) -сто х астичности по Колмогорову и его свойства. Доклады А кадемии наук СССР , 271 (6), 1337–1340 (1983). English translation: A. Shen, The concept of ( α , β ) -sto c hasticity in the Kol- mogoro v sense, and its prop erties. Soviet Math. Dokl. , 28 (1), 295–299 (1983). [36] A. Shen, Discussion on Kolmogoro v complexit y and statistical analysis, The Computer Journal , 42:4(1999), 340–342. [37] A. Shen, Around K olmogorov complexit y: Basic Notions and Results, Me asur es of Complexity. F estschrift for Alexey Chervonenkis , Springer, 2015, 75–116. See also: . [38] M. Sipser, A complexity theoretic approac h to randomness, Pr o c e e dings of the fifte enth annual A CM symp osium on The ory of c omputing (STOC), 1983, 330– 335. [39] R. Solomonoff, A formal theory of inductiv e inference. P art I, Information and Contr ol , 7 (1), 1–22 (1964) [40] R. Solomonoff, A formal theory of inductiv e inference. P art I I. Applications of the Systems to V arious Problems in Induction, Information and Contr ol , 7 (2), 224–254 (1964) [41] N. V ereshc hagin, Algorithmic Minimal Sufficien t Statistic Revisited. In: Mathe- matic al The ory and Computational Pr actic e, 5th Confer enc e on Computability in 101 Eur op e , CiE 2009, Heidelb erg, Germany , July 19–24, 2009. Pro ceedings. LNCS 5635. [42] N. V ereshchagin. Algorithmic Minimal Sufficien t Statistics: a New Approach. The ory of Computing Systems , 58 (3), 463–481 (2016). [43] N. V ereshc hagin, A. Shen, Algorithmic statistics revisited, Me asur es of Com- plexity. F estschrift for Alexey Chervonenkis , Springer, 2015, 2035–252. See also: arXiv:1504.04950 . [44] Н.К. Верещагин, В.А. У спенский, А. Шень. Колмогоровск ая сло жность и алгоритмическ ая случайность. 576 с. Москва, МЦНМО, 2013. Electronic v er- sion: http://www.lirmm.fr/~ashen/kolmbook.pdf . Draft English translation: http://www.lirmm.fr/~ashen/kolmbook- eng.pdf . [45] N.K. V ereshchagin, P .M.B. Vit´ an yi, K olmogorov’s structure functions and mo del selection, IEEE T r ansactions on Information The ory , 50 (12), 3265–3290 (2004). [46] N.K. V ereshchagin, P .M.B. Vit´ anyi. Rate Distortion and Denoising of Individ- ual Data Using K olmogorov Complexit y . IEEE T r ansactions on Information The ory , v. 56(7), 2010, p. 3438–3454. [47] P .M.B. Vit´ an yi, Me aningful information , IEEE T r ansactions on Information The ory , 52 (10), 4617–4626 (2006). See also: arXiv:cs/0111053 . [48] V.V. V’yugin, On the defect of randomness of a finite ob ject with resp ect to measures with giv en complexit y b ounds, SIAM The ory Pr ob ab. Appl. , 32 (3), 508–512 (1987). [49] V.V. V’yugin, Algorithmic complexit y and sto c hastic prop erties of finite binary sequences, The Computer Journal , 42 (4), 294–317 (1999). [50] V.V. V’yugin. Do es sno oping help? The or etic al Computer Scienc e , 276 (1), 407– 415 (2002). [51] C.S. W allace, D.M. Boulton, An information measure for classification, Com- puter Journal , 11 (2), 185–194 (1968). 102

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment