On Universal Prediction and Bayesian Confirmation

The Bayesian framework is a well-studied and successful framework for inductive reasoning, which includes hypothesis testing and confirmation, parameter estimation, sequence prediction, classification, and regression. But standard statistical guideli…

Authors: ** Marcus Hutter **

On Univ ersal Prediction and Ba y esian Confirmation Marcus Hutter RSISE @ ANU and SML @ NICT A Can b erra, A CT, 020 0, Australia marcus@h utter1.net www.hutt er1.net 11 Septem b er 200 7 Abstract The Ba y esian framew ork is a w ell-studied and successful framework for induc- tiv e reasoning, whic h includes h yp othesis testing and confirmation, p arameter estimatio n, sequence prediction, classificatio n, and regression. But standard statistic al guidelines for c ho osing the model class and p rior are not alw a ys a v aila ble or can fail, in particular in co mp lex situations. Solomonoff c om- pleted the Ba y esian framework b y pro viding a rigorous, unique, formal, and univ ersal c hoice for th e mo del class and th e pr ior. I discuss in b readth ho w and in whic h sense unive rs al (non-i.i.d.) sequence predictio n solv es v a rious (philosophical) pr oblems of traditional Ba y esian sequence prediction. I s ho w that Solomonoff ’s mod el p ossesses many desirable prop erties: Strong t otal and futu r e b ound s, and wea k instant aneous b ound s, and in con trast to most classical cont inuous prior d ensities has n o zero p(oste)rior problem, i.e. can confirm unive rsal hyp otheses, is reparametrization and regrouping inv arian t, and a v oids the ol d -evidence and up dating p roblem. It ev en performs w ell (actually b etter) in non-computable environmen ts. Con ten ts 1 In tro duction 2 2 Ba y esian Sequence Prediction 3 3 Ho w to Cho ose the Prior 7 4 Indep enden t Identically Distributed Data 10 5 Univ ersal Sequence Prediction 14 6 Discussion 18 A Pro ofs of (8), (11 f ), and (17) 22 Keyw ords Sequence prediction, Ba y es, Solomonoff pr ior, Kolmogoro v complexit y , Oc- cam’s r azo r, prediction b ounds, mo del cla sses, ph ilosophical issues, symm etry principle, confirmation theory , Blac k ra v en p arado x, reparametrization in v ari- ance, old-evidence/up dating problem, (n on)computable en vironments. 1 1 In tro du c tion “... in spite of it’s inc omputability, Algorithmic Pr ob ability c an serve as a kind o f ‘Gold Standar d’ for induction systems” — R ay Solomonoff (1997) Giv en the we ather in the past, what is the probability of rain tomorrow? What is the correct answ er in an IQ test asking to contin ue the sequenc e 1,4,9 ,1 6,? Giv en historic stock -charts, can one predict the quotes of tomorro w? Assuming t he sun rose 5000 y ears ev ery da y , ho w lik ely is doomsday (that the sun does not rise) tomorrow? These are instances of the imp ortant problem of induction or time-series forecasting or sequence prediction. F inding prediction rules for eve ry particular (new) problem is p ossible but cum b ersome and prone to disagreemen t or con tradiction. What is desirable is a f o rmal g eneral theory for prediction. The Bay es ian framew ork is the most consisten t a nd suc cessful framew ork de- v elop ed th us far [Ear93, Ja y03]. A Ba y esian considers a set of en vironmen ts= =h yp otheses=models M whic h includes the true dat a generating probabilit y distri- bution µ . F rom one’s prior b elief w ν in en vironmen t ν ∈ M and the observ ed data sequenc e x = x 1 ...x n , Ba y es’ rule yields one’s posterior confiden ce in ν . In a pre- quen tial [Daw84] or transductiv e [V ap99, Sec.9.1] setting, one directly determines the predi ctiv e probabilit y of the next sy mbol x n +1 without the in termediate step of iden tifying a (true or go o d or caus al or useful) mo del. With the exception of Section 4, this pap er concen trates on pr e diction rather than mo del iden tification. The ultimate goal is to mak e “go o d” predictions in the sense of maximizing one’s profit or min imizing one’s loss. Note that classifi cation and regression can be re- garded as sp ecial sequence prediction problems, where the sequence x 1 y 1 ...x n y n x n +1 of ( x,y )-pairs is giv en and t he class lab el or function v alue y n +1 shall b e predicted. The Ba y esian framew ork lea v es op en ho w to c ho ose the mo del class M and prior w ν . General guidelines are that M should b e small but large enough to contain the true env ironmen t µ , and w ν should reflect one’s prior (sub jectiv e) b elief in ν or should b e no n- informativ e or neutral or ob jectiv e if no prior kno wledge is a v ailable. But thes e are informal and ambiguous considerations outside the formal Ba y esian framew ork. Solomonoff ’s [Sol64] rigorous, essen tially unique, formal, and unive rsal solution to this problem is to cons ider a si ngle large univ ersal c lass M U suitable for al l induction problems. The corresp onding univ ersal prior w U ν is biased tow ards simple en vironmen ts in suc h a w a y that it dominates (=sup erior to) all other priors. This leads to an a priori proba bility M ( x ) whic h is equiv alen t to the probabilit y that a univ ers al T uring mac hine with random input tap e outputs x , and the shortes t program computing x pro duces the most like ly con tin uation (prediction) of x . Man y in teresting, imp ortan t, and deep results ha v e b een pro v en for Solomonoff ’s univ ersal distribution M [ZL70, Sol78, G´ ac83, L V97, Hut01, Hut04]. The motiv ation and goal of this pap er is to pro vide a broa d discuss ion of ho w and in whic h sense univ ersal sequence prediction solv es all kinds of (philosophical) problems of Ba ye sian 2 sequenc e prediction, and to presen t some recen t results. Many argumen ts and ideas could b e f urther dev elop ed. I hop e that the exp osition stim ulates suc h a future, more detailed, in v estigation. In Sec tion 2, I revie w the excellen t predictiv e and decision-theoretic p erfo r mance results of Ba y esian sequence prediction for generic (non-i.i.d.) coun table a nd con tin- uous mo del classes. Section 3 critically reviews the classical principles (indifference, symmetry , minimax) for obtaining o b jectiv e priors, in tro duce s the univ ersal prior inspired b y Occam’s razor and quantifi ed in terms of Kolmogoro v complex ity . In Section 4 (for i.i.d. M ) and Section 5 (for univ ersal M U ) I sho w v arious desirable prop erties of the univ ersal prior and class (non-zero p(oste)rior, confirmation of uni- v ersal hy p otheses, reparametrization and regrouping in v ariance, no old-evidence and up dating problem) in con trast to (most) classical contin uous prior densities . I also complemen t the general total b ounds of Section 2 with some un iv ersal and s ome i.i.d.-sp ecific instan taneous and future b ounds. Finally , I sho w that the univ ersal mixture perfo rms b etter than classi cal con tin uous mixtures, ev en in uncomputable en vironmen ts. Section 6 con tains critique, summary , and conclusions. The reparametrization and regrouping in v ariance, the (w eak) instan taneous b ounds, the go o d p erformance of M in non-computable en vironmen ts, and most of the discussion (zero prior and univ ersal h yp otheses, old evidence) are new or new in the ligh t of univ ersal sequence predic tion. T echn ical and mathematical non-trivial new results are the Hellinger-lik e loss b ound (8 ) and the instan taneous b ounds (14) and (17). 2 Ba y esian Sequence Prediction I now formally introduce the Bay esian sequence prediction setup and describ e the most important results. I consider sequences o v er a finite alphab et, assume that the true env ironmen t is unkno wn but kno wn to b elong to a coun table or contin uous class of env ironmen ts (no i.i.d. or Marko v or stationarity assump tion), and consider general prior . I sho w that the predictiv e distribution conv erges r a pidly to the true sampling distribution and that the Ba y es-optimal predictor p erforms excelle nt for an y b ounded loss function. Notation. I use letters t,n ∈ I N for natural num bers, and denote the cardinality of a set S by # S or |S | . I write X ∗ for the set of finite strings o v er some alphab et X , and X ∞ for the set of infinite sequence s. F or a string x ∈ X ∗ of length ℓ ( x ) = n I write x 1 x 2 ...x n with x t ∈ X , and further abbreviate x t : n := x t x t +1 ...x n − 1 x n and x 0 at more than c εδ times t is b ounded b y δ . I sometimes lo osely call this the n um b er of errors. Sequence prediction. Giv en a sequence x 1 x 2 ...x t − 1 , we w an t to predict its lik ely con tin uation x t . I assume that the strings whic h hav e to b e con tin ued are dra wn from a “ t rue” probability distribution µ . The maximal prior information a prediction algorithm can p ossess is the exact kno wledge of µ , but often the true distribution is unkno wn. Instead, prediction is based on a guess ρ of µ . While I require µ to b e a measure, I allo w ρ to b e a sem imeasure [L V97, Hut04]: 1 F ormally , ρ : X ∗ → [0 , 1] is a semimeasure if ρ ( x ) ≥ P a ∈X ρ ( xa ) ∀ x ∈ X ∗ , and a (probability ) measure if equalit y holds a nd ρ ( ǫ ) = 1, where ǫ is the empt y string. ρ ( x ) denotes the ρ -probabilit y that a sequen ce starts with string x . F u rther, ρ ( a | x ) := ρ ( xa ) /ρ ( x ) is the “p osterior” or “predictiv e” ρ -probabilit y that the next sym b ol is a ∈ X , giv en sequence x ∈ X ∗ . Ba y es mixture. W e may kno w or assume that µ b elongs to some countable class M := { ν 1 ,ν 2 ,... } ∋ µ of semimeasures . Then we can use the weigh ted av erage on M (Ba y es-mixture, data evidence, marginal) ξ ( x ) := X ν ∈M w ν · ν ( x ) , X ν ∈M w ν ≤ 1 , w ν > 0 (1) for prediction. One may in terpret w ν = P[ H ν ] as prior b elief in ν a nd ξ ( x ) = P[ x ] as the sub jectiv e probability of x , and µ ( x ) = P[ x | µ ] is the sampling distribution or lik eliho o d. The most imp ortan t prop erty of semimeasure ξ is its dominance ξ ( x ) ≥ w ν ν ( x ) ∀ x and ∀ ν ∈ M , in particular ξ ( x ) ≥ w µ µ ( x ) (2) whic h is a strong form of absolute contin uit y . Con v ergence for deterministic environ men ts. In the predictiv e setting w e are not in terested in iden tifying the t rue en vironmen t, but to predict the next sym b ol w ell. Let us consider deterministic µ first. An en vironmen t is called deterministic if µ ( α 1: n ) = 1 ∀ n f o r some sequ ence α , and µ = 0 elsew here (off-seque nce). In this case w e iden tify µ with α a nd the following holds: ∞ X t =1 | 1 − ξ ( α t | α 0 is the w eigh t of α b = µ ∈ M . This sho ws that ξ ( α t | α 0, since ξ ( α 1: n ) is monotone decreasing in n and ξ ( α 1: n ) ≥ w µ µ ( α 1: n ) = w µ > 0. Hence ξ ( α 1: n ) /ξ ( α 1: t ) → c/c = 1 for an y limit sequenc e t,n → ∞ . The b ound follows from P n t =1 1 − ξ ( x t | x 10 ) is tin y e − c/ 2 . One can also sho w m ulti-step lo ok ahead conv e rgence ξ ( x t : n t | ω < γ , where γ := ℓ 01 − ℓ 00 ℓ 01 − ℓ 00 + ℓ 10 − ℓ 11 . The instan taneous loss a t time t a nd the total µ (=true)-exp ected loss for the first n sym b ols are l Λ ρ t ( ω 0 the sums (1) are naturally replaced by integrals: ξ ( x 1: n ) := Z Θ w ( θ ) · ν θ ( x 1: n ) dθ , Z Θ w ( θ ) dθ = 1 (9) The most imp ortan t prop ert y of ξ was the dominance (2) achi ev ed by dropping t he sum o v er ν . Th e analogous construction here is to restrict the in tegral ov er θ to a small vicinit y of θ 0 . Since a con tin uous parameter can typically b e estimated to accuracy ∝ n − 1 / 2 after n observ ations, the largest volume in whic h ν θ as a function of θ is a pproximately flat is ∝ ( n − 1 / 2 ) d , hence ξ ( x 1: n ) & n − d/ 2 w ( θ 0 ) µ ( x 1: n ). Under some w eak regularit y conditions one can prov e [CB90, Hut03c] D n ( µ || ξ ) := E ln µ ( ω 1: n ) ξ ( ω 1: n ) ≤ ln w ( θ 0 ) − 1 + d 2 ln n 2 π + 1 2 ln det ¯  n ( θ 0 ) + o (1) (10) where w ( θ 0 ) is the we ight densit y (9) of µ in ξ , and o (1) tends to zero fo r n → ∞ , and the av erage Fisher information matrix ¯  n ( θ ) = − 1 n E [ ∇ θ ∇ T θ ln ν θ ( ω 1: n )] measures the lo cal smo othness o f ν θ and is b ounded for man y reasonable classes, including all stationary ( k th -order) finite-state Mark o v processes. See Section 4 for an appl ication 2 w () will alwa ys denote densities, and w () probabilities. 6 to the i.i.d. ( k = 0) case. W e se e that in the con tin uous cas e, D n is no longer b ounded b y a constan t, but gro ws ve ry slo wly (loga rithmically ) with n , whic h still im plies that ε -deviations are expo nen tially seldom. Hence, (10) allo ws to b ound (5) and (8) ev en in case of contin uous M . 3 Ho w to Cho ose the Prior I show e d in the last section how to predic t if the true env ironmen t µ is unkno wn, but kno wn to b elong some class M of en vironmen ts. In this section, I assume M to b e giv en, and discuss ho w to (unive rsally) c ho ose the prior w ν . After reviewing v arious classical principles ( indifference, symmetry , minimax) for obtaining ob jectiv e priors for “small” M , I disc uss large M . Occam’s razor in conjunction with Epicurus’ principle of m ultiple ex planations, quan tified b y Kolmogoro v complexit y , leads us to a univ ersal prior, whic h results in a b etter predictor than an y other prior o v er coun table M . Classical principles. The probabilit y axioms (implying Ba y es’ rule) allo w to com- pute p osteriors and predictiv e distributions from prior ones, but are m ute abo ut ho w to choose the prior. Muc h has b een written on the c hoice of priors (see [KW96] for a surv ey and reference s). A main classification is b et w een ob jectiv e and sub jectiv e priors. An obje ctive prio r w ν is a prior constructed based on some rational princi- ples, whic h ideally ev ery one without (relev an t) extra prior know ledge should a dopt. In con trast, a subje ctive prior aims at mo delling the agents personal (sub jectiv e) b elief in en vironmen t ν prior to observ ation of x , but based on his past p ersonal experience or know ledge (e.g. of related phenomena). In Section 6, I show that one w a y to arriv e at a sub jectiv e prior is to start with an ob jectiv e prior, mak e all past p ersonal exp erience explicit, determine a “p osterior” and use it as sub jectiv e prior. So I concen trate in the followin g on the more imp ortant ob jectiv e priors. Consider a v ery simple case of t w o en vironmen ts, e.g. a biased coin with head or tail probabilit y 1 / 3. In a bsenc e of an y extra kno wledge (whic h I henceforth assume) there is no reason to prefer head probabilit y θ = 1 / 3 o v er θ = 2 / 3 and vice v ersa, leav ing w 1 / 3 = w 2 / 3 = 1 2 as the o nly rational c hoice. More generally , for finite M , the symmetry or indiffer enc e ar gument [Lap12] suggests to set w ν = 1 |M| ∀ ν ∈ M . Similarly for a compact measurable parameter space Θ w e ma y ch o ose a uniform densit y w ( θ ) = [V ol(Θ)] − 1 . But there is a problem: If w e go to a differen t parametrization (e.g. θ ❀ θ ′ := √ θ in the Bernoulli case), the prior w ( θ ) ❀ w ′ ( θ ′ ) b ecomes non-uniform. Jeffreys ’ [Jef46] solution is to find a symmetry group of the problem (like p erm utations for finite M ) and require the prior to b e invariant under gr oup tr ansformations . F or instance , if θ ∈ I R is a lo cation parameter (e.g. the mean) it is natural to require a translation-inv arian t prior. Problem s a re that there ma y b e no obvious symmetry , the resulting prior ma y be improp er (lik e for the translation group), and the result can dep end on whic h parameters are treated a s n uisance parameters. 7 The max i m um entr opy principle extends the symmetry principle b y allowin g certain types of cons traints on the para meters. Conjugate priors are classes of priors suc h that the p osteriors are themselv es again in the class. While this can lead to in teresting classes, the principle itself is not selectiv e, since e.g. the class of all priors forms a conjugate class. Another m inimax a p pr o ach b y Bernardo [Ber79, CB90 ] is to consider b ound (10), whic h can actually be impro ve d within o (1) to an equalit y . Since w e w an t D n to b e small, we minimiz e the r.h.s. for the w orst µ ∈ M . Choice w ( θ ) ∝ p det ¯  n ( θ ) equalizes and hence minimizes (10). The problems are the same as f o r Jeffrey’s prior (actually often b oth priors coincide), and also the dep endence on the mo del class and p otentially on n . The principles ab o v e, although not unproblematic, c an pro vide go o d ob jectiv e priors in man y cases of small discrete or compact spaces, but we will meet some more problems later. F or “large” mo del classes I am inte rested in, i.e. coun tably infinite, non-compact, or non-para metric spaces, the principles ty pically do not apply o r break down. Occam’s razor et al. Mac hine le arning, the com puter scie nce branc h o f statis- tics, often deals with very larg e mo del classes. Naturally , mac hine learning has (re)disco v ered and exploited quite differen t principles for c ho osing priors, appropri- ate for this situation. The o v erarc hing principles put together by Solomonoff [Sol64] are: Occam’s razor (choose the simplest mo del consisten t with the data ) , Epicurus’ principle o f m ultiple explanations (k eep all explanations consiste nt with the data), (Univ ersal) T uring mac hines (to compute, quan tify and assign co des to all quan ti- ties o f intere st), and Kolmogorov complexit y (to define what simplicit y/complex ity means). I will first “deriv e” the so called univ ersal prior, and subs equen tly justify it by presen ting v arious we lcome theoretical prop erties and by examples. The idea is that a priori, i.e. b efore seeing the data, all mo dels are “consisten t, ” so a-priori Epicurus w ould regar d all mo dels (in M ) p ossible, i.e. c ho ose w ν > 0 ∀ ν ∈ M . In order to also do (some) justice to Occam’s razor w e should pr e f e r simple h yp otheses, i.e. assign high prior (low) prior w ν to simple (complex) h yp otheses H ν . Before I can define this prior , I need to quan tify the notio n of complexit y . Notation. A function f : S → I R ∪ { ±∞} is said to be lo we r semi-computable (or en umerable) if the set { ( x,y ) : y < f ( x ) , x ∈ S , y ∈ I Q } is recursiv ely enum erable. f is upp er semi-computable (or co- enume rable) if − f is enume rable. f is computable (or recursiv e) if f and − f are en ume rable. The set of (co)en umerable functions is recursiv ely en umerable. I write O (1) for a constan t of reasonable size: F or in- stance, a sequence of length 100 is reasonable, maybe ev en 2 30 , but 2 500 is not. I write f ( x ) + ≤ g ( x ) for f ( x ) ≤ g ( x ) + O (1) and f ( x ) × ≤ g ( x ) for f ( x ) ≤ 2 O (1) · g ( x ). Cor- respo nding equalitie s hold if the inequalities hold in b oth directions. 3 W e sa y that a pro p erty A ( n ) ∈ { tr ue,f al se } holds for most n , if # { t ≤ n : A ( t ) } /n n →∞ − → 1. 3 I will ignore these additive and m ultiplicativ e fudges in the discussion till Section 6. 8 Kolmogoro v complexity . W e can no w quan tify the complexit y o f a string. In tu- itiv ely , a string is simple if it can b e describ ed in a f ew w ords, lik e “ t he string of one million ones”, and is complex if there is no suc h short description, lik e for a random ob j ect whose shortest description is s p ecifying it bit b y bit. W e are intere sted in effectiv e descriptions, and hence restrict deco ders to b e T uring mac hines (TMs). Let us c ho ose some univ ersal (so-called prefix) T uring machine U with binary in- put=program tap e, X ary output t a pe, and bidirectional w ork tap e. W e can then define the pr efix Kolmo gor ov c omplexity [Cha75, G´ ac74, Kol65, Lev74] of string x as the length ℓ of the shortest binary program p f or whic h U outputs x : K ( x ) := min p { ℓ ( p ) : U ( p ) = x } Simple strings lik e 0 00...0 can b e g enerated b y short programs, and, hence hav e low Kolmogorov complexit y , but irregular (e.g. ra ndom) strings are their o wn shortest description, and hence hav e high Kolmogorov complexit y . F or non-string ob jects o (lik e n umbers and func tions) w e define K ( o ) := K ( h o i ), w here h o i ∈ X ∗ is some standard co de for o . In particular, if ( f i ) ∞ i =1 is an en umeration of all (co)enum erable functions, w e define K ( f i ) = K ( i ). An imp ortan t prop ert y of K is that it is nearly indep enden t of the c hoice of U . More precisely , if w e switc h from one univ ersal TM to another, K ( x ) c hanges at most by an additiv e constan t indep enden t o f x . F or natural univ ersal TM s, the compiler constant is o f reasonable size O (1). A defining prop erty of K : X ∗ → I N is that it additiv ely dominates all co- en umerable functions f : X ∗ → I N that satisfy Kraft’s inequalit y P x 2 − f ( x ) ≤ 1, i.e. K ( x ) + ≤ f ( x ) for K ( f ) = O (1 ). The univ ersal TM pro vides a shorter prefix code than any other effectiv e prefi x co de. K shares man y prop erties with Shannon’s en trop y (info r matio n meas ure) S , but K is sup erior to S in man y resp ects. T o b e brief, K is a n excellen t univ ersal complexit y measure, suitable for quan tifying Occam’s razor. W e need the follow ing prop erties of K : a ) K is not computable, but only upp er semi-computable, b ) the upp er b ound K ( n ) + ≤ log 2 n + 2log 2 log n , (11) c ) Kraft ’s inequalit y P x 2 − K ( x ) ≤ 1, whic h implies 2 − K ( n ) ≤ 1 n for most n , d ) information non-increase K ( f ( x )) + ≤ K ( x ) + K ( f ) for recursiv e f : X ∗ → X ∗ , e ) K ( x ) + ≤ − log 2 P ( x ) + K ( P ) if P : X ∗ → [0 , 1] is en umerable and P x P ( x ) ≤ 1, f ) P x : f ( x )= y 2 − K ( x ) × = 2 − K ( y ) if f is recursiv e and K ( f ) = O (1). The pro of of ( f ) can b e found in App endix A and the pro ofs of ( a ) − ( e ) in [L V97]. The univ ersal prior. W e can now quan tify a prior biased tow ards simple mo dels. First, w e quantify the complexit y of an environm en t ν or hy p othesis H ν b y its Kolmogorov complexit y K ( ν ). The unive rsal prior should b e a decreasin g function in the model’s comple xit y , and of course sum to (less than) one. Since K satisfies Kraft’s inequalit y (11 c ), this suggests the follo wing ch oice: w ν = w U ν := 2 − K ( ν ) (12) 9 F or this choice , the b ound (5) on D n (whic h b ounds (5) and (8)) reads P ∞ t =1 E [ h t ] ≤ D ∞ ≤ K ( µ ) ln 2 (13) i.e. the n umber of times, ξ deviates f rom µ or l Λ ξ deviates f r o m l Λ µ b y more than ε > 0 is b ounded b y O ( K ( µ )) , i.e. is prop ortio nal to the complexit y of the en vironmen t. Could other c hoices for w ν lead to b etter b ounds? The answ er is essen tially no [Hut04]: Consider an y other reas onable prior w ′ ν , where reasonable means (low er semi)computable with a program of size O (1). Then, MDL b ound (11 e ) with P ( ) ❀ w ′ () and x ❀ h µ i sho ws K ( µ ) + ≤ − log 2 w ′ µ + K ( w ′ () ), hence ln w ′ µ − 1 + ≥ K ( µ )ln2 leads (within an additiv e constan t) to a w eak er b ound. A coun ting argument also sho ws that O ( K ( µ )) errors for most µ are unav oidable. So this c hoice of prior leads to v ery go o d prediction. Ev en for con tin uous classes M , w e can assign a (proper) univ ersal prior (not densit y) w U θ = 2 − K ( θ ) > 0 for computable θ , and 0 fo r uncomputable ones. This effectiv ely reduces M to a discrete class { ν θ ∈ M : w U θ > 0 } whic h is typ ically den se in M . W e will see that this prior has many adv a n tages o v er the classical prior densities . 4 Indep enden t Iden tically Distributed Data I now compare the classical con tin uous prior densities to the univ ersal prior on classes of i.i.d. en vironmen ts. I presen t some standard critiques to the fo rmer, illustrated on Bay e s-Laplace’s classical Bernoulli class with uniform prior: the problem of zero p(oste)rior, non- confirmation of univ ersal hypotheses, and reparametrization and regrouping non-in v ariance. I show that the univ ersal prior do es not suffer from these problems. Finally I complemen t the general total b ounds of Section 2 with some i.i.d.-sp ecific instantaneous b ounds. Laplace’s rule for Bernoulli sequences. Let x = x 1 x 2 ...x n ∈ X n = { 0 , 1 } n b e generated b y a biased coin with head=1 probabilit y θ ∈ [0 , 1], i.e. the lik elihoo d o f x under h ypot hesis H θ is ν θ ( x ) = P[ x | H θ ] = θ n 1 (1 − θ ) n 0 , where n 1 = x 1 + ... + x n = n − n 0 . Bay es [Ba y63] assumed a uniform prior densit y w ( θ ) = 1. The evidence is ξ ( x ) = R 1 0 ν θ ( x ) w ( θ ) dθ = n 1 ! n 0 ! ( n +1)! and the p osterior probabilit y w eigh t densi ty w ( θ | x ) = ν θ ( x ) w ( θ ) /ξ ( x ) = ( n +1)! n 1 ! n 0 ! θ n 1 (1 − θ ) n 0 of θ after seeing x is strongly p eak ed aro und the frequency estimate ˆ θ = n 1 n for large n . Laplace [Lap12] ask ed for the predictiv e probabilit y ξ (1 | x ) of observing x n +1 = 1 after havin g seen x = x 1 ...x n , whic h is ξ (1 | x ) = ξ ( x 1) ξ ( x ) = n 1 +1 n +2 . ( La place b eliev ed t hat the sun had risen for 5 000 y ears = 1 8 26 213 da ys since creation, so he concluded that t he probabilit y of do om, i.e. that the sun w on’t rise tomorro w is 1 1826215 .) This lo oks lik e a reasonable estim ate, since it is close to the relativ e frequen cy , asymptotically consisten t, symmetric, ev en defined for n = 0, and not ov erc onfiden t (nev er a ssigns pro bability 1). The problem of zero prior. But also Laplace’s rule is not without problems. The appropriateness o f the uniform prior has b een questioned in Section 3 and will b e 10 detailed b elow . Here I discuss a vers ion of the zero prior problem. If the prior is zero, then the p osterior is nece ssarily a lso zero. The ab ov e ex ample seems unproblematic, since the prior and posterior den s ities w ( θ ) and w ( θ | x ) are non-zero. Nev ertheless it is problematic e.g. in the con text of scien tific confirmation theory [Ear93]. Consider the h ypo thesis H that all balls in some urn, or all ra v ens, are black (=1). A natural mo del is to assume tha t balls (or ra v ens) are drawn randomly from an infinite po pulation with fraction θ of blac k balls (or ra v ens) and to assume a uniform prior ov er θ , i.e. just the Ba ye s-Laplace mo del. Now we dra w n o b jects and observ e that they are all blac k. W e ma y formalize H as the h yp othesis H ′ := { θ = 1 } . Although the po sterior probabilit y of the relaxed h ypothesis H ε := { θ ≥ 1 − ε } , P[ H ε | 1 n ] = R 1 1 − ε w ( θ | 1 n ) dθ = R 1 1 − ε ( n + 1) θ n dθ = 1 − (1 − ε ) n +1 tends to 1 for n → ∞ for ev ery fixed ε > 0, P[ H ′ | 1 n ] = P[ H 0 | 1 n ] remains iden tically zero, i.e. no amoun t of evidence can confi rm H ′ . The reason is simply t hat zero prior P[ H ′ ] = 0 implies zero p osterior. Note that H ′ refers to the unobserv a ble quan tit y θ and only demands blac kness with probability 1. So ma yb e a b etter formalization of H is purely in terms of obser- v ational quan tities: H ′′ := { ω 1: ∞ = 1 ∞ } . Since ξ (1 n ) = 1 n +1 , the predictiv e probabilit y of observing k further blac k ob jects is ξ (1 k | 1 n ) = ξ (1 n + k ) ξ (1 n ) = n +1 n + k +1 . While for fixed k this t ends to 1, P[ H ′′ | 1 n ] = lim k →∞ ξ (1 k | 1 n ) ≡ 0 ∀ n , as for H ′ . One ma y sp eculate that the crux is the infinite p opulation. But for a finite p opulation of size N and samp ling with (similarly without) rep etition, P[ H ′′ | 1 n ] = ξ (1 N − n | 1 n ) = n +1 N +1 is close to one only if a large fraction of ob jects has b een o bse rve d. This contradicts scien tific practice: Although only a tin y fraction o f all existing ra v ens ha ve b een observ ed, w e regard this as sufficien t evidence for b elieving strongly in H . This quan tifies [Mah04, Thm .11] a nd sho ws that Maher do es not solve the problem o f confirmation o f univ ersal hypotheses. There are t w o solutions of this problem: W e may abandon strict/logical/all- quan tified/univ ersal h yp otheses altogether in fav or of soft hy p otheses lik e H ε . Al- though not unreasonable, this approac h is unattractiv e for sev eral reasons. The other solution is to assign a non-zero prior to θ = 1. Consider, for instance, the improp er densit y w ( θ ) = 1 2 [1 + δ (1 − θ )], where δ is the Dirac-delta ( R f ( θ ) δ ( θ − a ) dθ = f ( a )), or equiv alen tly P[ θ ≥ a ] = 1 − 1 2 a . W e get ξ ( x 1: n ) = 1 2 [ n 1 ! n 0 ! ( n +1)! + δ 0 n 0 ], whe re δ ij = { 1 if i = j 0 else } is Kronec k er’s δ . In particular ξ (1 n ) = 1 2 n +2 n +1 is m uc h larger than for uniform prior. Since ξ (1 k | 1 n ) = n + k +2 n + k +1 · n +1 n +2 , we get P[ H ′′ | 1 n ] = lim k →∞ ξ (1 k | 1 n ) = n +1 n +2 → 1, i.e. H ′′ gets strongly confirmed b y observing a reasonable num ber of blac k ob jects. This correct asymptotics also follows from the general result (3) . Confirmation of H ′′ is also reflected in the fa ct that ξ (0 | 1 n ) = 1 ( n +2) 2 tends m uc h faster to zero than for uniform prior, i.e. the confidence that the next ob ject is blac k is muc h higher. The p ow er actually dep ends on the shap e of w ( θ ) around θ = 1. Similarly H ′ gets confirmed: P[ H ′ | 1 n ] = µ 1 (1 n )P[ θ = 1] /ξ (1 n ) = n +1 n +2 → 1. On the other hand, if a single (or more) 0 a re observ ed ( n 0 > 0), then the predictiv e distribution ξ ( ·| x ) and p osterior w ( θ | x ) are the same as for uniform prior. 11 The findings ab ov e remain qualitativ ely v alid for i.i.d. pro cesses ov er finite non- binary alphab et | X | > 2 and for non-uniform prior. Surely to get a generally working setup, w e should also assign a non-zero prior to θ = 0 and to all other “sp ecial” θ , lik e 1 2 and 1 6 , whic h may naturally app ear in a h yp othesis, lik e “is the coin or die fair”. The natural con tin uation of this though t is to assign non-zero prior to all compu table θ . This is another motiv ation for the univ ersal prior w U θ = 2 − K ( θ ) (12) constructed in Section 3. It is difficult but not imp ossible to op erate with suc h a prior [PH04, PH06]. One ma y w an t to mix the discrete prior w U ν with a con tin uous (e.g. uniform) prior densit y , so that the set of non-computable θ k eeps a non-zero densit y . Although p ossible, we will see that this is actually not necess ary . Reparametrization inv ariance. Naive ly , the uniform prior is justified b y the in- difference principle, but a s discussed in Section 3, uniformity is not reparametriza- tion in v arian t. F or instance if in our Be rnoulli example w e introduce a new parametrization θ ′ = √ θ , then the θ ′ -densit y w ′ ( θ ′ ) = 2 √ θ w ( θ ) is no longer uniform if w ( θ ) = 1 is uniform. More generally , assume w e hav e some principle whic h leads to some prior w ( θ ). No w we apply the principle to a differen t parametrization θ ′ ∈ Θ ′ and get prior w ′ ( θ ′ ). Assume that θ and θ ′ are related via bijection θ = f ( θ ′ ). Another w ay to get a θ ′ -prior is to transform the θ -prior w ( θ ) ❀ ˜ w ( θ ′ ). The reparametrization in v a ria nce principle (RIP) states that w ′ should b e equal to ˜ w . F or discrete Θ, simply ˜ w θ ′ = w f ( θ ′ ) , and a uniform prior remains uniform ( w ′ θ ′ = ˜ w θ ′ = w θ = 1 | Θ | ) in any parametrization, i.e. the indifference principle satisfies RIP in finite mo del classes. In case of densities, w e ha v e ˜ w ( θ ′ ) = w ( f ( θ ′ )) d f ( θ ′ ) dθ ′ , and the indifference principle violates RIP f o r non-linear transformations f . But Jeffrey’s and Bernardo’s principle satisfy RIP . F or instance, in the Bernoulli case w e ha v e ¯  n ( θ ) = 1 θ + 1 1 − θ , hence w ( θ ) = 1 π [ θ (1 − θ )] − 1 / 2 and w ′ ( θ ′ ) = 1 π [ f ( θ ′ )(1 − f ( θ ′ ))] − 1 / 2 d f ( θ ′ ) dθ ′ = ˜ w ( θ ′ ). Do es the univ ersal prior w U θ = 2 − K ( θ ) satisfy RIP? If w e apply the “univ er- salit y princip le” to a θ ′ -parametrization, we get w ′ θ ′ U = 2 − K ( θ ′ ) . On the other hand, w θ simply transforms to ˜ w U θ ′ = w U f ( θ ′ ) = 2 − K ( f ( θ ′ )) ( w θ is a discrete (non- densit y) prior, whic h is non-zero on a discrete subs et of M ). F or computable f w e ha v e K ( f ( θ ′ )) + ≤ K ( θ ′ ) + K ( f ) by (11 d ) , and similarly K ( f − 1 ( θ )) + ≤ K ( θ ) + K ( f ) if f is in v ertible. Hence for simple bijections f i.e. for K ( f ) = O (1 ), we hav e K ( f ( θ ′ )) + = K ( θ ′ ), whic h implies w ′ θ ′ U × = ˜ w U θ ′ , i.e. the universal prior satisfies RI P w.r.t. simple transformations f (within a m ultiplicativ e constan t). Regrouping inv ariance. There are imp ortan t transformations f whic h are not bijections, whic h w e consider in the follo wing. A simple non-bijection is θ = f ( θ ′ ) = θ ′ 2 if w e conside r θ ′ ∈ [ − 1 , 1]. More intere sting is the follo wing example: Assume w e had decided not to record blac kness v ersus non-blac kness of ob jects, but their “color”. F or s implicit y of exp osition assum e w e rec ord only whether an ob ject is blac k o r white or colored, i.e. X ′ = { B ,W,C } . In analogy to the binary case w e 12 use the indifference principle to assign a uniform prior on θ ′ ∈ Θ ′ := ∆ 3 , where ∆ d := { θ ′ ∈ [0 , 1] d : P d i =1 θ ′ i = 1 } , and ν θ ′ ( x ′ 1: n ) = Q i θ ′ i n i . All inferences rega r ding blac kness (predictiv e and p osterior) are iden tical to the binomial mo del ν θ ( x 1: n ) = θ n 1 (1 − θ ) n 0 with x ′ t = B ❀ x t = 1 and x ′ t = W or C ❀ x t = 0 and θ = f ( θ ′ ) = θ ′ B and w ( θ ) = R ∆ 3 w ′ ( θ ′ ) δ ( θ ′ B − θ ) d θ ′ . Unfor t unately , for uniform prior w ′ ( θ ′ ) ∝ 1, w ( θ ) ∝ 1 − θ is not uniform, i.e. the indifference principle is not inv arian t under splitting/grouping, or general regro uping. Regrouping in v ariance is regarded as a v ery imp ortan t and desirable prop ert y [W al96]. I no w consider general i.i.d. pro cesses ν θ ( x ) = Q d i =1 θ n i i . Dirichle t priors w ( θ ) ∝ Q d i =1 θ α i − 1 i form a natural conjugate class ( w ( θ | x ) ∝ Q d i =1 θ n i + α i − 1 i ) and are the default priors for m ultinomial (i.i.d.) pro cesses o v er finite alphab et X of size d . Note that ξ ( a | x ) = n a + α a n + α 1 + ... + α d generalizes Laplace’s rule and coincide s with Carnap’s [Car52] confirmation function. Symmetry demands α 1 = ... = α d ; for instance α ≡ 1 for uni- form a nd α ≡ 1 2 for Bernard-Jeffrey’s prior. Grouping t w o “colors” i and j results in a Diric hlet prior with α i & j = α i + α j for the group. The o nly w a y to resp ect symmetry under a ll p ossible groupings is to set α ≡ 0. This is Haldane’s improp er prior, whic h results in unacceptably o v erconfiden t predictions ξ (1 | 1 n ) = 1. W alley [W a l9 6] solv es the problem that t here is no single acceptable prior densit y b y considering sets of priors. I no w show that the univ ersal prior w U θ = 2 − K ( θ ) is inv ariant under regrouping, and more generally under all simple (computable with complex ity O(1)) ev en non- bijectiv e transformations. Consider prior w ′ θ ′ . If θ = f ( θ ′ ) then w ′ θ ′ transforms to ˜ w θ = P θ ′ : f ( θ ′ )= θ w ′ θ ′ (note that for non-bijections t here is more than one w ′ θ ′ consisten t with ˜ w θ ). In θ ′ -parametrization, the univ ersal prior reads w ′ θ ′ U = 2 − K ( θ ′ ) . Using (11 f ) with x = h θ ′ i a nd y = h θ i we get ˜ w U θ = X θ ′ : f ( θ ′ )= θ 2 − K ( θ ′ ) × = 2 − K ( θ ) = w U θ i.e. the universal prior is gener al tr ansformation and h e n c e r e gr ouping invariant (within a m ultiplicativ e constan t) w.r.t. simple computable transformations f . Note that reparametrization and regrouping inv ariance hold for arbitrary classes M and are not limited to the i.i.d. case. Instan taneous bounds. The cum ulativ e b ounds (5) and (1 0 ) stay v alid for i.i.d. pro cesses , but instan taneous bounds are now a lso possible. F or i.i.d. M with con- tin uous, discrete, and univ e rsal prior, resp ectiv ely , one can sho w (in prep aratio n; see [Kri98, PH04, PH06] for related b ounds) E [ h n ] × ≤ 1 n ln w ( θ 0 ) − 1 and E [ h n ] × ≤ 1 n ln w − 1 θ 0 = 1 n K ( θ 0 ) ln 2 (14) Note that, if summed up o v er n , they lead to we ake r cum ulativ e b ounds. 13 5 Univ ersal Sequenc e Prediction Section 3 deriv ed the univ ersal prior and Section 4 discuss ed i.i.d. classe s. What remains and will b e done in this section is to find a univ ersal class o f envi ronmen ts, namely Solomonoff-Levin’s class of all (lo w er semi)comp utable (semi)measures. The resulting unive rsal mixture is equiv alent to the output distribution o f a univ ersal T uring mac hine with uniform input distribution. The univ ersal prior av oids the problem of old evidence and the univ ersal class a v oids the necessit y of up dating M . I discuss the gene ral total bounds of Section 2 for the sp ecific univ e rsal mixture, and supplemen t them with some w eak instan taneous b ounds. Finally , I sho w that the univ ersal mixtu re p erfo rms b etter than classical con tin uous mixtures, ev en in uncomputable en vironmen ts. Univ ersal c hoice of M . The b ounds of Section 2 apply if M con tains the true en vironmen t µ . The larger M the less restrictiv e is this assumption. The class of all computable distributions, although only coun table, is prett y large fro m a practical p oint of view. (Finding a non-computable ph ysical system would ov ertu rn the Ch urc h-T uring thesis.) It is the largest class, relev an t fro m a comp utational p oin t of view. Solomonoff [Sol64, Eq.(13)] defined and studied the mixture o v er this class. One problem is that this class is not en umerable, since the class of computable functions f : X ∗ → I R is not enum erable (halting problem), nor is it decidable whe ther a function is a measure. Hence ξ is completely incomputable. Levin [ZL70] had the idea to “sligh tly” extend the class and include also lo w er semi-computable sem imea- sures. One can sho w that this class M U = { ν 1 ,ν 2 ,... } is enu merable, hence ξ U ( x ) = X ν ∈M U w U ν ν ( x ) (15) is itself low er semi-computable, i.e. ξ U ∈ M U , whic h is a con v enien t property in itself. Note that since 1 n log 2 n × ≤ w U ν n ≤ 1 n for most n b y (11 b ) and (11 c ), most ν n ha v e prior appro ximately reciprocal to their inde x n , as adv o cated b y Jeffreys [Jef61, p238] and Rissanen [Ris83]. In some sen se M U is the largest class of en vironmen ts for whic h ξ is in some sense computable [Hu t03b, Hut06] , but see [Sc h02a] for ev en larger classes . Note that includin g non-semi-computable ν would not affect ξ U , since w U ν = 0 on suc h en vironmen ts. The problem of old evid ence. An imp ortan t problem in Bay esian inference in general and (Ba y esian) confirmation theory [Ear93] in particular is ho w t o deal with ‘old evidence’ or equiv alen tly with ‘new theories’. Ho w shall a Bay e sian treat the case when some evidenc e E b = x (e.g. Mercury’s p erihelion adv ance) is kno wn w ell-b efore the correct hypothesis/theory/mo del H b = µ (Einstein’s general relativit y theory) is found? How shall H b e added to the Ba y esian mac hinery a posteriori? What is the prior of H ? Should it b e the b elief in H in a hypothetical coun terfactual 14 w orld in whic h E is not known ? Can old evidence E confirm H ? After a ll, H could simply b e constructed/biased/fitted to w ards “explaining” E . The univ ersal class M U and univ ersal prior w U ν formally solv e this problem: The univ ersal prior of H is 2 − K ( H ) . This is indep enden t o f M and of whether E is kno wn or not. If we use E to construct H or fit H to explain E , this will lead to a theory whic h is more complex ( K ( H ) + ≥ K ( E )) than a theory from scratch ( K ( H ) = O (1)), so c heats are automatically p enalized. There is no problem of adding h yp otheses to M a p osteriori. Priors of old hy p otheses are not affected. Finally , M U includes al l hy p otheses ( including y et unkno wn or unnamed ones) a priori. So at least theoretically , up dating M is unnecessary . Other represen tations of ξ U . D efinition (15) is somew hat complex, r elying on en umeration of semimeasures and Kolmogorov complex ity . I no w a pproac h ξ U from a differen t p ersp ectiv e. Assume that our w orld is gov erned b y a computable deter- ministic process describable in ≤ l bits. Conside r a standard (not prefix) T uring mac hine U ′ and programs p generating en vironmen ts starting with x . Let us pad all programs so that they ha ve length ex actly l . Among the 2 l programs of length l there are N l ( x ) := # { p ∈ { 0 , 1 } l : U ′ ( p ) = x ∗} programs consisten t with observ ation x . If we regard all en vironmen tal descriptions p ∈ { 0 , 1 } l a priori as e qually lik ely (Epicurus) w e should adopt the relativ e f r equency N l ( x ) / 2 l as our prior b elief in x . Since w e do not kno w l and w e can pad ev ery p arbitrarily , w e could tak e the limit M ( x ) := lim l →∞ N l ( x ) / 2 l (whic h exists, since N l ( x ) / 2 l increases). Or equiv alently : M ( x ) is the probability that U ′ outputs a string starting with x when pro vided with uniform random noise o n the program tap e. Note that a uniform distribution is also used in the No F ree Lunc h theorems [WM97] to prov e the imp ossibilit y of univ ersal learners, but in our case the uniform distribution is pip ed through a univ ersal T uring mac hine whic h defeats these negativ e imp lications. Y et another represen tation of M is as follows : F or eve ry q printing x ∗ there exis ts a shortest prefix (called minimal) p of q prin ting x . p p ossesse s 2 l − ℓ ( p ) prolongations to length l , all printing x ∗ . Hence all prolongations of p together yield a con tribution 2 l − ℓ ( p ) / 2 l = 2 − ℓ ( p ) to M ( x ). Let U ( p ) = x ∗ iff p is a minimal program prin ting a string starting with x . Then M ( x ) = X p : U ( p )= x ∗ 2 − ℓ ( p ) (16) whic h may b e r ega rded as a 2 − ℓ ( p ) -w eigh ted mixture o v er all computable determin- istic en vironmen ts ν p ( ν p ( x ) = 1 if U ( p ) = x ∗ and 0 else). No w, as a p ositiv e surprise, M ( x ) c oincides with ξ U ( x ) within an irrelev ant m ultiplicativ e constan t. So it is ac- tually sufficien t to consider the class of deterministic semimeasures . The reason is that the probabilistic semimeasures are in the conv e x hull of the deterministic ones, and so need not b e tak en extra in to accoun t in the mixture. One can also get an explicit en umeration of all lo w er semi-computable semimeasures M U = { ν 1 ,ν 2 ,... } b y means of ν i ( x ) := P p : T i ( p )= x ∗ 2 − ℓ ( p ) , where T i ( p ) ≡ U ( h i i p ), i = 1 , 2 ,... is an en ume ration of all monotone T uring mac hines. 15 Bounds for computable en vironmen ts. The b ound (13) surely is applicable for ξ = ξ U and no w holds for any computable measure µ . Within an additiv e constan t the b ound is a lso v alid for M × = ξ . That is, ξ U and M ar e exc el lent pr e dictors with the only c ondition that the se quenc e is dr awn fr om any c omputable pr ob ability distribution . Bound (13) show s that the total n um b er of prediction errors is small. Similarly to (3) one can s how that P n t =1 | 1 − M ( x t | x n ). I define P t ( n ′ ) := P a 6 = x n ′ M t ( x n t . With n t also P t is computable a nd increasing, hence P ( n ′ ) := lim t →∞ P t ( n ′ ) = sup t P t ( n ′ ) is low er semi-computable. Clearly P ( n ′ ) = P a 6 = x n ′ M ( x n ∞ ( n ′ ∞ = lim t n t ≤ ∞ ). Hence P n ′ P ( n ′ ) ≤ 1, since { x

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment