Predictive Hypothesis Identification

While statistics focusses on hypothesis testing and on estimating (properties of) the true sampling distribution, in machine learning the performance of learning algorithms on future data is the primary issue. In this paper we bridge the gap with a g…

Authors: ** *저자 정보가 제공되지 않음* **

Predictiv e Hyp othesi s Id en tification Marcus Hutter RSISE @ ANU and SML @ NICT A Can b erra, A CT, 0200, Australia marcus@hutt er1.net www.hutter1 .net 8 September 2008 Abstract While statistics fo cusses on hyp o thesis testing and on estimating (prop er- ties of ) the true sampling distribution, in mac hine learning the p erformance o f learning algorithms on future dat a is the primary issue. In this pap er w e b ridge the gap with a general principle (PHI) that identifies hyp ot heses with b est predictiv e p erformance. This includes predictiv e p oin t an d in terv al estimation, simple and comp osite h yp othesis testing, (mixture) mo del select ion, and others as sp ecial cases. F or concrete instan tiations we w ill reco ver well- kno wn metho ds, v ariatio ns thereof, and new ones. PHI nicely j ustifies, reconciles, and blend s (a r ep a rametrization inv arian t v ariation of ) MAP , ML, MDL, and momen t estimation. One particular feature of PHI is that it can genuinely deal with n este d h yp otheses. Con t en ts 1 In tro duction 2 2 Preliminaries 4 3 Predictiv e Hyp othesis Iden tification Principle 6 4 Exact Propert ies of PHI 7 5 PHI for ∞ -Batc h 9 6 Large Sa mple Approx imations 13 7 Discussion 15 Keyw ords parameter estimation; hyp othesis testing; mo del selection; predictiv e inference; comp osite hypotheses; MAP v ersus ML; momen t fitting; Ba yesia n statistics. 1 1 In tro d uction Consider data D sampled from some distribution p ( D | θ ) with unkno wn θ ∈ Ω. The lik eliho o d function or the p osterior contain the complete statistical info r ma t ion of the sample. Often this information needs to b e summarized or simplified for v arious reasons (comprehensibilit y , comm unication, storage, computational efficiency , math- ematical tractabilit y , etc.). P ar a me ter estimation, h yp othesis testing, and mo del (complexit y) se lection can all b e regar ded as w a ys of summarizing this informatio n, alb eit in differen t wa ys o r con text. The p osterior migh t either b e summarized by a single p oin t Θ = { θ } (e.g. ML or MAP or mean o r sto c hastic mo del selection), or by a con v ex set Θ ⊆ Ω (e.g. confidence or credible in terv al), or b y a finite set of points Θ = { θ 1 ,...,θ l } (mixture mo dels) or a sample of points (pa r tic le filtering), or b y the mean and co v ariance matrix (Gaussian approxim ation), or b y more general den sit y estimation, or in a few other w a ys [BM98, Bis06]. I ha ve roughly sorted the meth- o ds in increasing order of complexit y . This pap er concen trates on set estimation, whic h includes (m ultiple) p oin t estimation and hypothesis testing as sp ecial cases, henceforth join tly r eferred to as “ h y p othesis identific ation ” (this nomenclature seems unc harg ed and naturally includes what w e will do: estimation and testing of simple and complex h yp othes es but not densit y estimation). W e will briefly comment on generalizations b ey ond set estimation at the end. Desirable prop erties. There are many desirable prop erties a n y h yp o the sis identi- fication principle ideally should satisfy . It should • lead to go o d predictions (that’s what mo dels are ultimately fo r ) , • b e broadly a pplic able, • b e analytically and computationa lly tractable, • b e defined a nd mak e sense also for non-i.i.d. and non-stationa ry data, • b e reparametrization and represen tation inv ar ian t, • w ork for simple and comp osite h yp othes es, • w ork for classes containing nested and ov erlapping h yp othese s, • w ork in the estimation, testing, a nd mo del selection r e gime, • reduce in sp ecial cases (approx imately) to existing other metho ds. Here w e concen trate o n the first item, and will sho w that the resulting principle nicely satisfies ma ny o f the other items. The main idea. W e address the problem of iden tifying h yp othes es (parame- ters/mo dels ) with go o d pr e dictive p erformanc e head on. If θ 0 is the true parame- ter, then p ( x | θ 0 ) is obv iously the b e st prediction of the m future observ a tions x . If w e don’t kno w θ 0 but ha v e prior b elief p ( θ ) a bout its distribution, the predictiv e distribution p ( x | D ) based on the past n obse rv ations D ( which av erages the lik eli- ho o d p ( x | θ ) ov er θ with p osterior w eight p ( θ | D )) is by definition the b est Ba y esian predictor Often w e cannot use full Ba ye s (for reasons discus sed ab o ve) but pr edict with h yp othesis H = { θ ∈ Θ } , i.e. use p ( x | Θ) as pr ediction. The closer p ( x | Θ) is 2 to p ( x | D ) o r p ( x | θ 0 ,D ) 1 the b etter is H ’s prediction (by definition), where w e can measure closeness with some distance function d . Since x and θ 0 are (assumed to b e) unkno wn, w e hav e to sum or av erage ov er them. Definition 1 (Predictive Loss) The pr e dictive L oss/ L f oss of Θ given D b ase d on distanc e d for m futur e observations is L oss m d (Θ , D ) := Z d ( p ( x | Θ) , p ( x | D )) d x (1) L f oss m d (Θ , D ) := Z Z d ( p ( x | Θ) , p ( x | θ , D )) p ( θ | D ) d x dθ Pr e dictive hyp othesis identific ation (PHI) minimizes the losses w.r.t. some hyp othesis class H . Our fo rm ulation is general enough to cov er p oin t and in terv al estimation, simple and comp osite h yp othesis testing, (mixture) mo del (complexit y) selection, and others. (Un)related w ork. The general idea of inference b y maximizing predictive p e rfor- mance is not new [G ei93 ]. Indeed, in the conte xt of mo del (complexit y) selection it is prev alent in machine learning and implemen ted primarily b y empirical cross v alidation pro cedures and v ariatio ns thereof [Zuc00] or b y minimizing test and/or train set (generalization) b ounds; see [Lan02] and references therein. There are also a n um b er of statistics pap ers on predictiv e inference; see [Gei93] f o r an ov erview and older references, and [BB04, MGB05] for new er r eferences. Most of them deal with distribution free metho ds based on some fo rm of cross-v alidation discrepancy measure, and often f ocus on mo del selection. A notable exception is MLPD [LF 82 ], whic h maximizes the predictiv e likelih o o d including future observ ations. The full decision-theoretic setup in whic h a decision based on D leads to a loss depending on x , and minimizing the exp ected loss, has b een studied extensiv ely [BM98, Hut05], but scarcely in the con text of h yp othesis iden t ific ation. On the natural progres- sion of estimation → prediction → action, approximating the predictiv e distribution by minimizing (1 ) lies b et wee n traditiona l parameter estimation and optimal decision making. F orm ulation (1) is quite nat ural but I ha v en’t seen it elsewhere. Indeed, b esid es ideological similarities the pap ers ab o v e b ear no resem bla nce to this w ork. Con t en ts. The main purp ose of this pap er is to inv estigate the predictiv e losses ab o ve and in particular their minima, i.e. the best predictor in H . Section 2 in tro - duces not ation, global assumptions, and illustrates PHI on a simple example. This also sho ws a shortcoming of MAP and ML esim tation. Section 3 formally states PHI, p ossible distance and loss functions, their minima, In Section 4, I study exact prop er- ties of PHI: in v ar ia nc es, sufficien t statistics, and equiv alences. Sections 5 in v estigates the limit m → ∞ in whic h PHI can b e r e lated to MAP and ML. Section 6 deriv es large sample appro ximations n → ∞ for whic h PHI reduces to sequen tial momen t 1 So far we tacitly assumed that given θ 0 , x is independent D . F or non- i.i.d. data this is generally not the case, hence the app earance o f D . 3 fitting (SMF). The results are subseque n tly used for Offline PHI. Section 7 contains summary , outlo ok and conclusions. Throughout the pap er, the Bernoulli example will illustrat e the g e neral results. The main aim of this pap er is to introduce and motiv ate PHI, demonstrate how it can deal with the difficult problem of selecting comp osite and nested h yp othese s, and sho w how PHI reduc es to kno wn principles in certain regimes. The latter provide s additional justification and supp ort of previous principles, and clarifies their rang e of applicabilit y . In g eneral, the treatmen t is exemplary , not exhaustiv e. 2 Preliminaries Setup. Let D ≡ D n ≡ ( x 1 ,...,x n ) ≡ x 1: n ∈ X n b e the observ ed sam p le with observations x i ∈ X from some measurable space X , e.g. I R d ′ or I N or a subs et t hereof. Similar ly let x ≡ ( x n +1 ,...,x n + m ) ≡ x n +1: n + m ∈ X m b e p oten tial futur e observation s . W e assume t ha t D and x are sampled from some p r ob a bility distribution P[ ·| θ ], where θ ∈ Ω is some unkno wn parameter. W e do not assume independence of the x i ∈ I N unless otherwise stated. F or simplicit y of exp o s ition w e assume that the densities p ( D | θ ) w.r.t. the default (Leb es gue or counting) measure ( R dλ , P x , written b oth henceforth as R dx ) exist. Ba yes. Similarly , w e assume a prior distribution P[Θ] with densit y p ( θ ) ov er pa- rameters. F rom prior p ( θ ) and likeliho o d p ( D | θ ) w e can compute the p osterior p ( θ | D ) = p ( D | θ ) p ( θ ) /p ( D ), where nor ma lizer p ( D ) = R p ( D | θ ) p ( θ ) dθ . The full Bay esian approac h uses para meter av eraging for prediction p ( x | D ) = Z p ( x | θ , D ) p ( θ | D ) dθ = p ( D , x ) p ( D ) the so-called pr e dictive distribution (or more precisely predictiv e densit y), whic h can b e regarded a s the gold standar d for prediction (and there are plen ty of results jus- tifying this [BCH93 , Hut05]). Comp osite lik eliho od. Let H θ b e the simple hyp othesis that x is sampled from p ( x | θ ) and H Θ the c o mp osite hyp othesis that x is “sampled” from p ( x | Θ), where Θ ⊆ Ω. In t he Ba y esian framew ork, the “ c omp osite l i k e liho o d ” p ( x | Θ) is a c tually w ell defined (for measurable Θ with P[Θ] > 0) as an a veraged lik eliho od p ( x | Θ) = Z p ( x | θ ) p ( θ | Θ) dθ, where p ( θ | Θ) = p ( θ ) P[Θ] for θ ∈ Θ and 0 else . MAP and ML. Let H b e the (finite, coun table, contin uous, complete, or else) class of h yp otheses H Θ (or Θ fo r short) from whic h the “b est” one shall b e selected. Eac h Θ ∈ H is assumed to b e a measurable subset of Ω. The maximum a p osteriori (MAP) estimator is defined as θ MAP = argma x θ ∈H p ( θ | D ) if H con tains only simple h yp otheses 4 and Θ MAP = argmax Θ ∈H P[Θ | D ] in the general case. The comp osite max imum likeli- ho o d estimator is defined as Θ ML = argmax Θ ∈H p ( D | Θ), whic h reduces to ordinary ML for simple hypotheses. In order not to further clutter up the text with to o m uch mathematical gibb erish, w e ma ke the follo wing global assumptions during informal discussions: Global Assump tion 2 Wher ever ne c essary, we assume that sets, sp ac es, and func- tions ar e me asur able, densities exist w.r.t. som e (L eb e s g u e or c ounting) b ase me asur e, observe d events have non-zer o pr ob ab i lit y, or densities c onditione d on pr ob ability zer o events ar e app r opria t ely define d, i n which c ase statements migh t h old with pr o b ability 1 only. F unctions a nd de nsities ar e sufficiently often ( c on t inuously) differ entiab l e , and inte gr als exist and e x change. Bernoulli Example. Consider a binary X = { 0 , 1 } i.i.d. pro ces s p ( D | θ ) = θ n 1 (1 − θ ) n 0 with bias θ ∈ [0 , 1] = Ω, and n 1 = x 1 + ... + x n = n − n 0 the num b er of observ ed 1s. Let us assume a uniform uniform prior p ( θ ) = 1. Here but no t generally in la te r con tin uatio ns of the example w e also assume n 0 = n 1 . Consider h yp o the sis class H = { H f ,H v } con ta ining simple h yp othesis H f = { θ = 1 2 } meaning “fair” and comp osite v acuous alternativ e H v = Ω meaning “do n’t kno w”. It is easy to see tha t p ( D | H v ) = p ( D ) = n 1 ! n 0 ! ( n + 1)! < 2 − n = p ( D | H f ) for n > 1 (and = else) hence Θ ML = H f , i.e. ML alw ay s suggests a fair coin ho we v er w eak the evidence is. On the other hand, P[ H f | D ] = 0 < 1 = P[ H v | D ], i.e. MAP nev er suggests a fair coin ho w ev er strong the evidence is. No w consider PH I . L et m 1 = x n +1 + ...x n + m = m − m 0 b e the n um b er of future 1s. The probabilities o f x giv en H f , H v , and D are, resp ec tiv ely p ( x | H f ) = 2 − m , p ( x | H v ) = m 1 ! m 0 ! ( m + 1) ! , p ( x | D ) = ( m 1 + n 1 )!( n 0 + m 0 )! ( n + m + 1)! ( n + 1)! n 1 ! n 0 ! (2) F or m = 1 w e get p (1 | H f ) = 1 2 = p (1 | H v ), so when concerned with predicting only one bit, b oth hypotheses are equally go o d. More generally , for an in terv al Θ = [ a,b ], compare p (1 | Θ) = ¯ θ := 1 2 ( a + b ) to the full Bay esian prediction p (1 | D ) = n 1 +1 n +2 (Laplace’s rule). Hence if H is a class of inte rv al hypotheses, then PHI c ho oses the Θ ∈ H whose midp oin t ¯ θ is closest to Laplace’s rule, whic h is reasonable. The size of the interv al do esn’t matter, since p ( x n +1 | Θ) is indep enden t of it . Things start to change for m = 2 . The follow ing table lists p ( x | D ) for some D , together with p ( x | H f ) and p ( x | H v ), and their prediction error Err( H ) := Loss 2 1 ( H ,D ) 5 for d ( p,q ) = | p − q | in (1) p ( x | D ) x = 00 x = 01 | 10 x = 11 Err( H f ) ≷ Err( H v ) Conclusion D = {} 1 / 3 1 / 3 1 / 3 1 / 3 > 0 don’t know D = 01 3 / 10 4 / 10 3 / 10 1 / 5 > 2 / 15 don’t kno w D = 0101 2 / 7 3 / 7 2 / 7 1 / 7 < 4 / 21 fair D = (01) ∞ 1 / 4 1 / 2 1 / 4 0 < 1 / 3 fair p ( x | H f ) 1 / 4 1 / 2 1 / 4 p ( x | H v ) 1 / 3 1 / 3 1 / 3 The last column contains the iden tified b est predictiv e hypothesis. F or four or more observ ations, PHI says “fair ” , ot he rwise “don’t know ”. Using (2) or our later r esults, o ne can show more generally that PHI c ho oses “fair” for n ≫ m and “don’t kno w” f or m ≫ n . ♦ MAP v ersus ML versus PHI. The conclusions of the example generalize: F or Θ 1 ⊆ Θ 2 , w e ha v e P[Θ 1 | D ] ≤ P[Θ 2 | D ], i.e. MAP a lw a ys c ho oses the less sp ecific hy - p othesis H Θ 2 . O n the other hand, we ha v e p ( D | θ ML ) ≥ p ( D | Θ), since the maximum can nev er b e smaller than an a v erage, i.e. comp osite ML prefers the maximally sp e- cific h yp o the sis. So interestingly , alt ho ugh MAP and ML giv e identic al answ ers fo r uniform prior on simple hy p otheses , their naiv e extension to comp osite hypotheses is diametral. While MAP is risk av erse finding a lik ely true mo del o f low predictiv e p o w er, comp osite ML risks an (o v er)precise prediction. Sure, there are w ays to make MAP and ML w ork for nested hypotheses. The Bernoulli example has also sho wn that PHI’s answ er depends not only o n the past data size n but also on the future data size m . Indeed, if w e mak e only few pr edictions based on a lot of data ( m ≪ n ), a p oint estimation ( H f ) is typically sufficien t , since there will not b e enough future observ ations t o detect any discrepancy . On the other hand, if m ≫ n , selecting a v ac- uous mo del ( H v ) that ignores past data is b etter than selecting a p oten tially wrong parameter, since there is plen ty of future data to learn from. This is exactly the b eha vior PHI exhibited in the example. 3 Predictiv e Hyp othesi s Id en tification Principl e W e alr e ady ha v e defined the predictiv e lo ss functions in (1). W e now formally state our predictiv e hypothesis iden tification (PHI) principle, discuss p ossible distances d , and ma jor prediction scenarios related to the c hoice of m . Distance functions. Throughout this work we assume that d is contin uous and zero if and only if b oth argumen ts coincide. Some p o pu lar distances are: the (f ) f - div ergence d ( p,q ) = f ( p/q ) q for conv ex f with f (1) = 0, the ( α ) α -distance f ( t ) = | t α − 1 | 1 /α , the (1) absolute deviation d ( p,q ) = | p − q | ( α = 1), the (h) Hellinger distance d ( p,q ) = ( √ p − √ q ) 2 ( α = 1 2 ), the (c) c hi-square distance f ( t ) = ( t − 1) 2 , the (k) K L - div ergence f ( t ) = t ln t , and the (r) reve rse KL-div ergence f ( t ) = − ln t . The only 6 distance considered here that is not an f div erg ence is the (2) squared distance d ( p,q ) = ( p − q ) 2 . The f -div ergence is particularly intere sting, sin ce it contains most of the standard distances and mak es L o s s represen tation in v aria n t (RI). Definition 3 (Predictive h yp othesis iden tification (PHI)) The b es t ( g b est) pr e dictive hyp othesi s in H given D is define d as ˆ Θ m d := arg min Θ ∈H L oss m d (Θ , D ) ( ˜ Θ m d := arg min Θ ∈H L f oss m d (Θ , D ) ) The PHI ( g PHI) princ i p le states to pr e dict x with p r ob a b ility p ( x | ˆ Θ m d ) ( p ( x | ˜ Θ m d ) ), which we c al l PHI m d ( g PHI m d ) p r e diction. Prediction mo des. There exist a few distinct prediction scenarios and mo des. Here are prototypes o f the presumably most imp ortant ones: In fini te b atch: Assume w e summarize our data D b y a mo del/h yp othesis Θ ∈ H . The mo del is henceforth used as bac kground kno wledge for predicting and learning from further observ ations essen tially indefinitely . This corresp onds to m → ∞ . Fi nite b atch: Assume the scenario ab o v e, but terminate after m predictions for whatev er reason. This cor- resp onds to a finite m (of t en lar ge). Offline: The selected mo del Θ is used for predicting x k +1 for k = n,...,n + m − 1 separately with p ( x k +1 | Θ) without further learn- ing from x n +1 ...x k taking place. This corresp onds to rep eated m = 1 with common Θ: Loss 1 m d (Θ ,D ) := E [ P n + m − 1 k = n Loss 1 d (Θ ,D k ) | D ]. Online: A t ev ery step k = n,...,n + m − 1 w e determine a (g oo d) hypothesis Θ k from H based on past data D k , and use it only once for predicting x k +1 . Then for k + 1 w e select a new h yp othesis etc. This corresp onds to rep eated m = 1 with differen t Θ: Loss = P n + m − 1 k = n Loss 1 d (Θ k ,D k ). The a b ov e list is not exhaustiv e. Other prediction scenarios a r e definitely p ossible. In all prediction scenarios ab ov e w e can use L f oss instead of Loss equally w ell. Since all time steps k in Online PHI are completely indep enden t, online PHI reduces to 1-Batc h PHI, hence will not b e discussed any further. 4 Exact Pro p erties of PHI Reparametrization and represen tation in v ariance ( RI). An imp ortan t sanit y c heck of any statistical pro cedure is its behavior under reparametrization θ ❀ ϑ = g ( θ ) [KW96] and/or when c hanging the represen tation o f observ ations x i ❀ y i = h ( x i ) [W al96], where g and h are bijections. If the par ame trization/represen tat ion is judged irrelev an t to the problem, any inference should also b e independen t of it. MAP and ML are b oth represen tatio n inv arian t, but (for p oin t estimation) only ML is reparametrization in v arian t. Prop osition 4 (In v ariance of Loss) L oss m d (Θ ,D ) an d L f oss m d (Θ ,D ) ar e invaria n t under r ep ar ametrization of Ω . If distanc e d is an f -dive r g enc e, then they ar e also indep endent of the r epr esentation of the observation s p ac e X . F or c ontinuous X , the tr ansformations a r e assume d to b e c ontinuously differ entiable. 7 RI for Loss m f is obv ious, but will see la te r some in teresting consequences. An y exact inference or a n y sp ecialized form of PHI f will inherit RI. Similarly fo r ap- pro ximations, as long as they do not break RI. F or instance, PHI h will lead to an in t eresting R I v ariation of MAP . Sufficien t statistic. F or large m , the integral in Definition 1 is prohibitive . Man y mo dels (the whole exponential family) p ossess a sufficien t statistic whic h allows us to reduce the in tegral o v er X m to a n in tegral o v er the sufficien t statistic. Let T : X m → I R d ′ b e a sufficien t statistic, i.e. p ( x | T ( x ) , θ ) = p ( x | T ( x )) ∀ x , θ (3) whic h implies that there exist functions g and h suc h that the lik eliho o d factorizes in t o p ( x | θ ) = h ( x ) g ( T ( x ) | θ ) (4) The pro of is trivial for discrete X (c ho ose h ( x ) = p ( x | T ( x )) and g ( t | θ ) = p ( t | θ ) := P[ T ( x ) = t | θ ]) and follows from Fisher’s factorizatio n theorem for contin uous X . Let A b e an ev en t that is indep ende n t x giv en θ . Then m ultiplying (4) b y p ( θ | A ) and in t egra ting ov er θ yields p ( x | A ) = Z p ( x | θ ) p ( θ | A ) dθ = h ( x ) g ( T ( x ) | A ) , where (5) g ( t | A ) := Z g ( t | θ ) p ( θ | A ) dθ (6) F or some β ∈ I R let (non-probability) measure µ β [ B ] := R { x : T ( x ) ∈ B } h ( x ) β d x ( B ⊆ I R d ′ ) ha v e densit y h β ( t ) ( t ∈ I R d ′ ) w.r.t. to (Leb esgue or counting) base measure dt ( R dt = P t in the discrete case). Informally , h β ( t ) := Z h ( x ) β δ ( T ( x ) − t ) d x (7) where δ is the Dirac delta fo r con tinuous X (or the Kroneck er delta for coun table X , i.e. R d x δ ( T ( x ) − t ) = P x : T ( x )= t ). Theorem 5 (PHI for sufficien t statistic) L et T ( x ) b e a sufficient statistic (3) for θ and assume x is indep end ent D given θ , i.e. p ( x | θ ,D ) = p ( x | θ ) . Then L oss m d (Θ , D ) = Z d ( g ( t | Θ) , g ( t | D )) h β ( t ) dt L f oss m d (Θ , D ) = Z d ( g ( t | Θ) , g ( t | θ )) h β ( t ) p ( θ | D ) dtdθ holds (wher e g and h β have b e en de fine d in (4), (6 ), and (7)), pr ov ide d o n e (or b oth) of the fol low ing c ond i tion s hold: (i) distanc e d sc ales w it h a p ower β ∈ I R , i.e. d ( σ p,σ q ) = σ β d ( p,q ) for σ > 0 , or (ii) any distanc e d , but h ( x ) ≡ 1 in (4). One c an cho ose g ( t |· ) = p ( t |· ) , the pr ob abi l i ty d ensity of t , in w hich c ase h 1 ( t ) ≡ 1 . 8 All distances defined in Section 3 satisfy ( i ), the f -divergenc es all with β = 1 and the square loss with β = 2. The independence assumption is rather strong. In practice, usually it o nly holds for some n if it holds for all n . Indep e ndence of x n +1: n + m from D n giv en θ for all n can only b e satisfied for indep ende n t ( no t necessarily iden tically distributed) x i ∈ I N . Theorem 6 (Equiv alence of PHI m 2 | r and g PHI m 2 | r ) F or sq u ar e distanc e ( d b =2 ) and RKL distanc e ( d b = r ), L oss m d (Θ ,D ) d i ff ers fr om L f oss m d (Θ ,D ) o n ly by an a d ditive c onstant c m d ( D ) indep endent of Θ , henc e PHI and g PHI sele ct the sa me hyp otheses ˆ Θ m 2 = ˜ Θ m 2 and ˆ Θ m r = ˜ Θ m r . Bernoulli Example. Let us contin ue with our Bernoulli example with uniform prior. T ( x ) = x 1 + ... + x m = m 1 = t ∈ { 0 ,...,m } is a sufficien t statistic. Since X = { 0 , 1 } is discrete, R dt = P m t =0 and R dx = P x ∈X m . In (4) w e can ch o ose g ( t | θ ) = p ( x | θ ) = θ t (1 − θ ) m − t whic h implies h ( x ) ≡ 1 and h β ( t ) = P x : T ( x )= t 1 =  m t  . F rom definition (5) w e see that g ( t | D ) = p ( x | D ) whose expression can b e found in (2). F or RKL-distance, Theorem 5 now yields Loss m r (Θ | D ) = P m t =1 h β ( t ) g ( t | D )ln g ( t | D ) g ( t | Θ) . F or a p oin t h yp othesis Θ = { θ } this ev aluates to a constant min us m [ n 1 +1 n +2 ln θ + n 1 +1 n +2 ln(1 − θ )], whic h is minimized for θ = n 1 +1 n +2 . Therefore the b est predictiv e p oin t ˆ θ r = n 1 +1 n +2 = ˜ θ r = Laplace rule, where we hav e used Theorem 6 in the third equalit y . ♦ 5 PHI for ∞ -Batch In this section w e will study PHI for large m , or mo r e precise ly , the m ≫ n regime. No assumption is made o n the data size n , i.e. the results are exact for any n (small or la r g e ) in the limit m → ∞ . F or simplicit y and partly b y necessit y w e assume that the x i ∈ I N are i.i.d. (lifting the “iden tical” is p ossible). Throughout this section w e mak e the fo llo wing assumptions. Assumption 7 L et x i ∈ I N b e indep endent and id entic a l ly distribute d, Ω ⊆ I R d , the likeliho o d density p ( x i | θ ) twic e c ontinuously differ entiable w.r.t. θ , and the b oundary of Θ has zer o prior pr ob abil i ty. W e f ur t her define x := x i (an y i ) a nd the partia l deriv ativ e ∂ := ∂ /∂ θ = ( ∂ /∂ θ 1 ,...,∂ /∂ θ d ) ⊤ = ( ∂ 1 ,...,∂ d ) ⊤ . The (t w o r epresen tations of the) Fisher information matrix of p ( x | θ ) I 1 ( θ ) = E [( ∂ ln p ( x | θ ))( ∂ ln p ( x | θ )) ⊤ | θ ] = − Z ( ∂ ∂ ⊤ ln p ( x | θ )) p ( x | θ ) d x (8) will play a crucial role in this Section. It also o ccurs in Jeffr ey’s p rio r , p J ( θ ) := p det I 1 ( θ ) /J , J := Z p det I 1 ( θ ) dθ (9) 9 a p opular reparametrization inv ariant (ob jectiv e) r eference prior (when it exists) [KW96]. W e call the determinan t ( det) o f I 1 ( θ ), Fisher information . J can b e in t erpreted as the intrins ic size of Ω [Gr ¨ u07]. Although not essen tial to this work, it will b e instructiv e to o ccasionally plug it into our expressions. As distance w e c ho ose the Hellinger distance. Theorem 8 (L g oss m h ( θ ,D ) for large m ) Under Assumption 7, for p oin t estimation, the pr e dictive Hel linge r loss for lar ge m is L f oss m h ( θ , D ) = 2 − 2  8 π m  d/ 2 p ( θ | D ) p det I 1 ( θ ) [1 + O ( m − 1 / 2 )] J = 2 − 2  8 π m  d/ 2 p ( D | θ ) J p ( D ) [1 + O ( m − 1 / 2 )] wher e the first expr e ssion hold s f o r an y c ontinuous prior density and the se c ond ex- pr ession ( J = ) holds for Jeffr ey’s prio r. IMAP . The asymptotic expression shows that minimizing L f oss m h is equiv alen t to the follo wing maximization IMAP : ˜ θ ∞ h = θ IMAP := arg max θ p ( θ | D ) p det I 1 ( θ ) (10) Without the denominator, this would just b e MAP estimation. W e ha v e discussed that MAP is not r epara m etrization in v ariant, hence can b e corrupted b y a bad c hoice of para me trization. Since the square ro ot of the Fisher information transforms lik e the p osterior, their ratio is in v arian t. So PHI led us to a nice repara metrizatio n in v aria n t v ariation of MAP , imm une to this problem. In v ariance of the expres sions in Theorem 8 is not a coincidence. It ha s to hold due to Prop osition 4. F or Jeffrey’s prior (second expression in Theorem 8), minimizing L f oss m h is equiv alen t to maximizing the lik eliho o d, i.e. ˜ θ ∞ h = θ ML . Remem b er that the expressions are exact ev en and esp ecially for small samples D n . No large n a ppro ximation has b een made. F or small n , MAP , ML, and IMAP can lead to significantly differen t results. F or Jeffrey’s prio r , IMAP and ML coincide. This is a nice reconciliation of MAP and ML: An “impro v ed” MAP leads for Jeffrey’s prior bac k to “simple” ML. MDL. W e can also relate PHI to MDL b y taking the logarit hm of the second ex- pression in Theorem 8 : ˜ θ ∞ h J = arg min θ {− log p ( D | θ ) + d 2 log m 8 π + J } (11) F or m = 4 n this is the classical (la r g e n appro ximation of ) MDL [Gr ¨ u07]. So presuming that (11) is a reasonable approximation of PHI ev en for m = 4 n , MD L appro ximately minimizes the predictiv e Hellinger loss iff used f o r O ( n ) predictions. W e will not expand on this, since the alluded relation to MDL stands on shaky grounds (f or sev eral reasons). 10 Corollary 9 ( ˜ θ ∞ h = θ IMAP J = θ ML ) The p r e di c t ive estimator ˜ θ ∞ h = lim m →∞ argmin θ L f oss m h ( θ ,D ) c oinci d es with θ IMAP , a r epr esentation invaria n t variation of MAP. In the sp e cial c ase o f Jeffr ey’s prior, it also c oincides with the maximum lik e liho o d estimator θ ML . Theorem 10 (L g oss m h (Θ ,D ) for large m ) Under Assumption 7, for c om p o s i te Θ , the pr e dictive Hel linge r loss for lar ge m is L f oss m h (Θ , D ) = 2 − 2  8 π m  d/ 4 1 p P[Θ] Z Θ p ( θ | D ) s p ( θ ) p det I 1 ( θ ) dθ + o ( m − d/ 4 ) J = 2 − 2  8 π m  d/ 4 s p ( D | Θ)P[Θ | D ] J P[ D ] + o ( m − d/ 4 ) wher e the first expr e ssion hold s f o r an y c ontinuous prior density and the se c ond ex- pr ession ( J = ) holds for Jeffr ey’s prio r. MAP meets ML half w a y . The second express ion in Theorem 10 is prop ortional to the geometric a v erage of the p osterior and the comp osite lik eliho o d. F or large Θ the lik eliho od g ets small, since the av erage inv olv es many wrong mo dels. F or small Θ, the p osterior is prop ortional to the v olume of Θ hence tends to zero. The pro duct is maximal for some Θ in-b et w een: ML × MAP : q p ( D | Θ)P[Θ | D ] P[ D ] = P[Θ | D ] √ P [Θ] = p ( D | Θ) √ P [Θ] P [ D ] →  1 for Θ → Ω 0 for Θ → { θ } O ( n d/ 4 ) for | Θ | ∼ n − d/ 2 (12) The regions where the p o s terior densit y p ( θ | D ) and where the (p oin t) lik eliho o d p ( D | θ ) are large ar e quite similar, as long as the prior is not extreme. Let Θ 0 b e this region. It ty pically has diameter O ( n − 1 / 2 ). Increasing Θ ⊃ Θ 0 cannot sig- nifican tly increase P[Θ | D ] ≤ 1 , but significan tly decreases the lik eliho o d, hence the pro duct g ets smaller. Vice v ersa, decreasing Θ ⊂ Θ 0 cannot significan tly increase p ( D | Θ) ≤ p ( D | θ ML ), but significan tly decreases the p osterior. The v alue a t Θ 0 follo ws from P[Θ 0 ] ≈ V olume(Θ 0 ) ≈ O ( n − d/ 2 ). T ogether this show s that Θ 0 appro ximately maximizes the product of lik eliho o d and po sterior. So the b est predictive Θ 0 = ˜ Θ ∞ h has diameter O ( n − 1 / 2 ), whic h is a ve ry r easonable answ er. It cov ers we ll but not excess iv ely the high p osterior and high lik eliho o d regions (pro vided H is sufficien tly ric h of course). By m ultiplying the lik eliho o d or dividing the p osterior with only the square ro ot of the prior, they meet half w ay! Bernoulli Example. A Bernoulli pro cess with uniform prior and n 0 = n 1 has p osterior v ariance σ 2 n = 1 4 n . Hence an y reasonable symmetric in terv al estimate Θ = [ 1 2 − z ; 1 2 + z ] of θ will ha ve size 2 z = O ( n − 1 / 2 ). F or PHI w e get P[Θ | D ] p P[Θ] = 1 √ 2 z ( n + 1)! n 1 ! n 0 ! Z Θ θ n 1 (1 − θ ) n 0 dθ ≃ 1 √ 2 z erf  z σ n √ 2  11 where equalit y ≃ is a larg e n approx imation, and erf( · ) is t he error function [AS74 ]. erf( x ) / √ x has a global maximu m at x . = 1 within 1% precision. Hence PHI selects an in t erv al of half-width z . = √ 2 σ n . If faced with a binary decision b et w een p oin t estimate Θ f = { 1 2 } and v acuous estimate Θ v = [0;1], comparing the losses in Theorems 8 a nd 10, we see that for large m , Θ v is sele cted, despite σ n b eing close to zero for large n . In Section 2 w e hav e explained that this makes from a predictiv e p oin t of view. ♦ Finally note t hat (12) do es not con v erge to (any monotone function of ) (10) fo r Θ → { θ } , i.e. and ˜ Θ ∞ h 6→ ˜ θ ∞ h , since the limits m → ∞ and Θ → { θ } do not exc hange. Finding ˜ Θ ∞ h . Con tra ry t o MAP and ML, an unrestricted maximization of (12) ov er al l measurable Θ ⊆ Ω mak es sense. The follow ing result reduces the optimization problem to finding the lev el sets of the lik eliho o d function and to a one-dimensional maximization problem. Theorem 11 (Finding ˜ Θ ∞ h = Θ ML × MAP ) L et Θ γ := { θ : p ( D | θ ) ≥ γ } b e the γ -level set of p ( D | θ ) . If P[Θ γ ] is c ontinuous in γ , then Θ ML × MAP := arg max Θ P[Θ | D ] p P[Θ] = arg max Θ γ : γ ≥ 0 P[Θ γ | D ] p P[Θ γ ] Mor e pr e cisely, every glob al maximum of (12) differs fr o m the maximiz e r Θ γ at most on a set of me asur e zer o. Using po s terior lev el sets, i.e. shortest α -credible se ts/in terv als instead of lik eli- ho o d lev el sets would not w ork (an indirect pro of is that they are not R I ). F or a general prior, p ( D | θ ) p p ( θ ) /I 1 ( θ ) lev el sets need to b e considered. The contin uit y assumption on P[Θ γ ] ex cludes lik eliho o ds with plateaus, whic h is restrictiv e if con- sidering non-ana lytic lik eliho ods. The assumption can b e lifted b y considering all Θ γ in-b et w een Θ o γ := { θ : p ( D | θ ) > γ } and ¯ Θ γ := { θ : p ( D | θ ) ≥ γ } . Exploiting the sp ecial form o f (12) one can sho w that the maxim um is attained for either Θ o γ or ¯ Θ γ with γ obtained as in the theorem. Large n . F or la rge n ( m ≫ n ≫ 1), the lik eliho o d usually tends to an (un-normalized) Gaussian with mean=mo de ¯ θ = θ ML and co v aria nce ma t r ix [ nI 1 ( ¯ θ )] − 1 . Therefore the lev els sets a re ellipsoids Θ r = { θ : ( θ − ¯ θ ) ⊤ I 1 ( ¯ θ )( θ − ¯ θ ) ≤ r 2 } W e kno w tha t the size r of the maximizing ellipsoid scales with O ( n − 1 / 2 ). F o r suc h tin y ellipsoids, ( 1 2) is a symptotically pro p ortional to P[Θ r | D ] p P[Θ r ] ∝ R Θ r p ( D | θ ) dθ p V olume[Θ r ] ∝ R || z ||≤ ρ e −|| z || 2 / 2 dz q R || z ||≤ ρ 1 dz ∝ R ρ 2 / 2 0 t d/ 2 − 1 e − t dt ρ d/ 2 = γ ( d / 2 , ρ 2 / 2 ) ρ d/ 2 12 where z := p nI 1 ( ¯ θ )( θ − ¯ θ ) ∈ I R d , and ρ := r √ n , a nd t := 1 2 ρ 2 , and γ ( · , · ) is t he incomplete Gamma function [AS74], and w e dropp ed all factors that are indep enden t o f r . The expressions also holds for general prior in Theorem 8, since asymptotically the prior has no influence. They are maximized for the following ˜ r : d 1 2 3 4 5 10 100 · · · ∞ ˜ r p n/d 1 . 40 0 1 . 121 1 . 009 0 . 947 0 . 907 0 . 819 0 . 721 · · · 1 / √ 2 i.e. for m ≫ n ≫ 1, unrestricted PHI selects ellipsoid ˜ Θ ∞ h = Θ ˜ r of (linear) size O ( p d/n ). So far w e hav e considere d L f oss m h . Analog ous asymptotic expressions can b e de- riv ed for Loss m h : While Loss m h differs from L f os s m h , for p oin t estimation their minima ˆ θ ∞ d = ˜ θ ∞ d = θ IMAP coincide. F or comp osite Θ, the answ er is qualitativ ely similar but differs quan tit a tiv ely . 6 Large Sample A p pro ximations In this section we will study PHI for larg e sample sizes n , more precisely the n ≫ m regime. F or simplicit y we concen trate on the univ ariate d = 1 case o nly . D a ta ma y b e non-i.i.d. Sequen tial momen t fitting (SMF). A classical a pproximation of the p osterior densit y p ( θ | D ) is b y a G aus sian with same mean and v ariance. In case the class of a v ailable distributions is further restricte d, it is still reasonable to approx imate t he p osterior by the distribution whose mean and v aria nc e a re closest to that of p ( θ | D ). There might b e a tradeoff b et w een ta k ing a distribution with go o d mean (low bia s ) or one with go od v ariance. Oft en low bias is of primary imp ortance, and v ariance comes second. This suggests to first fit the mean, then the v aria nc e, a nd possibly con tinue with higher o rde r moments . PHI is concerned with predictiv e p erformance, not with densit y estimation, but of course t hey are related. Go o d densit y estimation in general and sequen tial momen t fitting (SMF) in particular lead to go o d predictions, but the conv erse is not necessarily true. W e will indeed see that PHI for n → ∞ (under certain conditio ns ) reduces to an SMF pro cedure. The SMF algorithm. In our case, the set of av a ilable distributions is g iv en by { p ( θ | Θ) : Θ ∈ H } . F or some ev ent A , let ¯ θ A := E [ θ | A ] = Z θ p ( θ | A ) dθ a nd µ A k := E [( θ − ¯ θ A ) 2 | A ] ( k ≥ 2) (13) b e the mean and cen tral momen ts of p ( θ | A ). The p osterior momen ts µ D k are kno wn and can in principle b e computed. SMF sequen tially “fits” µ Θ k to µ D k : Starting with H 0 := H , let H k ⊆ H k − 1 b e the set o f Θ ∈ H k − 1 that minimize | µ Θ k − µ D k | : H k := { arg min Θ ∈H k − 1 | µ Θ k − µ D k |} , H 0 := H , µ A 1 := ¯ θ A 13 Let k ∗ := min { k : µ Θ k 6 = µ D k , Θ ∈ H k } b e the smallest k for whic h t here is no p erfect fit an ymore (or ∞ otherwise). Under some quite general conditions, in a certain sense, all and only the Θ ∈ H k ∗ minimize Loss m d (Θ ,D n ) for la rge n . Theorem 12 (PHI for large n b y SMF) F or some k ≤ k ∗ , assume p ( x | θ ) is k times c ontinuously differ entiable w.r.t. θ at the p osterior m e an ¯ θ D . L et β > 0 and assume sup θ R | p ( k ) ( x | θ ) | β dθ < ∞ , µ D k = O ( n − k / 2 ) , µ Θ k = O ( n − k / 2 ) , and d ( p,q ) / | p − q | β is a b ounde d function. Then L oss m d (Θ , D ) = O ( n − k β/ 2 ) ∀ Θ ∈ H k ( k ≤ k ∗ ) F or the α ≤ 1 distances w e ha v e β = 1, fo r the square distance we hav e β = 2 (see Section 3). F or i.i.d. distributions with finite mo ments, the a s sumption µ D k = O ( n − k / 2 ) is virtually nil. Normally , no Θ ∈ H has b etter loss order than O ( n − k ∗ β / 2 ), i.e. H k ∗ can b e regarded as the set of a ll asymptotically optimal predictors. In many cases, H k ∗ con ta ins only a single elem en t. Note that H k ∗ do es neither dep end on m , nor on the chosen distance d , i.e. the b est predictiv e hy p othesis ˆ Θ = ˆ Θ m d is essen tially the same for all m and d if n is large. Bernoulli Example. In the Bernoulli Example in Section 2 we considered a binary decision b et w een p oin t estimate Θ f = { 1 2 } and v acuous estimate Θ v = [0;1], i.e. H 0 = { Θ f , Θ v } . F or n 0 = n 1 w e hav e ¯ θ [0;1] = ¯ θ 1 / 2 = 1 2 = ¯ θ D , i.e. b oth fit the first mo ment exactly , hence H 1 = H 0 . F o r the second momen ts w e ha ve µ D 2 = 1 4 n , but µ [0;1] 2 = 1 12 and µ 1 / 2 2 = 0, henc e f o r large n the point estimate matc hes t he p osterior v aria nc e b etter, so ˆ Θ = { 1 2 } ∈ H 2 = { Θ f } , whic h mak es sense. ♦ F or unrestricted (single) p oin t estimation, i.e. H = {{ θ } ,θ ∈ I R } , one can t ypically estimate the mean exactly but no higher mo ments. More generally , finite mixture mo dels Θ = { θ 1 ,...,θ l } with l comp onen ts (degree of freedoms ) can fit at most l mo- men ts. F or large l , t he n um b er of θ i ∈ ˆ Θ tha t lie in a small neigh b orho o d of some θ (i.e. the “densit y” of p oin ts in ˆ Θ at θ ) will b e prop ortional to the lik eliho od p ( D | θ ). Coun ta bly infinite and ev en more so contin uous mo dels if otherwise unrestricted are sufficien t t o get all momen ts r ig h t. If t he pa ramete r ra nge is restricted, anythin g can happ en ( k ∗ = ∞ or k ∗ < ∞ ). F o r interv al estimation H = { [ a ; b ] : a,b ∈ I R ,a ≤ b } and uni- form prior, we hav e ¯ θ [ a ; b ] = 1 2 ( a + b ) a nd µ [ a ; b ] 2 = 1 12 ( b − a ) 2 , hence the first tw o momen ts can b e fitted exactly and t he SMF algorithm yields the unique asy mptotic solution ˆ Θ = [ ¯ θ D − √ 3 µ D 2 ; ¯ θ D + √ 3 µ D 2 ]. In higher dimensions, common c hoices of H are conv ex sets, ellipsoids, and h yp e rcub es . F or ellipsoids, the mean and cov a riance matrix can b e fitted exactly and uniquely similarly to 1d interv al estimation. While SMF can b e con tinued b ey ond k ∗ , H k t ypically do es not contain ˆ Θ for k > k ∗ an ymore. The correct con tinuation beyond k ∗ is either H k +1 = { arg min Θ ∈H k µ Θ k } or H k +1 = { arg m ax Θ ∈H k µ Θ k } (there is some criterion for the c hoice), but apart from exotic situations this do es not impro v e the order O ( n − k ∗ β / 2 ) of the loss, and usually | H k ∗ | = 1 an yw a y . Exploiting Theorem 6, w e see t ha t SMF is also applicable for L f oss m 2 and L f oss m r . Luc kily , Offline g PHI can also b e reduced to 1-Batch g PHI: 14 Prop osition 13 (Offline = 1-Batc h) I f x i ∈ I N ar e i.i.d., the Offine L f oss is pr op o r- tional to the 1-Batch L f oss : L f oss 1 m d (Θ , D ) := n + m − 1 X k = n Z L f oss 1 d (Θ , D k ) p ( x n +1: k | D ) dx n +1: k = m L f oss 1 d (Θ , D ) In p articular, Offine g PHI e quals 1-B at ch g PHI: ˜ Θ 1 m d = ˜ Θ 1 d . Exploiting Theorem 6, w e see that also Loss 1 m 2 | r = m Loss m 2 | r +constan t . Hence w e can apply SMF a ls o for Offline PHI 2 | r and g PHI 2 | r . F or square loss, i.i.d. is not essen tial, indep e ndence is sufficien t. 7 Discuss ion Summary . If prediction is the go al, but full Ba y es not feasible, one should identify (estimate/test/select) the hyp othesis (parameter/mo del/in terv al) that pr e d i c ts b est. What best is can dep end on the problem setup: What our b e nc hmark is (Loss, L f oss), the distance function w e use f o r comparison ( d ), how long we use the mo del ( m ) compared t o how m uch data we ha v e at hand ( n ), and whether w e con tin ue to learn or not (Batc h,Offline). W e hav e defined some reparametrization and represen t a tion in v aria n t lo sses t ha t co v er man y practical scenarios. Predictiv e h yp othes is iden tifica- tion (PHI) aims at minimizing this loss. F or m → ∞ , PHI o v ercomes some problems of and ev en reconciles ( a v a riation o f ) MAP and ( comp osite) ML. Asymptotically , for n → ∞ , PHI reduces to a sequen t ial momen t fitting (SMF) pro ce dure, whic h is indep e nden t of m and d . The primary purpose of the asymptotic approxim ations w as to gain understanding (e.g. consistency of PHI follo ws from it), without supp osing that they a re the most relev ant in practice. A case where PHI can b e ev aluated efficien tly exactly is when a sufficien t statistic is a v ailable. Outlo ok. There are man y op en ends and p ossible extensions that deserv e fur- ther study . Some results ha v e o nly b een pro v en fo r sp ecific distance functions. F or instance, w e conjecture that PHI reduces t o IMAP for o the r d (seems t r ue for α -distances ). Definitely the b eha vior of PHI should next b e studied for semi- parametric mo dels and compared to existing mo del ( complexity ) selectors like AIC, LoRP [Hut0 7], BIC, and MDL [Gr ¨ u07], and cross v alidation in the sup ervised case. Another imp ortan t generalization to b e done is t o sup e rvised learning (classification and regression), whic h (lik ely) requires a sto c hastic model of the input v ariables. PHI could also b e generalized to predictiv e densit y estimation prop er b y replacing p ( x | Θ) with a (para me tric) class of densities q ϑ ( x ). Finally , we could also go the full w ay to a decision-theoretic setup and loss. Note that Theorems 8 and 12 combine d with (asymptotic) frequen tist prop erties lik e consistency of MAP/ML/SMF easily yields analogous results fo r PHI. 15 Conclusion. W e hav e sho wn that predictiv e h yp othesis identification scores well on all desirable prop erties listed in Section 3. In part icular, PHI can prop erly deal with nested h ypo the ses, and nicely justifies, reconciles, and blends MAP and ML for m ≫ n , MDL for m ≈ n , and SMF for n ≫ m . Ac kno wledgemen ts. Man y thanks to Ja n Poland fo r his help impro ving the clarit y of the presen tation. References [AS74] M. Abramo witz and I. A. Stegun, ed i tors. Hand b o ok of Mathematic al F unctions . Do ver pub lic ations, 1974. [BB04] M. M. Barbieri and J . O . Berge r. Optimal pr e dictiv e mo del selection. Anna ls of Statistics , 32(3):87 0–897 , 2004. [BCH93] A. R. Barron, B. S. Clark e, and D. Haussler. Inform a tion b ounds for the risk of Ba yesia n p r edict ions and the r e dund a ncy of univ ersal co des. In Pr o c. IEEE International Symp osium on Inf ormation The ory (ISIT) , pages 54–54, 1993. [Bis06] C. M. Bishop. Pattern R e c o gnition and Machine L e arning . Sp ringer, 200 6. [BM98] A. A. Boro vko v and A. Moullaga liev. Mathematic al Statistics . Gordon & Breac h, 1998. [Gei93] S . Geisser. Pr e dictive Infe r enc e . Chapm an & Hall/CR C, 1993. [Gr ¨ u 07 ] P . D. Gr ¨ u nw ald. The Minimum Description L ength Principle . T he MIT Press, Cam bridge, 2007. [Hut05] M. Hutter. Universal Artificial Intel ligenc e: Se quential De cisions b ase d on Algorithmic Pr ob ability . Spr inge r, Berlin, 2005. 300 pages, h ttp://www.h utter1.net/ai /uaib ook.htm . [Hut07] M. Hutter. The lo ss rank prin ciple for mo del selection. In Pr o c. 20th Annual Conf. on L e arning The ory (COL T’07) , v olum e 4539 of LNAI , pages 589–603, San Diego, 2007. S pringer, Berlin. [KW96] R. E. Kass and L. W asserm an . The selec tion of p r io r distributions b y formal rules. Journal of the Americ an Statistic al Asso ciation , 91(435 ):1343 –1370, 1996 . [Lan02] J . L an gford . Combining tr ain set and test set b ounds. In Pr o c. 19th Internationa l Conf. on Machine L e arning (ICM L-20 02) , pages 331–3 38. Elsevier, 2002. [LF82] M. Lejeune and G. D. F aulke n b erry . A simple predictiv e densit y f u nctio n. Journal of the Americ an Statistic al Asso ciation , 77(379 ):654– 657, 1982 . [MGB05] N. Mukhopadhy ay a, J. K. Ghosh, and J. O. Berger. Some Ba ye sian predictiv e approac h e s to mo del selection. Statistics & Pr ob ability L etters , 73(4):2005 , 2005 . [W al96] P . W alley . Inferences from multinomia l data: learning ab out a bag of m arb le s. Journal of the R oyal Statistic al So ciety B , 58(1):3–5 7, 1996. [Zuc00] W. Zu cchini. An introd uctio n to mo del selection. Journal of M a thematic al P sy- cholo gy , 44(1):41– 61, 2000. 16

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment