Information, learning and falsification

Inf ormation , learning and falsiﬁca tion David B alduzzi Max Planck Institute for Intellige nt Systems, T ¨ ubingen , Germany . david.balduz zi@tuebinge n.mpg.de There ar e (at least) th ree ap proach es to quan tifying info rmation. The ﬁrst, algorithm ic inform ation or K olmogorov complexity , takes e vents as strings and, g iv en a universal T uring machine, quan tiﬁes the information content of a string as the length of th e sh ortest p rogram pro ducing it [1]. The second, S han non inf ormation , takes events as belon ging to ensembles and quantiﬁes th e in formatio n resulting fro m observ ing the giv en event in terms of the numb er of altern ate events that h av e been ruled out [2]. The third, statistical learning theo ry , has introd uced measures of capacity that con trol (in p art) the expec ted risk of classiﬁers [3]. These ca pacities quantif y the expectations regard ing future data that learning algorithms embed into classiﬁers. Solomon off and Hutter h av e app lied algorith mic in formatio n to prove remarka ble results o n u ni- versal in duction. Shan non in formation provides th e mathematical fou ndation f or commun ication and c oding theory . Howe ver , bo th ap proach es have shor tcomings. Algorithmic in formatio n is n ot computab le, severely limiting its practical usefu lness. Shannon info rmation refers to en sembles rather than actua l events: it makes no sen se to com pute the Sha nnon informatio n of a sing le string – or rather, ther e a re many a nswers to this qu estion d ependin g on how a r elated ensemble is con- structed. Althou gh there are asym ptotic results linking algorithmic and Shannon info rmation, it is unsatisfying that there is such a large gap – a dif ferenc e in k ind – between the two measures. This note describes a ne w method of q uantify ing in formation , effectiv e information, that link s algo- rithmic inform ation to Shan non informa tion, an d also links both to capac ities arising in statistical learning theo ry [4, 5]. After intro ducing the measure, we show that it provid es a non-universal ana- log o f K olmo gorov comp lexity . W e then app ly it to derive b asic capacities in statistical learn ing theory: empirical VC-entropy and empir ical Radem acher co mplexity . A nice by prod uct of our ap- proach is an in terpretation of the explanato ry po wer of a learn ing algorith m in terms o f the n umber of hypoth eses it falsiﬁes [6], coun ted in two different ways fo r the two cap acities. W e also discuss how ef fective inform ation relates to information gain, Shannon and mutual informatio n. Effective inf ormation Any physical system, at any spatiotempor al scale, is an input/ou tput de vice. For simp licity , w e only model memoryless systems with ﬁnite input X and output Y alphabets. The probab ility that system m o utputs y P Y g iv en input x P X is encoded in Markov matrix p m p y | x q . The effecti ve info rmation g enerated when system m outputs y is com puted as follows. First, let the poten tial r epertoir e p unif p X q be the input set eq uipped with the uniform distribution. Next, compute the actual r epertoir e via Bayes’ rule ˆ p p X | y q :  p  y | do p x q   p unif p x q p m p y q , (1) where p m p y q  ° x p m  y | do p x q   p unif p x q and do pq refers to Pearl’ s inter ventional calcu lus [7]. Effective information is the K ullback -Leibler di vergence between th e two repertoires ei p m , y q :  D  ˆ p m p X | y q   p unif p X q  . (2) For a deterministic function f : X Ñ Y , the actual repertoire and effecti ve inf ormation are ˆ p f p x | y q  " 1 | f  1 p y q| if f p x q  y 0 else and ei p f , y q  log 2 | X |  log 2 | f  1 p y q| . (3) 1 The supp ort of the actual rep ertoire is the pre-imag e f  1 p y q . Elem ents in the pre- image a ll have the same probab ility since they cannot b e distinguished by the fun ction f . Ef fective in formation quantiﬁes the size of the p re-image relative to the input set – the smaller (“sh arper”) the pre- image, the higher ei . Algorithmic information W e show that e ffecti ve info rmation is a non-u niversal analo g of Kolmogorov complexity . Giv en universal T urin g machine T , the (unnor malized) So lomono ff prior probab ility of string s is p T p s q :  ¸ t i | T p i q s u 2  len p i q , (4) where the sum is over strings i that cause T to ou tput s as a preﬁx, where no proper preﬁx of i outputs s , and len p i q is the length of i . K olmogor ov co mplexity is K p s q :   log 2 p T p s q . K olmogor ov complexity is usually deﬁned as the shortest program on a univ ersal preﬁx ma chine that produces s . The two deﬁnitions coincide up to additiv e constant by Le vin’ s Coding Theorem [1]. Replace universal T uring machine T with deterministic system f : X Ñ Y . All inpu ts have len p x q  log 2 | X | in the optimal code for the u niform d istribution on X . Deﬁne th e effective pr obability of y as p f p y q  ¸ t x | f p x q y u 2  len p x q  # | f  1 p y q| | X | if y P f p X q 0 else. (5) Note that p f p y q is a special case of p m p y q , as deﬁned after Eq. (1). The ef fective distribution is thus a non-u niversal analog of the Solo monoff prior, since it is comp uted by replacing u niversal T uring machine T in Eq . (4) with deterministic physical system f : X Ñ Y . In the deterministic case, ef fective info rmation turns out to be e i p f , y q   log 2 p f p y q , analogou sly to Kolmogorov com plexity . Effective inform ation is non- universal – but com putable – since it de- pends on the choice of f . Statistical learning theory This section uses a particular deterministic function , learnin g algor ithm L F , D , to co nnect effectiv e informa tion and th e effecti ve distribution to statistical learning theory . Giv en ﬁnite set X , let hypothe sis space Σ X  σ : X Ñ  1 ( contain all lab elings of elements of X . Now , giv en a set of functions F  Σ X and unlab eled data D P X l , deﬁne learning algo rithm (empirical risk minimizer) L F , D : Σ X Ý Ñ R : ˆ σ ÞÑ ǫ  min f P F 1 l l ¸ k  1 I r f p d k q  ˆ σ p d k qs . (6) The le arning a lgorithm ta kes a lab eling o f th e data a s inpu t and ou tputs the emp irical risk of th e function that best ﬁts the data. W e drop subscripts fro m the notation L below . Deﬁne em pirical VC-entro py in [3]) as V p F , D q :  log 2 | q D p F q| where q D : F Ñ R l : f ÞÑ  f p d 1 q . . . f p d l q  . Also deﬁne empirical Rademacher complexity as R p F , D q  1 | Σ | ¸ σ P Σ  sup f P F 1 l l ¸ k  1 σ p d k q  f p d k q  . These capacities can be used to bound the expec ted risk of c lassiﬁers, see [8, 9 ] for details. The following propositions are prov ed in [5]: Proposition 1 (ef fective infor mation “is” empirical VC-entropy) . ei p L , 0 q   log 2 p L p 0 q  l  V p F , D q Proposition 2 (expectation over p L p ǫ q “is” empirical Rademacher complexity) . E r ǫ | p L s  ¸ ǫ P R ǫ  p L p ǫ q  1 2  1  R p F , D q  2 Thus, replacing the u niv ersal Turing machine with lear ning a lgorithm L F , D we obtain th at o ur analog of K olmogorov complexity , the ef fective inf ormation of output ǫ  0 , i s ess entially empirical VC-entropy . Moreover, the expectation of the a nalog of the Solom onoff distribution is essentially Rademacher complexity . The two quantities ei p L , 0 q and E r ǫ | p L s are measures of explanatory p ower: as they in crease, so expected future performan ce imp roves. By Eq. (3), th e effecti ve informa tion g enerated by L is ei p L , 0 q  log 2 | Σ | lo omo on total # hypo theses  log 2 | L  1 p 0 q| lo oooo omo o ooo on # hypoth eses L ﬁts   # hypoth eses L falsiﬁes  , (7) where hypoth eses are cou nted af ter lo garithmin g. Effectiv e in formation , which relate to VC-e ntropy , counts the nu mber of hypotheses th e learn ing algor ithm falsiﬁes when it ﬁ ts labels perfectly , without taking into account how often they are wrong. Similarly , see [5] for details, the expectation i s ° ǫ p L p ǫ q  ǫ  ° ǫ  fraction of hypotheses L falsiﬁes    on fraction ǫ of the data  . (8) Expected ǫ , which relates to Rademacher comp lexity , looks at the average behavior of the lea rning algorithm , av erag ing over the fractions of h ypotheses falsiﬁed, weighted by how mu ch of the d ata they are falsiﬁed on. The bounds proved in [3, 8, 9], which control the e xpe cted fu ture performan ce of the classiﬁer mini- mizing em pirical risk, can th erefore be reph rased in terms of the num ber of hy potheses falsiﬁed by the learnin g algo rithm, Eq s (7) and ( 8), sugg esting a possible route towards r igorou sly gro undin g the role of falsiﬁcation in science [6]. Shannon information W e relate effecti ve information to Shannon and mutual information. Suppose we h ave m odel m that generates data d P D with proba bility p m p d | h q giv en h ypoth esis h P H . For prior distribution p p H q on hypo theses, the information gained by observing d is D  p m p H | d q   p p H q  . (9) Kullback-Leibler di vergence D r p } q s can be in terpreted as the n umber of Y/N question s required to get from q to p . Thu s, Eq. ( 9) quantiﬁes how many Y/N que stions the mod el answers ab out the hypoth eses u sing the data. Effecti ve in formatio n, Eq. (2), q uantiﬁes the inf ormation gained when ph ysical system m outpu ts y . Rather than inferrin g o n hyp otheses, th e system, b y prod ucing an output, speciﬁes p robabilistic constraints on what its input must have been. Effectiv e inf ormation uses the u niform (maximu m entropy) prior since any o ther p rior would insert add itional data not be longing to the system – the prior is someth ing e lse, o n to p of m . However , this restrictio n is not essential an d will be d ropped for the remaind er of this section. Consider the following scenario . W e have X and X 1 are isomo rphic, and a determin istic phy sical system c : X Ñ X 1 that cop ies its inp uts, mappin g x k ÞÑ x 1 k for example. Gi ven prio r p p X q , the effecti ve information generated i s ei  p p X q , c, x 1 k  :  D  p c p X | x 1 k q   p p X q   D  δ x k   p p X q    log 2 p p x k q , the surprise of x k . It follows that Shannon informatio n is expected ef fective in formation H p X q  E  ei  p p X q , c, x 1 k     p c p X 1 q  . More gen erally , if we are given noisy m emoryle ss ch annel m from X to Y with distribution p p X q on X , then mutua l information is the expectation I p X ; Y q  E  ei p p p X q , m , y q    p m p Y q  , where p m p y q  ° x p m  y | do p x q   p p x q is the ef fective distribution on Y . T hus, Sh annon a nd mutu al informa tion are simp ly averages of effecti ve infor mation, our no n-universal analog o f Kolmogorov complexity . Finally , inter preting effecti ve in formatio n as in formation gain, Eq . (9), and combinin g with results from th e pr evious section shows that p l  empirical VC-entropy q is th e inf ormation we gain ab out the set Σ X of hypoth eses wh en told that learning algorithm L F , D ﬁt the labeled data perfectly . 3 Discussion This note starts from the o bservation that all p hysical systems classify inp uts and ther eby gener- ate inform ation. A determin istic p hysical system f : X Ñ Y implicitly categorizes its inputs by assigning them to o utputs: th e category assigned to outpu t y is the set of in puts in the pre -image f  1 p y q  X . The intuitio n carries through in the probabilistic case after replacing pre-imag es w ith actual repertoire s. Effecti ve informa tion then quan tiﬁes th e sh arpness of c ategories: th e sharper a category , the more infor mative the corr espondin g ou tput. Alternatively , ef fective information quan- tiﬁes causal dependen cies: outp uts with high ei are extremely sensiti ve to changes in the input. Effecti ve in formatio n is a concr ete, comp utable analog of K olmog orov co mplexity . The Kol- mogor ov co mplexity of a strin g quan tiﬁes the “work” requ ired to produ ce it; r oughly , the length of the program s that outpu t it. Since universal T uring machines require inﬁnite storage space and are theref ore impossible to c onstruct, it is unclear how relev ant they are to p rocesses actually oc- curring in natur e. Effecti ve in formatio n substitutes a deterministic mode l of a physical system in place o f the universal T uring machine, and qu antiﬁes the “work ” req uired to p roduce a n ou tput as the number of Y/N decisions required to choose it. Both Shann on and mu tual informa tion arise as expecta tions of effecti ve informa tion after tweaking to get rid of the unifor m prior . The difference between K olmogo rov com plexity and Shannon infor- mation reduces to : ( i) replacing a universal Turing m achine with a s peciﬁc system (channel) and (ii) computin g the av erage inform ation gain over all outputs, rather than a single one. When the physical p rocess un der consideration is em pirical r isk minimization, the effecti ve infor- mation it gen erates contributes to bound s on expected risk. In particular, the work (th e num ber of Y/N decisions) req uired to ﬁt d ata D using function s in F essentially is the emp irical VC-en tropy . Since ﬁnding th e op timal classiﬁer in F req uires computin g L F , D in some way or ano ther, ther eby implementin g it physically , it follows that the e ffecti ve information generated while ﬁtting data h as implications for the future perfor mance of classiﬁers, see [3, 8, 9]. Effecti ve in formatio n and the expected risk over the effectiv e distribution also p rovide new inter- pretations of VC-entropy and Rademach er complexity in terms of falsifyin g hypotheses, see Eq. (7) and (8) – an d also [10] for a comp arison of falsiﬁcation with VC-dimension. V ie wing em pirical risk minimization as a phy sical process that c lassiﬁes hypotheses acco rding to ﬁt ǫ thus d irectly links VC-entro py an d Rademacher complexity with Popper ’ s pr oposal that the power of a scientiﬁc theory lies in how many hy potheses it rules out , rather than the amount of data it explains [6]. The links with K olmog orov c omplexity , lear ning theory , inform ation gain and falsiﬁcation shown above suggest it is worth in vestigating wheth er the ef fective informa tion gener ated while optimizing quantities other than empirical risk (e.g. margins) has i mp lications for future performan ce. Acknowledgements. I tha nk Samory Kpotufe and Pedro Ortega for useful discussions. Refer ences [1] Li M , V it ´ anyi P (2008) An Introduction to Kolmogoro v Complexity and Its Applications. Springer . [2] Shann on C ( 1948) A mathematical theory of communica tion. Bell S ystems T ech J 27:379–423. [3] V apnik V (1998) Statistical Learning Theory . John Wile y & Sons. [4] Baldu zzi D, T onon i G (2008) Integrated Information in Di screte Dynamical Systems: Motiv ation and Theoretical Framew ork. PL oS Comput Biol 4:e1000 091. doi:10.1371/journal.pcbi.100 0091. [5] Baldu zzi D (in press) Falsiﬁcation and Future Performance. In: Proceedings of Solomonoff 85th Memo- rial Conference. Springer Lecture Notes in Artiﬁcial Intelligence. [6] Popp er K (1959) T he Logic of Scientiﬁc Discov ery . Hutchinson. [7] Pearl J (20 00) Causality: models, reasoning and inference. Cambridge Univ ersity Press. [8] Bou cheron S, Lugosi G, Massart P (2000) A Sharp Conce ntration I nequality with Applications. Random Structures and Algorithms 16:277– 292. [9] Bou squet O, Boucheron S, Lugosi G (2004) Introduction to Statistical Learning Theory . In: Bousquet O, von Lux burg U, R ¨ atsch G, editors, Advan ced Lectures on Machine Learning, Springer . pp. 169–2 07. [10] Co rﬁeld D, Sch ¨ o lkopf B, V apnik V (2009) Falsiﬁcation and Statistical Learning T heory: Comparing the Popper and Vapnik-Cherv onenkis Dimensions. Journal for General Philosophy of Science 40:51–58. 4

Information, learning and falsification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment