Lectures on Probability, Entropy, and Statistical Physics

LECTURES ON PR OBABILITY, ENTR OPY, AND ST A TISTICAL PHYSICS Ariel Catic ha Departmen t of Physics, Univ ersit y at Alban y – SUNY ii Con ten ts Preface vii 1 Inductiv e Inference 1 1.1 Probabilit y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Inductiv e reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Probabilit y 7 2.1 Consisten t reasoning: degrees of b elief . . . . . . . . . . . . . . . 8 2.2 The Cox Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Regraduation: the Pro duct Rule . . . . . . . . . . . . . . . . . . 11 2.3.1 Co x’s ﬁrst theorem . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Pro of of the Asso ciativity Theorem . . . . . . . . . . . . . 12 2.3.3 Setting the range of degrees of b elief . . . . . . . . . . . . 15 2.4 F urther regraduation: the Sum Rule . . . . . . . . . . . . . . . . 16 2.4.1 Co x’s second theorem . . . . . . . . . . . . . . . . . . . . 16 2.4.2 Pro of of the Compatibility Theorem . . . . . . . . . . . . 18 2.5 Some remarks on the sum and pro duct rules . . . . . . . . . . . . 19 2.5.1 On meaning, ignorance and randomness . . . . . . . . . . 19 2.5.2 The general sum rule . . . . . . . . . . . . . . . . . . . . . 20 2.5.3 Indep enden t and mutually exclusive ev ents . . . . . . . . 20 2.5.4 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 The exp ected v alue . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 The binomial distribution . . . . . . . . . . . . . . . . . . . . . . 24 2.8 The law of large num bers . . . . . . . . . . . . . . . . . . . . . . 26 2.9 The Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . 28 2.9.1 The de Moivre-Laplace theorem . . . . . . . . . . . . . . 28 2.9.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . 30 2.10 Up dating probabilities: Bay es’ rule . . . . . . . . . . . . . . . . . 33 2.10.1 F ormulating the problem . . . . . . . . . . . . . . . . . . 33 2.10.2 Minimal up dating: Bay es’ rule . . . . . . . . . . . . . . . 34 2.10.3 Multiple exp eriments, sequen tial updating . . . . . . . . . 38 2.10.4 Remarks on priors . . . . . . . . . . . . . . . . . . . . . . 39 2.11 Examples from data analysis . . . . . . . . . . . . . . . . . . . . 42 2.11.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . 42 iii iv CONTENTS 2.11.2 Curve ﬁtting . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.11.3 Mo del selection . . . . . . . . . . . . . . . . . . . . . . . . 47 2.11.4 Maximum Likelihoo d . . . . . . . . . . . . . . . . . . . . . 49 3 En tropy I: Carnot’s Principle 51 3.1 Carnot: rev ersible engines . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Kelvin: temp erature . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Clausius: en tropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Maxw ell: probability . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Gibbs: b ey ond heat . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6 Boltzmann: en tropy and probability . . . . . . . . . . . . . . . . 61 3.7 Some remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4 En tropy I I: Measuring Information 67 4.1 Shannon’s information measure . . . . . . . . . . . . . . . . . . . 68 4.2 Relativ e en tropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Join t en tropy , additivit y , and subadditivity . . . . . . . . . . . . 74 4.4 Conditional entrop y and mutual information . . . . . . . . . . . 75 4.5 Con tinuous distributions . . . . . . . . . . . . . . . . . . . . . . . 76 4.6 Comm unication Theory . . . . . . . . . . . . . . . . . . . . . . . 79 4.7 Assigning probabilities: MaxEnt . . . . . . . . . . . . . . . . . . 82 4.8 Canonical distributions . . . . . . . . . . . . . . . . . . . . . . . . 83 4.9 On constraints and relev ant information . . . . . . . . . . . . . . 86 5 Statistical Mec hanics 89 5.1 Liouville’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Deriv ation of Equal a Priori Probabilities . . . . . . . . . . . . . 91 5.3 The relev ant constrain ts . . . . . . . . . . . . . . . . . . . . . . . 93 5.4 The canonical formalism . . . . . . . . . . . . . . . . . . . . . . . 95 5.5 The Second La w of Thermo dynamics . . . . . . . . . . . . . . . . 97 5.6 The thermo dynamic limit . . . . . . . . . . . . . . . . . . . . . . 100 5.7 In terpretation of the Second Law . . . . . . . . . . . . . . . . . . 103 5.8 Remarks on irrev ersibility . . . . . . . . . . . . . . . . . . . . . . 104 5.9 En tropies, descriptions and the Gibbs parado x . . . . . . . . . . 105 6 En tropy I I I: Up dating Probabilities 113 6.1 What is information? . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2 En tropy as a to ol for up dating probabilities . . . . . . . . . . . . 118 6.3 The pro ofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.3.1 Axiom 1: Lo cality . . . . . . . . . . . . . . . . . . . . . . 124 6.3.2 Axiom 2: Co ordinate in v ariance . . . . . . . . . . . . . . 126 6.3.3 Axiom 1 again . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3.4 Axiom 3: Consistency for iden tical indep endent subsystems 128 6.3.5 Axiom 3: Consistency for non-identical subsystems . . . . 132 6.3.6 Axiom 3: Consistency with the la w of large n umbers . . . 133 6.4 Random remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 CONTENTS v 6.4.1 On deductive vs. inductive systems . . . . . . . . . . . . . 135 6.4.2 On priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.4.3 Commen ts on other axiomatizations . . . . . . . . . . . . 136 6.5 Ba yes’ rule as a sp ecial case of ME . . . . . . . . . . . . . . . . . 138 6.6 Comm uting and non-commuting constraints . . . . . . . . . . . . 141 6.7 Information geometry . . . . . . . . . . . . . . . . . . . . . . . . 143 6.7.1 Deriv ation from distinguishability . . . . . . . . . . . . . 144 6.7.2 Deriv ation from a Euclidean metric . . . . . . . . . . . . . 145 6.7.3 Deriv ation from relative entrop y . . . . . . . . . . . . . . 146 6.7.4 V olume elements in curved spaces . . . . . . . . . . . . . 146 6.8 Deviations from maxim um en tropy . . . . . . . . . . . . . . . . . 148 6.9 An application to ﬂuctuations . . . . . . . . . . . . . . . . . . . . 150 6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 References 155 Preface Science consists in using information about the world for the purpose of predict- ing, explaining, understanding, and/or controlling phenomena of in terest. The basic diﬃcult y is that the av ailable information is usually insuﬃcient to attain an y of those goals with certain ty . In these lectures w e will b e concerned with the problem of inductive in- ference, that is, the problem of reasoning under conditions of incomplete in- formation. Is there a general method for handling uncertaint y? Or, at least, are there rules that could in principle b e follo wed b y an ideally rational agent when discussing scientiﬁc matters? What makes one statement more plausible than another? Ho w muc h more plausible? And then, when new information is acquired how do es it change its mind? Or, to put it diﬀerently , are there rules for learning? Are there rules for pro cessing information that are ob jective and consisten t? Are they unique? And, come to think of it, what, after all, is information? It is clear that data “contains” or “con veys” information, but what does this precisely mean? Can information be conv eyed in other wa ys? Is information some sort of ph ysical ﬂuid that can b e contained or transp orted? Is information physic al ? Can we measure amoun ts of information? Do we need to? Our goal is to dev elop the main to ols for inductiv e inference – probability and en tropy – and to illustrate their use in ph ysics. T o b e speciﬁc we will concen trate on examples b orro wed from the foundations of classical statistical ph ysics, but this is not mean t to reﬂect a limitation of these inductiv e methods, whic h, as far as w e can tell at present are of univ ersal applicability . It is just that statistical mec hanics is rather sp ecial in that it provides us with the ﬁrst examples of fundamental laws of physics that can b e derived as examples of inductiv e inference. P erhaps all laws of physics can be derived in this w ay . The level of these lectures is somewhat unev en. Some topics are fairly ad- v anced – the sub ject of recent research – while some other topics are v ery el- emen tary . I can give tw o related reasons for including the latter. First, the standard education of physicists includes a very limited study of probability and ev en of entrop y – ma yb e just a little ab out errors in a lab oratory course, or ma yb e a couple of lectures as a brief mathematical prelude to statistical me- c hanics. The result is a widespread misconception that these sub jects are trivial and unproblematic – that the real problems of theoretical ph ysics lie elsewhere, and that if your exp erimen tal data require analysis, then you hav e done the vii viii PREF A CE wrong exp eriment. Which brings me to the second reason. It would b e very surprising to ﬁnd that the interpretations of probabilit y and of en tropy turned out to bear no relation to our understanding of statistical mec hanics and quan- tum mechanics. Indeed, if the only notion of probabilit y at your disp osal is that of a frequency in a large num ber of trials you migh t b e led to think that the en- sem bles of statistical mec hanics must b e real, and to regard their absence as an urgen t problem demanding an immediate solution – perhaps an ergo dic solution. Y ou might also b e led to think that similar ensembles are needed in quantum theory and therefore that quan tum theory requires the existence of an ensemble of parallel universes. Similarly , if the only notion of en tropy av ailable to you is derived from thermo dynamics, you migh t end up thinking that entrop y is a ph ysical quan tity related to heat and disorder, that it can b e measured in the lab, and that therefore has little or no relev ance b ey ond statistical mechanics. It is very worth while to revisit the elemen tary basics not b ecause they are easy – they are not – but b ecause they are fundamen tal. Man y are the sub jects that I hav e left out but wish I had included in these lectures. Some relate to inference proper – the assignmen t of priors, information geometry , mo del selection, and the theory of questions or inductiv e inquiry – while others deal with applications to the foundations of b oth classical and quan tum physics. As a provisional remedy at the very end I pro vide a short and very biased list of suggestions for further reading. Ac knowledgemen ts: The p oints of view expressed here reﬂect m uch that I ha ve learned from discussions with many colleagues and friends: C. Cafaro, N. Catic ha, V. Dose, R. Fischer, A. Garrett, A. Giﬃn, M. Grendar, K. Kn uth, R. Preuss, C. Ro dr ´ ıguez, J. Skilling, and C.-Y. Tseng. I hope they will not judge these lectures by those few instances where we hav e not yet managed to reach agreemen t. I would also like to express my special thanks to Julio Stern and to the organizers of MaxEnt 2008 for their encouragement to pull these notes together into some sort of printable form. Alban y , Ma y 2008. Chapter 1 Inductiv e Inference The pro cess of drawing conclusions from av ailable information is called infer- ence. When the a v ailable information is suﬃcient to make unequivocal, unique assessmen ts of truth we sp eak of making deductions: on the basis of a certain piece of information we deduce that a certain prop osition is true. The metho d of reasoning leading to deductive inferences is called logic. Situations where the a v ailable information is insuﬃcient to reac h suc h certain ty lie outside the realm of logic. In these cases w e sp eak of making a probable inference, and the method of reasoning is probabilit y theory . Alternativ e names are ‘inductiv e inference’ and ‘inductive logic’. The word ‘induction’ refers to the pro cess of using limited information ab out a few special cases to dra w conclusions ab out more general situations. 1.1 Probabilit y The question of the meaning and in terpretation of the concept of probabilit y has long been con trov ersial. Needless to sa y the interpretations oﬀered by v arious sc ho ols are at least partially successful or else they would already hav e b een discarded. But the diﬀeren t in terpretations are not equiv alent. They lead p eople to ask diﬀerent questions and to pursue their research in diﬀerent directions. Some questions ma y become essen tial and urgen t under one interpretation while totally irrelev ant under another. And p erhaps ev en more imp ortant: under diﬀeren t in terpretations equations can b e used diﬀeren tly and this can lead to diﬀeren t predictions. Historically the fr e quentist interpretation has b een the most p opular: the probabilit y of a random even t is given b y the relative n umber of occurrences of the even t in a suﬃcien tly large num b er of iden tical and indep endent trials. The app eal of this in terpretation is that it seems to provide an empirical metho d to estimate probabilities by coun ting ov er the set of trials – an ensemble. The magnitude of a probability is obtained solely from the observ ation of man y rep eated trials and does not depend on an y feature or c haracteristic of the 1 2 CHAPTER 1. INDUCTIVE INFERENCE observ ers. Probabilities in terpreted in this wa y ha ve b een called obje ctive . This view dominated the ﬁelds of statistics and physics for most of the 19th and 20th cen turies (see, e.g. , [v on Mises 57]). One disadv antage of the frequen tist approac h has to do with matters of rigor: what precisely does one mean b y ‘random’ ? If the trials are suﬃcien tly identical, shouldn’t one alwa ys obtain the same outcome? Also, if the in terpretation is to b e v alidated on the basis of its op erational, empirical v alue, how large should the n umber of trials b e? Unfortunately , the answers to these questions are neither easy nor free from con trov ersy . By the time the ten tative answers ha ve reac hed a moderately acceptable lev el of sophistication the intuitiv e app eal of this interpretation has long b een lost. In the end, it seems the frequen tist in terpretation is most useful when left a bit v ague. A more serious ob jection is the following. In the frequentist approac h the notion of an ensemble of trials is central. In cases where there is a natural ensem ble (tossing a coin, or a die, spins in a lattice, etc.) the frequency in ter- pretation seems natural enough. But for man y other problems the construction of an ensem ble is at best highly artiﬁcial. F or example, consider the probability of there being life in Mars. Are we to imagine an ensem ble of Mars planets and solar systems? In these cases the ensemble would b e purely hypothetical. It oﬀers no p ossibility of an empirical determination of a relative frequency and this defeats the original goal of pro viding an ob jectiv e operational in terpretation of probabilities as frequencies. In yet other problems there is no ensemble at all: consider the probabilit y that the n th digit of the num ber π b e 7. Are w e to imagine alternativ e univ erses with diﬀerent v alues for the num b er π ? It is clear that there a n umber of in teresting problems where one susp ects the notion of probabilit y could b e quite useful but whic h nev ertheless lie outside the domain of the frequen tist approac h. According to the Bayesian interpretations, whic h can be traced bac k to Bernoulli and Laplace, but ha ve only achiev ed p opularity in the last few decades, a probability reﬂects the conﬁdence, the degree of b elief of an individual in the truth of a prop osition. These probabilities are said to b e Bayesian b ecause of the central role play ed by Bay es’ theorem – a theorem which is actually due to Laplace. This approach enjo ys sev eral adv antages. One is that the diﬃcul- ties asso ciated with attempting to pinp oin t the precise meaning of the word ‘random’ can b e a voided. Bay esian probabilities are not restricted to rep eat- able ev ents; they allow us to reason in a consistent and rational manner ab out unique, singular even ts. Th us, in going from the frequentist to the Bay esian in terpretations the domain of applicability and therefore the usefulness of the concept of probabilit y is considerably enlarged. The crucial asp ect of Ba yesian probabilities is that diﬀerent individuals may ha ve diﬀerent degrees of b elief in the truth of the v ery same prop osition, a fact that is describ ed by referring to Bay esian probabilities as b eing subje ctive . This term is somewhat misleading b ecause there are (at least) tw o views on this matter, one is the so-called sub jective Bay esian or p ersonalistic view (see, e.g. , [Sa v age 72, Howson Urbach 93, Jeﬀrey 04]), and the other is the obje ctive Ba yesian view (see e.g. [Jeﬀreys 39, Ja ynes 85, 03, Lucas 70]). F or an excellent 1.2. INDUCTIVE REASONING 3 in tro duction with a philosophical p ersp ective see [Hacking 01]. According to the sub jective view, tw o reasonable individuals faced with the same evidence, the same information, can legitimately diﬀer in their conﬁdence in the truth of a prop osition and ma y therefore assign diﬀeren t probabilities. Sub jective Ba yesians accept that an individual can c hange his or her b eliefs, merely on the basis of in trosp ection, reasoning, or ev en rev elation. A t the other end of the Bay esian sp ectrum, the ob jective Bay esian view considers the theory of probability as an extension of logic. It is said then that a probabilit y measures a degree of r ational belief. It is assumed that the ob jectiv e Bay esian has thought so long and hard ab out how probabilities are assigned that no further reasoning will induce a revision of b eliefs except when confron ted with new information. In an ideal situation t wo diﬀeren t individuals will, on the basis of the same information, assign the same probabilities. Whether Bay esian probabilities are sub jective or ob jective is still a matter of dispute. Our position is that they lie somewhere in b et ween. Probabilities will alw ays retain a “sub jective” element b ecause translating information into probabilities in volv es judgments and diﬀerent p eople will inevitably judge dif- feren tly . On the other hand, not all probabilit y assignments are equally useful and it is plausible that what mak es some assignmen ts better than others is that they represen t or reﬂect some ob jectiv e feature of the world. One migh t ev en sa y that what makes them b etter is that they provide a b etter guide to the “truth”. Thus, probabilities can b e c haracterized by both sub jectiv e and ob- jectiv e elements and, ultimately , it is their ob jectivity that makes probabilities useful. In fact we shall see that while the sub jective element in probabilities can nev er b e completely eliminated, the rules for pro cessing information, that is, the rules for up dating probabilities, are themselves quite ob jective. This means that the new information can b e ob jectively pro cessed and incorp orated into our posterior probabilities. Thus, it is quite possible to con tinuously suppress the sub jective elemen ts while enhancing the ob jectiv e elemen ts as we pro cess more and more information. 1.2 Inductiv e reasoning W e discussed how the study of macroscopic systems requires a general theory to allow us to carry out inferences on the basis of incomplete information and our ﬁrst step should b e to inquire what this theory or language for inference should b e. The principle of reasoning that we will follo w is simple, comp elling, and quite common in science [Skilling 89]: If a general theory exists, it m ust apply to sp ecial cases. If a certain sp ecial case happ ens to b e known then this knowledge can b e used to constrain the general theory: all candidate theories that fail to repro duce the kno wn example are discarded. 4 CHAPTER 1. INDUCTIVE INFERENCE If a suﬃcient num b er of sp ecial cases is known then the general theory migh t b e completely determined. The method allows us to extrapolate from a few sp ecial cases where w e know what to exp ect, to more general cases where w e did not. This is a metho d for induction, for generalization. Of course, it may happ en that there are too man y constrain ts, in which case there is no general theory that repro duces them all. Philosophers hav e a name for such a metho d: they call it eliminative in- duction [Earman 92]. On the negative side, the Principle of Eliminative Induction (PEI), like any other form of induction, is not guaran teed to w ork. On the p ositive side, the PEI adds an interesting twist to P opp er’s scientiﬁc metho dology . According to Popper scientiﬁc theories can nev er b e prov ed right, they can only b e prov e d false; a theory is corrob orated only to the exten t that all attempts at falsifying it hav e failed. Eliminativ e induction is fully compati- ble with P opp er’s notions but the point of view is just the opp osite. Instead of fo cusing on failur e to falsify one fo cuses on suc c ess : it is the successful falsiﬁca- tion of all riv al theories that corrob orates the surviving one. The adv antage is that one acquires a more explicit understanding of why comp eting theories are eliminated. This inductive metho d will b e used several times. First in c hapter 2 to show that if a general theory of inference exists, then it m ust coincide with the usual theory of probability . In other w ords, w e will sho w that degrees of b elief, those measures of plausibilit y that we require to do inference, should b e manipulated and calculated using the ordinary rules of the calculus of probabilities and ther e- for e that probabilities c an b e interpreted as degrees of b elief [Cox 46, Jaynes 57a, 03]. But w ith this achiev ement, enormous as it is, we do not yet reach our ﬁ- nal goal. The problem is that what the rules of probability theory will allow us to do is to assign probabilities to some “complex” prop ositions on the ba- sis of the probabilities that ha ve b een previously assigned to other, perhaps more “elementary” prop ositions. The issue of ho w to assign probabilities to the elemen tary prop ositions is not addressed. Historically the ﬁrst partial solution to this problem w as suggested b y James Bernoulli (1713). The idea is simple: in those situations where there are several alternativ es that can b e en umerated and coun ted, and where one has no reason to fav or one o ver another, the alternativ es should b e deemed equally probable. The equalit y of the degrees of b elief reﬂects the symmetry of one’s state of kno wledge or, rather, of ignorance. This mode of reasoning has been called the ‘ Principle of Insuﬃcient R e ason ’ and is usually asso ciated with the name of Laplace (1812). The principle has b een particularly successful in dealing with situations where there is some p ositive, suﬃcient reason to susp ect that the v arious al- ternativ es should b e considered equally likely . F or example, in certain games of chance the symmetry among p ossible outcomes is attained on purp ose, by construction. These games are sp ecial b ecause they are delib erately designed so that information ab out previous outcomes is irrelev an t to the prediction of 1.2. INDUCTIVE REASONING 5 future outcomes and the symmetry of our state of ignorance ab out the future is very robust. The range of applications of Laplace’s principle is, how ev er, limited. There are situations where it is not clear what ‘equally likely’ means. F or example, it migh t not b e possible to coun t the alternativ es or ma yb e the possible outcomes are distributed o ver con tinuous ranges. Also, there are situations where there is information leading one to prefer some alternatives ov er others; how can suc h information b e incorp orated in a systematic w ay? One needs a metho d that generalizes Laplace’s principle. Progress to ward this goal came from an unexp ected direction. While inv esti- gating the capacit y of communication channels to transmit information Shannon came to appreciate the need for a quan titative measure of the notion of “amoun t of missing information” or the “amoun t of uncertain ty” in a probabilit y distri- bution. In 1948 he succeeded in ﬁnding such a measure and thereb y initiated the ﬁeld of information theory [Shannon 48]. As w e will see in chapter 4 Shannon’s argument is a second application of the induction principle ab o ve: A general theory , if it exists at all, must apply to sp ecial cases. He argued that in order to qualify as a measure of ignorance or of missing information a quan tity S would ha ve to satisfy some reasonable condi- tions – the Shannon axioms – and these conditions were suﬃcien tly constraining to determine the quantit y S uniquely: There is only one wa y to measure the amoun t of uncertain ty in a probability distribution. It was rather surprising that the expression that Shannon obtained for S in comm unication theory coincided with expressions that had previously b een used by Boltzmann and by Gibbs to represen t entrop y in the very diﬀeren t context of statistical mec hanics and thermo dynamics. This coincidence led Shannon to choose the name ‘entrop y’ for his quan tity S . Somewhat later, how ev er, Brillouin and Ja ynes realized that the similarit y of Shannon’s entrop y with Gibbs’ entrop y could not b e a mere co- incidence and th us began a process that w ould radically alter our understanding of the thermo dynamical entrop y of Clausius. [Brillouin 52, Jaynes 57b] The crucial con tribution of Ja ynes was the insight that the Shannon deriv a- tion was not limited to information in comm unication channels, but that the same mathematics can be applied to information in general. It establishes a basis for a general metho d of inference that includes Laplace’s principle of insuﬃcient reason as a special case. In fact, it b ecame clear that on a purely intuitiv e basis Boltzmann and Gibbs had already found and had made extensive use of this metho d in statistical mechanics. With the Boltzmann-Gibbs-Jaynes metho d we can revisit the question of ho w to assign those probabilities that will b e used as the starting p oin t for the calculation of all others. The answer is simple: among all p ossible probability distributions that satisfy the constraints implied by the limited av ailable infor- mation we select that particular distribution that reﬂects maxim um ignorance ab out those aspects of the problem ab out which nothing is kno wn. What else could w e do? It seems this is the only intellectually honest w a y to pro ceed. And the procedure is mathematically clear: since ignorance is measured b y entrop y the desired probability distribution is obtained b y maximizing the en tropy sub- 6 CHAPTER 1. INDUCTIVE INFERENCE ject to whatever conditions are known to constrain the system. This is called the Metho d of Maximum Entr opy and it is usually abbreviated as MaxEnt. But the pro cedure is not without its problems. These may , to some, seem relativ ely minor, but one ma y reasonably argue that any problem of principle is necessarily a ma jor problem. F or example, the Shannon axioms refer to discrete probability distributions rather than con tinuous ones, and generalizing his measure of uncertain ty is not altogether straigh tforward. Another, p erhaps more serious problem, is that the axioms themselv es may b e self-evident to some but not to others: do the Shannon axioms really co dify what w e mean b y uncertaint y? Are there other measures of uncertaint y? Indeed, others ha ve b een prop osed. Thus, despite its obvious success, in the eyes of many , the MaxEn t method remains contro versial and several v ariations on its justiﬁcation ha ve b een prop osed. In chapter 6 we present an extension of the metho d of maximum entrop y (whic h w e will abbreviate ME to distinguish it from the older MaxEnt) which deriv es from the w ork of Shore and Johnson. They p oint out what is perhaps the main drawbac k of the Shannon-Jaynes approac h: it is indirect. First one ﬁnds how to measure amount of uncertain ty and then one argues that the only un biased wa y to incorp orate information into a probability distribution is to maximize this measure sub ject to constrain ts. The pro cedure can be c hallenged b y arguing that, ev en granted that entrop y measures something, how sure can w e b e this something is uncertaint y , ignorance? Shore and Johnson argue that what one really wan ts is a consistent method to pro cess information directly , without detours that inv oke questionable measures of uncertaint y . A third application of the general inductiv e method – a general theory , if it exists at all, m ust apply to special cases [Skilling 88] – yields the desired proce- dure: There is a unique metho d to up date from an old set of beliefs co diﬁed in a prior probabilit y distribution in to a new set of b eliefs described by a new, p oste- rior distribution when the information av ailable is in the form of a constraint on the family of acceptable p osteriors. The updated p osterior distribution is that of maxim um “relative” entrop y . The axioms of the ME metho d are, hop efully , more self-evident: They reﬂect the con viction that w hat was learned in the past is important and should not b e frivolously ignored. The chosen posterior dis- tribution should coincide with the prior as closely as p ossible and one should only update those aspects of one’s b eliefs for whic h corrective new evidence has b een supplied. F urthermore, since the new axioms do not tell us what and ho w to update, they merely tell us what not to update, they hav e the added bonus of maximizing ob jectivit y – there are many wa ys to change something but only one wa y to keep it the same. [Caticha 03,Caticha Giﬃn 06, Catic ha 07] This alternativ e justiﬁcation for the metho d of maximum en tropy turns out to be directly applicable to con tinuous distributions, and it establishes the v alue of the concept of en tropy irresp ective of its in terpretation in terms of heat, or disorder, or uncertaint y . In this approach entr opy is pur ely a to ol for c onsistent r e asoning; strictly, it ne e ds no interpr etation . Perhaps this is the reason wh y the meaning of entrop y has turned out to b e such an elusive concept. Chapter 2 Probabilit y Our goal is to establish the theory of probability as the general theory for reasoning on the basis of incomplete information. This requires us to tackle t wo diﬀerent problems. The ﬁrst problem is to ﬁgure out ho w to achiev e a quan titative description of a state of kno wledge. Once this is settled w e address the second problem of how to up date from one state of kno wledge to another when new information b ecomes a v ailable. Throughout we will assume that the sub ject matter – the set of statements the truth of which we wan t to assess – has been clearly sp eciﬁed. This question of what it that we are actually talking about is muc h less trivial than it migh t app ear at ﬁrst sight. 1 Nev ertheless, it will not b e discussed further. The ﬁrst problem, that of describing or characterizing a state of kno wledge, requires that we quantify the degree to which we b elieve each prop osition in the set is true. The most basic feature of these b eliefs is that they form an in terconnected w eb that m ust b e in ternally consistent. The idea is that in general the strengths of one’s b eliefs in some prop ositions are constrained b y one’s beliefs in other propositions; b eliefs are not indep enden t of eac h other. F or example, the belief in the truth of a certain statemen t a is strongly constrained b y the belief in the truth of its negation, not- a : the more I b eliev e in one, the less I b elieve in the other. As we will see b elow, the basic desiderata for suc h a scheme, whic h are expressed in the Co x axioms, [Co x 46] lead to a unique formalism in which degrees of b elief are related to eac h other using the standard rules of probabilit y theory . Then w e explore some of the consequences. F or exp erimen ts that can b e rep eated indeﬁnitely one reco vers standard results, suc h as the law of large n umbers, and the connection betw een probability and frequency . The second problem, that of up dating from one consistent web of b eliefs to another when new information b ecomes a v ailable, will b e addressed for the sp ecial case that the information is in the form of data. The basic up dating 1 Consider the example of quantum mechanics: Are we talking about particles, or ab out experimental setups, or b oth? Are we talking about p osition v ariables, or ab out momen ta, or both? Or neither? Is it the position of the particles or the p osition of the detectors? 7 8 CHAPTER 2. PROBABILITY strategy reﬂects the conviction that what we learned in the past is v aluable, that the web of b eliefs should only b e revised to the exten t required by the data. W e will see that this principle of minimal up dating leads to the uniquely natural rule that is widely known as Bay es’ theorem. (More general kinds of information can also b e pro cessed using the minimal up dating principle but they require a more sophisticated to ol, namely relative entrop y . This topic will b e extensiv ely explored later.) As an illustration of the enormous p ow er of Ba yes’ rule we will brieﬂy explore its application to data analysis. 2.1 Consisten t reasoning: degrees of b elief W e discussed how the study of physical systems in general requires a theory of inference on the basis of incomplete information. Here we will show that a gener al the ory of infer enc e, if it exists at al l, c oincides with the usual the ory of pr ob ability . W e will show that the quantitativ e measures of plausibility or de gr e es of b elief that w e introduce as to ols for reasoning should b e manipulated and calculated using the ordinary rules of the calculus of probabilities. Ther efor e probabilities c an b e interpreted as degrees of belief. The pro cedure we follow diﬀers in one remark able w ay from the traditional w ay of setting up physical theories. Normally one starts with the mathematical formalism, and then one pro ceeds to try to ﬁgure out what the formalism could p ossibly mean, one tries to app end an in terpretation to it. This is a v ery diﬃcult problem; historically it has aﬀected not only statistical physics – what is the meaning of probabilities and of entrop y – but also quan tum theory – what is the meaning of w av e functions and amplitudes. Here w e proceed in the opp osite order, we ﬁrst decide what we are talking ab out, degrees of belief or plausibility (we use the t wo expressions interc hangeably) and then w e design rules to manipulate them; we design the formalism, w e construct it to suit our purp oses. The adv antage of this approac h is that the issue of meaning, of in terpretation, is settled from the start. Before we pro ceed further it ma y be important to emphasize that the degrees of b elief discussed here are those held b y an idealized rational agent that would not b e sub ject to the practical limitations under which we humans op erate. W e discuss degrees of r ational b elief and not the irrational and inconsisten t b eliefs that real humans seem to hold. W e are concerned with the ideal optimal standard of rationality that we h umans ought to attain at least when discussing scien tiﬁc matters. An y suitable measure of b elief must allo w us to represent the fact that given an y t wo statements a and b one m ust b e able to describ e the fact that either a is more plausible than b , or a is less plausible than b , or else a and b are equally plausible. That this is p ossible is implicit in what we mean by ‘plausibility’. Th us we can order assertions according to increasing plausibility: if a statement a is more plausible than b , and b is itself more plausible than another statement c , then a is more plausible than c . Since any transitiv e ordering, such as the one just described, can b e represented with real n um b ers, we are led to the follo wing 2.1. CONSISTENT REASONING: DEGREES OF BELIEF 9 requiremen t: De gr e es of r ational b elief (or, as we shal l later c al l them, pr ob abilities) ar e r epr esente d by r e al numb ers. The next and most crucial requirement is that whenever a degree of b elief can b e computed in tw o diﬀerent wa ys the t wo results m ust agree. The assignment of de gr e es of r ational b elief must b e c onsistent. Otherwise we could get en tangled in confusing parado xes: by follo wing one computational path w e could decide that a statemen t a is more plausible than a statement b , but if we w ere to follow a diﬀeren t path we could conclude the opp osite. Consiste ncy is the crucial requirement that eliminates v agueness and transforms our general qualitative statements in to precise quantitativ e ones. Our general theory of inference is constructed using the inductive metho d describ ed in the previous chapter: If a general theory exists, then it must re- pro duce the right answers in those special cases where the answers happ en to b e known; these sp ecial cases constrain the general theory; giv en enough such constrain ts, the general theory is fully determined . Before we write down the sp ecial cases that will play the role of the axioms of probability theory w e should introduce a con venien t notation. A degree of plausibility is a real n umber that we will assign to a statement a on the basis of some information that we hav e and will obviously dep end on what that information actually is. A common kind of information tak es the form of another statemen t b which is asserted to b e true. Therefore, a degree of plausibilit y is a real num b er assigned to tw o statements a and b , rather than just one. Our notation should reﬂect this. Let P ( a | b ) denote the plausibilit y that statemen t a is true pro vided we know b to b e true. P ( a | b ) is read ‘the degree of plausibilit y (or, later, the probability) of a giv en b ’. P ( a | b ) is commonly called a conditional probability (the probability of a giv en that condition b holds). When b turns out to b e false, we shall regard P ( a | b ) as undeﬁned. Although the notation P ( a | b ) is quite conv enien t we will not alw ays use it; w e will often just write P ( a ) omitting the statement b , or w e migh t ev en just write P . It is, ho wev er, important to realize that degrees of b elief and probabilities are alwa ys conditional on something even if that something is not explicitly stated. More notation: F or every statemen t a there exists its negation not- a , which will b e denoted with a prime, a 0 . If a is true, then a 0 is false and vice versa. Giv en t wo statemen ts a 1 and a 2 w e can form their c onjunction ‘ a 1 and a 2 ’ whic h w e will denote it as a 1 a 2 . The conjunction is true if and only if b oth a 1 and a 2 are true. Given a 1 and a 2 , w e can also form their disjunction ‘ a 1 or a 2 ’. The disjunction will b e denoted b y a 1 + a 2 and it is true when either a 1 or a 2 or b oth are true; it is false when b oth a 1 and a 2 are false. No w w e pro ceed to state the axioms [Cox 46, Jaynes 03]. 10 CHAPTER 2. PROBABILITY 2.2 The Co x Axioms The degrees of b elief or plausibility we assign to a statemen t a and to its negation a 0 are not indep endent of each other. The more plausible one is, the less plausible the other b ecomes; if one increases we expect the other to decrease and vice- v ersa. This is expressed b y our ﬁrst axiom. Axiom 1. The plausibilit y of not- a is a monotonic function of the plausibilit y of a , P ( a 0 | b ) = f ( P ( a | b )) . (2.1) A t this p oint w e do not know the precise relation b etw een P ( a | b ) and P ( a 0 | b ), w e only know that some such function f must exist. The second axiom expresses the fact that a measure of plausibilit y for a complex statement such as the conjunction “ a 1 and a 2 ”, must somehow dep end on the separate plausibilities of a 1 and of a 2 . W e consider it “self-evident” that the plausibility that b oth a 1 and a 2 are simultaneously true, P ( a 1 a 2 | b ), can be analyzed in stages: In order for a 1 a 2 to be true it must ﬁrst b e the case that a 1 is itself true. Th us, P ( a 1 a 2 | b ) must dep end on P ( a 1 | b ). F urthermore, once w e ha ve established that a 1 is in fact true, in order for a 1 a 2 to b e true, it mu st b e the case that a 2 is also true. Thus, P ( a 1 a 2 | b ) m ust dep end on P ( a 2 | a 1 b ) as w ell. This argument is carried out in more detail in [T ribus 69]. Therefore, our second axiom is Axiom 2. The plausibility P ( a 1 a 2 | b ) of a conjunction a 1 a 2 , is determined once w e sp ecify the plausibilit y P ( a 1 | b ) of a 1 and the plausibility P ( a 2 | a 1 b ) of a 2 giv en a 1 . What this means is that P ( a 1 a 2 | b ) m ust b e calculable in terms of P ( a 1 | b ) and P ( a 2 | a 1 b ): the second axiom asserts that there exists a function g suc h that P ( a 1 a 2 | b ) = g ( P ( a 1 | b ) , P ( a 2 | a 1 b )) . (2.2) Remark ably this is all we need! Note the qualitative nature of these axioms: what is b eing asserted is the existence of some unsp eciﬁed functions f and g and not their s peciﬁc quantitativ e mathematical forms. F urthermore, note that the same f and g apply to any and all prop ositions. This reﬂects our desire to construct a single theory of universal applicabilit y . It also means that the axioms represent a huge num b er of kno wn sp ecial cases. A t this p oint the functions f and g are unknown, but they are not arbitrary . In fact, as we shall see b elow, the requiremen t of consistency is v ery constraining. F or example, notice that since a 1 a 2 = a 2 a 1 , in 2.2 the roles of a 1 and a 2 ma y b e interc hanged, P ( a 1 a 2 | b ) = g ( P ( a 2 | b ) , P ( a 1 | a 2 b )) . (2.3) Consistency requires that g ( P ( a 1 | b ) , P ( a 2 | a 1 b )) = g ( P ( a 2 | b ) , P ( a 1 | a 2 b )) . (2.4) 2.3. REGRADUA TION: THE PRODUCT RULE 11 W e will ha ve to chec k that this is indeed the case. As a second example, since a 00 = a , it m ust b e the case that P ( a | b ) = P ( a 00 | b ) = f ( P ( a 0 | b )) = f [ f ( P ( a | b ))] . (2.5) The plausibility P ( a | b ) is just a num b er, call it u , this can b e written as f ( f ( u )) = u . (2.6) These tw o constrain ts are not at this p oint helpful in ﬁxing the functions f and g . But the following one is. 2.3 Regraduation: the Pro duct Rule 2.3.1 Co x’s ﬁrst theorem A consistency constraint that follows from the associativity property of the conjunction go es a long wa y tow ard ﬁxing the acceptable forms of the function g . The constraint is obtained b y noting that since ( ab ) c = a ( bc ), we ha ve tw o w ays to compute P ( abc | d ). Starting from P [( ab ) c | d ] = P [ a ( bc ) | d ] , (2.7) w e get g [ P ( ab | d ) , P ( c | abd )] = g [ P ( a | d ) , P ( bc | ad )] (2.8) and g [ g ( P ( a | d ) , P ( b | ad )) , P ( c | abd )] = g [ P ( a | d ) , g ( P ( b | ad ) , P ( c | bad ))] . (2.9) W riting P ( a | d ) = u , P ( b | ad ) = v , and P ( c | abd ) = w , the “asso ciativity” constrain t is g ( g ( u, v ) , w ) = g ( u, g ( v , w )) . (2.10) It is quite ob vious that the functional equation eq.(2.10) has an inﬁnit y of solutions. Indeed, by direct substitution one can easily chec k that functions of the form g ( u, v )) = G − 1 [ G ( u ) G ( v )] (2.11) are solutions for any inv ertible (and therefore monotonic) function G ( u ). What is not so easy to prov e is that this is the general solution. Asso ciativit y Theorem: Given an y function g ( u, v ) that satisﬁes the asso cia- tivit y constraint, eq.(2.10), one can construct another monotonic function G ( u ) suc h that G ( g ( u, v )) = G ( u ) G ( v ) . (2.12) Co x’s pro of of this theorem is somewhat lengthy and is relegated to the next subsection. 12 CHAPTER 2. PROBABILITY The signiﬁcance of this result b ecomes apparen t when one rewrites it as G [ P ( ab | c )] = G [ P ( a | c )] G [ P ( b | ac )] (2.13) and realizes that there w as nothing particularly sp ecial ab out the original as- signmen t of real num b ers P ( a | c ), P ( b | ac ), and so on. Their only purp ose was to pro vide us with a ranking, an ordering of propositions according to ho w plau- sible they are. Since the function G ( u ) is monotonic, the same ordering can b e ac hieved using a new set p ositive n umbers p ( a | c ) def = G [ P ( a | c )] , p ( b | ac ) def = G [ P ( b | ac )] , ... (2.14) instead of the old. The adv an tage of using these ‘regraduated’ plausibilities is that the plausibilit y of ab can b e calculated in terms of the plausibilities of a and of b given a in a particularly simple wa y: it is just their pro duct. Thus, while the new num b ers are neither more nor less correct than the old, they are just considerably more conv enient. The theorem can b e rephrased as follows. Co x’s First Regraduation Theorem: Once a consistent represen tation of the ordering of prop ositions according to their degree of plausibility has been set up by assigning a real num b er P ( a | b ) to each pair of prop ositions a and b one can alwa ys ﬁnd another equiv alent representation by assigning p ositive n umbers p ( a | c ) that satisfy the pro duct rule p ( ab | c ) = p ( a | c ) p ( b | ac ) . (2.15) P erhaps one can make the logic behind this regraduation a little bit clearer b y considering the somewhat analogous situation of introducing the quantit y temp erature as a measure of degree of “hotness”. Clearly any acceptable mea- sure of “hotness” m ust reﬂect its transitivity – if a is hotter than b and b is hotter than c then a is hotter than c ; thus, temp eratures are represen ted b y real num b ers. But the temp erature scales are so far arbitrary . While many temp erature scales may serve equally w ell the purpose of ordering systems ac- cording to their hotness, there is one c hoice – the absolute or Kelvin scale – that turns out to b e considerably more con venien t because it simpliﬁes the mathe- matical formalism. Switching from an arbitrary temp erature scale to the Kelvin scale is one instance of a conv enient regraduation. (The details of temp erature regraduation are giv en in chapter 3.) On the basis of plain common sense one w ould hav e exp ected g ( u, v ) to b e monotonic in b oth its argumen ts. Consider a c hange in the ﬁrst argument P ( a 1 | b ) while holding the second P ( a 2 | a 1 b ) ﬁxed. Since a strengthening the b elief in a 1 can only strengthen the belief in a 1 a 2 w e require that a change in P ( a 1 | b ) should yield a c hange in P ( a 1 a 2 | b ) of the same sign. It is therefore a reassuring chec k that the pro duct rule eq.(2.15) b ehav es as expected. 2.3.2 Pro of of the Asso ciativit y Theorem Understanding the pro of that eq.(2.12) is the general solution of the asso ciativity constrain t, eq.(2.10), is not necessary for understanding other topics in this 2.3. REGRADUA TION: THE PRODUCT RULE 13 b o ok. This section ma y be skipped on a ﬁrst reading. The pro of giv en b elow, due to Cox, tak es adv antage of the fact that our in terest is not just to ﬁnd the most general solution but rather that w e w ant the most general solution under the restricted circumstance that the function g is to be used for the purp ose of inference. This allows us to imp ose additional constraints on g . W e will assume that the functions g are contin uous and twice diﬀerentiable. Indeed inference is quan tiﬁed common sense and if the function g had turned out to be non-diﬀerentiable serious doubts would b e cast on the legitimacy of the whole scheme. F urthermore, common sense also requires that g ( u, v ) b e mono- tonic increasing in b oth its argumen ts. Consider a c hange in the ﬁrst argument P ( a 1 | b ) while holding the second P ( a 2 | a 1 b ) ﬁxed. Since a strengthening of one’s b elief in a 1 m ust b e reﬂected in a corresp onding strengthening in ones’s belief in a 1 a 2 w e require that a c hange in P ( a 1 | b ) should yield a change in P ( a 1 a 2 | b ) of the same sign. An analogous line of reasoning leads one to impose that g ( u, v ) m ust b e monotonic increasing in the second argumen t as well, ∂ g ( u, v ) ∂ u ≥ 0 and ∂ g ( u, v ) ∂ v ≥ 0 . (2.16) Let r def = g ( u, v ) and s def = g ( v , w ) , (2.17) and g 1 ( u, v ) def = ∂ g ( u, v ) ∂ u ≥ 0 and g 2 ( u, v ) def = ∂ g ( u, v ) ∂ v ≥ 0 . (2.18) Then eq.(2.10) and its deriv atives with respect to u and v are g ( r, w ) = g ( u, s ) , (2.19) g 1 ( r , w ) g 1 ( u, v ) = g 1 ( u, s ) , (2.20) and g 1 ( r , w ) g 2 ( u, v ) = g 2 ( u, s ) g 1 ( v , w ) . (2.21) Eliminating g 1 ( r , w ) from these last tw o equations we get K ( u, v ) = K ( u, s ) g 1 ( v , w ) . (2.22) where K ( u, v ) = g 2 ( u, v ) g 1 ( u, v ) . (2.23) Multiplying eq.(2.22) b y K ( v , w ) we get K ( u, v ) K ( v , w ) = K ( u, s ) g 2 ( v , w ) (2.24) 14 CHAPTER 2. PROBABILITY Diﬀeren tiating the right hand side of eq.(2.24) with resp ect to v and comparing with the deriv ative of eq.(2.22) with respect to w , w e ha ve ∂ ∂ v ( K ( u, s ) g 2 ( v , w )) = ∂ ∂ w ( K ( u, s ) g 1 ( v , w )) = ∂ ∂ w ( K ( u, v )) = 0 . (2.25) Therefore ∂ ∂ v ( K ( u, v ) K ( v , w )) = 0 , (2.26) or, 1 K ( u, v ) ∂ K ( u, v ) ∂ v = − 1 K ( v , w ) ∂ K ( v , w ) ∂ v def = h ( v ) . (2.27) In tegrate using the fact that K ≥ 0 b ecause b oth g 1 and g 2 are positive, we get K ( u, v ) = K ( u, 0) exp Z v 0 h ( v 0 ) dv 0 , (2.28) and also K ( v , w ) = K (0 , w ) exp − Z v 0 h ( v 0 ) dv 0 , (2.29) so that K ( u, v ) = α H ( u ) H ( v ) , (2.30) where α = K (0 , 0) is a constant and H ( u ) is the p ositive function H ( u ) def = exp  − Z u 0 h ( u 0 ) du 0  ≥ 0 . (2.31) On substituting bac k in to eqs.(2.22) and (2.24) we get g 1 ( v , w ) = H ( s ) H ( v ) and g 2 ( v , w ) = α H ( s ) H ( w ) . (2.32) Next, use s = g ( v , w ), so that ds = g 1 ( v , w ) dv + g 2 ( v , w ) dw . (2.33) Substituting (2.32) w e get ds H ( s ) = dv H ( v ) + α dw H ( w ) . (2.34) This is easily integrated. Let G ( u ) = G (0) exp  Z u 0 du 0 H ( u 0 )  , (2.35) 2.3. REGRADUA TION: THE PRODUCT RULE 15 so that du/H ( u ) = dG ( u ) /G ( u ). Then G ( g ( v , w )) = G ( v ) G α ( w ) , (2.36) where a multiplicativ e constan t of integration has b een absorb ed into the con- stan t G (0). Applying this function G twice in eq.(2.10) w e obtain G ( u ) G α ( v ) G α ( w ) = G ( u ) G α ( v ) G α 2 ( w ) , (2.37) so that α = 1, G ( g ( v , w )) = G ( v ) G ( w ) , (2.38) (The second possibility α = 0 is discarded b ecause it leads to g ( u, v ) = u which is not useful for inference.) This completes our pro of eq.(2.12) is the general solution of eq.(2.10): Given an y g ( u, v ) that satisﬁes eq.(2.10) one can construct the corresp onding G ( u ) using eqs.(2.23), (2.27), (2.31), and (2.35). F urthermore, since G ( u ) is an exp onential its sign is dictated by the constant G (0) which is p ositive b ecause the right hand side of eq.(2.38) is p ositive. Finally , since H ( u ) ≥ 0, eq. (2.31), the regraduating function G ( u ) is a monotonic function of its v ariable u . 2.3.3 Setting the range of degrees of b elief Degrees of b elief range from the extreme of total certain ty that an assertion is true to the opposite extreme of total certain ty that it is false. What n umerical v alues should we assign to these extremes? Let p T and p F b e the numerical v alues assigned to the (regraduated) plausi- bilities of propositions which are kno wn to be true and false resp ectiv ely . Notice that the extremes should b e unique. There is a single p T and a single p F . The p ossibilit y of assigning tw o diﬀerent numerical v alues, for example p T 1 and p T 2 , to propositions known to b e true is ruled out by our desire that degrees of plausibilit y b e ordered. The philosophy behind regraduation is to seek the most conv enient repre- sen tation of degrees of b elief in terms of real num bers. In particular, we would lik e our regraduated plausibilities to re ﬂect the fact that if b is known to b e true then we b eliev e in ab to precisely the same exten t as we b elieve in a , no more and no less. This is expressed by p ( ab | b ) = p ( a | b ) . (2.39) On the other hand, using the pro duct rule eq.(2.15) we get p ( ab | b ) = p ( b | b ) p ( a | bb ) = p T p ( a | b ) . (2.40) Comparing eqs.(2.39) and (2.40) w e get p T = 1 (2.41) Th us, the v alue of p T is assigned so that eq.(2.39) holds: 16 CHAPTER 2. PROBABILITY Belief that a is true is r epr esente d by p ( a ) = 1 . F or the other extreme v alue, p F , which represen ts imp ossibility , consider the plausibilit y of ab 0 giv en b . Using the pro duct rule we hav e p ( ab 0 | b ) = p ( a | b ) p ( b 0 | ab ) . (2.42) But p ( ab 0 | b ) = p F and p ( b 0 | ab ) = p F . Therefore p F = p ( a | b ) p F . (2.43) Again, this should hold for arbitrary a . Therefore either p F = 0 or ∞ , either v alue is ﬁne. (The v alue −∞ is not allow ed; negative v alues of p ( a | b ) would lead to an inconsistency .) W e can either c ho ose plausibilities in the range [0 , 1] so that a higher p reﬂects a higher degree of belief or, alternativ ely , we can c ho ose ‘implausibilities’ in the range [1 , ∞ ) so that a higher p reﬂects a low er degree of b elief. Both alternatives are equally consistent and correct. The usual con ven tion is to choose the former. Belief that a is false is r epr esente d by p ( a ) = 0 . The numerical v alues assigned to p T and p F follo w from a particularly con- v enient regraduation that led to the pro duct rule. Other p ossibilities are, of course, legitimate. Instead of eq.(2.14) w e could for example ha ve regraduated plausibilities according to p ( a | c ) def = C G [ P ( a | c )] where C is some constan t. Then the pro duct rule would read C p ( ab | c ) = p ( a | c ) p ( b | ac ) and the analysis of the previous paragraphs would hav e led us to p T = C and p F = 0 or ∞ . The choice C = 100 is quite common; it is implicit in many collo quial uses of the notion of probabilit y , as for example, when one says ‘I am 100% sure that...’. Notice, in- ciden tally , that within a frequen tist interpretation most such statements would b e meaningless. 2.4 F urther regraduation: the Sum Rule 2.4.1 Co x’s second theorem Ha ving restricted the form of g considerably w e next study the function f by requiring its compatibility with g . It is here that we make use of the constraints (2.4) and (2.6) that w e had found earlier. Consider plausibilities P that ha ve gone through a ﬁrst process of regradu- ation so that the pro duct rule holds, P ( ab | c ) = P ( a | c ) P ( b | ac ) = P ( a | c ) f ( P ( b 0 | ac )) (2.44) but P ( ab 0 | c ) = P ( a | c ) P ( b 0 | ac ), then P ( ab | c ) = P ( a | c ) f  P ( ab 0 | c ) P ( a | c )  . (2.45) 2.4. FUR THER REGRADUA TION: THE SUM RULE 17 But P ( ab | c ) is symmetric in ab = ba . Therefore P ( a | c ) f  P ( ab 0 | c ) P ( a | c )  = P ( b | c ) f  P ( a 0 b | c ) P ( b | c )  . (2.46) This must hold irresp ectiv e of the c hoice of a , b , and c . In particular supp ose that b 0 = ad . On the left hand side P ( ab 0 | c ) = P ( b 0 | c ) because aa = a . On the righ t hand side, to simplify P ( a 0 b | c ) we note that a 0 b 0 = a 0 ad is false and that a 0 b 0 = ( a + b ) 0 . (In order for a + b to b e false it m ust b e the case that b oth a is false and b is false.) Therefore a + b is true: either a is true or b is true. If b is true then a 0 b = a 0 . If a is true b oth a 0 and a 0 b are false whic h means that w e also get a 0 b = a 0 . Therefore on the righ t hand side P ( a 0 b | c ) = P ( a 0 | c ) and we get P ( a | c ) f  f ( P ( b | c )) P ( a | c )  = P ( b | c ) f  f ( P ( a | c )) P ( b | c )  . (2.47) W riting P ( a | c ) = u , and P ( b | c ) = v , and P ( c | abd ) = w , the “compatibilit y” constrain t is uf  f ( v ) u  = v f  f ( u ) v  . (2.48) W e had earlier seen that certaint y is represen ted by 1 and imp ossibility by 0. Note that when u = 1, using f (1) = 0 and f (0) = 1, we obtain f [ f ( v )] = v . Th us, eq.(2.6) is a sp ecial case of (2.48). Compatibilit y Theorem: The function f ( u ) that satisﬁes the compatibility constrain t eq.(2.48) is f ( u ) = (1 − u α ) 1 /α or u α + f α ( u ) = 1 . (2.49) where α is a constant. It is easy to show that eq.(2.49) is a solution – just substitute. What is con- siderably more diﬃcult is to sho w that it is the general solution. The proof is giv en in the next subsection. As a result of the ﬁrst theorem we can consider b oth u and f ( u ) positive. Therefore, for α > 0 imp ossibility m ust b e represented by 0, while for α < 0 imp ossibilit y should b e represen ted b y ∞ . The signiﬁcance of the solution for f b ecomes clear when eq.(2.49) is rewrit- ten as [ P ( a | b )] α + [ P ( a 0 | b )] α = 1 , (2.50) and the pro duct rule eq.(2.44) is raised to the same p o wer α , [ P ( ab | c )] α = [ P ( a | c )] α [ P ( b | ac )] α . (2.51) This shows that, ha ving regraduated plausibilities once, we can simplify the solution (2.50) considerably b y regraduating a second time, while still preserving the pro duct rule. This second regraduation is p ( a | b ) def = [ P ( a | b )] α . (2.52) 18 CHAPTER 2. PROBABILITY Co x’s Second Regraduation Theorem: Once a consistent represen tation of the ordering of prop ositions according to their degree of plausibility has been set up in suc h a w ay that the product rule holds, one can regraduate further and ﬁnd an equiv alent and more con venien t representation that assigns plausibilities p ( a | b ) satisfying b oth the sum rule, p ( a | b ) + p ( a 0 | b ) = 1 , (2.53) and the pro duct rule, p ( ab | c ) = p ( a | c ) p ( b | ac ) . (2.54) These new, conv eniently regraduated degrees of plausibility will b e called pr ob abilities , p ositive n um b ers in the in terv al [0 , 1] with certaint y represen ted b y 1 and imp ossibility by 0. F rom no w on there is no need to refer to plausibilities again; b oth notations, lo wer case p as w ell as upp er case P will b e used to refer to the regraduated probabilities. 2.4.2 Pro of of the Compatibilit y Theorem The con tents of this section is not essen tial to understanding other topics in this b o ok. It may b e skipp ed on a ﬁrst reading. Just as in our previous consideration of the constrain t imp osed b y asso cia- tivit y on the function g , since the function f is to be used for the purp ose of inference w e can assume that it is contin uous and t wice diﬀerentiable. F urther- more, once we hav e gone through the ﬁrst stage of regraduation, and plausi- bilities satisfy the pro duct rule eq.(2.15), common sense also requires that the function f ( u ) b e monotonic decreasing, d f ( u ) du ≤ 0 for 0 ≤ u ≤ 1 , with extreme v alues such that f (0) = 1 and f (1) = 0. The ﬁrst step is to transform the functional equation (2.48) into an ordinary diﬀeren tial equation. Let r def = f ( v ) u and s def = f ( u ) v . (2.55) and substitute in to eq.(2.48), uf ( r ) = v f ( s ) . (2.45) Next diﬀeren tiate eq.(2.48) with resp ect to u , to v , and to u and v , to get (here primes denote deriv atives) f ( r ) − r f 0 ( r ) = f 0 ( s ) f 0 ( u ) , (2.56) f ( s ) − sf 0 ( s ) = f 0 ( r ) f 0 ( v ) , (2.57) 2.5. SOME REMARKS ON THE SUM AND PRODUCT RULES 19 and s v f 00 ( s ) f 0 ( u ) = r u f 00 ( r ) f 0 ( v ) . (2.58) Multiply eq.(2.48) b y eq.(2.58), sf 00 ( s ) f 0 ( u ) f ( s ) = r f 00 ( r ) f 0 ( v ) f ( r ) , (2.59) and use eqs.(2.56) and (2.57) to eliminate f 0 ( u ) and f 0 ( v ). After rearranging one gets, sf 00 ( s ) f ( s ) f 0 ( s ) [ f ( s ) − sf 0 ( s )] = r f 00 ( r ) f ( r ) f 0 ( r ) [ f ( r ) − rf 0 ( r )] . (2.60) Since the left side do es not dep end on r , neither must the right side; b oth sides m ust actually b e constant. Call this constant k . Th us, the problem is reduced to a diﬀeren tial equation, r f 00 ( r ) f ( r ) = k f 0 ( r ) [ f ( r ) − rf 0 ( r )] . (2.61) Multiplying by dr /r f f 0 giv es d f 0 f 0 = k  dr r − d f f  . (2.62) In tegrating t wice giv es f ( r ) = ( Ar α + B ) 1 /α , (2.63) where A and B are integration constants and α = 1 + k . Substituting back into eq.(2.48) allo ws us, after some simple algebra to determine one of the integration constan ts, B = A 2 , while substituting into eq.(2.6) yields the other, A = − 1. This concludes the pro of. 2.5 Some remarks on the sum and pro duct rules 2.5.1 On meaning, ignorance and randomness The pro duct and sum rules can b e used as the starting p oint for a theory of probabilit y: Quite indep endently of what probabilities could p ossibly mean, w e can develop a formalism of real n um b ers (measures) that are manipulated according to eqs.(2.53) and (2.54). This is the approach tak en b y Kolmogoro v. The adv antage is mathematical clarity and rigor. The disadv antage, of course, is that in actual applications the issue of meaning, of in terpretation, turns out to b e imp ortan t b ecause it aﬀects how and why probabilities are used. The adv antage of the approac h due to Co x is that the issue of meaning is clariﬁed from the start: the theory w as designed to apply to degrees of belief. Consistency requires that these num b ers b e manipulated according to the rules of probability theory . This is all we need. There is no reference to measures of sets or large ensem bles of trials or even to random v ariables. This is remark- able: it means that we can apply the p ow erful metho ds of probability theory 20 CHAPTER 2. PROBABILITY to thinking and reasoning ab out problems where nothing random is going on, and to single ev ents for whic h the notion of an ensemble is either absurd or at b est highly contriv ed and artiﬁcial. Thus, probability theory is the metho d for consisten t reasoning in situations where the information av ailable might b e in- suﬃcien t to reach certaint y: probability is the to ol for dealing with uncertaint y and ignorance. This in terpretation is not in conﬂict with the common view that probabil- ities are asso ciated with randomness. It may , of course, happ en that there is an unknown inﬂuence that aﬀects the system in unpredictable wa ys and that there is a goo d reason wh y this inﬂuence remains unkno wn, namely , it is so com- plicated that the information necessary to characterize it cannot b e supplied. Suc h an inﬂuence we call ‘random’. Th us, b eing random is just one among man y p ossible reasons wh y a quantit y might b e uncertain or unkno wn. 2.5.2 The general sum rule F rom the sum and pro duct rules, eqs.(2.53) and (2.54) w e can easily deduce a third one: Theorem: The probability of a disjunction (or) is given b y the sum rule p ( a + b | c ) = p ( a | c ) + p ( b | c ) − p ( ab | c ) . (2.64) The pro of is straigh tforward. Use ( a + b ) 0 = a 0 b 0 , (for a + b to b e false both a and b m ust b e false) then p ( a + b | c ) = 1 − p ( a 0 b 0 | c ) = 1 − p ( a 0 | c ) p ( b 0 | a 0 c ) = 1 − p ( a 0 | c ) (1 − p ( b | a 0 c )) = p ( a | c ) + p ( a 0 b | c ) = p ( a | c ) + p ( b | c ) p ( a 0 | bc ) = p ( a | c ) + p ( b | c ) (1 − p ( a | bc )) = p ( a | c ) + p ( b | c ) − p ( ab | c ) . These theorems are rather obvious on the basis of the interpretation of a probabilit y as a frequency or as the measure of a set. This is conv eyed graphi- cally in a very clear wa y b y V enn diagrams (see ﬁg.2.1). 2.5.3 Indep enden t and m utually exclusiv e even ts In sp ecial cases the sum and pro duct rules can b e rewritten in v arious useful w ays. Two statements or even ts a and b are said to b e indep endent if the probabilit y of one is not altered by information ab out the truth of the other. More sp eciﬁcally , ev ent a is indep enden t of b (giv en c ) if p ( a | bc ) = p ( a | c ) . (2.65) F or indep endent ev ents the pro duct rule simpliﬁes to p ( ab | c ) = p ( a | c ) p ( b | c ) or p ( ab ) = p ( a ) p ( b ) . (2.66) The symmetry of these expressions implies that p ( b | ac ) = p ( b | c ) as well: if a is indep enden t of b , then b is indep endent of a . 2.5. SOME REMARKS ON THE SUM AND PR ODUCT R ULES 21 Figure 2.1: V enn diagram showing P ( a ), P ( b ), P ( ab ) and P ( a + b ). Tw o statements or even ts a 1 and a 2 are mutual ly exclusive given b if they cannot be true sim ultaneously , i.e., p ( a 1 a 2 | b ) = 0. Notice that neither p ( a 1 | b ) nor p ( a 2 | b ) need v anish. F or m utually exclusiv e ev ents the sum rule simpliﬁes to p ( a 1 + a 2 | b ) = p ( a 1 | b ) + p ( a 2 | b ) . (2.67) The generalization to man y m utually exclusive statements a 1 , a 2 , . . . , a n (m u- tually exclusive given b ) is immediate, p ( a 1 + a 2 + · · · + a n | b ) = n P i =1 p ( a i | b ) . (2.68) If one of the statements a 1 , a 2 , . . . , a n is necessarily true, i.e., they cov er all p ossibilities, they are said to b e exhaustive . Then their conjunction is necessarily true, a 1 + a 2 + · · · + a n = > , so that p ( a 1 + a 2 + · · · + a n | b ) = 1 . (2.69) If, in addition to being exhaustiv e, the statemen ts a 1 , a 2 , . . . , a n are also mutu- ally exclusive then n P i =1 p ( a i ) = 1 . (2.70) A useful generalization inv olving the probabilities p ( a i | b ) conditional on any arbitrary prop osition b is n P i =1 p ( a i | b ) = 1 . (2.71) The pro of is straightforw ard: p ( b ) = p ( b > ) = n P i =1 p ( ba i ) = p ( b ) n P i =1 p ( a i | b ) . (2.72) 22 CHAPTER 2. PROBABILITY 2.5.4 Marginalization Once w e decide that it is legitimate to quan tify degrees of b elief b y real n um b ers p the problem b ecomes how do we assign these num b ers. The sum and pro duct rules show ho w w e should assign probabilities to some statemen ts once proba- bilities ha ve b een assigned to others. Here is an imp ortant example of how this w orks. W e w ant to assign a probability to a particular statemen t b . Let a 1 , a 2 , . . . , a n b e mutually exclusive and exhaustiv e statements and suppose that the proba- bilities of the conjunctions ba j are known. W e wan t to calculate p ( b ) giv en the join t probabilities p ( ba j ). The solution is straigh tforward: sum p ( ba j ) ov er all a j s, use the pro duct rule, and eq.(2.71) to get P j p ( ba j ) = p ( b ) P j p ( a j | b ) = p ( b ) . (2.73) This pro cedure, called marginalization, is quite useful when we wan t to eliminate unin teresting v ariables a so we can concentrate on those v ariables b that really matter to us. The distribution p ( b ) is referred to as the marginal of the joint distribution p ( ab ). F or a second use of formulas such as these supp ose that we happ en to know the conditional probabilities p ( b | a ). When a is known we can make go o d infer- ences ab out b , but what can w e tell ab out b when we are uncertain ab out the actual v alue of a ? Then we pro ceed as follows. Use of the sum and pro duct rules gives p ( b ) = P j p ( ba j ) = P j p ( b | a j ) p ( a j ) . (2.74) This is quite reasonable: the probability of b is the probability we would assign if the v alue of a w ere precisely kno wn, av eraged o ver all a s. The assignment p ( b ) clearly dep ends on how uncertain w e are about the v alue of a . In the extreme case when w e are totally certain that a takes the particular v alue a k w e hav e p ( a j ) = δ j k and we recov er p ( b ) = p ( b | a k ) as exp ected. 2.6 The exp ected v alue Supp ose we know that a quan tity x can take v alues x i with probabilities p i . Sometimes w e need an estimate for the quantit y x . What should we choose? It seems reasonable that those v alues x i that ha ve larger p i should ha ve a dominan t con tribution to x . W e therefore mak e the following reasonable choice: The exp ected v alue of the quantit y x is denoted b y h x i and is given by h x i def = P i p i x i . (2.75) The term ‘exp ected’ v alue is not alwa ys an appropriate one b ecause h x i may not b e one of the actually allow ed v alues x i and, therefore, it is not a v alue w e 2.6. THE EXPECTED V ALUE 23 w ould exp ect. The exp ected v alue of a die toss is (1 + · · · + 6) / 6 = 3 . 5 which is not an allo wed result. Using the av erage h x i as an estimate for the exp ected v alue of x is reason- able, but it is also somewhat arbitrary . Alternativ e estimates are p ossible; for example, one could hav e chosen the v alue for which the probabilit y is maximum – this is called the ‘mo de’. This raises tw o questions. The ﬁrst question is whether h x i is a goo d estimate. If the probabilit y distri- bution is sharply p eaked all the v alues of x that hav e appreciable probabilities are close to each other and to h x i . Then h x i is a go o d estimate. But if the distribution is broad the actual v alue of x may deviate from h x i considerably . T o describe quan titativ ely ho w large this deviation migh t be w e need to describe ho w broad the probabilit y distribution is. A conv enient measure of the width of the distribution is the root mean square ( rms ) deviation deﬁned by ∆ x def = D ( x − h x i ) 2 E 1 / 2 . (2.76) The quan tity ∆ x is also called the standard deviation, its square (∆ x ) 2 is called the v ariance. F or historical reasons it is common to refer to the ‘v ariance of x ’ but this is misleading b ecause it suggests that x itself could v ary; ∆ x refers to our knowledge ab out x . If ∆ x  h x i then x will not deviate muc h from h x i and we exp ect h x i to b e a go o d estimate. The deﬁnition of ∆ x is somewhat arbitrary . It is dictated b oth by common sense and by conv enience. Alternativ ely we could ha ve chosen to deﬁne the width of the distribution as h| x − h x i|i or h ( x − h x i ) 4 i 1 / 4 but these deﬁnitions are less con venien t for calculations. No w that we ha ve a w ay of deciding whether h x i is a go o d estimate for x w e may raise a second question: Is there such a thing as the “b est” estimate for x ? Consider another estimate x 0 . W e exp ect x 0 to b e accurate provided the deviations from it are small, i.e., h ( x − x 0 ) 2 i is small. The b est x 0 is that for whic h its v ariance is a minimum d dx 0 h ( x − x 0 ) 2 i     x 0 best = 0, (2.77) whic h implies x 0 best = h x i . Conclusion: h x i is the b est estimate for x when b y “b est” we mean the one with the smallest v ariance. But other c hoices are p ossible, for example, had we actually decided to minimize the width h| x − x 0 |i the b est estimate w ould hav e b een the median, x 0 best = x m , a v alue such that Prob( x < x m ) = Prob( x > x m ) = 1 / 2. W e conclude this section by men tioning tw o imp ortant iden tities that will b e rep eatedly used in what follo ws. The ﬁrst is that the av erage deviation from the mean v anishes, h x − h x ii = 0 , (2.78) 24 CHAPTER 2. PROBABILITY b ecause deviations from the mean are just as likely to b e p ositive and negativ e. The second useful identit y is D ( x − h x i ) 2 E = h x 2 i − h x i 2 . (2.79) The pro ofs are trivial – just use the deﬁnition (2.75). 2.7 The binomial distribution Supp ose the probability of a certain ev ent α is p . The probability of α not happ ening is 1 − p . Using the theorems discussed earlier w e can obtain the probabilit y that α happ ens m times in N indep endent trials. The probabilit y that α happens in the ﬁrst m trials and not- α or α 0 happ ens in the subsequen t N − m trials is, using the pro duct rule for indep endent ev ents, p m (1 − p ) N − m . But this is only one particular ordering of the m α s and the N − m α 0 s. There are N ! m !( N − m )! =  N m  (2.80) suc h orderings. Therefore, using the sum rule for mutually exclusive ev ents, the probabilit y of m α s in N indep endent trials irresp ectiv e of the particular order of α s and ˜ α s is P ( m | N , p ) =  N m  p m (1 − p ) N − m . (2.81) This is called the binomial distribution. Using the binomial theorem (hence the name of the distribution) one can sho w these probabilities are correctly normalized: N X m =0 P ( m | N , p ) = N X m =0  N m  p m (1 − p ) N − m = ( p + (1 − p )) N = 1 . (2.82) The range of applicabilit y of this distribution is enormous. Whenever trials are indep enden t of each other (i.e., the outcome of one trial has no inﬂuence on the outcome of another, or alternatively , knowing the outcome of one trial provides us with no information about the p ossible outcomes of another) the distribution is binomial. Indep endence is the crucial feature. The exp ected n umber of α s is h m i = N X m =0 m P ( m | N , p ) = N X m =0 m  N m  p m (1 − p ) N − m . This sum ov er m is complicated. The follo wing elegan t tric k is useful. Consider the sum S ( p, q ) = N X m =0 m  N m  p m q N − m , 2.7. THE BINOMIAL DISTRIBUTION 25 where p and q are indep endent v ariables. After we calculate S w e will replace q b y 1 − p to obtain the desired result, h m i = S ( p, 1 − p ). The calculation of S is easy if w e note that m p m = p ∂ ∂ p p m . Then, using the binomial theorem S ( p, q ) = p ∂ ∂ p N X m =0  N m  p m q N − m = p ∂ ∂ p ( p + q ) N = N p ( p + q ) N − 1 . Replacing q b y 1 − p we obtain our b est estimate for the exp ected n umber of α s h m i = N p . (2.83) This is the best estimate, but how goo d is it? T o answer we need to calculate ∆ m . The v ariance is (∆ m ) 2 = D ( m − h m i ) 2 E = h m 2 i − h m i 2 , whic h requires we calculate h m 2 i , h m 2 i = N X m =0 m 2 P ( m | N , p ) = N X m =0 m 2  N m  p m (1 − p ) N − m . W e can use the same trick w e used b efore to get h m i : S 0 ( p, q ) = N X m =0 m 2  N m  p m q N − m = p ∂ ∂ p  p ∂ ∂ p ( p + q ) N  . Therefore, h m 2 i = ( N p ) 2 + N p (1 − p ), (2.84) and the ﬁnal result for the rms deviation ∆ m is ∆ m = p N p (1 − p ) . (2.85) No w we can address the question of ho w go o d an estimate h m i is. Notice that ∆ m gro ws with N . This might seem to suggest that our estimate of m gets w orse for large N but this is not quite true b ecause h m i also grows with N . The ratio ∆ m h m i = s (1 − p ) N p ∝ 1 N 1 / 2 , (2.86) sho ws that while b oth the estimate h m i and its uncertain ty ∆ m grow with N , the relative uncertaint y decreases. 26 CHAPTER 2. PROBABILITY 2.8 Probabilit y vs. frequency: the la w of large n um b ers Notice that the “frequency” f = m/ N of α s obtained in one N -trial sequence is not equal to p . F or one giv en ﬁxed v alue of p , the frequency f can take any one of the v alues 0 / N , 1 / N , 2 / N , . . . N / N . What is equal to p is not the frequency itself but its exp ected v alue. Using eq.(2.83), h f i = h m N i = p . (2.87) F or large N the distribution is quite narrow and the probability that the observ ed frequency of α s diﬀers from p tends to zero as N → ∞ . Using eq.(2.85), ∆ f = ∆  m N  = ∆ m N = r p (1 − p ) N ∝ 1 N 1 / 2 . (2.88) The same ideas are more precisely con vey ed by a theorem due to Bernoulli kno wn as the ‘w eak law of large num b ers’. A simple pro of of the theorem in volv es an inequality due to Tcheb yshev. Let ρ ( x ) dx b e the probabilit y that a v ariable X lies in the range b etw een x and x + dx , P ( x < X < x + dx ) = ρ ( x ) dx. The v ariance of X satisﬁes (∆ x ) 2 = Z ( x − h x i ) 2 ρ ( x ) dx ≥ Z | x −h x i|≥ ε ( x − h x i ) 2 ρ ( x ) dx , where ε is an arbitrary constan t. Replacing ( x − h x i ) 2 b y its least v alue ε 2 giv es (∆ x ) 2 ≥ ε 2 Z | x −h x i|≥ ε ρ ( x ) dx = ε 2 P ( | x − h x i| ≥ ε ) , whic h is Tcheb yshev’s inequality , P ( | x − h x i| ≥ ε ) ≤  ∆ x ε  2 . (2.89) Next we prov e Bernoulli’s theorem, the weak law of large num b ers. First a sp ecial case. Let p b e the probability of outcome α in an experiment E , P ( α | E ) = p . In a sequence of N indep endent repetitions of E the probability of m outcomes α is binomial. Substituting h f i = p and (∆ f ) 2 = p (1 − p ) N in to Tc hebyshev’s inequality w e get Bernoulli’s theorem, P  | f − p | ≥ ε | E N  ≤ p (1 − p ) N ε 2 . (2.90) 2.8. THE LA W OF LARGE NUMBERS 27 Therefore, the probabilit y that the observ ed frequency f is appreciably diﬀeren t from p tends to zero as N → ∞ . Or equiv alently: for any small ε , the probabilit y that the observ ed frequency f = m/ N lies in the interv al b etw een p − ε/ 2 and p + ε/ 2 tends to unit y as N → ∞ . In the mathematical/statistical literature this result is commonly stated in the form f − → p in pr ob ability . (2.91) The qualifying w ords ‘in probability’ are crucial: w e are not saying that the observ ed f tends to p for large N . What v anishes for large N is not the diﬀerence f − p itself, but rather the pr ob ability that | f − p | is larger than a certain (small) amoun t. Th us, probabilities and frequencies are not the same thing but they are related to each other. Since h f i = p , one migh t perhaps b e tempted to deﬁne the probability p in terms of the expected frequency h f i , but this do es not work either. The problem is that the notion of expected v alue already presupposes that the concept of probability has b een deﬁned previously . The deﬁnition of a probabilit y in terms of exp ected v alues is unsatisfactory b ecause it is circular. The law of large num b ers is easily generalized b eyond the binomial distribu- tion. Consider the a verage x = 1 N N P r =1 x r , (2.92) where x 1 , . . . , x N are N indep endent v ariables with the same mean h x r i = µ and v ariance v ar( x r ) = (∆ x r ) 2 = σ 2 . (In the previous discussion leading to eq.(2.90) eac h v ariable x r is either 1 or 0 according to whether outcome α happ ens or not in the r th rep etition of exp eriment E .) T o apply Tcheb yshev’s inequality , eq.(2.89), we need the mean and the v ari- ance of x . Clearly , h x i = 1 N N P r =1 h x r i = 1 N N µ = µ . (2.93) F urthermore, since the x r are indep endent, their v ariances are additiv e. F or example, v ar( x 1 + x 2 ) = v ar( x 1 ) + v ar( x 2 ) . (2.94) (Pro ve it.) Therefore, v ar( x ) = N P r =1 v ar( x r N ) = N  σ N  2 = σ 2 N . (2.95) Tc hebyshev’s inequality no w giv es, P  | x − µ | ≥ ε | E N  ≤ σ 2 N ε 2 (2.96) so that for any ε > 0 lim N →∞ P  | x − µ | ≥ ε | E N  = 0 or lim N →∞ P  | x − µ | ≤ ε | E N  = 1 , (2.97) 28 CHAPTER 2. PROBABILITY or x − → µ in pr ob ability . (2.98) Again, what v anishes for large N is not the diﬀerence x − µ itself, but rather the pr ob ability that | x − µ | is larger than any giv en small amount. 2.9 The Gaussian distribution The Gaussian distribution is quite remark able, it applies to a wide v ariet y of problems such as the distribution of errors aﬀecting experimental data, the distribution of v elo cities of molecules in gases and liquids, the distribution of ﬂuctuations of thermo dynamical quantities, and so on and on. One susp ects that a deeply fundamen tal reason m ust exist for its wide applicabilit y . Somehow the Gaussian distribution manages to co dify the information that happ ens to b e relev an t for prediction in a wide v ariety of problems. The Central Limit Theorem discussed b elo w pro vides an explanation. 2.9.1 The de Moivre-Laplace theorem The Gaussian distribution turns out to be a sp ecial case of the binomial distri- bution. It applies to situations when the n umber N of trials and the exp ected n umber of α s, h m i = N p , are b oth very large (i.e., N large, p not too small). T o ﬁnd an analytical expression for the Gaussian distribution we note that when N is large the binomial distribution, P ( m | N , p ) = N ! m !( N − m )! p m (1 − p ) N − m , is very sharply p eaked: P ( m | N , p ) is essen tially zero unless m is very close to h m i = N p . This suggests that to ﬁnd a go o d approximation for P w e need to pa y sp ecial attention to a very small range of m and this can b e done following the usual approach of a T a ylor expansion. A problem is immediately apparent: if a small change in m produces a small c hange in P then we only need to k eep the ﬁrst few terms, but in our case P is a very sharp function. T o repro duce this kind of b eha vior we need a huge num b er of terms in the series expansion whic h is impractical. Having diagnosed the problem one can easily ﬁnd a cure: instead of ﬁnding a T a ylor expansion for the rapidly v arying P , one ﬁnds an expansion for log P which v aries muc h more smo othly . Let us therefore expand log P ab out its maximum at m 0 , the lo cation of whic h is at this p oint still unknown. The ﬁrst few terms are log P = log P | m 0 + d log P dm     m 0 ( m − m 0 ) + 1 2 d 2 log P dm 2     m 0 ( m − m 0 ) 2 + . . . , where log P = log N ! − log m ! − log ( N − m )! + m log p + ( N − m ) log (1 − p ) . 2.9. THE GAUSSIAN DISTRIBUTION 29 What is a deriv ative with resp ect to an integer? F or large m the function log m ! v aries so slowly (relativ e to the huge v alue of log m ! itself ) that w e may consider m to b e a con tinuous v ariable. Then d log m ! dm ≈ log m ! − log ( m − 1)! 1 = log m ! ( m − 1)! = log m . (2.99) In tegrating one obtains a very useful approximation – called the Stirling ap- pro ximation – for the logarithm of a large factorial log m ! ≈ Z m 0 log x dx = ( x log x − x ) | m 0 = m log m − m. A somewhat b etter expression which includes the next term in the Stirling ex- pansion is log m ! ≈ m log m − m + 1 2 log 2 π m + . . . (2.100) Notice that the third term is muc h smaller than the ﬁrst tw o; the ﬁrst tw o terms are of order m while the last is of order log m . F or m = 10 23 , log m is only 55 . 3. The deriv atives in the T aylor expansion are d log P dm = − log m + log ( n − m ) + log p − log (1 − p ) = log p ( N − m ) m (1 − p ) , and d 2 log P dm 2 = − 1 m − 1 N − m = − N m ( N − m ) . T o ﬁnd the v alue m 0 where P is maximum set d log P /dm = 0. This giv es m 0 = N p = h m i , and substituting into the second deriv ativ e of log P w e get d 2 log P dm 2     h m i = − 1 N p (1 − p ) = − 1 (∆ m ) 2 . Therefore log P = log P ( h m i ) − ( m − h m i ) 2 2 (∆ m ) 2 + . . . or P ( m ) = P ( h m i ) exp " − ( m − h m i ) 2 2 (∆ m ) 2 # . The remaining unknown constant P ( h m i ) can b e ev aluated by requiring that the distribution P ( m ) b e prop erly normalized, that is 1 = N X m =0 P ( m ) ≈ Z N 0 P ( x ) dx ≈ Z ∞ −∞ P ( x ) dx. 30 CHAPTER 2. PROBABILITY Using Z ∞ −∞ e − αx 2 dx = r π α , w e get P ( h m i ) = 1 q 2 π (∆ m ) 2 . Th us, the expression for the Gaussian distribution with mean h m i and rms deviation ∆ m is P ( m ) = 1 q 2 π (∆ m ) 2 exp " − ( m − h m i ) 2 2 (∆ m ) 2 # . (2.101) It can b e rewritten as a probability for the frequency f = m/ N using h m i = N p and (∆ m ) 2 = N p (1 − p ). The probability that f lies in the small range d f = 1 / N is p ( f ) d f = 1 p 2 π σ 2 N exp " − ( f − p ) 2 2 σ 2 N # d f , (2.102) where σ 2 N = p (1 − p ) / N . T o appreciate the signiﬁcance of the theorem consider a macroscopic v ariable x built up b y adding a large n umber of small contributions, x = P N n =1 ξ n , where the ξ n are statistically indep endent. W e assume that each ξ n tak es the v alue ε with probability p , and the v alue 0 with probability 1 − p . Then the probability that x takes the v alue mε is given b y the binomial distribution P ( m | N , p ). F or large N the probability that x lies in the small range mε ± dx/ 2 where dx = ε is p ( x ) dx = 1 q 2 π (∆ x ) 2 exp " − ( x − h x i ) 2 2 (∆ x ) 2 # dx , (2.103) where h x i = N pε and (∆ x ) 2 = N p (1 − p ) ε 2 . Th us, the Gaussian distribution arises whenever we have a quantity that is the r esult of adding a lar ge numb er of smal l indep endent c ontributions . The deriv ation abov e assumes that the microscopic contributions are discrete (binomial, either 0 or ε ), and identically distributed but, as sho wn in the next section, b oth of these conditions can b e relaxed. 2.9.2 The Cen tral Limit Theorem Consider the a verage x = 1 N N P r =1 x r , (2.104) 2.9. THE GAUSSIAN DISTRIBUTION 31 of N indep endent v ariables x 1 , . . . , x N . Our goal is to calculate the probability of x in the limit of large N . Let p r ( x r ) b e the probabilit y distribution for the r th v ariable with h x r i = µ r and (∆ x r ) 2 = σ 2 r . (2.105) The probability density for x is given b y the integral P ( x ) = R dx 1 . . . dx N p 1 ( x 1 ) . . . p N ( x N ) δ  x − 1 N N P r =1 x r  . (2.106) (This is just an exercise in the sum and product rules.) T o calculate P ( x ) in tro duce the av erages ¯ µ def = 1 N N P r =1 µ r and ¯ σ 2 def = 1 N N P r =1 σ 2 r , (2.107) and consider the distribution for the v ariable x − ¯ µ whic h is Pr( x − ¯ µ ) = P ( x ). It’s F ourier transform, F ( k ) = Z dx Pr( x − ¯ µ ) e ik ( x − ¯ µ ) = Z dx P ( x ) e ik ( x − ¯ µ ) = Z dx 1 . . . dx N p 1 ( x 1 ) . . . p N ( x N ) exp  ik N N P r =1 ( x r − µ r )  , can b e rearranged into a pro duct F ( k ) =  Z dx 1 p 1 ( x 1 ) e i k N ( x 1 − µ 1 )  . . .  Z dx N p N ( x N ) e i k N ( x N − µ N )  . (2.108) The F ourier transform f ( k ) of a distribution p ( ξ ) has many interesting and useful prop erties. F or example, f ( k ) = Z dξ p ( ξ ) e ikξ =  e ikξ  , (2.109) and the series expansion of the exp onen tial giv es f ( k ) =  ∞ P n =0 ( ik ξ ) n n !  = ∞ P n =0 ( ik ) n n ! h ξ n i . (2.110) In w ords, the co eﬃcients of the T a ylor expansion of f ( k ) give all the moments of p ( ξ ). The F ourier transform f ( k ) is called the moment gener ating function and also the char acteristic function of the distribution. Going bac k to our calculation of P ( x ), eq.(2.106), its F ourier transform, eq.(2.108) is, F ( k ) = N Q r =1 f r ( k N ) , (2.111) 32 CHAPTER 2. PROBABILITY where f r ( k N ) = Z dx r p r ( x r ) e i k N ( x r − µ r ) = 1 + i k N h x r − µ r i − k 2 2 N 2 D ( x r − µ r ) 2 E + . . . = 1 − k 2 σ 2 r 2 N 2 + O  k 3 N 3  . (2.112) F or a suﬃcien tly large N this can b e written as f r ( k N ) − → exp  − k 2 σ 2 r 2 N 2  . (2.113) so that F ( k ) = exp  − k 2 2 N 2 N P r =1 σ 2 r  = exp  − k 2 ¯ σ 2 2 N  . (2.114) Finally , taking the in verse F ourier transform, w e obtain the desired result, whic h is called the central limit theorem Pr( x − ¯ µ ) = P ( x ) = 1 p 2 π ¯ σ 2 / N exp  − ( x − ¯ µ ) 2 2 ¯ σ 2 / N  . (2.115) T o conclude we comment on its signiﬁcance. W e hav e shown that almost indep enden tly of the form of the distributions p r ( x r ) the distribution of the a verage x is Gaussian centered at ¯ µ with standard deviation ¯ σ 2 / N . Not only the p r ( x r ) need not be binomial, they do not ev en hav e to be equal to eac h other. This helps to explain the widespread applicability of the Gaussian distribution: it applies to almost any ‘macro-v ariable’ (suc h as x ) that results from adding a large num b er of indep endent ‘micro-v ariables’ (suc h as x r / N ). But there are restrictions; although v ery common, Gaussian distributions do not obtain alwa ys. A careful look at the deriv ation ab ov e shows the crucial step w as taken in eqs.(2.112) and (2.114) where w e neglected the con tributions of the third and higher moments. Earlier w e mentioned that the success of Gaussian distributions is due to the fact that they co dify the information that happ ens to b e relev ant to the particular phenomenon under consideration. Now we see what that relev ant information might be: it is contained in the ﬁrst t wo momen ts, the mean and the v ariance – Gaussian distributions are successful when third and higher moments are irrelev ant. (This can b e stated more precisely in terms as the so-called Lyapuno v condition.) Later we shall approach this same problem from the p oint of view of the metho d of maximum entrop y and there w e will show that, indeed, the Gaussian distribution can b e deriv ed as the distribution that codiﬁes information ab out the mean and the v ariance while remaining maximally ignorant ab out everything else. 2.10. UPDA TING PROBABILITIES: BA YES’ RULE 33 2.10 Up dating probabilities: Ba y es’ rule No w that we ha ve solved the problem of ho w to represent a state of knowledge as a consistent web of in terconnected b eliefs we can address the problem of up dating from one consistent web of b eliefs to another when new information b ecomes av ailable. W e will only consider those sp ecial situations where the information to b e pro cessed is in the form of data. Sp eciﬁcally the problem is to up date our b eliefs ab out θ (either a single parameter or man y) on the basis of data x (either a single num b er or sev eral) and of a kno wn relation b etw een θ and x . The updating consists of replacing the prior probability distribution p ( θ ) that represents our b eliefs b efore the data is pro cessed, by a p osterior distribution p new ( θ ) that applies after the data has b een pro cessed. 2.10.1 F ormulating the problem W e must ﬁrst describ e the state of our knowledge b efore the data has b een collected or, if the data has already b een collected, b efore w e hav e tak en it into accoun t. At this stage of the game not only w e do not know θ , w e do not know x either. As mentioned abov e, in order to infer θ from x we m ust also kno w how these tw o quantities are related to each other. Without this information one cannot pro ceed further. F ortunately w e usually know enough ab out the physics of an experiment that if θ were kno wn we would ha ve a fairly go od idea of what v alues of x to expect. F or example, giv en a v alue θ for the c harge of the electron, w e can calculate the v elo city x of an oil drop in Millik an’s exp eriment, add some uncertain ty in the form of Gaussian noise and we ha ve a very reasonable estimate of the conditional distribution p ( x | θ ). The distribution p ( x | θ ) is called the sampling distribution and also (less appropriately) the likeliho o d . W e will assume it is known. W e should emphasize that the crucial information ab out ho w x is related to θ is contained in the functional form of the distribution p ( x | θ ) – say , whether it is a Gaussian or a Cauch y distribution – and not in the actual v alues of the argumen ts x and θ which are, at this p oint, still unknown. Th us, to describ e the web of prior b eliefs we m ust know the prior p ( θ ) and also the sampling distribution p ( x | θ ). This means that we m ust kno w the full join t distribution, p ( θ , x ) = p ( θ ) p ( x | θ ) . (2.116) This is v ery imp ortant: w e m ust b e clear ab out what w e are talking ab out. The relev ant univ erse of discourse is neither the space Θ of possible parameters θ nor is it the space X of possible data x . It is rather the pro duct space Θ × X and the probability distributions that concern us are the joint distributions p ( θ , x ). Next w e collect data: the observed v alue turns out to b e X . Our goal is to use this information to update to a web of posterior b eliefs represented by a new joint distribution p new ( θ , x ). How shall we choose p new ( θ , x )? The new data tells us that the v alue of x is now known to b e X . Therefore, the new web 34 CHAPTER 2. PROBABILITY of b eliefs m ust b e such that p new ( x ) = R dθ p new ( θ , x ) = δ ( x − X ) . (2.117) (F or simplicity we hav e here assumed that x is a con tin uous v ariable; had x b een discrete Dirac δ s would b e replaced by Kronec ker δ s.) This is all w e know but it is not suﬃcient to determine p new ( θ , x ). Apart from the general requirement that the new w eb of b eliefs must be internally consistent there is nothing in any of our previous considerations that induces us to prefer one consistent web ov er another. A new principle is needed. 2.10.2 Minimal up dating: Ba y es’ rule The basic up dating strategy that we adopt b elow reﬂects the conviction that what w e hav e learned in the past, the prior knowledge, is a v aluable resource that should not b e squandered. Prior b eliefs should b e revised only when this is demanded by the new information; the new w eb of b eliefs should coincide with the old one as m uch as p ossible. W e prop ose to adopt the follo wing Principle of Minimal Up dating (PMU): The web of b eliefs needs to b e revised only to the extent required b y the new data. This seems so reasonable and natural that an explicit statemen t may seem sup erﬂuous. The important p oin t, how ever, is that it is not lo gic al ly ne c essary . W e could up date in many other wa ys that preserv e b oth internal consistency and consistency with the new information. As we sa w ab o ve the new data, eq.(2.117), do es not fully determine the joint distribution p new ( θ , x ) = p new ( x ) p new ( θ | x ) = δ ( x − X ) p new ( θ | x ) . (2.118) All distributions of the form p new ( θ , x ) = δ ( x − X ) p new ( θ | X ) , (2.119) where p new ( θ | X ) remains arbitrary is compatible with the newly acquired data. W e still need to assign p new ( θ | X ). It is at this p oint that we in vok e the PMU. W e stipulate that no further revision is needed and set p new ( θ | X ) = p old ( θ | X ) = p ( θ | X ) . (2.120) Therefore, the w eb of p osterior b eliefs is describ ed by p new ( θ , x ) = δ ( x − X ) p ( θ | X ) . (2.121) The p osterior probabilit y p new ( θ ) is p new ( θ ) = R dx p new ( θ , x ) = R dx δ ( x − X ) p ( θ | X ) , (2.122) 2.10. UPDA TING PROBABILITIES: BA YES’ RULE 35 or, p new ( θ ) = p ( θ | X ) . (2.123) In w ords, the p osterior pr ob ability e quals the prior c onditional pr ob ability of θ giv en X . This result, kno wn as Bay es’ rule, is extremely reasonable: we maintain those b eliefs ab out θ that are consistent with the data v alues X that turned out to be true. Data v alues that w ere not observed are discarded b ecause they are now kno wn to b e false. ‘Maintain’ is the key w ord: it reﬂects the PMU in action. Using the pro duct rule p ( θ , X ) = p ( θ ) p ( X | θ ) = p ( X ) p ( θ | X ) , (2.124) Ba yes’ rule can b e written as p new ( θ ) = p ( θ ) p ( X | θ ) p ( X ) . (2.125) Remark: Ba yes’ rule is usually written in the form p ( θ | X ) = p ( θ ) p ( X | θ ) p ( X ) , (2.126) and called Ba yes’ theorem. This formula is v ery simple; p erhaps it is too simple. It is just a restatement of the pro duct rule, eq.(2.124), and therefore it is a simple consequence of the internal consistency of the prior web of b eliefs. The dra wback of this formula is that the left hand side is not the p osterior but rather the prior c onditional probability; it obscures the fact that an additional principle – the PMU – was needed for up dating. The in terpretation of Bay es’ rule is straigh tforward: according to eq.(2.125) the p osterior distribution p new ( θ ) gives preference to those v alues of θ that were previously preferred as described b y the prior p ( θ ), but this is no w mo dulated b y the lik eliho o d factor p ( X | θ ) in such a wa y as to enhance our preference for v alues of θ that make the data more likely , less surprising. The factor in the denominator p ( X ) which is the prior probability of the data is given b y p ( X ) = R p ( θ ) p ( X | θ ) dθ , (2.127) and plays the role of a normalization constant for the p osterior distribution p new ( θ ). It do es not help to discriminate one v alue of θ from another b ecause it aﬀects all v alues of θ equally and is therefore not imp ortant except, as we shall see later in this c hapter, in problems of mo del selection. Neither the rule, eq.(2.123), nor the theorem, eq.(2.126), was ever actually written down b y Bay es. The p erson who ﬁrst explicitly stated the theorem and, more imp ortantly , who ﬁrst realized its deep signiﬁcance w as Laplace. 36 CHAPTER 2. PROBABILITY Example: is there life on Mars? Supp ose we are interested in whether there is life on Mars or not. How is the probability that there is life on Mars altered by new data indicating the presence of w ater on Mars. Let θ =‘There is life on Mars’. The prior information includes the fact I = ‘All kno wn life forms require w ater’. The new data is that X = ‘There is water on Mars’. Let us look at Ba yes’ rule. W e can’t sa y muc h ab out p ( X | I ) but whatev er its v alue it is deﬁnitely less than 1. On the other hand p ( X | θ I ) ≈ 1. Therefore the factor multiplying the prior is larger than 1. Our b elief in the truth of θ is strengthened by the new data X . This is just common sense, but notice that this kind of probabilistic reasoning cannot b e carried out if one adheres to a strictly frequentist interpretation – there is no set of trials. The name ‘Ba yesian probabilities’ giv en to ‘degrees of b elief ’ originates in the fact that it is only under this interpretation that the full p ow er of Bay es’ rule can b e exploited. Example: testing p ositiv e for a rare disease Supp ose y ou are tested for a disease, say cancer, and the test turns out to b e p ositiv e. Supp ose further that the test is said to b e 99% accurate. Should you panic? It may b e wise to pro ceed with caution. One should start b y explaining that ‘99% accurate’ means that when the test is applied to p eople kno wn to ha ve cancer the result is p ositiv e 99% of the time, and when applied to p eople kno wn to b e health y , the result is negative 99% of the time. W e express this accuracy as p ( y | c ) = A = 0 . 99 and p ( n | ˜ c ) = A = 0 . 99 ( y and n stand for ‘p ositive’ and ‘negative’, c and ˜ c stand for ‘cancer’ or ‘no cancer’). There is a 1% probability of false p ositives, p ( y | ˜ c ) = 1 − A , and a 1% probabilit y of false negativ es, p ( n | c ) = 1 − A . On the other hand, what we really wan t to know is p new ( c ) = p ( c | y ), the probabilit y of having cancer given that y ou tested p ositive. This is not the same as the probabilit y of testing positive given that y ou ha ve cancer, p ( y | c ); the t wo probabilities are not the same thing! So there might be some hop e. The connection betw een what we wan t, p ( c | y ), and what w e kno w, p ( y | c ), is given b y Ba yes’ theorem, p ( c | y ) = p ( c ) p ( y | c ) p ( y ) . An imp ortan t virtue of Ba yes’ rule is that it do esn’t just tell you how to pro cess information; it also tells you what information you should seek. In this case one should ﬁnd p ( c ), the probabilit y of ha ving cancer irresp ective of being tested positive or negative. Supp ose you inquire and ﬁnd that the incidence of cancer in the general p opulation is 1%; this means that p ( c ) = 0 . 01. Thus, p ( c | y ) = p ( c ) A p ( y ) One also needs to know p ( y ), the probability of the test b eing p ositive irre- 2.10. UPDA TING PROBABILITIES: BA YES’ RULE 37 sp ectiv e of whether the p erson has cancer or not. T o obtain p ( y ) use p (˜ c | y ) = p (˜ c ) p ( y | ˜ c ) p ( y ) = (1 − p ( c )) (1 − A ) p ( y ) , and p ( c | y ) + p ( ˜ c | y ) = 1 whic h leads to our ﬁnal answ er p ( c | y ) = p ( c ) A p ( c ) A + (1 − p ( c )) (1 − A ) . (2.128) F or an accuracy A = 0 . 99 and an incidence p ( c ) = 0 . 01 w e get p ( c | y ) = 50% whic h is not nearly as bad as one migh t hav e originally feared. Should one dismiss the information provided by the test as misleading? No. Note that the probabilit y of having cancer prior to the test w as 1% and on learning the test result this w as raised all the wa y up to 50%. Note also that when the disease is really rare, p ( c ) → 0, w e still get p ( c | y ) → 0 even when the test is quite accurate. This means that for rare diseases most p ositive tests turn out to be false p ositives. W e conclude that b oth the prior and the data contain imp ortant information; neither should b e neglected. Remark: The previous discussion illustrates a mistak e that is common in verbal discussions: if h denotes a hypothesis and e is some evidence, it is quite ob vious that we should not confuse p ( e | h ) with p ( h | e ). How ever, when expressed verbally the distinction is not nearly as obvious. F or example, in a criminal trial jurors migh t b e told that if the defendant were guilty (the hypothesis) the probability of some observed evidence would be large, and the jurors might easily b e misled in to concluding that given the evidence the probability is high that the defendan t is guilty . Lawy ers call this the pr ose cutor’s fal lacy. Example: uncertain data As b efore we w ant to up date from a prior joint distribution p ( θ , x ) = p ( x ) p ( θ | x ) to a posterior joint distribution p new ( θ , x ) = p new ( x ) p new ( θ | x ) when information b ecomes a v ailable. When the information is data X that precisely ﬁxes the v alue of x , w e imp ose that p new ( x ) = δ ( x − X ). The remaining unknown p new ( θ | x ) is determined b y inv oking the PMU: no further up dating is needed. This ﬁxes p new ( θ | x ) to b e the old p ( θ | x ) and yields Ba yes’ rule. It may happ en, how ever, that there is a measuremen t error. The data X that was actually observed do es not constrain the v alue of x completely . T o b e explicit let us assume that the remaining uncertaint y in x is well understoo d: the observ ation X constrains our b eliefs ab out x to a distribution P X ( x ) that happ ens to b e known. P X ( x ) could, for example, b e a Gaussian distribution cen tered at X , with some known standard deviation σ . This information is incorporated into the posterior distribution, p new ( θ , x ) = p new ( x ) p new ( θ | x ), by imp osing that p new ( x ) = P X ( x ). The remaining condi- tional distribution is, as b efore, determined b y in voking the PMU, p new ( θ | x ) = p old ( θ | x ) = p ( θ | x ) , (2.129) 38 CHAPTER 2. PROBABILITY and therefore, the joint p osterior is p new ( θ , x ) = P X ( x ) p ( θ | x ) . (2.130) Marginalizing ov er the uncertain x yields the new p osterior for θ , p new ( θ ) = R dx P X ( x ) p ( θ | x ) . (2.131) This generalization of Ba yes’ rule is sometimes called Jeﬀrey’s conditionalization rule. Inciden tally , this is an example of updating that shows that it is not always the c ase that information c omes pur ely in the form of data X . In the deriv ation ab o ve there clearly is some information in the observed v alue X and some in- formation in the particular functional form of the distribution P X ( x ), whether it is a Gaussian or some other distribution. The common elemen t in our previous deriv ation of Ba yes’ rule and in the presen t deriv ation of Jeﬀrey’s rule is that in both cases the information b eing pro cessed is a constraint on the allo wed p osterior marginal distributions p new ( x ). Later w e shall see (chapter 5) ho w the up dating rules can b e generalized still further to apply to ev en more general constrain ts. 2.10.3 Multiple exp eriments, sequen tial up dating The problem here is to up date our b eliefs about θ on the basis of data x 1 , x 2 , . . . , x n obtained in a sequence of experiments. The relations betw een θ and the v ari- ables x i are giv en through kno wn sampling distributions. W e will assume that the exp eriments are indep endent but they need not be iden tical. When the ex- p erimen ts are not independent it is more appropriate to refer to them as being p erformed is a single more complex exp eriment the outcome of which is a set of n umbers { x 1 , . . . , x n } . F or simplicit y we deal with just tw o identical experiments. The prior web of b eliefs is describ ed b y the joint distribution, p ( x 1 , x 2 , θ ) = p ( θ ) p ( x 1 | θ ) p ( x 2 | θ ) = p ( x 1 ) p ( θ | x 1 ) p ( x 2 | θ ) , (2.132) where we hav e used independence, p ( x 2 | θ , x 1 ) = p ( x 2 | θ ). The ﬁrst exp erimen t yields the data x 1 = X 1 . Bay es’ rule gives the up dated distribution for θ as p 1 ( θ ) = p ( θ | X 1 ) = p ( θ ) p ( X 1 | θ ) p ( X 1 ) . (2.133) The second exp eriment yields the data x 2 = X 2 and requires a second applica- tion of Bay es’ rule. The p osterior p 1 ( θ ) in eq.(2.133) now plays the role of the prior and the new p osterior distribution for θ is p 12 ( θ ) = p 1 ( θ | X 2 ) = p 1 ( θ ) p ( X 2 | θ ) p 1 ( X 2 ) . (2.134) 2.10. UPDA TING PROBABILITIES: BA YES’ RULE 39 W e hav e explicitly follow ed the up date from p ( θ ) to p 1 ( θ ) to p 12 ( θ ). It is straigh tforward to show that the same result is obtained if the data from b oth exp erimen ts w ere pro cessed simultaneously , p 12 ( θ ) = p ( θ | X 1 , X 2 ) = p ( θ ) p ( X 1 , X 2 | θ ) p ( X 1 , X 2 ) . (2.135) Indeed, using eq.(2.132) and (2.133), this last equation can b e rewritten as p 12 ( θ ) = p ( θ ) p ( X 1 | θ ) p ( X 1 ) p ( X 2 | θ ) p ( X 2 | X 1 ) = p 1 ( θ ) p ( X 2 | θ ) p ( X 2 | X 1 ) , (2.136) and it remains to sho w that p ( X 2 | X 1 ) = p 1 ( X 2 ). This last step is straightfor- w ard; use eq.(2.134) and (2.133): p 1 ( X 2 ) = R p 1 ( θ ) p ( X 2 | θ ) dθ = R p ( θ ) p ( X 1 | θ ) p ( X 1 ) p ( X 2 | θ ) dθ = R p ( X 1 , X 2 , θ ) p ( X 1 ) dθ = p ( X 2 | X 1 ) . (2.137) F rom the symmetry of eq.(2.135) it is clear that the same p osterior p 12 ( θ ) is obtained irresp ective of the order that the data X 1 and X 2 are pro cessed. The comm utativity of Bay esian up dating follows from the sp ecial circumstance that the information conv eyed by one exp eriment do es not revise or render obsolete the information conv eyed by the other exp eriment. As we generalize our metho ds of inference for pro cessing other kinds of information that do interfere with each other (and therefore one may render the other obsolete) we should not exp ect, m uch less demand, that commutativit y will contin ue to hold. 2.10.4 Remarks on priors Let us return to the question of the exten t to which probabilities incorp orate sub jectiv e and ob jective elements. W e hav e seen that Ba yes’ rule allows us to up date from prior to p osterior distributions. The posterior distributions incorp orate the presumably ob jective information contained in the data plus whatev er earlier b eliefs had b een co diﬁed into the prior. T o the exten t that the Ba yes up dating rule is itself unique one can c laim that the p osterior is “more ob jectiv e” than the prior. As w e up date more and more we should exp ect that our probabilities should reﬂect more and more the input data and less and less the original sub jective prior distribution. In other words, some sub jectivit y is una voidable at the b eginning of an inference chain, but it can b e gradually suppressed as more and more information is pro cessed. The problem of c ho osing the ﬁrst prior in the inference c hain is a diﬃcult one. W e will tackle it in several diﬀerent wa ys. Later in this chapter, as we in tro duce some elementary notions of data analysis, we will address it in the standard wa y: just make a “reasonable” guess – whatever that might mean. With exp erience and intuition this seems to work well. But when addressing 40 CHAPTER 2. PROBABILITY new problems w e ha ve neither experience nor in tuition and guessing is risky . W e w ould lik e to develop more systematic wa ys to pro ceed. Indeed it can b e shown that certain t yp es of prior information (for example, symmetries and/or other constrain ts) can b e ob jectively translated into a prior once we ha ve developed the appropriate to ols – entrop y and geometry . (See e.g. [Caticha Preuss 04] and references therein.) Our immediate goal here is, ﬁrst, to remark on the dangerous consequences of extreme degrees of b elief, and then to pro ve our previous in tuitiv e assertion that the accum ulation of data will swamp the original prior and render it irrelev ant. Dangerous extremes: the prejudiced mind The consistency of Ba yes’ rule can be c heck ed for the extreme cases of certaint y and imp ossibilit y: Let B describ e any bac kground information. If p ( θ | B ) = 1, then θ B = B and p ( X | θ B ) = p ( X | B ), so that Ba yes’ rule gives p new ( θ | B ) = p ( θ | B ) p ( X | θ B ) p ( X | B ) = 1 . (2.138) A similar argument can b e carried through in the case of impossibility: If p ( θ | B ) = 0, then p new ( θ | B ) = 0. Conclusion: if w e are absolutely certain ab out the truth of θ , acquiring data X will hav e absolutely no eﬀect on our opinions; the new data is worthless. This should serve as a warning to the dangers of erroneously assigning a probabilit y of 1 or of 0: since no amount of data could sw ay us from our prior b eliefs we ma y decide w e did not need to collect the data in the ﬁrst place. If y ou are absolutely sure that Jupiter has no mo ons, you ma y either decide that it is not necessary to lo ok through the telescop e, or, if you do lo ok and you see some little bright sp ots, you will probably decide the sp ots are mere optical illusions. Extreme degrees of b elief are dangerous: a truly prejudiced mind do es not, and indeed, c annot question its own b eliefs. Lots of data ov erwhelms the prior As more and more data is accumulated according to the sequential up dating describ ed earlier one would exp ect that the contin uous inﬂo w of information will even tually render irrelev ant whatev er prior information we migh t hav e had at the start. This is indeed the case: unless we hav e assigned a pathological prior – all w e need is a prior that is smo oth where the likelihoo d is large – after a large num b er of exp eriments the p osterior b ecomes essentially indep endent of the prior. Consider N indep endent repetitions of a certain experiment E that yield the data X = { X 1 . . . X N } . The corresp onding lik eliho o d is p ( X | θ ) = N Q r =1 p ( X r | θ ) , (2.139) 2.10. UPDA TING PROBABILITIES: BA YES’ RULE 41 and the p osterior distribution p new ( θ ) is p ( θ | X ) = p ( θ ) p ( X ) p ( X | θ ) = p ( θ ) p ( X ) N Q r =1 p ( X r | θ ) . (2.140) T o inv estigate the extent to which the data X supp orts the particular v alue θ 1 rather than an y other v alue θ 2 it is con venien t to study the ratio p ( θ 1 | X ) p ( θ 2 | X ) = p ( θ 1 ) p ( θ 2 ) R ( X ) , (2.141) where we introduce the likelihoo d ratios, R ( X ) def = N Q r =1 R r ( X r ) and R r ( X r ) def = p ( X r | θ 1 ) p ( X r | θ 2 ) . (2.142) W e w ant to pro ve the follo wing theorem: Barring tw o trivial exceptions, for any arbitrarily large p ositiv e Λ, we hav e lim N →∞ P ( R ( X ) > Λ | θ 1 ) = 1 (2.143) or, in other words, giv en θ 1 , R ( X ) − → ∞ in pr ob ability. (2.144) The signiﬁcance of the theorem is that as data accum ulates a rational p erson b ecomes more and more convinced of the truth – in this case the true v alue is θ 1 – and this happ ens essen tially irresp ectiv e of the prior p ( θ ). The theorem fails in tw o cases: ﬁrst, when the prior p ( θ 1 ) v anishes, in whic h case probabilities conditional on θ 1 are meaningless, and second, when p ( X r | θ 1 ) = p ( X r | θ 2 ) for all X r whic h describ es an exp eriment E that is ﬂa wed b ecause it cannot distinguish b et ween θ 1 and θ 2 . The pro of of the theorem is an application of the weak la w of large num b ers. Consider the quan tity 1 N log R ( X ) = 1 N N P r =1 log R r ( X r ) (2.145) Since the v ariables log R r ( X r ) are indep enden t, eq.(2.97) gives lim N →∞ P      1 N log R ( X ) − K ( θ 1 , θ 2 )     ≤ ε | θ 1  = 1 (2.146) where ε is any small p ositive num b er and K ( θ 1 , θ 2 ) =  1 N log R ( X ) | θ 1  = P X r p ( X r | θ 1 ) log R r ( X r ) . (2.147) 42 CHAPTER 2. PROBABILITY In other w ords, giv en θ 1 , e N ( K − ε ) ≤ R ( X ) ≤ e N ( K + ε ) in pr ob ability. (2.148) In Chapter 4 w e will pro ve that K ( θ 1 , θ 2 ) ≥ 0 with equalit y if and only if the tw o distributions p ( X r | θ 1 ) and p ( X r | θ 2 ) are iden tical, whic h is precisely the second of the t wo trivial exceptions w e explicitly a void. Thus K ( θ 1 , θ 2 ) > 0, and this concludes the pro of. W e see here the ﬁrst app earance of a quan tity , K ( θ 1 , θ 2 ) = + P X r p ( X r | θ 1 ) log p ( X r | θ 1 ) p ( X r | θ 2 ) , (2.149) that will pro ve to b e cen tral in later discussions. When m ultiplied b y − 1, the quantit y − K ( θ 1 , θ 2 ) is called the r elative entr opy , 2 that is the entrop y of p ( X r | θ 1 ) r elative to p ( X r | θ 2 ). It can be in terpreted as a measure of the exten t that the distribution p ( X r | θ 1 ) can b e distinguished from p ( X r | θ 2 ). 2.11 Examples from data analysis T o illustrate the use of Bay es’ theorem as a to ol to pro cess information when the information is in the form of data we consider some elemen tary examples from the ﬁeld of data analysis. (F or detailed treatments that are friendly to ph ysicists see e.g. [Sivia Skilling 06, Gregory 05].) 2.11.1 P arameter estimation Supp ose the probability for the quantit y x dep ends on certain parameters θ , p = p ( x | θ ). Although most of the discussion here can b e carried out for an arbitrary function p it is b est to b e sp eciﬁc and fo cus on the important case of a Gaussian distribution, p ( x | µ, σ ) = 1 √ 2 π σ 2 exp  − ( x − µ ) 2 2 σ 2  . (2.150) The ob jective is to estimate the parameters θ = ( µ, σ ) on the basis of a set of data X = { X 1 , . . . X N } . W e assume the measurements are statistic ally inde- p enden t of each other and use Bay es’ theorem to get p ( µ, σ | X ) = p ( µ, σ ) p ( X ) N Q i =1 p ( X i | µ, σ ) . (2.151) Indep endence is imp ortant in practice b ecause it leads to considerable practical simpliﬁcations but it is not essential: instead of N indep endent measurements 2 Other names include relative information, directed divergence, and Kullbac k-Leibler dis- tance. 2.11. EXAMPLES FROM D A T A ANAL YSIS 43 eac h providing a single datum we would hav e a single complex exp erimen t that pro vides N non-indep endent data. Lo oking at eq.(2.151) we see that a more precise form ulation of the same problem is the following. W e w ant to estimate certain parameters θ , in our case µ and σ , from repeated measurements of the quantit y x on the basis of sever al pieces of information. The most ob vious is 1. The information contained in the actual v alues of the collected data X . Almost equally obvious (at least to those who are comfortable with the Bay esian in terpretation of probabilities) is 2. The information ab out the parameters that is co diﬁed into the prior dis- tribution p ( θ ). Where and ho w this prior information w as obtained is not relev ant at this point; it could ha ve resulted from previous exp eriments, or from other background kno wledge ab out the problem. The only relev ant part is whatever ended up b eing distilled in to p ( θ ). The last piece of information is not alwa ys explicitly recognized; it is 3. The information that is co diﬁed into the functional form of the ‘sampling’ distribution p ( X | θ ). If we are to estimate parameters θ on the basis of measuremen ts of a quantit y x it is clear that we m ust know ho w θ and x are related to eac h other. Notice that item 3 refers to the functional form – whether the distribution is Gaussian as opp osed to Poisson or binomial or something else – and not to the actual v alues of the data X whic h is what is taken in to account in item 1. The nature of the relation in p ( X | θ ) is in general statistical but it could also b e completely deterministic. F or example, when X is a known function of θ , say X = f ( θ ), we ha ve p ( X | θ ) = δ [ X − f ( θ )]. In this latter case there is no need for Ba yes’ rule. Eq. (2.151) is rewritten as p ( µ, σ | X ) = p ( µ, σ ) p ( X ) 1 (2 π σ 2 ) N/ 2 exp " − N P i =1 ( X i − µ ) 2 2 σ 2 # (2.152) In tro ducing the sample a verage ¯ X and sample v ariance s 2 , ¯ X = 1 N N P i =1 X i and s 2 = 1 N N P i =1  X i − ¯ X  2 , (2.153) eq.(2.152) b ecomes p ( µ, σ | X ) = p ( µ, σ ) p ( X ) 1 (2 π σ 2 ) N/ 2 exp " −  µ − ¯ X  2 + s 2 2 σ 2 / N # . (2.154) 44 CHAPTER 2. PROBABILITY It is in teresting that the data app ears here only in the particular com bination in eq.(2.153) – diﬀeren t sets of data c haracterized by the same ¯ X and s 2 lead to the same inference ab out µ and σ . (As discussed earlier the factor p ( X ) is not relev ant here since it can b e absorb ed into the normalization of the p osterior p ( µ, σ | X ).) Eq. (2.154) incorp orates the information described in items 1 and 3 ab o ve. The prior distribution, item 2, remains to b e sp eciﬁed. Let us start by consid- ering the simple case where the v alue of σ is actually kno wn. Then p ( µ, σ ) = p ( µ ) δ ( σ − σ 0 ) and the goal is to estimate µ . Bay es’ theorem is now written as p ( µ | X ) = p ( µ ) p ( X ) 1 (2 π σ 2 0 ) N/ 2 exp " − N P i =1 ( X i − µ ) 2 2 σ 2 0 # (2.155) = p ( µ ) p ( X ) 1 (2 π σ 2 0 ) N/ 2 exp " −  µ − ¯ X  2 + s 2 2 σ 2 0 / N # ∝ p ( µ ) exp " −  µ − ¯ X  2 2 σ 2 0 / N # . (2.156) Supp ose further that w e know nothing ab out µ ; it could ha ve any v alue. This state of extreme ignorance is represented b y a very broad distribution that we tak e as essentially uniform within some large range; µ is just as lik ely to ha ve one v alue as another. F or p ( µ ) ∼ const the posterior distribution is Gaussian, with mean given by the sample av erage ¯ x , and v ariance σ 2 0 / N . The b est estimate for the v alue of µ is the sample av erage and the uncertaint y is the standard deviation. This is usually expressed in the form µ = ¯ X ± σ 0 √ N . (2.157) Note that the estimate of µ from N measurements has a muc h smaller error than the estimate from just one measurement; the individual measurements are plagued with errors but they tend to cancel out in the sample a verage. In the case of very little prior information – the uniform prior – we hav e re- co vered the same results as in the standard non-Bay esian data analysis approach. The real diﬀerence arises when prior information is av ailable: the non-Bay esian approac h can’t deal with it and can only pro ceed by ignoring it. On the other hand, within the Bay esian approach prior information is easily taken in to ac- coun t. F or example, if we know on the basis of other physical considerations that µ has to b e p ositive w e assign p ( µ ) = 0 for µ < 0 and we calculate the estimate of µ from the truncated Gaussian in eq.(2.156). A sligh tly more complicated case arises when the v alue of σ is not kno wn. Let us assume again that our ignorance of b oth µ and σ is quite extreme and c ho ose a uniform prior, p ( µ, σ ) ∝  C for σ > 0 0 otherwise. (2.158) 2.11. EXAMPLES FROM D A T A ANAL YSIS 45 Another p opular choice is a prior that is uniform in µ and in log σ . When there is a considerable amount of data the t wo c hoices lead to practically the same conclusions but we see that there is an imp ortant question here: what do we mean by the word ‘uniform’ ? Uniform in terms of which v ariable? σ , or σ 2 , or log σ ? Later we shall hav e muc h more to sa y ab out this misleadingly inno cuous question. T o estimate µ w e return to eq.(2.152) or (2.154). F or the purpose of estimat- ing µ the v ariable σ is an uninteresting nuisance which, as discussed in section 2.5.4, is eliminated through marginalization, p ( µ | X ) = ∞ R 0 dσ p ( µ, σ | X ) (2.159) ∝ ∞ R 0 dσ 1 σ N exp " −  µ − ¯ X  2 + s 2 2 σ 2 / N # . (2.160) Change v ariables to t = 1 /σ , then p ( µ | X ) ∝ ∞ R 0 dt t N − 2 exp  − t 2 2 N   µ − ¯ X  2 + s 2   . (2.161) Rep eated integrations by parts lead to p ( µ | X ) ∝ h N   µ − ¯ X  2 + s 2 i − N − 1 2 , (2.162) whic h is called the Student-t distribution. Since the distribution is symmetric the estimate for µ is easy to get, h µ i = ¯ X . (2.163) The p osterior p ( µ | X ) is a Lorentzian-lik e function raised to some pow er. As the n umber of data gro ws, say N & 10, the tails of the distribution are suppressed and p ( µ | X ) approach es a Gaussian. T o obtain an error bar in the estimate µ = ¯ X we can estimate the v ariance of µ using the following trick. Note that for the Gaussian in eq.(2.150), d 2 dx 2 log p ( x | µ, σ )     x max = − 1 σ 2 . (2.164) Therefore, to the extent that eq.(2.162) approximates a Gaussian, we can write (∆ µ ) 2 ≈ " − d 2 dµ 2 log p ( µ | X )     µ max # − 1 = s 2 N − 1 . (2.165) (This explains the famous factor of N − 1. As we can see it is not a particularly fundamen tal result; it follows from approximations that are meaningful only for large N .) 46 CHAPTER 2. PROBABILITY W e can also estimate σ directly from the data. This requires that w e marginalize ov er µ , p ( σ | X ) = ∞ R −∞ dµ p ( µ, σ | X ) (2.166) ∝ 1 σ N exp  − N s 2 2 σ 2  ∞ R −∞ dµ exp " −  µ − ¯ X  2 2 σ 2 / N # . (2.167) The Gaussian in tegral o ver µ is  2 π σ 2 / N  1 / 2 ∝ σ and therefore p ( σ | X ) ∝ 1 σ N − 1 exp  − N s 2 2 σ 2  . (2.168) As an estimate for σ we can use the v alue where the distribution is maximized, σ max = r N N − 1 s 2 , (2.169) whic h agrees with our previous estimate of (∆ µ ) 2 , σ 2 max N = s 2 N − 1 . (2.170) An error bar for σ itself can b e obtained using the previous trick (provided N is large enough) of taking a second deriv ative of log p. The result is σ = σ max ± σ max p 2 ( N − 1) . (2.171) 2.11.2 Curv e ﬁtting The problem of ﬁtting a curve to a set of data p oints is a problem of parameter estimation. There are no new issues of principle to be resolv ed. In practice, ho w- ev er, it can b e considerably more complicated than the simple cases discussed in the previous paragraphs. The problem is as follows. The observ ed data is in the form of pairs ( X i , Y i ) with i = 1 , . . . N and we b elieve that the true y s are related to the X s through a function y i = f θ ( x i ) whic h depends on sev eral parameters θ . The goal is to estimate the parameters θ and the complication is that the measured v alues of y are aﬄicted by exp erimental errors, Y i = f θ ( X i ) + ε i . (2.172) F or simplicity we assume that the probabilit y of the error ε i is Gaussian with mean h ε i i = 0 and that the v ariances  ε 2 i  = σ 2 are kno wn and the same for all data pairs. W e also assume that there are no errors aﬀecting the X s. A more realistic account might ha ve to reconsider these assumptions. 2.11. EXAMPLES FROM D A T A ANAL YSIS 47 The sampling distribution is p ( Y | θ ) = N Q i =1 p ( Y i | θ ) , (2.173) where p ( Y i | θ ) = 1 √ 2 π σ 2 exp  − ( Y i − f θ ( X i )) 2 2 σ 2  . (2.174) Ba yes’ theorem gives, p ( θ | Y ) ∝ p ( θ ) exp  − N P i =1 ( Y i − f θ ( X i )) 2 2 σ 2  . (2.175) As an example, supp ose that we are trying to ﬁt a straigh t line through data p oin ts f ( x ) = a + bx , (2.176) and suppose further that being ignorant about the v alues of θ = ( a, b ) we c hoose p ( θ ) = p ( a, b ) ∼ const, then p ( a, b | Y ) ∝ exp  − N P i =1 ( Y i − a − bX i ) 2 2 σ 2  . (2.177) A go o d estimate of a and b is the v alue that maximizes the p osterior distribution, whic h as we see, is equiv alent to using the metho d of least squares. But this Ba yesian analysis, simple as it is, can already giv e us more: from p ( a, b | Y ) we can also estimate the uncertain ties ∆ a and ∆ b which lies b eyond the scope of least squares. 2.11.3 Mo del selection Supp ose w e are trying to ﬁt a curv e y = f θ ( x ) through data p oints ( X i , Y i ), i = 1 , . . . N . How do we choose the function f θ ? T o b e sp eciﬁc let f θ b e a p olynomial of order n , f θ ( x ) = θ 0 + θ 1 x + . . . + θ n x n , (2.178) the tec hniques of the previous section allo w us to estimate the parameters θ 0 , . . . , θ n but how do we decide the order n ? Should we ﬁt a straight or a quadratic line? It is not obvious. Having more parameters means that we will b e able to achiev e a closer ﬁt to the data, which is goo d, but we migh t also b e ﬁtting the noise, which is bad. The same problem arises when the data shows p eaks and w e wan t to estimate their lo cation, their width, and their numb er ; could there b e an additional p eak hiding in the noise? Are w e just ﬁtting noise, or do es the data really supp ort one additional p eak? W e sa y these are ‘problems of mo del selection’. T o appreciate how imp ortant they can b e consider replacing the mo destly unassuming w ord ‘mo del’ by the more impressiv e sounding word ‘theory’. Giv en tw o comp eting theories, which 48 CHAPTER 2. PROBABILITY one do es the data support b est? What is at stak e is nothing less than the foundation of exp erimen tal science. On the basis of data X we wan t to select one mo del among several comp eting candidates lab eled b y m = 1 , 2 , . . . Suppose mo del m is deﬁned in terms of some parameters θ m = { θ m 1 , θ m 2 , . . . } and their relation to the data X is con tained in the sampling distribution p ( X | m, θ m ). The exten t to which the data supp orts mo del m , i.e. , the probabilit y of mo del m giv en the data, is given by Bay es’ theorem, p ( m | X ) = p ( m ) p ( X ) p ( X | m ) , (2.179) where p ( m ) is the prior for the model. The factor p ( X | m ), which is the prior probabilit y for the data giv en the mo del, pla ys the role of a likelihoo d. It is often called the ‘evidence’. This is not altogether appropriate b ecause the meaning of p ( X | m ) is already giv en as “the prior probability of the data.” There is nothing more to b e said ab out it. Calling it the ‘evidence’ can only mislead us by suggesting interpretations and therefore uses that go b ey ond and could conceiv ably b e in conﬂict with its probabilit y meaning. 3 After this warning, we follo w standard practice. The “evidence” is calculated from p ( X | m ) = Z dθ m p ( X, θ m | m ) = Z dθ m p ( θ m | m ) p ( X | m, θ m ) . (2.180) Therefore p ( m | X ) ∝ p ( m ) Z dθ m p ( θ m | m ) p ( X | m, θ m ) . (2.181) Th us, the problem is solved, at least in principle, once the priors p ( m ) and p ( θ m | m ) are assigned. Of course, the practical problem of calculating the multi- dimensional integrals can still b e quite formidable. No further progress is p ossible without making speciﬁc c hoices for the v arious functions in eq.(2.181) but we can oﬀer some qualitativ e comments. When comparing tw o models, m 1 and m 2 , it is fairly common to argue that a priori w e hav e no reason to prefer one ov er the other and therefore we assign the same prior probabilit y p ( m 1 ) = p ( m 2 ). (Of course this is not alwa ys justiﬁed. P articularly in the case of theories that claim to b e fundamen tal p eople usually ha ve v ery strong prior prejudices fav oring one theory against the other. Be that as it ma y , let us proceed.) Supp ose the prior p ( θ m | m ) represents a uniform distribution ov er the pa- rameter space. Since Z dθ m p ( θ m | m ) = 1 then p ( θ m | m ) ≈ 1 V m , (2.182) where V m is the ‘volume’ of the parameter space. Supp ose further that p ( X | m, θ m ) has a single p eak of height L max spread out ov er a region of ‘v olume’ δ θ m . The 3 A similar problem o ccurs when h x i is called the “exp ected” v alue. It misleads us into thinking that h x i is the v alue we should expect, which is not necessarily true. 2.11. EXAMPLES FROM D A T A ANAL YSIS 49 v alue θ m where p ( X | m, θ m ) attains its maximum can b e used as an estimate for θ m and the ‘volume’ δ θ m is then interpreted as an uncertain ty . Then the in tegral of p ( X | m, θ m ) can b e appro ximated by the pro duct L max × δ θ m . Thus, in a v ery rough and qualitative wa y the probabilit y for the model given the data is p ( m | X ) ∝ L max × δ θ m V m . (2.183) W e can now interpret eq.(2.181) as follows. Our preference for a mo del will b e dictated by how well the mo del ﬁts the data; this is measured by [ p ( X | m, θ m )] max = L max . The volume of the region of uncertaint y δ θ m also contributes: if more v alues of the parameters are consistent with the data, then there are more wa ys the model agrees with the data, and the model is fav ored. Finally , the larger the v olume of p ossible parameter v alues V m the more the mo del is p enalized. Since a larger volume V m means a more complex mo del the 1 /V m factor p enalizes complexit y . The preference for simpler mo dels is said to implement Occam’s razor. This is a reference to the principle, stated by William of Occam, a 13th cen tury F ranciscan monk, that one should not seek a more complicated expla- nation when a simpler one will do. Such an in terpretation is satisfying but ultimately it is quite unnecessary . Occam’s principle do es not need not be put in by hand: Bay es’ theorem tak es care of it automatically in eq.(2.181)! 2.11.4 Maxim um Likelihoo d If one adopts the frequency in terpretation of probabilities then most uses of Ba yes’ theorem are not allo w ed. The reason is simple: it mak es sense to assign a probabilit y distribution p ( x | θ ) to the data X = { X i } b ecause the x are random v ariables but it is absolutely meaningless to talk ab out probabilities for the parameters θ b ecause they ha ve no frequency distributions, they are not r andom v ariables, they are merely unknown . This means that many problems in science lie b eyond the reach of a frequentist probabilit y theory . T o o vercome this diﬃculty a new sub ject was inv ented: statistics. Within the Ba yesian approac h the tw o sub jects, statistics and probability theory , are uniﬁed in to the single ﬁeld of inductiv e inference. In the frequen tist approach to statistics in order to infer an unknown quantit y θ on the basis of measurements of another quantit y , the data x , one p ostulates the existence of some function, called the ‘statistic’, that relates the t wo, θ = f ( x ). Since data are aﬄicted b y exp erimen tal errors they are deemed to be legitimate random v ariables to which frequen tist probability concepts can b e applied. The problem is to estimate the unkno wn θ when the sampling distribution p ( x | θ ) is kno wn. The solution prop osed by Fisher was to select that v alue of θ that maximizes the probability of the data that was actually obtained in the exp eriment. Since p ( x | θ ) is a function of the v ariable x and θ app ears as a ﬁxed parameter, Fisher in tro duced a function of θ , whic h he called the lik eliho o d, where the observed data X app ear as ﬁxed parameters, L ( θ | X ) def = p ( X | θ ) . (2.184) 50 CHAPTER 2. PROBABILITY Th us, this metho d of parameter estimation is called the metho d of ‘maxim um lik eliho o d’. The likelihoo d function L ( θ | X ) is not a probability , it is not nor- malized in an y wa y , and it makes no sense to use it compute an a verage or a v ariance, but the same intuition that leads one to prop ose maximization of the lik eliho o d to estimate θ also leads one to use the width of the lik eliho o d function as to estimate an error bar. The Ba yesian approac h agrees with the metho d of maxim um likelihoo d in the sp ecial case where of prior is uniform, p ( θ ) = const ⇒ p ( θ | X ) ∝ p ( θ ) p ( X | θ ) ∝ p ( X | θ ) . (2.185) This explains why the Bay esian discussion of this section has repro duced so man y of the standard results of the ‘orthodox’ theory . But then there are ad- ditional adv antages. Unlike the likelihoo d, the p osterior is a true probability distribution that allows estimation not just of θ but of an y one of its moments. And, most imp ortant, there is no limitation to uniform priors. If there is ad- ditional prior information that is relev an t to a problem the prior distribution pro vides a mechanism to take it in to accoun t. Chapter 3 En trop y I: The Ev olution of Carnot’s Principle An imp ortant problem that o ccupied the minds of man y scientists in the 18th cen tury w as either to devise a p erp etual motion machine, or to prov e its impos- sibilit y from the established principles of mec hanics. Both attempts failed. Ever since the most rudimentary understanding of the laws of thermodynamics w as ac hieved in the 19th cen tury no competent scientist would w aste time consid- ering p erp etual motion. 1 The other goal has also pro ved elusiv e; there exist no deriv ations the Second La w from purely mechanical principles. It took a long time, and for man y the sub ject is still con tro versial, but the reason has gradually b ecome clear: entrop y is not a physical quantit y , it is a to ol for inference, a to ol for reasoning in situations of incomplete information. It is quite imp ossible that suc h a non-mechanical quan tity could emerge from a com bination of mechanical notions. If anything it should b e the other wa y around. Muc h of the material including the title for this chapter is inspired b y a b eautiful article b y E. T. Jaynes [Jaynes 88]. I also b orrow ed from the historical pap ers [Klein 70, 73, Uﬃnk 04]. 3.1 Carnot: rev ersible engines Sadi Carnot w as in terested in improving the eﬃciency of steam engines, that is, of maximizing the amoun t of useful work that can b e extracted from an engine p er unit of burnt fuel. His work, published in 1824, w as concerned with whether appropriate c hoices of a w orking substance other than steam and of the 1 The science of thermo dynamics which led to statistical mec hanics and even tually to infor- mation theory was initially motiv ated by the desire to improv e steam engines. There seems to exist a curious historical parallel with the modern day dev elopment of quan tum information theory , which is b eing driven by the desire to build quantum computers. The usefulness of thermodynamics far outgrew its original aim. It is conceiv able that the same will happ en to quantum information theory . 51 52 CHAPTER 3. ENTR OPY I: CARNOT’S PRINCIPLE op erating temp eratures and pressures w ould impro ve the eﬃciency . Carnot was quite con vinced that p erp etual motion w as imp ossible even though he had no proof. He could not ha ve had a pro of: thermo dynamics had not b een in ven ted yet. His conviction derived from the long list of previous attempts that had ended in failure. His brillian t idea was to pro ceed anyw a y and to p ostulate what he knew was true but could not prov e as the foundation from which he would draw all sorts of other conclusions ab out engines. 2 A t the time Carnot did his work the nature of heat as a form of energy had not yet b een understo o d. He adopted a model that w as fashionable at the time – the caloric mo del – according to whic h heat is a substance that could b e transferred but neither created nor destroy ed. F or Carnot an engine used heat to pro duce w ork in muc h the same wa y that falling w ater can turn a waterwheel and produce w ork: the caloric would “fall” from a higher temp erature to a lo wer temp erature thereb y making the engine turn. What was being transformed into w ork w as not the caloric itself but the energy acquired in the fall. According to the caloric mo del the amount of heat extracted from the high temp erature source should b e the same as the amount of heat discarded into the low temp erature sink. Later measurements sho wed that this was not true, but Carnot was quite lucky . Although the model was seriously wrong, it did ha ve a great virtue: it suggested that the generation of work in a heat engine should include not just the high temperature source from which heat is extracted (the b oiler) but also a low temp erature sink (the condenser) into which heat is discarded. Later, when heat was interpreted as a form of energy transfer it w as understo o d that for contin ued op eration it was necessary that excess heat b e discarded in to a low temp erature sink so that the engine could com plete eac h cycle by returning to same initial state. Carnot’s caloric-waterwheel model was fortunate in yet another resp ect – he w as not just lucky , he was v ery lucky – a waterwheel engine can be op erated in rev erse and used as a pump. This led him to consider a rev ersible heat engine in whic h work would be used to draw heat from a cold source and ‘pump it up’ to deliver heat to the hot reservoir. The analysis of such reversible heat engines led Carnot to the imp ortan t conclusion Carnot’s Principle: “ No he at engine E c an b e mor e eﬃcient than a r eversible one E R op er ating b etwe en the same temp er atur es .” The pro of of Carnot’s principle is quite straightforw ard but b ecause he used the caloric model Carnot’s pro of w as not strictly correct – the necessary revisions w ere supplied by Clausius in 1850. As a side remark, it is in teresting that Carnot’s noteb o oks, which were made public by his family ab out 1870, long after his death, indicate that so on after 1824 Carnot came to reject the caloric 2 In his attempt to understand the undetectability of the ether Einstein faced a similar problem: he knew that it was hopeless to seek an understanding of the constancy of the sp eed of light on the basis of the primitive physics of the atomic structure of solid rods that was av ailable at the time. Inspired by Carnot he delib erately follo wed the same strategy – to give up and declare victory – and postulated the constancy of the sp eed of ligh t as the unprov en but kno wn truth which would serv e as the foundation from whic h other conclusions could be derived. 3.1. CARNOT: REVERSIBLE ENGINES 53 mo del and that he ac hieved the mo dern understanding of heat as a form of energy transfer. This w ork – which had preceded Joule’s exp eriments by ab out ﬁfteen years – was not published and therefore had no inﬂuence on the history of thermo dynamics [Wilson 81]. The following is Clausius’ pro of. In a standard cycle (Figure 3.1a) a heat engine E extracts heat q 1 from a reservoir at high temp erature t 1 and partially con verts it to useful w ork w . The diﬀerence q 1 − w = q 2 is wasted heat that is dump ed in to a reservoir at a low er temp erature t 2 . The Carnot-Clausius argumen t is that if an engine E S exists that is more eﬃcien t than a rev ersible engine E R , then it is p ossible to build p erp etual motion machines. Since the latter do not exist Carnot’s principle follo ws: heat engines that are more eﬃcien t than reversible ones do not exist. Figure 3.1: (a) A regular engine E op erating b etw een heat reserv oirs at tem- p eratures t 1 and t 2 generates work w = q 1 − q 2 . (b) A (hypothetical) sup er- eﬃcien t engine E S link ed to a reversed engine E R w ould b e a p erp etual mo- tion engine extracting heat from the cold reservoir and conv erting it to work w S − w R = q 2 R − q 2 S . Consider t wo engines, one is sup er-eﬃcient and the other is reversible, E S and E R , op erating b etw een the same hot and cold reserv oirs. The engine E S dra ws heat q 1 from the hot source, it generates work w S , and deliv ers the dif- ference as heat q 2 S = q 1 − w S to the cold sink (ﬁgure 3.1b). It is arranged 54 CHAPTER 3. ENTR OPY I: CARNOT’S PRINCIPLE that in its normal (forward) op eration the reversible engine E R dra ws the same heat q 1 from the hot source, it generates work w R , and discards the diﬀerence q 2 R = q 1 − w R to the cold sink. Since E S is supp osed to b e more eﬃcien t than E R w e ha ve w S > w R , it w ould b e possible to use a part w R of the w ork pro- duced b y E S to run E R in rev erse. The result w ould b e to extract heat q 2 R from the cold source and pump the total heat q 2 R + w R = q 1 bac k up in to the hot source. The remaining work w S − w R pro duced by E S w ould then b e av ailable for any other purp oses. At the end of suc h comp osite cycle the hot reservoir is left unchanged and the net result would b e to extract heat q 2 R − q 2 > 0 from the cold reservoir and con vert it to work w S − w R without any need for fuel. The conclusion is that the existence of a sup er-eﬃcien t heat engine would allo w the construction of a p erp etual motion engine. The blank statemen t p erp etual motion is not p ossible is a true principle but it do es not tell the whole story . It blurs the imp ortant distinction b etw een p erp etual motion engines that op erate by violating energy conserv ation, which are called mac hines of the ﬁrst kind , and p erp etual motion engines that do not violate energy conserv ation, whic h are thus called machines of the se c ond kind . Carnot’s conclusion deserv es to be singled out as a new principle b ecause it is sp eciﬁc to the second kind of mac hine. Other imp ortant conclusions obtained b y Carnot are that all reversible en- gines op erating b etw een the same temp eratures are equally eﬃcien t; their eﬃ- ciency is a function of the temp eratures only , e def = w q 1 = e ( t 1 , t 2 ) , (3.1) and is therefore indep enden t of any and all other details of ho w the engine is constructed and op erated; that eﬃciency increases with the temp erature diﬀer- ence [see eq.(3.3) b elow]. F urthermore, the most eﬃcient heat engine cycle, now called the Carnot cycle, is one in which all heat is absorb ed at the high t 1 and all heat is discharged at the low t 2 . Th us, the Carnot cycle is deﬁned by tw o isotherms and t wo adiabats. The next imp ortant step, the determination of the universal function e ( t 1 , t 2 ), w as accomplished by Kelvin. 3.2 Kelvin: temp erature After Joule’s exp eriments in the 1840’s on the conv ersion of work in to heat the caloric mo del had to b e abandoned. Heat was ﬁnally recognized as a form of energy and the additional relation w = q 1 − q 2 w as the ingredien t that, in the hands of Kelvin and Clausius, allow ed Carnot’s principle to be dev elop ed in to the next stage. Supp ose t w o rev ersible engines E a and E b are link ed in series to form a single more complex reversible engine E c . The ﬁrst op erates b et ween temp eratures t 1 and t 2 , and the second b et ween t 2 and t 3 . E a dra ws heat q 1 and discharges q 2 , while E b uses q 2 as input and discharges q 3 . The eﬃciencies of the three engines 3.2. KEL VIN: TEMPERA TURE 55 are e a = e ( t 1 , t 2 ) = w a q 1 , e b = e ( t 2 , t 3 ) = w b q 2 , (3.2) and e c = e ( t 1 , t 3 ) = w a + w b q 1 . (3.3) They are related by e c = e a + w b q 2 q 2 q 1 = e a + e b  1 − w a q 1  , (3.4) or e c = e a + e b − e a e b , (3.5) whic h is a functional equation for e = e ( t 1 , t 2 ). T o ﬁnd the solution change v ariables to x = log (1 − e ), whic h transforms eq.(3.5) in to x c ( t 1 , t 3 ) = x a ( t 1 , t 2 ) + x b ( t 2 , t 3 ) , (3.6) and then diﬀeren tiate with resp ect to t 2 to get ∂ ∂ t 2 x a ( t 1 , t 2 ) = − ∂ ∂ t 2 x b ( t 2 , t 3 ) . (3.7) The left hand side is independent of t 3 while the second is independent of t 1 , therefore ∂ x a /∂ t 2 m ust b e some function g of t 2 only , ∂ ∂ t 2 x a ( t 1 , t 2 ) = g ( t 2 ) . (3.8) In tegrating gives x ( t 1 , t 2 ) = F ( t 1 ) + G ( t 2 ) where the t wo functions F and G are at this point unkno wn. The b oundary condition e ( t, t ) = 0 or equiv alen tly x ( t, t ) = 0 implies that we deal with merely one unknown function: G ( t ) = − F ( t ). Therefore x ( t 1 , t 2 ) = F ( t 1 ) − F ( t 2 ) or e ( t 1 , t 2 ) = 1 − f ( t 2 ) f ( t 1 ) , (3.9) where f = e − F . F rom eq.(3.3) w e see that the eﬃciency e ( t 1 , t 2 ) increases as the diﬀerence in temp erature increases, so that f ( t ) m ust b e a monotonically increasing function. Kelvin recognized that there is nothing fundamental about the original tem- p erature scale t . It depends on the particular materials employ ed to construct the thermometer. Kelvin realized that the freedom in eq.(3.9) in the c hoice of the function f corresponds to the freedom of changing temp erature scales by using diﬀeren t thermometric materials. The only feature common to all ther- mometers that claim to rank systems according to their ‘degree of hotness’ is that they m ust agree that if A is hotter than B , and B is hotter than C , then A is hotter than C . One can therefore r e gr aduate any old inconv enient t scale 56 CHAPTER 3. ENTR OPY I: CARNOT’S PRINCIPLE b y a monotonic function to obtain a new scale T chosen purely b ecause it leads to a more elegant formulation of the theory . F rom eq.(3.9) the optimal choice is quite obvious, and th us Kelvin introduced the absolute scale of temp erature, T = C f ( t ) , (3.10) where the arbitrary scale factor C reﬂects the still remaining freedom to choose the units. In the absolute scale the eﬃciency for the ideal reversible heat engine is very simple, e ( t 1 , t 2 ) = 1 − T 2 T 1 . (3.11) Carnot’s principle that an y heat engine E 0 m ust b e less eﬃcient than the rev ersible one, e 0 ≤ e , is rewritten as e 0 = w q 1 = 1 − q 2 q 1 ≤ e = 1 − T 2 T 1 , (3.12) or, q 1 T 1 − q 2 T 2 ≤ 0 . (3.13) It is conv enient to redeﬁne heat so that inputs are p ositiv e, Q 1 = q 1 , and outputs are negative, Q 2 = − q 2 . Then, Q 1 T 1 + Q 2 T 2 ≤ 0 , (3.14) where the equalit y holds when and only when the engine is rev ersible. The generalization to an engine or an y system that undergo es a cyclic process in which heat is exchanged with more than tw o reserv oirs is straigh tforward. If heat Q i is absorb ed from the reservoir at temp erature T i w e obtain the Kelvin form (1854) of Carnot’s principle, P i Q i T i ≤ 0 . (3.15) whic h, in the hands of Clausius, led to the next non-trivial step, the introduction of the concept of en tropy . 3.3 Clausius: en trop y By ab out 1850 b oth Kelvin and Clausius had realized that t wo laws w ere nec- essary as a foundation for thermo dynamics. The somewhat awkw ard expres- sions for the second law that they had adopted at the time w ere reminiscen t of Carnot’s; they stated the imp ossibility of heat engines whose sole eﬀect would b e to transform heat from a single source in to w ork, or of refrigerators that could pump heat from a cold to a hot reservoir without the input of external work. It to ok Clausius until 1865 – this is some ﬁfteen years later, which indicates that 3.3. CLAUSIUS: ENTROPY 57 the breakthrough was not at all trivial – b efore he came up with a new compact statemen t of the second la w that allo wed substantial further progress. [Cropp er 86] Clausius rewrote Kelvin’s eq.(3.15) for a cycle where the system absorbs in- ﬁnitesimal (p ositive or negativ e) amounts of heat dQ from a con tin uous sequence of reservoirs, H dQ T ≤ 0 , (3.16) where T is the temp erature of each reserv oir. F or a reversible pro cess, which is ac hieved when the system is slowly tak en through a sequence of equilibrium states and T is the temp erature of the system as well as the reservoirs, the equalit y sign implies that the integral from any state A to an y other state B is indep enden t of the path taken, H dQ T = 0 ⇒ R R 1 ( A,B ) dQ T = R R 2 ( A,B ) dQ T , (3.17) where R 1 ( A, B ) and R 2 ( A, B ) denote an y t wo reversible paths linking the same initial state A and ﬁnal state B . Clausius saw that this implied the existence of a function of the thermo dynamic state, which he called the en trop y , and deﬁned up to an additive constant b y S B = S A + R R ( A,B ) dQ T . (3.18) A t this stage in the dev elopment this entrop y is ‘thermo dynamic en tropy’, and is deﬁned only for equilibrium states. Eq.(3.18) seems like a mere reformulation of eqs.( 3.15) and (3.16) but it represen ts a ma jor adv ance b ecause it allow ed thermodynamics to reach beyond the study of cyclic pro cesses. Consider a p ossibly irrev ersible pro cess in which a system is taken from an initial state A to a ﬁnal state B , and supp ose the system is returned to the initial state along some other reversible path. Then, the more general eq.(3.16) giv es B R A, irrev dQ T + R R ( A,B ) dQ T ≤ 0 . (3.19) F rom eq.(3.18) the second in tegral is S A − S B . In the ﬁrst in tegral − dQ is the amoun t is the amoun t of heat absorb ed b y the reservoirs at temp erature T and therefore it represen ts min us the change in the entrop y of the reserv oirs, which in this case represent the rest of the universe, ( S res A − S res B ) + ( S A − S B ) ≤ 0 or S res B + S B ≥ S res A + S A . (3.20) Th us the second la w can be stated in terms of the total en tropy S total = S res + S as S total ﬁnal ≥ S total initial , (3.21) 58 CHAPTER 3. ENTR OPY I: CARNOT’S PRINCIPLE and Clausius could then summarize the la ws of thermo dynamics as “ The ener gy of the universe is c onstant. The entr opy of the universe tends to a maximum .” All restrictions to cyclic pro cesses hav e disapp eared. Clausius w as also responsible for initiating another indepe nden t line of re- searc h in this sub ject. His pap er “On the kind of motion w e call heat” (1857) w as the ﬁrst (failed!) attempt to deduce the second law from purely mec hanical principles applied to molecules. His results referred to av erages tak en ov er all molecules, for example the kinetic energy p er molecule, and inv olv ed theorems in mec hanics such as the virial theorem. F or him the increase of en tropy was mean t to b e an absolute law and not just a matter of o verwhelming probability . 3.4 Maxw ell: probabilit y W e ow e to Maxw ell the introduction of probabilistic notions in to fundamental ph ysics (1860). (Perhaps he was inspired by his earlier study of the rings of Saturn which required reasoning about particles undergoing v ery complex tra- jectories.) He realized the imp ossibility of k eeping track of the exact motion of all the molecules in a gas and pursued a less detailed description in terms of the distribution of velocities. Maxw ell interpreted his distribution function as the fraction of molecules with v elo cities in a certain range, and also as the “proba- bilit y” P ( ~ v ) d 3 v that a molecule has a velocity ~ v in a certain range d 3 v . It would tak e a long time to achiev e a clearer understanding of the meaning of the term ‘probabilit y’. In an y case, Maxwell concluded that “velocities are distributed among the particles according to the sam e law as the errors are distributed in the theory of the ‘method of least squares’,” and on the basis of this distribution he obtained a num b er of signiﬁcant results on the transp ort prop erties of gases. Ov er the years he prop osed several deriv ations of his v elo city distribution function. The earlier one (1860) is very elegan t. It inv olves tw o assumptions: the ﬁrst is a symmetry requirement, the distribution should only dep end on the actual magnitude | ~ v | = v of the v elo city and not on its direction, P ( v ) d 3 v = P  q v 2 x + v 2 y + v 2 z  d 3 v . (3.22) The second assumption is that velocities along orthogonal directions should b e indep enden t P ( v ) d 3 v = p ( v x ) p ( v y ) p ( v z ) d 3 v . (3.23) Therefore P  q v 2 x + v 2 y + v 2 z  = p ( v x ) p ( v y ) p ( v z ) . (3.24) Setting v y = v z = 0 we get P ( v x ) = p ( v x ) p (0) p (0) , (3.25) so that w e obtain a functional equation for p , p  q v 2 x + v 2 y + v 2 z  p (0) p (0) = p ( v x ) p ( v y ) p ( v z ) , (3.26) 3.4. MAXWELL: PROBABILITY 59 or log   p  q v 2 x + v 2 y + v 2 z  p (0)   = log  p ( v x ) p (0)  + log  p ( v y ) p (0)  + log  p ( v z ) p (0)  , (3.27) or, introducing the functions G , G  q v 2 x + v 2 y + v 2 z  = G ( v x ) + G ( v y ) + G ( v z ) . (3.28) The solution is straightforw ard. Diﬀerentiate with resp ect to v x and to v y to get G 0  q v 2 x + v 2 y + v 2 z  q v 2 x + v 2 y + v 2 z v x = G 0 ( v x ) and G 0  q v 2 x + v 2 y + v 2 z  q v 2 x + v 2 y + v 2 z v x = G 0 ( v x ) , (3.29) or G 0 ( v x ) v x = G 0 ( v y ) v y = − 2 α , (3.30) where − 2 α is a constant. In tegrating giv es log  p ( v x ) p (0)  = G ( v x ) = − αv 2 x + const , (3.31) so that P ( v ) =  α π  3 / 2 exp  − α  v 2 x + v 2 y + v 2 z  , (3.32) the same distribution as “errors in the metho d of least squares”. Maxw ell’s distribution applies whether the molecule is part of a gas, a liquid, or a solid and, with the b eneﬁt of hindsight, the reason is quite easy to see. The probability that a molecule ha ve velocity ~ v and position ~ x is giv en b y the Boltzmann distribution ∝ exp − H /k T . F or a large v ariet y of situations the Hamiltonian for one molecule is of the form H = mv 2 / 2 + V ( ~ x ) where the potential V ( ~ x ) includes the in teractions, whether they be weak or strong, with all the other molecules. If the p otential V ( ~ x ) is indep endent of ~ v , then the distribution for ~ v and ~ x factorizes. V elo city and p osition are statistically indep enden t, and the v elo city distribution is Maxwell’s. Maxw ell w as the ﬁrst to realize that the second law is not an absolute la w (this was expressed in his p opular textb o ok ‘Theory of Heat’ in 1871), that it ‘has only statistical certaint y’ and indeed, that in ﬂuctuation phenomena ‘the second law is con tinually being violated’. Such phenomena are not rare: just lo ok out the window and you can see the sky is blue – a consequence of the scattering of ligh t b y densit y ﬂuctuations in the atmosphere. Maxw ell in tro duced the notion of probabilit y , but what did he actually mean b y the word ‘probabilit y’ ? He used his distribution function as a veloc it y dis- tribution, the num b er of molecules with velocities in a certain range, whic h 60 CHAPTER 3. ENTR OPY I: CARNOT’S PRINCIPLE b etra ys a frequentist in terpretation. These probabilities are ultimately mechan- ical prop erties of the gas. But he also used his distribution to represent the lac k of information w e ha ve ab out the precise microstate of the gas. This lat- ter interpretation is particularly evident in a letter he wrote in 1867 where he argues that the second law could b e violated by “a ﬁnite b eing who kno ws the paths and v elo cities of all molecules by simple insp ection but can do no work except op en or close a hole.” Such a “demon” could allo w fast molecules to pass through a hole from a vessel containing hot gas into a vessel containing cold gas, and could allow slow molecules pass in the opp osite direction. The net eﬀect b eing the transfer of heat from a low to a high temp erature, a violation of the second law. All that w as required w as that the demon “know” the righ t information. [Klein 70] 3.5 Gibbs: b eyond heat Gibbs generalized the second law in tw o directions: to op en systems and to inhomogeneous systems. With the introduction of the concept of the chemical p oten tial, a quan tity that regulates the transfer of particles in muc h the same w ay that temp erature regulates the transfer of heat, he could apply the meth- o ds of thermo dynamics to phase transitions, mixtures and solutions, chemical reactions, and muc h else. His pap er “On the Equilibrium of Heterogeneous Sys- tems” [Gibbs 1875-78] is formulated as the purest form of thermo dynamics – a phenomenological theory of extremely wide applicabilit y b ecause its founda- tions do not rest on particular mo dels ab out the structure and dynamics of the microscopic constituents. And y et, Gibbs was k eenly a ware of the signiﬁcance of the underlying molec- ular constitution – he w as familiar with Maxwell’s writings and in particular with his “Theory of Heat” (indeed, he found mistak es in it). His discussion of the pro cess of mixing gases led him to analyze the “paradox” that b ears his name. The entrop y of t wo diﬀerent gases increases when the gases are mixed; but do es the en tropy also increase when t wo gases of the same molecular species are mixed? Is this an irrev ersible pro cess? F or Gibbs there never was a ‘paradox’, muc h less one that would require some esoteric new (quantum) physics for its resolution. F or him it was quite clear that thermo dynamics was not concerned with microscopic details, but rather with the c hanges from one macrostate to another. He explained that the mixing of t wo gases of the same molecular species cannot b e rev ersed because the mixing do es not lead to a diﬀerent “thermo dynamic” state: “...w e do not mean a state in which eac h particle shall o ccupy more or less exactly the same position as at some previous ep o ch, but only a state which shall b e indistinguishable from the previous one in its sensible prop erties. It is to states of systems th us incompletely deﬁned that the problems of thermo dynamics relate.” [Gibbs 1875-78] 3.6. BOL TZMANN: ENTR OPY AND PROBABILITY 61 Gibbs’ resolution of the paradox hinges on recognizing, as had Maxwell b e- fore him, that the explanation of the second law cannot rest on purely me- c hanical arguments, that probabilistic concepts are required. This led him to conclude: “In other words, the imp ossibility of an uncompensated decrease of entrop y seems to b e reduced to improbability ,” a sen tence that Boltzmann adopted as the motto for the second v olume of his “Lectures on the Theory of Gases.” (F or a mo dern discussion of the Gibbs’ parado x see section 4.12.) Remark ably neither Maxw ell nor Gibbs established a connection b etw een probabilit y and entrop y . Gibbs was v ery successful at showing what one can accomplish b y maximizing entrop y but he did not address the issue of what en tropy is or what it means. The crucial steps in this direction w ere tak en by Boltzmann. But Gibbs’ contributions did not end here. The ensemble theory in tro duced in his “Principles of Statistical Mec hanics” in 1902 (it w as Gibbs who coined the term ‘statistical mechanics’) represent a practical and conceptual step b ey ond Boltzmann’s understanding of entrop y . 3.6 Boltzmann: en trop y and probabilit y It was Boltzmann who found the connection b etw een en tropy and probability , but his path was long and tortuous [Klein 73, Uﬃnk 04]. Over the years he adopted several diﬀerent in terpretations of probability and, to add to the con- fusion, he was not alwa ys explicit ab out which one he was using, sometimes mixing them within the same pap er, and ev en within the same equation. At ﬁrst, he deﬁned the probability of a molecule having a v elo city ~ v within a small cell d 3 v as b eing prop ortional to the amount of time that the particle sp ent within that particular cell, but he also deﬁned that same probability as the fraction of particles within the cell. By 1868 he had managed to generalize the Maxwell distribution for p oin t particles and the theorem of equipartition of energy to complex molecules in the presence of an external ﬁeld. The basic argumen t, which led him to the Boltz- mann distribution, w as that in equilibrium the distribution should b e stationary , that it should not c hange as a result of collisions among particles. The collision argumen t only gav e the distribution for individual molecules; it was also in 1868 that he ﬁrst applied probability to the system as a whole rather than to the individual molecules. He iden tiﬁed the probability of the system b eing in some region of the N -particle phase space (rather than the space of molecular velocities) with the relative time the system would sp end in that region – the so-called “time” ensemble. Alternativ ely , probability was also deﬁned at a given instant in time as being prop ortional to the v olume of the region. At ﬁrst he did not think it was necessary to comment on whether the tw o deﬁnitions are equiv alen t or not, but ev entually he realized that their ‘probable’ equiv alence should b e explicitly expressed as the hypothesis, which later came to b e known as the ‘ergodic hypothesis’, that ov er a long time the tra jectory of the system w ould co ver the whole region of phase space consisten t 62 CHAPTER 3. ENTR OPY I: CARNOT’S PRINCIPLE with the given v alue of the energy . At the time all these probabilities w ere still conceiv ed as mechanical prop erties of the gas. In 1871 Boltzmann achiev ed a signiﬁcant success in establishing a connection b et ween thermo dynamic en tropy and microscopic concepts such as the proba- bilit y distribution in phase space. In mo dern notation his argumen t w as as follo ws. The energy of N interacting particles is given b y H = N P i p 2 i 2 m + U ( x 1 , . . . , x N ) . (3.33) The ﬁrst non-trivial decision was to sp ecify what quantit y deﬁned in purely microscopic terms corresp onds to the macroscopic internal energy . He opted for the “av erage” E = h H i = R dz N P N H , (3.34) where dz N = d 3 N xd 3 N p is the v olume element in the N -particle phase space, and P N is the N -particle distribution function, P N = exp ( − β H ) Z where Z = R dz N e − β H , (3.35) and β = 1 /k T , so that, E = 3 2 N k T + h U i . (3.36) The connection to the thermo dynamic entrop y requires a clear idea of the nature of heat and ho w it diﬀers from w ork. One needs to express heat in purely microscopic terms, and this is quite subtle b ecause at the molecular lev el there is no distinction betw een thermal motions and j ust plain motions. The distribution function is the crucial ingredien t. In any inﬁnitesimal transformation the c hange in the in ternal energy separates in to t wo contributions, δ E = R dz N H δ P N + R dz N P N δ H . (3.37) The second in tegral, whic h can b e written as h δ H i = h δ U i , arises purely from c hanges in the p otential function U , whic h depends among other things on the volume of the v essel containing the gas. Now, a change in the p otential is precisely what one means by mec hanical w ork δ W , therefore, since δ E = δ Q + δ W , the ﬁrst integral must represen t the transferred heat δ Q , δ Q = δ E − h δ U i . (3.38) On the other hand, substituting δ E from eq.(3.36), one gets δ Q = 3 2 N k δ T + δ h U i − h δ U i . (3.39) This is not a complete diﬀerential, but dividing by the temp erature yields δ Q T = δ  3 2 N k log T + h U i T + k log  R d 3 N x e − β U  + const  , (3.40) 3.6. BOL TZMANN: ENTR OPY AND PROBABILITY 63 whic h suggests that the expression in brac kets should b e iden tiﬁed with the thermo dynamic entrop y S . F urther rewriting leads to S = E T + k log Z + const , (3.41) whic h is recognized as the correct mo dern expression. Boltzmann’s path to wards understanding the second law was guided by one notion from which he never wa vered: matter is an aggregate of molecules. Apart from this the story of his progress is the story of the increasingly more imp or- tan t role pla yed by probabilistic notions, and ultimately , it is the story of the ev olution of his understanding of the notion of probability itself. By 1877 Boltz- mann achiev es his ﬁnal goal and explains entrop y purely in terms of probability – mec hanical notions were by now reduced to the bare minim um consisten t with the sub ject matter: we are, after all, talking about collections of molecules and their energy is conserved. His ﬁnal achiev emen t hinges on the introduction of y et another wa y of thinking ab out probabilities. He considered an idealized system consisting of N particles whose single- particle phase space is divided into m cells each with energy ε n , n = 1 , ..., m . The num b er of particles in the n th cell is denoted w n , and the distribution ‘function’ is given b y the set of n umbers w 1 , . . . , w m . In Boltzmann’s previous w ork the determination of the distribution function had b een based on ﬁgur- ing out its time evolution from the mechanics of collisions. Here he used a purely com binatorial argument. A completely sp eciﬁed state, which he called a complexion, and w e call a microstate, is deﬁned by sp ecifying the cell of eac h individual molecule. A macrostate is less completely sp eciﬁed by the distribu- tion function, w 1 , . . . , w m . The n umber of microstates compatible with a giv en macrostate, which he called the ‘p ermutabilit y’, and we call the ‘multiplicit y’ is W = N ! w 1 ! . . . w m ! . (3.42) Boltzmann’s assumption w as that the probability of the macrostate w as propor- tional to its multiplicit y , to the num b er of wa ys in which it could b e achiev ed, whic h assumes each microstate is as likely as any other – the ‘equal a priori probabilit y p ostulate’. The most probable macrostate is that which maximizes W sub ject to the constrain ts of a ﬁxed total n umber of particles N and a ﬁxed total energy E , m P n =1 w n = N and m P n =1 w n ε n = E . (3.43) When the num b ers w n are large enough that one can use Stirling’s approxima- tion for the factorials, w e ha ve log W = N log N − N − m P n =1 ( w n log w n − w n ) (3.44) = − m P n =1 w n log w n + const , (3.45) 64 CHAPTER 3. ENTR OPY I: CARNOT’S PRINCIPLE or p erhaps b etter log W = − N m P n =1 w n N log w n N (3.46) so that log W = − N m P n =1 f n log f n (3.47) where f n = w n / N is the fraction of molecules in the n th cell with energy ε n , or, alternativ ely the probabilit y that a molecule is in its n th state. The distribution that maximizes log W sub ject to the constrain ts (3.43) is such that f n = w n N ∝ e − β ε n , (3.48) where β is a Lagrange multiplier determined by the total energy . When applied to a gas, the p ossible states of a molecule are cells in phase space. Therefore log W = − N Z dz 1 f ( x, p ) log f ( x, p ) , (3.49) where dz 1 = d 3 xd 3 p and the most probable distribution is the equilibrium dis- tribution found earlier by Maxwell and generalized b y Boltzmann. In this approac h probabilities are cen tral. The role of dynamics is minimized but it is not eliminated. The Hamiltonian enters the discussion in t wo places. One is quite explicit: there is a conserved energy the v alue of whic h is imp osed as a constraint. The second is muc h more subtle; we saw ab ov e that the probabilit y of a macrostate could b e taken prop ortional to the multiplicit y W provided microstates are assigned equal probabilities, or equiv alently , equal v olumes in phase space are assigned equal a priori weigh ts. As alw ays equal probabilities m ust be justiﬁed in terms of some form of underlying symmetry . In this case, the symmetry follows from Liouville’s theorem – under a Hamiltonian time evolution a region in phase space will mov e around and its shap e will b e distorted but its volume will b e conserv ed; Hamiltonian time ev olution preserves v olumes in phase space. The nearly universal applicability of the ‘equal a priori p ostulate’ can be traced to the fact that what is needed is a Hamiltonian; an y Hamiltonian w ould do. It is v ery remark able that although Boltzmann calculated the maximized v alue log W for an ideal gas and knew that it agreed with the thermodynamical en tropy except for a scale factor, he never wrote the famous equation that b ears his name S = k log W . (3.50) This equation, as well as Boltzmann’s constant k , were b oth ﬁrst written by Planc k. There is, how ever, a problem with eq.(3.49): it inv olv es the distribution function f ( x, p ) in the one-particle phase space and therefore it cannot tak e correlations in to accoun t. Indeed, eq.(3.49) giv es the correct form of the entrop y 3.7. SOME REMARKS 65 only for ideal gases of non-in teracting particles. The expression that applies to systems of in teracting particles is 3 log W = − Z dz N f N log f N , (3.51) where f N = f N ( x 1 , p 1 , . . . , x N , p N ) is the probability distribution in the N - particle phase space. This equation is usually asso ciated with the name of Gibbs who, in his “Principles of Statistical Mechanics” (1902), developed Boltzmann’s com binatorial arguments into a very p ow erful theory of ensembles. The con- ceptual gap betw een eq.(3.49) and (3.51) is enormous; it go es well b ey ond the issue of intermolecular interactions. The probabilit y in Eq.(3.49) is the single- particle distribution, it can b e interpreted as a “mechanical” property , namely , the relativ e num b er of molecules in eac h cell. The en tropy Eq.(3.49) is a me- c hanical prop erty of the individual system. In con trast, eq.(3.51) inv olves the N -particle distribution whic h is not a property of any single individual system but a prop erty of an ensemble of replicas of the system. Gibbs was not very explicit ab out his interpretation of probability . He wrote “The states of the b o dies whic h we handle are certainly not known to us exactly . What w e know ab out a b o dy can generally b e describ ed most accurately and most simply by saying that it is one taken at random from a great num ber (ensemble) of bo dies whic h are completely describ ed.” [m y italics, Gibbs 1902, p.163] It is clear that for Gibbs probabilities represen t a state of knowledge, that the ensem ble is a purely imaginary construction, just a to ol for handling incomplete information. On the other hand, it is also clear that Gibbs still thinks of prob- abilities in terms of frequencies, and since the actual replicas of the system do not exist, he is forced to imagine them. This brings our s tory of entrop y up to about 1900. In the next chapter we start a more delib erate and systematic study of the connection b etw een entrop y and information. 3.7 Some remarks I end with a disclaimer: this c hapter has historical ov ertones but it is not history . Lines of research suc h as the Boltzmann equation and the ergo dic hypothesis that were historically very imp ortant hav e b een omitted b ecause they represent paths that diverge from the central theme of this work, namely how laws of ph ysics can be deriv ed from rules for handling information and uncertain t y . Our goal has been and will b e to discuss thermo dynamics and statistical mechanics as the ﬁrst historical example of such an information physics. At ﬁrst I tried to write a ‘history as it should ha ve happ ened’. I w anted to trace the dev elopment 3 F or the moment we disregard the question of the distinguishability of the molecules. The so-called Gibbs parado x and the extra factor of 1 / N ! will be discussed in detail in chapter 4. 66 CHAPTER 3. ENTR OPY I: CARNOT’S PRINCIPLE of the concept of entrop y from its origins with Carnot in a manner that reﬂects the logical rather than the actual evolution. But I found that this approach w ould not do; it trivializes the enormous achiev emen ts of the 19th century think ers and it misrepresen ts the actual nature of research. Scientiﬁc researc h is not a tidy business. I mentioned that this c hapter w as inspired by a b eautiful article by E. T. Ja ynes with the same title [Ja ynes 88]. I think Ja ynes’ article has great p eda- gogical v alue but I disagree with him on ho w well Gibbs understoo d the logical status of thermodynamics and statistical mechanics as examples of inferen tial and probabilistic thinking. My o wn assessment runs in quite the opposite di- rection: the reason wh y the conceptual foundations of thermo dynamics and statistical mechanics hav e b een so contro v ersial throughout the 20th century is precisely b ecause neither Gibbs nor Boltzmann were particularly clear on the in terpretation of probabilit y . I think that we could hardly exp ect them to ha ve done muc h b etter; they did not b eneﬁt from the writings of Keynes (1921), Ramsey (1931), de Finetti (1937), Jeﬀreys (1939), Cox (1946), Shannon (1948), P olya (1954) and, of course, Jaynes himself (1957). Indeed, whatever clarity Ja ynes attributes to Gibbs, is not Gibbs’; it is the hard-w on clarity that Ja ynes attained through his own eﬀorts and after absorbing muc h of the b est the 20th cen tury had to oﬀer. Chapter 4 En trop y I I: Measuring Information What is information? Our cen tral goal is to gain insigh t in to the nature of information, ho w one manipulates it, and the implications of such insigh ts for ph ysics. In chapter 2 w e provided a ﬁrst partial answ er. W e might not y et kno w precisely what information is, but we kno w it when we see it. F or example, it is clear that exp erimental data contains information, that it is pro cessed using Ba yes’ rule, and that this is very relev ant to the empirical asp ect of science, namely , to data analysis. Ba yes’ rule is the machinery that pro cesses the in- formation con tained in data to up date from a prior to a posterior probability distribution. This suggests the following generalization: “information” is what- ev er induces one to up date from one state of b elief to another. This is a notion w orth exploring and to which we will return later. In this chapter we pursue another p oint of view that has turned out to b e extremely fruitful. W e sa w that the natural w ay to deal with uncertaint y , that is, with lac k of information, is to in tro duce the notion of degrees of b elief, and that these measures of plausibilit y should be manipulated and calculated using the ordinary rules of the calculus of probabilities. But with this achiev ement w e do not y et reac h our ﬁnal goal. The rules of probability theory allow us to assign probabilities to some “complex” propositions on the basis of the proba- bilities that hav e b een previously assigned to other, p erhaps more “elemen tary” prop ositions. In this chapter we in tro duce a new inference to ol designed sp eciﬁcally for assigning those elementary probabilities. The new to ol is Shannon’s measure of an “amount of information” and the asso ciated method of reasoning is Jaynes’ Metho d of Maxim um En tropy , or MaxEnt. [Shannon 48, Jaynes 57b, 83, 03] 67 68 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION 4.1 Shannon’s information measure W e app eal once more to the idea that if a general theory exists it m ust apply to sp ecial cases. Consider a set of mutually exclusiv e and exhaustiv e alternatives i , for example, the possible v alues of a v ariable, or the p ossible states of a system. The state of the system is unknown. On the basis of the incomplete information I we hav e w e can at b est assign probabilities p ( i | I ) = p i . In order to select just one among the possible states more information is required. The question we address here is ho w m uch more? Note that we are not asking the more diﬃcult question of which particular piece of information is missing, but merely the quantit y that is missing. It seems reasonable that the amount of information that is missing in a sharply p eaked distribution is smaller than the amoun t missing in a broad distribution, but ho w muc h smaller? Is it p ossible to quan tify the notion of amount of information? Can one ﬁnd a unique quantit y S that is a function of the p i ’s, that tends to b e large for broad distributions and small for narrow ones? Consider a discrete set of n mutually exclusiv e and exhaustive discrete al- ternativ es i , each with probability p i . According to Shannon, any measure S of the amoun t of information that is missing when all we know is a probabilit y distribution must satisfy three axioms. It is quite remark able that these con- ditions are suﬃcien tly constraining to determine the quan tity S uniquely . The ﬁrst tw o axioms are deceptively simple. Axiom 1 . S is a real con tinuous function of the probabilities p i , S [ p ] = S ( p 1 , . . . p n ). R emark: It is explicitly assumed that S [ p ] dep ends only on the p i and on nothing else. What we seek here is an absolute measure of the amoun t of missing information in p . If the ob jectiv e were to up date from a prior q to a p osterior distribution p – a problem that will b e later tackled in c hapter 6 – then we would require a functional S [ p, q ] dep ending on b oth q and p . Such S [ p, q ] would at b est b e a r elative measure: the information in p relative to the reference distribution q . Axiom 2 . If all the p i ’s are equal, p i = 1 /n . Then S = S (1 /n, . . . , 1 /n ) = F ( n ), where F ( n ) is an increasing function of n . R emark: This means that it takes less information to pinpoint one alternative among a few than among man y and also that kno wing the n umber n of av ailable states is already a v aluable piece of information. Notice that the uniform distri- bution p i = 1 /n is singled out to play a v ery sp ecial role. Indeed, although no reference distribution has b een explicitly mentioned, the uniform distribution will, in eﬀect, provide the standard of complete ignorance. The third axiom is a consistency requirement and is somewhat less intuitiv e. The entrop y S [ p ] measures the amount of additional information b eyond the in- complete information I already codiﬁed in the p i that will b e needed to pinpoint the actual state of the system. Imagine that this missing information were to b e obtained not all at once, but in installments. The consistency requiremen t is that the particular manner in which w e obtain this information should not matter. This idea can b e expressed as follo ws. 4.1. SHANNON’S INFORMA TION MEASURE 69 Imagine the n states are divided in to N groups labeled by g = 1 , . . . , N . The probabilit y that the system is found in group g is P g = X i ∈ g p i . (4.1) Let p i | g denote the conditional probability that the system is in the state i ∈ g giv en it is in group g , p i | g = p i P g for i ∈ g . (4.2) Supp ose w e were to obtain the desired information in tw o steps, the ﬁrst of whic h would allow us to single out one of the groups g while the second would allo w us to decide on the actual i within the selected group g . The amount of information required in the ﬁrst step is S G = S [ P ] where P = { P g } with g = 1 . . . N . Now suppose we did get this information, and as a result we found, for example, that the system was in group g 1 . Then for the second step, to single out the state i within the group g 1 , the amount of additional information needed would b e S g 1 = S [ p ·| g 1 ]. Similarly , information amoun ts S g 2 , S g 3 , . . . or S g N w ould b e required had the selected groups turned out to be g 2 , g 3 , . . . or g N . But at the beginning of this pro cess we do not yet kno w whic h of the g ’s is the correct one. The exp e cte d amount of missing information to tak e us from the g ’s to the actual i ’s is P g P g S g . The point is that it should not matter whether w e get the total missing information in one step, whic h completely determines i , or in tw o steps, the ﬁrst of which has low resolution and only determines one of the groups, sa y g , while the second step provides the ﬁne tuning that determines i within the given g . This giv es us our third axiom: Axiom 3 . F or all p ossible groupings g = 1 . . . N of the states i = 1 . . . n we m ust ha ve S = S G + P g P g S g . (4.3) This is called the “grouping” prop erty . R emark: Giv en axiom 3 it might seem more appropriate to interpret S as a measure of the exp e cte d rather than the actual amoun t of missing information, but if S is the exp ected v alue of something, it is not clear, at this point, what that something w ould b e. W e will return to this b elo w. The solution to Shannon’s constrain ts is obtained in tw o steps. First assume that all states i are equally likely , p i = 1 /n . Also assume that the N groups g all hav e the same num b er of states, m = n/ N , so that P g = 1 / N and p i | g = p i /P g = 1 /m . Then by axiom 2, S [ p i ] = S (1 /n, . . . , 1 /n ) = F ( n ) , (4.4) S G [ P g ] = S (1 / N , . . . , 1 / N ) = F ( N ) , (4.5) and S g [ p i | g ] = S (1 /m, . . . , 1 /m ) = F ( m ) . (4.6) 70 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION Then, axiom 3 gives F ( mN ) = F ( N ) + F ( m ) . (4.7) This should b e true for all integers N and m . It is easy to see that one solution of this equation is F ( m ) = k log m , (4.8) where k is any p ositive constant, but it is easy to see that eq.(4.7) has inﬁnitely man y other solutions. Indeed, since any integer m can be uniquely decomp osed as a pro duct of prime n umbers, m = Q r q α r r , where α i are integers and q r are prime num b ers, using eq.(4.7) we hav e F ( m ) = P r α r F ( q r ) (4.9) whic h means that eq.(4.7) can b e satisﬁed by arbitrarily specifying F ( q r ) on the primes and then deﬁning F ( m ) for any other integer through eq.(4.9). A unique solution is obtained when we impose the additional requirement that F ( m ) b e monotonic increasing in m (axiom 2). The following argument is found in [Jaynes 03]. Consider an y tw o integers s and t b oth larger than 1. The ratio of their logarithms can b e appro ximated arbitrarily closely b y a rational n umber, i.e., we can ﬁnd integers α and β (with β arbitrarily large) such that α β ≤ log s log t < α + 1 β or t α ≤ r β < t α +1 . (4.10) But F is monotonic increasing, therefore F ( t α ) ≤ F ( s β ) < F ( t α +1 ) , (4.11) and using eq.(4.7), αF ( t ) ≤ β F ( s ) < ( α + 1) F ( t ) or α β ≤ F ( s ) F ( t ) < α + 1 β . (4.12) Whic h means that the ratio F ( r ) /F ( s ) can b e approximated by the same ratio- nal num b er α /β . Indeed, comparing eqs.(4.10) and (4.12) we get     F ( s ) F ( t ) − log s log t     ≤ 1 β (4.13) or,     F ( s ) log s − F ( t ) log t     ≤ F ( t ) β log s (4.14) W e can make the right hand side arbitrarily small by choosing β suﬃciently large, therefore F ( s ) / log s m ust b e a constant, which prov es (4.8) is the unique solution. In the second step of our deriv ation w e will still assume that all i s are equally lik ely , so that p i = 1 /n and S [ p ] = F ( n ). But now w e assume the groups g hav e diﬀeren t sizes, m g , with P g = m g /n and p i | g = 1 /m g . Then axiom 3 b ecomes F ( n ) = S G [ P ] + X g P g F ( m g ), 4.1. SHANNON’S INFORMA TION MEASURE 71 Therefore, S G [ P ] = F ( n ) − X g P g F ( m g ) = X g P g [ F ( n ) − F ( m g )] . Substituting our previous expression for F we get S G [ P ] = X g P g k log n m g = − k N X i =1 P g log P g . Therefore Shannon’s quan titative measure of the amoun t of missing information, the entrop y of the probability distribution p 1 , . . . , p n is S [ p ] = − k n P i =1 p i log p i . (4.15) Commen ts Notice that for discrete probability distributions we ha ve p i ≤ 1 and log p i ≤ 0. Therefore S ≥ 0 for k > 0. As long as we interpret S as the amoun t of uncertain ty or of missing information it cannot b e negative. W e can also chec k that in cases where there is no uncertaint y w e get S = 0: if any state has probabilit y one, all the other states ha ve probabilit y zero and ev ery term in S v anishes. The fact that en tropy dep ends on the av ailable information implies that there is no such thing as the entrop y of a system. The same system may hav e man y diﬀerent en tropies. Notice, for example, that already in the third axiom w e ﬁnd an explicit reference to tw o entropies S [ p ] and S G [ P ] referring to tw o diﬀeren t descriptions of the same system. Collo quially , how ever, one do es refer to the en tropy of a system; in such cases the relev an t information av ailable about the system should b e obvious from the context. In the case of thermo dynamics what one means b y the en trop y is the particular en tropy that one obtains when the only information av ailable is sp eciﬁed b y the known v alues of those few v ariables that sp ecify the thermo dynamic macrostate. The c hoice of the constant k is purely a matter of conv en tion. A con venien t c hoice is k = 1. In thermo dynamics the c hoice is Boltzmann’s constant k B = 1 . 38 × 10 − 16 erg/K which reﬂects the historical choice of units of temp erature. In communication theory and computer science, the conv entional c hoice is k = 1 / log e 2 ≈ 1 . 4427, so that S [ p ] = − n P i =1 p i log 2 p i . (4.16) The base of the logarithm is 2, and the entrop y is said to measure information in units called ‘bits’. No w we turn to the question of in terpretation. Earlier w e mentioned that from axiom 3 it seems more appropriate to in terpret S as a measure of the 72 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION exp e cte d rather than the actual amount of missing information. If one adopts this interpretation, the actual amount of information that we gain when we ﬁnd that i is the true alternativ e would ha ve to b e log 1 /p i . But this is not quite satisfactory . Consider a v ariable that tak es just t wo v alues, 0 with probability p and 1 with probabilit y 1 − p . F or very small p , log 1 /p w ould b e v ery large, while the information that communicates the true alternativ e is con vey ed b y a v ery short one bit message, namely “0”. It app ears that it is not the actual amount of information that log 1 /p seems to measure but rather how unexpected or how surprising the piece of information might be. Accordingly , log 1 /p i is sometimes called the “surprise” of i . It seems reasonable to exp ect that more information implies less uncertaint y . W e hav e used the word ‘uncertaint y’ as roughly synonymous to ‘lac k of infor- mation’. The following example illustrates the p otential pitfalls. I normally k eep my keys in m y p o ck et. My state of knowledge ab out the lo cation of my k eys is represen ted b y a probabilit y distribution that is sharply p eak ed at m y p o c ket and reﬂects a small uncertaint y . But supp ose I chec k and I ﬁnd that m y po ck et is empt y . Then my k eys could b e virtually anywhere. My new state of knowledge is represented by a very broad distribution that reﬂects a high uncertain ty . W e ha ve here a situation where more information has increased the uncertaint y rather than decreased it. The point of these remarks is not to suggest that there is something wrong with the mathematical deriv ation – eq.(4.15) do es follow from the axioms – but to suggest caution when interpreting S . The notion of information is at this p oin t still v ague. An y attempt to ﬁnd its measure will alwa ys b e open to the ob jection that it is not clear what it is that is b eing measured. Indeed, the ﬁrst t wo of Shannon’s axioms seem to b e particularly intuitiv e, but the third one, the grouping prop erty , is not nearly as comp elling. Is entrop y the only w ay to measure uncertaint y? Do esn’t the v ariance also measure uncertain ty? Shannon and Jaynes b oth argued that one should not place to o m uch signiﬁcance on the axiomatic deriv ation of eq.(4.15), that its use can b e ful ly justiﬁed a p osteriori b y its formal properties, for example, by the v arious inequalities it satisﬁes. Ho wev er, this p osition can be questioned on the grounds that it is the axioms that confer meaning to the en tropy; the disagreement is not ab out the actual equations, but about what they mean and, ultimately , ab out how they should b e used. Other measures of uncertaint y can b e in tro duced and, indeed, they ha ve b een introduced b y Renyi and by Tsallis, creating a whole industry of alternativ e theories. [Renyi 61, Tsallis 88] Whenever one can mak e an inference using Shannon’s entrop y , one can make other inferences using any one of the Ren yi’s en tropies. Which, among all those alternatives, should one choose? The t wo-state case T o gain intuition ab out S [ p ] consider the case of a v ariable that can take t w o v alues. The prov erbial example is a biased coin – for example, a b ent coin – for whic h the outcome ‘heads’ is assigned probability p and ‘tails’ probability 1 − p . 4.2. RELA TIVE ENTR OPY 73 The corresp onding en tropy is S ( p ) = − p log p − (1 − p ) log (1 − p ) , (4.17) where w e c hose k = 1. It is easy to chec k that S ≥ 0 and that the maxim um uncertain ty , attained for p = 1 / 2, is S max = log 2. An imp ortant set of prop erties of the entrop y follows from the concavit y of the entrop y whic h follows from the conca vity of the logarithm. Suppose w e can’t decide whether the actual probability of heads is p 1 or p 2 . W e may decide to assign probability q to the ﬁrst alternative and probability 1 − q to the second. The actual probability of heads then is the mixture q p 1 + (1 − q ) p 2 . The corresp onding en tropies satisfy the inequalit y S ( q p 1 + (1 − q ) p 2 ) ≥ q S ( p 1 ) + (1 − q ) S ( p 2 ) , (4.18) with equality in the extreme cases where p 1 = p 2 , or q = 0, or q = 1. Eq.(4.18) sa ys that ho wev er ignorant we migh t b e when w e inv oke a probabilit y distribu- tion, an uncertain ty ab out the probabilities themselves will in tro duce an ev en higher degree of ignorance. 4.2 Relativ e en trop y The following entrop y-lik e quan tity turns out to b e useful K [ p, q ] = + P i p i log p i q i . (4.19) Despite the p ositiv e sign K is sometimes read as the ‘en tropy of p relative to q ,’ and thus called “relative entrop y .” It is easy to see that in the sp ecial case when q i is a uniform distribution then K is essentially equiv alen t to the Shannon en tropy – they diﬀer by a constan t. Indeed, for q i = 1 /n , eq.(4.19) b ecomes K [ p, 1 /n ] = n P i p i (log p i + log n ) = log n − S [ p ] . (4.20) The relative entrop y is also kno wn b y many other names including cross en tropy , information divergence, information for discrimination, and Kullback- Leibler distance [Kullback 59] who recognized its imp ortance for applications in statistics, and studied many of its prop erties). Ho wev er, the expression (4.19) has a muc h older history . It was already used by Gibbs in his Elementary Principles of Statistic al Me chanics [Gibbs 1902]. It is common to in terpret K [ p, q ] as the amoun t of information that is gained (th us the p ositiv e sign) when one thought the distribution that applies to a random pro cess is q and one learns that the distribution is actually p . The in terpretation suﬀers from the same conceptual diﬃculties mentioned earlier concerning the Shannon entrop y . In the next chapter we will see that the relative en tropy turns out to b e the fundamental quantit y for inference – indeed, more 74 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION fundamen tal, more general, and therefore, more useful than entrop y itself – and that the interpretational diﬃculties that aﬄict the Shannon entrop y can b e a voided. (W e will also redeﬁne it with a negativ e sign, S [ p, q ] def = − K [ p, q ], so that it really is a true entrop y .) In this c hapter we just derive some prop erties and consider some applications. An imp ortant prop erty of the relativ e en tropy is the Gibbs inequality , K [ p, q ] ≥ 0 , (4.21) with equalit y if and only if p i = q i for all i . The pro of uses the conca vity of the logarithm, log x ≤ x − 1 or log q i p i ≤ q i p i − 1 , (4.22) whic h implies P i p i log q i p i ≤ P i ( q i − p i ) = 0 . (4.23) The Gibbs inequality provides some justiﬁcation to the common interpreta- tion of K [ p, q ] as a measure of the “distance” betw een the distributions p and q . Although useful, this language is not quite correct because K [ p, q ] 6 = K [ q , p ] while a true distance d is required to b e symmetric, d [ p, q ] = d [ q , p ]. Ho w ever, as we shall later see, if the tw o distributions are suﬃciently close the relative en tropy K [ p + δ p, p ] satisﬁes all the requiremen ts of a metric. Indeed, it turns out that up to a constan t factor, it is the only natural Riemannian metric on the manifold of probabilit y distributions. It is known as the Fisher-Rao metric or, p erhaps more appropriately , the information metric. The tw o inequalities S [ p ] ≥ 0 and K [ p, q ] ≥ 0 together with eq.(4.20) imply 0 ≤ S [ p ] ≤ log n , (4.24) whic h establishes the range of the en tropy b etw een the tw o extremes of complete certain ty ( p i = δ ij for some v alue j ) and c omplete uncertaint y (the uniform distribution) for a v ariable that takes n discrete v alues. 4.3 Join t en trop y , additivit y , and subadditivit y The entrop y S [ p x ] reﬂects the uncertain ty or lack of information ab out the v ariable x when our knowledge ab out it is codiﬁed in the probabilit y distribution p x . It is con venien t to refer to S [ p x ] directly as the “en tropy of the v ariable x ” and write S x def = S [ p x ] = − P x p x log p x . (4.25) The virtue of this notation is its compactness but one must keep in mind the same symbol x is used to denote b oth a v ariable x and its v alues x i . T o b e more explicit, − P x p x log p x = − P i p x ( x i ) log p x ( x i ) . (4.26) 4.4. CONDITIONAL ENTROPY AND MUTUAL INF ORMA TION 75 The uncertaint y or lack of information ab out tw o (or more) v ariables x and y is expressed b y the joint distribution p xy and the corresp onding joint entrop y is S xy = − P xy p xy log p xy . (4.27) When the v ariables x and y are indep endent, p xy = p x p y , the joint entrop y is additive S xy = − P xy p x p y log( p x p y ) = S x + S y , (4.28) that is, the joint en tropy of indep endent v ariables is the sum of the entropies of each v ariable. This additivity prop erty also holds for the other measure of uncertain ty we had introduced earlier, namely , the v ariance, v ar( x + y ) = v ar( x ) + v ar( y ) . (4.29) In thermo dynamics additivity is called extensivity : the entrop y of an ex- tended system is the sum of the entropies of its parts pro vided these parts are indep enden t. The thermo dynamic entrop y can be extensiv e only when the in- teractions b etw een v arious subsystems are suﬃcien tly weak that correlations b et ween them can b e neglected. When the tw o v ariables x and y are not independent the equality (4.28) can b e generalized into an inequalit y . Consider the joint distribution p xy = p x p y | x = p y p x | y . The relativ e entrop y or Kullbac k “distance” of p xy to the pro duct distribution p x p y that would represent uncorrelated v ariables is given b y K [ p xy , p x p y ] = P xy p xy log p xy p x p y = − S xy − P xy p xy log p x − P xy p xy log p y = − S xy + S x + S y . (4.30) Therefore, using K ≥ 0 we get S xy ≤ S x + S y , (4.31) with the equality holding when the t w o v ariables x and y are indep endent. This inequalit y is called the sub additivity prop erty . Its interpretation is clear: en tropy increases when information ab out correlations is discarded. 4.4 Conditional entrop y and m utual information Consider again tw o v ariables x and y . W e w ant to measure the amoun t of uncertain ty ab out one v ariable x when we hav e some limited information ab out another v ariable y . This quan tity , called the conditional en tropy , and denoted 76 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION S x | y , is obtained by calculating the entrop y of x as if the precise v alue of y were kno wn and then taking the exp ectation ov er the p ossible v alues of y S x | y = P y p y S [ p x | y ] = − P y p y P x p x | y log p x | y = − P x,y p xy log p x | y , (4.32) where p xy is the join t distribution of x and y . The conditional entrop y is related to the en tropy of x and the joint entrop y b y the following “chain rule.” Use the pro duct rule for the joint distribution log p xy = log p y + log p x | y , (4.33) and take the exp ectation ov er x and y to get S xy = S y + S x | y . (4.34) In words: the en tropy of tw o v ariables is the en tropy of one plus the conditional en tropy of the other. Also, since S y is p ositive we see that conditioning reduces en tropy , S xy ≥ S x | y . (4.35) Another useful en tropy-lik e quan tity is the so-called “mutual information” of x and y , denoted M xy , whic h “measures” ho w m uch information x and y hav e in common. This is giv en b y the relativ e entrop y b etw een the joint distribution p xy and the product distribution p x p y that discards all information contained in the correlations. Using eq.(4.30), M xy def = K [ p xy , p x p y ] = S x + S y − S xy ≥ 0 , (4.36) whic h sho ws that it is symmetrical in x and y . Using eq.(4.34) the m utual information is related to the conditional en tropies b y M xy = S x − S x | y = S y − S y | x . (4.37) The relationships among these v arious en tropies can b e visualized by a ﬁgure that resembles a V enn diagram. (The diagram is usually considered a purely mnemonic aid, but recent w ork [Kn uth 02-06] on the duality b etw een asser- tions and questions, and the corresp onding duality b etw een probabilities and en tropies suggests that the resemblance b etw een the tw o types of V enn diagrams is not acciden tal.) 4.5 Con tin uous distributions Shannon’s deriv ation of the expression for en trop y , eq.(4.15), applies to probabil- it y distributions of discrete v ariables. The generalization to contin uous v ariables is not quite straightforw ard. The discussion will b e carried out for a one-dimensional contin uous v ariable; the generalization to more dimensions is trivial. The starting point is to note that the expression − R dx p ( x ) log p ( x ) (4.38) 4.5. CONTINUOUS DISTRIBUTIONS 77 is unsatisfactory . A change of v ariables x → y = y ( x ) changes the probabil- it y density p ( x ) to p 0 ( y ) but do es not represent a loss or gain of information. Therefore, the actual probabilities do not c hange, p ( x ) dx = p 0 ( y ) dy , and neither should the en tropy . How ever, one can c heck that (4.38) is not in v ariant, R dx p ( x ) log p ( x ) = R dy p 0 ( y ) log  p 0 ( y )     dy dx      6 = R dy p 0 ( y ) log p 0 ( y ) . (4.39) W e approach the contin uous case as a limit from the discrete case. Consider a con tinuous distribution p ( x ) deﬁned on an interv al for x a ≤ x ≤ x b . Divide the in terv al into equal interv als ∆ x = ( x b − x a ) / N . The distribution p ( x ) can b e approximated by a discrete distribution p n = p ( x n )∆ x , (4.40) where x n = x a + n ∆ x and n is an integer. The discrete entrop y is S N = − N P n =0 ∆ x p ( x n ) log [ p ( x n )∆ x ] , (4.41) and as N → ∞ we get S N − → log N − x b R x a dx p ( x ) log  p ( x ) 1 / ( x b − x a )  (4.42) whic h div erges. This is quite to b e expected: it tak es a ﬁnite amount of informa- tion to identify one discrete alternative within a ﬁnite set, but i t tak es an inﬁnite amoun t to single out one p oint in a con tinuum. The diﬀerence S N − log N has a well deﬁned limit and we are tempted to consider − x b R x a dx p ( x ) log  p ( x ) 1 / ( x b − x a )  (4.43) as a candidate for the contin uous entrop y , until we realize that, except for an additive constant, it coincides with the unacceptable expression (4.38) and should b e discarded for precisely the same reason: it is not inv ariant under c hanges of v ariables. Had we ﬁrst changed v ariables to y = y ( x ) and then discretized into N equal ∆ y in terv als we would hav e obtained a diﬀerent limit − y b R y a dy p 0 ( y ) log  p 0 ( y ) 1 / ( y b − y a )  . (4.44) The problem is that the limiting procedure depends on the particular c hoice of discretization; the limit depends on which particular set of interv als ∆ x or ∆ y w e ha ve arbitrarily decided to call equal. Another w ay to express the same idea is to note that the denominator 1 / ( x b − x a ) in (4.43) represents a probabilit y densit y that is uniform in the v ariable x , but not in y . Similarly , the density 1 / ( y b − y a ) in (4.44) is uniform in y , but not in x . 78 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION Ha ving iden tiﬁed the origin of the problem we can now suggest a solution. On the basis of our prior kno wledge of the problem at hand we m ust decide on a privileged set of equal in terv als, or alternativ ely , on one preferred probabilit y distribution µ ( x ) w e are willing to deﬁne as “uniform.” Then, and only then, it mak es sense to prop ose the follo wing deﬁnition S [ p, µ ] def = − x b R x a dx p ( x ) log p ( x ) µ ( x ) . (4.45) It is easy to c heck that this is inv ariant, x b R x a dx p ( x ) log p ( x ) µ ( x ) = y b R y a dy p 0 ( y ) log p 0 ( y ) µ 0 ( y ) . (4.46) Examples illustrating p ossible choices of the uniform µ ( x ) are the follo wing. 1. When the v ariable x refers to p osition in “physical” space, we can feel fairly comfortable with what we mean b y equal volumes: use Cartesian co ordinates and c ho ose µ ( x ) = constan t. 2. In a curved space D -dimensional, with a kno wn metric tensor g ij , i.e., the distance betw een neigh b oring points with co ordinates x i and x i + dx i is giv en b y d` 2 = g ij dx i dx j , the volume elemen ts are given b y (det g ) 1 / 2 d D x . In this case choose µ ( x ) ∝ (det g ) 1 / 2 . 3. In classical statistical mechanics the Hamiltonian ev olution in phase space is, according to Liouville’s theorem, such that phase space volumes are conserv ed. This leads to a natural deﬁnition of equal in terv als or equal v olumes. The corresp onding c hoice of uniform µ is called the p ostulate of “equal a priori probabilities.” Notice that the expression in eq.(4.45) is a relative entrop y − K [ p, µ ]. This is a hint for a theme that will b e fully dev elop ed in chapter 6: relative entrop y is the more fundamental quantit y . Strictly , there is no Shannon entrop y in the con tinuum – not only do we hav e to subtract an inﬁnite constan t and sp oil its (already shaky) interpretation as an information measure, but we ha ve to app eal to prior knowledge and introduce the measure µ . On the other hand there are no diﬃculties in obtaining the contin uum relativ e entrop y from its discrete version. W e can chec k that K N = N P n =0 p n log p n q n = N P n =0 ∆ x p ( x n ) log p ( x n )∆ x q ( x n )∆ x (4.47) has a w ell deﬁned limit, K [ p, q ] = x b R x a dx p ( x ) log p ( x ) q ( x ) , (4.48) whic h is explicitly in v ariant under co ordinate transformations. 4.6. COMMUNICA TION THEOR Y 79 4.6 Comm unication Theory Here w e giv e the briefest in tro duction to some basic notions of comm unication theory as originally developed b y Shannon [Shannon 48, Shannon W eav er 49]. F or a more comprehensive treatmen t see [Cov er Thomas 91]. Comm unication theory studies the problem of how a message that was se- lected at some p oint of origin can b e b est reproduced at some later destination p oin t. The complete communication system includes an information sour c e that generates a message comp osed of, say , words in English, or pixels on a picture. A tr ansmitter translates the message in to an appropriate signal. F or example, sound pressure is enco ded into an electrical curren t, or letters into a sequence of zeros and ones. The signal is such that it can be transmitted ov er a c ommu- nic ation channel , which could b e electrical signals propagating in coaxial cables or radio wa ves through the atmosphere. Finally , a r e c eiver reconstructs the sig- nal back into a message that can b e in terpreted b y an agent at the destination p oin t. F rom the engineering p oint of view the comm unication system m ust b e de- signed with only a limited information ab out the set of possible messages. In particular, it is not known which speciﬁc messages will b e selected for transmis- sion. The t ypical sort of questions one wishes to address concern the minimal ph ysical requirements needed to communicate the messages that could poten- tially b e generated by a particular information source. One w ants to characterize the sources, measure the capacit y of the comm unication channels, and learn ho w to con trol the degrading eﬀects of noise. And after all this, it is somewhat ironic but nev ertheless true that “information theory” is completely unconcerned with whether any “information” is b eing communicated at all. As far as the engi- neering go es, whether the messages conv ey some meaning or not is completely irrelev ant. T o illustrate the basic ideas consider the problem of “data compression.” A useful idealized mo del of an information source is a sequence of random v ariables x 1 , x 2 , . . . which tak e v alues from a ﬁnite alphab et of sym b ols. W e will assume that the v ariables are indep endent and iden tically distributed. (Eliminating these limitations is both possible and imp ortant.) Supp ose that we deal with a binary source in whic h the v ariables x i , whic h are usually called ‘bits’, take the v alues zero or one with probabilities p or 1 − p resp ectively . Shannon’s idea was to classify the p ossible sequences x 1 , . . . , x N in to typic al and atypic al according to whether they hav e high or low probability . F or large N the expected n umber of zeros and ones is N p and N (1 − p ) resp ectively . The probability of these typic al sequences is P ( x 1 , . . . , x N ) ≈ p N p (1 − p ) N (1 − p ) , (4.49) so that − log P ( x 1 , . . . , x N ) ≈ − N [ p log p − (1 − p ) log(1 − p )] = N S ( p ) (4.50) where S ( p ) is the t wo-state en tropy , eq.(4.17), the maxim um v alue of whic h is 80 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION S max = log 2. Therefore, the probability of t ypical s equences is roughly P ( x 1 , . . . , x N ) ≈ e − N S ( p ) . (4.51) Since the total probability is less than one, we see that the num b er of t ypical sequences has to b e less than ab out e N S ( p ) whic h for large N is considerably less than the total num b er of p ossible sequences, 2 N = e N log 2 . This fact is v ery signiﬁcan t. T ransmitting an arbitrary sequence irrespective of whether it is typical or not requires a long message of N bits, but we do not hav e to waste resources in order to transmit all sequences. W e only need to worry ab out the far fewer typical sequences b ecause the atypical sequences are to o rare. The n umber of typical sequences is ab out e N S ( p ) = 2 N S ( p ) / log 2 = 2 N S ( p ) /S max (4.52) and therefore w e only need ab out N S ( p ) /S max bits to iden tify eac h one of them. Thus, it m ust b e p ossible to compress the original long message into a muc h shorter one. The compression migh t imply some small probability of error because the actual message migh t conceiv ably turn out to b e at ypical but one can, if desired, av oid any such errors by using one additional bit to ﬂag the sequence that follows as typical and short or as at ypical and long. Actual sc hemes for implementing the data compression are discussed in [Co ver Thomas 91]. Next we state these intuitiv e notions in a mathematically precise w ay . Theorem: The Asymptotic Equipartition Prop erty (AEP). If x 1 , . . . , x N are indep enden t v ariables with the same probability distribution p ( x ), then − 1 N log P ( x 1 , . . . , x N ) − → S [ p ] in probability . (4.53) Pro of: If the v ariables x i are indep endent, so are their logarithms, log p ( x i ), − 1 N log P ( x 1 , . . . , x N ) = − 1 N N P i log p ( x i ) , (4.54) and the la w of large n umbers (see section 2.8) gives lim N →∞ Prob      − 1 N log P ( x 1 , . . . , x N ) + h log p ( x ) i     ≤ ε  = 1 , (4.55) where − h log p ( x ) i = S [ p ] . (4.56) This concludes the pro of. W e can elab orate on the AEP idea further. The typical sequences are those for whic h eq.(4.51) or (4.53) is satisﬁed. T o b e precise let us deﬁne the t ypical set A N ,ε as the set of sequences with probabilit y P ( x 1 , . . . , x N ) such that e − N [ S ( p )+ ε ] ≤ P ( x 1 , . . . , x N ) ≤ e − N [ S ( p ) − ε ] . (4.57) 4.6. COMMUNICA TION THEOR Y 81 Theorem of typical sequences: (1) F or N suﬃciently large Prob[ A N ,ε ] > 1 − ε . (2) | A N ,ε | ≤ e N [ S ( p )+ ε ] where | A N ,ε | is the num b er of sequences in A N ,ε . (3) F or N suﬃciently large | A N ,ε | ≥ (1 − ε ) e N [ S ( p ) − ε ] . In words: the t ypical set has probability near one, t ypical sequences are nearly equally probable (thus the ‘equipartition’), and there are ab out e N S ( p ) of them. T o summarize: A lmost al l events ar e almost e qual ly likely . Pro of: Eq.(4.55) states that for ﬁxed ε , for any given δ there is an N δ suc h that for all N > N δ , we hav e Prob      − 1 N log P ( x 1 , . . . , x N ) + S [ p ]     ≤ ε  ≥ 1 − δ . (4.58) Th us, the probability that the sequence ( x 1 , . . . , x N ) is ε -t ypical tends to one, and therefore so m ust Prob[ A N ,ε ]. Setting δ = ε yields part ( 1) . T o prov e (2) write 1 ≥ Prob[ A N ,ε ] = P ( x 1 ,...,x N ) ∈ A N,ε P ( x 1 , . . . , x N ) ≥ P ( x 1 ,...,x N ) ∈ A N,ε e − N [ S ( p )+ ε ] = e − N [ S ( p )+ ε ] | A N ,ε | . (4.59) Finally , from part (1) , 1 − ε < Prob[ A N ,ε ] = P ( x 1 ,...,x N ) ∈ A N,ε P ( x 1 , . . . , x N ) ≤ P ( x 1 ,...,x N ) ∈ A N,ε e − N [ S ( p ) − ε ] = e − N [ S ( p ) − ε ] | A N ,ε | , (4.60) whic h pro ves (3) . W e can now quan tify the exten t to whic h messages generated b y an infor- mation source of entrop y S [ p ] can b e compressed. A scheme that pro duces compressed sequences that are more than N S ( p ) /S max bits is capable of dis- tinguishing among all the typical sequences. The compressed sequences can be reliably decompressed in to the original message. Conv ersely , schemes that yield compressed sequences of fewer than N S ( p ) /S max bits cannot describ e all typi- cal sequences and are not reliable. This result is known as Shannon ’s noiseless channel c o ding the or em . 82 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION 4.7 Assigning probabilities: MaxEn t Probabilities are introduced to deal with lack of information. The notion that en tropy S [ p ] can b e in terpreted as a quantitativ e measure of the amount of missing information has one remark able consequence: it provides us with a metho d to assign probabilities. The idea is simple: Am ong al l p ossible pr ob ability distributions that agr e e with whatever we know sele ct that p articular distribution that r eﬂe cts maximum ignor anc e ab out every- thing else. Sinc e ignor anc e is me asur e d by entr opy, the metho d is mathematic al ly implemente d by sele cting the distribution that maximizes entr opy subje ct to the c onstr aints imp ose d by the available information. This metho d of r e asoning is c al le d the metho d of Maximum Entr opy, and is often abbr eviate d as MaxEnt. Ultimately , the metho d of maximum en tropy is based on an ethical principle of intellectual honest y that demands that one should not assume information one does not hav e. The idea is quite comp elling but its justiﬁcation relies h eavily on in terpreting entrop y as a measure of missing information and therein lies its w eakness: to what extent are we sure that entrop y is the unique measure of information or of uncertaint y? As a simple illustration of the MaxEnt metho d in action consider a v ariable x ab out whic h absolutely nothing is known except that it can take n discrete v al- ues x i with i = 1 . . . n . The distribution that represents the state of maxim um ignorance is that which maximizes the entrop y sub ject to the single constrain t that the probabilities b e normalized, P i p i = 1. In tro ducing a Lagrange multi- plier α to handle the constraint, the v ariation p i → p i + δ p i giv es 0 = δ  S [ p ] − α P i p i  = − n P i =1 (log p i + 1 + α ) δ p i , (4.61) so that the selected distribution is p i = e − 1 − α or p i = 1 n , (4.62) where the multiplier α has b een determined from the normalization constrain t. W e can c heck that the maxim um v alue attained by the en tropy , S max = − P i 1 n log 1 n = log n , (4.63) agrees with eq.(4.24). Remark: The distribution of maximum ignorance turns out to b e uniform. It coincides with what w e would hav e obtained using Laplace’s Principle of Insuﬃcien t Reason. It is sometimes asserted that the MaxEnt metho d provides a pro of of Laplace’s principle but suc h a claim is questionable. As w e saw earlier, the privileged status of the uniform distribution was imp osed through the Shannon’s axioms from the very b eginning. 4.8. CANONICAL DISTRIBUTIONS 83 4.8 Canonical distributions The av ailable information constrains the p ossible probability distributions. Al- though the constrain ts can take an y form whatso ever, in this section w e develop the MaxEn t formalism for the sp ecial case of constraints that are linear in the probabilities. The most important applications are to situations of thermo- dynamic equilibrium where the relev an t information is given in terms of the exp ected v alues of those few macroscopic v ariables such as energy , volume, and n umber of particles o ver which one has some exp erimental con trol. (In the next c hapter w e revisit this problem more explicitly .) The goal is to select the distribution of maxim um en tropy from within the family of all distributions for which the exp ectations of some functions f k ( x ), k = 1 , 2 , . . . ha ve known n umerical v alues F k ,  f k  = P i p i f k i = F k , (4.64) where we set f k ( x i ) = f k i to simplify the notation. In addition there is a normalization constraint, P p i = 1. In tro ducing the necessary m ultipliers, the en tropy maximization is achiev ed setting 0 = δ  S [ p ] − α P i p i − λ k  f k   = − P i  log p i + 1 + α + λ k f k i  δ p i , (4.65) where we adopt the summation conv en tion that repeated upp er and low er indices are summed o ver. The solution is the so-called ‘canonical’ distribution, p i = exp − ( λ 0 + λ k f k i ) , (4.66) where we hav e set 1 + α = λ 0 . The normalization constrain t determines λ 0 , e λ 0 = P i exp( − λ k f k i ) def = Z ( λ 1 , λ 2 , . . . ) (4.67) where w e hav e introduced the partition function Z . Substituting eqs.(4.66) and (4.67) into the other constraints, eqs.(4.64), gives a set of equations that implicitly determine the remaining m ultipliers, − ∂ log Z ∂ λ k = F k , (4.68) and substituting in to S [ p ] = − P p i log p i w e obtain the maximized v alue of the en tropy , S max = P i p i ( λ 0 + λ k f k i ) = λ 0 + λ k F k . (4.69) Equations (4.66-4.68) are a generalized form of the “canonical” distributions ﬁrst discov ered by Maxw ell, Boltzmann and Gibbs. Strictly , the calculation 84 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION ab o ve only shows that the entrop y is stationary , δ S = 0. T o complete the argumen t we must sho w that (4.69) is the absolute maximum rather than just a lo cal extrem um or a stationary p oint. Consider an y other distribution q i that satisﬁes precisely the same con- strain ts in eqs.(4.64). According to the basic Gibbs inequality for the relativ e en tropy of q and the canonical p , K ( q , p ) = P i q i log q i p i ≥ 0 , (4.70) or S [ q ] ≤ − P i q i log p i . (4.71) Substituting eq.(4.66) giv es S [ q ] ≤ P i q i ( λ 0 + λ k f k i ) = λ 0 + λ k F k . (4.72) Therefore S [ q ] ≤ S [ p ] = S max . (4.73) In w ords: within the family of all distributions q that satisfy the constraints (4.64) the distribution that achiev es the maxim um entrop y is the canonical distribution p giv en in eq.(4.66). Ha ving found the maximum entrop y distribution we can now dev elop the MaxEn t formalism along lines that closely parallel the formalism of statistical mec hanics. Each distribution within the family (4.66) can b e though t of as a p oin t in a contin uous space – the manifold of canonical distributions. Eac h sp eciﬁc choice of exp ected v alues ( F 1 , F 2 , . . . ) determines a unique point within the space, and therefore the F k pla y the role of co ordinates. T o eac h p oint ( F 1 , F 2 , . . . ) we can asso ciate a num b er, the v alue of the maximized entrop y . Therefore, S max is a scalar ﬁeld which we denote S ( F 1 , F 2 , . . . ) = S ( F ). In thermo dynamics it is conv entional to drop the suﬃx ‘max’ and to refer to S ( F ) as the entr opy of the system. This language is inappropriate b ecause it can b e misleading. W e should constantly remind ourselves that S ( F ) is just one out of many possible en tropies. S ( F ) is that particular en tropy that measures the amount of missing information of a sub ject whose knowledge consists of the n umerical v alues of the F s and nothing else. The m ultiplier λ 0 = log Z ( λ 1 , λ 2 , . . . ) = log Z ( λ ) (4.74) is sometimes called the “free energy” b ecause it is closely related to the ther- mo dynamic free energy , S ( F ) = log Z ( λ ) + λ k F k . (4.75) The quan tities S ( F ) and log Z ( λ ) contain the same information; the equation ab o ve shows that they are Legendre transforms of each other. Just as the F s are obtained from log Z ( λ ) from eq.(4.68), the λ s can b e obtained from S ( F ) ∂ S ( F ) ∂ F k = ∂ log Z ( λ ) ∂ λ j ∂ λ j ∂ F k + ∂ λ j ∂ F k F j + λ k , (4.76) 4.8. CANONICAL DISTRIBUTIONS 85 or, using eq.(4.68), ∂ S ( F ) ∂ F k = λ k , (4.77) whic h shows that the multipliers λ k are the comp onents of the gradien t of the en tropy S ( F ) on the manifold of canonical distributions. A useful extension of the formalism is the follo wing. It is common that the functions f k are not ﬁxed but dep end on one or more parameters v that can b e externally manipulated, f k i = f k ( x i , v ). F or example f k i could refer to the energy of the i th state of the system, and the parameter v could b e the volume of the system or an externally applied magnetic ﬁeld. Then a general change in the exp ected v alue F k induced by changes in b oth f k and λ k , is expressed as δ F k = δ  f k  = P i  p i δ f k i + f k i δ p i  , (4.78) The ﬁrst term on the right is  δ f k  = P i p i ∂ f k i ∂ v δ v =  ∂ f k ∂ v  δ v . (4.79) When F k represen ts the internal energy then  δ f k  is a small energy trans- fer that can b e controlled through an external parameter v . This suggests that  δ f k  represen ts a kind of “generalized work,” δ W k , and the exp ectations  ∂ f k /∂ v  are analogues of pressure or susceptibility , δ W k def =  δ f k  =  ∂ f k ∂ v  δ v . (4.80) The second term in eq.(4.78), δ Q k def = P i f k i δ p i = δ  f k  −  δ f k  (4.81) is a kind of “generalized heat”, and δ F k = δ W k + δ Q k (4.82) is a “generalized ﬁrst la w.” The corresp onding c hange in the en tropy is obtained from eq.(4.75), δ S = δ log Z ( λ ) + δ ( λ k F k ) = − 1 Z P i  δ λ k f k i + λ k δ f k i  e − λ k f k i + δ λ k F k + λ k δ F k = λ k  δ  f k  −  δ f k  , (4.83) whic h, using eq.(4.81), giv es 86 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION δ S = λ k δ Q k . (4.84) It is easy to see that this is equiv alen t to eq.(4.77) where the partial deriv atives are deriv atives at constant v . Th us the entrop y remains constant in inﬁnitesimal “adiabatic” pro cesses – those with δ Q k = 0. F rom the information theory p oint of view [see eq.(4.81)] this result is a triviality: the amount of information in a distribution cannot c hange when the probabilities do not change, δ p i = 0 ⇒ δ Q k = 0 ⇒ δ S = 0 . (4.85) . 4.9 On constrain ts and relev ant information The metho d of maximum en tropy has been successful in many applications, but there are cases where it has failed. Are these symptoms of irreparable ﬂaws or mere examples of misuses of the metho d? MaxEn t is a metho d for pro cessing information: what information are w e talking ab out? The imp ortance of this issue cannot b e ov erestimated. Here we collect a few remarks; this is a topic to whic h w e will return rep eatedly . One p oin t that must b e made is that questions ab out how information is pro cessed – and this is the problem that MaxEn t is supposed to address – should not b e confused with questions ab out how the information was obtained in the ﬁrst place. These are tw o separate issues. Here is an example of a common error. Once w e accept that certain con- strain ts migh t refer to the expected v alues of certain v ariables, how do w e decide their numerical magnitudes? The n umerical v alues of exp ectations are seldom kno wn and one migh t b e tempted to replace expected v alues by sample a v erages b ecause it is the latter that are directly a v ailable from exp erimen t. But the tw o are not the same: Sample aver ages ar e exp erimental data. Exp e cte d values ar e not exp erimental data. F or v ery large samples such a replacement can b e justiﬁed b y the la w of large n umbers – there is a high probability that sample a verages will appro ximate the exp ected v alues. How ever, for small samples using one as an appro ximation for the other can lead to incorrect inferences. It is imp ortan t to realize that these incorrect inferences do not represent an in trinsic ﬂa w of the MaxEnt metho d; they are merely a w arning of how the MaxEnt metho d should not be used. There are many other ob jections that hav e b een raised against the logic b ehind the MaxEn t metho d. W e make no attempt to survey them all; many ha ve already received adequate answers (see, e.g., [Jaynes 83] and [Jaynes 03], particularly section 11.8). But some ob jections remain that are quite legitimate and demand our atten tion. They rev olve around the follo wing question: Once w e accept that constrain ts will b e in the form of the exp ected v alues of certain v ariables, how do we decide which v ariables to choose? 4.9. ON CONSTRAINTS AND RELEV ANT INFORMA TION 87 When using the MaxEnt metho d to obtain, say , the canonical Boltzmann distribution ( p i ∝ e − β E i ) it has b een common to adopt the following language: (A) W e seek the probabilit y distribution that co diﬁes the information w e ac- tually hav e (say , the exp ected energy) and is maximally unbiased ( i.e. maximally ignorant or maxim um entrop y) ab out all the other information w e do not p ossess. Man y authors ﬁnd this justiﬁcation unsatisfactory . Indeed, they migh t argue, for example, that (B1) The observed sp ectrum of black b o dy radiation is whatever it is, inde- p enden tly of the information that happ ens to be av ailable to us. W e prefer to phrase the ob jection diﬀerently: (B2) In most realistic situations the exp ected v alue of the energy is not a quan tity we happ en to know; ho w, then, can w e justify using it as a constrain t? Alternativ ely , even when the exp ected v alues of some quan tities happen to be kno wn, according to (A) what MaxEnt provides is the b est p ossible inferences giv en the limited information that is a v ailable. This is no mean feat, but there is no guaran tee that the resulting inferences will be an y goo d at all. The pre- dictions of statistical mechanics are sp ectacularly accurate: how can we hop e to achiev e equally sp ectacular predictions in other ﬁelds? (B3) W e need some understanding of which are the “correct” quantities the exp ectation v alues of which co dify the relev ant information for the problem at hand. Merely that some particular exp ected v alue happ ens to b e known is neither an adequate nor a suﬃcient explanation. A partial answer to these ob jections starts with the observ ation that whether the v alue of the expected energy is kno wn or not, it is nevertheless still true that maximizing entrop y sub ject to the energy constraint leads to the indisputably correct family of thermal equilibrium distributions (e.g., the black-bo dy spectral distribution). The justiﬁcation b ehind imp osing a constraint on the exp ected energy cannot b e that this is a quantit y that happ ens to b e kno wn – because of the brute fact that it is not known – but rather that the exp ected energy is the quantit y that should be known. Even when its actual numerical v alue is unkno wn, we recognize it as the r elevant information without whic h no suc- cessful predictions are possible. (In the next chapter w e revisit this imp ortant question.) Therefore we allow MaxEnt to pro ceed as if this crucial information were a v ailable which leads us to a family of distributions containing the temp erature as a free parameter. The actual v alue of this parameter will hav e to be inferred 88 CHAPTER 4. ENTROPY I I: MEASURING INFORMA TION from the exp eriment itself either directly , using a thermometer, or indirectly by Ba yesian analysis from other empirical data. T o summarize: It is not just what you happ en to know; you have to know the right thing. The constraints that should b e imposed are those that co dify the information that is relev ant to the problem under consideration. Bet ween one extreme of ignorance (we know neither which v ariables are relev an t nor their exp ected v alues), and the other extreme of useful knowledge (we know whic h v ariables are relev ant and we also know their exp ected v alues), there is an interme diate state of know le dge – and this is the rule rather than the exception – in whic h the relev ant v ariables hav e b een correctly identiﬁed but their actual exp ected v alues remain unknown. In this intermediate state, the information ab out which are the relev ant v ariables is tak en into account using MaxEn t to select a parametrized family of probability distributions, while the actual exp ected v alues must then b e inferred indep endently either by direct measuremen t or inferred indirectly using Ba yes’ rule from other exp erimental data. Ac hieving this ‘intermediate state of kno wledge’ is the diﬃcult problem pre- sen ted b y (B3). Historically progress has b een achiev ed in individual cases mostly by “intuition,” that is, trial and error. Perhaps the seeds for a more systematic “theory of relev ance” can already b e seen in the statistical theories of mo del selection and of non-parametric densit y estimation. Chapter 5 Statistical Mec hanics Among the v arious theories that make up what we call ph ysics, thermo dynamics holds a very special place b ecause it provided the ﬁrst example of a fundamen tal theory that could b e interpreted as a pro cedure for pro cessing relev ant infor- mation. Our goal in this chapter is to pro vide a more explicit discussion of statistical mechanics as a theory of inference. W e sho w that sev eral notoriously con trov ersial topics such as the Second La w of thermo dynamics, irreversibil- it y , repro ducibility , and the Gibbs parado x can b e considerably clariﬁed when view ed from the information/inference p ersp ective. Since the success of any problem of inference hinges on identifying the rele- v ant information we start by providing some background on the dynamical ev o- lution of probability distributions – Liouville’s theorem – and then we justify wh y in situations of thermal equilibrium the relev an t constrain t is encapsulated in to the exp ected v alue of the energy (and/or other suc h conserved quantities). 5.1 Liouville’s theorem P erhaps the most r elevant , and therefore, most imp ortant piece of information that has to b e incorp orated into an y inference ab out physical systems is that their time ev olution is constrained b y equations of motion. Whether these equa- tions – those of Newton, Maxw ell, Y ang and Mills, or Einstein – can themselv es b e derived as examples of inference are questions whic h will not concern us at this p oint. T o b e sp eciﬁc, in this section we will limit ourselves to discussing classical systems suc h as ﬂuids. In this case there is an additional crucial piece of relev ant information: these systems are composed of molecules. F or simplicit y we will assume that the molecules ha ve no in ternal structure, that they are describ ed b y their p ositions and momenta, and that they b ehav e according to classical mec hanics. The imp ort of these remarks is that the prop er description of the micr ostate of a ﬂuid of N particles in a v olume V is in terms of a “v ector” in the N -particle 89 90 CHAPTER 5. ST A TISTICAL MECHANICS phase space, z = ( ~ x 1 , ~ p 1 , . . . ~ x N , ~ p N ). The time evolution is given by Hamilton’s equations, d~ x i dt = ∂ H ∂ ~ p i and d~ p i dt = − ∂ H ∂ ~ x i , (5.1) where H is the Hamiltonian, H = N P i =1 p 2 i 2 m + U ( ~ x 1 , . . . ~ x N , V ) . (5.2) But the actual p ositions and momenta of the molecules are unkno wn and thus the macr ostate of the ﬂuid is describ ed b y a probability density in phase space, f ( z , t ). When the system ev olves con tinuously according to Hamilton’s equa- tions there is no information loss and the probability ﬂow satisﬁes a lo cal con- serv ation equation, ∂ ∂ t f ( z , t ) = −∇ z · J ( z , t ) , (5.3) where the probabilit y curren t J is a vector giv en b y J ( z , t ) = f ( z , t ) ˙ z =  f ( z , t ) d~ x i dt , f ( z , t ) d~ p i dt  . (5.4) Ev aluating the divergence explicitly using (5.1) giv es ∂ f ∂ t = − N P i =1  ∂ ∂ ~ x i ·  f ( z , t ) d~ x i dt  + ∂ ∂ ~ p i ·  f ( z , t ) d~ p i dt  = − N P i =1  ∂ f ∂ ~ x i · ∂ H ∂ ~ p i − ∂ f ∂ ~ p i · ∂ H ∂ ~ x i  . (5.5) Th us the time deriv ative of f ( z , t ) at a ﬁxed p oint z is given by the Poisson brac ket with the Hamiltonian H , ∂ f ∂ t = { H , f } def = N P i =1  ∂ H ∂ ~ x i · ∂ f ∂ ~ p i − ∂ H ∂ ~ p i · ∂ f ∂ ~ x i  . (5.6) This is called the Liouville equation. Tw o important corollaries are the following. Instead of fo cusing on the c hange in f ( z , t ) at a ﬁxed p oin t z we can study the c hange in f ( z ( t ) , t ) at a p oin t z ( t ) that is b eing carried along by the ﬂow. This deﬁnes the so-called “con vectiv e” time deriv ativ e, d dt f ( z ( t ) , t ) = ∂ ∂ t f ( z , t ) + N P i =1  ∂ f ∂ ~ x i · d~ x i dt + ∂ f ∂ ~ p i · d~ p i dt  . (5.7) Using Hamilton’s equations shows that the second term is −{ H , f } and cancels the ﬁrst, therefore d dt f ( z ( t ) , t ) = 0 , (5.8) 5.2. DERIV A TION OF EQUAL A PRIORI PR OBABILITIES 91 whic h means that f is constant along a ﬂow line. Explicitly , f ( z ( t ) , t ) = f ( z ( t 0 ) , t 0 ) . (5.9) Next consider a small v olume element ∆ z ( t ) that is b eing carried along b y the ﬂuid ﬂo w. Since tra jectories cannot cross each other (because Hamilton’s equations are ﬁrst order in time) they cannot cross the boundary of the ev olving v olume ∆ z ( t ) and therefore the total probabilit y within ∆ z ( t ) is conserv ed, d dt Prob[∆ z ( t )] = d dt [∆ z ( t ) f ( z ( t ) , t )] = 0 . (5.10) But f itself is constant, eq.(5.8), therefore d dt ∆ z ( t ) = 0 , (5.11) whic h means that the shap e of a region of phase space may get deformed by time evolution but its volume remains inv ariant. 5.2 Deriv ation of Equal a Priori Probabilities Earlier, in section 4.5, w e pointed out that a proper deﬁnition of entrop y in a con tinuum, eq.(4.45), requires that one specify a privileged bac kground measure µ ( z ), S [ f , µ ] = − R dz f ( z ) log f ( z ) µ ( z ) , (5.12) where dz = d 3 N xd 3 N p . The c hoice of µ ( z ) is imp ortant: it determines what we mean by a uniform or maximally ignorant distribution. It is customary to set µ ( z ) equal to a constan t whic h we migh t as w ell c ho ose to be one. This amounts to p ostulating that equal volumes of phase space are assigned the same a priori probabilities. Ever since the in tro duction of Boltzmann’s ergo dic h yp othesis there hav e b een many failed attempts to derive it from purely dynamical considerations. In this section we w ant to determine µ ( z ) b y pro ving the following theorem The Equal a Priori Probability Theorem: Sinc e Hamiltonian dynamics in volv es no loss of information, if the entr opy S [ f , µ ] is to b e interpr ete d as the me asur e of amount of information , then µ ( z ) must b e a constan t in phase space. Remark: In chapter 6 the requiremen t that the entrop y S must b e interpreted as a measure of information will b e remov ed and thus the logic of statistical mec hanics as a theory of inference will b e considerably strengthened. Pro of: The main non-dynamical hypothesis is that en tropy measures informa- tion. The information entrop y of the time-evolv ed distribution f ( z , t ) is S ( t ) = − R dz f ( z , t ) log f ( z , t ) µ ( z ) . (5.13) 92 CHAPTER 5. ST A TISTICAL MECHANICS The ﬁrst input from Hamiltonian dynamics is that information is not lost and therefore we must require that S ( t ) b e constan t, d dt S ( t ) = 0 . (5.14) Therefore, d dt S ( t ) = − R dz  ∂ f ( z , t ) ∂ t log f ( z , t ) µ ( z ) + ∂ f ( z , t ) ∂ t  . (5.15) The second term v anishes, R dz ∂ f ( z , t ) ∂ t = d dt R dz f ( z , t ) = 0 . (5.16) A second input from Hamiltonian dynamics is that probabilities are not merely conserv ed, they are lo cally conserved, which is expressed by eqs.(5.3) and (5.4). The ﬁrst term of eq.(5.15) can b e rewritten, d dt S ( t ) = R dz ∇ z · J ( z , t ) log f ( z , t ) µ ( z ) , (5.17) so that in tegrating b y parts (the surface term v anishes) giv es d dt S ( t ) = − R dz f ( z , t ) ˙ z · ∇ z log f ( z , t ) µ ( z ) = R dz [ − ˙ z · ∇ z f ( z , t ) + f ( z , t ) ˙ z · ∇ z log µ ( z )] . (5.18) Hamiltonian dynamics enters here once again: the ﬁrst term v anishes by Liou- ville’s equation (5.6), − R dz ˙ z · ∇ z f ( z , t ) = R dz { H , f ( z , t ) } = R dz ∂ f ( z , t ) ∂ t = 0 , (5.19) and therefore, imp osing (5.14), d dt S ( t ) = R dz f ( z , t ) ˙ z · ∇ z log µ ( z ) = 0 . (5.20) This integral must v anish for any arbitrary choice of the distribution f ( z , t ), therefore ˙ z · ∇ z log µ ( z ) = 0 . (5.21) F urthermore, we ha ve considerable freedom ab out the particular Hamiltonian op erating on the system. W e could choose to change the volume in any arbi- trarily prescrib ed wa y by pushing on a piston, or we could c ho ose to v ary an external magnetic ﬁeld. Either wa y we can c hange H ( t ) and therefore ˙ z at will. The time deriv ative dS /dt m ust still v anish irrespective of the particular choice of the v ector ˙ z . W e conclude that ∇ z log µ ( z ) = 0 or µ ( z ) = const . (5.22) 5.3. THE RELEV ANT CONSTRAINTS 93 T o summarize: the requiremen t that information is not lost in Hamiltonian dynamics implies that the measure of information must b e a constant of the motion, d dt S ( t ) = 0 , (5.23) and this singles out the Gibbs en tropy , S ( t ) = − R dz f ( z , t ) log f ( z , t ) , (5.24) as the correct information entrop y . It is sometimes asserted that (5.23) implies that the Gibbs entrop y cannot b e identiﬁed with the thermo dynamic en tropy b ecause this would b e in con tra- diction to the second law. As w e shall see b elow, this is not true; in fact, it is quite the opp osite. 5.3 The relev an t constraints Thermo dynamics is concerned with situations of thermal equilibrium. What is the relev ant information needed to mak e inferences that apply to these sp ecial cases? The ﬁrst condition we must imp ose on f ( z , t ) to describ e equilibrium is that it b e independent of time. Th us w e require that { H, f } = 0 and f must b e a function of conserved quantities suc h as energy , momen tum, angular momentum, or num b er of particles. But w e do not w ant f to b e merely stationary , as say , for a rotating ﬂuid, w e wan t it to b e truly static. W e wan t f to b e in v ariant under time rev ersal. F or these problems it turns out that it is not necessary to imp ose that the total momentum and total angular momentum v anish; these constrain ts will turn out to b e satisﬁed automatically . T o simplify the situation ev en more we will only consider problems where the n umber of particles is held ﬁxed. Pro cesses where particles are exc hanged as in the equilibrium b etw een a liquid and its v ap or, or where particles are created and destroy ed as in c hemical reactions, constitute an imp ortant but straightforw ard extension of the theory . It th us app ears that it is suﬃcient to imp ose that f b e some function of the energy . According to the formalism dev elop ed in section 4.8 and the remarks in 4.9 this is easily accomplished: the constrain ts co difying the information that could b e relev ant to problems of thermal equilibrium should b e the expe cted v alues of functions φ ( E ) of the energy . F or example, h φ ( E ) i could include v arious momen ts, h E i , h E 2 i ,. . . or perhaps more perhaps complicated functions. The remaining question is whic h functions φ ( E ) and how man y of them. T o answer this question w e lo ok at thermal equilibrium from the p oint of view leading to what is kno wn as the micr o c anonic al formalism . Let us enlarge our description to include the system of in terest A and its environmen t, that is, the thermal bath B with whic h it is in equilibrium. The adv an tage of this broader view is that the composite system C = A + B can b e assumed to be isolated and we know that its ener gy E c is some ﬁxe d c onstant . This is highly relev ant information: when the v alue of E c is known, not only do w e kno w h E c i but we kno w the exp ected v alues h φ ( E c ) i for absolutely all functions φ ( E c ) . In other 94 CHAPTER 5. ST A TISTICAL MECHANICS w ords, in this case w e ha ve succeeded in iden tifying the relev ant information and w e are ﬁnally ready to assign probabilities using the MaxEnt metho d. (When the v alue of E c is not kno wn we are in that state of “in termediate” knowledge describ ed in section 4.9.) T o simplify the notation it is conv enient to divide phase space into discrete cells of equal volume. F or system A let the (discretized) microstate z a ha ve energy E a . F or the thermal bath B a muc h less detailed description is suﬃcient. Let the n umber of bath microstates with energy E b b e Ω B ( E b ). Our relev an t information includes the fact that A and B in teract v ery w eakly , just barely enough to attain equilibrium, and th us the known total energy E c constrains the allow ed microstates of A + B to the subset that satisﬁes E a + E b = E c . (5.25) The total n umber of such microstates is Ω( E c ) = P a Ω B ( E c − E a ) . (5.26) W e are in a situation where w e kno w absolutely nothing b ey ond the fact that the comp osite system C can b e in any one of its Ω( E c ) allow ed microstates. This is precisely the problem tackled in section 4.7: the maxim um entrop y distribution is uniform, eq.(4.62), and the probability of any microstate of C is 1 / Ω( E c ). More imp ortantly , the probability that system A is in the particular microstate a when it is in thermal equilibrium with the bath B is p a = Ω B ( E c − E a ) Ω( E c ) . (5.27) This is the result we sough t; no w w e need to interpret it. It is conv enient to rewrite p a in terms of the the entrop y of the bath S B = k log Ω B , p a ∝ exp 1 k S B ( E c − E a ) . (5.28) There is one ﬁnal piece of relev ant information we can use: the thermal bath B is muc h larger than system A , E c  E a , and w e can T aylor expand S B ( E c − E a ) = S B ( E c ) − E a T + . . . , (5.29) where the temp erature T of the bath has b een introduced according to the standard thermo dynamic deﬁnition, ∂ S B ∂ E b     E c def = 1 T . (5.30) The term S B ( E c ) is a constant indep endent of the lab el a which can be absorbed in to the normalization. W e conclude that the distribution that co diﬁes the relev ant information about equilibrium is p a ∝ exp  − E a k T  , (5.31) 5.4. THE CANONICAL FORMALISM 95 whic h w e recognize as ha ving the canonical form of eq.(4.66). Our goal in this section was to identify the relev ant v ariables. Here is the answ er: the relev ant information about thermal equilibrium can b e summarized b y the expected v alue of the energy h E i because someone who just kno ws h E i and is maximally ignorant ab out ev erything else is led to assign probabilities according to eq.(4.66) which coincides with (5.31). But our analysis has also disclosed an important limitation. Eq.(5.27) shows that in general the distribution for a system in equilibrium with a bath dep ends in a complicated wa y on the prop erties of the bath. The information in h E i is adequate only when the system and the bath in teract w eakly and the bath is so m uch larger than the system that its eﬀects can b e represented by a sin- gle parameter, the temp erature T . Con versely , if these conditions are not met, then more information is needed. F or example, the system might b e suﬃcien tly isolated that within the time scales of interest it can only reac h thermal equi- librium with the few degrees of freedom in its very immediate vicinity . Then the surrounding bath need not b e large and the information contained in the exp ected v alue h E i while still useful and relev ant might just not be suﬃcient; more will b e needed. Remark: The notion of relev ance is relativ e. A particular piece of information migh t be relev ant to one sp eciﬁc question and irrelev ant to another. In the discussion abov e the system is in equilibrium, but w e hav e not been suﬃciently explicit ab out what sp eciﬁc questions one wan ts to address. It is implicit in this whole approac h that one refers to the typical questions addressed in thermo dy- namics. 5.4 The canonical formalism W e consider a system (sa y , a ﬂuid) in thermal equilibrium. The energy of the (con venien tly discretized) microstate z a is E a = E a ( V ) where V is the v olume of the system. W e assume further that the exp ected v alue of the energy is known, h E i = ¯ E . Maximizing the (discretized) Gibbs en tropy , S [ p ] = − k P a p a log p a where p a = f ( z a )∆ z , (5.32) sub ject to constraints on normalization h 1 i = 1 and energy h E i = ¯ E yields, eq.(4.66), p a = 1 Z e − β E a (5.33) where the Lagrange multiplier β is determined from − ∂ log Z ∂ β = ¯ E and Z ( β , V ) = P a e − β E a . (5.34) The maximized v alue of the Gibbs entrop y is, eq.(4.69), S ( ¯ E , V ) = k log Z + k β ¯ E . (5.35) 96 CHAPTER 5. ST A TISTICAL MECHANICS Diﬀeren tiating with resp ect to ¯ E w e obtain the analogue of eq.(4.77),  ∂ S ∂ ¯ E  V = k ∂ log Z ∂ β ∂ β ∂ ¯ E + k ∂ β ∂ ¯ E ¯ E + k β = k β , (5.36) where eq.(5.34) has b een used to cancel the ﬁrst t wo terms. In thermo dynamics temp erature is deﬁned by  ∂ S ∂ ¯ E  V def = 1 T , (5.37) therefore, β = 1 k T . (5.38) The c onne ction b etwe en the formalism ab ove and thermo dynamics hinges on a suitable identiﬁc ation of work and he at. A small change in the internal energy δ E can be induced by small c hanges in T and V , δ ¯ E = P a p a δ E a + P a E a δ p a . (5.39) Since E a = E a ( V ) the ﬁrst term h δ E i is an energy c hange that can be induced b y small changes in volume, h δ E i = P a p a ∂ E a ∂ V δ V =  ∂ E ∂ V  δ V , (5.40) this suggests that we can identify it with the mec hanical w ork, h δ E i = δ W = − P δ V , (5.41) and therefore, the pressure is given by P = −  ∂ E ∂ V  . (5.42) This is the microscopic deﬁnition of pressure. The second term in eq.(5.39) must therefore represent heat, δ Q = δ ¯ E − δ W = δ h E i − h δ E i . (5.43) The corresp onding c hange in entrop y is obtained from eq.(5.35), δ S k = δ log Z + δ ( β ¯ E ) = − 1 Z P a e − β E a ( E a δ β + β δ E a ) + ¯ E δ β + β δ ¯ E = β ( δ ¯ E − h δ E i ) , (5.44) therefore, δ S = δ Q T . (5.45) This result is imp ortant. It pro ves that 5.5. THE SECOND LA W OF THERMODYNAMICS 97 The maximize d Gibbs entr opy, S ( ¯ E , V ) , is identic al to the thermo dynamic entr opy original ly deﬁne d by Clausius . Substituting into eq.(5.43), yields the fundamental thermo dynamic identity , δ ¯ E = T δ S − P δ V . (5.46) Inciden tally , it shows that the “natural” v ariables for energy are S and V , that is, ¯ E = ¯ E ( S, V ). Similarly , writing δ S = 1 T δ ¯ E + P T δ V (5.47) conﬁrms that S = S ( ¯ E , V ). The free energy F is deﬁned by Z = e − β F or F = − k T log Z ( T , V ) . (5.48) Eq.(5.35) then leads to F = ¯ E − T S , (5.49) so that δ F = − S δ T − P δ V , (5.50) whic h sho ws that F = F ( T , V ). Sev eral useful thermo dynamic relations can b e easily obtained from eqs.(5.46), (5.47), and (5.50). F or example, the identities  ∂ F ∂ T  V = − S and  ∂ F ∂ V  V = − P , (5.51) can b e read directly from eq.(5.50). 5.5 The Second La w of Thermo dynamics W e sa w that in 1865 Clausius summarized the tw o laws of thermo dynamics into “The energy of the univ erse is constant. The en tropy of the univ erse tends to a maximum.” W e can b e a bit more explicit ab out the Second Law: “In an adiabatic non-quasi-static process that starts and ends in equilibrium the total en tropy increases; if the pro cess is adiabatic and quasi-static pro cess the total en tropy remains constant.” The Second Law was formulated in a somewhat stronger form by Gibbs (1878) “F or irreversible pro cesses not only does the en tropy tend to increase, but it do es increase to the maxim um v alue allow ed by the constraints imp osed on the system.” W e are now ready to prov e the Second Law. The pro of b elow prop osed b y E. T. Jaynes in 1965 is mathematically very simple, but it is also conceptually subtle [Jaynes 65]. It may b e useful to recall some of our previous results. The 98 CHAPTER 5. ST A TISTICAL MECHANICS en tropy men tioned in the Second La w is the “thermodynamic” en tropy S T . It is deﬁned only for equilibrium states b y the Clausius relation, S T ( B ) − S T ( A ) = B R A dQ T , (5.52) where the integral is along a reversible path of intermediate equilibrium states. But as we sa w in the previous section, in thermal equilibrium the maximize d Gibbs entrop y S can G – that is, the entrop y computed from the canonical distri- bution – satisﬁes the same relation, eq.(5.45), δ S can G = δ Q T ⇒ S can G ( B ) − S can G ( A ) = B R A dQ T . (5.53) If the arbitrary additiv e constant is adjusted so S can G matc hes S T for one equi- librium state they will b e equal for all equilibrium states. Therefore, if at any time t the system is in thermal equilibrium and its relev ant macro v ariables agree with exp ected v alues, say X t , calculated using the canonical distribution then, S T ( t ) = S can G ( t ) . (5.54) The system, which is assumed to b e thermally insulated from its environ- men t, is allow ed (or forced) to evolv e according to a certain Hamiltonian. The ev olution could, for example, be the free expansion of a gas into v acuum, or it could b e given by the time-dependent Hamiltonian that describ es some exter- nally prescrib ed inﬂuence, say , a mo ving piston or an imp osed ﬁeld. Even tually a new equilibrium is reached at some later time t 0 . Such a pro cess is adiabatic; no heat w as exc hanged with the en vironment. Under these circumstances the initial canonical distribution f can ( t ), e.g. eq.(4.66) or (5.33), evolv es according to Liouville’s equation, eq.(5.6), f can ( t ) H ( t ) − → f ( t 0 ) , (5.55) and, according to eq.(5.23), the corresp onding Gibbs entrop y remains constant, S can G ( t ) = S G ( t 0 ) . (5.56) Since the Gibbs entrop y remains constant it is sometimes argued that this con tradicts the Second Law but note that the time-evolv ed S G ( t 0 ) is not the thermo dynamic entrop y b ecause f ( t 0 ) is not necessarily of the canonical form, eq.(4.66). F rom the new distribution f ( t 0 ) w e can, ho wev er, compute the exp ected v alues X t 0 that apply to the state of equilibrium at t 0 . Of all distributions agreeing with the new v alues X t 0 the canonical distribution f can ( t 0 ) is that whic h has maximum Gibbs entrop y , S can G ( t 0 ). Therefore S G ( t 0 ) ≤ S can G ( t 0 ) . (5.57) 5.5. THE SECOND LA W OF THERMODYNAMICS 99 But S can G ( t 0 ) coincides with the thermo dynamic entrop y of the new equilibrium state, S can G ( t 0 ) = S T ( t 0 ) . (5.58) Collecting all these results, eqs.(5.54)-(5.58), w e conclude that the thermo dy- namic entrop y has increased, S T ( t ) ≤ S T ( t 0 ) , (5.59) whic h is the Second Law. The equalit y applies when the time ev olution is qua- sistatic so that throughout the pro cess the distribution is alw ays canonical; in particular, f ( t 0 ) = f can ( t 0 ). The argumen t ab o ve can be generalized consider- ably b y allowing heat exchanges or b y introducing uncertainties into the actual Hamiltonian dynamics. T o summarize, the chain of steps is S T ( t ) = (1) S can G ( t ) = (2) S G ( t 0 ) ≤ (3) S can G ( t 0 ) = (4) S T ( t 0 ) . (5.60) Steps (1) and (4) hinge on iden tifying the maximized Gibbs en tropy with the thermo dynamic entrop y – which works provided we ha ve correctly identiﬁed the relev an t macrov ariables for the particular problem at hand. Step (2) follows from the constancy of the Gibbs en tropy under Hamiltonian ev olution – this is the least contro versial step. Of course, if we did not hav e complete kno wledge ab out the exact Hamiltonian H ( t ) acting on the system an inequality would ha ve b een introduced already at this p oint. The crucial inequality , how ev er, is in tro duced in step (3) where information is disc ar de d . The distribution f ( t 0 ) con tains information ab out the macro v ariables X t 0 at time t 0 , and since the Hamiltonian is known, it also contains information about the v alues X t the macro v ariables had at the initial time t . In con trast, a description in terms of the distribution f can ( t 0 ) con tains information about the macrov ariables X t 0 at time t 0 and nothing else . In a thermo dynamic description all memory of the history of the system is lost. The Second Law refers to thermo dynamic entropies only . These en tropies measure the amount of information a v ailable to someone with only macroscopic means to observe and manipulate the system. The irr eversibility implicit in the Se c ond L aw arises fr om this r estriction to thermo dynamic descriptions. It is imp ortant to emphasize what has just b een prov ed: in an adiabatic pro cess from one state of equilibrium to another the thermo dynamic entrop y increases. This is the Sec ond Law. Man y questions remain unanswered: W e ha ve assumed that the system tends to wards and ﬁnally reaches an equilibrium; ho w do w e know that this happ ens? What are the relaxation times, transp ort co eﬃcien ts, etc.? There are all sorts of asp ects of non-equilibrium irreversible pro cesses that remain to be explained but this does not detract from what Ja ynes’ explanation did in fact accomplish, namely , it explained the Second La w, no more and, most emphatically , no less. 100 CHAPTER 5. ST A TISTICAL MECHANICS 5.6 The thermo dynamic limit If the Second La w “has only statistical certaint y” (Maxwell, 1871) and any violation “seems to b e reduced to improbability” (Gibbs, 1878) how can ther- mo dynamic predictions attain so m uch certaint y? Part of the answer hinges on restricting the kind of questions we are willing to ask to those concerning the few macroscopic v ariables o ver which we hav e some control. Most other questions are not “interesting” and thus they are never ask ed. F or example, suppose w e are given a gas in equilibrium within a cubic b ox, and the question is where will particle #23 b e found. The answ er is that w e expect the particle to b e at the cen ter of the b ox but with a very large standard deviation – the particle can b e an ywhere in the box. The answ er is not particularly impressiv e. On the other hand, if we ask for the energy of the gas at temperature T , or how it changes as the v olume is changed by δ V , then the answ ers are truly impressiv e. Consider a system in thermal equilibrium in a macrostate describ ed by a canonical distribution f ( z ) assigned on the basis of constraints on the v alues of certain macrov ariables X . F or simplicit y w e will assume X is a single v ariable, the energy , X = h E i = ¯ E . The microstates z can b e divided into typical and at ypical microstates. The typical microstates are all contained within a “high probabilit y” region R ε to b e deﬁned below that has total probability 1 − ε , where ε is a small p ositive num b er, and within which f ( z ) is greater than some lo wer b ound. The “phase” volume of the t ypical region is V ol( R ε ) = R R ε dz = W ε . (5.61) Our goal is to establish that the thermo dynamic entrop y and the v olume of the region R ε are related through Boltzmann’s equation, S T ≈ k log W ε . (5.62) The surprising feature is that the result is essen tially independent of ε . The fol- lo wing theorems which are adaptations of the Asymptotic Equipartition Prop- ert y (section 4.6) state this result in a mathematically precise wa y . Theorem: Let f ( z ) b e the canonical distribution and k S = S G = S T the corresp onding entrop y , f ( z ) = e − β E ( z ) Z and S = β ¯ E + log Z . (5.63) Then as N → ∞ , − 1 N log f ( z ) − → S N in probability , (5.64) pro vided that the system is such that the energy ﬂuctuations increase slo wer than N , that is, lim N →∞ ∆ E / N = 0. (∆ denotes the standard deviation.) The theorem roughly means that The ac c essible micr ostates ar e essential ly e qual ly likely . 5.6. THE THERMODYNAMIC LIMIT 101 Microstates z for which ( − log f ( z )) / N diﬀers substan tially from S/ N hav e ei- ther to o lo w probability and are deemed “inaccessible,” or they might individ- ually hav e a high probability but are to o few to con tribute signiﬁcan tly . Remark: The word ‘essentially’ is tricky b ecause f ( z ) ma y diﬀer from e − S b y a huge factor, but log f ( z ) diﬀers from − S b y an unimp ortant amount that gro ws less rapidly than N . Remark: Note that the theorem applies only to those systems with interparti- cle interactions such that the energy ﬂuctuations are suﬃcien tly well b eha ved. T ypically this requires that as N and V tend to inﬁnit y with N /V constan t, the spatial correlations fall suﬃciently fast that distan t particles are uncorrelated. Under these circumstances energy and entrop y are extensive quan tities. Pro of: Apply the Tcheb yshev inequality (see section 2.8), P ( | x − h x i| ≥ ε ) ≤  ∆ x ε  2 , (5.65) to the v ariable x = − 1 N log f ( z ) . (5.66) The mean is the en tropy p er particle, h x i = − 1 N h log f i = S N = 1 N  β ¯ E + log Z  . (5.67) T o calculate the v ariance, (∆ x ) 2 = 1 N 2 h  (log f ) 2  − h log f i 2 i , (5.68) use D (log f ) 2 E = D ( β E + log Z ) 2 E = β 2  E 2  + 2 β h E i log Z + (log Z ) 2 , (5.69) so that (∆ x ) 2 = β 2 N 2   E 2  − h E i 2  =  β ∆ E N  2 . (5.70) Collecting these results gives Prob      − 1 N log f ( z ) − S N     ≥ ε  ≤  β ∆ E N ε  2 . (5.71) F or systems suc h that the relativ e energy ﬂuctuations ∆ E / ¯ E tend to 0 as N − 1 / 2 when N → ∞ , and the energy is an extensive quantit y , ¯ E ∝ N , the limit on the righ t is zero, ∆ E / N → 0, therefore, 102 CHAPTER 5. ST A TISTICAL MECHANICS lim N →∞ Prob      − 1 N log f ( z ) − S N     ≥ ε  = 0 , (5.72) whic h concludes the pro of. The follo wing theorem elaborates on these ideas further. T o b e precise let us deﬁne the typical region R ε as the set of microstates with probability f ( z ) suc h that e − S − N ε ≤ f ( z ) ≤ e − S + N ε , (5.73) or, using eq.(5.63), 1 Z e − β ¯ E − N ε ≤ f ( z ) ≤ 1 Z e − β ¯ E + N ε . (5.74) This last expression shows that typical microstates are those for which the en- ergy p er particle E ( z ) / N lies within a narrow in terv al 2 εk T ab out the exp ected v alue ¯ E / N . Remark: Even though some states z (namely those with energy E ( z ) < ¯ E ) can individually be more probable than the t ypical states it turns out (see b elow) that they are to o few and their volume is negligible compared to W  . Theorem of typical microstates: F or N suﬃcien tly large (1) Prob[ R ε ] > 1 − ε (2) V ol( R ε ) = W ε ≤ e S + N ε . (3) W ε ≥ (1 − ε ) e S − N ε . (4) lim N →∞ (log W ε − S ) / N = 0. In words: The typic al r e gion has pr ob ability close to one; typic al micr ostates ar e almost e qual ly pr ob able; the phase volume they o c cupy is ab out e S T /k , that is, S T = k log W . The Gibbs entrop y is a measure of the logarithm of the phase volume of typical states and for large N it do es not muc h matter what w e mean by typical (i.e., what w e c ho ose for ε ). Inciden tally , note that it is the Gibbs entrop y that satisﬁes the Boltzmann formula S G = k log W . Pro of: Eq.(5.72) states that for ﬁxed ε , for any given δ there is an N δ suc h that for all N > N δ , we hav e Prob      − 1 N log f ( z ) − S N     ≤ ε  ≥ 1 − δ . (5.75) Th us, the probabilit y that the microstate z is ε -typical tends to one, and there- fore so must Prob[ R ε ]. Setting δ = ε yields part ( 1) . This also shows that the 5.7. INTERPRET A TION OF THE SECOND LA W 103 total probability of the set of states with E ( z ) < ¯ E is negligible – they must o ccup y a negligible v olume. T o pro ve (2) write 1 ≥ Prob[ R ε ] = R R ε dz f ( z ) ≥ e − S − N ε R R ε dz = e − S − N ε W ε . (5.76) Similarly , to pro ve (3) use (1) , 1 − ε < Prob[ R ε ] = R R ε dz f ( z ) ≤ e − S + N ε R R ε dz = e − S + N ε W ε , (5.77) Finally , from (2) and (3), (1 − ε ) e S − N ε ≤ W ε ≤ e S + N ε , (5.78) whic h is the same as S N − ε + log(1 − ε ) N ≤ log W ε N ≤ S N + ε , (5.79) and prov es (4) . Remark: The theorems ab ov e can be generalized to situations in v olving sev eral macro v ariables X k in addition to the energy . In this case, the expected v alue of log f ( z ) is h− log f i = S = λ k  X k  + log Z , (5.80) and its v ariance is (∆ log f ) 2 = λ k λ m  X k X m  −  X k  h X m i  . (5.81) 5.7 In terpretation of the Second La w: Repro- ducibilit y W e saw that the Gibbs entrop y is a measure of the logarithm of the phase v olume of typical states. In the pro of of the Second Law (section 4.11.1) we started with a system at time t in a state of thermal equilibrium deﬁned b y the macro v ariables X t . W e saw (section 4.11.2) that within the t ypical region R ( t ) ﬂuctuations of the X t are negligible: all microstates are characterized by the same v alues of X . F urthermore, the typical region R ( t ) includes essen tially all p ossible initial states compatible with the initial X t . The volume W ( t ) = e S T ( t ) /k of the typical region can b e in terpreted in t wo w ays. On one hand it is a measure of our ignorance as to the true microstate when all we know are the macro v ariables X t . On the other hand, the volume W ( t ) is also a measure of the exten t that we can con trol the actual microstate of the system when the X t are the only parameters w e can manipulate. Ha ving been prepared in equilibrium at time t the system is then sub jected to an adiabatic process and it even tually attains a new equilibrium at time t 0 . 104 CHAPTER 5. ST A TISTICAL MECHANICS The Hamiltonian ev olution deforms the initial region R ( t ) into a new region R ( t 0 ) with exactly the same volume W ( t ) = W ( t 0 ); the macrov ariables ev olve from their initial v alues X t to new v alues X t 0 . No w supp ose we adopt a thermo dynamic description for the new equilibrium; the preparation history is forgotten, and all we know are the new v alues X t 0 . The new t ypical region R 0 ( t 0 ) has a volume W 0 ( t 0 ) and it includes all microstates compatible with the information X t 0 . After these preliminaries we come to the crux of the argument: With the limited exp erimen tal means at our disp osal we can guaran tee that the initial microstate will b e somewhere within W ( t ) and therefore that in due course of time it will b e within W ( t 0 ). In order for the process X t → X t 0 to b e exp erimen tally repro ducible it must be that all microstates in W ( t 0 ) will also b e within W 0 ( t 0 ) which means that W ( t ) = W ( t 0 ) ≤ W 0 ( t 0 ). Con versely , if it were true that W ( t ) > W 0 ( t 0 ) we would sometimes observe that an initial microstate within W ( t ) w ould ev olve in to a ﬁnal microstate lying outside W 0 ( t 0 ) that is, sometimes we would observ e X t 9 X t 0 . Thus, when W ( t ) > W 0 ( t 0 ) the exp erimen t is not repro ducible. A new element has b een introduced into the discussion of the Second Law: r epr o ducibility . [Ja ynes 65] Thus, we can express the Second La w in the some- what tautological form: In a r epr o ducible adiab atic pr o c ess the thermo dynamic entr opy c annot de- cr e ase. W e can address this question from a diﬀerent angle: How do we kno w that the chosen constrain ts X are the relev ant macro v ariables that pro vide an ade- quate thermodynamic description? In fact, what do we mean by an ade quate description? Let us rephrase these questions diﬀerently: Could there exist addi- tional unkno wn physical constraints Y that signiﬁcantly restrict the microstates compatible with the initial macrostate and which therefore pro vide an ev en b et- ter description? The answer is that such v ariables can, of course, exist but that including them in the description do es not necessarily lead to an improv e- men t. If the pro cess X t → X t 0 is repro ducible when no particular care has b een tak en to control the v alues of Y w e can exp ect that to the exten t that we are only in terested in the X ’s the Y ’s are irrelev ant; keeping trac k of them will not yield a better description. R epr o ducibility is the criterion wher eby we c an de cide whether a p articular thermo dynamic description is ade quate or not. 5.8 Remarks on irrev ersibilit y A considerable source of confusion on the question of reversibilit y is that the same word ‘reversible’ is used with several diﬀeren t meanings [Uﬃnk 01]: (a) Me chanic al or micr osc opic r eversibility refers to the p ossibility of rev ersing the v elo cities of every particle. Suc h reversals would allow the system not just to retrace its steps from the ﬁnal macrostate to the initial macrostate but it w ould also allow it to retrace its detailed microstate tra jectory as w ell. 5.9. ENTROPIES, DESCRIPTIONS AND THE GIBBS P ARADOX 105 (b) Carnot or macr osc opic r eversibility refers to the p ossibility of retracing the history of macrostates of a system in the opp osite direction. The required amount of con trol o ver the system can b e ac hiev ed b y forcing the system along a prescribed path of in termediate macroscopic equilibrium states that are inﬁnitesimally close to each other. Such a reversible process is normally and appropriately called quasi-static . There is no implication that the tra jectories of the individual particles will b e retraced. (c) Thermo dynamic r eversibility refers to the p ossibility of starting from a ﬁnal macrostate and completely reco vering the initial macrostate without any other external changes. There is no need to retrace the in termediate macrostates in rev erse order. In fact, rather than ‘rev ersibility’ it ma y b e more descriptiv e to refer to ‘ r e c over ability ’. Typically a state is irreco verable when there is friction, deca y , or corruption of some kind. Notice that when one talks ab out the “irrev ersibility” of the Second Law and ab out the “rev ersibility” of mec hanics there is no inconsistency or con tradiction: the word ‘reversibilit y’ is being used with tw o entirely diﬀeren t meanings. Classical thermodynamics assumes that isolated systems approac h and ev en- tually attain a state of equilibrium. The state of equilibrium is, by deﬁnition, a state that, once attained, will not sp ontaneously change in the future. On the other hand, it is understo o d that changes migh t hav e happ ened in the past. Classical thermo dynamics introduces a time asymmetry: it treats the past and the future diﬀeren tly . The situation with statistical mechanics is, ho w ever, somewhat diﬀerent. Once equilibrium has been attained ﬂuctuations are p ossible. In fact, if we wait long enough we can exp ect that large ﬂuctuations can b e expected to happ en in the future, just as they might hav e happ ened in the past. The situation is quite symmetric. The interesting asymmetry arises when w e realize that for a large ﬂuctuation to happ en sp on taneously in the future might require an extremely long time while w e just happ en to know that a similarly large “ﬂuctuation” was observ ed in the very recent past. This migh t seem strange b ecause the formalism of statistical mec hanics does not introduce any time asymmetry . The solution to the puzzle is that the large “ﬂuctuation” in the recent past most likely did not happ en sp on taneously but was quite deliberately brough t ab out by human (or otherwise) interv ention. The system was prepared in some un usual state b y applying appropriate constraints whic h were subsequently remo v ed – we do this all the time. 5.9 En tropies, descriptions and the Gibbs para- do x Under the generic title of “Gibbs P aradox” one usually considers a n umber of related questions in b oth phenomenological thermo dynamics and in statistical mec hanics: (1) The entrop y c hange when tw o distinct gases are mixed happ ens 106 CHAPTER 5. ST A TISTICAL MECHANICS to be indep endent of the nature of the gases. Is this in conﬂict with the idea that in the limit as the tw o gases b ecome iden tical the entrop y c hange should v anish? (2) Should the thermo dynamic en tropy of Clausius b e an extensive quan tity or not? (3) Should tw o microstates that diﬀer only in the exc hange of iden tical particles b e coun ted as tw o or just one microstate? The conv entional wisdom asserts that the resolution of the paradox rests on quantum mechanics but this analysis is unsatisfactory; at best it is incom- plete. While it is true that the exchange of identical quantum particles do es not lead to a new microstate this approach ignores the case of classical, and ev en non-identical particles. F or example, nanoparticles in a colloidal suspen- sion or macromolecules in solution are b oth classical and non-identical. Several authors (e.g., [Grad 61, Jaynes 92]) ha ve recognized that quantum theory has no b earing on the matter; indeed, as remarked in section 3.5, this was already clear to Gibbs. Our purp ose here is to discuss the Gibbs paradox from the p oint of view of information theory . The discussion follo ws [Tseng Catic ha 01]. Our conclusion will b e that the parado x is resolv ed once it is realized that there is no suc h thing as the en tropy of a system, that there are many entropies. The c hoice of entrop y is a c hoice b et ween a description that treats particles as b eing distinguishable and a description that treats them as indistinguishable; whic h of these alterna- tiv es is more conv enien t dep ends on the resolution of the particular exp eriment b eing p erformed. The “grouping” prop ert y of entrop y , eq.(4.3), S [ p ] = S G [ P ] + P g P g S g [ p ·| g ] pla ys an imp ortan t role in our discussion. It establishes a relation b etw een tw o diﬀeren t descriptions and refers to three diﬀerent entropies. One can describ e the system with high resolution as b eing in a microstate i (with probabilit y p i ), or alternativ ely , with low er resolution as b eing in one of the groups g (with probabilit y P g ). Since the description in terms of the groups g is less detailed w e migh t refer to them as ‘mesostates’. A thermo dynamic description, on the other hand, corresp onds to an even lo wer resolution that merely sp eciﬁes the equilibrium macrostate. F or simplicity , we will deﬁne the macrostate with a single v ariable, the energy . Including additional v ariables is easy and does not mo dify the gist of the argument. The standard connection b etw een the thermodynamic description in terms of macrostates and the description in terms of microstates is established in section 4.10.4. If the energy of microstate a is E a , to the macrostate of energy ¯ E = h E i w e asso ciate that canonical distribution (5.33) p a = e − β E a Z H , (5.82) where the partition function Z H and the Lagrange m ultiplier β are determined from eqs.(5.34), Z H = P i e − β E i and ∂ log Z H ∂ β = − ¯ E . (5.83) 5.9. ENTROPIES, DESCRIPTIONS AND THE GIBBS P ARADOX 107 The corresp onding en tropy , eq.(5.35) is (setting k = 1) S H = β ¯ E + log Z H , (5.84) measures the amount of information required to sp ecify the microstate when all w e kno w is the v alue ¯ E . Iden tical particles Before we compute and in terpret the probability distribution ov er mesostates and its corresp onding en tropy w e must be more sp eciﬁc ab out which mesostates w e are talking ab out. Consider a system of N classical particles that are exactly iden tical. The interesting question is whether these identical particles are also “distinguishable.” By this we mean the following: we lo ok at t wo particles no w and we lab el them. W e lo ok at the particles later. Someb o dy might ha ve switc hed them. Can we tell which particle is which? The answ er is: it dep ends. Whether we can distinguish iden tical particles or not dep ends on whether we w ere able and willing to follo w their tra jectories. A sligh tly diﬀerent version of the same question concerns an N -particle sys- tem in a certain state. Some particles are permuted. Do es this give us a diﬀerent state? As discussed earlier the answ er to this question requires a careful speci- ﬁcation of what we mean by a state. Since b y a micr ostate we mean a point in the N -particle phase space, then a permutation does indeed lead to a new microstate. On the other hand, our concern with permutations suggests that it is useful to introduce the notion of a mesostate deﬁned as the group of those N ! microstates that are obtained as p erm utations of each other. With this deﬁnition it is clear that a p ermutation of the iden tical particles do es not lead to a new mesostate. No w we can return to discussing the connection betw een the thermo dynamic macrostate description and the description in terms of mesostates using, as b efore, the Metho d of Maximum Entrop y . Since the particles are (suﬃcien tly) iden tical, all those N ! microstates i within the same mesostate g hav e the same energy , whic h w e will denote b y E g ( i.e. , E i = E g for all i ∈ g ). T o the macrostate of energy ¯ E = h E i w e asso ciate the canonical distribution, P g = e − β E g Z L , (5.85) where Z L = P g e − β E g and ∂ log Z L ∂ β = − ¯ E . (5.86) The corresp onding en tropy , eq.(5.35) is (setting k = 1) S L = β ¯ E + log Z L , (5.87) measures the amoun t of information required to sp ecify the mesostate when all w e kno w is ¯ E . 108 CHAPTER 5. ST A TISTICAL MECHANICS Tw o diﬀerent entropies S H and S L ha ve b een assigned to the same macrostate ¯ E ; they measure the diﬀerent amounts of additional information required to sp ecify the state of the system to a high resolution (the microstate) or to a low resolution (the mesostate). The relation b et ween Z H and Z L is obtained from Z H = P i e − β E i = N ! P g e − β E g = N ! Z L or Z L = Z H N ! . (5.88) The relation b et ween S H and S L is obtained from the “grouping” prop erty , eq.(4.3), with S = S H and S G = S L , and p i | g = 1 / N !. The result is S L = S H − log N ! . (5.89) Inciden tally , note that S H = − P a p a log p a = − P g P g log P g / N ! . (5.90) Equations (5.88) and (5.89) both exhibit the Gibbs N ! “corrections.” Our analy- sis sho ws (1) that the justiﬁcation of the N ! factor is not to be found in quan tum mec hanics, and (2) that the N ! does not correct anything. The N ! is not a fudge factor that ﬁxes a wrong (p ossibly nonextensiv e) en tropy S H in to a correct (p os- sibly extensive) entrop y S L . Both entropies S H and S L are correct. They diﬀer b ecause they measure diﬀeren t things: one measures the information to sp ecify the microstate, the other measures the information to sp ecify the mesostate. An imp ortan t goal of statistical mechanics is to provide a justiﬁcation, an explanation of thermo dynamics. Thus, w e still need to ask whic h of the tw o statistical en tropies, S H or S L , should b e iden tiﬁed with the thermo dynamic en tropy of Clausius S T . Insp ection of eqs.(5.88) and (5.89) sho ws that, as long as one is not concerned with exp erimen ts that inv olve changes in the n umber of particles, the same thermo dynamics will follow whether we set S H = S T or S L = S T . But, of course, exp eriments inv olving changes in N are very imp ortant (for example, in the equilibrium b etw een diﬀerent phases, or in chemical reactions). Since in the usual thermo dynamic exp eriments w e only care that some num ber of particles has b een exchanged, and we do not care which were the actual par- ticles exc hanged, we exp ect that the correct identiﬁcation is S L = S T . Indeed, the quantit y that regulates the equilibrium under exchanges of particles is the c hemical p oten tial deﬁned by µ = − kT  ∂ S T ∂ N  E ,V ,... (5.91) The tw o iden tiﬁcations S H = S T or S L = S T , lead to tw o diﬀerent c hemical p oten tials, related by µ L = µ H − N k T . (5.92) It is easy to verify that, under the usual circumstances where surface eﬀects can b e neglected relativ e to the bulk, µ L has the correct functional dependence 5.9. ENTROPIES, DESCRIPTIONS AND THE GIBBS P ARADOX 109 on N : it is intensiv e and can b e identiﬁed with the thermodynamic µ . On the other hand, µ H is not an intensiv e quantit y and cannot therefore b e iden tiﬁed with µ . Non-iden tical particles W e saw that classical iden tical particles can b e treated, dep ending on the res- olution of the exp eriment, as b eing distinguishable or indistinguishable. Here w e go further and point out that even non-identical particles can b e treated as indistinguishable. Our goal is to state explicitly in precisely what sense it is up to the observ er to decide whether particles are distinguishable or not. W e deﬁned a mesostate as a subset of N ! microstates that are obtained as p erm utations of each other. With this deﬁnition it is clear that a p ermutation of particles do es not lead to a new mesostate even if the exc hanged particles are not iden tical. This is an important extension b ecause, unlik e quan tum particles, classical particles cannot b e exp ected to b e exactly iden tical down to ev ery min ute detail. In fact in man y cases the particles can b e grossly diﬀerent – examples migh t b e colloidal susp ensions or solutions of organic macromolecules. A high resolution device, for example an electron microscop e, would rev eal that no tw o colloidal particles or t wo macromolecules are exactly alike. And yet, for the purp ose of modelling most of our macroscopic observ ations it is not necessary to tak e accoun t of the m yriad w ays in which tw o particles can diﬀer. Consider a system of N particles. W e can p erform rather crude macroscopic exp erimen ts the results of which can b e summarized with a simple phenomeno- logical thermo dynamics where N is one of the relev ant v ariables that deﬁne the macrostate. Our goal is to construct a statistical foundation that will explain this macroscopic mo del, reduce it, so to sp eak, to “ﬁrst principles.” The par- ticles might ultimately b e non-identical, but the crude phenomenology is not sensitiv e to their diﬀerences and can be explained by postulating mesostates g and microstates i with energies E i ≈ E g , for all i ∈ g , as if the particles were iden tical. As in the previous section this statistical mo del gives Z L = Z H N ! with Z H = X i e − β E i , (5.93) and the connection to the thermo dynamics is established b y p ostulating S T = S L = S H − log N ! . (5.94) Next we consider what happ ens when more sophisticated exp eriments are p erformed. The examples traditionally oﬀered in discussions of this sort refer to the new experiments that could b e made p ossible b y the discov ery of membranes that are p ermeable to some of the N particles but not to the others. Other, p erhaps historically more realistic examples, are aﬀorded b y the av ailability of new exp erimental data, for example, more precise measurements of a heat capacit y as a function of temperature, or perhaps measuremen ts in a range of temp eratures that had previously b een inaccessible. 110 CHAPTER 5. ST A TISTICAL MECHANICS Supp ose the new phenomenology can b e modelled b y p ostulating the exis- tence of t wo kinds of particles. (Exp eriments that are ev en more sophisticated migh t allo w us to detect three or more kinds, perhaps even a contin uum of diﬀeren t particles.) What w e previously thought w ere N iden tical particles we will no w think as b eing N a particles of t yp e a and N b particles of t yp e b . The new description is in terms of macrostates deﬁned by N a and N b as the relev ant v ariables. T o construct a statistical explanation of the new phenomenology from ‘ﬁrst principles’ we need to revise our notion of mesostate. Each new mesostate will b e a group of microstates whic h will include all those microstates obtained by p erm uting the a particles among themselv es, and by p ermuting the b particles among themselv es, but will not include those microstates obtained b y permuting a particles with b particles. The new mesostates, which w e will label ˆ g and to whic h we will assign energy ε ˆ g , will b e comp osed of N a ! N b ! microstates ˆ ı , each with a w ell deﬁned energy E ˆ ı = E ˆ g , for all ˆ ı ∈ ˆ g . The new statistical mo del giv es ˆ Z L = ˆ Z H N a ! N b ! with ˆ Z H = X ˆ ı e − β E ˆ ı , (5.95) and the connection to the new phenomenology is established by p ostulating ˆ S T = ˆ S L = ˆ S H − log N a ! N b ! . (5.96) In discussions of this topic it is not unusual to ﬁnd commen ts to the eﬀect that in the limit as particles a and b b ecome iden tical one exp ects that the en tropy of the system with tw o kinds of particles tends to the entrop y of a system with just one kind of particle. The fact that this exp ectation is not met is one manifestation of the Gibbs parado x. F rom the information theory p oint of view the parado x do es not arise b ecause there is no such thing as the entr opy of the system , there are several entropies. It is true that as a → b we will ha ve ˆ Z H → Z H , and accordingly ˆ S H → S H , but there is no reason to expect a similar relation betw een ˆ S L and S L b ecause these tw o entropies refer to mesostates ˆ g and g that remain diﬀerent even as a and b became identical. In this limit the mesostates ˆ g , whic h are useful for descriptions that treat particles a and b as indistinguishable among themselves but distinguishable from each other, lose their usefulness. Conclusion The Gibbs paradox in its v arious forms arises from the widespread misconception that en tropy is a real ph ysical quantit y and that one is justiﬁed in talking ab out the entr opy of the system. The thermo dynamic en tropy is not a prop erty of the system. Entrop y is a property of our description of the system, it is a prop erty of the macrostate. More explicitly , it is a function of the macroscopic v ariables used to deﬁne the macrostate. T o diﬀerent macrostates reﬂecting diﬀerent choices of v ariables there corresp ond diﬀerent entropies for the very same system. 5.9. ENTROPIES, DESCRIPTIONS AND THE GIBBS P ARADOX 111 But this is not the complete story: entrop y is not just a function of the macrostate. En tropies reﬂect a relation b etw een tw o descriptions of the same system: in addition to the macrostate, w e m ust also sp ecify the set of mi- crostates, or the set of mesostates, as the case might b e. Then, having sp eciﬁed the macrostate, an entrop y can be in terpreted as the amount of additional in- formation required to specify the microstate or mesostate. W e hav e found the ‘grouping’ prop erty very v aluable precisely b ecause it emphasizes the dep en- dence of en tropy on the choice of micro or mesostates. Chapter 6 En trop y I I I: Up dating Probabilities The general problem of inductive inference is to up date from a prior probability distribution to a p osterior distribution when new information b ecomes a v ailable. The challenge is to develop up dating methods that are b oth systematic and ob jectiv e. In Chapter 2 we saw that Ba yes’ rule is the natural wa y to up date when the information is in the form of data. W e also saw that Bay es’ rule could not b e derived just from the requirements of consistency implicit in the sum and pro duct rules of probability theory . An additional Principle of Minimal Up dating (PMU) w as necessary: Prior information is valuable and should not b e disc ar de d; b eliefs should b e r evise d only to the extent r e quir e d by the data. A few in teresting questions were just barely hinted at: How do we up date when the information is not in the form of data? If the information is not data, what else could it p ossibly b e? Indeed what, after all, is information? Then in Chapter 4 w e saw that the method of maximum entrop y , MaxEnt, allo wed one to deal with information in the form of constraints on the allo w ed probabilit y distributions. So here we hav e a partial answ er to one of our ques- tions: in addition to data information can tak e the form of constrain ts. How ever, MaxEn t is not a method for up dating; it is a method for assigning probabilities on the basis of the constrain t information, but it do es not allo w us to tak e in to accoun t the information con tained in prior distributions. Th us, Bay es’ rule allows for the information contained in arbitrary priors and in data, but not in arbitrary constraints, 1 while on the other hand, Max- En t can handle arbitrary constraints but not arbitrary priors. In this c hapter w e bring those tw o metho ds together: by generalizing the PMU w e show ho w the MaxEnt metho d can b e extended b eyond its original scop e, as a rule to assign probabilities, to a full-ﬂedged metho d for inductive inference, that is, a metho d for up dating from arbitrary priors giv en information in the form of 1 Bay es’ rule can handle constrain ts when they are expressed in the form of data that can be plugged into a lik eliho o d function. Not all constrain ts are of this kind. 113 114 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES arbitrary constrain ts. It should not b e too surprising that the extended Max- im um Entrop y metho d, which we will henceforth abbreviate as ME, includes b oth MaxEnt and Bay es’ rule as special cases. Historically the ME method is a direct descendant of MaxEn t. As w e sa w in c hapter 4 within the MaxEnt method entrop y is interpreted through the Shannon axioms as a measure of the amoun t of uncertain ty or of the amount of information that is missing in a probability distribution. W e discussed some limitations of this approac h. The Shannon axioms refer to probabilities of dis- crete v ariables; for con tinuous v ariables the entrop y is not deﬁned. But a more serious ob jection w as raised: even if we grant that the Shannon axioms do lead to a reasonable expression for the en tropy , to what extent do we b elieve the axioms themselv es? Shannon’s third axiom, the grouping prop erty , is indeed sort of reasonable, but is it necessary? Is entrop y the only consisten t measure of uncertaint y or of information? What is wrong with, say , the standard de- viation? Indeed, there exist examples in whic h the Shannon entrop y do es not seem to reﬂect one’s in tuitive notion of information [Uﬃnk 95]. Other en tropies, justiﬁed b y a diﬀerent c hoice of axioms, can b e introduced (prominent examples are [Renyi 61, Tsallis 88]). F rom our p oin t of view the real limitation is that neither Shannon nor Jaynes w ere concerned with up dating probabilities. Shannon was analyzing the ca- pacit y of communication c hannels and characterizing the potential diversit y of messages generated by information sources (section 4.6). His entrop y makes no reference to prior distributions. On the other hand, as we already mentioned, Ja ynes conceived MaxEn t as a metho d to assign probabilities on the basis of constrain t information and a ﬁxed underlying measure, not an arbitrary prior. He never meant to up date from one probabilit y distribution to another. Considerations such as these motiv ated sev eral attempts to dev elop ME di- rectly as a metho d for up dating probabilities without in v oking questionable measures of uncertaint y . Prominen t among them are [Shore and Johnson 80, Skilling 88-90, Csiszar 91]. The imp ortant con tribution b y Shore and Johnson w as the realization that one could axiomatize the up dating metho d itself rather than the information measure. Their axioms are justiﬁed on the basis of a fun- damen tal principle of consistency – if a problem can be solv ed in more than one w ay the results should agree – but the axioms themselv es and other assump- tions they make ha ve raised some ob jections [Karb elk ar 86, Uﬃnk 95]). Despite suc h criticism Shore and Johnson’s pioneering pap ers hav e had an enormous inﬂuence; they iden tiﬁed the correct goal to b e achiev ed. Another approach to entrop y was prop osed by Skilling. His axioms are clearly inspired by those of Shore and Johnson but his approach is diﬀeren t in sev eral imp ortant asp ects. in particular Skilling did not explore the possibility of using his induction metho d for the purp ose for inductiv e infer enc e , that is, for up dating from prior to p osterior probabilities. The primary goal of this chapter is to apply Skilling’s metho d of eliminativ e induction to Shore and Johnson’s problem of up dating probabilities and, in the pro cess, to ov ercome the ob jections that can b e raised against either. The presen tation b elo w follo ws [Catic ha 03, Caticha Giﬃn 06, Caticha 07]. 115 As we argued earlier when dev eloping the theory of degrees of b elief, our general approach diﬀers from the w ay in whic h man y ph ysical theories hav e been dev elop ed in the past. The more traditional approach consists of ﬁrst setting up the mathematical formalism and then seeking an acceptable in terpretation. The dra wback of this procedure is that questions can alw ays b e raised about the uniqueness of the prop osed interpretation, and ab out the criteria that mak es it acceptable or not. In con trast, here w e pro ceed in the opposite order: we ﬁrst decide what we are talking ab out, what goal we wan t to achiev e, and only then we design a suitable mathematical formalism. The adv antage is that the issue of meaning and interpretation is resolved from the start. The preeminen t example of this approac h is Co x’s algebra of probable inference (discussed in chapter 2) whic h clariﬁed the meaning and use of the notion of probabilit y: after Cox it w as no longer possible to raise doubts about the legitimacy of the degree of b elief in terpretation. A second example is sp ecial relativity: the actual ph ysical sig- niﬁcance of the x and t app earing in the mathematical formalism of Loren tz and P oincare was a matter of contro v ersy until Einstein settled the issue b y deriving the formalism,that is, the Lorentz transformations, from more basic principles. Y et a third example is the deriv ation of the mathematical formalism of quan tum theory . [Catic ha 98] In this chapter w e explore a fourth example: the concept of relativ e en tropy is in tro duced as a to ol for reasoning which reduces to the usual en tropy in the sp ecial case of uniform priors. There is no need for an interpre- tation in terms of heat, multiplicit y of states, disorder, uncertain ty , or ev en in terms of an amount of information. In this approach we ﬁnd an explanation for wh y the search for the meaning of entrop y has turned out to b e so elusive: Entr opy ne e ds no interpr etation . W e do not need to know what ‘en tropy’ means; w e only need to know how to use it. Since the PMU is the driving force b ehind b oth Bay esian and ME up dating it is w orthwhile to inv estigate the precise relation b et ween the tw o. W e sho w that Ba yes’ rule can be deriv ed as a sp ecial case of the ME method. 2 The virtue of our deriv ation, which hinges on translating information in the form of data in to a constrain t that can b e pro cessed using ME, is that it is particularly clear. It throws light on Bay es’ rule and demonstrates its complete compatibilit y with ME up dating. A slight generalization of the same ideas sho ws that Jeﬀrey’s up dating rule (section 2.10.2) is also a sp ecial case of the ME metho d. Th us, within the ME framew ork maximum entrop y and Bay esian metho ds are uniﬁed in to a single consisten t theory of inference. There is a second function that the ME method must perform in order to fully qualify as a metho d of inductiv e inference: once w e ha ve dec ided that the distribution of maximum en tropy is to b e preferred ov er all others the following question arises immediately: the maximum of the entrop y function is never inﬁnitely sharp, are w e really conﬁdent that distributions with entrop y v ery close to the maximum are totally ruled out? W e must ﬁnd a quantitativ e wa y 2 This result was ﬁrst obtained by Williams (see [Williams 80, Diaconis 82]) long b efore the logical status of the ME metho d – and therefore the full extent of its implications – had been suﬃciently clariﬁed. 116 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES to assess the exten t to whic h distributions with low er en tropy are ruled out. This matter is addressed follo wing the treatment in [Caticha 00]. 6.1 What is information? It is not un usual to hear that systems “carry” or “con tain” information and that “information is ph ysical”. This mo de of expression can p erhaps be traced to the origins of information theory in Shannon’s theory of comm unication. W e sa y that w e ha v e receiv ed information when among the v ast v ariet y of messages that could conceiv ably ha ve b een generated b y a distan t source, w e discov er whic h particular message was actually sen t. It is thus that the message “carries” information. The analogy with physics is straightforw ard: the set of all p ossible states of a physical system can b e likened to the set of all p ossible messages, and the actual state of the system corresp onds to the message that was actually sen t. Thus, the system “con veys” a message: the system “carries” information ab out its own state. Sometimes the message migh t b e diﬃcult to read, but it is there nonetheless. This language – information is physical – useful as it has turned out to b e, do es not exhaust the meaning of the word ‘information’. The goal of informa- tion theory , or b etter, communication theory , is to c haracterize the sources of information, to measure the capacity of the communication channels, and to learn ho w to control the degrading eﬀects of noise. It is somewhat ironic but nev ertheless true that this “information” theory is unconcerned with the cen tral Ba yesian issue of how the message aﬀects the beliefs of a rational agent. A ful ly Bayesian information the ory demands an explicit ac c ount of the r e- lation b etwe en information and b eliefs. The notion that the theory for reasoning with incomplete information is the theory of degrees of rational b elief led us to tac kle t wo diﬀeren t problems. 3 The ﬁrst was to understand the conditions required to achiev e consistency within a w eb of in terconnected b eliefs. This problem w as completely solv ed: degrees of b elief are consisten t when they ob ey the rules of probability theory , which led us to conclude that r ational de gr e es of b elief ar e pr ob abilities . The second problem is that of up dating probabilities when new information b ecomes av ailable. The desire and need to update our b eliefs is driven by the con viction that not all probability assignments are equally go o d. This b ears on the issue of whether probabilities are sub jectiv e, ob jectiv e, or somewhere in b et ween. W e argued earlier that what mak es one probabilit y assignment b etter than another is that it b etter reﬂects some “ob jective” feature of the world, that is, it pro vides a b etter guide to the “truth” – whatev er this might mean. Therefore ob jectivit y is a desirable goal. It is their (partial) ob jectivit y that mak es probabilities useful. Indeed, what we seek are up dating mechanisms that 3 W e men tioned earlier, and emphasize again here, that the qualiﬁer ‘rational’ is crucial: we are interested in the reasoning of an idealized rational agen t and not of real imperfect humans. 6.1. WHA T IS INFORMA TION? 117 allo w us to process information and incorp orate its ob jective features into our b eliefs. Ba yes’ rule b ehav es precisely in this wa y . W e saw in section 2.10 that as more and more data are taken in to account the original (p ossibly sub jective) prior becomes less and less relev ant, and all rational agen ts become more and more con vinced of the same truth. This is crucial: w ere it not this w ay Ba yesian reasoning would not b e deemed acceptable. W e are now ready to answ er the question ‘What, after all, is information?’ The result of b eing confron ted with new information should b e a restriction on our options as to what we are honestly and rationally allow ed to b eliev e. This, I prop ose, is the deﬁning characteristic of information. By information, in its most general form, I mean a set of constrain ts on the family of acceptable p osterior distributions. Thus, Information is whatever c onstr ains r ational b eliefs . W e can phrase this idea somewhat diﬀeren tly . Since our ob jective is to update from a prior distribution to a posterior when new information becomes a v ailable w e can state that Information is what for c es a change of b eliefs. An imp ortan t asp ect of this notion is that for a rational agent the up dating is not optional: it is a moral imp erative. Our deﬁnition captures an idea of information that is directly related to c hanging our minds: information is the driving force b ehind the process of learning. Note also that although there is no need to talk ab out amoun ts of information, whether measured in units of bits or otherwise, our notion of in- formation allows precise quantitativ e calculations. Indeed, constraints on the acceptable p osteriors are precisely the kind of information the metho d of max- im um en tropy (see b elow) is designed to handle. The constraints that con vey , or rather, that are information can take a wide v ariety of forms. F or example, they can represen t data (see section 6.5 b elo w), or they can b e in the form of exp ected v alues (as in statistical mec hanics, see c hapter 5). Although one cannot directly measure expected v alues or prob- abilities one can still use them to conv ey information. This is what we do, for example, when we sp ecify a prior or the likelihoo d function – this is not something that one can measure but b y constraining our b eliefs they certainly are v aluable information. Constraints can also b e sp eciﬁed through geometrical relations (see section 6.7 and also [Catic ha 01, Caticha Cafaro 07]). It ma y b e worth while to p oint out an analogy with dynamics – the study of c hange. In Newtonian dynamics the state of motion of a system is describ ed in terms of momentum – the “quantit y” of motion – while the change from one state to another is explained in terms of an applied force. Similarly , in Bay esian inference a state of b elief is describ ed in terms of probabilities – a “quan tity” of b elief – and the c hange from one state to another is due to information. Just as a force is deﬁned as that which induces a ch ange from one state of motion to 118 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES another, so information is that which induc es a change fr om one state of b elief to another . What about prejudices and sup erstitions? What ab out divine revelations? Do they constitute information? Perhaps they lie outside our c hosen sub ject of ideally rational b eliefs, but to the exten t that their eﬀects are indistinguishable from those of other sorts of information, namely , they aﬀect b eliefs, they qualify as information to o. Whether the sources of such information are reliable or not is quite another matter. F alse information is information too and even ideally rational agents are aﬀected by false information. What ab out limitations in our computational p ow er? They inﬂuence our inferences. Should they b e considered information? No. Limited computational resources ma y aﬀect the n umerical appro ximation to the v alue of, say , an in te- gral, but they do not aﬀect the actual v alue of the integral. Similarly , limited computational resources may aﬀect the approximate imperfect reasoning of real h umans and real computers but they do not aﬀect the reasoning of those ideal rational agents that are the sub ject of our presen t concerns. 6.2 En trop y as a to ol for up dating probabilities Consider a v ariable x the v alue of which is uncertain. The v ariable can b e discrete or contin uous, in one or in several dimensions. F or example, x could represen t the p ossible microstates of a ph ysical system, a p oint in phase space, or an appropriate set of quan tum n umbers. The uncertaint y ab out x is describ ed by a probabilit y distrib ution q ( x ). Our goal is to up date from the prior distribution q ( x ) to a p osterior distribution p ( x ) when new information – by which w e mean a set of constrain ts – becomes av ailable. The information can b e given in terms of exp ected v alues but this is not necessary . The question is: of all those distributions within the family deﬁned b y the constraints, what distribution p ( x ) should w e select? T o select the p osterior one c ould pro ceed by attempting to place all candi- date distributions in increasing or der of pr efer enc e . [Skilling 88] Irresp ective of what it is that mak es one distribution preferable ov er another it is clear that an y ranking according to preference must b e transitive: if distribution p 1 is pre- ferred ov er distribution p 2 , and p 2 is preferred ov er p 3 , then p 1 is preferred ov er p 3 . Suc h transitiv e rankings are implemen ted b y assigning to eac h p ( x ) a real n umber S [ p ] in suc h a wa y that if p 1 is preferred o ver p 2 , then S [ p 1 ] > S [ p 2 ]. The selected distribution (one or p ossibly many , for there ma y b e several equally preferred distributions) will b e that which maximizes the functional S [ p ] which w e will call the entrop y of p . W e are thus led to a metho d of Maximum Entrop y (ME) that is a v ariational metho d inv olving en tropies which are real num b ers. These are features imp osed by design; they are dictated by the function that the ME metho d is supp osed to p erform. Next, to deﬁne the ranking sc heme , w e m ust decide on the functional form of S [ p ]. First, the purp ose of the metho d is to up date fr om priors to p osteriors. The ranking sc heme must depend on the particular prior q and therefore the en tropy 6.2. ENTROPY AS A TOOL F OR UPD A TING PROBABILITIES 119 S must b e a functional of b oth p and q . The entrop y S [ p, q ] describ es a ranking of the distributions p r elative to the given prior q . S [ p, q ] is the entrop y of p r elative to q , and accordingly S [ p, q ] is commonly called r elative entr opy . This is appropriate and sometimes w e will follo w this practice. Ho wev er, as discussed in section 4.5, ev en the ‘regular’ Shannon entrop y is relative, it is the en tropy of p relativ e to an underlying uniform distribution. Since all entropi es are relative to some prior, the qualiﬁer ‘relative’ and is redundan t can b e dropp ed. This is somewhat analogous to the situation with energy: all energies are relative to some origin or to some reference frame but we do not feel comp elled to constan tly refer to the ‘relative energy’. It is just taken for gran ted. Second, since we deal with incomplete information the metho d, b y its very nature, cannot b e deductive: the metho d must b e inductive . The b est we can do is generalize from those few sp ecial cases where we kno w what the preferred distribution should b e to the m uch larger num ber of cases where we do not. In order to ac hiev e its purp ose, we must assume that S [ p, q ] is of universal applica- bilit y . There is no justiﬁcation for this universalit y b eyond the usual pragmatic justiﬁcation of induction: in order to a void the paralysis of not generalizing at all we must risk making wrong generalizations. An induction metho d must b e allo wed to induce. W e will apply the Principle of Eliminativ e Induction in tro duced in c hapter 1: If a general theory exists it m ust apply to sp ecial cases. If sp ecial examples are known then all candidate theories that fail to re- pro duce the kno wn examples are discarded. If a suﬃcient n umber of sp ecial examples are kno wn then the general theory might b e completely determined. The b est we can do is use those special cases where we know what the preferred distribution should b e to eliminate those entrop y functionals S [ p, q ] that fail to provide the righ t up date. The known special cases will b e called (p erhaps inappropriately) the axioms of the theory . They pla y a crucial role: they deﬁne what makes one distribution preferable ov er another. The three axioms b elow are chosen to reﬂect the moral con viction that in- formation collected in the past and co diﬁed into the prior distribution is very v aluable and should not b e frivolously discarded. This attitude is radically con- serv ative: the only asp ects of one’s b eliefs that should b e up dated are those for whic h new evidence has been supplied. This is important and it is w orth while to consider it from a diﬀerent angle. Degrees of b elief, probabilities, are said to b e sub jectiv e: tw o diﬀeren t individuals might not share the same beliefs and could conceiv ably assign probabilities diﬀerently . But sub jectivity do es not mean ar- bitrariness. It is not a blank chec k allowing the rational agent to change its mind for no go o d reason. V aluable prior information should not b e discarded un til new information renders it obsolete. 120 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES F urthermore, since the axioms do not tell us what and how to up date, they merely tell us what not to up date, they hav e the added b onus of maximizing ob jectivit y – there are many wa ys to c hange something but only one wa y to k eep it the same. Th us, w e adopt the Principle of Minimal Up dating (PMU): Beliefs should b e up date d only to the extent r e quir e d by the new information. The three axioms, the motiv ation b ehind them, and their consequences for the functional form of the en tropy functional are giv en b elow. As will b ecome im- mediately apparen t the axioms do not refer to merely three cases; any induction from suc h a weak foundation w ould hardly b e reliable. The reason the axioms are convincing and so constraining is that they refer to three inﬁnitely large classes of known sp ecial cases. Detailed pro ofs are deferred to the next section. Axiom 1: Locality . L o c al information has lo c al eﬀe cts. Supp ose the information to be processed do es not refer to a particular subdo- main D of the space X of x s. In the absence of any new information ab out D the PMU demands we do not c hange our minds ab out D . Th us, we design the inference metho d so that q ( x |D ), the prior probability of x conditional on x ∈ D , is not up dated. The selected conditional posterior is P ( x |D ) = q ( x |D ). W e emphasize: the p oin t is not that we make the unw arranted assumption that k eeping q ( x |D ) is guaranteed to lead to correct inferences. It need not. Induc- tion is risky . The p oint is, rather, that in the absence of any evidence to the con trary there is no reason to change our minds and the prior information takes priorit y . The consequence of axiom 1 is that non-ov erlapping domains of x con tribute additiv ely to the en tropy , S [ p, q ] = Z dx F ( p ( x ) , q ( x ) , x ) , (6.1) where F is some unkno wn function – not a functional, just a regular function of three argumen ts. Axiom 2: Co ordinate in v ariance. The system of c o or dinates c arries no information. The p oints x can b e lab eled using any of a v ariety of co ordinate systems. In certain situations we might hav e explicit reasons to b elieve that a particular c hoice of coordinates should be preferred ov er others. This information might ha ve b een giv en to us in a v ariet y of wa ys, but unless the evidence was in fact giv en w e should not assume it: the ranking of probability distributions should not dep end on the co ordinates used. T o grasp the meaning of this axiom it may b e useful to recall some facts ab out co ordinate transformations. Consider a change from old co ordinates x to new coordinates x 0 suc h that x = Γ( x 0 ). The new volume element dx 0 includes the corresp onding Jacobian, dx = γ ( x 0 ) dx 0 where γ ( x 0 ) =     ∂ x ∂ x 0     . (6.2) 6.2. ENTROPY AS A TOOL F OR UPD A TING PROBABILITIES 121 Let m ( x ) b e an y density; the transformed densit y m 0 ( x 0 ) is such that m ( x ) dx = m 0 ( x 0 ) dx 0 . This is true, in particular, for probabilit y densities suc h as p ( x ) and q ( x ), therefore m ( x ) = m 0 ( x 0 ) γ ( x 0 ) , p ( x ) = p 0 ( x 0 ) γ ( x 0 ) and q ( x ) = q 0 ( x 0 ) γ ( x 0 ) . (6.3) The co ordinate transformation gives S [ p, q ] = Z dx F ( p ( x ) , q ( x ) , x ) = Z γ ( x 0 ) dx 0 F  p 0 ( x 0 ) γ ( x 0 ) , q 0 ( x 0 ) γ ( x 0 ) , Γ( x 0 )  , (6.4) whic h is a mere c hange of v ariables. The iden tity abov e is v alid alwa ys, for all Γ and for all F ; it imposes absolutely no constrain ts on S [ p, q ]. The real constrain t arises from realizing that we could hav e starte d in the x 0 co ordinate frame, in whic h case we would ha ve ha ve ranked the distributions using the en tropy S [ p 0 , q 0 ] = Z dx 0 F ( p 0 ( x 0 ) , q 0 ( x 0 ) , x 0 ) , (6.5) but this should hav e no eﬀect on our conclusions. This is the non trivial con tent of axiom 2. It is not that we can change v ariables, we can alwa ys do that; but rather that the tw o rankings, the one according to S [ p, q ] and the other ac cording to S [ p 0 , q 0 ] m ust coincide. This requirement is satisﬁed if, for example, S [ p, q ] and S [ p 0 , q 0 ] turn out to b e numerically equal, but this is not necessary . The consequence of axiom 2 is that S [ p, q ] can b e written in terms of co or- dinate inv ariants suc h as dx m ( x ) and p ( x ) /m ( x ), and q ( x ) /m ( x ): S [ p, q ] = Z dx m ( x )Φ  p ( x ) m ( x ) , q ( x ) m ( x )  . (6.6) Th us the unknown function F which had three arguments has b een replaced by a still unknown function Φ with tw o arguments plus an unknown density m ( x ). Next we determine the densit y m ( x ) by inv oking the lo cality axiom 1 once again. A situation in whic h no new information is av ailable is dealt by allowing the domain D to cov er the whole space X . The requirement that in the absence of an y new information the prior conditional probabilities q ( x |D ) = q ( x |X ) = pq x ) should not b e up dated, can b e expressed as Axiom 1 (sp ecial case): When ther e is no new information ther e is no r e ason to change one’s mind. When there are no constraints the selected p osterior distribution should coincide with the prior distribution, that is, P ( x ) = q ( x ). The consequence of this second use of locality is that the arbitrariness in the density m ( x ) is remov ed: up to normalization m ( x ) must be the prior distribution q ( x ), and therefore at this p oin t w e ha ve succeeded in restricting the entrop y to functionals of the form S [ p, q ] = Z dx q ( x )Φ  p ( x ) q ( x )  (6.7) 122 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES Axiom 3: Consistency for indep endent subsystems . When a system is c omp ose d of subsystems that ar e known to b e indep endent it should not matter whether the infer enc e pr o c e dur e tr e ats them sep ar ately or jointly. This axiom is perhaps subtler than it app ears at ﬁrst sigh t. Two p oints m ust b e made clear. The ﬁrst p oint concerns how the information ab out indep endence is to b e handled as a constraint. Consider a system comp osed of tw o (or more) subsystems which w e know are indep endent. This means that b oth the prior and the posterior are products. If the subsystem priors are q 1 ( x 1 ) and q 2 ( x 2 ), then the prior for the whole system is the pro duct q ( x 1 , x 2 ) = q 1 ( x 1 ) q 2 ( x 2 ) , (6.8) while the join t p osterior is constrained within the family p ( x 1 , x 2 ) = p 1 ( x 1 ) p 2 ( x 2 ) . (6.9) F urther supp ose that new information is acquired, sa y constrain ts C 1 suc h that q 1 ( x 1 ) is up dated to P 1 ( x 1 ), and constrain ts C 2 suc h that q 2 ( x 2 ) is up dated to P 2 ( x 2 ). Axiom 3 is implemented as follows: First we treat the tw o subsystems separately . F or subsystem 1 w e maximize S [ p 1 , q 1 ] = Z dx 1 q 1 ( x 1 )Φ  p 1 ( x 1 ) q 1 ( x 1 )  , (6.10) sub ject to constraints C 1 on the marginal distribution p 1 ( x 1 ) = R dx 2 p ( x 1 , x 2 ) to select the p osterior P 1 ( x 1 ). The constrain ts C 1 could, for example, include normalization, or they could inv olv e the kno wn exp ected v alue of a function f 1 ( x 1 ), Z dx 1 f 1 ( x 1 ) p 1 ( x 1 ) = Z dx 1 dx 2 f 1 ( x 1 ) p ( x 1 , x 2 ) = F 1 . (6.11) Similarly , for subsystem 2 w e maximize the corresp onding S [ p 2 , q 2 ] sub ject to constrain ts C 2 on p 2 ( x 2 ) = R dx 1 p ( x 1 , x 2 ) to select the p osterior P 2 ( x 2 ). Next the subsystems are treated jointly . Since we are concerned with those sp ecial examples where w e hav e the information that the subsystems are inde- p enden t, we are r e quir e d to search for the p osterior within the restricted family of joint distributions that tak e the form of the pro duct (6.9); this is an addi- tional constrain t ov er and ab ov e the original C 1 and C 2 . The new constrain t p = p 1 p 2 is easily implemented by direct substitution. Instead of maximizing the joint entrop y , S [ p, q 1 q 2 ], we now maximize S [ p 1 p 2 , q 1 q 2 ] = Z dx 1 dx 2 q 1 ( x 1 ) q 2 ( x 2 )Φ  p 1 ( x 1 ) p 2 ( x 2 ) q 1 ( x 1 ) q 2 ( x 2 )  , (6.12) under indep enden t v ariations δp 1 and δ p 2 sub ject to the same constraints C 1 and C 2 . The function Φ is then determined – or at least constrained – by demanding that the selected p osterior b e P 1 ( x 1 ) P 2 ( x 2 ). The second p oint is that the axiom applies to al l instanc es of systems that happ en to b e independent – this is wh y it is so p ow erful. The axiom applies to 6.2. ENTROPY AS A TOOL F OR UPD A TING PROBABILITIES 123 situations where we deal with just t wo systems – as in the previous paragraph – and it also applies when we deal with many , whether just a few or a very large n umber. The axiom applies when the indep endent subsystems are identical, and also when they are not. The ﬁnal conclusion is that probability distributions p ( x ) should be ranked relativ e to the prior q ( x ) according to the relative entrop y , S [ p, q ] = − Z dx p ( x ) log p ( x ) q ( x ) . (6.13) The length y pro of leading to (6.13) is giv en in the next section. It inv olves three steps. First w e show (subsection 6.3.4) that applying Axiom 3 to subsys- tems that happ en to b e identical restricts the entrop y functional to a member of the one-parameter family of entropies S η [ p, q ] parametrized b y an “inference parameter” η , S η [ p, q ] = 1 η ( η + 1)  1 − Z dx p η +1 q − η  . (6.14) It is easy to see that there are no singularities for η = 0 or − 1. The limits η → 0 and η → − 1 are well behav ed. In particular, to tak e η → 0 use y η = exp( η log y ) ≈ 1 + η log y , (6.15) whic h leads to the usual logarithmic en tropy , S 0 [ p, q ] = S [ p, q ] giv en in eq.(6.13). Similarly , for η → − 1 w e get S − 1 [ p, q ] = S [ q , p ]. In the second step (subsection 6.3.5) axiom 3 is applied to tw o indep endent systems that are not identical and could in principle b e describ ed by diﬀeren t parameters η 1 and η 2 . The consistency demanded b y axiom 3 implies that the t wo parameters must b e equal, η 1 = η 2 , and since this m ust hold for all pairs of indep endent systems we conclude that η m ust b e a universal constant. In the ﬁnal step the v alue of this constant – which turns out to b e η = 0 – is determined (subsection 6.3.5) by demanding that axiom 3 apply to N identical subsystems where N is very large. W e can no w summarize our o verall conclusion: The ME metho d: We want to up date fr om a prior distribution q ( x ) to a p osterior distribution p ( x ) when information in the form of a c onstr aint that sp e ciﬁes the al lowe d p osteriors b e c omes available. The p osterior se- le cte d by induction fr om sp e cial c ases that implement lo c ality, c o or dinate invarianc e and c onsistency for indep endent subsystems, is that which max- imizes the r elative entr opy S [ p, q ] subje ct to the available c onstr aints. No interpr etation for S [ p, q ] is given and none is ne e de d. This extends the metho d of maximum en tropy b eyond its original purpose as a rule to assign probabilities from a given underlying measure (MaxEn t) to a metho d for updating probabilities from an y arbitrary prior (ME). F urthermore, the logic b ehind the up dating pro cedure do es not rely on any particular meaning 124 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES assigned to the entrop y , either in terms of information, or heat, or disorder. En tropy is merely a to ol for inductiv e inference; we do not need to kno w what it means; w e only need to know how to use it. The deriv ation ab ov e has singled out a unique S [ p, q ] to b e use d in inductive infer enc e . Other ‘entropies’ could turn out to b e useful for other purp oses – p erhaps as measures of information, or of ecological diversit y , or something else – but they are not an induction from the sp ecial cases set do wn in the axioms. 6.3 The pro ofs In this section w e establish the consequences of the three axioms leading to the ﬁnal result eq.(6.13). The details of the pro ofs are imp ortant not just b ecause they lead to our ﬁnal conclusions, but also b ecause the translation of the v erbal statemen t of the axioms into precise mathematical form is a crucial part of unam biguously sp ecifying what the axioms actually say . 6.3.1 Axiom 1: Locality Here we prov e that axiom 1 leads to the expression eq.(6.1) for S [ p, q ]. The requiremen t that probabilities b e normalized is handled by imposing normaliza- tion as one among so man y other constraints that one might wish t o imp ose. T o simplify the pro of w e consider the case of a discrete v ariable, p i with i = 1 . . . n , so that S [ p, q ] = S ( p 1 . . . p n , q 1 . . . q n ). The generalization to a contin uum is straigh tforward. Supp ose the space of states X is partitioned in to tw o non-ov erlapping do- mains D and D 0 with D ∪ D 0 = X , and that the information to b e pro cessed is in the form of a constraint that refers to the domain D 0 , P j ∈D 0 a j p j = A . (6.16) Axiom 1 states that the constraint on D 0 do es not hav e an inﬂuence on the c onditional probabilities p i |D . It may how ev er inﬂuence the probabilities p i within D through an o verall multiplicativ e factor. T o deal with this complication consider then a sp ecial case where the o verall probabilities of D and D 0 are constrained to o, P i ∈D p i = P D and P j ∈D 0 p j = P D 0 , (6.17) with P D + P D 0 = 1. Under these sp ecial circumstances constrain ts on D 0 will not inﬂuence p i s within D , and vice versa. T o obtain the p osterior maximize S [ p, q ] sub ject to these three constrain ts, 0 =  δ S − λ  P i ∈D p i − P D  + − λ 0 P j ∈D 0 p i − P D 0 ! + µ P j ∈D 0 a j p j − A !# , 6.3. THE PROOFS 125 leading to ∂ S ∂ p i = λ for i ∈ D , (6.18) ∂ S ∂ p j = λ 0 + µa j for j ∈ D 0 . (6.19) Eqs.(6.16-6.19) are n + 3 equations we must solve for the p i s and the three Lagrange multipliers. Since S = S ( p 1 . . . p n , q 1 . . . q n ) its deriv ative ∂ S ∂ p i = f i ( p 1 . . . p n , q 1 . . . q n ) (6.20) could in principle also dep end on all 2 n v ariables. But this violates the lo cality axiom b ecause any arbitrary change in a j within D 0 w ould inﬂuence the p i s within D . The only wa y that probabilities within D can be shielded from arbi- trary changes in the constraints p ertaining to D 0 is that the functions f i with i ∈ D depend only on p i s while the functions f j dep end only on p j s. F urther- more, this must hold not just for one particular partition of X into domains D and D 0 , it m ust hold for all conceiv able partitions. Therefore f i can dep end only on p i and, at this p oint, on any of the q s, ∂ S ∂ p i = f i ( p i , q 1 . . . q n ) . (6.21) But the p ow er of the lo cality axiom is not exhausted yet. The information to b e incorp orated into the p osterior can enter not just through constraints but also through the prior. Supp ose that the lo cal information ab out domain D 0 is altered b y changing the prior within D 0 . Let q j → q j + δq j for j ∈ D 0 . Then (6.21) b ecomes ∂ S ∂ p i = f i ( p i , q 1 . . . q j + δ q j . . . q n ) (6.22) whic h shows that p i with i ∈ D will b e inﬂuenced by information ab out D 0 unless f i with i ∈ D is indep endent of all the q j s for j ∈ D 0 . Again, this must hold for all partitions in to D and D 0 , and therefore, ∂ S ∂ p i = f i ( p i , q i ) for all i ∈ X . (6.23) In tegrating, one obtains S [ p, q ] = P i F i ( p i , q i ) + constant . (6.24) for some undetermined functions F i . The corresp onding expression for a contin- uous v ariable x is obtained replacing i by x , and the sum ov er i by an integral o ver x leading to eq.(6.1). 126 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES 6.3.2 Axiom 2: Coordinate in v ariance Next w e prov e eq.(6.6) It is con venien t to in tro duce a function m ( x ) whic h transforms as a density and rewrite the expression (6.1) for the entrop y in the form S [ p, q ] = Z dx m ( x ) 1 m ( x ) F  p ( x ) m ( x ) m ( x ) , q ( x ) m ( x ) m ( x ) , x  (6.25) = Z dx m ( x )Φ  p ( x ) m ( x ) , q ( x ) m ( x ) , m ( x ) , x  , (6.26) where the function Φ is deﬁned b y Φ( α, β , m, x ) def = 1 m F ( αm, β m, m, x ) . (6.27) Next, we consider a sp ecial situation where the new information are con- strain ts which do not fav or one co ordinate system o ver another. F or example consider the constrain t Z dx p ( x ) a ( x ) = A (6.28) where a ( x ) is a scalar, i.e. , in v ariant under co ordinate c hanges, a ( x ) → a 0 ( x 0 ) = a ( x ) . (6.29) The usual normalization condition R dx p ( x ) = 1 is a simple example of a scalar constrain t. Maximizing S [ p, q ] sub ject to the constraint, δ  S [ p, q ] + λ  Z dx p ( x ) a ( x ) − A  = 0 , (6.30) giv es ˙ Φ  p ( x ) m ( x ) , q ( x ) m ( x ) , m ( x ) , x  = λa ( x ) , (6.31) where the dot represents the deriv ative with resp ect to the ﬁrst argument, ˙ Φ ( α, β , m, x ) def = ∂ Φ ( α, β , m, x ) ∂ α (6.32) But we could hav e started using the primed coordinates, ˙ Φ  p 0 ( x 0 ) m 0 ( x 0 ) , q 0 ( x 0 ) m 0 ( x 0 ) , m 0 ( x 0 ) , x 0  = λ 0 a 0 ( x 0 ) , (6.33) or, using (6.3) and (6.29), ˙ Φ  p ( x ) m ( x ) , q 0 ( x 0 ) m 0 ( x 0 ) , m ( x ) γ ( x 0 ) , x 0  = λ 0 a ( x ) . (6.34) 6.3. THE PROOFS 127 Dividing (6.34) b y (6.31) we get ˙ Φ ( α, β , mγ , x 0 ) ˙ Φ ( α, β , m, x ) = λ 0 λ . (6.35) This identit y should hold for any transformation x = Γ( x 0 ). On the right hand side the multipliers λ and λ 0 are just constants; the ratio λ 0 /λ might dep end on the transformation Γ but it do es not dep end on x . Consider the sp ecial case of a transformation Γ that has unit determinan t everywhere, γ = 1, and diﬀers from the identit y transformation only within some arbitrary region D . Since for x outside this region D we hav e x = x 0 , the left hand side of eq.(6.35) equals 1. Thus, for this particular Γ the ratio is λ 0 /λ = 1; but λ 0 /λ = constant, so λ 0 /λ = 1 holds within D as well. Therefore, for x within D , ˙ Φ ( α, β , m, x 0 ) = ˙ Φ ( α, β , m, x ) . (6.36) Since the choice of D is arbitrary w e conclude is that the function ˙ Φ cannot dep end on its third argumen t, ˙ Φ = ˙ Φ ( α, β , m ). Ha ving eliminated the third argument, let us go back to eq.(6.35), ˙ Φ ( α, β , mγ ) ˙ Φ ( α, β , m ) = λ 0 λ , (6.37) and consider a diﬀeren t transformation Γ, one with unit determinan t γ = 1 only outside the region D . The refore the constan t ratio λ 0 /λ is again equal to 1, so that ˙ Φ ( α, β , mγ ) = ˙ Φ ( α, β , m ) . (6.38) But within D the transformation Γ is quite arbitrary , it could hav e any arbitrary Jacobian γ 6 = 1. Therefore the function ˙ Φ cannot dep end on its second argument either, and therefore ˙ Φ = ˙ Φ( α, β ). Integrating with resp ect to α gives Φ = Φ( α, β ) + constan t. The additive constant, which could dep end on β , has no eﬀect on the maximization and can b e dropp ed. This completes the pro of of eq.(6.6). 6.3.3 Axiom 1 again The lo calit y axiom implies that when there are no constraints the selected p os- terior distribution should coincide with the prior distribution. This provides us with an in terpretation of the density m ( x ) that had b een artiﬁcially in tro- duced. The argumen t is simple: maximize S [ p, q ] in (6.6) sub ject to the single requiremen t of normalization, δ  S [ p, q ] + λ  Z dx p ( x ) − 1  = 0 , (6.39) to get ˙ Φ  p ( x ) m ( x ) , q ( x ) m ( x )  = λ. (6.40) 128 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES Since λ is a constan t, the left hand side must b e independent of x for arbitrary c hoices of the prior q ( x ). This could, for example, b e accomplished if the func- tion ˙ Φ( α, β ) were itself a constant, indep endent of its arguments α and β . But this gives Φ( α, β ) = c 1 α + c 2 (6.41) where c 1 and c 2 are constants and leads to the unacceptable form S [ p, q ] ∝ R dx p ( x ) + constant. If the dep endence on x cannot b e eliminated by an appropriate c hoice of ˙ Φ, w e must secure it by a choice of m ( x ). Eq.(6.40) is an equation for p ( x ). In the absence of new information the selected p osterior distribution m ust coincide with the prior, P ( x ) = q ( x ). The obvious wa y to secure that (6.40) b e indep en- den t of x is to c ho ose m ( x ) ∝ q ( x ). Therefore m ( x ) must, except for an ov erall normalization, b e c hosen to coincide with the prior distribution. 6.3.4 Axiom 3: Consistency for identical indep enden t sub- systems In this subsection w e show that applying axiom 3 to subsystems that happ en to b e iden tical restricts the entrop y functional to a member of the one-parameter family of η -entropies S η [ p, q ] parametrized by η . F or η = 0 one obtains the standard logarithmic en tropy , eq.(6.13), S 0 [ p, q ] = − Z dx p ( x ) log p ( x ) q ( x ) . (6.42) F or η = − 1 one obtains S − 1 [ p, q ] = Z dx q ( x ) log p ( x ) q ( x ) , (6.43) whic h coincides with S 0 [ q , p ] with the argumen ts switched. Finally , for a generic v alue of η 6 = − 1 , 0 the result is S η [ p, q ] = − Z dx p ( x )  p ( x ) q ( x )  η . (6.44) It is worth while to recall that the ob jective of this whole exercise is to rank probability distributions according to preference and therefore diﬀeren t en tropies that induce the same ranking scheme are eﬀectively equiv alent. This is very conv enient as it allows considerable simpliﬁcations b y an appropriate c hoice of additive and multiplicativ e constan ts. T aking adv antage of this free- dom we can, for example, combine the three expressions (6.42), (6.43), and (6.44) into the single expression S η [ p, q ] = 1 η ( η + 1)  1 − Z dx p η +1 q − η  , (6.45) that we met earlier in eq.(6.14). 6.3. THE PROOFS 129 The pro of b elow is fairly lengthy and may b e skipp ed on a ﬁrst reading. It follo ws the treatment in [Caticha Giﬃn 06] and is based up on and extends a previous pro of b y Karb elk ar who show ed that belonging to the family of η - en tropies is a suﬃcient condition to satisfy the consistency axiom for identic al systems. He conjectured but did not prov e that this was p erhaps also a necessary condition. [Karb elk ar 86] Although necessity w as not essential to his argument it is crucial for ours. W e show below that for iden tical subsystems there are no acceptable entropies outside the S η family . First we treat the subsystems separately . F or subsystem 1 we maximize the en tropy S [ p 1 , q 1 ] sub ject to normalization and the constraint C 1 in eq.(6.11). In tro duce Lagrange multipliers α 1 and λ 1 , δ  S [ p 1 , q 1 ] − λ 1  Z dx 1 f 1 P 1 − F 1  − α 1  Z dx 1 P 1 − 1  = 0 , (6.46) whic h giv es Φ 0  p 1 ( x 1 ) q 1 ( x 1 )  = λ 1 f 1 ( x 1 ) + α 1 , (6.47) where the prime indicates a deriv ative with resp ect to the argumen t, Φ 0 ( y ) = d Φ( y ) /dy . F or subsystem 2 w e need only consider the extreme situation where the constraints C 2 determine the p osterior completely: p 2 ( x 2 ) = P 2 ( x 2 ). Next w e treat the subsystems jointly . The constraints C 2 are easily imple- men ted b y direct substitution and thus, we maximize the entrop y S [ p 1 P 2 , q 1 q 2 ] b y v arying o ver p 1 sub ject to normalization and the constraint C 1 in eq.(6.11). In tro duce Lagrange multipliers α and λ , δ  S [ p 1 P 2 , q 1 q 2 ] − λ  Z dx 1 f 1 p 1 − F 1  − α  Z dx 1 p 1 − 1  = 0 , (6.48) whic h giv es Z dx 2 p 2 Φ 0  p 1 P 2 q 1 q 2  = λ [ P 2 , q 2 ] f 1 ( x 1 ) + α [ P 2 , q 2 ] , (6.49) where the m ultipliers λ and α are independent of x 1 but could in principle b e functionals of P 2 and q 2 . The consistency condition that constrains the form of Φ is that if the solution to eq.(6.47) is P 1 ( x 1 ) then the solution to eq.(6.49) must also b e P 1 ( x 1 ), and this must b e true irrespective of the choice of P 2 ( x 2 ). Let us then consider a small change P 2 → P 2 + δ P 2 that preserves the normalization of P 2 . First in tro duce a Lagrange m ultiplier α 2 and rewrite eq.(6.49) as Z dx 2 p 2 Φ 0  P 1 P 2 q 1 q 2  − α 2  Z dx 2 P p 2 − 1  = λ [ P 2 , q 2 ] f 1 ( x 1 )+ α [ P 2 , q 2 ] , (6.50) where we hav e replaced p 1 b y the known solution P 1 and thereby eﬀectively transformed eqs.(6.47) and (6.49) into an equation for Φ. The δ P 2 v ariation 130 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES giv es, Φ 0  P 1 P 2 q 1 q 2  + P 1 P 2 q 1 q 2 Φ 00  P 1 P 2 q 1 q 2  = δ λ δ P 2 f 1 ( x 1 ) + δ α δ P 2 + α 2 . (6.51) Next use eq.(6.47) to eliminate f 1 ( x 1 ), Φ 0  P 1 P 2 q 1 q 2  + P 1 P 2 q 1 q 2 Φ 00  P 1 P 2 q 1 q 2  = A [ P 2 , q 2 ]Φ 0  P 1 q 1  + B [ P 2 , q 2 ] , (6.52) where A [ P 2 , q 2 ] = 1 λ 1 δ λ δ P 2 and B [ P 2 , q 2 ] = − δ λ δ P 2 α 1 λ 1 + δ α δ P 2 + α 2 , (6.53) are at this p oint unknown functionals of P 2 and q 2 . Diﬀeren tiating eq.(6.52) with resp ect to x 1 the B term drops out and we get A [ P 2 , q 2 ] =  d dx 1 Φ 0  P 1 q 1  − 1 d dx 1  Φ 0  P 1 P 2 q 1 q 2  + P 1 P 2 q 1 q 2 Φ 00  P 1 P 2 q 1 q 2  , (6.54) whic h sho ws that A is not a functional of P 2 and q 2 but a mere function of P 2 /q 2 . Substituting back in to eq.(6.52) w e see that the same is true for B . Therefore eq.(6.52) can b e written as Φ 0 ( y 1 y 2 ) + y 1 y 2 Φ 00 ( y 1 y 2 ) = A ( y 2 )Φ 0 ( y 1 ) + B ( y 2 ) , (6.55) where y 1 = P 1 /q 1 , y 2 = P 2 /q 2 , and A ( y 2 ), B ( y 2 ) are unkno wn functions of y 2 . No w w e sp ecialize to identical subsystems. Then we can exchange the lab els 1 ↔ 2, and w e get A ( y 2 )Φ 0 ( y 1 ) + B ( y 2 ) = A ( y 1 )Φ 0 ( y 2 ) + B ( y 1 ) . (6.56) T o ﬁnd the unknown functions A and B diﬀerentiate with resp ect to y 2 , A 0 ( y 2 )Φ 0 ( y 1 ) + B 0 ( y 2 ) = A ( y 1 )Φ 00 ( y 2 ) (6.57) and then with resp ect to y 1 to get A 0 ( y 1 ) Φ 00 ( y 1 ) = A 0 ( y 2 ) Φ 00 ( y 2 ) = a = const . (6.58) In tegrate to get A ( y 1 ) = a Φ 0 ( y 1 ) + b , (6.59) then substitute bac k in to eq.(6.57) and in tegrate again to get B 0 ( y 2 ) = b Φ 00 ( y 2 ) and B ( y 2 ) = b Φ 0 ( y 2 ) + c , (6.60) where b and c are constants. W e can chec k that A ( y ) and B ( y ) are indeed solutions of eq.(6.56). Substituting into eq.(6.55) gives Φ 0 ( y 1 y 2 ) + y 1 y 2 Φ 00 ( y 1 y 2 ) = a Φ 0 ( y 1 ) Φ 0 ( y 2 ) + b [Φ 0 ( y 1 ) + Φ 0 ( y 2 )] + c . (6.61) 6.3. THE PROOFS 131 This is a p eculiar diﬀerential equation. W e can think of it as one diﬀerential equation for Φ 0 ( y 1 ) for each given constant v alue of y 2 but there is a complication in that the v arious (constan t) co eﬃcients Φ 0 ( y 2 ) are themselv es unkno wn. T o solv e for Φ c ho ose a ﬁxed v alue of y 2 , say y 2 = 1, y Φ 00 ( y ) − η Φ 0 ( y ) − κ = 0 , (6.62) where η = a Φ 0 (1) + b − 1 and κ = b Φ 0 (1) + c . T o eliminate the constant κ diﬀeren tiate with resp ect to y , y Φ 000 + (1 − η ) Φ 00 = 0 , (6.63) whic h is a linear homogeneous equation and is easy to integrate. F or generic v alues of η 6 = − 1 , 0 the solution is Φ 00 ( y ) ∝ y η − 1 ⇒ Φ 0 ( y ) = αy η + β . (6.64) The constants α and β are chosen so that this is a solution of eq.(6.61) for all v alues of y 2 (and not just for y 2 = 1). Substituting into eq.(6.61) and equating the coeﬃcients of v arious p ow ers of y 1 y 2 , y 1 , and y 2 giv es three conditions on the tw o constants α and β , α (1 + η ) = aα 2 , 0 = aα β + bα, β = aβ 2 + 2 bβ + c . (6.65) The nontrivial ( α 6 = 0) solutions are α = (1 + η ) /a and β = − b/a , while the third equation gives c = b (1 − b ) / 4 a . W e conclude that for generic v alues of η the solution of eq.(6.61) is Φ( y ) = 1 a y η +1 − b a y + C , (6.66) where C is a new constant. Substituting in to eq.(6.7) yields S η [ p, q ] = 1 a Z dx p ( x )  p ( x ) q ( x )  η − b a Z dx p ( x ) + C Z dx q ( x ) . (6.67) This complicated expression can b e simpliﬁed considerably b y exploiting the freedom to c ho ose additive and multiplicativ e constants. W e can drop the last t wo terms and c ho ose a = − 1 so that the preferred distribution is that whic h maximizes entrop y . This repro duces eq.(6.44). F or η = 0 w e return to eq.(6.63) and integrate twice to get Φ( y ) = a 0 y log y + b 0 y + c 0 , (6.68) for some new constants a 0 , b 0 , and c 0 . Substituting into eq.(6.7) yields S 0 [ p, q ] = a 0 Z dx p ( x ) log p ( x ) q ( x ) + b 0 Z dx p ( x ) + c 0 Z dx q ( x ) . (6.69) 132 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES Again, c ho osing a 0 = − 1 and dropping the last tw o terms does not aﬀect the ranking scheme. This yields the standard expression for relative en tropy , eq.(6.42). Finally , for η = − 1 integrating eq.(6.63) twice giv es Φ( y ) = a 00 log y + b 00 y + c 00 , (6.70) for some new constants a 00 , b 00 , and c 00 . Substituting into eq.(6.7) yields S 0 [ p, q ] = a 00 Z dx q ( x ) log p ( x ) q ( x ) + b 00 Z dx p ( x ) + c 00 Z dx q ( x ) . (6.71) Again, choosing a 00 = 1 and dropping the last tw o terms yields eq.(6.43). This completes our deriv ation. 6.3.5 Axiom 3: Consistency for non-identical subsystems Let us summarize our results so far. The goal is to up date probabilities b y rank- ing the distributions according to an entrop y S that is of general applicabilit y . The allow ed functional forms of the entrop y S hav e been constrained down to a mem b er of the one-dimensional family S η . One migh t b e tempted to conclude that there is no S of universal applicability; that inferences ab out diﬀeren t sys- tems could to b e carried out with diﬀerent η -entropies. But we ha ve not yet exhausted the full p ow er of the consistency axiom 3. Consistency is universally desirable; there is no reason why it should b e limited to identical systems. T o pro ceed further we ask: What is η ? Is it a prop erty of the individual carrying out the inference or of the system under inv estigation? The former is unacceptable; we insist that the up dating m ust b e ob jective in that diﬀeren t individuals with the same prior and with the same constraints m ust mak e the same inference. Therefore the “inference parameter” η can only b e a property of the system. Consider tw o diﬀeren t systems c haracterized by η 1 and η 2 . Let us further supp ose that these systems are known to be indep enden t (p erhaps system #1 liv es here on Earth while system #2 lives in a distan t galaxy) so that they fall under the jurisdiction of axiom 3. Separate inferences about systems #1 and #2 are carried out with S η 1 [ p 1 , q 1 ] and S η 2 [ p 2 , q 2 ] resp ectively . F or the combined system w e are also required to use an η -en tropy , say S η [ p 1 p 2 , q 1 q 2 ]. The question is what η s do we choose that will lead to consisten t inferences whether w e treat the systems separately or jointly . The results of the previous subsection indicate that a join t inference with S η [ p 1 p 2 , q 1 q 2 ] is equiv alen t to separate inferences with S η [ p 1 , q 1 ] and S η [ p 2 , q 2 ]. Therefore w e m ust c ho ose η = η 1 and also η = η 2 whic h is p ossible only if w e had η 1 = η 2 from the start. But this is not all: consider a third system #3 that also lives here on Earth. W e do not know whether system #3 is indep endent from system #1 or not but w e can conﬁdently assert that it will certainly b e indep endent of the system #2 living in the distant galaxy . The argument of the previous paragraph leads us to conclude that η 3 = η 2 , and therefore that η 3 = η 1 ev en when systems 6.3. THE PROOFS 133 #1 and #3 are not known to b e indep endent! W e conclude that al l systems must b e char acterize d by the same p ar ameter η whether they are independent or not b ecause we can alwa ys ﬁnd a common reference system that is suﬃcien tly distan t to b e indep endent of any t wo of them. The inference parameter η is a univ ersal constan t, the v alue of which is at this p oint still unknown. The p ow er of a consistency argument resides in its univ ersal applicability: if an en tropy S [ p, q ] exists then it must b e one chosen from among the S η [ p, q ]. The remaining problem is to determine this univ ersal constant η . Here w e give one argument; in the next subsection we giv e another one. One p ossibility is to regard η as a quantit y to b e determined exp erimen- tally . Are there systems for which inferences based on a kno wn v alue of η ha ve rep eatedly led to success? The answ er is yes; they are quite common. As w e discussed in Chapter 5 statistical mechanics and thus thermo dynamics are theories of inference based on the v alue η = 0. The relev ant entrop y , which is the Boltzmann-Gibbs-Shannon entrop y , can b e interpreted as the sp ecial case of the ME when one updates from a uniform prior. It is an experimental fact without any known exc eptions that inferences ab out al l physical, chemical and biological systems that are in thermal equilibrium or close to it can b e carried out b y assuming that η = 0. Let us emphasize that this is not an obscure and rare example of purely academic interest; these systems comprise essentially all of natural science. (Included is every instance where it is useful to introduce a notion of temp erature.) In conclusion: consistency for non-identical sys tems requires that η be a univ ersal constant and there is abundan t exp erimental evidence for its v alue b eing η = 0. Other η -entropies ma y turn out to be useful for other purp oses but the lo garithmic entr opy S [ p, q ] in e q.(6.13) pr ovides the only c onsistent r anking criterion for up dating pr ob abilities that c an claim gener al applic ability. 6.3.6 Axiom 3: Consistency with the law of large n umbers Here we oﬀer a second argumen t, also based on a broader application of axiom 3, that the v alue of the univ ersal constan t η must b e η = 0. W e require consistency for large num b ers of independent identical s ubsystems. In such cas es the weak la w of large n umbers is suﬃcient to make the desired inferences. Let the state for each individual system b e described by a discrete v ariable i = 1 . . . m . First w e treat the individual systems separately . The identical priors for the individual systems are q i and the av ailable information is that the potential p osteriors p i are sub ject, for example, to an exp ectation v alue constrain t such as h a i = A , where A is some sp eciﬁed v alue and h a i = P a i p i . The preferred p osterior P i is found maximizing the η -entrop y S η [ p, q ] sub ject to h a i = A . T o treat the systems jointly we let the n um b er of systems found in state i b e n i , and let f i = n i / N b e the corresp onding frequency . The tw o descriptions are related by the law of large n umbers: for large N the frequencies f i con verge (in probability) to the desired posterior P i while the sample av erage ¯ a = P a i f i con verges (also in probability) to the expected v alue h a i = A . 134 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES No w we consider the set of N systems treated jointly . The probability of a particular frequency distribution f = ( f 1 . . . f n ) generated b y the prior q is giv en b y the multinomial distribution, Q N ( f | q ) = N ! n 1 ! . . . n m ! q n 1 1 . . . q n m m with m P i =1 n i = N . (6.72) When the n i are suﬃciently large we can use Stirling’s appro ximation, log n ! = n log n − n + log √ 2 π n + O (1 /n ) . (6.73) Then log Q N ( f | q ) ≈ N log N − N + log √ 2 π N − P i  n i log n i − n i + log √ 2 π n i − n i log q i  = − N P i n i N log n i N q i − P i log r n i N − ( N − 1) log √ 2 π N = N S [ f , q ] − P i log p f i − ( N − 1) log √ 2 π N , (6.74) where S [ f , q ] is the η = 0 entrop y given by eq.(6.13). Therefore for large N can b e written as Q N ( f | q ) ≈ C N ( Q i f i ) − 1 / 2 exp( N S [ f , q ]) (6.75) where C N is a normalization constant. The Gibbs inequality S [ f , q ] ≤ 0, eq.(4.21), shows that for large N the probability Q N ( f | q ) shows an exceedingly sharp p eak. The most likely frequency distribution is numerically equal to the probabilit y distribution q i . This is the w eak law of large n umbers. Equiv alen tly , w e can rewrite it as 1 N log Q N ( f | q ) ≈ S [ f , q ] + r N , (6.76) where r N is a correction that v anishes as N → ∞ . This means that ﬁnding the most probable frequency distribution is equiv alen t to maximizing the entrop y S [ f , q ]. The most probable frequency distribution that satisﬁes the constraint ¯ a = A is the distribution that maximizes Q N ( f | q ) sub ject to the constraint ¯ a = A , whic h is equiv alent to maximizing the en tropy S [ f , q ] sub ject to ¯ a = A . In the limit of large N the frequencies f i con verge (in probability) to the desired p osterior P i while the sample a verage ¯ a = P a i f i con verges (also in probability) to the exp ected v alue h a i = A . The tw o pro cedures agree only when we choose η = 0. Inferences carried out with with η 6 = 0 are not consistent with inferences from the la w of large n umbers. This is the Principle of Eliminative Induction in action: it is the successful falsiﬁcation of all riv al η -entropies that corrob orates the surviving en tropy with η = 0. The reason the comp eting η -entropies are discarded is clear: η 6 = 0 is inconsisten t with the la w of large n umbers. 6.4. RANDOM REMARKS 135 [Csiszar 84] and [Grendar 01] hav e argued that the asymptotic argument ab o ve pro vides b y itself a v alid justiﬁcation for the ME method of updating. An agen t whose prior is q receives the information h a i = A which can be reasonably in terpreted as a sample av erage ¯ a = A ov er a large ensem ble of N trials. The agen t’s b eliefs are up dated so that the pos terior P coincides with the most probable f distribution. This is quite comp elling but, of course, as a justiﬁcation of the ME metho d it is restricted to situations where it is natural to think in terms of ensembles with large N . This justiﬁcation is not nearly as compelling for singular even ts for which large ensembles either do not exist or are to o unnatural and con trived. F rom our p oint of view the asymptotic argument ab o ve do es not by itself provide a fully convincing justiﬁcation for the universal v alidity of the ME metho d but it do es provide considerable inductiv e supp ort. It serv es as a v aluable consistency c heck that m ust b e passed by any inductive inference pro cedure that claims to b e of gener al applicabilit y . 6.4 Random remarks 6.4.1 On deductiv e vs. inductiv e systems In a deductiv e axiomatic system certain statements are chosen as axioms and other statements called theorems are derived from them. The theorems can b e asserted to b e true only when conditions are such that the axioms hold true. Within a deductive axiomatic system it mak es no sense to mak e assertions that go b eyond the reach of applicability of the axioms. In con trast the purpose of eliminativ e induction is precisely to ven ture into regions b eyond those known sp ecial cases – the axioms – and accordingly , the truth of the resulting inferences – the theorems – is not guaran teed. A second interesting diﬀerence is that in a deductiv e system there is a certain preference for minimizing the num b er of axioms as this clariﬁes the relations among v arious elements of the system and the structure of the whole. In con trast when doing induction one striv es to maximize the n um b er of axioms as it is m uc h safer to induce from man y kno wn instances than from just a few. 6.4.2 On priors A l l entr opies ar e r elative entr opies . In the case of a discrete v ariable, if one assigns equal a priori probabilities, q i = 1, one obtains the Boltzmann-Gibbs- Shannon en tropy , S [ p ] = − P i p i log p i . The notation S [ p ] has a serious draw- bac k: it misleads one into thinking that S dep ends on p only . In particular, w e emphasize that whenever S [ p ] is used, the prior measure q i = 1 has b een implicitly assumed. In Shannon’s axioms, for example, this choice is implicitly made in his ﬁrst axiom, when he states that the en tropy is a function of the probabilities S = S ( p 1 ...p n ) and nothing else, and also in his second axiom when the uniform distribution p i = 1 /n is singled out for sp ecial treatmen t. The absence of an explicit reference to a prior q i ma y erroneously suggest 136 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES that prior distributions hav e b een rendered unnecessary and can b e eliminated. It suggests that it is possible to transform information ( i.e. , constrain ts) di- rectly in to p osterior distributions in a totally ob jective and unique wa y . This w as Jaynes’ hop e for the MaxEn t program. If this w ere true the old con trov ersy , of whether probabilities are sub jectiv e or ob jectiv e, would hav e b een resolv ed – probabilities w ould ultimately b e totally ob jective. But the prior q i = 1 is im- plicit in S [ p ]; the postulate of equal a priori probabilities or Laplace’s “Principle of Insuﬃcient Reason” still plays a ma jor, though p erhaps hidden, role. An y claims that probabilities assigned using maximum en tropy will yield absolutely ob jectiv e results are unfounded; not all sub jectivity has b een eliminated. Just as with Bayes’ the or em, what is obje ctive her e is the manner in which information is pr o c esse d to up date fr om a prior to a p osterior, and not the prior pr ob abilities themselves. Cho osing the prior densit y q ( x ) can b e tricky . Sometimes symmetry consid- erations can b e useful in ﬁxing the prior (three examples were giv en in section 4.5) but otherwise there is no ﬁxed set of rules to translate information into a probability distribution except, of course, for Bay es’ theorem and the ME metho d themselves. What if the prior q ( x ) v anishes for some v alues of x ? S [ p, q ] can b e inﬁnitely negativ e when q ( x ) v anishes within some region D . In other words, the ME metho d confers an ov erwhelming preference on those distributions p ( x ) that v anish whenever q ( x ) do es. One must e mphasize that this is as it should b e; it is not a problem. A similar situation also arises in the context of Bay es’ theorem where a v anishing prior represents a tremendously serious commitmen t b ecause no amount of data to the contrary w ould allo w us to revise it. In both ME and Bay es up dating we should recognize the implications of assigning a v anishing prior. Assigning a very low but non-zero prior represen ts a safer and less prejudiced represen tation of one’s b eliefs. F or more on the choice of priors see the review [Kass W asserman 96]; in particular for en tropic priors see [Ro driguez 90-03, Catic ha Preuss 04] 6.4.3 Commen ts on other axiomatizations One feature that distinguishes the axiomatizations prop osed by v arious authors is how they justify maximizing a functional. In other words, why maximum entr opy ? In the approach of Shore and Johnson this question receiv es no answ er; it is just one of the axioms. Csiszar pro vides a b etter answ er. He derives the ‘maximize a functional’ rule from reasonable axioms of regularit y and locality [Csiszar 91]. In Skilling’s and in the approach developed here the rule is not deriv ed, but it do es not go unexplained either: it is imp osed by design, it is justiﬁed b y the function that S is supp osed to p erform, to achiev e a transitive ranking. Both Shore and Johnson and Csiszar require, and it is not clear wh y , that up dating from a prior m ust lead to a unique p osterior, and accordingly , there is a restriction that the constrain ts deﬁne a conv ex set. In Skilling’s approach and in the one adv o cated here there is no requirement of uniqueness, we are 6.4. RANDOM REMARKS 137 p erfectly willing to en tertain situations where the a v ailable information p oints to several equally preferable distributions. There is another important diﬀerence b et ween the axiomatic approach pre- sen ted by Csiszar and the presen t one. Since our ME metho d is a metho d for induction w e are justiﬁed in applying the metho d as if it were of universal applicabilit y . As with all inductive pro cedures, in any particular instance of induction can turn out to b e wrong – b ecause, for example, not all relev ant information has b een taken into account – but this do es not change the fact that ME is still the unique inductive inference metho d that generalizes from the sp ecial cases c hosen as axioms. Csiszar’s version of the MaxEnt metho d is not designed to generalize b ey ond the axioms. His metho d w as dev elop ed for linear constrain ts and therefore he do es not feel justiﬁed in carrying out his de ductions b ey ond the cases of linear constrain ts. In our case, the application to non-linear constrain ts is precisely the kind of induction the ME metho d w as designed to p erform. It is interesting that if instead of axiomatizing the inference pro cess, one axiomatizes the entrop y itself by sp ecifying those prop erties exp ected of a mea- sure of separation betw een (p ossibly unnormalized) distributions one is led to a con tinuum of η -en tropies, [Amari 85] S η [ p, q ] = 1 η ( η + 1) Z dx  ( η + 1) p − η q − p η +1 q − η  , (6.77) lab elled by a parameter η . These entropies are equiv alent, for the purp ose of up dating, to the relative Renyi en tropies [Renyi 61, Aczel 75]. The shortcoming of this approach is that it is not clear when and how such entropies are to b e used, which features of a probability distribution are b eing up dated and whic h preserv ed, or ev en in what sense do these entropies measure an amount of information. Remark ably , if one further requires that S η b e additive o ver indep enden t sources of uncertain ty , as any self-resp ecting measure ought to be, then the contin uum in η is restricted to just the tw o v alues η = 0 and η = − 1 whic h corresp ond to the entropies S [ p, q ] and S [ q , p ]. F or the sp ecial case when p is normalized and a uniform prior q = 1 we get (dropping the in tegral o ver q ) S η = 1 η  1 − 1 η + 1 Z dx p η  . (6.78) A related en tropy S 0 η = 1 η  1 − Z dx p η +1  (6.79) has b een prop osed in [Tsallis 88] and forms the foundation of his non-extensiv e statistical mechanics. Clearly these t wo entropies are equiv alent in that they generate equiv alent v ariational problems – maximizing S η is equiv alent to maxi- mizing S 0 η . T o conclude our brief remarks on the en tropies S η w e point out that quite apart from the diﬃculty of achieving consistency with the law of large 138 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES n umbers, some the probability distributions obtained maximizing S η ma y also b e deriv ed through a more standard use of MaxEn t or ME as advocated in these lectures. [Plastino 94] 6.5 Ba y es’ rule as a sp ecial case of ME Since the ME metho d and Bay es’ rule are b oth designed for up dating proba- bilities, and b oth inv oke a Principle of Minimal Up dating, it is imp ortant to explore the relations b etw een them. In particular we would like to know if the t wo are mutually consistent or not. [Caticha Giﬃn 06] As describ ed in section 2.10 the goal is to up date our b eliefs ab out θ ∈ Θ ( θ represen ts one or many parameters) on the basis of three pieces of information: (1) the prior information co diﬁed into a prior distribution q ( θ ); (2) the data x ∈ X (obtained in one or many exp eriments); and (3) the known relation b et ween θ and x giv en by the mo del as deﬁned by the sampling distribution or lik eliho o d, q ( x | θ ). The up dating consists of replacing the prior probability distribution q ( θ ) b y a p osterior distribution P ( θ ) that applies after the data has b een pro cessed. The crucial elemen t that will allow Bay es’ rule to b e smo othly incorp orated in to the ME sc heme is the realization that b efore the data information is av ail- able not only we do not know θ , w e do not kno w x either. Th us, the relev ant space for inference is not Θ but the pro duct space Θ × X and the relev an t joint prior is q ( x, θ ) = q ( θ ) q ( x | θ ). W e should emphasize that the information about ho w x is related to θ is con tained in the functional form of the distribution q ( x | θ ) – for example, whether it is a Gaussian or a Cauch y distribution or something else – and not in the actual v alues of the arguments x and θ which are, at this p oin t, still unknown. Next w e collect data and the observ ed v alues turn out to b e X . W e must up date to a p osterior that lies within the family of distributions p ( x, θ ) that reﬂect the fact that x is no w kno wn, p ( x ) = R dθ p ( θ , x ) = δ ( x − X ) . (6.80) This data information constrains but is not suﬃcient to determine the joint distribution p ( x, θ ) = p ( x ) p ( θ | x ) = δ ( x − X ) p ( θ | X ) . (6.81) An y choice of p ( θ | X ) is in principle p ossible. So far the formulation of the prob- lem parallels section 2.10 exactly . W e are, after all, solving the same problem. Next we apply the ME metho d and show that w e get the same answer. According to the ME metho d the selected joint p osterior P ( x, θ ) is that whic h maximizes the en tropy , S [ p, q ] = − R dxdθ p ( x, θ ) log p ( x, θ ) q ( x, θ ) , (6.82) sub ject to the appropriate constrain ts. Note that the information in the data, eq.(6.80), represents an inﬁnite num b er of constraints on the family p ( x, θ ): 6.5. BA YES’ R ULE AS A SPECIAL CASE OF ME 139 for each v alue of x there is one constraint and one Lagrange multiplier λ ( x ). Maximizing S , (6.82), sub ject to (6.80) and normalization, δ  S + α  R dxdθ p ( x, θ ) − 1  + R dx λ ( x )  R dθ p ( x, θ ) − δ ( x − X )  = 0 , (6.83) yields the join t p osterior, P ( x, θ ) = q ( x, θ ) e λ ( x ) Z , (6.84) where Z is a normalization constan t, and the m ultiplier λ ( x ) is determined from (6.80), R dθ q ( x, θ ) e λ ( x ) Z = q ( x ) e λ ( x ) Z = δ ( x − X ) , (6.85) so that the joint p osterior is P ( x, θ ) = q ( x, θ ) δ ( x − X ) q ( x ) = δ ( x − X ) q ( θ | x ) , (6.86) The corresp onding marginal p osterior probabilit y P ( θ ) is P ( θ ) = R dx P ( θ , x ) = q ( θ | X ) = q ( θ ) q ( X | θ ) q ( X ) , (6.87) whic h is recognized as Bay es’ rule, eq.(2.123). Thus Ba yes’ rule is consistent with, and indeed, is a sp ecial case of the ME metho d. T o summarize: the prior q ( x, θ ) = q ( x ) q ( θ | x ) is up dated to the p osterior P ( x, θ ) = P ( x ) P ( θ | x ) where P ( x ) = δ ( x − X ) is ﬁxed by the observed data while P ( θ | X ) = q ( θ | X ) remains unchanged. Note that in accordance with the philosoph y that drives the ME metho d one only up dates those asp e cts of one’s b eliefs for which c orr e ctive new evidenc e has b e en supplie d . I conclude with a few simple examples that show ho w the ME allows gener- alizations of Bay es’ rule. The bac kground for these generalized Bay es problems is the familiar one: W e wan t to make inferences ab out some v ariables θ on the basis of information about other v ariables x . As b efore, the prior information consists of our prior kno wledge ab out θ given b y the distribution q ( θ ) and the relation betw een x and θ is giv en b y the lik eliho o d q ( x | θ ); thus, the prior join t distribution q ( x, θ ) is known. But now the information ab out x is m uch more limited. Ba yes up dating with uncertain data The data is uncertain: x is not kno wn. The marginal p osterior p ( x ) is no longer a sharp delta function but some other known distribution, p ( x ) = P D ( x ). This is still an inﬁnite n umber of constraints p ( x ) = R dθ p ( θ , x ) = P D ( x ) , (6.88) 140 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES that are easily handled by ME. Maximizing S , (6.82), sub ject to (6.88) and normalization, leads to P ( x, θ ) = P D ( x ) q ( θ | x ) . (6.89) The corresp onding marginal p osterior, P ( θ ) = R dx P D ( x ) q ( θ | x ) = q ( θ ) R dx P D ( x ) q ( x | θ ) q ( x ) , (6.90) is known as Jeﬀrey’s rule which we met earlier in section 2.10. Ba yes up dating with information ab out x moments No w w e ha ve even less information: p ( x ) is not known. All we know ab out p ( x ) is an exp ected v alue h f i = R dx p ( x ) f ( x ) = F . (6.91) Maximizing S , (6.82), sub ject to (6.91) and normalization, δ  S + α  R dxdθ p ( x, θ ) − 1  + λ R dxdθ p ( x, θ ) f ( x ) − F  = 0 , (6.92) yields the join t p osterior, P ( x, θ ) = q ( x, θ ) e λf ( x ) Z , (6.93) where the normalization constant Z and the multiplier λ are obtained from Z = R dx q ( x ) e λf ( x ) and d log Z dλ = F . (6.94) The corresp onding marginal p osterior is P ( θ ) = q ( θ ) R dx e λf ( x ) Z q ( x | θ ) . (6.95) These tw o examples (6.90) and (6.95) are suﬃciently intuitiv e that one could ha ve written them do wn directly without deplo ying the full machinery of the ME metho d, but they do serve to illustrate the essen tial compatibilit y of Ba yesian and Maximum Entrop y metho ds. Next we consider a slightly less trivial exam- ple. Up dating with data and information ab out θ moments Here we follow [Giﬃn Caticha 07]. In addition to data ab out x we ha ve addi- tional information about θ in the form of a constraint on the expected v alue of some function f ( θ ), R dxdθ P ( x, θ ) f ( θ ) = h f ( θ ) i = F . (6.96) In the standard Ba yesian practice it is p ossible to imp ose constrain t infor- mation at the level of the prior, but this information need not b e preserv ed 6.6. COMMUTING AND NON-COMMUTING CONSTRAINTS 141 in the p osterior. What we do here that diﬀers from the standard Bay es’ rule is that we can require that the constraint (6.96) b e satisﬁed by the p osterior distribution. Maximizing the entrop y (6.82) sub ject to normalization, the data constrain t (6.80), and the moment constraint (6.96) yields the joint p osterior, P ( x, θ ) = q ( x, θ ) e λ ( x )+ β f ( θ ) z , (6.97) where z is a normalization constant, z = R dxdθ e λ ( x )+ β f ( θ ) q ( x, θ ) . (6.98) The Lagrange m ultipliers λ ( x ) are determined from the data constrain t, ( ?? ), e λ ( x ) z = δ ( x − X ) Z q ( X ) where Z ( β , X ) = R dθ e β f ( θ ) q ( θ | X ) , (6.99) so that the joint p osterior b ecomes P ( x, θ ) = δ ( x − X ) q ( θ | X ) e β f ( θ ) Z . (6.100) The remaining Lagrange m ultiplier β is determined b y imp osing that the p os- terior P ( x, θ ) satisfy the constraint (6.96). This yields an implicit equation for β , ∂ log Z ∂ β = F . (6.101) Note that since Z = Z ( β , X ) the resultant β will depend on the observ ed data X . Finally , the new marginal distribution for θ is P ( θ ) = q ( θ | X ) e β f ( θ ) Z = q ( θ ) q ( X | θ ) q ( X ) e β f ( θ ) Z . (6.102) F or β = 0 (no moment constrain t) we reco ver Ba yes’ rule. F or β 6 = 0 Bay es’ rule is mo diﬁed by a “canonical” exp onential factor. 6.6 Comm uting and non-comm uting constrain ts The ME metho d allo ws one to pro cess information in the form of constrain ts. When we are confron ted with several constrain ts we m ust b e particularly cau- tious. In what order should they b e pro cessed? Or should they b e pro cessed together? The answer dep ends on the problem at hand. (Here we follow [Giﬃn Catic ha 07].) W e refer to constraints as c ommuting when it mak es no diﬀerence whether they are handled simultaneously or sequentially . The most common example is that of Ba yesian up dating on the basis of data collected in multiple exp eri- men ts: for the purp ose of inferring θ it is well-kno wn that the order in which 142 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES the observed data x 0 = { x 0 1 , x 0 2 , . . . } is processed does not matter. (See section 2.10.3.) The proof that ME is completely compatible with Bay es’ rule implies that data constraints implemen ted through δ functions, as in (6.80), comm ute. It is useful to see how this comes ab out. When an experiment is rep eated it is common to refer to the v alue of x in the ﬁrst experiment and the v alue of x in the second exp eriment. This is a dan- gerous practice because it obscures the fact that we are actually talking ab out two separate v ariables. W e do not deal with a single x but with a comp osite x = ( x 1 , x 2 ) and the relev an t space is X 1 × X 2 × Θ. After the ﬁrst exp eriment yields the v alue X 1 , represen ted by the constraint c 1 : P ( x 1 ) = δ ( x 1 − X 1 ), w e can p erform a second exp erimen t that yields X 2 and is represen ted by a sec- ond constraint c 2 : P ( x 2 ) = δ ( x 2 − X 2 ). These constrain ts c 1 and c 2 comm ute b ecause they refer to diﬀer ent v ariables x 1 and x 2 . An experiment, once p er- formed and its outcome observed, cannot b e un-p erforme d and its result cannot b e un-observe d b y a second exp eriment. Thus, imposing the second constraint do es not imply a revision of the ﬁrst. In general constraints need not commute and when this is the case the order in which they are pro cessed is critical. F or example, supp ose the prior is q and w e receive information in the form of a constraint, C 1 . T o up date we maximize the en tropy S [ p, q ] sub ject to C 1 leading to the p osterior P 1 as sho wn in Figure 6.1. Next we receiv e a second piece of information describ ed b y the constraint C 2 . A t this p oint we can pro ceed in essentially t wo diﬀerent w ays: (a) Sequen tial up dating. Having pro cessed C 1 , we use P 1 as the current prior and maximize S [ p, P 1 ] sub ject to the new constraint C 2 . This leads us to the p osterior P a . (b) Sim ultaneous up dating. Use the original prior q and maximize S [ p, q ] sub ject to b oth constrain ts C 1 and C 2 sim ultaneously . This leads to the p oste- rior P b . 4 T o decide whic h path (a) or (b) is appropriate w e m ust b e clear ab out ho w the ME metho d handles constraints. The ME machinery in terprets a constraint suc h as C 1 in a very mechanical w ay: all distributions satisfying C 1 are in principle allow ed and all distributions violating C 1 are ruled out. Up dating to a p osterior P 1 consists precisely in revising those asp ects of the prior q that disagree with the new constrain t C 1 . Ho wev er, there is nothing ﬁnal about the distribution P 1 . It is just the b est w e can do in our current state of knowledge and it ma y happ en that future information will require us to revise it further. Indeed, when new information C 2 is receiv ed we m ust reconsider whether the original constraint C 1 remains v alid or not. Are al l distributions satisfying the new C 2 really allow ed, even those that violate C 1 ? If w e decide that this is the case then the new C 2 tak es ov er and we up date from the current P 1 to the ﬁnal p osterior P a . The old constraint C 1 ma y still exert some limited inﬂuence on 4 At ﬁrst sight it migh t appear that there exists a third possibility of sim ultaneous updating: (c) use P 1 as the current prior and maximize S [ p, P 1 ] sub ject to both constrain ts C 1 and C 2 simultaneously . F ortunately , and this is a v aluable c heck for the consistency of the ME method, it is easy to show that case (c) is equiv alent to case (b). Whether we up date from q or from P 1 the selected posterior is P b . 6.7. INF ORMA TION GEOMETR Y 143 Figure 6.1: Illustrating the diﬀerence betw een processing t wo constraints C 1 and C 2 sequen tially ( q → P 1 → P a ) and sim ultaneously ( q → P b or q → P 1 → P b ). the ﬁnal p osterior P a through its eﬀect on the intermediate posterior P 1 , but from now on C 1 is considered obsolete and will b e ignored. Alternativ ely , we may decide that the old constraint C 1 retains its v alidity . The new C 2 is not meant to replace the old C 1 but to provide an additional reﬁnemen t of the family of allow ed p osteriors. If this is the case, then the constrain t that correctly reﬂects the new information is not C 2 but the more restrictiv e C 1 ∧ C 2 . The tw o constrain ts C 1 and C 2 should b e pro cessed simul- taneously to arriv e at the correct p osterior P b . T o summarize: sequen tial up dating is appropriate when old constraints b e- come obsolete and are superseded b y new information; simultaneous updating is appropriate when old constraints remain v alid. The tw o cases refer to diﬀerent states of information and therefore we exp e ct that they will result in diﬀeren t inferences. These commen ts are meant to underscore the imp ortance of under- standing what information is b eing pro cessed; failure to do so will lead to errors that do not reﬂect a shortcoming of the ME method but rather a misapplication of it. 6.7 Information geometry This section provides a very brief introduction to an imp ortant sub ject that deserv es a muc h more extensive treatmen t. [Amari 85, Amari Nagaok a 00] Consider a family of distributions p ( x | θ ) labelled b y a ﬁnite n umber of pa- rameters θ i , i = 1 . . . n . It is usually possible to think of the family of distribu- 144 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES tions p ( x | θ ) as a manifold – an n -dimensional space that is lo cally isomorphic to R n . 5 The distributions p ( x | θ ) are p oints in this “statistical manifold” with co ordinates given b y the parameters θ i . W e can introduce the idea of a distance b et ween tw o such p oints – that is, a ‘distance’ b etw een probability distribu- tions. The distance d` b etw een tw o neigh b oring p oints θ and θ + dθ is given by a generalization of Pythagoras’ theorem in terms of a metric tensor g ij , 6 d` 2 = g ij dθ i dθ j . (6.103) The singular importance of the metric tensor g ij deriv es from a most remark able theorem due to ˇ Cenco v that we mention without pro of. [Cencov 81, Campbell 86] The theorem states that the metric g ij on the manifold of probability distri- butions is unique: there is only one metric that takes into account the fact that these are not distances b etw een simple structureless dots but b etw een probabil- it y distributions. Up to a scale factor, which merely reﬂects a c hoice of units, the unique distance is giv en b y the information metric whic h w e introduce below in three indep enden t but intuitiv ely app ealing wa ys. 6.7.1 Deriv ation from distinguishability W e seek a quantitativ e measure of the extent that t wo distributions p ( x | θ ) and p ( x | θ + dθ ) can b e distinguished. The following argumen t is intuitiv ely app ealing. Consider the relativ e diﬀerence, p ( x | θ + dθ ) − p ( x | θ ) p ( x | θ ) = ∂ log p ( x | θ ) ∂ θ i dθ i . (6.104) The exp ected v alue of the relative diﬀerence might seem a go o d candidate, but it do es not work b ecause it v anishes identically , Z dx p ( x | θ ) ∂ log p ( x | θ ) ∂ θ i dθ i = dθ i ∂ ∂ θ i Z dx p ( x | θ ) = 0 . (6.105) Ho wev er, the v ariance do es not v anish, d` 2 = Z dx p ( x | θ ) ∂ log p ( x | θ ) ∂ θ i ∂ log p ( x | θ ) ∂ θ j dθ i dθ j . (6.106) This is the measure of distinguishability w e seek; a small v alue of d` 2 means the points θ and θ + dθ are diﬃcult to distinguish. It suggests introducing the matrix g ij g ij def = Z dx p ( x | θ ) ∂ log p ( x | θ ) ∂ θ i ∂ log p ( x | θ ) ∂ θ j (6.107) 5 Of course it is p ossible to conceive of suﬃciently singular families of distributions that are not smo oth manifolds. This do es not detract from the v alue of the metho ds of information geometry any more than the existence of spaces with complicated geometries detracts from the general v alue of geometry itself. 6 The use of sup erscripts rather than subscripts for the indices lab elling co ordinates is a standard and very conv enient notational con ven tion in diﬀerential geometry . W e adopt the standard Einstein conven tion of summing over rep eated indices when one app ears as a superscript and the other as a subscript. 6.7. INFORMA TION GEOMETR Y 145 called the Fisher information matrix [Fisher 25], so that d` 2 = g ij dθ i dθ j . (6.108) Up to no w no notion of distance has been in tro duced. Normally one says that the reason it is diﬃcult to distinguish tw o p oints in say , the three dimensional space w e seem to inhabit, is that they happen to be to o close together. It is v ery tempting to in vert the logic and assert that the tw o p oints θ and θ + dθ must b e very close together b ecause they are diﬃcult to distinguish. F urthermore, note that b eing a v ariance d` 2 is p ositive and v anishes only when dθ v anishes. Th us it is natural to in terpret g ij as the metric tensor of a Riemannian space [Rao 45]. It is known as the information metric . The recognition b y Rao that g ij is a metric in the space of probability distributions gav e rise to the sub- ject of information geometry [Amari 85], namely , the application of geometrical metho ds to problems in inference and in information theory . A disadv an tage of this heuristic argumen t is that it does not make explicit a crucial prop erty men tioned abov e that except for an o verall m ultiplicative constan t this metric is unique. [Cencov 81, Campb ell 86] The coordinates θ are quite arbitrary; one can freely relabel the points in the manifold. It is then easy to chec k that g ij are the comp onen ts of a tensor and that the distance d` 2 is an inv ariant, a scalar under co ordinate transformations. 6.7.2 Deriv ation from a Euclidean metric Consider a discrete v ariable a = 1 . . . n . The restriction to discrete v ariables is not a serious limitation, we can c ho ose n suﬃciently large to approximate a con tinuous distribution to an y desired degree. The p ossible probabilit y distri- butions of a can b e lab elled by the probability v alues themselves: a probability distribution can b e sp eciﬁed by a point p with coordinates ( p 1 . . . p n ). The cor- resp onding statistical manifold is the simplex S n − 1 = { p = ( p 1 . . . p n ) : P a p a = 1 } . Next we change to new co ordinates ψ a = ( p a ) 1 / 2 . In these new co ordi- nates the equation for the simplex S n − 1 – the normalization condition – reads P ( ψ a ) 2 = 1, whic h w e recognize as the equation of an ( n − 1)-sphere em b ed- ded in an n -dimensional Euclidean space R n , pr ovide d the ψ a are interpreted as Cartesian co ordinates. This suggests that we assign the simplest p ossible metric: the distance betw een the distribution p ( ψ ) and its neighbor p ( ψ + dψ ) is the Euclidean distance in R n , d` 2 = P a ( dψ a ) 2 = δ ab dψ a dψ b . (6.109) Distances b etw een more distant distributions are merely angles deﬁned on the surface of the sphere S n − 1 . Except for an ov erall constan t this is the same information metric (6.108) w e deﬁned earlier! Indeed, consider an m -dimensional subspace ( m ≤ n − 1) of the sphere S n − 1 deﬁned by ψ = ψ ( θ 1 , . . . , θ m ). The parameters θ i , i = 1 . . . m , can 146 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES b e used as co ordinates on the subspace. The Euclidean metric on R n induces a metric on the subspace. The distance b etw een p ( θ ) and p ( θ + dθ ) is d` 2 = δ ab dψ a dψ b = δ ab ∂ ψ a ∂ θ i dθ i ∂ ψ b ∂ θ j dθ j = 1 4 P a p a ∂ log p a ∂ θ i ∂ log p a ∂ θ j dθ i dθ j , (6.110) whic h (except for the factor 1/4) we recognize as the discrete version of (6.107) and (6.108). This in teresting result does not constitute a “deriv ation.” There is a priori no reason wh y the co ordinates ψ should be singled out as sp ecial and attributed a Euclidean metric. But perhaps it helps to lift the veil of mystery that might otherwise surround the strange expression (6.107). 6.7.3 Deriv ation from relative en tropy The “deriv ation” that follo ws has the merit of dra wing upon our intuition about relativ e entrop y . Consider the entrop y of one distribution p ( x | θ 0 ) relativ e to another p ( x | θ ), S ( θ 0 , θ ) = − Z dx p ( x | θ 0 ) log p ( x | θ 0 ) p ( x | θ ) . (6.111) W e study how this entrop y v aries when θ 0 = θ + dθ is in the close vicinity of a giv en θ . As we had seen in section 4.2 – recall the Gibbs inequality S ( θ 0 , θ ) ≤ 0 with equality if and only if θ 0 = θ – the entrop y S ( θ 0 , θ ) attains an absolute maxim um at θ 0 = θ . Therefore, the ﬁrst non v anishing term in the T a ylor expansion ab out θ is second order in dθ S ( θ + dθ , θ ) = 1 2 ∂ S ( θ 0 , θ ) ∂ θ 0 i ∂ θ 0 j     θ 0 = θ dθ i dθ j + . . . ≤ 0 , (6.112) and we use this quadratic form to deﬁne the information metric, g ij def = − ∂ S ( θ 0 , θ ) ∂ θ 0 i ∂ θ 0 j     θ 0 = θ , (6.113) so that S ( θ + dθ , θ ) = − 1 2 d` 2 . (6.114) It is straigh tforward to show that (6.113) coincides with (6.107). 6.7.4 V olume elemen ts in curved spaces Ha ving decided on a measure of distance w e can no w also measure angles, areas, v olumes and all sorts of other geometrical quantities. Here we only consider calculating the m -dimensional volume of the manifold of distributions p ( x | θ ) lab elled by parameters θ i with i = 1 . . . m . 6.7. INFORMA TION GEOMETR Y 147 The parameters θ i are co ordinates for the point p and in these coordinates it ma y not b e obvious how to write down an expression for a volume elemen t dV . But within a suﬃcien tly small region – which is what a volume elemen t is – any curv ed space lo oks ﬂat. Curved spaces are ‘lo cally ﬂat’. The idea then is rather simple: within that v ery small region we should use Cartesian co ordinates and the metric takes a very simple form, it is the identit y matrix, δ ab . In lo cally Cartesian co ordinates φ a the volume element is simply giv en b y the pro duct dV = dφ 1 dφ 2 . . . dφ m , (6.115) whic h, in terms of the old co ordinates θ i , is dV =     ∂ φ ∂ θ     dθ 1 dθ 2 . . . dθ m =     ∂ φ ∂ θ     d m θ . (6.116) This is the v olume w e seek written in terms of the co ordinates θ . Our remaining problem consists in calculating the Jacobian | ∂ φ/∂ θ | of the transformation that tak es the metric g ij in to its Euclidean form δ ab . Let the locally Cartesian co ordinates b e deﬁned by φ a = Φ a ( θ 1 , . . . θ m ). A small change in dθ corresponds to a small change in dφ , dφ a = X a i dθ i where X a i def = ∂ φ a ∂ θ i , (6.117) and the Jacobian is giv en b y the determinant of the matrix X a i ,     ∂ φ ∂ θ     = | det ( X a i ) | . (6.118) The distance b etw een t wo neighboring p oin ts is the same whether w e compute it in terms of the old or the new co ordinates, d` 2 = g ij dθ i dθ j = δ ab dφ a dφ b (6.119) Th us the relation b et ween the old and the new metric is g ij = δ ab X a i X b j . (6.120) The right hand side represents the pro duct of three matrices. T aking the deter- minan t w e get g def = det( g ab ) = [det ( X a i )] 2 , (6.121) so that | det ( X α a ) | = g 1 / 2 . (6.122) W e hav e succeeded in expressing the volume element totally in terms of the co ordinates θ and the known metric g ij ( θ ). The answer is dV = g 1 / 2 ( θ ) d m θ . (6.123) 148 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES The volume of any extended region on the manifold is V = Z dV = Z g 1 / 2 ( θ ) d n θ . (6.124) After this tec hnical detour w e are now ready to return to the main sub ject of this c hapter – up dating probabilities – and derive one last and very imp ortant feature of the ME metho d. 6.8 Deviations from maxim um en tropy There is one last issue that m ust b e addressed b efore one can claim that the design of the ME metho d is more or less complete. Higher entrop y represents higher preference but there is nothing in the previous arguments to tell us by ho w m uch. Do es twice the entrop y represent twice the preference or four times as m uch? W e can rank probability distributions p relative to a prior q according to the relativ e en tropy S [ p, q ] but an y monotonic function of the relative entrop y will accomplish the same goal. Once we hav e decided that the distribution of maxim um entrop y is to be preferred o ver all others the follo wing question arises: Supp ose the maximum of the entrop y function is not particularly sharp, are we really conﬁdent that distributions with en tropy close to the maxim um are totally ruled out? Can w e quan tify ‘preference’ ? W e w ant a quan titativ e measure of the exten t to which distributions with low er en tropy are ruled out. The discussion b elo w follo ws [Catic ha 00]. Supp ose w e ha v e maximized the en trop y S [ p, q ] sub ject to certain constrain ts and obtain a probability distribution p 0 ( x ). The question we now address con- cerns the exten t to which p 0 ( x ) should be preferred ov er other distributions with low er entrop y . Consider a family of distributions p ( x | θ ) labelled b y a ﬁnite n umber of parameters θ i , i = 1 . . . n . W e assume that the p ( x | θ ) satisfy the same constrain ts that led us to select p 0 ( x ) and that p 0 ( x ) itself is included in the family . F urther w e choose the parameters θ so that p 0 ( x ) = p ( x | θ = 0). The question ab out the exten t that p ( x | θ = 0) is to b e preferred ov er p ( x | θ 6 = 0) is a question about the probability p ( θ ) of v arious v alues of θ : what is the rational degree of b elief that the selected v alue should b e θ ? The original problem which led us to design the maximum entrop y metho d was to assign a probability to x ; w e no w see that the full problem is to assign probabilities to both x and θ . W e are concerned not just with p ( x ) but rather with the joint distribution P ( x, θ ); the univ erse of discourse has b een expanded from X (the space of x s) to X × Θ (Θ is the space of parameters θ ). T o determine the joint distribution P ( x, θ ) we make use of essen tially the only metho d at our disp osal – the ME metho d itself – but this requires that w e address the standard tw o preliminary questions: First, what is the prior distribution, what do we know ab out x and θ before we receiv e information ab out the constrain ts? And second, what is this new information that constrains the allow ed P ( x, θ )? 6.8. DEVIA TIONS FR OM MAXIMUM ENTROPY 149 This ﬁrst question is the more subtle one: when w e know absolutely nothing ab out the θ s we know neither their physical meaning nor whether there is any relation to the x s. A prior that reﬂects this lac k of correlations is a pro duct, q ( x, θ ) = q ( x ) µ ( θ ). W e will assume that the prior ov er x is kno wn – it is the prior w e had used when w e up dated from q ( x ) to p 0 ( x ). Since we are totally ignoran t ab out θ w e w ould like to choose µ ( θ ) so that it reﬂects a uniform distribution but here we stumble up on a problem: uniform means that equal v olumes in Θ are assigned equal probabilities and knowing nothing ab out the θ s we do not y et know what “equal” volumes in Θ could p ossibly mean. W e need some additional information. Supp ose next that w e are told that the θ s represen t probabilit y distributions, they are parameters lab eling some unsp eciﬁed distributions p ( x | θ ). W e do not y et know the functional form of p ( x | θ ), but if the θ s derive their meaning solely from the p ( x | θ ) then there exists a natural measure of distance in the space Θ. It is the information metric g ij ( θ ) introduced in the previous section and the corresponding volume elements are giv en b y g 1 / 2 ( θ ) d n θ , where g ( θ ) is the determinan t of the metric. The uniform prior for θ , whic h assigns equal prob- abilities to equal volumes, is prop ortional to g 1 / 2 ( θ ) and therefore we choose µ ( θ ) = g 1 / 2 ( θ ). Next we tackle the second question: what are the constraints on the allo wed join t distributions p ( x | θ )? Consider the space of all joint distributions. T o eac h choice of the functional form of p ( x | θ ) (whether w e talk ab out Gaussians, Boltzmann-Gibbs distributions, or something else) there corresp onds a diﬀerent subspace deﬁned by distributions of the form P ( x, θ ) = p ( θ ) p ( x | θ ). The crucial constrain t is that which sp eciﬁes the subspace, that is, the particular functional form for p ( x | θ ). This deﬁnes the meaning to the θ s – for example, the θ s could b e the mean and v ariance in Gaussian distributions, or Lagrange multipliers in Boltzmann-Gibbs distributions. It also ﬁxes the prior µ ( θ ) on the relev ant subspace. Notice that the kind of constrain t that w e imp ose here is very diﬀerent from those that app ear in usual applications of maxim um en tropy metho d, whic h are in the form of exp ectation v alues. T o select the preferred distribution P ( x, θ ) we maximize the entrop y S [ P | g 1 / 2 q ] o ver all distributions of the form P ( x, θ ) = p ( θ ) p ( x | θ ) by v arying with respect to p ( θ ) with p ( x | θ ) ﬁxed. It is conv enient to write the entrop y as S [ P, g 1 / 2 q ] = − Z dx dθ p ( θ ) p ( x | θ ) log p ( θ ) p ( x | θ ) g 1 / 2 ( θ ) q ( x ) = S [ p, g 1 / 2 ] + Z dθ p ( θ ) S ( θ ) , (6.125) where S [ p, g 1 / 2 ] = − Z dθ p ( θ ) log p ( θ ) g 1 / 2 ( θ ) (6.126) and S ( θ ) = − Z dx p ( x | θ ) log p ( x | θ ) q ( x ) . (6.127) 150 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES The notation shows that S [ p, g 1 / 2 ] is a functional of p ( θ ) while S ( θ ) is a function of θ (it is also a functional of p ( x | θ )). Maximizing (6.125) with resp ect to v ariations δ p ( θ ) suc h that R dθ p ( θ ) = 1, yields 0 = Z dθ  − log p ( θ ) g 1 / 2 ( θ ) + S ( θ ) + log ζ  δ p ( θ ) , (6.128) where the required Lagrange multiplier has b een written as 1 − log ζ . Therefore the probabilit y that the v alue of θ should lie within the small v olume g 1 / 2 ( θ ) d n θ is p ( θ ) d n θ = 1 ζ e S ( θ ) g 1 / 2 ( θ ) d n θ with ζ = Z d n θ g 1 / 2 ( θ ) e S ( θ ) . (6.129) Equation (6.129) is the result w e seek. It tells us that, as exp ected, the preferred v alue of θ is that which maximizes the entrop y S ( θ ), eq.(6.127), b ecause this maximizes the scalar probabilit y density exp S ( θ ). But it also tells us the degree to whic h v alues of θ aw ay from the maxim um are ruled out. F or macroscopic systems the preference for the ME distribution can b e o verwhelming. Eq.(6.129) agrees with the Einstein thermodynamic ﬂuctuation theory and extends it be- y ond the regime of small ﬂuctuations – in the next section w e deal with ﬂuctua- tions as an illustration. Note also that the densit y exp S ( θ ) is a scalar function and the presence of the Jacobian factor g 1 / 2 ( θ ) makes Eq.(6.129) manifestly in v ariant under changes of the co ordinates θ i in the space Θ. W e conclude this section by p ointing out that there are a couple of in teresting p oin ts of analogy betw een the pair { maximum lik eliho o d metho d/Bay es’ rule } on one hand and the corresp onding pair { MaxEnt/ME } methods on the other hand. Note that maximizing the likelihoo d function L ( θ | x ) def = p ( x | θ ) selects a single preferred v alue of θ but no measure is giv en of the extent to which other v alues of θ are ruled out. The method of maxim um likelihoo d do es not pro vide us with a distribution for θ – the likelihoo d function L ( θ | x ) is not a probability distribution for θ . Similarly , maximizing entrop y as prescrib ed by the MaxEnt metho d yields a single preferred v alue of the lab el θ but MaxEnt fails to address the question of the extent to which other v alues of θ are ruled out. Neither Ba yes’ rule nor the ME metho d suﬀer from this limitation. The second p oint of analogy is that neither the maximum likelihoo d nor the MaxEn t metho ds are capable of handling information contained in prior distributions, while b oth Ba yesian and ME metho ds can. This limitation of maxim um likelihoo d and MaxEnt is not surprising since neither metho d was designed for up dating probabilities. 6.9 An application to ﬂuctuations The starting p oint for the standard form ulation of the theory of ﬂuctuations in thermo dynamic systems (see [Landau 77, Callen 85]) is Einstein’s in version of Boltzmann’s formula S = k log W to obtain the probabilit y of a ﬂuctuation in 6.9. AN APPLICA TION TO FLUCTUA TIONS 151 the form W ∼ exp S/k . A careful justiﬁcation, how ever, reveals a num b er of appro ximations which, for most purp oses, are legitimate and work very well. A re-examination of ﬂuctuation theory from the p oint of view of ME is, ho wev er, v aluable. Our general conclusion is that the ME p oint of view allows exact form ulations; in fact, it is clear that deviations from the canonical predictions can b e expected, although in general they will be negligible. Other adv an tages of the ME approach include the explicit cov ariance under changes of co ordinates, the absence of restrictions to the vicinity of equilibrium or to large systems, and the conceptual ease with which one deals with ﬂuctuations of b oth the extensiv e as well as their conjugate intensiv e v ariables. [Caticha 00] This last p oin t is an imp ortant one: within the canonical formalism (section 4.8) the extensiv e v ariables such as energy are uncertain while the intensiv e ones suc h as the temp erature or the Lagrange m ultiplier β are ﬁxed parameters, they do not ﬂuctuate. There are, how ev er, sev eral contexts in whic h it makes sense to talk ab out ﬂuctuations of the conjugate v ariables. W e discuss the standard scenario of an op en system that can exchange sa y , energy , with its environmen t. Consider the usual setting of a thermo dynamical system with microstates lab elled by x . Let m ( x ) dx b e the num b er of microstates within the range dx . According to the postulate of “equal a priori probabilities” we c ho ose a uniform prior distribution prop ortional to the densit y of states m ( x ). The canonical ME distribution obtained b y maximizing S [ p, m ] sub ject to constrain ts on the exp ected v alues  f k  = F k of relev ant v ariables f k ( x ), is p ( x | F ) = 1 Z ( λ ) m ( x ) e − λ k f k ( x ) with Z ( λ ) = R dx m ( x ) e − λ k f k ( x ) , (6.130) and the corresp onding entrop y is S ( F ) = log Z ( λ ) + λ k F k . (6.131) Fluctuations of the v ariables f k ( x ) or of an y other function of the microstate x are usually computed in terms of the v arious moments of p ( x | F ). Within this con text all expected v alues suc h as the constraints  f k  = F k and the entrop y S ( F ) itself are ﬁxed; they do not ﬂuctuate. The corresp onding conjugate v ari- ables, the Lagrange multipliers λ k = ∂ S/∂ F k , eq.(4.77), do not ﬂuctuate either. The standard wa y to make sense of λ ﬂuctuations is to couple the system of in terest to a second system, a bath, and allow exc hanges of the quan tities f k . All quan tities referring to the bath will be denoted by primes: the microstates are x 0 , the density of states is m 0 ( x 0 ), and the v ariables are f 0 k ( x 0 ), etc. Even though the ov erall exp ected v alue  f k + f 0 k  = F k T of the combined system plus bath is ﬁxed, the individual exp ected v alues  f k  = F k and  f 0 k  = F 0 k = F k T − F k are allo wed to ﬂuctuate. The ME distribution p 0 ( x, x 0 ) that b est reﬂects the prior information contained in m ( x ) and m 0 ( x 0 ) up dated by information on the total F k T is p 0 ( x, x 0 ) = 1 Z 0 m ( x ) m 0 ( x 0 ) e − λ 0 α ( f k ( x )+ f 0 k ( x 0 ) ) . (6.132) 152 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES But less than ME distributions are not totally ruled out; to explore the p ossi- bilit y that the quan tities F k T are distributed b etw een the tw o systems in a less than optimal w ay we consider distributions p ( x, x 0 , F ) constrained to the form P ( x, x 0 , F ) = p ( F ) p ( x | F ) p ( x 0 | F T − F ) , (6.133) where p ( x | F ) is the canonical distribution in eq.(6.130), its en trop y is eq.(6.131) and analogous expressions hold for the primed quantities. W e are no w ready to write down the probability that the v alue of F ﬂuctuates in to a small v olume g 1 / 2 ( F ) dF . F rom eq.(6.129) we hav e p ( F ) dF = 1 ζ e S T ( F ) g 1 / 2 ( F ) dF , (6.134) where ζ is a normalization constant and the en tropy S T ( F ) of the system plus the bath is S T ( F ) = S ( F ) + S 0 ( F T − F ) . (6.135) The formalism simpliﬁes considerably when the bath is large enough that ex- c hanges of F do not aﬀect it, and λ 0 remains ﬁxed at λ 0 . Then S 0 ( F T − F ) = log Z 0 ( λ 0 ) + λ 0 k  F k T − F k  = const − λ 0 k F k . (6.136) It remains to calculate the determinan t g ( F ) of the information metric giv en b y eq.(6.113), g ij = − ∂ 2 S T ( ˙ F , F ) ∂ ˙ F i ∂ ˙ F j = − ∂ 2 ∂ ˙ F i ∂ ˙ F j h S ( ˙ F , F ) + S 0 ( F T − ˙ F , F T − F ) i (6.137) where the dot indicates that the deriv ativ es act on the ﬁrst argumen t. The ﬁrst term on the right is ∂ 2 S ( ˙ F , F ) ∂ ˙ F i ∂ ˙ F j = − ∂ 2 ∂ ˙ F i ∂ ˙ F j Z dx p ( x | ˙ F ) log p ( x | ˙ F ) m ( x ) m ( x ) p ( x | F ) = ∂ 2 S ( F ) ∂ F i ∂ F j + Z dx ∂ 2 p ( x | F ) ∂ F i ∂ F j log p ( x | F ) m ( x ) . (6.138) T o calculate the integral on the righ t use log p ( x | F ) m ( x ) = − log Z ( λ ) − λ k f k ( x ) (6.139) (from eq.(6.130) so that the integral v anishes, − log Z ( λ ) ∂ 2 ∂ F i ∂ F j Z dx p ( x | F ) − λ k ∂ 2 ∂ F i ∂ F j Z dx p ( x | F ) f k ( x ) = 0 . (6.140) Similarly ∂ 2 ∂ ˙ F i ∂ ˙ F j S 0 ( F T − ˙ F , F T − F ) = ∂ 2 S 0 ( F T − F ) ∂ F i ∂ F j (6.141) + Z dx 0 ∂ 2 p ( x 0 | F T − F ) ∂ F i ∂ F j log p ( x 0 | F T − F ) m 0 ( x 0 ) 6.10. CONCLUSION 153 and here, using eq.(6.136), b oth terms v anish. Therefore g ij = − ∂ 2 S ( F ) ∂ F i ∂ F j . (6.142) W e conclude that the probability that the v alue of F ﬂuctuates in to a small v olume g 1 / 2 ( F ) dF b ecomes p ( F ) dF = 1 ζ e S ( F ) − λ 0 k F k g 1 / 2 ( F ) dF . (6.143) This equation is exact. An imp ortant diﬀerence with the usual theory stems from the presence of the Jacobian factor g 1 / 2 ( F ). This is required by co ordinate inv ariance and can lead to small deviations from the canonical predictions. The quantities h λ k i and  F k  ma y b e close but will not in general coincide with the quantities λ 0 k and F k 0 at the p oint where the scalar probabilit y densit y attains its maximum. F or most thermo dynamic systems how ev er the maximum is v ery sharp. In its vicinity the Jacobian can b e considered constant, and one obtains the usual results [Landau 77], namely , that the probability distribution for the ﬂuctuations is given b y the exp onen tial of a Legendre transform of the en tropy . The remaining diﬃculties are purely computational and of the kind that can in general be tackled systematically using the metho d of steep est descent to ev aluate the appropriate generating function. Since we are not interested in v ariables referring to the bath we can in tegrate Eq.(6.133) ov er x 0 , and use the distribution P ( x, F ) = p ( F ) p ( x | F ) to compute v arious moments. As an example, the correlation betw een δ λ i = λ i − h λ i i and δf j = f j −  f j  or δ F j = F j −  F j  is  δ λ i δ f j  =  δ λ i δ F j  = − ∂ h λ i i ∂ λ 0 j + ( λ 0 i − h λ i i )  F j 0 −  F j   . (6.144) When the diﬀerences λ 0 i − h λ i i or F j 0 −  F j  are negligible one obtains the usual expression,  δ λ i δ f j  ≈ − δ j i . (6.145) 6.10 Conclusion An y Bay esian accoun t of the notion of information cannot ignore the fact that Ba yesians are concerned with the b eliefs of rational agents. The relation b e- t ween information and b eliefs must b e clearly sp elled out. The deﬁnition we ha ve prop osed – that information is that which constrains rational b eliefs and therefore forces the agent to change its mind – is con venien t for tw o reasons. First, the information/b elief relation very explicit, and second, the deﬁnition is ideally suited for quantitativ e manipulation using the ME metho d. Dealing with uncertaint y requires that one solve t wo problems. First, one m ust represen t a state of kno wledge as a consistent web of in terconnected be- liefs. The instrument to do it is probability . Second, when new information 154 CHAPTER 6. ENTROPY I I I: UPDA TING PROBABILITIES b ecomes av ailable the b eliefs m ust b e updated. The instrumen t for this is rela- tiv e en tropy . It is the only candidate for an up dating metho d that is of universal applicabilit y and ob eys the moral injunction that one should not change one’s mind friv olously . Prior information is v aluable and should not b e revised except when demanded by new evidence, in whic h case the revision is no longer optional but obligatory . The resulting general method – the ME metho d – can handle arbitrary priors and arbitrary constrain ts; it includes MaxEn t and Ba yes’ rule as special cases; and it provides its own criterion to assess the extent that non maxim um-entrop y distributions are ruled out. 7 T o conclude I cannot help but to express my contin ued sense of w onder and astonishmen t that the method for reasoning under uncertain ty – which should presumably apply to the whole of science – turns out to rest up on an ethical foundation of in tellectual honest y . The moral imp erative is to uphold those b eliefs and only those b eliefs that ob ey v ery strict constraints of consistency; the allow ed beliefs must b e consistent among themselves, and they must b e consisten t with the a v ailable information. Just imagine the implications! 7 F or p ossible dev elopments and applications of these ideas, which w e hop e will be the sub ject of future additions to these lectures, see the “Suggestions for further reading.” References [Aczel 75] J. Acz ´ el and Z. Dar´ oczy , On Me asur es of Information and their Char acterizations (Academic Press, New Y ork 1975). [Amari 85] S. Amari, Diﬀer ential-Ge ometric al Metho ds in Statistics (Springer- V erlag, 1985). [Amari Nagaok a 00] S. Amari and H. Nagaok a, Metho ds of Information Ge- ometry (Am. Math. So c./Oxford U. Press, Providence, 2000). [Brillouin 52] L. Brillouin, Scienc e and Information The ory (Academic Press, New Y ork, 1952). [Callen 85] H. B. Callen, Thermo dynamics and an Intr o duction to Thermo- statistics (Wiley , New Y ork, 1985). [Campb ell 86] L. L. Campb ell: Pro c. Am. Math. So c. 98 , 135 (1986). [Catic ha 03] A. Caticha, “Relative Entrop y and Inductiv e Inference,” Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , ed. by G. Erickson and Y. Zhai, AIP Conf. Pro c. 707 , 75 (2004) (arXiv.org/abs/ph ysics/0311093). [Catic ha 06] A. Caticha and A. Giﬃn, “Up dating Probabilities,” in Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , ed. b y A. Mohammad-Djafari, AIP Conf. Pro c. 872 , 31 (2006) abs/ph ysics/0608185). [Catic ha 07] A. Caticha, “Information and En tropy ,” in Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , ed. b y K. Kn uth et al. , AIP Conf. Pro c. 954 , 11 (2007) [Catic ha Preuss 04] A. Catic ha and R. Preuss, ‘Maxim um en tropy and Ba yesian data analysis: entropic prior distributions,’ Phys. Rev. E70 , 046127 (2004) (arXiv.org/abs/ph ysics/0307055). [Cenco v 81] N. N. ˇ Cenco v: Statistic al De cision Rules and Optimal Infer enc e , T ransl. Math. Monographs, vol. 53, Am. Math. So c. (Providence, 1981). 155 156 REFERENCES [Co ver Thomas 91] T. Co ver and J. Thomas, Elements of Information The- ory (Wiley ,1991). [Co x 46] R. T. Co x, ‘Probabilit y , F requency and Reasonable Exp ectation’, Am. J. Phys. 14 , 1 (1946); The A lgebr a of Pr ob able Infer enc e (Johns Hopkins, Baltimore, 1961). [Cropp er 86] W. H. Cropper, “Rudolf Clausius and the road to en trop y ,” Am. J. Phys. 54 , 1068 (1986). [Csiszar 84] I. Csiszar, “Sanov prop erty , generalized I -pro jection and a con- ditional limit theorem,” Ann. Prob. 12 , 768 (1984). [Csiszar 85] I. Csisz´ ar “An extended Maximum Entrop y Principle and a Ba yesian justiﬁcation,” in Bayesian Statistics 2 , p.83, ed. b y J. M. Bernardo. M. H. de Gro ot, D. V. Lindley , and A. F. M. Smith (North Holland, 1985); “MaxEnt, mathematics and information theory ,” Maxi- mum Entr opy and Bayesian Metho ds , p. 35, ed. by K. M. Hanson and R. N.Silv er (Klu wer, 1996). [Csiszar 91] I. Csisz´ ar, “Why least squares and maxim um en tropy: an ax- iomatic approach to inference for linear in verse problems,” Ann. Stat. 19 , 2032 (1991). [Diaconis 82] P . Diaconis and S. L. Zab ell, “Up dating Sub jective Probabili- ties,” J. Am. Stat. Asso c. 77 , 822 (1982). [Earman 92] J. Earman, Bayes or Bust?: A Critic al Examination of Bayesian Conﬁrmation The ory (MIT Press, Cambridge, 1992). [Fisher 25] R. A. Fisher: Pro c. Cambridge Philos. So c. 122 , 700 (1925). [Gibbs 1875-78] J. W. Gibbs, “On the Equilibrium of Heterogeneous Sub- stances,” T rans. Conn. Acad. I I I (1875-78), reprin ted in The Scientiﬁc Pap ers of J. W. Gibbs (Dov er, NY, 1961). [Gibbs 1902] J. W. Gibbs, Elementary Principles in Statistic al Me chanics (Y ale U. Press, New Ha ven, 1902; reprin ted by Ox Bow Press, Connecti- cut, 1981). [Giﬃn Catic ha 07] A. Giﬃn and A. Caticha, “Up dating Probabilities with Data and Moments,” in Bayesian Infer enc e and Maximum Entr opy Meth- o ds in Scienc e and Engine ering , ed. by K. Kn uth et al. , AIP Conf. Pro c. 954 , 74 (2007) [Grad 61] H. Grad, “The Man y F aces of Entrop y ,” Comm. Pure and Appl. Math. 14 , 323 (1961), and “Lev els of Description in Statistical Mechan- ics and Thermo dynamics” in Delawar e Seminar in the F oundations of Physics , ed. by M. Bunge (Springer-V erlag, New Y ork, 1967). REFERENCES 157 [Gregory 05] P . C. Gregory , Bayesian L o gic al Data Analysis for the Physic al Scienc es (Cambridge U. Press, 2005). [Grendar 03] M. Grendar, Jr. and M. Grendar “Maximum Probabilit y and Maxim um En tropy Metho ds: Bay esian interpretation,” in Bayesian In- fer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , ed. b y G. Erickson and Y. Zhai, AIP Conf. Pro c. 707 , p. 490 (2004) (arXiv.org/abs/ph ysics/0308005). [Hac king 01] I. Hac king, An Intr o duction to Pr ob ability and Inductive L o gic (Cam bridge U. P ., 2001) [Ho wson Urbac h 93] C. Ho wson and P . Urbach, Scientiﬁc R e asoning, the Bayesian Appr o ach (Op en Court, Chicago, 1993). [James 1907] W. James, Pr agmatism (Do ver, 1995) and The Me aning of T ruth (Prometheus, 1997). [Jeﬀrey 04] R. Jeﬀrey , Subje ctive Pr ob ability, the R e al Thing (Cambridge U.P ., 2004). [Jeﬀreys 39] H. Jeﬀreys, The ory of Pr ob ability (Oxford U. P ., 1939). [Ja ynes 57a] E. T. Ja ynes, “Ho w does the Brain do Plausible Reasoning” Stanford Univ. Microw av e Lab. rep ort 421 (1957), also published in Maximum Entr opy and Bayesian Metho ds in Scienc e and Engine ering , G. J. Eric kson and C. R. Smith (eds.) (Kluw er, 1988) and online at h ttp://bay es.wustl.edu. [Ja ynes 57b] E. T. Jaynes, “Information Theory and Statistical Mechanics” Ph ys. Rev. 106 , 620 and 108 , 171 (1957). [Ja ynes 65] E. T. Jaynes, “Gibbs vs. Boltzmann En tropies,” Am. J. Ph ys. 33 , 391 (1965) (online at http://ba yes.wustl.edu). [Ja ynes 83] E. T. Jaynes: Pap ers on Pr ob ability, Statistics and Statistic al Physics edited b y R. D. Rosenkran tz (Reidel, Dordrech t, 1983), and pa- p ers online at http://ba yes.wustl.edu. [Ja ynes 85] E. T. Jaynes, “Ba yesian Metho ds: General Background”, in Max- imum Entr opy and Bayesian Metho ds in Applie d Statistics , J. H. Justice (ed.) (Cam bridge U. P ., 1985) and at http://ba y es.wustl.edu. [Ja ynes 88] E. T. Jaynes, ”The Evolution of Carnot’s Principle,” pp. 267-281 in Maximum Entr opy and Bayesian Metho ds in Scienc e and Engine er- ing ed. by G. J. Eric kson and C. R. Smith (Kluw er, 1988) (online at h ttp://bay es.wustl.edu). [Ja ynes 92] E. T. Ja ynes, “The Gibbs Parado x” in Maximum Entr opy and Bayesian Metho ds , ed. by C. R. Smith, G. J. Erickson and P . O. Neudorfer (Klu wer, Dordrech t, 1992). 158 REFERENCES [Ja ynes 03] E. T. Jaynes, Pr ob ability The ory: The L o gic of Scienc e edited b y G. L. Bretthorst (Cambridge U. Press, 2003). [Karb elk ar 86] S. N. Karbelk ar, “On the axiomatic approach to the maximum en tropy principle of inference,” Pramana – J. Phys. 26 , 301 (1986). [Kass W asserman 96] R. E. Kass and L. W asserman, J. Am. Stat. Assoc. 91 , 1343 (1996). [Klein 70] M. J. Klein, “Maxwell, His Demon, and the Second Law of Ther- mo dynamics,” American Scien tist 58 , 84 (1970). [Klein 73] M. J. Klein, “The Developmen t of Boltzmann’s Statistical Ideas” in The Boltzmann Equation ed. by E. G. D. Cohen and W. Thirring, (Springer V erlag, 1973). [Kullbac k 59] S. Kullback, Information The ory and Statistics (Wiley , New Y ork 1959). [Landau 77] L. D. Landau and E. M. Lifshitz, Statistic al Physics (P ergamon, New Y ork, 1977). [Lucas 70] J. R. Lucas, The Conc ept of Pr ob ability (Clarendon Press, Oxford, 1970). [Mehra 98] J. Mehra, “Josiah Willard Gibbs and the F oundations of Statisti- cal Mechanics,” F ound. Ph ys. 28 , 1785 (1998). [v on Mises 57] R. von Mises, Pr ob ability, Statistics and T ruth (Dov er, 1957). [Plastino 94] A. R. Plastino and A. Plastino, Phys. Lett. A193 , 140 (1994). [Rao 45] C. R. Rao: Bull. Calcutta Math. So c. 37 , 81 (1945). [Ren yi 61] A. Renyi, “On measures of entrop y and information,” Pr o c. 4th Berkeley Symp osium on Mathematic al Statistics and Pr ob ability , V ol 1, p. 547 (U. of California Press, 1961). [Ro driguez 90] C. C. Ro dr ´ ıguez, “Ob jective Bay esianism and geometry” in Maximum Entr opy and Bayesian Metho ds , P . F. F oug` ere (ed.) (Kluw er, Dordrec ht, 1990). [Ro driguez 91] C. C. Ro dr ´ ıguez, “En tropic priors” in Maximum Entr opy and Bayesian Metho ds , edited by W. T. Grandy Jr. and L. H. Schic k (Kluw er, Dordrec ht, 1991). [Ro driguez 02] C. C. Ro dr ´ ıguez: ‘Entropic Priors for Discrete Probabilistic Net works and for Mixtures of Gaussian Mo dels’. In: Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , ed. by R. L. F ry , AIP Conf. Pro c. 617 , 410 (2002) REFERENCES 159 [Ro driguez 03] C. C. Ro dr ´ ıguez, “A Geometric Theory of Ignorance” (omega. alban y .edu:8008/ignorance/ignorance03.p df ). [Sa v age 72] L. J. Sav age, The F oundations of Statistics (Dov er, 1972). [Shannon 48] C. E. Shannon, “The Mathematical Theory of Communication,” Bell Syst. T ech. J. 27 , 379 (1948). [Shannon W eav er 49] C. E. Shannon and W. W ea ver, The Mathematic al The ory of Communic ation , (U. Illinois Press, Urbana 1949). [Shore Johnson 80] J. E. Shore and R. W. Johnson, “Axiomatic deriv ation of the Principle of Maxim um En tropy and the Principle of Minim um Cross- En tropy ,” IEEE T rans. Inf. Theory IT-26 , 26 (1980); “Properties of Cross-En tropy Minimization,” IEEE T rans. Inf. Theory IT-27 , 26 (1981). [Sivia Skilling 06] D. S. Sivia and J. Skilling, Data Analysis: a Bayesian tutorial (Oxford U. Press, 2006). [Skilling 88] J. Skilling, “The Axioms of Maximum Entrop y” in Maximum- Entr opy and Bayesian Metho ds in Scienc e and Engine ering , G. J. Eric kson and C. R. Smith (eds.) (Klu wer, Dordrech t, 1988). [Skilling 89] J. Skilling, “Classic Maximum Entrop y” in Maximum Entr opy and Bayesian Metho ds , ed. b y J. Skilling (Kluw er, Dordrec ht, 1989). [Skilling 90] J. Skilling, “Quantiﬁed Maxim um Entrop y” in Maximum En- tr opy and Bayesian Metho ds , ed. by P . F. F oug ` ere (Kluw er, Dordrech t, 1990). [Stapp 72] H. P . Stapp, “The Cop enhagen Interpretation” Am. J. Phys. 40 , 1098 (1972). [T ribus 69] M. T ribus, R ational Descriptions, De cisions and Designs (Perga- mon, New Y ork, 1969). [Tsallis 88] C. Tsallis, J. Stat. Phys. 52 , 479 (1988). [Tseng Catic ha 01] C.-Y. Tseng and A. Catic ha, “Y et another resolution of the Gibbs parado x: an information theory approac h,” in Bayesian Infer- enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , ed. by R. L. F ry , A.I.P . Conf. Pro c. V ol. 617, p. 331 (2002) mat/0109324). [Uﬃnk 95] J. Uﬃnk, “Can the Maximum Entrop y Principle b e explained as a consistency Requirement?” Studies in History and Philosophy of Modern Ph ysics 26B , 223 (1995). [Uﬃnk 03] J. Uﬃnk, “Irreversibilit y and the Second Law of Thermo dynam- ics,” in Entr opy , ed. by A. Greven et al. (Princeton U. Press , 2003). 160 REFERENCES [Uﬃnk 04] J. Uﬃnk, “Boltzmann’s W ork in Statistical Physics” in The Stan- for d Encyclop e dia of Philosophy (http://plato.stanford.edu). [Williams 80] P . M. Williams, Brit. J. Phil. Sci. 31 , 131 (1980). [Wilson 81] S. S. Wilson,“Sadi Carnot,” Scientiﬁc American, August 1981, p. 134. Suggestions for further reading Here is a very incomplete and v ery biased list of references on topics that we plan to include in future editions of these lectures. The topics range form infer- ence proper – the assignment of priors, information geometry , mo del selection, inductiv e inquiry , ev olutionary Ba yes – to the applications of all these ideas to the foundations of quantum, classical, statistical, and gravitational physics. [Catic ha 98a] A. Caticha, “Consistency and Linearity in Quantu m Theory ,” Ph ys. Lett. A244 , 13 (1998) (arXiv.org/abs/quan t-ph/9803086). [Catic ha 98b] A. Caticha, “Consistency , Amplitudes and Probabilities in Quan- tum Theory ,” Ph ys. Rev. A57 , 1572 (1998) (arXiv.org/abs/quan t-ph /9804012). [Catic ha 98c] A. Caticha, “Insuﬃcient reason and entrop y in quantum the- ory ,” F ound. Phys. 30 , 227 (2000) [Catic ha 00] A. Catic ha, ‘Maximum entrop y , ﬂuctuations and priors’, in Bayesian Metho ds and Maximum Entr opy in Scienc e and Engine ering , ed. b y A. Mohammad-Djafari, AIP Conf. Pro c. 568 , 94 (2001) abs/math-ph/0008017). [Catic ha 01] A. Caticha, ‘En tropic Dynamics’, in Bayesian Metho ds and Max- imum Entr opy in Scienc e and Engine ering , ed. by R. L. F ry , A.I.P . Conf. Pro c. 617 (2002) [Catic ha 04] A. Caticha “Questions, Relev ance and Relativ e En tropy ,” in Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engi- ne ering , R. Fischer et al. A.I.P . Conf. Pro c. V ol. 735 , (2004) abs/gr-qc/0409175). [Catic ha 05] A. Caticha, “The Information Geometry of Space and Time” in Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and En- gine ering , ed. b y K. Knuth et al. AIP Conf. Pro c. 803 , 355 (2006) [Catic ha Cafaro 07] A. Caticha and C. Cafaro, ‘F rom Information Geometry to Newtonian Dynamics’, in Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , ed. by K. Kn uth et al. , AIP Conf. Pro c. 954 , 165 (2007) REFERENCES 161 [Catic haN Kinouc hi 98] N. Catic ha and O. Kinouchi, “Time ordering in the ev olution of information pro cessing and modulation systems,” Phil. Mag. B 77 , 1565 (1998). [Catic haN Neirotti 06] N. Catic ha and J. P . Neirotti, “The ev olution of learning systems: to Bay es or not to be,” in Bayesian Infer enc e and Maxi- mum Entr opy Metho ds in Scienc e and Engine ering , ed. by A. Mohammad- Djafari, AIP Conf. Pro c. 872 , 203 (2006). [Dew ar 03] R. Dew ar, “Information theory explanation of the ﬂuctuation the- orem, maximum entrop y pro duction and self-organized criticalit y in non- equilibrium stationary states,” J. Ph ys. A: Math. Gen. 36 631 (2003). [Dew ar 05] R. Dew ar, “Maximum entrop y production and the ﬂuctuation the- orem,” J. Ph ys. A: Math. Gen. 38 L371 (2003). [Ja ynes 79] E. T. Jaynes, “Where do stand on maximum en tropy?” in The Maximum Entr opy Principle ed. b y R. D. Levine and M. T ribus (MIT Press 1979). [Kn uth 02] K. H. Kn uth, “What is a question?” in Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , ed. by C. Williams, AIP Conf. Pro c. 659 , 227 (2002). [Kn uth 03] K. H. Knuth, “Deriving laws from ordering relations” in Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , ed. b y G.J. Erickson and Y. Zhai, AIP Conf. Pro c. 707 , 204 (2003). [Kn uth 05] K. H. Knuth, “Lattice duality: The origin of probabilit y and en- trop y ,” Neurocomputing 67C , 245 (2005). [Kn uth 06] K. H. Kn uth, “V aluations on lattices and their application to in- formation theory ,” Pro c. IEEE W orld Congress on Computational Intel- ligence (2006). [Neirotti Catic haN 03] J. P . Neirotti and N. Caticha, “Dynamics of the evo- lution of learning algorithms by selection” Ph ys. Rev. E 67 , 041912 (2003). [Ro driguez 89] C. C. Ro dr ´ ıguez, “The metrics generated b y the Kullbac k n umber” in Maximum Entr opy and Bayesian Metho ds , J. Skilling (ed.) (Klu wer, Dordrech t, 1989) (omega.alban y .edu:8008). [Ro driguez 98] C. C. Ro dr ´ ıguez, “Are we cruising a hypothesis space?” (arxiv. org/abs/ph ysics/9808009). [Ro driguez 04] C. C. Ro dr ´ ıguez, “The V olume of Bitnets” (omega.albany .edu: 8008/bitnets/bitnets.p df ). [Ro driguez 05] C. C. Ro dr ´ ıguez, “The ABC of mo del selection: AIC, BIC and the new CIC” (omega.alban y .edu:8008/CIC/me05.p df ). 162 REFERENCES [Tseng Catic ha 04] C. Y. Tseng and A. Caticha, “Maximum Entrop y and the V ariational Metho d in Statistical Mechanics: an Application to Simple Fluids”

Lectures on Probability, Entropy, and Statistical Physics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment