A Thermodynamical Approach for Probability Estimation

The issue of discrete probability estimation for samples of small size is addressed in this study. The maximum likelihood method often suffers over-fitting when insufficient data is available. Although the Bayesian approach can avoid over-fitting by …

Authors: Takashi Isozaki

A Thermodynamical Approach for Probability Estimation
A Thermo dynamical Approac h for Probabilit y Estimation T ak ashi Isozaki Son y Computer Science Lab oratories, Inc. 3-14-1 3 Higashigotanda Shinaga wa-ku, T oky o 141-002 2 Japan. isozaki@csl.son y .co.jp Ma y 2 5, 2018 Abstract The issue of discrete probabilit y estimation for samples of small size is addressed in this study . The maximum likelihoo d method often suf- fers o verfitting w hen insufficient data is a v ailable. Although the Bay esian approac h can avo id o verfitting b y using prior distributions, it still has problems with ob jective analysis. In resp onse to these drawbac k s, a new theoretical framework b ased on t h ermod ynamics, where energy and tem- p erature are in trod uced, was dev elop ed. Entrop y a nd like lihoo d are placed at the center of this metho d. The key principle of inference for p robabil- it y mass functions is the minimum free energy , whic h is show n to u nify the tw o principles of maxim um likelihood and maxim um entrop y . Our metho d can robustly estimate probabilit y functions from small size data. 1 In tro duction A metho d for estima ting probability o f discrete random v ariables was developed. It is based on the key idea tha t statistical inference can be described b y a com- bination of tw o frameworks, namely , thermo dyna mics and information theo ry . The r oles of temp erature and en tropy in the metho d are paid sp e cial attention. In other words, he at is in tro duced to statistica l inference. This metho d, fur- thermore, has no fr ee parameters, including temperatur e. The pr op osed metho d makes it p ossible to unify the maxim um likeliho o d principle and the maximum ent ropy principle for statistical infer ence even fr om data of s mall sa mple sizes. In rece nt times, the amount of v ar ious av ailable data ha s b een growing day by day . As a result, a larg e a mount of da ta not o nly for one v ariable but for many v ariables can b e obtained. Intuitiv ely , getting conditional pro babilities and joint pro babilities can reduce the entropy of interested v ariables, which is g uaranteed b y infor mation theoretic inequalities. Note tha t in this pa p er a capital letter s uch as X denotes a random discre te v ariable, a non-capital letter such as x deno tes the s pe c ia l state of that v aria ble, a bo ld capital letter denotes 1 a set of v ar iables, and a b old non-capita l letter deno ted configura tions of that set. Here, we adopt Gibbs-Shannon entrop y Sha nno n (1948) a s en tropy (we call this entropy Shannon entrop y hereafter ), defined as H ( X ) = − X x P ( x ) log P ( x ) , where P is a probability mas s function. The inequa lities are thus given a s H ( X ) ≥ H ( X | Y ) and H ( X ) + H ( Y ) ≥ H ( X , Y ) , where the condition of the eq uality is indep endence b etw een v ariables X a nd Y Shannon (1948). It is therefore preferable to use data generated from man y v ariables beca us e, obviously , highly predictive statistical infer ence req uires low- ent ropy parameters. In discre te, man y-v aria ble systems, statistica l estimation of conditional proba bilities and joint pr obabilities needs e x po nentially large data bec ause of combinatorial explos ion of even ts in v aria bles. Although the maxi- m um likelihoo d (ML) principle a nd methods play significant r oles in statistics and are r e garded a s the most-g eneral pr inciple in statistics, it is known that ML metho ds often suffer ov erfitting and that they ar e ineffectiv e in cases such as insufficiently lar g e data size in relation to num b er of parameters. The situation that ML metho ds often suffer ov er fitting in multiv ariate statis- tical analysis with man y parameter s seems to make Bayesian statistics mor e and more attractive from the viewp oint of av oiding o verfitting. Bayesian statistics incorp ora tes background knowledge, whic h co mp e nsates for shor tage of data size and increas ingly beco mes p opular in natura l science including physics Do s e (2003). It can b e said that Bayesian statistics a dds prior imaginary frequency of even ts to real one. F urthermore, ev en in the cases o f no av ailable prior kno wl- edge or the case o f public analysis which needs to preclude pr ior knowledge for av o iding generation of unnecess ary bia s, Bay esian statistics can r educe o ver- fitting by means of noninformative priors (e.g., Kass a nd W asserman (19 96)). F or ex ample, in discr ete random-v a r iable systems , Bay e s ian statistics usually uses Diric hlet distr ibutions, which a lwa ys ha ve parameters . The parameters of Dirichlet prior distributions are often in terpreted as prior sa mples. When those parameters are uniform and express no special prior knowledge, they can incr e ase entr opy of ML est imators a nd thereb y make estimated pr o babilities more robust than those obtained from ML estimation. This feature of those parameters can b e regarde d as one to generalize the principle of insufficient r e ason prop ose d b y Laplace. Although noninformative priors have b een widely used, they still hav e so me problems Kass a nd W as serman (199 6); Ir o ny and Sing purwalla (1997); Rob ert (2007). F or example, Jeffreys’ prio rs Jeffreys (1 961), whic h ar e the most widely accepted noninfor ma tive pr io rs, do no t sa tisfy axioms of probability and are th us said to be improp er distributions . In addition, even the p os ter ior can b e 2 still improper Kass and W asserma n (1996). As for s tatistical inference, it is th us probably r easonable to explor e another princ iple o r theoretica l framework. A theoretical framework for es timating probability of discrete v ariables used in ob jective analys is is prop os ed in the fo llowing. The new framework keeps the go o d c haracteristics of Ba yesian statistics: increasing en tro py obtained from ML estimators in the case with no prior knowledge. Entrop y is r egarded as rep- resenting the uncertaint y of information. Accor dingly , the entrop y in the cas e of insufficie nt data size should b e la rger than that obtained from hidden true distributions, b ecause the smaller the data size is, the higher uncerta int y b e- comes. That is, w e conside r that entr opy c onsists of unc ert ainty due to limite d sample size and unc ertainty due to t rue pr ob ability distributions . It is there- fore prop osed that pro ba bility estimation should b e r egarded as searching for the optimal v alue o f entrop y acco rding to av ailable da ta size a nd data prop- erty . How ever, rega rding limited sample s ize (which incr eases uncertain ty), neither a satisfactor y principle nor a metho d for o ptimally estimating entrop y exists. Jaynes (1957) prop os ed a method bas ed on the ma ximum entrop y prin- ciple, whic h has b een used in some domains including ph ysics (e.g., Berger et a l. (1996); Huscr oft et al. (200 0); Catic ha and Pr euss (2004)). Our metho d utilizes information of frequencies like ML metho ds in addition to ME principle, b o th of which co op erate to mo dify over biases due to small samples, while Jaynes’ metho d do es not. W e r egard the difference is an essential one b etw ee n the b oth metho ds. The theoretica l framework on whic h the pro p o sed probability estimation metho d is based consists of and unifies t wo well-known principles. T he fir st is the max imum likeliho o d (ML) principle which states that the b est estimator s should most duplica te data and which is very effectiv e in the case of s ufficie ntly large data size. The second is the maximum en tropy (ME) princ iple, whic h states that no bias should b e applied to par ticular in ternal states of v ar iables, within some constr aints, as far as p ossible. How ever, each pr inciple is co ntrary to the o ther beca us e the exp ectation v alues of minus-log lik eliho o ds a re the sa me as empirical entropy . In clear contrast to the ME pr inciple, it is in tuitiv ely obvious that obtaining the estimator with the lowest en tropy and the highest likelihoo d is pr eferable in the case of sufficiently lar g e data. Given the ab ov e-describ ed conflict b etw een ML and ME principles, it is necessary to devise a metho d for analy z ing sa mples with insufficien t data size. It s eems na tural tha t there is a balancing v alue b etw een of the en tr opies giv en by the ML and ME principles. If it is ass umed that s uch a balancing v a lue exists, it is necessa ry to find the optimal p o int b etw een contrary principles in statistical inference. In a branch of natural science, namely , thermo dy namics, there is an analog y with the ab ov e-describ ed trade-off b etw een the ML and ME principles Kittel and Kro emer (1980); Callen (1985). In thermo dynamics, nature selects the sta te that achiev es a balance betw e e n the minim um energy state and the maxim um entropy state at a finite temp er ature. It is assumed here that this analogy applies to s ta - tistical inference in discrete v ariables . Cons equently , temp er atur e , which plays the r ole of a unit of mea sure, is in tr o duced. This appro a ch is an extended o ne 3 from our preceding works Isozak i e t al. (200 8, 200 9), in which tempe rature w as represented by a n a rtificial model containing a free hyper parameter. In the present w ork, temp erature is entirely redefined in a new metho d with no free hyperpara meters. According to our prop osed metho d, a ne w interpretation of probability estimated fro m da ta is presented, which is neither frequentism H´ ajek (1997) nor B ayesianism. In the machine lea r ning domain, some similar metho ds to ours hav e b een pro - po sed Pereira et al. (1993); Ueda a nd Na k ano (199 5); Hofmann (1999); LeCun and Huang (2005); W atana b e et al. (2 009). Nev ertheless, many studies that have applied free ener gy to statistical science ha ve not included temper ature or treated as a controlled parameter , fixed parameter or a free parameter , appa rently be c a use of the la ck of clar ity of its meaning in data science. In r egard to the existing resear ches, therefore, w e consider that the p otentials of free energies are not well extracted. Similar metho ds in context o f robust estimation, in which a fr ee pa- rameter is in tr o duced in a similar fashion, ha ve also b een inv es tigated Windham (1995); Basu et al. (1998); Jones et a l. (2001), w her e how to determine the free parameter for s mall samples s till remains all the same. This pa pe r is organized as follo ws. In the next section, the basic theory based on thermo dy namics is expla ined. The prop osed “pro bability estimation metho d” is introduced in Section 3, where estimation methods for joint prob- abilities and conditional pr obabilities a re also prop o sed. Section 4 presen ts the results of exper iments using the proba bilit y estimation metho d. The r ela- tionships b etw een o ur metho d a nd classic al/Bayesian statistics are discusse d in Section 5. Section 6 concludes this study . 2 Basic theory In constructing an estimation metho d of finite discrete probability distributions from samples with finite size, we utilize both Shannon entrop y a nd likelihoo d. How ever, a new principle is needed for combining the tw o concepts; acco r dingly , we assume the principle to do it is in the thermo dy namical framework. In the following, therefore, en tropy , energy , tempe r ature, and Helmholtz free e ne r gy are defined for the purp ose. Necessary p ostulates for constructing the method and its pr op erties are then describ ed. Hereafter, multiv ariate random s y stems are treated without any prior knowledge. All proba bility distributions are assumed to b e discrete v a riables, a nd samples are assumed to b e i.i.d. data. An extensio n to the ca se with av ailable prior kno wledge is discus sed in se c tion 3.4. 4 2.1 Definitions and p ostulates 2.1.1 En trop y Definition 2 .1 (En trop y) The entr opy H ( X ) of a discr ete r andom variabl e X is define d, ac c or ding to Shannon (1948), as H ( X ) := − X x P ( x ) log P ( x ) . (1) If P ( X ) is an estimate d pr ob ability function, H ( X ) is also an estimate d function of P ( X ) . Entr opy is also denote d as H ( P ( X )) or H ( P ) in or der to make it cle ar which distributions ar e use d. Joint entropy H ( X ) of multiv a riate systems and conditional en tropy H ( X | Y ) are defined as follo ws Shannon (19 48); Cov er and Thomas (20 0 6): H ( X ) := − X x P ( x ) log P ( x ) (2) and H ( X | Y ) := − X x, y P ( x, y ) log P ( x | y ) . (3) It should be noted that pro bability ma ss function P ( x ) a nd entrop y H a re quantities that should be estimated from data in this study . Accordingly , it should b e emphasized that the en tr opy ha s tw o as pe c ts of uncertaint y: The first is the uncertaint y that each true probability distribution p eculia rly has; the second is that which co mes from finiteness of av ailable data. T o the a uthor’s knowledge, the second as pec t of entropy has not b een sp ecifically discussed. Accordingly , w e intro duce a mec hanism to estimate the optimal unc ertainty under given finite available data . 2.1.2 Energy The (in ter nal) energy of a probability sys tem is defined as follo ws. Firs t, a distance-like qua nt ity b etw een tw o distributions is de fined in the usua l w ay as follows. Definition 2 .2 (Kullback -Leibler div ergence) The Kul lb ack-L eibler (KL) diver genc e Kul lb ack and L eibler (195 1) b etwe en two distributions of a r andom variable X , i.e., P ( X ) and Q ( X ) , is given as fol lows: D ( P ( X ) || Q ( X )) := X x P ( x ) log P ( x ) Q ( x ) . (4) F or mult iv aria te systems, D ( P ( X ) || Q ( X )) can be defined in the same manner. Conditional KL div er gence is defined as D ( P ( X | Y ) || Q ( X | Y )) := X x, y P ( x, y ) P ( x | y ) Q ( x | y ) . (5) 5 Next, the cros s entrop y is defined, which is also useful to represent the energy . Definition 2 .3 (Cross en trop y) Th e cr oss ent ro py of discr et e r andom vari- able X b etwe en pr ob ability distributions P ( X ) and Q ( X ) , i.e., H ( P ( X ) , Q ( X )) , is define d as H ( P ( X ) , Q ( X )) := − X x P ( x ) log Q ( x ) . (6) Cross entropy is also denoted as H ( P , Q ). The following relations hip betw een KL divergence and cross en tr opy is easily deriv ed: D ( P ( X ) || Q ( X )) + H ( P ( X )) = H ( P ( X ) , Q ( X )) . (7) According to Jensen’s inequalit y , D ( P || Q ) ≥ 0 Cov er and Thomas (200 6) and H ( P, Q ) ≥ H ( P ). Empirical distr ibution functions a r e defined in a usual wa y . Definition 2 .4 (Empirical distributions ) It is assume d that ther e ar e N samples of r andom variable X : { y (1) , ..., y ( N ) } . A n empiric al distribution of X , ˜ P ( X ) , is defin e d as ˜ P ( X = x ) = 1 N N X i =1 δ ( x − y ( i ) ) , (8) wher e δ ( x − y ) = 1 if x = y and δ ( x − y ) = 0 if x 6 = y . ˜ P ( X ) is relative frequency , tha t is, a maximum lik eliho o d (ML) estimato r, which is denoted by ˆ P ( X ). Definition 2 .5 (Information energy) (Internal) ener gy is define d by using Kul lb ack-L eibler diver genc e as a distortion b etwe en the distribution of a tar get and an empi ric al distribution: U 0 ( X ) := D ( P 1 ( X ) || P 2 ( X )) , (9) wher e P 1 ( X ) denotes a t ar get mass function to b e estimate d, and P 2 ( X ) denotes an empiric al fu n ction or the ML estimator. Cr oss entr opy c an also b e use d as an alternative of t he KL diver genc e: U ( X ) := H ( P 1 ( X ) , P 2 ( X )) . (10) U 0 and U ar e, her e after, b oth c al le d “information ener gy”. It is noteworth y that minimizing U 0 for pro bability es timation cor r esp onds to the ML principle. Self-information ener gy of functions of X or X , namely , ǫ , a re defined as ǫ ( X ) := − log ˜ P ( X ) , (11) 6 where ˜ P ( X ) is the empirical distribution, and ǫ ( X ) can be defined for any states x of X when ˜ P ( x ) > 0. P ractically , the ML estimator , ˆ P ( X ), ca n b e used as an alternative to ˜ P ( X ). That is , the following eq uation is used: ǫ ( X ) := − log ˆ P ( X ) . (12) This equation indicates that self-information energy denotes min us maximum - log likelihoo d. 2.1.3 T emp erature Inv ers e tempera ture, β 0 , is in tro duced a s one of the most significan t quantities for statistical inference with finite data size. β 0 is used instead of ph ysical temper ature, often denoted a s T , and is simply called “temper ature” hereafter. T emp era ture is regarded as a bridge b etw een thermo dynamics and s ta tistical inference. As describ ed in our pr e limina ry w ork Isozaki et al. (20 08, 2009), by int ro ducing a thermo dy namical framew ork for s ta tistical inference, fluctuation due to finiteness of av a ilable data size can be rega rded as thermal fluctuation. In the following, this philosophy is applied to define temper ature in this pap er for construc ting a probability estimation method. Fluctuation which data hav e can b e denoted by the distortion betw een the ML estimator in currently av ailable data size n and the proba bility function estimated from the new framework by using n − 1 data (which do not include the n th data ). The fluctua tion is rela ted to the temper ature as a unit of mea sure. P i ( X ) is fir s t defined as a new es timator obtaine d from i da ta, a nd averaged estimator P G n ( X ), which denotes the geometr ic mea n for data size n ( ≥ 0), is defined as follo ws: P G n ( X ) := n Y i =0 P i ( X ) ! 1 n +1 , (13) where P G 0 ( X ) := P 0 ( X ) := 1 / | X | is defined as a uniform function, in whic h | X | denotes the num b er of elements in the range of X . This de finitio n for n = 0 corres p o nds to the ME principle. It follows that the distortion is denoted as KL divergence, and the div ergence is connected to the temperature . Definition 2 .6 (Data temp erature) F or a natur al-nu mb er data size, i.e., n ≥ 1 , the inverse temp er atur e of a r andom variable X , namely, β 0 ( X ) , is define d as β 0 ( X ) := 1 /D ( P G n − 1 ( X ) || ˆ P ( X )) = 1 / X x k P G n − 1 ( x k ) log P G n − 1 ( x k ) ˆ P ( x k ) (14) for D ( P G n − 1 ( X ) || ˆ P ( X )) 6 = 0 , wher e P G n − 1 ( X ) is define d by Equation (13). β 0 ( X ) := 0 for data size m = 0 , and β 0 ( X ) of multivariate systems, i.e., X , c an b e define d in the same way. Note that variables X and X ar e often omitte d if they c an b e cle arly r e c o gnize d. 7 W e call β 0 “data tempera ture”. According to this definition, it ca n be as sumed that 0 < β 0 < ∞ for n > 0. It will be seen tha t this po sitivity of temp era tur e has consistency with a pos tulate describ ed in 2.1.5 and Equation (33). In genera l, β 0 bec omes la r ge, that is , the system approa ches lo w-temper ature state, as av ailable da ta size gr ows b ecause the fluctuation of data beco mes small. Normalized data temp era ture in statistical inference, which improv es trac tabil- it y of data tempera ture in ma thematical formulas, is defined as follows: Definition 2 .7 (Data temp erature I I) F or a natur al-nu m b er data size, i.e., n ≥ 1 , temp er atur e of a r andom variab le, β ( X ) , is define d as β ( X ) := β 0 ( X ) 1 + β 0 ( X ) . (15) β ( X ) is define d in the same way, and t he variable(s) name is often omitte d. A c c or ding to this definition, 0 < β < 1 for n > 0 . β := 0 for data s ize m = 0 . Her e after, b oth β and β 0 c omp atibly ar e use d, and b oth ar e c al le d “data temp er atur e” when ther e is in no danger of c onfusion. 2.1.4 Helmhol tz free e nergy Helmholtz free e ne r gy , whic h plays a significant r ole in the method w e present, is introduced next. Definition 2 .8 (Helmho ltz free energy) Helmho ltz fr e e ener gy, F fo r X , is define d by using information ener gy, U 0 ( X ) , Shannon entr opy, H ( X ) , and data temp er atur e β 0 ( X ) , as fol lows: F ( X ) := U 0 ( X ) − H ( X ) β 0 ( X ) . (16) F r e e ener gy c an b e e quivalently r ewritten us ing β and cr oss ent r opy U , inst e ad of β 0 and U 0 , as F ( X ) := U ( X ) − H ( X ) β ( X ) . (17) F or mu ltivariate systems, F ( X ) c an b e define d in the same way. The second terms of right-hand s ides in Equations (16) and (17) r epresent ther- mal energy in thermodyna mics . It follows that the concept o f he at is ex plicitly int ro duced in statistical inference. 2.1.5 P o s tulates T o a s sure the positivity of tempera ture, the following postulate, whic h will b e needed in 3.3 .1, is firs t ass umed. 8 P o s tulate 2.1 Ent ro py is differ entiable and is a monotonic al ly incr e asing fun c- tion of ener gy. The c o efficient of the p artial de rivative for the ener gy takes p ositive values as follows: ∂ H ∂ U 0 > 0 , (18) wher e U 0 denotes the ener gy r epr esente d by t he KL diver genc e and ∂ H ∂ U > 0 , (19) wher e U denotes the ener gy by the cr oss entr opy. The key principle of our method is assumed as follows. At the large sample limit, U 0 → 0 is reasona ble in accordance with the ML principle, while it is reasonable that H takes maximum v alues at the sma ll sample limit in accordance with the maximum entropy (ME) principle. F o r a finite sample size, it is thus reasonable that estimator s of probabilities take the v alues that ba la nce both principles in acco rdance with the data s ize a nd the true hidden intrinsic entropies. W e po stulate that the minimum (Helmholtz) free ener g y principle determines the optimal balance. P o s tulate 2.2 (Minimum free energy principle) Pr ob ability m ass functions that ar e estimate d fr om da ta ar e such as to minimize the H elmholtz fr e e ener gy. W e call the principle MFE principle. The tw o a b ove-stated p o stulates are all that is needed fo r the framework of our pro p o sed metho d. It is noteworthy that these p o stulates ar e parts of thermo dynamics Ca llen (19 8 5), implying that our metho d o f statistica l infer- ence is fully based on the framework o f thermo dynamics theory (except for the int erpretatio n of the entrop y , for which Shannon entrop y is adopted). Our new metho d dev elop ed here selects a probability mass function that maximizes en tropy a s far as p ossible acc ording to the minim um free energ y principle, while the max im um entropy metho d of Jaynes (195 7) maximizes the ent ropy sub ject to his adopted another cons traint. The new metho d even cor- rects the bias generated fro m the limited size samples, whic h is a ma jor differ- ence compared to Jaynes’ metho d. W e cons ider that the difference arises from the theo retical gr ound, which of our method is thermo dynamics with Sha nnon ent ropy while whic h of Jaynes is o nly information theo ry . The basis of our metho d for infer e nce is desc r ib ed b y us ing Shannon entropy and int ro ducing “information energy” , “data temper ature”, and the minimum free ener g y (MFE) principle. When we estimate probability functions from finite data, its purp ose is getting effectiv e information from da ta for recognizing tr uth and/or predicting future event s. F rom this viewp oint, with a finite data s ize, it is reasona ble to select a probability function that explains data to some extent but has so me additive uncertaint y due to having limited samples. MFE principle, th us, unifies ML and ME principles, and ther mo dynamics als o has a similar relationship: MFE principle unifies minimum (internal) energy principle and maximum entrop y principle Ca llen (1985). 9 3 Probabilit y estimation A pr o bability-estimation method based on the theor y descr ibe d in the previo us section is formalized in the following. The estimation is based on the MFE principle. Multinomial distributions ar e used for discr ete random v aria bles in the usual ma nner. 3.1 Probabilit y estima tion metho d The en tropy of a v ariable, X , is first defined by E quation (1). Info r mation energy is then defined by E quation (9), probability P ( X ), and the empirical distribution ( ˜ P ( X )) as U 0 ( X ) := D ( P ( X ) || ˜ P ( X )) = X x P ( X ) log P ( X ) ˜ P ( X ) . (20) ˜ P ( X ) is repla ced by the ML estimato r ˆ P ( X ) b ecause ˆ P ( X ) denotes re lative frequency , which is the s ame as ˜ P ( X ) for binomial or m ultinomial distributions under the co ndition of i.i.d . According to the MFE principle, probability estimator P ( X ) with β 0 or β is estimated by minimizing F under the constr aint P x P ( x ) = 1 . It is therefo re solved by us ing La grang e mult ipliers. F ree ene r gy F is written in the following form with Shanno n entrop y and information energy: F = U 0 − 1 β 0 H. (21) β can b e used as the alternative to β 0 ; a ccordingly , free energy F can b e rewrit- ten by using cross entropy U as F = U − 1 β H. (22) It follows that { U, β } can b e used instead of { U 0 , β 0 } . The Lagr angian L is expressed as L = F + λ X x P ( x ) − 1 ! = 1 β X x P ( x ) log P ( x ) − X x P ( x ) log ˆ P ( x ) + λ X x P ( x ) − 1 ! , (23) where λ is the Lagr ange multiplier. In relation to that expr ession, if β 0 → 0, then β → 0 (high-temp erature limit); if β 0 → ∞ , then β → 1 (low- temper ature limit). The s olution P ( X ) is thus derived fro m the following equa- tion: ∂ L /∂ P ( x ) = 0. The estimated probability , P ( X ), is there fo re expres s ed in 10 the form of the canonical distr ibutions, which is a ls o called Gibbs dis tributions. The distribution as the solution is w ell known in statistical physics as P ( x ) = exp( − β ( − log ˆ P ( x ))) P x ′ exp( − β ( − log ˆ P ( x ′ ))) (24) = exp( − β ǫ ( x )) P x ′ exp( − β ǫ ( x ′ )) , (25) where ǫ ( x ) a s expressed in Equa tion (1 2 ) is use d. P ractically , the follo wing equiv alent fo r m is us e d: P ( x ) = [ ˆ P ( x )] β P x ′ [ ˆ P ( x ′ )] β , (26) where β c an be determined, without any free parameters , by using E quations (13), (1 4 ), (15), and (2 6). F or data size n = 0, the estimator is defined such that P ( x ) = 1 / | X | , wher e | X | denotes the num b er o f elemen ts in the range of X . Note that the prop osed metho d has consistency with the ME principle, at high temperatur e limit, where min F ≈ max ( H /β ), and with ML pr inciple, at low temp era tur e limit, where min F ≈ min U . F or co nditional probability , co nditional entrop y H ( X | Y ) and conditional KL divergence D ( P ( X | Y ) || ˆ P ( X | Y )) o r conditional cross entrop y are used. β is defined as β ( X | Y ) := β 0 ( X | Y ) / ( β 0 ( X | Y ) + 1). The for mula for e s timating conditional pro babilities is therefo r e obtained in the following form: P ( x | y ) = exp( − β ( − log ˆ P ( x | y ))) P x ′ exp( − β ( − log ˆ P ( x ′ | y ))) . (27) F or conditional data size n = 0 g iven Y = y , P ( x | y ) = 1 / | X | for any y in the same manner a s P ( x ) given in sectio n 3.1. Joint probability can be calcula ted by using Equatio ns (24) and (27) and the definite relation: P ( X , Y ) = P ( X | Y ) P ( Y ). In genera l, it is calculated using decomp ositio n rules such that P ( X 1 , X 2 , . . . , X n ) = P ( X n | X n − 1 , . . . , X 2 , X 1 ) . . . P ( X 2 | X 1 ) P ( X 1 ) . Partition functions similar to statistical mechanics are int ro duced for c o nv e- nience. By using “da ta tempera ture” β , free energy F is expressed in the sa me form as that in statistical mechanics: F = U − 1 β H = − 1 β log Z, (28) where Z is the partition function, whic h is well known in statistical mechanics, defined for single or multiv ariate probabilities as Z ( X ) = X x [ ˆ P ( x )] β , (29) 11 and for co nditio na l probabilities as Z ( X | Y ) = X x [ ˆ P ( x | Y )] β . (30) Consequently , when the thermo dynamica l framework and Shanno n e ntropy are assumed, the partition-function formula o f statistics common to statistica l me- chanics can be deriv ed. A sig nificant feature of data temper ature is prov ed as follows. The lemma needed for this pr o of is sta ted as follows. Lemma 3.1 P i ( X ) is denote d as the c anonic al distribution est imate d by Equa- tion (24) fr om i data. F or data size n → ∞ , P G n ( x ) , define d by Equation (13), c onver ges to a definite value P G ( x ) when 0 < P i ( x ) for int e gers i s uch that i ≥ 0 and any state x . Pr o of. log P G n ( x ) − log P G n − 1 ( x ) = 1 n + 1 n X i =0 log P i ( x ) − 1 n n − 1 X i =0 log P i ( x ) =  n n + 1 − 1  log P G n − 1 + 1 n + 1 log P n ( x ) . (31) Be c ause 0 < P G n − 1 ( x ) and 0 < P n ( x ) and then log P G n − 1 ( x ) and log P n ( x ) ar e definite values, if n → ∞ , b oth terms on the right-hand side of Equation (31) c onver ge to 0 . Ther efor e, P G n ( x ) → P G ( x ) b e c ause lo g functions ar e single- value d functions. Theorem 3.1 At the asymptotic limit (i.e. lar ge sample limit), data temp er a- tur e β c onver ges to 1 when 0 < P i ( x ) for inte gers i su ch that i ≥ 0 and any state x , wher e P i ( x ) is denote d as t he c anonic al distribution estimate d by Equation (24) fr om i data. Pr o of. A c c or ding to L emma 3.1, P G n ( x ) → P G ( x ) at the limit n → ∞ , wher e P G ( x ) is a definite value for any x . P G ( x ) is also a definite value: P n ( x ) , wher e P n ( x ) is an estimate d value given by Equation (24) using n data, b e c aus e log P G ( x ) is a me an value of log P i ( x ) . β thus c onver ges to a definite val ue. Me anwhile, ML estimator ˆ P ( x ) c onver ges to true distribution P t ( x ) due to t he c onsistency of ML estimators. P n ( x ) at n → ∞ is denote d as P ( x ) . Ther efor e, at n → ∞ , 1 β 0 = 1 − β β = D ( P ( X ) || P t ( X )) = X x [ P t ( x )] β Z log [ P t ( x )] β − 1 Z (32) 12 for 0 < P ( x ) and any state x . This identity (32) ne e ds β → 0 or β → 1 in or der that [ P t ( x )] β or [ P t ( x )] β − 1 is a c onstant for any pr ob ability distributions P t ( x ) . However, β → 0 do es not s atisfy Equation (32), while β → 1 do es satisfy the e quation. A c c or dingly, β → 1 is the asymptotic limit. According to Theorem 3.1, the mor e data is obtained, the mo r e β approaches 1 a nd the mor e the e s timator approaches the ML estimator. Therefor e, it is noticeable that the new es tima tor has the same prefera ble asymptotic prop erties as the ML has, whic h ar e consistency and asymptotic efficiency . F or insufficient data size , β 0 is small by definition, so β is a ls o s ma ll. Ade- quate estimators, whic h a re a utomatically adjusted to av ailable data, can there- fore b e obtained. In other words, free energy is dominated by the s e cond ter m of Equation (22 ) when sufficient data is not av ailable, b ecause uncertaint y is large due to shor tage of evidence . In co ntrast, it is dominated by the fir st term when sufficient data is av a ila ble, b eca use uncertain t y is s mall due to a large size o f data. W e call the prop osed metho d MFEE , whic h w e a bbreviate “MFE estimation” as . 3.2 In terpretation of probabilit y and estimated informa- tion MFEE provides a new interpretation of probability instead of frequentism or Bay e sianism. F requent ism is based on countin g o ccurrence of even ts (e.g. H´ ajek (1997)) and Bayesianism is based on sub jectivity or combination of co unt- ing pr ior ima g inary a nd r eal o ccur rences o f even ts. It can b e regarded that the Bay esian approa ch extends frequentism to that including (prior ) imaginary counting of even ts. The thermo dynamical estimation method stated in this sec- tion is based on the concept of optimal uncerta int y , which consists of counting even ts and temp eratur e. In MFEE , probability can be r egarded as the degr ee of uncertaint y according to the MFE principle o ptimizing uncertaint y in a re- flection of q ua lity and quantit y of the data. It can therefor e b e r egarded as a new interpretation of pro bability in r e al-world applications . When the optimal en tropy is obtained by using the MFE principle, the opti- mal negative entropy r epresents the o ptimally estimated effective information, which is defined as E I (denoting “effective infor ma tion”), for given data as follows: E I ( X ) := X x [ ˆ P ( x )] β Z log [ ˆ P ( x )] β Z . The av eraged log likelihoo d is therefore a large sample approximation of E I . 3.3 Some characteristic prop erties of MFEE The cano nical distribution der ived from the MFE principle can provide some characteristic prop erties of MFEE . The following notations are defined. Proba- 13 bilit y mas s functions, such as P k , hav e discrete states that are de no ted as index k . The ML estimator is denoted b y ˆ P k . { β , U } is us ed instead o f { β 0 , U 0 } . 3.3.1 Relation to usual definition of te mp erature In thermo dynamics, temp e r ature β is usually defined a s Callen (1985); Kittel a nd Kr o emer (1980) β := ∂ H ∂ U . (33) How ever, if the cano nic a l distr ibutio n, which has the form of Equation (24), is used, the usua l definition of β is automatically satisfied as follo ws. Lemma 3.2 H = β U + l og Z under the MFE c ondition, wher e H , U , β , and Z denote entr opy, informatio n ener gy, data temp er atu r e, and p artition function. Pr o of. Sinc e pr ob ability mass function P k has a c anonic al form under the MFE c ondition, it follows that H = − X k P k log P k = − X k ˆ P β k Z log ˆ P β k Z = β · − X k ˆ P β k Z log ˆ P k ! + X k ˆ P β k Z ! · log Z = β U + log Z, (34) wher e cr oss entr opy is use d as U . Theorem 3.2 Equation (33) is au t omatic al ly satisfie d under the MFE princi- ple. Pr o of. Differ ent iating p artial ly with r esp e ct t o U of b oth sides of Equation (34) gives ∂ H ∂ U = β + U ∂ β ∂ U + ∂ ∂ U log Z. (35) ∂ ∂ U log Z = ∂ β ∂ U ∂ ∂ β log Z. (36) ∂ ∂ β log Z = 1 Z ∂ ∂ β ( X k ˆ P β k ) = 1 Z X k ˆ P β k log ˆ P k = X k ˆ P β k Z log ˆ P k = − U. It fol lows that ∂ H ∂ U = β . The the or em is ther efor e pr ove d. In the same w ay , it can b e prov ed that β 0 = ∂ H/ ∂ U 0 . 14 3.3.2 Energy fluctuation In statistical mechanics, energy fluctuations h ǫ 2 i − h ǫ i 2 are shown to have the fol- lowing relation, w he r e hi denotes an exp ectatio n v alue in r esp ect to the ca no nical distributions. h ǫ 2 i − h ǫ i 2 = − ∂ U ∂ β . (37) In rega rd to energ y defined in MFEE , na mely , Equation (12 ), the s a me relation as that sho wn her e is satisfied. W e use the following equation: U = − X k ˆ P β k Z log ˆ P k , (38) where U deno tes cros s en tropy . − ∂ U ∂ β = X k (log ˆ P k ) ( 1 Z ˆ P β k log ˆ P k −  1 Z 2  X m ∂ ˆ P β m ∂ β ) = X k (log ˆ P k ) ( ˆ P β k Z log ˆ P k − ˆ P β k Z X m ˆ P β m Z log ˆ P m ) = X k ˆ P β k Z log ˆ P k log ˆ P k − X m ˆ P β m Z log ˆ P m ! = h ǫ 2 i − h ǫ i 2 , (39) where we use Equation (12). The Equation (3 7) is ther efore prov ed. 3.3.3 Pseudo Fisher in fo rm ation and energy fluctuation When β is a pa rameter of pr o bability ma ss function P , Fisher informa tion ˜ I ( β ) is defined in the usual wa y as ˜ I ( β ) := X k f k ( β )  ∂ ∂ β log f k ( β )  2 , (40 ) where f is the likeliho o d function. Here, the likeliho o d function is replaced with the probability function and w e called the replaced ˜ I ( β ) MFE-Fisher informa- tion I ( β ). When it is assumed tha t the proba bility functions can be expre s sed by the canonical distributions with parameter β , it is clear that I ( β ) = h ǫ 2 i − h ǫ i 2 as 15 follows. I ( β ) = X k ˆ P β k Z ∂ ∂ β log ˆ P β k Z ! 2 = X k ˆ P β k Z log ˆ P k − 1 Z X m ˆ P β m log ˆ P m ! 2 = X k ˆ P β k Z ( − log ˆ P k ) 2 − X k ˆ P β k Z ( − log ˆ P k ) ! 2 = h ǫ 2 i − h ǫ i 2 . (41) The MFE-Fisher informa tion is therefore identical to the ener gy fluctuation defined in Equation (37). 3.3.4 Other s imil ariti es with statis tical mecha nics It is noteworth y that MFEE has other similarities with ther mo dynamics and/or statistical mechanics. That is, the same relationships e x ist. W e list those below: • The follo wing r elation is ea sily derived from the definition o f par tition function Z : U = − ∂ ∂ β log Z. (42) • The following relation, known as the Gibbs-Helmholtz relation, is derived from E quations (28) and (42) as follows: U = ∂ ∂ β ( β F ) . (43) • The following relatio n is s imply obtained from Eq ua tions (22) and (43): H = β 2 ∂ F ∂ β . (44) • The energy v a riance is represented by the second-o rder differen tial o f the partition function for β as h ǫ 2 i − h ǫ i 2 = ∂ 2 ∂ β 2 log Z. ( 45) • Thermal capa city is r e pr esented by da ta tempera ture a nd ener gy fluctua - tion as C = β 2 ∂ 2 ∂ β 2 log Z = β 2 ( h ǫ 2 i − h ǫ i 2 ) . (46) 16 3.4 Incorp orating prior knowledge MFEE can incorp or ate Bay esian sub jective b eliefs. It can be said that Bayesian sub jective probabilities extend co unt s of even ts to the sum of thos e counts and prior imaginary coun ts of the event s. Our metho d can therefore easily include sub jectivity . T o a chiev e this extension, the ML estimators ar e r eplaced with Bay e sian pos terior p oint probabilities such as those e s timated by the ma x imizing a p osterior (MAP) metho d, whic h leads to the B ay es ian canonical distributions instead of the formula express e d b y Equation (24) as fo llowing: P ( x ) = exp( − β ( − log P B ay es ( x ))) P x ′ exp( − β ( − log P B ay es ( x ′ ))) , (47) where P B ay es ( x ) is the Bay esian p osterior point probabilit y , and da ta tem- per ature is calcula ted fro m Equation (14) wher e ML estimator is repla ced by P B ay es ( x ). In this cas e , the stronger the sub jectivity , the mor e the data tem- per ature, that is, the temp ering effect caused by β is w eaker. 4 Examples Sim ulations to demonstrate the robustness for small samples o f MFEE, in com- parison with ML, ME , a nd Bayesian–Dirichlet estimators with Jeffreys’ prior, are describ ed. X is assumed to hav e three int ernal states and four probability mass func- tions with a v ariety of en tropies denoted a s H ( X ) in na tur al loga rithms: 1. P ( x = 0) = 0 . 431 , P ( x = 1) = 0 . 337 , P ( x = 2) = 0 . 2 32 , H ( X ) = 1 . 07, 2. P ( x = 0) = 0 . 677 , P ( x = 1) = 0 . 206 , P ( x = 2) = 0 . 1 17 , H ( X ) = 0 . 841, 3. P ( x = 0) = 0 . 851 , P ( x = 1) = 0 . 117 , P ( x = 2) = 0 . 0 320 , H ( X ) = 0 . 498, 4. P ( x = 0 ) = 0 . 989 8 , P ( x = 1) = 0 . 0 0810 , P ( x = 2) = 0 . 0021 0 , H ( X ) = 0 . 0621 . Data fro m each function w as sa mpled, and probabilities w ere estimated from given data sets with v a rious data sizes. In the estimation, ML, ME, and Bay e sian–Dirichlet estimation with hyperpar ameter α = 1 / 2 derived fro m Jef- freys’ prior distribution on Diric hlet models (Rob ert , 2007, p. 130) w ere used. The ma ximizing a p osterior (MAP) w as us ed fo r Bayesian–Dirichlet estimation from the viewpo int of p oint estimations. W e set, as usual, a veraged outputs h X i as the co nstraint in the ME metho d as following: h X i := (1 / N ) P N d =1 X d , where X d denotes d -th sample’s output and N denotes a s ample size. After that, true and es timated probabilities w ere c o mpared b y using K ullba ck-Leibler (KL) divergence Kullback and Leibler (1 9 51) as a metric, whic h has the following form: D ( P ( X ) || P e ( X )) = X x P ( x ) log P ( x ) P e ( x ) , (48) 17 where P ( X ) is the true distr ibution, and P e ( X ) is the distribution es timated by ML, ME, MAP , or MFEE. F or av o iding zer o probabilities, proba bilities w ere smo othed by a dding 0 . 0001 to the co unt s. The K L div ergences are shown in Fig. 1, where they are a veraged v alues from 100 samples at e ach sample size from identical distributions. It can be seen that the ML estimators are infer ior to MFEE due to overfitting, ex cept for the distribution having very small entrop y . Even the degree o f sup erio r ity of the ML estimation in (d) is rela tively smaller than tha t of inferiority in o ther distributions. The ME estimator s s howed the o pp o site b ehaviors to the ML, and show ed some rela tively p o or re s ults in large-sa mple regions. ML metho ds tend to fit data and then ca n mor e acc ur ately estimate distributions with v ery low entropies than o thers in small sample c a ses. F o r example, if a true en tr o py equals zero, ML method can estimate the exact true distribution from only one sample. On the o ther hand, ME methods tend to increase entropies and then can with high entropies such as the unifor m distributions. Hence, the ML has tendency of overfitting and the ME has that of underfitting in the view of misestimation. Even so MFEE s how ed relative stability in b oth sample s izes and distributions. It indicates effectiv eness of MFEE metho d, as was defined so as to inc o rp ora te characteristics of b oth the ME and ML. Not all v alues could b e es timated by the MAP at each sa mple size for the following sa mple sizes ( N ): N ≤ 19 in (a), N ≤ 50 in (b), N ≤ 103 in (c), a nd N > 50 0 in (d), where the size decreased in resp ons e to the v alue s of en tropies for each distribution. This is beca use even the p o sterior pro bability distributions were improp er distributions, whic h ha ve b een p ointed out as a problem of Bay esia n statistics Kass and W asserman (1996). In these range s , the a veraged p oints ab out the MAP w ere not plo tted in Fig. 1 . Our metho d always provides the v alues for an y sa mple size a nd s how ed effectiveness for av o iding overfitting (at least to so me extent). 5 Discussion 5.1 Relation to classical and Jayn es’ app roac h The classical approach to pr obability estimation based on frequen tism is in- cluded in our metho d. This approach can b e consider ed as a metho d where β is assumed to be 1 or nearly 1 in o ur MFE -based metho d. That is, the appro ach can b e said to b e a zer o-temp er atur e o r a low-temp er atur e appr oximation of our metho d. MFEE suggests a new in terpretation of pro bability , in which sample size is not explicitly included, unlike b oth the classical a nd Ba yesian a pproaches. Al- though a n equiv a lent s a mple size can b e calculated from a proba bilit y estimated by MFEE, the calculated v alue is no more than the one as interpreted in the language of freq ue ntism. Jaynes’ ma x imu m entrop y (ME) methods are w ell known a s the leas t bi- ased inference metho ds. How ever, the constraints o n which ME methods are 18 based may not b e reliable for sma ll samples and then may b e bia sed. On the other hand, MFE E co rrects even s uch biases using temp erature. Moreov er , ME metho ds seemed to fa il in estimation from large size-samples in our sim ulations, which implies that ME methods do not fully take adv antage of informa tio n from av ailable da ta. 5.2 Relation to Ba y esian approac h Our approach is quite differen t from Ba yesian approaches when prior knowledge is not av a ilable or should b e excluded. MFEE assumes that a physics-like mecha- nism determines optimal estimation, while the Bay esian appro ach assumes non- informative pr ior distributions unrelated to the optimal estimatio n. In a dditio n, the former puts optima l entrop y at the center o f the metho d, which seems de- sirable b ecause s tatistical inference aims to get o ptimal useful informatio n fro m data. In hierarchical Bay esian mo dels, the hyper parameter s of prior distributions are often determined by ma ximizing marg inal distributions , which are called the empirical Bay esian methods . Our metho d is not suitable for these mod- els because hyper parameter s a r e not ones of no ninformative prio rs and can be int erpreted as additive p ar ameters that complemen t inco mpleteness of struc- tures of the mo dels . A s imilar situation o ccurs in Bayesian netw or k c la ssifiers (BNCs) F riedman et al. (1997), as men tioned in our previous work Isozaki et al. (2008, 2 009). BNCs are being developed in the machine lear ning domain for classification tasks and a re generaliz e d from well-known naive Bayes class ifiers. In the case of BNCs, it is known that their conditional probabilities pla y a part in complementing inaccurac ies of e s timated net work structur es Jing et al. (2005). 6 Summary Based neither o n freq ue ntism nor Bayesianism, a robust metho d of the prob- ability estimation-ba sed on bo th thermo dyna mics and informatio n theor y-for discrete-ra ndom-v ar ia ble systems was developed. The co re of the method is the int ent to obtain optimized en tropy explic itly , namely , obtaining optimized infor- mation from a v ailable limited data. The theory introduces tw o new quan tities: information ener gy and data temp er atur e . F r e e ener gy is defined by using these quantities. The minim um free energy principle for inference, which unifies the maximum likelihoo d and ma x imu m e ntropy principles with the a b ove quanti- ties, is adopted. The theory has adv antages over frequentism b ecause of it is more ro bust for s mall sample size and o ver Bayesianism b ecause it do es not use prior/p os terior distributions when no pr ior knowledge is av ailable, wher e prior biases a re rega rded as not completely exc luded. The effectiveness of the metho d in terms of robustness w as demons trated by sim ulation studies on po int estimation for s ing le v aria ble systems with v arious entropies. 19 Ac kn o wledgmen ts The author would like to thank Mario T okoro and Hiroaki Kitano of Sony Com- puter Science Labor a tories, Inc. for their supp ort. References Basu, A., Harris, I. R., Hj ort, N. L., and Jones, M. C. (1998). Robust and efficient estima tion by minimising a densit y p ower divergence. Biometrika , 85:549 –559 . Berger, A. L., Pietra, S. D., and Pietra, V. D. (1 9 96). A maximum entropy ap- proach to na tural langua ge pro cessing . Computational Linguistics , 22(1):39– 71. Callen, H. B. (1985 ). Thermo dynamic s and An Int r o duct ion to Thermostatistics . John Wiley & Sons, Hob oken, NJ , second edition. Caticha, A. a nd Preuss, R. (200 4). Maximum entropy and Bay es ian data anal- ysis: Entropic prior distr ibutio ns. Physic al Re view E , 70:0 4612 7. Cov er , T. M. and Thomas, J. A. (2006 ). Elements of Information The ory . John Wiley & So ns, Hob oken, NJ, seco nd e dition. Dose, V. (2003). Ba yesian inference in ph ysics: Ca se studies. R ep orts on Pr o gr ess in Physics , 66:14 21. F riedma n, N., Geig er, D., a nd Goldszmidt, M. (1997). Bayesian net w ork cla ssi- fiers. Machine L e arning , 29(2- 3):131– 163. H´ ajek, A. (1997). ”mises r edux” –redux: Fifteen arg uments against finite fre- quentism. Erkenntnis , 45:20 9–22 7. Hofmann, T. (1 999). Pro ba bilistic laten t semantic analysis. In Pr o c. of Confer- enc e on Unc ertainty in A rtificial Intel ligenc e (UAI-99) , page s 28 9–29 6. Huscroft, C., Gass, R., and Jar r ell, M. (2000 ). Maximum entrop y metho d of o b- taining thermo dy namical pr op erties from quantum Monte Carlo simulations. Physic al R eview B , 61:9300 . Irony , T. Z. and Singpurwalla, N. D. (1997 ). Non-infor mative priors do not exist: A dialogue with Jos´ e M. Bernardo . J ournal of Statistic al Planning and Infer enc e , 65:159– 189. Isozaki, T., Kato, N., a nd Ueno, M. (2008). Minimum free e ne r gies with “data temper ature” for par ameter lear ning of Bay es ia n netw or ks. In Pr o c. of IEEE International Confer enc e on T o ols with Artificia l Intel ligenc e (ICT AI-08) , pages 371– 378. 20 Isozaki, T., Kato, N., and Ueno, M. (2009 ). “Data temper ature” in minimum free energies for parameter learning o f Bay esian net works. International Jour- nal on A rtificial Intel ligenc e T o ols , 18(5):653–67 1. Jaynes, E. T. (1957). Infor ma tion theory and statistical mec hanics. Physic al R eview , 1 06(4):620 –630 . Jeffreys, H. (1961 ). The ory of Pr ob ability . Oxfor d University Press , NY, third edition. Jing, Y., Pa vlovi ´ c, V., and Rehg, J. M. (2005 ). Efficient discriminative le a rning Bay e sian net work clas sifier via b o osted augmented naive Bayes. In Pr o c. of International Confer enc e on Machine L e arning (ICML-05) , pages 369–37 6. Jones, M. C., Hjort, N. L., Har ris, I. R., and Basu, A. (200 1). A co mparison of related density-based minim um divergence estimators . Biometrika , 8 8(3):865– 873. Kass, R. E. and W asserman, L. (1996). The selection of prior distributions b y formal rules. Journ al of the Americ an Statistic al Asso ciation , 91(435):13 43– 1370. Kittel, C. and Kro emer, H. (1980 ). Thermal Physics . W. H. F r eeman, San F ra ncisco, CA. Kullback, S. and Leibler, R. A. (1 951). On information and sufficiency . Annals of Mathe matic al Statistics , 22 (1):79–8 6. LeCun, Y. and Huang, F. J. (20 05). Loss functions for discr iminative training of energy-ba sed mo dels. In Pr o c. of International Workshop on Artificia l Intel ligenc e and Statistics (AIST A TS-05) , page s 206–213 . Pereira, F., Tishb y , N., a nd Lee, L. (1993). Distr ibutional c lustering o f En- glish words. In Pr o c. of Annual Me eting on Asso ciation for Computational Linguistics (AC L-93) , pages 183–190 . Rob ert, C. P . (20 07). The Bayesian Choic e: F r om De cision-The or etic F oun- dations t o Computational Implementation . Springe r-V erlag, New Y ork , NY, second edition. Shannon, C. E. (1948). A mathematica l theory of communication. Bel l Systems T e chnic al Journal , 27 :379–4 23,62 3–656. Ueda, N. a nd Nak ano, R. (1995). Deter ministic annea ling v ar iant of the EM algorithm. In Pr o c. of A dvanc es in Neur al Information Pr o c essing Systems 7 (NIPS 7) , pages 5 4 5–55 2. W atana b e , K., Shiga, M., and W atanab e, S. (200 9). Upper b ound for v ar iational free energy of B ayesian netw orks. Machine L e arning , 75(2):19 9–21 5. Windham, M. P . (199 5). Robustifying model fitting. Journal of the R oyal Statistic al So ciety B , 57 (3):599– 609. 21 0.0 1.0 2.0 3.0 4.0 5.0 0 10 20 30 40 50 60 KL divergence Sample size ML MFEE Bayes ME 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0 10 20 30 40 50 60 KL divergence Sample size ML MFEE Bayes ME (a): H = 1 . 07 (b): H = 0 . 841 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0 10 20 30 40 50 60 KL divergence Sample size ML MFEE ME 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 10 20 30 40 50 60 KL divergence Sample size ML MFEE ME (c): H = 0 . 498 (d): H = 0 . 0621 Figure 1: KL divergences b etw een true probability mass functions and prob- ability mass functions e s timated by using max imu m likeliho o d (ML), MFEE, maximum a p os terior with Jeffrey s’ prior (Bay es) and maxim um entrop y (ME). The hor izontal a xes deno te sa mple sizes. H deno tes Shannon ent ropy in na tural logarithms. 22

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment