On Tsallis Entropy Bias and Generalized Maximum Entropy Models

In density estimation task, maximum entropy model (Maxent) can effectively use reliable prior information via certain constraints, i.e., linear constraints without empirical parameters. However, reliable prior information is often insufficient, and t…

Authors: Yuexian Hou, Tingxu Yan, Peng Zhang

Noname manuscr ipt No. (will b e inserted b y the editor) On Tsallis En trop y Bias a nd Generalized Maxim um En trop y Mo dels Y uexian Hou · Ti ngxu Y an · Peng Zhang · Daw ei Song · W enjie Li Receiv ed: dat e / Accepted: date Abstract In densit y estimation task, maxim um entropy mo del (Maxent) can effec- tively use reliable prior information v ia certain constraints, i.e., linear constraints with- out empirical p arameters. Ho w ever, reliable p rior information is often insufficient, and the selection of uncertain constraints b ecomes necessary but p oses considerable imple- mentatio n complexit y . Improp er setting of uncertain constraints can result in o verfit- ting or underfitting. T o solv e this problem, a generalization of Maxen t, under Tsallis entrop y framew ork, is proposed. The proposed metho d introduces a conv ex quadratic constrain t for the correction of (exp ected ) Tsallis entropy bias (TEB). Sp ecifically , we demonstrate that t he exp ected Tsallis entrop y of sampling d istributions is smaller than the Tsallis entropy of the un derlying real distribution. This exp ected entrop y reduction is exactly the ( ex p ected) TEB, whic h can be expressed by a close d-form form ula and act as a consistent and un biased correction. TEB indicates t hat the entropy of a sp e- cific sampling d istribution should b e increased accordingly . This entail s a quantitative re-interpretation of the Maxent principle. By comp ensating TEB and meanwhile forc- ing t he resulting distribution to b e close to th e sampling distribution, our generalized TEBC Maxen t can be exp ected to al leviate the o verfitting and und erfi tting. W e also present a connection b etw een TEB and Lidstone estimator. As a result, TEB-Lidstone estimator is d eveloped b y analytically identifying t he rate of probabilit y correction in Lidstone. Ex tensive empirical ev aluatio n shows promising p erformance of b oth TEBC Maxent and TEB-Lidstone in comparison with va rious state-of-the- art density estima- tion metho ds. Keywords Densit y estimation · Maximum entrop y · Tsallis entrop y · Tsallis entropy bias · Lidstone estimator Y uexian Hou and Tingxu Y an Sc hool of Computer Science and T ec h nology , Tianjin Universit y , Tianjin, 300072, China E-mail: yxhou@tju.edu.cn; Sunriser2008@gmail.com Pe ng Zhang and Da we i Song Sc hool of Computing, The Rob ert Gordon Universit y , Aberdeen, AB25 1HG, United Kingdom E-mail: { p.zhang1; d.song } @rgu.ac.uk W enjie Li Departmen t of Computing, The Hong Kong Po lytec hnic Universit y , H ong Kong, 999077 E-mail: cswjli@comp.p olyu.edu.hk 2 1 Introduction The ma ximum entrop y (Maxent) approac h to density estimatio n w as originally pro- p osed by E. T. Jaynes [Jaynes, 1957], and since then has b een widely u sed in many areas of computer science and statistical learning, esp ecially natural language pro cess- ing [Berger et al. , 1996, Pi etra et al., 1997]. The Maxen t principle can b e traced back to Ja ynes’ classical d escription [Ja ynes, 1957]: “ . . . the fact th at a probability distribution maximizes entrop y sub ject to certain constrain ts representing o ur incomplete information, is the fundamental prop erty which justifies u se of that distribution for inference; it agrees with everything that is known, but carefully av oids assuming an ything that is not known . . . ” In implementing this principle, given a sampling distribution drawn from the un- derlying real distribution, Maxent computes a resulting distribution whose entrop y is maximized, sub ject to a set of selected constraints. The standard Maxent can b e form ulated in F orm ula 1: max ¯ P ( m ) S [ ¯ P ( m ) ] st. | X j ∈ S u ¯ p j − a u | ≤ δ u ∀ u ∈ U                  X i ∈ S c ¯ p i = a ∗ c or, X i ∈ S c ¯ p i ≥ a ∗ c or, ∀ c ∈ C X i ∈ S c ¯ p i ≤ a ∗ c (1) where ¯ P ( m ) ≡ h ¯ p 1 , . . . , ¯ p m i is the res ulting m -nomial probability d istribu tion, S [ · ] denotes the Shannon entropy of some probability distribution, C and U are tw o index sets, a ∗ c , c ∈ C is the constan t determined by rel iable information, a u and δ u , u ∈ U are th e p arameters that n eed to b e empirically adjusted, and S c , c ∈ C and S u , u ∈ U are sub sets of { 1 , 2 , . . . , m } . St andard Maxent has tw o sets of constrain ts. The fi rst set (indexed by C ) includ es all certain constraints, which are derived from reliable prior information and do not invol ve empirical parameters. F or ex ample, tw o of the most common certain constrain ts are P i ¯ p i = 1 and ¯ p i ≥ 0. The second set (ind exed by U ) includ es all uncertain constraints, which are from less reliable knowledge or sample informa tion, and hence n ecessarily in volv e empirica l parameters (e.g., a u and δ u ) to gain a satisfying p erformance. Note that th ere could b e oth er sp ecific forms of constraints not listed in F ormula 1 , e.g., th e common form of real-v alued feature functions. H o weve r, these constraint forms are essen t ially equiva lent to and can b e categorized into certain or uncertain constrain ts. Moreov er, all certain and uncertain constrain ts considered in this pap er are linear. 1.1 The Problem Although the essential idea of Maxent is concise and elegant, the implementati on of Maxent p oses considerable practical complexity . S p ecifically , in a t y pical density esti- 3 mation task, reliable prior information is often insufficient. In this case, if Maxent only inv olves certain constraints derived from reli able prior informatio n, th e resulting d is- tribution will b e aw ay from the sampling distribution. Consequently , underfitt in g will result. Hence, Maxent usually in volves a set of uncertain constraints, which force the resulting distribution to b e close to the sampling distribution. The t olerable violation- leve l of th e resulting distribution against the sampling distribution is con trolled by a set of threshold parameters. These constraints and parameters are essentiall y empirical and ad-ho c. This is a dilemma: O n one h an d , if a large num b er of uncertain constraints and a set of tight threshold parameters are inv olved, the solution of Maxent will b e close to the sampling distribution and might severely overfit th e sample [Dud ik et al., 2007]; On t he other hand, if a small number of u n certain constraints or a set of loose threshold parameters are used, Ma xent might un derfit the sample and miss out some useful sample informa tion. 1.2 Existing W ork In the framew ork of Maxent, main approac hes to tackling o verfitting or underfi t ting are parameter regularization and constrain t relaxation. The former in trod uces some spe- cific stati stics (e.g., l 1 , l 2 2 , l 1 + l 2 2 etc.) as the regularized terms of the ob jectiv e func- tion and remov es explicit constraints [Dud ik et al., 2007 , Chen and R osenfeld, 2000, Lebanon and Lafferty, 2001, Lau, 1994 ]. The latter aims to relax the constrain ts accord- ing to some t heoretical considerations [Khudanpur, 1995, Kazama and Tsujii, 2003, Jedynak and K hudanpu r, 2005], e.g., Maxim u m Likelihoo d set in [Jedynak and Khudanpur, 2005]. The p erformance guarantee of some Maxent v ariants is rigorously established with resp ect to (w.r.t.) finite sample criteria, e.g., Probably Approximately Correct (P AC). H o w ever, according to our b est knowl edge, most of guaran tees are, to some ex- tent, self-refere ncing. F or example, using log loss as the criterion, theoretical relations b etw een the solution of Generalized Maxent (GME) and the b est Gibbs d istribution are given [Dudik et al. , 2007]. How ever, the definition of the b est Gibbs distribution intrinsi cally dep ends on the selection of feature funct ions. It turns out that, if the selec- tion of feature functions is improp er, the solution of GME mig ht not b e able to av oid o verfitting or underfitting substantia lly even if it is close to th e b est Gibbs d istribu t ion. 1.3 Our Approach In this pap er, we propose a nov el generalization of Maxent, under the framework of Tsallis entrop y 1 [Tsalli s , 1988, A b e, 2000]. An imp ortan t motiv ating observ ation is that, th e exp ected Tsallis entropy of sam- pling distributions is alwa ys smaller than the Tsallis entrop y of th e und erlying real distribution. T o demonstrate th is formally , we present a theoretical analysis on the ex- p ected Tsallis entropy bias (TEB) 2 b etw een sampling distributions and th e underlying real distribution. The TEB is indep end ent of the selection 3 of constraints and can be 1 Please refer to Section 3.1 for more details ab out Tsalli s entrop y . F or the sake of analytical simplicity , we only consider the Ts allis entrop y with q = 2 [ Tsallis, 1988] in this pap er. 2 In this paper, the notation of “Tsallis entrop y bias” has the same meaning as the “exp ected Tsallis en tropy bias”. Accordingly , TEB has the “expected” sense in itself. 3 Actually , the TEB only depend on i. i .d. sampli ng presumption. 4 expressed by a simple closed-form form u la of th e sample size n and the Tsallis entrop y of the und erlying real d istribu tion. This observ ation n aturally entai ls a q uantitativ e re-interpretation and a theoretical guarantee of the Maxent p rinciple: Sin ce the en- tropy of sampling distributions is smaller, in the expected sense, than th e entropy of the underlying real distribu tion, Maxen t should increase the entropy of the sampling distribution to comp ensate the TEB and hence approximate the underlying real dis- tribution. The TEB is first d evel op ed in th e frequentist framewo rk and we n otate it as F requentist-TEB. I n addition, by assuming a uniform Bay esian p rior o ver all p ossible m -nomial distributions, a Ba yesian-TEB is developed. W e argue that, in consi stency with t he basic p rinciple of Maxent, a rigorously established comp ensation of F requentist-TEB or Ba yesian-TEB can help alleviate th e o verfitting problem. O n the oth er hand, it is natural to ov ercome un d erfitting through simply forcing the resulting distribution to b e close to the sampling distribution. By integ rating these tw o strategies into our generalized Maxent, called Tsallis entrop y bias comp ensation (TEBC) Maxent, it is exp ected th at TEBC Maxent can alleviate o verfitting and u nderfitting. N ote that it is somewhat problematic to develop the similar metho d in th e framework of S h annon entropy since a consisten t and unbiased correction of Sh annon entrop y has not b een exactly found yet, in general (see Section 2 for more detials of th e estimate of S hannon entrop y) In implementatio n, the TEBC Maxen t is conv ex and hence can b e efficiently solv ed . More imp ortantly , TEBC Maxent can bypass the selection of uncertain constraints as wel l as p arameter identification by introducing a p arameter-free TEB constraint, aiming at quantitativ e entrop y comp ensation. In addition t o the ab ov e Maxent framew ork, the generalit y of our theoretical results can b e demonstrated by a practical connection b etw een TEB and another widely u sed estimator, n amely t h e Lidstone estimator. W e will sh o w t hat b oth F requ entist-TEB and Ba yesian-TEB can offer guidance to id entify the adapt ive rate of p robabilit y correction, whic h need s t o b e emp irically set in Lidstone. Accordingly , the so called “F-Lidstone” and “B-Lidstone” estima tors are derived resp ectively . Extensive ex p erimen tal results on a num b er of synthesized an d real-wo rld datasets demonstrate a promising p erformance of TEBC Maxent, F- L id stone and B-Lidstone, in comparison with va rious state-of-th e-art density estimation metho ds. 2 Re lated W ork The concept of maximum en t rop y has b een existing in the Machine Learning litera- ture for a long time and has resulted in va rious app roac h es. Its constrained form has b een widely applied in many con text s [Berger et al ., 1996, Kazama and Tsujii, 2003 , Jedynak and K hudanpu r, 2005]. Recently , there hav e b een many studies of Maxent with l 1 -style regularization [Khudanpur, 1995, Kazama and Tsujii, 2003, Willia ms, Ng, 2004, Go o dman , 2004, K rishnapuram et al., 2 005], l 2 2 -style regularization [Lau , 1994, Chen and Rosenfeld, 2000 , Lebanon and Lafferty, 2001, Zhang, 2005] as w ell as some other types o f regularization such as l 1 + l 2 2 -style [Kazama and Tsujii, 20 03], l 2 -style regularization [Newman, 19 77] and a smo othed vers ion of l 1 -style regular- ization [Dekel et al., 2003]. Altun and Smola [2006] derive duality and p erformance guaran tees for settings in which the entrop y is replaced by an arbitrary Bregman or Csiszar d ive rgence. A th oroughly theoretic analysis of regularized Maxent can b e found in [Dudik et al., 2007 ]. 5 As another direction to density estimation, there are many smoothing metho ds that hav e been proposed in v arious contexts, e.g., information retriev al tasks [Zhai and Lafferty, 2001], sp eech recognition [Chen and Go od man, 1998 ] and cryp tology [Goo d , 1953]. A typical family of general-pu rp ose smoothing metho ds is Go od - T uring estimator. A ll of them u se th e follo wing equation to calculate the resulting frequencies of even ts: F X = ( N X + 1) T · E ( N X + 1) E ( N X ) where X is an even t, N X is the num b er of times the even t X has been seen, within the sample of size T , and E ( n ) is an estimate of how many d ifferent even ts t hat happ ened exactly for n times. Different va rian ts of Go od - T uring estimator, e.g., the simplest Goo d-T u ring estimator, Simple Goo d-T u ring estimator [Gale and S ampson, 1995] and diminishing-attenuation estimator [Orli tsky et al., 2003], are based on d ifferent calcu- lations of th e E ( · ). Another widely used smo othing metho d is Lidstone estimator [Chen and Goo dman , 1998]. Typical v ariants of Lidstone estimator include Expected Lik elihoo d Estimator, Laplace estimator and Add- t iny estimator. F rom a theoretical p oint of v iew, by defi ning the attenuation of a probability estimator as the largest p ossible ratio b etw een the p er-symbol probability assigned to an arbitrary sized sequence by any distribu t ion and the correspond ing probability assigned by th e estimator, it can b e show n that th e attenuation of d iminishing-attenuatio n estimators is un it y [Orlitsky et al ., 2003]. Note that the attenuation analysis is an asymptotic analysis w.r.t. large sample p erformance. F or the entrop y correction, there are some metho ds to estimate the Shannon en- tropy bias. How ever, to th e b est of our knowledge, no consistent and unbiased correc- tion of Shannon entrop y has b een developed . In principle, the ”inconsistency” th eorem leads to several approximatio ns of th e Shannon entropy bias [Miller, 1955, Carlton , 1969, Panzeri and T reves, 1996, Victor, 2000]. The analytical app roximation of bias giv en by [Paninski, 2003] can be considered more rigorous than pred ecessors. H o w ever, this bias do es not hav e a closed form and dep ends on sp ecific p rior distribution and the c [Paninski, 2003], and hence it is hard to b e computed in general. Regarding this, P aninski prop osed an estimator, which is consistent even when t h e c is b ounded (pro- vided that b oth m and n are su fficien tly large) [P an in ski, 2004 ]. H o w ever, a general and ex act closed-form formula for Sh annon entropy bias is still an op en problem. The rest of this pap er is organized as follo ws: Section 3 give s a theoretical analysis on tw o crucial observa tions; S ection 4 discusses the estimate of TEBs (F requentist-TEB and Bay esian-TEB), introdu ces TEBC Maxent and reveals the connection b etw een TEBs and Lidstone estimator; Section 5 gives tw o mo del-ev aluating criteria; Exp er- iments on synthesized and real-w orld datasets are constructed in Section 6 and the exp erimental results are rep orted and discussed. Finall y , conclusions and future work are p resen ted in S ection 7. 3 T sallis En tropy Bias In this section, we present tw o th eoretical observ ations, which motiv ate and underpin the p rop osed TEB, TEBC Maxent and TEB-Lidstone estimator. 6 3.1 Notations and Definitions W e u se th e follo wing notations throughout the rest of the p ap er: P ( m ) ≡ h p 1 , . . . , p m i : The underlying real m -nomial ( m ≥ 2) probability d istribu- tion, where P m i =1 p i = 1; b P ( m ) n ≡ h b p 1 , ..., b p m i : The sampling distribution of sample size n w.r.t. P ( m ) , where P m i =1 b p i = 1; P ( m ) : The set of all p ossible m -nomial ( m ≥ 2) probability d istributions. T q h P ( m ) i ≡ k 1 − P m i =1 p q i q − 1 : The Tsallis en tropy of P ( m ) w.r.t. the index q , where k is th e Boltzmann constan t; W e hav e lim q → 1 T q h P ( m ) i = k · H h P ( m ) i , where H h P ( m ) i is th e S hannon information entrop y of P ( m ) . The Tsallis entrop y is the simplest entrop y form that exten ds the Shannon en- tropy while maintai ning the basic properties but allo wing, if q 6 = 1, nonextensiv- it y [San tos and Math , 1997 , Abe, 2000]. In this pap er, for the sake of analytical con- venience, w e alwa ys assume q = 2 and neglect the subscript q . In addition, without loss of generalit y , w e omit the Boltzmann constan t. It th en turns out th at T h P ( m ) i = 1 − P i p 2 i . 3.2 Ma in Results Proposition 1 Gi ven an arbitr ary m -nomial pr ob abil ity distribution P ( m ) ∈ P ( m ) , let b P ( m ) n b e the sampli ng distribution of sample si ze n with r esp e ct to P ( m ) , E ( m ) P,n ( T ) b e the exp e cte d Tsal lis entr opy of b P ( m ) n , and T h P ( m ) i b e the Tsal li s entr opy of P ( m ) , then E ( m ) P,n ( T ) = n − 1 n T h P ( m ) i (2) Pr o of L et P ( m ) = D p 1 , . . . , p m − 1 , 1 − P m − 1 i =1 p i E b e an m -nomial pr ob ability di stribu- tion. Denote the set of al l p ossible sampling distributions of sample size n as S ( m ) n ≡ ( b P ( m ) n ≡ * x 1 n , . . . , x m − 1 n , n − P m − 1 i =1 x i n +      x 1 , . . . , x m − 1 ∈ N , x 1 + · · · + x m − 1 ≤ n ) wher e x i is the c ount of the i th nomial. G iven P ( m ) , the o c curr enc e pr ob ability of a sampling distribution b P ( m ) n ≡ D x 1 n , . . . , x m − 1 n , n − P m − 1 i =1 x i n E ∈ S ( m ) n is given by the fol lowing e quation: Pr h b P ( m ) n    P ( m ) i = n ! ( n − P m − 1 i =1 x i )! Q m − 1 i =1 x i !  1 − X m − 1 i =1 p i  n − P m − 1 i =1 x i Y m − 1 i =1 p x i i (3) Note that we assume 0 0 = 1 in F ormula 3. Henc e, we have E ( m ) P,n ( T ) = X n x 1 =0 X n − x 1 x 2 =0 · · · X n − P m − 2 i =1 x i x m − 1 =0 T h b P ( m ) n i · Pr h b P ( m ) n    P ( m ) i 7 By the definition of Tsal lis entr opy ( q = 2 ), we have T h b P ( m ) n i ≡ 1 − X m i =1  x i n  2 wher e we denote n − P m − 1 i =1 x i as x m . It is c onvenient to expr ess E ( m ) P,n ( T ) as the fol lowing: E ( m ) P,n ( T ) = 1 − X m i =1 E ( m ) P,n  x i n  2 wher e E ( m ) P,n  x i n  2 ≡ X n x 1 =0 X n − x 1 x 2 =0 · · · X n − P m − 2 i =1 x i x m − 1 =0  x i n  2 · Pr  b P ( m ) n    P ( m )  , i = 1 . . . m Note that E ( m ) P,n  x 2 i  is j ust the moments ab out the origin of the m ul tinomial distribution and given by E ( m ) P,n  x 2 i  = n ( n − 1) p 2 i + np i . Henc e, we have E ( m ) P,n  x i n  2 = ( n − 1) p 2 i + p i n . It turns out that E ( m ) P,n ( T ) = 1 − X m i =1 ( n − 1) p 2 i + p i n = n − 1 n  1 − X m i =1 p 2 i  The r. h. s of the last e quation is just T h P ( m ) i . Corollary 1 lim n →∞ E ( m ) P,n ( T ) = T h P ( m ) i (4) Pr o of The c or ol lary f ol l ows imme diately fr om F ormula 2. The ab ov e result is developed in the frequentist framew ork and hence correspond s to a F requentist-TEB. In the follo wing, a uniform Bay esian TEB (Ba yesia n-TEB for short) is dev elop ed by assuming a uniform Bay esian prior o ver all p ossible m -nomial distributions 4 . Proposition 2 Gi ven the uniform pr ob abili ty met ric over P ( m ) , t he ex p e ctation of E ( m ) P,n , i .e. E ( m ) n , i s given by E ( m ) n ( T ) = ( n − 1) · ( m − 1) n · ( m + 1) (5) Pr o of By the definition of mathemat ic al exp e ctation, we have E ( m ) n ( T ) = 1 Z ( m − 1) Z 1 0 Z 1 − p 1 0 · · · Z 1 − P m − 2 i =1 p i 0 E ( m ) P,n ( T ) dp m − 1 dp m − 2 · · · d p 1 wher e Z ( m − 1) is the normalization factor determine d by the ( m − 1) -or der inte gr al op er ator, Z ( m − 1) = Z 1 0 Z 1 − p 1 0 · · · Z 1 − P m − 2 i =1 p i 0 dp m − 1 dp m − 2 · · · dp 1 = 1 ( m − 1)! 4 The code pack age f or the numeric ev al uation of Proposi tion 1 and 2 i s pr o vided at http://w ww.comp.r gu.ac.uk/staff/pz/TEBC/Proposition_Validation_Codes.zip 8 Exp ands and r ewrites E ( m ) P,n ( T ) as the f ol l owi ng: E ( m ) P,n ( T ) = 2 ( n − 1) n  X m − 1 i =1 p i − X m − 1 i =1 p 2 i − X m − 1 i =1 X m − 1 j = i +1 p i · p j  (6) Henc e, we have E ( m ) n ( T ) = 2 ( n − 1) Z ( m − 1) · n · Z 1 0 Z 1 − p 1 0 · · · Z 1 − P m − 2 i =1 p i 0 ( X m − 1 i =1 p i − X m − 1 i =1 p 2 i − X m − 1 i =1 X m − 1 j = i +1 p i · p j ) dp m − 1 dp m − 2 · · · dp 1 T o sim plify the notations, we define the ( m − 1) -or der inte gr al op er ator: L ( m − 1) ≡ Z 1 0 Z 1 − p 1 0 · · · Z 1 − P m − 2 i =1 p i 0 dp m − 1 dp m − 2 · · · dp 1 (7) We denote L ( m − 1) ( f ( p 1 , . . . , p m − 1 )) as L ( m − 1) ( f ( p 1 , . . . , p m − 1 )) ≡ Z 1 0 Z 1 − p 1 0 · · · Z 1 − P m − 2 i =1 p i 0 f ( p 1 , . . . , p m − 1 ) dp m − 1 dp m − 2 · · · d p 1 Then the pr o of of F ormula 5 i s r e duc e d to solve the close d-form of L m − 1 ( X m − 1 i =1 p i − X m − 1 i =1 p 2 i − X m − 1 i =1 X m − 1 j = i +1 p i · p j ) Due to the symmetry of inte gr al domain, L ( m − 1) has the fol lowing pr op erties: (a) L ( m − 1) ( p i ) = L ( m − 1) ( p j ) , 1 ≤ i, j ≤ m − 1 (b) L ( m − 1) ( p 2 i ) = L ( m − 1) ( p 2 j ) , 1 ≤ i, j ≤ m − 1 (c) L ( m − 1) ( p i · p j ) = L ( m − 1) ( p k · p l ) , 1 ≤ i, j ≤ m − 1 , i 6 = j, k 6 = l Ther efor e, if the gener al term formulae of L ( m − 1) ( p i ) , L ( m − 1)  p 2 i  and L ( m − 1)  p i · p j  , i 6 = j , and their term numb ers ar e available, the gener al term f ormula of E ( m ) n ( T ) c ould b e obtaine d dir e ctly. The fol lowing gener al term f ormulae c an b e verifie d: (a’) L ( m − 1) ( p i ) = 1 m ! (b’) L ( m − 1)  p 2 i  = 2 ( m +1)! (c’) L ( m − 1)  p i · p j  = 1 ( m +1)! then E ( m ) n ( T ) = 2 ( n − 1) Z ( m − 1) · n · L ( m − 1) ( X m − 1 i =1 p i − X m − 1 i =1 p 2 i − X m − 1 i =1 X m − 1 j = i +1 p i · p j ) = 2 ( n − 1) Z ( m − 1) · n  m − 1 m ! − 2 ( m − 1) ( m + 1)! − ( m − 1) · ( m − 2) 2 ( m + 1)!  R e c al l that Z ( m − 1) = 1 ( m − 1)! , and after some sim plific ation steps, F ormula 5 is ob- taine d f r om the ab ove e quation, which c ompletes th e pr o of of Pr op osition 2. 9 Corollary 2 L et E ( m ) ( T ) = 1 Z ( m − 1) R 1 0 R 1 − p 1 0 · · · R 1 − P m − 2 i =1 p i 0 T h P ( m ) i dp m − 1 dp m − 2 · · · dp 1 , wher e Z ( m − 1) is the normalization factor 1 ( m − 1)! . Then we have lim n →∞ E ( m ) n ( T ) = E ( m ) ( T ) (8) Pr o of The c or ol lary fol lows dir e ctly fr om the f act that E ( m ) ( T ) c an b e given by the r.h.s of F ormula 6 , ex c ept for a multiplic ative f actor n − 1 n . 4 D ensity Estimate based on Ts allis En tropy Bias Based on the ab ov e th eoretic results, we first discuss the issue on the estimate of Tsallis entrop y b ias, and then prop ose d ensit y estimate metho ds in Maxent framew ork and Lidstone framew ork, respectively . 4.1 On Estimation of Tsallis Entrop y Bias T o apply the result of Prop osition 1, the frequ enti st Tsallis entrop y bias (F requ entist- TEB) should b e effectiv ely estima ted so th at the Tsa llis en t rop y of the sampling dis- tribution b P ( m ) n can b e comp ensated accordingly . According t o F orm ula 2, the ex p ected Tsallis entropy of b P ( m ) n , i.e., E ( m ) P,n ( T ), is ( n − 1)/ n of the Tsallis entrop y of the under- lying real distribution P ( m ) , i.e., T h P ( m ) i . W e denote the estimation of E ( m ) P,n ( T ) as b E ( m ) P,n ( T ), then n n − 1 b E ( m ) P,n ( T ) can b e considered as an estimation of T h P ( m ) i . Hence, the estimation of frequen tist Tsallis entrop y bias, i.e. F requ entist-TEB, is given by ∆T = n n − 1 b E ( m ) P,n ( T ) − T h b P ( m ) n i (9) The simplest (and unbiased) estimation of b E ( m ) P,n ( T ) is giv en b y T h b P ( m ) n i , and hence the corresp onding estima ted TEB is given by ∆T = 1 n − 1 T h b P ( m ) n i (10) Remark 1 T h b P ( m ) n i is an unbiase d estimate of E ( m ) P,n ( T ) . Ther efor e, F ormula 10 gives an unbiase d c orr e ction of T h P ( m ) i . I n addition, the c onsistency of this c orr e ction fol- lows fr om the law of lar ge numb ers (if m is finite) or c entr al l imit the or em for multi- nomial sums [Morris, 1975 ] (if m is infinite). Surprisingly , experimental results (detailed in Section 6) sho w the TEBC Maxen t and TEB-Lidstone estimator based on this naive estimator can outp erform all com- parative density estimation mo dels in mos t cases. In many cases, using statistical re-sampling tec hniques, e.g., Bo otstrap [W asserman, 2006], w e can achiev e more accurate estimations of T h P ( m ) i , and hence obtain b etter estimations of F requentist-TEB. In the follow ing, we give an estimation pro cedure of T h P ( m ) i , which is opt imal in the sense of th e least squared error. 10 Let us rewrite F ormula 2 so th at E ( m ) P,n ( T ) is expressed by a funct ion of n E ( m ) P,n ( T ) = K · n − 1 n where K is a constant slope and determined by T h P ( m ) i . W e can estimate E ( m ) P,i ( T ) , 1 ≤ i ≤ n by re- sampling techniques and obtain b E ( m ) P,i ( T ) , 1 ≤ i ≤ n . Note t hat the re- sampling is meaningful only if i < n . The remaining task is to solve an u nconstrained quadratic program so th at the squ ared error X n i =1  b E ( m ) P,i ( T ) − i − 1 i K  2 (11) is minimized. This minimum corresp onds to the zero p oin t of the fi rst deriv ative in t h e cost fun ct ion 0 = ∂ ∂ K X n i =1  b E ( m ) P,i ( T ) − i − 1 i K  2 By expanding the ab o ve equ ation, it tu rns out th at the estimated slop e is given by b K = P n i =1 ( i − 1) /i · b E ( m ) P,i ( T ) P n i =1 ( i − 1) 2 /i 2 where b K is the estimation of T h P ( m ) i . Note that, in practice, it is often sufficien t to only in volv e an approp riate sub set of b E ( m ) P,i ( T ) , 1 ≤ i ≤ n , in the cost function (11 ) to construct the estimate of T h P ( m ) i . How ever, even with the re-sampling technique, in the case that t h e sampling is seriously in ad eq uate, it s eems still difficult to estimate E ( m ) P,n ( T ) a ccurately , w hich migh t in tu rn result in inaccurate F requenti st-TEB. In this case, some Bay esian prior o ver the space of all p ossible real distributions might b e more useful. The results of Proposition 2 and Coro llary 2 guide the construction of uniform Bay esian-TEB, i.e., ∆T = m − 1 n · ( m + 1) (12) by assuming the un iform probabilit y metric ov er all p ossible real distribu tions. Note that un iform Ba yesian-TEB is d irectly obtained b y computing the difference betw een the expected Tsallis entrop y of all possible real distributions and the exp ectation of E ( m ) P,n ( T ), i.e. E ( m ) n ( T ), w.r.t. uniform prior. Hence, it a voids th e estimation of E ( m ) P,n ( T ). Remark 2 Cor ol lary 2 shows that E ( m ) n ( T ) i s the asymptotic al ly unbiase d estimate of E ( m ) ( T ) and c an b e unbiase d if the Bayesian-TEB m − 1 n ( m +1) is c omp ensate d. I n addition, this c orr e cte d estimator is also c onsistent with E ( m ) ( T ) . In summary , Remark1 and Remark2 show that the estimates of t h e Tsallis entrop y bias, w.r.t b oth F requentist and Bay esian framewo rks, can b e considered sound in th e sense of unbiasedness and consistency . Note that the Sh annon entropy estimate lac ks these guarantee s. 11 4.2 TE BC Maxents In t his sub section, we prop ose three Tsallis entrop y bias compensation (TEBC) Max- ents to comput e resulting distribu tions. These d istributions are most similar to the sampling distribution w. r.t. different similarity criteria, sub ject to th e constraint that the estimated F requentis t/Ba yesian-TEB, denoted a s ∆T , is forcibly co mp ensated. This strategy can h elp to alleviate the o verfitting and und erfitting problems. Give n any m -n omial sampling d istribution b P ( m ) n ≡ h b p 1 , . . . , b p m i of sample size n and the estimated ∆T , t he TEBC Maxents can b e constru cted to compute the resulting distribution ¯ P ( m ) ≡ h ¯ p 1 , . . . , ¯ p m i w.r.t. the criterion of l 2 2 norm, Jensen- Shannon (JS) divergence (see F ormula 15 for details) or Maximum Likel ihoo d, respectively: Mode l 1: l 2 2 Tsalli s Entrop y Bias Comp ens ation ( l 2 2 -TEBC) min ¯ P ( m ) X m i =1 ( ¯ p i − b p i ) 2 s.t. T h ¯ P ( m ) i ≥ T h b P ( m ) n i + ∆T C ert ain C onstr aints (13) Mode l 2: J S-Divergence Tsallis Entr op y Bias Comp ensation (JSD-TEBC) min ¯ P ( m ) J S D h ¯ P ( m )    b P ( m ) n i s.t. T h ¯ P ( m ) i ≥ T h b P ( m ) n i + ∆T C ert ain C onstr aints (14) where J S D [ ·|· ] d enotes th e JS-divergence. Note that a common statistic to measure the divergence b etw een tw o probability distributions is the Ku llbac k -Leibler ( KL) divergence. Despite of t he computational and theoretical adv antages of KL-divergence, it is not symmetric in its argumen t s. Reversing the arguments in the K L- divergence function can yield substantially different results. F urthermore, K L ( P, Q ) may b e seriously underestimated if P inv olves zero terms since lim p i → 0 p i log p i q i = 0. Besides, K L ( P, Q ) is sensitive to p enalty terms used in the case of q i = 0. H en ce, we apply a symmetrized vari ant of KL-d iver gence, i.e., JS-divergence, instead. J S D [ P | Q ] ≡ 1 2 D [ P | M ] + 1 2 D [ Q | M ] (15) where D [ P | M ] is the KL-divergence from P to M and M = ( P + Q ) / 2. Mode l 3: Maximum Likeliho od Tsall is Entrop y Bias Comp ensation (ML- TEBC) max ¯ P ( m ) log n Pr h b P ( m ) n    ¯ P ( m ) io s.t. T h ¯ P ( m ) i ≥ T h b P ( m ) n i + ∆T C ert ain C onstr aints (16) where Pr h b P ( m ) n    ¯ P ( m ) i is given by F ormula 3. In Models 1, 2 and 3, all ob jectiv e functions aim at forcing the resulting distribution similar to sampling distribut ion as wel l as p ossible. This is to counter the underfitting 12 problem. Meanwhile, the TEB constraint is used to increase the entrop y of the result- ing distribution and th en comp ensate the Tsallis entrop y bias, in order to make the resulting distribution approximate the u nderlying real distribution. This is to a voi d the o verfitting problem. F rom another p oint of view, in the exp ected sense, t he ob jectiv e function aims at reducing the Tsallis entrop y , while the TEB constraint is adopted to necessarily increase the Tsallis entrop y . Through this join t effort, the ob jectiv e function forces the TEB constraint to hold as eq ualit y , which is consistent with our previous theoretical analysis. In implementa tion, all the ab o ve ob jecti ve functions and constraints are convex. Therefore, TEB C Maxen ts can be globall y solv ed b y efficien t metho ds, e.g ., the inte- rior metho d [Boyd and V and enberghe, 2004]. In sp ecific application contexts, TEBC Maxents should include certain constraints, which are derived from reliable prior infor- mation and do n ot inv olv e empirical threshold parameters. A specific example on the form of certain constraints is giv en in our experiment (see Section 6.2.2 for details). TEBC Maxen ts can b e constructed with F requentist-TEB or Ba yesian-TEB. In the cases that sampling pro cess is seriously inadequate, th e Bay esian TEBC Maxen ts are exp ected to hav e stable p erformance. The cause is t h at, given uniform Ba yesian prior and an inadequate sampling, e.g., n ≈ m , the stand ard deviation of E ( m ) P,n ( T ) tends to b e negligible compared to un iform Bay esian-TEB, which implies a relative ly stable estimation of uniform Ba yesian-TEB. The detailed pro of is giv en in Prop osition 3 of App endix A . 4.3 TE B-Lidstone Estimators There is a natural connection b etw een TEBs and Lidstone estimator. Lidstone’s law of succession suggests th e family of Lidstone estimators in the follo wing for m: ¯ p i = x i + f n + f · m where n is the sample size, m is t he number of nomials, x i is t he count of the i th nomial and f is a parameter indicating the rate of probability correction (normally b etw een 0 and 1). When f = 0 . 5, it turns out to b e the wel l-known Ex p ected Lik elihoo d Estimator (ELE), i.e., ¯ p i = x i + 0 . 5 n + 0 . 5 · m Another tw o common Lidstone estimators are add-one estimator ( f = 1) and add- tiny estimator ( f = 1 /n ). The smaller f is, the less probabilit y mass it comp ensates for underestimations. There exist some explanations on the selectio n of parameter f . F or example, ELE gives a Ba yesian justification by assuming a u n iform prior for a binomially d istributed v ariable [Bo x and Tiao, 1973]. How ever, in general cases, f is empirically configured. TEBs offer a set of criteria, either of whic h analytically identifies an adaptive f w.r.t. a sp ecific inp u t sample, and derives t h e TEB-Lidstone estimator. The fundamen- tal idea is to solve such an f so that the Tsallis entrop y bias of the inp ut sample is quantitativ ely comp ensated. That is 1 − X i  x i + f n + f · m  2 = T h b P ( m ) n i + ∆T 13 where ∆T can b e F req uentist-TEB or Bay esian-TEB. Let α ≡ 1 −  T h b P ( m ) n i + ∆T  . It tu rns out th at we hav e the follo wing qu adratic eq uation in the single v ariable f :  αm 2 − m  f 2 + 2 n ( αm − 1) f +  kn 2 − X m i =1 x 2 i  = 0 (17) Occasionally , F orm ula 17 has not real roots. In this case, w e can simply select f correspondin g to the minimum (if αm 2 − m > 0) or the maximum (if αm 2 − m < 0) of the l.h.s of F ormula 17. Consequently , the so called “F-Lidstone” and “B-Lidstone” estimators are derived on F requentist-TEB and Ba yesia n-TEB, resp ectively . Note that, in principle, the Tsall is entrop y bias can also serve to iden tify the pa- rameters of some other estimators, e.g., the multiplicati ve p arameter of Go o d-T uring estimator. W e omit the comp u tation details here. 5 Ev al uation Criteria The p erformance of a density estimation metho d can b e directly eva luated by mea- suring the similarit y b etw een its so lution and the underlying real distribution. As mentio ned in F ormula 15, JS-divergence can b e considered as a cand idate criterion. In addition, we u se t he exp ected log loss (th e exp ect n egativ e normalized log likeli- hoo d [Du dik et al., 2007]) as another similarit y criterion. The log loss of a resulting distribution ¯ P ( m ) ≡ h ¯ p 1 , . . . , ¯ p m i with resp ect to t he sample x ≡ h x 1 , x 2 , . . . , x m i is defined as: L ¯ P ( m ) ( x ) ≡ − log ¯ p x 1 1 ¯ p x 2 2 · · · ¯ p x m m = − m X i =1 x i log ¯ p i (18) Recall that, gi ven a underlying real m -nomial distribution P ( m ) ≡ h p 1 , . . . , p m i , the occurren ce p robabilit y of a sample x ≡ h x 1 , x 2 , . . . , x m i of size n can b e expressed by Pr h x    P ( m ) i = n ! x 1 ! x 2 ! · · · x m ! · p x 1 1 p x 2 2 · · · p x m m (19) Hence, w e can define th e ex p ected log loss of ¯ P ( m ) w.r.t. P ( m ) as E L h ¯ P ( m )    P ( m ) i ≡ X x ∈ X L ¯ P ( m ) ( x ) Pr h x    P ( m ) i (20) where X stands for the set of all p ossible samples of size n . By substituting the r.h.s of F ormula 20 b y F ormula 18 and 19, it can b e check ed that E L h ¯ P ( m )    P ( m ) i = P x ∈ X m P i =1 − x i log ¯ p i · n ! x 1 ! x 2 ! ··· x m ! p x 1 1 p x 2 2 · · · p x m m = − m P i =1 log ¯ p i P x ∈ X x i n ! x 1 ! x 2 ! ··· x m ! p x 1 1 p x 2 2 · · · p x m m = − m P i =1 log ¯ p i ( np i ) = − n m P i =1 p i log ¯ p i (21) T o more systematically measure the p erformance of different algorithms w.r.t a sp ecific eval uation criterion, we introduce the Perfor m ance Score (PS) as b elo w: 14 P S C D ( A ) = C D ( A ) − C D ( W or st ) C D ( B es t ) − C D ( W or st ) (22) where A stands for an algorithm under ev aluation, C represen ts an eval uation criterion and D denotes a sp ecific dataset; the p erformance v alue C D ( A ) is ev aluated by criterion C for algorithm A ru n ning on d ataset D , and C D ( B es t ) and C D ( W or st ) are the p erformance v alues of t he b est and wors t algorithms on D , resp ectively . 6 Ex pe riments In th is section, we construct three sets of exp eriments 5 . First, if reliable prior informa- tion is a vai lable, in Maxen t framew ork, TEBC Maxents’ p erformance will b e ev aluated in comparison with standard Maxen t, GME (a l 1 -regularized Maxent which has P AC guaran tee of p erformance) [Du dik et al., 2007], as we ll as Maxents based on Shan- non entrop y bias (SEB) [Mille r , 1955] which is described in Section 6.2.1. Second, in case that reliable prior information is not a v ailable, w e will verify the effectiveness of TEB-Lidstone, b y comparing their p erformance with comparative Lidstone and Go o d- T uring estimato rs, together with SEB-Lidstone whic h is deriv ed from th e ab ov e SEB in a similar wa y to TEB-Lidstone (describ ed in Section 6.3.1). In all the ab o ve exp er- imenta l settings, b oth synthesized and real-wo rld datasets are employ ed. F or TEBs, Ba yesian-TEB is calculated by F ormula 12, and F requ entist-TEB is estimated by th e naive estimator given in F ormula 10, which is more efficient in large-scale exp eriments and also can give a satisfying p erformance in b oth Maxent and Lidstone frameworks. 6.1 Datasets Description First, synthesized probability distributions are generated to serve as the underlying real m -n omial distributions P ( m ) . T o th is end, we adopt a simple Mon t e Carlo method to randomly draw m p ositive p oints from a source distribution, and th en normalize them to form an un derlying real d istribu t ion. The source distributions we used in - clude uniform distribution U (0 , 1), the absolute va lue of stand ard normal d istribution | N (0 , 1 2 ) | , n ormal distribution N (3 , 1 2 ), χ 2 distribution χ 2 (10), binomial distribution B (30 , 0 . 2) and b eta distribution β (3 , 6). After the underlying real distribution is gen- erated, a sample of size n is drawn from it, which can b e then used to calculate the sampling distribution b P ( m ) . W e also adopt four real-w orld datasets: UCI-Dexter 6 , UCI-S tatlog 7 , UCI-I SOLET 8 and U CI-Sonar 9 , in order to generate the underlying real distributions T ext dataset: Dexter 5 The source code is av ailable at http://w ww.comp.r gu.ac.uk/staff/pz/TEBC/TEB_Experiment_Codes.zip 6 h ttp://arc hive.ics.uci.edu/ml/mac hi ne-l earning-databases/de xter/DEXTER/ 7 h ttp://arc hive.ics.uci.edu/ml/mac hi ne-l earning-databases/sta tlog/satimage/ 8 h ttp://arc hive.ics.uci.edu/ml/mac hi ne-l earning-databases/isolet/ 9 h ttp://arc hive.ics.uci.edu/ml/mac hi ne-l earning-databases/un documented/ connectionist- bench/sonar/ 15 UCI-Dexter is a text dataset, containing 2000+300 do cuments. Each do cument is rep resented as a 20000-term count vector. Before the text dataset is actually em- plo yed, it is prepro cessed by dropp ing the terms that o ccur to o frequ ently , i.e., the “stop words”. A fter th e prepro cessing step, m terms could b e rand omly selected from the whole term set and the frequen cies of these selected terms are considered as the underlying real distribution P ( m ). Then, we rand omly c ho ose a b ag (size n ) of words from all the do cuments as a sample. The frequencies of these terms in the sample are calculated and considered as t he sampling d istribution b P ( m ). Non-text datasets: Statlog, ISOLET, Sonar UCI-Statlog (Landsat Satellite) dataset consists of all p ossible 3 × 3 n eigh b orhoo ds in a 82 × 100 p ix el sub- area of a single scene which is represented by four digital images in different sp ectral bands. A sample is t hen defined as the pixel v alues of eac h 3 × 3 neighborho od in the four sp ectral bands (h ence 4 × 9=36 features in total). The size of the d ataset is 4435+2000. UCI-ISOLET dataset includes 150 sub jects sp eaking the name of eac h letter of the alphab et twice. The sp eakers are d ivided into group s of 30 sp eakers. There are 617 real-v alue features including spectral coefficients, con t our features, sonora nt features, pre-sonorant features, and p ost-sonoran t features but in an unkn o wn order. UCI-Sonar contains 111 patterns obtained by boun cing sonar signals off a metal cylinder at v arious angles and under va rious conditions, and 97 patterns obtained from rocks under similar cond itions. The transmitted sonar signal is a frequency-mo dulated chirp, rising in frequency . The data set contains signals obtained from a v ariet y of different aspect angles, spanning 90 degrees for the cy lind er and 180 degrees for the rock. Each p attern is a set of 60 numbers in the range 0.0 to 1.0. Eac h number represen ts the en ergy within a particular frequency band, integrated ov er a certain p erio d of time. F or th e ab ove three non-tex t datasets, in ord er to generate th e m -n omial underly- ing real distribution, we simply partition a randomly selected feature into m interv als co vering the whole range of this single fe ature. Then the n umb er of instances in eac h interv al is counted and finally the un derlying real distribution P ( m ) is formed by nor- malizing the count vector. The sampling process is to first rand omly choose n instances and distribute them into the correspondin g interv als based on their feature v alue, and then form the sampling distribut ion b P ( m ) by normalizing this sampling count vector. 6.2 TE BC Maxents vs. Comparative Maxents Maxent is widely u sed due to its effective use of reliable prior information. In this set of exp eriments where t h e reliable prior information is given, TEBC Maxents and other Maxents are comp ared in terms of their density estimation p erformance. 6.2.1 Maxents V arious forms of Maxents are tested, including F req u entist TEBC Maxen ts (F- l 2 2 - TEBC, F-ML-TEBC and F-JSD -TEBC), Ba yesian TEBC Maxents (B- l 2 2 -TEBC, B- ML-TEBC and B-JSD-TEBC), standard Maxent (SME for short), and GME [Dudik et al., 2007] (implemented by l 1 -SUMMET and l 1 -PLUMMET, which stand for selectiv e- up date and parallel-upd ate algorithms for l 1 -regularization Maxent, respectively). I n addition, w e also constru ct three Maxents based on S hannon en tropy bias (SEB) [Mille r, 16 1955]. SEB Maxents are simila r to TEBC Maxent ( Model 1-3) except that the TEB constrain t is rep laced by th e follo wing SEB constraint: S h ¯ P ( m ) i ≥ S h b P ( m ) n i + ∆S (23) where S [ · ] denotes the Shannon entrop y of some probability d istribution and ∆S = ( m − 1) / (2 n ) denotes SEB 10 . W e u se m − 1 2 n as th e correction since it is simple in fo rm and frequently-used. Note that it can not b e considered an unbiased correction, in a strict sense [P aninski, 2003]. By substituting the SEB constraint for t h e TEB constrain t in Mo del 1-3, three new Maxents are introdu ced in the ex p erimen t, namely l 2 2 -SEB, ML-SEB and JSD -SEB. Note that in the follo wing, w e adopt “Sample” to represent the metho d using the sampling distribution as the resulting distribution directly . 6.2.2 Certain and Unc ertain Const r aints Tw o k inds of constraints are invo lved. One is ce rtain constrain t s, and the other is uncertain constraints. Certain constraints are d erive d from rel iable p rior informa tion, whic h is incomplete information of the u nderlying real distribution. S p ecifically , certain constrain ts can b e represented by a set of constraints as follow s: X i ∈ S c ¯ p i = a ∗ c = X i ∈ S c p i ∀ c ∈ C (24) where S c is a subset of { 1 , 2 , . . . , m } . In order to form t his sub set, we rand omly choose a number | S c | from { 1 , 2 , . . . , m − 1 } , and then ran d omly select | S c | index es from { 1 , 2 , . . . , m } . If w e do this step k times, k certain constrain ts can b e d erive d, and then the set C of all certain constraints is formed. Uncertain constraints are derived from sampling information, together with empir- ical threshold parameters to control the similarit y b etw een the resulting distribution and the sampling distribu t ion. Sp ecifically , for every i , if b p i ≥ th , then w e construct an uncertain constraint represented by Box Constraint : | ¯ p i − b p i | ≤ δ (25) where δ and th are t hreshold parameters. In our exp eriments, we fix th as 0 . 2 /m , and adjust δ to find relatively optimal p erformance for standard Maxent and GME. 6.2.3 Par ametric Configur ation F or all th e mo dels, the same para meters are u sed, including generating times , b in num b er, sample size, and sampling times. Generating times is the num b er of t imes to generate the un derlying real distribution. Sampling times is the num b er of times to draw the sampling distribution from the giv en u n derlying real distribution. Note that in order t o av oid the zero probability of any bin in the m - nomial un derlying real distribution, the bin n umber m should b e set prop erly for eac h real-wo rld dataset in terms of its scale. F or example, Sonar dataset h as a relatively small number of data 10 In the or i ginal formula, an estimate of m is used instead of the r eal one. In our cont ext, m is kno wn i n adv ance and hence the estimation can be av oi ded. 17 p oin ts. Therefore, we let m S onar = 30 to av oid a degenerated m - nomial un derlying real distribution. All Maxents inv olve the same certain constraints . TEBC Maxents and SEB Maxents adopt the corresp onding TEB and S EB constraints, while other three Maxents use uncertain constrain t s. F or uncertain constrain t s, we c ho ose δ in F ormula 25 from [1 e − 4 , 1 . 6 e − 3] with incremen t 1 e − 4. The p erformance of TEBC Maxen ts will b e compared with the p erformance of comparative Maxents with the optimal δ . Under th is optimal δ , comparative Maxen t s can obtain relatively optimal p erformance compared with th e p erformance using other δ ’s. The parametric confi guration is listed in T able 1. Category Detailed Configurations Generating Times r = 10 #Bin m S y n,D exter,I S O LE T = 100 m S AT = 50, m S onar = 30 Sample Size n = 10 · m Sampling Ti mes s = 20 #Certain Constraints k = 0 . 2 × m , k = 0 . 05 × m Threshold parameter δ = 6 e − 4 chosen from [ 1 e − 4 , 1 . 6 e − 3] T able 1 Parametric Configuration in Maxen t F ramework 6.2.4 R esults W e employ the parametric configuration in T able 1 to ru n every Maxen t. The mean p erformance scores w.r.t. JS- Divergence and Exp ected Log Loss, av eraged on r × s sampling distribut ions, are summarized in T ables 2, 3, 4 an d 5. In addition to the p erformance scores, w e also give the b est v alue and worst v alue w.r.t. the tw o ev al- uation criteria. Finally , th e ov erall p erformance of each algorithm, avera ged o ver all synthesized and all real-w orld datasets, are shown in T able 6 and T able 7, respectively . When k = 0 . 2 × m , in exp erimental results on synthesized datasets, all TEBC Maxents outp erform comparative Maxents in most cases. F rom T able 6 , we can observe that, on av erage, F-ML-TEBC and B-ML-TEBC are the b est tw o mo dels among TEBC Maxents. The sup eriorit y of TEBC Maxents is clearly shown in T ables 2 and 6, u nder the JS-divergence measure. This sup eriorit y also holds u nder exp ected log loss measure, whic h is demonstrated in T ables 3 and 7. As for real-world datasets, we can still draw the same conclusion that TEBC Maxen ts sho w th eir adv antages ov er th e others from T ables 4 an d 5 . Lik e the results on synthesized d atasets, F- ML-TEBC and B-ML- TEBC still show their robustness and b ecome the b est tw o mo dels in the av erage sense from T ables 6 and 7. When k = 0 . 05 × m , th e amount of prior information is actuall y reduced. In th is case, it is also clearly demonstrated that all TEBC Maxents outp erform compara tive Maxents on a verag e. Pa rticularly , w e can observe th at F- l 2 2 -TEBC and B- l 2 2 -TEBC b ecome th e most stable and effective, somewhat differing from their p erformance in the case where k = 0 . 2 × m . W e would like to mention that in our exp eriments, standard Maxent and l 1 - PLUMMET did not p erform well. I n many cas es, using th e sampling distribution directly ( denoted as Sample) can outp erform these tw o M axents, especially in the exp eriment with real-wo rld datasets. O ne of t he causes is that when w e fi x a unified 18 ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data U (0 , 1) | N (0 , 1 2 ) | N (3 , 1 2 ) Sample 0.5983/0.8088 0.7005/0.8838 0.0000/0.0000 F- l 2 2 -TEBC 0.8639/0.957 2 0. 8695/0.996 6 0.9793/ 1.0000 F-JSD-TEBC 0.9985/0.9999 0.9983/0.9992 0.9919/0.9863 F-ML-TEBC 1.0000 / 1.0000 1.0000 / 1.0000 1.0000 /0.9940 B- l 2 2 -TEBC 0.8630/0.9571 0.8690/0.9962 0.9759/0.9955 B-JSD-TEBC 0 .9978/0.9993 0.9981/0.9990 0.9875/0.9803 B-ML-TEBC 0.9992/0.9994 0.9997/0.9997 0.9955/0.9879 SME 0.0000/0.0622 0.0231/0.0660 0.3774/0.6114 l 1 -SUMMET 0.0017/0.0000 0.0000/0.0000 0.6376/0.6428 l 1 -PLUMMET 0.0773/0.0566 0.0901/0.0577 0.4280/0.6299 l 2 2 -SEB 0.8392/0.7936 0.7190/0.8508 0.3398/0.0889 JSD-SEB 0.9985/0.8494 0.9309/0.9296 0.3867/0.1172 ML-SEB 0.8410/0.8496 0.9332/0.9299 0.3921/0.1176 Best JS V alue 0.0120/0.0143 0.0133/0.0156 0.0091/0.0113 W orst JS V alue 0.0262/0.0338 0.0281/0.0340 0.0182/0.0190 ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data χ 2 (10) β (3 , 6) B (30 , 0 . 2) Sample 0.0000/0.0000 0.0000/0.0000 0.0000/0.0000 F- l 2 2 -TEBC 1.0000 / 1.0000 1.000 0 / 1.0000 1.0000 / 1. 0000 F-JSD-TEBC 0.8736/0.7449 0.9243/0.8088 0.9443/0.8759 F-ML-TEBC 0.8830/ 0.7560 0.9347/0.8190 0.9534/0.8859 B- l 2 2 -TEBC 0.9963/0.9955 0.9967/0.9958 0.9957/0.9957 B-JSD-TEBC 0 .8704/0.7409 0.9212/0.8046 0.9396/0.8709 B-ML-TEBC 0.8798/0.7518 0.9314/0.8147 0.9486/0.8807 SME 0.4469/0.7508 0.3993/0.5139 0.4686/0.7009 l 1 -SUMMET 0.7528/0.7562 0.6077/0.4997 0.7484/0.7088 l 1 -PLUMMET 0.5371/0.7649 0.4859/0.5135 0.5623/0.7070 l 2 2 -SEB 0.3620/0.0802 0.3635/0.0795 0.3299/0.0951 JSD-SEB 0.4414/0.1373 0.4634/0.1414 0.3656/0.1217 ML-SEB 0.4450/0.1373 0.4674/0.1419 0.3689/0.1218 Best JS V alue 0.0109/0.0128 0.0111/0.0129 0.0099/0.0113 W orst JS V alue 0.0186/0.0190 0.0184/0.0189 0.0191/0.0184 results w.r.t certain constrain ts’ num ber k = 0 . 2 × m ( lef t ) and 0 . 05 × m ( rig ht ) T able 2 Pe rformance Score (w.r.t. JS-dive rgence) of differen t Maxent s on Synth esized Datasets δ to all th e uncertain constraints, it is often in a dilemma: I f a small δ is used, the feasible region might b e far from the underlying real distribution, and hence the op- timization procedu re can only converge to a po or solution; If a large δ is used, there migh t b e a great chance to underfit the sample. Compared with standard Maxent and l 1 -PLUMMET, l 1 -SUMMET (using the selectiv e- up date strategy) is relativ ely stable. How ever, l 1 -SUMMET with any unified δ can not p erform as we ll as TEBC Maxents. Note that there is no p ractical guidance to d etermine d ifferent δ for different uncertain constrain ts. Even though this guidance ex ists, it is often a proh ib itively-complicated task to find the optimal δ for each uncertain constraint. This indeed reflects the im- plementatio n complexit y of th e existing Maxents and h ighligh ts the adv antag e of the parameter-free characteristic of TEBC Maxents. The idea is also supp orted by the exp eriment result of SEB Maxen t s, which achiev e b ett er p erformance than SME and GME on av erage although are still less effective than TEBCs. 19 ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data U (0 , 1) | N (0 , 1 2 ) | N (3 , 1 2 ) Sample 0.8539/0.7594 0.9179/0.8291 0.3802/0.0547 F- l 2 2 -TEBC 0.9331/0.9871 0.9190/ 1.0000 1.0000 / 1.0000 F-JSD-TEBC 0.9989/0.9987 0.9993/0.9840 0.9914/0.9371 F-ML-TEBC 1.0000 / 1.0000 1.0000 /0.9856 0.9982/0.9471 B- l 2 2 -TEBC 0 .9293/0.9863 0.9191/0.9993 0.9977/0.9958 B-JSD-TEBC 0.998 6/0.9977 0.9992/0.9836 0.9884/0.9313 B-ML-TEBC 0.9996/0.9989 0.9998/0.9851 0.9952/0.9413 SME 0.0000/0.1225 0.0000/0.0115 0.0000/0.0000 l 1 -SUMMET 0.7169/0.0719 0.7764/0.0000 0.8133/0.7731 l 1 -PLUMMET 0.5838/0.0000 0.6521/0.0192 0.5717/0.7539 l 2 2 -SEB 0.7553/0.5448 0.7943/0.4707 0.5039/0.0400 JSD-SEB 0.9319/0.7853 0.9741/0.8590 0.6117/0.1514 ML-SEB 0.9327/0.7855 0.9749/0.8594 0.6152/0.1518 Best ELL V alue 6.4128/6.4179 6.2935/6.3022 6.5928/6.6009 W orst ELL V alue 6.5864/6.4879 6.5226/6.3587 6.6583/6.6394 ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data χ 2 (10) β (3 , 6) B (30 , 0 . 2) Sample 0.4724/0.0936 0.6084/0.0911 0.7054/0.0236 F- l 2 2 -TEBC 1.0000 / 1.0000 1.000 0 / 1.0000 1.0000 / 1.0000 F-JSD-TEBC 0.9239/0.7299 0.9642/0.7823 0.9770/0.8411 F-ML-TEBC 0.9298/0.7408 0.9693/0.7930 0.9803/0.8529 B- l 2 2 -TEBC 0 .9985/0.9967 0.9982/0.9962 0.9988/0.9959 B-JSD-TEBC 0.922 1/0.7262 0.9629/0.7784 0.9755/0.8360 B-ML-TEBC 0.9279/0.7369 0.9679/0.7889 0.9788/0.8477 SME 0.0000/0.0000 0.0000/0.6955 0.0000/0.7849 l 1 -SUMMET 0.9061/0.8648 0.8935/0.6993 0.9371/0.7977 l 1 -PLUMMET 0.6539/0.8531 0.7555/0.7008 0.8527/0.7923 l 2 2 -SEB 0.5335/0.0147 0.6291/0.0000 0.7546/0.0000 JSD-SEB 0.6897/0.1989 0.7776/0.1984 0.8075/0.1278 ML-SEB 0.6917/0.1990 0.7793/0.1989 0.8085/0.1279 Best ELL V alue 6.5460/6.5533 6.5405/6.5491 6.5916/6.5834 W orst ELL V alue 6.6129/6.5873 6.6267/6.5820 6.7347/6.6178 results w.r.t certain constrain ts’ num ber k = 0 . 2 × m ( lef t ) and 0 . 05 × m ( rig ht ) T able 3 Performance Scores (w.r.t. Expected Log Loss) of different Maxents on Syn thesized Datasets In summary , TEBC Maxents show th eir effectiveness and stabilit y , as demonstrated in result tables, esp ecially i n T ables 6 and 7. W e w ould lik e to stres s that TEB C Maxents are al so easy to implement since only certain constrain t s and a single TEB constrain t are in volv ed. This can help to demonstrate our p rev ious t heoretical justifica- tion on TEB. In add ition, note that th e p erformance of SEB-based mo dels can consis- tently out p erform SME, l 1 -SUMMET and l 1 -PLUMMET, though the SEB correction is ind eed not ex act. This observ ation ind icates that the framewor k of our generalized Maxent proposed in Section 4.2 has gains in itself. 6.3 TE B-Lidstone vs. Comparativ e Lidstone and Go od- T uring Estimators When the reliable prior info rmation is not a vai lable, Lidstone and Goo d-T u ring esti- mators are often used in m any applications since they are often effective en ough, and 20 ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data Dexter Statlog ISOLET Sonar Sample 0.3852/0.4350 0.8212 /0.8968 0.5107/0.5693 0.3323/0.317 6 F- l 2 2 -TEBC 0 .9415/ 1.0000 0.8000/0.9010 0.8261 /0.9997 0.9513/ 1.000 0 F-JSD-TEBC 0.9955/0.7596 0.9996/0.9995 0.996 6/0.9565 0.9956/0.8274 F-ML-TEBC 1.0000 /0.7631 0.9999/ 1.0000 1 .0000 /0.9598 1. 0000 /0.8328 B- l 2 2 -TEBC 0 .9395/0.9983 0.7998/ 0.9008 0. 8250/ 1.00 00 0.9476/0.9948 B-JSD-TEBC 0.9952/0.7591 0.9996/0.9995 0.9951/ 0.9550 0.9942/0.8239 B-ML-TEBC 0.9996/0.7624 1.0000 /0.9999 0.9984/0.9582 0.9984/0.8289 SME 0.0000/0.0697 0.0000 /0.0065 0.0000/0.0691 0.0000/0.073 7 l 1 -SUMMET 0.3694/0.0000 0.2099/0 .0000 0.1384/0.0000 0.0962/0.0000 l 1 -PLUMMET 0.1956/0.0819 0.2588/0.1281 0.0867/0 .0842 0.1148/0.1567 l 2 2 -SEB 0.5313/0.4094 0.7533 /0.7945 0.5821/0.5497 0.5605/0.368 4 JSD-SEB 0.8874/0.5702 0.9853/0.9716 0.8675/ 0.6704 0.8117/0.5095 ML-SEB 0.8910/0.5709 0.9863/0.971 9 0.8704/0.6709 0.8144/0.5092 Best JS V alue 0.0144/0.01 52 0.0132/0.0148 0.0132/0 .0145 0.0134/0.0142 W orst JS V alue 0.0213/0.0213 0.0337 /0.0304 0.0227/0.0228 0.0201/0.019 7 results w.r.t certain constrain ts’ num ber k = 0 . 2 × m ( lef t ) and 0 . 05 × m ( rig ht ) T able 4 Performance Score (w.r.t. JS-div er gence) of different Maxents on Real-world Datasets ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data Dexter Statlog ISOLET Sonar Sample 0 .9023/0.4710 0.9811/ 0.9884 0.8680/0.7702 0.9696/0.9648 F- l 2 2 -TEBC 0 .9463/ 1.0000 0.9351/0.9572 0.8621/0.9833 0.9838 / 1.0000 F-JSD-TEBC 0 .9991/0.7678 0.9998/ 0.9999 0.9986/0.9974 0.9997/0.9954 F-ML-TEBC 1.0000 /0.7714 0.9999/ 1.0000 1.0000 / 1.0 000 1.0000 /0.9959 B- l 2 2 -TEBC 0 .9451/0.9970 0.9342/ 0.9569 0.8618/0.9919 0.9823/0.9992 B-JSD-TEBC 0.9990/0.7671 0.9998 /0.9998 0.9981/0.9963 0.9995/0.9952 B-ML-TEBC 0 .9998/0.7706 1.0000 /0.9999 0.9995 /0.9988 0.9998/0.9955 SME 0.0000/0.5313 0.5737 /0.4980 0.0000/0.5079 0.7454/0.5920 l 1 -SUMMET 0.9470/0.4908 0.8644/0.672 4 0.8252/0.4840 0.8715/0.8095 l 1 -PLUMMET 0.8330/0.5286 0.000 0/0.0000 0.6469/0.0000 0.0000/0.0000 l 2 2 -SEB 0.7848/0.0000 0.9028/0.88 34 0.7169/0.5678 0.9333/0.9203 JSD-SEB 0.9755/0.5619 0.9962 /0.9920 0.9557/0.8100 0.9889/0.9740 ML-SEB 0.9761/0.5626 0.9964/0.992 0 0. 9567/0.8103 0.9891/0.9740 Best ELL V alue 6.3275/6.3600 5.0336 /5.0862 6.0809/6.2262 4.6626/4.6838 W orst ELL V alue 6.5260/6.3918 5.7561/5.4920 6.2318/6.2896 5.3435/5.1230 results w.r.t certain constrain ts’ num ber k = 0 . 2 × m ( lef t ) and 0 . 05 × m ( rig ht ) T able 5 Performance Scores (w.r .t. Exp ected Log Loss) of different Maxen ts on Real- wo rld Datasets more efficient than Maxent. This set of exp eriments is constructed to verif y the ad- v antage of TEB-Lidstone (F- L id stone and B-Lidstone) ov er the inv olved Lidstone and Goo d-T u ring estimators. 6.3.1 Li dstone and Go o d-T uring Estimators In this su b section, we eval uate the effectiveness of F-Lidstone an d B-Lidstone esti- mators which h ave b een describ ed in Section 4.3. V arious other Lidstone mo dels, i.e., Laplace estimator (Laplace) and Exp ected Likelihoo d Estimator ( ELE), serv e as com- parative algorithms. Motiv ated by TEB-Lidstone, we also derive a Lidstone estimator from SEB in a similar wa y to TEB-Lidstone. T o identify th e rate of probabilit y cor- 21 Algorithms Syn thesized Real-world Av erage Sample 0.2164/0.2821 0.5123 /0.5547 0.3644/0.4184 F- l 2 2 -TEBC 0.9521/ 0.9923 0.8797/ 0.9752 0.9159/ 0.9837 F-JSD-TEBC 0.9551 /0.9025 0.9968/0.8858 0.9760/0.894 1 F-ML-TEBC 0.9 618 /0.9091 0.9 999 /0.8889 0 .9809 /0.8990 B- l 2 2 -TEBC 0.9495/0.9893 0.8780 /0.9735 0.9137/0.9814 B-JSD-TEBC 0.9524/0.8992 0.9960/0.8844 0.9742/0.8918 B-ML-TEBC 0.9590 /0.9057 0.9991/0.887 4 0.9791/0.8965 SME 0.2859/0.4508 0.0000/0.0547 0.1429/0.2528 l 1 -SUMMET 0.4580/0.4346 0.2035/0.0000 0.3307/0.2173 l 1 -PLUMMET 0. 3634/0.4549 0.1640/0.1127 0.2637/0.2838 l 2 2 -SEB 0.4658/0.3313 0.6068/0.5305 0.5363/0.4309 JSD-SEB 0.571 2/0.3828 0.8879/0.680 4 0.7296/0.5316 ML-SEB 0.5746/0.3830 0.8905/0.6807 0.7326/0.5319 results w.r.t certain constrain ts’ num ber k = 0 . 2 × m / k = 0 . 05 × m T able 6 Overall P er formance Score ev aluated by JS-Divergen ce for Exp eriment Results in Section 6.2 Algorithms Syn thesized Real-world Av erage Sample 0.6564/0.3086 0.9302/0.7986 0.7933/0.5536 F- l 2 2 -TEBC 0.9753/ 0.9978 0.9318/0.985 1 0.9536/ 0.9915 F-JSD-TEBC 0.9758/0.8789 0.9993/0.9401 0.9875/0.9095 F-ML-TEBC 0.9 796 /0.8866 0.9 999 /0.9418 0 .9898 /0.9142 B- l 2 2 -TEBC 0.9736/0.9950 0.9309/ 0.9862 0.9522/0.9906 B-JSD-TEBC 0.9744/0.8756 0.9991/0.9396 0.9868/0.9076 B-ML-TEBC 0.9782 /0.8831 0.9998/0.941 2 0.9890/0.9122 SME 0.0000/0.2690 0.3298/0.5323 0.1649/0.4007 l 1 -SUMMET 0.8406/0.5345 0.8770/0.6142 0.8588/0.5743 l 1 -PLUMMET 0. 6783/0.5199 0.3699/0.1321 0.5241/0.3260 l 2 2 -SEB 0.6618/0.1783 0.8345/0.5929 0.7481/0.3856 JSD-SEB 0.798 8/0.3868 0.9791/0.834 5 0.8889/0.6106 ML-SEB 0.8004/0.3871 0.9796/0.8347 0.8900/0.6109 results w.r.t certain constrain ts’ num ber k = 0 . 2 × m / k = 0 . 05 × m T able 7 Overall Performance Score ev al uated by Expected Log Loss for Experi men t Resul ts in Section 6.2 rection f , the S EB constraint (F ormula 23) is used as a guidance instead of the TEB constrain t, so that the Shann on entrop y of the resulting distribution S h ¯ P ( m ) i is clos- est to the su m of th e sampling Shannon en tropy S h b P ( m ) n i and the Shannon en tropy bias ∆S . The resulting Lidstone estimator is referred as SEB-Lidstone. In addition, some Go o d-T uring estimators, i.e., the simplest Go od -T uring estimator (SimplestGT), Simple Go o d-T uring (SGT) [Gale and Samp son, 1995 ] and a low com- plexity d iminishing attenuati on estimator (LC-DAE) with some asymptotic guarantee of p erformance [Orlitsky et al., 2003], are also inv olved in comparativ e exp eriments. Because Go o d -T uring estimators do not assume the bin num b er m is given, they only assign a probability sum to all t h e zero bins w.r.t the sample. I n our case in whic h m is p rovided, th e su m is u n iformly distribut ed to all these zero bins. 22 6.3.2 R esults W e emplo y the same parameter setting in T able 1 but ignore the parameters in con- strain ts of Maxent, and then run F-Lidstone and B-Lidstone estimators as w ell as the inv olved comparative mo dels on t he synthesized and real-worl d datasets. The mean p erformance scores w.r.t. JS-D ivergence and Exp ected Log Loss ov er generating times r and sampling times s for eac h dataset are summarized in T able 8 to T able 11. The o veral l p erformance of each algorithm on sy nthesized and real-w orld datasets is also demonstrated in T able 12 and T able 13 . ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data U (0 , 1) | N (0 , 1 2 ) | N (3 , 1 2 ) χ 2 (10) β (3 , 6) B (30 , 0 . 2) Sample 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 SimplestGT 0.0916 0.1851 0.0062 0.0446 0.0679 0.0219 Laplace 0.9429 1.0000 0.482 3 0.6518 0.6521 0.508 2 ELE 0.6318 0.7745 0.2741 0.3744 0.3785 0.2876 SGT 0.4896 0.3537 0.3754 0.3235 0.4152 0.3642 LC-DAE 0 .5680 0.7053 0.2532 0.5771 0.5202 0.403 2 B-Lidstone 0.9999 0.9566 0.9958 0.9954 0.9961 0.9953 F-Lidstone 1.0000 0.9573 1.0000 1. 0000 1. 0000 1.0000 SEB-Lidstone 0.7768 0.8065 0.5755 0.6500 0.6363 0.5882 Best JS V alue 0.0151 0.0154 0.0114 0.0132 0.0131 0.0118 W orst JS V alue 0.0181 0. 0177 0.0183 0.0187 0.0186 0.0187 T able 8 P erformance Score (w.r.t. JS-divergence) of different Li dstone and Goo d-T uring Es- timators on Syn thesized D atasets ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data U (0 , 1) | N (0 , 1 2 ) | N (3 , 1 2 ) χ 2 (10) β (3 , 6) B (30 , 0 . 2) Sample 0 .0000 0.0000 0.0000 0.0000 0.0000 0.0000 SimplestGT 0.1475 0.2419 0.0228 0.0748 0.1074 0.0423 Laplace 0.9042 1.0000 0.4965 0.6732 0.6715 0.5313 ELE 0.5798 0.6886 0.2860 0.3967 0.3995 0.3076 SGT 0.4857 0.3881 0.3861 0.3487 0. 4391 0.3817 LC-DAE 0 .7220 0.8708 0.3367 0.6309 0.5951 0.4599 B-Lidstone 0.9985 0.9258 0.9957 0.9957 0.9963 0.9956 F-Lidstone 1.00 00 0.9273 1.0000 1.0000 1.0000 1.0000 SEB-Lidstone 0.7260 0.7349 0.5865 0.6689 0.6542 0.6071 Best ELL V alue 6.4366 6.2912 6.6035 6.5505 6.5494 6.5945 W orst ELL V alue 6.4539 6.3051 6.6364 6.5781 6.5769 6.6275 T able 9 Performance Scores (w.r.t. Exp ected Log Loss) of different Lidstone and Go od-T uring Estimators on Syn thesized Datasets In exp erimenta l results on synthesized datasets, it can b e observed that F-Lidstone and B- Lidstone estimators, esp ecially F- Lidstone, outp erform the other mo dels in most cases. Even in the w orst case, they are still more effective than most of th e oth er mod els. Hence, F-Lidstone and B-Lidstone are on av erage the b est p erforming among all estimators. 23 ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data Dexter Statlog ISOLET Sonar Sample 0.0000 0. 4568 0.0000 0.0000 SimplestGT 0.2774 0.3649 0.1241 0.1712 Laplace 1.0000 0.4363 0.8874 0.8967 ELE 0.6277 0.9992 0.5832 0.5480 SGT 0.3740 0. 4704 0.3137 0.5455 LC-DAE 0 .8305 0.0000 0.6002 0.6556 B-Lidstone 0.9099 1.00 00 0.9972 0.9912 F-Lidstone 0.9124 0.9986 1.0 000 1. 0000 SEB-Lidstone 0. 7658 0.9204 0.7668 0.7172 Best JS V alue 0.0148 0.0152 0.0150 0.0144 W orst JS V alue 0.0184 0. 0171 0.0183 0.0182 T able 10 Performance Score (w.r .t. JS-di vergence) of different Lidstone and Go o d-T uring Estimators on Real-world Datasets ❳ ❳ ❳ ❳ ❳ ❳ ❳ ❳ Method Data Dexter Sta tlog ISOLET Sonar Sample 0 .0000 0.1702 0.000 0 0.0000 SimplestGT 0.3568 0.0000 0.179 5 0.2092 Laplace 1.0000 0.8697 0.9138 0.9062 ELE 0.636 3 1.000 0 0.5827 0.5631 SGT 0.4422 0.2000 0.358 4 0.5509 LC-DAE 0 .9297 0.3987 0.7471 0.7839 B-Lidstone 0.9153 0.9195 0.9970 0.9911 F-Lidstone 0.9177 0.9157 1.000 0 1.00 00 SEB-Lidstone 0.7727 0.6445 0.7562 0.7240 Best ELL V alue 6.3450 4.9949 6.4027 4.7125 W orst ELL V alue 6.3659 5.0012 6.4207 4.73 38 T able 11 Performance Scores (w.r.t. Exp ected Log Loss) of different Lidstone and Goo d- T uring Estimators on Real-world Datasets Algorithms Synt hesized Real-world Average SimplestGT 0.0695 0.2344 0.1520 Laplace 0.7062 0.8051 0.7556 ELE 0.4535 0.6896 0.5715 SGT 0.38 69 0.4259 0.4064 LC-DAE 0.5045 0.5216 0.5130 B-Lidstone 0.9899 0.9746 0.9822 F-Lidstone 0.99 28 0.9777 0 .9853 SEB-Lidstone 0.6722 0.7925 0.7324 T able 12 Overall Performance Score ev aluated by JS-Di v ergence f or Exp eriment Results in Section 6.3 F or real-worl d datasets, w e can still come to th e same conclusion that F-Lidstone and B-Lidstone outp erform the others on av erage. Although Laplace and ELE could ac hieve the optimal p erformance on Dexter and Statlog datasets, the effectiv eness of F-Lidstone and B-Lidstone is just sligh tly low er bu t their p erformance on the oth er datasets, esp ecially IS OLET, is muc h b etter than that of the other estimators. F ur- ther, it can b e observed that in some cases the p erformance scores ev aluated by JS- Diverge nce and Exp ected Log Loss are not consistent with eac h other. The phenomenon 24 Algorithms Synt hesized Real-world Average SimplestGT 0.1061 0.1864 0.1462 Laplace 0.7128 0.9224 0.8176 ELE 0.4431 0.6955 0.5693 SGT 0.40 49 0.3879 0.3964 LC-DAE 0.6025 0.7149 0.6587 B-Lidstone 0.9846 0.9557 0.9702 F-Lidstone 0.98 78 0.9583 0 .9731 SEB-Lidstone 0.6629 0.7243 0.6936 T able 13 Overall Performance Score ev aluated by Exp ected Log Loss for Exp eriment Results in Section 6.3 is probably du e to the incompleteness of these ev aluation criteria, which could only p ar- tially reflect the simila rit y b etw een the resulting distribution and t he underlying real distribution. In this sense, the more stable th e algorithm is in differen t similarity cri- teria, the more eff ective it could be considered to b e. F-Lidstone and B-Lidstone are also op t imal in this wa y . In summary , when Bay esian-TEB and F requentist-TEB are applied to the Lidstone framew ork, the resulting B-Lidstone and F-Lidstone estimators could ac hieve ex cellen t p erformance in the ex p ected sense, compared with common Lidstone and Goo d-T u ring estimators. Hen ce, it can be concluded th at TEBs d o make sense and can b enefi t the Lidstone framew ork. 7 C onclusions and F uture W or k This pap er prop oses the closed-form formulae on th e exp ected Tsallis entrop y bias (TEB) under F requentist and Bay esian framew orks. TEBs give the quantities on the difference b etw een the exp ected Tsallis entrop y of sampling distributions and th e Tsal- lis entrop y of the un d erlying real distribution. It is exact in the sense of unbiasedness and consistency , and hence naturally entails a quantitativ e re- interpretation of the Maxent principle. In oth er w ords, TEBs quantitativ ely give the answ er to the ques- tion: Wh y w e should choose the distribution with maxim um entrop y . W e further use TEBs in Maxent and Lidstone frameworks, and b oth of them sh ow promising results on synthesized and real-world datasets. In using maximum entropy approach for density estimation, a key challenge lies in the dilemma of uncertain constraints selection: Inappropriate choices ma y easily cause serious o verfitting or underfitting problems. T o deal with the challe nge, a family of TEBC Maxents, namely l 2 2 -TEBC, JSD-TEBC and ML-TEBC, are prop osed in th is pap er. Instead of using un certain constraints selected empirically , the prop osed mo dels let its Tsallis entrop y conv erge t o th e und erlying real distribut ion by comp ensating exp ected Tsal lis entropy b ias, while ensure the resulting distribution to resem ble the sampling distribution w.r.t. l 2 2 norm, JS-divergence or Maximum Likeli ho od. H en ce, the resulting distributions are optimal in the exp ected sense, w.r.t. the ab ov e three similarit y criteria . The family of TEBC Maxen ts is a natural generalization of standard Maxent. The imp ortan t difference b etw een TEBC Maxents and stan d ard Maxent is that, the constrain ts of th e former can b e derived from reliable prior information (certain con- strain ts) or analytical analysis (TEB constraint), while the latter also has to invol ve 25 uncertain constraints demand ing emp irically parametric selection. It tu rns out that TEBC Maxents b ecome parameter-free in this sense. F urthermore, the analytically es- tablished TE B constraint can b e effective to depress overfitting or und erfi tting, since it can force the resulting distribut ion to approac h the real one b y matching t heir en - tropies. In ad d ition t o the Maxent framew ork , we also demonstrate that there is a nat- ural connection b etw een TEB and another widely u sed estimator, Lidstone estima- tor. S p ecifically , TEB can an alyt ically identify t h e adaptive rate of probability cor- rection in Lidstone framew ork. As a result, TEB-Lidstone estimators (F-Lidstone and B-Lidstone) hav e b een d evel op ed. In the future, sev eral extra theoretical issues are w orth considering. Firstly , the TEB results might b e develo p ed in terms of other q in d exes of Tsallis entrop y . The unbiased and consistent results w.r.t q = 1 is of sp ecial interests since Tsallis en topy is equ iv alent to Shannon entropy in th is case. It can b e exp ected that these extend ed results offer more comp lete criteria to further solve the o verfitting and underfittin g. F or instance, as an extreme case, if w e can giv e m indep endent F requentist-TEB results, the p ossibility of o verfitting and underfitting can be, in principle, ruled out and hence a task of m - nomial distribution estimation could b ecome determined. How ever, other numerical proced ures should be devised in order to globally solv e the resulting model integrating TEB constraints w.r.t. q 6 = 2, since the n ew TEB constraints could b e non-convex. Secondly , it is also interesting to develop Ba yesian-TEB results w.r.t other Bay esian priors. Finally , w e hav e observe d that th e tw o criteria of the estimation qualit y , i.e., JS- divergence and the ex p ected log loss, occasionally giv e inconsistent ev aluation results. Hence, it is helpful to develop more sophisticated metrics to ev aluate the p erformance of den sity estimation. Ac knowledgemen ts The authors w ould lik e to thank editors and anonymous reviewers in adv ance for their constructive commen ts. This work is supp orted in part b y the Natural Science F oundation of China (Gran t 60603027) , the HK Researc h Grants Council (pro ject no. CERG Po lyU 5211/05E and 5217/07E), the UK’s Engineering and Ph ysi cal Sciences Researc h Council (EPSR C, grants EP/E002145/1 and EP/F014708/1) , the Natural Science F oundation pr o ject of Tianjin, China (gran t 09JCYBJC00200), and MSRA Internet Servi ces Theme Pro j ect, 2008. 26 App endix A. Standard Deviation of E ( m ) n,p ( T ) Prop osition 3 Given the uniform pr ob ability metric over the P ( m ) , the standar d deviation of E ( m ) n,p ( T ) denote d by S T D ( m ) n ( T ) , then S T D ( m ) n ( T ) = 2 p ( m − 1) · ( m + 2) · ( m + 3) E ( m ) n ( T ) (26) Pr o of Combining the definition of standar d deviati on with the inte gr al op er ator L ( m − 1) de- fine d by F ormula 7, we have: S T D ( m ) n ( T ) = s 1 Z ( m − 1) L ( m − 1)  h E ( m ) P ,n ( T ) − E ( m ) n ( T ) i 2  which c ould b e further tr ansforme d into S T D ( m ) n ( T ) = s 1 Z ( m − 1) · L ( m − 1)  h E ( m ) P ,n ( T ) i 2  − h E ( m ) n ( T ) i 2 (27) R e c al l t hat E ( m ) P ,n ( T ) has b e en r e- expr esse d in F ormula 6, and henc e it c an b e v erifie d t hat h E ( m ) P ,n ( T ) i 2 = 4 ( n − 1) 2 n 2  X m − 1 i =1 p i − X m − 1 i =1 p 2 i − X m − 1 i =1 X m − 1 j = i +1 p i · p j  2 L et U ( m − 1) denote the se t of al l pr o duct t erms in  P m − 1 i =1 p i − P m − 1 i =1 p 2 i − P m − 1 i =1 P m − 1 j = i +1 p i · p j  2 . Then t he c alculation of h E ( m ) P ,n ( T ) i 2 is r e duc e d t o solve the close d-form U ( m − 1) w.r.t. integr al op e ra tor L ( m − 1) . Due to the symmetry of integr al domain, the pr op erties of L ( m − 1) in Pr op osition 2 c ould b e gener alize d as the fol lowings: (a) L ( m − 1) ( p r i ) = L ( m − 1) ( p r j ) , 1 ≤ i, j ≤ m − 1 , r ∈ { 2 , 3 , 4 } (b) L ( m − 1) ( p i · p r j ) = L ( m − 1) ( p k · p r l ) , 1 ≤ i, j, k , l ≤ m − 1 , i 6 = j, k 6 = l , r ∈ { 1 , 2 , 3 } (c) L ( m − 1) ( p 2 i · p 2 j ) = L ( m − 1) ( p 2 k · p 2 l ) , 1 ≤ i, j ≤ m − 1 , i 6 = j, k 6 = l (d) L ( m − 1) ( p i · p j · p r k ) = L ( m − 1) ( p u · p v · p r w ) , 1 ≤ i, j , k , u, v , w ≤ m − 1 , i 6 = j 6 = k , u 6 = v 6 = w , r ∈ { 2 , 3 } (e) L ( m − 1) ( p i · p j · p k · p l ) = L ( m − 1) ( p u · p v · p w · p x ) , 1 ≤ i, j , k , l, u , v , w , x ≤ m − 1 , i 6 = j 6 = k 6 = l, u 6 = v 6 = w 6 = x Base d on t he ab ove pr op erties, we c an obtain a p artition of U ( m − 1) : First, we p artition U ( m − 1) into fiv e differ ent sets subje ct to the formal co nstr aints of (a)-(e); Se c ond, we further p artition e ach set into subsets subje ct to diffe r ent r values (if r is i nvolve d in the formal definition of a set). After the ab ove two steps of de c omp osition, the final p artition includes 10 p arts and c ould b e r epr esent e d as b elow: Pa rtition h U ( m ) i =  p 2 i  ,  p 3 i  ,  p 4 i  , { p i p j } ,  p i p 2 j  ,  p i p 3 j  ,  p 2 i p 2 j  , { p i p j p k } ,  p i p j p 2 k  , { p i p j p k p l }  It c an che cke d that the terms in ea ch p art give the same r esult with r esp ect the inte g r al op e ra tor L ( m − 1) . Ther efor e, if the g e ner al term formula of e ach p art and their te rm numb ers cou ld b e worke d out, the gene r al t erm formula of h E ( m ) P ,n ( T ) i 2 fol lows dir e ctly. 27 The fol low ing e quations co uld b e che cke d: L ( m − 1)  p 2 i  = 2! ( m + 1)! L ( m − 1)  p 3 i  = 3! ( m + 2)! L ( m − 1)  p 4 i  = 3! ( m + 3)! L ( m − 1) ( p i · p j ) = 1 ( m + 1)! L ( m − 1)  p i · p 2 j  = 2! ( m + 2)! L ( m − 1)  p i · p 3 j  = 3! ( m + 3)! L ( m − 1)  p 2 i · p 2 j  = 4 ( m + 3)! L ( m − 1) ( p i · p j · p k ) = 1 ( m + 2)! L ( m − 1)  p i · p j · p 2 k  = 2! ( m + 3)! L ( m − 1) ( p i · p j · p k · p l ) = 1 ( m + 3)! In addition, the gener al term formula N ( m − 1) ( · ) of the term numb er in e ach p art is given by N ( m − 1) ( p 2 i ) = m − 1 N ( m − 1) ( p 3 i ) = − 2( m − 1) N ( m − 1) ( p 4 i ) = m − 1 N ( m − 1) ( p i p j ) = ( m − 1)( m − 2) N ( m − 1) ( p i p 2 j ) = − 4( m − 1)( m − 2) N ( m − 1) ( p i p 3 j ) = 2( m − 1)( m − 2) N ( m − 1) ( p 2 i p 2 j ) = 3 · ( m − 1)( m − 2) 2 N ( m − 1) ( p i p j p k ) = − ( m − 1)( m − 2)( m − 3) N ( m − 1) ( p i p j p 2 k ) = 2( m − 1)( m − 2)( m − 3) N ( m − 1) ( p i p j p k p t ) = ( m − 1)( m − 2)( m − 3)( m − 4) 4 R eplac e h E ( m ) P ,n ( T ) i 2 by the two sets of gener al term formulae, and we c an get 1 Z ( m − 1) · L ( m − 1)  h E ( m ) P ,n ( T ) i 2  = 1 Z ( m ) · 4 ( n − 1) 2 n 2 X α ∈ Partition [ U ( m ) ] L ( m − 1) ( α ) · N ( m − 1) ( α ) = ( n − 1) 2 n 2  m 3 + 2 m 2 − 5 m + 2  ( m + 1)( m + 2)( m + 3) (28) Substitute F ormula 28 into F ormula 27, and then the fol lowing equa tion holds: S T D ( m ) n ( T ) = n − 1 n · s ( m 3 + 2 m 2 − 5 m + 2) ( m + 1)( m + 2)( m + 3) − ( m − 1) 2 ( m + 1) 2 28 After the simplific ation, we obtain: S T D ( m ) n ( T ) = 2 p ( m − 1) · ( m + 2) · ( m + 3) · ( n − 1) n ( m − 1) ( m + 1) R e c al l t hat E ( m ) n ( T ) = ( n − 1) · ( m − 1) n · ( m +1) , and henc e F ormula 26 is pr ove d. It is clear that, if n ≈ m , S T D ( m ) n ( T ) ≈ 2 √ m m − 1 n · ( m +1) . Recall that m − 1 n · ( m +1) is uniform Ba ye sian- TEB. Hence, the estimation of unif orm Ba yesian-TEB is relatively stable in the case of inad- equate sampling. References S. Abe. Axi oms and uniqueness theorem for tsallis entrop y . Phys. L ett. A , 271:74–79, 2000. Y. Altun and A. Smola. Unifying div ergence minimi zation and statistical i nference via con vex dualit y . In In Pr o ce e dings of the Ninet e enth Annu al Confer enc e on L e arning The ory , 2006. A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum entrop y approach to natural language processing. Computational Linguist ics , 22(1):39–71, 1996. G. Bo x and G. Tiao. Bayesian Inferenc e in Statistic al Analysis . Addison-W esley , Reading, Mass., 1973. S. Bo yd and L. V anden b erghe. Convex Optimization . Cambridge Universit y Pr ess, 2004. A. G. Carlton. On the bias of inf ormation estimates. Psycholo g ic al Bul letin , 71:108109, 1969. S. F. Chen and J. Go o dman. An empirical study of smo othing tec hniques for language mo d- eling. T ec hnical Report TR-10-98, Harv ard Universit y , 1998. S. F. Chen and R. Rosenfeld. A surve y of smo othing tec hniques for me mo dels. IEEE T r ans- actionson Sp ee ch and A udio Pr o c essing , 8(1):37–50, 2000. O. Dek el, S. Shalev-Sh wa rtz, and Y. Singer. Smo oth ǫ - insensitive regression by loss symmetriza- tion. In In Pr o c e ed ings of the Sixte enth Annual Confer ence on Computational Le arning The ory , pages 433–447. Springer, 2003. Miroslav Dudik, Stev en J. Phill ips, and Rob ert E. Schap ire. Maximum ent ropy density esti- mation with generalized r egulari zation and an application to sp ecies dis tr i bution mo deli ng. The Journal of Machine L e arning R ese ar ch ar chive , 8, 2007. William A. Gale and Geoffrey Sampson. Goo d-turing frequency estimation without tears. Journal of Quantitati ve Ling uistics , 2:217–237, 1995. I. J. Go od. The p opulation fr equencies of sp ecies and the estimation of p opulation parameters. Biometrika , 40:237–264 , 1953. J. Goo dman. E xp onent ial priors f or maximu m entrop y models. In In Confer ence of the North Am eric an Chapter of the Asso ciation for Computational Linguist ics , 2004. E. T. Jaynes. Information the ory and statistical mechanics. Physics R e v iews , 106 :620–630, 1957. B. M. Jedynak and S. Khudanpur. Maximum likelihoo d set for estimating a probability mass function. Neur al Computation , 17:1508–1530, 2005. J. Kazama and J. Tsuji i. Ev aluation and extension of maximum en tropy mo dels wi th inequalit y constrain ts. In In Confer enc e on Empric al M e tho ds in Natrual La nguage Pr o c essing , pages 137–144, 2003. S. P . Khu danpur. A method of maximum ent ropy estimation with relaxed constrain ts. In In Pr o c e ed ings of the Johns Hopkins Uni v ersity Lan guage Mo deling Workshop , pages 1–17, 1995. B. Kri s hnapuram, L. Carin, M. A. T. Figueiredo, and A . J. Hartemink. Sparse m ultinomial lo- gistic regression: F ast algorithms and generalization b ounds. IEEE T r ansactions on Pattern Ana lysis and Machine Intel ligenc e , 27(6):957–968 , 2005. R. Lau. A daptive statist i c al language mo deling . PhD thesis, MIT Department of El ectrical Engineering and Computer Science, 1994. G. Lebanon and J. Lafferty . Boosting and maxim um likelihoo d for exponential mo dls. T echnical Report CM U-CS-01-144, CMU Sc ho ol of Computer Science, 2001. G. Miller . Note on the bi as of information estimates. Information the ory in psycholo g y II-B , pages 95–100 , 1955. 29 Carl M orris. Cen tral limit theorems for multinomial sums. The A nnuals of Statistics , 3(1): 165–188, 1975. W. I. Newman. Extension to the maximum entrop y method. IEEE T r ansations on Information The ory , 23(1):89–93, 1977. A. Y. Ng. F eature selection, l 1 vs. l 2 regularization, and r otational inv ariance. In In Pr o c e ed i ngs of t he Twenty- First International Confer ence on Machine Le arning , pages 615–622, 2004. Alon Orl itsky , Naray ana P . Sant hanam, and Junan Zhang. Alwa ys goo d turing: Asymptotically optimal probability estimation. Scienc e , 302:427–431, 2003. Liam Paninski. Estimation of entrop y and mutu al inf or mation. Neur al Comp utation , 15( 6): 1191–125 3, 2003. Liam Paninski. Estimating entrop y on m bins given fewer than m samples. IEEE T r ansactions on Information The ory , 50(9):2200–2 203, 2004. S. Pan zeri and A. T r ev es. Analytical estimates of lim ited sampl i ng biases i n differen t informa- tion measures. Net work: Computation in Neur al Systems , 7(1):87–107, 1996. S. Della Pietra, V. Della Pietra, and J. Laffert y . Inducing features of random fields. IEEE T r ansactions on Pattern Analysis and M achine Intel ligenc e , 19(4):1–13 , 1997. R. J. V. San tos and J. Math. Generalization of shannon’s theorem for tsallis entrop y . Journal of M athematic al Physics , 38:4104, 1997. C. Tsallis. Po ssible generalization of boltzmann-gibbs statistics. Journal of Statistic al Physics , 52:479–487, 1988. Jonathan D. Victor. A symptotic bias in i nformation estimates and the exponent ial (b ell) polynomials . Neur al Comput. , 12(12):2797–2804 , 2000 . L. W asserman. All of nonparametric statistics. Springer , 2006. P . M. Williams. Bay esian regularization and pruning using a. Chengxiang Zhai and John Lafferty . A study of smo othing methods for l anguage mo dels applied to ad ho c inform ation r etriev al. In SIGIR’01 , 2001. T. Zhang. Class- size independen t generalization analysis of some discri minativ e multi-category classification. In A dvanc es in Neur al Information Pr o c essing Systems , volume 17, 2005.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment