Towards algebraic methods for maximum entropy estimation

T o w ards algebraic meth o ds for maxim um en trop y estimation Am b edk ar Dukkipati EURANDOM, P .O. Box 513, 560 0 MB Eindhoven, The Netherlands E-mail: duk kipati @eura ndom.tue.nl Abstract. W e show that v arious formulations ( e.g., dua l and K ullback-Csiszar iterations) o f es tima tio n o f maximum entrop y (ME ) mo dels can be trans formed to solving systems of poly nomial equa tions in several v a riables for which one can use celebrated Gr¨ obner bases metho ds. Posing of ME estimation as solv ing p o lynomial equations is pos sible, in the ca s es where feature functions (suﬃcien t statistic) that provides the information ab out the underlying random v ariable in the form o f exp ectations are integer v alued. 2 1. In tro duction Algebra has alw ays pla y ed an imp ortant role in statistics, a classical example b eing linear algebra. There are also man y other instances of applying algebraic to ols in statistics (e.g (Viana & R ic hards 2001)). But, treating statistical models as a lg ebraic ob jects, and thereb y using to ols of computational comm utativ e algebra and algebraic geometry in the analysis of statistical mo dels is v ery recen t and has led to the still ev olving ﬁeld of algebr aic statistics . The use of computationa l algebra and algebraic geometry in statistics was initiated in the w ork o f Diaconis and Sturmfels (Diaconis & Sturmfels 1998) on exact hypothesis tests of conditional indep endence in contingency tables, and in the w ork o f Pistone et al. (Pistone et al. 2001) in exp erimen tal design. The term ‘Algebraic Statistics’ w as ﬁrst coined in the monograph by Pistone et al. (Pistone et al. 2 001) and app eared recen tly in the title of the b o ok b y P ac h ter and Sturmfels (P ac h ter & Sturm bfels 2005). T o extract the underlying algebraic structures in discrete statistical mo dels, algebraic statistics treat statistical mo dels as a ﬃne v arieties. (An aﬃne v ariet y is the set of all solutions to family of p olynomial equations.) P arametric statistical mo dels are described in terms o f a p olynomial (or rational) mapping from a set of parameters to distributions. One can sho w t ha t man y statistical mo dels, for example independence mo dels, Bernouli random v ariable etc. (see (P ac hte r & Sturm bfels 2005) fo r more examples), can b e given this a lgebraic formulation, and these are referred to as algebraic statistical models. Exp o nen tial mo dels, whic h fo rm the imp o rtan t class of statistical mo dels, are studied in algebraic statistics under the name ‘toric’ mo dels by using maxim um lik eliho o d metho ds. T oric mo dels are algebraic statistical mo dels and t he term ‘toric’ comes from impor t a n t algebraic ob jects kno wn as ‘toric ideals’ in computational algebra. In this view of v ery established role of informatio n theory in statistics (Kullbac k 195 9, Csisz´ ar & Shields 20 0 4) this pap er attempts to describ e maxim um en trop y mo dels in algebraic statistical f ramew ork. In particular, w e sho w t ha t maxim um en trop y mo dels (also minimum relativ e- en trop y mo dels) ar e indeed toric models, when the functions that pro vide the information ab out the underlying random v ariable in the f orm o f exp ected v alues are in teger v a lued. W e also sho w that when the information is a v aila ble in the f orm of sample means, b y mo difying maximum en trop y prescriptions calculating mo del parameters amoun ts to solving set of p o lynomial equations. This establishes a fact that set of statistical mo dels results from maxim um en tropy metho ds are indeed algebraic v arieties. A note on the results presen ted in this pap er: w e will not presen t the details on Gr¨ obner ba ses theory and related concepts to solv e the p olynomial equations due to space constraint; w e refer reader to text b o oks on computational algebra and G r¨ o bner basis theory (Adams & Loustaunau 1994, Cox et al. 1991). W e organize our pap er as follows. In § 2 w e give basic notions of algebra and in tro duce nota tion along with a n intro duction to alg ebraic statistics. § 3 describ es 3 maxim um entrop y (ME) prescriptions in algebraic statistical framew ork b y in tro ducing imp ortant algebraic ob jects called tor ic ideals. In § 4 w e sho w how one can transform the problem of calculating ME distributions to solving set of p o lynomial equations. 2. Algebraic Statistical Mo dels 2.1. Ba sic notions of Algebr a Through out this pap er k represen ts a ﬁeld. A monomial in n indeterminates x 1 , . . . , x n is a p ow er pro duct o f the form x α 1 1 . . . x α n n , where all the exp onen ts are nonnegative in tegers, i.e. α i ∈ Z ≥ 0 , i = 1 , . . . n . One can simplify the notation for monomial as follo ws: denote α = ( α 1 , . . . , α n ) ∈ Z n ≥ 0 and b y using multi-index notation we set x α = x α 1 1 . . . x α n n with the understanding that x = ( x 1 , . . . , x n ). Note that x α = 1 when eve r α = (0 , . . . , 0). O nce the order of the indeterminates are ﬁx ed, monomial x α 1 1 . . . x α n n = x α is iden tiﬁed by ( α 1 , . . . , α n ). Henc e, set of a ll monomials in indeterminates x 1 , . . . , x n can b e represen t ed by Z n ≥ 0 . Theory o f monomials is central to the celebrated Gr¨ obner bases theory in computational algebra whic h pro vides to ols fo r solving set of po lynomial equations and related problems in algebraic geometry (Mishra & Y ap 1 9 89). Monomial theory itself pla ys imp ortan t role in a lgebraic statistics in the represen t ation of exponential mo dels where probabilities are expressed in terms of p o w er pro ducts (Rapallo 2006). A po lynomial f in x 1 , . . . , x n with coeﬃcien ts in k is a ﬁnite linear com binatio n of monomials and c an b e written in the form of f = X α ∈ Λ f a α x α , where Λ f ⊂ Z n ≥ 0 is a ﬁnite set and a α ∈ k . The collection of all p olynomials in the indeterminates x 1 , . . . , x n is the set k [ x 1 , . . . , x n ] and it has structure not only o f a v ector space but also o f a ring. Indeed the ring structure of k [ x 1 , . . . , x n ] pla ys main role in computational algebra and algebraic g eometry . A subset a ⊂ k [ x 1 , . . . , x n ] is said to b e ideal if it satisﬁes: (i) 0 ∈ a (ii) f , g ∈ a , then f + g ∈ a (iii) f ∈ a and h ∈ k [ x 1 , . . . , x n ] and then hf ∈ a . A set V ⊂ k n is said to be aﬃne v ariet y if there exists f 1 , . . . , f s ∈ k [ x 1 , . . . , x n ] suc h that V = { ( c 1 , . . . c n ) ∈ k n : f i ( c 1 , . . . c n ) = 0 , 1 ≤ i ≤ s } . W e use the notatio n V ( f 1 , . . . , f s ) = V . 2.2. Algebr aic Statistic al Mo del A t the v ery core of the ﬁeld of algebraic statistics lies the notion of an ‘algebraic statistical mo del’. While this not ion has the p o ten tial of serving as a unifying theme for algebraic statistics, there is no uniﬁed deﬁnition of an algebraic statistical mo del (Drton 4 & Sulliv ant 2006 ). Here, w e adopt the appropriate deﬁnition o f statistical mo del from (Pac hter & Sturmbfe ls 20 05, Drton & Sulliv ant 2006). F or a recen t elab orate discussion on formal deﬁnition of algebraic statistical mo dels one can refer to (Dr t on & Sulliv ant 2006 ). Let X b e a discrete random v ariable taking ﬁnitely many v alues fr om the set [ m ] = { 1 , 2 , . . . m } . A probability distribution p of X is na t ur a lly represen ted as a v ector p = ( p 1 , . . . , p m ) ∈ R m if we ﬁx the order o n [ m ]. Then set of all probability ma ss functions (pmfs) of X is called probabilit y simplex ∆ m − 1 = { p = ( p 1 , . . . , p m ) ∈ R m ≥ 0 : m X i =1 p i = 1 } . (1) The index m − 1 indicates the dimension of the simplex ∆ m − 1 . A statistical mo del M is a subset of ∆ m − 1 and is said to be algebraic if ∃ f 1 , . . . , f s ∈ k [ p 1 , . . . , p m ] suc h that M = V ( f 1 , . . . , f s ) ∩ ∆ m − 1 . No w w e mo v e on to parametric statistical mo dels and their algebraic form ulations. Let Θ ⊆ R d b e a para metric space and κ : Θ → ∆ m − 1 b e a map. The image κ (Θ) is called parametric statistical mo del. Given a statistical mo del M ⊆ ∆ m − 1 , by parametrization of M w e mean, iden tifying a set Θ ⊆ R d and a function κ : Θ → ∆ m − 1 suc h that M = κ (Θ). T o describ e more general statistical mo dels in algebraic fr a mew ork w e need f ollo wing notion of semi-algebr aic set . Deﬁnition 2.1. A set Θ ⊆ R d is c al le d semi-algebr aic set, if ther e ar e two ﬁnite c ol le ction o f p ol yno m ials F ⊂ k [ x 1 , . . . , x d ] and G ⊂ k [ x 1 , . . . , x d ] such that Θ = { θ ∈ R d : f ( θ ) = 0 , ∀ f ∈ F and g ( θ ) ≥ 0 , g ∈ G } . No w w e ha v e follow ing deﬁnition of para metric algebraic statistical mo del. Deﬁnition 2.2. L et ∆ m − 1 b e a pr ob ability simple x and Θ ⊂ R d b e a semi-alge b r aic set. L et κ : R d → R m b e a r ational function (a r a tional function is a quotient of two p olynom ials) such that κ (Θ) ⊆ ∆ m − 1 . Then the image M = κ (Θ) is a p ar ametric algebr aic statistic al mo del. Con v ersely , a parametric statistical mo del M = κ (Θ) ⊆ ∆ n − 1 is said to b e algebraic if Θ is semi-algebraic set and κ is a rat io nal function. F rom no w on w e refer to ‘parametric algebraic statistic al mo dels’ as ‘alg ebraic statistical mo dels’. In this pap er w e consider f ollo wing sp ecial case of algebraic statistical mo dels (cf. (P ac h ter & Sturm bfels 2005, pp 7)). Consider a map κ : Θ( ⊆ R d ) → R m κ : θ = ( θ 1 , . . . , θ d ) 7→ ( κ 1 ( θ ) , . . . , κ m ( θ )) (2) where κ i ∈ k [ θ 1 , . . . , θ d ]. W e assume t ha t Θ satisﬁes κ i ( θ ) ≥ 0 , i = 1 , . . . , m a nd P m i =1 κ i ( θ ) = 1 for any θ ∈ Θ. Under these conditio ns κ (Θ) is indeed an algebraic statistical mo del (Deﬁnition 2.2) since κ (Θ) ⊂ ∆ m − 1 , κ is a p olynomial function and 5 Θ is a semi-algebraic set ( H = { P m i =1 f i − 1 } and G = { f i : i = 1 , . . . , m } in the Deﬁnition 2.1). Some statistical mo dels are nat urally given b y a p olynomial map κ (2) f o r whic h the conditio n P m i =1 κ i ( θ ) = 1 do es not ho ld. If this is the case one can consider following algebraic statistical mo del: κ : θ = ( θ 1 , . . . , θ d ) 7→ 1 P m i =1 κ i ( θ ) ( κ 1 ( θ ) , . . . , κ m ( θ )) , (3) assuming that r emaining conditions that hav e b een s p eciﬁed f or the mo del (2) are v alid here to o. The only diﬀerenc e is that instead of κ b eing a p olynomial map, we hav e it as a rationa l ma p. 3. ME in algebraic statistical setup 3.1. T oric Mo dels In the algebraic description of exp onential mo dels monomials and binomials pla y a fundamen tal role. The study of relations of p ow er pro ducts lead to the t heory o f toric ideals in the comm utativ e algebra (Sturmbfels 1996). Here w e desc rib e basic notion of t o ric ideal that are relev an t to represen tation and computation of discrete exp o nen tial mo dels; for more details on theory and computation of toric ideals one can refer to (Sturmbfels 19 96, Bigatti & Robbiano 2001, Bigatti et al. 1999). Before w e g ive t he deﬁnition of toric ideal, w e describe t he notion of Lauren t p olynomial. If we allow negativ e exp onen ts in a p olynomial i.e., p olynomial of the form f = P α ∈ Λ f a α x α where α ∈ Z n , it is know n a s Lauren t p o lynomial (Λ f ⊂ Z n ≥ 0 is ﬁnite). Set of all Lauren t p olynomials in the indeterminates x 1 , . . . , x n is denoted b y k [ x ± 1 , . . . , x ± n ] a nd it also has a structure of a ring. No w w e deﬁne the toric ideal. Deﬁnition 3.1. L et A = [ a ij ] ∈ Z d × n b e a matrix with r an k d . Consider the ri n g home omorphism ˆ π : k [ x 1 , . . . , x n ] → k [ θ ± 1 , . . . , θ ± d ] ˆ π : x j 7→ θ a 1 j 1 . . . θ a d j 1 (4) The toric id e al a A of A is deﬁne d as the ke rn e l of the map ˆ π , i.e., a A = ker ˆ π . The mapping ˆ π can be view ed as “parametrization” and whic h can b e explained b y the f o llo wing description of ˆ π . Consider a ma p π : Z n ≥ 0 → Z d π : u = ( u 1 , . . . , u n ) 7→ Au. (5) The map π lifts to the ring homomorphism ˆ π in t he sense of action of ˆ π on x u = x u 1 1 . . . x u n n ∈ k [ x 1 , . . . , x n ]. That is ˆ π ( x u ) = ˆ π ( x u 1 1 , . . . , x u n n ) = d Y i =1 θ a i 1 i ! u 1 . . . d Y i =1 θ a in i ! u n (6) 6 = d Y i =1 θ P n j =1 a ij u j i = θ Au . (7) T oric ideal theory play s an imp or t an t ro le in applications of computational algebraic geometry lik e in teger programming etc.(cf. (Sturm bfels 1996)). Note tha t in the algebraic descriptions of exp onential mo dels and their maxim um likelih o o d estimates only non-negativ e cases of toric ideals (and hence toric mo dels) is considered i.e., the matrix A = [ a ij ] in D eﬁnition 3 .1 is assumed to b e nonnegativ e and the map (4) is sp eciﬁed as ˆ π : k [ x 1 , . . . , x n ] → k [ θ 1 , . . . , θ d ] (see (P ac h ter & Sturmbfels 2005)). As described later in this pap er, in the algebraic descriptions of maxim um en trop y mo dels one ha s to deal with the Lauren t p olynomials and hence one has to include the negativ e case in the deﬁnitions of toric ideals a nd toric mo dels. This p o ses no problem b ecause toric ideal theory in comm utative algebra na t ur a lly includes the negativ e case (as in Deﬁnition 3.1) and Gr¨ obner bases theory can b e extended to Lauren t polynomial ring (P auer & Un terkirc her 1999). The concept of toric ideals let to the description o f exp onential mo dels under the name toric mo dels in a lg ebraic statistics which is deﬁned as follo ws. Deﬁnition 3.2. L et A ∈ Z d × m ≥ 0 b e a matrix such that the ve ctor (1 , . . . , 1) ∈ Z m ≥ 0 is in the r ow sp an of A . L et h ∈ R m > 0 b e a ve ctor of p ositive r e al numb ers. L et Θ = R m > 0 and let κ A,h b e the r ational p ar ametrization κ A,h : Θ → R m κ A,h j : θ 7→ Z ( θ ) − 1 h j d Y i =1 θ a ij i , (8) wher e θ = ( θ 1 , . . . , θ d ) a n d Z ( θ ) is the appr opriate no rm alizing c onstant. T he toric mo d e l is the p ar a metric algeb r aic statistic al mo del M A,h , κ A,h (Θ) . (9) Indep endence mo dels, ex p o nen tial mo dels, Mark o v c hains and Hidden Marko v c hains can b e giv en an algebraic statistical description b y means o f toric mo dels (P ac hter & Sturm bfels 2005 ). W e k eep p ositivity of A in the D eﬁnition 3.2 as a matter of con v en tion. 3.2. ME in terms of T oric Mo dels Let X b e a random v ariable taking v alues from the set [ m ] = { 1 , 2 , . . . m } . The only information w e know ab out the pmf p = ( p 1 , . . . , p m ) o f X is in the form o f expected v alues of the f unctions t i : [ m ] → R , i = 1 , . . . , d (w e r efer these f unctions as ‘constraint functions’). W e therefore hav e m X j =1 t i ( j ) p j = T i , i = 1 , . . . d , (10) 7 where T i , i = 1 , . . . , d , are assumed to b e kno wn. In an information theoretic approac h to statistics, known as Ja y ens maxim um en t r op y mo del, o ne would choo se the pmf p ∈ ∆ m − 1 that maximize the Shannon en tropy functional S ( p ) = − m X j =1 p j ln p j (11) with respect to the constraints (10). The corresponding Lagrang ian can b e written as Ξ( p, ξ ) ≡ S ( p ) − ξ 0 m X j =1 p j − 1 ! − d X i =1 ξ d m X j =1 t i ( j ) p j − T i ! (12) Holding ξ = ( ξ 1 , . . . , ξ d ) ﬁxed, the unconstrained maxim um of L a grangian Ξ( p, ξ ) o v er all p ∈ ∆ m − 1 is giv en by an exp onen tial family (Co v er & Thomas 19 91) p j ( ξ ) = Z ( ξ ) − 1 exp − d X i =1 ξ i t i ( j ) ! , j = 1 , . . . , m, (13) where Z ( ξ ) is normalizing constan t giv en b y Z ( ξ ) = m X j =1 exp − d X i =1 ξ i t i ( j ) ! . (14) F or v arious v alues of ξ ∈ R d , t he family (13) is know n as maximum entr opy mo del . No w, w e ha v e follow ing prop osition. Prop osition 3.3. The maximum entr opy mo del (13) is a toric mo del p r ovid e d that the c ons tr aint functions a r e inte ge r value d. Pr o of. Set ξ i = − ln θ i , i = 1 , . . . , d . Now, (13) giv es us p j = Z ( θ ) − 1 exp − d X i =1 t i ( j ) ln θ i ! = Z ( θ ) − 1 d Y i =1 θ t i ( j ) i . (15) By deﬁning matrix A = [ t i ( j )] ∈ Z d × n and setting h = ( 1 m , . . . , 1 m ) w e hav e r a tional parametrization as in (8). Note that we allo w ed o nly in teger v alued functions in t he ME-mo del in t he ab ov e prop osition, whic h is necessary for algebraic descriptions of the same. Here w e also men tion that in the ab ov e pro of b y assuming h ∈ ∆ m − 1 (whic h acts as a prior), w e can imply that minimum I- div ergence mo del (Csisz´ ar 1975) p j = b Z ( ξ ) − 1 h j exp − d X i =1 ξ i T i ( j ) ! , j = 1 , . . . , m, (16) (with appropriate normalizing constant b Z ( ξ )) is indee d a toric mo del. Once the sp eciﬁcation of statistical mo del is done, the task is to calculate the mo del parameters with the a v ailable information. In this case the av ailable information is in the form of exp ected v alued of f unctions t i , i = 1 , . . . d and the La grange parameters ξ i , i = 1 , . . . , d are determine d using the constrains (10). 8 4. Calculation of ME distributions via solving P olynomial equations 4.1. Di r e ct metho d One can sho w that the Lagrange parameters in ME-mo del ( 1 3) can b e estimated b y solving follo wing set of partial diﬀeren tial equations (Ja ynes 1968) ∂ ∂ ξ i ln Z ( ξ ) = T i , i = 1 , . . . , d, (17) whic h ha s no explicit ana lytical solution. In literature there ar e sev eral metho ds of estimating ME-mo dels. O ne of the imp o rtan t metho d is Darro ch and R atcliﬀ ’s generalized iterativ e scaling algo r it hm (Darro c h & Ratcliﬀ 1972). Here we can sho w that ME -mo dels can be calculated using computational algebraic metho ds. Note that se t of all distributions whic h satisﬁes (10) is k nown as linear family (w e denote this b y L ) . No w, if w e represen t the expo nential family (13) by E , the set of statistical mo dels that res ults fr om ME-principle can b e written as L ∩ E ⊂ ∆ m − 1 . One can sho w that L ∩ E ⊂ ∆ m − 1 is a v ariet y . By subs tituting maximum entrop y distributions (15) in (10) w e get m X j =1 t i ( j ) d Y i =1 θ t i ( j ) i = T i Z ( θ ) , (18) whic h can be written as m X j =1 t i ( j ) d Y i =1 θ t i ( j ) i = T i m X j =1 d Y i =1 θ t i ( j ) i . (19) The solutions of system of p olynomial equations ( 1 9) g ives the maxim um en tropy mo del sp ciﬁed the a v ailable information (10). W e state this as a proposition. Prop osition 4.1. The maximum entr opy mo del (13) c an b e sp e ciﬁe d by solving s e t of p olynom ial e quations pr ovide d that the c onstr aint functions t i , i = 1 , . . . , d ar e inte ger value d. 4.2. Dual Metho d Here we follow the metho d of dual optimization problem. By using Kuhn-T uc k er theorem w e calculate L a grange pa r a meters ξ i , i = 1 , . . . , d in (13) b y optimizing dual of Ξ( p, ξ ). That is t he task is to ﬁnd ξ whic h maximizes Ψ( ξ ) ≡ Ξ( p ( ξ ) , ξ ) . (20) Note that Ψ( ξ ) is nothing but en tro p y of ME-distribution (13). W e ha v e Ψ( ξ ) = ln Z + d X i =1 ξ i T i . (21) 9 This can b e w ritten as Ψ( ξ ) = ln m X j =1 exp − d X j =1 ξ i t i ( j ) ! + d X i =1 ξ i T i = ln m X j =1 exp ( ξ i ( T i − t i ( j ))) . (22) No w maximizing Ψ( ξ ) is equiv alent to maximizing Ψ ′ ( ξ ) = m X j =1 exp ( ξ i ( T i − t i ( j ))) . (23) By in t r o ducing ξ i = − ln θ i , i = 1 , . . . , d w e ha ve Ψ ′ ( θ ) = m X j =1 d Y i =1 θ t i ( j ) − T i i . (24) The solution is giv en b y solving the fo llo wing set of equations ∂ Ψ ′ ∂ θ j = 0 , j = 1 , . . . d . (25) Unfortunately ∂ Ψ ∂ θ j ∈ k [ θ ± 1 , . . . , θ ± d ] only if T i ∈ Z . No w, w e consider the case where the exp ected v alues are av ailable as sample means. In most pra ctical problems the information in the form of exp ected v alues is a v a ilable via sample or empirical means. Th at is, giv en a sequence of observ atio ns O 1 , . . . , O N the sample means e T i , i = 1 , . . . , d , with resp ect to the functions t i , i = 1 , . . . , d are giv en by e T i = 1 N N X l =1 t i ( O l ) , i = 1 , . . . , d, (26) and the underlying hy p o thesis is T i ≈ e T i . That is m X j =1 p j t i ( j ) ≈ 1 N N X l =1 t i ( O l ) , i = 1 , . . . , d . (27) No w we sho w tha t , b y c ho osing alternate Lagrangian in the place of (12) we can transform the parameter estim atio n of ME-mo del to a problem o f solving se t of p olynomial (Lauren t) equations. Prop osition 4.2. Given the hyp o thesis (27) the pr oblem of estimating the ME-mo del in the dual metho d amounts to solving set of L aur ent p olynomia l e quations (assuming that c onstr aint functions ar e inte ger value d). Pr o of. T o retain the in teger v alued exp onen ts in our ﬁnal solution w e consider the constrains of t he form N m X j =1 t i ( j ) p j = σ i , i = 1 , . . . d , (28) 10 where σ i = P N l =1 t i ( O l ) denotes the sample sum. In this c ase Lagrangian is e Ξ( p, ξ ) ≡ S ( p ) − ξ 0 m X j =1 p j − 1 ! − d X i =1 e ξ d N m X j =1 p j t i ( j ) − σ i ! . (29) This results in the ME-distribution p j ( ξ ) = e Z ( ξ ) − 1 exp − N d X i =1 e ξ i t i ( j ) ! , j = 1 , . . . , m, (30) where Z ( ξ ) is normalizing constan t giv en b y e Z ( ξ ) = m X j =1 exp − N d X i =1 e ξ i t i ( j ) ! . (31) T o calculate the para meters w e maximize the dual e Ψ( e ξ ) of e Ξ( p, ξ ). That is w e maximize the functional e Ψ( e ξ ) = ln e Z + d X i =1 e ξ i σ i . ( 3 2) It is equiv a lent to optimizing the functional e Ψ ′ ( e ξ ) = m X j =1 exp d X i =1 e ξ i σ i − N d X i =1 e ξ i t i ( j ) ! By se tting ln e θ i = e ξ i w e ha v e ˜ Ψ ′ ( e θ ) = m X j =1 d Y i =1 e θ ( σ i − N t i ( j )) i (33) The solution is giv en b y solving the fo llo wing set of equations ∂ e Ψ ′ ∂ e θ i = 0 , i = 1 , . . . d . (34) W e ha v e ∂ e Ψ ′ ∂ e θ i ∈ k [ e θ ± 1 , . . . , e θ ± d ] , i = 1 , . . . , d . (35) In alg ebraic statistics, algebraic descriptions a re used to analyze the maxim um lik eliho o d estimates of exponential mo dels (P ach ter & Sturm bfels 2005 ) . In the view that maximu m lik eliho o d and maximu m en tropy are related,it will b e interes ting to compare these tw o metho ds from algebraic statistical p oin t of view. 11 4.3. Kul lb ack-C siszar Iter ation Minim um I-diveren ce princile is a generalization o f maxim um entrop y principle, and whic h considers the cases where prior estimate of the dis tribution p is a v aila ble. G iven a prior estimate r ∈ ∆ m and information in the fo rm o f (10) one w ould c ho ose the pmf p ∈ ∆ m that minimiz es the Kullbac k-Leibler divergen ce I ( p k r ) = m X j =1 p j ln p j r j (36) with respect to the constain ts (10). The corresp onding minimum entrop y distributions are in the form of p j ( ξ ) = Z ( ξ ) − 1 r j exp − d X i =1 ξ i t i ( j ) ! , j = 1 , . . . , m, (37) where Z ( ξ ) is normalizing constan t giv en b y Z ( ξ ) = m X j =1 r j exp − d X i =1 ξ i t i ( j ) ! . (38) It is easy to see that estimating minim um en tropy distributions can b e translated to solving p olynomial equations, when the feature functions are integer v alued. P olynomial system one w ould solve in this case is m X j =1 r j ( t i ( j ) − T i ) d Y i =1 θ t i ( j ) i = 0 . (39) Hence w e hav e followin g prop o sition. Prop osition 4.3. The e stimation of minimum en tr opy mo del ( ?? ) amounts to solving a set of p olynomial e quations in indeterminates θ i = exp( − ξ i ) , i = 1 , . . . , d pr ovide d that the fe atur e functions t i , i = 1 , . . . , d ar e p ositive and inte ger value d. Since an estimation of ME-distributions in v olv es solving a sys tem of nonlinear equations, which b ecome ineﬃcien t, one w ould employ a interativ e metho d where one w ould estimate the distibution considering only one constrain t at a time. W e describ e this procedure as follo ws. A t N th iteration, the algorithm computes the distribution p ( N ) whic h minimizes I ( p ( N ) k p ( N − 1) ) w ith resp ect the i th constrain t, 1 ≤ i ≤ d if N = ad + i , for an y po sitiv e in teger a . In this iterat ive procedure w e ha v e p (0) = r and p (1) is giv en by p (1) j = r j  Z (1)  − 1 ζ t 1 ( j ) 1 , where  Z (1)  − 1 = P m j =1 r j ζ t 1 ( j ) 1 . Considering the ﬁrst constrian t in (10) can b e estimated b y so ving p olynomial equation m X j =1 r j ( t 1 ( j ) − T 1 ) ζ t 1 ( j ) 1 = 0 , (40 ) 12 with inderminate ζ 1 . Similary w e ha v e p (2) j = r j  Z (1)  − 1  Z (2)  − 1 ζ t 1 ( j ) 1 ζ t 2 ( j ) 2 , where  Z (2)  − 1 = P m j =1 ζ t 1 ( j ) 2 . Considering t he ﬁrst tw o constrains in ( 10) ME distribution can b e estimated b y solving m X j =1 r j ( t 2 ( j ) − T 2 ) ζ t 1 ( j ) 1 ζ t 2 ( j ) 2 = 0 , (41) along with (40 ) . In general, when N = ad + i for some p ositive in teger a , p ( N ) j , fo r N = 1 , 2 . . . is giv en b y p ( N ) j = r j  Z (1)  − 1 . . .  Z ( N )  − 1 ζ t 1 ( j ) 1 . . . ζ t N ( j ) N and is determined b y the follo wing system of p olynomial equations P m j =1 r j ( t 1 ( j ) − T 1 ) ζ t 1 ( j ) 1 = 0 , P m j =1 r j ( t 2 ( j ) − T 2 ) ζ t 1 ( j ) 1 ζ t 2 ( j ) 2 = 0 , . . . P m j =1 r j ( t i ( j ) − T i ) ζ t 1 ( j ) 1 ζ t 2 ( j ) 2 . . . ζ t i ( j ) N = 0 .            (42) 5. Conclusion and Directions for F uture researc h In t his paper we attempted to describ e maxim um ( a nd hence minim um) entrop y mo del in algebraic s tatistical fr a mew ork. W e sho w ed that maxim um entrop y mo dels are toric mo dels when the constrain t functions are assume d to b e in teger v alued f unctions and the set of statistical mo dels results from ME-principle is indeed an v ariet y . In a dual estimation we demons trat ed that when the info r mation is in the form o f empirical means, the calculation of ME-mo dels can b e transformed to solving set of Lauren t p olynomial equations. W ork on computationa l algebraic algorithms for estimating ME-mo dels ar e in pro gress. W e hop e that this will also shed light on p ossible in teresting algebraic structures in info rmation theoretic s tatistics. References Adams W W & Loustaunau P 1994 An Int r o duction to Gr¨ obner Bases V ol. 3 of Gr aduate Studies in Mathematics American Mathematica l So ciety . Bigatti A M, La Scala R & Robbiano L 1999 Journal of S ymb olic Computation 27 , 351–3 65. Bigatti A M & Robbiano L 2001 Mat. Contemp. 21 , 1–25 . Cov er T M & Tho ma s J A 1991 Elements of Information The ory Wiley New Y ork. Cox D, Little J & O’Shea D 1 9 91 Ide als, V arieties, and Algorithms 2nd e dn Springer New Y ork. Csisz´ ar I 197 5 Ann. Pr ob. 3 (1), 146 –158 . Csisz´ ar I & Shields P 2004 Information The ory and Statistics: A T u torial V ol. 1 of F oundations and T r ends in Commun ic ations and In formation The ory Now Publications. Darro ch J N & Ratcliﬀ D 19 72 The Annals of Mathematic al Statistics 4 3 (5), 1470 –148 0. Diaconis P & Sturmfels B 199 8 Annals of S tatistics 26 , 363– 397. 13 Drton M & Sulliv an t S 2006 (T o b e publishe d in) S tatistic al Sinic a . Jaynes E T 196 8 IEEE T r ansactions on Systems Scienc e and Cyb ernetics s ec-4 (3), 227–2 41. Kullback S 195 9 Information The ory and S tatistics Wiley New Y or k. Mishra B & Y ap C 1 989 I nformation Scienc es 48 (3), 219–25 2. Pac hter L & Sturmbfels B 2005 Algebr aic Statistics and Computational Biolo gy Cambridge Univ ers it y Press Cambridge. Pauer F & Unt erk ircher A 1999 Applic able A lgebr a in Engine ering, Commun ic ation and Computing 9 , 271– 291. Pistone G, Riccomag no E & Wynn H 2001 Algebr aic Statistics. Computational Commutative A lgebr a in Statistics Chapman and Hall New Y ork. Rapallo F 2006 Annals of the In stitute of S tatistic al Mathematics . In pres s. Sturmbf els B 1996 Gr¨ obner Bases and Convex Polytop es V ol. 8 o f Un iversity L e ctur e Series AMS Press Providence, RI. Viana M A G & Richards D S P , eds 2001 Algebr aic Metho ds in Statistics and Pr ob ability V ol. 2 87 of Contemp or ary Mathematics American Mathematical So ciety Pr ovidence, RI.

Towards algebraic methods for maximum entropy estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment