Observational nonidentifiability, generalized likelihood and free energy

We study the parameter estimation problem in mixture models with observational nonidentifiability: the full model (also containing hidden variables) is identifiable, but the marginal (observed) model is not. Hence global maxima of the marginal likeli…

Authors: A.E. Allahverdyan

Observational nonidentifiability, generalized likelihood and free energy
Observ ational noniden tifiabilit y , generalized lik eliho o d and f r e e energy A.E. Allahv erdyan Y er evan Physics Institute , Alikhanian Br others Str e et 2, Y er evan 375036, Armenia W e study the parameter estimation problem in mixture models with observ ational nonidentifia- bilit y : the full model (also con t a ining hidden v ariables) is identifiable, but the marginal ( o b s erved) mod el is not. Hence global maxima of the marginal likel iho od are (in finitely) degenerate and predic- tions of th e marginal likelihoo d are n ot un ique. W e show how to generalize the marginal likel ih o o d by introdu cing an effectiv e temp erature, and making it simil ar to the free energy . This general- ization resolves the o bserv ational nonidentifiabilit y , since its maximization leads to u n ique results that are b etter than a random selection of on e degenerate maximum of the marginal likel ih o o d or the a veraging over many suc h maxima. The generalized likelihoo d inherits many features from t he usual lik eliho o d, e.g. it holds t he conditionalit y principle, and its local maximum can b e searc h ed for via suitably mo dified exp ectation-maximization metho d. T h e maximization of the generalized lik elihoo d relates to entrop y optimization. I. INTRO D U CTION Unknown para meters of mixtur e mo dels are frequently estimated v ia the Maximum Marg inal Likelihoo d (MML) metho d that employs the marginal pro bability o f the obser ved data [1 – 5]. A lo ca l maximization o f the ma rginal likelihoo d can be carried out v ia one o f computationa lly feasible a lgorithms, e.g. the Exp ectation-Max imiza tion (EM) metho d [3 – 5]. There is how ever a rang e o f problems, where MML do es not apply due to observational n onidentifiability : the full mo del (including hidden v ar iables) is identifiable, but the observed (marginal) mo del is not. Hence the maxima of the marginal lik eliho o d a re (genera lly infinitely) degener ate, and the outcome of MML do es dep end on the initial p oint of the maximiza tion. Reso lving the no niden tifiability in such situations is not ho p e less, precisely b ecause the full mode l is ident ifia ble. How e ver, the standard likelihoo d maximiza tion cannot b e emplo yed, s ince there are hidden (not obse r ved) v a r iables. W e empha size that some informa tion a b o ut unknown parameter s is a lwa ys lo st after mar ginalizatio n [2]. The observ ational nonidentifiabilit y is an extreme case of this. Nonident ifiability in mixture mo dels is studied in [14 – 23]; see [17 – 19] for reviews . In s uch mo dels even an infinitely large num b er of observed data samples ca nnot gua rantee the per fect recov er y o f par ameters (i.e. the c o nv ergenc e to true par a meter v alues), b ecause max ima of the likelihoo d ar e infinitely degenerate [15]. There is an a ttitude tow ards no nident ifiable mo dels that they a r e in a certain sense rare , a nd do not hav e a big practical impo r tance. This is incorr ect: almost any mo del becomes nonidentifiable if the num b er of unknown parameters is sufficiently large, i.e. if the mo del is sufficiently rea listic [17, 20]. Moreov e r, nonidentifiabilit y can b e pres ent effectiv ely due to unresp onsiveness of a many-parameter likelihoo d along s ufficiently many direc tio ns [24, 25]; see [2 6] fo r a r eview. The simplest scenario of this is realized via small eigenv alues of the likeliho o d Hessian. F or pra c tica l purp o ses s uch a n effective nonidentifiabilit y —which is generically found in systems biolog y a nd chemistry [24 – 26]—is indistinguishable from the true o ne . Aiming to solve the problem of observ ationa l no nident ifiability , we extend the margina l likeliho o d via a one- parameter g eneralized function L β , which is co nstructed by a nalogy to the free energy in statistical ph ysics. The po sitive parameter β is an analog ue o f the inv erse temp era ture from statistica l physics, and the mar g inal likeliho o d is r ecov er ed for β = 1. W e show that L β inherits per tinent features of L 1 ; e.g. it holds the co nditionality principle, concavit y (for β ≤ 1) and the p ossibility to sea rch for its lo cal maxim um via suitably ge ner alized ex pe c tation- maximization metho d. Its maximiza tion resolves the degenera c y of L 1 . It do es hav e relations with the max im um ent ropy metho d (for β < 1 ) and with en tro py minimization (for β > 1). F or several mo dels w e found a n optimal v alue of β in L β , which a pp e ars to b e clo s e to 1, but strictly s ma ller than 1. W e also show numerically that maximizing L β . 1 leads to b etter r e sults than (i) a random selection of one of many results pr ovided by maximizing the usual likelihoo d L 1 ; (ii) averaging over many such ra ndom selections; see sectio n V. Both (i) and (ii) would b e among standard reactions of pra c titioners to (effective) no nident ifia bility . F or β → ∞ we g e t another known quantit y: L ∞ coincides with the h- likelihoo d [1, 3 – 5], i.e. the full likeliho o d (including b o th observed and hidden v ariables ), where the v a lue of hidden v ariables is r e placed by their maximum ap osterior i (MAP) estimates from the observed da ta [8 – 1 0]. The h-likeliho o d is employ ed in Hidden Marko v Mo dels (HMM), where efficient metho ds o f ma ximizing L ∞ are known as Viterbi T raining (VT) or k-means segmentation [3 – 7]. When the h-likeliho o d L ∞ is applied to an obs e r v a tionally nonidentifiable situation, its results converge to bo undary v alues of the para meter s (e.g. zero o r one for unknown pr obabilities), as was demo ns trated by a nalyzing an e xactly solv able HMM mo del [13]. Such res ults are inferior to ra ndom selectio n (see ( i) and (ii) a b ove), if there 2 is no prio r information that the mo del is indee d sparse in this sense; cf. sectio n VI. This feature is one re a son why the h-likelihoo d maximization leads to obvious failur e s even in simple mo dels [8, 11]. In particular , it cannot apply generally for s olving observ ational nonidentifiabilit y . L β also r elates to a rece nt trend in the Bayesian statistics, wher e the mo del is raised to a certain p ositive p ow er, akin to β 6 = 1 in L β [28 – 32]. In this wa y p eo ple deal with missp ecifie d mo dels [28, 2 9, 32], facilitate the co mputation of Bay esia n factors for mo del selection [31], re gularize them [30] etc ; s ee [32] for a recent re v iew. The r aising in to a p ow er emerges from the decision theory (as applied to miss pe c ified mo dels ) [28] and present a gener al metho d for mak ing Bay esia n models mo r e robust. Among the actively r esearched issues here is the selection of the power parameter [29]. This pap er is written in the style of the b o ok by Cox a nd Hinkley [2]: it is exa mple- based and informal, not the least b ecause it employs ideas of s tatistical physics. It is orga nized a s follows. Section I I A reca lls the definition of the observ atio na l nonidentifiabilit y we s et to study . Section I I B defines the gener alized likelihoo d L β , and discus ses its features inherited from the usual likelihoo d L 1 . Sections I I C and I I D s tudy the simplest nonidentifiable ex a mples that illustr ates features of L β . Section II I defines the main mo del we sha ll fo cus on. It amounts to a finite mixtur e with unknown proba bilities. Section IV studies for this mo del the generaliz e d lik eliho o d L β < 1 . Numerical co mparison with the rando m selec tion methods is discussed in section V. Section IV studies the maximization of L β > 1 and shows in whic h sense this is related to entrop y minimizatio n. W e summarize in the las t section. II. FREE ENERGY AS GENERALIZED LIKELIHOOD A. Defining observ ational nonidentifiabilit y W e are given tw o random v ariables X and Y wit h v alue s x = ( x 1 , ..., x n ) and y = ( y 1 , ..., y m ), r esp ectively . W e assume that X is hidden, while Y observed v ar iable, i. e. we ass ume a mixture mo del. The joint probabilities of X Y p θ ( x, y ) , (1) generally depend on unknown par ameters θ . Let we are g iven the observ ation data { y [ k ] } N k =1 , (2) where y [ k ] are v alues of Y generated indep endently fro m each other. Then θ can b e estimated via the (mar g inal, logarithmic) likelihoo d L ( θ ) = 1 N N X k =1 ln h p θ ( y [ k ] ) i = X y p ( y ) ln " X x p θ ( x, y ) # , (3) where p ( y 1 ) , ..., p ( y m ) are the frequencies of Y obtained from the da ta (2). If (for a fixed m ) the observ ation data (2) is la rge, N ≫ 1 , p ( y ) conv erg e to tr ue probabilities of Y . Within the maximum likelihoo d metho d, the unknown θ ca n be determined from argmax θ [ L ( θ )]. Since X is a hidden, we can easily run into the no nidenitifia bility problem, where (at lea st tw o) different v alues o f θ lead to the same probability for all v alues of Y [14 – 1 7]: p θ ( y l ) = p θ ′ ( y l ) , l = 1 , ..., m, θ 6 = θ ′ . (4) Eqs. (4) imply tha t maxima o f L ( θ ) ar e deg enerate; see b elow fo r examples. In additio n to (4), we shall require that the full mo del is still ident ifia ble, i.e. imp osing eq ua l joint probabilities for for all ( x, y ) do es lea d to θ = θ ′ for all ( θ, θ ′ ): p θ ( x k , y l ) = p θ ′ ( x k , y l ) , k = 1 , ..., n, l = 1 , ..., m, = ⇒ θ = θ ′ . (5) W e s ha ll prop ose a so lutio n to this type o f no nidentifi ability . B elow we shall fo cus the most acute situatio n, where the mar ginal probability in (4) do es not dep end o n θ and the sample length in (2 ) is v e ry lar ge: N ≫ 1. Note that other (weak er) for ms of nonidentifiabilit y are p oss ible and well-documented in literature : the weak est for m of nonidentifiabilit y is when it is restr icted to a measure- z ero subset of the par ameter doma in (gener ic identifiabilit y) [22]. A s tronger form is that of pa rtial nonidentifiabilit y , where so me information on θ (e.g. certain bo unds on θ ) c an still be re c ov ered from observ atio ns ; see [23] for a r ecent discussion. 3 B. Generalized likeliho o d: definition and f e atures 1. Definition Instead of (3 ) w e set to maximize over θ its gener alization, viz. the negative free-energ y L β ( θ ) = 1 β X y p ( y ) ln " X x p β θ ( x, y ) # , (6) where β > 0 is a par ameter. An obvious featur e of (6) is that for β = 1 we return from (6) to the (mar ginal) likelihoo d function L ( θ ) in (3). Hence if we apply the maximiza tio n of (6) with β ≈ 1 to the identifiable mo de l, we exp ect to get results that ar e close to those found via ma x imization of L ( θ ). The meaning of β 6 = 1 in (6) is that it sums over all v alue s o f x , but do es not reduce the outcome to the usual (marginal) lik eliho o d. Below we discuss several featur e s that L β ( θ ) inherits from the usual likeliho o d L 1 ( θ ) = L ( θ ). Thes e features motiv a te introducing L β ( θ ) as a ge ne r alization of L 1 ( θ ). The fir s t such feature is appar e nt from the fa c t that L β ( θ ) in (6) is to b e maximized over unknown parameter θ [1]. If we repara metrize L β via a bijectiv e (one-to-one) function ψ ( θ )—i.e. if the full informatio n o n θ is retained in ψ —then the maximization outcomes ˆ ψ = ar gmax ψ [ L β ( ψ )] and ˆ θ = argma x θ [ L β ( θ )] , (7) are related via the same function: ˆ ψ = ψ ( ˆ θ ). 2. R elations of (6) to none quilibrium fr e e ener gy Relations b etw een statistical physics and probabilistic infer ence frequently pro c eed via the Gibbs distribution, where the min us logar ithm of proba bility to be inferr ed is interpreted as the p h ysical energy (both these quantities ar e additiv e for indep endent even ts), while the physical tempera ture is taken to be 1; see [33] for a textb o ok presentation of this analogy and [34] for a rece nt rev iew. The main p oint of mak ing this ana logy is tha t p ow er ful approximate methods of statistical ph ys ic s ca n b e applied to inference [33, 34]. In the context of mixture mo dels w e can carry out the above analog y one step further. This ana logy is now structural, i.e. it relates to the form of (6), and not to applicability of any approximate metho d. W e r elate − ln ˆ p ( x, y ) with the energ y of a physical system, where X a nd Y are resp ectively fas t (hidden) and slow (obs e rved) v a riables. Here fast and slow co nnect with (resp.) hidden and obse r ved, which agr ees with the set-up of statistical ph ys ics, where only a part of v ar iables is obser ved [35]. Then (6) connects to the negative nonequilibr ium free energ y with inv er se temper ature β [3 5]. Here nonequilibrium means that only one v ariable (i.e. X ) is thermalized (i.e. its conditional probability is Gibbsian), while the free energy has several physical meaning s [35]; e.g . it is a generating function for calc ula ting v ario us av er ages and a lso the (physical) work done under a slow change of suitable externa lly -driven parameters [35 ]. The ma x imization o f (6) naturally relates to the physical tendency of decreasing free energ y (one formulation of the s econd la w of thermo dy na mics) [3 5]. Though formal, this co rresp ondence with statistical physics will b e instrumental in in ter preting L β . E .g. we shall see that the ma ximizer of L β < 1 is unique (in contrast to maximizers of L β ≥ 1 ), and this fact can be r elated to sufficiently high temp er a tures that simplify the free ener gy landsc a p e. 3. R elations with h-li keliho o d F or β → ∞ we revert from (6) to L ∞ ( θ ) = X y p ( y ) ln [max x p θ ( x, y )] , (8) where argmax x ˆ p θ ( x, y ) is the MAP (maximum a p o steriori) estimate of x given the data y [5, 6, 13]. The meaning of (8) is obvious in the cont e xt of (5): once w e cannot employ the maxim um lik e liho o d metho d to p θ ( x, y )—since we do 4 not know what to take for the hidden v ariable x —we first estimate x from data (2) via the MAP metho d, and then pro ceed a la usual likelihoo d 1 . It is known that maximizing ov er unobser ved v ar iables has drawbacks [1, 8, 11]. P eople tried to improv e on those drawbac ks by lo oking instea d of (8) at [12] 1 U U X u =1 X y p ( y ) ln  p θ ( x [ u ] ( y ) , y )  , (9) where U = 2 , 3, x [1] ( y ) maximizes p ( x, y ) over x , x [2] ( y ) is the next to the maxima l v alue of p ( x, y ) etc . In co ntrast to (8), Eq. (9 ) acco unt s for v alue s o f x around the maximum of p ( x, y ). No w L β ( θ ) c a ptures the same idea fo r a larg e but finite β . 4. Conditionality It is known that the ordinary maximum-lik eliho o d metho d has an app ealing feature of conditionality , whic h is formulated in several r e lated forms [2], and closely connects to other fundamental principles of statistics, e.g. to the likelihoo d principle [2, 44, 45]. W e now find out to which extent the conditionality principle is inherited by the generalized lik e liho o d L β defined in (6). First w e no te that L β holds the weak conditionality principle [44]. T o define this principle we should enlar g e the original pair ( X , Y ) of random v ar iables to ( X , Y , J ), where J assumes (for simplicity) a finite set of v alues j 1 , .., j ℓ . Now X and Y ar e still (resp.) hidden and observed v aria bles, while J determines the choice of the exp eriment that do es not depend on the unknown parameter θ [44 , 45]. The choice is done be fo re obse rving Y , i.e. b efore collecting the sample (2), a nd the (marginal) probability p ( j ) do e s not dep end o n θ . F or this extended exp eriment the data amounts to sample (2) plus the indicator j for the choice of the exp eriment. Then the ana logue o f (6) is defined as L β ( j, θ ) = 1 β X y p ( y ) ln " X x p β θ ( x, y , j ) # = 1 β X y p ( y ) ln " X x p β θ ( x, y | j ) # + ln p ( j ) , (10) where p θ ( x, y , j ) is the pro bability of ( X , Y , J ). It is seen that the infere nce for the extended exp er iment pro duces the same result as the inference for the partial exp er iment , where the v a lue of J = j was fixed b eforehands (i.e. the choice of J = j was not a par t of data): argmax θ ( L β ( j, θ )) = argmax θ 1 β X y p ( y ) ln " X x p β θ ( x, y | j ) #! . (11) This is the weak co nditionality principle that holds for the generaliz ed likelihoo d L β ( θ ). How ever, a s tronger form of the conditiona lity principle do es not hold for L β ( θ ), b ecause this form mixes obser v a ble and hidden v a riables. Define a new random v aria ble G ( X , Y ) that dep ends on X and Y and a ssumes v alues g 1 , ..., g ℓ [2]. Assume that the ma rginal probability p ( g ) of G do es no t dep end o n θ , i.e. G is an ancillary v ariable with resp ect to es timating θ [6 2] 2 . Now (6) rea ds L β ( θ ) = 1 β X y p ( y ) ln   X g ∈G ( y ) p β ( g ) X x : G ( x,y )= g p β θ ( x, y | g )   , (12) where G ( y ) ⊂ ( g 1 , ..., g ℓ ) is the set o f v alues assumed by G ( x, y ), where y is fixed, while x go e s ov er a ll its v alues ( x 1 , .., x n ). One defines a new exp er iment, wher e it is a priori known that the v alue of G ( X , Y ) is restricted to a 1 Note that i n (8) the m aximization wa s carri ed out for a given v al ue of y , i.e. we did not apply it to the whole sample (2). Doing so will lead to L ′ ∞ ( θ ) = max x h P y p ( y ) ln p θ ( x, y ) i instead of (8). W e did not see applications of L ′ ∞ ( θ ) in literature. One p ossi bl e reason for this is that the definition of L ′ ∞ ( θ ) mak es an unw arranted (thou gh not strictly forbidden) assumption that X = x is fixed during the sample generation pro cess. At an y rate, we applied L ′ ∞ ( θ ) to models and noted that its results for parameter estimation are w orse than those of L ∞ ( θ ). H ence we stick to (8). 2 Recall that ancillar y v ari ables need not alwa ys exist for a giv en model [63]. 5 sp ecific v a lue g from ( g 1 , ..., g ℓ ). The gener alized likeliho o d for this e x pe r iment is L β ( θ | G = g ) = 1 β X y p ( y ) ln   X x : G ( x,y )= g p β θ ( x, y | g )   . (13) It is seen that for β 6 = 1 the maximization o f (12) and (13) will genera lly pro duce different res ults, i.e. the stro nger form of the conditio nality principle does not hold for L β . 5. Monotonicity and c onc avity L β ( θ ) is mo notonically decreasing ov er β : ∂ L β ( θ ) ∂ β = 1 β 2 X y p ( y ) X x ζ θ ( x | y ; β ) ln ζ θ ( x | y ; β ) ≤ 0 , (14) ζ θ ( x | y ; β ) ≡ p β θ ( x, y ) . X ¯ x p β θ ( ¯ x, y ) , (15) since ∂ L β ( θ ) ∂ β is a weighted sum o f negative entropies. Let θ ∈ Ω is defined ov er a pa rtially conv ex s et Ω, i.e. if θ 1 ∈ Ω and θ 2 ∈ Ω, then for 0 < λ < 1 there exis ts θ 3 ∈ Ω such that p θ 3 = λp θ 1 + (1 − λ ) p θ 2 ; such a mo del is s tudied b elow in sectio n I I I . Now for β ≤ 1, L β from (6) is a concav e function, since it is a linear combination of s up er p o sition of tw o strictly co ncav e functions : f ( u ) = u β and g ( v ) = ln v : L β ( λp θ 1 + (1 − λ ) p θ 2 ) > λL β ( p θ 1 ) + (1 − λ ) L β ( p θ 2 ) , β ≤ 1 . (16) F or β > 1, we note that a sup erp osition g ( f ( u )) of strictly co nv ex f ( u ) = u β and mo notonic g ( v ) = ln v is pseudo- conv ex [27]. Pseudo-convex functions do sha re many impor tant features of co nv ex functions, but ge ner ally L β > 1 is not pseudo-conv ex , since bes ides s uper p osition of f ( u ) = u β and g ( v ) = ln v , (6) inv olves summation over y , and the sum of t wo pseudo- conv ex functions is g e nerally not pseudo -conv ex [27]. In section VI we shall show numerically that maximizers of L β > 1 relate to thos e o f a genera lized Sch ur -conv ex function; see Appe ndix D . 6. R elations with the maximum entr opy metho d The maximization of the genera lized likeliho o d (6 ) will be now related with the maximum entropy metho d [46 – 50]. Recall that the metho d addresses the pr oblem of recovering unknown proba bilities { q ( z k ) } n k =1 of a random v a riable Z = ( z 1 , ..., z n ) on the gro und of certain contrain ts on q and Z . The type and num b er of thos e co ns trains ar e not decided within the metho d itself [47, 49], thoug h the metho d can give some rec ommendations for s e le cting r elev ant constraints; see App endix C. Then { q ( z k ) } n k =1 are deter mined fro m the constr ained maximiza tion o f the entrop y − P n k =1 q ( z k ) ln q ( z k ) [46 – 50]. The intuitiv e rationale of the metho d is tha t it provides the mos t un bia sed choice of probability compatible with constraints. T o find this relation, we expand (6 ) for a small 1 − β (i.e. β ≃ 1 ) L β ( θ ) = X y p ( y ) ln p θ ( y ) + 1 β X y p ( y ) ln " X x p θ ( x | y ) e ( β − 1) ln p θ ( x | y ) # (17) = X y p ( y ) ln p θ ( y ) + (1 − β ) X y p ( y ) S y ( θ ) + (1 − β ) 2 X y p ( y ) S y ( θ ) (18) + (1 − β ) 2 2 X xy p ( y ) p θ ( x | y ) ( − S y ( θ ) + ln p θ ( x | y ) ) 2 + O  (1 − β ) 3  , (19) S y ( θ ) ≡ − X x p θ ( x | y ) ln p θ ( x | y ) , (20) where S y ( θ ) is the e ntropy of X fo r a fixe d observ ation Y = y and fixed parameter s θ . When expa nding e ( β − 1) ln p θ ( x | y ) ov er ( β − 1) ln p θ ( x | y ) we need to a s sume that p θ ( x | y ) > 0, but even tually a milder co ndition p θ ( x | y ) ≥ 0 suffices bec ause the ter ms in (18 – 20) s tay finite for p θ ( x | y ) → 0 . 6 The zer o-or der ter m in L β ( θ ) is naturally L ( θ ) = P y p ( y ) ln p θ ( y ); see (18). But, as we expla ined aro und (4), ev en when N ≫ 1 in (16), the maximiza tion of L 1 do es not lead to a sing le result if the mo del is not identifiable. This degeneration will b e (at least partially) lifted if the next-order term (1 − β ) P y p ( y ) S y ( θ ) in L β is taken into acc ount; cf. (18). F or β < 1 this ter m will tend to lift the degener acy by se le cting thos e maxima w hich achiev e the larges t av era g e entrop y P y p ( y ) S y ( θ ). Hence for a small, but positive 1 − β , the results of maximazing P y p ( y ) ln p θ ( y ) will (effectiv e ly) serve as constraints when maximizing P y p ( y ) S y ( θ ). This is the r elation b etw een max imizing L β (for a small, positive 1 − β ) and entropy maximization 3 . Note that when p ( y ) conv erges to the true proba bilities of Y , i.e. when N ≫ 1 in (16), and when θ is fixed to its true v alue, then P y p ( y ) S y ( θ ) is the co nditional ent r opy o f X g iven Y [47]. The app earance of the conditional ent ropy is reasonable g iven the fact tha t Y is an observed v ar iable. Within the second- o rder term O  (1 − β ) 2  the fluctuations of entrop y enter into cons ide r ation: the dege neration will be lifted by (simultaneously) ma ximizing the entropy v aria nce and maximizing the entrop y; see (18, 1 9). Likewise, for β > 1 (but β ≃ 1) the term (1 − β ) P y p ( y ) S y ( θ ) in L β ( θ ) predicts that among degenerate maxima of L 1 ( θ ), those of the minimal en tr opy will b e selec ted. 7. Q -function and gener alize d EM pr o c e dur e L β ( θ ) in (6) admits a repr e sentation via a suitably generalized Q -function, i.e. its lo cal maximum can b e calculated via the (g e ne r alized) expec ta tion-maximization (EM) algorithm. Let us define for tw o differ ent v alues o f θ a nd e θ : Q β ( θ, e θ ) = X y p ( y ) X x ζ θ ( x | y ; β ) ln p e θ ( x, y ) , (21) where ζ θ ( x | y ; β ) defined by (15) is formally a co nditional probability . F o r β = 1 we revert from (21) to the av era ge of the usual Q -function [5, 51]: P x p θ ( x | y ) ln p e θ ( x, y ), which is the full log-likelihoo d ln p e θ ( x, y ) that is av erag ed ov er the hidden v ar iable X given the observed Y = y and calculated at trial v alues θ and e θ . Now non-negativity of the relative en tr o py: 1 β X y p ( y ) X x ζ θ ( x | y ; β ) ln ζ θ ( x | y ; β ) ζ e θ ( x | y ; β ) ≥ 0 , (22) implies after us ing (1 5, 21) a nd re -arra nging (22): L β ( e θ ) − L β ( θ ) ≥ Q β ( θ, e θ ) − Q β ( θ, θ ) . (23) Hence if for a fixed θ we choose e θ such that Q β ( θ, e θ ) > Q β ( θ, θ ), then this will incr ease L β ( e θ ) ov er L β ( θ ). Eq . (23) shows the ma in idea o f EM: defining θ k +1 = arg max e θ [ Q β ( θ k , e θ ) ] , (24) and s tarting from tr ia l v a lue θ 1 , w e increas e L β ( θ ) sequentially , L β ( θ k +1 ) ≥ L β ( θ k ), as (23) shows. Eq. (21) implies: ∂ Q β ( θ, e θ ) ∂ e θ      e θ = θ = ∂ L β ( θ ) ∂ θ = X y p ( y ) P ¯ x p β θ ( ¯ x, y ) X x p β − 1 θ ( x, y ) ∂ p θ ( x, y ) ∂ θ . (25) Eq. (25) shows that if we would find θ ∗ such that the ma ximum of Q β ( θ ∗ , e θ ) over e θ is reached at e θ = θ ∗ , i.e. ∂ Q β ( θ, e θ ) ∂ e θ      e θ = θ = θ ∗ = ∂ L β ( θ )) ∂ θ     θ = θ ∗ = 0 , (26) then θ ∗ can b e a lo cal maximum of L β ( θ ), or an inflection po int o f L β ( θ ) (which has a direc tion along which it maximizes), o r—for a mult idimensio nal θ —a sa ddle po int. Eq. (2 6) holds if (24) conv er g es. Thus similarly to the usual lik eliho o d, L β ( θ ) can b e partia lly (i.e. generally not g lobally) maximized v ia (21). 3 Note that the i dea of lifting degeneracies of the maximum l i ke l iho o d by maximizing the entrop y ov er those degenerate solutions appeared recen tly in the quant um maximum likelihoo d method [36, 37]. But there the degeneracies of the likelihoo d are due to incomplete (noisy) data, i.e. they appear in a i den tifiable model. 7 C. First ex ample (discrete random v ariables) 1. Definition The following example is a mong simplest o nes, but it do es illustrate several general p oints of the a pproach based on maximizing L β ( θ ). A binary random v ariable X ( x = ± 1 ) is hidden, while its noisy version Y ( y = ± 1) is observed. The join t pro bability of X Y reads p gh ( x, y ) = e gx + h xy 4 co sh h cosh g , x = ± 1 , y = ± 1 , (27) where g > 0 and h > 0 are unknown parameters: g relates with the pr ior probability of unobserved X , and h r elates to the noise . Since the marg inal proba bility of Y holds: p gh ( y ) = p z ( y ) = 1 2 (1 + z y ) , z ≡ tanh h tanh g , (28) even with infinite set of Y -obse rv ations one can determine only the pro duct tanh h ta nh g , but not the separate factors g and h . On the other ha nd, the full mo del (27) is identifiable with r esp ect to g and h , i.e. we have nonidentifiabilit y in the sense of (4, 5). Appe ndix A 1 discuss es a Bay esia n a pproach to s olving this nonidentifiabilit y . As exp ected, if a go o d (shar p) prior pr obabilities for g or for h are av ailable, then the nonidentifiabilit y can b e res o lved. Howev er, when no prior informa tio n is av ailable, one is invited to employ noninfor mative prio rs [46], which are improp er for this mo del, a nd which do not lead to any sensible outcomes; see Appendix A 1 . T o the same end, App endix A 2 studies a decision-theor e tic (maximin) appro ach to this mo del, which also do es not as sume a ny prior information on g and/or on h . This approa ch als o do es not lead to sensible r e s ults. Thus, App endices A 1 and A 2 arg ue that the estimation of parameters in (27) is a non tr ivial problem. W e shall assume that a la rge ( N ≫ 1) set of observ ations is given in (2); hence p ( y ) = p z ( y ); see (6, 28). O mitting irrelev ant constants, we get from (6, 27 , 2 8): L β ( ˆ g , ˆ h ) = − ln cosh ˆ h − ln cosh ˆ g + 1 + z 2 β ln cosh( β ˆ g + β ˆ h ) + 1 − z 2 β ln cosh( β ˆ g − β ˆ h ) , (29) where ˆ g and ˆ h ar e estimates of (r esp.) g and h to b e determined from maximizing (29). Recall that we as s umed ˆ g > 0 and ˆ h > 0 as a prior information. Eq. (29) is in v ar iant with res p ect to interc hanging ˆ g and ˆ h : ˆ g ⇆ ˆ h . 2. Solutions Now equations ∂ L β ( θ ) / ∂ ˆ g = ∂ L β ( θ ) / ∂ ˆ h = 0 reduce from (29) to tanh( β ˆ g + β ˆ h ) = tanh ˆ h + ta nh ˆ g 1 + z , tanh( β ˆ g − β ˆ h ) = tanh ˆ h − tanh ˆ g 1 − z , (30) where for β = 1 we o btain from (3 0) the exp ected tanh ˆ h tanh ˆ g = z . One can chec k that for β < 1, the global maximum of (29) is g iven by so lutions of (30), where ˆ g = ˆ h hence (31) 1 + z 2 tanh(2 β ˆ g ) = tanh ˆ g , (32) where (31) is a single maximum of the function (29 ) that has ˆ g ⇆ ˆ h symmetry . F or β < 1 / 2 the only so lution o f (32 ) is ˆ g = ˆ h = 0, which is far from holding tanh ˆ h tanh ˆ g = z ; hence we disreg ard the domain β < 1 / 2. F or β < 1, but (1 + z ) β > 1, there is a non-zero s o lution of (32) that provides the globa l maximum o f L β . This solution is certainly b etter than the prev ious ˆ g = ˆ h = 0 , but it also do es not exactly hold the constraint ta nh ˆ h tanh ˆ g = z . This recov er y—i.e. the co nv ergence ˆ g = ˆ h → arctanh √ z —is achiev ed only in the limit β → 1 − . F or any β < 1 we thu s have from maximizing L β : ˆ g = ˆ h < arctanh √ z . Both these facts a re seen from (32 ). The situation is different for β > 1: under as s umed ˆ g ≥ 0 and ˆ h ≥ 0, we get tw o maxima of L β related by the transformatio n ˆ g ⇆ ˆ h to ea ch other : ˆ g = ∞ , ˆ h = arctanh z or ˆ h = ∞ , ˆ g = arctanh z . (33) 8 Both s olutions hold tanh ˆ h tanh ˆ g = z ; in a sense these ar e the mos t extr e me p oss ibilities that hold this constr aint 4 . W e emphasize that o ne do es not need to fo cus exclusively on max imizing L β ( ˆ g , ˆ h ) ov er ˆ g and ˆ h . W e note that R g . h . e L β ( g,h ) is finite and hence we can cons ider e L β (ˆ g, ˆ h ) / R g . h . e L β ( g,h ) as a joint density of ˆ g and ˆ h , which is still symmetric with r esp ect to ˆ g ⇆ ˆ h . 3. Over c onfidenc e Returning to solutions (32) and (33), let us argue that there is a s ense in which (32) is b etter than (3 3). T o this e nd, we should enlarg e our consider ation and ask which solutio n is mo re suitable from the viewp oint of finding a n estimate ˆ x ( y ) of the hidden v a riable X given the observed v alue of Y = y . This estima tio n can b e done via maximizing the ov erla p (or the r is k function): O ( y ; ˆ g , ˆ h ) = P x = ± 1 ˆ x ( y ) xp ˆ g ˆ h ( x | y ) ov er ˆ x ( y ); see (27 ). The maximization pro duces ˆ x ( y ) = sig n[ ˆ g + ˆ hy ], and the qualit y o f the estimation can b e judged via the average overlap [cf. (28)]: ¯ O ( z ; ˆ g , ˆ h ) = X y = ± 1 p gh ( y ) O ( y ; ˆ g , ˆ h ) = 1 + z 2 tanh    ˆ g + ˆ h    + 1 − z 2 tanh    ˆ g − ˆ h    . (34) If the v alues of g and h are known precisely , g = ˆ g and h = ˆ h , then tog ether with z = tanh h tanh g we get from (34): ¯ O ( z ; g , h ) = max[tanh g , tanh h ]. Now employing in (34) solution (33), we get ¯ O ( z ; ˆ g , ˆ h ) = 1 > ¯ O ( z ; g , h ). This over c onfidenc e is not desir able, b eca use with approximate v alues of par ameters we do not ex p ec t to have a be tter estimation qualit y than with the true v alues. In contrast, using (32) in (34) w e get a reasonable co nclusion: ¯ O ( z ; ˆ g , ˆ h ) = 1 + z 2 tanh 2 ˆ g < √ z < ¯ O ( z ; g , h ) . (35) Hence, from this viewp oint, the best regime is β . 1, since we approximately ho ld the contrain t tanh ˆ h tanh ˆ g = tanh h tanh g , and also ¯ O ( z ; ˆ g , ˆ h ) < ¯ O ( z ; g , h ). Moreov er , the β . 1 -solution is unique in contrast to (33). D. Second example (contin uous random v ariables ) While the previous example show ed that the ma x imization of L β < 1 can pro duce reaso nable res ults, here we discuss a contin uous-v aria ble ex ample, where the s imila r maximization leads nowhere without a dditional assumptions on the mo del. Co nsider an a na logue o f (27 ): p gh ( x, y ) = g e − gx h x e − hxy , x ≥ 0 , y ≥ 0 , g > 0 , h > 0 , (36) where X (hidden) and Y (observed) ar e nonnegative, co nt inu ous random v ar iables, while g and h a re p ositive unknown parameters . The full mo del is identifiable; e.g. the maximum-lik eliho o d estimates of g and h read (re sp.): 1 / x and 1 / ( xy ), where x and y ar e o bserved v alues of X and Y . But the mar ginal mo de l is not identifiable, since p gh ( y ) = p χ ( y ) = χ [ y + χ ] 2 , χ ≡ g /h, (37) depe nds on the ratio χ of t wo unknown para meters; cf. (28). Maximizing over ˆ χ the ma r ginal likeliho o d R ∞ 0 y . p χ ( y ) ln p ˆ χ ( y )—for a large N ≫ 1 n um b er o f obser v a tions in (2)—leads to the corre ct outcome ˆ χ = χ . But the individual v alues of unknown parameter s ˆ g and ˆ h are not determined in this w ay . W e now employ (6, 36) with an obvious generalization of (6) to contin uous ra ndom v ariables , and write for L β ( ˆ g , ˆ h ) (again assuming N ≫ 1): L β ( ˆ g , ˆ h ) = 1 β Z ∞ 0 y . p χ ( y ) ln Z ∞ 0 x . p β ˆ g ˆ h ( x, y ) = 1 β ln[ Γ( β ) /β ] + (1 − 1 β ) ln ˆ h (38) + ln( ˆ χ ) − β + 1 β χ ln[ χ ] − ˆ χ ln[ ˆ χ ] χ − ˆ χ , (39) 4 Note that (33) can b e obtained in a more ar tificial wa y , by r eplacing x → x ◦ ( y ; g , h ) ≡ P x xp gh ( x | y ) = tanh( g + hy ) in p gh ( x, y ), and then maximi zing P y p gh ( y ) ln p gh ( x ◦ ( y ; g , h ) , y ) ov er g and h ; cf. this procedure with (8). Replacing x → x ◦ ( y ; g , h ) is formal, since p gh ( x, y ) is (strictly sp eaking) not defined for a real x . Stil l for this model this formal pr ocedure leads to (33). 9 where Γ( β ) is the Euler’s Gamma-function, and whe r e ˆ χ ≡ ˆ g / ˆ h . It is s een that L β expresses in ter ms o f tw o unknown parameters : ˆ h a nd ˆ χ . Hence the maximizatio n o f L β can be car ried out indep endent ly over ˆ h a nd ˆ χ . No w the maximization of (39 ) ov er ˆ χ pro duces for a fix e d ˆ h a finite outcome for ˆ χ (see b elow), while the max imiza tion of (3 8) ov er ˆ h leads to ˆ h → 0 for β > 1 and to ˆ h → ∞ for β < 1. Hence L β 6 =1 ( ˆ g , ˆ h ) = L β 6 =1 ( ˆ χ, ˆ h ) do es not hav e maxima fo r po sitive and finite ˆ g and ˆ h , as required for having a reasonable mo del in (36). Note that this situation is worse than the maximization of the mar ginal likeliho o d L 1 , b ecause there at lea st the v alue of the ratio ˆ χ = χ was recovered correctly (in the limit o f infinite num b er of observ ations). The situation with maximizing L β < 1 ( ˆ χ, ˆ h ) in (38, 39) improv es, if w e as sume an a dditional prior informa tion on h : h ≤ H , (40) where H > 0 is a new and known parameter . Now (38, 39) is to b e maximized over ˆ χ and ov er ˆ h under constraint ˆ h ≤ H . F or β < 1 this maximization pro duce s r easonable results: argmax ˆ h, ˆ χ [ L β < 1 ( ˆ χ, ˆ h ) ] =  ˆ h = H , ˆ χ = f β ( χ )  , (41) f β ( χ ) < χ for β < 1 , and f β ( χ ) → χ for β → 1 . (42) I.e. for β → 1, but β < 1 we a unique maximization o utcome: ˆ h = H and ˆ g = H χ . Note that the max imiza tion of L β > 1 is s till not sensible, since it leads to ˆ h → 0. T o co nclude this con tin uo us-v ariable exa mple, her e the maximiza tion o f L β < 1 pro duces unique and cor rect r e s ults for unknown para meters ˆ g and ˆ h (correct in the sense of repro ducing the ratio g / h ), at the cost of additional assumption (40). If this assumption is not made, then only the maximiza tion of L β =1 , i.e. of the usua l margina l likelihoo d, is sensible for this mo del. The maximization of L β > 1 is never sensible here. II I. MIXTURE M ODEL WITH UNKNOWN PRO BABILITIES Now we fo cus on a sufficiently gener a l mixture mo del, which will allow us to study in detail the structure of L β and its dependence o n β . In mixtur e mo de l (1) probabilities p ( x ) and p ( y | x ) ar e unknown. The pr ior information on them is in tro duce d b elow. W e shall skip θ and denote unknown probabilities by hats: ˆ p ( x, y ) = ˆ p ( x ) ˆ p ( y | x ) . (43) Then L β ( θ ) reads fr o m (6) L β = 1 β X y p ( y ) ln " X x ˆ p β ( x, y ) # . (44) If N ≫ 1 in (2), and hence freq uencies p ( y ) conv erge d to pro babilities of Y , quantities in (43) hav e to hold: X x ˆ p ( x, y ) = p ( y ) , (45) which is also pr o duced by the maximiza tion o f L 1 from (44). Eq . (45) has m − 1 known qua nt ities p ( y 1 ) , ..., p ( y m ) (note the constra int P m i =1 p ( y i ) = 1). If all ˆ p ( x ) and ˆ p ( y | x ) are unknown (apart of ho lding (45)), then we hav e nm − m unknown v ar iables: nm − 1 parameter s ˆ p ( x, y ) minus m − 1 known parameters p ( y ). Alr eady for n = 2, nm − m is larger than the num b er m − 1 of known v ar iables. As exp e c ted, (45) will not give a unique solution, and the mo del is no nident ifia ble; cf. (4). Apart of (4 5 ), further constraints are also p ossible. Such constra int s a mount to v ar io us forms of prior information; e.g. ˆ p ( x ) and ˆ p ( y | x ) hold a linear constr aint: X xy E ( x, y ) ˆ p ( x, y ) = E , (46) where E ( x, y ) is some function of x and y with a known av erag e E . F or instance, E ( x, y ) = xy refers to the corr elation betw een X and Y . Another example o f (46) is when one of pro ba bilities ˆ p ( x, y ) is known pr e c isely . Note that several linear co nstraints can b e implemen ted simultaneously , this do es not incr ease the analytical difficult y of treating the 10 mo del. Constraints similar to (46) decrease the num b er o f (effectiv ely) unknown v aria bles, but we shall fo cus o n the situation, where they ca nno t selec t a sing le so lution o f (45), i.e. the nonidentifiabilit y is k e pt. Once the maximization of L 1 do es not lead to any de finite outco me, we lo ok a t maximizing L β . T o this end, it will be useful to reca ll the conc avity of L β ≤ 1 ; cf. (1 6). T he adv antage of linear constraints [cf. (45, 46)], is that unkno wn ˆ p ( x, y ) ar e defined over a co nv ex set. Eq . (16) means that for β < 1 ther e ca n b e only a single int ernal (with r esp ect to the convex set) po int p 0 , where the gr adient of L β ≤ 1 ( p ) v anishes, ∇ L β ≥ 1 | ˆ p = ˆ p 0 = 0, and ˆ p 0 is the global maximum of L β < 1 ( ˆ p ). IV. MAXIMIZING T HE GENERALIZED LIKELIHOOD FOR β ≤ 1 A. Kno wn probability of X As the first e x ercise in maximizing L β < 1 for the pr esent mo del, let us a s sume that (prio r) pr obabilities p ( x ) are known. Hence p ( x ) = X y ˆ p ( x, y ) . (47) The Lagrang e function reads: L β = 1 β X y p ( y ) ln " X x ˆ p β ( x, y ) # − X xy γ ( x ) ˆ p ( x, y ) , ( 48) where γ ( x ) a re Lagrang e multipliers of (47). Now ∂ L β ∂ ˆ p ( x,y ) = 0 amounts to p ( y ) ˆ p β − 1 ( x, y ) P ¯ x ˆ p β ( ¯ x, y ) = γ ( x ) . (49) Since the right-hand-side of (49 ) doe s no t dep end o n y so should its left-hand-side, which is only poss ible under ˆ p ( x, y ) = p ( y ) p ( x ) . (50) Once (50) solves (49), it is the g lo bal maximum of L β < 1 , since the latter is concave. Recall that p ( y ) are ge ne r ally the observed freq ue ncie s of (2). Though (50) may no t very useful by itself, it still shows that maximizing L β < 1 under (47) leads to a reasona ble null mo del in a nonidentifiable situation. Impos ing other cons traints on ˆ p ( x, y ) do es lead to no nt r ivial predictions, as we now pr o ceed to show. B. Known average 1. Derivation Let us turn to maximizing L β under c onstraint (46). The La grange function rea ds: L β = 1 β X y p ( y ) ln " X x ˆ p β ( x, y ) # − δ X xy ˆ p ( x, y ) − γ X xy E ( x, y ) ˆ p ( x, y ) , (51) where δ refers to the nor malization P xy ˆ p ( x, y ) = 1 and γ enforces (46). Now ∂ L β ∂ ˆ p ( x,y ) = 0 leads to p ( y ) ˆ p β − 1 ( x, y ) P ¯ x ˆ p β ( ¯ x, y ) = δ + γ E ( x, y ) , (52) which is solved as ˆ p ( x, y ) = p ( y ) [ δ + γ E ( x, y )] 1 β − 1 P ¯ x [ δ + γ E ( ¯ x, y )] β β − 1 , (53) 11 3 4 5 6 7 8 9 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 E Hellinger’ s distance FIG. 1: Hellinger’s d istance 1 − P 4 k =1 p ˆ p ( y k ) p ( y k ) b etw een ˆ p ( y ) = P x ˆ p ( x, y ) from ( 53) and p ( y ) for n = m = 4, p ( y ) = (0 . 4 , 0 . 01 , 0 . 5 , 0 . 09), E ( x k , y l ) = k l ( k, l = 1 , .., 4) and var ious val u es of E that hold (61). F rom top to b ottom: β = 0 . 85 (blac k curve), β = 0 . 9 (blue curve) and β = 0 . 95 (red curve). where γ and δ are found from the nor malization a nd from (46): γ = 1 E (1 − δ ) , (54) X y ˆ p ( y ) = 1 , ˆ p ( y ) ≡ p ( y ) P x [ δ + γ E ( x, y )] 1 β − 1 P ¯ x [ δ + γ E ( ¯ x, y )] β β − 1 . (55) Note that (54, 5 5) have a spurio us solution δ = 1, whic h is to be av o ided in numerical determinatio n of δ . 2. F e atur es of (53–55) 1. Constra int (46) is inv ar iant with resp ect to multiplying E ( x, y ) and E b y a n um b er . Hence ˆ p ( x, y ) in (53) is also in v a r iant to this transfor mation, as seen from (53, 54), wher e δ and γ E do not ch ange after multiplication. Constraint (46) is also inv aria nt with r e sp ect to shifting E ( x, y ) a nd E by a cons tant factor a : E ′ ( x, y ) = E ( x, y ) + a and E ′ = E + a . Hence we can a lwa ys choose E ( x, y ) > 0 and E > 0. Now ˆ p ( x, y ) in (53) is also in v a r iant under this transformatio n, b ecause δ + (1 − δ ) E ( x, y ) E = δ ′ + (1 − δ ′ ) E ′ ( x, y ) E ′ , (56) due to γ ′ = 1 E ′ (1 − δ ′ ) , δ ′ = δ (1 + a E ) − a E . (57) 2. Eq. (53) predicts indep endent v ariables X and Y , if E ( x, y ) do es not depend on y ; i.e. having no prio r information on the dep endency b etw een X and Y leads to predicting them to b e indep endent [46]. This feature can be gener alized showing tha t ˆ p ( x, y ) predic ted b y (53) is not more pre c ise than E ( x, y ): a ssume that the ra ng e o f y is divided in to mutually exclusive doma ins S 1 , ..., S M , so that E ( x, y ) = E m ( x ) whenever y ∈ S m . Now denoting p m = Pr( y ∈ S m ) = P y ∈S m p ( y ) and ˆ p m ( x ) = Pr( x, y ∈ S m ) = P y ∈S m ˆ p ( x, y ), we get that the shap e of (53) coarse- grains and stays inv a riant: ˆ p m ( x ) = p m [ δ + γ E m ( x )] 1 β − 1 P M ¯ m =1 [ δ + γ E ¯ m ( x )] β β − 1 , m = 1 , ..., M . (58) 3. W e emphasize that the marginal pro ba bility ˆ p ( y ) = P x ˆ p ( x, y ) from (53) is ge ne r ally not equal to p ( y ), i.e. (45) do es not fo llow from (53). No w ˆ p ( y ) 6 = p ( y ) is not prohibited, if p ( y ) are finite-sample fr equencies. But when N ≫ 1 in (2), then ˆ p ( y ) = p ( y ) is demanded. This equality can b e imposed via constraints —a dditional to (46)—and this will lead to a jo int pro ba bility differen t from (53); see Appendix B for details. Instead of imp osing a dditional cons traints, 12 T ABLE I: The v alues of D 1 and D 2 giv en by (resp.) (67) and ( 68) for x = 1 , .., n and y = 1 , ...m = n and E 1 ( x, y ) = | x − y | , E 2 ( x, y ) = xy . The av eraging in (67, 68 ) wa s taken ov er S = 10 3 samples. W e to ok β = 0 . 95. F or completeness, we also presented the analogues of D 1 and D 2 (denoted by K 1 and K 2 , respectively), where Hellinger’s distance in (65) is replaced by the relativ e entrop y: dist[ π k , ˆ p k ] → P xy π k ( x, y ) ln π k ( x,y ) ˆ p k ( x,y ) . Both c hoices supp ort the same conclusion: D 1 < D 2 , K 1 < K 2 . n = m = 4 n = m = 5 E 1 D 1 = 0 . 041, D 2 = 0 . 092 D 1 = 0 . 046, D 2 = 0 . 098 K 1 = 0 . 145, K 2 = 0 . 410 K 1 = 0 . 153, K 2 = 0 . 442 E 2 D 1 = 0 . 041, D 2 = 0 . 089 D 1 = 0 . 045, D 2 = 0 . 096 K 1 = 0 . 143, K 2 = 0 . 434 K 1 = 0 . 155, K 2 = 0 . 430 we note from (5 3, 5 4) that for β < 1 and β ≃ 1 (wr itten together as β . 1), we get δ → 1, and ˆ p ( x, y ) simplifies as ˆ p ( x, y ) = p ( y )  1 + δ − 1 + 1 − δ E E ( x, y )  1 β − 1 P ¯ x  1 + δ − 1 + 1 − δ E E ( ¯ x, y )  β β − 1 ≃ p ( y ) e − Γ E ( x,y ) P ¯ x e − Γ E ( ¯ x,y ) , (59) Γ ≡ 1 E 1 − δ 1 − β , (60) where Γ stays finite in the limit β → 1 − 0. It is clea r from (59, 60) that in this limit (45) do es follow from (53): ˆ p ( y ) = p ( y ); cf. sectio n II C. F or the present analytically solv able situation, we able to tak e the the limit β → 1 − 0 and deduce (59, 60). How ever, upo n more g eneral us age of L β (and its maximization) this will not be p oss ible, since taking β ≈ 1 in L β will run into problems inherited from L 1 (quasi-degener acy o f maxima etc ). Hence it is impor tant to know how close β should b e to 1 for recovering ˆ p ( y ) ≃ p ( y ). Fig. (1) illustrates this question by loo king at Hellinger’s distance b etw een ˆ p ( y ) and p ( y ). It is seen that 0 . 9 ≤ β < 0 . 95 is already sufficient for getting ˆ p ( y ) ≃ p ( y ) sufficiently precisely for a lmost all v a lue s of E . 4. Here ar e finally certain subsidiary , but useful features. When p ( y ) are the true pro babilities of Y , then E is suppo sed to hold the following cons traints: X y p ( y ) E ( e x ( y ) , y ) ≤ E ≤ X y p ( y ) E ( ˆ x ( y ) , y ) , (61) e x ( y ) ≡ a rgmin x [ E ( x, y ) ] , ˆ x ( y ) ≡ ar gmax x [ E ( x, y ) ] . (62) In addition, there is a r elation that ca n b e deduced directly from (59, 6 0), but app ears to hold mor e generally , i.e . also for β < 1 : sign [ γ ] = sign " 1 n X xy p ( y ) E ( x, y ) − E # . (63) V. NUMERICAL COMP ARISON WITH RANDOM CHOICES OF NONIDENTIFIABLE P ARAMET ERS In this section we compare predic tio ns obtained from maximizing L β < 1 with the standard attitude of prac titio ner s tow ards nonidentifiabilit y: p eo ple either take a maximum of the (mar ginal) likeliho o d L 1 , po stulating that if there are ma ny maxima, they are even tually equiv a lent. O r, within a more ca reful, but also more la b orious appr oach, they average ov er sufficiently many such maxima. F or the studied mo del these maxima are g iven by (45), and the compariso n will show that maximizing L β . 1 is supe rior with r e sp ect to such ra ndom s election methods. Let us ass ume that we know the true joint probability π k ( x, y ) of X and Y (the meaning of an integer k is sp ecified below). Given π k ( x, y ) and E ( x, y ) we ca lculate the ma r ginal pr obability of Y and the constraint p k ( y ) = X x π k ( x, y ) , E k = X xy E ( x, y ) π k ( x, y ) . (64) 13 T ABLE II : The val u es of ∆ D 3 giv en by (70) for x = 1 , 2 , 3, y = 1 , 2 , 3, E 1 ( x, y ) = | x − y | , and E 2 ( x, y ) = x y . The av eraging in (70 ) was taken ov er S = 300 samples for M = 10 5 in (69 ). W e to ok β = 0 . 95. W e also presented the analogues of ∆ D 3 (denoted by ∆ K 3 ), where Hellinger’s distance in (70) is replaced by the relativ e entro py: dist[ π k , ˆ p k ] → P xy π k ( x, y ) ln π k ( x,y ) ˆ p k ( x,y ) . Both choices supp ort the same conclusion: ∆ D 3 > 0, ∆ K 3 > 0. E 1 ∆ D 3 = 0 . 00278, ∆ K 3 = 0 . 01224 E 2 ∆ D 3 = 0 . 00215, ∆ K 3 = 0 . 01053 Using (53 – 55), and p k ( y ) and E k from (64) we recover ˆ p k ( x, y ) that depends on β < 1. Recalling the discussion around (60), we s ha ll work with β = 0 . 95. The quality of ˆ p k ( x, y )—given by (53 – 5 5, 6 4) as a solution to the problem o f estimating π k ( x, y )—can b e judged from the dista nc e dist[ π k , ˆ p k ], which (for clar ity) is c ho sen to b e Hellinger’s distance betw e en tw o proba bilities: dist[ π k , ˆ p k ] ≡ 1 − X xy p ˆ p k ( x, y ) π k ( x, y ) . (65) Now (65) de p ends on the c ho ice of π k ( x, y ). T o make this dependence weak er, i.e. to mak e the situation less sub jective, we assume that π k ( x, y ) fo r k = 1 , ...S ≫ 1 ar e ge nerated randomly and independently from each other. The simplest po ssible mechanism suits our purp oses: we c ho o se Π k ( x, y ) a s n × m × S indep endent r andom v aria bles ho mogeneously distributed in [0 , A ] (the choice of A do es no t seriously influence on the situatio n provided that A ≥ 1), and then calculate: π k ( x, y ) = Π k ( x, y ) . X ¯ x ¯ y Π k ( ¯ x, ¯ y ) . (66) Thu s for S ≫ 1 we define from (65) the averaged distance: D 1 = 1 S S X k =1 dist[ π k , ˆ p k ] , (67) which estimates the quality of ˆ p ( x, y ) in predicting the (known) joint probability . T o comment on the ab ov e choice β = 0 . 95, we note tha t from o ur num e ric results that the dep endence o f D 1 on β is anyho w weak, e.g. it typically changes b y 1 % when changing β from 0 . 7 to 1. Now D 1 will be compar ed with the situation, where—given p k ( y ) and E k from (64)— we do not employ (53 – 5 5), but instead guess the join t pr obability of X a nd Y . This will b e done by pic k ing up randomly—v ia the sa me mechanism, as in (66 )—a c o nditional probability e p k ( x | y ), with an additional condition that it ho lds P xy p k ( y ) e p k ( x | y ) E ( x, y ) = E k 5 ; see (64 ). Ther eby we construct D 2 = 1 S S X k =1 dist[ π k , p k ( y ) e p k ( x | y )] . (68) Due to S ≫ 1 in (6 8), D 2 is (almost) a sure quantit y . T able I compares D 2 with D 1 for a re pr esentativ e set of parameters . It is seen tha t D 2 is so me tw o times larger than D 1 , i.e. a random so lution is worse than (5 3 – 55). T able I also shows that D 2 > D 1 holds upon using other measur es of closeness, e.g. the rela tive entropy instead of (65). There is yet another quantit y that can be employed for ev aluating our approach. Returning to the discussion ab ov e (68), w e generate indep endently—follo wing the ab ov e recip e, and for a g iven π k ( x, y ), p k ( y ) a nd E k —many ( l = 1 , ..., M ≫ 1 ) conditional probabilities e p [ l ] k ( x | y ) that hold P xy p k ( y ) e p [ l ] k ( x | y ) E ( x, y ) = E k . Next, w e cons ider the 5 In more detail, this go es as follows: giv en π k ( x, y ) we find E k and p k ( y ) via (64). Next f or a fixed k we randomly generate nm − 1 positive v ariables { e Π k ( x, y ) } ; their num b er i s nm − 1, since e Π k ( n, m ) is absent. Then we lo ok at equation (64): P y p ( y ) P x e Π k ( x,y )[ E ( x,y ) − E k ] P ¯ x e Π k ( ¯ x,y ) = 0, with unknown e Π k ( n, m ). If this equation is s ol v ed with a nonnegativ e solution e Π k ( n, m ), the l atter is joined to { e Π k ( x, y ) } , and we tak e e p k ( x | y ) = e Π k ( x, y )  P ¯ x e Π k ( ¯ x, y ) as the sought random conditiona l pr obabili t y . Otherwise, if the equation is not solved with a posi tive e Π k ( n, m ) , we generate { e Π k ( x, y ) } anew, till e Π k ( n, m ) > 0. 14 av era g e: e p k ( x | y ) = 1 M M X l =1 e p [ l ] k ( x | y ) , (69) which also corr esp onds to the known pr actice of taking av er ages over different outco mes o f the likelihoo d maximiza tion. Eq. (69) is a kin to the Bay es ia n-av er age estimator , b ecause for given observ atio ns (for this case p k ( y )) it av era ges ov er all hidden para meters cons istent with the prio r information E k . T o unders tand whether p k ( y ) e p k ( x | y ) is a be tter estimate of π k ( x, y ) a s c o mpared to ˆ p k ( x, y ), we lo ok at averages ov er independent π k ( x, y ) [cf. (67, 6 8)]: ∆ D 3 = 1 S S X k =1 ∆ d k , ∆ d k ≡ dist[ π k , p k ( y ) e p k ( x | y )] − dist[ π k , ˆ p k ( x, y )] . (70) Though particular v alues of ∆ d k can b e negative, the av er a ged v a lue ∆ D 3 > 0 is p os itive showing that ˆ p k ( x, y ) [given by (53)] is a b etter estimate than p k ( y ) e p k ( x | y ); see T a ble V. Comparing T able V with results of T a ble I, we see that p k ( y ) e p k ( x | y ) is closer to π k ( x, y ) than a single rando m guess p k ( y ) e p k ( x | y )—hence the pra ctical habit o f av erag ing ov er different outco mes o f the max im um-likelihoo d method does hav e a r ationale in it—but it is still outper formed b y ˆ p k ( x, y ). VI. MAXIMIZING T HE GENERALIZED LIKELIHOOD FOR β > 1 W e now turn to maximizing L β > 1 ov er unkno w n probabilities ˆ p ( x, y ); cf. (44). As seen below, this leads to setting many unknown pro babilities to zero, i.e. mak ing the v ecto r { ˆ p ( x, y ) } sparse. Hence the maximization of L β > 1 do es not apply to the problem of solving the observ ationa l no niden tifiability , unless this problem comes with a pr io r informa tion on the sparsity . Even apart of such cas es, studying max[ L β > 1 ] is relev ant for those examples, wher e the maximization of L β < 1 do es not provide sufficiently nontrivial r esult; s ee section IV A , where only the ma r ginal probabilities of X and Y are known. As seen b elow, yet ano ther reaso n for studying max[ L β > 1 ] is that it do es hav e close relations with entrop y minimization, a technique sp oradica lly emplo yed in probabilistic inference [5 2 – 54] (e.g., for the featur e extraction problem [53]) a nd r ecently discussed in the co nt e x t of ris k -minimization in decision- making [5 5]. F or simplicity we assume that N ≫ 1 in (2), i.e. P x ˆ p ( x, y ) = p ( y ) holds. Hence we use ˆ p ( x, y ) = ˆ p ( x | y ) p ( y ) and write (44) as L β = 1 β X y p ( y ) ln " X x ˆ p β ( x | y ) # + X y p ( y ) ln p ( y ) , (71) where { ˆ p ( x | y ) } can b e taken as maximization v ar iables. Besides P x ˆ p ( x | y ) = 1 and ˆ p ( x | y ) ≥ 0, there can be additional conditions imp osed on the maximization, e.g . condition (46). W e denote such conditions by C . Without such co nstraints, the maximiza tion of (71 ) is trivial: since P x ˆ p β ( x | y ) ≤ 1 due to β > 1, the global max imum of (71) is r eached for ˆ p ( x | y ) = p sparse ( x | y ) = δ K ( x, x ′ ), where δ K ( x, x ′ ) is the Kronecker delta, and where x ′ is a n a rbitrary v a lue of X . No te that the same p sparse ( x | y ) minimizes the en tro py S X Y = − X xy p ( y ) ˆ p ( x | y ) ln [ p ( y ) ˆ p ( x | y )] , (72) ov er { ˆ p ( x | y ) } . In App endix E we present a numerical evidence that the ma ximizer of (44) for β > 1 co incides the minimizer of (7 2 ) under a nontrivial co nstraint C of the known marg inal p ( x ). This minimizer c o rresp onds to the p os sibly ma jor izing (i.e. in the s ense of ma jo r ization [43 ]) probability vector under c onstraints C ; see Appendix D for details . T o describ e it, one intro duces max p ( x k | y l ); l =1 ,...,m,k =1 ,... ,n [ ˆ p ( x k | y l ) p ( y l ) ; C ] . (73 ) If the maximization in (73) is rea ched at k = k ∗ and l = l ∗ , then the next element of { p ( x k | y l ) } k =1 ,..., n l =1 ,...,m is found from: max p ( x k | y l ); l =1 ,...,m,k =1 ,... ,n, k 6 = k ∗ , l 6 = l ∗ [ ˆ p ( x k | y l ) p ( y l ) ; C ] . (74) This pro cess con tinues—taking at ea ch step all pre viously found elements as co nt r aints—till all elemen ts of { p ( x | y ) } are found. Eqs . (73, 74) emerge as ma x imizers of a g eneralized Sch ur-c onv ex function; s ee App endix D . W e emphasize that L β > 1 in (71) is not a g eneralized Schu r conv ex; hence the r elation betw een the ma ximizer of L β > 1 and (73, 74) is pr esently an empiric (n umeric) fact that needs further under standing. 15 VII. SU M MAR Y AND OPEN PRO B LEMS How to solve nonidentifiabilit y in pa rameter determination of mixiture mo dels? W e prop os ed an answer that applies to observ ational nonident ifia bilit y , where the full mo del (including hidden v ar ia bles) is identifiable, but the observed (marginal) model is not; se e section II A. Marginalizing decreases the info r mation a v aila ble a b out the unknown parameter(s) [2]. This gener al point can b e illustrated by the b ehavior of the Fisher infor mation that decrease s upo n marginalizing [2]. Here w e fo cus on the e xtreme case, when the informatio n ab out the parameter is lost completely . This is the pheno menon of o bserv ational nonidentifiabilit y , where the maxima o f the marg inal likelihoo d are (infinitely) degene r ate. In contrast to mos t general instances of nonidentifiabilit y (whic h e.g. ca n follow fr om a trivial ov erpar ametrization), this particular form is not hop eless to solve, precisely beca use the full mo del (including the unobserved or hidden v ar iables) is identifiable. The presented metho d amo un ts to generalizing the marginal likelihoo d function via L β ( θ ), where θ is the unknown parameter(s), and β > 0 is a n a nalogue of inv ers e temper ature from sta tistical mec hanics ; see section II B. F or β = 1 we recover the usual marginal likelihoo d, while L ∞ ( θ ) a mounts to the h-lik eliho o d, where the v a lue of hidden v aria bles is replac e d by its MAP (maximum apo steriori) estimate. L β ( θ ) is co nstructed by a nalogy to the sta tis tica l physical free ener gy , where β plays the r ole of inv e rse temp eratur e; see section I I B . The gene r alization is motiv ated by the fact that L β ( θ ) inherits so me o f useful features of L 1 ( θ ); see sectio n I I B. Maximizing L β ( θ ) instead of L 1 ( θ ) can lead to reas onable pr edictions if the v alue of β is chosen cor rectly . W e treated several mo dels and argued that the optimal v alue of β is close to, but (strictly) smaller than one. In particular, results predicted by L β ( θ ) are b e tter than those obtained via wha t one can call a pra ctitioner’s attitude tow a rds nonidentifiabilit y , i.e. picking up a random maximum of L 1 , or averaging over many such (randomly selected) ma xima; see section V. The chec k was carried out numerically by ass uming that the initial data is distributed ra ndomly in a sufficient ly unbiased w ay . W e hav e shown that maximizing L β . 1 ( θ ) relates to the maximum entrop y metho d; s ee section I I B 6. Likewise, the maximizatio n of L β > 1 ( θ ) relates with minimizing the entrop y; see sec tio n VI. There are also some analogies b etw een L β ( θ ) and co nditional Ren yi entropies [59, 60]. Several pe r tinent questions are left op en and should motiv ate further res earch. (i) Results and methods o f section V—that compar es pr edictions o f L β ( θ ) with random selections —should b e studied systematically on an analytica l base. (ii) How L β ( θ ) applies to e ffective nonidentifiabilit y ? (iii) Asymptotic analysis o f L β ( θ ) that should link it to a (generalized?) Fis her information. (iv) The relation be tw een maximizing L β > 1 ( θ ) and entropy minimization should be cla rified; see section VI. So fa r it is res tricted to a p erturbation ar gument (s e e section I I B 6) a nd numerical examples; cf. Appendix E. (v) How L β ( θ ) applies to image restor ation problems that also frequently suffer from observ ational nonident ifia bility issues [61]? Ackno wledgme nt s It is a pleasure to a cknowledge many useful discussions with Narek Martiro s yan. I thank Ara m Gals tyan for supp ort and dis c ussions and Gevorg Karyan for a useful remar k. This r esearch w as suppo rted by ISTC Joint Research Gran t Prog ram Par ameter le arning in nonidentifiable mo dels , and by SCS of Armenia, grants No. 18RF-015 and No. 18T-1 C090. [1] Y. Pa witan, In al l likeliho o d (Ox ford Universi ty Press, Oxford, 2000). [2] D.R. Cox and D.V. Hinkley , The or etic al Statistics (Chapman and Hall, London, 1974). [3] F. Jelinek, Continuous sp e e ch r e c o gnition by statistic al metho ds , Pro c. IEEE, 64 , 532 (1976). [4] L. R . Rabiner, A tutorial on hidden Markov mo dels and sele cte d appli c ations in sp e e ch r e c o gnition , Pro c. IEEE, 77 , 257 (1989). [5] Y. Ephraim and N. Merha v, H i dden Markov pr o c esses , IEEE T rans. Inf. Th., 48 , 1518 (2002). [6] B.-H. Juang, L.R. Rabin er, The se gmental K-me ans algorithm for estimating p ar ameters of hidden M arkov mo dels , I EEE T ransactions on Acoustics, Sp eech, and Signal Processing, 38, 1639 (1990). [7] N. Merha v and Y. Ephraim, Maxim um likeliho o d hidden Markov mo deling using a dominant se quenc e of states , IEEE T ransactions on Signal Pro cessing, vol.39, no.9, pp.2111-2115 (1991). [8] Y. Lee, J. A . Nelder, and Y. Pa witan, Gener al ize d Line ar Mo dels with R andom Effe cts: Unifie d Analysis via H-likeliho d (Chapman & Hall/CRC , Boca Raton, 2006). [9] J. F. Bjornstad, On the Gener alization of the Likeli ho o d F unction and the Likeli ho o d Principle , J. A m. Stat. Ass. 91 , 791-806 (1996). 16 [10] E.J. Bedric k and J.R. Hill, S cand. J. Stat. Pr op erties and Appli c ations of the Gener ali ze d Likeliho o d as a Summary F unction for Pr e diction Pr oblems , 26 , 593609 (1999). [11] X.-L. Meng, De c o di ng the H-li keliho o d , Statistical S cience, 24 , 280293 (2009). [12] W. Byrne, An inf ormation ge ometric tr e atment of maximum likeliho o d criteria and gener al ization in hidden m arkov mo d- eling , tec h nical rep ort. W. Byrne, Information ge ometry and maximum likeli ho o d criteria , in Pr o c e e di ngs of the Confer enc e on Information Scienc es and Syste ms , (Princeton, U S A, Princeton Universit y , 1996). [13] A. Allahv erdyan and A. Galsty an, Comp ar ative analysis of Viterbi tr aining and Maximum-Likeliho o d estimation for Hidden Markov Mo dels , in A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2011. [14] H. T eicher, Identifiability of finite mixtur es , The Ann als of Mathematical statistics, 1265 (1963). [15] T. J. Rothenberg, I dentific ation i n p ar ametric mo del s , Econometrica, 39, 577 (1971). [16] H. It o, S . Amari, and K. Kobay ashi, I dentifiability of Hidden Markov Information Sour c es , IEEE T rans. Inf. Th. 38 , 324 (1992). [17] C. Hsiao, Identic ation , in Z. Grilic hes and M. Intriligator (ed s) Handb o ok of Ec onometrics , V ol. I, Chapter 4 and pp. 224-283 (Amsterdam, 1983). [18] Z.-Y. R an and B.-G. Hu, Par ameter I dentifiabili ty in Statistic al Machine L e arning: A R eview , Neural Computation, 29 , 1 (2017). [19] S. W ec hsler, R. Izbicki, and L. G. Esteves, A Bayesian Lo ok at Nonidentifiability: A Simple Example , The American Statistician, 67 , 90-93 (2013). [20] S. W atanab e, Almost A l l Le arning Machines ar e Singular , in Pr o c e e dings of the 2007 IEEE Symp osium on F oundations of Computational I ntel l i genc e , (FOCI 2007). [21] C. F. Manski, Partial Identific ation of Pr ob abil ity Distributions (Sp ringer-V erlag, N ew Y ork, 2003). [22] E.S. Allman, C. Matias, and J. A. R ho des, Identifiability of p ar ameters in latent structur e mo dels with many observe d variables , The Ann als of Statistics, 37 , 3099-3132 (2009). [23] Y. Gu and G. Xu , Partial I dentifiability of R estricte d La tent Class Mo dels , arXiv:1803.04353 (2018). [24] J. J. W aterfal et al. , Sloppy-Mo del Universality Class and the V andermonde Matrix , Phys. Rev. Lett. 97 , 150601 (2006). [25] R. N . Gutenkunst et al. , Universal ly Sloppy Par ameter Sensitivities in Systems Biolo gy Mo dels , PLOS Comp. Biology , 3 , 1871-1878 (2007). [26] M. K. T ranstrum, B. B. Mach ta, K. S. Bro wn, B. C. Daniels, C. R. Myers, and J. P . Sethn a, Persp e ctive: Sloppiness and emer gent the ories in physics, biolo gy, and b eyond , Journal of Chemical Physics, 143 , 01090 1 (2015). [27] S. K. Mishra, S.- Y. W ang and K.-K. Lai, Gener alize d Convexity and V e ctor Optimization (Springer-V erlag, Berlin, 2009). [28] P . G. Bissiri, C. C. H olmes, and S. G. W alker, A gener al fr amework f or up dating b elief distributions , Journal of the Roy al Statistical So ciety B, 78 , 1103-1130 (2016). . [29] C. H olmes and S. W alk er, Assigning a value to a p ower likeliho o d in a gener al Bayesian mo del , Biometrik a, 104 , 497-503 (2017). Also av ailable at https://arxiv.org/ abs/1701.08515 . [30] A. O’H agan, F r actional Bayes factors for mo del c omp arison (with discussion) , Journal of the Roya l Statistical Society B, 57 , 99138 (1995). [31] N. F riel and A.N. Pe t titt, Mar ginal l ikeliho o d estimation via p ower p osteriors , Journal of th e Roy al S tatistical S ociety: Series B, 70 , 589-607 (2008). [32] J. W. Miller and D. B. Du nson, R obust Bayesian Infer enc e via Co arsening , Journal of the American Statistical Association, 114 , 1113-1125 (2019). [33] M. Mezard and A . Mon tanari, Information, physics, and c omputation (Oxford Universi ty Press, Oxforf, 2009). [34] C. H. LaMon t and P . A. Wiggins, On the c orr esp ondenc e b etwe en thermo dynamics and infer enc e , Phys. Rev. E 99 , 052140 (2019). [35] A. E. Allahv erdyan and N.H. Martirosya n , F r e e ener gy for non-e quili brium quasi-stationar y stat es , EPL (Europhysics Letters) 117 , 50004 (2017). [36] Z. Hradil and J. R ehacek, Likeliho o d and entr opy for statistic al inversion , J. Phys.: Conf. Ser. 36 , 55 (2006). [37] Y. S. T eo, H. Zhu, B.-G. Englert, J. Rehacek, and Z. H radil, Phys. R ev . Lett. 107 , 020404 (2011). [38] R.B. Nelsen, An i ntr o duction to c opulas ( Lectu re Notes in Statistics, vol. 139, Springer-V erlag, Berlin, 1999). [39] L. Cohen and Y . I. Zaparo va n ny , Positive quantum joint di stributions , J. Math. Phys. 21 , 794 (1980). [40] P . D. Finc h and R . Groblieki, Bivariate Pr ob abil ity Densities wi th Given Mar gins , F ou n dations of Physics, 14 , 549 (1984). [41] I.J. Go od , Maximum entr opy for hyp othesis formulation , A nn. Math. Stat. 34 , 911 (1963). [42] S. Kullback, Pr ob abili ty densities wi th given mar ginals , Ann. Math. Stat. 39 , 1236 (1968). [43] A.W. Marshall and I. Olkin, Ine qualities: The ory of Majorization and its Applic ations , (Academic Press, N ew Y ork, 1979 ) . [44] J.O. Berger and R. L. W olpert, The likeliho o d principle (The institu te of mathematical statistics, Ha ywoo d, CA, 1988). [45] M.J. Ev ans, D.S. F raser and G. Monette, On princi ples and ar guments to li keliho o d , The Canadian Journal of Statistics, 14 , 181-199 (1986). [46] E.T. Jaynes, Pri or pr ob abili ties , IEEE T ransactions on systems science and cyb ernetics, 4 , 227-241 (1968). [47] E.T. Ja yn es, Wher e do We Stand on Maximum Entr opy , in The Maxim um Entr opy F ormali sm , edited by R. D. Levine and M. T ribus, pp. 15118 (MIT Press, Cam bridge, MA, 1978). [48] B. Sky rms, Up dating, supp osing, and MaxEnt , Theory and Decision, 22 , 225-246 (1987). [49] S.J. v an Enk, The Br andeis Dic e Pr oblem and Statistic al Me chanics , Stu d. Hist. Phil. Sci. B 48 , 1-6 (2014). [50] P . Cheeseman an d J. St utz, On the R elationship b etwe en Bayesian and Maximum Entr opy Infer enc e , in Bayesian I nf er enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , edited by R. Fischer, R . Preuss, and U. vo n T oussaint, 17 American Insititute of Physics, Melville, NY, USA, 2004, pp . 44546 1. [51] C.F.J. W u, On the c onver genc e pr op erties of the EM algorithm , The An n als of Statistics 11 , 95-103 (1983). [52] I.J. Go od , Some Statistic al M etho ds in Machine Intel l igenc e R ese ar ch , Mathematical Biosciences, 6 , 185-208 (1970). [53] R. Christensen, Entr opy Mini max Multivariate Statistic al Mo deling I: The ory , International Journal of General Systems, 11 , 231-277 (1985). [54] S. W atanab e, Information-the or etic al asp e cts of inductive and de ductive infer enc e , IBM Journal of R esearc h and Develo p - ment, 4 , 208-231 (1960). [55] A. E. Allahve rd yan, A. Galst yan, A. Ab bas, Z . Stru zik , A daptive De cision Making via Entr opy Minimization , International Journal of Approximate R easoning, 103 , 270-287 (2018). [56] M. Kov acevic, I. Stano jevic, V . Senk, On the Entr opy of Couplings , Information and Comput ation, 242 , 369-382 (2015). [57] F. Cicalese, L. Gargano, U . V accaro, How to find a joint pr ob abi l ity distribution of minim um entr opy (almost) given the mar ginals , 2017 IEEE International Symp osium on Information Theory (ISIT). DOI: 10.1109/ I SIT.2017.80 06914 [58] L. Y u, Maximal Guessing Coupling and I ts Applic ations , 2018 IEEE International Sy mp osium on Information Theory (ISIT). DOI: 10.1109/ISIT.20 18.8437344 [59] A. T eixeira, A. Matos and L. A ntunes, Conditional R enyi entr opies , IEEE T ransactions on Information Theory , 58 , 4273- 4277 (2012). [60] S. F eh r and S. Berens, On the C onditional R enyi Entr opy , IEEE T ransactions on In formation Theory , 60 , 6801-6810 (2014) [61] R. Xie, S. Den g, W. Deng, and A. E. Allahv erdyan, A ctive i mage r estor ation , Phys. Rev. E 98 , 052108 ( 2018). [62] M. Ghosh, N. R eid, and D.A.S. F raser, Ancil l ary statistics: A r eview , Statistica Sinica, 1309-1332 ( 2010). [63] E.A. Pena, V.K. Rohatgi, and G.J. Szekely , On the non-existenc e of ancil l ary statistics , Statistics and Probabilit y Lett ers, 15 , 357-360 (1992). App endix A : Two al ternative approaches 1. Bay esi an approach to observ ational nonidentifiability Here we outline a Bay es ian appro ach to the observ ationa lly nonidentifiable mo del dis cussed in section I I C. First of all, reca ll fro m (27) that p gh ( x, y ) = p ( x, y | g , h ) is a conditional probability . Given a n observ ation Y = y we want to exclude the parameter h so as to g a in information on the parameter g v ia a conditinal probability p ( g | y ). T o this end, we ha ve to come up with prior proba bilities of h and g . W e make the simplest a ssumption that a pr iori they ar e independent: p ( h, g ) = p ( h ) p ( g ) and that the prior p ( h ) is noninformative. In the Bay es ia n approa ch this means that we hav e to take [46] (dep ending on the pos s ible range o f h ): p ( h ) ∝ 1 if ∞ > h > −∞ , (A1) p ( h ) ∝ 1 / h if ∞ > h > 0 . (A2) Note that b oth prior s pr obability densities (A1, A2) are impro pe r , i.e. they ar e not norma lizable. Improp er priors can still lea d to useful applica tions [46]. The first s tep in c a lculating p ( g | y ) is to study p ( x, y | g ) from (27) and (A1) p ( x, y | g ) ? ∝ Z h . p ( x, y | g , h ) p ( h ) . (A3) Now (A1, A2) a nd (27) show that the integral in the right-hand-side of (A3) do es not e xist, i.e. the Bayesian approach with non-info r mative priors is blocked alrea dy at its firs t step. If a prop er pr io r is av aila ble instead of (A1, A2), then the Bay esian approach do es work. W e do no t dwell into this, since we assume that no pr ior information is av aila ble. 2. Deci sion theory approach: attempts to build a maxim in estimator for an observ ationally non-identifiable mo del F ollowing the basic tenets of the decisio n theor y appr o ach [2], we s hall attempt to build a maximin estimator for the model mo del discussed in s ection I I C. The vir tue o f such an estimator is that its construction do es not need prior probabilities for unknown parameters [2]. 18 Starting from (27 , 2 8) a nd assuming that y is obser ved we co nstruct dist[ p ˆ g ( y ) ˆ h ( y ) ( x | y ) , p gh ( x | y )] = 1 − X x q p ˆ g ( y ) ˆ h ( y ) ( x | y ) p gh ( x | y ) (A4) = 1 − cosh h g + ˆ g ( y ) 2 + y ( h + ˆ h ( y )) 2 i r cosh [ g + y h ] co sh h ˆ g ( y ) + y ˆ h ( y ) i , ( A5 ) where the choose to work with Hellinger’s distance, and where ˆ g ( y ) and ˆ h ( y ) are estimators of (resp.) g and h given the observ ation y . T og e ther with (A4), o ne also employs the dista nce, which is averaged ov er o bserv ations [2]: X y p gh ( y ) dist [ p ˆ g ( y ) ˆ h ( y ) ( x | y ) , p gh ( x | y )] (A6) = 1 − cosh[ u ] cosh[ u ] + cosh[ v ] cosh[( u + e u ) / 2] p cosh[ u ] co s h[ e u ] − cosh[ v ] cosh[ u ] + cosh[ v ] cosh[( v + e v ) / 2] p cosh[ v ] cosh[ e v ] , (A7) where in (A7) we defined u ≡ h + g > 0 , v ≡ h − g , e u ≡ ˆ h (1) + ˆ g (1) > 0 , e v ≡ ˆ h ( − 1) − ˆ g ( − 1) . (A8) Note that the co nstraints u > 0 and e u > 0 come fr om o ur assumption o n g > 0 and h > 0. The maximin estimator takes the worst c a se (i.e. the maximal distance) with r e s p e c t to unknown parameters g and h , a nd then minimizes this worst case ov e r the es timators ˆ h ( y ) and ˆ g ( y ) [2]. In principle, this pro cedur e ca n b e applied to either (A5 ) or (A7). W e shall start b y applying it to (A5). W e note from (A5, A8): 1 − dist[ p ˆ g ( y ) ˆ h ( y ) ( x | 1) , p gh ( x | 1)] = cosh[( u + e u ) / 2] p cosh[ u ] cosh[ e u ] , (A9) 1 − dist[ p ˆ g ( y ) ˆ h ( y ) ( x | − 1) , p gh ( x | − 1)] = cosh[( v + e v ) / 2] p cosh[ v ] cosh[ e v ] . (A10) The first s tep amounts to max imizing the distance ov er unknown g and h , i.e. over u > 0 and v : min u> 0 cosh[( u + e u ) / 2] p cosh[ u ] cosh[ e u ] = e e u/ 2 √ 2 co sh e u for e u < − ln[ √ 2 − 1] , (A11) = cosh[ e u/ 2] √ cosh e u for e u > − ln[ √ 2 − 1] , (A12) min v cosh[( v + e v ) / 2 ] p cosh[ v ] co sh[ e v ] = e −| e v | / 2 √ 2 co sh e v , (A13) where the la st r elation is deduced for v → ±∞ dep ending on the sign of e v . A t the second step we should minimize the distance ov er estima to rs, i.e. (A11, A12) is to b e maximized over e u , while (A13) is to b e maximized ov er e v . This step is suppo sed to de fine thos e es tima to rs. W e get from (A11 – A13): e u = 0 , ∞ , (A14) e v = 0 , (A15) where the tw o v a lues 0 or ∞ fo r e u in (A14) c o me fr om (r esp.) (A11) a nd (A12). W hile e v = 0 in (A15) seems reasona ble (though incomplete) v alue for the estimator, neither o f e u = 0 o r e u = ∞ is meaningful. Hence the ma ximin strategy applied to (A5) do es not leas to sensible es tima to rs. When applying the strategy to (A7), we note that (A13) ≤ (A11) and (A13) ≤ (A12) for all allowed v a lues of e u and e v . Hence we find min u> 0 , v " cosh[ u ] cosh[ u ] + cosh[ v ] cosh[( u + e u ) / 2] p cosh[ u ] cosh[ e u ] − cosh[ v ] cosh[ u ] + cosh[ v ] cosh[( v + e v ) / 2 ] p cosh[ v ] cosh[ e v ] # = e −| e v | / 2 √ 2 co sh e v , (A16) where the last relation is a gain deduced for v → ± ∞ . The maximization of (A16 ) ov er e v brings us sback to (A15 ). Again nothing reasonable is pro duced for e u : the ma x imin method do es not work for this example. 19 App endix B: Maxim ization of L β under tw o constraints Consider the maximization of L β given by (44) under t wo co nstraints [cf. s ection IV B] X x ˆ p ( x, y ) = p ( y ) , X xy E ( x, y ) ˆ p ( x, y ) = E , (B 1 ) where E ( x, y ) is a function of x and y with a known av erag e E . The Lagr ange function rea ds: L β = 1 β X y p ( y ) ln " X x ˆ p β ( x, y ) # − X xy δ ( y ) ˆ p ( x, y ) − γ X xy E ( x, y ) ˆ p ( x, y ) , (B2) where δ ( y ) and γ refer to (res p.) (B1) and (B1). No w ∂ L β ∂ ˆ p ( x,y ) = 0 leads to p ( y ) ˆ p β − 1 ( x, y ) P ¯ x ˆ p β ( ¯ x, y ) = δ ( y ) + γ E ( x, y ) , (B3) which is solved as ˆ p ( x, y ) = p ( y ) [ δ ( y ) + γ E ( x, y )] 1 β − 1 P ¯ x [ δ ( y ) + γ E ( ¯ x, y )] β β − 1 . (B4) Here δ ( y ) and γ a re found fr om (r esp.) (B1) a nd (B1). E ventually , γ can b e expressed via δ ( y ), which is found from (B6): γ = 1 E 1 − X y p ( y ) δ ( y ) ! , (B5) X x [ δ ( y ) + γ E ( x, y )] 1 β − 1 = X x [ δ ( y ) + γ E ( x, y )] β β − 1 . (B6) 20 App endix C: The rele v ance of v arious constrain ts in the maxi mum entr op y metho d The maximum en tr o py method address es the problem of recovering unknown pro babilities { q ( z k ) } n k =1 of a rando m v a r iable Z = ( z 1 , ..., z n ) via ma x imization of the entrop y S [ q ] = − n X k =1 q ( z k ) ln q ( z k ) . (C1) sub ject to certain co nstraints on q and Z [4 6 – 50]. These co nstraints a r e assumed to come as a prio r information, within its standard formulation the method do es not deter mine the type and a num b er o f those co nstraints; the only (obvious) r e quirement from the metho d is that the result of maximization is unique. The intuit ive ratio na le o f the metho d is that provides the mos t unbiased choice of proba bilit y compatible with the constraints. One w ay of recovering the constraints is to lo ok at (necessarily noisy) da ta. If this wa y is follow ed in detail, it can give some recommendations o n sele c ting the c o nstraints, or at le ast o n deter mining their relative re lev a nce. Below we shall pres ent some preliminary results to this effect within. Since the r esults are pr eliminary , we shall not attempt to generalize them towards the lik e liho o d L β < 1 . A standar d way of chec king an inference method is to a ssume that the tr ue pro babilities ar e known. Hence we shall start b y as suming tha t w e know the proba bilities { q ( z k ) } n k =1 of Z = ( z 1 , ..., z n ). F rom { q ( z k ) } n k =1 we generate a finite i.i.d. sa mple S M = ( z i 1 , ..., z i M ) (C2) of length M . V ar ious cons traints are no w to be recov er ed from (C2 ). Here are several examples – W e can apply no constraint at all and just maximize the en tr opy: q [0] ( z k ) = 1 n . (C3) – After ca lculating the empir ic mea n of (C2), µ = 1 M M X u =1 z i u , (C4) we can ta ke it as an estimate for the true av erag e P n k =1 q ( z k ) z k , and recover approximate pr o babilities { q [1] ( z k ) } n k =1 via maximizing (C1) sub ject to a constr aint: P n k =1 q [1] ( z k ) z k = µ . It is well-kno w n [46, 47] that this max imiza tion leads to q [1] ( z k ) = e − β z k P l e − β z l , (C5) where β is determined from P n k =1 q [1] ( z k ) z k = µ . – The empiric means is certa inly not the only infor mation contained in the sa mple; e.g. one ca n es timate a s well the second mo ment : µ 2 = 1 M M X u =1 z 2 i u , (C6) and ma ximize (C1) under tw o contraints (C4 ) and (C6): q [1+2] ( z k ) = e − β 1 z k − β 2 z 2 k P l e − β 1 z l − β 2 z 2 l , (C7) where β 1 and β 2 are determined from P n k =1 q [1+2] ( z k ) z k = µ and P n k =1 q [1+2] ( z k ) z 2 k = µ 2 . – It is the standard lore of statistics that for relatively short sa mples , the e mpiric median is a better (more ro bust) estimator than the empiric mean. Thus we should pa y attent ion to the median as a constra int in the en tropy maximization. Recalling the definition of the median Md for given (discrete-v ar iable) proba bilities, the max imu m 21 of (C1) under a fixed median is ma de obvious with the following example for n = 4 (ass uming for simplicity that z 1 < z 2 < z 3 < z 4 ): Md = z 1 : arg max q [ S [ q ] ] = 1 2  1 + ǫ, 1 − ǫ 3 , 1 3 , 1 3  , (C8) Md = z 2 : argmax q [ S [ q ] ] = 1 2  1 2 , 1 + ǫ 2 , 1 − ǫ 2 , 1 2  , (C9) Md = z 3 : arg max q [ S [ q ] ] = 1 2  1 3 , 1 3 , 1 + ǫ 3 , 1 − ǫ  , (C10) Md = z 4 : argmax q [ S [ q ] ] = 1 2  1 3 , 1 3 , 1 − ǫ 3 , 1 + ǫ  , (C11) where ǫ > 0 is infinitely small. W e kept it for confirming that the median is indeed eq ua l to its fixed v alue, but ǫ can be neg lected in actua l ca lculations. E qs. (C8 – C11) eas ily g eneralize to an ar bitrary finite n . Now the median Md will estimated fro m the finite sample (a s an e mpir ic media n), and the maximum ent ropy probabilities recov er ed according to (C8 – C11) will b e denoted as { q [md] ( z k ) } n k =1 W e can calculate how close are the ab ove estimates from the true proba bilities q = { q ( z k ) } n k =1 : d 0 = dist[ q [0] , q ] , d 1 = dist[ q [1] , q ] , d 1+2 = dist[ q [1+2] , q ] , d md = dist[ q [md] , q ] , (C12) where dist[ ., . ] can be e.g. the Hellinger distance: dist[ q [0] , q ] ≡ 1 − n X k =1 q q ( z k ) q [0] ( z k ) . (C13) Besides d 0 , quantities defined in (C12) are r andom v aria bles together with the sample (C2). Hence we shall av er age them over M ≫ 1 indep e ndently gener ated s a mples, keeping the s a mple leng th M fixed. The averaged quantities will be denoted as h d 1 i , h d 1+2 i , h d md i . (C14) T o gether with d 0 they dep end on q = { q ( z k ) } n k =1 . Bes ides quantities in (C14) we sha ll a lso study their av era ges ov er q = { q ( z k ) } n k =1 : w e genera te r andomly N proba bilities (the mechanism for this is discussed in section V o f the main text), and a verage h d 1 i , h d 1+2 i , h d md i , and d 0 ov er them. The results will b e denoted by d 1 , d 1+2 , d md , d 0 . (C15) T a ble I I I pre s ents a n umerical illustration fo r quantities defined in (C12 – C15). It is seen that when M is larger, but comparable to n ( M = 7 and n = 6 in T able II I ), the situation is so noisy that samples do not provide information from the viewp oint of the constra int s studied 6 . This means that the no -constraint solution (C3) is alwa ys b etter bec ause in the ma jority of c a ses we g et min [ h d 1 i , h d 1+2 i , h d md i , d 0 ] = d 0 , and beca use min[ d 1 , d 1+2 , d md , d 0 ] = d 0 . F or such v alues of M employing the a b ov e co nstrained solutions will just a mount to an ov er fitting (of noise). F or a lar ger M ( M = 11 and n = 6 in T able I I I ), w e see that (C5) is the b est co nstraint in one sense, since no w the solution (C5) provides a smaller av erag e dis tance fr o m the true solution: min[ d 1 , d 1+2 , d md , d 0 ] = d 1 . How ever, in the second sense (C3) is still b e tter, b ecaus e the p erce ntage of cases, where min[ h d 1 i , h d 1+2 i , h d md i , d 0 ] = d 0 is still the largest one. Applying the media n solution o r the seco nd- order so lutio n (C7) lead to worse results. T able I I I shows that the s olution based o n the median is alwa ys w o rse than so me of the o ther solutions. The second-or der s olution (C7) beco mes the b est solution for M ≥ 2 1 ; see T able I I I. This holds in terms of the av erag e distance: min[ d 1 , d 1+2 , d md , d 0 ] = d 1+2 , and a lso in terms of the p ercentage of cases , where min[ h d 1 i , h d 1+2 i , h d md i , d 0 ] = d 1+2 . Incr easing M more just confirms this tr end, i.e. improv es the qua lity of (C7) in b oth sense s . Interestingly , the p er centage of cases, where min[ h d 1 i , h d 1+2 i , h d md i , d 0 ] = d 1 is r e latively sta ble for larger v alues o f M : in more than 1 / 6 of cases the first-or der solutio n (C5) is still b etter than o ther solutions, even for M = 1 01; see T able I I I. 6 This does not mean that short samples provide no information whatso ev er . Thi s means that the prop er infor mation extraction mechanism from such samples is yet to b e found. 22 Our (pre limina ry) conclusions are summarized a s follows: (i) The median is not a relev ant c o nstraint for the maximum entropy metho d. It is never b etter than the average. (ii) The latter solutio n do es overfit for sho rt samples ( M ≃ n ), where having no constr aints whatso ever is b etter than fixing the a verage. ( iii) F o r sufficiently long sa mples fixing the first and second moments outp erfor ms other solutions, but the av erag e co nstraint do es stay reasona ble ev en for larger sample lengths. M % min = h d 1 i % min = h d 1+2 i % min = h d med i % min = d 0 d 1 d 1+2 d md d 0 7 18 4 8 70 0.06194 0.07713 0.06771 0.05535 11 25 18 7 50 0.05548 0.05829 0.06212 0.05656 21 24 41 2 33 0.04731 0.04421 0.05894 0.05350 31 27 45 1 27 0.04583 0.04125 0.05935 0.05520 41 29 51 2 18 0.05091 0.04302 0.06311 0.05970 61 24 64 2 10 0.04628 0.03531 0.05459 0.05430 101 18 71 0 11 0.04296 0.03567 0.05519 0.05179 T ABLE II I : F or a set-u p of a dice: n = 6 and z k = k ( k = 1 , ..., 6) we show va rious q uantities defined ab ov e and b elo w ( C12). The dist[ ., . ] in (C12) was chosen to b e the Hellinger distance. M is the num b er of samples. The a verag e in (C14) is defined ov er 10 4 indep edent samples generated via t he meth o d describ ed in section V. The av erage in (C15) is defined ov er 100 realizations of probabilities. Now % min = h d 1 i means the p ercentage of the relation % min[ h d 1 i , h d 1+2 i , h d md i , d 0 ] = d 1 among those 100 cases; e.g. % min = h d 1 i → 18 means that in 18 cases out of 100 we got min[ h d 1 i , h d 1+2 i , h d md i , d 0 ] = h d 1 i . The minimal among d 1 , d 1+2 , d md , and d 0 is underlined. 23 App endix D : Generalize d Sch ur-con vexity W e s hall briefly r eview implications o f the generalized Sch ur-co nv exity [4 3] for maximizing functions similar to (71 ). Though we w e re no t a ble to show that (71) is gener alized Sch ur-co nv ex, n umer ical r esults show the Sc hur-conv ex maximizers provide a go o d desc ription of lo cal maxima fo r (71). Let we ar e given a differentiable function Φ( x ; u ) of tw o vectors: x = ( x 1 , ..., x M ) and u = ( u 1 , ..., u M ). Both v ar y on c o mpact subsets of R M , and u k 6 = 0 for all k = 1 , ..., M . Let us assume that Φ( x ; u ) is Sc hur-conv ex [4 3]: ( x k − x l )  1 u k ∂ Φ ∂ x k − 1 u l ∂ Φ ∂ x l  ≥ 0 , k , l = 1 , .., M − 1 . ( D1) Let D b e the set of vectors that a re or dered as: x 1 ≥ ... ≥ x M . Denote z ℓ ≡ ℓ X k =1 u k x k , ℓ = 1 , ..., M , (D2) and no te that Φ( x ; u ) can b e wr itten as Φ( x ; u ) = Φ( z 1 u 1 , z 2 − z 1 u 2 , ..., z M − z M − 1 u M ; u ) ≡ e Φ( z 1 , ..., z M ) . (D3) Now for x ∈ D , e Φ( z 1 , ..., z M ) is a non-decrea sing function of z 1 , ... z M − 1 , beca use then (D1 ) r e duces to 1 u k ∂ Φ ∂ x k − 1 u k +1 ∂ Φ ∂ x k +1 ≥ 0 , k = 1 , .., M − 1 , (D4) and then (D4) implies ∂ e Φ ∂ z k ≥ 0 for k = 1 , ..., M − 1 a nd x ∈ D . (D5) Let us denote by A the set of a ll vectors x that ho ld M X k =1 u k x k = 1 . (D6) Eq. (D5) will show how to maximize Φ( x ; u ) ov er x ∈ A ∩ D . First consider tw o vectors, x ∈ D ∩ A and y ∈ D ∩ A . Eq. (D2) a nd conditions (D5) imply that if: ℓ X k =1 u k x k ≥ ℓ X k =1 u k y k , ℓ = 1 , ..., M − 1 , (D7) then Φ( x ; u ) ≥ Φ( y ; u ) . (D8) Eqs. (D1, D7) r efer to the concept of u -ma jor ization, while Φ( x ; u ) in (D8) is a u -Sch ur-co nvex function [43]. Now arg ma x x ∈A∩D [ Φ( x, u ) ] is found as follows: one first finds max x 1 ∈A∩D [ u 1 x 1 ]. Then taking this maximized v a lue as a condition one obtains max x 2 ∈A∩D [ u 2 x 2 ], then under t wo pr evious co nditions one finds max x 3 ∈A∩D [ u 3 x 3 ] etc . W e genera lize the ab ove reasoning taking instea d of D any o ther order ing: x ∈ D π means tha t x π 1 ≥ ... ≥ x π M , where π is a certain p ermutation of indices 1 , ..., M . Conditions (D1, D6) stay without changes. T o obtain x ∗ ≡ argmax x ∈A [ Φ( x, u ) ] under (D1) and (D6) (i.e. without imp os ing any co ndition x ∈ D π for a sp ecific π ), we shall o ptimize the above c o nstruction ov er a ll p os sible D π . Thus one first finds max 1 ≤ k ≤ M max x ∈A [ u k x k ] . (D9) If this ma ximum is reached at a cer ta in v alue x ∗ j 1 of x j 1 , then o ne lo oks a t max 1 ≤ k ≤ M ,k 6 = j 1 max x ∈A [ u k x k ] . (D10) 24 If the maximum in (D10) is reached a t a certain v alue x ∗ j 2 of x j 2 , then the next maximization excludes b oth j 1 and j 2 ; and so o n till all elements of x ∗ ∈ A will be found. Returning to the pr oblem stated by the maximization of (71), we note that the index k in (D6, D4) co rresp onds to the double index ( x, y ), where M = nm , while x and u refer to { ˆ p ( x | y ) } x,y { p ( y ) } x,y , resp ectively . Eq. (D6) then holds due to no rmalization. Likewise, A is defined from r elev ant constraints, e.g. from (47). But conditions (D1) for L β > 1 do no t hold, since the left-hand-side of (D1) amounts to [ ˆ p ( x | y ) − ˆ p ( x ′ | y ′ )]  ˆ p β − 1 ( x | y ) P ¯ x ˆ p β ( ¯ x | y ) − ˆ p β − 1 ( x ′ | y ′ ) P ¯ x ˆ p β ( ¯ x | y ′ )  , (D11) which is genera lly no t nonnegative. In contrast, the neg ative average en tr opy: P y p ( y ) P x ˆ p ( x | y ) ln ˆ p ( x | y ) do es hold (D1): [ ˆ p ( x | y ) − ˆ p ( x ′ | y ′ )] [ln ˆ p ( x | y ) − ln ˆ p ( x ′ | y ′ )] ≥ 0 . (D12) App endix E: Maxim i zation of L β > 1 for known marginals of X and Y il lustrated via ex amples There are infinitely many joint probabilities ˆ p ( x, y ) with given margina ls p ( y ) and p ( x ) [38 – 4 0]. One can ask a b o ut the simplest joint probabilities compatible with given marg inals [41, 42]. Such a proba bility ca n b e employed a s a nu ll-hypothesis and s erve a s a star ting p oint for further approximations. It is well-known that the maximal entropy reasoning lea ds to the factor ized joint probability ˆ p ( x, y ) = p ( x ) p ( y ) [41], which w e also got fro m maximizing L β < 1 ; see section IV A. Below we show n umer ically that the maximiza tion of L β > 1 leads to a different and unique prediction for ˆ p ( x, y ) that agrees with (7 3, 74). Hence it agrees with minimizing the joint entrop y (7 2) under the co nstraint of given ma rginals. This is a well-kno wn problem, b ecause (for given marginals) it is equiv alent to maximizing the m utual information betw een X a nd Y ; see [5 6 – 58] for recent discussions. Let us as sume that b oth X and Y assume 3 v alues ( x 1 , x 2 , x 3 ) and ( y 1 , y 2 , y 3 ), res pe ctively . Here is a n example for the (global) ma ximizer of (71 ) that we pr esented in the fo rm o f (47 ) with numeric v alues of ˆ p ( x | y ) written in b old: ( p ( y 1 ) , p ( y 2 ) , p ( y 3 ) ) = (0 . 1 , 0 . 3 , 0 . 6) , (E1) p ( x 1 ) = 0 . 55 = 0 × 0 . 1 + 0 × 0 . 3 + 55 60 × 0 . 6 , (E2) p ( x 2 ) = 0 . 25 = 0 × 0 . 1 + 25 30 × 0 . 3 + 0 × 0 . 6 , (E3) p ( x 3 ) = 0 . 20 = 1 × 0 . 1 + 5 30 × 0 . 3 + 5 60 × 0 . 6 . (E 4) Eqs. (E2 – E 4) follow (73, 74). First one finds ˆ p ( x 1 | y 3 ) = 55 / 60 , since this provides the lar gest p ossible v alue for the joint probability: ˆ p ( x 1 , y 3 ) = 0 . 5. Due to (E2) this already sets ˆ p ( x 1 | y 1 ) = ˆ p ( x 1 | y 2 ) = 0. Next, o ne finds ˆ p ( x 2 | y 2 ) = 25 / 3 0, since this provides the second-la rgest v alue of the joint proba bility , ˆ p ( x 2 | y 2 ) = 0 . 25, als o enforc ing ˆ p ( x 2 | y 1 ) = ˆ p ( x 2 | y 3 ) = 0. Remaining p ( x | y ) in (E4) are recovered fr o m normalization. It is seen that ˆ p ( x | y ) given in (E2 – E4) do hav e the maximal num b er of zero es (4 for the consider ed case n = m = 3) allow ed by (47 ). I.e. the maximizers of L β > 1 are lo cated at vertices o f the conv e x domain (47). The second example is dealt with in the same way with ˆ p ( x 1 | y 1 ) = 4 / 9 b eing the first step, a nd ˆ p ( x 3 | y 2 ) = ˆ p ( x 3 | y 3 ) = 1 a mount to the last step: ( p ( y 1 ) , p ( y 2 ) , p ( y 3 )) = (0 . 9 , 0 . 0 6 , 0 . 04) , (E5) p ( x 1 ) = 0 . 4 = 4 9 × 0 . 9 + 0 × 0 . 06 + 0 × 0 . 04 , (E6) p ( x 2 ) = 0 . 35 = 35 90 × 0 . 9 + 0 × 0 . 06 + 0 × 0 . 04 , (E7) p ( x 3 ) = 0 . 25 = 15 90 × 0 . 9 + 1 × 0 . 06 + 1 × 0 . 04 . (E8) The maximizer s (but not the v alue of L β ) do not dep end on β provided that β > 1 . How ever, we noted that for β → ∞ the glo ba l maximum of L β are difficult to r each numerically , since they are plag ued b y many lo ca l maxima. Hence employing mo der ate v alues of β (e.g. β = 2 ) can b e b eneficial for finding the g lobal max imum numerically . 25 This po int can be illustra ted by compa ring the globa l ma ximizer (E6–E8) with ( p ( y 1 ) , p ( y 2 ) , p ( y 3 )) = (0 . 9 , 0 . 0 6 , 0 . 04) , (E9) p ( x 1 ) = 0 . 4 = 4 9 × 0 . 9 + 0 × 0 . 06 + 0 × 0 . 04 , (E10 ) p ( x 2 ) = 0 . 35 = 29 90 × 0 . 9 + 1 × 0 . 06 + 0 × 0 . 04 , (E11) p ( x 3 ) = 0 . 25 = 21 90 × 0 . 9 + 0 × 0 . 06 + 1 × 0 . 04 . (E12) Both (E 6 – E8) and (E10 – E12) pro duce the same v alue for L β →∞ , beca use L β →∞ = X y p ( y ) ln [ ˆ p ( e x ( y ) | y )] + X y p ( y ) ln p ( y ) , e x ( y ) ≡ argmax x [ p ( x | y ) ] . (E1 3) Indeed, b oth (E6 – E8) and (E10 – E 1 2) ha ve the same v alues of ˆ p ( e x ( y ) | y ). E ven though the g lo bal maximum of L β > 1 may b e difficult to re ach numerically , we noted that num erically rea chable lo cal maxima a lso hav e the s ame (i.e. maximal) n umber of zeros.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment