On Expectation Propagation and the Probabilistic Editor in some simple mixture problems

As for other latent-variable problems, exact Bayesian analysis is typically not practicable for mixture problems and approximate methods have been developed. Variational Bayes tends to produce approximate posterior distributions for parameters that a…

Authors: Nils Lid Hjort, Mike Titterington

On Exp ectation Propagation and the Probabilistic Editor in some simple mixture problems Nils Lid Hjort 1 and Donald Mic hael Titterington 2 1 Departmen t of Mathematics, Univ ersit y of Oslo 2 Sc ho ol of Mathematics & Statistics, Univ ersit y of Glasgo w ∗ Octob er 2010 Abstract As for other latent-v ariable problems, exact Ba yesian analysis is t ypically not practicable for mixture problems and appro ximate metho ds ha ve b een dev elop ed. V ariational Ba yes tends to pro- duce appro ximate posterior distributions for parameters that are to o tigh tly concentrated in ha ving v ariances that are to o small. The pap er identifies a few mixture problems in whic h Exp ectation Propagation and v ariations thereof lead to approximate posterior distributions that asymptotically exhibit ‘correct’ v ariances and therefore stand to pro vide reliable in terv al estimates for the unkno wn parameter or parameters. Some key wor ds : Assumed Densit y Filtering; Exp ectation Propagation; Kullback-Leibler; Mixture distributions; Probabilistic Editor; Recursiv e estimation; V ariational Bay es. 1 In tro duction Ba y esian analysis is t ypically straigh tforw ard if the model from which the data are generated is com- patible with a conjugate family of prior distributions; this scenario obtains if the model b elongs to an exp onen tial family . How ever, if there are laten t or missing v ariables this ideal scenario is not a v ailable, ev en though the mo del corresp onding to the case in whic h the latent v ariables are actually observ ed ma y be amenable in the sense describ ed abov e. In suc h cases exact calculation of the relev an t posterior distributions and predictiv e distributions is not av ailable and some form of approximation b ecomes inevitable. Supp ose that we hav e data of the form x ( n ) := { x 1 , . . . , x n } , where n denotes the sample size, that the parameters in the mo del are denoted b y θ and the corresponding latent v ariables are denoted b y z ( n ) := { z 1 , . . . , z n } . The posterior distribution of interest is formally giv en b y p ( θ | x ( n ) ) ∝ p ( x ( n ) | θ ) p ( θ ) , 1 Mik e Titterington passed a wa y in 2023, at the age of 77; this is the Octob er 2010 version of a pap er we collaborated on then and (still) planned to extend b efore submitting to a journal 1 where p ( θ ) on the right-hand side is the prior density for θ . In this pap er w e shall assume that the data are indep endently distributed, so that we can write p ( θ | x ( n ) ) ∝ t 0 ( θ ) n Y i =1 t i ( θ ) , in whic h t 0 ( θ ) = p ( θ ) and t i ( θ ) = f ( x i | θ ) , the probabilit y density or mass function for the i th obser- v ation, for i = 1 , . . . , n . One ‘approximate’ approach is to sim ulate, t ypically by some Mark o v chain Monte Carlo technique, a large num b er of realizations from the join t conditional distribution p ( θ , z ( n ) | x ( n ) ). The resulting realizations of θ can b e regarded as a sample from p ( θ | x ( n ) ) and therefore as a source of empirical appro ximations of features of in terest of the true p ( θ | x ( n ) ). Similarly , the realizations of the laten t v ariables provide an appro ximation to p ( z ( n ) | x ( n ) ). In principle, if enough realizations of ( θ , z ( n ) ) are generated and if the Marko v chains that created them can be guaran teed to ha v e con v erged, the empirical appro ximations should be arbitrarily close to the desired distributions. Ho w ever, concerns ab out the con v ergence issue and the complexity of some latent-v ariable scenarios ha v e led to the developmen t of deterministic appro ximations in nontrivial Ba y esian contexts. One approac h is to deriv e a so-called v ariational appro ximation, q , c hosen to b e as close as p ossible to the true target distribution, p , in terms of the Kullback-Leibler directed div ergence kl ( q , p ) = Z q log ( q /p ) , where the ‘integral’ is ov er θ and z ( n ) and has a summation comp onen t if, as is often the case, the laten t v ariables are discrete. T o make the ab ov e optimization practicable some constrain ts hav e to b e imp osed on q , t ypically that it factorizes in to a pro duct of a function q θ of θ and a function q z of z ( n ) . It turns out that, if, w ere the laten t v ariables known, the complete-data model came from an exp onen tial family and a conjugate prior were used, then q θ , whic h represents the appro ximation to the intractable p osterior distribution of θ , w ould also belong to the conjugate family; see for example Beal and Ghahramani (2003). An EM-t yp e algorithm is usually emplo yed to calculate the appropriate h yp erparameters that complete the specification of the appro ximating distribution. While this approach leads to practicable deterministic approximations, the fact remains that the true p osterior distribution is not a mem b er of the conjugate family; rather, it is a complicated mixture of such distributions, so that the gap b et w een the true and approximate functions cannot be made arbitrarily small, for finite sample size, unlik e the case for the sim ulation-based method. Ho w ev er, there may b e some hop e that the v ariational metho d will b eha v e acceptably asymptotically , that is, as the sample size n tends to infinit y . In many cases, in line with the asymptotic properties of 2 maxim um lik eliho o d estimators, p osterior distributions tend to b e Gaussian, with means ultimately the same as the maximum lik eliho o d estimates and with co v ariance matrices defined in terms of Fisher information matrices. The posterior mean defines the location of the parameters in the mo del and a ‘reliable’ co v ariance matrix is crucial if interv al estimates of the parameters are required. Th us, an y metho d of appro ximation can b e said to b e asymptotically acceptable if the corresp onding q θ tends to b e Gaussian with the correct mean and correct cov ariance matrix. A num b er of pap ers concerning mixture problems ha ve established that all is w ell so far as the Gaussianity and the mean are concerned, but not so for the co v ariance matrix. The limiting cov ariance matrix tends to that whic h corresponds to a complete-data scenario, so that an y in terv al estimates calculated on the basis of the v ariational approximation are unrealistically narrow; see for example W ang and Titterington (2005a, 2005b, 2006). A differen t class of deterministic approximations is pro vided by the metho d of Exp ectation Prop- agation (Mink a, 2001a, 2001b). It is assumed that the p osterior distribution for θ is approximated b y a product of terms q θ ( θ ) = n Y i =0 ˜ t i ( θ ) , where the i = 0 term corresp onds to the prior densit y and, t ypically , if the prior comes from a conjugate family then, as functions of θ , all the other terms take the same form so that q θ ( θ ) do es also. The explicit form of the appro ximation is calculated iteratively . Step 1. F rom an initial or current prop osal for q θ ( θ ) the i th factor ˜ t i is discarded and the resulting form is renormalised, giving q θ, \ i ( θ ). Step 2. This q θ, \ i ( θ ) is then com bined with the ‘correct’ i th factor t i ( θ ) (implying that the correct p osterior does consist of a pro duct) and a new q θ ( θ ) of the conjugate form is selected that giv es an ‘optimal’ fit with the new pro duct. In Mink a (2001b) optimalit y is determined by Kullback-Leibler div ergence, in that, if p i ( θ ) ∝ q θ, \ i ( θ ) t i ( θ ) , then the new q θ ( θ ) minimises kl ( p i , q ) = Z p i log ( p i /q ) . Ho w ever, in some cases, as we shall see, simpler solutions are obtained if momen t-matc hing is used instead; if the conjugate family is Gaussian then the t wo approac hes are equiv alen t. Step 3. Finally , from q θ, \ i ( θ ) and the new q θ ( θ ) a new factor ˜ t i ( θ ) can b e obtained such that q θ ( θ ) ∝ q θ, \ i ( θ ) ˜ t i ( θ ) . This procedure is iterated ov er c hoices of i rep eatedly until con vergence is attained. 3 As with the v ariational Bay es approach, an appro ximate p osterior is obtained that is a member of the complete-data conjugate family . The imp ortan t question is to what exten t the approximation impro v es on the v ariational approximation. Much empirical evidence suggests that it is indeed b etter, and the purp ose of this paper is to in vestigate this issue at a deep er lev el. The general approach will be to see if Exp ectation Propagation achiev es what v ariational Ba yes achiev es in terms of Gaussianity and asymptotically correct mean but in addition manages to b ehav e appropriately in terms of asymptotic second posterior momen ts. This pap er tak es a few first steps by discussing some v ery simple scenarios and treating them with- out complete rigour. In some cases p ositive results are obtained, but consideration of one example suggests that the metho d is not uniformly successful in the ab ov e terms. 2 Ov erview of the recursiv e-based approac h to b e adopted As Mink a (2001b) explains, Exp ectation Propagation was motiv ated b y a recursive v ersion of the approac h known as Assumed Density Filtering (ADF) (Maybeck, 1982). In ADF an approximation to the p osterior density is created by incorp orating the data one by one, carrying forward a conjugate- family appro ximation to the true posterior and up dating it observ ation-by-observ ation using the sort of Kullbac k-Leibler-based strategy describ ed ab o v e in Step 2 of the EP metho d; the same approac h w as studied by Bernardo and Gir´ on (1988) and Stephens (1997, Chapter 5). The difference from EP is that the data are run through only once, in a particular order, and therefore the final result is order-dep endent. Ho w ever, recursiv e procedures suc h as these often ha ve desirable asymptotic prop erties, sometimes going under the name of sto chastic approximations. There is also a close relationship with what is called the Probabilistic Editor; see for example Athans et al. (1977), Mako v (1983), Titterington et al. (1985, Chapter 6) and further references in Section 4. Indeed, our analysis concen trates on the recursiv e step defined b y Step 2 in the EP approac h. The pap er considers in detail normal mixtures with an unknown mean parameter, for which the conjugate prior family is Gaussian, and mixtures with an unknown mixing weigh t, for which the conjugate prior family is Beta. Extension to more complicated or more general scenarios is curren tly under in vestigation. In this pap er w e shall deal mainly with one-parameter problems, although the material in Section 3, at least, can b e generalised easily to the v ector case, and we shall supp ose that the conjugate family tak es the form q ( θ | a ), where a represents a set of hyperparameters. F or simplicity we shall 4 omit the subscript θ attached to q . Of interest will b e the relationship b et ween consecutiv e sets of h yp erparameters, a ( n − 1) and a ( n ) , corresp onding to the situation b efore and after the n th observ ation, x n , is incorp orated. Thus, q ( θ | a ( n ) ) is the mem b er of the conjugate family that is closest, in terms of Kullbac k-Leibler div ergence, or perhaps of momen t-matching, to the densit y that is given by q ( θ | a ( n − 1) ) t n ( θ ) = q ( θ | a ( n − 1) ) f ( x n | θ ) , suitably normalised. If E n and V n are the functions of a ( n ) that represen t the mean and the v ariance of q ( θ | a ( n ) ), then we w ould w ant E n and V n asymptotically to b e indistinguishable from the corresp onding v alues for the correct posterior. W e w ould also expect asymptotic Gaussianity . So far as E n is concerned, it is a matter of showing that it conv erges in some sense to the true v alue of θ . The correct asymptotic v ariance is e ssen tially given b y the asymptotic v ariance of the maximum likelihoo d estimator, with the prior density having negligible effect, asymptotically . Equiv alently , the increase in precision asso ciated with the addition of x n , V − 1 n − V − 1 n − 1 , should b e asymptotically the same, in some sense, as the negative of the second deriv ativ e with respect to θ of log f ( x n | θ ), again b y analogy with maxim um lik eliho o d theory . 3 A general finite normal mixture with an unkno wn mean parame- ter 3.1 Pream ble As remark ed ab ov e, only the univ ariate case is considered although a multiv ariate v ersion will follo w the same pattern. This is arguably the most general t yp e of mixture, apart from regression v ersions, for which the Gaussian distribution is the complete-data conjugate prior. The assumption is that the observ ed data are a random sample from a univ ariate mixture of J Gaussian distributions, with means and v ariances { c j µ, σ 2 j ; j = 1 , . . . , J } and with mixing weigh ts { v j ; j = 1 , . . . , J } . The { c j , σ 2 j , v j } are assumed kno wn, so that µ is the only unkno wn parameter. F or observ ation x , therefore, w e ha ve f ( x | µ ) ∝ J X j =1 v j σ j exp {− 1 2 σ 2 j ( x − c j µ ) 2 } . In the case of a Gaussian distribution the ob vious hyperparameters are the mean and the v ari- ance themselv es. F or comparativ e simplicit y of notation we shall write those hyperparameters b efore treatmen t of the n th observ ation as ( a, b ), the hyperparameters afterwards as ( A, B ) and the n th observ ation itself as x . 5 3.2 The recursiv e step The h yp erparameters A and B are c hosen to matc h the momen ts of the density for µ that is propor- tional to b − 1 / 2 exp {− 1 2 b ( µ − a ) 2 } J X j =1 v j σ j exp {− 1 2 σ 2 j ( x − c j µ ) 2 } . Detailed calculation, described in App endix 1, sho ws that the c hanges in mean and precision satisfy , resp ectiv ely , A − a = b X j R j S j T j / X j ′ T j ′ + o ( b ) } , (1) and B − 1 − b − 1 = P j R 2 j T j P j T j − P j T j R 2 j S 2 j P j T j + ( P j T j R j S j ) 2 ( P j T j ) 2 + o (1) , (2) where R j = c j /σ j , S j = ( x − c j a ) /σ j and T j = v j σ j exp − ( x − ac j ) 2 2 σ 2 j ! . 3.3 Fisher information As explained earlier, the ‘correct’ asymptotic v ariance is given b y the inv erse of the Fisher information, so that the exp ected c hange in the inv erse of the v ariance is the Fisher information corresp onding to one observ ation, i.e. the negativ e of the exp ected second deriv ativ e of log f ( x | µ ) = const . + log    J X j =1 v j σ j exp − 1 2 σ 2 j ( x − c j µ ) 2 !    = const . + log X j T ′ j , where T ′ j is the same as T j except that a is replaced b y µ . In what follows, R ′ j and S ′ j are similarly related to R j and S j . Then it is straigh tforw ard to show that ∂ ∂ µ log f ( x | µ ) = P j T ′ j R ′ j S ′ j P j T ′ j and so the observ ed information is − ∂ 2 ∂ µ 2 log f ( x | µ ) = P j R ′ 2 j T ′ j P j T ′ j − P j T ′ j R ′ 2 j S ′ 2 j P j T ′ j + ( P j T ′ j R ′ j S ′ j ) 2 ( P j T ′ j ) 2 . (3) The right-hand sides of (2) and (3) differ only in that (2) inv olves a whereas (3) in v olves µ , and (2) is correct just to O ( b ) whereas (3) is exact. Ho wev er, b ecause of the nature of (1), sto chastic appro ximation theory as applied b y Smith and Mako v (1981) will confirm that asymptotically a will con v erge to µ and terms of ( O ) b will b e negligible. 6 Th us, the approximate p osterior distribution derived in this section behav es as w e w ould wish, in terms of its mean and v ariance; b y construction it is also Gaussian. 3.4 Precision afforded by the complete-data scenario The µ -dependent part of the n th observ ation’s con tribution to the complete-data loglik eliho od is − X j z nj ( x − c j µ ) 2 / (2 σ 2 j ) , in which the indicator v ariable z nj is 1 if the observ ation b elongs to the j th comp onen t and is 0 otherwise. The negativ e second deriv ative with resp ect to µ is then X j z nj c 2 j /σ 2 j = X j R 2 j z nj , and, giv en x , E ( z nj ) = T j / P j ′ T j ′ so that E ( X j z nj c 2 j /σ 2 j ) = P j R 2 j T j P j T j . (4) 3.5 V ariational Bay es appro ximation It is easy to sho w that the corresp onding c hange in precision associated with the v ariational appro xi- mation w ould giv e B − 1 − b − 1 = P j R 2 j T j P j T j + o (1) ≥ P j R 2 j T j P j T j − P j T j R 2 j S 2 j P j T j + ( P j T j R j S j ) 2 ( P j T j ) 2 + o (1) , (5) the inequalit y following b ecause the extra terms can b e interpreted as a negative v ariance. This indicates that the precision reflected in the v ariational appro ximation is to o high and, as is often the case, is the same as would b e ac hiev ed in the complete-data scenario; see (4). In the next t w o subsections we consider t wo particular cases of the form ulation studied so far. 3.6 A symmetric mixture of t w o Gaussians In this case the mo del is assumed to corresp ond to an equally weigh ted mixture of N ( − µ, 1) and N ( µ, 1) distributions, where µ is unknown, so that f ( x | µ ) ∝ exp {− 1 2 ( x + µ ) 2 } + exp {− 1 2 ( x − µ ) 2 } . The prior densit y for µ is assumed to b e N ( a (0) , b (0) ) and the Gaussian approximation for µ given data x ( n ) := { x 1 , . . . , x n } is assumed to b e the N ( a ( n ) , b ( n ) ) distribution. Consider the incorp oration 7 of observ ation n and, for simplicity of notation, write a ( n − 1) , b ( n − 1) , a ( n ) , b ( n ) and x n as a , b , A , B and x , respectively . This is a particular case of the mo del studied so far, with J = 2, c 2 = − c 1 = 1, σ 1 = σ 2 = 1 and v 1 = v 2 = 1 / 2 . Direct substitution giv es P j R 2 j T j P j T j = 1 P j T j R 2 j S 2 j P j T j = ( x + a ) 2 exp {− 1 2 ( x + a ) 2 } ( x − a ) 2 exp {− 1 2 ( x − a ) 2 } exp {− 1 2 ( x + a ) 2 } + exp {− 1 2 ( x − a ) 2 } ( P j T j R j S j ) 2 ( P j T j ) 2 = ( ( x + a )exp {− 1 2 ( x + a ) 2 } − ( x − a )exp {− 1 2 ( x − a ) 2 } exp {− 1 2 ( x + a ) 2 } + exp {− 1 2 ( x − a ) 2 } ) 2 , so that (5) b ecomes B − 1 − b − 1 = 1 − 4 x 2 exp {− 1 2 ( x + a ) 2 − 1 2 ( x − a ) 2 } [exp {− 1 2 ( x + a ) 2 } + exp {− 1 2 ( x − a ) 2 } ] 2 + o (1) . (6) Lik ewise, the negative of the second deriv ative of log f ( x | µ ) is − ∂ 2 ∂ µ 2 log f ( x | µ ) = 1 − 4 x 2 exp {− 1 2 ( x + µ ) 2 − 1 2 ( x − µ ) 2 } [exp {− 1 2 ( x + µ ) 2 } + exp {− 1 2 ( x − µ ) 2 } ] 2 . (7) As in the general case, the righ t-hand sides of equations (6) and (7) differ only in that, where the former has a , the curren t p osterior mean for µ , the latter has µ itself. Note that the recursion for the posterior mean of the Gaussian appro ximation is A = a + b { (1 − w ) x − w x − a } + o ( b ) , where w = exp {− ( x + a ) 2 / 2 } / [exp {− ( x + a ) 2 / 2 } + exp {− ( x − a ) 2 / 2 } ] . The ab ov e equation is similar to standard stochastic appro ximations and should con verge to the true µ , as will the maxim um lik eliho o d estimator. 3.6.1 Quasi-Ba y es The new observ ation is assigned to N ( − a, 1) with probabilit y w and to N ( a, 1) with probability 1 − w , so that, for appropriate prior, A = a + ( n + 1) − 1 { (1 − w ) x − w x − a } and B = ( n + 1) − 1 . The behaviour of b in the same as for the confirmed-data case considered next. 3.6.2 Confirmed-data case In this case A = a + ( n + 1) − 1 {− xI ( z n = 1) + xI ( z n = 2) − a } 8 and B = ( n + 1) − 1 . Also, the Fisher information p er observ ation is 1. Note that the (observed) information per observ ation in (7) is less than 1, but approac hes 1 as µ → ∞ , i.e. as the mixture comp onen ts become separated. 3.7 Mink a’s clutter problem a = b + c = d + e 3.7.1 Pream ble Again we just consider the univ ariate v ersion of the problem, describ ed in Mink a (2001b), s o that the mo del consists of the mixture of N ( µ, 1) and N (0 , 10), with kno wn mixing weigh ts 1 − v and v resp ectiv ely; Mink a (2001b) uses w for v but we hav e used w already for something else. Thus f ( x | µ ) ∝ (1 − v )exp {− 1 2 ( x − a ) 2 } + v √ 10 exp( − x 2 / 20) . As before, w e use the notation a , b , A , B and x . This is another special case of the general model, with J = 2 , c 1 = 1 , c 2 = 0 , σ 1 = 1 , σ 2 = √ 10 , v 1 = 1 − v and v 2 = v . Direct substitution gives P j R 2 j T j P j T j = (1 − v )exp {− 1 2 ( x − a ) 2 } (1 − v )exp {− 1 2 ( x − a ) 2 } + v √ 10 exp( − x 2 / 20) P j T j R 2 j S 2 j P j T j = (1 − v )( x − a ) 2 exp {− 1 2 ( x − a ) 2 } (1 − v )exp {− 1 2 ( x − a ) 2 } + v √ 10 exp( − x 2 / 20) ( P j T j R j S j ) 2 ( P j T j ) 2 = [(1 − v )( x − a )exp {− 1 2 ( x − a ) 2 } ] 2 [(1 − v )exp {− 1 2 ( x − a ) 2 } + v √ 10 exp( − x 2 / 20)] 2 , so that (5) b ecomes B − 1 − b − 1 = (1 − v )exp {− 1 2 ( x − a ) 2 } (1 − v )exp {− 1 2 ( x − a ) 2 } + v √ 10 exp( − x 2 / 20) − v (1 − v )( x − a ) 2 exp {− 1 2 ( x − a ) 2 − x 2 / 20 } [(1 − v )exp {− 1 2 ( x − a ) 2 } + v √ 10 exp( − x 2 / 20)] 2 ! + o (1) . (8) The negativ e of the second deriv ativ e of log f ( x | µ ) is − ∂ 2 log f ( x | µ ) ∂ µ 2 = (1 − v )exp {− 1 2 ( x − µ ) 2 } (1 − v )exp {− 1 2 ( x − µ ) 2 } + v √ 10 exp( − x 2 / 20) − v √ 10 (1 − v )( x − µ ) 2 exp {− 1 2 ( x − µ ) 2 − x 2 / 20 } [(1 − v )exp {− 1 2 ( x − µ ) 2 } + v √ 10 exp( − x 2 / 20)] 2 ! . (9) Again the righ t-hand sides of equations (8) and (9) differ only in that, where the former has a , the curren t p osterior mean for µ , the latter has µ itself. Since a will tend to µ in the limit, it follo ws 9 that the appro ximate approac h based on a Gaussian appro ximation to the p osterior distribution will b eha v e appropriately asymptotically , so far as mean and v ariance are concerned. Note that the recursion for the posterior mean of the Gaussian appro ximation is A = a + bw ( x − a ) + o ( b ) , whic h is similar to standard sto c hastic appro ximations and should con v erge to the true µ , as will the maxim um lik eliho o d estimator. 4 Mixture of t w o kno wn distributions 4.1 Pream ble This is arguably the simplest p ossible mixture but is one for whic h the conjugate family is not given b y Gaussian distributions. It is assumed that data arrive indep enden tly and identically distributed from a mixture distribution with density function f ( x | β ) = β f 1 ( x ) + (1 − β ) f 2 ( x ) , in whic h β is an unkno wn mixing w eigh t betw een zero and one and f 1 and f 2 are kno wn densities. The prior densit y for β is assumed to b e that of Be( a (0) , b (0) ) , and the Beta approximation for the posterior based on x ( n ) := { x 1 , . . . , x n } is assumed to be the Be( a ( n ) , b ( n ) ) distribution, for h yp erparameters a ( n ) and b ( n ) . The expectation, second momen t and v ariance of the Beta appro ximation are respectively E n = a ( n ) / ( a ( n ) + b ( n ) ) , S n = { a ( n ) ( a ( n ) + 1) } / { ( a ( n ) + b ( n ) )( a ( n ) + b ( n ) + 1) } V n = ( a ( n ) b ( n ) ) / { ( a ( n ) + b ( n ) ) 2 ( a ( n ) + b ( n ) + 1) } = E n (1 − E n ) / ( L n + 1) , where L n = a ( n ) + b ( n ) . The limiting b ehaviour of E n , S n and V n is of key in terest. W e w ould wan t E n to tend to the true β in some sense and V n to tend to the v ariance of the correct p osterior distribution of β . Asymptotic normality of the approximating distribution is also desired. If E n b eha v es as desired then E n (1 − E n ) tends to β (1 − β ) and, for V n , the b ehaviour of a ( n ) + b ( n ) + 1 = L n + 1, and therefore of L n , is then crucial. 10 4.2 The case of confirmed data In this case the comp onent iden tifiers { z n } , i.e. z n = 1 or z n = 2, of the observ ations are kno wn, the exact posterior distribution is a Beta, and a ( n ) = a ( n − 1) + I ( z n = 1) , b ( n ) = b ( n − 1) + I ( z n = 2) , L n = L n − 1 + 1 = L 0 + n ≏ n, for large n , where I ( · ) is an indicator function. In this case E n is, to first order, ˆ β CO , the prop ortion of the n observ ations that b elong to the first comp onen t, it therefore do es tend to β , b y the La w of Large Numbers, and the limiting version of V n is V CO = β (1 − β ) /n for large n . Asymptotic normality of the p osterior distribution of β follows from the Bernstein-von Mises mirror result corresp onding to the Cen tral Limit result for ˆ β CO . 4.3 The ‘correct’ results The behaviour of the ‘correct’ posterior distribution will b e dictated by the b ehaviour of the maxim um lik eliho o d estimator ˆ β ML whic h, for this problem, will b e consisten t for β , will be asymptotically normal pro vided the true β is not 0 or 1 and will ha v e large-sample v ariance defined b y the Fisher information: for large n , appro ximately , E ( ˆ β ML ) = β , v ar( ˆ β ML ) = 1 n R { f 1 ( x ) − f 2 ( x ) } 2 f ( x ) dx = 1 n R ( f 1 − f 2 ) 2 β f 1 +(1 − β ) f 2 = V ML , sa y . Again, these properties can b e transferred to the p osterior distribution of β , b y a Bernstein-v on Mises argumen t. This transferrence will apply as a general rule. 11 4.4 The v ariational appro ximation In this case the v ariational appro ximation, which is of course a Beta distribution, has h yp erparameters of the form a ( n ) = a (0) + n X i =1 w 1 i b ( n ) = b (0) + n X i =1 w 2 i , where, for each i , w 1 i and w 2 i are nonzero and sum to 1; see for example Humphreys and Titterington (2000). Th us, as in the previous subsection, L n = a (0) + b (0) + n ≏ n, for large n . Informally , this implies that the limiting version of V n is V V A = β (1 − β ) /n. This is the same as V CO and therefore is ‘smaller than it should b e’ and w ould lead to unrealistically narro w in terv al estimates for β . F or more details see for example Humphreys and Titterington (2000) and, for a more rigorous discussion esp ecially about the conv ergence of E n to β , W ang and Titterington (2005a). 4.5 The Quasi-Ba yes recursive approac h F or eac h n let w 1 n = a ( n − 1) f 1 n / ( a ( n − 1) f 1 n + b ( n − 1) f 2 n ) , (10) where, for j = 1 , 2 , f j n = f j ( x n ) . Then, for the Quasi-Bay es approac h (Smith and Mak o v, 1978), whic h creates a sequence of Beta approximations that recursively tracks the expectation from stage to stage, a ( n ) = a ( n − 1) + w 1 n , b ( n ) = b ( n − 1) + 1 − w 1 n , (11) whic h implies that L n = a ( n − 1) + b ( n − 1) + 1 = a (0) + b (0) + n ≏ n. 12 The recursiv ely calculated sequence of p osterior means { E n } satisfy E n = E n − 1 + 1 a ( n − 1) + b ( n − 1) + 1 ( w 1 n − E n − 1 ) , (12) Smith and Mako v (1978) show, b y carefully establishing the creden tials of (12) as a sto chastic appro x- imation, that the p osterior mean is consisten t, so that, for Quasi-Ba yes and for large n , V n ≏ V QB = β (1 − β ) /n ; this is the same as for the confirmed-data case and the v ariational approximation, thereby falling foul of the same criticism as the latter as being ‘to o small’. Similar remarks apply to an y other recursiv e metho d in whic h the hyperparameters are up dated according a rule like that in (11); a recursive v ersion of the v ariational approximation, implemen ted in Humphreys and Titterington (2000) is a case in p oint. 4.6 The probabilistic editor In Quasi-Ba y es the first momen t of the posterior distribution is trac ked recursively . In the probabilistic editor (PE) the first t wo momen ts are track ed, whic h is p ossible b ecause there are t w o h yp erparameters. Mak o v (1983) lo oks at a context differen t from this simple mixture problem and inv estigates empirically up dating more than one observ ation at a time, but for the time b eing w e consider just the ab o ve framew ork. W e describ e ho w the hyperparameters are up dated based on incorporation of an observ ation x n from the mixture distribution and a Be( a ( n − 1) , b ( n − 1) ) prior distribution. F or simplicity of notation and in the spirit of previous sections, w e shall denote the hyperparameters b y ( a ( n − 1) , b ( n − 1) ) = ( a, b ) and ( a ( n ) , b ( n ) ) = ( A, B ). Then we ha v e to match momen ts of a Be( A, B ) distribution with those of the Beta mixture distribution w 1 Be( a + 1 , b ) + w 2 Be( a, b + 1) , where w 1 = af 1 n / ( af 1 n + bf 2 n ) = 1 − w 2 . In general, if β ∼ P j w j g j , where the distribution corresponding to g j has mean µ j and v ariance σ 2 j , then E β = ¯ µ = X j w j µ j , v ar β = X j w j σ 2 j + X j w j ( µ j − ¯ µ ) 2 . 13 These form ulae are also employ ed in the App endix. Thus, matching means, w e obtain E n = { w 1 ( a + 1) + w 2 a } / ( a + b + 1) = ( a + w 1 ) / ( L n − 1 + 1) . Matc hing v ariances giv es, after some algebra, E n (1 − E n ) L n + 1 = E n (1 − E n ) L n − 1 + 2 + w 1 (1 − w 1 ) ( L n − 1 + 2)( L n − 1 + 1) . Clearly , this provides an explicit formula for L n and consequently for A = E n L n and B = (1 − E n ) L n , but, from a theoretical p oint of view, we are interested in appro ximations that lead to asymptotic results. F urther manipulation gives L n + 1 ≏ ( L n − 1 + 2)  1 − 1 L n − 1 + 1 w 1 (1 − w 1 ) E n (1 − E n )  , whic h then leads to L n − L n − 1 ≏ 1 − w 1 (1 − w 1 ) E n (1 − E n ) = 1 − ϵ n , sa y , the appro ximations relying on L n b eing of order n . The p osterior mean, E n , can b e calculated recursiv ely and the relev an t equation is the same as (12): E n = E n − 1 + 1 a + b + 1 ( w 1 − E n − 1 ) , where w 1 is given in (10). By the argument in Smith and Mak o v (1978), the sequence { E n } will con v erge (to the true β v alue). The posterior v ariance of the PE-based Beta approximation based on n observ ations is V n = E n (1 − E n ) / ( L n + 1) , in which the limiting v alue of E n (1 − E n ) is β (1 − β ) . F or large n , L n , or equiv alently L n + 1 , will tend to nt , where t is the limiting expectation of L n − L n − 1 . Ho wev er, for large n , appro ximately , w 1 (1 − w 1 ) = β (1 − β ) f 1 n f 2 n / { β f 1 n + (1 − β ) f n 2 } 2 , and therefore w 1 (1 − w 1 ) E n (1 − E n ) ≏ f 1 n f 2 n / { β f 1 n + (1 − β ) f n 2 } 2 . Th us, appro ximately , E ( L n − L n − 1 ) = 1 − Z f 1 f 2 β f 1 + (1 − β ) f 2 , so that, asymptotically , the v ariance of the PE-based appro ximation is V PE = n − 1 β (1 − β )  1 − Z f 1 f 2 β f 1 + (1 − β ) f 2  − 1 . 14 W e no w sho w that V ML = V PE . This follows immediately from the following Lemma. L emma . 1 β (1 − β )  1 − Z f 1 f 2 β f 1 + (1 − β ) f 2  = Z ( f 1 − f 2 ) 2 β f 1 + (1 − β ) f 2 . (13) Pr o of . Denote the left-hand and righ t-hand sides of (13) by I 1 and I 2 , respectively , and note that β (1 − β ) = 1 2 { 1 − β 2 − (1 − β ) 2 } . Then I 2 = 1 β (1 − β ) Z ( f 2 1 − 2 f 1 f 2 + f 2 2 ) β (1 − β ) β f 1 + (1 − β ) f 2 = 1 β (1 − β ) Z β (1 − β )( f 2 1 + f 2 2 ) + f 1 f 2 { β 2 + (1 − β ) 2 } − f 1 f 2 β f 1 + (1 − β ) f 2 = 1 β (1 − β ) Z (1 − β ) f 1 { β f 1 + (1 − β ) f 2 } + β f 2 { β f 1 + (1 − β ) f 2 } − f 1 f 2 β f 1 + (1 − β ) f 2 = 1 β (1 − β ) Z  (1 − β ) f 1 + β f 2 − f 1 f 2 β f 1 + (1 − β ) f 2  = 1 β (1 − β )  1 − β + β − Z f 1 f 2 β f 1 + (1 − β ) f 2  = I 1 . Th us, asymptotically , the Probabilistic Editor, and by implication the moment-matc hing v ersion of Expectation Propagation, get the v ariance righ t. 4.7 Results for the Kullback-Leibler up date The ‘standard’ EP and ADF approac hes base the up dates not sp ecifically on moment-matc hing but on minimization of the Kullbac k-Leibler div ergence discussed in Section 1. As indicated in Section 3.3.1 of Mink a (2001a) and as implied by Section 5.6.4 of Stephens (1997), the relationships b etw een successiv e sets of h yp erparameters ( a, b ) and ( A, B ) are Ψ( A ) − Ψ( A + B ) = f 1 n af 1 n + bf 2 n − 1 a + b + Ψ( a ) − Ψ( a + b ) Ψ( B ) − Ψ( A + B ) = f 2 n af 1 n + bf 2 n − 1 a + b + Ψ( b ) − Ψ( a + b ) , (14) where Ψ( c ) denotes the digamma function. F or large c , the dominant term in Ψ( c ) is log ( c − 1 / 2) , obtainable through Stirling’s approximation to ( c − 1)!, and the corresp onding approximation s to equations (14) are log ( A − 1 / 2) − log ( A + B − 1 / 2) = f 1 n af 1 n + bf 2 n − 1 a + b + log ( a − 1 / 2) − log ( a + b − 1 / 2) log ( B − 1 / 2) − log ( A + B − 1 / 2) = f 2 n af 1 n + bf 2 n − 1 a + b + log ( b − 1 / 2) − log ( a + b − 1 / 2) . (15) 15 If w e no w define δ a b y A = a + δ a , and similarly for δ b , then we can write log ( A − 1 / 2) = log ( a − 1 / 2) + log { 1 + δ a / ( a − 1 / 2) } = log ( a − 1 / 2) + δ a /a + o ( a − 1 ) , along with similar expressions for log ( B − 1 / 2) and log ( A + B − 1 / 2). Substitution in (15) giv es, for the dominan t terms, δ a /a − ( δ a + δ b ) / ( a + b ) = f 1 n af 1 n + bf 2 n − 1 a + b δ b /b − ( δ a + δ b ) / ( a + b ) = f 2 n af 1 n + bf 2 n − 1 a + b . (16) Substitution of the formulae for δ a and δ b implicit in the moment-matc hing up date in Section 4.6 and concen tration on the low est-order terms lead to equations (16) b eing satisfied. F or example, since A = L n E n = L n ( a + w 1 n ) / ( L n − 1 + 1) , we hav e δ a /a = { ( L n − L n − 1 ) a + L n w 1 n − a } / { a ( L n − 1 + 1) } ≏ ( L n − L n − 1 ) /L n − 1 + w 1 n / ( aL n − 1 ) − 1 /L n − 1 = ( δ a + δ b ) / ( a + b ) + f 1 n / ( af 1 n + bf 2 n ) − 1 / ( a + b ) . The second equation in (16) follows similarly , although in fact the tw o equations are linearly dep endent. Th us, to this degree of appro ximation, the Kullback-Leibler up date is equiv alen t to the moment- matc hing update and therefore p erforms acceptably , asymptotically . 4.8 A remark ab out the ab ov e calculation and the treatment of earlier examples In the previous examples, in whic h the conjugate family of distributions w as Gaussian, we found that the (approximate) increase in precision was ‘matc hed’ with the single-observ ation negativ e second deriv ativ e of the log-densit y; exp ectations were not carried out. The latter seems to be necessary for this example with an unkno wn mixing w eigh t, as is no w heuristically explained. Let ˆ β denote the estimate of β giv en by the p osterior mean after stage n − 1, i.e. what has so far b een called E n − 1 . Then the approximate change in precision is { ˆ β (1 − ˆ β ) } − 1 ( L n − L n − 1 ) ≏ { ˆ β (1 − ˆ β ) } − 1 1 − f 1 n f 2 n { ˆ β f 1 n + (1 − ˆ β ) f 2 n } 2 ! , whereas the negative second deriv ativ e of the log densit y for the n th observ ation is − ∂ 2 ∂ µ 2 log f ( x n | β ) = ( f 1 n − f 2 n ) 2 { β f 1 n + (1 − β ) f 2 n } 2 = 1 β (1 − β )  (1 − β ) f 1 n + β f 2 n β f 1 n + (1 − β ) f 2 n − f 1 n f 2 n { β f 1 n + (1 − β ) f 2 n } 2  , whic h does not matc h up without av eraging. 16 4.9 Discussion In a v ery small sim ulation exercise, Humphreys and Titterington (2000) compare the non-recursiv e v ariational approximation, its recursiv e v arian t, the recursiv e Quasi-Ba yes and Probabilistic Editor, and the Gibbs sampler, the last of whic h can be regarded as providing a reliable estimate of the true p osterior. As can b e expected on the basis of the abov e analysis, the appro ximation provided by the Probabilistic Editor is very similar to that obtained from the Gibbs sampler, whereas the other ap- pro ximations are ‘to o narro w’. F urthermore, the v ariances asso ciated with the v arious appro ximations are n umerically v ery close to the ‘asymptotic’ v alues derived abov e. The implication is that this is also the case for the corresponding v ersion of the EP algorithm based on momen t matching, mentioned in Section 3.3.3 of Mink a (2001a), of whic h the PE represents an online v ersion. Of course EP updates using KL divergence rather than (alw a ys) matching momen ts, but the tw o v ersions p erform very simi- larly in Mink a’s empirical experiments, and Section 4.7 reflects this asymptotically . Recursiv e v ersions of the algorithm with KL up date, i.e. v ersions of ADF, are outlined and illustrated by sim ulation in Chapter 5 of Stephens (1997) for mixtures of known densities, extending earlier w ork b y Bernardo and Gir´ on (1988), and for mixtures of Gaussian densities with all parameters unkno wn, including the mixing w eights. F or mixtures of tw o known densities, Stephens notes that, empirically , the KL up date app ears to pro duce an estimate of the p osterior density that is indistinguishable from the MCMC estimate, and is muc h sup erior to the Quasi-Bay es estimate, whic h is to o narro w. F or a mixture of four known densities, for which the conjugate prior distributions are 4-cell Dirichlet distributions, the KL up date app ears clearly to b e better than the Quasi-Bay es up date, but somewhat more ‘peaked’ than it should b e. This is b ecause, in terms of the approach of the present paper, there are insufficient h yp erparameters in the conjugate family to matc h all first and second momen ts. F or a J -cell Diric hlet, with J h yp erparameters, there are J − 1 independent first momen ts and J ( J − 1) / 2 second moments, so that full moment-matc hing is not p ossible for J > 2 , that is, for any case but mixtures of J = 2 kno wn densities. T o see this, w e pro ceed along the lines follo w ed in Section 4.6. Supp ose w e consider up dating a Dir( a 1 , . . . , a J ) = Dir( a ) distribution to Dir( A 1 , . . . , A J ) = Dir( A ) on the basis of a new observ ation, x n . Then the moments of the new Dirichlet hav e to match those of the Dirichlet mixture X j w j Dir( a + δ j ) , where a + δ j denotes the set of h yp erparameters that are the same as the set in a except that the j th of them is a j + 1 and where, for eac h j , w j = a j f j n / ( X k a k f kn ) . 17 Then matc hing means requires that, for eac h j , E j n = ( a j + w j ) / ( L n − 1 + 1) = E j,n − 1 + 1 L n − 1 + 1 ( w j − E j,n − 1 ) , where L n − 1 = P j a j , and then A j = L n E j n , where L n = P j A j . Clearly , these amoun t to J − 1 indep enden t equations, leaving just one degree of freedom whic h w e assign to the c hoice of L n . Matc hing v ariances leads, for each j , to E j n (1 − E j n ) L n + 1 = E j n (1 − E j n ) L n − 1 + 2 + w j (1 − w j ) ( L n − 1 + 2)( L n − 1 + 1) , (17) and then to L n − L n − 1 ≏ 1 − w j (1 − w j ) E j n (1 − E j n ) . (18) Similarly , matc hing co v ariances giv es, for each j and k with j  = k, − E j n E kn L n + 1 = − E j n E kn L n − 1 + 2 − w j w k ( L n − 1 + 2)( L n − 1 + 1) , (19) and then to L n − L n − 1 ≏ 1 − w j w k E j n E kn . (20) Matc hing of second-order moments requires J ( J − 1) / 2 equations to b e satisfied, and this is clearly p ossible only for J = 2. This is implicit in the w ork of Co w ell et al. (1996) on recursive up dating, following on from Spiegel- halter and Lauritzen (1990) and Spiegelhalter and Co w ell (1992), and referred to in Section 3.3.3 of Mink a (2001a) and Section 9.7.4 of Cow ell et al. (1999). They chose Diric hlet hyperparameters to matc h first momen ts and the av erage v ariance of the parameters. This can clearly b e done exactly , from (17), or approximately , from (18). Alternativ ely one could match the av erage of the v ariances and cov ariances, based on (19) or (20). Ho wev er, the upshot is that there is no hop e that a pure Diric hlet approximation will produce a totally satisfactory approximation to the ‘correct’ p osterior for J > 2, whether through EP or the recursive alternativ es, based on KL up dating or moment-matc hing. Ho w ever, these versions should b e a distinct improv emen t on the Quasi-Bay es and v ariational approx- imations, in terms of v ariance. A possible w ay forw ard for small J is to appro ximate the p osterior by a mixture of a small num b er of Dirichlets. T o matc h all first- and second-order moments of a p osterior distribution of a set of J -cell m ultinomial probabilities one w ould need a mixture of K pure J -cell Diric hlets, where K J + ( K − 1) = ( J − 1) + ( J − 1) + ( J − 1)( J − 2) / 2 , 18 i.e. Diric hlet h yp erparameters + mixing weigh ts = first moments + v ariances + co v ariances. This giv es K = J / 2. Thus, for ev en J , the match can b e exact, but for o dd J there w ould b e some redundancy . In fact, ev en for J as small as 4, the algebraic details of the moment-matc hing b ecome formidable. In passing we note that there is a directly analogous v ersion of Section 4.7 for the case of J > 2. Ac kno wledgemen t This w ork has b enefited from con tact with Philip Dawid, Steffen Lauritzen, Y.W. T eh and Jinghao Xue. The pap er w as written, in part while the authors w ere in residence at the Isaac Newton Insti- tute in Cam bridge, taking part in the Research Programme on Statistical Theory and Metho ds for Complex High-dimensional Data. 5 App endix: The recursiv e step for the normal-mixtures problem As remark ed at the beginning of Section 3.2, the h yp erparameters A and B are chosen to match the momen ts of the densit y for µ that is prop ortional to b − 1 / 2 exp {− 1 2 b ( µ − a ) 2 } J X j =1 v j σ j exp {− 1 2 σ 2 j ( x − c j µ ) 2 } . This turns out to be a Gaussian mixture, X j w j N ( µ ; m j , s 2 j ) , where w j ∝ v j σ j 1 b + c 2 j σ 2 j ! − 1 / 2 exp        − ( x − ac j ) 2 2 bσ 2 j  1 b + c 2 j σ 2 j         , m j = a b + c j x σ 2 j 1 b + c 2 j σ 2 j , s 2 j = 1 b + c 2 j σ 2 j ! − 1 . 19 Next w e matc h the first t wo moments with those of a Gaussian approximation, N ( A, B ). First, for the mean, we hav e A − a = X j w j ( m j − a ) = X j w j { bR j S j + o ( b ) } , where R j = c j /σ j and S j = ( x − c j a ) /σ j . Since in an asymptotic scenario b will b e small, so that we shall neglect terms of o ( b ), we only need the first-order, O (1), term in w j : w j = T j / X j ′ T j ′ , where T j = v j σ j exp − ( x − ac j ) 2 2 σ 2 j ! , so that A = a + b X j R j S j T j / X j ′ T j ′ + o ( b ) . Next, for the v ariance, B = X j w j s 2 j + X j w j m 2 j − ( X j w j m j ) 2 . No w, X j w j s 2 j = X j w j 1 b + c 2 j σ 2 j ! − 1 = X j w j b 1 − bc 2 j σ 2 j ! + o ( b 2 ) = b − b 2 X j w j ( c 2 j /σ 2 j ) + o ( b 2 ) . Again it is necessary only to retain the leading term in w j , so that X j w j s 2 j = b − b 2 X j R 2 j T j / X j T j + o ( b ) . Next, if we write m j = a + ∆ j , w e ha ve that X j w j m 2 j = a 2 + 2 a X j w j ∆ j + X j w j ∆ 2 j , ( X j w j m j ) 2 = a 2 + 2 a X j w j ∆ j + ( X j w j ∆ j ) 2 20 so that, since again only the leading term in w j is necessary , X j w j m 2 j − ( X j w j m j ) 2 = X j w j ∆ 2 j − ( X j w j ∆ j ) 2 = P j T j ∆ 2 j P j T j − ( P j T j ∆ j ) 2 ( P j T j ) 2 + o ( b 2 ) = b 2 P j T j R 2 j S 2 j P j T j − ( P j T j R j S j ) 2 ( P j T j ) 2 ! + o ( b 2 ) . Th us, B = b ( 1 − b P j R 2 j T j P j T j − P j T j R 2 j S 2 j P j T j + ( P j T j R j S j ) 2 ( P j T j ) 2 !) + o ( b 2 ) and therefore B − 1 − b − 1 = P j R 2 j T j P j T j − P j T j R 2 j S 2 j P j T j + ( P j T j R j S j ) 2 ( P j T j ) 2 + o (1) . References A thans, M., Whiting, R. and Gruber, M. (1977). A sub optimal estimation algorithm with probabilistic editing for false measuremen ts with application to target trac king with wak e phenomena. IEEE T r ans. A uto. Contr. AC-22 , 273–384. Beal, M.J. and Ghahramani, Z. (2003). The v ariational Bay esian EM algorithm for incomplete data: with application to scoring graphical mo del structures. In Bayesian Statistics 7 , Ed. J.M. Bernardo, M.J. Bay arri, J.O. Berger, A.P . Dawid, D. Heck erman, A.F.M. Smith and M. W est, pp. 453–464. Oxford Univ ersity Press. Bernardo, J.M. and Gir´ on, F.J. (1988). A Bay esian analysis of simple mixture problems (with Dis- cussion). In Bayesian Statistics , Ed. J.M. Bernardo, M.H. DeGro ot, D.V. Lindley and A.F.M. Smith, pp. 67–78 Co w ell, R.G., Dawid, A.P . and Sebastiani, P . (1996). A comparison of sequential learning metho ds for incomplete data. In Bayesian Statistics 5 , Ed. J.M. Bernardo, J.O. Berger, A.P . Dawid and A.F.M. Smith, pp. 533–541. Oxford: Clarendon Press. Co w ell, R.G. Da wid, A.P ., Lauritzen, S.L. and Spiegelhalter, D.J. (1999). Pr ob abilistic Networks and Exp ert Systems. New Y ork: Springer. Humphreys, K. and Titterington, D.M. (2000). Approximate Bay esian inference for simple mixtures. In COMPST A T 2000 , Ed. J.G. B ethlehem and P .G.M. v an der Heijden, pp. 331–336. Heidelb erg: Ph ysica-V erlag. Lehmann, E.L. (1991). The ory of Point Estimation. Pacific Gro v e, CA: W adsw orth and Bro oks/Cole. Ma yb ec k, P .S. (1982). Sto chastic Mo dels, Estimation and Contr ol. New Y ork: Academic Press. 21 Mak o v, U.E. (1983). Approximate Bay esian pro cedures for dynamic linear models in the presence of jumps. The Statistician 32 , 207–213. Mak o v, U.E. and Smith, A.F.M. (1977). A quasi-Bay es unsup ervised learning pro cedure for priors. IEEE T r ans. Inform. The ory IT-23 , 761–764. Mink a, T. (2001a). A family of algorithms for appr oximate Bayesian infer enc e. Ph.D. dissertation, Massac h usetts Institute of T echnology . Mink a, T. (2001b). Exp ectation Propagation for approximate Ba y esian inference. In Pr o c. Conf. Unc ertainty in AI . Ow en, R.J. (1975). A Ba yesian sequen tial procedure for quan tal resp onse in the con text of adaptiv e men tal testing. J. Am. Statist. Asso c. 70 , 351–356. Smith, A.F.M. and Mak o v, U.E. (1978). A quasi-Ba yes sequential pro cedure for mixtures. J. R. Statist. So c. B 40 , 106–112. Smith, A.F.M. and Mak ov, U.E. (1981). Unsup ervised learning for signal v ersus noise. IEEE T r ans. Inform. The ory IT-27 , 498–500. Spiegelhalter, D.J. and Co w ell, R.G. (1992). Learning in probabilistic expert systems. In Bayesian Statistics 4 , Ed. J.M. Bernardo, J.O. Berger, A.P . Dawid and A.F.M. Smith, pp. 447–465. Oxford: Clarendon Press. Spiegelhalter, D.J. and Lauritzen, S.L. (1990). Sequen tial up dating of conditional probabilities on directed graphical structures. Networks 20 , 579–605. Stephens, M. (1997). Bayesian Metho ds for Mixtur es of Normal Distributions. D. Phil. dissertation, Univ ersit y of Oxford. Titterington, D.M., Smith, A.F.M. and Mako v, U.E. (1985). Statistic al Analysis of Finite Mixtur e Distributions. Chic hester: Wiley . W ang, B. and Titterington, D.M. (2005a). V ariational Ba yes estimation of mixing co efficients. In Deterministic and Statistic al Metho ds in Machine L e arning , Lecture Notes in Artificial Intelligence V ol. 3635, Ed. J. Winkler, M. Niranjan and N. Lawrence, pp. 281–295. Springer-V erlag. W ang, B. and Titterington, D.M. (2005b). Inadequacy of in terv al estimates corresp onding to v aria- tional Ba yesian approximations. In Pr o c. 10th Int. Workshop AIST A TS , pp. 373–380. W ang, B. and Titterington, D.M. (2006). Conv ergence properties of a general algorithm for calculating v ariational Ba yesian estimates for a normal mixture model. Bayesian Analysis , 1 625–650. 22

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment