Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values

Con v ergence and asymptotic normalit y of v ariational Ba y esian appro ximations for exp onen tial family mo dels with missing v alues Bo W ang Departmen t of Statistics Univ ersity of Glasgow Glasgo w G12 8QQ Scotland, U.K. D. M. Titterington Departmen t of Statistics Univ ersity of Glasgow Glasgo w G12 8QQ Scotland, U.K. Abstract W e study the prop erties of v ariational Bay es appro ximations for exp onen tial family mo d- els with missing v alues. It is shown that the iterativ e algorithm for obtaining the v aria- tional Bay esian estimator conv erges lo cally to the true v alue with probabilit y 1 as the sample size b ecomes indeﬁnitely large. More- o ver, the v ariational p osterior distribution is pro ved to b e asymptotically normal. 1 INTR ODUCTION V ariational Ba yes appro ximations hav e recently b een applied to complex mo dels inv olving incomplete-data for whic h computational diﬃculties arise with the ideal Ba yesian approach. Suc h mo dels include hid- den Marko v mo dels and mixture mo dels; see for ex- ample Attias (1999, 2000); Beal (2003); Ghahramani and Beal (2000); Humphreys and Titterington (2000, 2001); MacKay (1997); P enny and Rob erts (2000); W ang and Titterington (2004b). In these earlier con- tributions, the appro ximations w ere sho wn empirically to be con vergen t and eﬀective. Ho wev er little has b een done to inv estigate their theoretical prop erties, and the purpose of this paper is to go some wa y to rectify- ing this. Hall, Humphreys and Titterington (2002) initiated a discussion of these issues and prov ed that, for certain Mark ov mo dels, the parameter estimator obtained b y maximising the v ariational lo wer bound function is asymptotically consistent provided the prop ortion of all v alues that are missing tends to zero. Later we pro ved in W ang and Titterington (2003) that it is not alw ays the case that a fully factorised form of v ari- ational posterior, which includes the factorisation of the join t probabilit y function for the hidden states, pro vides an asymptotically consisten t estimator as the ‘sample size’ b ecomes large. W e demonstrated this in particular in the con text of linear state space mo dels, in which the abov e suﬃcient condition ob viously do es not hold. On the other hand we show ed in W ang and Titterington (2004a) that v ariational Bay es estimators for certain mixture models are asymptotically eﬃcient for large sample sizes. In this pap er w e study the properties of v ariational appro ximation algorithms for more general mo dels, namely exp onen tial family models with missing v al- ues. Exponential families include cases such as Gaus- sian, gamma, Poisson, Dirichlet and Wishart distribu- tions, and exp onen tial family mo dels with missing v al- ues con tain many models of practical interest as partic- ular cases, such as Gaussian mixtures, hidden Mark o v mo dels and linear state space models. Beal (2003) and Ghahramani and Beal (2000) applied the v ariational Ba yesian metho d to these mo dels and derived the it- erativ e algorithm for learning the approximate p oste- rior distributions of the latent states and the mo del parameters. The numerical expriments therein show empirically that this algorithm is conv ergent and eﬃ- cien t. In this pap er we deriv e the iterative pro cedure for obtaining the v ariational Ba yesian estimator, we pro vide analytical proofs of local con vergence of the pro cedure as the sample size tends to inﬁnity , and w e sho w that the v ariational p osterior distribution for the parameters is asymptotically normal. 2 EXPONENTIAL F AMIL Y MODELS WITH MISSING V ALUES AND V ARIA TIONAL APPR O XIMA TIONS W e consider the following exp onential family mo dels with missing v alues. Suppose that Θ is an op en subset of I R m , that P = { P θ : θ ∈ Θ } is a family of probabilit y distributions on a measurable space (Ω , F ), and that x and y are sampled from the natural exp onential family UAI 2004 WANG & TITTERINGTON 577 with density p ( x, y | θ ) = f ( x, y ) exp { θ > u ( x, y ) − ψ ( θ ) } , (1) with x taking v alues in I R d and y in I R p , where θ ∈ Θ is the unkno wn parameter, and ψ ( · ) : Θ 7→ I R is six-times con tinuously diﬀerentiable and has p ositiv e deﬁnite Hessian matrix on Θ. The parameter θ has a conjugate prior to the complete-data likelihoo d (1), with density p ( θ | α 0 , β 0 ) = h ( α 0 , β 0 ) exp { θ > β 0 − α 0 ψ ( θ ) } , (2) where h is a normalising constant satisfying h ( α, β ) − 1 = Z Θ exp { θ > β − αψ ( θ ) } dθ, (3) and α 0 ∈ I R , β 0 ∈ I R m are the hyperparameters of the prior. Remark 1. The mo dels of the forms (1) and (2) in- clude most latent-variable mo dels of pr actic al inter est. A simple example is when x is sample d fr om a univari- ate Gaussian distribution with me an θ 1 and varianc e 1 and y = x + w , wher e w is sample d fr om another Gaussian distribution, indep endent of x , with me an θ 2 and varianc e 1 . The joint pr ob ability density is p ( x, y | θ 1 , θ 2 ) = exp {− 1 2 ( y − x ) 2 + θ 1 x + θ 2 ( y − x ) − 1 2 ( θ 2 1 + θ 2 2 ) − log(2 π ) } . And the p ar ameters θ 1 and θ 2 have indep endent Gaus- sian prior distributions with the same varianc e. Supp ose that only y is observ able whereas x is la- ten t. W e hav e a data-set consisting of a random sam- ple of size n , with Y = ( y 1 , y 2 , . . . , y n ) and X = ( x 1 , x 2 , . . . , x n ). In the Bay esian framew ork we wan t to infer the p osteriors ov er b oth the parameters and the hidden states. Unfortunately exact Bay esian in- ference is generally time-consuming, if not impossible, esp ecially for large dimensionality m . Therefore ap- pro ximation is usually necessary in these cases. In the v ariational approach, the true p osterior p ( X, θ | Y ) is appro ximated by the v ariational distribution q ( X, θ ), whic h factorises as q ( X, θ ) = q X ( X ) q θ ( θ ), and is cho- sen to maximise the negative free energy Z q ( X, θ ) log p ( θ , X, Y ) q ( X, θ ) dθ dX, (4) equiv alent to minimising the Kullback-Leibler diver- gence b etw een the exact and appro ximate distribu- tions of θ and X , given Y . The negativ e free energy (4) can b e maximised us- ing the following iterativ e pro cedure (Beal (2003); Ghahramani and Beal (2000)). In turn, the follo wing t wo stages are p erformed. (i) Optimise q θ ( θ ) for ﬁxed { q x i ( x i ) , i = 1 , . . . , n } , de- ﬁned in (ii) b elo w. This step results in q θ ( θ ) = h ( α, β ) exp { θ > β − αψ ( θ ) } , (5) where α and β are the hyperparameters of the v aria- tional p osterior and are up dated b y α = n + α 0 , β = n X i =1 r i + β 0 , and r i = h u ( x i , y i ) i x i . (6) Here h·i x i denotes the exp ectation under q x i ( x i ). (ii) Optimise q X ( X ) for ﬁxed q θ ( θ ). This leads to the factorised form q X ( X ) = Q n i =1 q x i ( x i ), where q x i ( x i ) = f ( x i , y i ) g ( θ, y i ) exp {h θ i > θ u ( x i , y i ) − ψ ( h θ i θ ) } , (7) in which g ( θ, y i ) is a normalising constant satisfying g ( θ, y i ) − 1 = Z f ( x i , y i ) exp {h θ i > θ u ( x i , y i ) − ψ ( h θ i θ ) } dx i , (8) and h·i θ denotes the exp ectation under q θ ( θ ). 3 THE ITERA TIVE ALGORITHM AND ITS CONVER GENCE W e deﬁne the v ariational Bay esian estimator ˆ θ of the parameter θ as ˆ θ = Z Θ θ q pos ( θ ) dθ, where q pos is the v ariational p osterior densit y of θ , giv en by the limiting form of q θ ( θ ) that results from the ab o ve iterativ e pro cedure. F or the exp onen tial family distribution (5) the corresp onding v ariational Ba yesian estimator is ˆ θ = Z Θ θ q θ ( θ ) dθ = − D β h ( α, β ) h ( α, β ) . (Throughout the paper, D Ψ and D 2 Ψ denote the gra- dien t and the Hessian of Ψ. When ambiguit y exists, the sp eciﬁc v ariable of diﬀeren tiation appears as a sub- script of the sym b ol D and D 2 .) Th us, the pro cedure in the previous section can be used to deriv e the following algorithm for obtaining the v ariational Ba yesian estimate of θ : starting with 578 WANG & TITTERINGTON UAI 2004 some initial v alue θ (0) , successive iterates are deﬁned inductiv ely b y θ ( k +1) , Φ n ( θ ( k ) ) = − D β h ( α, β ) h ( α, β ) , (9) where α and β are given as in (6), and q x i ( x i ) = f ( x i , y i ) g ( θ ( k ) , y i ) exp { ( θ ( k ) ) > u ( x i , y i ) − ψ ( θ ( k ) ) } , g ( θ ( k ) , y i ) − 1 = Z f ( x i , y i ) exp { ( θ ( k ) ) > u ( x i , y i ) − ψ ( θ ( k ) ) } dx i . It is of interest to in vestigate the questions of whether or not the algorithm (9) is con vergen t and, if so, what prop erties are possessed by the limiting v alue. The follo wing theorem gives a partial answer. Theorem 1. With pr ob ability 1 as n appr o aches in- ﬁnity, the iter ative pr o c e dur e (9) c onver ges lo c al ly to the true value θ ∗ , i.e. (9) c onver ges to θ ∗ whenever the starting value is suﬃciently ne ar to θ ∗ . Pr o of. Deﬁne the norm of θ ∈ I R m as k θ k , ( θ > θ ) 1 / 2 and the norm of the real m × m matrix A as k A k , sup k θ k =1 k Aθ k . W e ﬁrst pro v e that, with probabilit y 1 as n approaches inﬁnit y , the op erator Φ n is lo cally con tractive; that is, there exists a n umber λ , 0 ≤ λ < 1, such that k Φ n ( ¯ θ ) − Φ n ( θ ∗ ) k ≤ λ k ¯ θ − θ ∗ k , (10) whenev er ¯ θ lies suﬃciently near θ ∗ . Since ¯ θ is near θ ∗ w e can write Φ n ( ¯ θ ) − Φ n ( θ ∗ ) = D Φ n ( θ ∗ )( ¯ θ − θ ∗ ) + O ( k ¯ θ − θ ∗ k 2 ) , where D Φ n ( θ ∗ ) denotes the gradien t of Φ n ( θ ) ev alu- ated at θ ∗ . It follo ws that k Φ n ( ¯ θ ) − Φ n ( θ ∗ ) k ≤k D Φ n ( θ ∗ ) k · k ¯ θ − θ ∗ k = sup k θ k =1 | θ > D Φ n ( θ ∗ ) θ | · k ¯ θ − θ ∗ k . Consequen tly , it is suﬃcien t to sho w that D Φ n ( θ ∗ ) con verges with probabilit y 1 to a matrix which has norm less than 1. W rite β and r i as β ( θ ) and r i ( θ ) to indicate explicitly their dep endence on θ . F rom (9) one has D Φ n ( θ ∗ ) = D β h ( α, β ) D > β h ( α, β ) − h ( α, β ) D 2 β h ( α, β ) h 2 ( α, β ) D β ( θ ∗ ) . Here h and its deriv ativ es are ev aluated at θ ∗ . F or conv enience w e write h ( α, β ) − 1 ev aluated at θ ∗ as ˜ h ( α, β ), from which D β ˜ h ( α, β ) = Z Θ exp { θ > β − αψ ( θ ) } θdθ , D 2 β ˜ h ( α, β ) = Z Θ exp { θ > β − αψ ( θ ) } θθ > dθ . Let b ( · ) : I R m 7→ I R b e a four-times con tinuously dif- feren tiable function of θ and write a n ( θ ) = (1 + α 0 n ) ψ ( θ ) − θ > ( 1 n n X i =1 r i + β 0 n ) , (11) h b = Z Θ b ( θ ) exp {− na n ( θ ) } dθ. (12) Since ψ ( θ ) is con tinuously diﬀerentiable and has p osi- tiv e deﬁnite Hessian matrix, it is ob vious that a n ( θ ) is also contin uously diﬀerentiable and strictly con vex in θ . Thus, if we let ˆ θ n solv e the equation D ψ ( θ ) = ( 1 n n X i =1 r i + β 0 n )  (1 + α 0 n ) , (13) ˆ θ n is also the unique global minimiser of a n ( θ ) on Θ. It is obvious that D 2 a n con verges to D 2 ψ with proba- bilit y 1 as n → ∞ . By Lemma 1 in App endix B, letting b ( θ ) b e 1, θ i and θ i θ j ( i, j = 1 , . . . , m ) corresp ondingly in (23) and after a straightforw ard calculation, w e ob- tain that, as n tends to inﬁnit y , with probabilit y 1, nD 2 ,ij β ˜ h ( α, β ) ˜ h ( α, β ) − nD i β ˜ h ( α, β ) D j β ˜ h ( α, β ) ˜ h 2 ( α, β ) → 1 2 σ ij ∞ = 1 2 [ D 2 ψ ( θ )] − 1 ij . (14) In App endix A we pro ve that, as n → ∞ , 1 n D β ( θ ∗ ) → D 2 ψ ( θ ∗ ) − I E y i  I E x i [ φ ]I E x i [ φ > ]  , a.s., where ’a.s.’ means ’almost surely’ and φ is deﬁned as φ = u ( x i , y i ) − D ψ ( θ ∗ ) . (15) Therefore, combining (14) with the last limiting result w e obtain that, with probabilit y 1, D Φ n ( θ ∗ ) → 1 2 [ D 2 ψ ( θ ∗ )] − 1  D 2 ψ ( θ ∗ ) − I E y i  I E x i [ φ ]I E x i [ φ > ]  = 1 2 I m − 1 2 [ D 2 ψ ( θ ∗ )] − 1 I E y i  I E x i [ φ ]I E x i [ φ > ]  , UAI 2004 WANG & TITTERINGTON 579 where I m denotes the m × m identit y matrix. Since D 2 ψ ( θ ∗ ) is p ositiv e deﬁnite and symmetric and ob viously I E y i  I E x i [ φ ]I E x i [ φ > ]  is p ositive semideﬁnite and symmetric, D Φ n ( θ ∗ ) ≤ 1 2 I m as n tends to inﬁnit y; that is, D Φ n ( θ ∗ ) − 1 2 I m is negative semideﬁnite. Next we sho w that [ D 2 ψ ( θ ∗ )] − 1 I E y i  I E x i [ φ ]I E x i [ φ > ]  ≤ I m . Since D 2 ψ ( θ ∗ ) is p ositiv e deﬁnite and symmetric, it is suﬃcien t to prov e that θ > I E y i  I E x i [ φ ]I E x i [ φ > ]  θ ≤ θ > D 2 ψ ( θ ∗ ) θ , (16) for any θ ∈ I R m . In fact, we ha ve θ > I E y i  I E x i [ φ ]I E x i [ φ > ]  θ =I E y i n m X j,k =1 θ j θ k I E x i [ φ j ]I E x i [ φ k ] o =I E y i n m X j,k =1 θ j θ k I E x i [ φ j φ k ] − m X j,k =1 θ j θ k I E x i  ( φ j − I E x i [ φ j ])( φ k − I E x i [ φ k ])  o =I E y i n m X j,k =1 θ j θ k I E x i [ φ j φ k ] − θ > I E x i  ( φ − I E x i [ φ ])( φ − I E x i [ φ ]) >  θ o ≤ m X j,k =1 θ j θ k I E y i  I E x i [ φ j φ k ]  = θ > D 2 ψ ( θ ∗ ) θ , where the last equalit y is a consequence of (22). Therefore, we obtain 0 ≤ D Φ n ( θ ∗ ) ≤ 1 2 I m , and conse- quen tly the inequality (10) holds with λ = 1 / 2. More- o ver, if w e use Laplace’s appro ximation (23) it is easy to deduce that Φ n ( θ ∗ ) = − D β h ( α, β ) /h ( α, β ) → θ ∗ with probability 1 as n tends to inﬁnity . Therefore, since the starting v alue is suﬃcien tly near to θ ∗ w e ha ve k θ ( k +1) − θ ∗ k ≤k Φ n ( θ ( k ) ) − Φ n ( θ ∗ ) k + k Φ n ( θ ∗ ) − θ ∗ k ≤ λ k θ ( k ) − θ ∗ k + k Φ n ( θ ∗ ) − θ ∗ k , and therefore the iterativ e procedure (9) con verges lo- cally to the true v alue θ ∗ with probability 1 as n ap- proac hes inﬁnit y . 4 ASYMPTOTIC NORMALITY OF THE V ARIA TIONAL POSTERIOR DISTRIBUTION There hav e been a large num b er of con tributions ab out the asymptotic normalit y of p osterior distributions as- so ciated with exponential families; see for instance W alker (1969), Heyde and Johnstone (1979), Chen (1985) and Bernardo and Smith (1994). Under appro- priate conditions the (true) p osterior density conv erges in distribution to a normal densit y . In this section, w e sho w that the v ariational p osterior distribution for the parameter θ obtained by the iterativ e pro cedure also has the prop erty of asymptotic normalit y . This im- plies that the v ariational posterior b ecomes more and more concen trated around the true parameter v alue as the sample size grows. Supp ose the sample size n is large. W e hav e prov ed that the algorithm (9) is conv ergent, so there exists an equilibrium p oin t denoted by ˜ θ n . It follows from (5) and (7) that, at ˜ θ n , ˜ α n = n + α 0 , ˜ β n = n X i =1 r i + β 0 , r i = h u ( x i , y i ) i x i , q ( x i ) = f ( x i , y i ) g ( ˜ θ n , y i ) exp { ˜ θ > n u ( x i , y i ) − ψ ( ˜ θ n ) } . Therefore, the v ariational posterior densit y of θ at the equilibrium p oin t is q n ( θ ) = h ( ˜ α n , ˜ β n ) exp { θ > ˜ β n − ˜ α n ψ ( θ ) } . Let ˆ θ n maximise θ > ˜ β n − ˜ α n ψ ( θ ). Then w e ha ve D ψ ( ˆ θ n ) = ( 1 n n X i =1 r i + β 0 n )  (1 + α 0 n ) . By the same argumen ts as used in the previous section and noting that ˜ θ n → θ ∗ with probability 1 by The- orem 1, w e hav e that 1 n P n i =1 r i con verges to D ψ ( θ ∗ ) almost surely . Since D ψ is strictly increasing and con- tin uous, ˆ θ n → θ ∗ with probabilit y 1 as n tends to inﬁnit y . Deﬁne L n ( θ ) , log q n ( θ ) = log h ( ˜ α n , ˜ β n ) + θ > ˜ β n − ˜ α n ψ ( θ ) . Then we ha ve Σ n , − [ D 2 L n ( ˆ θ n )] − 1 = [( n + α 0 ) D 2 ψ ( ˆ θ n )] − 1 . Denote b y B ( θ, ε ) the op en ball of radius ε cen tred at θ . According to Chen (1985), under the assumption of the consistency of ˆ θ n for θ ∗ , the p osterior density q n con verges in distribution to N ( ˆ θ n , Σ n ) if the follo wing basic conditions hold. 580 WANG & TITTERINGTON UAI 2004 (C1) “Steepness”. σ 2 n → 0 with P θ ∗ -probabilit y 1 as n → ∞ , where σ 2 n is the largest eigen v alue of Σ n . (C2) “Smoothness”. F or an y ε > 0, there exists an in teger N and δ > 0 such that, for any n > N and θ ∈ B ( θ , δ ), D 2 L n ( θ ) exists and satisﬁes I m − A ( ε ) ≤ D 2 L n ( θ )[ D 2 L n ( ˆ θ n )] − 1 ≤ I m + A ( ε ) , a.s., where A ( ε ) is an m × m symmetric p ositiv e semideﬁnite matrix whose largest eigenv alue tends to zero with P θ ∗ - probabilit y 1 as ε → 0. (C3) “Concentration”. F or any δ > 0, Z B ( θ,δ ) q n ( θ ) dθ → 1 with P θ ∗ -probabilit y 1 as n tends to inﬁnity . In fact, since ˆ θ n → θ ∗ , the comp onen ts of D 2 ψ ( ˆ θ n ) are b ounded ab ov e and aw ay from 0 almost surely if n is large enough, so the largest eigen v alue of Σ n tends to 0. (C2) is ob vious because D 2 L n ( θ )[ D 2 L n ( ˆ θ n )] − 1 = D 2 ψ ( θ )[ D 2 ψ ( ˆ θ n )] − 1 and ψ ( · ) is con tinuously diﬀeren- tiable. F rom Kass et al. (1990), assumption (iii) in App endix B is stronger than (C3). Therefore all the conditions are veriﬁed. Ac knowledgemen t This work was supp orted by a gran t from the UK Sci- ence and Engineering Research Council. The authors gratefully ac knowledge the review ers for their v aluable commen ts. References A ttias, H. (1999). Inferring parameters and struc- ture of latent v ariable mo dels by v ariational Bay es. In Prade, H. and Laskey , K., editors, Pr o c. 15th Confer enc e on Unc ertainty in Artiﬁcial Intel ligenc e , pages 21–30, Sto c kholm, Sw eden. Morgan Kauf- mann Publishers. A ttias, H. (2000). A v ariational Ba yesian framew ork for graphical models. In Solla, S., Leen, T., and Muller, K.-R., editors, A dvanc es in Neur al Infor- mation Pr o c essing Systems 12 , pages 209–215. MIT Press, Cambridge, MA. Beal, M. J. (2003). V ariational Algorithms for Appr ox- imate Bayesian Infer enc e . PhD thesis, Universit y College London. Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian The ory . John Wiley & Sons, Inc, New Y ork. Chen, C.-F. (1985). On asymptotic normality of lim- iting density functions with Bay esian implications. J. R. Statist. So c. B , 47:540–546. Ghahramani, Z. and Beal, M. J. (2000). Propagation algorithms for v ariational Ba yesian learning. In A d- vanc es in Neur al Information Pr o c essing systems 12 , pages 507–513. MIT Press, Cambridge, MA. Hall, P ., Humphreys, K., and Titterington, D. M. (2002). On the adequacy of v ariational low er b ound functions for lik eliho od-based inference in Marko- vian mo dels with missing v alues. Journal of the R oyal Statistic al So ciety Series B , 64:549–564. Heyde, C. C. and Johnstone, I. M. (1979). On asymp- totic p osterior normality for stochastic pro cesses. J. R. Statist. So c. B , 41:184–189. Humphreys, K. and Titterington, D. M. (2000). Ap- pro ximate Bay esian inference for simple mixtures. In Bethlehem, J. G. and v an der Heijden, P . G. M., editors, COMPST A T2000 , pages 331–336. Physica- V erlag, Heidelb erg. Humphreys, K. and Titterington, D. M. (2001). Some examples of recursive v ariational appro ximations for Ba yesian inference. In Opp er, M. and Saad, D., ed- itors, A dvanc e d Me an Field Metho ds: The ory and Pr actic e , pages 179–195. MIT Press. Kass, R. E., Tierney , L., and Kadane, J. B. (1990). The v alidity of p osterior expansions based on Laplace’s metho d. In Geisser, S., Ho dges, J. S., Press, S. J., and Zellner, A., editors, Bayesian and Likeliho o d Metho ds in Statistics and Ec onometrics , pages 473– 488. Elsevier Science Publishers, North-Holland. MacKa y , D. J. C. (1997). Ensem ble learning for hid- den Marko v models. T echnical report, Cav endish Lab oratory , Universit y of Cam bridge. P enny , W. D. and Rob erts, S. J. (2000). V ariational Ba yes for 1-dimensional mixture mo dels. T echnical Rep ort P ARG-2000-01, Oxford Universit y . W alker, A. M. (1969). On the asymptotic b ehaviour of p osterior distributions. J. R. Statist. So c. B , 31:80– 88. W ang, B. and Titterington, D. M. (2003). Lac k of consistency of mean ﬁeld and v ariational Ba yes approximations for state space models. T echnical Rep ort 03-5, Univ ersity of Glas- go w. h ttp://www.stats.gla.ac.uk/Research/- T echRep2003/03-5.pdf. W ang, B. and Titterington, D. M. (2004a). Con ver- gence prop erties of a general algorithm for calcu- lating v ariational Ba yesian estimates for a normal mixture mo del. T echnical Rep ort 04-3, Univ ersity of Glasgo w. h ttp://www.stats.gla.ac.uk/Researc h/- T echRep2003/04-3.pdf. UAI 2004 WANG & TITTERINGTON 581 W ang, B. and Titterington, D. M. (2004b). V ari- ational Ba yesian inference for partially observ ed diﬀusions. T ec hnical Report 04-4, Universit y of Glasgo w. h ttp://www.stats.gla.ac.uk/Research/- T echRep2003/04-4.pdf. App endix A In the appendix w e pro ve that the follo wing conv er- gences hold: 1 n D β ( θ ∗ ) → D 2 ψ ( θ ∗ ) − I E y i  I E x i [ φ ]I E x i [ φ > ]  , a.s., (17) 1 n β ( θ ∗ ) → D ψ ( θ ∗ ) , a.s. (18) In fact, from (6) w e hav e that D β ( θ ) = P n i =1 D r i ( θ ) and D r i ( θ ∗ ) = Z u ( x i , y i ) D > θ q x i ( x i ) dx i = Z u ( x i , y i ) f ( x i , y i ) D > θ g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i + Z u ( x i , y i ) f ( x i , y i ) g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } ·  u > ( x i , y i ) − D > ψ ( θ ∗ )  dx i = Z  u ( x i , y i ) − D ψ ( θ ∗ )  f ( x i , y i ) D > θ g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i + Z f ( x i , y i ) g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } ·  u ( x i , y i ) − D ψ ( θ ∗ )  u > ( x i , y i ) − D > ψ ( θ ∗ )  dx i , where in the last equality w e used the fact that Z f ( x i , y i ) D θ g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i + Z  u ( x i , y i ) − D ψ ( θ ∗ )  f ( x i , y i ) g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i = 0 , (19) whic h is obtained by diﬀerentiating, with resp ect to θ , Z f ( x i , y i ) g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i = 1 . Since it follows from (8) that D θ g ( θ ∗ , y i ) = − Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } ·  u ( x i , y i ) − D ψ ( θ ∗ )  dx i · n Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i o − 2 , equalit y (19) can b e rewritten as Z  u ( x i , y i ) − D ψ ( θ ∗ )  f ( x i , y i ) g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i · Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i = Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } ·  u ( x i , y i ) − D ψ ( θ ∗ )  dx i . (20) Diﬀeren tiating b oth sides of (20) with resp ect to θ ∗ , w e ha ve  Z  u ( x i , y i ) − D ψ ( θ ∗ )  f ( x i , y i ) D > θ g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i − D 2 ψ ( θ ∗ ) + Z f ( x i , y i ) g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } ·  u ( x i , y i ) − D ψ ( θ ∗ )  u > ( x i , y i ) − D > ψ ( θ ∗ )  dx i  · Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i + Z  u ( x i , y i ) − D ψ ( θ ∗ )  f ( x i , y i ) g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i · Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ) } ·  u > ( x i , y i ) − D > ψ ( θ ∗ )  dx i = Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } ·  u ( x i , y i ) − D ψ ( θ ∗ )  u > ( x i , y i ) − D > ψ ( θ ∗ )  dx i − Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i · D 2 ψ ( θ ∗ ) . (21) W e deﬁne φ as in (15). The marginal distribution of y i is R p ( x i , y i | θ ∗ ) dx i , and therefore it follows from the 582 WANG & TITTERINGTON UAI 2004 strong law of large num b ers that, with probability 1, 1 n n X i =1 ∇ θ r i → Z n ∇ θ r i Z p ( x i , y i | θ ∗ ) dx i o dy i = D 2 ψ ( θ ∗ ) − Z n I E x i [ φ ]I E x i [ φ > ] · Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i o dy i = D 2 ψ ( θ ∗ ) − I E y i  I E x i [ φ ]I E x i [ φ > ]  , where we ha ve used equality (21) and the fact that Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } ·  u ( x i , y i ) − D ψ ( θ ∗ )  u > ( x i , y i ) − D > ψ ( θ ∗ )  dx i dy i = D 2 ψ ( θ ∗ ) , (22) and I E x i denotes exp ectation under q x i . Th us, w e obtain (17). Deriv ation of (18) is similar. App endix B In this app endix we show that under our framework the Laplace approximation is justiﬁed. The pro of consists of verifying the analytic al assumptions for L aplac e’s metho d in Kass, Tierney and Kadane (1990), whic h are listed here for conv enience. Since a n deﬁned in (11) is of random nature, some minor revisions are made to adapt to our settings. Supp ose that { a n : n = 1 , 2 , . . . } is a sequence of six-times con tinuously diﬀerentiable real functions and that b is a four-times contin uously diﬀerentiable func- tion of θ . The pair ( { a n } , b ) is said to satisfy the ana- lytic al assumptions for L aplac e’s metho d if there exist p ositiv e num b ers ε , M and η , and an integer n 0 suc h that n > n 0 implies the following: (i) for all θ ∈ B ( ˆ θ n , ε ) and all 1 ≤ j 1 , . . . , j d ≤ m with 0 ≤ d ≤ 6, | ∂ j 1 ··· j d a n ( θ ) | < M with P θ ∗ -probabilit y 1; (ii) det( D 2 a n ( ˆ θ n )) > η with P θ ∗ -probabilit y 1; (iii) the in tegral h b deﬁned in equation (12) exists and is ﬁnite, and, for all δ for which 0 < δ < ε , B ( ˆ θ n , δ ) ⊆ Θ,  det( nD 2 a n ( ˆ θ n ))  1 / 2 Z Θ − B ( ˆ θ n ,δ ) b ( θ ) · exp {− n ( a n ( ˆ θ n ) − a n ( θ )) } dθ = O ( n − 2 ) with P θ ∗ -probabilit y 1; or, more strongly , (iii’) for all δ for whic h 0 < δ < ε , B ( ˆ θ n , δ ) ⊆ Θ, lim sup n →∞ sup θ { a n ( ˆ θ n ) − a n ( θ ) : θ ∈ Θ − B ( ˆ θ n , δ ) } < 0 with P θ ∗ -probabilit y 1. According to Kass et al. (1990), w e ha ve the following lemma. Lemma 1. If ( { a n } , b ) satisfy the analytic al assump- tions for L aplac e’s metho d then Z Θ b ( θ ) exp {− na n ( θ ) } dθ =(2 π ) m/ 2 [det( nD 2 a n )] − 1 / 2 exp {− na n ( ˆ θ n ) } · n b ( ˆ θ n ) + 1 n h 1 2 m X i,j =1 σ ij n b ij − 1 6 m X i,j =1 k,s =1 a ij k n b s µ 4 ij ks + 1 72 b ( ˆ θ n ) m X i,j,k =1 q,r,s =1 a ij k n h qr s n µ 6 ij kq r s − 1 24 b ( ˆ θ n ) m X i,j =1 k,s =1 a ij ks n µ 4 ij ks i + O ( n − 2 ) o , a.s., (23) wher e µ 4 ij ks and µ 6 ij kq r s ar e the fourth and sixth c entr al moments of a multivariate normal distribution having c ovarianc e matrix ( D 2 a n ) − 1 ; that is, µ 4 ij ks = σ ij n σ ks n + σ ik n σ j s n + σ is n σ j k n , µ 6 ij kq r s = σ ij n σ kq σ rs n + σ ij n σ kr σ qs n + σ ij n σ ks σ qr n + σ ik n σ j q σ rs n + σ ik n σ j r σ q s n + σ ik n σ j s σ qr n + σ iq n σ j k σ rs n + σ iq n σ j r σ ks n + σ iq n σ j s σ kr n + σ ir n σ j k σ qs n + σ ir n σ j q σ ks n + σ ir n σ j s σ kq n + σ is n σ j k σ qr n + σ is n σ j q σ kr n + σ is n σ j r σ kq n , wher e D 2 a n denotes the Hessian of a n , its ( i, j ) - c omp onent is written as a ij n and the c omp onents of its inverse ar e written as σ ij n ; mor e over, b s and b ij denote the c omp onents of the ﬁrst- and se c ond-or der deriva- tives of b , r esp e ctively. A l l derivatives ar e evaluate d at ˆ θ n . No w w e v erify the assumptions (i)-(iii). Under our assumptions, it has b een sho wn in (18) that 1 n P n i =1 r i → D ψ ( θ ∗ ) with probability 1, so, when n large enough, 1 n P n i =1 r i is almost surely bounded in B ( ˆ θ n , ε ). Since ψ is contin uously diﬀerentiable (i) ob- viously holds. Condition (ii) is one of our assumptions. As n tends to inﬁnity , for an y θ ∈ Θ, a n ( θ ) conv erges with P θ ∗ -probabilit y 1 to a 0 ( θ ) = ψ ( θ ) − θ > D ψ ( θ ∗ ) . UAI 2004 WANG & TITTERINGTON 583 Since ˆ θ n maximises a n , we ha ve ˆ θ n = ( D ψ ) − 1  ( 1 n n X i =1 r i + β 0 n )  (1 + α 0 n )  , so it follows that, as n tends to inﬁnity , with probabil- it y 1, ˆ θ n → ( D ψ ) − 1  D ψ ( θ ∗ )  = θ ∗ . Therefore, for all δ for which 0 < δ < ε and θ ∈ Θ − B ( ˆ θ n , δ ), we hav e that, ∀ ε 0 satisfying 0 < ε 0 < δ / 2, there exists an in teger N such that, if n > N , it holds that, for all θ ∈ Θ, | a n ( θ ) − a 0 ( θ ) | < ε 0 , k ˆ θ n − θ ∗ k < ε 0 , a.s. | a 0 ( ˆ θ n ) − a 0 ( θ ∗ ) | < ε 0 , a.s. Th us, a n ( ˆ θ n ) − a n ( θ ) = a n ( ˆ θ n ) − a 0 ( ˆ θ n ) + a 0 ( ˆ θ n ) − a 0 ( θ ∗ ) + a 0 ( θ ∗ ) − a 0 ( θ ) + a 0 ( θ ) − a n ( θ ) < a 0 ( θ ∗ ) − a 0 ( θ ) + 3 ε 0 , a.s., so that sup { a n ( ˆ θ n ) − a n ( θ ) : θ ∈ Θ − B ( ˆ θ n , δ ) } ≤ sup { a 0 ( θ ∗ ) − a 0 ( θ ) : θ ∈ Θ − B ( ˆ θ n , δ ) } + 3 ε 0 ≤ sup { a 0 ( θ ∗ ) − a 0 ( θ ) : θ ∈ Θ − B ( θ ∗ , δ − ε 0 ) } + 3 ε 0 , a.s., (24) since B ( θ ∗ , δ − ε 0 ) ⊂ B ( ˆ θ n , δ ). Since a 0 ( · ) is strictly conv ex, for θ ∈ Θ − B ( θ ∗ , δ − ε 0 ), w e hav e a 0 ( θ ) − a 0 ( θ ∗ ) > c , where c = inf { a 0 ( θ ) − a 0 ( θ ∗ ) : θ lies in the b oundary of B ( θ ∗ , δ / 2) } > 0 . Consequen tly , we get sup { a 0 ( θ ∗ ) − a 0 ( θ ) : θ ∈ Θ − B ( θ ∗ , δ − ε 0 ) } ≤ − c. Com bining the last estimate with (24) we ha ve that, ∀ ε 0 satisfying 0 < ε 0 < δ , there exists an integer N suc h that n > N implies sup { a n ( ˆ θ n ) − a n ( θ ) : θ ∈ Θ − B ( ˆ θ n , δ ) } ≤ − c + 3 ε 0 , a.s. ; that is, (iii’) holds. 584 WANG & TITTERINGTON UAI 2004

Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment