Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values
We study the properties of variational Bayes approximations for exponential family models with missing values. It is shown that the iterative algorithm for obtaining the variational Bayesian estimator converges locally to the true value with probabil…
Authors: Bo Wang, D. Titterington
Con v ergence and asymptotic normalit y of v ariational Ba y esian appro ximations for exp onen tial family mo dels with missing v alues Bo W ang Departmen t of Statistics Univ ersity of Glasgow Glasgo w G12 8QQ Scotland, U.K. D. M. Titterington Departmen t of Statistics Univ ersity of Glasgow Glasgo w G12 8QQ Scotland, U.K. Abstract W e study the prop erties of v ariational Bay es appro ximations for exp onen tial family mo d- els with missing v alues. It is shown that the iterativ e algorithm for obtaining the v aria- tional Bay esian estimator conv erges lo cally to the true v alue with probabilit y 1 as the sample size b ecomes indefinitely large. More- o ver, the v ariational p osterior distribution is pro ved to b e asymptotically normal. 1 INTR ODUCTION V ariational Ba yes appro ximations hav e recently b een applied to complex mo dels inv olving incomplete-data for whic h computational difficulties arise with the ideal Ba yesian approach. Suc h mo dels include hid- den Marko v mo dels and mixture mo dels; see for ex- ample Attias (1999, 2000); Beal (2003); Ghahramani and Beal (2000); Humphreys and Titterington (2000, 2001); MacKay (1997); P enny and Rob erts (2000); W ang and Titterington (2004b). In these earlier con- tributions, the appro ximations w ere sho wn empirically to be con vergen t and effective. Ho wev er little has b een done to inv estigate their theoretical prop erties, and the purpose of this paper is to go some wa y to rectify- ing this. Hall, Humphreys and Titterington (2002) initiated a discussion of these issues and prov ed that, for certain Mark ov mo dels, the parameter estimator obtained b y maximising the v ariational lo wer bound function is asymptotically consistent provided the prop ortion of all v alues that are missing tends to zero. Later we pro ved in W ang and Titterington (2003) that it is not alw ays the case that a fully factorised form of v ari- ational posterior, which includes the factorisation of the join t probabilit y function for the hidden states, pro vides an asymptotically consisten t estimator as the ‘sample size’ b ecomes large. W e demonstrated this in particular in the con text of linear state space mo dels, in which the abov e sufficient condition ob viously do es not hold. On the other hand we show ed in W ang and Titterington (2004a) that v ariational Bay es estimators for certain mixture models are asymptotically efficient for large sample sizes. In this pap er w e study the properties of v ariational appro ximation algorithms for more general mo dels, namely exp onen tial family models with missing v al- ues. Exponential families include cases such as Gaus- sian, gamma, Poisson, Dirichlet and Wishart distribu- tions, and exp onen tial family mo dels with missing v al- ues con tain many models of practical interest as partic- ular cases, such as Gaussian mixtures, hidden Mark o v mo dels and linear state space models. Beal (2003) and Ghahramani and Beal (2000) applied the v ariational Ba yesian metho d to these mo dels and derived the it- erativ e algorithm for learning the approximate p oste- rior distributions of the latent states and the mo del parameters. The numerical expriments therein show empirically that this algorithm is conv ergent and effi- cien t. In this pap er we deriv e the iterative pro cedure for obtaining the v ariational Ba yesian estimator, we pro vide analytical proofs of local con vergence of the pro cedure as the sample size tends to infinity , and w e sho w that the v ariational p osterior distribution for the parameters is asymptotically normal. 2 EXPONENTIAL F AMIL Y MODELS WITH MISSING V ALUES AND V ARIA TIONAL APPR O XIMA TIONS W e consider the following exp onential family mo dels with missing v alues. Suppose that Θ is an op en subset of I R m , that P = { P θ : θ ∈ Θ } is a family of probabilit y distributions on a measurable space (Ω , F ), and that x and y are sampled from the natural exp onential family UAI 2004 WANG & TITTERINGTON 577 with density p ( x, y | θ ) = f ( x, y ) exp { θ > u ( x, y ) − ψ ( θ ) } , (1) with x taking v alues in I R d and y in I R p , where θ ∈ Θ is the unkno wn parameter, and ψ ( · ) : Θ 7→ I R is six-times con tinuously differentiable and has p ositiv e definite Hessian matrix on Θ. The parameter θ has a conjugate prior to the complete-data likelihoo d (1), with density p ( θ | α 0 , β 0 ) = h ( α 0 , β 0 ) exp { θ > β 0 − α 0 ψ ( θ ) } , (2) where h is a normalising constant satisfying h ( α, β ) − 1 = Z Θ exp { θ > β − αψ ( θ ) } dθ, (3) and α 0 ∈ I R , β 0 ∈ I R m are the hyperparameters of the prior. Remark 1. The mo dels of the forms (1) and (2) in- clude most latent-variable mo dels of pr actic al inter est. A simple example is when x is sample d fr om a univari- ate Gaussian distribution with me an θ 1 and varianc e 1 and y = x + w , wher e w is sample d fr om another Gaussian distribution, indep endent of x , with me an θ 2 and varianc e 1 . The joint pr ob ability density is p ( x, y | θ 1 , θ 2 ) = exp {− 1 2 ( y − x ) 2 + θ 1 x + θ 2 ( y − x ) − 1 2 ( θ 2 1 + θ 2 2 ) − log(2 π ) } . And the p ar ameters θ 1 and θ 2 have indep endent Gaus- sian prior distributions with the same varianc e. Supp ose that only y is observ able whereas x is la- ten t. W e hav e a data-set consisting of a random sam- ple of size n , with Y = ( y 1 , y 2 , . . . , y n ) and X = ( x 1 , x 2 , . . . , x n ). In the Bay esian framew ork we wan t to infer the p osteriors ov er b oth the parameters and the hidden states. Unfortunately exact Bay esian in- ference is generally time-consuming, if not impossible, esp ecially for large dimensionality m . Therefore ap- pro ximation is usually necessary in these cases. In the v ariational approach, the true p osterior p ( X, θ | Y ) is appro ximated by the v ariational distribution q ( X, θ ), whic h factorises as q ( X, θ ) = q X ( X ) q θ ( θ ), and is cho- sen to maximise the negative free energy Z q ( X, θ ) log p ( θ , X, Y ) q ( X, θ ) dθ dX, (4) equiv alent to minimising the Kullback-Leibler diver- gence b etw een the exact and appro ximate distribu- tions of θ and X , given Y . The negativ e free energy (4) can b e maximised us- ing the following iterativ e pro cedure (Beal (2003); Ghahramani and Beal (2000)). In turn, the follo wing t wo stages are p erformed. (i) Optimise q θ ( θ ) for fixed { q x i ( x i ) , i = 1 , . . . , n } , de- fined in (ii) b elo w. This step results in q θ ( θ ) = h ( α, β ) exp { θ > β − αψ ( θ ) } , (5) where α and β are the hyperparameters of the v aria- tional p osterior and are up dated b y α = n + α 0 , β = n X i =1 r i + β 0 , and r i = h u ( x i , y i ) i x i . (6) Here h·i x i denotes the exp ectation under q x i ( x i ). (ii) Optimise q X ( X ) for fixed q θ ( θ ). This leads to the factorised form q X ( X ) = Q n i =1 q x i ( x i ), where q x i ( x i ) = f ( x i , y i ) g ( θ, y i ) exp {h θ i > θ u ( x i , y i ) − ψ ( h θ i θ ) } , (7) in which g ( θ, y i ) is a normalising constant satisfying g ( θ, y i ) − 1 = Z f ( x i , y i ) exp {h θ i > θ u ( x i , y i ) − ψ ( h θ i θ ) } dx i , (8) and h·i θ denotes the exp ectation under q θ ( θ ). 3 THE ITERA TIVE ALGORITHM AND ITS CONVER GENCE W e define the v ariational Bay esian estimator ˆ θ of the parameter θ as ˆ θ = Z Θ θ q pos ( θ ) dθ, where q pos is the v ariational p osterior densit y of θ , giv en by the limiting form of q θ ( θ ) that results from the ab o ve iterativ e pro cedure. F or the exp onen tial family distribution (5) the corresp onding v ariational Ba yesian estimator is ˆ θ = Z Θ θ q θ ( θ ) dθ = − D β h ( α, β ) h ( α, β ) . (Throughout the paper, D Ψ and D 2 Ψ denote the gra- dien t and the Hessian of Ψ. When ambiguit y exists, the sp ecific v ariable of differen tiation appears as a sub- script of the sym b ol D and D 2 .) Th us, the pro cedure in the previous section can be used to deriv e the following algorithm for obtaining the v ariational Ba yesian estimate of θ : starting with 578 WANG & TITTERINGTON UAI 2004 some initial v alue θ (0) , successive iterates are defined inductiv ely b y θ ( k +1) , Φ n ( θ ( k ) ) = − D β h ( α, β ) h ( α, β ) , (9) where α and β are given as in (6), and q x i ( x i ) = f ( x i , y i ) g ( θ ( k ) , y i ) exp { ( θ ( k ) ) > u ( x i , y i ) − ψ ( θ ( k ) ) } , g ( θ ( k ) , y i ) − 1 = Z f ( x i , y i ) exp { ( θ ( k ) ) > u ( x i , y i ) − ψ ( θ ( k ) ) } dx i . It is of interest to in vestigate the questions of whether or not the algorithm (9) is con vergen t and, if so, what prop erties are possessed by the limiting v alue. The follo wing theorem gives a partial answer. Theorem 1. With pr ob ability 1 as n appr o aches in- finity, the iter ative pr o c e dur e (9) c onver ges lo c al ly to the true value θ ∗ , i.e. (9) c onver ges to θ ∗ whenever the starting value is sufficiently ne ar to θ ∗ . Pr o of. Define the norm of θ ∈ I R m as k θ k , ( θ > θ ) 1 / 2 and the norm of the real m × m matrix A as k A k , sup k θ k =1 k Aθ k . W e first pro v e that, with probabilit y 1 as n approaches infinit y , the op erator Φ n is lo cally con tractive; that is, there exists a n umber λ , 0 ≤ λ < 1, such that k Φ n ( ¯ θ ) − Φ n ( θ ∗ ) k ≤ λ k ¯ θ − θ ∗ k , (10) whenev er ¯ θ lies sufficiently near θ ∗ . Since ¯ θ is near θ ∗ w e can write Φ n ( ¯ θ ) − Φ n ( θ ∗ ) = D Φ n ( θ ∗ )( ¯ θ − θ ∗ ) + O ( k ¯ θ − θ ∗ k 2 ) , where D Φ n ( θ ∗ ) denotes the gradien t of Φ n ( θ ) ev alu- ated at θ ∗ . It follo ws that k Φ n ( ¯ θ ) − Φ n ( θ ∗ ) k ≤k D Φ n ( θ ∗ ) k · k ¯ θ − θ ∗ k = sup k θ k =1 | θ > D Φ n ( θ ∗ ) θ | · k ¯ θ − θ ∗ k . Consequen tly , it is sufficien t to sho w that D Φ n ( θ ∗ ) con verges with probabilit y 1 to a matrix which has norm less than 1. W rite β and r i as β ( θ ) and r i ( θ ) to indicate explicitly their dep endence on θ . F rom (9) one has D Φ n ( θ ∗ ) = D β h ( α, β ) D > β h ( α, β ) − h ( α, β ) D 2 β h ( α, β ) h 2 ( α, β ) D β ( θ ∗ ) . Here h and its deriv ativ es are ev aluated at θ ∗ . F or conv enience w e write h ( α, β ) − 1 ev aluated at θ ∗ as ˜ h ( α, β ), from which D β ˜ h ( α, β ) = Z Θ exp { θ > β − αψ ( θ ) } θdθ , D 2 β ˜ h ( α, β ) = Z Θ exp { θ > β − αψ ( θ ) } θθ > dθ . Let b ( · ) : I R m 7→ I R b e a four-times con tinuously dif- feren tiable function of θ and write a n ( θ ) = (1 + α 0 n ) ψ ( θ ) − θ > ( 1 n n X i =1 r i + β 0 n ) , (11) h b = Z Θ b ( θ ) exp {− na n ( θ ) } dθ. (12) Since ψ ( θ ) is con tinuously differentiable and has p osi- tiv e definite Hessian matrix, it is ob vious that a n ( θ ) is also contin uously differentiable and strictly con vex in θ . Thus, if we let ˆ θ n solv e the equation D ψ ( θ ) = ( 1 n n X i =1 r i + β 0 n ) (1 + α 0 n ) , (13) ˆ θ n is also the unique global minimiser of a n ( θ ) on Θ. It is obvious that D 2 a n con verges to D 2 ψ with proba- bilit y 1 as n → ∞ . By Lemma 1 in App endix B, letting b ( θ ) b e 1, θ i and θ i θ j ( i, j = 1 , . . . , m ) corresp ondingly in (23) and after a straightforw ard calculation, w e ob- tain that, as n tends to infinit y , with probabilit y 1, nD 2 ,ij β ˜ h ( α, β ) ˜ h ( α, β ) − nD i β ˜ h ( α, β ) D j β ˜ h ( α, β ) ˜ h 2 ( α, β ) → 1 2 σ ij ∞ = 1 2 [ D 2 ψ ( θ )] − 1 ij . (14) In App endix A we pro ve that, as n → ∞ , 1 n D β ( θ ∗ ) → D 2 ψ ( θ ∗ ) − I E y i I E x i [ φ ]I E x i [ φ > ] , a.s., where ’a.s.’ means ’almost surely’ and φ is defined as φ = u ( x i , y i ) − D ψ ( θ ∗ ) . (15) Therefore, combining (14) with the last limiting result w e obtain that, with probabilit y 1, D Φ n ( θ ∗ ) → 1 2 [ D 2 ψ ( θ ∗ )] − 1 D 2 ψ ( θ ∗ ) − I E y i I E x i [ φ ]I E x i [ φ > ] = 1 2 I m − 1 2 [ D 2 ψ ( θ ∗ )] − 1 I E y i I E x i [ φ ]I E x i [ φ > ] , UAI 2004 WANG & TITTERINGTON 579 where I m denotes the m × m identit y matrix. Since D 2 ψ ( θ ∗ ) is p ositiv e definite and symmetric and ob viously I E y i I E x i [ φ ]I E x i [ φ > ] is p ositive semidefinite and symmetric, D Φ n ( θ ∗ ) ≤ 1 2 I m as n tends to infinit y; that is, D Φ n ( θ ∗ ) − 1 2 I m is negative semidefinite. Next we sho w that [ D 2 ψ ( θ ∗ )] − 1 I E y i I E x i [ φ ]I E x i [ φ > ] ≤ I m . Since D 2 ψ ( θ ∗ ) is p ositiv e definite and symmetric, it is sufficien t to prov e that θ > I E y i I E x i [ φ ]I E x i [ φ > ] θ ≤ θ > D 2 ψ ( θ ∗ ) θ , (16) for any θ ∈ I R m . In fact, we ha ve θ > I E y i I E x i [ φ ]I E x i [ φ > ] θ =I E y i n m X j,k =1 θ j θ k I E x i [ φ j ]I E x i [ φ k ] o =I E y i n m X j,k =1 θ j θ k I E x i [ φ j φ k ] − m X j,k =1 θ j θ k I E x i ( φ j − I E x i [ φ j ])( φ k − I E x i [ φ k ]) o =I E y i n m X j,k =1 θ j θ k I E x i [ φ j φ k ] − θ > I E x i ( φ − I E x i [ φ ])( φ − I E x i [ φ ]) > θ o ≤ m X j,k =1 θ j θ k I E y i I E x i [ φ j φ k ] = θ > D 2 ψ ( θ ∗ ) θ , where the last equalit y is a consequence of (22). Therefore, we obtain 0 ≤ D Φ n ( θ ∗ ) ≤ 1 2 I m , and conse- quen tly the inequality (10) holds with λ = 1 / 2. More- o ver, if w e use Laplace’s appro ximation (23) it is easy to deduce that Φ n ( θ ∗ ) = − D β h ( α, β ) /h ( α, β ) → θ ∗ with probability 1 as n tends to infinity . Therefore, since the starting v alue is sufficien tly near to θ ∗ w e ha ve k θ ( k +1) − θ ∗ k ≤k Φ n ( θ ( k ) ) − Φ n ( θ ∗ ) k + k Φ n ( θ ∗ ) − θ ∗ k ≤ λ k θ ( k ) − θ ∗ k + k Φ n ( θ ∗ ) − θ ∗ k , and therefore the iterativ e procedure (9) con verges lo- cally to the true v alue θ ∗ with probability 1 as n ap- proac hes infinit y . 4 ASYMPTOTIC NORMALITY OF THE V ARIA TIONAL POSTERIOR DISTRIBUTION There hav e been a large num b er of con tributions ab out the asymptotic normalit y of p osterior distributions as- so ciated with exponential families; see for instance W alker (1969), Heyde and Johnstone (1979), Chen (1985) and Bernardo and Smith (1994). Under appro- priate conditions the (true) p osterior density conv erges in distribution to a normal densit y . In this section, w e sho w that the v ariational p osterior distribution for the parameter θ obtained by the iterativ e pro cedure also has the prop erty of asymptotic normalit y . This im- plies that the v ariational posterior b ecomes more and more concen trated around the true parameter v alue as the sample size grows. Supp ose the sample size n is large. W e hav e prov ed that the algorithm (9) is conv ergent, so there exists an equilibrium p oin t denoted by ˜ θ n . It follows from (5) and (7) that, at ˜ θ n , ˜ α n = n + α 0 , ˜ β n = n X i =1 r i + β 0 , r i = h u ( x i , y i ) i x i , q ( x i ) = f ( x i , y i ) g ( ˜ θ n , y i ) exp { ˜ θ > n u ( x i , y i ) − ψ ( ˜ θ n ) } . Therefore, the v ariational posterior densit y of θ at the equilibrium p oin t is q n ( θ ) = h ( ˜ α n , ˜ β n ) exp { θ > ˜ β n − ˜ α n ψ ( θ ) } . Let ˆ θ n maximise θ > ˜ β n − ˜ α n ψ ( θ ). Then w e ha ve D ψ ( ˆ θ n ) = ( 1 n n X i =1 r i + β 0 n ) (1 + α 0 n ) . By the same argumen ts as used in the previous section and noting that ˜ θ n → θ ∗ with probability 1 by The- orem 1, w e hav e that 1 n P n i =1 r i con verges to D ψ ( θ ∗ ) almost surely . Since D ψ is strictly increasing and con- tin uous, ˆ θ n → θ ∗ with probabilit y 1 as n tends to infinit y . Define L n ( θ ) , log q n ( θ ) = log h ( ˜ α n , ˜ β n ) + θ > ˜ β n − ˜ α n ψ ( θ ) . Then we ha ve Σ n , − [ D 2 L n ( ˆ θ n )] − 1 = [( n + α 0 ) D 2 ψ ( ˆ θ n )] − 1 . Denote b y B ( θ, ε ) the op en ball of radius ε cen tred at θ . According to Chen (1985), under the assumption of the consistency of ˆ θ n for θ ∗ , the p osterior density q n con verges in distribution to N ( ˆ θ n , Σ n ) if the follo wing basic conditions hold. 580 WANG & TITTERINGTON UAI 2004 (C1) “Steepness”. σ 2 n → 0 with P θ ∗ -probabilit y 1 as n → ∞ , where σ 2 n is the largest eigen v alue of Σ n . (C2) “Smoothness”. F or an y ε > 0, there exists an in teger N and δ > 0 such that, for any n > N and θ ∈ B ( θ , δ ), D 2 L n ( θ ) exists and satisfies I m − A ( ε ) ≤ D 2 L n ( θ )[ D 2 L n ( ˆ θ n )] − 1 ≤ I m + A ( ε ) , a.s., where A ( ε ) is an m × m symmetric p ositiv e semidefinite matrix whose largest eigenv alue tends to zero with P θ ∗ - probabilit y 1 as ε → 0. (C3) “Concentration”. F or any δ > 0, Z B ( θ,δ ) q n ( θ ) dθ → 1 with P θ ∗ -probabilit y 1 as n tends to infinity . In fact, since ˆ θ n → θ ∗ , the comp onen ts of D 2 ψ ( ˆ θ n ) are b ounded ab ov e and aw ay from 0 almost surely if n is large enough, so the largest eigen v alue of Σ n tends to 0. (C2) is ob vious because D 2 L n ( θ )[ D 2 L n ( ˆ θ n )] − 1 = D 2 ψ ( θ )[ D 2 ψ ( ˆ θ n )] − 1 and ψ ( · ) is con tinuously differen- tiable. F rom Kass et al. (1990), assumption (iii) in App endix B is stronger than (C3). Therefore all the conditions are verified. Ac knowledgemen t This work was supp orted by a gran t from the UK Sci- ence and Engineering Research Council. The authors gratefully ac knowledge the review ers for their v aluable commen ts. References A ttias, H. (1999). Inferring parameters and struc- ture of latent v ariable mo dels by v ariational Bay es. In Prade, H. and Laskey , K., editors, Pr o c. 15th Confer enc e on Unc ertainty in Artificial Intel ligenc e , pages 21–30, Sto c kholm, Sw eden. Morgan Kauf- mann Publishers. A ttias, H. (2000). A v ariational Ba yesian framew ork for graphical models. In Solla, S., Leen, T., and Muller, K.-R., editors, A dvanc es in Neur al Infor- mation Pr o c essing Systems 12 , pages 209–215. MIT Press, Cambridge, MA. Beal, M. J. (2003). V ariational Algorithms for Appr ox- imate Bayesian Infer enc e . PhD thesis, Universit y College London. Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian The ory . John Wiley & Sons, Inc, New Y ork. Chen, C.-F. (1985). On asymptotic normality of lim- iting density functions with Bay esian implications. J. R. Statist. So c. B , 47:540–546. Ghahramani, Z. and Beal, M. J. (2000). Propagation algorithms for v ariational Ba yesian learning. In A d- vanc es in Neur al Information Pr o c essing systems 12 , pages 507–513. MIT Press, Cambridge, MA. Hall, P ., Humphreys, K., and Titterington, D. M. (2002). On the adequacy of v ariational low er b ound functions for lik eliho od-based inference in Marko- vian mo dels with missing v alues. Journal of the R oyal Statistic al So ciety Series B , 64:549–564. Heyde, C. C. and Johnstone, I. M. (1979). On asymp- totic p osterior normality for stochastic pro cesses. J. R. Statist. So c. B , 41:184–189. Humphreys, K. and Titterington, D. M. (2000). Ap- pro ximate Bay esian inference for simple mixtures. In Bethlehem, J. G. and v an der Heijden, P . G. M., editors, COMPST A T2000 , pages 331–336. Physica- V erlag, Heidelb erg. Humphreys, K. and Titterington, D. M. (2001). Some examples of recursive v ariational appro ximations for Ba yesian inference. In Opp er, M. and Saad, D., ed- itors, A dvanc e d Me an Field Metho ds: The ory and Pr actic e , pages 179–195. MIT Press. Kass, R. E., Tierney , L., and Kadane, J. B. (1990). The v alidity of p osterior expansions based on Laplace’s metho d. In Geisser, S., Ho dges, J. S., Press, S. J., and Zellner, A., editors, Bayesian and Likeliho o d Metho ds in Statistics and Ec onometrics , pages 473– 488. Elsevier Science Publishers, North-Holland. MacKa y , D. J. C. (1997). Ensem ble learning for hid- den Marko v models. T echnical report, Cav endish Lab oratory , Universit y of Cam bridge. P enny , W. D. and Rob erts, S. J. (2000). V ariational Ba yes for 1-dimensional mixture mo dels. T echnical Rep ort P ARG-2000-01, Oxford Universit y . W alker, A. M. (1969). On the asymptotic b ehaviour of p osterior distributions. J. R. Statist. So c. B , 31:80– 88. W ang, B. and Titterington, D. M. (2003). Lac k of consistency of mean field and v ariational Ba yes approximations for state space models. T echnical Rep ort 03-5, Univ ersity of Glas- go w. h ttp://www.stats.gla.ac.uk/Research/- T echRep2003/03-5.pdf. W ang, B. and Titterington, D. M. (2004a). Con ver- gence prop erties of a general algorithm for calcu- lating v ariational Ba yesian estimates for a normal mixture mo del. T echnical Rep ort 04-3, Univ ersity of Glasgo w. h ttp://www.stats.gla.ac.uk/Researc h/- T echRep2003/04-3.pdf. UAI 2004 WANG & TITTERINGTON 581 W ang, B. and Titterington, D. M. (2004b). V ari- ational Ba yesian inference for partially observ ed diffusions. T ec hnical Report 04-4, Universit y of Glasgo w. h ttp://www.stats.gla.ac.uk/Research/- T echRep2003/04-4.pdf. App endix A In the appendix w e pro ve that the follo wing conv er- gences hold: 1 n D β ( θ ∗ ) → D 2 ψ ( θ ∗ ) − I E y i I E x i [ φ ]I E x i [ φ > ] , a.s., (17) 1 n β ( θ ∗ ) → D ψ ( θ ∗ ) , a.s. (18) In fact, from (6) w e hav e that D β ( θ ) = P n i =1 D r i ( θ ) and D r i ( θ ∗ ) = Z u ( x i , y i ) D > θ q x i ( x i ) dx i = Z u ( x i , y i ) f ( x i , y i ) D > θ g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i + Z u ( x i , y i ) f ( x i , y i ) g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } · u > ( x i , y i ) − D > ψ ( θ ∗ ) dx i = Z u ( x i , y i ) − D ψ ( θ ∗ ) f ( x i , y i ) D > θ g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i + Z f ( x i , y i ) g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } · u ( x i , y i ) − D ψ ( θ ∗ ) u > ( x i , y i ) − D > ψ ( θ ∗ ) dx i , where in the last equality w e used the fact that Z f ( x i , y i ) D θ g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i + Z u ( x i , y i ) − D ψ ( θ ∗ ) f ( x i , y i ) g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i = 0 , (19) whic h is obtained by differentiating, with resp ect to θ , Z f ( x i , y i ) g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i = 1 . Since it follows from (8) that D θ g ( θ ∗ , y i ) = − Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } · u ( x i , y i ) − D ψ ( θ ∗ ) dx i · n Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i o − 2 , equalit y (19) can b e rewritten as Z u ( x i , y i ) − D ψ ( θ ∗ ) f ( x i , y i ) g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i · Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i = Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } · u ( x i , y i ) − D ψ ( θ ∗ ) dx i . (20) Differen tiating b oth sides of (20) with resp ect to θ ∗ , w e ha ve Z u ( x i , y i ) − D ψ ( θ ∗ ) f ( x i , y i ) D > θ g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i − D 2 ψ ( θ ∗ ) + Z f ( x i , y i ) g ( θ ∗ , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } · u ( x i , y i ) − D ψ ( θ ∗ ) u > ( x i , y i ) − D > ψ ( θ ∗ ) dx i · Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i + Z u ( x i , y i ) − D ψ ( θ ∗ ) f ( x i , y i ) g ( θ ∗ , y i ) · exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i · Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ) } · u > ( x i , y i ) − D > ψ ( θ ∗ ) dx i = Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } · u ( x i , y i ) − D ψ ( θ ∗ ) u > ( x i , y i ) − D > ψ ( θ ∗ ) dx i − Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i · D 2 ψ ( θ ∗ ) . (21) W e define φ as in (15). The marginal distribution of y i is R p ( x i , y i | θ ∗ ) dx i , and therefore it follows from the 582 WANG & TITTERINGTON UAI 2004 strong law of large num b ers that, with probability 1, 1 n n X i =1 ∇ θ r i → Z n ∇ θ r i Z p ( x i , y i | θ ∗ ) dx i o dy i = D 2 ψ ( θ ∗ ) − Z n I E x i [ φ ]I E x i [ φ > ] · Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } dx i o dy i = D 2 ψ ( θ ∗ ) − I E y i I E x i [ φ ]I E x i [ φ > ] , where we ha ve used equality (21) and the fact that Z f ( x i , y i ) exp { θ ∗> u ( x i , y i ) − ψ ( θ ∗ ) } · u ( x i , y i ) − D ψ ( θ ∗ ) u > ( x i , y i ) − D > ψ ( θ ∗ ) dx i dy i = D 2 ψ ( θ ∗ ) , (22) and I E x i denotes exp ectation under q x i . Th us, w e obtain (17). Deriv ation of (18) is similar. App endix B In this app endix we show that under our framework the Laplace approximation is justified. The pro of consists of verifying the analytic al assumptions for L aplac e’s metho d in Kass, Tierney and Kadane (1990), whic h are listed here for conv enience. Since a n defined in (11) is of random nature, some minor revisions are made to adapt to our settings. Supp ose that { a n : n = 1 , 2 , . . . } is a sequence of six-times con tinuously differentiable real functions and that b is a four-times contin uously differentiable func- tion of θ . The pair ( { a n } , b ) is said to satisfy the ana- lytic al assumptions for L aplac e’s metho d if there exist p ositiv e num b ers ε , M and η , and an integer n 0 suc h that n > n 0 implies the following: (i) for all θ ∈ B ( ˆ θ n , ε ) and all 1 ≤ j 1 , . . . , j d ≤ m with 0 ≤ d ≤ 6, | ∂ j 1 ··· j d a n ( θ ) | < M with P θ ∗ -probabilit y 1; (ii) det( D 2 a n ( ˆ θ n )) > η with P θ ∗ -probabilit y 1; (iii) the in tegral h b defined in equation (12) exists and is finite, and, for all δ for which 0 < δ < ε , B ( ˆ θ n , δ ) ⊆ Θ, det( nD 2 a n ( ˆ θ n )) 1 / 2 Z Θ − B ( ˆ θ n ,δ ) b ( θ ) · exp {− n ( a n ( ˆ θ n ) − a n ( θ )) } dθ = O ( n − 2 ) with P θ ∗ -probabilit y 1; or, more strongly , (iii’) for all δ for whic h 0 < δ < ε , B ( ˆ θ n , δ ) ⊆ Θ, lim sup n →∞ sup θ { a n ( ˆ θ n ) − a n ( θ ) : θ ∈ Θ − B ( ˆ θ n , δ ) } < 0 with P θ ∗ -probabilit y 1. According to Kass et al. (1990), w e ha ve the following lemma. Lemma 1. If ( { a n } , b ) satisfy the analytic al assump- tions for L aplac e’s metho d then Z Θ b ( θ ) exp {− na n ( θ ) } dθ =(2 π ) m/ 2 [det( nD 2 a n )] − 1 / 2 exp {− na n ( ˆ θ n ) } · n b ( ˆ θ n ) + 1 n h 1 2 m X i,j =1 σ ij n b ij − 1 6 m X i,j =1 k,s =1 a ij k n b s µ 4 ij ks + 1 72 b ( ˆ θ n ) m X i,j,k =1 q,r,s =1 a ij k n h qr s n µ 6 ij kq r s − 1 24 b ( ˆ θ n ) m X i,j =1 k,s =1 a ij ks n µ 4 ij ks i + O ( n − 2 ) o , a.s., (23) wher e µ 4 ij ks and µ 6 ij kq r s ar e the fourth and sixth c entr al moments of a multivariate normal distribution having c ovarianc e matrix ( D 2 a n ) − 1 ; that is, µ 4 ij ks = σ ij n σ ks n + σ ik n σ j s n + σ is n σ j k n , µ 6 ij kq r s = σ ij n σ kq σ rs n + σ ij n σ kr σ qs n + σ ij n σ ks σ qr n + σ ik n σ j q σ rs n + σ ik n σ j r σ q s n + σ ik n σ j s σ qr n + σ iq n σ j k σ rs n + σ iq n σ j r σ ks n + σ iq n σ j s σ kr n + σ ir n σ j k σ qs n + σ ir n σ j q σ ks n + σ ir n σ j s σ kq n + σ is n σ j k σ qr n + σ is n σ j q σ kr n + σ is n σ j r σ kq n , wher e D 2 a n denotes the Hessian of a n , its ( i, j ) - c omp onent is written as a ij n and the c omp onents of its inverse ar e written as σ ij n ; mor e over, b s and b ij denote the c omp onents of the first- and se c ond-or der deriva- tives of b , r esp e ctively. A l l derivatives ar e evaluate d at ˆ θ n . No w w e v erify the assumptions (i)-(iii). Under our assumptions, it has b een sho wn in (18) that 1 n P n i =1 r i → D ψ ( θ ∗ ) with probability 1, so, when n large enough, 1 n P n i =1 r i is almost surely bounded in B ( ˆ θ n , ε ). Since ψ is contin uously differentiable (i) ob- viously holds. Condition (ii) is one of our assumptions. As n tends to infinity , for an y θ ∈ Θ, a n ( θ ) conv erges with P θ ∗ -probabilit y 1 to a 0 ( θ ) = ψ ( θ ) − θ > D ψ ( θ ∗ ) . UAI 2004 WANG & TITTERINGTON 583 Since ˆ θ n maximises a n , we ha ve ˆ θ n = ( D ψ ) − 1 ( 1 n n X i =1 r i + β 0 n ) (1 + α 0 n ) , so it follows that, as n tends to infinity , with probabil- it y 1, ˆ θ n → ( D ψ ) − 1 D ψ ( θ ∗ ) = θ ∗ . Therefore, for all δ for which 0 < δ < ε and θ ∈ Θ − B ( ˆ θ n , δ ), we hav e that, ∀ ε 0 satisfying 0 < ε 0 < δ / 2, there exists an in teger N such that, if n > N , it holds that, for all θ ∈ Θ, | a n ( θ ) − a 0 ( θ ) | < ε 0 , k ˆ θ n − θ ∗ k < ε 0 , a.s. | a 0 ( ˆ θ n ) − a 0 ( θ ∗ ) | < ε 0 , a.s. Th us, a n ( ˆ θ n ) − a n ( θ ) = a n ( ˆ θ n ) − a 0 ( ˆ θ n ) + a 0 ( ˆ θ n ) − a 0 ( θ ∗ ) + a 0 ( θ ∗ ) − a 0 ( θ ) + a 0 ( θ ) − a n ( θ ) < a 0 ( θ ∗ ) − a 0 ( θ ) + 3 ε 0 , a.s., so that sup { a n ( ˆ θ n ) − a n ( θ ) : θ ∈ Θ − B ( ˆ θ n , δ ) } ≤ sup { a 0 ( θ ∗ ) − a 0 ( θ ) : θ ∈ Θ − B ( ˆ θ n , δ ) } + 3 ε 0 ≤ sup { a 0 ( θ ∗ ) − a 0 ( θ ) : θ ∈ Θ − B ( θ ∗ , δ − ε 0 ) } + 3 ε 0 , a.s., (24) since B ( θ ∗ , δ − ε 0 ) ⊂ B ( ˆ θ n , δ ). Since a 0 ( · ) is strictly conv ex, for θ ∈ Θ − B ( θ ∗ , δ − ε 0 ), w e hav e a 0 ( θ ) − a 0 ( θ ∗ ) > c , where c = inf { a 0 ( θ ) − a 0 ( θ ∗ ) : θ lies in the b oundary of B ( θ ∗ , δ / 2) } > 0 . Consequen tly , we get sup { a 0 ( θ ∗ ) − a 0 ( θ ) : θ ∈ Θ − B ( θ ∗ , δ − ε 0 ) } ≤ − c. Com bining the last estimate with (24) we ha ve that, ∀ ε 0 satisfying 0 < ε 0 < δ , there exists an integer N suc h that n > N implies sup { a n ( ˆ θ n ) − a n ( θ ) : θ ∈ Θ − B ( ˆ θ n , δ ) } ≤ − c + 3 ε 0 , a.s. ; that is, (iii’) holds. 584 WANG & TITTERINGTON UAI 2004
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment