A Bit of Information Theory, and the Data Augmentation Algorithm Converges

1 A Bit of Information T heory , and the Data A ugmentation Alg orithm Con verges Y aming Y u, Member , IEEE Abstract — The data augm entation (D A) algorithm is a simple and powerful tool in statistical computing. In th is note b asic informa tion th eory is used to prov e a nontrivial conv er gence theorem f or th e D A algorithm. Index T erms — Gibbs sampling, information geo metry , I- projection, Kullback-Leib ler diver gence, Markov chain Monte Carlo, Pinsker’ s inequality , relativ e entropy , reve rse I-projection, total va riation I . B AC K G RO U N D In many statist ical pr oblems we w ould like to sample from a probability d ensity π ( x, y ) , e.g., the jo int p osterior of all p arameters and latent variables in a Bayesian m odel. When π ( x, y ) is com plicated, direct si mulation may be im- practical; howe ver , if the conditiona l densities π X | Y ( x | y ) and π Y | X ( y | x ) are tractable, th e following alg orithm is an intuitively appealing alternati ve. Draw ( X, Y ) fro m an initial density p (0) ( x, y ) , and then alterna tingly replace X by a condition al draw giv en Y ac cording to π X | Y ( x | y ) , and Y b y a condition al draw giv en X accor ding to π Y | X ( y | x ) ; this is a crud e descr iption of the da ta augmen tation (D A) algorithm of T anner and W ong [1 8] (see also [15], [20] and [22]), a powerful and widely used method in statistical co mputing . It is not immed iately obviou s that iterates of the DA algorithm sho uld app roach th e target π ( x, y ) . T o sho w conver - gence, one usually ap peals to Mar kov chain theory (T ierney [19]), which says that (rough ly) if a Markov chain is irre- ducible an d ap eriodic, and possesses a stationa ry distribution, then it co n verges to that d istribution. Such results ar e o ften stated in terms of the total variation distanc e , deﬁned for tw o densities p and q as V ( p, q ) = Z | p − q | . Because iterates of the D A alg orithm form a Markov chain, they co n verge in total variation un der som e r egularity co ndi- tions. T otal variation, of course, is no t the only discrepancy measure. There is actually another discrepancy measure that is natur al for the pr oblem, yet rar ely explored. Recall that the r elative entr opy , or Kullback-Leibler divergence, b etween tw o densities p and q is deﬁned as D ( p | q ) = Z p log ( p/q ) . It is related to V ( p, q ) via th e well- known Pinsker’ s in equality D ( p | q ) ≥ 1 2 V 2 ( p, q ) , Y aming Y u is with the Depa rtment of Stati stics, Univ ersity of Cali fornia, Irvine, CA, 92697-1250, USA (e-mail: yamingy@uci.edu). This work is supported in part by a start-up fund from the Bren School of Information and Computer Scie nces at UC Irvine. so that f or a sequence of d ensities p t , t = 0 , 1 , · · · , lim t →∞ D ( p t | p ∞ ) = 0 imp lies lim t →∞ V ( p t , p ∞ ) = 0 . Other u seful pro perties of relativ e entropy can be f ound in Cover and Tho mas [3]. It is the pur pose of this note to an alyze the D A a lgorithm in term s of relativ e entro py and pr esent a short p roof o f a conv ergence result (Theorem 2.1) u sing simple info rmation theoretic techniqu es. I I . M A I N R E S U LT Let µ × ν b e a pro duct me asure on a joint mea surable space ( X × Y , F × G ) . Suppo se the target density π ( x, y ) with respect to µ × ν satisﬁes π ( x, y ) > 0 for a ll ( x, y ) ∈ X × Y (in statistical ap plications of ten X and Y ar e sub sets o f Eu clidean spaces a nd each of µ and ν is either Lebesgue measure or the counting mea sure). Formally , given an initial density p (0) ( x, y ) , the D A algorithm g enerates a sequ ence of densities p ( t ) ( x, y ) , t ≥ 0 , where ( p ( t ) X ( x ) = R Y p ( t ) ( x, y ) d ν ( y ) , for example) p ( t +1) ( x, y ) = ( p ( t ) X ( x ) π Y | X ( y | x ) , t o dd; p ( t ) Y ( y ) π X | Y ( x | y ) , t even . (1) Theorem 2.1: If π ( x, y ) > 0 for all ( x, y ) ∈ X × Y , and D  p (0) | π  < ∞ , then iterates of the D A alg orithm (1) conv erge in relative entropy , i.e., lim t →∞ D  p ( t ) | π  = 0 , and lim t →∞ V  p ( t ) , π  = 0 necessarily . The cond ition π ( x, y ) > 0 , ( x, y ) ∈ X × Y , can be weakened, and the result can be gene ralized to th e Gibbs sampler ([11] [10]); see Y u [21]. Note that the cond itions of Theorem 2.1 are alread y weaker than those of Scherv ish and Carlin [17], for example (see a lso Liu et al. [13]), although Theorem 2 .1 does not g iv e a qu alitativ e rate of conver gence. As a gener al comment, the approach taken h ere com plements the more traditional L 2 approa ch (Amit [1]) th at studies the Gibbs sampler in the Hilb ert space o f squ are integrab le function s. Section I II provides a short, self-contained proof of The- orem 2. 1. The m ain to ols (L emmas 3.1 – 3.3) exploit the informa tion geometry of th e DA algor ithm. Althou gh relative entropy does not deﬁn e a m etric, it beh av es like squared Euclidean distance. See Csisz ´ ar [4 ], Csisz ´ ar a nd Shields [6], and Csisz ´ ar and Mat ´ us [5] for the notions o f I- pr o jection and r ever se I-pr ojection that explor e such prop erties in broad er contexts. I I I . P RO O F O F T H E O R E M 2 . 1 In th is section let p ( t ) be a sequence of d ensities gen erated accordin g to (1) with D  p (0) | π  < ∞ . Lemma 3. 1 captures the in tuition that each iteratio n is a p rojection (more pr ecisely , a r ev erse I-p rojection ) onto the set of densities with a given condition al. The pro of is simple and h ence omitted. Lemma 3.1: For all t ≥ 0 , D  p ( t ) | π  = D  p ( t ) | p ( t +1)  + D  p ( t +1) | π  . 2 According to Lemma 3.1, D  p ( t ) | π  can o nly decrease in t (this holds f or Markov chains in gen eral). Howe ver, it does n ot imply D  p ( t ) | π  ↓ 0 . T o prove the theorem we need fur ther analysis. Lemma 3.2: L et t ≥ 1 and n ≥ 1 . If n is even then D  p ( t ) | p ( t + n )  ≤ D  p ( t ) | p ( t + n − 1)  ; (2) if n is o dd then D  p ( t ) | p ( t + n )  = D  p ( t ) | p ( t +1)  + D  p ( t +1) | p ( t + n )  . (3) Pr oo f: T o p rove (2), without loss of g enerality assum e t is odd. Sinc e n is e ven, p ( t ) and p ( t + n ) have the same condition al p ( t ) X | Y = p ( t + n ) X | Y = π X | Y , whereas p ( t + n ) Y = p ( t + n − 1) Y by (1). W e have D  p ( t ) | p ( t + n )  = D  p ( t ) Y | p ( t + n ) Y  = D  p ( t ) Y | p ( t + n − 1) Y  ≤ D  p ( t ) | p ( t + n − 1)  , the last in equality being a basic prope rty of relative en tropy (Cover and Thomas [3]). T he proof of (3), om itted, is the same as that of L emma 3.1. Lemma 3.3: For all t ≥ 1 an d n ≥ 0 we h av e D  p ( t ) | p ( t + n )  ≤ D  p ( t ) | π  − D  p ( t + n ) | π  . (4) Pr oo f: Let us use induction on n . The case n = 0 is trivial. Supp ose (4) has b een veriﬁed for all n ′ < n . When n is e ven, we apply (2), the induction hypothesis, and Lemma 3.1 to obtain D  p ( t ) | p ( t + n )  ≤ D  p ( t ) | p ( t + n − 1)  ≤ D  p ( t ) | π  − D  p ( t + n − 1) | π  ≤ D  p ( t ) | π  − D  p ( t + n ) | π  . When n is odd, b y ( 3), th e ind uction hy pothesis, and then Lemma 3.1, we h av e D  p ( t ) | p ( t + n )  = D  p ( t ) | p ( t +1)  + D  p ( t +1) | p ( t + n )  ≤ D  p ( t ) | p ( t +1)  + D  p ( t +1) | π  − D  p ( t + n ) | π  = D  p ( t ) | π  − D  p ( t + n ) | π  . Corollary 3 .1: There exists some density π ∗ such th at lim t →∞ V  p ( t ) , π ∗  = 0 . Pr oo f: Pinsker’ s inequality and (4) imp ly 1 2 V 2  p ( t ) , p ( k )  ≤    D  p ( t ) | π  − D  p ( k ) | π     , for all t, k ≥ 1 . Because D  p ( t ) | π  is ﬁnite and decreases monoto nically in t , lim t,k →∞ V  p ( t ) , p ( k )  = 0 , i.e. , p ( t ) is a Cauchy seque nce in L 1 ( X × Y ) . Hence p ( t ) conv erges in L 1 ( X × Y ) to some d ensity π ∗ . (Only the completene ss of L 1 ( X × Y ) is used here. Further proper ties of L p spaces can be found in r eal analysis texts such as Royden [16].) Proposition 3.1 : In the setting of Corollary 3. 1, π ∗ = π . Pr oo f: Since p ( t ) , t ≥ 1 , has the conditional π X | Y when t is od d, and π Y | X when t is e ven, the conditio nals of π ∗ must match those of π , i.e., π ∗ ( x, y ) = π ∗ Y ( y ) π X | Y ( x | y ) = π ∗ X ( x ) π Y | X ( y | x ) , (5) almost everywhere. Under the assumption π ( x, y ) > 0 , (5) implies π ∗ Y ( y ) = π ∗ X ( x ) π Y | X ( y | x ) π X | Y ( x | y ) = π ∗ X ( x ) π Y ( y ) π X ( x ) . Integration over y yields 1 = π ∗ X ( x ) /π X ( x ) , wh ich, to gether with (5), proves π ∗ = π . Finally we ﬁnish th e proof of T heorem 2.1 by sh owing th at the conver gence in Corollary 3. 1 also ho lds in r elativ e entropy . Lemma 3.4: lim t →∞ D  p ( t ) | π  = 0 . Pr oo f: W e alre ady have D  p ( t ) | π  ↓ d , say , with d ≥ 0 . T akin g n → ∞ in ( 4) we get lim inf n →∞ D  p ( t ) | p ( t + n )  ≤ D  p ( t ) | π  − d. On the oth er h and, since D  p ( t ) | p ( t + n )  = Z p ( t ) log  p ( t ) /p ( t + n )  − p ( t ) + p ( t + n ) and the i ntegrand is non -negative, by F atou’ s Lemma we h av e lim inf n →∞ D  p ( t ) | p ( t + n )  ≥ D  p ( t ) | π  (6) which forces d = 0 . The proof is now complete. Note that (6) is a case of the more gener al lower semi-continu ity pro perty of relative entropy (Csisz ´ ar [4]). I V . R E M A R K S As pointed out by an an onymous reviewer , the core of Section II I consists o f two p arts: ( i) showing lim t →∞ V  p ( t ) , π ∗  = 0 for some π ∗ , whose c ondition als match those of π , and ( ii) showing that π ∗ = π . P art (i) can be ph rased more generally and is r elated to the resu lts of Csisz ´ ar and Shields ([6], Th eorem 5.1) on alternating I-proje ctions. I t is also re lated to th e inf ormation theor etic treatment of the E M algorithm ([8] [14]) of Csisz ´ ar and T u snady [ 7]. Th e condition π ( x, y ) > 0 , not used in par t (i) , can be r eplaced b y a weaker assumption , as long as one can show that there exists n o density other than π that possesses the two condition als π X | Y and π Y | X . Lemma 3.1 appears in Y u [21]. Lemmas 3.2 an d 3.3 are new . Generalizations of Theorem 2.1 to the Gibbs sampler with more than two compone nts are possible ([21]), b u t technically more inv olved, b ecause Lemm as 3.2 an d 3 .3 are ta ilored to the two compone nt case. The issue of the rate of convergence, not addressed here, is d eﬁnitely worth in vestigating. The DA algorithm has the f ollowing feature. If we let ( X (0) , Y (0) , X (1) , Y (1) , . . . ) be the iterates g enerated, i.e., the c ondition al d istribution of Y ( k ) | X ( k ) is π Y | X and tha t of X ( k +1) | Y ( k ) is π X | Y , then each of { X ( k ) } and { Y ( k ) } 3 forms a reversible Markov chain. Fritz [9], Barron [2], and Harremo ¨ es and Holst [ 12] a pply inform ation theory to p rove conv ergence theor ems for re versible Markov chains. Their results m ay be adapted to give an alternative (albeit less elementary ) de riv ation of Theor em 2 .1. A C K N O W L E D G M E N T S The author would like to thank Xiao-Li Meng and D avid van Dyk fo r intr oducin g him to the top ic of data augmen tation. He is also gratefu l to the Associate Ed itor and thr ee anonymou s revie wers fo r th eir valuable c omments. R E F E R E N C E S [1] Y . Amit, “On rates of con ver gence of stochastic relaxatio n for Gaussian and non-Gau ssian distributi ons, ” Journal of Mult ivariat e Analysis , vol . 38, pp. 82–99, 1991. [2] A. R. Barro n, “Limits of information, Markov chains, and projection s, ” in Proc . IEEE Int. Symp. Informat ion Theory , S orrento, Italy , Jun. 2000. [3] T . Cover and J. Thomas, Elements of Informat ion Theory , 2nd ed., New Y ork: W iley , 200 6. [4] I. Csi sz ´ ar , “I-div er gence geometry of probabili ty distri buti ons and min- imizati on proble ms, ” Ann. P r obab . , v ol. 3, pp. 146-158, 1975. [5] I. Csisz ´ ar and F . Mat ´ us, “Information projecti ons revisit ed, ” IEEE T rans. Inf. Theory , v ol. 49, pp. 1474-1490, 2003. [6] I. Csisz ´ ar an d P . Shields, “Information t heory and sta tistic s: a tutorial, ” F oundations and T rends in Communicatio ns and Information Theory , vol. 1, pp . 417–528, 2004 . [7] I. Csisz ´ ar and G. Tu snady , “Informati on geometry and alterna ting minimizat ion procedures, ” Statistics & Decisions Supplement Issue 1, pp. 205–237, 1984. [8] A. P . Dempster , N. M. Laird and D. B. Rubin , “Maximum likel ihood es- timatio n from incomplete data via the EM algorithm” (with discussion), J . Roy . Stati st. Soc. B , vol. 39, pp. 1–38, 1977. [9] J . Fritz, “ A n information- theoret ical proof of limit theorems for re- versi ble Marko v processes, ” T ran s. Sixth Prague Conf. on Inform. Theory , Statist. Decisi on Function s, Random Pr ocesses , Prague, C zech Republi c, 1973. [10] A. E. Gelfand and A. F . M. Sm ith, “Sampling-based approaches to calcu lating marginal densitie s, ” J. Amer . Statist . Assoc. , vol. 85, pp. 398– 409, 1990. [11] S. Geman and D. Geman, “Stochasti c relaxation , Gibbs distrib utions, and the Bayesian restorat ion of images, ” IEEE Tr ans. P attern Analysis and Machine Int ellig enc e , vol. 6, pp. 721–741, 1984. [12] P . Harremo ¨ es and K. K. H olst, “Con ve rgen ce of Marko v chai ns in informati on div erge nce, ” In press, Jo urnal of Theoreti cal Prob abilit y , 2008. [13] J. L iu, W . H. W ong and A. Kong, “ Correla tion structure and con ver genc e rate of the Gibbs sampler for various scans, ” J. R oy . Stati st. Soc. B , vol. 57, pp. 157–169, 1995. [14] X. L. Meng and D. A. v an Dyk, “The E M algo rithm – an old folk s ong sung to a fast new tune ” (with discussion), J . Roy . Statist. Soc . B , vol. 59, pp. 511–567, 1997. [15] X. L. Meng and D. A. v an Dyk, “Seeking ef ﬁcient data augmentati on schemes via condit ional and marginal augmentation, ” Biometrika , vol . 86, pp. 301–320, 1999. [16] H. L. Roy den, R eal analysis , 3rd edition, Ne w Y ork: Macmillan , 1988. [17] M. J. Schervish and B. P . Carlin, “On the conv er gence of successi v e sub- stituti on sampli ng, ” Journal of Computati onal and Graphi cal Statistics , vol. 1, pp . 111–127, 1992 . [18] M. A. T anner and W . H. W ong, “The calculat ion of posteri or distri- buti ons by data augmentatio n, ” J . A mer . Statist. Ass oc. , vol. 82, pp. 528–540, 1987. [19] L . Ti erne y , “Marko v chains for explor ing posterior distributio ns, ” Annals of Statistics , vo l. 22, pp. 1701–1727, 1994. [20] D. A. v an Dyk an d X. L . Meng, “The art of dat a augment ation” (with discussion), Journa l of Computational and G raphic al Stati stics , vol . 10, pp. 1–111, 2001. [21] Y . Y u, “Informati on geometry and the Gibbs sampler , ” T echni cal Report, Dept. of Stati stics, Uni versity of Californi a, Irvine, 2008. [22] Y . Y u and X. L. Meng, “Espousing classical statisti cs with modern computat ion: sufﬁcienc y , ancill arity and an inte rwea ving generation of MCMC, ” T echnical Report , Dept. of Statistics, Uni ve rsity of Cal ifornia, Irvine, 2 008.

A Bit of Information Theory, and the Data Augmentation Algorithm Converges

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment