Cumulants of multiinformation density in the case of a multivariate normal distribution

We consider a generalization of information density to a partitioning into $N \geq 2$ subvectors. We calculate its cumulant-generating function and its cumulants, showing that these quantities are only a function of all the regression coefficients as…

Authors: Guillaume Marrelec, Alain Giron

Cumulants of multiinformation density in the case of a multivariate   normal distribution
Cum ulan ts of m ultiinformation densit y in the case of a m ultiv ariate normal distribution Guillaume Marrelec 1,2 and Alain Giron 1,2 1 Lab oratoire d’imagerie biom ´ edicale (LIB), Sorb onne Univ ersit´ e, CNRS, INSERM, F-75006, P aris, F rance. email: firstname.lastname@inserm.fr. 2 Cen tre de recherc hes et d’ ´ etudes en sciences des in teractions (CR ´ ESI), Cen ter for In teraction Science (CIS), F-75006, P aris, F rance No vem ber 14, 2019 Abstract W e consider a generalization of information density to a partition- ing into N ≥ 2 subv ectors. W e calculate its cumulan t-generating func- tion and its cumulan ts, showing that these quantities are only a func- tion of all the regression co efficients asso ciated with the partitioning. Keywor ds: dep endence; information; cumulan t-generating function; cum ulants; mutual information; multiinformation 1 In tro duction Let X b e a m ultiv ariate normal random v ariable with distribution f ( x ), and ( X 1 , X 2 ) a partitioning of X into 2 sub vectors with corresp onding marginals f 1 ( x 1 ) and f 2 ( x 2 ). The information density relativ e to X and the partitioning ( X 1 , X 2 ) is the random v ariable defined as (P olyanskiy and W u, 2017, § 17.1) i d ( X ; X 1 , X 2 ) = ln f ( X 1 , X 2 ) f 1 ( X 1 ) f 2 ( X 2 ) . (1) One of the key features of information density is that its exp ectation yields m utual information (Kullback, 1968, Chap. 1, § 2; Poly anskiy and W u, 2017, § 17.1). In the presen t pap er, w e consider a partitioning of X into N ≥ 2 sub vectors ( X 1 , . . . , X N ) with corresp onding marginals f 1 , . . . , f N and define multiinformation density as i d ( X ; X 1 , . . . , X N ) = ln " f ( X ) Q N n =1 f n ( X n ) # . 1 The mean of this quantit y I ( X ; X 1 , . . . , X N ) = Z f ( x ) ln " f ( x ) Q N n =1 f n ( x n ) # d x is itself a generalization of m utual information kno wn under differen t names: total correlation (W atanab e, 1960), m ultiv ariate constrain t (Garner, 1962), δ (Joe, 1989), or m ultiinformation (Studen ´ y and V ejnaro v´ a, 1998). Our in terest in i d is driv en by its close connection with mutual indep endence. Indeed, when the X n ’s are mutually indep endent, multiinformation is clas- sically equal to 0, but w e also ha v e i d ≡ 0 and V ar( i d ) = 0 (see Appendix A), yielding other statistical markers of indep endence. By con trast, dep endence b et ween the X n ’s is a multiv ariate phenomenon that multiinformation, as a one-dimensional measure, can only partially quantify . W e exp ect i d to give a more detailed c haracterization of dep endence, e.g., through its moments or cumulan ts. W e here fo cus on multiv ariate normal distributions. The family of mul- tiv ariate normal distribution with given mean µ can b e parameterized by either a cov ariance matrix or a concentration/precision (i.e., inv erse co- v ariance) matrix. Either parameterization shows multiv ariate distributions according to a certain p ersp ective and emphasizes differen t features (e.g., Mark ov properties for the concentration matrix). In this context, we wished to in vestigate the existence of a natural wa y of parameterizing dep enden- cies, i.e., a parameter that would emphasize the dependence prop erties of the distribution. The core of the present pap er is the follo wing theorem: Theorem 1 L et X b e a d -dimensional variable fol lowing a multivariate normal distribution with me an µ and c ovarianc e matrix Σ . Partition X into N subve ctors ( X 1 , . . . , X N ) , and set i d the c orr esp onding multiinformation density. Then the cumulant-gener ating function of i d is given by ln E  e ti d  = t I ( X 1 ; . . . ; X N ) − 1 2 ln | I d − t Γ | , (2) wher e Γ = Σ diag ( Σ 11 , . . . , Σ N N ) − 1 − I d is the blo ck matrix whose diagonal blo cks ar e e qual to 0 and wher e e ach off-diagonal blo ck ( m, n ) is the matrix of r e gr ession c o efficients of X m on x n Γ m | n = Σ mn Σ − 1 nn , m 6 = n. (3) The cumulants of i d ar e given by κ 1 ( i d ) = I ( X 1 ; . . . , X N ) and κ l ( i d ) = ( l − 1)! 2 tr  Γ l  , l ≥ 2 . This theorem is prov ed in Section 2. In Section 3, we in vestigate some consequences of this result. Section 4 is devoted to the discussion. 2 2 Pro of of theorem 2.1 Cum ulan t-generating function W e partition µ and Σ in accordance with the partitioning of X , so that µ n is the exp ectation of X n and Σ mn the matrix of cov ariances b etw een X m and X n . Multiinformation b etw een the X n ’s yields I ( X 1 ; . . . ; X N ) = 1 2 ln Q N n =1 | Σ nn | | Σ | . F rom there, we can express multiinformation density as i d = I ( X 1 ; . . . ; X N ) + j d , (4) where j d is defined as j d = 1 2 ( X − µ ) t Φ ( X − µ ) . (5) and Φ as Φ = diag ( Σ 11 , . . . , Σ N N ) − 1 − Σ − 1 . Here, diag ( Σ 11 , . . . , Σ N N ) stands for the blo c k-diagonal matrix with diago- nal blo cks equal to the Σ nn ’s. The moment-generating function of j d yields E  e tj d  = Z (2 π ) − d 2 | Σ | − 1 2 e − 1 2 ( x − µ ) t ( Σ − 1 − t Φ ) ( x − µ ) d x . Since it can b e shown that Σ − 1 − t Φ is p ositive definite at least in a neigh- b orho o d of t = 0 (see App endix B), the integrand is prop ortional to a mul- tiv ariate normal distribution with mean µ and cov ariance matrix Σ − 1 − t Φ . In tegration with respect to x therefore yields E  e tj d  = | Σ − 1 | 1 2 | Σ − 1 − t Φ | 1 2 = | I d − t Γ | − 1 2 and ln E  e tj d  = − 1 2 ln | I d − t Γ | , (6) where I d is the d -b y- d unit matrix and Γ = ΣΦ the blo ck matrix whose diagonal blo c ks are equal to 0 and where each nondiagonal blo c k ( m, n ) is the matrix of regression co efficients of X m on x n giv en b y (Anderson, 2003, Definition 2.5.1) Γ m | n = Σ mn Σ − 1 nn , m 6 = n. (7) 3 2.2 Cum ulan ts The cumulan ts of i d can b e calculated in closed form from those of j d and Equation (4) b y noting that the first cum ulant, κ 1 ( i d ), is shift-equiv arian t, while the others, κ i ( i d ) for i ≥ 2, are shift inv ariant (Kendall, 1945, § 3.13). This leads to  κ 1 ( i d ) = I ( X 1 ; . . . ; X N ) + κ 1 ( j d ) κ l ( i d ) = κ l ( j d ) , l ≥ 2 . (8) No w, the cum ulants of j d can b e easily computed as follows. Using the fact that | A | = e tr[ln( A )] (Higham, 2007), which, for a p ositiv e definite matrix, can b e expressed as ln | A | = tr[ln( A )], we hav e from Equation (6) ln E  e tj d  = − 1 2 tr [ln( I d − t Γ )] . F or t sufficiently small, w e can p erform a T a ylor expansion of the log function around I d (Abramo witz and Stegun, 1972, Eq. 4.1.24), leading to ln E  e tj d  = 1 2 tr " ∞ X il =1 ( t Γ ) l l # = ∞ X l =1 t l 2 l tr( Γ l ) . Iden tification with the decomp osition of the same function in terms of cu- m ulants (Kendall, 1945, § 3.12) ln E  e tj d  = ∞ X l =1 κ l t l l ! , yields for the cumulan ts of j d κ l ( j d ) = ( l − 1)! 2 tr( Γ l ) . (9) The same result could ha ve b een reached b y using the fact that j d is a quadratic function of a m ultidimensional normal v ariate x , as evidenced in Equation (5), together with the expression of the cumulan ts of such functions (Magn us, 1986, Lemma 2). The cumulan ts of i d therefore yield κ 1 ( i d ) = I ( X 1 ; . . . , X N ) , as expected, since the first cum ulan t is also the mean (Kendall, 1945, § 3.14), and, for l ≥ 2, κ l ( i d ) = ( l − 1)! 2 tr  Γ l  . 4 In particular, the v ariance, which is equal to the second cumulan t (Kendall, 1945, § 3.14) is given by V ar( i d ) = κ 2 ( i d ) = X 1 ≤ m 0. Σ − 1 − t Φ is therefore diagonalizable as well, with eigenv alues given by (1 + t ) λ 2 i − t = ( λ 2 i − 1) t + λ 2 i , which is p ositive in a neighborho o d of t = 0. Σ − 1 − t Φ is therefore p ositiv e definite in a neighborho o d of t = 0. C Alternativ e expression of m ultiinformation F or a decomp osition of a multidimensional normal v ariable in to sev eral sub- v ectors, multiinformation reads I ( X 1 ; . . . ; X N ) = 1 2 ln Q N n =1 | Σ nn | | Σ | . By comparison, we calculate I d + Γ = I d + ΣΦ = I d + Σ  diag( Σ 11 , . . . , Σ N N ) − 1 − Σ − 1  = Σ diag( Σ 11 , . . . , Σ N N ) − 1 , leading to | I d + Γ | = | Σ diag( Σ 11 , . . . , Σ N N ) − 1 | = | Σ | Q N n =1 | Σ nn | , 11 and, finally , − 1 2 ln | I d + Γ | = 1 2 ln Q N n =1 | Σ nn | | Σ | . D Chec king asymptotic normalit y Let the correlation matrix R d b e a d -b y- d homogeneous matrix with param- eter ρ , i.e., a matrix with 1s on the diagonal and all off-diagonal elements equal to ρ . suc h a matrix has tw o eigen v alues: 1 + ( d − 1) ρ with m ultiplicit y 1 (asso ciated with the vector composed only of 1s) and 1 − ρ with multiplicit y d − 1 (asso ciated with the subspace of vectors with a zero mean). Suc h a matrix is p ositive definite for − 1 d − 1 ≤ ρ < 1 . The exp ectation of i d is given by − 1 2 { ( d − 1) ln(1 − ρ ) + ln[1 + ( d − 1) ρ ] } T o compute the higher cumulan ts of i d , let U d the d -b y- d matrix with all elemen ts equal to 1. Using the fact that Γ = ρ ( U d − I d ) together with U l d = d l − 1 U d for l ≥ 2, we obtain Γ l = ρ l  ( − 1) l I d + ( d − 1) l − ( − 1) l d U d  tr( Γ l ) = ρ l h ( − 1) l d + ( d − 1) l − ( − 1) l i κ l ( i d ) = ( l − 1)! 2 ρ l h ( − 1) l d + ( d − 1) l − ( − 1) l i . In particular, we hav e V ar( i d ) = ρ 2 d ( d − 1) / 2. F or large d , w e hav e κ l ( i d ) ∼ ( l − 1)! 2 ρ l d l for l ≥ 2 and, in particular, V ar( i d ) ∼ ρ 2 d 2 / 2. T o in vestigate the asymptotic normality of i d , w e classically consider u = [ i d − E( i d )] / p V ar( i d ). Using the fact that the cum ulant of order l is homogeneous of degree l , w e obtain κ l ( u ) ∼ 2 l/ 2 − 1 ( l − 1)! = cste. If u w ere asymptotically normal, κ l ( u ) for l ≥ 3 w ould tend to 0 as d → ∞ , whic h is not the case. As a consequence, u is not asymptotically normal. References Abramo witz, M., Stegun, I.A. (Eds.), 1972. Handbo ok of Mathematical F unctions. Num b er 55 in Applied Math., National Bureau of Standards. 12 Ali, S., Silvey , S.D., 1966. A general class of co efficients of divergence of one distribution from another. Journal of the Roy al Statistical So ciety: Series B (Statistical Metho dology) 28, 131–142. Anderson, T.W., 2003. An In tro duction to Multiv ariate Statistical Analysis. Wiley Series in Probability and Mathematical Statistics. 3rd ed., John Wiley and Sons, New Y ork. Csisz´ ar, I., 1963. Eine informationstheoretisc he Ungleich ung und ihre An wendung auf den Bew eis der Ergo dizitat von Mark offschen Ketten. A Magy ar T udom´ any os Ak ad ´ emia Matematik ai ´ es Fizik ai T udom´ any ok Oszt´ aly´ anak K¨ ozlem´ eny ei 8, 85–108. Csisz´ ar, I., 1967. Information-t yp e measures of difference of probability dis- tributions and indirect observ ation. Studia Scientiarum Mathematicarum Hungarica 2, 229–318. Garner, W.R., 1962. Uncertaint y and Structure as Psyc hological Concepts. John Wiley & Sons, New Y ork. Higham, N.J., 2007. F unctions of matrices, in: Hogb en, L. (Ed.), Handb o ok of Linear Algebra. Chapman & Hall/CRC Press, Bo ca Raton. Discrete Mathematics and its Applications. chapter 11. Jo e, H., 1989. Relative entrop y measures of m ultiv ariate dep endence. Jour- nal of the American Statistical Asso ciation 84, 157–164. Jupp, P .E., Mardia, K.V., 1980. A general correlation co efficien t for direc- tional data and related regression problems. Biometrik a 67, 163–173. Kendall, M.G., 1945. The Adv anced Theory of Statistics. v olume 1. 2nd ed., Charles Griffin & Co. Ltd., London. Kotz, S., Nadara jah, S., 2004. Multiv ariate t Distributions and their Appli- cations. Cam bridge Universit y Press, Cambridge, UK. Kullbac k, S., 1968. Information Theory and Statistics. Dov er, Mineola, NY. Magn us, J., 1986. The exact momen ts of a ratio of quadratic forms. Annales d’ ´ economie et de statistique 4, 95–109. Mardia, K.V., Jupp, P .E., 2000. Directional Statistics. Wiley Series in Probabilit y and Statistics, Wiley , Chic hester. P olyanskiy , Y., W u, Y., 2017. Lecture notes on information theory . h ttp://www.stat.yale.edu/ ∼ yw562/ln.h tml. 13 Studen´ y, M., V ejnarov´ a, J., 1998. The multiinformation function as a to ol for measuring sto chastic dep endence, in: Jordan, M.I. (Ed.), Pro ceedings of the NA TO Adv anced Study Institute on Learning in Graphical Models, pp. 261–298. V a jda, I., 1972. On the f -div ergence and singularity of probabilit y measures. P erio dica Mathematica Hungarica 2, 223–234. W atanab e, S., 1960. Information theoretical analysis of m ultiv ariate corre- lation. IBM Journal of Research and Dev elopment 4, 66–82. 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment