Information Inequalities for Joint Distributions, with Interpretations and Applications
Upper and lower bounds are obtained for the joint entropy of a collection of random variables in terms of an arbitrary collection of subset joint entropies. These inequalities generalize Shannon's chain rule for entropy as well as inequalities of Han…
Authors: Mokshay Madiman, Prasad Tetali
SUBMITTED TO IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , 2007 1 Information Inequalities for Joint Distrib utions, with Interpretations and Applications Mokshay Madiman, Member , IEEE, and Prasad T etali, Member , IEEE Abstract — Upper and lower bounds are obtained f or the joint entropy of a collection of random variables in terms of an arbitrary collection of subset joint entropies. These inequalities generalize Shannon’ s chain rule f or entropy as well as inequalities of Han, Fujishige and Shearer . A duality between the upper and lower bounds f or joint entropy is dev eloped. All of these results are sho wn to be special cases of general, new results for submodular functions– thus, the inequalities presented constitute a richly structured class of Shannon-type inequalities. The new inequalities are applied to obtain new results in combinatorics, such as bounds on the number of independent sets in an arbitrary graph and the number of zer o-error source-channel codes, as well as new determinantal inequalities in matrix theory . A new inequality f or relativ e entropies is also developed, along with interpr etations in terms of hypothesis testing. Finally , revealing connections of the results to literatur e in economics, computer science, and physics are explor ed. Index T erms — Entropy inequality; inequality f or minors; entropy-based counting; submodularity . I . I N T R O D U C T I O N L ET X 1 , X 2 , . . . , X n be a collection of random variables. There are the familiar two canonical cases: (a) the random variables are real-valued and possess a probability density function, in which case h represents the differential entropy , or (b) they are discrete, in which case H represents the discrete entropy . More generally , if the joint distribution has a density f with respect to some reference product measure, the joint entropy may be defined by − E [log f ( X 1 , X 2 , . . . , X n )] ; with this definition, H corresponds to counting measure and h to Lebesgue measure. The only assumption we will implicitly make throughout is that the joint entrop y is finite, i.e., neither −∞ nor + ∞ . W e wish to discuss the relationship between the joint entropies of various subsets of the random variables X 1 , X 2 , . . . , X n . Thus we are motiv ated to consider an ar - bitrary collection C of subsets of { 1 , 2 , . . . , n } . The following con ventions are useful: • [ n ] is the index set { 1 , 2 , . . . , n } . W e equip this set with its natural (increasing) order, so that 1 < 2 < . . . < n . (Any other total order would do equally well, and indeed Material in this paper was presented at the Information Theory and Applications W orkshop, San Diego, CA, January 2007, and at the IEEE Symposium on Information Theory , Nice, France, June 2007. Mokshay Madiman is with the Department of Statistics, Y ale Uni- versity , 24 Hillhouse A venue, New Ha ven, CT 06511, USA. Email: mokshay.madiman@yale.edu Prasad T etali is with the School of Mathematics and College of Com- puting, Georgia Institute of T echnology , Atlanta, GA 30332, USA. Email: tetali@math.gatech.edu . Supported in part by NSF grants DMS- 0401239 and DMS-0701043. we use this flexibility later , but it is con venient to fix a default order .) • For any set s ⊂ [ n ] , X s stands for the collection of random variables ( X i : i ∈ s ) , with the indices tak en in their increasing order . • For any index i in [ n ] , define the de gr ee of i in C as r ( i ) = |{ t ∈ C : i ∈ t }| . Let r − ( s ) = min i ∈ s r ( i ) denote the minimal de gr ee in s , and r + ( s ) = max i ∈ s r ( i ) denote the maximal de gr ee in s . First we present a weak form of our main inequality . Proposition I: [ W E A K D E G R E E F O R M ] Let X 1 , . . . , X n be arbitrary random variables jointly distributed on some discrete sets. For any collection C such that each inde x i has non-zero degree, X s ∈C H ( X s | X s c ) r + ( s ) ≤ H ( X [ n ] ) ≤ X s ∈C H ( X s ) r − ( s ) , (1) where r + ( s ) and r − ( s ) are the maximal and minimal degrees in s . If C satisfies r − ( s ) = r + ( s ) for each s in C , then (1) also holds for h in the setting of continuous random variables. Proposition I unifies a large number of inequalities in the literature. Indeed, 1) Applying to the class C 1 of singletons, n X i =1 H ( X i | X [ n ] \ i ) ≤ H ( X [ n ] ) ≤ n X i =1 H ( X i ) . (2) The upper bound represents the subadditivity of entropy noticed by Shannon. The lower bound may be inter - preted as the fact that the erasure entropy of a collection of random variables is not greater than their entropy; see Section VI for further comments. 2) Applying to the class C n − 1 of all sets of n − 1 elements, 1 n − 1 n X i =1 H ( X [ n ] \ i | X i ) ≤ H ( X [ n ] ) ≤ 1 n − 1 n X i =1 H ( X [ n ] \ i ) . (3) This is Han’ s inequality [23], [10], in its prototypical form. 3) Let r + = min i ∈ [ n ] r ( i ) and r − = max i ∈ [ n ] r ( i ) be the minimal and maximal degrees with respect to C . Using r − ≤ r − ( s ) and r + ≤ r + ( s ) , we ha ve 1 r + X s ∈C H ( X s | X s c ) ≤ H ( X [ n ] ) ≤ 1 r − X s ∈C H ( X s ) . SUBMITTED TO IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , 2007 2 The upper bound is Shearer’ s lemma [9], kno wn in the combinatorics literature [43]. The lower bound is new . The paper is organized as follo ws. First, in Section II, the notions of fractional cov erings and packings using hyper- graphs, which provide a useful language for the information inequalities we present, are dev eloped. In Section III, we present the main technical result of this paper , which is a new inequality for submodular set functions. Section IV presents the main entropy inequality of this paper, which strengthens Proposition I, and giv es a very simple proof as a corollary of the general result for submodular functions. This entropy inequality is de veloped in two forms, which we call the strong fractional form and the strong degree form; Proposition I may then be thought of as the weak degree form. A different manifestation of the upper bound in this weak degree form of the inequality was recently proved (in a more in volv ed manner) by Friedgut [15]; the relationship with his result is also further discussed in Section IV using the preliminary concepts de veloped in Section II. While independent sets in graphs hav e always been of combinatorial and graph-theoretical interest, counting inde- pendent sets in bipartite graphs receiv ed rene wed attention due to Kahn’ s entropy approach [26] to Dedekind’ s prob- lem. Dedekind’ s problem in volv es counting the number of antichains in the Boolean lattice, or equiv alently , counting the number of Boolean functions on n variables that can be constructed using only AND and OR (and no NOT) gates. T o handle this problem by induction on the number of lev els in the lattice, Kahn first deriv ed a tight bound on the (logarithm of the) number of independent sets in a regular bipartite graph. In Section V, we build on Kahn’ s work to obtain a bound on number of independent sets in an arbitrary graph. W e also generalize this to counting graph homomorphisms, with applications to graph coloring and zero-error source-channel codes. The applications of entropy inequalities to counting typi- cally inv olves discrete random v ariables, but the inequalities also have applications when applied to continuous random variables. In Section VI, we dev elop such an application by proving a new family of determinantal inequalities that pro vide generalizations of the classical determinantal inequalities of Hadamard, Szasz and Fischer . Having presented two applications of our main inequalities, we move on to studying the structure of the inequalities more closely . In Section VII, we present a duality between our upper and lower bounds that generalizes a theorem of Fujishige [17]. In particular , we sho w that the collection of upper bounds on H ( X [ n ] ) for all collections C is equiv alent to the collection of lo wer bounds. There we also discuss interpretations of the inequality relating to sensor networks and erasure entropy , and generalize the monotonicity property for special collections of subsets discov ered by Han [23]. Section VIII presents some ne w entrop y po wer inequalities for joint distributions, and points out an intriguing analogy between them and the recent subset sum entropy power inequalities of Madiman and Barron [33]. In Section IX, we prov e inequalities for relativ e entropy between joint distribu- tions. Interpretations of the relativ e entropy inequality through hypothesis testing and concentration of measure are also giv en there. In Section X, we note that weaker versions of our main inequality for submodular functions follo w from results dev el- oped in various communities (economics, computer science, physics); this history and the consequent connections do not seem to be well known or much tapped in information theory . Finally in Section XI, we conclude with some final remarks and brief discussion of other applications, including to multiuser information theory . I I . O N H Y P E R G R A P H S A N D R E L A T E D C O N C E P T S It is appropriate here to recall some terminology from discrete mathematics. A collection C of subsets of [ n ] is called a hyper graph , and each set s in C is called a hyper edge . When each hyperedge has cardinality 2, then C can be thought of as the set of edges of an undirected graph on n labelled vertices. Thus all the statements made abov e can be translated into the language of hypergraphs. In the rest of this paper , we interchangeably use “hyper graph” and “collection” for C , “hyperedge” and “set” for s in C , and “vertex” and “index” for i in [ n ] . W e ha ve the follo wing standard definitions. Definition I: The collection C is said to be r -r e gular if each index i in [ n ] has the same degree r , i.e., if each vertex i appears in e xactly r hyperedges of C . The follo wing definitions e xtend the familiar notion of pack- ings, coverings and partitions of sets by allowing fractional counts. The history of these notions is unclear to us, b ut some references can be found in the book by Scheinerman and Ullman [44]. Definition II: Giv en a collection C of subsets of [ n ] , a function α : C → R + , is called a fractional covering , if for each i ∈ [ n ] , we have P s ∈C : i ∈ s α ( s ) ≥ 1 . Giv en C , a function β : C → R + is a fractional packing , if for each i ∈ [ n ] , we hav e P s ∈C : i ∈ S β ( s ) ≤ 1 . If γ : C → R + is both a fractional cov ering and a fractional packing, we call γ a fractional partition . Note that the standard definition of a fractional packing of [ n ] using C (as in [44]), would assign weights β i to the elements, (rather than sets) i ∈ [ n ] , and require that, for each s ∈ C , we have P i ∈ s β i ≤ 1 . Our terminology can be justified, if one considers the “dual hypergraph, ” obtained by interchanging the role of elements and sets – consider the 0-1 incidence matrix (with rows indexed by the elements and columns by the sets) of the set system, and simply switch the roles of the elements and the sets. The follo wing simple lemmas are useful. Lemma I: [ F R AC T I O NA L A D D I T I V I T Y ] Let { a i : i ∈ [ n ] } be an arbitrary collection of real numbers. For any s ⊂ [ n ] , define a s = P j ∈ s a j . For any fractional partition γ using any hyper graph C , a [ n ] = P s ∈C γ ( s ) a s . Furthermore, if each SUBMITTED TO IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , 2007 3 a i ≥ 0 , then X s ∈C β ( s ) a s ≤ a [ n ] ≤ X s ∈C α ( s ) a s (4) for any fractional packing β and any fractional cov ering α using C . Pr oof: Interchanging sums implies X s ∈C α ( s ) X i ∈ s a i = X i ∈ [ n ] a i X s ∈C α ( s ) 1 { i ∈ s } ≥ X i ∈ [ n ] a i , using the definition of a fractional cov ering. The other state- ments are similarly ob vious. W e introduce the notion of quasiregular hyper graphs. Definition III: The hypergraph C is quasir e gular if the degree function r : [ n ] → Z + defined by r ( i ) = |{ s ∈ C : s 3 i }| is constant on s , for each s in C . Example: One can construct simple examples of quasiregular hypergraphs using what are called bi-regular graphs in the graph theory literature. Consider a bipartite graph on vertex sets V 1 and V 2 (i.e., all edges go between V 1 and V 2 ), such that every verte x in V 1 has degree r 1 and every verte x in V 2 has degree r 2 . Such a graph always exists if | V 1 | r 1 = | V 2 | r 2 . Now consider the hypergraph on V 1 ∪ V 2 with hyperedges being the neighborhoods of vertices in the bipartite graph. This hypergraph is quasiregular (with degrees being r 1 and r 2 ), and it is not re gular if r 1 is dif ferent from r 2 . There is a sense in which all quasiregular hypergraphs are similar to the example above; specifically , any quasiregular hypergraph has a canonical decomposition as a disjoint union of regular subhypergraphs. Lemma II: Suppose the hypergraph C on the vertex set [ n ] is quasiregular . Then one can partition [ n ] into disjoint subsets { V m } , and C into disjoint subhypergraphs {C m } such that each C m is a re gular hypergraph on vertex set V m . Pr oof: Consider the equi valence relation on [ n ] induced by the degree, i.e., i and j are related if r ( i ) = r ( j ) . This relation decomposes [ n ] into disjoint equi v alence classes { V m } . Since C is quasiregular , all indices in s hav e the same degree for each set s ∈ C , and hence each s ∈ C is a subset of exactly one equiv alence class V m . Q.E.D. The notion of quasire gularity is related to what we believe is an important and natural fractional cov ering/packing pair . As long as there is at least one set s in the hypergraph C that contains i , we ha ve X s ∈C ,s 3 i 1 r − ( s ) = X s ∈C 1 { i ∈ s } r − ( s ) ≥ X s ∈C 1 { i ∈ s } r ( i ) = 1 , so that α ( s ) = 1 r − ( s ) provide a fractional covering. Similarly , the the numbers β ( s ) = 1 r + ( s ) provide a fractional packing. Definition IV : Let C be any hyper graph on [ n ] such that ev ery index appears in at least one hyperedge. The fractional cov ering giv en by α ( s ) = 1 r − ( s ) is called the de gree covering , and the fractional packing giv en by β ( s ) = 1 r + ( s ) is called the de gr ee packing . The following lemma is a trivial consequence of the defini- tions. Lemma III: If C is quasiregular , the degree packing and degree covering coincide and provide a fractional partition of [ n ] using C . In particular , a [ n ] = P s ∈C a s /r − ( s ) . One may define the weight of a fractional partition as follows. Definition V : Let γ be a fractional partition (or a fractional cov ering or packing). Then the weight of γ is w ( γ ) = P s ∈C γ ( s ) . There are natural optimization problems associated with the weight function. The problem of minimizing the weight of α ov er all fractional coverings α is the called the optimal frac- tional covering problem, and that of maximizing the weight of β over all fractional packings β is the called the optimal fractional pac king problem. These are linear programming relaxations of the integer programs associated with optimal cov ering and optimal packing, which are of course important in many applications. Much work has been done on these problems, including studies of the integrality gap (see, e.g., [44]). One may also define a notion of duality for fractional partitions. Definition VI: For any hypergraph C , define the complimen- tary hypergraph as ¯ C = { s c : s ∈ C } . If α is a fractional cov ering (or packing) using C , the dual fractional packing (respectiv ely , covering) using ¯ C is defined by ¯ α ( s c ) = α ( s ) w ( α ) − 1 . T o see that this definition makes sense (say for the case of a fractional co vering α ), note that for each i ∈ [ n ] , X s c ∈ ¯ C ,s c 3 i ¯ α ( s c ) = X s ∈C ,i / ∈ s α ( s ) w ( α ) − 1 = P s ∈C α ( s ) − P s ∈C ,i ∈ s α ( s ) w ( α ) − 1 ≤ w ( α ) − 1 w ( α ) − 1 = 1 . I I I . A N E W I N E Q UA L I T Y F O R S U B M O D U L A R F U N C T I O N S The following definitions are necessary in order to state the main technical result of this paper . Definition VII: The set function f : 2 [ n ] → R is submodular if f ( s ) + f ( t ) ≥ f ( s ∪ t ) + f ( s ∩ t ) SUBMITTED TO IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , 2007 4 for ev ery s, t ⊂ [ n ] . If − f is submodular , we say that f is supermodular . Definition VIII: For any disjoint subsets s and t of [ n ] , define f ( s | t ) = f ( s ∪ t ) − f ( t ) . For a fixed subset t ( [ n ] , the function f t : 2 [ n ] \ t → R defined by f t ( s ) = f ( s | t ) is called f conditional on t . For any s ⊂ [ n ] , denote by < s the set of indices less than ev ery index in s . Similarly , > s is the set of indices greater than ev ery index in s . Also, the index i is identified with the set { i } ; thus, for instance, < i is well-defined. W e also write [ i : i + k ] for { i, i + 1 , . . . , i + k − 1 , i + k } . Note that [ n ] = [1 : n ] . Lemma IV : Let f : 2 [ n ] → R be any submodular function with f ( φ ) = 0 . 1) If s, t, u are disjoint sets, f ( s | t, u ) ≤ f ( s | t ) . (5) 2) The follo wing “chain rule” e xpression holds for f ([ n ]) : f ([ n ]) = X i ∈ [ n ] f ( i | < i ) . Pr oof: First note that if s, t, u are disjoint sets, then submodularity implies f ( s ∪ t ∪ u ) + f ( t ) ≤ f ( s ∪ t ) + f ( t ∪ u ) , which is equi valent to f ( s | t, u ) ≤ f ( s | t ) . The “chain rule” expression for f ([ n ]) is obtained by induction. Note that f ([2]) = f (1) + f (2 | 1) = f (1 | φ ) + f (2 | 1) since f ( φ ) = 0 . Now assume the chain rule holds for [ n ] , and observe that f ([ n + 1]) = f ([ n ]) + f ( n + 1 | [ n ]) = X i ∈ [ n +1] f ( i | < i ) , where we used the induction hypothesis for the second equal- ity . Theorem I: Let f : 2 [ n ] → R be any submodular function with f ( φ ) = 0 . Let γ be any fractional partition with respect to any collection C of subsets of [ n ] . Then X s ∈C γ ( s ) f ( s | s c \ > s ) ≤ f ([ n ]) ≤ X s ∈C γ ( s ) f ( s | < s ) . Pr oof: The chain rule (actually a slightly extended version of it with additional conditioning in all terms that can be prov ed in exactly the same way) implies f ( s | < s ) = X j ∈ s f ( j | < j ∩ s, < s ) . (6) Thus X s ∈C α ( s ) f ( s | < s ) ( a ) = X s ∈C α ( s ) X j ∈ s f ( j | < j ∩ s, < s ) ( b ) ≥ X s ∈C α ( s ) X j ∈ s f ( j | < j ) ( c ) = X j ∈ [ n ] f ( j | < j ) X s ∈C α ( s ) 1 { j ∈ s } ( d ) ≥ X j ∈ [ n ] f ( j | < j ) ( a ) = f ( X [ n ] ) , where (a) follo ws by the chain rule (6), (b) follows from (5), (c) follows by interchanging sums, and (d) follows by the definition of a fractional covering. The lower bound may be prov ed in a similar fashion by a chain of inequalities. Indeed, X s ∈C β ( s ) f ( s | s c \ > s ) ( a ) = X s ∈C β ( s ) X j ∈ s f ( j | < j ∩ s, s c \ > s ) ( b ) ≤ X s ∈C β ( s ) X j ∈ s f ( j | < j ) ( c ) = X j ∈ [ n ] f ( j | < j ) X s ∈C 1 { j ∈ s } β ( s ) ( e ) ≤ X j ∈ [ n ] f ( j | < j ) ( a ) = f ([ n ]) , where (a), (b), (c) follow as above, and (e) follows by the definition of a fractional partition. Remark 1: The key new element in this result is the fact that one can use, for any ordering on the ground set [ n ] , the conditional values of f that appear in the upper and lower bounds for f ([ n ]) . Because of (5), this is an improv ement ov er simply using f . The latter weaker inequality has been implicit in the cooperative game theory literature; various historical remarks explicating these connections are gi ven in Section X. Corollary I: Let f : 2 [ n ] → R be any submodular function with f ( φ ) = 0 , such that f ([ j ]) is non-decreasing in j for j ∈ [ n ] . Then, for any collection C of subsets of [ n ] , X s ∈C β ( s ) f ( s | s c \ > s ) ≤ f ([ n ]) ≤ X s ∈C α ( s ) f ( s | < s ) , where β is any fractional packing and α is any fractional cov ering of C . Pr oof: The proof is almost exactly the same as that of Theorem I; the only difference being that the validity there of (d) for fractional coverings and of (e) for fractional packings is guaranteed by the non-negativity of f ( j | < j ) . SUBMITTED TO IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , 2007 5 Observe that if f defines a polymatroid (i.e., f is not only submodular but also non-decreasing in the sense that f ( s ) ≤ f ( t ) if s ⊂ t ), then the condition of Corollary I is automatically satisfied. I V . E N T R O P Y I N E Q UA L I T I E S A. Str ong F ractional F orm The main entropy inequality introduced in this work is the following generalization of Shannon’ s chain rule. Theorem I’: [ S T RO N G F R AC T I O N A L F O R M ] For any collection C of subsets of [ n ] , X s ∈C β ( s ) H ( X s | X s c \ >s ) ≤ H ( X [ n ] ) ≤ X s ∈C α ( s ) H ( X s | X s ) ≤ h ( X [ n ] ) ≤ X s ∈C γ ( s ) h ( X s | X s ) r + ( s ) ≤ H ( X [ n ] ) ≤ X s ∈C H ( X s | X s in the lo wer bound. Remark 4: The collections C for which the results in this paper hold need not consist of distinct sets. That is, one may hav e multiple copies of a particular s ⊂ [ n ] contained in C , and as long as this is taken into account in counting the degrees of the indices (or checking that a set of coefficients forms a fractional packing or covering), the statements extend. W e will make use of this feature when developing applications to combinatorics in Section V. Remark 5: Using the previous remark, one may write down Theorem II with arbitrary numbers of repetitions of each set in C . This gi ves a version of Theorem I’ with rational coefficients, following which an approximation argument can be used to obtain Theorem I’. This proof is similar to the one alluded to by Friedgut [15] for the version without ordering. Thus Theorem II is actually equiv alent to Theorem I’. The strong degree form of the inequality generalizes Shan- non’ s chain rule. In order to see this, simply choose the collection C to be C 1 , the collection of all singletons. For this collection, Theorem II says n X i =1 H ( X i | X [ n ] \≥ i ) ≤ H ( X [ n ] ) ≤ n X i =1 H ( X i | X , which is essentially the Gaussian density . The classical determinan- tal inequalities of Hadamard and Fischer then follow from the subadditi vity of entropy . This approach seems to hav e been first cast in probabilistic language by Dembo, Cov er and Thomas [11], who further showed that an inequality of Szasz can be derived (and generalized) using Han’ s inequality . Follo wing this well-trodden path, Proposition II yields the following general determinantal inequality . Corollary III: [ D E T E R M I NA N TA L I N E Q UA L I T I E S ] Let K be a positiv e definite n × n matrix and let C be a hyper graph on [ n ] . Let K ( s ) denote the submatrix corresponding to the rows and columns indexed by elements of s . Then, using | M | denote the determinant of M , we have for any fractional partition α ∗ , Y s ∈C | K | | K ( s c ) | α ∗ ( s ) ≤ | K | ≤ Y s ∈C | K ( s ) | α ∗ ( s ) . The proof follows from Proposition II via the fact that any positive definite n × n matrix K can be realized as the covariance matrix of a multiv ariate normal distribution N (0 , K ) , whose entropy is H ( X [ n ] ) = 1 2 log (2 π e ) n | K | , and furthermore, that if X [ n ] ∼ N (0 , K ) , then X s ∼ N (0 , K ( s )) . Note that an alternative approach to proving Corollary III would be to directly apply Theorem I to the known fact (called the K oteljanskii or sometimes the Hadamard-Fischer inequality) that the set function f ( s ) = log | K ( s ) | is submodular . For an r -re gular hypergraph C , using the degree partition in Corollary III implies that | K | r ≤ Y s ∈C | K ( s ) | . Considering the hyper graphs C 1 and C n − 1 then yields the Hadamard and prototypical Szasz inequality , while the Fischer inequality follows by considering C = { s, s c } , for an arbitrary s ⊂ [ n ] . W e remark that one can interpret Corollary III using the all- minors matrix-tree theorem (see, e.g., Chaiken [7] or Lewin [32]). This is a generalization of the matrix tree theorem of Kirchhoff [29], which states that the determinant of any cofactor of the Laplacian matrix of a graph is the total number of distinct spanning trees in the graph, and interprets all minors of this matrix in terms of combinatorial properties of the graph. SUBMITTED TO IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , 2007 9 V I I . D U A L I T Y A N D M O N O TO N I C I T Y O F G A P S Consider the weak fractional form of Theorem I, namely X s ∈C γ ( s ) f ( s | s c ) ≤ f ([ n ]) ≤ X s ∈C γ ( s ) f ( s ) . W e observe that there is a duality between the upper and lower bounds, relating the gaps in this inequality . Theorem IV : [ D UA L I T Y O F G A P S ] Let f : 2 [ n ] → R be a submodular function with f ( φ ) = 0 . Let γ be an arbitrary fractional partition using some hypergraph C on [ n ] . Define the lo wer and upper gaps by Gap L ( f , C , γ ) = f ([ n ]) − X s ∈C γ ( s ) f ( s | s c ) and Gap U ( f , C , γ ) = X s ∈C γ ( s ) f ( s ) − f ([ n ]) . (12) Then Gap U ( f , C , γ ) w ( γ ) = Gap L ( f , ¯ C , ¯ γ ) w (¯ γ ) , (13) where w is the weight function and ¯ γ is the dual fractional partition defined in Section II. Pr oof: This follo ws easily from the definitions. Indeed, f ([ n ]) − X s c ∈ ¯ C ¯ γ ( s c ) f ( s c | s ) = f ([ n ]) − X s ∈C γ ( s ) w ( γ ) − 1 f ([ n ]) − f ( s ) = P s ∈C γ ( s ) f ( s ) w ( γ ) − 1 − w ( γ ) w ( γ ) − 1 − 1 f ([ n ]) = 1 w ( γ ) − 1 X s ∈C γ ( s ) f ( s ) − f ([ n ]) , and w (¯ γ ) = X s c ∈ ¯ C ¯ γ ( s c ) = P s ∈C γ ( s ) w ( γ ) − 1 = w ( γ ) w ( γ ) − 1 . Dividing the first expression by the second yields the result. Note that the upper bound for f ([ n ]) with respect to ( C , γ ) is equiv alent to the lower bound for f ([ n ]) with respect to the dual ( ¯ C , ¯ γ ) , implying that the collection of upper bounds for all hypergraphs and all fractional coverings is equiv alent to the collection of lower bounds for all hypergraphs and all fractional packings. Also, it is clear that under the assumptions of Corollary I, one can state a duality result e xtending Theorem IV by replacing γ by any fractional covering α , and ¯ γ by the dual fractional packing ¯ α . From Theorem IV , it is clear by symmetry that also Gap L ( f , C , γ ) w ( γ ) = Gap U ( f , ¯ C , ¯ γ ) w (¯ γ ) . (14) Howe ver , the identities (13) and (14) do not imply any relation between Gap U ( f , C , γ ) and Gap L ( f , C , γ ) . The gaps in the inequalities hav e especially nice structure when they are considered in the weak degree form, i.e., for the fractional partition using a r -re gular hypergraph C , all of whose coef ficients are 1 /r . The associated gaps are g L ( f , C ) = f ([ n ]) − 1 r X s ∈C f ( s | s c ) and g U ( f , C ) = 1 r X s ∈C f ( s ) − f ([ n ]) . (15) Corollary IV : [D UA L I T Y F O R R E G U L A R C O L L E C T I O N S ] Let f : 2 [ n ] → R be a submodular function with f ( φ ) = 0 . For a r -re gular collection C , g L ( f , ¯ C ) g U ( f , C ) = r |C | − r . Let us now specialize to the entropy set function e ( s ) – we use this to mean either H ( X s ) (if the random variables X i are discrete) or h ( X s ) (if the random variables X i are continuous). The special hypergraphs C k , k = 1 , 2 , . . . , n , consisting of all k -sets or sets of size k , are of particular interest, and a lot is already known about the gaps for these collections. For instance, Han’ s inequality [23] already implies Proposition I for these hypergraphs, and Corollary IV applied to these hypergraphs implies that g L ( e , C n − k ) g U ( e , C k ) = k n − k , recov ering an observation made by Fujishige [17]. Indeed, Theorem IV and Corollary IV generalize what [17] interpreted using the duality of polymatroids, since our assumptions are weaker and the assertions broader . Fujishige [17] considered these gaps important enough to merit a name: building on terminology of Han [23], he called the quantity g U ( e , C k ) a “total correlation”, and g L ( e , C k ) a “dual total correlation”. In two particular cases, the gaps hav e simple expressions as relativ e entropies (see Section IX for definitions). First, note that the lower gap in Han’ s inequality (3) is related to the dependence measure that generalizes the mutual information. ( n − 1) g L ( e , C n − 1 ) = g U ( e , C 1 ) = X i ∈ [ n ] e ( { i } ) − e ([ n ]) = D ( P X [ n ] k P X 1 × . . . × P X n ) . (16) It is tri vial to see that the gap is zero if and only if the random variables are independent. Second, the lower gap in Proposition I with respect to the singleton class C 1 is related to the upper gap in the prototypical form (3) of Han’ s inequality . g L ( e , C 1 ) = ( n − 1) g U ( e , C n − 1 ) = X i ∈ [ n ] D ( P X i | X [ n ] \ i k P X i | X 0 . As in the case of entropy , the bounds on the entropy powers associated with the hypergraphs C m and the degree cov ering satisfy a monotonicity property . Indeed, by Theorem 16.5.2 of [10], 1 n m X s ∈C n − m N c ( X s ) is a decreasing sequence in m . More interesting than entropy power inequalities for joint distributions, ho wev er , are entropy power inequalities for sums of independent random variables with densities. Introduced by Shannon [45] and Stam [48] in seminal contributions, they have prov ed to be extremely useful and surprisingly deep– with connections to functional analysis, central limit theorems, and to the determination of capacity and rate regions for problems in information theory . Recently the first author showed (building on work by Artstein, Ball, Barthe and Naor [2] and Madiman and Barron [33]) the follo wing generalized entropy power inequality . For independent real-v alued random variables X i with densities and finite variances, N X i ∈ [ n ] X i ≥ X s ∈C γ ( s ) N X i ∈ s X i , (18) for any fractional partition γ with respect to any hypergraph C on [ n ] . Inequality (18) shares an intriguing similarity of form to the inequalities of this paper , although it is much harder to prov e. The formal similarity between results for joint entropy and for entropy power of sums extends further . For instance, the fact that 1 n m X s ∈C n − m N X i ∈ s X i is an increasing sequence in m , can be thought of as a formal dual of Han’ s theorem. It is an open question whether upper bounds for entropy power of sums can be obtained that are analogous to the lo wer bound in Theorem I’. I X . A N I N E Q UA L I T Y F O R R E L ATI V E E N T R O P Y , A N D I N T E R P R E T A T I O N S Let A be either a countable set, or a Polish (i.e., complete separable metric) space equipped as usual with its Borel σ -algebra of measurable sets. Let P and Q be probability measures on the Polish product space A n . For any nonempty subset s of [ n ] , write P s for the mar ginal probability measure corresponding to the coordinates in s . Recall the definition of the relati ve entropy: D ( P s k Q s ) = E P log d P s d Q s ∈ [0 , ∞ ] when P s is absolutely continuous with respect to Q s , and D ( P s k Q s ) = + ∞ otherwise. One may also define the conditional relativ e entropy by D ( P s | t k Q s | t | P ) = E P t D ( P s | t k Q s | t ) , (19) where P s | t is understood to mean the conditional distribution (under P ) of the random variables corresponding to s giv en particular values of the random variables corresponding to t ; then E P t denotes the averaging using P t ov er the values that are conditioned on. W ith this definition, it is easy to verify the chain rule d ( s ∪ t ) = D ( P s | t k Q s | t | P ) + d ( t ) for disjoint s and t , so that following the terminology devel- oped in Section III, we have d ( s | t ) = D ( P s | t k Q s | t | P ) . W e hav e freely used (regular) conditional distributions in these definitions; the existence of these is justified by the fact that we are working with Polish spaces. Theorem V : Let Q be a product probability measure on A n , where A is a Polish space as abov e. Suppose P is a probability measure on A n such that the set function d : 2 [ n ] → [0 , ∞ ] giv en by d ( s ) = D ( P s k Q s ) does not take the v alue + ∞ for any s ⊂ [ n ] . Then d ( s ) is supermodular . Pr oof: For any nonempty s, t ⊂ [ n ] , we hav e d ( s ∪ t ) + d ( s ∩ t ) − d ( s ) − d ( t ) = [ d ( s ∪ t ) − d ( t )] − [ d ( s ) − d ( s ∩ t )] = d ( s ∪ t \ t t ) − d ( s \ s ∩ t s ∩ t ) . Since s ∪ t \ t = s \ s ∩ t , it would suffice to prove for disjoint sets s 0 and t that d ( s 0 | t ) ≥ d ( s 0 | t 0 ) (20) for any t 0 ⊂ t . Howe ver observe that, since Q is a product probability measure, d ( s 0 | t ) = E P t D ( P s 0 | t k Q s 0 ) = E P t 0 E P t \ t 0 D ( P s 0 | t k Q s 0 ) SUBMITTED TO IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , 2007 12 and d ( s 0 | t 0 ) = E P t 0 D ( P s 0 | t 0 k Q s 0 ) = E P t 0 D ( E P t \ t 0 P s 0 | t k Q s 0 ) , so that (20) is an immediate consequence of the con vexity of relativ e entropy (see, e.g., [10]). Based on the supermodularity proved in Theorem V , The- orem I applied to − d ( s ) immediately implies the following corollary . Corollary VII: Under the assumptions of Theorem V , X s ∈C γ ( s ) D ( P s | s c \ >s k Q s | P ) ≥ D ( P [ n ] k Q [ n ] ) ≥ X s ∈C γ ( s ) D ( P s |
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment