Generalized Entropy Concentration for Counts

The phenomenon of entropy concentration provides strong support for the maximum entropy method, MaxEnt, for inferring a probability vector from information in the form of constraints. Here we extend this phenomenon, in a discrete setting, to non-nega…

Authors: Kostas N. Oikonomou

Generalized Entropy Concentration for Counts
Generalized En trop y Concen tration for Coun ts Kostas N. Oik onomou A T&T Labs Researc h Middleto wn, NJ 07748, U.S.A. Email: ko@research.att.com Ma y 2019 Abstract The phenomenon of entrop y concentration pro vides strong supp ort for the maxi- m um en tropy metho d, MaxEnt , for inferring a probability v ector from information in the form of constrain ts. Here we extend this phenomenon, in a discrete setting, to non-negativ e integral v ectors not necessarily summing to 1. W e sho w that linear con- strain ts that simply b ound the allow able sums suffice for concen tration to o ccur ev en in this setting. This requires a new, ‘generalized’ en tropy measure in which the sum of the v ector plays a role. W e measure the concen tration in terms of deviation from the maxim um generalized entrop y v alue, or in terms of the distance from the maxim um generalized entrop y vector. W e pro vide non-asymptotic b ounds on the concentration in terms of v arious parameters, including a tolerance on the constraints whic h ensures that they are alw ays satisfied by an integral vector. Generalized en tropy maximization is not only compatible with ordinary MaxEnt , but can also b e considered an exten- sion of it, as it allows us to address problems that cannot b e formulated as MaxEnt problems. Keyw ords: maxim um generalized entrop y , counts, concentration, linear constraints, inequalities, norms, tolerances Con ten ts 1 In tro duction 2 2 The generalized en tropy G 7 2.1 Basic prop erties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Monotonicit y and conca vity properties . . . . . . . . . . . . . . . . . . . . . 9 2.3 Lo wer bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 A connection with I-div ergence . . . . . . . . . . . . . . . . . . . . . . . . . 12 1 3 Constrain ts, scaling, sensitivity , and the optimal coun t vector 13 3.1 Constrain ts with tolerances . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Effect of tolerances on the optimality of x ∗ . . . . . . . . . . . . . . . . . . 15 3.3 Scaling of the data and b ounds on the allo wable sums . . . . . . . . . . . . 16 3.4 The optimal count v ector ν ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Concen tration with resp ect to entrop y difference 18 4.1 Realizations of the optimal coun t vector . . . . . . . . . . . . . . . . . . . . 19 4.2 Realizations of the sets with smaller entrop y . . . . . . . . . . . . . . . . . . 21 4.3 The scaling factor needed for concentration . . . . . . . . . . . . . . . . . . 23 4.3.1 Bounds on the concen tration threshold . . . . . . . . . . . . . . . . . 25 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 Concen tration with resp ect to distance from the MaxGEnt vector 27 5.1 Realizations of the sets far from the MaxGEnt v ector . . . . . . . . . . . . 29 5.2 Scaling and concentration around the MaxGEnt count v ector . . . . . . . 31 5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6 Conclusion 37 A Proofs 37 1 In tro duction The maxim um entrop y metho d or principle, originally prop osed by E.T. Jaynes in 1957, no w app ears in standard textb o oks on engineering probabilit y and information theory , [ PP02 ], [ CT06 ]. Commonly referred to as MaxEnt , the principle ess en tially states that if the only information a v ailable about a probabilit y v ector is in the form of linear constrain ts on its elements, then, among all others, the preferred probability vector is the one that maximizes the Shannon entrop y under these constraints. Besides the great wealth and div ersity of its applications, MaxEnt can b e justified on a v ariet y of theoretical grounds: axiomatic formulations ([ SJ80 ], [ Ski89 ], [ Csi91 ], [ Cat12 ]), the concentration phenomenon ([ Ja y83 ], [ Gr8 ], [ Cat12 ], [ OG16 ]), decision- and game-theoretic interpretations ([ Gr8 ] and references therein), and its unification with Bay esian inference ([ GC07 ], [ Cat12 ]). Among these justifications, in a discrete setting, the app eal of concentration lies in its conceptual simplicity . It is essentially a combinatorial argument, first presented by E.T. Ja ynes [ Jay83 ], who called it “concen tration of distributions at en tropy maxima”. The concen tration viewp oint was further dev elop ed in [ Gr8 ] and [ OG16 ], which presented gen- eralizations, impro ved results, eliminated the asymptotics, and studied additional asp ects. In this pap er we adopt a discrete, finite, non-probabilistic, combinatorial approach, and show that the concentration phenomenon arises in a new setting, that of non-negative 2 v ectors whic h are not necessarily densit y vectors 1.1 . Among other things, this requires in- tro ducing a new, ‘generalized’ en trop y measure. This new concen tration phenomenon lends supp ort to an extension of the MaxEnt metho d to what we call “maximum generalized en tropy”, or MaxGEnt . The basics of entrop y concentration are easiest to explain in terms of the abstract “balls and bins” paradigm ([ Ja y03 ]). There are m labelled, distinguishable bins, to whic h n indistinguishable balls are to b e allo cated one-by-one. The final conten t of the bins is describ ed by a count v ector ν = ( ν 1 , . . . , ν m ) which sums to n , and a corresp onding frequency vector f = ν /n , summing to 1. Supp ose that the frequency v ector m ust satisfy a set of linear equalities and inequalities, P i a ij f i = b j and P i a ij f i 6 b j , with a ij , b j ∈ R . The c onc entr ation phenomenon is that as n b ecomes large, the ov erwhelming ma jority of the allo cations whic h accord with the constrain ts ha ve frequency vectors that are close to the m -vector whic h maximizes the Shannon en tropy sub ject to the constraints. In our extension there is no longer a given num b er of balls. Therefore w e cannot define a unique frequeny vector, but must deal directly with count vectors ν whose sums are unkno wn (Example 1.1 b elow mak es this clear). The linear constraints are no w placed on the counts ν i , again with co efficien ts in R . Our only assumption ab out the constraints is that they limit the sums of the count v ectors to lie in a finite range [ s 1 , s 2 ]. With just this assumption, we sho w that as the coun ts are allo wed to b ecome larger and larger (by a pro cess of sc aling the problem, explained in § 3 ), the v ast ma jorit y of allo cations that satisfy the constrain ts in fact ha ve count vectors close to the non-negative m -vector x ∗ that maximizes the gener alize d entr opy G ( x ). A precise statement of this concentration phenomenon needs some additional preliminaries, and is giv en at the end of this section. Our main results are, in § 2 , a new generalized entrop y function G , defined on arbitrary non-negativ e vectors, which reduces to the Shannon en tropy H on vectors summing to 1; its prop erties are studied in § 2 and § 3 , where the scaling pro cess is also in tro duced. In § 4 w e demonstrate the new concentration pheonomenon with resp ect to deviations from the maxim um generalized entrop y v alue G ∗ . Theorem 4.1 gives a lo wer b ound on the ratio of the num b er of realizations of the MaxGEnt vector to that of the set of count v ectors ν whose generalized entropies G ( ν ) are far from the maxim um value G ∗ . Then Theorem 4.2 completes the picture by deriving how large the problem must b e for the ab ov e ratio to b e suitably large. In § 5 we establish concentration with resp ect to the ` 1 norm distanc e of the count v ectors from the MaxGEnt ve ctor ; we presen t Theorems 5.1 and 5.2 , which are analogous to those of § 4 , and also Theorem 5.3 , an optimized v ersion of Theorem 5.2 . In all the theorems, ‘far’, ‘large’, etc. are defined in terms of parameters, in tro duced in T able 1.2 b elo w. None of our results in volv e an y asymptotic considerations, and w e give a n umber of n umerical illustrations. The following example demonstrates the basic issues referred to abov e in a very simple 1.1 What w e call here ‘densit y’ or ‘frequency’ vectors w ould be called “discrete probabilit y distributions”, p ossibly ‘empirical’, if we w ere operating in a probabilistic setting. 3 setting, which highligh ts the differences with the usual frequency v ector case. After this, w e pro ceed to the precise statement of generalized entrop y concentration. Example 1.1 A num b er of indistinguishable balls 1.2 are to b e placed one-b y-one in three bins, red, green, and blue. The final conten t ( ν r , ν g , ν b ) of the bins must satisfy ν r + ν g = 4 and ν g + ν b 6 6. Thus the total num b er of balls that ma y b e put in the bins cannot b e to o small, e.g. 3, or to o large, e.g. 20. Each assignment of balls to the bins is describ ed b y a sequence made from the letters r , g , b , with a corresp onding c ount vector ν = ( ν r , ν g , ν b ); the sequence can b e of any length n consisten t with the constrain ts. T able 1.1 lists all the coun t vectors that satisfy the constraints, their sums n , and their num b er of r e alizations # ν , i.e. the num b er of sequences that result in these counts, given b y a multinomial co efficien t, e.g.  7 3 , 1 , 3  = 140. [In the terminology of the theory of types, # ν is the size of the type class T ( ν /n ).] What can b e said ab out the “most likely” final conten t of the bins? ν r ν g ν b n # ν 0 4 0 4 1 1 3 0 4 2 2 0 6 3 1 0 4 4 0 1 1 0 4 1 5 5 1 3 1 20 2 2 1 30 3 1 1 20 4 0 1 5 ν r ν g ν b n # ν 0 4 2 6 15 1 3 2 60 2 2 2 90 3 1 2 60 4 0 2 15 1 3 3 7 140 2 2 3 210 3 1 3 140 4 0 3 35 ν r ν g ν b n # ν 2 2 4 8 420 3 1 4 280 4 0 4 70 3 1 5 9 504 4 0 5 126 4 0 6 10 210 T able 1.1: The coun t v ectors ν = ( ν r , ν g , ν b ) satisfying ν r + ν g = 4 , ν g + ν b 6 6, their sum n , and their num b er of realizations # ν . If we had the additional constraint ν r + ν g + ν b = 7, only the n = 7 section of the table would apply , and we w ould reduce to a MaxEnt problem. This example makes tw o p oin ts. First, it do es not seem p ossible to find a single fr e- quency vector that can be naturally asso ciated with the problem; without that, one cannot think ab out maximizing the usual en tropy 1.3 . Second, one may think that starting with the lar gest p ossible num b er of balls, 10 in this case, would lead to the greatest n umber of realizations. But this is not so: the coun t v ector with the most realizations sums to 9, and ev en vectors summing to 8 ha ve more realizations than the one summing to 10. 1.2 The balls don’t have to be indistinguishable, w e just ignore distinguishing characteristics, if they hav e an y . How ev er, in mo delling some situations, such as in Example 3.1 , indistinguishabilit y is essen tial. 1.3 In the ordinary entrop y problem where we hav e a single n , the distinction b et ween c ount and fr e quency v ectors do esn’t really matter, there is a 1-1 correspondence; but this is not true here. 4 Next we giv e a precise statement, GC b elo w, of generalized entrop y concentration. T o do that w e need to (a) define the generalized entrop y and describe how to find the vector that maximizes it, (b) sp ecify how to derive the b ounds s 1 , s 2 from the constrain ts, (c) describ e ho w to ensure the existence of in tegral solutions (coun t v ectors) to the constrain ts, and (d) introduce parameters that define the concen tration. T o find the vector with the largest num b er of realizations in a problem lik e that of Example 1.1 , we first assume that the problem do es not admit arbitr arily lar ge solutions . This is made precise in ( 1.2 ) b elo w, but a necessary condition is that each element of ν app ears in some constraint 1.4 . Next we relax the integralit y requirement on the counts, and set up a con tinuous maximization problem max x ∈C G ( x ) , where G ( x ) = − P i x i ln x i + ( P i x i ) ln ( P i x i ) and C = { x ∈ R m | A E x = b E , A I x 6 b I , x > 0 } . , (1.1) Here G ( x ) is the gener alize d entr opy of the real v ector x > 0, and the constraints on x are expressed via the real matrices A E , A I and v ectors b E , b I . W e assume that the constraints (a) are satisfiable, and (b) they b ound the p ossible sums of the x ∈ C ; this is equiv alent to assuming that all x i are b ounded. Thus C is a non-empty p olytop e in R m and ( 1.1 ) is a conca ve maximization problem (see e.g. [ BV04 ]) with a solution x ∗ . W e will refer to ( 1.1 ) as the “ MaxGEnt problem” and to x ∗ as “the MaxGEnt v ector” or as “the optimal relaxed count vector”. Since the function G is concav e but not strictly concav e, see Fig. 2.1 in § 2 , it is not immediate that the solution x ∗ is unique; ho wev er, w e show that this is the case in § 2.4 . The b oundedness assumption is that P i x i lies b et ween (finite) num b ers s 1 and s 2 ; these are determined b y solving the linear programs s 1 , min x ∈C ( x 1 + · · · + x m ) , s 2 , max x ∈C ( x 1 + · · · + x m ) . (1.2) (A tec hnicality is that the constraints may force some elements of x ∗ to b e 0; for reasons explained in § 3 it is conv enien t to eliminate such elemen ts, so that in the end all elemen ts of x ∗ can b e assumed to b e p ositive reals.) Finally , from x ∗ w e derive an inte gr al v ector ν ∗ , to which we refer as the optimal, or MaxGEnt c ount vector, by a pro cedure explained in § 3 . Because in the end we are in terested only in integral/coun t vectors in the set C of ( 1.1 ), w e will introduce, as explained in § 3 , toler anc es on the satisfaction on the constraints, go verned by a parameter δ . This will turn C in to C ( δ ). T o describ e the concen tration w e need tw o more parameters, ε sp ecifying the str ength of the concen tration, and η or ϑ describing the size of the r e gion in which it o ccurs. The parameters are summarized in T able 1.2 . Lastly , when w e hav e ordinary entrop y and frequency v ectors, concentration o ccurs by increasing the num b er of balls n . With count vectors, this is replaced b y increasing b E , b I , 1.4 But this is not sufficient: consider, e.g. m = 2 and ν 1 , ν 2 > 0, ν 1 − ν 2 = 10. 5 δ : relative tolerance in satisfying the constrain ts ε : concen tration tolerance, on num b er of realizations η : relative tolerance in deviation from the maxim um generalized en tropy v alue G ∗ ϑ : absolute tolerance in deviation (distance) from the optimal relaxed count v ector x ∗ T able 1.2: Parameters for the concentration results. the v alues of the constraints. The increase w e consider here consists in multiplying these v ectors by a scalar c > 1, a pro cess which we call sc aling . This scaling results in larger and larger count v ectors b eing admissible and is describ ed in detail in § 3 . No w we can give the precise statement of the concen tration phenomenon for count v ectors: GC : Theorems 4.2 and 5.2 compute a num b er ˆ c ( δ, ε, η ) and ˆ c ( δ , ε, ϑ ), resp ec- tiv ely , called the “concen tration threshold”, suc h that if the problem data b E , b I is scaled b y any factor c > ˆ c , the n um b er of assignments/sequences that result in the optimal coun t v ector ν ∗ is at least 1 /ε times greater than the num b er of al l assignmen ts that result in coun t vectors with entrop y less than (1 − η ) G ∗ or farther than ϑ from x ∗ b y ` 1 norm. Significance In a problem where the only av ailable information is em b o died in the constraints and whic h otherwise admits a large n umber of probability vectors as solutions, the concentra- tion phenomenon pro vides a p ow erful argumen t for the MaxEnt metho d, whic h selects a particular solution, the one with maxim um entrop y , in preference to all others 1.5 . Likewise, the concentration results in this pap er supp ort the maximization of generalized entrop y for problems in volving general non-negative vectors. W e b elieve that MaxGEnt can b e considered to b e a compatible extension of MaxEnt . The c omp atibility is that an y Max- Ent problem ov er the reals with constraints A E x = b E , A I x 6 b I can b e formulated as a MaxGEnt problem of the form ( 1.1 ) with the same constrain ts, plus the constraint P i x i = 1; b oth problems will hav e the same solution x ∗ ∈ R m , and the maxim um en- trop y H ( x ∗ ) will equal the maxim um generalized entrop y G ( x ∗ ). Also, if the constraints of the MaxGEnt problem either explicitly or implicitly fix the v alue of P i x i , then the problem can b e reduced to a MaxEnt problem ov er the reals. The extension consists in the fact that MaxGEnt addresses problems in volving un-normalized vectors that cannot b e form ulated as MaxEnt problems, as we saw in Example 1.1 ; more examples of such problems are given in § 3 , § 4 , and § 5 . 1.5 MaxEnt solves the infer enc e problem, not the de cision problem. It do es not claim that the maximum en tropy ob ject is the one to use no matter what use one has in mind. 6 Related w ork Our term “generalized entrop y” for G is neither imaginative nor distinctive, and there are many other generalized entrop y measures. The most general of these are Csisz´ ar’s f -en tropies and f -divergences [ Csi96 ], and the related Φ-entropies of [ BLM13 ]. Any re- lationship of G to Φ-entropies remains to b e inv estigated. The function G , in the form of the log of a multinomial co efficien t with “v ariable n umerator”, app eared in [ OS06 ] and [ Oik12 ]. The problem of inferring a non-negativ e real vector from information in the form of linear equalities w as considered by Skilling [ Ski89 ], where suc h vectors were termed “posi- tiv e additive distributions”, and b y Csizs´ ar, [ Csi91 ], [ Csi96 ]. Both authors gav e axiomatic justifications, whic h do not inv olve probabilities, for minimizing the I-div ergence, a gener- alization of relative en tropy to un-normalized vectors. A further generalization is the α, β div ergences of [ CCA11 ]. W e discuss a connection b etw een I-divergence and our generalized en tropy in § 2.5 . With resp ect to concentration, recen t dev elopments for the discrete, normalized case w ere given in [ OG16 ]. The contin uous normalized case, for relative entrop y , is examined in [ Cat12 ] from the viewp oint of information geometry . Countable spaces are also treated in [ Gr8 ]. But these references do not provide explicit b ounds such as the ones here and in [ OG16 ]. T o our kno wledge, concentration for non-densit y v ectors has not been studied b efore. The structure and some of the presentation of this pap er are similar to [ OG16 ] b ecause of the similar sub ject matter, en tropy concen tration from a com binatorial viewp oin t. Man y of the results here that app ear similar to those of section I I I of [ OG16 ] are generalizations of those results, insofar as G is a generalization of H . How ever the main theorems here do not actually subsume corresp onding theorems in [ OG16 ], b ecause in b oth cases the theorems include optimizations specific to count or frequency v ectors, respectively . 2 The generalized en tropy G In this section we introduce the generalized entrop y function G , and study its properties, relationships with other functions, and its maximization under linear constraints. Giv en a real vector x > 0, its gener alize d entr opy is G ( x ) , H ( x ) +  X i x i  ln  X i x i  =  X i x i  H ( χ ) , x > 0 . (2.1) Here H ( x ) is the form − P i x i ln x i extended to v ectors in R m + that are not necessarily densit y vectors, and χ is the density , or normalized, or probability , vector corresp onding to x . ( 2.1 ) giv es tw o wa ys to lo ok at G ( x ): it is the (extended) entrop y of x plus the sum of x times its log, or the sum of x times the ordinary en tropy of the normalized x . If x is already normalized G ( x ) coincides with H ( x ). Fig. 2.1 is a plot of G ( x ) for m = 2. 7 x 1 1 2 3 4 x 2 1 2 3 4 G ( x 1 , x 2 ) 2 3 4 5 x 1 1 2 3 4 x 2 1 2 3 4 G ( x 1 , x 2 ) 2 3 4 5 Figure 2.1: G ( x 1 , x 2 ). Note that G ( x, x ) = (2 ln 2) x , which destroys the strict conca vity of G . 2.1 Basic prop erties W e list some imp ortan t prop erties of the function G : P1 G ( x 1 , . . . , x m ) is the log of the m ultinomial co efficient  x 1 + ··· + x m x 1 ,...,x m  to “second Stirling order”: b y using the first t w o terms of ln x ! = x ln x − x + 1 2 ln x + ln √ 2 π + ϑ 12 x , ϑ ∈ (0 , 1) , w e find that ln  x 1 + · · · + x m x 1 , . . . , x m  ≈ G ( x 1 , . . . , x m ) . This in terpretation was given in [ Oik12 ], where it w as used to derive “most likely” matrices, i.e. those with the largest num b er of realizations, from incomplete infor- mation. P2 G is related to the ordinary en trop y (of densit y vectors) and the extended en tropy (of arbitrary non-negative v ectors) H in the t wo w ays specified in ( 2.1 ). P3 Unlik e the entrop y of normalized vectors which is b ounded b y ln m , the generalized en tropy G ( x ) increases without b ound as the elements of x b ecome larger: for any x, y , if y > x then G ( y ) > G ( x ). This is sho wn in Prop osition 2.3 . One consequence is that if x, y are close in norm, i.e. k x − y k 6 ζ , | G ( x ) − G ( y ) | cannot b e b ounded b y an expression inv olving only m and ζ . P4 G ( x ) is p ositive, unless x has just one non-0 element, in whic h case G ( x ) = 0. This follo ws from the second form in ( 2.1 ). P5 Giv en any p.d. p = ( p 1 , . . . , p m ) and an y n -sequence σ with count vector ν , the probabilit y of σ under p can b e written as Pr p ( σ ) = e − ( G ( ν )+ nD ( f k p )) 8 where D ( · | · ) is the divergence, or relative en tropy , b et w een t wo probabilit y v ectors and f = ν /n is the frequency vector corresp onding to ν . By substituting G ( ν ) = nH ( f ) we obtain the well-kno wn expression for the same probability in terms of the ordinary entrop y of a frequency vector. P6 Lik e the ordinary or the extended H , G ( x 1 , . . . , x m ) is c onc ave o ver the domain x 1 > 0 , . . . , x m > 0, but unlik e H , it is not strictly conca ve. See Prop osition 2.2 in § 2.2 . P7 The maximum of G ( x 1 , . . . , x m ) sub ject just to the constraint P i x i = s is s ln m . When s = 1, x is a densit y vector and this reduces to the maximum of H . P8 What is the relationship b etw een maximizing G and maximizing the extende d H ? Consider maximizing the first form in ( 2.1 ), sub ject to A E x = b E , A I x 6 b I , by imp osing the additional constraint P i x i = s and treating s as a parameter taking v alues in [ s 1 , s 2 ]. F or a given s , there will b e a unique maximum since H is strictly conca ve 2.1 . F urther, some s = s ∗ will achiev e max s max x ( s ln s + H ( x )) sub ject to A E x = b E , A I x 6 b I , P i x i = s ; this maxim um v alue will equal G ( x ∗ ). Using the second form in ( 2.1 ) we see that there is a similar relationship b et ween maximizing G ( x ) and maximizing the function sH ( x/s ). P9 G has a sc aling (or homo geneity ) prop ert y , which H do es not: for any c > 0 and any x ∈ R m + , G ( cx ) = cG ( x ). This is most easily seen from the second form in ( 2.1 ). P10 G has a further imp ortan t scaling property: if x ∗ maximizes G ( x ) u nder Ax 6 b , then for any c > 0, cx ∗ maximizes G ( x ) under Ax 6 cb . W e show this in § 3.3 , Proposition 3.2 . 2.2 Monotonicity and concavit y prop erties As we noted in prop ert y 2.1 in § 2.1 , G is an increasing function in the sense that Prop osition 2.1 F or any x, y , if y > x then G ( y ) > G ( x ) , and if the ine quality is strict in some plac es, then G ( y ) > G ( x ) . W e will use this prop ert y in § 2.3 . Now w e turn to conca vity . The extended ordinary entrop y H is strictly concav e, and in addition, strongly concav e for any mo dulus γ 6 1 /a when defined ov er [0 , a ] m . The generalized entrop y G is also conca ve, but neither strictly concav e, nor strongly conca ve for any mo dulus. How ever − G is sublinear, whereas − H is not. These properties are collected in the follo wing proposition: Prop osition 2.2 1. The function G ( x 1 , .., x m ) is c onc ave over R m + . 2.1 One ma y also maximize H without the constrain t P x i = s , but what would the result mean? 9 2. G is not strictly c onc ave over R m + . 3. G is not strongly c onc ave over R m + for any mo dulus γ > 0 . 4. If the definition of G is extende d over al l of R m by setting G ( x ) = −∞ if any x i is < 0 , then for al l α , β > 0 and for al l x, y ∈ R m , G ( αx + β y ) > αG ( x ) + β G ( y ) . The last prop ert y is stronger than (implies) concavit y since α, β are not required to sum to 1. The absence of strict conca vity means that more care is needed with maximization, w e address this in § 2.4 . 2.3 Low er b ounds Giv en a p oin t x , if some other p oin t y is close to it in the distance/norm sense, how muc h smaller than G ( x ) can G ( y ) b e? W e will need the answ er in § 4 . Prop osition 2.1 implies that if we hav e a h yp ercub e centered at x , say k x − y k ∞ 6 ζ , then G ( · ) attains its maximum at the “upp er righ t-hand” corner of the hypercub e and its minim um at the “low er left-hand” corner. Sp ecifically , for any ζ > 0, let ζ denote the m -vector ( ζ , . . . , ζ ), and let x > ζ . Then it can be seen from Prop osition 2.1 that for any y > 0 k x − y k ∞ 6 ζ ⇒ G ( x − ζ ) 6 G ( y ) 6 G ( x + ζ ) . (2.2) Using this observ ation we can sho w that Lemma 2.1 Given ζ > 0 and x, y ∈ R m + , if x > ζ and k y − x k ∞ 6 ζ , then G ( y ) > G ( x ) −  X i ln k x k 1 x i  ζ − 1 2  X i 1 x i − ζ − m k x k 1 /m − ζ  ζ 2 . The c o efficient of ζ 2 is p ositive unless al l x i ar e e qual, in which c ase it b e c omes 0. The low er b ound ab ov e do es not dep end on y , only on x and ζ . The restriction x > ζ applies to the ‘reference’ p oint x , not to the ‘v ariable’ y ; see also Remark 4.2 . Lastly , since k x − y k 1 6 ζ ⇒ k x − y k ∞ 6 ζ , the lemma holds also when the ` ∞ norm is replaced by the ` 1 norm. W e will use Lemma 2.1 in § 4.1 to b ound ho w far from the maximum G ( x ∗ ) the v alue G ( x ) can b e if x is close to x ∗ . W e also comment there, Remark 4.3 , on how the ab ov e b ound compares to b ounds obtainable from the relationship b etw een (ordinary) entrop y difference and ` 1 norm. 10 2.4 Maximization Let C (0) denote the subset of R m defined b y the constrain ts in ( 1.1 ) 2.2 . Here we p oint out that despite the fact that G is not a strictly concav e function (recall Prop osition 2.2 , part 2), the p oin t x ∗ solving ( 1.1 ) is the unique optimal solution of our maximization problem, and o ccupies a sp ecial location in C (0): Prop osition 2.3 1. The p oint x ∗ is the unique optimal solution of pr oblem ( 1.1 ). 2. The set C (0) do es not c ontain any x s.t. x > x ∗ with at le ast one strict ine quality. Figure 2.2 illustrates the first statement of the prop osition. x 2 x 1 0 Figure 2.2: A 2-dimensional p olytope C (0). By Prop osition 2.3 , x ∗ can lie only on the hea vy black line. Finally w e lo ok at the form of the solution x ∗ in terms of Lagrange multipliers. The Lagrangean for problem ( 1.1 ) is L ( x, λ E , λ I ) = G ( x ) − λ E · ( A E x − b E ) − λ I · ( A I x − b I ) , (2.3) where λ E , λ I are the v ectors of the Lagrange m ultipliers corresp onding to the equality and inequalit y constraints. The solution x ∗ will satisfy some of the inequalit y constraints with equalit y (and these are called binding or active at x ∗ ), and some with strict inequality . It is known that multipliers λ I j corresp onding to inequalities non-binding at x ∗ will b e 0, while the rest of them will b e > 0 (see, e.g., [ HUL96 ], Ch. VI I, § 2.4). Thus, denoting the sub-vector of λ I corresp onding to binding inequalities b y λ BI and the corresp onding sub-matrix of A I b y A BI , it follows from ( 2.3 ) that x ∗ can b e written as x ∗ j = ( x ∗ 1 + · · · + x ∗ m ) e − ( λ E · A E .j + λ BI · A BI .j ) . (2.4) This expression determines the elements of the densit y vector χ ∗ = x ∗ / P i x ∗ i in terms of the multipliers, but it do es not determine the v ector x ∗ itself. 2.2 The reason for the “0” will be seen in § 3.1 , where we discuss to lerances on constraints. 11 Remark 2.1 It is clear that the form ( 2.4 ) cannot express an y elements of x ∗ that are 0, if the multipliers λ are to b e finite. T o av oid introducing sp ecial cases in the sequel to handle the zeros, w e will assume as a con venience that an y elements of the solution to problem ( 1.1 ) that are forced to b e exactly 0 b y the constraints are eliminated from consideration either b efore or after the solution is found. W e hav e already alluded to this after ( 1.2 ). Th us, whenev er we sp eak of x ∗ in what follows we will assume that all of its elemen ts are p ositiv e. See Example 5.3 in § 5 . A more detailed discussion of the issue of 0s is in [ OG16 ], § I I.A. Example 2.1 Returning to Example 1.1 , it is p ossible to maximize G analytically under the given constraints. Introducing real v ariables x 1 , x 2 , x 3 corresp onding to ν r , ν g , ν b and letting the constraints b e x 1 + x 2 = a and x 2 + x 3 6 b , the solution turns out to b e x ∗ 1 = s ∗ − b, x ∗ 2 = a + b − s ∗ , x ∗ 3 = s ∗ − a, s ∗ = a + b + √ a 2 + b 2 2 . F urther, the b ounds s 1 , s 2 of ( 1.2 ) on the p ossible sums are s 1 = a and s 2 = a + b . W e see that the MaxGEnt solution to the problem is never trivial, in the sense that for all a, b , we hav e s 1 < s ∗ < s 2 ; when a = b we ha ve s ∗ = 1 2  1 + √ 13 5  s 2 ≈ 0 . 861 s 2 . With a = 4 , b = 6 we find s ∗ = 8 . 61 and x ∗ = (2 . 61 , 1 . 39 , 4 . 61); compare with T able 1.1 . 2.5 A connection with I-div ergence F or densit y vectors, the relationship b etw een ordinary en trop y H ( x ) and div ergence D ( x k y ) is well known: with uniform y , D ( x k y ) reduces to H ( x ) to within a constant, and its minimization is equiv alent to the maximization of H ( x ). Here we lo ok at whether G ( x ) has any analogous prop erties. First, if in D ( x k y ) w e take y to ha ve all of its elements equal to P i x i , we obtain − G ( x ). How ever, this is merely a formal relationship 2.3 . F or example, minimizing D ( x k y ) with resp ect to x when y = ( P i x i , . . . , P i x i ) cannot be giv en the same interpretation as minimizing D ( x k y ) with respect to x given a fixe d ‘prior’ y . So even if x, y summed to 1, neither the axiomatic nor the concentration justifications for cross-entrop y minimization w ould apply . Second, the concen tration properties w e establish in § 4 and § 5 supp ort the maximization of G ( x ) as a method of inference of non-negativ e v ectors from limited information. Another metho d for doing this, suggested in [ Ski89 ], [ Csi96 ], is based on minimizing the I-diver genc e (information divergence) betw een non-negative v ectors D ( u k v ) , X i u i ln u i v i − X i u i + X i v i , u, v ∈ R m + . (2.5) 2.3 This is pointed out in [ BV04 ], Ch. 3, Example 3.19. 12 This reduces to D ( u k v ) when u, v sum to 1. The inference problem is “problem (iii)” in [ Csi96 ]: infer a non-negative function p ( z ), not necessarily summing or integrating to 1, giv en that (a) it belongs to a certain feasible set F of functions defined b y linear e quality constrain ts, and (b) a default mo del q ( z ) 2.4 . It is shown that the solution of this problem is the p ∗ ∈ F that minimizes the I-div ergence D ( p k q ). (Recen tly , minimization of I-div ergence and generalizations to “ α, β divergences” has found many applications in the area known as “non-negative matrix factorization”, see [ CCA11 ].) There is a relationship b et ween minimizing I-divergence and maximizing generalized en tropy: Prop osition 2.4 L et ( A E , b E ) , ( A I , b I ) b e line ar e quality and ine quality c onstr aints on a ve ctor in R m + , and let x ∗ b e the solution of the MaxGEnt pr oblem with these c onstr aints on x . Given a prior v ∈ R m + , let u ∗ ( v ) b e the solution to the minimum I-diver genc e pr oblem with the same c onstr aints on u . Then ther e is a prior ˜ v which makes the two solutions c oincide, i.e. u ∗ ( ˜ v ) = x ∗ . That prior is ˜ v = ( s ∗ , . . . , s ∗ ) . This follows from the fact that the minimum I-div ergence solution to a problem with prior v and constrain ts A E u = b E and A I u 6 b I on u is u ∗ j = v j e − ( λ E · A E .j + λ BI · A BI .j ) . (2.6) If we set v j = s ∗ , it can be seen from expression ( 2.4 ) that u ∗ j = x ∗ j satisfies ( 2.6 ). Inference b y minimizing I-div ergence under equalit y constraints has an axiomatic basis, but as p oin ted out in § 3 and § 7 of [ Csi96 ], the com binatorial, concen tration rationale that we are adv o cating here do es not se em to apply to it. Prop osition 2.4 shows that the adoption of a p articular prior furnishes this rationale, except that this prior cannot b e prop erly viewed as indep endent of the solution (p osterior) u ∗ . This dep endence may shed some light on the difficult y of finding the concentration rationale in general. [As an illustration, Example 2.1 can b e solved by I-divergence minimization assuming a constant prior v = ( α, α, α ). An analytical solution u ∗ is p ossible, and it has the same form as the MaxGEnt solution, but it is a function of α ∈ (0 , ∞ ); the question then b ecomes what v alue to adopt for α .] 3 Constrain ts, scaling, sensitivit y , and the optimal coun t v ector In § 3.1 w e discuss the necessity of introducing toler anc es into the constraints defining the MaxGEnt problem, and in § 3.2 the effect of these tolerances on the maximization of G . In § 3.3 we turn to the sc aling of the problem, i.e. m ultiplying the data vector b b y some c > 0, and the important properties of this scaling. Lastly , in § 3.4 we discuss the optimal , or MaxGEnt count v ector ν ∗ , constructed from the real v ector x ∗ solving problem ( 1.1 ). 2.4 The sense of ‘default’ is that if q is in F , then, in the absence of any constraints, the method should infer p ∗ = q . 13 3.1 Constraints with tolerances W e p oin ted out the necessit y of introducing toler anc es in to linear constraints when es- tablishing concentration of ordinary en tropy in [ OG16 ]. The constrain ts in volv ed real co efficien ts, and the solutions had to b e rational (frequency) vectors with a particular denominator. Here the solutions need to b e in tegral (count) vectors, but the equality con- strain ts ma y not hav e an y in tegral solution; e.g. x 1 − x 2 = 1 , x 1 + x 2 = 4 are satisfied only for ( x 1 , x 2 ) = (2 . 5 , 1 . 5), and likewise with inequalities, e.g. 1 . 3 6 x 1 6 1 . 99. W e there- fore define the set of real m -v ectors x that satisfy the constraints in ( 1.1 ) with a relativ e accuracy or tolerance δ > 0: C ( δ ) , { x ∈ R m : b E − δ | β E | 6 A E x 6 b E + δ | β E | , A I x 6 b I + δ | β I |} , (3.1) where β E , β I are iden tical to b E , b I , except that any elements that are 0 are replaced b y appropriate small p ositive constants. The tolerances are only on the values b of the constrain ts, not on their structure A . Recall that the generalized entrop y is maximized o ver C (0), whic h we ha ve assumed to b e non-empty , problem ( 1.1 ). There are three main p oints concerning the introduction of δ . First, the existence of in tegral solutions, whic h is elab orated in Prop osition 3.1 b elow. Second, and related to the first, δ ensures that the concen tration statemen t GC in § 1 holds for al l scalings of the problem larger than a threshold ˆ c . This is analogous to having concen tration for frequency (rational) vectors hold for all denominators n larger than some N , as in [ OG16 ]. Third, δ has an effect on the maximization of G ; this the sub ject of § 3.2 . Prop osition 3.1 b elow gives the fundamen tal facts about the existence of count vectors in C ( δ ). Giv en an x in C (0), any other vector y close enough to it is in C ( δ ), and, if δ is not to o small, the c ount v ector obtained by rounding x element-wise is in C ( δ ); in other w ords, for ev ery real v ector in C (0) there is an integral vector in C ( δ ). The “close enough” and the “not too small” dep end on a n um b er ϑ ∞ : Prop osition 3.1 With β E , β I as in ( 3.1 ), define ϑ ∞ , min( | β E | min / 9 A E 9 ∞ , | β I | min / 9 A I 9 ∞ ) , or ∞ if ther e ar e no c onstr aints 3.1 . Then if x is any p oint in C (0) , 1. Given any δ > 0 , any y ∈ R m + such that k y − x k ∞ 6 δ ϑ ∞ is in C ( δ ) . 2. In p articular, if δ > 1 / (2 ϑ ∞ ) , the inte gr al/c ount ve ctor [ x ] is in C ( δ ) . As we add constraints to a problem, ϑ ∞ can only decrease, or at b est stay the same. This prop osition is used in § 4.3 , eq. ( 4.23 ), and in § 5.2 , after ( 5.9 ). 3.1 Recall that the infinity norm 9 · 9 ∞ of a matrix is the maximum of the ` 1 norms of the rows. 14 Example 3.1 Fig. 3.1 shows a net work consisting of 6 no des and 6 links. The links are sub ject to a certain impairmen t x and x i is the quantit y asso ciated with link i . The impairmen t is additive , e.g. its v alue ov er the path AB consisting of links 4 , 1 , 6 is x 4 + x 1 + x 6 . 6 3 2 1 5 4 x 4 + x 1 + x 6 = b 1 x 6 + x 3 + x 5 = b 2 x 4 + x 2 + x 5 = b 3 x 4 , x 5 , x 6 6 b 4 C A B Figure 3.1: Data b on the impairment x in a 6-no de, 6-link net work. Supp ose that x is measured ov er the 3 paths AB , B C, C A , and it is also known that the access links 4, 5, 6 contribute no more than a certain amount, as sho wn in Fig. 3.1 . The structure matrices A E , A I and data vectors b E , b I then are A E = " 1 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 # , b E = " b 1 b 2 b 3 # , A I = " 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 # , b I = " b 4 b 4 b 4 # . The problem is to infer the impairmen t v ector x from the measurement v ector b . Clearly , the v alues of the b i dep end on the chosen units and can c hange under v arious conditions, whereas the elements of A E , A I are constants defining the structure of the netw ork, and indep enden t of an y units. Supp ose we take ( b 1 , . . . , b 4 ) = (10 . 5 , 18 . 3 , 8 . 7 , 4). Then with β E , β I = b E , b I w e hav e in Prop osition 3.1 | b E | min / 9 A E 9 ∞ = 8 . 7 / 3 and | b I | min / 9 A I 9 ∞ = 4 / 1, so ϑ ∞ = 2 . 9. The v ector x = (6 . 591 , 5 . 326 , 13 . 26 , 1 . 120 , 2 . 253 , 2 . 789) satisfies the constraints exactly . The rounded vector [ x ] = (7 , 5 , 13 , 1 , 2 , 3) is in the set C ( δ ) defined b y ( 3.1 ) for an y δ > 0 . 172. 3.2 Effect of tolerances on the optimalit y of x ∗ With the constraints x 1 − x 2 = 1 , x 1 + x 2 = 4 , and x 1 , x 2 > 0, C (0) is a 0-dimensional p olytope in R 2 , the p oin t (2 . 5 , 1 . 5). Ho wev er, introducing the tolerance δ = 0 . 05 turns the equalities into inequalities and C (0 . 05) = { 0 . 95 6 x 1 − x 2 6 1 . 05 , 3 . 8 6 x 1 + x 2 6 4 . 2 } b ecomes 2-dimensional. Apart from the change in dimension, C (0 . 05) also contains the p oin t (2 . 55 , 1 . 55) at whic h G assumes a v alue gr e ater than G ∗ = G (2 . 5 , 1 . 5), its maxim um o ver C (0). This must b e tak en into accoun t, since concentration refers to the v ectors in C ( δ ), not those in C (0). The following lemma shows that the amount b y which the v alue of G can exceed G ∗ due to the widening of the domain C (0) to C ( δ ) is b ounded by a line ar function of δ ; it generalizes Prop. I I.2 of [ OG16 ] for the ordinary entrop y H : 15 Lemma 3.1 L et ( λ E , λ BI ) b e the ve ctor of L agr ange multipliers in ( 2.4 ) c orr esp onding to the solution G ∗ = G ∗ (0) , x ∗ = x ∗ (0) of the maximization pr oblem ( 2.1 ). Define Λ ∗ , | λ E | · | β E | + λ BI · | β BI | , Λ ∗ > G ∗ . Then with δ > 0 , for any ν ∈ C ( δ ) G ( ν ) 6 G ∗ + Λ ∗ δ − nD ( f k χ ∗ ) , wher e n = P i ν i , f is the fr e quency ve ctor c orr esp onding to ν , and χ ∗ is the density ve ctor c orr esp onding to x ∗ . The upp er b ound on G ( ν ) is at least (1 + δ ) G ∗ − nD ( f k χ ∗ ). When δ = 0 the lemma sa ys simply that G ∗ is the maxim um of G ov er C (0). The D ( ·k· ) term is p ositive, and equals 0 iff ν = α x ∗ for some α > 0 3.2 . Leavin g aside that this is possible only for sp ecial x ∗ and α , Lemma 3.1 says that if the resulting ν is in C ( δ ), then G ( ν ) = αG ∗ 6 G ∗ + Λ ∗ δ , i.e. the allo wable α is limited by δ . Also, if we ha ve ev en one equality constrain t, δ limits the size of the allo wable α even further. 3.3 Scaling of the data and b ounds on the allow able sums W e establish a fundamental prop erty , 2.1 in § 2.1 , of maximizing the generalized entrop y G : if the problem data b is scaled b y the factor c > 0, all asp ects of the solution scale by the same factor. Prop osition 3.2 Supp ose that the r elaxe d c ount ve ctor x ∗ maximizes G ( x ) under the line ar c onstr aints A E x = b E , A I x 6 b I , which also imply that P i x i is b etwe en the b ounds s 1 , s 2 . L et c > 0 b e any c onstant. Then the ve ctor cx ∗ maximizes G ( x ) under the sc ale d c onstr aints A E x = cb E , A I x 6 cb I , the maximum value of G is cG ( x ∗ ) , and the new b ounds on P i x i ar e cs 1 , cs 2 . Ho w do s 1 and s 2 , defined in ( 1.2 ), dep end on the structure matrices A E , A I and the data b E , b I ? In general, the problem of b ounding s 1 or s 2 do esn’t hav e a simple answer: b y scaling the v ariables, any linear program whose ob jectiv e function is a p ositive linear com bination of the v ariables can be conv erted to one where the ob jectiv e function is simply the sum of the v ariables. But in some sp ecial cases w e can derive simple b ounds on s 1 and s 2 : Prop osition 3.3 Bounds on the sums s 1 and s 2 . 1. If ther e ar e some e quality c onstr aints, then s 1 > k b E k 1 / 9 ( A E ) T 9 ∞ . (This b ound c an only incr e ase if ther e ar e also ine qualities.) 3.2 The only w a y the densit y vectors can b e equal is if the un-normalized v ectors are proportional. 16 2. Supp ose al l of A E , A I , b E , b I ar e > 0 , and e ach x i o c curs in at le ast one c onstr aint. Then s 2 6 P i b E i /α E i + P i b I i /α I i , wher e α E i , α I i , is the smal lest non-zer o element of r ow i of A E , r esp e ctively A I , if that element is < 1 , and 1 otherwise. Recall from § 1 that “each x i o ccurs in at least one constraint” is a necessary condition for the problem to b e b ounded. The prop osition applies to Example 3.1 : we find that s 1 > ( b 1 + b 2 + b 3 ) / 2 and s 2 6 b 1 + b 2 + b 3 + b 4 . 3.4 The optimal count vector ν ∗ Giv en the relaxed optimal count vector x ∗ , we construct from it a c ount vector ν ∗ whic h is a reasonable approximation to the integral vector that solv es problem ( 1.1 ), in the sense that (a) its sum is close to that of x ∗ , and (b) its distance from x ∗ is small in ` 1 norm. These prop erties will b e needed in § 4 and § 5 . W e will require ν ∗ to sum to n ∗ , where n ∗ , d s ∗ e , s ∗ , X i x ∗ i . (3.2) F or an y x > 0, let [ x ] b e the vector obtained b y rounding eac h of the elements of x up or do wn to the nearest in teger. ν ∗ is obtained from x ∗ b y a pro cess of rounding and adjusting: Definition 3.1 ([ OG16 ], Defn. I I I.1) Given x ∗ , form the density ve ctor χ ∗ = x ∗ /s ∗ and set ˜ ν = [ n ∗ χ ∗ ] . Construct ν ∗ by adjusting ˜ ν as fol lows. L et d = P i ˜ ν i − n ∗ ∈ Z . If d = 0 , set ν ∗ = ˜ ν . Otherwise, if d < 0 , add 1 to | d | elements of ˜ ν that wer e r ounde d down, and if d > 0 , subtr act 1 fr om | d | elements that wer e r ounde d up. The r esulting ve ctor is ν ∗ . W e will refer to ν ∗ as “the optimal coun t v ector” or “the MaxGEnt coun t vector” (ev en though it is not unique). It sums to n ∗ , and do es not differ to o muc h from x ∗ in norm: Prop osition 3.4 The optimal c ount ve ctor ν ∗ of Definition 3.1 is such that X 1 6 i 6 m ν ∗ i = n ∗ , k ν ∗ − x ∗ k 1 6 3 m 4 + 1 , k ν ∗ − x ∗ k ∞ 6 1 , k f ∗ − χ ∗ k 1 6 3 m 4 n ∗ . There are other approximations to the integral solution of problem ( 1.1 ); for example, simply [ x ∗ ] achiev es smaller norms than ν ∗ : k [ x ∗ ] − x ∗ k 1 6 m/ 2, k [ x ∗ ] − x ∗ k ∞ 6 1 / 2. ([ x ∗ ] is the p oint of N m that minimizes the Euclidean distance k ν − x ∗ k 2 from x ∗ .) But [ x ∗ ] do es not hav e the required sum n ∗ . Another, more sophisticated definition for ν ∗ , w ould use the solution of the integer linear program min ν ∈ N m P m i =1 | ν i − x ∗ i | sub ject to P m i =1 ν i = n ∗ . [This is a linear program b ecause min z P i | z i − c i | is equiv alen t to min a,z P i a i sub jec t to a i ≥ z i − c i , a i ≥ − ( z i − c i ).] A ν ∗ b etter than that of Definition 3.1 would impro ve the b ound in ( 4.9 ) b elow 3.3 . 3.3 No integral vector can achiev e ` 1 norm smaller than k x ∗ − [ x ∗ ] k 1 ; this solution to the linear program ignores the constrain t, and minimizes each term of the ob jective function individually . 17 4 Concen tration with resp ect to en trop y difference It is not clear that concen tration should o ccur at all in a situation lik e the one of Example 1.1 . The fact that G has a global maximum G ∗ o ver C (0) is not enough. In this section we demonstrate that concen tration around G ∗ do es indeed o ccur, in the sense of the statemen t GC of § 1 , p ertaining to entropies η -far from G ∗ . This is done in tw o stages, by Theorem 4.1 in § 4.2 and Theorem 4.2 in § 4.3 . Consider the count v ectors that sum to n and satisfy the constraints. W e divide them in to t w o sets, A n , B n , according to the deviation of their gener alize d entr opy from G ∗ : given δ, η > 0, A n ( δ, η ) , { ν ∈ N n ∩ C ( δ ) , G ( ν ) > (1 − η ) G ∗ } , B n ( δ, η ) , { ν ∈ N n ∩ C ( δ ) , G ( ν ) < (1 − η ) G ∗ } . (4.1) Irresp ectiv e of the v alues of δ and η , A n ( δ, η ) ] B n ( δ, η ) = N n ∩ C ( δ ). No w w e discuss the p ossible range of n . W e hav e assumed that the problem constraints A E x = b E , A I x 6 b I imply that s 1 6 P i x i 6 s 2 , where the b ounds s 1 , s 2 on the sum of x are found b y solving the linear programs ( 1.2 ). So an y in tegral v ector that satisfies the constraints exactly , i.e. is in C (0), m ust hav e a sum n b et ween n 1 = d s 1 e and n 2 = b s 2 c . W e will use a slight mo dification of this definition n 1 , d s 1 e , n 2 , d s 2 e . (4.2) With n ∗ defined b y ( 3.2 ), w e hav e n 1 6 n ∗ 6 n 2 . W e may assume without loss of generalit y that n 1 6 n 2 + 1; otherwise all coun t v ectors sum to a kno wn n , and w e reduce to the case of frequency vectors whic h was studied in [ OG16 ]. Remark 4.1 There is a certain degree of arbitrariness (or flexibility) in the definitions of n 1 , n 2 . Setting n 1 = d s 1 e , n 2 = b s 2 c sa ys that the allo wable sums are those of coun t v ectors which b elong to C (0); it do es not say that the only allow able ve ctors are those in C (0). Now it could b e argued that after introducing the tolerance δ , the n umbers n 1 , n 2 should b e allow ed to b ecome functions of δ . How ever, this would introduce significant extra complexit y . Our definition mak es concessions to simplicity by restricting somewhat the allow able sums, and by sligh tly adjusting the v alue of n 2 to handle the ‘b oundary’ case b s 2 c < s ∗ 6 s 2 more easily . Ha ving defined the range of allo w able sums n as n 1 6 n 6 n 2 , we will use the (disjoin t) unions of the sets ( 4.1 ) ov er n ∈ { n 1 , . . . , n 2 } A n 1 : n 2 ( δ, η ) , { ν | P i ν i = n, n 1 6 n 6 n 2 , ν ∈ C ( δ ) , G ( ν ) > (1 − η ) G ∗ } , B n 1 : n 2 ( δ, η ) , { ν | P i ν i = n, n 1 6 n 6 n 2 , ν ∈ C ( δ ) , G ( ν ) < (1 − η ) G ∗ } . (4.3) Irresp ectiv e of δ and η w e ha ve A n 1 : n 2 ( δ, η ) ] B n 1 : n 2 ( δ, η ) = N n 1 : n 2 ∩ C ( δ ) . (4.4) 18 W e note the following relationship among the num b ers of realizations of the optimal count v ector ν ∗ and those of the sets A n 1 : n 2 ( δ, η ) and B n 1 : n 2 ( δ, η ): if ν ∗ ∈ A n 1 : n 2 ( δ, η ), then # ν ∗ # B n 1 : n 2 ( δ, η ) > 1 ε ⇒ # A n 1 : n 2 ( δ, η ) + # B n 1 : n 2 ( δ, η ) # A n 1 : n 2 ( δ, η ) 6 1 + ε ⇒ # A n 1 : n 2 ( δ, η ) #( N n 1 : n 2 ∩ C ( δ )) > 1 1 + ε > 1 − ε. (4.5) In other words, if the single vector ν ∗ dominates the set B n 1 : n 2 w.r.t. realizations, then the set A n 1 : n 2 dominates the set N n 1 : n 2 ∩ C ( δ ) likewise. The concentration statement GC in § 1 sa ys that given δ, ε, η > 0, there is a num b er ˆ c = ˆ c ( δ, ε, η ) > 1 suc h that when the data b E , b I are scaled by an y factor c > ˆ c , then ν ∗ ∈ A n 1 : n 2 ( δ, η ) and # ν ∗ # B n 1 : n 2 ( δ, η ) > 1 ε . (4.6) W e establish the inequalit y in ( 4.6 ) by finding a low er bound on # ν ∗ in § 4.1 and an upp er b ound on # B n 1 : n 2 ( δ, η ) in § 4.2 . Theorem 4.1 presen ts the ratio of these b ounds. Then in § 4.3 we find the concentr ation threshold ˆ c that ensures ( 4.6 ); that is giv en by Theorem 4.2 . T able 4.1 describ es our notation for the process of scaling the problem data. Basic quantities Deriv ed quantities x ∗ 7→ cx ∗ ν ∗ s 1 , s 2 , s ∗ 7→ cs 1 , cs 2 , cs ∗ n 1 , n 2 , n ∗ G ∗ 7→ cG ∗ ϑ ∞ 7→ cϑ ∞ T able 4.1: The data scaling pro cess b 7→ cb . The sym b ols x ∗ , s 1 , s 2 , . . . on the left denote quan- tities b efore scaling. The symbols ν ∗ , . . . on the right are quan tities derived from the scaled basic quan tities. 4.1 Realizations of the optimal coun t vector In this section w e find a low er b ound on # ν ∗ =  ν ∗ 1 + ··· + ν ∗ m ν ∗ 1 ,...,ν ∗ m  where ν ∗ is the m -vector of Definition 3.1 , in terms of quantities related to the generalized entrop y . Lik e the n umber of realizations of a fr e quency vector and its en tropy , the n umber of realizations # ν of a c ount vector ν is related to its generalized entrop y . Given ν ∈ N m , w.l.o.g. let ν 1 , . . . , ν k , k > 1 b e its non-zero elemen ts; then e − k 12 S ( ν ) e G ( ν ) 6 # ν 6 S ( ν ) e G ( ν ) , S ( ν ) , √ n (2 π ) ( k − 1) / 2 1 √ ν 1 · · · ν k . (4.7) 19 This follows immediately from eq. (I I I.6) in [ OG16 ], or Problem 2.2 in [ CK11 ]; the b ounds hold even when k = 1 and # ν = 1. Since ν ∗ has no 0 elements (Remark 2.1 ) we can take k = m in ( 4.7 ), so # ν ∗ > e − m/ 12 S ( ν ∗ ) e G ( ν ∗ ) . (4.8) Next w e w ant to bound G ( ν ∗ ) in terms of G ∗ = G ( x ∗ ). By Proposition 3.4 , k ν ∗ − x ∗ k ∞ 6 1. If we assume that x ∗ > 1 , Lemma 2.1 applies to ν ∗ and x ∗ and we get x ∗ > 1 ⇒ G ( ν ∗ ) > G ∗ − X i ln 1 χ ∗ i − 1 2  X i 1 x ∗ i − 1 − m s ∗ /m − 1  . (4.9) Returning to ( 4.8 ), it remains to find a conv enient lo wer b ound for S ( ν ∗ ). Since k ν ∗ − x ∗ k ∞ 6 1, w e can use ν ∗ i 6 x ∗ i + 1 in ( 4.7 ) to obtain S ( ν ∗ ) > √ s ∗ (2 π ) ( m − 1) / 2 Y 1 6 i 6 m 1 p x ∗ i + 1 . (4.10) [Another, simpler b ound, is obtained by noting that ν 1 ν 2 · · · ν m is maximum when all ν i are equal to n/m . The b ound ( 4.10 ) is generally b etter, but can b ecome sligh tly w orse in some exceptional situations.] Putting ( 4.10 ) and ( 4.9 ) in ( 4.8 ), # ν ∗ > e − m/ 12 √ s ∗ (2 π ) ( m − 1) / 2 e − 1 2  P m i =1 1 x ∗ i − 1 − m s ∗ /m − 1  m Y i =1 χ ∗ i p x ∗ i + 1 e G ∗ , C 0 ( x ∗ ) e G ∗ . , if x ∗ > 1 (4.11) The form of C 0 ( x ∗ ) is conv enien t for scaling according to T able 4.1 . Remark 4.2 On the condition x ∗ > 1 . It is certainly p ossible to form ulate MaxGEnt problems whose solutions hav e some elements that are smaller than 1, in fact arbitrarily close to 0, and th us inv alidate ( 4.9 ) and ( 4.11 ). Here how ev er we are dealing with ‘large’ problems, where x ∗ is scaled b y c > 1 for concentration to arise; see Theorem 4.2 below. So one w a y to deal with such problem formulations is to tak e as “the problem” a certain pre- scaling of the original, one might say pathological, problem. Nev ertheless, if one wan ted to a void the x ∗ > 1 issue en tirely , one could use a w eaker bound than ( 4.9 ) not sub ject to this restriction; see, for example, Remark 4.3 . Remark 4.3 W e compare the b ound ( 4.9 ), derived for coun t v ectors, to one adapted from a b ound for density v ectors. In [ OG16 ], pro of of Prop osition I I I.1, w e deriv ed the b ound H ( f ∗ ) > H ( χ ∗ ) − 3 m 8 s ∗ ln( m − 1) − h  3 m 8 s ∗  (4.12) 20 where h ( · ) is the binary entrop y function; there we had n in place of s ∗ . [This is based on the b ound | H ( χ ) − H ( ψ ) | 6 1 2 k χ − ψ k 1 ln( m − 1) + h  1 2 k χ − ψ k 1  ; see [ CK11 ] problem 3.10, or [ Zha07 ]. An impro ved version, using b oth the ` 1 and ` ∞ norms is in [ Sas13 ].] By multiplying b oth sides of ( 4.12 ) by n ∗ and then using the fact that n ∗ H ( f ∗ ) = G ( ν ∗ ), n ∗ > s ∗ , and s ∗ G ( χ ∗ ) = G ( x ∗ ) , G ∗ , we obtain G ( ν ∗ ) > G ∗ − 3 m 8 ln( m − 1) − s ∗ h  3 m 8 s ∗  . (4.13) One w ay to compare the b ounds ( 4.9 ) and ( 4.13 ) is to ask how the right-hand sides, apart from the G ∗ term, behav e under scaling of the problem by c ( § 3.3 ): we see that as c increases, the r.h.s. of ( 4.9 ) tends to − P i ln(1 /χ ∗ i ) while the r.h.s. of ( 4.13 ) goes to −∞ . 4.2 Realizations of the sets with smaller entrop y Here we derive upp er b ounds on the num b er of realizations of the sets B n ( δ, η ) and B n 1 : n 2 ( δ, η ). By com bining them with the lo wer b ound on # ν ∗ of § 4.1 , w e establish our first main result, Theorem 4.1 . F rom ( 4.1 ) and ( 4.7 ), # B n ( δ, η ) 6 X ν ∈ N n ∩C ( δ ) , G ( ν ) < (1 − η ) G ∗ S ( ν ) e G ( ν ) 6 e (1 − η ) G ∗ X ν ∈ N n S ( ν ) , where in going from the 1st to the 2nd inequalit y we ignored all the constrain ts. Using ( 4.7 ) and pro ceeding as in [ OG16 ], pro of of Lemma I II.1, # B n ( δ, η ) 6 e (1 − η ) G ∗ m X k =1  m k  √ n (2 π ) ( k − 1) / 2 X ν 1 + ··· + ν k = n 1 √ ν 1 · · · ν k 6 e (1 − η ) G ∗ m X k =1  m k  √ n (2 π ) ( k − 1) / 2 Z x 1 + · · · + x k = n x 1 , . . . , x k > 0 dx 1 · · · dx k √ x 1 · · · x k = e (1 − η ) G ∗ m X k =1  m k  √ n (2 π ) ( k − 1) / 2 π k/ 2 Γ( k / 2) n k/ 2 − 1 = e (1 − η ) G ∗ p 2 π /n m X k =1  m k   n 2  k/ 2 1 Γ( k / 2) . (4.14) W e sho w in the App endix § A that the sum in the last line ab o ve is bounded b y ( n/ 2) m/ 2 Γ( m/ 2)  1 + p m/n  m . This is b etter than the 4  1 + p n/ 4  m b ound for the same sum obtained in [ OG16 ], pro of of Lemma I II.1, as it is asymptotically tight ( m fixed, n → ∞ ). Using this impro ved bound in ( 4.14 ), # B n ( δ, η ) < e (1 − η ) G ∗ p 2 π /n ( n/ 2) m/ 2 Γ( m/ 2)  1 + p m/n  m . (4.15) 21 W e now turn to the set B n 1 : n 2 ( δ, η ) defined in ( 4.3 ). By ( 4.15 ), # B n 1 : n 2 ( δ, η ) = X n 1 6 n 6 n 2 # B n ( δ, η ) < e (1 − η ) G ∗ √ 2 π Γ( m/ 2) X n 1 6 n 6 n 2 1 √ n  n 2  m 2  1 + p m/n  m . Bounding the sum in the 2nd line b y an integral, 1 2 m/ 2 X n 1 6 n 6 n 2 1 √ n  √ n + √ m  m 6 1 2 m/ 2 Z s 2 +2 s 1 1 √ y  √ y + √ m  m dy = 1 2 m/ 2 − 1 1 m + 1   √ s 2 + 2 + √ m  m +1 −  √ s 1 + √ m  m +1  where in the first line we hav e widened the interv al of integration from [ n 1 , n 2 + 1] to [ s 1 , s 2 + 2]; rec all the definition ( 4.2 ) of n 1 , n 2 . Therefore # B n 1 : n 2 ( δ, η ) < C 1 ( s 1 , s 2 ) e (1 − η ) cG ∗ , C 1 ( s 1 , s 2 ) , √ π ( m + 1)2 ( m − 3) / 2 Γ( m/ 2)   √ s 2 + 2 + √ m  m +1 −  √ s 1 + √ m  m +1  , (4.16) where the sums s 1 , s 2 ha ve been defined in ( 1.2 ). By com bining ( 4.11 ) and ( 4.16 ) w e arrive at our first main result, a low er b ound on the ratio of the num b er of realizations of the optimal coun t vector ν ∗ to those of the set B n 1 : n 2 ( δ, η ), of coun t vectors with generalized entrop y η -far from G ∗ = G ( x ∗ ): Theorem 4.1 Given structur e matric es A E , A I and data ve ctors b E , b I , let ( x ∗ , s 1 , s 2 ) b e the optimal solution to pr oblem ( 1.1 ), ( 1.2 ). Assume that x ∗ > 1 ; r e c al l R emark 4.2 . Then for any δ, η > 0 , # ν ∗ # B n 1 : n 2 ( δ, η ) > ( m + 1) e − m/ 12 Γ( m/ 2) 2 π m 2 C 2 ( x ∗ ) C 4 ( x ∗ ) C 3 ( s 1 , s 2 ) e η G ∗ wher e the c onstants ar e C 2 ( x ∗ ) = √ s ∗ Y 1 6 i 6 m χ ∗ i p x ∗ i + 1 , C 4 ( x ∗ ) = exp  − 1 2  X 1 6 i 6 m 1 x ∗ i − 1 − m s ∗ /m − 1  , C 3 ( s 1 , s 2 ) =  √ m + √ s 2 + 2  m +1 −  √ m + √ s 1  m +1 . One use of the theorem is when the problem is already ‘large’ enough and do esn’t require further scaling. Then one may substitute appropriate v alues for δ and η and see what kind of concen tration is achiev ed. Note that the concentration tolerance ε do es not app ear in the theorem. 22 4.3 The scaling factor needed for concen tration What happ ens to the low er b ound of Theorem 4.1 as the size of the problem increases? In this section w e establish Theorem 4.2 , our first concen tration result, which shows that the b ound can exceed 1 /ε for an y giv en ε > 0. In tro ducing into the b ound of Theorem 4.1 a scaling factor c > 1, # ν ∗ # B n 1 : n 2 ( δ, η ) > ( m + 1) e − m/ 12 Γ( m/ 2) 2 π m 2 C 2 ( cx ∗ ) C 4 ( cx ∗ ) C 3 ( cs 1 , cs 2 ) e cη G ∗ . (4.17) T o facilitate scaling, w e dev elop bounds on the functions C 2 , C 3 , C 4 of c app earing ab ov e. First, C 2 ( cx ∗ ) = √ s ∗ c ( m − 1) / 2 Y 1 6 i 6 m χ ∗ i p x ∗ i + 1 /c > √ s ∗ c ( m − 1) / 2 Y 1 6 i 6 m χ ∗ i p x ∗ i + 1 , c > 1 , (4.18) since χ ∗ is in v ariant under scaling, and the first pro duct ab ov e increases as c % . Next, writing C 3 as C 3 ( cs 1 , cs 2 ) = c ( m +1) / 2   p m/c + p s 2 + 2 /c  m +1 −  p m/c + √ s 1  m +1  , it can b e shown that the function of c multiplying c ( m +1) / 2 ab o ve decreases as c % 4.1 , so its maximum occurs at c = 1. Th us C 3 ( cs 1 , cs 2 ) 6 c ( m +1) / 2   √ m + √ s 2 + 2  m +1 −  √ m + √ s 1  m +1  , c > 1 . (4.19) Finally , for C 4 ( cx ∗ ), − 1 2  X 1 6 i 6 m 1 cx ∗ i − 1 − m cs ∗ /m − 1  > − 1 2 X 1 6 i 6 m 1 cx ∗ i − 1 > − 1 2 X 1 6 i 6 m 1 x ∗ i − 1 , since x ∗ > 1 and c > 1, and so C 4 ( cx ∗ ) > e − 1 2 P m i =1 1 x ∗ i − 1 . (4.20) Putting ( 4.18 ), ( 4.19 ), and ( 4.20 ) into ( 4.17 ), if x ∗ > 1 , # ν ∗ # B n 1 : n 2 ( δ, η ) > B c − m e cη G ∗ , (4.21) where the constant B , ( m + 1)Γ( m/ 2) e − m/ 12 2 π m 2 √ s ∗ Q 1 6 i 6 m χ ∗ i √ x ∗ i +1  √ m + √ s 2 + 2  m +1 −  √ m + √ s 1  m +1 e − 1 2 P m i =1 1 x ∗ i − 1 4.1 After some algebra, its deriv ative can b e sho wn to be negative if s 2 > s 1 . 23 is  1. By ( 4.6 ), the scaling factor c to b e applied to the original problem must b e suc h that the r.h.s. of ( 4.21 ) is > 1 /ε , and also such that ν ∗ b elongs to A n 1 : n 2 ( δ, η ). The first of these requirements translates into cη G ∗ − m ln c > − ln( εB ) , (4.22) If c 1 is the largest of the tw o solutions of the equality version of ( 4.22 ) 4.2 , the inequality ( 4.22 ) will hold for all c > c 1 . The second requiremen t on c , that ν ∗ ∈ A n 1 : n 2 ( δ, η ), whic h is really ν ∗ ∈ A n ∗ ( δ, η ), has t wo parts. F or the first part w e need ν ∗ ∈ C ( δ ); b y Prop osition 3.1 this is ensured b y k ν ∗ − cx ∗ k ∞ 6 δ cϑ ∞ , and since the l.h.s. is 6 1 b y Prop osition 3.4 , this will hold if c > c 2 where c 2 , 1 δ ϑ ∞ . (4.23) F or the second part we need c to be s.t. G ( ν ∗ ) > (1 − η ) cG ∗ . By Prop osition 3.2 and ( 4.9 ), this is ensured b y cG ∗ − X 1 6 i 6 m ln 1 χ ∗ i − 1 2  X 1 6 i 6 m 1 cx ∗ i − 1 − m cs ∗ /m − 1  > (1 − η ) cG ∗ ⇐ 1 2 X 1 6 i 6 m 1 cx ∗ i − 1 + X 1 6 i 6 m ln 1 χ ∗ i < cη G ∗ , ⇐ 1 2 c X 1 6 i 6 m 1 x ∗ i − 1 + X 1 6 i 6 m ln 1 χ ∗ i 6 cη G ∗ , (4.24) where the last implication follo ws from c > 1 and x ∗ > 1 . So w e need c > c 3 , the largest solution of the (quadratic) equation version of ( 4.24 ). Giv en tolerances δ, ε, η , w e hav e now established how to compute a lo wer b ound ˆ c , the c onc entr ation thr eshold , on the scaling factor required for concentration to o ccur around the p oint ν ∗ or in the set A n 1 : n 2 , to the exten t sp ecified by δ, ε, η . This is our second main result, which establishes the statement GC in § 1 concerning deviation from the v alue G ∗ : Theorem 4.2 With the c onditions of The or em 4.1 , for any δ, ε, η > 0 , define the c onc en- tr ation thr eshold ˆ c , max( c 1 , c 2 , c 3 ) , wher e c 1 , c 2 , c 3 have b e en define d in ( 4.22 )-( 4.24 ). Then when the data b E , b I is sc ale d by a factor c > ˆ c , the c ount ve ctor ν ∗ of Definition 3.1 b elongs to the set A n 1 : n 2 ( δ, η ) and we have # ν ∗ # B n 1 : n 2 ( δ, η ) > 1 ε and # A n 1 : n 2 ( δ, η ) #( N n 1 : n 2 ∩ C ( δ )) > 1 − ε, wher e n 1 = d cs 1 e , n 2 = d cs 2 e , and the sets A n 1 : n 2 , B n 1 : n 2 have b e en define d in ( 4.3 ). 4.2 An equation of this type generally has t wo roots, one small and one large. F or example e x /x = 10 has ro ots 0.1118 and 3.577. 24 Note that the constrain t information A E , b E , A I , b I app ears implicitly , via s 1 , s 2 , and ϑ ∞ . The v arious sets figuring in the theorem are depicted in Figure 4.1 . G ( ν ) < (1 − η ) G ∗ N n 1 : n 2 ∩ C ( δ ) G ( ν ) > (1 − η ) G ∗ B n 1 : n 2 ( δ, η ) k ν − x ∗ k ∞ 6 δ ϑ ∞ A n 1 : n 2 ( δ, η ) ν ∗ A n ∗ ( δ, η ) x ∗ Figure 4.1: The outer ellipse, the set N n 1 : n 2 ∩ C ( δ ) of count vectors that satisfy the constraints to within tolerance δ , is partitioned into B n 1 : n 2 , shown in gray , and A n 1 : n 2 , the inner white ellipse. The relationship shown b etw een k ν − x ∗ k ∞ 6 δ ϑ ∞ and A n 1 : n 2 ( δ, η ) is not the only one p ossible. Lik ewise for x ∗ and A n ∗ ( δ, η ). 4.3.1 Bounds on the concentration threshold It is useful to know something ab out ho w the threshold ˆ c dep ends on the solution x ∗ , G ∗ to the MaxGEnt problem and on the parameters δ , ε, η , without ha ving to solve equations. W e derive some b ounds on ˆ c with regard to conv enience, not tigh tness 4.3 . If c i > L i , then ˆ c = max i c i > max i L i . Hence w e hav e the low er b ound ˆ c > max  − ln( εB ) η G ∗ , 1 δ ϑ ∞  , (4.25) since c 1 m ust b e bigger than the first term on the r.h.s., and c 2 equals the second. As in tuitively exp ected, the b ound says that the smaller δ, ε , or η are, the more scaling we need. By lo oking at the expression for B after ( 4.21 ), we see that the same holds the farther apart the b ounds s 1 , s 2 on the possible sums are from each other; this accords with in tuition, and w e discuss it further in Example 4.2 . Next, if c i 6 U i , then max i c i 6 max i U i . So ˆ c 6 max  2 m η G ∗ ln m − ln( εB ) η G ∗ − ln( εB ) η G ∗ , 1 δ ϑ ∞ , s P m i =1 1 / ( x ∗ i − 1) 2 η G ∗  , (4.26) where the expressions on the r.h.s. are upp er b ounds on c 1 , c 2 , c 3 , resp ectiv ely , as shown in the Appendix 4.4 . The upp er bound sa ys that the larger G ∗ is, the less scaling we need; lik ewise for the elements of x ∗ . Both of these implications agree with intu ition. F urther illustrations of the bounds ( 4.25 ) and ( 4.26 ) are in Example 4.1 . 4.3 The bounds stil l require kno wing the solution x ∗ to the MaxGEnt problem. 4.4 Concerning the last expression, recall our assumption x ∗ > 1 and Remark 4.2 . 25 4.4 Examples W e give tw o examples. The first con tin ues Example 3.1 , illustrates the b ounds on the con- cen tration threshold, and p oin ts out a, at first sight, surprising behavior of the threshold. The second example illustrates an in tuitively-expected relationship b et ween concentration and the b ounds s 1 , s 2 . Example 4.1 Returning to Example 3.1 , we find s 1 = 21 . 5 , s 2 = 37 . 5 , x ∗ = (6 . 591 , 5 . 326 , 13 . 26 , 1 . 120 , 2 . 253 , 2 . 789) , s ∗ = 31 . 34 , G ∗ = 47 . 53 . Th us ν ∗ = (7 , 5 , 14 , 1 , 2 , 3) and n ∗ = 32. Also, ϑ ∞ = 2 . 9. T able 4.2 shows what happ ens when the problem data b is scaled by the factor ˆ c dictated b y the giv en δ, ε, η . [W e don’t use a special notation for the quan tities app earing in the unscaled vs. the scaled problem, so whenever w e write x ∗ , ν ∗ , b E , etc. a scaling factor, whic h could b e 1, is implied.] δ ∈ [0 . 01 , 1] , ε = 10 − 9 η ˆ c b n 1 n ∗ n 2 ν ∗ 0.05 34.48 (362.1,631.0,300.0,137.9) 880 1081 1294 (227,184,457,39,78,96) 0.02 91.27 (958.3,1670,794.0,365.1) 2328 2861 3423 (602,486,1210,102,206,255) 0.01 191.9 (2015,3512,1670,767.7) 4894 6015 7197 (1265,1022,2545,215,433,535) δ ∈ [0 . 01 , 1] , ε = 10 − 15 0.05 40.25 (422.7,736.6,350.2,161.0) 1027 1262 1510 (266,214,534,45,91,112) 0.02 106.8 (1121.3,1954.3,929.1,427.2) 2724 3347 4005 (703,569,1416,120,241,298) 0.01 222.9 (2340.2,4078.6,1939.0,891.5) 5684 6985 8358 (1469,1187,2955,250,502,622) T able 4.2: Scaling of the problem of Example 3.1 for the given δ, ε, η . With resp ect to the discrete solution, in the first row of T able 4.2 for example, we ha ve k x ∗ − ν ∗ k ∞ = 0 . 370. F urther, ν ∗ satisfies the equality constrain ts with tolerance k A E ν ∗ − b E k ∞ / min | b E | = 0 . 0033 and the inequality constraints with tolerance 0. W e see that the scaling factor ˆ c is quite sensitive to η and rather insensitive to ε ; this can b e surmised from ( 4.25 ). One wa y to interpret the scaling is as a change in the sc ale of me asur ement of the data b , e.g. a change in the units. Then scaling by a larger factor means choosing more refined units, and the ab o ve results sho w that the concentration increases, as intuitiv ely exp ected. With resp ect to the b ounds ( 4.25 ) and ( 4.26 ) on the threshold ˆ c , for the first ro w of the table with δ = 0 . 01, they yield ˆ c ∈ [34 . 48 , 41 . 87]. F or δ ∈ [0 . 02 , 0 . 05] they yield ˆ c ∈ [25 . 1 , 41 . 87]. F or the second row, the b ounds giv e ˆ c ∈ [62 . 8 , 116 . 2] for an y δ ∈ [0 . 01 , 0 . 05]. No w supp ose that the problem data is pr e-sc ale d b y 34.5. Then for the first ro w the b ounds say that ˆ c ∈ [1 . 0 , 1 . 0], i.e. no further scaling is needed. F or the second row, Theorem 4.2 gives ˆ c = 2 . 39 and the b ounds giv e ˆ c ∈ [2 . 23 , 2 . 55]. So the original problem had a threshold ˆ c = 91 . 27, but when scaled by 34.5, the threshold b ecomes only ˆ c = 2 . 39 < 91 . 27 34 . 5 = 2 . 64. Apparen tly , unlik e the rest of the problem (Prop osition 3.2 ), the concentration 26 threshold does not behav e linearly with scaling: ˆ c (34 . 5 × problem) < 34 . 5ˆ c (problem). The explanation for this at first sigh t disconcerting behavior is t wo-fold: first, Theorem 4.2 does not say that ˆ c is the minimum required scaling factor for a given problem; second, there are man y approximations inv olved in the deriv ation of ˆ c , and many get b etter as the size of the problem increases. Example 4.2 In tuition says that the b ounds s 1 , s 2 on the possible sums of the admissible coun t v ectors hav e something to do with concentration: if they are wide, concentration should b e more difficult to achiev e. Supp ose that, someho w, the MaxGEnt vector x ∗ from which ν ∗ is derived remains fixed; then the wider the range s 1 , s 2 allo wed by the constrain ts, the larger should b e the scaling factor required for ν ∗ to dominate. The b ound ( 4.25 ) agrees with this, due to the expression for B after ( 4.21 ). W e now give a simple situation in which the difference b etw een s 1 and s 2 can increase while x ∗ remains fixed. Consider a 2-dimensional problem with b o x constraints b 1 6 x 1 6 b 2 , b 3 6 x 2 6 b 4 , depicted in Fig. 4.2 . Then s 1 = b 1 + b 3 , s 2 = b 2 + b 4 and G is maximum at the upp er righ t corner of the b ox (Prop osition 2.3 ). If w e reduce b 1 , b 3 to b 0 1 , b 0 3 , the low er left corner of the b o x mo ves down and to the left while the upp er righ t corner remains fixed, as sho wn in the figure. Th us we widen the b ounds s 1 , s 2 while leaving s ∗ , G ∗ unc hanged, and the ( b 2 , b 4 ) = x ∗ ( b 1 , b 3 ) ( b 0 1 , b 0 3 ) x 1 x 2 k x − x ∗ k 6 δ ϑ ∞ k x − x ∗ k 6 δ ϑ 0 ∞ Figure 4.2: Reducing s 1 while leaving s 2 and x ∗ unc hanged. problem with the new b o x constrain ts requires more scaling than the original problem. The construction generalizes immediately to m dimensions, see § 2.3 . 5 Concen tration with resp ect to distance from the Max- GEn t v ector In this section w e pro vide results analogous to those of § 4 , but with the sets A , B form ulated in terms of the distanc e of their elemen ts from the optimal ve ctor x ∗ , as measured by the ` 1 norm. This is a more intuitiv e measure than difference in entrop y . There are three main results: Theorems 5.1 and 5.2 , analogues of Theorems 4.1 and 4.2 , and Theorem 5.3 , an 27 optimized version of Theorem 5.2 that do es not require sp ecifying a δ . In v arious places w e reuse results and metho ds from § 4 , so the presentation here is more succinct. F or given n and δ > 0, we wan t to consider the count vectors in N n that lie in C ( δ ) and whose distance from x ∗ is no more than ϑ > 0 in ` 1 norm, and those that lie in C ( δ ) but are farther from x ∗ than ϑ in ` 1 norm. The situation is less straightforw ard than with frequency/densit y vectors. First, given t w o real m -vectors, the norm of their difference can nev er b e smaller than the difference of their norms, so it do es not make sense to require that this norm b e to o small 5.1 . Second, we will b e considering norms that can b e large n umbers, esp ecially after scaling of the problem, so it will not do to consider a fixed-size region around x ∗ . F or these reasons, we define for ϑ > 0 A n ( δ, ϑ ) , { ν ∈ N n ∩ C ( δ ) , k ν − x ∗ k 1 6 | n − s ∗ | + min( n, s ∗ ) ϑ } , B n ( δ, ϑ ) , { ν ∈ N n ∩ C ( δ ) , k ν − x ∗ k 1 > | n − s ∗ | + min( n, s ∗ ) ϑ } . (5.1) This is more complicated that the definition for frequency v ectors in [ OG16 ], but here ϑ is again a small num b er < 1. If n were equal to s ∗ , ( 5.1 ) would say that the densit y vectors f and χ ∗ are suc h that k f − χ ∗ k 1 is 6 ϑ in A n and > ϑ in B n . In general, ( 5.1 ) says that the norm of ν − x ∗ is close to | n − s ∗ | : if n 6 s ∗ , the b ound is s ∗ − (1 − ϑ ) n , and if n > s ∗ it is n − (1 − ϑ ) s ∗ . W e will consider the (disjoin t) unions of the sets ( 5.1 ) o ver n ∈ { n 1 , . . . , n 2 } , with n 1 , n 2 giv en by ( 4.2 ): A n 1 : n 2 ( δ, ϑ ) ,  ν | X i ν i = n, n 1 6 n 6 n 2 , ν ∈ C ( δ ) , k ν − x ∗ k 1 6 | n − s ∗ | + min( n, s ∗ ) ϑ  , B n 1 : n 2 ( δ, ϑ ) ,  ν | X i ν i = n, n 1 6 n 6 n 2 , ν ∈ C ( δ ) , k ν − x ∗ k 1 > | n − s ∗ | + min( n, s ∗ ) ϑ  . (5.2) F or any δ, ϑ , these t wo sets partition N n 1 : n 2 ∩ C ( δ ), the set of count v ectors that sum to a n umber b etw een n 1 and n 2 and lie in C ( δ ). With these definitions, we will establish an analogue of ( 4.6 ) in § 4 : given δ, ε, ϑ > 0, there is a concen tration threshold ˆ c = ˆ c ( δ, ε, ϑ ) s.t. if the problem data b E , b I is scaled by an y factor c > ˆ c , then the MaxGEnt count v ector ν ∗ is in the set A n 1 : n 2 ( δ, ϑ ) and has at least 1 /ε times the realizations of all v ectors in the set B n 1 : n 2 ( δ, ϑ ): ν ∗ ∈ A n 1 : n 2 ( δ, ϑ ) and # ν ∗ # B n 1 : n 2 ( δ, ϑ ) > 1 ε . (5.3) There is one imp ortant difference with § 4 , that here the tolerances δ and ϑ cannot b e c hosen indep endently of one another, they m ust obey a certain restriction. 5.1 In the case of frequency vectors, this low er bound is 0. See Prop osition 5.1 for more details. 28 Remark 5.1 G ∗ is the maxim um of G ov er the domain C (0), with no tolerances on the constrain ts. As we said in § 3.1 , a tolerance δ > 0 widens this domain to C ( δ ), may mov e the vector that maximizes G from x ∗ (0) to x ∗ ( δ ), and may change the maxim um v alue from G ∗ (0) to G ∗ ( δ ). Here w e are lo oking for concentration in a region of size ϑ around the p oin t x ∗ . If δ is to o large, w e cannot exp ect such a region to dominate the count v ectors in C ( δ ) w.r.t. the num b er of realizations, since x ∗ ( δ ) may even lie inside the set B ( δ, ϑ ); b y Prop osition 2.3 , it already lies on the ‘b oundary’ of C (0). If ϑ is given, concentration in A ( δ, ϑ ) requires an upp er b ound on the allo wable δ ; see ( 5.8 ) below. In the setting of § 4 there is no limitation on the magnitude of δ with resp ect to that of η . It is p erfectly fine if the set A n ( δ, η ) con tains ν with G ( ν ) > G ∗ (0), but not if B n ( δ, η ) do es. But B n ( δ, η ) can’t contain any suc h ν by its definition ( 4.1 ): if there are an y such ν , all of them ha ve to b e in A n ( δ, η ). 5.1 Realizations of the sets far from the MaxGEnt vector T o b ound the n umber of realizations of B n 1 : n 2 ( δ, ϑ ) we need to show that if ν is far from x ∗ , in the k ν − x ∗ k 1 sense, then G ( ν ) is far from G ( x ∗ ). T o simplify the notation, in this section we denote x ∗ (0) , χ ∗ (0) , G ∗ (0) simply by x ∗ , χ ∗ , G ∗ . W e first need an auxiliary relationship b etw een the norm of the difference of tw o real v ectors and the norm of the difference of their normalized v ersions: Prop osition 5.1 L et k · k b e any ve ctor norm, such as k · k 1 , k · k 2 , k · k ∞ etc. Then for any x, y ∈ R m and ϑ > 0 , k x − y k > |k x k − k y k| + ϑ ⇒     x k x k − y k y k     > ϑ min( k x k , k y k ) . What we wan t to sho w ab out G ( ν ) and G ∗ follo ws by taking Lemma 3.1 , b ounding the div ergence term D ( ·k· ) in terms of the ` 1 norm, and then using Prop osition 5.1 with the ` 1 norm and min( k ν k , k x ∗ k ) ϑ in place of ϑ : Lemma 5.1 Given δ > 0 and ϑ > 0 , with the notation of L emma 3.1 , for any c ount ve ctor ν ∈ C ( δ ) with sum n , k ν − x ∗ k 1 > | n − s ∗ | + min( n, s ∗ ) ϑ ⇒ G ( ν ) 6 G ∗ + Λ ∗ δ − γ ∗ ϑ 2 n wher e γ ∗ , 1 4(1 − 2 β ∗ ) ln 1 − β ∗ β ∗ , β ∗ , max I ⊂{ 1 ,...,m } min  X i ∈ I χ ∗ i , 1 − X i ∈ I χ ∗ i  . In gener al, γ ∗ > 1 2 and 1 − χ ∗ max 2 6 β ∗ 6 1 2 . If β ∗ = 1 / 2 , γ ∗ , 1 / 2 . 29 The b ound on the divergence that w e used ab ov e, D ( p k q ) > γ ( q ) k p − q k 2 1 , is due to [ O W05 ]. The closeness of the n um b er β ( q ) to 1 / 2 can b e though t of as measuring ho w far a wa y the density vector q is from having a partition 5.2 . [ BHK14 ] is also relev ant here, as the authors study inf p D ( p k q ) sub ject to k p − q k 1 > ` . They refer to 1 − β > 1 / 2, where β is as in Lemma 5.1 , as the “balance co efficien t”. Their Theorem 1b provides an exact v alue for inf p D ( p k q ) as a function of 1 − β , q , and ` , v alid for ` 6 4(1 / 2 − β ); this could b e used in Lemma 5.1 , at the exp ense of an additional condition b etw een, in our notation, ϑ and β ∗ . They also show that β > 1 / 2 − q max / 2, where q max is the largest elemen t of q , a result which w e hav e incorp orated into Lemma 5.1 . W e can no w proceed to find an upp er b ound on # B n 1 : n 2 ( δ, ϑ ). Beginning with # B n ( δ, ϑ ), b y ( 5.1 ) and ( 4.7 ) # B n ( δ, ϑ ) 6 X ν ∈ N n ∩ C ( δ ) k ν − x ∗ k 1 > | n − s ∗ | + min( n, s ∗ ) ϑ S ( ν ) e G ( ν ) . Applying Lemma 5.1 to G ( ν ) and, similarly to what we did in § 4.2 , ignoring the condition in volving the norm in the sum as well as the intersection with C ( δ ), # B n ( δ, ϑ ) 6 e G ∗ +Λ ∗ δ − γ ∗ ϑ 2 n X ν ∈ N n S ( ν ) . The sum ab ov e is identical to that in the expression for # B n ( δ, η ) given at the beginning of § 4.2 , so follo wing the developmen t that led to ( 4.15 ), # B n ( δ, ϑ ) 6 p 2 π /n ( n/ 2) m/ 2 Γ( m/ 2)  1 + p m/n  m e G ∗ +Λ ∗ δ − γ ∗ ϑ 2 n . Compare with ( 4.15 ). Consequently , # B n 1 : n 2 ( δ, ϑ ) = X n 1 6 n 6 n 2 # B n ( δ, ϑ ) 6 √ 2 π 2 m/ 2 Γ( m/ 2) e G ∗ +Λ ∗ δ X n 1 6 n 6 n 2 1 √ n  √ m + √ n  m e − γ ∗ ϑ 2 n 6 √ 2 π 2 m/ 2 Γ( m/ 2) e G ∗ +Λ ∗ δ 2 ( m + 1)   √ s 2 + 2 + √ m  m +1 e − γ ∗ ϑ 2 ( s ∗ +1) +  √ s ∗ + 2 + √ m  m +1 e − γ ∗ ϑ 2 s 1  (5.4) where the inequality implied in the last line is derived in the App endix. This b ound on # B n 1 : n 2 ( δ, ϑ ) is to b e compared with the b ound ( 4.16 ) on # B n 1 : n 2 ( δ, η ). 5.2 In the sense of the NP-complete problem P ar tition . 30 Com bining ( 4.11 ) with ( 5.4 ) w e obtain a low er bound on the ratio of n umbers of real- izations analogous to that of Theorem 4.1 : Theorem 5.1 Given structur e matric es A E , A I and data ve ctors b E , b I , let ( x ∗ , s 1 , s 2 ) b e the optimal solution of pr oblem ( 1.1 ). Assume that x ∗ > 1 ; r e c al l R emark 4.2 . Then for any δ, ε, ϑ > 0 , # ν ∗ # B n 1 : n 2 ( δ, ϑ ) > ( m + 1)Γ( m/ 2) e − m/ 12 2 π m/ 2 C 2 ( x ∗ ) C 4 ( x ∗ ) C 0 3 ( s ∗ , s 1 , s 2 ) e γ ∗ ϑ 2 s 1 − Λ ∗ δ , wher e the c onstants C 2 ( x ∗ ) , C 4 ( x ∗ ) ar e the same as in ( 4.17 ), C 0 3 ( s ∗ , s 1 , s 2 ) =  √ s 2 + 2 + √ m  m +1 e − γ ∗ ϑ 2 ( s ∗ − s 1 +1) +  √ s ∗ + 2 + √ m  m +1 , and Λ ∗ , γ ∗ have b e en define d in L emmas 3.1 and 5.1 r esp e ctively. The low er b ound will not b e useful if the exp onent γ ∗ ϑ 2 s 1 − Λ ∗ δ is not p ositiv e. W e elab orate on this in § 5.2 . Also, like Theorem 4.1 , the theorem says nothing ab out how lar ge the bound is for a given problem. This is the job of Theorem 5.2 . 5.2 Scaling and concentration around the MaxGEn t coun t vector As we did in § 4.3 , we now inv estigate what happ ens to the low er b ound of Theorem 5.1 when the problem data b is scaled b y a factor c > 1. The end results are the concen tration Theorems 5.2 and 5.3 b elo w. T able 4.1 described ho w scaling the data affects the quan tities app earing in the b ound, except for Λ ∗ , which is new to § 5 . Scaling b has the effect x ∗ 7→ cx ∗ , and from ( 2.4 ) in § 2.4 we see that the Lagrange multipliers remain unchanged 5.3 . Then the definition of Λ ∗ in Lemma 3.1 sho ws that the end result of scaling is Λ ∗ 7→ c Λ ∗ . This and s 1 7→ cs 1 imply that scaling by c multiplies the exp onen t γ ∗ ϑ 2 s 1 − Λ ∗ δ in Theorem 5.1 by c . The effect of scaling on C 2 and C 4 is given b y ( 4.18 ) and ( 4.20 ), and finally C 0 3 ( cs ∗ , cs 1 , cs 2 ) < c m +1 2  ( √ s 2 + 2 + √ m ) m +1 e − γ ∗ ϑ 2 e − cγ ∗ ϑ 2 ( s ∗ − s 1 ) + ( √ s ∗ + 2 + √ m ) m +1  6 c m +1 2  ( √ s 2 + 2 + √ m ) m +1 e − γ ∗ ϑ 2 ( s ∗ − s 1 +1) + ( √ s ∗ + 2 + √ m ) m +1  , c m +1 2 C 00 3 , (5.5) where the 2nd inequalit y follo ws from c > 1. In conclusion, when the data b is scaled by the factor c > 1, Theorem 5.1 says that if x ∗ > 1 , then # ν ∗ # B n 1 : n 2 ( δ, ϑ ) > B 0 C 00 3 c − m e (2 γ ∗ ϑ 2 s 1 − Λ ∗ δ ) c , (5.6) 5.3 This also follows from expression ( A.9 ) in the proof of Lemma 3.1 , for G ∗ in terms of the multipliers and G ∗ 7→ cG ∗ . 31 where C 00 3 is defined in ( 5.5 ) and B 0 , ( m + 1)Γ( m/ 2) e − m/ 12 2 π m/ 2 √ s ∗ e − 1 2 P m i =1 1 x ∗ i − 1 Y 1 6 i 6 m χ ∗ i p x ∗ i + 1 . (5.7) ( 5.6 ) and ( 5.7 ) are to b e compared with ( 4.21 ). Recalling Remark 5.1 , an imp ortan t consequence of ( 5.6 ) is that if concen tration is to o ccur the tolerances δ and ϑ m ust satisfy ϑ 2 > Λ ∗ 2 γ ∗ s 1 δ. (5.8) This can be ensured by choosing small enough δ for the giv en ϑ , or large enough ϑ for the giv en δ . (The results of this pap er do not immediately translate to the frequency vector case, but ( 5.8 ) can b e compared with the similar condition in Theorem IV.2 of [ OG16 ].) By ( 5.6 ), the concentration statement ( 5.3 ) will hold if the scaling factor c is suc h that (2 γ ∗ ϑ 2 s 1 − Λ ∗ δ ) c − m ln c > ln C 00 3 εB 0 . (5.9) This inequality is of the same form as ( 4.22 ), and will hold for all c greater than the larger of the tw o ro ots of the equalit y version of it. As in § 4.3 , we also need ν ∗ to b e in the set A n 1 : n 2 ( δ, ϑ ) of ( 5.2 ), more sp ecifically in A n ∗ ( δ, ϑ ). F or this, we must first hav e ν ∗ ∈ C ( δ ); this is ensured by c > c 2 , with c 2 as in ( 4.23 ). Second, by the definition ( 5.1 ) of A n ( δ, ϑ ), we need k ν ∗ − x ∗ k 1 6 | n ∗ − s ∗ | + min( n ∗ , s ∗ ) ϑ ; by Proposition 3.4 this will hold if ϑ > (3 m/ 4 + 1) / ( cs ∗ ). W e ha ve now established the desired analogue of Theorem 4.2 , and prov ed the statemen t GC of § 1 , in terms of distance from the MaxGEnt vector x ∗ : Theorem 5.2 With the same c onditions as in The or em 5.1 , supp ose that the toler anc es δ, ϑ satisfy ( 5.8 ), wher e Λ ∗ , γ ∗ have b e en define d in L emmas 3.1 and 5.1 . L et c 1 , 3 m/ 4 + 1 ϑs ∗ , c 2 , 1 δ ϑ ∞ , and given ε > 0 , let c 3 b e the lar gest r o ot c of the e quality version of ( 5.9 ). Final ly, define the c onc entr ation thr eshold ˆ c , max( c 1 , c 2 , c 3 ) . Then when the data b E , b I is sc ale d by any c > ˆ c , the MaxGEnt c ount ve ctor ν ∗ of Definition 3.1 b elongs to the set A n 1 : n 2 ( δ, ϑ ) of ( 5.2 ), sp e cific al ly to A n ∗ ( δ, ϑ ) , and is such that # ν ∗ # B n 1 : n 2 ( δ, ϑ ) > 1 ε and # A n 1 : n 2 ( δ, ϑ ) #( N n 1 : n 2 ∩ C ( δ )) > 1 − ε. 32 The second inequalit y in the claim of the theorem follo ws from the first by ( 4.5 ) in § 4 , whic h holds whether the sets A n 1 : n 2 and B n 1 : n 2 are defined as they were in § 4 or as they w ere defined here. As in Theorem 4.2 , the constrain t information A E , b E , A I , b I app ears implicitly in Theorem 5.2 , via s 1 , s 2 , and ϑ ∞ . Bounds on the concentration threshold can b e deriv ed similarly to § 4.3.1 . Finally , Fig. 5.1 depicts the v arious sets in volv ed in the definition of the threshold ˆ c app earing in the theorem. k ν − x ∗ k 1 > | n − s ∗ | + min( n, s ∗ ) ϑ N n 1 : n 2 ∩ C ( δ ) B n 1 : n 2 ( δ, ϑ ) k ν − x ∗ k ∞ 6 δ ϑ ∞ A n 1 : n 2 ( δ, ϑ ) ν ∗ A n ∗ ( δ, ϑ ) k ν − x ∗ k 1 6 | n − s ∗ | + min( n, s ∗ ) ϑ x ∗ Figure 5.1: Concentration around ν ∗ w.r.t. ` 1 norm. The MaxGEnt vector ν ∗ has 1 /ε times more realizations than the en tire set B n 1 : n 2 ( δ, ϑ ), shown in gra y . The relationship we sho w betw een k ν − x ∗ k ∞ 6 δ ϑ ∞ and A n 1 : n 2 ( δ, ϑ ) is not the only one p ossible; likewise for x ∗ and A n ∗ ( δ, ϑ ). F rom the definition of ˆ c in Theorem 5.2 we see that as δ increases, the constants c 2 and c 3 b eha ve in opp osite wa ys: c 2 decreases but c 3 increases. If one cares only ab out the tolerances ε and ϑ , and do es not care to sp ecify a particular δ , this op ens the p ossibilit y of reducing ˆ c by c ho osing δ so as to minimize the largest of c 2 , c 3 : Theorem 5.3 Given ε, ϑ , supp ose that Λ ∗ > mϑ ∞ and ϑ 2 < Λ ∗ p s ∗ /m + 1 γ ∗ ϑ ∞ s 1 1 ε 1 /m , wher e the various quantities ar e as in The or em 5.2 . If so, the e quation for δ 2 γ ∗ ϑ 2 s 1 ϑ ∞ 1 δ + m ln δ = ln C 00 3 εB 0 + Λ ∗ ϑ ∞ − m ln ϑ ∞ has a r o ot δ 0 ∈  0 , 2 γ ∗ ϑ 2 s 1 / Λ ∗  , and we define ˆ c , max  1 δ 0 ϑ ∞ , 3 m/ 4 + 1 ϑs ∗  . Then when the data b E , b I is sc ale d by any c > ˆ c , the MaxGEnt c ount ve ctor ν ∗ of Definition 3.1 b elongs to the set A n ∗ ( δ 0 , ϑ ) of ( 5.1 ), and is such that # ν ∗ # B n 1 : n 2 ( δ 0 , ϑ ) > 1 ε and # A n 1 : n 2 ( δ 0 , ϑ ) #( N n 1 : n 2 ∩ C ( δ 0 )) > 1 − ε. 33 In this situation a simple low er b ound on the concentration threshold ˆ c is ˆ c > max  H ( χ ∗ ) 2 γ ∗ 1 ϑ ∞ ϑ 2 , 3 m 4 s ∗ ϑ  (5.10) where for the first expression we used the upp er b ound on δ 0 and Λ ∗ > s ∗ H ( χ ∗ ). The ratio H ( χ ∗ ) /γ ∗ is small for im balanced distributions χ ∗ , e.g. with a single dominant elemen t, in whic h case γ ∗ is large, and approaches 2 ln m for p erfectly balanced ones. The b ound ( 5.10 ) says that ˆ c increases with ϑ − 2 , and this can b e seen in Example 5.2 b elo w. 5.3 Examples The first tw o examples illustrate Theorems 5.2 and 5.3 , while the third illustrates the remo v al of 0s from the solution mentioned in § 2.4 and the ‘b oundary’ case in which the MaxGEnt vector x ∗ sums to the maxim um allow able s ∗ = s 2 . Example 5.1 W e return to Example 4.1 . Recall that s 1 = 21 . 5 , s 2 = 37 . 5 , x ∗ = (6 . 591 , 5 . 326 , 13 . 26 , 1 . 120 , 2 . 253 , 2 . 789) , s ∗ = 31 . 34 , G ∗ = 47 . 53 . W e hav e ϑ ∞ = 2 . 9, γ ∗ = 0 . 5, and Λ ∗ = G ∗ = 47 . 53. The constrain t ( 5.8 ) on δ and ϑ is ϑ 2 > 1 . 864 δ . This means that if we w ant small ϑ , w e m ust hav e a corresp ondingly small δ , as we commented after ( 5.8 ). T able 5.1 lists v arious v alues of ˆ c ( δ, ε, ϑ ) obtained from Theorem 5.2 . ˆ c δ ϑ ε = 10 − 9 ε = 10 − 15 0.001 0.08 862.7 989.2 10 − 4 3448 3448 10 − 5 34483 34483 0.001 0.07 1322 1511 10 − 4 3448 3448 10 − 5 34483 34483 0.001 0.06 2392 2722 10 − 4 3448 3448 10 − 5 34483 34483 10 − 4 0.05 3448 3448 10 − 5 34483 34483 10 − 5 0.03 34483 34483 0.01 60376 67351 0.008 111472 123967 T able 5.1: Scaling of the problem of Example 4.1 for the giv en δ, ε, ϑ . The threshold ˆ c do es not b eha ve smoothly b ecause of the max() in Theorem 5.2 . 34 Example 5.2 Consider the same data as in T able 5.1 , but with only ε, ϑ sp ecified; w e don’t care ab out a particular δ , as long as it ensures that ν ∗ ∈ A n ∗ ( δ, ϑ ). With δ chosen automatically by Theorem 5.3 , T able 5.2 b elo w shows that the concen tration threshold ˆ c is significantly reduced. ˆ c ϑ ε = 10 − 9 ε = 10 − 15 0.08 704.4 793.4 0.07 933.5 1050 0.06 1292 1450 0.05 1896 2124 0.04 3032 3387 0.03 5548 6178 0.01 55345 60991 0.008 88189 97004 T able 5.2: The threshold ˆ c for giv en ε, ϑ with optimal selection of δ = δ 0 . Compare with T able 5.1 . The v ariation of ˆ c with ϑ − 2 implied by the lo wer bound ( 5.10 ) is evident. Example 5.3 Fig. 5.2 sho ws four cities connected by road segments. W e assume that v ehicles tra v elling from one city to another follow the most direct route, and that there is no traffic from a cit y to itself. 1 2 3 4 Figure 5.2: F our cities connected b y (bidirectional) road segmen ts. Arro ws indicate the constrained directions. The n um b er of v ehicles in cit y i is known, whic h puts upp er bounds on the n umber that lea ves eac h city; also, from observ ations we hav e low er b ounds on the n umber of vehicles on the road segments 2 → 3, 3 → 1, and 3 → 4. F rom this information we w an t to infer ho w many v ehicles tra v el from cit y i to city j , i.e. infer the 4 × 4 matrix of coun ts v =     0 v 12 v 13 v 14 v 21 0 v 23 v 24 v 31 v 32 0 v 34 v 41 v 42 v 43 0     . 35 So supp ose the constraints on v are v ii = 0 , X j v ij 6 100 , 120 , 80 , 90 , v 23 + v 24 > 80 , v 31 + v 41 > 59 , v 14 + v 24 + v 34 > 70 , where the last three reflect the “direct route” assumption. Then w e ha v e s 1 = 139 , s 2 = 390. W e define the 12-element vector x for the MaxGEnt metho d as ( v 12 , v 13 , v 14 , v 21 , v 23 , v 24 , . . . , v 43 ). [Note that if we knew that al l v ehicles in a city leav e the city , then we could define a fr e quency matrix b y dividing the matrix v by 100 + · · · + 90 and thus form ulate a MaxEnt problem.] The MaxGEnt solution is v ∗ =     0 33 . 333 33 . 333 33 . 333 40 . 0 0 40 . 0 40 . 0 27 . 765 26 . 118 0 26 . 118 31 . 235 29 . 382 29 . 382 0     with sum s ∗ = s 2 = 390 and maximum generalized entrop y G ∗ = 964 . 62, Λ ∗ = 971 . 84, γ ∗ = 0 . 5. So here we ha ve the b oundary case in whic h the sum of x ∗ is the maxim um p ossible. (Problems inv olving matrices sub ject to constrain ts of the ab ov e type, for which analytic al solutions are p ossible, were studied in [ Oik12 ].) Applying Theorem 5.3 with ϑ = 0 . 04 , ε = 10 − 15 the ‘optimal’ δ is δ 0 ≈ 2 . 02 · 10 − 5 and yields the threshold ˆ c = 837 . 9. Using a scaling factor c = 838 on v ∗ results in the integral matrix ν ∗ =     0 27394 27393 27393 33520 0 33520 33520 23267 21887 0 21887 26175 24622 24622 0     with sum n ∗ = 326820 ∈ [116842 , 326820]. This matrix has at least 10 15 times the num b er of realizations of the en tire set B 116842:326820 (2 . 02 · 10 − 5 , 0 . 04) defined in ( 5.2 ). T o gain some appreciation of what this means, it is not easy to determine the size of this set, but just the particular subset of it B 326819 (0 , 0 . 04) = { ν ∈ C (0) , k ν − x ∗ k 1 > 13073 } con tains at least 2 . 012 · 10 37 elemen ts 5.4 . F or comparison, the whole of C (0) has 2 . 394 · 10 54 elemen ts. (W e compute these n umbers with the barvinok softw are, [ VWBC05 ]. F or B 326819 w e get a lo wer b ound by using the stronger constrain t | ν 1 − x ∗ 1 | > 13072 in place of k ν − x ∗ k 1 > 13073, whic h is harder to express.) 5.4 W e hav e | 326819 − 838 · 390 | + min(326819 , 838 · 390) · 0 . 04 = 13073 . 8 . 36 6 Conclusion W e demonstrated an extension of the phenomenon of en tropy concen tration, hitherto kno wn to apply to probability or frequency vectors, to the realm of count vectors, whose elemen ts are natural num b ers. This required in tro ducing a new entrop y function in which the sum of the coun t vector plays a role. Still, like the Shannon en tropy , this generalized en tropy can b e viewed combinatorially as an approximation to the log of a multinomial co efficien t. Our deriv ations are carried out in a fully discrete, finite, non-asymptotic frame- w ork, do not inv olve any probabilities, and all of the ob jects ab out which w e make any claims are fully constructible. This discrete, com binatorial setting is an attempt to reduce the phenomenon of entrop y concen tration to its essence. W e believe that this concen tration phenomenon supports viewing the maximization of our generalized en tropy as a compatible extension of the w ell-known MaxEnt metho d of inference. Ac kno wledgments Thanks to P eter Gr ¨ unw ald for his commen ts on a previous v ersion of the man uscript, and for many useful discussions on the sub ject. A Pro ofs Pro of of Prop osition 2.1 Giv en a y > x , y can b e reached from x b y a sequence of steps each of which increases a single co ordinate, and the v alue of G increases at each step because all its partial deriv atives are p ositive. (The deriv ativ es are 0 only at p oin ts x that consist of a single non-zero elemen t; a direct pro of can b e giv en for that case.) F or a more formal pro of, we note that the directional deriv ativ e G 0 ( ξ ; u ) of G at any p oin t ξ is > 0 in an y direction u > 0: G 0 ( ξ ; u ) = ∇ G ( ξ ) · u = P i u i ln ξ 1 + ··· + ξ m ξ i . So any mo ve a wa y from ξ in a direction u > 0 will increase G . More precisely , b y the mean v alue theorem, for any y that can b e written as x + u for some u > 0, there is a ξ on the line segmen t from x to x + u s.t. G ( x + u ) − G ( x ) = ∇ G ( ξ ) · u > 0. Finally , if some element of u is strictly positive, then ∇ G ( ξ ) · u > 0. Pro of of Prop osition 2.2 1. T o establish concavit y it suffices to show that ∇ 2 G ( x ), the Hessian of G , is negative semi-definite. W e find ∇ 2 G ( x ) = 1 x 1 + · · · + x m U m − diag  1 x 1 , . . . , 1 x m  , (A.1) 37 where U m is a m × m matrix all of whose entries are 1. Given x , for an arbitrary y = ( y 1 , . . . , y m ) we m ust hav e y T ∇ 2 G ( x ) y 6 0. T o sho w this, first write ∇ 2 G ( x ) as ∇ 2 G ( x ) = 1 x 1 + · · · + x m  U m − diag  x 1 + · · · + x m x 1 , . . . , x 1 + · · · + x m x m  . No w define ξ i = x i / ( x 1 + · · · + x m ). Then y T ∇ 2 G ( x ) y 6 0 is equiv alent to ( y 1 + · · · + y m ) 2 6 y 2 1 /ξ 1 + · · · + y 2 m /ξ m , (A.2) where the ξ i are > 0 and sum to 1. But for fixed y , y 2 1 /ξ 1 + · · · + y 2 m /ξ m is a conv ex function of ξ = ( ξ 1 , . . . , ξ m ) o ver the domain ξ  0, and its minimum under the constrain t ξ 1 + · · · + ξ m = 1 o ccurs at ξ i = y i / ( y 1 + · · · + y m ). So the least v alue of the r.h.s. of ( A.2 ) as a function of ξ 1 , . . . , ξ m is ( y 1 + · · · + y m ) 2 , and this establishes ( A.2 ). F or a given x w e see that y T ∇ 2 G ( x ) y is 0 exactly at p oin ts y such that for all i , x i / P j x j = y i / P j y j , i.e. iff y = cx for some c ∈ R . 2. The fact that the Hessian fails to b e negativ e definite do es not imply that G is not strictly c onca v e; negativ e definiteness is a sufficient, but not a necessary condition for strict concavit y . It can b e seen that G is not strictly concav e b ecause of the scaling or homogeneity prop ert y 2.1 in § 2.1 : consider the distinct p oin ts x and y = 2 x ; strict conca vit y w ould require G (( x + y ) / 2) > G ( x ) / 2 + G ( y ) / 2, whic h is not true. 3. Proposition 1.1.2 in Chapter IV of [ HUL96 ] says that a function F ( x ) is strongly con vex on a conv ex set C with modulus γ > 0 iff the modified function F ( x ) − 1 2 γ k x k 2 2 is conv ex on C . Applying this to our function G , by the pro of carried out in part 1, w e would ha ve to sho w that giv en an y x ∈ R m + , for all y ∈ R m + ( y 1 + · · · + y m ) 2 − ( y 2 1 /ξ 1 + · · · + y 2 m /ξ m ) + γ ( y 2 1 + · · · + y 2 m ) 6 0 for the chosen modulus γ > 0. But for any x and any γ > 0, this condition is false at the p oint y = γ x . 4. By Definition 1.1.1 in [ HUL96 ] Ch. V, § 1.1, a conv ex and p ositively homogeneous function F defined ov er the extended real n umbers R ∪ {±∞} is sublinear. If we define G ( · ) o ver all of R m b y setting G ( x 1 , . . . , x m ) = −∞ if any x i is negativ e, the ab o ve statement applies to F = − G . Finally , a sublinear function has the prop erty F ( αx + β y ) 6 αF ( x ) + β F ( y ). 38 Pro of of Lemma 2.1 By ( 2.2 ), if k x − y k ∞ 6 ζ w e hav e G ( y ) > G ( x − ζ ). Now we expand G ( x − ζ ) in a T a ylor series around x . Since G ( · ) is a t wice-differentiable function on the op en set x > ζ , if x, x 0 are tw o points in this set, then there is a ˜ x = (1 − α ) x + αx 0 with α ∈ [0 , 1], such that G ( x 0 ) = G ( x ) + ∇ G ( x ) · ( x 0 − x ) + 1 2 ( x 0 − x ) T · ∇ 2 G ( ˜ x ) · ( x 0 − x ) (Theorem 12.14 of [ Apo74 ]). Set x 0 = x − ζ , so ˜ x = x − αζ . Noting that ∇ G ( x ) =  ln x 1 + · · · + x m x 1 , . . . , ln x 1 + · · · + x m x m  , ∇ 2 G ( ˜ x ) = 1 ˜ x 1 + · · · + ˜ x m U m − diag  1 ˜ x 1 , . . . , 1 ˜ x m  , ( x 0 − x ) T · ∇ 2 G ( ˜ x ) · ( x 0 − x ) = ζ 2 X i  m 2 ˜ x 1 + · · · + ˜ x m − 1 ˜ x i  , where the second equalit y is ( A.1 ) in the pro of of Prop osition 2.2 , we find that for an y x > ζ G ( x − ζ ) = G ( x ) − ζ X i ln x 1 + · · · + x m x i − 1 2 ζ 2  X i 1 x i − αζ − m k x k 1 /m − αζ  , (A.3) where we know that the sum of the ζ and the ζ 2 terms on the right is negative. [W e c hose to expand around the p oin t x − ζ b ecause then the sign of the terms ∇ G ( x ) · ζ and ζ T · ∇ 2 G ( ˜ x ) · ζ is known.] No w for fixed x define the function g ( α, ζ ) , X 1 6 i 6 m 1 x i − αζ − m k x k 1 /m − αζ , α ∈ [0 , 1] , ζ > 0 , x > ζ . (A.4) This function is > 0 and increasing with α . T o see that g ( α, ζ ) > 0, set u i = x i − αζ so that k x k 1 /m − αζ b ecomes the arithmetic mean ¯ u of the u i ; then use a fundamen tal prop ert y of the p o wer means: for any u > 0, and an y weigh ts w i summing to 1,  X i w i u − k i  − 1 /k 6 X i w i u i , k > 1 (A.5) (see [ HLP97 ], Theorem 16). The desired result follows by choosing all w i = 1 /m . T o sho w that g ( α, ζ ) increases with α , ∂ g ∂ α = X i ζ ( x i − αζ ) 2 − mζ ( k x k 1 /m − αζ ) 2 39 and this is alw ays > 0 by the same pow er means tec hnique ( A.5 ). [Similarly , ∂ 2 g /∂ α 2 > 0, so g ( α, ζ ) is a con vex function of α .] W e therefore see that for an y ζ > 0 min α ∈ [0 , 1] g ( α, ζ ) = g (0 , ζ ) and max α ∈ [0 , 1] g ( α, ζ ) = g (1 , ζ ) . (A.6) It no w follows from G ( y ) > G ( x − ζ ), ( A.3 ), and ( A.6 ) that for an y x > ζ and any y s.t. k y − x k ∞ 6 ζ G ( y ) > G ( x ) − ζ X i ln k x k 1 x i − 1 2 ζ 2  X i 1 x i − ζ − m k x k 1 /m − ζ  . This establishes the lemma. The co efficient of ζ 2 ab o ve is > 0 and equals 0 iff all elements of x are equal ([ HLP97 ], Theorem 16). Pro of of Prop osition 2.3 1. The pro of is b y con tradiction. Assume that u, v are t wo (distinct) global maximizers of G ov er C (0). It is not p ossible that b oth of them hav e the same sum s : under the condition P i x i = s , w e hav e G ( x ) = s ln s − P i x i ln x i b y ( 2.1 ). But the Shannon entrop y extended to all x < 0 is strictly concav e, so G ( x ) has a unique global maximizer ov er the conv ex domain C (0) ∩ { x | P i x i = s } . Next let u and v hav e different sums. W e will derive a condition ne c essary for b oth u and v to maximize G and show that it is contradicted by the scaling prop ert y of G . Under our assumption that G ( u ) = G ( v ) = G ∗ , the concavit y of G implies that an y p oint on the line segment b etw een u and v m ust yield the same v alue, G ∗ , of G . Th us the function f ( α ) = G ( αu + (1 − α ) v ), α ∈ [0 , 1], m ust b e a constan t for all α . Therefore f 0 ( α ) must b e 0 for all α ∈ (0 , 1). Rather than f 0 ( α ), it is easier to deal with the expression for f 00 ( α ). The constancy of f 0 ( α ) implies that we m ust ha ve f 00 ( α ) ≡ 0: f 00 ( α ) = ( P i u i − P i v i ) 2 α P i u i + (1 − α ) P i v i − X i ( u i − v i ) 2 αu i + (1 − α ) v i . W e will consider the condition f 00 (1 / 2) = 0, and set u i − v i = z i , u i + v i = w i . Then w e hav e f 00 (1 / 2) = ( P i z i ) 2 P i w i − X i z 2 i w i = 0 . F urther setting q i = w i / P i w i , since P i w i > 0 the ab o ve condition is equiv alent to ( P i z i ) 2 − P i ( z 2 i /q i ) = 0. But the l.h.s. is a strictly concav e function of q , hence o ver the con vex set q > 0 , P i q i = 1 it attains its global maximum of 0 at a unique p oin t ˆ q , where b q i = z i / P j z j . 40 So we hav e shown that f 00 (1 / 2) = 0 ⇒ w i / P j w j = z i / P j z j for all i . This is equiv alent to ∀ i, u i + v i P j u j + P j v j = u i − v i P j u j + P j v j or u i v i = P j u j P j v j . (A.7) This condition is ne c essary for f 0 ( α ) to b e constant, in particular 0, hence for f ( α ) to b e constant. Finally , w e can assume w.l.o.g. that u and v are such that P i u i > P i v i , and then ( A.7 ) implies that there is some c > 1 s.t. u = cv . But then the scaling property 2.1 in § 2.1 sa ys that G ( u ) = cG ( v ) > G ( v ), contradicting our initial assumption that b oth u and v maximize G . 2. If ˆ x were such a p oin t, we would ha ve G ( ˆ x ) > G ( x ∗ ) b y Prop osition 2.1 , and this w ould contradict that x ∗ is the global maxim um. Pro of of Prop osition 3.1 Consider the equalit y constrain ts first. W riting them as | A E y − b E | 6 δ | b E | , w e see that this will b e satisfied if max i | A E y − b E | i 6 δ min i | b E i | , or k A E y − b E k ∞ 6 δ | b E | min . Now for any y ∈ R m , A E y − b E = A E ( y − x ), since x ∈ C (0). Th us k A E y − b E k ∞ = k A E ( y − x ) k ∞ . But k A E ( y − x ) k ∞ 6 9 A E 9 ∞ k y − x k ∞ , where the (rectangular) matrix norm 9 · 9 ∞ is defined as the largest of the ` 1 norms of the rows A.1 . Therefore, to ensure k A E y − b E k ∞ 6 δ | b E | min it suffices to require that k y − x k ∞ 6 δ | b E | min / 9 A E 9 ∞ , as claimed. T urning to the inequality constraints, write them as A I ( x + y − x ) 6 b I + δ | b I | , or A I x − b I 6 A I ( x − y )+ δ | b I | . Since A I x − b I 6 0, this inequality will b e satisfied if A I ( y − x ) 6 δ | b I | . This will certainly hold if max i ( A I ( y − x )) i 6 δ min i | b I i | , whic h is equiv alent to k A I ( y − x ) k ∞ 6 δ | b I | min . In turn, this will hold if we require 9 A I 9 ∞ k y − x k ∞ 6 | b I | min . F or b oth t yp es of constraints the final condition is stronger than necessary , but more so in the case of inequalities. Finally , part 2 of the prop osition follows from part 1 since k [ x ] − x k ∞ 6 1 / 2. Pro of of Prop osition 3.2 F rom ( 2.4 ) we can write the elements of x ∗ in the form x ∗ j = ( P i x ∗ i ) E j , where E j is an expression in volving the v ectors λ E , λ BI and the matrices A E , A BI . The elements of λ E , λ BI are determined b y substituting the x ∗ j in to the constraints. Thus the k th equality constraint leads to an equation of the form  X i x ∗ i  (expression inv olving the E j ) = b E k (A.8) A.1 F or an y rectangular matrix A and compatible vector x , k Ax k ∞ 6 9 A 9 ∞ k x k ∞ holds b ecause the l.h.s. is max i | A i. x | . This is 6 max i P j | a ij x j | 6 max i k x k ∞ k A i. k 1 = k x k ∞ 9 A 9 ∞ . 41 and similarly for each binding inequality constraint. But the solution of a system of equa- tions of the form ( A.8 ) is unc hanged if the x ∗ i on the l.h.s. and the b E and b BI on the r.h.s. are b oth multiplied b y the same constant c > 0. This establishes the first claim. The claim ab out the maximum of G follows from prop ert y 2.1 of G in the list of § 2 . Coming to the b ounds on x 1 + · · · + x m , the fact that they scale with b is just a prop ert y of general linear programs. That is, if y is the solution to the linear program min x ∈ R m P i α i x i sub j ect to Ax 6 b , then cy is the solution to min x ∈ R m P i α i x i sub j ect to Ax 6 cb . Similarly for the maxim um. Pro of of Prop osition 3.3 F or part 1, giv en A E x = b E , w e hav e k A E x k 1 = k b E k 1 . No w, omitting the sup erscript to simplify the notation, k Ax k 1 = | a 11 x 1 + · · · + a 1 m x m | + | a 21 x 1 + · · · + a 2 m x m | + · · · + | a ` 1 x 1 + · · · + a `m x m | 6 | a 11 + a 21 + · · · + a ` 1 || x 1 | + | a 12 + a 22 + · · · + a ` 2 || x 2 | + · · · 6 9 A T 9 ∞ k x k 1 . Hence 9 ( A E ) T 9 ∞ k x k 1 > k b E k 1 , and since x > 0, k x k 1 is simply the sum of the x i . F or part 2, an y x ∈ R m satisfying A E x = b E , A I x 6 b I will satisfy A E x 6 b E , A I x 6 b I as well. Divide eac h inequality in this system by the smallest non-0 element of the l.h.s., if that element is < 1, otherwise leav e the inequality as is. Since each x i app ears in some constraint, if w e add all the ab o ve inequalities by sides the resulting l.h.s. will b e > x 1 + · · · + x m , and the r.h.s. will b e P i b E i /α E i + P i b I i /α I i , where the α i are defined as in the Prop osition. Pro of of Prop osition 3.4 First, the adjustmen t performed on ˜ ν is alwa ys p ossible: if d < 0 there must b e at least | d | elemen ts of n ∗ χ ∗ that were rounded to their flo ors, and if d > 0 to their ceilings. It is clear that the adjustmen t mak es ν ∗ sum to n ∗ . Now supp ose that k ∈ N and χ is an m -elemen t densit y vector; then k χ sums to k , and the sum of the rounded v ersion [ k χ ] differs by no more than m/ 2 from k . Thus d 6 m/ 2. F or the b ound on k ν ∗ − x ∗ k 1 , we first show that k ν ∗ − n ∗ χ ∗ k 1 6 3 m/ 4. The adjustment of ˜ ν causes d of the elements of ν ∗ to differ from the corresp onding elements of n ∗ χ ∗ b y < 1, and the rest to differ by 6 1 / 2, so k ν ∗ − n ∗ χ ∗ k 1 6 max d ( d + ( m − d ) / 2) 6 3 m/ 4. Next, k ν ∗ − x ∗ k 1 = k ν ∗ − s ∗ χ ∗ k 1 6 k ν ∗ − n ∗ χ ∗ k 1 + k n ∗ χ ∗ − s ∗ χ ∗ k 1 6 3 m/ 4 + | n ∗ − s ∗ | , since χ ∗ sums to 1, and lastly | n ∗ − s ∗ | < 1 b y ( 3.2 ). That k ν ∗ − x ∗ k ∞ 6 1 follows from this last statement and the fact that ν ∗ sums to n ∗ . Finally , the b ound on k f ∗ − χ ∗ k 1 follo ws from that on k ν ∗ − n ∗ χ ∗ k 1 . 42 Pro of of Lemma 3.1 F or brevity , in this pro of we denote G ∗ (0) , x ∗ (0) , χ ∗ (0) simply by G ∗ , x ∗ , χ ∗ . Giv en the v ector x ∗ , set s ∗ = x ∗ 1 + · · · + x ∗ m . Then from ( 2.4 ), x ∗ j = s ∗ e − ( λ E · A E .j + λ BI · A BI .j ) . Therefore X i x ∗ i ln x ∗ i = X i x ∗ i  ln s ∗ − ( λ E · A E .i + λ BI · A BI .i )  = s ∗ ln s ∗ − X i x ∗ i ( λ E · A E .i + λ BI · A BI .i ) = s ∗ ln s ∗ − ( λ E · b E + λ BI · b BI ) , since x ∗ satisfies the equalities and the binding inequalities. Substituting the ab o ve in ( 2.1 ), the maxim um generalized entrop y can b e expressed in terms of the Lagrange multipliers and the data as G ∗ = λ E · b E + λ BI · b BI . (A.9) This implies that the quan tity Λ ∗ is at least as large as G ∗ , as claimed. No w if σ is an arbitr ary sequence with count v ector ν , its probability under χ ∗ is Pr χ ∗ ( σ ) = ( χ ∗ 1 ) ν 1 · · · ( χ ∗ m ) ν m where χ ∗ j = e − ( λ E · A E .j + λ BI · A BI .j ) . Therefore Pr χ ∗ ( σ ) = e − ξ ( ν ) , where ξ ( ν ) = X i λ E i ( A E i. · ν ) + X i λ BI i ( A BI i. · ν ) . (A.10) The rest of the pro of is analogous to that of Prop osition I I.2 in [ OG16 ]. If ν is in C ( δ ), then b E i − δ | β E i | 6 A E i. · ν 6 b E i + δ | β E i | , A BI i. · ν 6 b BI i + δ | β BI i | . Therefore from ( A.10 ), noting that λ BI > 0 but the λ E i can b e p ositive or negativ e, max ν ∈C ( δ ) ξ ( ν ) 6 λ E · b E + ( | λ E | · | β E | ) δ + λ BI · ( b BI + δ | β BI | ) , min ν ∈C ( δ ) ξ ( ν ) > λ E · b E − ( | λ E | · | β E | ) δ + min ν ∈C ( δ ) P i λ BI i ( A BI i. · ν ) . (The | · | around λ E cannot b e remov ed.) Using ( A.9 ) in the ab ov e, max ν ∈C ( δ ) ξ ( ν ) 6 G ∗ + ( | λ E | · | β E | + λ BI · | β BI | ) δ, min ν ∈C ( δ ) ξ ( ν ) > G ∗ − ( | λ E | · | β E | ) δ + min ν ∈C ( δ ) P i λ BI i ( A BI i. · ν − b BI i ) = G ∗ − ( | λ E | · | β E | ) δ − ∆( C ( δ )) , (A.11) where ∆( C ( δ )) , max ν ∈C ( δ ) X i λ BI i ( b BI i − A BI i. · ν ) . 43 Finally , for any p.d. p and any n -sequence σ with count vector ν , Pr p ( σ ) is given by the expression in prop erty 2.1 in § 2.1 . Comparing that with ( A.10 ), ξ ( ν ) = G ( ν ) + nD ( f k χ ∗ ), so by using ( A.11 ) G ∗ − ( | λ E | · | β E | ) δ − ∆( C ( δ )) 6 G ( ν ) + nD ( f k χ ∗ ) 6 G ∗ + ( | λ E | · | β E | + λ BI · | β BI | ) δ, where f = ν /n , and the claim of the lemma follows. Pro of of inequalit y ( 4.15 ) Let y = p n/ 2. The sum P m k =1  m k  y k / Γ( k / 2) can b e found in closed form by noticing that if it is split ov er even and o dd k , each of the sums is hypergeometric. Ho wev er, the resulting expression is to o complicated for our purp oses. W e will obtain a tractable bound that matches the highest p o wer of y in the sum, i.e. y m / Γ( m/ 2). W e need an auxiliary fact, relating Γ( k / 2) for k < m to Γ( m/ 2). F rom Gautschi’s inequalit y for the gamma function (see [ OLBC10 ], 5.6.4) it follows that Γ(( µ − 1) / 2) > Γ( µ/ 2) / p µ/ 2, for any µ > 1. Applying this recursiv ely we find that for k > 1 Γ  m − k 2  > 2 k/ 2 Γ( m/ 2)  m ( m − 1) · · · ( m − k + 1)  1 / 2 > 2 k/ 2 Γ( m/ 2) m k/ 2 e − k ( k − 1) / (4 m ) (A.12) where the 2nd line follows b y using 1 − z < e − z , for z < 1, in the denominator of the first line. No w pulling out the last term of our sum, reversing the order of the other terms, and applying ( A.12 ) to eac h term, we get m X k =1  m k  y k 1 Γ( k / 2) < y m Γ( m/ 2) + 1 Γ( m/ 2) m − 1 X k =1  m m − k   m 2  k/ 2 y m − k e − k ( k − 1) 4 m = y m Γ( m/ 2) + y m Γ( m/ 2) m − 1 X k =1  m k   m 2 y 2 e − k − 1 2 m  k/ 2 < y m Γ( m/ 2) + y m Γ( m/ 2)   1 + p m/ (2 y 2 )  m − 1  = ( n/ 2) m 2 Γ( m/ 2)  1 + p m/n  m , where in going from the 2nd to the 3d line w e ignored the exponential factor and the last term in the expansion of  1 + p m/ (2 y 2 )  m , and in the last line w e substituted y = p n/ 2. The ratio of the sum and this last expression tends to 1 as n → ∞ . 44 Pro ofs of inequalit y ( 4.26 ) The first term is an upp er b ound on c 1 . ( 4.22 ) is an inequality of the type x > α ln x + β , with α, β > 0. W e will show that if α + β > 1, this inequalit y is satisfied b y x = 2 α ln( α + β ) + β . [This expression is motiv ated by the metho d of successive substitutions: with x 0 = β , we get x 2 = α ln( a ln β + β ) + β ; but this satisfies the inequality only if β < 1.] Substituting into the inequalit y we get ( α + β ) 2 > 2 α ln( α + β ) + β ⇔ α + β > α α + β 2 ln( α + β ) + β α + β . Therefore this will hold if α + β > max(2 ln( α + β ) , 1). No w w e ha ve assumed that α + β > 1, and x > 2 ln x is alw ays true for x > 0, so our claim is established. T urning to the case α + β < 1, we can supp ose that α < 1, otherwise we fall into the case α + β > 1. Then it suffices to find a x that satisfies x > ln x + β , and that is so for x = 1 . 5 β + ln β . The third term in ( 4.26 ) is an upp er b ound on c 3 . W rite ( 4.24 ) as 1 2 Σ 1 − c Σ 2 6 c 2 η G ∗ . This will hold if c >  p Σ 2 2 + 2Σ 1 η G ∗ − Σ 2  / (2 η G ∗ ), so the r.h.s. can be taken to b e c 3 . If a, b > 0, which is guaranteed by our assumption that x ∗ > 1 , then √ a + b < √ a + √ b , so √ 2Σ 1 η G ∗ / (2 η G ∗ ) = q P m i =1 1 / ( x ∗ i − 1) 2 η G ∗ is an upp er b ound on c 3 . Pro of of Prop osition 5.1 T o ease the notation, let k x k = s, k y k = t . First we sho w that    x s − y t    6 ϑ ⇒ |k x − y k − | s − t || 6 min( s, t ) ϑ. (A.13) W e hav e    x s − y t    6 ϑ ⇔     x s − y s  +  y s − y t     6 ϑ ⇒       x s − y s    −    y s − y t       6 ϑ ⇔     1 s k x − y k −      1 s − 1 t  y         6 ϑ ⇔     1 s k x − y k −     1 s − 1 t     t     6 ϑ ⇔ |k x − y k − | s − t || 6 sϑ. Exc hanging x with y and s with t in this deriv ation, it also follows that    x s − y t    6 ϑ ⇒ |k x − y k − | s − t || 6 tϑ, and this establishes ( A.13 ). Now ( A.13 ) implies that     x k x k − y k y k     6 ϑ ⇒ k x − y k 6 min( k x k , k y k ) ϑ + |k x k − k y k| and taking the con trap ositiv e of this k x − y k > min( k x k , k y k ) ϑ + |k x k − k y k| ⇒     x k x k − y k y k     > ϑ, from which the claim of the prop osition follo ws. 45 Pro of of the inequalit y in ( 5.4 ) This is an impro vemen t ov er b ounding the sum in the second line of ( 5.4 ) b y simply pulling out e − γ ∗ ϑ 2 n 1 and then bounding the rest by an in tegral. Splitting the sum around the p oin t n ∗ , n 2 X n = n 1 ( √ n + √ m ) m √ n e − γ ∗ ϑ 2 n 6 e − γ ∗ ϑ 2 n 1 n ∗ X n = n 1 ( √ n + √ m ) m √ n + e − γ ∗ ϑ 2 ( n ∗ +1) n 2 X n = n ∗ +1 ( √ n + √ m ) m √ n 6 2 e − γ ∗ ϑ 2 s 1 Z √ n ∗ +1 √ s 1  u + √ m  m du + 2 e − γ ∗ ϑ 2 ( s ∗ +1) Z √ s 2 +2 √ n ∗ +1  u + √ m  m du, since the summand is an increasing function of n . The last line can b e written as 2 m + 1  √ n ∗ + 1 + √ m  m +1 ( e − γ ∗ ϑ 2 s 1 − e − γ ∗ ϑ 2 ( s ∗ +1) )+ 2 m + 1   √ s 2 + 2 + √ m  m +1 e − γ ∗ ϑ 2 ( s ∗ +1) −  √ s 1 + √ m  m +1 e − γ ∗ ϑ 2 s 1  and the desired result follo ws by neglecting the second exp onential from each of the tw o summands. Pro of of Theorem 5.3 W e minimize the max of c 2 ( δ ) , c 3 ( δ ) b y setting them equal to eac h other. Substituting c 2 for c into ( 5.9 ) whic h defines c 3 , we get the equation for δ in the theorem: 2 γ ∗ ϑ 2 s 1 ϑ ∞ 1 δ + m ln δ = ln C 00 3 εB 0 + Λ ∗ ϑ ∞ − m ln ϑ ∞ . (A.14) Let f ( δ ) stand for the function of δ on the l.h.s. This function decreases for δ < 2 γ ∗ ϑ 2 s 1 / ( mϑ ∞ ). F rom the condition b et ween ϑ and δ of Theorem 5.2 , we m ust ha ve δ < δ max = 2 γ ∗ ϑ 2 s 1 / Λ ∗ . So if 2 γ ∗ ϑ 2 s 1 / ( mϑ ∞ ) > δ max , which will hold if Λ ∗ > mϑ ∞ , then f ( δ ) will decrease with δ ∈ (0 , δ max ). If f ( δ max ) is less than the r.h.s. of ( A.14 ) then ( A.14 ) will ha ve a ro ot δ 0 ∈ (0 , δ max ]. This condition on f ( δ max ) b oils down to  2 γ ∗ ϑ ∞ ϑ 2 s 1 Λ ∗  m < C 00 3 εB 0 , or ε 1 /m ϑ 2 < Λ ∗ 2 γ ∗ ϑ ∞ s 1  C 00 3 B 0  1 /m . (A.15) T o arriv e at the condition of the theorem w e find a simple low er b ound on ( C 00 3 /B 0 ) 1 /m . F rom ( 5.5 ), C 00 3 >  √ s ∗ + 2 + √ m  m +1 >  √ s ∗ + m + 2  m +1 . 46 Therefore from ( 5.7 ) C 00 3 B 0 > 2 π m/ 2 e m/ 12  √ s ∗ + m + 2  m +1 ( m + 1)Γ( m/ 2) √ s ∗ > 2 π m/ 2 e m/ 12 ( m + 1)Γ( m/ 2)  √ s ∗ + m + 2  m where in the first line we used the fact that the pro duct of the last tw o factors in the expression ( 5.7 ) for B 0 is < 1. It follows that  C 00 3 B 0  1 /m > 2 1 /m √ π e 1 / 12 (( m + 1)Γ( m/ 2)) 1 /m √ s ∗ + m + 2 = 2 1 /m √ π e 1 / 12 √ m (( m + 1)Γ( m/ 2)) 1 /m p s ∗ /m + 1 + 2 /m > 2 p s ∗ /m + 1 . [T o go from the 2nd to the 3d line, it can b e sho wn that the first factor on the 2nd line is an increasing function of m ; its minim um o ccurs at m = 2 and is ≈ 2 . 22.] It follo ws that condition ( A.15 ) for the existence of the root δ 0 will b e satisfied if ε 1 /m ϑ 2 6 Λ ∗ γ ∗ ϑ ∞ s 1 p s ∗ /m + 1 , as stated in the theorem. No w since w e ha ve ensured c 2 ( δ 0 ) = c 3 ( δ 0 ), we can take ˆ c = max  c 2 ( δ 0 ) , c 1  , where c 1 is as in Theorem 5.2 . Finally , it is quite likely that c 2 ( δ 0 ) > c 1 so that ˆ c = c 2 ( δ 0 ). Giv en that δ 0 < δ max = 2 γ ∗ ϑ 2 s 1 / Λ ∗ and Λ ∗ > mϑ ∞ , it can be seen that this will b e so if if ϑ < s ∗ / (2 γ ∗ s 1 ). References [Ap o74] T.M. Apostol. Mathematic al A nalysis, 2nd Ed. Addison-W esley , 1974. [BHK14] D. Berend, P . Harremo ¨ es, and A. Kon torovic h. Minimum KL-div ergence on complements of L1 balls. IEEE T r ansactions on Information The ory , 60(6):3172–3177, 2014. Also . [BLM13] S. Bouc heron, G. Lugosi, and P . Massart. Conc entr ation Ine qualities: A Nonasymptotic The ory of Indep endenc e . Oxford Universit y Press, 2013. [BV04] S. Boyd and L. V anden b erghe. Convex Optimization . Cambridge, 2004. [Cat12] A. Caticha. Entropic Inference and the F oundations of Physics. In EBEB- 2012, the 11th Br azilian Me eting on Bayesian Statistics , 2012. Also http: //arxiv.org/abs/1212.6967 . 47 [CCA11] A. Cichocki, S. Cruces, and S-I. Amari. Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegativ e Matrix F actorization. Entr opy , 13:134–170, 2011. [CK11] I. Csisz´ ar and J. K¨ orner. Information The ory: Co ding The or ems for Discr ete Memoryless Systems . Cambridge, 2nd edition, 2011. [Csi91] I. Csisz´ ar. Wh y Least Squares and Maxim um En tropy? An Axiomatic Approac h to Inference for Linear In v erse Problems. The A nnals of Statistics , 19(4):2032– 2066, 1991. [Csi96] I. Csisz´ ar. Maxent, Mathematics, and Information Theory . In K.M. Hanson and R.N. Silver, editors, Maximum Entr opy and Bayesian Metho ds, 15th Int’l Workshop , Santa F e, New Mexico, U.S.A., 1996. Klu wer Academic. [CT06] T.M. Cov er and J.A. Thomas. Elements of Information The ory . J. Wiley , 2nd edition, 2006. [GC07] A. Giffin and A. Caticha. Up dating probabilities with data and moments. In K.H. Knuth et al, editor, Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering, 28 . AIP Conf. Pro c. 954, 2007. [Gr8] P .D. Gr ¨ un wald. Entrop y Concentration and the Empirical Co ding Game. Sta- tistic a Ne erlandic a , 62(3):374–392, 2008. Also 1017 . [HLP97] G.H. Hardy , J.E. Littlewoo d, and G. P´ olya. Ine qualities, 2nd Ed. Cam bridge Univ ersity Press, 1997. [HUL96] J.B. Hiriart-Urruty and C. Lemar´ echal. Convex Analysis and Minimization A lgorithms I . Springer-V erlag, 1996. [Ja y83] E.T. Jaynes. Concentration of Distributions at Entrop y Maxima. In R.D. Rosenkran tz, editor, E.T. Jaynes: Pap ers on Pr ob ability, Statistics, and Statis- tic al Physics . D. Reidel, 1983. [Ja y03] E.T. Ja ynes. Pr ob ability The ory: The L o gic of Scienc e . Cam bridge Universit y Press, 2003. [OG16] K.N. Oik onomou and P .D. Gr ¨ un wald. Explicit Bounds for Entrop y Concen- tration under Linear Constraints. IEEE T r ansactions on Information The ory , 62:1206–1230, March 2016. Also . [Oik12] K.N. Oikonomou. Analytical F orms for Most Likely Matrices Derived from Incomplete Information. International Journal of Systems Scienc e , 43:443–458, Marc h 2012. Also . 48 [OLBC10] F.W. Olv er, D.W. Lozier, R.F. Boisvert, and C.W. Clark, editors. NIST Hand- b o ok of Mathematic al F unctions . Cam bridge Univ ersit y Press, 2010. [OS06] K.N. Oik onomou and R.K. Sinha. Net work Design and Cost Analysis of Optical VPNs. In Pr o c e e dings of the OFC , Anaheim, C A, U.S.A., March 2006. Optical So ciet y of America. [O W05] E. Orden tlich and M.J. W einberger. A Distribution-Dep enden t Refinemen t of Pinsk er’s Inequality . IEEE T r ansactions on Information The ory , 51(5):1836– 1840, May 2005. [PP02] A. Papoulis and S.U Pillai. Pr ob ability, R andom V ariables, and Sto chastic Pr o- c esses, 4th Ed. Mc. Graw Hill, 2002. [Sas13] I. Sason. En tropy Bounds for Discrete Random V ariables via Maximal Coupling. IEEE T r ansactions on Information The ory , 59(11):7118–7131, 2013. [SJ80] J.E Shore and R.W Johnson. Axiomatic deriv ation of the principle of maxim um en tropy and the principle of minimum cross-entrop y . IEEE T r ansactions on Information The ory , 26:26–37, 1980. [Ski89] J. Skilling. Classic Maximum En tropy . In J. Skilling, editor, Maximum Entr opy and Bayesian Metho ds . Klu wer Academic, 1989. [VWBC05] S. V erdo olaege, K. W o o ds, M. Bruyno oghe, and R. Co ols. Computation and Manipulation of Enumerators of Integer Pro jections of Parametric Polytopes. T echnical Rep ort Report CW 392, K.U. Leuven, Marc h 2005. [Zha07] Z. Zhang. Estimating mutual information via Kolmogorov distance. IEEE T r ansactions on Information The ory , 53:3280–3282, 2007. 49

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment