Monotonic Convergence in an Information-Theoretic Law of Small Numbers
An "entropy increasing to the maximum" result analogous to the entropic central limit theorem (Barron 1986; Artstein et al. 2004) is obtained in the discrete setting. This involves the thinning operation and a Poisson limit. Monotonic convergence in …
Authors: Yaming Yu
1 Monotonic Con v er gence in an Information-Theoreti c La w of Smal l Numb ers Y aming Y u, Memb er , IEEE Abstract —An “entropy incr easing to the maximum” result analogous to the entr opic central limit theor em (Barron 1986; Artstein et al. 2004 ) is obtained in the discrete setting. This in volv es the thi nning operation and a Poisson limit. M onotonic con vergence in relativ e entropy is established fo r general discrete distributions, while monotonic increase of S hannon ent ropy is prov ed for the special cla ss of ultra-log-conca ve distributions. Overa ll we extend the parallel between th e informa tion-theoretic central limit theorem and law of small numbers explored by K ontoyiannis et al. (2 005) and Harremo ¨ es et al. (200 7, 2008 ). Ingredients in the proofs inclu de conv exity , majorization, and stochastic orders. Index T erms —bi nomial thinning; conv ex order; logarithmic Sobolev inequality; m ajorization; Poisso n approximation; relati ve entropy; Schur -conca vity; ultra-log-conca vity . I . I N T R O D U C T I O N The informa tion-theor etic cen tral limit theorem (CL T , [4]) states that, for a sequenc e of inde pendent an d identically distributed (i.i.d.) r andom variables X i , i = 1 , 2 , . . . , with zero m ean and unit variance, the normalized pa rtial sum Z n = P n i =1 X i / √ n tend s to N (0 , 1) a s n → ∞ in relative entropy , as long as the relativ e en tropy D ( Z n | N(0 , 1)) is ev entually finite. An interesting feature is that D ( Z n | N(0 , 1)) decreases mon otonically in n , or , equiv ale ntly , th e differential entropy of Z n increases to that of the stan dard no rmal. While this mono tonicity is an old pro blem ([2 4]), its full solution is obtained on ly recently by Artstein e t al. [2]; see T ulin o and V erd ´ u [34], Madiman and Barro n [26], and Shlya khtenko [31], [32] for ram ifications. In this paper we establish analogou s results f or a g eneral version of the law of small nu mbers, extending the parallel between the in formation -theoretic CL T and the informatio n-theor etic law of small numb ers explo red in [14] [2 3] [15] and [1 6]. Such mon otonicity results are interesting as they reveal f undamen tal co nnection s between probab ility , in formation theo ry , and ph ysics (the analo gy with the seco nd law of thermod ynamics). Moreover , the associated inequalities ar e often of great pr actical significance . The entropic CL T , for example, is closely related to Shanno n’ s entropy power inequa lity ([5], [33]), wh ich is a valuable tool in an alyzing Gaussian ch annels. Inform ally , the law of small n umbers refer s to th e phe- nomeno n that, for rand om variables X i on Z + = { 0 , 1 , . . . } , the sum P n i =1 X i has approx imately a Poisson d istribution with mea n λ = P n i =1 E X i , as long a s i) e ach of X i is such Y aming Y u is with the Department of Statistic s, Univ ersity of Cali fornia, Irvine, CA, 92697-1250, USA ( e-mail: yaming y@uci.edu). This w ork is supported in part by a start-up fund from the Bren School of Information and Computer Scienc es at the Univ ersity of Califor nia, Irvine. that Pr( X i = 0) is clo se to o ne, Pr( X i = 1) is u niformly small, an d P r( X i > 1) is negligible compar ed to Pr( X i = 1 ) ; and ii) the depen dence b etween the X i ’ s is sufficiently weak . In the version consider ed b y Harremo ¨ es et al. [15] [16] and in this pap er , the X i ’ s are i.i. d. rand om variables obtain ed from a com mon distribution throu gh thinning . (I ndeed, Harrem o ¨ es et al. term their r esult “the law of thin numbers”.) Th e no tion of thinn ing is intro duced by R ´ enyi [29]. Definition 1: The α -thinning ( α ∈ (0 , 1) ) of a probability mass fun ction (p mf) f on Z + , de noted as T α ( f ) , is the pmf of P Y i =1 X i , whe re Y has p mf f and, indep endent of Y , X i , i = 1 , 2 , . . . , are i.i.d. Bernoulli( α ) random variables, i .e., Pr( X i = 1) = 1 − Pr( X i = 0 ) = α . Thinning is closely associated with cer tain classical distri- butions su ch as the Poisson and the b inomial. For the Poisson pmf po ( λ ) = { po ( i ; λ ) , i = 0 , 1 , . . . } , with po ( i ; λ ) = λ i e − λ /i ! , we have T α ( po ( λ )) = po ( αλ ) . For th e binomial pmf bi ( n, p ) = { bi ( i ; n, p ) , i = 0 , . . . , n } , with bi ( i ; n, p ) = n i p i (1 − p ) n − i , we have T α ( bi ( n, p )) = bi ( n, αp ) . Basic prop erties of thinning also include the semigrou p r ela- tion ([1 9]) T α ( T β ( f )) = T αβ ( f ) . (1) Thinning f or d iscrete r andom v ariables is analogo us to scaling for their continu ous co unterpa rts. The n -th conv olu tion of f , denoted as f ∗ n , is the pmf of P n i =1 Y i where Y i ’ s are i.i.d. with pmf f . It is easy to show that thinnin g and conv olution opera tions commute, i.e., T α ( f ∗ n ) = ( T α ( f )) ∗ n . (2) Using the no tions of th inning and conv olution, we ca n state the fo llowing version of the law of small num bers consider ed by Harrem o ¨ es et al. [15]. As u sual, for two pmfs f and g , the entropy of f is defined as H ( f ) = − P i f i log( f i ) , and the r elati ve entropy between f and g is defin ed as D ( f | g ) = P i f i log( f i /g i ) . It is understoo d that D ( f | g ) = ∞ if the support of f , supp ( f ) = { i : f i > 0 } , is not a subset of supp ( g ) . W e freque ntly c onsider the relative entropy between a pm f f and po ( λ ) , wh ere λ is the mean of f ; we den ote D ( f ) = D ( f | po ( λ )) for co n ven ience. Theorem 1: Le t f be a p mf on Z + with mean λ < ∞ . Then, as n → ∞ , 2 1) T 1 /n ( f ∗ n ) tend s to po ( λ ) poin twise; 2) H ( T 1 /n ( f ∗ n )) → H ( po ( λ )) ; 3) if D ( T 1 /n ( f ∗ n )) e ver b ecomes finite, then it tends to zero. Part 1) of Theor em 1 is proved by Harre mo ¨ es et al. [1 5], who also present a proof o f Part 3 ) assuming D ( f ) < ∞ . The current, slightly more gene ral fo rm o f Part 3 ) is reminiscent of Barron ’ s work [4] on the CL T . I n Section II we p resent a short proof of Part 3). W e also no te that Part 2), which i s stated in [16] with a stronger assumption, can be dedu ced f rom 1) directly . A major goal of this work is to establish monoton icity proper ties in Th eorem 1 . W e show that, in Part 3) of Theorem 1, the r elati ve entropy never incre ases (Theor em 2), and, assuming f is ultra-log-concave (see Definition 2), in Part 2) of Theo rem 1, th e e ntropy never de creases (Theorem 3). Both Theor ems 2 and 3 can be r egarded as discrete analogues of the m onoton icity o f en tropy in the CL T ([2]), with thin ning playing the role of scaling. (Unlike the CL T case, h ere monoto nicity of the entr opy and th at of the relativ e entropy are not e quiv a lent.) W e begin with mono tonicity of the relative entropy . Theorem 2: I f f is a pmf on Z + with a finite mean, then D ( T 1 /n ( f ∗ n )) d ecreases on n = 1 , 2 , . . . . The pro of of T heorem 2 uses two Lemmas, which are o f interest by themselves. These deal with the behavior of relati ve entropy under thinning (Lemma 1) and con volution (Lemma 2) respectively . Lem ma 1 is proved in Section II I, where we also note its close c onnection with modified log arithmic Sobolev inequalities (Bobkov and Ledoux [6]; W u [35]) for the Poisson distribution. Lemma 1 (The Thinning Lemma): Let f be a pmf on Z + with a finite mea n. Then D ( T α ( f )) ≤ αD ( f ) , 0 < α < 1 . An equivalent statement is that α − 1 D ( T α ( f )) in creases in α ∈ (0 , 1] , in view of the semigro up p roperty (1). Combined with a data p rocessing argum ent, Lemma 1 can be used to show that the relative en tropy is m onoton e along power -of-two iterates in Theorem 2. T o prove Theorem 2 fully , howe ver, we nee d the fo llowing c on volution result, which may be seen as a “strengthened data pro cessing inequality . ” Lemma 2 is proved in Section IV . Lemma 2 (The Conv olution Lemma): If f is a p mf on Z + with a finite mea n, then (1 /n ) D ( f ∗ n ) decr eases in n . The main d ifference in the dev elopmen t here, compar ed with the CL T case, is that we need to consider the effect of both thin ning and conv olu tion. In th e CL T case, the monoto nicity of entropy can be ob tained from one g eneral conv olu tion in equality for th e Fisher in formation ([2], [2 6]). Nev ertheless, the p roofs of Lemmas 1 and 2 (Lemma 2 in particular) somewhat parallel the CL T case. W e first express the d esired div ergence quantity as a n integral via a de Bruijn type identity ([33], [5], [4]), and then analyze the monotonicity proper ty of the integrand; see Sectio ns III an d IV for details. Once we have Lemmas 1 and 2, Theorem 2 is quickly established. Pr oof o f Theo r em 2 : Lemma 1 and ( 1) imply ( n ≥ 2 ) n n − 1 D ( T 1 /n ( f ∗ n )) ≤ D T 1 / ( n − 1) ( f ∗ n ) . Lemma 2 and (2) then yield D T 1 / ( n − 1) ( f ∗ n ) ≤ n n − 1 D T 1 / ( n − 1) f ∗ ( n − 1) and the claim f ollows. By a d ifferent analysis, we also establish the mono tonicity of H ( T 1 /n ( f ∗ n )) , un der the assumption that f is ultra-log- concave. Definition 2: A non negativ e sequen ce u = { u i , i ∈ Z + } is called log-conc ave , if the support of u is an interval of consecutive integers, a nd u 2 i ≥ u i − 1 u i +1 for all i > 0 . A pmf f is u ltra-log-concave , or ULC, if th e seque nce i ! f i , i ∈ Z + , is log- concave. Equiv alently , f is ULC if i f i /f i − 1 decreases in i . It is clear that ultra -log-co ncavity implies log -concavity . Ex amples of ULC pmf s include the Poisson and the b inomial. Mor e generally , the pmf of P n i =1 X i is ULC if X i ’ s are in depen- dent (not necessarily id entically distributed) Bern oulli random variables. The mon otonicity of entr opy is stated as follows. Theorem 3: I f f is ULC, then H ( T 1 /n ( f ∗ n )) in creases monoto nically o n n = 1 , 2 , . . . . An example ([ 14] [36]) is when f is a Bernoulli with parameter p , in which ca se T 1 /n ( f ∗ n ) = bi ( n, p/ n ) . In other words, bo th th e en tropy an d th e relati ve entro py are monoto ne in the classical bin omial-to-Po isson convergence. It should not be surprising tha t we make th e U LC assum p- tion; th e situation is similar to th at of a Markov chain with homog eneous transition probabilities ( [10], C hapter 4): re lati ve entropy always decreases, but entropy does n ot incr ease with- out additional assumptions. The ULC assumption is natural in Theorem 3 because ULC distributions with the same m ean λ form a n atural class in which the Po( λ ) distribution has max- imum entropy [ 19]. I n fact, if we reverse the ULC assumption (but still assume th at f is log-c oncave), the n H ( T 1 /n ( f ∗ n )) decreases monoto nically (Theor em 7). Theorem s 3 and 7 are proved in Section VI. The starting po int in these proofs is a general result (Lemma 4) that r elates entropy comp arison to comparin g th e expectation s o f con vex fun ctions. T his entails a rather d etailed analysis of the con vex orde r (to be define d in Section V) betwe en the relevant distributions. As a simple example, Fig. 1 d isplays the values of d ( n ) = D ( T 1 /n ( f ∗ n )) , t ( n ) = nD ( T 1 /n ( f )) , r ( n ) = n − 1 D ( f ∗ n ) , h ( n ) = H ( T 1 /n ( f ∗ n )) for f = bi (2 , 1 / 2) an d n = 1 , . . . , 10 . The mon otone p atterns of d ( n ) , t ( n ) , r ( n ) and h ( n ) illustrate Theorem 2, Lemm a 1, Lemm a 2, and Theor em 3 respectively . Besides m onoton icity , an equally interesting pro blem is th e rate of co n vergence. In Section VII we show that, if f is ULC or has finite sup port, then D ( T 1 /n ( f ∗ n )) = O ( n − 2 ) , n → ∞ . This complemen ts cer tain bounds obtained by Harremo ¨ es e t al. [15], [16]. Different tools con tribute to this O ( n − 2 ) r ate. For ULC distributions we use stoch astic orders as in Sec tion VI; for distributions with finite s uppo rt, we simply analyze the 3 2 4 6 8 10 0.00 0.04 0.08 0.12 d(n)=D(bi(2n, 1/(2n))) n 2 4 6 8 10 0.02 0.06 0.10 t(n)=nD(bi(2, 1/(2n))) n 2 4 6 8 10 0.02 0.06 0.10 r(n)=(1/n)D(bi(2n, 1/2)) n 2 4 6 8 10 1.05 1.15 1.25 h(n)=H(bi(2n, 1/(2n))) n Fig. 1. V alues of d ( n ) , t ( n ) , r ( n ) and h ( n ) for n = 1 , . . . , 10 . scaled F isher information ([23], [2 7]). W e co nclude with a discussion on possible extensions and refinments (of Theorem 2 in particular ) in Section VIII. I I . T H E C O N V E R G E N C E T H E O R E M This section dea ls with Theorem 1. Part 1) of Theo rem 1 is proved in [15]. Part 2) is stated in [16] with the assumption that f is ultra-logcon cav e. The pre sent fo rm only assumes th at λ , th e mean of f , is finite. Part 2) can be quickly p roved as follows. Part 1 ) an d Fatou’ s lemma y ield lim inf n →∞ H ( T 1 /n ( f ∗ n )) ≥ H ( po ( λ )) . Let g denote the pmf of a geometric( p ) distribution, i.e., g i = p (1 − p ) i , i = 0 , 1 , . . . , 0 < p < 1 . By the lower - semicontinu ity property of relative entropy , lim inf n →∞ D ( T 1 /n ( f ∗ n ) | g ) ≥ D ( po ( λ ) | g ) . (3) Since th e mean o f T 1 /n ( f ∗ n ) is λ fo r all n , (3) simplifies to lim sup n →∞ H ( T 1 /n ( f ∗ n )) ≤ H ( po ( λ )) and Part 2) is p roved. Our proof of Part 3) uses con vexity a rguments th at als o yield some in teresting intermedia te results (Pro positions 2 a nd 3). In Prop ositions 1– 3 let X 1 , X 2 , . . . , b e i.i.d. with pm f f . Proposition 1 : For any α ∈ (0 , 1] , D ( T α ( f )) < ∞ if and only if E X 1 log( X 1 ) < ∞ (as usual 0 lo g 0 = 0 ). Pr oof: Let us co nsider α = 1 first. Note that H ( f ) is finite since the m ean of f is finite. W e have D ( f ) = X i ≥ 0 f i log( i !) − λ lo g( λ ) + λ − H ( f ) . (4) Thus D ( f ) < ∞ if and only if P i ≥ 0 f i log( i !) conver ges, which, by Stirling’ s fo rmula, is equiv a lent to E X 1 log( X 1 ) < ∞ . For gen eral α ∈ (0 , 1] , let Y | X 1 ∼ Bi( X 1 , α ) . By the precedin g a rgument D ( T α ( f )) < ∞ if and only if E Y lo g ( Y ) < ∞ . However E αX 1 log( αX 1 ) ≤ E Y log ( Y ) ≤ E X 1 log( X 1 ) where the lower bou nd h olds b y Jensen’ s inequality . T hus E Y log ( Y ) < ∞ is als o equi valent to E X 1 log( X 1 ) < ∞ . A con sequence of Prop osition 1 is that, in Part 3 ), D ( T 1 /n ( f ∗ n )) < ∞ ⇐ ⇒ E ¯ X n log( ¯ X n ) < ∞ . Here and in Propositions 2 and 3 below , ¯ X n = (1 /n ) P n i =1 X i . Proposition 2 : For n ≥ 1 , D ( T 1 /n ( f ∗ n )) ≤ λ n + E ¯ X n log ¯ X n λ . Pr oof: W e b orrow an idea of [15] used in the pro of of their Prop osition 8. Letting g = f ∗ n , we have D ( T 1 /n ( g )) = D ∞ X k =0 g k bi ( k , 1 / n ) ! ≤ ∞ X k =0 g k D ( bi ( k , 1 /n ) | po ( λ )) by convexity . Howev er , D ( bi ( k , p ) | po ( λ )) = D ( bi ( k , p )) + D ( po ( k p ) | po ( λ )) ≤ k p 2 + k p log k p λ − k p + λ where the simp le boun d D ( bi ( k , p )) ≤ k p 2 (see [1 4] for its proof ) is used in the inequality . Th us D ( T 1 /n ( g )) ≤ ∞ X k =0 g k k n 2 + k n log k nλ − k n + λ = λ n + E ¯ X n log ¯ X n λ as req uired. Proposition 3 : Den ote l n = E ¯ X n log( ¯ X n /λ ) . Th en, as n ↑ ∞ , l n decreases to zero if it is finite for some n . Pr oof: By Jensen’ s inequ ality , l n ≥ 0 . Noting ¯ X n = E ¯ X n − 1 | ¯ X n , we apply Jensen’ s inequ ality again to g et l n ≤ E E ¯ X n − 1 log( ¯ X n − 1 /λ ) | ¯ X n = l n − 1 . (Essentially we are proving ¯ X n ≤ cx ¯ X n − 1 where ≤ cx denotes the con vex o rder; see [30]. Section V contains a brief intro- duction to several stochastic or ders.) Th us l n ↓ l ∞ , say , with l ∞ ≥ 0 . W e show l ∞ = 0 , assuming l k < ∞ for some k . By symmetry l n = E ¯ X k log( ¯ X n /λ ) , n ≥ k . W e m ay use this and Jensen’ s inequality to obtain l n ≤ E ¯ X k log E ¯ X n | ¯ X k λ = E ¯ X k log k ¯ X k + ( n − k ) λ nλ . (5) Howe ver, ¯ X k log k ¯ X k + ( n − k ) λ nλ ≤ ¯ X k max 0 , log ¯ X k λ , and the righ t hand side has a finite expec tation since l k < ∞ . Letting n → ∞ in (5) and usin g Fatou’ s lemma we ob tain l ∞ ≤ E ¯ X k log λ λ = 0 4 which fo rces l ∞ = 0 . Part 3) is then a direct consequen ce of Pro positions 1 – 3. I I I . L E M M A 1 A N D A M O D I FI E D L O G A R I T H M I C S O B O L E V I N E Q UA L I T Y For any pmfs ˜ g and g on Z + , we h av e D ( T α ( ˜ g ) | T α ( g )) ≤ D ( ˜ g | g ) . (6) This is a special case o f a gen eral result on the decrease of relativ e entro py a long a Markov chain (see [1 0], Chapte r 4). It follows fr om (6) and the semigroup proper ty (1) th at, in the setting of Lemm a 1, D ( T α ( f )) increases in α . Th is is howe ver not stron g enou gh to pr ove Lemma 1 yet. Let us recall th e size-biasing o peration, whic h often appears in Poisson appro ximation p roblems. Definition 3: For a pm f f on Z + with me an λ > 0 , the sized-biased pmf , denoted b y S ( f ) , is defined o n Z + as S ( f ) = { ( i + 1) f i +1 /λ, i = 0 , 1 , . . . } . The formu las S ( po ( λ ) ) = po ( λ ) and S ( bi ( n, p )) = bi ( n − 1 , p ) are r eadily verified. M oreover , size-biasing and thin ning operation s co mmute, i.e., T α ( S ( f )) = S ( T α ( f )) . (7) Ke y to the proo f of Lemm a 1 is the following identity; see Johnson [19] for related calculation s. Lemma 3: Let f = { f i , i ≥ 0 } be a pmf on Z + with mean λ ∈ (0 , ∞ ) , and assume that th e suppor t of f is finite, i.e., there exists some k such that f i = 0 for a ll i ≥ k . Then dD ( T α ( f )) dα = λD ( T α ( S ( f )) | T α ( f )) , α ∈ (0 , 1) . (8) Pr oof: Write g = T α ( f ) for convenience, i.e., g i = X j ≥ 0 f j bi ( i ; j, α ) . By dir ect calculation dD ( g ) dα = X i ≥ 0 dg i dα log g i po ( i ; αλ ) = X i ≥ 0 ,j ≥ 1 j f j [ bi ( i − 1; j − 1 , α ) − bi ( i ; j − 1 , α )] × log g i po ( i ; αλ ) = X i ≥ 1 ,j ≥ 1 j f j bi ( i − 1; j − 1 , α ) × log g i po ( i ; αλ ) − log g i − 1 po ( i − 1; αλ ) = λ X i ≥ 0 X j ≥ 0 ( S ( f )) j bi ( i ; j, α ) log ( i + 1) g i +1 λg i = λD ( T α ( S ( f )) | T α ( f )) where the simple id entity d ( bi ( i ; n, p )) dp = n [ b i ( i − 1 ; n − 1 , p ) − bi ( i ; n − 1 , p )] is u sed in the seco nd step, a nd Abel’ s sum mation fo rmula in the third. (By convention bi ( i ; n, p ) = 0 if i < 0 or i > n .) All sums are finite sums since f has finite support. Remark. The assumptio n th at f ha s finite support does not appear to imp ose a serious limit o n the applicability of Lem ma 3. Of course, it would be good to see this assumption relaxed. Pr oof o f Lemma 1: Let us first assume that f has finite support. Then D ( T α ( f )) is obviously co ntinuou s on α ∈ [0 , 1] . Lemma 3 and ( 6) show that dD ( T α ( f )) /dα increases on α ∈ (0 , 1) . Thus D ( T α ( f )) is conv ex on α ∈ [0 , 1 ] , an d the claim fo llows. For gener al f , we construct a sequence of pmfs f ( k ) = { f ( k ) i , i ≥ 0 } , k = 1 , 2 , . . . , by tru ncation. In other words, let f ( k ) i = c k f i , i = 0 , . . . , k , wher e c k = ( P i ≤ k f i ) − 1 , and f ( k ) i = 0 , i > k . Assume D ( f ) < ∞ without lo ss of generality . The n T α ( f ( k ) ) tends to T α ( f ) pointwise as k → ∞ . It is also easy to show D f ( k ) → D ( f ) , k → ∞ . Thus, by the finite-suppo rt result and the lower - semi- continuity pr operty of the relative entro py , we have D ( T α ( f )) ≤ lim inf k →∞ D T α f ( k ) ≤ lim inf k →∞ αD f ( k ) = αD ( f ) as req uired. For two pmfs f and g on Z + with finite mean s, th e data- processing ineq uality ([10]) g i ves ( ∗ den otes conv o lution) D ( T α ( f ) ∗ T β ( g )) ≤ D ( T α ( f )) + D ( T β ( g )) (9) where α, β ∈ [0 , 1] . By Lemma 1, we h a ve D ( T α ( f ) ∗ T β ( g )) ≤ αD ( f ) + β D ( g ) . (10) This is eno ugh to prove Theorem 2 in the spec ial case of power -of-two iterates, i.e., D ( T 1 /n ( f ∗ n )) decreases on n = 2 k , k = 0 , 1 , . . . . T o establish Theore m 2 fully , we n eed a conv olu tion inequa lity stronger than ( 9), namely Lemma 2 ; Section IV contain s th e details. A result clo sely related to Lemm a 1 is Theorem 4, wh ich was proved by W u ([3 5], Eqn . 0. 6) u sing ad vanced stochastic calculus tools (see [ 6], [8], [9] fo r related work ). Our p roof of Theo rem 4, b ased o n conve x ity , is similar in spirit to those giv en by [8], [9]; the use of th inning appe ars new . Theorem 4 ([35]): For a pmf f on Z + with mean λ ∈ (0 , ∞ ) we have D ( f ) ≤ λD ( S ( f ) | f ) . (11) Pr oof: Let us assume the sup port o f f is fin ite. The conv exity of h ( α ) = D ( T α ( f )) im plies h ′ ( α ) ≥ h ( α ) /α for all α ∈ (0 , 1 ) . If D ( S ( f ) | f ) < ∞ th en supp ( f ) is an in terval of consecutiv e in tegers includ ing zero . W e m ay let α → 1 and obtain λD ( S ( f ) | f ) = lim α ↑ 1 h ′ ( α ) ≥ h (1) = D ( f ) . When th e support o f f is not finite, an argument similar to the one for Lem ma 1 ap plies. 5 Theorem 4 sharpens a modified lo garithmic Sobolev in- equality o riginally obtained by Bobkov and Ledo ux [6]. Corollary 1 ([6], Corollary 4): In the setting of Theorem 4, assume that f i > 0 for a ll i ∈ Z + . Th en D ( f ) ≤ λχ 2 ( S ( f ) , f ) (12) where χ 2 ( S ( f ) , f ) = P i f i (( S ( f )) i /f i − 1) 2 . The inequality (12) f ollows from Th eorem 4 an d the well- known inequa lity b etween the relative en tropy and the χ 2 distance. For an application of (12) to Poisson appro ximation bound s, see [23]. I V . R E L A T I V E E N T RO P Y U N D E R C O N VO L U T I O N This sectio n establishes Lemma 2. The starting p oint is an easily verified d ecomposition form ula (Proposition 4). Proposition 4 was used by Madima n et al. [27] to der i ve a conv olu tion inequ ality ([27], Theorem III) for the scaled Fisher informa tion, which is λχ 2 ( S ( f ) , f ) as in (12). Here we ob tain a monoton icity result (Corollary 2) for the relati ve entropy D ( S ( f ∗ n ) | f ∗ n ) , which is instru mental in the proof of L emma 2. Proposition 4 ([27], Eqn. 14): Le t q ( i ) be pmfs o n Z + with fin ite means λ i , i = 1 , . . . , n , resp ecti vely ( n ≥ 2 ) . Define q = q (1) ∗ . . . ∗ q ( n ) and q ( − i ) = q (1) ∗ . . . ∗ q ( i − 1) ∗ q ( i +1) ∗ . . . ∗ q ( n ) (i.e., q ( i ) is left ou t), i = 1 , . . . , n . Then there ho lds S ( q ) = n X i =1 β i q ( i ) ∗ S q ( − i ) where β i = (1 − λ i / P n j =1 λ j ) / ( n − 1) . (I n statistical terms, we have a mix ture representatio n of S ( q ) .) Proposition 5 : I n the setting of Proposition 4 we have D ( q | S ( q )) ≤ n X i =1 β i D q ( − i ) | S q ( − i ) ; (13) D ( S ( q ) | q ) ≤ n X i =1 β i D S q ( − i ) | q ( − i ) . (14) Pr oof: W e prove (13); the sam e argum ent applies to (1 4). By convexity , Pro position 4 yields D ( q | S ( q )) ≤ n X i =1 β i D q | q ( i ) ∗ S q ( − i ) . Howe ver, since q = q ( i ) ∗ q ( − i ) for each i , we h av e D q | q ( i ) ∗ S q ( − i ) ≤ D q ( − i ) | S q ( − i ) by data processing, an d the claim follows. Corollary 2 co rrespond s to the case of identical q ( i ) ’ s in Proposition 5. Corollary 2: For any pmf f on Z + with m ean λ ∈ (0 , ∞ ) , both D ( S ( f ∗ n ) | f ∗ n ) a nd D ( f ∗ n | S ( f ∗ n )) d ecrease in n . Pr oof of Lemma 2: Let us assume that f has fi nite support first. W e have (8) in th e integral for m 1 n D ( f ∗ n ) = λ Z 1 0 D ( T α ( S ( f ∗ n )) | T α ( f ∗ n )) dα (15) = λ Z 1 0 D ( S (( T α ( f )) ∗ n ) | ( T α ( f )) ∗ n ) dα (16) where (1 6) holds by the commu ting re lations (7) and (2 ). By Corollary 2, th e integrand in (16) decr eases in n for e ach α . Thus (1 /n ) D ( f ∗ n ) decreases in n as claimed. For general f , we again use truncation. Sp ecifically , let f ( k ) and c k be defined as in the p roof of Lemma 1. For n ≥ 2 let g = f ∗ n , and similarly let g ( k ) denote th e n th con volution of f ( k ) . Th en g ( k ) tends to g pointwise, and the mean o f g ( k ) tends to that of g . Assume D ( g ) < ∞ , whic h amoun ts to P i g i log( i !) < ∞ . The argume nt fo r Part 2) of Theo rem 1 shows H g ( k ) → H ( g ) , k → ∞ . (17) W e also have the simple ineq uality g ( k ) i ≤ c n k g i for all i . Sinc e c k → 1 as k → ∞ , we may ap ply domina ted con vergence to obtain X i g ( k ) i log( i !) → X i g i log( i !) , k → ∞ , which, taken to gether with (1 7), shows D g ( k ) → D ( g ) , k → ∞ . The finite-support result and the lower -semicontinu ity pr operty of relative entropy then yield 1 n + 1 D f ∗ ( n +1) ≤ 1 n D ( f ∗ n ) as in the pr oof of Le mma 1. A genera lization o f Lemma 2 is r eadily obtained if we use Proposition 5 rather than Co rollary 2 in the above argument. Theorem 5: I n the setting of Proposition 4, D ( q ) ≤ 1 n − 1 n X i =1 D q ( − i ) . Theorem 5 streng thens the usual da ta pro cessing inequ ality D ( q ) ≤ n X i =1 D q ( i ) in the same way that the entropy po wer inequality of Art- stein et al. [2] streng thens Shan non’ s classical en tropy power inequality . Remark. A by-prod uct of Corollar y 2 is that the div ergence quantities h n = D ( T 1 /n ( f ∗ n ) | S ( T 1 /n ( f ∗ n ))) and ˜ h n = D S ( T 1 /n ( f ∗ n )) | T 1 /n ( f ∗ n ) also de crease in n . Ind eed we have h n = D ( T 1 /n ( f ∗ n ) | T 1 /n ( S ( f ∗ n ))) ≤ D ( T 1 / ( n − 1) ( f ∗ n ) | T 1 / ( n − 1) ( S ( f ∗ n ))) (18) = D (( T 1 / ( n − 1) ( f )) ∗ n | S (( T 1 / ( n − 1) ( f )) ∗ n )) ≤ h n − 1 (19) where ( 6) is used in (18), Corollary 2 is u sed in (19), an d the co mmuting relations ( 7) a nd ( 2) a re applied throug hout. The proof for ˜ h n is the same. Th ese monoton icity statements compleme nt The orem 2. 6 V . S T O C H A S T I C O R D E R S A N D M A J O R I Z A T I O N The pr oof of the m onoton icity o f entro py (Theorem 3) in volves se vera l notio ns of stocha stic or ders wh ich we br iefly introdu ce. Definition 4: For two ran dom variables X and Y with pmfs f and g respectively , • X is sma ller than Y in the usua l stochastic order , written as X ≤ st Y , if Pr( X > c ) ≤ Pr( Y > c ) fo r all c ; • X is smaller th an Y in the con vex o r d er , written as X ≤ cx Y , if E φ ( X ) ≤ E φ ( Y ) for every conv ex f unction φ such that the expectation s exist; • X is log- concave relative to Y , written as X ≤ lc Y , if i) both supp ( f ) a nd supp ( g ) are intervals o f consecutive integers, ii) supp ( f ) ⊂ su pp ( g ) , and iii) log( f i /g i ) is concave on supp ( f ) . W e use ≤ st , ≤ cx , ≤ lc with th e pmfs as well as the ran dom variables. In g eneral, f ≤ st g if th ere exist random variables X and Y wi th pmfs f and g resp ectiv ely suc h that X ≤ Y almost surely . Examples includ e bi ( n, p ) ≤ st bi ( n + 1 , p ) , bi ( n, p ) ≤ st bi ( n, p ′ ) , p ≤ p ′ . In contrast, ≤ cx compare s variability . A classical example (Hoeffding [18]) is bi ( n, λ/n ) ≤ cx bi ( n + 1 , λ/ ( n + 1)) , 0 ≤ λ ≤ n. Another example me ntioned in Section II is ¯ X n ≤ cx ¯ X n − 1 where ¯ X n = (1 /n ) P n i =1 X i for i.i.d . X i ’ s with a finite mean. The log-con cavity order ≤ lc is also usef ul in our context; for example, f being ULC can be written as f ≤ lc po ( λ ) , λ > 0 . (The actual value of λ is irrelev an t.) Further properties of these stochastic o rders can be found in Shaked and Shanthiku mar [30]. W e also need the conce pts of ma jorization and Schur concavity . Definition 5: A real vector b = ( b 1 , . . . , b n ) is said to majorize a = ( a 1 , . . . , a n ) , written as a ≺ b , if • P n i =1 a i = P n i =1 b i , and • P n i = k a ( i ) ≤ P n i = k b ( i ) , k = 2 , . . . , n, where a (1) ≤ . . . ≤ a ( n ) and b (1) ≤ . . . ≤ b ( n ) are ( a 1 , . . . , a n ) an d ( b 1 , . . . , b n ) a rranged in increasing ord er , respectively . A fun ction φ ( a ) sym metric in t he coordinates of a = ( a 1 , . . . , a n ) is said to be Schur concave , if a ≺ b = ⇒ φ ( a ) ≥ φ ( b ) . As is well-known, if p mfs f and g on { 0 , . . . , n } (viewed as vectors of the respective probabilities) satisfy f ≺ g , then H ( f ) ≥ H ( g ) . I n o ther words H ( f ) is a Schur concave function of f . Furth er properties and various ap plications of these two notions can b e found in Hard y et al. [13] and Marshall and Olkin [2 8]. V I . M O N OT O N I C I T Y O F T H E E N T RO P Y This section proves Th eorem 3 . W e state a key lemma that can be traced back to Karlin and Rinott [22]. Lemma 4: Let f and g be pmfs on Z + such that f ≤ cx g and g is lo g-conc a ve. Then H ( f ) + D ( f | g ) ≤ H ( g ) . In par ticular H ( f ) ≤ H ( g ) with equality only if f = g . Although Lemm a 4 follows almost immediately fr om th e definitions (h ence the proof is omitted), it is a useful to ol in se veral entropy compar ison contexts ([22], [ 36], [38], [39]). Effecti vely , Lemma 4 red uces en tropy comparison to two (often easier) problem s: i) establishin g a log-con cavity result, and ii) compar ing the expectatio ns of c on vex functions. A modification of Lemma 4 is used by [36] to give a short and unified proof of the main theorems of [19] and [37] concerning the maximum entropy proper ties of th e Poisson and bino mial distributions. W e qu ote Johnson’ s result. Further extensions to compou nd distributions can be found in [21], [39]. Theorem 6: I f a pmf f on Z + is ULC with m ean λ , then H ( f ) ≤ H ( po ( λ )) , with eq uality on ly if f = po ( λ ) . T o apply Lemm a 4 to ou r pr oblem, we show that, in the setting of T heorem 3, T 1 / ( n − 1) ( f ∗ ( n − 1) ) ≤ cx T 1 /n ( f ∗ n ) . (20) In a sen se, (20) means that T 1 /n ( f ∗ n ) becomes more and more “spread out” as n increases. On the other hand, it can be shown that T 1 /n ( f ∗ n ) is log-co ncave for all n . In deed, f is ULC and hence log-concave. It is well-known that conv olu tion preserves log-co ncavity . Tha t th inning p reserves log-con cavity is sometimes known as Brenti’ s criterio n [7] in the combinatoric s literature. Th us T 1 /n ( f ∗ n ) remain s lo g- concave. Actually , since f is ULC, there ho lds the stron ger relation T 1 /n ( f ∗ n ) ≤ lc po ( λ ) . (21) Relation (2 1) follows from i) if f is ULC then so is f ∗ n (Liggett [ 25]) an d ii) if f is ULC then so is T α ( f ) (Johnson [19], Prop osition 3.7). The core of the proof of Theorem 3 is pr oving (20). The notions o f ma jorization and Schur concavity briefly revie wed in Section V ar e helpful in fo rmulating a m ore g eneral (and easier to handle) version of (2 0). Proposition 6 : L et Y 1 , . . . , Y n be i.i.d. rand om v ariables on Z + with an ultra-log- concave pm f f . Cond itional on the Y i ’ s, let Z i , i = 1 , . . . , n, be ind ependen t Bi( Y i , p i ) r andom variables respectively , wh ere p 1 , . . . , p n ∈ [0 , 1] . Let φ be a co n vex func tion on Z + . Th en E φ ( P n i =1 Z i ) is a Schur concave fun ction of ( p 1 , . . . , p n ) on [0 , 1] n . The pro of of Prop osition 6, somewhat technical, is co llected in the append ix. Pr oof o f (2 0): Noting that (1 /n, . . . , 1 /n ) ≺ (1 / ( n − 1) , . . . , 1 / ( n − 1) , 0 ) the claim follows from Proposition 6 and the definition of Schur-concavity . Theorem 3 then fo llows fro m (20), ( 21) and L emma 4. Remark. T heorem 3 re sembles the semigr oup argum ent of Johnson [19] in th at both are statemen ts of “entro py increasing to th e maximum , ” and both in volve con volution and thin ning 7 operation s. The difference is that [ 19] con siders con volution with a Poisson wh ile we stud y the self-conv olution f ∗ n . As mentioned in Section I, if we reverse the ULC assump - tion (but still a ssume log-con cavity), th en th e conclu sion of Theorem 3 is a lso reversed. Theorem 7: Le t f b e a pmf on Z + with mean λ . Assume f is log-c oncave, and assume po ( λ ) ≤ lc f . Then H ( T 1 /n ( f ∗ n )) decreases in n . Theorem 7 extends a minimum entr opy result th at par allels Theorem 6. Proposition 7 ([36]): The Po ( λ ) distribution achie ves minimum entro py among all pmfs f with mean λ such th at f is log- concave and po ( λ ) ≤ lc f . An examp le of Theorem 7, also n oted in [ 36], is when f is a ge ometric ( p ) pmf, in which case T 1 /n ( f ∗ n ) = nb ( n, n/ ( n − 1 + 1 / p )) . (Her e nb ( n, p ) denotes the negati ve b inomial pm f with parameters ( n, p ) , i.e., nb ( n, p ) = { n + i − 1 i p n (1 − p ) i , i = 0 , 1 , . . . } .) In o ther words, the negati ve-binomial- to-Poisson con vergenc e is mo notone in en tropy (as lon g a s the first param eter of the n egati ve b inomial is at least 1). The proof of Theo rem 7 parallels that of Th eorem 3. In place of (20) we have T 1 /n ( f ∗ n ) ≤ cx T 1 / ( n − 1) ( f ∗ ( n − 1) ) (22) assuming po ( λ ) ≤ lc f . Th e pr oof of (20) applies after revers- ing the directio n of ≤ lc in the relev ant places. As noted befo re, since f is log -concave, T 1 /n ( f ∗ n ) is log-concave for all n . Thus The orem 7 follows from Lemma 4 as do es The orem 3. Incidentally , we have po ( λ ) ≤ lc f = ⇒ po ( λ ) ≤ lc T 1 /n ( f ∗ n ) , (23) which is a reversal of ( 21). T o prove ( 23), w e note that, accordin g to a result of Da venpor t an d P ´ olya [11], po ( λ ) ≤ lc f implies po ( λ ) ≤ lc f ∗ n . By a slight modification of the argument of Johnso n ([19], Prop osition 3.7), we can also show that po ( λ ) ≤ lc f im plies po ( λ ) ≤ lc T α ( f ) (details omitted ); thus (23) holds. V I I . R AT E O F C O N V E R G E N C E Assuming that f is a p mf on Z + with mean λ and variance σ 2 < ∞ , Harremo ¨ es et al. ([15], Co rollary 9) show th at D ( T 1 /n ( f ∗ n )) ≤ λ n + σ 2 nλ . That is, the relative entropy conv erges at a rate o f (at least) O ( n − 1 ) . W e aim to improve this to O ( n − 2 ) un der some natural assumptions. The O ( n − 2 ) rate is perhaps n ot surprising since, in the binomial case ([17]), D ( bi ( n, λ/n )) = O ( n − 2 ) , n → ∞ . (24) W e first use the stoch astic o rders ≤ cx and ≤ lc to extend (24) to ULC d istributions. Theorem 8: I f f is ULC on Z + with me an λ , th en D ( T 1 /n ( f ∗ n )) ≤ { n λ } D ( bi ( ⌊ nλ ⌋ + 1 , 1 / n ) | po ( λ )) + (1 − { nλ } ) D ( bi ( ⌊ nλ ⌋ , 1 / n ) | po ( λ )) (25) where { x } and ⌊ x ⌋ den ote the fraction al and integer par ts of x , respectively . Theorem 8 and (2 4) easily y ield D ( T 1 /n ( f ∗ n )) = O ( n − 2 ) , n → ∞ , as long as f is ULC. T o prove Theorem 8, we again adopt th e strategy of Section VI. Proposition 8 is a variant of Lem ma 4. Proposition 8 : L et f an d g b e pmfs on Z + such that f ≤ cx g an d g is ULC. Then D ( f ) ≥ D ( g ) + D ( f | g ) . W e also h av e the fo llowing result, which is easily ded uced from Theo rem 3.A. 13 of Shaked a nd Shanth ikumar [30] (see also [3 9], Lemma 2 ). Plainly , it says that th e conve x ord er ≤ cx is preser ved u nder thinn ing. Proposition 9 : I f f and g are pmfs on Z + such that f ≤ cx g , the n T α f ≤ cx T α g , α ∈ (0 , 1) . Pr oof of Th eor e m 8 : Let g b e the two-p oint p mf that assigns pro bability { nλ } to ⌊ nλ ⌋ + 1 and the rema ining probab ility to ⌊ nλ ⌋ . Note th at the mean of g is nλ . Also, the r elation g ≤ cx f ∗ n is intuitive a nd easily proven. Indeed , if φ is a conve x func tion on Z + , then φ ( x ) ≥ ( x − ⌊ nλ ⌋ ) φ ( ⌊ nλ ⌋ + 1) + ( ⌊ nλ ⌋ + 1 − x ) φ ( ⌊ nλ ⌋ ) . The claim f ollows by taking the weig hted av erage with re spect to f ∗ n . By Pro position 9, T 1 /n g ≤ cx T 1 /n ( f ∗ n ) . Since f is ULC, so is T 1 /n ( f ∗ n ) . By Pro position 8, D ( T 1 /n ( f ∗ n )) ≤ D ( T 1 /n g ) . Howe ver T 1 /n g is a m ixture o f two bino mials: T 1 /n g = { nλ } bi ( ⌊ nλ ⌋ + 1 , 1 /n ) + (1 − { nλ } ) bi ( ⌊ nλ ⌋ , 1 /n ) . Thus (25) holds by the conv exity of the relative entropy . Although (25) implies the r ight or der of the co n vergence rate, the b ound itself do es not inv olve the variance of f . It is known that, if f is ULC, then its variance σ 2 does not exceed its mean λ ([19], [36]). I t is intuiti vely re asonable t hat the closer σ 2 is to λ , the smaller D ( f ) and D ( T 1 /n ( f ∗ n )) are. Hence any bou nd that accou nts f or the variance σ 2 would be interesting. Of cour se, it would also be interesting to see the ULC assumption r elaxed. Th eorem 9 sh ows that the O ( n − 2 ) r ate holds under a fin ite support assumption. Note that, in the CL T case, an O ( n − 1 ) rate of conver gence for the r elati ve entr opy can be obtained under a “spectral gap” assum ption ([1], [20]); possibly a similar assumption suffices in our case. Un der the finite support assumption, howe ver , the p roof of Theorem 9 is elementary , a lthough it does use a non trivial subad ditivity proper ty of the scaled Fisher inform ation ([23], [27]). Theorem 9: Su ppose f is a pmf on Z + with finite support and den ote th e mean and variance o f f by λ and σ 2 respec- ti vely . Then D ( T 1 /n ( f ∗ n )) = O ( n − 2 ) , n → ∞ . (26 ) If λ = σ 2 in addition, then the right hand side of (26) can be replaced by O ( n − 3 ) . Pr oof: Let us ass ume λ > 0 to eliminate the triv- ial case. For a p mf g on Z + with m ean µ > 0 , define K ( g ) = µχ 2 ( S ( g ) , g ) as in (12). Mad iman et al. ([2 7], 8 Theorem II I) show that K ( g ∗ n ) decr eases in n . In particu lar , letting g = T 1 /n ( f ) , an d noting (12) and (2), we obtain D ( T 1 /n ( f ∗ n )) ≤ K ( T 1 /n ( f ∗ n )) ≤ K ( T 1 /n ( f )) . Thus, to prove (26), we on ly ne ed K ( T 1 /n ( f )) = O ( n − 2 ) . By the definition o f K ( · ) and (7), this is equivalent to χ 2 ( T 1 /n ( S ( f )) , T 1 /n ( f )) = O ( n − 1 ) . (27) Howe ver, for each i ≥ 0 we have ( T 1 /n ( f )) i = k X j = i f j bi ( i ; j, 1 / n ) = n − i k X j = i j i f j + O ( n − i − 1 ) where k is the largest integer such that f k 6 = 0 ; a similar expression hold s f or T 1 /n ( S ( f )) . By direct calcu lation, each term in the sum k X i =0 (( T 1 /n ( S ( f ))) i − ( T 1 /n ( f )) i ) 2 ( T 1 /n ( f )) i (28) is O ( n − 1 ) , and (2 7) holds. If λ = σ 2 , then ea ch term in (28) is O ( n − 2 ) , thu s provin g the remaining claim . Theorem s 8 a nd 9 imply a correspondin g rate of con ver- gence f or the total variation d istance, wh ich is defin ed as V ( g , ˜ g ) = P i | g i − ˜ g i | for any pmfs g and ˜ g . Th e total variation is related to the relative entr opy via Pinsker’ s inequality V 2 ( g , ˜ g ) ≤ 2 D ( g | ˜ g ) . Hence, if f is either UL C or has finite support, then V ( T 1 /n ( f ∗ n ) , po ( λ )) = O ( n − 1 ) . An explicit upper bound , po ssibly via the Stein-Chen method , is of cour se d esirable. V I I I . S U M M A RY A N D P O S S I B L E E X T E N S I O N S W e have e xtende d the mon otonicity of entropy in the central limit theorem to a version o f the law o f small n umbers, which inv o lves the thinn ing operatio n (th e discrete analogu e of scaling) , an d a Poisson lim it ( the d iscrete cou nterpart of the normal) . For a p mf f on Z + with me an λ , we show that the relativ e entro py D ( T 1 /n ( f ∗ n ) | po ( λ )) decreases mo notonically in n ( Theorem 2), and, if f is ultra-log- concave, the entro py H ( T 1 /n ( f ∗ n )) increases in n (Theorem 3) . In the process of establishing Th eorem 2, inequa lities are ob tained for the rela- ti ve entropy und er thinning and conv o lution, and con nections are made w ith logarithmic Sob olev ineq ualities and with the recent results of K ontoyiann is et a l. [23] and Madim an et al. [27]. Theo rem 3, in con trast, is established by co mparing pmfs with resp ect to the conve x order, an idea that dates back to Karlin and Rinott [ 22]. This work is arguably more qualitati ve than quan titati ve, giv en its f ocus on m onoton icity . When bounds are occasionally obtained, in Propo sition 2 fo r examp le, we d o n ot claim that they are always sharp. Amo ng the large literature on Poisson approx imation bounds (e.g., Barbour et al. [3] ), the use of informa tion theo retic ideas is a relati vely n ew d ev elopmen t ([23], [27]). W e have, h owe ver, o btained an upp er bound and identified an O ( n − 2 ) rate for the relati ve entropy under certain simple cond itions. Such results complement those o f [15] and [16]. The analogy with the CL T leads to further questions. For ex- ample, given the intimate connec tion between the inform ation- theoretic CL T with Shannon ’ s entro py power ineq uality ( EPI), it is n atural to ask whethe r ther e exists a discrete version of the E PI. By analo gy with the CL T , o ur results seem to suggest that the answer is yes, although there is still much to be done. Certain simple fo rmulations of th e EPI d o not hold in the discrete setting; see [ 41] for r ecent developments. W e may also consider extending our mono tonicity results to co mpoun d Poisson limit theorems. Recen tly , Johnson et al. [21] (see also [3 9]) hav e shown th at compoun d Po isson dis- tributions ad mit a maximum entropy characterization similar to that of the Poisson . Such results sug gest the possibility of compou nd Po isson limit theorem s with the same appealin g “entropy increasing to the maximum ” interpretation. Finally , o n a mor e techn ical note , we point o ut a p ossible refinement of Th eorem 2. This is analo gous to the results of Y u [4 0], who no ted th at relative entr opy is c ompletely monoton ic in the CL T for certain d istribution families. (A function is completely mon otonic if its deriv a ti ves of all order s exist an d alter nate in sign ; th e definition is sim ilar for discrete sequences; see Feller [1 2] for the prec ise statements.) Theorem 10 ([40]): L et X i , i = 1 , 2 , . . . , be i.i.d. ran- dom variables w ith distribution F , mean µ , an d variance σ 2 ∈ (0 , ∞ ) . Then D P n i =1 ( X i − µ ) / √ nσ 2 | N(0 , 1) is a completely monoto nic fun ction o f n if F is either a g amma distribution or an in verse Gaussian distribution. Part of the reason th at the ga mma and inverse Ga us- sian distributions are considered is th at they are analytica lly tractable. The resu lt may con ceiv ably hold f or a wide class of distributions. W e conclude with a discrete an alogue based on numerical evidence. Conjecture 1 : L et λ > 0 . Th en • D ( bi ( n, λ/n )) is com pletely monoto nic in n ( n ≥ λ ); • D ( nb ( n, n/ ( λ + n ))) is completely monoto nic in n ( n > 0 ). W e again expect similar results for other pmfs, but ar e unable to prove even those for the bino mial and the negative binomial. A P P E N D I X P R O O F O F P RO P O S I T I O N 6 Let us recall a well-kn own characteriza tion of the conv ex order (see [30], Th eorem 3.A.1 , for example) . Proposition 1 0: Let X and Y be random v ariables on Z + such that E X = E Y < ∞ . The n X ≤ cx Y if and only if E max { X − k , 0 } ≤ E max { Y − k , 0 } , k ≥ 0 , or , equiv alently , X i ≥ k Pr( X ≥ i ) ≤ X i ≥ k Pr( Y ≥ i ) , k ≥ 0 . 9 Proposition 1 1: Fix p ∈ (0 , 1 ) , and let Y 1 and Y 2 be i.i.d. random variables on Z + with an ultra-log-co ncave pmf f . Let Z 1 , Z 2 , Z ′ 1 and Z ′ 2 be in depende nt co nditional on Y 1 and Y 2 and satisfy Z 1 | Y 1 ∼ Bi( Y 1 , p + δ ) , Z 2 | Y 2 ∼ B i( Y 2 , p − δ ) , Z ′ 1 | Y 1 ∼ Bi( Y 1 , p + δ ′ ) , Z ′ 2 | Y 2 ∼ B i( Y 2 , p − δ ′ ) . If δ > δ ′ ≥ 0 , then Z 1 + Z 2 ≤ cx Z ′ 1 + Z ′ 2 . Pr oof: W e s how that, for each k ≥ 0 , P i ≥ k Pr( Z 1 + Z 2 ≥ i ) is a decreasing fu nction of δ as long as 0 ≤ δ ≤ min { p, 1 − p } . Th e claim then fo llows fro m Proposition 10 (the assumptio ns imply E ( Z 1 + Z 2 ) = E ( Z ′ 1 + Z ′ 2 ) < ∞ ). T o simply the notation, in what follows the limits of summation, if no t spelled out, are from −∞ to ∞ ; also f i ≡ 0 if i < 0 . Denoting B ( i ; n, p ) = P j ≥ i bi ( j ; n , p ) , and letting h ( δ ) = P i ≥ k Pr( Z 1 + Z 2 ≥ i ) , we have h ( δ ) = X i ≥ k X j Pr( Z 1 ≥ j ) P r( Z 2 = i − j ) = X j Pr( Z 1 ≥ j ) Pr( Z 2 ≥ k − j ) = X j X s ≥ 0 f s B ( j ; s, p + δ ) X s ≥ 0 f s B ( k − j ; s, p − δ ) = X s,t ≥ 0 f s f t v ( s, t, δ ) (29) where v ( s, t, δ ) = X j B ( j ; s, p + δ ) B ( k − j ; t, p − δ ) . Using the simple iden tity dB ( i ; n , p ) dp = n [ bi ( i − 1 ; n − 1 , p )] we ge t dv ( s, t, δ ) dδ = su ( s, t, δ ) − tu ( t, s, − δ ) where u ( s, t, δ ) = X j bi ( j − 1; s − 1 , p + δ ) B ( k − j ; t, p − δ ) = X j bi ( k − j − 1; s − 1 , p + δ ) B ( j ; t, p − δ ) . The quantity u ( s, t, δ ) has the f ollowing interpretatio n. If we let V 1 ∼ B i( s − 1 , p + δ ) and V 2 ∼ Bi( t, p − δ ) independe ntly , then u ( s, t, δ ) = Pr( V 1 + V 2 ≥ k − 1) . Clearly u ( s, t, δ ) = u ( t + 1 , s − 1 , − δ ) . (30) Hence, we ma y take t he deriv ative under the sum mation in ( 29) (by d ominated conv ergence), and then app ly (30) to obtain dh ( δ ) dδ = X s,t ≥ 0 sf s f t u ( s, t, δ ) − X s,t ≥ 0 tf s f t u ( t, s, − δ ) = X s ≥ 1 ,t ≥ 0 sf s f t u ( s, t, δ ) − X s ≥ 1 ,t ≥ 0 ( t + 1) f s − 1 f t +1 u ( t + 1 , s − 1 , − δ ) = X s ≥ 1 ,t ≥ 0 [ sf s f t − ( t + 1) f s − 1 f t +1 ] u ( s, t, δ ) . (31) By a change of variables s → t + 1 and t → s − 1 in (31), and by (30), we get dh ( δ ) dδ = X s ≥ 1 ,t ≥ 0 [( t + 1) f t +1 f s − 1 − sf s f t ] u ( s, t, − δ ) . (32 ) Combining (3 1) and (3 2), an d n oting th e symm etry , we obtain dh ( δ ) dδ = X 1 ≤ s ≤ t [ sf s f t − ( t +1) f t +1 f s − 1 ][ u ( s, t, δ ) − u ( s, t, − δ )] . (33) Because f is ULC, if s ≤ t , then sf s f t ≥ ( t + 1) f t +1 f s − 1 . (34) W e can also show ( s ≤ t ) u ( s, t, δ ) ≤ u ( s, t, − δ ) (35) as fo llows. Let W 1 , W 2 , W 3 , W 4 be ind ependen t rand om variables such that W 1 ∼ Bi( s − 1 , p + δ ) , W 2 ∼ B i( s − 1 , p − δ ) , W 3 ∼ Bi( t − s + 1 , p + δ ) , W 4 ∼ B i( t − s + 1 , p − δ ) . Then u ( s, t, δ ) = Pr( W 1 + W 2 + W 4 ≥ k − 1); u ( s, t, − δ ) = Pr( W 1 + W 2 + W 3 ≥ k − 1) . Since δ ≥ 0 , we have W 4 ≤ st W 3 , whic h yields W 1 + W 2 + W 4 ≤ st W 1 + W 2 + W 3 , and u ( s, t, δ ) ≤ u ( s, t, − δ ) by the definition of ≤ st . Now (3 3), (34) and (35) give dh ( δ ) dδ ≤ 0 i.e., h ( δ ) d ecreases in δ . Pr oof of Pr oposition 6: Giv en the b asic properties o f majorization , we on ly need to prove that E φ ( P n i =1 Z i ) is Schur concave as a function of ( p 1 , p 2 ) holdin g p 3 , . . . , p n fixed. Define ψ ( z ) = E φ ( z + P n i =3 Z i ) . Since φ is conve x, so is ψ . ( W e m ay assum e that ψ is fin ite a s the gen eral ca se can be h andled b y a standard limiting argum ent.) Proposition 1 1, howe ver, shows pre cisely that E ψ ( Z 1 + Z 2 ) = E φ ( P n i =1 Z i ) is Schur-concave in ( p 1 , p 2 ) . A C K N OW L E D G E M E N T The autho r would like to thank O. Johnson and D. Ch afa ¨ ı for stimulatin g discussions. 10 R E F E R E N C E S [1] S. Artste in, K. M. Ball, F . Barth e, and A. Naor, “ On th e rate of con verg ence in the entropi c centra l limit theorem, ” Pr obabilit y Theory and R elate d F ields, vol. 129, no. 3, pp. 381–390 , 2004. [2] S. Artstein , K. M. Ball, F . Barthe, and A. Naor , “Soluti on of Shannon’ s problem on the m onotonic ity of ent ropy , ” J. A mer . Math. Soc. , vol. 17, no. 4, pp. 975–982, 2004. [3] A. D. Barbour , L. Holst, and S. Janson, P oisson Appr oximati on, Oxford Studies in Pr obabilit y , vol. 2, Clare ndon Press, Oxford, 1992. [4] A. R. Barron, “Entropy and the centr al limit theorem, ” Ann. Prob ab . , vol. 14, pp. 336–342, 1986. [5] N. M. Blac hman, “The con volut ion inequalit y for entropy po wers, ” IEEE T rans. Inform. Theory , vol. 11, pp. 267–27 1, 1965. [6] S. Bobko v and M. Ledoux,“On modified logarithmic Sobolev inequali- ties for Bernoul li and Poisson measures, ” J. Funct. Anal. , vol. 156, no. 2, pp. 347–365, 1998. [7] F . Brenti, “Unimodal, log-conca ve, and P ´ olya frequenc y sequen ces in combinat orics, ” Mem. Amer . Math. Soc. , vol. 81, no. 413, 1989. [8] D. Chafa ¨ ı, “Entropies, con ve xity , and functional inequaliti es: on Φ - entropi es and Φ -Sobolev inequalit ies, ” J. Math. K yoto Univ . , vol. 44, no. 2, pp. 325–363, 2004. [9] D. Chafa ¨ ı, “B inomial-Poi sson entropic inequali ties and the M/M/ ∞ queue, ” ESAIM Pr obab . Stati st. , vol. 10, pp. 317–339, 2006. [10] T . Co ver and J. Thomas, Elements of Information Theory , 2nd ed., New Y ork: Wile y , 2006. [11] H. Dav enport and G. P ´ olya, “On the product of two power series, ” Canad. J. Math. , vol . 1, pp. 1–5, 1949. [12] W . Felle r , An Introdu ction to Pr obability Theory and Its Applications , vol. 2, W iley , New Y ork, 1966. [13] G. H. Hardy , J. E. Little wood, and G. Polya, Inequalities, Cambridge Uni v . Press, Cambridge, U.K., 1964. [14] P . Harremo ¨ es, “Binomial and Poisson distributi ons as maximum entropy distrib utions, ” IEEE T rans. Inform. Theory , vol. 47, no. 5, pp. 2039– 2041, J ul. 2001. [15] P . Harremo ¨ es, O. Johnson, and I. K onto yiannis, “Thinning and the law of small numbers, ” in Proc. IEEE Inte rnational Symposium on Information Theory , Nice, France, Jun. 2007. [16] P . Harremo ¨ es, O. Johnson, and I. Kontoy iannis, “Thinning and informa- tion proje ctions, ” in Proc . IEEE Internation al Symposium on Informat ion Theory , T oronto, Canada , Jul. 2008. [17] P . Harremo ¨ es and P . S. Ruzankin , “Rate of con ver gence to Poisson la w in terms of information di vergen ce, ” IEE E T rans. Inform. Theory , vol. 50, no. 9, pp. 2145–2149, 2004. [18] W . Hoef fding, “On the distri butio n of the numbe r of succe sses in indepen dent trials, ” Ann. Math. Statist. , vol. 27, pp. 713–721, 1956. [19] O. Johnson, “Log-conca vity and the maximum entropy prope rty of the Poisson distrib ution, ” Stochasti c Pr ocesses and their Applicati ons , vol . 117, no. 6, pp. 791–802, 2007. [20] O. Johnson and A. R. Barr on, “Fisher informatio n inequ alitie s and the centra l limit theorem, ” Prob ability Theory and Related Fie lds , vol. 129, no. 3, pp. 391–409, 2004. [21] O. J ohnson, I. Kont oyian nis, and M. Madiman, “On the entrop y and log-concav ity of compoun d Poisson measures, ” Pr eprint, 2008, http://arxiv.o rg/abs/0805.4112 v1 [22] S. Karlin and Y . Rinott, “Entropy inequalit ies for classe s of probabilit y distrib utions I. the univ ariate case, ” Adv . A ppl. Pro b . , vol. 13, pp. 93– 112, 1981. [23] I. Kont oyia nnis, P . Harremo ¨ es, and O. T . Johnson, “Entrop y and the law of small numbers, ” IEEE T rans. Inform. Theory , vol. 51, no. 2, pp. 466–472, Feb . 2005. [24] E . H. Lieb, “Proof of an ent ropy conj ecture of W ehrl, ” Communic ations in Mathematical Physics , vol. 62, no. 1, pp. 35–41, Aug. 1978. [25] T . M. Ligget t, “Ultra logconca ve sequences and negat iv e dependenc e, ” J . Combin. Theory Ser . A , v ol. 79, no. 2, pp. 315–325, 1997. [26] M. Madiman and A. Barron, “General ized entrop y powe r inequal ities and monotonicity propert ies of information, ” IEEE T rans. Inform. The- ory , v ol. 53, no. 7, pp. 2317–2329, Jul. 2007. [27] M. Madiman, O. Johnson and I. Kontoy iannis, “Fisher information, compound Poisson approx imation and the Poisson chan nel, ” in Proc . IEEE International Symposium on Information Theory , Nice , France, Jun. 2007. [28] A. W . Marshal l and I. Olkin. Inequalities: Theory of Majorizat ion and Its A pplicat ions , Academic Press, New Y ork, 1979. [29] A. R ´ eny i, “ A cha racteri zation of Poisson processes, ” Ma gyar T ud. Akad. Mat. Kut at ´ o Int. K ¨ ozl. , vol. 1, pp. 519–527, 1956. [30] M. Shaked and J. G. Shanthikumar . Stocha stic Order s , Springe r , New Y ork, 2007. [31] D. Shlyakhte nko, “ A free analogue of Shannon’ s problem on mono- tonici ty of entrop y , ” Adv . in Math. , vol. 208, no. 2, pp. 824–833 , Jan. 2007. [32] D. Shlyakhten ko, “Shannon’ s monotonicity problem for free and clas- sical entropy , ” Proc . Nat. A cad. Sci. USA , vo l. 104, no. 39, pp. 15254– 15258, Sep. 2007. [33] A. J. S tam, “Some inequa lities satisfied by the quanti ties of information of Fisher and Shannon, ” Inform. Contr . , vol . 2, no. 2, pp. 101–112, Jun. 1959. [34] A. M. Tulino and S. V erd ´ u, “Monotonic decrea se of the non- Gaussianne ss of the sum of independent random var iables: a simple proof, ” IEEE T rans. Inform. Theory , vol . 52, no. 9, pp. 4295–4297 , Sep. 2006. [35] L . Wu, “ A ne w modified logarithmic Sobole v inequa lity for Poisson point processes and se veral applicati ons, ” P r obab . Theory and Related F ields , vol. 118, no. 3, pp. 427–438, 2000. [36] Y . Y u, “ Relati ve log-conc avit y and a pair of t riangle inequ alitie s, ” T echnic al Report , Department of Statistics, Univ ersity of Californi a, Irvine, 2008. [37] Y . Y u, “On the m aximum entrop y propertie s of the binomial distribu- tion, ” IEEE T rans. Inform. Theory , vol . 54, no. 7, pp. 3351–3353, Jul. 2008. [38] Y . Y u, “On an inequali ty of Karlin and Rinott concerning weighted sums of i.i.d. random va riable s, ” A dv . A ppl. Pr ob. , vol. 40, no. 4, pp. 1223– 1226, 2008. [39] Y . Y u, “ On the entr opy of compou nd distrib utions on nonne gati ve inte gers, ” Accept ed, IEEE T rans. Inform. Theory , 2009. [40] Y . Y u, “Compl ete monotoni city of the entrop y in th e ce ntral limit theorem for gamma and in verse Gaussian distributi ons, ” Stat. Prob . Lett. , vol. 79, pp. 270–274, 2009. [41] Y . Y u and O. J ohnson, “Conca vity of entropy under thinning, ” Accepted , Pr oc. IE EE International Symposium on Informati on Theory , Seoul, Ko rea, 2009.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment