A note on entropy estimation

A note on en trop y estimation Thomas Sc h ¨ urmann J¨ ulich Supercomputing Cen tre, J ¨ ulich Researc h Centre, 52425 J¨ ulic h, German y 8 June 2015 Abstract. W e compa re an en trop y estim ator ˆ H z recen tly dis cussed in [10] with t wo estimators ˆ H 1 and ˆ H 2 int ro duced i n [6][7]. W e prov e the ident ity ˆ H z ≡ ˆ H 1 , which has not been take n int o accoun t in [10]. Then, we pro v e that the statistical bias of H 1 is less than the bias of the ordinary li ke liho o d estimator of entrop y . Finally , by numerical si mulat ion we verify that for the most i nteresting regime of small sample est imation and large e ve nt spaces, the estimator ˆ H 2 has a signiﬁcant smaller s tatistical error than H z . Keywor ds : Shannon entrop y , E ntropy estimation, Bias analysis , Div ersity index, Probability and statistics, Data analysis 1. Intr o duction Symbolic sequences are t ypically c haracterized b y a n alphabet A of d diﬀeren t letters. W e assume statistical stationarity , i.e. any letter-blo ck (word or n -gr am of constant length) w i , i = 1 , ..., M , can b e exp ected at a ny c hos en site to o ccur with a known probability p i = pro b( w i ) and P M i =1 p i = 1. In a classic pap er published in 1951, Shannon considered the problem of estimating the entrop y H = − M X i =1 p i log p i , (1) of ordinary English [1]. In principle, this might b e done by dea ling with longer and longer contexts un til dependencies at the w ord lev el, phr ase lev el, sentence level, paragr aph level, chapter level, and so o n, hav e all been taken into a ccount in the statistical a nalysis. In practice, howev er , this is q uite impractical, for as the cont ext grows, the num b er M of p o ssible words explo des exp onentially with n . In the numerical estimation of the Shannon entrop y one can do frequency counting, hence in the limit of large data sets N , the relativ e frequency distribution yields an estimate of the underlying proba bilit y distribution. W e cons ider samples of N independent obser v ations, and let k i , i = 1 , ..., M , b e the frequency of realiz ation w i in the ensemble. How ever, with the c hoice ˆ p i = k i N , the naive (or lik eliho o d) estimate ˆ H 0 = − M X i =1 ˆ p i log ˆ p i (2) A no te on entr opy est imation 2 leads to a systematic underestimation of the Shannon entrop y [2][3][4][5][6][7 ]. In particular, if M is in the order of the num b er of data p oints N , then ﬂuctua tions increase a nd estimates usually b ecome signiﬁcantly biased. By bias we deno te the deviation of the e xp ectation v alue of an estimator from the true v alue. In g e neral, the pr oblem in estimating functions o f proba bilit y distributions is to construct an estimator whose estimates b oth ﬂuctuate with the smalle st p os sible v ar ia nce and are least biased. On the other hand, there is the Bayesian a pproach to ent ropy es timation, building upo n an approa ch introduced in [8], o r a g eneralizatio n recently prop osed in [9]. There, the basic strategy is to place a prior o ver the space of probability distributions and then p erform inference us ing the induced p o sterior distribution ov er entrop y . Actually , a partial num erical compariso n of the popular Bayesian entrop y estimates and those discus sed hereinafter ca n b e found in [9]. Unfortunately , these sim ula tio ns only consider the bias of the entropy estimates but not their mea n s q uare err or, whic h takes into account the imp ortant trade-o ﬀ b etw een bias and v ar ia nce. How ever, in the consideratio ns to b e discussed b elow, for what we intend to demons trate, no ex plicit prior information on distributio ns is assumed and we will fo cus ours elf o n Non-Bayes ent ropy estimates only . T o start with, let us consider an estimator o f the Shannon entrop y which ha s recently b een prop os ed a nd ana lyzed a gainst the likelihoo d estimator [10]. The developmen t o f this int eresting estimator starts with a genera lization o f the div ersity index prop o s ed by Simson in 1949 [1 1] and refer s to the following represe ntation of the Shannon entrop y ‡ H = ∞ X ν =1 1 ν M X i =1 p i (1 − p i ) ν . (3) In [10], it has be e n mentioned that there exists an interesting e s timator of each term in (3), which is unbiased up to the orde r ν = N − 1, namely Z ν /ν , where Z ν is explicitly given by the expression Z ν = N 1+ ν ( N − ν − 1)! N ! M X i =1 k i N ν − 1 Y j =0  1 − k i N − j N  , (4) such that ˆ H z = N − 1 X ν =1 1 ν Z ν (5) is a statistical consistent ent ropy estimator of H w ith (nega tive) bias B N = − ∞ X ν = N 1 ν M X i =1 p i (1 − p i ) ν . (6) Indeed, the estimato r is notable b e c ause a uniform v ariance upp er b ound has b een prov en in [10] tha t decays at a ra te o f O (log ( N ) / N ) for a ll dis tributions with ﬁnite ent ropy , compared to O ((log( N )) 2 / N ) of the o rdinary likelihoo d estimator established in [13]. It should b e mentioned he r e that the latter decay rate is an implication of the Efron-Stein inequality , wherea s the former (faster ) decay rate is derived within the completely diﬀerent approach introduced in [1 0]. Actually , it se ems hard to prov e the ‡ F or anothe r inte rpretation of this represen tation see [ 12]. A note on entr opy estimation 3 same decay rate for the lik eliho o d estimator. In the following section, we will show that ˆ H z is alge braically equiv a le nt to the estimator [6][7] ˆ H 1 = M X i =1 k i N  ψ ( N ) − ψ ( k i )  , (7) while the summatio n is deﬁned for all k i > 0 and the diga mma function ψ ( k ) is the logarithmic deriv ative of the Gamma -function [14]. Actually , the estimator (7) is given for the choice ξ = 1 in [7] (Eq. (28) therein). In the asy mptotic r egime k i ≫ 1 this es timator le a ds to the or dinary Miller cor rection ˆ H 1 ∼ ˆ H 0 + ( M − 1) / 2 N . This can be seen by using the asymptotic relatio n ψ ( x ) ∼ log( x ) − 1 / 2 x . The mathematica l expr e ssion of the bias of ˆ H 1 has a lso b een derived in [7] and is explicitly given by B (1) N = − M X i =1 p i Z 1 − p i 0 t N − 1 1 − t dt, (8) with a uniform upp er bo und | B (1) N | ≤ M N . (9) The pro of of the ident ity B N ≡ B (1) N will b e suppres sed here b ecause it is suﬃcient to show the equiv alence of the cor resp onding entropy es timators in the following section. It sho uld b e mentioned that the numerical co mputation time of the estimator H 1 is signiﬁcantly faster than for H z . Actually , this impr ov emen t has not b een taken into account in reference [9] (Fig. 11 ), wher e the a uthors still used ex pression (5) above. In the third sectio n, by numerical computation we co mpare the mean square er ror of ˆ H z with an entropy estimator co rresp o nding to ξ = 1 / 2 in Eq. (13) of [7] (see als o Eq. (35) o f [6]), which is explicitly given by the following repres entation ˆ H 2 = M X i =1 k i N  ψ ( N ) − ψ ( k i ) + lo g(2) + k i − 1 X j =1 ( − 1) j j  . (10) This estimator is a n extension o f ˆ H 1 by an o scillating term in the br ack et on the right- hand side of (7). In b oth [6] and [7], this estima tor ha s not bee n ex pressed in ter ms of a ﬁnite s um, but by int egral expre s sions or inﬁnite sum repres e nt ations ins tea d. How ever, it can b e eas ily s hown that the present form is equiv alent to those in [6][7], but the c o mputation is less time- c o nsuming. The bias of the e s timator (10) is [7] B (2) N = − M X i =1 p i Z 1 − 2 p i 0 t N − 1 1 − t dt, (11) with uniform upper bo und | B (2) N | ≤ M + 1 2 N . (12) Now, when we lo ok at the rig ht-hand side of (9) and (12), then we s ee that they ma inly diﬀer by a fa ctor 2 in the denominator . That has the implication that | B (2) N | < | B (1) N | A note on entr opy estimation 4 for all N and M ≥ 2 . Th us, we can exp ect a faster conv ergence of ˆ H 2 for suﬃcient large M and not very strong ly p eaked proba bility distributions. Actually , these ar e the distributions we ar e mainly in terested in. The num erical comparison of the mean square erro r of ˆ H z and ˆ H 2 will be ev aluated for the uniform probability distribution, the Zipf distribution and fo r the zero -entrop y delta distribution. 2. Com parison of ˆ H z and ˆ H 1 In this section, we show the identit y ˆ H z ≡ ˆ H 1 . Therefore, let Z i,ν denote the i - th term of (4 ), Z i,ν = N 1+ ν ( N − ν − 1)! N ! k i N ν − 1 Y j =0 (1 − k i N − j N ) . (13) By extending with N in the pro duct, this express ion can b e rewritten as Z i,ν = ( N − ν − 1)! N ! k i ν − 1 Y j =0 ( N − k i − j ) . (14) Next, the pro duct is r eformulated as a quotient of factorials, i.e. ν − 1 Y j =0 ( N − k i − j ) = ( N − k i )! ( N − k i − ν )! (15) and in terms of binomial co eﬃcien ts we g e t Z i,ν = k i N − ν  N − ν k i  .  N k i  . (16) Now, the i - th term of the estimato r (5) is obtained by summation ov er ν , i.e. N − 1 X ν =1 1 ν Z i,ν =  N k i  − 1 k i N − 1 X ν =1 1 ν ( N − ν )  N − ν k i  = k i N  H N − 1 − H k i − 1  (17) while H k = P k n =1 1 /n is the k - th harmonic n umber [14]. Applying the identit y H k − 1 = ψ ( k ) + γ (with γ = 0 . 5 772 ... , the Euler -Mascheroni constant) a nd summa- tion for i = 1 , 2 , ..., M , we obtain the estimator (7), which prov es the identit y ˆ H z ≡ ˆ H 1 . In addition, we hav e the following Prop ositi on . The estimator ˆ H 1 is les s biased than the likeliho o d estimator ˆ H 0 . Pro of . Since we know from [7] that the bias of ˆ H 1 is negative, it is suﬃcient to prov e that ψ ( N ) − ψ ( k ) > log N k , for 0 < k < N . The following inequalities [14] ψ ( N ) ≥ lo g  N − 1 2  (18) ψ ( k ) ≤ log( k ) − 1 2 k (19) can be applied such that we only have to chec k tha t N > 1 / 2 1 − e − 1 2 k . (20) A note on entr opy estimation 5 Now, for a ny ﬁnite k > 0, the inequa lit y 1 + 1 2 k < exp  1 2 k  is satisﬁed. The pro of is by T aylor ser ies expa nsion of the exp onential function. F ro m this, by simple alg ebraic manipulations, it follows that the rig ht -hand side of (20) is le s s than k + 1 2 , for any ﬁnite k > 0. It fo llows that (20) is satisﬁed for a ny k with 0 < k < N . This prov es that ˆ H 1 is less biased then ˆ H 0 .  3. Nume rical comparison of ˆ H z and ˆ H 2 In this section, we will fo cus on the c onv ergence rates of the ro ot mean square erro r (RMSE) of ˆ H z and ˆ H 2 . Here, the RMSE is deﬁned by RMSE = q E[( ˆ H − H ) 2 ] . (21) W e choo se this error measure b ecause it takes into account the trade- o ﬀ betw een bias and v aria nce. Moreover, we want to mention that there is a slig ht ly modiﬁed version ˆ H ∗ z of the estimator ˆ H z , deﬁned in E q. (12) o f [10]. Since the bias B N of ˆ H z is explic- itly known, a corr ection is deﬁned by subtra ction of the bias ter m B N with p i replaced by its estimate ˆ p i . The mo diﬁed estimator is then given by ˆ H ∗ z = ˆ H z − ˆ B N , while ˆ B N is the plug -in estimator of B N . F or r e asons of simplicity , we deny applying the sa me pro cedure of bias correctio n for the es timator ˆ H 2 . Our ﬁrst data sample is taken fro m the uniform proba bilit y distribution p i = 1 / M , for i = 1 , 2 , ..., M . In addition, we co nsider the (r ight-tailed) Zipf-dis tr ibution with p i = c/ i , for i = 1 , 2 , ..., M a nd no rmalization co nstant c = 1 / H M (recipro cal of the M - th har monic num b er). The statistical err or for incr e asing sample size N and given M is shown in Fig. 1 a nd Fig. 2. As we can see, the RMSE of all estimato rs is monotonic decrea s ing in N . The conv ergence of the naive estimator ˆ H 0 is rather slow compared to the other estimators, while the p er formance of ˆ H ∗ z is sligh tly b etter than for ˆ H z . On the other hand, the statistical erro r of ˆ H 2 is signiﬁca nt ly smaller than the statistical er ror of ˆ H z and ˆ H ∗ z and this b ehaviour seems to be representative for larg e M . The statistical err o r for increasing M and ﬁxed sample size N is shown in Fig. 3 and Fig. 4. F or M ≫ N , the RMSE of ˆ H z and ˆ H ∗ z is greater than of ˆ H 2 . This phenomenon reﬂects the fact that the bia s reductio n b eco mes more and more r elev ant for increasing M , compar e d to the contribution of the v arianc e . As we can see from b oth exa mples, the g ap b etw een ˆ H ∗ z and ˆ H 2 is slightly smaller for the p e aked Zipf distribution co mpared to the uniform distribution. Thus, we ask for the p erformance in the extr e me case o f the delta distribution p i = δ i, 1 , which has ent ropy zero. Indeed, in this sp ecial case we hav e ˆ H 0 = ˆ H 1 = ˆ H z = ˆ H ∗ z = 0 for a ny sample size N , but ˆ H 2 = log(2) + P N − 1 j =1 ( − 1) j /j → 0 for N → ∞ . Actually , in this case the statistical error of the latter s c ales like ∼ 1 / 2 N for lar ge N . 4. Summ ary In the present note, we class iﬁe d the entrop y estimator ˆ H z of [10] within the family o f entrop y estimators origina lly intro duce d in [7]. This r eveals an interesting connection betw een tw o diﬀerent a ppr oaches to en tropy estimation, one coming from the genera lization of the diversity index o f Simpso n a nd the other one co ming from the estimation of p q i in the family of Renyi entropies. This connec tion is explicitly A note on entr opy estimation 6 established by the identit y ˆ H z ≡ ˆ H 1 . In a ddition, we prov e d that the statistica l bias of ˆ H 1 is s maller than the bias o f the likeliho o d e s timator ˆ H 0 . F urthermor e, by numerical computatio n for v ar ious pro bability distributions, we found that ˆ H z (or the heuristic estimator ˆ H ∗ z ) can b e improved b y the estimato r ˆ H 2 , which is an excellent member of the estimator family in [6][7]. On the o ther hand, there is a unifor m v ariance upper b ound of ˆ H z (and therefore of ˆ H 1 ) that decays at a r ate o f O (log ( N ) / N ) for all distributions with ﬁnite entropy [10]. It would b e interesting to know if this v ariance bo und also holds for the estimator ˆ H 0 or ˆ H 2 . The answer might be found in a forthcoming publication. References [1] Shannon C E and W eav er W, The Mathematical Theory of Commun ication, (Univ ersity of Illinois Press, Urbana, IL 1949 ) [2] M iller G, N ote on the bias of information estimates. In Quastler H, ed., Information the ory in psycholo gy II-B , pp 95 (F ree Press, Glenco e, IL 1955) [3] Har ris B, The statistical estimation of entrop y in the non-parametric case. In Csiszar I, ed., T opics in Information The ory , pp 323 (Amsterdam: North-Holland, 1975) [4] Her zel H 1988 Sys. A nal. Mo d. Sim. 5 435 [5] Sch¨ urmann T and Grassb erger P 1996 Chaos 6 414 [6] Grassb erger P 2003; arXiv: physics/0 307138 [7] Sch¨ urmann T 2004 J. Phys. A: Math. Gen. 37 L295; arXiv:c ond-mat/0 403192 [8] Nemenman I, Shafee F and Bialek W 2002, Entrop y and inference, revisited. In A dvanc es in Neur al Information Pr o c essing Sy stems , 471478 (MIT Press, Camb ridge, M A , 2002) [9] Ar ch er E, P ark I M and Pillow J W 2014, J. Machine L e arning R es. 15 2833 [10] Zhang Z 2012 Neur al Computation 24 1368 [11] Simson E H 1949 Natur e 163 688 [12] Mon tgomery-Smith S and Sc h ¨ urmann T 2005 (priv ate commu nication) arXi v:1410.50 02 (uploaded 2014) [13] An tos A and Konto yiannis 2001 R andom Structur es & Algorithms 19 163 [14] Abramo witz M and Stegun I, eds., Handb o ok of Mathematic al F unctions (Do v er, N ew Y ork 1965) A note on entr opy estimation 7                                                   ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1 2 5 10 20 50 0.5 1.0 2.0 N RMSE H H ` L Figure 1 . Statistical error of ˆ H 0 (  ), ˆ H z ( ◦ ), ˆ H ∗ z (+) and ˆ H 2 ( • ), f or the uniform probability distri bution with M = 100 (see text). The RMSE of ˆ H 2 is signiﬁcantly smaller then of ˆ H z and ˆ H ∗ z . The exact v alue of the en trop y i s H = 5 . 3.                                                   ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1 2 5 10 20 50 1.0 2.0 3.0 1.5 N RMSE H H ` L Figure 2. Same as in Fig. 1, but for Zipf ’s probabili t y distri bution (see text). The exact v alue of the en trop y is H = 3 . 68. A note on entr opy estimation 8                                                 ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 5 10 20 50 1.00 0.50 0.20 2.00 0.30 0.15 1.50 0.70 M RMSE H H ` L Figure 3. Statistical error of ˆ H 0 (  ), ˆ H z ( ◦ ), ˆ H ∗ z (+) and ˆ H 2 ( • ), for sample size N = 10 in the instance of the uniform probability distribution. Small sample estimation is exp ected when M is abov e the s ampl e size N .                                                 ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç ç æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 5 10 20 50 1.00 0.50 0.20 0.30 1.50 0.70 M RMSE H H ` L Figure 4. Same as in Fig. 3, but for the Zipf distribution. There is a crossov er f or M ≈ N .

A note on entropy estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment