Statistical computation of Boltzmann entropy and estimation of the optimal probability density function from statistical sample

Mon. Not. R. Astron. Soc. 000 , 000–000 (0000) Printed 7 Septembe r 2021 (MN L A T E X style ﬁle v2.2) Statistica l computat ion of Bolt zmann entr opy and estimation of the optimal pr obability density function fr om statistical sample Ning Sui 1 , Min Li 2 and Ping He 1 , 3 , 4 ⋆ 1 Colle ge of Physics, Jilin University , Changc hun 130012, China 2 Changc hun Artiﬁcial Satellite Observatory , Chinese Academy of Sciences, Changc hun 130117, China 3 Center for High Energ y Physics, P eking Univer sity , B eijing 100871, China 4 State Ke y Laboratory of Theor etic al P hysics, Instit ute of Theor etic al Physics, Chinese A cademy of Scie nces, Beijing 100190, China 7 September 2021 ABSTRA CT In this work, we in vestigate the statistical co mputation o f th e Boltzm ann en tropy of statis- tical samples. For this pu rpose, we use b oth histogram and kernel functio n to estimate the probab ility density function of statistical samples. W e ﬁnd th at, due to coarse-gr aining, the entropy is a mon otonic increasing functio n of th e bin width for histogram or bandwid th for kernel estimation, which seems to be difﬁcult to select an op timal bin width/ban dwidth for computin g the entropy . Fortunately , we notice that there exists a minimum of the ﬁrst deri va- ti ve of entropy for both histogram and kernel estimation, and this minimum point of the ﬁrst deriv ati ve asymptotically points to the optimal b in width or bandwidth. W e ha ve veriﬁed these ﬁndings by large amounts of numerical experiments. Hence, we suggest that the minimum of the ﬁrst derivati ve of en tropy be used as a selector for the optim al bin wid th or ban dwidth of den sity estimation. Mo reover , the optimal ban dwidth selected b y th e min imum of the ﬁrst deriv ati ve of entro py is purely data-b ased, indep endent of the un known underlying pro babil- ity density distrib ution, which is obviously superior to the e xisting estimators. Our results are not restricted to on e-dimension al, b ut can also be extend ed to multiv ariate cases. It should be emphasized, howe ver, that we do not provide a r obust math ematical proof of these ﬁndings, and we leav e these issues with those who are interested in them. Key words: m ethods: data analysis – methods: num erical – metho ds: statistical – cosm ology: theory – large-scale structure of Uni verse. 1 INTRODUCTION Entropy i s very important in thermodynamics and statistical me- chanics. In fact, it is the k ey concept, upon which the equilib- rium statistical mechanics is formulated ( Landau & Lifshitz 1996 ; Huang 1 987), and from which all the other thermodyna mical quan- tities can be derive d. Also, the increasing of entrop y sho ws the time-ev olutionary direction of a thermodynamical system. The concept of entropy can ev en be applied to non-thermod ynamical systems such as informa tion t heory , in wh ich entropy is a mea- sure of the uncertainty in a random variable (Shannon 1948; Shunsuk e 1993), and to n on-ordinary thermodynamical systems such as self- gravitating systems, in which the dominating micro- scopic interaction between particles is long-ranged (L ynden-Bell 1967; Campa et al. 20 09). The latter i s relev ant to our stu dies, in wh ich we formulated a framewo rk of equilibrium statistical mechanics for self-gravitating systems (He & Kang 2010, 201 1; ⋆ E-mail: hep@itp.a c.cn Kang & He 2011; He 2012a,b). In these works, we demonstrated that the Boltzmann entrop y S B [ F ] = − Z F ( v ) ln F ( v )d 3 v (1) is also va lid for self-gravitating systems, in which F ( v ) is the sys- tem’ s probability density function (PDF , hereafter). Hence, the PDF i s necessary for computing the systems’ en- tropy , but often, instead of giving an analytic fo rm of PDF , we hav e to deal with a t hermodyna mical system that is in data-form. For instance, Helmi & White (1999) performed numerical simulations to study satel lite galaxy disruption in a potential resembling that of the Milk y W ay . In their work, the coarse-grained Boltzmann en- tropy is used a s a measure of the phase-mixing to indicate ho w mixing of disrupted satellites can be quantiﬁed. The analytic PDF is unav ailable, and hence they deriv ed the coarse-grained PD F from the simulation d ata b y histogram, that is, by taking a partit ion in the 6D phase-space and counting how many particles fall in each 6D cell. So, with analytic PDF unav ailable, it is indispensab le t o com- pute the system’ s entropy fro m the da ta-based samples. This seem- ingly easy computation , ho wev er , is plagued with some une xpected c  0000 RAS 2 N. Sui, M. Li and P . He 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 . 0 1 . 2 1 . 4 1 . 6 1 . 8 2 . 0 N = 1 0 5 S v S t h = 1 . 4 1 9 Figure 1. Illustration of the monotonic behaviour of S (∆ v ) . The underly- ing distri buti on is the uni varia te standa rd normal distrib ution , from which the data set is randomly drawn, with the sample size N = 10 5 . ∆ v is the bin width of the one-dimensiona l histogram, and S is ev aluated by equa- tion (1), but with ˆ F ∆ v ( v ) instead of F ( v ) . S ( h ) , compute d by kern el es- timatio n, exhibits the similar behaviou r with respect to the bandwidth h . S th , sho wn with the dashed li ne, is directly computed by using the analy ti- cal uni variat e standard normal PDF of equation (4). troubles. A s we have seen in equation (1), the ev aluation of en- tropy is actually related to the estimation of probability density , which is one of t he most important techniques in exp loratory data analysis (ED A). Usually , ˆ F ∆ v ( v ) , the estimation of the underlying unkno wn PDF F ( v ) of a statistical sample, is derived by data bin- ning, i.e. histogram (Gentle 2009), just as Helmi & White (1999). Ho we ver , in the real practice, we ﬁ nd the follo wing interesting phe- nomenon . S ee Figure 1 , the resulting entrop y S depend s on the bin width ∆ v , monotonically increasing with the bin width, and thus it is troublesome to select an appropriate bin width of histogram for computing the entrop y . Prior to our study , there have been many in vestigations on ho w to make the optimal probability density estimation of statis- tical samples (Scott 1992; Gentle 2009). Histogram is the mostly used method, in which the bin width i s its characteristic parameter . A larger bin width produces ov er-smoo th esti mation and the PDF looks smo other and featureless, while a smaller bin width prod uces under -smooth estimation so that the resulting P DF has lar ge ﬂuc tu- ations and spurious bumps. Hence, t he selection of an optimal bin width is indeed an important task i n EDA, and such an effort can be dated back t o S turges (1926), but the famous selector for optimal bin width is given by S cott (1979). Usually , the optimal bin width is obtained by mi nimizing the asymptotic mean integ rated squared error (AMISE), which is a tradeof f between the squared bias and v ariance. Contrary to the con ventional treatment, ho we ver , Knuth (2013) proposed a straightforward da ta-based method of determin- ing the optimal number of bins in a uniform bin-width histogram using Bayesian probability theory . Another commonly used density estimation method i s k ernel estimation method, which was ﬁrst introduced by Parzen (1962) and Rosenblatt (1956) and is considered to be much superior to the histogram method (Jones et al. 1996). With kernel density esti- mation, we arrive at the simil ar results as with histogram that the entropy increases as a monotonic function of the kern el bandwidth h . 0 1 2 2 3 1 2 3 1 Figure 2. The or igin of the monotonic ity of S (∆ v ) versus ∆ v is caused by coarse-gra ining of the PDF . The s tep solid line and the dotted line indicate the ﬁne-grained and coarse-grained P DF , respecti vely . The entropy ev alu- ated wit h coarse-gra ined PDF is la rger th an that with ﬁne-grai ned PDF . See Section 2.1 for deta il. This explana tion is also applic able to S ( h ) versus h . Besides the tw o main methods, there are also other methods of density estimation, such as average shifted histogram, orthogonal series estimators (Gentle 2009; Scott 1992, 2004). W e just focus on histogram and kern el methods in this work. In vestigations o n histogram and kernel estimation ha ve indeed provid ed us wi th useful criteria for optimal bin width of histogram or bandwidth of kern el function (Sil verman 19 86; Jones et al. 1996; W and 1997), yet all these criteria need to tak e into account the f unc- tional form of the true PDF , which is usually unav ailable. So it is a difﬁculty on ho w to select an optimal bin width or bandwidth for computing the entropy . Fortunately , by l arge amounts of numeri- cal experiments, we ﬁnd that t he ﬁrst deriv atives of S (∆ v ) with respect to ∆ v for histogram estimation, or of S ( h ) with respect to h for kernel estimation, correspond to the optimal bin width and bandwidth in all the cases we considered in this work. In this paper , we describe our nu merical ex periments and demonstrate this ﬁ nding in detail. Alt hough we do not provide a robust mathematical proof of this ﬁnding, we suggest that the ﬁrst deri v ativ e of entropy be regarded as an alternative selector other than the usual AMISE to pick out an optimal bin width of his- togram and bandwidth of k ernel estimation. The paper is o rganized as follows. In Section 2, we introduce our me thods and describe the working procedure. In Section 3 , we present the results. Finally , we gi ve our con clusions and discussions in Section 4. 2 METHODS AND PROCEDURE 2.1 Origin of the monotonicity of S (∆ v ) and S ( h ) W e demonstrate the origin of the mon otonicity of S (∆ v ) verse ∆ v . See Figure 2, we design a simple one-dimensional normalized ex- perimental PDF , as: f 1 ( v ) =    2 3 , 0 < v < 1 , 1 3 , 1 < v < 2 , 0 , otherwise , (2) with which the entropy ev aluated by equation (1) is S 1 = ln 3 − 2 3 ln 2 . Next, the averaged, or coarse-grained, PDF within the inter - c  0000 RAS, MNRAS 000 , 000–000 Statisti cal computation of Boltzm ann entr opy 3 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 0 1 2 3 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 0 1 2 3 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 . 0 1 . 5 2 . 0 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 . 3 1 . 4 1 . 5 1 . 6 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 . 4 0 0 1 . 4 2 5 1 . 4 5 0 1 0 - 6 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 . 4 0 0 1 . 4 2 5 1 . 4 5 0 S ( v ) d S / d l n ( v ) S d m = 1 . 4 1 1 S v ( a ) N = 1 0 3 S t h = 1 . 4 1 9 v d m = 2 . 1 7 4 1 0 - 1 S ( v ) d S / d l n ( v ) S d m = 1 . 4 1 8 v d m = 1 . 1 9 0 1 0 - 1 S t h = 1 . 4 1 9 N = 1 0 4 ( b ) v S ( v ) d S / d l n ( v ) S d m = 1 . 4 1 9 v ( c) N = 1 0 5 S t h = 1 . 4 1 9 v d m = 5 . 0 5 1 1 0 - 2 S ( v ) d S / d l n ( v ) S d m = 1 . 4 1 9 v d m = 2 . 1 1 9 1 0 - 2 S t h = 1 . 4 1 9 N = 1 0 6 ( d ) S v S ( v ) d S / d l n ( v ) S d m = 1 . 4 1 9 v d m = 1 . 1 9 0 1 0 - 2 S t h = 1 . 4 1 9 N = 1 0 7 ( e ) v S ( v ) d S / d l n ( v ) S d m = 1 . 4 1 9 v d m = 5 . 0 2 0 1 0 - 3 S t h = 1 . 4 1 9 N = 1 0 8 ( f ) v 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 d S / d l n ( v) 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 - 6 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 d S / d l n ( v) Figure 3. ¯ S (∆ v ) , the avera ged entropy ev alua ted by histogram density estimation and its ﬁrst deriv ati ve w .r .t to ∆ v , d ¯ S / d ln(∆ v ) , for the one-dimensional standard normal distrib ution of equati on (4). Error bars of ¯ S (∆ v ) curve s indic ate the standa rd de viation. In e very case , the mean and the stan dard de viation are e v aluat ed with 50 instances. ∆ v is the bi n width of th e histo gram. In e very panel, the ve rtica l axis on the le ft represen ts the entrop y S , and t he right re presents the ﬁrst deri v ati ve, d ¯ S / d ln(∆ v ) . T he entropie s and their deriv ati ves as functions of ∆ v are indicated as solid and dotted line s, respecti vely . For the visual clarit y , the deri v ati ves are exhibi ted in logarit hmic scale. The subscript ‘dm’ in the ﬁgure indica tes ‘deriv ati ve ’ s minimum’, with ¯ S dm ≡ S (∆ v dm ) . S th , sho wn with dashed lines, is the entrop y ev alua ted with analytica l PDF of equation (4). v al 0 < v < 2 can be easily deriv ed and normalized as: f 2 ( v ) =  1 2 , 0 < v < 2 , 0 , otherwise . (3) So, t he entropy ev aluated with this coarse-grained PDF f 2 is, S 2 = ln 2 , and we can immediately see that S 2 > S 1 . Hence, the monotonicity of S (∆ v ) verse ∆ v is caused by t he coarse-graining of PDF . The same explanation is also applicable to t he monotonic- ity of S ( h ) versus h . Beyon d the mono tonicity of S (∆ v ) , ho wever , there shou ld be another important feature hiding in the S ( ∆ v ) curve. By scruti- nizing S (∆ v ) of Figure 1, we speculate t hat the ﬁrst deriv ati ve of S ( ∆ v ) with respect to ∆ v , might take its local minimum around the cross point of S th and S (∆ v ) . If this is the case, then we can use this property to construct a selecto r of optimal bin width of h is- togram, so that we can compute the entropy to t he best extent. W e will verify this speculation belo w . 2.2 Three ex perimental PDFs Our strategy is brieﬂ y described as follo ws. Fir st, choose some ana - lytical PDFs, a nd with M onte Carlo tec hnique, we dra w N random data from these analy tical PDFs. W ith these statistical samples, we can construct the estimators ˆ F ∆ v ( v ) or ˆ F h ( v ) of the true PDFs, depending on whether using histogram or kernel estimation. Sec- ondly , we can compute the entropy by equation (1), with ˆ F ∆ v ( v ) or ˆ F h ( v ) replacing the true PDFs. In this way , we derive the curves of S ( ∆ v ) or S ( h ) . Meanwhile, the entropy can be exactly ev aluated with these analytical PDFs, denoted as S th . These exact results can be used to calibrate our empirical results of S (∆ v ) or S ( h ) , and to he lp select the optimal bin width or bandwidth. For this purpose, we choose three analytical PDF s for the ex- periments. T he ﬁrst is the one-dimensiona l standard normal distr i- bution: F ( v ) = 1 √ 2 π e − v 2 2 , −∞ < v < ∞ , (4) whose S th = 1 . 419 . The second experimental P DF i s t he one-dimensional power - law fun ction: F ( v ) = 1 − 16 9 v 2 , − 3 4 < v < 3 4 , (5) whose S th = 0 . 280 4 . Unlike the ﬁrst one, this po wer-law distribu- tion is non-e xtended for its random va riable v . The third one is t he three-dimensional isotropic standard nor- mal distribu tion PDF: F ( v ) = 1 (2 π ) 3 / 2 e − v 2 2 , −∞ < v < ∞ , (6) in which v ≡ ( v 1 , v 2 , v 3 ) and S th = 4 . 257 . T his PDF can be further reduced to a one-dimensiona l distribution, whose PDF is: F ( v ) = r 2 π v 2 e − v 2 2 , 0 < v < ∞ . (7) This redu ced on e-dimensional distribu tion, un like the pre vious two distributions, is asy mmetric in the random variab le v . c  0000 RAS, MNRAS 000 , 000–000 4 N. Sui, M. Li and P . He T able 1. Results deri ved from histogra m estimation. Listed are the m ean ¯ S dm and the varian ce σ ( ¯ S dm ) of the entropy at the m inimum point, ∆ v dm , of the deri vati ve of entropy for the three ex periment al PDFs addressed in Section 2.2. The mean and the va riance are ev aluated from 50 instances with diffe rent sets of ra ndom number s. Sample siz es for the e xperiment s range from 10 3 to 10 8 . T o co mpare with, t he th eoretic al results S th for the three distri but ions are 1.419, 0.284, and 4.257, respecti vely . 1D normal 1D po wer-la w 3D normal N ∆ v dm ¯ S dm σ ( ¯ S dm ) ∆ v dm ¯ S dm σ ( ¯ S dm ) ∆ v dm ¯ S dm σ ( ¯ S dm ) 10 3 2.174 × 10 − 1 1.411 2.099 × 10 − 2 6.250 × 10 − 2 0.2756 9.919 × 10 − 3 1.170 × 10 − 1 4.226 4.003 × 10 − 2 10 4 1.190 × 10 − 1 1.418 6.643 × 10 − 3 2.419 × 10 − 2 0.2790 3.121 × 10 − 3 6.511 × 10 − 2 4.254 1.083 × 10 − 2 10 5 5.051 × 10 − 2 1.419 2.115 × 10 − 3 8.721 × 10 − 3 0.2797 4.941 × 10 − 4 3.669 × 10 − 2 4.256 3.697 × 10 − 3 10 6 2.119 × 10 − 2 1.419 8.134 × 10 − 4 3.178 × 10 − 3 0.2812 2.464 × 10 − 5 1.546 × 10 − 2 4.256 1.295 × 10 − 3 10 7 1.190 × 10 − 2 1.419 2.045 × 10 − 4 1.786 × 10 − 3 0.2805 1.326 × 10 − 5 8.695 × 10 − 3 4.257 4.184 × 10 − 4 10 8 5.020 × 10 − 3 1.419 8.081 × 10 − 5 1.004 × 10 − 3 0.2804 6.098 × 10 − 7 1.151 × 10 − 3 4.257 1.049 × 10 − 4 theoret ic – 1.419 – – 0.2804 – – 4.257 – These S th of the thre e cases are e xplicitly indicated in the ﬁg- ures and tables belo w . 3 RESUL TS 3.1 Histogram estimation As described above, PDF of stati stical samples can be esti mated mainly by histogram and kernel methods. W e ﬁ rst present the re- sults by histogram estimation. Figure 3 sho ws the experimen ts with the one-dimensional standard normal distri bution of equation (4). W e do the experiments of the sample size N ranging from 10 3 to 10 8 , and in every case, we compute S ( ∆ v ) with differen t sets of random numbers to giv e, say 50 , instances. With t hese 50 instances, we can deri ve both the mean ¯ S and the variance σ of the entropy as a function of ∆ v . These results are sho wn in panels from (a) to (f), respectiv ely , and we can see that the varian ces of all cases are very small. The ﬁ rst deriv ative of ¯ S ( ∆ v ) with respect to ∆ v , d ¯ S ( ∆ v ) / d ln( ∆ v ) , is also numerically ev aluated and sho wn in the correspondin g p anels. From this ﬁgure, we ﬁnd that: (1) in all cases, the deriv ati ve has a minimum around the cross point of the curve ¯ S ( ∆ v ) and the straight line S th , (2) the minimum of the deriv a- tiv e goes to zero with increasing sample size N , and (3) ∆ v dm , at which the deriv ati ve tak es its minimum, approaches the abscissa of the cross point of ¯ S ( ∆ v ) and S th , such that ¯ S ( ∆ v dm ) asymptoti- cally approaches S th . W ith exactly the same procedures, we do ex periments with an- other two analytical PDFs, i.e. the on e-dimensional power - law and three-dimensional isotropic standard normal distribution, of equa- tions ( 5) and (6), respectiv ely . T he above ﬁ ndings are also applied to these tw o case s. In T able 1, we gi ve the relev ant means and vari- ances at ∆ v dm for all the cases abo ve mentioned. Note t hat in Figure 3, we sho w the ﬁ rst deriv ative in the form of d ¯ S / d ln(∆ v ) , rather than d ¯ S / d∆ v . W e ca n see from the ﬁgure, that at ∆ v dm , ¯ S ( ∆ v ) are nearly ﬂat, that is, d ¯ S / d∆ v ∼ 0 . Hence, d 2 ¯ S d(ln ∆ v ) 2    ∆ v dm = ∆ v d ¯ S d(∆ v )    ∆ v dm + ∆ v 2 d 2 ¯ S d(∆ v ) 2    ∆ v dm ≈ ∆ v 2 d 2 ¯ S d(∆ v ) 2    ∆ v dm = 0 . So the minima of d ¯ S / d ln(∆ v ) and d ¯ S / d∆ v are nearly at the same places. 3.2 K ernel estimation A non-n egati ve real-valued function K ( u ) is called a kernel func- tion, if satisfying the follo wing two cond itions: Z + ∞ −∞ K ( u )d u = 1; K ( u ) = K ( − u ) . The P DF of a statistical sample, x i , with i running from 1 to the sample size N , can be estimated by using the kernel function: ˆ F h ( x ) = 1 hN N X i =1 K ( x − x i h ) , (8) in which h is the bandwidth of the kernel. If the kern el is a deriv- able function, then ˆ F h ( x ) is also deriv able. For this reason, kernel estimation is belie ve d to be superior to histogram method. The kernel function we used here is Epanechniko v (1969) function K e ( u ) : K e ( u ) = 3 4 (1 − u 2 ) , − 1 < u < 1 , (9) who has the highest ef ﬁciency , in contrast to other kernel functions, to yield the optimal rate of conv ergence of the mean integrated squared error (Silverman 19 86; Gentle 2009). W e gi ve the results from kernel estimation for one- dimensional n ormal distribu tion i n Figure 4 and the relev ant means and v ariances for all cases at ∆ v dm in T able 2. These results are all parallel to the prev ious results from histograms. Again, we can see that the minimum of the ﬁrst deri vati ve of S ( h ) can be regarded as a selector for t he optimal bandwidth h dm . Compared with the his- togram estimation, the ad v antages o f kernel estimation are obv ious, in that al l the curv es, S ( h ) and d S/ d h are smooth and should be deri v able. W e also considered other kernel functions, such as uniform and Gaussian (S ilverman 1986). The results are t he same as wi th Epanechnik ov (not sho wn). 3.3 Comparison with prev ious estimators Prior to our results, there are some well-kno wn estimators for the optimal bin width of histogram or bandwidth of kernel estimation. The Scott (1979) optimal bin width is one of them, ∆ v ∗ =  6 R ( f ′ )  1 / 3 N − 1 / 3 , (10) c  0000 RAS, MNRAS 000 , 000–000 Statisti cal computation of Boltzm ann entr opy 5 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 0 1 2 3 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 0 1 2 3 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 . 0 1 . 5 2 . 0 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 . 3 1 . 4 1 . 5 1 . 6 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 . 4 0 0 1 . 4 2 5 1 . 4 5 0 1 0 - 6 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 . 4 0 0 1 . 4 2 5 1 . 4 5 0 S ( h ) d S / d l n ( h ) S d m = 1 . 4 1 0 S h ( a ) N = 1 0 3 S t h = 1 . 4 1 9 h d m = 1 . 8 4 6 1 0 - 1 S ( h ) d S / d l n ( h ) S d m = 1 . 4 1 7 h d m = 8 . 0 9 0 1 0 - 2 S t h = 1 . 4 1 9 N = 1 0 4 ( b ) h S ( h ) d S / d l n ( h ) S d m = 1 . 4 1 8 h ( c) N = 1 0 5 S t h = 1 . 4 1 9 h d m = 3 . 5 3 1 1 0 - 2 S ( h ) d S / d l n ( h ) S d m = 1 . 4 1 9 h d m = 1 . 8 5 5 1 0 - 2 S t h = 1 . 4 1 9 N = 1 0 6 ( d ) S h S ( h ) d S / d l n ( h ) S d m = 1 . 4 1 9 h d m = 8 . 1 2 8 1 0 - 3 S t h = 1 . 4 1 9 N = 1 0 7 ( e ) h S ( h ) d S / d l n ( h ) S d m = 1 . 4 1 9 h d m = 4 . 2 7 4 1 0 - 3 S t h = 1 . 4 1 9 N = 1 0 8 ( f ) h 1 0 - 2 1 0 - 1 1 0 0 1 0 - 2 1 0 - 1 1 0 0 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 d S / d l n ( h ) 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 1 0 - 5 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 d S / d l n ( h ) Figure 4. ¯ S ( h ) , the av eraged entrop y ev aluated with kern el estimation and its ﬁrst de ri v ati ve w .r .t to h , d ¯ S / d ln( h ) , for th e one-dimen sional standard normal distrib ution of equation (4). The ke rnel function is the Epanech niko v (1969) form, and h is the bandwidth of the ke rnel function. Error bars of ¯ S ( h ) curv es indica te the standard de viation. The entropies and t heir deri vati ves as functi ons of h are indi cated as solid an d dotted li nes, respecti vely . Simila r to the c ases of histogram estimation, in ev ery case, the vertic al axis on the left represents ¯ S ( h ) , and the right represents the deri vati ve. The mean and the standard devia tion are e v aluate d with 50 instances. Again, the subscript ‘dm’ in the ﬁgure indicate s ‘deri vati ve’ s minimum’. T able 2. Results deri ved from kernel estimati on. All are parall el to those of histogram estimat ion. For details see T able 1. 1D normal 1D po wer-la w 3D normal N h dm ¯ S dm σ ( ¯ S dm ) h dm ¯ S dm σ ( ¯ S dm ) h dm ¯ S dm σ ( ¯ S dm ) 10 3 1.846 × 10 − 1 1.410 2.128 × 10 − 2 4.213 × 10 − 2 0.2722 1.040 × 10 − 2 1.539 × 10 − 1 4.259 3.562 × 10 − 2 10 4 8.090 × 10 − 2 1.417 6.634 × 10 − 3 1.693 × 10 − 2 0.2781 3.273 × 10 − 3 6.741 × 10 − 2 4.255 1.122 × 10 − 2 10 5 3.531 × 10 − 2 1.418 2.131 × 10 − 3 6.804 × 10 − 3 0.2796 4.976 × 10 − 4 3.531 × 10 − 2 4.257 3.990 × 10 − 3 10 6 1.855 × 10 − 2 1.419 8.126 × 10 − 4 3.281 × 10 − 3 0.2812 2.454 × 10 − 5 1.546 × 10 − 2 4.256 1.216 × 10 − 3 10 7 8.128 × 10 − 3 1.419 2.040 × 10 − 4 1.319 × 10 − 3 0.2805 1.327 × 10 − 5 5.644 × 10 − 3 4.257 3.844 × 10 − 4 10 8 4.274 × 10 − 3 1.419 8.082 × 10 − 5 7.631 × 10 − 4 0.2804 5.985 × 10 − 7 2.473 × 10 − 3 4.257 1.240 × 10 − 4 theoret ic – 1.419 – – 0.2804 – – 4.257 – in which f ′ is the ﬁrst deriv ativ e of f , and the functional R [ g ] is deﬁned for the roughn ess of g as R ( g ) = Z + ∞ −∞ g ( x ) 2 d x. Another one is the optimal bandwidth for kernel estimation (Silverman 1986; Jon es et al. 1996): h AMISE =  R ( K ) R ( f ′′ )( R x 2 K ( x )d x ) 2  1 / 5 N − 1 / 5 . (11) Note that h AMISE scales with the sample size as N − 1 / 5 , while ∆ v ∗ scales as N − 1 / 3 . The kernel estimation is believ ed to be superior to the his- togram method, and indeed, we notice that ¯ S ( h ) as well as d ¯ S / d h are mu ch smoother than their counterp arts of histogram estimation . So we just concentrate on the kern el estimation below . Figure 5 shows the r elations of h dm with the sample size N . W e see that in all the three cases, difference s between h dm and h AMISE are signiﬁcant, and h dm well scales as N − 1 / 3 , in contrast to h AMISE ∝ N − 1 / 5 . It is interesting to note t hat this N − 1 / 3 - scaling is similar to that of Scott’s formula of equation (10) for optimal bin width of histogram estimation. Entropies e v aluated at h dm and h AMISE of the three cases are sho wn in Figure 6. It can be seen that ¯ S ( h dm ) is much closer to entropy ’ s true value S th than S ( h AMISE ) , but the difference be- tween the two v anishes asymptotically with increasing sample size. Nonetheless, as f ar as the statistical co mputation of entropy is con- cerned, h AMISE , selected by minimizing AMISE, is less than the optimal bandwidth h dm . c  0000 RAS, MNRAS 000 , 000–000 6 N. Sui, M. Li and P . He 1 0 - 2 1 0 - 1 1 0 0 1 0 - 3 1 0 - 2 1 0 - 1 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 1 0 8 1 0 - 3 1 0 - 2 1 0 - 1 1 0 0 h A M I S E = 2 . 3 4 5 N - 1 / 5 h h d m h d m , f i t h A M I S E 1 D n o r m a l h d m , f it = 1 . 8 0 0 N - 1 / 3 ( a ) h A M I S E = 0 . 9 5 4 2 N - 1 / 5 h h d m h d m , f i t h A M I S E 1 D p o w e r - l a w h d m , f it = 0 . 3 4 2 1 N - 1 / 3 ( b ) h A M I S E = 1 . 4 9 5 N - 1 / 5 h N h d m h d m , f i t h A M I S E 3 D n o r m a l h d m , f it = 1 . 4 1 1 N - 1 / 3 ( c) Figure 5. Relati on of the bandwi dth h and sample s ize N for the case of kerne l estimation. The three panels correspond to the three experimenta l PDFs of equations (4), (5) and (6). The bandwidt h h AMISE that is deriv ed by minimizin g AMISE is also shown for c omparison. 3.4 The ﬁrst derivati ve of entropy as b andwidth selector From pre vious results, we have seen that the ﬁrst deriv ativ e of S ( ∆ v ) or S ( h ) can be used as a selector for the optimal bin width or bandwidth. W e just concentrate on the kern el method here, and present some analytical results. W ith the estimated PDF ˆ F h , the ﬁrst deri v ativ e of the entropy is d S ( h ) d h = − d d h Z ˆ F h ln ˆ F h d v = − Z  ∂ ˆ F h ∂ h ln ˆ F h + ∂ ˆ F h ∂ h  d v = − Z ∂ ˆ F h ∂ h ln ˆ F h d v , (12) in which we use the normalization R ˆ F h d v = 1 , and t he fact that the partial deriv ativ e w .r .t. h can be taken out from the integral. 1 . 4 0 1 . 4 1 1 . 4 2 1 . 4 3 1 . 4 4 1 . 4 5 1 . 4 6 0 . 3 0 0 . 3 5 0 . 4 0 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 1 0 8 4 . 2 5 4 . 2 6 4 . 2 7 4 . 2 8 4 . 2 9 S t h S ( h A M I S E ) S ( h d m ) S 1 D n o r m a l ( a ) S t h S ( h A M I S E ) S ( h d m ) ( b ) S 1 D p o w e r - l a w S t h S ( h A M I S E ) S ( h d m ) ( c) S N 3 D n o r m a l Figure 6. Relation of entropy S and sample size N for the case of ke rnel estimati on. The three panels correspond to the three experi mental PDFs of equati ons (4), (5) and (6). Entropies of h AMISE are also shown for com- parison. The second deri v ativ e of S ( h ) i s d 2 S ( h ) d h 2 = − Z  ∂ 2 ˆ F h ∂ h 2 ln ˆ F h + 1 ˆ F h  ∂ ˆ F h ∂ h  2  d v = − Z  ∂ 2 ˆ F h ∂ h 2 ln ˆ F h − ˆ F h ∂ 2 ln ˆ F h ∂ h 2 + ∂ 2 ˆ F h ∂ h 2  d v = − Z ∂ 2 ˆ F h ∂ h 2 ln ˆ F h d v + Z ˆ F h ∂ 2 ln ˆ F h ∂ h 2 d v . (13) In deri ving t he last line of t he abo ve relation, we again use the no r- malization R ˆ F h d v = 1 . Hence, the minimum of the ﬁrst deriv ative of S ( h ) should be picked out by d 2 S/ d h 2 = 0 , and from the abov e equation, we hav e Z ∂ 2 ˆ F h ∂ h 2 ln ˆ F h d v = Z ˆ F h ∂ 2 ln ˆ F h ∂ h 2 d v . (14) W e can make u se of this relati on to pick out the op timal bandwidth h dm . Note that both ∆ v ∗ of equation ( 10) and h AMISE of equa- tion (11) depend on the unkno wn tr ue PD F f , which i s usually difﬁcu lt or even impossible to acquire. On the contrary , the selec- tor of optimal bandwidth based on the ﬁrst deriv ativ e of entropy is c  0000 RAS, MNRAS 000 , 000–000 Statisti cal computation of Boltzm ann entr opy 7 purely data-based , and hence, ou r approach is muc h superior to the existing metho ds. Ho we ver , we do not provide a proof on the existence of the minimum of the ﬁ rst deri v ati ve of entropy , and do not explain why such a minimum can help pick out the optimal bandwidth, either . These issues are surely of great importance and deserve further in- vestigations. 4 SUMMARY AND DISCU SSIONS Entropy is a ve ry important concept, and in statistical mechanics, it is computed with the probability distrib ution function o f the system under consideration. I n our statist ical-mechanical inv estigations of self-gravitating systems, howe ver , we hav e t o do statistical compu- tation of entrop y directly from statistical samples, without k no wing the underlying analytic PDF . Thus, the e v aluation of entropy is ac- tually related t o the estimation of the PDF from st atistical samples in data-form. Usually , there are two approaches t o estimate a PDF from a statistical sample. The ﬁrst density estimation method is histogram. Another way is kernel estimation, which is considered to be su- perior to the histogram. W e use both the methods to ev aluate the entropy , and we ﬁnd t hat the entropy thus computed depends on the bin width of the histogram, or bandwidth of kernel method. Concretely , the entropy is a monotonic increasing function of the bin width or bandwidth. W e attri bute this mono tonicity t o the PDF coarse-graining. Thus, it is a difﬁculty on how to select an optimal bin width/bandwidth f or compu ting the e ntropy . Fortunately , we notice that there may exist a minimum of the ﬁrst deriv atives of entropy for both histogram and kernel estimation, and this minimum may correspond to the optimal bin width and bandwidth. W e perform a large amount of numerical exp eriments to verify this ﬁnding. F irst, we select three analytical P DFs, one- dimensional standard normal, one-dimensional po wer-law , and three-dimensional standard normal distribution, and with Monte Carlo technique, we draw N random data from these analytical PDFs, respe ctiv ely . With these statistical samples, we construct the estimator of the tr ue P DFs of both hist ogram and kernel estima- tion. Secondly , we compute the entropy with the estimated PDFs, so in this way , we deriv e the curves of S (∆ v ) or S ( h ) , in which ∆ v or h are the bin width of histogram or bandwidth of kernel, respecti vely . Meanwhile, the entrop y can be e xactly ev aluated with these analytical PDFs. These exact results can be used to calibrate our empirical results, and to help select the optimal bin wi dth or bandwidth. W e do the same exp eriments for the three PDFs with the sam- ple si ze ranging from 10 3 to 10 8 . In all cases, whate ver using his- togram or kerne l estimation, we ﬁnd that: • The ﬁrst deri v ati ve of entrop y indeed has a minimum around the cross-point of the entropy curve S (∆ v ) or S ( h ) and the theo- retical straight line S th . • The minimum of t he deriv ative goes to zero wi th increasing sample size. • ∆ v dm or h dm , at which the deriv ative takes its minimum, asymptotically approaches the abscissa of the abov e-mentioned cross-point with increasing sample size, such that the entropy at ∆ v dm or h dm asymptotically approac hes the theoretical value S th . • h dm scales wit h the sample size as N − 1 / 3 , in contrast to h AMISE , the bandwidth selected by minimizing AMISE , whose scaling is ∝ N − 1 / 5 . • The entropy ev aluated at h dm is much closer to the true v alue S th than at h AMISE , but the difference between t he two v anishes asymptotically with increasing sample size. Hence, we see that the minimum of the ﬁrst deriv ativ e of S ( ∆ v ) or S ( h ) can be used as a selector for the optimal bin-width or b andwidth of d ensity estimation, and h AMISE is less the optimal bandwidth than h dm for computing entropy . Note tha t bo th Scott’ s o ptimal b in width ∆ v ∗ and h AMISE de- pend on the unk no w n underlying PDF of the system, which is usu- ally dif ﬁcult or e ven impossible to acquire. On the contrary , the esti- mator o f optimal ba ndwidth selected from the minimum of the ﬁrst deri v ativ e of entrop y is purely data-based, and hence, our method is clearly superior to the existing method s. W e emphasize that our r esults are by no means r estricted to one-dimension al, but can also be extended to multiv ariate cases. Finally , we acknowled ge t hat we do not provide a robust mathe- matical proof o f the e xistence of the minimum of the ﬁrst deriv ative of entrop y , nor theoretically exp lain why su ch a minimum can help pick out the optimal band width. These issues are surely of great im- portance and deserv e further in vestigations. W e leave these issues with those specialists who are interested in them. A CKNO WL EDGEMENTS W e thank the referee very much for many constructiv e sugges- tions and comments, especially for the reminding of the tech- niques for estimating information-theoretic q uantities de velope d by W olpert & W olf ( 1995) and W olpert & Dedeo (2013). This work is supported by the National Basic Research Program of China (no: 2010CB832805) and by t he National Science Foundation of China (no. 1127301 3), and also supported by the Open Pr oject Program of State Ke y Laboratory of Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences, China (no. Y4KF121CJ1). REFERENCES Campa A., Dauxois T . , Ruf fo S., 2009, Phys. Rep., 480, 57 Epanechnik ov V . K., 1969, Theory Probab . A ppl., 14, 153 Gentle J. E., 200 9, Computational Statistics. Springer Sci- ence+Business Media, Ne w Y ork He P ., Kang D. B., 2010, MNRAS, 406, 2678 He P ., Kang D. B., 2011, MNRAS, 414, L21 He P ., 2012 , MNRAS, 419, 1667 He P ., 2012 , MNRAS, 421, 2088 Helmi A., White S. D. M., 1999, MNRAS, 307, 495 Huang K., 1987, Statistical Mechanics. W iley , New Y ork Jones M. C., Marron J. S. , Sheather S. J., 1996, J. Am. Stat. As- soc., 91, 401 Kang D. B., He P ., 2011, A&A, 526, A147 Knuth K. H., 2013, preprint (arXiv:ph ysics/06051 97 v2) Landau L. D., L ifshitz E. M., 1996, Statistical Physics, 3rd Re- vised edn. Butterworth-Heine mann, London L ynden-Bell D., 1967, MNRAS, 136, 101 Parzen E., 1962 , Ann. Math. S tat., 33, 1065 Rosenblatt M., 1956, Ann. Math. Stat., 27, 832 Scott D. W ., 1979, Biometrika, 66, 605 Scott D. W ., 1992, Multi variate Density Estimation: Theory , Prac- tice, and V isualization, W iley , New Y ork c  0000 RAS, MNRAS 000 , 000–000 8 N. Sui, M. Li and P . He Scott D. W ., 2004, Multiv ariate Density Estimation and V isual- ization, Papers/Humboldt-Uni versit ¨ a t Berlin, Center fo r Applied Statistics and Economics (CASE), No. 2004 , 16 Shannon C. E., 1948, Bell Syst. T ech. J. 27, 379. Shunsuk e I. , 1993, Information theory for continuous systems, W orld Scientiﬁc, Singapore Silverman B. W ., 1986, Density Est imation for Statistics and Data Analysis, Chapman and Hall, London Sturges H. A., 192 6, J. Am. St at. Assoc., 21, 65 W and M. P ., 1997, Am. St at., 51, 59 W olpert D . H., W olf D. R., 1995, Phys. Rev . E , 52, 6841 W olpert D . H., DeDeo S., 2013, Entropy , 15, 4668 c  0000 RAS, MNRAS 000 , 000–000

Statistical computation of Boltzmann entropy and estimation of the optimal probability density function from statistical sample

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment