A New Look at the Classical Entropy of Written English

A simple method for finding the entropy and redundancy of a reasonable long sample of English text by direct computer processing and from first principles according to Shannon theory is presented. As an example, results on the entropy of the English …

Authors: Fabio G. Guerrero

Finding a precise value of the entropy of a language is, in general, an elusive matter, mainly due to the underlying statistical nature of the problem. The entropy of English has being researched for more than sixty years by a variety of approaches. Yet coincidence is not perfect among reported values of entropy. There may be different assumptions, of course, but most of the previous work has been based on indirect methods. To have a reliable estimate of the entropy of a language is something important. For instance, coding methods in source coding theory aim to code as close as possible to the entropy limit. If a good estimation of entropy is available, it may help to know how good a system of source coding is. Information Theory's father, C. E. Shannon showed in [1] that the entropy of printed English should be bounded between 0.6 and 1.3 bits/letter over 100-letter long sequences of English text. He used for his calculations a human prediction approach and ignored punctuation, lowercase, and uppercase. Cover and King, using gambling estimates, found a value between 1.25 and 1.35 bits per character [2]. Teahan and Cleary reported a value for the entropy of English of 1.46 bits per character using data compression [3]. Kontoyiannis reported a value of 1.77 bits per character using a method based on string matching, resembling that of universal source coding, based on a onemillion-character sample from a same author [4]. Stylistics language analysis based on both frequency of words and entropic analysis is also a wide field of research. A recent example of this kind of work is presented in [5]. In this paper, an approach for estimating the entropy rate from several English samples using direct computer calculations of probability and from entropy first principles is presented. We use C. E. Shannon's mathematical analysis for calculating the entropy H of a natural language by a series of approximations of conditional entropy F 0 , In (1), B i is a block of N-1 characters, j is the next character after B i , p(B i , j) is the probability of the N-character block (B i , j), and p(j | B i ) is the conditional probability of character j after block B i . The language entropy H, in bits/character, can be calculated as: Section 2 of this paper discusses the methodology (software, practical methods, assumptions, etc.) adopted to obtain the entropy values reported in this work. Section 3 presents the main results obtained. In section 4, a discussion of the results is given and, finally, section 5 summarizes the main findings of this work. This paper concentrates exclusively on the entropy of English at a probabilistic level. Analysis of the language at any other level is beyond the scope of this paper. All the evidence material supporting this work is available in [6]. Twenty-one classic literary works are used as the basis for calculations in this work. These samples were selected among the top novels reported by The Daily Telegraph British newspaper on January 2009 in an article entitled "100 novels everyone should read" [7]. The text of the selected works, all of them in contemporary English, is freely available on the Internet at the Guternberg project homepage [8]. For those few works whose text for some reason was not available, a different work but from the same author was chosen. The resulting selection is quite diverse in terms of author's origin, literary style, and such like. All the software used in this work was written in Mathematica™ 6. Computer processing was carried out on a laptop computer with an AMD 1.6 GHz double-core processor and 1024 MB physical memory. Table I shows the twentyone samples selected and their basic statistics. In Table I, α is the average word length, ∑ ‫ܮ‬ ‫‬ ,where L i is the length in characters of the i-th word, and p i is its corresponding probability. In Table I the weighted average value of α is 4.22 letters per word, the total number of words is 3,804,409 and the total number of characters is 20,306,606. The number of characters corresponds only to printable characters (i.e. control characters not included). The word dispersion ratio, WDR, equals the number of different words over the total number of words. Because, with the exception of control characters, all printable characters present in each sample text are taken into account, uppercase and lowercase characters are different, punctuation is included, the space character is included, etc. Entropy for symbols (blocks) of n = 1 to 500 characters in length is calculated using the fundamental formula . The probability p i of the n-character symbol B i is obtained using the law of large numbers, where p(B i ) ≈ number of occurrences of B i /total number of symbols. To get the entropy for symbols longer than one character in length, entropy has to be averaged over individual shifted symbol entropies for the same n. That is, the n-character symbol probability is computed by first counting non-overlapping symbols from the first character of the text (shift = 0). Then, the n-character symbol probability is computed for non-overlapping symbols starting from the second character (shift = 1), and so on. It can be easily observed that the number of shifts that have to be considered before n-character blocks start repeating is equal to n. Remarkably, the shift entropies of a given n are very similar. For example, with sample 11 (Tess of the Durbervilles), for n = 5 (blocks of five characters), the five entropy values (shifts from 0 to 4) obtained are 14.256508, 14.257813, 14.248730, 14.256469, and 14.251314 bits/symbol. This same behavior is observed when the analysis is done using words instead of characters. The same approach using n-character symbols, however, leads to a finer analysis because it avoids the quantization in length imposed by words, and also considers non-alphanumeric characters, which are part of the language. Also, for every sample, a variable named equiprobability distance, n aep, is calculated. The value of n aep is such that, for any n ≥ n aep , equiprobable symbols are obtained for all shifts of n. For reasons of space, Table II shows the values of entropy for n-character symbols for n = 1 to 15 characters only. II have been rounded to two significant digits. The complete table (n = 1 to 500) is available at [6]. In Table II, an asterisk indicates the maximum n-character symbol entropy of every sample. The values of entropy, in Table II, are averages of the individual shifted entropies. The total number of individual shifted entropies computed for every sample, from n = 1 to k, is (ାଵ) ଶ . Then, for n = 1 to 500, this number is 125,250. For ease of display, Fig. 1 shows only the curves of H n for four samples, from n = 1 to 100. Fig. 2 shows the approximate total CPU time used on every sample to get the values reported, as in Table I, for n = 1 to 500. The algorithm employed for obtaining the frequency of symbols is an extremely simple sorting algorithm implemented with Mathematica functions. In Fig. 2, the CPU time is the sum of the individual CPU times (as reported by Mathematica's function Timing) for every entropy calculation. The work is carried out simultaneously on both cores of the computer's processor. Half of the samples are processed on core 1, and the other half on core 2, balancing the load, as given by every sample's number of characters, as much as possible. Hence, the actual real time needed to process the whole set of samples is nearly half the total sum of individual CPU times. An analysis based on computational complexity theory is beyond the scope of this paper. Table III shows the values of n aep for the twenty-one samples, for n between 1 and 200. As mentioned, n aep is the length in characters for which for any n greater or equal than n aep the n-character symbols are equiprobable, that is ቕ, for all shifts (0 to n-1) within a given n. This definition of n aep is quite stringent since it requires equiprobability for every shift and for every n ≥ n aep . For example, on sample 11 there exist symbol equiprobability from n = 76 to 307 for every shift, but clearly 75 does not equal n aep . If there were repeated subsequences of text greater than 500 characters on any sample, something which is very unlikely with a reasonable degree of probability, then the values in Table III would be absolutely certain only for n ≤ 500. Some implications of n aep will be discussed in the next section. I). F 1 is calculated from single-character frequencies as given by and so on. Table II shows values for F N from F 1 to F 15 , rounded to three significant digits. To estimate the entropy rate, a polynomial interpolation of third degree is first applied to the values of F N . As an example, Fig. 3 shows the interpolated F N curves for samples one and twenty-one. As shown in Fig. 3, F N becomes negative after crossing zero, and then asymptotically approaches zero as N→∞. Therefore In (5), N Z is the root of the continuous function F N and hence it is a real number. The values of H n in Table I are similarly mathematically interpolated to find H NZ , the real value of H n corresponding to N Z . The redundancy is then obtained using the classic formula, , where H L is the source's entropy rate, and In Table V, the weighted average of H L is 1.58 bits/character. The weighted average of the alphabet size in Table I Naturally, this value of R would be precise only for the set of works analyzed in this study. However, other samples of written English should have a redundancy value of about 74.86%, as long as their length is reasonably long. This method for finding the entropy can be done in very reasonable time since the roots of F N occur at small values of n. Textbooks on Information Theory generally consider 75% to be the redundancy of English [9], but arrive at such a value through different assumptions and values for both H L and A S . In Table V, it can also be observed that the sample with the lowest WDR has the highest redundancy, and vice versa. Samples five and seventeen are another example of the same behavior. V. CONCLUSION There are innumerable contexts in which the English language is, has been, and will continue to be used, perhaps as many different contexts as in anyone's life. However, several statistical properties of the language can be identified when reasonably long samples are analyzed. A simple method for finding the entropy and redundancy of sample of English language has been presented. For the nearly 20.3 million printable characters of English text analyzed in this work, an entropy rate of 1.58 bits/character was found, and a language redundancy of 74.86%. For the set of samples analyzed, the probability of a typical sequence of length n would then be 2 -n (1.58) , assuming n is sufficiently large. The maximum values of entropy for n-character symbols for the set of samples analyzed were found to occur for n between 8.31 and 11.43 characters (approximately 1.54 and 2.17 words). Although the analysis has been done for symbols from n = 1 to 500 characters in length, it has been found that only this maximum value of entropy is required for finding the entropy rate of the sample. Also, entropy values for different shifts for the same n were found to be very similar. Because of the structural relationship between words that is imposed by language, tools such as plagiarism detection software should take into consideration the value of d aep before attempting a verdict. The same applies, for example, to the limit on number of words imposed by some social networks on the Internet for short messages. Intuitively, it sounds obvious that there should be sufficient words in a short text message to allow a meaning to be conveyed. What d aep indicates is that, in general, such a number is fairly low. However, finding the exact value of d aep for a given sample may require a considerable computing time. It is difficult to talk about a definitive and precise value of the entropy rate of a language because the condition that n →∞ requires an extremely large sample to be analyzed, which in practice simply is not available, apart from it taking a very long time to compute. In this work, however, an approach to estimating both the entropy and redundancy of a reasonably large language sample in a simple, direct way has been presented. Computer processing proved not to be an insurmountable barrier in the analysis of this type of sample and, on the contrary, allowed to corroborate interesting observations about the entropy of English.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment