The predictability of letters in written english

Reading time: 5 minute
...

📝 Original Info

  • Title: The predictability of letters in written english
  • ArXiv ID: 0710.4516
  • Date: 2017-04-24
  • Authors: 저자 정보가 논문 본문에 명시되어 있지 않음. (논문 원문에 저자와 소속이 포함되어 있지 않아 확인할 수 없습니다.)

📝 Abstract

We show that the predictability of letters in written English texts depends strongly on their position in the word. The first letters are usually the least easy to predict. This agrees with the intuitive notion that words are well defined subunits in written languages, with much weaker correlations across these units than within them. It implies that the average entropy of a letter deep inside a word is roughly 4 times smaller than the entropy of the first letter.

💡 Deep Analysis

📄 Full Content

Since language is used to transmit information, one of its most quantitative characteristics is the entropy, i.e., the average amount of information (usually measured in bits) per character.

Entropy as a measure of information was introduced by Shannon [1]. He also performed extensive experiments [2] using the ability of humans to predict continuations of printed text. This and similar experiments [3,4] led to estimates of typically ≈ 1 -1.5 bits per character.

In contrast, the best computer algorithms whose prediction is based on sophisticated statistical methods reach entropies of ≈ 2 -2.4 bits [5]. Even this is better than what commercial text compression packages achieve: starting from texts where each character is represented by one byte, they typically achieve compression ratios ≈ 2, corresponding to ≈ 4 bits/character. These differences result from different abilities to take into account long-range correlations which are present in all texts and whose utilization requires not only a good understanding of language but also substantial computational resources.

Formally, Shannon entropy h of a letter sequence (…, s -1 , s 0 , s 1 , …) over an alphabet of d letters is given by

where p(s -n , …, s 0 ) is the probability for the letters at position -n to 0 to be s -n to s 0 , and p(s 0 |s -1 , …, s -n ) = p(s-n,…,s0) p(s-n,…,s-1) . The second line of this equation tells us that h can be considered as an average over the information of bit number. While Eq. (1) obviously assumed stationarity, we can define the latter also for nonstationary sequences, provided they are distributed according to some probability p which satisfies the Kolmogorov consistency conditions. The information of the kth letter when it follows the string …, s k-2 , s k-1 is thus defined as:

Notice that this depends both on the previous letters (or “contexts” [6]) and on s k itself. If the sequence is only one-sided infinite (as for written texts), we extend it to the left with some arbitrary but fixed sequence, in order to make the limes in Eq. (3) well defined.

When trying to evaluate η k , the main problem is the fact that p(s k |s k-1 , …, s k-n ) is not known. The best we can do is to obtain an estimator p(s k |s k-1 , …, s k-n ) which then leads to an information estimate ηk , and to:

for a text of length N . This can be used also for testing the quality of the predictor p(s k |s k-1 , …, s k-n ): the best predictor is that which leads to the smallest ĥ. This is indeed the main criterion by which p(s

In this way we do not only get an estimate ĥ of h, but we can investigate the predictability of individual letters within specific contexts. The fact that different letters have different predictabilities is of course well known. If no contexts are taken into account at all, then the best predictor is based on the frequencies of letters, making the most frequent ones the easiest to predict. Studies of these frequencies exist for all important languages.

Much less effort has gone into the context dependence. Of course, the next natural distribution after the singleletter probabilities are the distributions of pairs and triples which give contexts of length 1 and 2, and which have also been studied in detail [5]. But these distributions do not directly reflect some of the most prominent features of written languages, namely, that they are composed of subunits (words, phrases) which are put together according to grammatical rules.

In the following, we shall study the simples consequences of this structure. If words are indeed natural units, it should be much easier to predict letters coming late in the word -where we have already seen several letters with which they should be strongly correlated -than letters at the beginnings of words. Surprisingly, this effect has not jet been studied in the literature, maybe due to a lack of efficient estimators of entropies of individual letters. A similar, but maybe less pronounced effect is expected with words replaced by phrases.

In our investigation, we use an estimator which is based on minimizing ĥ. Technically, it builds a rooted tree with contexts represented as path starting at some inner node and ending at the root. The tree is constructed such that each leaf corresponds to a context which is seen a certain number of times (typically, 2-5), and each internal node has appeared more often as a context. A heuristic rule is used for estimating p for each context length, and the optimal context length is chosen such that it will most likely lead to the smallest ĥ. Details of this algorithm (which resembles those discussed in Refs. [5] and [6]) is given in [7].

The information needed to predict a letter with this algorithm consists, on the one hand, of the rules entering the algorithm, and on the other, of the structure stored in the tree. In the present application, we have first build two trees, each based on ≈ 4 × 10 6 letters from Shakespeare [8], and from the

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut