Information content versus word length in random typing

Information content versus word length in random typing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently, it has been claimed that a linear relationship between a measure of information content and word length is expected from word length optimization and it has been shown that this linearity is supported by a strong correlation between information content and word length in many languages (Piantadosi et al. 2011, PNAS 108, 3825-3826). Here, we study in detail some connections between this measure and standard information theory. The relationship between the measure and word length is studied for the popular random typing process where a text is constructed by pressing keys at random from a keyboard containing letters and a space behaving as a word delimiter. Although this random process does not optimize word lengths according to information content, it exhibits a linear relationship between information content and word length. The exact slope and intercept are presented for three major variants of the random typing process. A strong correlation between information content and word length can simply arise from the units making a word (e.g., letters) and not necessarily from the interplay between a word and its context as proposed by Piantadosi et al. In itself, the linear relation does not entail the results of any optimization process.


💡 Research Summary

The paper revisits the claim made by Piantadosi et al. (2011) that a linear relationship between a word’s information content and its length is evidence of an optimization process in language. Piantadosi’s hypothesis rests on strong empirical correlations observed across many languages, suggesting that speakers choose shorter words for more predictable (low‑information) contexts and longer words for less predictable (high‑information) contexts. The authors of the present study argue that such a correlation does not necessarily imply optimization; it can arise purely from the statistical properties of the units that compose a word.

To investigate this, the authors adopt the classic random‑typing model. In this model a text is generated by pressing keys on a keyboard at random, with letters and a space (the word delimiter) each assigned a fixed probability. No semantic or contextual information is considered, and the process does not attempt to minimize any cost function. Three variants of the model are examined: (1) uniform probabilities for all symbols, (2) non‑uniform probabilities where letters and the space have different likelihoods, and (3) a version that adjusts the space probability to control the average word length. For each variant the authors derive analytically the expected information content (I(w) = -\log P(w)) of a word (w) as a function of its length (\ell). The derivations yield an exact linear relationship \


Comments & Academic Discussion

Loading comments...

Leave a Comment