Size dependent word frequencies and translational invariance of books
It is shown that a real novel shares many characteristic features with a null model in which the words are randomly distributed throughout the text. Such a common feature is a certain translational invariance of the text. Another is that the functional form of the word-frequency distribution of a novel depends on the length of the text in the same way as the null model. This means that an approximate power-law tail ascribed to the data will have an exponent which changes with the size of the text-section which is analyzed. A further consequence is that a novel cannot be described by text-evolution models like the Simon model. The size-transformation of a novel is found to be well described by a specific Random Book Transformation. This size transformation in addition enables a more precise determination of the functional form of the word-frequency distribution. The implications of the results are discussed.
💡 Research Summary
The paper investigates how word‑frequency statistics in a novel depend on the size of the text segment examined and whether the distribution exhibits translational invariance. Using “Howards End” as a representative novel, the authors first define the word‑frequency function (W_D(k)), the number of distinct words that occur exactly (k) times, and the associated probability distribution (P(k)=W_D(k)/W_D). Empirically, the full‑book distribution follows a broad, approximately power‑law tail that can be fitted by (P(k)\sim e^{-bk}/k^{\gamma}) with (\gamma\approx1.73).
The authors then divide the novel into equal‑size sections (e.g., 20‑part, 200‑part partitions) and compute the average frequency distribution (P_{w_T}(k)) for each section size (w_T). They observe a systematic steepening of the tail as the section becomes smaller: the effective exponent (\gamma) increases with decreasing (w_T). This “size dependence” is a characteristic feature of word‑frequency statistics in books.
A striking observation is that the statistical properties of any section are essentially independent of its position in the text. When the novel is split into three consecutive parts, the functions (W_D(w_T)) (the number of distinct words in a part of length (w_T)) and (P_{w_T}(k)) are virtually identical for all three parts. The authors term this property translational invariance.
To explain these findings, they introduce a null model called the “random book”. In this model the set of words and their overall frequencies (P(k)) are preserved, but the words are randomly permuted throughout the text. Because the permutation destroys any positional correlations, the random book naturally exhibits translational invariance. Moreover, the authors derive an exact combinatorial transformation—named the Random Book Transformation (RBT)—that maps the full‑book distribution (P(k)) to the distribution for any smaller segment (P_{w_T}(k)). The transformation uses a triangular matrix (A_{k k’}) built from binomial coefficients and a normalization constant (C). The forward transformation (Eq. 1–2) and its inverse (Eq. 4) allow one to predict how the tail exponent changes with segment size, and to reconstruct the original distribution from segment data.
When the authors apply the RBT to the real novel, the predicted segment distributions match the empirical ones almost perfectly, confirming that the novel behaves statistically like a random book. In contrast, the classic Simon model—an evolutionary stochastic process where new words are introduced with a fixed probability and existing words are chosen proportionally to their current counts—fails to reproduce these features. Simulated Simon books show a strong positional bias: rare words tend to appear late, and common words cluster early, leading to a systematic drift of (W_D(w_T)) and (P_{w_T}(k)) across the three parts. Moreover, the Simon model predicts a length‑independent exponent (\gamma), contradicting the observed size dependence.
The authors extend the analysis to a collection of novels and short stories (Appendix A), demonstrating that the same translational invariance and size‑dependent tail behavior hold across authors, genres, and narrative lengths. They also show that the word‑frequency distribution of a short story matches the average distribution of sections of the same size taken from longer novels, indicating that the distribution depends primarily on the total word count, not on the specific content.
In conclusion, the study establishes three key results: (1) real novels exhibit statistical properties indistinguishable from a random permutation of their words; (2) the word‑frequency tail exponent varies systematically with the length of the examined text segment; (3) growth‑based stochastic models such as Simon’s cannot account for these observations. The Random Book Transformation provides a rigorous, analytically tractable framework for linking full‑book statistics to segment statistics, offering a powerful tool for quantitative text analysis, author identification, and the development of more realistic language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment