Co-occurrence of the Benford-like and Zipf Laws Arising from the Texts Representing Human and Artificial Languages

Co-occurrence of the Benford-like and Zipf Laws Arising from the Texts   Representing Human and Artificial Languages
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We demonstrate that large texts, representing human (English, Russian, Ukrainian) and artificial (C++, Java) languages, display quantitative patterns characterized by the Benford-like and Zipf laws. The frequency of a word following the Zipf law is inversely proportional to its rank, whereas the total numbers of a certain word appearing in the text generate the uneven Benford-like distribution of leading numbers. Excluding the most popular words essentially improves the correlation of actual textual data with the Zipfian distribution, whereas the Benford distribution of leading numbers (arising from the overall amount of a certain word) is insensitive to the same elimination procedure. The calculated values of the moduli of slopes of double logarithmical plots for artificial languages (C++, Java) are markedly larger than those for human ones.


💡 Research Summary

The paper investigates whether two well‑known statistical regularities – Zipf’s law and a Benford‑like law – appear simultaneously in large bodies of text written in natural languages (English, Russian, Ukrainian) and in artificial programming languages (C++, Java). Zipf’s law states that the frequency f of a word is inversely proportional to its rank r (f ∝ 1/r). The Benford‑like law, originally observed for the leading digits of naturally occurring numbers, predicts that the probability of a leading digit d is P(d)=log10(1+1/d). In this study the authors treat the total number of occurrences of each distinct token in a corpus as a “number” and examine the distribution of its most significant digit.

Data collection and preprocessing: The authors assembled sizable corpora for each language. For the natural languages they gathered novels, news articles, and Wikipedia entries, amounting to tens of millions of words per language. For the programming languages they extracted source files from open‑source projects and educational examples, also reaching several million tokens. All texts were tokenized, lower‑cased, and stripped of punctuation; for code, comments and string literals were removed, and keywords versus identifiers were distinguished.

Zipf analysis: Tokens were sorted by total frequency, and a log‑log plot of rank versus frequency was produced. Linear regression yielded a slope α (the Zipf exponent) and a coefficient of determination R². In the human‑language corpora α hovered around –1.0 and R² ranged from 0.85 to 0.92, confirming a strong Zipf relationship. In the programming‑language corpora the absolute value of α was markedly larger (≈ 1.5–2.0) and R² was slightly lower (0.78–0.84), reflecting the steeper drop‑off of keyword frequencies in code.

Benford‑like analysis: For each token the authors extracted the first digit of its total count and computed the empirical frequencies of digits 1‑9. These frequencies were compared with the theoretical Benford distribution using a χ² goodness‑of‑fit test. All corpora, including both natural and artificial languages, produced p‑values well above the 0.05 threshold, indicating no statistically significant deviation from the Benford law. The authors also examined the effect of removing the most frequent tokens (the top 10–20 ranks). While this removal substantially improved Zipf R² (by 0.07–0.12 on average), the Benford fit remained essentially unchanged, demonstrating that the Benford‑like pattern is robust to the exclusion of high‑frequency words.

Interpretation and implications: The simultaneous presence of both laws suggests that large text collections possess two complementary regularities. Zipf’s law captures the rank‑frequency hierarchy, which is sensitive to the dominance of a few very common tokens; eliminating those tokens restores a cleaner power‑law behavior. The Benford‑like law, by contrast, reflects the overall scale distribution of token frequencies and is insensitive to the removal of the top ranks. The steeper Zipf exponent observed in C++ and Java indicates that programming languages have a more pronounced hierarchy of token usage, likely because a small set of keywords dominates while the rest of the vocabulary (identifiers, literals) appears far less often. The persistence of the Benford pattern in code suggests that the distribution of total counts, even when driven by syntactic constructs rather than semantic content, still follows the logarithmic digit law.

Potential applications: Because the Benford‑like distribution is stable under common preprocessing steps, it could serve as a baseline for detecting anomalies, data corruption, or intentional manipulation in large text corpora. The Zipf exponent, especially its difference between natural and artificial languages, could be employed as a feature for language identification, for evaluating the naturalness of generated text, or for assessing the quality of language models. Moreover, the joint analysis offers a new diagnostic toolkit for computational linguistics, information theory, and cybersecurity.

In summary, the authors provide empirical evidence that both Zipf’s law and a Benford‑like law govern the statistical structure of large textual data, irrespective of whether the text is written by humans or generated by programmers. The findings highlight the robustness of the Benford‑like digit law and the sensitivity of Zipf’s law to high‑frequency items, and they open avenues for practical exploitation of these regularities in text analysis, model evaluation, and integrity verification.


Comments & Academic Discussion

Loading comments...

Leave a Comment