Evolution of the most common English words and phrases over the centuries
By determining which were the most common English words and phrases since the beginning of the 16th century, we obtain a unique large-scale view of the evolution of written text. We find that the most common words and phrases in any given year had a much shorter popularity lifespan in the 16th than they had in the 20th century. By measuring how their usage propagated across the years, we show that for the past two centuries the process has been governed by linear preferential attachment. Along with the steady growth of the English lexicon, this provides an empirical explanation for the ubiquity of the Zipf’s law in language statistics and confirms that writing, although undoubtedly an expression of art and skill, is not immune to the same influences of self-organization that are known to regulate processes as diverse as the making of new friends and World Wide Web growth.
💡 Research Summary
The paper presents a large‑scale quantitative investigation of how the most frequently used English words and phrases have changed from the early 16th century through the 20th century. Using the Google Books N‑gram corpus, the authors extracted the top 10 000 tokens (single words or multi‑word expressions) for each year and defined two key metrics: (1) rank‑based popularity and (2) “popularity lifespan,” the number of consecutive years a token remains among the most popular.
Initial analyses reveal a striking temporal shift in lifespan. In the 1500s the average lifespan of a popular token was only 2–3 years, reflecting a highly volatile lexical environment with limited printing and regional textual production. By contrast, from the late 19th century onward the average lifespan expands dramatically, exceeding ten years and reaching over twenty years for many 20th‑century tokens. The authors attribute this to the diffusion of printing technology, the rise of mass‑market media, compulsory education, and the standardization of dictionaries and style guides, all of which increase linguistic inertia.
To uncover the dynamics governing the spread of popularity, the study models token selection as a preferential‑attachment process. For each year t, the probability P(k) that a token with frequency k at time t will be selected in year t + 1 is measured. Empirical results show a near‑linear relationship P(k) ≈ α·k + β, with regression fits yielding R² > 0.95. This linear preferential attachment indicates that the more a word or phrase is used, the more likely it is to be used again, mirroring mechanisms observed in network growth, citation accumulation, and social friendship formation.
The authors also examine the growth of the overall English lexicon. The number of newly appearing tokens per year rises roughly thirty‑fold from the 16th to the 20th century, corresponding to a steady annual growth rate of about 1.5–2 %. Despite this expansion, the frequency distribution continues to obey Zipf’s law (frequency ∝ rank⁻¹), suggesting a self‑organizing balance: high‑frequency items dominate usage while a growing tail of low‑frequency items enriches the vocabulary without disrupting the power‑law shape.
Combining these findings, the paper draws two major conclusions. First, written English evolves under a self‑organizing dynamic where linear preferential attachment drives the reinforcement of popular forms. Second, the concurrent, gradual increase in lexicon size provides the necessary substrate for Zipf’s law to persist across centuries. The work therefore positions language evolution alongside other complex systems—such as the World Wide Web or social networks—where simple attachment rules give rise to robust statistical regularities.
In the discussion, the authors propose that this quantitative framework bridges traditional linguistic scholarship with digital humanities and complex‑systems science. They suggest future research directions including cross‑linguistic comparisons, analysis of modern digital media (e.g., social‑media streams), and exploration of how cultural or technological shocks might perturb the preferential‑attachment dynamics. Overall, the study offers a compelling empirical foundation for viewing language as a dynamic, self‑organizing system rather than solely a product of artistic or cognitive intent.
Comments & Academic Discussion
Loading comments...
Leave a Comment