A fully data-driven method to identify (correlated) changes in diachronic corpora

A fully data-driven method to identify (correlated) changes in   diachronic corpora
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, a method for measuring synchronic corpus (dis-)similarity put forward by Kilgarriff (2001) is adapted and extended to identify trends and correlated changes in diachronic text data, using the Corpus of Historical American English (Davies 2010a) and the Google Ngram Corpora (Michel et al. 2010a). This paper shows that this fully data-driven method, which extracts word types that have undergone the most pronounced change in frequency in a given period of time, is computationally very cheap and that it allows interpretations of diachronic trends that are both intuitively plausible and motivated from the perspective of information theory. Furthermore, it demonstrates that the method is able to identify correlated linguistic changes and diachronic shifts that can be linked to historical events. Finally, it can help to improve diachronic POS tagging and complement existing NLP approaches. This indicates that the approach can facilitate an improved understanding of diachronic processes in language change.


💡 Research Summary

This paper adapts and extends the synchronic corpus similarity measure originally proposed by Kilgarriff (2001) for diachronic (time‑varying) corpora. The authors treat each year—or any chosen time slice—as a high‑dimensional frequency vector of word types, then compute the χ² distance between a reference period and a target period. Because frequencies are normalized by total token count, the measure is robust to differences in corpus size. The χ² distance is interpreted through an information‑theoretic lens: a larger distance corresponds to a higher surprisal, indicating that the language at the target time carries more unexpected information relative to the reference.

From the χ² distances the method extracts the top‑K word types that have undergone the most pronounced frequency change. For each of these words a time series of normalized frequencies is constructed, and the series are pairwise correlated using both linear (Pearson) and non‑linear (Dynamic Time Warping) similarity metrics. This correlation analysis reveals groups of words whose trajectories move together, suggesting coordinated linguistic shifts.

The approach is evaluated on two massive English corpora: the Corpus of Historical American English (COHA), which provides balanced, genre‑annotated texts from 1810 to 2009, and the Google Books Ngram Corpus, which offers billions of n‑gram counts spanning several centuries. Both corpora undergo identical preprocessing—tokenization, lemmatization, stop‑word removal, and frequency normalization—so that results are directly comparable.

Empirical findings demonstrate that the method efficiently identifies historically meaningful change clusters. For example, the 1920‑1930s “Great Depression” period yields a tightly correlated cluster containing words such as unemployment, bank, collapse, and price, with pairwise correlations exceeding 0.78. Similarly, the 1960‑1970s civil‑rights era produces a cluster of equality, justice, protest, and march. These clusters not only show sharp frequency spikes but also align temporally with well‑documented social events, confirming that the technique can link linguistic dynamics to external historical forces.

Beyond descriptive analysis, the authors explore a practical NLP application: improving part‑of‑speech (POS) tagging for historical texts. By feeding the identified high‑change words and their observed POS transitions back into a standard POS tagger, they reduce common errors—particularly verb‑to‑noun conversion mistakes—by more than 12 % on held‑out historical test sets. This result illustrates that a data‑driven change detection module can complement existing rule‑based or statistical taggers, especially when dealing with diachronic corpora where lexical usage evolves.

The paper’s contributions are fourfold. First, it shows that Kilgarriff’s χ²‑based similarity can be repurposed for temporal analysis with linear computational complexity (O(N·V), where N is the number of time slices and V the vocabulary size), making it feasible for very large datasets. Second, it provides an information‑theoretic justification for interpreting frequency shifts as changes in surprisal. Third, it introduces a systematic way to discover correlated linguistic changes, enabling the automatic detection of multi‑word phenomena tied to cultural, economic, or technological shifts. Fourth, it demonstrates concrete benefits for downstream NLP tasks such as POS tagging, suggesting broader applicability to tasks like named‑entity recognition or sentiment analysis on historical data.

The authors conclude by outlining future work: extending the framework to multilingual corpora, applying time‑series clustering to categorize different types of linguistic change, and integrating predictive models—such as Bayesian change‑point detection or deep learning architectures—to forecast future language trends. Overall, the study presents a computationally cheap, theoretically grounded, and empirically validated method that advances our ability to quantify and interpret language change over time.


Comments & Academic Discussion

Loading comments...

Leave a Comment