DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.
💡 Research Summary
The paper introduces DHPLT (Diachronic HPLT), an open, large‑scale multilingual diachronic corpus designed explicitly for lexical semantic change detection (LSCD) research. Building on the HPLT v3.0 web‑crawled datasets, the authors extract three temporally distinct subsets—2011‑2015 (early period), 2020‑2021 (COVID period), and 2024‑present (most recent crawls)—for each of 41 languages spanning twelve language families. Each subset contains one million documents per language (or 0.5 M for low‑resource languages), yielding roughly 170 GB of compressed JSONL data and about 5.9 × 10¹⁰ tokens in total.
Temporal annotation relies on web‑crawl timestamps, which serve as an upper bound on document creation dates. While this approach does not guarantee precise creation times, it provides a pragmatic, uniformly applicable signal across all languages, enabling the construction of comparable time bins with gaps of at least two years. The authors acknowledge the inherent noise (e.g., older documents appearing in later bins) but argue that the scale and multilingual coverage outweigh this limitation.
Language selection follows two criteria: (1) at least 0.5 M documents must be available in each time slice, and (2) a monolingual HPLT T5 encoder‑decoder model must exist for the language, as these models are needed to generate token‑level embeddings. This results in a curated set of 41 languages, listed with ISO codes and script information in the appendix.
For downstream LSCD experiments, the authors pre‑define a set of “target words” per language. Starting from the T5 vocabulary, they filter out sub‑word pieces, low‑frequency items, and non‑nouns/verbs/adjectives, retaining only words written in the language’s primary script. After lemmatization, each language ends up with an average of ~18,600 target lemmas.
Four families of semantic representations are released for these targets:
-
Static word embeddings – SGNS (word2vec) models trained separately on each time slice (300‑dimensional, 50 k most frequent types). Vectors from periods 1 and 2 are aligned to period 3 using Procrustes, enabling direct cosine‑similarity comparisons across time.
-
Contextual token embeddings – Encoder outputs from three models: HPLT T5, XLM‑R, and HPLT GPT‑BERT. For each target, 1,000 random occurrences are embedded with T5, 100 with XLM‑R, and 100 with GPT‑BERT, providing rich contextualized representations for downstream distance‑based or clustering methods.
-
Lexical substitutes – Top‑15 substitute tokens generated via masked language modeling with GPT‑BERT (and also XLM‑R where vocabularies intersect). Substitutes are collected from 100 random occurrences per target, offering an alternative, sense‑oriented representation that has proven effective in recent LSCD benchmarks.
-
Frequency counts – Raw token frequencies per target across the three periods, facilitating frequency‑controlled analyses and serving as a simple baseline indicator of lexical change.
The authors demonstrate the utility of DHPLT through a sanity check on the English word “AI” (and its Spanish and Russian equivalents). Static embeddings reveal a clear semantic trajectory: early 2010s usage tied to video‑game characters, 2020‑2021 usage shifting toward chatbots and machine‑learning, and 2024‑present usage dominated by large language models and generative AI. Similar patterns appear in Spanish (“IA”) and Russian, confirming that the corpus captures real‑world semantic shifts despite the temporal noise inherent in crawl‑timestamp binning. Additional quantitative analyses using T5 embeddings (average pairwise distances) show that “AI” undergoes the largest change, while legal terms like “legislative” change minimally, aligning with intuitive expectations.
In conclusion, DHPLT fills a critical gap in LSCD research by providing (i) a multilingual, time‑stratified corpus of unprecedented scale, (ii) a suite of ready‑to‑use semantic representations (static, contextual, substitute‑based), and (iii) open‑source extraction pipelines and CC0‑licensed data. Researchers can immediately experiment with multilingual semantic change models, conduct long‑term dynamic studies, or adapt the pipeline to create custom time slices or target word lists. Limitations include the reliance on crawl timestamps (introducing potential temporal noise) and the exclusion of very low‑frequency or dialectal terms from the target set. Nonetheless, DHPLT represents a substantial resource that is likely to accelerate multilingual diachronic NLP and open new avenues for studying cultural and societal evolution through language.
Comments & Academic Discussion
Loading comments...
Leave a Comment