Dynamic Bernoulli Embeddings for Language Evolution

Dynamic Bernoulli Embeddings for Language Evolution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Word embeddings are a powerful approach for unsupervised analysis of language. Recently, Rudolph et al. (2016) developed exponential family embeddings, which cast word embeddings in a probabilistic framework. Here, we develop dynamic embeddings, building on exponential family embeddings to capture how the meanings of words change over time. We use dynamic embeddings to analyze three large collections of historical texts: the U.S. Senate speeches from 1858 to 2009, the history of computer science ACM abstracts from 1951 to 2014, and machine learning papers on the Arxiv from 2007 to 2015. We find dynamic embeddings provide better fits than classical embeddings and capture interesting patterns about how language changes.


💡 Research Summary

The paper introduces Dynamic Bernoulli Embeddings, a probabilistic framework that extends exponential‑family word embeddings to capture temporal evolution of word meanings. Building on the Bernoulli version of exponential‑family embeddings (Rudolph et al., 2016), each word token is represented as a binary indicator vector and its conditional probability of occurrence is modeled with a logistic function whose natural parameter is the inner product between a word‑specific embedding vector and the sum of context vectors of neighboring words. In the static setting, these embeddings are shared across the entire corpus.

To model change over time, the authors assign a separate embedding vector ρ⁽ᵗ⁾ for each word at each time slice t (e.g., a year). The context vectors α remain shared across time, while the embeddings follow a Gaussian random walk prior: ρ⁽ᵗ⁾ ∼ N(ρ⁽ᵗ⁻¹⁾, λ⁻¹I). This prior enforces smooth drift, reflecting the linguistic intuition that meanings evolve gradually and allowing the model to borrow statistical strength from adjacent periods, especially when data are sparse.

Training maximizes a pseudo‑likelihood, i.e., the sum of conditional log‑likelihoods over all observed word‑context pairs. The objective is split into contributions from positive (observed) entries (L_pos) and negative (unobserved) entries (L_neg). Because enumerating all zeros is infeasible, the authors adopt negative sampling, drawing a small set of zero entries from a unigram⁰·⁷⁵ distribution and approximating L_neg with this subsample. Regularization terms penalize the L2 norms of both embeddings and context vectors (λ₀) and the squared differences between consecutive time‑slice embeddings (λ). The overall loss is optimized with stochastic gradient methods (Adam/Adagrad) using automatic differentiation in the Edward probabilistic programming library.

The method is evaluated on three large, temporally annotated corpora: (1) machine‑learning papers from the arXiv (2007‑2015, yearly slices), (2) ACM computer‑science abstracts (1951‑2014, yearly slices), and (3) U.S. Senate speeches (1858‑2009, two‑year slices). For each corpus the data are split 80/10/10 into train/validation/test within each slice. The authors compare three models: static Bernoulli embeddings (s‑emb), time‑binned embeddings trained independently per slice (t‑emb, following Hamilton et al., 2016), and their proposed dynamic embeddings (d‑emb). Evaluation uses held‑out Bernoulli probability (conditional likelihood) on validation and test sets. Across all datasets, d‑emb achieves the highest likelihood, outperforming both s‑emb and t‑emb. The advantage is most pronounced in early periods with limited data, where the random‑walk prior effectively transfers information from later, richer slices.

Beyond quantitative metrics, the authors present qualitative analyses that illustrate how specific words shift meaning over decades. For example, the term “intelligence” moves from “government intelligence” (mid‑20th C) to “cognitive intelligence” (late‑20th C) to “artificial intelligence” (21st C) in both the ACM and Senate corpora, as visualized by a one‑dimensional projection of the embedding trajectory and by listing nearest neighbors at selected years. Similar trajectories are shown for “iraq,” “data,” and “computer,” revealing how geopolitical events, technological advances, and domain‑specific jargon reshape semantic neighborhoods.

The paper acknowledges limitations: only the word embeddings are dynamic while context vectors remain static, potentially restricting expressiveness; the Gaussian random walk assumes linear, smooth drift, which may not capture abrupt semantic shifts such as the rapid emergence of slang or new technical terms. Future work could explore dynamic context vectors, non‑linear temporal priors (e.g., neural ODEs), or fully Bayesian variational inference to better model uncertainty.

In summary, Dynamic Bernoulli Embeddings provide a principled, scalable way to embed words while tracking their semantic trajectories over time. By integrating a simple stochastic process into a probabilistic embedding framework, the authors achieve both better predictive performance and richer interpretability for longitudinal text analysis, opening avenues for interdisciplinary studies of language change, cultural evolution, and domain‑specific knowledge drift.


Comments & Academic Discussion

Loading comments...

Leave a Comment