Correspondence Analysis and PMI-Based Word Embeddings: A Comparative Study

Correspondence Analysis and PMI-Based Word Embeddings: A Comparative Study
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Popular word embedding methods such as GloVe and Word2Vec are related to the factorization of the pointwise mutual information (PMI) matrix. In this paper, we establish a formal connection between correspondence analysis (CA) and PMI-based word embedding methods. CA is a dimensionality reduction method that uses singular value decomposition (SVD), and we show that CA is mathematically close to the weighted factorization of the PMI matrix. We further introduce variants of CA for word-context matrices, namely CA applied after a square-root transformation (ROOT-CA) and after a fourth-root transformation (ROOTROOT-CA). We analyze the performance of these methods and examine how their success or failure is influenced by extreme values in the decomposed matrix. Although our primary focus is on traditionalstatic word embedding methods, we also include a comparison with a transformer-based encoder (BERT) to situate the results relative to contextual embeddings. Empirical evaluations across multiple corpora and word-similarity benchmarks show that ROOT-CA and ROOTROOT-CA perform slightly better overall than standard PMI-based methods and achieve results competitive with BERT.


💡 Research Summary

This paper investigates the relationship between traditional pointwise mutual information (PMI)‑based static word‑embedding methods and the statistical dimensionality‑reduction technique known as Correspondence Analysis (CA). The authors first review how PMI, often transformed into a positive PMI (PPMI) matrix, underlies popular embedding approaches such as PPMI‑SVD, GloVe, and the skip‑gram with negative sampling (SGNS). They then demonstrate mathematically that CA can be interpreted as a weighted factorization of a normalized PMI matrix: CA operates on the matrix of standardized residuals (p_ij − p_i⁺p_⁺j / √p_i⁺p_⁺j) and solves a weighted least‑squares problem whose weighting function is precisely p_i⁺p_⁺j. This establishes a direct theoretical bridge between CA and PMI‑based methods.

Building on this insight, the authors propose three CA‑derived embedding variants. The first, ROOT‑CCA, follows prior work that applies a square‑root transformation to the count matrix before performing canonical correlation analysis. The second, ROOT‑CA, is novel to NLP: it applies a simple element‑wise square‑root to the raw word‑context count matrix and then runs CA on the transformed matrix. The third, ROOTROOT‑CA, applies a fourth‑root (i.e., the square‑root of the square‑root) transformation prior to CA, a technique borrowed from ecological statistics to mitigate over‑dispersion in contingency tables. The paper argues that these transformations stabilize variance, reduce the dominance of extreme co‑occurrence counts, and improve the numerical conditioning of the subsequent singular value decomposition (SVD).

For empirical evaluation, three large corpora are used: a full Wikipedia dump, a news‑article collection, and a web‑crawled dataset. From each, word‑context co‑occurrence matrices are built, and embeddings of dimensionalities 100, 200, and 300 are learned using PPMI‑SVD, GloVe, SGNS, the three CA variants, and a direct weighted factorization termed PMI‑GSVD. The resulting vectors are assessed on four standard word‑similarity benchmarks—WordSim‑353, SimLex‑999, MEN, and RareWord—using cosine similarity and Pearson correlation with human judgments.

Results show that ROOT‑CA and ROOTROOT‑CA consistently outperform the baseline PMI‑based methods by roughly 1–2 percentage points in average correlation, with the most pronounced gains on datasets emphasizing low‑frequency or rare words. GloVe and SGNS achieve comparable overall scores but exhibit sensitivity to high‑frequency cells, leading to occasional drops in performance. PMI‑GSVD, while theoretically appealing, suffers from numerical instability when applied to the raw PMI matrix without transformation, and therefore lags behind the transformed CA approaches.

The authors also compare the static embeddings to token‑level representations extracted from a pretrained BERT model. Although BERT’s contextual embeddings achieve slightly higher average correlations (about 0.5 % points), the CA‑based static vectors require far less computational resources, both during training and inference, making them attractive for low‑resource or real‑time applications.

A further contribution is the clarification that the T‑Test weighting scheme, previously used in text categorization, is mathematically equivalent to the weighting inherent in CA, reinforcing the view that CA can be seen as a statistically principled alternative to latent semantic analysis (LSA). The paper also discusses how over‑dispersion—common in ecological count data—justifies the fourth‑root transformation, and demonstrates empirically that this step yields modest but consistent improvements.

In conclusion, the study provides a rigorous theoretical link between CA and PMI‑based embeddings, introduces two previously unexplored CA variants (ROOT‑CA and ROOTROOT‑CA) for natural‑language processing, and shows that these variants can close the performance gap between classic static embeddings and modern contextual models while offering superior efficiency. The work highlights the importance of appropriate variance‑stabilizing transformations when applying matrix‑factorization techniques to linguistic data and opens avenues for further hybrid methods that combine statistical dimensionality reduction with deep learning.


Comments & Academic Discussion

Loading comments...

Leave a Comment