Counter-fitting Word Vectors to Linguistic Constraints
In this work, we present a novel counter-fitting method which injects antonymy and synonymy constraints into vector space representations in order to improve the vectors’ capability for judging semantic similarity. Applying this method to publicly available pre-trained word vectors leads to a new state of the art performance on the SimLex-999 dataset. We also show how the method can be used to tailor the word vector space for the downstream task of dialogue state tracking, resulting in robust improvements across different dialogue domains.
💡 Research Summary
The paper introduces a lightweight post‑processing technique called “counter‑fitting” that refines pre‑trained word embeddings by explicitly incorporating linguistic constraints derived from synonymy and antonymy. Starting from any set of vectors (e.g., GloVe or Paragram‑SL999), the method defines three loss components: (1) Antonym Repel (AR) pushes known antonym pairs apart by enforcing a minimum cosine‑based distance (δ = 1, corresponding to orthogonality); (2) Synonym Attract (SA) pulls known synonym pairs together, aiming for zero distance; and (3) Vector Space Preservation (VSP) penalizes changes in the relative distances among each word and its original nearest neighbours, thereby retaining the distributional information encoded in the original space. The overall objective is a weighted sum of these three terms (weights set equal in the experiments) and is optimized with stochastic gradient descent for 20 epochs, taking less than two minutes on a standard laptop.
The authors extract constraints from two lexical resources: the Paraphrase Database (PPDB 2.0) and WordNet. PPDB provides “Equivalence” (synonym) and “Exclusion” (antonym) labels, while WordNet contributes additional antonym pairs. In total, 12,802 antonym and 31,828 synonym pairs are used for a vocabulary of roughly 76 k frequent words.
Two evaluation tracks are presented. First, the refined vectors are tested on SimLex‑999, a benchmark that measures pure semantic similarity (as opposed to relatedness). Counter‑fitted GloVe vectors improve from a Spearman correlation of 0.41 to 0.58, and counter‑fitted Paragram‑SL999 vectors improve from 0.69 to 0.74, surpassing the previous state‑of‑the‑art score of 0.685 and approaching the human inter‑annotator ceiling (0.78). Detailed error analysis shows that many false synonym or false antonym pairs (e.g., “sunset‑sunrise”, “dumb‑dense”) are corrected, even when the specific pair is not directly present in the constraint set, demonstrating the indirect regularizing effect of the VSP term.
Second, the method is applied to Dialogue State Tracking (DST), a downstream task where a system must infer user goals expressed as slot‑value pairs. The authors inject domain‑specific ontologies (e.g., values for slots like price, cuisine) as additional antonym constraints (forcing distinct slot values apart) and then use the refined embeddings to automatically generate semantic dictionaries (lists of paraphrases for each slot value). Experiments on two DST datasets (restaurant and tourist‑information domains) show that using dictionaries built from counter‑fitted vectors yields consistent improvements over baselines without dictionaries and over dictionaries built from the original embeddings. For example, the restaurant domain accuracy rises from 68.6 % (no dictionary) to 73.4 % with a GloVe‑based dictionary and further to 73.4 % with the counter‑fitted version; similar gains are observed for the tourist domain.
The paper’s contributions are threefold: (1) it demonstrates that explicit antonym constraints substantially improve semantic similarity judgments; (2) it provides a fast, resource‑efficient way to specialize generic embeddings for specific applications without retraining on large corpora; and (3) it validates the practical benefit of such specialization in a real‑world dialogue system, where automatically generated lexical resources can replace costly manual engineering. The authors also discuss the importance of preserving the original distributional geometry (via VSP) to avoid degrading the embeddings’ generalization capabilities. Future work may explore extending the approach to multilingual settings, incorporating richer lexical relations (hypernyms, meronyms), and testing on other downstream tasks such as text classification or information retrieval.
Comments & Academic Discussion
Loading comments...
Leave a Comment