Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation

Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability   Estimation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a novel family of language model (LM) estimation techniques named Sparse Non-negative Matrix (SNM) estimation. A first set of experiments empirically evaluating it on the One Billion Word Benchmark shows that SNM $n$-gram LMs perform almost as well as the well-established Kneser-Ney (KN) models. When using skip-gram features the models are able to match the state-of-the-art recurrent neural network (RNN) LMs; combining the two modeling techniques yields the best known result on the benchmark. The computational advantages of SNM over both maximum entropy and RNN LM estimation are probably its main strength, promising an approach that has the same flexibility in combining arbitrary features effectively and yet should scale to very large amounts of data as gracefully as $n$-gram LMs do.


💡 Research Summary

The paper introduces a novel family of language‑model estimation techniques called Sparse Non‑negative Matrix (SNM) estimation. The authors start by framing language modeling as the problem of estimating conditional word probabilities given a history, traditionally tackled with (n‑1)‑gram equivalence classes (e.g., Kneser‑Ney smoothing) or neural network models (e.g., recurrent neural networks, RNNs). While n‑gram models are computationally cheap, they struggle with long‑range dependencies; RNNs capture such dependencies but are expensive to train and require substantial memory.

SNM addresses these issues by representing each training instance as a pair of sparse binary vectors: a feature vector f that encodes any set of context features (including conventional n‑grams and skip‑grams) and a target vector t that indicates the word to be predicted. A non‑negative matrix M maps f to a dense prediction vector y = M f, which is then normalized to a probability distribution over the vocabulary. The matrix entries are defined as

 Mᵢⱼ = exp(A(i, j)) · Cᵢⱼ / Cᵢ*

where Cᵢⱼ is the raw count of feature i co‑occurring with word j, Cᵢ* is the total count of feature i, and A(i, j) is a real‑valued adjustment function.

The adjustment function is not a simple scalar; it is a sum over a large set of metafeatures. Each metafeature is a conjunction of elementary properties such as:

  1. Feature identity (e.g., the exact skip‑gram string),
  2. Feature type (the (r, s, a) tuple describing remote, skipped, and adjacent words),
  3. Feature count Cᵢ*,
  4. Target identity (the predicted word), and
  5. Feature‑target count Cᵢⱼ.

These elementary metafeatures are bucketed by the floor of their log₂ values (and optionally by a ceiling bucket) to reduce sparsity, yielding up to 2⁵‑1 possible combinations. Each bucket is hashed to a key in a massive hash table that stores a weight θ; the adjustment A(i, j) is the sum of the θ values for all metafeatures associated with the (i, j) pair. Collisions are allowed to keep memory usage modest, effectively tying together some metafeatures.

Training proceeds by stochastic gradient descent (SGD) on a loss function. The authors consider two losses: a multinomial cross‑entropy and a Poisson loss. The multinomial loss requires a sum over the entire vocabulary for each gradient step, which is computationally prohibitive for large vocabularies. The Poisson loss, by contrast, yields a gradient that depends only on the feature vector f and the target indicator t, allowing the authors to restrict updates to positive training examples (where the feature is present and the target word occurs). This reduces the per‑example cost from O(|V|) to O(|Pos(f)|).

To avoid over‑fitting, a leave‑one‑out (LOO) scheme is employed: when computing the gradient for a particular (fᵢ, tⱼ) pair, the counts Cᵢ* and Cᵢⱼ used in the adjustment are temporarily decremented, ensuring that the model does not “see” the example it is currently updating. The resulting gradient update for a positive example is

 ∂L/∂A(i, j) = fᵢ tⱼ (Cᵢ* – Cᵢⱼ) / Cᵢ* · (1 – yⱼ)

with analogous handling for negative examples distributed uniformly across positive ones sharing the same feature.

A central contribution of the paper is the incorporation of skip‑gram features. A skip‑gram is defined by a triple (r, s, a): r remote context words, s skipped words, and a adjacent context words. For example, in the sentence “The quick brown fox jumps over the lazy dog”, a (1,2,3) skip‑gram for “dog” would be “


Comments & Academic Discussion

Loading comments...

Leave a Comment