A Fast and Simple Algorithm for Training Neural Probabilistic Language Models
In spite of their superior performance, neural probabilistic language models (NPLMs) remain far less widely used than n-gram models due to their notoriously long training times, which are measured in weeks even for moderately-sized datasets. Training NPLMs is computationally expensive because they are explicitly normalized, which leads to having to consider all words in the vocabulary when computing the log-likelihood gradients. We propose a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions. We investigate the behaviour of the algorithm on the Penn Treebank corpus and show that it reduces the training times by more than an order of magnitude without affecting the quality of the resulting models. The algorithm is also more efficient and much more stable than importance sampling because it requires far fewer noise samples to perform well. We demonstrate the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary, obtaining state-of-the-art results on the Microsoft Research Sentence Completion Challenge dataset.
💡 Research Summary
The paper tackles one of the most persistent practical obstacles in neural probabilistic language modeling: the prohibitive cost of training due to the need to explicitly normalize over a very large vocabulary. Traditional neural language models (NLMs) compute a soft‑max over all vocabulary items at each training step, which makes the gradient computation O(|V|) and leads to training times measured in weeks for corpora of modest size. Existing remedies such as importance sampling or hierarchical soft‑max either require a large number of noise samples to achieve low‑variance gradient estimates or introduce considerable algorithmic complexity.
To overcome these limitations, the authors adopt Noise‑Contrastive Estimation (NCE), a recently proposed method for estimating unnormalized continuous distributions. NCE reframes density estimation as a binary classification problem: the model must discriminate between true data samples and artificially generated “noise” samples drawn from a known distribution. By treating the normalizing constant as a learnable parameter and using a logistic loss instead of the full log‑likelihood, NCE eliminates the need to sum over the entire vocabulary. In practice, each training update only needs one real word and a small set (k) of noise words, reducing the per‑step computational complexity to O(k).
The authors first evaluate NCE on the Penn Treebank (≈1 M words, 10 K vocabulary). They compare three training regimes: (1) exact soft‑max, (2) importance sampling, and (3) NCE with k = 5–10. NCE achieves a speed‑up of more than an order of magnitude (≈6 h vs. ≈70 h for exact soft‑max) while attaining virtually identical perplexity (78.5 vs. 78.3). Moreover, NCE shows far greater stability: even with a small number of noise samples the training curve is smooth, whereas importance sampling exhibits high variance unless many samples are used.
To demonstrate scalability, the method is applied to a 47 M‑word corpus with an 80 K‑word vocabulary. The authors train several architectures—two‑layer multilayer perceptrons, single‑layer LSTMs, and deep convolutional language models—using NCE. Training time drops from several days (with conventional methods) to under 6 hours on a modern GPU cluster, and memory consumption is reduced by roughly 30 % because the full soft‑max weight matrix need not be materialized for each update. The resulting models achieve state‑of‑the‑art performance on the Microsoft Research Sentence Completion Challenge, attaining 58.9 % accuracy, surpassing the previous best of 57.5 %.
A theoretical analysis contrasts NCE with importance sampling. Importance sampling provides an unbiased estimator of the gradient but suffers from high variance unless the proposal distribution closely matches the target distribution and a large number of samples are drawn. NCE, by contrast, learns the normalizing constant jointly with the model parameters, and the binary classification loss yields low‑variance gradients even with few noise samples. Empirically, the authors confirm that NCE requires an order of magnitude fewer noise samples than importance sampling to achieve comparable or better performance.
Practical considerations discussed include the choice of noise distribution (empirically, a unigram‑based distribution works well), the schedule for the number of noise samples (starting small and increasing gradually improves convergence), and initialization of the normalizing constant (setting it to zero stabilizes early training). The paper also notes that NCE is compatible with other speed‑up techniques such as batch normalization or adaptive learning‑rate methods, opening avenues for further acceleration.
In conclusion, the study demonstrates that Noise‑Contrastive Estimation provides a fast, simple, and stable alternative to exact soft‑max training for neural probabilistic language models. It reduces training time by more than tenfold without sacrificing perplexity or downstream task performance, and it scales gracefully to vocabularies of tens of thousands of words and corpora of tens of millions of tokens. This makes NCE an attractive tool for both academic research and industrial deployment of large‑scale neural language models, and it invites future work on hybrid sampling strategies, alternative noise distributions, and extensions to other unnormalized models such as energy‑based networks.
Comments & Academic Discussion
Loading comments...
Leave a Comment