Parsimonious Topic Models with Salient Word Discovery
We propose a parsimonious topic model for text corpora. In related models such as Latent Dirichlet Allocation (LDA), all words are modeled topic-specifically, even though many words occur with similar frequencies across different topics. Our modeling determines salient words for each topic, which have topic-specific probabilities, with the rest explained by a universal shared model. Further, in LDA all topics are in principle present in every document. By contrast our model gives sparse topic representation, determining the (small) subset of relevant topics for each document. We derive a Bayesian Information Criterion (BIC), balancing model complexity and goodness of fit. Here, interestingly, we identify an effective sample size and corresponding penalty specific to each parameter type in our model. We minimize BIC to jointly determine our entire model – the topic-specific words, document-specific topics, all model parameter values, {\it and} the total number of topics – in a wholly unsupervised fashion. Results on three text corpora and an image dataset show that our model achieves higher test set likelihood and better agreement with ground-truth class labels, compared to LDA and to a model designed to incorporate sparsity.
💡 Research Summary
The paper introduces a parsimonious topic model that simultaneously enforces sparsity at the word level and at the document‑topic level, addressing two major shortcomings of Latent Dirichlet Allocation (LDA). In LDA every topic is represented by a full probability distribution over the entire vocabulary, and every document is assumed to contain all topics with non‑zero proportions. This leads to an explosion of parameters (one probability per word per topic) and a high risk of over‑fitting, especially when the vocabulary is large.
The proposed model tackles these issues by (1) distinguishing “salient” (topic‑specific) words from “shared” words and (2) allowing each document to contain only a small subset of the total topics. For each topic j a binary indicator u_{jn} marks whether word n is salient for that topic. If u_{jn}=1 the word is generated from a topic‑specific probability β_{jn}; otherwise it is generated from a universal background distribution β_{0n} shared across all topics. This dramatically reduces the number of word‑topic parameters because the majority of common words are modeled by the single shared distribution.
At the document level, a binary indicator v_{jd} denotes whether topic j is present in document d. When v_{jd}=1 the document‑specific proportion α_{jd} is estimated; when v_{jd}=0 the proportion is forced to zero. Consequently each document is described by only M_d « M active topics, yielding a sparse representation of topic proportions.
Learning proceeds by alternating between structure learning (selecting the sets of salient words u and active topics v) and parameter learning (estimating α and β given the current structure). The key methodological contribution is a novel Bayesian Information Criterion (BIC) that assigns different penalty terms to different types of parameters based on their effective sample sizes. Topic‑specific word probabilities β_{jn} are penalized according to the total number of word tokens associated with topic j, while the shared word probabilities β_{0n} are penalized according to the overall corpus size. This differentiated penalty reflects the true amount of information supporting each parameter class and prevents over‑penalization of the shared component.
Importantly, the number of topics M is not fixed a priori. The BIC objective includes a term for M, allowing the model to automatically select the optimal number of topics during learning, without any held‑out validation set or cross‑validation. This contrasts with LDA, where M must be chosen manually.
The authors evaluate the model on three text corpora (including news and scientific articles) and one image dataset where images are represented as visual words. They compare against standard LDA and Sparse Topical Coding (STC), measuring test‑set log‑likelihood and agreement with external class labels (e.g., NMI, accuracy). Across all datasets the parsimonious model achieves higher log‑likelihoods and better label alignment. The advantage is especially pronounced on corpora with very large vocabularies, where the shared background model captures the bulk of common words while the salient‑word component focuses on discriminative terms. Moreover, the model reliably discovers a sensible number of topics and a compact set of active topics per document.
In summary, the paper makes four major contributions: (1) introducing a shared‑vs‑salient word representation within a topic model, reducing lexical parameter count; (2) explicitly modeling sparsity of topic proportions at the document level; (3) deriving a BIC formulation with parameter‑type‑specific penalties based on effective sample sizes; and (4) integrating model‑order selection into the unsupervised learning objective. The approach is computationally tractable, avoids hyper‑parameter tuning, and is applicable not only to text but also to other high‑dimensional data such as images.
Comments & Academic Discussion
Loading comments...
Leave a Comment