Scalable Topical Phrase Mining from Text Corpora
While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the inference results of unigram-based topic models, or utilizes complex n-gram-discovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized datasets. We propose a different approach that is both computationally efficient and effective. Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.
💡 Research Summary
The paper addresses the longstanding gap between topic modeling, which traditionally relies on unigram representations, and human interpretation that often depends on multi‑word phrases. Existing approaches either post‑process unigram‑based models or embed complex n‑gram generation mechanisms directly into the topic model. Both strategies suffer from either low phrase quality or poor scalability on realistic corpora.
To overcome these limitations, the authors propose TOPMine, a two‑stage framework that first mines high‑quality phrases and segments each document into a “bag‑of‑phrases,” then performs topic inference under the constraint that all words within a phrase must share the same latent topic.
Stage 1 – Phrase Mining and Segmentation. The authors introduce an efficient frequent‑phrase mining algorithm that leverages two classic pruning principles: (1) the downward‑closure lemma, which guarantees that any super‑phrase of an infrequent phrase cannot be frequent, and (2) data‑antimonotonicity, which allows early termination of a document’s processing when no frequent phrases of a given length exist. By maintaining a list of active indices for each document and using a sliding‑window hash counter, the algorithm collects aggregate counts for all contiguous token sequences that meet a user‑defined minimum support. The pruning steps dramatically reduce the candidate space, turning what would be a quadratic‑time operation into linear time with respect to the total number of tokens (O(p·N)). After counting, a statistical significance measure—based on observed versus expected co‑occurrence—guides an agglomerative merging process that produces the final phrase partition. This process automatically filters out spurious candidates and determines the optimal phrase length, satisfying the three human‑interpretability criteria of frequency, collocation, and completeness.
Stage 2 – Phrase‑Constrained Topic Modeling (PhraseLDA). With the document now represented as a sequence of phrases, the authors modify the standard Latent Dirichlet Allocation (LDA) inference to enforce a hard constraint: all tokens belonging to the same phrase are assigned the same topic indicator z. This is achieved by a collapsed Gibbs sampler that samples a single topic for an entire phrase rather than for each individual word, eliminating the need for additional latent variables to capture phrase boundaries. Consequently, the model’s computational complexity remains comparable to vanilla LDA while guaranteeing intra‑phrase topic consistency.
Experimental Evaluation. The authors evaluate TOPMine on four diverse corpora: DBLP paper titles (20Conf), PubMed abstracts, Amazon product reviews, and New York Times articles. They compare against state‑of‑the‑art phrase‑aware topic models such as Topical N‑Gram and PD‑LDA, as well as post‑processing baselines. Metrics include topic coherence, human‑judged phrase quality, perplexity, and runtime. Results show that TOPMine achieves comparable perplexity to standard LDA, substantially higher topic coherence, and markedly better human‑rated phrase relevance. Moreover, TOPMine’s runtime is an order of magnitude faster than the baselines, confirming its scalability.
Contributions and Impact. The paper’s primary contributions are: (1) a linear‑time, data‑driven phrase mining algorithm that requires no linguistic resources; (2) a novel phrase‑constrained topic model (PhraseLDA) that integrates phrase structure directly into the generative process; (3) extensive empirical evidence of superior scalability and interpretability across multiple domains. By treating phrases as atomic units during inference, TOPMine bridges the gap between statistical topic models and the way humans naturally understand text, enabling more intuitive visualizations and downstream applications such as document summarization, recommendation, and information retrieval.
Future work may explore hierarchical extensions that model phrase‑to‑phrase relationships, incorporate supervision for domain‑specific phrase vocabularies, or apply the framework to multimodal data where textual phrases are aligned with visual or auditory cues.
Comments & Academic Discussion
Loading comments...
Leave a Comment