Sampled Weighted Min-Hashing for Large-Scale Topic Mining

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.

💡 Research Summary

The paper introduces Sampled Weighted Min‑Hashing (SWMH), a scalable, randomized algorithm for mining topics from very large text collections. Unlike conventional topic models such as Latent Dirichlet Allocation (LDA) or Hierarchical Dirichlet Processes, which represent topics as probability distributions over the vocabulary, SWMH defines a topic as an ordered subset of terms that frequently co‑occur across documents. The method builds on the earlier Sampled Min‑Hashing (SMH) technique, which partitions the vocabulary by applying Min‑Hashing to inverted file lists, but extends it by incorporating term‑weight information into the hashing process.

The core of SWMH consists of two stages. In the partitioning stage, each term’s inverted list (the set of documents in which the term appears) is processed with a family of Min‑Hash functions. Traditional Min‑Hashing draws random permutations uniformly; SWMH replaces this with weighted permutations as described by Chum et al., where the probability of a document being selected in the permutation is proportional to a weight (e.g., inverse document length or tf‑idf). For each term, r independent Min‑Hash values are concatenated into a tuple; l such tuples are generated, yielding l hash tables. Terms that share an identical tuple in any table are grouped into a “co‑occurring term set.”

The second stage clusters these co‑occurring term sets into topics. Overlap between two sets C₁ and C₂ is measured by the overlap coefficient ovᵣ(C₁, C₂)=|C₁∩C₂|/min(|C₁|,|C₂|). Pairs whose overlap exceeds a user‑defined threshold ε are linked in an undirected graph whose vertices are the term sets. Connected components of this graph become final topics. Because high Jaccard similarity between term sets implies a high overlap coefficient, the algorithm can efficiently prune candidate pairs using Min‑Hashing before computing the exact overlap, avoiding a quadratic blow‑up.

Key parameters are the similarity threshold s* (which determines how strict the Jaccard similarity must be for a collision) and the tuple size r. The number of hash tables l is derived from the desired collision probability: l = log(0.5)/log(1−sʳ). Smaller s yields fewer, more precise topics; larger r reduces the chance of accidental collisions, leading to finer granularity.

The authors evaluate SWMH on four corpora of increasing size: NIPS (≈3.6 K documents), 20 Newsgroups (≈35 K), Reuters (≈138 K) and Wikipedia (≈1.27 M). They compare SWMH against the original SMH and against Online LDA (the stochastic variational inference version of LDA). Experiments vary s* (0.15, 0.13, 0.10) and r (3, 4). Results show that weighting dramatically reduces the number of mined topics and the average number of terms per topic—up to 73 % reduction on NIPS and 45 % on Reuters—while preserving semantic coherence.

Scalability tests on Reuters demonstrate near‑linear growth of runtime and memory with respect to both document count and a composite “complexity” metric (vocabulary size × average term frequency). Mining the full Wikipedia dump required about 45 000 seconds (≈12.5 hours) and 1.5 GB of RAM, far faster than Online LDA, which needed three days to learn 400 topics on the same data.

For downstream utility, the mined topics are used to construct document representations: each document is represented by its similarity scores to all topics, and a linear SVM is trained for the 20 Newsgroups classification task. As the number of topics increases from 205 to 2 427, classification accuracy improves from 59.9 % to 64.1 %, comparable to or slightly better than Online LDA (59.2 % with 100 topics, 65.9 % with 400 topics) while requiring far less computational resources.

In summary, SWMH offers a practical alternative to probabilistic topic models for massive corpora. Its advantages are: (1) incorporation of term‑weighting into Min‑Hashing, yielding more meaningful co‑occurrence groups; (2) a simple yet effective clustering step based on overlap, which naturally discovers topics at multiple granularities; (3) linear scalability in both time and memory, making it feasible for corpora with millions of documents. The paper suggests future work on parallelizing the hash table construction, exploring dynamic weighting schemes, and extending the method to streaming environments.

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

💡 Research Summary

Comments & Academic Discussion

Leave a Comment