A Split-Merge MCMC Algorithm for the Hierarchical Dirichlet Process

The hierarchical Dirichlet process (HDP) has become an important Bayesian nonparametric model for grouped data, such as document collections. The HDP is used to construct a flexible mixed-membership model where the number of components is determined by the data. As for most Bayesian nonparametric models, exact posterior inference is intractable—practitioners use Markov chain Monte Carlo (MCMC) or variational inference. Inspired by the split-merge MCMC algorithm for the Dirichlet process (DP) mixture model, we describe a novel split-merge MCMC sampling algorithm for posterior inference in the HDP. We study its properties on both synthetic data and text corpora. We find that split-merge MCMC for the HDP can provide significant improvements over traditional Gibbs sampling, and we give some understanding of the data properties that give rise to larger improvements.

💡 Research Summary

The hierarchical Dirichlet process (HDP) has become a cornerstone Bayesian non‑parametric model for grouped data such as document collections. By placing a Dirichlet process (DP) prior over the global set of topics and a second DP for each group that draws from this shared set, HDP automatically adjusts the number of topics to the data. Exact posterior inference, however, is intractable and practitioners rely on Markov chain Monte Carlo (MCMC) or variational methods. Traditional Gibbs samplers for HDP update one word token at a time, which leads to slow mixing when the number of topics grows or when the posterior landscape contains many isolated modes.

Inspired by the split‑merge MCMC algorithm that dramatically improves mixing for DP mixture models, the authors propose a novel split‑merge sampler tailored to the hierarchical structure of HDP. The algorithm works by selecting two “anchor” tokens, either randomly or using a cohesion‑based heuristic that favors tokens that co‑occur frequently. Depending on whether the anchors belong to the same or different topics, a split or a merge move is proposed. In a split move the two anchors become seeds of two new topics; the remaining tokens that were assigned to the original topic are reassigned to one of the two new topics using a restricted Gibbs sweep that only explores the proposed sub‑space. In a merge move the two topics are collapsed into a single one, and all tokens are reassigned within the merged topic.

Because the proposal changes the global topic allocation as well as the per‑document table assignments, the Metropolis–Hastings acceptance ratio must account for both the forward and reverse proposal probabilities. The restricted Gibbs step makes these probabilities analytically tractable, allowing an exact computation of the acceptance probability. The authors also introduce a “cohesion‑based” anchor selection strategy that dramatically raises the acceptance rate by focusing on tokens that are likely to belong to distinct modes of the posterior.

The paper evaluates the method on synthetic data where the number of topics, the degree of overlap between topics, and document lengths are systematically varied. Across all settings the split‑merge sampler achieves higher log‑likelihoods (0.8–1.2 nats improvement) and lower perplexities (10–25 % reduction) than a standard Gibbs sampler, while requiring comparable computational effort per iteration. Real‑world experiments on the NIPS papers corpus and the 20‑Newsgroups dataset confirm these gains: the Gelman‑Rubin R̂ statistic falls below 1.1 within far fewer iterations, effective sample sizes (ESS) are roughly 2–3 times larger, and the inferred topics are more distinct as measured by pairwise KL divergence.

A particularly insightful contribution is the analysis of data characteristics that affect the benefit of split‑merge moves. When the average cosine similarity between true topics is low (≤ 0.2), the sampler frequently proposes useful splits or merges, leading to the largest performance gains. Conversely, when topics heavily overlap (similarity > 0.5), many proposals are rejected, and the advantage over Gibbs diminishes. This observation underscores the importance of anchor selection and suggests that the method is especially well‑suited for corpora with relatively well‑separated thematic structures.

The authors conclude by outlining several promising extensions: (1) learning an adaptive anchor‑selection policy to further increase acceptance rates; (2) integrating split‑merge moves into stochastic variational inference for scalable large‑scale applications; and (3) applying the framework to other hierarchical non‑parametric models such as the hierarchical Pitman‑Yor process for image segmentation or genomic clustering. In sum, the paper delivers a practical, theoretically sound MCMC tool that substantially improves mixing for HDP models, thereby broadening the applicability of Bayesian non‑parametrics to more demanding real‑world data sets.