Bayesian Agglomerative Clustering with Coalescents

Bayesian Agglomerative Clustering with Coalescents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman’s coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over others, and demonstrate our approach in document clustering and phylolinguistics.


💡 Research Summary

The paper proposes a novel Bayesian framework for hierarchical clustering that is built on Kingman’s coalescent—a stochastic process originally devised to model genealogical trees in population genetics. By treating a clustering tree as a realization of a coalescent, the authors obtain a prior over tree topologies that naturally encodes a probabilistic “bottom‑up” merging order. This contrasts with traditional Bayesian hierarchical models that rely on Dirichlet‑process‑based priors or fixed distance metrics, which either impose restrictive structural assumptions or lack an explicit generative story for the merging sequence.

The generative model consists of two components. First, a tree structure (T) is sampled from the coalescent with a rate parameter (\lambda). The coalescent defines a distribution over the times at which any pair of lineages merges, yielding a simple closed‑form expression for the prior probability of any particular merge. Second, given (T), observed data (\mathbf{x}_i) at the leaves are generated from a likelihood model that is conditionally independent across clusters once the tree is fixed. The authors illustrate the approach with Gaussian mixture likelihoods for continuous data and multinomial models for document word counts, but the framework is agnostic to the specific choice of leaf‑level distribution.

To perform inference on the posterior (p(T,\theta\mid\mathbf{X})), the authors develop two complementary algorithms. The first is a greedy agglomerative procedure that mirrors classic hierarchical clustering: starting from singleton clusters, it repeatedly selects the pair that maximizes a score equal to the product of the coalescent prior for that merge and the marginal likelihood improvement obtained by joining the two clusters. The score can be computed efficiently using sufficient statistics, and the algorithm yields a single deterministic tree that is a maximum‑a‑posteriori (MAP) estimate under the coalescent prior.

The second algorithm is a Sequential Monte Carlo (SMC) sampler that maintains a population of particle trees. At each iteration a particle proposes a merge according to a proposal distribution that blends the coalescent prior with a data‑driven likelihood term. Particle weights are updated by the ratio of the target joint density (prior × likelihood) to the proposal density, and systematic resampling is performed to focus computational effort on high‑weight particles while preserving diversity. This SMC scheme enables exploration of multiple high‑probability regions of the posterior, thereby capturing uncertainty over tree structures that the greedy method cannot represent.

Empirical evaluation is carried out on two distinct domains. In a document‑clustering setting (20 Newsgroups, Reuters‑21578), the authors use a multinomial likelihood with Dirichlet priors on word probabilities. The coalescent‑based methods (both greedy and SMC) achieve higher Adjusted Rand Index, Normalized Mutual Information, and lower within‑cluster variance than baselines including Bayesian hierarchical Dirichlet processes, Ward’s method, and recent variational tree models. Notably, the SMC approach consistently outperforms the greedy MAP tree by 3–5 % on these metrics, demonstrating the benefit of posterior averaging.

In a phylogenetic‑linguistics experiment, binary lexical feature vectors for a set of languages are clustered. The inferred coalescent trees closely match accepted linguistic family trees, and the SMC sampler captures subtle borrowing events that appear as asymmetric merges, which are missed by distance‑based agglomerative clustering. The learned coalescent rate (\lambda) adapts to the data, effectively controlling tree depth and the aggressiveness of merging without manual tuning.

The paper’s contributions are threefold: (1) introducing Kingman’s coalescent as a flexible, analytically tractable prior for hierarchical clustering; (2) devising both a fast greedy MAP estimator and a principled SMC sampler that exploit the coalescent’s structure; and (3) demonstrating superior performance on real‑world text and linguistic data, thereby validating the practical relevance of the approach. The authors also discuss extensions such as combining the coalescent with Pitman‑Yor processes for richer priors, applying the framework to non‑vectorial data (e.g., images, graphs), and scaling the SMC algorithm via parallelism. Overall, the work bridges a gap between population‑genetics theory and machine‑learning clustering, offering a theoretically sound yet computationally feasible alternative to existing Bayesian hierarchical models.


Comments & Academic Discussion

Loading comments...

Leave a Comment