Tree-Structured Stick Breaking Processes for Hierarchical Data

Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking

Tree-Structured Stick Breaking Processes for Hierarchical Data

Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data.


💡 Research Summary

The paper introduces a novel non‑parametric Bayesian prior designed to capture unknown hierarchical relationships in data. Unlike traditional Dirichlet‑process (DP) mixtures or hierarchical Dirichlet processes (HDP) that provide infinite mixtures without explicit structure among components, the proposed Tree‑Structured Stick‑Breaking Process (TSSBP) generates a random tree of unbounded width and depth. The construction relies on two nested stick‑breaking steps: the first splits a unit “stick” at the root into proportions assigned to each child node, and the second recursively splits each child’s stick among its own children. By iterating this recursion, a potentially infinite tree of partitions is formed, and each observation may be assigned to any node, with its probability given by the product of stick‑breaking weights along the path from the root to that node.

A key theoretical contribution is the proof of infinite exchangeability. The authors adopt a “family of partitions” viewpoint, showing that the joint distribution of data points depends only on the multiset of tree paths, not on ordering. Consequently, the model retains the desirable de Finetti‑type representation while endowing mixture components with a natural dependency structure: higher‑level nodes act as “ancestral” topics or clusters, and lower‑level nodes inherit characteristics from their ancestors, analogous to an evolutionary diffusion process down a phylogenetic tree.

For posterior inference, the authors develop a slice‑sampling‑based Markov chain Monte Carlo (MCMC) algorithm that avoids explicit truncation of the infinite tree. Each datum is augmented with a slice variable (u) that selects a region of the stick‑breaking space; only nodes whose cumulative weight exceeds (u) need to be instantiated in a given iteration. This dynamic “on‑the‑fly” expansion dramatically reduces memory usage and computational cost, allowing the sampler to explore arbitrarily deep and wide portions of the tree as needed. Hyper‑parameters governing the Beta distributions of the stick‑breaking proportions are also sampled, providing flexibility to adapt the tree’s branching behavior to the data.

The authors evaluate the model on two real‑world tasks. In hierarchical image clustering (e.g., CIFAR‑10), the learned tree places broad categories such as “animals” or “vehicles” at upper levels, while finer subclasses (e.g., specific dog breeds, car models) appear at deeper nodes. This hierarchical organization emerges automatically without pre‑specifying the number of clusters or the depth of the hierarchy. In a topic‑modeling experiment on the 20 Newsgroups corpus, the top‑level topics correspond to major domains (politics, science, sports), and lower‑level topics capture more specific sub‑domains (election policy, quantum physics, soccer matches). The model’s ability to assign documents to any tree level yields a richer representation than flat HDP mixtures.

Quantitatively, the TSSBP achieves higher predictive log‑likelihoods and lower perplexities than HDP and other baseline non‑parametric models, demonstrating both better fit and more interpretable structure. The hierarchical priors also act as an implicit regularizer, preventing over‑fitting by allowing the tree to grow only where the data support additional branches.

In summary, the paper presents a flexible, infinitely exchangeable, tree‑based non‑parametric prior, together with an efficient slice‑sampling inference scheme. It shows that modeling data with a diffusion‑like hierarchy can improve both predictive performance and interpretability. Future directions suggested include richer priors over tree topology, extensions to non‑vectorial data such as graphs or time series, and scalable variational inference alternatives.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...