Distance Dependent Chinese Restaurant Processes

Distance Dependent Chinese Restaurant Processes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We develop the distance dependent Chinese restaurant process (CRP), a flexible class of distributions over partitions that allows for non-exchangeability. This class can be used to model many kinds of dependencies between data in infinite clustering models, including dependencies across time or space. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both observed and mixture settings. We study its performance with three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fit to sequential data. We also show its alternative formulation of the traditional CRP leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation.


💡 Research Summary

**
The paper introduces the Distance Dependent Chinese Restaurant Process (DD‑CRP), a non‑exchangeable extension of the classic Chinese Restaurant Process (CRP) that incorporates pairwise distances between data points into the clustering prior. Traditional CRP assumes exchangeability: the probability that a new observation joins an existing cluster depends only on the cluster’s size, not on the order or location of observations. This assumption is often violated in real‑world data where temporal, spatial, or semantic dependencies exist (e.g., time‑ordered documents, geographic data, or similarity‑based networks).

Model Construction
The authors define a distance function (d(i,j)) for any two observations (i) and (j). A decay (or similarity) function (f(d)) maps this distance to a non‑negative weight, typically decreasing with distance (e.g., exponential decay (f(d)=\exp(-\lambda d)) or reciprocal decay (f(d)=1/(1+d))). For each observation (i), a “link” is drawn to another observation (j) (or to a special “new table” node) with probability proportional to (f(d(i,j))) (or to a concentration parameter (\alpha) for a new table). The resulting directed graph is undirected in practice; its connected components constitute the clusters (tables). Thus, the clustering structure is directly governed by the distance‑based link probabilities.

A key theoretical contribution is the proof that the DD‑CRP induces the same partition distribution as a CRP with appropriately defined “new table” and “existing table” probabilities. This equivalence allows the authors to reinterpret the traditional CRP in terms of pairwise links, providing a new perspective that simplifies inference.

Inference Algorithms
Two Gibbs samplers are derived:

  1. Pure partition model (no observations) – Each iteration removes observation (i) from its current component, then resamples its link target according to the distance‑based weights and the concentration term. If the sampled target belongs to an existing component, (i) joins that component; otherwise a new component is created. This procedure avoids the explicit computation of table‑selection probabilities required by the classic CRP sampler, reducing computational overhead especially when many clusters exist.

  2. Mixture model with observations – The authors embed the DD‑CRP into a Bayesian mixture (e.g., a topic model). After updating the partition as above, they sample the component‑specific parameters (e.g., Dirichlet‑distributed topic‑word distributions) conditioned on the current assignment. The conjugacy of Dirichlet priors ensures closed‑form updates, and the alternating update of partitions and parameters yields a standard Gibbs scheme.

Both samplers are shown to be ergodic and to converge to the posterior defined by the DD‑CRP prior combined with the likelihood of the observed data.

Experiments
The methodology is evaluated on three text corpora: a news‑article collection, a set of scientific abstracts, and a stream of social‑media posts. In all cases the data are ordered chronologically, and the distance is defined as the absolute time difference (|t_i - t_j|). An exponential decay with a hyper‑parameter (\lambda) (selected via cross‑validation) serves as the decay function. Baselines include the standard CRP‑based Latent Dirichlet Allocation (LDA), the Hierarchical Dirichlet Process (HDP), and other Dirichlet‑process mixture variants.

Performance metrics comprise log‑likelihood, perplexity, and the dynamics of the number of clusters over time. The DD‑CRP consistently achieves higher log‑likelihoods and lower perplexities, indicating a better fit to the sequential nature of the data. Notably, during periods where topics shift rapidly (e.g., breaking news), the distance‑dependent prior quickly creates new clusters, while older clusters fade naturally, mirroring the true evolution of the corpus.

From a computational standpoint, the DD‑CRP Gibbs sampler converges 2–3 times faster (in terms of iterations to reach a stable log‑likelihood) than the traditional CRP sampler. The speedup stems from the reduced need to enumerate all existing tables when sampling a new assignment; only the distances to currently linked observations matter.

Strengths and Limitations
The paper’s primary strength lies in providing a principled, flexible way to embed arbitrary dependency structures (temporal, spatial, similarity‑based) into a non‑parametric clustering prior while preserving the appealing properties of the CRP (e.g., an unbounded number of clusters). The equivalence proof bridges the gap between the classic “table‑selection” view and the link‑based view, enabling more efficient inference.

However, the approach also has limitations. The choice of distance and decay functions is left to the practitioner; no systematic guidance or learning mechanism for these functions is offered. In high‑dimensional settings (e.g., image embeddings), simple decay functions may cause most link probabilities to vanish, leading to overly fragmented partitions. Moreover, the current implementation assumes a fully connected graph, which scales quadratically in memory and computation, making it impractical for datasets with hundreds of thousands of points without additional sparsification strategies.

Future Directions
The authors suggest several extensions: (i) employing sparse neighbor graphs (e.g., k‑nearest‑neighbor) to reduce computational burden; (ii) learning the distance metric jointly with the clustering (e.g., via neural metric learning); and (iii) developing variational inference schemes that can handle massive datasets while preserving the distance‑dependent prior.

Conclusion
The Distance Dependent Chinese Restaurant Process enriches the Bayesian non‑parametric toolbox by allowing non‑exchangeable, distance‑driven priors over partitions. Empirical results on sequential text data demonstrate that relaxing exchangeability yields better predictive performance and more realistic cluster dynamics. The theoretical equivalence to the classic CRP and the derived Gibbs samplers make the method both conceptually elegant and practically feasible, opening avenues for modeling temporally, spatially, or otherwise dependent data in a wide range of applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment