A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple–but general–parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics.
💡 Research Summary
The paper tackles the problem of supervised clustering—grouping data points while exploiting label‑level or constraint information—by building a fully Bayesian non‑parametric model. The authors adopt a Dirichlet Process (DP) prior to allow an unbounded number of clusters, thereby removing the need to pre‑specify the cluster count, which is especially valuable in applications such as reference matching, coreference resolution, identity uncertainty, and record linkage where the true number of entities is unknown.
A novel element is the introduction of “reference types,” latent variables that are shared across all clusters. These variables capture systematic variations that are common to many records (e.g., spelling variants, formatting differences) and act as a conduit for supervision: data points that share the same reference type are more likely to belong to the same cluster. By jointly inferring cluster assignments and reference types, the model integrates supervision without requiring explicit pairwise constraints.
The generative process can be summarized as follows: for each observation (x_i), a cluster indicator (z_i) is drawn from the Chinese Restaurant Process induced by the DP with concentration parameter (\alpha). Conditional on (z_i), a reference type (\theta_{z_i}) is sampled from a global prior, and the observation is generated from a likelihood (p(x_i \mid \theta_{z_i}, \phi_{z_i})). The authors instantiate the likelihood with a Gaussian distribution, assuming that each cluster‑type pair has its own mean vector but shares a common covariance matrix. For conjugate priors they use a Normal‑Wishart distribution, which yields closed‑form Gibbs updates. In the non‑conjugate case they employ Metropolis‑Hastings steps to sample the means and covariances.
Inference proceeds via Markov chain Monte Carlo. The algorithm alternates between (1) sampling the reference types and cluster‑specific parameters given current assignments, and (2) resampling each data point’s cluster label using the DP predictive probabilities (including the possibility of creating a new cluster). Hyper‑parameters such as (\alpha) and the Gaussian hyper‑priors are also sampled, resulting in a fully Bayesian treatment. The authors discuss implementation details that improve mixing, such as collapsed Gibbs sampling for the DP component and adaptive proposal distributions for the Metropolis steps.
Empirical evaluation comprises one synthetic dataset—designed to have a known number of clusters and reference types—and three real‑world tasks: (a) a record linkage benchmark involving noisy customer records, (b) a coreference resolution corpus of news articles, and (c) an identity‑uncertainty dataset from medical records. Baselines include classic unsupervised methods (K‑means, DBSCAN, spectral clustering) and state‑of‑the‑art supervised clustering approaches (constrained K‑means, label propagation, semi‑supervised Gaussian mixture models). Performance is measured using precision, recall, F1, Adjusted Rand Index, and domain‑specific matching accuracy.
Results show that the DP‑based model consistently outperforms all baselines. On synthetic data it recovers the exact number of clusters and reference types with an F1 score above 0.98. In the real‑world experiments, the proposed method achieves 5–15 % higher F1 scores than the best supervised competitor and improves duplicate‑removal rates in the record linkage task by roughly 12 %. The automatic adjustment of cluster count via the DP and the explicit modeling of reference types are identified as the primary contributors to these gains.
The paper also acknowledges limitations. The Gaussian likelihood may be insufficient for highly non‑linear or multimodal data such as raw text, suggesting that mixtures of Gaussians or deep embedding models could be incorporated. MCMC inference, while exact, is computationally intensive; scaling to very large datasets would benefit from variational inference or stochastic MCMC techniques. Moreover, the current single‑level reference‑type formulation may struggle with hierarchical or multi‑aspect variations (e.g., cultural name variants, typographical errors), motivating extensions using hierarchical Dirichlet Processes.
Future work outlined by the authors includes (1) replacing the simple Gaussian observation model with more expressive alternatives (mixed Gaussians, Bernoulli‑Multinomial for categorical data, neural embeddings), (2) developing a variational inference scheme to accelerate convergence and enable online learning, (3) extending the model to a hierarchical DP that can capture multiple layers of reference types, and (4) deploying the approach in production record‑linkage pipelines to assess real‑time performance.
In summary, the paper presents a rigorous Bayesian framework that unifies non‑parametric clustering with supervised information through the novel concept of reference types. The combination of a Dirichlet Process prior and a latent supervision mechanism yields a flexible, data‑driven model that automatically discovers the appropriate number of clusters while leveraging domain‑specific regularities. Empirical results across synthetic and diverse real‑world tasks demonstrate superior accuracy over existing unsupervised and supervised methods, and the discussion of extensions points toward a promising research agenda for scalable, expressive supervised clustering.
Comments & Academic Discussion
Loading comments...
Leave a Comment