Inference of global clusters from locally distributed data

We consider the problem of analyzing the heterogeneity of clustering distributions for multiple groups of observed data, each of which is indexed by a covariate value, and inferring global clusters ar

Inference of global clusters from locally distributed data

We consider the problem of analyzing the heterogeneity of clustering distributions for multiple groups of observed data, each of which is indexed by a covariate value, and inferring global clusters arising from observations aggregated over the covariate domain. We propose a novel Bayesian nonparametric method reposing on the formalism of spatial modeling and a nested hierarchy of Dirichlet processes. We provide an analysis of the model properties, relating and contrasting the notions of local and global clusters. We also provide an efficient inference algorithm, and demonstrate the utility of our method in several data examples, including the problem of object tracking and a global clustering analysis of functional data where the functional identity information is not available.


💡 Research Summary

The paper tackles the problem of characterizing heterogeneity across multiple groups of observations that are indexed by a covariate (e.g., time, space, experimental condition) and of discovering “global” clusters that emerge when the data are aggregated over the covariate domain. Traditional approaches either treat each group independently—risking over‑fitting when data are scarce—or pool all observations into a single mixture model, thereby discarding local structure. To bridge this gap the authors propose a novel Bayesian non‑parametric framework built on a nested hierarchy of Dirichlet processes (DPs).

At the core of the model is a three‑level hierarchy. For each covariate value (u) we observe a set ({x_{ui}}_{i=1}^{n_u}). These observations are generated from a Dirichlet‑process mixture (DPM) with group‑specific random measure (G_u). Crucially, each (G_u) shares a common base measure (G_0) drawn from a higher‑level DP, i.e., (G_u\mid G_0\sim\text{DP}(\alpha,G_0)) and (G_0\mid H\sim\text{DP}(\gamma,H)). The atoms of (G_0) represent the global clusters; each local cluster is a draw from (G_0) and therefore inherits a global label. This nesting endows the model with two complementary properties: (1) local clusters can borrow statistical strength from the global pool, which is especially valuable when a group contains few observations or is noisy; (2) the posterior explicitly encodes the mapping from global to local clusters, allowing a clear interpretation of how global patterns manifest locally.

The authors provide a rigorous theoretical analysis. By employing the Chinese Restaurant Franchise metaphor they derive the joint distribution of cluster assignments and prove exchangeability across observations and consistency under addition or removal of groups. They also contrast their construction with the Hierarchical Dirichlet Process (HDP). While the HDP also shares a base DP across groups, the HDP’s base is fixed a priori; in the nested DP the base itself is random, which yields a richer representation of uncertainty about the number and composition of global clusters.

For inference, a Gibbs sampler is designed that alternates between (i) collapsed updates of local cluster assignments using standard DPM techniques, and (ii) block updates of the global atoms and the allocation of each local cluster to a global atom. To keep computation tractable, a stick‑breaking truncation is applied: a finite number (K) of global atoms are retained, with the remaining probability mass collected in a residual component. Hyper‑parameters (\alpha) and (\gamma) are given Beta‑Gaussian priors and sampled directly, enabling the model to adapt its complexity automatically.

Empirical evaluation is performed on two distinct domains. The first concerns object tracking in video streams. Each frame yields a set of detected positions and velocities; these are modeled as local clusters, while the global clusters capture overarching motion patterns (e.g., linear drift, circular motion). The nested DP improves tracking accuracy, especially in frames with occlusions or sparse detections, by leveraging the global motion priors. The second application involves functional data such as EEG or climate time series where individual series lack identity labels. Local clusters describe the shape of each series, and the global clusters reveal common functional motifs across subjects or locations. Compared against k‑means, standard DPM, and HDP‑GMM, the proposed method achieves higher Adjusted Rand Index scores, better predictive log‑likelihood, and more interpretable cluster structures.

Overall, the paper contributes a flexible, theoretically sound, and computationally feasible approach to simultaneous local‑global clustering. The nested Dirichlet process elegantly captures the dependence between local heterogeneity and global regularities, and the inference algorithm scales to realistic data sizes. Limitations include sensitivity to the truncation level (K) and the need for careful prior specification in very high‑dimensional settings. Future work may explore extensions such as Gaussian‑process‑based base measures for spatial continuity, or hybrid models that integrate deep neural representations with the Bayesian non‑parametric backbone. The presented framework opens new avenues for analyzing complex, covariate‑indexed datasets across a wide range of scientific and engineering disciplines.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...