Stability of Density-Based Clustering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High density clusters can be characterized by the connected components of a level set $L(\lambda) = {x:\ p(x)>\lambda}$ of the underlying probability density function $p$ generating the data, at some appropriate level $\lambda\geq 0$. The complete hierarchical clustering can be characterized by a cluster tree ${\cal T}= \bigcup_{\lambda} L(\lambda)$. In this paper, we study the behavior of a density level set estimate $\widehat L(\lambda)$ and cluster tree estimate $\widehat{\cal{T}}$ based on a kernel density estimator with kernel bandwidth $h$. We define two notions of instability to measure the variability of $\widehat L(\lambda)$ and $\widehat{\cal{T}}$ as a function of $h$, and investigate the theoretical properties of these instability measures.

💡 Research Summary

The paper investigates the stability of density‑based clustering, a non‑parametric approach that defines clusters as the connected components of a density level set (L(\lambda)={x:p(x)>\lambda}). By varying the threshold (\lambda) one obtains a hierarchical structure called the cluster tree (\mathcal{T}=\bigcup_{\lambda}L(\lambda)). In practice the underlying density (p) is unknown, so the authors use a kernel density estimator (KDE) (\widehat p_h) with bandwidth (h) to construct empirical level sets (\widehat L_h(\lambda)={x:\widehat p_h(x)>\lambda}) and an empirical tree (\widehat{\mathcal{T}}_h). The central question is how the choice of (h) influences the variability of these estimates.

Two instability measures are introduced. The first, level‑set instability (\Xi_n(h)=\mathbb{P}\bigl(\widehat L_h^{(1)}(\lambda)\neq\widehat L_h^{(2)}(\lambda)\bigr)), quantifies the probability that two independent samples from the same distribution produce different estimated level sets at a fixed (\lambda). The second, tree instability (\Upsilon_n(h)=\mathbb{P}\bigl(\widehat{\mathcal{T}}_h^{(1)}\neq\widehat{\mathcal{T}}_h^{(2)}\bigr)), measures the probability that the full hierarchical structures disagree. Both functions of (h) typically display a U‑shaped curve: very small bandwidths lead to high variance (over‑sensitivity to sampling noise), while very large bandwidths cause excessive smoothing and bias, again increasing disagreement.

The theoretical analysis assumes a non‑negative, symmetric, Lipschitz kernel with finite second moment, and that the true density (p) is (\beta)-Hölder continuous ((0<\beta\le 2)). Moreover, the boundary (\partial L(\lambda)) is required to be a (C^2) manifold with a strictly positive gradient norm (\delta_\lambda=\inf_{x\in\partial L(\lambda)}|\nabla p(x)|>0). Under these conditions the authors derive convergence rates:

Level‑set instability
\

Stability of Density-Based Clustering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment