Internal Evaluation of Density-Based Clusterings with Noise

Internal Evaluation of Density-Based Clusterings with Noise
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Being able to evaluate the quality of a clustering result even in the absence of ground truth cluster labels is fundamental for research in data mining. However, most cluster validation indices (CVIs) do not capture noise assignments by density-based clustering methods like DBSCAN or HDBSCAN, even though the ability to correctly determine noise is crucial for successful clustering. In this paper, we propose DISCO, a Density-based Internal Score for Clusterings with nOise, the first CVI to explicitly assess the quality of noise assignments rather than merely counting them. DISCO is based on the established idea of the Silhouette Coefficient, but adopts density-connectivity to evaluate clusters of arbitrary shapes, and proposes explicit noise evaluation: it rewards correctly assigned noise labels and penalizes noise labels where a cluster label would have been more appropriate. The pointwise definition of DISCO allows for the seamless integration of noise evaluation into the final clustering evaluation, while also enabling explainable evaluations of the clustered data. In contrast to most state-of-the-art, DISCO is well-defined and also covers edge cases that regularly appear as output from clustering algorithms, such as singleton clusters or a single cluster plus noise.


💡 Research Summary

The paper addresses a long‑standing gap in the evaluation of density‑based clustering algorithms such as DBSCAN and HDBSCAN: the lack of an internal validation measure that explicitly assesses the quality of noise assignments. While many internal cluster validity indices (CVIs) exist, they are typically designed for centroid‑based, roughly spherical clusters and either ignore noise points altogether or penalize them uniformly without considering whether the noise label is appropriate. The only existing density‑based CVI that handles noise, DBCV, simply scales the overall score by the fraction of non‑noise points and suffers from non‑determinism due to the removal of leaf nodes from cluster‑wise minimum spanning trees (MSTs). Consequently, DBCV can assign higher scores to poorer clusterings and its results are not reproducible.

To fill this void, the authors propose DISCO (Density‑based Internal Score for Clusterings with nOise). DISCO builds on the well‑known Silhouette Coefficient but replaces Euclidean distance with the density‑connectivity distance (dc‑dist). The dc‑dist is defined as the minimax edge weight along the path between two points in the graph whose edges are weighted by the mutual‑reachability distance (the maximum of the two points’ core‑distances and their Euclidean distance). This distance captures how points are linked through dense regions rather than raw geometric proximity, making it suitable for arbitrarily shaped clusters.

DISCO provides a pointwise score ρ(x) for every data object. For a point assigned to a cluster, the score mirrors the Silhouette formula: it compares the average dc‑dist to points in its own cluster (compactness) with the average dc‑dist to the nearest other cluster (separation). The result is normalized to the interval


Comments & Academic Discussion

Loading comments...

Leave a Comment