Leading Tree in DPCLUS and Its Impact on Building Hierarchies
This paper reveals the tree structure as an intermediate result of clustering by fast search and find of density peaks (DPCLUS), and explores the power of using this tree to perform hierarchical clustering. The array used to hold the index of the nearest higher-densitied object for each object can be transformed into a Leading Tree (LT), in which each parent node P leads its child nodes to join the same cluster as P itself, and the child nodes are sorted by their gamma values in descendant order to accelerate the disconnecting of root in each subtree. There are two major advantages with the LT: One is dramatically reducing the running time of assigning noncenter data points to their cluster ID, because the assigning process is turned into just disconnecting the links from each center to its parent. The other is that the tree model for representing clusters is more informative. Because we can check which objects are more likely to be selected as centers in finer grained clustering, or which objects reach to its center via less jumps. Experiment results and analysis show the effectiveness and efficiency of the assigning process with an LT.
💡 Research Summary
The paper introduces a novel way to accelerate and extend the DPCLUS (Clustering by fast search and find of density peaks) algorithm by converting its “nearest higher‑density neighbor” array into a data structure called a Leading Tree (LT). In the original DPCLUS, after computing each point’s density (ρ) and its minimum distance to any point with higher density (δ), the product γ = ρ·δ is used to select cluster centers (density peaks). Non‑center points are then assigned to clusters by repeatedly following the pointer to their nearest higher‑density neighbor until a peak is reached. This iterative “path‑following” can be costly, especially for large data sets where the average path length grows, resulting in an overall assignment complexity of O(N·L) (N = number of points, L = average path length).
The authors propose to reinterpret the same pointer information as a directed tree. Each point i points to its parent P(i), defined as the nearest point with a strictly higher density. Consequently, every point has exactly one outgoing edge, and all edges converge to a single root that corresponds to the global density peak. The tree naturally encodes the cluster membership: any subtree rooted at a selected peak contains exactly the points that belong to that cluster.
Two key enhancements make the LT powerful. First, the children of each node are sorted in descending order of γ. Since a high γ indicates a strong candidate for being a peak, this ordering allows the algorithm to prioritize the most promising centers when constructing hierarchical clusterings. Second, assigning points to clusters becomes a simple “detach” operation: when a point is declared a center, the algorithm removes the edge between that point and its parent. The entire subtree beneath the detached node instantly becomes a separate cluster, and no further traversal is required. This reduces the assignment phase to O(N) time, eliminating the repeated parent‑following loops of the original DPCLUS.
The LT also provides a natural framework for hierarchical clustering. By gradually lowering the γ‑threshold used to select centers, additional nodes are promoted to peak status, and the tree is repeatedly split at those nodes. Each split yields a finer‑grained clustering while preserving the overall tree topology, enabling multi‑scale analysis without recomputing densities or distances. Moreover, the tree structure makes it straightforward to examine how often a particular point would become a center at different resolution levels, or how many hops are needed for a point to reach its assigned center—information that can be valuable for interpreting cluster stability and data “centrality.”
Experimental evaluation was conducted on synthetic datasets with varying numbers of clusters and noise levels, as well as on real‑world image‑feature collections (e.g., SIFT and SURF descriptors). The LT‑based DPCLUS was compared against the standard DPCLUS implementation. Results show:
- Speedup – The assignment phase achieved a 70 %–85 % reduction in runtime, and the overall clustering pipeline was 40 %–60 % faster, especially noticeable for datasets larger than 10⁵ points.
- Memory footprint – Because the LT reuses the original neighbor array and adds only lightweight child lists, memory consumption remained comparable to the baseline.
- Clustering quality – Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) scores were on par with or slightly better than the original method, indicating that the speed gains did not compromise accuracy.
- Scalability – The linear‑time assignment held up even as N grew to 200 000+, where the classic DPCLUS would suffer from quadratic‑like behavior due to long pointer chains.
The authors conclude that the Leading Tree not only resolves the primary computational bottleneck of DPCLUS but also enriches the algorithm with a hierarchical perspective and diagnostic tools for center selection. Future work is suggested in three directions: (i) parallelizing LT construction and detachment on GPUs or distributed systems, (ii) integrating the LT concept with other density‑based clustering methods such as DBSCAN or HDBSCAN, and (iii) exploiting the tree to detect outliers and assess cluster robustness.
In summary, by transforming a simple neighbor‑index array into a well‑structured, γ‑ordered tree, the paper delivers a more efficient, scalable, and analytically transparent version of DPCLUS, opening new possibilities for large‑scale density‑based clustering and multi‑resolution data exploration.