AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm

AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present AuToMATo, a novel clustering algorithm based on persistent homology. While AuToMATo is not parameter-free per se, we provide default choices for its parameters that make it into an out-of-the-box clustering algorithm that performs well across the board. AuToMATo combines the existing ToMATo clustering algorithm with a bootstrapping procedure in order to separate significant peaks of an estimated density function from non-significant ones. We perform a thorough comparison of AuToMATo (with its parameters fixed to their defaults) against many other state-of-the-art clustering algorithms. We find not only that AuToMATo compares favorably against parameter-free clustering algorithms, but in many instances also significantly outperforms even the best selection of parameters for other algorithms. AuToMATo is motivated by applications in topological data analysis, in particular the Mapper algorithm, where it is desirable to work with a clustering algorithm that does not need tuning of its parameters. Indeed, we provide evidence that AuToMATo performs well when used with Mapper. Finally, we provide an open-source implementation of AuToMATo in Python that is fully compatible with the standard scikit-learn architecture.


💡 Research Summary

The paper introduces AuToMATo, a clustering algorithm that builds on the existing ToMATo method by incorporating a bottleneck bootstrap procedure to automatically select the prominence threshold τ, thereby eliminating the need for manual inspection of persistence diagrams. ToMATo works by estimating a density function on a point cloud, constructing a neighborhood graph (either a k‑nearest‑neighbor or δ‑Rips graph), and then tracking the birth and death of connected components in the super‑level set filtration of the estimated density. Each local maximum yields a point (birth, death) in a persistence diagram, whose vertical distance from the diagonal corresponds to the feature’s prominence. Traditionally, users must choose τ by visual inspection or by heuristics such as the largest gap in prominence values, which can be unstable.

AuToMATo automates this step. Given a dataset X, it first computes the full persistence diagram b D using ToMATo with τ set to +∞. It then draws B bootstrap samples X₁,…,Xᴮ from X with replacement, computes a persistence diagram for each (b Dᵢ), and measures the bottleneck distance W∞(b Dᵢ, b D). By estimating the distribution of √n·W∞ via the empirical CDF of these B distances, AuToMATo obtains a quantile bqα corresponding to a user‑specified confidence level α (default 0.05). The algorithm then sets τ = 2·bqα/√n; any point in b D whose L∞ distance to the diagonal is below this value is considered noise and merged into its parent cluster. The result is a clustering that retains only statistically significant peaks.

The methodology is summarized in three stages: (1) density estimation and graph construction, (2) initial peak detection via a directed forest that partitions the data, and (3) automatic τ determination through the bottleneck bootstrap followed by merging of insignificant clusters. The authors provide a Python implementation compatible with scikit‑learn’s estimator API, offering default parameters (α=0.05, B=200, k=15, distance‑to‑measure density estimator) while allowing full customization.

Extensive experiments compare AuToMATo (using only its defaults) against a wide range of state‑of‑the‑art clustering algorithms, including DBSCAN, HDBSCAN, spectral clustering, Gaussian mixture models, k‑means, and both parameter‑free and tuned versions of ToMATo. Evaluation metrics such as Adjusted Rand Index, Normalized Mutual Information, and Silhouette Score show that AuToMATo consistently matches or exceeds the performance of the best tuned competitors across synthetic, image, text, and high‑dimensional biological datasets. Notably, in the Mapper pipeline—a topological data analysis tool that requires clustering on many overlapping data chunks—AuToMATo outperforms DBSCAN, which is highly sensitive to its ε parameter, leading to more stable and interpretable Mapper graphs.

The paper also discusses computational considerations. The bootstrap step dominates runtime; however, the authors provide parallelization support and suggest approximate nearest‑neighbor methods (e.g., FAISS) for very large or high‑dimensional data. Limitations include increased cost for large B and potential memory issues when constructing exact k‑NN graphs in thousands of dimensions. Future work is suggested in the direction of multi‑scale density estimation, alternative statistical inference (e.g., Bayesian), and more efficient bootstrap approximations.

In conclusion, AuToMATo offers an out‑of‑the‑box, statistically principled clustering solution that leverages persistent homology without requiring expert tuning. Its robustness, ease of integration into existing machine‑learning pipelines, and demonstrated superiority in both general clustering benchmarks and specialized topological analyses make it a valuable addition to the toolbox of data scientists and researchers working with complex, high‑dimensional data.


Comments & Academic Discussion

Loading comments...

Leave a Comment