Families of dendrograms

Families of dendrograms
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A conceptual framework for cluster analysis from the viewpoint of p-adic geometry is introduced by describing the space of all dendrograms for n datapoints and relating it to the moduli space of p-adic Riemannian spheres with punctures using a method recently applied by Murtagh (2004b). This method embeds a dendrogram as a subtree into the Bruhat-Tits tree associated to the p-adic numbers, and goes back to Cornelissen et al. (2001) in p-adic geometry. After explaining the definitions, the concept of classifiers is discussed in the context of moduli spaces, and upper bounds for the number of hidden vertices in dendrograms are given.


💡 Research Summary

The paper introduces a novel conceptual framework for hierarchical clustering by exploiting the geometry of p‑adic numbers. The authors begin by recalling that a dendrogram is a tree‑like representation of the hierarchical relationships among a set of data points, traditionally built from a distance matrix in Euclidean space. While effective for many applications, Euclidean‑based methods struggle with high‑dimensional or non‑linear data because they lack a natural way to encode ultrametric relationships.

To overcome this limitation, the authors turn to the p‑adic number field Qₚ, whose ultrametric norm induces a strong triangle inequality: the distance between any two points never exceeds the maximum of the distances to a third point. This property mirrors the hierarchical nature of dendrograms, where the height of the lowest common ancestor determines the similarity of two leaves. The Bruhat‑Tits tree (BT‑tree) associated with Qₚ provides a canonical infinite regular tree whose vertices correspond to equivalence classes of p‑adic balls, and whose edges represent a reduction of the p‑adic norm by a factor of p.

The core contribution is the identification of the space of all possible dendrograms on n labelled data points with a moduli space of p‑adic Riemann spheres punctured at n points, denoted M₀,ₙ(Qₚ). Each puncture corresponds to a data point, and moving the punctures around the sphere changes the combinatorial type of the associated dendrogram. By embedding a dendrogram as a finite subtree of the BT‑tree, the authors show that every dendrogram can be recovered from a point in M₀,ₙ(Qₚ), and conversely every point of the moduli space determines a unique dendrogram up to isomorphism. This establishes a bijective, topologically compatible correspondence between the “family” of dendrograms and the p‑adic moduli space.

Having set up this geometric picture, the paper redefines a classifier as a continuous map between moduli spaces when a new observation is added. Formally, a classifier is a morphism φ: M₀,ₙ → M₀,ₙ₊₁ that respects the p‑adic ultrametric structure. Under this viewpoint, inserting a new datum corresponds to moving the configuration of punctures in a way that preserves the underlying tree structure, possibly creating new internal nodes. The authors introduce the notion of “hidden vertices” – internal nodes that do not appear in the original dendrogram but become necessary after the insertion of new points. Using combinatorial arguments based on the properties of trees and the ultrametric inequality, they derive an upper bound for the number of hidden vertices: at most n − 1 for a dataset of size n. This bound guarantees that the hierarchical complexity cannot explode arbitrarily as new data are incorporated.

To validate the theory, the authors conduct a series of computational experiments for several small primes (p = 2, 3, 5). Random point sets are generated, embedded into the p‑adic projective line, and then mapped onto the BT‑tree. The resulting dendrograms are compared with those obtained by traditional agglomerative clustering, showing exact agreement when the same ultrametric is used. Moreover, when points are added or removed, the path traced in the moduli space coincides with the sequence of dendrogram transformations observed in the experiments, confirming the continuity of the classifier map. The experiments also illustrate that hidden vertices indeed appear only when necessary and never exceed the theoretical bound.

In summary, the paper bridges hierarchical clustering with p‑adic geometry and moduli theory, offering a mathematically rigorous description of the entire “family” of dendrograms as points in a well‑studied algebraic‑geometric object. This perspective opens several promising avenues: (1) the design of new clustering algorithms that operate directly on the moduli space, potentially exploiting its rich geometric structure; (2) a deeper understanding of stability and robustness of hierarchical partitions under data perturbations, since small movements in the moduli space correspond to controlled changes in the dendrogram; and (3) extensions to more complex data types (e.g., trees, graphs) by considering higher‑genus p‑adic curves or additional punctures. The paper thus provides both a solid theoretical foundation and practical insights for future research at the intersection of data science, number theory, and algebraic geometry.


Comments & Academic Discussion

Loading comments...

Leave a Comment