Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. "Structure" can be understood as symmetry and a range of symmetries are expressed by hierarchy. Such symmetries directly point to invariants, that pinpoint intrinsic properties of the data and of the background empirical domain of interest. We review many aspects of hierarchy here, including ultrametric topology, generalized ultrametric, linkages with lattices and other discrete algebraic structures and with p-adic number representations. By focusing on symmetries in data we have a powerful means of structuring and analyzing massive, high dimensional data stores. We illustrate the powerfulness of hierarchical clustering in case studies in chemistry and finance, and we provide pointers to other published case studies.
Deep Dive into Hierarchical Clustering for Finding Symmetries and Other Patterns in Massive, High Dimensional Datasets.
Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. “Structure” can be understood as symmetry and a range of symmetries are expressed by hierarchy. Such symmetries directly point to invariants, that pinpoint intrinsic properties of the data and of the background empirical domain of interest. We review many aspects of hierarchy here, including ultrametric topology, generalized ultrametric, linkages with lattices and other discrete algebraic structures and with p-adic number representations. By focusing on symmetries in data we have a powerful means of structuring and analyzing massive, high dimensional data stores. We illustrate the powerfulness of hierarchical clustering in case studies in chemistry and finance, and we provide pointers to other published case studies.
Herbert A. Simon, Nobel Laureate in Economics, originator of "bounded rationality" and of "satisficing", believed in hierarchy at the basis of the human and 1 arXiv:1005.2638v1 [stat.ML] 14 May 2010 social sciences, as the following quotation shows: "... my central theme is that complexity frequently takes the form of hierarchy and that hierarchic systems have some common properties independent of their specific content. Hierarchy, I shall argue, is one of the central structural schemes that the architect of complexity uses." ( [74], p. 184.) Partitioning a set of observations [75,76,49] leads to some very simple symmetries. This is one approach to clustering and data mining. But such approaches, often based on optimization, are not of direct interest to us here. Instead we will pursue the theme pointed to by Simon, namely that the notion of hierarchy is fundamental for interpreting data and the complex reality which the data expresses. Our work is very different too from the marvelous view of the development of mathematical group theory -but viewed in its own right as a complex, evolving system -presented by Foote [19].
Weyl [80] makes the case for the fundamental importance of symmetry in science, engineering, architecture, art and other areas. As a “guiding principle”, “Whenever you have to do with a structure-endowed entity … try to determine its group of automorphisms, the group of those element-wise transformations which leave all structural relations undisturbed. You can expect to gain a deep insight in the constitution of [the structure-endowed entity] in this way. After that you may start to investigate symmetric configurations of elements, i.e. configurations which are invariant under a certain subgroup of the group of all automorphisms; …” ( [80], p. 144).
In section 2, we describe ultrametric topology as an expression of hierarchy. This provides comprehensive background on the commonly used quadratic computational time (i.e., O(n 2 ), where n is the number of observations) agglomerative hierarchical clustering algorithms.
In section 3, we look at the generalized ultrametric context. This is closely linked to analysis based on lattices. We use a case study from chemical database matching to illustrate algorithms in this area.
In section 4, p-adic encoding, providing a number theory vantage point on ultrametric topology, gives rise to additional symmetries and ways to capture invariants in data.
Section 5 deals with symmetries that are part and parcel of a tree, representing a partial order on data, or equally a set of subsets of the data, some of which are embedded. An application of such symmetry targets from a dendrogram expressing a hierarchical embedding is provided through the Haar wavelet transform of a dendrogram and wavelet filtering based on the transform.
Section 6 deals with new and recent results relating to the remarkable symmetries of massive, and especially high dimensional data sets. An example is discussed of segmenting a financial forex (foreign exchange) trading signal.
For the reader new to analysis of data a very short introduction is now provided on hierarchical clustering. Along with other families of algorithm, the objective is automatic classification, for the purposes of data mining, or knowledge discovery. Classification, after all, is fundamental in human thinking, and machine-based decision making. But we draw attention to the fact that our objective is unsupervised, as opposed to supervised classification, also known as discriminant analysis or (in a general way) machine learning. So here we are not concerned with generalizing the decision making capability of training data, nor are we concerned with fitting statistical models to data so that these models can play a role in generalizing and predicting. Instead we are concerned with having “data speak for themselves”. That this unsupervised objective of classifying data (observations, objects, events, phenomena, etc.) is a huge task in our society is unquestionably true. One may think of situations when precedents are very limited, for instance.
Among families of clustering, or unsupervised classification, algorithms, we can distinguish the following: (i) array permuting and other visualization approaches; (ii) partitioning to form (discrete or overlapping) clusters through optimization, including graph-based approaches; and -of interest to us in this article -(iii) embedded clusters interrelated in a tree-based way.
For the last-mentioned family of algorithm, agglomerative building of the hierarchy from consideration of object pairwise distances has been the most common approach adopted. As comprehensive background texts, see [48,30,81,31].
The real number system, and a p-adic number system for given prime, p, are potentially equally useful alternatives. p-Adic numbers were introduced by Kurt Hensel in 1898.
Whether we deal with Euclidean or with non-Euclidean geometry, we are (nearly) always dealing w
…(Full text truncated)…
This content is AI-processed based on ArXiv data.