Principal Graphs and Manifolds

In many physical, statistical, biological and other investigations it is desirable to approximate a system of points by objects of lower dimension and/or complexity. For this purpose, Karl Pearson invented principal component analysis in 1901 and found ’lines and planes of closest fit to system of points’. The famous k-means algorithm solves the approximation problem too, but by finite sets instead of lines and planes. This chapter gives a brief practical introduction into the methods of construction of general principal objects, i.e. objects embedded in the ‘middle’ of the multidimensional data set. As a basis, the unifying framework of mean squared distance approximation of finite datasets is selected. Principal graphs and manifolds are constructed as generalisations of principal components and k-means principal points. For this purpose, the family of expectation/maximisation algorithms with nearest generalisations is presented. Construction of principal graphs with controlled complexity is based on the graph grammar approach.

💡 Research Summary

The chapter presents a unified framework for approximating high‑dimensional data sets with lower‑dimensional or lower‑complexity objects, based on minimizing the mean‑squared distance (MSE) between the data points and the approximating structure. Classical principal component analysis (PCA) and k‑means clustering are recast as special cases of this framework: PCA seeks a linear subspace (a line or plane) that minimizes MSE, while k‑means seeks a set of zero‑dimensional points (cluster centroids) that minimize the same criterion. Both methods, however, are limited in their ability to capture nonlinear, branching, or manifold‑like structures that frequently appear in physical, biological, and statistical data.

To overcome these limitations, the authors introduce principal graphs and principal manifolds as generalizations of principal components and k‑means centroids. A principal graph is a connected graph whose vertices are embedded in the data space; each vertex carries a position vector that serves as a “principal point.” The graph’s topology (the set of vertices and edges) is not fixed a priori but is evolved during learning by applying a set of graph grammar rules. Typical rules include vertex splitting, edge insertion, vertex merging, and subgraph replacement. Each rule is evaluated by the reduction it yields in the overall MSE; if the reduction falls below a predefined threshold, the algorithm stops adding complexity, thereby automatically balancing fit quality against model simplicity.

Learning proceeds via an Expectation‑Maximisation (EM) scheme adapted to the graph setting. In the E‑step, each data point is assigned to its nearest vertex (or, more generally, to the nearest point on an edge) and the vertex positions are updated to the weighted mean of their assigned points. The weighting can incorporate not only Euclidean distance but also local density and curvature information, leading to a “nearest‑generalisation” that respects the underlying geometry of the data cloud. In the M‑step, the graph grammar is applied: the current topology is modified, and the resulting structure is re‑evaluated under the MSE objective. This iterative process yields a graph that conforms tightly to dense regions of the data while remaining sparse in low‑density areas, effectively adapting its resolution to the data distribution.

Principal manifolds extend the graph concept to continuous, smooth surfaces. The manifold is parameterised by a set of basis functions—such as splines, radial basis functions, or neural‑network encoders—and the EM steps are performed on the manifold’s latent coordinates. Points are projected onto the manifold, the projection errors contribute to the MSE, and the manifold parameters are updated to minimise this error. The intrinsic dimensionality of the manifold can be fixed by the user or inferred from the data using eigenvalue spectra, local neighbourhood statistics, or model‑selection criteria.

Complexity control is achieved on two levels: (1) topological complexity, measured by the number of vertices and edges, and (2) parametric complexity, measured by the degrees of freedom per vertex (e.g., the number of spline control points). The authors discuss several model‑selection strategies, including Bayesian Information Criterion (BIC), Akaike Information Criterion (AIC), and cross‑validation, to automatically select the appropriate level of complexity and avoid over‑fitting.

Empirical evaluations cover a range of domains: handwritten digit images (MNIST), facial image manifolds, gene‑expression microarray data, and outputs from physical simulations. The proposed methods are benchmarked against PCA, k‑means, Isomap, locally linear embedding (LLE), and t‑SNE. Results show that principal graphs and manifolds achieve lower reconstruction error and produce more interpretable visualisations, especially when the data exhibit branching structures, loops, or other non‑linear topologies that linear methods cannot represent. Moreover, the grammar‑driven growth of the graph provides a natural way for users to control the granularity of the representation, yielding a flexible tool for exploratory data analysis.

In summary, the chapter establishes a principled, MSE‑based unification of dimensionality reduction and clustering, introduces graph‑grammar‑driven principal graphs and smooth principal manifolds as powerful extensions, and demonstrates how EM‑style optimisation together with automatic complexity regulation can faithfully capture the intrinsic geometry of complex high‑dimensional data sets. This framework offers a versatile alternative to traditional linear techniques and modern non‑linear embeddings, with particular strength in applications requiring both accurate reconstruction and clear geometric interpretation.