Data spectroscopy: Eigenspaces of convolution operators and clustering

Reading time: 6 minute
...

📝 Original Info

  • Title: Data spectroscopy: Eigenspaces of convolution operators and clustering
  • ArXiv ID: 0807.3719
  • Date: 2009-11-20
  • Authors: Researchers from original ArXiv paper

📝 Abstract

This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real-world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.

💡 Deep Analysis

Deep Dive into Data spectroscopy: Eigenspaces of convolution operators and clustering.

This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components

📄 Full Content

arXiv:0807.3719v2 [stat.ML] 20 Nov 2009 The Annals of Statistics 2009, Vol. 37, No. 6B, 3960–3984 DOI: 10.1214/09-AOS700 c ⃝Institute of Mathematical Statistics, 2009 DATA SPECTROSCOPY: EIGENSPACES OF CONVOLUTION OPERATORS AND CLUSTERING By Tao Shi1, Mikhail Belkin2 and Bin Yu3 Ohio State University, Ohio State University and University of California, Berkeley This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigen- vectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes prop- erly selected eigenvectors to determine the number of clusters au- tomatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simu- lation studies and experiments on real-world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods. 1. Introduction. Data clustering based on eigenvectors of a proximity or affinity matrix (or its normalized versions) has become popular in machine learning, computer vision and many other areas. Given data x1,...,xn ∈Rd, Received July 2008; revised March 2009. 1Supported in part by NASA Grant NNG06GD31G. 2Supported in part by NSF Early Career Award 0643916. 3Supported in part by NSF Grant DMS-06-05165, ARO Grant W911NF-05-1-0104, NSFC Grant 60628102, a grant from MSRA and a Guggenheim Fellowship in 2006. AMS 2000 subject classifications. Primary 62H30; secondary 68T10. Key words and phrases. Gaussian kernel, spectral clustering, kernel principal compo- nent analysis, support vector machines, unsupervised learning. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2009, Vol. 37, No. 6B, 3960–3984. This reprint differs from the original in pagination and typographic detail. 1 2 T. SHI, M. BELKIN AND B. YU this family of algorithms constructs an affinity matrix (Kn)ij = K(xi,xj)/n based on a kernel function, such as a Gaussian kernel K(x,y) = e−∥x−y∥2/(2ω2). Clustering information is obtained by taking eigenvectors and eigenvalues of the matrix Kn or the closely related graph Laplacian matrix Ln = Dn −Kn, where Dn is a diagonal matrix with (Dn)ii = P j(Kn)ij. The basic intuition is that when the data come from several clusters, distances between clus- ters are typically far larger than the distances within the same cluster, and thus Kn and Ln are (close to) block-diagonal matrices up to a permutation of the points. Eigenvectors of such block-diagonal matrices keep the same structure. For example, the few top eigenvectors of Ln can be shown to be constant on each cluster, assuming infinite separation between clusters, allowing one to distinguish the clusters by looking for data points corre- sponding to the same or similar values of the eigenvectors. In particular, we note the algorithm of Scott and Longuet-Higgins [13] who proposed to embed data into the space spanned by the top eigenvectors of Kn, normalize the data in that space and group data by investigating the block structure of inner product matrix of normalized data. Perona and Freeman [10] suggested to cluster the data into two groups by directly thresholding the top eigenvector of Kn. Another important algorithm, the normalized cut, was proposed by Shi and Malik [14] in the context of image segmentation. It separates data into two groups by thresholding the second smallest generalized eigenvector of Ln. Assuming k groups, Malik et al. [6] and Ng, Jordan and Weiss [8] sug- gested embedding the data into the span of the bottom k eigenvectors of the normalized graph Laplacian1 In −D−1/2 n KnD−1/2 n and applying the k-means algorithm to group the data in the embedding space. For further discussions on spectral clustering, we refer the reader to Weiss [20], Dhillon, Guan and Kulis [2] and von Luxburg [18]. An empirical comparison of various meth- ods is provided in Verma and Meila [17]. A discussion of some limitations of spectral clustering can be found in Nadler and Galun [7]. A theoretical analysis of statistical consistency of different types of spectral clustering is provided in von Luxburg, Belkin and Bousque

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut