This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real-world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.
Deep Dive into Data spectroscopy: Eigenspaces of convolution operators and clustering.
This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components
arXiv:0807.3719v2 [stat.ML] 20 Nov 2009
The Annals of Statistics
2009, Vol. 37, No. 6B, 3960–3984
DOI: 10.1214/09-AOS700
c
⃝Institute of Mathematical Statistics, 2009
DATA SPECTROSCOPY: EIGENSPACES OF CONVOLUTION
OPERATORS AND CLUSTERING
By Tao Shi1, Mikhail Belkin2 and Bin Yu3
Ohio State University, Ohio State University and
University of California, Berkeley
This paper focuses on obtaining clustering information about a
distribution from its i.i.d. samples. We develop theoretical results to
understand and use clustering information contained in the eigen-
vectors of data adjacency matrices based on a radial kernel function
with a sufficiently fast tail decay. In particular, we provide population
analyses to gain insights into which eigenvectors should be used and
when the clustering information for the distribution can be recovered
from the sample. We learn that a fixed number of top eigenvectors
might at the same time contain redundant clustering information and
miss relevant clustering information. We use this insight to design the
data spectroscopic clustering (DaSpec) algorithm that utilizes prop-
erly selected eigenvectors to determine the number of clusters au-
tomatically and to group the data accordingly. Our findings extend
the intuitions underlying existing spectral techniques such as spectral
clustering and Kernel Principal Components Analysis, and provide
new understanding into their usability and modes of failure. Simu-
lation studies and experiments on real-world data are conducted to
show the potential of our algorithm. In particular, DaSpec is found
to handle unbalanced groups and recover clusters of different shapes
better than the competing methods.
1. Introduction.
Data clustering based on eigenvectors of a proximity or
affinity matrix (or its normalized versions) has become popular in machine
learning, computer vision and many other areas. Given data x1,...,xn ∈Rd,
Received July 2008; revised March 2009.
1Supported in part by NASA Grant NNG06GD31G.
2Supported in part by NSF Early Career Award 0643916.
3Supported in part by NSF Grant DMS-06-05165, ARO Grant W911NF-05-1-0104,
NSFC Grant 60628102, a grant from MSRA and a Guggenheim Fellowship in 2006.
AMS 2000 subject classifications. Primary 62H30; secondary 68T10.
Key words and phrases. Gaussian kernel, spectral clustering, kernel principal compo-
nent analysis, support vector machines, unsupervised learning.
This is an electronic reprint of the original article published by the
Institute of Mathematical Statistics in The Annals of Statistics,
2009, Vol. 37, No. 6B, 3960–3984. This reprint differs from the original in
pagination and typographic detail.
1
2
T. SHI, M. BELKIN AND B. YU
this family of algorithms constructs an affinity matrix (Kn)ij = K(xi,xj)/n
based on a kernel function, such as a Gaussian kernel K(x,y) = e−∥x−y∥2/(2ω2).
Clustering information is obtained by taking eigenvectors and eigenvalues of
the matrix Kn or the closely related graph Laplacian matrix Ln = Dn −Kn,
where Dn is a diagonal matrix with (Dn)ii = P
j(Kn)ij. The basic intuition
is that when the data come from several clusters, distances between clus-
ters are typically far larger than the distances within the same cluster, and
thus Kn and Ln are (close to) block-diagonal matrices up to a permutation
of the points. Eigenvectors of such block-diagonal matrices keep the same
structure. For example, the few top eigenvectors of Ln can be shown to
be constant on each cluster, assuming infinite separation between clusters,
allowing one to distinguish the clusters by looking for data points corre-
sponding to the same or similar values of the eigenvectors.
In particular, we note the algorithm of Scott and Longuet-Higgins [13]
who proposed to embed data into the space spanned by the top eigenvectors
of Kn, normalize the data in that space and group data by investigating
the block structure of inner product matrix of normalized data. Perona
and Freeman [10] suggested to cluster the data into two groups by directly
thresholding the top eigenvector of Kn.
Another important algorithm, the normalized cut, was proposed by Shi
and Malik [14] in the context of image segmentation. It separates data into
two groups by thresholding the second smallest generalized eigenvector of
Ln. Assuming k groups, Malik et al. [6] and Ng, Jordan and Weiss [8] sug-
gested embedding the data into the span of the bottom k eigenvectors of the
normalized graph Laplacian1 In −D−1/2
n
KnD−1/2
n
and applying the k-means
algorithm to group the data in the embedding space. For further discussions
on spectral clustering, we refer the reader to Weiss [20], Dhillon, Guan and
Kulis [2] and von Luxburg [18]. An empirical comparison of various meth-
ods is provided in Verma and Meila [17]. A discussion of some limitations
of spectral clustering can be found in Nadler and Galun [7]. A theoretical
analysis of statistical consistency of different types of spectral clustering is
provided in von Luxburg, Belkin and Bousque
…(Full text truncated)…
This content is AI-processed based on ArXiv data.