📝 Original Info
- Title: Optimal construction of k-nearest neighbor graphs for identifying noisy clusters
- ArXiv ID: 0912.3408
- Date: 2009-12-18
- Authors: Researchers from original ArXiv paper
📝 Abstract
We study clustering algorithms based on neighborhood graphs on a random sample of data points. The question we ask is how such a graph should be constructed in order to obtain optimal clustering results. Which type of neighborhood graph should one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is the optimal parameter k? In our setting, clusters are defined as connected components of the t-level set of the underlying probability distribution. Clusters are said to be identified in the neighborhood graph if connected components in the graph correspond to the true underlying clusters. Using techniques from random geometric graph theory, we prove bounds on the probability that clusters are identified successfully, both in a noise-free and in a noisy setting. Those bounds lead to several conclusions. First, k has to be chosen surprisingly high (rather of the order n than of the order log n) to maximize the probability of cluster identification. Secondly, the major difference between the mutual and the symmetric k-nearest neighbor graph occurs when one attempts to detect the most significant cluster only.
💡 Deep Analysis
Deep Dive into Optimal construction of k-nearest neighbor graphs for identifying noisy clusters.
We study clustering algorithms based on neighborhood graphs on a random sample of data points. The question we ask is how such a graph should be constructed in order to obtain optimal clustering results. Which type of neighborhood graph should one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is the optimal parameter k? In our setting, clusters are defined as connected components of the t-level set of the underlying probability distribution. Clusters are said to be identified in the neighborhood graph if connected components in the graph correspond to the true underlying clusters. Using techniques from random geometric graph theory, we prove bounds on the probability that clusters are identified successfully, both in a noise-free and in a noisy setting. Those bounds lead to several conclusions. First, k has to be chosen surprisingly high (rather of the order n than of the order log n) to maximize the probability of cluster identification. Secondly, the major d
📄 Full Content
Optimal construction of k-nearest neighbor
graphs for identifying noisy clusters ⋆
Markus Maier a, Matthias Hein b, Ulrike von Luxburg a
aMax Planck Institute for Biological Cybernetics, Spemannstr. 38,
72076 T¨ubingen, Germany
mmaier@tuebingen.mpg.de
ulrike.luxburg@tuebingen.mpg.de
bSaarland University, P.O. Box 151150, 66041 Saarbr¨ucken, Germany
hein@cs.uni-sb.de
Abstract
We study clustering algorithms based on neighborhood graphs on a random sample
of data points. The question we ask is how such a graph should be constructed in or-
der to obtain optimal clustering results. Which type of neighborhood graph should
one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is
the optimal parameter k? In our setting, clusters are defined as connected compo-
nents of the t-level set of the underlying probability distribution. Clusters are said
to be identified in the neighborhood graph if connected components in the graph
correspond to the true underlying clusters. Using techniques from random geometric
graph theory, we prove bounds on the probability that clusters are identified suc-
cessfully, both in a noise-free and in a noisy setting. Those bounds lead to several
conclusions. First, k has to be chosen surprisingly high (rather of the order n than of
the order log n) to maximize the probability of cluster identification. Secondly, the
major difference between the mutual and the symmetric k-nearest neighbor graph
occurs when one attempts to detect the most significant cluster only.
Key words: clustering, neighborhood graph, random geometric graph, connected
component
1
Introduction
Using graphs to model real world problems is one of the most widely used
techniques in computer science. This approach usually involves two major
⋆Preprint of an article published in Theoretical Computer Science, Volume 410,
Issue 19, Pages 1749–1764, Elsevier, April 2009.
arXiv:0912.3408v1 [stat.ML] 17 Dec 2009
steps: constructing an appropriate graph which represents the problem in a
convenient way, and then constructing an algorithm which solves the problem
on the given type of graph. While in some cases there exists an obvious natural
graph structure to model the problem, in other cases one has much more choice
when constructing the graph. In the latter cases it is an important question
how the actual construction of the graph influences the overall result of the
graph algorithm.
The kind of graphs we want to study in the current paper are neighborhood
graphs. The vertices of those graphs represent certain “objects”, and ver-
tices are connected if the corresponding objects are “close” or “similar”. The
best-known families of neighborhood graphs are ε-neighborhood graphs and
k-nearest neighbor graphs. Given a number of objects and their mutual dis-
tances to each other, in the first case each object will be connected to all
other objects which have distance smaller than ε, whereas in the second case,
each object will be connected to its k nearest neighbors (exact definitions see
below). Neighborhood graphs are used for modeling purposes in many areas
of computer science: sensor networks and wireless ad-hoc networks, machine
learning, data mining, percolation theory, clustering, computational geometry,
modeling the spread of diseases, modeling connections in the brain, etc.
In all those applications one has some freedom in constructing the neigh-
borhood graph, and a fundamental question arises: how exactly should we
construct the neighborhood graph in order to obtain the best overall result in
the end? Which type of neighborhood graph should we choose? How should
we choose its connectivity parameter, for example the parameter k in the
k-nearest neighbor graph? It is obvious that those choices will influence the
results we obtain on the neighborhood graph, but often it is completely unclear
how.
In this paper, we want to focus on the problem of clustering. We assume that
we are given a finite set of data points and pairwise distances or similarities
between them. It is very common to model the data points and their distances
by a neighborhood graph. Then clustering can be reduced to standard graph
algorithms. In the easiest case, one can simply define clusters as connected
components of the graph. Alternatively, one can try to construct minimal
graph cuts which separate the clusters from each other. An assumption often
made in clustering is that the given data points are a finite sample from
some larger underlying space. For example, when a company wants to cluster
customers based on their shopping profiles, it is clear that the customers in
the company’s data base are just a sample of a much larger set of possible
customers. The customers in the data base are then considered to be a random
sample.
In this article, we want to make a first step towards such results in a simple
2
setting we call “cluster identification” (see next section for details). Clusters
will be represented by connected components of the level set of
…(Full text truncated)…
Reference
This content is AI-processed based on ArXiv data.