Optimal construction of k-nearest neighbor graphs for identifying noisy clusters

Reading time: 6 minute
...

📝 Original Info

  • Title: Optimal construction of k-nearest neighbor graphs for identifying noisy clusters
  • ArXiv ID: 0912.3408
  • Date: 2009-12-18
  • Authors: Researchers from original ArXiv paper

📝 Abstract

We study clustering algorithms based on neighborhood graphs on a random sample of data points. The question we ask is how such a graph should be constructed in order to obtain optimal clustering results. Which type of neighborhood graph should one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is the optimal parameter k? In our setting, clusters are defined as connected components of the t-level set of the underlying probability distribution. Clusters are said to be identified in the neighborhood graph if connected components in the graph correspond to the true underlying clusters. Using techniques from random geometric graph theory, we prove bounds on the probability that clusters are identified successfully, both in a noise-free and in a noisy setting. Those bounds lead to several conclusions. First, k has to be chosen surprisingly high (rather of the order n than of the order log n) to maximize the probability of cluster identification. Secondly, the major difference between the mutual and the symmetric k-nearest neighbor graph occurs when one attempts to detect the most significant cluster only.

💡 Deep Analysis

Deep Dive into Optimal construction of k-nearest neighbor graphs for identifying noisy clusters.

We study clustering algorithms based on neighborhood graphs on a random sample of data points. The question we ask is how such a graph should be constructed in order to obtain optimal clustering results. Which type of neighborhood graph should one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is the optimal parameter k? In our setting, clusters are defined as connected components of the t-level set of the underlying probability distribution. Clusters are said to be identified in the neighborhood graph if connected components in the graph correspond to the true underlying clusters. Using techniques from random geometric graph theory, we prove bounds on the probability that clusters are identified successfully, both in a noise-free and in a noisy setting. Those bounds lead to several conclusions. First, k has to be chosen surprisingly high (rather of the order n than of the order log n) to maximize the probability of cluster identification. Secondly, the major d

📄 Full Content

Optimal construction of k-nearest neighbor graphs for identifying noisy clusters ⋆ Markus Maier a, Matthias Hein b, Ulrike von Luxburg a aMax Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T¨ubingen, Germany mmaier@tuebingen.mpg.de ulrike.luxburg@tuebingen.mpg.de bSaarland University, P.O. Box 151150, 66041 Saarbr¨ucken, Germany hein@cs.uni-sb.de Abstract We study clustering algorithms based on neighborhood graphs on a random sample of data points. The question we ask is how such a graph should be constructed in or- der to obtain optimal clustering results. Which type of neighborhood graph should one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is the optimal parameter k? In our setting, clusters are defined as connected compo- nents of the t-level set of the underlying probability distribution. Clusters are said to be identified in the neighborhood graph if connected components in the graph correspond to the true underlying clusters. Using techniques from random geometric graph theory, we prove bounds on the probability that clusters are identified suc- cessfully, both in a noise-free and in a noisy setting. Those bounds lead to several conclusions. First, k has to be chosen surprisingly high (rather of the order n than of the order log n) to maximize the probability of cluster identification. Secondly, the major difference between the mutual and the symmetric k-nearest neighbor graph occurs when one attempts to detect the most significant cluster only. Key words: clustering, neighborhood graph, random geometric graph, connected component 1 Introduction Using graphs to model real world problems is one of the most widely used techniques in computer science. This approach usually involves two major ⋆Preprint of an article published in Theoretical Computer Science, Volume 410, Issue 19, Pages 1749–1764, Elsevier, April 2009. arXiv:0912.3408v1 [stat.ML] 17 Dec 2009 steps: constructing an appropriate graph which represents the problem in a convenient way, and then constructing an algorithm which solves the problem on the given type of graph. While in some cases there exists an obvious natural graph structure to model the problem, in other cases one has much more choice when constructing the graph. In the latter cases it is an important question how the actual construction of the graph influences the overall result of the graph algorithm. The kind of graphs we want to study in the current paper are neighborhood graphs. The vertices of those graphs represent certain “objects”, and ver- tices are connected if the corresponding objects are “close” or “similar”. The best-known families of neighborhood graphs are ε-neighborhood graphs and k-nearest neighbor graphs. Given a number of objects and their mutual dis- tances to each other, in the first case each object will be connected to all other objects which have distance smaller than ε, whereas in the second case, each object will be connected to its k nearest neighbors (exact definitions see below). Neighborhood graphs are used for modeling purposes in many areas of computer science: sensor networks and wireless ad-hoc networks, machine learning, data mining, percolation theory, clustering, computational geometry, modeling the spread of diseases, modeling connections in the brain, etc. In all those applications one has some freedom in constructing the neigh- borhood graph, and a fundamental question arises: how exactly should we construct the neighborhood graph in order to obtain the best overall result in the end? Which type of neighborhood graph should we choose? How should we choose its connectivity parameter, for example the parameter k in the k-nearest neighbor graph? It is obvious that those choices will influence the results we obtain on the neighborhood graph, but often it is completely unclear how. In this paper, we want to focus on the problem of clustering. We assume that we are given a finite set of data points and pairwise distances or similarities between them. It is very common to model the data points and their distances by a neighborhood graph. Then clustering can be reduced to standard graph algorithms. In the easiest case, one can simply define clusters as connected components of the graph. Alternatively, one can try to construct minimal graph cuts which separate the clusters from each other. An assumption often made in clustering is that the given data points are a finite sample from some larger underlying space. For example, when a company wants to cluster customers based on their shopping profiles, it is clear that the customers in the company’s data base are just a sample of a much larger set of possible customers. The customers in the data base are then considered to be a random sample. In this article, we want to make a first step towards such results in a simple 2 setting we call “cluster identification” (see next section for details). Clusters will be represented by connected components of the level set of

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut