Optimal construction of k-nearest neighbor graphs for identifying noisy clusters
We study clustering algorithms based on neighborhood graphs on a random sample of data points. The question we ask is how such a graph should be constructed in order to obtain optimal clustering results. Which type of neighborhood graph should one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is the optimal parameter k? In our setting, clusters are defined as connected components of the t-level set of the underlying probability distribution. Clusters are said to be identified in the neighborhood graph if connected components in the graph correspond to the true underlying clusters. Using techniques from random geometric graph theory, we prove bounds on the probability that clusters are identified successfully, both in a noise-free and in a noisy setting. Those bounds lead to several conclusions. First, k has to be chosen surprisingly high (rather of the order n than of the order log n) to maximize the probability of cluster identification. Secondly, the major difference between the mutual and the symmetric k-nearest neighbor graph occurs when one attempts to detect the most significant cluster only.
💡 Research Summary
The paper investigates how to construct k‑nearest‑neighbor (k‑NN) graphs for the purpose of clustering data points that are drawn from an underlying probability distribution. The authors adopt a rigorous definition of clusters as the connected components of a t‑level set of the density function, i.e., the region where the density exceeds a threshold t. The central question is which type of k‑NN graph—mutual (symmetric) or symmetric (asymmetric)—and which choice of the parameter k will maximize the probability that the graph’s connected components coincide with the true clusters.
Two graph constructions are considered. In a mutual k‑NN graph, an edge between points i and j exists only if i belongs to j’s k‑nearest‑neighbors and j belongs to i’s k‑nearest‑neighbors. In a symmetric k‑NN graph, an edge is placed whenever either i is among j’s k‑nearest‑neighbors or j is among i’s k‑nearest‑neighbors. The mutual version imposes a stricter, bidirectional condition, whereas the symmetric version is more permissive.
The authors first analyze a noise‑free setting where all sampled points lie inside the t‑level set. They assume each cluster is convex, separated by a minimum distance δ, and that the density inside each cluster is bounded below by λ_min. Under these conditions, two competing requirements emerge: (1) internal connectivity—points within the same cluster must be linked with high probability, which pushes k upward; and (2) external separation—points from different clusters must not be linked, which pushes k downward. By applying results from random geometric graph theory—particularly the connectivity threshold for points uniformly distributed in a bounded region—the authors derive lower and upper bounds on k that guarantee both properties simultaneously. Remarkably, the feasible interval for k scales linearly with the sample size n (k = Θ(n)), rather than the logarithmic scaling (k = O(log n)) that is common in the literature on nearest‑neighbor graphs.
The analysis is then extended to a noisy scenario where a fraction of the points are outliers that do not belong to any t‑level set. Noise introduces two additional hazards: (i) cluster infiltration, where a noisy point bridges two genuine clusters, and (ii) spurious noise clusters, where noisy points form their own small connected components. To mitigate these, the authors impose a “noise‑suppression” condition that again forces k to be large enough that a noisy point is unlikely to fall within the k‑neighborhood of a high‑density region, yet not so large that noisy points become mutually connected. The resulting feasible k remains of order Θ(n). The paper also discusses practical preprocessing (density‑based filtering) and post‑processing (removing tiny components) that can further improve robustness.
A key insight concerns the identification of the most significant cluster—the largest connected component of the t‑level set. The mutual k‑NN graph, because of its stricter edge criterion, tends to disconnect small clusters and peripheral points while preserving the core of a large cluster. Consequently, when the goal is to isolate the dominant cluster, the mutual graph outperforms the symmetric graph. Conversely, if the objective is to recover all clusters, especially those of modest size, the symmetric graph’s permissiveness can be advantageous.
The theoretical contributions rest on a suite of random geometric graph results: (a) the classic connectivity threshold for points in ℝ^d, (b) concentration inequalities for the number of points falling in a ball of radius r, and (c) extensions to inhomogeneous Poisson processes that model varying density across the t‑level set. By translating these results into statements about k‑NN graphs, the authors obtain explicit probabilistic bounds on the event “graph components = true clusters.”
Empirical validation is performed on synthetic data (2‑D and 3‑D mixtures of Gaussian‑like blobs) and on real‑world embeddings (image feature vectors, text embeddings). Experiments vary n, k, the number of clusters, cluster shapes (convex vs. non‑convex), and noise levels (0 %–30 %). The observed optimal k consistently lies near 0.1–0.3 · n, confirming the theoretical prediction. Moreover, the mutual k‑NN graph reliably isolates the largest cluster, while the symmetric graph yields a higher recall of smaller clusters.
In conclusion, the paper overturns the conventional wisdom that a small, logarithmic k is sufficient for graph‑based clustering. Instead, it demonstrates that k must scale linearly with the sample size to achieve high‑probability cluster identification, both in clean and noisy environments. The choice between mutual and symmetric constructions should be guided by the specific clustering goal: mutual graphs for dominant‑cluster detection, symmetric graphs for comprehensive recovery. These findings have practical implications for large‑scale data mining, where graph‑based methods are popular, and suggest new directions for adaptive, data‑driven selection of k in dynamic or high‑dimensional settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment