On the Difficulty of Nearest Neighbor Search
Fast approximate nearest neighbor (NN) search in large databases is becoming popular. Several powerful learning-based formulations have been proposed recently. However, not much attention has been paid to a more fundamental question: how difficult is (approximate) nearest neighbor search in a given data set? And which data properties affect the difficulty of nearest neighbor search and how? This paper introduces the first concrete measure called Relative Contrast that can be used to evaluate the influence of several crucial data characteristics such as dimensionality, sparsity, and database size simultaneously in arbitrary normed metric spaces. Moreover, we present a theoretical analysis to prove how the difficulty measure (relative contrast) determines/affects the complexity of Local Sensitive Hashing, a popular approximate NN search method. Relative contrast also provides an explanation for a family of heuristic hashing algorithms with good practical performance based on PCA. Finally, we show that most of the previous works in measuring NN search meaningfulness/difficulty can be derived as special asymptotic cases for dense vectors of the proposed measure.
💡 Research Summary
The paper addresses a fundamental yet under‑explored question in similarity search: how intrinsically difficult is approximate nearest‑neighbor (ANN) retrieval on a given dataset, and which data characteristics drive that difficulty? To answer this, the authors introduce a novel, quantitative difficulty measure called Relative Contrast (RC). For any query point q, RC is defined as the ratio between the expected distance to a random database point and the distance to the true nearest neighbor (or the k‑th nearest neighbor). A large RC indicates that the nearest neighbor is significantly closer than a typical point, implying that the dataset is “easy” for NN search; an RC close to one signals distance concentration and a hard search problem.
The authors derive an analytical expression for the expected RC in arbitrary ℓp normed spaces under the assumption that data vectors are i.i.d. with zero mean, variance σ², and absolute‑value mean μ per coordinate. By invoking the Central Limit Theorem, they show that in high dimensions the distribution of pairwise distances converges to a Gaussian whose mean scales with √d σ and whose variance scales with σ². Consequently, the expected RC can be approximated as a function of dimensionality d, sparsity (the fraction of zero coordinates), and database size N. The analysis reveals two opposing forces: increasing d alone drives RC toward 1 (the classic “curse of dimensionality”), while increasing sparsity effectively reduces the intrinsic dimensionality and pushes RC upward again. Thus RC simultaneously captures the effects of dimension, sparsity, and scale.
Having a closed‑form RC, the paper proceeds to link it to the performance of Locality Sensitive Hashing (LSH), a widely used ANN method. Standard LSH theory expresses the required number of hash tables L as O((1/ρ)·log N), where ρ is a data‑dependent sensitivity parameter. The authors prove that ρ is essentially the inverse of RC (ρ≈1/RC). Therefore, the number of hash tables, the expected number of candidate points, and the overall query time all scale linearly with 1/RC. In practice, datasets with high RC need far fewer hash tables and achieve faster query times, confirming that RC directly predicts LSH complexity beyond mere dimensionality.
The paper also provides a theoretical justification for the empirical success of PCA‑based hashing schemes. By projecting data onto its leading principal components, PCA increases the variance of distances along the projected axes while leaving the average distance roughly unchanged. This operation inflates RC, making the nearest neighbor stand out more sharply. Consequently, hash functions aligned with principal components produce higher collision probabilities for true neighbors and lower for distant points, yielding better recall‑precision trade‑offs. Experiments on image (CIFAR‑10, GIST) and text (20 Newsgroups) benchmarks demonstrate that RC measured on each dataset predicts the observed speed‑up of PCA‑LSH over vanilla LSH, often by a factor of two to three.
Finally, the authors show that many previously proposed difficulty metrics—intrinsic dimensionality, expansion constant, alternative contrast measures—are special or asymptotic cases of RC. By adjusting the assumptions on sparsity, norm, or distribution, those older measures can be derived from the general RC formula. This unification positions RC as the most comprehensive framework for quantifying NN search hardness.
In summary, the contributions are threefold: (1) a principled, analytically tractable difficulty measure that jointly accounts for dimensionality, sparsity, and dataset size; (2) a rigorous connection between this measure and the theoretical complexity of LSH, providing practitioners with a predictive tool for algorithm configuration; and (3) an explanation of why PCA‑based hashing works so well in practice, grounded in RC. The work bridges a gap between data‑centric properties and algorithmic performance, offering a solid foundation for future research on adaptive ANN methods and dataset‑aware indexing strategies.
Comments & Academic Discussion
Loading comments...
Leave a Comment