Fast-Convergent Proximity Graphs for Approximate Nearest Neighbor Search
Approximate nearest neighbor (ANN) search in high-dimensional metric spaces is a fundamental problem with many applications. Over the past decade, proximity graph (PG)-based indexes have demonstrated superior empirical performance over alternatives. However, these methods often lack theoretical guarantees regarding the quality of query results, especially in the worst-case scenarios. In this paper, we introduce the α-convergent graph (α-CG), a new PG structure that employs a carefully designed edge pruning rule. This rule eliminates candidate neighbors for each data point p by applying the shifted-scaled triangle inequalities among p, its existing out-neighbors, and new candidates. If the distance between the query point q and its exact nearest neighbor v* is at most τ for some constant τ > 0, our α-CG finds the exact nearest neighbor in poly-logarithmic time, assuming bounded intrinsic dimensionality for the dataset; otherwise, it can find an ANN in the same time. To enhance scalability, we develop the α-convergent neighborhood graph (α-CNG), a practical variant that applies the pruning rule locally within each point’s neighbors. We also introduce optimizations to reduce the index construction time. Experimental results show that our α-CNG outperforms existing PGs on real-world datasets. For most datasets, α-CNG can reduce the number of distance computations and search steps by over 15% and 45%, respectively, when compared with the best-performing baseline.
💡 Research Summary
The paper addresses the long‑standing gap between the impressive empirical performance of proximity‑graph (PG) based indexes for high‑dimensional approximate nearest neighbor (ANN) search and the lack of rigorous worst‑case guarantees. Existing PGs such as HNSW, NSG, V‑amanna, and τ‑MG either provide no guarantee of finding the exact nearest neighbor (NN) when the query‑to‑NN distance is small, or they only guarantee a multiplicative approximation factor that can be far from optimal. To close this gap, the authors introduce two new structures: the α‑convergent graph (α‑CG) and its practical variant, the α‑convergent neighborhood graph (α‑CNG).
The key technical innovation is a novel edge‑pruning rule that incorporates both a scaling factor α > 1 and a distance threshold τ. For a candidate neighbor u of a point p, the pruning radius is set to
r = (1/α)·(δ(p,u) − (α + 1)·τ).
Compared with earlier rules (r = δ(p,u), r = δ(p,u)/α, or r = δ(p,u) − 3τ), this radius yields a strictly smaller intersection of the two balls B(p,δ(p,u)) and B(u,r). Consequently, many superfluous edges are eliminated while preserving the “shortcutable” property required for greedy routing.
Theoretical analysis shows that when the query point q satisfies δ(q, v*) ≤ τ (v* being the true NN), a greedy walk on α‑CG reduces the distance to q by at least a factor α at each hop. The number of hops is bounded by O(log_α Δ), where Δ is the aspect ratio of the dataset, and the total time is O((α·τ)^d·log Δ·log_α Δ). Since the doubling dimension d is assumed constant for many real‑world data, this yields poly‑logarithmic query time— the first such guarantee for PGs under a bounded‑τ assumption. If δ(q, v*) > τ, the same walk returns an (α + 1)/(α − 1) + ε‑approximate NN for any ε > 0, matching the best known guarantee of V‑amanna. Space usage, out‑degree, and construction cost of α‑CG are comparable to V‑amanna up to an O(τ^d) factor.
Because building α‑CG with the full candidate set V = P \ {p} is prohibitive (quadratic in n), the authors propose α‑CNG, which builds a local candidate set for each point using an approximate K‑NN graph or another base PG. An adaptive pruning strategy adjusts α per node: starting from a small α, the algorithm prunes until the node’s out‑degree reaches a predefined limit, then incrementally raises α to admit longer‑range shortcuts if needed. Two engineering tricks—distance‑reusing (caching intermediate distances) and lazy pruning (deferring expensive checks)—reduce the overall construction time to near‑linear.
Empirical evaluation on eight diverse real‑world datasets (image descriptors, text embeddings, recommendation logs, etc.) compares α‑CNG against four state‑of‑the‑art PGs: HNSW, NSG, V‑amanna, and τ‑MG. At equal recall levels (e.g., 0.9, 0.95), α‑CNG achieves at least a 15 % reduction in distance computations and a 45 % reduction in search steps on average; the best cases show 2.28× fewer distance evaluations and 2.88× fewer steps. These gains translate into substantial I/O savings for disk‑based or distributed deployments, where each step may involve a costly disk read. Moreover, inserting the α‑CG pruning rule into HNSW and V‑amanna improves their performance on most datasets, demonstrating the rule’s general applicability.
The paper concludes with a discussion of limitations and future work. The current method requires a user‑specified τ, and automatic estimation of τ or dynamic adaptation remains open. The analysis assumes Euclidean distance and bounded doubling dimension; extending the guarantees to non‑Euclidean metrics (cosine, Jaccard) or very high intrinsic dimensions is a promising direction. Finally, supporting dynamic updates (insertions/deletions) and scaling the construction to massive distributed settings are identified as next steps.
In summary, this work delivers (1) a theoretically sound pruning mechanism based on shifted‑scaled triangle inequalities, (2) the first poly‑logarithmic‑time exact NN guarantee for PGs under a realistic bounded‑τ condition, (3) a practical, adaptively pruned graph construction that retains these guarantees while being competitive in space and build time, and (4) extensive empirical validation showing clear superiority over existing PG‑based ANN indexes. These contributions bridge the gap between theory and practice in high‑dimensional similarity search.
Comments & Academic Discussion
Loading comments...
Leave a Comment