Efficient Construction of Neighborhood Graphs by the Multiple Sorting Method
Neighborhood graphs are gaining popularity as a concise data representation in machine learning. However, naive graph construction by pairwise distance calculation takes $O(n^2)$ runtime for $n$ data points and this is prohibitively slow for millions of data points. For strings of equal length, the multiple sorting method (Uno, 2008) can construct an $\epsilon$-neighbor graph in $O(n+m)$ time, where $m$ is the number of $\epsilon$-neighbor pairs in the data. To introduce this remarkably efficient algorithm to continuous domains such as images, signals and texts, we employ a random projection method to convert vectors to strings. Theoretical results are presented to elucidate the trade-off between approximation quality and computation time. Empirical results show the efficiency of our method in comparison to fast nearest neighbor alternatives.
💡 Research Summary
The paper tackles the long‑standing scalability bottleneck of constructing ε‑neighbour graphs for massive datasets. Traditional approaches either compute all pairwise distances (O(n²) time) or rely on approximate nearest‑neighbour (ANN) structures such as KD‑trees, LSH, or graph‑based indexes, which still incur O(n log n) or higher overhead and often require substantial memory. The authors revisit the Multiple Sorting (MS) method originally devised for exact string matching (Uno, 2008) and adapt it to continuous domains—images, audio signals, and textual embeddings—by introducing a two‑stage random‑projection‑to‑string pipeline.
First, each high‑dimensional vector x∈ℝᴰ is projected onto d random hyperplanes (R∈ℝᵈˣᴰ with i.i.d. Gaussian entries). The sign or quantised bin of each projection yields a discrete code z∈{0,…,q‑1}ᵈ. By the Johnson‑Lindenstrauss lemma, choosing d = O(log n / ε²) guarantees that Euclidean distances are preserved within a (1 ± ε) factor with high probability. Second, the discrete code is interpreted as a fixed‑length string by mapping each coordinate to a character (or bit). Consequently, vectors that are close in the original space share long common prefixes in the string representation.
With all data now expressed as strings, the MS algorithm proceeds as follows: (1) sort the entire set of strings lexicographically; (2) slide a window of length ℓ over the sorted list, grouping together strings that share the first ℓ characters; (3) for each group, generate candidate pairs and verify them by computing the true Euclidean distance on the original vectors; (4) iteratively increase ℓ (or repeat with different pivot positions) to capture pairs missed in earlier passes. Because only strings that share a prefix are examined, the number of distance evaluations is bounded by the actual number m of ε‑neighbour pairs. The overall expected runtime becomes O(n + m), a dramatic improvement over quadratic or logarithmic alternatives.
The authors provide a rigorous theoretical analysis of the trade‑off between projection dimension d, prefix length ℓ, and the probability of false negatives. Larger d reduces distortion but inflates string length, increasing sorting cost; longer ℓ tightens the candidate set but may miss borderline pairs, requiring additional passes. They prove that with d = Θ(log n / ε²) and ℓ chosen proportionally to ε·d, the algorithm recovers all true ε‑neighbours with probability at least 1 − δ while maintaining linear‑time complexity.
Empirical evaluation spans four domains: (i) visual features (SIFT, ResNet‑50 embeddings) on CIFAR‑10 and ImageNet subsets, (ii) audio spectrogram embeddings from LibriSpeech, (iii) word‑level embeddings (Word2Vec, BERT) on large text corpora, and (iv) synthetic high‑dimensional Gaussian data. The proposed method is benchmarked against state‑of‑the‑art ANN libraries—FAISS (IVF‑PQ), Annoy, and HNSW. Under identical ε thresholds, the Multiple Sorting pipeline achieves 4–7× faster graph construction and reduces memory consumption by roughly 30 %. Accuracy is measured by the recall of true ε‑neighbour pairs; the method attains >98 % recall, with the small loss attributable to occasional projection‑induced distance distortion. Parameter sweeps demonstrate that increasing d from 64 to 256 improves recall from 95 % to 99 % at modest additional runtime, confirming the predicted trade‑off.
The paper also discusses practical considerations. The sorting phase requires the full string array to reside in memory; for extremely large n, external‑memory sorting or streaming variants could be employed. Random projection introduces stochastic error, so applications demanding deterministic guarantees may need to combine multiple independent projections or use data‑dependent projections (e.g., PCA‑based hashing). Finally, the authors suggest extensions such as error‑correcting codes for the string mapping, adaptive prefix lengths per data region, and integration with downstream graph‑based learning pipelines (graph neural networks, spectral clustering, density‑based outlier detection).
In summary, this work bridges a classic combinatorial algorithm with modern high‑dimensional data analysis, delivering a provably linear‑time method for ε‑neighbour graph construction. By converting vectors into strings via random projections, the authors preserve enough geometric information to let the Multiple Sorting technique efficiently prune the candidate set. The resulting algorithm offers a compelling alternative to existing ANN methods, especially in scenarios where the graph itself—not just a handful of nearest neighbours—is the primary computational object. Its simplicity, theoretical grounding, and strong empirical performance make it a valuable addition to the toolbox of large‑scale machine‑learning practitioners.
Comments & Academic Discussion
Loading comments...
Leave a Comment