FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force ($n^2D$), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results.
💡 Research Summary
The paper introduces FLASH (Fast LSH Algorithm for Similarity search accelerated with HPC), a system designed to perform similarity search on ultra‑high‑dimensional data (millions of dimensions) using a single machine equipped with both CPU and GPU resources. Traditional brute‑force similarity computation scales as O(n²·D) and quickly becomes infeasible for large n and D, while existing LSH‑based methods still require substantial memory and computation to maintain hash tables and verify candidates. FLASH circumvents these bottlenecks by combining several algorithmic innovations and engineering optimizations.
First, FLASH adopts a classic LSH‑style random indexing scheme: each data point is hashed by L independent hash functions into low‑dimensional signatures. To keep memory bounded, each hash bucket stores only a fixed‑size reservoir of candidates (size B) using reservoir sampling. This ensures that the total memory footprint is O(L·B) regardless of the number of points that map to a bucket, while still preserving a representative sample of potential neighbors.
Second, the system leverages recent advances in one‑pass minwise hashing. Traditional minwise hashing requires multiple passes over the data to compute signatures; the one‑pass variant produces the same sketch in a single streaming pass, making it practical even when D reaches several million. The resulting sketches are then used in a count‑based estimation step: for any pair of points, the number of hash tables where they co‑occur in the same bucket is counted, providing an unbiased estimator of Jaccard or cosine similarity without ever computing an inner product.
Third, FLASH is implemented for both CPU and GPU. On the GPU, hash table construction, candidate sampling, and count aggregation are expressed as large matrix‑oriented kernels, allowing thousands of CUDA threads to work concurrently. The CPU version exploits SIMD vectorization, cache‑friendly data layouts, and multi‑core parallelism to avoid bandwidth bottlenecks. Both implementations are pipelined: signature generation → reservoir sampling → count estimation → final k‑nearest‑neighbor selection, with each stage overlapping in time to keep the hardware fully utilized.
The authors provide a theoretical analysis that combines the probabilistic guarantees of LSH with the sampling error bounds of reservoir sampling. They prove that, for appropriate choices of L and B, the candidate set contains an ε‑approximate k‑NN for each query with probability at least 1 – δ, where δ can be made arbitrarily small.
Empirical evaluation spans several real‑world high‑dimensional datasets: text corpora, malicious URL collections, click‑through prediction logs, and social‑network embeddings. The most striking result is on the webspam dataset, which contains 1.3 billion non‑zero entries and over a million dimensions. FLASH builds a full approximate k‑NN graph (k = 10) from scratch in under 10 seconds on a single machine equipped with a modern GPU, whereas state‑of‑the‑art CPU‑only LSH implementations require tens to hundreds of seconds, and a naïve brute‑force approach would need on the order of 20 TFLOPs. Accuracy remains competitive: average precision@10 stays above 0.90, within 1–2 % of the best existing methods.
Memory usage is also dramatically reduced; the entire index occupies less than 0.1 % of the raw data size thanks to the reservoir sampling strategy. GPU acceleration yields 5×–8× speedups in the most compute‑intensive phases, while a hybrid CPU‑GPU pipeline further balances workload and minimizes idle time.
In conclusion, FLASH demonstrates that ultra‑high‑dimensional similarity search can be performed efficiently on a single heterogeneous node by integrating LSH‑based random indexing, one‑pass minwise hashing, reservoir sampling, and high‑performance parallel execution. The authors release both CPU and GPU source code along with the datasets used, ensuring reproducibility and providing a solid foundation for future extensions such as multi‑node distributed scaling or application to non‑vector data (e.g., graph embeddings).
Comments & Academic Discussion
Loading comments...
Leave a Comment