DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor Search

DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor Search
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Locality-sensitive hashing (LSH) is a well-known solution for approximate nearest neighbor (ANN) search in high-dimensional spaces due to its robust theoretical guarantee on query accuracy. Traditional LSH-based methods mainly focus on improving the efficiency and accuracy of the query phase by designing different query strategies, but pay little attention to improving the efficiency of the indexing phase. They typically fine-tune existing data-oriented partitioning trees to index data points and support their query strategies. However, their strategy to directly partition the multi-dimensional space is time-consuming, and performance degrades as the space dimensionality increases. In this paper, we design an encoding-based tree called Dynamic Encoding Tree (DE-Tree) to improve the indexing efficiency and support efficient range queries based on Euclidean distance. Based on DE-Tree, we propose a novel LSH scheme called DET-LSH. DET-LSH adopts a novel query strategy, which performs range queries in multiple independent index DE-Trees to reduce the probability of missing exact NN points, thereby improving the query accuracy. Our theoretical studies show that DET-LSH enjoys probabilistic guarantees on query accuracy. Extensive experiments on real-world datasets demonstrate the superiority of DET-LSH over the state-of-the-art LSH-based methods on both efficiency and accuracy. While achieving better query accuracy than competitors, DET-LSH achieves up to 6x speedup in indexing time and 2x speedup in query time over the state-of-the-art LSH-based methods. This paper was published in PVLDB 2024.


💡 Research Summary

Locality‑Sensitive Hashing (LSH) remains a cornerstone for approximate nearest‑neighbor (ANN) search in high‑dimensional Euclidean spaces because of its strong probabilistic guarantees. However, most recent LSH‑based systems focus almost exclusively on the query phase, leaving the indexing stage largely untouched. Traditional approaches typically adapt existing data‑oriented partitioning trees (R*‑Tree, PM‑Tree, etc.) to index the projected points. This direct multi‑dimensional partitioning becomes increasingly costly as the dimensionality of the projected space grows, leading to long indexing times and degraded performance in very high‑dimensional settings.

The paper “DET‑LSH: A Locality‑Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor Search” introduces a fundamentally different indexing paradigm. Instead of partitioning the K‑dimensional projected space, the authors propose the Dynamic Encoding Tree (DE‑Tree), an encoding‑based structure that treats each dimension independently. Each projected coordinate is quantized into a symbolic representation (similar to iSAX) whose alphabet size is dynamically adapted to the data distribution. By encoding dimensions separately, DE‑Tree avoids the expensive multi‑dimensional splits, reduces the construction complexity from O(n·K·log n) to roughly O(n·log n), and enables fast computation of lower‑ and upper‑bound distances between a query and any tree node using only the per‑dimension interval boundaries.

DET‑LSH builds L independent DE‑Trees, each based on the same family of LSH hash functions that map original points into a K‑dimensional projected space. The query algorithm proceeds in two stages. In the first stage, a range query with radius r is executed on every DE‑Tree, collecting a coarse‑grained candidate set that is guaranteed (with high probability) to contain a substantial fraction of the true nearest neighbors. Because DE‑Tree nodes store tight distance bounds, this filtering step is extremely fast. In the second stage, the algorithm computes the exact Euclidean distance in the original space for all candidates, sorts them, and returns the top‑k results. This two‑step strategy blends the strengths of boundary‑constraint (BC) methods—high recall through multiple independent structures—with the efficiency of distance‑metric (DM) methods that use actual distances for final selection.

The authors provide a rigorous theoretical analysis. They prove that DET‑LSH answers a (c², k)‑ANN query with a constant success probability p > 0. The proof leverages the standard (r, cr, p₁, p₂)‑LSH guarantee together with the fact that DE‑Tree’s encoding ensures that any point within distance r from the query will be captured by a node whose lower‑bound distance is ≤ r. By using L independent trees, the overall success probability is amplified to 1 − (1 − p₁·(1 − ε))ᴸ, where ε accounts for the discretization error introduced by encoding.

Extensive experiments on real‑world benchmarks (SIFT1M, GIST1M, Deep1B, MNIST, etc.) demonstrate that DET‑LSH consistently outperforms state‑of‑the‑art LSH‑based systems across three major metrics: indexing time, query time, and recall. Index construction is accelerated by up to 6×, while query processing gains up to 2× speed‑up, all while achieving higher recall (typically 4–7 % improvement) compared to leading BC methods such as DB‑LSH, C2 methods like R2LSH, and DM methods like PM‑LSH. The performance gap widens as the projected dimensionality increases, confirming that DE‑Tree’s dimension‑wise encoding mitigates the curse of dimensionality that plagues traditional partitioning trees.

The paper’s contributions are threefold: (1) a novel encoding‑based tree structure that dramatically reduces indexing overhead and supports efficient Euclidean range queries; (2) the DET‑LSH framework that combines coarse‑grained filtering with fine‑grained distance verification, backed by provable ANN guarantees; (3) a comprehensive empirical evaluation showing superior scalability and accuracy. Limitations include the need to pre‑select the alphabet size (i.e., the number of symbols per dimension) and the lack of support for dynamic updates (insertions/deletions) without rebuilding the trees. The authors suggest future work on automatic parameter tuning, incremental DE‑Tree maintenance, extensions to non‑Euclidean similarity measures (e.g., cosine or Manhattan distance), and distributed implementations for massive datasets.

In summary, DET‑LSH re‑imagines the indexing component of LSH‑based ANN search by replacing costly multi‑dimensional partitioning with a dynamic, per‑dimension encoding tree. This design yields substantial speed‑ups in both indexing and query phases while preserving, and even improving, the theoretical and empirical accuracy guarantees that make LSH attractive for high‑dimensional similarity search.


Comments & Academic Discussion

Loading comments...

Leave a Comment