Generic LSH Families for the Angular Distance Based on Johnson-Lindenstrauss Projections and Feature Hashing LSH
In this paper we propose the creation of generic LSH families for the angular distance based on Johnson-Lindenstrauss projections. We show that feature hashing is a valid J-L projection and propose two new LSH families based on feature hashing. These new LSH families are tested on both synthetic and real datasets with very good results and a considerable performance improvement over other LSH families. While the theoretical analysis is done for the angular distance, these families can also be used in practice for the euclidean distance with excellent results [2]. Our tests using real datasets show that the proposed LSH functions work well for the euclidean distance.
💡 Research Summary
The paper addresses the problem of designing locality‑sensitive hashing (LSH) schemes for the angular (cosine) distance that are both theoretically sound and practically efficient. It begins by reviewing the classic definition of an LSH family and the amplification technique that allows one to boost the collision probabilities for “near” and “far” points by using multiple hash functions (r) and multiple hash tables (b). The authors then reinterpret three well‑known angular‑distance LSH families—Hyperplane LSH (Charikar 2002), Voronoi LSH, and Cross‑Polytope LSH—as special cases of a more general construction based on Johnson‑Lindenstrauss (JL) projections. The key insight is that any random linear map that satisfies the JL lemma preserves pairwise distances up to a small distortion, and therefore any simple post‑processing of the projected vectors (e.g., taking the sign of a coordinate or selecting the index of the maximum coordinate) yields a hash function whose collision probability is a monotone function of the original distance.
The central theoretical contribution is twofold. First, Theorem 1 shows that for any JL projection p: ℝ^d → ℝ^{d′} one can define at least two distinct LSH families: (i) the “arg‑max” family h(x)=arg max_i ⟨p_i, x⟩, and (ii) the “sign” family h(x)=sign(⟨p_i, x⟩), where p_i denotes the i‑th column of the projection matrix. Both families inherit the distance‑sensitive collision property from the underlying JL map. Second, the authors prove that Feature Hashing—a widely used dimensionality‑reduction technique that maps each original feature to a random target coordinate with a random sign—is itself a JL projection. By representing Feature Hashing as a sparse d × d′ matrix with exactly k non‑zero entries (±1) per row, they show (Theorem 2) that the expected squared norm of the projected vector scales by a factor k, i.e., ‖M x‖≈√k ‖x‖, thereby preserving distances in expectation.
Building on these results, the paper proposes two new LSH families:
-
Feature Hashing LSH – For each hash function a random sparse matrix M (with k ± 1 entries per row) is generated. The hash value of a point x is the index of the column with the largest inner product ⟨M_i, x⟩. This is exactly the Voronoi (or “max‑index”) scheme but with a highly sparse projection, dramatically reducing the cost of the inner‑product computation.
-
Directional Feature Hashing LSH – Using the same matrix M, the hash value is a d′‑bit binary vector obtained by taking the sign of each inner product ⟨M_i, x⟩. This mirrors Hyperplane LSH, yet each hyperplane involves only k non‑zero coefficients, again yielding a substantial speed‑up.
The experimental evaluation consists of two parts. In the synthetic benchmark, 128‑dimensional unit‑sphere vectors are generated, and collision probability is plotted against Euclidean distance for several LSH families. The results show that Directional Feature Hashing matches Hyperplane LSH in false‑positive rate while being faster due to sparsity; Feature Hashing LSH exhibits higher false‑positive rates but can be tuned by increasing k. Voronoi and Cross‑Polytope LSH achieve the lowest false‑positive rates but are the most computationally expensive because they rely on dense Gaussian matrices.
The second part uses the real‑world SIFT‑1M dataset (one million 128‑dimensional descriptors). Although SIFT vectors are not normalized to the unit sphere, the proposed methods still perform well under Euclidean distance. Directional Feature Hashing attains comparable recall to Hyperplane LSH while reducing both memory footprint and query time by roughly 30 %. Feature Hashing LSH, when configured with a larger k, also reaches competitive recall with significantly lower computational cost.
Overall, the paper demonstrates that any JL‑type projection can serve as a universal building block for angular‑distance LSH. Feature Hashing, being extremely simple, memory‑efficient, and amenable to streaming or online updates, emerges as a practical alternative to dense random projections. The two new families inherit the theoretical guarantees of classic LSH while offering concrete speed and space advantages, making them attractive for large‑scale nearest‑neighbor search in high‑dimensional settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment