b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions

b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning   and Using GPUs for Fast Preprocessing with Simple Hash Functions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we study several critical issues which must be tackled before one can apply b-bit minwise hashing to the volumes of data often used industrial applications, especially in the context of search. 1. (b-bit) Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20-80 and becomes substantially smaller than the data loading time. 2. One major advantage of b-bit minwise hashing is that it can substantially reduce the amount of memory required for batch learning. However, as online algorithms become increasingly popular for large-scale learning in the context of search, it is not clear if b-bit minwise yields significant improvements for them. This paper demonstrates that $b$-bit minwise hashing provides an effective data size/dimension reduction scheme and hence it can dramatically reduce the data loading time for each epoch of the online training process. This is significant because online learning often requires many (e.g., 10 to 100) epochs to reach a sufficient accuracy. 3. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that $b$-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.


💡 Research Summary

This paper tackles three practical obstacles that prevent the widespread deployment of b‑bit minwise hashing in industrial‑scale search and advertising systems. First, the preprocessing step required by minwise hashing—computing k (e.g., 500) minima after applying a permutation to each high‑dimensional sparse vector—is extremely costly. The authors design a GPU‑accelerated pipeline that streams vectors to device memory, generates pseudo‑random permutations on the fly using simple 2‑universal (2U) and 4‑universal (4U) hash families, and extracts the minima in parallel across thousands of threads. Benchmarks on real‑world datasets show a 20‑ to 80‑fold speedup over a single‑core CPU implementation, making preprocessing time smaller than data loading time.

Second, while b‑bit hashing is known to shrink memory footprints for batch learning, its benefit for online learning—where many epochs of stochastic gradient descent (SGD), FTRL, or similar algorithms are required—was unclear. By compressing each vector from millions of dimensions to a few thousand b‑bit signatures, the authors demonstrate that per‑epoch data loading time drops by a factor of 5–10, and total training time is dramatically reduced without measurable loss in predictive quality (AUC, log‑loss). In some cases the regularizing effect of the compression even yields a slight accuracy gain.

Third, storing a fully random permutation matrix is infeasible for terabyte‑scale corpora. The paper provides the first extensive empirical evidence that simple hash families (2U, 4U) can serve as effective substitutes. Experiments on datasets up to 200 GB show that models trained with these hash‑based permutations achieve virtually identical performance to those trained with true random permutations—the AUC gap is less than 0.001. The 4U family, in particular, offers near‑zero collision probability while requiring only a handful of integer multiplications and bit‑shifts, thus keeping memory overhead negligible.

Overall, the work establishes that (1) GPU parallelism eliminates the preprocessing bottleneck, (2) b‑bit minwise hashing substantially accelerates online learning by reducing I/O and memory bandwidth demands, and (3) lightweight universal hash functions provide a practical, storage‑efficient way to implement the technique at industrial scale. These contributions make b‑bit minwise hashing a ready‑to‑use tool for large‑scale search, click‑through‑rate prediction, and real‑time recommendation pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment