Accurate Estimators for Improving Minwise Hashing and b-Bit Minwise Hashing
Minwise hashing is the standard technique in the context of search and databases for efficiently estimating set (e.g., high-dimensional 0/1 vector) similarities. Recently, b-bit minwise hashing was proposed which significantly improves upon the original minwise hashing in practice by storing only the lowest b bits of each hashed value, as opposed to using 64 bits. b-bit hashing is particularly effective in applications which mainly concern sets of high similarities (e.g., the resemblance >0.5). However, there are other important applications in which not just pairs of high similarities matter. For example, many learning algorithms require all pairwise similarities and it is expected that only a small fraction of the pairs are similar. Furthermore, many applications care more about containment (e.g., how much one object is contained by another object) than the resemblance. In this paper, we show that the estimators for minwise hashing and b-bit minwise hashing used in the current practice can be systematically improved and the improvements are most significant for set pairs of low resemblance and high containment.
💡 Research Summary
The paper revisits the fundamental problem of estimating set similarity (resemblance) and containment using minwise hashing and its recent variant, b‑bit minwise hashing. While the classic estimator for minwise hashing simply counts the fraction of permutations where the minimum hashed values of two sets coincide, this approach is optimal only when the two sets have equal cardinalities. In real‑world data, set sizes often differ dramatically, causing the standard estimator to suffer from unnecessarily large variance, especially for pairs with low resemblance but high containment.
To address this, the authors model the outcome of each permutation as a three‑category multinomial variable: (i) the minima are equal, (ii) the minimum of set 1 is smaller, and (iii) the minimum of set 2 is smaller. The corresponding probabilities are
P₌ = a/(f₁+f₂−a) = R,
P_< = (f₁−a)/(f₁+f₂−a), and
P_> = (f₂−a)/(f₁+f₂−a),
where a is the intersection size and f₁, f₂ are the set cardinalities. By counting the occurrences k₌, k_<, k_> over k independent permutations, unbiased estimators of a can be derived from each category, but their variances differ markedly (Equations 13‑15). Because a is unknown, the paper proposes the maximum‑likelihood estimator (MLE) obtained by solving the likelihood equation (16). The resulting estimator is asymptotically unbiased and has variance (17), which is always smaller than that of the traditional estimator. The variance reduction is most pronounced when f₂/f₁ < 0.5 and the containment T = a/f₂ is close to 1; reductions of up to two orders of magnitude are demonstrated analytically and confirmed by simulations on real‑world word‑occurrence sets.
The second contribution extends the analysis to b‑bit minwise hashing. Prior work only used the probability that the lowest b bits of the two minima match (P_b,=) to estimate resemblance, discarding the richer information contained in the full 2ᵇ × 2ᵇ joint distribution of the b‑bit values. The authors derive explicit formulas (19‑21) for the joint probabilities P_b,(t,d) as functions of the normalized set sizes r₁ = f₁/D, r₂ = f₂/D and s = a/D (D is the universe size). This yields a multinomial model with up to 2ᵇ·2ᵇ cells. Applying MLE to this model leads to several practical estimators:
- ˆs_{b,f}: full MLE using all cells (most accurate, computationally intensive).
- ˆs_{b,do}: MLE using the 2ᵇ diagonal cells plus two aggregated cells for “<” and “>”.
- ˆs_{b,d}: MLE using only the diagonal cells.
- ˆs_{b,=}: the original estimator based solely on P_b,=.
- ˆs_{b,≈}: a simplified estimator that combines P_b,= with the aggregated “<” and “>” counts.
Experiments with b = 4–8 show that the reduced‑complexity estimators (especially ˆs_{b,do}) achieve 2–5× lower variance than the original b‑bit estimator while retaining the same storage footprint. The gains are especially large for low‑resemblance, high‑containment pairs, mirroring the findings for the full‑size minwise case.
Overall, the paper makes three key contributions: (1) it reframes minwise and b‑bit minwise hashing as multinomial estimation problems, (2) it introduces MLE‑based estimators that systematically improve variance across all similarity regimes, and (3) it highlights the importance of containment estimation, offering practical solutions for applications such as duplicate detection, database schema matching, and large‑scale machine‑learning pipelines where many pairwise similarities must be computed under tight storage constraints. The theoretical derivations are supported by extensive simulations on synthetic and real data, establishing the proposed methods as robust, high‑performance alternatives to the traditional estimators.
Comments & Academic Discussion
Loading comments...
Leave a Comment