ZOR filters: fast and smaller than fuse filters

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Probabilistic membership filters support fast approximate membership queries with a controlled false-positive probability $\varepsilon$ and are widely used across storage, analytics, networking, and bioinformatics \cite{chang2008bigtable,dayan2018optimalbloom,broder2004network,harris2020improved,marchet2023scalable,chikhi2025logan,hernandez2025reindeer2}. In the static setting, state-of-the-art designs such as XOR and fuse filters achieve low overhead and very fast queries, but their peeling-based construction succeeds only with high probability, which complicates deterministic builds \cite{graf2020xor,graf2022binary,ulrich2023taxor}. We introduce \emph{ZOR filters}, a deterministic continuation of XOR/fuse filters that guarantees construction termination while preserving the same XOR-based query mechanism. ZOR replaces restart-on-failure with deterministic peeling that abandons a small fraction of keys, and restores false-positive-only semantics by storing the remainder in a compact auxiliary structure. In our experiments, the abandoned fraction drops below $1%$ for moderate arity (e.g., $N\ge 5$), so the auxiliary handles a negligible fraction of keys. As a result, ZOR filters can achieve overhead within $1%$ of the information-theoretic lower bound $\log_2(1/\varepsilon)$ while retaining fuse-like query performance; the additional cost is concentrated on negative queries due to the auxiliary check. Our current prototype builds several-fold slower than highly optimized fuse builders because it maintains explicit incidence information during deterministic peeling; closing this optimisation gap is an engineering target.

💡 Research Summary

ZOR filters are introduced as a deterministic continuation of XOR/fuse filters that eliminates the need for restart‑on‑failure during construction while preserving the ultra‑fast XOR‑based query mechanism. In static approximate membership filters, XOR and fuse filters achieve near‑optimal space (within a few percent of the information‑theoretic bound log₂(1/ε)) and sub‑100 ns query latency, but their peeling‑based construction can get stuck when the hypergraph of cells and keys contains a core where every remaining cell has degree at least two. Traditional solutions simply restart with new hash seeds, which makes construction probabilistic and sometimes costly.

ZOR solves this by employing a deterministic peeling strategy that, when the degree‑1 queue becomes empty, selects a cell v of minimal current degree (≥ 2) and forces progress: among the d(v) keys incident to v, one key is kept (to be resolved at v) while the remaining d(v) − 1 keys are abandoned. These abandoned keys are removed from all their incident cells, instantly turning v into a degree‑1 cell and allowing the standard peeling loop to continue. Because each forced step eliminates at least one active key, the algorithm is guaranteed to terminate after at most n steps.

The abandoned set A introduces false negatives, which would break the usual “false‑positive‑only” semantics of Bloom‑style filters. To restore the standard semantics, ZOR stores a carefully selected subset A′ of the abandoned keys in an auxiliary static structure (e.g., another fuse filter, an MPHF with short fingerprints, or any compact static filter). Queries first consult the main ZOR filter; if it reports “absent,” the auxiliary is consulted. The overall false‑positive probability becomes ε_tot = 1 − (1 − ε₁)(1 − ε₂) ≈ ε₁ + ε₂, where ε₁ and ε₂ are the false‑positive rates of the main and auxiliary filters respectively.

Memory optimisation is analysed by modelling the main filter’s fingerprint size as F bits per original key (with m = n cells) and the auxiliary’s fingerprint size as G bits per stored abandoned key. The total bits per original key are B(F,G) = F + αG, where α = |A|/n is the abandoned fraction. Using the standard approximation ε ≈ 2⁻ᶠ for fingerprint‑based filters, the total error is ε_tot ≈ 2⁻ᶠ + 2⁻ᴳ. Minimising the overhead ρ(F,G) = B(F,G) − log₂(1/ε_tot) yields the balance condition G* ≈ F + log₂(1/α). Thus, when α is small (e.g., < 1 %), the auxiliary fingerprint can be very short, keeping the overall space within 1 % of the theoretical optimum.

Empirical evaluation shows that for arities N ≥ 5 the abandoned fraction α drops below 1 %, meaning the auxiliary structure handles a negligible portion of the keys. Consequently, ZOR filters achieve an overall space overhead of less than 1 % above log₂(1/ε) while retaining fuse‑like query performance (typically < 100 ns). The extra cost appears only on negative queries that hit the auxiliary, but because α is tiny the impact on average latency is minimal.

The current prototype builds several times slower than highly optimised fuse builders because it maintains explicit incidence lists during deterministic peeling. This overhead is an engineering issue rather than a theoretical limitation; more sophisticated data structures, batch processing, or parallelism could close the gap.

In summary, ZOR filters contribute three key advances: (1) a deterministic, always‑terminating construction algorithm that removes the probabilistic restart bottleneck of XOR/fuse filters; (2) a principled method to abandon a tiny fraction of keys and recover false‑positive‑only semantics via a compact auxiliary filter; and (3) space efficiency that approaches the information‑theoretic lower bound within 1 % while preserving sub‑100 ns query times. These properties make ZOR filters attractive for large‑scale static workloads such as database indexes, immutable storage components, network packet filters, and bio‑informatics k‑mer dictionaries, where both minimal memory footprint and ultra‑fast lookups are critical.

ZOR filters: fast and smaller than fuse filters

💡 Research Summary

Comments & Academic Discussion

Leave a Comment