Bloom Filter based approaches for duplicate detection in streams

Existing probabilistic data structures for duplicate detection on streams can be broadly divided into Bloom filter based and Dictionary-based (even if the “Bloom filter” term is sometimes abused, with the risk of losing its meaning). Essentially, Bloom filter variants use $`k`$ hash functions to choose $`k`$ cells to update, each holding some bit or counter or timestamp; cells from insertions of different elements end up mixed (e.g., ORed or ADDed). Examples are Counting Bloom Filters or Generalized Bloom Filters .

Dictionary-based approaches use one (or a few) hash functions to choose one cell (possibly one of a few alternatives, e.g., one of several slots in a bucket in one of two arrays) where some content (e.g., a fingerprint and a timestamp) will be stored. Each cell is kept separate and contents of cells relative to different elements will not be mixed. Basically, they are some hash-table variant based, but storing some hashes and not the full elements themselves. Examples are Cuckoo Filters and Morton Filters .

Most approaches for duplicate detection in streams are BF-based but the best ones tend to be dictionary-based, like in using a Backyard Cuckoo Hashing dictionary, or in also based on Cuckoo hashing, or SWAMP , using combination of dictionary mapping fingerprints to counters plus a circular buffer of fingerprints.

Our approach improves the state of the art of BF-based approaches and contrary to most of them is competitive space-wise with dictionary-based approaches, while being very simple to implement. It also provides an adjacent transition zone with exponential forgetting, outside the desired window (while garanteeing no false-negatives inside it), a feature that may be interesting in itself, leading to a small probability-weighted slack.

We now briefly survey BF-based approaches, to give an understanding of the design space where our approach lies. Broadly, we can classify them as either bit decaying based, segmentation based, counter based, and timestamping based.

Bit decaying based approaches

Bit decaying based approaches forget the past by resetting bits, randomly or by further hashing elements inserted, to limit the BF fill-rate. The big drawback is that they only tend to forget the distant past, but there are no guarantees that a recent element is never affected, leading to false negatives. They tend to cause a large variance in the age when an inserted element stops being reported, and therefore are not well suited to the problem, in either the sliding or jumping window models.

One example is the Scope Decay Bloom Filter (SDBF) , which resets random bits, either with an exponential decay model by resetting each bit with a given probability (inpractical), or by a linear decay model, resetting a few random bits each time. Another is the Generalized Bloom Filter (GBF) , which at each insertion uses also another set of $`k_0`$ hash functions to reset $`k_0`$ bits. Reservoir Sampling based Bloom Filter (RSBF) , a partitioned BF scheme which insert missing elements only with some probability and if inserting also resets one random bit from each partition. In some Biased Sampling based Bloom Filter (BSBF) variants are described, such as a variant of RSFB which always inserts, a variant which only deletes one bit from one part, and another which stores the fill-rate of each part and uses it for probability of bit resetting from the respective part.

Segmentation based approaches

Segmentation based approaches use several disjoint segments which can be individually added and retired. The most naïf and several times mentioned approach is a sequence of plain BFs, one per generation, adding a new one and retiring the oldest when the one in use gets full. This is a perfect fit for the jumping window model with one BF per sub-window. One special case of this scheme with two BFs is the Cell Bloom filter (CEBF) . Unfortunately, to have smooth jumps or to have little slack in the sliding window model, a sequence of more than two BFs is needed and it will become slow and not memory efficient, as a query will need testing each BF, each one needing a tighter false-positive rate. Most approaches try some more sophisticated segmentation. To avoid rough jumps and keep the more recent elements alive, current approaches write in several segments when inserting, leading to a waste of space.

The Double Buffering concept was introduced in , using a pair of active and warm-up BFs, essentially using the active for queries and inserting in both until the warm-up is half-full, at which point it becomes the active, the previous active is discarded and a new empty warm-up is added. Somewhat dually, Active-Active Buffering (A2 Buffering) , while having also two BFs, named active1 and active2, inserts only in active1 but queries in both, with the nuance that if it is only found in active2, the element bits are copied to active1. Compared with Double Buffering it is more memory efficient, as both active1 and active2 can store distict data, while in that scheme one BF is always a subset of the other.

Forgetful Bloom Filter (FBF) , uses a future, a present and $`N \geq 1`$ past BFs. Inserts in the future and present components and queries (essentially) by testing the presence in two consecutive BFs from future to oldest past. When full, the oldest past is discarded, other segments are shifted and a new future is added. The waste of space by duplicate insertion is somewhat compensated by the reduced false-positive rate through the two-consecutive filters test. However, the correlation between consecutive segments caused by duplicate insertion leads to the false-positive rate reduction being modest and not space efficient.

The more sophisticated Segmented Aging BF (SA-BF) combines the active/warm-up approach with a partitioned scheme, being each segment a partitioned BF with $`k`$ parts. Insertions go to both active and warm-up but only the active is queried. At forget time, one part is moved from the warm-up to the active, in a round-robin scheme over the $`k`$ parts. Regardless of the sophistication, the duplicated insertion causes some space inefficiency.

Counter based approaches

Many approaches are based on counting BFs , using the same representation (a vector of counters), not merely to allow element deletion in sets, but for other purposes such as representing multi-sets. One example is the above mentioned , which for the sliding window model uses a counting BF to store a multi-set, counting the number of occurrences in the window. While counting BFs are relatively efficient for their original purpose (allowing deletion in a set), as it is enough to have 4 bit counters, if used to store multi-sets the size of each counter will render them very inefficient when compared with dictionary-based approaches.

Also, several approaches, as the one just mentioned, require knowing the elements themselves to expire them when they leave the window, using a window sized circular queue of elements, aggravating the space consumption problem. This is the case when using Spectral Bloom Filter , or Floating Counter Bloom Filter (FCBF) , this last using even more space by having floating point numbers, aimed at reporting existential probabilities.

Most counter based approaches avoid storing elements themselves by storing a fixed value in each of the $`k`$ cells when inserting, and periodically decrement counters, considering an element absent if one or more counters has reached zero. A basic approach would be using the window size $`N`$ as starting value and decrementing all counters in the filter per insertion, as in Decaying Bloom Filters (DBF) is unacceptably inefficient both time and space wise; it is somewhat more acceptable if time-based windows are used , if many events fit in a time unit. Space consumption in DBFs is addressed in the same paper by grouping elements in generations, obtaining block Decaying Bloom Filter (b_DBF); time complexity is improved at the cost of using oversized counters, to allow less periodical subtractions, and computing bursts avoided by de-amortizing such subtractions over time. Similar approaches are used in the Temporal Counting Bloom Filter (TCBF) ,

Unlike most counter based approaches, which guarantee no false-negatives, Stable Bloom Filters use counters but are more related to bit decaying approaches. They decrement some random counters (if non-zero) and set $`k`$ counters to some fixed value at each insertion. They allow false-negatives and do not provide control over the expiration age of inserted elements.

Timestamping based approaches

A slight variation on using counters is using integers that will remain immutable until expiration, representing the insertion timestamp. These aim to avoid the periodic decrementing over time of counter based approaches.

Timing Bloom Filters stores in each of the $`k`$ cells the insertion timestamp, and increases a current time variable; query is by comparing the minimum timestamp in the $`k`$ cells to the current time. To have relatively small integers, time and timestamps are stored modulo some number greater than the window size. Using, e.g., twice the window size allows just a few elements to be scanned for expiration at each insert. A similar approach is used in Time Bloom Filters , which also introduces Time Interval Bloom Filters, which improve the false positive rate by storing start-end timestamp intervals.

Detached Counting Bloom filter Array (DCBA) , addresses the (precise) sliding window model while having a segmented architecture with a component per sub-window, using a mix of bits and timestamps. Uses a number of sub-windows on the order of the word size (e.g., 32, or 64), and to each sub-window devotes a filter with precise timestamps ranging over the window, saving in bits per timestamp. For the inner sub-windows (that are not suffering insertion or expiration) it groups the bits for all sub-windows for a given hash position in the same word, to allow efficient queries in $`k`$ iterations (a similar scheme, Group Bloom Filters, was also proposed in ).

Finally, an inferential version of Timing Bloom Filter , allows more sophisticated queries such as inferring the most likely insertion age of a given element (and not merely if it is a duplicate).

Discussion

Considering the the Sliding Filter problem, the bit decaying approaches are clearly inappropriate, both due to the false negatives and the little control over expiration. Both counter based and timestamping based approaches are not efficient in terms of memory, as they take more space than a classic counting Bloom filter; they are not able to compete with dictionary-based approaches.

Current segmentation based approaches address slack by updating several segments when inserting, to cause some overlap between them. But they perform duplicate insertion, using the same hashes, which causes memory inneficiency. Our approach, presented below, is the first segmented based approach which does not perform duplicate writing, but uses different hash functions to write different patterns in different segments, which will be expired at different times.