Recursive n-gram hashing is pairwise independent, at best

Reading time: 6 minute
...

📝 Abstract

Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.

💡 Analysis

Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.

📄 Content

An n-gram is a consecutive sequence of n symbols from an alphabet Σ. An n-gram hash function h maps n-grams to numbers in [0, 2 L ). These functions have several applications from full-text matching [1][2][3], pattern matching [4], or language models [5][6][7][8][9][10][11] to plagiarism detection [12].

To prove that a hashing algorithm must work well, we typically need hash values to satisfy some statistical property. Indeed, a hash function that maps all n-grams to a single integer would not be useful. Yet, a single hash function is deterministic: it maps an n-gram to a single hash value. Thus, we may be able to choose the input data so that the hash values are biased. Therefore, we randomly pick a function from a family H of functions [13].

Such a family H is uniform (over L-bits) if all hash values are equiprobable. That is, considering h selected uniformly at random from H , we have P(h(x) = y) = 1/2 L for all n-grams x and all hash values y. This condition is weak; the family of constant functions (h(x) = c) is uniform 1 .

Intuitively, we would want that if an adversary knows the hash value of one n-gram, it cannot deduce anything about the hash value of another n-gram. For example, with the family of constant functions, once we know one hash value, we know them all. The family H is pairwise independent if the hash value of n-gram x 1 is independent from the hash value of any other n-gram x 2 . That is, we have P(h(x 1 ) = y ∧ h(x 2 ) = z) = P(h(x 1 ) = y)P(h(x 2 ) = z) = 1/4 L for all distinct n-grams x 1 , x 2 , and all hash values y, z with x 1 = x 2 . Pairwise independence implies uniformity. We refer to a particular hash function h ∈ H as “uniform” or “a pairwise independent hash function” when the family in question can be inferred from the context.

Moreover, the idea of pairwise independence can be generalized: a family of hash functions H is k-wise independent if given distinct x 1 , . . . , x k and given h selected uniformly at random from H , then P(h(x 1 ) = y 1 ∧ • • • ∧ h(x k ) = y k ) = 1/2 kL . Note that k-wise independence implies k -1-wise independence and uniformity. (Fully) independent families are k-wise independent for arbitrarily large k. For applications, nonindependent families may fare as well as fully independent families if the entropy of the data source is sufficiently high [16].

A hash function h is recursive [17]-or rolling [18]-if there is a function F computing the hash value of the n-gram x 2 . . . x n+1 from the hash value of the preceding n-gram (x 1 . . . x n ) and the values of x 1 and x n+1 . That is, we have h(x 2 , . . . , x n+1 ) = F(h(x 1 , . . . , x n ), x 1 , x n+1 ).

Ideally, we could compute function F in time O(L) and not, for example, in time O(Ln).

The main contributions of this paper are:

• a proof that recursive hashing is no more than pairwise independent ( § 3);

• a proof that randomized Karp-Rabin can be uniform but never pairwise independent ( § 5);

• a proof that hashing by irreducible polynomials is pairwise independent ( § 7);

• a proof that hashing by cyclic polynomials is not even uniform ( § 9);

• a proof that hashing by cyclic polynomials is pairwise independent-after ignoring n -1 consecutive bits ( § 10).

We conclude with an experimental section where we show that hashing by cyclic polynomials is faster than hashing by irreducible polynomials. Table 1 summarizes the algorithms presented.

Some randomized algorithms [14,15] merely require that the number of trailing zeroes be independent. For example, to estimate the number of distinct n-grams in a large document without enumerating them, we merely have to compute maximal numbers of leading zeroes k among hash values [19]. Naïvely, we may estimate that if a hash value with k leading zeroes is found, we have ≈ 2 k distinct n-grams. Such Table 1: A summary of the hashing function presented and their properties. For GENERAL and CYCLIC, we require L ≥ n. To make CYCLIC pairwise independent, we need to discard some bits-the resulting scheme is not formally recursive. Randomized Karp-Rabin is uniform under some conditions. name cost per n-gram independence memory use non-recursive 3-wise ( § 4)

estimates might be useful because the number of distinct n-grams grows large with n: Shakespeare’s First Folio [20] has over 3 million distinct 15-grams. Formally, let zeros(x) return the number of trailing zeros (0,1,. . . ,L) of x, where zeros(0) = L. We say h is k-wise trailing-zero independent if P(zeros(h(x 1 ))

If h is k-wise independent, it is k-wise trailing-zero independent. The converse is not true. If h is a k-wise independent function, consider g•h where g makes zero all bits before the rightmost 1 (e.g., g(0101100) = 0000100). Hash g • h is k-wise trailing-zero independent but not even uniform (consider that P(g = 0001) = 8P(g = 1000)).

Not only are recursive hash functions limited to pairwise independence: they cannot be 3-wise trailing-zero independent.

Proposition 1 There

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut