recursive n-gram hashing is pairwise independent, at best

13
Recursive n-gram hashing is pairwise independent, at best Daniel Lemire a, * , Owen Kaser b a LICEF, Universite ´ du Que ´bec a ` Montre ´al (UQAM), 100 Sherbrooke West, Montreal, QC, Canada H2X 3P2 b Dept. of CSAS, University of New Brunswick, 100 Tucker Park Road, Saint John, NB, Canada Received 23 February 2009; received in revised form 19 August 2009; accepted 3 December 2009 Available online 11 December 2009 Abstract Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bot- tleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise indepen- dent, our implementations either run in time OðnÞ or use an exponential amount of memory. As a more scalable alterna- tive, we make hashing by cyclic polynomials pairwise independent by ignoring n 1 bits. Experimentally, we show that hashing by cyclic polynomials is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp–Rabin hash families are not pairwise independent. Ó 2009 Elsevier Ltd. All rights reserved. Keywords: Rolling hashing; Rabin–Karp hashing; Hashing strings 1. Introduction An n-gram is a consecutive sequence of n symbols from an alphabet R. An n-gram hash function h maps n-grams to numbers in ½0; 2 L Þ. These functions have several applications from full-text matching (Cohen, 1998a,b, 1999), pattern matching (Tan et al., 2006), or language models (Cardenal-Lopez et al., 2002; Zhang and Zhao, 2002; Schwenk, 2007; Li and Zhao, 2007; Talbot and Osborne, 2007a,b; Talbot and Brants, 2008) to plagiarism detection (Ribler and Abrams, 2000). To prove that a hashing algorithm must work well, we typically need hash values to satisfy some statistical property. Indeed, a hash function that maps all n-grams to a single integer would not be useful. Yet, a single hash function is deterministic: it maps an n-gram to a single hash value. Thus, we may be able to choose the input data so that the hash values are biased. Therefore, we randomly pick a function from a family H of func- tions (Carter and Wegman, 1979). 0885-2308/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.csl.2009.12.001 * Corresponding author. Tel.: +1 514 987 3000x2835; fax: +1 514 843 2160. E-mail addresses: [email protected] (D. Lemire), [email protected] (O. Kaser). Available online at www.sciencedirect.com Computer Speech and Language 24 (2010) 698–710 www.elsevier.com/locate/csl COMPUTER SPEECH AND LANGUAGE

Upload: daniel-lemire

Post on 26-Jun-2016

214 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Recursive n-gram hashing is pairwise independent, at best

Available online at www.sciencedirect.comCOMPUTER

Computer Speech and Language 24 (2010) 698–710

www.elsevier.com/locate/csl

SPEECH AND

LANGUAGE

Recursive n-gram hashing is pairwise independent, at best

Daniel Lemire a,*, Owen Kaser b

a LICEF, Universite du Quebec a Montreal (UQAM), 100 Sherbrooke West, Montreal, QC, Canada H2X 3P2b Dept. of CSAS, University of New Brunswick, 100 Tucker Park Road, Saint John, NB, Canada

Received 23 February 2009; received in revised form 19 August 2009; accepted 3 December 2009Available online 11 December 2009

Abstract

Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bot-tleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursivehash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise indepen-dent, our implementations either run in time OðnÞ or use an exponential amount of memory. As a more scalable alterna-tive, we make hashing by cyclic polynomials pairwise independent by ignoring n� 1 bits. Experimentally, we show thathashing by cyclic polynomials is twice as fast as hashing by irreducible polynomials. We also show that randomizedKarp–Rabin hash families are not pairwise independent.� 2009 Elsevier Ltd. All rights reserved.

Keywords: Rolling hashing; Rabin–Karp hashing; Hashing strings

1. Introduction

An n-gram is a consecutive sequence of n symbols from an alphabet R. An n-gram hash function h mapsn-grams to numbers in ½0; 2LÞ. These functions have several applications from full-text matching (Cohen,1998a,b, 1999), pattern matching (Tan et al., 2006), or language models (Cardenal-Lopez et al., 2002; Zhangand Zhao, 2002; Schwenk, 2007; Li and Zhao, 2007; Talbot and Osborne, 2007a,b; Talbot and Brants, 2008)to plagiarism detection (Ribler and Abrams, 2000).

To prove that a hashing algorithm must work well, we typically need hash values to satisfy some statisticalproperty. Indeed, a hash function that maps all n-grams to a single integer would not be useful. Yet, a singlehash function is deterministic: it maps an n-gram to a single hash value. Thus, we may be able to choose theinput data so that the hash values are biased. Therefore, we randomly pick a function from a familyH of func-tions (Carter and Wegman, 1979).

0885-2308/$ - see front matter � 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.csl.2009.12.001

* Corresponding author. Tel.: +1 514 987 3000x2835; fax: +1 514 843 2160.E-mail addresses: [email protected] (D. Lemire), [email protected] (O. Kaser).

Page 2: Recursive n-gram hashing is pairwise independent, at best

Table 1A summary of the hashing function presented and their properties. For GENERAL and CYCLIC, we require L P n. To make CYCLIC pairwiseindependent, we need to discard some bits—the resulting scheme is not formally recursive. Randomized Karp–Rabin is uniform undersome conditions.

Name Cost per n-gram Independence Memory use

Non-recursive 3-wise (Section 4) OðLnÞ 3-Wise OðnLjRjÞRandomized Karp–Rabin (Section 5) OðL log L2Oðlog�LÞÞ Uniform OðLjRjÞGENERAL (Section 7) OðLnÞ Pairwise OðLjRjÞRAM-Buffered GENERAL (Section 8) OðLÞ Pairwise OðLjRj þ L2nÞCYCLIC (Section 9) OðLþ nÞ Pairwise (Section 10) OððLþ nÞjRjÞ

D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710 699

Such a family H is uniform (over L-bits) if all hash values are equiprobable. That is, considering h selecteduniformly at random from H, we have PðhðxÞ ¼ yÞ ¼ 1=2L for all n-grams x and all hash values y. This con-dition is weak; the family of constant functions ðhðxÞ ¼ cÞ is uniform.1

Intuitively, we would want that if an adversary knows the hash value of one n-gram, it cannot deduce any-thing about the hash value of another n-gram. For example, with the family of constant functions, once weknow one hash value, we know them all. The family H is pairwise independent if the hash value of n-gramx1 is independent from the hash value of any other n-gram x2. That is, we haveP ðhðx1Þ ¼ y ^ hðx2Þ ¼ zÞ ¼ Pðhðx1Þ ¼ yÞP ðhðx2Þ ¼ zÞ ¼ 1=4L for all n-grams x1; x2, and all hash values y, z withx1 – x2. Pairwise independence implies uniformity. We refer to a particular hash function h 2 H as “uniform”

or “a pairwise independent hash function” when the family in question can be inferred from the context.Moreover, the idea of pairwise independence can be generalized: a family of hash functionsH is k-wise inde-

pendent if given distinct x1; . . . ; xk and given h selected uniformly at random from H, thenP ðhðx1Þ ¼ y1 ^ � � � ^ hðxkÞ ¼ ykÞ ¼ 1=2kL. Note that k-wise independence implies k � 1-wise independenceand uniformity. (Fully) Independent families are k-wise independent for arbitrarily large k. For applications,non-independent families may fare as well as fully independent families if the entropy of the data source issufficiently high (Mitzenmacher and Vadhan, 2008).

A hash function h is recursive (Cohen, 1997)—or rolling (Schleimer et al., 2003)—if there is a function F

computing the hash value of the n-gram x2; . . . ; xnþ1 from the hash value of the preceding n-gramðx1; . . . ; xnÞ and the values of x1 and xnþ1. That is, we have

1 We1985; G

hðx2; . . . ; xnþ1Þ ¼ F ðhðx1; . . . ; xnÞ; x1; xnþ1Þ:

Ideally, we could compute function F in time OðLÞ and not, for example, in time OðLnÞ.The main contributions of this paper are:

� a proof that recursive hashing is no more than pairwise independent (Section 3);� a proof that randomized Karp–Rabin can be uniform but never pairwise independent (Section 5);� a proof that hashing by irreducible polynomials is pairwise independent (Section 7);� a proof that hashing by cyclic polynomials is not even uniform (Section 9);� a proof that hashing by cyclic polynomials is pairwise independent—after ignoring n� 1 consecutive bits

(Section 10).

We conclude with an experimental section where we show that hashing by cyclic polynomials is faster thanhashing by irreducible polynomials. Table 1 summarizes the algorithms presented.

2. Trailing-zero independence

Some randomized algorithms (Flajolet and Martin, 1985; Gibbons and Tirthapura, 2001) merely requirethat the number of trailing zeroes be independent. For example, to estimate the number of distinct n-grams

omit families uniform over an arbitrary interval ½0; bÞ—not of the form ½0; 2LÞ. Indeed, several applications (Flajolet and Martin,ibbons and Tirthapura, 2001) require uniformity over L-bits.

Page 3: Recursive n-gram hashing is pairwise independent, at best

700 D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710

in a large document without enumerating them, we merely have to compute maximal numbers of leading zer-oes k among hash values (Durand and Flajolet, 2003). Naıvely, we may estimate that if a hash value with k

leading zeroes is found, we have �2k distinct n-grams. Such estimates might be useful because the number ofdistinct n-grams grows large with n: Shakespeare’s First Folio (Project Gutenberg Literary Archive Founda-tion, 2009-08-0) has over 3 million distinct 15-grams.

Formally, let zerosðxÞ return the number of trailing zeros ð0; 1; . . . ; LÞ of x, where zerosð0Þ ¼ L. We say h isk-wise trailing-zero independent if

P ðzerosðhðx1ÞÞP j1 ^ zerosðhðx2ÞÞP j2 ^ � � � ^ zerosðhðxkÞÞP jkÞ ¼ 2�j1�j2�����jk ; for ji ¼ 0; 1; . . . ; L

If h is k-wise independent, it is k-wise trailing-zero independent. The converse is not true. If h is a k-wiseindependent function, consider g � h, where g makes zero all bits before the rightmost 1 (e.g.,gð0101100Þ ¼ 0000100). Hash g � h is k-wise trailing-zero independent but not even uniform (consider thatPðg ¼ 0001Þ ¼ 8P ðg ¼ 1000Þ).

3. Recursive hash functions are no more than pairwise independent

Not only are recursive hash functions limited to pairwise independence: they cannot be 3-wise trailing-zero

independent.

Proposition 1. There is no 3-wise trailing-zero independent hashing function that is recursive.

Proof. Consider the ðnþ 2Þ-gram a nbb. Suppose h is recursive and 3-wise trailing-zero independent, then

P zerosðhða; . . . ;aÞÞP L^

zerosðhða; . . . ;a;bÞÞP L^

zerosðhða; . . . ;a;b;bÞÞP L� �

¼ P hða; . . . ;aÞ ¼ 0^

F ð0;a;bÞ ¼ 0^

F ð0;a;bÞ ¼ 0� �

¼ P hða; . . . ;aÞ ¼ 0^

F ð0;a;bÞ ¼ 0� �

¼ P zerosðhða; . . . ;aÞÞP L^

zerosðhða; . . . ;a;bÞÞP L� �

¼ 2�2L by trailing-zero pairwise independence

– 2�3Las required by trailing-zero 3-wise independence:

Hence, we have a contradiction and no such h exists. h

4. A non-recursive 3-wise independent hash function

A trivial way to generate an independent hash is to assign a random integer in ½0; 2LÞ to each new value x.Unfortunately, this requires as much processing and storage as a complete indexing of all values.

However, in a multidimensional setting this approach can be put to good use. Suppose that we have tuplesin K1 � K2 � � � � � Kn such that jKij is small for all i. We can construct independent hash functionshi : Ki ! ½0; 2LÞ for all i and combine them. The hash function hðx1; x2; . . . ; xnÞ ¼ h1ðx1Þ h2ðx2Þ � � � hnðxnÞ is then 3-wise independent ( is the “exclusive or” function, XOR). In time O

Pni¼1jKij

� �, we can con-

struct the hash function by generatingPn

i¼1jKij random numbers and storing them in a look-up table. Withconstant-time look-up, hashing an n-gram thus takes OðLnÞ time. Algorithm 1 is an application of this idea ton-grams.

Page 4: Recursive n-gram hashing is pairwise independent, at best

D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710 701

Algorithm 1. The (non-recursive) 3-wise independent family.

Require: n L-bit hash functions h1; h1; . . . ; hn over R from an independent hash family

1:

2 The values f ð1Þ;

s empty FIFO structure

2: for each character c do

3:

append c to s

4:

if length(s) = n then 5: yield h1ðs1Þ h2ðs2Þ � � � hnðsnÞ

{The yield statement returns the value, without terminating the algorithm.}

6: remove oldest character from s

7:

end if

8:

end for

This new family is not 4-wise independent for n > 1. Consider the n-grams ac, ad, bc, bd. The XOR oftheir four hash values is zero. However, the family is 3-wise independent.

Proposition 2. The family of hash functions hðxÞ ¼ h1ðx1Þ h2ðx2Þ � � � hnðxnÞ, where the L-bit hash functions

h1; . . . ; hn are taken from an independent hash family, is 3-wise independent.

Proof. Consider any three distinct n-grams: xð1Þ ¼ xð1Þ1 � � � xð1Þn ; xð2Þ ¼ xð2Þ1 � � � xð2Þn , and xð3Þ ¼ xð3Þ1 � � � xð3Þn . Becausethe n-grams are distinct, at least one of two possibilities holds:

Case A For some i 2 f1; . . . ; ng, the three values xð1Þi ; xð2Þi ; xð3Þi are distinct. Write vj ¼ hi xðjÞi

� �for j ¼ 1; 2; 3.

For example, consider the three 1-grams: a, b, c.

Case B (Up to a reordering of the three n-grams.) There are two values i; j2f1;...;ng such that xð1Þi is distinct

from the two identical values xð2Þi ; xð3Þi , and such that xð2Þj is distinct from the two identical values xð1Þi ; xð3Þi .

Write v1 ¼ hi xð1Þi

� �; v2 ¼ hj xð2Þj

� �, and v3 ¼ hi xð3Þi

� �. For example, consider the three 2-grams: ad, bc, bd.

Recall that the XOR operation is invertible: a b ¼ c if and only if a ¼ b c.We prove 3-wise independence for cases A and B.

Case A Write f ðiÞ ¼ hðxðiÞÞ vi for i ¼ 1; 2; 3. We have that the values v1; v2; v3 are mutually indepen-dent, and they are independent from the values f ð1Þ; f ð2Þ; f ð3Þ2:

P3

i¼1

vi ¼ yi ^3

i¼1

f ðiÞ ¼ y0i

!¼Y3

i¼1

P ðvi ¼ yiÞP3

i¼1

f ðiÞ ¼ y0i

!

for all values yi; y0i. Hence, we have

P hðxð1ÞÞ ¼ zð1Þ^

hðxð2ÞÞ ¼ zð2Þ^

hðxð3ÞÞ ¼ zð3Þ� �

¼ P v1 ¼ zð1Þ f ð1ÞÞ^

v2 ¼ zð2Þ f ð2Þ^

v3 ¼ zð3Þ f ð3Þ� �

¼Xg;g0;g00

P v1 ¼ zð1Þ g^

v2 ¼ zð2Þ g0^

v3 ¼ zð3Þ g00� �� P ðf ð1Þ ¼ g ^ f ð2Þ ¼ g0 ^ f ð3Þ ¼ g00Þ

¼Xg;g0;g00

1

23L P ðf ð1Þ ¼ g ^ f ð2Þ ¼ g0 ^ f ð3Þ ¼ g00Þ

¼ 1

23L :

f ð2Þ; f ð3Þ are not necessarily mutually independent.

Page 5: Recursive n-gram hashing is pairwise independent, at best

702 D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710

Thus, in this case, the hash values are 3-wise independent.

Case B Write f ð1Þ ¼ hðxð1ÞÞ v1; f ð2Þ ¼ hðxð2ÞÞ v2 v3; f ð3Þ ¼ hðxð3ÞÞ v3. Again, the values v1; v2; v3

are mutually independent, and independent from the values f ð1Þ; f ð2Þ; f ð3Þ. We have

P hðxð1ÞÞ ¼ zð1Þ^

hðxð2ÞÞ ¼ zð2Þ^

hðxð3ÞÞ ¼ zð3Þ� �

¼ P v1 ¼ zð1Þ f ð1ÞÞ^

v2 v3 ¼ zð2Þ f ð2Þ^

v3 ¼ zð3Þ f ð3Þ� �

¼ P v1 ¼ zð1Þ f ð1ÞÞ^

v2 ¼ zð2Þ f ð2Þ zð3Þ f ð3Þ^

v3 ¼ zð3Þ f ð3Þ� �

¼Xg;g0 ;g00

P v1 ¼ zð1Þ g^

v2 ¼ zð2Þ zð3Þ g0 g00^

v3 ¼ zð3Þ g00� �

� P ðf ð1Þ ¼ g ^ f ð2Þ ¼ g0 ^ f ð3Þ ¼ g00Þ

¼Xg;g0 ;g00

1

23L P ðf ð1Þ ¼ g ^ f ð2Þ ¼ g0 ^ f ð3Þ ¼ g00Þ

¼ 1

23L :

This concludes the proof. h

5. Randomized Karp–Rabin is not independent

One of the most common recursive hash functions is commonly associated with the Karp–Rabin string-matching algorithm (Karp and Rabin, 1987). Given an integer B, the hash value over the sequence of integersx1; x2; . . . ; xn is

Pni¼1xiBn�i. A variation of the Karp–Rabin hash method is “Hashing by Power-of-2 Integer

Division” (Cohen, 1997), where hðx1; . . . ; xnÞ ¼Pn

i¼1xiBn�i mod2L. In particular, the hashcode method ofthe Java String class uses this approach, with L ¼ 32 and B ¼ 31 (Sun Microsystems, 2004). A widely usedtextbook (Weiss, 1999, p. 157) recommends a similar Integer-Division hash function for strings with B ¼ 37.

Since such Integer-Division hash functions are recursive, quickly computed, and widely used, it is interest-ing to seek a randomized version of them. Assume that h1 is a random hash function over symbols uniform in½0; 2LÞ, then define hðx1; . . . ; xnÞ ¼ Bn�1h1ðx1Þ þ Bn�2h1ðx2Þ þ � � � þ h1ðxnÞ mod2L for some fixed integer B. Wechoose B ¼ 37 (calling the resulting randomized hash “ID37;” see Algorithm 2). Our algorithm computes eachhash value in time OðMðLÞ), where MðLÞ is the cost of multiplying two L-bit integers. (We precompute thevalue Bn mod 2L.) In many practical cases, L bits can fit into a single machine word and the cost of multipli-cation can be considered constant. In general, MðLÞ is in OðL log L2Oðlog�LÞÞ (Furer, 2007).

Algorithm 2. The recursive ID37 family (Randomized Karp–Rabin).

Require: an L-bit hash function h1 over R from an independent hash family

1: B 37 2: s empty FIFO structure 3: x 0 (L-bit integer) 4: z 0 (L-bit integer) 5: for each character c do

6:

append c to s

7:

x Bx� Bnzþ h1ðcÞ mod2L

8:

if length(s) = n then

9:

yield x

10:

remove oldest character y from s 11: z h1ðyÞ 12: end if

13:

end for
Page 6: Recursive n-gram hashing is pairwise independent, at best

D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710 703

The randomized Integer-Division functions mapping n-grams to ½0; 2LÞ are not pairwise independent. How-ever, for B odd, they are uniform.

Proposition 3. Randomized Integer-Division hashing with B odd is not uniform for n-grams, if n is even.

Otherwise, it is uniform, but not pairwise independent.

Proof. For B odd, we see that Pðhða2kÞ ¼ 0Þ > 2�L since hða2kÞ ¼ h1ðaÞðB0ð1þ BÞ þ B2ð1þ BÞ þ � � � þB2k�2ð1þ BÞÞ mod2L and since ð1þ BÞ is even, we have P ðhða2kÞ ¼ 0ÞP P ðh1ðx1Þ ¼ 2L�1 _ h1ðx1Þ ¼ 0Þ ¼1=2L�1. Hence, for B odd and n even, we do not have uniformity.

For the rest of the result, we begin with n ¼ 2 and B even. If x1 – x2, then P ðhðx1; x2Þ ¼ yÞ ¼ P ðBh1ðx1Þþh1ðx2Þ ¼ y mod2LÞ ¼

PzP ðh1ðx2Þ ¼ y � Bz mod 2LÞP ðh1ðx1Þ ¼ zÞ ¼

PzP ðh1ðx2Þ ¼ y � Bz mod 2LÞ=2L ¼ 1=2L,

whereas Pðhðx1; x1Þ ¼ yÞ ¼ PððBþ 1Þh1ðx1Þ ¼ y mod2LÞ ¼ 1=2L since ðBþ 1Þx ¼ y mod2L has a uniquesolution x when B is even. Therefore h is uniform. This argument can be extended for any value of n andfor n odd, B even.

To show it is not pairwise independent, first suppose that B is odd. For any string b of length n� 2,consider n-grams w1 ¼ baa and w2 ¼ bbb for distinct a; b 2 R. Then P ðhðw1Þ ¼ hðw2ÞÞ ¼ P ðB2hðbÞþBh1

ðaÞþ h1ðaÞ ¼ B2hðbÞþBh1ðbÞþ h1ðbÞ mod2LÞ ¼ Pðð1þBÞðh1ðaÞ� h1ðbÞÞ mod2L ¼ 0ÞP P ðh1ðaÞ� h1ðbÞ ¼ 0ÞþP ðh1ðaÞ� h1ðbÞ ¼ 2L�1Þ. Because h1 is independent, P ðh1ðaÞ� h1ðbÞ ¼ 0Þ ¼

Pc2½0;2LÞP ðh1ðaÞ ¼ cÞP ðh1ðbÞ ¼

cÞ ¼P

c2½0;2LÞ1=4L ¼ 1=2L. Moreover, P ðh1ðaÞ� h1ðbÞ ¼ 2L�1Þ> 0. Thus, we have that Pðhðw1Þ ¼ hðw2ÞÞ> 1=2L

which contradicts pairwise independence. Second, if B is even, a similar argument shows Pðhðw3Þ ¼ hðw4ÞÞ>1=2L, where w3 ¼ baa and w4 ¼ bba. P ðhða;aÞ ¼ hðb;aÞÞ ¼ P ðBh1ðaÞþ h1ðaÞ ¼ Bh1ðbÞþ h1ðaÞ mod2LÞ ¼ PðBðh1ðaÞ� h1ðbÞÞmod2L ¼ 0ÞP Pðh1ðaÞ� h1ðbÞ ¼ 0Þþ Pðh1ðaÞ� h1ðbÞ ¼ 2L�1Þ> 1=2L. This argument can beextended for any value of B and n. h

A weaker condition than pairwise independence is 2-universality: a family is 2-universal ifP ðhðx1Þ ¼ hðx2ÞÞ 6 1=2L (Mitzenmacher and Vadhan, 2008). As a consequence of this proof, RandomizedInteger-Division is not even 2-universal.

These results also hold for any Integer-Division hash where the modulo is by an even number, not neces-sarily a power of 2.

6. Generating hash families from polynomials over Galois fields

A practical form of hashing using the binary Galois field GF(2) is called “Recursive Hashing by Polyno-mials” and has been attributed to Kubina by Cohen (1997). GF(2) contains only two values (1 and 0) withthe addition (and hence subtraction) defined by XOR, aþ b ¼ a b and the multiplication by AND,a� b ¼ a ^ b. GFð2Þ½x is the vector space of all polynomials with coefficients from GF(2). Any integer in bin-ary form (e.g., c ¼ 1101) can thus be interpreted as an element of GFð2Þ½x (e.g., c ¼ x3 þ x2 þ 1). IfpðxÞ 2 GFð2Þ½x, then GFð2Þ½x=pðxÞ can be thought of as GFð2Þ½x modulo pðxÞ. As an example, ifpðxÞ ¼ x2, then GFð2Þ½x=pðxÞ is the set of all linear polynomials. For instance, x3 þ x2 þ xþ 1 ¼xþ 1 mod x2 since, in GFð2Þ½x; ðxþ 1Þ þ x2ðxþ 1Þ ¼ x3 þ x2 þ xþ 1.

As a summary, we compute operations over GFð2Þ½x=pðxÞ—where pðxÞ is of degree L—as follows:

� the polynomialPL�1

i¼0 qixi is represented as the L-bit integer

PL�1i¼0 qi2

i;

� subtraction or addition of two polynomials is the XOR of their L-bit integers;

� multiplication of a polynomialPL

i¼0qixi by the monomial x is represented either as

PL�1i¼0 qix

iþ1 if qL�1 ¼ 0or as pðxÞ þ

PL�1i¼0 qix

iþ1 otherwise. In other words, if the value of the last bit is 1, we merely apply a bin-ary left shift, otherwise, we apply a binary left shift immediately followed by an XOR with the integerrepresenting pðxÞ. In either case, we get an L-bit integer.

Hence, merely with the XOR operation, the binary left shift, and a way to evaluate the value of the last bit,we can compute all necessary operations over GFð2Þ½x=pðxÞ using integers.

Consider a hash function h1 over characters taken from some independent family. Interpreting h1 hash val-ues as polynomials in GFð2Þ½x=pðxÞ, and with the condition that degreeðpðxÞÞP n, we define a hash function

Page 7: Recursive n-gram hashing is pairwise independent, at best

704 D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710

as hða1; a2; � � � ; anÞ ¼ h1ða1Þxn�1 þ h1ða2Þxn�2 þ � � � þ h1ðanÞ. It is recursive over the sequence h1ðaiÞ. The com-bined hash can be computed by reusing previous hash values:

TableSome

Degree

1015202530

hða2; a3; . . . ; anþ1Þ ¼ xhða1; a2; . . . ; anÞ � h1ða1Þxn þ h1ðanþ1Þ:

Depending on the choice of the polynomial pðxÞ we get different hashing schemes, including GENERAL and

CYCLIC, which are presented in the next two sections.

7. Recursive hashing by irreducible polynomials is pairwise independent

Algorithm 3. The recursive GENERAL family.

Require: an L-bit hash function h1 over R from an independent hash family; an irreducible polynomial p ofdegree L in GFð2Þ½x

1:

2irreducible polynomials over GFð2Þ½x.

s empty FIFO structure

2: x 0 (L-bit integer) 3: z 0 (L-bit integer) 4: for each character c do

5:

append c to s 6: x shiftðxÞ 7: z shiftnðzÞ 8: x x z h1ðcÞ 9: if length(s) = n then

10:

yield x

11:

remove oldest character y from s

12:

z h1ðyÞ 13: end if 14: end for

1:

function shift 2: input L-bit integer x

3:

shift x left by 1 bit, storing result in an Lþ 1-bit integer x0

4:

if leftmost bit of x0 is 1 then

5:

x0 x0 p 6: end if 7: {leftmost bit of x0 is thus always 0} 8: return rightmost L bits of x0

We can choose pðxÞ to be an irreducible polynomial of degree L in GFð2Þ½x: an irreducible polynomial can-not be factored into nontrivial polynomials (see Table 2). The resulting hash is called GENERAL (see Algorithm3). The main benefit of setting pðxÞ to be an irreducible polynomial is that GFð2Þ½x=pðxÞ is a field; in partic-ular, it is impossible that p1ðxÞp2ðxÞ ¼ 0 mod pðxÞ unless either p1ðxÞ ¼ 0 or p2ðxÞ ¼ 0. The field propertyallows us to prove that the hash function is pairwise independent.

Polynomial

1þ x3 þ x10

1þ xþ x15

1þ x3 þ x20

1þ x3 þ x25

1þ xþ x4 þ x6 þ x30

Page 8: Recursive n-gram hashing is pairwise independent, at best

D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710 705

Lemma 1. GENERAL is pairwise independent.

Proof. If pðxÞ is irreducible, then any non-zero qðxÞ 2 GFð2Þ½x=pðxÞ has an inverse, noted q�1ðxÞ sinceGFð2Þ½x=pðxÞ is a field. Interpret hash values as polynomials in GFð2Þ½x=pðxÞ.

Firstly, we prove that GENERAL is uniform. In fact, we show a stronger result: Pðq1ðxÞh1ða1Þ þ q2

ðxÞh1ða2Þ þ � � � þ qnðxÞh1ðanÞ ¼ yÞ ¼ 1=2L for any polynomials qi where at least one is different from zero. Theresult follows by induction on the number of non-zero polynomials: it is clearly true where there is a singlenon-zero polynomial qiðxÞ, since qiðxÞh1ðaiÞ ¼ y () q�1

i ðxÞqiðxÞh1ðaiÞ ¼ q�1i ðxÞy. Suppose it is true up to k � 1

non-zero polynomials and consider a case where we have k non-zero polynomials. Assume without loss ofgenerality that q1ðxÞ – 0, we have Pðq1ðxÞh1ða1Þþq2ðxÞh1ða2Þþ���þqnðxÞh1ðanÞ¼yÞ¼P ðh1ða1Þ¼q�1

1

ðxÞðy�q2ðxÞh1ða2Þ�����qnðxÞh1ðanÞÞÞ¼P

y0P ðh1ða1Þ¼q�11 ðxÞðy�y0ÞÞPðq2ðxÞh1ða2Þþ���þqnðxÞh1ðanÞ¼y0Þ¼

Py0

12L

12L¼ 1

2L by the induction argument. Hence the uniformity result is shown.Consider two distinct sequences a1; a2; . . . ; an and a01; a

02; . . . ; a0n. Write H a ¼ hða1; a2; . . . ; anÞ and

H a0 ¼ hða01; a02; . . . ; a0nÞ. We have that P ðH a ¼ y ^ H a0 ¼ y0Þ ¼ P ðHa ¼ yjH a0 ¼ y0ÞP ðH a0 ¼ y0Þ. Hence, to provepairwise independence, it suffices to show that P ðHa ¼ yjH a0 ¼ y0Þ ¼ 1=2L.

Suppose that ai ¼ a0j for some i; j; if not, the result follows since by the (full) independence of the hashingfunction h1, the values Ha and Ha0 are independent. Write qðxÞ ¼ �

Pkjak¼ai

xn�k� � P

kja0k¼a0jxn�k

� ��1, then

H a þ qðxÞH a0 is independent from ai ¼ a0j (and h1ðaiÞ ¼ h1ða0jÞ).In H a þ qðxÞHa0 , only hashed values h1ðakÞ for ak – ai and h1ða0kÞ for a0k – a0j remain: label them

h1ðb1Þ; . . . ; h1ðbmÞ. The result of the substitution can be written H a þ qðxÞHa0 ¼P

kqkðxÞh1ðbkÞ, where qkðxÞ arepolynomials in GFð2Þ½x=pðxÞ. All qkðxÞ are zero if and only if H a þ qðxÞH a0 ¼ 0 for all values ofh1ða1Þ; . . . ; h1ðanÞ and h1ða01Þ; . . . ; h1ða0nÞ (but notice that the value h1ðaiÞ ¼ h1ða0jÞ is irrelevant); in particular, itmust be true when h1ðakÞ ¼ 1 and h1ða0kÞ ¼ 1 for all k, hence ðxn þ � � � þ xþ 1Þ þ qðxÞðxn � � � þ xþ 1Þ ¼ 0)qðxÞ ¼ �1. Thus, all qkðxÞ are zero if and only if Ha ¼ H a0 for all values of h1ða1Þ; . . . ; h1ðanÞ andh1ða01Þ; . . . ; h1ða0nÞ which only happens if the sequences a and a0 are identical. Hence, not all qkðxÞ are zero.

Write Hy0;a0 ¼P

kja0k¼a0jxn�k

� ��1y0 �

Pkja0k – a0j

xn�kh1 a0k� �� �

. On the one hand, the condition H a0 ¼ y0 can berewritten as h1ða0jÞ ¼ H y0;a0 . On the other hand, H a þ qðxÞH a0 ¼ y þ qðxÞy0 is independent from h1ða0jÞ ¼ h1ðaiÞ.Because Pðh1ða0jÞ ¼ Hy0;a0 Þ ¼ 1=2L irrespective of y0 and h1ða0kÞ for k 2 fkja0k – a0jg, thenP ðh1ða0jÞ ¼ H y0;a0 jH a þ qðxÞHa0 ¼ y þ qðxÞy0Þ ¼ P ðh1ða0jÞ ¼ H y0;a0 Þ which implies that h1ða0jÞ ¼ H y0;a0 andH a þ qðxÞH a0 ¼ y þ qðxÞy0 are independent. Hence, we have

P ðHa ¼ yjHa0 ¼ y 0Þ¼ P ðHa þ qðxÞHa0 ¼ y þ qðxÞy0jh1ða0jÞ ¼ Hy0;a0 Þ¼ P ðHa þ qðxÞHa0 ¼ y þ qðxÞy0Þ

¼ PX

k

qkðxÞh1ðbkÞ ¼ y þ qðxÞy 0 !

and by the earlier uniformity result, this last probability is equal to 1=2L. This concludes the proof. h

8. Trading memory for speed: RAM-Buffered GENERAL

Unfortunately, GENERAL—as computed by Algorithm 3—requires OðnLÞ time per n-gram. Indeed, shifting avalue n times in GFð2Þ½x=pðxÞ requires OðnLÞ time. However, if we are willing to trade memory usage forspeed, we can precompute these shifts. We call the resulting scheme RAM-Buffered GENERAL.

Lemma 2. Pick any pðxÞ in GFð2Þ½x. The degree of pðxÞ is L. Represent elements of GFð2Þ½x=pðxÞ as

polynomials of degree at most L� 1. Given any h in GFð2Þ½x=pðxÞ. we can compute xnh in O(L) time given an

OðL2nÞ-bit memory buffer.

Proof. Write h asPL�1

i¼0 qixi. Divide h into two parts, hð1Þ ¼

PL�n�1i¼0 qix

i and hð2Þ ¼PL�1

i¼L�nqixi, so that h¼hð1Þþhð2Þ.

Then xnh¼xnhð1Þþxnhð2Þ. The first part, xnhð1Þ is a polynomial of degree at most L�1 since the degree of hð1Þ is atmost L�1�n. Hence, xnhð1Þ as an L-bit value is just qL�n�1qL�n�2 ...q00...0. which can be computed in time OðLÞ.So, only the computation of xnhð2Þ is possibly more expensive than OðLÞ time, but hð2Þ has only n terms as a poly-nomial (since the first L�n terms are always zero). Hence, if we precompute xnhð2Þ for all 2n possible values ofhð2Þ, and store them in an array with OðLÞ time look-ups, we can compute xnh as an L-bit value in OðLÞ time.

Page 9: Recursive n-gram hashing is pairwise independent, at best

706 D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710

When n is large, this precomputation requires excessive space and precomputation time. Fortunately, wecan trade back some speed for memory. Consider the proof of Lemma 2. Instead of precomputing the shiftsof all 2n possible values of hð2Þ using an array of 2n entries, we can further divide hð2Þ into K parts. Forsimplicity, assume that the integer K divides n. The K parts hð2;1Þ; . . . ; hð2;KÞ are made of the first n=K bits,the next n=K bits and so on. Because xnhð2Þ ¼

PKi¼1xnhð2;iÞ, we can shift hð2Þ by n in OðKLÞ operations using

K arrays of 2n=K entries. To summarize, we have a time complexity of OðKLÞ per n-gram usingOðLjRj þ LK2n=KÞ bits. We implemented the case K ¼ 2.

9. Recursive hashing by cyclic polynomials is not even uniform

Choosing pðxÞ ¼ xL þ 1 for L P n, for any polynomial qðxÞ ¼PL�1

i¼0 qixi, we have

xiqðxÞ ¼ xiðqL�1xL�1 þ � � � þ q1xþ q0Þ ¼ qL�i�1xL�i�2 þ � � � þ qL�iþ1xþ qL�i:

Thus, we have that multiplication by xi is a bitwise rotation, a cyclic left shift—which can be computed inOðLÞ time. The resulting hash (see Algorithm 4) is called CYCLIC. It requires only OðLÞ time per hash value.Empirically, Cohen showed that CYCLIC is uniform (Cohen, 1997). In contrast, we show that it is not formallyuniform:

Algorithm 4. The recursive CYCLIC family.

Require: an L-bit hash function h1 over R from an independent hash family

1: s empty FIFO structure 2: x 0 (L-bit integer) 3: z 0 (L-bit integer) 4: for each character c do

5:

append c to s

6:

rotate x left by 1 bit 7: rotate z left by n bits 8: x x z h1ðcÞ 9: if length(s) = n then

10:

yield x 11: remove oldest character y from s

12:

z h1ðyÞ 13: end if

14:

end for

Lemma 3. CYCLIC is not uniform for n even and never 2-universal, and thus never pairwise independent.

Proof. If n is even, use the fact that xn�1 þ � � � þ xþ 1 is divisible by xþ 1 to write xn�1þ���þxþ1¼ðxþ1ÞrðxÞfor some polynomial rðxÞ. Clearly, rðxÞðxþ1ÞðxL�1þxL�2þ���þxþ1Þ¼0 modxLþ1 for any rðxÞ and soPðhða1;a1;.. .;a1Þ¼0Þ¼P ððxn�1þ���þxþ1Þh1ða1Þ¼0Þ¼P ððxþ1ÞrðxÞh1ða1Þ¼0ÞPP ðh1ða1Þ¼0_h1ða1Þ¼ xL�1þxL�2þ���þxþ1Þ¼1=2L�1. Therefore, CYCLIC is not uniform for n even.

To show CYCLIC is never pairwise independent, consider n ¼ 3 (for simplicity), thenPðhða1; a1; a2Þ ¼ hða1; a2; a1ÞÞ ¼ P ððxþ 1Þðh1ða1Þ þ h1ða2ÞÞ ¼ 0ÞP P ðh1ða1Þ þ h1ða2Þ ¼ 0 _ h1ða1Þ þ h1ða2Þ ¼xL�1 þ xL�2 þ � � � þ xþ 1Þ ¼ 1=2L�1, but 2-universal hash values are equal with probability 1=2L. The resultis shown. h

Of the four recursive hashing functions investigated by Cohen (1997), GENERAL and CYCLIC were superiorboth in terms of speed and uniformity, though CYCLIC had a small edge over GENERAL. For n large, the benefitsof these recursive hash functions compared to the 3-wise independent hash function presented earlier can besubstantial: n table look-ups is much more expensive than a single look-up followed by binary shifts.

Page 10: Recursive n-gram hashing is pairwise independent, at best

Table 3CYCLIC hash for various values of h1ðaÞ ðhða;aÞ ¼ xh1ðaÞ þ h1ðaÞ mod2L þ 1Þ.h1ðaÞ hða;aÞ hða;aÞ hða;aÞ hða;aÞ

(first two bits) (last two bits) (first and last bit)

000 000 00 00 00100 110 11 10 10010 011 01 11 01110 101 10 01 11001 101 10 01 11101 011 01 11 01011 110 11 10 10111 000 00 00 00

D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710 707

10. CYCLIC is pairwise independent if you remove n� 1 consecutive bits

Because Cohen found empirically that CYCLIC had good uniformity (Cohen, 1997), it is reasonable to expectCYCLIC to be almost uniform and maybe even almost pairwise independent. To illustrate this intuition, considerTable 3 which shows that while hða;aÞ is not uniform (hða;aÞ ¼ 001 is impossible), hða;aÞ minus any bit isindeed uniformly distributed. We will prove that this result holds in general.

The next lemma and the next theorem show that CYCLIC is quasi-pairwise independent in the sense thatL� nþ 1 consecutive bits (e.g., the first or last L� nþ 1 bits) are pairwise independent. In other words, CYC-

LIC is pairwise independent if we are willing to sacrifice n� 1 bits. (We say that n bits are “consecutive moduloL” if the bits are located at indexes imodL for n consecutive values of i such as i ¼ k; k þ 1; . . . ; k þ n� 1.)

Lemma 4. If qðxÞ 2 GFð2Þ½x=ðxL þ 1Þ (with qðxÞ – 0) has degree n < L, then

� the equation qðxÞw ¼ y modxL þ 1 modulo the first n bits3 has exactly 2n solutions for all y;� more generally, the equation qðxÞw ¼ ymodxL þ 1 modulo any consecutive n bits (modulo L) has exactly 2n

solutions for all y.

Proof. Let P be the set of polynomials of degree at most L� n� 1. Take any pðxÞ 2 P , then qðxÞpðxÞ hasdegree at most L� n� 1þ n ¼ L� 1 and thus if qðxÞ – 0 and pðxÞ – 0, then qðxÞpðxÞ– 0 mod xL þ 1. Hence,for any distinct p1; p2 2 P we have qðxÞp1 – qðxÞp2 modxL þ 1.

To prove the first item, we begin by showing that there is always exactly one solution in P. Consider thatthere are 2L�n polynomials pðxÞ in P, and that all values qðxÞpðxÞ are distinct. Suppose there are p1; p2 2 P suchthat qðxÞp1 ¼ qðxÞp2 modxL þ 1 modulo the first n bits, then qðxÞðp1 � p2Þ is a polynomial of degree at mostn� 1 while p1 � p2 is a polynomial of degree at most L� n� 1 and qðxÞ is a polynomial of degree n, thusp1 � p2 ¼ 0. (If p1� p2 – 0 then degreeðqðxÞðp1� p2Þ modxL þ 1ÞP degreeðqðxÞÞ ¼ n, a contradiction.)Hence, all pðxÞ in P are mapped to distinct values modulo the first n bits, and since there are 2L�n such distinctvalues, the result is shown.

Any polynomial of degree L� 1 can be decomposed into the form pðxÞ þ xL�nzðxÞ, where zðxÞ is apolynomial of degree at most n� 1 and pðxÞ 2 P . By the preceding result, for distinctp1; p2 2 P ; qðxÞðxL�nzðxÞ þ p1Þ and qðxÞðxL�nzðxÞ þ p2Þ must be distinct modulo the first n bits. In otherwords, the equation qðxÞðxL�nzðxÞ þ pÞ ¼ y modulo the first n bits has exactly one solution p 2 P for any zðxÞand since there are 2n polynomials zðxÞ of degree at most n� 1, then qðxÞw ¼ y (modulo the first n bits) musthave 2n solutions.

To prove the second item, choose j and use the first item to find any w solving qðxÞw ¼ yxj mod xL þ 1modulo the first n bits. j. Then wxL�j is a solution to qðxÞw ¼ y mod xL þ 1 modulo the bits in positionsj; jþ 1; . . . ; jþ n� 1 modL. h

3 By “equality modulo h some specified set of bit positions i”, we mean that the two quantities are bitwise identical, with exceptionspermitted only at the specified positions. For our polynomials, “equality modulo the first n bit positions” implies the difference of the two

polynomials has degree at most n� 1.

Page 11: Recursive n-gram hashing is pairwise independent, at best

708 D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710

We have the following corollary to Lemma 4.

Corollary 1. If w is chosen uniformly at random in GFð2Þ½x=ðxL þ 1Þ, then P ðqðxÞw ¼ y mod n� 1 bitsÞ ¼1=2L�nþ1 where the n� 1 bits are consecutive (modulo L).

Theorem 1. Consider the L-bit CYCLIC n-gram hash family. Pick any n� 1 consecutive bit locations, then remove

these bits from all hash values. The resulting L� nþ 1-bit hash family is pairwise independent.

Proof. We show Pðq1ðxÞh1ða1Þ þ q2ðxÞh1ða2Þ þ � � � þ qnðxÞh1ðanÞ ¼ y modn� 1 bitsÞ ¼ 1=2L�nþ1 for any poly-nomials qi where at least one is different from zero. It is true when there is a single non-zero polynomialqiðxÞ by Corollary 1. Suppose it is true up to k � 1 non-zero polynomials and consider a case where we havek non-zero polynomials. Assume without loss of generality that q1ðxÞ – 0, we have P ðq1ðxÞh1ða1Þ þ q2ðxÞh1ða2Þþ � � � þ qnðxÞh1ðanÞ ¼ y mod n� 1 bitsÞ ¼ Pðq1ðxÞh1ða1Þ ¼ y� q2ðxÞh1ða2Þ � � � � � qnðxÞh1ðanÞ modn� 1 bitsÞ ¼P

y0P ðq1ðxÞh1ða1Þ ¼ y � y0 modn� 1 bitsÞP ðq2ðxÞh1ða2Þ þ � � � þ qnðxÞh1ðanÞ ¼ y 0mod n�1 bitsÞ¼P

y01

2L�nþ11

2L�nþ1¼1=2L�nþ1 by the induction argument, where the sum is over 2L�nþ1 values of y 0. Hence the uniformity resultis shown.

Consider two distinct sequences a1; a2; . . . ; an and a01; a02; . . . ; a0n. Write Ha ¼ hða1; a2; . . . ; anÞ and

Ha0 ¼ hða01; a02; . . . ; a0nÞ. To prove pairwise independence, it suffices to show thatPðH a ¼ y mod n� 1 bitsjHa0 ¼ y0 modn� 1 bitsÞ ¼ 1=2L�nþ1. Suppose that ai ¼ a0j for some i; j; if not, theresult follows by the (full) independence of the hashing function h1. Using Lemma 4, find qðxÞ such thatqðxÞ

Pkja0k¼a0j

xn�k ¼ �P

kjak¼aixn�k modn� 1 bits, then H a þ qðxÞHa0 modn� 1 bits is independent from

ai ¼ a0j (and h1ðaiÞ ¼ h1ða0jÞ).The hashed values h1ðakÞ for ak – ai and h1ða0kÞ for a0k – a0j are now relabelled as h1ðb1Þ; . . . ; h1ðbmÞ. Write

Ha þ qðxÞHa0 ¼P

kqkðxÞh1ðbkÞ modn� 1 bits, where qkðxÞ are polynomials in GFð2Þ½x=ðxL þ 1Þ (not all qkðxÞare zero). As in the proof of Lemma 1, we have that H a0 ¼ y0 mod n� 1 bits andHa þ qðxÞHa0 ¼ y þ qðxÞy0 mod n� 1 bits are independent4: P ðH a0 ¼ y0 mod n� 1 bitsjy0; b1; b2; . . . ; bmÞ ¼1=2L�nþ1 by Corollary 1 since Ha0 ¼ y can be written as rðxÞh1ða0jÞ ¼ y �

PkrkðxÞh1ðbkÞ for some polynomials

rðxÞ; r1ðxÞ; . . . ; rmðxÞ. Hence, we have

4 We5 htt

PðH a ¼ y modn� 1 bitsjHa0 ¼ y 0 mod n� 1 bitsÞ¼ PðH a þ qðxÞH a0 ¼ y þ qðxÞy0 mod n� 1 bitsjH a0 ¼ y0 mod n� 1 bitsÞ¼ PðH a þ qðxÞH a0 ¼ y þ qðxÞy0 mod n� 1 bitsÞ

¼ PX

k

qkðxÞh1ðbkÞ ¼ y þ qðxÞy0 mod n� 1 bits

!

and by the earlier uniformity result, this last probability is equal to 1=2L�nþ1. h

11. Experimental comparison

Irrespective of pðxÞ, computing hash values has complexity XðLÞ. For GENERAL and CYCLIC, we requireL P n. Hence, the computation of their hash values is in XðnÞ. For moderate values of L and n, this analysisis pessimistic because CPUs can process 32- or 64-bit words in one operation.

To assess their real-world performance, the various hashing algorithms5 were written in C++. We compiledthem with the GNU GCC 4.0.1 compiler on an Apple MacBook with two Intel Core 2 Duo processors(2.4 GHz) and 4 GiB of RAM. The -O3 compiler flag was used since it provided slightly better performancefor all algorithms. All hash values are stored using 32-bit integers, irrespective of the number of bits used.

All hashing functions generate 19-bit hash values, except for CYCLIC which generates 19 + n-bit hash values.We had CYCLIC generate more bits to compensate for the fact that it is only pairwise independent after removal

use the shorthand notation P ðf ðx; yÞ ¼ cjx; yÞ ¼ b to mean P ðf ðx; yÞ ¼ cjx ¼ z1; y ¼ z2Þ ¼ b for all values of z1; z2.p://code.google.com/p/ngramhashing/.

Page 12: Recursive n-gram hashing is pairwise independent, at best

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 2 4 6 8 10 12 14

time

(s)

n

3-wiseGeneral

RAM-Buffered GeneralCyclic

Rand. Karp-Rabin

Fig. 1. Wall-clock running time to hash all n-grams in the King James Bible.

D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710 709

of n� 1 consecutive bits. For GENERAL, we used the polynomial pðxÞ ¼ x19 þ x18 þ x17 þ x16 þ x12 þ x7 þ x6þx5 þ x3 þ x2 þ 1 (Ruskey, 2006). For Randomized Karp–Rabin, we used the ID37 family. The character hashvalues are stored in an array for fast look-up.

We report wall-clock time in Fig. 1 for hashing the n-grams of the King James Bible Project GutenbergLiterary Archive Foundation (2009-08-0) which contains 4.3 million ASCII characters. CYCLIC is twice as fastas GENERAL. As expected, the running time of the non-recursive hash function (3-wise) grows linearly with n:for n ¼ 5, 3-wise is already seven times slower than CYCLIC. Speed-wise, Randomized Karp–Rabin (ID37) isthe clear winner, being nearly twice as fast as CYCLIC. The performance of CYCLIC and ID37 is oblivious ton in this test.

The RAM-Buffered GENERAL timings are—as expected—independent of n, but they are twice as large as theCYCLIC timings. We do not show the modified version of RAM-Buffered GENERAL that uses two precomputedarrays instead of a single one. It was approximately 30% slower than ordinary RAM-Buffered GENERAL, evenup to n ¼ 25. However, its RAM usage was three orders of magnitude smaller: from 135 MB down to 25 kB.Overall, we cannot recommend RAM-Buffered GENERAL or its modification considering that: (1) its memoryusage grows as 2n and (2) it is slower than CYCLIC.

12. Conclusion

Considering speed and pairwise independence, we recommend CYCLIC—after discarding n� 1 consecutivebits. If we require only uniformity, Randomized Integer-Division is twice as fast.

Acknowledgments

This work is supported by NSERC Grants 155967, 261437 and by FQRNT Grant 112381. The authors aregrateful to the anonymous reviewers for their significant contributions.

References

Cardenal-Lopez, A., Diguez-Tirado, F.J., Garcia-Mateo, C., 2002. Fast LM look-ahead for large vocabulary continuous speechrecognition using perfect hashing. In: ICASSP’02, pp. 705–708.

Carter, L., Wegman, M.N., 1979. Universal classes of hash functions. Journal of Computer and System Sciences 18 (2), 143–154.Cohen, J.D., 1997. Recursive hashing functions for n-grams. ACM Transactions on Information Systems 15 (3), 291–320.Cohen, J.D., 1998a. Hardware-assisted algorithm for full-text large-dictionary string matching using n-gram hashing. Information

Processing and Management 34 (4), 443–464.Cohen, J.D., 1998b. An n-gram hash and skip algorithm for finding large numbers of keywords in continuous text streams. Software –

Practice Experience 28 (15), 1605–1635.Cohen, J.D., 1999. Massive query resolution for rapid selective dissemination of information. Journal of the American Society for

Information Science 50 (3), 195–206.

Page 13: Recursive n-gram hashing is pairwise independent, at best

710 D. Lemire, O. Kaser / Computer Speech and Language 24 (2010) 698–710

Durand, M., Flajolet, P., 2003. Loglog counting of large cardinalities. In: ESA’03, vol. 2832 of LNCS, pp. 605–617.Flajolet, P., Martin, G.N., 1985. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences

31 (2), 182–209.Furer, M., 2007. Faster integer multiplication. In: STOC ’07, pp. 57–66.Gibbons, P.B., Tirthapura, S., 2001. Estimating simple functions on the union of data streams. In: SPAA’01, pp. 281–291.Karp, R.M., Rabin, M.O., 1987. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (2),

249–260.Li, X., Zhao, Y., 2007. A fast and memory-efficient N-gram language model lookup method for large vocabulary continuous speech

recognition. Computer Speech & Language 21 (1), 1–25.Mitzenmacher, M., Vadhan, S., 2008. Why simple hash functions work: exploiting the entropy in a data stream. In: SODA ’08, pp. 746–

755.Project Gutenberg Literary Archive Foundation, 2009. Project Gutenberg. <http://www.gutenberg.org/> (checked 03.08.2009).Ribler, R.L., Abrams, M., 2000. Using visualization to detect plagiarism in computer science classes. In: INFOVIS’00. IEEE Computer

Society, Washington, DC, USA, p. 173.Ruskey, F., 2006. The (combinatorial) object server. <http://www.theory.cs.uvic.ca/~cos/cos.html> (checked 30.05.2007).Schleimer, S., Wilkerson, D.S., Aiken, A., 2003. Winnowing: local algorithms for document fingerprinting. In: SIGMOD’2003, pp. 76–85.Schwenk, H., 2007. Continuous space language models. Computer Speech & Language 21 (3), 492–518.Sun Microsystems, String (Java 2 Platform SE 5.0), 2004. <http://java.sun.com/j2se/1.5.0/docs/api/index.html>.Talbot, D., Brants, T., 2008. Randomized language models via perfect hash functions. In: ACL’08, pp. 505–513.Talbot, D., Osborne, M., 2007. Smoothed Bloom filter language models: tera-scale LMs on the cheap. In: EMNLP’07, pp. 468–476.Talbot, D., Osborne, M., 2007. Randomised language modelling for statistical machine translation. In: ACL’07, pp. 512–519.Tan, T, Gould, S., Williams, D., Peltzer, E., Barrie, R., 2006. Fast pattern matching using large compressed databases. US Patent App. 11/

326, p. 131.Weiss, M., 1999. Data Structures and Algorithm Analysis in Java. Addison Wesley.Zhang, X, Zhao, Y., 2002. Minimum perfect hashing for fast N-gram language model lookup. In: Seventh International Conference on

Spoken Language Processing, ISCA, pp. 401–404.