# tutorial 9 (bloom filters)

DESCRIPTION

Part of the Search Engine course given in the Technion (2011)TRANSCRIPT

Bloom Filters

Kira Radinsky

Slides based on material from:

Michael Mitzenmacher and Hanoch Levy

Motivation - Cache

• Lookup questions: Does item “x” exist in a set?

• Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data.

• Allow false positive errors, as they only cost us an extra data access.

• Don’t allow false negative errors, because they result in wrong answers.

Application of Bloom Filters: Distributed Web Caches

Web Cache 1 Web Cache 2 Web Cache 3

Web Cache 6Web Cache 5Web Cache 4

• Send Bloom filters of URLs.• False positives do not hurt much.

– Get errors from cache changes anyway

Web Caching

• Summary Cache: [Fan, Cao, Almeida, & Broder]

If local caches know each other’s content...

…try local cache before going out to Web

• Sending/updating lists of URLs too expensive.

• Solution: use Bloom filters.

• False positives– Local requests go unfulfilled.

– Small cost, big potential gain

The Problem Solved by BF:Approximate Set Membership

• Lookup Problem: Given a set S = {x1,x2,…,xn}, construct data structure to answer queries of the form “Is y in S?”

• Data structure should be:

– Fast (Faster than searching through S).

– Small (Smaller than explicit representation).

• To obtain speed and size improvements, allow some probability of error.

– False positives: y S but we report y S

– False negatives: y S but we report y S

Bloom Filters

Start with an m bit array, filled with 0s.

Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

To check if y is in S, check B at Hi(y). All k values must be 1.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

Possible to have a false positive; all k values are 1, but y is not in S.

Bloom Filter

01000 10100 00010

x

h1(x) h2(x) hk(x)

V0 Vm-1

h3(x)

Advantages

• No Overflow

• Union and intersection of Bloom filters

– A simple bitwise OR and AND operations

• Applications:

– Google BigTable

– The Squid Web Proxy Cache uses Bloom filters for cache digests.

Bloom Errors

01000 10100 00010h1(x) h2(x) hk(x)

V0 Vm-1

h3(x)

a b c d

x didn’t appear, yet its bits are already set

Example

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 1 2 3 4 5 6 7 8 9 10

Hash functions

Fa

lse p

osi

tiv

e r

ate

m/n = 8

Opt k = 8 ln 2 = 5.45...

Tradeoffs

• Three parameters.

– Size m/n : bits per item.

• |U| = n: Number of elements to encode.

• hi: U[1..m] : Maintain a Bit Vector V of size m

– Time k : number of hash functions.

• Use k hash functions (h1..hk)

– Error f : false positive probability.

Bloom Filter Tradeoffs

• Three factors: m,k and n.

• Normally, n and m are given, and we select k.

• Small k– Less computations.

– Actual number of bits accessed (nk) is smaller, so the chance of a “step over” is smaller too.

– However, less bits need to be stepped over to generate an error.

• For big k, the exact opposite holds.

• Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is exactly 0.5

Alternative Approach for Bloom Filters: Perfect Hashing Approach

Element 1 Element 2 Element 3 Element 4 Element 5

Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)

Perfect Hashing Approach

• Folklore Bloom filter construction.– Recall: Given a set S = {x1,x2,x3,…xn} on a universe U, we want

to answer membership queries.

– Method: Find an n-cell perfect hash function for S.• Maps set of n elements to n cells in a 1-1 manner.

– Then keep bit fingerprint of item in each cell. Lookups have false positive < e.

– Advantage: each bit/item reduces false positives by a factor of 1/2, vs ln 2 for a standard Bloom filter.

• Negatives:– Perfect hash functions non-trivial to find.

– Cannot handle on-line insertions.

)/1(log 2 e

Bloom Filters and Deletions

• Cache contents change– Items both inserted and deleted.

• Insertions are easy – add bits to BF

• Can Bloom filters handle deletions?

– Use Counting Bloom Filters to track insertions/deletions at hosts;

– Send Bloom filters.

Handling Deletions

• Bloom filters can handle insertions, but not deletions.

• If deleting xi means resetting 1s to 0s, then deleting xi will “delete” xj.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

xi xj

Counting Bloom Filters

Start with an m bit array, filled with 0s.

Hash each item xj in S k times. If Hi(xj) = a, add 1 to B[a].

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B

0 3 0 0 1 0 2 0 0 3 2 1 0 2 1 0B

To delete xj decrement the corresponding counters.

0 2 0 0 0 0 2 0 0 3 2 1 0 1 1 0B

Can obtain a corresponding Bloom filter by reducing to 0/1.

0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0B

Counting Bloom Filters: Overflow

• Must choose counters large enough to avoid overflow.

• Poisson approximation suggests 4 bits/counter.– Average load using k = (ln 2)m/n counters is ln 2.

– Probability a counter has load at least 16:

• Failsafes possible.

17E78.6!16/)2(ln 162ln e

Variations and Extensions

• Distance-Sensitive Bloom Filters

• Bloomier Filter

Extension: Distance-Sensitive Bloom Filters

• Instead of answering questions of the form

we would like to answer questions of the form

• That is, is the query close to some element of the set, under some metric and some notion of close.

• Applications:– DNA matching– Virus/worm matching– Databases

• Some initial results [KirschMitzenmacher]. Hard.

.SyIs

.SxyIs

Extension: Bloomier Filter

• Bloom filters handle set membership.

• Counters to handle multi-set/count tracking.

• Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]:– Extend to handle approximate functions.

– Each element of set has associated function value.

– Non-set elements should return null.

– Want to always return correct function value for set elements.

– A false positive returns a function value for a non-null element.