bloom bloomfilter orig paper

5
Space/Time Trade-offs in Hash Coding with Allowable Errors BURTON H. BLOOM Computer Usage Comp any, Newton Upp er Falls, Mass. In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem con- for mem bership in a given set of messages. Two new hash- coding methods are examined and compared with a par- ticular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a m essage as a nonmember of the given set (reject time), and an allowable error frequency. The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tole rable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods. In such applications, it is envisaged that overall perform ance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to "catch" the small fraction of errors associated with the new methods. An ex am ple is discussed which illustrates possible areas of application for the new methods. A n a ly s is of the paradigm problem demonstrates that al- lowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time. KEY WO RD S AND PHRASES: hash coding, hash addressing, scatter storage, searching, storage layout, retrieval trade-ofFs, retrieval efficiency, storage efficiency C R CATEGORIES: 3.73, 3.74, 3.79 Introduction In conventional hash coding, a hash area is organized into cells, and an iterative pseudorandom computational process is used to generate, from the given set of messages, hash addresses of empty cells into which the messages are then stored. Messages are tested by a similar process of Work on this paper was in part performed while the author w a s affiliated with Computer Corporation of America, Cambridge, Massachusetts. iteratively generating hash addresses of cells. The con- tents of these cells are then compared with the test mes- sages. A match indicates the test message is a member of the set; an empty cell indicates the opposite. The reader is assumed to be familiar with this and similar conventional hash-coding methods [1, 2, 3]. The new hash-coding methods to be introduced are suggested for applications in which the great majority of messages to be tested will not belong to the given set. For these applications, it is appropriate to consider as a unit of time (called reject time) the average time required to classify a test message as a nonm ember of the given set. Furthermore, since the contents of a cell can, in general, be recognized as not m atching a test message by examining only part of the message, an appropriate assumption will be introduced concerning the time required to access in- dividual bits of the hash area. In addition to the two computational factors, reject time and space (i.e. hash area size), this paper considers a third computational factor, allowable fraction of errors. It will be shown that allowing a small num ber of test messages to be falsely identified as members of the given set will perm it a much smaller hash area to be used without increasing the reject time. In some practical applications, this reduction in hash area size may make the difference between maintaining the hash area in core, where it can be processed quickly, or having to put it on a slow access bulk storage device such as a disk. Two methods of reducing hash area size by allowing errors will be introduced. The trade-off between hash area size and fraction of allowable errors will be analyzed for each method, as well as their resulting space/time trade- offs. The reader should note that the new methods are not intended as alternatives to conventional hash-coding meth- ods in any application area in which hash coding is cur- rently used (e.g. symbol table management [1]). Rather, they are intended to make it possible to exploit the bene- fits of hash-coding techniques in certain areas in which con- ventional error-free methods suffer from a need for hash areas too large to be core resident and consequently found to be infeasible. In order to gain substantial reductions in hash area size, without introducing excessive reject times, the error-free performance associated with conventional methods is sacrificed. In application areas where error-free performance is a necessity, these new methods are not applicable. A Sample Application The type of application in which allowing errors is expected to permit a useful reduction in hash area size may be characterized as follows. Suppose a program must perform a computational process for a large num ber of distinct eases. Further, suppose that for a great majority of these cases the process is very simple, but for a small 422 Communications of the ACM Volume 13 / Number 7 / July, 1970

Upload: venkatesh-kavuluri

Post on 07-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bloom Bloomfilter Orig Paper

8/6/2019 Bloom Bloomfilter Orig Paper

http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 1/5

S p a ce /T im e Trad e-of fs in

Hash Coding with

Al lowable Errors

BURTON H. BLOOM

Computer Usage Comp any, Newton Upp er Falls, Mass.

I n t hi s p a p e r t r a d e - o f f s a m o n g c e r t a in c o m p u t a t io n a l f a c t o r s

i n h a s h c o d i n g a r e a n a l y z e d . T h e p a r a d i g m p r o b l e m c o n -

s i d e r e d i s t h a t o f t e st in g a s e r ie s o f m e s s a g e s o n e - b y - o n e

f o r m e m b e r s h i p i n a g i v e n s e t o f m e s s ag e s . T w o n e w h a s h -

c o d i n g m et ho d s a r e e x a m i n e d a n d c o m p a r e d w i t h a p a r -

t i c u l a r c o n v e n t i o n a l h a s h - c o d i n g m e t h o d . T h e c o m p u t a t i o n a l

f a c t o r s c o n s i d e r e d a r e t h e s iz e o f th e h a s h a r e a ( s p a c e ) , th e

t im e r e q u i r e d t o id e n t i f y a m e s s a g e a s a n o n m e m b e r o f t h e

g i v e n s e t ( r e je c t ti m e ) , a n d a n a l l o w a b l e e r r o r f re q u e n c y .

T h e n e w m e t h od s a r e i n t e n d e d t o r e d u c e t h e a m o u n t o f

s p a c e r e q u i r e d t o c o nt a in t h e h a s h - c o d e d i n f o r m a t i o n f ro mt h a t a s s o c i a t e d w i t h c o n v e n t i o n a l m e t h o d s . T h e r e d u c t i o n i n

s p a c e i s a c c o m p li s h ed b y e x p l o i t i n g t h e p o s s i b i l it y th a t a

s m a l l f r a c t i o n o f e r r o r s o f c o m m i s s io n m a y b e t o l e r a b l e i n

s o m e a p p l i c a t i o n s , i n p a r t i c u l a r , a p p l i c a t i o n s i n w h i c h a l a r g e

a m o u n t o f d a t a i s i n v o l v e d a n d a c o r e r e s i d e n t h a s h a r e a i s

c o n s e q u e n t l y n o t f e a s i b l e u s i n g c o n v e n t i o n a l m e t h od s .

I n s u c h a p p l i c a t i o n s , i t is e n v i s a g e d t h a t o v e r a l l p e r f o r m a n c e

c o u ld b e i m p r o v e d b y u s i n g a s m a l l e r c o r e r e s i d e n t ha s h a r e a

i n c o n ju n c ti o n w i th t h e n e w m e t h o d s a n d , w h e n n e c e s s a r y , b y

u s i n g s o m e s e c o n d a r y a n d p e r h a p s t im e - c o n su m i n g t e s t t o

" c a t c h " t h e s m a l l fr a c t i o n o f e r r o r s a s s o c ia t e d w i t h t h e n e w

m et h o d s . A n ex am p le i s d iscu s sed wh ich i llu s tra t es p o ss ib le

a r e a s o f a p p l i c a t i o n f o r t h e n e w m e t h o d s .

A n a ly s is o f t h e p a r a d i g m p r o b l e m d e m o n s t r a te s t h a t a l -

l o w i n g a s m a l l n u m b e r o f te s t m e s s a g e s t o b e f a l s e l y i d e n t i f i e d

a s m e m b e r s o f t h e g i v e n s e t w i ll p e r m i t a m u c h s m a l l e r h a s h

a r e a t o b e u s e d w i t h o u t in c r e as i n g r e j e c t ti m e .

KEY WO RD S AND PHRASES: hash coding, hash addressing, scatter storage,

searching, storage layout, retrieval trade-ofFs, retrieval eff iciency, storageefficiency

CR CATEGORIES: 3.73, 3.74, 3.79

I n t r o d u c t i o n

I n c o n v e n t i o n a l h a s h c o d i n g , a h a s h a r e a i s o r g a n i z e d

i n t o c e l l s , a n d a n i t e r a t i v e p s e u d o r a n d o m c o m p u t a t i o n a l

p r o c e s s is u s e d t o g e n e r a t e , f r o m t h e g i v e n s e t o f m e s s a g e s ,

h a s h a d d r e s s e s o f e m p t y c e ll s i n t o w h i c h t h e m e s s a g e s

a r e t h e n s t o r e d . M e s s a g e s a r e t e s t e d b y a s i m i l a r p r o c e s s o f

Work on th is paper was in par t perform ed whi le the au tho r w a saff i l ia ted wi th Computer Corporat ion of America, Cambridge,Massachuset t s .

i t e r a t i v e l y g e n e r a t i n g h a s h a d d r e s s e s o f c e ll s. T h e c o n -

t e n t s o f th e s e c e ll s a r e t h e n c o m p a r e d w i t h t h e t e s t m e s -

s a g e s . A m a t c h i n d i c a t e s t h e t e s t m e s s a g e i s a m e m b e r o f

t h e s e t ; a n e m p t y c e ll i n d i c a te s t h e o p p o s i te . T h e r e a d e r i s

a s s u m e d t o b e f a m i l i a r w i t h t h i s a n d s i m i l a r c o n v e n t i o n a l

h a s h - c o d i n g m e t h o d s [ 1 , 2 , 3 ] .

T h e n e w h a s h - c o d in g m e t h o d s t o b e i n t r o d u c e d a r e

s u g g e s t e d f o r a p p l i c a ti o n s i n w h i c h t h e g r e a t m a j o r i t y o f

m e s s a g e s t o b e t e s t e d w i l l n o t b e l o n g t o t h e g i v e n s et . F o r

t h e s e a p p l i c a t i o n s , i t i s a p p r o p r i a t e t o c o n s i d e r a s a u n i t o f

t i m e ( c a l l e d reject time) t h e a v e r a g e t i m e r e q u i r e d t o

c l a ss i fy a t e s t m e s s a g e a s a n o n m e m b e r o f t h e g i v e n s e t .

F u r t h e r m o r e , s i n c e t h e c o n t e n t s o f a c e l l c a n , i n g e n e r a l ,

b e r e c o g n iz e d a s n o t m a t c h i n g a t e s t m e s s a g e b y e x a m i n i n g

o n l y p a r t o f t h e m e s s a g e, a n a p p r o p r i a t e a s s u m p t i o n w i ll

b e i n t r o d u c e d c o n c e r n i n g t h e t i m e r e q u i r e d t o a c c e s s i n -

d i v i d u a l b i ts o f t h e h a s h a r e a .

I n a d d i t i o n t o t h e t w o c o m p u t a t i o n a l f a c t o r s , r e j e c t

t i m e a n d s p a c e ( i. e . h a s h a r e a s i z e ) , t h i s p a p e r c o n s i d e r s a

t h i r d c o m p u t a t i o n a l f a c t o r , a l lo w a b l e f r a c t i o n o f e rr o r s .

I t w i l l b e s h o w n t h a t a l lo w i n g a s m a l l n u m b e r o f t e s t

m e s s a g e s t o b e f a l s el y i d e n ti f ie d a s m e m b e r s o f t h e g i v e ns e t w il l p e r m i t a m u c h s m a l l e r h a s h a r e a t o b e u s e d w i t h o u t

i n c r e a s i n g t h e r e j e c t t i m e . I n s o m e p r a c t i c a l a p p l i c a t i o n s ,

t h i s r e d u c t i o n i n h a s h a r e a s iz e m a y m a k e t h e d i f fe r e n c e

b e t w e e n m a i n t a i n i n g t h e h a s h a r e a i n c o r e , w h e r e i t c a n

b e p r o c e ss e d q u ic k l y , o r h a v i n g t o p u t i t o n a s l ow a c c e s s

b u l k s t o r a g e d e v i c e s u c h a s a d i s k .

T w o m e t h o d s o f r e d u c i n g h a s h a r e a s iz e b y a l lo w i n g

e r r o rs w i ll b e i n t ro d u c e d . T h e t r a d e - o f f b e t w e e n h a s h a r e a

s i z e a n d f r a c t i o n o f a l l o w a b l e e r r o r s w i l l b e a n a l y z e d f o r

e a c h m e t h o d , a s w e l l a s t h e i r r e s u l t i n g s p a c e / t i m e t r a d e -

offs .

T h e r e a d e r s ho u ld n o t e t h a t t h e n e w m e t h o d s a r e n o t

i n t e n d e d a s a l t e r n a t i v e s t o c o n v e n t i o n a l h a s h - c o d i n g m e t h -o d s in a n y a p p l i c a t i o n a r e a i n w h i c h h a s h c o d i n g is c u r -

r e n t l y u s e d ( e .g . s y m b o l t a b l e m a n a g e m e n t [ 1] ). R a t h e r ,

t h e y a r e i n t e n d e d t o m a k e i t p o s s ib l e t o e x p l o i t t h e b e n e -

f i ts o f h a s h - c o d i n g t e c h n i q u e s i n c e r t a i n a r e a s i n w h i c h c o n -

v e n t i o n a l e r r o r - f r e e m e t h o d s s u f fe r fr o m a n e e d f o r h a s h

a r e a s t o o l a r ge t o b e c o r e r e s i d e n t a n d c o n s e q u e n t l y f o u n d

t o b e i n f e as i b le . I n o r d e r t o g a i n s u b s t a n t i a l r e d u c t i o n s i n

h a s h a r e a s iz e , w i t h o u t i n t r o d u c i n g e x c e s si v e r e j e c t t i m e s ,

t h e e r r o r - f r e e p e r f o r m a n c e a s s o c i a t e d w i t h c o n v e n t i o n a l

m e t h o d s i s s a c r i f i c e d . I n a p p l i c a t i o n a r e a s w h e r e e r r o r - f r e e

p e r f o r m a n c e i s a n e c e s s i ty , t h e s e n e w m e t h o d s a r e n o t

a p p l i c a b l e .

A S a m p l e A p p l i c a t i o n

T h e t y p e o f a p p l i c a t i o n i n w h i c h a l l o w i n g e r r o r s i s

e x p e c t e d t o p e r m i t a u s e f u l r e d u c t i o n i n h a s h a r e a s i ze

m a y b e c h a r a c t e r i z e d a s f o l l o w s . S u p p o s e a p r o g r a m m u s t

p e r f o r m a c o m p u t a t i o n a l p r o c e s s fo r a l a r g e n u m b e r o f

d i s t i n c t e as e s. F u r t h e r , s u p p o s e t h a t f o r a g r e a t m a j o r i t y

o f t h e s e c a s e s t h e p r o c e s s i s v e r y s i m p l e , b u t f o r a s m a l l

4 2 2 C o m m u n i c a t i o n s o f t h e A C M V o l u m e 13 / N u m b e r 7 / J u l y , 19 70

Page 2: Bloom Bloomfilter Orig Paper

8/6/2019 Bloom Bloomfilter Orig Paper

http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 2/5

d i f f i c u l t - t o - i d e n t i f y m i n o r i t y t h e p r o c e s s i s v e r y c o m p l i -

c a t e d . T h e n , i t m i g h t b e u s e f u l t o h a s h c o d e i d e n ti f i er s

f o r t h e m i n o r i t y s e t o f c a s es s o t h a t e a c h c a se t o b e p r o c -

e s s ed c o u l d b e m o r e e a s i ly t e s t e d f o r m e m b e r s h i p i n t h e

m i n o r i t y s e t . I f a p a r t i c u l a r c a s e i s r e j e c t ed , a s w o u l d h a p -

p e n m o s t o f t h e t i m e , t h e s i m p l e p r o c es s w o u l d b e u s e d . I f a

c a s e i s n o t r e j e c t e d , i t c o u l d t h e n b e s u b j e c t e d t o a f o l l o w - u p

t e s t t o d e t e r m i n e i f i t is a c t u a l l y a m e m b e r o f t h e m i n o r i t y

s e t o r a n " a l l o w a b l e e r r o r . " B y a l l o w i n g e rr o r s o f t h i s

k i n d , t h e h a s h a r e a m a y b e m a d e s m a l l e n o u g h f o r t h i s

p r o c e d u r e t o b e p r a c t i c a l .

A s o n e e x a m p l e o f t h i s t y p e o f a p p l ic a t i o n , c o n s i d e r a

p r o g r a m f o r a u t o m a t i c h y p h e n a t i o n . L e t u s a ss u m e t h a t a

f e w s i m p l e r u l e s c o u l d p r o p e r l y h y p h e n a t e 9 0 p e r c e n t o f

a ll E n g l i s h w o r d s , b u t a d i c t i o n a r y l o o k u p w o u l d b e r e -

q u i r e d f o r t h e o t h e r 1 0 p e r c e n t . S u p p o s e t h i s d i c t i o n a r y i s

t o o b i g t o f i t i n a v a i l a b l e c o r e m e m o r y a n d i s t h e r e f o r e

k e p t o n a d i s k . B y a l l o w i n g a f e w w o r d s t o b e f a l s e l y

i d e n t i f i e d a s b e l o n g i n g t o t h e 1 0 p e r c e n t , a h a s h a r e a f o r

t h e 1 0 p e r c e n t m i g h t b e m a d e s m a l l e n o u g h t o f i t i n c or e .

W h e n a n " a l lo w a b l e e r r o r " o c c u rr e d , t h e t e s t w o r d w o u l d

n o t b e f o u n d o n th e d i s k , a n d t h e s i m p l e r u l e s c o u l d b e

u s e d t o h y p h e n a t e i t . N e e d l e s s d i s k a c c es s w o u l d b e a

r a r e o c c u r r e n c e w i t h a f r e q u e n c y r e l a t e d t o t h e s i ze o f t h e

c o r e re s i d e n t h a s h a r e a . T h i s s a m p l e a p p l i c a t i o n w i l l b e

a n a l y z e d i n d e t a i l a t t h e e n d o f t h i s p a p e r .

A C o n v e n t i o n a l H a s h - C o d i n g M e t h o d

A s a p o i n t o f d e p a r t u r e , w e w i ll r e v i e w a c o n v e n t i o n a l

h a s h - co d i n g m e t h o d w h e r e n o e r ro r s a r e p e r m i t t e d . A s s u m e

w e a r e s t o r i n g a s e t o f n m e s s a g e s , e a c h b b i t s l o n g . F i r s t

w e o r g a n i z e t h e h a s h a r e a i n t o h c e ll s o f b + 1 b i t s e a c h ,

h > n . T h e e x t r a b i t i n e a c h c e l l i s u s e d a s a f l a g t o i n -

d i c a t e w h e t h e r o r n o t t h e c el l i s e m p t y . F o r t h i s p u r p o s e ,

t h e m e s s a g e i s t r e a t e d a s a b + 1 b i t o b j e c t w i t h t h e f i r s t

b i t a l w a y s s e t t o 1 . T h e s t o r i n g p r o c e d u r e i s t h e n a s f o l -

l o w s :

G e n e r a t e a p s e u d o r a n d o m n u m b e r c a l l e d a hash

address, s a y k , (0 ~ /c ~ h - 1 ) i n a m a n n e r w h ic h d e -

p e n d s o n t h e m e s s a g e u n d e r c o n s i d e r a ti o n . T b e n

c h e c k t h e k t h c e ll t o s e e i f i t is e m p t y . I f s o , s t o r e t h e

m e s s a g e i n t h e k t h c el l. I f n o t , c o n t i n u e t o g e n e r a t e

a d d i t i o n a l h a s h a d d r e s s es u n t i l a n e m p t y c e ll i s f o u n d ,

i n t o w h i c h t h e m e s s a g e is t h e n s t o r e d .

T h e m e t h o d o f t e s ti n g a n e w m e s s a g e f o r m e m b e r s h i p i ss i m i l a r t o t h a t o f st o r i n g a m e s s a g e . A s e q u e n c e o f h a s h

a d d r e s s e s i s g e n e r a t e d , u s i n g t h e s a m e r a n d o m n u m b e r

g e n e r a t i o n t e c h n i q u e a s a b o v e , u n t i l o n e o f t h e f o l l o w i ng

o c c u r s .

1 . A c e l l i s f o u n d w h i c h h a s s t o r e d i n i t t h e i d e n t i c a l

m e s s a g e a s t h a t b e i n g t e s te d . I n t h i s c a se , t h e n e w m e s s a g e

b e l o n g s t o t h e s e t a n d i s s a i d t o b e accepted.

2 . A n e m p t y c e ll i s f o u n d . I n t h i s c a se t h e n e w m e s s a g e

d o e s n o t b e l o n g t o t h e s e t a n d i s sa i d t o b e rejected.

T w o H a s h - C o d i n g M e t h o d s w i t h A l l o w a b l e E r ro r s

M e t h o d 1 i s d e r i v e d in a n a t u r a l w a y f r o m t h e c o n -

v e n t i o n a l e r r o r - f r e e m e t h o d . T h e h a s h a r e a i s o r g a n -

i z e d i n t o c e l l s a s b e f o r e , b u t t h e c e l l s a r e s m a l l e r , c o n -

t a i n i n g a c o d e i n s t e a d o f t h e e n t i r e m e s s a g e . T h e c o d e

i s g e n e r a t e d f r o m t h e m e s s a g e , a n d i t s s i z e d e p e n d s o n t h e

p e r m i s s i b l e f r a c t i o n o f e r r o rs . I n t u i t i v e l y , o n e c a n s e e t h a t

t h e c e l l s i z e s h o u l d i n c r e a s e a s t h e a l l o w a b l e f r a c t i o n o f

e r r o r s g e t s s m a l l e r. W h e n t h e f r a c t i o n o f e r r o r s is s u f -f i c i e n t l y s m a l l ( a p p r o x i m a t e l y 2 -- b) , t h e c e l ls w i ll b e l a r g e

e n o u g h t o c o n t a i n t h e e n t i r e m e s s a g e i ts e lf , t h e r e b y r e -

s u l t i n g i n n o e r r o rs . I f P r e p r e s e n t s t h e a l l o w a b l e f r a c ti o n

o f e rr o r s , i t i s a s s u m e d t h a t 1 > > P > > 2 -b.

H a v i n g d e c i d e d o n t h e s i z e o f c e ll , s a y e < b , c h o s e n s o

t h a t t h e e x p e c t e d f r a c t i o n o f e r r o r s w i ll b e c l o se t o a n d

s m a l l e r t h a n P , t h e h a s h a r e a i s o r g a n iz e d i n t o c e l ls o f c

b i t s e a c h . T h e n e a c h m e s s a g e is c o d e d i n t o a c - b i t c o d e

( n o t n e c e s sa r i l y u n i q u e ) , a n d t h e s e c o d es a r e s to r e d a n d

t e s t e d i n a m a n n e r s i m i l a r t o t h a t u s e d i n t h e c o n v e n t i o n a l

e r r o r - fr e e m e t h o d . A s b e f o r e , t h e f i r s t b i t o f e v e r y c o d e i s

s e t t o 1 . S i n c e t h e c o d e s a r e n o t u n i q u e , a s w e r e t h e o r i g i n a l

m e s s a g e s , e r r o r s o f c o m m i s s i o n m a y a r i s e .

h ~ Ie th o d 2 c o m p ] e t e l y g e t s a w a y f r o m t h e c o n v e n -

t i o n a l c o n c e p t o f o r g a n i z i n g t h e h a s h a r e a i n t o c e l ls . T h e

h a s h a r e a i s c o n s i d e r e d a s N i n d i v i d u a l a d d r e s s a b l e b i t s ,

w i t h a d d r e s s e s 0 t h r o u g h N - 1 . I t is a s s u m e d t h a t

a l l b i t s i n t h e h a s h a r e a a r e f i r s t s e t t o 0 . N e x t , e a c h m e s -

s a g e i n t h e s e t t o b e s t o r e d i s h a s h c o d e d i n t o a n u m b e r o f

d i s t i n c t b i t a d d r e s s e s , s a y a l , a 2 , . . . , a d . F i n a l l y , al l d

b i t s a d d r e s s e d b y a l t h r o u g h a d a r e s e t to 1 .

T o t e s t a n e w m e s s a g e a s e q u e n c e o f d b i t a d d r e s s e s, s a yP ! t

a l , a 2 , • . . , a d , i s g e n e r a t e d i n t h e s a m e m a n n e r a s f o r

s t o r i n g a m e s s a g e . I f a ll d b i t s a r e 1 , t h e n e w m e s s a g e i s

a c c e p t e d . I f a n y o f t h e s e b i t s i s z er o , t h e m e s s a g e is r e -

j e c t e d .

I n t u i t i v e l y , i t c a n b e s e e n t h a t u p t o a p o i n t o f d i m i n i s h -

i n g r e t u r n s , t h e l a r g e r d i s, t h e s m a l l e r w i l l b e t h e e x p e c t e d

f r a c t i o n o f e r ro r s . T h i s p o i n t o f d im i n i s h i n g r e t u r n s o c c u r s

w h e n i n c r e a s i n g d b y 1 c a u s e s t h e f r a c t i o n o f 1 b i t s i n

t h e h a s h f i e l d t o g r o w t o o l a r g e . T h e i n c r e a s e d a p r i o r i

l i k e l i h o o d o f e a c h b i t a c c e s s e d b e i n g a 1 o u t w e i g h s t h e

e f f e ct o f a d d i n g t h e a d d i t i o n a l b i t t o b e t e s t e d w h e n h a l f

t h e b i t s i n t h e h a s h f i e l d a r e 1 a n d h a l f a r e 0 , a s w i l l b e

s h o w n l a t e r i n t h i s p a p e r . C o n s e q u e n t l y , f o r a n y g i v e n

h a s h f i e l d s iz e N , t h e r e i s a m i n i m u m p o s s i bl e e x p e c t e d

f r a c t i o n o f e rr o r s, a n d m e t h o d 2 t h e r e f o r e p r e c lu d e s t h e

e r r o r - f re e p e r f o r m a n c e p o s si b l e b y m o d i f y i n g m e t h o d 1 f o r

a v e r y s m a l l a l l o w a b l e f r a c t i o n o f e r r o rs

C o m p u t a t i o n a l F a c t o r s

Allowable Fraction of Errors . T h i s f a c t o r w i ll b e a n -

a l y z e d w i t h r e s p e c t t o h o w t h e s i z e o f t h e h a s h a r e a c a n

b e r e d u c e d b y p e r m i t t i n g a f e w m e s s a g e s t o b e fa l s e ly

i d e n t if i e d a s m e m b e r s o f a g i v e n s e t o f m e s s a g e s. W e r e p r e -

V o l u m e 1 3 / N u m b e r 7 / J u l y , 1 97 0 C o m m u n i c a t i o n s o f t h e A C M 4 2 3

Page 3: Bloom Bloomfilter Orig Paper

8/6/2019 Bloom Bloomfilter Orig Paper

http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 3/5

sen~ the fraction of errors by

P = ( n a - n ) / ( n t - n ) , (1)

where: n~ is the number of messages in the message space

that would be accepted as members of the given set; n is

the number of messages in the given set; and n t is the total

number of distinct messages in the message space.

S p a c e . The basic space factor is the number of bits, N,in the hash area. By analyzing the effect on the time factor

of changing the value of N, a suitable normalized measure

of the space factor will be introduced later. Using this

normalized measur e will separate the effects on time due to

the number of messages in the given message set and the

allowable fraction of errors from the effects on time due to

the size of the hash area. This separation of effects will

permit a clearer picture of the space/time trade-offs to be

presented.

T i m e . The time factor is the average time required to

reject a message as a member of the given set. In m easuring

this factor, the unit used is the time required to calculate asingle bit address in the hash area, to access the addressed

bit, and to make an appropriate test of the bit ' s contents.

For the conventional hash-coding method, the test is a

comparison of the addressed bit in the hash area with the

corresponding bit of the message. For method 1, the test is

a comparison of the hash area bit with a corresponding bit

of a code derived from the message. For method 2, the

test is simply to determine the contents of the hash area

bit; e.g. is it 1? For the analysis to follow, it is assumed that

the unit of time is the same for all three methods and for1

all bits in the hash area.

The time factor measured in these units is called the

normalized time measure, and the space/time trade-offswill be analyzed with respect to this factor. T he norma lized

time measure is

T = me an (t~), (2)miEa

where: M is the given set of messages; a is the set of

messages identified (correctly or falsely) as member s of M;

1 The reader should note that this is a very strong assumption,since the time to generate the successive hash-coded bit addressesused in method 2 is assumed to be the same as the time to suc-cessively increment the single hash-coded cell address used inmethod 1. Furthermore, conventional computers are able to accessmemory and make comparisons in multibit chunks, i.e. bytes orwords. The size of a chunk varies from one machine to another. Inorder to establish the desired mathematical comparisons, somefixed unit of accessibility and comparability is required, and it issimplest to use a single bit as this unit. If the reader wishes toanalyze the comparative performance of the two methods for amachine with a multibit unit and wishes to include the effects ofmultiple hash codings per message, this can readily be done byfollowing a similar method of analysis to the one used here forsingle-bit unit . ~

g is the set of messages identified as nonmembers of M;

mi is the ith message; and ti is the time required to reject

the ith message.

A n a l y s i s o f t h e C o n v e n t i o n a l H a s h - C o d i n g M e t h o d

The hash area has N bits and is organized into h cells

of b + 1 bits each, of which n cells are filled with th e n

messages in M. Let ¢ represent the fraction of cells which

are empty. Then

h-- n N-- n . ( b + 1)4' - h - N (3 )

Solving for N yiel ds

N - n- (b A- 1) (4)

Let us now calculate the normalized time measure, T.

T represents the expected numb er of bits to be tested during

a typical rejection procedure. T also equals the expected

number of bits to be tested after a nonempty cell has been

accessed and abandoned. That is, if a hash-addressed cellcontains a message other than the message to be tested, on

the average this will be discovered after, say, E bits are

tested. Then the procedure, in effect, starts over again.

Since ~b repre sents the fra ction of cells which a re em pty ,

then the probab ility of accessing a non em pty cell is

(1 - ~b), and the probabi lity of accessing an em pty cell is

~. If an nonempty cell is accessed, the expected number of

bits to be tested is E + T, since E re presents the expected

number of bits to be tested in rejecting the nonempty

accessed cell, and T represents the expected number of

bits to be tested when the procedure is repeated. If an

empty cell is accessed, then only one bit is tested to dis-

cover this fact. Therefore

T = (1 -- ¢) ( E + T) + ~. (5)

In order to calculate a value for E, we note that the con-

ditional probability that the first x bits of a cell match

those of a message to be tested, and the (x + 1)th bit

does not m atch, given that the cell contains a message other

than the message to be tested, is (½)~. (The reader should

remember that the first bit of a message always matches

the first bit of a nonempty cell, and consequently the

exponent is x rather t han x + 1, as would otherwise be the

case. ) Thus, for b >> 1, the expecte d value of E is approxi-

mated by the following sum:

(x + 1)-( ½)x = 3. (6)z~ l

Therefore

T = (3 /¢ ) - 2, (7)

N = n . ( b + 1) -T-----~ (8)T - - l "

Equation (8) represents the space/time trade-off for the

conventional hash-coding method.

424 Commu nica tio ns of th e ACM Volume 13 / Numb er 7 / July, 1970

Page 4: Bloom Bloomfilter Orig Paper

8/6/2019 Bloom Bloomfilter Orig Paper

http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 4/5

A n a l y s i s o f M e t h o d 1

T h e h a s h a r e a c o n t a i n s N ' b i t s a n d i s o r g a n i z e d i n t o c e ll s

o f c b i t s e a ch . I n a m a n n e r a n a l o g o u s t o t h e c o n v e n t i o n a l

m e t h o d , w e e s t a b l i s h t h e f o l l o w i n g e q u a t i o n s :

~b' = N ' -- n.c~ - t h e f r a c t i o n o f e m p t y ce l ls . ( 9 )

N '

N' - n.c ( 1 0 )

1 - ¢ "

T ' = ( 3 / ¢ ' ) - 2 . ( 1 1 )

N ' T ' A - 2 ( 1 2 )=n ' c ' T ' - - 1 "

I t r e m a i n s t o d e r i v e r e l at i o n s f o r t h e c o r r e s p o n d i n g e x -

p e c t e d f r a c t i o n o f e r r or s , P ' , i n t e r m s o f c a n d T ' a n d i n

t e r m s o f N ' a n d T'.

A m e s s a g e t o b e t e s t e d w h i c h i s n o t a m e m b e r o f t h e s e t

o f m e s s a g e s , M , w i ll b e e r r o n e o u s l y a c c e p t e d a s a m e m b e r

o f M w h e n :

( 1 ) o n e o f t h e s e q u e n c e o f h a s h a d d r e s s e s g e n e r a t e d

f r o m t h e t e s t m e s s a g e c o n t a in s t h e s a m e c o d e , s a y C ,

a s t h a t g e n e r a t e d f r o m t h e t e s t m e s s a g e ; a n d

( 2 ) t h a t s u c h a h a s h a d d r e s s is g e n e r a t e d e a r l ie r in t h e

s e q u e n c e t h a n t h e h a s h a d d r e s s o f s o m e e m p t y c el l.

T h e e x p e c t e d f r a c t i o n o f t e s t m e s s a g e s , n o t m e m b e r s o f

M , w h i c h a r e e r r o n e o u s l y a c c e p t e d i s t h e n

P ' = ( ½ ) ° -' / ~ b '. ( 1 3 )

T h e r e f o r e

T L _c = - - lo g 2 P ' + 1 - t - l o gs -4- 2 (1 4 )

3

n . - - logs + 1 d- logs 2 -4- 2- - T ' - - 1" ( 1 5 )

E q u a t i o n ( 1 5 ) r e p r e s e n t s t h e t r a d e -o f f s a m o n g a l l t h r e e

c o m p u t a t i o n a l f a c t o r s f o r m e t h o d 1 .

A n a l y s i s o f M e t h o d 2

L e t ¢ " r e p r e s e n t t h e e x p e c t e d p r o p o r t i o n o f b i ts i n t h e

h a s h a r e a o f N " b i t s s ti ll s e t t o 0 a f t e r n m e s s a g e s h a v e b e e n

h a s h s t o r e d , w h e r e d is t h e n u m b e r o f d i s t i n c t b i t s se t t o

1 f o r e a c h m e s s a g e i n t h e g i v e n s e t .

¢ " = (1 - - d/N" ) ' . ( 1 6 )

A m e s s a g e n o t i n t h e g i v e n s e t w i l l b e f a l s e l y a c c e p t e d i f al l

d b it s t e s t e d a r e l ' s . T h e e x p e c t e d f r a c t i o n o f t e s t m e s s a g e s ,n o t i n M , w h i c h r e s u l t i n s u c h e r r o r s i s t h e n

P " = ( 1 - - ¢ - ) d . ( 1 7 )

A s s u m i n g d < < N " , a s is c e r t a i n l y t h e c a s e , w e t a k e t h e

l o g b a s e 2 o f b o t h s i d e s o f e q . ( 1 6 ) a n d o b t a i n a p p r o x i -

m a t e l y

logs ¢" = log~ (1 - d/N")n.logs e

= --n. (d/N").logs e.

T h e r e f o r e

N " p , , ) l o g 2 e= n . ( - - l o g 2 l o gs ¢ " - l o g s ( 1 - - ¢ " ) " ( 1 9 )

W e n o w d e r i v e a r e l a t i o n s h i p f o r t h e n o r m a l i z e d t i m e

m e a s u r e T " . x b i t s w i l l b e t e s t e d w h e n t h e f i r s t x - - 1 b i t s

t e s t e d a r e 1 , a n d t h e x t h t e s t e d b i t i s 0 . T h i s o c c u r s w i t h

p r o b a b i l i t y ¢ " - (1 - ¢ " ) * - 1 . F o r P " < < 1 a n d d > > 1 , t h e

a p p r o x i m a t e v a l u e o f T " ( t h e e x pe c t ed v a l u e of t h e n u m -b e r o f b it s t e s t e d f o r e a c h r e j e c t e d t e s t m e s s a g e ) i s t h e n

T " = ~ x . ~ " . ( 1 - ~ , , )~ - 1 = 1 / ~ " . (20)

T h e r e f o r e

logs e . (21 )N " = n . ( - l o g s P " ) " lo gs ( 1 / T " ) . l o g s ( 1 -- l IT" )

E q u a t i o n ( 2 1 ) r e p r e s e n t s t h e t r a d e - o f f s a m o n g t h e t h r e e

c o m p u t a t i o n a l f a c t o r s f o r m e t h o d 2 .

C o m p a r i s o n o f M e t h o d s 1 a n d 2

T o c o m p a r e t h e r e l a t i v e s p a c e / t i m e t r a d e - o f f s b e t w e e n

m e t h o d s 1 a n d 2 , i t w i ll b e u se f u l t o i n t r o d u c e a n o r m a l i z e d

s p a c e m e a s u r e ,

S = g / ( -n . logs P) . ( 2 2 )

S i s n o r m a l i z e d t o e l i m i n a t e t h e i n f l u e n c e o f t h e s i z e o f t h e

g i v e n s e t o f m e s s ag e s , n , a n d t h e a l l o w a b l e f ra c t i o n o f

e r r o r s , P . S u b s t i t u t i n g t h e r e l a t i o n ( 2 2 ) i n t o e q s. ( 1 5 )

a n d ( 2 1 ) g i v e s

(' - T ' -4- 2 1 d- log2 , (2 3)

T ' - I 1 + - - i ~ /

S " = lo g s e (2 4 )l og s ( 1 / T " ) . l o g 2 ( 1 - - 1/T")"

W e n o t e t h a t S ' > ( T ' --? 2 ) / ( T ' - - 1 ) , a n d

l i m S ' - T ' + 2 ( 2 5 )~'-~0 T ' - - 1 "

T h e s u p e r i o r i t y o f m e t h o d 2 o v e r m e t h o d 1 c a n b e se e n

d i r e c t l y b y e x a m i n i n g F i g u r e 1 , w h i c h i s a g r a p h o f t h e

S v e r s u s T c u r v e s f o r e q . ( 2 4 ) a n d t h e l o w e r b o u n d l i m i t

e q u a t i o n , e q . ( 2 5) .2 T h e c u r v e s i n F i g u r e 1 i l l u s t r a t e t h es p a c e / t i m e t r a d e - o f fs f o r b o t h m e t h o d s a s s u m i n g a f ix e d

v a l u e f o r t h e e x p e c t e d f r a c t i o n o f e rr o r s , P , a n d a f i x e d

n u m b e r o f m e s s a g e s , n .

T h e r e a d e r s h o u ld n o te th a t t h e s u p e r io r i ty o f m e th o d 2 o v e rmethod 1 depends on the as sumption o f cons tan t t ime of address -ing, accessing, and testing bits in the hash area. This result m a y

no longer ho ld i f the effects o f a c ce s sin g m u l t ib i t " c h u n k s " a n d o f

ca lcu la t ing add i t iona l hash-coded b i t addresses a re taken in toaccount.

V o l u m e 13 / N u m b e r 7 / J u l y , 19 70 C o m m u n i c a t i o n s o f t h e A C M 4 2 5

Page 5: Bloom Bloomfilter Orig Paper

8/6/2019 Bloom Bloomfilter Orig Paper

http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 5/5

F o r m e t h o d 2 , a n i n c r e a se i n T " c o r r e s p o n d s t o a d e -

c r e a s e i n ¢ " , t h e f r a c t i o n o f 0 b i t s i n t h e h a s h f i e l d a f t e r a l l

n m e s s a g e s h a v e b e e n h a s h c o d e d . T o m a i n t a i n a f i x ed

v a l u e o f P , t h i s d e c r e a s e i n ~b" c o r r e s p o n d s t o a n i n c r e a s e i n

t h e n u m b e r o f b i t s t e s t e d p e r m e s s a g e , d , a n d a n a p p r o -

p r i a t e a d j u s t m e n t t o t h e h a s h a r e a s i ze , N " . S " i s d i r e c t l y

p r o p o r t i o n a l t o t h e h a s h a r e a s iz e N " , a s s h o w n i n eq . ( 2 2 ).

A s T " i n c re a s e s , a n d c o r r e s p o n d i n g l y , ¢ " d e c r e a s e s , S a n d

N " d e c r e a s e u n t i l a p o i n t o f d i m i n i s h i n g r e t u r n s i s r e a c h e d .T h i s p o i n t o f d i m i n is h i n g r e t u r n s i s i ll u s t r a t e d i n F i g u r e 1

w h e r e S " i s s m a l l e st f o r 1 / T " = 1 - - 1 / T " , i . e . f o r T" = 2 .

I0 1

81

6 1

• ~

'1

i

N

] ,

METHOD2 ~i ii ,i

I iI

i :' !

I ,I

i

\

[ L I L

MTHOD

% . l I I

] \i

t ~

L

\\ ,

I. I1 .01 1 .02 1 .0 5 I . I 1 .2 1 ,3 1 ,4 1 .5 2 3 4 6 7 8 I I

T - N O R M A L I Z E D T I M E M E A S U R E

FIG. i

S i n c e ¢ " = 1 / T " , t h i s m e a n s t h a t t h e s m a l l e s t h a s h a r e a

f o r w h i c h m e t h o d 2 c a n b e u s e d o c c u r s w h e n h a l f t h e b i t s

a r e 1 a n d h a l f a r e 0 . T h e v a l u e o f S " c o r r e s p o n d i n g t o

T" = 2 i s S " = l o g 2 e = 1 .4 7 .

A n a l y s i s o f H y p h e n a t i o n S a m p l e A p p l i c a t io n

I n t h i s s e c t i o n w e w i ll c a l c u l a t e a n o m i n a l s i z e f o r a h a s h

a r e a f o r s o l v in g t h e s a m p l e a u t o m a t e d h y p h e n a t i o n p r o b -

l e m . S i n ce w e h a v e a l r e a d y c o n c l u d e d t h a t m e t h o d 2 i s

b e t t e r t h a n m e t h o d 1 , w e w i ll c o m p a r e t h e c o n v e n t i o n a l

m e t h o d w i t h m e t h o d 2 .

L e t u s a s s u m e t h a t t h e r e a r e a b o u t 5 0 0 , 0 0 0 w o r d s t o b e

h y p h e n a t e d b y t h e p r o g r a m a n d t h a t 4 5 0 ,0 0 0 of t h e s e w o r d s

T A B L E I. S U M m A R r O F E X P E C T E D P E R F O R M A N C E O F H Y P H E N -

A T I O N A P P L I C A T IO N O F H A S H C O D I N G U S I N G M E T H O D 2 F O R

V A R I O U S V A L U E S O F A L L O W A B L E F R A C T I O N O F E R R O R S

P = Allowabl e Fraction N = Size of Hash Area Disho] Error s (Bits ) A ccesses Saved

½ 72,800 45.0 %145,600 67.5%

218,400 78.7%291,200 84.4%

364,000 87.2%

509,800 88.5%

c a n b e h y p h e n a t e d b y a p p l i c a t io n o f a f e w s im p l e r u l es .

T h e o t h e r 5 0 , 00 0 w o r d s r e q u i r e r e f e r e n c e t o a d i c t i o n a r y . I t

i s r e a s o n a b l e t o e s t i m a t e t h a t a t l e a s t 1 9 b i ts w o u l d , o n t h e

a v e r a g e , b e r e q u i r e d t o r e p r e s e n t e a c h o f t h e s e 5 0 , 0 00 w o r d s

u s i n g a c o n v e n t i o n a l h a sh - c o d i n g m e t h o d . I f w e a s s u m e

t h a t a t i m e f a c t o r o f T = 4 i s a c c e p t a b l e , w e f i n d f r o m e q .

( 9 ) t h a t t h e h a s h a r e a w o u l d b e 2 , 0 0 0 ,0 0 0 b i ts i n s iz e . T h i s

m i g h t v e r y w e l l b e t o o l a r g e f o r a p r a c t i c a l c o r e c o n t a i n e d

h a s h a r e a . B y u s i n g m e t h o d 2 w i t h a n a l l o w a b l e e r r o r f r e -

q u e n c y o f , s a y , P = 1 / 1 6 , a n d u s i n g t h e s m a l l e s t p o s s i b l eh a s h a r e a b y h a v i n g T = 2 , w e se e f r o m e q . ( 2 2 ) t h a t t h e

p r o b l e m c a n b e s o l v e d w i t h a h a s h a r e a o f le s s t h a n 3 0 0 ,0 0 0

b i t s , a s i z e w h i c h w o u l d v e r y l i k e l y b e s u i t a b l e f o r a c o r e

h a s h a r e a .

W i t h a c h o i c e f o r P o f 1 /1 6 , a n a c c e s s w o u l d b e r e q u i r e d

t o t h e d i s k r e s i d e n t d i c t i o n a r y f o r a p p r o x i m a t e l y 5 0 ,0 0 0

+ 4 5 0 , 0 0 0 / 1 6 ~ 7 8 , 0 0 0 o f t h e 5 0 0 ,0 0 0 w o r d s t o b e h y p h e n -

a t e d , i . e . f o r a p p r o x i m a t e l y 1 6 p e r c e n t o f t h e c a s e s . T h i s

c o n s t i t u t e s a r e d u c t i o n o f 8 4 p e r c e n t i n t h e n u m b e r o f d i s k

a c c e ss e s f r o m t h o s e r e q u i r e d i n a t y p i c a l c o n v e n t i o n a l a p -

p r o a c h u s i n g a c o m p l e t e l y d i s k r e s i d e n t h a s h a r e a a n d

d i c t i o n a r y .

T a b l e I s h o w s h o w a l t e r n a t i v e c h o i c e s f o r th e v a l u e o f Pa f f e c t t h e s iz e of t h e c o r e r e s id e n t h a s h a r e a a n d t h e p e r -

c e n t a g e o f d is k a c c e s s es s a v e d a s c o m p a r e d t o th e " t y p i c a l

c o n v e n t i o n a l a p p r o a c h . "

A c k n o w l e d g m e n t s . T h e a u t h o r w i s h e s t o e x p r e s s h i s

t h a n k s t o M r . O l i ve r S e l f ri d g e f o r h i s m a n y h e l p f u l s u g -

g e s t i o n s i n t h e w r i t i n g o f th i s p a p e r .

R E C E I V E D O C T O B E R , 1969; R E V I S E D A P R I L , 1 9 7 0

R E F E R E N C E S

1 . B A T SO N , A . T h e o r g a n i z a t i o n o f sy m b o l t a b l e s . Comm. ACM 8,2 (Feb. 1965) , 111-112.

2 . M A U R E R, W . D . A n i m p r o v e d h a sh c o d e f o r s c a t t e r s t o r a g e .

Comm. ACM 11, 1 (Jan. 1968), 35--38.3 . M O R R IS , R . S c a t t e r s t o r a g e t e c h n i q u e s . Comm. ACM 11, 1

(Jan. 1968) , 38-44.

4 2 6 C o m m u n i c a t i o n s o f t h e A C M V o l u m e 13 / N u m b e r 7 / J ul y, ,• 19 7 0