bloom bloomfilter orig paper
TRANSCRIPT
8/6/2019 Bloom Bloomfilter Orig Paper
http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 1/5
S p a ce /T im e Trad e-of fs in
Hash Coding with
Al lowable Errors
BURTON H. BLOOM
Computer Usage Comp any, Newton Upp er Falls, Mass.
I n t hi s p a p e r t r a d e - o f f s a m o n g c e r t a in c o m p u t a t io n a l f a c t o r s
i n h a s h c o d i n g a r e a n a l y z e d . T h e p a r a d i g m p r o b l e m c o n -
s i d e r e d i s t h a t o f t e st in g a s e r ie s o f m e s s a g e s o n e - b y - o n e
f o r m e m b e r s h i p i n a g i v e n s e t o f m e s s ag e s . T w o n e w h a s h -
c o d i n g m et ho d s a r e e x a m i n e d a n d c o m p a r e d w i t h a p a r -
t i c u l a r c o n v e n t i o n a l h a s h - c o d i n g m e t h o d . T h e c o m p u t a t i o n a l
f a c t o r s c o n s i d e r e d a r e t h e s iz e o f th e h a s h a r e a ( s p a c e ) , th e
t im e r e q u i r e d t o id e n t i f y a m e s s a g e a s a n o n m e m b e r o f t h e
g i v e n s e t ( r e je c t ti m e ) , a n d a n a l l o w a b l e e r r o r f re q u e n c y .
T h e n e w m e t h od s a r e i n t e n d e d t o r e d u c e t h e a m o u n t o f
s p a c e r e q u i r e d t o c o nt a in t h e h a s h - c o d e d i n f o r m a t i o n f ro mt h a t a s s o c i a t e d w i t h c o n v e n t i o n a l m e t h o d s . T h e r e d u c t i o n i n
s p a c e i s a c c o m p li s h ed b y e x p l o i t i n g t h e p o s s i b i l it y th a t a
s m a l l f r a c t i o n o f e r r o r s o f c o m m i s s io n m a y b e t o l e r a b l e i n
s o m e a p p l i c a t i o n s , i n p a r t i c u l a r , a p p l i c a t i o n s i n w h i c h a l a r g e
a m o u n t o f d a t a i s i n v o l v e d a n d a c o r e r e s i d e n t h a s h a r e a i s
c o n s e q u e n t l y n o t f e a s i b l e u s i n g c o n v e n t i o n a l m e t h od s .
I n s u c h a p p l i c a t i o n s , i t is e n v i s a g e d t h a t o v e r a l l p e r f o r m a n c e
c o u ld b e i m p r o v e d b y u s i n g a s m a l l e r c o r e r e s i d e n t ha s h a r e a
i n c o n ju n c ti o n w i th t h e n e w m e t h o d s a n d , w h e n n e c e s s a r y , b y
u s i n g s o m e s e c o n d a r y a n d p e r h a p s t im e - c o n su m i n g t e s t t o
" c a t c h " t h e s m a l l fr a c t i o n o f e r r o r s a s s o c ia t e d w i t h t h e n e w
m et h o d s . A n ex am p le i s d iscu s sed wh ich i llu s tra t es p o ss ib le
a r e a s o f a p p l i c a t i o n f o r t h e n e w m e t h o d s .
A n a ly s is o f t h e p a r a d i g m p r o b l e m d e m o n s t r a te s t h a t a l -
l o w i n g a s m a l l n u m b e r o f te s t m e s s a g e s t o b e f a l s e l y i d e n t i f i e d
a s m e m b e r s o f t h e g i v e n s e t w i ll p e r m i t a m u c h s m a l l e r h a s h
a r e a t o b e u s e d w i t h o u t in c r e as i n g r e j e c t ti m e .
KEY WO RD S AND PHRASES: hash coding, hash addressing, scatter storage,
searching, storage layout, retrieval trade-ofFs, retrieval eff iciency, storageefficiency
CR CATEGORIES: 3.73, 3.74, 3.79
I n t r o d u c t i o n
I n c o n v e n t i o n a l h a s h c o d i n g , a h a s h a r e a i s o r g a n i z e d
i n t o c e l l s , a n d a n i t e r a t i v e p s e u d o r a n d o m c o m p u t a t i o n a l
p r o c e s s is u s e d t o g e n e r a t e , f r o m t h e g i v e n s e t o f m e s s a g e s ,
h a s h a d d r e s s e s o f e m p t y c e ll s i n t o w h i c h t h e m e s s a g e s
a r e t h e n s t o r e d . M e s s a g e s a r e t e s t e d b y a s i m i l a r p r o c e s s o f
Work on th is paper was in par t perform ed whi le the au tho r w a saff i l ia ted wi th Computer Corporat ion of America, Cambridge,Massachuset t s .
i t e r a t i v e l y g e n e r a t i n g h a s h a d d r e s s e s o f c e ll s. T h e c o n -
t e n t s o f th e s e c e ll s a r e t h e n c o m p a r e d w i t h t h e t e s t m e s -
s a g e s . A m a t c h i n d i c a t e s t h e t e s t m e s s a g e i s a m e m b e r o f
t h e s e t ; a n e m p t y c e ll i n d i c a te s t h e o p p o s i te . T h e r e a d e r i s
a s s u m e d t o b e f a m i l i a r w i t h t h i s a n d s i m i l a r c o n v e n t i o n a l
h a s h - c o d i n g m e t h o d s [ 1 , 2 , 3 ] .
T h e n e w h a s h - c o d in g m e t h o d s t o b e i n t r o d u c e d a r e
s u g g e s t e d f o r a p p l i c a ti o n s i n w h i c h t h e g r e a t m a j o r i t y o f
m e s s a g e s t o b e t e s t e d w i l l n o t b e l o n g t o t h e g i v e n s et . F o r
t h e s e a p p l i c a t i o n s , i t i s a p p r o p r i a t e t o c o n s i d e r a s a u n i t o f
t i m e ( c a l l e d reject time) t h e a v e r a g e t i m e r e q u i r e d t o
c l a ss i fy a t e s t m e s s a g e a s a n o n m e m b e r o f t h e g i v e n s e t .
F u r t h e r m o r e , s i n c e t h e c o n t e n t s o f a c e l l c a n , i n g e n e r a l ,
b e r e c o g n iz e d a s n o t m a t c h i n g a t e s t m e s s a g e b y e x a m i n i n g
o n l y p a r t o f t h e m e s s a g e, a n a p p r o p r i a t e a s s u m p t i o n w i ll
b e i n t r o d u c e d c o n c e r n i n g t h e t i m e r e q u i r e d t o a c c e s s i n -
d i v i d u a l b i ts o f t h e h a s h a r e a .
I n a d d i t i o n t o t h e t w o c o m p u t a t i o n a l f a c t o r s , r e j e c t
t i m e a n d s p a c e ( i. e . h a s h a r e a s i z e ) , t h i s p a p e r c o n s i d e r s a
t h i r d c o m p u t a t i o n a l f a c t o r , a l lo w a b l e f r a c t i o n o f e rr o r s .
I t w i l l b e s h o w n t h a t a l lo w i n g a s m a l l n u m b e r o f t e s t
m e s s a g e s t o b e f a l s el y i d e n ti f ie d a s m e m b e r s o f t h e g i v e ns e t w il l p e r m i t a m u c h s m a l l e r h a s h a r e a t o b e u s e d w i t h o u t
i n c r e a s i n g t h e r e j e c t t i m e . I n s o m e p r a c t i c a l a p p l i c a t i o n s ,
t h i s r e d u c t i o n i n h a s h a r e a s iz e m a y m a k e t h e d i f fe r e n c e
b e t w e e n m a i n t a i n i n g t h e h a s h a r e a i n c o r e , w h e r e i t c a n
b e p r o c e ss e d q u ic k l y , o r h a v i n g t o p u t i t o n a s l ow a c c e s s
b u l k s t o r a g e d e v i c e s u c h a s a d i s k .
T w o m e t h o d s o f r e d u c i n g h a s h a r e a s iz e b y a l lo w i n g
e r r o rs w i ll b e i n t ro d u c e d . T h e t r a d e - o f f b e t w e e n h a s h a r e a
s i z e a n d f r a c t i o n o f a l l o w a b l e e r r o r s w i l l b e a n a l y z e d f o r
e a c h m e t h o d , a s w e l l a s t h e i r r e s u l t i n g s p a c e / t i m e t r a d e -
offs .
T h e r e a d e r s ho u ld n o t e t h a t t h e n e w m e t h o d s a r e n o t
i n t e n d e d a s a l t e r n a t i v e s t o c o n v e n t i o n a l h a s h - c o d i n g m e t h -o d s in a n y a p p l i c a t i o n a r e a i n w h i c h h a s h c o d i n g is c u r -
r e n t l y u s e d ( e .g . s y m b o l t a b l e m a n a g e m e n t [ 1] ). R a t h e r ,
t h e y a r e i n t e n d e d t o m a k e i t p o s s ib l e t o e x p l o i t t h e b e n e -
f i ts o f h a s h - c o d i n g t e c h n i q u e s i n c e r t a i n a r e a s i n w h i c h c o n -
v e n t i o n a l e r r o r - f r e e m e t h o d s s u f fe r fr o m a n e e d f o r h a s h
a r e a s t o o l a r ge t o b e c o r e r e s i d e n t a n d c o n s e q u e n t l y f o u n d
t o b e i n f e as i b le . I n o r d e r t o g a i n s u b s t a n t i a l r e d u c t i o n s i n
h a s h a r e a s iz e , w i t h o u t i n t r o d u c i n g e x c e s si v e r e j e c t t i m e s ,
t h e e r r o r - f r e e p e r f o r m a n c e a s s o c i a t e d w i t h c o n v e n t i o n a l
m e t h o d s i s s a c r i f i c e d . I n a p p l i c a t i o n a r e a s w h e r e e r r o r - f r e e
p e r f o r m a n c e i s a n e c e s s i ty , t h e s e n e w m e t h o d s a r e n o t
a p p l i c a b l e .
A S a m p l e A p p l i c a t i o n
T h e t y p e o f a p p l i c a t i o n i n w h i c h a l l o w i n g e r r o r s i s
e x p e c t e d t o p e r m i t a u s e f u l r e d u c t i o n i n h a s h a r e a s i ze
m a y b e c h a r a c t e r i z e d a s f o l l o w s . S u p p o s e a p r o g r a m m u s t
p e r f o r m a c o m p u t a t i o n a l p r o c e s s fo r a l a r g e n u m b e r o f
d i s t i n c t e as e s. F u r t h e r , s u p p o s e t h a t f o r a g r e a t m a j o r i t y
o f t h e s e c a s e s t h e p r o c e s s i s v e r y s i m p l e , b u t f o r a s m a l l
4 2 2 C o m m u n i c a t i o n s o f t h e A C M V o l u m e 13 / N u m b e r 7 / J u l y , 19 70
8/6/2019 Bloom Bloomfilter Orig Paper
http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 2/5
d i f f i c u l t - t o - i d e n t i f y m i n o r i t y t h e p r o c e s s i s v e r y c o m p l i -
c a t e d . T h e n , i t m i g h t b e u s e f u l t o h a s h c o d e i d e n ti f i er s
f o r t h e m i n o r i t y s e t o f c a s es s o t h a t e a c h c a se t o b e p r o c -
e s s ed c o u l d b e m o r e e a s i ly t e s t e d f o r m e m b e r s h i p i n t h e
m i n o r i t y s e t . I f a p a r t i c u l a r c a s e i s r e j e c t ed , a s w o u l d h a p -
p e n m o s t o f t h e t i m e , t h e s i m p l e p r o c es s w o u l d b e u s e d . I f a
c a s e i s n o t r e j e c t e d , i t c o u l d t h e n b e s u b j e c t e d t o a f o l l o w - u p
t e s t t o d e t e r m i n e i f i t is a c t u a l l y a m e m b e r o f t h e m i n o r i t y
s e t o r a n " a l l o w a b l e e r r o r . " B y a l l o w i n g e rr o r s o f t h i s
k i n d , t h e h a s h a r e a m a y b e m a d e s m a l l e n o u g h f o r t h i s
p r o c e d u r e t o b e p r a c t i c a l .
A s o n e e x a m p l e o f t h i s t y p e o f a p p l ic a t i o n , c o n s i d e r a
p r o g r a m f o r a u t o m a t i c h y p h e n a t i o n . L e t u s a ss u m e t h a t a
f e w s i m p l e r u l e s c o u l d p r o p e r l y h y p h e n a t e 9 0 p e r c e n t o f
a ll E n g l i s h w o r d s , b u t a d i c t i o n a r y l o o k u p w o u l d b e r e -
q u i r e d f o r t h e o t h e r 1 0 p e r c e n t . S u p p o s e t h i s d i c t i o n a r y i s
t o o b i g t o f i t i n a v a i l a b l e c o r e m e m o r y a n d i s t h e r e f o r e
k e p t o n a d i s k . B y a l l o w i n g a f e w w o r d s t o b e f a l s e l y
i d e n t i f i e d a s b e l o n g i n g t o t h e 1 0 p e r c e n t , a h a s h a r e a f o r
t h e 1 0 p e r c e n t m i g h t b e m a d e s m a l l e n o u g h t o f i t i n c or e .
W h e n a n " a l lo w a b l e e r r o r " o c c u rr e d , t h e t e s t w o r d w o u l d
n o t b e f o u n d o n th e d i s k , a n d t h e s i m p l e r u l e s c o u l d b e
u s e d t o h y p h e n a t e i t . N e e d l e s s d i s k a c c es s w o u l d b e a
r a r e o c c u r r e n c e w i t h a f r e q u e n c y r e l a t e d t o t h e s i ze o f t h e
c o r e re s i d e n t h a s h a r e a . T h i s s a m p l e a p p l i c a t i o n w i l l b e
a n a l y z e d i n d e t a i l a t t h e e n d o f t h i s p a p e r .
A C o n v e n t i o n a l H a s h - C o d i n g M e t h o d
A s a p o i n t o f d e p a r t u r e , w e w i ll r e v i e w a c o n v e n t i o n a l
h a s h - co d i n g m e t h o d w h e r e n o e r ro r s a r e p e r m i t t e d . A s s u m e
w e a r e s t o r i n g a s e t o f n m e s s a g e s , e a c h b b i t s l o n g . F i r s t
w e o r g a n i z e t h e h a s h a r e a i n t o h c e ll s o f b + 1 b i t s e a c h ,
h > n . T h e e x t r a b i t i n e a c h c e l l i s u s e d a s a f l a g t o i n -
d i c a t e w h e t h e r o r n o t t h e c el l i s e m p t y . F o r t h i s p u r p o s e ,
t h e m e s s a g e i s t r e a t e d a s a b + 1 b i t o b j e c t w i t h t h e f i r s t
b i t a l w a y s s e t t o 1 . T h e s t o r i n g p r o c e d u r e i s t h e n a s f o l -
l o w s :
G e n e r a t e a p s e u d o r a n d o m n u m b e r c a l l e d a hash
address, s a y k , (0 ~ /c ~ h - 1 ) i n a m a n n e r w h ic h d e -
p e n d s o n t h e m e s s a g e u n d e r c o n s i d e r a ti o n . T b e n
c h e c k t h e k t h c e ll t o s e e i f i t is e m p t y . I f s o , s t o r e t h e
m e s s a g e i n t h e k t h c el l. I f n o t , c o n t i n u e t o g e n e r a t e
a d d i t i o n a l h a s h a d d r e s s es u n t i l a n e m p t y c e ll i s f o u n d ,
i n t o w h i c h t h e m e s s a g e is t h e n s t o r e d .
T h e m e t h o d o f t e s ti n g a n e w m e s s a g e f o r m e m b e r s h i p i ss i m i l a r t o t h a t o f st o r i n g a m e s s a g e . A s e q u e n c e o f h a s h
a d d r e s s e s i s g e n e r a t e d , u s i n g t h e s a m e r a n d o m n u m b e r
g e n e r a t i o n t e c h n i q u e a s a b o v e , u n t i l o n e o f t h e f o l l o w i ng
o c c u r s .
1 . A c e l l i s f o u n d w h i c h h a s s t o r e d i n i t t h e i d e n t i c a l
m e s s a g e a s t h a t b e i n g t e s te d . I n t h i s c a se , t h e n e w m e s s a g e
b e l o n g s t o t h e s e t a n d i s s a i d t o b e accepted.
2 . A n e m p t y c e ll i s f o u n d . I n t h i s c a se t h e n e w m e s s a g e
d o e s n o t b e l o n g t o t h e s e t a n d i s sa i d t o b e rejected.
T w o H a s h - C o d i n g M e t h o d s w i t h A l l o w a b l e E r ro r s
M e t h o d 1 i s d e r i v e d in a n a t u r a l w a y f r o m t h e c o n -
v e n t i o n a l e r r o r - f r e e m e t h o d . T h e h a s h a r e a i s o r g a n -
i z e d i n t o c e l l s a s b e f o r e , b u t t h e c e l l s a r e s m a l l e r , c o n -
t a i n i n g a c o d e i n s t e a d o f t h e e n t i r e m e s s a g e . T h e c o d e
i s g e n e r a t e d f r o m t h e m e s s a g e , a n d i t s s i z e d e p e n d s o n t h e
p e r m i s s i b l e f r a c t i o n o f e r r o rs . I n t u i t i v e l y , o n e c a n s e e t h a t
t h e c e l l s i z e s h o u l d i n c r e a s e a s t h e a l l o w a b l e f r a c t i o n o f
e r r o r s g e t s s m a l l e r. W h e n t h e f r a c t i o n o f e r r o r s is s u f -f i c i e n t l y s m a l l ( a p p r o x i m a t e l y 2 -- b) , t h e c e l ls w i ll b e l a r g e
e n o u g h t o c o n t a i n t h e e n t i r e m e s s a g e i ts e lf , t h e r e b y r e -
s u l t i n g i n n o e r r o rs . I f P r e p r e s e n t s t h e a l l o w a b l e f r a c ti o n
o f e rr o r s , i t i s a s s u m e d t h a t 1 > > P > > 2 -b.
H a v i n g d e c i d e d o n t h e s i z e o f c e ll , s a y e < b , c h o s e n s o
t h a t t h e e x p e c t e d f r a c t i o n o f e r r o r s w i ll b e c l o se t o a n d
s m a l l e r t h a n P , t h e h a s h a r e a i s o r g a n iz e d i n t o c e l ls o f c
b i t s e a c h . T h e n e a c h m e s s a g e is c o d e d i n t o a c - b i t c o d e
( n o t n e c e s sa r i l y u n i q u e ) , a n d t h e s e c o d es a r e s to r e d a n d
t e s t e d i n a m a n n e r s i m i l a r t o t h a t u s e d i n t h e c o n v e n t i o n a l
e r r o r - fr e e m e t h o d . A s b e f o r e , t h e f i r s t b i t o f e v e r y c o d e i s
s e t t o 1 . S i n c e t h e c o d e s a r e n o t u n i q u e , a s w e r e t h e o r i g i n a l
m e s s a g e s , e r r o r s o f c o m m i s s i o n m a y a r i s e .
h ~ Ie th o d 2 c o m p ] e t e l y g e t s a w a y f r o m t h e c o n v e n -
t i o n a l c o n c e p t o f o r g a n i z i n g t h e h a s h a r e a i n t o c e l ls . T h e
h a s h a r e a i s c o n s i d e r e d a s N i n d i v i d u a l a d d r e s s a b l e b i t s ,
w i t h a d d r e s s e s 0 t h r o u g h N - 1 . I t is a s s u m e d t h a t
a l l b i t s i n t h e h a s h a r e a a r e f i r s t s e t t o 0 . N e x t , e a c h m e s -
s a g e i n t h e s e t t o b e s t o r e d i s h a s h c o d e d i n t o a n u m b e r o f
d i s t i n c t b i t a d d r e s s e s , s a y a l , a 2 , . . . , a d . F i n a l l y , al l d
b i t s a d d r e s s e d b y a l t h r o u g h a d a r e s e t to 1 .
T o t e s t a n e w m e s s a g e a s e q u e n c e o f d b i t a d d r e s s e s, s a yP ! t
a l , a 2 , • . . , a d , i s g e n e r a t e d i n t h e s a m e m a n n e r a s f o r
s t o r i n g a m e s s a g e . I f a ll d b i t s a r e 1 , t h e n e w m e s s a g e i s
a c c e p t e d . I f a n y o f t h e s e b i t s i s z er o , t h e m e s s a g e is r e -
j e c t e d .
I n t u i t i v e l y , i t c a n b e s e e n t h a t u p t o a p o i n t o f d i m i n i s h -
i n g r e t u r n s , t h e l a r g e r d i s, t h e s m a l l e r w i l l b e t h e e x p e c t e d
f r a c t i o n o f e r ro r s . T h i s p o i n t o f d im i n i s h i n g r e t u r n s o c c u r s
w h e n i n c r e a s i n g d b y 1 c a u s e s t h e f r a c t i o n o f 1 b i t s i n
t h e h a s h f i e l d t o g r o w t o o l a r g e . T h e i n c r e a s e d a p r i o r i
l i k e l i h o o d o f e a c h b i t a c c e s s e d b e i n g a 1 o u t w e i g h s t h e
e f f e ct o f a d d i n g t h e a d d i t i o n a l b i t t o b e t e s t e d w h e n h a l f
t h e b i t s i n t h e h a s h f i e l d a r e 1 a n d h a l f a r e 0 , a s w i l l b e
s h o w n l a t e r i n t h i s p a p e r . C o n s e q u e n t l y , f o r a n y g i v e n
h a s h f i e l d s iz e N , t h e r e i s a m i n i m u m p o s s i bl e e x p e c t e d
f r a c t i o n o f e rr o r s, a n d m e t h o d 2 t h e r e f o r e p r e c lu d e s t h e
e r r o r - f re e p e r f o r m a n c e p o s si b l e b y m o d i f y i n g m e t h o d 1 f o r
a v e r y s m a l l a l l o w a b l e f r a c t i o n o f e r r o rs
C o m p u t a t i o n a l F a c t o r s
Allowable Fraction of Errors . T h i s f a c t o r w i ll b e a n -
a l y z e d w i t h r e s p e c t t o h o w t h e s i z e o f t h e h a s h a r e a c a n
b e r e d u c e d b y p e r m i t t i n g a f e w m e s s a g e s t o b e fa l s e ly
i d e n t if i e d a s m e m b e r s o f a g i v e n s e t o f m e s s a g e s. W e r e p r e -
V o l u m e 1 3 / N u m b e r 7 / J u l y , 1 97 0 C o m m u n i c a t i o n s o f t h e A C M 4 2 3
8/6/2019 Bloom Bloomfilter Orig Paper
http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 3/5
sen~ the fraction of errors by
P = ( n a - n ) / ( n t - n ) , (1)
where: n~ is the number of messages in the message space
that would be accepted as members of the given set; n is
the number of messages in the given set; and n t is the total
number of distinct messages in the message space.
S p a c e . The basic space factor is the number of bits, N,in the hash area. By analyzing the effect on the time factor
of changing the value of N, a suitable normalized measure
of the space factor will be introduced later. Using this
normalized measur e will separate the effects on time due to
the number of messages in the given message set and the
allowable fraction of errors from the effects on time due to
the size of the hash area. This separation of effects will
permit a clearer picture of the space/time trade-offs to be
presented.
T i m e . The time factor is the average time required to
reject a message as a member of the given set. In m easuring
this factor, the unit used is the time required to calculate asingle bit address in the hash area, to access the addressed
bit, and to make an appropriate test of the bit ' s contents.
For the conventional hash-coding method, the test is a
comparison of the addressed bit in the hash area with the
corresponding bit of the message. For method 1, the test is
a comparison of the hash area bit with a corresponding bit
of a code derived from the message. For method 2, the
test is simply to determine the contents of the hash area
bit; e.g. is it 1? For the analysis to follow, it is assumed that
the unit of time is the same for all three methods and for1
all bits in the hash area.
The time factor measured in these units is called the
normalized time measure, and the space/time trade-offswill be analyzed with respect to this factor. T he norma lized
time measure is
T = me an (t~), (2)miEa
where: M is the given set of messages; a is the set of
messages identified (correctly or falsely) as member s of M;
1 The reader should note that this is a very strong assumption,since the time to generate the successive hash-coded bit addressesused in method 2 is assumed to be the same as the time to suc-cessively increment the single hash-coded cell address used inmethod 1. Furthermore, conventional computers are able to accessmemory and make comparisons in multibit chunks, i.e. bytes orwords. The size of a chunk varies from one machine to another. Inorder to establish the desired mathematical comparisons, somefixed unit of accessibility and comparability is required, and it issimplest to use a single bit as this unit. If the reader wishes toanalyze the comparative performance of the two methods for amachine with a multibit unit and wishes to include the effects ofmultiple hash codings per message, this can readily be done byfollowing a similar method of analysis to the one used here forsingle-bit unit . ~
g is the set of messages identified as nonmembers of M;
mi is the ith message; and ti is the time required to reject
the ith message.
A n a l y s i s o f t h e C o n v e n t i o n a l H a s h - C o d i n g M e t h o d
The hash area has N bits and is organized into h cells
of b + 1 bits each, of which n cells are filled with th e n
messages in M. Let ¢ represent the fraction of cells which
are empty. Then
h-- n N-- n . ( b + 1)4' - h - N (3 )
Solving for N yiel ds
N - n- (b A- 1) (4)
Let us now calculate the normalized time measure, T.
T represents the expected numb er of bits to be tested during
a typical rejection procedure. T also equals the expected
number of bits to be tested after a nonempty cell has been
accessed and abandoned. That is, if a hash-addressed cellcontains a message other than the message to be tested, on
the average this will be discovered after, say, E bits are
tested. Then the procedure, in effect, starts over again.
Since ~b repre sents the fra ction of cells which a re em pty ,
then the probab ility of accessing a non em pty cell is
(1 - ~b), and the probabi lity of accessing an em pty cell is
~. If an nonempty cell is accessed, the expected number of
bits to be tested is E + T, since E re presents the expected
number of bits to be tested in rejecting the nonempty
accessed cell, and T represents the expected number of
bits to be tested when the procedure is repeated. If an
empty cell is accessed, then only one bit is tested to dis-
cover this fact. Therefore
T = (1 -- ¢) ( E + T) + ~. (5)
In order to calculate a value for E, we note that the con-
ditional probability that the first x bits of a cell match
those of a message to be tested, and the (x + 1)th bit
does not m atch, given that the cell contains a message other
than the message to be tested, is (½)~. (The reader should
remember that the first bit of a message always matches
the first bit of a nonempty cell, and consequently the
exponent is x rather t han x + 1, as would otherwise be the
case. ) Thus, for b >> 1, the expecte d value of E is approxi-
mated by the following sum:
(x + 1)-( ½)x = 3. (6)z~ l
Therefore
T = (3 /¢ ) - 2, (7)
N = n . ( b + 1) -T-----~ (8)T - - l "
Equation (8) represents the space/time trade-off for the
conventional hash-coding method.
424 Commu nica tio ns of th e ACM Volume 13 / Numb er 7 / July, 1970
8/6/2019 Bloom Bloomfilter Orig Paper
http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 4/5
A n a l y s i s o f M e t h o d 1
T h e h a s h a r e a c o n t a i n s N ' b i t s a n d i s o r g a n i z e d i n t o c e ll s
o f c b i t s e a ch . I n a m a n n e r a n a l o g o u s t o t h e c o n v e n t i o n a l
m e t h o d , w e e s t a b l i s h t h e f o l l o w i n g e q u a t i o n s :
~b' = N ' -- n.c~ - t h e f r a c t i o n o f e m p t y ce l ls . ( 9 )
N '
N' - n.c ( 1 0 )
1 - ¢ "
T ' = ( 3 / ¢ ' ) - 2 . ( 1 1 )
N ' T ' A - 2 ( 1 2 )=n ' c ' T ' - - 1 "
I t r e m a i n s t o d e r i v e r e l at i o n s f o r t h e c o r r e s p o n d i n g e x -
p e c t e d f r a c t i o n o f e r r or s , P ' , i n t e r m s o f c a n d T ' a n d i n
t e r m s o f N ' a n d T'.
A m e s s a g e t o b e t e s t e d w h i c h i s n o t a m e m b e r o f t h e s e t
o f m e s s a g e s , M , w i ll b e e r r o n e o u s l y a c c e p t e d a s a m e m b e r
o f M w h e n :
( 1 ) o n e o f t h e s e q u e n c e o f h a s h a d d r e s s e s g e n e r a t e d
f r o m t h e t e s t m e s s a g e c o n t a in s t h e s a m e c o d e , s a y C ,
a s t h a t g e n e r a t e d f r o m t h e t e s t m e s s a g e ; a n d
( 2 ) t h a t s u c h a h a s h a d d r e s s is g e n e r a t e d e a r l ie r in t h e
s e q u e n c e t h a n t h e h a s h a d d r e s s o f s o m e e m p t y c el l.
T h e e x p e c t e d f r a c t i o n o f t e s t m e s s a g e s , n o t m e m b e r s o f
M , w h i c h a r e e r r o n e o u s l y a c c e p t e d i s t h e n
P ' = ( ½ ) ° -' / ~ b '. ( 1 3 )
T h e r e f o r e
T L _c = - - lo g 2 P ' + 1 - t - l o gs -4- 2 (1 4 )
3
n . - - logs + 1 d- logs 2 -4- 2- - T ' - - 1" ( 1 5 )
E q u a t i o n ( 1 5 ) r e p r e s e n t s t h e t r a d e -o f f s a m o n g a l l t h r e e
c o m p u t a t i o n a l f a c t o r s f o r m e t h o d 1 .
A n a l y s i s o f M e t h o d 2
L e t ¢ " r e p r e s e n t t h e e x p e c t e d p r o p o r t i o n o f b i ts i n t h e
h a s h a r e a o f N " b i t s s ti ll s e t t o 0 a f t e r n m e s s a g e s h a v e b e e n
h a s h s t o r e d , w h e r e d is t h e n u m b e r o f d i s t i n c t b i t s se t t o
1 f o r e a c h m e s s a g e i n t h e g i v e n s e t .
¢ " = (1 - - d/N" ) ' . ( 1 6 )
A m e s s a g e n o t i n t h e g i v e n s e t w i l l b e f a l s e l y a c c e p t e d i f al l
d b it s t e s t e d a r e l ' s . T h e e x p e c t e d f r a c t i o n o f t e s t m e s s a g e s ,n o t i n M , w h i c h r e s u l t i n s u c h e r r o r s i s t h e n
P " = ( 1 - - ¢ - ) d . ( 1 7 )
A s s u m i n g d < < N " , a s is c e r t a i n l y t h e c a s e , w e t a k e t h e
l o g b a s e 2 o f b o t h s i d e s o f e q . ( 1 6 ) a n d o b t a i n a p p r o x i -
m a t e l y
logs ¢" = log~ (1 - d/N")n.logs e
= --n. (d/N").logs e.
T h e r e f o r e
N " p , , ) l o g 2 e= n . ( - - l o g 2 l o gs ¢ " - l o g s ( 1 - - ¢ " ) " ( 1 9 )
W e n o w d e r i v e a r e l a t i o n s h i p f o r t h e n o r m a l i z e d t i m e
m e a s u r e T " . x b i t s w i l l b e t e s t e d w h e n t h e f i r s t x - - 1 b i t s
t e s t e d a r e 1 , a n d t h e x t h t e s t e d b i t i s 0 . T h i s o c c u r s w i t h
p r o b a b i l i t y ¢ " - (1 - ¢ " ) * - 1 . F o r P " < < 1 a n d d > > 1 , t h e
a p p r o x i m a t e v a l u e o f T " ( t h e e x pe c t ed v a l u e of t h e n u m -b e r o f b it s t e s t e d f o r e a c h r e j e c t e d t e s t m e s s a g e ) i s t h e n
T " = ~ x . ~ " . ( 1 - ~ , , )~ - 1 = 1 / ~ " . (20)
T h e r e f o r e
logs e . (21 )N " = n . ( - l o g s P " ) " lo gs ( 1 / T " ) . l o g s ( 1 -- l IT" )
E q u a t i o n ( 2 1 ) r e p r e s e n t s t h e t r a d e - o f f s a m o n g t h e t h r e e
c o m p u t a t i o n a l f a c t o r s f o r m e t h o d 2 .
C o m p a r i s o n o f M e t h o d s 1 a n d 2
T o c o m p a r e t h e r e l a t i v e s p a c e / t i m e t r a d e - o f f s b e t w e e n
m e t h o d s 1 a n d 2 , i t w i ll b e u se f u l t o i n t r o d u c e a n o r m a l i z e d
s p a c e m e a s u r e ,
S = g / ( -n . logs P) . ( 2 2 )
S i s n o r m a l i z e d t o e l i m i n a t e t h e i n f l u e n c e o f t h e s i z e o f t h e
g i v e n s e t o f m e s s ag e s , n , a n d t h e a l l o w a b l e f ra c t i o n o f
e r r o r s , P . S u b s t i t u t i n g t h e r e l a t i o n ( 2 2 ) i n t o e q s. ( 1 5 )
a n d ( 2 1 ) g i v e s
(' - T ' -4- 2 1 d- log2 , (2 3)
T ' - I 1 + - - i ~ /
S " = lo g s e (2 4 )l og s ( 1 / T " ) . l o g 2 ( 1 - - 1/T")"
W e n o t e t h a t S ' > ( T ' --? 2 ) / ( T ' - - 1 ) , a n d
l i m S ' - T ' + 2 ( 2 5 )~'-~0 T ' - - 1 "
T h e s u p e r i o r i t y o f m e t h o d 2 o v e r m e t h o d 1 c a n b e se e n
d i r e c t l y b y e x a m i n i n g F i g u r e 1 , w h i c h i s a g r a p h o f t h e
S v e r s u s T c u r v e s f o r e q . ( 2 4 ) a n d t h e l o w e r b o u n d l i m i t
e q u a t i o n , e q . ( 2 5) .2 T h e c u r v e s i n F i g u r e 1 i l l u s t r a t e t h es p a c e / t i m e t r a d e - o f fs f o r b o t h m e t h o d s a s s u m i n g a f ix e d
v a l u e f o r t h e e x p e c t e d f r a c t i o n o f e rr o r s , P , a n d a f i x e d
n u m b e r o f m e s s a g e s , n .
T h e r e a d e r s h o u ld n o te th a t t h e s u p e r io r i ty o f m e th o d 2 o v e rmethod 1 depends on the as sumption o f cons tan t t ime of address -ing, accessing, and testing bits in the hash area. This result m a y
no longer ho ld i f the effects o f a c ce s sin g m u l t ib i t " c h u n k s " a n d o f
ca lcu la t ing add i t iona l hash-coded b i t addresses a re taken in toaccount.
V o l u m e 13 / N u m b e r 7 / J u l y , 19 70 C o m m u n i c a t i o n s o f t h e A C M 4 2 5
8/6/2019 Bloom Bloomfilter Orig Paper
http://slidepdf.com/reader/full/bloom-bloomfilter-orig-paper 5/5
F o r m e t h o d 2 , a n i n c r e a se i n T " c o r r e s p o n d s t o a d e -
c r e a s e i n ¢ " , t h e f r a c t i o n o f 0 b i t s i n t h e h a s h f i e l d a f t e r a l l
n m e s s a g e s h a v e b e e n h a s h c o d e d . T o m a i n t a i n a f i x ed
v a l u e o f P , t h i s d e c r e a s e i n ~b" c o r r e s p o n d s t o a n i n c r e a s e i n
t h e n u m b e r o f b i t s t e s t e d p e r m e s s a g e , d , a n d a n a p p r o -
p r i a t e a d j u s t m e n t t o t h e h a s h a r e a s i ze , N " . S " i s d i r e c t l y
p r o p o r t i o n a l t o t h e h a s h a r e a s iz e N " , a s s h o w n i n eq . ( 2 2 ).
A s T " i n c re a s e s , a n d c o r r e s p o n d i n g l y , ¢ " d e c r e a s e s , S a n d
N " d e c r e a s e u n t i l a p o i n t o f d i m i n i s h i n g r e t u r n s i s r e a c h e d .T h i s p o i n t o f d i m i n is h i n g r e t u r n s i s i ll u s t r a t e d i n F i g u r e 1
w h e r e S " i s s m a l l e st f o r 1 / T " = 1 - - 1 / T " , i . e . f o r T" = 2 .
I0 1
81
6 1
• ~
'1
i
N
] ,
METHOD2 ~i ii ,i
I iI
i :' !
I ,I
i
\
[ L I L
MTHOD
% . l I I
] \i
t ~
L
\\ ,
I. I1 .01 1 .02 1 .0 5 I . I 1 .2 1 ,3 1 ,4 1 .5 2 3 4 6 7 8 I I
T - N O R M A L I Z E D T I M E M E A S U R E
FIG. i
S i n c e ¢ " = 1 / T " , t h i s m e a n s t h a t t h e s m a l l e s t h a s h a r e a
f o r w h i c h m e t h o d 2 c a n b e u s e d o c c u r s w h e n h a l f t h e b i t s
a r e 1 a n d h a l f a r e 0 . T h e v a l u e o f S " c o r r e s p o n d i n g t o
T" = 2 i s S " = l o g 2 e = 1 .4 7 .
A n a l y s i s o f H y p h e n a t i o n S a m p l e A p p l i c a t io n
I n t h i s s e c t i o n w e w i ll c a l c u l a t e a n o m i n a l s i z e f o r a h a s h
a r e a f o r s o l v in g t h e s a m p l e a u t o m a t e d h y p h e n a t i o n p r o b -
l e m . S i n ce w e h a v e a l r e a d y c o n c l u d e d t h a t m e t h o d 2 i s
b e t t e r t h a n m e t h o d 1 , w e w i ll c o m p a r e t h e c o n v e n t i o n a l
m e t h o d w i t h m e t h o d 2 .
L e t u s a s s u m e t h a t t h e r e a r e a b o u t 5 0 0 , 0 0 0 w o r d s t o b e
h y p h e n a t e d b y t h e p r o g r a m a n d t h a t 4 5 0 ,0 0 0 of t h e s e w o r d s
T A B L E I. S U M m A R r O F E X P E C T E D P E R F O R M A N C E O F H Y P H E N -
A T I O N A P P L I C A T IO N O F H A S H C O D I N G U S I N G M E T H O D 2 F O R
V A R I O U S V A L U E S O F A L L O W A B L E F R A C T I O N O F E R R O R S
P = Allowabl e Fraction N = Size of Hash Area Disho] Error s (Bits ) A ccesses Saved
½ 72,800 45.0 %145,600 67.5%
218,400 78.7%291,200 84.4%
364,000 87.2%
509,800 88.5%
c a n b e h y p h e n a t e d b y a p p l i c a t io n o f a f e w s im p l e r u l es .
T h e o t h e r 5 0 , 00 0 w o r d s r e q u i r e r e f e r e n c e t o a d i c t i o n a r y . I t
i s r e a s o n a b l e t o e s t i m a t e t h a t a t l e a s t 1 9 b i ts w o u l d , o n t h e
a v e r a g e , b e r e q u i r e d t o r e p r e s e n t e a c h o f t h e s e 5 0 , 0 00 w o r d s
u s i n g a c o n v e n t i o n a l h a sh - c o d i n g m e t h o d . I f w e a s s u m e
t h a t a t i m e f a c t o r o f T = 4 i s a c c e p t a b l e , w e f i n d f r o m e q .
( 9 ) t h a t t h e h a s h a r e a w o u l d b e 2 , 0 0 0 ,0 0 0 b i ts i n s iz e . T h i s
m i g h t v e r y w e l l b e t o o l a r g e f o r a p r a c t i c a l c o r e c o n t a i n e d
h a s h a r e a . B y u s i n g m e t h o d 2 w i t h a n a l l o w a b l e e r r o r f r e -
q u e n c y o f , s a y , P = 1 / 1 6 , a n d u s i n g t h e s m a l l e s t p o s s i b l eh a s h a r e a b y h a v i n g T = 2 , w e se e f r o m e q . ( 2 2 ) t h a t t h e
p r o b l e m c a n b e s o l v e d w i t h a h a s h a r e a o f le s s t h a n 3 0 0 ,0 0 0
b i t s , a s i z e w h i c h w o u l d v e r y l i k e l y b e s u i t a b l e f o r a c o r e
h a s h a r e a .
W i t h a c h o i c e f o r P o f 1 /1 6 , a n a c c e s s w o u l d b e r e q u i r e d
t o t h e d i s k r e s i d e n t d i c t i o n a r y f o r a p p r o x i m a t e l y 5 0 ,0 0 0
+ 4 5 0 , 0 0 0 / 1 6 ~ 7 8 , 0 0 0 o f t h e 5 0 0 ,0 0 0 w o r d s t o b e h y p h e n -
a t e d , i . e . f o r a p p r o x i m a t e l y 1 6 p e r c e n t o f t h e c a s e s . T h i s
c o n s t i t u t e s a r e d u c t i o n o f 8 4 p e r c e n t i n t h e n u m b e r o f d i s k
a c c e ss e s f r o m t h o s e r e q u i r e d i n a t y p i c a l c o n v e n t i o n a l a p -
p r o a c h u s i n g a c o m p l e t e l y d i s k r e s i d e n t h a s h a r e a a n d
d i c t i o n a r y .
T a b l e I s h o w s h o w a l t e r n a t i v e c h o i c e s f o r th e v a l u e o f Pa f f e c t t h e s iz e of t h e c o r e r e s id e n t h a s h a r e a a n d t h e p e r -
c e n t a g e o f d is k a c c e s s es s a v e d a s c o m p a r e d t o th e " t y p i c a l
c o n v e n t i o n a l a p p r o a c h . "
A c k n o w l e d g m e n t s . T h e a u t h o r w i s h e s t o e x p r e s s h i s
t h a n k s t o M r . O l i ve r S e l f ri d g e f o r h i s m a n y h e l p f u l s u g -
g e s t i o n s i n t h e w r i t i n g o f th i s p a p e r .
R E C E I V E D O C T O B E R , 1969; R E V I S E D A P R I L , 1 9 7 0
R E F E R E N C E S
1 . B A T SO N , A . T h e o r g a n i z a t i o n o f sy m b o l t a b l e s . Comm. ACM 8,2 (Feb. 1965) , 111-112.
2 . M A U R E R, W . D . A n i m p r o v e d h a sh c o d e f o r s c a t t e r s t o r a g e .
Comm. ACM 11, 1 (Jan. 1968), 35--38.3 . M O R R IS , R . S c a t t e r s t o r a g e t e c h n i q u e s . Comm. ACM 11, 1
(Jan. 1968) , 38-44.
4 2 6 C o m m u n i c a t i o n s o f t h e A C M V o l u m e 13 / N u m b e r 7 / J ul y, ,• 19 7 0