bitwise bloom filter

Upload: vrutang-shah

Post on 03-Jun-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Bitwise Bloom Filter

    1/24

    The Bitwise Bloom Filter

    Ashwin Lall

    University of Rochester

    Mitsunori Ogihara

    University of Rochester

    Univ. of Rochester Comp. Sci. Dept. Technical Report TR-2007-927November 29, 2007

    Abstract

    We present the Bitwise Bloom Filter, a data structure for maintaining counts for a large

    number of items. The bitwise filter is an extension of the Bloom filter, a space-efficient data

    structure for storing a large set efficiently by discarding the identity of the items being held while

    still being able to determine whether it is in the set or not, with high probability. We show how

    this idea can be extended to maintaining counts of items by maintaining a separate Bloom filter

    for every position in the bit representations of all the counts. We give both theoretical analysis

    of the accuracy of the Bitwise filter together with validation via experiments on real network

    data.

    1 Introduction

    With the advent of the internet and developments in the capacity of storage media there has been anexplosion in the amount of data used by current-day applications. While the polynomial-boundedresource model was the gold standard of the past, modern algorithms have to guarantee linear or

    even sub-linear bounds, usually at the cost of accuracy. One of the basic primitives required bymodern algorithms is to be able to keep counts of large numbers of items using minimal storage. Inthis paper we will see a novel data structure that performs exactly this task.

    There has been a considerable amount of work done recently on Streaming Algorithms. Thegoal of a streaming algorithm is to compute some function of the input, viewed as a stream, usingas little space as possible. Notably, there has been a fairly extensive study of ways to computethe frequency moments of streaming data probably and approximately (see, for example, [AMS99,BYKS02, IW03, Woo04]). Many applications, however, need more than just statistics about thedata being streamed. It is becoming increasingly important to record more fine-grained informationabout the data, such as actual frequencies, while still being parsimonious with the space that thistakes in memory.

    In this paper we will specifically take a look at methods that are based upon the concept of theBloom filter [Blo70]. Hashing has long been the answer to designing algorithms that approximate

    some value that we would like to maintain. The method of Bloom filters is a way to use a constantnumber of hash functions to get a vast improvement in the accuracy of the algorithm.

    When Bloom filters were first introduced in the 1970s they were immediately applied to manyexisting applications where memory was a constraint. After that, the idea lay dormant and nomore research was done regarding them until as recently as 1998. Now that memory has becomea constraint once again with network applications that see vast quantities of data very second, theidea has caught on once again and there has been much research done extending the idea of the

    1

  • 8/12/2019 Bitwise Bloom Filter

    2/24

    filter. In this paper we will see an approach to maintain counts of items approximately in a mannerunlike any before, even among the methods that make use of Bloom filters.

    The rest of this paper is laid out as follows. In Section 2, the original Bloom filter is described,

    and some analysis is presented with regards to optimizing its performance and bounding the errorthat it can have. The counting Bloom filter is introduced in Section 3, and its error is analyzed.In Section 4, some applications for approximate counting are given as motivation. Prior work doneon variations of Bloom filters is described in Section 5. In Section 6, the Bitwise Bloom Filter ispresented and analyzed. The empirical evaluation of the filter is presented in Section 7. This paperis finally concluded in Section 8.

    2 The Bloom Filter

    The idea for what is now called a Bloom filter dates back to a paper by Burton Bloom in 1970 [Blo70].The example with which he motivated the use of his data structure was that of automaticallyhyphenating words in a word processor. He demonstrated, using some reasonable numbers, that

    even using existing hashing techniques about 250 Kb of data storage would be requiredfar morethan what was available in main memory! Today, memory capacities have increased but the samekind of problem persists. In fact, the problem is exacerbated by the fact that the amount of dataavailable is growing at a remarkable rate.

    The purpose of the Bloom filter is to maintain a small subset of some set. The advantage of thisdata structure is that it uses considerably less space than any exact method, but pays for this byintroducing a small probability of error. Depending on the space available, this error can be madearbitrarily small.

    The design of the Bloom filter is extremely simple. Assume that we are maintaining a subset ofthe set S= [n] 1. The filter comprises of a bit vector ofm bits, with all its bits initialized to zero,and hash functions h1, h2, . . . , hk, for some positive integer k. Each hash function hi maps elementsfrom the set being maintained to a location in the vector, i.e. hi : [n] [m]. When an item x isinserted into the filter, each of the locations h1[x], h2[x], . . . , hk[x] is set to 1 (see Algorithm 1).

    Algorithm 1: Bloom Filter InsertionBloom-Insert(x)

    1: fori = 1 to k do2: A[hi(x)] := 1

    Algorithm 2: Bloom Filter SearchBloom-Search(x)

    1: fori = 1 to k do2: if A[hi(x)] = 0 then

    3: return false4: return true

    To search for an item in the filter, it only needs to be checked that each of the locations for theitem has been set to one (see Algorithm 2).

    1As is standard in the literature, [n] =def {1, 2, 3, . . . , n}.

    2

  • 8/12/2019 Bitwise Bloom Filter

    3/24

    Clearly, if an item is inserted into the filter it is found when searched for. Hence, there is no falsenegative error. The only possibility for error is when an item is not inserted into the filter, but eachof the locations that the hash functions map it to are turned on. We will show that this error can

    be kept small while using considerably less space than any exact method.

    2.1 Lower Bounds

    Before we analyze the Bloom filter, we should first convince ourselves that we need a considerableamount of space with an exact method. It is easy to see that to exactly maintain a subset of a setSof size n we need at least n bits. The proof of this is included here for completeness.

    Suppose that we maintained any subset ofS (||S|| = n) with n < n bits. The total number ofpossible subsets of the set is 2n, but the total number of configurations possible with n bits is 2n

    .Since 2n

  • 8/12/2019 Bitwise Bloom Filter

    4/24

    for the case that we are interested in, i.e. when the number of items seen is much smaller than theactual universe size. We will show that the Bloom filter uses space that is close to this lower bound.

    2.2 Analysis

    Let us take the number of potential items that we can see to be n, the number of bits available tous for the Bloom filter to be m, and the number of hash functions that are used to be k . The valueofn will be fixed by a particular application, and m will be fixed by the memory resources availableto us, so we need to compute the optimal value ofk that minimizes the error of the Bloom filter.

    As mentioned earlier, the only source of error in a Bloom filter is if an element is not supposedto be in the set but the insertion of other elements has caused each of itsk locations in the vector tobe turned on. To compute the probability of this eventuality, we consider the worst possible stateof the vector: when alln elements have been hashed into the vector (note that we are not makingany assumptions about what is actually in the subset).

    The probability that one particular hash function maps a value to a given location in the vectoris simply 1/m. Hence, after all n elements have been hashed into the vector, the probability that a

    given position is still zero is (1 1/m)nk, or the probability that it is one is 1 (1 1/m)nk. Now,if we fix an element that is not in the set, the probability that it appears to be in the set, i.e. all klocations it gets mapped to are turned on, is

    = (1 (1 1/m)nk)k.

    The above formula can be approximated by

    = (1 enk/m)k,

    which is a good approximation when m is sufficiently large.Instead of minimizing the error of with respect to k, we minimize the error of loge(), which

    is equivalent since the logarithm function is monotonically increasing. We set p = enk/m. Hence,the value that we are trying to minimize is

    loge() = k loge(1 p) = mn loge(p)loge(1 p).

    By symmetry, it is easy to see that this value is minimized when p = 1/2, or when k = mn loge(2).Hence, we can always use k = mn loge(2) hash functions to minimize the error.

    For the above value ofk, the actual error is (1/2)k or approximately (0.619)m/n. Note that ifwe replace n with N, the number of distinct items that are actually inserted into the filter, thenthe analysis is identical and we get an error of (0.619)m/N. Hence, if we allocate 8 bits for eachitem that we expect then we have a false positive error of about 2%. With 16 bits per item theerror goes down to under 0.05%. For any application with elements of size 32 bits or more this is aconsiderable saving, with only a very small false positive error.

    Hence, we see that with a Bloom filter we can bring the amount of space used down fromlogarithmic (in n) per item inserted to constant per item inserted.

    Finally, we compare the space used by the Bloom filter to the lower bound computed in theprevious section. We wish to keep the false positive probability less than some given , or

    (1/2)m loge2/N.

    This happens when

    mNlog2(1/)

    loge2 = (Nlog2(1/))log2e.

    4

  • 8/12/2019 Bitwise Bloom Filter

    5/24

    Algorithm 3: Counting Bloom Filter InsertionCounting-Bloom-Insert(x)

    1: fori = 1 to k do2: A[hi(x)] :=A[hi(x)] + 1

    Algorithm 4: Counting Bloom Filter DeletionCounting-Bloom-Delete(x)

    1: fori = 1 to k do2: A[hi(x)] :=A[hi(x)] 1

    This demonstrates that there is a factor of log2e 1.44 between the amount of space used bya Bloom filter and the optimal amount of space that can be used. There are other data structuresthat use space closer to the lower bound, but they are much more complicated and do not have some

    of the useful properties found in the Bloom filter.

    2.3 Some Properties

    Bloom filters have several properties that make them useful in various applications. Suppose wewanted to support the union of two sets. In this case, if we were to use two Bloom filters withthe same hash functions and vector size, we could simply perform a bitwise disjunction to get theunion of the sets. Note that the intersection of two sets represented by Bloom filters cannot becomputed as trivially, though the inner product of the vectors is a good indicator of the sizeof theintersection [BM03].

    Another useful property of a Bloom filter is the ability to halve the amount of space that it usesdynamically. This may be useful when too much space has been initially allocated to it. If we assumethat the size of the vector is even, we can halve the space used by taking a bitwise disjunction of

    the first half of the vector with the second half, and then modifying all the hash functions to maskthe highest order bit of their outputs.From a systems standpoint Bloom filters have some nice properties as well. When an item is

    inserted into the filter, all the locations can be updated blindly in parallel. The reason for this isthat the only operation performed is to make a bit 1, so there can never be any data race conditions.Also, for an insertion, there need to be a constant (that is, k) number of writes and no reads. Again,when searching for an item, there are at most a constant number of reads. These properties makeBloom filters very useful for high-speed applications, where memory access can be the bottleneck.

    3 Counting Bloom Filters

    One of the main drawbacks of the Bloom filter is that it does not allow deletions. To get around this

    limitation the concept of the counting Bloom filter was introduced in a paper on caching [FCAB98].The modification of the Bloom filter to get a counting Bloom filter is quite simple: rather thanusing a bit vector a small counter should be used at each location in the vector. When an elementis added to the filter each location mapped to by the hash function is incremented rather than justturned on. For deletions each location is decremented. Testing for membership in the set is thensimply checking to make sure that each location for that element has an appropriately high count(see Algorithm 5).

    5

  • 8/12/2019 Bitwise Bloom Filter

    6/24

    Algorithm 5: Counting Bloom Filter SearchCounting-Bloom-Search(x)

    1: fori = 1 to k do2: if A[hi(x)]< ||{j|1 j k andhj(x) = hi(x)}|| then3: return false4: return true

    Algorithm 6: Counting Bloom Filter Get-CountCounting-Bloom-Count(x)

    1: min:=2: fori = 1 to k do3: if A[hi(x)]< min then4: min:= A[hi(x)]5: return min

    Note that a counting Bloom filter also has the ability to maintain approximate counts of items.In particular, when we wish to estimate the count of an item, we can take the minimum of thecounts across all the locations that that item is hashed to. This is demonstrated in Algorithm 6.

    3.1 Analysis

    It is not hard to see that certain new kinds of errors are introduced by this method. Now, falsenegative error is also possible. This happens when an item that erroneously appears to be in theset (due to false positive error) is deleted. By decrementing the k locations it is possible that anelement that is supposed to be in the set appears not to be any more. However, if we assume thatan element not in the set is never deleted (as is standard practice), then this kind of error is not anissue. Also note that the probability of this kind of error is upper bounded by the probability of

    false positive error, and will usually happen with a much smaller probability so as to be negligible.The other source of error is when one of the counters overflows. The simple workaround to this

    problem is to stop incrementing the counter when it reaches its maximum value. Then, the onlyway that an error could occur is if the particular counter gets decremented all the way down tozero again though it should not have been. We will show that, assuming a uniform distribution, theprobability of an overflow is very small.

    For each 1 i m, let c(i) be the count of the ith counter. The probability that aftern itemshave been inserted the counter c(i) has count exactly j is given by:

    P r(c(i) = j) =

    nk

    j

    1m

    j1

    1

    m

    nkj.

    We can bound the probability of an overflow as follows:

    P r(c(i) j)

    nk

    j

    1m

    j

    enk

    j

    j1m

    j(see, for example, [CLRS01, p. 1097])

    =enk

    jm

    j.

    6

  • 8/12/2019 Bitwise Bloom Filter

    7/24

    Now, from the analysis for the Bloom filter, we know that the optimal value for k ism loge2/n,so assuming that we always usek m loge2/n, we get that

    P r(c(i) j)

    e loge2jj

    .

    Taking a simple union bound, we get that the probability that anycounter will overflow is

    P r(maxi

    c(i) j) me loge2

    j

    j.

    Hence, if we use as little as 4 bits for the counter (so that j = 24 = 16), we get a probability offailure of at most m (1.37 1015), which is sufficiently small for most applications. However, notethat 16 insertions of a single element will cause all k of its locations to overflow.

    4 Applications

    Maintaining counts of items within small amounts of space is a very useful primitive for manyapplications these days. The savings in space by approximation are necessary because networkdevices simply do not have the amount of fast memory required to process the amounts of datathat current link speeds can generate. Also, when the counts have to be shared in a distributedenvironment, savings in transmission costs can be extremely beneficial.

    4.1 Caching of Keywords for Search Engines

    The primary motivation for using Bloom filters, and the new data structure described later in thispaper, is that when the number of items for which we are keeping counts is much smaller than thetotal number of potential items, we can save a lot of space. Hence, the use of succinct frequencymaintenance data structures is ideal for search engines.

    Search engines, such as Google and AskJeeves, need to maintain statistics of queries sent to them.There could be many reasons for this. Some search engines publish their most popular queries, oroptimize searches based upon this data. It may also be useful to have statistics on popular queriesso as to evenly balance load across machines.

    Bloom filter techniques are ideal for search engines because the number of actual queries per-formed is quite small as compared to the potential number that could be performed (mostly mean-ingless search strings). Typically, a popular search engine such as Google might service 100 millionqueries per day. Storing and updating histograms exactly on such a vast quantity of data would beinfeasible to accomplish in any fast-access memory, and hence an approximate counting mechanismwould provide significant reduction in cost.

    4.2 Iceberg Queries

    There has been a lot of interest of late in so-called iceberg queries. An iceberg query is one inwhich the goal is to identify all the items with frequency above some fixed threshold. Clearly, anydata structure that maintains approximate counts can be used to compute iceberg queries by simplychecking at the end which items have count above the threshold.

    Previous work done on iceberg queries includes [FSGM+98, GM99, MM02]. Another threshold-ing algorithm, based on the Bloom filter, is described in Section 5.2. One big drawback of theseapproaches is that they are very dependent on the threshold. If the threshold is changed, then the

    7

  • 8/12/2019 Bitwise Bloom Filter

    8/24

    iceberg query has to be recomputed over all of the data. This drawback is not present when wemaintain counts of all the items.

    4.3 Proxy Caches

    A proxy cache is a means of caching for a larger domain than a single computer. For instance, thewebpages visited by users at the computer science department at the University of Rochester mightbe stored in a single proxy cache. The benefit of using a proxy cache is that webpages that are usedvery frequently by the members of the domain can be stored locally, decreasing the overall bandwidthconsumed. Since not every page that is accessed can be cached, it is important to determine whichpages are accessed most frequently within the domain.

    Maintaining the number of hits for each webpage accessed within a domain can be maintainedwith an approximate counter so as to save space. A thresholding algorithm, such as the onesdescribed in the previous section, might also suffice for this task. However, since statistics have tobe maintained for a sliding window, simple thresholding might not work.

    4.4 Traffic Statistics

    Low-memory approximate histogram data structures allow internet routers to maintain statisticsabout flows passing through them. Depending on the application, it might be necessary to keeptrack of the counts for the source IPs, the destination IPs, or even the flow between each pair ofsource and destination IPs. Work in this direction has been done in [SX02, DLOM02, KXLW03].

    Broadly, there are two reasons for maintaining statistics about the requests processed on arouter. This information could be useful in redirecting traffic when there is excessive congestion ona particular route. Secondly, maintaining approximate counts, or even iceberg queries, is a goodway of detecting distributed denial of service attacks. Since a router might see tens of million IPaddresses per hour, and the amount of fast DRAM available to it is limited, data structures withsmall memory footprints are crucial [KXLW03].

    4.5 Distributed Joins

    The concept of the Bloomjoin was introduced by Mackert and Lohman [ML86]. Suppose that thereare two members of a distributed system M1 and M2, and they wish to compute the intersectionof two sets S1 and S2, where M1 has S1 and M2 has S2. Now, ifM1 sends all of the elements ofS1 to M2, M2 can compute the intersection and send the result back to M1. However, doing so isexpensive in bandwidth. An alternative is for M1 to store S1 in a Bloom filter and send that toM2. Then, M2 computes the intersection of the elements apparently in the filter and S2 and sendsback this considerably smaller set. Note that due to some false positives, the intersection might beappear larger than it actually is. But this can be corrected byM1 in one final phase of the protocol,where it sends back the (even smaller) set of false positives. If the size of the join and the falsepositive error are sufficiently small, this method can save a lot of bandwidth while still giving anexact answer.

    The concept of the Bloomjoin can be generalized to that of obtaining items over a given thresholdfrequency [CM03]. The succinct counting data structure is sent by M1 to M2, and M2 sends backthe list of items above the threshold (and their counts). Finally, M1 sends back the values that arenot above the threshold due to approximate counting. This can result in a considerable saving inthe amount of data communicated, while still computing the join exactly.

    8

  • 8/12/2019 Bitwise Bloom Filter

    9/24

    5 Prior Work

    At the time of its conception Bloom filters did not receive much attention from the theory community.

    The reason for this was that the promise that they guaranteed was for constant bounded error, andtypically errors that decrease asymptotically are preferred. However, in the last several years theidea of the Bloom filter has become popular once again due to a variety of reasons.

    With the gain of popularity of streaming algorithms it has become increasingly important toapproximate statistics over a stream using minimal space and time per data item. These problems aremotivated by that fact that network devices, such as routers, are expected to handle vast quantitiesof data while having relatively little main memory. Since there is no strong requirement for thestatistics to be exact, this is a natural application of Bloom filters and their variations.

    Another reason for interest in Bloom filters is that the concept can be generalized to broadersettings than simple set membership. We will take a look at some of the recent work that has beendone with variations on Bloom filters in the following subsections.

    5.1 Spectral Bloom Filters

    The Spectral Bloom Filter [CM03] is essentially a counting Bloom filter that is used for maintainingfrequencies of items rather than just set membership. The problem of counting, however, is harderin terms of the number of bits required and the error produced, so some modifications are introducedin the paper.

    The first problem that the observant reader might note is that each counter can no longer bemaintained with a small number of bits. While only 4 bits per counter gave a fairly low probabilityof error in the counting Bloom filter, any distribution with high-probability items would have a highprobability of counter overflow. To get around this problem, the authors proposed using variablelength counters that can increase in size as the need arises.

    The other problem with the simple spectral Bloom filter is the accuracy of the statistics that itmaintains. If the hash functions have many collisions, then the apparent count that results can bemuch larger than the actual one. Note that there can never be an undercount since each counter for

    an item is incremented every time that the item is seen. Hence, each count for an item is an upperbound of its actual value, and the minimum count is the least known upper bound. To improve theaccuracy of the counts maintained, a few optimizations were proposed.

    The first optimization was the idea of Minimal Increase. When an item is inserted into the SBF,only the counters with minimal count should be incremented. The motivation for doing so is thatthere is only possibility for overestimating the count, and this method attempts to lessen that asmuch as possible. Note that this method has the minimal number of increases while still maintainingthe property that each count of an item is an upper bound of its actual count, hence its name.

    A significant drawback of minimal increase is that it no longer allows for deletions. In fact,if deletions were allowed, then false negative error would be introduced, which would be hard tobound. However, in tests where there were no deletions, minimal increase demonstrated a signifi-cant improvement. Finally, note that since minimal increase significantly decreases the number ofincrements, there is a smaller probability for overflow.

    The other improvement suggested by the creators of the spectral Bloom filter was that of makinguse of recurring minima. They observed, and verified experimentally, that if the minimum countfor a particular item was unique, then there was a considerably higher probability that the countfor that item was incorrect. To improve the accuracy of the filter, they proposed using a secondaryfilter that would maintain counts for items without unique minima. The details of the algorithm aregiven below.

    9

  • 8/12/2019 Bitwise Bloom Filter

    10/24

    Algorithm 7: Spectral Bloom Filter Insertion (using Recurring Minimum)Spectral-Bloom-Insert(x)

    1: fori = 1 to k do2: A[hi(x)] :=A[hi(x)] + 13: ifthe minimum ofA[h1(x)], . . . , A[hk(x)] is not unique then4: if x is not in the secondary filter B then5: fori = 1 to k do6: B[hi(x)] :=B[h

    i(x)] + miniA[h1(x)], . . . , A[hk(x)]

    7: else

    8: fori = 1 to k do9: B[hi(x)] :=B[h

    i(x)] + 1

    Algorithm 8: Sampling Algorithm InsertionSampling-Insert(x)

    1: if x already has an entry then2: increment counter forx3: else

    4: with probabilityp create an entry for x

    When inserting an item, it is first inserted into the spectral Bloom filter as usual. Next, it ischecked whether the item has a unique minimum. If the minimum is not unique, then nothing moreneeds to be done. If it turns out that the minimum is unique, then the item is searched for inthe secondary filter. If it is not in the secondary filter, it is inserted there with count equal to theminimum count of the primary filter, otherwise its count is simply updated in the secondary. Thisis shown in full detail in Algorithm 7.

    Since only a fraction of the items will not have recurring minima, the size of the secondary filter

    can be a lot smaller. Experimentally, the authors found that using a secondary filter half the size ofthe primary one gives a considerable improvement in performance, even compared to a single filterthat used as much space as it did.

    5.2 Multi-stage filters

    Very often counts are not needed for all items. For many practical applications an accurate estimateof the high-frequency items is all that is required. Usually, an item is defined to be high-frequencyif it accounts for more than some threshold, say 0.1%, of all the insertions. This is really an easierproblem because we have a bound on the number of items that we are interested in, i.e. for athreshold of 0.1% we are guaranteed that there are no more than 1000 such items.

    The problem of sampling to maintain counts has been known to be a good way to keep track ofhigh-frequency items, but this method is optimized, using multi-stage filters, by Estan and Vargh-

    ese [EV02].We will first take a look at a simple sampling algorithm proposed by Estan and Varghese. When

    an item is received, it is sampled with some fixed probability p, and if an entry does not exist forit, one is created. After an item has been sampled, an exact count is kept for it from then on (seeAlgorithm 8) .

    Suppose that all items that take up more than 1% of the flow need to be sampled. There are

    10

  • 8/12/2019 Bitwise Bloom Filter

    11/24

    Algorithm 9: Multi-stage SamplingMultistage-Sampling-Insert(x)

    1: fori = 1 to k do2: Ai[hi(x)] :=Ai[hi(x)] + 13: if all Ai[hi(x)] T then4: ifno entry exists forx then5: create an entry forx with countT6: else

    7: increment the count forxs entry

    at most 100 such items, but we shall allocate space for 100F counters, where F is an oversamplingfactor. Since we wish to eventually sample about 100Fitems, the sampling probability can be setto p = 100F/N, where N is the total number of (not necessarily distinct) items seen. Now, if wetake an item that appears at least 1% of the times (i.e. it appears at least 0.01Ntimes), then the

    probability with which it will not have an entry is less than (1 p)0.01N

    = (1100F/N)0.01N

    eF

    .Hence, the probability that such an item is not sampled is exponentially small in the oversamplingfactor.

    One problem with the above approach is that it is possible that more than 100F items maybe sampled. However, since the distribution of the samples is binomial there is a small standarddeviation, and hence adding a few standard deviations to the number of counters (105 Frather than100F) will give a very low probability of running out of memory.

    Another sampling method proposed by Estan and Varghese is to use multi-stage filters. The ideabehind the multi-stage filter is, once again, very similar to that of the Bloom filter.

    Typically, when hashing is used for sampling, a hash function is used to map the item beinginserted to a location in an array of counters and this counter is incremented. The problem lies inthe fact that if we want to use a lot fewer counters than we have potential items, then multiple itemswill get mapped to the same location. The resulting collisions could cause two kinds of problems.

    Firstly, if too many small count items get mapped to the same location, then they all look like theyhave high counts. Secondly, it is possible for low count items to get mapped to the same location asa real, high count one, again giving a false positive. To get around this problem Estan and Vargheseuse multiple hash functions, as in a Bloom filter.

    Each hash function corresponds to a stage for the item. The main difference between thesestages and a regular counting Bloom filter is that each hash function maps to a location in its owncount vector, hence avoiding collisions between different hash functions. When an item is inserted,each location mapped to by the different hash functions is incremented. If each location is abovesome fixed threshold, then an entry is created for that item and its exact count is maintained fromthen on. This is illustrated in Algorithm 9.

    Both of the sampling methods described above are good for the case when we wish to lookfor items with a frequency over some threshold. The ideas presented above were generalized byMuthukrishnan and Cormode in [CM05]. These kinds of methods were further generalized by these

    same two authors to include thresholds in variations of items (i.e., in their absolute or relative sizes)in [CM06].

    11

  • 8/12/2019 Bitwise Bloom Filter

    12/24

    5.3 Bloomier Filters

    The Bloomier filter, introduced by Rubinfeld et al. [CKRT04], is also worth mentioning, even thoughit is not used for the problem of maintaining frequency counts. The idea of the Bloomier filter isto generalize the use of Bloom filters from simply returning set membership (i.e. the characteristicfunction of the subset) to maintaining any arbitrary function.

    Letfbe any function from D = [n] toR = [2r 1] {}, such that for some S D, ||S||= N,f(x) = for all x D Sand f(x) [2r 1] for x S. The Bloomier filter allows you to get anyvaluef(x) for x Sexactly, and will return f(x) = for any x D Swith high probability.

    The filter maintains its data structure with a space bound ofO(N r) which, when N

  • 8/12/2019 Bitwise Bloom Filter

    13/24

    Algorithm 10: SCBF InsertSCBF-Insert(x)

    1: i:= random(1, l)2: forj = 1 to k do3: A[hij(x)] := 1

    Algorithm 11: MRSCBF InsertMRSCBF-Insert(x)

    1: fori = 1 to r do2: With probability pi insertx into SCBF i

    groups for which the item appears (i.e. the size||{i | hi1(x), hi2(x), . . . , h

    ik(x) are all turned on}||) is

    used to estimate the actual frequency ofx.

    To compute the approximate frequency of an element from the count of the number of filters thatappear to have the item, some probabilistic estimation technique needs to be used. Essentially, thegoal is to find the fixed value that maximizes the probability that the real frequency was this value,given the observed number of filters with positive reports. To do this, the two methods that wereused were: Maximum Likelihood Estimate and Mean Value Estimation. The maximum likelihoodestimate computes the most likely frequency that gave the current observation, while the mean valueestimation gives the frequency that is expected to give the observation on average.

    One important drawback, which was not addressed in the paper, was that on average the absoluteerror of this method would be terrible. The reason for this is that irrespective of the estimationmethod used, there are only l + 1 different values that the observation can take (i.e. the number offilters that appear to have the item is in {0, 1, 2, . . . , l}), and hence any counts lying between somemaximum estimation would have to be mapped to the maximum estimation above or below it.

    Another problem with the simple SCBF is the choice ofl, the number of groups of hash functions.

    If the choice of l is too small, then it is impossible to distinguish high counts, since all the filterswill appear to have the item. In particular, the coupon collectors problem [MR95, p. 57] statesthat afterl logel copies of an item are inserted, all the filters should have had the item inserted intothem with high probability. Hence, the filter will not be able to distinguish between counts that aregreater than l logel.

    One way to get around the above problem would be to increase the number of groups until l logelis greater than the maximum count (assumed to be knowna priori). However, this method requiresmore space than is typically available. To get around both problems the multi-resolution space-codeBloom filter, or MRSCBF, was introduced.

    The MRSCBF makes use of multiple SCBFs to keep more accurate counts. When an item isinserted into the MRSCBF, it is inserted into each of the SCBFs with a certain probability. The ideais to use some SCBF updated with high probability so as to keep an exact count of the low-frequencyitems, and another group with low probability that samples the high-frequency items. If there are

    r SCBFs, then the (different) probability of each is denoted by p1 > p2 > p3 > .. . > pr. The exactalgorithm for insertion is shown in Algorithm 11.

    At the end of the insertion phase we can approximate the actual count of an element x fromthe individual counts of each SCBF. For some fixed element, the vector of counts of the r SCBFs isdenoted by = (1, . . . , r). The method of estimating the count of the item is outlined below.

    If we use maximum likelihood estimation, then we wish to compute argmaxfP r(f|). We

    13

  • 8/12/2019 Bitwise Bloom Filter

    14/24

    first solve the problem for each individual observation i. Applying Bayes rule, we get that

    P r(f|i) = Pr(i|f)Pr(f)Pr(i)

    . When maximizing this quantity for a particular observation, P r(i)

    is constant. Hence, if we assume thatP r(f) is a constant then maximizing P r(f|i) is equivalent

    to maximizing P r(i|f). The assumption is not unreasonable since it works for uniform and slow-varying distributions, i.e. distributions for which |P r(f)P r(f+1)| is bounded by a small constant.Hence, computing the frequency that maximizes the probability of the observation becomes a matterof computing argmaxfP r(i|f) which is easier, though still cumbersome, to compute.

    Since the hash functions in the different SCBFs are independent of one another, the optimalvalue is

    argmaxfP r(f|) = argmaxf

    ri=1

    P r(i|f).

    Unfortunately, the formula becomes intractable to compute and hence not even feasible assumingthat the lookup table is computed once and stored.

    The problem becomes tractable if mean value estimation is used. With some extensive derivation,it can be shown that given a frequency fthe expected value of the count of the ith SCBF () is:

    E[] =

    li=0

    fq=0

    f

    q

    pq(1 p)fq

    l

    i

    (k)i(1 k)li(i+j(q, i)),

    where is the fraction of bits set to 1 in the MRSCBF, and j (q, i) is the value ofj that satisfies theequationq lli +

    ll(i+1) + . . .+

    ll(i+ji) .

    Hence, precomputing the above function and storing it in a lookup table would enable us to getan estimate for the frequency of the item, given the observation from the SCBF.

    6 The Bitwise Bloom Filter

    A significant disadvantage of the spectral Bloom filter, and other variations on it, is that there is noway of pre-determining how large a particular items count can get. Simply allocating a conservativenumber of bits to each counter is a waste of space and using variable size counters adds unnecessarycomplexity to the implementation and increases the number of memory accesses required. Motivatedby this consideration, we propose the Bitwise Bloom filter.

    The simple idea behind the bitwise Bloom filter is that each count can be maintained in itsbinary representation, with each order of magnitude having its own counting Bloom filter. We needto use counting Bloom filters to support deletions from a particular filter when we have a carryover,though any succinct set maintenance data structure that supports insertions and deletions wouldsuffice.

    The bitwise Bloom filter consists ofl counting Bloom filters B0, . . . Bl1, each corresponding toa different level of magnitude. For each level 0 i < l, we will say that Bi has a vector ofmi bitsandki hash functions.

    Inserting an item into this filter is a similar process to that of incrementing a counter. Whilethe current lowest order bit has been turned on, we turn it off and do a carry. This is illustrated inAlgorithm 12.

    To get the count of an item we simply need to read its binary representation from the counters.This is shown in Algorithm 13.

    Note that we expect each successive level of the filter to have fewer and fewer items, so we candecrease the size at each level. If we drop the size of the counting filter at each level by a factor of

    14

  • 8/12/2019 Bitwise Bloom Filter

    15/24

    Algorithm 12: Bitwise InsertionBitwise-Insert(x)

    1: i:= 02: while Bi contains x and i < l 1 do3: Remove x from Bi4: i:= i+ 15: Insert x into Bi

    Algorithm 13: Bitwise CountBitwise-Count(x)

    1: count:= 02: fori = 0 to l 1 do3: if x is in Bi then4: count:= count+ 2i

    5: return count

    2 each time (or any other constant factor), then the total space used by the filter as a whole will beat most two times (or some constant multiple of) the amount of space at the bottom level. We canthus keep arbitrarily many levels (and hence keep counts of arbitrarily high frequencies) withouthaving to pre-allocate an unnecessarily large amount of space for them.

    In the following section we will take a look at some analysis for the resources required to maintainthe bitwise filter, and bounds on its error.

    6.1 Analysis

    The first parameter that we will fix is the amount of space that we allocate to each level of the filter.We know from the analysis in Section 2.2 that the probability of error at each level will be at most(0.619)mi/ni , where mi is the amount of space used at the ith level and ni is the number of itemsthat we expect to reach theith level. Given a fixed distribution, we can determine how many itemswe expect will be inserted into the ith level. This is analyzed for the simple case of the uniformdistribution below.

    Assume that we have a uniform distribution ofNnumbers from the set [n]. We are interestedin computing the number of items from the set that have frequency at least 2i for each i, that isnP r(count 2i). From an analysis very similar to the one in Section 3.1, we have that

    P r(count 2i)

    N

    2i

    1n

    2i

    eN

    n2i

    2i.

    Hence, we can allocate cneNn2i

    2ispace at the ith level, so as to guarantee an error of at most

    (0.619)c at that level. If we are given only a fixed amount of space, then we choose c sufficientlysmall so that there is enough space for each level.

    A similar such analysis becomes very hard for other distributions where it might not be possibleto get such a tight bound. To get around this, we could make use of Markovs inequality. For

    15

  • 8/12/2019 Bitwise Bloom Filter

    16/24

    example, if we redo the analysis for the uniform distribution, by Markovs inequality we have that

    P r(count 2i) E[count]

    2i

    = N

    n2i.

    Hence, by this analysis we could allocate cN/2i space at the ith level to get an error of at most(0.619)c, validating our earlier claim that we should halve the amount of space at each subsequentlevel of the filter. Similarly, the Markov bound can be applied for any distribution, and the estimatedfrequency of the most probable item could be used for E[count].

    In the previous section we saw that we could use an arbitrarily large number of levels withconstant dropoff. From a practical standpoint, however, we cannot have an arbitrary number oflevels. If we know a priori that the maximum count of any item is maxcount, then the numberof levels that we would use would be 1 + log2maxcount. However, if the maximum count isnot known, then we use logdspace/mmin levels, where d is the dropoff in size of each successive

    counting filter, space is the total amount of space that we have available, and mmin is the smallestsize counting filter we might use with reasonable amount of error.Once we have the amount of space mi at each level, we can easily set ki, the number of hash

    functions at each level, to beki= mini

    loge2, which we already know to be the optimal number fromSection 2.2.

    Now that we have seen how we can choose all the parameters of the bitwise filter, we can take alook at its performance.

    6.1.1 Memory Accesses

    One important measure of the usefulness of any data structure such as this is the number of memoryaccesses that are necessary for a single update. If we implement this algorithm on a high-speed routerthen we would like to have as few memory accesses as possible.

    The original hope for the bitwise counter was that the updates at higher levels that wouldbe required on carryover could be amortized across all insertions, much the same way that theamortized cost of incrementing a binary counter is O(1) (see [CLRS01, p. 408]). Unfortunately,the same analysis does not work here because at each level we have a constant probability of falsepositive error. However, as we will see, there will still be very few accesses.

    The number of memory accesses for the bitwise counter is at mostl1i=02ki (ki reads and ki

    writes at the ith level). However, note that if we are halving the amount of space (mi) at eachlevel, this means that the value ofki is halving at each level as well (since optimally ki =

    mini

    loge2).Hence, the number of accesses is bounded by

    l1i=0

    2ki l1i=0

    2k02i