concurrent expandable amqs on the basis of quotient filtersconcurrent expandable amqs on the basis...

Concurrent Expandable AMQs on the Basis of Quotient Filters

Tobias Maier, Peter Sanders, Robert Williger

Karlsruhe Institute of Technology{t.maier, sanders}@kit.edu

Abstract

A quotient filter is a cache efficient ApproximateMembership Query (AMQ) data structure. De-pending on the fill degree of the filter most inser-tions and queries only need to access one or twoconsecutive cache lines. This makes quotient fil-ters very fast compared to the more commonly usedBloom filters that incur multiple cache misses de-pending on the false positive rate. However, con-current Bloom filters are easy to implement and canbe implemented lock-free while concurrent quotientfilters are not as simple. Usually concurrent quo-tient filters work by using an external array of locks– each protecting a region of the table. Accessingthis array incurs one additional cache miss per op-eration. We propose a new locking scheme thathas no memory overhead. Using this new lockingscheme we achieve 1.8 times higher speedups thanwith the common external locking scheme.

Another advantage of quotient filters over Bloomfilters is that a quotient filter can change its sizewhen it is becoming full. We implement this grow-ing technique for our concurrent quotient filters andadapt it in a way that allows unbounded growingwhile keeping a bounded false positive rate. We callthe resulting data structure a fully expandable quo-tient filter. Its design is similar to scalable Bloomfilters, but we exploit some concepts inherent toquotient filters to improve the space efficiency andthe query speed.

Additionally, we propose several quotient filtervariants that are aimed to reduce the number of sta-tus bits (2-status-bit variant) or to simplify concur-rent implementations (linear probing quotient fil-ter). The linear probing quotient filter even leads toa lock-free concurrent filter implementation. This isespecially interesting, since we show that any lock-free implementation of another common quotientfilter variant would incur significant overheads inthe form of additional data fields or multiple passesover the accessed data.

1 Introduction

Approximate Membership Query (AMQ) datastructures offer a simple interface to represent sets.Elements can be inserted, queried, and dependingon the use case elements can also be removed. A

query returns true if the element was previously in-serted. Querying an element might return true evenif an element was not inserted. We call this a falsepositive. The probability that a non-inserted ele-ment leads to a false positive is called the false posi-tive rate of the filter. It can usually be chosen at thetime of constructing the data structure. AMQ datastructures have two main advantages over other setrepresentations they are fast and space efficient.Similar to hash tables most AMQ data structureshave to be initialized knowing a bound to their fi-nal size. This can be a problem for their spaceefficiency if no accurate bound is known.

AMQ data structures have become an integralpart of many complex data structures or data baseapplications. Their small size and fast accesses canbe used to sketch large, slow data sets. The AMQfilter is used before accessing the data base to checkwhether a slow lookup is actually necessary. Somerecent uses of AMQs include network analysis [1]and bio informatics [16]. These are two of the areaswith the largest data sets available today, but giventhe current big data revolution we expect more andmore research areas will have a need for the spaceefficiency and speedup potential of AMQs. Themost common AMQ data structure in practice isstill a Bloom filter even though a quotient filterwould usually be significantly faster. One reason forthis might be that concurrent Bloom filters are easyto implement – even lock-free and well scaling im-plementations. This is important because well scal-ing implementations have become critical in today’smulti-processor scenarios since single core perfor-mance stagnates and efficient multi-core computa-tion has become a necessity to handle growing datasets.

We present a technique for concurrent quotientfilters that uses local locking inside the table – with-out an external array of locks. Instead of traditionallocks we use the per-slot status bits that are inher-ent in quotient filters to lock local ranges of slots.As long as there is no contention on a single area ofthe table there is very little overhead to this lock-ing technique because the locks are placed in slotsthat would have been accessed anyways. An inter-esting advantage of quotient filters over Bloom fil-ters is that the capacity of a quotient filter can beincreased without access to the original elements.However, because the false positive rate grows lin-early with the number of inserted elements, this is

1

arX

iv:1

911.

0837

4v1

[cs

.DS]

19

Nov

201

9

only useful for a limited total growing factor. Weimplement this growing technique for our concur-rent quotient filters and extend it to allow largegrowing factors at a strictly bounded false positiverate. These fully expandable quotient filters com-bine growing quotient filters with the multi levelapproach of scalable Bloom filters [2].

1.1 Related Work

Since quotient filters were first described in 2012 [3]there has been a steady stream of improvements.For example Pandey et al. [17] have shown how toreduce the memory overhead of quotient filters byusing rank-select data structures. This also im-proves the performance when the table becomesfull. Additionally, they show an idea that savesmemory when insertions are skewed (some elementsare inserted many times). They also mention thepossibility for concurrent access using an externalarray of locks (see Section 5 for results). Recently,there was also a GPU-based implementation of quo-tient filters [9]. This is another indicator that thereis a lot of interest in concurrent AMQs even in thesehighly parallel scenarios.

Quotient filters are not the only AMQ data struc-tures that have received attention recently. Cuckoofilters [8, 13] and very recently Morton filters [4](based on cuckoo filters) are two other examplesof AMQ data structures. A cuckoo filter can befilled more densely than a quotient filter, thus, itcan be more space efficient. Similar to cuckoo hashtables each access to a cuckoo filter needs to accessmultiple buckets – leading to a worse performancewhen the table is not densely filled (a quotient filteronly needs one access). It is also difficult to imple-ment concurrent cuckoo filters (similar to concur-rent cuckoo hash tables [11,14]); in particular, whenadditional memory overhead is unwanted, e.g., per-bucket locks.

A scalable Bloom filter [2] allows unboundedgrowing. This works by adding additional levelsof Bloom filters once one level becomes full. Eachnew filter is initialized with a higher capacity anda lower false positive rate than the last. The querytime is dependent on the number of times the fil-ter has grown (usually logarithmic in the numberof insertions). Additionally, later filters need moreand more hash functions to achieve their false posi-tive rate making them progressively slower. In Sec-tion 4, we show a similar technique for fully expand-able quotient filters that mitigates some of theseproblems.

1.2 Overview

In Section 2 we introduce the terminology and ex-plain basic quotient filters as well as our variations

in a sequential setting. Section 3 then describesour approach to concurrent quotient filters includ-ing the two variants: the lock-free linear probingquotient filter and the locally locked quotient filter.The dynamically growing quotient filter variant isdescribed in Section 4. All presented data struc-tures are evaluated in Section 5. Afterwards, wedraw our conclusions in Section 6.

2 Sequential Quotient Filter

In this section we describe the basic sequential quo-tient filter as well as some variants to the main datastructure. Throughout this paper, we use m for thenumber of slots in our data structure and n for thenumber of inserted elements. The fill degree of anygiven table is denoted by δ = n/m. Additionally,we use p+ to denote the probability of a false posi-tive query.

2.1 Basic Quotient Filter

Quotient filters are approximate membership querydata structures that were first described by Benderet al. [3] and build on an idea for space efficienthashing originally described by Cleary [6]. Quo-tient filters replace possibly large elements by fin-gerprints. The fingerprint f(x) of an element x is anumber in a predefined range f : x 7→ {0, ..., 2k−1}(binary representation with exactly k digits). Wecommonly get a fingerprint of x by taking the k bot-tommost bits of a hash function value h(x) (usinga common hash function like xxHash [7]).

A quotient filter stores the fingerprints of all in-serted elements. When executing a query for anelement x, the filter returns true if the finger-print f(x) was previously inserted and false oth-erwise. Thus, a query looking for an element thatwas inserted always returns true. A false posi-tive occurs when x was not inserted, but its fin-gerprint f(x) matches that of a previously insertedelement. Given a fully random fingerprint function,the probability of two fingerprints being the same is2−k. Therefore, the probability of a false positive isbounded by n ·2−k where n is the number of storedfingerprints.

To achieve expected constant query times as wellas to save memory, fingerprints are stored in a spe-cial data structure that is similar to a hash tablewith open addressing (specifically it is similar torobinhood hashing [5]). During this process, thefingerprint of an element x is split into two parts:the topmost q bits called the quotient quot(x) andthe bottommost r bits called the remainder rem(x)(q + r = k). The quotient is used to address atable that consisting of m = 2q memory slots ofr+ 3 bits (it can store one remainder and three ad-ditional status bits). The quotient of each element

2

is only stored implicitly by the position of the ele-ment in the table. The remainder is stored explic-itly within one slot of the table. Similar to manyhashing techniques, we try to store each element inone designated slot which we call its canonical slot(index quot(x)). With the help of the status bits wecan reconstruct the quotient of each element evenif it is not stored in its canonical slot.

The main idea for resolving collisions is to findthe next free slot – similar to linear probing hashtables – but to reorder the elements such that theyare sorted by their fingerprints (see Figure 1). El-ements with the same quotient (the same canon-ical slot) are stored in consecutive slots, we callthem a run. The canonical run of an element is therun associated with its canonical slot. The canoni-cal run of an element does not necessarily start inits canonical slot. It can be shifted by other runs.If a run starts in its canonical slot we say that itstarts a cluster that contains all shifted runs afterit. Multiple clusters that have no free slots betweenthem form a supercluster. We use the 3 status bitsthat are part of each slot to distinguish betweenruns, clusters, and empty slots. For this we storethe following information about the contents of theslot (further described in Table 1): Were elementshashed to this slot (is its run non-empty)? Doesthe element in this slot belong to the same run asthe previous entry (used as a run-delimiter signal-ing where a new run starts)? Does the element inthis slot belong to the same cluster as the previousentry (is it shifted)?

During the query for an element x, all remaindersstored in x’s canonical run have to be compared torem(x). First we look at the first status bit of thecanonical slot. If this status bit is unset there is norun for this slot and we can return false. If there isa canonical run, we use the status bits to find it byiterating to the left until the start of a cluster (thirdstatus bit = 0). From there, we move to the rightcounting the number of non-empty runs to the leftof the canonical run (slots with first status bit = 1left of quot(x)) and the number of run starts (slotswith second status bit = 0). This way, we can easilyfind the correct run and compare the appropriateremainders.

An insert operation for element x proceeds sim-ilar to a query, until it either finds a free slot or aslot that contains a fingerprint that is ≥ f(x). Thecurrent slot is replaced with rem(x) shifting the fol-lowing slots to the right (updating the status bitsappropriately).

When the table fills up, operations become slowbecause the average cluster length increases. Whenthe table is full, no more insertions are possible. In-stead, quotient filters can be migrated into a largertable – increasing the overall capacity. To do this anew table is allocated with twice the size of the orig-

inal table and one less bit per remainder. Address-ing the new table demands an additional quotientbit, since it is twice the size. This issue is solvedby moving the uppermost bit from the remainderinto the quotient (q′ = q + 1 and r′ = r − 1). Thecapacity exchange itself does not impact the falsepositive rate of the filter (fingerprint size and num-ber of elements remain the same). But the falsepositive rate still increases linearly with the num-ber of insertions (p+ = n · 2−k). Therefore, thefalse positive rate doubles when the table is filledagain. We call this migration technique boundedgrowing, because to guarantee reasonable false pos-itive rates this method of growing should only beused a bounded number of times.

2.2 Variants

We give a short explanation of counting quotient fil-ters using rank-and-select data structures [17]. Ad-ditionally, we introduce two previously unpublishedquotient filter variants: the 2-status-bit quotient fil-ter and the linear probing quotient filter.

(Counting) Quotient Filter Using Rank-and-Select. This quotient filter variant proposed byPandey et al. [17] introduces two new features: thenumber of necessary status bits per slot is reducedto 2.125 by using a bitvector that supports rank-and-select queries and the memory footprint of mul-tisets is reduced by allowing counters to be storedin the table – together with remainders.

This quotient filter variant uses two status bitsper slot the occupied-bit (first status bit in the de-scription above) and the run-end-bit (similar to thesecond bit). These are stored in two bitvectors thatare split into 64 bit blocks (using 8 additional bitsper block). To find a run, the rank of its occupied-bit is computed and the slot with the matching run-end-bit is selected.

To reduce memory impact of repeated elementsPandey et al. use the fact that remainders arestored in increasing order. Whenever an elementis inserted for a second time instead of doubling itsremainder they propose to store a counter in theslot after its remainder this counter can be recog-nized as a counter because it does not fit into theincreasing order (the counter starts at zero). Aslong as it remains smaller than the remainder thiscounter can count the number of occurrences of thisremainder. Once the counter becomes too large an-other counter is inserted.

Two-Status-Bit Quotient Filter. Pandey etal. [17] already proposed a 2 status bit variant oftheir counting filter. Their implementation howeverhas average query and insertion times in Θ(n). Thegoal of our 2-status bit variant is to achieve running

3

1** this slot has a run000 empty slot100 cluster start*01 run start*11 continuation of a run*10 – (not used see 3.2)

Table 1: Meaning of different sta-tus bit combinations.

1 1 1 1 1 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 0 000

super cluster

status

rem(·)

clusterrun

1|a 1|b 3|c 3|d 3|e 4|f 6|g0 01: 2: 3: 4: 5: 6: 7:quot(·)| 8:0:

Figure 1: Section of the table with highlighted runs, clusters, andsuperclusters. Runs point to their canonical slot.

times close to O(supercluster length). To achievethis, we change the definition of the fingerprint(only for this variant) such that no remainder canbe zero f ′ : x 7→ {0, ..., 2q−1}×{1, ..., 2r−1}. Ob-taining a non-zero remainder can easily be achievedby rehashing an element with different hash func-tions until the remainder is non-zero. This changeto the fingerprint only has a minor impact on thefalse positive rate of the quotient filter (n/m·(2r−1)instead of n/m · 2r).

Given this change to the fingerprint we can easilydistinguish empty slots from filled ones. Each slotuses two status bits the occupied-bit (first statusbit in the description above) and the new-run-bit(run-delimiter). Using these status bits we can finda particular run by going to the left until we find afree slot and then counting the number of occupied-and new-run-bits while moving right from there(Note: a cluster start within a larger superclustercannot be recognized when moving left).

Linear Probing Quotient Filter. This quo-tient filter is a hybrid between a linear probing hashtable and a quotient filter. It uses no reordering ofstored remainders and no status bits. To offset theremoval of status bits we add three additional bitsto the remainder (leading to a longer fingerprint).Similar to the two-status-bit quotient filter abovewe ensure no element x has rem(x) = 0 (by adapt-ing the fingerprint function).

During the insertion of an element x the remain-der rem(x) is stored in the first empty slot afterits canonical slot. Without status bits and reorder-ing it is impossible to reconstruct the fingerprint ofeach inserted element. Therefore, a query lookingfor an element x compares its remainder rem(x)with every remainder stored between x’s canoni-cal slot and the next empty slot. There are po-tentially more remainders that are compared thanduring the same operation on a normal quotientfilter. The increased remainder length however re-duces the chance of a single comparison to lead toa false positive by a factor of 8. Therefore, as longas the average number of compared remainders isless than 8 times more than before, the false posi-tive remains the same or improves. In our tests, wefound this to be true for fill degrees up to ≈ 70%.

The number of compared remainders correspondsto the number of slots probed by a linear probinghash table, this gives the bound below.

Corollary 1 (p+ linear probing q.f.). The falsepositive rate of a linear probing quotient filter withδ = n/m (bound due to Knuth [10] chapter 6.4) is

p+ =E[#comparisons]

2r+3 − 1=

1

2

(1 +

1

(1− δ)2

)·

1

2r+3 − 1

It should be noted that linear probing quotientfilters cannot support deletions and the boundedgrowing technique.

3 Concurrent Quotient Filter

From a theoretical point of view using an exter-nal array of locks should lead to a good concurrentquotient filter. As long as the number of locks islarge enough to reduce contention (> p2) and thenumber of slots per lock is large enough to ensurethat clusters don’t span multiple locking regions.But there are multiple issues that complicate theimplementation in practice. In this section we de-scribe the difficulties of implementing an efficientconcurrent quotient filter, attributes of a success-ful concurrent quotient filter, and the concurrentvariants we implemented.

Problems with Concurrent Quotient Filters.The first consideration with every concurrent datastructure should be correctness. The biggest prob-lem with implementing a concurrent quotient fil-ter is that multiple data entries have to be readin a consistent state for a query to succeed. Allcommonly known variants (except the linear prob-ing quotient filter) use at least two status bits perslot, i.e., the occupied-bit and some kind of run-delimiter-bit (the run-delimiter might take differentforms). The remainders and the run-delimiter-bitthat belong to one slot cannot reliably be storedclose to their slot and its occupied-bit. Therefore,the occupied-bit and the run-delimiter-bit cannotbe updated with one atomic operation.

For this reason, implementing a lock-free quo-tient filter that uses status bits would cause signifi-cant overheads (i.e. additional memory or multiple

4

passes over the accessed data). The problem is thatreading parts of multiple different but individuallyconsistent states of the quotient filter leads to aninconsistent perceived state. To show this we as-sume that insertions happen atomically and trans-fer the table from one consistent state directly intothe new final consistent state. During a query wehave to find the canonical run and scan the remain-ders within this run. To find the canonical run wehave to compute its rank among non-empty runsusing the occupied-bits (either locally by iteratingover slots or globally using a rank-select-data struc-ture) and then find the run with the same rankusing the run-delimiter-bits. There is no way toguarantee that the overall state has not changedbetween finding the rank of the occupied-bit bitand finding the appropriate run-delimiter. Specif-ically we might find a new run that was createdby an insertion that shifted the actual canonicalrun. Similar things can happen when accessingthe remainders especially when remainders are notstored interleaved with their corresponding statusbits. Some of these issues can be fixed using multi-ple passes over the data but ABA problems mightarise in particular when deletions are possible.

To avoid these problems while maintaining per-formance we have two options: either comparing allfollowing remainders and removing the status bitsall-together (i.e. the linear probing quotient filter)or by using locks that protect the relevant range ofthe table.

Goals for Concurrent Quotient Filters. Be-sides the correctness there are two main goals forany concurrent data structure: scalability and over-all performance. One major performance advan-tage of quotient filters over other AMQ data struc-tures is their cache efficiency. Any successful con-current quotient filter variant should attempt topreserve this advantage. Especially insertions intoshort or even empty clusters and (unsuccessful)queries with an empty canonical run lead to op-erations with only one or two accessed cache linesand at most one write operation. Any variant usingexternal locks has an immediate disadvantage here.

Concurrent Slot Storage. Whenever data isaccessed by multiple threads concurrently, we haveto think about the atomicity of operations. To beable to store all data connected to one slot togetherwe alternate between status bits and remainders.The slots of a quotient filter can have arbitrarysizes that do not conform to the size of a standardatomic data type. Therefore, we have to take intoaccount the memory that is wasted by using thesebasic data types inefficiently. It would be commonto just use the smallest data type that can holdboth the remainder and status bits and waste any

excess memory. But quotient filters are designed tobe space efficient. Therefore, we need a differentmethod.

We use the largest common atomic data type(64 bit) and pack as many slots as possible intoone data element – we call this packed constructa group (of slots). This way, the waste of multi-ple slots accumulates to encompass additional ele-ments. Furthermore, using this technique we canatomically read or write multiple slots at once – al-lowing us to update in bulk and even avoid somelocking.

An alternative would have been not to waste anymemory and store slots consecutively. However,this leads to slots that are split over the bordersof the aligned common data types and might evencross the border between cache lines. While non-aligned atomic operations are possible on modernhardware, we decided against this method after afew preliminary experiments showed how inefficientthey performed in practice.

3.1 Concurrent Linear Probing Quo-tient Filter

Operations on a linear probing quotient filter (asdescribed in Section 2.2) can be executed concur-rently similar to operations of a linear probing hashtable [12]. The table is constructed out of mergedatomic slots as described above. Each insertionchanges the table atomically using a single compareand swap instruction. This implementation is evenlock free, because at least one competing write op-eration on any given cell is successful. Queries findall remainders that were inserted into their canon-ical slot, because they are stored between theircanonical slot and the next empty slot and the con-tents of a slot never change once something is storedwithin it.

3.2 Concurrent Quotient Filter withLocal Locking

In this section we introduce an easy way to imple-ment concurrent quotient filters without any mem-ory overhead and without increasing the numberof cache lines that are accessed during operations.This concurrent implementation is based on the ba-sic (3-status-bit) quotient filter. The same ideascan also be implemented with the 2-status-bit vari-ant, but there are some problems discussed later inthis section.

Using status bits for local locking. To imple-ment our locking scheme we use two combinationsof status bits that are impossible to naturally ap-pear in the table (see Table 1) – 010 and 110. Weuse 110 to implement a write lock. In the beginning

5

Algorithm 1 Concurrent Insertion

(quot, rem)← f(key)// Block A: try a trivial insertion

table section← atomically load data around slot quotif insertion into table section is trivial then

finish that insertion with a CAS and return// Block B: write lock the supercluster

scan right from it ← quotif it is write locked then

wait until released and continueif it is empty then

lock it with write lock and breakif CAS unsuccessful re-examine it

// Block C: read lock the cluster

scan left from it ← quotif it is read locked then

wait until released and retry this slotif it is cluster start then

lock it with read lock and breakif CAS unsuccessful re-examine it

// Block D: find the correct run

occ = 0; run = 0scan right from it

if it is occupied and it < canonical slot then occ++if it is run start then run++if occ = run and it ≥ canonical slot then break

// Block E: insert into the run and shift

scan right from itif it is read locked then

wait until releasedstore rem in correct slotshift content of following slots(keep groups consistent)break after overwriting the write lock

unlock the read lock

of each write operation, 110 is stored in the first freeslot after the supercluster, see Block B Algorithm 1(using a compare and swap operation). Insertionswait when encountering a write lock. This ensuresthat only one insert operation can be active persupercluster.

The combination 010 is used as a read lock. Alloperations (both insertions Block C Algorithm 1and queries Block G Algorithm 2) replace the sta-tus bits of the first element of their canonical clus-ter with 010. Additionally, inserting threads waitfor each encountered read lock while moving ele-ments in the table, see Block E Algorithm 1. Whenmoving elements during the insertion each encoun-tered cluster start is shifted and becomes part ofthe canonical cluster that is protected by the in-sertion’s original read lock. This ensures, that nooperation can change the contents of a cluster whileit is protected with a read lock.

Avoiding locks. Many instances of locking canbe avoided, e.g., when the canonical slot for an in-sertion is empty (write the elements with a singlecompare and swap operation), when the canonicalslot of a query either has no run (first status bit), orstores the wanted fingerprint. In addition to thesetrivial instances of lock elision, where the whole

Algorithm 2 Concurrent Query

(quot, rem)← f(key)// Block F: try trivial query

table section← atomically load data around slot quotif answer can be determined from table section then

return this answer// Block G: read lock the cluster

scan left from it ← quotif it is read locked then

wait until released and retry this slotif it is cluster start then

lock it with read lock and breakif CAS unsuccessful re-examine it

// Block H: find the correct run

occ = 0; run = 0scan right from it

if it is occupied and it < canonical slot then occ++if it is run start then run++if occ = run and it ≥ canonical slot then break

// Block I: search remainder within the run

scan right from itif it = rem

unlock the read lockreturn contained

if it is not continuation of this rununlock the read lockreturn not contained

operation happens in one slot, we can also profitfrom our compressed atomic storage scheme (seeSection 3). Since we store multiple slots togetherin one atomic data member, multiple slots can bechanged at once. Each operation can act withoutacquiring a lock, if the whole operation can be com-pleted without loading another data element. Thecorrectness of the algorithm is still guaranteed, be-cause the slots within one data element are alwaysin a consistent state.

Additionally, since the table is the same as itwould be for the basic non-concurrent quotient fil-ter, queries can be executed without locking if therecan be no concurrent write operations (e.g. dur-ing an algorithm that first constructs the table andthen queries it repeatedly). This fact is actuallyused for the fully expandable variant introduced inSection 4.

Deletions. Given our locking scheme deletionscould be implemented in the following way (we didnot do this since it has some impact on insertions).A thread executing a deletion first has to acquire awrite lock on the supercluster then it has to scanto the left (to the canonical slot) to guarantee thatthere was no element removed while finding thecluster end. Then the read lock is acquired and thedeletion is executed similar to the sequential case.During the deletion it is possible that new clusterstarts are created due to elements shifting to theleft into their canonical slot. Whenever this hap-pens, the cluster is created with read locked statusbits (010). After the deletion all locks are unlocked.

This implementation would impact some of the

6

other operations. During a query, after acquiringthe read lock when moving to the right, it would bepossible to hit a new cluster start or an empty cellbefore reaching the canonical slot. In this case theprevious read lock is unlocked and the operation isrestarted. During an insertion after acquiring thewrite lock it is necessary to scan to the left to checkthat no element was removed after the canonicalslot.

2-Status-Bit Concurrent Quotient Filter. Inthe 2-status-bit variant of the quotient filter thereare no unused status bit combinations that can beused as read or write locks. But zero can neverbe a remainder when using this variant, therefore,a slot with an empty remainder but non-zero sta-tus bits cannot occur. We can use such a cell torepresent a lock. Using this method, cluster startswithin a larger supercluster cannot be recognizedwhen scanning to the left ,i.e., in Blocks C and G(of Algorithms 1 and 2). Instead we can only rec-ognize supercluster starts (a non-empty cell to theright of an empty cell).

To write lock a supercluster, we store 01 in thestatus bits of an otherwise empty slot after the su-percluster. To read lock a supercluster we removethe remainder from the table and store it locallyuntil the lock is released (status bits 11). How-ever, this concurrent variant is not practical be-cause we have to ensure that the read-locked slotis still a supercluster start (the slot to its left re-mains empty). To do this we can atomically com-pare and swap both slots. However this is a problemsince both slots might be stored in different atomicdata types or even cache lines. The variant is stillinteresting from a theoretical perspective where acompare and swap operation changing two neigh-boring slots is completely reasonable (usually be-low 64 bits). However, we have implemented thisvariant when we did some preliminary testing withnon-aligned compare and swap operations and it isnon-competitive with the other implementations.

Growing concurrently. The bounded growingtechnique (described in Section 2.1) can be used toincrease the capacity of a concurrent quotient fil-ter similar to that of a sequential quotient filter.In the concurrent setting we have to consider twothings: distributing the work of the migration be-tween threads and ensuring that no new elementsare inserted into parts of the old table that werealready migrated (otherwise they might be lost).

To distribute the work of migrating elements, weuse methods similar to those in Maier et al. [12].After the migration is triggered, every thread thatstarts an operation first helps with the migrationbefore executing the operation on the new table.Reducing interactions between threads during the

migration is important for performance. There-fore, we migrate the table in blocks. Every threadacquires a block by incrementing a shared atomicvariable. Each block has a constant size of 4096slots. The migration of each block happens one su-percluster at a time. Each thread migrates all su-perclusters that begin in its block. This means thata thread does not migrate the first supercluster inits block, if it starts in the previous block. It alsomeans that the thread migrates elements from thenext block, if its last supercluster crosses the bor-der to that block. The order of elements does notchange during the migration, because they remainordered by their fingerprint. In general this meansthat most elements within one block of the originaltable are moved into one of two blocks in the tar-get table (block i is moved to 2i and 2i + 1). Byassigning clusters depending on the starting slot oftheir supercluster, we enforce that there are no twothreads accessing the same slot of the target table.Hence, no atomic operations or locks are necessaryin the target table.

As described before, we have to ensure that ongo-ing insert operations either finish correctly or helpwith the migration before inserting into the new ta-ble. Ongoing queries also have to finish to preventdeadlocks. To prevent other threads from insert-ing elements during the migration, we write lockeach empty slot and each supercluster before it ismigrated. These migration-write-locks are never re-leased. To differentiate migration-write-locks fromthe ones used in a normal insertions, we write locka slot and store a non-zero remainder (write locksare usually only stored in empty slots). This way,an ongoing insertion recognizes that the encoun-tered write lock belongs to a migration. The in-serting thread first helps with the migration be-fore restarting the insertion after the table is fullymigrated. Queries can happen concurrently withthe migration, because the migration does not needread locks.

4 Fully Expandable QFs

The goal of this fully expandable quotient filter isto offer a resizable quotient filter variant with abounded false positive rate that works well even ifthere is no known bound to the number of elementsinserted. Adding new fingerprint bits to existingentries is impossible without access to the insertedelements. We adapt a technique that was originallyintroduced for scalable Bloom filters [2]. Once aquotient filter is filled, we allocate a new additionalquotient filter. Each subsequent quotient filter in-creases the fingerprint size. Overall, this ensures abounded false positive rate.

We show that even though this is an old idea, itoffers some interesting possibilities when applied to

7

quotient filters, i.e., avoiding locks on lower levels,growing each level using the bounded growing tech-nique, higher fill degree through cascading inserts,and early rejection of queries also through cascad-ing inserts.

4.1 Architecture

This new data structure starts out with one quo-tient filter, but over time, it may contain a set ofquotient filters we call levels. At any point in time,only the newest (highest) level is active. Insertionsoperate on the active level. The data structure isinitialized with two user-defined parameters the ex-pected capacity c and the upper bound for the falsepositive rate p+. The first level table is initializedwith m0 slots where m0 = 2q0 is the first power of 2where δgrow ·m0 is larger than c, here δgrow is the fillratio where growing is triggered and ni = δgrow ·mi

is the maximum number of elements level i canhold. The number of remainder bits r0 is chosensuch that p+ > 2δgrow · 2−r0 . This first table usesk0 = q0 + r0 bits for its fingerprint.

Queries have to check each level. Within thelower levels queries do not need any locks, becausethe elements there are finalized. The query perfor-mance depends on the number of levels. To keep thenumber of levels small, we have to increase the ca-pacity of each subsequent level. To also bound thefalse positive rate, we have to reduce the false pos-itive rate of each subsequent level. To achieve bothof these goals, we increase the size of the fingerprintki by two for each subsequent level (ki = 2 + ki−1).Using the longer fingerprint we can ensure that oncethe new table holds twice as many elements as theold one (ni = ni+1/2), it still has half the false pos-itive rate (p+i = ni ·2−ki = 2p+i+1 = 2 ·ni+1 ·2−ki+1).

When one level reaches its maximum capacity niwe allocate a new level. Instead of allocating thenew level to immediately have twice the numberof slots as the old level, we allocate it with one8th of the final size, and use the bounded growingalgorithm (described in Section 2.1) to grow it to itsfinal size (three growing steps). This way, the tablehas a higher average fill rate (at least 2/3 · δgrowinstead of 1/3 · δgrow ).

Theorem 2 (Bounded p+ in expandable QF). Thefully expandable quotient filter holds the false posi-tive probability p+ set by the user independently ofthe number of inserted elements.

Proof. For the following analysis, we assume thatfingerprints can potentially have an arbitrarylength. The analysis of the overall false positiverate p+ is very similar to that of the scalable Bloomfilter. A false positive occurs if one of the ` levelshas a false positive p+ = 1 −

∏i(1 − p+i ). This

can be approximated with the Weierstrass inequal-ity p+ ≤

∑`i=1 p

+i . When we insert the shrinking

false positive rates per level (p+i+1 = p+i /2) we ob-

tain a geometric sum which is bounded by 2p+1 :∑`−1i=0 p

+i 2−i ≤ 2p+1 < p+.

Using this growing scheme the number of filtersis in O(log n/T ), therefore, the bounds for queriesare similar to those in a broad tree data structure.But due to the necessary pointers tree data struc-tures take significantly more memory. Additionally,they are difficult to implement concurrently with-out creating contention on the root node.

4.2 Cascading Inserts

The idea behind cascading inserts is to insert ele-ments on the lowest possible level. If the canonicalslot in a lower level is empty, we insert the ele-ment into that level. This can be done using a sim-ple compare and swap operation (without acquir-ing a write lock). Queries on lower levels can stillproceed without locking, because insertions cannotmove existing elements.

The main reason to grow the table before it isfull is to improve the performance by shorteningclusters. The trade-off for this is space utilization.For the space utilization it would be optimal to filleach table 100%. Using cascading inserts this canbe achieved while still having a good performanceon each level. Queries on the lower level tableshave no significant slow down due to cascading in-serts, because the average cluster length remainssmall (cascading inserts lead to one-element clus-ters). Additionally, if we use cascading inserts, wecan abort queries that encounter an empty canoni-cal slot in one of the lower level tables, because thisslot would have been filled by an insertion.

Cascading inserts do incur some overhead. Eachinsertion checks every level whether its canonicalslot is empty. However, in some applications check-ing all levels is already necessary. For example inapplications like element unification all lower levelsare checked to prevent repeated insertions of oneelement. Applications like this often use combinedquery and insert operations that only insert if theelement was not yet in the table. Here cascadinginserts do not cost any overhead.

5 Experiments

The performance evaluations described in this sec-tion consist of two parts: Some tests without grow-ing where we compare our implementation of con-current quotient filters to other AMQ data struc-tures, and some tests with dynamically growingquotient filter implementations. There we comparethree different variants of growing quotient filters:bounded growing, our fully expandable multi level

8

approach, and the multi level approach with cas-cading inserts.

All experiments were executed on a two socketIntel Xeon E5-2683 v4 machine with 16 cores persocket, each running at 2.1 GHz with 40MB cachesize, and 512 GB main memory. The tests werecompiled using gcc 7.4.0 and the operating systemis Ubuntu 18.04.02. Each test was repeated 9 timesusing 3 different sequences of keys (same distribu-tion) – 3 runs per sequence.

Non-Growing Competitors

n Linear probing quotient filter presented in Sec-tion 3.1

l Locally locked quotient filter presented in Sec-tion 3.2 (using status bits as locks)

n Externally locked quotient filter using an arraywith one lock per 4096 slots

: Counting quotient filter – implementation byPandey et al. [15, 17]

s Bloom filter. Our implementation uses thesame amount of memory as the quotient fil-ters we compared to (m · (r+ 3) bits) and only4 hash functions (⇒ better performance at afalse positive rate comparable to our quotientfilters).

Speedup. The first test presented in Figure 2shows the throughput of different data structuresunder a varying number of operating threads. Wefill the table with 24M uniformly random elementsthrough repeated insertions by all threads into apreviously empty table of 225 ≈ 33.6M slots (δ ≈72%) using r = 10 remainder bits (13 for the lin-ear probing quotient filter). Then we execute 24Mqueries with uniformly random keys (none of whichwere previously inserted) this way we measure theperformance of unsuccessful queries and the rate offalse positives (not shown since it is independentof p). At last, we execute one query for each in-serted element. In each of these three phases wemeasure the throughput dependent on the numberof processors. The base line for the speedups wasmeasured using a sequential quotient filter, execut-ing the same test.

We can see that all data structures scale closeto linearly with the number of processors. There isonly a small bend when the number of cores exceedsthe first socket and once it reaches the number ofphysical cores. But, the data structures even seemto scale relatively well when using hyperthreading.The absolute performance is quite different betweenthe data structures. In this test the linear probingquotient filter has by far the best throughput (ex-cluding the unsuccessful query performance of the

Bloom filter). Compared to the locally locking vari-ant it performs 1.5 times better on inserts and 2.4times better on successful queries (speedups of 30.6vs. 20.5 on inserts and 51.3 vs 21.0 on successfulqueries at p = 32). Both the externally lockingvariant performs far worse (speedups below 12 forinsertions). Inserting into the counting quotient fil-ter does not scale with the number of processors atall, instead, it starts out at about 2M inserts persecond and gets worse from there. Both variantsalso perform worse on queries with a speedup ofbelow 15 (p = 32). This is due to the fact thateach operation incurs multiple cache faults – onefor locking and one for the table access. In the caseof the counting quotient filter there is also the addi-tional work necessary for updating the rank-selectbitvectors. The concurrent Bloom filter has verydifferent throughputs depending on the operationtype. It has by far the best throughput on unsuc-cessful queries (speedup of 46.8 at p = 32). Thereason for this is that the described configurationwith only 4 hash functions and more memory thana usual bloom filters (same memory as quotient fil-ters) leads to a sparse filter, therefore, it most un-successful queries can be aborted after one or twocache misses.

Fill ratio. In this test we fill a table with 225 ≈33.6M slots and r = 10. Every 10% of the fill ra-tio, we execute a performance test. During this testeach table operation is executed 100k times. Theresults can be seen in Figure 3. The throughput ofall quotient filter variants decreases steadily withthe increasing fill ratio, but the relative differencesremain close to the same. Only the counting quo-tient filter and the Bloom filter have running timesthat are somewhat independent of the fill degree.On lower fill degrees the linear probing quotientfilter and the locally locked quotient filter displaytheir strengths. With about 236% and 157% higherinsertion throughputs than the external lockingquotient filter, and 358% and 210% higher success-ful query throughputs than the Bloom filter at 50%fill degree.

Growing Benchmark In the benchmark shownin Figure 4, we insert 50M elements into each of ourdynamically sized quotient filters bounded growing,fully expandable without cascading inserts (no ci),and fully expandable with cascading inserts (ci) (allbased on the quotient filter using local locking).The filter was initialized with a capacity of only219 ≈ 520k slots and a target false positive rate ofp+ = 2−10. The inserted elements are split into50 segments of 1M elements each. We measure therunning time of inserting each segment as well asthe query performance after each segment (using1M successful and unsuccessful queries).

9

12 4 8 16 24 32 48 64

threads

0

100

200

300

400

500pe

rfor

man

ce (M

OPS

)successful contains

Bloom Filter LP Q-Filter ext. locked Q-Filter loc. locked Q-Filter counting Q-Filter

12 4 8 16 24 32 48 64

threads

0

200

400

600

perf

orm

ance

(MO

PS)

unsuccessful contains

12 4 8 16 24 32 48 64

threads

0

100

200

300

400

500

perf

orm

ance

(MO

PS)

insert

0

10

20

30

40

50

spee

dup

0

20

40

60

80

spee

dup

0

20

40

60

80

spee

dup

Figure 2: Throughput over the number of threads. Strong scaling measurement inserting 24M elementsinto a table with 225 ≈ 33.6M slots (72% fill degree). The speedup is measured compared to a sequentialbasic quotient filter. Note: the x-axis does not scale linearly past 32 threads.

20 40 60 80fill degree (%)

0

100

200

300

400

500

perf

orm

ance

(MO

PS)

successful contains

20 40 60 80fill degree (%)

0

100

200

300

400

500

600

perf

orm

ance

(MO

PS)

unsuccessful contains

Bloom FilterLP Q-Filterext. locked Q-Filterloc. locked Q-Filtercounting Q-Filter

20 40 60 80fill degree (%)

0

100

200

300

400

perf

orm

ance

(MO

PS)

insert

20 40 60 80fill degree (%)

2 17

2 15

2 13

2 11

2 9

2 7

fals

e po

sitiv

e ra

te

false positive rate

Figure 3: Throughput over fill degree. Using p = 32 threads and 225 ≈ 33.6M slots. Each point of themeasurement was tested using 100k operations.

As expected, the false positive rate grows linearlywhen we only use bounded growing, but each querystill consists of only one table lookup – resulting ina query performance similar to that of the non-growing benchmark. Both fully expandable vari-ants have a bounded false positive rate. But theirquery performance suffers due to the lower levellook ups and the additional cache lines that have tobe accessed with each additional table. Cascadinginserts can improve query times by 13% for success-ful queries and 12% for unsuccessful queries (av-eraged over all segments), however, slowing downinsertions significantly.

6 Conclusion

Future Work. We already mentioned a way toimplement deletions in our concurrent table Sec-

tion 3.2. Additionally, one would probably wantto add the same counting method used by Pandeyet al. [17] this should be fairly straight forward,since the remainders in our implementation are alsostored in increasing order.

One large opportunity for improvement could beto use the local locking methods presented here asa backoff mechanism for a solution that uses hard-ware transactional memory. Transactional memorycould also improve the memory usage by remov-ing the overheads introduced by the grouped stor-age method we proposed. Such an implementationmight use unaligned compare and swap operations,but only if a previous transaction failed (this wouldhopefully be very rare).

The fully expandable quotient filter could beadapted to specific use cases, e.g., one could usethe normal insertion algorithm to fill a fully ex-pandable quotient filter. Then one could scan all

10

0 10 20 30 40 50elements (in millions)

0

100

200

300

400pe

rfor

man

ce (M

OPS

) successful contains


0

200

400

600

perf

orm

ance

(MO

PS) unsuccessful contains

bounded growingfully expandable (ci)fully expandable (no ci)


0

50

100

150

200

250

perf

orm

ance

(MO

PS) insert


2 13

2 10

2 7

2 4

2 1

fals

e po

sitiv

e ra

te

false positive rate

Figure 4: Throughput of the growing tables (all variants based on the local locking quotient filter).Using p = 32 threads 50 segments of 1M elements are inserted into a table initialized with 219 ≈ 520kslots.

tables (in parallel) to move elements from later lev-els into earlier levels, if their slots are free. Theresulting data structure could then be queried as ifit was built with cascading inserts.

Discussion. In this publication, we have showna new technique for concurrent quotient filters thatuses the status bits inherent to quotient filters forlocalized locking. Using this technique, no addi-tional cache lines are accessed (compared to a se-quential quotient filter). We were able to achievea 1.8 times increase in insert performance over theexternal locking scheme (1.4 on queries; both atp = 32). Additionally, we proposed a simple lin-ear probing based filter that does not use any sta-tus bits and is lock-free. Using the same amountof memory, this filter achieves even better falsepositive rates up to a fill degree of 70% and also1.5 times higher insertion speedups (than the locallocking variant).

We also propose to use the bounded growingtechnique available in quotient filters to refine thegrowing technique used in scalable Bloom filters.Using this combination of techniques guaranteesthat the overall data structure is always at least2/3 ·δgrow filled (where δgrow is the fill degree wheregrowing is triggered). Using cascading inserts, thiscan even be improved by filling lower level tableseven further, while also improving query times byover 12%.

Our tests show that there is no optimal AMQdata structure. Which data structure performs bestdepends on the use case and the expected workload.The linear probing quotient filter is very good, ifthe table is not densely filled. The locally locked

quotient filter is also efficient on tables below a filldegree of 70%. But, it is also more flexible for ex-ample when the table starts out empty and is filledto above 70% (i.e., constructing the filter). Ourgrowing implementations work well if the numberof inserted elements is not known prior to the ta-ble’s construction. The counting quotient filter per-forms well on query heavy workloads that operateon densely filled tables.

References

[1] Mohammad Al-hisnawi and Mahmood Ah-madi. Deep Packet Inspection using Quo-tient Filter. IEEE Communications Letters,20(11):2217–2220, Nov 2016.

[2] Paulo Sergio Almeida, Carlos Baquero, NunoPreguica, and David Hutchison. ScalableBloom Filters. Inf. Process. Lett., 101(6):255–261, March 2007.

[3] Michael A. Bender, Martin Farach-Colton,Rob Johnson, Russell Kraner, Bradley C.Kuszmaul, Dzejla Medjedovic, Pablo Montes,Pradeep Shetty, Richard P. Spillane, and ErezZadok. Don’t Thrash: How to cache your hashon flash. Proc. VLDB Endow., 5(11):1627–1637, July 2012.

[4] Alex D. Breslow and Nuwan S. Jayasena. Mor-ton filters: fast, compressed sparse cuckoo fil-ters. The VLDB Journal, Aug 2019.

[5] Pedro Celis, Per-Ake Larson, and J Ian Munro.Robin Hood Hashing. In 26th Annual Sym-

11

posium on Foundations of Computer Science(FOCS), pages 281–288. IEEE, 1985.

[6] J. G. Cleary. Compact hash tables using bidi-rectional linear probing. IEEE Transactionson Computers, C-33(9):828–834, 1984.

[7] Yan Collet. xxHash. https://github.com/

Cyan4973/xxHash. Accessed March 21, 2019.

[8] Bin Fan, Dave G. Andersen, Michael Kamin-sky, and Michael D. Mitzenmacher. CuckooFilter: Practically better than Bloom. In 10thACM International on Conference on Emerg-ing Networking Experiments and Technologies,pages 75–88, 2014.

[9] Afton Geil, Martin Farach-Colton, andJohn D. Owens. Quotient Filters: Approxi-mate membership queries on the GPU. In 2018IEEE International Parallel and DistributedProcessing Symposium (IPDPS), pages 451–462, May 2018.

[10] Donald Ervin Knuth. The art of computer pro-gramming: sorting and searching, volume 3.Pearson Education, 1997.

[11] Xiaozhou Li, David G. Andersen, MichaelKaminsky, and Michael J. Freedman. Al-gorithmic Improvements for Fast ConcurrentCuckoo Hashing. In Proceedings of the 9th Eu-ropean Conference on Computer Systems, Eu-roSys ’14, pages 27:1–27:14, New York, NY,USA, 2014. ACM.

[12] Tobias Maier, Peter Sanders, and Roman De-mentiev. Concurrent Hash Tables: Fast andgeneral(?)! ACM Trans. Parallel Comput.,5(4):16:1–16:32, February 2019.

[13] Michael Mitzenmacher, Salvatore Pontarelli,and Pedro Reviriego. Adaptive Cuckoo Filters.In 2018 Proceedings of the Twentieth Work-shop on Algorithm Engineering and Experi-ments (ALENEX), pages 36–47. SIAM, 2018.

[14] Nhan Nguyen and Philippas Tsigas. Lock-freecuckoo hashing. In 34th IEEE InternationalConference on Distributed Computing Systems(ICDCS), pages 627–636, June 2014.

[15] Prashant Pandey. Counting quotient filter.https://github.com/splatlab/cqf. Ac-cessed August 07, 2019.

[16] Prashant Pandey, Fatemeh Almodaresi,Michael A. Bender, Michael Ferdman, RobJohnson, and Rob Patro. Mantis: A fast,small, and exact large-scale sequence-searchindex. Cell Systems, 7(2):201 – 207.e4, 2018.

[17] Prashant Pandey, Michael A. Bender, RobJohnson, and Rob Patro. A General-PurposeCounting Filter: Making every bit count. InACM Conference on Management of Data(SIGMOD), pages 775–787, 2017.

12

https://github.com/Cyan4973/xxHash

https://github.com/Cyan4973/xxHash

https://github.com/splatlab/cqf

concurrent expandable amqs on the basis of quotient filtersconcurrent expandable amqs on the basis...

Documents