building a bw-tree takes more than just buzz words

16
Building a Bw-Tree Takes More Than Just Buzz Words Ziqi Wang Carnegie Mellon University [email protected] Andrew Pavlo Carnegie Mellon University [email protected] Hyeontaek Lim Carnegie Mellon University [email protected] Viktor Leis TU München [email protected] Huanchen Zhang Carnegie Mellon University [email protected] Michael Kaminsky Intel Labs [email protected] David G. Andersen Carnegie Mellon University [email protected] ABSTRACT In 2013, Microsoft Research proposed the Bw-Tree (humorously termed the “Buzz Word Tree”), a lock-free index that provides high throughput for transactional database workloads in SQL Server’s Hekaton engine. The Bw-Tree avoids locks by appending delta record to tree nodes and using an indirection layer that allows it to atomically update physical pointers using compare-and-swap (CaS). Correctly implementing this techniques requires careful attention to detail. Unfortunately, the Bw-Tree papers from Microsoft are missing important details and the source code has not been released. This paper has two contributions: First, it is the missing guide for how to build a lock-free Bw-Tree. We clarify missing points in Microsoft’s original design documents and then present techniques to improve the index’s performance. Although our focus here is on the Bw-Tree, many of our methods apply more broadly to designing and implementing future lock-free in-memory data structures. Our experimental evaluation shows that our optimized variant achieves 1.1–2.5× better performance than the original Microsoft proposal for highly concurrent workloads. Second, our evaluation shows that despite our improvements, the Bw-Tree still does not perform as well as other concurrent data structures that use locks. ACM Reference Format: Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, Huanchen Zhang, Michael Kaminsky, and David G. Andersen. 2018. Building a Bw-Tree Takes More Than Just Buzz Words. In Proceedings of 2018 International Conference on Management of Data (SIGMOD’18). ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3183713.3196895 1 INTRODUCTION Lock-free data structures are touted as being ideal for today’s multi- core CPUs. They are, however, notoriously difficult to implement for several reasons [10]. First, writing efficient and robust lock-free 1 code requires the developer to figure out all possible race conditions, the interactions between which can be complex. Furthermore, The point that concurrent threads synchronize with each other are 1 In this the paper, we always use the term “lock” when referring to “latch”. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD’18, June 10–15, 2018, Houston, TX, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-4703-7/18/06. . . $15.00 https://doi.org/10.1145/3183713.3196895 usually not explicitly stated in the serial version of the algorithm. Programmers often implement lock-free algorithms incorrectly and end up with busy-waiting loops. Another challenge is that lock-free data structures require safe memory reclamation that is delayed until all readers are finished with the data. Finally, atomic primitives can be a performance bottleneck themselves if they are used carelessly. One example of a lock-free data structure is the Bw-Tree from Microsoft Research [29]. The high-level idea of the Bw-Tree is that it avoids locks by using an indirection layer that maps logical identifiers to physical pointers for the tree’s internal components. Threads then apply concurrent updates to a tree node by appending delta records to that node’s modification log. Subsequent operations on that node must replay these deltas to obtain its current state. The indirection layer and delta records provide two benefits. First, it avoids coherence traffic of locks by decomposing every global state change into atomic steps. Second, it incurs fewer cache invalidations on a multi-core CPU because threads append delta records to make changes to the index instead of overwriting exist- ing nodes. The original Bw-Tree paper [29] claims that this lower synchronization and cache coherence overhead provides better scalability than lock-based indexes. To the best of our knowledge, however, there is no comprehen- sive evaluation of the Bw-Tree. The original paper lacks detailed descriptions of critical components and runtime operations. For example, they do not provide a scalable solution for safe memory reclamation or efficient iteration. Microsoft’s Bw-Tree may support these features, but the implementation details are unknown. This paper aims to be a more thorough investigation of the Bw-Tree: to supply the missing details, propose improvements, and to provide a more comprehensive evaluation of the index. Our first contribution is a complete design for how to build an in-memory Bw-Tree. We present the missing features required for a correct implementation, including important corner cases missing from the original description of the data structure. We then present several additional enhancements and optimizations that improve the index’s performance. The culmination of this effort is our open- source version called the OpenBw-Tree. Our experiments show that the OpenBw-Tree outperforms what we understand to be the original Bw-Tree design by 1.1–2.5× for insert-heavy workloads and by 1.1–1.4× for read-heavy workloads. Our second contribution is to compare the OpenBw-Tree against four other state-of-the-art in-memory data structures: (1) SkipList [8], (2) Masstree [31], (3) a B+Tree with optimistic lock coupling [22] and (4) ART [20] with optimistic lock coupling [22]. Our results

Upload: others

Post on 17-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a Bw-Tree Takes More Than Just Buzz Words

Building a Bw-Tree Takes More Than Just Buzz WordsZiqi Wang

Carnegie Mellon [email protected]

Andrew PavloCarnegie Mellon University

[email protected]

Hyeontaek LimCarnegie Mellon University

[email protected]

Viktor LeisTU Mü[email protected]

Huanchen ZhangCarnegie Mellon [email protected]

Michael KaminskyIntel Labs

[email protected]

David G. AndersenCarnegie Mellon University

[email protected]

ABSTRACTIn 2013, Microsoft Research proposed the Bw-Tree (humorouslytermed the “Buzz Word Tree”), a lock-free index that provides highthroughput for transactional database workloads in SQL Server’sHekaton engine. The Bw-Tree avoids locks by appending deltarecord to tree nodes and using an indirection layer that allows it toatomically update physical pointers using compare-and-swap (CaS).Correctly implementing this techniques requires careful attentionto detail. Unfortunately, the Bw-Tree papers from Microsoft aremissing important details and the source code has not been released.

This paper has two contributions: First, it is the missing guidefor how to build a lock-free Bw-Tree. We clarify missing points inMicrosoft’s original design documents and then present techniquesto improve the index’s performance. Although our focus here is onthe Bw-Tree, many of our methods apply more broadly to designingand implementing future lock-free in-memory data structures. Ourexperimental evaluation shows that our optimized variant achieves1.1–2.5× better performance than the original Microsoft proposalfor highly concurrent workloads. Second, our evaluation showsthat despite our improvements, the Bw-Tree still does not performas well as other concurrent data structures that use locks.

ACM Reference Format:Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, Huanchen Zhang,Michael Kaminsky, and David G. Andersen. 2018. Building a Bw-Tree TakesMore Than Just Buzz Words. In Proceedings of 2018 International Conferenceon Management of Data (SIGMOD’18). ACM, New York, NY, USA, 16 pages.https://doi.org/10.1145/3183713.3196895

1 INTRODUCTIONLock-free data structures are touted as being ideal for today’s multi-core CPUs. They are, however, notoriously difficult to implementfor several reasons [10]. First, writing efficient and robust lock-free1code requires the developer to figure out all possible race conditions,the interactions between which can be complex. Furthermore, Thepoint that concurrent threads synchronize with each other are

1 In this the paper, we always use the term “lock” when referring to “latch”.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]’18, June 10–15, 2018, Houston, TX, USA© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-4703-7/18/06. . . $15.00https://doi.org/10.1145/3183713.3196895

usually not explicitly stated in the serial version of the algorithm.Programmers often implement lock-free algorithms incorrectlyand end up with busy-waiting loops. Another challenge is thatlock-free data structures require safe memory reclamation that isdelayed until all readers are finished with the data. Finally, atomicprimitives can be a performance bottleneck themselves if they areused carelessly.

One example of a lock-free data structure is the Bw-Tree fromMicrosoft Research [29]. The high-level idea of the Bw-Tree isthat it avoids locks by using an indirection layer that maps logicalidentifiers to physical pointers for the tree’s internal components.Threads then apply concurrent updates to a tree node by appendingdelta records to that node’s modification log. Subsequent operationson that node must replay these deltas to obtain its current state.

The indirection layer and delta records provide two benefits.First, it avoids coherence traffic of locks by decomposing everyglobal state change into atomic steps. Second, it incurs fewer cacheinvalidations on a multi-core CPU because threads append deltarecords to make changes to the index instead of overwriting exist-ing nodes. The original Bw-Tree paper [29] claims that this lowersynchronization and cache coherence overhead provides betterscalability than lock-based indexes.

To the best of our knowledge, however, there is no comprehen-sive evaluation of the Bw-Tree. The original paper lacks detaileddescriptions of critical components and runtime operations. Forexample, they do not provide a scalable solution for safe memoryreclamation or efficient iteration. Microsoft’s Bw-Tree may supportthese features, but the implementation details are unknown. Thispaper aims to be a more thorough investigation of the Bw-Tree: tosupply the missing details, propose improvements, and to providea more comprehensive evaluation of the index.

Our first contribution is a complete design for how to build anin-memory Bw-Tree. We present the missing features required for acorrect implementation, including important corner cases missingfrom the original description of the data structure. We then presentseveral additional enhancements and optimizations that improvethe index’s performance. The culmination of this effort is our open-source version called the OpenBw-Tree. Our experiments showthat the OpenBw-Tree outperforms what we understand to be theoriginal Bw-Tree design by 1.1–2.5× for insert-heavy workloadsand by 1.1–1.4× for read-heavy workloads.

Our second contribution is to compare the OpenBw-Tree againstfour other state-of-the-art in-memory data structures: (1) SkipList [8],(2) Masstree [31], (3) a B+Tree with optimistic lock coupling [22]and (4) ART [20] with optimistic lock coupling [22]. Our results

Page 2: Building a Bw-Tree Takes More Than Just Buzz Words

Inner

Inner

MappingTable

Δ

Inner

Δ

Δ

InnerDeltaChain

Δ LeafDeltaChain

N1 P1

Leaf

Inner nodePhysical link Logical link

Delta nodeLeaf node

newCaS

old

ID Ptr

N2 P2

Ni Pi

Nj Pj

NiN2

Nj

Separator items

K2 Ki

K1

K1 K4 K5 K6

V1 V4 V5 V6

Data itemsLeafNk Pk

Kk

Nk

K7

V7

Ki K8N8

Figure 1: Architecture Overview – An instance of a Bw-Tree with itsinternal logical links, Mapping Table links, and an ongoing CaS operationon the leaf Delta Chain.

show that despite previous claims of lock-free indexes being su-perior to lock-based indexes on multi-core CPUs, the overheadof the Bw-Tree’s indirection layer and delta records causes it tounder-perform the lock-based indexes by 1.5–4.5×.

2 BW-TREE ESSENTIALSAlthough the Bw-Tree has been covered in previous papers [25, 28–30], additional details are needed to understand our discussion inSections 3 and 4. We assume that the Bw-Tree is deployed insideof a database management system (DBMS) with a thread pool, andhas worker threads accessing the index to process queries. If non-cooperative garbage collection is used, the DBMS also launchesone or more background threads periodically to perform garbagecollection on the index.

The most prominent difference between the Bw-Tree and otherB+Tree-based indexes is that the Bw-Tree avoids directly editingtree nodes because it causes cache line invalidation. Instead, it storesmodifications to a node in a delta record (e.g., insert, update, delete),and maintains a chain of such records for every node in the tree.This per-node structure, called a Delta Chain, allows the Bw-Treeto perform atomic updates via CaS. The Mapping Table serves asan indirection layer that maps logical node IDs to physical pointers,making atomic updates of several references to a tree node possible.

The mapping table works as follows: As shown in Fig. 1, everynode in the Bw-Tree has a unique logical node ID (64-bit). Instead ofusing pointers, nodes refer to other nodes using these IDs (logicallinks). When a thread needs the physical location of a node, itconsults the Mapping Table to translate a node ID to its memoryaddress. Thus, the Mapping Table allows threads to atomicallychange the physical location of a node without having to acquirelocks: a single atomic compare-and-swap (CaS) instruction changesmultiple logical links to a node throughout the index.

2.1 Base Nodes and Delta ChainsA logical node (called “virtual node” in [29]) in the Bw-Tree hastwo components: a base node and a Delta Chain. There are twotypes of base nodes: an inner base node that holds a sorted (key,node ID) array, and a leaf base node that holds a sorted (key, value)

Size: 5Low Key: K1

High Key: K8

Right Sib: N8

To N8

Depth: 0

From Parent

Attribute

Delta node

Leaf Node

Attr. Entry

Attr. Inheritance

Physical link

Logical link

Δdelete [K1, V1]

Size: 4Low Key: K1

High Key: K8

Right Sib: N8

Depth: 1

Attribute

Δinsert [K2, V2]

Size: 5Low Key: K1

High Key: K8

Right Sib: N8

Depth: 2

Attribute

Offset: 1

Offset: 0

K1 K4 K5 K6

V1 V4 V5 V6

K7

V7

Figure 2: Delta Records Overview – A more detailed illustration of alogical leaf node from Fig. 1 with its base node and two delta nodes.

Attribute Descriptionlow-key The smallest key stored at the logical node. In a node split, the

low-key of the right sibling is set to the split key. Otherwise, itis inherited from the element’s predecessor.

high-key The smallest key of a logical node’s right sibling. ∆split recordsuse the split key for this attribute. ∆merge records use thehigh-key of the right branch. Otherwise, it is inherited from theelement’s predecessor.

right-sibling The ID of the logical node’s right sibling.size The number of items in the logical node. It is incremented for

∆insert records and decremented for ∆delete records.depth The number of records in the logical node’s Delta Chain.offset The location of the inserted or deleted item in the base node if

they were applied to the base node. Only valid for ∆insert and∆delete records.

Table 1: Node Attributes – The list of the attributes that are stored in thelogical node’s elements (i.e., base node or delta records).

array. Initially, the Bw-Tree consists of two nodes, an empty leafbase node, and an inner base node that contains one separator itemreferring to the empty leaf node. Base nodes are immutable.

As shown in Fig. 2, a Delta Chain is a singly linked list thatcontains a chronologically-ordered history of the modificationsmade to the base node. The entries in the Delta Chain are connectedusing physical pointers, with the tail pointing to the base node. Boththe base node and its delta records contain additional meta-datathat represent the state of the logical node at that point in time (seeTable 1). That is, when a worker thread updates a logical node, theycompute the latest attributes of the logical node and store themin the delta record. The threads then use this information whennavigating the tree or when performing structural modifications toavoid having to traverse (and replay) a node’s Delta Chain.

As we next describe, a node’s logical link in the Mapping Tablepoints to either a Delta Chain entry or the base node. Threadsappend new entries to the head of a node’s Delta Chain and thenupdate the physical address in the Mapping Table.

2.2 Mapping TableThe Bw-Tree borrows the Blink-Tree design where a node can havetwo inbound pointers, one from the parent and one from the leftsibling [19]. Updating these pointers atomically requires eitherhardware support (e.g., transactional memory [30]) or complexsoftware primitives (e.g., multi-word CaS [2, 12, 14]).

The Bw-Tree’s centralized Mapping Table avoids these problemsand allows a thread to update all references to a node in a single CaS

Page 3: Building a Bw-Tree Takes More Than Just Buzz Words

instruction that is available on all modern CPUs. If a thread’s CaSfails then it aborts its operation and restarts. This restart is transpar-ent to the higher-level DBMS components. Threads always restartan operation by traversing again from the tree’s root. Althoughmore sophisticated restart protocols are possible (e.g., restartingfrom the previous level in the tree), we contend that restarting fromthe root simplifies the implementation. The nodes that a thread willrevisit after a restart will likely be in the CPU cache anyway.

Although this paper focuses only on in-memory behavior of theBw-Tree, it is worth emphasizing that the mapping table also servesthe purpose of supporting log-structured updates when deployedwith SSD. Updates to tree nodes will otherwise propagate to alllevels without the extra indirection provided by the Mapping Table.

2.3 Consolidation and Garbage CollectionAs we described above, worker threads update the Bw-Tree byappending new delta records to the logical nodes’ Delta Chains.This means that these Delta Chains are continually growing, whichin turn increases the time that it takes for threads to traverse thetree because they must replay the delta records to get the currentstate of a logical node. To prevent excessively long Delta Chains,worker threads will periodically consolidate a logical node’s deltarecords into a new base node. Consolidation is triggered when DeltaChain’s length exceeds some threshold. Microsoft reported that alength of eight was a good setting [29]. Our results in Section 5.3show that using different thresholds for inner nodes versus leafnodes yields the best performance.

At the beginning of consolidation, the thread copies the logicalnode’s base node contents to its private memory and then appliesthe Delta Chain. It then updates the node’s logical link in the Map-ping Table with the new base node. After consolidation, the indexreclaims the old base node and Delta Chain memory after all otherthreads in the system are finished accessing them. The originalBw-Tree uses a centralized epoch-based garbage collection schemeto determine when it is safe to reclaim memory [25].

2.4 Structural ModificationAs with a B+Tree, a Bw-Tree’s logical node is subject to overflowor underflow. These cases require splitting a logical node with toomany items into two separate nodes ormerging together under-full nodes into a new node. We briefly describe the Bw-Tree’s struc-tural modification (SMO) protocols for handling node splits andmerges without using locks. The main idea is to use special deltarecords to represent internal structural modifications.

The SMO operation is divided into two phases: a logical phasewhich appends special deltas to notify other threads of an ongoingSMO, and a physical phase which actually performs the SMO. Inthe logical phase, some thread t appends a ∆insert, ∆merge or∆remove to the virtual node, updating its attributes such as thehigh key and next node ID. The updated attributes guarantee thatother worker threads always observe consistent virtual node statesduring their traversal. In the following physical phase, t’ (not nec-essarily the same thread as t) will split or merge the node, and thenreplace the old virtual node with the new node via CaS.

Although the Bw-Tree is lock-free, threads are not guaranteedto make progress because of failed CaS. One way to alleviate this

starvation is for threads to cooperatively complete a multi-stageSMO, which in known as the help-along protocol [29]. Threads musthelp complete an unfinished SMO before the corresponding virtualnode can be traversed.

3 MISSING COMPONENTSThis section introduces our design and implementation of fourcomponents that are either missing or lacking details from the Bw-Tree papers. Since we assume that the Bw-Tree will be used in aDBMS, in Section 3.1 we also describe how to support non-uniquekeys, followed by iterators in Section 3.2. Finally, we discuss howto enable dynamic Mapping Table expansion in Section 3.3.

3.1 Non-unique Key SupportDuring traversal, Bw-Tree stops at the first leaf delta record thatmatches the search key, without continuing down the chain. Thisbehavior, however, is incompatible with non-unique key support.

We handle non-unique keys in the OpenBw-Tree by havingthreads compute the visibility of delta records on-the-fly usingtwo disjoint value sets for a search key, as shown in Fig. 3. Thefirst set (Spresent ) contains the values that are already known tobe present. The second set (Sdeleted ) contains the values that areknown to be deleted. While traversing the leaf Delta Chain, if aworker thread finds an insert delta record with key K and value Vwhere V < Sdeleted , it adds V to Spresent . If the thread finds a deletedelta record with key K and value V where K < Spresent , it adds Vto Sdeleted . Update deltas are processed as a delete delta of the oldvalue followed by an insert delta of the updated value. Althoughwe largely ignore update deltas in the discussion, they are postedwhen a worker thread modifies a key-value pair. After reachingthe base node with values for K , denoted Sbase , the thread returnsSpresent ∪ (Sbase − Sdeleted ) as the values of K . This result returnsall values whose existence has not been overridden by any earlierdelta records in the Delta Chain.

Computing Spresent and Sdeleted is inexpensive, and is performedonly when traversing a leaf node’s Delta Chain. A delta record mayinsert only one value to either set. Because the max chain length istypically small and fixed (e.g., 24 [29]), OpenBw-Tree stack allocatesSpresent and Sdeleted as two variable-length arrays.

3.2 IterationTo support range scans on an index, the DBMS’s execution engineuses an iterator. This interface is complex to implement in the Bw-Tree since it must support concurrent operations without locks.Tracking a thread’s current location in an iterator is difficult whenother threads are simultaneously inserting and deleting items. Fur-thermore, concurrent SMOs (Section 2.4) make it more challengingwhen the iterator moves from one logical leaf node to its neighbor.

To overcome these problems, OpenBw-Tree’s iterator does notoperate directly on tree nodes. Instead, each iterator maintains aprivate, read-only copy of a logical node that enables consistentand fast random access. The iterator also maintains an offset of thecurrent item in this node, as well as information for determiningthe next item. As the iterator moves forward or backward, if thecurrent private copy has been exhausted, the worker thread starts

Page 4: Building a Bw-Tree Takes More Than Just Buzz Words

Δinsert [K1, V3]

Δdelete [K1, V1]

Δinsert [K1, V4]

Δdelete [K1, V4] V4

V4

V1 V4

V3 V1 V4

V2 V3 V1 V4

Leaf node Delta record Spresent Sdeleted

K1V1 V2

K1

Figure 3: Non-unique Key Support – The two sets (Spresent , Sdeleted ) trackthe visibility of ∆insert and ∆delete records in the Delta Chain.

a new traversal from the root, using the current low key or highkey to reach the previous or the next sibling node.

3.3 Mapping Table ExpansionSince every thread accesses the Bw-Tree’s Mapping Table multipletimes during traversal, it is important that it is not a bottleneck.Storing the Mapping Table as an array of physical pointers indexedby the node ID is the fastest data structure. But using a fixed-sizearray makes it difficult to dynamically resize the Mapping Table asthe number of items in the tree grows and shrinks. This last pointis the problem that we address here.

The OpenBw-Tree pre-allocates large virtual address space forthe Mapping Table without requesting backing physical pages. Thisallows it to leverage the OS to lazily allocate physical memorywithout using locks; this technique was previously used in theKISS-Tree [18]. As the index grows, a thread may attempt to accessone of the Mapping Table’s pages that have not been mapped tothe physical memory, incurring a page fault. The OS then allocatesa new empty physical page for the virtual page. In practice, theamount of virtual address space we reserve is estimated using thetotal amount of physical memory and the lower bound of virtualnode size.

Although this approach makes it easy to increase the number ofentries in the Mapping Table as the index grows, it does not solvethe problem of shrinking the size of the Mapping Table. To the bestof our knowledge, there is no lock-free way of doing this. The onlyway to shrink the Mapping Table is to block all worker threads andrebuild the index.

4 COMPONENT OPTIMIZATIONA good-faith implementation of the data structure described inoriginal Bw-Tree paper design can further be improved. We presentour optimizations for the OpenBw-Tree’s key components to im-prove its performance and scalability. As we show in Section 5,these optimizations increase the index’s throughput by 1.1–2.5×for multi-threaded environments.

4.1 Delta Record Pre-allocationAs described in Section 2.1, the Delta Chain in Bw-Tree is a linkedlist of delta records that is allocated on the heap. Traversing thislinked list is slow because a thread can incur a cache miss foreach pointer dereference. Additionally, excessive allocations ofsmall objects create contention in the allocator, which becomes ascalability bottleneck as the number of cores increases.

Δ1 Base NodeΔ2Δ3

Base node storageΔ1Δ3Δ2FreeSpace

Low address High address

Logical view

Physical view

Allocation metadata (incl. the marker)

Growing

Figure 4: Pre-allocated Chunk – This diagram depicts the logical viewand physical view of a OpenBw-Tree node. Slots are acquired by threadsusing a CaS on the marker, which is part of the allocation metadata onlower-address of the chunk.

To avoid these problems, the OpenBw-Tree pre-allocates thedelta records inside of each base node. As shown in Fig. 4, it storesthe base node in the high-address end of the pre-allocated chunkand stores the delta records from high to low addresses (right-to-leftin the figure). Each chain also maintains an allocation marker thatpoints to the last delta record or the base node. When a workerthread claims a slot, it decrements this marker by the number ofbytes for the new delta record using an atomic subtraction. If thepre-allocated area is full, then this triggers a node consolidation.

This reverse-growth design is optimized for efficient Delta Chaintraversals. Reading delta records in the new-to-old order is likelyto (but not always) access memory linearly from low to high ad-dresses, which is ideal for modern CPUs with hardware memoryprefetching. But threads must traverse a node’s Delta Chain byfollowing each delta record’s pointer to find the next entry, ratherthan just scanning from low to high addresses. This is becausethe logical order of delta records may not match their physicallocations in memory. Slot allocations and Delta Chain appendingsare not atomic, permitting multiple threads to interleave them. Forexample, Fig. 4 shows that delta record ∆3 was logically added tothe node before delta record ∆2, but ∆3 appears after ∆2 physicallyin memory.

4.2 Garbage CollectionThe OpenBw-Tree adopts a garbage collection (GC) scheme thatis similar to the one used in Silo [34] and Deuteronomy [24]. Theepoch-based GC scheme of the original Bw-Tree [25] providessafe memory reclamation that prevents the index from reusingmemory when there may exist a thread that is accessing it. Withthis approach, the index maintains a list of global epoch objects,and appends new epoch objects to the end of this list at fixedintervals (e.g., every 40 ms). Every thread must enter the epochby enrolling itself in the current epoch object before it accessesthe index’s internal data structures (e.g., performing a key lookup).When the thread completes its operation, it removes itself from theepoch it has entered. Any objects that are marked for deletion by athread are added into the garbage list of the current epoch. Once allthreads exit an epoch, the index’s GC component can then reclaimthe objects in that epoch that are marked for deletion.

Fig. 5a illustrates the centralized GC scheme with three activeepochs, three worker threads (t1, t2, t3), and a background GCthread (tдc ). In this diagram, t2 adds a new node to the garbage listof epoch 103. At the same time, the GC thread tдc installs a newepoch object to the epoch list. Since the counter inside epoch 101has reached zero, tдc will reclaim all entries in its garbage list.

Page 5: Building a Bw-Tree Takes More Than Just Buzz Words

Epoch t Thread Ready to reclaimThread local Not ready

Epoch 101Counter = 0

Epoch 102Counter = 2

Epoch 103Counter = 1

tgc

t1

t3

t2 CaS Adding a new epoch object

NULL

NULL

NULL

NULL

NULL

Epoch 104Counter = 0

Gar

bage

nod

es

(a) Centralized GC (Background Thread)

Global Epocheglobal = 103

Thread Local #1elocal = 102

Thread Local #2elocal = 103

Thread Local #3elocal = 100

t1

edelete= 98

tgc

edelete= 99

edelete= 100

edelete= 101

t1 reads 103 into elocal

edelete= 103

Periodically increment eglobal

Adding garbage

NULL

NULL

NULL

NULL

Gar

bage

nod

es

(b) Decentralized GC (Cooperative)Figure 5: Garbage Collection – Illustrations of the centralized GC schemeusing a background thread and a cooperative decentralized GC scheme.

Such a centralized GC implementation has poor scalability be-cause the worker threads enroll in an epoch by incrementing acounter that represents the number of active threads in the currentepoch [33]. This becomes bottleneck when there are many threadsbecause of cache coherence traffic [34].

The OpenBw-Tree adopts a decentralized GC scheme wherethreads avoid writing to the global memory [24, 34]. The indexmaintains a global epoch (eglobal ). Each worker thread also main-tains a private epoch (elocal ) and linked list of pointers to objectsthat the thread marked for deletion (llocal ). Using the same exampleas above, Fig. 5b depicts this decentralized GC scheme. t1 is theonly active worker thread in the diagram, and it is adding a newgarbage node, tagged with the current global epoch, into its llocal .At the beginning of a new index operation, a thread copies the cur-rent eglobal to its elocal . Whenever that thread creates garbage (e.g.,removing a logical node), it appends the pointer to that garbageobject tagged with the latest value of eglobal to its llocal . When thethread finishes its operation, it copies the latest eglobal to elocal againand then initiates GC. The thread retrieves the elocal from all theother threads and then searches its llocal for the objects that aretagged with an epoch that is less than the minimum elocal . It thenremoves any objects that satisfy this condition from its llocal andreclaims their memory. This is safe to do because the epochs act asa watermark that guarantee that no other thread is accessing them.To make progress, the eglobal is periodically incremented by one bythe DBMS in a background thread.

Leaf node Delta record

Δinsert 2

Δdelete 3

Δinsert 9

Δdelete 5

Δinsert 3

1 3 5 7 8V1 V3 V5 V7 V8

0 1 2 3 4 5Offset

(a) Before Consolidation

0 1 2 3 4 5Offset

Δinsert 3

Δinsert 9

Δinsert 2

1 3 5 7 8V1 V3 V5 V7 V8

1 2 3

(b) Build Segments

1V1

7 8V7 V8

Insert 2 Insert 3 Insert 9

(c) Two-way MergeFigure 6: Fast Consolidation – This diagram depicts the fast consolidationalgorithm. The base node is first divided into segments using the offsetattribute in the delta records. Then a two-way merge applies all valid insertdeltas onto the new base node after copying live elements from the old basenode.

4.3 Fast ConsolidationThe node consolidation scheme described in Section 2.3 can be ex-pensive for write-heavy workloads. For these applications, threadscreate many delta records and thus must frequently consolidatelong Delta Chains. On consolidation, a thread has to first replay theDelta Chain to collect all (key, value) or (key, node ID) items in thelogical node and then sort them. We present a faster consolidationalgorithm that reduces the sorting overhead.

As depicted in Fig. 6, the consolidation algorithm has two steps.The first step (Fig. 6b) reuses the Delta Chain replay technique ofnon-unique key support (see Section 3.1) to gather effective changesto the base node. As a thread traverses a Delta Chain, it adds thedelta records’ keys to Spresent and Spresent to find which insertionsand deletions are not overridden by newer delta records. At the endof this step, the old base node is divided into segments. Then, asshown in Fig. 6c, a two-way merge combines ∆insert records andsegments from the old base node into the new base node.

To divide the old base node’s item array into segments, theOpenBw-Tree uses the delta record’s offset attribute (Section 2.1).This attribute is valid only for ∆insert and ∆delete records, andit stores the location of K in the current base node (not in thelogical node that delta records have modified). For an insertion,the offset stores where K would appear if this insertion is done inthe base node (ignoring any other delta records). For a deletion,the offset is the position of the existing K in the base node. If aworker thread performs a base node search to create a delta record,the thread stores the key search within the base node as offset;otherwise, it copies the previous delta record’s offset to the newdelta record’s offset. For non-unique key indexes, however, this ismore complicated. One simplification is to use the smallest offset asthe value of offset among all items having keys that is equal to K .

Page 6: Building a Bw-Tree Takes More Than Just Buzz Words

Leaf node Delta record

Δinsert 2

Δdelete 3

Δinsert 9

Δdelete 5

Δinsert 3

1 3 5 7 8V1 V3 V5 V7 V8

0 1 2 3 4 5Offset Search Range (Key = 6)

[0, + Inf) [1, + Inf)

[1, + Inf) [2, + Inf)

[2, + Inf) [2, 5)

[2, 5) [2, 5)

[2, 5) [2, 5)

1 3 5 7 8V1 V3 V5 V7 V8

Figure 7: Node Search Shortcuts – This diagram illustrates how threadmakes use of the offset attribute. On the base level the thread searches onlythree elements instead of five.

The OpenBw-Tree then divides the old base node’s item arrayinto segments using the offset field in delta records. A segment is acontiguous slice [start, end) of the base node items that does notcontain any insertion and deletion between two items in it. Thethread first sorts all items in Spresent and Sdeleted by their offsets andkeys; sorting both sets is fast because they have a small number ofelements. The thread starts with a single segment [0, n) (i.e., theentire array with n items) and applies the following rules:• Rule #1: For an inserted item, an existing segment [s ,e) isbroken into the segments [s ,offset) and [offset,e).• Rule #2: For a deleted item, an existing segment [s , e) is brokeninto the segments [s , offset) and [offset +1, e).• Rule #3: However, if a delete delta removes an item that doesnot exist in the base node (i.e., deleting an item that is newlyinserted by an earlier delta), the thread ignores this deletion.

Node consolidation finishes with a two-way merge between theitems of Spresent and the segments.

4.4 Node Search ShortcutsAs with any tree-based index, the Bw-Tree favors large nodes be-cause it reduces the height of the tree and the number of virtualnodes. This in turn reduces the number of mapping table lookupsand delta chain traversals. The downside, however, is that large basenodes reduces in-node search performance. Threads that reach thebase node during a Delta Chain traversal conduct a binary search tofind the target item. If the base node spans multiple cache lines, thefirst few probes of a binary search will likely incur cache misses.

We can optimize this last step by using the offset attribute to nar-row down the range a threadmust search in the base node, as shownin Fig. 7. The technique we use is known as micro-indexing [27].When a worker thread traverses a Delta Chain, it initializes thebinary search range [min,max] it accesses for search key K to [0,+inf). During the traversal, whenever the thread sees an ∆insertor ∆delete record with key K ′ and offset, it compares K with K ′.If K=K ′ then the range immediately converges to [offset, offset], avoiding the binary search. If offset > min and K>K ′, then setmin to offset. Otherwise, if offset < max and K<K ′ then setmax tooffset. For non-unique keys, however, it is unclear how to interpretthe offset attribute because it may point to the middle of a span

consisting of items with key K . In this case, the index ignores theoffset attribute.

5 EXPERIMENTAL EVALUATIONWe evaluate the OpenBw-Tree as an in-memory OLTP index struc-ture. We focus on understanding how the optimizations presentedin Section 4 improve performance over a good-faith implementa-tion of Microsoft’s Bw-Tree. We provide a comparison with otherstate-of-the-art indexes in Section 6.

We created a testing framework that supports a variety of work-loads and hardware configurations. All of our experiments were con-ducted on a systemwith two Intel Xeon E5-2680 v2 CPUs (10 threadswith 2× HT) with 128 GB RAM. We compiled our framework usingg++ (v5.4) with tcmalloc. Unless stated otherwise, multi-threadedexperiments run on a single CPU socket by pinning threads andrestricting memory allocation to that same NUMA node.

5.1 WorkloadsWe used a set of Yahoo! Cloud Serving Benchmark (YCSB) mi-crobenchmarks to mimic OLTP index workloads [7]. We used itsdefault workloads A (Read/Update, 50/50), C (Read-only), and E(Scan/Insert, 95/5) with Zipfian distributions, which have skewedaccess patterns common to OLTP workloads. The average length ofYCSB-E scan is 48, with standard deviation 30.13. We measure theinitialization phase in each workload and report it as the Insert-onlyworkload. For each workload, we tested three key types: 64-bit ran-dom integers (Rand-Int), 64-bit monotonically increasing integers(Mono-Int), and email addresses (Email). For integers, we populatethe indexes with approximately 52M keys using either Rand-Int orMono-Int trace first, and then the same YCSB-A/C/E workloads arerun. For email, we insert the keys in the order they are stored inthe trace file. The emails are derived from a real-world database;we store them as fixed-length 32-byte strings. There are around27M entries in the Email workload. All values are 64-bit integersto represent tuple pointers. We run each trial five times and reportthe median. If the variation is high, error bars are also drawn.

Unless otherwise stated, we configured the Bw-Tree andOpenBw-Tree index to enforce unique keys. OpenBw-Tree uses a max innernode size of 64 entries and max leaf node size of 128 entries. Themax Delta Chain length is 2 for inner nodes and 24 for leaf nodes.We set the index’s GC threshold at 1024 entries in a thread’s localgarbage list. The OpenBw-Tree uses the cooperative GC scheme(see Section 4.2), while the Bw-Tree uses the centralized schemewith a background thread. The GC interval is set to 40 ms. Wepicked these numbers empirically such that performance variancesof both Bw-Tree and OpenBw-Tree are small.

5.2 Delta Record Pre-allocationIn this first experiment, we compare the pre-allocation optimizationagainst the baseline approach where threads independently allocatedelta records. The pre-allocated space on each node is set to themax Delta Chain length times the largest delta record size (176bytes and 1728 bytes for inner and leaf base nodes, respectively).

Fig. 8 shows single-threaded performance numbers for the differ-ent workload configurations. Pre-allocation improves performancefor all workloads by up to 22%. One reason for this improvement

Page 7: Building a Bw-Tree Takes More Than Just Buzz Words

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0

1

2

3

4

5

Thro

ughp

ut(M

Ops

/Sec

)

3.17

1.39

1.08

0.61

0

3.82

1.44

1.20

0.70

8

Independently Alloc’edPre-Alloc’ed

(a) Mono-Int Keys

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Thro

ughp

ut(M

Ops

/Sec

)

0.77

2 1.12

0.87

0

0.38

20.94

1 1.31

1.04

0.42

8

Independently Alloc’edPre-Alloc’ed

(b) Rand-Int Keys

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Thro

ughp

ut(M

Ops

/Sec

)

0.53

1

1.37 1.

66

0.24

70.61

2

1.39 1.

68

0.27

7

Independently Alloc’edPre-Alloc’ed

(c) Email KeysFigure 8: Delta Record Pre-allocation (Single-Threaded) – Throughput with and without the pre-allocation optimization (Section 4.1).

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0

1

2

3

4

5

Thro

ughp

ut(M

Ops

/Sec

)

2.98

1.44

0.99

5

0.68

1

3.82

1.44

1.20

0.70

8

No FC & SSFC & SS

(a) Mono-Int Keys

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5Th

roug

hput

(MO

ps/S

ec)

0.78

2 1.24

0.88

2

0.40

70.94

1 1.31

1.04

0.42

8

No FC & SSFC & SS

(b) Rand-Int Keys

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Thro

ughp

ut(M

Ops

/Sec

)

0.51

4

1.32 1.

67

0.26

9

0.61

2

1.39 1.

68

0.27

7

No FC & SSFC & SS

(c) Email KeysFigure 9: Fast Consolidation & Search Shortcuts (Single-Threaded) – Throughput with and without the fast consolidation (Section 4.3) and node searchshortcut (Section 4.4) optimizations.

0 5 10 15 20Number of Worker Threads

0

5

10

15

20

Thro

ughp

ut(M

Ops

/Sec

) Centralized GCDistributed GC

(a) Mono-Int Keys – Read/Update

0 5 10 15 20Number of Worker Threads

0

5

10

15

20

Thro

ughp

ut(M

Ops

/Sec

) Centralized GCDistributed GC

(b) Rand-Int Keys – Read/Update

0 5 10 15 20Number of Worker Threads

0

5

10

15

20

Thro

ughp

ut(M

Ops

/Sec

) Centralized GCDistributed GC

(c) Email Keys – Read/UpdateFigure 10: Garbage Collection Scalability (Multi-Threaded) – All threads are pinned to NUMA node 0.

is better locality during Delta Chain traversal. To verify this, weran these trials again with perf to collect hardware performancecounters. We observe 24% fewer L1 misses and 22% fewer L3 misseswhen pre-allocation is enabled. Second, pre-allocation reduces thenumber of (small) memory allocations, which relieves contentionin the allocator and memory fragmentation for insert-heavy work-loads. The Mono-Int Read-only workload, on the other hand, showslittle improvement. This is because after the Delta Chain is short,and hence there is almost no chain traversal needed on any level.

The downside of pre-allocation is wasted memory in the pre-allocated area because some slots are not used due. As demonstratedin Table 2, after the Rand-Int Insert-only workload, only 63% and81% of pre-allocated storage is utilized for leaf and inner nodes,respectively. In contrast, for Mono-Int Insert-only, utilization isalmost 100% because of the regular insert pattern. The space under-utilization is not significantly affected by changes in the workloadsize; we observed only small variance (< 1%).

5.3 Fast Consolidation & Search ShortcutsWe next evaluate the fast consolidation (Section 4.3) and the nodesearch shortcut optimization (Section 4.4). We compare these twooptimizations against the baseline approach where threads sortthe base node after replaying the delta chain. Threads also alwayssearch the entire leaf data item array using binary search. As inSection 5.2, we execute the trials with a single thread to avoidconcurrency overhead.

The results in Fig. 9 show that our improved consolidation in-creases the performance of insert- and update-heavy workloads.For Insert-only workloads, performance increases between 19%(Email) and 28% (Rand-Int). For Read/Update and Scan/Insert work-loads, the increase is 18–20%. Overall, fast consolidation is effectiveat reducing consolidation time, which increases the performanceof insert-heavy workloads. We also see that the search shortcutoptimization improves the performance of random lookups. For ex-ample, for random integers, we observed a Read-only performance

Page 8: Building a Bw-Tree Takes More Than Just Buzz Words

Name Mono-Int Rand-Int Mono-HCAvg. IDCL 0 0.46 0Avg. LDCL 0 11.38 0.34Avg. INS 32.04 44.18 32.00Avg. LNS 72.00 99.84 72.34Abort Rate 1.05% 1.44% 1078.63%Avg. IPU 99.96% 81.29% 99.99%Avg. LPU 100% 63.09% 97.56%

Table 2: OpenBw-Tree Statistics – Statistics from running Insert-onlyworkload with 20 threads (IDCL: Inner Delta Chain Length; LDCL: LeafDelta Chain Length; INS: Inner Node Size (number of key-ID items); LNS:Leaf Node Size (number of key-value items); IPU: Inner Pre-allocation Uti-lization (fraction of bytes pre-allocated); LPU: Leaf Pre-allocation Utilization(fraction of bytes pre-allocated); HC: Mono-HC workload type)

8 16 24 32 40Maximum Delta Chain Length

0

5

10

15

20

25

30

Thro

ughp

ut(M

Ops

/Sec

)

32 64 128

(a) Mono-Int Keys – Insert-only

8 16 24 32 40Maximum Delta Chain Length

0

5

10

15

20

25

Thro

ughp

ut(M

Ops

/Sec

)

32 64 128

(b)Mono-Int Keys – Read/UpdateFigure 11: Delta Chain Length & Node Size – OpenBw-Tree with vary-ing node sizes and Delta Chain length configurations (20 worker threads).

increase of 6%. For Mono-Int keys, the Read-only workload’s per-formance does not increase because, for monotonic insertions, theDelta Chain is usually short (Table 2). Also, the delta records alwayscontain the last few keys in the leaf node. These optimizations incura small memory overhead (< 5%) because adding an extra offsetattribute increases nodes and delta record sizes slightly.

5.4 Garbage Collection ScalabilityWe next compare the original GC scheme from the Bw-Tree paperwith the decentralized scheme described in Section 4.2. In thisexperiment, we use the Read/Update workload for all key types.We vary the number of threads in each trial, and measure thethroughput. All threads are pinned to NUMA node 0.

Fig. 10 shows that under high levels of concurrency, our modifiedGC scheme is superior due to the elimination of the centralizedepoch counter. The effect is more pronounced for Mono-Int keys(1.3× improvement with 20 threads) than for Rand-Int, becauseMono-Int workloads run faster, and therefore, cache line invalida-tion happens more frequently, as worker threads increment anddecrement the centralized epoch counter on entry and on exit ofthe index’s routine.

5.5 Delta Chain Length & Node SizeWe evaluate how the OpenBw-Tree’s Delta Chain length and nodesize settings affect its performance. The former determines thenumber of records that can exist in the Delta Chain before it isconsolidated, while the latter determines the number of key-valuepairs that the base node’s item array can store. Since these twoparameters are highly dependent, we vary them together whileexecuting Insert-only and Read/Update workloads using the Mono-Int keys. All experiments are run using 20 worker threads.

We see in Fig. 11b that the index’s performance increases slightlyfor the Read/Update workload as node size and Delta Chain lengththreshold increases. If the Delta Chain length threshold is too large,however, the performance drops. where the node size is 32 andDelta Chain length threshold is 40, the performance will drop.

For the Insert-only workload, larger tree nodes results in fasterinserts. The reason why the OpenBw-Tree favors larger nodesis because each Mapping Table lookup may incur a cache miss,implying that more shallow tree could have fewer cache misses thana deeper tree with smaller nodes. Furthermore, since the OpenBw-Tree does not move node elements for every insert, larger nodesdo not necessarily imply frequent cache line invalidations duringInsert-only workloads. In Table 3, it is shown that the OpenBw-Treedemonstrates better cache locality than a B+Tree for insertion.

From Fig. 11, we can conclude that the optimal Delta Chain lengththreshold for OpenBw-Tree is between 32–40, which is higher thanMicrosoft’s recommended value of 8 [29].

We repeated the same experiment on the Bw-Tree. It is suggestedby the results that the optimal Delta Chain length threshold isbetween 16 and 24 for both Insert-only and Read/Update workloads.We believe the difference on optimal Delta Chain lengths is relatedto the pre-allocation optimization, because it reduces the overheadof delta chain traversal. Furthermore, the type of keys also has aneffect on the optimal threshold. For Email Insert-only workloads onOpenBw-Tree, the optimal Delta Chain length threshold is between16–24, while for Read/Update workloads it is 32–40.

5.6 DiscussionTo better understand the benefits of each of these optimizations, wesummarize the results from the previous experiments in Fig. 12. Theresults in Fig. 12a are the improvements from applying the optimiza-tions to the OpenBw-Tree one-at-a-time. We use the Rand-Int keysfor the Read/Update workload executing with both a single and 20worker threads. When there is only a single thread, pre-allocation isthe most beneficial optimization as it increases throughput by 28%.Fast consolidation and search shortcut is the second most effectiveoptimization, increasing the speed by 18%. Enabling non-uniquekey support shows almost no impact if there is no duplicated key.Similar conclusions are also found for multi-threaded configuration.

Fig. 12b presents a comparison between the baseline Bw-Treeand OpenBw-Tree for all workload types, using Mono-Int keys.Numbers are measures using 20 worker threads. As shown in thediagram, the optimizations we apply to the Bw-Tree are equallyeffective for all workload types, speeding up operations by 27–35%.

6 IN-MEMORY INDEX COMPARISONThe results from the previous section show that the Bw-Tree per-forms better with our optimizations from Section 4. The next stepis to understand how it compares with other data structures. Tothe best of our knowledge, there are no previous results comparingthe Bw-Tree with state-of-the-art in-memory indexes (the originalpaper [25] compared only against SkipList and BerkeleyDB). Wetherefore ported three additional indexes to our testing framework:

SkipList: SkipList [32] is an alternative to the B+Tree for in-memory databases. In our experimentation, we adopted the lock-free No Hot Spot Non-blocking SkipList (we simply use “SkipList”

Page 9: Building a Bw-Tree Takes More Than Just Buzz Words

Single Thread 20 Worker Threads0

2

4

6

8

10

12

14

16

Thro

ughp

ut(M

Ops

/Sec

)

0.71

1

9.68

0.69

0

11.0

0.88

2

11.8

1.04

13.3

1.02

13.1Bw-Tree

+GC+PA+FC & SS+NK

(a) Rand-Int Keys – Read/Update

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0

5

10

15

20

25

30

Thro

ughp

ut(M

Ops

/Sec

)

7.36

12.0

9.68

4.07

9.94

15.5

13.3

5.43

Bw-TreeOpenBw-Tree

(b)Mono-Int Keys (20 Worker Threads)Figure 12: Optimization Summary – The left diagram shows how per-formance goes up as we apply each optimization to the Bw-Tree. The rightdiagram shows a comparison between the baseline Bw-Tree and OpenBw-Tree for all four workloads. (GC: Garbage Collection; PA: Pre-allocation; FC:Fast Consolidation; SS: Search Shortcut; NK: Non-unique Key Support)

Bw-Tree OpenBw-Tree SkipList Masstree B+Tree ARTL1 Miss 5.0 3.2 14 2.2 4.3 0.78L3 Miss 2.0 1.4 6.3 0.68 1.7 0.24IPC 0.17 0.26 0.04 0.36 0.11 0.45Branch 16 16 11 11 7.0 6.1Br MisPred 1.0 1.2 0.74 0.75 0.95 0.14Inst 77 73 50 51 43 30Cycle 443 278 1150 179 380 72.9

Table 3: Microbenchmarks for Rand-Int Insert-only Workload – 20worker threads. All worker threads are pinned to NUMA node 0. (L1/L3Miss:L1/L3 data cache misses, 109; IPC: Instruction per cycle; Branch: Numberof branch instructions, 109; Br MisPred: Number of branch mispredictions,109; Inst: Number of instructions, 109; Cycle: Number of clock cycles, 109)All numbers are measured system-wide for all cores and all NUMA nodes.

below) that is optimized for multi-core CPUs [8]. This design usesa background thread to establish new pointers on each level formaintaining its expected performance. Worker threads do not needfrequent synchronization because the background thread exclu-sively writes to upper levels of the linked-node structure.

Masstree: Masstree [31] is a hybrid trie/B+tree data structurethat is used as the index structure of Silo [34] and its derivatives [16].Its trie-based design allows Masstree to achieve high performanceand low space consumption for keys with shared prefixes. It uses alock-based synchronization protocol with epoch-based GC.

B+Tree:Although originally designed for disk-orientedDBMSs [6],B+Trees are widely used in main-memory database systems [35]. In-stead of using traditional latching [3], our B+Tree implementationuses the optimistic lock coupling (OLC) [22] method. In OLC, eachnode has a lock, but instead of acquiring locks eagerly, read opera-tions validate version counters (and restart if the counter changes).Read validations across multiple nodes can be interleaved, whichallows implementing the traditional lock coupling technique forsynchronizing tree-based indexes. Our B+Tree has a similar nodeorganization as the OpenBw-Tree (sorted keys). We configure theB+Tree to use 4KB node size.

ART: The Adaptive Radix Tree (ART) [20] is designed for in-memory DBMSs and is the default index structure of HyPer [15].Each node represents one byte of the key, which results in a maxnode fanout of 256. In contrast to most tries, which implementnodes as a fixed-size (256 for ART) array of child pointers, ARTuses four different node layouts depending on the number of non-null child pointers. This saves space and improves cache efficiency.Our ART implementation also uses OLC [22] for synchronization.

Masstree and ART are tries and therefore require binary keys,unlike non-trie indexes that can handle any partially ordered keys.To support key types that are not necessarily ordered in their binaryrepresentation (e.g., integers on little-endian architectures), keysmust be preprocessed to have a totally ordered binary form [20].

All of our experiments in this comparison use the same hardwareas in the previous section. We begin with an evaluation that usesthe same workload and key configurations from Section 5. We thenpresent a worst-case scenario workload with high contention.

6.1 YCSB WorkloadWe compare the indexes using the YCSB-based workload in Sec-tion 5.1. We first run all four workloads with the three key config-urations using a single worker thread. We then execute the trialsagain using 20 threads that are all pinned to a single CPU socket.The peak amount of memory consumed by the index during opera-tions for the Read/Update workload are also measured. Finally, wemeasured the performance counters for the 20-thread Read/Updateworkload using perf and Intel’s Performance Counter Monitor.

The results for the single-threaded and multi-threaded experi-ments are shown in Fig. 13 and Fig. 14 respectively. Memory num-bers are in Fig. 15. Performance counter readings are in Table 3.

Although our optimizations made the OpenBw-Tree faster thanthe default Bw-Tree, it is still slower than its competitors exceptthe SkipList. For example, the ART is more than 4× faster than theOpenBw-Tree for point lookups (though the ART is slower on theScan/Insert workload). The OpenBw-Tree is also slower than theMasstree and the B+Tree, often by a factor of ∼2×. Microbench-mark numbers show that the OpenBw-Tree in general has a higherinstruction count and cache misses per operation (and hence lowerIPC). Higher instruction count is a consequence of having compli-cated delta chain traversal routines. Higher cache misses are causedby features such as the Mapping Table.

The SkipList shows high variation and low performance formost multi-threaded experiments. This is because its threads do notcreate towers as they insert elements. Instead, the SkipList uses a

Page 10: Building a Bw-Tree Takes More Than Just Buzz Words

0.00.10.20.30.40.50.60.70.80.9−0.06−0.04−0.020.000.020.040.060.08BwTree OpenBwTree Skiplist MassTree B+Tree ART

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

02468

10121416

Thro

ughp

ut(M

Ops

/Sec

)

2.43

1.41

0.89

6

0.48

5

3.82

1.44

1.20

0.70

8

0.91

0

0.32

4

0.21

1

0.15

0

8.40

2.17

1.82

0.78

3

14.1

1.98

1.82

1.98

10.9

5.69

3.88

0.48

1(a) Mono-Int Keys

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0

1

2

3

4

5

Thro

ughp

ut(M

Ops

/Sec

)

0.62

3

1.02

0.71

1

0.31

00.94

1

1.31

1.04

0.42

8

0.50

9

0.39

4

0.34

0

0.41

7

1.75 1.95

1.67

0.28

2

1.57

2.27

2.13

1.59

3.32 3.

75

3.08

0.29

6

(b) Rand-Int Keys

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0

1

2

3

4

5

6

Thro

ughp

ut(M

Ops

/Sec

)

0.43

0 1.31 1.

62

0.23

5

0.61

2 1.39 1.

68

0.27

7

0.31

3

0.29

1

0.27

4

0.41

01.31

3.76

3.37

0.29

31.04

2.30

2.21

0.96

01.78

4.49

4.31

0.17

2

(c) Email KeysFigure 13: In-Memory Index Comparison (Single-Threaded) – The worker thread is pinned to NUMA node 0.

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0

20

40

60

80

100

120

140

Thro

ughp

ut(M

Ops

/Sec

)

11.2

12.5

12.3

5.80

23.3

21.4

16.7

8.44

8.80

2.60

3.13

1.82

26.2

27.6

21.9

7.51

55.3

26.7

22.9

19.1

101

82.2

53.7

5.42

(a) Mono-Int Keys

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0

10

20

30

40

50

60

70

Thro

ughp

ut(M

Ops

/Sec

)

7.36 12

.0

9.68

4.079.

94 15.5

13.3

5.43

2.51

2.78

1.51 2.43

17.9

30.5

22.0

3.688.

09

29.0

25.1

18.9

44.9 51

.5

42.9

3.43

(b) Rand-Int Keys

InsertOnly

ReadOnly

ReadUpdate

ScanInsert

0

10

20

30

40

50

60

70

Thro

ughp

ut(M

Ops

/Sec

)

4.51

12.9

13.2

2.516.

90

16.7

17.1

3.09

3.48

2.02

2.80

1.40

12.2

41.3

27.5

3.047.

39

27.6

20.6

9.51

23.0

48.4

33.3

1.90

(c) Email KeysFigure 14: In-Memory Index Comparison (Multi-Threaded) – 20 worker threads. All worker threads are pinned to NUMA node 0.

Mono Int Rand Int Email Key0123456789

Mem

ory

(GB

)

1.97

1.24

1.13

2.16

1.53 1.721.87 2.09

2.002.35 2.

84

4.19

1.59

1.13 1.29

0.41

2 1.44

0.72

8

BwTreeOpenBwTreeSkiplist

MassTreeB+TreeART

(a) Single-Threaded – Read/Update

Mono Int Rand Int Email Key0

1

2

3

4

5

6

7

8

Mem

ory

(GB

)

2.05

1.36 2.

052.34

1.79

1.912.07

2.18 2.49

3.37

2.86

4.22

1.59

1.15 1.30

0.42

0 1.44

0.72

2

BwTreeOpenBwTreeSkiplist

MassTreeB+TreeART

(b)Multi-Threaded – Read/UpdateFigure 15: Memory Usage – Single-threaded and multi-threaded

background thread that periodically scans the entire list and adjuststhe height of towers. As a consequence, the background thread maynot process recent inserts fast enough, and worker threads iteratethrough the SkipList’s lowest level to locate a key, causing highcache misses and cycle counts.

The Masstree has high single-threaded Mono-Int Insert-onlythroughput, but scales only by 3× using 20 threads. This is becauseMasstree avoids splitting an overflowed leaf node when items areinserted sequentially: it creates a new empty leaf node instead of

copying half of the items from the previous leaf. This optimiza-tion, however, is less effective in the multi-threaded experimentswhere the threads’ insert operations are interleaved. In general, theMasstree is comparable to the B+Tree for integer workloads (exceptInsert-only). And for Email, its performance is even comparable tothe trie-based ART index, as its high-level structure is also a trie.

For integer keys, the B+Tree’s Read-only and Read/Update per-formance is comparable to the Masstree, and much faster than theOpenBw-Tree. For the Mono-Int Insert-only workload, the B+Treewithout any optimizations even outperforms the Masstree and ART,and is 3.7× faster than the OpenBw-Tree. The B+Tree also achieveshigh throughput for Scan/Insert workloads, and is usually 3–5×faster than all other indexes. But it has relatively poor performancefor Email workloads. The microbenchmark indicates high cachemisses and low IPC during Rand-Int and Email (not shown) inser-tion, which explains why the B+Tree is slower in these workloads.

ART outperforms the other indexes for all workloads and keytypes except Scan/Insert, where its iteration requires more memoryaccess than the OpenBw-Tree.

As shown in Fig. 15a, both the Bw-Tree and the OpenBw-Treeuse moderate amount of memory. The OpenBw-Tree consumesmore memory than the Bw-Tree (10–31%) in all experiments dueto pre-allocation and metadata. For the Mono-Int workload, as thepre-allocated space utilization is lower compared with the Rand-Int workload (Table 2). Correspondingly, the OpenBw-Tree usesmore memory in the Rand-Int workload. For multi-threaded exper-iments, since worker threads keep garbage nodes in their thread-local chains, peak memory usage also increases slightly (8–17%).

Among all compared indexes, the ART has the lowest usagefor Mono-Int and Email keys, while the B+Tree has the lowest

Page 11: Building a Bw-Tree Takes More Than Just Buzz Words

0.00.10.20.30.40.50.60.70.80.9−0.06−0.04−0.020.000.020.040.060.08BwTree OpenBwTree Skiplist MassTree B+Tree ART

20 ThreadsNUMA 0

20 ThreadsNUMA 0 + 1

40 ThreadsNUMA 0 + 1

0

1

2

3

4

5

Thro

ughp

ut(M

Ops

/Sec

)

0.87

4

0.31

1

0.31

21.15

0.53

0

0.55

8

0.00

0

0.00

0

0.00

0

4.03

1.88 2.

172.69

1.10

0.92

5

3.98

1.28 1.

51

(a) High Contention Throughput

20 ThreadsNUMA 0

20 ThreadsNUMA 0 + 1

40 ThreadsNUMA 0 + 1

0102030405060708090

DR

AM

Loca

lAcc

ess

Rat

e(M

Acc

ess/

Sec

)

58.2

24.5 29

.7

72.4

36.0 45

.3

0.00

0

0.00

0

0.00

0

7.25 11

.2 14.7

2.04 5.58

5.92

2.78 6.45 8.84

(b) Local DRAM Access Rate

20 ThreadsNUMA 0

20 ThreadsNUMA 0 + 1

40 ThreadsNUMA 0 + 1

0102030405060708090

DR

AM

Rem

ote

Acc

ess

Rat

e(M

Acc

ess/

Sec

)

0.36

5

39.6

39.6

0.43

7

62.0

64.3

0.00

0

0.00

0

0.00

0

0.04

3 14.5 19

.0

0.03

3 9.83

11.0

0.04

0 9.60 13

.4

(c) Remote DRAM Access RateFigure 16: High ContentionWorkload (Multi-Threaded) and DRAMAccesses Rate – Throughput using Mono-HC keys with the Insert-only workloadon 20 threads. DRAM accesses are measured as operations per second for a high-contention workload under different NUMA configurations using 20 threads.

20 ThreadsMono Int

20 ThreadsHigh Contention

0

20

40

60

80

100

120

Thro

ughp

ut(M

Ops

/Sec

)

11.2

0.87

023.3

1.15

19.1

0.00

0

26.2

4.03

55.3

2.69

101

3.98

Bw-TreeOpenBw-TreeSkiplistMasstreeB+TreeART

Figure 17: Comparison Between High Contention Workload andNormal Workload – 20 worker threads, with normal Insert-only and highcontention Insert-only workload. All threads are pinned to NUMA node 0.

for the Rand-Int keys due to its compact internal structure andlarge node size (4 KB). The SkipList consumes more memory thanthe B+Tree/ART due to its customized memory allocator and pre-allocation; it has a memory usage comparable to the OpenBw-Tree. The Masstree always has highest memory usage, especiallyfor the Email workload (2.0–5.7× higher). For integer workloads,although the Masstree still uses the most memory, the gap is smallercompared on the Email workload (only 1.3–2.5× higher, except forthe ART).

The high throughput and low memory usage of the ART in-dex under both single-threaded and multi-threaded environmentsshould be attributed to its flexible way of structuring trie nodes ofdifferent sizes. Furthermore, only a single byte is compared on eachlevel. Table 3 shows that both properties minimize CPU cycles andreduce cache misses, resulting in high IPC.

6.2 High Contention WorkloadThe salient aspect of the Bw-Tree’s design is that it is lock-free,whereas most other data structures that we tested here use locks(although sparingly). Lock-free data structures are often favoredin high contention environments because threads can make globalprogress [30], even though the progress may be small in practice.To better understand this issue, we created a specialized workloadthat with extreme contention. Each thread in the benchmark usesthe RDTSC instruction with a unique thread ID suffix to generatemonotonically increasing integers in real-time as keys, to mimicmultiple threads appending new records to the end of a table. Tofurther demonstrate how and in which way the NUMA configura-tion affects performance, we run the evaluation under three NUMA

settings: 20 worker threads on a single NUMA node, 20 workerthreads on two NUMA nodes, and 40 worker threads on two NUMAnodes. The last setting uses all available hardware threads on ourtesting system.

The results shown in Fig. 16a indicate that all five indexes de-grade under high contention. Both Insert-only and Read/Updateperformance drops in both one- and two-node NUMA settings. Thelocal and remote NUMA access rate, which is the number of DRAMaccesses per second, is shown in Fig. 16b and Fig. 16c, respectively.

Under high contention, Masstree has the best result, followedby ART, and then B+Tree. OpenBw-Tree suffers from an extremelyhigh abort rate as threads contend for the head of the Delta Chain.Table 2 shows that the abort rate is over 1000%, i.e., on averagethere are more than 10 aborts for every insert.

Overall, under high contention, none of these six data struc-tures perform well. As shown in Fig. 17, compared with their multi-threaded performance numberswithout high contention, all of themsuffer from performance degradation. In particular, all lock-free in-dexes struggled more than any lock-based indexes; for example, theSkipList failed to make progress in this high-contention workload 2.

6.3 Bw-Tree Performance DecompositionThe Bw-Tree’s lock-freedom comes at a cost: worker threads musttraverse a Delta Chain on each level; the Mapping Table must beconsulted to obtain the physical pointer of a virtual node; andinstead of updating tree nodes in place, a delta records must beallocated and appended to the node being modified. To better under-stand these issues, we disabled the Bw-Tree’s features one-by-onein a single-threaded environment. We ran our benchmark on eachversion of the modified Bw-Tree, and compare the result with thestandard OpenBw-Tree and the B+Tree.

Disabling theDeltaChain:After populating theOpenBw-Treeusing the Rand-Int Insert-only workload, we invoke a special rou-tine that recursively visits every virtual node and consolidates itsdelta chain. After it returns, we run the Read-only workload underthe same configuration as in Section 6. As Fig. 18 shows, by elimi-nating delta chains for the Read-only workload, the performanceincreases by 23%. If we backport this modification to the originalBw-Tree, the performance improvement will even be greater (45%),

2We reported a high contention insert problem to the authors of the SkipList imple-mentation, but we unfortunately have received no response.

Page 12: Building a Bw-Tree Takes More Than Just Buzz Words

0.00.10.20.30.40.50.60.70.80.9−0.06−0.04−0.020.000.020.040.060.08OpenBwTree − DC − CAS − MT − DU

InsertOnly

ReadOnly

0.0

0.5

1.0

1.5

2.0

2.5

Thro

ughp

ut(M

Ops

/Sec

)

0.94

1 1.31

N/A

1.61

0.94

5

1.67

N/A

1.97

1.32

N/A

Figure 18: Performance Change as Features Are Decomposed – Theleftmost number is from Fig. 13b. As we disable the OpenBw-Tree’s features,performance also steps up. (DC: Delta Chain; CAS: compare-and-swap; MT:Mapping Table; DU: Delta Update) N/A indicates a crucial feature for insertsor an irrelevant feature for reads.

justifying our pre-allocation optimization in Section 4.1.

Disabling CaS:We instrumented the OpenBw-Tree source codewith a “fake” atomic type that uses non-atomic primitives.

The third column of Fig. 18 shows that with CaS disabled, neitherInsert-only nor Read-only operations become significantly faster.This seems to contradict the common belief that atomic operationslike CaS usually takes more cycles on some ISAs. Our experiments,however, pins the worker thread on a single core, and therefore,the CPU can perform the CaS locally, requiring almost no cachecoherence overhead [9]. While CaS would force memory barriers,the measured cost is dominated by main query processing.

Disabling the Mapping Table: We eliminated the translationof node IDs to physical pointers by invoking a post-insert processthat replaces node IDs inside inner nodes with the correspondingphysical pointers. Because inserts require Mapping Table to updatemultiple inbound pointers, we examine the Read-only workloadonly. The fourth column of Fig. 18 presents the effect of disablingtheMapping Table. Read performance increases by 18% due to fewercache misses for load instructions (L1: 32% lower; L3: 52% lower).

Disabling Delta Updates: Both insert and delete operationsappend delta records to modify a virtual node. However, this isunnecessary when only a single thread modifies the tree at alltimes. To illustrate the trade-off between delta update and updatingthe node in place, we rewrote the leaf node update routine, and letthe worker thread move elements in leaf nodes to insert insteadof appending deltas. For simplicity, only leaf node modificationsare done in place. We measured Insert-only numbers because theRead-only workload does not perform updates. The last columnin Fig. 18 shows the trade-off of the two strategies. By replacingdelta update with in-place update, the performance of Insert-onlyoperations is 40% higher.

Overall, after disabling these lock-free features, the Bw-Treeis still 15%–19% slower than the B+Tree with OLC synchroniza-tion. We conjecture that the simplicity of OLC upper bounds thenumber of instructions for every operation, while for the Bw-Tree,

even Read-only operations perform considerable bookkeeping tomaintain the consistency of the tree, limiting its performance.7 RELATEDWORKThere have been many experimental studies on index performance.For instance, Gramoli examines the performance of concurrent datastructure (list, queue, hash table, etc.) implementations [11] andAlvarez et al. compare radix trees with hash tables [1]. We notethat prior work has limitations in either omitting concurrency orlacking experiments with important index designs such as the Bw-Tree [25] that are used in modern high-speed in-memory databases.In addition, previous studies often exclude realistic workloads suchas string, Zipf-distributed, or monotonically increasing keys.

Traditional concurrent B-trees use lock coupling (also knownas “hand-over-hand locking” or “crabbing”) to provide fine-grainedconcurrent access by allowing a thread to hold no more than a cer-tain number of locks [3]. This approach, however, scales poorly onmodern multi-core CPUs because of the high overhead of frequentlock acquisition that occurs even under read-only workloads [5].

Optimistic locking is an approach to improve the scalability of B-trees onmodern CPUs. OLFITmaintains a per-node version counterthat is incremented on every nodemodification [5]. Readers traversethe tree by validating these version counters (and restart if neces-sary) instead of acquiring locks. Masstree is a hybrid trie/B-treestructure that uses separate version counters for node inserts andsplits to reduce the chance of restarts [31]. Combining optimisticlocks with lock coupling provides a general design for concurrenttree data structures [4, 22]. Optimistic variants of software transac-tional memory, on the other hand, is a more general design optionfor concurrent tree indexes [13, 26]. It achieves similar goals at thecost of more complicated design.

New hardware technologies create new opportunities and chal-lenges for in-memory index designs. Master-tree is a B-tree in-dex that can use large non-volatile memory [17]. It is built uponMasstree, but facilitates paging by maintaining only one inboundpointer per node. Hardware transactional memory (HTM) affordseasy construction of concurrent data structures [21], but does notalways outperform sophisticated non-HTM designs [22].

8 CONCLUSIONSIn this work, we introduced OpenBw-Tree, our clean slate Bw-Treeimplementation. OpenBw-Tree incorporates a number of optimiza-tions that were not described in the original Bw-Tree papers. Exper-imental results show that OpenBw-Tree outperforms the originalBw-Tree design. Nevertheless, even our optimized OpenBw-Tree,is still considerably slower than other state-of-the-art in-memoryindex structures like SkipList, Masstree and ART. OpenBw-Tree isalso slower than a B+Tree that uses optimistic lock coupling, whichindicates that lock-freedom does not always pay off in comparisonwith modern lock-based synchronization techniques.

The Bw-Tree has been proposed as an in-memory index and isused as such in Hekaton. Nevertheless, it is important to note thatthe Bw-Tree design supports eviction to SSD [23]. In the future, wewill investigate larger-than-main-memory indexes and compare thelog-structured nature of the Bw-Tree with the in-place approach oftraditional B-trees.

Page 13: Building a Bw-Tree Takes More Than Just Buzz Words

REFERENCES[1] Victor Alvarez, Stefan Richter, Xiao Chen, and Jens Dittrich. 2015. A comparison

of adaptive radix trees and hash tables. In ICDE. 1227–1238.[2] James H Anderson and Mark Moir. 1995. Universal constructions for multi-object

operations. In PODC. 184–193.[3] Rudolf Bayer and Mario Schkolnick. 1977. Concurrency of Operations on B-Trees.

Acta Informatica 9 (1977), 1–21.[4] Nathan G. Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun. 2010. A

Practical Concurrent Binary Search Tree. In PPoPP. 257–268.[5] Sang Kyun Cha, Sangyong Hwang, Kihong Kim, and Keunjoo Kwon. 2001. Cache-

conscious concurrency control of main-memory indexes on shared-memorymultiprocessor systems. In PVLDB. 181–190.

[6] Douglas Comer. 1979. Ubiquitous B-tree. ACM Computing Surveys (CSUR) 11, 2(1979), 121–137.

[7] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and RussellSears. 2010. Benchmarking cloud serving systems with YCSB. In SoCC.

[8] Tyler Crain, Vincent Gramoli, and Michel Raynal. 2013. No hot spot non-blockingskip list. In Distributed Computing Systems (ICDCS), 2013 IEEE 33rd InternationalConference on. IEEE, 196–205.

[9] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything YouAlways Wanted to Know About Synchronization but Were Afraid to Ask (SOSP).33–48.

[10] Keir Fraser. 2004. Practical lock-freedom. Ph.D. Dissertation. University of Cam-bridge.

[11] Vincent Gramoli. 2015. More Than You Ever Wanted to Know about Synchroniza-tion: Synchrobench, Measuring the Impact of the Synchronization on ConcurrentAlgorithms. In PPoPP.

[12] Timothy L Harris, Keir Fraser, and Ian A Pratt. 2002. A practical multi-wordcompare-and-swap operation. In International Symposium on Distributed Com-puting. 265–279.

[13] Nathaniel Herman, Jeevana Priya Inala, Yihe Huang, Lillian Tsai, Eddie Kohler,Barbara Liskov, and Liuba Shrira. 2016. Type-aware transactions for fasterconcurrent code. In Proceedings of the Eleventh European Conference on ComputerSystems. 31.

[14] Amos Israeli and Lihu Rappoport. 1994. Disjoint-access-parallel implementationsof strong shared memory primitives. In PODC. 151–160.

[15] A. Kemper and T. Neumann. 2011. HyPer: A hybrid OLTP&OLAP main memorydatabase system based on virtual memory snapshots. In ICDE.

[16] Kangnyeon Kim, Tianzheng Wang, Ryan Johnson, and Ippokratis Pandis. 2016.ERMIA: Fast Memory-Optimized Database System for Heterogeneous Workloads.In SIGMOD. 1675–1687.

[17] Hideaki Kimura. 2015. FOEDUS: OLTP Engine for a Thousand Cores and NVRAM.In SIGMOD.

[18] Thomas Kissinger, Benjamin Schlegel, Dirk Habich, and Wolfgang Lehner. 2012.KISS-Tree: Smart latch-free in-memory indexing on modern architectures. InDaMoN.

[19] Philip L. Lehman and s. Bing Yao. 1981. Efficient Locking for Concurrent Opera-tions on B-trees. ACM Trans. Database Syst. 6, 4 (Dec. 1981), 650–670.

[20] Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radixtree: ARTful indexing for main-memory databases. In ICDE. 38–49.

[21] Viktor Leis, Alfons Kemper, and Thomas Neumann. 2016. ScalingHTM-supporteddatabase transactions to many cores. IEEE Transactions on Knowledge and DataEngineering 28, 2 (2016), 297–310.

[22] Viktor Leis, Florian Scheibner, Alfons Kemper, and Thomas Neumann. 2016. TheART of practical synchronization. In DaMoN.

[23] Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013. LLAMA: ACache/Storage Subsystem for Modern Hardware. PVLDB 6, 10 (2013), 877–888.

[24] Justin J. Levandoski, David B. Lomet, Sudipta Sengupta, Ryan Stutsman, and RuiWang. 2015. High Performance Transactions in Deuteronomy. In CIDR.

[25] Justin J Levandoski and Sudipta Sengupta. 2013. The BW-Tree: A Latch-FreeB-Tree for Log-Structured Flash Storage. IEEE Data Eng. Bull. 36, 2 (2013), 56–62.

[26] Hyeontaek Lim, Michael Kaminsky, and David G. Andersen. 2017. Cicada: De-pendably Fast Multi-Core In-Memory Transactions. In SIGMOD.

[27] David Lomet. 2001. The evolution of effective B-tree: page organization andtechniques: a personal account. SIGMOD Record 30, 3 (2001), 64–69.

[28] David Lomet, Justin Levandoski, Sudipta Sengupta, Adrian Birka, and CristianDiaconu. 2014. Indexing onModern Hardware: Hekaton and Beyond. In SIGMOD.

[29] David B. Lomet, Sudipta Sengupta, and Justin J. Levandoski. 2013. The Bw-Tree:A B-tree for New Hardware Platforms. In ICDE. 302–313.

[30] Darko Makreshanski, Justin Levandoski, and Ryan Stutsman. 2015. To lock,swap, or elide: On the interplay of hardware transactional memory and lock-freeindexing. PVLDB 8, 11 (2015), 1298–1309.

[31] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftinessfor fast multicore key-value storage. In EuroSys. 183–196.

[32] William Pugh. 1990. Skip lists: a probabilistic alternative to balanced trees.Commun. ACM 33, 6 (1990), 668–676.

[33] Stephen Tu. 2013. Techniques for Implementing Concurrent Data Structureson Modern Multicore Machines. https://people.eecs.berkeley.edu/~stephentu/presentations/workshop.pdf. (2013).

[34] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden.2013. Speedy transactions in multicore in-memory databases. In SOSP. 18–32.

[35] Huanchen Zhang, David G. Andersen, Andrew Pavlo, Michael Kaminsky, LinMa, and Rui Shen. 2016. Reducing the Storage Overhead of Main-Memory OLTPDatabases with Hybrid Indexes. In SIGMOD. 1567–1581.

Page 14: Building a Bw-Tree Takes More Than Just Buzz Words

A NODE SPLIT AND MERGEThis section details the process of node splitting and merging. Asobserved in Section 2.4, the Bw-Tree divides an SMO into twophases: the logical phase when a ∆split, ∆merge or ∆remove isappended to notify all worker threads of an ongoing SMO, andthe physical phase that actually performs node split or merge by asuccessful CaS. In the following sections, we use stage to refer toan atomic step that changes the state of the Bw-Tree.

A.1 Node SplitEach logical node keeps track of the number of items it contains.When a worker thread discovers that the logical node’s size is abovea threshold, it initiates a node split. As shown in Fig. 19, the splitoperation occurs in three stages:

Stage I: To begin the split of node N0, the thread creates a basenode N1 that is populated with the upper-half of N0’s items. Thethread also copies some ofN0’s attributes toN1, such as its high-keyand right-sibling, while the low-key is set to the split key (K1). N1’sdepth and size attributes are both set to zero. As shown in Fig. 19a,the thread then adds N1 to the Mapping Table.

Stage II: The thread adds a ∆split record to the Delta Chainof N0. As shown in Fig. 19b, this record changes N0’s high-keyto K and right-sibling ID to N1. At this point, the split process isincomplete because the split information has not been propagatedto the parent node. This “half-split” state [29] is adopted from Blink-Tree. The ∆split serves as a flag to inform other worker threadsof an unfinished split. We discuss the other delta records for nodemerges and removals in Appendix A.2. Threads check their searchkey against the high-key attribute of every logical node on theirtraversal path, even in the absence of a ∆split record. If a threaddiscovers a logical node whose high-key is less than or equal to thesearch key, it will move to its right-sibling node.

Stage III: Finally, in Fig. 19c the thread appends a ∆separatorrecord to the parent logical node. This record not only stores(K1,N1), but also the separator item (K2,N2) that is the parent’snext item in sorted key order. This way, when a thread sees the∆separator, it can immediately compare the search key with K1and K2. Thus, threads can avoid consulting the base node when thesearch key lies inside the interval [K1,K2).

If multiple threads initiate the node split concurrently for thesame logical node, only one of them will succeed in installing the∆split record.

A.2 Node MergeMerging two nodes together in the Bw-Tree follows roughly thesame process as a split. One restriction is that a thread is onlyallowed to merge a node with its left sibling. We walk throughthe three steps for the merge process using the example shown inFig. 20 where N1 merges into N0.

Stage I:The thread first appends a∆remove record toN1’s DeltaChain. This record blocks access to the Delta Chain by forcing allthreads to switch to N1’s left sibling N0. In order for a thread tofind N0, it must keep a physical pointer to the parent node whentraversing down to a child node of the parent.

Stage II: Any thread that reaches the left sibling N0 withoutaborting appends a ∆merge record toN0’s Delta Chain. This recordcontains N0’s low-key, as well as N1’s high-key and right-sibling.A ∆merge differs from a ∆split because the former contains aphysical pointer to N1 instead of a logical node ID. Therefore, aftera thread adds a ∆merge to N0’s chain, N0 and N1 are consideredas part of the same logical node. The ∆merge contains a merge key(copied from N1’s low-key) that allows threads to choose betweenN0 and N1 during traversal.

Stage III: Any worker thread that encounters a ∆merge recordwill “help along” the merge process. As shown in Fig. 20c, a threadwill attempt to install a ∆separator record in the parent node P .This record contains the node ID N0 as well as two (key, nodeID) pairs. The first is (K0,N0) and represents the separator itembefore the deleted item in P . The other pair (K2,N2) is the separatoritem that occurs after the deleted item. Any thread that observesa ∆separator will determine whether its search key lies in theinterval [K0,K2); if it does, that thread takes the fast path to N0 andthus avoids searching the base node. After successfully appendinga ∆separator record to P , the thread marks the ∆remove recordfrom Stage I as deleted. Likewise, the thread also marks node IDN1 as deleted, thereby completing the node merge process.

B CONCURRENT SPLIT AND MERGEThe SMO protocol in Section 2.4 guarantees that the index is in avalid state only if splits and merges do not interfere with each other.But a concurrent split operation on a parent node and a mergeoperation on its children will put it in an inconsistent state.

Fig. 21 illustrates this problem. Consider a parent inner node P ,with two child nodes L and R, where R is L’s right sibling. Assumethat the tree uses high-keys as separators inside of its inner nodes.Let K1 and K2 be L and R’s high-key, respectively. Suppose threadt1 obtains the pointer to node P and traverses to P ’s child nodeR (Fig. 21a). Suppose another thread t2 splits P into P1 and P2(Fig. 21b), using K2 as the split point; this process works if R doesnot merge into L. But if node R’s size decreases below the threshold,then t1 observes the underflow on node R and initiate the mergeprotocol on R by appending a ∆remove on R first (Fig. 21c) andthen a ∆merge on L (Fig. 21d). t1 now enters Stage III of the mergeprocess and attempts to post a ∆delete to P . But since P has alreadysplit, the pointer is out-of-date and the CaS will fail.

The tree is now inconsistent after the interleaved split and merge.Suppose a thread t3 tries to complete the deletion after the aboveprocess. It visits logical node P1 and then L, after t2 appended the∆merge record to L. According to the help-along protocol describedin Section 2.4, t3 searches P1 for the separator item with key K2that is stored in the ∆merge. The search (correctly) does not findK2 because K2 is no longer within the range of P1 after the split ofnode P . Therefore, t3 assumes that the deletion process is finishedeven though the separator with key K2 is left untouched in P2. Ifthe Delta Chain containing the ∆merge is consolidated afterwards,the link in the ∆remove becomes invalid.

Our solution in the OpenBw-Tree is to use a new kind of deltarecord called the ∆abort that will stop threads from appending to aDelta Chain. When a thread encounters one of these records duringtraversal, the Delta Chain is write-locked to that thread. Any thread

Page 15: Building a Bw-Tree Takes More Than Just Buzz Words

Parent node Child node Physical link Logical linkDelta node

N0 N1

PK0 K2

HK: K2

LK: K0

HK: K2

LK: K1

N0 N2

RS: N1

(a) Stage I

Δsplit

N0 N1

PK0 K2

HK: K2

LK: K0

HK: K2

LK: K1

HK: K1 RS: N1

RS: N1

N0 N2

(b) Stage II

Δsplit

N0 N1

Δinsert

PK0 K2

HK: K2

LK: K0

HK: K2

LK: K1

HK: K1 RS: N1

RS: N1

N0 N2

(c) Stage IIIFigure 19: Node Split – This diagram illustrates a node split and how attributes are derived during the split. N0 and N1 are node IDs. For clarity, not allattributes are shown. (LK: low-key; HK: high-key; RS: right-sibling)

Δremove

PK0 K1

N0 N1

K2

N2

N0HK: K1

LK: K0 RS:N1

N1HK: K2

LK: K1 RS:N2

(a) Stage I

ΔremoveHK: K2 RS: N2

PK0 K1

N0 N1

K2

N2

Δmerge

N0 HK: K1

LK: K0 RS:N1

N1HK: K2

LK: K1 RS:N2

(b) Stage II

ΔremoveHK: K2 RS: N2

PK0 K1

N0 N1

K2

N2

Δmerge

Δdelete

N0HK: K1

LK: K0 RS:N1

N1HK: K2

LK: K1 RS:N2

(c) Stage IIIFigure 20: Node Merge – This diagram illustrates a node merge and how attributes are derived during the merge. N0 and N1 as node IDs. For clarity, not allattributes are shown. (LK: low-key; HK: high-key; RS: right-sibling)

Parent node Child node Physical link Logical linkDelta node

PK1 K2

L

t1 L R

RK2

V2

(a)

RK2

V2

Δsplit

L t1

P1K1

L P2K2

Rt2

(b)

P2K2

RP1K1

L

RK2

V2

Δsplit

L

Δremove

t1

(c)

P2K2

RP1K1

L

Δsplit

L

Δremove Δmerge

t1 RK2

V2

(d)Figure 21: Concurrent Split and Merge – This diagram shows a concurrency bug caused by concurrent split and merge of parent node and child nodesrespectively. After the merge in the last step, L and R are logically in the same node, while their parent node P splits into P1 and P2.

that attempts to append a new record to that Delta Chain will abort.If t1 intends to append a remove delta on R (before Fig. 21c), it firstattempts to append a ∆abort on P . If the CaS on the Mapping Tablefails, t1 will abort because P has been modified. If the CaS succeeds,then the node merge process from Appendix A.2 continues. Afterthe ∆merge is appended, t1 can remove the ∆abort from P ’s DeltaChain, unlocking P .

C FORWARD AND BACKWARD ITERATIONWe give a brief overview of the implementation of forward andbackward iterators. The general design idea of iterators has beencovered by Section 3.2. Here we present our algorithm for iteration.

C.1 Forward IterationFor a given search key K , a thread traverses the tree to find thesmallest item in a leaf node with the key K ′, where K ′ ≥ K . Atthe leaf level, the thread consolidates the Delta Chain into its leafbase node and copies the logical node (i.e., leaf base node plus allattributes) into IC as L. If there are no delta records, then the leaflevel logical node is copied directly into IC as L. The thread thenperforms a binary search on L’s data item array to find the itemwith key K ′≥K and sets C in IC to the offset of this item.

When the thread moves the iterator forward, it increments Cby one. C typically refers to a valid data item in L. If C equals the

Page 16: Building a Bw-Tree Takes More Than Just Buzz Words

P0

N1

P1 P2

P4P3

P6P5

N0

K3

K2 K5

K1 K9

Inner node

Leaf node

Logical link

P1 P2

P3 P4

P5

K5

P6

K5

N1

K4

N0

K5

V5

K6

V6

K7

V7

K8

V8

For backward iteration traversal with K5, use [K2, P3] instead

Figure 23: Backward Iteration – For backward iteration using K5 as thelow key, the path is [(K1, P1), (K2, P3), (K3, P5), (K4, N0)]. This is achievedby always going left when a separator item with key K5 is seen during innernode search.

Parent node Leaf node

Physical link Logical link

Delta node

Current location

Iterator Context

N0 N1

N0 N1

K0 K1

(a)

N0 N1

K0 K1

IteratorContext

ΔremoveΔmerge

N0 N1

(b)Figure 22: Forward IterationwithConcurrentMerge – In this example,the leaf node N0 is merged into its left sibling (N1) while the iterator scansforward. The arrow indicates the current location of the iterator.

size of L after the increment, then the thread initiates a new treetraversal using L’s high-key (Khigh) and builds a new IC.

Fig. 22 illustrates a case where a concurrent node merge happenswhile the iterator moves forward across leaf nodes. Fig. 22a showsthe last state before the iterator advances fromN0 toN1, and Fig. 22bshows the desired state of the iterator after the iterator advances.Based on our algorithm above, the thread traverses the tree usingN0’s high-key K1. It reaches the ∆remove record and then the∆merge, as described in Appendix A.2. After consolidating theDelta Chain, the new L contains the data items from both N0 andN1. A binary search on the new L using K1 computes the valueof C (i.e., the location of the first item with key K where K ≥K1),thereby correctly locating the next data item.

C.2 Backward IterationTo begin backward iteration starting at key K , a thread traversesthe tree to find the starting leaf node with the item with key K ′≤K .It then instantiates IC using the same method described above. Aslong as the IC’s counter C > 0, backward iteration will decrementC by one. But when C is zero, the thread cannot apply the samestrategy used in the forward iteration to jump from one leaf nodeto the next. In the OpenBw-Tree, any logical node with low-keyKlow and high-key Khigh must be reachable by threads traversingthe tree using key K , where Klow ≤K <Khigh. Therefore, if Klow is

the low-key of a logical leaf node, traversing to Klow will reach thesame leaf node, rather than its left sibling.

There are two useful properties of a tree traversal that we canleverage to overcome this problem. For a search key Klow , let a pathp be a sequence of (separator key, inner logical node) pairs that aregenerated during tree traversal. Recall that the OpenBw-Tree usesthe low-key Klow of any leaf logical node N as the separator key,which allows finding N in its parent node. Thus, when there areno concurrent tree updates, the first property is that there alwaysexists a pathp from the root node toN such thatKlow is the key of atleast one separator item on p. Furthermore, if Klow is the separatoritem on p, then the second property is that all items below it (i.e.,closer to the leaf level) must have Klow as its separator key.

A thread can find a node’s left sibling based on the two aboveobservations. When a thread on logical leaf L needs to find L’s leftsibling, it begins a backward tree traversal by using L’s low-keyKlow as the search key. At each inner node that the thread visitsduring the backward traversal, it checks whether Klow is equal tothe separator keyK ′ that are selected for traversing to the next level.If they are equal, then the thread uses the largest key K ′′, whereK ′′ < K ′, in the same inner logical node as the actual separatorinstead of K ′ for the traversal. At the leaf node, the thread usesthe right-sibling ID to locate the node whose low-key is smallerthan Klow and larger than the high-key of all other leaf nodes. Notethat during the backward traversal, although the thread does notexplicitly compute or store p, it maintains an invariant that everyseparator key K ′′ in p is smaller than Klow , by always choosing asmaller separator item in the same inner node.

Fig. 23 provides an example of this. Suppose that a thread startsa backward iteration using K5 as the search key. When the threadtraverses the tree the first time, it generates the search path p as[(K1, P1), (K5, P4), (K5, P6), (K5, N1)], and reaches leaf node N1.When the iterator moves to N1’s left sibling, however, the threadmust start a backward traversal using the high-key of N1, and thistime it generates the path [(K1, P1), (K2, P3), (K3, P5), (K4, N0)],reaching leaf node N0. The reason for generating different paths isthat when the thread picks K5 on inner node P1 during the secondtraversal, it will move left and use K2 instead because the separatorkey K5 is equal to the search key.

The leaf node whose low-key is Klow may be merged concur-rently into its left sibling during the traversal. This can cause athread to be unable to find a separator key K ′ = Klow because K ′was deleted from its parent node in Stage III of the node merge pro-cess. The thread will still reach a leaf node that contains a low-keythat is smaller than Klow , guaranteeing correctness. More compli-cated concurrent modifications, however, can happen during thetraversal. Should the thread reach a node whose high-key is greaterthan or equal to Klow , it must abort its traversal and restart.

D ACKNOWLEDGEMENTSThis work was supported (in part) by the Intel Science and Technol-ogy Center for Visual Cloud Systems, and the U.S. National ScienceFoundation (CCF-1438955, CCF-1535821). We thank the researchersat Microsoft Research for their constructive advice.

This paper is dedicated to the memory of Leon Wrinkles.May his tortured soul rest in peace.