the skiplist-based lsm tree - aronszanto.com · the skiplist-based lsm tree aron szanto, harvard...

The Skiplist-Based LSM Tree

ARON SZANTO, Harvard UniversityLog-Structured Merge (LSM) Trees provide a tiered datastorage and retrieval paradigm that is a�ractive forwrite-optimized data systems. Maintaining an e�cientbu�er in memory and deferring updates past their initialwrite-time, the structure provides quick operations overhot data. Because each layer of the structure is logicallyseparate from the others, the structure is also conduciveto opportunistic and granular optimization. In this pa-per, we introduce the Skiplist-Based LSM Tree (sLSM), anovel system in which the memory bu�er of the LSM iscomposed of a sequence of skiplists. We develop theoret-ical and experimental results that demonstrate that thebreadth of tuning parameters inherent to the sLSM al-lows it broad �exibility for excellent performance acrossa wide variety of workloads.

1 INTRODUCTIONAs data scales, transactional updates and reads be-come more costly. Traditional systems do not dif-ferentiate between hot and cold data, foregoing sig-ni�cant optimization opportunities, since in manyapplications users need to access the most recentdata the fastest. �e LSM tree, introduced in 1995by O’Neil et al.[2], provides a mechanism for quickupdates, deletes, and writes by collecting them ina pool of active keys before pushing them to sec-ondary storage when the pool is full. By arrang-ing secondary storage in tiers, the cost of mergingthe bu�er to disk is amortized, allowing for e�-cient writes, while the maintenance of hot keysin memory allows for performant lookups overrecent data. �e tiered structure of data also pro-vides a natural opportunity for indexing. A varietyof indexing structures, including fence pointers,zone maps, and Bloom �lters are commonly usedto minimize unnecessary disk accesses. In addi-tion, compression algorithms can be used to shrink

2016. XXXX-XX/2016/1-ART1 $15.00DOI: 10.1145/nnnnnnn.nnnnnnn

the memory and disk footprint of the data bothin memory and on disk. Because LSM trees havedisparate and independent components, there isa large space for optimization. However, the pa-rameters of interaction between the componentsare also a crucial part of good performance. In thispaper, we describe a novel LSM system that usescache-conscious skiplists in memory, along withBloom �lter and fence pointer indexing, to achieveexcellent throughput. �e remainder of this paperwill proceed as follows: Section 2 will detail thedesign of the Skiplist-Based LSM (sLSM), includ-ing the in-memory component, the on-disk com-ponent, indexing structures, key algorithms, theo-retical guarantees, and the range of design knobs;Section 3 will provide extensive experimental re-sults, including parameter tuning and performanceanalysis; and Section 4 will discuss and conclude.

2 SLSM DESIGN�e sLSM has two macroscopic components: thein-memory bu�er and the disk-based store. �ein-memory section is composed of a set of datastructures that is optimized for quick insert andlookup on the bu�ered data. �e disk-based storeis composed of a tiered layer storage that scales bya constant factor with each tier.

2.1 Memory Bu�er�e memory bu�er consists of R runs, indexed byr . In the sLSM , one run corresponds to one skiplist.Only one run is active at any one time, denotedby the index ra . Each run can contain up to Rnelements. An insert occurs as follows: if the currentrun is full, make the current run a new, emptyone. If there is no space for a new run: i) mergea proportionm of the existing runs in the bu�erto secondary storage; ii) set the active run to anew, empty one. Insert the key-value pair into the

, Vol. 1, No. 1, Article 1. Publication date: January 2016.

1:2 • Aron Szanto

active run. A lookup is similar: starting from thenewest run and moving towards the oldest, searchthe skiplist for the given key. Return the �rst onefound (as it is the newest). If not found, then searchdisk storage.

2.2 SkiplistsSkiplists are probabilistic data structures that pro-vide for fast search within an ordered sequence ofvalues. �ey are composed of decreasingly sparsesorted runs of values that are set in parallel, so thata search consists of searching a run until a key isfound that is greater than the desired one, thenrepeating the same process on the next-densestrun, until the correct key is found. With some care-ful optimization, these structures can be powerful,yet leave only a small memory footprint. Two op-timizations implemented in sLSM are presentedbelow.

2.2.1 Fast Random Levels. One of the vital stepsin skiplist insertion is choosing the ”level” that theelement to be inserted will occupy. Ideally, thedistribution of levels follows a geometric distribu-tion with parameter p; in practice, p = 0.5 is stan-dard, and also has nice mathematical propertiesregarding optimal average runtime [3]. �e sLSMuses hardware optimization to generate randomlevels quickly. Rather than an iterative mechanismthat increments the level with probability p at eachround, stopping when the level is not incremented,we propose anO(1) solution: generateMAXLEVELrandom bits, whereMAXLEVEL is the maximumlevel for any element. Return the result of the hard-ware builtin ”�nd �rst set bit” function (standardon x86-64). Since each bit is random, the probabil-ity that the nth bit is the �rst one that is set is 2�n,which is exactly the geometric distribution withparameter p = .5 that is needed. Our skiplists useMAXLEVEL = 16, which was experimentally de-termined to be optimal. �ere is a tradeo� betweenthe probabilistic speed of retrieval and the skiplistsize, including the amount of data needed to beloaded into the cache for a lookup. AsMAXLEVEL

gets larger, there is a higher probability that nodescan be skipped, leading to faster lookup for higher-valued keys. However, the list of forward pointersis larger, and may not �t in the cache, leading tocache misses that o�set the performance gains ofhigh-level nodes. We found that ��ing the forwardpointer list into two cache lines is optimal, andtheorize that this is due to the fact that the secondcache line is only accessed about 1 � (.58) = 0.4%of the time. When it is accessed, the speedup dueto skipping large swaths of the list outweighs theinfrequent performance drawdown of loading theextra cache line.

2.2.2 Vertical Arrays, Horizontal Pointers. �eother skiplist optimization involves the way thatthe skiplist traverses levels. While the di�eren-tial densities of the levels precludes an array-basedstructure in the horizontal direction, it is wastefulto include links from nodes of value k to anothernode of value k on the next level. Instead, in ourimplementation a skiplist node includes one key,one value, and an array of pointers to other skiplistnodes. �is array can be thought of as a verticalcolumn of level pointers, where pointers above thenode’s level are null and each pointer below pointsto the next node on that particular level. In thisway, skipping down a level is a ma�er of reading avalue that was already loaded into the cache, ratherthan chasing a pointer somewhere random in mem-ory. Because we implemented the skiplist in thisway from the beginning we do not report di�eren-tial performance between this cache-conscious andthe alternative, naive way. �e following graphicsummarizes this optimization nicely.1

1Source: h�p://ticki.github.io/blog/skip-lists-done-right/(MIT LICENSE)


The Skiplist-Based LSM Tree • 1:3

2.3 Memory Bu�er IndexingBloom �lters are space-e�cient probabilistic datastructures that are used to test whether an elementis in a set. Using a series of hash functions and abitset, the �lter can provide a strong probabilisticguarantee: an element will never induce a false-negative result under a test for membership, andan element will only induce a false-positive resultup to some error probability � , a value that is cho-sen by the user and that is traded o� against thespace occupied by the �lter. Bloom �lters �nd im-portant use in the sLSM when paired one-to-onewith runs in memory and on disk. Rather thanincur a high cost by searching for a key in everyrun, the �lter is consulted �rst; if it returns neg-ative, we can safely skip that run, because of the�lter’s no-false-negative guarantee. In this way,we’ll only search a proportion � of the runs that wedon’t need to, which could result in a signi�canttime saving for lookups. In our implementation,Bloom �lters are leveraged by pairing each con-sideration of a run with a �lter test; if it fails, wesimply skip that run. We use the Murmur3 hashfunction and utilize the mathematical technique of”double hashing”, allowing us to quickly generatethe k hash values necessary without recomputingthe entire hash k times by using a linear combina-tion of two hashes for each. We also keep trackof the maximal and minimal key in each run forlow-cost, high-granularity �ltering by run.

2.4 Disk-Backed Storage�e disk-backed store is for more permanent stor-age and is only touched by merges from the mem-ory bu�er. �ere are L disk levels, each withD runson each level. A run is an immutable sequence ofsorted key-value pairs, and is captured within onememory-mapped �le on disk. Levels grow by runsuntil they hit the threshold D, and then a new levelis added, incrementing L. Each disk run is indexedsimilarly to the in-memory run, withmax/min keysand Bloom �lters. Additionally, we use fence point-ers to index into disk runs for lookups. �ese are

�xed-width indices that store the key of elementsin increments of some logical page size in memory.To look up a key in a disk run, we �nd the fencepointers that bound the key via binary search, thensearch for the key in that range on disk, also by bi-nary search. �is reduces disk accesses by a factorof logn

log µ , where µ is the fence pointer page size.

2.5 MergingOne of the most important implementation detailsof the sLSM is the merging algorithm. When thebu�er becomes full, a fraction m of the runs is�agged and their elements collected and sorted,then wri�en to the shallowest disk level’s nextavailable run. In this way, adjacent levels sharethe following relationship: the size of a level’s runsis identical to the total size of its shallower neigh-bor multiplied by the fraction of runs mergedm.Analogously, the number of elements at level kis O((mD)k ). Merging is not as simple as copyingfrom �le to �le, however. A disk level might befull, requiring a cascade of merges down to lowerdisk levels. �is complex operation is quite nu-anced: though the runs being merged are individ-ually sorted, the resulting run needs to be sorted.Because runs at lower levels do not �t in memory,some optimizations are necessary to save both timeand space. Moreover, when several runs from levelk contain the same key, the value that remains tiedto that key on level k + 1 must be the most recentlywri�en, i.e., the one that came from the newest runon level k . �e naive algorithm for merging n itemsfrom k sorted lists is O(nk), where each elementis compared against the minimal unwri�en itemfrom each list. We propose a heap-based merg-ing algorithm that runs in O(n log(mD)) time andO(mD) space, where n is the number of elementsbeing merged. Algorithm 1 demonstrates our ap-proach, which uses a min-heap whose constituentsare pairs of key-value elements and integers denot-ing from which run that key-value element wastaken.�e heap disgorges key-value pairs in key order;

this entails that the result will be sorted, and in


1:4 • Aron Szanto

ALGORITHM 1: HeapMergeData: Runs to merge R1...Rk , min-heap H , result run SResult: Data from runs to merge are in key-order in S,

with the latest value corresponding to eachkey.

for each run r 1…k dopush(H , (R[r ][0], r ));

endj := -1;lastKe� := None;lastK := None;Heads := Array[k];Heads[i] := 0 for all i ;while size(H ) > 0 do

(e,k) := pop(H );if e.key == lastKey then

if lastK < k thenS[j] = e;

elseend

elsej = j + 1;S[j] = e;

endlastKey = e.key;lastK = k;if Heads[k] < size(R[k]) then

Heads[k] = Heads[k] + 1;push(H , R[k][Heads[k]]);

elseendConstruct-Index(S);

end

the case of multiple values for the same keys, onlythe highest-ranked run’s value is wri�en into theresult. �ough we do not include it in the formalalgorithm speci�cation, we mention that our imple-mentation involves knowledge as to whether themerge in progress is writing to a new level belowall the others. In that case, keys �agged for deleteare not wri�en to the bu�er at all. �e time andspace bounds are trivially derived by noticing thatn elements are popped o� the heap, and the heap is

of sizemD. In our implementation, we also usemul-tithreaded merging to decrease our latency. Whenan insertion triggers a merge, a dedicated mergethread takes ownership of the runs to merge andexecutes the merge in parallel, allowing the mainthread to rebuild the bu�er and continue to answerqueries. If a lookup request comes while the mergethread is executing the merge, the main threadsearches the memory bu�er for the requested key,and if unsuccessful, waits for themerge to completebefore querying the disk levels.

2.6 Insertion�e algorithm for insertion into the sLSM is givenin Algorithm 2. �e Do-Merge algorithm takes

ALGORITHM 2: PutData: key k, value v to insert, runs runs[] of sLSM,

Bloom �lters B[] of sLSM, active run index ra ,size of runs Rn , number of runs R

Result: key-value pair is inserted into the sLSMif size(runs[ra]) == Rn then

ra = ra + 1elseendif ra == R then

Do-Merge(1);ra = ra �mR;

elseendinsert(runs[ra], k, v);insert(B[ra], k, v);

one parameter, which represents the disk level tomerge runs to. In our implementation, it is a recur-sive function that merges successive levels untilthere is a free run to merge to, or it creates a newlevel at the bo�om of the sLSM. Do-Merge callsHeapMerge when it �nds an empty run to mergeto, or a�er it creates a new, empty level. It is at thispoint that HeapMerge receives information as towhether the merged level is the last, in which casedeletes are ‘’commi�ed”, i.e., not wri�en to disk.Expected insertion time is calculated as follows:



with probability RnRm�1RnRm

insertion is into a skiplistwith no merge necessary, i.e.,O(logRn). With prob-ability 1

RnRm, a merge is necessary. Merging RnRm

elements to disk level 1 takes O(RnRm(1 � log �))time, given the necessity to copy each element foreach element in the merged runs, as well as towrite a Bloom �lter (which requires O(�n log �)time for n elements). With probability 1

RnRmD , thecurrent insertion is the one that necessitated the�rst level to merge down to level two, which is anO(RnRm2D logDm) operation. By induction, theprobability that the current insert causes a mergedown to level k is 1

RnRmk+1Dk , meaning that the totalinsertion time simpli�es as

RnRm�1RnRm

O(logRn) +PL

k=0RnRmk+1Dk (1�log �) logDm

RnRmk+1Dk

= O(logRn + (1 � log �)L logDm)

with L = O(logn), where n is the number of ele-ments in the skiplist; this is due to the fact thatlevels are added at exponentially increasing thresh-olds of n.

2.7 LookupLookups all follow the same pa�ern: Starting withthe memory bu�er and moving downwards to disklevels, query runs in newest-to-oldest order. Foreach run query, check if the key is between themin and max key for that run. If so, query theBloom �lter. If positive, then search the run. Forin-memory runs, this entails searching the skiplist.For disk runs, this involves a binary search of fencepointers in order to �nd two �le locations thatbound the key, if it exists. �en a binary searchis performed between those two locations and thekey’s value, if any, is returned. �e worst case fora lookup is that the key doesn’t exist, meaning thatthe algorithm has to search each level and eachrun. Noting that the time to query a Bloom �lterwith k = � log �

log 2 hash functions is O(� log �), the

expected runtime of a lookup is

O

(�� log �)

✓(R logRn) +

LXl

DXk

log✓RnRmkDk

µ

◆+ log µ

◆�

where µ is the number of elements per fence pointer.�is expression simpli�es to

O((�� log �)(R logRn + DL logRn+ DL logR + D2L logDm))

In practice, Rn >> R, so average lookup time isapproximatelyO�(�� log �)

�(R + DL) logRn + D2L log (Dm)

� �.

2.8 DeleteDeletes are implemented quite simply: to deletea key, simply insert that key paired with a valuethat signi�es a deleted key. When a lookup comesacross this value for this key, it immediately re-turns with a failure to �nd the key. Last, when thisdeleted key-value pair is merged down to start anew deepest level, it is omi�ed from the result set,as it is no longer needed to supersede any previouskeys.

2.9 RangeRange queries involve looking up, for each run, allthe elements in the range. For skiplists, this is assimple as locating the node corresponding to thesmallest key greater than or equal to the �rst key inthe range. �en, simply follow the skiplist’s point-ers until the current node is greater than or equalto the second key in the range, or else the end ofthe list has been reached. For disk-based runs, we�rst �lter by key, then do only the 1 or 2 lookupswe need to �nd the indexes in the run that framethe range. From there, we construct a hash table asfollows: Starting with the newest run and workingbackwards towards the oldest, �nd all elements inthe range, and for each, 1) insert the key and value


1:6 • Aron Szanto

Table 1. Parameters and Values

Parm Meaning RangeR Number of runs Z > 0Rn Elements per run Z > 0� Bloom �lter FP rate (0, 1)D Number of disk runs per level Z > 0m Fraction of runs merged (0, 1]µ Fence pointer page size Z > 0

into the hash table and 2) if the element is not adelete or already in the table, write it to the resultset. �e hash table guarantees that only the newestnon-deleted values will remain in the result set. Forour hash table, we again use the Murmur3 hashfunction. We use linear probing rather than chain-ing to optimize for small key-value pairs (as the testworkload will be integers), and keep true key-valueobjects in the table, rather than pointers, in order toremain cache-optimal. When the hash table is morethan half full, we double its size and rehash eachelement, leading to amortized O(1) insertion andtrue constant-time probing. Because collecting ele-ments in the range in both skiplists and disk runs isa linear-time operation, and because hashing is anamortized constant-time operation, a range queryover n keys is expected to take O(n) time.

2.10 Parameter table�e full range of tuning parameters is given inTable 1.

2.11 Object OrientationWe designed sLSM to be fully general; for the sakeof academic experimentation, it makes sense to beable to substitute key and value data types at will,as well as to swap out run types. To this end, wechose C++ as our language, primarily for its speedand templating �exibility. With careful considera-tions such as a Run interface that enforces proper-ties of a memory bu�er run, we will in the future be

able to simplify complex testing that involves try-ing di�erent combinations of runs (perhaps hashtables or radix trees as well as skiplists).

3 EXPERIMENTATIONWe tested the sLSM on a DigitalOcean DropletServer running 64-bit Ubuntu 4.4.0 with 32 IntelXeon E5-2650L v3 @ 1.80GHz CPUs, a 500 GB SSD,224GB main memory, and 30MB L3 cache.It was clear that the choice of parameters for thesLSMwould play an important role in performanceoptimization. Our �rst task was to �nd the com-bination of parameters that resulted in excellentbaseline performance. �en, for various experi-ments, we deviated from that set of parameters inorder to determine the e�ect of each parameter onoverall performance in the face of changing work-load types. To �nd this baseline combination ofparameters, we tested on the Cartesian productof the parameters in Table 1 and data size up to100MB, essentially performing a �ne-mesh multidi-mensional grid search. Of the 1,556 parameter setsthat we sampled, we chose the top 10% to move onto the next round of larger data set testing. We thentested each one of the remaining parameter setson 500MB, 1GB, and 10GB data sets, selecting theparameter set with the highest average (weightedby number of inserts/lookups) insert plus lookupsper second. �is baseline parameter set is the basisfor the rest of our experimentation: µ = 512, � =0.001,R = 50,Rn = 800,D = 20,m = 1.0. Unlessotherwise speci�ed, all experiments use these pa-rameters, as well as a 100 million key dataset with32-bit integer keys generated uniformly at random.

3.1 Number of Runs in MemoryIn determining the optimal R, we found that thesmaller R is, the smaller the memory bu�er is, andthe more frequent merges will be. �us, lower Rleads to lower insertion throughput. However, withfew runs to search, lookups are very quick withsmall R. Analogously, higher R is linked to fasterinsertion but slower lookup, since more runs needto be searched. With Bloom �lters, it is possible to



set R high enough to achieve extremely fast inser-tion while enjoying signi�cant speedup on lookupsdue to the �lters. More formally, R does not enterthe amortized insertion time function, and thereare signi�cant constant factors hidden in that equa-tion that correspond to the speed and frequency ofmerges. However, lookups depend linearly uponR, as proven above.

�e above graph details the tradeo� between inser-tion time and lookup time for a number of valuesof R. As such, se�ing the number of runs intelli-gently also allows us to tune the performance ofthe sLSM to the workload at hand- more runs formore writes, and fewer for more lookups.

3.2 Bu�er Size�e choice of bu�er size is tightly linked withthe experimental optimization of the number ofruns. However, instead of a simple size-in-bytesparameter for the memory bu�er, we expose a two-dimensional knob. We allow the user to chooseboth the number of runs and the size of each run.As shown above, the number of runs is crucial forperformance. However, another important factorfor speed is the size of each skiplist. As shown, Rnis a main determinant of the insertion and lookupspeed within each run. For this experiment, wetake a Cartesian product over R ⇥ Rn. As expected,insert and lookup throughputs are traded o� as Rincreases. However, of interest here is the e�ectof changing Rn. For each value of R, increasing Rnincreases insertion rate while decreasing lookupthroughput. �is is because a larger Rn allows forfewer merges over the lifetime of the workload due

to the larger memory bu�er. However, this increasein the size of each run increases the runtime ofeach lookup, since skiplist queries are logarithmicin their size. For this reason, we chose a value ofRn between 500 and 1000, �nding that 800 workednicely for a variety of workloads.

3.3 Disk and Merge ParametersWe determined that ifm is set under 0.5, mergeswould happen too frequently for sizeable datasets,causing the OS to run out of �le descriptors. Todetermine optimal values of D and m, we tooktheir Cartesian product for along the range of val-ues that were le� a�er the original parameter tun-ing. For our standard workload, there was not asigni�cant trend other than that throughput forlookups tended to decrease asD increased for smallm. �is is likely due to the fact that the OS hadtoo have many �les open, leading to page churn asvery small disk runs are searched in sequence. Wepresent our chart with these results, noting thateven if the experimental results are sorted by Dm,the size ratio between levels, there is not a signi�-cant trend. Further experimentation will involvelarger datasets given this Cartesian product, as well


1:8 • Aron Szanto

as the introduction of various types of workloads.

3.4 Bloom FiltersHere we describe the performance enhancementsa�orded by Bloom Filters. Testing on a workload ofa million inserts and lookups, we demonstrate thefollowing cases: no Bloom �lter; Bloom �lter with� 2 {0.1; 0.01; 0.001; 0.0001; 0.00001; 0.000001}

�e �lter provides an impressive speedup, from3,634 lookups/sec to over 340,000/sec, with no sig-ni�cant di�erence in insertion time. �e intuitionbehind the speedup is simple: for each run, weavoid a lookup inside it if we fail a Bloom �ltertest. Since such a test is far cheaper than a skiplistlookup, we save ourselves the time of doing thedeep search by ruling out the possibility that a keyexists in a particular run.

Under pro�ling, we found that 98.9% of the CPU

(clock) time is spent in the skiplist lookup func-tion without Bloom �lters. �is drops to a mere13.1% when the �lters are introduced. In that case,53.2% of the CPU time is spent calculating the hashfunctions necessary for the �lter. Perhaps thereis an opportunity for further optimization here;�nding quicker hash functions or making the �lterstructure more cache-conscious could ostensiblyprovide signi�cant speedup, since the �lters aresuch ”hot” structures. A more in-depth optimiza-tion might take the form of recent work by Dayanet al. in which the false positive rates of the Bloom�lters are dynamically optimized across di�erentlevels of the structure[1].

3.5 Range�eriesWe present a short demonstration of the sLSM’srange query performance. As derived in Section2, range queries are linear time operations in thesize of the range. For several di�erent range sizes,we plot the time to complete range queries over auniform distribution of keys.

3.6 Data SizeWe now show results for varying dataset sizes. Asis evident, insertions into and lookups experienceslowdowns as the data size gets larger.



�e reason for this e�ect with respect to inserts isthat as the system gets larger, it completes moremerges, which require expensive disk accesses. Lookupstoo require querying more Bloom �lters and diskruns as the data grow large, meaning that eachlookup requires more time. Nevertheless, our sys-tem exhibits no more than a logarithmic slowdown,which is as good as it can be given the theoreticalresults shown above.

3.7 Update-Lookup RatioFor many data systems, performance is dependentupon the ratio of updates to lookups in the work-load. In this experiment, we manipulated this ratiobetween 10% lookups and 90% lookups for a 100mil-lion query workload. To demonstrate the ability ofsLSM to adapt to various workloads, we show plotsof completion times of the query set for two sLSM:one parameterized by R = 20 and one by R = 200.As shown above, higher R leads to increased insertthroughput at the cost of lookup speed. As such,the graph displays that the tree with balanced pa-rameters (R = 20) is quite forgiving with respectto the lookup ratio. In contrast, the specializedtree (R = 200) completes the low-lookup, high-insert workloads an order of magnitude quickerthan its balanced counterpart, at the cost of steeplydecreasing performance as lookups become moreprevalent in the workloads.

3.8 Data Skew3.8.1 Insertion Skew. �e skewness of inserted

data is a vitally important factor of performance.Our skiplists do not blindly insert elements; rather,if a key is already in the active skiplist, its valueis simply updated. �is means that for data withlow variance of keys, insertion can be incrediblyfast. For this experiment, we generate integral keysvia a normal distribution (rounding to the nearestinteger) around zero and manipulate the variance.As the variance increases, there is greater varietyof keys, and the insert performance drops precipi-tously, as shown in the following log-log plot.

3.8.2 Lookup Skew: Single Threaded. Lookupskew is similar: in the same style of experiment,we �nd that for a set of uniformly distributed keysin the sLSM, querying for a tightly clustered setof lookup keys results in higher performance. �e


1:10 • Aron Szanto

performance degrades as the lookup keys get moredispersed. A small set of lookup keys requiresfewer random seeks and disk page loads, result-ing in be�er performance, as shown. Further workwill involve testing on hard disk drives and on sys-tems that allow for �ne-grained tracking of diskIOPS.

3.9 Concurrency3.9.1 Lookup Skew: Multithreaded. Lookup skew

also provides an opportunity to utilize concurrencyin our experimentation. In this experiment, lookupskew was varied along with the number of threadsperforming concurrent lookups, demonstrating thatwith highly clustered lookups, the sLSM’s scalingfactor is higher with each thread than with evenly-distributed lookups. �is is due to the fact thatthe disk is able to optimize its seeking to servicethe lookup requests be�er when there is a smalllocality of keys. Perhaps multiple threads couldeven be serviced by the same disk pages, cu�ingdown on disk operations even further. �e close-ness of the keys allows the access pa�ern to act likesequential requests rather than random requests.�e graphic also highlights the way that the sLSMscales with the number of threads in the general(key-dispersed) case. With a correlation coe�cientof R2 = 0.98 and an increase in lookup throughputof 1.02M per thread, sLSM proves its worth as aparallel data structure as well.

3.9.2 Merge Threading. One of the concurrencyoptimizations we implemented was the dedicationof a hardware thread for merging disk levels toallow for lower latency. For this test, we measuredthe largest time between insertions over 100M keys.We used our server’s SSD as well as a Toshiba Can-vio HDTB205XK3AA 500GB external hard driveconnected via USB3.0. With this setup, we wereable to measure the reduction in tail latency af-forded by merge threading for both spinning andsolid-state disks.

As shown, there is a signi�cant reduction in maxi-mal response time for both SSD and HDD. In fur-ther testing, we also showed that merge threadingallowed the system to experience no less than 99.7%CPU utilization throughout the workload, whileutilization dropped to 53% at times without mergethreading due to the processor waiting for the diskoperations to �nish.



4 CONCLUSIONWe presented the Skiplist-Based Log StructuredMerge Tree, a data system that makes heavy useof probabilistic data structures to provide highlyperformant transactional queries. Our novel ap-proach includes the development of a sequence ofcache-aware skiplists, indexing via Bloom �ltersand fence pointers, a fast k-waymerging algorithm,lookup and merging concurrency, and a thoroughexperimental evaluation that details the tradeo�sbetween update and lookup throughput for a widevariety of workloads.We also showed theoretical guarantees as to the per-formance of the system at scale and corroboratedthem empirically, demonstrating that the system’sperformance is tightly bounded by the theoreticalguarantee. Further, we demonstrated that the sLSMis adaptable to various workloads, readily able toperform well for query sets of di�erent types ofskew. In single threaded work and with the righttuning, the sLSM can exceed millions of queriesper second for both updates and reads, far outpac-ing baseline results for such systems. Concurrentlookups scale nearly perfectly as well, allowing thesLSM to achieve throughput of between 7 and 11million lookups per second on various datasets.While there is still testing and implementationworkto be done, the sLSM has proven its worth as a sub-ject of research and as a high-performance big datasystem.

REFERENCES[1] Niv Dayan, Manos Athanassoulis, and Stratos Idreos.

2017. Monkey: Optimal Navigable Key-Value Store. InACM SIGMOD International Conference on Managementof Data.

[2] Patrick O’Neil, Edward Cheng, Dieter Gawlick, andElizabeth O’Neil. 1996. �e Log-Structured Merge-Tree(LSM-Tree). (1996).

[3] William Pugh. 1990. Skip lists: a probabilistic alternativeto balanced trees. Commun. ACM 33, 6 (1990), 668–676.


the skiplist-based lsm tree - aronszanto.com · the skiplist-based lsm tree aron szanto, harvard...

Documents