snapple - diva portal

48
IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2016 Snapple A distributed, fault-tolerant, in-memory key-value store using Conflict-Free Replicated Data Types JOHAN STENBERG KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Upload: others

Post on 28-Oct-2021

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Snapple - DiVA portal

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

SnappleA distributed, fault-tolerant, in-memory key-value store using Conflict-Free Replicated Data Types

JOHAN STENBERG

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Page 2: Snapple - DiVA portal

Snapple: A distributed, fault-tolerant, in-memorykey-value store using Conflict-Free Replicated

Data Types

JOHAN [email protected]

Master’s Thesis in Computer ScienceWritten at CSC, KTH

Supervisor: Philipp HallerExaminer: Johan Håstad

Page 3: Snapple - DiVA portal
Page 4: Snapple - DiVA portal

AbstractAs services grow and receive more tra�c, data resiliencethrough replication becomes increasingly important. Mod-ern large-scale Internet services such as Facebook, Googleand Twitter serve millions of users concurrently. Replica-tion is a vital component of distributed systems. Even-tual consistency and Conflict-Free Replicated Data Types(CRDTs) are suggested as an alternative to strong con-sistency systems. This thesis implements and evaluatesSnapple, a distributed, fault-tolerant, in-memory key-valuedatabase based on CRDTs running on the Java Virtual Ma-chine. Snapple supports two kinds of CRDTs, an optimizedimplementation of the OR-Set and version vectors. Perfor-mance measurements show that the Snapple system is sig-nificantly faster than Riak, a persistent database based onCRDTs, but has a factor 5x - 2.5x lower throughput thanRedis, a popular in-memory key-value database written inC. Snapple is a prototype-implementation but might be aviable alternative to Redis if the user wants the consistencyguarantees CRDTs provide.

Page 5: Snapple - DiVA portal

ReferatSnapple - en distribuerad feltolerant

nyckelvärdesdatabas i RAM-minnet baseradpå konfliktfria replikerade datatyper

När internet-baserade tjänster växer och får mer trafik blirdata replikering allt viktigare. Moderna storskaliga internet-baserade tjänster såsom Facebook, Google och Twitter han-terar miljoner av förfrågningar från användare samtidigt.Datareplikering är en vital komponent av distribuerade sy-stem. Eventuell synkronisering och Konfliktfria ReplikeradeDatatyper (CRDTs) är föreslagna som alternativ till direktsynkronisering. Denna uppsats implementerar och evalu-erar Snapple, en distribuerad, feltolerant, i RAM-minnetnyckel-värdes databas baserad på CRDTs och som exekve-rar på Javas virtuella maskin. Snapple stödjer två sortersCRDTs, den optimerade implementationen av observera-ta-bort setet och versionsvektorer. Prestanda-mätningar visaratt Snapple-systemet är mycket snabbare än Riak, en per-sistent databas baserad på CRDTs. Snapple visar sig ha 5x- 2.5x lägre genomströmning än Redis, en popular i-minnetnyckel-värdes databas skriven i C. Snapple är en prototypmen CRDT-stödda system kan vara ett värdigt alternativtill Redis om användaren vill ta del av synkroniseringsga-rantierna som CRDTs tillhandahåller.

Page 6: Snapple - DiVA portal

Contents

Glossary 1

1 Introduction 31.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Intended Readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.1 Google Bigtable . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.2 Amazon DynamoDB . . . . . . . . . . . . . . . . . . . . . . . 51.4.3 Facebook Cassandra . . . . . . . . . . . . . . . . . . . . . . . 61.4.4 Riak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4.5 Redis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Conflict-Free Replicated Data Types 92.1 An Introductory Example . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Mathematical Properties . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 State-based and Operation-based CRDTs . . . . . . . . . . . . . . . 112.4 Consistency Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 The Observed-Remove Set CRDT . . . . . . . . . . . . . . . . . . . 12

3 The Optimized Observed-Remove Set CRDT 153.1 Shortcomings of the original OR-Set CRDT . . . . . . . . . . . . . . 153.2 Version Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Introducing the Optimized OR-Set . . . . . . . . . . . . . . . . . . . 16

4 Snapple: A distributed, fault-tolerant, in-memory key-value store 194.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Snapple API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3.1 Synchronization Method . . . . . . . . . . . . . . . . . . . . . 224.3.2 Communication Protocol . . . . . . . . . . . . . . . . . . . . 224.3.3 Security Model . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.4 Client SDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.5 Utility Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Page 7: Snapple - DiVA portal

5 Performance Evaluation 255.1 Environment and Setup . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Performance Measurement Results . . . . . . . . . . . . . . . . . . . 27

6 Conclusion 316.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3.1 CRDTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.3.2 Snapple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Bibliography 33

Appendices 35

A File Listings 37

Page 8: Snapple - DiVA portal

Glossary

Atomic Instructions Instructions provided by the hardware which is guaranteedto not be interrupted. Examples of such an instruction is the CAS (compare-and-set) instruction, commonly used in lock-free data structures.

CAP Theorem A theorem which states that it is impossible for a system to pro-vide consistency, availability and partition-tolerance simultaneously.

Consistency Model A set of rules for how, given that the rules are not broken,data is consistent in a system, often distributed.

Eventual Consistency A consistency model which o�ers a liveliness guarantee,that eventually all updates are delivered to all replicas.

LWW (last-writer-wins) An algorithm used to resolve merge conflicts betweentwo replicas. The replica with the latest timestamp wins.

Partial Replication Replicates data at a subset of all replicas. The number ofdata copies in the cluster is known as the replication factor. Useful if largeamounts of data or geographically dispersed data warehouses.

Replica A single server which runs a program which is part of a larger cluster ofservers. Each replica in the cluster usually runs the same program.

Replica Cluster A set of replicas which together form a larger identity which fillsa designated function.

Scala Future Futures in the Scala programming language represents an asyn-chronous computation which either finishes successfully and returns a typedvalue, or fails and returns an exception.

Strong Consistency If a system guarantees strong consistency, any two replicasdo never di�er in their state.

Strong Eventual Consistency O�ers the same guarantees as eventual consis-tency with the addition that if any two replicas have received the same up-dates, they will be in the same state.

1

Page 9: Snapple - DiVA portal
Page 10: Snapple - DiVA portal

Chapter 1

Introduction

This chapter describes the motivation for the thesis, a short content description andrelated work.

1.1 BackgroundAs services grow and receive more tra�c, data resilience through replication be-comes increasingly important. Modern large-scale Internet services such as Face-book, Google and Twitter serve millions of users concurrently. These services startedout as simple web applications but have evolved in order to meet the needs of theirrapidly growing user base. Scaling and distributing your service has become moreand more important as you grow larger. Distributed systems play an important roleat large technical companies, governments and non-profit organizations today.

Replication is a vital component of distributed systems. There exist multipleapproaches to implement replication, one popular approach is to try to maintain aglobal order of events [19]. Many of these theories rely on strong consistency modelsand are further complicated if the system should support fault-handling [7].

Another approach, eventual consistency [23], has been suggested as an alterna-tive to strong consistency systems. Eventual consistency allows distributed systemsto execute operations without synchronizing with their replicas beforehand - insteaddata synchronization is done in the background. Usually in distributed systems,the synchronization step is a major bottleneck during client requests and results inhigher latency. Eventually consistent systems move this bottleneck from the criticalpath and into a background process which should give them a performance increasecompared to strongly consistent systems, this at the cost of consistency [34].

The most widely used eventually consistent system is the DNS (Domain NameSystem), where each update to the system is propagated according to a predefinedset of rules and eventually will be seen by all clients.

This thesis tries to utilize eventual consistency in distributed in-memory databases.Distributed in-memory databases are systems which, at multiple di�erent replicas,store data. This in order to provide resilience, scalability and locality. Distributed

3

Page 11: Snapple - DiVA portal

CHAPTER 1. INTRODUCTION

data stores find it impossible to satisfy all three traits of the CAP theorem [5],however it is possible to satisfy two [12].

This thesis focuses on Conflict-Free Replicated Data Types (CRDTs) [27] - datastructures achieving strong eventual consistency. These data structures focus onthe AP (availability and partition-tolerance) traits of the CAP theorem. Eventuallyconsistent systems usually su�er from di�culties when resolving merge conflicts -how can one know what happened first if the synchronization is delayed? CRDTreplicas o�er conflict-free resolution mechanics, it is mathematically impossible formerge-conflicts to arise in CRDTs if implemented correctly.

There are currently few popular implementations of CRDT-based data storagesystems. One of the most prominent systems is Riak [3], a distributed, persistentCRDT-based key-value store. Riak brands itself as a fault-tolerant NoSQL databaseand is used in production at Klarna, McAfee and Rovio among others. Riak o�ersan ideal resilient data store. However, since it stores its data persistently on disk,its performance is limited compared to an in-memory store.

1.2 This Thesis

Riak proves that CRDTs can be used to provide distributed, fault-tolerant, persis-tent data stores, however persistent data stores often have an inherent overhead dueto the nature of persisting data on disk. Many applications need fault-tolerant in-memory stores; a typical example is web application session storage systems. Eachtime a request to the server is executed the server needs to communicate with thesession store. In such a scenario, a persistent data store might introduce too muchoverhead. This thesis aims to further broaden the usage of CRDTs through evalu-ating its usability as the mathematical foundation for a distributed, fault-tolerant,in-memory key-value store. The evaluation focuses on two system qualities, featuresand performance.

This thesis introduces Snapple, a distributed, fault-tolerant in-memory key-valuestore based on CRDTs running on the JVM and fully open-source [30]. Snapple aimsto reap the benefits of CRDTs while providing scaling and performance. Snapplewas written to test that CRDT-centered in-memory systems can in fact o�er thesame desirable resilience qualities as Riak, but with the flexibility and performanceof an in-memory store.

Chapter one introduces the thesis background, its purpose and related work.Chapter two provides an in-depth theory of CRDTs and also an introduction toone of the most common CRDTs, the OR-Set (observed-remove set). Chapter threepresents shortcomings of the OR-Set and an alternative implementation which solvesone of the biggest problems prohibiting the original OR-Set implementation to beused in a production-grade system. Chapter four introduces Snapple and its designdecisions. The source code of Snapple can be browsed at Github [30]. Chapter fivepresents performance measurement comparisons between Snapple and two othercommonly used database systems. Finally, chapter six concludes the thesis and

4

Page 12: Snapple - DiVA portal

1.3. INTENDED READERS

discusses results. The bibliography and appendix can be found after chapter six.

1.3 Intended ReadersAny student or professional interested in distributed systems, modern data storagesystems or theoretical computer science revolving the theory behind CRDTs mightbenefit by reading this thesis. Furthermore, individuals interested in the practi-cal performance of Snapple can see chapter five which shows Snapple performancemeasurements.

1.4 Related WorkThis thesis evaluates using CRDTs to build an in-memory database — related workfocuses on systems and theories with both di�erent and similar logic foundations.

1.4.1 Google BigtableGoogle Bigtable [8] is presented as a multi-purpose distributed database, servinganything from URLs to webpages to satellite images. The latency requirements foreach of these data types are very varying, ranging from real-time data streaming tobatch processing. A Bigtable is a sparse, distributed, persistent multi-dimensionalsorted map. Each data entry has a row, a column and a timestamp as key and anuninterpreted array of bytes as value. Data is organized as tables and each table issorted and indexed by the row key. Rows are partitioned into ranges called tablets.Columns can be added dynamically or through the table definition. Columns aregrouped together as column families and each family are stored together locally ondisk. The timestamp key is used to represent multiple di�erent versions of the sameentry. Bigtable solves write conflicts through timestamps, using the last-writer-winsmodel.

Bigtable is implemented using multiple other Google services, including GFS [11],the Google File System, and Google Chubby [6], a distributed lock system. Bigtableuses Chubby to select a master node, store Bigtable metadata and add new replicascorrectly. GFS is used as the storage system. The master server assigns tablets totablet servers (slave servers), balances tablet server loads, handles schema changesand conducts garbage collection in GFS.

GFS handles replication and not Bigtable, each data entry is replicated at leastthree times throughout the cluster using a master-slave approach.

1.4.2 Amazon DynamoDBAmazon DynamoDB [10] was created to handle the increasing load at Amazon’se-commerce platform. DynamoDB is branded as a distributed data store and has afixed replication factor of three. As Google Bigtable, DynamoDB uses tables. Eachentry has a hash key, a range key and a value. The hash key uniquely identifies an

5

Page 13: Snapple - DiVA portal

CHAPTER 1. INTRODUCTION

item and is used for building an unordered hash index. Within the unordered hashindices, the data is sorted according to the range key. Range keys and hash keystogether uniquely identify an item if the hash key references another table.

Notable is that since last-writer-wins conflict resolution mechanism is supportedbut DynamoDB also allows applications to handle conflict resolutions themselves.The Amazon DynamoDB shopping cart implementation uses a custom merge-conflictresolution algorithm which su�ers from consistency issues, removal operations canbe themselves removed after replicas have synchronized.

1.4.3 Facebook Cassandra

The Facebook Cassandra [18] database is inspired by Google Bigtable and wasdesigned by Facebook. Cassandra was initially created to solve what Facebookcalls the Inbox Search problem – inbox search is a feature of Facebook that allowusers to search through their Facebook inbox. As Bigtable, Cassandra organizesits data through tables where each table is a distributed multi-dimensional mapindexed by a key. It also uses columns and column families, however the value ofeach entry is not uninterpreted but strictly validated. Initial releases of Cassandrasupported only insert, delete and get operations, however the Apache Foundationis now maintaining Cassandra and it now has its own query language, CQL. CQLo�ers similar functionality to SQL, however it does not support many SQL featureswhich require reading from multiple tables, such as joins. The concept of foreignkeys does not exist in Cassandra. As with Bigtable, the last-writer-wins model isused for conflicting writes.

Cassandra forgoes the Bigtable master-slave approach and uses a peer-to-peerarchitecture.

1.4.4 Riak

Riak [16] is a CRDT-based database developed by Basho Technologies, Inc. Itis one of the few existing popular open-source CRDT-based database systems. Itis labeled as a distributed, persistent and fault-tolerant system. Riak consists ofthree products; Riak KV is the CRDT-based key-value database, Riak TS enablesprocessing of time-series data and finally, Riak S2 o�ers a distributed object storagesolution. This thesis focuses on Riak KV and uses the terms Riak KV and Riakinterchangeably.

Riak KV is a peer-to-peer distributed key-value store. It is branded as aneventually consistent system using partial replication. Each key is a binary valueand each value is either a CRDT or primitive data type such as a string, numberor binary. To this date Riak supports five di�erent CRDTs [2]; flags, registers,counters, sets and maps.

6

Page 14: Snapple - DiVA portal

1.4. RELATED WORK

1.4.5 RedisUnlike the other systems mentioned in this chapter, Redis [20] was not designed asa distributed system but has clustering support using a master-slave architecture.Redis stands for REmote DIctionary Server and was initially introduced in 2009.Redis was created by Salvatore Sanfilippo who still is the main developer.

Redis is an in-memory key-value database o�ering string keys mapped to valuesof di�erent types. Supported value types are, among others, lists, sets, maps, sortedsets and primitive data types such as strings or numbers. For each of the listed datastructures, Redis o�ers instructions to atomically modify these. Examples of suchinstructions is the insertion operation for a specified set or the sorting operation fora specified list. Though being primarily an in-memory database, Redis o�ers optionsto synchronize data with the hard drive in an interval specified in seconds. Redisruns as a single-process on a single core and has limited support for transactions -it does not o�er any rollback support in case an error occurs.

7

Page 15: Snapple - DiVA portal
Page 16: Snapple - DiVA portal

Chapter 2

Conflict-Free Replicated Data Types

Conflict-Free Replicated Data Types (from here on CRDTs) is a type of data struc-ture following certain rules and was originally introduced by Marc Shapiro andNuno Preguiça [28]. This chapter will show an introductory CRDT example, ex-plain the mathematical properties of CRDTs, explain di�erent kinds of CRDTs, theconsistency properties of CRDTs and ultimately discuss the Observed-Remove SetCRDT.

2.1 An Introductory ExampleA simple yet powerful CRDT is the increment-only counter. Each replica storesits own version of the data structure. This distributed data structure allows eachreplica in the cluster to provide two methods, to increment a counter or to read thecurrent state of the counter. A typical use case for distributed counters are statistics,for example to show how many times a certain web page has been displayed.

Example 1 Increment-Only Counter CRDT1: instance members2: n Ω number of replicas3: v Ω array of size n

4: function increment()5: id Ω myID()6: v[id] Ω v[id] + 17: function value()8: return sum(v)9: function compare(X, Y)

10: return (’i œ [1, n] : X.v[i] Æ Y.v[i])11: function merge(X, Y)12: ’i œ [1, n] : Z.v[i] = max(X.v[i], Y.v[i])13: return Z

9

Page 17: Snapple - DiVA portal

CHAPTER 2. CONFLICT-FREE REPLICATED DATA TYPES

As seen by the reader, each replica has an array of integers and each slot repre-sents how many times the increment function has been called at each replica. ThemyID function returns an identifier which is unique for each replica. The comparefunction allows two counters to be compared. If all entries for all replicas in onecounter is smaller or equal to the values in the other counter it is considered smalleror equal. To synchronize each replica with one other the merge function is used.

CRDTs are used in a peer-to-peer network. This thesis does not discuss spe-cific protocols used to synchronize replicas since such protocols are not specific forCRDT-based systems but for peer-to-peer systems, however one of the most popularprotocols used is for CRDT-based systems is the gossip protocol [31].

2.2 Mathematical Properties

All CRDTs define a merge operation which must satisfy three mathematical proper-ties; commutativity, associativity and idempotence (ı denotes the merge operation):

Y__]

__[

x ı y = y ı x Commutativityx ı (y ı z) = (x ı y) ı z Associativityx ı x = x Idempotence

These three operations solve two problems for the CRDT merge function – therecan be message duplicates (idempotence) and messages can be delivered out-of-order(commutativity and associativity).

These constraints on the CRDT model defines CRDTs as join-semilattices [13](from here on semilattices). A semilattice is a partially ordered set which has a joinoperation. The join operation allows two elements in the semilattice to be joinedinto a single least upper bound and is by definition commutative, associative andidempotent. Each two elements in the semilattice are partially ordered according toa binary relation. A semilattice is bounded if it has a least element which is smallerthan all other elements.

When discussing CRDTs, the join operation translates to the merge functionand the partial order relation is defined by the compare function. Each element inthe semilattice represents a system state and the set of system states is partiallyordered. The system state is, according to the partial definition of the semilattice,always growing. The operations exposed by the CRDT on the system state allgrow the system state and the merge function joins the system state with anotherstate to the least upper bound. CRDT-defined semilattices are called monotonicallyincreasing because of this property. This also implicates that the system state neverdecreases and that CRDT clients never experience rollbacks.

10

Page 18: Snapple - DiVA portal

2.3. STATE-BASED AND OPERATION-BASED CRDTS

2.3 State-based and Operation-based CRDTsCRDTs originally came in two di�erent models, state-based CRDTs and operation-based CRDTs. These two models require di�erent CRDT implementations and hasdi�erent requirements on the network communication.

State-based CRDTs are known as Convergent Replicated Data Types (CvRDTs)and send their full system state to their peers and expose three abstract methods,a query operation to query the system state, an update operation to update thesystem state and a merge operation which is, as mentioned above, commutative,associative and idempotent.

Operation-based CRDTs are known as Commutative Replicated Data Types(CmRDTs) and propagate their state to their peers through broadcasting the updateoperation which must be commutative. The peers receive the broadcast and updatetheir state locally. The update operation is commutative which allows for receivingmessages out-of-order but is not idempotent so operation-based CRDTs require thecommunication medium used to deliver unique messages.

This thesis only discusses state-based CRDTs. This due to the constraintoperation-based CRDTs impose on the communication medium used. In this thesis,CRDTs and state-based CRDTs are equal.

2.4 Consistency PropertiesCRDTs guarantees strong eventual consistency. Eventual consistency ensures thatall updates written from clients are eventually delivered to all replicas. Since CRDTmerge operations are commutative and idempotent, if any two replicas receive thesame set of updates, no matter what order these updates were received in or if anyduplicates exists, the two replicas will be in the same system state. This guarantee isknown as strong eventual consistency. Note that CRDTs rely on having a su�cientlygood communication layer in order to deliver these guarantees.

A more formal definition of these consistency properties requires first causalhistory [25], in the perspective of CRDTs, to be defined. Shapiro et al. [27] definesthe causal history C for an object x as, for a replica xi of x:

• Initially, C(xi) = ÿ

• After executing the update operation f , C(f(xi)) = C(xi) fi {f}

• After executing merge against states xi, xj , C(merge(xi, xj)) = C(xi)fiC(xj)

The causal history is defined as the set of update operations applied to a replicaxi targeting the replica’s instance of x. The happens-before operation [19] operationis defined as f æ g ≈∆ C(f) µ C(g).

Shapiro continues to define eventual convergence as for any two replicas xi, xj ,an object x eventually converge if the following properties hold:

11

Page 19: Snapple - DiVA portal

CHAPTER 2. CONFLICT-FREE REPLICATED DATA TYPES

• Safety: ’i, j : C(xi) = C(xj) implies that the abstract states of i and j areequivalent.

• Liveliness: ’i, j : f œ C(xi) implies that, eventually, f œ C(xj).

The safety property implies that if two replicas xi and xj have the same causalhistory — the same update operations has been applied to them — their abstractstate is equivalent. Abstract state equality is defined as all query operations on theobject instance returns the same value. The liveliness property implies that if onereplica xi has received an update operation f , f œ C(xi), it is eventually propagatedto another replica xj , f œ C(xj).

Finally, in order to prove CRDTs eventual and strong eventual consistency,Shapiro proposes the following proof. The proof is slightly altered to fit this thesis:

Any two replicas xi, xj will converge, as long as they can exchange statesby some (direct or indirect) channel that eventually delivers, by mergingtheir states. Since CRDT values form a monotonic semi-lattice, mergeis always enabled, and one can make xÕ

i := merge(xi, xj) and xÕj =

merge(xj , xi). By the definition of causal history previously listed in xÕi

and xÕj , since C(xi) fi C(xj) = C(xj) fi C(xi). Finally, we have the same

equivalent abstract states xÕi = xÕ

j since the merge option of CRDTs iscommutative.

2.5 The Observed-Remove Set CRDTNaturally, CRDTs are not deterministic, there can occur race conditions betweeninserts or removals. Still, CRDTs can represent many usable data structures whereone operation is allowed precedence over another. The set is one of the most fun-damental data structures used in computer science and can be implemented as aCRDT. Multiple CRDT implementations exist such as the Growth-Only Set, the2P-Set and the LWW-Element-Set. All of these implementations su�er naturalshortcomings, the growth-only set allows for insertions only, the 2P-set allows onlythe insertion and removal of an element once and the LWW-Element set relies ontimestamps. The OR-Set does not rely on timestamps and allows for elements tobe inserted and removed multiple times.

12

Page 20: Snapple - DiVA portal

2.5. THE OBSERVED-REMOVE SET CRDT

Example 2 Observed-Remove Set CRDT1: instance members2: I Ω set with tuples (element e, token u)3: R Ω set with tuples (element e, token u)4: function contains(e)5: return (÷u : (e, u) œ I)6: function elements()7: return (e | ÷u : (e, u) œ I)8: function add(e)9: u Ω unique()

10: I Ω I fi {(e, u)} \ R

11: function remove(e)12: T Ω ((e, u) | ÷u : (e, u) œ I)13: I Ω I \ T14: R Ω R fi T15: function compare(X, Y)16: return ((X.I fi X.R) ™ (Y.I fi Y.R) · (X.R ™ Y.R))17: function merge(X, Y)18: Z.I Ω (X.I \ Y.R) fi (Y.I \ X.R)19: Z.R Ω X.R fi Y.R20: return Z

The OR-Set is defined as two sets, the insertion set I and the removal set R.Each set contains tuples, each tuple contains an element e and a unique token u.The removal set R is also known as the tombstone set since it only contains tupleswhich are no longer used to compute which elements that are in the set. The setmimics non-monotonic operations through allowing both insertions and removalsthrough its interface, however each operation results in the system state growingaccording to the semilattice’s partial order. An element e is a member of the OR-Setif there exists a tuple containing e in the insertion set I.

The OR-Set’s memory complexity is defined as O(i), where i is the number ofinsertion function calls made. The OR-Set prioritizes insertions over removals, asseen if one replica receives an insertion update inserting an element e and anotherreplica a removal update removing the element e and these two replicas are merged.Then the resulting set will contain the element e.

This thesis focuses primarily on the OR-Set CRDT since the set data structureis such a fundamental data structure and since CRDT update operations are non-deterministic. Other common data structures which are ordered such as the listdata structure cannot be implemented correctly in a nondeterministic environmentwithout using locks [17].

13

Page 21: Snapple - DiVA portal
Page 22: Snapple - DiVA portal

Chapter 3

The Optimized Observed-Remove SetCRDT

The original implementation of the OR-Set listed in chapter 2 falls short regardingmemory complexity. The memory complexity of CRDTs are of paramount impor-tance given that each time a propagation occurs, the full CRDT state must beserialized, transported over the network, de-serialized and merged.

There exists another implementation of the OR-Set whose memory complexityis bounded by the number of elements in the set and the number of replicas used.This implementation is known as the Optimized OR-Set (Opt-OR-Set) and wasintroduced by Bieniusa et al. [4].

This chapter discusses the shortcomings of the original OR-Set, version vectorsand finally the Optimized Observed-Remove set.

3.1 Shortcomings of the original OR-Set CRDTOne inherent weakness of the original OR-Set implementation’s internal structureis that for every invocation of the insertion operation, the OR-Set adds an entryin the corresponding insertion set I. The consequence of this behavior is thatthe OR-Set’s memory complexity is not limited by the number of elements in theset but the number of insertion operation invocations. Without any modification,the chapter 2 implementation of the OR-Set requires additional garbage collectionmechanisms to be able to operate in production environments, otherwise the userrisks an out-of-memory error caused by too many insertion operation invocations.

3.2 Version VectorsA Version Vector [22] tracks changes in a distributed system where each replica canupdate data independent of one another. Version vectors allow systems to under-stand if one update happened before another or if they happened simultaneouslyand might collide. Version vectors are used in the Optimized OR-Set implementa-

15

Page 23: Snapple - DiVA portal

CHAPTER 3. THE OPTIMIZED OBSERVED-REMOVE SET CRDT

tion in Snapple, however not in the pseudo-code listed in the next section due toverbosity.

Example 3 Version Vector CRDT1: instance members2: n Ω number of replicas3: v Ω array of size n

4: function query(replica)5: return v[replica]6: function update(replica)7: v[replica] Ω v[replica] + 18: function compare(X, Y)9: return (’i œ [1, n] : X.v[i] Æ Y.v[i])

10: function order(X, Y)11: equal Ω X = Y12: before Ω (’i œ [1, n] : X.v[i] Æ Y.v[i]) · (÷i œ [1, n] : X.v[i] < Y.v[i])13: after Ω (’i œ [1, n] : Y.v[i] Æ X.v[i]) · (÷i œ [1, n] : Y.v[i] < X.v[i])14: concurrent Ω (¬equal · ¬before · ¬after)15: return (equal, before, after, concurrent)16: function merge(X, Y)17: ’i œ [1, n] : Z.v[i] = max(X.v[i], Y.v[i])18: return Z

Version vector implementations contain an array of size n, where n is the numberof replicas used in the distributed system. Each replica has its own unique identifier,ranging from one to n. A version vector is a type of CRDT and is very similar tothe increment-only counter.

The order function can return four di�erent states. Version vectors can be beforeone another, after one another, equal to one another or concurrent.

The version vector CRDT has a memory complexity of O(n), where n is thenumber of replicas.

3.3 Introducing the Optimized OR-SetThe Opt-OR-Set implementation o�ers improved memory complexity over the orig-inal OR-Set implementation. The Opt-OR-Set removes the usage of tombstones– there is no need to store information about removed elements. Other CRDTset implementations such as Wuu’s 2P-Set [35] garbage collects tombstones after allreplicas has received them, however this requires sophisticated callback mechanismsand adds complexity.

If the original OR-Set implementation is used in a system which only containsone replica, the removal set R is not necessary since you could remove the elementdirectly from the insertion set I, just as with a regular set, since there occurs no

16

Page 24: Snapple - DiVA portal

3.3. INTRODUCING THE OPTIMIZED OR-SET

merge operations. Furthermore you would not need any tuples, a regular set wouldsu�ce. However, this approach falls short when the system contains more thanone replica and two of these replicas A and B merge - if only replica A contain anelement e but not replica B, we need to know if e recently was removed in B andthat if removal hasn’t propagated to A, or if e was recently inserted into A and ifthat insertion hasn’t propagated to B.

Example 4 Optimized OR-Set CRDT1: instance members2: n Ω number of replicas3: v Ω array of size n4: E Ω set with tuples (element e, timestamp t, replica index r)5: function contains(e)6: return (÷n : (e, t, r) œ E)7: function elements()8: return (e | ÷t, r : (e, t, r) œ E)9: function add(e)

10: id Ω myID()11: t Ω v[id] + 112: v[id] Ω t13: T Ω ((e, tÕ, id) œ E | tÕ < t)14: E Ω E fi {(e, t, id)} \ T

15: function remove(e)16: T Ω ((e, t, r) œ E)17: E Ω E \ T

18: function compare(X, Y)19: R Ω ((t, r) | 0 Æ t Æ X.v[r]· ” ÷ : (e, t, r) œ X.E)20: RÕ Ω ((t, r) | 0 Æ t Æ Y.v[r]· ” ÷ : (e, t, r) œ Y.E)21: return (’i œ [1, n] : X.v[i] Æ Y.v[i]) · R ™ RÕ

22: function merge(X, Y)23: M Ω X.E fl Y.E24: M Õ Ω ((e, t, r) œ X.E \ Y.E | t > Y.v[r])25: M ÕÕ Ω ((e, t, r) œ Y.E \ X.E | t > X.v[r])26: U Ω M fi M Õ fi M ÕÕ

27: T Ω ((e, t, r) œ U | ÷(e, tÕ, r) œ U : tÕ < t)28: Z.E Ω U \ T29: Z.n Ω X.n30: ’i œ [1, n] : Z.v[i] = max(X.v[i], Y.v[i])31: return Z

Each replica in the Opt-OR-Set maintains both a vector v and an element set Ewith tuples of three elements, one element e, one timestamp t and a replica indexr. If an element e is present in a tuple in E, it is present in the Opt-OR-Set.

17

Page 25: Snapple - DiVA portal

CHAPTER 3. THE OPTIMIZED OBSERVED-REMOVE SET CRDT

When an element e is removed from the set, all tuples in the element set Econtaining element e are removed.

When an element e is added, the vector is incremented for the current replicawith replica index id. The new timestamp value t in the vector for the currentreplica is then used to remove all entries in E with element e, a lower timestampthan t and with replica index id. Finally, a new tuple with the element e, thetimestamp t and the replica index id is added to E.

When merging two Opt-OR-Sets X and Y to a new set Z, an element e shouldonly be present in the final set Z if it is either present in both set X and set Y orif it exists in set X and wasn’t recently removed in set Y or vice-versa. The lattertwo cases say that if one set has seen a replica adding the element but this hasn’tbeen merged to the other set this element will still be present in the final set. Thetwo vectors X.v and Y.v are merged as well.

When comparing two Opt-OR-Sets X and Y two properties are considered. If areplica with index r has the current timestamp t, but only t ≠ j elements exist withreplica index r, it has witnessed j removals. These pairs of timestamps and replicaindices which represents witnessed removals are filtered out into a set R for X andRÕ for Y . To compare set X and Y , the two vectors X.v and Y.v and the witnessedremoval sets R and RÕ are compared.

The Opt-OR-Set has the memory complexity O(n|elements|+n) where n standsfor the number of replicas and elements for the number of elements present in theOpt-OR-Set.

18

Page 26: Snapple - DiVA portal
Page 27: Snapple - DiVA portal

Chapter 4

Snapple: A distributed, fault-tolerant,in-memory key-value store

Snapple [30] o�ers a distributed database allowing for storing key-value entries in-memory, running on the JVM [21] (Java Virtual Machine). Each replica is identifiedby a hostname and a port. As soon as one replica knows about another replica, itis scheduled recurrently to synchronize its with that replica. Replicas can be addedand removed dynamically and the system is designed such that all replicas know allother replicas.

This chapter discusses Snapple’s data model, its exposed API and its implemen-tation details.

4.1 Data ModelSnapple is a key-value database, using Unicode-encoded strings as keys and CRDTsas values. The full data structure is represented by an in-memory map boxed intoa wrapper class. Each invocation creating or removing an entry to the map resultsin a compare-and-swap (CAS) operation [1] called on the map. Also the mergeoperation, where two di�erent maps are merged, results in a CAS operation on themap.

19

Page 28: Snapple - DiVA portal

CHAPTER 4. SNAPPLE: A DISTRIBUTED, FAULT-TOLERANT, IN-MEMORYKEY-VALUE STORE

Example 5 Snapple Map Update1: instance members2: atomicMap Ω atomicMapReference3: function update(lambda)4: success Ω false5: repeat6: current Ω atomicMap.current()7: modified Ω lambda(current)8: success Ω atomicMap.CompareAndSwap(current, modified)9: until success

Two CRDTs are supported, the optimized OR-Set and the version vector. CRDTsare boxed into a wrapper class. Each modification of the CRDTs results in CAS op-erations for the wrapper class and not for the key-value map. This removes load fromthe key-value wrapper instance and adds it to the CRDT wrapper instance. Sinceeach map holds multiple CRDTs this reduces CAS errors and increases throughput.Snapple is extendable and more CRDTs can easily be added.

Each replica knows about its peers through an internal entry with an OR-Setof string values. The set includes all known replicas for the cluster in the form ofhostname and port. If a new replica is to be added or a malfunctioning replica isto be removed, a client can remove or add a host to their replica of choice and thischange will propagate through the system.

All data structures used are functional — there are no mutable classes usedexcept for the atomic pointers used in the wrapper classes. Functional data struc-tures allow for easier concurrency since there are no need for locks. Furthermore,functional data structures can share memory with can reduce memory usage.

4.2 Snapple APISnapple exposes its API through one single SDK, the Snapple Scala client SDK. TheSDK can be found at the source core repository. The Snapple API o�ers multiplemethods to communicate with the Snapple instance.

Snapple clients are asynchronous, they do not block the calling thread whilewaiting for a response from the Snapple database, instead they provide Scala fu-tures [14] which hold the result of the asynchronous computation. A Scala futurehold a value which becomes available at some point, given that the future does notfail. A callback can be provided to the future which processes the result when itbecomes available.

20

Page 29: Snapple - DiVA portal

4.2. SNAPPLE API

Snapple clients can be initialized through specifying one or more replica ad-dresses.

1 // Connect to single host2 val client = SnappleClient . singleHost ("http :// myhost .com", 5001)34 // Disconnect client5 client . disconnect

The ping method allows clients to test their connection.1 // Future fails if communication fails , otherwise returns Unit2 val res: Future [Unit] = client .ping

The createEntry method creates a new entry with the specified key. Thesecond argument specifies which type of CRDT should be used. The third argumentspecifies the type the CRDT’s elements should have.

1 // Future fails if communication fails.2 // Returns true if successfully created , or false if key entry3 // already exists or malformed method call.4 val key = "key"5 val res: Future [ Boolean ] =6 client . createEntry (key , ORSetDataKind , LongElementKind )

The modifyEntry method modifies an entry with the specified key. The secondargument specifies which kind of modification should occur, and the third argumentspecifies an optional element. The third argument is optional since not all modifi-cations need an element.

1 // Future fails if communication fails.2 // Returns true if successfully modified , or false if key entry3 // does not exist or malformed method call.4 val key = "key"5 val element = 1L6 val res: Future [ Boolean ] =7 client . modifyEntry (key , AddOpKind , Some( element ))

The entry method reads the entry with the specified key.1 // Future fails if communication fails.2 // Returns a filled optional value if the key entry exists ,3 // otherwise an empty optional value.4 // The SnappleEntry contains the value and its type.5 val key = "key"6 val res: Future [ Option [ SnappleEntry ]] = client .entry(key)

The removeEntry method removes the entry with the specified key.1 // Future fails if communication fails.2 // Returns true if successfully removed , or false if key entry3 // does not exist.4 val key = "key"5 val res: Future [ Boolean ] = client . removeEntry (key)

21

Page 30: Snapple - DiVA portal

CHAPTER 4. SNAPPLE: A DISTRIBUTED, FAULT-TOLERANT, IN-MEMORYKEY-VALUE STORE

Here a longer example is shown, creating an entry, modifying it and reading it.

1 val key = "key"23 client4 . createEntry (key , ORSetDataKind , LongElementKind )5 . onSuccess {6 case true =>7 val element = 1L89 client

10 . modifyEntry (key , AddOpKind , Some( element ))11 . onSuccess {12 case true =>13 client14 .entry(key)15 . onSuccess {16 case Some(entry) =>17 println (s"Read entry $entry for key $key")18 case None =>19 println (s" Couldn ’t read entry with key $key")20 }21 case false =>22 println (s" Couldn ’t modify entry with key $key")23 case false =>24 println (s" Couldn ’t create entry with key $key")25 }

Further examples of API usage and the various flags and types supported can befound in the source code repository, especially inspecting the IO test suites mightbe useful.

4.3 Implementation DetailsHere follows an in-depth explanation of the core features of Snapple.

4.3.1 Synchronization MethodSynchronization occurs at a configurable interval measured in seconds. When syn-chronization occurs, each replica sends all data to all other replicas. Replicas thenmerge the received data with their own. If any replica does not respond successfullyto the merger, it is then re-tried at the next cycle as long as it’s in the peer OR-Set.All replicas form a fully connected graph.

4.3.2 Communication ProtocolSnapple uses Finagle [32] as a communication system. Finagle is developed byTwitter Inc. and is, according to the Finagle website, an extensible RPC system for

22

Page 31: Snapple - DiVA portal

4.3. IMPLEMENTATION DETAILS

the JVM, used to construct high-concurrency servers. Finagle implements uniformclient and server APIs for multiple protocols.

Finagle supports multiple protocols, including the Apache Thrift protocol [29]which Snapple uses. Apache Thrift allows specifying the communication protocolin an interface-definition-language (IDL) and provides a binary to generate clientand server code for communicating through that IDL. Thrift is language-agnosticbut only Scala is used in Snapple. Technically Thrift does not support Scala butTwitter has its own open source tool which was used named Scrooge [33] whichsupports Scala. Snapple uses non-blocking IO and communicates with a binaryprotocol, allowing for very little overhead.

Snapple’s Thrift specification file is listed in the appendix (figure A.1).

4.3.3 Security ModelSnapple is designed to be accessed by trusted clients inside trusted environments.This means that Snapple should not be exposed directly to the Internet or anyplace where malicious clients might reside, but instead should be run on a machinewith an IP address whitelist or similar. Snapple does not provide authentication,encryption or IP whitelists.

A typical use case for Snapple is the session store. Each Snapple key is asession token and each value is an OR-Set containing data about the session. Inthis case the application developer might run a service which is responsible foruser authentication management. This service is exposed to the Internet for RESTAPI calls but other application services also use this service for user authentication.This service is the only service which needs to communicate with the Snapple sessionstore and hence it’s IP address would be the only IP address allowed on the Snapplehost machine.

In the example above, the user authentication management service works as amediator between the Snapple session store and potentially dangerous clients. Thissecurity model allows Snapple to focus on simplicity and performance. Redis, thepopular in-memory key-value store, uses the same security philosophy and a similarmodel.

4.3.4 Client SDKSnapple provides only one client SDK (Software Development Kit) written in Scalaand showcased above. Since Apache Thrift is language-agnostic, other SDKs inother programming languages can be created with ease. The client SDK allows forreading, manipulating, creating and removing entries in Snapple. The client SDKcan either connect to a single Snapple replica or, if provided with a list of replicaaddresses, connect to one replica in this list and then, if that replica experiences afailure, seamlessly switch to another replica without any custom logic.

The client SDK sticks to one replica as long as it can, otherwise it might ex-perience consistency issues since Snapple is an eventually consistent system. If a

23

Page 32: Snapple - DiVA portal

CHAPTER 4. SNAPPLE: A DISTRIBUTED, FAULT-TOLERANT, IN-MEMORYKEY-VALUE STORE

replica B hasn’t received the data written to another replica A before the same datais read by a client, the client will experience consistency issues if it originally wrotethe data to A.

4.3.5 Utility ToolsMany popular databases come with a set of management tools which relieves theusers from writing custom programs to perform menial tasks. Snapple provides twoutility tools to help users manage and measure their Snapple instances, a commandline interface (CLI) tool to help manage Snapple data and a benchmarking programallowing users to benchmark Snapple performance.

The CLI tool allows for adding, reading and removing data entries. It alsoallows users to read the current replica host list and to read all entries at once,which can be useful for debugging. The CLI tool is also located in the Snapplesource repository and uses the client SDK for communicating with Snapple.

The benchmark tool is inspired by Redis benchmark tool and allows users tomeasure the performance of their Snapple setups. The tool acts as one or multipleclients executing in either a parallel or sequential fashion and supports three tests,creating new key-value entries, reading entries and adding elements to an OR-Setentry. Also the data size to be sent in bytes in the add and read methods can bemodified.

The chapter five performance measurements use the Snapple benchmark toolextensively.

24

Page 33: Snapple - DiVA portal
Page 34: Snapple - DiVA portal

Chapter 5

Performance Evaluation

Evaluating an in-memory database is nontrivial, however two core aspects of eval-uating any system is performance measurements and features. In the previouschapter, Snapple’s features are shown and explained. In this chapter, performancemeasurements will be presented. The environment and used setup will also bediscussed, and also how to reproduce the measurement results.

Snapple will be benchmarked locally against two other popular databases –Redis [24], an in-memory database, and Riak, another CRDT-based database. Theperformance measurements will measure two distinct traits, throughput and latency.Throughput can be described as how many requests can be processed per second inparallel. Latency can be described as how fast one request is processed.

The goal with any performance measurement on any system is not to find theactual throughput or latency which one would experience in a production system.Instead, the purpose is to get a clue of how the systems compare against each otherand furthermore get a ballpark number of the system’s latency and throughput.

5.1 Environment and SetupThe computer used conducting the performance measurements was an Asus desktoprunning Linux Ubuntu 16.04 and Java 8 with Intel i7-6700T Quad-Core at 2.8 GHzand 16 GB of RAM. Each database is installed locally and using one single replica,hence there is no synchronization between replicas to consider. Snapple, Redis andRiak provide their own measurement utilities and these are used to conduct theperformance measurements.

Snapple and Redis measurements are setup as follows; each measurement usesone of three methods - either creating a new key-value entry, reading an existingentry or adding an element to an OR-Set entry. Each measurement also uses either1, 10, 50 or 100 clients. The get method uses either 10 bytes or one kilobyte aspayload size. The add method uses 20 bytes as payload size. Naturally the methodcreating new key-value entries does not regard payloads. The 20 bytes payload sizefor the add method was selected due to a limitation in the Redis measurement tool.

25

Page 35: Snapple - DiVA portal

CHAPTER 5. PERFORMANCE EVALUATION

Riak measurements use 50 clients and a fixed payload size of 10 bytes for allthree methods.

The Redis database version used was 3.0.7 and the Riak database version usedwas 2.1.4. Riak used the Riak protocol bu�er interface driver.

The performance measurements only use one replica per database since it ispractically di�cult to measure the impact of replication synchronization. All threedatabase systems synchronize replicas in the background, hence they are all even-tually consistent.

5.2 ReproducibilityThis section covers how to reproduce the performance measurement results of eachthree databases. Since each tool is, in a sense, deterministic, reproductions shouldreceive similar quotients between databases. The exact number of requests persecond processed however might di�er quite drastically.

To reproduce Snapple performance measurements, start a Snapple instance atyour local machine, running on port 9000. Run the following commands in theproject root folder of Snapple and save the output:

1 # To record latency , re -run each command with the --sequential flag ,2 # however modify the request flag to be smaller since each request3 # runs sequentially .45 # note only "ADD" sections6 sbt " project snapple - benchmark " "run -c 1 -r 100000 -d 20"7 sbt " project snapple - benchmark " "run -c 10 -r 100000 -d 20"8 sbt " project snapple - benchmark " "run -c 50 -r 100000 -d 20"9 sbt " project snapple - benchmark " "run -c 100 -r 100000 -d 20"

1011 # note "GET" and " CREATE " sections12 sbt " project snapple - benchmark " "run -c 1 -r 100000 -d 10"13 sbt " project snapple - benchmark " "run -c 10 -r 100000 -d 10"14 sbt " project snapple - benchmark " "run -c 50 -r 100000 -d 10"15 sbt " project snapple - benchmark " "run -c 100 -r 100000 -d 10"1617 # note "GET" sections18 sbt " project snapple - benchmark " "run -c 1 -r 100000 -d 1000"19 sbt " project snapple - benchmark " "run -c 10 -r 100000 -d 1000"20 sbt " project snapple - benchmark " "run -c 50 -r 100000 -d 1000"21 sbt " project snapple - benchmark " "run -c 100 -r 100000 -d 1000"

To reproduce Redis performance measurements, start a Redis instance at yourlocal machine, running on port 6379. Use the following commands and save theoutput:

26

Page 36: Snapple - DiVA portal

5.3. PERFORMANCE MEASUREMENT RESULTS

1 redis - benchmark -c 1 -n 100000 -r 100000 -d 10 -t set ,get ,sadd2 redis - benchmark -c 10 -n 100000 -r 100000 -d 10 -t set ,get ,sadd3 redis - benchmark -c 50 -n 100000 -r 100000 -d 10 -t set ,get ,sadd4 redis - benchmark -c 100 -n 100000 -r 100000 -d 10 -t set ,get ,sadd56 redis - benchmark -c 1 -n 100000 -r 100000 -d 1000 -t get7 redis - benchmark -c 10 -n 100000 -r 100000 -d 1000 -t get8 redis - benchmark -c 50 -n 100000 -r 100000 -d 1000 -t get9 redis - benchmark -c 100 -n 100000 -r 100000 -d 1000 -t set

To reproduce Riak performance measurements, start a Riak instance at yourlocal machine, running on port 8087 for the protocol bu�er interface. Run thefollowing commands, the results will be saved into the bench-results folder asCSV files, ordered by date and time. The configuration files mentioned in thesecommands can be found in the appendix (figure A.2).

1 basho_bench --results -dir ~/ bench - results / ~/ get. config2 basho_bench --results -dir ~/ bench - results / ~/ create . config3 basho_bench --results -dir ~/ bench - results / ~/ add. config

5.3 Performance Measurement ResultsLatency performance measurement results are not displayed in any graph since thebenchmark tools for both Snapple and Redis only displayed latency in milliseconds.At least 99.5 percent of all requests during all measurements were under or equalto one millisecond for both Redis and Snapple. Riak averaged a latency of 11.4milliseconds per request for all requests.

Database - % Measured Snapple - 99.5% Redis - 99.5% Riak - 100%Average Latency Æ 1 ms Æ 1 ms 11.4 ms

Figure 5.1. Request Latency Performance Results (milliseconds)

Throughput performance measurement results is displayed in five graphs below.

27

Page 37: Snapple - DiVA portal

CHAPTER 5. PERFORMANCE EVALUATION

1 10 50 1000

50,000100,000150,000200,000250,000

Get Method, 10B Payload

1 10 50 100050,000100,000150,000200,000250,000

Get Method, 1KB Payload

1 10 50 1000

50,000100,000150,000200,000250,000

Add Method, 20B Payload

1 10 50 100050,000100,000150,000200,000250,000

Create Method

Req

uest

s/

Seco

nd

Clients

Snapple Redis

Get Create Add0

2,000

4,000

6,000

8,000

Request Method

Req

uest

s/

Seco

nd

50 Clients, 10B Payload

Riak

Figure 5.2. Databases Throughput Performance Results

28

Page 38: Snapple - DiVA portal

5.3. PERFORMANCE MEASUREMENT RESULTS

The four first graphs show Snapple performance in blue and Redis performancein red. The y-axis displays the number of requests processed per second and thex-axis displays the number of simultaneous clients used. All four graphs show thatSnapple has between 2.5 times to 5 times lower throughput than Redis. It is alsonotable that the performance of the read operation using 10 bytes versus usingone kilobyte is not too di�erent, for both Snapple and Redis. This shows that thepayload size, up to one kilobyte, is not a performance bottleneck. The four graphsare quite similar and show the same scale of performance di�erence between Snappleand Redis.

The final graph displays Riak performance in green. The y-axis displays thenumber of requests processed per second and the x-axis displays which requestmethod was used. Riak performance was not plotted together with the other twodatabases since it was magnitudes slower.

29

Page 39: Snapple - DiVA portal
Page 40: Snapple - DiVA portal

Chapter 6

Conclusion

This chapter discusses the thesis results which is divided into two categories - fea-tures and performance metrics. Furthermore, this chapter provides a discussionabout future work in both the field of CRDTs and for Snapple.

6.1 Features

Snapples greatest feature is that it is, unlike many similar databases, based onCRDTs. This gives Snapple the ability to handle merge conflicts more gracefullythan last-writer-wins databases. Snapple is a proof-of-concept database which canbe extended with powerful features but as of now it lacks a lot of features makingit fit for production. In the future this might change.

6.2 Performance

Snapples performance has a much lower throughput than Redis, the performancemeasurements showed that Snapple has a 5x to 2.5x lower throughput. Redis isan incredibly fast, robust and battle-tested database and written in C. Snapple,including it’s benchmark tool, is written in Scala which is a slower language thanC [15]. Snapple does however o�er the strengths of CRDTs, which Redis does not.Notable is also that the per-request latency for both Redis and Snapple falls underone millisecond which is very fast.

Comparing Snapple and Riak shows that Riak has a much higher latency anda much lower throughput than Snapple, however Riak is a rather sophisticatedsystem and o�ers a lot more functionality than Snapple. Snapple performed 10x -30x faster than Riak depending on the request method used. If Riak users want touse an in-memory database for more latency-sensitive tasks but still want to utilizethe benefits of CRDT-based databases, Snapple might be an alternative.

31

Page 41: Snapple - DiVA portal

CHAPTER 6. CONCLUSION

6.3 Future WorkThis section is divided in two separate sections, one discussing the future of CRDTsand one discussing Snapple.

6.3.1 CRDTsAll CRDTs are not suitable for production usage. The Opt-OR-Set is complexand not as easy to understand as the standard OR-Set implementation. Still thiscomplexity is needed to avoid out-of-memory errors. Furthermore, CRDTs need tosend their full data structure to their peers when synchronizing, something which isquite limiting if the CRDT has many elements. There has been some work tryingto only send partial CRDTs [26] to avoid this. Another weakness of CRDTs is theabsence of partial replication, an often necessary requirement when working withInternet-scale applications. Crain et al. recently published an article regardingeventual consistency and partial replication [9].

6.3.2 SnappleSnapple is first and foremost a proof-of-concept database evaluating the usage ofCRDTs in an in-memory key-value database. If Snapple would be further developed,it would need to add more resilience features, such as dumping all its data to diskwhen the server shuts down. More client libraries in di�erent programming languageneeds to be developed, the Snapple server binary needs to be packaged and releasedto various popular software repositories and support for more CRDTs would beimplemented. Furthermore, it would be interesting to investigate if partial CRDTscould be sent [26] when synchronizing replicas. Partial replication might be toocomplex feature for Snapple to support since it would require such an sophisticatedimplementation and would probably imply a large overhead.

Furthermore, benchmarking and analyzing performance as the number of activereplicas in the cluster increase would be interesting.

32

Page 42: Snapple - DiVA portal

Bibliography

[1] D. Alistarh, K. Censor-Hillel, and N. Shavit. Are Lock-free Concurrent Algo-rithms Practically Wait-free? In Proceedings of the 46th Annual ACM Sym-posium on Theory of Computing, STOC ’14, pages 714–723, New York, NY,USA, 2014. ACM.

[2] Basho Technologies, Inc. Riak KV Whitepa-per. http://basho.com/wp-content/uploads/2015/04/RiakKV-Enterprise-Technical-Overview-6page.pdf. Accessed: 2016-05-12.

[3] Basho Technologies, Inc. Riak Website. http://basho.com/products/riak-kv/. Accessed: 2016-05-09.

[4] A. Bieniusa, M. Zawirski, N. M. Preguiça, M. Shapiro, C. Baquero, V. Bale-gas, and S. Duarte. An Optimized Conflict-free Replicated Set. CoRR,abs/1210.3368, 2012.

[5] E. A. Brewer. Towards Robust Distributed Systems (Abstract). In Proceed-ings of the Nineteenth Annual ACM Symposium on Principles of DistributedComputing, PODC ’00, pages 7–7, New York, NY, USA, 2000. ACM.

[6] M. Burrows. The Chubby Lock Service for Loosely-coupled Distributed Sys-tems. In Proceedings of the 7th Symposium on Operating Systems Design andImplementation, OSDI ’06, pages 335–350, Berkeley, CA, USA, 2006. USENIXAssociation.

[7] T. D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detectorfor Solving Consensus. J. ACM, 43(4):685–722, July 1996.

[8] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed StorageSystem for Structured Data. ACM Trans. Comput. Syst., 26(2):4:1–4:26, June2008.

[9] T. Crain and M. Shapiro. Designing a Causally Consistent Protocol for Geo-distributed Partial Replication. In C. Baquero and M. Serafini, editors, PA-POC, co-located with EuroSys 2015, Bordeaux, France, Apr. 2015. SIGOPS,ACM.

33

Page 43: Snapple - DiVA portal

BIBLIOGRAPHY

[10] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Ama-zon’s Highly Available Key-value Store. SIGOPS Oper. Syst. Rev., 41(6):205–220, Oct. 2007.

[11] S. Ghemawat, H. Gobio�, and S.-T. Leung. The Google File System. SIGOPSOper. Syst. Rev., 37(5):29–43, Oct. 2003.

[12] S. Gilbert and N. Lynch. Brewer’s Conjecture and the Feasibility of Consistent,Available, Partition-tolerant Web Services. SIGACT News, 33(2):51–59, June2002.

[13] G. Grätzer. General Lattice Theory, Second Edition. Birkhäuser Basel, 2003.

[14] P. Haller, A. Prokopec, H. Miller, V. Klang, R. Kuhn, and V. Jovanovic.Futures and Promises. http://docs.scala-lang.org/overviews/core/futures.html, 2012. Accessed: 2016-04-02.

[15] R. Hundt. Loop Recognition in C++/Java/Go/Scala. In Proceedings of the2nd Scala Workshop, 2011.

[16] R. Klophaus. Riak Core: Building Distributed Applications Without SharedState. In ACM SIGPLAN Commercial Users of Functional Programming,CUFP ’10, pages 14:1–14:1, New York, NY, USA, 2010. ACM.

[17] L. Kuper, A. Turon, N. R. Krishnaswami, and R. R. Newton. Freeze AfterWriting: Quasi-deterministic Parallel Programming with LVars. SIGPLANNot., 49(1):257–270, Jan. 2014.

[18] A. Lakshman and P. Malik. Cassandra: A Decentralized Structured StorageSystem. SIGOPS Oper. Syst. Rev., 44(2):35–40, Apr. 2010.

[19] L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System.Commun. ACM, 21(7):558–565, July 1978.

[20] R. M. Lerner. At the Forge: Redis. Linux J., 2010(197), Sept. 2010.

[21] T. Lindholm and F. Yellin. Java Virtual Machine Specification. Addison-WesleyLongman Publishing Co., Inc., Boston, MA, USA, 2nd edition, 1999.

[22] D. Parker, G. Popek, G. Rudisin, A. Stoughton, B. Walker, E. Walton, J. Chow,D. Edwards, S. Kiser, and C. Kline. Detection of Mutual Inconsistency inDistributed Systems. IEEE Transactions on Software Engineering, 9(3):240–247, 1983.

[23] Y. Saito and M. Shapiro. Optimistic Replication. ACM Comput. Surv.,37(1):42–81, Mar. 2005.

[24] S. Sanfilippo. Redis website. http://redis.io. Accessed: 2016-05-09.

34

Page 44: Snapple - DiVA portal

BIBLIOGRAPHY

[25] R. Schwarz and F. Mattern. Detecting Causal Relationships in DistributedComputations: In Search of the Holy Grail. Distrib. Comput., 7(3):149–174,Mar. 1994.

[26] P. Sérgio Almeida, A. Shoker, and C. Baquero. Delta State Replicated DataTypes. ArXiv e-prints, Mar. 2016.

[27] M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski. A comprehensive studyof Convergent and Commutative Replicated Data Types. RR 7506, INRIA,Rocq, Jan. 2011.

[28] M. Shapiro and N. M. Preguiça. Designing a Commutative Replicated DataType. CoRR, abs/0710.1784, 2007.

[29] M. Slee, A. Agarwal, and M. Kwiatkowski. Thrift: Scalable Cross-LanguageServices Implementation. Facebook, Palo Alto, CA, USA, https://thrift.apache.org/static/files/thrift-20070401.pdf edition, 2007.

[30] J. Stenberg. Snapple Open Source Repository at Github. https://github.com/johanstenberg92/snapple. Accessed: 2016-05-10.

[31] R. Subramaniyan, P. Raman, A. D. George, and M. Radlinski. GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems.Cluster Computing, 9(1):101–120, 2006.

[32] I. Twitter. Twitter Finagle Open Source Repository at Github. https://github.com/twitter/finagle. Accessed: 2016-05-12.

[33] I. Twitter. Twitter Scrooge Open Source Repository at Github. https://github.com/twitter/scrooge. Accessed: 2016-05-12.

[34] W. Vogels. Eventually Consistent. Commun. ACM, 52(1):40–44, Jan. 2009.

[35] G. T. Wuu and A. J. Bernstein. E�cient Solutions to the Replicated Log andDictionary Problems. In Proceedings of the Third Annual ACM Symposiumon Principles of Distributed Computing, PODC ’84, pages 233–242, New York,NY, USA, 1984. ACM.

35

Page 45: Snapple - DiVA portal
Page 46: Snapple - DiVA portal

Appendix A

File Listings

1 # @namespace scala snapple . finagle .io23 struct TVersionVector { 1: map <string , i64 > versions }45 struct TORSet {6 1: i32 elementKind ,7 2: map <binary , TVersionVector > elements ,8 4: TVersionVector versionVector9 }

1011 union TDataType {12 1: TORSet orset ,13 2: TVersionVector versionVector14 }1516 struct TOptionalDataType { 1: optional TDataType dataType }1718 service SnappleService {19 void ping ()20 void propagate (1: map <string , TDataType > values )21 bool createEntry (1: string key ,22 2: string dataKind ,23 3: i32 elementKind ),24 bool removeEntry (1: string key),25 TOptionalDataType getEntry (1: string key),26 bool modifyEntry (1: string key ,27 2: string operation ,28 3: binary element ),29 map <string , TDataType > getAllEntries ()30 }

Figure A.1. The Snapple Thrift Specification File.

37

Page 47: Snapple - DiVA portal

APPENDIX A. FILE LISTINGS

1 mode , max}.23 { report_interval , 1}.45 {duration , 1}.67 { concurrent , 50}.89 {driver , basho_bench_driver_riakc_pb }.

1011 { key_generator , { int_to_bin_bigendian , { uniform_int , 10000 }}}.1213 { value_generator , {fixed_bin , 10}}.1415 { riakc_pb_ips , [{127 ,0 ,0 ,1}]}.1617 { riakc_pb_replies , 1}.1819 % Change these depending on if the benchmark should20 % measure get , create or add operations .21 { operations , [{get , 1}]}.22 %{ operations , [{put , 1}]}.23 %{ operations , [{update , 1}]}.2425 { pb_connect_options , [{ auto_reconnect , true}]}.2627 { pb_timeout_general , 30000}.28 { pb_timeout_read , 5000}.29 { pb_timeout_write , 5000}.30 { pb_timeout_listkeys , 50000}.

Figure A.2. The Riak Benchmark Specification File(s) used.

38

Page 48: Snapple - DiVA portal

www.kth.se