consolidating concurrency control and consensus for commits

17
This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). November 2–4, 2016 • Savannah, GA, USA ISBN 978-1-931971-33-1 Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu and Lamont Nelson, New York University; Wyatt Lloyd, University of Southern California; Jinyang Li, New York University https://www.usenix.org/conference/osdi16/technical-sessions/presentation/mu

Upload: trantruc

Post on 01-Jan-2017

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Consolidating Concurrency Control and Consensus for Commits

This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design

and Implementation (OSDI ’16).November 2–4, 2016 • Savannah, GA, USA

ISBN 978-1-931971-33-1

Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems

Design and Implementation is sponsored by USENIX.

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Shuai Mu and Lamont Nelson, New York University; Wyatt Lloyd, University of Southern California; Jinyang Li, New York University

https://www.usenix.org/conference/osdi16/technical-sessions/presentation/mu

Page 2: Consolidating Concurrency Control and Consensus for Commits

Consolidating Concurrency Control and Consensusfor Commits under Conflicts

Shuai Mu?, Lamont Nelson?, Wyatt Lloyd†, and Jinyang Li??New York University, †University of Southern California

AbstractConventional fault-tolerant distributed transactions layera traditional concurrency control protocol on top of thePaxos consensus protocol. This approach provides scala-bility, availability, and strong consistency. When used forwide-area storage, however, this approach incurs cross-data-center coordination twice, in serial: once for con-currency control, and then once for consensus. In thispaper, we make the key observation that the coordinationrequired for concurrency control and consensus is highlysimilar. Specifically, each tries to ensure the serializationgraph of transactions is acyclic. We exploit this insightin the design of Janus, a unified concurrency control andconsensus protocol. Janus targets one-shot transactionswritten as stored procedures, a common, but restricted,class of transactions. Like MDCC [16] and TAPIR [51],Janus can commit unconflicted transactions in this class inone round-trip. Unlike MDCC and TAPIR, Janus avoidsaborts due to contention: it commits conflicted transac-tions in this class in at most two round-trips as long asthe network is well behaved and a majority of each serverreplica is alive.We compare Janus with layered designs and TAPIR

under a variety of workloads in this class. Our evaluationshows that Janus achieves∼5× the throughput of a layeredsystem and 90% of the throughput of TAPIR under acontention-free microbenchmark. When the workloadsbecome contended, Janus provides much lower latencyand higher throughput (up to 6.8×) than the baselines.

1 IntroductionScalable, available, and strongly consistent distributedtransactions are typically achieved through layering a tra-ditional concurrency control protocol on top of shardsof a data store that are replicated by the Paxos [19, 33]consensus protocol. Sharding the data into many smallsubsets that can be stored and served by different serversprovides scalability. Replicating each shard with con-sensus provides availability despite server failure. Co-ordinating transactions across the replicated shards with

concurrency control provides strong consistency despiteconflicting data accesses by different transactions. Thisapproach is commonly used in practice. For example,Spanner [9] implements two-phase-locking (2PL) andtwo phase commit (2PC) in which both the data andlocks are replicated by Paxos. As another example, Per-colator [35] implements a variant of opportunistic con-currency control (OCC) with 2PC on top of BigTable [7]which relies on primary-backup replication and Paxos.

When used forwide-area storage, the layering approachincurs cross-data-center coordination twice, in serial:once by the concurrency control protocol to ensure trans-action consistency (strict serializability [34, 45]), andanother time by the consensus protocol to ensure replicaconsistency (linearizability [14]). Such double coordina-tion is not necessary and can be eliminated by consoli-dating concurrency control and consensus into a unifiedprotocol. MDCC [16] and TAPIR [51] showed how uni-fied approaches can provide the read-commited and strictserializability isolation levels, respectively. Both pro-tocols optimistically attempt to commit and replicate atransaction in one wide-area roundtrip. If there are con-flicts among concurrent transactions, however, they abortand retry, which can significantly degrade performanceunder contention. This concern is more pronounced forwide area storage: as transactions take much longer tocomplete, the amount of contention rises accordingly.

This paper proposes the Janus protocol for buildingfault-tolerant distributed transactions. Like TAPIR andMDCC, Janus can commit and replicate transactions inone cross-data-center roundtrip when there is no con-tention. In the face of interference by concurrent con-flicting transactions, Janus takes at most one additionalcross-data-center roundtrip to commit.

The key insight of Janus is to realize that strict serializ-ability for transaction consistency and linearizability forreplication consistency can both be mapped to the sameunderlying abstraction. In particular, both require theexecution history of transactions be equivalent to somelinear total order. This equivalence can be checked byconstructing a serialization graph based on the executionhistory. Each vertex in the graph corresponds to a trans-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 517

Page 3: Consolidating Concurrency Control and Consensus for Commits

action. Each edge represents a dependency between twotransactions due to replication or conflicting data access.Specifically, if transactionsT1 andT2 make conflicting ac-cess to some data item, then there exists an edge T1 → T2if a shard responsible for the conflicted data item executesor replicates T1 before T2. An equivalent linear total or-der for both strict serializability and linearizability can befound if there are no cycles in this graph.Janus ensures an acyclic serialization graph in two

ways. First, Janus captures a preliminary graph by track-ing the dependencies of conflicting transactions as theyarrive at the servers that replicate each shard. The co-ordinator of a transaction T then collects and propagatesT’s dependencies from sufficiently large quorums of theservers that replicate each shard. The actual executionof T is deferred until the coordinator is certain that all ofT’s participating shards have obtained all dependenciesfor T . Second, although the dependency graph based onthe arrival orders of transactions may contain cycles, T’sparticipating shards re-order transactions in cycles in thesame deterministic order to ensure an acyclic serializa-tion graph. This provides strict serializability because allservers execute conflicting transactions in the same order.In the absence of contention, all participating shards

can obtain all dependencies for T in a single cross-data-center roundtrip. In the presence of contention, two typesconflicts can appear: one among different transactionsand the other during replication of the same transaction.Conflicts between different transactions can still be han-dled in a single cross-data-center roundtrip. They mani-fest as cycles in the dependency graph and are handled byensuring deterministic execution re-ordering. Conflictsthat appear during replication of the same transaction,however, require a second cross-data-center roundtrip toreach consensus among different server replicas with re-gard to the set of dependencies for the transaction. Nei-ther scenario requires Janus to abort a transaction. Thus,Janus always commits with at most two cross-data-centerroundtrips in the face of contention as long as networkandmachine failures do not conspire to prevent amajorityof each server replica from communicating.Janus’s basic protocol and its ability to always commit

assumes a common class of transactionalworkloads: one-shot transactions written as stored procedures. Storedprocedures are frequently used [12, 13, 15, 22, 31, 40,43, 46] and send transaction logic to servers for execu-tion as pieces instead of having clients execute the logicwhile simply reading and writing data to servers. One-shot transactions preclude the execution output of onepiece being used as input for a piece that executes ona different server. One-shot transactions are sufficientfor many workloads, including the popular and canonicalTPC-C benchmark. While the design of Janus can beextended to support general transactions, doing so comes

at the cost of additional inter-server messages and thepotential for aborts.

This approach of dependency tracking and executionre-ordering has been separately applied for concurrencycontrol [31] and consensus [20, 30]. The insight of Janusis that both concurrency control and consensus can usea common dependency graph to achieve consolidationwithout aborts.

We have implemented Janus in a transactional key-value storage system. To make apples-to-apples com-parisons with existing protocols, we also implemented2PL+MultiPaxos, OCC+MultiPaxos and TAPIR [51] inthe same framework. We ran experiments on Ama-zon EC2 across multiple availability regions using mi-crobenchmarks and the popular TPC-C benchmark [1].Our evaluation shows that Janus achieves ∼5× thethroughput of layered systems and 90% of the through-put of TAPIR under a contention-free microbenchmark.When the workloads become contended due to skew orwide-area latency, Janus provides much lower latencyand more throughput (up to 6.8×) than existing systemsby eliminating aborts.

2 OverviewThis section describes the system setup we target, reviewsbackground and motivation, and then provides a highlevel overview of how Janus unifies concurrency controland consensus.

2.1 System SetupWe target thewide-area setupwhere data is sharded acrossservers and each data shard is replicated in several geo-graphically separate data centers to ensure availability.We assume each data center maintains a full replica ofdata, which is a common setup [4, 26]. This assump-tion is not strictly needed by existing protocols [9, 51]or Janus. But, it simplifies the performance discussionin terms of the number of cross-data-center roundtripsneeded to commit a transaction.

We adopt the execution model where transactions arerepresented as stored procedures, which is also com-monly used [12, 13, 15, 22, 31, 40, 43, 46]. As datais sharded across machines, a transaction is made upof a series of stored procedures, referred to as pieces,each of which accesses data items belonging to a singleshard. The execution of each piece is atomic with regardto other concurrent pieces through local locking. Thestored-procedure execution model is crucial for Janus toachieve good performance under contention. As Janusmust serialize the execution of conflicting pieces, storedprocedures allows the execution to occur at the servers,

518 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 4: Consolidating Concurrency Control and Consensus for Commits

System Commit latency in How to resolve conflicts How to resolve conflictswide-area RTTs during replication during execution/commit

2PL+MultiPaxos [9] 2∗ Leader assigns ordering Locking†

OCC+MultiPaxos [35] 2* Leader assigns ordering Abort and retryReplicated Commit [28] 2 Leader assigns ordering Abort and retry

TAPIR [51] 1 Abort and retry commit Abort and retryMDCC [16] 1 Retry via leader Abort

Janus 1 Reorder Reorder

Table 1: Wide-area commit latency and conflict resolution strategies for Janus and existing systems. MDCCprovides read commited isolation, all other protocols provide strict serializability.

which is much more efficient than if the execution wascarried out by the clients over the network.

2.2 Background and MotivationThe standard approach to building fault-tolerant dis-tributed transactions is to layer a concurrency controlprotocol on top of a consensus protocol used for repli-cation. The layered design addresses three paramountchallenges facing distributed transactional storage: con-sistency, scalability and availability.

• Consistency refers to the ability to preclude “bad” in-terleaving of concurrent conflicting operations. Thereare two kinds of conflicts: 1) transactions containingmultiple pieces performing non-serializable data ac-cess at different data shards. 2) transactions replicatingtheir writes to the same data shard in different orders atdifferent replica servers. Concurrency control handlesthe first kind of conflict to achieve strict serializabil-ity [45]; consensus handles the second kind to achievelinearizability [14].

• Scalability refers to the ability to improve performanceby adding more machines. Concurrency control proto-cols naturally achieve scalability because they operateon individual data items that can be partitioned acrossmachines. By contrast, consensus protocols do notaddress scalability as their state-machine approach du-plicates all operations across replica servers.

• Availability refers to the ability to survive and recoverfrom machine and network failures including crashesand arbitrary message delays. Solving the availabilitychallenge is at the heart of consensus protocols whiletraditional concurrency control does not address it.

Because concurrency control and consensus eachsolves only two out of the above three challenges, it isnatural to layer one on top of the other to cover all bases.

∗Additional roundtrips are required if the coordinator logs the com-mit status durably before returning to clients.

†2PL also aborts and retries due to false positives in distributeddeadlock detection.

For example, Spanner [9] layers 2PL/2PC over Paxos.Percolator [35] layers a variant of OCC over BigTablewhich relies on primary-backup replication and Paxos.

The layered design is simple conceptually, but doesnot provide the best performance. When one protocolis layered on top of the other, the coordination requiredfor consistency occurs twice, serially; once at the concur-rency control layer, and then once again at the consensuslayer. This results in extra latency that is especially costlyfor wide area storage. This observation is not new andhas been made by TAPIR and MDCC [16, 51], whichpoint out that layering suffers from over-coordination.

Aunification of concurrency control and consensus canreduce the overhead of coordination. Such unification ispossible because the consistency models for concurrencycontrol (strict serializability [34, 45]) and consensus (lin-earizability [14]) share great similarity. Specifically, bothtry to satisfy the ordering constraints among conflictingoperations to ensure equivalence to a linear total order-ing. Table 1 classifies the techniques used by existingsystems to handle conflicts that occur during transactionexecution/commit and during replication. As shown, op-timistic schemes rely on abort and retry, while conser-vative schemes aim to avoid costly aborts. Among theconservative schemes, consensus protocols rely on the(Paxos) leader to assign the ordering of replication oper-ations while transaction protocols use per-item locking.

In order to unify concurrency control and consensus,one must use a common coordination scheme to handleboth types of conflicts in the same way. This designrequirement precludes leader-based coordination sincedistributed transactions that rely on a single leader toassign ordering [41] do not scale well. MDCC [16] andTAPIR [51] achieve unification by relying on aborts andretries for both concurrency control and consensus. Suchan optimistic approach is best suited for workloads withlow contention. The goal of our work is to develop aunified approach that can work well under contention byavoiding aborts.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 519

Page 5: Consolidating Concurrency Control and Consensus for Commits

Execute 2PC Prepare(any interleaving during

Execute or Prepare causes abort)

2PC Commit or Abort

Paxo

sac

cept

Servers Sx, Sx replicate item X; Sy Sy replicate item Y

DataCenter-1

DataCenter-2

Sx

Sy

Sx

Sy

1

1

2

2

Paxos

accept

Paxo

sac

cept

Paxos

accept

1 12 2

Coordinator

(a) OCC over Multi-Paxos.

T2→

T1

T1→T2

Sx

Sy

Sx

Sy

(pieces are dispatched, but execution is deferred)

Pre-accept Accept(Skipped if

no contention during replication)

T1↔

T2

Commit (reorder and execute)

DataCenter-1

DataCenter-2

Servers Sx, Sx replicate item X; Sy Sy replicate item Y1 2 1 2

1

1

2

2

Coordinator for T1

(b) Janus

Figure 1: Workflows for committing a transaction that increments x and y that are stored on different shards.The execute and 2PC-prepare messages in OCC over Multi-Paxos can be combined for stored procedures.

2.3 A Unified Approach to ConcurrencyControl and Consensus

Can we design unify concurrency control and consen-sus with a common coordination strategy that minimizescostly aborts? A key observation is that the orderingconstraints desired by concurrency control and replica-tion can both be captured by the same serialization graphabstraction. In the serialization graph, vertexes representtransactions and directed edges represent the dependen-cies among transactions due to the two types of conflicts.Both strict serializability and linearizability can be re-duced to checking that this graph is free of cycles [20, 45].

T1 T2

T1: Write(X), T2:Write(X)

T2: Write(X), T1: Write (X)S'x:

Sx:

(a) A cycle representingreplication conflicts.

T1 T2

T1: Write(X), T2: Write(X)

T2: Write(Y), T1: Write(Y)

Sx:

Sy:

(b) A cycle representingtransaction conflicts.

Figure 2: Replication and transaction conflicts can berepresented as cycles in a common serialization graph.

Janus attempts to explicitly capture a preliminary se-rialization graph by tracking the dependencies of con-flicting operations without immediately executing them.Potential replication or transaction conflicts both man-ifest as cycles in the preliminary graph. The cycle inFigure 2a is due to conflicts during replication: T1’s writeof data item X arrives before T2’s write on X at serverSx , resulting in T1 → T2. The opposite arrival order hap-pens at a different server replica S′x , resulting in T1 ← T2.The cycle in Figure 2b is due to transaction conflicts: theserver Sx receives T1’s write on item-x before T2’s write

on item-x while server Sy receives the writes on item-y in the opposite order. In order to ensure a cycle-freeserialization graph and abort-free execution, Janus de-terministically re-orders transactions involved in a cyclebefore executing them.

To understand the advantage and challenges of unifi-cation, we contrast the workflow of Janus with that ofOCC+MultiPaxos using a concrete example. The exam-ple transaction, T1: x++; y++;, consists of two pieces, eachis a stored procedure incrementing item-x or item-y.

Figure 1a shows how OCC+MultiPaxos executes andcommits T1. First, the coordinator of T1 dispatches thetwo pieces to item-x and item-y’s respective Paxos lead-ers, which happen to reside in different data centers. Theleaders execute the pieces and buffer the writes locally.To commit T1, the coordinator sends 2PC-prepare re-quests to Paxos leaders, which perform OCC validationand replicate the new values of x and y to others. Be-cause the pieces are stored procedures, we can combinethe dispatch and 2PC-preparemessages so that the coordi-nator can execute and commit T1 in two cross-data-centerroundtrips in the absence of conflicts. Suppose a concur-rent transaction T2 also tries to increment the same twocounters. The dispatch messages (or 2PC-prepares) ofT1 and T2 might be processed in different orders by dif-ferent Paxos leaders, causing T1 and/or T2 to be abortedand retried. On the other hand, because Paxos leadersimpose an ordering on replication, all replicas for item-x(or item-y) process the writes of T1 or T2 consistently, butat the cost of additional wide-area roundtrip.

Figure 1b shows the workflow of Janus. To commit andexecute T1, the coordinator moves through three phases:PreAccept,Accept andCommit, ofwhich theAcceptphasemay be skipped. InPreAccept, the coordinator dispatchesT1’s pieces to their corresponding replica servers. UnlikeOCC+MultiPaxos, the server does not execute the pieces

520 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 6: Consolidating Concurrency Control and Consensus for Commits

immediately but simply tracks their arrival orders in itslocal dependency graph. Servers reply to the coordina-tor with T1’s dependency information. The coordinatorcan skip the Accept phase if enough replica servers replywith identical dependencies, which happens when thereis no contention among server replicas. Otherwise, thecoordinator engages in another round of communicationwith servers to ensure that they obtain identical depen-dencies forT1. In theCommit phase, each server executestransactions according to their dependencies. If there isa dependency cycle involving T1, the server adheres to adeterministic order in executing transactions in the cycle.As soon as the coordinator hears back from the nearestserver replica which typically resides in the local datacenter, it can return results to the user.As shown in Figure 1b, Janus attempts to coordi-

nate both transaction commit and replication in one shotwith the PreAccept phase. In the absence of contention,Janus completes a transaction in one cross-data-centerroundtrip, which is incurred by thePreAccept phase. (Thecommit phase incurs only local data-center latency.) Inthe presence of contention, Janus incurs two wide-arearoundtrips from the PreAccept and Accept phases. Thesetwo wide-area roundtrips are sufficient for Janus to beable to resolve replication and transaction commit con-flicts. The resolution is achieved by tracking the depen-dencies of transactions and then re-ordering the executionof pieces to avoid inconsistent interleavings. For exam-ple in Figure 1b, due to contention between T1 and T2,the coordinator propagates T1 ↔ T2 to all server replicasduring commit. All servers detect the cycle T1 ↔ T2 anddeterministically choose to execute T1 before T2.

3 DesignThis section describes the design for the basic Janus pro-tocol. Janus is designed to handle a common class oftransactions with two constraints: 1) the transactions donot have any user-level aborts. 2) the transactions do notcontain any piece whose input is dependent on the exe-cution of another piece, i.e., they are one-shot transac-tions [15]. Both the example in Section 2 and the popularTPC-C workload belong to this class of transactions. Wediscuss how to extend Janus to handle general transac-tions and the limitations of the extension in Section 4.Additional optimizations and a full TLA+ specificationare included in a technical report [32].

3.1 Basic ProtocolThe Janus system has three roles: clients, coordinators,and servers. A client sends a transaction request as a setof stored procedures to a nearest coordinator. The co-

Algorithm 1: Coordinator::DoTxn(T=[α1, ..., αN ])1 T’s metadata is a globally unique identifier, a

partition list, and an abandon flag.2 PreAccept Phase:3 send PreAccept(T , ballot) to all participating

servers, ballot default as 04 if ∀ piece αi ∈ α1...αN , αi has a fast quorum of

PreAcceptOKs with the identical dependency depithen

5 dep = Union(dep1, dep2, ..., depN )6 goto commit phase7 else if ∀ αi ∈ α1...αN , αi has a majority quorum of

PreAcceptOKs then8 foreach αi ∈ α1...αN do9 depi ←Union dependencies returned by the

majority quorum for piece αi .10 dep = Union(dep1, dep2, ..., depN )11 goto accept phase12 else13 return FailureRecovery(T)14 Accept Phase:15 foreach αi in α1...αN in parallel do16 send Accept(T , dep, ballot) to αi’s participating

servers.17 if ∀ αi has a majority quorum of Accept-OKs then18 goto commit phase19 else20 return FailureRecovery(T)21 Commit Phase:22 send Commit(T , dep) to all participating servers23 return to client after receiving execution results.

ordinator is responsible for communicating with serversstoring the desired data shard to commit the transactionand then proxy the result to the client. Coordinators haveno global nor persistent state and they do not communi-catewith each other. In our evaluation setup, coordinatorsare co-located with clients.

We require that the shards information of a transactionT is known before running. Suppose a transaction T in-volves n pieces, α1,...,αn. Let r be the number of serversreplicating each shard. Thus, there are r servers process-ing each piece αi . We refer to them as αi’s participatingservers. We refer to the set of r ∗ n servers for all piecesas the transaction T’s participating servers, or the serversthat T involves.There is a fast path and standard path of execution in

Janus. When there is no contention, the coordinator takesthe fast path which consists of two phases: pre-accept andcommit. When there is contention, the coordinator takes

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 521

Page 7: Consolidating Concurrency Control and Consensus for Commits

Algorithm 2: Server S::PreAccept(T , ballot)24 if highest_ballotS[T] < ballot then25 return PreAccept-NotOK26 highest_ballotS[T]←ballot27 //add T to GS if not already28 AddNewTx(GS , T)29 GS[T].status←pre-accepted#ballot30 dep←GS[T].dep31 return PreAccept-OK, dep

Algorithm 3: Server S::Accept(T , dep, ballot)32 if GS[T].status ≥ committing or33 highest_ballotS[T] > ballot then34 return Accept-NotOK, highest_ballotS[T]35 highest_ballotS[T ′]←ballot36 GS[T ′].dep←dep37 GS[T ′].status←accepted#ballot38 return Accept-OK

the standard path of three phases: pre-accept, accept andcommit, as shown in Figure 1b.

Dependency graphs. At a high level, the goal of pre-accept and accept phase is to propagate the necessarydependency information among participating servers. Totrack dependencies, each server S maintains a local de-pendency graph, GS . This is a directed graph where eachvertex represents a transaction and each edge representsthe chronological order between two conflicting transac-tions. The incoming edges for a vertex is stored in itsdeps value. Additionally, each vertex contains the statusof the transaction, which captures the stage of the pro-tocol the transaction is currently in. The basic protocolrequires three vertex status types with the ranked orderpre-accepted < accepted < committing. Servers andcoordinators exchange and merge these graphs to ensurethat the required dependencies reach the relevant serversbefore they execute a transaction.In Janus, the key invariant for correctness is that all

participating servers must obtain the exact same set ofdependencies for a transactionT before executingT . Thisis challenging due to the interference from other concur-rent conflicting transactions and from the failure recoverymechanism.Next, we describe the fast path of the protocol, which is

taken by the coordinator in the absence of interference byconcurrent transactions. Psuedocode for the coordinatoris shown in Algorithm 1.

The PreAccept phase. The coordinator sends PreAc-cept messages both to replicate transaction T andto establish a preliminary ordering for T among its

Algorithm 4: Server S::Commit(T , dep)39 GS[T].dep←dep40 GS[T].status←committing41 Wait & Inquire Phase:42 repeat43 choose T ′{T : GS[T ′].status < committing44 if T ′ does not involve S then45 send Inquire(T ′) to a server that T ′ involves46 wait until GS[T ′].status ≥ committing47 until ∀T ′{T in GS : GS[T ′].status ≥ committing48 Execute Phase:49 repeat50 choose T ′ ∈ GS: ReadyToProcess(T ′) {51 scc←StronglyConnectedComponent(GS , T ′)52 for each T ′′ in DeterministicSort(scc) do53 if T ′′ involves S and not T ′′.abandon then54 T ′′.result←execute T ′′

55 processedS[T ′′]←true

56 until processedS[T] is true57 reply CommitOK, T.abandon, T.result

Algorithm 5: Server S::ReadyToProcess(T)58 if processedS[T] or GS[T].status < committing then59 return false60 scc← StronglyConnectedComponent(GS , T)61 for each T ′ < scc and T ′{T do62 if processedS[T ′] , true then63 return false

64 return true

participating servers. As shown in Algorithm 2, uponreceiving the pre-accept for T , server S inserts a newvertex for T if one does not already exist in the localdependency graph Gs . The newly inserted vertex for Thas the status pre-accepted. An edge T ′ → T is insertedif an existing vertex T ′ in the graph and T both intend tomake conflicting access to a common data item at serverS. The server then replies PreAccept-OK with T’s directdependencies GS[T].dep in the graph. The coordinatorwaits to collect a sufficient number of pre-accept repliesfrom each involved shard. There are several scenarios(Algorithm 1, lines 4-11). In order to take the fast path,the coordinator must receive a fast quorum of repliescontaining the same dependency list for each shard. Afast quorum F in Janus contains all r replica servers.

The fast quorum concept is due to Lamport who origi-nally applied it in the FastPaxos consensus protocol [21].In consensus, fast quorum lets one skip the Paxos leader

522 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 8: Consolidating Concurrency Control and Consensus for Commits

and directly send a proposed value to replica servers.In Fast Paxos, a fast quorum must contain at least threequarter replicas. By contrast, when applying the idea tounify consensus and concurrency control, the size of thefast quorum is increased to include all replica servers.Futhermore, the fast path of Janus has the additional re-quirement that the dependency list returned by the fastquorum must be the same. This ensures that the ancestorgraph of transaction T on the standard path and duringfailure recovery always agrees with what is determinedon the fast path at the time of T’s execution.

The Commit phase. The coordinator aggregates thedirect dependencies of every piece and sends it in aCommit message to all servers (Algorithm 1, lines21-22). When server S receives the commit request for T(Algorithm 4), it replaces T’s dependency list in its localgraph GS and upgrades the status of T to committing.In order to execute T , the server must ensure that allancestors of T have the status committing. If server Sparticipates in ancestor transaction T ′, it can simplywait for the commit message for T ′. Otherwise, theserver issues an Inquire request to a nearest server S′

that participates in T ′ to request the direct dependenciesGS′[T ′].dep after T ′ has become committing on S′.

In the absence of contention, the dependency graphat every server is acyclic. Thus, after all the ancestorsof T become committing at server S, it can perform atopological sort on the graph and executes the transactionaccording to the sorted order. After executing transactionT , the server marksT as processed locally. After a serverexecutes a transaction shard, it returns the result back tothe coordinator. The coordinator can reply to the client assoon as it receives the necessary results from the nearestserver replica, which usually resides in the local datacenter. Thus, on the fast path, a transaction can commitand execute with only one cross-data center round-trip,taken by the pre-accept phase.

3.2 Handling Contention Without AbortsContention manifests itself in two ways during the nor-mal execution (Algorithm 1). As an example, considertwo concurrent transactions T1, T2 of the form: x++; y++.First, the coordinator may fail to obtain a fast quorumof identical graphs for T1’s piece x++ due to interferencefrom T2’s pre-accept messages. Second, the dependencyinformation accumulated by servers may contain cycles,e.g. with both T1{T2 and T2{T1 if the pre-accept mes-sages of T1 and T2 arrive in different orders at differentservers. Janus handles these two scenarios through theadditional accept phase and deterministic re-ordering.

The Accept phase. If some shard does not return a fastquorum of identical dependency list during pre-accept,then consensus on the complete set of dependencies forT has not yet been reached. In particular, additionaldependencies for T may be inserted later, or existing de-pendencies may be replaced (due to the failure recoverymechanism, Section 3.3).

The accept phase tries to reach consensus via the ballotaccept/reject mechanism similar to the one used by Paxos.Because a higher ballot supersedes a lower one, the ac-cepted status of a transaction includes a ballot numbersuch that accepted#b < accepted#b′, if ballot b<b′. Thecoordinator aggregates the dependency information col-lected from a majority quorum of pre-accepts and sendsan accept message to all participating servers with ballotnumber 0 (Algorithm 1, lines 14-17). The server handlesaccepts as shown in Algorithm 3; It first checks whetherT’s status in its local dependency graph is committing andwhether the highest ballot seen for T is greater than theone in the accept request. If so, the server rejects the ac-cept request. Otherwise, the server replaces T’s ancestorwith the new one, updates its status, and its highest ballotseen for T . After,

the server replies Accept-OK. If the coordinator re-ceives a majority quorum of Accept-OKs, it moves onto the commit phase. Otherwise, the accept phase hasfailed and the coordinator initiates the failure recoverymechanism (Section 3.3). In the absence of active fail-ure recovery done by some other coordinator, the acceptphase in Algorithm 1 always succeeds.

For a shard αi , once the coordinator obtains a fastquorum of PreAccept-OKs with the same dependencylist depi or a majority quorum of Accept-OKs acceptingdependency list depi , then depi is the consensus depen-dency for piece αi . When there are concurrent conflict-ing transactions, the dependencies for different shards(dep1, ..., depN ) may not be identical. The coordinatoraggregates them together and sends the resulting depen-dencies in the commit phase.

Deterministic execution ordering. In the general casewith contention, the servers can observe cyclic depen-dencies among T and its ancestors after waiting for allT’s ancestors to advance their status to committing. Inthis case, the server first computes all strongly connectedcomponents (SCCs) of GS in T’s ancestor graph. It thenperforms a topological sort across all SCCs and executesSCCs in sorted order. Each SCC contains one or moretransactions. SCCs with multiple transactions are exe-cuted in an arbitrary, but deterministic order (e.g., sortedby transaction ids). Because all servers observe the samedependency graph and deterministically order transac-tions within a cycle, they execute conflicting transactionsin the same order.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 523

Page 9: Consolidating Concurrency Control and Consensus for Commits

3.3 Handling Coordinator Failure

Algorithm 6: Coordinator C::FailureRecovery(T)65 Prepare Phase:66 ballot ← highest ballot number seen + 167 send Prepare(T , ballot) to T’s participating servers68 if ∃ Tx-Done,dep among replies then69 //T’s direct dependencies dep has been

determined70 goto commit phase71 else if ∃ αi without a majority quorum of

Prepare-OKs then72 goto prepare phase73 let R be the set of replies w/ the highest ballot74 if ∃dep′ ∈ R: dep′[T].status = accepted then75 goto accept phase w/ dep = dep′

76 else if R contains at least F ∩M identicaldependencies depi for each shard then

77 dep = Union(dep1, dep2, .. depN )78 goto accept phase w/ dep

79 else if ∃dep′ ∈ R : dep′[T].status = pre-acceptedthen

80 goto preaccept phase, avoid fast path81 else82 T .abandon = true83 goto accept phase w/ dep = nil

The coordinator may crash at anytime during protocolexecution. Consequently, a participating server for trans-action T will notice that T has failed to progress to thecommitting status for a threshold amount of time. Thistriggers the failure recovery process where some servertakes up the role of the coordinator to try to commit T .The recovery coordinator progresses through: prepare,

accept and commit phases, as shown in Algorithm 6. Un-like in the normal execution (Algorithm 1), the recoverycoordinator may repeat the pre-accept, prepare and ac-cept phases many times using progressively higher ballotnumbers, to cope with contention that arises when multi-ple servers try to recover the same transaction or when theoriginal coordinator has been falsely suspected of failure.In the prepare phase, the recovery coordinator picks

a unique ballot number b > 0 and sends it to all of T’sparticipating servers.‡ Algorithm 7 shows how servershandle prepares. If the status of T in the local graph iscommitting or beyond, the server replies Tx-Done withT’s dependencies. Otherwise, the server checks whetherit has seen a ballot number for T that is higher than that

‡Unique server ids can be appended as low order bits to ensureunique ballot numbers.

Algorithm 7: Server S::Prepare(T , ballot)84 dep←GS[T].dep85 if GS[T].status ≥ commiting then86 return Tx-Done, dep

87 else if highest_ballotS[T] > ballot then88 return Prepare-NotOK, highest_ballotS[T]89 highest_ballotS[T]←ballot90 reply T , Prepare-OK, GS[T].ballot, dep

included in the prepare message. If so, it rejects. Other-wise, it replies Prepare-OKwithT’s direct dependencies.If the coordinator fails to receive a majority quorum ofPrepare-OKs for some shard, it retries the prepare phasewith a higher ballot after a back-off.

The coordinator continues if it receives a majorityquorum (M) of Prepare-OKs for each shard. Next, itmust distinguish against several cases in order to decidewhether to proceed from the pre-accept or the acceptphase (lines 73-83). We point out a specific case inwhich the recovery coordinator has received a majorityPrepare-OKs with the same dependencies for each shardand the status of T on the servers is pre-accepted. Inthis case, tranaction T could have succeeded on the fastpath. Thus, the recovery coordinator merge the result-ing dependencies and proceed to the accept phase. Thiscase also illustrates why Janus must use a fast quorumF containing all server replicas. For any two conflict-ing transactions that both require recovery, a recoverycoordinator is guaranteed to observe their dependency if(F ∩M)∩ (F ∩M) , ∅, whereM is a majority quorum.The pre-accept and accept phase used for recovery is

the same as in the normal execution (Algorithm 1) exceptthat both phases use the ballot number chosen in theprepare phase. If the coordinator receives a majorityquorum of Accept-OK replies, it proceeds to the commitphase. Otherwise, it restarts in the prepare phase using ahigher ballot number.

The recovery coordinator attempts to commit the trans-action if possible, but may abandon it if the coordinatorcannot recover the inputs for the transaction. This is pos-sible when some servers pass on dependencies for onetransaction to another and then fail simultaneously withthe first transaction’s coordinator. A recovery coordina-tor abandons a transaction T using the normal protocolexcept it sets the abandon flag for T during the acceptphase and the commit phase. Abandoning a transactionthis way ensures that servers reach consensus on aban-doning T even if the the original coordinator is falselysuspected of failure. During transaction execution, if aserver encounters a transaction T with the abandon flagset, it simply skips the execution of T .

524 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 10: Consolidating Concurrency Control and Consensus for Commits

4 General TransactionsThis section discusses how to extend Janus to handledependent pieces and limited forms of user aborts. Al-though Janus avoids aborts completely for one-shot trans-actions, our proposed extensions incur aborts when trans-actions conflict.Let a transaction’s keys be the set of data items that the

transaction needs to read or write. We classify generaltransactions into two categories: 1) transactions whosekeys are known prior to execution, but the inputs of somepieces are dependent on the execution of other pieces. 2)transactions whose keys are not known beforehand, butare dependent on the execution of certain pieces.

Transactions with pre-determined keys. For anytransaction T in this category, T’s set of participatingservers are known apriori. This allows the coordinatorto go through the pre-accept, accept phases to fix T’sposition in the serialization graph while deferring pieceexecution. Suppose servers Si , Sj are responsible for thedata shards accessed by piece αi and α j (i < j) respec-tively and α j’s inputs depend on the outputs of αi . Inthis case, we extend the basic protocol to allow servers tocommunicate with each other during transaction execu-tion. Specifically, once server Sj is ready to execute α j ,it sends a message to server Si to wait for the executionof αi and fetch the corresponding outputs. We can usethe same technique to handle transactions with a certaintype of user-aborts. Specifically, the piece αi containinguser-aborts is not dependent on the execution of any otherpiece that performs writes. To execute any piece α j ofthe transaction, server Sj communicates with server Si tofind out αi’s execution status. If αi is aborted, then serverSj skips the execution of α j .

Transactions with dependent keys. As an exam-ple transaction of this type, consider transaction T(x):y←Read(x); Write(y, ...) The key to be accessed by T’s sec-ond piece is dependent on the value read by the first piece.Thus, the coordinator cannot go through the usual phasesto fix T’s position in the serialization graph without exe-cution. We support such a transaction by requiring usersto transform it to transactions in the first category usinga known technique [41]. In particular, any transactionwith dependent keys can be re-written into a combina-tion of several read-only transactions that determine theunknown keys and a conditional-write transaction thatperforms writes if the read values match the input keys.For example, the previous transaction T can be trans-formed into two transactions, T1(x): y← Read(x); return

y; followed by T2(x,y): y′← Read(x); if y′,y {abort;} else

{Write(y, ...);} This technique is optimistic in that it turnsconcurrency conflicts into user-level aborts. However, asobserved in [41], real-life OLTP workloads seldom in-

volve key-dependencies on frequently updated data andthus would incur few user-level aborts due to conflicts.

5 Implementation

In order to evaluate Janus together with existing systemsand enable an apples-to-apples comparison, we built amodular software framework that facilitates the construc-tion and debugging of a variety of concurrency controland consensus protocols efficiently.

Our software framework consists of 36,161 linesof C++ code, excluding comments and blank lines.The framework includes a custom RPC library, an in-memory database, as well as test suits and several bench-marks. The RPC library uses asynchronous socket I/O(epoll/kqueue). It can passively batch RPC mes-sages to the same machine by reading or writing multipleRPC messages with a single system call whenever possi-ble. The framework also provides common library func-tions shared by most concurrency control and consen-sus protocols, such as a multi-phase coordinator, quorumcomputation, logical timestamps, epochs, database locks,etc. By providing this common functionality, the librarysimplifies the task of implementing a new concurrencycontrol or consensus protocol. For example, a TAPIRimplementation required only 1,209 lines of new code,and the Janus implementation required only 2,433 linesof new code. In addition to TAPIR and Janus we alsoimplemented 2PL+MultiPaxos and OCC+MultiPaxos.

Our OCC implementation is the standard OCC with2PC. It does not include the optimization to combine theexecution phase with the 2PC-prepare phase. Our 2PLis different from conventional implementations in that itdispatches pieces in parallel in order to minimize execu-tion latency in the wide area. This increases the chancefor deadlocks significantly. We use thewound-wait proto-col [37] (also used by Spanner [9]) to prevent deadlocks.With a contended workload and parallel dispatch, thewound-wait mechanism results in many false positivesfor deadlocks. In both OCC and 2PL, the coordinatordoes not make a unilateral decision to abort transactions.TheMultiPaxos implementation inherits the common op-timization of batchingmanymessages to and from leadersfrom the passive batching mechanism in the RPC library.

Apart from the design described in Section 3, ourimplementation of Janus also includes the garbage col-lection mechanism to truncate the dependency graph astransactions finish. We have not implemented Janus’scoordinator failure recovery mechanism nor the exten-sions to handle dependent pieces. Our implementationsof 2PL/OCC+MultiPaxos and TAPIR also do not includefailure recovery.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 525

Page 11: Consolidating Concurrency Control and Consensus for Commits

Oregon Ireland Seoul

Oregon 0.9 140 122Ireland 0.7 243Seoul 1.6

Table 2: Ping latency between EC2 datacenters (ms).

6 EvaluationOur evaluation aims to understand the strength and limi-tations of the consolidated approach of Janus. How doesit compare against conventional layered systems? Howdoes it compare with TAPIR’s unified approach, whichaborts under contention? The highlights include:

• In a single data center with no contention, Janusachieves 5× the throughput of 2PL/OCC+MultiPaxos,and 90% of TAPIR’s throughput in microbenchmarks.

• All systems’ throughput decrease as contention rises.However, as Janus avoids aborts, its performance un-der moderate or high contention is better than existingsystems.

• Wide-area latency leads to higher contention. Thus,the relative performance difference between Janus andthe other systems is higher in multi-data-center exper-iments than in single-data-center ones.

6.1 Experimental SetupTestbed. We run all experiments onAmazonEC2 usingm4.large instance types. Each node has 2 virtual CPUcores, 8GB RAM. For geo-replicated experiments, weuse 3 EC2 availability regions, us-west-2 (Oregon), ap-northeast-2 (Seoul) and eu-west-1 (Ireland). The pinglatencies among these data centers are shown in Table 2.Experimental parameters. We adopt the configura-tion used by TAPIR [51] where a separate server replicagroup handles each data shard. Each microbenchmarkuses 3 shards with a replication level of 3, resulting ina total of 9 server machines being used. In this set-ting, a majority quorum contains at least 2 servers and afast quorum must contain all 3 servers. When runninggeo-replicated experiments, each server replica residesin a different data center. The TPC-C benchmark uses6 shards with a replication level of 3, for a total of 18processes running on 9 server machines.We use closed-loop clients: each client thread issues

one transaction at a time back-to-back. Aborted trans-actions are retried for up to 20 attempts. We vary theinjected load by changing the number of clients. We runclient processes on a separate set of EC2 instances thanservers. In the geo-replicated setting, experiments of 2PLand OCC co-locate the clients in the datacenter contain-ing the underlying MultiPaxos leader. Clients of Janus

0

40

80

120

0.5 0.6 0.7 0.8 0.9 1

Thro

ughput (1

03 txn/s

)

Zipf Coefficient

2PLJanus

OCCTapir

(a) Cluster Throughput

0

0.2

0.4

0.6

0.8

1

0.5 0.6 0.7 0.8 0.9 1

Com

mit R

ate

Zipf Coefficient

2PLJanus

OCCTapir

(b) Commit Rates

Figure 3: Single datacenter performance with in-creasing contention as controlled by the Zipf coeffi-cient.

and Tapir are spread evenly across the three datacenters.We use enough EC2 machines such that clients are notbottlenecked.

Other measurement details. Each of our experimentlasts 30 seconds, with the first 7.5 and last 7.5 secondsexcluded from results to avoid start-up, cool-down, andpotential client synchronization artifacts. For each setof experiments, we report the throughput, 90th percentilelatency, and commit rate. The system throughput is cal-culated as the number of committed transactions dividedby the measurement duration and is expressed in trans-actions per second (tps). We calculate the latency of atransaction as the amount of time taken for it to commit,including retries. A transaction that is given up after 20retries is considered to have infinite latency. We calculatethe commit rate as the total number of committed trans-actions divided by the total number of commit attempts(including retries).

Calibrating TAPIR’s performance. Because we re-implemented the TAPIR protocol using a different codebase, we calibrated our results by running the most basicexperiment (one-key read-modify-write transactions), aspresented in Figure 7 of TAPIR’s technical report [49].The experiments run within a single data center with onlyone shard. The keys are chosen uniformly randomly. OurTAPIR implementation (running on EC2) achieves a peakthroughput of ∼165.83K tps, which is much higher than

526 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 12: Consolidating Concurrency Control and Consensus for Commits

the amount (∼8K tps) reported for Google VMs [49].We also did the experiment corresponding to Figure 13of [51] and the results show a similar abort rate per-formance for TAPIR and OCC (TAPIR’s abort rate isan order of magnitude lower than OCC with lower zipfcoefficients). Therefore, we believe our TAPIR imple-mentation is representative.

6.2 MicrobenchmarkWorkload. In the microbenchmark, each transactionperforms 3 read-write access on different shards by in-crementing 3 randomly chosen key-value pairs. We pre-populate each shard with 1 million key-value pairs beforestarting the experiments. We vary the amount of con-tention in the system by choosing keys according to a zipfdistribution with a varying zipf coefficient. The largerthe zipf coefficient, the higher the contention. This typeof microbenchmark is commonly used in evaluations oftransactional systems [9, 51].

Single data center (low contention). Figure 3 showsthe single data center performance of different systemswhen the zipf coefficient increases from 0.5 to 1.0. Zipfcoefficients of 0.0∼0.5 are excluded because they havenegligble contention and similar performance to zipf=0.5.In these experiments, we use 900 clients to inject load.

The number of clients is chosen to be on the “knee”of the latency-throughput curve of TAPIR for a specificzipf value (0.5). In other words, using more than 900clients results in significantly increased latency with onlysmall throughput improvements. With zipf coefficientsof 0.5∼0.6, the system experiences negligible amounts ofcontention. As Figure 3b shows, the commit rates acrossall systems are almost 1 with zipf coefficient 0.5.Both TAPIR and Janus achieve much higher through-

put than 2PL/OCC+MultiPaxos. As a Paxos leader needsto handle much more communication load (>3×) thannon-leader server replicas, 2PL/OCC’s performance isbottlenecked by Paxos leaders, which are only one-thirdof all 9 server machines.

Single data center (moderate to high contention). Asthe zipf value varies from 0.6 to 1.0, the amount of con-tention in the workload rises. As seen in Figure 3b, thecommit rates of TAPIR, 2PL and OCC decrease quicklyas the zipf value increases from 0.6 to 1.0. At first glance,it is surprising that 2PL’s commit rate is no higher thanOCC’s. This is because our 2PL implementation dis-patches pieces in parallel. This combined with the largeamounts of false positives induced by deadlock detection,makes locking ineffective. By contrast, Janus does notincur any aborts and maintains a 100% commit rate. In-creasing amounts of aborts result in significantly lowerthroughput for TAPIR, 2PL and OCC. Although Janus

0

20

40

60

0.5 0.6 0.7 0.8 0.9 1

Thro

ughput (1

03 txn/s

)

Zipf Coefficient

2PLJanus

OCCTapir

(a) Cluster Throughput

0

0.2

0.4

0.6

0.8

1

0.5 0.6 0.7 0.8 0.9 1

Com

mit R

ate

Zipf Coefficient

2PLJanus

OCCTapir

(b) Commit Rates

Figure 4: Geo-replicated performance with increas-ing contention as controlled by the Zipf coefficient.

does not have aborts, its throughput also drops becausethe size of the dependency graph that is being maintainedand exchanged grows with the amount of the contention.Nevertheless, the overhead of graph computation is muchless than the cost of aborts/retries, which allows Janus tooutperform existing systems. At zipf=0.9, the through-put of Janus (55.41K tps) is 3.7× that of TAPIR (14.91Ktps). The corresponding 90-percentile latency for Janusand TAPIR is 24.65ms and 28.90ms, respectively. Thelatter is higher due to repeated retries.

Multiple data centers (moderate to high contention).We move on to experiments that replicate data acrossmultiple data centers. In this set of experiments, we use10800 clients to inject load, compared to 900 for in thesingle data center setting. As the wide-area communica-tion latency is more than an order of magnitude larger,we have to use many more concurrent requests in order toachieve high throughput. Thus, the amount of contentionfor a given zipf value is much higher in multi-datacenterexperiments than that of single-data-center experiments.As we can see in Figure 4b, at zipf=0.5, the commit rateis only 0.37 for TAPIR. This causes the throughput ofTAPIR to be lower than Janus at zipf=0.5 (Figure 4a).At larger zipf values, e.g., zipf=0.9, the throughput ofJanus drops to 43.51K tps, compared to 3.95K tps forTAPIR, and ∼1085 tps for 2PL and OCC. As Figure 4bshows, TAPIR’s commit rate is slightly lower than thatof 2PL/OCC. Interference during replication, apart from

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 527

Page 13: Consolidating Concurrency Control and Consensus for Commits

100

101

102

103

104

1 10 100 1000

Thro

ughput (t

xn/s

)

Clients

2PLJanus

OCCTapir

(a) Cluster Throughput

0

200

400

600

800

1000

1 10 100 1000

Late

ncy (

ms)

Clients

2PLJanus

OCCTapir

(b) 90% Latency

0

0.2

0.4

0.6

0.8

1

1 10 100 1000

Com

mit R

ate

Clients

2PLJanus

OCCTapir

(c) Commit Rates

Figure 5: Performance in the single datacenter setting for the TPC-C benchmark with increasing load andcontention as controlled by the number of clients per partition.

100

101

102

103

104

1 10 100 1000 10000

Thro

ughput (t

xn/s

)

Clients

2PLJanus

OCCTapir

(a) Cluster Throughput

0

200

400

600

800

1000

1 10 100 1000 10000

Late

ncy (

ms)

Clients

2PLJanus

OCCTapir

(b) 90% Latency

0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000

Com

mit R

ate

Clients

2PLJanus

OCCTapir

(c) Commit Rates

Figure 6: Performance in the geo-replicated setting for the TPC-C benchmark with increasing load andcontention as controlled by the number of clients per partition.

aborts due to transaction conflicts, may also cause TAPIRto retry the commit.

6.3 TPC-C

new-order payment order-status delivery stock-level

ratio 43.65% 44.05% 4.13% 3.99% 4.18%

Table 3: TPC-C mix ratio in a Janus trial.

Workload. As the most commonly accepted bench-mark for testing OLTP systems, TPC-C has 9 tables and92 columns in total. It has five transaction types, three ofwhich are read-write transactions and two are read-onlytransactions. Table 3 shows a commit transaction ratioin one of our trials. In this test we use 6 shards eachreplicated in 3 data centers. The workload is sharded bywarehouse; each shard contains 1 warehouse; each ware-house contains 10 districts, following the specification.Because each new-order transaction needs to do a read-modify-write operation on the next-order-id that is uniquein a district, this workload exposes very high contentionwith increasing numbers of clients.

Single datacenter setting. Figure 5 shows the singledata center performance of different systems when thenumber of clients is increased. As the number of clientsincreases, the throughput of Janus climbs until it saturatesthe servers’ CPU, and then stabilizes at 5.78K tps. Onthe other hand, the throughput of other systems will firstclimb, and then drop due to high abort rates incurred ascontention increases. TAPIR’s the throughput peaks at560 tps; The peak throughput for for 2PL/OCC is lowerthan 595/324 tps. Because TAPIR can abort and retrymore quickly than OCC, its commit rate becomes lowerthan OCC as the contention increases. Because of themassive number of aborts, the 90-percentile latency for2PL/OCC/TAPIR are more than several seconds whenthe number of clients increases; by contrast, the latencyof Janus remains relatively low (<400ms). The latency ofJanus increases with the number of clients due to as moreoutstanding transactions result in increased queueing.

Multi-data center setting. When replicating dataacross multiple data centers, more clients are needed tosaturate the servers. As shown in Figure 6, the throughputof Janus (5.7K tps) is significantly higher than the otherprotocols. In this setup, the other systems show the sametrend as in single data-center but the peak throughput be-

528 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 14: Consolidating Concurrency Control and Consensus for Commits

comes lower because the amount of contention increasesdramatically with increased number of clients. TAPIR’scommit rate is lower than OCC’s when there are ≥ 100clients because the wide-area setting leads to more abortsfor its inconsistent replication.Figure 6b shows the 90-percentile latency. When the

number of clients is <= 100, the experiments use onlya single client machine located in the Oregon data cen-ter. When the number of clients > 100, the clients arespread across all three data centers. As seen in Figure 6b,when the contention is low (<10 clients), the latency ofTAPIR and Janus is low (less than 150ms) because bothprotocols can commit in one cross-data center roundtripfrom Oregon. By contrast, the latency of 2PL/OCC ismuch more ( 230ms) as their commits require 2 crossdata center roundtrips. In our experiments, all leadersof the underlying MultiPaxos happen to be co-locatedin the same data center as the clients. Thus, 2PL/OCCboth require only one cross-data center roundtrip to com-plete the 2PC prepare phase and another roundtrip for thecommit phase as our implementation of 2PL/OCC onlyreturns to clients after the commit to reduce contention.As contention rises due to increased number of clients,the 90-percentile latency of 2PL/OCC/TAPIR increasesquickly to tens of seconds. The latency of Janus alsorises due to increased overhead in graph exchange andcomputation at higher levels of contention. Nevertheless,Janus achieves a decent 90-percentile latency (<900ms)as the number of clients increases to 10,000.

7 Related WorkWe review related work in fault-tolerant replication, geo-replication, and concurrency control.

Fault-tolerant replication. The replicated state ma-chine (RSM) approach [17, 38] enables multiple ma-chines to maintain the same state through deterministicexecution of the same commands and can be used tomaskthe failure of individual machines. Paxos [18] and view-stamped replication [24, 33] showed it was possible toimplement RSMs that are always safe and remain livewhen f or fewer out of 2 f + 1 total machines fail. Paxosand viewstamped replication solve consensus for the se-ries of commands the RSM executes. Fast Paxos [21]reduced the latency for consensus by having client sendcommands directly to replicas instead of a through a dis-tinguished proposer. Generalized Paxos [20] builds onFast Paxos by further enabling commands to commit outof order when they do not interfere, i.e., conflict. Janususes this insight from Generalized Paxos to avoid speci-fying an order for non-conflicting transactions.Mencius [29] showed how to reach consensus with

low latency under low load and high throughput under

high load in a geo-replicated setting by efficiently round-robining leader duties between replicas. SpeculativePaxos [36] can achieve consensus in a single round tripby exploiting a co-designed datacenter network for con-sistent ordering. EPaxos [30] is the consensus protocolmost related to our work. EPaxos builds onMencius, FastPaxos, and Generalized Paxos to achieve (near-)optimalcommit latency in the wide-area, high throughput, andtolerance to slow nodes. EPaxos’s use of dependencygraphs to dynamically order commands based on theirarrival order at different replicas inspired Janus’s depen-dency tracking for replication.

Janus addresses a different problem than RSM becauseit provides fault tolerance and scalability to many shards.RSMs are designed to replicate a single shard and whenused across shards they are still limited to the throughputof a single machine.

Geo-replication. Many recent systems have been de-signed for the geo-replicated setting with varying degreesof consistency guarantees and transaction support. Dy-namo [11], PNUTS [8], and TAO [6] provide eventualconsistency and avoid wide-area messages for most oper-ations. COPS [26] and Eiger [27] provide causal consis-tency, read-only transactions, and Eiger provides write-only transactions while always avoiding wide-area mes-sages. Walter [39] and Lynx [52] often avoid wide-areamessages and provide general transactions with parallelsnapshot isolation and serializability respectively. All ofthese systems will typically provide lower latency thanJanus because they made a different choice in the funda-mental trade off between latency and consistency [5, 23].Janus is on the other side of that divide and provides strictserializability and general transactions.

Concurrency control. Fault tolerant, scalable systemstypically layer a concurrency control protocol on top of areplication protocol. Sinfonia [3] piggybacks OCC into2PC over primary-backup replication. Percolator [35]also uses OCC over primary-backup replication. Span-ner [9] uses 2PL with wound wait over Multi-Paxos andinspired our 2PL experimental baseline.

CLOCC [2, 25] using fine-grained optimistic concur-rency control using loosely synchronized clocks overviewstamped replication. Granola [10] is optimized forsingle shard transactions, but also includes a global trans-action protocol that uses a custom 2PC over viewstampedreplication. Calvin [42] uses a sequencing layer and 2PLover Paxos. Salt [47] is a concurrency control protocol formixing acid and base transactions that inherits MySQLCluster’s chain replication [44]. Salt’s successor, Callas,introduces modular concurrency control [48] that enablesdifferent types of concurrency control for different partsof a workload.

Replicated Commit [28] executes Paxos over 2PC in-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 529

Page 15: Consolidating Concurrency Control and Consensus for Commits

stead of the typical 2PC over Paxos to reduce wide-areamessages. Rococo [31] is a dependency graph based con-currency control protocol over Paxos that avoids abortsby reordering conflicting transactions. Rococo’s protocolinspired our use of dependency tracking for concurrencycontrol. All of these systems layer a concurrency con-trol protocol over a separate replication protocol, whichincurs coordination twice in serial.MDCC [16] and TAPIR [49, 50, 51] are the most re-

lated systems and we have discussed them extensivelythroughout the paper. Their fast fault tolerant transactioncommits under low contention inspired us to work onfast commits under all levels of contention, which is ourbiggest distinction from them.

8 ConclusionWe presented the design, implementation, and evalua-tion of Janus, a new protocol for fault tolerant distributedtransactions that are one-shot and written as stored pro-cedures. The key insight behind Janus is that the coordi-nation required for concurrency control and consensus ishighly similar. We exploit this insight by tracking con-flicting arrival orders of both different transactions acrossshards and the same transaction within a shard using asingle dependency graph. This enables Janus to com-mit in a single round trip when there is no contention.When there is contention Janus is able to commit by haveall shards of a transaction reach consensus on its depen-dencies and then breaking cycles through deterministicreordering before execution. This enables Janus to pro-vide higher throughput and lower latency than the stateof the art when workloads have moderate to high skew,are geo-replicated, and are realistically complex.

AcknowledgmentsThe work is supported by NSF grants CNS-1514422 andCNS-1218117 as well as AFOSR grant FA9550-15-1-0302. Lamont Nelson is also supported by an NSF grad-uate fellowship. We thankMichaelWalfish for helping usclarify the design of Janus and Frank Dabek for readingan early draft. We thank our shepherd, Dan Ports, and theanonymous reviewers of the OSDI program committeefor their helpful comments.

References[1] TPC-C Benchmark. http://www.tpc.org/tpcc/.[2] A. Adya, R. Gruber, B. Liskov, and U. Maheshwari. Ef-

ficient optimistic concurrency control using loosely syn-

chronized clocks. In Proceedings of ACM InternationalConference on Management of Data (SIGMOD), 1995.

[3] M. K. Aguilera, A. Merchant, M. Shah, A. Veitch, andC. Karamanolis. Sinfonia: a new paradigm for buildingscalable distributed systems. In Proceedings of ACM Sym-posium on Operating Systems Principles (SOSP), 2007.

[4] Amazon. Cross-Region Replication Using DynamoDBStreams. http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.CrossRegionRepl.html, 2016.

[5] H. Attiya and J. L. Welch. Sequential consistency versuslinearizability. ACM Transactions on Computer Systems(TOCS), 12(2), 1994.

[6] N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Di-mov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. Li,M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, andV. Venkataramani. TAO: Facebook’s distributed data storefor the social graph. In Proceedings of USENIX Confer-ence on Annual Technical Conference (ATC), 2013.

[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wal-lach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber.Bigtable: A distributed storage system for structured data.In Proceedings of USENIX Symposium on Opearting Sys-tems Design and Implementation (OSDI), 2006.

[8] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silber-stein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver,and R. Yerneni. PNUTS: Yahoo!’s hosted data servingplatform. Proceedings of International Conference onVery Large Data Bases (VLDB), 2008.

[9] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost,J. Furman, S. Ghemawat, A. Gubarev, C. Heiser,P. Hochschild, et al. Spanner: Google’s globally dis-tributed database. In Proceedings of USENIX Sympo-sium on Opearting Systems Design and Implementation(OSDI), 2012.

[10] J. Cowling and B. Liskov. Granola: low-overhead dis-tributed transaction coordination. In Proceedings ofUSENIX Conference on Annual Technical Conference(ATC), 2012.

[11] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,A. Lakshman, A. Pilchin, S. Sivasubramanian, P.Vosshall,and W. Vogels. Dynamo: Amazon’s highly available key-value store. In Proceedings of ACM Symposium on Oper-ating Systems Principles (SOSP), 2007.

[12] H. Garcia-Molina and K. Salem. Main memory databasesystems: An overview. Knowledge andData Engineering,IEEE Transactions on, 4(6):509–516, 1992.

[13] H. Garcia-Molina, R. J. Lipton, and J. Valdes. A massivememory machine. Computers, IEEE Transactions on, 100(5):391–399, 1984.

[14] M. P. Herlihy and J. M. Wing. Linearizability: A correct-ness condition for concurrent objects. ACM Transactionson Programming Languages and Systems (TOPLAS), 12(3):463–492, 1990.

[15] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin,S. Zdonik, E. P. Jones, S. Madden, M. Stonebraker,

530 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 16: Consolidating Concurrency Control and Consensus for Commits

Y. Zhang, et al. H-store: a high-performance, distributedmain memory transaction processing system. In Pro-ceedings of International Conference on Very Large DataBases (VLDB), 2008.

[16] T. Kraska, G. Pang, M. J. Franklin, S. Madden, andA. Fekete. MDCC: Multi-data center consistency. InProceedings of ACM European Conference on ComputerSystems (EuroSys), 2013.

[17] L. Lamport. Time, clocks, and the ordering of events in adistributed system. Commun. ACM, 1978.

[18] L. Lamport. The part-time parliament. ACM Transactionson Computer Systems (TOCS), 16(2):133–169, 1998.

[19] L. Lamport. Paxos made simple. ACM Sigact News, 32(4):18–25, 2001.

[20] L. Lamport. Generalized consensus and Paxos. Techni-cal report, Technical Report MSR-TR-2005-33, MicrosoftResearch, 2005.

[21] L. Lamport. Fast Paxos. Distributed Computing, 19(2):79–103, October 2006.

[22] K. Li and J. F. Naughton. Multiprocessor main memorytransaction processing. In Proceedings of the first interna-tional symposium onDatabases in parallel and distributedsystems, 2000.

[23] R. J. Lipton and J. S. Sandberg. PRAM: A scalable sharedmemory. Technical Report TR-180-88, Princeton Univ.,Dept. Comp. Sci., 1988.

[24] B. Liskov and J. Cowling. Viewstamped replication revis-ited. Technical report, MIT-CSAIL-TR-2012-021, MIT,2012.

[25] B. Liskov, M. Castro, L. Shrira, and A. Adya. Providingpersistent objects in distributed systems. InECOOP, 1999.

[26] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. An-dersen. Don’t settle for eventual: scalable causal con-sistency for wide-area storage with COPS. In Proceed-ings of ACM Symposium on Operating Systems Principles(SOSP), 2011.

[27] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. An-dersen. Stronger semantics for low-latency geo-replicatedstorage. In Proceedings of USENIX Conference onNetworked Systems Design and Implementation (NSDI),2013.

[28] H. Mahmoud, F. Nawab, A. Pucher, D. Agrawal, andA. El Abbadi. Low-latency multi-datacenter databasesusing replicated commit. In Proceedings of InternationalConference on Very Large Data Bases (VLDB), 2013.

[29] Y.Mao, F. P. Junqueira, and K.Marzullo. Mencius: build-ing efficient replicated state machines for wans. In Pro-ceedings of USENIX Symposium on Opearting SystemsDesign and Implementation (OSDI), 2008.

[30] I. Moraru, D. G. Andersen, and M. Kaminsky. There ismore consensus in egalitarian parliaments. In Proceed-ings of ACM Symposium on Operating Systems Principles(SOSP), 2013.

[31] S. Mu, Y. Cui, Y. Zhang, W. Lloyd, and J. Li. Extractingmore concurrency from distributed transactions. In Pro-

ceedings of USENIX Symposium on Opearting SystemsDesign and Implementation (OSDI), 2014.

[32] S. Mu, L. Nelson, , W. Lloyd, and J. Li. Consolidat-ing concurrency control and consensus for commits underconflicts. Technical Report TR2016-983, New York Uni-versity, Courant Institute of Mathematical Sciences, 2016.

[33] B. M. Oki and B. H. Liskov. Viewstamped replication:A new primary copy method to support highly-availabledistributed systems. In Proceedings of ACM Symposiumon Principles of Distributed Computing (PODC), 1988.

[34] C. H. Papadimitriou. The serializability of concurrentdatabase updates. Journal of the ACM, 26(4), 1979.

[35] D. Peng and F. Dabek. Large-scale incremental processingusing distributed transactions and notifications. In Pro-ceedings of USENIX Symposium on Opearting SystemsDesign and Implementation (OSDI), 2010.

[36] D. R. Ports, J. Li, V. Liu, N. K. Sharma, and A. Krishna-murthy. Designing distributed systems using approximatesynchrony in data center networks. In Proceedings ofUSENIX Conference on Networked Systems Design andImplementation (NSDI), 2015.

[37] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis II.System level concurrency control for distributed databasesystems. ACMTransactions onDatabase Systems (TODS),3(2):178–198, 1978.

[38] F. B. Schneider. Implementing fault-tolerant services us-ing the statemachine approach: a tutorial. ACMComputerSurveys, 22(4), Dec. 1990.

[39] Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Trans-actional storage for geo-replicated systems. In Proceed-ings of ACM Symposium on Operating Systems Principles(SOSP), 2011.

[40] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos,N. Hachem, and P. Helland. The end of an architec-tural era:(it’s time for a complete rewrite). In Proceedingsof International Conference on Very Large Data Bases(VLDB), 2007.

[41] A. Thomson and D. J. Abadi. The case for determin-ism in database systems. In Proceedings of InternationalConference on Very Large Data Bases (VLDB), 2010.

[42] A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao,and D. J. Abadi. Calvin: fast distributed transactionsfor partitioned database systems. In Proceedings of ACMInternational Conference on Management of Data (SIG-MOD), 2012.

[43] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden.Speedy transactions in multicore in-memory databases.In Proceedings of ACM Symposium on Operating SystemsPrinciples (SOSP), 2013.

[44] R. van Renesse and F. B. Schneider. Chain replication forsupporting high throughput and availability. In Proceed-ings of USENIX Symposium on Opearting Systems Designand Implementation (OSDI), 2004.

[45] G. Weikum and G. Vossen. Transactional InformationSystems: Theory, Algorithms, and the Practice of Concur-rency Control and Recovery. Morgan Kaufmann, 2001.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 531

Page 17: Consolidating Concurrency Control and Consensus for Commits

[46] A. Whitney, D. Shasha, and S. Apter. High volume trans-action processing without concurrency control, two phasecommit, sql or C++. In Seventh InternationalWorkshop onHigh Performance Transaction Systems, Asilomar, 1997.

[47] C. Xie, C. Su, M. Kapritsos, Y. Wang, N. Yaghmazadeh,L. Alvisi, and P. Mahajan. Salt: Combining ACIDand BASE in a distributed batabase. In Proceedings ofUSENIX Symposium on Opearting Systems Design andImplementation (OSDI), 2014.

[48] C. Xie, C. Su, C. Littley, L. Alvisi, M. Kapritsos, andY. Wang. High-performance ACID via modular concur-rency control. In Proceedings of ACM Symposium onOperating Systems Principles (SOSP), 2015.

[49] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy,and D. R. Ports. Building consistent transactions with in-consistent replication. Technical report, Technical Report

UW-CSE-2014-12-01, University of Washington, 2014.

[50] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy,and D. R. Ports. Building consistent transactions withinconsistent replication (extended version). Technical re-port, Technical Report UW-CSE-2014-12-01 v2, Univer-sity of Washington, 2015.

[51] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy,and D. R. K. Ports. Building consistent transactions withinconsistent replication. In Proceedings of ACM Sympo-sium on Operating Systems Principles (SOSP), 2015.

[52] Y. Zhang, R. Power, S. Zhou, Y. Sovran, M. K. Aguilera,and J. Li. Transaction chains: achieving serializabilitywith low latency in geo-distributed storage systems. InProceedings of ACM Symposium on Operating SystemsPrinciples (SOSP), 2013.

532 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association