coordination avoidance in database...

Coordination Avoidance in Database Systems

Peter Bailis, Alan Fekete†, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, Ion StoicaUC Berkeley and †University of Sydney

ABSTRACTMinimizing coordination, or blocking communication between con-currently executing operations, is key to maximizing scalability,availability, and high performance in database systems. However,uninhibited coordination-free execution can compromise applica-tion correctness, or consistency. When is coordination necessary forcorrectness? The classic use of serializable transactions is sufficientto maintain correctness but is not necessary for all applications,sacrificing potential scalability. In this paper, we develop a formalframework, invariant confluence, that determines whether an appli-cation requires coordination for correct execution. By operatingon application-level invariants over database states (e.g., integrityconstraints), invariant confluence analysis provides a necessary andsufficient condition for safe, coordination-free execution. Whenprogrammers specify their application invariants, this analysis al-lows databases to coordinate only when anomalies that might violateinvariants are possible. We analyze the invariant confluence of com-mon invariants and operations from real-world database systems(i.e., integrity constraints) and applications and show that many areinvariant confluent and therefore achievable without coordination.We apply these results to a proof-of-concept coordination-avoidingdatabase prototype and demonstrate sizable performance gains com-pared to serializable execution, notably a 25-fold improvement overprior TPC-C New-Order performance on a 200 server cluster.

1. INTRODUCTIONMinimizing coordination is key in high-performance, scalable

database design. Coordination—informally, the requirement thatconcurrently executing operations synchronously communicate orotherwise stall in order to complete—is expensive: it limits con-currency between operations and undermines the effectiveness ofscale-out across servers. In the presence of partial system fail-ures, coordinating operations may be forced to stall indefinitely,and, in the failure-free case, communication delays can increaselatency [9, 26]. In contrast, coordination-free operations allow ag-gressive scale-out, availability [26], and low latency execution [1].If operations are coordination-free, then adding more capacity (e.g.,servers) will result in additional throughput; operations can exe-cute on the new resources without affecting the old set of resources.Partial failures will not affect non-failed operations, and latencybetween any database replicas can be hidden from end-users.

Unfortunately, coordination-free execution is not always safe. Un-inhibited coordination-free execution can compromise application-

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-mission prior to any use beyond those covered by the license. Contactcopyright holder by emailing [email protected]. VLDB PREPRINT.PREPRINT of the VLDB Endowment, Vol. 8, No. 3Copyright 2014 VLDB Endowment 2150-8097/14/09.

level correctness.1 In canonical banking application examples, con-current, coordination-free withdrawal operations can result in unde-sirable and inconsistent outcomes like negative account balances—application-level anomalies that the database should prevent. Toensure correct behavior, a database system must coordinate the exe-cution of these operations that, if otherwise executed concurrently,could result in inconsistent application state.

This tension between coordination and consistency is evidencedby the range of database concurrency control policies. In tradi-tional database systems, serializable isolation provides concurrentoperations (transactions) with the illusion of executing in some se-rial order [13]. As long as individual transactions maintain correctapplication state, serializability guarantees correctness [28]. How-ever, each pair of concurrent operations (at least one of which isa write) can potentially compromise serializability and thereforewill require coordination to execute [9, 19]. By isolating users atthe level of reads and writes, serializability can be overly conser-vative and may in turn coordinate more than is strictly necessaryfor consistency [27, 37, 51, 56]. For example, hundreds of userscan safely and simultaneously retweet Barack Obama on Twitterwithout observing a serial ordering of updates to the retweet counter.In contrast, a range of widely-deployed weaker models require lesscoordination to execute but surface read and write behavior that mayin turn compromise consistency [2, 9, 20, 46]. With these alternativemodels, it is up to users to decide when weakened guarantees areacceptable for their applications [6], leading to considerable confu-sion regarding and substantial interest in the relationship betweenconsistency, scalability, and availability [1, 9, 16, 19, 20, 26, 38].

In this paper, we tackle the central question inherent in this trade-off: when is coordination strictly necessary to maintain application-level consistency? To do so, we enlist the aid of applications andrequire programmers to specify their correctness criteria in the formof invariants. For example, a banking application writer wouldspecify that account balances should be positive (e.g., by schema an-notations), similar to constraints in modern databases today. Usingthese invariants, we formalize a necessary and sufficient condi-tion for correct and coordination-free execution of an application’soperations—the first such condition we have encountered. Thisproperty—invariant confluence—captures the potential scalabilityand availability of an application, independent of any particular data-base implementation: if an application’s operations are I-confluent,a database can correctly execute them without coordination. Ifoperations are not I-confluent, then a database must coordinateto guarantee correctness. This provides a basis for coordinationavoidance: the use of coordination only when necessary.

While coordination-free execution is powerful, are any usefuloperations safely executable without coordination? I-confluence

1Our use of the term “consistency” refers to application-level correctness,as is traditional in the database literature [13, 19, 23, 28, 54]. As we discussin Section 5, replicated data consistency (and also isolation [2, 9]) modelslike linearizability [26] can be cast as application criteria if desired.

PREPRINT DRAFT (VLDB 2015)

analysis determines when concurrent execution of specific opera-tions can be “merged” into valid database state; we accordinglyanalyze invariants and operations from several real-world databasesand applications. Many production databases today already supportinvariants in the form of primary key, uniqueness, foreign key, androw-level check constraints [9, 40]. We analyze these and showmany are I-confluent, including forms of foreign key constraints,unique value generation, and some check constraints, while others,like primary key constraints are, in general, not. We also considerentire applications and apply our analysis to the workloads of theOLTPBenchmark suite [21]: surprisingly, many are easy to expressand are also I-confluent. As an extended case study, we examinethe TPC-C benchmark [53], the preferred standard for evaluatingnew concurrency control algorithms [21, 33, 44, 50, 52]. We showthat ten of twelve of TPC-C’s invariants are I-confluent and, moreimportantly, compliant TPC-C can be implemented without anysynchronous coordination across servers. We subsequently scale acoordination-avoiding database prototype linearly, to over 12.7MTPC-C New-Order transactions per second on 200 servers, repre-senting a 25-fold improvement over prior results.

Overall, I-confluence offers a concrete grasp on the challengeof minimizing coordination while ensuring application-level consis-tency. In seeking a necessary and sufficient (i.e., “tight”) conditionfor safe, coordination-free execution, we require the programmer todescribe her invariants. If invariants or operations are unavailablefor inspection, users must fall back to using serializable transactionsor, alternatively, perform the same ad-hoc analyses they use to-day. Moreover, to prevent several low-level isolation anomalies likenon-linearizable reads and writes, it is well known that operationsmust coordinate [9, 26]. However, when users can correctly specifytheir application invariants and operations, they can maximize scal-ability without requiring expertise in the milieu of weak isolationmodels [2, 9]. As an intermediate option, we have also found thatI-confluence can be a useful design tool: studying specific invari-ant and operation combinations (independent of applications) canindicate the existence of more scalable algorithms [16]. Applicationconsistency does not always require coordination, and I-confluenceanalysis can explain both when and why this is the case.

Overview. The remainder of this paper proceeds as follows: Sec-tion 2 describes and quantifies the costs of coordination. Section 3introduces our system model and Section 4 contains our primarytheoretical result. Readers may skip to Section 5 for practical ap-plications of I-confluence to real-world invariant-operation com-binations. Section 6 subsequently applies these combinations toreal applications and presents an experimental case study of TPC-C.Section 7 describes related work, and Section 8 concludes.

2. CONFLICTS AND COORDINATIONAs repositories for application state, databases are traditionally

tasked with maintaining correct data on behalf of users. Duringconcurrent access to data, a correct database must therefore decidewhich user operations can execute simultaneously and which, ifany, must coordinate, or block. In this section, we explore therelationship between the conflicts that a database attempts to avoidand the coordination costs of doing so.

Example. As a running example, we consider a database-backedpayroll application that maintains information about employees anddepartments within a small business. In the application, a.) eachemployee is assigned a unique ID number and b.) each employeebelongs to exactly one department. A correct database must maintainthese application-level properties, or invariants on behalf of theapplication (i.e., without application-level intervention). In our

payroll application, this is non-trivial: for example, if the applicationattempts to simultaneously create two employees, then the databasemust ensure that the employees are assigned distinct IDs.

Serializability and conflicts. The classic answer to maintain-ing application-level invariants is to use serializable isolation: ex-ecute each user’s ordered sequence of operations, or transactions,such that the end result is equivalent to some sequential execu-tion [13, 28, 51]. If each transaction preserves correctness in isola-tion, composition via serializable execution ensures correctness. Inour payroll example, the database would execute the two employeecreation transactions such that one transaction appears to executeafter the other, avoiding duplicate ID assignment inconsistencies.

While serializability is a powerful abstraction, it comes with acost: for arbitrary transactions (and for all implementations of se-rializability’s more conservative variant—conflict serializability),any two operations to the same item—at least one of which is awrite—will result in a conflict. These conflicts require coordina-tion or, informally, blocking communication between concurrenttransactions: to provide a serial ordering, conflicts must be totallyordered [13]. For example, given database state {x =?,y =?}, iftransaction T1 writes x = 1 and reads from y and T2 writes y = 1and reads from x, a database cannot both execute T1 and T2 entirelyconcurrently and maintain serializability [9, 19].

The steep costs of conflicts. The coordination overheads of avoid-ing the above conflicts incurs three primary penalties: increasedlatency (due to stalled execution), unavailability (in the event ofpartial failures), and decreased throughput. If a transaction takes dseconds to execute, the maximum throughput of conflicting trans-actions operating on the same items under a general-purpose (i.e.,interactive, non-batched) transaction model is limited by 1

d , whilecoordinating operations will also have to wait. On a single system,delays can be small, permitting tens to hundreds of thousands ofconflicting transactions per item per second. In a partitioned data-base system, where different items are located on different servers,or in a replicated database system, where the same item is located(and is available for operations) on multiple servers, the cost in-creases: delay is lower-bounded by network latency. On a localarea network, delay may vary from several microseconds (e.g., viaInfiniband or RDMA) to several milliseconds on today’s cloud in-frastructure, permitting anywhere from a few hundred transactionsto a few hundred thousand transactions per second. On a wide-areanetwork, delay is lower-bounded by the speed of light (worst-caseon Earth, around 75ms, or about 13 operations per second [9]). Incontrast, non-conflicting operations can execute concurrently andwill not incur these penalties.

Quantifying coordination overhead. To further understand thecosts of coordination, we performed two sets of measurements—oneon a database prototype and one using traces from prior studies.

We first compared the throughput of a set of conflicting transac-tions with that of a conflict-free set of transactions. We partitioned aset of eight data items across eight servers and ran one set of trans-actions with an optimized variant of two-phase locking (providingserializability) [13] and another set of transactions without locking(Figure 1; see [10, Appendix A] for more details and results). Withsingle-item, non-distributed transactions, the conflict-free implemen-tation achieves, in aggregate, over 12M transactions per second andbottlenecks on physical resources—namely, CPU cycles. In con-trast, the lock-based implementation achieves approximately 1.1Mtransactions per second: it is unable to fully utilize all multi-coreprocessor contexts due to lock contention. For distributed trans-actions, conflict-free throughput decreases linearly (as an N-itemtransaction performs N writes), while the throughput of conflicting

1 2 3 4 5 6 7Number of Items per Transaction

Thro

ughp

ut (t

xns/

s) No Conflict Conflict

Figure 1: Microbenchmark performance of conflicting andnon-conflicting transactions across eight items on eight servers.

operations underperforms by over three orders of magnitude.While the above microbenchmark demonstrates the costs of a

particular implementation of conflict resolution, we also studied theeffect of more fundamental, implementation-independent overheads(i.e., also applicable to optimistic and scheduling-based concur-rency control mechanisms). We determined the maximum attainablethroughput for conflicting operations within a single datacenter(based on data from [58]) and across multiple datacenters (based ondata from [9]) due to blocking coordination during atomic commit-ment [13]. For an N-server transaction, classic two-phase commit(C-2PC) requires N (parallel) coordinator to server RTTs, while de-centralized two-phase commit (D-2PC) requires N (parallel) serverto server broadcasts, or N2 messages. Figure 2 shows that, in thelocal area, with only two servers (e.g., two replicas or two conflict-ing operations on items residing on different servers), throughputis bounded by 1125 transactions/s (via D-2PC; 668/s via C-2PC).Across eight servers, D-2PC throughput drops to 173 transactions/s(resp. 321 for C-2PC) due to long-tailed latency distributions. In thewide area, the effects are more stark: if coordinating from Virginiato Oregon, D-2PC message delays are 83 ms per commit, allowing12 operations per second. If coordinating between all eight EC2availability zones, throughput drops to slightly over 2 transactions/sin both algorithms. ([10, Appendix A] again provides more details.)

These results should be unsurprising: coordinating between con-current, conflicting operations—especially over the network—canincur serious performance penalties. In contrast, conflict-free opera-tions can execute without incurring these costs. The costs of actualworkloads can vary: if conflicts are rare, concurrency control willnot be a bottleneck. For example, a serializable database executingtransactions with disjoint read and write sets can perform as well asa non-serializable database without compromising correctness [32].However, worst-case behavior (e.g., a large number of conflicts fora given data item, or conflicting operations spread across multipleservers) can result in substantial penalties. Minimizing the numberof conflicts and their degree of distribution can therefore have atangible impact on performance, latency, and availability [1, 9, 26].While we study real applications in Section 6, these pathologicalaccess patterns highlight the worst of coordination costs.

Our goal: Minimize coordination. In this paper, we seek to min-imize the amount of coordination required to correctly execute anapplication’s transactions. As discussed in Section 1, serializabilityis sufficient to maintain correctness but is not always necessary; thatis, many—but not all—transactions can be executed concurrentlywithout necessarily compromising application correctness. It is ourjob in the remainder of this paper to identify when this safe, con-current execution is possible. If serializability requires coordinatingbetween each possible pair of conflicting reads and writes, we willonly coordinate between pairs of operations that might compromise

2 3 4 5 6 7 8Number of Servers in 2PC

0200400600800

10001200

Max

. Thr

ough

put (

txns

/s)

D-2PCC-2PC

a.) Maximum transaction throughput over local-area network in [58]

+OR +CA +IR +SP +TO +SI +SYParticipating Datacenters (+VA)

02468

1012

Max

. Thr

ough

put (

txn/

s)

D-2PCC-2PC

b.) Maximum throughput over wide-area network in [9] with transactions origi-nating from a coordinator in Virginia (VA; OR: Oregon, CA: California, IR: Ire-land, SP: São Paulo, TO: Tokyo, SI: Singapore, SY: Sydney)

Figure 2: Atomic commitment latency as an upper bound onthroughput over LAN and WAN networks.

Property EffectGlobal validity Invariants hold over committed statesTransactional availability Non-trivial response guaranteedConvergence Updates are reflected in shared stateCoordination-freedom No synchronous coordination

Table 1: Informal utility of properties in system model.

application-level correctness. To do so, we must both raise thespecification of correctness beyond the level of reads and writesand directly account for the process of reconciling the effects ofconcurrent transaction execution at the application level

3. SYSTEM MODELTo precisely characterize coordination avoidance, we first present

a system model. We begin with an informal overview. In our model,transactions operate over independent (logical) “snapshots” of da-tabase state. Transaction writes are applied at one or more serversinitially when the transaction commits and then are integrated intoother servers asynchronously via a “merge” operator that incorpo-rates those changes into the server’s state. Given a set of invariantsdescribing valid database states, as Table 1 outlines, we seek tounderstand when it is possible to ensure invariants are always satis-fied (global validity) while guaranteeing a response (transactionalavailability) and the existence of a common state (convergence), allwithout communication during transaction execution (coordination-freedom). This formal model need not directly correspond to a givenimplementation (e.g., see the database architecture in Section 6)—rather, it serves as a useful abstraction in our formalization. Theremainder of this section precisely defines these concepts; readersmore interested in their application should skip to Section 4.

Databases. We represent a state of the shared database as a set Dof unique versions of data items located on a set of database servers.Each version is located on at least one server and possibly multiple

servers, but the distinction is not important for our model. We use Dto denote the set of possible database states—that is, the set of setsof versions. The database is initially populated by an initial state D0(typically but not necessarily empty).

Transactions, Replicas, and Merging. Application clients submitrequests to the database in the form of transactions, or orderedgroups of operations on data items that should be executed together.Each transaction operates on a replica, or set of versions of the itemsmentioned in the transaction. At the beginning of the transaction,the replica contains a subset of the database state and is formedfrom all of the versions of the relevant items that can be found atone or more physical servers that are contacted during transactionexecution. As the transaction executes, it may add versions (of itemsin its writeset) to its replica. Thus, we define a transaction T as atransformation on a replica: T : D! D. In our initial model, wetreat transactions as opaque transformations that can contain writes(which add new versions to the replica’s set of versions) or reads(which return a specific set of versions from the replica). (Later, wewill discuss transactions operating on data types such as counters.)

Upon completion, each transaction can commit, signaling success,or abort, signaling failure. Upon commit, the replica state is subse-quently merged (t:D⇥D!D) into the set of versions at one ormore servers, where later transactions are able to observe its effects.Over time, the effects are propagated through the system to otherservers, again through the use of the merge operator. Though notstrictly necessary, we assume this merge operator is commutative,associative, and idempotent [5, 48]. In our initial model, we definemerge as set union of the versions contained at different servers(Section 5 discusses additional implementations). For example, ifserver Rx = {v} and Ry = {w}, then RxtRy = {v,w}.

In effect, each transaction can modify its replica state withoutmodifying any other concurrently executing transactions’ replicastate. Replicas therefore provide transactions with partial “snap-shot” views of global state (that we will use to simulate concurrentexecutions, similar to revision diagrams [15]). Importantly, twotransactions’ replicas do not necessarily correspond to two phys-ically separate servers; rather, a replica is simply a partial “view”over the global state of the database system. For now, we assumetransactions are known in advance (see also [10, Section 8]).

Invariants. To determine whether a database state is valid ac-cording to application correctness criteria, we use invariants, orpredicates over replica state: I : D ! {true, f alse} [23]. In ourpayroll example, we could specify an invariant that only one userin a database has a given ID. This invariant—as well as almostall invariants we consider—is naturally expressed as a part of thedatabase schema (e.g., via DDL); however, our approach allows usto reason about invariants even if they are known to the developerbut not declared to the system. Invariants directly capture the notionof ACID Consistency [13, 28], and we say that a database state isvalid under an invariant I (or I-valid) if it satisfies the predicate:

Definition 1. A replica state R 2D is I-valid iff I(R) = true.

We require that D0 be valid under invariants. Section 4.3 providesdiscussion regarding our use of invariants.

Availability. To ensure each transaction receives a non-trivialresponse, we adopt the following definition of availability [9]:

Definition 2. A system provides transactionally available execu-tion iff, whenever a client executing a transaction T can accessservers containing one or more versions of each item in T , thenT eventually commits or aborts itself either due to an abort opera-tion in T or if committing the transaction would violate a declaredinvariant over T ’s replica state.

Server 1D0={}

Server 2D0={}

Server 1D1={x1}

commit T1 commit T2

Server 2D2={x2}

Server 2D3={x1,x2}

Server 1D3={x1,x2}

TIME

(ASYNCHRONOUS) MERGE OF DIVERGENT SERVER STATES

COORDINATION-FREE EXECUTION

replica={x1} replica={x2}

Figure 3: An example coordination-free execution of two trans-actions, T1 and T2, on two servers. Each transaction writes toits local replica, then, after commit, the servers asynchronouslyexchange state and converge to a common state (D3).

Under the above definition, a transaction can only abort if itexplicitly chooses to abort itself or if committing would violateinvariants over the transaction’s replica state.2

Convergence. Transactional availability allows replicas to main-tain valid state independently, but it is vacuously possible to maintain“consistent” database states by letting replicas diverge (contain dif-ferent state) forever. This guarantees safety (nothing bad happens)but not liveness (something good happens) [47]. To require statesharing, we adopt the following definition:

Definition 3. A system is convergent iff, in the absence of newupdate transactions and in the absence of indefinite communicationdelays, all pairs of servers eventually contain the same set of versionsfor any item that they both store.

To capture the process of reconciling divergent states, we use thepreviously introduced merge operator: given two divergent serverstates, we apply the merge operator to produce convergent state. Weassume the effects of merge are atomically visible: either all effectsof a merge are visible or none are. This assumption is not strictlynecessary but, as it is maintainable without coordination [9, 11],does not affect our results.

Maintaining validity. To make sure that both divergent and con-vergent database states are valid and, therefore, that transactionsnever observe invalid states, we introduce the following property:

Definition 4. A system is globally I-valid iff all replicas alwayscontain I-valid state.

Coordination. Our system model is missing one final constrainton coordination between transactions, or replicas. We adopt thefollowing definition of coordination-freedom:

Definition 5. A system provides coordination-free execution iff anyfinite number of transactions can sequentially execute on a replicawithout communicating with any other replica.

By example. Figure 3 illustrates a coordination-free execution oftwo transactions T1 and T2 on two separate, fully-replicated physicalservers. Each transaction commits on its local replica, and the resultof each transaction is reflected in the transaction’s local server state.After the transactions have completed, the servers exchange stateand, after applying the merge operator, contain the same state. Any

2This basic definition precludes fault tolerance (i.e., durability) guaranteesbeyond a single server failure [9]. We can relax this requirement and allowcommunication with a fixed number of servers (e.g., F+1 servers for F-faulttolerance; F is often small [20]) without affecting our results. This doesnot affect scalability because, as more replicas are added, the additionalcommunication overhead is constant.

transactions executing later on either server will be able to work ona snapshot of the converged set of versions that includes the effectsof both transactions.

4. CONSISTENCY SANS COORDINATIONWith a system model and goals in hand, we now address the

question: when do applications require coordination for correctness?The answer depends not just on the transactions that a database maybe expected to perform and not just on the integrity constraints thatthe database is required to maintain. Rather, the answer dependson the combination of the two under study. Our contribution in thissection is to formulate a criterion that will answer this question forspecific combinations in an implementation-agnostic manner.

In this section, we focus almost exclusively on providing a formalanswer to this question. The remaining sections of this paper aredevoted to practical interpretation and application of these results.

4.1 I-confluence: Criteria DefinedTo begin, we introduce a concept that will underlie our main

result: invariant confluence (hereafter, I-confluence) [22]. Appliedin a transactional context, the I-confluence property informallyensures that divergent, valid database states can be merged into avalid database state. That is, if the effects of two I-valid sequences oftransactions (S1, S2) operating independently on replicas of I-validdatabase state Ds produce valid outputs (I(S1(Ds)) and I(S2(Ds))each hold), their effects can safely be merged into a third, validdatabase state (I(S1(Ds)tS2(Ds)) holds). In the next sub-section,we show that I-confluence analysis directly determines the potentialfor safe, coordination-free execution.

We first formalize the process of executing a sequence of transac-tions on a single, valid replica and maintaining invariants betweeneach transaction (which we will use to model divergent executions).If T is a set of transactions, and Si = ti1, . . . tin is a sequence oftransactions from the set T, then we write Si(D) = tin(. . . ti1(D)):

Definition 6 (Valid Sequence). Given invariant I, a sequence Si oftransactions in set T , and database state D, we say Si is an I-validsequence from D if 8k 2 [1,n], tik(. . . ti1(D)) are I-valid.

We can now formalize the I-confluence property:

Definition 7 (I-confluence). A set of transactions T is I-confluentwith respect to invariant I if, for all I-valid database states Ds =S0(D0) where S0 is an I-valid sequence of transactions in T fromD0, for all pairs of I-valid sequences S1,S2 of transactions in T fromDs, S1(Ds)tS2(Ds) is I-valid.

Figure 4 depicts an I-confluent execution using two I-valid se-quences each starting from a shared, I-valid database state Ds. TwoI-valid sequences tin . . . ti1 and t jn . . . t j1 each modify a shared da-tabase state Ds. Under I-confluence, the terminal states resultingfrom these sequences (Din and D jm) must be valid under merge.3

I-confluence holds for specific combinations of invariants andtransactions. In our payroll database example from Section 2, re-moving a user from the database is I-confluent with respect to theinvariant that user IDs are unique. However, two transactions thatremove two different users from the database are not I-confluentwith respect to the invariant that there exists at least one user inthe database at all times. We devote Section 5 to further (and moreprecise) study of these combinations.

3We require these I-valid sequences to have a common ancestor to rule outthe possibility of merging states that could not have arisen from transactionexecution (e.g., even if no transaction assigns IDs, merging two states thateach have unique but overlapping sets of IDs might be invalid).

DsI(Ds)=True

Di1I(Di1)=True

DinI(Din)=True

Dj1I(Dj1)=True

DjmI(Djm)=True

ti1 tj1

ti2

tin

tj2

tjm

Din ⊔ DjmI(Din ⊔ Djm)=TrueIMPLICATION (merge must be valid)

(valid divergencefrom initial state)

PRECONDITION

Figure 4: The I-confluence property illustrated via a diamonddiagram. If a set of transactions T is I-confluent, then all data-base states (Din, D jm) produced by I-valid sequences in T start-ing from a common, I-valid database state (Ds) must be merge-able (t) into an I-valid database state.

4.2 I-confluence and CoordinationWe can now apply I-confluence to our goals from Section 3:

Theorem 1. A globally I-valid system can execute a set of transac-tions T with coordination-freedom, transactional availability, con-vergence if and only if T is I-confluent with respect to I.

We provide a full proof of Theorem 1 in [10, Appendix B] (whichis straightforward) but provide a sketch here. The backwards direc-tion is by construction: if I-confluence holds, each replica can checkeach transaction’s modifications locally and replicas can merge in-dependent modifications to guarantee convergence to a valid state.The forwards direction uses a partitioning argument [26] to derive acontradiction: we construct a scenario under which a system cannotdetermine whether a non-I-confluent transaction should commitwithout violating one of our desired properties (either compromisingvalidity or availability, diverging forever, or coordinating).

Theorem 1 establishes I-confluence as a necessary and sufficientcondition for correct, coordination-free execution—the first suchcondition we have encountered. If I-confluence holds, there existsa correct, coordination-free execution strategy for the transactions;if not, no possible implementation can guarantee these propertiesfor the provided invariants and transactions. That is, if I-confluencedoes not hold, there exists at least one execution of transactionson divergent replicas that will violate the given invariants whenservers converge. To prevent invalid states from occurring, at leastone of the transaction sequences will have to forego availabilityor coordination-freedom, or the system will have to forego conver-gence. I-confluence analysis is independent of any given imple-mentation, and effectively “lifts” prior discussions of scalability,availability, and low latency [1, 9,26] to the level of application (i.e.,not “I/O” [6]) correctness. While this formalization is not difficult,it provides a useful handle on the implications of coordination-freeexecution without requiring reasoning about low level propertiessuch as physical data location and the number of servers.

4.3 Discussion and LimitationsI-confluence captures a simple (informal) rule: coordination can

only be avoided if all local commit decisions are globally valid (i.e.merging all local states satisfies invariants). If two independent de-cisions to commit can result in invalid converged state, then replicasmust coordinate in order to ensure that only one of the decisionsis to commit. Given the existence of an unsafe execution and theinability to distinguish between safe and invalid executions usingonly local information, a globally valid system must coordinate inorder to prevent the invalid execution from arising.

Use of invariants. Our use of invariants in I-confluence is key toachieving a necessary and not simply sufficient condition. By di-rectly capturing application-level correctness criteria via invariants,I-confluence analysis only identifies “true” conflicts. This allowsI-confluence analysis to perform a more accurate assessment ofwhether coordination is needed compared to related conditions likecommutativity (discussed in Section 7).

However, the reliance on invariants also has drawbacks. I-confluence analysis only guards against violations of any providedinvariants. If invariants are incorrectly or incompletely specified, anI-confluent database system may violate application-level correct-ness. If users cannot guarantee the correctness and completenessof their invariants and operations, they should opt for a more con-servative analysis or mechanism such as employing serializabletransactions. Accordingly, our development of I-confluence anal-ysis provides developers with a powerful option—but only if usedcorrectly. If used incorrectly, I-confluence allows incorrect results,or, if not used at all, developers must resort to existing alternatives.

This final point raises several questions: can we specify invariantsin real-world use cases? Classic database concurrency control mod-els assume that “the [set of application invariants] is generally notknown to the system but is embodied in the structure of the transac-tion” [23, 54]. Nevertheless, since 1976, databases have introducedsupport for a finite set of invariants [12, 24, 27, 30, 35] in the form ofprimary key, foreign key, uniqueness, and row-level “check” con-straints [40]. We can (and, in this paper, do) analyze these invariants,which can—like many program analyses [16]—lead to new insightsabout execution strategies. We have found the process of invariantspecification to be non-trivial but feasible; Section 6 is devoted todescribing some of our experiences.

(Non-)determinism. I-confluence analysis effectively capturespoints of unsafe non-determinism [6] in transaction execution. Aswe have seen in many of our examples thus far, total non-determinismunder concurrent execution can compromise application-level con-sistency [5, 34]. But not all non-determinism is bad: many desirableproperties (e.g., classical distributed consensus among processes)involve forms of acceptable non-determinism (e.g., any proposedoutcome is acceptable as long as all processes agree) [29]. In manycases, maximizing safe concurrency requires non-determinism.

I-confluence analysis allows this non-deterministic divergence ofdatabase states but makes two useful guarantees about those states.First, the requirement for global validity ensures safety (in the formof invariants). Second, the requirement for convergence ensuresliveness (in the form of convergence). Accordingly, via its use ofinvariants, I-confluence allows users to scope non-determinismwhile permitting only those states that are acceptable.

5. FROM THEORY TO PRACTICEAs a test for coordination requirements, I-confluence exposes

a trade-off between the operations a user wishes to perform andthe properties she wishes to guarantee. At one extreme, if a user’stransactions do not modify database state, she can guarantee anysatisfiable invariant. At the other extreme, with no invariants, a usercan safely perform any operations she likes. While neither of theseextremes is particularly realistic, the space in-between contains aspectrum of interesting and useful combinations.

Until now, we have been largely concerned with formalizingI-confluence for abstract operations; in this section, we begin toleverage this property. We examine a series of practical invariantsby considering several features of SQL, ending with abstract datatypes and revisiting our payroll example along the way. We willapply these results to full applications in Section 6.

Invariant Operation I-C? Proof #Attribute Equality Any Yes 1Attribute Inequality Any Yes 2Uniqueness Choose specific value No 3Uniqueness Choose some value Yes 4AUTO_INCREMENT Insert No 5Foreign Key Insert Yes 6Foreign Key Delete No 7Foreign Key Cascading Delete Yes 8Secondary Indexing Update Yes 9Materialized Views Update Yes 10> Increment [Counter] Yes 11< Increment [Counter] No 12> Decrement [Counter] No 13< Decrement [Counter] Yes 14[NOT] CONTAINS Any [Set, List, Map] Yes 15, 16SIZE= Mutation [Set, List, Map] No 17

Table 2: Example SQL (top) and ADT invariant I-confluencealong with references to formal proofs in [10, Appendix C].

In this section, we focus on providing intuition and informal ex-planations of our I-confluence analysis. Interested readers can finda more formal analysis in [10, Appendix C], including discussionof invariants not presented here. For convenience, we referencespecific proofs from [10, Appendix C] inline.

5.1 I-confluence for RelationsWe begin by considering several constraints found in SQL that

are expressible via standard relational constructs.

Equality. What if an application wants to prevent a particularvalue from appearing in a database? For example, our payroll ap-plication from Section 2 might require that every user have a lastname, marking the LNAME column with a NOT NULL invariant. Whilenot particularly exciting, we can apply I-confluence analysis toinsertions and updates of databases with (in-)equality constraints(Claims 1, 2 in [10, Appendix C]). Per-record inequality invari-ants are I-confluent, which we can show by contradiction: assumetwo database states S1 and S2 are each I-valid under per-recordin-equality invariant Ie but that Ie(S1tS2) is false. Then there mustbe a r 2 S1 tS2 that violates Ie (i.e., r has the forbidden value). rmust appear in S1, S2, or both. But, that would imply that one of S1or S2 is not I-valid under Ie, resulting in a contradiction.

Uniqueness. We can also consider common uniqueness invariants(e.g., in SQL, PRIMARY KEY and UNIQUE constraints). For example,in our payroll example, we wanted user IDs to be unique. In fact, ourearlier discussion in Section 2 already provided a counter-exampleshowing that arbitrary insertion of users is not I-confluent underthese invariants: {Stan:5} and {Mary:5} are both valid databasesthat can be created by I-valid sequences of insertions (starting atS0 = {}), but their merge—{Stan:5, Mary:5}—is not I-valid. There-fore, uniqueness is not I-confluent for inserts of unique values(Claim 3). Note that although insertions are not I-confluent underuniqueness invariants, other operations are. For example, reads anddeletions are both I-confluent under uniqueness invariants: readingand removing items cannot introduce duplicates.

Can the database safely choose unique values on behalf of users(e.g., assign a new user an ID)? In this case, we can achieve unique-ness without coordination—as long as we have a notion of replicamembership (e.g., server or replica IDs). The difference is sub-tle (“grant this record this specific, unique ID” versus “grant thisrecord some unique ID”), but, in a system model with membership(as is pragmatic in many contexts), is powerful. If replicas assignunique IDs within their respective portion of the ID namespace, then

merging locally valid states will also be globally valid (Claim 4).

Foreign Keys. We can consider more complex invariants, suchas foreign key constraints. In our payroll example, each employeebelongs to a department, so the application could specify an invari-ant capturing this relationship (e.g., in SQL, we could declare theconstraint EMP.D_ID FOREIGN KEY REFERENCES DEPT.ID).

Are foreign key constraints maintainable without coordination?Again, the answer depends on the actions of transactions modifyingthe data governed by the invariant. Insertions under foreign keyconstraints are I-confluent (Claim 6), which we again demonstrateby contradiction. We look for the existence of two I-valid statesthat, when merged, result in invalid state. In the case of foreign keyconstraints, an invalid state will contain a record with a “danglingpointer”—a record missing a corresponding record on the oppositeside of the association. If we assume there exists some invalid stateS1tS2 containing a record r with an invalid foreign key to record f ,but S1 and S2 are both valid, then r must appear in S1, S2, or both.But, since S1 and S2 are both valid, r must have a correspondingforeign key record ( f ) that “disappeared” during merge. Merge (inthe current model) does not remove versions, so this is impossible.

From the perspective of I-confluence analysis, foreign key con-straints concern the visibility of related updates: if individual data-base states maintain referential integrity, a non-destructive mergefunction such as set union cannot cause tuples to “disappear” andcompromise the constraint. This also explains why models such ascausal consistency and read committed isolation [9] are also achiev-able without coordination: restricting the visibility of updates doesnot require coordination between concurrent operations.

Deletion and modification of foreign key constraints are morechallenging. Arbitrary deletion of records is unsafe: a user mightbe added to a department that was concurrently deleted (Claim7). However, performing cascading deletions (e.g., SQL DELETECASCADE), where the deletion of a record also deletes all matchingrecords on the opposite end of the association, is I-confluent underforeign key constraints (Claim 8). We can generalize this discussionto updates (and cascading updates).

Materialized Views. As a final example within standard SQL, wecan consider the problem of maintaining materialized views. Appli-cations often pre-compute results to speed query performance viaa materialized view [51] (e.g., UNREAD_CNT = SELECT COUNT(*)FROM emails WHERE read_date=NULL). We can consider a classof invariants that specify that materialized views reflect primarydata; when a transaction (or merge invocation) modifies data, anyrelevant materialized views should be updated as well. This requiresinstalling updates at the same time as the changes to the primarydata are installed (a problem related to maintaining foreign keyconstraints). However, given that a view should simply reflect pri-mary data, there are no “conflicts,” and, accordingly, updates areI-confluent under materialized view invariants (Claim 10).

5.2 I-confluence for Data TypesSo far, we have considered databases that store growing sets of

immutable versions. We have used this model to analyze severaluseful constraints, but, in practice, databases do not (often) providethese semantics, leading to a variety of interesting anomalies. For ex-ample, if we implement a user’s account balance using a “last writerwins” merge policy [48], then performing two concurrent with-drawal transactions might result in a database state reflecting onlyone transaction (a classic example of the Lost Update anomaly) [2,9].To avoid variants of these anomalies, many optimistic, coordination-free database designs have proposed the use of abstract data types(ADTs), providing merge functions for a variety of uses such as

counters, sets, and maps [17, 42, 48, 56] that ensure that all updatesare reflected in final database state. For example, a database can rep-resent a simple counter ADT by recording the number of times eachtransaction performs an increment operation on the counter [48].

I-confluence analysis is also applicable to these ADTs and theirassociated invariants. For example, a row-level “>” threshold in-variant is I-confluent for counter increment and assign ( ) butnot decrement (Claims 11, 13), while a row-level < threshold in-variant is I-confluent for counter decrement and assign but notincrement (Claims 12, 14). This means that, in our payroll exam-ple, we can provide coordination-free support for concurrent salaryincrements but not concurrent salary decrements. ADTs (includinglists, sets, and maps) can be combined with standard relational con-straints like materialized view maintenance (e.g., the “total salary”row should contain the sum of employee salaries in the employeetable). As with our generic set-union merge, I-confluence analysisrequires a specification of the ADT merge behavior ([10, AppendixC] provides several examples). Additionally, while many implemen-tations of these data types provide useful properties like convergencewithout compromising availability [48], they do not—on their own—guarantee that invariants are maintained; I-confluence (or strongerconditions) are required for correctness (Section 7).

5.3 Discussion and LimitationsWe have analyzed a number of combinations of invariants and

operations (shown in Table 2). While these results are by no meanscomprehensive, they are expressive for many applications (Sec-tion 6). Here, we discuss lessons from this classification process.

Analysis mechanisms. In this section, we manually analyzedparticular invariant and operation combinations, proving each I-confluent or not. To study actual applications, we can apply theseI-confluence labels via simple static analysis. Specifically, giveninvariants captured via SQL DDL and transactions expressed asstored procedures, we can iterate through all declared invariants andcheck each mutation within each transaction, identifying pairs thatappear in our set of non-I-confluent pairs and flagging them accord-ingly. Pairs appearing in the set of I-confluent pairs can be markedas safe, while, for soundness (but not completeness), any operationsor invariants that are not recognized can be flagged as potentiallynon-I-confluent (thus, the analysis is similar to that of [5]). Despiteits simplicity (both conceptually and in terms of implementation),this technique—coupled with the results of Table 2—is sufficientlypowerful to automatically characterize the I-confluence of the ap-plications we consider in Section 6 when expressed in SQL (withsupport for multi-row aggregates like Invariant 8 in Table 3).

By growing our recognized list of I-confluent pairs on an as-needed basis (via manual analysis of the pair), the above techniquehas proven useful—due in large part to the common re-use of invari-ants like foreign key constraints. However, (and, as a non-goal ofthis paper), we can consider more complex forms of program analy-sis. For example, one might analyze the I-confluence of arbitraryinvariants, leaving the task of proving or disproving I-confluence toa SMT solver like Z3. While I-confluence—like monotonicity andcommutativity—is undecidable for arbitrary programs, others haverecently shown this alternative approach (e.g., in commutativityanalysis [16, 38] and in invariant generation for view serializabletransactions [45]) to be fruitful for restricted languages.We continuewith the rule-based approach and manual proofs but view languagedesign as an interesting area for more speculative future work.

Merge and abstract data types. As others note [20, 43, 51], han-dling concurrent updates using destructive merge operators (e.g.,“last write wins,” as in the Lost Update above) may result in incon-

sistent or otherwise undesirable outcomes (as an extreme example,D1tD2 ⌘ {} is a safe but unhelpful merge function). In practice,and in our analysis, we have found ADTs to be a useful workaroundfor avoiding anomalies due to destructive update (thus avoiding theLost Update problem). This requires user programs to explicitlyuse ADT functions (e.g., “INCREMENT(x)” rather than read-modify-write “x=x+1”), but, for many invariants, destructive updates maysignal a requirement for coordination anyway.

Recency and session support. Our proposed invariants are declar-ative, but a class of useful semantics—recency, or real-time guar-antees on reads and writes—are operational (i.e., they pertain totransaction execution rather than the state(s) of the database). Forexample, users often wish to read data that is up-to-date as of a givenpoint in time (e.g., “read latest” [18] or linearizable [26] semantics).While traditional isolation models do not directly address theserecency guarantees [2], they are often important to programmers.Are these models I-confluent? We can attempt to simulate recencyguarantees in I-confluence analysis by logging the result of allreads and any writes with a timestamp and requiring that all loggedtimestamps respect their recency guarantees (thus treating isolationguarantees as invariants over recorded read/write execution traces).However, this is a somewhat pointless exercise: it is well knownthat recency guarantees are unachievable with transactional avail-ability [9,19,26]. Thus, if application reads face these requirements,coordination is required. Indeed, when application ”consistency”means “recency,” systems cannot circumvent speed-of-light delays.

If users wish to “read their writes” or wish to maintain other“session” guarantees [43] (maintaining recency on a per-user or per-session basis), they must maintain affinity or “stickiness” [9] with agiven (set of) replicas. These guarantees are admittedly awkwardto express in our data-centric I-confluence formalism but do notrequire coordination between different users’ transactions.

Physical and logical replication. We have used the concept ofreplicas to reason about concurrent transaction execution. However,as previously noted, our use of replicas is simply a formal deviceand is independent of the actual concurrency control mechanisms atwork. Specifically, reasoning about replicas allows us to separate theanalysis of transactions from their implementation: just because atransaction is executed with (or without) coordination does not meanthat all query plans or implementations require (or do not require)coordination [9]. However, in deciding on an implementation, aswe will do shortly for one application, there is a range of designdecisions yielding a variety of performance trade-offs: for example,a(n otherwise coordination-free) physically replicated system mayunderperform a single-master system for linearizable writes.

Requirements and Restrictions. Our techniques are predicatedon the ability to correctly and completely specify invariants andinspect user transactions; without such a correctness specification,for arbitrary transaction schedules, it is well known that serializ-ability is the “optimal” strategy [36]. By casting correctness interms of application state rather than as a property of read-writeschedules, we achieve a more precise statement of coordinationoverheads. However, as we have noted, this does not obviate theneed for coordination in all cases.

6. EXPERIENCES WITH COORDINATIONIf achievable, coordination-free execution enables scalability lim-

ited to that of available hardware. This is powerful: an I-confluentapplication can therefore scale out without sacrificing correctness,latency, or availability. In Section 5, we saw how many combina-tions of invariants and transactions were not I-confluent and how

# Informal Invariant Description Type Txns I-C1 YTD wh sales = sum(YTD district sales) MV P Yes2 Per-district order IDs are sequential SID+FK N, D No3 New order IDs are sequentially assigned SID N, D No4 Per-district, item order count = roll-up MV N Yes5 Order carrier is set iff order is pending FK N, D Yes6 Per-order item count = line item roll-up MV N Yes7 Delivery date set iff carrier ID set FK D Yes8 YTD wh = sum(historical wh) MV D Yes9 YTD district = sum(historical district) MV P Yes10 Customer balance matches expenditures MV P, D Yes11 Orders reference New-Orders table FK N Yes12 Per-customer balance = cust. expenditures MV P, D Yes

Table 3: TPC-C Declared “Consistency Conditions” (3.3.2.x)and I-confluence analysis results (Invariant type: MV: mate-rialized view, SID: sequential ID assignment, FK: foreign key;Transactions: N: New-Order, P: Payment, D: Delivery).

others were not; in this section, we apply these combinations toreal-world applications, with a focus on the TPC-C benchmark.

6.1 TPC-C Invariants and ExecutionThe TPC-C benchmark is the gold standard for database con-

currency control [21] both in research and in industry [53], and inrecent years has been used as a yardstick for distributed databaseperformance [50, 52, 55]. How much coordination does TPC-Crequire for a compliant execution?

The TPC-C workload is designed to be representative of a whole-sale supplier’s transaction processing requirements. The workloadhas a number of application-level correctness criteria that representbasic business needs (e.g., order IDs must be unique) as formulatedby the TPC-C Council and which must be maintained in a compliantrun. We can interpret these well-defined “consistency criteria” asinvariants and subsequently use I-confluence analysis to determinewhich transactions require coordination and which do not.

Table 3 summarizes the twelve invariants found in TPC-C as wellas their I-confluence analysis results as determined by Table 2. Weclassify the invariants into three broad categories: materialized viewmaintenance, foreign key constraint maintenance, and unique ID as-signment. As we discussed in Section 5, the first two categories areI-confluent (and therefore maintainable without coordination) be-cause they only regulate the visibility of updates to multiple records.Because these invariants (10 of 12) are I-confluent under the work-load transactions, there exists an execution strategy that does notuse coordination. However, simply because these invariants areI-confluent does not mean that all execution strategies will scalewell: for example, using locking to execute the transactions wouldnot be coordination-free.

As one coordination-free execution strategy (which we imple-ment in Section 6.2) that respects the foreign key and materializedview invariants, we can use the recently proposed RAMP transac-tions, which provide atomically visible transactional updates acrossservers without relying on coordination for correctness [11]. Inbrief, RAMP transactions employ limited multi-versioning and lim-ited metadata along with writes to ensure that readers and writerscan always proceed concurrently: any client whose reads overlapwith another client’s writes to the same item(s) can use metadatastored in the items to fetch any “missing” writes from the respectiveservers. A standard RAMP transaction over data items suffices toenforce foreign key constraints, while a RAMP transaction overcommutative counters as described in [11] are sufficient to enforcethe TPC-C materialized view constraints.

Two of TPC-C’s invariants are not I-confluent with respect to theworkload transactions and therefore do require coordination. On a

per-district basis, order IDs should be assigned sequentially (bothuniquely and sequentially, in the New-Order transaction) and ordersshould be processed sequentially (in the Delivery transaction). If thedatabase is partitioned by warehouse (as is standard [50,52,55]), theformer is a distributed transaction (by default, 10% of New-Ordertransactions span multiple warehouses), while the benchmark spec-ification allows the latter to be run asynchronously and in batchmode on a per-warehouse (non-distributed) basis. Accordingly, we,like others [52, 55], focus on New-Order. Including additional trans-actions like the read-only Order-Status in the workload mix wouldincrease performance due to the transactions’ lack of distributedcoordination and (often considerably) smaller read/write footprints.

Avoiding New-Order Coordination. New-Order is not I-confluentwith respect to the TPC-C invariants, so we can always fall back tousing serializable isolation. However, the per-district ID assignmentrecords (10 per warehouse) would become a point of contention, lim-iting our throughput to effectively 100W

RT T for a W -warehouse TPC-Cbenchmark with the expected 10% distributed transactions. Oth-ers [55] (including us, in prior work [9]) have suggested disregardingconsistency criteria 3.3.2.3 and 3.3.2.4, instead opting for uniquebut non-sequential ID assignment: this allows inconsistency andviolates the benchmark compliance criteria.

During a compliant run, New-Order transactions must coordinate.However, as discussed above, only the ID assignment operationis non-I-confluent; the remainder of the operations can executecoordination-free. Moreover, with some effort, we can avoid dis-tributed coordination. A naïve implementation might grab a lockon the appropriate district’s “next ID” record, perform (possiblyremote) remaining reads and writes, then release the lock at committime. Instead, as a more efficient solution, New-Order can deferID assignment until commit time by introducing a layer of indi-rection. New-Order transactions can generate a temporary, unique,but non-sequential ID (tmpID) and perform updates using this IDusing a RAMP transaction (which, in turn, handles the foreign keyconstraints) [11]. Immediately prior to transaction commit, the New-Order transaction can assign a “real” ID by atomically incrementingthe current district’s“next ID” record (yielding realID) and record-ing the [tmpID, realID] mapping in a special “ID lookup” table.Any read requests for the ID column of the Order, New-Order, orOrder-Line tables can be safely satisfied (transparently to the enduser) by joining with the ID lookup table on tmpID . In effect, theNew-Order ID assignment can use a nested atomic transaction [42]upon commit, and all coordination between any two pairs of trans-actions is confined to a single server.

6.2 Evaluating TPC-C New-OrderWe subsequently implemented the above execution strategy in a

distributed database prototype to quantify the overheads associatedwith coordination in TPC-C New-Order. In brief, the coordination-avoiding query plan scales linearly to over 12.7M transactions persecond on 200 servers while substantially outperforming distributedtwo-phase locking. Our goal is to demonstrate—beyond the mi-crobenchmarks of Section 2—that safe but judicious use of coordi-nation can have meaningful positive effects.

Implementation and Deployment. We originally ported TPC-C New-Order to our prior RAMP transaction prototype [11] butfound that the larger transaction footprint was poorly suited for thethreading model. Each New-Order transaction generates between13–33 reads and 13–33 writes, so switching to an event-based RPClayer with support for batching and connection pooling deliveredan approximate factor of two improvement in throughput, whileadditional optimization (e.g., more efficient indexing, fewer object

Coordination-Avoiding Serializable (2PL)

1 4 16 64 256Number of Clients

0

40K

80K

120K

160K

Thro

ughp

ut (t

xns/

s)

8 16 32 48 64Number of Warehouses

40K

100K

600K

Thro

ughp

ut (t

xns/

s)0 25 50 75 100

Proportion Distributed Transactions (%)

020406080

100

N

orm

aliz

edTh

roug

hput

(%)

Figure 5: TPC-C New-Order throughput across eight servers.

allocations) delivered another factor of two. We employ a multi-versioned storage manager, with RAMP-Fast transactions for snap-shot reads and atomically visible writes/“merge” (providing a variantof regular register semantics, with writes visible to later transactionsafter commit) [11] and implement the nested atomic transaction forID assignment as a subprocedure inside RAMP-Fast’s server-sidecommit procedure (using spinlocks). We implement transactions asstored procedures and fulfill the TPC-C “Isolation Requirements”by using read and write buffering as proposed in [9]. As is com-mon [33, 44, 50, 52], we disregard per-warehouse client limits and“think time” to increase load per warehouse. In all, our base proto-type architecture is similar to that of [11]: a JVM-based partitioned,main-memory, mastered database.

For an apples-to-apples comparison with a coordination-intensivetechnique within the same system architecture, we also implementedtextbook two-phase locking (2PL) [13], which provides serializabil-ity but also requires distributed coordination. We totally order lockrequests across servers to avoid deadlock, batching lock requestsat the server granularity (e.g., lock all items on server 1, then lockall items on server 2). To avoid extra RTTs while holding locks,we piggyback read and write requests on lock request RPCs. As avalidation of our implementation, our 2PL prototype achieves per-warehouse (and sometimes aggregate) throughput similar to (andoften in excess of) several recent serializable database implementa-tions (of both 2PL and other approaches) [33, 44, 50, 52].

By default, we deploy our prototype on eight EC2 cr1.8xlargeinstances in Amazon AWS us-west-2 region (with non-co-locatedclients) with one warehouse per server (10 “hot” district ID recordsper warehouse) and report the average of three 120 second runs.

Basic behavior. Figure 5 shows performance across a variety ofconfigurations, which we detail below. Overall, the coordination-avoiding query plan far outperforms the serializable execution. The

coordination-avoiding query plan performs some coordination, but,because coordination points are not distributed (unlike 2PL), physi-cal resources (and not coordination) are the bottleneck.

Varying load. As we increase the number of clients, the coordination-avoiding query plan throughput increases linearly, while 2PL through-put increases to 40K transactions per second, then levels off. Asin our microbenchmarks in Section 2, the former is able to utilizeavailable hardware resources, while the latter bottlenecks on logi-cal contention. In contrast, the coordination-avoiding query planbottlenecks on CPU cycles at 640K transactions per second.

Physical resource consumption. To understand the overheadsof each component in the coordination-avoiding query plan, weused JVM profiling tools to sample thread execution while runningat peak throughput, attributing time spent in functions to relevantmodules within the database implementation (where possible):

Code Path CyclesStorage Manager (Insert, Update, Read) 45.3%Stored Procedure Execution 14.4%RPC and Networking 13.2%Serialization 12.6%ID Assignment Synchronization (spinlock contention) 0.19%Other 14.3%

The coordination-avoiding prototype spends a large portion of ex-ecution in the storage manager, performing B-tree modifications andlookups and result set creation, and in RPC/serialization. In contrastto 2PL, the prototype spends less than 0.2% of time coordinating,in the form of waiting for locks in the New-Order ID assignment;the (single-site) assignment is fast (a linearizable integer incrementand store, followed by a write and fence instruction on the spin-lock), so this should not be surprising. We observed 40% or greaterthroughput penalties due to garbage collection (GC) overheads—anunfortunate cost of our compact (several thousand lines of Scala),JVM-based implementation. However, even in this current proto-type, physical resources are the bottleneck—not coordination.

Varying contention. We subsequently varied the number of “hot,”or contended items by increasing the number of warehouses on eachserver. Unsurprisingly, 2PL benefits from a decreased contention,rising to over 87K transactions per second with 64 warehouses.In contrast, our coordination-avoiding implementation is largelyunaffected (and, at 64 warehouses, is even negatively impacted byadditional GC pressure). The coordination-avoiding query plan iseffectively agnostic to read/write contention as it bottlenecks onCPU cycles, not coordination.

Varying distribution. We also varied the percentage of distributedtransactions. The coordination-avoiding query plan incurred a 29%overhead moving from no distributed transactions to all distributedtransactions due to increased serialization overheads and less ef-ficient batching of RPC requests. However, the 2PL implemen-tation decreased in throughput by over 90% (in line with priorresults [44, 52], albeit exaggerated here due to higher contention) asmore requests stalled due to coordination with remote servers (i.e.,holding locks during RPCs).

Scaling out. Finally, we examined our prototype’s scalability, againdeploying one warehouse per server. As Figure 6 demonstrates, ourprototype scales linearly, to over 12.74 million transactions per sec-ond on 200 servers (in light of our earlier results, and, for economicreasons, we do not run 2PL at this scale). Per-server throughput islargely constant after 100 servers, at which point our deploymentspanned all three us-west-2 datacenters and experienced slightlydegraded per-server performance. While we make use of application

0 50 100 150 200

2M4M6M8M

10M12M14M

Tota

l Thr

ough

put (

txn/

s)

0 50 100 150 200Number of Servers

020K40K60K80K

Thro

ughp

ut (t

xn/s

/ser

ver)

Figure 6: Coordination-avoiding New-Order scalability.

semantics, we are unaware of any other compliant multi-server TPC-C implementation that has achieved greater than 500K New-Ordertransactions per second [33, 44, 50, 52].

Summary. We present these results as a proof of concept that exe-cuting even challenging workloads like TPC-C that contain complexintegrity constraints are not necessarily at odds with scalability ifimplemented in a coordination-avoiding manner. Distributed coordi-nation need not be a bottleneck for all applications, even if conflictserializable read/write analysis indicates otherwise.

6.3 Analyzing Additional ApplicationsThese results begin to quantify the effects of coordination-avoiding

concurrency control. When conflicts cannot be avoided, coordina-tion can be expensive. However, if considering application-levelinvariants, databases only have to pay the price of coordination whennecessary. We were surprised that the “current industry standard forevaluating the performance of OLTP systems” [21] was so amenableto coordination-avoiding execution—at least for compliant execu-tion as defined by the TPC-C standard.

For greater variety, we studied the workloads of the recentlyassembled OLTP-Bench suite [21], performing a similar analysisto that of Section 6.1. We found (and confirmed with an authorof [21]) that for nine of fourteen remaining (non-TPC-C) OLTP-Bench applications, the workload transactions did not involve in-tegrity constraints (e.g., did not modify primary key columns), one(CH-benCHmark) matched TPC-C, and two specifications implied(but did not explicitly state) a requirement for unique ID assign-ment (AuctionMark’s new-purchase order completion, SEATS’sNewReservation seat booking; achievable like TPC-C order IDs).The remaining two benchmarks, sibench and smallbank werespecifically designed (by an author of this paper) as research bench-marks for serializable isolation. Finally, the three “consistencyconditions” required by the newer TPC-E benchmark are a propersubset of the twelve conditions from TPC-C considered here (andare all materialized counters). It is possible (even likely) that thesebenchmarks are underspecified, but according to official specifica-tions, TPC-C contains the most coordination-intensive invariantsamong all but two of the benchmarks we encountered.

Anecdotally, our conversations and experiences with real-worldapplication programmers and database developers have not identi-fied invariants that are radically different than those we have studiedhere. A simple thought experiment identifying the invariants re-quired for a social networking site yields a number of invariantsbut none that are particularly exotic (e.g., username uniqueness,

foreign key constraints between updates, privacy settings [11, 18]).Nonetheless, we view the further study of real-world invariants to bea necessary area for future investigation. In the interim, these prelim-inary results hint at what is possible with coordination-avoidance aswell as the costs of coordination if applications are not I-confluent.

7. RELATED WORKDatabase system designers have long sought to manage the trade-

off between consistency and coordination. As we have discussed,serializability and its many implementations (including lock-based,optimistic, and pre-scheduling mechanisms) [13, 14, 23, 28, 50–52,55] are sufficient for maintaining application correctness. However,serializability is not always necessary: as discussed in Section 1,serializable databases do not allow certain executions that are correctaccording to application semantics. This has led to a large class ofsemantic—or application-level—concurrency control mechanismsthat admit greater concurrency. There are several surveys on thistopic, such as [27, 51], and, in our solution, we integrate manyconcepts from this literature.

Commutativity. One of the most popular alternatives to serializ-ability is to exploit commutativity: if transaction return values (e.g.,of reads) and/or final database states are equivalent despite reorder-ing, they can be executed simultaneously [16,39,56]. Commutativityis often sufficient for correctness but is not necessary. For example,if an analyst at a wholesaler creates a report on daily cash flows, anyconcurrent sale transactions will not commute with the report (theresults will change depending on whether the sale completes beforeor after the analyst runs her queries). However, the report creationis I-confluent with respect to, say, the invariant that every sale inthe report references a customer from the customers table. [16, 37]provide additional examples of safe non-commutativity.

Monotonicity and Convergence. The CALM Theorem [7] showsthat monotone programs exhibit deterministic outcomes despite re-ordering. CRDT objects [48] similarly ensure convergent outcomesthat reflect all updates made to each object. These convergence andoutcome determinism guarantees are useful liveness properties [47](e.g., a converged CRDT OR-Set reflects all concurrent additionsand removals) but do not prevent users from observing inconsistentdata [38]—safety (e.g., the CRDT OR-Set does not—by itself—enforce invariants, such as ensuring that no employee belongs totwo departments)—and are therefore not sufficient to guaranteecorrectness for all applications.

Use of Invariants. A large number of database designs—including,in restricted forms, many commercial databases today—use application-supplied invariants as a correctness specification (e.g., [12, 19, 24,27, 30, 31, 35, 38–40, 45]). We draw inspiration from this prior work,but are not aware of related work that formally discusses whencoordination is strictly required to enforce these invariants.

In this work, we provide a necessary and sufficient condition forsafe, coordination-free execution—the first of which we are aware.Yet, in contrast with many of the above conditions, we require moreinformation from the application in the form of invariants. Wheninvariants are unavailable, the above, more conservative approachesmay instead be applicable. Our use of analysis-as-design-tool isinspired by this literature—in particular, [16].

Coordination costs. In this work, we determine when transactionscan run entirely concurrently and without coordination. In contrast,a large number of alternative models (e.g., [4, 8, 24, 31, 35, 40, 41])assume serializable or linearizable (and therefore coordinated) up-dates to shared state. These assumptions are standard (but notuniversal [15]) in the concurrent programming literature [8, 47].

(Additionally, unlike much of this literature, we only consider asingle set of invariants per database rather than per-operation invari-ants.) However, this can harm scalability as coordinating updatescan become a bottleneck [9, 26]. For example, transaction chop-ping [49] and later application-aware extensions [3, 12] decomposetransactions into a (usually larger) set of smaller transactions, pro-viding increased concurrency, but in turn require that individualtransactions execute in a serializable (or, alternatively, strict serial-izable) manner. These alternative techniques are useful once it isestablished that coordination is actually required.

Term rewriting. In term rewriting systems, I-confluence guaran-tees that arbitrary rule application will not violate a given invari-ant [22], generalizing Church-Rosser confluence [34]. We adapt thisconcept and effectively treat transactions as rewrite rules, databasestates as constraint states, and the database merge operator as aspecial join operator (in the term-rewriting sense) defined for allstates. Rewriting system concepts—including confluence [4]—havepreviously been integrated into active database systems [57] (e.g.,in triggers, rule processing), but we are not familiar with a conceptanalogous to I-confluence in the existing database literature.

Coordination-free algorithms and semantics. Our work is influ-enced by the distributed systems literature, where coordination-freeexecution across replicas of a given data item has been captured as“availability” [26]. A large class of systems provides availability via“optimistic replication” (i.e., perform operations locally, then repli-cate) [46], which is complementary to our goal of coordination-freeexecution. We adopt the use of the merge operator to reconcile di-vergent database states from this literature and related programmingmodels like those presented in [15]. Both traditional database sys-tems [2] and more recent proposals [38, 39] allow the simultaneoususe of “weak” and “strong” isolation; we seek to understand whenstrong mechanisms are needed rather than an optimal implementa-tion of either. Unlike “tentative update” models [25], we do not re-quire programmers to specify compensatory actions (beyond merge,which we expect to typically be generic and/or system-supplied)and do not reverse transaction commit/abort decisions. A modifiedmerge implementation could provide this behavior if desired.

The CAP Theorem [1,26] popularized the tension between strongsemantics and coordination but pertains to a specific model (lineariz-ability), while the relationship between serializability and coordina-tion has also been well documented [19]. We recently classified arange of isolation models by availability [9]. This paper addresseswhen particular applications require coordination.

In our evaluation, we make use of our recent RAMP transaction al-gorithms [11], which guarantee coordination-free, atomically visibleupdates. RAMP transactions are an implementation of I-confluent se-mantics (i.e., Read Atomic isolation, used in our implementation forforeign key constraint maintenance). Our focus in this paper is whenRAMP transactions (and any other coordination-free/I-confluentsemantics) are appropriate for applications.

Summary. The I-confluence property is the first necessary andsufficient condition for safe, coordination-free execution that wehave encountered. Sufficient conditions such as commutativity andmonotonicity are useful in reducing coordination overheads but arenot always necessary. Here, we explore the fundamental limits ofcoordination-free execution. To do so, we explicitly consider amodel without synchronous communication. This is key to scalabil-ity: if, by default, operations must contact a centralized validationservice, perform atomic updates to shared state, or otherwise com-municate, then scalability will be compromised. Finally, we onlyconsider a single set of invariants for the entire application, reducingprogrammer overhead without affecting our I-confluence results.

8. CONCLUSIONACID transactions and associated strong isolation levels domi-

nated the field of database concurrency control for decades, due inlarge part to their ease of use and ability to automatically guaranteeapplication correctness criteria. However, this powerful abstractioncomes with a hefty cost: concurrent transactions must coordinate inorder to prevent read/write conflicts that could compromise equiv-alence to a serial execution. At large scale and, increasingly, ingeo-replicated deployments, the coordination costs necessarily asso-ciated with these implementations produce significant overheads inthe form of penalties to throughput, latency, and availability. In lightof these trends, we developed a framework, called I-confluence,in which application invariants are used as a basis for determiningif and when coordination is strictly necessary. With this frame-work, we demonstrated that, in fact, many—but not all—commondatabase invariants and integrity constraints are actually achievablewithout coordination. By applying these results to a range of actualtransactional workloads, we demonstrated an opportunity to avoidcoordination in many cases that traditional serializable mechanismswould otherwise coordinate. The order-of-magnitude performanceimprovements we demonstrated via coordination-avoiding concur-rency control strategies provide strong evidence that invariant-basedcoordination avoidance is a promising approach to meaningfullyscaling future transaction processing systems.

Acknowledgments. The authors would like to thank Peter Alvaro,Neil Conway, Shel Finkelstein, and Josh Rosen for helpful feedbackon earlier versions of this work, Dan Crankshaw, Joey Gonzalez,Nick Lanham, and Gene Pang for various engineering contribu-tions, and Yunjing Yu for sharing the Bobtail dataset. This researchis supported by NSF CISE Expeditions award CCF-1139158 andDARPA XData Award FA8750-12-2-0331, the NSF Graduate Re-search Fellowship (grant DGE-1106400), and gifts from AmazonWeb Services, Google, SAP, Apple, Inc., Cisco, Clearstory Data,Cloudera, EMC, Ericsson, Facebook, GameOnTalis, General Elec-tric, Hortonworks, Huawei, Intel, Microsoft, NetApp, NTT Mul-timedia Communications Laboratories, Oracle, Samsung, Splunk,VMware, WANdisco and Yahoo!.

9. REFERENCES[1] D. J. Abadi. Consistency tradeoffs in modern distributed database system

design: CAP is only part of the story. IEEE Computer, 45(2):37–42, 2012.[2] A. Adya. Weak consistency: a generalized theory and optimistic

implementations for distributed transactions. PhD thesis, MIT, 1999.[3] D. Agrawal et al. Consistency and orderability: semantics-based correctness

criteria for databases. ACM Trans. Database Syst., 18(3):460–486, Sept. 1993.[4] A. Aiken, J. Widom, and J. M. Hellerstein. Behavior of database production

rules: Termination, confluence, and observable determinism. In SIGMOD 1992.[5] P. Alvaro, N. Conway, J. M. Hellerstein, and W. Marczak. Consistency analysis

in Bloom: a CALM and collected approach. In CIDR 2011.[6] P. Alvaro et al. Consistency without borders. In SoCC 2013.[7] T. J. Ameloot, F. Neven, and J. Van Den Bussche. Relational transducers for

declarative networking. J. ACM, 60(2):15:1–15:38, May 2013.[8] H. Attiya, R. Guerraoui, D. Hendler, et al. Laws of order: Expensive

synchronization in concurrent algorithms cannot be eliminated. In POPL 2011.[9] P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica.

Highly available transactions: Virtues and limitations. In VLDB 2014.[10] P. Bailis, A. Fekete, M. J. Franklin, A. Ghodsi, et al. Coordination avoidance in

database systems (Extended version). 2014. arXiv:14�2.2237.[11] P. Bailis, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica. Scalable atomic

visibility with RAMP transactions. In SIGMOD 2014.[12] A. J. Bernstein and P. M. Lewis. Transaction decomposition using transaction

semantics. Distributed and Parallel Databases, 4(1):25–47, 1996.[13] P. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency control and

recovery in database systems. Addison-wesley New York, 1987.[14] P. A. Bernstein, D. W. Shipman, and J. B. Rothnie, Jr. Concurrency control in a

system for distributed databases (SDD-1). ACM TODS, 5(1):18–51, Mar. 1980.[15] S. Burckhardt, D. Leijen, M. Fähndrich, and M. Sagiv. Eventually consistent

transactions. In ESOP. 2012.

[16] A. T. Clements et al. The scalable commutativity rule: designing scalablesoftware for multicore processors. In SOSP 2013.

[17] N. Conway et al. Logic and lattices for distributed programming. In SoCC 2012.[18] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon,

et al. PNUTS: Yahoo!’s hosted data serving platform. In VLDB 2008.[19] S. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in partitioned

networks. ACM Computing Surveys, 17(3):341–370, 1985.[20] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, et al.

Dynamo: Amazon’s highly available key-value store. In SOSP 2007.[21] D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux. OLTP-Bench: An

extensible testbed for benchmarking relational databases. In VLDB 2014.[22] G. Duck, P. Stuckey, and M. Sulzmann. Observable confluence for constraint

handling rules. In ICLP 2007.[23] K. P. Eswaran et al. The notions of consistency and predicate locks in a

database system. Commun. ACM, 19(11):624–633, 1976.[24] H. Garcia-Molina. Using semantic knowledge for transaction processing in a

distributed database. ACM TODS, 8(2):186–213, June 1983.[25] H. Garcia-Molina and K. Salem. Sagas. In SIGMOD 1987.[26] S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of consistent,

available, partition-tolerant web services. SIGACT News, 33(2):51–59, 2002.[27] P. Godfrey et al. Logics for databases and information systems, chapter Integrity

constraints: Semantics and applications, pages 265–306. Springer, 1998.[28] J. Gray. The transaction concept: Virtues and limitations. In VLDB 1981.[29] J. Gray and L. Lamport. Consensus on transaction commit. ACM TODS,

31(1):133–160, Mar. 2006.[30] P. W. Grefen and P. M. Apers. Integrity control in relational database

systems–an overview. Data & Knowledge Engineering, 10(2):187–223, 1993.[31] A. Gupta and J. Widom. Local verification of global integrity constraints in

distributed databases. In SIGMOD 1993, pages 49–58.[32] R. Johnson, I. Pandis, and A. Ailamaki. Eliminating unscalable communication

in transaction processing. The VLDB Journal, pages 1–23, 2013.[33] E. P. Jones, D. J. Abadi, and S. Madden. Low overhead concurrency control for

partitioned main memory databases. In SIGMOD 2010.[34] J. W. Klop. Term rewriting systems. Stichting Mathematisch Centrum

Amsterdam, 1990.[35] H. K. Korth and G. Speegle. Formal model of correctness without serializabilty.

In SIGMOD 1988.[36] H.-T. Kung and C. H. Papadimitriou. An optimality theory of concurrency

control for databases. In SIGMOD, 1979.[37] L. Lamport. Towards a theory of correctness for multi-user database systems.

Technical report, CCA, 1976. Described in [3, 47].[38] C. Li, J. Leitao, A. Clement, N. Preguiça, R. Rodrigues, et al. Automating the

choice of consistency levels in replicated systems. In USENIX ATC 2014.[39] C. Li, D. Porto, A. Clement, J. Gehrke, et al. Making geo-replicated systems

fast as possible, consistent when necessary. In OSDI 2012.[40] Y. Lin, B. Kemme, R. Jiménez-Peris, et al. Snapshot isolation and integrity

constraints in replicated databases. ACM TODS, 34(2), July 2009.[41] S. Lu, A. Bernstein, and P. Lewis. Correct execution of transactions at different

isolation levels. IEEE TKDE, 16(9), 2004.[42] N. A. Lynch, M. Merritt, W. Weihl, and A. Fekete. Atomic Transactions: In

Concurrent and Distributed Systems. Morgan Kaufmann Publishers Inc., 1993.[43] K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and A. J. Demers.

Flexible update propagation for weakly consistent replication. In SOSP 1997.[44] K. Ren, A. Thomson, and D. J. Abadi. Lightweight locking for main memory

database systems. VLDB 2013.[45] S. Roy, L. Kot, N. Foster, J. Gehrke, et al. Writes that fall in the forest and make

no sound: Semantics-based adaptive data consistency, 2014. arXiv:14�3.23�7.[46] Y. Saito and M. Shapiro. Optimistic replication. ACM CSUR, 37(1), Mar. 2005.[47] F. B. Schneider. On concurrent programming. Springer, 1997.[48] M. Shapiro et al. A comprehensive study of convergent and commutative

replicated data types. Technical Report 7506, INRIA, 2011.[49] D. Shasha, F. Llirbat, E. Simon, and P. Valduriez. Transaction chopping:

algorithms and performance studies. ACM TODS, 20(3):325–363, Sept. 1995.[50] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, et al. The end of an

architectural era: (it’s time for a complete rewrite). In VLDB 2007.[51] M. Tamer Özsu and P. Valduriez. Principles of distributed database systems.

Springer, 2011.[52] A. Thomson, T. Diamond, S. Weng, K. Ren, P. Shao, and D. Abadi. Calvin: Fast

distributed transactions for partitioned database systems. In SIGMOD 2012.[53] TPC Council. TPC Benchmark C revision 5.11, 2010.[54] I. L. Traiger, J. Gray, C. A. Galtieri, and B. G. Lindsay. Transactions and

consistency in distributed database systems. ACM TODS, 7(3):323–342, 1982.[55] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Speedy transactions in

multicore in-memory databases. In SOSP 2013.[56] W. Weihl. Specification and implementation of atomic data types. PhD thesis,

Massachusetts Institute of Technology, 1984.[57] J. Widom and S. Ceri. Active database systems: Triggers and rules for

advanced database processing. Morgan Kaufmann, 1996.[58] Y. Xu et al. Bobtail: avoiding long tails in the cloud. In NSDI 2013.

coordination avoidance in database...

Documents