wacco and loko: strong consistency at global scale

WACCO and LOKO : Strong Consistency at Global Scale

Darrell BetheaUniversity of North Carolina

[email protected]

Michael K. ReiterUniversity of North Carolina

[email protected]

Feng QianAT&T Labs – Research

[email protected]

Qiang XuNEC Labs America

[email protected]

Z. Morley MaoUniversity of Michigan

[email protected]

AbstractMotivated by a vision for future global-scale servicessupporting frequent updates and widespread concur-rent reads, we propose a scalable object-sharing sys-tem calledWACCO offering strong consistency seman-tics. WACCO propagates read responses on a tree-basedtopology to satisfy broad demand and migrates objectsdynamically to place them close to that demand. TodemonstrateWACCO, we use it to develop a service calledLOKO that could roughly encompass the current dutiesof the DNS and simultaneously support granular statusupdates (e.g., currently preferred routes) in a future In-ternet. We evaluateLOKO, including the performanceimpact of updates, migration, and fault tolerance, usinga trace of DNS queries served by Akamai.

1 Introduction

Today’s Internet is served by infrastructures that, ingeneral, scale remarkably well to the massive demandsplaced on them. Both the Domain Name System (DNS)and content-distribution networks (CDNs) are examplesof dramatic feats of engineering that facilitate global andquick access to content. The power of these infrastruc-tures, however, derives in part from the largely static na-ture of the data they serve. DNS scales through cachingon the basis of time-to-live (TTL) values that are typi-cally large enough to hide updates from parts of the net-work for minutes or hours. CDNs serve primarily staticdata or else data that, if updated, need not be viewed con-sistently by different parts of the network.

The viability of such approaches may be challenged,however, as the Internet evolves. Multiple visions for fu-ture Internet designs anticipate the need to support moredynamic information in the network (e.g., SCION’s ad-dress and path servers [41], NIRA’s NRLS [39], or ren-dezvous servers to support mobility in content-centricnetworking [21]), which may enable mobile network lo-cation, dynamic route control, or diagnosis of network

anomalies, for example. Because this information canchange quickly — in some cases at the granularity ofseconds or less — there is a need for infrastructure ser-vices that support dynamic updates, strong consistency,and global scalability. Even for existing uses to directclients to servers or to exercise route control, today’sDNS has limited ability to provide fine-grained control(e.g., [30, 32]), and we expect this shortcoming to be-come more acute in the future.

In this paper we describe a system called Wide-AreaCluster-Consistent Objects (WACCO). WACCO managesaccess to stateful, deterministicobjectsthat support invo-cations of arbitrary types, each of which is either anup-datethat may modify object state or areadthat does not.Objects are managed on a tree-based overlay network ofproxiesthat is arranged with respect to geography; i.e.,neighbors in the tree tend to be close geographically or,more to the point, enjoy low latency between them. Eachclient is assigned to a nearby proxy to which it connectsto access objects, and object access is managed througha protocol that offers a novel type of consistency that wedub cluster consistency. Cluster consistency is strong:it ensures sequential consistency [23] and also thatclus-tersof concurrent reads see the most recent preceding up-date to the object on which the reads are performed. Theresulting agreed-upon order and rapid visibility of up-dates facilitate a wide range of applications, e.g., networktroubleshooting, trajectory tracking of mobile nodes, andcontent-oriented network applications.

Scalability of services implemented usingWACCO isachieved through two strategies. First,WACCO uses thetree structure of the overlay to aggregate read demand,permitting the responses to some reads to answer others.As such, under high read concurrency, the vast major-ity of reads are not propagated to the location of the ob-ject; rather, most arepausedin the tree awaiting othersto complete, from which the return result can be “bor-rowed”. Second,WACCO employs migration to dynam-ically change where each object resides, permitting the

1

object to move closer to demand as it fluctuates, e.g., dueto diurnal patterns.

To demonstrate and evaluateWACCO, we use it tobuild a service called Low-Overhead Keyspace Objects(LOKO). LOKO permits clients to create, modify andquery keyspaceobjects. A keyspace is identified by apublic keypk, and the keyspace forpk stores (or gener-ates) mappings, each from a query stringqstr to a valueval and bearing a digital signature that can be verified bypk. So, for example, querying the keyspace forpk forthe stringnytimes/publicKeymight return the signedpublic key certificate that the owner ofpk believes tobe for nytimes. Similarly, the querywww/bestRouteon the keyspace identified bypk′ might return a signedmapping indicating the currently preferred route to reachthe web server representing the owner ofpk′. By iterat-ing queries to a “chain” of keyspaces, each referring theclient to the next keyspace in the chain, a client couldsecurely resolve a multipart pathname, much as is donewith DNSSEC [2]. In this respect,LOKO could encom-pass one of the main duties of today’s DNS/DNSSEC,while supporting more dynamic mappings due to theconsistency provided byWACCO.

In evaluatingLOKO (and WACCO), we were handi-capped in not having a global workload for such a ser-vice. So, we approximated a global workload using atrace of over 4.4 billion DNS requests served by Aka-mai servers over 36 hours to 83,448 clients in four ge-ographic regions across Asia, North America and Eu-rope. We used this trace to drive 76-proxy emulationsof LOKO with network delays induced to represent aLOKO deployment across these four regions. Our em-ulations show thatLOKO provides good latency for oper-ations, e.g., with up to 89% of reads completing in under100ms. We also show that our implementation can sus-tain the full per-proxy query rate represented by the Aka-mai trace, while guaranteeing cluster consistency. We il-lustrate the effectiveness of the components of our designusing measurements from these emulations.

We begin by presenting related work in Sec. 2. Wediscuss our design goals in Sec. 3 and present the de-sign of WACCO in Sec. 4. Our evaluation (including adescription ofLOKO) is in Sec. 5, and we conclude inSec. 6. The appendix presents the definition of clusterconsistency and a proof that our protocol implements it.

2 Related Work

The use of a tree-based topology inWACCO for objectaccess is reminiscent of hierarchical caching, which hasbeen studied and deployed extensively for wide-area sys-tems such as the World-Wide Web (e.g., [7, 27, 34]). Insome respects,WACCO can be viewed as using apolling-every-timecache validation strategy [6] in which the au-

thoritative object copy is consulted before returning acached answer in order to enforce strong consistency. Toreduce the overheads and response latencies induced bysuch polling,WACCO employs two strategies. The firstis to leverage the tree structure to aggregate polling bymany concurrent reads into few messages along the tree.This aggregation also allowsWACCO to reduce pollinglatency by using ongoing polling requests to accelerateothers; this strategy has implications for the consistencyoffered byWACCO, which we characterize precisely. Thesecond strategy is to migrate the authoritative object copycloser to where demand is largest, an option availableto WACCO because it manages the authoritative copy ofeach object itself, in contrast to web caches that do not.

Many wide-area caching, edge service, and storage de-signs are also related to our work; space limitations pre-clude a comparison to all of them. That said, if a replica-tion (or caching) scheme is to prevent conflicting objectversions and to make updates available to reads immedi-ately, it must apply reads and updates at a set (quorum) ofreplicas that intersects the quorum employed in anotherupdate [12, 17]. Different designs employ different quo-rum systems; e.g., in a read-one-update-all quorum sys-tem, every proxy (the update quorum) must be contactedon the critical path of an update.WACCO employs a quo-rum per object consisting of single authoritative copy,uses a tree-based overlay to reach this copy, is optimizedtoward widespread concurrent read load and moderateconcurrent update load, and, to our knowledge, offers anew type of consistency achieved by a novel combinationof tree-based aggregation and migration.

Some designs offer stronger consistency thanWACCO.For example, Scatter [14] supports linearizability [18].However, partly due to its use of distributed hash tables,it does not offer the same benefits of request aggregationand geographic proximity thatWACCO achieves throughits tree structure and migration. Spanner [9] also im-plements linearizability, though it does so in part by re-lying on synchronized real-time clocks, whichWACCO

does not, and again does not leverage request aggrega-tion. Other systems offer weaker consistency to improvepartition-tolerance: e.g., COPS [24] implements causalconsistency [1]. Here, we strive for stronger consistencyand necessarily1 presume that partitions in future Internetarchitectures will be negligibly rare (e.g., due to redun-dant routing paths [38, 35, 28]).

As discussed in Sec. 1, our implementation ofLOKO

as a demonstration ofWACCO is motivated by short-comings of the current DNS for future Internet archi-tectures or even for serving more dynamic data in sup-

1The proof by Gilbert and Lynch [13] shows that linearizability isimpossible to achieve if all operations must return even when partitionsoccur. This proof applies equally to cluster consistency, the propertythat WACCO provides.

2

port of today’s mobility and content management (e.g.,see [30, 32]). These shortcomings have led to numerousattempts to modify DNS usage (e.g., [37]), to enhanceDNS operation (e.g., [8]), to replace it outright with al-ternative designs (e.g., [20, 10, 32]), and to understandthe tradeoffs between new designs and the current DNS(e.g., [31]). CoDoNS [32] is a noteworthy design that,like LOKO, decouples namespace (or keyspace) manage-ment from the location and ownership of name servers(in our parlance, proxies) and accelerates the propaga-tion of updates to clients. It provides fast read responsevia a dynamic replication technique that ensures that alarge percentage of requests can be answered immedi-ately by the first proxy to receive the request. However,as in the discussion of read-one-update-all quorum sys-tems above, consistency then requires that all of thesereplicas be updated (or invalidated) when an update oc-curs, making updates more costly.LOKO is a differentpoint in the design space that anticipates more frequentupdates and so strikes a different balance between readand update cost — one that still favors reads particularlywhen read load is high but that lessens the number ofproxies that updates must alter.

3 Design Considerations and Goals

We anticipate an object access workload that is gener-ally read-dominated — maybe by orders of magnitude— but that may nevertheless involve frequent and evenconcurrent updates on a per-object basis. Updates to anobject may be frequent due to the transient nature of theinformation used to update an object (e.g., the currentperformance characteristics of a network link), and ob-ject updates may be concurrent due to contributions frommany parties (e.g., one per link, for an object that calcu-lates preferred routes based on current characteristics ofmany links). Such workloads temper our willingness totrade update performance for read performance arbitrar-ily, e.g., as in a typical read-one-update-all system (seeSec. 2). Rather,WACCO takes a more balanced approachthat favors read performance but that still limits updatesto a single authoritative object copy.

The consistency implemented inWACCO implies se-quential consistency [23] (and more, see below). Se-quential consistency is a “strong” consistency model: itimplies that clients observe update operations to objectsin the same total order (cf., [3, Ch. 9]). Sequential consis-tency impliescausal consistency[1]: updates related bypotential causality [22] (e.g., a client reads an update andthen performs another) will be observed by any clientin order of their potential causality. But unlike causalconsistency, sequential consistency also implies that allclients will observe all updates that arenot related bypotential causality in the same order.

Despite the strength of sequential consistency, it alonedoes not ensure rapid propagation of updates: in thelimit, a client of a sequentially consistent (only) objectstore would be permitted to read the same value foreverfor an object, even if other clients have updated that ob-ject. As such, our goals include an enforced rapid prop-agation of updates, i.e., updates “take effect” (nearly)immediately. Linearizability [18] strengthens sequentialconsistency by mandating that an update be observed byany operation on the same object that begins after (in realtime) the update operation returns to its caller. However,linearizability comes at considerable performance cost[3, Ch. 9], and so we adopt a weaker requirement thatnevertheless strengthens sequential consistency to makeupdates take effect quickly.

The middle ground we adopt is one in which read op-erations on the same object can be partitioned intoclus-tersof concurrent reads,2 so that all reads in each clusterreturn results based on the latest update preceding thecluster in real time (or a more recent update, i.e., oneconcurrent with the cluster). The resulting consistencyproperty, which we termcluster consistency, is weakerthan linearizability because a read may return results onthe basis of only updates that preceded the cluster con-taining it, rather than all updates that precede the indi-vidual read. (Updates to the same object are still orderedaccording to their real-time order, however.) In exchangefor this weaker property, we show that cluster consis-tency can be implemented scalably in wide-area settingsby permitting a read to carry responses to other readsin its cluster, thereby accelerating the response times ofthose reads and reducing load on the authoritative copy.

Beyond applications to future Internet designs men-tioned in Sec. 1, we see cluster consistency as poten-tially useful in other, nearer-term applications ofWACCO,e.g.:• Network troubleshooting Updates from network

sensors that publish toWACCO will appear in the sameorder, enabling consistent diagnosis and actuation ofthe network by distributed analysis engines. For ex-ample, routing anomalies caused by MED oscilla-tion [15] and BGP policy divergence [16] in today’sInternet require distributed monitoring to quickly de-tect and react to an anomaly, e.g., by modifying lo-cal routing policies to eliminate the divergent be-havior and so to minimize its impact on traffic. Acluster-consistent view of routing updates publishedto WACCO will make it simpler for distributed mon-itors to concur on the anomaly and effect changesin policy at multiple locations to rectify the problem.Another example is real-time response to routing pol-

2More specifically, in each cluster, the union of real-time intervalsbeginning with each read invocation and ending with its return, is con-tiguous. See the appendix for details.

3

lution, e.g., prefix hijacking [42]. Rapid update prop-agation and consistent event ordering (e.g., which net-works are polluted first) could help reveal the sourceof pollution and enable a faster reaction to the propa-gation of polluted routes.

• Trajectory tracking of mobile nodes Predicting thefuture location of a mobile endpoint (e.g., a train) foruse in routing (e.g., [29]) would be greatly simpli-fied with a cluster-consistent view of the endpoint’strajectory. For example, if each network appends itsname to aWACCO object representing the endpoint’strajectory when the endpoint attaches to the network,cluster consistency implies that the trajectory will beaccurate. A weaker property like causal consistencymight yield incomplete and even conflicting trajecto-ries, since appends would not be causally related (inthe sense of Lamport [22]).

• Content-oriented network applications In somecontent-oriented network applications such as onlinegaming, it can be at least as important for users seethe same content as it is that the content they see isthe most up-to-date [11, 25]. Otherwise, the gamemay be unfair. Such applications can be simplified ifbuilt on objects that appear to all clients to be modi-fied in the same order.

As suggested in Sec. 2 and detailed in Sec. 4,WACCO

implements cluster consistency using a protocol in whicheach read cluster collectively polls an authoritative ob-ject copy before returning responses for the reads it con-tains. Prior work has generally found polling costlierthan cache invalidation (e.g., [6]), and in ongoing workwe are investigating scalable designs for implementingcluster consistency using cache invalidations. That said,polling serves dual purposes inWACCO; in addition toconsistency, polling messages carry load information tothe proxy holding the authoritative object copy, which ituses to determine if the object should be migrated. Mi-gration enables an object to be placed closer to the pre-dominant sources of demand and, as we will show, cansignificantly reduce response times for operations.

4 WACCO Design

The object-sharing protocol that underliesWACCO uti-lizes a logically tree-structured overlay network thatspans a collection ofproxies. This overlay networkshould be assembled in a “geographically aware” man-ner, i.e., so that geographically close (and so presumablywell-connected) proxies are also close to one another inthe tree. The manner in which a client is paired with aproxy can be decoupled from the rest of our system de-sign; our present design simply leverages a few widely-known proxies to refer each new client to a proxy near it.We assume that each client interacts with only a single

proxy at a time, awaiting the completion of any opera-tions it issued to one proxy before switching to another.

The proxies provide clients with access to a collec-tion of objects. A client submits a read or update invo-cation for an object to its proxy and awaits a responsefrom that same proxy. Updates (potentially) modify theobject state; reads do not. Our protocol description andproof presume that a read simply returns the current ob-ject state, though obviously a proxy can derive a cus-tomized read result from that state before returning theresult to the client. An example of this behavior in thecontext ofLOKO is described in Sec. 5.1.

4.1 Basic Protocol

WACCO maintains a single authoritative copy of each ob-ject. At any point in time, the proxy at which this copyof the object resides is said tohost the object and, syn-onymously, to be thelocationof the object. Proxies im-plement a protocol to route client invocations toward thecurrent location of the object over tree edges (see [33]).Once performed on the object, an operation’s response isrouted back along the tree to the client that invoked it.

While all update invocations are always routed to theobject itself, a read invocation will bepausedin the treeif the invocation, while being routed toward the object,encounters a proxy that already forwarded a read requestfor the same object and has not yet received a response.The paused read will not be forwarded further in the tree;rather, it will be held by the proxy until the response tothe invocation on which it paused is returned. When thatresponse arrives, it can serve as the response for any readinvocation on the same object that was paused awaitingit and that meets certain conditions described below. Inthis way, a single read invocation that reaches the ob-ject may, in fact, end up serving numerous read requeststhat are paused on it elsewhere in the tree. This effectis shown in Fig. 1, where the second and third reads arepaused waiting on the first (Fig. 1(a)) and then adopt theresponse to the first read as their own (Fig. 1(b)).

Pausing read requests in this way offers at least twobenefits. First, it can reduce the latency of read requestsin comparison to forwarding each read request all theway to the object, since the read request on which an-other is paused is farther along the path to the object (andso should solicit a response sooner) than the paused readis. That is, in Fig. 1(a), the first read is at least as closeto the object as the second or third read is when each ispaused, and a response may even already be traversingthe path back. Second, in comparison to forwarding ev-ery read request to the object and returning each read re-sponse individually, pausing reduces the bandwidth useof the protocol, the routing costs to proxies, and the com-putational load on the proxy hosting the object.

4

object

read1

read2

read3

(a) Second and third reads are paused on thefirst.

object

read1

response

read2response

read3response

(b) Return value for first read used to resume(respond to) second and third reads.

objectread

42

(c) Fourth read carries aggregated count of re-cently paused read invocations toward object.

Figure 1: Example of pausing some reads and resuming them later

There are also challenges that arise from pausing.First, a paused read constitutes state that a proxy muststore until the response for the read on which it is pausedreturns, possibly opening the door to resource exhaus-tion. That said, aside from read invocations submitted toa proxy directly by clients, the number of paused readsfor an object that a proxy must maintain simultaneouslyis limited by the number of its neighbors. Reads submit-ted to a proxy directly by clients (and that are paused)still pose a denial-of-service risk, but it can be managedusing any of several techniques (e.g., [19]), and more-over, dropping these read requests as needed can neverinterfere with other reads (since none are paused on thesereads). Resource exhaustion will be discussed further inSec. 4.4.

Second, pausing erodes the consistency of the pro-tocol, and, indeed, to achieve cluster consistency—andspecifically to achieve the sequential consistency thatimplies—we must place restrictions on which read re-sponses can be used to respond to paused reads. Intu-itively, implementing cluster consistency requires that apaused read is not answered by an incoming responsethat is too outdated. Specifically, as we prove in theappendix, the following conditions suffice to implementcluster consistency: Each read request from a client car-ries the largest Lamport time [22] at which any updatethat the client has observed was applied, and each readresponse carries the Lamport time at which the responsewas emitted from the authoritative object. A read re-sponse that returns to a proxy can be used to satisfy aread request paused at that proxy only if the response’stimestamp exceeds the request’s timestamp. If any readspaused at the proxy remain unsatisfied due to this re-quirement, then the proxy unpauses one and forwards italong toward the object.

4.2 Caching

Each object state has aversion number(an integer, ini-tially zero). Applying an update to the object increments

that version number.WACCO uses these version numbersto optimize the protocol above as follows.

Each proxy maintains a cache holding at most onecached state per object. A proxy can unilaterally deletestates from this cache and manage it using policies inde-pendent of those of other proxies. Each read request isaugmented to carry a version number. If upon receiving aread request with version numberv (or if upon receivinga read request from a client, in which casev defaults to−1), a proxy has a versionv′ > v of the relevant object incache, then the proxy can increase the read request’s ver-sion number tov′ when forwarding it. If it does so, theproxy is said to havetaken responsibilityfor the requestand is obligated to retain the cached object state until ithas responded to this request. (Our current proxy imple-mentation defaults toward taking responsibility; otherscould do so more selectively.)

When responding to a read request, the proxy host-ing the authoritative copy sends the object state (as inSec. 4.1) if the current object version is larger than theversion number in the read request, and sendssame oth-erwise. Upon receiving a response to a read for whicha proxy took responsibility, the proxy identifies the lat-est object version it now has—either the object state inthe response or, if the response wassame, the version inits cache—and responds to paused reads similarly (sub-ject also to the constraints of Sec. 4.1 on Lamport times-tamps). That is, it returnssame to paused reads bearingthe version number of the proxy’s latest object version,and it responds with the latest object state to the rest.

A proxy that forwards a read request but that doesnot take responsibility for it might receive asame re-sponse, at which point it may not have the latest ob-ject version and so would be unable to respond to anyread it paused bearing an older object version number.These paused reads therefore remain paused while oneis unpaused and forwarded toward the authoritative ob-ject, as discussed in Sec. 4.1. Note that forwarding anyread request bearing an old object version number guar-antees that the response will contain an object state, and

5

so when the proxy selects one to unpause and forward,it gives preference to those with smaller object versionnumbers.

4.3 Migration

A component ofWACCO is migration, by which an ob-ject is migrated from one proxy to another in responseto demand. For example, a proxy currently holding anobject can migrate the object toward its neighbor fromwhich a majority of the invocations on the object arrive,or a proxy can migrate an object away if the proxy is be-coming too heavily loaded. In this way, migration canbe used to reduce load by moving objects closer to ar-eas of greater interest and to otherwise reposition load asneeded to deal with hotspots. The former use of migra-tion is particularly beneficial forLOKO (see Sec. 5), sincemigration can be used to position objects to best accom-modate the time zones that are most active at a particulartime of day. Moreover, many entities will be accessedwith a clear geographic preference — e.g., websites inChinese will presumably be accessed most often fromChina — and so migration makes sense for positioningsuch an object near where it is accessed most.

WACCO is not closely tied to the mechanics of migra-tion; all thatWACCO requires is that it be able to migratean object from one proxy to a neighbor in between objectinvocations. So whileWACCO uses the migration mech-anism in Quiver [33], it would presumably work usingother migration mechanisms, as well. That said, in or-der for migration to work effectively, two issues must beresolved. The first is how to determine from where an ob-ject is currently experiencing the most load; because ofpausing reads, no single proxy observes the entire loadon an object. Given an answer to this issue, we then needto determine the specific conditions under which an ob-ject should be migrated.

The first question is resolved inWACCO by amendingeach message carrying an object invocation to also in-clude the number of read invocations for that same objectthat wererecentlypaused along the path the message hastraveled. If this invocation is paused, the proxy that doesso accumulates the message’s count into a per-object,per-neighbor counter that the proxy maintains (i.e., forthe object to which the invocation pertains and the neigh-bor from which the proxy received the invocation) andthen further increments this counter by one (for the in-vocation that was just paused). Otherwise, the proxy ac-cumulates its counters for this object and forall of itsneighbors into the field on the invocation message andforwards the message along toward the object, subse-quently setting each of these counters to zero. Fig. 1(c)shows an example where the field of a fourth read in-vocation, initially with value0, is updated to1 at the

proxy whereread3 was formerly paused and then to2as it travels through the proxy at whichread2 was for-merly paused. In this way, a count of paused reads trick-les toward the object at all times, which the proxy hold-ing the object can similarly incorporate into per-object,per-neighbor counts of paused invocations.3

As described so far, this approach for conveying thenumbers of paused reads to the proxy holding the objectdoes not adjust these counters for the passage of time,but intuitively such adjustment is necessary. After all,reads paused ten minutes ago should presumably haveless bearing on whether to migrate the object than readspaused within the last few seconds. For this reason, eachWACCO proxy decaysits per-object, per-neighbor coun-ters to account for the passage of time, before incorpo-rating them into an invocation message that the proxyforwards toward the object (or, in the case of the proxyholding the object, before calculating whether to migratethe object). In our present implementation, the proxy de-cays these counters linearly as a function of the time thatpassed since last unpausing (and returning values for)reads for that object, i.e., the interval between the proxyseeing the last object response and the subsequent objectinvocation.

Finally, this brings us to the question of how a proxyholding an object determines whether to migrate an ob-ject and if so, to which of its neighbors. In our imple-mentation, the proxy hosting an object periodically sumsits per-neighbor counters for that object and, if one suchcounter accounts for more than a fractionm of this sum(for a fixed thresholdm), then the proxy asks the neigh-boring proxy corresponding to that counter to migrate theobject to it. That neighboring proxy might not do so, e.g.,because it is already hosting too many other objects. Ifit decides to do so, however, then it initiates the objectmigration. Note that the thresholdm value can be differ-ent per object, though in our present implementation wesimply setm the same for all objects.

4.4 Resilience

Fault tolerance In WACCO as described so far, a proxyfailure would disconnect the tree until the proxy recov-ers. A generic approach to tolerate proxy failure is tolocally replicate each proxy; e.g., in our implementa-tion, each proxy can optionally have a backup to whichit commits any meaningful change in internal state [5,§8.2.1] before acting on it. InWACCO, such changesinclude changes to an object (due to update operations)and changes to internal Quiver routing tables [33] (e.g.,

3Though an update invocation cannot be paused, the proxy hold-ing the object incorporates each update invocation into this count, aswell, so that updates are reflected in the load from that neighbor for theobject.

6

due to migration). In a straightforward implementa-tion, this primary-backup configuration would double thehardware needed for the service. In practice, we ex-pect clusters of proxies to reside in datacenters in majormetropolitan areas, in which case these proxies can pro-vide backup service for others in the same datacenter.Denial-of-service defense The most acute threat ofdenial-of-service attacks is interfering with proxy-to-proxy communication. Multi-path routing (e.g., [38, 35,28]), using private leased lines, or other suitable defenses(e.g., [40]) can mitigate the threat of link overload. Inaddition, each proxy should ensure that it reserves ade-quate resources to retain communication with its neigh-bor proxies. For example, each proxy can utilize two net-work interfaces, one dedicated to proxy-to-proxy com-munication and the other for serving clients that con-tact it directly. Moreover, proxies can prioritize tasks formanaging inter-proxy activities ahead of those respond-ing to clients, for example, and can terminate (or refuse)client requests in favor of retaining communication withneighbor proxies.

Migration opens the possibility of a degradation of ser-vice if, e.g., a flood of read requests can cause an objectto be migrated (see Sec. 4.3) to a region of the networkfar from legitimate demand. This risk can be mitigatedby each object expressing toWACCO its preferences orrequirements for where it can be hosted, if the region oflegitimate demand is known in advance. (This mecha-nism is also useful to enforce regulatory constraints onwhere data can be hosted, for example.) In other cases,allowing only authorizedreads to influence migrationcan mitigate this risk. One method for doing this is de-scribed in Sec. 5.1 in the context ofLOKO.

5 WACCO Evaluation

We have implementedWACCO in Java. Our implemen-tation consists of roughly 11,500 physical source linesof code. To evaluateWACCO, we used it to construct aservice calledLOKO, which we describe in Sec. 5.1. Wethen describe the traces that we use to induce a realis-tic workload on this service in Sec. 5.2. We describe ourexperimental setup in Sec. 5.3 and our results in Sec. 5.4.

5.1 LOKO

As discussed in Sec. 1, we have usedWACCO to imple-ment a service calledLOKO that supportskeyspaceob-jects. A keyspace is identified by a public keypk andstores (or generates) mappings, each from a query stringqstr to a valueval. When responding to a queryqstr, thekeyspace sendsval, along with a digital signature on themapping that can be verified bypk. The signature could

be inserted into the keyspace object through an update in-vocation, or the object could produce the signature itselfusing a private key it holds. This latter strategy might beappropriate for keyspace objects that generate responsesdynamically.

Generating dynamic responses is useful, e.g., to sup-port CDNs by customizing the content-server address re-turned in response to a read query. That is, a keyspacefor pk, when queried fornytimes/www/address, couldselect the answerval from a set of candidate addressesbased on load conditions and the address of the clientwho is asking. (This selection would be performed bythe proxy directly returning the response to the client.)The response could carry either a previously stored sig-nature or one that the keyspace object generates itself;in the latter case, the keyspace object state would needto include the private keysk corresponding topk. Thecluster consistency offered byLOKO would improve theresponsiveness of this mapping to changing conditionsover that provided by DNS today (cf., [30]). Of course,keyspaces can also be used to store static mappings, e.g.,to addresses or public keys, and several keyspaces couldbe queried iteratively to resolve hierarchical names, anal-ogous to DNS/DNSSEC today.

Any LOKO object can enforce its own access controlby checking a signature for each invocation — possiblythe same one that it will store and return in response toread invocations later. But by virtue of it having a pub-lic key, a keyspace enables the enforcement of coarseaccess-control policy at the first proxy to receive a re-quest for it, even if that proxy does not host the object.That is, we could extendLOKO so that a proxy, upon re-ceiving a read request for the keyspace identified bypkfrom a client, confirms that the request is accompaniedby a delegation credential signed by the owner ofpk andthat authorizes the read. The proxy would do so priorto acting on the read request, dropping it if the checkfails. This defense would hinder degradation-of-serviceattempts to migrate the keyspace away from legitimatedemand by submitting unauthorized read requests (seeSec. 4.4). We have not yet implemented this extension,however.

5.2 Traces

The data we used in our evaluation ofLOKO (and henceWACCO) are traces of domain-name queries received byAkamai, collected from 6am, March 9, 2011 to 6pm,March 10, 2011 (36 hours). In addition to serving DNSqueries for domain names of its own, Akamai servesqueries for the domain names of a number of customers,as well. The dataset we obtained from Akamai includesqueries of both types and reportedly includes all queriesAkamai received during that period by 357 of these

7

(globally distributed) servers.We emphasize that the goal of using Akamai traces

was not to evaluateLOKO as a DNS replacement perse, but rather to stress our system with a workload thatexhibits typical global effects, e.g., diurnal patterns andregional object affinities. As such, in using it to popu-late objects and generate a workload for our evaluation(see below), we strived primarily to preserve the object-access and client distributions.

5.3 Experimental Setup

Hardware Our experiments consisted of emulations onEmulab [36]. Each node (on which we ran multiple prox-ies, see below) was of the “d820” variety; seehttps://

wiki.emulab.net/Emulab/wiki/UtahHardware forits specifications. We performed our emulations with 76proxies spread across 4 nodes, resulting in an averageof between 3 and 4 vCPUs per proxy. The only excep-tions were our fault-tolerance experiments, in which eachproxy was accompanied by a backup, doubling the totalnumber of proxies on the same hardware.Proxy placement Recall that the number of serversthat Akamai dedicates for the load that our traces rep-resent (and to provide consistency falling short ofLOKO)is 357, and so we needed to scale down the Akamai traceto permit a realistic evaluation for 76 proxies. To dothis, we selected 4 geographic regions that accounted for72/357= 20.2% of all queries in the original trace andallocated 72 proxies to those regions proportionally tothe number of requests originating there.4 (The remain-ing 4 of the 76 proxies in our experiments are describedbelow.) Clients at each region were then assigned to thatregion’s proxies to yield a roughly balanced number ofqueries at each proxy and, most importantly, in a mannerthat was oblivious to the contents of those queries. The 4selected regions included one in Asia, one in Europe, andtwo in North America, and so we believe this method-ology produced a reasonable approximation to a globalworkload. While client requests drive our experiments,clients themselves are not instantiated (or measured) inour experiments. So, the latency between a client andits proxy is not represented in our measurements, nor areclient computational costs.Network latencies To generate the tree topology forour experiments, we added an additionalhead proxyper region and built a minimum spanning tree covering

4More precisely, we first geolocated the clients in the Akamaitracesusing the database from IP2Location (http://ip2location.com)and truncated each one’s latitude and longitude to an integral value,yielding its “region”. We allocated a number of proxies to each se-lected region proportional to its queries; e.g., if one region originated10% of the 20.2% of queries selected from the original trace, then itwas allocated 10%×72= 7 proxies.

the head proxies using geographical distance as our dis-tance measure. Each region’s other proxies were thenorganized in a balanced ternary tree underneath the re-gion’s head. So, the total proxies in each experimentwas 72+ 4 = 76, of which only the 72 non-head prox-ies accepted requests from clients directly. Once the treewas fixed, we estimated latencies between neighboringproxies as a linear function of the geographical distancebetween them, where this function was calculated usinglinear regression on real distance/latency pairs.5 We em-ulated proxy-to-proxy latencies at user level, using themethod implemented in the EmuSockets toolkit [4].6 Wedid not limit the bandwidth between proxies, because wedo not expectLOKO to even remotely tax the capacity offuture networks (or even today’s).Keyspace objects The queries selected as describedabove were used to populate keyspace objects as follows.Every DNS query indicates a DNS zone, the requestedname in that zone, and a query type. The query type canindicate an IPv4 host (A) record, an IPv6 (AAAA) record, aname server (NS) record, etc. We created a keyspace ob-ject per zone and initialized it with a field for each namewithin that zone for which anA record was requested(e.g., “www/A”), sinceA records overwhelmingly consti-tute the most common form of query. The value assignedto each such field was a random 16-byte value. Wemade no effort to represent resource records in keyspacesmore explicitly, remembering that the goal of using theAkamai traces is to induce a realistic global workloadon LOKO rather than to makeLOKO mimic DNS faith-fully. Rather than signing each mapping individually, aMerkle tree [26] was computed over the mappings andthe root signed by the private key corresponding to thepublic key used to label the keyspace. The Merkle treewas transient, i.e., only the signed root was sent when thekeyspace was copied (to support a read) or migrated; theinterior nodes were recomputed on demand.

The 20.2% of the original trace that we used in-cluded 4,460,838,100 queries spanning 1,009,689 do-main names and 83,448 clients. Fig. 2(a) shows thatwhen used to construct keyspace objects as described

5We took round-trip latencies (ms) from AT&T (seehttp://ipnetwork.bgtmo.ip.att.net/pws/current_

network_performance.shtml) on 9 Oct 2011 from KansasCity to 24 other cities in the continental US, as well as from SanFrancisco to Hong Kong, New York to London, and Washington toFrankfurt. We then obtained distance estimates (miles) forthesecity pairs. Using simple linear regression, the best fit lineto thesedistance/latency points wasy = 0.019732193x + 8.712212072 withan R2 of 0.96820894, indicating a strong goodness of fit. Our use ofdistance-based latencies from within a single provider’s network isreasonable, we believe, since our service may well be implemented bya major global service provider.

6This design is an artifact of our trying out several different plat-forms for our emulations, including some where we were restricted touser-level modifications only.

8

% of total queries

% o

f key

spac

es

0 5 10 15

9899

100

(a) CDF of queries per keyspace.A few keyspaces comprise a largeportion of requests.

Keyspace size (fields)

Per

cent

age

1 100 104 106

8090

100

(b) CDF of keyspace size. Notethe x-axis is log-scale. Mostkeyspaces are small, but some arequite large.

Non−Dominant

Dom

inan

t

1 104 108

110

210

410

6

(c) Dominant vs. non-dominantproxy queries (one × perkeyspace). These query types arestrongly correlated.

Keyspace size (fields)

Key

spac

e qu

erie

s

1 102 104 106

110

410

8

(d) Keyspace size versus queriesto that keyspace (one× perkeyspace). The larger keyspacestend to be accessed more fre-quently.

Figure 2: Keyspace query and size distributions

above, there were a few keyspaces which dominated thequeries, in that requests for those keyspaces were a sig-nificant portion of the total requests. The most frequentlyqueried keyspace object comprised over 14% of the total,and the 5 most frequently queried keyspace objects com-prised over one third of all requests. The distribution ofkeyspace sizes was also far from uniform, as shown inFig. 2(b). While over 88% of all keyspaces containedless than 10 keys, some contained over one million.

Prior to each measurement run ofLOKO, we deter-mined the starting location of each object by executinga warmup. The warmup migrated each keyspace objectto its dominant proxy, i.e., the proxy that will make themost requests of it during the run. This warmup thusimplements an optimalstaticplacement of keyspace ob-jects for the run. Nevertheless, as shown in Fig. 2(c),the request rate by the dominant proxy for a keyspace isstrongly correlated with the request rate by other, non-dominant proxies for that keyspace, implying that opera-tion workloads will be dominated by nonlocal operationsin any static placement of keyspaces.Update operations We introduced updates into our ex-periments, but since the Akamai traces include no up-dates, we did so artificially. Specifically, for a param-eteru ∈ [0,1], each read operation for a keyspace sub-mitted to its dominant proxy was converted to an updateoperation with probabilityu. Because the rates of re-

quests to keyspace objects from their dominant proxieswere highly skewed (see Fig. 2(c)), these update oper-ations were not uniformly spread across keyspace ob-jects but instead were concentrated in those that werealso read most often, including read most often fromnon-dominant proxies (again, see Fig. 2(c)). So, theseupdates caused many caches to become invalid and thusmany object sends, and, because the keyspaces accessedthe most often tended to be larger (Fig. 2(d)), these sentobjects also tended to be large.

If a query was chosen to become an update, an up-date was generated in its place for the relevant keyspaceobject, consisting of the relevant query name and query-type string (e.g., “www/CNAME”), a 16-byte value, and a128-byte digital signature on the root of that keyspace’snew Merkle tree (i.e., the previous Merkle tree updatedto reflect the newly added or modified field). The proxyto which this update was introduced verified the signa-ture using the public key of the keyspace. Since clientcosts are not included in our measurements (see above),signature generation for update operations or signatureverification after a read were omitted.Time scaling Recall that our Akamai trace was 36 hoursin length. Due to the number of experiments we wishedto perform with this trace, it was not possible to dedi-cate a full 36 hours per experiment. Simply truncatingthe trace would hide important trace characteristics, no-tably any diurnal pattern that it exhibits. As such, weemployed the following methodology to “compact” thetrace while retaining its characteristics. Each experimentwas parameterized by asampling rate s∈ (0,1] and anacceleration a≥ 1. Each query in the trace was then re-played in the experiment independently with probabilitys, and the trace was accelerated by a factor ofa. So,in a period in which the rate of requests in the originaltrace wasq requests per second, sampling reduced thisrate tosq requests per second in expectation, and accel-eration increased this tosqrequests per 1/asecond in ex-pectation. This method shortens the trace replay to 1/atimes the original, thereby expediting our tests; in ourtests we fixeda= 48 so that each test required 45 min-utes. However, we sometimes varied the sampling rates between experiments. It is convenient to describe anexperiment in terms of the productsa, which we will callits load factor. For example, an experiment with loadfactor sa= 0.1 has an expected request rate of 10% ofthe original Akamai trace’s rate.

5.4 Experimental Results

All performance numbers in this section were producedusing the Java Runtime Environment (JRE) distributedwith Java SE 7. We configured the HotSpot Server Javavirtual machine to use the Concurrent Mark and Sweep

9

Latency (ms)

% o

f Ope

ratio

ns

u=0.0u=0.005u=0.01

0 500 1000

5075

100

(a) Reads

Latency (ms)

% o

f Ope

ratio

ns

u=0.005u=0.01

0 500 1000

5075

100

(b) Updates

Figure 3: CDFs of latencies (ms) asu varies.

garbage collector to maintain responsiveness. Exceptwhen evaluating the impact of the migration thresholdm below, we setm= 0.75, and except when evaluatingthroughput below, we set the load factor to 0.1.Updates We first explore request latencies and, in par-ticular, the impact of varying the fraction of updates inthe execution on those latencies. Fig. 3 shows CDFs ofoperation latencies in experiments for update probabili-tiesu∈ {0.0, 0.005, 0.01}, whereu= 0.0 implies no up-dates. In Fig. 3(a), we see that as updates become morecommon, latency tends to increase for reads, because up-dates cause caches to become invalidated, creating theneed for more network traffic. Moreover, as discussed inSec. 5.3, these cache invalidations tend to be focused onthe larger and more frequently accessed objects, ampli-fying the performance impact of updates.

Despite these effects, read latency stays low, with89.5%, 86.7% and 84.7% of reads completing in under100ms foru = 0.0, 0.005, and 0.01, respectively. La-tencies for the updates themselves appear in Fig. 3(b).These too perform well, with 67.7% and 66.0% complet-ing in under 100ms foru= 0.005 and 0.01, respectively.This low latency is partially an artifact of our warmupmethod, which initially places objects at the proxy whichwill request them most, making many updates local (ex-cept when the object has been migrated away). Note,though, that this behavior is part of our design — mi-gration will tend to move an object toward the proxiesrequesting it most.Migration We illustrate the impact of object migra-tion on operation latency in Fig. 4. Recall thatm repre-sents the fraction of the total load for which a neighbormust account in order for migration in the direction ofthat neighbor to begin. Thus,m> 1 is impossible to sat-isfy and allows no migration at all. We ran experimentswith various migration thresholds:m= 0.55 to 0.95 inincrements of 0.1, as well asm> 1.

Fig. 4(a) shows the total number of migrations foreach setting ofm, and Fig. 4(b) shows the impact ofthese migrations on operation latencies. Without mi-gration, 85% of operations finished in less than 120ms.

Threshold

Mig

ratio

ns (

thou

sand

s)

.55 .75 .95

100

150

200

(a) Migration count

Latency (ms)

Per

cent

m=0.75m=0.95m>1

0 100 200

5075

100

5010

0

(b) Latency

Figure 4: Impact of varyingm, with u = 0.0. Lines forsome values ofmare omitted from Fig. 4(b) for clarity.

But even with migration enabled at a very conservativethreshold (m= 0.95), that figure was reduced by 17%to 100ms. Migration at that level also reduced the to-tal number of proxy-to-proxy messages by 19%. Ob-jects migrated within the tree in response to demand over110,000 times, resulting in faster response times as wellas fewer and smaller network messages sent.

Reducingm further increases performance. For exam-ple, at a very liberal threshold,m= 0.55, 85% of oper-ations finished in less than 95ms. In general, the per-formance differences resulting from different values ofthe migration threshold (e.g.,m= 0.55 vs.m= 0.95) aremuch smaller than the differences between runs with mi-gration and those without it (e.g.,m= 0.95 vs.m> 1).

The reason for this disparity is that even a high migra-tion threshold allows objects to move quite close to theirareas of demand. If an object is far (in the tree) from thepart of the tree where demand for the object is high, thenthe proxy hosting that object will see that nearly 100%of the load for that object is coming to it from whateverneighbor is in the direction of the load; the host will thustry to migrate the object to that neighbor (see Sec. 4.3).In this way, almost any migration threshold will allowmigration of sufficiently out-of-place objects toward theparts of the tree where they are in the most demand. Theexact value ofm only becomes relevant once the objectis near enough to its demand that significant fractions ofdemand for it come from different neighbors. But by thatpoint, objects are already fairly close to the demand, andperformance has already improved substantially.Fault tolerance We measured the effect of fault toler-ance on operation latencies when usingLOKO, i.e., with abackup per proxy (see Sec. 4.4), foru= 0.01. The resultsappear in Fig. 5. As expected, the overhead of fault tol-erance is much more evident for update operations, sincecommunication with the backup is on the critical path ofeach update operation. One possible cause of the addedread latency may be that we allocated no additional hard-ware to host backups, nor did we reduce the number ofprimary proxies to make room for their backups. In-

10

Latency (ms)

% o

f Ope

ratio

ns

No backupsWith backups

0 500 1000

050

100

(a) Reads

Latency (ms)

% o

f Ope

ratio

ns

No backupsWith backups

0 500 1000

050

100

(b) Updates

Figure 5: CDFs of latencies (ms) when using backups,with u= 0.01.

Load factor

103 O

ps /

Sec

ond

0 0.5 1

020

40

(a) ThroughputLoad factor

Mes

sage

s / O

p

0 0.5 1

02

46

(b) Message costLoad factor

Hop

s pe

r R

ead

0 0.5 1

01

2

(c) Read hops

Figure 6: Throughput and messaging overhead as loadfactor varies, withu= 0.01.

stead, the primaries and their backups shared the sameresources that, in other experiments, were available ex-clusively to the primaries. Despite the more thinly spreadresources and the synchronization costs of the primary-backup protocol, operation latencies with backups werestill reasonably close to those without.Throughput We next present experiments that offer in-sights into the achievable throughput of our system. Inthese tests, we increased the sampling rates and so theload factor, up to a load factor of 1.0, i.e., the same queryrate per proxy as Akamai supported in the original trace.Fig. 6(a) shows the achieved throughput in operations persecond withu= 0.01. This figure shows that ourLOKO

implementation absorbs the full per-proxy query rate ofthe Akamai trace. Fig. 6(b) illustrates one reason be-hind this throughput, namely that as the operation rate in-creases, the effectiveness of read pausing also increases,since more reads are concurrent. This increase in readpausing then results in a reduced number of messagesneeded per operation, on average (Fig. 6(b)). Finally,Fig. 6(c) shows that the average number of hops a givenread request must travel before it is paused or reaches theobject is stable, even as the load factor increases. Whenthe load factor reaches 1.0, each read request travels lessthan 1.1 hops on average.

5.5 Limitations

The Akamai data that we employed in our experiments isthe best data we have found for a realistic, global work-

load. That said, it is important to recognize that thisdataset has limitations for the purposes it is used here.First, Akamai customers tend to be large organizationsfor which domain-name query activity might be heav-ier and more widespread than most domain names notserved by Akamai or than other objects that one mightenvision in a future application (e.g., a mobile device’slocation). This tendency might yield an overly opti-mistic evaluation ofLOKO, since it makes more oppor-tunities to aggregate (i.e., pause) reads in the tree, but italso might yield an overly conservative evaluation, sinceglobal demand reduces the ability to improve access la-tencies through migration. Second, as already noted, theAkamai dataset contains no update operations, and so itwas necessary to fabricate them.

6 Conclusion

This paper describes the design and evaluation ofWACCO, a system for implementing object-based ser-vices that need to support both frequent updates andwidespread, massive read demand with strong consis-tency. A contribution of our work is a novel type ofstrong consistency dubbedcluster consistency, whichimplies both sequential consistency and rapid updatepropagation and, we argue, can be useful in a range offuture networked applications. We usedWACCO to im-plement a service calledLOKO that supports keyspaceobjects and, in one style of usage, could roughly encom-pass the current duties of DNSSEC. Our evaluation usingan emulated global topology and trace of DNS queries toAkamai shows thatLOKO provides good responsivenessand can scale to large demand. Through our evaluation,we also documented the importance of object migrationand read pausing (and hence cluster consistency) to theperformanceLOKO achieves.

References

[1] M. Ahamad, G. Neiger, J. E. Burns, P. Kohli, andP. W. Hutto. Causal memory: Definitions, imple-mentation, and programming.Distributed Comput-ing, 9(1), 1995.

[2] R. Arends, R. Austein, M. Larson, D. Massey, andS. Rose. DNS security introduction and require-ments. RFC 4033, March 2005.

[3] H. Attiya and J. Welch. Distributed Computing:Fundamentals, Simulations and Advanced Topics.John Wiley & Sons, Inc., second edition, 2004.

[4] M. Avvenuti and A. Vecchio. Application-level net-work emulation: The EmuSocket toolkit.J. Netw.Comp. Appl., 29(4), 2006.

11

[5] N. Budhiraja, K. Marzullo, F. B. Schneider, andS. Toueg. The primary-backup approach. InS. Mullender, editor,Distributed Systems, 2nd edi-tion, pages 199–216. Addison-Wesley, 1993.

[6] P. Cao and C. Liu. Maintaining strong cache con-sistency in the World Wide Web.IEEE Trans. Com-puters, 47(4), 1998.

[7] A. Chankhunthod, P. Danzig, C. Neerdaels, M. F.Schwartz, and K. J. Worrell. A hierarchical Internetobject cache. InUSENIX ATC, 1996.

[8] X. Chen, H. Wang, S. Ren, and X. Zhang. Main-taining strong cache consistency for the DomainName System.IEEE Trans. Knowledge and DataEngineering, 19(8), 2007.

[9] J. C. Corbett, J. Dean, M. Epstein, A. Fikes,C. Frost, J. Furman, S. Ghemawat, A. Gubarev,C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak,E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura,D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito,M. Szymaniak, C. Taylor, R. Wang, and D. Wood-ford. Spanner: Google’s globally distributeddatabase. In10th USENIX OSDI, 2012.

[10] R. Cox, A. Muthitacharoen, and R. T. Morris. Serv-ing DNS using a peer-to-peer lookup service. In1stIntern. Wkshp. Peer-to-Peer Syst., 2002.

[11] E. Cronin, B. Filstrup, A. B. Kurc, and S. Jamin. Anefficient synchronization mechanism for mirroredgame architectures. In1st Wkshp. Netw. Syst. Sup-port for Games, 2002.

[12] D. K. Gifford. Weighted voting for replicated data.In 7th ACM SOSP, 1979.

[13] S. Gilbert and N. Lynch. Brewer’s conjectureand the feasibility of consistent, available, andpartition-tolerant web services. ACM SIGACTNews, 33(2), 2002.

[14] L. Glendenning, I. Beschastnikh, A. Krishna-murthy, and T. Anderson. Scalable consistency inScatter. In23rd ACM SOSP, 2011.

[15] T. Griffin and G. Wilfong. Analysis of the MEDoscillation problem in BGP. InIEEE ICNP, 2002.

[16] T. G. Griffin and G. Wilfong. An analysis of BGPconvergence properties. InACM SIGCOMM, 2009.

[17] M. P. Herlihy. A quorum-consensus replicationmethod for abstract data types.ACM TOCS, 4(1),1986.

[18] M. P. Herlihy and J. M. Wing. Linearizability: Acorrectness condition for concurrent objects.ACMTOPLAS, 12(3), 1990.

[19] A. Juels and J. Brainard. Client puzzle: A cryp-tographic defense against connection depletion at-tacks. In5th ISOC NDSS, 1999.

[20] J. Kangasharju and K. W. Ross. A replicated archi-tecture for the Domain Name System. In19th IEEEINFOCOM, 2000.

[21] D. Kim, J. Kim, Y. Kim, H. Yoon, and I. Yeom.Mobility support in content centric networks. In2nd Wkshp. Inform.-Centric Netw., 2012.

[22] L. Lamport. Time, clocks, and the ordering ofevents in a distributed system.CACM, 21, 1978.

[23] L. Lamport. How to make a multiprocessorcomputer that correctly executes multiprocess pro-grams.IEEE Trans. Computers, C-28(9), 1979.

[24] W. Lloyd, M. J. Freedman, M. Kaminsky, andD. G. Andersen. Don’t settle for eventual: Scal-able causal consistency for wide-area storage withCOPS. In23rd ACM SOSP, 2011.

[25] M. Mauve, J. Vogel, V. Hilt, and W. Effelsberg.Local-lag and timewarp: Providing consistency forreplicated continuous applications.IEEE Trans.Multimedia, 6(1), 2004.

[26] R. C. Merkle. Secrecy, authentication, and publickey systems. PhD thesis, Department of ElectricalEngineering, Stanford University, 1979.

[27] S. Michel, K. Nguyen, A. Rosenstein, L. Zhang,S. Floyd, and V. Jacobson. Adaptive web caching:Towards a new global caching architecture.Comp.Netw. and ISDN Syst., 30, 1998.

[28] M. Motiwala, M. Elmore, N. Feamster, and S. Vem-pala. Path splicing. InACM SIGCOMM, 2008.

[29] J. Paek, K. Kim, J. P. Singh, and R. Govindan.Energy-efficient positioning for smartphone appli-cations using cell-ID sequence matching. In9thMobiSys, 2011.

[30] J. Pang, A. Akella, A. Shaikhy, B. Krishnamurthyz,and S. Seshan. On the responsiveness of DNS-based network control. InInternet MeasurementConf., 2004.

[31] V. Pappas, D. Massey, A. Terzis, and L. Zhang. Acomparative study of the DNS design with DHT-based alternatives. In25th IEEE INFOCOM, 2006.

12

[32] V. Ramasubramanian and E. G. Sirer. The designand implementation of a next generation name ser-vice for the Internet. InACM SIGCOMM, 2004.

[33] M. K. Reiter and A. Samar. Quiver: Consistent ob-ject sharing for edge services.IEEE TPDS, 19(7),2008.

[34] P. Rodriguez, C. Spanner, and E. W. Biersack.Analysis of web caching architectures: Hierarchi-cal and distributed caching. IEEE/ACM Trans.Netw., 9(4), 2001.

[35] X. Wang and D. Wetherall. Source selectable pathdiversity via routing deflections. InACM SIG-COMM, 2006.

[36] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Gu-ruprasad, M. Newbold, M. Hibler, C. Barb, andA. Joglekar. An integrated experimental environ-ment for distributed systems and networks. In5thUSENIX OSDI, 2002.

[37] Y. Wu, J. Tuononen, and M. Latvala. Performanceanalysis of DNS with TTL value 0 as locationrepository in mobile Internet. InIEEE WirelessComm. and Netw. Conf., 2007.

[38] W. Xu and J. Rexford. MIRO: Multi-path Interdo-main ROuting. InACM SIGCOMM, 2006.

[39] X. Yang, D. Clark, and A. W. Berger. NIRA: Anew inter-domain routing architecture.IEEE/ACMTrans. Netw., 15(4), 2007.

[40] X. Yang, D. Wetherall, and T. Anderson. TVA:A DoS-limiting network architecture.IEEE/ACMTrans. Netw., 16(6), 2008.

[41] X. Zhang, H.-C. Hsiao, G. Hasker, H. Chan, A. Per-rig, and D. Andersen. SCION: Scalability, control,and isolation on next-generation networks. InIEEESymp. Security & Privacy, 2011.

[42] Z. Zhang, Y. Zhang, Y. C. Hu, and Z. M. Mao.iSPY: Detecting IP prefix hijacking on my own. InACM SIGCOMM, 2008.

A Cluster Consistency

Definition Here we definecluster consistency. Anobjectconsists of state and a set ofmethodsthat can beinvoked. Each invocation returns aresponse, and an in-vocation/response pair is called anoperation. Correctbehavior of the object is defined by itssequential speci-fication, which specifies the return results of operationsinvoked sequentially on the object.

To denote a read or update operation specifically, wewill often user-op or u-op, respectively, though we willalso useop to denote an operation generically. For anyoperationop, its invocation occurs at a distinct real timeop.inv and is followed by a matching response at a dis-tinct real timeop.res > op.inv. A history H is a setof operations and an induced partial order≺H definedas op1 ≺H op2 ⇐⇒ op1.res < op2.inv. The interval[op.inv,op.res] is denotedop.interval. H is sequentialif ≺H is a total order. For an objectobj, the setH|objincludes only those operations inH that are invoked onobj, and for a clientc, the setH|c includes only thoseoperations inH that are invoked byc. By convention,we assume thatH|c is sequential for each clientc. (Inpractice, each “client” is a clientthread.) A serializationSof H is the setH totally ordered by a relation≺S.

Definition 1 (Sequential consistency [23]). A history His sequentially consistent if there exists a serializationSof H such that the following properties hold: (i)Legal-ity: For each object obj, S|obj is legal (i.e., is in the se-quential specification of obj). (ii)Local-Order: If op1and op2 are executed by the same client and op1 ≺H op2,then op1 ≺S op2.

The consistency implemented inWACCO, calledclus-ter consistency, implies sequential consistency. As such,there is a well-defined order in which updates are appliedto each object, and each update operation produces a newversion of the object on which it operates. The versionnumber of the new object instance is one greater thanthat of the object instance to which the update was ap-plied. Letu-op.ver be the version number of the objectinstance produced byu-op.

Definition 2 (Read cluster). A read cluster C is anonempty set of read operations (i) that return thesame (object and) object version, and (ii) for which⋃

op∈C op.interval is a contiguous interval of time. Fora read cluster C, we define

C.inv = minop∈C

op.inv C.res= minop∈C

op.res

It is convenient to also permit a single update oper-ation u-op to constitute its ownupdate cluster C. Westress, however, that an update cluster contains onlya single update and no reads. For an update clusterC = {u-op}, the definitions forC.inv andC.res abovesimplify to C.inv = u-op.inv, C.res = u-op.res. Giventhese definitions, we abuse notation by usingC1 ≺H C2

(whereC1 andC2 are read or update clusters) to meanC1.res<C2.inv. We also assign a version number to eachcluster, as follows. IfC is a read cluster, thenC.ver is theversion of the object read by the read requests inC. IfC= {u-op} is an update cluster, thenC.ver = u-op.ver.

13

Definition 3 (Cluster consistency). A set of operations iscluster-consistent if it is sequentially consistent and sat-isfiesCluster-Order: There exists a partition of the oper-ations into clusters so that if C1, C2 are performed on thesame object and C1.res<C2.inv, then C1.ver ≤C2.ver.

Fig. 7(a) gives an example execution that is sequen-tially consistent but not cluster-consistent, and so clusterconsistency is strictly stronger. However, cluster con-sistency is weaker than linearizability [18], which re-quires that for anyop1 andop2, if op1.res< op2.inv thenop1.ver ≤ op2.ver; i.e., history precedence must be re-spected at the level of all operations and not only betweenclusters on the same object. Fig. 7(b) shows a cluster-consistent execution may not be linearizable.

write(0) write(1)

0 ← read()

0 ← read()

(a) A history that is sequentially consistent but notcluster-consistent. For cluster consistency, the sec-ond read must return 1 since its read cluster (itselfonly) occurs after the write of 1.

write(0) write(1)

0 ← read()

0 ← read()

(b) A history that is cluster-consistent, since bothread operations form a read cluster, but not lineariz-able. For linearizability, the second read must re-turn 1 since it occurs after the write of 1.

Figure 7: Execution histories. Time increases left-to-right. Each row denotes one client. All operations areon the same object.

Proof of Cluster Consistency We now prove that theprotocol described in Sec. 4.1 implements cluster con-sistency. We do not explicitly treat caching (Sec. 4.2),migration (Sec. 4.3) or proxy backups (Sec. 4.4) in theproof, as these are done in a way that does not alter thesemantics of the protocol (though they do optimize it ormake it more resilient). Given a historyH, consider adirected graphGH with nodes the operations inH andedges of three types:• Client order (

c→): if op1 andop2 are performed by

the same client and ifop1 ≺H op2, thenop1c→ op2.

• Reads-from order (rf→): if u-op is an update that re-

sults in an object state on whichop is applied, then

u-oprf→ op.

• Version order (v→): Let u-op1 andu-op2 denote dis-

tinct update operations on the same object, and let

op denote any other operation on that object such

that u-op1rf→ op. If the object version on which

u-op1 is applied is larger than the object version onwhich u-op2 is applied (u-op1.ver > u-op2.ver), thenu-op2

v→ u-op1 and otherwiseop

v→ u-op2.

We use natural shorthands such asc,rf→ =

c→∪

rf→. We

also usec→+ to denote the irreflexive transitive closure

ofc→, and similarly for other orders.c→ and

rf→ naturally capture the temporal and data-

flow relationships between operations that need to be re-spected in serializingH. The purpose of

v→, moreover,

is to constrain any serialization to respect the object ver-sions observed by operations. More precisely, to provethe sequential consistency ofH, we first argue thatGH isacyclic (Lemma 6) and then that this implies that thereis a serialization ofH respectingLegality and Local-Order (Corollary 1). We will then separately argue inLemma 7 that there must exist some such serializationthat also demonstratesCluster-Order.

Below we prove several lemmas which we will usein order to prove (in Lemma 6) thatGH is acyclic. Ourproofs below involve the following additional notation.To each operationop is associated a logical (Lamport)time [22] op.linv at which the client invoked it and an-other logical timeop.lres at which it returned its resultat that client. In addition, each update operationu-ophas a(logical) effective timeof u-op.leff, which is theLamport clock value assigned to the event applyingu-opto the object at the proxy hosting the object. For a readoperationr-op, r-op.leff is the logical time at which a re-sponse for this read operation was issued, either by thelast proxy to pauser-op or, if r-op traveled all the wayto the object, by the proxy hosting the object. Note thatop.linv < op.leff < op.lres for all operations.

Lemma 1. The subgraph ofGH consisting of only edges

inc,rf→ is acyclic.

Proof. Sinceop1c→ op2 impliesop1.lres < op2.linv, we

see thatop1c→ op2 impliesop1.leff < op2.leff. Similarly,

it must be thatop1rf→ op2 implies op1.leff < op2.leff,

since an update must have been written before it can

be read from. Therefore, each edge inc,rf→ represents

an increase inop.leff, meaningop1c,rf→+ op2 implies

op1.leff < op2.leff.Assume for a contradiction, then, that there is a cy-

cle consisting only of edges inc,rf→. That means that

opc,rf→+ op. Therefore, we haveop.leff < op.leff, a con-

tradiction.

Lemma 2. If there is a cycle inGH , then there is a cy-cle in GH in which every

v→ edge appears in an edge

14

sequence of the form u-op1c,rf→+ r-op2

v→ u-op3.

Proof. We prove the result by first showing that for anycycle in GH , any

v→ edge not already in an edge se-

quence of the formop1c,rf→+ r-op2

v→ u-op3 can be re-

placed by edges not inv→ to produce a new cycle in

GH . Sincev→ edges must point to an update, we must

considerv→ edges of only the formsr-op

v→ u-op and

u-op′v→ u-op. In the first case, sinceop

v→ r-op is

impossible (again,v→ edges point to updates), an edge

r-opv→ u-op already occurs within an edge sequence of

the form op1c,rf→+ r-op2

v→ u-op3 on the cycle. In the

second case, because updates on each object are appliedsequentially,u-op′ is applied beforeu-op, and so there is

a chain of updates to the object such thatu-op′rf→+ u-op.

Replacing the edgeu-op′v→ u-op with this chain pro-

duces a cycle not containingu-op′v→ u-op.

To complete the proof, we now must argue that for

any edge sequence of the formop1c,rf→+ r-op2

v→ u-op3

on the cycle, there is a corresponding edge sequence

u-op1c,rf→+ r-op2

v→ u-op3 on the cycle. Ifop1 is an up-

date, then settingu-op1 = op1 completes the argument.Otherwise, consider walking the cycle backward alongrf→ and

c→ edges fromop1, terminating at a

v→ edge.

Since thisv→ edge must point to an update, this update

suffices foru-op1.

If there is a cycle inGH , then Lemma 2 guarantees theexistence of a cycle in which all

v→ edges occur within

edge sequences of a certain form. Below we refer to sucha cycle asconstrained.

Lemma 3. If there is a cycle inGH , then within aconstrained cycle, there must be at least one edge se-

quence u-op1c,rf→+ r-op2

v→ u-op3 such that u-op3.leff ≤

u-op1.leff.

Proof. Consider an alternative graphG ′H that includes all

of the edges ofGH and additionally the edgeu-op1s→

u-op3 whenever u-op1c,rf→+ r-op2

v→ u-op3. From

any constrained cycle inGH we can construct a cy-

cle opc,rf,s→ + op in G ′

H by replacing edge sequences

u-op1c,rf→+ r-op2

v→ u-op3 on the constrained cycle with

the edgeu-op1s→ u-op3. Recall from the proof of

Lemma 1 thatop′c,rf→+ op implies op′.leff < op.leff.

Moreover, if Lemma 3 were false, thenu-op1.leff <

u-op3.leff for every edgeu-op1s→ u-op3 used in the cy-

cle inG ′H . So, from the cycleop

c,rf,s→ + op we could infer

op.leff < op.leff, a contradiction.

Each read cluster has exactly one read operation thatreads from the authoritative object itself. For a read clus-ter C, we call this the “representative” read operationC.rep.

Lemma 4. If there is an edge sequence u-op′ c,rf→+ r-op

v→ u-op inGH where u-op.leff ≤ u-op′.leff, then r-op′.leff≤ u-op′.leff where C is the read cluster containing r-opand r-op′ =C.rep.

Proof. Assume for a contradiction thatu-op′.leff <r-op′.leff. Then, we haveu-op.leff ≤ u-op′.leff <r-op′.leff. Since r-op′ read from the object itself andu-op.leff < r-op′.leff, r-op′ must have read the valuewritten by u-op (or possibly a later value) and sor-op

must have, as well. That is,u-oprf→+ r-op, contradict-

ing r-opv→ u-op.

Lemmas 1–4 show that forGH to have a cycle, anecessary condition is an edge sequence of the form

u-op′c,rf→+ r-op

v→ u-op where r-op is contained in a

read clusterC whose representativer-op′ = C.rep is toooutdated, i.e.,r-op′.leff ≤ u-op′.leff. It is for this rea-son thatWACCO is designed to prevent this possibility.Specifically, each returning response to a read opera-tion r-op carries with it the effective time of represen-tative r-op′ of the cluster containingr-op and the effec-tive time of the update from whichr-op′ and thusr-op

are reading, calledr-op.lueff. That is, if u-oprf→ r-op,

then r-op.lueff = u-op.leff. Responses to updates canalso carry the effective time back to the requester, so thatu-op.lueff = u-op.leff.

Each clientc tracks the largestop.lueff for all oper-ationsop it has issued, denotedc.after; i.e., c.after =maxop{op.lueff} where the maximum is taken over alloperations issued byc. For each read requestr-op, theoutbound requestr-op carries with it the current valueof c.after, called r-op.after. As r-op reaches proxiesalong its outbound path, the proxies are allowed to pauseit, as usual. When a response arrives at the proxy, theresponse carries with it the effective time of the readoperationr-op′ that reached the authoritative object toelicit this response. The proxy will use this read re-sponse to answer a paused read operationr-op only ifr-op′.leff > r-op.after; in this case,r-op is added to thecluster for whichr-op′ serves as the representative andsor-op.lueff is assigned to ber-op′.lueff, which is avail-able in the incoming read response. Any readsr-op thatwere not answered by this read response (i.e., becauser-op′.leff ≤ r-op.after) must still be addressed, and nowno response is expected inbound. Therefore, the proxychooses any remainingr-op to forward along to elicit an-other response.

15

Lemma 5. There is no edge sequence u-op′ c,rf→+ r-op

v→ u-op inGH such that u-op.leff ≤ u-op′.leff.

Proof. By Lemma 4, the existence of edge sequence

u-op′c,rf→+ r-op

v→ u-op in GH such thatu-op.leff ≤

u-op′.leff implies thatr-op′.leff ≤ u-op′.leff whereC isthe read cluster containingr-op andr-op′ = C.rep. Byconstruction,r-op can be answered by a read responseonly if the effective time of the read operationr-op′ thatreached the authoritative object to elicit this response sat-isfiesr-op′.leff > r-op.after. So, to prove the lemma, itsuffices to show thatu-op′.leff ≤ r-op.after.

Consider the edge sequenceu-op′c,rf→+ r-op

v→ u-op

and letu-op′′ be the update operation on this sequencethat precedes and is closest tor-op; i.e., there is no up-date operation betweenu-op′′ and r-op along this edgesequence. Eitheru-op′′ = u-op′ and sou-op′.leff =

u-op′′.leff, or u-op′c,rf→+ u-op′′ and sou-op′.leff <

u-op′′.leff. It thus suffices to prove thatu-op′′.leff ≤

r-op.after. If the chainu-op′′c,rf→+ r-op includes no

rf→ edges, then the client issuingr-op is the same as theclient issuingu-op′′, and sou-op′′.leff = u-op′′.lueff ≤r-op.after becauser-op.after is constructed as the max-imum op.lueff for all operationsop that this client hasissued so far (includingu-op′′ itself). If the chainu-op′′c,rf→+ r-op includes one

rf→ edge, it must be the first

edge. That is, we haveu-op′′rf→ r-op′′

c→+ r-op

v→ u-op.

Then, the client issuingr-op also issuedr-op′′, and sou-op′′.leff = r-op′′.lueff ≤ r-op.after, again due to theconstruction ofr-op.after.

Lemma 6. GH is acyclic.

Proof. Assume for a contradiction that there is a cyclein GH . Lemma 3 shows that the cycle contains a se-

quenceu-op1c,rf→+ r-op2

v→ u-op3 such thatu-op3.leff ≤

u-op1.leff. But Lemma 5 shows that this cannot happen,giving a contradiction.

Corollary 1. The protocol of Sec. 4.1 is sequentiallyconsistent.

Proof. Consider any topological sort ofGH . Due to thec→ edges, it satisfiesLocal-Order. Moreover, every readand update operation appears in this serialization afterthe update producing the object state to which it is ap-

plied (due torf→ edges) and before any subsequent up-

date (due tov→ edges). Consequently,Legality is satis-

fied.

Lemma 7. The protocol of Sec. 4.1 satisfiesCluster-Order.

Proof. Consider two clustersC1,C2 ⊆ H|obj as definedabove, such thatC1 ≺H C2. Therefore,C1.rep was ap-plied to the authoritative object beforeC2.rep (in realtime), and soC1.ver ≤C2.ver.

16

wacco and loko: strong consistency at global scale

Documents