a scalable distributed information management systemdahlin/projects/sdims/papers/...planning...

A Scalable Distributed Information Management System�

Praveen Yalagandula and Mike DahlinDepartment of Computer SciencesThe University of Texas at Austin

Abstract

We present a Scalable Distributed Information Manage-ment System (SDIMS) that aggregates information aboutlarge-scale networked systems and that can serve as abasic building block for a broad range of large-scaledistributed applications by providing detailed views ofnearby information and summary views of global infor-mation. To serve as a basic building block, a SDIMSshould have four properties: scalability to many nodesand attributes, flexibility to accommodate a broad rangeof applications, administrative isolation for security andavailability, and robustness to node and network failures.We design, implement and evaluate a SDIMS that (1)leverages Distributed Hash Tables (DHT) to create scal-able aggregation trees, (2) provides flexibility througha simple API that lets applications control propagationof reads and writes, (3) provides administrative isolationthrough simple extensions to current DHT algorithms, and(4) achieves robustness to node and network reconfigura-tions through lazy reaggregation, on-demand reaggrega-tion, and tunable spatial replication. Through extensivesimulations and micro-benchmark experiments, we ob-serve that our system is an order of magnitude more scal-able than existing approaches, achieves isolation proper-ties at the cost of modestly increased read latency in com-parison to flat DHTs, and gracefully handles failures.

1 Introduction

The goal of this research is to design and build a ScalableDistributed Information Management System (SDIMS)that aggregates information about large-scale networkedsystems and that can serve as a basic building block for abroad range of large-scale distributed applications. Mon-itoring, querying, and reacting to changes in the stateof a distributed system are core components of applica-tions such as system management [15, 31, 37, 42], serviceplacement [14, 43], data sharing and caching [18, 29, 32,35, 46], sensor monitoring and control [20, 21], multicasttree formation [8, 9, 33, 36, 38], and naming and request

�This is an extended version of SIGCOMM 2004 paper. Please cite

the original SIGCOMM paper.

routing [10, 11]. We therefore speculate that a SDIMS ina networked system would provide a “distributed operat-ing systems backbone” and facilitate the development anddeployment of new distributed services.

For a large scale information system, hierarchical ag-gregation is a fundamental abstraction for scalability.Rather than expose all information to all nodes, hierarchi-cal aggregation allows a node to access detailed views ofnearby information and summary views of global infor-mation. In a SDIMS based on hierarchical aggregation,different nodes can therefore receive different answers tothe query “find a [nearby] node with at least 1 GB of freememory” or “find a [nearby] copy of file foo.” A hierar-chical system that aggregates information through reduc-tion trees [21, 38] allows nodes to access information theycare about while maintaining system scalability.

To be used as a basic building block, a SDIMS shouldhave four properties. First, the system should be scal-able: it should accommodate large numbers of participat-ing nodes, and it should allow applications to install andmonitor large numbers of data attributes. Enterprise andglobal scale systems today might have tens of thousandsto millions of nodes and these numbers will increase overtime. Similarly, we hope to support many applications,and each application may track several attributes (e.g., theload and free memory of a system’s machines) or mil-lions of attributes (e.g., which files are stored on whichmachines).

Second, the system should have flexibility to accom-modate a broad range of applications and attributes. Forexample, read-dominated attributes like numCPUs rarelychange in value, while write-dominated attributes likenumProcesses change quite often. An approach tuned forread-dominated attributes will consume high bandwidthwhen applied to write-dominated attributes. Conversely,an approach tuned for write-dominated attributes will suf-fer from unnecessary query latency or imprecision forread-dominated attributes. Therefore, a SDIMS shouldprovide mechanisms to handle different types of attributesand leave the policy decision of tuning replication to theapplications.

Third, a SDIMS should provide administrative isola-tion. In a large system, it is natural to arrange nodesin an organizational or an administrative hierarchy (e.g.,

1

....

ee math

........

........

univ1 ........

........

cs

ROOT

univ2 univ3 univ4

pc1 pc2 pc3 pc4

........

com

edu

pc5 pc6

Figure 1: Administrative hierarchy

Figure 1). A SDIMS should support administrative iso-lation in which queries about an administrative domain’sinformation can be satisfied within the domain so that thesystem can operate during disconnections from other do-mains, so that an external observer cannot monitor or af-fect intra-domain queries, and to support domain-scopedqueries efficiently.

Fourth, the system must be robust to node failures anddisconnections. A SDIMS should adapt to reconfigura-tions in a timely fashion and should also provide mecha-nisms so that applications can tradeoff the cost of adap-tation with the consistency level in the aggregated resultswhen reconfigurations occur.

We draw inspiration from two previous works: Astro-labe [38] and Distributed Hash Tables (DHTs).

Astrolabe [38] is a robust information managementsystem. Astrolabe provides the abstraction of a singlelogical aggregation tree that mirrors a system’s adminis-trative hierarchy. It provides a general interface for in-stalling new aggregation functions and provides eventualconsistency on its data. Astrolabe is robust due to itsuse of an unstructured gossip protocol for disseminatinginformation and its strategy of replicating all aggregatedattribute values for a subtree to all nodes in the subtree.This combination allows any communication pattern toyield eventual consistency and allows any node to answerany query using local information. This high degree ofreplication, however, may limit the system’s ability to ac-commodate large numbers of attributes. Also, althoughthe approach works well for read-dominated attributes, anupdate at one node can eventually affect the state at allnodes, which may limit the system’s flexibility to supportwrite-dominated attributes.

Recent research in peer-to-peer structured networks re-sulted in Distributed Hash Tables (DHTs) [18, 28, 29, 32,35, 46]—a data structure that scales with the number ofnodes and that distributes the read-write load for differentqueries among the participating nodes. It is interestingto note that although these systems export a global hashtable abstraction, many of them internally make use ofwhat can be viewed as a scalable system of aggregationtrees to, for example, route a request for a given key to the

right DHT node. Indeed, rather than export a general DHTinterface, Plaxton et al.’s [28] original application makesuse of hierarchical aggregation to allow nodes to locatenearby copies of objects. It seems appealing to develop aSDIMS abstraction that exposes this internal functionalityin a general way so that scalable trees for aggregation canbe a basic system building block alongside the DHTs.

At a first glance, it might appear to be obvious that sim-ply fusing DHTs with Astrolabe’s aggregation abstractionwill result in a SDIMS. However, meeting the SDIMS re-quirements forces a design to address four questions: (1)How to scalably map different attributes to different ag-gregation trees in a DHT mesh? (2) How to provide flex-ibility in the aggregation to accommodate different appli-cation requirements? (3) How to adapt a global, flat DHTmesh to attain administrative isolation property? and (4)How to provide robustness without unstructured gossipand total replication?

The key contributions of this paper that form the foun-dation of our SDIMS design are as follows.

1. We define a new aggregation abstraction that spec-ifies both attribute type and attribute name and thatassociates an aggregation function with a particularattribute type. This abstraction paves the way forutilizing the DHT system’s internal trees for aggre-gation and for achieving scalability with both nodesand attributes.

2. We provide a flexible API that lets applications con-trol the propagation of reads and writes and thustrade off update cost, read latency, replication, andstaleness.

3. We augment an existing DHT algorithm to ensurepath convergence and path locality properties in or-der to achieve administrative isolation.

4. We provide robustness to node and network re-configurations by (a) providing temporal replicationthrough lazy reaggregation that guarantees eventualconsistency and (b) ensuring that our flexible API al-lows demanding applications gain additional robust-ness by using tunable spatial replication of data ag-gregates or by performing fast on-demand reaggre-gation to augment the underlying lazy reaggregationor by doing both.

We have built a prototype of SDIMS. Through simula-tions and micro-benchmark experiments on a number ofdepartment machines and PlanetLab [27] nodes, we ob-serve that the prototype achieves scalability with respectto both nodes and attributes through use of its flexibleAPI, inflicts an order of magnitude lower maximum nodestress than unstructured gossiping schemes, achieves iso-lation properties at a cost of modestly increased read la-

2

tency compared to flat DHTs, and gracefully handles nodefailures.

This initial study discusses key aspects of an ongo-ing system building effort, but it does not address all is-sues in building a SDIMS. For example, we believe thatour strategies for providing robustness will mesh wellwith techniques such as supernodes [22] and other ongo-ing efforts to improve DHTs [30] for further improvingrobustness. Also, although splitting aggregation amongmany trees improves scalability for simple queries, thisapproach may make complex and multi-attribute queriesmore expensive compared to a single tree. Additionalwork is needed to understand the significance of this limi-tation for real workloads and, if necessary, to adapt queryplanning techniques from DHT abstractions [16, 19] toscalable aggregation tree abstractions.

In Section 2, we explain the hierarchical aggregationabstraction that SDIMS provides to applications. In Sec-tions 3 and 4, we describe the design of our system forachieving the flexibility, scalability, and administrativeisolation requirements of a SDIMS. In Section 5, we de-tail the implementation of our prototype system. Section 6addresses the issue of adaptation to the topological recon-figurations. In Section 7, we present the evaluation of oursystem through large-scale simulations and microbench-marks on real networks. Section 8 details the related work,and Section 9 summarizes our contribution.

2 Aggregation Abstraction

Aggregation is a natural abstraction for a large-scale dis-tributed information system because aggregation providesscalability by allowing a node to view detailed informa-tion about the state near it and progressively coarser-grained summaries about progressively larger subsets ofa system’s data [38].

Our aggregation abstraction is defined across a treespanning all nodes in the system. Each physical node inthe system is a leaf and each subtree represents a logicalgroup of nodes. Note that logical groups can correspondto administrative domains (e.g., department or university)or groups of nodes within a domain (e.g., 10 workstationson a LAN in CS department). An internal non-leaf node,which we call virtual node, is simulated by one or morephysical nodes at the leaves of the subtree for which thevirtual node is the root. We describe how to form suchtrees in a later section.

Each physical node has local data stored as a setof

�attributeType � attributeName � value � tuples such as

(configuration, numCPUs, 16), (mcast membership, ses-sion foo, yes), or (file stored, foo, myIPaddress). Thesystem associates an aggregation function ftype with eachattribute type, and for each level-i subtree Ti in the sys-tem, the system defines an aggregate value Vi � type � name for

each (attributeType, attributeName) pair as follows. For a(physical) leaf node T0 at level 0, V0 � type � name is the locallystored value for the attribute type and name or NULL ifno matching tuple exists. Then the aggregate value for alevel-i subtree Ti is the aggregation function for the type,ftype computed across the aggregate values of each of Ti’sk children:Vi � type � name � ftype � V 0i � 1 � type � name � V 1i � 1 � type � name �� V k � 1i � 1 � type � name � .

Although SDIMS allows arbitrary aggregationfunctions, it is often desirable that these functionssatisfy the hierarchical computation property [21]:f�v1 �� vn � = f � f � v1 �� vs1 �� f

�vs1 � 1 �� vs2 ��

f�vsk � 1 �� vn �� , where vi is the value of an attribute

at node i. For example, the average operation, definedas avg

�v1 �� vn � � 1 � n � ∑ni � 0 vi, does not satisfy the

property. Instead, if an attribute stores values as tuples�sum � count � , the attribute satisfies the hierarchical com-

putation property while still allowing the applications tocompute the average from the aggregate sum and countvalues.

Finally, note that for a large-scale system, it is diffi-cult or impossible to insist that the aggregation value re-turned by a probe corresponds to the function computedover the current values at the leaves at the instant of theprobe. Therefore our system provides only weak consis-tency guarantees – specifically eventual consistency as de-fined in [38].

3 Flexibility

A major innovation of our work is enabling flexible ag-gregate computation and propagation. The definition ofthe aggregation abstraction allows considerable flexibilityin how, when, and where aggregate values are computedand propagated. While previous systems [15, 29, 38, 32,35, 46] implement a single static strategy, we argue thata SDIMS should provide flexible computation and prop-agation to efficiently support wide variety of applicationswith diverse requirements. In order to provide this flexi-bility, we develop a simple interface that decomposes theaggregation abstraction into three pieces of functionality:install, update, and probe.

This definition of the aggregation abstraction allowsour system to provide a continuous spectrum of strategiesranging from lazy aggregate computation and propagationon reads to aggressive immediate computation and prop-agation on writes. In Figure 2, we illustrate both extremestrategies and an intermediate strategy. Under the lazyUpdate-Local computation and propagation strategy, anupdate (or write) only affects local state. Then, a probe(or read) that reads a level-i aggregate value is sent up thetree to the issuing node’s level-i ancestor and then downthe tree to the leaves. The system then computes the de-sired aggregate value at each layer up the tree until the

3

Update Strategy On Update On Probe for Global Aggregate Value On Probe for Level-1 Aggregate Value

Update-Local

Update-Up

Update-All

Figure 2: Flexible APIlevel-i ancestor that holds the desired value. Finally, thelevel-i ancestor sends the result down the tree to the is-suing node. In the other extreme case of the aggressiveUpdate-All immediate computation and propagation onwrites [38], when an update occurs, changes are aggre-gated up the tree, and each new aggregate value is floodedto all of a node’s descendants. In this case, each level-inode not only maintains the aggregate values for the level-i subtree but also receives and locally stores copies of allof its ancestors’ level- j ( j � i) aggregation values. Also,a leaf satisfies a probe for a level-i aggregate using purelylocal data. In an intermediate Update-Up strategy, the rootof each subtree maintains the subtree’s current aggregatevalue, and when an update occurs, the leaf node updatesits local state and passes the update to its parent, and theneach successive enclosing subtree updates its aggregatevalue and passes the new value to its parent. This strat-egy satisfies a leaf’s probe for a level-i aggregate valueby sending the probe up to the level-i ancestor of the leafand then sending the aggregate value down to the leaf. Fi-nally, notice that other strategies exist. For example, anUpdate-UpRoot-Down1 strategy (not shown) would ag-gregate updates up to the root of a subtree and send a sub-tree’s aggregate values to only the children of the root ofthe subtree. In general, an Update-Upk-Downj strategyaggregates up to the kth level and propagates the aggre-gate values of a node at level l (s.t. l �� ) downward for jlevels.

A SDIMS must provide a wide range of flexible com-putation and propagation strategies to applications forit to be a general abstraction. An application shouldbe able to choose a particular mechanism based on itsread-to-write ratio that reduces the bandwidth consump-tion while attaining the required responsiveness and pre-cision. Note that the read-to-write ratio of the attributesthat applications install vary extensively. For example,a read-dominated attribute like numCPUs rarely changesin value, while a write-dominated attribute like numPro-cesses changes quite often. An aggregation strategy likeUpdate-All works well for read-dominated attributes butsuffers high bandwidth consumption when applied forwrite-dominated attributes. Conversely, an approach likeUpdate-Local works well for write-dominated attributesbut suffers from unnecessary query latency or imprecisionfor read-dominated attributes.

parameter description optional

attrType Attribute Typeaggrfunc Aggregation Functionup How far upward each update is sent

(default: all)X

down How far downward each aggregate issent (default: none)

X

domain Domain restriction (default: none) XexpTime Expiry Time

Table 1: Arguments for the install operation

SDIMS also allows non-uniform computation andpropagation across the aggregation tree with different upand down parameters in different subtrees so that appli-cations can adapt with the spatial and temporal hetero-geneity of read and write operations. With respect to spa-tial heterogeneity, access patterns may differ for differentparts of the tree, requiring different propagation strategiesfor different parts of the tree. Similarly with respect totemporal heterogeneity, access patterns may change overtime requiring different strategies over time.

3.1 Aggregation API

We provide the flexibility described above by splitting theaggregation API into three functions: Install() installs anaggregation function that defines an operation on an at-tribute type and specifies the update strategy that the func-tion will use, Update() inserts or modifies a node’s lo-cal value for an attribute, and Probe() obtains an aggre-gate value for a specified subtree. The install interface al-lows applications to specify the k and j parameters of theUpdate-Upk-Downj strategy along with the aggregationfunction. The update interface invokes the aggregation ofan attribute on the tree according to corresponding aggre-gation function’s aggregation strategy. The probe inter-face not only allows applications to obtain the aggregatedvalue for a specified tree but also allows a probing nodeto continuously fetch the values for a specified time, thusenabling an application to adapt to spatial and temporalheterogeneity. The rest of the section describes these threeinterfaces in detail.

3.1.1 Install

The Install operation installs an aggregation function inthe system. The arguments for this operation are listed

4


attrType Attribute TypeattrName Attribute Nameval Value

Table 2: Arguments for the update operation


attrType Attribute TypeattrName Attribute Namemode Continuous or One-shot (default: one-

shot)X

level Level at which aggregate is sought (de-fault: at all levels)

X

up How far up to go and re-fetch the value(default: none)

X

down How far down to go and re-aggregate(default: none)

X

expTime Expiry Time

Table 3: Arguments for the probe operation

in Table 1. The attrType argument denotes the type ofattributes on which this aggregation function is invoked.Installed functions are soft state that must be periodicallyrenewed or they will be garbage collected at expTime.

The arguments up and down specify the aggre-gate computation and propagation strategy Update-Upk-Downj. The domain argument, if present, indicates thatthe aggregation function should be installed on all nodesin the specified domain; otherwise the function is installedon all nodes in the system.

3.1.2 Update

The Update operation takes three arguments attrType, at-trName, and value and creates a new (attrType, attrName,value) tuple or updates the value of an old tuple withmatching attrType and attrName at a leaf node.

The update interface meshes with installed aggregatecomputation and propagation strategy to provide flexibil-ity. In particular, as outlined above and described in detailin Section 5, after a leaf applies an update locally, the up-date may trigger re-computation of aggregate values upthe tree and may also trigger propagation of changed ag-gregate values down the tree. Notice that our abstractionassociates an aggregation function with only an attrTypebut lets updates specify an attrName along with the at-trType. This technique helps achieve scalability with re-spect to nodes and attributes as described in Section 4.

3.1.3 Probe

The Probe operation returns the value of an attribute toan application. The complete argument set for the probeoperation is shown in Table 3. Along with the attrNameand the attrType arguments, a level argument specifies thelevel at which the answers are required for an attribute.In our implementation we choose to return results at all

levels k � l for a level-l probe because (i) it is inexpensiveas the nodes traversed for level-l probe also contain levelk aggregates for k � l and as we expect the network costof transmitting the additional information to be small forthe small aggregates which we focus and (ii) it is useful asapplications can efficiently get several aggregates with asingle probe (e.g., for domain-scoped queries as explainedin Section 4.2).

Probes with mode set to continuous and with finite ex-pTime enable applications to handle spatial and temporalheterogeneity. When node A issues a continuous probeat level l for an attribute, then regardless of the up anddown parameters, updates for the attribute at any node inA’s level-l ancestor’s subtree are aggregated up to levell and the aggregated value is propagated down along thepath from the ancestor to A. Note that continuous modeenables SDIMS to support a distributed sensor-actuatormechanism where a sensor monitors a level-i aggregatewith a continuous mode probe and triggers an actuatorupon receiving new values for the probe.

The up and down arguments enable applications to per-form on-demand fast re-aggregation during reconfigura-tions, where a forced re-aggregation is done for the corre-sponding levels even if the aggregated value is available,as we discuss in Section 6. When present, the up and downarguments are interpreted as described in the install oper-ation.

3.1.4 Dynamic Adaptation

At the API level, the up and down arguments in installAPI can be regarded as hints, since they suggest a compu-tation strategy but do not affect the semantics of an aggre-gation function. A SDIMS implementation can dynami-cally adjust its up/down strategies for an attribute basedon its measured read/write frequency. But a virtual in-termediate node needs to know the current up and downpropagation values to decide if the local aggregate is freshin order to answer a probe. This is the key reason whyup and down need to be statically defined at the installtime and can not be specified in the update operation. Indynamic adaptation, we implement a lease-based mech-anism where a node issues a lease to a parent or a childdenoting that it will keep propagating the updates to thatparent or child. We are currently evaluating different poli-cies to decide when to issue a lease and when to revoke alease.

4 Scalability

Our design achieves scalability with respect to both nodesand attributes through two key ideas. First, it carefully de-fines the aggregation abstraction to mesh well with its un-derlying scalable DHT system. Second, it refines the basic

5

DHT abstraction to form an Autonomous DHT (ADHT)to achieve the administrative isolation properties that arecrucial to scaling for large real-world systems. In this sec-tion, we describe these two ideas in detail.

4.1 Leveraging DHTs

In contrast to previous systems [4, 15, 38, 39, 45],SDIMS’s aggregation abstraction specifies both an at-tribute type and attribute name and associates an aggrega-tion function with a type rather than just specifying and as-sociating a function with a name. Installing a single func-tion that can operate on many different named attributesmatching a type improves scalability for “sparse attributetypes” with large, sparsely-filled name spaces. For ex-ample, to construct a file location service, our interfaceallows us to install a single function that computes an ag-gregate value for any named file. A subtree’s aggregatevalue for (FILELOC, name) would be the ID of a nodein the subtree that stores the named file. Conversely, As-trolabe copes with sparse attributes by having aggregationfunctions compute sets or lists and suggests that scalabil-ity can be improved by representing such sets with Bloomfilters [6]. Supporting sparse names within a type providesat least two advantages. First, when the value associatedwith a name is updated, only the state associated withthat name needs to be updated and propagated to othernodes. Second, splitting values associated with differentnames into different aggregation values allows our sys-tem to leverage Distributed Hash Tables (DHTs) to mapdifferent names to different trees and thereby spread thefunction’s logical root node’s load and state across multi-ple physical nodes.

Given this abstraction, scalably mapping attributes toDHTs is straightforward. DHT systems assign a long, ran-dom ID to each node and define an algorithm to route arequest for key k to a node rootk such that the union ofpaths from all nodes forms a tree DHTtreek rooted at thenode rootk. Now, as illustrated in Figure 3, by aggregatingan attribute along the aggregation tree corresponding toDHTtreek for k � hash(attribute type, attribute name), dif-ferent attributes will be aggregated along different trees.

In comparison to a scheme where all attributes are ag-gregated along a single tree, aggregating along multipletrees incurs lower maximum node stress: whereas in a sin-gle aggregation tree approach, the root and the intermedi-ate nodes pass around more messages than leaf nodes, in aDHT-based multi-tree, each node acts as an intermediateaggregation point for some attributes and as a leaf nodefor other attributes. Hence, this approach distributes theonus of aggregation across all nodes.

001 010100

000

011 101

111

110

011 111 001 101 000 100 110010L0

L1

L2

L3

Figure 3: The DHT tree corresponding to key 111(DHTtree111) and the corresponding aggregation tree.

4.2 Administrative Isolation

Aggregation trees should provide administrative isolationby ensuring that for each domain, the virtual node at theroot of the smallest aggregation subtree containing allnodes of that domain is hosted by a node in that domain.Administrative isolation is important for three reasons: (i)for security – so that updates and probes flowing in a do-main are not accessible outside the domain, (ii) for avail-ability – so that queries for values in a domain are not af-fected by failures of nodes in other domains, and (iii) forefficiency – so that domain-scoped queries can be simpleand efficient.

To provide administrative isolation to aggregationtrees, a DHT should satisfy two properties:

1. Path Locality: Search paths should always be con-tained in the smallest possible domain.

2. Path Convergence: Search paths for a key from dif-ferent nodes in a domain should converge at a nodein that domain.

Existing DHTs support path locality [18] or can eas-ily support it by using the domain nearness as the distancemetric [7, 17], but they do not guarantee path convergenceas those systems try to optimize the search path to the rootto reduce response latency. For example, Pastry [32] usesprefix routing in which each node’s routing table containsone row per hexadecimal digit in the nodeId space wherethe ith row contains a list of nodes whose nodeIds dif-fer from the current node’s nodeId in the ith digit withone entry for each possible digit value. Notice that for agiven row and entry (viz. digit and value) a node n canchoose the entry from many different alternative destina-tion nodes, especially for small i where a destination nodeneeds to match n’s ID in only a few digits to be a candidatefor inclusion in n’s routing table. A common policy is tochoose a nearby node according to a proximity metric [28]to minimize the network distance for routing a key. Un-der this policy, the nodes in a routing table sharing a shortprefix will tend to be nearby since there are many suchnodes spread roughly evenly throughout the system dueto random nodeId assignment. Pastry is self-organizing—nodes come and go at will. To maintain Pastry’s localityproperties, a new node must join with one that is nearbyaccording to the proximity metric. Pastry provides a seed

6

discovery protocol that finds such a node given an arbi-trary starting point.

Given a routing topology, to route a packet to an arbi-trary destination key, a node in Pastry forwards a packet tothe node with a nodeId prefix matching the key in at leastone more digit than the current node. If such a node isnot known, the current node uses an additional data struc-ture, the leaf set containing L immediate higher and lowerneighbors in the nodeId space, and forwards the packetto a node with an identical prefix but that is numericallycloser to the destination key in the nodeId space. Thisprocess continues until the destination node appears in theleaf set, after which the message is routed directly. Pas-try’s expected number of routing steps is logn, where n isthe number of nodes, but as Figure 4 illustrates, this algo-rithm does not guarantee path convergence: if two nodesin a domain have nodeIds that match a key in the samenumber of bits, both of them can route to a third node out-side the domain when routing for that key.

Simple modifications to Pastry’s route table construc-tion and key-routing protocols yield an Autonomous DHT(ADHT) that satisfies the path locality and path conver-gence properties. As Figure 5 illustrates, whenever twonodes in a domain share the same prefix with respect to akey and no other node in the domain has a longer prefix,our algorithm introduces a virtual node at the boundary ofthe domain corresponding to that prefix plus the next digitof the key; such a virtual node is simulated by the existingnode whose id is numerically closest to the virtual node’sid. Our ADHT’s routing table differs from Pastry’s in twoways. First, each node maintains a separate leaf set foreach domain of which it is a part. Second, nodes use twoproximity metrics when populating the routing tables –hierarchical domain proximity is the primary metric andnetwork distance is secondary. Then, to route a packet toa global root for a key, ADHT routing algorithm uses therouting table and the leaf set entries to route to each suc-cessive enclosing domain’s root (the virtual or real nodein the domain matching the key in the maximum numberof digits).

The routing algorithm we use in routing for a key atnode with nodeId is shown in the Algorithm 4.2. By rout-ing at the lowest possible domain till the root of that do-main is reached, we ensure that the routing paths con-form to the Path Convergence property. The routing al-gorithm guarantees that as long as the leafset membershipis correct, Path Convergence property is satisfied. We usethe Pastry leaf set maintenance algorithm for maintainingleafsets at all levels; after reconfigurations, once the re-pairs are done on a particular domain level’s leafset, theautonomy properties are met at that domain level. Notethat the modifications proposed for the Pastry still pre-serves the fault-tolerance properties of the original algo-rithm; rather, they enhance the fault-tolerance of the algo-

110XX

010XX011XX

100XX

101XX

univ

dep1 dep2

key = 111XX

011XX 100XX 101XX 110XX 010XX

L1

L0

L2

Figure 4: Example shows how isolation property is vio-lated with original Pastry. We also show the correspond-ing aggregation tree.

110XX

010XX011XX

100XX

101XX

univ

dep1 dep2

key = 111XX

X

011XX 100XX 101XX 110XX 010XX

L0

L1

L2

Figure 5: Autonomous DHT satisfying the isolation prop-erty. Also the corresponding aggregation tree is shown.

rithm but at the cost of extra maintenance overhead.

Algorithm 1 ADHTroute(key)1: flipNeigh � checkRoutingTable(key) ;2: l � numDomainLevels - 1 ;3: while (l � � 0) do4: if (commLevels(flipNeigh, nodeId) �� l) then5: send the key to flipNeigh ; return ;6: else7: leafNeigh � an entry in leafset[l] closer to key

than nodeId ;8: if (leafNeigh ! � null) then9: send the key to leafNeigh ; return ;

10: end if11: end if12: l � l � 1;13: end while14: this node is the root for this key

Properties. Maintaining a different leaf set for eachadministrative hierarchy level increases the number ofneighbors that each node tracks to

�2b �� lgb n � c � l from�

2b �� lgb n � c in unmodified Pastry, where b is the num-ber of bits in a digit, n is the number of nodes, c is the leafset size, and l is the number of domain levels. Routingrequires O(lgbn � l) steps compared to O(lgbn) steps inPastry; also, each routing hop may be longer than in Pas-try because the modified algorithm’s routing table preferssame-domain nodes over nearby nodes. We experimen-tally quantify the additional routing costs in Section 7.

In a large system, the ADHT topology allows domainsto improve security for sensitive attribute types by in-stalling them only within a specified domain. Then, ag-

7

A1 A2 B1((B1.B.,1),(B.,1),(.,1))

((B1.B.,1),(B.,1),(.,1))

L2

L1

L0

((B1.B.,1),(B.,1),(.,3))((A1.A.,1),

(A.,2),(.,2))

((A1.A.,1),(A.,1),(.,1))

((A2.A.,1),(A.,1),(.,1))

Figure 6: Example for domain-scoped queries

gregation occurs entirely within the domain and a node ex-ternal to the domain can neither observe nor affect the up-dates and aggregation computations of the attribute type.Furthermore, though we have not implemented this fea-ture in the prototype, the ADHT topology would also sup-port domain-restricted probes that could ensure that noone outside of a domain can observe a probe for datastored within the domain.

The ADHT topology also enhances availability by al-lowing the common case of probes for data within a do-main to depend only on a domain’s nodes. This, for ex-ample, allows a domain that becomes disconnected fromthe rest of the Internet to continue to answer queries forlocal data.

Aggregation trees that provide administrative isola-tion also enable the definition of simple and efficientdomain-scoped aggregation functions to support querieslike “what is the average load on machines in domain X?”For example, consider an aggregation function to countthe number of machines in an example system with threemachines illustrated in Figure 6. Each leaf node l up-dates attribute NumMachines with a value vl containing aset of tuples of form (Domain, Count) for each domainof which the node is a part. In the example, the nodeA1 with name A1.A. performs an update with the value((A1.A.,1),(A.,1),(.,1)). An aggregation function at an in-ternal virtual node hosted on node N with child set C com-putes the aggregate as a set of tuples: for each domain Dthat N is part of, form a tuple

�D � ∑c � C

�count � � D � count ��

vc �� . This computation is illustrated in the Figure 6.Now a query for NumMachines with level set to MAXwill return the aggregate values at each intermediate vir-tual node on the path to the root as a set of tuples (treelevel, aggregated value) from which it is easy to extractthe count of machines at each enclosing domain. Forexample, A1 would receive ((2, ((B1.B.,1),(B.,1),(.,3))),(1, ((A1.A.,1),(A.,2),(.,2))), (0, ((A1.A.,1),(A.,1),(.,1)))).Note that supporting domain-scoped queries would be lessconvenient and less efficient if aggregation trees did notconform to the system’s administrative structure. It wouldbe less efficient because each intermediate virtual nodewill have to maintain a list of all values at the leaves in itssubtree along with their names and it would be less conve-nient as applications that need an aggregate for a domainwill have to pick values of nodes in that domain from the

local MIB

MIBsancestor

reduction MIB(level 1)MIBs

ancestor

MIB fromchild 0X...

MIB fromchild 0X...

Level 2

Level 1

Level 3

Level 0

1XXX...

10XX...

100X...

From parents0X..

To parent 0X...

−− aggregation functions

From parents

To parent 10XX...

1X..1X..

1X..

To parent 11XX...

Node Id: (1001XXX)

1001X..

100X..

10X..

1X..

Virtual Node

Figure 7: Example illustrating the data structures and theorganization of them at a node.

list returned by a probe and perform computation.

5 Prototype Implementation

The internal design of our SDIMS prototype comprisesof two layers: the Autonomous DHT (ADHT) layer man-ages the overlay topology of the system and the Aggrega-tion Management Layer (AML) maintains attribute tuples,performs aggregations, stores and propagates aggregatevalues. Given the ADHT construction described in Sec-tion 4.2, each node implements an Aggregation Manage-ment Layer (AML) to support the flexible API describedin Section 3. In this section, we describe the internal stateand operation of the AML layer of a node in the system.

We refer to a store of (attribute type, attribute name,value) tuples as a Management Information Base orMIB, following the terminology from Astrolabe [38] andSNMP [34]. We refer an (attribute type, attribute name)tuple as an attribute key.

As Figure 7 illustrates, each physical node in the sys-tem acts as several virtual nodes in the AML: a node actsas leaf for all attribute keys, as a level-1 subtree root forkeys whose hash matches the node’s ID in b prefix bits(where b is the number of bits corrected in each step ofthe ADHT’s routing scheme), as a level-i subtree root forattribute keys whose hash matches the node’s ID in the ini-tial i � b bits, and as the system’s global root for attributekeys whose hash matches the node’s ID in more prefix bitsthan any other node (in case of a tie, the first non-matchingbit is ignored and the comparison is continued [46]).

To support hierarchical aggregation, each virtual nodeat the root of a level-i subtree maintains several MIBs thatstore (1) child MIBs containing raw aggregate values gath-ered from children, (2) a reduction MIB containing locallyaggregated values across this raw information, and (3) anancestor MIB containing aggregate values scattered downfrom ancestors. This basic strategy of maintaining child,

8

reduction, and ancestor MIBs is based on Astrolabe [38],but our structured propagation strategy channels informa-tion that flows up according to its attribute key and ourflexible propagation strategy only sends child updates upand ancestor aggregate results down as far as specifiedby the attribute key’s aggregation function. Note that inthe discussion below, for ease of explanation, we assumethat the routing protocol is correcting single bit at a time(b � 1). Our system, built upon Pastry, handles multi-bitcorrection (b � 4) and is a simple extension to the schemedescribed here.

For a given virtual node ni at level i, each child MIBcontains the subset of a child’s reduction MIB that con-tains tuples that match ni’s node ID in i bits and whoseup aggregation function attribute is at least i. These lo-cal copies make it easy for a node to recompute a level-iaggregate value when one child’s input changes. Nodesmaintain their child MIBs in stable storage and use a sim-plified version of the Bayou log exchange protocol (sansconflict detection and resolution) for synchronization afterdisconnections [26].

Virtual node ni at level i maintains a reduction MIB oftuples with a tuple for each key present in any child MIBcontaining the attribute type, attribute name, and outputof the attribute type’s aggregate functions applied to thechildren’s tuples.

A virtual node ni at level i also maintains an ancestorMIB to store the tuples containing attribute key and a listof aggregate values at different levels scattered down fromancestors. Note that the list for a key might contain mul-tiple aggregate values for a same level but aggregated atdifferent nodes (see Figure 5). So, the aggregate valuesare tagged not only with level information, but are alsotagged with ID of the node that performed the aggrega-tion.

Level-0 differs slightly from other levels. Each level-0leaf node maintains a local MIB rather than maintainingchild MIBs and a reduction MIB. This local MIB storesinformation about the local node’s state inserted by lo-cal applications via update() calls. We envision various“sensor” programs and applications insert data into localMIB. For example, one program might monitor local con-figuration and perform updates with information such astotal memory, free memory, etc., A distributed file sys-tem might perform update for each file stored on the localnode.

Along with these MIBs, a virtual node maintains twoother tables: an aggregation function table and an out-standing probes table. An aggregation function table con-tains the aggregation function and installation arguments(see Table 1) associated with an attribute type or an at-tribute type and name. Each aggregate function is in-stalled on all nodes in a domain’s subtree, so the aggregatefunction table can be thought of as a special case of the an-

cestor MIB with domain functions always installed up toa root within a specified domain and down to all nodeswithin the domain. The outstanding probes table main-tains temporary information regarding in-progress probes.

Given these data structures, it is simple to support thethree API functions described in Section 3.1.

Install The Install operation (see Table 1) installs ona domain an aggregation function that acts on a specifiedattribute type. Execution of an install operation for func-tion aggrFunc on attribute type attrType proceeds in twophases: first the install request is passed up the ADHTtree with the attribute key (attrType, null) until it reachesthe root for that key within the specified domain. Then,the request is flooded down the tree and installed on allintermediate and leaf nodes.

Update When a level i virtual node receives an updatefor an attribute from a child below: it first recomputes thelevel-i aggregate value for the specified key, stores thatvalue in its reduction MIB and then, subject to the func-tion’s up and domain parameters, passes the updated valueto the appropriate parent based on the attribute key. Also,the level-i (i ! 1) virtual node sends the updated level-i ag-gregate to all its children if the function’s down parameterexceeds zero. Upon receipt of a level-i aggregate from aparent, a level k virtual node stores the value in its ances-tor MIB and, if k ! i � down, forwards this aggregate toits children.

Probe A Probe collects and returns the aggregate valuefor a specified attribute key for a specified level of thetree. As Figure 2 illustrates, the system satisfies a probefor a level-i aggregate value using a four-phase protocolthat may be short-circuited when updates have previouslypropagated either results or partial results up or down thetree. In phase 1, the route probe phase, the system routesthe probe up the attribute key’s tree to either the root ofthe level-i subtree or to a node that stores the requestedvalue in its ancestor MIB. In the former case, the systemproceeds to phase 2 and in the latter it skips to phase 4. Inphase 2, the probe scatter phase, each node that receivesa probe request sends it to all of its children unless thenode’s reduction MIB already has a value that matchesthe probe’s attribute key, in which case the node initiatesphase 3 on behalf of its subtree. In phase 3, the probeaggregation phase, when a node receives values for thespecified key from each of its children, it executes the ag-gregate function on these values and either (a) forwardsthe result to its parent (if its level is less than i) or (b)initiates phase 4 (if it is at level i). Finally, in phase 4,the aggregate routing phase the aggregate value is routeddown to the node that requested it. Note that in the ex-treme case of a function installed with up � down � 0, alevel-i probe can touch all nodes in a level-i subtree whilein the opposite extreme case of a function installed withup � down � ALL, probe is a completely local operation

9

at a leaf.For probes that include phases 2 (probe scatter) and

3 (probe aggregation), an issue is how to decide when anode should stop waiting for its children to respond andsend up its current aggregate value. A node stops waitingfor its children when one of three conditions occurs: (1)all children have responded, (2) the ADHT layer signalsone or more reconfiguration events that mark all childrenthat have not yet responded as unreachable, or (3) a watch-dog timer for the request fires. The last case accounts fornodes that participate in the ADHT protocol but that failat the AML level.

At a virtual node, continuous probes are handled simi-larly as one-shot probes except that such probes are storedin the outstanding probe table for a time period of exp-Time specified in the probe. Thus each update for an at-tribute triggers re-evaluation of continuous probes for thatattribute.

We implement a lease-based mechanism for dynamicadaptation. A level-l virtual node for an attribute can issuethe lease for level-l aggregate to a parent or a child onlyif up is greater than l or it has leases from all its children.A virtual node at level l can issue the lease for level-kaggregate for k � l to a child only if down ! k � l or if it hasthe lease for that aggregate from its parent. Now a probefor level-k aggregate can be answered by level-l virtualnode if it has a valid lease, irrespective of the up and downvalues. We are currently designing different policies todecide when to issue a lease and when to revoke a leaseand are also evaluating them with the above mechanism.

Our current prototype does not implement access con-trol on install, update, and probe operations but we plan toimplement Astrolabe’s [38] certificate-based restrictions.Also our current prototype does not restrict the resourceconsumption in executing the aggregation functions; but,‘techniques from research on resource management inserver systems and operating systems [2, 3] can be appliedhere.

6 Robustness

In large scale systems, reconfigurations are common. Ourtwo main principles for robustness are to guarantee (i)read availability – probes complete in finite time, and (ii)eventual consistency – updates by a live node will be visi-ble to probes by connected nodes in finite time. During re-configurations, a probe might return a stale value for tworeasons. First, reconfigurations lead to incorrectness inthe previous aggregate values. Second, the nodes neededfor aggregation to answer the probe become unreachable.Our system also provides two hooks that applications canuse for improved end-to-end robustness in the presence ofreconfigurations: (1) On-demand re-aggregation and (2)application controlled replication.

Our system handles reconfigurations at two levels –adaptation at the ADHT layer to ensure connectivity andadaptation at the AML layer to ensure access to the datain SDIMS.

6.1 ADHT Adaptation

Our ADHT layer adaptation algorithm is same as Pastry’sadaptation algorithm [32] — the leaf sets are repaired assoon as a reconfiguration is detected and the routing ta-ble is repaired lazily. Note that maintaining extra leaf setsdoes not degrade the fault-tolerance property of the orig-inal Pastry; indeed, it enhances the resilience of ADHTsto failures by providing additional routing links. Due toredundancy in the leaf sets and the routing table, updatescan be routed towards their root nodes successfully evenduring failures. Also note that the administrative isola-tion property satisfied by our ADHT algorithm ensuresthat the reconfigurations in a level i domain do not affectthe probes for level i in a sibling domain.

6.2 AML Adaptation

Broadly, we use two types of strategies for AML adap-tation in the face of reconfigurations: (1) Replication intime as a fundamental baseline strategy, and (2) Replica-tion in space as an additional performance optimizationthat falls back on replication in time when the system runsout of replicas. We provide two mechanisms for repli-cation in time. First, lazy re-aggregation propagates al-ready received updates to new children or new parents ina lazy fashion over time. Second, applications can reducethe probability of probe response staleness during such re-pairs through our flexible API with appropriate setting ofthe down parameter.

Lazy Re-aggregation: The DHT layer informs theAML layer about reconfigurations in the network usingthe following three function calls – newParent, failed-Child, and newChild. On newParent(parent, prefix), allprobes in the outstanding-probes table corresponding toprefix are re-evaluated. If parent is not null, then ag-gregation functions and already existing data are lazilytransferred in the background. Any new updates, installs,and probes for this prefix are sent to the parent immedi-ately. On failedChild(child, prefix), the AML layer marksthe child as inactive and any outstanding probes that arewaiting for data from this child are re-evaluated. OnnewChild(child, prefix), the AML layer creates space inits data structures for this child.

Figure 8 shows the time line for the default lazy re-aggregation upon reconfiguration. Probes initiated be-tween points 1 and 2 and that are affected by reconfigu-rations are reevaluated by AML upon detecting the recon-figuration. Probes that complete or start between points 2and 8 may return stale answers.

10

Reconfig

reconfignoticesDHT

partialDHT

completeDHT

ends

Lazy

Time

Data

3 7 81 2 4 5 6starts

LazyData

starts

LazyData

starts

LazyData

repairrepair

reaggr reaggr reaggr reaggr

happens

Figure 8: Default lazy data re-aggregation time line

On-demand Re-aggregation: The default lazy ag-gregation scheme lazily propagates the old updates inthe system. Additionally, using up and down knobs inthe Probe API, applications can force on-demand fast re-aggregation of updates to avoid staleness in the face ofreconfigurations. In particular, if an application detects orsuspects an answer as stale, then it can re-issue the probeincreasing the up and down parameters to force the re-freshing of the cached data. Note that this strategy will beuseful only after the DHT adaptation is completed (Point6 on the time line in Figure 8).

Replication in Space: Replication in space is morechallenging in our system than in a DHT file location ap-plication because replication in space can be achieved eas-ily in the latter by just replicating the root node’s contents.In our system, however, all internal nodes have to be repli-cated along with the root.

In our system, applications control replication in spaceusing up and down knobs in the Install API; with largeup and down values, aggregates at the intermediate virtualnodes are propagated to more nodes in the system. By re-ducing the number of nodes that have to be accessed toanswer a probe, applications can reduce the probability ofincorrect results occurring due to the failure of nodes thatdo not contribute to the aggregate. For example, in a filelocation application, using a non-zero positive down pa-rameter ensures that a file’s global aggregate is replicatedon nodes other than the root. Probes for the file locationcan then be answered without accessing the root; hencethey are not affected by the failure of the root. However,note that this technique is not appropriate in some cases.An aggregated value in file location system is valid as longas the node hosting the file is active, irrespective of the sta-tus of other nodes in the system; whereas an applicationthat counts the number of machines in a system may re-ceive incorrect results irrespective of the replication. Ifreconfigurations are only transient (like a node temporar-ily not responding due to a burst of load), the replicatedaggregate closely or correctly resembles the current state.

7 Evaluation

We have implemented a prototype of SDIMS in Java us-ing the FreePastry framework [32] and performed large-scale simulation experiments and micro-benchmark ex-periments on two real networks: 187 machines in the de-partment and 69 machines on the PlanetLab [27] testbed.

0.1

1

10

100

1000

10000

0.0001 0.01 1 100 10000

Avg

. num

ber

of m

essa

ges

per

oper

atio

n

Read to Write ratio

Update-All

Up=ALL, Down=9

Up=ALL, Down=6

Update-Up

Update-Local

Up=2, Down=0

Up=5, Down=0

Figure 9: Flexibility of our approach. With different UPand DOWN values in a network of 4096 nodes for differ-ent read-write ratios.

In all experiments, we use static up and down values andturn off dynamic adaptation. Our evaluation supports fourmain conclusions. First, flexible API provides differentpropagation strategies that minimize communication re-sources at different read-to-write ratios. For example, inour simulation we observe Update-Local to be efficientfor read-to-write ratios below 0.0001, Update-Up around1, and Update-All above 50000. Second, our system isscalable with respect to both nodes and attributes. In par-ticular, we find that the maximum node stress in our sys-tem is an order lower than observed with an Update-All,gossiping approach. Third, in contrast to unmodified Pas-try which violates path convergence property in upto 14%cases, our system conforms to the property. Fourth, thesystem is robust to reconfigurations and adapts to failureswith in a few seconds.

7.1 Simulation Experiments

Flexibility and Scalability: A major innovation of oursystem is its ability to provide flexible computation andpropagation of aggregates. In Figure 9, we demonstratethe flexibility exposed by the aggregation API explainedin Section 3. We simulate a system with 4096 nodes ar-ranged in a domain hierarchy with branching factor (bf) of16 and install several attributes with different up and downparameters. We plot the average number of messagesper operation incurred for a wide range of read-to-writeratios of the operations for different attributes. Simula-tions with other sizes of networks with different branchingfactors reveal similar results. This graph clearly demon-strates the benefit of supporting a wide range of computa-tion and propagation strategies. Although having a smallUP value is efficient for attributes with low read-to-writeratios (write dominated applications), the probe latency,when reads do occur, may be high since the probe needsto aggregate the data from all the nodes that did not sendtheir aggregate up. Conversely, applications that wish to

11

1

10

100

1000

10000

100000

1e+06

1e+07

1 10 100 1000 10000 100000

Max

imum

Nod

e S

tres

s

Number of attributes installed

Gossip 256

Gossip 4096

Gossip 65536

DHT 256

DHT 4096

DHT 65536

Figure 10: Max node stress for a gossiping approach vs.ADHT based approach for different number of nodes withincreasing number of sparse attributes.

improve probe overheads or latencies can increase theirUP and DOWN propagation at a potential cost of increasein write overheads.

Compared to an existing Update-all single aggregationtree approach [38], scalability in SDIMS comes from (1)leveraging DHTs to form multiple aggregation trees thatsplit the load across nodes and (2) flexible propagationthat avoids propagation of all updates to all nodes. Fig-ure 10 demonstrates the SDIMS’s scalability with nodesand attributes. For this experiment, we build a simula-tor to simulate both Astrolabe [38] (a gossiping, Update-All approach) and our system for an increasing numberof sparse attributes. Each attribute corresponds to themembership in a multicast session with a small numberof participants. For this experiment, the session size isset to 8, the branching factor is set to 16, the propagationmode for SDIMS is Update-Up, and the participant nodesperform continuous probes for the global aggregate value.We plot the maximum node stress (in terms of messages)observed in both schemes for different sized networkswith increasing number of sessions when the participantof each session performs an update operation. Clearly,the DHT based scheme is more scalable with respect toattributes than an Update-all gossiping scheme. Observethat at some constant number of attributes, as the numberof nodes increase in the system, the maximum node stressincreases in the gossiping approach, while it decreases inour approach as the load of aggregation is spread acrossmore nodes. Simulations with other session sizes (4 and16) yield similar results.

Administrative Hierarchy and Robustness: Al-though the routing protocol of ADHT might lead to anincreased number of hops to reach the root for a key ascompared to original Pastry, the algorithm conforms tothe path convergence and locality properties and thus pro-vides administrative isolation property. In Figure 11, wequantify the increased path length by comparisons withunmodified Pastry for different sized networks with dif-

0

1

2

3

4

5

6

7

10 100 1000 10000 100000

Pat

h Le

ngth

Number of Nodes

ADHT bf=4

ADHT bf=16

ADHT bf=64

PASTRY bf=4,16,64

Figure 11: Average path length to root in Pastry versusADHT for different branching factors. Note that all linescorresponding to Pastry overlap.

0

2

4

6

8

10

12

14

16

10 100 1000 10000 100000

Per

cent

age

of v

iola

tions

Number of Nodes

bf=4bf=16bf=64

Figure 12: Percentage of probe pairs whose paths to theroot did not conform to the path convergence propertywith Pastry.

ferent branching factors of the domain hierarchy tree. Toquantify the path convergence property, we perform sim-ulations with a large number of probe pairs – each pairprobing for a random key starting from two randomly cho-sen nodes. In Figure 12, we plot the percentage of probepairs for unmodified pastry that do not conform to the pathconvergence property. When the branching factor is low,the domain hierarchy tree is deeper resulting in a largedifference between Pastry and ADHT in the average pathlength; but it is at these small domain sizes, that the pathconvergence fails more often with the original Pastry.

7.2 Testbed experiments

We run our prototype on 180 department machines (somemachines ran multiple node instances, so this configura-tion has a total of 283 SDIMS nodes) and also on 69 ma-chines of the PlanetLab [27] testbed. We measure theperformance of our system with two micro-benchmarks.In the first micro-benchmark, we install three aggregationfunctions of types Update-Local, Update-Up, and Update-

12

Upda

te-A

ll

Upda

te-U

p

Upda

te-L

ocal

0

200

400

600

800La

tenc

y (in

ms)

Average Latency

Upda

te-A

ll

Upda

te-U

p

Upda

te-L

ocal

0

1000

2000

3000

Late

ncy

(in m

s) Average Latency

(a) (b)

Figure 13: Latency of probes for aggregate at global rootlevel with three different modes of aggregate propagationon (a) department machines, and (b) PlanetLab machines

0

20

40

60

80

100

120

140

0 5 10 15 20 25 2700

2720

2740

2760

2780

2800

2820

2840

Late

ncy

(in m

s)

Val

ues

Obs

erve

d

Time(in sec)

Valueslatency

Node Killed

Figure 14: Micro-benchmark on department networkshowing the behavior of the probes from a single nodewhen failures are happening at some other nodes. All 283nodes assign a value of 10 to the attribute.

All, perform update operation on all nodes for all threeaggregation functions, and measure the latencies incurredby probes for the global aggregate from all nodes in thesystem. Figure 13 shows the observed latencies for bothtestbeds. Notice that the latency in Update-Local is highcompared to the Update-UP policy. This is because la-tency in Update-Local is affected by the presence of evena single slow machine or a single machine with a high la-tency network connection.

In the second benchmark, we examine robustness. Weinstall one aggregation function of type Update-Up thatperforms sum operation on an integer valued attribute.Each node updates the attribute with the value 10. Thenwe monitor the latencies and results returned on the probeoperation for global aggregate on one chosen node, whilewe kill some nodes after every few probes. Figure 14shows the results on the departmental testbed. Due to thenature of the testbed (machines in a department), there islittle change in the latencies even in the face of reconfigu-rations. In Figure 15, we present the results of the experi-ment on PlanetLab testbed. The root node of the aggrega-tion tree is terminated after about 275 seconds. There isa 5X increase in the latencies after the death of the initialroot node as a more distant node becomes the root nodeafter repairs. In both experiments, the values returned on

10

100

1000

10000

100000

0 50 100 150 200 250 300 350 400 450 500500

550

600

650

700

Late

ncy

(in m

s)

Val

ues

Obs

erve

d

Time(in sec)

Valueslatency

Node Killed

Figure 15: Probe performance during failures on 69 ma-chines of PlanetLab testbed

probes start reflecting the correct situation within a shorttime after the failures.

From both the testbed benchmark experiments and thesimulation experiments on flexibility and scalability, weconclude that (1) the flexibility provided by SDIMS al-lows applications to tradeoff read-write overheads (Fig-ure 9), read latency, and sensitivity to slow machines(Figure 13), (2) a good default aggregation strategy isUpdate-Up which has moderate overheads on both readsand writes (Figure 9), has moderate read latencies (Fig-ure 13), and is scalable with respect to both nodes andattributes (Figure 10), and (3) small domain sizes are thecases where DHT algorithms fail to provide path conver-gence more often and SDIMS ensures path convergencewith only a moderate increase in path lengths (Figure 12).

7.3 Applications

SDIMS is designed as a general distributed monitoringand control infrastructure for a broad range of applica-tions. Above, we discuss some simple microbenchmarksincluding a multicast membership service and a calculate-sum function. Van Renesse et al. [38] provide detailedexamples of how such a service can be used for a peer-to-peer caching directory, a data-diffusion service, a publish-subscribe system, barrier synchronization, and voting.Additionally, we have initial experience using SDIMS toconstruct two significant applications: the control planefor a large-scale distributed file system [12] and a networkmonitor for identifying “heavy hitters” that consume ex-cess resources.

Distributed file system control: The PRACTI (PartialReplication, Arbitrary Consistency, Topology Indepen-dence) replication system provides a set of mechanismsfor data replication over which arbitrary control policiescan be layered. We use SDIMS to provide several keyfunctions in order to create a file system over the low-levelPRACTI mechanisms.

First, nodes use SDIMS as a directory to handle read

13

misses. When a node n receives an object o, it updates the(ReadDir, o) attribute with the value n; when n discardso from its local store, it resets (ReadDir, o) to NULL. Ateach virtual node, the ReadDir aggregation function sim-ply selects a random non-null child value (if any) and weuse the Update-Up policy for propagating updates. Fi-nally, to locate a nearby copy of an object o, a node n1issues a series of probe requests for the (ReadDir, o) at-tribute, starting with level � 1 and increasing the levelvalue with each repeated probe request until a non-nullnode ID n2 is returned. n1 then sends a demand read re-quest to n2, and n2 sends the data if it has it. Conversely, ifn2 does not have a copy of o, it sends a nack to n1, and n1issues a retry probe with the down parameter set to a valuelarger than used in the previous probe in order to force on-demand re-aggregation, which will yield a fresher valuefor the retry.

Second, nodes subscribe to invalidations and updatesto interest sets of files, and nodes use SDIMS to set upand maintain per-interest-set network-topology-sensitivespanning trees for propagating this information. To sub-scribe to invalidations for interest set i, a node n1 first up-dates the (Inval, i) attribute with its identity n1, and theaggregation function at each virtual node selects one non-null child value. Finally, n1 probes increasing levels of thethe (Inval, i) attribute until it finds the first node n2 "� n1;n1 then uses n2 as its parent in the spanning tree. n1 alsoissues a continuous probe for this attribute at this level sothat it is notified of any change to its spanning tree parent.Spanning trees for streams of pushed updates are main-tained in a similar manner.

In the future, we plan to use SDIMS for at least twoadditional services within this replication system. First,we plan to use SDIMS to track the read and write ratesto different objects; prefetch algorithms will use this in-formation to prioritize replication [40, 41]. Second, weplan to track the ranges of invalidation sequence numbersseen by each node for each interest set in order to augmentthe spanning trees described above with additional “holefilling” to allow nodes to locate specific invalidations theyhave missed.

Overall, our initial experience with using SDIMS forthe PRACTII replication system suggests that (1) the gen-eral aggregation interface provided by SDIMS simpli-fies the construction of distributed applications—giventhe low-level PRACTI mechanisms, we were able to con-struct a basic file system that uses SDIMS for several dis-tinct control tasks in under two weeks and (2) the weakconsistency guarantees provided by SDIMS meet the re-quirements of this application—each node’s controller ef-fectively treats information from SDIMS as hints, and ifa contacted node does not have the needed data, the con-troller retries, using SDIMS on-demand re-aggregation toobtain a fresher hint.

Distributed heavy hitter problem: The goal of theheavy hitter problem is to identify network sources, des-tinations, or protocols that account for significant or un-usual amounts of traffic. As noted by Estan et al. [13],this information is useful for a variety of applications suchas intrusion detection (e.g., port scanning), denial of ser-vice detection, worm detection and tracking, fair networkallocation, and network maintenance. Significant workhas been done on developing high-performance stream-processing algorithms for identifying heavy hitters at onerouter, but this is just a first step; ideally these applicationswould like not just one router’s views of the heavy hittersbut an aggregate view.

We use SDIMS to allow local information about heavyhitters to be pooled into a view of global heavy hitters.For each destination IP address IPx, a node updates the at-tribute

�DestBW � IPx � with the number of bytes sent to IPx

in the last time window. The aggregation function for at-tribute type DestBW is installed with the Update-UP strat-egy and simply adds the values from child nodes. Nodesperform continuous probe for global aggregate of the at-tribute and raise an alarm when the global aggregate valuegoes above a specified limit. Note that only nodes sendingdata to a particular IP address perform probes for the cor-responding attribute. Also note that techniques from [25]can be extended to hierarchical case to tradeoff precisionfor communication bandwidth.

8 Related Work

The aggregation abstraction we use in our work is heav-ily influenced by the Astrolabe [38] project. Astrolabeadopts a Propagate-All and unstructured gossiping tech-niques to attain robustness [5]. However, any gossipingscheme requires aggressive replication of the aggregates.While such aggressive replication is efficient for read-dominated attributes, it incurs high message cost for at-tributes with a small read-to-write ratio. Our approachprovides a flexible API for applications to set propagationrules according to their read-to-write ratios. Other closelyrelated projects include Willow [39], Cone [4], DASIS [1],and SOMO [45]. Willow, DASIS and SOMO build a sin-gle tree for aggregation. Cone builds a tree per attributeand requires a total order on the attribute values.

Several academic [15, 21, 42] and commercial [37] dis-tributed monitoring systems have been designed to moni-tor the status of large networked systems. Some of themare centralized where all the monitoring data is collectedand analyzed at a central host. Ganglia [15, 23] usesa hierarchical system where the attributes are replicatedwithin clusters using multicast and then cluster aggregatesare further aggregated along a single tree. Sophia [42] is adistributed monitoring system designed with a declarativelogic programming model where the location of query ex-

14

ecution is both explicit in the language and can be calcu-lated during evaluation. This research is complementaryto our work. TAG [21] collects information from a largenumber of sensors along a single tree.

The observation that DHTs internally provide a scal-able forest of reduction trees is not new. Plaxton etal.’s [28] original paper describes not a DHT, but a sys-tem for hierarchically aggregating and querying object lo-cation data in order to route requests to nearby copiesof objects. Many systems—building upon both Plax-ton’s bit-correcting strategy [32, 46] and upon other strate-gies [24, 29, 35]—have chosen to hide this power and ex-port a simple and general distributed hash table abstrac-tion as a useful building block for a broad range of dis-tributed applications. Some of these systems internallymake use of the reduction forest not only for routing butalso for caching [32], but for simplicity, these systems donot generally export this powerful functionality in theirexternal interface. Our goal is to develop and expose theinternal reduction forest of DHTs as a similarly generaland useful abstraction.

Although object location is a predominant target ap-plication for DHTs, several other applications like multi-cast [8, 9, 33, 36] and DNS [11] are also built using DHTs.All these systems implicitly perform aggregation on someattribute, and each one of them must be designed to handleany reconfigurations in the underlying DHT. With the ag-gregation abstraction provided by our system, designingand building of such applications becomes easier.

Internal DHT trees typically do not satisfy domain lo-cality properties required in our system. Castro et al. [7]and Gummadi et al. [17] point out the importance of pathconvergence from the perspective of achieving efficiencyand investigate the performance of Pastry and other DHTalgorithms, respectively. SkipNet [18] provides domainrestricted routing where a key search is limited to the spec-ified domain. This interface can be used to ensure pathconvergence by searching in the lowest domain and mov-ing up to the next domain when the search reaches the rootin the current domain. Although this strategy guaranteespath convergence, it loses the aggregation tree abstrac-tion property of DHTs as the domain constrained routingmight touch a node more than once (as it searches forwardand then backward to stay within a domain).

There are some ongoing efforts to provide the rela-tional database abstraction on DHTs: PIER [19] and Grib-ble et al. [16]. This research mainly focuses on support-ing “Join” operation for tables stored on the nodes in anetwork. We consider this research to be complementaryto our work; the approaches can be used in our systemto handle composite probes – e.g., find a nearest machinewith file “foo” and has more than 2 GB of memory.

9 Conclusions

This paper presents a Scalable Distributed InformationManagement System (SDIMS) that aggregates informa-tion in large-scale networked systems and that can serveas a basic building block for a broad range of applica-tions. For large scale systems, hierarchical aggregationis a fundamental abstraction for scalability. We build oursystem by extending ideas from Astrolabe and DHTs toachieve (i) scalability with respect to both nodes and at-tributes through a new aggregation abstraction that helpsleverage DHT’s internal trees for aggregation, (ii) flexi-bility through a simple API that lets applications controlpropagation of reads and writes, (iii) administrative isola-tion through simple augmentations of current DHT algo-rithms, and (iv) robustness to node and network reconfig-urations through lazy reaggregation, on-demand reaggre-gation, and tunable spatial replication.

Our system is still in a nascent state. The initial workdoes provide evidence that we can achieve scalable dis-tributed information management by leveraging aggrega-tion abstraction and DHTs. Our work also opens up manyresearch issues in different fronts that need to be solved.Below we enumerate some future research directions.

1. Robustness: In our current system, in spite ofour current techniques, reconfigurations are costly(O

�m � d � log2

�N �� where m is the number of at-

tributes, N is the number of nodes in the system, andd is the fraction of attributes that each node is inter-ested in [44]). Malkhi et al. [22] propose Supernodesto reduce the number of reconfigurations at the DHTlevel; this technique can be leveraged to reduce thenumber of reconfigurations at the Aggregation Man-agement Layer.

2. Self-tuning adaptation: The read-to-write ratios forapplications are dynamic. Instead of applicationschoosing the right strategy, the system should be ableto self-tune the aggregation and propagation strategyaccording to the changing read-to-write ratios.

3. Handling Composite Queries: Queries involvingmultiple attributes pose an issue in our system as dif-ferent attributes are aggregated along different trees.

4. Caching: While caching is employed effectively inDHT file location applications, further research isneeded to apply this concept in our general frame-work.

Acknowlegements

We are grateful to J.C. Browne, Robert van Renessee,Amin Vahdat, Jay Lepreau, and the anonymous reviewersfor their helpful comments on this work.

15

References[1] K. Albrecht, R. Arnold, M. Gahwiler, and R. Wattenhofer. Join and

Leave in Peer-to-Peer Systems: The DASIS approach. Technicalreport, CS, ETH Zurich, 2003.

[2] G. Back, W. H. Hsieh, and J. Lepreau. Processes in KaffeOS:Isolation, Resource Management, and Sharing in Java. In Proc.OSDI, Oct 2000.

[3] G. Banga, P. Druschel, and J. Mogul. Resource Containers: A NewFacility for Resource Management in Server Systems. In OSDI99,Feb. 1999.

[4] R. Bhagwan, P. Mahadevan, G. Varghese, and G. M. Voelker.Cone: A Distributed Heap-Based Approach to Resource Selection.Technical Report CS2004-0784, UCSD, 2004.

[5] K. P. Birman. The Surprising Power of Epidemic Communication.In Proceedings of FuDiCo, 2003.

[6] B. Bloom. Space/time tradeoffs in hash coding with allowable er-rors. Comm. of the ACM, 13(7):422–425, 1970.

[7] M. Castro, P. Druschel, Y. C. Hu, and A. Rowstron. ExploitingNetwork Proximity in Peer-to-Peer Overlay Networks. TechnicalReport MSR-TR-2002-82, MSR.

[8] M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A. Rowstron,and A. Singh. SplitStream: High-bandwidth Multicast in a Coop-erative Environment. In SOSP, 2003.

[9] M. Castro, P. Druschel, A.-M. Kermarrec, and A. Rowstron.SCRIBE: A Large-scale and Decentralised Application-level Mul-ticast Infrastructure. IEEE JSAC (Special issue on Network Sup-port for Multicast Communications), 2002.

[10] J. Challenger, P. Dantzig, and A. Iyengar. A scalable and highlyavailable system for serving dynamic data at frequently accessedweb sites. In In Proceedings of ACM/IEEE, Supercomputing ’98(SC98), Nov. 1998.

[11] R. Cox, A. Muthitacharoen, and R. T. Morris. Serving DNS usinga Peer-to-Peer Lookup Service. In IPTPS, 2002.

[12] M. Dahlin, L. Gao, A. Nayate, A. Venkataramani, P. Yalagandula,and J. Zheng. PRACTI replication for large-scale systems. Tech-nical Report TR-04-28, The University of Texas at Austin, 2004.

[13] C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for count-ing active flows on high speed links. In Internet Measurement Con-ference 2003, 2003.

[14] Y. Fu, J. Chase, B. Chun, S. Schwab, and A. Vahdat. SHARP: Anarchitecture for secure resource peering. In Proc. SOSP, Oct. 2003.

[15] Ganglia: Distributed Monitoring and Execution System.http://ganglia.sourceforge.net.

[16] S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu. What CanPeer-to-Peer Do for Databases, and Vice Versa? In Proceedings ofthe WebDB, 2001.

[17] K. Gummadi, R. Gummadi, S. D. Gribble, S. Ratnasamy,S. Shenker, and I. Stoica. The Impact of DHT Routing Geome-try on Resilience and Proximity. In SIGCOMM, 2003.

[18] N. J. A. Harvey, M. B. Jones, S. Saroiu, M. Theimer, and A. Wol-man. SkipNet: A Scalable Overlay Network with Practical Local-ity Properties. In USITS, March 2003.

[19] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo, S. Shenker,and I. Stoica. Querying the Internet with PIER. In Proceedings ofthe VLDB Conference, May 2003.

[20] C. Intanagonwiwat, R. Govindan, and D. Estrin. Directed diffu-sion: a scalable and robust communication paradigm for sensornetworks. In MobiCom, 2000.

[21] S. R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong.TAG: a Tiny AGgregation Service for Ad-Hoc Sensor Networks.In OSDI, 2002.

[22] D. Malkhi. Dynamic Lookup Networks. In FuDiCo, 2002.

[23] M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributedmonitoring system: Design, implementation, and experience. Insubmission.

[24] P. Maymounkov and D. Mazieres. Kademlia: A Peer-to-peer In-formation System Based on the XOR Metric. In Proceesings of theIPTPS, March 2002.

[25] C. Olston and J. Widom. Offering a precision-performance trade-off for aggregation queries over replicated data. In VLDB, pages144–155, Sept. 2000.

[26] K. Petersen, M. Spreitzer, D. Terry, M. Theimer, and A. Demers.Flexible Update Propagation for Weakly Consistent Replication.In Proc. SOSP, Oct. 1997.

[27] Planetlab. http://www.planet-lab.org.

[28] C. G. Plaxton, R. Rajaraman, and A. W. Richa. Accessing NearbyCopies of Replicated Objects in a Distributed Environment. InACM SPAA, 1997.

[29] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. AScalable Content Addressable Network. In Proceedings of ACMSIGCOMM, 2001.

[30] S. Ratnasamy, S. Shenker, and I. Stoica. Routing Algorithms forDHTs: Some Open Questions. In IPTPS, March 2002.

[31] T. Roscoe, R. Mortier, P. Jardetzky, and S. Hand. InfoSpect: Usinga Logic Language for System Health Monitoring in DistributedSystems. In Proceedings of the SIGOPS European Workshop,2002.

[32] A. Rowstron and P. Druschel. Pastry: Scalable, Distributed ObjectLocation and Routing for Large-scale Peer-to-peer Systems. InMiddleware, 2001.

[33] S.Ratnasamy, M.Handley, R.Karp, and S.Shenker. Application-level Multicast using Content-addressable Networks. In Proceed-ings of the NGC, November 2001.

[34] W. Stallings. SNMP, SNMPv2, and CMIP. Addison-Wesley, 1993.

[35] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrish-nan. Chord: A scalable Peer-To-Peer lookup service for internetapplications. In ACM SIGCOMM, 2001.

[36] S.Zhuang, B.Zhao, A.Joseph, R.Katz, and J.Kubiatowicz. Bayeux:An Architecture for Scalable and Fault-tolerant Wide-Area DataDissemination. In NOSSDAV, 2001.

[37] IBM Tivoli Monitoring.www.ibm.com/software/tivoli/products/monitor.

[38] R. VanRenesse, K. P. Birman, and W. Vogels. Astrolabe: A Ro-bust and Scalable Technology for Distributed System Monitoring,Management, and Data Mining. TOCS, 2003.

[39] R. VanRenesse and A. Bozdog. Willow: DHT, Aggregation, andPublish/Subscribe in One Protocol. In IPTPS, 2004.

[40] A. Venkataramani, P. Weidmann, and M. Dahlin. Bandwidth con-strained placement in a wan. In PODC, Aug. 2001.

[41] A. Venkataramani, P. Yalagandula, R. Kokku, S. Sharif, andM. Dahlin. Potential costs and benefits of long-term prefetch-ing for content-distribution. Elsevier Computer Communications,25(4):367–375, Mar. 2002.

[42] M. Wawrzoniak, L. Peterson, and T. Roscoe. Sophia: An Informa-tion Plane for Networked Systems. In HotNets-II, 2003.

[43] R. Wolski, N. Spring, and J. Hayes. The network weather ser-vice: A distributed resource performance forecasting service formetacomputing. Journal of Future Generation Computing Sys-tems, 15(5-6):757–768, Oct 1999.

[44] P. Yalagandula. SDIMS: A Scalable Distributed Information Man-agement System, Feb. 2004. Ph.D. Proposal.

16

[45] Z. Zhang, S.-M. Shi, and J. Zhu. SOMO: Self-Organized MetadataOverlay for Resource Management in P2P DHT. In IPTPS, 2003.

[46] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: AnInfrastructure for Fault-tolerant Wide-area Location and Routing.Technical Report UCB/CSD-01-1141, UC Berkeley, Apr. 2001.

17

a scalable distributed information management systemdahlin/projects/sdims/papers/...planning...

Documents