chord: a scalable peer-to-peer lookup service for internet...

12
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica , Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT Laboratory for Computer Science [email protected] http://pdos.lcs.mit.edu/chord/ Abstract A fundamental problem that confronts peer-to-peer applications is to efficiently locate the node that stores a particular data item. This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data item pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis, simulations, and ex- periments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes. 1. Introduction Peer-to-peer systems and applications are distributed systems without any centralized control or hierarchical organization, where the software running at each node is equivalent in functionality. A review of the features of recent peer-to-peer applications yields a long list: redundant storage, permanence, selection of nearby servers, anonymity, search, authentication, and hierarchical nam- ing. Despite this rich set of features, the core operation in most peer-to-peer systems is efficient location of data items. The contri- bution of this paper is a scalable protocol for lookup in a dynamic peer-to-peer system with frequent node arrivals and departures. The Chord protocol supports just one operation: given a key, it maps the key onto a node. Depending on the application using Chord, that node might be responsible for storing a value associated with the key. Chord uses a variant of consistent hashing [11] to assign keys to Chord nodes. Consistent hashing tends to balance load, since each node receives roughly the same number of keys, University of California, Berkeley. [email protected] Authors in reverse alphabetical order. This research was sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Space and Naval Warfare Sys- tems Center, San Diego, under contract N66001-00-1-8933. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGCOMM’01, August 27-31, 2001, San Diego, California, USA. Copyright 2001 ACM 1-58113-411-8/01/0008 ...$5.00. and involves relatively little movement of keys when nodes join and leave the system. Previous work on consistent hashing assumed that nodes were aware of most other nodes in the system, making it impractical to scale to large number of nodes. In contrast, each Chord node needs “routing” information about only a few other nodes. Because the routing table is distributed, a node resolves the hash function by communicating with a few other nodes. In the steady state, in an -node system, each node maintains information only about other nodes, and resolves all lookups via mes- sages to other nodes. Chord maintains its routing information as nodes join and leave the system; with high probability each such event results in no more than messages. Three features that distinguish Chord from many other peer-to- peer lookup protocols are its simplicity, provable correctness, and provable performance. Chord is simple, routing a key through a se- quence of other nodes toward the destination. A Chord node requires information about other nodes for efficient routing, but performance degrades gracefully when that informa- tion is out of date. This is important in practice because nodes will join and leave arbitrarily, and consistency of even state may be hard to maintain. Only one piece information per node need be correct in order for Chord to guarantee correct (though slow) routing of queries; Chord has a simple algorithm for maintaining this information in a dynamic environment. The rest of this paper is structured as follows. Section 2 com- pares Chord to related work. Section 3 presents the system model that motivates the Chord protocol. Section 4 presents the base Chord protocol and proves several of its properties, while Section 5 presents extensions to handle concurrent joins and failures. Sec- tion 6 demonstrates our claims about Chord’s performance through simulation and experiments on a deployed prototype. Finally, we outline items for future work in Section 7 and summarize our con- tributions in Section 8. 2. Related Work While Chord maps keys onto nodes, traditional name and lo- cation services provide a direct mapping between keys and val- ues. A value can be an address, a document, or an arbitrary data item. Chord can easily implement this functionality by storing each key/value pair at the node to which that key maps. For this reason and to make the comparison clearer, the rest of this section assumes a Chord-based service that maps keys onto values. DNS provides a host name to IP address mapping [15]. Chord can provide the same service with the name representing the key and the associated IP address representing the value. Chord re- quires no special servers, while DNS relies on a set of special root

Upload: others

Post on 07-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

Chord: A Scalable Peer-to-peer Lookup Service for InternetApplications

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari BalakrishnanMIT Laboratory for Computer Science

[email protected]://pdos.lcs.mit.edu/chord/

AbstractA fundamental problem that confronts peer-to-peer applications isto efficiently locate the node that stores a particular data item. Thispaper presents Chord, a distributed lookup protocol that addressesthis problem. Chord provides support for just one operation: givena key, it maps the key onto a node. Data location can be easilyimplemented on top of Chord by associating a key with each dataitem, and storing the key/data item pair at the node to which thekey maps. Chord adapts efficiently as nodes join and leave thesystem, and can answer queries even if the system is continuouslychanging. Results from theoretical analysis, simulations, and ex-periments show that Chord is scalable, with communication costand the state maintained by each node scaling logarithmically withthe number of Chord nodes.

1. IntroductionPeer-to-peer systems and applications are distributed systems

without any centralized control or hierarchical organization, wherethe software running at each node is equivalent in functionality.A review of the features of recent peer-to-peer applications yieldsa long list: redundant storage, permanence, selection of nearbyservers, anonymity, search, authentication, and hierarchical nam-ing. Despite this rich set of features, the core operation in mostpeer-to-peer systems is efficient location of data items. The contri-bution of this paper is a scalable protocol for lookup in a dynamicpeer-to-peer system with frequent node arrivals and departures.The Chord protocol supports just one operation: given a key,

it maps the key onto a node. Depending on the application usingChord, that node might be responsible for storing a value associatedwith the key. Chord uses a variant of consistent hashing [11] toassign keys to Chord nodes. Consistent hashing tends to balanceload, since each node receives roughly the same number of keys,

University of California, Berkeley. [email protected] in reverse alphabetical order.

This research was sponsored by the Defense Advanced ResearchProjects Agency (DARPA) and the Space and Naval Warfare Sys-tems Center, San Diego, under contract N66001-00-1-8933.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGCOMM’01, August 27-31, 2001, San Diego, California, USA.Copyright 2001 ACM 1-58113-411-8/01/0008 ...$5.00.

and involves relatively little movement of keys when nodes joinand leave the system.Previous work on consistent hashing assumed that nodes were

aware of most other nodes in the system, making it impractical toscale to large number of nodes. In contrast, each Chord node needs“routing” information about only a few other nodes. Because therouting table is distributed, a node resolves the hash function bycommunicating with a few other nodes. In the steady state, inan -node system, each node maintains information only about

other nodes, and resolves all lookups via mes-sages to other nodes. Chord maintains its routing information asnodes join and leave the system; with high probability each suchevent results in no more than messages.Three features that distinguish Chord from many other peer-to-

peer lookup protocols are its simplicity, provable correctness, andprovable performance. Chord is simple, routing a key through a se-quence of other nodes toward the destination. A Chordnode requires information about other nodes for efficientrouting, but performance degrades gracefully when that informa-tion is out of date. This is important in practice because nodes willjoin and leave arbitrarily, and consistency of even statemay be hard to maintain. Only one piece information per node needbe correct in order for Chord to guarantee correct (though slow)routing of queries; Chord has a simple algorithm for maintainingthis information in a dynamic environment.The rest of this paper is structured as follows. Section 2 com-

pares Chord to related work. Section 3 presents the system modelthat motivates the Chord protocol. Section 4 presents the baseChord protocol and proves several of its properties, while Section 5presents extensions to handle concurrent joins and failures. Sec-tion 6 demonstrates our claims about Chord’s performance throughsimulation and experiments on a deployed prototype. Finally, weoutline items for future work in Section 7 and summarize our con-tributions in Section 8.

2. Related WorkWhile Chord maps keys onto nodes, traditional name and lo-

cation services provide a direct mapping between keys and val-ues. A value can be an address, a document, or an arbitrary dataitem. Chord can easily implement this functionality by storing eachkey/value pair at the node to which that key maps. For this reasonand to make the comparison clearer, the rest of this section assumesa Chord-based service that maps keys onto values.DNS provides a host name to IP address mapping [15]. Chord

can provide the same service with the name representing the keyand the associated IP address representing the value. Chord re-quires no special servers, while DNS relies on a set of special root

Page 2: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

servers. DNS names are structured to reflect administrative bound-aries; Chord imposes no naming structure. DNS is specialized tothe task of finding named hosts or services, while Chord can alsobe used to find data objects that are not tied to particular machines.The Freenet peer-to-peer storage system [4, 5], like Chord, is

decentralized and symmetric and automatically adapts when hostsleave and join. Freenet does not assign responsibility for docu-ments to specific servers; instead, its lookups take the form ofsearches for cached copies. This allows Freenet to provide a degreeof anonymity, but prevents it from guaranteeing retrieval of existingdocuments or from providing low bounds on retrieval costs. Chorddoes not provide anonymity, but its lookup operation runs in pre-dictable time and always results in success or definitive failure.The Ohaha system uses a consistent hashing-like algorithm for

mapping documents to nodes, and Freenet-style query routing [18].As a result, it shares some of the weaknesses of Freenet. ArchivalIntermemory uses an off-line computed tree to map logical ad-dresses to machines that store the data [3].The Globe system [2] has a wide-area location service to map ob-

ject identifiers to the locations of moving objects. Globe arrangesthe Internet as a hierarchy of geographical, topological, or adminis-trative domains, effectively constructing a static world-wide searchtree, much like DNS. Information about an object is stored in aparticular leaf domain, and pointer caches provide search shortcuts [22]. The Globe system handles high load on the logical rootby partitioning objects among multiple physical root servers us-ing hash-like techniques. Chord performs this hash function wellenough that it can achieve scalability without also involving anyhierarchy, though Chord does not exploit network locality as wellas Globe.The distributed data location protocol developed by Plaxton et

al. [19], a variant of which is used in OceanStore [12], is perhapsthe closest algorithm to the Chord protocol. It provides strongerguarantees than Chord: like Chord it guarantees that queries makea logarithmic number hops and that keys are well balanced, but thePlaxton protocol also ensures, subject to assumptions about net-work topology, that queries never travel further in network distancethan the node where the key is stored. The advantage of Chordis that it is substantially less complicated and handles concurrentnode joins and failures well. The Chord protocol is also similar toPastry, the location algorithm used in PAST [8]. However, Pastryis a prefix-based routing protocol, and differs in other details fromChord.CAN uses a -dimensional Cartesian coordinate space (for some

fixed ) to implement a distributed hash table that maps keys ontovalues [20]. Each node maintains state, and the lookup costis . Thus, in contrast to Chord, the state maintained by aCAN node does not depend on the network size , but the lookupcost increases faster than . If , CAN lookup timesand storage needs match Chord’s. However, CAN is not designedto vary as (and thus ) varies, so this match will only occurfor the “right” corresponding to the fixed . CAN requires anadditional maintenance protocol to periodically remap the identifierspace onto nodes. Chord also has the advantage that its correctnessis robust in the face of partially incorrect routing information.Chord’s routing procedure may be thought of as a one-

dimensional analogue of the Grid location system [14]. Grid relieson real-world geographic location information to route its queries;Chord maps its nodes to an artificial one-dimensional space withinwhich routing is carried out by an algorithm similar to Grid’s.Chord can be used as a lookup service to implement a variety

of systems, as discussed in Section 3. In particular, it can helpavoid single points of failure or control that systems like Napster

possess [17], and the lack of scalability that systems like Gnutelladisplay because of their widespread use of broadcasts [10].

3. System ModelChord simplifies the design of peer-to-peer systems and applica-

tions based on it by addressing these difficult problems:

Load balance: Chord acts as a distributed hash function,spreading keys evenly over the nodes; this provides a degreeof natural load balance.

Decentralization: Chord is fully distributed: no node ismore important than any other. This improves robustness andmakes Chord appropriate for loosely-organized peer-to-peerapplications.

Scalability: The cost of a Chord lookup grows as the log ofthe number of nodes, so even very large systems are feasible.No parameter tuning is required to achieve this scaling.

Availability: Chord automatically adjusts its internal tablesto reflect newly joined nodes as well as node failures, ensur-ing that, barring major failures in the underlying network, thenode responsible for a key can always be found. This is trueeven if the system is in a continuous state of change.

Flexible naming: Chord places no constraints on the struc-ture of the keys it looks up: the Chord key-space is flat. Thisgives applications a large amount of flexibility in how theymap their own names to Chord keys.

The Chord software takes the form of a library to be linked withthe client and server applications that use it. The application in-teracts with Chord in two main ways. First, Chord provides alookup(key) algorithm that yields the IP address of the noderesponsible for the key. Second, the Chord software on each nodenotifies the application of changes in the set of keys that the nodeis responsible for. This allows the application software to, for ex-ample, move corresponding values to their new homes when a newnode joins.The application using Chord is responsible for providing any de-

sired authentication, caching, replication, and user-friendly namingof data. Chord’s flat key space eases the implementation of thesefeatures. For example, an application could authenticate data bystoring it under a Chord key derived from a cryptographic hash ofthe data. Similarly, an application could replicate data by storing itunder two distinct Chord keys derived from the data’s application-level identifier.The following are examples of applications for which Chord

would provide a good foundation:

Cooperative Mirroring, as outlined in a recent proposal [6].Imagine a set of software developers, each of whom wishesto publish a distribution. Demand for each distribution mightvary dramatically, from very popular just after a new releaseto relatively unpopular between releases. An efficient ap-proach for this would be for the developers to cooperativelymirror each others’ distributions. Ideally, the mirroring sys-tem would balance the load across all servers, replicate andcache the data, and ensure authenticity. Such a system shouldbe fully decentralized in the interests of reliability, and be-cause there is no natural central administration.

Time-Shared Storage for nodes with intermittent connectivity. Ifa person wishes some data to be always available, but their

2

Page 3: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

Server

Chord Chord Chord

File System

Block Store Block Store Block Store

Client Server

Figure 1: Structure of an example Chord-based distributedstorage system.

machine is only occasionally available, they can offer to storeothers’ data while they are up, in return for having their datastored elsewhere when they are down. The data’s name canserve as a key to identify the (live) Chord node responsiblefor storing the data item at any given time. Many of thesame issues arise as in the Cooperative Mirroring applica-tion, though the focus here is on availability rather than loadbalance.

Distributed Indexes to support Gnutella- or Napster-like keywordsearch. A key in this application could be derived from thedesired keywords, while values could be lists of machinesoffering documents with those keywords.

Large-Scale Combinatorial Search, such as code breaking. Inthis case keys are candidate solutions to the problem (such ascryptographic keys); Chord maps these keys to the machinesresponsible for testing them as solutions.

Figure 1 shows a possible three-layered software structure for acooperative mirror system. The highest layer would provide a file-like interface to users, including user-friendly naming and authenti-cation. This “file system” layer might implement named directoriesand files, mapping operations on them to lower-level block opera-tions. The next layer, a “block storage” layer, would implementthe block operations. It would take care of storage, caching, andreplication of blocks. The block storage layer would use Chord toidentify the node responsible for storing a block, and then talk tothe block storage server on that node to read or write the block.

4. The Base Chord ProtocolThe Chord protocol specifies how to find the locations of keys,

how new nodes join the system, and how to recover from the failure(or planned departure) of existing nodes. This section describes asimplified version of the protocol that does not handle concurrentjoins or failures. Section 5 describes enhancements to the base pro-tocol to handle concurrent joins and failures.

4.1 OverviewAt its heart, Chord provides fast distributed computation of a

hash function mapping keys to nodes responsible for them. It usesconsistent hashing [11, 13], which has several good properties.With high probability the hash function balances load (all nodesreceive roughly the same number of keys). Also with high prob-ability, when an node joins (or leaves) the network, only an

fraction of the keys are moved to a different location—this is clearly the minimum necessary to maintain a balanced load.

0

6

1

2

34

5

6

71

2

successor(2) = 3

successor(6) = 0

successor(1) = 1

Figure 2: An identifier circle consisting of the three nodes 0, 1,and 3. In this example, key 1 is located at node 1, key 2 at node3, and key 6 at node 0.

Chord improves the scalability of consistent hashing by avoid-ing the requirement that every node know about every other node.A Chord node needs only a small amount of “routing” informa-tion about other nodes. Because this information is distributed, anode resolves the hash function by communicating with a few othernodes. In an -node network, each node maintains informationonly about other nodes, and a lookup requiresmessages.Chord must update the routing information when a node joins or

leaves the network; a join or leave requires messages.

4.2 Consistent HashingThe consistent hash function assigns each node and key an -bit

identifier using a base hash function such as SHA-1 [9]. A node’sidentifier is chosen by hashing the node’s IP address, while a keyidentifier is produced by hashing the key. We will use the term“key” to refer to both the original key and its image under the hashfunction, as the meaning will be clear from context. Similarly, theterm “node” will refer to both the node and its identifier under thehash function. The identifier length must be large enough tomake the probability of two nodes or keys hashing to the same iden-tifier negligible.Consistent hashing assigns keys to nodes as follows. Identifiers

are ordered in an identifier circle modulo . Key is assigned tothe first node whose identifier is equal to or follows (the identifierof) in the identifier space. This node is called the successor nodeof key , denoted by successor . If identifiers are represented asa circle of numbers from to , then is thefirst node clockwise from .Figure 2 shows an identifier circle with . The circle has

three nodes: 0, 1, and 3. The successor of identifier 1 is node 1, sokey 1 would be located at node 1. Similarly, key 2 would be locatedat node 3, and key 6 at node 0.Consistent hashing is designed to let nodes enter and leave the

network with minimal disruption. To maintain the consistent hash-ing mapping when a node joins the network, certain keys previ-ously assigned to ’s successor now become assigned to . Whennode leaves the network, all of its assigned keys are reassignedto ’s successor. No other changes in assignment of keys to nodesneed occur. In the example above, if a node were to join with iden-tifier 7, it would capture the key with identifier 6 from the nodewith identifier 0.The following results are proven in the papers that introduced

consistent hashing [11, 13]:

THEOREM 1. For any set of nodes and keys, with highprobability:

1. Each node is responsible for at most keys

3

Page 4: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

2. When an node joins or leaves the network, respon-sibility for keys changes hands (and only to or fromthe joining or leaving node).

When consistent hashing is implemented as described above, thetheorem proves a bound of . The consistent hashingpaper shows that can be reduced to an arbitrarily small constantby having each node run “virtual nodes” each with itsown identifier.The phrase “with high probability” bears some discussion. A

simple interpretation is that the nodes and keys are randomly cho-sen, which is plausible in a non-adversarial model of the world.The probability distribution is then over random choices of keysand nodes, and says that such a random choice is unlikely to pro-duce an unbalanced distribution. One might worry, however, aboutan adversary who intentionally chooses keys to all hash to the sameidentifier, destroying the load balancing property. The consistenthashing paper uses “ -universal hash functions” to provide certainguarantees even in the case of nonrandom keys.Rather than using a -universal hash function, we chose to use

the standard SHA-1 function as our base hash function. This makesour protocol deterministic, so that the claims of “high probability”no longer make sense. However, producing a set of keys that collideunder SHA-1 can be seen, in some sense, as inverting, or “decrypt-ing” the SHA-1 function. This is believed to be hard to do. Thus,instead of stating that our theorems hold with high probability, wecan claim that they hold “based on standard hardness assumptions.”For simplicity (primarily of presentation), we dispense with the

use of virtual nodes. In this case, the load on a node may exceed theaverage by (at most) an factor with high probability (orin our case, based on standard hardness assumptions). One reasonto avoid virtual nodes is that the number needed is determined bythe number of nodes in the system, which may be difficult to deter-mine. Of course, one may choose to use an a priori upper bound onthe number of nodes in the system; for example, we could postulateat most one Chord server per IPv4 address. In this case running 32virtual nodes per physical node would provide good load balance.

4.3 Scalable Key LocationA very small amount of routing information suffices to imple-

ment consistent hashing in a distributed environment. Each nodeneed only be aware of its successor node on the circle. Queriesfor a given identifier can be passed around the circle via these suc-cessor pointers until they first encounter a node that succeeds theidentifier; this is the node the query maps to. A portion of the Chordprotocol maintains these successor pointers, thus ensuring that alllookups are resolved correctly. However, this resolution scheme isinefficient: it may require traversing all nodes to find the ap-propriate mapping. To accelerate this process, Chord maintainsadditional routing information. This additional information is notessential for correctness, which is achieved as long as the successorinformation is maintained correctly.As before, let be the number of bits in the key/node identifiers.

Each node, , maintains a routing table with (at most) entries,called the finger table. The entry in the table at node containsthe identity of the first node, , that succeeds by at least onthe identifier circle, i.e., , where

(and all arithmetic is modulo ). We call node thefinger of node , and denote it by finger node (see Table 1).A finger table entry includes both the Chord identifier and the IPaddress (and port number) of the relevant node. Note that the firstfinger of is its immediate successor on the circle; for conveniencewe often refer to it as the successor rather than the first finger.In the example shown in Figure 3(b), the finger table of node

Notation Definitionfinger mod ,interval finger start finger start

first node finger startsuccessor the next node on the identifier circle;

finger nodepredecessor the previous node on the identifier circle

Table 1: Definition of variables for node , using -bit identi-fiers.

points to the successor nodes of identifiers mod ,mod , and mod , respectively.

The successor of identifier is node , as this is the first node thatfollows , the successor of identifier is (trivially) node , and thesuccessor of is node .This scheme has two important characteristics. First, each node

stores information about only a small number of other nodes, andknows more about nodes closely following it on the identifier circlethan about nodes farther away. Second, a node’s finger table gener-ally does not contain enough information to determine the succes-sor of an arbitrary key . For example, node 3 in Figure 3 does notknow the successor of 1, as ’s successor (node 1) does not appearin node ’s finger table.What happens when a node does not know the successor of a

key ? If can find a node whose ID is closer than its own to ,that node will know more about the identifier circle in the regionof than does. Thus searches its finger table for the nodewhose ID most immediately precedes , and asks for the node itknows whose ID is closest to . By repeating this process, learnsabout nodes with IDs closer and closer to .The pseudocode that implements the search process is shown in

Figure 4. The notation n.foo() stands for the function foo() be-ing invoked at and executed on node . Remote calls and variablereferences are preceded by the remote node identifier, while localvariable references and procedure calls omit the local node. Thusn.foo() denotes a remote procedure call on node , while n.bar,without parentheses, is an RPC to lookup a variable bar on node .find successor works by finding the immediate predecessor node

of the desired identifier; the successor of that node must be thesuccessor of the identifier. We implement find predecessor explic-itly, because it is used later to implement the join operation (Sec-tion 4.4).When node executes find predecessor, it contacts a series of

nodes moving forward around the Chord circle towards . If nodecontacts a node such that falls between and the successor

of , find predecessor is done and returns . Otherwise nodeasks for the node knows about that most closely precedes .Thus the algorithm always makes progress towards the precedessorof .As an example, consider the Chord ring in Figure 3(b). Suppose

node wants to find the successor of identifier . Since belongsto the circular interval , it belongs to finger interval; nodetherefore checks the third entry in its finger table, which is .

Because precedes , node will ask node to find the successorof . In turn, node will infer from its finger table that ’s successoris the node itself, and return node 1 to node 3.The finger pointers at repeatedly doubling distances around the

circle cause each iteration of the loop in find predecessor to halvethe distance to the target identifier. From this intuition follows atheorem:

4

Page 5: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

01

2

34

5

6

7

finger[1].interval = [finger[1].start,

finger[2].start)

finger[2].interval = [finger[2].start, finger[3].start)

finger[3].interval = [finger[3].start, 1)

finger[1].start = 2

finger[2].start = 3finger[3].start = 5

(a)

0

1 [1,2) 12 [2,4) 34 [4,0) 0

start int. succ.finger table keys

6

1

2

34

5

6

7 2 [2,3) 33 [3,5) 35 [5,1) 0

start int. succ.finger table keys

1

4 [4,5) 05 [5,7) 07 [7,3) 0

start int. succ.finger table keys

2

(b)

Figure 3: (a) The finger intervals associated with node 1. (b) Finger tables and key locations for a net with nodes 0, 1, and 3, and keys 1, 2, and 6.

THEOREM 2. With high probability (or under standard hard-ness assumptions), the number of nodes that must be contacted tofind a successor in an -node network is .

PROOF. Suppose that node wishes to resolve a query for thesuccessor of . Let be the node that immediately precedes . Weanalyze the number of query steps to reach .Recall that if , then forwards its query to the closest

predecessor of in its finger table. Suppose that node is in thefinger interval of node . Then since this interval is not empty, nodewill finger some node in this interval. The distance (number of

identifiers) between and is at least . But and are bothin ’s finger interval, which means the distance between them isat most . This means is closer to than to , or equivalently,that the distance from to is at most half the distance from to.If the distance between the node handling the query and the pre-

decessor halves in each step, and is at most initially, thenwithin steps the distance will be one, meaning we have arrivedat .In fact, as discussed above, we assume that node and key identi-

fiers are random. In this case, the number of forwardings necessarywill be with high probability. After forwardings,the distance between the current query node and the key will bereduced to at most . The expected number of node identi-fiers landing in a range of this size is 1, and it is withhigh probability. Thus, even if the remaining steps advance by onlyone node at a time, they will cross the entire remaining interval andreach key within another steps.

In the section reporting our experimental results (Section 6), wewill observe (and justify) that the average lookup time is .

4.4 Node JoinsIn a dynamic network, nodes can join (and leave) at any time.

The main challenge in implementing these operations is preservingthe ability to locate every key in the network. To achieve this goal,Chord needs to preserve two invariants:

1. Each node’s successor is correctly maintained.

2. For every key , node is responsible for .

In order for lookups to be fast, it is also desirable for the fingertables to be correct.This section shows how to maintain these invariants when a sin-

gle node joins. We defer the discussion of multiple nodes joiningsimultaneously to Section 5, which also discusses how to handle

// ask node to find ’s successor

find predecessor ;return successor;

// ask node to find ’s predecessor

;while successor

closest preceding finger ;return ;

// return closest finger preceding

for downtoif finger nodereturn finger node;

return ;

Figure 4: The pseudocode to find the successor node of an iden-tifier . Remote procedure calls and variable lookups are pre-ceded by the remote node.

a node failure. Before describing the join operation, we summa-rize its performance (the proof of this theorem is in the companiontechnical report [21]):

THEOREM 3. With high probability, any node joining or leav-ing an -node Chord network will use messages tore-establish the Chord routing invariants and finger tables.

To simplify the join and leave mechanisms, each node in Chordmaintains a predecessor pointer. A node’s predecessor pointer con-tains the Chord identifier and IP address of the immediate predeces-sor of that node, and can be used to walk counterclockwise aroundthe identifier circle.To preserve the invariants stated above, Chord must perform

three tasks when a node joins the network:

1. Initialize the predecessor and fingers of node .

2. Update the fingers and predecessors of existing nodes to re-flect the addition of .

3. Notify the higher layer software so that it can transfer state(e.g. values) associated with keys that node is now respon-sible for.

We assume that the new node learns the identity of an existingChord node by some external mechanism. Node uses to

5

Page 6: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

0

1 [1,2) 12 [2,4) 34 [4,0) 6

start int. succ.finger table keys

1

2

34

5

6

7 2 [2,3) 33 [3,5) 35 [5,1) 6

start int. succ.finger table keys

1

4 [4,5) 65 [5,7) 67 [7,3) 0

start int. succ.finger table keys

2

7 [7,0) 00 [0,2) 02 [2,6) 3

start int. succ.finger table keys

6

(a)

0

1 [1,2) 02 [2,4) 34 [4,0) 6

start int. succ.finger table keys

1

2

34

5

6

7

4 [4,5) 65 [5,7) 67 [7,3) 0

start int. succ.finger table keys

1

7 [7,0) 00 [0,2) 02 [2,6) 3

start int. succ.finger table keys

6

2

(b)

Figure 5: (a) Finger tables and key locations after node 6 joins. (b) Finger tables and key locations after node 3 leaves. Changed entries are shownin black, and unchanged in gray.

initialize its state and add itself to the existing Chord network, asfollows.

Initializing fingers and predecessor: Node learns its pre-decessor and fingers by asking to look them up, using theinit finger table pseudocode in Figure 6. Naively performingfind successor for each of the finger entries would give a run-time of . To reduce this, checks whether thefinger is also the correct finger, for each . This hap-pens when finger interval does not contain any node, and thusfinger node finger start. It can be shown that the changereduces the expected (and high probability) number of finger en-tries that must be looked up to , which reduces the overalltime to .As a practical optimization, a newly joined node can ask an

immediate neighbor for a copy of its complete finger table and itspredecessor. can use the contents of these tables as hints to helpit find the correct values for its own tables, since ’s tables will besimilar to its neighbors’. This can be shown to reduce the time tofill the finger table to .

Updating fingers of existing nodes: Node will need to be en-tered into the finger tables of some existing nodes. For example, inFigure 5(a), node 6 becomes the third finger of nodes 0 and 1, andthe first and the second finger of node 3.Figure 6 shows the pseudocode of the update finger table func-

tion that updates existing finger tables. Node will become thefinger of node if and only if (1) precedes by at least , and(2) the finger of node succeeds . The first node, , that canmeet these two conditions is the immediate predecessor of .Thus, for a given , the algorithm starts with the finger of node, and then continues to walk in the counter-clock-wise directionon the identifier circle until it encounters a node whose fingerprecedes .We show in the technical report [21] that the number of nodes

that need to be updated when a node joins the network iswith high probability. Finding and updating these nodes takes

time. Amore sophisticated scheme can reduce this timeto ; however, we do not present it as we expect implemen-tations to use the algorithm of the following section.

Transferring keys: The last operation that has to be performedwhen a node joins the network is to move responsibility for allthe keys for which node is now the successor. Exactly what thisentails depends on the higher-layer software using Chord, but typi-cally it would involve moving the data associated with each key tothe new node. Node can become the successor only for keys thatwere previously the responsibility of the node immediately follow-

#define successor finger node

// node joins the network;// is an arbitrary node in the network

if ( )init finger table( );update others();// move keys in from successor

else // is the only node in the networkfor tofinger node ;

predecessor ;

// initialize finger table of local node;// is an arbitrary node already in the network

finger node find successor ;predecessor successor predecessor;successor predecessor ;for toif finger start finger nodefinger node finger node

elsefinger node

find successor finger start ;

// update all nodes whose finger// tables should refer to

for to// find last node whose finger might be

find predecessor ;update finger table ;

// if is finger of , update ’s finger table with

if finger nodefinger node ;

predecessor; // get first node precedingupdate finger table ;

Figure 6: Pseudocode for the node join operation.

6

Page 7: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

ing , so only needs to contact that one node to transfer respon-sibility for all relevant keys.

5. Concurrent Operations and FailuresIn practice Chord needs to deal with nodes joining the system

concurrently and with nodes that fail or leave voluntarily. Thissection describes modifications to the basic Chord algorithms de-scribed in Section 4 to handle these situations.

5.1 StabilizationThe join algorithm in Section 4 aggressively maintains the finger

tables of all nodes as the network evolves. Since this invariant isdifficult to maintain in the face of concurrent joins in a large net-work, we separate our correctness and performance goals. A basic“stabilization” protocol is used to keep nodes’ successor pointersup to date, which is sufficient to guarantee correctness of lookups.Those successor pointers are then used to verify and correct fin-ger table entries, which allows these lookups to be fast as well ascorrect.If joining nodes have affected some region of the Chord ring,

a lookup that occurs before stabilization has finished can exhibitone of three behaviors. The common case is that all the finger ta-ble entries involved in the lookup are reasonably current, and thelookup finds the correct successor in steps. The secondcase is where successor pointers are correct, but fingers are inaccu-rate. This yields correct lookups, but they may be slower. In thefinal case, the nodes in the affected region have incorrect successorpointers, or keys may not yet have migrated to newly joined nodes,and the lookup may fail. The higher-layer software using Chordwill notice that the desired data was not found, and has the optionof retrying the lookup after a pause. This pause can be short, sincestabilization fixes successor pointers quickly.Our stabilization scheme guarantees to add nodes to a Chord ring

in a way that preserves reachability of existing nodes, even in theface of concurrent joins and lost and reordered messages. Stabi-lization by itself won’t correct a Chord system that has split intomultiple disjoint cycles, or a single cycle that loops multiple timesaround the identifier space. These pathological cases cannot beproduced by any sequence of ordinary node joins. It is unclearwhether they can be produced by network partitions and recoveriesor intermittent failures. If produced, these cases could be detectedand repaired by periodic sampling of the ring topology.Figure 7 shows the pseudo-code for joins and stabilization; this

code replaces Figure 6 to handle concurrent joins. When nodefirst starts, it calls join , where is any known Chord node.The function asks to find the immediate successor of . Byitself, does not make the rest of the network aware of .Every node runs stabilize periodically (this is how newly joined

nodes are noticed by the network). When node runs stabilize,it asks ’s successor for the successor’s predecessor , and de-cides whether should be ’s successor instead. This would bethe case if node recently joined the system. stabilize also noti-fies node ’s successor of ’s existence, giving the successor thechance to change its predecessor to . The successor does this onlyif it knows of no closer predecessor than .As a simple example, suppose node joins the system, and its

ID lies between nodes and . would acquire as its succes-sor. Node , when notified by , would acquire as its predeces-sor. When next runs stabilize, it will ask for its predecessor(which is now ); would then acquire as its successor. Finally,will notify , and will acquire as its predecessor. At this

point, all predecessor and successor pointers are correct.

;find successor ;

// periodically verify n’s immediate successor,// and tell the successor about n..stabilize()

;if

;notify ;

// thinks it might be our predecessor.

if is nil or;

// periodically refresh finger table entries.

random index into finger ;finger find successor finger start ;

Figure 7: Pseudocode for stabilization.

As soon as the successor pointers are correct, calls tofind predecessor (and thus find successor) will work. Newly joinednodes that have not yet been fingered may cause find predecessor toinitially undershoot, but the loop in the lookup algorithm will nev-ertheless follow successor (finger ) pointers through the newlyjoined nodes until the correct predecessor is reached. Eventuallyfix fingers will adjust finger table entries, eliminating the need forthese linear scans.The following theorems (proved in the technical report [21])

show that all problems caused by concurrent joins are transient.The theorems assume that any two nodes trying to communicatewill eventually succeed.

THEOREM 4. Once a node can successfully resolve a givenquery, it will always be able to do so in the future.

THEOREM 5. At some time after the last join all successorpointers will be correct.

The proofs of these theorems rely on an invariant and a termina-tion argument. The invariant states that once node can reach nodevia successor pointers, it always can. To argue termination, we

consider the case where two nodes both think they have the samesuccessor . In this case, each will attempt to notify , and willeventually choose the closer of the two (or some other, closer node)as its predecessor. At this point the farther of the two will, by con-tacting , learn of a better successor than . It follows that everynode progresses towards a better and better successor over time.This progress must eventually halt in a state where every node isconsidered the successor of exactly one other node; this defines acycle (or set of them, but the invariant ensures that there will be atmost one).We have not discussed the adjustment of fingers when nodes join

because it turns out that joins don’t substantially damage the per-formance of fingers. If a node has a finger into each interval, thenthese fingers can still be used even after joins. The distance halvingargument is essentially unchanged, showing that hopssuffice to reach a node “close” to a query’s target. New joins in-fluence the lookup only by getting in between the old predecessorand successor of a target query. These new nodes may need to bescanned linearly (if their fingers are not yet accurate). But unless a

7

Page 8: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

tremendous number of nodes joins the system, the number of nodesbetween two old nodes is likely to be very small, so the impact onlookup is negligible. Formally, we can state the following:

THEOREM 6. If we take a stable network with nodes, andanother set of up to nodes joins the network with no finger point-ers (but with correct successor pointers), then lookups will still take

time with high probability.

More generally, so long as the time it takes to adjust fingers isless than the time it takes the network to double in size, lookupsshould continue to take hops.

5.2 Failures and ReplicationWhen a node fails, nodes whose finger tables include must

find ’s successor. In addition, the failure of must not be allowedto disrupt queries that are in progress as the system is re-stabilizing.The key step in failure recovery is maintaining correct succes-

sor pointers, since in the worst case find predecessor can makeprogress using only successors. To help achieve this, each Chordnode maintains a “successor-list” of its nearest successors on theChord ring. In ordinary operation, a modified version of the stabi-lize routine in Figure 7 maintains the successor-list. If node no-tices that its successor has failed, it replaces it with the first live en-try in its successor list. At that point, can direct ordinary lookupsfor keys for which the failed node was the successor to the newsuccessor. As time passes, stabilize will correct finger table entriesand successor-list entries pointing to the failed node.After a node failure, but before stabilization has completed, other

nodes may attempt to send requests through the failed node as partof a find successor lookup. Ideally the lookups would be able toproceed, after a timeout, by another path despite the failure. Inmany cases this is possible. All that is needed is a list of alternatenodes, easily found in the finger table entries preceding that of thefailed node. If the failed node had a very low finger table index,nodes in the successor-list are also available as alternates.The technical report proves the following two theorems that

show that the successor-list allows lookups to succeed, and be effi-cient, even during stabilization [21]:

THEOREM 7. If we use a successor list of lengthin a network that is initially stable, and then every node fails withprobability 1/2, then with high probability find successor returnsthe closest living successor to the query key.

THEOREM 8. If we use a successor list of lengthin a network that is initially stable, and then every node fails withprobability 1/2, then the expected time to execute find successor inthe failed network is .

The intuition behind these proofs is straightforward: a node’ssuccessors all fail with probability , so with high prob-ability a node will be aware of, so able to forward messages to, itsclosest living successor.The successor-list mechanism also helps higher layer software

replicate data. A typical application using Chord might store repli-cas of the data associated with a key at the nodes succeeding thekey. The fact that a Chord node keeps track of its successorsmeans that it can inform the higher layer software when successorscome and go, and thus when the software should propagate newreplicas.

0

50

100

150

200

250

300

350

400

450

500

1 10

Num

ber o

f key

s pe

r nod

e

Number of virtual nodes

1st and 99th percentiles

Figure 9: The 1st and the 99th percentiles of the number ofkeys per node as a function of virtual nodes mapped to a realnode. The network has real nodes and stores keys.

6. Simulation and Experimental ResultsIn this section, we evaluate the Chord protocol by simulation.

The simulator uses the lookup algorithm in Figure 4 and a slightlyolder version of the stabilization algorithms described in Section 5.We also report on some preliminary experimental results from anoperational Chord-based system running on Internet hosts.

6.1 Protocol SimulatorThe Chord protocol can be implemented in an iterative or recur-

sive style. In the iterative style, a node resolving a lookup initiatesall communication: it asks a series of nodes for information fromtheir finger tables, each time moving closer on the Chord ring to thedesired successor. In the recursive style, each intermediate nodeforwards a request to the next node until it reaches the successor.The simulator implements the protocols in an iterative style.

6.2 Load BalanceWefirst consider the ability of consistent hashing to allocate keys

to nodes evenly. In a network with nodes and keys we wouldlike the distribution of keys to nodes to be tight around .We consider a network consisting of nodes, and vary the

total number of keys from to in increments of . Foreach value, we repeat the experiment 20 times. Figure 8(a) plotsthe mean and the 1st and 99th percentiles of the number of keys pernode. The number of keys per node exhibits large variations thatincrease linearly with the number of keys. For example, in all casessome nodes store no keys. To clarify this, Figure 8(b) plots theprobability density function (PDF) of the number of keys per nodewhen there are keys stored in the network. The maximumnumber of nodes stored by any node in this case is 457, or themean value. For comparison, the 99th percentile is the meanvalue.One reason for these variations is that node identifiers do not uni-

formly cover the entire identifier space. If we divide the identifierspace in equal-sized bins, where is the number of nodes, thenwe might hope to see one node in each bin. But in fact, the proba-bility that a particular bin does not contain any node is .For large values of this approaches .As we discussed earlier, the consistent hashing paper solves this

problem by associating keys with virtual nodes, and mapping mul-tiple virtual nodes (with unrelated identifiers) to each real node.Intuitively, this will provide a more uniform coverage of the iden-tifier space. For example, if we allocate randomly chosenvirtual nodes to each real node, with high probability each of the

8

Page 9: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

0

50

100

150

200

250

300

350

400

450

500

0 20 40 60 80 100

Num

ber o

f key

s pe

r nod

e

Total number of keys (x 10,000)

1st and 99th percentiles

(a)

0

0.005

0.01

0.015

0.02

0.025

0 50 100 150 200 250 300 350 400 450 500

PDF

Number of keys per node

(b)

Figure 8: (a) Themean and 1st and 99th percentiles of the number of keys stored per node in a node network. (b) The probabilitydensity function (PDF) of the number of keys per node. The total number of keys is .

bins will contain nodes [16]. We note that this doesnot affect the worst-case query path length, which now becomes

.To verify this hypothesis, we perform an experiment in which

we allocate virtual nodes to each real node. In this case keysare associated to virtual nodes instead of real nodes. We consideragain a network with real nodes and keys. Figure 9 showsthe 1st and 99th percentiles for , and 20, respec-tively. As expected, the 99th percentile decreases, while the 1stpercentile increases with the number of virtual nodes, . In par-ticular, the 99th percentile decreases from to the meanvalue, while the 1st percentile increases from 0 to the meanvalue. Thus, adding virtual nodes as an indirection layer can sig-nificantly improve load balance. The tradeoff is that routing tablespace usage will increase as each actual node now needs times asmuch space to store the finger tables for its virtual nodes. However,we believe that this increase can be easily accommodated in prac-tice. For example, assuming a network with nodes, andassuming , each node has to maintain a table with only

entries.

6.3 Path LengthThe performance of any routing protocol depends heavily on the

length of the path between two arbitrary nodes in the network.In the context of Chord, we define the path length as the numberof nodes traversed during a lookup operation. From Theorem 2,with high probability, the length of the path to resolve a query is

, where is the total number of nodes in the network.To understand Chord’s routing performance in practice, we sim-

ulated a network with nodes, storing keys inall. We varied from to and conducted a separate experimentfor each value. Each node in an experiment picked a random setof keys to query from the system, and we measured the path lengthrequired to resolve each query.Figure 10(a) plots the mean, and the 1st and 99th percentiles of

path length as a function of . As expected, the mean path lengthincreases logarithmically with the number of nodes, as do the 1stand 99th percentiles. Figure 10(b) plots the PDF of the path lengthfor a network with nodes ( ).Figure 10(a) shows that the path length is about . The

reason for the is as follows. Consider some random node anda random query. Let the distance in identifier space be consideredin binary representation. The most significant (say ) bit of this

0

0.05

0.1

0.15

0.2

0.25

0 0.05 0.1 0.15 0.2

Faile

d Lo

okup

s (F

ract

ion

of T

otal

)

Failed Nodes (Fraction of Total)

95% confidence interval

Figure 11: The fraction of lookups that fail as a function of thefraction of nodes that fail.

distance can be corrected to 0 by following the node’s finger.If the next significant bit of the distance is 1, it too needs to becorrected by following a finger, but if it is 0, then no fingeris followed—instead, we move on the the bit. In general, thenumber of fingers we need to follow will be the number of ones inthe binary representation of the distance from node to query. Sincethe distance is random, we expect half the bits to be ones.

6.4 Simultaneous Node FailuresIn this experiment, we evaluate the ability of Chord to regain

consistency after a large percentage of nodes fail simultaneously.We consider again a node network that stores keys, andrandomly select a fraction of nodes that fail. After the failuresoccur, we wait for the network to finish stabilizing, and then mea-sure the fraction of keys that could not be looked up correctly. Acorrect lookup of a key is one that finds the node that was origi-nally responsible for the key, before the failures; this correspondsto a system that stores values with keys but does not replicate thevalues or recover them after failures.Figure 11 plots the mean lookup failure rate and the 95% confi-

dence interval as a function of . The lookup failure rate is almostexactly . Since this is just the fraction of keys expected to be lostdue to the failure of the responsible nodes, we conclude that thereis no significant lookup failure in the Chord network. For example,if the Chord network had partitioned in two equal-sized halves, we

9

Page 10: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

0

2

4

6

8

10

12

1 10 100 1000 10000 100000

Path

leng

th

Number of nodes

1st and 99th percentiles

(a)

0

0.05

0.1

0.15

0.2

0.25

0 2 4 6 8 10 12

PDF

Path length

(b)

Figure 10: (a) The path length as a function of network size. (b) The PDF of the path length in the case of a node network.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 0.02 0.04 0.06 0.08 0.1

Faile

d Lo

okup

s (F

ract

ion

of T

otal

)

Node Fail/Join Rate (Per Second)

95% confidence interval

Figure 12: The fraction of lookups that fail as a function ofthe rate (over time) at which nodes fail and join. Only failurescaused by Chord state inconsistency are included, not failuresdue to lost keys.

would expect one-half of the requests to fail because the querierand target would be in different partitions half the time. Our re-sults do not show this, suggesting that Chord is robust in the faceof multiple simultaneous node failures.

6.5 Lookups During StabilizationA lookup issued after some failures but before stabilization has

completed may fail for two reasons. First, the node responsible forthe key may have failed. Second, some nodes’ finger tables andpredecessor pointers may be inconsistent due to concurrent joinsand node failures. This section evaluates the impact of continuousjoins and failures on lookups.In this experiment, a lookup is considered to have succeeded if

it reaches the current successor of the desired key. This is slightlyoptimistic: in a real system, there might be periods of time in whichthe real successor of a key has not yet acquired the data associatedwith the key from the previous successor. However, this method al-lows us to focus on Chord’s ability to perform lookups, rather thanon the higher-layer software’s ability to maintain consistency of itsown data. Any query failure will be the result of inconsistencies inChord. In addition, the simulator does not retry queries: if a queryis forwarded to a node that is down, the query simply fails. Thus,the results given in this section can be viewed as the worst-casescenario for the query failures induced by state inconsistency.

Because the primary source of inconsistencies is nodes joiningand leaving, and because the main mechanism to resolve these in-consistencies is the stabilize protocol, Chord’s performance will besensitive to the frequency of node joins and leaves versus the fre-quency at which the stabilization protocol is invoked.In this experiment, key lookups are generated according to a

Poisson process at a rate of one per second. Joins and failuresare modeled by a Poisson process with the mean arrival rate of .Each node runs the stabilization routines at randomized intervalsaveraging 30 seconds; unlike the routines in Figure 7, the simulatorupdates all finger table entries on every invocation. The networkstarts with 500 nodes.Figure 12 plots the average failure rates and confidence intervals.

A node failure rate of corresponds to one node joining andleaving every 100 seconds on average. For comparison, recall thateach node invokes the stabilize protocol once every 30 seconds.In other words, the graph axis ranges from a rate of 1 failureper 3 stabilization steps to a rate of 3 failures per one stabilizationstep. The results presented in Figure 12 are averaged over approx-imately two hours of simulated time. The confidence intervals arecomputed over 10 independent runs.The results of figure 12 can be explained roughly as follows. The

simulation has 500 nodes, meaning lookup path lengths averagearound . A lookup fails if its finger path encounters a failed node.If nodes fail, the probability that one of them is on the finger pathis roughly , or . This would suggest a failure rate ofabout % if we have 3 failures between stabilizations. The graphshows results in this ball-park, but slightly worse since it might takemore than one stabilization to completely clear out a failed node.

6.6 Experimental ResultsThis section presents latency measurements obtained from a pro-

totype implementation of Chord deployed on the Internet. TheChord nodes are at ten sites on a subset of the RON test-bedin the United States [1], in California, Colorado, Massachusetts,New York, North Carolina, and Pennsylvania. The Chord softwareruns on UNIX, uses 160-bit keys obtained from the SHA-1 cryp-tographic hash function, and uses TCP to communicate betweennodes. Chord runs in the iterative style. These Chord nodes arepart of an experimental distributed file system [7], though this sec-tion considers only the Chord component of the system.Figure 13 shows the measured latency of Chord lookups over a

range of numbers of nodes. Experiments with a number of nodeslarger than ten are conducted by running multiple independent

10

Page 11: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120 140 160 180 200

Look

up L

aten

cy (m

s)

Number of Nodes

5th, 50th, and 95th percentiles

Figure 13: Lookup latency on the Internet prototype, as a func-tion of the total number of nodes. Each of the ten physical sitesruns multiple independent copies of the Chord node software.

copies of the Chord software at each site. This is different fromrunning virtual nodes at each site to provide good loadbalance; rather, the intention is to measure how well our implemen-tation scales even though we do not have more than a small numberof deployed nodes.For each number of nodes shown in Figure 13, each physical

site issues 16 Chord lookups for randomly chosen keys one-by-one. The graph plots the median, the 5th, and the 95th percentileof lookup latency. The median latency ranges from 180 to 285 ms,depending on number of nodes. For the case of 180 nodes, a typi-cal lookup involves five two-way message exchanges: four for theChord lookup, and a final message to the successor node. Typicalround-trip delays between sites are 60 milliseconds (as measuredby ping). Thus the expected lookup time for 180 nodes is about300 milliseconds, which is close to the measured median of 285.The low 5th percentile latencies are caused by lookups for keysclose (in ID space) to the querying node and by query hops that re-main local to the physical site. The high 95th percentiles are causedby lookups whose hops follow high delay paths.The lesson from Figure 13 is that lookup latency grows slowly

with the total number of nodes, confirming the simulation resultsthat demonstrate Chord’s scalability.

7. Future WorkBased on our experience with the prototype mentioned in Sec-

tion 6.6, we would like to improve the Chord design in the follow-ing areas.Chord currently has no specific mechanism to heal partitioned

rings; such rings could appear locally consistent to the stabilizationprocedure. One way to check global consistency is for each nodeto periodically ask other nodes to do a Chord lookup for ; if

the lookup does not yield node , there may be a partition. Thiswill only detect partitions whose nodes know of each other. Oneway to obtain this knowledge is for every node to know of the samesmall set of initial nodes. Another approach might be for nodesto maintain long-term memory of a random set of nodes they haveencountered in the past; if a partition forms, the random sets in onepartition are likely to include nodes from the other partition.A malicious or buggy set of Chord participants could present an

incorrect view of the Chord ring. Assuming that the data Chordis being used to locate is cryptographically authenticated, this is athreat to availability of data rather than to authenticity. The sameapproach used above to detect partitions could help victims realize

that they are not seeing a globally consistent view of the Chordring.An attacker could target a particular data item by inserting a node

into the Chord ring with an ID immediately following the item’skey, and having the node return errors when asked to retrieve thedata. Requiring (and checking) that nodes use IDs derived from theSHA-1 hash of their IP addresses makes this attack harder.Even messages per lookup may be too many for some

applications of Chord, especially if each message must be sent toa random Internet host. Instead of placing its fingers at distancesthat are all powers of , Chord could easily be changed to place itsfingers at distances that are all integer powers of . Undersuch a scheme, a single routing hop could decrease the distance to aquery to of the original distance, meaning thathops would suffice. However, the number of fingers needed wouldincrease to .A different approach to improving lookup latency might be to

use server selection. Each finger table entry could point to the firstnodes in that entry’s interval on the ID ring, and a node could

measure the network delay to each of the nodes. The nodesare generally equivalent for purposes of lookup, so a node couldforward lookups to the one with lowest delay. This approach wouldbe most effective with recursive Chord lookups, in which the nodemeasuring the delays is also the node forwarding the lookup.

8. ConclusionMany distributed peer-to-peer applications need to determine

the node that stores a data item. The Chord protocol solves thischallenging problem in decentralized manner. It offers a power-ful primitive: given a key, it determines the node responsible forstoring the key’s value, and does so efficiently. In the steady state,in an -node network, each node maintains routing informationfor only about other nodes, and resolves all lookups via

messages to other nodes. Updates to the routing infor-mation for nodes leaving and joining require only mes-sages.Attractive features of Chord include its simplicity, provable cor-

rectness, and provable performance even in the face of concurrentnode arrivals and departures. It continues to function correctly, al-beit at degraded performance, when a node’s information is onlypartially correct. Our theoretical analysis, simulations, and exper-imental results confirm that Chord scales well with the number ofnodes, recovers from large numbers of simultaneous node failuresand joins, and answers most lookups correctly even during recov-ery.We believe that Chord will be a valuable component for peer-

to-peer, large-scale distributed applications such as cooperative filesharing, time-shared available storage systems, distributed indicesfor document and service discovery, and large-scale distributedcomputing platforms.

AcknowledgmentsWe thank Frank Dabek for the measurements of the Chord proto-type described in Section 6.6, and David Andersen for setting upthe testbed used in those measurements.

9. References[1] ANDERSEN, D. Resilient overlay networks. Master’s thesis,

Department of EECS, MIT, May 2001.http://nms.lcs.mit.edu/projects/ron/.

[2] BAKKER, A., AMADE, E., BALLINTIJN, G., KUZ, I., VERKAIK,P., VAN DER WIJK, I., VAN STEEN, M., AND TANENBAUM., A.

11

Page 12: Chord: A Scalable Peer-to-peer Lookup Service for Internet ...courses.cse.tamu.edu/caverlee/csce438/readings/chord.pdfpeer-to-peer systems is efficientlocation of data items. The

The Globe distribution network. In Proc. 2000 USENIX Annual Conf.(FREENIX Track) (San Diego, CA, June 2000), pp. 141–152.

[3] CHEN, Y., EDLER, J., GOLDBERG, A., GOTTLIEB, A., SOBTI, S.,AND YIANILOS, P. A prototype implementation of archivalintermemory. In Proceedings of the 4th ACM Conference on Digitallibraries (Berkeley, CA, Aug. 1999), pp. 28–37.

[4] CLARKE, I. A distributed decentralised information storage andretrieval system. Master’s thesis, University of Edinburgh, 1999.

[5] CLARKE, I., SANDBERG, O., WILEY, B., AND HONG, T. W.Freenet: A distributed anonymous information storage and retrievalsystem. In Proceedings of the ICSI Workshop on Design Issues inAnonymity and Unobservability (Berkeley, California, June 2000).http://freenet.sourceforge.net.

[6] DABEK, F., BRUNSKILL, E., KAASHOEK, M. F., KARGER, D.,MORRIS, R., STOICA, I., AND BALAKRISHNAN, H. Buildingpeer-to-peer systems with Chord, a distributed location service. InProceedings of the 8th IEEE Workshop on Hot Topics in OperatingSystems (HotOS-VIII) (Elmau/Oberbayern, Germany, May 2001),pp. 71–76.

[7] DABEK, F., KAASHOEK, M. F., KARGER, D., MORRIS, R., ANDSTOICA, I. Wide-area cooperative storage with CFS. In Proceedingsof the 18th ACM Symposium on Operating Systems Principles (SOSP’01) (To appear; Banff, Canada, Oct. 2001).

[8] DRUSCHEL, P., AND ROWSTRON, A. Past: Persistent andanonymous storage in a peer-to-peer networking environment. InProceedings of the 8th IEEE Workshop on Hot Topics in OperatingSystems (HotOS 2001) (Elmau/Oberbayern, Germany, May 2001),pp. 65–70.

[9] FIPS 180-1. Secure Hash Standard. U.S. Department ofCommerce/NIST, National Technical Information Service,Springfield, VA, Apr. 1995.

[10] Gnutella. http://gnutella.wego.com/.[11] KARGER, D., LEHMAN, E., LEIGHTON, F., LEVINE, M., LEWIN,

D., AND PANIGRAHY, R. Consistent hashing and random trees:Distributed caching protocols for relieving hot spots on the WorldWide Web. In Proceedings of the 29th Annual ACM Symposium onTheory of Computing (El Paso, TX, May 1997), pp. 654–663.

[12] KUBIATOWICZ, J., BINDEL, D., CHEN, Y., CZERWINSKI, S.,EATON, P., GEELS, D., GUMMADI, R., RHEA, S.,WEATHERSPOON, H., WEIMER, W., WELLS, C., AND ZHAO, B.OceanStore: An architecture for global-scale persistent storage. InProceeedings of the Ninth international Conference on ArchitecturalSupport for Programming Languages and Operating Systems(ASPLOS 2000) (Boston, MA, November 2000), pp. 190–201.

[13] LEWIN, D. Consistent hashing and random trees: Algorithms forcaching in distributed networks. Master’s thesis, Department ofEECS, MIT, 1998. Available at the MIT Library,http://thesis.mit.edu/.

[14] LI, J., JANNOTTI, J., DE COUTO, D., KARGER, D., AND MORRIS,R. A scalable location service for geographic ad hoc routing. InProceedings of the 6th ACM International Conference on MobileComputing and Networking (Boston, Massachusetts, August 2000),pp. 120–130.

[15] MOCKAPETRIS, P., AND DUNLAP, K. J. Development of theDomain Name System. In Proc. ACM SIGCOMM (Stanford, CA,1988), pp. 123–133.

[16] MOTWANI, R., AND RAGHAVAN, P. Randomized Algorithms.Cambridge University Press, New York, NY, 1995.

[17] Napster. http://www.napster.com/.[18] Ohaha, Smart decentralized peer-to-peer sharing.

http://www.ohaha.com/design.html.[19] PLAXTON, C., RAJARAMAN, R., AND RICHA, A. Accessing

nearby copies of replicated objects in a distributed environment. InProceedings of the ACM SPAA (Newport, Rhode Island, June 1997),pp. 311–320.

[20] RATNASAMY, S., FRANCIS, P., HANDLEY, M., KARP, R., ANDSHENKER, S. A scalable content-addressable network. In Proc. ACMSIGCOMM (San Diego, CA, August 2001).

[21] STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, M. F., ANDBALAKRISHNAN, H. Chord: A scalable peer-to-peer lookup servicefor internet applications. Tech. Rep. TR-819, MIT LCS, March 2001.http://www.pdos.lcs.mit.edu/chord/papers/.

[22] VAN STEEN, M., HAUCK, F., BALLINTIJN, G., AND TANENBAUM,A. Algorithmic design of the Globe wide-area location service. TheComputer Journal 41, 5 (1998), 297–310.

12