topology-aware routing in structured peer-to-peer overlay...

19
Topology-aware routing in structured peer-to-peer overlay networks Miguel Castro Peter Druschel Y. Charlie Hu Antony Rowstron Microsoft Research, 7 J J Thomson Close, Cambridge, CB3 0FB, UK. Rice University, 6100 Main Street, MS-132, Houston, TX 77005, USA. Purdue University, 1285 EE Building, West Lafayette, IN 47907, USA. Abstract Structured peer-to-peer (p2p) overlay networks like CAN, Chord, Pastry and Tapestry offer a novel platform for a variety of scalable and decentralized distributed ap- plications. They provide efficient and fault-tolerant rout- ing, object location and load balancing within a self- organizing overlay network. One important aspect of these systems is how they exploit network proximity in the underlying Internet. We present a study of topology- aware routing approaches in p2p overlays, identify prox- imity neigbor selection as the most promising technique, and present an improved design in Pastry. Results ob- tained via analysis and via simulation of two large-scale topology models indicate that it is possible to efficiently exploit network proximity in self-organizing p2p sub- strates. Proximity neighbor selection incurs only a mod- est additional overhead for organizing and maintaining the overlay network. The resulting locality properties im- prove application performance and reduce network us- age in the Internet substantially. Finally, we show that the impact of proximity neighbor selection on the load balancing in the p2p overlay is minimal. 1 Introduction Several recent systems (e.g., CAN, Chord, Pastry and Tapestry [7, 13, 10, 16, 6]) provide a self-organizing sub- strate for large-scale peer-to-peer applications. Among other uses, these systems can implement a scalable, fault- tolerant distributed hash table, in which any item can be located within a bounded number of routing hops, using a small per-node routing table. While there are algorith- mic similarities among each of these systems, one impor- tant distinction lies in the approach they take to consider- ing and exploiting proximity in the underlying Internet. Chord in its original design, for instance, does not con- sider network proximity at all. As a result, its protocol for maintaining the overlay network is very light-weight, but messages may travel arbitrarily long distances in the Internet in each routing hop. In a version of CAN, each node measures its network delay to a set of landmark nodes, in an effort to deter- mine its relative position in the Internet and to construct an Internet topology-aware overlay. Tapestry and Pastry construct a topology-aware overlay by choosing nearby nodes for inclusion in their routing tables. Early results for the resulting locality properties are promising. How- ever, these results come at the expense of a significanly more expensive overlay maintenance protocol, relative to Chord. Also, proximity based routing may compromise the load balance in the p2p overlay network. Moreover, it remains unclear to what extent the locality properties hold in the actual Internet, with its complex, dynamic, and non-uniform topology. As a result, the cost and ef- fectiveness of proximity based routing in these p2p over- lays remain unclear. This paper presents a study of proximity based routing in structured p2p overlay networks, and presents results of an analysis and of simulations based on two large- scale Internet topology models. The specific contribu- tions of this paper include a comparison of approaches to proximity based routing in structured p2p overlay networks, which identifies proximity neighbor selection in prefix- based protocols like Tapestry and Pastry as the most promising technique; 1

Upload: nguyenanh

Post on 01-May-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

Topology-a ware routing in structured peer-to-peer overla ynetw orks

Miguel Castro�

PeterDruschel�

Y. CharlieHu�

Antony Rowstron�

�Microsoft Research,7 JJ ThomsonClose,Cambridge,CB3 0FB,UK.�RiceUniversity, 6100Main Street,MS-132,Houston,TX 77005,USA.�PurdueUniversity, 1285EE Building, WestLafayette,IN 47907,USA.

Abstract

Structured peer-to-peer (p2p) overlay networks likeCAN,Chord, PastryandTapestryoffer a novel platformfor avarietyof scalableanddecentralizeddistributedap-plications.They provideefficientandfault-tolerant rout-ing, object location and load balancing within a self-organizing overlay network. One important aspectofthesesystemsis how they exploit networkproximity intheunderlyingInternet. We presenta studyof topology-awareroutingapproachesin p2poverlays,identifyprox-imity neigborselectionasthemostpromisingtechnique,and presentan improved designin Pastry. Resultsob-tainedvia analysisandvia simulationof two large-scaletopology modelsindicatethat it is possibleto efficientlyexploit network proximity in self-organizing p2p sub-strates.Proximityneighborselectionincurs only a mod-estadditional overheadfor organizingand maintainingtheoverlaynetwork.Theresultinglocality propertiesim-prove applicationperformanceand reducenetworkus-age in the Internetsubstantially. Finally, we showthatthe impact of proximity neighborselectionon the loadbalancingin thep2poverlayis minimal.

1 Intr oduction

Several recentsystems(e.g., CAN, Chord, Pastry andTapestry[7, 13, 10, 16, 6]) provideaself-organizingsub-stratefor large-scalepeer-to-peerapplications. Amongotheruses,thesesystemscanimplementascalable,fault-tolerantdistributedhashtable,in which any item canbelocatedwithin a boundednumberof routinghops,usinga smallper-noderoutingtable.While therearealgorith-mic similaritiesamongeachof thesesystems,oneimpor-

tantdistinctionlies in theapproachthey take to consider-ing andexploiting proximity in the underlyingInternet.Chordin its original design,for instance,doesnot con-sidernetwork proximity at all. As a result, its protocolfor maintainingtheoverlaynetwork is very light-weight,but messagesmaytravel arbitrarily long distancesin theInternetin eachroutinghop.

In a versionof CAN, eachnodemeasuresits networkdelay to a setof landmarknodes,in an effort to deter-mineits relative positionin theInternetandto constructanInternettopology-awareoverlay. TapestryandPastryconstructa topology-awareoverlay by choosingnearbynodesfor inclusionin their routing tables.Early resultsfor theresultinglocality propertiesarepromising.How-ever, theseresultscomeat theexpenseof a significanlymoreexpensiveoverlaymaintenanceprotocol,relative toChord. Also, proximity basedroutingmaycompromisethe loadbalancein thep2poverlaynetwork. Moreover,it remainsunclearto what extent the locality propertieshold in the actualInternet,with its complex, dynamic,andnon-uniformtopology. As a result,thecostandef-fectivenessof proximity basedroutingin thesep2pover-laysremainunclear.

Thispaperpresentsastudyof proximity basedroutingin structuredp2poverlaynetworks, andpresentsresultsof an analysisand of simulationsbasedon two large-scaleInternettopology models. The specificcontribu-tionsof thispaperinclude

� a comparisonof approachesto proximity basedrouting in structuredp2p overlay networks, whichidentifies proximity neighbor selectionin prefix-basedprotocolslikeTapestryandPastryasthemostpromisingtechnique;

1

� improved nodejoin and overlay maintenancepro-tocols for proximity neighborselectionin Pastry,whichsignificanclyreducetheoverheadof creatingandmaintaininga topology-awareoverlay;� astudyof thecostsandbenefitsof proximity neigh-bor selectionvia analysisandsimulationbasedontwo large-scaleInternettopologymodels;� a studyof the impactof proximity neighborselec-tion on theloadbalancingin thep2poverlaybasedonsimulationson a large-scaletopologymodel.

Comparedto theoriginal Pastrypaper[10], this workaddsa comparisonwith other proposedapproachestotopology-awarerouting,new nodejoin andoverlaymain-tenanceprotocolsthat dramaticallyreducethe cost ofoverlayconstructionandmaintenance,anew protocoltolocateanearbycontactnode,resultsof a formalanalysisof Pastry’s routing propertiesand extensive simulationresultson two differentnetwork topologymodels.

The rest of this paperis organizedas follows. Pre-vious work on structuredp2p overlays is discussedinSection2. Approachesto topology-awareroutingin p2poverlaysarepresentedin Section3. Section4 presentsPastry’s implementationof proximity neighborselection,includingnew efficient protocolsfor nodejoin andover-lay maintenance.An analysisof Pastry’s locality proper-tiesfollow in Section5. Section6 presentsexperimentalresults,andwe concludein Section7.

2 Background and prior work

In this section,we presentsomebackgroundon struc-turedp2p overlay protocolslike CAN, Chord,TapestryandPastry. (We do not considerunstructuredp2pover-layslikeGnutellaandFreenetin thispaper[1, 2]). Spacelimitationspreventusfrom a detaileddiscussionof eachprotocol.Instead,we give a moredetaileddescriptionofPastry, as an exampleof a structuredp2p overlay net-work, and then point out relevant differenceswith theotherprotocols.

2.1 Pastry

Pastry is a scalable,fault resilient, and self-organizingpeer-to-peersubstrate.EachPastrynodehasa unique,uniform randomlyassignednodeIdin a circular 128-bit

identifier space. Given a 128-bit key, Pastry routesanassociatedmessagetowardsthe live nodewhosenodeIdis numericallyclosestto thekey. Moreover, eachPastrynodekeepstrackof its neighboringnodesin thenames-paceandnotifiesapplicationsof changesin theset.

Node state: For the purposeof routing, nodeIdsandkeys are thoughtof as a sequenceof digits in base

���( � is a configurationparameterwith typical value4). Anode’sroutingtableis organizedinto � ����� � rowsand

� �columns.The

���entriesin row of theroutingtablecon-

tain the IP addressesof nodeswhosenodeIdssharethefirst digits with thepresentnode’s nodeId;the ���� thnodeIddigit of thenodein column � of row equals� .Thecolumnin row thatcorrespondsto thevalueof the ���� ’s digits of thelocal node’s nodeIdremainsempty.Figure1 depictsasampleroutingtable.

A routingtableentry is left emptyif no nodewith theappropriatenodeIdprefixis known. Theuniformrandomdistribution of nodeIdsensuresanevenpopulationof thenodeIdspace;thus,on averageonly ������� ����� � levelsarepopulatedin theroutingtable.Eachnodealsomaintainsa leaf set. The leaf setis thesetof � nodeswith nodeIdsthatarenumericallyclosestto thepresentnode’s nodeId,with � ��� largerand � ��� smallernodeIdsthanthecurrentnode’s id. A typical value for � is approximately � "!���#� �%$ �&� . The leaf setensuresreliablemessagedeliveryandis usedto storereplicasof applicationobjects.

Messagerouting: At eachroutingstep,a nodeseekstoforwardthemessageto anodewhosenodeIdshareswiththekey aprefix thatis at leastonedigit (or � bits) longerthanthecurrentnode’ssharedprefix. If nosuchnodecanbe found in the routing table,the messageis forwardedto a nodewhosenodeIdsharesa prefix with the key aslong asthecurrentnode,but is numericallycloserto thekey thanthe presentnode’s id. Several suchnodescannormally be found in the routing table; moreover, sucha nodeis guaranteedto exist in the leaf set unlessthemessagehasalreadyarrivedat thenodewith numericallyclosestnodeIdor its immediateneighbor. And, unlessall� ��� nodesin onehalf of the leaf sethave failedsimulta-neously, at leastoneof thosenodesmustbelive.

The Pastry routing procedureis shown in Figure 3.Figure2 shows thepathof anexamplemessage.Analy-sisshowsthattheexpectednumberof forwardinghopsisslightly below �����#� ���'�&� , with a distribution that is tightaroundthe mean. Moreover, simulationshows that theroutingis highly resilientto nodefailures.

2

0x( 1

x( 2x( 3)x( 4

x( 5*x( 7

x( 8+x( 9,x( a-

x( b.x( c/

x( d0x( e1

x( f2x(

60x( 6

1x( 6

2x( 6

3)x( 6

4x( 6

6x( 6

7x( 6

8+x( 6

9,x( 6

a-x( 6

b.x( 6

c/x( 6

d0x( 6

e1x( 6

f2x(

65*0x(

65*13x(

65*24x(

65*3)x(

65*45x(

65*5*x(

65*6x(

65*7x(

65*8+x(

65*9,x(

65*b.x(

65*c/x(

65*d0x(

65*e1x(

65*f2x(

65*a-0x(

65*a-2x(

65*a-3)x(

65*a-4x(

65*a-5*x(

65*a-6x(

65*a-7x(

65*a-8+x(

65*a-9,x(

65*a-a-x(

65*a-b.x(

65*a-c/x(

65*a-d0x(

65*a-e1x(

65*a-f2x(

Figure1: Routingtableof a Pastrynodewith nodeId 67�89�;: , �=<?> . Digits areinbase16, : representsanarbitrarysuffix.

d@4A6BaC 1cD

RoE uF tG eH (I d@ 46BaC 1cD )J

d@46B2bKaC

d@4213Lf

d@13Ld@aC 3L

6B5MaC 1fcD

d@46B7NcD 4d

@4A7N1fO1

OP

2Q

128 - 1

Figure2: Routinga messagefrom node67�8R�TSVU with key W�>X6�89�YU . Thedotsdepictlivenodesin Pastry’scircularnamespace.

2.2 CAN, Chord, Tapestry

Next, webriefly describeCAN, ChordandTapestry, withan emphasison the differencesof theseprotocolswhencomparedto Pastry.

Tapestryis very similar to Pastrybut differs in its ap-proachto mappingkeys to nodesin the sparselypopu-lated id space,and in how it managesreplication. InTapestry, thereis no leaf set and neighboringnodesinthe namespaceare not aware of eachother. When anode’s routing table doesnot have an entry for a nodethatmatchesa key’s th digit, themessageis forwardedto the nodewith the next highervalue in the th digit,modulo

� �, found in the routing table. This procedure,

calledsurrogaterouting, mapskeys to auniquelivenodeif thenoderouting tablesareconsistent.For fault toler-ance,Tapestryinsertsreplicasof dataitemsusingdiffer-entkeys.

Like Pastry, Chord usesa circular id space. UnlikePastry, Chord forwardsmessagesonly in clockwisedi-rection in the circular id space. Insteadof the prefix-basedrouting table in Pastry, Chord nodesmaintainafingertable,consistingof up to � � pointersto otherlivenodes.The Z th entry in thefinger tableof node refersto thelivenodewith thesmallestnodeIdclockwisefrom [� �]\�^ � . The first entry points to ’s successor, andsubsequententriesrefer to nodesat repeatedlydoublingdistancesfrom . Eachnodealsomaintainspointerstoits predecessorandto its successorsin theid space(thesuccessorlist). Similar to Pastry’s leafset,thissuccessorlist is usedto replicateobjectsfor fault tolerance.Theexpectednumberof routinghopsin Chordis

�� ���#� � � .

CAN routesmessagesin a W -dimensionalspace,whereeachnodemaintainsa routing table with _a`�Wcb entriesandany nodecanbereachedin _a`�W � �edgf b routinghops.Theentriesin anode’s routingtablereferto its neighborsin the W -dimensionalspace.Unlike Pastry, TapestryandChord,CAN’s routing tabledoesnot grow with thenet-work size,but the numberof routing hopsgrows fasterthan ���#� � in this case.

3 Topology-aware routing

In this section, we describeand compare three ap-proachesto topology-aware routing in structuredover-lay networksthathave beenproposed,namelytopology-basednodeIdassignment, proximityrouting, andproxim-ity neighborselection[9].Proximity routing: With proximity routing,theoverlayis constructedwithout regard for the physicalnetworktopology. The techniqueexploits the fact that when amessageis routed,therearepotentiallyseveral possiblenext hop neighborsthat arecloserto the message’s keyin the id space.The ideais to select,amongthe possi-ble next hops,theonethat is closestin thephysicalnet-work or onethatrepresentsagoodcompromisebetweenprogressin theid spaceandproximity. With h alternativehopsin eachstep,theapproachcanreducetheexpecteddelay in eachhop from the averagedelaybetweentwonodesto theexpecteddelayof thenearestamongh nodeswith randomlocationsin thenetwork. Themain limita-tion is that the benefitsdependon the magnitudeof h ;with practicalprotocols, h is small. Moreover, depend-

3

(1)i if ( jlk monqpsrqtvuwr;rqxzy|{~}l�����Y��{z�����'� )(2) // j is within rangeof local leaf set(mod �]� ��� )(3) forwardto {�� , s.th. � js��{��e� is minimal;(4) else(5) // usetheroutingtable(6) Let ����n;�X�ey|jl�e�� ;(7) if ( ������ existsandis live)(8) forwardto � � �� ;(9) else(10) // rarecase(11) forwardto t���{���� , s.th.(12) n;�X�ey�t��ej���� � ,(13) � tV��jc�¡¢� �£��jc�

Figure 3: Pastry routing procedure,executedwhen amessagewith key W arrivesatanodewith nodeId8 . ¤ \¥ istheentryin theroutingtable ¤ atcolumn Z androw � . ¦ \is thei-th closestnodeIdin theleaf set ¦ , wherea nega-tive/positive index indicatescounterclockwise/clockwisefrom the local nodein the id space,respectively. ¦ ^ ¥ dg�and ¦ ¥ dg� arethenodesat theedgesof the local leaf set.W ¥ representsthe � ’s digit in the key W . §]¨���`�8R©��;b is thelengthof theprefix sharedamong8 and � , in digits.

ing ontheoverlayprotocol,greedilychoosingtheclosesthopmay leadto an increasein thetotal numberof hopstaken. While proximity routingcanyield significantim-provementsover a systemwith no topology-awarerout-ing, its performancefalls shortof what canbe achievedwith the following two approaches.The techniquehasbeenusedin CAN andChord[7, 4].

Topology-basednodeId assignment: Topology-basednodeIdassignmentattemptsto map the overlay’s logi-cal id spaceonto the physicalnetwork suchthat neigh-bouring nodesin the id spaceareclosein the physicalnetwork. The techniquehasbeensuccessfullyusedin aversionof CAN, andhasachieved delaystretchresultsof two or lower [7, 8]. However, the approachhassev-eral drawbacks. First, it destroys the uniform popula-tion of the id space,causingloadbalancingproblemsinthe overlay. Second,the approachdoesnot work wellin overlaysthat usea one-dimensionalid space(Chord,Tapestry, Pastry), becausethe mappingis overly con-strained. Lastly, neighboringnodesin the id spacearemorelikely to suffer correlatedfailures,which canhaveimplicationsfor robustnessandsecurityin protocolslikeChordandPastry, which replicateobjectson neighborsin theid space.

Proximity neighbour selection: Like the previoustechnique, proximity neighbor selection constructsatopology-awareoverlay. However, insteadof biasingthenodeIdassignment,theideais to chooseroutingtableen-triesto referto thetopologicallynearestamongall nodeswith nodeIdin the desiredportion of the id space.Thesuccessof this techniquedependson thedegreeof free-dom an overlay protocol hasin choosingrouting tableentrieswithout affecting the expectednumberof rout-ing hops. In prefix-basedprotocolslike TapestryandPastry, the upperlevels of the routing tableallow greatfreedomin this choice,with lower levels leaving expo-nentially lesschoice. As a result,theexpecteddelayofthefirst hop is very low, it increasesexponentiallywitheachhop, andthe delayof the final hop dominates.Asonecanshow, this leadsto low delaystretchandotherusefulproperties.A limitation of this techniqueis thatitdoesnotwork for overlayprotocolslikeCAN andChord,which requirethat routing tableentriesrefer to specificpointsin theid space.

Discussion: Proximity routing is the most light-weighttechnique,sinceit doesnot constructa topology-awareoverlay. But, its performanceis limited sinceit canonlyreducetheexpectedper-hopdelayto theexpecteddelayof the nearestamonga small number h of nodeswithrandomlocationsin the network. With topology-awarenodeIdassignment,the expectedper-hop delay can beaslow asthe averagedelayamongneighboringoverlaynodesin the network. However, the techniquesuffersfrom loadimbalanceandrequiresa high-dimensionalidspaceto beeffective.

Proximity-neighborselectioncanbeviewedasacom-promise that preserves the load balanceand robust-nessaffordedby a randomnodeIdassignment,but stillachieves a small constantdelay stretch. In the follow-ing sections,we show that proximity neighborselec-tion can be implementedin Pastry and Tapestrywithlow overhead,that it achievescomparabledelaystretchto topology-basednodeId assignmentwithout sacrific-ing load balancingor robustness,and that is hasaddi-tional route convergencepropertiesthat facilitate effi-cientcachingandmulticastingin theoverlay. Moreover,we confirm theseresultsvia simulationson two large-scaleInternettopologymodels.

4

4 Prª

oximity neighbor selection:Pastry

This sectionshows how proximity basedneighborse-lection is usedin Pastry. We describenew node joinandoverlay maintenanceprotocolsthat significantlyre-ducetheoverheadcomparedto theoriginalprotocolsde-scribedin [10]. Moreover, we presenta new protocolthat allows nodesthat wish to join the overlay to locateanappropriatecontactnode.

It isassumedthateachPastrynodecanmeasureits dis-tance,in termsof a scalarproximity metric,to any nodewith aknown IP address.Thechoiceof aproximity met-ric dependson thedesiredqualitiesof theresultingover-lay (e.g.,low delay, highbandwidth,low network utiliza-tion). In practice,averageround-triptime hasproven tobeagoodmetric.

Pastryusesproximity neighborselectionasintroducedin theprevioussection.Selectingroutingtableentriestorefer to the preciselynearestnodewith an appropriatenodeIdis expensive in a largesystem,becauseit requires_�` � b communication.Therefore,Pastryusesheuristicsthatrequireonly _a`������ � � � b communicationbut only en-surethatroutingtableentriesareclosebut notnecessarilytheclosest.More precisely, Pastryensuresthefollowinginvariantfor eachnode’s routingtable:Proximity invariant: Each entry in a node « ’s routingtable refers to a nodethat is near « , according to theproximity metric, amongall live Pastry nodeswith theappropriatenodeIdprefix.

In Section4.1,weshow how Pastry’snodejoiningpro-tocol maintainsthe proximity invariant. Next, we con-sider the effect of the proximity invariant on Pastry’srouting. Observe that as a result of the proximity in-variant, a messageis normally forwardedin eachrout-ing step to a nearbynode, accordingto the proximitymetric, amongall nodeswhosenodeIdsharesa longerprefix with the key. Moreover, the expecteddistancetraveled in eachconsecutive routing step increasesex-ponentially, becausethe densityof nodesdecreasesex-ponentiallywith the length of the prefix match. Fromthis property, onecanderive threedistinct propertiesofPastrywith respectto network locality:Total distancetraveled (delay stretch): The expecteddistanceof the last routing step tendsto dominatethetotal distancetraveledby a message.As a result,theav-eragetotal distancetraveled by a messageexceedsthedistancebetweensourceanddestinationnodeonly by asmallconstantvalue.

Local route convergence:Thepathsof two Pastrymes-sagessentfrom nearbynodeswith identicalkeys tendtoconverge at a nodenearthe sourcenodes,in the prox-imity space.To seethis,observe thatin eachconsecutiveroutingstep,themessagestravel exponentiallylargerdis-tancestowardsan exponentiallyshrinkingsetof nodes.Thus,theprobabilityof a routeconvergenceincreasesineachstep,even in the casewhereearlier(smaller)rout-ing stepshavemovedthemessagesfartherapart.Thisre-sult hassignificancefor cachingapplicationslayeredonPastry. Popularobjectsrequestedby a nearbynodeandcachedby all nodesalongtheroutearelikely to befoundwhenanothernearbynoderequeststheobject.Also, thispropertyis exploited in Scribe[12] to achieve low linkstressin anapplicationlevel multicastsystem.Locating the nearest replica: If replicasof an objectarestoredon h nodeswith adjacentnodeIds,Pastrymes-sagesrequestingtheobjecthave a tendency to first reacha nodenear the client node. To seethis, observe thatPastrymessagesinitially take smallstepsin theproxim-ity space,but large stepsin the nodeIdspace.Applica-tions can exploit this propertyto make surethat clientrequestsfor anobjecttendto behandledby areplicathatis neartheclient. Exploiting thispropertyis application-specific,andis discussedin [11].

An analysisof thesepropertiesfollows in Section5.Simulation and measurementresults that confirm andquantifythesepropertiesfollow in Section6.

4.1 Maintaining the overlay

Next, we presentthe new protocolsfor nodejoin, nodefailureandroutingtablemaintenancein Pastryandshowhow theseprotocolsmaintain the proximity invariant.The new nodejoin and routing table maintenancepro-tocolssupersedethe“secondphase”of the join protocoldescribedin the original Pastrypaper, which hadmuchhigheroverhead[10].

When joining the Pastry overlay, a new node withnodeId« mustcontactanexistingPastrynode¬ . ¬ thenroutesa messageusing « asthekey, andthenew nodeobtainsthe th row of its routing table from the nodeencounteredalongthe pathfrom ¬ to « whosenodeIdmatches« in the first ®­�� digits. We will show thattheproximity invariantholdson « ’s resultingroutingta-ble,if node¬ is nearnode« , accordingto theproximitymetric.

First, considerthe top row of « ’s routing table,ob-

5

tainedfrom node ¬ . Assumingthe triangle inequalityholdsin theproximity space,it is easyto seethattheen-tries in thetop row of ¬ ’s routingtablearealsocloseto« . Next, considerthe th row of « ’s routingtable,ob-tainedfrom thenode¬°¯ encounteredalongthepathfrom¬ to « . By induction,this nodeis Pastry’s approxima-tion to thenodeclosestto ¬ thatmatches« ’s nodeIdinthefirst  ­±� digits. Therefore,if the triangleinequal-ity holds,wecanusethesameargumentto concludethattheentriesof the th row of ¬ ¯ ’s routingtableshouldbecloseto « .

At thispoint,wehaveshown thattheproximity invari-antholdsin « ’s routingtable.To show thatthenodejoinprotocolmaintainstheproximity invariantglobally in allPastrynodes,we mustnext show how theroutingtablesof otheraffectednodesareupdatedto reflect « ’s arrival.Once« hasinitialized its own routingtable,it sendsthe th row of its routing tableto eachnodethatappearsasan entry in that row. This serves both to announceitspresenceandto propagateinformationaboutnodesthatjoinedpreviously. Eachof thenodesthat receivesa rowtheninspectsthe entriesin the row, performsprobestomeasureif « or oneof theentriesis nearerthanthecor-respondingentryin its own routingtable,andupdatesitsroutingtableasappropriate.

To seethat this procedureis sufficient to restoretheproximity invariantin all affectednodes,considerthat «andthenodesthatappearin row of « ’s routing tableform a groupof

� �nearbynodeswhosenodeIdsmatch

in the first digits. It is clear that thesenodesneedtoknow of « ’sarrival, since« maydisplaceamoredistantnodein oneof the node’s routing tables. Conversely, anodewith identicalprefix in thefirst digits that is nota memberof this groupis likely to bemoredistantfromthe membersof the group,andthereforefrom « ; thus,« ’s arrival is not likely to affect its routing table and,with high probability, it doesnot needto beinformedof« ’s arrival.Node failur e: Failed routing tablesentriesarerepairedlazily, whenever a routing tableentry is usedto routeamessage.Pastryroutesthemessageto anothernodewithnumericallyclosernodeId.If thedownstreamnodehasaroutingtableentrythatmatchesthenext digit of themes-sage’skey, it automaticallyinformstheupstreamnodeofthatentry.

Weneedto show thattheentrysuppliedby thisproce-dure satisfiesthe proximity invariant. If a numericallyclosernodecan be found in the routing table, it must

be an entry in the samerow as the failed node. If thatnodesuppliesa substituteentry for the failed node,itsexpecteddistancefrom the local nodeis thereforelow,sinceall threenodesarepartof thesamegroupof nearbynodeswith identicalnodeIdprefix. On the otherhand,if no replacementnodeis suppliedby the downstreamnode,we trigger theroutingtablemaintenancetask(de-scribedin the next section)to find a replacemententry.In eithercase,theproximity invariantis preserved.

Routing table maintenance:Theroutingtableentriesproducedby thenodejoin protocolandtherepairmech-anismsarenot guaranteedto be the closestto the localnode. Several factorscontribute to this, including theheuristicnatureof thenodejoin andrepairmechanismswith respectto locality. Also, many practicalproxim-ity metricsdo not strictly satisfy the triangle inequalityandmay vary over time. However, limited imprecisionis consistentwith theproximity invariant,andaswe willshow in Section6, it doesnot have a significantimpactonPastry’s locality properties.

However, oneconcernis thatdeviationscouldcascade,leadingto a slow deteriorationof the locality propertiesover time. To preventa deteriorationof theoverall routequality, eachnoderunsa periodicrouting tablemainte-nancetask(e.g.,every 20 minutes). The taskperformsthefollowing procedurefor eachrow of thelocal node’srouting table. It selectsa randomentry in the row, andrequestsfrom the associatednodea copy of that node’scorrespondingrouting tablerow. Eachentry in that rowis thencomparedto thecorrespondingentry in the localroutingtable.If they differ, thenodeprobesthedistanceto both entriesand installs the closestentry in its ownroutingtable.

The intuition behindthis maintenanceprocedureis toexchangerouting information amonggroupsof nearbynodeswith identicalnodeIdprefix. A nearbynodewiththeappropriateprefixmustbeknow to at leastonemem-ber of the group; the procedureensuresthat the entiregroupwill eventuallylearnof thenode,andadjusttheirroutingtablesaccordingly.

Whenever a Pastrynodereplacesa routingtableentrybecausea closernodewas found, the previous entry iskept in a list of alternateentries(up to ten suchentriesaresaved in theimplementation).Whentheprimaryen-try fails, oneof the alternatesis useduntil andunlessacloserentry is foundduringthenext periodicroutingta-blemaintenance.

6

(1)disco² ver(seed)(2) nodes= getLeafSet(seed)(3) forall nodein nodes(4) nearNode= closerToMe(node,nearNode)(5) depth= getMaxRoutingTableLevel(nearNode)(6) while (depth ³ 0)(7) nodes= getRoutingTable(nearNode,depth- -)(8) forall nodein nodes(9) nearNode= closerToMe(node,nearNode)(10) endwhile(11) do(12) nodes= getRoutingTable(nearNode,0)(13) currentClosest= nearNode(14) forall nodein nodes(15) nearNode= closerToMe(node,nearNode)(16) while (currentClosest!= nearNode)(17) returnnearNode

Figure 4: Simplified nearbynodediscovery algorithm.seedis the Pastry node initially known to the joiningnode.

4.2 Locating a nearby node

Recall that for the nodejoin algorithm to preserve theproximity invariant,thestartingnode ¬ mustbeclosetothenew node « , amongall live Pastrynodes.This begsthe questionof how a newly joining nodecandetectanearbyPastrynode. Oneway to achieve this is to per-form an“expandingring” IP multicast,but this assumestheavailability of IP multicast.In Figure4, we presentanew, efficient algorithmby which anodemaydiscover anearbyPastrynode,giventhatit hasknowledgeof somePastrynodeat any location.Thus,a joining nodeis onlyrequiredto obtainknowledgeof any Pastrynodethroughout-of-bandmeans,asopposedto obtainingknowledgeof a nearbynode. The algorithm exploits the propertythat location of the nodesin the seeds’leaf set shouldbe uniformly distributed over the network. Next, hav-ing discoveredthe closestleaf set member, the routingtabledistancepropertiesareexploitedto move exponen-tially closerto the locationof the joining node. This isachieved bottomup by picking the closestnodeat eachlevel andgettingthenext level from it. Thelastphasere-peatstheprocessfor thetop level until no moreprogressis made.

5 Analysis

In this section,we presentanalyticalresultsfor Pastry’srouting properties.First, we analyzethe distribution ofthenumberof routinghopstakenwhenaPastrymessagewith arandomlychosenkey is sentfrom arandomlycho-senPastrynode.Thisanalysisthenformsthebasisfor ananalysisof Pastry’s locality properties.Throughoutthisanalysis,we assumethateachPastrynodehasa perfectroutingtable.Thatis, aroutingtableentrymaybeemptyonly if no nodewith anappropriatenodeIdprefix exists,andall routingtableentriespoint to thenearestnode,ac-cordingto theproximity metric. In practice,Pastrydoesnot guaranteeperfectrouting tables. Simulationresultspresentedin Section6 show thattheperformancedegra-dationdueto this inaccuracy is minimal. In the follow-ing, we presentthemainanalyticalresultsandleave outthedetailsof theproofsin AppendixA.

5.1 Routeprobability matrix

Althoughthenumberof routinghopsin Pastryis asymp-totically �����#� � � �&� , theactualnumberof routinghopsisaffectedby theuseof theleafsetandtheprobability thatthemessagekey alreadysharesa prefix with thenodeIdof the starting node and intermediatenodesalong therouting path. In the following, we analyzethe distribu-tion of thenumberof routinghopsbasedonthestatisticalpopulationof thenodeIdspace.SincetheassignmentofnodeIdsis assumedto berandomlyuniform,thispopula-tion canbecapturedby thebinomialdistribution (see,forexample,[3]). For instance,thedistribution of thenum-ber of nodeswith a given valueof the mostsignificantnodeIddigit, outof � nodes,is givenby �]`vh9´ � ©T� ����� b .

Recallfrom Figure3 thatat eachnode,a messagecanbeforwardedusingoneof threebranchesin theforward-ing procedure.In caseµ�¶ , themessageis forwardedus-ing the leaf set ¦ (line 3); in caseµ�· usingthe routingtable ¤ (line 8); andin caseµ¹¸ usinga nodein ¦�º»¤(lines11-13).Weformally definetheprobabilitiesof tak-ing thesebranchesaswell asof two specialcasesin thefollowing.

Definition 1 Let ¼¾½���`v¨¿©À��© � ©Àµ�ÁÂb denotetheprobabil-ity of takingbranch µ Á ©�«®Ã�Ä�¬=©ÀÅ�©'Æ�Ç , at the `v¨�����b thhopin routinga message with randomkey, startingfroma noderandomlychosenfrom � nodes,with a leaf setof size � . Furthermore, we define¼¾½���`v¨¿©À��© � ©ÀµÉȶ b as

7

the probability that the node encountered after the ¨ -th hop is already the numerically closestnode to themessage, and thus the routing terminates,and define¼¾½���`v¨¿©À��© � ©ÀµÊÈ· b as the probability that the node en-countered after the ¨ -th hopalreadysharesthe `v¨"�Ë��bdigits with thekey, thusskippingthe `v¨��±��b th hop.

We denote¼¾½��]`v¨z©À��© � ©Àµ Á b'©�¨sÃ�Ì ÍΩT� ��� �=­Ï�qÐo©�«ÑÃÄ�¬=©À¬sÈv©ÀÅ�©ÀÅ�ÈÒ©'Æ�Ç astheprobabilitymatrixof Pastryrout-ing. The following Lemmagivesthebuilding block forderiving the full probability matrix as a function of �and � .Lemma 1 Assumebranch µ¹· hasbeentakenduringthefirst ¨ hopsin routinga randommessage Ó , i.e. themes-sage Ó is at an intermediatenode « which shares thefirst ¨ digits with Ó . Let Ô be the total numberof ran-domuniformlydistributednodeIdsthat share the first ¨digits with Ó . Theprobabilitiesin takingdifferentpathsat the `v¨��±��b th hopisÕÖÖÖÖ

×ØlÙ;ÚTÛ yÒ�¾�e�o��Ü���ÝRÞ¿�ØlÙ;ÚTÛ yÒ�¾�e�o��Ü���Ý�ßÞ �ØlÙ;ÚTÛ yÒ�¾�e�o��Ü���ÝRà��ØlÙ;ÚTÛ yÒ�¾�e�o��Ü���Ý�ßà �ØlÙ;ÚTÛ yÒ�¾�e�o��Ü���ÝRá��

â�ããããä �

�gå�} �æ�gçéè

êæë�ì ç�èÛ yîí è]ï Üa�ñð��ò �Vó

ê } ëvìæë çéèÛ yîí ï Üô��í è � j� ò � ð �Vó ØXÙYÚTÛ Ø � Û�õ yîí���í è �eÜô�öí è �öí��g�����Ò�

where ¼¾½��� ¼R8Î�'U`ø÷ ¥ ©%÷Tù;©%÷Tú�©�¨¿©À�ob calculatesthefiveprob-abilitiesassumingthere are ÷ ¥ ©%÷TùT©%÷Yú nodeIdsthatsharedthefirst ¨ digits with Ó , but whose `v¨��û��b th digits aresmallerthan,equalto, andlarger thanthatof Ó , respec-tively.

Sincetherandomlyuniformly distributednodeIdsthatfall in a particularsegmentof thenamespacecontainingafixedprefixof ¨ digits follow thebinomialdistribution,the ¨ th row of the probability matrix canbe calculatedby summingoverall possiblenodeIddistributionsin thatsegmentof thenamespacetheprobabilityof eachdistri-bution multipliedby its correspondingprobabilityvectorgiven by Lemma1. Figure5 plots the probabilitiesoftakingbranchesµ�¶ , µ�· , and µ�¸ at eachactualhop(i.e.after theadjustmentof collapsingskippedhops)of Pas-try routing for � <ü6�ÍÍÍÍ , with �°<þý � and ��<ÿ> . Itshowsthatthe ���#� �%$ ` � b -th hopis dominatedby µ�¶ hopswhile earlierhopsaredominatedby µ�· hops.Theaboveprobabilitymatrix canbe usedto derive thedistribution

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4Pro

babi

litie

s of

taki

ng b

ranc

hes

PA

, PB

, and

PC

�Hop number h

N=60000, l=32, b=4, Expected (hops) = 3.67

prob(h,l,N,Pa)prob(h,l,N,Pb)prob(h,l,N,Pc)

Figure 5: Probabilities Ý Ù yÒ�¾�e�%� � ��ÝRÞz� , Ý Ù yÒ�¾�e�o� � �eÝ à � ,Ý Ù y������o� � �eÝ á � andexpectednumberof hopsfor� ����������� ,

with �����]� and Û ��� . (Fromanalysis.)

of thenumbersof routinghopsin routinga randommes-sage. Figure 6 plots this distribution for � < 6�ÍÍÍÍwith �ñ< ý � and �Â<Ñ> . Theprobabilitymatrix canalsobeusedto derive theexpectednumberof routinghopsinPastryroutingaccordingto thefollowing theorem.

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6

Pro

babi

lity

Number of routing hops

N=60000, l=32, b=4, Expected (hops) = 3.67

Figure6: Distributionof thenumberof routinghopspermes-sagefor

� ������������ , with �¾���#� and Û ��� . (Fromanalysis.)

Theorem 1 Let theexpectednumberof additionalhopsafter taking µ ¸ for thefirst time, at the ¨ th hop,be de-notedas Æ ���w`v¨¿©À��© � ©Àµ ¸ b . Theexpectednumberof rout-ing hopsin routinga message with randomkey Ó start-ing froma noderandomlychosenfromthe � nodesis

� ����� ò } �æ� çéè ØXÙYÚTÛ yÒ�¾�e�%� � ��Ý Þ ��� ØXÙYÚTÛ y������o� � �eÝ ßÞ ���ØXÙYÚTÛ y������o� � �eÝ à ��� ØXÙYÚTÛ y������o� � �eÝ ßà ���ØlÙ;ÚTÛ yÒ�¾�e�o� � �eÝ á ����������yÒ�¾�e�o� � �eÝ á �Vó ØXÙYÚTÛ yÒ�¾�e�%� � ��Ý á �8

5.2 Expectedrouting distance

The above routing hop distribution is derived solelybasedontherandomlyuniformdistributionof nodeIdsinthenamespace.Coupledwith proximity neighborselec-tion in maintainingtheentriesin Pastry’s routingtables,the routing hop distribution canbe usedto analyzetheexpectedtotal routedistance.

To make the analysistractable,it is assumedthat thelocationsof thePastrynodesarerandomuniformly dis-tributedoverthesurfaceof asphere,andthattheproxim-ity metricusedby Pastryequalsthegeographicdistancebetweenpairsof Pastrynodesonthesphere.Theuniformdistribution of nodelocationsandtheuseof geographicdistanceastheproximity metricareclearlynot realistic.In Section6 we will presenttwo setsof simulationre-sults,onefor conditionsidenticalto thoseassumedin theanalysis,andonebasedon Internettopologymodels.Acomparisonof theresultsindicatesthattheimpactof ourassumptionson theresultsis limited.

Since Pastry nodesare uniformly distributed in theproximity space,the averagedistancefrom a randomnodeto thenearestnodethatsharesthefirst digit, thefirsttwo digits,etc.,canbecalculatedbasedonthedensityofsuchnodes.Thefollowing Lemmagivestheaveragedis-tancein eachhop traveledby a messagewith a randomkey sentfrom a randomstartingnode,asa function ofthehopnumberandthehoptype.

Lemma 2 (1) In routing message Ó , after ¨»µ · hops,if ¤ ���� is not empty, theexpected��'¼ W�Ze§��Y`v¨¿©À¤�©Àµ¹· b is¤ÉU;�§ ^ � `g�£­ � ��� �! #"%$& #"' b .(2) In routingmessage Ó , if path µ�¶ is takenat anygivenhop,thehopdistance��'¼ W�Z�§(�Y`v¨¿©À¤�©Àµ�¶~b is )+* ,� .(3) In routing message Ó , after ¨ hops, if pathµ ¸ is taken, the hop distance ¨¾��¼ WXZe§(�Y`v¨¿©À¤�©Àµ ¸ b is¨��'¼ W�Z�§(�Y`v¨&­Ï��©À¤�©Àµ�· b , which with high probability isfollowedbya hoptakenvia µ�¶ , i.e. with distance)+* ,� .

The above distance��'¼ W�Ze§��Y`v¨¿©À¤�©Àµ · b comesfromthe densityargument. AssumingnodeIdsareuniformlydistributedoverthesurfaceof thesphere,theaveragedis-tanceof the next µ�· hop is the radiusof a circle thatcontainsonaverageonenodeId(i.e. thenearestone)thatsharev¨=����b digits with Ó .

Giventhevectorof theprobabilitiesof takingbranchesµ�¶ , µ�· , and µ¹¸ at theactual th hop(e.g.Figure5), andtheabovevectorof per-hopdistancefor thethreetypesof

hopsatthe ¨ th hop,theaveragedistanceof the ¨ th actualhopis simply thedot-productof thetwo vectors,i.e. theweightedsumof the hop distancesby the probabilitiesthat they are taken. Theseresultsare presentedin thenext sectionalongwith simulationresults.

5.3 Local route convergence

Next, we analyzePastry’s route convergenceproperty.Specifically, whentwo randomPastrynodessendames-sagewith thesamerandomlychosenkey, weanalyzetheexpecteddistancethetwo messagestravel in theproxim-ity spaceuntil thepointwheretheir routesconverge,asafunctionof thedistancebetweenthestartingnodesin theproximity space.

To simplify theanalysis,we considerthreescenarios.In theworst-casescenario,it is assumedthatateachrout-ing hop prior to the point wheretheir routesconverge,the messagestravel in oppositedirectionsin the prox-imity space.In theaverage-casescenario,it is assumedthat prior to convergence,the messagestravel suchthattheirdistancein theproximity spacedoesnot change.Inthebestcasescenario,themessagestravel towardseachotherin theproximity spaceprior to their convergence.

For eachof the above threescenarios,we derive theprobability that the two routesconverge after eachhop.The probability is estimatedas the intersectingareaofthe two circlespotentiallycoveredby the two routesateachhopasapercentageof theareaof eachcircle. Cou-pling thisprobabilityvectorwith thedistancevector(fordifferenthops)givestheexpecteddistancetill routecon-vergence.

Theorem 2 Let Æö� and Æ � be the two starting nodeson a sphere of radius ¤ from which messages with anidentical,randomkey are beingrouted.Let thedistancebetweenÆö� and Æ � be WXÍ . Thenthe expecteddistancethatthetwomessageswill travelbefore their pathsmergeis

j#mon'tÀy|j-���e���¿� �/.�021 å43æë çéè �65ë7� çéè y ð �

ØXÙYÚTÛ � ÚgØ y�mg��j��X�e���%�e� ÚgØ j#mon'tÀy�í#�e���where ¼¾½�� ¨¾��¼¹`ø÷©ÀWXÍΩÀ¤Âb < 8#9 �(:<; f \>=@? 9 ACB ,ED B f A!B ,ED8�F<GCH@I4J4K<LM9 �(:<; f \>=@? 9 ACB ,ED B ,ED ,W�÷ < WXÍö� �ON#PRQ�S AQUTWV ¨��'¼ W�Z�§(�Y`ø÷©À¤Âb in the worst case,or W�÷®<üWXÍ in the average case, or We÷[<þ� 8l:�`�ÍΩÀWXÍ"­�XN PRQ�S AQUTWV ¨¾�'¼ WXZe§��;`ø÷�©À¤Âb�b in the bestcase, respectively,Y `�½#©ÀW¾©À¤Âb denotestheintersectingareaof two circlesofradius ½ centered at two pointson a sphere of radius ¤

9

that are a distanceof W�Z � ½ apart, andY =@[ ú!\U]�ù@^ `�½#©À¤Âb

denotesthe surfacearea of a circle of radius ½ on asphere of radius ¤ .

Figure 7 plots the averagedistancetraveled by twomessagessentfrom two randomPastry nodeswith thesamerandomkey, asa functionof thedistancebetweenthetwo startingnodes.Resultsareshown for the“worstcase”,“averagecase”,and“bestcase”analysis.

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000 3500

Ave

rage

Pas

try

dist

ance

to c

onve

rgen

ce p

oint

_Network distance between source nodes

Worst case, N=60kAverage case, N=60k

Best case, N=60k

Figure 7: Distanceamongsourcenodesrouting messageswith thesamekey, versusthedistancetraverseduntil the twopathsconverge, for a 60,000nodePastrynetwork, with l=32andb=4. (Fromanalysis.)

6 Experimental results

Our analysisof proximity neighborselectionin Pastryhasrelied on assumptionsthat do not generallyhold inthe Internet. For instance,the triangle inequality doesnot generallyhold for mostpracticalproximity metricsin theInternet.Also, nodesarenotuniformly distributedin the resultingproximity space.Therefore,it is neces-saryto confirm the robustnessof Pastry’s locality prop-ertiesundermorerealisticconditions.In thissection,wepresentexperimentalresultsquantifyingtheperformanceof proximity neighborselectionin Pastry underrealis-tic conditions.The resultswereobtainedusinga Pastryimplementationrunningon top of a network simulator,using Internettopologymodels. The Pastryparameterswereset to �&< > andthe leafsetsize �Â< ý � . Unlessotherwisestated,resultswhereobtainedwith asimulatedPastryoverlaynetwork of 60,000nodes.

6.1 Network topologies

Threesimulatednetwork topologieswereusedin theex-periments. The “Sphere” topology correspondsto thetopology assumedin the analysisof Section5. Nodesareplacedat uniformly randomlocationson thesurfaceof a spherewith radius 1000. The distancemetric isbasedon thetopologicaldistancebetweentwo nodesonthe sphere’s surface. Resultsproducedwith this topol-ogymodelshouldcorrespondcloselyto theanalysis,andit wasusedprimarily to validatethesimulationenviron-ment. However, thespheretopologyis not realistic,be-causeit assumesa uniform randomdistribution of nodeson the Sphere’s surface,andits proximity spaceis veryregularandstrictly satisfiesthetriangleinequality.

A secondtopology wasgeneratedusing the GeorgiaTech transit-stubnetwork topology model [15]. Theround-tripdelay(RTT) betweentwo nodes,asprovidedby thetopologygraphgenerator, is usedastheproximitymetricwith this topology. We usea topologywith 5050nodesin thecore,wherea LAN with anaverageof 100nodesis attachedto eachcorenode. Out of the result-ing 505,000LAN nodes,60,000randomlychosennodesformaPastryoverlaynetwork. As in therealInternet,thetriangleinequalitydoesnot hold for RTTs amongnodesin thetopologymodel.

Finally, we usedthe Mercator topology and routingmodels [14]. The topology model contains102,639routersand it wasobtainedfrom real measurementsoftheInternetusingtheMercatorprogram[5]. Theauthorsof [14] usedreal dataandsomesimpleheuristicsto as-signanautonomoussystemto eachrouter. TheresultingAS overlayhas2,662nodes.Routingis performedhier-archicallyasin theInternet.A routefollows theshortestpathin theAS overlaybetweentheAS of thesourceandtheAS of thedestination.Therouteswithin eachAS fol-low theshortestpathto a routerin thenext AS of theASoverlaypath.

We built a Pastry overlay with 60,000nodeson thistopologyby pickingarouterfor eachnoderandomlyanduniformly, andattachingthe nodedirectly to the routerwith a LAN link. Sincethe topology is not annotatedwith delay information, the numberof routing hopsinthetopologywasusedastheproximity metricfor Pastry.We count the LAN hopswhen reportingthe length ofthe Pastryroutes. This is conservative becausethe costof thesehopsis usuallynegligible andPastry’s overheadwouldbelower if we did not countLAN hops.

10

6.2 Pastry routing hopsand distanceratio

In the first experiment, 200,000lookup messagesareroutedusingPastryfrom randomlychosennodes,usingarandomkey. Figure8 showsthenumberof Pastryroutinghopsandthedistanceratio for thespheretopology. Dis-tanceratio is definedastheratioof thedistancetraversedby a Pastrymessageto the distancebetweenits sourceanddestinationnodes,measuredin termsof theproxim-ity metric. The distanceratio canbe interpretedas thepenalty, expressedin termsof the proximity metric, as-sociatedwith routinga messagesthroughPastryinsteadof sendingthemessagedirectly in theInternet.

Four setsof resultsareshown. “Expected”representstheresultsof theanalysisin Section5. “Normal routingtable”showsthecorrespondingexperimentalresultswithPastry. “Perfect routing table” shows resultsof experi-mentswith a versionof Pastrythat usesperfectroutingtable. That is, eachentry in the routing table is guar-anteedto point to the nearestnodewith the appropriatenodeIdprefix. Finally, “No locality” showsresultswith aversionof Pastrywherethelocality heuristicshave beendisabled.

3.67 3.68 3.68 3.69

1.26 1.33 1.37

3.68

0

0.5

1

1.5

2

2.5

3

3.5

4

Expected PerfectRoutingTable

NormalRoutingTable

NoLocality

Expected PerfectRoutingTable

NormalRoutingTable

NoLocality

Number of hops Distance ratio

Figure 8: Numberof routing hopsand distanceratio,spheretopology.

All experimentalresultscorrespondwell with the re-sultsof theanalysis,thusvalidatingtheexperimentalap-paratus. As expected,the expectednumberof routinghopsis slightly below ���#�a` $ 6�ÍΩÀÍÍÍ < ýcbed�f andthedis-tanceratio is small. The reportedhop countsarevirtu-ally independentof the network topology, thereforewepresentthemonly for thespheretopology.

Thedistanceratioobtainedwith perfectroutingtablesis only marginally betterthanthatobtainedwith therealPastryprotocol. This confirmsthat thenodejoin proto-col producesrouting tablesof high quality, i.e., entries

refer to nodesthat are nearly the closestamongnodeswith theappropriatenodeIdprefix. Finally, thedistanceratioobtainedwith thelocality heuristicsdisabledis sig-nificantly worse. This speaksboth to the importanceoftopology-awarerouting,andtheeffectivenessof proxim-ity neighborselection.

6.3 Routing distance

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 3 4 5Hop number

Per

-hop

dis

tanc

e

Expected

Perfect Routing Table

Normal Routing Table

No Locality

Figure9: Distancetraversedperhop,spheretopology.

Figure 9 shows the distancemessagestravel in eachconsecutive routing hop. The resultsconfirm the expo-nential increasein the expecteddistanceof consecutivehopsup to the fourth hops,aspredictedby theanalysis.Note that the fifth hop is only taken by a tiny fraction(0.004%)of themessages.Moreover, in theabsenceofthe locality heuristics,the averagedistancetraveled ineachhopis constantandcorrespondsto theaveragedis-tancebetweennodes( g(h�fcgOikj<lnmpo�qMrts , wherer is theradiusof thesphere).

0

100

200

300

400

500

600

1 2 3 4 5

Hop Number

Per

-ho

p d

ista

nce

Normal Routing Tables

Perfect Routing Tables

No locality

Figure10: Distancetraversedperhop,GATechtopology.

Figures10 and11 show thesameresultsfor theGAT-ech and the Mercator topologies,respectively. Due tothenon-uniformdistribution of nodesandthemorecom-plex proximity spacein thesetopologies,the expected

11

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5Hop Number

Per

-hop

dis

tanc

e

Normal Routing Tables

Perfect Routing Tables

No Locality

Figure11: Distancetraversedper hop, Mercatortopol-ogy.

distancein eachconsecutive routing stepno longer in-creasesexponentially, but it still increasesmonotonically.Moreover, thenodejoin algorithmcontinuesto produceroutingtablesthatrefer to nearbynodes,asindicatedbythemodestdifferencein hopdistanceto theperfectrout-ing tablesin thefirst threehops.

The proximity metric usedwith the Mercator topol-ogymakesproximity neighborselectionappearin anun-favorablelight. Sincethe numberof nodeswithin u IProutinghopsincreasesveryrapidlywith u , thereareveryfew “nearby”Pastrynodes.Observe thattheaveragedis-tancetraveled in the first routing hop is almosthalf oftheaveragedistancebetweennodes(i.e., it takesalmosthalf the averagedistancebetweennodesto reachabout16 otherPastrynodes).As a result,Pastrymessagestra-verserelatively longdistancesin thefirst few hops,whichleadsto a relatively high distanceratio. Nevertheless,theseresultsdemonstratethatproximity neighborselec-tion workswell evenunderadverseconditions.

Figures12,13and14show rasterplotsof thedistancemessagestravel in Pastry, asa function of the distancebetweenthesourceanddestinationnodes,for eachof thethreetopologies,respectively. Messagesweresentfrom20,000randomlychosensourcenodeswith randomkeysin this experiment.Themeandistanceratio is shown ineachgraphasa solid line.

The resultsshow that the distribution of the distanceratio is relatively tight aroundthe mean. Not surpris-ingly, thespheretopologyyields thebestresults,duetoits uniform distribution of nodesandthegeometryof itsproximity space.However, the far more realisticGAT-echtopologyyieldsstill very goodresults,with a meandistanceratio of 1.59,a maximaldistanceratio of about8.5,anddistribution that is fairly tight aroundthemean.

Even the leastfavorableMercatortopologyyields goodresults,with a meandistanceration of 2.2 anda maxi-mumof about6.5.

0

1000

2000

3000

4000

5000

6000

7000

0 400 800 1200 1600 2000 2400 2800 3200

Distance between source and destination

Dis

tan

ce t

rave

led

by

Pas

try

mes

sag

e

Mean = 1.37

Figure 12: Distancetraversedversusdistancebetweensourceanddestination,spheretopology.

0

500

1000

1500

2000

2500

0 200 400 600 800 1000 1200 1400

Distance between source and destination

Dis

tan

ce t

rave

led

by

Pas

try

mes

sag

eMean = 1.59

Figure 13: Distancetraversedversusdistancebetweensourceanddestination,GATechtopology.

6.4 Local route convergence

The next experimentevaluatesthe local route conver-gencepropertyof Pastry. In the experiment,10 nodeswereselectedrandomly, andthenfor eachof thesenodes,6,000othernodeswerechosensuchthat thetopologicaldistancebetweeneachpairprovidesgoodcoverageof therangeof possibledistances.Then,100randomkeyswerechosenandmessageswhereroutedvia Pastryfrom eachof thetwo nodesin apair, with agivenkey.

To evaluatehow early the pathsconvergence,we usethe metric jwvyxv�x{z}|M~��� vyxvyx{z}|2�� qMrts where, ��� is the distancetraveledfrom thenodewherethe two pathsconverge tothedestinationnode,and � `v and ���v arethedistancestrav-eledfrom eachsourcenodeto thenodewherethepathsconverge. The metric expressesthe averagefraction ofthelengthof thepathstraveledby thetwo messagesthatwasshared.Note that themetric is zerowhenthepaths

12

0

10

20

30

40

50

60

70

80

0 5 10 15 20 25 30 35 40 45

Distance between source and destination

Dis

tan

ce t

rave

led

by

Pas

try

mes

sag

e Mean = 2.2

Figure 14: Distancetraversedversusdistancebetweensourceanddestination,Mercatortopology.

converge in thedestination.Figures15, 16 and17 showthe averageof the convergencemetricsversusthe dis-tancebetweenthetwo sourcenodes.As expected,whenthedistancebetweenthesourcenodesis small,thepathsare likely to converge quickly. This result is importantfor applicationsthatperformcaching,or rely onefficientmulticasttrees[11, 12].

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 400 800 1200 1600 2000 2400 2800 3200

Distance between two source nodes

Co

nve

rgen

ce m

etri

c

Figure15: Convergencemetric versusthe distancebe-tweenthesourcenodes,spheretopology.

6.5 Overheadof nodejoin protocol

Next, wemeasuretheoverheadincurredby thenodejoinprotocolto maintaintheproximity invariant in the rout-ing tables. We quantify this overheadin termsof thenumberof probes, whereeachprobecorrespondsto thecommunicationrequiredto measurethedistance,accord-ing to theproximitymetric,amongtwo nodes.Of course,in our simulatednetwork, a probesimply involveslook-ing upthecorrespondingdistanceaccordingto thetopol-ogy model. However, in a real network, probingwouldlikely requireat leasttwo messageexchanges.Thenum-ber of probesis thereforea meaningfulmeasureof theoverheadrequiredto maintaintheproximity invariant.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000 1200

Distance between two source nodes

Con

verg

ence

met

ric

Figure16: Convergencemetricversusdistancebetweenthesourcenodes,GATechtopology.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 5 10 15 20 25 30 35 40

Distance between two source nodes

Co

nve

rgen

ce M

etri

c

Figure17: Convergencemetricversusdistancebetweenthesourcenodes,Mercatortopology.

Theaveragenumberof probesperformedby a newlyjoining nodewas29,with a minimumof 23 anda maxi-mumof 34. Theseresultswerevirtually independentoftheoverlay size,which we variedfrom 1,000to 60,000nodes. In eachcase,the probesperformedby the lasttennodesthat joined thePastrynetwork wererecorded,which are the nodeslikely to perform the most probesgiven the sizeof the network at that stage. The corre-spondingaveragenumberof probesperformedby otherPastrynodesduring the join wasabout70, with a mini-mumof 2 andamaximumof 200.

It is assumedherethatonceanodehasprobedanothernode,it storesthe resultanddoesnot probeagain. Thenumberof nodescontactedduring the joining of a newnodeis jyst����g�q2�<��� �M�C� � � , whereN is the numberofPastrynodes.This follows from theexpectednumberofnodesin the routing table, and the sizeof the leaf set.Although every nodethat appearsin the joining node’srouting table receives information aboutall the entriesin thesamerow of the joining node’s routing table,it isvery likely that the receiving nodealreadyknows manyof thesenodes,andthustheir distance.As a result, the

13

numberof probesperformedpernodeis low (onaveragelessthan2). This meansthat the total numberof nodesprobedis low, andtheprobingis distributedover a largenumberof nodes.Theresultswerevirtually identicalfortheGATechandtheMercatortopologies.

6.6 Nodefailur e

In the next experiment,we evaluatethe nodefailure re-coveryprotocol(Section4.1)andtheroutingtablemain-tenance(Section4.1).Recallthatleafsetrepairis instan-taneous,failed routing table entriesare repairedlazilyuponnext use,anda periodicroutingtablemaintenancetaskrunsperiodically(every20mins)to exchangeinfor-mationwith randomlyselectedpeers.

In theexperiment,a50,000nodePastryoverlayis cre-atedbasedon the GATechtopology, and200,000mes-sagesfrom randomsourceswith randomkeysarerouted.Then,20,000randomlyselectednodesaremadeto failsimultaneously, simulatingconditionsthatmightoccurintheeventof anetwork partition.Priorto thenext periodicroutingtablemaintenance,a new setof 200,000randommessagearerouted.After anotherperiodicroutingtablemaintenance,anotherset of 200,000randommessagesarerouted.

Figure18 shows boththenumberof hopsandthedis-tanceratio at variousstagesin this experiment. Shownaretheaveragenumberof routing hopsandtheaveragedistanceratio,for 200,000messageseachbeforethefail-ure, after the failure, after the first andafter the secondround of routing table maintenance.The “no failure”result is includedfor comparisonand correspondsto a30,000nodePastryoverlay with no failures. Moreover,to isolatetheeffectsof theroutingtablemaintenance,wegive resultswith and without the routing table mainte-nanceenabled.

During the first 200,000messagetransmissionsafterthemassivenodefailure,theaveragenumberof hopsandaveragedistanceratio increaseonly mildly (from 3.54to4.17and1.6to 1.86,respectively). Thisdemonstratestherobustnessof Pastryin thefaceof massive nodefailures.After eachround,theresultsimproveandapproachthosebeforethefailureaftertwo rounds.

With theroutingtablemaintenancedisabled,boththenumberof hopsandthedistanceratio do not recover asquickly. Considerthat the routing table repair mecha-nism is lazy and only repairsentriesthat are actuallyused.Moreover, arepairgenerallyinvolvesanextrarout-

3.6

4.2

3.83.6

3.5

1.6

1.91.7 1.6 1.6

3.6

4.24.1 4.1

3.5

1.6

1.9 1.8 1.81.6

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Beforefailure

Afterfailure

After 1round

After 2rounds

Nofailure

Beforefailure

Afterfailure

After 1round

After 2rounds

Nofailure

50000 30000 50000 30000

Number of Hops Distance Ratio

Routing Table Maintenance Enabled

Routing Table Maintenance Disabled

Figure18: Routinghopsanddistanceratio for a 50,000nodePastryoverlay when20,000nodessimultaneouslyfail, GATechtopology.

inghop,becausethemessageis routedtoanodethatdoesnot sharea longerprefix with thekey. Eachconsecutiveburstof 200,000messagesis likely to encounterdifferentrouting tableentriesthat have not yet beenfixed (about95,000entrieswererepairedduringeachbursts).Thepe-riodic routing tablemaintenance,on the otherhand,re-placesfailedentriesthathavenotyetbeenusedaspartofits routine.It is intuitive to seewhy thedistanceratio re-covers more slowly without routing table maintenance.The replacemententry provided by the repair mecha-nisms is generallyrelatively close,but not necessarilyamongthe closest. The periodic routing table mainte-nanceperformsprobingandis likely to replacesuchanentrywith abetterone.Finally, wepointout thatroutingtablemaintanancealsotakescareof changingdistancesamongnodesover time.

Wealsomeasuredthecostof theperiodicroutingtablemaintenance,in termsof network probes,to determinethe distanceof nodes. On average,lessthan 20 nodesarebeingprobedeachtimeanodeperformsroutingtablemaintenance,with a maximumof 82 probes.Sincetheroutingtablemaintenanceis performedevery20minutesand the probesare likely to target different nodes,thisoverheadis not significant.

6.7 Load balance

Next, weconsiderhow maintainingtheproximity invari-antin theroutingtablesaffectsloadbalancein thePastryroutingfabric. In thesimplePastryalgorithmwithout thelocality heuristics,or in protocolslike Chordthat don’tconsidernetwork proximity, the “indegree” of a node,i.e., thenumberof routingtableentriesreferringto aanygivennode,shouldbebalancedacrossall nodes.This is

14

a desirableproperty, asit tendsto balancemessagefor-wardingloadacrossall participatingnodesin theoverlay.

Whenrouting tablesentriesare initialized to refer tothe nearestnodewith the appropriateprefix, this prop-ertymaybecompromised,becausethedistribution of in-degreesis now influencedby thestructureof theunderly-ing physicalnetwork topology. Thus,thereis aninherenttradeoff betweenproximity neighborselectionandloadbalancein the routing fabric. The purposeof the nextexperimentis to quantify thedegreeof imbalancein in-degreesof nodes,causedby theproximity invariant.

Figure19 shows the cumulative distribution of inde-greesfor a 60,000node Pastry overlay, basedon theGATech topology. As expected,the resultsshow thatthe distribution of indegreesis not perfectly balanced.The resultsalsoshow that the imbalanceis mostsignif-icantat the top levels of the routing table(not shown inthegraph),andthat thedistribution hasa thin tail. Thissuggeststhat it is appropriateto dealwith thesepoten-tial hotspotsreactively ratherthanproactively. If oneofthenodeswith ahigh indegreebecomesahotspot,whichwill dependon the workload, it cansendbackoff mes-sages. The nodesthat receive sucha backoff messagefind analternative nodefor thesameslot usingthesametechniqueasif thenodewasfaulty. Sincethemostsig-nificant imbalanceoccursat thetop levelsof theroutingtable,changingrouting tableentriesto point to analter-native nodewill not increasethe distanceratio signifi-cantly. Therearemany alternative nodesthatcanfill outtheseslotsandthedistancetraversedin thefirst hopsac-countsfor asmallfractionof thetotaldistancetraversed.Weconcludethatimbalancein theroutingfabricasa re-sult of the proximity invariant doesnot appearto be asignificantproblem.

0

10000

20000

30000

40000

50000

60000

1 10 100 1000 10000

In-degree

Cu

mal

ativ

e n

um

ber

of

no

des

1023

Figure19: Indegreedistribution of 60,000Pastrynodes,GATechtopology.

Exact Average Average Numberclosest Distance RT0 Distance Probes

Sphere 95.3% 11.0 37.1 157GATech 83.7% 82.1 34.1 258Mercator 32.1% 2.6 6.0 296

Table1: Resultsfor theclosestnodediscoveryalgorithm.

6.8 Discovering a nearby seednode

Next, we evaluatethediscovery algorithmusedto find anearbynode,presentedin Section4.2. In eachof 1,000trials, we chosea pair of nodesrandomly amongthe60,000Pastrynodes.Onenodein thepair is consideredthe joining nodethat wishesto locatea nearbyPastrynode,theotheris treatedastheseedPastrynodeknownto the joining node. Using this seednode,thenodedis-covery algorithmwasusedto discover a nodenearthejoining node,accordingto theproximity metric. Table1shows theresultsfor the threedifferenttopologies.Thefirst column shows the numberof times the algorithmproducedtheclosestexisting node. Thesecondcolumnshows the averagedistance,accordingto the proximitymetric, of the nodeproducedby the algorithm, in thecaseswherethe nearestnodewasnot found. For com-parison,thethird columnshows theaveragedistancebe-tweena nodeandits row zeroroutingtableentries.Thefourth column shows the numberof probesperformedpertrial.

In the spheretopology, over 95% of the found nodesaretheclosest.Whentheclosestis not found, theaver-agedistanceto the found nodeis significantlylessthanthe averagedistanceto the entriesin the first level ofthe routing table. More interestingly, this is also truefor the Mercator topology, even thoughthe numberoftimestheclosestnodewasfound is low with this topol-ogy. TheGATechresultis interesting,in thatthefractionof caseswherethe nearestnodewasfound is very high(almost84%), but the averagedistanceof the producesnodein the caseswheretheclosestnodewasnot foundis high. Thereasonis thatthehighly regularstructureofthis topologycausesthealgorithmto sometimesgetintoa “local minimum”, by gettingtrappedin a nearbynet-work. Overall,thealgorithmfor locatinganearbynodeiseffective. Resultsshow thatthealgorithmsallows newlyjoining nodesto efficiently discover anearbynodein theexistingPastryoverlay.

15

6.9 Testbedmeasurements

Finally, we performedpreliminary measurementsin asmall testbedof about20 sitesthroughoutthe US andEurope.Themeasuredresultswereasexpected,but thetestbedis toosmallto obtaininterestinganrepresentativeresults. We expectthat a currentinitiative by a numberof organizationsto puttogetheralargerwide-areatestbedwill allow us to includesuchresultsin thefinal versionof this paper.

7 Conclusion

This paperpresentsa study of topology-aware routingin structuredp2p overlay protocols. We compareap-proachesto topology-aware routing and identify prox-imity neighbor selectionas the most promising tech-nique. We presentimproved protocols for proximitybasedneighborselectionin Pastry, which significantlyreducetheoverheadof topology-awareoverlayconstruc-tion andmaintenance.Analysisandsimulationsconfirmthat proximity neighbor selectionyields good perfor-manceatvery low overhead.Weconcludethattopology-awarerouting canbe accomplishedeffectively andwithlow overheadin aself-organizing,structuredpeer-to-peeroverlaynetwork.

References

[1] The Gnutella protocol specification, 2000.http://dss.clip2.com/GnutellaProtocol04.pdf.

[2] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong.Freenet:A distributed anonymousinformation storageand retrieval system. In Workshopon Design Issuesin AnonymityandUnobservability, pages311–320,July2000.ICSI, Berkeley, CA, USA.

[3] T. H. Cormen,C. E. Leiserson,andR. L. Rivest. Intro-ductionto Algorithms. TheMIT Press,Cambridge,MA,1990.

[4] F. Dabek,M. F. Kaashoek,D. Karger, R. Morris, andI. Stoica. Wide-areacooperative storagewith cfs. In18thACM Symposiumon OperatingSystemsPrinciples,Oct.2001.

[5] R. GovindanandH. Tangmunarunkit.Heuristicsfor in-ternetmapdiscovery. In Proc. 19th IEEE INFOCOM,pages1371–1380,Tel Aviv, Israel,March2000.IEEE.

[6] P. Maymounkov andD. Mazieres. Kademlia: A peer-to-peerinformationsystembasedon the xor metric. In

Proceedingsof the1st InternationalWorkshopon Peer-to-PeerSystems(IPTPS’02), Boston,MA, Mar. 2002.

[7] S. Ratnasamy, P. Francis, M. Handley, R. Karp, andS. Shenker. A ScalableContent-AddressableNetwork.In Proc.of ACM SIGCOMM, Aug. 2001.

[8] S. Ratnasamy, M. Handley, R. Karp, and S. Shenker.Topologically-awareoverlayconstructionandserver se-lection. In Proc. 21stIEEE INFOCOM, New York, NY,June2002.

[9] S. Ratnasamy, S. Shenker, andI. Stoica. Routingalgo-rithmsfor dhts:Someopenquestions.In Proceedingsofthe1st InternationalWorkshopon Peer-to-PeerSystems(IPTPS’02), Boston,MA, Mar. 2002.

[10] A. Rowstron and P. Druschel. Pastry: Scalable,dis-tributedobjectlocationandroutingfor large-scalepeer-to-peersystems. In International Conferenceon Dis-tributedSystemsPlatforms(Middleware), Nov. 2001.

[11] A. RowstronandP. Druschel.Storagemanagementandcachingin PAST, a large-scale,persistentpeer-to-peerstorageutility. In 18th ACM Symposiumon OperatingSystemsPrinciples, Oct.2001.

[12] A. Rowstron,A.-M. Kermarrec,M. Castro,andP. Dr-uschel. Scribe: The designof a large-scaleevent noti-fication infrastructure.In Third InternationalWorkshopon NetworkedGroupCommunications, Nov. 2001.

[13] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek,andH. Balakrishnan.Chord:A scalablepeer-to-peerlookupservicefor internetapplications. In Proceedingsof theACM SIGCOMM’01 Conference, SanDiego,California,August2001.

[14] H. Tangmunarunkit, R. Govindan, D. Estrin, andS. Shenker. The impact of routing policy on internetpaths. In Proc. 20th IEEE INFOCOM, Alaska, USA,Apr. 2001.

[15] E. Zegura, K. Calvert, and S. Bhattacharjee. How tomodelaninternetwork. In INFOCOM96, 1996.

[16] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph.Tapestry:An infrastructurefor fault-resilientwide-arealocationandrouting. TechnicalReportUCB//CSD-01-1141,U. C. Berkeley, April 2001.

Appendix A: Proofsof analytical results

We give proofsfor theLemmasandTheoremsstatedinSection5. Their numberingsheremaybedifferentfrombefore.

16

A.1 Routeprobability matrix

Lemma1 statesthe probability of taking path ��� at aspecifichopduringtheroutingof a randommessage.

Lemma 1 Assumepath ��� has beentaken during thefirst � hopsin routingmessage with key � , i.e. themes-sageis at anode� thatsharesthefirst � digitswith � . Ifthereexist �(� nodeIdssmallerthan � thatshare thesame� digitswith � , and �U� nodeIdslarger than � that sharethesame� digits with � , thentheprobability that node� will forward the message using ��� (i.e., � is within� ’s leaf set)is� o���� ��� j �(�@¡@� � ¡C�}¡!��q¢i £ �¤2¥ z ¤2¦ if h = 0

min§ ¤ ¥%¨ �/© �!ª z min§ ¤ ¦�¨ �/© �!ª¤ ¥ z ¤¦ if h « 0

Proof: Assumethe numericallyclosestnodeIdalwayssharessomeprefix with the messagekey. When themessagekey is within �yrts nodesfrom the boundaryof the subdomainsharing the sameprefix of � dig-its, the numberof nodesin the subdomainwhoseleaf-setscover � dropsto min j �(�y¡!��rts�q � min j �(�-¡!��rts�q , andthus the probability that ��� will be taken next ismin § ¤¥ z �¬© �!ª z min § ¤2¦ z �/© �!ª¤ ¥ z ¤ ¦ . In the very first hop, the pre-fix is of zerolength,thusthereis nobundaryeffect.

Since the numerically closestnodeId to message�maynot shareany leadingdigits with thekey, theaboveprobabilityfailsto accountfor oneadditionalroutinghopin suchcases.Sincethiscaseis rarein practice,it hasvir-tually noeffect on theabove probability.

Lemma 2 Assumebranch � � hasbeentakenduringthefirst � hopsin routinga randommessage � , i.e. themes-sage � is at a node� which sharesthefirst � digitswith� . Let ­ bethetotal numberof randomuniformlydis-tributednodeIdsthat share thefirst � digits with � . Theprobabilitiesin takingdifferentbranchesat the jy� � g�q thhopis

®°¯²±´³2µ�³2¶¸·}¹»º¼¼¼¼½�¾a¿ÁÀ ¯%±�³µ�³2¶O³2Ã�Ä}·

¾a¿ÁÀ ¯%±�³µ�³2¶O³2ÃÆÅÄ ·

¾a¿ÁÀ ¯%±�³µ�³2¶O³2Ã�Ç�·

¾a¿ÁÀ ¯%±�³µ�³2¶O³2ÃÆÅÇ ·

¾a¿ÁÀ ¯%±�³µ�³2¶O³2Ã�È�·

ÉCÊÊÊÊË ¹ÍÌ4ÎÏÑÐÒÓ4ÔÑÕ»ÖÒ×�Ø ÔÑÕ Â ¯¬ÙÁÚ�Û2¶O³ ÜÝ�Þ ·EßÖ Ï × ØÒ× ÔÑÕ Â ¯¬Ù+Û2¶áàâÙ Ú ³ ãÝ Þ à Ü ·Wß ¾+¿UÀ

¾aäÂMå ¯¬Ù�³<Ù Ú ³2¶æàOÙ Ú àçÙ�³2±�³µ%·

Proof: Thereare s � subdomainsof nodeIdsthatsharethefirst � digits as � but differ in the jy� � g�q th digit. With

equalprobabilities, � and � can fall into any of thesest� subdomains.Within eachsubdomain,the numberofnodes��è followsthebinomialdistribution, i.e. with prob-ability �tj �(èÑé!­p¡ `� � q , �(è nodescanendup in eachsubdo-main.Dependingonwhichsubdomain� falls into, therecanbebetweenê andupto s � �Íg subdomainsto theleftof � ’ssubdomain.If thereare ë subdomainsto theleft of� ’ssubdomain,thenumberof nodeIds� in thosesubdo-mainsfollows binomialdistribution �tj �Ñé � �ì��è#¡ �� �Mí ` q .Each iteration in the innermost summationcorre-spondsto a particulardistribution of �(� ,� v , and �(� , thenumberof nodeIdsto the left of, within the sameas,to the right of � ’s subdomain. In the formula above,thesevaluesare ��¡@�(è , and ­î�ì�(è��p� . Thevectorfunc-tion � o���� ��� �{��j ����¡@� v ¡@�U�t¡C�}¡!��q takes sucha distribution,andassumesequalprobability that � canbe any of the��� � � v � �U� nodeIdsand � canbeanywherein thenamespacespannedby thesenodeIds,andcalculatestheprob-abilitiesthatnode � will forwardmessage� using ��� ,�°ï� , � � , �°ï� , or ��ð , respectively.

Function� ot��� �E� ����j �(�@¡@� v ¡@� � ¡C��¡!�yq is calculatedasfol-lows. If � v i ê , � ’s subdomainis empty, the nextrouting hop takes either ��� or � ð . The probabilityof ��� is � o���� ��� j �(�@¡@�U�t¡C�}¡!��q , and that of ��� is gñ�� ot��� �E� j ���y¡@�U�t¡C�}¡!��q . Sincewe assumeuniform distribu-tion of j ��� � � � q in thesubdomainof thenamespace,theprobabilityof � ï� , i.e. � is numericallyclosestto � , isg�r#j ��� � �(�Uq . If � v «�ê , � ’s subdomainis not empty, thenext routinghoptakeseither ��� or ��� , andtheprobabil-ity of �òï� is g�r#j �(� � � v � �U��q . Sincethereare � v nodeIdsthat sharethe first jy� � g�q digits with � , therecanbej � v � g�q intervals in � ’s subdomainthat � can fall inwith equalprobability. For eachinterval, probabilitiesof��� and ��� arecaculatedbefore. Furthermore,if ��� isnot taken and ��� is taken, thereis a certainprobabilitythat � is amongthe � v nodesthatalreadysharesthenextdigit with � , in whichcasethenext hopis skipped.Thisprobabilitycontributesto �òï� . Theprobabilityequalsthenumberof nodeIdsamongthe � v nodesthatarenotwithin��rts from � , over the j ��� � � v � �U�(q possiblecandidatesof � .

Lemma 3 Let � be the total numberof randomuni-formly distributed nodeIds. Row 0 of the route prob-ability matrix, i.e. the probabilities in taking differ-ent branchesat the first hop in routing a randommes-sage from a randomstarting nodeId, is óôj<ê#¡!�2¡ � q�iõ j<ê#¡!�2¡ � q .

17

To calculatetheprobabilitiesat subsequenthops,notethatvalue �(ö givesthenumberof nodeIdsthat sharethefirst digit with message� , and the probabilitiesat thesecondhop is conditionedon the �(ö value. Oneway ofcalculatingthemis thusto repeattheabovecaseanalysisrecursively using��ö atthesecondhop, ��` atthethird hop,etc. Thefollowing theoremgivestheprobabilitiesat thejy� � g�q th hop.

Theorem 1 Let N be the total numberof randomuni-formly distributed nodeIds. Row � of the route prob-ability matrix, i.e. the probabilities in taking differentbranchesat the jy� � g�q th hopin routinga randommes-sage � startingfroma randomnodeId,is

÷R¯%±�³µ@³øù·}¹ º¼¼¼¼½Í¾a¿UÀ ¯%±�³µ@³øç³2Ã�Ä}·

¾a¿UÀ ¯%±�³µ@³øç³2ÃÆÅÄ ·

¾a¿UÀ ¯%±�³µ@³øç³2Ã Ç ·

¾a¿UÀ ¯%±�³µ@³øç³2ÃÆÅÇ ·

¾a¿UÀ ¯%±�³µ@³øç³2Ã È ·

É ÊÊÊÊË ¹ Ý ÞnúÒ×<û ÔÑÕ Â ¯¬Ù Õ Û2øç³ ÜÝ Þ ·EßÝ Þ ×yûÒ×�ü ÔÑÕ Â ¯¬Ù Ð Û<Ù Õ ³ýÜÝ�Þ ·Wß�þ/þ¬þ�ß Ý Þ ×yØCÿ��Ò×�Ø!ÿ ü ÔÑÕ Â ¯/Ù Ú ÏÑÐ Û²Ù Ú ÏcÌ ³ýÜÝ�Þ ·Wß{® ¯%±�³µ�³<Ù Ú Ï´Ð ·Theorem 2 Let theexpectednumberof additionalhopsafter first time taking � ð at the � th hop be denotedas����� jy�}¡!�2¡ � ¡!� ð q . Theexpectednumberof routinghopsin routinga message with randomkey � startingfromanoderandomlychosenfromthe � nodesisÐ@Ì �� Þ Ï´ÐÒÚ ÔÑÕ ¾a¿ÁÀ

 ¯%±�³µ�³2øç³Ã Ä ·�à¾a¿UÀ

 ¯%±�³µ@³øç³2à ÅÄ ·�¾+¿UÀ

 ¯%±�³µ@³øç³2Ã Ç ·�à¾a¿UÀ

 ¯%±�³µ@³øç³2à ÅÇ ·�¾+¿UÀ

 ¯²±´³2µ�³2øç³Ã È ·�� ���� ¯²±´³2µ�³øO³Ã È ·Wß¾a¿ÁÀ

 ¯%±�³µ�³2øç³Ã È ·Proof: Thesum� o����tjy�}¡!�2¡ � ¡!� � q � � o����tjy�}¡!�2¡ � ¡!� � q �� o����tjy�}¡!�2¡ � ¡!� ð q is theprobabilitythattheroutingtakesthe jy� � g�q th hop, and out of � o����tjy�}¡!�2¡ � ¡!��� q , withprobability � o����tjy�}¡!�2¡ � ¡!� ï� q , the routing skips futurehopsby one,andout of � ot���tjy�}¡!�2¡ � ¡!��� q , with proba-bility � o����-jy��¡!�2¡ � ¡!� ï� q , the routingskipsthe ��� hopatthe jy� � g�q th hop. If the intermediatenodeafter taking� hopssharesadditionaldigits otherthanthe jy� � g�q thdigit, the additionalskippedhopswill be accountedforby � ot���tjy� � gt¡!�2¡ � ¡!�òï� q , � ot���tjy� � sc¡!�2¡ � ¡!�òï� q , etc.

Lemma 4 Assumethat branch � � was taken duringthe first � hops in routing a message with key � ,branch � ð is taken in the jy� � g�q th hop, and thereare u nodeIdssharing the first jy� � g�q digits as mes-sage � . The probability that the routing finishesin

the next hop is � o���� � ��jyuE¡!��q»i � �/© �¤�� ` �tj �ÑéCuE¡ `� � q ����¤�� �/© � z ` �tj �´éCuE¡ `� � q�� �¬© �¤ .

Proof: Let thenodereachedafter taking branch � ð forthefirst timebenode� . Recall � is selectedfrom �����thatis numericallyclosestto � whosenodeIdalsosharesthefirst � digits with � . Let "! bethesetof nodeIdsin� ’s subdomain,i.e. sharingthefirst jy� � g�q digits with� . Theroutingfinishesin onestepafter � if

# "!%$ ��rts . Theprobabilityis � �¬© �¤�� ` �-j �´éCuE¡ `� � q . Or,

# "! « ��rts , andthe � ð hopreachedoneof theright-most �yrts nodesin "! . Theprobabilityof "!n« ��rtsis �&�¤�� �/© � z ` �tj �ÑéCuE¡ `� � q . The probability of later is�/© �'�( , becauseunderrandomuniformdistribution, theprobabilityof any of the rightmost �yrts nodesin *)shows up asany node’s routingtableentryis

�/© �',+ .

The distribution of � o����-jy��¡!�2¡ � ¡!� ð q shows that itsvalueonly becomesnot insignificantwhen � i �<��� � �!� ,when the value u above follows binomial distribution�tjyuEé � ¡(g�rtst� è q . In suchcases,� o���� � ��jyuE¡!��qÍ« ê#bed�d�- .Thus, for non-insignificantvaluesof � o����tjy�}¡!�2¡ � ¡!��ð¢q ,with very high probability, the routing takes one extrahopaftertaking � ð , i.e.

����� jy�}¡!�2¡ � ¡!� ð q/. g .A.2 Local route convergence

Theorem 3 Let� g and

� s be the two starting nodeson a sphere of radius � from which messages with anidentical,randomkey are beingrouted.Let thedistancebetween

� g and� s be ë+ê . Thenthe expecteddistance

thatthetwomessageswill travelbefore their pathsmergeisã,021�3 ¯ ã�4 ³65Æ·}¹�7982: � Î úÒ× ÔÑÕ<;>= ×?

;ÔÑÕ ¯ Ü à

¾+¿UÀ ±À4¾¯ 0 ³ ã,4 ³65Æ·@·±

À4¾ã,021�3 ¯/Ù�³65Æ·

where � o���� ��� � j ��¡!ë+ê#¡@� q i ' § èBADC �FE |HG § ¤ ¨ I ª ¨ � ¤ ¨ I ª',JDK ¦HLM �DN § èBADC �FE |HG § ¤ ¨ I ª ¨ I ª ,ë2� iîë+ê � sO�P� �,Q ¤� � ö �´� � ë�R2�BSUj ��¡@� q in the worst case,or ë2��i ë+ê in the average case, or ë�ÍiUT �WV j<ê#¡!ë+ê¸�sX� � �,Q ¤� � ö ��� � ëYR��SÁj ��¡@� qMq in the bestcase, respectively, �j²o�¡!ë�¡@� q denotestheintersectingareaof two circlesofradius o centered at two pointson a sphere of radius �that are a distanceof ë[Zws-o apart, and |H\ �@]_^ v2` j²o-¡@� qdenotesthe surfacearea of a circle of radius o on asphere of radius � .

18

Proof: Without loss of generality, we assumethe twostarting nodesare on the equatorof the sphere. Theexpected distance of the first hop traveled in Pas-try routing is ��� � ëYR�BSUj<ê#¡@� q . If we draw two cir-cles around the two starting nodes with radius of�´� � ë�R2�BSUj<ê#¡@� q , the intersectingarea of the two cir-cles will be Æjy�´� � ë�R2�BSUj<ê#¡@� q{¡!ë´¡@� q . Thus the prob-ability that, after the first hop, the two paths con-verge into the same node is � o��t� �´� � j<ê#¡!ë+ê#¡@� q i Æjy��� � ëYR�BSUj<ê#¡@� q{¡!ë+ê#¡@� qMra |2\ �@]_^ v2` jy�´� � ë�R2�BSUj<ê#¡@� q{¡@� q .

If thetwo pathsdid not convergeafterthefirst hop,inthe worst case,the two messagesmove in oppositedi-rections,andthedistancebetweenthetwo nodesreachedby the two messagesis ë�g i ë+ê � s���� � ëYR�BSUj<ê#¡@� q ; inthebestcase,ë�g i<T �WV j<ê#¡!ëaêâ��s��´� � ëYR�BSUj<ê#¡@� qMq ; andsincewith equalprobability thehopmaymove in all di-rections,ë�g iRëaê in theaveragecase.

Sincethe expecteddistanceof the secondhop trav-eled in Pastry routing is ��� � ëYR��SÁj4gt¡@� q , the prob-ability that after second hop, the two paths con-verge into the same node is � o��t� �´� � j4gt¡!ë+ê#¡@� q i Æjy��� � ëYR�BSUj4gt¡@� q{¡!ë�gt¡@� qMra |2\ �@]_^ v2` jy�´� � ë�R2�BSUj4gt¡@� q{¡@� q .

Theanalysisfor subsequenthopsis analogous.

19