pastry scalable, decentralized object location and routing for large-scale peer-to-peer systems
DESCRIPTION
Pastry Scalable, decentralized object location and routing for large-scale peer-to-peer systems. Peter Druschel, Rice University Antony Rowstron, Microsoft Research Cambridge, UK Modified and Presented by: JinYoung You. Outline. Background Pastry Pastry proximity routing Related Work - PowerPoint PPT PresentationTRANSCRIPT
PastryScalable, decentralized object location and routing for large-scale peer-to-peer systems
Peter Druschel, Rice University
Antony Rowstron, Microsoft Research Cambridge, UK
Modified and Presented by:JinYoung You
1
Outline
• Background• Pastry• Pastry proximity routing• Related Work• Conclusions
2
Background
Peer-to-peer systems
• distribution• decentralized control• self-organization• symmetry (communication, node
roles)
3
Common issues
• Organize, maintain overlay network– node arrivals– node failures
• Resource allocation/load balancing• Resource location• Network proximity routing
IDea: provide a generic p2p substrate
4
Architecture
TCP/IP
Pastry
Network storage
Event notification
Internet
p2p substrate (self-organizingoverlay network)
p2p application layer?
5
Structured p2p overlays
One primitive:route(M, X): route message M to the live node with nodeID closest to key X
• nodeIDs and keys are from a large, sparse id space
6
Distributed Hash Tables (DHT)
k6,v6
k1,v1
k5,v5
k2,v2
k4,v4
k3,v3
nodes
Operations:insert(k,v)lookup(k)
P2P overlay networ
k
P2P overlay networ
k
• p2p overlay maps keys to nodes• completely decentralized and self-organizing• robust, scalable
7
Why structured p2p overlays?
• Leverage pooled resources (storage, bandwidth, CPU)
• Leverage resource diversity (geographic, ownership)
• Leverage existing shared infrastructure
• Scalability• Robustness• Self-organization
8
Outline
• Background• Pastry• Pastry proximity routing• Related Work• Conclusions
9
Pastry: Object distribution
objID
Consistent hashing [Karger et al. ‘97]
128 bit circular id space
nodeIDs (uniform random)
objIDs (uniform random)
Invariant: node with numerically closest nodeID maintains object
nodeIDs
O2128-1
11
Pastry: Object insertion/lookup
X
Route(X)
Msg with key X is routed to live node with nodeID closest to X
Problem: complete
routing table not feasible
O2128-1
12
Pastry: Routing
Tradeoff
• O(log N) routing table size• O(log N) message forwarding steps
13
Pastry: Routing table (# 65a1fcx)0x
1x
2x
3x
4x
5x
7x
8x
9x
ax
bx
cx
dx
ex
fx
60x
61x
62x
63x
64x
66x
67x
68x
69x
6ax
6bx
6cx
6dx
6ex
6fx
650x
651x
652x
653x
654x
655x
656x
657x
658x
659x
65bx
65cx
65dx
65ex
65fx
65a0x
65a2x
65a3x
65a4x
65a5x
65a6x
65a7x
65a8x
65a9x
65aax
65abx
65acx
65adx
65aex
65afx
log16 Nrows
Row 0
Row 1
Row 2
Row 3
14
Pastry: Routing
Properties• log16 N steps • O(log N) state
d46a1c
Route(d46a1c)
d462ba
d4213f
d13da3
65a1fc
d467c4d471f1
15
Pastry: Leaf sets
Each node maintains IP addresses of the nodes with the L/2 numerically closest larger and smaller nodeIDs, respectively. • routing efficiency/robustness • fault detection (keep-alive)• application-specific local coordination 16
Pastry: Routing procedureif (destination is within range of our leaf set)
forward to numerically closest memberelse
let l = length of shared prefix let d = value of l-th digit in D’s addressif (Rl
d exists)
forward to Rld
else forward to a known node that (a) shares at least as long a prefix(b) is numerically closer than this node
17
Pastry: Performance
Integrity of overlay/ message delivery:
• guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIDs
Number of routing hops:• No failures: < log16 N expected, 128/b
+ 1 max• During failure recovery:
– O(N) worst case, average case much better
18
Pastry: Self-organization
Initializing and maintaining routing tables and leaf sets
• Node addition• Node departure (failure)
19
Pastry: Node addition
d46a1c
Route(d46a1c)
d462ba
d4213f
d13da3
65a1fc
d467c4d471f1
New node: d46a1c
20
Node departure (failure)
Leaf set members exchange keep-alive messages
• Leaf set repair (eager): request set from farthest live node in set
• Routing table repair (lazy): get table from peers in the same row, then higher rows
21
Pastry: Experimental results
Prototype• implemented in Java• emulated network• deployed testbed (currently ~25
sites worldwide)
22
Pastry: Average # of hops
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1000 10000 100000
Ave
rag
e n
um
ber
of
ho
ps
Number of nodes
Pastry Log(N)
L=16, 100k random queries 23
Pastry: # of hops (100k nodes)
0.0000 0.0006 0.0156
0.1643
0.6449
0.1745
0.00000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 3 4 5 6
Number of hops
Pro
bab
ilit
y
L=16, 100k random queries 24
Pastry: # routing hops (failures)
L=16, 100k random queries, 5k nodes, 500 failures
2.73
2.96
2.74
2.6
2.65
2.7
2.75
2.8
2.85
2.9
2.95
3
No Failure Failure After routing table repair
Ave
rag
e h
op
s p
er lo
oku
p
25
Outline
• Background• Pastry• Pastry proximity routing• Related Work• Conclusions
26
Pastry: Proximity routing
Assumption: scalar proximity metric • e.g. ping delay, # IP hops• a node can probe distance to any
other node
Proximity invariant: Each routing table entry refers to a node close to the
local node (in the proximity space), among all nodes with the appropriate nodeID prefix.
27
Pastry: Routes in proximity space
d46a1c
Route(d46a1c)
d462ba
d4213f
d13da3
65a1fc
d467c4d471f1
NodeID space
d467c4
65a1fcd13da3
d4213f
d462ba
Proximity space
28
Pastry: Distance traveled
0.8
0.9
1
1.1
1.2
1.3
1.4
1000 10000 100000Number of nodes
Rel
ativ
e D
ista
nce
Pastry
Complete routing table
L=16, 100k random queries, Euclidean proximity space29
Pastry: Locality properties1) Expected distance traveled by a message
in the proximity space is within a small constant of the minimum
2) Routes of messages sent by nearby nodes with same keys converge at a node near the source nodes
3) Among k nodes with nodeIDs closest to the key, message likely to reach the node closest to the source node first
30
d467c4
65a1fcd13da3
d4213f
d462ba
Proximity space
Pastry: Node addition
New node: d46a1c
d46a1c
Route(d46a1c)
d462ba
d4213f
d13da3
65a1fc
d467c4d471f1
NodeID space31
Pastry delay
0
500
1000
1500
2000
2500
0 200 400 600 800 1000 1200 1400
Distance between source and destination
Dis
tan
ce t
rave
led
by
Pas
try
mes
sag
e
Mean = 1.59
GATech top., .5M hosts, 60K nodes, 20K random messages 32
Pastry: API
• route(M, X): route message M to node with nodeID numerically closest to X
• deliver(M): deliver message M to application
• forwarding(M, X): message M is being forwarded towards key X
• newLeaf(L): report change in leaf set L to application
33
Pastry: Security
• Secure nodeID assignment• Secure node join protocols• Randomized routing• Byzantine fault-tolerant leaf set
membership protocol
34
Outline
• Background• Pastry• Pastry proximity routing• Related Work• Conclusions
35
Pastry: Related work
• Chord [Sigcomm’01]• CAN [Sigcomm’01]• Tapestry [TR UCB/CSD-01-1141]
• PAST [SOSP’01]• SCRIBE [NGC’01]
36
Outline
• Background• Pastry• Pastry proximity routing• Related Work• Conclusions
37
Conclusions
• Generic p2p overlay network• Scalable, fault resilient, self-organizing,
secure• O(log N) routing steps (expected)• O(log N) routing table size• Network proximity routing
• For more informationhttp://www.cs.rice.edu/CS/Systems/Pastry
38
Thanks
• Any Questions?
39
40
41
Outline
• PAST• SCRIBE
42
PAST: Cooperative, archival file storage and distribution
• Layered on top of Pastry• Strong persistence• High availability• Scalability• Reduced cost (no backup)• Efficient use of pooled resources
43
PAST API
• Insert - store replica of a file at k diverse storage nodes
• Lookup - retrieve file from a nearby live storage node that holds a copy
• Reclaim - free storage associated with a file
Files are immutable
44
PAST: File storage
fileID
Insert fileID
45
PAST: File storage
Storage Invariant: File “replicas” are stored on k nodes with nodeIDsclosest to fileID
(k is bounded by the leaf set size)
fileID
Insert fileID
k=4
46
PAST: File Retrieval
fileID file located in log16 N steps (expected)
usually locates replica nearest client C
Lookup
k replicasC
47
PAST: Exploiting Pastry
• Random, uniformly distributed nodeIDs – replicas stored on diverse nodes
• Uniformly distributed fileIDs – e.g. SHA-1(filename,public key, salt)– approximate load balance
• Pastry routes to closest live nodeID– availability, fault-tolerance
48
PAST: Storage management
• Maintain storage invariant• Balance free space when global
utilization is high– statistical variation in assignment of
files to nodes (fileID/nodeID)– file size variations– node storage capacity variations
• Local coordination only (leaf sets)
49
Experimental setup
• Web proxy traces from NLANR– 18.7 Gbytes, 10.5K mean, 1.4K median,
0 min, 138MB max• Filesystem
– 166.6 Gbytes. 88K mean, 4.5K median, 0 min, 2.7 GB max
• 2250 PAST nodes (k = 5)– truncated normal distributions of node
storage sizes, mean = 27/270 MB
50
Need for storage management
• No diversion (tpri = 1, tdiv = 0):
– max utilization 60.8% – 51.1% inserts failed
• Replica/file diversion (tpri = .1, tdiv = .05):– max utilization > 98%– < 1% inserts failed
51
PAST: File insertion failures
0
524288
1048576
1572864
2097152
0 20 40 60 80 100Global Utilization (%)
Fil
e s
ize
(B
yte
s)
0%
5%
10%
15%
20%
25%
30%
Fa
ilu
re r
ati
o
Failed insertion
Failure ratio
52
PAST: Caching
• Nodes cache files in the unused portion of their allocated disk space
• Files caches on nodes along the route of lookup and insert messages
Goals:• maximize query xput for popular
documents• balance query load• improve client latency
53
PAST: Caching
fileID
Lookup topicID54
PAST: Caching
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100
Utilization (%)
Glo
bal C
ache
Hit
Rat
e
0
0.5
1
1.5
2
2.5
Ave
rage
num
ber
of r
outin
g ho
ps
GD-S: Hit RateLRU : Hit RateGD-S: # HopsLRU: # HopsNone: # Hops
None: # Hops
LRU: Hit RateGD-S : Hit Rate
LRU: # Hops
GD-S: # Hops
55
PAST: Security
• No read access control; users may encrypt content for privacy
• File authenticity: file certificates• System integrity: nodeIDs, fileIDs
non-forgeable, sensitive messages signed
• Routing randomized
56
PAST: Storage quotas
Balance storage supply and demand• user holds smartcard issued by
brokers– hides user private key, usage quota– debits quota upon issuing file certificate
• storage nodes hold smartcards– advertise supply quota– storage nodes subject to random audits
within leaf sets
57
PAST: Related Work
• CFS [SOSP’01]• OceanStore [ASPLOS 2000]• FarSite [Sigmetrics 2000]
58
Outline
• PAST• SCRIBE
59
SCRIBE: Large-scale, decentralized multicast
• Infrastructure to support topic-based publish-subscribe applications
• Scalable: large numbers of topics, subscribers, wide range of subscribers/topic
• Efficient: low delay, low link stress, low node overhead
60
SCRIBE: Large scale multicast
topicID
Subscribe topicID
Publish topicID
61
Scribe: Results
• Simulation results• Comparison with IP multicast: delay,
node stress and link stress• Experimental setup
– Georgia Tech Transit-Stub model– 100,000 nodes randomly selected out of
.5M– Zipf-like subscription distribution, 1500
topics62
Scribe: Topic popularity
1
10
100
1000
10000
100000
0 150 300 450 600 750 900 1050 1200 1350 1500
Group Rank
Gro
up
Siz
e
gsize(r) = floor(Nr -1.25 + 0.5); N=100,000; 1500 topics63
Scribe: Delay penalty
0
300
600
900
1200
1500
0 1 2 3 4 5
Delay Penalty
Cu
mu
lati
ve G
rou
ps
RMD
RAD
Relative delay penalty, average and maximum 64
Scribe: Node stress
0
5000
10000
15000
20000
0 100 200 300 400 500 600 700 800 900 1000 1100
Total Number of Children Table Entries
Nu
mb
er
of
No
de
s
0
5
10
15
20
25
30
35
40
45
50
55
50 150 250 350 450 550 650 750 850 950 1050
Total Number of Children Table Entries
Nu
mb
er
of
No
de
s
65
Scribe: Link stress
0
5000
10000
15000
20000
25000
30000
1 10 100 1000 10000
Link Stress
Nu
mb
er
of
Lin
ks
Scribe
IP Multicast
Maximum stress
One message published in each of the 1,500 topics 66
Scribe: Summary
Self-configuring P2P framework for topic-based publish-subscribe
• Scribe achieves reasonable performance when compared to IP multicast– Scales to a large number of subscribers– Scales to a large number of topics– Good distribution of load
• For more informationhttp://www.cs.rice.edu/CS/Systems/Pastry
67
Status
Functional prototypes• Pastry [Middleware 2001]• PAST [HotOS-VIII, SOSP’01]• SCRIBE [NGC 2001, IEEE JSAC]• SplitStream [submitted]• Squirrel [PODC’02]
http://www.cs.rice.edu/CS/Systems/Pastry
68
Current Work
• Security– secure routing/overlay maintenance/nodeID
assignment– quota system
• Keyword search capabilities• Support for mutable files in PAST• Anonymity/Anti-censorship• New applications• Free software releases
69