distributed shared memory, related issues, and new challenges in large-scale dynamic systems vincent...
TRANSCRIPT
Distributed Shared Memory, Related Issues, and New Challenges in Large-Scale Dynamic Systems
Vincent Gramoli
1
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Roadmap
Large-Scale Dynamic Systems Context and Motivations
Distributed Shared Memory Facing Dynamism
Facing Scalability
A Probabilistic Solution Facing Dynamism and Scalability
A New Challenge Distributed Slicing
2
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
The Scale-Shift of Distributed Systems
Internet Network growth
IPv4 to IPv6
Internet
3
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
The Scale-Shift of Distributed Systems
Personal devices multiplyAll tend to be connected together
4
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
The Scale-Shift of Distributed Systems
Network devices
time
17 billions of network devices by 2012as predicted by IDC
5
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Drawback of this Scale-shift
Heterogeneity Each device acts independently
Each device has distinct lifetime
Out-of-control Global monitoring is impossible
The system is unpredictable
Dynamism At any time some participants may leave/fail
And some others may join
Unbounded number of leaves/failures/joins
6
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Problem: How to Communicate?
Shared-Memory Paradigm Simple programming style
Appealing design for algorithmsMEM
P1 P1
7
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Problem: How to Communicate?
Shared-Memory Paradigm Simple programming style
Appealing design for algorithms
Message-Passing Paradigm Better suited for delayed messages
Fault-tolerant system
MEM
P1 P1
P1 P1
LINK
8
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Problem: How to Communicate?
Shared-Memory Paradigm Simple programming style
Appealing design for algorithms
Message-Passing Paradigm Better-suited for delayed messages
Fault-tolerant system
Emulating Shared-Memory in Message-Passing Simplicity
Fault-tolerance
MEM
P1 P1
P1 P1
LINK
9
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Shared Memory: the emulation
Consistency Criterion, a set of rulesAtomic object:
• If an operation ends before another starts, then it can not be ordered after
• Write operations are totally ordered and read operations are ordered w.r.t. write operations
• A read returns the last value written (or the default one if none exist)
If objects are atomic, then the system looks like a shared-memory model!
10
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Quorum-based DSM [ABD]
Quorums Mutually intersecting sets of nodes
Q1Q2
Q3
Q1 ∩ Q2 ≠ ØQ1 ∩ Q3 ≠ ØQ2 ∩ Q3 ≠ Ø
Each node maintains: A local value v of the object A unique version number t of this value
11
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Quorum-based DSM [ABD]
Operations A node reads the object value by
• Asking the value, tag of all nodes of a quorum
• Choosing the value with the largest tag
• Replicating this value to all nodes of a quorum
A node writes a new object value by• Asking the tags of all nodes of a quorum
• Choosing a higher tag than any tag returned
• Replicating its value with the new tag to a quorum
Get <v,t>
Set <v,t>
Get <v,t>
Set <v’,t’>
t’ = t++
12
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Quorum-based DSM [ABD]
Writing a value v1
Q1Q2
Q3
Input: v1
13
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Quorum-based DSM [ABD]
Writing a value v1
Q1Q2
Q3
max tag?
t
14
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Quorum-based DSM [ABD]
Writing a value v1
Q1Q2
Q3v1,t1 (with t1 > t)
15
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Quorum-based DSM [ABD]
Reading a value
Q1Q2
Q3value? tag?
v1,t1
16
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Quorum-based DSM [ABD]
Reading a value
Q1Q2
Q3
v1,t1
17
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Quorum-based DSM [ABD]
Reading a value
Q1Q2
Q3
Output: v1
18
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
Dynamism => Unbounded number of failures
Solution: Reconfiguration Replacing quorums periodically with quorums
of active nodes.
Q1Q2
Q3
Problem: Q1 ∩ Q2 = Ø
and Q1 ∩ Q3 = Ø and Q2 ∩ Q3 = Ø
19
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
All must agree on the next set of quorums Quorum-based consensus algorithm: Paxos
Reconfiguration must not block operations Up-to-date information is passed from old
quorums to new quorums during reconfiguration Operations that discover a new quorum set
must restart using it.
20
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
Algorithm Reconfiguration is based on Paxos (3 phases leader-based
consensus alorithm) l is the leader c is the current configuration configs is the set of active configurations A ballot has a unique identifier b and a value v, which is a
configuration
Paxos phases: Prepare: l creates a new ballot and chooses/gets the value to
propose. Propose: l proposes <b,v> and gathers votes from a majority. Propagate: l propagates decision
21
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
l
Q1Q2
Recon(c,c’)
22
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
l
Q1Q2
Prepare phaseRecon(c,c’) •Creates a new larger ballot b
23
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
l
Q1Q2
<1a, b>
Prepare phaseRecon(c,c’)
24
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
l
Q1Q2
<1a, b>
<1b, b, configs, <b’’, c’’>>
•Updates its ballot’s value v with the one received •Updates its configs set
Prepare phaseRecon(c,c’)
25
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
l
Q1Q2
<1a, b>
<1b, b, configs, <b’’, c’’>>
<2a, b, c, v>
Propose phaseRecon(c,c’)
26
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
l
Q1Q2
<1a, b>
<1b, b, configs, <b’’, c’’>>
<2a, b, c, v>
<2b, b, c, v, tag, val>
Recon(c,c’)
<2b, b, c, v, tag, val>
Propose phase
•Updates their tag and val•Adds v to their configs set 27
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
l
Q1Q2
<1a, b>
<1b, b, configs, <b’’, c’’>>
<2a, b, c, v>
<2b, b, c, v, tag, val><3a, c, v, tag, val>
<3a, c, v, tag, val>
Recon(c,c’)
<2b, b, c, v, tag, val>
Propagation phase
•Update their tag and val•Remove configuration c from their configs set
<3a, c, v, tag, val>
28
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
Good News: The overhead latency to cope with dynamism is low
29
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Dynamic DSM [RDS]
Bad News: In both solutions, congestion may delay the latency
30
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
First Conclusion
Communication complexity must be reduced to face scalability!
31
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable DSM [SQUARE]
Object replicated on failure-prone nodes The replicas r1, …, rk share a 2-dim
coordinate spacer1 r2 r3 r4
r5 r6 r7 r8
…
… rk-1rk
32
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable DSM [SQUARE]
Communication through neighborhood Each replica ri can communicate only with its nearest
neighbors
ri
33
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable DSM [SQUARE]
Topology takeover mechanism [CAN] Upon node failure/departure the space sharing is
modified accordingly
If a node ri fails, a takeover node rj replaces it
rirj
34
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable DSM [SQUARE]
Dynamic Quorums Vertical Quorum: All replicas responsible of an
abscissa x Horizontal Quorum: All replicas responsible of
an ordinate yx
y
For any horizontal quorum H and any vertical quorum V:
H V ≠ Ø
35
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable DSM [SQUARE]
Read Operation:1) Get up-to-date values and tags of a horizontal
quorum,2) Replicate this value on a vertical quorum.
Write Operation:1) Get up-to-date value on a horizontal quorum,2) Replicate the value to write (and a higher tag) on
the same vertical quorum
Operation Execution
36
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable DSM [SQUARE]
Memory thwarts if the requested replica is overloaded:Other replicas on its diagonal are contacted in turn until a non-overloaded one is found
Memory expands if all contacted replicas are overloaded:A node outside the memory is added, and the object value is replicated at this node.
Memory shrinks if a replica gets underloaded:The replica simply leaves the memory after neighbors notification.
Self-Adjusting Memory
37
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable DSM [SQUARE]
Good News: The memory self-adapts well in face of dynamism
38
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable DSM [SQUARE]
Good News: The load is well-balanced over the replicas
39
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable DSM [SQUARE]
Bad News: The operation latency increases with the load (request rate)
40
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Second Conclusion
No way to avoid the tradeoff betweencommunication and time complexity!
41
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable and Dynamic DSM [TQS]
Motivations for Probabilistic Solutions Tradeoff between time and message
complexity prevents deterministic solutions
Allowing more Realistic Models• Any node can fail independently
• Even if it is unlikely that many nodes fail at the same time
Quality of Service (QoS) is often expressed in terms of percentage of success
42
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable and Dynamic DSM [TQS]
Dynamic System n interconnected nodes Nodes join/leave the system A joining node is new c is the churn:
• At each time unit, └cn
┘ nodes leave the network
• At each time unit, └cn
┘ nodes enter the network
43
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable and Dynamic DSM [TQS]
Probabilistic Atomicity• If an operation ends before another starts, then it is
ordered after with probability e-β2 (with β a constant). If this happen, the preceding operation is considered as unsuccessful.
• Write operations are totally ordered and read operations are ordered w.r.t. write operations
• A read returns the last successfully value written (or the default one if none exist) with probability 1- e-β2 (with β a constant). If this does not hold, then the read is unsuccessful.
44
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable and Dynamic DSM [TQS]
Gossip-based algorithm in parallel Shuffle set of neighbors using gossip-based algorithm
(e.g. Cyclon)
Quorum contact- Disseminate message with TTL l to k neighbors, such
that #contacted nodes = β n / (1-c) Δ/2
- Decrements TTL received if first time received.- Forward received messages to k neighbors if their
TTL is not null.
45
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable and Dynamic DSM [TQS]
Read Ask a quorum of nodes their values Replicate the most up-to-date value to a
quorum
Write Ask a quorum of nodes their tags Chooses a strictly higher tag Replicate the value to write with the new tag
46
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Scalable and Dynamic DSM [TQS]
Assumptions: success regularity: at least one operation succeeds
every Δ time units. The underlying gossip-based algorithm provides each
node with a neighbor chosen uniformly at random.
Results: Expected messages: O( nD) without shuffling, Expected latency: O(log nD), Where D = (1-c)-Δ is the dynamic parameter.
Given the application requirement in terms of QoS, the quorum size can be tuned.
47
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Third Conclusion
Trading deterministic for probabilistic guarantees seems to be the solution!
48
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
What could be the next step?
Today in Peer-to-Peer: Every node must participates Avoid Free-Riding/Lurking at any cost!
However, an old story says: Gnutella performance was limited by the
performance of its lowest capable nodes
49
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
What could be the next step?
Some classifying solutions are efficient: Recommendation helps decision (e.g. eBay) Supernodes/utlrapeers help sharing files (e.g. Kazaa) Supernodes help NAT/FW by-passing (e.g. Skype)
Generally, we can benefit from heterogeneity Streaming service needs nodes with highest
bandwidth Non-critical service can run on unstable nodes File-sharing service requires nodes with many files …
50
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Problem
Classifying nodes into categories, slices Based on individual characteristics: attributes A slice corresponds to a portion of the system
Typically, answering the question:
51
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
HOW AM I COMPARED TO OTHERS?
Classifying nodes into categories, slices Based on individual characteristics: attributes A slice corresponds to a portion of the system
Typically, answering the question:
52
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
53
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
…using their attribute values (assume a single attribute for simplicity reason)
54
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
0 100
Attribute values ai
55
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
0 100
0 1
Attribute values ai
NormalizedIndices pi
56
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
#4#3#2#1
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
0 100
0 1
NormalizedIndices pi
0 1Slices si
Attribute values ai
57
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
Existing solutions use gossip-based mechanism
58
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
4/11
59
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
68
89
60
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
0/2
61
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
7220
62
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
1/4
63
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
48
75
64
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
68
70
8
7262
7565
20
71
48
5989
27
1/3
65
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Distributed Slicing [RANK]
Performance achieved so far: d, is the distance from pi’ (the position
estimate of i) to the closest slice boundary. For confidence coefficient of 99,99%, the
required number of attribute value drawn is:
mi ≥ z pi’ (1 – pi’) / d2, with z <16, a constant.
66
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
Fourth Conclusion
Distributed Slicing is a new Challenge!
67
ICDCS 2007June
Fernandez, Gramoli, Jimenez, Kermarrec, Raynal
References[RANK] Distributed Slicing in Dynamic Systems A. Fernandez, V. Gramoli, E. Jimenez, A-M. Kermarrec, and M. RaynalICDCS 2007[TQS] Timed Quorum System for Large-Scale Dynamic EnvironmentsV. Gramoli and M. RaynalIRISA TR1859 2007[SQUARE] SQUARE: Scalable Quorum-based Atomic Memory with Local ReconfigurationV. Gramoli, E. Anceaume, and A. VirgilittoACM SAC 2007[Cyclon] Cyclon: Inexpensive Membership Management for Unstructured P2P OverlaysS. Voulgaris, D. Gavidia, and M. van SteenJournal of Network and System Management 13(2) 2005[RDS] Reconfigurable Distributed Storage for Dynamic NetworksG. Chokler, S. Gilbert, V. Grmoli, P.M. Musial, and A.A. ShvartsmanOPODIS 2005[CAN] A Scalable Content Addressable Network.S. Ratnasamy, P. Francis, M. Handley, R.M. Karp, and S. ShenckerACM SIGCOMM 2001[ABD] Sharing Memory Robustly in Message-Passing Systems.H. Attiya, A. Bar-Noy, and D. DolevJACM 1995
68