distributed shared memory, related issues, and new challenges in large-scale dynamic systems vincent...

Post on 15-Dec-2015

221 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Distributed Shared Memory, Related Issues, and New Challenges in Large-Scale Dynamic Systems

Vincent Gramoli

1

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Roadmap

Large-Scale Dynamic Systems Context and Motivations

Distributed Shared Memory Facing Dynamism

Facing Scalability

A Probabilistic Solution Facing Dynamism and Scalability

A New Challenge Distributed Slicing

2

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

The Scale-Shift of Distributed Systems

Internet Network growth

IPv4 to IPv6

Internet

3

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

The Scale-Shift of Distributed Systems

Personal devices multiplyAll tend to be connected together

4

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

The Scale-Shift of Distributed Systems

Network devices

time

17 billions of network devices by 2012as predicted by IDC

5

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Drawback of this Scale-shift

Heterogeneity Each device acts independently

Each device has distinct lifetime

Out-of-control Global monitoring is impossible

The system is unpredictable

Dynamism At any time some participants may leave/fail

And some others may join

Unbounded number of leaves/failures/joins

6

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Problem: How to Communicate?

Shared-Memory Paradigm Simple programming style

Appealing design for algorithmsMEM

P1 P1

7

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Problem: How to Communicate?

Shared-Memory Paradigm Simple programming style

Appealing design for algorithms

Message-Passing Paradigm Better suited for delayed messages

Fault-tolerant system

MEM

P1 P1

P1 P1

LINK

8

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Problem: How to Communicate?

Shared-Memory Paradigm Simple programming style

Appealing design for algorithms

Message-Passing Paradigm Better-suited for delayed messages

Fault-tolerant system

Emulating Shared-Memory in Message-Passing Simplicity

Fault-tolerance

MEM

P1 P1

P1 P1

LINK

9

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Shared Memory: the emulation

Consistency Criterion, a set of rulesAtomic object:

• If an operation ends before another starts, then it can not be ordered after

• Write operations are totally ordered and read operations are ordered w.r.t. write operations

• A read returns the last value written (or the default one if none exist)

If objects are atomic, then the system looks like a shared-memory model!

10

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Quorum-based DSM [ABD]

Quorums Mutually intersecting sets of nodes

Q1Q2

Q3

Q1 ∩ Q2 ≠ ØQ1 ∩ Q3 ≠ ØQ2 ∩ Q3 ≠ Ø

Each node maintains: A local value v of the object A unique version number t of this value

11

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Quorum-based DSM [ABD]

Operations A node reads the object value by

• Asking the value, tag of all nodes of a quorum

• Choosing the value with the largest tag

• Replicating this value to all nodes of a quorum

A node writes a new object value by• Asking the tags of all nodes of a quorum

• Choosing a higher tag than any tag returned

• Replicating its value with the new tag to a quorum

Get <v,t>

Set <v,t>

Get <v,t>

Set <v’,t’>

t’ = t++

12

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Quorum-based DSM [ABD]

Writing a value v1

Q1Q2

Q3

Input: v1

13

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Quorum-based DSM [ABD]

Writing a value v1

Q1Q2

Q3

max tag?

t

14

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Quorum-based DSM [ABD]

Writing a value v1

Q1Q2

Q3v1,t1 (with t1 > t)

15

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Quorum-based DSM [ABD]

Reading a value

Q1Q2

Q3value? tag?

v1,t1

16

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Quorum-based DSM [ABD]

Reading a value

Q1Q2

Q3

v1,t1

17

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Quorum-based DSM [ABD]

Reading a value

Q1Q2

Q3

Output: v1

18

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

Dynamism => Unbounded number of failures

Solution: Reconfiguration Replacing quorums periodically with quorums

of active nodes.

Q1Q2

Q3

Problem: Q1 ∩ Q2 = Ø

and Q1 ∩ Q3 = Ø and Q2 ∩ Q3 = Ø

19

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

All must agree on the next set of quorums Quorum-based consensus algorithm: Paxos

Reconfiguration must not block operations Up-to-date information is passed from old

quorums to new quorums during reconfiguration Operations that discover a new quorum set

must restart using it.

20

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

Algorithm Reconfiguration is based on Paxos (3 phases leader-based

consensus alorithm) l is the leader c is the current configuration configs is the set of active configurations A ballot has a unique identifier b and a value v, which is a

configuration

Paxos phases: Prepare: l creates a new ballot and chooses/gets the value to

propose. Propose: l proposes <b,v> and gathers votes from a majority. Propagate: l propagates decision

21

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

l

Q1Q2

Recon(c,c’)

22

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

l

Q1Q2

Prepare phaseRecon(c,c’) •Creates a new larger ballot b

23

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

l

Q1Q2

<1a, b>

Prepare phaseRecon(c,c’)

24

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

l

Q1Q2

<1a, b>

<1b, b, configs, <b’’, c’’>>

•Updates its ballot’s value v with the one received •Updates its configs set

Prepare phaseRecon(c,c’)

25

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

l

Q1Q2

<1a, b>

<1b, b, configs, <b’’, c’’>>

<2a, b, c, v>

Propose phaseRecon(c,c’)

26

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

l

Q1Q2

<1a, b>

<1b, b, configs, <b’’, c’’>>

<2a, b, c, v>

<2b, b, c, v, tag, val>

Recon(c,c’)

<2b, b, c, v, tag, val>

Propose phase

•Updates their tag and val•Adds v to their configs set 27

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

l

Q1Q2

<1a, b>

<1b, b, configs, <b’’, c’’>>

<2a, b, c, v>

<2b, b, c, v, tag, val><3a, c, v, tag, val>

<3a, c, v, tag, val>

Recon(c,c’)

<2b, b, c, v, tag, val>

Propagation phase

•Update their tag and val•Remove configuration c from their configs set

<3a, c, v, tag, val>

28

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

Good News: The overhead latency to cope with dynamism is low

29

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Dynamic DSM [RDS]

Bad News: In both solutions, congestion may delay the latency

30

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

First Conclusion

Communication complexity must be reduced to face scalability!

31

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable DSM [SQUARE]

Object replicated on failure-prone nodes The replicas r1, …, rk share a 2-dim

coordinate spacer1 r2 r3 r4

r5 r6 r7 r8

… rk-1rk

32

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable DSM [SQUARE]

Communication through neighborhood Each replica ri can communicate only with its nearest

neighbors

ri

33

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable DSM [SQUARE]

Topology takeover mechanism [CAN] Upon node failure/departure the space sharing is

modified accordingly

If a node ri fails, a takeover node rj replaces it

rirj

34

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable DSM [SQUARE]

Dynamic Quorums Vertical Quorum: All replicas responsible of an

abscissa x Horizontal Quorum: All replicas responsible of

an ordinate yx

y

For any horizontal quorum H and any vertical quorum V:

H V ≠ Ø

35

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable DSM [SQUARE]

Read Operation:1) Get up-to-date values and tags of a horizontal

quorum,2) Replicate this value on a vertical quorum.

Write Operation:1) Get up-to-date value on a horizontal quorum,2) Replicate the value to write (and a higher tag) on

the same vertical quorum

Operation Execution

36

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable DSM [SQUARE]

Memory thwarts if the requested replica is overloaded:Other replicas on its diagonal are contacted in turn until a non-overloaded one is found

Memory expands if all contacted replicas are overloaded:A node outside the memory is added, and the object value is replicated at this node.

Memory shrinks if a replica gets underloaded:The replica simply leaves the memory after neighbors notification.

Self-Adjusting Memory

37

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable DSM [SQUARE]

Good News: The memory self-adapts well in face of dynamism

38

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable DSM [SQUARE]

Good News: The load is well-balanced over the replicas

39

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable DSM [SQUARE]

Bad News: The operation latency increases with the load (request rate)

40

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Second Conclusion

No way to avoid the tradeoff betweencommunication and time complexity!

41

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable and Dynamic DSM [TQS]

Motivations for Probabilistic Solutions Tradeoff between time and message

complexity prevents deterministic solutions

Allowing more Realistic Models• Any node can fail independently

• Even if it is unlikely that many nodes fail at the same time

Quality of Service (QoS) is often expressed in terms of percentage of success

42

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable and Dynamic DSM [TQS]

Dynamic System n interconnected nodes Nodes join/leave the system A joining node is new c is the churn:

• At each time unit, └cn

┘ nodes leave the network

• At each time unit, └cn

┘ nodes enter the network

43

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable and Dynamic DSM [TQS]

Probabilistic Atomicity• If an operation ends before another starts, then it is

ordered after with probability e-β2 (with β a constant). If this happen, the preceding operation is considered as unsuccessful.

• Write operations are totally ordered and read operations are ordered w.r.t. write operations

• A read returns the last successfully value written (or the default one if none exist) with probability 1- e-β2 (with β a constant). If this does not hold, then the read is unsuccessful.

44

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable and Dynamic DSM [TQS]

Gossip-based algorithm in parallel Shuffle set of neighbors using gossip-based algorithm

(e.g. Cyclon)

Quorum contact- Disseminate message with TTL l to k neighbors, such

that #contacted nodes = β n / (1-c) Δ/2

- Decrements TTL received if first time received.- Forward received messages to k neighbors if their

TTL is not null.

45

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable and Dynamic DSM [TQS]

Read Ask a quorum of nodes their values Replicate the most up-to-date value to a

quorum

Write Ask a quorum of nodes their tags Chooses a strictly higher tag Replicate the value to write with the new tag

46

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Scalable and Dynamic DSM [TQS]

Assumptions: success regularity: at least one operation succeeds

every Δ time units. The underlying gossip-based algorithm provides each

node with a neighbor chosen uniformly at random.

Results: Expected messages: O( nD) without shuffling, Expected latency: O(log nD), Where D = (1-c)-Δ is the dynamic parameter.

Given the application requirement in terms of QoS, the quorum size can be tuned.

47

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Third Conclusion

Trading deterministic for probabilistic guarantees seems to be the solution!

48

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

What could be the next step?

Today in Peer-to-Peer: Every node must participates Avoid Free-Riding/Lurking at any cost!

However, an old story says: Gnutella performance was limited by the

performance of its lowest capable nodes

49

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

What could be the next step?

Some classifying solutions are efficient: Recommendation helps decision (e.g. eBay) Supernodes/utlrapeers help sharing files (e.g. Kazaa) Supernodes help NAT/FW by-passing (e.g. Skype)

Generally, we can benefit from heterogeneity Streaming service needs nodes with highest

bandwidth Non-critical service can run on unstable nodes File-sharing service requires nodes with many files …

50

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Problem

Classifying nodes into categories, slices Based on individual characteristics: attributes A slice corresponds to a portion of the system

Typically, answering the question:

51

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

HOW AM I COMPARED TO OTHERS?

Classifying nodes into categories, slices Based on individual characteristics: attributes A slice corresponds to a portion of the system

Typically, answering the question:

52

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

53

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

…using their attribute values (assume a single attribute for simplicity reason)

54

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

0 100

Attribute values ai

55

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

0 100

0 1

Attribute values ai

NormalizedIndices pi

56

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

#4#3#2#1

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

0 100

0 1

NormalizedIndices pi

0 1Slices si

Attribute values ai

57

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

Existing solutions use gossip-based mechanism

58

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

4/11

59

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

68

89

60

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

0/2

61

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

7220

62

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

1/4

63

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

48

75

64

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

68

70

8

7262

7565

20

71

48

5989

27

1/3

65

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Distributed Slicing [RANK]

Performance achieved so far: d, is the distance from pi’ (the position

estimate of i) to the closest slice boundary. For confidence coefficient of 99,99%, the

required number of attribute value drawn is:

mi ≥ z pi’ (1 – pi’) / d2, with z <16, a constant.

66

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

Fourth Conclusion

Distributed Slicing is a new Challenge!

67

ICDCS 2007June

Fernandez, Gramoli, Jimenez, Kermarrec, Raynal

References[RANK] Distributed Slicing in Dynamic Systems A. Fernandez, V. Gramoli, E. Jimenez, A-M. Kermarrec, and M. RaynalICDCS 2007[TQS] Timed Quorum System for Large-Scale Dynamic EnvironmentsV. Gramoli and M. RaynalIRISA TR1859 2007[SQUARE] SQUARE: Scalable Quorum-based Atomic Memory with Local ReconfigurationV. Gramoli, E. Anceaume, and A. VirgilittoACM SAC 2007[Cyclon] Cyclon: Inexpensive Membership Management for Unstructured P2P OverlaysS. Voulgaris, D. Gavidia, and M. van SteenJournal of Network and System Management 13(2) 2005[RDS] Reconfigurable Distributed Storage for Dynamic NetworksG. Chokler, S. Gilbert, V. Grmoli, P.M. Musial, and A.A. ShvartsmanOPODIS 2005[CAN] A Scalable Content Addressable Network.S. Ratnasamy, P. Francis, M. Handley, R.M. Karp, and S. ShenckerACM SIGCOMM 2001[ABD] Sharing Memory Robustly in Message-Passing Systems.H. Attiya, A. Bar-Noy, and D. DolevJACM 1995

68

top related