distributed algorithms luc j. b. onana alima seif haridi

85
Distributed Algorithms Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

Upload: alban-bishop

Post on 06-Jan-2018

231 views

Category:

Documents


4 download

DESCRIPTION

3 Ch9: Models of Distributed Computation Preliminaries Notations Assumptions Causality Lamport Timestamps Vector Timestamps Causal Communication Distributed Snapshots Modeling a Distributed Computation Execution DAG Predicates Failures in Distributed System

TRANSCRIPT

Page 1: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

Distributed AlgorithmsDistributed Algorithms

Luc J. B. Onana Alima Seif Haridi

Page 2: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

2

Introduction• What is a distributed system ? set of autonomous processors interconnected in some way

•What is a distributed algorithm (protocol) ? Concurrently executing components each on a separate processor•Distributed algorithms can be extremely complex: many components run concurrently; locality; failure; non-determinism; independent inputs; no global clock, uncertain message delivery; uncertain messages ordering, .

•Can we understand everything about their executions?

Page 3: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

3

Ch9: Models of Distributed Computation

• Preliminaries• Notations• Assumptions

• Causality• Lamport Timestamps• Vector Timestamps• Causal Communication

• Distributed Snapshots• Modeling a Distributed Computation• Execution DAG Predicates• Failures in Distributed System

Page 4: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

4

Ch9 Models: Preliminaries Assumptions

A1: No shared variables among processors

A2: On each processor there are a number of executing threads

A3: Communication by sending and receiving messages send(dest,action,param) is non-blocking;

A4: Event driven algorithms: reaction upon receipt of a declared event; Events: sending or receiving a message; etc. An event is buffered until it is handeled; Dedicated thread to handle some events at any time;

Page 5: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

5

Ch9 Models: Preliminaries Notations Waiting for events

wait for A1,A2,…,An on Ai (source;param) do code to handle Ai , 1<= i <=n end

Waiting for an event from p up to T seconds wait until p sends (event;param), timeout=T on timeout do timeout action

end on event(param) from p do Successful response actions

Page 6: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

6

Ch9 Models: Preliminaries Notations Waiting for events

wait for A1,A2,…,An on Ai (source;param) do code to handle Ai , 1<= i <=n end

end Waiting for an event from p up to T seconds

wait for pon timeout do time-out action end

on Ai(param) from p do action endend

Page 7: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

7

Ch9 Models: Preliminaries Notations Waiting for responses from a set of processors up to T seconds

wait up to T seconds for (event;param) messages Event:<message handling code> To be considered if necessary.

Page 8: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

8

Ch9 Models: Preliminaries Concurrency control within an instance of a protocol Definition : Let P be a protocol. If instance of P at processor q consists of threads T1, T2, T3, …, Tn , we say that T1, T2, …, Tn are in the same family. they access the same set of variables; need for concurrency control; Assumption used: A5: Once a thread gains control of the processor, it does not release control to a thread of the same family until it is blocked.

Page 9: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

9

Ch9 Models: Causality There is no global time in a distributed system processors cannot do simultaneous observations of global states Causality serves as a supporting property

Provided traveling backward in time is excluded, distributed systems are causal The cause precedes the effect.

The sending of a message precedes the receipt of that message

Page 10: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

10

Ch9 Models: CausalitySystem composition we assume a distributed system composed of the set processors P = {p1, …, pM}.

Each processor reacts upon receipt of an event

Two classes of events: External/Communication events: sending a message; receiving a message Internal events: local input/output; raising of a signal; decision on a commit point (database); etc.

Page 11: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

11

Ch9 Models: CausalityNotations: E : the set of all possible events in our system Ep : the set of all events in E that occur at processor p

We are interested in defining orders between events Why?

In many cases, orders are necessary for coordinating distributed activities (e.g. many concurrency control algorithms use ordering of events we’ll see this later)

Page 12: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

12

Ch9 Models: Causality Orders between events 1) on the same processor p Order: <p, e <p e’ ``e occurs before e’ in p´´. If e and e’ occur on the same processor p then either e <p e’ or e’ <p e i.e. in the same processor events are totally ordered

Time

p

e

e’

e <p e’

Page 13: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

13

Ch9 Models: Causality Orders between events2) of sending message m and receiving message m Order: <m If e is the sending of message m, and e’ the receipt of message m then e <m e’

Page 14: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

14

Ch9 Models: Causality Orders between events3) in general (i.e. all events in E are considered) Order: <H ``happens-before´´ or ``can causally affect´´

Definition <H is the union of <p and <m (for all p,m), and transitive (i.e. if e1 <H e2 and e2 <H e3 then e1 <H e3)

Definition: we define a causal path from e to e’ as a sequence of events e1,e2,…,en such that 1) e=e1; e’=en 2) for each i in {1,..,n}, ei <H ei+1

Thus, e <H e’ if only if there is a causal path from e to e’

Page 15: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

15

Ch9 Models: CausalityHappens-before is a partial order

It is possible to have two events e and e’ (e e’) such that neither e <H e’ nor e’ <H e

If two events e and e’ are such that neither e <H e’ nor e’ <H e, then e and e’ are concurrent and we write e || e’ The possibility of concurrent events implies that the happens-before (<H) relation is a partial order

Page 16: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

16

Ch9 Models: CausalitySpace-Time diagram:Happens-before DAG

p1 p2 p3

e7

e4

e3

e2

e5

e8

e6

e1

Time

No causal path neither from e1 to e2 nor from e2 to e1

e1 and e2 are concurrent

No causal path neither from e1 to e6 nor from e6 to e1

e1 and e6 are concurrent

No causal path neither from e2 to e6 nor from e6 to e2

e2 and e6 are concurrentDependencies must point forward in time

Page 17: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

17

Ch9 Models: CausalitySpace-Time diagram:Happens-before DAG

p1 p2 p3

e7

e4

e3

e2

e5

e8

e6

e1

Time

Compare:

e1 and e7;

e1 and e8;

e5 and e2;

e4 and e6

Page 18: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

18

Ch9 Models: CausalityGlobal Logical Clock (Time stamps)

Although there is no global time in a distributed system, a Global Logical Clock (GLC) that assigns total order to the events in a distributed system is very useful

Such a global logical clock can be used to arbitrate requests for resources in a fair manner, breaks deadlock, etc. A GLC should assign a time stamp t(e) to each event e such that t(e) < t(e’) or t(e’) < t(e) for e e’, furthermore the order imposed by the GLC should be consistent with <H., that is if e <H e’ then t(e) < t(e’)

Page 19: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

19

Ch9 Models: CausalityLamport’s Algorithm Gives a Global Logical Clock consistent with <H

Each event e receives an integer e.TS such that e <H e’ e.TS < e’.TS Concurrent events (unrelated by <H) are ordered according to the processor address (assume these are integers)

Timestamps t(e) = (e.TS,p) when e occurs at processor p Ordering of timestamps: (e.TS,p) < (e’.TS,q) iff e.TS < e’.TS or e.TS = e’.TS and p < q

Page 20: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

20

Ch9 Models: CausalityLamport’s Algorithm (cont.)

Each processor maintains p a local timestamp my_TS

Each processor attaches its timestamp to all messages that it sends

Page 21: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

21

Ch9 Models: Causality Lamport’s Timestamp algorithmInitially, my_TS = 0wait for any event e on e do if e is the receipt of message m then my_TS := max(m.TS,my_TS)+1; e.TS := my_TS elseif e is an internal event then my_TS := my_TS+1 ; e.TS := my_TS elseif e is the sending of message m then my_TS := my_TS+1 ; e.TS := my_TS; m.TS = my_TS endend

Page 22: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

22

Ch9 Models: CausalityLamport’s Algorithm (cont.)

Lamport’s algorithm ensures that e <H e’ e.TS < e’.TSReason: if e1 <p e2 or e1 <m e2 then

e2 is assigned a higher timestamp than e1 Note: It is easy to see that the algorithm presented does not assign total order to the events in the system. Processor address to break the ties

Page 23: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

23

Ch9 Models: CausalityLamport’s timestamps illustrated

p1 p2 p3

e7

e4

e3

e2

e5

e8

e6

e1

Why e7 is labeled (3,1)?e8 is labeled (4,3)?

(1,1)

(2,1)

(3,1)

(1,2)

(2,2)

(3,2)

(1,3)

(4,3)Time

Page 24: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

24

Ch9 Models: CausalityLamport’s timestamps algorithm Has the following properties:

Completely distributed

Simple

Fault tolerant

Minimal overhead

Many applications

Page 25: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

25

Ch9 Models: CausalityVector Timestamps Lamport Timestamps guarantee that if e <H e’ then e.TS < e’.TS but there is no guarantee that if e.TS < e’.TS then e <H e’

Problem: given two arbitrary events e and e’ in E, we want to determine if they are causally related

Why this problem is interesting?

Page 26: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

26

Ch9 Models: CausalityKnowing when two events are causally related is usefulTo see this, consider the following H-DAG in which O is a mobile object

p1 p2 p3

Where is O ?MigrateO on p2

On p2

Where is O ?

I don’t know

Error !

m3

m2

m1

Time

When you debug the system after the red line, you will find that the object is at p2.

So, why p2 don’t knowwhere the object is ?

Page 27: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

27

Ch9 Models: CausalityCausally precedes relation <c between messages Let s(m) be the event of sending message m r(m) the event of receiving message m Definition: m1 <c m2 if s(m1) <H s(m2) A Causality violation occurs when there are messages m1 and m2, a processor p such that s(m1) <H s(m2) and r(m2) <p r(m1)

p1 p2

Time

s(m1)s(m2)

r(m2)r(m1)

The simplest form of causality violation:the sending events are on the same processor p1

the receiving events are on the same processor p2

Page 28: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

28

Ch9 Models: Causality Causality violation (ex: distributed object system)

When p3 receives “I don’t know” message from p2, p3 has inconsistent information : From p1, p3 knows O is on p2 but from p2, p3 knows O is not on p2!

The source of the problem is: m1 <c m3 but r(m3) <p2 r(m1) i.e. there is a causality violation.

Thus for two events e and e’, if we know exactly whether e <H e’ then we can detect causality violation

Vector timestamps gives this.

Page 29: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

29

Ch9 Models: Causality Vector Timestamps Idea: each event e indicates for each processor p, all events at p that are causally before e

Page 30: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

30

Ch9 Models: CausalityThe idea illustrated

p1 p4p3p2

1

23

12

4

3

5

1

6

4

5

23

21

3

4

e

Page 31: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

31

Ch9 Models: Causality Vector Timestamps Idea: each event e indicates which events in each processor p causally precede e

Each event e has a vector timestamp e.VT such that e.VT <V e’.VT e <H e’

e.VT is an array with an entry for any processor p;

For any processor p e.VT[p] is an integer and e.VT[p]=k means e causally follows the first k events that occur at p (one assumes that each event follows itself)

Page 32: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

32

Ch9 Models: CausalityThe meaning of e.VT[p] illustrated

p1 p4p3p2

1

23

12

4

3

5

1

6

4

5

23

21

3

4

e

e.VT[p1]=3e.VT[p2]=6e.VT[p3]=4e.VT[p4]=2

Page 33: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

33

Ch9 Models: Causality Vector TimestampsOrdering <V on vector timestamps is defined as e.VT <V e’.VT iff a) e.VT[i] e’.VT[i] for all i in {1,..,M} and b) there is j in {1,..,M} such that e.VT[j] e’.VT[j]

Example: (1,0,3) <V (2,0,5); (1,1,3) <V (2,1,3); (1,1,3) <V (1,0,3); (1,1,3) <V (1,1,3)

Property: e.VT <V e’.VT only if e’ causally follows every event that e causally follows

Page 34: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

34

Ch9 Models: CausalityComparison of vector timestamps illustrated

p1 p2 p3 p4

1

32

4

3

21

2

4

3

5

1

6

54

321 e1.VT=(5,4,1,3) ;

e2.VT=(3,6,4,2);e3.VT=(0,0,1,3)e3.VT <V e1.VT;

No causal path neither from e1 toe2 nor from e2 to e1. e1 and e2 are concurrent

e1

e2

e3

Page 35: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

35

Ch9 Models: CausalityThe property illustrated

p1 p4p3p2

1

23

12

4

3

5

1

6

4

5

23

21

3

4

e’

We have that e.VT=(0,1,4,2)e’.VT=(3,6,4,2)e.VT <V e’.VT

e

e’ causally follows every event that e causally follows

Page 36: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

36

Ch9 Models: Causality Vector timestamps algorithmInitially, my_VT = [0,…,0]wait for any event e on e do if e is the receipt of message m then for i := 1 to M do my_VT[i] := max(m.VT[i],my_VT[i])+1; my_VT[self] := my_VT[self] +1 e.VT := my_VT end elseif e is an internal event then my_VT[self] := my_VT[self]+1 ; e.VT := my_VT elseif e is the sending of message m then my_VT[self] := my_VT[self]+1 ; e.VT := my_VT m.VT = my_VT endend

Here we assume that each processor knows the names of all the processors in the system

How can we achieve this assumption ?

We’ll see later

Page 37: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

37

Ch9 Models: Causality Vector Timestamp algorithm

Ensures:e <H e’ e.VT <V e’.VT

Reason: 1) e <p e’ : the case of internal events at processor p; e.VT <V e’.VT 2) e <m e’: the case of receiving of message m; e.VT <V e’.VT

Page 38: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

38

Ch9 Models: Causality Vector Timestamp algorithm

Ensures: e.VT <V e’.VT e <H e’

Reason: Assume e <H e’ then two cases are to consider 1) if e’ <H e then e’.VT <V e.VT (from previous slide)

p

e’ k

l

e And e.VT[p]=l > k which implies that e.VT <V e’.VT

Page 39: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

39

Ch9 Models: Causality Vector Timestamp algorithm

Ensures (cont.): e.VT <V e’.VT e <H e’

Reason: Assume e <H e’ then two cases are to consider 2) if e’ <H e then e’.VT <V e.VT and e.VT <V e’.VT

Page 40: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

40

Ch9 Models: CausalityDetecting causality violation in the dist. object system ex.

If we know for every pair of events, whether they are causally related we can detectcausality violation in the distributed object system example by installing a causality violation detector at every processor

p1 p2 p3

Where is O ?MigrateO on p2

On p2

Where is O ?

I don’t know

Error !

m3

m2

m1

Time

If we attach a vector timestamp to each event(and message)of the distributed object system example, then each processor can detect a causality violation

e.g. p2 can detect that a causality violation occurs when it receives m1 : m1 <c m3 but r(m3) <p2 r(m1)

(1,0,0)(0,0,1)

(3,0,2)

(3,0,3)

(3,2,4)

(3,1,3)(3,2,3)

(3,3,3)

(2,0,1)

(3,0,1)

Page 41: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

41

Ch9 Models: CausalityCausal communication Causality violation can lead to undesirable situations

A processor usually cannot choose the order in which messages arrive.

But a processor can decide the order in which application executing on it have messages delivered to them

This leads to the need for communication subsystems with specified propertiese.g. one may require a communication subsystem that deliver messages in a causal order

Advantage: the design of many distributed algorithms would be easy (e.g. simple object migration protocol)

Page 42: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

42

Ch9 Models: CausalityCausal communication

Can we build a communication subsystem that guarantees delivery of messages in causal order?

No for unicast message sending,

Yes for multicast

Page 43: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

43

Ch9 Models: Causality

Causal communication (an attempt of solution)

Idea:Hold back messages that arrive “too soon”;Deliver a held-back message m only when you are assured that you will not receive m’ such that m’ causally precedes m ;

The implementation of this idea is similar to the implementation of FIFO communication

Applications

CSS

Network

Page 44: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

44

Ch9 Models: CausalityFIFO communication (TCP): the problemAssume 1) p and q are connected by an oriented communication line from p to q that satisfies: messages sent are eventually received; messages sent by p can arrive at q in any order

2) q delivers messages received from p to an application A running at q

The problem is to devise a distributed algorithm that enablesprocessor q to deliver to A messages received from p in the orderp sent them.

Page 45: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

45

Ch9 Models: CausalityFIFO communication: implementation(idea)The solution consists of one algorithm for p and one for q.

Algorithm for p p sequentially numbers each message it sends to q.

q knows that messages should be sequentially numbered.

Algorithm for q (idea) upon receipt of a message m with a sequence number x, if q has never received a message with sequence number x-1, q delays the delivery of m until m can be delivered in sequence

Page 46: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

46

Ch9 Models: CausalityFIFO communication: implementation(idea)

Algorithm for q (idea cont.)

Message number xMessage number x

No hole, deliver There is a hole,buffer

Page 47: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

47

Ch9 Models: CausalityCausal communication: implementation(idea)Assumption (PTP):all point-to-point messages are delivered in order sent

Instead of using sequence numbers (as for the FIFO implementation)we use timestamps

Lamport timestamps or vector timestamps can be used Idea: whenever processor q receives a message m from processor p, q holds back m until it is assured that no message m’ <c m will be delivered from any other processor.

Page 48: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

48

Ch9 Models: CausalityCausal communication: implementation(idea, variables used)

self

blocked[i] = queue of blocked messages received from pi

earliest[i] = (head(blocked[i])).timestamp OR 1i if blocked[i] is empty

messages in delivery_list are causally ordered

delivery_list

blocked[1]

earliest[1]

blocked[i]

earliest[i]

blocked[M]

earliest[M]

Page 49: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

49

Ch9 Models: CausalityCausal communication: implementation(idea, variables update)When processor self receives a message m from p, it performs the followingsteps in order:

Step1 : If blocked[p] is empty then earliest[p] is set to m.timestamp ; /* because assumption (PTP) guarantees that no earlier message can be received from p */Step 2: Enqueue message m to blocked[p]; Step 3: Unblock one after another, all blocked messages that can be unblocked; add each unblocked message to deliver_list; update earliest if necessary

How to determine when a message can be unblocked?Step 4: Deliver messages in deliver_list

Page 50: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

50

Ch9 Models: CausalityCausal communication: implementation(idea, variables update)Step 3 detailed: Assume we use vector timestamps

Step 3 refined: Unblock one after another, all blocked messages that can be unblocked; the message m at the head of the holding queue for processor k can be unblocked only if the “time” of processor k according to message m is smaller than the “time” of processor k according to any other message m’ if any, at the head of a holding queue More precisely, blocked[k] can be unblocked only if ( i {1,..,M} i k i self : earliest[k][i] < earliest[i][i]) Thus, the details of Step 3 are:

Page 51: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

51

Ch9 Models: CausalityCausal communication: implementation(idea, variables update)Step 3 detailed (cont.):blocked[k] can be unblocked only if ( i {1,..,M} i k i self : earliest[k][i] < earliest[i][i]) combining the above condition with the fact that messages are unblocked one after another, we obtain a while loop.

While ( ( k {1,..,M} : blocked[k] empty) ( i {1,..,M} i k i self : earliest[k][i] < earliest[i][i])) do remove the first message of blocked[k] and add this message to delivery_list; if blocked[k] empty then earliest[k] := (head(blocked[k])).timestamp /* vector timestamp */ else earliest[k] := earliest[k] + 1k

end

Deliver the messages in delivery_list

Page 52: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

52

Ch9 Models: CausalityCausal communication:implementation(the complete scheme)Initially for each k in {1,..,M}, earliest[k] := 1k; blocked[k] := empty Wait for a message from any processor on the receipt of message m from processor p do deliver_list := empty; Step 1; Step 2 ; Step 3; Step 4 end end

Page 53: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

53

Ch9 Models: CausalityDetecting causality violation in the dist. object system ex.

If we know for every pair of events, whether they are causally related we can detectcausality violation in the distributed object system example by installing a causality violation detector at every processor

p1 p2 p3

Where is O ?MigrateO on p2

On p2

Where is O ?

I don’t know

Error !

m3

m2

m1

Time

(1,0,0)(0,0,1)

(3,0,2)

(3,0,3)

(3,2,4)

(3,1,3)(3,2,3)

(3,3,3)

(2,0,1)

(3,0,1)

Page 54: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

54

Ch9 Models: CausalityProblem of the causal communication implementation previously given.

One problem that the algorithm presented for causal communication has is that

the communication subsystem at processor self might never deliver some messages

Page 55: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

55

Ch9 Models: CausalityCausal communication: problems illustrated

(1,0,0,0)

(1,0,0,2)

(3,0,1,0)

(2,0,1,0)

(0,0,1,0)

(3,0,2,0)

(3,0,3,0)

(3,1,3,0)

M

p3p1 p2 p4 Message M is never delivered by the communication subsystem running at processor p2

blocked[p3] ; M=head(blocked[p3])

earliest[p3][p1]=3

and

blocked[p1] =; earliest[p1][p1]=1

blocked[p4] =; earliest[p4][p1]=1

self is processor p2

Page 56: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

56

Ch9 Models:Distributed SnapshotsAssumptions/definitions The system is connected, that is there is a path from every pair of processors Ci,j channel from pi to pj;

Communication channels : reliable and FIFO messages sent are eventually received in order;

State of Ci,j is the ordered list of messages sent by pi but not yet received at pj; (we will soon make this definition precise)

State of a processor (at an instant) is the assignment of a value to each variable of that processor;

Page 57: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

57

Ch9 Models:Distributed SnapshotsAssumptions (cont.) Global state of the system: (S,L) S =(s1,.., sM) processor states ; L = channel states

A global state cannot be taken instantaneously it must be computed in a distributed manner;

The problem: Devise a distributed algorithm that computes a consistent global state.

What do we mean by consistent global state?

Page 58: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

58

Ch9 Models:Distributed SnapshotsMeaning of consistent global state Example 1

Cq,p

q

Cp,q

p

Two possible states for each processor: s0, s1

In s0: the processor hasn’t the tokenIn s1: the processor has the token

The system contains exactly one token which moves back and forth between p and q. Initially, p has the token.Events: sending/receiving the token.

Cq,p

Page 59: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

59

Ch9 Models:Distributed SnapshotsMeaning of consistent global state Global states of the system of Example 1

q

Cp,q

pCq,p

Cq,p

q

Cp,q

pCq,p

Cq,p

q

Cp,q

pCq,p

Cq,p

q

Cp,q

pCq,p

Cq,p

Page 60: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

60

Ch9 Models:Distributed SnapshotsMeaning of consistent global state (informal) A global state G is consistent if it is one that could have occurred

Actual transitionsG

The output of the snapshot algorithm can be G !

Consider a system with two possible runs (non-determinism)

Page 61: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

61

Ch9 Models:Distributed SnapshotsConsistent global state (formal)S={s1,..,sM}; oi: event of observing si at pi;O(S)={o1,..,oM}Definition: S is a consistent cut iff {o1,..,oM} is consistent with causality

Definition: {o1,..,oM} is consistent with causality iff ( e, oi : e in Ei e <H oi : ( e’ : e’ in Ej e ’ <H e : e ’ <H oj) )

Notation:s(m)= event of sending m; r(m)= event of receiving m

oi

pi pj

e

e’

oj

Intuition

Page 62: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

62

Ch9 Models:Distributed SnapshotsPrecision about ´´message sent but not yet received´´ Definition: Given O(S)={o1,..,oM}; m a message. If s(m) <pi oi oj <pj r(m) then m is sent but not yet received (relatively to O).

o1p1

p2 o2

m1 m2m3

p2 observes its state, then asks p1 to do the sameThe global state resulting from o1 and o2 must contain: m1,m2,m3

Page 63: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

63

Ch9 Models:Distributed SnapshotsMeaning of consistent global state (cont.)Definition: A global state (S,L) is consistent if S is a consistent cut L contains all messages sent but not yet received (relatively to O(S))

Page 64: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

64

Ch9 Models:Distributed SnapshotsExamples of global states (questions)

o3

o1

o2

Is O={o1,o2,o3} consistent with causality?

o’3

o’1

o’2

Is O’={o’1,o’2,o’3} consistent with causality?

p1 p2 p3p1 p2 p3

Page 65: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

65

Ch9 Models:Distributed SnapshotsWhy a consistent global state is useful (an example)?

Processors p1 and p2 make use of resources r1 and r2

A deadlock global state of a distributed system is one in which there is cycle in the wait-for graph

Deadlock property:Once a distributed system enters a deadlock state, all subsequent global state are deadlock states.

p1 r1 r2 p2

Req

Rel

Ok

Req

Req

Req

Ok

Ok

43

21Assume we have a “tough guy” called deadlock detector whose goal is to observe the processors and the resources at some points of their processing then checks if there is a cycle in the wait-for graph if so, he claim that there is a deadlock

Our guy observes the processors and the resources at the points marked 1 through 4

Page 66: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

66

Ch9 Models:Distributed SnapshotsWhy a consistent global state is useful (ex., cont.)?

p1 r1 r2 p2

Req

Rel

Ok

Req

Req

Req

Ok

Ok

43

21

The deadlock detector observes the processors and the resources at the points marked 1 through 4 and finds :

p2p1

r2

r1

Where x y

Means x is waiting for y

To see why, assumea correct transaction for using a resourceconsists of three steps:

Req

Ok

Rel

Page 67: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

67

Ch9 Models:Distributed SnapshotsWhy a consistent global state is useful (ex., cont.)?

4

p1 r1 r2 p2

Req

Rel

Ok

Req

Req

Req

Ok

Ok

3

21

The deadlock detector observes the processors and the resources at the points marked 1 through 4 a finds :

p2p1

r2

r1

Where x y

Means x is waiting for y

Is there actually a deadlock in the system?

The answer is NO. There is only a phantom deadlock. The claim of our guy is dueto the fact that he made an inconsistent observation that led to a wrong result!

Page 68: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

68

Ch9 Models:Distributed SnapshotsThe snapshot algorithm(Informal) Uses special messages: snapshot tokens (stok) There are two types of participating processors: initiating, others The algorithm for the initiating processor: Records its state; Sends a stok to each outgoing channel; Starts to record state of incoming channels. Recording of the state of an incoming channel c is finished when a stok is received along it.

Page 69: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

69

Ch9 Models:Distributed SnapshotsThe snapshot algorithm(Informal cont.) Uses special messages: snapshot tokens (stok) Types of participating processors: initiating, others The algorithm for any other processor: Records its state on receipt of a stok for the first time; (assume the first stok is received along c). Records the state of c as empty; Sends one stok to each outgoing channel; Starts to record the state of all other incoming channels;

Recording of the state of an incoming channel c´ c is finished when a stok is received along it.

Page 70: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

70

Ch9 Models:Distributed SnapshotsThe snapshot algorithm(Idea, cont.)

Notation: T(p,state): time at p when p records its state; T(p,stok,c) : time at p when p receives a stok along c The state of an incoming channel c of p is the sequence of messages that p receives in the interval ] T(p,state), T(p,stok,c) [Recall that the state of c is recorded by p.

Page 71: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

71

Ch9 Models:Distributed SnapshotsThe snapshot algorithm illustrated: Taking a snapshot of a token passing system

p records its state: s0

and send stok

q

Cp,q

pCq,p

Cq,pCp,q

stoks0s0

qpCq,p

Cq,pstok

s0

Lpq={}

s0

q

Cp,q

pCq,p

Cq,p

stok

s0 s1

q

Cp,q

pCq,p

Cq,p

s1

s0

Lqp={ }

s0

q receives stok: q records its state and the state of Cp,q

then sends stok

p received the token and stok arrives then p records the state of Cq,p

Recorded global state:S={s0, s0}L={Lpq, Lqp}

Page 72: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

72

Ch9 Models:Distributed SnapshotsApplications of snapshots

Detecting stable state predicates (or properties)

A state predicate P is said to be stable if P(G) P(G’) for every G’ that is reachable from G

Examples: Deadlock; Termination; lost of token; etc.

Page 73: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

73

Ch9 Models:Distributed SnapshotsThe snapshot algorithm(in the book) Accounts for the possibility of different concurrent snapshots; To achieve this, Each snapshot is identified by the name of the initiating processor A processor might initiate a new snapshot while the first is still being collected; version number To achieve this, Version numbers are used (for simplicity, when a processor r requests a new version of the snapshot, the old snapshot is cancelled)

diffusing computation: one useful technique for designing distributed algorithms

Page 74: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

74

Ch9 Models:Diffusing computationDiffusing computationAssume a connected network (i.e. for each pair of processors in the system, there is a path connecting them) and that messages sent are eventually receivedThe problem : A processor p has an information Info that it wants to send to all other processors.

pProcessors that are directlyconnected are called neighborsEach processor knows its neighbors

Page 75: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

75

Ch9 Models:Diffusing computationDiffusing computation (a solution)The algorithm for the initiator i for each neighbor k send(k,Info)

The algorithm for any other processorwait for message from any neighbor on receipt of Info from some neighbor p do for each neighbor k p send(k,Info) end end

There are two problems withthis algorithm:

Problem 1: there might be unprocessed messages left in some channels

Problem 2: processor p does not know if and when all other processors have received Info

Page 76: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

76

Ch9 Models:Diffusing computation

The algorithm for the initiator i Step1: for each k in my_neighbors send(k,Info) Step 2: my_wlist:=my_neighbors; while my_wlist is not empty do wait for message from any k in my_wlist on receipt of Info from k in my_wlist do my_wlist:= my_wlist\ {k} end end end

Diffusing computation (a solution,cont.)Solution to problem 1 and 2: we want the initiator to be informed of the fact that all the processors have received Info

Variables used:my_neighbors: the set of identities of all my neighbors;

my_wlist: the list of neighbors from which I am waiting for a message containing Info

Page 77: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

77

Ch9 Models:Diffusing computation

(a solution,cont.:The algorithm for a non-initiating processor consists of three steps: Step 1, Step 2 and Step 3 in that

order)

Step1: wait for a message from any k in my_neighbors on receipt of Info from k do my_parent := k; for each j in my_neighbors\{k} send(j, Info), end end

Step 2: my_wlist:=my_neighbors \{my_parent}; while my_wlist is not empty do wait for message from any k in my_wlist on receipt of Info from k in my_wlist do my_wlist:= my_wlist\ {k} end end end

Step 3: send(my_parent, Info)

Why this distributed algorithm iscorrect (i.e. each processor receivesInfo and the initiator eventually learns that each processor has received Info, no deadlock)?

Page 78: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

78

Ch9 Models:Diffusing computation

p

Channels along which processors received Info for the first time

Spanning tree constructionA spanning tree of a graph is a tree whose nodes are all those in the graph and whose edges are a subset of those in the graph

Page 79: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

79

Ch9 Models:Distributed computationFormal models (non-deterministic interleaving): Understand how distributed computations actually occur

Intuition: A distributed system has: Global states: (S,L) see Snapshots; Initially, each processor is in an initial local state; each communication channel is empty

Events: occurrence of an event causes a transition of the system from the current global state to a new global state;

Computations: sequences of events from intial global states;

Page 80: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

80

Ch9 Models:Distributed computationMore precisely: An event e=(p,s,s’,m,c); p in P; s, s’ local states of p; m in M NULL; (M= set of all possible messages) c in C NULL (C=set of all channels); Interpretation of e=(p,s,s’,m,c): e takes p from s to s’ and possibly sends or receives m on c. If m (and c) is NULL then e is an internal event; No channel is affected by the occurrence of e Otherwise If c is an incoming channel then m is removed from c. If c is an outgoing channel then m is added to c.

Page 81: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

81

Ch9 Models:Distributed computationOccurrence of event (execution of an event): An event e=(p,s,s’,m,c) can occur in a global state G only if some condition, termed enabling condition of e is satisfied in G.

The enabling condition of e=(p,s,s’,m,c) is a condition on the state of p and the channels attached to p; example: the program counter has a specific value;

Transition of the system: If e=(p,s,s’,m,c) can occur in G, then the execution of e by p changes the global state by changing only the state of p and possibly the state of one channel attached to p.

Page 82: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

82

Ch9 Models:Distributed computationMore precisely (cont.) Two functions: Let G be a global state; e and event. Ready(G) = the set of all events that can occur in G; Next(G,e) = the global state just after the occurrence of e. Assume: G0 = initial global state; Gi = the global state when event ei occurs; seq = <e0,e1,…,en> a sequence of events. Definition: seq is a computation of the system if 1) ( i in {0,…,n} : ei in Ready(Gi)) 2) ( i in {0,…,n} : Gi+1=Next(Gi, ei))Note: non-deterministic selection in Ready(Gi).

Page 83: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

83

Ch9 Models:Distributed computationCorrectness: State predicate: assertion on global states; Correctness property: assertion on computations.

Definition: A distributed algorithm is correct if each of its computations satisfies the correctness property.

Proving correctness: Show that each global state reachable from the initial global state satisfies some well-defined state predicate. In general, one uses invariant assertions.

Page 84: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

84

Ch9 Models:Distributed computation``Eventually´´ and ``Always´´ properties: Let G0 be an initial global state; R(G0)= all computations that start in G0; A a state predicate; Q an assertion on computation.eventually: eventually(A,G0,Q) means starting from G0, for any computation for which Q holds, there is a global state that satisfies A (from now on, something good will happen)always: always(A,G0,Q) means A is always true starting from G0 for any computation for which Q holds,

Page 85: Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

85

Ch9 Models:Distributed computationFailures in a distributed system

In a distributed system, failures occur

An additional complication in designing distributed algorithm

for a distributed system to be dependable, fault tolerance must be incorporated

a fault tolerant algorithm is one which minimizes the impact of certain

faults on the service provided by the system

Fault Classification:

fail-stop; timing fault,

byzantine; transient faults, etc.