sysrép / 2.5a. schipereté 2007 1 2.5 the consensus problem

49
SysRép / 2.5 A. Schiper Eté 2007 1 2.5 The consensus problem

Upload: garry-west

Post on 19-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 1

2.5 The consensus problem

Page 2: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 2

Motivation

• Implementation of atomic broadcast and other group communication primitives in the presence of failures is a difficult problem

• Consensus: problem that is the common denominator for the implementation of the various group communication primitives

• Model: static groups, crash-stop

Page 3: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 3

Definitions

Processes:

• correct process: process that does not crash in its whole execution

• faulty process: process that is not correct

Page 4: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 4

Definitions (2)

Channels:

• Reliable channel: if p executes send (m) to q and q is correct, then q eventually receives m

• Quasi-reliable channel: if p executes send (m) to q and p, q are correct, then q eventually receives m

Page 5: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 5

Specification of consensus

Informal:

• n processes: p1, …, pn

• Each process pi has an initial value vi

• Processes must agree on a common value that is the initial value of one of the processes

4

7

1

7

7

7

Page 6: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 6

Specification of consensus (2)

Formal

• Consensus defined by two primitives: – propose (v): primitive by which a process proposes an initial value

– decide(v): primitive by which a process decides

propose(4)

propose(7)

propose(1)

decide(7)

decide(7)

decide(7)

Page 7: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 7

Specification of consensus (3)

Propose and decide must satisfy the following properties:

• Validity: if a process decides v, then v was proposed by some process (v is the initial value of some process)

• Agreement: Two correct processes cannot decide differently

• Termination: Every correct process eventually decides

Page 8: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 8

Specification of consensus (4)

Uniform consensus:

• Validity: if a process decides v, then v was proposed by some process (is the initial value of some process)

• Uniform agreement: Two correct processes cannot decide differently

• Termination: Every correct process eventually decides

Page 9: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 9

Solving consensus

• Consensus is easy to solve if processes do not crash and if channels are reliable

• Otherwise not so easy …

• Solvability of consensus depends on the system model (which defines assumption about processes and channels)

Page 10: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 10

System models: synchronous system

• Bound on message delay: If message m is sent by process p to process q at time t, then q receives the message no later than at time t+.

• Bound on relative speed of process: If the fastest process takes x time units to do some computation, then the slowest process does not take more then x time units to do the same computation

Page 11: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 11

System models: synchronous system (2)

A synchronous system allows accurate failure detection

• Handling of “are you alive”: x time units for the fastest process x time units for the slowest process

• Timeout of p: 2 + x

p

q

are you alive

yes

2 + x

Page 12: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 12

System models: asynchronous system

• No bound on message delay

• No bound on process relative speed

Not possible to know whether a process has crashed or not

Page 13: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 13

Synchronous round model

• First goal: solve consensus in the synchronous model

• As often done, we express consensus algorithm in a computation model composed of rounds, that can be implemented in the synchronous model

• Name: synchronous round model

Page 14: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 14

Synchronous round model

In every round r, each process p:• Sends a message to all processes• Receives the messages sent in round r• Does some local computation

st st’

Round r

p

Page 15: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 15

Synchronous round model (2)

In every round r, each process p:• …• Receives the messages sent in round r• …

• If p does not crash in round r: all processes that do not crash in round r (or before) receive p’s message

• If p crashes in round r: some processes might receive p’s message, some other processes might not receive p’s messagse

Page 16: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 16

floodSet: example 1

• f=2

p1

p2

p3

r=1 r=2 r=3

{3}

{7}

{5}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

DECIDE(3)

DECIDE(3)

DECIDE(3)

Page 17: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 17

Synchronous round model: floodSet algorithm

Parameter f: maximum number of processes that can crash

State:

Wp : set of values, initially {vp} {p’s initial value}

Round r

Sr:

send Wp to all processes

Tr:

forall q from which Wq received do Wp Wp Wq

if r = f+1 then DECIDE (min (Wp))

Page 18: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 18

floodSet: example 2

• f=2

p1

p2

p3

r=1 r=2 r=3

{3}

{7}

{5} {5,7} {5,7}

crash

x

{3,5,7}crash

x

DECIDE(5)

Page 19: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 19

Proof

• Validity: if a process decides v, then v was proposed by some process (v is the initial value of some process)

• Termination: Every correct process eventually decides

• Agreement: Two correct processes cannot decide differently

Page 20: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 20

FLP impossibility result

• Consensus is solvable in the synchronous system

• What about the asynchronous system model?

• Fischer-Lynch-Paterson (1985):

Consensus is not solvable in an asynchronous system with reliable channels if one single process may crash.

Page 21: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 21

FLP impossibility result (2)

• What does not solvable mean?

• There exist no algorithm A such that in all runs of A compatible with the system model, consensus is solved

• This does not mean that A cannot solve consensus in any run

Page 22: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 22

Discussion

• Asynchronous system: – too weak to solve consensus

• Synchronous system– Allows us to solve consensus

– Drawback: requires to estimate the worst message transmission delay (e.g., must include possible retransmission)

– has a direct impact on the crash detection time, and on the duration of the black-out period that follows a crash)

• Question: is it possible to solve consensus without making mistakes in the crash detection?

Page 23: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 23

Discussion (2)

1. Partially synchronous model (Dwork, Lynch, Stockmeier, 1988):– Model inbetween the synchronous model and the asynchronous

model.

– The bounds and of the synchronous model:

1. Exist but are unknown, or

2. Are known but hold only from a time T on, called global stabilization time

2. Augmenting the asynchronous system with failure detectors (Chandre, Toueg, 1996)

Page 24: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 24

Failure detectors

• Each FDi : maintains a list of suspected processes

• Each FDi can make a mistake by suspecting a process that has not crashed

• Each FDi can change its mind by removing a suspected process

• No agreement among FDi’s is required

p1 p2

p4p3

FD1 FD2

FD4FD3

{p2, p3} {p1}

{ } {p2, p3}

{p2}

Page 25: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 25

Failure detectors (2)

• Without adding constraints on the output of the failure detectors, the new model is equivalent to the asynchronous mode

• Two types of constraints on the output of failure detectors:– Constraints related to crashed processes: completeness properties

– Constraints related to correct processes: accuracy properties

• A failure detector is defined by a pair (c, a):– c: a completeness property

– a: an accuracy property

Page 26: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 26

Completeness

• Strong completeness: Every process that crashes is eventually permanently suspected by every correct process

• Weak completeness: Every process that crashes is eventually permanently suspected by some correct process.

Page 27: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 27

Accuracy

• Strong accuracy: No process is suspected before it crashes

• Weak accuracy: Some correct process is never suspected

• Eventual strong accuracy: There is a time after which correct processes are not suspected by any correct process

• Eventual weak accuracy: There is a time after which some correct process is never suspected

Page 28: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 28

Failure detectors

• Perfect failure detector:– Strong completeness, strong

accuracy– Notation: P

• Eventually perfect failure detector:– Strong completeness, eventual

strong accuracy– Notation: P

• Strong failure detector:– Strong completeness, weak

accuracy– Notation: S

• Eventually strong failure detector:– Strong completeness,

eventually weak accuracy– Notation: S

• Eventually weak failure detector:– Weak completeness,

eventually weak accuracy– Notation: W

Page 29: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 29

Solving consensus with S

• Proposed by Chandra, Toueg (1996)

• Hyp:– f < n/2

S• Eventual weak accuracy: There is a time after which some

correct process is no more suspected by any correct process

• Strong completeness: Every process that crashes is eventually permanently suspected by every correct process

Page 30: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 30

Solving consensus with S (2)

Basic idea:

• Process p1 tries to impose its initial value as the decision

• How many acks should p1 wait for?

p1v1

ack

decide (v1)

A majority, i.e., (n+1) / 2

Page 31: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 31

Solving consensus with S (3)

• What if p1 crashes ?

• Process p2 takes over the role of p1

• Can p2 ignore what p1 has done previously?

• What is the problem?

p2v2

ack

decide (v2)

Page 32: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 32

Solving consensus with S (4)

• If some process has decided v1, then p2 must ignore v2 and must try to impose v1 as the decision

• p2 must be able to discover that v1 might have been decided

p2x

ack

decide (v2)

pi: if v1 received from p1 then send v1 to p2

if v1 received then x = v1 else x = v2

Page 33: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 33

Solving consensus with S (5)

• If p2 does not succeed, then p3 takes over

• If p3 does not succeed, then p4 takes over

• …

• If pn does not succeed, then …

… p1 takes over

• …

• This is called: rotating coordinator

Page 34: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 34

Solving consensus with S (6)

• Rotating coordinator

• pi is the new coordinator: what value should pi choose?

• The values sent are time-stamped with round numbers;

the value with the largest time-stamp is chosen

pi

value vx received from px

value vy received from py

Page 35: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 35

Solving consensus with S (7)

coord

round

phase 1 phase 2 phase 3 phase 4

Page 36: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 36

2.6 Atomic broadcast in the crash-stop model

Page 37: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 37

• Reliable broadcast (specification)

• Atomic broadcast (specification)

• Reliable broadcast (implementation)

• Atomic broadcast (implemention)

Page 38: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 38

Reliable broadcast

• Unreliable broadcast of message m to group g– If the sender is correct, then every correct process in g eventually

receives m

– If the sender crashes, then some correct processes in g might receive m, and others not.

• We may want stronger guarantees reliable broadcast

Page 39: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 39

Reliable broadcast (2)

• Defined by the primitives rbcast and rdeliver

• Convention: – g dropped– sender is member of g

Replication technique

Group communication

Transport layer

rbcast (g, m) rdeliver (m)

receive (m)send (m) to p

Page 40: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 40

Reliable broadcast (3)

Rbcast and rdeliver satisfy the following properties:

• Validity: If a correct process executes rbcast(m), then it eventually rdelivers m.

• Agreement: If a correct process rdelivers m, then all correct processes eventually rdeliver m.

• Integrity: For any message m, every correct process rdelivers m at most once, and only if m was previously rbcast.

Page 41: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 41

Uniform reliable broadcast

Uniform reliable broadcast :

agreement uniform agreement

• Uniform agreement: If a correct process rdelivers m, then all correct processes eventually rdeliver m.

Page 42: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 42

Atomic broadcast

• Uniform reliable broadcast plus the following property:

• Uniform total order: If some process (correct or faulty) adelivers m before m’, then every process adelivers m’ only after having adelivered m.

NB Should be called uniform atomic broadcast. To simplify, atomic broadcast is often used.

Page 43: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 43

Solving reliable broadcast

• Can be solved in an asynchronous system with quasi-reliable channels for f < n

To rbcast(m):

send(m) to all processes

Upon reception of m for the first time do

if pi sender(m) then send(m) to all processes

rdeliver(m)

Page 44: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 44

Solving atomic broadcast

• Atomic broadcast also subject to the FLP impossibility result:– shown by contradiction:

if atomic broadcast solvable, then consensus also solvable

• We will show that if consensus solvable, then atomic broadcast also solvable

• Consensus and atomic broadcast are equivalent problems

Page 45: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 46

Solving atomic broadcast (3)

abcast(m1)

abcast(m2)

abcast(m3)

cons

ensu

s

cons

ensu

s

abcast(m4)

adeliver(m4)adeliver(m2)

adeliver(m1)

cons

ensu

s

adeliver(m3)

Page 46: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 47

Solving atomic broadcast (4)

Principle of the algorithm:

• Sequence of instances of consensus (numbered 1, 2, …)

• Each consensus on a set of messages

• Initial value for each consensus: set of messages

• Let msgk be the set of messages decided by consensus #k:– The messages in msgk are adelivered before the messages in msgk+1

– The messages in msgk are adelivered in some deterministic order (e.g., according to their IDs)

Page 47: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 48

Solving atomic broadcast (5)

Initialization

ki := 0; adeliveredi := ; rdeliveredi :=

To abcast(m):

rbcast(m)

Upon rdeliver(m) do

rdeliveredi := rdeliveredi {m}

Upon rdeliveredi adeliveredi do

ki := ki + 1

aUndelivered :=

rdeliveredi adeliveredi

propose(ki , aUndelivered)

wait until decide (ki , msg ki)

adeliver ki := msg ki adeliveredi

adeliver the messages in adeliver ki in some deterministic

order

adeliveredi :=

adeliveredi adeliver ki typos

Page 48: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 49

Quorum systems vs. group communication

c

s1

s3

s2

c

s1

s3

s2

inc/dec

inc/dec

Server with inc/dec operations

read

write

With group communication With quorum systems

mutual exclusion

Page 49: SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem

SysRép / 2.5 A. Schiper Eté 2007 50

Quorum systems vs. group communication (2)

• Solution based on quorum systems

majority of correct servers

mutual exclusion

perfect failure detector

• Solution based on group communication

majority of correct servers

S failure detector