sysrép / 2.5a. schipereté 2007 1 2.5 the consensus problem

SysRép / 2.5 A. Schiper Eté 2007 1

2.5 The consensus problem


Motivation

• Implementation of atomic broadcast and other group communication primitives in the presence of failures is a difficult problem

• Consensus: problem that is the common denominator for the implementation of the various group communication primitives

• Model: static groups, crash-stop


Definitions

Processes:

• correct process: process that does not crash in its whole execution

• faulty process: process that is not correct


Definitions (2)

Channels:

• Reliable channel: if p executes send (m) to q and q is correct, then q eventually receives m

• Quasi-reliable channel: if p executes send (m) to q and p, q are correct, then q eventually receives m


Specification of consensus

Informal:

• n processes: p1, …, pn

• Each process pi has an initial value vi

• Processes must agree on a common value that is the initial value of one of the processes

4

7

1

7

7

7


Specification of consensus (2)

Formal

• Consensus defined by two primitives: – propose (v): primitive by which a process proposes an initial value

– decide(v): primitive by which a process decides

propose(4)

propose(7)

propose(1)

decide(7)

decide(7)

decide(7)



Propose and decide must satisfy the following properties:

• Validity: if a process decides v, then v was proposed by some process (v is the initial value of some process)

• Agreement: Two correct processes cannot decide differently

• Termination: Every correct process eventually decides



Uniform consensus:

• Validity: if a process decides v, then v was proposed by some process (is the initial value of some process)

• Uniform agreement: Two correct processes cannot decide differently



Solving consensus

• Consensus is easy to solve if processes do not crash and if channels are reliable

• Otherwise not so easy …

• Solvability of consensus depends on the system model (which defines assumption about processes and channels)


System models: synchronous system

• Bound on message delay: If message m is sent by process p to process q at time t, then q receives the message no later than at time t+.

• Bound on relative speed of process: If the fastest process takes x time units to do some computation, then the slowest process does not take more then x time units to do the same computation


System models: synchronous system (2)

A synchronous system allows accurate failure detection

• Handling of “are you alive”: x time units for the fastest process x time units for the slowest process

• Timeout of p: 2 + x

p

q

are you alive

yes

2 + x


System models: asynchronous system

• No bound on message delay

• No bound on process relative speed

Not possible to know whether a process has crashed or not


Synchronous round model

• First goal: solve consensus in the synchronous model

• As often done, we express consensus algorithm in a computation model composed of rounds, that can be implemented in the synchronous model

• Name: synchronous round model


Synchronous round model

In every round r, each process p:• Sends a message to all processes• Receives the messages sent in round r• Does some local computation

st st’

Round r

p


Synchronous round model (2)

In every round r, each process p:• …• Receives the messages sent in round r• …

• If p does not crash in round r: all processes that do not crash in round r (or before) receive p’s message

• If p crashes in round r: some processes might receive p’s message, some other processes might not receive p’s messagse


floodSet: example 1

• f=2

p1

p2

p3

r=1 r=2 r=3

{3}

{7}

{5}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

{3,5,7}

DECIDE(3)

DECIDE(3)

DECIDE(3)


Synchronous round model: floodSet algorithm

Parameter f: maximum number of processes that can crash

State:

Wp : set of values, initially {vp} {p’s initial value}

Round r

Sr:

send Wp to all processes

Tr:

forall q from which Wq received do Wp Wp Wq

if r = f+1 then DECIDE (min (Wp))


floodSet: example 2

• f=2

p1

p2

p3

r=1 r=2 r=3

{3}

{7}

{5} {5,7} {5,7}

crash

x

{3,5,7}crash

x

DECIDE(5)


Proof

• Validity: if a process decides v, then v was proposed by some process (v is the initial value of some process)


• Agreement: Two correct processes cannot decide differently


FLP impossibility result

• Consensus is solvable in the synchronous system

• What about the asynchronous system model?

• Fischer-Lynch-Paterson (1985):

Consensus is not solvable in an asynchronous system with reliable channels if one single process may crash.


FLP impossibility result (2)

• What does not solvable mean?

• There exist no algorithm A such that in all runs of A compatible with the system model, consensus is solved

• This does not mean that A cannot solve consensus in any run


Discussion

• Asynchronous system: – too weak to solve consensus

• Synchronous system– Allows us to solve consensus

– Drawback: requires to estimate the worst message transmission delay (e.g., must include possible retransmission)

– has a direct impact on the crash detection time, and on the duration of the black-out period that follows a crash)

• Question: is it possible to solve consensus without making mistakes in the crash detection?


Discussion (2)

1. Partially synchronous model (Dwork, Lynch, Stockmeier, 1988):– Model inbetween the synchronous model and the asynchronous

model.

– The bounds and of the synchronous model:

1. Exist but are unknown, or

2. Are known but hold only from a time T on, called global stabilization time

2. Augmenting the asynchronous system with failure detectors (Chandre, Toueg, 1996)


Failure detectors

• Each FDi : maintains a list of suspected processes

• Each FDi can make a mistake by suspecting a process that has not crashed

• Each FDi can change its mind by removing a suspected process

• No agreement among FDi’s is required

p1 p2

p4p3

FD1 FD2

FD4FD3

{p2, p3} {p1}

{ } {p2, p3}

{p2}


Failure detectors (2)

• Without adding constraints on the output of the failure detectors, the new model is equivalent to the asynchronous mode

• Two types of constraints on the output of failure detectors:– Constraints related to crashed processes: completeness properties

– Constraints related to correct processes: accuracy properties

• A failure detector is defined by a pair (c, a):– c: a completeness property

– a: an accuracy property


Completeness

• Strong completeness: Every process that crashes is eventually permanently suspected by every correct process

• Weak completeness: Every process that crashes is eventually permanently suspected by some correct process.


Accuracy

• Strong accuracy: No process is suspected before it crashes

• Weak accuracy: Some correct process is never suspected

• Eventual strong accuracy: There is a time after which correct processes are not suspected by any correct process

• Eventual weak accuracy: There is a time after which some correct process is never suspected


Failure detectors

• Perfect failure detector:– Strong completeness, strong

accuracy– Notation: P

• Eventually perfect failure detector:– Strong completeness, eventual

strong accuracy– Notation: P

• Strong failure detector:– Strong completeness, weak

accuracy– Notation: S

• Eventually strong failure detector:– Strong completeness,

eventually weak accuracy– Notation: S

• Eventually weak failure detector:– Weak completeness,

eventually weak accuracy– Notation: W


Solving consensus with S

• Proposed by Chandra, Toueg (1996)

• Hyp:– f < n/2

S• Eventual weak accuracy: There is a time after which some

correct process is no more suspected by any correct process

• Strong completeness: Every process that crashes is eventually permanently suspected by every correct process


Solving consensus with S (2)

Basic idea:

• Process p1 tries to impose its initial value as the decision

• How many acks should p1 wait for?

p1v1

ack

decide (v1)

A majority, i.e., (n+1) / 2



• What if p1 crashes ?

• Process p2 takes over the role of p1

• Can p2 ignore what p1 has done previously?

• What is the problem?

p2v2

ack

decide (v2)



• If some process has decided v1, then p2 must ignore v2 and must try to impose v1 as the decision

• p2 must be able to discover that v1 might have been decided

p2x

ack

decide (v2)

pi: if v1 received from p1 then send v1 to p2

if v1 received then x = v1 else x = v2



• If p2 does not succeed, then p3 takes over

• If p3 does not succeed, then p4 takes over

• …

• If pn does not succeed, then …

… p1 takes over

• …

• This is called: rotating coordinator



• Rotating coordinator

• pi is the new coordinator: what value should pi choose?

• The values sent are time-stamped with round numbers;

the value with the largest time-stamp is chosen

pi

value vx received from px

value vy received from py



coord

round

phase 1 phase 2 phase 3 phase 4


2.6 Atomic broadcast in the crash-stop model


• Reliable broadcast (specification)

• Atomic broadcast (specification)

• Reliable broadcast (implementation)

• Atomic broadcast (implemention)


Reliable broadcast

• Unreliable broadcast of message m to group g– If the sender is correct, then every correct process in g eventually

receives m

– If the sender crashes, then some correct processes in g might receive m, and others not.

• We may want stronger guarantees reliable broadcast


Reliable broadcast (2)

• Defined by the primitives rbcast and rdeliver

• Convention: – g dropped– sender is member of g

Replication technique

Group communication

Transport layer

rbcast (g, m) rdeliver (m)

receive (m)send (m) to p


Reliable broadcast (3)

Rbcast and rdeliver satisfy the following properties:

• Validity: If a correct process executes rbcast(m), then it eventually rdelivers m.

• Agreement: If a correct process rdelivers m, then all correct processes eventually rdeliver m.

• Integrity: For any message m, every correct process rdelivers m at most once, and only if m was previously rbcast.


Uniform reliable broadcast

Uniform reliable broadcast :

agreement uniform agreement

• Uniform agreement: If a correct process rdelivers m, then all correct processes eventually rdeliver m.


Atomic broadcast

• Uniform reliable broadcast plus the following property:

• Uniform total order: If some process (correct or faulty) adelivers m before m’, then every process adelivers m’ only after having adelivered m.

NB Should be called uniform atomic broadcast. To simplify, atomic broadcast is often used.


Solving reliable broadcast

• Can be solved in an asynchronous system with quasi-reliable channels for f < n

To rbcast(m):

send(m) to all processes

Upon reception of m for the first time do

if pi sender(m) then send(m) to all processes

rdeliver(m)


Solving atomic broadcast

• Atomic broadcast also subject to the FLP impossibility result:– shown by contradiction:

if atomic broadcast solvable, then consensus also solvable

• We will show that if consensus solvable, then atomic broadcast also solvable

• Consensus and atomic broadcast are equivalent problems


Solving atomic broadcast (3)

abcast(m1)

abcast(m2)

abcast(m3)

cons

ensu

s

cons

ensu

s

abcast(m4)

adeliver(m4)adeliver(m2)

adeliver(m1)

cons

ensu

s

adeliver(m3)



Principle of the algorithm:

• Sequence of instances of consensus (numbered 1, 2, …)

• Each consensus on a set of messages

• Initial value for each consensus: set of messages

• Let msgk be the set of messages decided by consensus #k:– The messages in msgk are adelivered before the messages in msgk+1

– The messages in msgk are adelivered in some deterministic order (e.g., according to their IDs)



Initialization

ki := 0; adeliveredi := ; rdeliveredi :=

To abcast(m):

rbcast(m)

Upon rdeliver(m) do

rdeliveredi := rdeliveredi {m}

Upon rdeliveredi adeliveredi do

ki := ki + 1

aUndelivered :=

rdeliveredi adeliveredi

propose(ki , aUndelivered)

wait until decide (ki , msg ki)

adeliver ki := msg ki adeliveredi

adeliver the messages in adeliver ki in some deterministic

order

adeliveredi :=

adeliveredi adeliver ki typos


Quorum systems vs. group communication

c

s1

s3

s2

c

s1

s3

s2

inc/dec

inc/dec

Server with inc/dec operations

read

write

With group communication With quorum systems

mutual exclusion


Quorum systems vs. group communication (2)

• Solution based on quorum systems

majority of correct servers

mutual exclusion

perfect failure detector

• Solution based on group communication

majority of correct servers

S failure detector

sysrép / 2.5a. schipereté 2007 1 2.5 the consensus problem

Documents

process v

executionfaulty process

relative speed of process

pneach process pi

fastest process x time

correct processes

round rif p

round rdoes