sysrép / 2.5a. schipereté 2007 1 2.5 the consensus problem
TRANSCRIPT
SysRép / 2.5 A. Schiper Eté 2007 1
2.5 The consensus problem
SysRép / 2.5 A. Schiper Eté 2007 2
Motivation
• Implementation of atomic broadcast and other group communication primitives in the presence of failures is a difficult problem
• Consensus: problem that is the common denominator for the implementation of the various group communication primitives
• Model: static groups, crash-stop
SysRép / 2.5 A. Schiper Eté 2007 3
Definitions
Processes:
• correct process: process that does not crash in its whole execution
• faulty process: process that is not correct
SysRép / 2.5 A. Schiper Eté 2007 4
Definitions (2)
Channels:
• Reliable channel: if p executes send (m) to q and q is correct, then q eventually receives m
• Quasi-reliable channel: if p executes send (m) to q and p, q are correct, then q eventually receives m
SysRép / 2.5 A. Schiper Eté 2007 5
Specification of consensus
Informal:
• n processes: p1, …, pn
• Each process pi has an initial value vi
• Processes must agree on a common value that is the initial value of one of the processes
4
7
1
7
7
7
SysRép / 2.5 A. Schiper Eté 2007 6
Specification of consensus (2)
Formal
• Consensus defined by two primitives: – propose (v): primitive by which a process proposes an initial value
– decide(v): primitive by which a process decides
propose(4)
propose(7)
propose(1)
decide(7)
decide(7)
decide(7)
SysRép / 2.5 A. Schiper Eté 2007 7
Specification of consensus (3)
Propose and decide must satisfy the following properties:
• Validity: if a process decides v, then v was proposed by some process (v is the initial value of some process)
• Agreement: Two correct processes cannot decide differently
• Termination: Every correct process eventually decides
SysRép / 2.5 A. Schiper Eté 2007 8
Specification of consensus (4)
Uniform consensus:
• Validity: if a process decides v, then v was proposed by some process (is the initial value of some process)
• Uniform agreement: Two correct processes cannot decide differently
• Termination: Every correct process eventually decides
SysRép / 2.5 A. Schiper Eté 2007 9
Solving consensus
• Consensus is easy to solve if processes do not crash and if channels are reliable
• Otherwise not so easy …
• Solvability of consensus depends on the system model (which defines assumption about processes and channels)
SysRép / 2.5 A. Schiper Eté 2007 10
System models: synchronous system
• Bound on message delay: If message m is sent by process p to process q at time t, then q receives the message no later than at time t+.
• Bound on relative speed of process: If the fastest process takes x time units to do some computation, then the slowest process does not take more then x time units to do the same computation
SysRép / 2.5 A. Schiper Eté 2007 11
System models: synchronous system (2)
A synchronous system allows accurate failure detection
• Handling of “are you alive”: x time units for the fastest process x time units for the slowest process
• Timeout of p: 2 + x
p
q
are you alive
yes
2 + x
SysRép / 2.5 A. Schiper Eté 2007 12
System models: asynchronous system
• No bound on message delay
• No bound on process relative speed
Not possible to know whether a process has crashed or not
SysRép / 2.5 A. Schiper Eté 2007 13
Synchronous round model
• First goal: solve consensus in the synchronous model
• As often done, we express consensus algorithm in a computation model composed of rounds, that can be implemented in the synchronous model
• Name: synchronous round model
SysRép / 2.5 A. Schiper Eté 2007 14
Synchronous round model
In every round r, each process p:• Sends a message to all processes• Receives the messages sent in round r• Does some local computation
st st’
Round r
p
SysRép / 2.5 A. Schiper Eté 2007 15
Synchronous round model (2)
In every round r, each process p:• …• Receives the messages sent in round r• …
• If p does not crash in round r: all processes that do not crash in round r (or before) receive p’s message
• If p crashes in round r: some processes might receive p’s message, some other processes might not receive p’s messagse
SysRép / 2.5 A. Schiper Eté 2007 16
floodSet: example 1
• f=2
p1
p2
p3
r=1 r=2 r=3
{3}
{7}
{5}
{3,5,7}
{3,5,7}
{3,5,7}
{3,5,7}
{3,5,7}
{3,5,7}
{3,5,7}
{3,5,7}
{3,5,7}
DECIDE(3)
DECIDE(3)
DECIDE(3)
SysRép / 2.5 A. Schiper Eté 2007 17
Synchronous round model: floodSet algorithm
Parameter f: maximum number of processes that can crash
State:
Wp : set of values, initially {vp} {p’s initial value}
Round r
Sr:
send Wp to all processes
Tr:
forall q from which Wq received do Wp Wp Wq
if r = f+1 then DECIDE (min (Wp))
SysRép / 2.5 A. Schiper Eté 2007 18
floodSet: example 2
• f=2
p1
p2
p3
r=1 r=2 r=3
{3}
{7}
{5} {5,7} {5,7}
crash
x
{3,5,7}crash
x
DECIDE(5)
SysRép / 2.5 A. Schiper Eté 2007 19
Proof
• Validity: if a process decides v, then v was proposed by some process (v is the initial value of some process)
• Termination: Every correct process eventually decides
• Agreement: Two correct processes cannot decide differently
SysRép / 2.5 A. Schiper Eté 2007 20
FLP impossibility result
• Consensus is solvable in the synchronous system
• What about the asynchronous system model?
• Fischer-Lynch-Paterson (1985):
Consensus is not solvable in an asynchronous system with reliable channels if one single process may crash.
SysRép / 2.5 A. Schiper Eté 2007 21
FLP impossibility result (2)
• What does not solvable mean?
• There exist no algorithm A such that in all runs of A compatible with the system model, consensus is solved
• This does not mean that A cannot solve consensus in any run
SysRép / 2.5 A. Schiper Eté 2007 22
Discussion
• Asynchronous system: – too weak to solve consensus
• Synchronous system– Allows us to solve consensus
– Drawback: requires to estimate the worst message transmission delay (e.g., must include possible retransmission)
– has a direct impact on the crash detection time, and on the duration of the black-out period that follows a crash)
• Question: is it possible to solve consensus without making mistakes in the crash detection?
SysRép / 2.5 A. Schiper Eté 2007 23
Discussion (2)
1. Partially synchronous model (Dwork, Lynch, Stockmeier, 1988):– Model inbetween the synchronous model and the asynchronous
model.
– The bounds and of the synchronous model:
1. Exist but are unknown, or
2. Are known but hold only from a time T on, called global stabilization time
2. Augmenting the asynchronous system with failure detectors (Chandre, Toueg, 1996)
SysRép / 2.5 A. Schiper Eté 2007 24
Failure detectors
• Each FDi : maintains a list of suspected processes
• Each FDi can make a mistake by suspecting a process that has not crashed
• Each FDi can change its mind by removing a suspected process
• No agreement among FDi’s is required
p1 p2
p4p3
FD1 FD2
FD4FD3
{p2, p3} {p1}
{ } {p2, p3}
{p2}
SysRép / 2.5 A. Schiper Eté 2007 25
Failure detectors (2)
• Without adding constraints on the output of the failure detectors, the new model is equivalent to the asynchronous mode
• Two types of constraints on the output of failure detectors:– Constraints related to crashed processes: completeness properties
– Constraints related to correct processes: accuracy properties
• A failure detector is defined by a pair (c, a):– c: a completeness property
– a: an accuracy property
SysRép / 2.5 A. Schiper Eté 2007 26
Completeness
• Strong completeness: Every process that crashes is eventually permanently suspected by every correct process
• Weak completeness: Every process that crashes is eventually permanently suspected by some correct process.
SysRép / 2.5 A. Schiper Eté 2007 27
Accuracy
• Strong accuracy: No process is suspected before it crashes
• Weak accuracy: Some correct process is never suspected
• Eventual strong accuracy: There is a time after which correct processes are not suspected by any correct process
• Eventual weak accuracy: There is a time after which some correct process is never suspected
SysRép / 2.5 A. Schiper Eté 2007 28
Failure detectors
• Perfect failure detector:– Strong completeness, strong
accuracy– Notation: P
• Eventually perfect failure detector:– Strong completeness, eventual
strong accuracy– Notation: P
• Strong failure detector:– Strong completeness, weak
accuracy– Notation: S
• Eventually strong failure detector:– Strong completeness,
eventually weak accuracy– Notation: S
• Eventually weak failure detector:– Weak completeness,
eventually weak accuracy– Notation: W
SysRép / 2.5 A. Schiper Eté 2007 29
Solving consensus with S
• Proposed by Chandra, Toueg (1996)
• Hyp:– f < n/2
S• Eventual weak accuracy: There is a time after which some
correct process is no more suspected by any correct process
• Strong completeness: Every process that crashes is eventually permanently suspected by every correct process
SysRép / 2.5 A. Schiper Eté 2007 30
Solving consensus with S (2)
Basic idea:
• Process p1 tries to impose its initial value as the decision
• How many acks should p1 wait for?
p1v1
ack
decide (v1)
A majority, i.e., (n+1) / 2
SysRép / 2.5 A. Schiper Eté 2007 31
Solving consensus with S (3)
• What if p1 crashes ?
• Process p2 takes over the role of p1
• Can p2 ignore what p1 has done previously?
• What is the problem?
p2v2
ack
decide (v2)
SysRép / 2.5 A. Schiper Eté 2007 32
Solving consensus with S (4)
• If some process has decided v1, then p2 must ignore v2 and must try to impose v1 as the decision
• p2 must be able to discover that v1 might have been decided
p2x
ack
decide (v2)
pi: if v1 received from p1 then send v1 to p2
if v1 received then x = v1 else x = v2
SysRép / 2.5 A. Schiper Eté 2007 33
Solving consensus with S (5)
• If p2 does not succeed, then p3 takes over
• If p3 does not succeed, then p4 takes over
• …
• If pn does not succeed, then …
… p1 takes over
• …
• This is called: rotating coordinator
SysRép / 2.5 A. Schiper Eté 2007 34
Solving consensus with S (6)
• Rotating coordinator
• pi is the new coordinator: what value should pi choose?
• The values sent are time-stamped with round numbers;
the value with the largest time-stamp is chosen
pi
value vx received from px
value vy received from py
SysRép / 2.5 A. Schiper Eté 2007 35
Solving consensus with S (7)
coord
round
phase 1 phase 2 phase 3 phase 4
SysRép / 2.5 A. Schiper Eté 2007 36
2.6 Atomic broadcast in the crash-stop model
SysRép / 2.5 A. Schiper Eté 2007 37
• Reliable broadcast (specification)
• Atomic broadcast (specification)
• Reliable broadcast (implementation)
• Atomic broadcast (implemention)
SysRép / 2.5 A. Schiper Eté 2007 38
Reliable broadcast
• Unreliable broadcast of message m to group g– If the sender is correct, then every correct process in g eventually
receives m
– If the sender crashes, then some correct processes in g might receive m, and others not.
• We may want stronger guarantees reliable broadcast
SysRép / 2.5 A. Schiper Eté 2007 39
Reliable broadcast (2)
• Defined by the primitives rbcast and rdeliver
• Convention: – g dropped– sender is member of g
Replication technique
Group communication
Transport layer
rbcast (g, m) rdeliver (m)
receive (m)send (m) to p
SysRép / 2.5 A. Schiper Eté 2007 40
Reliable broadcast (3)
Rbcast and rdeliver satisfy the following properties:
• Validity: If a correct process executes rbcast(m), then it eventually rdelivers m.
• Agreement: If a correct process rdelivers m, then all correct processes eventually rdeliver m.
• Integrity: For any message m, every correct process rdelivers m at most once, and only if m was previously rbcast.
SysRép / 2.5 A. Schiper Eté 2007 41
Uniform reliable broadcast
Uniform reliable broadcast :
agreement uniform agreement
• Uniform agreement: If a correct process rdelivers m, then all correct processes eventually rdeliver m.
SysRép / 2.5 A. Schiper Eté 2007 42
Atomic broadcast
• Uniform reliable broadcast plus the following property:
• Uniform total order: If some process (correct or faulty) adelivers m before m’, then every process adelivers m’ only after having adelivered m.
NB Should be called uniform atomic broadcast. To simplify, atomic broadcast is often used.
SysRép / 2.5 A. Schiper Eté 2007 43
Solving reliable broadcast
• Can be solved in an asynchronous system with quasi-reliable channels for f < n
To rbcast(m):
send(m) to all processes
Upon reception of m for the first time do
if pi sender(m) then send(m) to all processes
rdeliver(m)
SysRép / 2.5 A. Schiper Eté 2007 44
Solving atomic broadcast
• Atomic broadcast also subject to the FLP impossibility result:– shown by contradiction:
if atomic broadcast solvable, then consensus also solvable
• We will show that if consensus solvable, then atomic broadcast also solvable
• Consensus and atomic broadcast are equivalent problems
SysRép / 2.5 A. Schiper Eté 2007 46
Solving atomic broadcast (3)
abcast(m1)
abcast(m2)
abcast(m3)
cons
ensu
s
cons
ensu
s
abcast(m4)
adeliver(m4)adeliver(m2)
adeliver(m1)
cons
ensu
s
adeliver(m3)
SysRép / 2.5 A. Schiper Eté 2007 47
Solving atomic broadcast (4)
Principle of the algorithm:
• Sequence of instances of consensus (numbered 1, 2, …)
• Each consensus on a set of messages
• Initial value for each consensus: set of messages
• Let msgk be the set of messages decided by consensus #k:– The messages in msgk are adelivered before the messages in msgk+1
– The messages in msgk are adelivered in some deterministic order (e.g., according to their IDs)
SysRép / 2.5 A. Schiper Eté 2007 48
Solving atomic broadcast (5)
Initialization
ki := 0; adeliveredi := ; rdeliveredi :=
To abcast(m):
rbcast(m)
Upon rdeliver(m) do
rdeliveredi := rdeliveredi {m}
Upon rdeliveredi adeliveredi do
ki := ki + 1
aUndelivered :=
rdeliveredi adeliveredi
propose(ki , aUndelivered)
wait until decide (ki , msg ki)
adeliver ki := msg ki adeliveredi
adeliver the messages in adeliver ki in some deterministic
order
adeliveredi :=
adeliveredi adeliver ki typos
SysRép / 2.5 A. Schiper Eté 2007 49
Quorum systems vs. group communication
c
s1
s3
s2
c
s1
s3
s2
inc/dec
inc/dec
Server with inc/dec operations
read
write
With group communication With quorum systems
mutual exclusion
SysRép / 2.5 A. Schiper Eté 2007 50
Quorum systems vs. group communication (2)
• Solution based on quorum systems
majority of correct servers
mutual exclusion
perfect failure detector
• Solution based on group communication
majority of correct servers
S failure detector