introduction into fault-tolerant distributed algorithms and their modeling (part 2)

95
Model Checking of Fault-Tolerant Distributed Algorithms Part II: Modeling Fault-tolerant Distributed Algorithms Annu Gmeiner Igor Konnov Ulrich Schmid Helmut Veith Josef Widder TMPA 2014, Kostroma, Russia Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 1 / 64

Upload: iosif-itkin

Post on 02-Jul-2015

242 views

Category:

Science


1 download

DESCRIPTION

Josef Widder, Vienna University of Technology

TRANSCRIPT

Page 1: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Model Checking of Fault-Tolerant DistributedAlgorithms

Part II: Modeling Fault-tolerant Distributed Algorithms

Annu Gmeiner Igor Konnov Ulrich SchmidHelmut Veith Josef Widder

TMPA 2014, Kostroma, Russia

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 1 / 64

Page 2: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Why Modeling and Verification?

Let’s have a look at the recent technical report by amazon.com 1:

We have found that the standard verification techniques inindustry are necessary but not sufficient. We use

deep design reviews,code reviews,static code analysis,stress testing,fault-injection testing [. . . ],

but we still find that subtle bugs can hide in complexconcurrent fault-tolerant systems. . .

. . . We have found that testing the code is inadequate as a

method to find subtle errors in design, as the number ofreachable states of the code is astronomical.

1C. Newcombe et al. Use of Formal Methods at Amazon Web Services, 2014Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 2 / 64

Page 3: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Why Modeling and Verification? (cont.)

. . . the final executable code is unambiguous, but contains

an overwhelming amount of detail.

We needed to be able to capture the essence of a design in

a few hundred lines of precise description .

Engineers naturally focus on designing the “happy case” fora system, i.e. the processing path in which no errors occur.

. . . the shortest error trace exhibiting the bug contained 35high level steps. The improbability of such compound events

is not a defense against such bugs; historically, AWS hasobserved many combinations of events at least ascomplicated as those that could trigger this bug.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 3 / 64

Page 4: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

What are we doing?

The approach we have:

A high-level description of a design,which is precise.

Sound verification method,as complete as possible.

More specifically:

modeling approach suitable for model checkingautomatic parameterized verification method

targeting at fault-tolerant distributed algorithms

(We are not interested in verifying mathematical toy examples)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 4 / 64

Page 5: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

What are we doing?

The approach we have:

A high-level description of a design,which is precise.

Sound verification method,as complete as possible.

More specifically:

modeling approach suitable for model checkingautomatic parameterized verification method

targeting at fault-tolerant distributed algorithms

(We are not interested in verifying mathematical toy examples)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 4 / 64

Page 6: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Why Model Checking?

an alternative proof approach

useful counter-examples

ability to define and vary assumptions about the system

and see why it breaks

closer to code level

good degree of automation

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 5 / 64

Page 7: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Distributed Algorithms: Model Checking Challenges

unbounded data typesunbounded number of rounds (round numbers part of messages)

parameterization in multiple parametersamong n processes f ≤ t are faulty with n > 3t

contrast to concurrent programsdiverse fault models (adverse environments)

continuous timefault-tolerant clock synchronization

degrees of concurrency: synchronous, asynchronous partiallysynchronous

a process makes at most 5 steps between 2 steps of any otherprocess

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 6 / 64

Page 8: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Fault-tolerant distributed algorithms

n

? ? ?t f

n processes communicate by messagesall processes know that at most t of them might be faultyf are actually faulty

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64

Page 9: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Fault-tolerant distributed algorithms

n

? ? ?t

f

n processes communicate by messagesall processes know that at most t of them might be faultyf are actually faulty

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64

Page 10: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Fault-tolerant distributed algorithms

n

? ? ?t f

n processes communicate by messagesall processes know that at most t of them might be faultyf are actually faulty

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64

Page 11: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Challenge #1: fault models

clean crashes: least severefaulty processes prematurely halt after/before “send to all”

crash faults:faulty processes prematurely halt (also) in the middle of “send to all”

omission faults:faulty processes follow the algorithm, but some messages sent by themmight be lost

symmetric faults:faulty processes send arbitrarily to all or nobody

Byzantine faults: most severefaulty processes can do anythingencompass all behaviors of above models

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 8 / 64

Page 12: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Challenges #2 & #3: Pseudo-code andCommunication

Translate pseudo-code to a formal descriptionthat allows us to verify the algorithmand does not oversimplify the original algorithm.

Assumptions about the communication mediumare usually written in plain English,spread across research papers,constitute folklore knowledge.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 9 / 64

Page 13: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Asynchronous Reliable Broadcast (Srikanth & Toueg,87)

The core of the classic broadcast algorithm from the DA literature.It solves an agreement problem depending on the inputs vi .

Variables of process ivi : {0 , 1} i n i t i a l l y 0 or 1accepti : {0 , 1} i n i t i a l l y 0

An atomic step:i f vi = 1then send ( echo ) to all ;i f received (echo) from

at l e a s t t + 1 distinct processesand not sent ( echo ) before

then send ( echo ) to all ;i f received ( echo ) from at l e a s t

n - t distinct processesthen accepti := 1 ;

asynchronoust Byzantine faultscorrect if n > 3tthe code isparameterized in nand t⇒ process template

P(n, t , f )

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 10 / 64

Page 14: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Asynchronous Reliable Broadcast (Srikanth & Toueg,87)

The core of the classic broadcast algorithm from the DA literature.It solves an agreement problem depending on the inputs vi .

Variables of process ivi : {0 , 1} i n i t i a l l y 0 or 1accepti : {0 , 1} i n i t i a l l y 0

An atomic step:i f vi = 1then send ( echo ) to all ;i f received (echo) from

at l e a s t t + 1 distinct processesand not sent ( echo ) before

then send ( echo ) to all ;i f received ( echo ) from at l e a s t

n - t distinct processesthen accepti := 1 ;

asynchronoust Byzantine faultscorrect if n > 3tthe code isparameterized in nand t⇒ process template

P(n, t , f )

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 10 / 64

Page 15: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Typical Structure of a Computation Step

receive messages

compute usingmessages and local variables

(description in Englishwith basic control flow

if-then-else)

send messages

atom

ic

impli

cit

pseu

do-co

de

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 11 / 64

Page 16: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Typical Structure of a Computation Step

receive messages

compute usingmessages and local variables

(description in Englishwith basic control flow

if-then-else)

send messages

atom

ic

impli

cit

pseu

do-co

de

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 11 / 64

Page 17: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Challenge #4: Parameterized Model Checking

Parameterized model checking problem:given a process template P(n, t , f ),resilience condition RC : n > 3t ∧ t ≥ f ≥ 0,fairness constraints Φ, e.g., “all messages will be delivered”and an LTL-X formula ϕ

show for all n, t , and f satisfying RC(P(n, t , f ))n−f + f faults |= (Φ→ ϕ)

n

? ? ?t

n

? ? ?t f

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 12 / 64

Page 18: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Challenge #5: Liveness in Distributed Algorithms

Interplay of safety and liveness is a central challenge in DAs

achieving safety and liveness is non-trivialasynchrony and faults lead to impossibility results(recall first part of lecture (Fischer et al., 1985))

Rich literature to verify safety (e.g. in concurrent systems)

Distributed algorithms perspective:

“doing nothing is always safe”“tools verify algorithms that actually might do nothing”

Verification efforts often have to simplify assumptions

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 13 / 64

Page 19: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Challenge #5: Liveness in Distributed Algorithms

Interplay of safety and liveness is a central challenge in DAs

achieving safety and liveness is non-trivialasynchrony and faults lead to impossibility results(recall first part of lecture (Fischer et al., 1985))

Rich literature to verify safety (e.g. in concurrent systems)

Distributed algorithms perspective:

“doing nothing is always safe”“tools verify algorithms that actually might do nothing”

Verification efforts often have to simplify assumptions

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 13 / 64

Page 20: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Summary

We have to model:faults,communication medium captured in English,algorithms written in pseudo-code.

and check:safety and livenessof parameterized systemswith unbounded integers,non-standard fairness constraints,

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 14 / 64

Page 21: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Existing formalization frameworks

TLA+/PlusCal

Design & Specification

Concurrent Alg.Proving/

TLC

(Timed) IOA

Asynchronous DAProving/UPPAAL

PVS

Theorem Proving

?(Parameterized)

Model Checking

of FTDAs

DISTAL

Simulation

PBFT

ImplementationJosef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 15 / 64

Page 22: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Alternative frameworks

TLA (temporal logic of actions):used to design (distributed) algorithms by refinement of the specverification with proof assistants (low degree of automation)

Encodings of DA in proof assistant PVS (e.g., by Rushby):ad-hoc encodingfound a bug in a published synchronous Byzantine Agreementalgorithm (Lincoln & Rushby, 1993)

I/O-Automata:originally designed to write clearer hand-written proofslimited tool support, e.g., Veromodo toolset is still in betasuitable only for asynchronous distributed algorithms

proof assistants are very general, but with low automation degree“everything is possible, but nothing is easy”

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 16 / 64

Page 23: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Alternative frameworks

TLA (temporal logic of actions):used to design (distributed) algorithms by refinement of the specverification with proof assistants (low degree of automation)

Encodings of DA in proof assistant PVS (e.g., by Rushby):ad-hoc encodingfound a bug in a published synchronous Byzantine Agreementalgorithm (Lincoln & Rushby, 1993)

I/O-Automata:originally designed to write clearer hand-written proofslimited tool support, e.g., Veromodo toolset is still in betasuitable only for asynchronous distributed algorithms

proof assistants are very general, but with low automation degree“everything is possible, but nothing is easy”

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 16 / 64

Page 24: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Simulation and Implementation

Distal:

Domain-specific language (Biely et al., 2013)Simulation and evaluate performance of fault-tolerant algorithms

Practical Byzantine Fault-Tolerance (Castro et al., 1999)and other practical algorithms:

Implementation with optimizationsPrecise semantics is unclearThe system is partially synchronous:non-divergent message delays are assumed

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 17 / 64

Page 25: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

In this part

We introduce efficient encoding in PROMELA.

Verify safety and liveness of fault-tolerant algorithms (fixed parameters).

Find counterexamples for parameters known from the literature.

This proves adequacy of our modeling.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 18 / 64

Page 26: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Preliminaries:Promela

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 19 / 64

Page 27: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Promela

PROMELA ≡ PROcess MEta LAnguage

SPIN ≡ Simple Promela INterpreter(not that simple any more)

Here we give a short introduction and cover onlythe features important to our work.

Detailed documentation, tutorials, and books on:http://spinroot.com Gerard Holzmann

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 20 / 64

Page 28: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Top-level: global variables and processes

/∗ g l o b a l d e c l a r a t i o n s v i s i b l e t o a l l p r o c e s s e s ∗ /int x; /∗ a g l o b a l i n t e g e r ( a s in C ) ∗ /

mtype = { X, Y }; /∗ c o n s t a n t message t y p e s ∗ //∗ a FIFO c h a n n e l wi th a t most 2 m e s s a g e s o f t y p e mtype ∗ /chan c = [2] of { mtype };

active[2] proctype ProcA() { Two processes are createdat the initial state...

}

proctype ProcB() { Processes can be createdlater using: run ProcB()...

}

init { A special process, use tocreate other processesrun ProcB(); run ProcB();

}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 21 / 64

Page 29: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

One process: Basics

int x, y;

active proctype ProcA() {

int z; Declare a local variable

z = x; Assignment

x > y; Block until the expression is evaluated to true

true; one step to execute, no effect

z++;

skip; same as true}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 22 / 64

Page 30: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

One process: Control flow

int x, y;

active proctype P() {main:if A guarded command

:: x == 0 -> x = 1;:: y == 0 -> y = 1;

non-deterministically selects an optionwhose first expression is not blocked.:: x == 1 && y == 1

-> x = 0; y = 0;fi; continues executing the rest of the option

step-by-step.goto main;}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 23 / 64

Page 31: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

One process: Control flow (cont.)

int x = 0, y = 0;

activeproctype P() {main:if:: x == 0 -> x = 1;:: y == 0 -> y = 1;

:: x == 1 && y == 1-> x = 0; y = 0;

fi;goto main;}

Run 1 Run 2 Run 3

x=0,y=0 x=0,y=0 x=0,y=0

x=1,y=0 x=0,y=1 x=1,y=0

x=1,y=1 x=1,y=1 x=1,y=1

x=0,y=0 x=0,y=0 x=0,y=0

x=0,y=1 x=1,y=0 x=0,y=1

x=1,y=1 x=1,y=1 x=1,y=1

x=0,y=0 x=0,y=0 x=0,y=0

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 24 / 64

Page 32: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

One process: Loops

int x;

active proctype P() {do a do..od loop

:: x == 10 -> x = 0;:: x == 10 -> break;:: x < 10 -> x++;

od;

A:if

basically the same. goto Aintroduces one more step

:: x == 10 -> x = 0;:: x == 10 -> goto B;:: x < 10 -> x++;

fi;goto A;

B:}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 25 / 64

Page 33: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Many Processes: Interleavings

Pure interleaving semantics

Every statement is executedatomically

int x = 0, y = 1;

active[2] proctype A() {x = 1 - x;y = 1 - y;

}

A[1]A[0]

The red path is an example execution where the steps of processes 0and 1 alternate.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 26 / 64

Page 34: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Many Processes: Atomics

use atomic { ... } to makeexecution of a sequence indivisible.

non-deterministic choice withif..fi is still allowed!

int x = 0, y = 1;

active[2] proctype A() {atomic {

x = 1 - x;y = 1 - y;

}}

A[1]A[0]

Larger atomic steps lead to less possible paths and states.Note: different atomicity degrees may lead to different verificationresults

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 27 / 64

Page 35: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Many Processes: Atomics

use atomic { ... } to makeexecution of a sequence indivisible.

non-deterministic choice withif..fi is still allowed!

int x = 0, y = 1;

active[2] proctype A() {atomic {

x = 1 - x;y = 1 - y;

}}

A[1]A[0]

Larger atomic steps lead to less possible paths and states.Note: different atomicity degrees may lead to different verificationresults

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 27 / 64

Page 36: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

(Asynchronous) message passing

mtype = { A, B };chan chan1 = [1] of { mtype }; queue of size 1chan chan2 = [1] of { mtype };

active proctype Ping() {chan1!A; insert A to “chan1”do

:: chan2?B -> chan1!A;od;

when B is on the top of “chan2”,remove it and insert A to “chan1”

}

active proctype Pong() {do

:: chan1?A -> chan2!B;od;

}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 29 / 64

Page 37: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Blocking receive

mtype = { A, B };chan chan1 = [1] of { mtype };chan chan2 = [1] of { mtype };

active proctype Ping() {chan1!A;do :: chan2?B -> ←− deadlock!

chan1!A; Ping sends A, Pong receives A,chan1?A is blockedod;

}

active proctype Pong() {do :: chan1?A ->

chan1?A; ←− deadlock!chan2!B;

od;}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 30 / 64

Page 38: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Blocking send

mtype = { A, B };chan chan1 = [1] of { mtype };chan chan2 = [1] of { mtype };active proctype Ping() {

chan1!A;do :: chan2?B -> When chan1=[A] and chan2=[B], the system deadlocks

chan1!A;chan1!A;chan1!A; ←− deadlock!chan1!A; The shortest counter-example has 10 steps

od;} Use Spin to find it

active proctype Pong() {do :: chan1?A ->

chan2!B; ←− deadlock!od;

}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 31 / 64

Page 39: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Promela vs. C

PROMELA looks like C

But it is not!

Non-determinism in the if statements (internal non-determinism)

Non-determinstic scheduler (external non-determinism)

Atomic statements

Message passing

PROMELA is a modeling language

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 32 / 64

Page 40: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Preliminaries:

Kripke StructuresLinear Temporal Logic (LTL)

Control Flow Automata (CFA)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 33 / 64

Page 41: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Kripke structures

A Kripke structure is aM = (S,S0,R,AP,L), where:

S is a set of states,S0 ⊆ S is the set of initialstates,R ⊆ S × S is a transitionrelation,AP is a set of atomicpropositions,L : S → 2AP is a state-labelingfunction. s4 : {g}

s1 : {y} s2 : {y}s3 : {r , y ,g}

s0 : {r}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 34 / 64

Page 42: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Linear Temporal Logic

An LTL formula is defined inductively w.r.t.atomic propositions AP:

(base) p ∈ AP is an LTL formula,if ϕ and ψ are LTL formulas, then thefollowing expressions are LTLformulas:

Nexttime: Xϕ,Eventually: Fϕ,Globally: Gϕ,Until: ψUϕ.Boolean combinations:ϕ ∧ ψ, ϕ ∨ ψ, and ¬ϕ.

s0 s2 s3 s4s1

s′0 s′1 s′2 s′4s′3

s′′0 s′′1 s′′2 s′′3 s′′4

s′′′0

ψ

s′′′1

ψ

s′′′2

ϕ

s′′′3 s′′′4

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 35 / 64

Page 43: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Recall: Typical Structure of a Computation Step

receive messages

compute usingmessages and local variables

(description in Englishwith basic control flow

if-then-else)

send messages

atom

ic

impli

cit

pseu

do-co

de

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 36 / 64

Page 44: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

CFA: Intermediate representation

Intermediate representation of aloop body: a path from qI to qFencodes one iteration.

Every variable is assigned at mostonce (SSA).

active proctype P() {int x, y;

do:: x == 0 -> x = 1;:: x == 1 -> x = 2;:: x == 2 -> x = 0;:: x == 1-> x = 0; y = 1 - y;

od;}

qI

q0 q1 q2 q3

q4

qF

x = 0

x = 1x = 1

x = 2

x ′ = 1

x ′ = 2x ′ = 0

x ′ = 0

y ′ = 1− y

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 37 / 64

Page 45: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Example: from a CFA to a Kripke structure

Kripke structure M(n, t , f ) = (S,S0,R,AP,L) of N(n, t , f ) processes

For a path π from qI to qF construct a formula φπ(x , y , x ′, y ′)

A state is a pair of x = (x1, . . . , xN) ∈ NN and y = (y1, . . . , yN) ∈ NN

and the initial states are S0 = {(0, . . . ,0)}

((x , y), (x ′, y ′)) ∈ R iff there areprocess index k . 1 ≤ k ≤ N and path π:

[k moves]: φπ(xk , yk , x ′k , y′k ) holds

[others do not]: ∀i ∈ {1, . . . ,N} \ {k}.x ′i = xi , y ′i = yi .

PropositionsAP = {[∃i . yi 6= 0], [∀i . yi 6= 0]}and a state (x , y) is labeled as:p ∈ L((x , y)) iff (x , y) |= p.

qI

q0 q1 q2 q3

q4

qF

x = 0

x = 1x = 1

x = 2

x ′ = 1

x ′ = 2x ′ = 0

x ′ = 0

y ′ = 1− y

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 38 / 64

Page 46: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Example: Properties of ST87 in LTL

Unforgeability. If vi = 0 for all correct processes i , then for all correctprocesses j , acceptj remains 0 forever.

G(( n−f∧

i=1vi = 0

)→ G

( n−f∧j=1

acceptj = 0))

Safety

Completeness. If vi = 1 for all correct processes i , then there is a correctprocess j that eventually sets acceptj to 1.

G(( n−f∧

i=1vi = 1

)→ F

( n−f∨j=1

acceptj = 1))

Liveness

Relay. If a correct process i sets accepti to 1, then eventually allcorrect processes j set acceptj to 1.

G(( n−f∨

i=1accepti = 1

)→ F

( n−f∧j=1

acceptj = 1))

Liveness

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 39 / 64

Page 47: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Model Checking Problems

Finite-state MCInput:

a process template P,an LTL formula ϕ (including fairness),values of parameters n, t , and f .

Problem: check, whether M(n, t , f ) |= ϕ.

Parameterized MCInput:

a process template P,an LTL formula φ (including fairness)with atomic propositions of the form [∃i .xi < y ] and [∀i .xi < y ]

Problem: check, whether ∀n, t , f : n > 3t ∧ t ≥ f ∧ f ≥ 0. M(n, t , f ) |= φ.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 40 / 64

Page 48: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Parameterized modeling&

Non-parameterized modelchecking

as in SPIN’13: (John et al., 2013)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 41 / 64

Page 49: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Modeling of threshold-based algorithms in Promela. . .

We introduce efficient encoding of threshold-basedfault-tolerant algorithms in PROMELA

(with parametrization!)

Verify safety and liveness of fault-tolerant algorithms (fixed parameters).

Find counterexamples for parameters known from the literature.

This proves adequacy of our modeling.

For our method, we exploit specifics of FTDAs:1 central feature of the algorithms

(message counting);2 specific message passing

(we do not need to know who sent but how many of them sentmessages);

3 the way faults affect messages(again, counting messages).

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 42 / 64

Page 50: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Modeling of threshold-based algorithms in Promela. . .

We introduce efficient encoding of threshold-basedfault-tolerant algorithms in PROMELA

(with parametrization!)

Verify safety and liveness of fault-tolerant algorithms (fixed parameters).

Find counterexamples for parameters known from the literature.

This proves adequacy of our modeling.

For our method, we exploit specifics of FTDAs:1 central feature of the algorithms

(message counting);2 specific message passing

(we do not need to know who sent but how many of them sentmessages);

3 the way faults affect messages(again, counting messages).

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 42 / 64

Page 51: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Case Studies

We consider a number of threshold-based algorithms.

Our running example ST87 for

1 Byzantine faults (BYZ)2 omission faults (OMIT)3 symmetric faults (SYMM)4 clean crashes (CLEAN).

5 Forklore reliable broadcast for clean crashes[Chandra & Toueg 96, CT96]

(to be continued)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 43 / 64

Page 52: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Characteristics of the FTDA by Srikanth & Toueg, 87

Variables of process ivi : {0 , 1} i n i t i a l l y 0 or 1accepti : {0 , 1} i n i t i a l l y 0

An atomic step:i f vi = 1then send ( echo ) to all ;i f received (echo) from

at l e a s t t + 1 distinct processesand not sent ( echo ) before

then send ( echo ) to all ;i f received ( echo ) from at l e a s t

n - t distinct processesthen accepti := 1 ;

the algorithm consistsof threshold-guardedcommands, onlythresholds t + 1 andn − tcommunication is by“send to all”how processesdistinguish distinctsenders is not part ofthe algorithm (i.e.,algorithm description ishigh level)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 44 / 64

Page 53: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Characteristics of the FTDA by Srikanth & Toueg, 87

Variables of process ivi : {0 , 1} i n i t i a l l y 0 or 1accepti : {0 , 1} i n i t i a l l y 0

An atomic step:i f vi = 1then send ( echo ) to all ;i f received (echo) from

at l e a s t t + 1 distinct processesand not sent ( echo ) before

then send ( echo ) to all ;i f received ( echo ) from at l e a s t

n - t distinct processesthen accepti := 1 ;

the algorithm consistsof threshold-guardedcommands, onlythresholds t + 1 andn − tcommunication is by“send to all”how processesdistinguish distinctsenders is not part ofthe algorithm (i.e.,algorithm description ishigh level)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 44 / 64

Page 54: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Case Studies (cont.): Larger Algorithms

more involved algorithms in the purely asynchronous setting:

6 Asynchronous Byzantine Agreement (Bracha & Toueg 85, BT85)Byzantine faultstwo phases and two message typesfive status valuesproperties: unforgeability, correctness (liveness), agreement(liveness)

7 Condition-based Consensus (Mostefaoui et al. 01, MRRR01)crash faultstwo phases and four message typesnine status variablesproperties: validity, agreement, termination (liveness)

8 Fast Byzantine Consensus: common case (Martin, Alvisi 06,MA06)

Byzantine faultsthe core part of the algorithmno cryptography

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 45 / 64

Page 55: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Experimental Results at Glance

Algorithm Fault Parameters Resilience Properties Time

1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec.1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec.1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec.2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec.2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec.3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec.3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec.4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec.5. CT96 CRASH n = 2 — U, C, R 1 sec.6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec.

6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec.6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1,f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 3 hrs.

8. MA06 BYZp = 4,a = 5,l = 4,t = 1, f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 14 min.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1, f = 2

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 2 sec.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Page 56: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Experimental Results at Glance

Algorithm Fault Parameters Resilience Properties Time

1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec.1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec.1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec.2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec.2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec.3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec.3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec.4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec.5. CT96 CRASH n = 2 — U, C, R 1 sec.6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec.

6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec.6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1,f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 3 hrs.

8. MA06 BYZp = 4,a = 5,l = 4,t = 1, f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 14 min.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1, f = 2

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 2 sec.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Page 57: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Experimental Results at GlanceAlgorithm Fault Parameters Resilience Properties Time

1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec.1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec.1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec.2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec.2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec.3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec.3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec.4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec.5. CT96 CRASH n = 2 — U, C, R 1 sec.

6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec.6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec.6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1,f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 3 hrs.

8. MA06 BYZp = 4,a = 5,l = 4,t = 1, f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 14 min.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1, f = 2

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 2 sec.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Page 58: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Experimental Results at Glance

Algorithm Fault Parameters Resilience Properties Time

1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec.1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec.1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec.2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec.2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec.3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec.3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec.4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec.5. CT96 CRASH n = 2 — U, C, R 1 sec.6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec.

6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec.6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1,f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 3 hrs.

8. MA06 BYZp = 4,a = 5,l = 4,t = 1, f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 14 min.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1, f = 2

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 2 sec.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Page 59: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Experimental Results at Glance

Algorithm Fault Parameters Resilience Properties Time

1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec.1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec.1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec.2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec.2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec.3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec.3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec.4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec.5. CT96 CRASH n = 2 — U, C, R 1 sec.

6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec.6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec.6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1,f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 3 hrs.

8. MA06 BYZp = 4,a = 5,l = 4,t = 1, f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 14 min.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1, f = 2

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 2 sec.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Page 60: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Experimental Results at Glance

Algorithm Fault Parameters Resilience Properties Time

1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec.1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec.1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec.2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec.2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec.3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec.3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec.4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec.5. CT96 CRASH n = 2 — U, C, R 1 sec.6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec.

6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec.6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec.7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1,f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 3 hrs.

8. MA06 BYZp = 4,a = 5,l = 4,t = 1, f = 1

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 14 min.

8. MA06 BYZp = 4,a = 6,l = 4,t = 1, f = 2

p > 3t , a > 5t , l > 3t CS1, CS3, CL1, CL2 2 sec.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Page 61: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Experimental Results: on ST87, the Byzantine Case

Time (sec, logscale)

no faults: f = 0

two faults: f = 2

Memory (MB, logscale)

no faults: f = 2

The more faults we have, the easier the problem is:Two faults: we can check the systems of up to nine processesNo faults: we can check the systems of up to seven processes

Precision of modeling: we found counter-examples for the cornercases n = 3t and f > t , where the resilience condition is violated.(June 2013: somebody wrote on Wikipedia that n = 3t should work :-)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 47 / 64

Page 62: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Discussion of the specifications

Unforgeability. If vi = 0 for all correct processes i , then for all correctprocesses j , acceptj remains 0 forever.

G(( n−f∧

i=1vi = 0

)→ G

( n−f∧j=1

acceptj = 0))

The specification of Byzantine FTDAs have the following features:

Only the states of correct processes are evaluated.Faulty processes may be Byzantine. (no assumption on

behavior)

Specifications do not talk about individual processes.Only global safety and progress are important.

Indexed temporal logic is not required!Quantification over processes is on the level of atomic

propositions.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 48 / 64

Page 63: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Threshold-Guarded Distributed Algorithms

Standard construct: quantified guards (t=f=0)

Existential Guardif received m from some process then ...

Universal Guardif received m from all processes then ...

what if faults might occur?

Fault-Tolerant Algorithms: n processes, at most t are Byzantine

Threshold Guardif received m from n − t processes then ...

(the processes cannot refer to f!)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 49 / 64

Page 64: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Threshold-Guarded Distributed Algorithms

Standard construct: quantified guards (t=f=0)

Existential Guardif received m from some process then ...

Universal Guardif received m from all processes then ...

what if faults might occur?

Fault-Tolerant Algorithms: n processes, at most t are Byzantine

Threshold Guardif received m from n − t processes then ...

(the processes cannot refer to f!)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 49 / 64

Page 65: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Threshold-Guarded Distributed Algorithms

Standard construct: quantified guards (t=f=0)

Existential Guardif received m from some process then ...

Universal Guardif received m from all processes then ...

what if faults might occur?

Fault-Tolerant Algorithms: n processes, at most t are Byzantine

Threshold Guardif received m from n − t processes then ...

(the processes cannot refer to f!)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 49 / 64

Page 66: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Counting Argument in Threshold-Guarded Algorithms

n

t f

t + 1

at least one non-faulty sent the message

Correct processes count incoming messages from distinct processes

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 50 / 64

Page 67: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Counting Argument in Threshold-Guarded Algorithms

n

t f

t + 1

at least one non-faulty sent the message

Correct processes count incoming messages from distinct processes

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 50 / 64

Page 68: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Counting Argument in Threshold-Guarded Algorithms

n

t f

t + 1

at least one non-faulty sent the message

Correct processes count incoming messages from distinct processes

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 50 / 64

Page 69: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Modeling threshold-based algorithms in Promela

As the distributed algorithms are given in pseudo-code,

we have to decide on how to encode in PROMELA:

send to all and receivecounting expressions “received <m> from n − t distinct processes”faults

In what follows, we compare side-by-side two solutions:A straightforward encoding with PROMELA channels andexplicit representation of faulty processes. [Solution 1]An advanced encoding with shared variables and fault injection.

[Solution 2]

To decouple encoding of reliable message passing and of faults,we first consider message passing without faults,and then show how to encode faults.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 51 / 64

Page 70: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Modeling threshold-based algorithms in Promela

As the distributed algorithms are given in pseudo-code,

we have to decide on how to encode in PROMELA:

send to all and receivecounting expressions “received <m> from n − t distinct processes”faults

In what follows, we compare side-by-side two solutions:A straightforward encoding with PROMELA channels andexplicit representation of faulty processes. [Solution 1]An advanced encoding with shared variables and fault injection.

[Solution 2]

To decouple encoding of reliable message passing and of faults,we first consider message passing without faults,and then show how to encode faults.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 51 / 64

Page 71: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Modeling threshold-based algorithms in Promela

As the distributed algorithms are given in pseudo-code,

we have to decide on how to encode in PROMELA:

send to all and receivecounting expressions “received <m> from n − t distinct processes”faults

In what follows, we compare side-by-side two solutions:A straightforward encoding with PROMELA channels andexplicit representation of faulty processes. [Solution 1]An advanced encoding with shared variables and fault injection.

[Solution 2]

To decouple encoding of reliable message passing and of faults,we first consider message passing without faults,and then show how to encode faults.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 51 / 64

Page 72: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Template in Promela

We implement the following loopon the right.

receive messages

compute usingmessages and local variables

(description in Englishwith basic control flow

if-then-else)

send messages

atom

ic/∗ s h a r e d s t a t e :

a v a r i a b l e o r a c h a n n e l ∗ /active proctype[N(n,t,f)] P(){

/∗ l o c a l v a r i a b l e t o countm e s s ag e s from d i s t i n c tp r o c e s s e s ∗ /

int nrcvd;/∗ i n i t i a l i z a t i o n ∗ /

loop: atomic {/∗1 . r e c e i v e and count m e s s a ge s2 . compute us ing nrcvd3 . send m e s s ag e s ∗ /}goto loop;}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 52 / 64

Page 73: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Modeling Message Passing

All our case studies are designed with the assumption ofclassic reliable asynchronous message passing as in (Fischer et al.,1985):

non-blocking communication,operations “receive” and “send” are executed immediately.

if a message can be received now, it may be also received later,a process does not have to receive a message as soon as it isable to.

every sent message is eventually received,but there are no bounds on the delays.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 53 / 64

Page 74: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 1: Message Passing using Promela channels

A straightforward encoding using message channels:

/∗ message t y p e ∗ /mtype = { ECHO };/∗ p o i n t−to−p o i n t c h a n n e l s ∗ /chan p2p[N][N] = [1] of { mtype };/∗ t a g r e c e i v e d m e s s a g e s ∗ /bit rx[N][N];

Sending a message to all processes:

for (i : 1 .. N) { p2p[_pid][i]!ECHO; }

Note: pid denotes the process identifier in PROMELA

(we use it solely to encode message passing).

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 54 / 64

Page 75: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 1: Message Passing (cont.)

Receiving and counting messages from distinct processes

(no faults yet):

/∗ l o c a l ∗ / int nrcvd = 0; /∗ i n i t i a l l y , no m e s s a ge s ∗ /...i = 0;do /∗ i s t h e r e a message from p r o c e s s i ? ∗ /

:: (i < N) && nempty(p2p[i][_pid]) ->p2p[i][_pid]?ECHO; /∗ remove i t ∗ /if

:: !rx[i][_pid] -> /∗ 1 . t h e f i r s t t ime : ∗ /rx[i][_pid] = 1; /∗ a . mark as r e c e i v e d ∗ /nrcvd++; break; /∗ b . i n c r e a s e l o c a l c o u n t e r ∗ /:: rx[i][_pid]; /∗ 2 . i g n o r e a d u p l i c a t e ∗ /

fi; i++; /∗ nex t p r o c e s s ∗ /:: (i < N) -> i++; /∗ r e c e i v e n o t h i n g from i ∗ /:: i == N -> break;

od

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 55 / 64

Page 76: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 2: Simulating message passing with variables

Keeping the number of send-to-all’s by (correct) processes:int nsnt; /∗ s h a r e d v a r i a b l e ∗ /

/∗ number o f send−to−a l l ’ s s e n t by c o r r e c t p r o c e s s e s ∗ /

Sending a message to all:nsnt++;

Receiving and counting messages from distinct processes (no faults):if /∗ p i c k a l a r g e r v a l u e ≤ nsnt ∗ /

:: ((nrcvd + 1) < nsnt) ->nrcvd++; /∗ one more message ∗ /

:: skip; /∗ or n o t h i n g ∗ /fi;

Reliable communication as a fairness property:

F G [∀i .nrcvdi ≥ nsnt ]

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 56 / 64

Page 77: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 2: Some questions you might ask

Q1: instead of

if:: ((nrcvd + 1) < nsnt) ->

nrcvd++; /∗ one more message ∗ /:: skip; /∗ or n o t h i n g ∗ /

fi;

why cannot we just write:

nrcvd = nsnt;

A1: You can, but that will be another model, not [FLP85]!

[FLP85] only guarantees that every message is eventually received.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 57 / 64

Page 78: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 2: Some questions you might ask (cont.)

Reliable communication:every sent message is eventually received.

Q2: Why do we write

F G [∀i .nrcvdi ≥ nsnt ] (1)

instead of:

∀i . G F [nrcvdi ≥ nsnt ] (2)

A2: We like to write (2), but it will require us to use another logic calledindexed LTL, which will cause problems in the parameterized case.

For threshold-based algorithms, the value of nsnt is changesat most n times.

Under this assumption, (2) is equivalent (1).Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 58 / 64

Page 79: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 1 (cont.): Explicit Modeling of Faults

(Lamport et al., 1982) introduce Byzantine processes that can virtuallydo anything.

In our case, Byzantine behavior boils down to sending ECHO to someof the correct processes and not sending ECHO to the others:

active[F] proctype Byz() {step:

atomic {i = 0;do /∗ send ECHO t o p r o c e s s i ∗ /

:: i < N -> p2p[_pid][i]!ECHO; i++;/∗ or not ∗ /

:: i < N -> i++;:: i == N -> break;

od};goto step;

}

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 59 / 64

Page 80: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 2 (cont.):Injecting Faults into Message Counters

We instantiate n − f correct processes and no faulty processes.

Instead, we say that the correct processes may receive up to fadditional messages due to faults:

if :: ((nrcvd + 1) < nsnt + f) ->nrcvd++; /∗ r e c e i v e one more message ∗ /

:: skip; /∗ or n o t h i n g ∗ /fi;

The fairness still forces the processes to receive all the messages sentby the correct processes:

F G [∀i .nrcvdi ≥ nsnt ]

Note: each correct process sends at most one ECHO message.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 60 / 64

Page 81: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 2 (cont.):Injecting Faults into Message Counters

We instantiate n − f correct processes and no faulty processes.

Instead, we say that the correct processes may receive up to fadditional messages due to faults:

if :: ((nrcvd + 1) < nsnt + f) ->nrcvd++; /∗ r e c e i v e one more message ∗ /

:: skip; /∗ or n o t h i n g ∗ /fi;

The fairness still forces the processes to receive all the messages sentby the correct processes:

F G [∀i .nrcvdi ≥ nsnt ]

Note: each correct process sends at most one ECHO message.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 60 / 64

Page 82: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 2 (cont.): Modeling different kinds of faults

Byzantine faults (previous slide):create only correct processes, i.e., n − f processesonly they have to satisfy specextra messages from Byzantine: ((nrcvd + 1) < nsnt + f)

fairness (reliable communication): F G [∀i .nrcvdi ≥ nsnt ]

Omission faults (processes fail to send messages):create all processes, i.e., n processesall of them are mentioned in the specificationno additional messages: ((nrcvd + 1) < nsnt)

fairness (with possible message loss due to faults)F G [∀i .nrcvdi ≥ nsnt − f ]

Crash faults: similar to omissions with crash control state added

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 61 / 64

Page 83: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 2 (cont.): Modeling different kinds of faults

Byzantine faults (previous slide):create only correct processes, i.e., n − f processesonly they have to satisfy specextra messages from Byzantine: ((nrcvd + 1) < nsnt + f)

fairness (reliable communication): F G [∀i .nrcvdi ≥ nsnt ]

Omission faults (processes fail to send messages):create all processes, i.e., n processesall of them are mentioned in the specificationno additional messages: ((nrcvd + 1) < nsnt)

fairness (with possible message loss due to faults)F G [∀i .nrcvdi ≥ nsnt − f ]

Crash faults: similar to omissions with crash control state added

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 61 / 64

Page 84: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Solution 2 (cont.): Modeling different kinds of faults

Byzantine faults (previous slide):create only correct processes, i.e., n − f processesonly they have to satisfy specextra messages from Byzantine: ((nrcvd + 1) < nsnt + f)

fairness (reliable communication): F G [∀i .nrcvdi ≥ nsnt ]

Omission faults (processes fail to send messages):create all processes, i.e., n processesall of them are mentioned in the specificationno additional messages: ((nrcvd + 1) < nsnt)

fairness (with possible message loss due to faults)F G [∀i .nrcvdi ≥ nsnt − f ]

Crash faults: similar to omissions with crash control state added

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 61 / 64

Page 85: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Experiments: Solution 1 vs. Solution 2

States (logscale)

10

100

1000

10000

100000

1e+06

1e+07

1e+08

3 4 5 6 7 8

stat

es (

logs

cale

)

number of processes, N

Memory (MB, logscale, ≤ 12 GB)

100

1000

10000

3 4 5 6 7 8

mem

ory,

MB

(lo

gsca

le)

number of processes, N

Solution 1: Channels + explicit Byzantine processes (blue)Solution 2: shared variables + fault injection (red)

in the presence of one Byzantine faulty process (f = 1)(case f = 2 runs out of memory too fast)

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 62 / 64

Page 86: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Summary

We show how to model threshold-based fault-tolerant algorithmsstarting with an imprecise description

We create PROMELA models using expert advice.

The tool demonstrates that the model behaves as predicted by theory(for concrete values of parameters)

This reference implementation allows us to optimize the encoding

... and to make the model amenable to parameterized verification

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 63 / 64

Page 87: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Tomorrow:

parameterized model checking

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 64 / 64

Page 88: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

References I

Biely, M., Delgado, P., Milosevic, Z., & Schiper, A. 2013 (June).Distal: A framework for implementing fault-tolerant distributed algorithms.Pages 1–8 of: Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIPInternational Conference on.

Castro, Miguel, Liskov, Barbara, et al. 1999.Practical Byzantine fault tolerance.Pages 173–186 of: OSDI, vol. 99.

Fischer, Michael J., Lynch, Nancy A., & Paterson, M. S. 1985.Impossibility of Distributed Consensus with one Faulty Process.J. ACM, 32(2), 374–382.http://doi.acm.org/10.1145/3149.214121.

John, Annu, Konnov, Igor, Schmid, Ulrich, Veith, Helmut, & Widder, Josef. 2013.Towards Modeling and Model Checking Fault-Tolerant Distributed Algorithms.Pages 209–226 of: SPIN.LNCS, vol. 7976.

Lamport, Leslie, Shostak, Robert E., & Pease, Marshall C. 1982.The Byzantine Generals Problem.ACM Trans. Program. Lang. Syst., 4(3), 382–401.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 65 / 64

Page 89: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

References II

Lincoln, P., & Rushby, J. 1993.A formally verified algorithm for interactive consistency under a hybrid fault model.Pages 402–411 of: FTCS-23.http://dx.doi.org/10.1109/FTCS.1993.627343.

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 66 / 64

Page 90: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Folklore Reliable Broadcast (e.g., Chandra & Toueg,96)

Correct processes agree on value vi in the presence of crash faults.

Variables of process i

vi : {0 , 1} i n i t i a l l y 0 or 1accepti : {0 , 1} i n i t i a l l y 0

An atomic step:

i f ( vi = 1 or received <echo> from some process )and accepti = 0

then beginsend <echo> to all ;

/* when crashing it sends to a subset of processes */

accepti := 1 ;

/* it can also crash here */

end

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 67 / 64

Page 91: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Folklore Reliable Broadcast (e.g., Chandra & Toueg,96)

Correct processes agree on value vi in the presence of crash faults.

Variables of process i

vi : {0 , 1} i n i t i a l l y 0 or 1accepti : {0 , 1} i n i t i a l l y 0

An atomic step:

i f ( vi = 1 or received <echo> from some process )and accepti = 0

then beginsend <echo> to all ;

/* when crashing it sends to a subset of processes */accepti := 1 ; /* it can also crash here */

end

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 67 / 64

Page 92: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Verification Problem as in Distributed Computing

Given a distributed algorithm A and specifications ϕU , ϕC , ϕR,

Fix n and t with n > 3t ,show that every execution of A(n, t) satisfies ϕU , ϕC , ϕR.

In every execution:the number of faulty processes is restricted, i.e., f ≤ t ;processes can use n and t in the code, but not f ;f is constant(if a process fails late, its “correct” behavior was a Byzantine trick).

A distributedsystem A(n, t)

f = 0. . .

f = t

Counterexamples when f > t?

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 68 / 64

Page 93: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Verification Problem as in Distributed Computing

Given a distributed algorithm A and specifications ϕU , ϕC , ϕR,

Fix n and t with n > 3t ,show that every execution of A(n, t) satisfies ϕU , ϕC , ϕR.

In every execution:the number of faulty processes is restricted, i.e., f ≤ t ;processes can use n and t in the code, but not f ;f is constant(if a process fails late, its “correct” behavior was a Byzantine trick).

A distributedsystem A(n, t)

f = 0. . .

f = t

Counterexamples when f > t?

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 68 / 64

Page 94: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Verification Problem as in Distributed Computing

Given a distributed algorithm A and specifications ϕU , ϕC , ϕR,

Fix n and t with n > 3t ,show that every execution of A(n, t) satisfies ϕU , ϕC , ϕR.

In every execution:the number of faulty processes is restricted, i.e., f ≤ t ;processes can use n and t in the code, but not f ;f is constant(if a process fails late, its “correct” behavior was a Byzantine trick).

A distributedsystem A(n, t)

f = 0. . .

f = t

Counterexamples when f > t?Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 68 / 64

Page 95: Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Experiments: Channels vs. Shared Variables

enumerating reachable states in SPIN with POR and statecompression

States (logscale)

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

3 4 5 6 7 8

stat

es (

logs

cale

)

number of processes, N

Memory (MB, logscale, limit of 12GB)

100

1000

10000

3 4 5 6 7 8

mem

ory,

MB

(lo

gsca

le)

number of processes, N

Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 69 / 64