moving away from the independent and identically distributed failure assumption

Moving away from the independent and identically distributed failure assumption

University of California, San Diego
Flavio Junqueira
Research Exam/Thesis Proposal
Advisors: Keith Marzullo and Geoffrey M. Voelker

Moving away from the independent and identically distributed failure assumption

University of California, San Diego

Flavio Junqueira

Research Exam/Thesis Proposal

Advisors: Keith Marzullo and Geoffrey M. Voelker

Common approach for distributed systems: replicate! Cheaper than investing on ultra-reliable, specialized components Enhance performance, availability E.g. Processes on software-based systems

Typical replication strategy Compute a threshold t on the failures of processes Determine the degree of replication required, depending on the

problem (e.g. n > 3t for Consensus with arbitrary failures ) Replicate to this degree

Well suited for independent and identically distributed failures (IID failure assumption) Non-negligible probability of t failures in any subset of size t+1 Is it often a reasonable assumption?

Where IID does not apply…

Systems for the Internet Hosts execute the same popular

software systems Hosts share the same vulnerabilities

Some major outbreaks Code Red: over 360,000

hosts [Moore02] Sapphire: over 75,000 hosts


A threshold on the number of failures is unrealistic.

Where IID does not apply…

Quorum systems in a wide-area network [Amir96] Failures are strongly correlated

Power outages

Network partitions

Software bugs [Little01] Single version

A demand may cause all replicas to crash

Multiple independently-developed versions Difficulty of a demand: difficulty in handling it Level of difficulty varies among the demands More difficult demands tend to cause multiple versions to fail

Where IID does not apply…

Multi-computer systems [Tang92] Correlated failures due to shared resources

Network errors Shared memory

Impact on availability, reliability, and performance

Grid computing Master delegates computation

Wait replies from slaves

Replicate to achieve fault-tolerance Dependent failures: same sub-network,

same software systems, etc.

System model Modeling failures

The classical approach: The threshold model An alternative to the threshold model: Cores/Survivor sets

Applying it to problems: Consensus Traditional results on Consensus Consensus in the core/survivor set model

Generalizing the results for Consensus General bounds on process replication

Coping with dependent failures in the real world A few systems that assume dependent failures An application: The Phoenix Recovery System

System model

Set of processes = {p1, p2, , pn} A process is a unit of computation

Communicate by exchanging messages

Reliable channels Validity: If a correct process p sends a message m to a correct

process q, then q eventually receives m; Integrity: A process p receives a message m from some process q

only if q sent m to p;

System model

Processes exchange messagesChannels are reliable

Set of processes

Distributed algorithm:

collection of state machines

Step of a process

State machine for process qState machine for process p


Execution: sequence of steps of processes


Distributed algorithm

Collection of state machines, one for each process p

Proceeds in steps of processes

In a step, a process p Sends a message to a single process Receives a message from a single process Undergoes a state transition

Execution Sequence of steps of processes in

Timing assumptions

Synchronous systems Clock drift, message delay, processor speed are bounded Execution in synchronous rounds In a synchronous round, a process

sends messages to any number of processes receives messages from any number of processes Undergoes a state transition

Asynchronous systems No bounds on clock drift, message delay, or processor speed

Failure modes for processes

Crash failures For every faulty process p in some execution of an algorithm A, there is a

time tp after which p stops executing steps of A

Arbitrary failures A faulty process can deviate arbitrarily from the specification of the

algorithm E.g. crash, sending messages selectively, modify arbitrarily the content of


Receive-omission failures A faulty process either crashes or selectively fail to receive messages

Assumptions Once a process fails it does not recover Probability of a total failure is negligible

Modeling failures

The threshold model

Threshold t on the number of process failures Degree of reliability: R [0,1] The probability of t+1process failures is smaller than 1-R Simple and compact representation (n > f(t))

SIFT project [Wensley76] Ultra-reliable computer system Process failures are arbitrary, but non-malicious Hardware designed to isolate faults (independent failures) Similar hardware (identically distributed process failures) IID failure assumption is valid

What if failures are not IID? Still safe

t is the size of the largest subset of faulty processes in any execution It does not hurt to consider more

Limitations of the threshold model

R : target degree of reliability>R: subset of processes has reliability greater than R

An alternative to the threshold model

Desirable properties Expressive: scenarios in the previous slide Flexible: not tied to any particular way of characterizing failures General: widely applicable

Cores [JM03a] A core c: minimal reliable subset of processes At least one process in c is correct in every execution of the system Generalize subsets of size t+1

Survivor sets [JM03a] A survivor set: contains all the correct processes of some execution Generalize subsets of size n-t

Cores and Survivor sets

R: desired degree of reliability r(X), X : evaluates to the reliability of x A subset C is a core of iff

r(C) R p C, r(C - {p}) R C : set of cores of

A subset S is a survivor set of iff C C, SC p S, C C, such that (p C) and ((S - {p}) C = )

S : set of survivor sets of

Cores and survivor sets are the dual of each other

An alternative definition

Design of algorithms be the set of allowed executions up(be the set of correct processes in execution A subset C is a core of iff

s.t. C up() C’C, s.t. C’ up()= C : set of cores of

A subset S is a survivor set of iff s.t. S = up() S’ S, , S’ up() S : set of survivor sets of

: system configuration SC ,,

An example

Blue, Red, and Yellow fail independentlyFailures of Yellow processes are highly correlatedr({Red, Blue, Yellow}) = R

Page 19: Moving away from the independent and identically distributed failure assumption

Another example

Blue: highly-reliable serverRed: clientFailures of Blue and Red are negatively correlated

Probability of more than 3 Red processes failing is negligible

Page 20: Moving away from the independent and identically distributed failure assumption

Determining cores and survivor sets

Probability models E.g. Markov models used in the analysis of dynamic fault trees

[Ren98] To find cores: Minimal subset of processes s.t. probability of total

failure in the subset is negligible Often difficult in practice

Attribute-based model [JM02] Processes characterized by attributes Attributes determine failure correlation Finding a core is NP-hard

Color-based model [JM02] Single attribute characterizes a process Polynomial time algorithm to find cores

Cores/Survivor sets vs. Quorum systems

Cores, Survivor sets, Quorums Subsets of processes

Quorums [Giff79] Enforce mutual exclusion [GM85] E.g. One-copy serializability Quorums necessarily intersect Execute operations on behalf of the system

Cores/Survivor sets Do not necessarily execute operations on behalf of the system Weaker than quorums: no intersection requirement a priori Generalize objects commonly used in proofs and algorithms

Cores: subsets of size t+1 Survivor sets: subsets of size n-t

Page 23: Moving away from the independent and identically distributed failure assumption

Motivation for Consensus

Replication often requires coordination

Coordination problems Atomic broadcast

Clock synchronization

Agreement on fault-tolerant processors (FTP)

Consensus specification

Each process begins with a proposed value v V Goal: agree on a single value Typical Consensus definition [Attiya98]

Agreement: No two correct processes decide on different values Termination: Every correct process eventually decides Validity: If a process p decides on value v, then v was proposed by

some process q Strong validity: if every process has v as its initial value, then v is

the only possible decision value [Attiya98] Vector validity: A correct process decides on a vector such that

[Doudou98]1. If pi is correct, then [i] has the initial value of pi or null

2. At least t+1 elements of are initial values of correct processes

Page 25: Moving away from the independent and identically distributed failure assumption

Synchronous systems - Crash failures

Solution for any number of failures Full-information algorithm (t+1

rounds, )

Early-deciding algorithms [LF82, CB00] For any execution with f failures,

correct processes decide in at most f+1 rounds ( )

Clean round: Round in which no process fails Process receives messages from

the same set of processes in two consecutive rounds

Message complexity: O(f·||2)

1 tn

1 2





1 tn

In the core/survivor set model

Algorithm SyncCrash [JM03a,

JM03d] Choose a core C, preferentially the

smallest Execute early-deciding algorithm

among processes of C Every process in has an array of |C|

positions, one for each process in C Processes in C send messages to

processes in -C as well A process decides when a round with

no failures in C happen

t)(sufficien C

Decision in at most |C| rounds If |C|-1 < t, then improves on

number of rounds Message complexity: O(f·|C|·|


Synchronous systems - Arbitrary failures

Impossible if n 3•t [Lamport82]

Strong Consensus Proof idea

Consensus algorithm that solves for || 3·t

Execution in which agreement is violated

Assume || 3·t Partition (A, B, C) of

s.t. each subset has at most t processes

Execution 2

(A, B, C: v’)



A:v, C


B:v', C


A:v, B:v'B:v', C:v'

A:v ', C:v '

A:v, B:v '

Execution 1

(A, B, C: v)



A:v, C


B:v', C


A:v, B:v'

B:v, C:vA:v, C:v

A:v, B:v

Execution 3

(A: v; B: v’, C: *)



A:v, C


B:v', C


A:v, B:v'

B:v, C:v

A:v ', C:v '

A:v, B:v '

In the core/survivor set model

Lower bound on process replication [JM03a, JM03d] Byzantine Partition: Every partition (A, B, C) of is such that at

least one of the subsets contains a core Byzantine Intersection:

The intersection of every pair of survivor sets in S contains a core

The intersection of every three survivor sets in S is not empty

Scenario (A, B, C: v)



A:v, C


B:v', C


A:v, B:v'

B:v, C:v

A:v, C:v

A:v, B:v

Scenario (A, B, C: v’)



A:v, C


B:v', C


A:v, B:v'B:v', C:v'

A:v ', C:v '

A:v, B:v '

Scenario (A: v; B: v’, C: *)



A:v, C


B:v', C


A:v, B:v'

B:v, C:v

A:v ', C:v '

A:v, B:v '

AC contains a survivor set S1

AB contains a survivor set S3

BC contains a survivor set S2

AB contains a survivor set S3

AC contains a survivor set S1

BC contains a survivor set S2

Equivalence of Byzantine Intersection and Partition



All processes in B can be faulty

All processes in A can be faulty


All processes in B can be faulty


All processes in A can be faulty

All processes in C can be faulty


All processes in C can be faulty

No subset contains a core

S1S2S3 is empty

In a partition (A,B,C):

In the threshold model: Lamport et al. [Lamport82] Solution for n>3·t in t+1 rounds

In the core/survivor set model Modified algorithm by Lamport et al. Solution for systems satisfying Byzantine Partition Replace subsets of processes of size n-t by survivor sets Replace majority by intersection of two survivor sets

Enable solution for some systems ={pa, pb, pc, pd, pe}

C={papbpc, papd, pape, pbpd, pbpe, pcpd, pcpe, pdpe}

S={papbpcpd, papbpcpe, papdpe, pbpdpe, pcpdpe}

Solving Consensus for arbitrary failures

Lower bound on the number of rounds

Definitions : replication requirement (e.g. Byzantine Partition) is a subsystem of iff


A subsystem is minimal if there is no smaller subsystem

Theorem: Given a system [JM03a, JM03b] is a minimal subsystem of sys A is a Consensus algorithm

SCssy ,, SCsys ,, CC ,


subsystem) the in failures of number (maximum }:min{ SSS

processes) correct two least at are (there decide to rounds ( )1:1 f

one) butfaulty be can processes (all decide to rounds ,min( )1:1 f

SCsys ,,

SCssy ,,

Back to the example

={pa, pb, pc, pd, pe}

C={papbpc, papd, pape, pbpd, pbpe, pcpd, pcpe, pdpe}

S={papbpcpd, papbpcpe, papdpe, pbpdpe, pcpdpe}

Crash failures Lower bound on the number of rounds:

Arbitrary failures Lower bound on the number of rounds:

Bound is different for crash and arbitrary failures!

CSC ,12 core), (smallest

case) (worst 2111 f

SSCC , ,

case) (worst 3121 f

14- ,1}:min{ SSS

13- ,2):min( SSS

Asynchronous systems

No solution for pure asynchronous systems even for a single crash failure [FLP85] Slow process vs. Faulty process: requires a liveness property

Common approaches Partially synchronous systems [DLS88] Extend model with failure detectors [CT96]

Crash failures (S [CT96]) Crash Partition: Every partition (A,B) of is such that either A or B

contains a core Crash Intersection: The intersection of every two survivor sets contains

a core (coterie [GM88])

Arbitrary failures (M [Doudou98]) Byzantine Partition/Intersection

Related work - Hybrid failures models

Moves away only from the identically distributed failure assumption

Different failure modes, one class for each mode [LR94] Manifest (c):detectable failures (e.g. corrupted messages) Symmetric (s): behavior deviates arbitrarily, but it is the same for

every other processor (e.g. send the same erroneous value to every other process)

Arbitrary (a): behavior deviates arbitrarily (e.g. send different values to different processes)

Algorithm for the Oral messages problem mamcsan ,22

Page 35: Moving away from the independent and identically distributed failure assumption

Replication requirements elsewhere

More general descriptions of failure scenarios Fail-prone systems [Malkhi97] Collusion and adversary structures (malicious players) [Hirt97]

Martin et al [Martin02] Confirmable writes in quorum systems Property: for every subset B in a fail-prone system and every pair of

quorums Q1, Q2, we have that Q1Q2\B intersection of every pair of quorums contains a core

Hirt and Maurer [Hirt97] Secure multi-party protocols Passive model: no pair of collusions can add up to the set of players set

of correct players is a coterie Active model: no three adversaries can add up to the set of players

intersection of three sets of correct players is not empty

Generalizing n > k t(Work in progress)

Motivation: k integer

Properties establishing bounds on process replication are similar for problems

Asynchronous crash Consensus( W) TM: n > 2 • t C/SS: S1, S2 S: S1 S2

State-machine replication: arbitrary failures TM: n > 2 • t C/SS: S1, S2 S: S1 S2

Synchronous arbitrary Consensus TM: n > 3 • t C/SS:S1, S2, S3 S: S1 S2 S3

Motivation: k rational

Consensus for synchronous systems with receive-omission faults

In the threshold model:

Execution 1: Process in B and C crash Processes in A propose 0 and

decide upon 0

Execution 2 Process in A and C crash Processes in B propose 1 and

decide upon 1


3 tn





Proof idea

Execution 3 Process in A omit to receive msgs

from processes not in A Processes in B omit to receive msgs

from processes not in B Processes in A propose 0 and

decide upon 0 Processes in B propose 1 and

decide upon 1 Agreement is violated!




Generalizing the partition and the intersection properties

(, )-Partition. For every partition of

, there is a subset such that:

(, )-Intersection. For every :

},,,{ 21 AAAA AAAAA kkk },,,{'


core a contains )(i


,, and S




,, ,S













,()-(1 :


\, ,,


,: subset of S

,, of subsets of collection :

Some intuition on the generalized properties

=3, =2





Threshold Model ( )12










Core/Survivor set Model

AC contains

a core

Acore a contains CB

B core a contains CA

C core a contains BA







processes contains CB

processes contains CA

processes contains BA







)()()(:),,( 323121321 SSSSSSSSSS

Bounds on process replication

Lower bound Every set of processes that satisfies ,

also satisfies (, )-Partition In every partition of into subsets, there are subsets s.t. the union

contains at least t+1 processes consequently a core

Upper bound (work in progress) If a problem P can be solved by an algorithm A in a system satisfying

, then P can be solved by a system satisfying (k,1)-Partition

Simulate a system under the threshold model

Rational k Looking for a candidate algorithm to motivate

1 ,


integer kktkn ,1,1

Algorithms designed under the threshold model can be automatically translated to our model, for integer k

There is no need to rethink the whole FT distributed systems world

If it simplifies, one may design an algorithm under the threshold model and later translate using our technique

Correlated failures in the real world

(work in progress)

Background: Systems considering dependent failures

Oceanstore [WMK02] Online mechanism to correlate failures Identify subsets of maximally independent failures Problem

Correlate failures only after they have happened Not useful for malicious behavior

PASIS [BWWG02] Survivable storage systems Add correlation level to classical model of availability Two models to determine correlation level

Conditional probabilities Beta-binomial distribution

Problem: Requires the computation of failure distributions

Page 45: Moving away from the independent and identically distributed failure assumption

Coping with Internet catastrophes: Phoenix

Possible approaches Contain Internet pathogens: very challenging [Moore03b] Recover from catastrophes: replicate data

Typical replication strategy Assume independent host failures Compute a threshold t on the number of failures Replicate to this degree

Shared vulnerabilities Dependent host failures Independent host failures is not a suitable assumption Threshold t on the number of host failures

From previous events, t can be large Code Red worm infected over 360,000 hosts

Page 46: Moving away from the independent and identically distributed failure assumption

Our replication strategy

Desirable properties Enable recovery of data after an Internet catastrophe Small replica sets

Informed strategy for replica placement [JBMSV03] Sets of hosts that fail independently Hosts executing different sets of software systems

Classes of software systems: attributes E.g. Operating system

Potentially vulnerable software systems: attribute values E.g. Linux, Windows

Replicate data on a set of hosts that have different values for each attribute: cores

Page 47: Moving away from the independent and identically distributed failure assumption

An example

Attributes Operating system:{ , }

Web server:{ , }

Web browser:{ , }

Cores Red and Green

(orthogonal core) Red, Yellow, and Blue

{ , , }{ , , }

{ , , }

Attribute configurations Attribute configurationsPhoenix

{ , , }

Page 48: Moving away from the independent and identically distributed failure assumption

Feasibility of this approach

Feasibility of this approach What is the impact of diversity on storage overhead and

load? Diversity: distribution of attribute configurations Storage overhead: size of the replica set (core) Storage load: given a host h, number of cores h participates

Simulations Levels of diversity Varying attribute sets

Page 49: Moving away from the independent and identically distributed failure assumption

System model

A set H of hosts A set A of attributes Every attribute has the

same cardinality y A mapping M from hosts to

attribute configurations Diversity

Determined by M Often skewed in practice

(93% Windows) [OneStat]

Modeling diversity Single parameter f [0.5,1) A share f of the hosts has a

share (1-f) of the attribute configurations

Example 1:

Example 2:

f = 0.5

f = 0.75

Attribute configurations:

Page 50: Moving away from the independent and identically distributed failure assumption

Heuristic to find cores

Attributes Operating system:{ , }

Web server:{ , }

Web browser:{ , }

Cores Red and Green Red, Yellow, and Blue

{ , , }{ , , }

{ , , }

Attribute configurations Attribute configurationsPhoenix

{ , , }

{ , , }{ , , }

Attribute configurations Attribute configurationPhoenix

{ , , }

Page 51: Moving away from the independent and identically distributed failure assumption

Summary of results

Simulated for 1,000 hosts 8 attributes, 2 values per attribute

f=0.7, core size=2/2.34/6 (min/avg/max), storage load=21 f=0.95, core size=2/3.49/7(min/avg/max), storage load=151

8 attributes, 4 values per attribute f=0.7, core size=2/2.00/2 (min/avg/max), storage load=6 f=0.95, core size=2/2.01/3 (min/avg/max), storage load=52

Conclusions Even for highly skewed diversity

Average core size is small

More attribute values reduce core size variation

Page 52: Moving away from the independent and identically distributed failure assumption

Wrapping up

Process failures are often non-IID Core/survivor set model

Enables one to model non-IID failures Abstracts failure probability distributions Generalizes objects commonly used in algorithms and proofs

Consensus Improves on number of rounds Enables solutions in systems in which Consensus is not solvable under

the threshold model

Generalizing the results for Consensus General lower bound on process replication Automatic translation of algorithms

Compatible with previous works

The Phoenix recovery system An application that uses the core abstraction

Determine cores by using attributes of hosts (shared vulnerabilities)

Reduces significantly storage overhead compared with a solution under the threshold model

Current status: we are working in the design of a prototype

Future work

Impact on reliability and performance Fewer executions allowed Another requirement: compute cores/survivor sets

Static vs. dynamic cores/survivor sets Processes joining and leaving Changes in reliability

Implementation issues Representation of cores and survivor sets Determining the cores/survivor sets of a system Applicability on the various systems

Phoenix Determining good sets of attributes Heuristics to find cores: storage overhead vs. storage load

Dissertation plan

Representation of non-IID failures Core/Survivor set model [JM03c] Application to Consensus [JM03a]

General bounds Lower bound Algorithm translation (work in progress) Submission to PODC 2004 (January 2004)

Phoenix [JBMSV03C] Method for determining attribute sets Heuristics to find cores that consider both storage load and overhead Implementation details Submissions to OSDI (March 2004) and NSDI (September 2004)

8/1Jul 1, 2003 Dec 1, 2004

9/1 10/1 11/1 12/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1 10/1 11/1



8/2 - 10/27Microsoft Internship

10/31 - 1/26Generalization

10/31 - 3/15Phoenix

3/30 - 12/1Phoenix, journal papers and dissertation

12/1Thesis defense

Evaluation of quorum systems in a wide-area network [Amir96] Crashes are strongly correlated Power outages

Computers in the same room Total failure during the experiment

Wide-area outage

Network partitions Quorums partially unreachable Computers in different segments Computers in the same segment

Switching devices Bridges

Where IID does not apply…

65Where IID does not apply…Where IID does not apply…

Software bugs [Little01] Single version

A demand may cause all replicas to crash

E.g. State-machine replication

Multiple independently-developed versions Difficulty of a demand: difficulty in

handling it Level of difficulty varies among the

demands More difficult demands tend to

cause multiple versions to fail

Failures are not independent Computing a threshold is not practical Model of dependent failures based on shared

vulnerabilities Storage overhead is small even for highly skewed

diversity Storage load can be large

Has to be considered by the heuristic that finds cores Increase average core size

Synchronous systems - Crash failures

Solution for any number of failures Early-deciding algorithm: decision in f+1 rounds, where f is

the number of failures in a given execution [Charron-Bost and Schiper]1. Every process keeps an array of initial values

2. In every round, a process:1. sends its array of initial values to all the other processes

2. receives messages from other processes (array of initial values or decide)

3. updates its array according to the received arrays

4. decides if it receives a decide message, and then sends a decide message to all the other processes in the next round

5. decides if round is detected as clean, and then sends a decide message to all the other processes in the next round

Page 68: Moving away from the independent and identically distributed failure assumption

Upper bound on process replication

Conjecture: Suppose a correct algorithm A that requires under the threshold model

Replace (n-’·t), 0 < ’ , for intersection of ’ survivor sets in A to generate algorithm A’

Transformed algorithm A’ is correct

Intuition: k=3 In every execution: at least 2·t+1 correct processes (subset )

Survivor sets: subsets of processes of size 2·t+1 Cores: subsets of size t+1

Every intersection of two subsets of size 2·t+1 (survivor sets) intersect in at least t+1 processes (a core)

Intersection of two survivor sets contains at least one correct process At least one intersection of two survivor sets contains only correct processes

1 ,


Solving Consensus for arbitrary failures

Algorithm by Lamport, Shostak, and Pease [Lamport82]Each process keeps a copy of the treeLevel i of the tree: values received at round iIn round 0, a correct process broadcast its initial valueIn round i, a correct process sends the values at level i-1

p1 p2 p3


pt+1 pt pn



i 2t


i nip

},{1,2, ,ppp21

nikiii t

Depth 0

Depth 1

Depth 2

Depth t

Depth t+1

pp,{p p,

},{1,2, ,ppp








Depth l-1

Depth l

Each correct process Traverse the tree in post-order Evaluates each node as follows

If node is a leaf, then evaluates to its current value Otherwise, use the majority (null if there is no majority)

Claim: if process p is correct, then The value of node(p) is the same in every correct process q It is the value p sent at round |p| regarding

Proof idea Recursion on the levels of the tree Base case: level t+1 (leaves) Ind. hypothesis: claim valid for

every level l t+1 Ind. Step: prove for l-1

},{1,2, ,ppp21

nikiii t

pDepth t+1

At round t+1… At round t+1…

Algorithm for the core/survivor set model

Adapt the algorithm from Lamport et al. In the original algorithm

In our algorithm: given a system Intersection of two survivor sets instead of majority

SC ,,

},{1,2, ,ppp21

nikiii t


i 2t


i nip

Depth t

Depth t+1 Leaves


pp,{p ,ppp2121




iiiiii slll




li 2p

li spi

Depth l

Depth l+1 Leaves

72 ={pa, pb, pc, pd, pe}

C={papbpc, papd, pape, pbpd, pbpe, pcpd, pcpe, pdpe}

S={papbpcpd, papbpcpe, papdpe, pbpdpe, pcpdpe}

In the threshold model Threshold on the

number of failures: 2 Minimum number of

processes: 3·t+1=7 There is no solution!










pdpa pc


























Page 73: Moving away from the independent and identically distributed failure assumption

73Executing the algorithmExecuting the algorithm

={pa, pb, pc, pd, pe}

C={papbpc, papd, pape, pbpd, pbpe, pcpd, pcpe, pdpe}

S={papbpcpd, papbpcpe, papdpe, pbpdpe, pcpdpe}

pa and pc are faulty










Time 1











Time 2











Time 3


Possibly different values across correctprocesses

Same value across correct processes

Asynchronous systems

No solution for pure asynchronous systems even for a single crash failure [FLP85]

Slow process vs. Faulty process Requires a liveness property Approach 1: consider more realistic timing assumptions

Partially synchronous systems [DLS88] Difficult to evaluate parameters in practice

Approach 2: extend model with failure detectors [CT96] Unreliable failure detectors

Page 75: Moving away from the independent and identically distributed failure assumption

Asynchronous systems - Crash failures

W is the weakest class of failure detectors that enable a solution to Consensus [CHT96] Weak completeness: eventually every process that crashes is

suspected by some correct process Eventual weak accuracy: there is some correct process which is

eventually not suspected by any other correct process

Lower bound on process replication: Proof idea:

12 tn


Initial value of processes in A: vProcesses in A decide v


Initial value of processes in B: v’Processes in B decide v’


No faulty processMessages from A to B are delayedProcesses in A decide v and process in B decide v’Agreement is violated

Page 76: Moving away from the independent and identically distributed failure assumption

An algorithm

Rotating coordinator paradigm [CT96] Assumes

Strong completeness: eventually every correct process suspects forever every faulty process






Every process sends an estimate message to the coordinator

Coordinator gathers t+1 estimates and proposes a new estimate

Processes acknowledge the reception of an estimate from the coordinator

Coordinator gathers t+1 acks and broadcasts a decide message


t=2, n=5: execution with no suspicions or failures

Page 77: Moving away from the independent and identically distributed failure assumption

Proof of correctness

No correct process stops (does not decide, does not move on) in a round i

A correct process either Decides in a round Eventually suspect the coordinator (Strong Completeness) and

moves on to the next round

Eventually there is a round in which the coordinator is not suspected by any correct process Ensured by Eventual Weak Accuracy

If not all processes decide in the same round Once some process decides, the decision value is “locked”

Page 78: Moving away from the independent and identically distributed failure assumption

In the core/survivor set model

Lower bound on process replication [JM03d] Crash Partition: There is no partition (A,B) of the processes in

such that none of the partitions contains a core Crash intersection: The intersection of every two survivor sets

contains a core

Crash Partition Crash Intersection Bound is tight: Chandra and Toueg’s algorithm modified

In the original algorithm: coordinator waits for n-t replies In our algorithm: coordinator waits for a reply from a survivor set

Page 79: Moving away from the independent and identically distributed failure assumption

Proof idea

Layering technique [Keidar] Layer: [p,[i]]

Process p fails but send messages to processes pi,…, pn

Apply layers to system states State is composed of states of processes Similar states x, y: only a single process can distinguish x from y A set of states is similarly connected iff for every pair of states in the,

there is a chain of similar states connecting them

Set of initial states is similarly connected Applying layers to a similarly connected set of states

generates another similarity connected set of states Cannot apply layers indefinitely

Page 80: Moving away from the independent and identically distributed failure assumption

Asynchronous systems - Arbitrary failures

Faulty processes can behave arbitrarily Correct to a subset of processes Strong completeness does not make sense

Mute process [Doudou98] A process pi is mute to a process pj iff there is a time t after which pj

stops sending messages to pi forever

Mute completeness Every process pi eventually suspects forever a process pj that is

mute to pi

Equivalent to S if processes fail only by crashing

Page 81: Moving away from the independent and identically distributed failure assumption

Lower bound on process replication

Lower bound: (Strong Consensus) Proof idea: assume , and a partition (A,B,C) such


13 tntn 3

Scenario 1: All process in B crash at time 0 Processes in A and C propose

value v and decide v

Scenario 2: All process in C crash at time 0 Processes in A and B propose

value v’ and decide v’

Scenario 3: All process in C are arbitrarily faulty Processes in C behave to process

in A as in Scenario 1, and to processes in B as in Scenario 2

Messages from B to C and conversely are delayed until after the last process decides

Processes in A propose v, and processes in B v’

Processes in A cannot distinguish Scenario 1 from Scenario 3

Processes in B cannot distinguish Scenario 2 from Scenario 3

Processes in A and B decide upon different values (agreement violation)


An algorithm for Vector Consensus

Requires Digitally signed messages Certificates

Certify message content E.g. Decision message has to contain enough Estimate messages from

other processes

Each process has a list of faulty processes FIFO channels: out of order messages Corrupted messages

1st stage Each process broadcasts its initial value Each process composes a proposed vector with received values

13 tn

Move on to the next round after receiving at least 2·t+1 current estimates

Processes exchange suspicion messages

An algorithm for Vector Consensus (cont.)

2nd stage: asynchronous rounds of message exchange

(coordinator) Forward estimate received from the coordinator

Coordinator’s estimateDecide after receiving at least 2·t+1 estimate msgs

Coordinator crashes and do not send estimateProcesses exchange current estimates after receiving at least 2·t+1 suspicion messages


Page 84: Moving away from the independent and identically distributed failure assumption

In the core/survivor set model

Byzantine Intersection/Partition is necessary and sufficient [JM03d]

Necessity proof Assume Byzantine Partition does not hold Scenario in which processes decide upon different values

Sufficiency proof Modify algorithm by Doudou and Schiper Original algorithm: process waits for messages from 2/3 of the processes In our algorithm: process waits for messages from a survivor set

Observation In the original protocol: wait for t+1 suspicion messages In our algorithm: wait for messages from processes in

SSSSS 2121 ,,

(, )-Partition. For every partition of

, there is a subset such that:

(, )-Intersection. For every :

Generalizing the partition and the intersection properties

},,,{ 21 AAAA AAAAA kkk },,,{'


core a contains )(i






,, ,S




,: subset of S

86Upper bound on process replicationUpper bound on process replication











a' b' c' d' g' e' h' f' i'

a b c d e f

Simulated processes

Physical process

Physical system Virtual system

Every core in the virtual system (subset of 3 processes) is simulated by a core in the physical system

Every subset of size 3 in the

virtual system contains at

least one correct process

Page 87: Moving away from the independent and identically distributed failure assumption

Proposed algorithm

Algorithm: given a system , let x be the size of the largest core

Any process in simulates at most (x-xp+1) virtual processes

Conjecture: necessary and sufficient for any subset of t+1 processes in the virtual system to map to a core in the physical system

Necessity: straightforward (counterexample) Sufficiency:

There are sufficient physical processes to simulate virtual processes Byzantine Partition t+1 processes map to a core

Page 88: Moving away from the independent and identically distributed failure assumption

Our replication strategy

Classes of software systems: attributes E.g. Operating system

Potentially vulnerable software systems: attribute values E.g. Linux, Windows

Replicate data on a set of hosts that have different values for each attribute: cores

Tolerating the failure of k values No permutation of k attribute values covers all the hosts in a core Current assumption: k=1

At least two distinct values per attribute in a core

Definitions Attribute configuration: attribute values of a host Diversity: distribution of attribute configurations

Choosing a core

Decision problem is NP-Complete (Set cover) Finding a core for host hi

1. Make a list L of hosts orthogonal to hi

2. If L is not empty1. Choose a host hj s.t hj L;

2. Return {hi, hj};

3. Else1. R {hi};

2. Make a list L’ of hosts that have different attribute configurations;

3. For each attribute a in A, choose randomly a host hj in L’ s.t. hj has a different value for a;

4. R R {hi};

5. Repeat 2 and 3 until R covers all attributes or L’ is empty;

6. Return R.

Core size for scenario 8/2

1,000 hosts 8 attributes

[ICAT] 2 values per


“Linux vs. Windows”

Average core size is small even for highly skewed diversity

Core size for scenario 8/4

1,000 hosts 8 attributes 4 values per attribute

More attribute values reduces core size variation

Storage load

1,000 hosts

For highly skewed diversity, storage load can be high

System design issues

Fully-distributed system No single point of failure Leverage research on P2P systems

Announcing available configurations DHT-based approach

Encryption scheme to protect against data corruption Recovering from a catastrophe

Time to recover is not critical Coping with a large number of requests

Threshold on the number of accepted requests Exponential backoff

Lower bound on process replication

Claim: Every set of processes that satisfies , also satisfies (, )-Partition

Proof idea. Given a set , , construct a partition as follows:

1 ,


A1 Ak Ak+1. . . . . .


part) fractional : part,

Integral : -



fl ,(

A1…Al: t/ processes

Al+1…A: t/ processes


There is at least one subset of elements Ai such that the union of these subsets contains t processesAdd one process to

Upper bound on process replication

Claim: If a problem P can be solved by an algorithm A in a system satisfying , then P can be solved by a system satisfying (k,1)-Partition

Suppose that A requires k=4 System satisfying (4,1)-Partition

Maximum number of failures: 2 Virtual system defined under the threshold model

Satisfies Simulate the virtual system with





integer kktkn ,1,1

SC ,,

SC ,,

},,,,,,,,{ ihgfedcba

2,14 ttn for

Page 96: Moving away from the independent and identically distributed failure assumption

Future work

Impact on reliability and performance Fewer executions allowed

What are the chances that an execution not assumed happen?

Another requirement: compute cores/survivor sets

Static vs. dynamic cores/survivor sets Processes joining and leaving Changes in reliability

Implementation issues Representation of cores and survivor sets Determining the cores/survivor sets of a system Applicability on the various systems

Future work

Applicability of the Consensus solutions Look at existing systems that use Consensus as a primitive Evaluate the benefits in practice of using our solutions

Solutions for hybrid failure models Translate , to our modelmamcsan ,22

Future work

No protocols with rational k so far Any known candidate?

Finish formal proof of algorithm translation

Future work

How do we determine the attributes? Resilience depends on the attributes Vulnerability databases Dynamic attributes:new attributes and values

How many attributes do we need? The number of attributes impact on storage overhead

What is a good level of granularity for the attributes? E.g. {Windows} vs. {Win_95, Win_98, Win_2000, Win_XP}

Other challenges Heuristics for finding cores: storage overhead and storage load Efficacy

How do we assess the efficacy of a prototype? Major Internet incidents are not so frequent

Generalizing the partition and the intersection properties

(, )-Partition. For every partition of

, there is a subset such that:

(, )-Intersection. For every :

},,,{ 21 AAAA AAAAA kkk },,,{'


core a contains )(i


,, and S




,, ,S













,()-( :

,( :-

\, ,,


,: subset of S

,, of subsets of collection :

Generalizing the partition and the intersection properties

, and integers (, )-Partition. For every partition of

, there is a subset such that:

(, )-Intersection. For every , there is a subset , such that:


},,,{ 21 AAAA AAAAA kkk },,,{'


core a contains )(i




kSSS k ,},,,{ ,21


