concurrent, distributed systems stock exchangestelecoms commuter rail

44
Finding Liveness Bugs In Distributed Systems

Post on 20-Dec-2015

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Finding Liveness Bugs In Distributed Systems

R. Jhala [C. Killian, J. Anderson , A. Vahdat]UC San Diego

Page 2: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Concurrent, Distributed Systems

Stock Exchanges Telecoms Commuter Rail

Page 3: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Concurrent, Distributed Systems

System Nodes exchanging Messages

Execution1. Node gets message event2. Executes event handler

- Updates node state - Sends new messages

3. Repeat…

Page 4: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Distributed Systems: Challenges

SystemNodes exchanging Messages

Challenges Nodes: enter, leave, fail Messages: reordered, lost

System must stay available- Eventually, all nodes regroup - Eventually, all packets delivered- Eventually, some good happens

Liveness Properties

Page 5: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

The Space of System Executions

1 2 Initial State

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@2

At each state,scheduler picks:1. Node n2. Event @n3. Executes code

Page 6: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

Page 7: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

Page 8: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

Page 9: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

Page 10: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Bad States

Safety Bugs: Execution that drives system to bad state

1 2 1 2

Safety Bugs

Bad States• Null Dereferences• Buffer overflows• Assertion Failures• Low-level crash

1 2 1 2event@2 fail@2

Page 11: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

How to find Safety Bugs?Find path from Initial to BadBy systematically exploring executions(Iterating over sequences of choices)

Initial State Bad States

Page 12: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

1 2

Model Checking for Safety Bugs

Bad States1 2

Find path from Initial to BadBy systematically exploring executions[Verisoft 97, Cmc 04, Chess 07]

Page 13: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Safety Properties are too Low Level

Find path from Initial to BadBy systematically exploring executions[Verisoft 97, Cmc 04, Chess 07]

Page 14: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Safety Properties are too Low Level

Distributed Systems:Designed for crashes & failures

Challenge: End-to-end ProblemsLiveness bugs

Page 15: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Live States

Bad States

InitialState

Good States: All nodes regroupAll packets deliveredLive States: Eventually Good Happens

Page 16: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Live Executions

InitialState

Live States

Page 17: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Liveness Violations

InitialState

Live States

Execution never reaches live state

Page 18: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

How to Find Liveness Violations?

Live States

Explore all executions ?Infinitely many ...

Page 19: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

How to Find Liveness Violations?

Live States

Explore all executions upto bound ?

Combinatorial explosion (depth < 50) Liveness at depth >> 50

[Verisoft 97, Cmc 04, Chess 07]

Page 20: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

How to Find Liveness Violations?

Live States

Looks pretty hopeless...

Page 21: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Live States

Idea 1: Dead States

Dead States

No execution can reach live statesRecovery is impossible

Page 22: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Idea 1: Dead States

To find Liveness bugs, Look for Dead executions.How to tell if a state is Dead ?

Page 23: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Idea 2: Random Walks

Live States

Dead States

Execute long random walks from state Pr[reaching live] = 0 Pr[reaching live] = 1How to tell if a state is Dead ?

Page 24: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Executions and Random Walks

At each execution step, 1. Scheduler picks node n2. Schedular picks event @n3. Executes event code

Random Walk: Scheduler picks randomly(from some Prob. Dist. over nodes, events)

Page 25: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Liveness Bugs = Search + Random Walks

1. Systematic Search: find candidates 2. Random Walk: test if candidate dead

Live States

Iterate

Page 26: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Liveness Bugs = Search + Random Walks

Live States

If walk length >> avg. steps to livenessThen non-live walk is likely liveness bug!

100k Events

1k Events

100,000 Step Execution (2 Gb Log file)How to pinpoint bug ?

Page 27: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Live States

Idea 3: The Critical Transition

Dead States

System transitions from a recoverable to a dead stateHow to find Critical Transitionwithout knowing Dead States?

Page 28: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Live States

Idea 3: The Critical Transition

Binary Search using

Random Walks!

Page 29: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Live States

Idea 3: The Critical Transition

Binary Search using

Random Walks!

Binary Search

Page 30: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Live States

Idea 3: The Critical Transition

Critical Transition

Dead States

System transitions from a recoverable to a dead statePinpoints bug

Page 31: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

RecapLiveness Bugs FoundSystem has shot itself (but doesnt know it)

Systematic SearchFinds candidate dead states

Random WalksDetermine if candidate is dead

Critical TransitionThe event that makes recovery impossible

Page 32: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Bells and Whistles (1/2)

Random Walk Bias• Assign “likely” events higher weight• e.g. application > network > timer > fail

Bugs not missed• Random walk only tests deadness

Live state reached sooner• Error traces shorter, simpler

Page 33: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Bells and Whistles (2/2)

Prefix-Based Search• Restart search after reaching liveness• Analyzes effect of failures in “steady-state”

Page 34: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Evaluation

Liveness Bugs,Critical Transition

Mace (C++)System MaceMC

Liveness Properties

Page 35: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Systems

RandTreeRandom Overlay Tree with max degree.

MaceTransportUser-level, reliable messaging service.

PastryKey-based routing, using an overlay ring.

ChordKey-based routing, using an overlay ring.

Page 36: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Liveness Properties

RandTreeRandom Overlay Tree with max degree.

MaceTransportUser-level, reliable transport service.

PastryKey-based routing, using an overlay ring.

ChordKey-based routing, using an overlay ring.

Eventually, all messages acknowledged.

Eventually, all nodes form single tree.

Eventually, all nodes form a ring.

Eventually, all nodes form a ring.

Page 37: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Sample Bug: RandTree

Nodes With Child, Parent pointers

PropertyEventually nodes form tree

Page 38: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Sample Bug: RandTree

C

A

C requests to join under AA sends ackC fails and restartsC ignores ack from AC joins under B

Bug: System stuck as a DAG!

C’s failure not propagated to A

B

Page 39: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Liveness Bugs Yield Safety Assertions

Dead States Violations of a priori unknown safety properties

Critical TransitionHelps identify dead statesYields new safety properties and bugs

Page 40: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

New Safety Property: ChordNodes with Fwd, Back pointers

PropertyEventually nodes form a ring

Critical Transition To Dead StateWhere: n.back=n, n.fwd = m

New Safety PropertyIF n.back=n THEN n.fwd=n

Page 41: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

ScorecardSystem Bugs Liveness Safety

MaceTransport 11 5 6RandTree 17 12 5

Pastry 5 5 0Chord 19 9 10Totals 52 31 21

Several “protocol level” bugsRoutinely used by Mace programmers

Page 42: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Programming Challenges

How to handle unexpected events ?

How to propagate effects of failures ?

How to limit impact on performance?

Page 43: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Take Away Message

Liveness BugsAre Very ImportantRandomness Helps.

Page 44: Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

www.macesystems.org(papers, code, etc.)