© 2006 andreas haeberlen, mpi-sws 1 the case for byzantine fault detection andreas haeberlen...
Post on 21-Dec-2015
213 Views
Preview:
TRANSCRIPT
© 2006 Andreas Haeberlen, MPI-SWS1
The Case for Byzantine Fault Detection
Andreas Haeberlen
MPI-SWS / Rice University
Petr Kouznetsov MPI-SWS
Peter Druschel MPI-SWS
2© 2006 Andreas Haeberlen, MPI-SWS
Challenge: Byzantine faults Distributed systems are subject to
a variety of failures and attacks Hacker break-in Freeloading Censorship Data corruption Software/hardware failure
Byzantine failure model: Faulty nodes may exhibit arbitrary behavior
Dependable systems must be protected against Byzantine faults
3© 2006 Andreas Haeberlen, MPI-SWS
Existing approach: Fault tolerance
Byzantine fault tolerance (BFT) can mask a limited number of Byzantine faults
Example: Castro and Liskov [OSDI'99]
Client
Serverreplicas
4© 2006 Andreas Haeberlen, MPI-SWS
Alternative approach: Fault detection Nodes monitor each other for faulty behavior When a fault occurs, the correct nodes
identify the faulty node(s) distribute evidence of the fault
Nodes can isolate the faulty node + initiate recovery
Byzantine Fault Detection
5© 2006 Andreas Haeberlen, MPI-SWS
Byzantine Fault Detection
Alternative approach: Fault detection Nodes monitor each other for faulty behavior When a fault occurs, the correct nodes
identify the faulty node(s) distribute evidence of the fault
Nodes can isolate the faulty node + initiate recovery
D C
B
A
ESet X=5
D C
A
E
D C
B
A
EOK
X=?X=7 E: X=5
7! B
6© 2006 Andreas Haeberlen, MPI-SWS
Level3
Best approach depends on the application
Best-effort service Goal: Find faulty components Wide-area delays, limited
bandwidth, many nodes
Air traffic control Inter-domain routing
Failures may be fatal! Goal: Mask fault symptoms Delays negligible, bandwidth
plentiful, few nodes
Machine roomAT&T
Sprint
Typical application for Fault DetectionTypical application for Fault Tolerance
7© 2006 Andreas Haeberlen, MPI-SWS
Detection can provide accountability In an accountable system:
Actions are undeniable State is tamper-evident Correctness can be certified
Good nodes can provide evidence that they are good
Bad nodes cannot hide evidence of misbehavior
Proven concept in society Banking, administration ...
Desirable for distributed systems [Yumerefendi05] Example: Building trust in federated systems
8© 2006 Andreas Haeberlen, MPI-SWS
What about performance?
If up to f nodes can be faulty, we need f+1 replicas to guarantee detection (fault tolerance: 3f+1)
More throughput using the same resources Works even when >33% of the nodes can become
faulty
Detection can defer overhead to periods of low load
System can deliver high peak throughput
Detection does not require consensus Potentially less expensive than BFT
9© 2006 Andreas Haeberlen, MPI-SWS
Outline
Introduction BFD abstraction PeerReview algorithm Conclusion
10© 2006 Andreas Haeberlen, MPI-SWS
How is BFD used?
Each correct node has state machine + detector Detector can inspect all messages at its local node When detector observes a fault on another node,
it informs its local application, and it provides evidence of the fault to other detectors
?
Application
State machine Detector
Network
Node Xis
faulty!
No assumptionsabout faulty nodes
11© 2006 Andreas Haeberlen, MPI-SWS
Only observable faults can be detected
Two classes of observable faults: Detectable faultiness: Node breaks the protocol Detectable ignorance: Node refuses to respond
As long as the faulty node continues to follow the protocol, BFD cannot detect this!
Set X=5
OKGet X
5
A B C
Correct
Set X=5
OKGet X
A B CSet X=5
OKGet X
7
A B C
Detectably ignorantDetectably faulty
12© 2006 Andreas Haeberlen, MPI-SWS
BFD can give strong guarantees Three types of detector output
Trusted, suspected, exposed
Strong completeness "No false negatives"
Strong accuracy "No false positives"
Precise definitions are in the paper
Trusted
Suspected Exposed
13© 2006 Andreas Haeberlen, MPI-SWS
Outline
Introduction BFD abstraction PeerReview algorithm Conclusion
14© 2006 Andreas Haeberlen, MPI-SWS
Assumptions
1. Protocol can be modeled as a deterministic state machine
2. Each node has a strong identity, as well as a public/private keypair for signing messages
3. The faulty nodes cannot prevent two correct nodes from communicating break the cryptographic keys
15© 2006 Andreas Haeberlen, MPI-SWS
Secure logging
All messages are signed and acknowledged Each node keeps a log of all local inputs and outputs Nodes must commit to the contents of their log
Log is tamper-evident [Maniatis02]
Rcv(A, "Set X=5")Send(A, "Okay")Rcv(C, "Get X")Send(C, "5")
Snd(B, "Set X=5")Rcv(B, "Okay")
Snd(B, "Get X")Rcv(B, "5")
B's log
A
B
C
16© 2006 Andreas Haeberlen, MPI-SWS
Detecting ignorance
If a node refuses to acknowledge a message Send message as evidence to other nodes Correct nodes will challenge the ignorant node to prove
that its log contains a 'Rcv' entry for that message A correct node can always respond
Rcv(A, "Set X=5")Send(A, "Okay")Recv(C, "Get X")
A
B
C
17© 2006 Andreas Haeberlen, MPI-SWS
Detecting faultiness
Nodes can audit each other's log at any time Auditors replay input in the log, compare output If a divergence is detected
Send log as evidence to other nodes Other nodes can repeat the same procedure to check
whether the node is really faulty (no he-said-she-said!)
Rcv(A, "Set X=5")Send(A, "Okay")Rcv(B, "Get X")Send(B, "7")
A
B
C
B'
Rcv(A, "Set X=5")Send(A, "Okay")Rcv(B, "Get X")Send(B, "5")
State machine B is expected to run
Rcv(A, "Set X=5")Send(A, "Okay")Rcv(B, "Get X")Send(B, "7")
Snap-shots
18© 2006 Andreas Haeberlen, MPI-SWS
Summary
New approach: Byzantine Fault Detection Alternative to fault tolerance Provides accountability
Fault Detection can give strong guarantees Eventual strong accuracy and completeness
Early results indicate Fault Detection is practical Example: PeerReview algorithm
Thank you!
top related