fundamentals of fault-tolerant distributed computing in asynchronous environments

Download Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments

If you can't read please download the document

Upload: imala

Post on 25-Feb-2016

51 views

Category:

Documents


3 download

DESCRIPTION

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments. Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003. Presentation Overview. Introduction Terminology Formal View of Fault Tolerance Four Types of Fault Tolerance - PowerPoint PPT Presentation

TRANSCRIPT

  • Fundamentals of Fault-TolerantDistributed Computing InAsynchronous EnvironmentsPaper by Felix C. Gartner

    Graeme CoakleyCOEN 317November 23, 2003

  • Presentation OverviewIntroductionTerminologyFormal View of Fault ToleranceFour Types of Fault ToleranceRedundancy as the Key to Fault ToleranceModels of Computation And Their RelevanceAchieving SafetyAchieving LivenessConclusions

  • IntroductionUntil early 1990s, work in fault-tolerant computing focused on specific technologies and applications.Resulted in distinct terminologies and methodologies

    GoalsStructure the area clearly.Survey the fundamental building blocks.

  • TerminologyStates, Configurations, and Guarded Commandsdistributed system: finite set of processes.Local state: variables of each process.State transition: defines event (send, receive, or internal event).Guarded Commands: abstractly represent a local algorithm. => Configuration: consists of local states of all processes plus state of communication subsystem.

  • process Pingvar z : IN init 0ack : boolean init truebeginack ^ rcv(m)=> ack := true; z := z + 1ack=> snd(a); ack := falseend

    process Pongvar wait : boolean init truebegin wait=> snd(m); wait := truewait ^ rcv(a)=> wait := falseend

  • Terminology (continued)Defining Faults and Fault ModelsFault: may cause an errorError: may lead to a failureFailure: system has left its correctness specification.

    Models:Crash failure, Fail-stop, and Byzantine

    Fault: can be modeled as an unwanted state transition of a process

  • Terminology (continued)Properties of Distributed Systems:Safety and LivenessSafety property: some specific bad thing never happens within system.

    Liveness property: claims some good thing will eventually happen during system execution.

    Problem Specification: consists of a safety and a liveness property

  • Formal View of Fault Tolerance Definition:A distributed program A is said to tolerate faults from a fault class F for an invariant P iff there exists a predicate T for which the following requirements hold:P => TT is closed in A and T is closed in FStarting from any state where T holds, every computation that executes actions from A alone eventually reaches a state where P holds.

  • Four Types of Fault Tolerance Liveness Property Satisfied Yes NoYesSafety Property SatisfiedNo

    NoneNonmasking

    Masking

    Fail Safe

  • Redundancy as the Key to Fault ToleranceDefining Redundancy:A distributed program (A) is said to be redundant in space iff for all executions e of A in which no faults occur, the set of all configurations of A contains configurations that are not reached in e.A is said to redundant in time iff for all executions of e in which no faults occur, the set of actions of A contains actions that are never executed in e.A program is said to employ redundancy iff it is either redundant in space or time.

  • Example: program with redundancy in space and in time

    process Redundancyvar x {0, 1, 2} init 1 {* local state *}begin{* normal program actions: *}x = 1=>x := 2 {* 1 *}x = 2=>x := 1 {* 2 *}x = 0=>x := 1 {* 3 *}{* fault action: *}true=>x := 0end

  • Redundancy as the Key to Fault Tolerance (continued)Claim:If A is a nontrivial distributed program that does not employ redundancy, then A may become incorrect regarding its correctness specification in the presence of faults.

    Conclusion:While redundancy is not sufficient for fault tolerance, it is a necessary condition.Redundancy in space is widespread

  • Models of Computation And Their RelevanceModels of Distributed SystemsSynchronous systems: there are real-time bounds on message transmission and process response times.Partially synchronous: intermediate models that have bounds to a varying degree.Asynchronous systems: no bounds made.Weakest model and realistic model in many applications.Every algorithm that works on this model, works on all other models.Cannot detect whether a process has crashed or not?

  • Achieving Safety: Detection as the Basis for SafetyTo ensure safety, we need to employ detection and subsequently inhibit dangerous actions.

    Common Detection Mechanisms: parity, checksums

    Detection includes checking whether a certain predicate Q holds over the entire systemQ is easier to specify if the type and effect of faults from F are known.

  • Achieving Safety: Detection in Distributed SettingsDeciding whether a predicate over the global state does or does not hold is not easy.

    Cooper and Marzullo introduced two transformers:Possibility(Q) is true iff there exists a continuous observation of the computation for which Q holds at some point.Definitely(Q) is true iff for all possible continuous observations of the computation Q holds at some point.

  • Achieving Safety: Adapting Consensus AlgorithmsSet of processes (each process has an initial value) must all decide on a common value.

    Central process acts as an observer that can construct all possible observations.Central process scheme not very fault tolerant:Central observer can crashCentral observer can send arbitrary messages

    Solution: diffuse information among all nodes.

  • Achieving Safety: Detecting Process CrashesFully Asynchronous model: impossible to detect

    Chandra and Toueg proposed unreliable failure detectors to extend the asynchronous model.The main property of failure detectors is accuracy:Weak: failure detector will never suspect at least one correct process of having crashed.Eventually Weak: failure detector may suspect every process at one time or another, but there is a time after which some correct process is no longer suspected.

  • Achieving Liveness: CorrectionLiveness tied to notion of correction.Correction refers to turning a bad state into a good one.Common methods include:retransmission, error-correction codes, rollback recovery, rollforward recovery, etc.

    On detecting a bad state via a detection predicate Q, the system must try to impose a new target predicate R onto the system.

  • Achieving Liveness: Correction via ConsensusCorrection corresponds to the decision phase of consensus algorithms.

    State machine approach (Schneider)Servers are made fault tolerant by replicating them and coordinating their behavior via consensus algorithms.

    Other methods based on several forms of fault-tolerant broadcasts.

  • ConclusionsThis paper introduces a formal approach to structure the area of fault-tolerant distributed computing, survey fundamental methodologies, and discuss their relations.This approach reveals the inherent limitations of fault-tolerance methodologies and their interactions with system models.This paper could not integrate the entire area of fault-tolerant distributed computing.Many topics still need further attention.

  • Questions