fundamentals of fault-tolerant distributed computing in...

26
Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments FELIX C. GÄRTNER Darmstadt University of Technology Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing. Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; C.4 [Computer Systems Organization]: Performance of Systems—Modeling techniques; Reliability, availability, and serviceability General Terms: Algorithms, Design, Reliability, Theory Additional Key Words and Phrases: Asynchronous system, agreement problem, consensus problem, failure correction, failure detection, fault models, fault tolerance, liveness, message passing, possibility detection, predicate detection, redundancy, safety 1. INTRODUCTION Research in fault-tolerant distributed computing aims at making distributed systems more reliable by handling faults in complex computing environ- ments. Moreover, the increasing depen- dence of society on well-designed and well-functioning computer systems has led to an increasing demand for depend- able systems, systems with quantifiable reliability properties. The necessity for such quantification is especially obvious in mission-critical settings like flight control systems or software to control (nuclear) power plants. Until the early 1990s, work in fault-tolerant computing focused on specific technologies and ap- plications, resulting in apparently unre- lated subdisciplines with distinct termi- nologies and methodologies. Despite attempts to structure the field and unify its terminology [Cristian 1991; Laprie 1985], Arora and Gouda [1993] subse- quently assessed that “the discipline it- This work was supported by the Deutsche Forschungsgemeinschaft (DFG) as part of the Graduierten- kolleg ISIA at Darmstadt University of Technology. Author’s address: Department of Computer Science, Darmstadt University of Technology, Alexander- straße 10, D-64283 Darmstadt, Germany; email: [email protected]. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. © 1999 ACM 0360-0300/99/0300–0013 $5.00 ACM Computing Surveys, Vol. 31, No. 1, March 1999

Upload: others

Post on 14-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

Fundamentals of Fault-Tolerant Distributed Computing inAsynchronous Environments

FELIX C. GÄRTNER

Darmstadt University of Technology

Fault tolerance in distributed computing is a wide area with a significant body ofliterature that is vastly diverse in methodology and terminology. This paper aimsat structuring the area and thus guiding readers into this interesting field. We usea formal approach to define important terms like fault, fault tolerance, andredundancy. This leads to four distinct forms of fault tolerance and to two mainphases in achieving them: detection and correction. We show that this can help toreveal inherently fundamental structures that contribute to understanding andunifying methods and terminology. By doing this, we survey many existingmethodologies and discuss their relations. The underlying system model is theclose-to-reality asynchronous message-passing model of distributed computing.

Categories and Subject Descriptors: A.1 [General Literature]: Introductory andSurvey; C.4 [Computer Systems Organization]: Performance ofSystems—Modeling techniques; Reliability, availability, and serviceability

General Terms: Algorithms, Design, Reliability, Theory

Additional Key Words and Phrases: Asynchronous system, agreement problem,consensus problem, failure correction, failure detection, fault models, faulttolerance, liveness, message passing, possibility detection, predicate detection,redundancy, safety

1. INTRODUCTION

Research in fault-tolerant distributedcomputing aims at making distributedsystems more reliable by handlingfaults in complex computing environ-ments. Moreover, the increasing depen-dence of society on well-designed andwell-functioning computer systems hasled to an increasing demand for depend-able systems, systems with quantifiablereliability properties. The necessity forsuch quantification is especially obvious

in mission-critical settings like flightcontrol systems or software to control(nuclear) power plants. Until the early1990s, work in fault-tolerant computingfocused on specific technologies and ap-plications, resulting in apparently unre-lated subdisciplines with distinct termi-nologies and methodologies. Despiteattempts to structure the field and unifyits terminology [Cristian 1991; Laprie1985], Arora and Gouda [1993] subse-quently assessed that “the discipline it-

This work was supported by the Deutsche Forschungsgemeinschaft (DFG) as part of the Graduierten-kolleg ISIA at Darmstadt University of Technology.Author’s address: Department of Computer Science, Darmstadt University of Technology, Alexander-straße 10, D-64283 Darmstadt, Germany; email: [email protected] to make digital / hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and / or a fee.© 1999 ACM 0360-0300/99/0300–0013 $5.00

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 2: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

self seems to be fragmented.” Hence,people approaching the field are oftenperplexed, sometimes even annoyed.

Since that time, much progress wasmade by viewing the area in a moreabstract and formal way. This has led toa clearer understanding of the essentialand inherent problems in the field andshown what can be done to harness thecomplexity of systems that counteractfaults. This paper draws from these re-sults and uses a formal approach tostructure fault-tolerant distributedcomputing. It concentrates on an impor-tant and intensely studied system envi-ronment called the asynchronous systemmodel. Informally, this is a model inwhich processors communicate by send-ing messages to one another deliveredwith arbitrary delay, in which thespeeds of the nodes can get out of synchto an arbitrary extent. We use thismodel as a starting point because it isthe weakest model (i.e., methods for thismodel work in other models, too) and isarguably the most realistic for distrib-

uted computing in today’s large-scalewide-area networks.

The purpose of this paper is twofold.First, we want to structure the areaclearly, using the latest research resultson formalizations of fault tolerance[Arora and Kulkarni 1998a; Arora andKulkarni 1998b]. For people familiarwith the area, this perspective may beunusual and, at first glance, collide withthe traditional approaches to structur-ing the field [Nelson 1990; Cristian1991; Jalote 1994], but we think theapproach here offers insights that com-plement current taxonomies and con-tributes to the understanding of fault-tolerance phenomena. This paper canalso serve as an introductory tutorial.

Our second aim is to survey the fun-damental building blocks that can beused to construct fault-tolerant applica-tions in asynchronous systems. By do-ing this, we also want to show thatformalization and abstraction can leadto more clarity and interesting insights.However, to comply with a tutorialstyle, this survey tries to argue in aninformal way, retaining formalizationsonly where needed. We assume that thereader has some basic understanding ofcomputers, formal systems, and logic,but not necessarily of distributed sys-tems theory. Note that we are not con-cerned with security aspects, althoughthey are becoming more and more im-portant and are sometimes discussedunder the heading of fault tolerance.

The structure of the paper echoes thetwo goals. First, we define the relevantterms and the basic system model usedthroughout (Section 2). Next, we for-mally define what it means for a systemto tolerate certain kinds of faults (Sec-tion 3) and derive four basic forms offault tolerance in Section 4. This leadsthe way to a discussion of methods forachieving fault tolerance: Section 5shows that there can be no fault toler-ance without redundancy. We briefly re-visit system models and argue for theasynchronous system model in Section 6and then dicuss refined and practicalexamples of fault-tolerance concepts in

CONTENTS

1. Introduction2. Terminology

2.1 States, Configurations, and Guarded Commands2.2 Defining Faults and Fault Models2.3 Properties of Distributed Systems: Safety and

Liveness3. A Formal View of Fault Tolerance4. Four Forms of Fault Tolerance5. Redundancy as the Key to Fault Tolerance

5.1 Defining Redundancy5.2 No Fault Tolerance Without Redundancy5.3 Conclusions from the Necessity of Redundancy

6. Models of Computation and Their Relevance7. Achieving Safety

7.1 Detection as the Basis for Safety7.2 Detection in Distributed Settings7.3 Adapting Consensus Algorithms for Fault-Toler-

ant Possibility Detection7.4 Detecting Process Crashes7.5 Summary

8. Achieving Liveness8.1 Correction as the Basis for Achieving Liveness8.2 Correction via Consensus8.3 Summary

9. Related and Current Research10. Conclusions and Future Work

2 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 3: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

Sections 7 and 8. Finally, Section 9 dis-cusses related work and sketches thecurrent state-of-the-art. Section 10 con-cludes the paper and outlines directionsfor future work.

2. TERMINOLOGY

The benefits of fault-tolerance are usu-ally advertised as improving depend-ability—the amount of trust that canjustifiably be put in a system. Normally,dependability is defined in statisticalterminology, stating the probabilitythat the system is functional and pro-vides the expected service at a specificpoint in time. This results in commondefinitions like the well-known meantime to failure (MTTF). While terms likedependability, reliability, and availabil-ity are important in practical settings,they are not central to this paper be-cause we focus on the design phase offault tolerance, not on the evaluationphase. So while the above terms mayremain a little imprecise, it is importantto define the characteristics of a distrib-uted system precisely. This is done inthe following section. For precise defini-tions of the defining attributes of de-pendability, see other introductoryworks; e.g., the book by Jalote [1994].

2.1 States, Configurations, and Guarded

Commands

We model a distributed system as afinite set of processes that communicateby sending messages from a fixed mes-sage alphabet through a communicationsubsystem. The variables of each pro-cess define its local state. Each processruns a local algorithm that results in asequence of atomic transitions of its lo-cal state. Every state transition definesan event, which can be a send event, areceive event, or an internal event. Noassumptions regarding network topol-ogy or message delivery properties ofthe communication system are made,except that sending messages betweenarbitrary processes must be possible.

We use guarded commands [Dijkstra

1975] as a notation to abstractly repre-sent a local algorithm. A guarded com-mand is a pair consisting of a booleanexpression over the local state (calledthe guard) and an assignment that sig-nifies an atomic and instantaneouschange of the local state (called the com-

mand). It is written as ^guard& 3

^command&. A guarded command(which for sake of variety is sometimescalled an action) is said to be enabled ifits guard evaluates to true. A localalgorithm is a set of guarded commands(in the notation the commands are sep-arated by a ‘[]’ symbol). A process exe-cutes a step in its local algorithm byevaluating all its guards and nondeter-ministically choosing one enabled actionfrom which it executes the assignment.We assume that the choice of actions isfair (meaning that an action that isenabled infinitely often is eventuallychosen and executed); this correspondsto strong fairness [Tel 1994]. Communi-cation is embedded within the notationas follows: The guard can contain a spe-cial rcv statement that evaluates totrue iff the corresponding message canbe received. A command may containone or more snd statements that, ineffect, send a message to another pro-cess.

A distributed program (or distributedalgorithm) consists of a set of local algo-rithms. The overall system state of sucha program, called a configuration, con-sists of all the local states of all pro-cesses plus the state of the communica-tion subsystem (i.e., the messages intransit). To make this paper more read-able, we often use the terms configura-tion and state synonymously, and ex-plicitly write “local” state whenreferring to the local state of a process.All processes have a local starting statethat defines the starting configurationof the system. We do not attempt todistinguish exactly among the termsdistributed system, distributed pro-gram, and distributed algorithm. Whilethe latter two are used synonymously,the former usually refers to the entirety

Fundamentals of Fault-Tolerant Distributed Computing • 3

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 4: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

of program, state, and execution envi-ronment (i.e., hardware).

To help in understanding all thesedefinitions, consider Figure 1, which de-picts a simple distributed algorithm us-ing guarded command notation. The al-gorithm is the well-known “ping-pong”example of two processes that keep al-ternately sending messages to eachother. For example, the local state ofprocess Ping consists of the values ofthe variables z and ack (which are ini-tialized to 0 and true, respectively).The global state in turn consists of thevalues of all variables (z, ack and wait),as well as the set of messages that maybe in transit (i.e., a or m). The startingconfiguration is the global state wherez 5 0, ack 5 true, wait 5 true, andwhere no messages are in transit. Notethat, at every point in time, there isalways at most one enabled action inboth processes: either it is the process’sturn to send or the process is waiting toreceive. If a guarded command is en-abled, it may be executed, resulting inthe change of the local state and equallyin the change of the global configura-tion.

2.2 Defining Faults and Fault Models

There is a considerable ambiguity in theliterature on the meaning of some cen-tral terms like fault and failure. Cris-tian [1991] remarks that “what one per-son calls a failure, a second person calls

a fault, and a third person might call anerror.” The term fault is usually used toname a defect at the lowest level ofabstraction, e.g., a memory cell that al-ways returns the value 0 [Jalote 1994].A fault may cause an error, which is acategory of the system state. An error,in effect, may lead to a failure, meaningthat the system deviates from its cor-rectness specification. These terms areadmittedly vague, and here again for-malization can help clarification.

Traditionally, faults were handled bydescribing the resulting behavior of thesystem and grouped into a hierarchicstructure of fault classes or fault models[Cristian 1991; Schneider 1993a]. Well-known examples are the crash failuremodel (in which processors simply stopexecuting at a specific point in time),fail-stop (in which a processor crashes,but this may be easily detected by itsneighbors), or Byzantine (in which pro-cessors may behave in arbitrary, evenmalevolent, ways). System correctnesswas always proved with respect to aspecific fault model. But, unfortunately,slight ambiguities or differences in defi-nition have worked against any commonunderstanding of even simple faultmodels, and thus more formal methodswere developed to describe them.

A formal approach to defining theterm “fault” is usually based on theobservation that systems change theirstate as a result of two quite similarevent classes: normal system operationand fault occurrences [Cristian 1985].Thus, a fault can be modeled as anunwanted (but nevertheless possible)state transition of a process. By usingadditional (virtual) variables to extendthe actual state space of a process, ev-ery kind of fault from common faultclasses can be simulated [Arora 1992;Arora and Gouda 1993; Völzer 1998;Gärtner 1998]. As an example, considerFigure 2, which shows code from one ofthe processes in Figure 1 that may besubject to crash failures. The occurrenceof a crash can be modeled by adding aninternal boolean “fault variable” up tothe process’s state, which is initially

Figure 1. A simple distributed algorithm.

4 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 5: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

true. An additional action up 3 up

:5 false, which models the crash, cannow be added to the set of guardedcommands. To comply with the expectedbehavior of the crash model, all otheractions of the process must be stopped,which can be achieved by adding up asan additional conjunct to the guards ofthese actions. Where virtual error vari-ables may inhibit normal program ac-tions, we call the error variable effec-tive. (Note that a subsequent repairaction must simply set up to trueagain.) The addition of virtual errorvariables and fault actions can beviewed as a transformation of the initialprogram into a possibly faulty program[Gärtner 1998]. Investigating and im-proving the properties of such programsis the subject of fault-tolerance method-ologies.

We call the state containing such ad-ditional error variables the extended lo-cal state or the extended configurationwhen referring to the entire system. Wetherefore define a fault as an action onthe possibly extended state of a process.A set of faults is called a fault class. Theadvantages of this definition are notonly in the clarity of its semantics; butby defining faults in this way, programssusceptible to faults can be reasonedabout easily by using the standard tech-niques aimed at normal fault-free oper-ation. We avoid the term error and usethe term failure to denote the fact that asystem has not behaved according to itsspecification.

2.3 Properties of Distributed Systems:

Safety and Liveness

We adopt the usual definitions of sys-tem properties proposed by Alpern andSchneider [1985], whereby an executionof a distributed program is an infinitesequence e 5 c0, c1, c2, . . . of globalsystem configurations. Configuration c0is the starting configuration and allsubsequent configurations c i (with i .

0) result from c i21 by executing a singleenabled guarded statement. Finite exe-cutions (e.g., of terminating programs)are technically turned into infinite se-quences by infinitely repeating the finalconfiguration. In places where we ex-plicitly refer to finite executions, we callthem partial executions.

A property of a distributed program isa set of system executions. A distributedprogram always defines a property initself, which is the set of all systemexecutions that are possible from itsstarting configuration. A specific prop-erty p is said to hold for a distributedprogram if the set of sequences definedby the program is contained in p.

Lamport [1977] reported on two majorclasses of system properties necessaryto describe any useful system behavior:safety and liveness. Informally, a safetyproperty states that some specific “badthing” never happens within a system.This can be characterized formally byspecifying when an execution e is notsafe for a property p (i.e., not containedin a safety property p): if e [/ p, theremust be an identifiable discrete eventwithin e that prohibits all possible con-tinuations of the execution from beingsafe [Alpern and Schneider 1985] (thisis the unwanted and irremediable “badthing”). For example, consider a trafficlight that controls the traffic at a roadintersection. A simple safety property ofthis system can be stated as follows: Atevery point in time no two traffic lightsshall show green. The system would notbe safe if there were an execution of the

Figure 2. A process that may crash.

Fundamentals of Fault-Tolerant Distributed Computing • 5

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 6: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

system where two distinct traffic lightsshow green.

A safety property is usually expressedby a set of “legal” system configurations,commonly referred to as invariant. Byproving that a distributed program issafe, we are assured that the systemwill always remain within this set ofsafe states. Thus the safety propertyunconditionally prohibits the systemfrom switching into configurations notin this set. (In general, if the violationof a property p can be observed in finitetime, then p is a safety property.)

On the other hand, a liveness propertyclaims that some “good thing” will even-tually happen during system execution.Formally, a partial execution of a sys-tem is live for property p iff it can beextended to still remain in p. A livenessproperty is one for which every partialexecution is live [Alpern and Schneider1985]. Consider the traffic-light exam-ple again. The liveness property couldbe stated as follows: A car waiting at ared traffic light must eventually receivea green signal and be allowed to crossthe intersection. This shows that live-ness properties are “eventuality” prop-erties (as opposed to “always” propertieslike safety). Every partial executionmust be live, i.e., the system guaranteesthat a car driver will eventually crossthe street whereby the expected “goodthing” happens. Note that this goodthing must not be discrete: The livenessproperty refers to the crossing of allcars that arrive at the intersection.Thus, liveness properties can capturenotions of progress.

The most common example of livenessin distributed systems is termination.Examples where the prescribed “goodthing” is not discrete are properties likeguaranteed service (which states thatevery request will eventually be satis-fied). Liveness properties make no fore-cast as to when the “good thing” willhappen, they only keep it possible. Theactual proof that a system satisfies aliveness property usually includes awell-foundedness argument.

A problem specification consists of asafety property and a liveness property.A distributed algorithm A is said to becorrect regarding a problem specifica-tion if both the safety and the livenessproperty hold for A.

3. A FORMAL VIEW OF FAULT

TOLERANCE

Informally, fault tolerance is the abilityof a system to behave in a well-definedmanner once faults occur. When design-ing fault tolerance, a first prerequisiteis to specify the fault class that shouldbe tolerated [Avizienis 1976; Arora andKulkarni 1998b]. As we saw in the pre-vious section, this was traditionallydone by naming one of the standardfault models (crash, fail-stop, etc.), butis done more concisely by specifying afault class (a set of fault actions). Thenext step is to enrich the system underconsideration with components or con-cepts that provide protection againstfaults from the fault class.

Example 1. Consider the exampleprogram in Figure 3. It shows a simpleprocess that keeps on changing its onlyvariable x between the values 1 and 2.To make it fault tolerant, we first haveto say which faults it should tolerate. Tokeep things simple, we consider only asingle type of fault specified by the faultaction true 3 x :5 0. The next step isto provide a protection mechanism

Figure 3. A simple fault-tolerant program.

6 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 7: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

against this type of fault, namely theaction x 5 0 3 x :5 1.

We have now refined our intuition tothe point that we can formally definewhat it means for a system to tolerate acertain fault class.

Definition 1. A distributed programA is said to tolerate faults from a fault

class F for an invariant P iff thereexists a predicate T for which the fol-lowing three requirements hold:

—At any configuration where P holds, T

also holds (i.e., P f T).

—Starting from any state where T

holds, if any actions of A or F areexecuted, the resulting state will al-ways be one in which T holds (i.e., T

is closed in A and T is closed in F).

—Starting from any state where Tholds, every computation that exe-cutes actions from A alone eventuallyreaches a state where P holds.

If a program A tolerates faults from afault class F for invariant P, we saythat A is F-tolerant for P.

To understand this definition, recallthe program from Example 1. The in-variant of the process can be stated asP [ x [ $1,2%, and the fault class to betolerated as F 5 $true 3 x :5 0%. Thepredicate T is called the fault span[Arora and Kulkarni 1998b], and can beunderstood as a limit to perturbationsmade possible by the faults from F. Inour example, T is equivalent to x [

$0,1,2%. Definition 1 states that faultsfrom F are handled by knowing thefault span T (see Figure 4, where pred-icates are depicted as state sets). Aslong as such faults f [ F occur, thesystem may leave the set P but willalways remain within T. When faultactions are not executed for a suffi-

ciently long period of time, but onlynormal program actions t [/ F, the sys-tem will eventually reach P again andresume “normal” behavior.

Note also that the definition does notrefer directly to the problem specifica-tion of A. It assumes that the invariantP and the fault span T characterize theprogram’s safety property and that live-ness is best achieved within the invari-ant. (Examples of systems that may notbe live starting from states in P aregiven in the next section.) We thereforecall states within P legal, states withinT but not within P illegal, and — be-cause they are not handled — statesoutside of T untolerated (see Figure 5).For instance, the correctness specifica-tion of the program in Example 1 couldbe stated as

(safety) The value of x is always ei-ther 1 or 2.

(liveness) At any point in the execu-tion of the program, there isa future point where x willbe 1 and there is a futurepoint where x will be 2.

The legal states of the program there-fore are those where x [ $1,2% and thestate x 5 0 is called illegal. Becausethe program cannot recover from x 5 3,this state is called untolerated. In legalstates, the program will satisfy safety,whereas safety may or may not be metin illegal states. The reason for not re-

Figure 4. Schematic overview of Definition 1.

Fundamentals of Fault-Tolerant Distributed Computing • 7

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 8: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

ferring directly to the correctness speci-fications of A is to be able to definedifferent forms of fault tolerance, as ex-plained in the next section.

In general, there can be differentpredicates T that cause A to be faulttolerant. This is an indication that thesame fault-tolerance can be achieved bydifferent means. In the above example,if T were specified as x [ $0,1,2,3%, anadditional “recovery” action x 5 3 3 x

:5 1 would be necessary to complywith the definition. As more states aretolerated, this is a distinct fault-toler-ance ability.

4. FOUR FORMS OF FAULT TOLERANCE

To behave correctly, a distributed pro-gram A must satisfy both its safety andits liveness properties. This may nolonger be the case if faults from a cer-tain fault class are allowed to occur. Soif faults occur, how are the properties ofA affected? Four different possible com-binations are shown in Table 1.

If a program A still satisfies both itssafety and its liveness properties in thepresence of faults from a specified faultclass F, then we say that A is masking

fault tolerance for fault class F. This isthe strictest, most costly, and most de-sirable form of fault tolerance becausethe program is able to tolerate thefaults transparently. Formally, thistype is an instantiation of Definition 1,where the invariant P is equal to thefault span T. This means that faults

from F and program actions from A

cannot lead to states outside of P, thusnever violating safety and liveness.

If neither safety nor liveness is guar-anteed in the presence of faults from F,then the program does not offer anyform of fault tolerance. This is theweakest, cheapest, most trivial, andmost undesirable form of fault toler-ance. In fact, one shouldn’t speak offault tolerance ability here at all, atleast not when referring to fault class F.

Two intermediate combinations exist:one guarantees safety, but is not live(e.g., does not terminate properly), theother may still meet the liveness speci-fication but is not safe. The former iscalled fail-safe fault tolerance and ispreferable to the latter whenever safetyis much more important than liveness.An example is the ground control sys-tem of the Ariane 5 space missile project[Dega 1996]. The system was maskingfault tolerance for a single componentfailure, but was also designed to stop ina safe state whenever two successivecomponent failures occurred [Dega1996]. For the latter type of faults, thelaunch of the missile (liveness) was lessimportant than the protection of its pre-cious cargo and launch site (safety). Itmust be formally possible to divide thefault span into two disjoint sets, L andR. Both state sets are safe, but thesystem is live only in L. This is guaran-teed if faults from a certain fault classF1 are considered. However, if faultsfrom another, and more severe, faultclass F2 occur, the system may reachstates in R that are safe but not liveanymore.1

1This captures the idea of a system being live aslong as possible. Of course, liveness is a static

Figure 5. Overview of state sets (with corre-sponding predicates) and colloquial characteriza-tions.

Table I. Four Forms of Fault Tolerance

live not live

safe masking fail safenot safe nonmasking none

8 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 9: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

In contrast to masking fault toler-ance, we call the latter combinationfrom Table I (which ensures livenessbut not safety) nonmasking, as the ef-fect of faults are revealed by an invali-dation of the safety property. In effect,the user may experience a certainamount of incorrect system behavior(i.e., failures). For example, a calcula-tion result will be wrong or a replicationvariable may not be up to date [Gärtnerand Pagnia 1998]. But at least livenessis guaranteed, e.g., the program willterminate or a request will be granted.2

With respect to Definition 1, the faultspan T contains states where the safetyproperty does not hold.

Research has traditionally focused onforms of fault tolerance that continu-ously ensure safety. In particular,masking fault tolerance has been a ma-jor area of research [Avizienis 1976;Jalote 1994; Nelson 1990]. This can beattributed to the fact that in most fault-tolerance applications, safety is muchmore important than liveness.3In manycases an invalidation of liveness is alsomore readily tolerated and observed bythe user because no liveness means noprogress. In such cases, the user or anoperator usually reluctantly (but atleast safely) restarts the application.

For a long time nonmasking fault tol-erance has been the “ugly duckling” inthe field, as application scenarios forthis type of fault tolerance are notreadily visible (some are given by Aroraet al. [1996] and by Singhai et al.

[1998]). However, the potential of non-masking fault tolerance lies in the factthat it is strictly weaker than maskingfault tolerance, and can therefore beused in cases where masking fault toler-ance is too costly to implement or evenprovably impossible.

There has recently been much workon a specialization of nonmasking faulttolerance, called self-stabilization[Schneider 1993b; Dijkstra 1974] due toits intriguing power: Formally, a pro-gram is said to be self-stabilizing iff thefault span from Definition 1 is the pred-icate true. This means that the pro-gram can recover from arbitrary pertur-bations of its internal state, so thatself-stabilizing programs can tolerateany kind of transient faults. However,examples show that such programs arequite difficult to construct and verify[Theel and Gärtner 1998]. Also, theirnonmasking nature has inhibited themfrom yet becoming practically relevant.4

It should be noted that the livenessproperties that are guaranteed in mask-ing and nonmasking fault tolerance areguaranteed only eventually, i.e., thesystem may be inhibited from makingprogress as long as faults occur. Thiscan be viewed as a temporal stagnationoutside of the invariant but within thefault span. Thus, the system may slowdown, but will at least eventually makeprogress. This can model aspects of aphenomenon known as graceful degra-dation [Herlihy and Wing 1991].

5. REDUNDANCY AS THE KEY TO FAULT

TOLERANCE

5.1 Defining Redundancy

When dealing with fault-tolerance prop-erties of distributed systems, a first ob-servation is immediately at hand: Nomatter how well designed or how faulttolerant a system is, there is always thepossibility of a failure if faults are too

property of a piece of code, and so formally A isnot live if fault class F1 ø F2 is considered.2Nonmasking fault tolerance usually requires thesafety property to hold eventually [Arora andKulkarni 1998b]. It is an open question whetherprograms that continuously violate safety are ofany practical use.3Interestingly, we are more liable to call the factthat a system has not met its safety property afailure than when it invalidates liveness. Notethat in the strict sense of a failure, both fail-safeand nonmasking fault tolerances can lead to fail-ures. But since at least one of the two necessarycorrectness conditions holds, we should speak of apartial failure only.

4To our knowledge, Singhai et al. [1998] are thefirst and only ones to describe a self-stabilizingalgorithm implemented in a commercial system.

Fundamentals of Fault-Tolerant Distributed Computing • 9

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 10: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

frequent or too severe. At first glancethis frustrating evidence is an immedi-ate consequence of the topic discussedin this section, i.e., fault-tolerance isalways limited in space or time.

The second central observation, whichwe try to substantiate here, is that to beable to tolerate faults, one must employa form of redundancy. The usual mean-ing of redundancy implies that some-thing is there but is not needed becausesomething else does the same thing. Forexample, in information theory, the bitsin a code that do not carry informationare redundant. We now define what itmeans for a distributed program to beredundant and distinguish two forms ofredundancy, namely in space and intime.

Definition 2. A distributed programA is said to be redundant in space iff forall executions e of A in which no faultsoccur, the set of all configurations of Acontains configurations that are notreached in e.

Dually, A is said to be redundant in

time iff for all executions e of A in whichno faults occur, the set of actions of Acontains actions that are never executedin e.

A program is said to employ redun-

dancy iff it either is redundant in spaceor in time.

To better understand the intuition be-

hind Definition 2, consider the code inFigure 6. The process in Figure 6 isredundant in space because in normaloperation the state x 5 0 is neverreached. Redundancy in space refers tothe superfluous part of the state of asystem, i.e., states never reached whenfaults do not occur. (A program that isnot redundant in space must eventuallyvisit every state in every possible execu-tion.)

On the other hand, redundancy intime refers to superfluous state transi-tions of a system, i.e., the superfluouswork it performs. For example, in Fig-ure 6 action 3 is never executed duringnormal operation because the state withx 5 0 is never reached. Thus, accordingto Definition 2, the program is redun-dant in time. (A program that is notredundant in time must eventually exe-cute every action in every possible exe-cution.)

The definition of redundancy in spaceis quite strong. Most programs we writecontain states that are never reachedsimply because (for example) the rangesof variables are not properly subtyped.In fact, every program that does any-thing useful is redundant in space. Thisis obvious from a simple example: Imag-ine a program that first calculates thesum s of two variables x and y (whichcontain nondeterministic integer valuesfrom a range 0. . . m) and then stops.(Nondeterministic values could depend,for example, on the relative ordering ofmessages received.) The result in s is inthe range of 0. . . 2m, and when thesum has been calculated, there aremany local states that are neverreached during the computation, andthus are redundant. By this argument,one can show that all programs that dosome simple form of arithmetics containunsafe states that may be reached iffaults occur. If some of these states areoutside the fault span, then the pro-gram cannot tolerate these kinds offaults. This is an indication that redun-dancy alone does not guarantee fault

Figure 6. A program with redundancy in spaceand in time.

10 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 11: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

tolerance. However, in the next sectionwe see that redundancy is necessary toprevent failures.

5.2 No Fault Tolerance Without

Redundancy

The code in Figure 6 from the previoussubsection employs redundancy. The ob-jective now is to look at programs lack-ing any form of redundancy and studytheir fault-tolerance properties. Figure7 contains an example of such a pro-gram. It shows a process that periodi-cally runs through the states x 5

0,1,2,3,0,1,2,3, . . . forever. What canwe say about the fault-tolerance proper-ties of this code?

As a first observation, the reader willnotice that the usual invariant of theprogram, namely P [ x [ $0,1,2,3%,cannot be violated, because in effect itreduces to the predicate true. This isbecause no redundancy in space impliesthat all states are reached during nor-mal operation of the program. So such aprogram cannot violate safety proper-ties defined as state sets. However, wesee now that even these kinds of pro-grams can still deviate from their speci-fication in the presence of faults. Theargument is valid for all kinds of non-trivial computations (we give a shortdefinition of the term “nontrivial com-putation”; (informally) it rules out triv-ial programs that do nothing and thuscan tolerate any kinds of faults).

We use a very weak notion of a “non-trivial” program here, so that the result

is as strong as possible, and use thecode in Figure 7 as an example. In anontrivial program, every event on eachindividual process is assumed to bemeaningful to the progress of the com-putation. So, in order to be correct, theprogram must satisfy the followingspecification: starting from an initialconfiguration, all specified events mustoccur in a certain prescribed order.5 Forthe code from Figure 7, this means thatevery execution must be of the form 00,10, 20, 30, 01, 11, 21, 31, 02, 12, 22,32, 03, . . . (where n i denotes the eventof the i-th occurrence of the state x 5

n). Thus, the program must satisfy thefollowing two properties:

(liveness) For all i in N and for all n in$0,1,2,3%, n i must eventu-ally occur.

(safety) If n i and m j are two succes-sive events within the com-putation, either i 5 j and n

1 1 5 m or i 1 1 5 j andn 5 3 and m 5 0.

The central point of this section isformulated in the following claim.

Claim 1. If A is a nontrivial distrib-uted program that does not employ re-dundancy, then A may become incorrectregarding its correctness specificationin the presence of nontrivial faults.

In the context of this claim, a non-trivial fault is a fault action that actu-ally changes program variables or effec-tive error variables (recall that errorvariables are called effective if they gov-ern the applicability of program ac-tions). We now argue for this claim us-ing the example code in Figure 7 and itscorrectness specification above.

We have already seen that the code in

5The order can be thought of as a form of intran-sitive causality relation [Lamport 1978; Schwarzand Mattern 1994]. This notion includes normalsequential programs as well as distributed compu-tations.

Figure 7. A program that does not employ redun-dancy.

Fundamentals of Fault-Tolerant Distributed Computing • 11

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 12: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

Figure 7 does not employ redundancy,and so its set of configurations containsno redundant (and thus possibly illegal)states. So safety properties that are ex-pressible as sets of legal states cannotbe violated. Also, all program actionsare used during normal executions, sothere are no program actions that aremerely inhibited from occurring due toerror variables within their guard. Ourgoal is now to show that this program(and analogously the program A fromClaim 1) does not satisfy its specifica-tion.

The argument is straightforward: As-sume that at some point during an exe-cution a nontrivial fault occurs; this isshown for our example program in Fig-ure 8. The configuration in which thefault occurs is one where x 5 1. Sincethe fault is nontrivial, it results in achange of the extended system configu-ration. Here, two cases can occur: (1)the fault changes the normal programvariables, or (2) the fault changes any(virtual) error variables. (A third case,in which both program and error vari-ables are changed, can be ignored, sinceit can be reduced to one of the first twocases.) Let’s look at these two cases inturn.

Case 1 (program variables have

changed). For the example code, thismeans that the value of x must havechanged, and thus the program findsitself in a new configuration with a dif-ferent value of x. Let’s assume that xwas decremented and changed to 0.Now a state where x 5 0 follows a statewhere x 5 1, and thus the safety condi-tion is violated. In general, such

changes can cause a program to be“time-warped” into the past and executeevents that causally precede events thathave long since taken place, thus violat-ing causality.6

Something similar occurs if thechange to x is an incrementation. If x isset to 2, nobody could distinguish thetransition from a normal program ac-tion, but we cannot assume that faultsare always so benign. So let’s assume xwas changed to 3. Now the programmisses an occurrence of x 5 2, and thussafety is violated again. In general, thiscorresponds to a “time warp” into thefuture that jumps over important eventsthat are necessary for the program tofunction correctly.7

Case 2 (effective error variables have

changed). The code in Figure 7 con-tains neither fault actions nor (virtual)error variables. This makes reasoningabout the properties of the program alittle complicated, since, in this case, wedo not know the concrete effects ofchanges to error variables. However, weassume that at least one effective errorvariable e has changed. Without loss ofgenerality, let e be a boolean variablethat changed from true to false. Be-cause e is effective, it now inhibits reg-ular program actions from occurring(i.e., those actions that have e as aconjunct in the guard). For example, emight inhibit action 1, action 2, or anarbitrary set of the four actions fromFigure 7. It is clear that inhibiting asingle action will inhibit progress of thecomplete process and that this fact isdue to the nonredundant nature of the

6Imagine the state x 5 1 as the receipt of a mes-sage that the program is about to send when x 5

0.7Imagine the missed event as the “commit” opera-tion of a transaction that is the prerequisite of thestate where x 5 3.

Figure 8. Results of fault actions to a nonredun-dant program.

12 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 13: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

code. Hence, the liveness condition isviolated.

In general, fault actions can result ina degraded functionality of a system(components fail partially or completeprocessors may crash totally). If the sys-tem is nonredundant, the missing func-tionality may have been vital. Obvi-ously, this may result in system failuresif faults happen at the wrong time.

This argument should be sufficient tosupport Claim 1 (it could have beenformally stated as a theorem, but thisseemed unnecessary in a tutorial sur-vey). The essential observation fromthese findings is that redundancy is anecessary prerequisite for fault toler-ance.

5.3 Conclusions from the Necessity of

Redundancy

The results of investigating nonredun-dant programs in the previous sectionare fundamental: While redundancy isnot sufficient for fault tolerance, it is anecessary condition and thus must beapplied intelligently. However, the ar-gument also reveals another essentialproblem within fault-tolerant comput-ing: It must be possible for the systemto detect that a fault has occured and to“know” about this fact (in the precisesense of knowledge, as defined by e.g.,Halpern and Moses [1990]). Detectionmechanisms need either state spaceand/or program actions. As such mecha-nisms are not really used in fault-freeexecutions, they form part of the sys-tem’s redundancy.

As we see later in this text, detectionmechanisms alone may suffice to ensuresafety. But it should already have be-come clear that on detection of a fault,some action must be taken to correct itif liveness is to be guaranteed. Correc-tion hereby implies detection, and thusis a hint that detection (and thus safety)is easier to achieve than correction(which in effect means liveness). This ispromising, since in Section 4 we notedthat in practical settings safety is more

important than liveness. However, add-ing detection and correction mecha-nisms is a complicated matter, becausethese mechanisms must themselveswithstand the faults against which theyshould protect (the problem of faultsoccuring in a fault-exception mode). Tostate it more clearly: detection and cor-rection mechanisms must themselves befault tolerant.8

In practical fault-tolerant systems, re-dundancy in space is very widespread.It is normally achieved by supplying acomponent more than once. Practicalexamples are pervasive in the literatureand comprise systems that multiply thesoftware space (e.g. a parity bit in datatransmission) or increase the hardwarespace (e.g. the concept of process pairsresiding on different processor boards inthe Tandem non-stop kernel [Bartlett1978]). Of course, the border betweenhardware and software redundancy isnot very sharp [Avizienis 1976]. In fact,hardware redundancy nearly alwaysemploys software redundancy becauseidentical components usually run iden-tical programs with identical states (seefor example the process group concept ofIsis [Birman 1993]). However, hardwareredundancy is applied nearly always infixed multiples of a certain base unit(processor, disk, etc.), whereas softwareredundancy can also employ fractions(as in error detection codes or the popu-lar RAID systems [Patterson et al.1988]).

On the other hand, redundancy intime can be thought of as repeating the

8A technical note: The safety property of the exam-ple program of Claim 1 is not expressible as aninvariant (i.e., predicate over the system state).However, by adding a variable i that keeps trackof the execution, a safety violation can be detected[Lynch 1996]. But now the program may be ex-posed to faults manipulating i and, by the sameargument, there are (new) safety properties (in-volving i) that are again not expressible as statesets. So if a safety property p is expressible as aset of configurations, “always safe” programs ex-ist. If p is of other forms, we must rely on the faulttolerance of the detection mechanism to ensuresafety.

Fundamentals of Fault-Tolerant Distributed Computing • 13

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 14: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

same computation again and againwithin the same system. Practical ex-amples of this behavior are exhibited bysystems that roll back to a consistentstate once a fault becomes visible (roll-back recovery) or that calculate a resulta given number of times to detect tran-sient hardware faults.

Another point of importance in thenext section is the distinction betweenchanges in the program’s state andchanges in the error variables. If a faultchanges normal program variables, re-dundancy makes it possible to detectthis change and often also to correct thechange. In such cases, redundancy canbe seen as a form of “immediate repair.”On the other hand, if only error vari-ables change, the fault is somewhatmore difficult to detect, since knowledgemust be derived from something that isin the worst case not present (inhibitedactions). This is a serious restriction insome of the widely used system modelsof distributed computation and is inves-tigated in the next section.

6. MODELS OF COMPUTATION AND

THEIR RELEVANCE

To be able to reason formally aboutproperties of distributed systems and torefine our intuition about them, wemust abstract those aspects that areunnecessary for investigating a specificphenomenon. We then normally designa set of attributes and associated rulesthat define how the object in question issupposed to behave. By doing this, webuild a model of the object, and thedifficulty in doing so is to define a modelwhich is both accurate and tractable[Schneider 1993a].

Because building a model divides im-portant from unimportant issues, thereis clearly no single correct model fordistributed systems. Even the basic def-initions in Section 2 are only one way tomodel the subject (albeit a widely ac-cepted one). Lamport and Lynch [1990]compare different models with differentpersonal views. But despite this fact,certain aspects of models appear repeat-

edly in the literature. For example, pro-cesses are usually modeled as state ma-chines (or transition systems) that inturn are used to model the execution oflarger distributed systems. Other as-pects comprise assumptions about thenetwork topology, the atomicity of ac-tions, or the communication primitivesavailable (some key words in this con-text are shared variables, point-to-pointvs. broadcast/multicast transmission,and synchronous vs. aysnchronous com-munication [Charron-Bost et al. 1996]).

Existing models of distributed sys-tems differ in one particularly impor-tant aspect, their inherent notion of realtime. This is usually expressed in cer-tain assumptions about process execu-tion speeds and message delivery de-lays. In synchronous systems, there arereal-time bounds on message transmis-sion and process response times. If nosuch assumptions are made, the systemis called asynchronous [Schneider1993a; Lamport and Lynch 1990]. Inter-mediate models that have such boundsto a varying degree are often found[Lynch 1996; Dolev et al. 1987; Dworket al. 1988]; these are often called par-tially synchronous.

It is well known that the asynchro-nous model is the weakest one, meaningthat every system is asynchronous (thusSchneider [1993a] calls asynchrony a“non-assumption”). Another result ofthis observation is that every algorithmthat works in the asynchronous modelalso works in all other models. On theother hand, algorithms for synchronoussystems are prone to incorrect behaviorif the implementation violates even asingle timing constraint. This is whythe asynchronous model is so attractiveand has attained so much interest indistributed systems theory.

Apart from its theoretical attractions,it has been argued [Chandra and Toueg1996; Babaoglu et al. 1994; Le Lann1995; Guerraoui and Schiper 1997] thatthe asynchronous model is also very re-alistic in many practical applications. Intoday’s large-scale systems, networkcongestion and bandwidth shortage, in

14 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 15: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

addition to unreliable components, havecontributed to a significant failure ratein reliable message delivery [Golding1991; Long et al. 1991]. Highly variableworkloads on network nodes also makereasoning based on time and timeouts adelicate and error-prone undertaking.All these facts are sources of asyn-chrony and contribute to the practicalappeal of having few or, better, no syn-chrony assumptions.

However, as mentioned earlier, thismodel has severe drawbacks, especiallyfor fault-tolerance applications: For ex-ample, it can be shown [Chandy andMisra 1986] that in such models it isimpossible to detect whether a processhas crashed or not. Intuitively, this is aresult of allowing processes to be arbi-trarily slow [Chandra and Toueg 1996].This fact is important because we sawin the previous section that detection isthe first step towards achieving faulttolerance. As a consequence, solutionsto many problems that require all pro-cesses to participate in a nontrivial wayare impossible [Fischer et al. 1985;Chandra et al. 1996; Fischer et al. 1986;Sabel and Marzullo 1995]. But in a pos-itive light, it is commonly said that suchimpossibility is a blessing to the formalsciences because a lot of work can bedone at the fringes [Dolev et al. 1987] ofthe problem, pushing the edge of thepossible further towards the impossible.There are achievable cases [Attiya et al.1987] in such environments too. We out-line the two basic approaches now andconsider them in detail in the followingtwo sections.

The two approaches can be character-ized as either restricting the system orextending the model. The former wayconsiders only systems and problems inwhich the existence of solutions isproven [Attiya et al. 1987; Hadzilacosand Toueg 1994; Dolev et al. 1986]. Thisincludes standard methods for tradi-tional (nondistributed) uniprocessorfault tolerance because, as explainedearlier, faults that manifest themselvesas perturbations of local state variablesare usually easily detectable, especially

if they occur regularly, and thus can beanticipated.

The second approach deals with ex-tending the model. This means thatsome synchrony assumptions are addedto the model, e.g., a maximum message-delivery delay.9 This is a sensible ap-proach when the system in question ac-tually guarantees certain timingconstraints. This is the case, for exam-ple, in real-time operating systems con-cerned with process response times. An-other example is networks of processorsthat share a common bus for which amaximum message latency can be guar-anteed. However, in this approach it isobviously desirable to add as few syn-chrony requirements as possible andalso to add them only to those parts ofthe system whose correctness dependson them. We meet such modular ap-proaches in the following two sections,which discuss the two essential proper-ties that must be achieved in order toguarantee fault tolerance.

7. ACHIEVING SAFETY

In Section 4 we saw the four distinctforms of fault tolerance based on thepresence or absence of the system’ssafety and liveness properties whenfaults are allowed to occur. It was alsonoted that a program’s safety propertywas usually related to the invariant P

and the fault span T of Definition 1(recall Figure 5). In particular, T cancontain unsafe configurations, i.e., glo-bal states that violate safety. A result ofthe considerations in Section 5, and es-pecially the argument accompanyingClaim 1, is that the first step towards

9The original work on partial synchrony [Dwork etal. 1988] assumes that upper bounds to message-delivery delay or relative processor speeds exist,but they are either unknown or only hold eventu-ally. Other prominent models are the timed asyn-chronous model by Cristian and Fetzer [1998] andthe quasi-synchronous model of Almeida et al.[1998] and Almeida and Veríssimo [1998]. Theformer mainly assumes a bounded drift rate oflocal hardware clocks while the latter postulatesthat only part of the network is truly synchronous.

Fundamentals of Fault-Tolerant Distributed Computing • 15

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 16: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

fault tolerance is to acquire knowledgeabout the fact that a fault has occurred.In this section, we consider basic meth-ods for fault detection and see that inmost cases they suffice to ensure safety.Detection mechanisms can therefore beused to implement fail-safe fault toler-ance or as a first step towards maskingfault tolerance.

7.1 Detection as the Basis for Safety

Confronted with the need to build safeapplications, system designers whohave read Section 5 may opt for a (atfirst glance) trivial method to ensurethe usual safety properties: The pro-gram should not employ redundancy inspace. If T [ P [ true, there are noillegal configurations the program canbe in, and safety (and thus fail-safefault tolerance) is trivially always satis-fied, no matter what kind of faults oc-cur. This is, however, not so trivial in adistributed setting.

Recall that a system configuration isdefined as the collection of all the localstates of the processes plus the state ofthe communication subsystem. Thus,even for small systems, the number ofdifferent states that a system can be inis enormous. In fact, if there is no boundon the number of messages that thecommunication subsystem can buffer,there are infinitely many configura-tions. This finding, together with theobservation from Section 5 that all pro-grams that do some form of arithmeticare redundant in space, contributes tothe fact that systems almost alwayscontain configurations that lie outsideof P or T, and thus may not satisfysafety. However, T characterizes the setof states reachable by faults from thefault class we want to tolerate, and sowe can restrict our attention to stateswithin T.

To continuously ensure safety, weshould employ detection and subse-quently inhibit “dangerous actions”[Arora and Kulkarni 1998c]. For exam-ple, single-bit errors over a transmis-

sion line can be detected by a parity bitthat is sent along with the message. Ifwe want to tolerate single-bit errorssafely, parity is computed and checkedagainst the parity bit. If the fault isdetected, the system must be inhibitedfrom doing anything important with thetransferred data, e.g., updating a localdatabase or printing the message to thescreen. Inhibiting such a dangerous ac-tion can be modeled by adding the re-sult of the detection to its guard. Aninstructive application of this method isknown as fail-stop processors [Schlich-ting and Schneider 1983] in whichnodes that detect internal faults simplystop working in a way detectable fromthe outside. Arora and Kulkarni [1998c]view this as eliminating unsafe statesfrom the fault span 1, thus moving T

closer to P.Detection mechanisms such as parity

are common in practical systems: prom-inent generalizations of parity checkingare the well-known error detectioncodes or cryptographic checksums andfingerprints used to guard data integ-rity. Modules that compare the resultsof repeated executions of a computation(comparators, voters) are also good ex-amples. Other, broader, notions are ex-ception conditions and acceptance orplausibility tests of computed values.

Taken formally, detection always in-cludes checking whether a certain pred-icate Q holds over the extended systemstate. If the type and effect of faultsfrom F are known, it is in general easyto specify Q. Detection is also easy aslong as the fault classes are not toosevere (because the detection mecha-nism itself must be fault tolerant) andas long as nondistributed settings areconsidered. In distributed settings, thedetectability of Q depends on differentfactors and is constrained by systemproperties, especially if the asynchro-nous model is chosen (see Section 6).

16 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 17: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

7.2 Detection in Distributed Settings

In distributed systems without a com-mon time frame, deciding whether apredicate over the global state does ordoes not hold is in general not easy. Onewould usually assume a central ob-server that would watch the executionof a distributed algorithm and instanta-neously take action if a given predicatewere to hold. But because there is nocentral lookout point from which to ob-serve the entire system at once, observ-ing distributed computations in a con-sistent and truthful way is a key issue[Babaoglu and Marzullo 1993]. In fact,settings can easily be constructed inwhich two nodes observe the same com-putation but arrive at different deci-sions on whether a global predicate heldor not [Schwarz and Mattern 1994;Babaoglu and Marzullo 1993]. Thismeans that in general it makes no senseto ask about the validity of a globalpredicate without referring to a specificobserver or (more generally) to a spe-cific set of observations.

To simplify this matter, Cooper andMarzullo [1991] introduced two predi-cate transformers called possibly anddefinitely.10

Definition 3. Let Q be an arbitrarypredicate over the system state. Thepredicate possibly(Q) is true iff thereexists a continuous observation of thecomputation for which Q holds at somepoint. The predicate definitely(Q) is trueiff for all possible continuous observa-tions of the computation Q holds atsome point.

Recall that in a nondistributed set-ting, one can eliminate an unsafe pointfrom the fault span by adding a detec-tion term to a possibly dangerous ac-

tion. To ensure safety in a distributedsetting, the analogy is that all processeswatch out for an unwanted situationspecified by Q that may lead to an un-safe state. In terms of the predicatetransformers of Definition 3, the systemmust take action if possibly(Q) is de-tected [Garg 1997].

Algorithms that detect possibly(Q)usually take the following approach: Ev-ery time a node within the networkundergoes a state transition that mightaffect the validity of Q, it sends a mes-sage to a central observer process. Theobserver process assembles the incom-ing information in such a way that ithas an overview over all possible obser-vations. Because it is not known whichof these observations is true, thisscheme still doesn’t suffice to check if apredicate Q actually holds. However,there is now enough information to con-clude whether the predicate possiblyholds or not. Note that this schememust be implemented in a fault-tolerantmanner if it is to be of any use in faultdetection. In addition, it must be possi-ble to inhibit some process from doingsomething bad if possibly(Q) is detected.This usually requires all processes towait for an acknowledgment from theobserver process.

There are some algorithms that canbe used to detect possibly(Q) for generalpredicates [Stoller 1997; Stoller andSchneider 1995; Cooper and Marzullo1991; Marzullo and Neiger 1991] or re-stricted ones [Garg and Waldecker1994; Garg and Waldecker 1996] thatconsist of a conjunction of so-called local

predicates. (A predicate q is called local

to process p iff the truth value of q

depends only on the local state of p[Charron-Bost et al. 1995]. See the pa-pers by Schwarz and Mattern [1994]and Chase and Garg [1998] for introduc-tory surveys.) Garg and Mitchell [1998]were the first to consider fault-toleranceissues along with possibility detection.Interestingly, there is a large body ofliterature on algorithms that have a

10The definitions used here are rather informal,but should be sufficient for clarity. Interestedreaders are referred to background work byBabaoglu and Marzullo [1993] and Schwarz andMattern [1994], as well as the initial papers byCooper and Marzullo [1991] and Marzullo andNeiger [1991].

Fundamentals of Fault-Tolerant Distributed Computing • 17

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 18: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

close resemblance to possibility detec-tion, whose fault-tolerance propertieshave been studied in great detail. Thesealgorithms usually appear under theheading of “consensus” [Turek andShasha 1992].

7.3 Adapting Consensus Algorithms for

Fault-Tolerant Possibility Detection

The problem of consensus can be viewedas a general form of agreement[Guerraoui and Schiper 1997]. Here, aset of processes (each posesses an initialvalue) must all decide on a commonvalue. Again, this is trivial in fault-freescenarios, but becomes interesting ifprocesses are faulty and if the asynchro-nous system model is used.

The similarity between detection ofpredicates and consensus is visiblewhen a very weak form of consensus isconsidered in which only one processmust eventually decide [Fischer et al.1985]. Algorithms for this task mustdiffuse the initial values within the sys-tem and collect the information at asingle node. Now assume that the ini-tial value of every process is a specificstep it wants to take next. Then thecentral node that collects these valuesacquires knowledge about the state ofall processes and what they want to donext. In effect, the central process actsas an observer that (as in possibilitydetection schemes) can construct allpossible observations. If the other pro-cesses are willing to wait for an ac-knowledgment from the observer beforetaking their anticipated action, the pro-cess can subsequently influence the sys-tem.

Of course, this scheme is not veryfault tolerant. For example, if the cen-tral observer crashes before sending ac-knowledgments, the system can beblocked forever. Or even worse: if theobserver goes haywire by sending arbi-trary messages to the other nodes, any-thing can happen to the system.

The idea of making consensus algo-rithms fault tolerant is to diffuse infor-mation to all nodes. If a certain subset

of processes is also allowed to behavemalevolently, then even more compli-cated mechanisms are used [Lamport etal. 1982]. However, it should be clearthat consensus algorithms need a diffu-sion phase to be able to work correctly:Information must be received from allcorrect nodes by all correct nodes inorder to arrive at a common decision[Bracha and Toueg 1985; Barborak etal. 1993; Chandra and Toueg 1996].Thus, in algorithms for stronger ver-sions of the problem, where all correctnodes must eventually decide, everyprocess acts as an observer indepen-dently. If the scheme is repeated everytime a process step may influence thevalidity of a predicate Q, a consensusalgorithm can be used to detect possi-bly(Q).

An important result states that con-sensus is impossible in asynchronoussystems where at least one process maycrash [Fischer et al. 1985]. However,the problem becomes solvable if someweak forms of synchrony are introduced[Dolev et al. 1987; Fischer et al. 1986;Dwork et al. 1988; Chandra and Toueg1996]. As we remarked previously,there even exist solutions to the consen-sus problem if a limited number ofnodes behave maliciously and follow thevery unfavorable Byzantine fault model[Lamport et al. 1982]. Using such algo-rithms, detection can be achieved in avery fault-tolerant manner.

7.4 Detecting Process Crashes

Up to now we have implicitly consideredpredicates that are formed over the sys-tem state without virtual error vari-ables. How can predicates that includethe latter items be detected in asynchro-nous systems? We have already notedthat in the fully asynchronous model itis impossible to detect a process crash[Chandy and Misra 1986] and have at-tributed this to the absence of time inthis model. Chandra and Toueg [1996]proposed a modular way of extendingthe asynchronous model to detect pro-cess crashes.

18 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 19: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

In their theory of unreliable failuredetectors, they propose a program mod-ule that acts as an unreliable oracle onthe functional states of neighboring pro-cesses. The main property of failure de-tectors is their accuracy: In its weakestform, a failure detector will never sus-pect at least one correct process of hav-ing crashed. This property is calledweak accuracy. Because weak accuracyis often difficult to achieve, it is oftenrequired only that the property eventu-ally holds. Thus, an eventually weakfailure detector may suspect every pro-cess at one time or another, but there isa time after which some correct processis no longer suspected. In effect, aneventually weak failure detector maymake infinitely many mistakes in pre-dicting the functional states of pro-cesses, but it is guaranteed to stop mak-ing mistakes when referring to at leastone process.

Requiring weak accuracy alone, how-ever, doesn’t ensure that a crashed nodeis suspected at all — it merely prohibitsthe detection mechanism from wronglysuspecting a correct node. So a secondproperty of failure detectors is neces-sary. Chandra and Toueg call it com-pleteness. Informally, completeness re-quires that every process that crashes iseventually suspected by some correctprocess.

It can be shown that different formsof unreliable failure detectors (some-times simply called failure suspectors[Schiper and Riccardi 1993]) are suffi-cient to solve important problems inasynchronous systems [Chandra andToueg 1996; Schiper 1997; Schiper andRiccardi 1993]. While there are still un-resolved implementation issues [Agu-ilera et al. 1997a; Chandra and Toueg1996; Guerraoui and Schiper 1997],these failure detectors simplify the taskof designing algorithms for asynchro-nous systems at large, because they en-capsulate the notion of time neatlywithin a program module.11 So failure

detectors can be used to detect inhibitedactions, and thus the change of virtualerror variables.

7.5 Summary

We have seen that detection is the keyto safety: faults must be detected andactions must be inhibited to remainsafe. Detection is easy locally, but indistributed settings it is generally diffi-cult and requires intrinisic fault-toler-ant mechanisms (like consensus algo-rithms and failure detectors) to beachieved.

8. ACHIEVING LIVENESS

As explained in Section 4, safety is moreimportant than liveness in many practi-cal situations. But to be truly (i.e.,masking) fault tolerant, liveness mustbe achieved also, and it was noted thatthough liveness is less important, it isoften more difficult to achieve. Detec-tion is sufficient for safety. However, tobe live, a fault must not only be de-tected but also corrected.

8.1 Correction as the Basis for Achieving

Liveness

Liveness is tied to the notion of correc-tion as safety is bound to detection. Theterm correction refers to turning a badstate into a good one, and thus implies anotion of recovery. The hope is thatwhile liveness may be inhibited byfaults, subsequent recovery will eventu-ally establish a good state from whichliveness will eventually be resumed.

Again, correction is easy in local situ-ations: for example, once a single-biterror over a communication line is de-tected via parity check, the situationcan be corrected by requesting a re-transmission of the data. Retransmis-sion is the correction that will lead to astate where the data is received cor-rectly. Broader notions of parity (analo-gous to error-detection codes) are the

11Obviously, models using unreliable failure de-tectors are no longer truly asynchronous; they

merely produce the illusion of an asynchronoussystem by encapsulating all references to time.

Fundamentals of Fault-Tolerant Distributed Computing • 19

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 20: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

well-known error-correction codes. No-tions of a local reset also exist and areoften called rollback recovery. Dually,there also exist methods called rollfor-ward recovery by which the system stateis forcefully advanced to some futuregood state. All these methods are exam-ples of actions of correction.

While these methods are easy to de-sign and add in centralized settings, thedistributed case is again a little morecomplicated. In general, on detecting abad state via a detection predicate Q,the system must try to impose a newtarget predicate R onto the system. Thiscan be seen as a general form of con-straint satisfaction [Arora et al. 1996] ordistributed reset [Arora and Gouda1994].

The choice of the target predicate Rdepends on the current situation withinthe system. In general, it is not easy todecide which predicate to choose, sincethe view of individual processes is re-stricted and a good choice for one pro-cess could be a bad choice for others.The most frequently used method indistributed settings arises from democ-racy: processes take a majority vote andimpose the result of the vote onto them-selves. Common examples of voting sys-tems can be found in algorithms thatmanage multiple copies of a single dataitem in a distributed system. Generalmethods to keep the copies consistentand to guarantee progress are known asdata replication schemes [Bernstein etal. 1987]. In systems that enforce strongconsistency constraints on data items(replicas), a process must often obtain amajority vote from the other processesto be able to update the data item mu-tually exclusively. Note that the neces-sity for obtaining a majority ensuressafety (here, consistency) and the factthat a process will eventually win amajority guarantees liveness (here, up-dating progress).

8.2 Correction via Consensus

The notion of correction also corre-sponds to the decision phase of consen-

sus algorithms (see Section 7). Wheninitial values are received from all func-tional processes, a node must irrevoca-bly make a decision. This is sometimesdone on the first nontrivial value re-ceived (for example, if processes canmerely crash), or on a majority vote ofall processes (if, for example, it is as-sumed that a certain fraction of pro-cesses may run haywire and upset oth-ers by sending arbitrary messages). Inany case, the decision value is the samein all participating functional processes,and thus can be used as a clue to thenext action they wish to take.

An obvious approach to ensuring live-ness in distributed systems, whicharises from the considerations above,was proposed by Schneider [1990], andis called the state machine approach. Init, servers are made fault tolerant byreplicating them and coordinating theirbehavior via consensus algorithms. Theidea is that servers act in a coordinatedfashion: the servers diffuse their nextmove and agree on it using consensus.Next moves can be internal actions (forexample, updating local data struc-tures) or answers to requests by clients.The fault-tolerance properties of thisapproach depend on the type of consen-sus algorithm used.

Other methods that implement fault-tolerant services are based on severalforms of fault-tolerant broadcasts[Hadzilacos and Toueg 1994]. It is inter-esting that a refined version of suchcommunication primitives, calledatomic broadcast, is closely related toconsensus [Hadzilacos and Toueg 1994;Chandra and Toueg 1996]. This is an-other clue of the importance and omni-presence of consensus, often underesti-mated in practice, in fault-tolerantdistributed computing [Guerraoui andSchiper 1997].

In this section, we briefly surveyedthe basic methodologies for achievingliveness in distributed systems. Whileensuring safety requires merely inhibit-ing dangerous actions, guaranteeingliveness requires coordinated correctionactions, which are more liable to faults

20 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 21: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

than mere detection. Again, consensusprotocols play an important role inasynchronous systems.

8.3 Summary

This section describes fundamentalmethods for achieving liveness in dis-tributed systems. The basic idea is toimpose a state predicate on the systemin cases where possibly illegal or unsafestates are detected. Detection is thus aprerequisite to correction. However, toprevent misunderstanding, it should beremarked that the methods presentedin the previous two sections do not im-ply that masking fault tolerance is al-ways possible [Arora and Kulkarni1998c]. Depending on the fault class inquestion, there can be faults that lead asystem directly into a state wheresafety is violated. Methods that ensuresafety depend on detecting a fault be-fore it can cause transition to an unsafestate. This is obviously not possible forall fault classes. However, it may stillbe possible to guarantee liveness insuch cases [Gärtner and Pagnia 1998;Schneider 1993b]. Thus, safety and live-ness in the presence of faults may beachieved both at the same time or onewithout the other, supporting the utilityof the four forms of fault tolerance in-troduced in Section 4.

9. RELATED AND CURRENT RESEARCH

The idea of dividing fault-tolerance ac-tions into detection and correctionphases goes back as far as 1976 to abasic survey by Avizienis [1976]. It isnotable that relevant surveys of thearea of fault-tolerant distributed com-puting [Jalote 1994; Siewiorek andSwarz 1992] contain this structure im-plicitly, without explicitly mentioningit. Lately, Arora and Kulkarni [1998a]have incorporated the idea of detectionand correction into a methodology toadd fault-tolerance properties to intoler-ant programs in a stepwise and nonin-terfering manner (the concept of multi-tolerance [Arora and Kulkarni 1998b]).

The four forms of fault tolerance dis-cussed in Section 4 result from theircontinuing efforts to find adequate for-malizations of fault tolerance and re-lated terms [Arora 1992; Arora andGouda 1993; Arora et al. 1996; Kulkarniand Arora 1997; Arora and Kulkarni1998b; 1998c; 1998a]. However, theirmodel of computation is based on locallyshared variables (i.e., processes haveread access to variables of neighboringnodes) and so has rather tight syn-chrony assumptions that result in ele-gant solutions to difficult problems[Kulkarni and Arora 1997]. The authoris unaware of any attempt to generalizetheir observations. The present paper isa first step in this direction.

Interest in the topic of consensus inasynchronous distributed systems wasvery strong during the 1980s [Barboraket al. 1993] and seemed to fade awayafter Fischer et al. [1985] publishedtheir famous impossibility theorem.However, the concept of unreliable fail-ure detectors [Chandra and Toueg 1996]has made practical solutions to thisproblem a little more feasible, and hasalso led to a flurry of research on whichproblems can now be solved with whichkind of failure detector [Aguilera et al.1997b; Sabel and Marzullo 1995; Chan-dra et al. 1996; Guerraoui et al. 1995;Schiper and Sandoz 1994]. There isstrong evidence [Chandra and Toueg1996] that the various classes of failuredetectors capture the notion of timemore adequately than the different lev-els of synchrony that were the subjectsof previous research [Dolev et al. 1987;Dwork et al. 1988]. Current work fo-cuses on extending failure suspectors todeal with node behavior other than thecrash model [Doudou and Schiper 1997;Oliveira et al. 1997; Aguilera et al.1998; Dolev et al. 1996; Hurfin et al.1997]. The ideas in Section 5, dealingwith the formalization of redundancyand its consequences have, to the best ofour knowledge, not appeared in the lit-erature.

Fundamentals of Fault-Tolerant Distributed Computing • 21

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 22: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

10. CONCLUSIONS AND FUTURE WORK

Despite substantial research effort, thedesign and implementation of distrib-uted software remains a complex andintriguing endeavor [Schwarz and Mat-tern 1994; Hadzilacos and Toueg 1994].However, although distributed systemshave many inherent limitations (suchas the lack of a global view or commontime frame), many problems are easy tosolve in ideal and fault-free environ-ments. The possibility of faults compli-cates the matter substantially, butfaults must be handled in order toachieve the dependability and reliabil-ity necessary in many practical applica-tions. Therefore, aspects of fault toler-ance are of considerable importancetoday.

In this paper, we have tried to use aformal approach to structure the area offault-tolerant distributed computing,survey fundamental methodologies, anddiscuss their relations. We are con-vinced that continuing attempts to for-malize important concepts in this areacan reveal many aspects of its underly-ing structure. This should lead to a bet-ter understanding of the subject, andnaturally to better, more reliable, andmore dependable systems in practice.The advantages of a formal approach,however, also lie in the fact that itreveals the inherent limitations of fault-tolerance methodologies and their inter-actions with system models. Stating theimpossibility of unconditional depend-ability is as important as trying to buildsystems that are increasingly reliable.

It is clear that this paper could notintegrate the entire area of fault-toler-ant distributed computing. Many topicsstill need further attention, such as as-pects of fault diagnosis [Barborak et al.1993], reliable communication [Hadzila-cos and Toueg 1994; Basu et al. 1996b;Basu et al. 1996a], network topologyand partitions [Ricciardi et al. 1993;Babaoglu et al. 1997; Aguilera et al.1997c; Dolev et al. 1996] and softwaredesign faults [Jalote 1994]. We sparedthe reader probabilistic definitions of

central terms like reliability and avail-ability [Jalote 1994] and detailed dis-cussions of implementation issues, testmethods (such as fault injection [Hsuehet al. 1997]), and practical example sys-tems [Siewiorek and Swarz 1992; Spec-tor and Gifford 1984; Dega 1996]. Allthese issues need to be dealt with to seeif the structures proposed in this papercan actually coherently organize most ofthe aspects of the field. The inherentinterrelations that this paper reveals(e.g., between consensus and fault-toler-ant possibility detection) also need to bepursued further.

Work can continue along other linesas well. For example, it is extremelyimportant to characterize current infor-mation technology in terms of modelsand fault classes (i.e., to know whichfault models to follow in which situa-tions). Here, evaluation and simulationof practical systems is a key area[Chandra and Chen 1998; Kuhn 1997].The effect of such considerations may beto embed practical systems into theoret-ical models more adequately (and thusmore reliably). To this end, the benefitsof a formal treatment of the subject canbe directly transferred to practical set-tings, thus easing the task of designing,implementing, and maintaining fault-tolerant distributed systems.

ACKNOWLEDGMENTS

I am indebted to Friedemann Matternand Henning Pagnia for reading a firstdraft of this paper and for their sugges-tions that substantially improved thepresentation. Hagen Völzer helped meclarify my understanding of safety andliveness during the final revision phase.I also wish to thank the anonymousreferees for their encouraging commentsand the Deutsche Forschungsgemein-schaft for making this work possible.

REFERENCES

AGUILERA, M. K., CHEN, W., AND TOUEG,S. 1997. Heartbeat: A timeout-free failuredetector for quiescent reliable communica-tion. In Proceedings of the 11th Interna-

22 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 23: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

tional Workshop on Distributed Algorithms(WDAG97, Sept. 1997). 126–140.

AGUILERA, M. K., CHEN, W., AND TOUEG,S. 1997. On the weakest failure detectorfor quiescent reliable communication. Techni-cal Report TR97-1640. Department of Com-puter Science, Cornell University, Ithaca, NY.

AGUILERA, M. K., CHEN, W., AND TOUEG,S. 1997. Quiescent reliable communicationand quiescent consensus in partitionable net-works. Technical Report TR97-1632. De-partment of Computer Science, Cornell Uni-versity, Ithaca, NY.

AGUILERA, M. K., CHEN, W., AND TOUEG,S. 1998. Failure detection and consensusin the crash-recovery model. In Proceedingsof the 12th International Symposium on Dis-tributed Computing (DISC, Sept.1998). 231–245.

ALMEIDA, C. AND VERÍSSIMO, P. 1998. Usinglight-weight groups to handle timing failuresin quasi-synchronous systems. In Proceed-ings of the 19th IEEE Symposium on Real-Time Systems (Madrid, Spain, Dec.). IEEEComputer Society Press, Los Alamitos, CA.

ALMEIDA, C., VERÍSSIMO, P., AND CASIMIRO,A. 1998. The quasi-synchronous approachto fault-tolerant and real-time communicationand processing. Technical Report CTI RT-98-04. Instituto Superior Técnico, Lisboa,Portugal.

ALPERN, B. AND SCHNEIDER, F. B. 1985. Definingliveness. Inf. Process. Lett. 21, 4 (Oct.), 181–185.

ARORA, A. AND GOUDA, M. 1993. Closure andconvergence: a foundation of fault-tolerantcomputing. IEEE Trans. Softw. Eng. 19, 11(Nov. 1993), 1015–1027.

ARORA, A. AND GOUDA, M. G. 1994. Distributedreset. IEEE Trans. Comput. 43, 9 (Sept.),1026–1038.

ARORA, A., GOUDA, M., AND VARGHESE,G. 1996. Constraint satisfaction as a basisfor designing nonmasking fault-tolerance. J.High Speed Netw. 5, 3, 293–306.

ARORA, A. AND KULKARNI, S. 1998. Detectorsand correctors: A theory of fault-tolerancecomponents. In Proceedings of the 18thIEEE International Conference on DistributedComputing Systems (ICDCS98, May).

ARORA, A. AND KULKARNI, S. S. 1998. Componentbased design of multitolerant systems. IEEETrans. Softw. Eng. 24, 1 (Jan.), 63–78.

ARORA, A. AND KULKARNI, S. S. 1998. Designingmasking fault tolerance via nonmasking faulttolerance. IEEE Trans. Softw. Eng. 24, 6(June).

ARORA, A. K. 1992. A foundation of fault-toler-ant computing. Ph.D. Dissertation. Univer-sity of Texas at Austin, Austin, TX.

ATTIYA, H., BAR-NOY, A., DOLEV, D., KOLLER, D.,PELEG, D., AND REISCHUK, R. 1987. Achiev-able cases in an asynchronous environmen-t. In Proceedings of the 28th Annual Sympo-

sium on Foundations of Computer Science(Oct. 1987). IEEE Computer Society Press,Los Alamitos, CA, 337–346.

AVIZIENIS, A. 1976. Fault-tolerant systems.IEEE Trans. Comput. 25, 12 (Dec.), 1304–1312.

BABAOGLU, Ö., BARTOLI, A., AND DINI, G. 1994.Replicated file management in large-scale dis-tributed systems. In Proceedings of the 8thInternational Workshop on Distributed Algo-rithms (WDAG94). Springer-Verlag, Berlin,Germany, 1–16.

BABAOGLU, Ö., DAVOLI, R., AND MONTRESOR,A. 1997. Partitionable group membership:specification and algorithms. Technical Re-port UBLCS-97-1. Department of ComputerScience, University of Bologna, Bologna, Italy.

BABAOGLU, Ö. AND MARZULLO, K. 1993. Consis-tent global states of distributed systems: fun-damental concepts and mechanisms. In Dis-tributed Systems (2nd ed.), S. Mullender,Ed. Addison-Wesley Longman Publ. Co.,Inc., Reading, MA, 55–96.

BARBORAK, M., DAHBURA, A., AND MALEK,M. 1993. The consensus problem in fault-tolerant computing. ACM Comput. Surv. 25,2 (June 1993), 171–220.

BARTLETT, J. F. 1978. A “NonStop” operatingsystem. In Proceedings of the 11th HawaiiInternational Conference on System Scienc-es. IEEE Computer Society Press, LosAlamitos, CA.

BASU, A., CHARRON-BOST, B., AND TOUEG,S. 1996. Simulating reliable links with un-reliable links in the presence of process crash-es. In Proceedings of the 10th InternationalWorkshop on Distributed Algorithms(WDAG96). 105–122.

BASU, A., CHARRON-BOST, B., AND TOUEG,S. 1996. Solving problems in the presenceof process crashes and lossy links. TechnicalReport TR96-1609. Department of ComputerScience, Cornell University, Ithaca, NY.

BERNSTEIN, P. A., HADZILACOS, V., AND GOODMAN,N. 1987. Concurrency Control and Recov-ery in Database Systems. Addison-WesleyLongman Publ. Co., Inc., Reading, MA.

BIRMAN, K. P. 1993. The process group ap-proach to reliable distributed computing.Commun. ACM 36, 12 (Dec. 1993), 37–53.

BRACHA, G. AND TOUEG, S. 1985. Asynchronousconsensus and broadcast protocols. J. ACM32, 4 (Oct. 1985), 824–840.

CHANDRA, S. AND CHEN, P. 1998. How fail-stopare faulty programs?. In Proceedings of the28th IEEE Symposium on Fault TolerantComputing Systems (FTCS-28, June). IEEEComputer Society Press, Los Alamitos, CA,240–249.

CHANDRA, T. D., HADZILACOS, V., AND TOUEG,S. 1996. The weakest failure detector forsolving consensus. J. ACM 43, 4, 685–722.

CHANDRA, T. D., HADZILACOS, V., TOUEG, S., ANDCHARRON-BOST, B. 1996. On the impossibil-

Fundamentals of Fault-Tolerant Distributed Computing • 23

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 24: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

ity of group membership. In Proceedings ofthe fifteenth annual ACM symposium on Prin-ciples of distributed computing (PODC ’96,Philadelphia, PA, May 23–26, 1996), J. E.Burns and Y. Moses, Eds. ACM Press, NewYork, NY, 322–330.

CHANDRA, J. AND TOUEG, S. 1996. Unreliablefailure detectors for reliable distributed sys-tems. J. ACM 43, 2 (Mar.), 225–267.

CHANDY, K. M. AND MISRA, J. 1986. How pro-cesses learn. Distrib. Comput. 1, 1 (Jan.1986), 40–52.

CHARRON-BOST, B., DELPORTE-GALLET, C., AND FAU-CONNIER, H. 1995. Local and temporalpredicates in distributed systems. ACMTrans. Program. Lang. Syst. 17, 1 (Jan. 1995),157–179.

CHARRON-BOST, B., MATTERN, F., AND TEL,G. 1996. Synchronous, asynchronous, andcausally ordered communication. Distrib.Comput. 9, 173–191.

CHASE, C. M. AND GARG, V. K. 1998. Detectionof global predicates: Techniques and theirlimitations. Distrib. Comput. 11, 4, 191–201.

COOPER, R. AND MARZULLO, K. 1991. Consistentdetection of global predicates. SIGPLANNot. 26, 12 (Dec. 1991), 167–174.

CRISTIAN, F. 1985. A rigorous approach to fault-tolerant programming. IEEE Trans. Softw.Eng. 11, 1 (Jan.), 23–31.

CRISTIAN, F. 1991. Understanding fault-toler-ant distributed systems. Commun. ACM 34,2 (Feb. 1991), 56–78.

CRISTIAN, F. AND FETZER, C. 1998. The timedasynchronous distributed system model. InProceedings of the 28th IEEE Symposium onFault Tolerant Computing Systems (FTCS-28,June). IEEE Computer Society Press, LosAlamitos, CA, 140–149.

DEGA, J.-L. 1996. The redundancy mechanismsof the Ariane 5 operational control center. InProceedings of the 26th IEEE Symposium onFault Tolerant Computing Systems (FTCS-26). IEEE Computer Society, New York, NY,382–386.

DIJKSTRA, E. W. 1974. Self-stabilizing systemsin spite of distributed control. Commun.ACM 17, 11, 643–644.

DIJKSTRA, E. W. 1975. Guarded commands,nondeterminacy, and formal derivation of pro-grams. Commun. ACM 18, 8 (Aug.), 453–457.

DOLEV, D., DWORK, C., AND STOCKMEYER,L. 1987. On the minimal synchronismneeded for distributed consensus. J. ACM34, 1 (Jan. 1987), 77–97.

DOLEV, D., FRIEDMAN, R., KEIDAR, I., AND MALKHI,D. 1996. Failure detectors in omission fail-ure environments. Technical Report TR96-1608. Department of Computer Science, Cor-nell University, Ithaca, NY.

DOLEV, D., LYNCH, N. A., PINTER, S. S., STARK, E.W., AND WEIHL, W. E. 1986. Reaching ap-

proximate agreement in the presence offaults. J. ACM 33, 3 (July 1986), 499–516.

DOUDOU, A. AND SCHIPER, A. 1997. Mutenessdetectors for consensus with Byzantine pro-cesses. Technical report, EPFL. Ecole Poly-technique Federale de Lausanne, Lausanne,Switzerland.

DWORK, C., LYNCH, N., AND STOCKMEYER,L. 1988. Consensus in the presence of par-tial synchrony. J. ACM 35, 2 (Apr. 1988),288–323.

FISCHER, M., LYNCH, N., AND PATERSON,M. 1985. Impossibility of distributed con-sensus with one faulty process. J. ACM 32, 2(Apr. 1985), 374–382.

FISCHER, M. J., LYNCH, N. A., AND MERRITT,M. 1986. Easy impossibility proofs for dis-tributed consensus problems. Distrib. Com-put. 1, 1 (Jan. 1986), 26–39.

GARG, V. K. 1997. Observation and control fordebugging distributed computations. In 3rdInt. Workshop on Automated Debugging(AADEBUG 97, Linköping, Sweden,May). 1–12.

GARG, V. K. AND MITCHELL, J. R. 1998. Distribut-ed predicate detection in a faulty environmen-t. In Proceedings of the 18th IEEE Interna-tional Conference on Distributed ComputingSystems (ICDCS98, May).

GARG, V. K. AND WALDECKER, B. 1994. Detectionof weak unstable predicates in distributedprograms. IEEE Trans. Parallel Distrib.Syst. 5, 3, 299–307.

GARG, V. K. AND WALDECKER, B. 1996. Detectionof strong unstable predicates in distributedprograms. IEEE Trans. Parallel Distrib.Syst. 7, 12, 1323–1333.

GÄRTNER, F. C. 1998. Specifications for faulttolerance: A comedy of failures. TechnicalReport TUD-BS-1998-03. Darmstadt Uni-versity of Technology, Darmstadt, Germany.

GÄRTNER, F. C. AND PAGNIA, H. 1998. Enhancingthe fault tolerance of replication: another ex-cercise in constrained convergence. In Digestof Fast Abstracts of the 28th IEEE Symposiumon Fault Tolerant Computing Systems (FTCS-28, June). IEEE Computer Society Press,Los Alamitos, CA.

GOLDING, R. A. 1991. Accessing replicated datain a large-scale distributed system. Techni-cal Report UCSC-CRL-91-18. Computer Re-search Laboratory, University of California,Santa Cruz, CA.

GUERRAOUI, R., LARREA, M., AND SCHIPER,A. 1995. Non blocking atomic commitmentwith an unreliable failure detector. In Pro-ceedings of the 14th IEEE Symposium on Re-liable Distributed Systems (SRDS95, BadNeuenahr, Germany, Sept. 1995). IEEEPress, Piscataway, NJ.

GUERRAOUI, R. AND SCHIPER,A. 1997. Consensus: the big misunder-standing. In Proceedings of the 6th Work-

24 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 25: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

shop on Future Trends of Distributed Comput-ing Systems (FTDCS-6, Oct. 1997).

HADZILACOS, V. AND TOUEG, S. 1994. A modularapproach to fault-tolerant broadcasts and re-lated problems. Technical Report TR94-1425. Department of Computer Science, Cor-nell University, Ithaca, NY.

HALPERN, J. Y. AND MOSES, Y. 1990. Knowledgeand common knowledge in a distributed envi-ronment. J. ACM 37, 3 (July 1990), 549–587.

HERLIHY, M. P. AND WING, J. M. 1991. Specifyinggraceful degradation. IEEE Trans. ParallelDistrib. Syst. 2, 1, 93–104.

HSUEH, M.-C., TSAI, T. K., AND IYER, R.K. 1997. Fault injection techniques andtools. IEEE Computer 30, 4, 75–82.

HURFIN, M., MOSTEFAOUI, A., AND RAYNAL,M. 1997. Consensus in asynchronous sys-tems where processes can crash and reco-ver. Technical Report 1144. IRISA, RennesCedex, France.

JALOTE, P. 1994. Fault tolerance in distributedsystems. Prentice-Hall, Inc., Upper SaddleRiver, NJ.

KUHN, D. R. 1997. Sources of failure in thepublic switched telephone network. IEEEComputer 30, 4, 31–36.

KULKARNI, S. S. AND ARORA, A. 1997. Composi-tional design of multitolerant repetitive Byz-antine agreement. In Proceedings of the 18thInternational Conference on the Foundationsof Software Technology and Theoretical Com-puter Science (Kharagpur, India).

LAMPORT, L. 1977. Proving the correctness ofmultiprocess programs. IEEE Trans. Softw.Eng. 3, 2 (Mar.), 125–143.

LAMPORT, L. 1978. Time, clocks, and the order-ing of events in a distributed system. Com-mun. ACM 21, 7, 558–565.

LAMPORT, L. AND LYNCH, N. 1990. Distributedcomputing: models and methods. In Hand-book of Theoretical Computer Science (vol. B):Formal Models and Semantics, J. van Leeu-wen, Ed. MIT Press, Cambridge, MA, 1157–1199.

LAMPORT, L., SHOSTAK, R., AND PEASE,M. 1982. The Byzantine generals prob-lem. ACM Trans. Program. Lang. Syst. 4, 3(July), 382–401.

LAPRIE, J. C. 1985. Dependable computing andfault tolerance: concepts and terminology. InProceedings of the 15th IEEE Symposium onFault Tolerant Computing Systems (FTCS-15,June 1985). IEEE Computer Society Press,Los Alamitos, CA, 2–11.

LE LANN, G. 1995. On real-time and non real-time distributed computing. In Proceedingsof the 9th International Workshop on Distrib-uted Algorithms (WDAG95, Sept.). Springer-Verlag, Berlin, Germany, 51–70.

LONG, D. D. E., CARROLL, J. L., AND PARK, C.J. 1991. A study of the reliability of Inter-net sites. In Proceedings of the 10th IEEE

Symposium on Reliable Distributed Systems(Pisa, Italy, Sept.). IEEE Press, Piscataway,NJ, 177–186.

LYNCH, N. 1996. Distributed Algorithms. Mor-gan Kaufmann, San Mateo, California.

MARZULLO, K. AND NEIGER, G. 1991. Detectionof global state predicates. In Proceedings ofthe 5th International Workshop on DistributedAlgorithms (WDAG91). 254–272.

MULLENDER, S., Ed. 1993. Distributed Systems(2nd ed.). Addison-Wesley Longman Publ.Co., Inc., Reading, MA.

NELSON, V. P. 1990. Fault-tolerant computing:fundamental concepts. IEEE Computer 23, 7(July), 19–25.

OLIVEIRA, R., GUERRAOUI, R., AND SCHIPER,A. 1997. Consensus in the crash-recovermodel. Ecole Polytechnique Federale deLausanne, Lausanne, Switzerland. Techni-cal Report TR-97/239

PATTERSON, D. A., GIBSON, G., AND KATZ, R.H. 1988. A case for redundant arrays ofinexpensive disks (RAID). In Proceedings ofthe Conference on Management of Data (SIG-MOD ’88, Chicago, IL, June 1-3, 1988), H.Boral and P.-A. Larson, Eds. ACM Press,New York, NY, 109–116.

RICCIARDI, A., SCHIPER, A., AND BIRMAN,K. 1993. Understanding partitions and the“no partition” assumption. In Proceedings ofthe 4th Workshop on Future Trends of Distrib-uted Computing Systems (FTDCS-4).

SABEL, L. S. AND MARZULLO, K. 1995. Electionvs. consensus in asynchronous systems. Tech-nical Report TR95-1488. Department ofComputer Science, Cornell University, Ithaca,NY.

SCHIPER, A. 1997. Early consensus in an asyn-chronous system with a weak failure detec-tor. Distrib. Comput. 10, 3, 149–157.

SCHIPER, A. AND RICCARDI, A. 1993. Virtually-synchronous communication based on a weakfailure suspector. In Proceedings of the 23rdIEEE Symposium on Fault Tolerant Comput-ing Systems (FTCS-23). IEEE Computer So-ciety Press, Los Alamitos, CA, 534–543.

SCHIPER, A. AND SANDOZ, A. 1994. Primary par-tition “virtually-synchronous communication”harder than consensus. In Proceedings of the8th International Workshop on Distributed Al-gorithms (WDAG94). Springer-Verlag, NewYork, 39–52.

SCHLICHTING, R. D. AND SCHNEIDER, F.B. 1983. Fail stop processors: An approachto designing fault-tolerant computing sys-tems. ACM Trans. Comput. Syst. 1, 3 (Aug.),222–238.

SCHNEIDER, F. B. 1990. Implementing fault-tol-erant services using the state machine ap-proach: A tutorial. ACM Comput. Surv. 22, 4(Dec. 1990), 299–319.

SCHNEIDER, F. B. 1993. What good are modelsand what models are good? In DistributedSystems (2nd ed.), S. Mullender, Ed. Addi-

Fundamentals of Fault-Tolerant Distributed Computing • 25

ACM Computing Surveys, Vol. 31, No. 1, March 1999

Page 26: Fundamentals of Fault-Tolerant Distributed Computing in ...csis.pace.edu/~marchese/CS865/Papers/gartner99fundamentals.pdf · Fundamentals of Fault-Tolerant Distributed Computing in

son-Wesley Longman Publ. Co., Inc., Reading,MA, 17–26.

SCHNEIDER, M. 1993. Self-stabilization. ACMComput. Surv. 25, 1 (Mar. 1993), 45–67.

SCHWARZ, R. AND MATTERN, F. 1994. Detectingcausal relationships in distributed computa-tions: in search of the holy grail. Distrib.Comput. 7, 149–174.

SIEWIOREK, D. P. AND SWARZ, R.S. 1992. Reliable computer systems (2nded.): design and evaluation. Digital Press,Newton, MA.

SINGHAI, A., LIM, S.-B., AND RADIA, S.R. 1998. The SunSCALR framework for in-ternet servers. In Proceedings of the 28thIEEE Symposium on Fault Tolerant Comput-ing Systems (FTCS-28, June). IEEE Com-puter Society Press, Los Alamitos, CA, 108–117.

SPECTOR, A. AND GIFFORD, D. 1984. The spaceshuttle primary computer system. Commun.ACM 27, 9, 874–900.

STOLLER, S. D. 1997. Detecting global predi-cates in distributed systems with clocks. InProceedings of the 11th International Work-shop on Distributed Algorithms (WDAG ’97,Sept. 1997), M. Mavronicolas and P. Tsigas,Eds. Lecture Notes in Computer Science,

vol. 1320. Springer-Verlag, Berlin, Ger-many, 185–199.

STOLLER, S. D. AND SCHNEIDER, F.B. 1995. Faster possibility detection bycombining two approaches. In Proceedings ofthe 9th International Workshop on DistributedAlgorithms (WDAG95, Sept.). Springer-Ver-lag, Berlin, Germany, 318–332.

TEL, G. 1994. Introduction to distributed algo-rithms. Cambridge University Press, NewYork, NY.

THEEL, O. AND GÄRTNER, F. C. 1998. On provingthe stability of distributed algorithms: self-stabilization vs. control theory. In Proceed-ings of the International Computers Confer-ence on Systems, Signals, Control (SSCC’98,Durban, South Africa, Sept.), V. B. Bajic,Ed. 58–66.

TUREK, J. AND SHASHA, D. 1992. The many facesof consensus in distributed systems. IEEEComputer 25, 6 (June), 8–17.

VÖLZER, H. 1998. Verifying fault tolerance ofdistributed algorithms formally: An exam-ple. In Proceedings of the International Con-ference on Application of Concurrency to Sys-tem Design (CSD98, Fukushima, Japan, Mar.1998). IEEE Computer Society Press, LosAlamitos, CA, 187–197.

Received: June 1998; revised: January 1999; accepted: January 1999

26 • F. C. Gärtner

ACM Computing Surveys, Vol. 31, No. 1, March 1999