ieee transactions on parallel and …motta/dim070/apresentacoes/faultinjection.pdfa...

13
A Global-State-Triggered Fault Injector for Distributed System Evaluation Ramesh Chandra, Ryan M. Lefever, Kaustubh R. Joshi, Michel Cukier, Member, IEEE, and William H. Sanders, Fellow, IEEE Abstract—Validation of the dependability of distributed systems via fault injection is gaining importance because distributed systems are being increasingly used in environments with high dependability requirements. The fact that distributed systems can fail in subtle ways that depend on the state of multiple parts of the system suggests that a global-state-based fault injection mechanism should be used to validate them. However, global-state-based fault injection is challenging since it is very difficult in practice to maintain the global state of a distributed system at runtime with minimal intrusion into the system execution. This paper presents Loki, a global- state-based fault injector, which has been designed with the goals of low intrusion, high precision, and high flexibility. Loki achieves these goals by utilizing the ideas of partial view of global state, optimistic synchronization, and offline analysis. In Loki, faults are injected based on a partial view of the global state of the system, and a postruntime analysis is performed to place events and injections into a single global timeline and to discard experiments with incorrect fault injections. Finally, the experiments with correct fault injections are used to estimate user-specified performance and dependability measures. A flexible measure language has been designed that facilitates the specification of a wide range of measures. Index Terms—Distributed systems, reliable systems, system evaluation, fault injection, partial view of global state, offline clock synchronization, measure estimation. æ 1 INTRODUCTION T HE increasing use of distributed systems in applications with high availability and reliability requirements, ranging from commercial Web servers to air traffic control systems, makes it necessary to develop techniques to validate the dependability of these systems. Fault injection has been among the most practical and effective of these techniques, and can be defined as a way to test a fault- tolerant system with respect to a class of inputs specific to such a system, i.e., the faults [2]. Since failures in a distributed system can depend on its global state, it is important that fault injections in a distributed system be triggered based on its global state. Global-state-based triggers are useful for both fault removal and system evaluation. Global-state-based triggers are useful for fault removal because they make it possible to introduce faults at very precise points in the execution of the system and observe the effects in detail. Since system evaluation requires the fault injection profile to be repre- sentative of reality, it may not be obvious why sophisticated global-state-based triggers are needed for evaluation. However, there are three important reasons why such triggers are useful. First, since global-state-based triggers can be used to drive a distributed system into certain global states that can be hard to reach otherwise, they are useful in exercising and evaluating hard-to-reach parts of the system. Second, distributed systems are evolving to the point that both the fault models and the measures needed to describe their dependability properties are dependent on their state. For example, software faults are part of an increasingly important class of faults that may need certain precondi- tions in order to manifest themselves. Global triggers can be used both to drive the system into the states where these preconditions are true, and to inject faults only when they are. Third, global-state-based fault triggers can be used to obtain conditional system measures, which can then be used with models to get system measures for large classes of workloads. Global-state-based fault injection of a distributed system is inherently difficult, because individual nodes in the system may not be aware of the current global state of the entire system at any point in time. Past research in fault injection has focused on stand-alone systems (e.g., [21], [20], [24]), distributed systems with local-state-based triggers (e.g., [12], [9], [22]), and centralized simulation of faults in distributed systems [1]; very little work has been done in global-state-based fault injection of distributed systems. Existing work on determining the global state of a distributed system includes techniques such as snapshot algorithms [7], distributed debugging [18], and distributed monitoring [16], [4]. Those methods log the local states of nodes in the system for postprocessing, and are useful for performance monitoring or fault monitoring. Because of the offline nature of those techniques, it is not possible to create the required fault injector simply by extending them with triggering mechanisms so that they can inject faults. Similarly, synchronizing the system at every state change IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 7, JULY 2004 593 . R. Chandra is with the Department of Computer Science, Stanford University, 353 Serra Mall, Room 407, Stanford, CA 94305. E-mail: [email protected]. . R.M. Lefever, K.R. Joshi, and W.H. Sanders are with the Coordinated Science Laboratory and Department of Electrical and Computer Engineer- ing, University of Illinois at Urbana-Champaign, 1308 West Main Street, Urbana, IL 61801. E-mail: {lefever, joshi1, whs}@crhc.uiuc.edu. . M. Cukier is with the Department of Mechanical Engineering, University of Maryland, 2100 Marie Mount Hall, College Park, MD 20742-7531. E-mail: [email protected]. Manuscript received 27 Mar. 2003; revised 22 Oct. 2003; accepted 11 Dec. 2003. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPDS-0043-0303. 1045-9219/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society

Upload: vukhuong

Post on 16-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

A Global-State-Triggered Fault Injector forDistributed System Evaluation

Ramesh Chandra, Ryan M. Lefever, Kaustubh R. Joshi,

Michel Cukier, Member, IEEE, and William H. Sanders, Fellow, IEEE

Abstract—Validation of the dependability of distributed systems via fault injection is gaining importance because distributed systems

are being increasingly used in environments with high dependability requirements. The fact that distributed systems can fail in subtle

ways that depend on the state of multiple parts of the system suggests that a global-state-based fault injection mechanism should be

used to validate them. However, global-state-based fault injection is challenging since it is very difficult in practice to maintain the

global state of a distributed system at runtime with minimal intrusion into the system execution. This paper presents Loki, a global-

state-based fault injector, which has been designed with the goals of low intrusion, high precision, and high flexibility. Loki achieves

these goals by utilizing the ideas of partial view of global state, optimistic synchronization, and offline analysis. In Loki, faults are

injected based on a partial view of the global state of the system, and a postruntime analysis is performed to place events and

injections into a single global timeline and to discard experiments with incorrect fault injections. Finally, the experiments with correct

fault injections are used to estimate user-specified performance and dependability measures. A flexible measure language has been

designed that facilitates the specification of a wide range of measures.

Index Terms—Distributed systems, reliable systems, system evaluation, fault injection, partial view of global state, offline clock

synchronization, measure estimation.

1 INTRODUCTION

THE increasing use of distributed systems in applicationswith high availability and reliability requirements,

ranging from commercial Web servers to air traffic controlsystems, makes it necessary to develop techniques tovalidate the dependability of these systems. Fault injectionhas been among the most practical and effective of thesetechniques, and can be defined as a way to test a fault-tolerant system with respect to a class of inputs specific tosuch a system, i.e., the faults [2].

Since failures in a distributed system can depend on itsglobal state, it is important that fault injections in adistributed system be triggered based on its global state.Global-state-based triggers are useful for both fault removaland system evaluation. Global-state-based triggers areuseful for fault removal because they make it possible tointroduce faults at very precise points in the execution ofthe system and observe the effects in detail. Since systemevaluation requires the fault injection profile to be repre-sentative of reality, it may not be obvious why sophisticatedglobal-state-based triggers are needed for evaluation.However, there are three important reasons why suchtriggers are useful. First, since global-state-based triggers

can be used to drive a distributed system into certain globalstates that can be hard to reach otherwise, they are useful inexercising and evaluating hard-to-reach parts of the system.Second, distributed systems are evolving to the point thatboth the fault models and the measures needed to describetheir dependability properties are dependent on their state.For example, software faults are part of an increasinglyimportant class of faults that may need certain precondi-tions in order to manifest themselves. Global triggers can beused both to drive the system into the states where thesepreconditions are true, and to inject faults only when theyare. Third, global-state-based fault triggers can be used toobtain conditional system measures, which can then beused with models to get system measures for large classesof workloads.

Global-state-based fault injection of a distributed systemis inherently difficult, because individual nodes in thesystem may not be aware of the current global state of theentire system at any point in time. Past research in faultinjection has focused on stand-alone systems (e.g., [21], [20],[24]), distributed systems with local-state-based triggers(e.g., [12], [9], [22]), and centralized simulation of faults indistributed systems [1]; very little work has been done inglobal-state-based fault injection of distributed systems.Existing work on determining the global state of adistributed system includes techniques such as snapshotalgorithms [7], distributed debugging [18], and distributedmonitoring [16], [4]. Those methods log the local states ofnodes in the system for postprocessing, and are useful forperformance monitoring or fault monitoring. Because of theoffline nature of those techniques, it is not possible to createthe required fault injector simply by extending them withtriggering mechanisms so that they can inject faults.Similarly, synchronizing the system at every state change

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 7, JULY 2004 593

. R. Chandra is with the Department of Computer Science, StanfordUniversity, 353 Serra Mall, Room 407, Stanford, CA 94305.E-mail: [email protected].

. R.M. Lefever, K.R. Joshi, and W.H. Sanders are with the CoordinatedScience Laboratory and Department of Electrical and Computer Engineer-ing, University of Illinois at Urbana-Champaign, 1308 West Main Street,Urbana, IL 61801. E-mail: {lefever, joshi1, whs}@crhc.uiuc.edu.

. M. Cukier is with the Department of Mechanical Engineering, Universityof Maryland, 2100 Marie Mount Hall, College Park, MD 20742-7531.E-mail: [email protected].

Manuscript received 27 Mar. 2003; revised 22 Oct. 2003; accepted 11 Dec.2003.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-0043-0303.

1045-9219/04/$20.00 � 2004 IEEE Published by the IEEE Computer Society

and performing the required fault injections will not work,since it is very intrusive to system execution and couldchange the behavior of the system in an undesired way.Thus, a new mechanism is needed that will perform global-state-based fault injection in distributed systems.

Loki is a distributed system fault injector that has beendeveloped to fill this need. We believe that the keyrequirements for a fault injector are low intrusion, highprecision, andhigh flexibility; Loki has beendesigned tomeetthese requirements in the context of global-state-based faultinjection. A few key ideas have shaped Loki’s design so as tomake this possible. Those ideas include the concepts of

1. state machine abstraction,2. optimistic synchronization, and3. separation of mechanism from policy in fault

injection, as well as the observations that4. fault triggers depend only on a part of the global

state and that5. offline analysis is sufficient for system evaluation.

Items 2, 4, and 5 help in decreasing the intrusion andincreasing the precision of fault injection using Loki.Observation 4 implies that, in general, the triggers for faultinjections in a particular node depend on the state of only afew other nodes. Loki makes use of this observation tooptimize the amount of state update traffic on the networkby sending state change notifications only between therequired nodes, and by tracking only the required part ofthe global state at each of the nodes. Observation 5 suggeststhat since fault injection is a process of evaluation, it doesnot need online synchronization. Loki utilizes this observa-tion and performs offline analysis, thus decreasing theruntime intrusion. With respect to item 2, while checkingfault triggers and performing fault injections, Loki optimis-tically assumes that the nodes are correctly synchronizeddue to the state change notifications, i.e., the locallyavailable view of the global state is assumed to be right.However, this assumption could result in fault injections inthe wrong global states. We remedy that during the offlineanalysis by verifying each fault injection and discardingexperiments with incorrect fault injections. This loosecoupling among the components of Loki decreases itsruntime intrusion and improves its scalability; thus, Lokiperforms well as long as the relevant state changes are nottoo rapid. In addition, Loki uses hardware clocks wheneverpossible to further decrease runtime intrusion [13] andrecord experiment data with precision.

In the past,many fault injectors have been built for specificsystems, and it was nontrivial to extend them to evaluateother systems. In contrast, Loki has been built to be flexible sothat it can be used for evaluating different kinds of systemswith varying fault types. This has been achieved through useof item 3, i.e., separation of themechanismused to carry out afault injection (namely, application-independent tasks suchas running of experiments, collection ofmeasurements, clocksynchronization, analysis of results, and obtaining of mea-sures) from the policy of fault injection (namely, application-specific tasks such as state machine specification, types offaults, and specification of fault triggers). Additionally, Lokiprovides flexibility in the kinds of measures that the user canobtain fromexperiment data.Most previous fault injectors donot provide such flexiblemeans to process the data collected.

An extensive graphical user interface has been developed forLoki to assist the user in the specification of fault injectioncampaigns, execution of the campaigns, and obtaining ofdesired measures from the results of the campaigns. Inaddition to being the management console for the Loki faultinjector, the graphical interface performs important runtimefunctions.AdescriptionofLoki’s graphical user interface andother operational details of the Loki tool are described inAppendix A (which can be found on the Computer SocietyDigital Library at http://computer.org/tpds/archives.htm).

In this paper, we present proof-of-concept results fromexperiments in which Loki was used to inject faults in asimple leader election protocol. The results verify the Lokifault injection tool’s functionality. However, Loki has alsobeen used in two case studies involving experimentalevaluation of large-scale distributed systems. The first casestudy is an experimental evaluation of unavailability due tothe group membership protocol in the Ensemble groupcommunication system [14]. The second case study is anexperimental evaluation of some of the protocols used toprovide high availability in the Coda distributed file system[17]. Both case studies demonstrate the efficacy of Loki inevaluating complex distributed systems.

2 LOKI CONCEPTS AND TERMINOLOGY

. State and state machine specification: The concept ofstate is fundamental to Loki. The user has to choosethe abstraction of states for each component of thedistributed system and define a state machine totrack the component’s state at that level of abstrac-tion. The execution of the component generates localevents that trigger transitions in the state machine.The current state of any component is called its localstate, and the vector of the local states of all thecomponents in the system is the global state.

. Node: The user should divide his/her distributedapplication into a set of basic components, each ofwhich has a state machine specification and a faulttrigger specification. Each of these basic components,combined with the Loki runtime code, is called anode. All fault injections and recordings of the statechange and fault injection times are performed at thenode level.

. Fault trigger: A fault trigger specifies the conditionsunder which a particular fault is to be injected into anode. It is defined as a Boolean expression over theglobal state of the system under study. Variables ofthe expression are represented by (state machine,state) pairs. A fault is injected into the node if thefault trigger transitions from a false to a true becauseof a global state change. An example trigger is((SM1:ELECT)&(SM2:FOLLOW)), which indicatesthat the corresponding fault is to be injected whenthe system transitions from a global state in whichthe trigger is not satisfied to a global state in whichthe state machine SM1 is in state ELECT and statemachine SM2 is in state FOLLOW. Note that Loki doesnot place any restriction on the type of fault (e.g.,random bit-flip or message corruption). This gives

594 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 7, JULY 2004

the user considerable flexibility in choosing thefaults best suited for the system under study.

. Partial view of global state: To inject faults into a nodebased on the node’s fault triggers, it is sufficient totrack the “interesting” portion of the global state thatis necessary for fault injection at that node, i.e., thevector of the local states of the nodes on which thefault triggers depend. This interesting portion iscalled the partial view of global state at that node.

. Fault injection campaign, study, and experiment: Theprocess of fault injection of a distributed systemusing Loki consists of one or more fault injectioncampaigns. Each campaign consists of one or morestudies. For each study, the user defines the nodesfor his/her system. Although the division of thefault injection process into campaigns and studies isleft to the user’s discretion, it is desirable for a studyto consist of a set of similar fault injections and for acampaign to consist of a set of correlated studies.That is so the user can obtain meaningful resultsfrom his/her experiments using Loki’s measureestimator, which is described in Section 5. To obtainstatistically accurate measures, several runs of eachstudy are performed along with the associated faultinjections. Each of these runs is called an experiment.

. Fault Injection Process: Loki’s fault injection processcomprises a specification process and a campaignevaluation process. During the specification process,the user provides the system’s state abstraction, faultdetails, and measures to be computed, and preparesthe system under study for fault injection. The

campaign evaluation process then makes use of thisspecification. The campaign evaluation process issubdivided into a runtime phase to execute the faultinjection campaign, an analysis phase to check forproper fault injections, and a measure estimationphase to obtain desired measures from global time-lines created in the analysis phase.

3 THE RUNTIME PHASE

During the runtime phase, fault injection campaigns areexecuted. The runtime phase consists of executing thespecified number of experiments in every study in acampaign, performing the required fault injections, andrecording the required experiment data. The Loki runtimeis the module of the Loki fault injection tool that managesthe runtime phase of the fault injection campaigns.

The Loki runtime architecture is illustrated in Fig. 1, withfour nodes on three hosts. In this architecture, one portionof the Loki runtime is attached to each of the nodes in thesystem; the node’s runtime performs fault-injection-relatedoperations. Each node’s runtime consists of several compo-nents, namely, the state machine, state machine transport,parser, probe, recorder, and heartbeat provider. All of thecomponents except the probe are independent of the systemunder study; the user is responsible for the probeimplementation. The runtime also contains one local daemonper host and one central daemon for the entire runtimesystem. The daemons keep track of the dynamic informa-tion regarding the current nodes in the system, andfacilitate routing of messages between nodes even whennodes are dynamically exiting and entering the system.

CHANDRA ET AL.: A GLOBAL-STATE-TRIGGERED FAULT INJECTOR FOR DISTRIBUTED SYSTEM EVALUATION 595

Fig. 1. The Loki runtime architecture.

The main function of a state machine1 is to keep track ofthe partial view of global state necessary for fault injectioninto its node. Even if multiple nodes share the sameexecutable code, each node has a separate state machinewith a unique name. To keep track of the partial view ofglobal state of a node, the node’s state machine has to trackboth its local state and the state of the required remotenodes. To keep track of the local state, the state machineuses the state machine specification provided by the useralong with the node’s local event notifications. When thestate machine receives a local event notification, it computesits new state based on both its old state and the local event.To keep track of the state of remote nodes, the statemachines send only the required state change notificationsto each other, in accordance with the specification. Theideas of maintaining only the required partial view of globalstate and transmitting only the required state changenotifications both help to decrease the intrusion to theapplication.

The state machines use their respective state machinetransports to send notification messages to each other. Thestate machine transports abstract the underlying commu-nication system, and provide a simple interface for sendingand receiving notification messages. To do so, the statemachine transports maintain communication routes to theirrespective local daemons.

On every change in the partial view of global state of anode, its state machine notifies its fault parser. The faultparser uses the partial view of global state to check whetherany local fault triggers have transitioned from false to true.If they have, the parser instructs its probe to inject thecorresponding faults.

The probe is the only system-dependent component ofthe Loki runtime, and performs two main functions:notifying its state machine of any local events andperforming the actual fault injection into the node wheninstructed to do so by the fault parser. Since the probe issystem-dependent, the user has to implement it as part ofthe instrumentation of the distributed application, asdiscussed in Section A.2.2 of the Appendix (http://computer.org/tpds/archives.htm).

The function of a node’s recorder is to record informationregarding local state changes and fault injections, alongwith their occurrence times, to a local timeline. To improveprecision and decrease intrusion, the recorder uses hard-ware clocks [13]. Finally, the heartbeat provider sends “I amalive” messages to its node’s local daemon. This aids thelocal daemon in detecting crashes.

The main functions of the local daemon are to coordinatethe experiment execution on its host, start new nodes at thebeginning of an experiment, route the notification messagesbetween hosts, monitor the nodes on its host for crashes,and manage node crashes and restarts. To perform theseoperations, the local daemons coordinate with each otherand with the central daemon using reliable, orderedcommunication links (currently TCP). Each local daemon

communicates with the nodes on its host using IPC(currently shared memory with semaphores).

The central daemon manages the runtime phase of thecampaigns. That phase includes starting up all thecomponents required for each experiment, aborting experi-ments if there are abnormal situations (such as localdaemon crashes or experiment hangs), and determiningwhen experiments have completed. The central daemoncoordinates with the local daemons to perform thesefunctions.

Nodes, local daemons, and the central daemon are allsingle processes. Though a node consists of many compo-nents, it has only three Loki runtime threads in addition tothe application threads. The three runtime threads are thestate machine transport thread, parser thread, and heartbeatthread. The state machine component of the runtimeexecutes in the application threads. The Loki front endfunctions as the central daemon; for more details regardingits implementation, refer to Section A.2 of the Appendix. Adetailed description of the operation of the runtime can befound in Section A.1 of the Appendix.

4 THE ANALYSIS PHASE

By the time the runtime phase of the fault injection campaignhas completed, a large amount of timeline data has beenrecorded during the running of experiments. This data has tobe processed before meaningful conclusions can be drawnabout the systemunder study. The processing consists of twophases: an analysis phase and a measure estimation phase.The analysis phase, described in this section, consists of twosteps, namely, the conversion of local timelines in eachexperiment into a single global timeline for that experiment,and the verification of the correctness of fault injections ineach experiment. Experiments in which fault injectionsoccurred in incorrect global states are discarded. Note that“analysis” in this context refers only to the process ofcomputing a correct global timeline from raw measurementdata. It does not include the analysis of the applicationbehavior or detected faults; that analysis can be done by theuser based on the estimated measures and global timelineafter the measure estimation phase.

4.1 Conversion to Global Timeline

Loki uses an offline clock synchronization algorithm tocalibrate the clocks on the multiple machines on which thefault injector operates, so that all the local timelines of anexperiment can be combined into a single global timeline[13]. One machine’s clock is taken as the reference, and theoffset and drift rate of every other machine’s clock areestimated relative to the reference clock. These offsets anddrift rates are then used to place all the local times onto asingle global timeline.

Loki assumes that the drifts of the processor clocks of thedifferent machines in the distributed system are linear [11].Therefore, if there are m machines in the system, numbered1 to m, we have the following relation between theprocessor clock time CiðtÞ on machine i and the processorclock time CjðtÞ on machine j:

CjðtÞ � �ij þ �ijCiðtÞ; i; j ¼ 1; . . . ;m; ð1Þ

596 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 7, JULY 2004

1. The implementation of the system-independent state machinecomponent of the runtime is referred to as the state machine, while thesystem-dependent description of the state machine is referred to as the statemachine specification.

where �ij is the offset between the clocks of machine i andmachine j at t ¼ 0, and �ij is the drift (or skew) of the clockof machine j with respect to the clock of machine i.

If a machine r is chosen as the reference machine, thecalibration of the clocks of the machines in the system isreduced to a computation of �ri and �ri, for i ¼ 1; . . . ;m. (Itcan be easily seen that �rr ¼ 0 and �rr ¼ 1.) To computethese values, synchronization messages are passed betweenthe reference machine and all the other machines, beforeand after each experiment or study, during the runtimephase. During this process, local timestamps are recordedfor each message at the sender and receiver. The synchro-nization messages are passed between, not during, experi-ments, so that the intrusion into application execution isreduced.

Fig. 2a gives a graphical illustration of these timestampsfor hypothetical messages between machine i and machiner. The x-axis is the local time, CrðtÞ, on machine r, and the y-axis, CiðtÞ, is the local time on machine i. Each pointrepresents a message, whose coordinates are the local sendor receive timestamps at machines r and i. Note that, sincewe are interested only in the times of occurrences of eventsrelative to each other, the timeline axes can be translated asdesired. In particular, the axes can be translated so that thepoint corresponding to the first synchronization messagefalls on either of the axes or on both. The line L � CiðtÞ ¼�ri þ �riCrðtÞ is the exact relation between local times onmachine r and machine i. If the message delays in thenetwork are zero, then all the points would lie on L.However, since, in practice, message delays are not zero,the points are separated from L by a distance equal to theirmessage delay. All the points above L (represented byunfilled circles) represent messages from machine r tomachine i, and the points below L (represented by filledcircles) represent messages from machine i to machine r. So,for example, in the absence of delays, the messagerepresented by point x would actually have been at x0,and the transmission delay from machine r to i for thismessage is the vertical distance between x and x0.

The question is how to determine L given only the points(representing messages) in the figure. If the message delaysare fixed (or known in some other way), then it is a trivialmatter to compute L from the points. However, in a generaldistributed system, message delays are variable, so theexact determination of L becomes impossible. It is, how-ever, possible to estimate bounds on L, i.e., to estimate thebounds on �ri (½��

ri; �þri�) and �ri (½��

ri; �þri�). To do so, observe

tha t L1 � CiðtÞ ¼ ��ri þ �þ

riCrðtÞ and L2 � CiðtÞ ¼ �þri þ

��riCrðtÞ are the two extreme estimates of L because they

satisfy L’s property of dividing the points represented byfilled and unfilled circles. Therefore, Loki uses L1 and L2 tocompute the lower and upper bounds of �ri and �ri,namely, ��

ri; �þri; �

�ri, and �þ

ri. L1 and L2 themselves arecomputed by first taking the convex hulls for the upper andlower sets of points and then joining their “edge” points, asillustrated in Fig. 2a.

Further details regarding Loki’s analysis algorithm canbe found in [13]. Note that ½��

ri; �þri� and ½��

ri; �þri� are not

confidence intervals for �ri and �ri. With confidenceintervals, a value has only a high probability of being in acertain interval, but in this situation, the correct values of�ri and �ri are always in the intervals ½��

ri; �þri� and ½��

ri; �þri�,

respectively (even though their exact values are unknown).Methods for increasing the accuracy of the estimates of �ri

and �ri include increasing the number of synchronizationmessages, increasing the duration of synchronizationmessages, and increasing the duration of each experimentby adding delays before and after the experiment.

As mentioned in Section 3, the occurrence times of every

local state change and fault injection in a node are recorded

in its local timeline. The times used in the local timeline are

the local times of the node’s machine. During the conver-

sion of the local timelines of all the nodes into a single

global timeline, the occurrence times of all the events and

fault injections have to be projected onto a single (reference)

timeline. This is done as follows: Suppose an event occurred

CHANDRA ET AL.: A GLOBAL-STATE-TRIGGERED FAULT INJECTOR FOR DISTRIBUTED SYSTEM EVALUATION 597

Fig. 2. Estimation of parameters in the analysis phase. (a) Estimation of �ri, �ri, and inaccuracy� and (b) estimation of the upper bound on �þri � ��

ri.

on machine i at local time CiðT Þ. Then, from (1), we have

the reference clock time as CrðT Þ ¼ CiðT Þ��ri

�ri. However, since

only the bounds (and not the exact values) for �ri and �ri

are known, only the upper and lower bounds of CrðT Þ canbe found using CrðT Þ� ¼ CiðT Þ��þ

ri

�þri

and CrðT Þþ ¼ CiðT Þ���ri

��ri

.

Therefore, an event occurring at local time CiðT Þ on

machine i corresponds to an event occurring between time

bounds CrðT Þ� and CrðT Þþ on the reference machine r.

Using this method, Loki projects all the events in the local

timeline of all the nodes in the system onto a (reference)

global timeline. Graphically, all the CrðT Þ� fall on the line

L0, and all the CrðT Þþ fall on the line L00, where L0 and L00

are as shown in Fig. 2a. L0 is parallel to L1 and L00 is parallel

to L2, with L0 � CiðtÞ ¼ �þri þ �þ

riCrðtÞ�, and L00 � CiðtÞ =

��ri þ ��

riCrðtÞþ.A valid point of concern here is �, the size of the interval

½CrðT Þ�; CrðT Þþ�; � is the inaccuracy in the computedglobal time, and determines the precision of all events in thesystem, including fault injections. In Fig. 2a, for any localtime CiðtÞ, � is the horizontal distance between lines L0 andL00 at CiðtÞ. Obviously, it is desirable that � be as small aspossible. It can be seen from the figure that � increaseslinearly with increasing CrðtÞ. The maximum inaccuracy,�max, occurs at the end of a study. As illustrated in thefigure, �max can be decomposed into �1;�2, and �3, where�1 ¼ �þ

ri���ri

�þri

;�3 ¼ �þri���

ri

��ri

, and �2 is less than twice themaximum message delay (as measured in global time).Without loss of generality, it can be assumed that the lastfilled circle has a larger CiðtÞ than the last unfilled circle,and that �2 is measured along the horizontal line passingthrough the last unfilled circle (as shown in Fig. 2a). Then, itis easy to see that the delays of messages corresponding tothe last unfilled and filled circles (d1 and d2, respectively)are less than the maximum message delays and, hence, �2

is less than twice the maximum message delay. Fig. 2bshows the first messages during a synchronization phase.As shown in the figure, we ensure in our implementationthat machine i sends the first synchronization message a tothe reference machine, and the reference machine sends itsfirst message b immediately after it receives a. Hence, it canbe observed that ð�þ

ri � ��riÞ is bounded by the maximum

message delay. Since both ��ri and �þ

ri are constants (becauseof the linear drift assumption), this means that both �1 and�3 are bounded by a constant factor of maximum messagedelay. Hence, �max is bounded by a constant factor of themaximum message delay. Also, it is possible to decrease �by increasing the accuracy of estimates on �ri and �ri asdescribed above, and by decreasing message delay (whichincreases accuracy of �ri).

4.2 Verification of Fault Injections

After the conversion to the global timeline, the fault injectionsare checked to determinewhether theywere proper, i.e., theyoccurred in the correct global state as specified in the faultspecification. For the purposes of this discussion we refer totheevent that causeda fault injectionpredicate tobecome trueas the entry event for the corresponding fault, and the eventthat caused the fault injection predicate to become false as theexit event for the fault.

Improper fault injections are removed via a check thatdetermines whether the time interval between the upperand lower global-time bounds of a fault injection comple-tely lies within the time interval between the upper andlower global-time bounds of the fault’s entry event and exitevent. More specifically, the upper bound of the entry eventand lower bound of the fault injection time are used todetermine whether the fault was injected after the state wasentered. Likewise, the lower bound of the exit event andupper bound of the fault injection time are used todetermine whether the fault was injected before the statewas exited. If both the criteria are met, the fault was injectedas intended. This procedure is repeated for each injectionthat should have been made in the experiment; theexperiment is marked as successful only if all the injectionsin the experiment were done correctly. The unsuccessfulexperiments are discarded and not used for measurecomputation.

5 THE MEASURE ESTIMATION PHASE

Measure estimation is a key component of performance anddependability assessment using fault injection. The mea-sures to be obtained from a fault injection study are highlysystem- and user-dependent. For example, computing thesystem’s coverage of a fault depends on the user’sdefinition of a failure and recovery. Hence, any mechanismfor measure estimation in fault injection should be flexible.To this end, a flexible mechanism for measure estimationhas been developed in Loki.

5.1 Overview of Measure Estimation in Loki

Measure estimation consists of two steps: measure specifica-tion and measure computation. During measure specification,the user specifies the measures he/she wants to compute,and Loki uses these specifications to perform the measurecomputation. Loki provides the user with a flexible measurespecification language, and uses statistical techniques tocompute estimators from the data collected.

Measures in Loki are at two levels: the study level and thecampaign level. As mentioned in Section 2, a study consists ofsimilar fault injections, and a campaign generally consists ofcorrelated studies. Measure estimation at both levelsconsists of both measure specification and measure com-putation. A study-level measure is specified for a particularstudy and is computed from the data collected duringexecution of the experiments in that study. A campaign-level measure is specified for a particular campaign and iscomputed by combining study-level measures of a subset ofits studies. The measures language is split into two levels toallow computation of measures, such as fault coverage, thatmay involve several types of faults. For example, a user canconduct multiple studies (one for each fault type), definestudy-level measures on each study to compute coveragefor its fault type, and then combine the results via acampaign-level measure to compute overall coverage forseveral probable fault type distributions.

5.2 Study-Level Measures

A study-level measure is computed using the globaltimelines of the experiments in the corresponding study.Informally, a study-level measure consists of 1) a sequenceof filters that eliminate all the experiments of the studywhose global timelines do not satisfy certain conditions,

598 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 7, JULY 2004

and 2) the computation of a function using the globaltimelines of the remaining experiments, whose output is asample for the study-level measure. More formally, a study-level measure is based on the concepts of predicate,observation function, and subset selection. The above threeconcepts may be described as follows:

. Predicate: Predicates in Loki are Boolean expressionsused to test the global timeline to identify whethercertain conditions are satisfied by querying thedifferent attributes of the state machines (i.e., states,events, and times), and are either true or false as afunction of time. Predicates combine state machinequeries with the Boolean operators. Queries can testfor the occurrence of a specific state or a specific eventin a particular machine in specified intervals of time.An example of a predicate is ((StateMachine1,

State1, 10 < t < 20)|(StateMachine2,

Event5, 30 < t < 40)). This predicate is only trueduring any time between 10 and 20 ms whenStateMachine1 is in State1, and during any timebetween 30 and 40 ms when Event5 occurs inStateMachine2. The outcome of a predicate at aparticular time is called a predicate value. The predicateapplied to theglobal timelinegenerated in the analysisphase results in a Boolean-valued function of timecalled the predicate value timeline. The obtainedpredicate value timeline contains a combination ofimpulses and steps. Fig. 3 gives an example of a globaltimeline and shows the predicate value timelineobtained upon application of the above predicate tothe global timeline. For more examples of predicates,refer to Section 6.1.

. Observation function: For each defined predicate, theuser must specify an observation function. Theobservation function is used to extract a single valuefrom the predicate value timeline. The input to anobservation function is a predicate value timeline,and the output is an observation function value. Lokiprovides the functions count, outcome, dura-

tion, instant, and total_duration, whichextract information about the predicate value time-line. The observation function can be specified asany C function that uses the above functions alongwith any mathematical functions and returns anumerical value. An example of an observationfunction is total_duration(True, 10, 35).This returns the total duration for which thepredicate timeline was true between 10 and 35 ms.Its value for the predicate value timeline of Fig. 3 is6.5. For more examples of observation functions,refer to Section 6.1.

. Subset selection: After obtaining the previous pre-dicate value timelines and the associated observationfunction values, a user might be interested inestimating a measure from a subset of experimentsof the study. Loki provides the user with the abilityto select a subset of experiments based on theobservation function values. Using standard math-ematical functions that can be compiled with astandard C compiler and observation functionvalues, the user can define a subset function thatreturns true or false. The first subset selection ispredefined to include all experiments. An exampleof a subset selection is all the experiments withpositive observation function values.

Formally, a study-level measure is specified as anordered sequence of tuples: ((subset_selection1, predicate1,observation_function1), . . . , (subset_selectionn, predicaten,observation_functionn)), where n � 1. The subset_selection1

function is always true. A study-level measure is computedas follows: Since subset_selection1 is always true, predicate1and observation_function1 are computed on the globaltimelines of all the experiments in the study. The outputvalue of observation_function1 for each of these experi-ments is checked to determine whether it satisfies thesubset_selection2. If it does not, then the experiment iseliminated from further consideration. The global timelinesof the experiments that survive this filtering are used tocompute predicate2 and observation_function2. This pro-cess of filtering is continued for all the tuples in thesequence, when the global timelines of the unfilteredexperiments are used to compute predicaten and ob-servation_functionn. The output of observation_functionn

for each remaining experiment is called the final observationfunction value for that experiment, and is a sample for thestudy-level measure.

5.3 Campaign-Level Measures

Loki processes final observation function values to obtaina study-level measure, and combines study-level measuresto obtain the campaign measure estimation. The campaignmeasure could be completely characterized if its prob-ability distribution could be obtained. However, inpractice, the distribution cannot be calculated. Therefore,for all practical purposes, knowledge of the moments isequivalent to knowledge of the distribution function, inthe sense that it is theoretically possible to exhibit all theproperties of the distribution in terms of the moments [23,pp. 108-109]. In practice, the properties obtained whencalculating the first four moments are very close to theproperties of the real distribution. That is the approachthat Loki takes. There are three types of campaignmeasures, which are described below.

CHANDRA ET AL.: A GLOBAL-STATE-TRIGGERED FAULT INJECTOR FOR DISTRIBUTED SYSTEM EVALUATION 599

Fig. 3. Predicate value timeline example.

Simple Sampling Measures are used when the user doesnot want to differentiate among the final observationfunction values of different study-level measures. Simplesampling measures are obtained by considering all theselected study-level measures to be similar such that allof their final observation function values are contained ina single sample, i.e., they are all instances of the samerandom variable.

Stratified Weighted Measures are used when study-levelmeasures need to be combined using a user-specifiedlinear weighted function, e.g., axþ by. The moments forsuch measures are determined by computing themoments for each constituent study-level measure,weighting them using the user-defined weights, andsumming them. This computation assumes that thepowers of random variables representing the finalobservation function values are independent acrossstudies. There are several reasons for focusing onstratified weighted measures when evaluating faulttolerance mechanisms. For example, one importantparameter is the coverage of a fault tolerance mechanism[5]. When several independent studies are being con-sidered, the overall coverage of the fault tolerancemechanism is defined by a linear weighted functionacross studies [19].

Stratified User Measures are used when study-levelmeasures need to be combined using a user-definedfunction other than a linearly weighted function.Unfortunately, statistical features are not computed forthese measures, since the calculation of the first fourmoments is not a trivial task for any arbitrary functionthat combines final observation function values of thevarious studies. Hence, Loki does not combine finalobservation function values individually, but rathercombines the means of the final observation functionvalues for each study-level measure.For the simple sampling and stratified weighted mea-

sures, the important statistical properties, such as the firstfour moments, the skewness coefficient, the kurtosiscoefficient, and the percentiles for various �-levels, arecomputed by Loki as described in [6]. For stratified usermeasures, only the single value returned by the user-defined function is computed.

6 SAMPLE APPLICATION: A SIMPLE ELECTION

PROTOCOL

In this section, we describe the instrumentation of a sampledistributed application, as a proof-of-concept of Loki’scapabilities. Loki has also been used in two case studiesinvolving experimental evaluation of complex distributedsystems [14], [17]. The sample application, described here, is

a simple leader election protocol, in which n processes electa leader from among themselves. To do so, each processchooses a random number and sends it to the remainingn� 1 processes. The process that chose the highest numberis elected as the leader. In case of ties, the arbitration isrepeated until a leader is selected. Process failures aredetected by other processes via socket connection closure. Ifthe leader fails, then the remaining processes reelect a newleader using the same protocol but with one fewer node. Inthis simple protocol, we assume that only one node fails at atime and that a leader reelection is completed between twosuccessive node failures.

For this application, all nodes use the same state machinespecification, which is shown in Fig. 4. The nodes in thestate machine are labeled with state names correspondingto the phases of the application, and the edges are labeledwith transition names corresponding to local event notifica-tions. We use a simple crash failure fault model to illustratethe fault-triggering and measure estimation capabilities ofLoki, and to measure Loki’s runtime efficiency. Weimplemented this fault model simply by writing to a nullpointer dereference, thus causing a crash. However, asmentioned in Section 2, the fault model supported by Lokiis flexible and can be arbitrarily complex depending on theapplication and type of evaluation. The fault triggers for theexample are shown in Table 1.

6.1 Measures

Using Loki’s measure language, we were able to computeseveral interesting measures for the election protocolapplication. These measures are shown below, along withtheir definitions in the Loki measure language. Completedetails of the functions used in the measure language aregiven in [6]. For the purposes of the example, the functionstotal_duration(TRUE, t1, t2) and count(UP,

STEP, t1, t2) return the amount of time the predicatetimeline was TRUE and the number of low-to-high step

600 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 7, JULY 2004

Fig. 4. State machine specification for a simple leader election protocol.

TABLE 1Fault Expressions for the Simple Leader Election Protocol

transitions on the predicate timeline between times t1 and

t2, respectively. START_TIME and END_TIME refer to the

experiment start time and experiment end time, respec-

tively. Tuples for the predicate timelines are shown in the

format (statemachine, state) or (statemachine,

event), as appropriate.

. Availability for this application can be defined as thefraction of the total execution time for which there is afunctional leader (i.e., leader is in the LEAD state). Itcan be defined in Loki via a single tuple with thepredicate function (nsx, LEAD)|(corvette,

LEAD)|(xj220, LEAD) and the observation func-tion total_duration(TRUE, START_TIME,

END_TIME)/(END_TIME - START_TIME).. An important global state in this application is the

one in which any surviving node is in the ELECT

state. Since we assume that crashes do not occurwhen an election is in progress, such a state can beexpressed using the following predicate function:(nsx, ELECT)|(corvette, ELECT)|(xj220,

ELECT).Based on the above predicate function, a few

interesting measures can be computed. In particular,

we can compute the number of elections per run by

counting the number of times this global state was

entered; the observation function is count(UP,

STEP, START_TIME, END_TIME).We can also compute the amount of time spent per

election by dividing the time spent by the system in

thisglobal stateby thenumberof times thisglobal state

was entered. The observation function for this is

total_duration(TRUE, START_TIME, END_

TIME)/count(UP, STEP, START_TIME, END_

TIME).. We can also compute the total number of crashes

that occurred. To do so, we can use the fact that Lokiautomatically detects a node crash and generates aspecial CRASH event that corresponds to the crash inthe timeline. We can simply count the number ofthese crash events to get the required measure. Thecounting is done via a single tuple consisting of thepredicate function (nsx, CRASH)|(corvette,

CRASH)|(xj220, CRASH) and the observationfunction count(UP, IMPULSE, START_TIME,

END_TIME).

6.2 Performance Evaluation

Two important metrics for evaluating the performance of a

tool such as Loki are intrusiveness and injection efficiency.

Intrusiveness is the amount of perturbation in the applica-

tion’s execution caused by the process of fault injection.

Injection efficiency is the probability of correct fault injection

(i.e., in the correct global state) by Loki under various

conditions. It is important because it ultimately determines

the granularity at which faults can be injected. We present

some results regarding these metrics in the following

sections. Note that the results presented below only measure

the overhead of the Loki runtime (i.e., intrusiveness specific

to global-state-based fault triggering). Any probe-specific or

fault-model-specific intrusivenesswouldbe inaddition to theintrusiveness results presented below.

6.2.1 Injection Efficiency

In order to understand the factors on which the injectionefficiency of Loki depends, we make the followingobservation. As described in Section 4.2, fault injectionsmust be discarded by the postruntime analysis only if thebounds for the fault injection event overlap the bounds forthe entry event or the exit event. The probability that suchoverlap will occur depends primarily on two factors.

The first factor is the precision of the fault injectionmechanism, where precision refers to the time between theentry event and the actual fault injection. Although a lowernumerical value for precision indicates a responsiveinjection mechanism, it increases the probability thatexperiments containing correctly injected faults will bethrown away by the postruntime analysis. To understandwhy, we define the interval between the upper and lowerbounds associated with each event on the global timeline asthe uncertainty interval. We cannot resolve the time of eventoccurrence to be any narrower than this interval. The lowerthe numerical value of the precision of the fault injector, thecloser the fault injection event will be to the entry event, andthe higher the probability that the injection will bediscarded because the uncertainty intervals for these twoevents overlap. The probability also increases with thelength of the uncertainty intervals. To improve the injectionefficiency, it is possible to decrease the uncertainty intervalby decreasing the value of � in the analysis phase, asexplained in Section 4.1.

The second factor that influences the injection efficiencyis the overlap of the injection event with the bounds for theexit event. The amount of time spent by the system in theglobal state in which the injection is to occur determineswhether such an overlap occurs. We refer to this time as thestate-holding time. Clearly, the smaller the state-holding time,the higher the probability that an overlap will occur and thefault will be injected incorrectly. It is important to note thatthe holding time required for successful injection dependson both the precision of the fault injection mechanism andthe uncertainty interval (the longer the uncertainty interval,the higher the minimum holding time). However, oneimportant difference between the first factor and the secondis that the minimum holding time requirement affects theinjection efficiency only for short-lived states, while theoverlap between uncertainty intervals for the fault injectionevent and the entry event interval affects the efficiency of allfault injections.

To measure the uncertainty interval, we ran four studiesof the election protocol, which had 25, 50, 100, and200 experiments, respectively. The runs were performedon 1GHz Pentium III machines running Linux connected toa switched 100 Mbps Ethernet network. Fig. 5a shows theuncertainty interval for the experiments in each study. Theduration of each experiment was between 1.5 and 2 seconds,with 3 seconds between experiments. Due to the nature ofthe clock synchronization algorithm, the uncertainty inter-vals rise as the studies progress, but the maximum value ofuncertainty is the same for all studies irrespective of the

CHANDRA ET AL.: A GLOBAL-STATE-TRIGGERED FAULT INJECTOR FOR DISTRIBUTED SYSTEM EVALUATION 601

number of experiments in them. This follows from the proofof bounded uncertainity intervals in Section 4.1, whichshows that the maximum value is dependent on only thenetwork latency.

Figs. 5b and 5c demonstrate the minimum holding timeproperties of Loki. To get those plots, we ran a set of400 runs of the election protocol. In different runs, wevaried the amount of time the three nodes spent in theELECT state by artificially inducing a delay inside that state.The delay induced was between 1 and 20 ms in incrementsof 1 ms (a constant induced delay per study). Then, we triedto inject the fault nelectfault into node nsx. nelect-fault is triggered when all three nodes are in the ELECT

state. Its definition is given in Table 1. Fig. 5b shows thenumber of experiments for each value of actual time spentby all three nodes in the ELECT state. As seen, thisdistribution of the actual time spent is not uniform, eventhough the induced delays were. We believe this is due toOS scheduler artifacts. Fig. 5c shows a plot of the fraction offaults successfully injected (injection efficiency) versus theactual time spent by all nodes in the ELECT state rounded tothe nearest millisecond. It can be seen that the injectionefficiency is nearly 100 percent for all state-holding timesabove 5 ms. For lower state-holding times, the 10 ms OSscheduler timeslice causes the injection efficiency to drop.However, the efficiency is still greater than 60 percent for allholding times above 1 ms.

6.2.2 Intrusiveness

Intrusiveness is another important metric for any instru-

mentation tool. It is important to minimize intrusiveness as

much as possible in order to get the most accuratemeasurements possible. Since the Loki runtime attaches

itself to the application being instrumented, it might be

argued that Loki is more intrusive to the application than a

tool that monitors the application while running as a

separate process. However, we claim that this is not so.

Even if an instrumentation tool runs as a separate process,the OS would have to schedule its threads. Those threads

would compete for the same CPU resources that Loki

threads now compete for; hence, the situation would not be

any different. The only other ways that the application

interacts with Loki are through the notifyEvent and

injectFault methods. The injectFault method isinvoked by the fault parser and runs in a different thread

from the application and, hence, does not contribute to

intrusiveness beyond the competition for CPU resources

mentioned above. However, the notifyEvent call is made

by the application thread and does affect the intrusiveness.

Hence, for purposes of the current Loki implementation, wecharacterize intrusiveness by the amount of time it takes a

notifyEvent call to return. We call this time the

notification penalty.

602 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 7, JULY 2004

Fig. 5. Loki performance results. (a) Uncertainty intervals over the course of a study, (b) actual delay distribution, (c) injection efficiency, and

(d) notification penalty.

We measured the distribution of the notification penaltyfor node nsx during the execution of the election protocolapplication, as described in the previous section. Fig. 5dshows the distribution for four different event notificationsin the election protocol. It can be seen that these distribu-tions are fairly narrow in width, but tend to have long tails.However, the peaks are different for different events. This isexplained as follows: The INIT event notification initializesthe state machine and is the first event notification that thenode sends to indicate that it is functional. This INIT eventnotification also informs all other components of theruntime about the new state machine that has come up.Therefore, it is seen to take more time than other events. TheLEADER and FOLLOWER events do not need any remotenotifications. These state transitions are completely local,requiring no IPC, and represent the best case for normalnotifications. The INIT_DONE event requires that notifica-tions be sent to the other two state machines. Its notificationtherefore requires more time than those of LEADER andFOLLOWER events, since IPC is involved. More importantly,each of the four notifications takes between 30 �s in the bestcase to approximately 280 �s in the worst case. Hence, it canbe argued that the notification penalty and, thus, intrusive-ness due to Loki, are reasonably low.

7 LIMITATIONS OF LOKI

The Loki fault injector, as it is currently implemented, hassome limitations. First, Loki has been implemented inC/C++onLinux on an x86 architecture, and the Loki library has to belinked into the application code. Availability of applicationsources makes this easier. However, instrumentation ofapplications that have been written in languages other thanC++, or for which source code is unavailable, is possiblethrough use of a C++ wrapper, as we did for the Ensemblecase study [14].

Second, Loki has been designed for injecting faults at thegranularity of hundreds of microseconds. In its currentform, it cannot inject faults at a finer granularity. Third, thelinear clock drift assumption made by Loki may not hold ifexperiments are very long (running into months), or if thereare significant changes in temperature during the experi-ment. Finally, Loki does not currently have a prebuiltlibrary of probes for different fault models. A user musttherefore build such probes himself/herself.

8 RELATED WORK

Many of the earlier fault injectors were developed withspecific systems and fault types in mind. Fault injection andevaluation tools for stand-alone systems include MESSA-LINE [3], FIAT [20], FERRARI [15], and FTAPE [24]. Thefault injection techniques used in those tools encompasshardware-implemented fault injection (e.g., MESSALINE)and software-implemented fault injection (e.g., FIAT).Though it would be nontrivial to extend those tools toevaluate other systems (particularly distributed systems),they served their intended purposes well.

Past research work in monitoring and measurement ofdistributed systems is also interesting, and is related to thework in this paper. JEWEL [16] is a measurement system

that performs online monitoring, evaluation, and visualiza-tion, as well as offline analysis, of distributed systems. SPI[4] provides a flexible framework for distributed systemevaluation and visualization, and is based on the event-action programming model, in which “ea-machines” ob-serve events in the system and execute actions. Althoughboth of those tools can be extended for fault injection, theydo not have the required flexibility and cannot ensure thatfaults have been injected in the correct global states.

Several of the existing fault injectors have been targetedspecifically toward distributed systems, and are well-suitedfor their intended applications. However, they do not haveall the requirements necessary for global-state-based dis-tributed system fault injection. In CESIUM [1], the dis-tributed execution of the processes in the system issimulated in a single address space, and fault injection isperformed within the simulation. Though this approachgives the user a great deal of control over his/herdistributed system, it cannot fully mirror the operation ofthe system in a real environment. DOCTOR [12] injectsmemory, CPU, or communication faults probabilistically orbased on past history, and has an integrated workloadgenerator. EFA [10] was designed for system verification,and generates random fault cases, user-defined fault cases,and/or fault cases derived from an analysis of the sourceprogram of the system. It allows the user to express faultlocations and types using a special language, and providessupport for controlling the sequence of concurrent events ina distributed system. The Orchestra fault injector [9] wasdeveloped for fault injection in protocol stacks, andintegrates into the system under study as a layer in theprotocol stack. NFTAPE [22] was designed with flexibilityand portability in mind, and is divided into system-independent and system-dependent parts. The system-independent part executes the experiments, monitors thesystem, and collects observations. The actual fault injectionis done by lightweight fault injectors that are system-dependent and are similar in concept to the probes in Loki.NFTAPE uses a modular triggering mechanism andsupports a variety of triggers, such as time-based triggersand event-based triggers. NFTAPE has been used to injectvarious types of faults into systems, including hardware-based, software-based, and simulation-based faults.

All of the tools described above have been successfullyapplied to many systems. However, to the best of ourknowledge, none of the earlier fault injectors have all thecapabilities desired for global-state-based fault injection,namely, flexibility in fault types, fault injection based onglobal state, verification that the injections were correct, andaccurate computation of a wide range of user-specifiedmeasures. Loki has been designed with those capabilities inmind, and we believe that this makes Loki a unique tool forglobal-state-based fault injection in distributed systems.

9 CONCLUSIONS

The Loki fault injection tool has tackled the challengingproblem of global-state-based distributed system faultinjection by tying together fault injection based on a partialview of the global state, optimistic synchronization, andoffline analysis. A flexible measure language, statistical

CHANDRA ET AL.: A GLOBAL-STATE-TRIGGERED FAULT INJECTOR FOR DISTRIBUTED SYSTEM EVALUATION 603

estimation of user-specified performance and dependability

measures, and an easy-to-use user interface have also been

designed and developed as part of the fault injection tool.Flexibility, in regards to the choice of fault types and the

type of system under study, has been one of the main

philosophies of Loki. The experimental results presented in

this paper indicate that Loki’s fault injection efficiency is

high. We believe that the low intrusion and high efficiency

of global-state-based fault injection in Loki, combined with

its flexibility and statistical measure estimation capabilities,

will make it an invaluable tool in distributed system

evaluation. The usefulness of the Loki fault injector has

been illustrated in two case studies involving the evaluation

of real-life, mature distributed systems.

ACKNOWLEDGMENTS

This material is based on work supported by the US

National Science Foundation under ITR Contract 0086096.

Any opinions, findings, and conclusions or recommenda-

tions expressed in this publication are those of the authors

and do not necessarily reflect the views of the US National

Science Foundation.

REFERENCES

[1] G. Alvarez and F. Cristian, “Centralized Failure Injection forDistributed, Fault-Tolerant Protocol Testing,” Proc. 17th IEEE Int’lConf. Distributed Computing Systems, pp. 78-85, May 1997.

[2] J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.C. Fabre, J.C. Laprie, E.Martin, and D. Powell, “Fault Injection for DependabilityValidation: A Methodology and Some Applications,” IEEE Trans.Software Eng., vol. 16, no. 2, pp. 166-182, Feb. 1990.

[3] J. Arlat, Y. Crouzet, and J.C. Laprie, “Fault Injection forDependability Validation of Fault-Tolerant Computer Systems,”Proc. 19th Int’l Symp. Fault-Tolerant Computing, pp. 348-355, June1989.

[4] D. Bhatt, R. Jha, T. Steeves, R. Bhatt, and D. Wills, “SPI: AnInstrumentation Development Environment for Parallel/Distrib-uted Systems,” Proc. Ninth Int’l Parallel Processing Symp., pp. 494-501, 1995.

[5] W.G. Bouricius, W.C. Carter, D.C. Jessep, P.R. Schneider, and A.B.Wadia, “Reliability Modeling for Fault-Tolerant Computers,”IEEE Trans. Computers, vol. 20, no. 11, pp. 1306-1311, 1971.

[6] R. Chandra, M. Cukier, R.M. Lefever, and W.H. Sanders,“Dynamic Node Management and Measure Estimation in aState-Driven Fault Injector,” Proc. 19th IEEE Symp. ReliableDistributed Systems, pp. 248-257, Oct. 2000.

[7] K. Chandy and L. Lamport, “Distributed Snapshots: Determiningthe Global States of Distributed Systems,” ACM Trans. ComputerSystems, vol. 3, no. 1, pp. 63-75, 1985.

[8] K. Chandy and J. Misra, “An Example of Stepwise Refinement ofDistributed Programs: Quiescence Detection,” ACM Trans. Pro-gram Languages and Systems, vol. 8, no. 3, pp. 326-343, July 1986.

[9] S. Dawson, F. Jahanian, T. Mitton, and T.L. Tung, “Testing ofFault-Tolerant and Real-Time Distributed Systems via ProtocolFault Injection,” Proc. 26th Int’l Symp. Fault-Tolerant Computing,pp. 404-414, June 1996.

[10] K. Echtle and M. Leu, “The EFA Fault Injector for Fault-TolerantDistributed System Testing,” Proc. IEEE Workshop Fault-TolerantParallel and Distributed Systems, pp. 28-35, 1992.

[11] C.E. Ellingston and R.J. Kulpinski, “Dissemination of SystemTime,” IEEE Trans. Comm., vol. 21, pp. 605-623, May 1973.

[12] S. Han, K.G. Shin, and H.A. Rosenberg, “DOCTOR: An IntegratedSoftware Fault Injection Environment for Distributed Real-TimeSystems,” Proc. Int’l Computer Performance and Dependability Symp.,pp. 204-213, 1995.

[13] D.A. Henke, “Loki—An Empirical Evaluation Tool for DistributedSystems: The Experiment Analysis Framework,” master’s thesis,Univ. of Illinois at Urbana-Champaign, 1998.

[14] K.R. Joshi, M. Cukier, and W.H. Sanders, “Experimental Evalua-tion of the Unavailability Induced by a Group MembershipProtocol,” Proc. Fourth European Dependable Computing Conf.,pp. 23-25, Oct. 2002.

[15] G.A. Kanawati, N.A. Kanawati, and J.A. Abraham, “FERRARI: ATool for the Validation of System Dependability Properties,” Proc.22nd Int’l Symp. Fault-Tolerant Computing, pp. 336-344, July 1992.

[16] F. Lange, R. Kroeger, and M. Gergeleit, “JEWEL: Design andImplementation of a Distributed Measurement System,” IEEETrans. Parallel and Distributed Systems, vol. 3, pp. 657-671, Nov.1992.

[17] R.M. Lefever, M. Cukier, and W.H. Sanders, “An ExperimentalEvaluation of Correlated Network Partitions in the CodaDistributed File System,” Proc. 22nd IEEE Symp. Reliable DistributedSystems, pp. 273-282, Oct. 2003.

[18] K. Marzullo and G. Neiger, “Detection of Global State Predicates,”Proc. Fifth Int’l Workshop Distributed Algorithms, pp. 254-272, 1991.

[19] D. Powell, E. Martins, J. Arlat, and Y. Crouzet, “Estimators forFault Tolerance Coverage Evaluation,” Proc. 23rd Int’l Symp. Fault-Tolerant Computing, pp. 228-237, 1993.

[20] Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J.Barton, D. Rancey, A. Robinson, and T. Lin, “FIAT—FaultInjection Based Automated Testing Environment,” Proc. 18th Int’lSymp. Fault-Tolerant Computing, pp. 102-107, 1988.

[21] K.G. Shin and Y.H. Lee, “Measurement and Application of FaultLatency,” IEEE Trans. Computers, vol. 35, no. 4, pp. 370-375, Apr.1986.

[22] D.T. Stott, B. Floering, D. Burke, Z. Kalbarczyk, and R.K. Iyer,“NFTAPE: A Framework for Assessing Dependability in Dis-tributed Systems with Lightweight Fault Injectors,” Proc. FourthIEEE Int’l Computer Performance and Dependability Symp., pp. 91-100, Mar. 2000.

[23] A. Stuart and J.K. Ord, Distribution Theory, Kendall’s AdvancedTheory of Statistics, 1. London: Edward Arnold, 1987.

[24] T. Tsai and R. Iyer, “Measuring Fault Tolerance with the FTAPEFault Injection Tool,” Proc. Eighth Int’l Conf. Modelling Techniquesand Tools for Computer Performance Evaluation, pp. 26-40, Sept. 1995.

Ramesh Chandra received the BTech degreefrom the Indian Institute of Technology, Madrasin 1998, and the MS degree from the Universityof Illinois at Urbana-Champaign in 2000, both incomputer science. He is a PhD candidate incomputer science at Stanford University. He isinterested in systems research, in general, and,in particular, operating systems, networking,distributed systems, and reliable systems.

Ryan M. Lefever received the BS degree incomputer engineering in 1999 and the MSdegree in electrical engineering in 2003 fromthe University of Illinois at Urbana-Champaign.He is currently pursuing the PhD degree inelectrical engineering at the University of Illinoisat Urbana-Champaign. His research interestsinclude dependable computing, autonomic com-puting, performance/dependability evaluation,and distributed systems.

Kaustubh R. Joshi received the BEng degreein computer engineering from the University ofPune in 1999 and the MS degree in computerscience from the University of Illinois at Urbana-Champaign in 2003. He is currently pursuing thePhD degree in computer science at the Uni-versity of Illinois at Urbana-Champaign. Hisresearch interests include autonomic computing,design and validation of dependable distributedsystems, and performance/dependability model-ing. He is a member of the ACM.

604 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 7, JULY 2004

Michel Cukier received the physics engineeringdegree from the Free University of Brussels,Belgium, in 1991, and the PhD degree inengineering from the National Polytechnic In-stitute of Toulouse, France, in 1996. He is anassistant professor in the Center for ReliabilityEngineering in the Department of MechanicalEngineering at the University of Maryland,College Park. From 1991-1992, he was aninstructor at the Free University of Brussels.

From 1992 to 1996, he was at LAAS-CNRS, Toulouse, France for hisdoctoral work on coverage estimation of fault-tolerant systems. From1996 to 2001, he was a researcher in the Perform Research Group inthe Coordinated Science Laboratory at the University of Illinois, Urbana-Champaign. His research interests included intrusion tolerance byadaptation in distributed systems, adaptive fault tolerance in distributedsystems, the evaluation of fault-tolerant systems combining modelingand fault injection, and the estimation of fault tolerance coverage. Aspart of this work, he is a codeveloper of the AQuA Architecture, anarchitecture that provides dependable distributed objects. His currentresearch interests include security evaluation, intrusion tolerance,distributed system validation, fault injection, and software testing. Heis member of the IEEE and the IEEE Computer Society.

William H. Sanders is a professor in theDepartment of Electrical and Computer Engi-neering and the Coordinated Science Laboratoryat the University of Illinois. He is vice-chair ofIFIP Working Group 10.4 on Dependable Com-puting. In addition, he serves on the editorialboard of IEEE Transactions on Reliability, and isthe area editor for simulation and modeling ofcomputer systems for the ACM Transactions onModeling and Computer Simulation. He is a past

chair of the IEEE Technical Committee on fault-tolerant computing. Dr.Sanders’s research interests include performance/dependability evalua-tion, dependable computing, and reliable distributed systems. He haspublished more than 140 technical papers in these areas. He served asthe general chair of the 2003 Illinois International Multiconference onMeasurement, Modelling, and Evaluation of Computer-CommunicationSystems. He has served as cochair of the program committees of the29th International Symposium on Fault-Tolerant Computing (FTCS-29),the Sixth IFIP Working Conference on Dependable Computing forCritical Applications, Sigmetrics 2003, PNPM 2003, and PerformanceTools 2003, and has served on the program committees of numerousconferences and workshops. He is a codeveloper of three tools forassessing the performability of systems represented as stochasticactivity networks: METASAN, UltraSAN, and Mobius. Mobius andUltraSAN have been distributed widely to industry and academia; morethan 300 licenses for the tools have been issued to universities,companies, and NASA for evaluating the performance, dependability,security, and performability of a variety of systems. He is also acodeveloper of the Loki distributed system fault injector and the AQuA/ITUA middlewares for providing dependability/security to distributed andnetworked applications. He is a fellow of the IEEE, IEEE ComputerSociety, and the ACM.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

CHANDRA ET AL.: A GLOBAL-STATE-TRIGGERED FAULT INJECTOR FOR DISTRIBUTED SYSTEM EVALUATION 605