[ieee comput. soc ieee international computer performance and dependability symposium. ipds'98...

10
Action Models: A Reliability Modeling Formalism for Fault-Tolerant Distributed Computing Systems Aad P. A. van Moorsel Distributed Software Research Department Bell Laboratories Research Lucent Technologies 600 Mountain Ave., Murray Hill, NJ 07974, USA [email protected] Abstract Modem-day computing system design and development is characterized by increasing system complexity and ever shortening time to market. For modeling techniques to be deployed successfully, they must Lonveniently deal with complex system models, and must be quick and easy to use by non-specialists. In this paper we introduce “action mod- els,” a modeling formalism that tries to achieve the above goals for reliability evaluation of fault-tolerant distributed computing systems, including both sofrware and hardware in the analysis. The metric of interest in action models is the job success probability, and we will argue why the traditional availability metric is inst@cient for the evalua- tion offault-tolerant distributed systems. We formally spec- ify action models, and introduce path-based solution algo- rithms to deal with the potential solution complexity of cre- ated models. In addition, we show several examples of ac- tion models, and use a preliminary tool implementation to obtain reliability results for a reliable clustered computing plaqorm. 1. Introduction Model-based evaluation of the reliability of distributed systems has traditionally required expert-level knowledge of modeling techniques such as fault trees, reliability block diagrams, Markov chains and stochastic Petri nets [24]. In modern-day system design and development, which is more and more influenced by short time to markets, possibilities are usually limited to add analysis specialists to a team or train design engineers in reliability evaluation methodology. As a consequence, there is a growing need for intuitive mod- eling formalisms that allow non-specialists to evaluate sys- tem reliability. In this paper we introduce the action model formalism, which aims at providing a high-level, robust modeling ap- proach to analyze the reliability of fault-tolerant distributed systems. More in particular, we want to be able to study the influence fault tolerance mechanisms have on the over- all reliability experienced by system users (system load). In very crude terms, action models are higher-level constructs that represent system dynamics in similar ways as stochas- tic Petri nets [I], but have an underlying mathematical rep- resentation that is close to fault trees, reliability block dia- grams and the like. Prerequisite to the construction of a model and a mod- eling formalism is a decision on the metric of interest. We argue that metrics of interest should be user-oriented met- rics [27]. In such measures, interest is in the completion of jobs submitted by a user. Therefore, we will focus on computing reliability in terms of the following metric: the probability that a submitted job completes successfully. (In fact, in action models we generalize this notion to the prob- ability that a submitted job completes at ‘a certain level of success.‘) The job success probability is more appropriate for the evaluation of fault-tolerant distributed systems than the tra- ditional availability metric. Even when a useful system- level interpretation of unavailability can be found, and when the user pattern is such that unavailability can be trans- formed into job failures, there still are various job failures that are caused by phenomena not included in system un- availability (think of software bugs, transient operating sys- tem failures and other ‘glitches’). Moreover, unavailability in distributed systems can often be made arbitrarily small by adding redundancy, while the other failures tend to be much more persistent. Hence, for highly-reliable fault-tolerant systems, the availability metric becomes relatively less im- portant. Parallel to the choice of the metric, the modeling formal- 119 O-8186-8679-0/98 $10.00 0 1998 IEEE

Upload: apa

Post on 07-Mar-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

Action Models: A Reliability Modeling Formalism for Fault-Tolerant DistributedComputing Systems

Aad P. A. van MoorselDistributed Software Research Department

Bell Laboratories ResearchLucent Technologies

600 Mountain Ave., Murray Hill, NJ 07974, [email protected]

Abstract

Modem-day computing system design and developmentis characterized by increasing system complexity and evershortening time to market. For modeling techniques tobe deployed successfully, they must Lonveniently deal withcomplex system models, and must be quick and easy to useby non-specialists. In this paper we introduce “action mod-els,” a modeling formalism that tries to achieve the abovegoals for reliability evaluation of fault-tolerant distributedcomputing systems, including both sofrware and hardwarein the analysis. The metric of interest in action modelsis the job success probability, and we will argue why thetraditional availability metric is inst@cient for the evalua-tion offault-tolerant distributed systems. We formally spec-ify action models, and introduce path-based solution algo-rithms to deal with the potential solution complexity of cre-ated models. In addition, we show several examples of ac-tion models, and use a preliminary tool implementation toobtain reliability results for a reliable clustered computingplaqorm.

1. Introduction

Model-based evaluation of the reliability of distributedsystems has traditionally required expert-level knowledgeof modeling techniques such as fault trees, reliability blockdiagrams, Markov chains and stochastic Petri nets [24]. Inmodern-day system design and development, which is moreand more influenced by short time to markets, possibilitiesare usually limited to add analysis specialists to a team ortrain design engineers in reliability evaluation methodology.As a consequence, there is a growing need for intuitive mod-eling formalisms that allow non-specialists to evaluate sys-tem reliability.

In this paper we introduce the action model formalism,which aims at providing a high-level, robust modeling ap-proach to analyze the reliability of fault-tolerant distributedsystems. More in particular, we want to be able to studythe influence fault tolerance mechanisms have on the over-all reliability experienced by system users (system load). Invery crude terms, action models are higher-level constructsthat represent system dynamics in similar ways as stochas-tic Petri nets [I], but have an underlying mathematical rep-resentation that is close to fault trees, reliability block dia-grams and the like.

Prerequisite to the construction of a model and a mod-eling formalism is a decision on the metric of interest. Weargue that metrics of interest should be user-oriented met-rics [27]. In such measures, interest is in the completionof jobs submitted by a user. Therefore, we will focus oncomputing reliability in terms of the following metric: theprobability that a submitted job completes successfully. (Infact, in action models we generalize this notion to the prob-ability that a submitted job completes at ‘a certain level ofsuccess.‘)

The job success probability is more appropriate for theevaluation of fault-tolerant distributed systems than the tra-ditional availability metric. Even when a useful system-level interpretation of unavailability can be found, and whenthe user pattern is such that unavailability can be trans-formed into job failures, there still are various job failuresthat are caused by phenomena not included in system un-availability (think of software bugs, transient operating sys-tem failures and other ‘glitches’). Moreover, unavailabilityin distributed systems can often be made arbitrarily small byadding redundancy, while the other failures tend to be muchmore persistent. Hence, for highly-reliable fault-tolerantsystems, the availability metric becomes relatively less im-portant.

Parallel to the choice of the metric, the modeling formal-

119O-8186-8679-0/98 $10.00 0 1998 IEEE

Page 2: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

ism is also based on a ‘user view,’ as opposed to a systemview. This also motivates the term action model: we lookat the system with respect to the actions the user wants itto carry out. Action models define, for each job offeredto the system, a flow through the system in the form of asequence of tasks. Fault tolerance mechanisms will intro-duce alternative flows in case of failure of resources. De-pendencies between resource failure probabilities are cap-tured by dependency coefficients associated with the var-ious system resources. Such dependencies are particularlyrelevant in software reliability modeling when modeling theinteraction between software components [ 13, 161. Also in-cluded in an action model is the operational or user profile;in telecommunication switches for instance, different callsgenerate different sequences of operations in the switch,and the combination and distribution of calls executed bythe switch is the operational profile. Operational profilesare being used systematically in distributed system testingand software reliability engineering [ 15, 201, and they areequally important when evaluating the system reliability.

The freedom we have in designing an intuitive modelingformalism is restricted by the need to efficiently solve re-sulting m ode ls. Since failures typically occur rarely in reli-ability models, discrete-event simulation often suffers fromprohibitively long simulation times (see [21, 271 and refer-ences therein). On the other hand, state-space based ana-lytic methods may also require excessive CPU or memoryresources because of the size of the model, and thus cannot be blindly applied either. Hence, we cannot just de-fine higher-level constructs on top of an existing modelingmethodology, but in fact have to tailor the modeling for-malism (and underlying mathematical formulation) so thatit leads to solvable models. We therefore carefully balancethe modeling power of action models with its tractability.In particular, we decide not to include timing aspects in ac-tion models, but aim at models solvable by combinatorialmeans. We will introduce path-based solution algorithmsas the generic solution approach to action models.

There are various reliability modeling approaches in fre-quent use [ 17, 241, but they are not specific to the domainof distributed fault tolerant computing systems. Therefore,they demand substantial abstraction ability from the mod-eler. In addition, these formalisms usually do not deal withdependencies and operational profiles in a generic way. Re-cently, there have been other attempts to construct domain-specific formalisms. Mendiratta [ 181 discusses particularMarkov models that are intimately tied to fault classificationand fault-tolerance mechanisms. We will provide hooks inaction models to do similar things. Kanoun and Barrel [ 121provide constructs for modeling fault dependencies, morepowerful than our dependency constructs, but also morecomplex to use. An issue we do not deal with is discussedby Dugan [6] and Dugan and Lyu [7], who use Markov

chains in addition to fault trees to model fault handling andsystem configuration changes. The work on manufactur-ing systems by Shah [26] has partly the same motivationas ours: high-level, domain-specific modeling constructseffectively model classes of systems. In this work, how-ever, the models are mapped on stochastic activity networks[ 19, 251, thus resulting in familiar solution limitations.

This paper lays the ground work for action models. Sec-tion 2 provides a non-formal introduction to the basic con-cepts. A formal specification of action models and theirexecution policies is given in Section 3. Section 4 intro-duces a path-based solution algorithm. These algorithmscan deal with the inclusion of dependencies, and circumventthe state-space explosion problem. Section 5 then shows re-sults obtained from action models for a reliable clusteredcomputing platform.

2. Non-Formal Introduction and Semantics

In action models the system is viewed as an entity thatoffers a service, to a variety of “jobs.” Each job consistsof a series of “sub-jobs,” which all rely on certain hardwareand software resources. As an example, think of telecom-munication switch systems, which must serve various typesof calls (terminal calls, I-800 calls). These calls are the jobssubmitted to the system, and for all call classes, one wantsto assess the probability a call successfully completes, sothat it can be decided whether requirements are being met.

Action models consist of the following hierarchy ofbuilding blocks:

l Resources, which model the physical software andhardware used by tasks; the mode of a resource repre-sents the state of operation of the software/hardware.

l Tasks, which use a collection of resources, and modelthe execution of a sub-job.

l Flows, which constitute an ordered sequence of tasks,representing the complete execution of a job offeredto the system. Each flow starts from a source andends at a sink. The ordering of tasks is defined bytransitions and by cases associated to transitions. Thecase that is chosen depends on the mode the resourcesare in.

l Operational Profile, which specifies a set of flows.

All four building blocks will now be further introduced. Aformal specification is postponed to Section 3; in this sec-tion we concentrate on informally introducing the seman-tic meaning of action model constructs. We manufacturedthe simple example model in Figure 1 to illustrate the dis-cussion: it contains just one task, with a hardware (“I-W”)and software (“SW”) resource, both with modes “up” and“down.”

120

Page 3: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

source

2.1. Resource

Each resource contains attributes that specify propertiesof the resource. We currently include:

(failure) modes, with associated (failure) mode prob-abilities,

mode dependencies, with associated dependency co-efficients.

In Figure 1 there is a hardware and a software resource.Each has modes “up” and “down,” with probabilities, say,f,, and 1 - fh for hardware and fs and 1 - fs for software.(For this case, the probabilities are indeed “failure probabil-ities.“) The dependency coefficients are, say, dt, and d, forthe hardware down mode and software down mode, respec-tively.

Modes are intended to allow modeling of any kind ofcharacterization of the resources (more than just up anddown). In particular, Section 5 shows examples in whichmodes are used to distinguish fault classes (using for in-stance the classification in [ 141 one may distinguish designfaults, transient faults, etc.).

The dependency coefficients express how the modeprobabilities change when a resource is being accessedmore than once. It works as follows. Let ml and rnzbe modes of resources ri and ~2, with mode probabilitiesfi and fa, and let di,a be the dependency coefficient, ex-pressing dependency between ml and m2. Then, the occur-rence probability pa that resource ~2 is in mode m2 equalspa = dft2 f2, where Ici is the number of times the resourceri has been in mode ml before the current access to re-source ~2. Two particular dependencies can be modeledin this way: self-dependencies, where the dependency isamong subsequent instances of use of the same resource andmode, and cross-dependencies, where dependence is acrossdifferent modes and (possibly) different resources.

Note that the mode probabilities must be such that everytime a resource is being accessed, the sum of all its modeprobabilities must be 1. This also restricts the range thedependency coefficients may take. A generic solution to this

Figure 1. Example model.

problem is to treat the coefficients and mode probabilities asweights, in which case normalization to 1 must be carriedout after evaluation of the weight of each case. See alsoSection 5.1 for an example and further discussion.

2.2. Task

A task uses a collection of resources. Hence, tasks aredefined by the set of resources that they employ. In Figure 1there is only one task (the box labeled “Base Task”), usingresources hardware and software. If a task uses only oneresource it represents a minimal unit of “work,” but we willtypically use more generic tasks in a model, that can be seenas concatenations of such minimal tasks.

There are two special ‘dummy’ tasks: sources and sinks.The source is displayed in Figure 1 by the arrow labeled“source,” and two sinks exist in the model: “Failed” and“Succeeded,” represented by ovals. The source can be seenas a task requiring no resources-it always successfully com-pletes. The sinks are “absorbing” entities, as we will seewhen discussing flows. We allow multiple sinks, to be ableto distinguish various degrees of ‘success’ and ‘failure.’

2.3. Flow

Flows are sequences of tasks. Flows are implicitly spec-ified by introducing transitions between tasks. In actionmodels, the transitions contain cases (similar to those instochastic activity networks [ 19, 25]), which connect witha downstream task. The way flows are formed by the com-bination of tasks and transitions is defined by the executionmodel, which is a Petri-net variant.

In Figure 1, there is one transition out of the base task.That transition has two cases, and with each case is associ-ated a case formula. The case formula is a boolean expres-sion over the resource modes in the task. In the example, theformula with the first case is ‘hardware in up mode and soft-ware in up mode.’ Obviously, the other case must be suchthat together the cases cover all possibilities: ‘hardware isdown and/or software is down.’ Using the mode probabil-ities, the probability that a case is chosen can be evaluated

121

Page 4: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

(a case is chosen when the belonging case formula holds.)When a case is chosen, the next task in the flow is givenfollowing the arc out of the particular case.

To be able to specify an operational profile, we intro-duce success flows. The success flow specifies the sequenceof tasks that must be carried out successfully. The successflow starts from a source, and finishes in the ‘success’ sink.The transitions, cases, resource modes, etc., then implicitlyspecify all possible other flows that may take place, includ-ing flows that contain fault tolerance operations.

2.4. Operational Profile

The operational protile is a collection of success flows,with associated probabilities. We do not allow for depen-dencies among flows in the current action models, since thiswould severely complicate the execution rules.

The operational profile is an important component in de-termining the reliability of software, and should also be partof a reliability study of a distributed system running thissoftware. It addresses the fact that each application runningon the system may experience different system reliability,depending on what resources are being used, and in whatorder.

3. Formal Specification

An action model is a particular variant of a net. Thatmeans, there is a set S of places (tasks), a set T of tran-sitions, and a binary flow relation F specifying the con-nections between tasks and transitions [23]. For a com-plete definition, the tasks S and transitions T require furtherspecification, and specific execution rules must be definedto determine the flows F. In addition, the operational pro-file introduces success flows F, and a distribution ~0 overthe success flows. Section 3.1 first completes the specifica-tion of tasks, transitions, flows and initial distribution, andSection 3.2 then defines execution rules for action models.

3.1. Specification

An action model is defined by a tuple A =(S, T, F, R, C, F,, KO), where S is the set of tasks, T thetransitions, F the flows, R the resources, C the cases, F,the success flows, and ~0 a distribution on the success flows.These elements are further specified as:

Definition 1 A resource T E R is a tuple (Mr , P,. , Or). andR is thefinite set of all resources ~1, ~2, . . . , r]Rj, such that:

l Mr is afinite set of modes m,,l, . . . ,mr,lM,I.

l Pr is a probability distribution p,l , . . . , pr,l,+t,~, suchthat for every mode mr,i, i = 1, . . . , (M, 1, there is aprobability pr,i,

l D, is a finite set of real-valued dependency coefi-cients &,I,. . . ,d,,p_l, specifying dependencies be-tween mode probabilities.

There are additional requirements for the values of the de-pendency coefficients, so that the sum of the mode probabil-ities sum to 1 for each resource (see the definition of flows).If desired, explicit restrictions on D, can be formulated sothat the above holds. We will not do so, since we want toallow various semantic interpretations.

Definition 2 With each task s E S is associated a set ofresources R, c R.

For convenience, we introduce two subsets of S correspond-ing to the sources and the sinks: Se c S are the sources,S, c S are the sinks. Both sources and sinks have emptyresource sets associated to them.

Definition 3 Transitions t E T are tuples (Ct , Bt, P,),where

.

l

Ct is afinite set of cases ct.1, . . . , ct,lcl 1,

for every case Ct,i,i = 1,. . . , ICtj, there exists aresource formula bt,i, i = 1,. . . , jC,j, over the setR of resources and their modes, Bt is then the set{bt,l,. . . ,bt,lcc$

. Pt is probability distribution pt,l, . . . , pt,lct 1, suchthat for every case ct,i, i = 1,. . . , ICtl, there is aprobability pt,i.

Only if Bt is the empty set, the probability distribution Ptcan (and must) be specified. Otherwise, the distribution Ptis derived from Bt, through computation of the likelihoodsthat the respective formulae are true. When defining theexecution rules it will become clear how this is done.

Definition 4 C is the set of all cases: C = UtErCt.

Then, the flow relation F is:

Definition 5 The set offlows is a binary relation F c (S xT) u (C x S).

The difference with a normal net is that cases enter in therelation, instead of just transitions.

The success flows, which are introduced by modeling theoperational profile, are defined as:

Definition 6 A successflow f is a subset f C F. The set ofall successflows is F,.

Typically, the success flows will be restricted to those casesthat correspond to ‘success.’ The success flows define adifferent sequence of tasks for each job in the operationalprofile, while the rest of the model (including the fault tol-erance mechanisms) remains unaltered. Finally, to specifythe occurrence probabilities of the different success flows,we introduce the distribution ~0:

122

Page 5: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

Definition 7 The initial distribution ~0 is a probability dis-tribution over the successJlOws f E F,.

3.2. Execution

The execution rules of an action model are similar tothat of stochastic activity networks [19, 251, and we willspecify the e\.ecution rule using activity-network and Petri-net terminology such as token, marking, firing and enabledtransition, without further defining these. There are specialfeatures in action models we must pay special attention to.The dependency coefficients, which determine case prob-abilities may depend on the complete ‘past’ of the modelevolution, and we thus must take into account the whole‘evolution history.’ In addition, the success flows and theinitial distribution ~0 over these flows give a particular twistquite different from the initial marking in Petri nets.

Definition 8 The input bag Si C S of a transition containsall elements s E S for which (s, t) E F.

Definition 9 The output bag of case ci, i = 1, . . . , ]C, 1, fortransition t E T is given by all s E S, for which (Ci, S) E F.

Definition 10 A transition t E T is enabled ifs contains atleast one token for all s E S with (s, t) E F.

Definition 11 A transition t E T thatjres distributes witha probability Pt,i, i = 1, . . . , ]Ct 1, a token to the places inthe output bag of case Ct,i.

A distribution of tokens over the tasks is called a marking,and M is the set of all possible (reachable) markings. Toevaluate the probabilities p,, i = 1, . . . , ]Ct 1, we add a his-tory & to each marking m E M, which lists all the casesthat are part of the evolution. Formally:

Definition 12 Let for some marking m E M, the historybe given by &,. If case c E C is chosen in marking m, thehistory &,,+ for the ne.rt marking is given by the sequence477x+ = (&I, c). The history &,, for the initial markingmo E M is the empty set.

Definition 13 At transition t E T, the probability associ-ated with case Ct,i, i = 1, . . , ]Ct], is computedfrom eval-uating the likelihood that bt,i holds true, given the failureprobabilities, dependency coeficients and evolution history&, of the current marking m E M.

The specification of the operational profile, through suc-cess flows F,, makes that the execution must be consideredfor all different elements in F,. Hence, by repetition of ex-ecution for all success flows in F,, the execution rules arecomplete.

There are important and intricate issues associated withwhether an action model is well specified [3, 221. In non-formal terms, an instance of an action model is well spec-ified if all possible evolutions that follow from the execu-tion rules result in the same set of reachable markings, withthe same probabilities p, associated to them. It is beyondthe scope of this paper to further discuss this topic, but itimplies that the user must take care in using dependencycoefficients in a correct way, and must be aware of well-specified issues familiar from nets with immediate transi-tions [3, 221.

4. Path-Based Solution Algorithm

In action models we need to solve for the absorptionprobabilities of the sinks. We do this using ‘on-the-fly’ or‘path-based’ algorithms [2, 5, 27, 281. Compared to state-space generation approaches for Petri nets, on-the-fly algo-rithms combine generation of the underlying mathematicalrepresentation of the model with its solution. Hence, thecomplete mathematical representation (for instance in termsof a transition matrix) is not generated before solving, thuscircumventing a potential memory bottleneck. Comparedto combinatorial algorithms [4, 241, path-based algorithmsmay loose efficiency for some models, but provide straight-forward means for dealing with various dependencies.

We give here the base algorithm in Algorithm 1. Startingfrom the initial marking mo E M, it generates all possiblesequences of markings m E M, keeping track of the evo-lution history &, for each marking, and the probability p,that the marking is reached. If a marking contains tokensin the sinks, the probability p, is added to the absorptionprobability for that set of sinks. The power set Sg overthe set of sinks S, gives all possible distributions of sinkswith and without tokens, and the algorithm thus sums theabsorption probability for each of the elements in Sr.

Algorithm 1 Solution algorithm

The sink absorption probabilities for an action model canbe computed as:

P(m0) = 1 ; il = {mo};Psd =o for all sdE sr;While( A#8 ) do {

Select mE A, with belongingpm and h;

Determine the set of allpossible next markings D(m);

A=AUD(m);For each mdED(m) do {

Compute pm, from pm, CL,and the selected case cE C;

If md is absorbing do {

123

Page 6: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

Figure 2. Redundancy model.

D e t e r m i n e sd = SSp fl md;ps,+ =pm,f;/i = A - {md} ;

11A = A - { m } ;

1

The above algorithm must be repeated for all success flowsin F,, and then the results must be weighted according tothe distribution ~0. Note that the above algorithm is onlyguaranteed to terminate for action models with a finite andacyclic reachability graph. If the model is not acyclic andfinite, a stopping criterion must be added or partial solutionscan be obtained, similar to those discussed in [2, 281. InSection 5 we will see, however, that in many cases jobs canbe naturally modeled as an acyclic sequence of tasks

For an efficient implementation of Algorithm 1 some im-portant issues must be addressed. In particular, for largermodels, quick approximate solutions may be achieved byfollowing the most significant paths first. That implies thatthe choice of the next marking md from the set o(m) is

important. The issues that come up are similar to those dis-cussed in [2, 281, albeit that the setting of action modelsis less generic, thus potentially simplifying the successfulapplication of path-based algorithms.

5. Example Models

In this section we illustrate the use of action models tomodel some basic fault tolerance mechanisms, and show theapplication of action models to the evaluation of a reliableclustered computing platform.

5.1. Redundancy

Figure 2 shows an example of a simple redundancymodel. In the system we only consider a single hardwareresource (“HW”), with two modes: “failure” and “success.”In case of a failure in “system A,” two redundant system(“system B” and “system C”) are available for retrying thetask.

The redundancy model illustrates the intricacies intro-duced by dependency coefficients. Assume the failure mode

124

Page 7: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

Figure 3. Fault classes m ode l.

probability for resource HW is f, and the (self-)dependencycoefficient is d. Then the probability of a failure at systemC (after two previous failures in system A and system B)is dLf. Hence, for d it must hold that 0 5 d < f-4. Thisbounddepends on the failure probability as well as the num-ber of redundant resources, and must thus be reevaluatedfor every model instance, although a robust solution exist in0 5 d 5 1. An additional difficulty is that for the case prob-abilities to sum to I, the probability of success at systemC should be 1 - d2f, which contradicts its specified value1 - f. In the redundancy model, we solve this problem byletting the failure mode “dominate” the computation of thecase probabilities. That is, we first evaluate the failure case,and scale the success case so that the case probabilities sumto 1. This works satisfactory for models with a single de-pendency coefficient. As a general solution, we evaluate allcase probabilities first, from top to bottom at each transition,and then normalize them to I (thus treating the “probabili-ties” as weights). Note, however, that the consequences ofmodeling dependency coefficients become hard to predict.Even more intricate difficulties can be thought off when in-troducing dependencies, and a powerful theory to deal withdependency issues is a topic that deserves further detailedresearch.

5.2. Fault classes

Modes of resources are useful to model different faultclasses. We illustrate this in Figure 3. The task “Brocess-ing” contains a single hardware resource “HW,” with fivemodes: four failure modes and one success mode. The dif-ferent failure modes trigger different fault tolerance mech-anisms, as described by Huang and Kintala [9]. If the faulttolerance task (“FTI” to “FI4”) fails, the path ends in the“Failed” sink, otherwise in the “Succeeded” sink.

The fault classes model illustrates the kind of analy-ses we want to perform using action models. We are in-terested in the influence of fault-tolerance mechanisms onuser-perceived reliability, as opposed to analyzing and op-timizing a particular fault tolerance mechanism itself. De-tailed analysis of the individual fault tolerance mechanismscan be used to provide parameter values for the models. Us-ing these parameters, the impact of individual mechanismson the user-perceived reliability can then be determined.

5.3. A Reliable Computing Cluster

We will conclude with showing some reliability compu-tations for a model of a reliable clustered computing plat-form, somewhat similar to [I I]. We obtain results from atool we developed for the analysis of action models. The

125

Page 8: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

.rmh

Libs 1 bug

Figure 4. Reliable computing cluster model.

tool is a prototype tool that operates on an earlier executionmodel that is simpler than the Petri-net like formulation inSection 3.2. The user interface for the tool is textual, sinceso far the focus of the code development has been on a basicspecification language and on the solution algorithms.

Figure 4 gives a model of the cluster, studying the effectof a watchdog mechanism. We consider the success flow ofa job to consist of a single task (labeled “Primary”). To suc-ceed, the hardware (“HW”), operating system (“OS”), andthe software (“SW’) and software libraries (“Libs”) shouldbe functioning correctly. We assume that hardware failuresmay lead to a crash, operating system and software maycontain bugs, and the operating system may experience theaging phenomena (aging implies that system resources areslowly getting exhausted [S, lo]).

The watchdog can only tolerate crash failures, not any ofthe other failure modes. This is specified by the formulaeassociated to the transition out of the primary task. (Notethat the formula at the second case should be read as ‘(bugOR aging) AND NOT crash,’ to assure non-conflicting con-ditions for all cases.) If a bug or aging failure occurs, the jobhas failed. For the watchdog to operate correctly, it needsthe system interface resource (“SI”), the hardware imple-mented watchdog (“WD”) and an up and running Ethernet(“Ether”). If all this is functioning, the “Spare” task getsexecuted. In the spare, if any of the mentioned software or

bug IihCra.sh

hardware failure occurs, there is no further fault toleranceprovided. In that case, the “Failed” sink is reached, other-wise the “Succeeded” sink.

For illustration purposes, we investigate for this systemhow dependency influences the reliability. We took allmode probabilities to be 10m4 (the complementary proba-bility always is the ‘success’ mode, which is not depicted).The model is still relatively simple, and results can bechecked easily, but dependency coefficients may consider-ably complicate the analysis. Assume that a dependencycoefficient d for the crash mode for the hardware resourceis introduced. Then Figure 5 shows the probability of suc-cess (sink “Succeeded”) when submitting a job to the sys-tem. Note that if d = f-’ = 10,000, a crash failure iscertain to be repeated in the spare unit, and fault-tolerancefor crash failures will not achieve improvement in reliabil-ity. If d = 0, the recovery from a crash failure is guaranteed,as long as the watchdog is functioning. The unreliability isdominated by the influence of bug and aging failures, andthe improvement achieved by the watchdog is thus limited.We see from Figure 5 that a reliability close to 0.9996 isguaranteed for a dependency coefficient of up to about 100.For high crash failure correlation, the watchdog mechanismstarts loosing its ability to improve the reliability.

As mentioned, the presented cluster mode1 is a roughrepresentation of a real system like [ 111, but it already can

126

Page 9: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

0 . 9 9 9 6

0.99959

0 . 9 9 9 5 8

0.99957Y.E

G 0.99956$tig 0.99955.2c9 0.99954a20.

0.99953

0.99952

0.99951

0.99951

I I I

10 100 1000 10000dependency coefficient crash mode

Figure 5. The influence of correlation in crash failures for the reliable clustered computing model.

be used to study the influence of parameter values. Ad- robust formalism, although the early examples and resultsditions to the model can be made to investigate more de- are encouraging. Further research must attend to efficienttails, or to investigate the reliability impact of other poten- solution algorithms, elegant specification methods for thetial fault tolerance mechanisms (one can easily investigate case formulae, and a theory for dependency coefficients.what the improvement is if rejuvenation is added to recover In addition, we are attempting to further simplify model-from aging-related failures). Also, different user applica- ing with action models by introducing other high-level con-tions, making use of different resources, may be modeled structs. The current paper establishes the fundamentals forusing the operational profile modeling construct. all such future developments.

6. Conclusion References

Action models provide an intuitive, high-level, modelingformalism for fault-tolerant distributed computing systems,to analyze the impact of fault tolerance mechanisms on theuser-perceived reliability. The modeling formalism followsa user or load-driven view of distributed systems, modelingsequences of actions that are required by a job offered to thesystem. It combines aspects of Petri nets with traditionalcombinatorial reliability modeling formalisms. It supportsthe modeling of software as well as hardware components,and their failure behavior, and includes constructs to modeldependency and operational profiles.

In this paper, we formally specified action models, pro-vided a path-based solution algorithm, and provided exam-ples of the use of action models. At this point, we cannotclaim to have achieved our goals of providing an intuitive,

[II

PI

[31

141

151

M. Ajmone Marsan, G. Balbo, G. Conte, S. Donatehi, andG.Franceschinis. Modelling with Generalized StochasticPetri Nefs. Series in Parallel Computing. John Wiley &Sons, New York, NY, USA, 1995.G. Ciardo. Discrete-time Markovian stochastic Petri nets.In W. J. Stewart, editor, Computations with Markov Chains,chapter 20, pages 339-358. Kluwer Academic Publishers,Boston, 1995.G. Ciardo and R. Zijal. Well-defined stochastic Petri nets. In4th International Workshop on Modeling. Analysis and Sim-ulation of Computer and Telecommunication Systems, pages278-284, San Jose, CA, USA, Feb. 1996. IEEE, IEEE Com-puter Society Press.C. Colboum. The Combinatorics of Network Reliability. Ox-ford University Press, New York, NY, USA, 1989.D. D. Deavours and W. H. Sanders. “On-the-fly” solutiontechniques for stochastic Petri nets and extensions. In 7th In-

127

Page 10: [IEEE Comput. Soc IEEE International Computer Performance and Dependability Symposium. IPDS'98 - Durham, NC, USA (7-9 Sept. 1998)] Proceedings. IEEE International Computer Performance

temational Workshop on Petri Nets and Performance Mod-els, pages 132-141, Saint Malo, France, June 1997. IEEE,IEEE Computer Society Press.

[6] J. B. Dugan. Fault trees and imperfect coverage. IEEETransactions on Reliability, 38(2): 177-l 85, June 1989.

[7] J. B. Dugan and M. R. Lyu. Dependability modeling forfault-tolerant software and systems. In B. Krishnamurthyand M. R. Lyu, editors, Sof?ware Fault Tolerance, volume 3of Trends In Soffware. chapter 5, pages 109-138. John Wiley& Sons, Chichester, UK, 1995.

[8] S. Garg, A. P. A. van Moorsel, K. S. Trivedi, andK. Vaidyanathan. Age estimation and failure forecastingin software systems using time-series analysis. In 9th In-ternational Symposium on Software Reliability Engineering,Paderbom, Germany, Nov. 1998. IEEE, IEEE Computer So-ciety Press.

[9] Y. Huang and C. Kintala. Software fault-tolerance in the ap-plication layer. In B. Krishnamurthy and M. R. Lyu, editors,Software Fault Tolerance, volume 3 of Trends in Software,chapter 10, pages 231-248. John Wiley & Sons, New York,1995.

[IO] Y. Huang, C. Kintala, N. Koletis, and N. D. Fulton. Soft-ware rejuvenation-design, implementation and analysis. In25th Fault-tolerant Computing Symposium, pages 381-390,Pasadena, CA, June 1995. ‘IEEE, IEEE Computer Society.

[I l] G. Hughes-Fenchel. A flexible clustered approach to highavailability. In 27th International Symposium on Fault-Tolerant Computing, pages 314-318, Seattle, WA, June1997. IEEE, IEEE Computer Society Press.

[12] K. Kanoun and M. Barrel. Dependability of fault-tolerantsystems+xplicit modeling of the interactions between hard-ware and software components. In 2nd annual IEEE Inter-national Computer Pe$ormance & Dependability Sympo-sium, pages 252-261, Urbana-Champaign, IL, Sept. 1996.IEEE, IEEE Computer Press.

[ 131 C. M. R. Kintala. Reliable software systems using reusablesoftware components. In 16th Symposium on Reliable Dis-tributed Systems, page 43, Durham, NC, USA, Oct. 1997.IEEE, IEEE Computer Society Press.

[ 141 J. C. Laprie. Dependability: Basic Concepts and Tenninol-ogy. Springer Verlag, Wien, Austria, 1992.

[ 151 M. R. Lyu. Handbook of Software Reliability Engineering.McGraw-Hill/IEEE Computer Society, New York, NY/LosAlamitos, CA, 1996.

[ 161 M. R. Lyu, S. Rangarajan, and A. P. A. van Moorsel. Op-timization of reliability allocation and development sched-ule for software systems. In 8rh international Symposiumon Software Reliability Engineering, pages 336-347. Albu-querque, NM, USA, Nov. 1997. IEEE, IEEE Computer So-ciety Press.

[17] M. Malhotra and K. S. Trivedi. Power-hierarchy ofdependability-model types. IEEE Transactions on Reliabil-ity, 43(3):493-501, Sept. 1994.

[ 181 V. B. Mendiratta. Assessing the reliability impacts of soft-ware fault-tolerance mechanisms. In 7th International Sym-posium on Softiare Reliability Engineering, pages 99-103,New York, NY, USA, 1996. IEEE, IEEE Computer SocietyPress.

[19] J. F. Meyer, A. Movaghar, and W. H. Sanders. Stochasticactivity networks: Structure, behavior and applications. InInternational Conference on Timed Petri Nets, pages lO6-115, Torino, Italy, July 1985.

[20] J. D. Muss, A. Iannino, and K. Okumoto. Software Reliabil-ity: Measurement, Prediction. Application. MC Graw-Hill,New York, NY, USA, pofessional edition, 1990.

[21] V. F. Nicola, P. Heidelberger, and P. Shahabuddin Uni-formization and exponential transformation: Techniques forfast simulation of highly-dependable non-Markovian sys-tems. In 22th International Symposium on Fault-TolerantComputing, pages 130-139. IEEE, IEEE Computer SocietyPress, 1992.

[22] M. A. Qureshi, W. H. Sanders, A. P. A. van Moorsel, andR. German. Algorithms for the generation of state-level rep-resentations of stochastic activity networks with general re-ward structures. IEEE Transactions on Software Engineer-ing, 22(9):603-614, Sept. 1996.

[23] W. Reisig. Petri Nets (An Introduction), volume 4 of Mono-graphs on Theoretical Computer Science. Springer-Verlag.Berlin, Germany, 1985. ‘_

[24] R. A. Sahner, K. S. Trivedi, and A. Puliatito. Performanceand Reliability Analysis of Computer Systems, An Example-Based Approach Using the SHARPE Sofhvare Package.Kluwer, Boston, MA, 1996.

[25] W. H. Sanders and J. F. Meyer. Reduced base model con-struction methods for stochastic activity networks. fEEEJournal on Selected Areas in Communications, 9~25-36,January 1991.

[26] H. V. Shah. Performance evaluation of manufacturing sys-tems using stochastic activity networks. Master’s thesis,University of Arizona, Tucson, Arizona, Dec. 199 I.

[27] A. P. A. van Moorsel. Performability Evaluation Conceptsand Techniques. PhD thesis, University of Twente, TheNetherlands, 1993.

[28] A. P. A. van Moorsel and B. R. Haverkort. Probabilisticevaluation for the analytical solution of large Markov mod-els: Algorithms and tool support. Microelectronics and Re-liability, 36(6):733-753, 1996.

128