cahiers techniques 144 - pangonilo safety, have become a science that no designer can afford to...

MERLIN GERINla maîtrise de l'énergie électrique

GROUPE SCHNEIDER

cahiers techniques 144

introduction to dependabilitydesignP. Bonnefoi

MERLIN GERINservice information38050 Grenoble CedexFrancetél. : 76.57.60.60

E/CT 144December 1990

Pascal Bonnefoi earned his enginee-ring degree ESE in 1985. After workingfor a year in Operational Research forthe French Navy he started his work asa reliability analyst for Merlin Gerin in1986, in the Reliability studies for whichhe developed a series of specialsoftware packages. He aslo taughtcourses in this field in the industrial andacademic worlds. He is presentlyworking as a software engineer forHANDEL, a Merlin Gerin subsidiary.

cahiers techniques Merlin Gerin n° 144 / p.1

Table of contents

1. Importance of dependability In housing p. 2

In services p. 2

In industry p. 2

2. Dependability characteristics Reliability p. 2

Failure rate p. 2

Availability p. 3

Maintainability p. 4

Safety p. 4

3. Dependability characteristics Interrelated quantities p. 5

Conflicting requirements p. 5

Time average related quantities p. 6

4. Types of defects Physical defects p. 7

Design defects p. 7

Operating errors p. 7

5. From component to system: Data bases for systemcomponents p. 8FMECA method p. 11

Reliability block diagram p. 11

Fault trees analysis p. 14

State graphs p. 17

6. Conclusion p. 19

7. References and Standards p. 20

introduction to dependability designP. BonnefoiP=.

Equipment failures, unavailability of apower supply, stoppage of automatedequipment and accidents are quicklybecoming unacceptable events, be it tothe ordinary citizen or industrialmanufacturers.Dependability and its components:reliability, maintainability, availability andsafety, have become a science that nodesigner can afford to ignore.This technical report presents the basicconcepts and an explanation of its basiccomputational methods.Some examples and several numericalvalues are given to complement theformulas and references to the variouscomputer tools usually applied in thisfield .

interdependence

modeling aspects

cahiers techniques Merlin Gerin n° 144/ p.2

1. importance of dependability

Prehistoric men had to depend on theirarms for survival. Modern man is sur-rounded by ever more sophisticated toolsand systems on which he depends forsafety, efficiency and comfort.

Ordinary citizen are specially concer-ned in everyday life by: the reliability of the TV set, the availability of the mains supply, the maintainability of freezers and cars, the safety of their boiler valves.

Bankers and, in general, serviceindustries give a lot of weight to: computer reliability, availability of heating, maintainability of elevators, fire related safety.

For over 20 years Merlin Gerin haspioneered work in the DEPENDABILITYfield: in the past, with its contribution tothe design of nuclear power plants or thehigh availability of power supplies used atthe launching site of the ARIANE spaceprogram, nowadays, by its design ofproducts and systems used worldwide.

2. dependability characteristics

reliabilityLight bulbs are used by everyone:individuals, bankers and industrialworkers. When turned on, a light bulb isexpected to work until turned off. Itsreliability is the probability that it worksuntil time t and it is a measure of the lightbulb’s aptitude to function correctly.

Definition:The reliability of an item is the probabilitythat this item will be able to perform thefunction it was designed to accomplishunder given conditions during a timeinterval (t1,t2); it is written R(t1,t2).This definition follows the one given bythe IEC (International ElectrotechnicalCommission)International Electrotechni-cal Vocabulary, Chapter 191. There arecertain basic concepts used by this defi-nition which must be detailed:

Function: the reliability is a characteristicassigned to the system’s function.Knowledge of its hardware architecture isusually not enough. Functional analysismethods must be used to determine thereliability.

Conditions : the environment has afundamental role in reliability. This is alsotrue for the operating conditions.Hardware aspects are clearly insufficient.

Time interval : we wish to emphasize aninterval of time as opposed to a specificinstant. Initially, the system is supposedto work. The problem is to determine forhow long. In general t1=0 and it is possibleto write R(t) for the reliability function.

failure rateConsider the light bulb example again. Itsfailure rate at time t, written as λ(t), gives

the probability that it will suddenly burnout in the interval of time (t, t+∆t), giventhat it kept working until time t. Failurerates are time rates and, as such, theirunits are inverse time.

Mathematically, the failure rate is writtenas:

For a human being, the failure ratemeasures the probability of deathoccurring in the next hour:λ(20 years)=10-6 per hour.If λ is represented as function of age, oneobtains the curve given in figure 1.

In competitive industries it is notpossible to tolerate production losses.This is even more so for complexindustrial processes. In these casesone vies to obtain the best: reliability of command and controlsystems, availability of machine tools, maintainability of production tools, personnel and invested capital safety.

These characteristics, known under thegeneral term of DEPENDABILITY, arerelated to the concept of reliance, (todepend upon something). They arequantified in relation to a goal, they arecomputed in terms of a probability andare obtained by the choice of anarchitecture and its components. Theycan be verified by suitable tests or byexperience.

λ(t) = lim∆t⇒0

( 1∆t

R(t) - R (t+∆t)R(t) )

= -1R(t)

d R(t)dt

(1)


fig. 1: bathtub curve

fig. 2: exponential reliability

fig. 3: wearout reliability curve

After the high values corresponding tothe infant mortality period, λ reaches thevalue of adult age during which it becomesconstant since causes of death are mainlyaccidental and thus, independent of age.After 60 years old, old age causes λ toincrease. Experience seems to show thatmany electronic components follow asimilar bathtub curve, from which thesame terminology is borrowed: infantmortality, useful life and wearout.

During the useful life, λ is constant andEquation (1) becomes R(t) = exp(-λt).This is the exponential distribution andthe shape of the reliability function isgiven in figure 2.

The exponential distribution is one amongmany other possibilities. Mechanicaldevices which are subject to wearoutsince the beginning of their operating lifecan follow other distributions, like Weibull’sdistribution. In this case the failure rate istime dependent. A curve illustrating thetime dependency of λ is seen in figure 3,in which no plateau, as in figure 1, exists.

availabilityTo illustrate the concept of availabilityconsider the case of an automobile. Avehicle must start and run upon demand.Its past history may be of little relevance.The availability is a measure of its aptitudeto run properly at a given instant.

Definition:The availability of a device is the probabilitythat this device be in such a state so as toperform the function for which it wasdesigned under given conditions and at agiven time t, under the assumption thatexternal conditions needed are assured.We will use the symbol A(t).This definition, inspired by the one givenby the IEC, mimicks the one for thereliability. However, its time characteristicsare basically different since the conceptof interest is an instant of time instead ofa time length. For a repairable system,functionning at time t does not necessarilyimply functionning between [0,t]. This isthe main difference between availabilityand reliability.

It is possible to plot the availability curvet

(t) λ

infant mortalityperiod

R (t) = e - λt

t

useful lifeinfant

mortality wearout

0

t

1

λ(t)


as a function of time for a repairabledevice, having exponential times to failureand to repair, (see figure 4).It can be seen that the availability has alimiting value which, by definition, is theasymptotic availability. This limit isreached after a certain time. The limitingreliability is always zero since, eventually,all devices will fail. (This last point iscontroversial when dealing with software).Consider again the case of the automobile.Two kinds of cars can have pooravailabilities: those with frequent failuresand those which do not fail often butinstead spend a long time in the garagefor repairs. Thus, although the reliabilityis an important component of theavailability, the aptitude to being promptlyrepaired is also of paramount importance:this is measured by the maintainability.

maintainabilityMany designers seek top performancefor their products, sometimes neglectingto consider the possibility of failure. Whenall the effort has been concentrated onhaving a functionning system, it is difficultto consider what would happen in case offailure. Still, this is a fundamental questionto ask. If a system is to have highavailability, it should very rarely fail but itshould also be possible to quickly repairit. In this context, the repair activity mustencompass all the actions leading tosystem restoration, including logistics. Theaptitude of a system to be repaired istherefore measured by its maintainability.

Definition:The maintainability of an item is theprobability that a given active maintenanceoperation can be accomplished in a giventime interval [t1,t2]. It is written asM(t1,t2).This definition also follows closelythat of the IEC’s international vocabulary.It shows that the maintainability is relatedto repair in a manner similar to that ofreliability and failure. The maintainabilityM(t) is also defined using the samehypotheses as R(t).The repair rate µ(t) is introduced in a wayanalogous to the failure rate. When it canbe considered constant, the implica-tion is an exponential distribution for:

[M(t) = exp(-µt)].

safetyIt is possible to distinguish betweendangerous failures and safe ones. Thedifference does not lie so much in thefailures themselves but in theirconsequences. Switching off the lightsignals in a train station or suddenlyswitching them from green to red has animpact (all trains stop) but is notfunctionally dangerous. The situation istotally different if the lights wouldaccidentally turn all to green. Safety is theprobability to avoid dangerous events.

The concept of safety is closely linked tothat of risk which, in turn, not only dependson the probability of occurrence but alsoon the criticality of the event. It is possibleto accept a life threatening risk (maximumcriticality) if the probability of such anevent is minimal. If it is just a matter ofhaving a broken limb the acceptableprobability might be greater. The curveon figure 5 illustrates the concept ofacceptable risk.

fig. 5: the level of risk is a function of both, criticality and probability of occurrence.

fig. 4: availability as a function of time

1

t0

D ∞

D (t)

criticality

acceptablerisk

unacceptablerisk

probability of occurrence


3. dependability characteristics interdependence

interrelated quantitiesThe examples given so far have shownthat the concept of dependability is afunction of four quantifiable characteris-tics: these are related to each other in theway shown by figure 6.These four quantities must be conside-red in all dependability studies. The de-pendability is thus often designated interms of the initials RAMS.Reliability: probability that the system befailure free in the interval [0,t].Availability: probability that the systemworks at time t.Maintainability: probability that the systembe repaired in the interval [0,t].Safety: probability that a catastrophicevent is avoided.

conflicting requirementsSome of the requirements of the depen-dability can be contradictory.

An improved maintainability can bringabout some choices which degrade thereliability, (for example, the addition ofcomponents to simplify the assembly-disassembly operations). The availabilityis therefore a compromise between relia-bility and maintainability. A dependabilitystudy allows the analyst to obtain anumerical estimate of this compromise.

Similarly, safety and availability mightconflict with each other.

We have noted that the safety of a systemis defined as the probability to avoid acatastrophic event and is often maximumwhen the system is stopped. In this case,its availability is zero! Such a case ariseswhen a bridge is closed to traffic whenthere is a risk of collapse. Conversely, toimprove the availability of their fleet, cer-tain airlines are known to have neglectedtheir preventive maintenance activitiesthus diminishing flight safety. In order toascertain the optimum compromise bet-ween safety and availability it is neces-sary to produce a scientific computationof these characteristics.

A system can be described as being in

one of three states, see figure 7. In additionto the normal functionning state, twofurther failed states can be considered: afailsafe state and a state of dangerousfailure. In order to simplify this descriptionwe are including in the failed states allmodes of degraded performance, labeled“incorrect performance”.

The time spent before leaving state A ischaracteristic of the reliability. The timespent on state B, after a safe failure, ischaracteristic of the maintainability. The

ratio between the time spent on state Aand the total time is characteristic of theavailability.The aptitude of the system to avoidspending any time on state C is acharacteristic of safety. It can be seenthat state B is acceptable in terms ofsafety but is a source of unavailability.

fig. 6: the components of dependability

fig. 7: failsafe: availabilitydangerous failure: safety

STATE B

INCORRECTPERFORMANCE

AND NOTDANGEROUSSTATE A

NORMALFUNCTIONNING

STATE C

INCORRECTPERFORMANCE

AND DANGEROUS

repair

failsafe

dangerousfailure

SAFETY

MAINTAINABILITYRELIABILITY

AVAILABILITY


on page 3 concerning the availability (ratioof correct performance time to total time).This quantity corresponds to the

asymptotic value given in figure 4, page 4.

asymptotic unavailability

= 1 - asymptotic availability

fig. 8: diagram for mean times in the case of a system with no interruptions due to preventivemaintenance

time average relatedquantitiesIn addition to the previously mentionedprobabilities (reliability, availability,maintainability and safety) of occurrenceof events, it is common to use mean timesbefore the ocurrence of events in order todescribe the dependability.

Mean timesIt is useful to recall here the exact definitionof all the mean times as they are oftenmisunderstood. The worst example ofabuse is probably the most widely known,the MTBF, which is often confused withlifetime.On the average, in a homogenouspopulation of items following anexponential distribution, about 2/3 of theseitems will have failed after a time equal tothe MTBF. A single system having aconstant failure rate will have a 63%

chance of having failed after such a time.The definitions and relative positions ofthese mean times during the life of asystem are given in figure 8.

MTTF or MTFF (Mean Time To FirstFailure):the mean time before the occurrence ofthe first failure.

MTBF (Mean Time Between Failures):mean time between two consecutive fai-lures in a repairable system.

MDT (Mean Down Time):mean time between the instant of failureand total restoration of the system. Itincludes the failure detection time, therepair time and the reset time.

MTTR (Mean Time To Repair): meantime to actually restore the system to anoperating condition.

MUT (Mean Up Time): mean failure freetime.

Important relations and numericalvaluesThere are many mathematical relationslinking the quantities introduced thus far:For an exponential distribution withR(t) = exp(-λt) one has MTTF = 1/λ. Inthis case, for a non repairable system, wehave MTBF = MTTF (in fact, in this case,all failures are “first” failures). This explainswhy the classical formula used forelectronic components (non repairable)is: MTBF = 1/λ.The above formula is only valid forexponential distributions (constant failurerates) and, strictly speaking, for nonrepaired items although it is possible toapply it for repaired systems with verysmall MDTs. Analogously, when repairtimes obey an exponential distribution, itis possible to show that MTTR = 1/µ.

One also has: MTBF = MUT + MDT. Ingeneral it is also true that MDT = MTTR,except for the logistic delay and restarttimes. Furthermore:

asymptotic availability

This formula illustrates the assertion given

The asymptotic unavailability is usuallyeasier to express numerically than theavailability: it is much easier to read 10-6

than 0.999999.

For exponential distributions, using theequations MUT = 1/λ and MDT = 1/µ oneobtains:

MDT MUT

MTTF

failed state up state

time

MDT MUT MDT

MTBFMTBF

failure

repair

failure failure

repair repair

U∞=λ

λ + µor A∞=

µ

λ + µ

MUTMTBF

U∞ = limt → +∞

1 - A t

A∞= limt → +∞

A t


resistances micro- fuses and generator mainsproc. circuit- outages

breakers,300 ft. cables,busbars

λ(/h) 10-9 10-6 10-7 to 10-6 10-5 10-2

MTTF 1000 centuries 100 years 100 to 1000 years 10 years 4 days

fig. 9: failure rates and mean times to failure for certain devices belonging to theelectronic and electrotechnical fields

λ is often much smaller than µ since therepair times are much smaller than thetimes to failure. It is therefore possible tosimplify the denominator and write:

It can be seen that the reliability isdegraded when the complexity of thesystem increases. This corresponds to awell-known rule of dependability design:simplify as much as possible.

The concept of mean time is oftenmisunderstood. For example the next twosentences have, for exponentialdistributions, the same meanings: “TheMTTF is 100 years” and “The odds areone in 100 to observe a failure in the firstyear”. Still, the second sentence seemsmore worrisome for a manufacturer selling10 000 devices of this type per year. Onthe average, about 100 units will fail onthe first year.

To illustrate the impact of redundancy onthe unavailability, consider the nationalpower grid. One is concerned with thedeliverance of energy to the final user.The unavailability is about 10-3. This cor-responds to about 9 hours of downtimeper year. For a computer room, having aheavily redundant system of Uninterrup-tible Power Supplies (UPS), it is possibleto reduce this figure between 1000 and10 000 times.

This last formula illustrates, in the case ofexponential distributions, the compromisebetween reliability and maintainabilitywhich has to be optimized to improve theavailability.The table of figure 9 gives failure ratesand mean times to failure for certaindevices belonging to the electronic andelectrotechnical fields.

4. types of defects

The design of a system with respect to itsdependability goals implies the need toidentify and take into account the variouspossible causes of defects.One can suggest the followingclassification:

physical defectsinduced by internal causes (breakdownof a component) or external causes,(electromagnetic interferences, vibra-tions,...).

design defectscomprising hardware and software designerrors.

operating errorsarising from an incorrect use of theequipment:

hardware being used in an inappropriateenvironment, human operating or maintenanceerrors, sabotage.

The various techniques discussed in thisdocument concern mostly physicaldefects. Nevertheless, human andsoftware errors are also very importantalthough the state of the art in these fieldsis not as advanced as for physical defects.Still, within the scope of this document,we feel the following elements are worthmentioning:

Software aspects the reliability of a piece of software inwhich all the inputs are exhaustively testedis equal to 1 forever. Nevertheless, this isunrealistic for real life, complex programs. having two redundant programs impliesdevelopment by different software teamsusing different algorithms. This is theprinciple behind fault tolerant softwarein which a majority vote may beimplemented. most software reliability models can besplit in two major categories: complexity models: based upon ameasure of the complexity of the code oralgorithm, reliability growth models: based uponprevious observed failure history. the quantitative evaluation of the

U∞=λ

µ= λ.MTTR


different models does not allow yet for asystematic study of software reliability.The best results are obtained in particularcases and for given environments(language, methods). This is the case forthe SPIN (Integrated Digital ProtectionSystem) software developped by MerlinGerin for use in nuclear power plants.Merlin Gerin is also an active participantin different working groups dealing withsoftware reliability (see references). TheTechnical Paper CT 117 gives furtherdetails on this subject. The title is “Methodsfor developping dependability relatedsoftware”.

Human reliability

Qualitative approaches are predominantin this field. The efforts lie mostly in themodeling of the human operator, taskclassification and human errors. The mostadvanced studies belong to the nuclearand aerospace industries. Humanbehavior is known as much by simulatorsas by field reports. Both sources can becompared to each other. Some referencesexist which propose some numericalvalues. However, these must be usedwith utmost caution. According to thesereferences it is feasible to assign an errorprobability depending on the nature of theactivity: mechanical, procedure orcognitive action.

Some of the recent major catastrophes

have shown that the human factor canhave great impact, not only from theoperator standpoint but also at thedesigner’s stage. The more freedom ofaction is given to a human operator themore the risks are increased. This alsoincludes management, as the ChallengerSpace Shuttle accident has shown: it ispossible to go all the way up to thedesigners of the working structure of thedesigner’s team! Many disciplines arecalled upon to tackle the problem of humanreliability. Among them psychology andergonomy.

5. from component to system: modeling aspects

data bases for systemcomponentsElectronicsReliability calculations have been widelyused in this field for many years. The twobest known data bases are the MilitaryHandbook 217 (version E at present)issued in the U.S. and the “Recueil dedonnées de fiabilité”, from CNET (FrenchTelecom Center), see figure 11 for anexample. Merlin Gerin participates in itsupdates.These data bases allow the calculation ofthe failure rates of electronic components,assumed to be constant. These rates area function of the application characteris-tics, environment, load, etc. The type ofcomponent is also relevant, e.g., numberof gates, value of the resistance, etc.Computation is usually faster with theCNET approach but many specializedcomputer programs exist to implementeither technique with ease.As an example, let us take a 50 kΩ

resistance used in an electronic boardand used inside an electric switchboard.It is necessary to consult the table givenin figure 11 in order to determine thecorresponding correcting values. Theenvironment is “au sol” (fixed, ground)and therefore, the environment correc-tive factor is:ΠE = 2.9The resistance value gives thecorresponding multiplying factor:ΠR = 1This resistance is taken as being “nonqualified” which gives the multiplyingquality factorΠQ = 7.5The load factor ρ is a characteristic of theapplication, as opposed to the otherfactors which are characteristic of thecomponent itself. If the load factor is 0.7and the environmental temperature forthe board is 90°C, the diagram givesλb = 15The global failure rate for this resistance

is thus obtained by multiplying all thecorrective factors and the base failurerate:λ = λb.ΠR.ΠEΠQ = 0.33 x 10 -6 / hourIf at the design stage the reliability goalshave been integrated, then: better thermal designs will allow alowering of the environment temperature, better board designs will lower the loadfactor ρ.With t = 60°C and ρ = 0.2 the diagramgives:λb = 1.7If now a qualified component is selected,we have: ΠQ = 2.5, which givesλ = 0.012 x 10 -6, that is an improvementfactor of 30.Knowledge of the reliability of eachcomponent provides a means to obtainthe reliability of the boards, (which arerepairable or replaceable), and thereforethat of whole electronic systems. This isdone by using the techniques describedin the rest of this report.


Mechanics and electromechanicsData bases in these fields exist althoughthey are not really “standards”. Somesources are: RAC, NPRD 3: report by the ReliabilityAnalysis Center (RADC, Griffiss AFB),under contract from the US DoD, dealingwith non electronic parts. IEEE STD 500: field data on reliabilityof electrical, electronic and mechanicalequipment used in nuclear power plants.

In France and the US, some referencebooks exist that deal specifically withmechanical components.

As an example of data relevant to ouractivities, figure 10 gives some informationconcerning circuit breakers. This comesfrom RAC’s NPRD 3-1985. First, there isa failure mode distribution in a pie chart.For example, 34% of all field failures aredue to the circuit breaker failing to open

when it should. The table in figure 10gives a point estimate of the failure ratefor the thermal function of circuit breakers.

Various information items given are asfollows: environment: GF, Ground Fixed,industrial conditions. failure rate estimate: 0.335 10-6 h-1

a 60% confidence interval for the failurerate using the 20% lower and 80% upperbounds. the number of records used in thiscalculation, i.e. 2. the number of observed failures: here 3. the total number of operating hours:8.994 106 h .

The actual knowledge of the global failurerate and the failure mode distributionallows the calculation of the probability ofspecific events by using a simpleproportionality rule.

For example, for the “stuck closed” mode,we have a corresponding failure rate of:

Another approach can sometimes bemore relevant: instead of consideringthe calendar time, the number of make-break operations can be tallied. Then,a test is planned in which a sample isselected and the reliability is estimatedusing a more realistic model (e.g.Weibull distribution).

Which technique to use is largely amatter of determining the kind of fai-lure one wishes to study: contact wearis related to the number of make andbreak cycles whereas corrosion is timedependent. Specific use and environ-ment conditions are always important.

fig. 10: failure modes and reliability data for circuit breakers

component APPL user point 60 % upper 20 % lower 80 % upper % of % of operatingpart type ENV code estimate single-side internal internal recs fail HRS (E6)

thermal GF M 0.335 - 0.171 0.621 2 3 8.944

0.335.10-6

x 34100

= 1.17.10-7

noisy

no movement

intermittent

degraded

stuck closed

stuck open

out of adjustment

others

15.00 %

9.00 %

4.00 %

34.00 %

8.00 %15.00 %

6.00 %

8.00 %


fig. 11: example of CNET publications

The people interessed in this kind ofinformationcan refer to American Standard referenced:MIL HDBK 217 E


Failure Modes, Effects andCritically Analysis (FMECA)methodThis is a technique to analyse the reliabilityof a system in terms of the failure modesof its components. The IEC has issued astandard (IEC 812) giving a description ofthis technique. Each element of thesystem can, in turn, be analyzed using

one of the relevant data bases. Thehardware structure of the system as wellas its functional characteristics allow theanalyst to inductively assess the effect ofeach and all of the failure modescorresponding to each element and theireffects on the system.An FMECA should also give an estimateof the criticality of each failure mode, seefigure 12. This depends on two factors:

the probability of occurence of failure andthe seriousness of its consequences. Thusan FMECA is a tool to study the influenceof the component failures on the system.The main interest of this technique lies inits exhaustiveness. It is nevertheless in-complete in that the combination of ef-fects must be seraparately considered.This can be accomplished using themethods described in the rest of thischapter.

fig. 12: example of FMECA table

component function failure cause effect criticality commentsmode

circuit-breaker switch stuck solder no 2closed shedding

« « unable mechanical no 2to close power

« short circuit unable solder no 4 actionprot. to open protect

« current sudden adjustment no 3path open power

« « heat bad electronic 2contact failure

Reliability Block Diagram(RBD)The RBD method is a simple tool torepresent a system through its (non-repairable) components. Using the RBDallows the computation of the reliability ofsystems having series, parallel, bridgeand k-out-of-n architectures or any of itscombinations. Although it is possible toapply the RBD technique to repairablesystems, the implementation is muchmore difficult.

Series-parallel systemsTwo components are in series, from thereliability standpoint, if both are necessaryto perform a given function. They are inparallel when the system works if at leastone of the two components works, seefigure 13.These considerations are easily genera-lized to more than two components.Whenever two components are in seriesand can be considered to be independent,(the failure of one does not modify theprobability of failure of the other), thereliability of this sytem can be calculatedby multiplying the individual reliabilitiestogether since the first component ANDthe second must work:

fig. 13: series/parallel systems

R(t)=R1(t).R2(t).

In the case of two independentcomponents in parallel, the system worksif one OR the other works. It is easy tocalculate the unreliability of the systemsince it is equal to the product of the twocomponent unreliabilities: the system failsif the first component AND the secondcomponent fail:1 - R(t) =(1 - R

1(t)).(1 - R2(t)).

Or equivalently:R(t) = R1(t)+R2(t) - R1(t).R2(t).

In this case, components 1 and 2 are saidto be in active redundancy. Theredundancy would be passive if one ofthe parallel components is turned on onlyin the case of failure of the first. This is thecase of auxiliary power generators.

For the particular case of non repairablecomponents following an exponentialdistribution of times to failure, one canwrite:For the series case:R(t) = exp(-λ

1t).exp(-λ2t) = exp(-(λ1+λ2)t).It follows that the system’s times to failurealso follow an exponential distribution,(constant failure rate), since the reliabilityfunction is an exponential with:

λ = λ1+ λ2

For the parallel case:R(t) = exp(-λ1t)+exp(-λ2t)-exp(-(λ1+λ2)t).Here, the reliability function is not anexponential. Therefore, it can beconcluded that the failure rate is notconstant.

1

1

series parallel

2

2


1

2

N

K/N

fig. 14: K/N redundant systems

fig. 15: bridge systems

3

1

2

4

5

All these formulas can be generalized toa system with n non repairable compo-nents, mixing series and parallel archi-tectures.

k-out-of-n redundanciesA k-out-of-n system, or simply K/N, is a n-component system in which k or morecomponents are needed for the system towork properly. We will consider only ac-tive redundancies here, see figure 14:

Let us call Ri(t) the reliability of each oneof the n components of the system. Insome simple cases the reliability of thesystem can be computed by adding thefavourable combinations:

2/3 system:

R=R1.R2+R1.R3+R2.R3

series system (n/n):

would result if each sensor is connectedto either one of the two alarms, as infigure 18, through a coupler. We willcalculate the reliability improvement dueto this modification. Let us also supposethat the mission time of this system isthree months, i.e., the maximum expectedabsence during which the system mustfunction. Furthermore, after each mission,the system is thoroughly checked andmaintained and can be considered asgood as new when reset. During themission, there are no repairable elements.

Let us use the following realistic constantfailure rates to obtain the different ordersof magnitude:

Vibration sensor: λ1 = 2.10-4

Photoelectric cell: λ2 = 10-4

Coupler: λ3 = 10-5

Alarms: λ4 = λ5 = 4.10-4

All these failure rates are given in(hours)-1

computation for Diagram A offigure 17.

This is a simple case of two parallelbranches, each having two componentsin series:

Reliability of Branch 1: R1(t).R4(t)

Reliability of Branch 2: R2(t).R5(t)

System reliability: RA(t) = R1(t).R4(t)

+ R2(t).R5(t) - R1(t).R4(t).R2(t).R5(t)

Using Ri(t)= exp(-λit) with t = 3 months= 2190 hours as the mission

time one obtains: RA(3 months) = 0.51.

Bridge systemsThese are systems which cannot bedescribed by simple series-parallelcombinations. They can, however, bereduced to series-parallel cases by aniterative procedure, see figure 15.

In order to compute the reliability of thissystem in terms of the five non repairablecomponent reliabilities it is necessary toapply conditional probabilities:

R=R3.R(given that 3 works)

+ (1-R3).R(given that 3 has failed).

It is thus possible to derive the systemreliability R(t) by decomposing the originalbridge system in the two disjoint systemsillustrated in figure 16.

Example: reliability of an intrusiondetection system.The system consists of two sensors, avibration sensor and a photoelectric cell.Each of these sensors could be connectedto its specific alarm, as in figure 17, andwe would have two independentbranches. However, a bridge system

parallel system (1/n):

k/n system of identical components

If we write

Ri (t) = r (t), then,

R(t) =nΠ

i=1Ri (t)

1 - R(t) =nΠ

i=1(1 - Ri (t) )

R(t) =n∑i=k

C n

ir(t)

i(1 - r(t))n-in

∑C

•

•

•


4

2 5

1 1

2

4

5

fig. 16: decomposition of a bridge system

((vibrationsensor

1 4

alarm 1

alarm 2

52photoelectriccell

( ((

(( ( ((

fig. 17: alarms with no coupling, diagram A

computation for Diagram B offigure 18

This is the bridge system. Whenever thecoupler is failed we are back to the dia-gram of figure 17. On the other hand,when it works, we have 1 and 2 in parallel,both in series with 4 and 5, themselves inparallel. The system reliability for figure 18is then:

RB = (1-R3).R+R3.(R1+R2-R1.R2).(R4+R5

-R4.R5)

The numerical computation givesRB(3 months) = 0.61.

In spite of the excellent reliability of thecoupler, the system’s reliability is onlymarginally improved. This numericalexample shows, through a simple calcu-lation, that there is not much sense inhaving a more expensive set-up.

Case of repairable elementsRBD’s cannot be used as systematicallyas before: for two components in parallel, theequation relating R(t) to R1(t) and R2(t) isno longer valid. In fact, a working systemin the interval [0,t] may correspond to analternating working condition between 1and 2, with non repairable componentsthere should be at least one workingcomponent in the time interval [0,t] whe-reas for repairable components both canfail, but not simultaneously. the equation R(t) = R1(t).R2(t) remainsvalid for a two reparaible component se-ries system. in the case of repairable componentsthe main concern is the numerical esti-mate of the availability. It is possible touse the RBD’s with the same formulas as

fig. 18: system with coupler, diagram B

1

2 5

4

3

coupler

for the reliability calculations:A(t) = A1(t).A2(t) for a series systemA(t) = A1(t)+A2(t)-A1(t).A2(t) for parallelsystems.These formulas are valid only forsimple casesFor instance, the formula A(t)= A

1(t)+A2(t)-A1(t).A2(t) ceases to be valid if only one

repairman is available, (instead of asmany as necessary). This sequentialfeature, i.e. having a component waitingto be repaired while the other is beingserviced, is not possible to model by asimple RBD. In these cases the StateGraphs, to be dealt with later, are adap-ted to this problem.


fault trees analysisThe computation of the system’s failureprobability is the main goal of this type ofanalysis. It is based upon a graphicalconstruction representing all thecombinations of events, essentiallythrough AND-gates and OR-gates, thatmay lead to a catastrophic event.Except for extremely simple cases,computer resources must be used toevaluate the probability of the catastrophicevent. It is then possible to modify thestructure of the system’s design to lowerthis probability.

Basic procedureA deep understanding of the system anda clear definition of the “catastrophicevent” are essential to build the fault tree.The catastrophic event, sometimes calledthe “top event”, is then analyzed in termsof its immediately preceding causes.Then, each one of these causes isanalyzed in terms of their own immediatelypreceding causes until the basic eventsare reached. These are supposed to beindependent.A simple example is given in figure 19 andits corresponding fault tree in figure 20.This tree only contains OR-gatesconnecting the intermediate events(rectangles) and the basic events. Thebasic events are represented by circles.It is convenient to define a cut-set as asimultaneous combination of basic eventsthat, by themselves, produce the topevent.The analysis proceeds in two phases:

qualitative analysis: the minimal cut-sets, or min cuts, are obtained. The mincuts are minimal combinations that includebasic events that lead to the top event.The order of a min cut is simply thenumber of basic events it contains.

quantitative analysis: this isperformed using the min cuts and theprobability of occurrence of the basicevents. This gives an approximate valuefor the probability of the top event. It isalso necessary to validate the accuracyof this approximation in a systematicfashion. Then, depending on theobjectives of the analysis, differentprobabilities are used to compute thesystem reliability or its availability.

We can illustrate these ideas by twoexamples:

motorfailure

motoridling

and unableto start

nopower

deadbatteryno - linkno + link

openwire

fuse openwire

switch

immediatecauses

intermediatecauses

an overhead projector with one lampinside and one spare. The top event is "noworking lamp available", see figure 21.

A single AND-gate is necessary. Thechances of this happening is seen to be 2in two thousand.

fig. 19: electrical supply for a motor

fig. 20: fault tree for fig. 19 circuit

The top event is: motor unable to start

M

fuse switch


fig. 23: low voltage network

fig. 21: fault tree for an overhead projector

fig. 22: a fault tree for a light bulb

a simple light bulb. The top event is “nolight”, see figure 22. A single OR-gate isnecessary. The probability of the top eventis seen to be about 0.001, one in athousand of not having light. The maincause for this event is the burn out of thelight bulb.

In the general case it is often possible toobtain an exact calculation of theprobability of the top event usingrecursivity instead of the min cuts: Booleanprobability calculations are performed foreach gate in terms of the sub-trees beinginput to the gate considered. Theassumption of independence must beverified but this procedure leads to anexact evaluation of the top event. Thus,the recursive calculation allows acomparison to the min-cut approach. Bothmethods are complementary.

Application of fault tree using min-cuts to the availability of a low voltagenetwork.The fault tree corresponding to the networkgiven in figure 23 is shown in figure 24.Power is considerd to be either present orabsent. The top event is assumed to bethe absence of power at the output,noted E.In building this tree certain assumptionsare made:

only two failure modes are consideredfor the circuit-breakers: sudden contactbreak and failure to open upon a short-circuit. each transformer line can, by itself,supply voltage to the main network, towhich E belongs. the two mains supplies are comingfrom two different Medium Voltagesources. This reduces the Common Modefailure to the unavailability of the HighVoltage supply.Each event in the Fault Tree will have acertain probability of occurrenceassociated with it. In this case theprobability will be the unavailability. Theunavailability associated with the basicevents is calculated by the formula:U ≈ λ.MTTR.

λ is the failure rate corresponding to aparticular failure mode of a component. Itcan be obtained from several sources offield data.

A B

Busbar 1

C D

Busbar 2 Busbar 3

E F

failureprobability: P

AND-Gate

no light

P1 P21st. light bulb dead

2nd. light bulb deador missing

one order 2 min-cut

P = P1

x P2

= 0 , 0 5 x 0 , 0 4 = 2 . 10 -3

no lightfailureprobability: P

OR -Gate

P1 P2no mains light bulb dead

two order 1 min-cuts

1 - P = ( 1 - P1

) ( 1 - P2

) = ( 1 -1 0 -4) ( 1 0 -3

) = 0 , 9 9 8 91 -


fig. 24: fault tree corresponding to Fig. 23 network

no powerin output E

BB 3failure

no powerto BB 3

suddenopening of

C.B.E

short circuit through F

wirefailure

suddenopening of

C.B. D

no powerto BB 1

C.B. Fstuck on

shortcircuit

shortcircuit

above F

BB 1failure

no powerto BB 1

short circuit through C

double line failure

no HV supply

short circuit through C

C.B. Cstuck on

shortcircuit

line A line B cable BB 2

transfoA

C.B. Atransfo

B C.B. B

G11*

2*1*G22* 2*3* G24*

3*5*3*4*G33*3*2*3*1*

4*1*G42* G43*

5*4*G53*5*2*G51*

6*4*6*3*G62*G61*

7*4*7*3*7*2*7*1*


fig. 26: elementary state graph

MTTR is the Mean Time to Repair and itdepends on the component beingconsidered as well as the particularinstallation, technology, geographicallocation, service contract.

In some instances a specific value of aprobability is unknown. A worst casesituation, or upper bound, is thereforeassumed. For example, we have takenthe upper bound probability of a short-circuit above F to be 10-2.

The results of the Fault Tree Analysis,shown in figure 25, indicate that theunavailability on output E is 10-5 whichcorresponds to 5 minutes per year. Themin cut approach allows, in addition tothe calculation of the probability of the topevent, the assessment of the weight eachmin cut carries in producing the top event.Figure 25 also shows this weight, as apercentage of the total unavailability whichis possible to attribute to each min cut.This contribution is one measure of theimportance of the min cut.

An eyeball examination of the min cutsrelative importances shows that the cablelinking busbar 1 to busbar 3, (third mincut), is critical. To a lower extent this isalso true of the two busbars 1 and 3. Ifthese components were improved, themains supply then becomes critical. If afurther improvement on the overallavailability became essential, it would benecessary to incorporate an auxiliarypower supply, such as a diesel generator.A detailed study of the availability of anelectrical supply is presented in MerlinGerin’s Technical paper “Sureté etdistribution électrique” (in French).

state graphsState graphs, also called Markov graphs,allow a powerful modeling of systemsunder certain restrictive assumptions. Theanalysis proceeds from the actual cons-truction of the graph to solving the corres-ponding equations and, finally to the in-terpretation of results in terms of reliabi-lity and unavailability. Mathematically, agreat simplification is obtained by consi-dering only the calculation of time inde-pendent quantities.

Construction of the graphThe graph represents all the possiblestates of the system as well as thetransitions between these states. These

transitions correspond to the differentevents that concern the components ofthe system. In general, these events areeither failures or repairs. As aconsequence, the transition ratesbetween states are essentially failure ratesor repair rates, eventually weighted byprobabilities like that of an equipmentrefusing to turn on upon demand.

The graph on figure 26 shows the behaviorof a system with a single repairablecomponent.

AssumptionsA model is said to be markovian if thefollowing conditions are satisfied: the evolution of the system dependsonly on its present state and not on itspast history, the transition rates are constant, i.e.only exponential distributions areconsidered, there is a finite number of states, at any given time there cannot be morethan one transition.

EquationsUnder the above hypotheses, the proba-bility of the system being in state Ei at timet+dt can be written as: Pi(t+dt) = P(thesystem is in state Ei and it stays

λ: failure rate

µ: repair rate

up state down state

fig. 25: contributions of network components to its unavailability

1 :2*1* : 9,52 :2*3* : 1,63 :3*1* : 684 :3*2* : 1,65 :3*4* , 3*5* : ,0136 :4*1* : 9,57 :5*2* : 9,98 :5*4* , 6*3* : 9,1E - 69 :5*4* , 6*4* : 3,2 E - 610 :7*1* , 7*3* : ,0005811 :7*1* , 7*4* : 1,3 E - 512 :7*2* , 7*3* : 1,3 E - 513 :7*2* , 7*4* : 2,7 E - 7

unavailability: 1.01 E -05, i.e. 1.01 10list of min cuts and their importancemin cuts indicated on the fault tree, percent contribution

-5

there) + P(the system comes from ano-ther state Ej).

For a graph having n states, n differentialequations are obtained which can bewritten as:

where: Π(t) = [P1(t), P2(t), …, Pn(t)]

[A] is called the transition matrix of thegraph.

The solution of this equation in matrixform is performed by computer and givesthe probabilities Pi(t), that is the probabilityof the system being in state i as a functionof all the transition rates and the initialstate.

Computation of dependability quanti-tiesThe availability being the probability ofthe system being in a working state, itfollows:

dΠ(t)dt

= Π(t).[A]

where Pi(t) = probability of being inworking state Ei.

D(T) = ∑i

P i(t).[A


The reliability is the probability of being ina working state without ever havingpassed through a down state. A graph isconstructed by deleting all transitionsgoing from a failed state to a workingstate. Once the new probabilities Pi’(t) areobtained, we have:

UPS’s. Each working UPS in state Ei

adds its own exit rate λ towards state Ei+1.These exit rates are 3λ, 2λ and λ res-pectively.

The up states are 0 and 1. We assumethat the repair strategy is such that therecan be three repairmen workingsimultaneously on each UPS. Thus, thetransition rates corresponding to the repairactivity are proportional to the number offailed UPS’s in the state being considered.The numerical values are as follows:

λ = 2.10-5 h

-1 ; µ = 10-1 h-1

Figure 28 gives the computed resultscorresponding to the time independent

quantities. It can be seen that the MTTFis here 4.17 107 hours whereas thenonredundant case (3/3) has an MTTFequal to 1/3 λ = 1.67 104 hours.

For the asymptotic unavailability thechange is from 1.19 10-7 for the redundantsystem to 6 10-4 for the non redundantcase (3/3) system. The comparison ofthese figures is easily visualized throughthe graph itself: in the redundant case,the unavailability is calculated by summingthe probabilities of the two failed states,i.e., A = P2+P3 while, in the non redundantcase, the sum is performed over threefailed states:A = P1+P2+P3

The characteristic mean times MTTF,MTTR, MUT, MDT, MTBF are calculatedusing matrix calculus and some of theequations already discussed. For theMTTF, the initial state of the system mustbe specified in terms of the probabilitiesof the system being initially in each one ofits different states.

Application: Uninterruptible PowerSupplies (UPS) in parallelA UPS is a device which improves thequality of the electrical supply. It is oftenused for critical applications such ascomputers and their peripherals. We willconsider a typical configuration (TripleModular Redundancy), i.e. the UPS’sconstitute a 2/3 redundant system. Theunavailability is not the only quantity ofinterest: the MTTF gives the mean timebefore the first black-out.In the construction of the state graph it ishere possible to use the fact that the threeUPS’s are identical and therefore statescan be grouped, according to the numberof failed UPS’s. The failure and repairrates for the UPS’s, λ and µ respectively,are given in figure. 27

The number associated with each statecorresponds to the number of failed

f i=P i

Ti

There are two other quantities which arevery simple to obtain:

the meant time of state occupancy:

fig. 28: values corresponding to the graph on figure 27

fig. 27: UPS's in parallel

µ 2 µ 3 µ

3 λ 2 λ λ

state 0 state 2 state 3state 1

R(t) = ∑i

P i

,(t)

the occupancy frequency correspon-ding to state i:

Ti=1

Σ (rates of departure from state i)

Time independant quantities:

Unavailability: : 1.199360E-07 Availability : 9.999999E-01MTTF : 4.169167E+07 MTTR : 8.333667E+00MUT : 4.169167E+07 MDT : 5.000333E+00MTBF : 4.169167E+07


6. conclusion

The dependability is a concept becomingever more critical for comfort, efficiencyand safety. It can be controlled andcalculated. It can be designed in, be it fordevices, architectures or systems.Dependability characteristics are nowfrequently included in specifications and

contracts. The existence of computationalmethods and tools allows the systematicstudy of the dependability during thedesign phase and for quality assurancepurposes.An intuitive insight, combined with exactor approximate calculations, allow the

comparison of different configurations andthus provide an evaluation of riskassociated to a better performance, i.e.performance adapted to clearly specifiedneeds.


7. references and standards

Military Handbook 217EDoD (U.S.A.)October 1986.

Recueil de données de fiabilité, CNET(Centre National d’Etudes desTélécommunications, France)1983.

IEEE Std. 493 and IEEE Std. 500(Institute of Electrical and ElectronicEngineers)1980 and 1984.

NPRD document 3Nonelectronics Parts Reliability DataReliability Analysis Center, (RADC)1985.

A. Pagès, M. Gondran:“Fiabilité des systèmes”Eyrolles, France1983.

A. Villemeur:“Sureté de fonctionnement dessystèmes industriels”Eyrolles, France 1988.

International ElectrotechnicalVocabularyVEI 191International ElectrotechnicalCommissionJune 1988.

Proceedings of the 15th InterRamconferencePortland, OregonJune 1988.

C. Marcovici, J. C. Ligeron:“Techniques de fiabilité en mécani-que”Pic, France, 1974.

EPRI document 3593Electrical Power Research InstituteHannaman, Spurgin, 1984.

NUREG document 2254US Nuclear Regulatory CommissionBell, Swain, 1983.Merlin Gerin Technical Report 117 :“Méthode de développement d’unlogiciel de sureté”A. Jourdil, R. Galera 1982.

Merlin GerinTechnical Report 134 :”Approche industrielle de la sureté defonctionnement”H. Krotoff 1985.

Merlin Gerin Technical Report 148 :“Sureté et distribution électrique”G. Gatine 1990.

IEC Standard 271List of basic terms, definitions and relatedmathematics for reliability.

IEC Standard 300Reliability and maintainability manage-ment.

IEC Standard 362Guide for the collection of reliability,availability and maintainability data fromfield performance of electronic items.

IEC Standard 409Guide for the inclusion of reliability clausesinto specifications for components (orparts) for electronic equipment.

IEC Standard 605Equipment Reliability Testing.

IEC Standard 706Guide on maintainability of equipment.

IEC Standard 812Analysis techniques for system reliability- Procedure for failure mode and effectsanalysis (FMEA).

IEC Standard 863Presentation of reliability, maintainabilityand availability predictions.

IEC Standard 1014Programmes for reliability growth.

Merlin Gerin’s dependability experts havepublished extensively in this field andhave presented papers in mostinternational reliability conferences.

Merlin Gerin is also an active participantin several national and internationalcommittees dealing with dependability: presidence of the French NationalCommittee for IEC TC 56 activities,(dependability) and expert with IECWorking Group 4, TC 56, (statisticalmethods), software dependability with theEuropean Group of EWICS- TC7:computer and critical applications, french AFCET Working Group oncomputer systems dependability, updating contributions to the FrenchCNET Electronic components reliabilityhandbook, working Group IFIP 10.4 on DependableComputing.

cahiers techniques 144 - pangonilo safety, have become a science that no designer can afford to...

Documents