introduction to reliability 13

RF RF i

Introduction to Reliability Theory and Practice

John B. Bowles

John B. Bowles

Computer Science and Engineering

University of South Carolina

Columbia, Sc 29208 USA

e-mail: [email protected]

RF RF ii

Summary & Purpose

This tutorial serves as an introduction to the main concepts and techniques in reliability and provides an

overview of the field. It tells what reliability is, describes how reliability is measured, modeled, analyzed,

and managed, and illustrates system models and design techniques aimed at improving system reliability.

The practicalities and misconceptions of some common approaches to achieving reliability, such as the use

of redundancy, are also discussed.

John Bowles is an Associate Professor in the Computer Science and Engineering Department at the

University of South Carolina where he teaches and does research in reliable system design. Previously he

was employed by NCR Corporation and Bell Laboratories where he worked on several large system design

projects. He has a BS in Engineering Science from the University of Virginia, an MS in Applied

Mathematics from the University of Michigan, and a Ph.D. in Computer Science from Rutgers University.

In 1987 he was a visiting scholar at the US Air Force Institute of Technology, Wright-Patterson AFB. Dr.

Bowles is a Senior Member of IEEE, a Member of SAE and ASQ, and an ASQC Certified Reliability

Engineer.

Contents

1. Introduction 1

1.1. What is Reliability

1.2. Questions Addressed by Reliability

1.3. Why is Reliability Important?

1.4. Scope of Reliability

2. Measures and Models of Reliability 2

2.1. Exponential Failure Distribution

2.2. Weibull Failure Distribution

3. Component-Level Reliability 5

3.1. Improving Early Life Reliability

3.2. Improving Component Reliability

3.3. Software Reliability

4. System-level reliability 10

4.1. System Structure

4.2 Effects of the Failure Distribution on System Reliability

4.3. Reliability Specification and Allocation

5. System Level Analysis Techniques 13

5.1 Failure Modes and Effects Analysis

5.2 Fault Trees

5.3. Markov Models

6. Perspectives and Future Challenges 16

RF RF 1

1. INTRODUCTION

1.1. What is Reliability?

In everyday parlance something is said to be reliable if it

works when it is needed. Thus, a car is considered reliable if

we are confident that it will start and takes us where we want

to go. Conversely, it is unreliable if it frequently will not start

or stops unexpectedly before we get to our destination.

A narrow, working definition of reliability is that

―reliability is the probability that a device will operate

successfully for a specified period of time, under specified

conditions when used in the manner and for the purpose for

which it was intended”. This definition allows us to define

the characteristics of reliability in quantitative and measurable

terms. It has several important words. First, reliability is a

probability. This recognizes that we cannot say absolutely

that something will work. Rather, we specify a probability

that something will work. If we have many similar items this

is easily interpreted as the relative number of items that work.

Interpretation becomes more problematic if we have only a

single item or one that is used only once. Second, we must

specify what constitutes successful operation. For devices

that are either working or broken this is easy; for items that

degrade with use, some minimum level of performance must

be required. Third, the period of time must be specified. For

military items this is often the time needed to accomplish

some ―mission‖; for commercial items, it may be the

―warranty period‖; for other items, it may be the ―working

life‖ or some other arbitrary interval. For non-repairable

items the question of interest is ―will the item last for the

period of interest?‖; for repairable items it is ―for what part of

the interval will the item be working?‖ Finally, the item must

be used for the purpose for which it was designed; if it is used

otherwise it can hardly be expected to operate successfully.

A broad definition of reliability, which defines the field, is

that ―Reliability is the science of predicting, analyzing,

preventing, and mitigating failures over time‖ [1]. This

defines reliability as an area of expertise with well-defined

sub-disciplines all of which are associated with failures in

some way. As a science reliability is based on well-defined

principals and poses testable hypotheses whose veracity can

then be determined. It also associates reliability with many

other fields of endeavor. Prediction implies the use of

models—primarily mathematical—to foretell when or how

failures will occur. Analysis seeks to explain not only why a

failure occurred but also to quantify ―how many‖ and ―how

often‖ failures occur. Thus, reliability is closely associated

with the physical sciences of physics, chemistry, mechanics,

and electronics; statistics and data collection; and since

people are a part of most systems, human factors and even

psychiatry as well. Prevention tries to stop a failure from

occurring or stop a minor failure from escalating into a major

catastrophe; thus it associates reliability with the design arts.

Finally, mitigation seeks to find ways to offset the effects of a

failure when one occurs.

1.2. Questions Addressed by Reliability

Reliability seeks to answer many questions about an item,

all of which ultimately affect how the product is designed.

The traditional and fundamental question is ―How long will

the item last?‖. This question is important because equipment

that lasts a long time is usually preferred over equipment that

only lasts a short while. For example, most people would

prefer a television set that lasts 10 years to one that lasts only

2 years. But the answers to many other questions depend on,

or influence, the answer to this question:

What is the system availability? The longer a system

lasts between failures and the faster its repair time, the

higher its availability will be.

How can failures be prevented? Items can be made to

last longer by eliminating potential failures. Failures can

be prevented by making changes in the design of a

device, changes in the materials of which it is made,

changes in the way the product is maintained, and

changes in the way the product is used. The more

failures that can be prevented, the longer the item will

last.

What is the life-cycle cost of a piece of equipment? Life-

cycle costs include the initial cost of a piece of

equipment as well as the costs of repairing it when it

fails. They also include the costs of keeping an inventory

of spare parts, transporting them to where they are

needed, the lost opportunity costs when the item is out of

service, and the cost of disposing of the equipment at the

end of its life. Minimizing these costs determines an

optimum lifetime for which an item should be designed

to last and reliability is an integral part of the calculation.

What are the biggest risks? Risks are most often

associated with what happens when something fails.

Thus, the biggest risks are those which have the worst

consequences when a failure occurs and which are most

likely to occur.

1.3. Why is Reliability Important?

Reliability has a significant impact on almost every aspect

of ownership of a product:

Cost of ownership. Reliability affects both the initial cost

of an item and the cost of repairing the item. Sometimes

changes to increase reliability increase the initial cost or

the maintenance costs of equipment. The use of more

expensive materials such as gold and using soldered

rather than socketed electronic components are examples

of this. But surprisingly often there is a synergistic effect

and the same changes that produce greater reliability also

result in lower costs. The progression of electronic

components from vacuum tubes to discrete transistors to

integrated circuits to very large integrated circuits is an

example of this synergy.

Ability to meet service objectives. An item that has failed

is not serving the purpose for which it was purchased.

Thus, the item must be working if service objectives that

depend on the item are to be met. The higher the

reliability of a piece of equipment, the fewer failures it

will have; fewer failures mean that fewer resources will

have to be diverted to respond to the failures; hence,

more resources can be focused on the primary task.

Customer satisfaction. Customers purchase things

because they expect them to be used. If an item fails, it

RF RF 2

cannot be used for the purpose intended by the person

who bought it. This leads to dissatisfaction with the

product. If it happens too often the customer may

abandon not only that product but also others from the

same manufacturer. Thus higher reliability can bring

about greater customer satisfaction with a product and

lower reliability brings about customer dissatisfaction.

Ability to market products or services. Since greater

reliability has both a cost benefit and brings about greater

customer satisfaction, it can be an important marketing

tool. Greater reliability enabled Japanese automobile

manufacturers to take a larger market share from their

American counterparts in the 1980s.

Safety. Reliability is closely associated with some

aspects of safety. When an item fails, it is no longer

doing the function it was intended to do; that can have

very serious consequences if the item was performing a

safety critical function. For example, if the motor in an

airplane or an automobile fails, the vehicle will

immediately lose power and speed. How the item fails

can also be important, for example certain failure modes

might cause an item to explode or to damage other

components in the system.

1.4. Scope of Reliability

Reliability affects every device from the smallest

transistor on an integrated circuit to communication networks

that encompass the entire world. Since designers want the

things they build to ―work‖ they would like for the things

they design to be reliable. They would like for a single

failure of a single device to not prevent an entire large system

from working. Thus, reliability affects the choice of materials

used in a device, the design of the operational units in a

device, the way individual devices are built up into larger

pieces of equipment, and the way equipment is combined into

systems.

Reliability is also closely related to, and associated with,

several other terms with which it is sometimes confused.

Quality is most often used with regard to manufacturing

processes. It refers to how well the manufactured item

conforms to its specifications. It has been said that

reliability must be designed into a product. Poor quality,

meaning that the product does not conform to

specifications, can lead to poor reliability; but quality, no

matter how good, cannot make up for a lack of reliability

in the product design.

Availability, like reliability, is a probability; it is the

probability that a device is operational at some instant in

time. Availability is often interpreted as the fraction of

some time period that an item is operational. Availability

is sometimes used as a measure of reliability for

repairable devices.

Maintainability is also a probability. It is the probability

that a device that is not working can be restored to

working condition in a specified amount of time.

Reliability and Maintainability determine availability.

2. MEASURES AND MODELS OF RELIABILITY

The fundamental unit of measure in reliability is the time

that an item is operational until it fails. Since this can never

be known with certainty in advance it is expressed as a

probability, usually designated by its distribution function as

F(t). F(t) tells the proportion of items, out of those that were

initially working at time 0 that have failed by time t. If NF(t)

is the number of items that have failed by time t, and N is the

initial number of items, then a reasonable estimate of F(t),

denoted )(ˆ tF , is

NtNtF F /)()(ˆ .

Similarly, the reliability, designated by R(t), is the

proportion of items that were initially working at time 0 that

are still working at time t.

R(t) = 1 – F(t).

If NW(t) is the number of items that are still working at time t,

(NF(t) + NW(t) = N) then R(t) can be estimated by:

NtNtR W /)()(ˆ .

The failure density function, defined as the derivative of

the failure distribution, is often more useful than the failure

distribution for purposes of mathematical calculations:

f(t) = dF(t)/dt.

Conversely,

t

0 )()( dxxftF .

Since F(t) is a probability and F(t) 1 as t , the area

under a curve of the density function must be 1. Thus, f(t)t

defines a rectangle of height f(t) and width t, which is

approximately the fraction of the area under the f(t) curve

between t and t+t. As t 0 this approximation becomes

exact and f(t)dt is the probability that a device fails ―at‖ time t

(more precisely, it is the probability that the device fails

between time t and t+dt).

The hazard function, designated as h(t), is often called the

instantaneous failure rate for a device. It is defined as

h(t) = f(t)/R(t).

h(t)t is the probability that a device fails in the next

increment of time, t, given that it was working at time t. In

general:

))(exp()(

0 t

dxxhtR .

The system availability, A(t), is the probability that the

system is working at time t. For non-repairable systems, A(t)

= R(t).

Since the distribution function is usually too complex to

be easily understood or compared, several ―figures of merit‖

are often used in assessing reliability. These are:

Mean Time To Failure (MTTF)

Mean Time Between Failures (MTBF)

Mean Time To First Failure (MTFF)

Mean Time To Repair (MTTR) (for repairable

systems)

Average Availability (A)

RF RF 3

Reliability at time ti: R(ti)

Availability at time ti: A(ti)

The failure distribution function, F(t), describes how items

in a population fail over time. It is usually represented as a

―standard‖ mathematical function and for any given set of

failure data, statistical analyses are done to determine the

function parameters that best describe the data. The

exponential and Weibull functions (discussed here) are

among the most commonly used distribution functions but

many others are used for particular types of components.

2.1. Exponential Failure Distribution Function

The exponential failure distribution (also known as the

constant failure rate distribution) is the simplest failure

distribution function. It has only a single parameter , which

is called the failure rate. is constant with respect to time but

it may be a function of other factors such as the type of

device, its load, its complexity, its manufactured quality, how

familiar the system designers are with using it, the operating

environment, the temperature, etc. Letting t be the time

period of interest, the exponential distribution is described by:

Distribution function: F(t) = 1 – exp( t)

Density function: f(t) = exp( t)

Reliability function: R(t) = exp( t) (1)

Hazard function: h(t) =

MTTF = 1/

The exponential distribution has many advantages for

reliability analysis:

1) The assumption of a constant failure rate leads to highly

tractable and easily constructed system models. For non-

redundant systems the system failure rate is simply the

sum of the individual component failure rates,

n

i

isys

1

For non-repairable systems, with n level redundancy, the

system MTTF can be calculated via simple formulas such

as:

n

i

sysi

MTTF1

11

.

2) Markov models are often the only feasible way of

modeling the states (both operational and failed) and the

transitions between states of a complex system. Such

models implicitly assume constant transition rates

between states, although in some cases non-constant rates

can be modeled by combinations of states.

3) Some databases store item failure rates making them

readily available and the corresponding item failure

probabilities are easily constructed for any time period by

applying the appropriate time interval to eqn. (1). The

most widely used reliability prediction procedures,

including the Bellcore Reliability Prediction Procedure**

**

Bellcore’s name has been changed to Telcordia

Technologies.

and Mil-Std-217*, generally assume a failure rate that is

constant in time but a function of other properties of the

component being modeled and the environment in which

it is placed [2].

4) Textbooks on reliability often use the exponential failure

distribution to give easily solved models that illustrate

the concepts being explained. Thus, if they use the

exponential distribution, novice practitioners have readily

available models to follow; even experienced

practitioners often assume a constant failure rate in their

analyses, knowing that it does not apply, but assuming

(or hoping) that the results will be ―robust‖ or ―close

enough‖.

5) Failures are relatively rare events in the life of a system.

Thus, collecting failure data is a time consuming and

expensive task and often only small sample sizes are

available. Furthermore, since recording the time of

failure (if it is known) is usually of secondary importance

when a system fails, failure data is often incomplete and

frequently uncertain as to the exact time of failure, the

cause, or even what item failed. Under these

circumstances the statistical methods available for

computing failure distributions often do not support

finding more than a one-parameter distribution.

6) From a theoretical perspective Drenick’s theorem states

that under suitable conditions the reliability of any

system that is maintained by replacing failed components

approaches the exponential as a limiting distribution [22].

This is often given as ―theoretical justification‖ for

assuming an exponential failure distribution; however,

simulation studies have shown that it takes many

generations for a system to reach equilibrium, so that as a

practical matter, most systems do not last long enough to

reach this steady state [3, p. 88]. Bowles and Dobbins

have also shown that in certain cases the failures in a

repairable, redundant system can be closely

approximated by an exponential distribution [4].

On the other hand, some well-known characteristics of the

exponential distribution make little sense from a reliability

point of view:

1) It implies that failures are due to chance events,

uniformly distributed over any interval of time. Thus, it

ignores the existence of many well-known failure

mechanisms whose effects are cumulative. For example,

many components experience failures from mechanisms

such as: wear (e.g., bearings and other mechanical

components), fatigue (e.g., structural components),

chemical reactions (e.g., corrosion), burning (e.g.,

filaments in electronic tubes), and submicroscopic effects

(e.g., ion migration in integrated circuits).

2) The assumption that the failure rate does not change over

time removes age from the reliability model, thus the

probability that an item fails is the same regardless of

* Mil-Hdbk-217 was last updated as revision F in January

1990 and is no longer supported by the US Department of

Defense. However, we will use some of the device models

here for illustrative purposes.

RF RF 4

whether the item is brand new or has been operating for

many years. This, so called, ―memorylessness‖

characteristic of the exponential distribution means that

the model cannot capture degradation over time such as

occurs in batteries and a used system is ―as good as new‖

from a reliability point of view.

In summary, the exponential model captures the important

concepts in the definition of reliability, is easy to understand,

and leads highly tractable system models. Unfortunately, it is

also easily misapplied and can provide very misleading

results. We will return to this point in section 4.1. The

characteristics of the exponential distribution and its

suitability for modeling device failures in reliability work

have been a source of endless debate within the reliability

community.

2.3. Weibull Failure Distribution Function

The Weibull distribution function is a two-parameter

distribution as shown in Figure 1. is often called the

location parameter and is the shape parameter. Since it is

parameterized, the Weibull is able to capture the failure

characteristics of many different types of systems. It is

described by:

Distribution function: F(t) = 1 – exp( t)

Density function: f(t) = t

exp( t)

Reliability function: R(t) = exp( t)

Hazard function: h(t) = t

MTTF = 1/

(1+1/

0 250 500 750 1000 1250 1500 1750 2000 2250 2500

5 104

0.001

0.002

0.002

w2( )t

w5( )t

wp( )t

e1( )t

t

Weibull: = 2

Weibull: = 5

Weibull: = 1

(Exponential)

f(t)

time (t)

Weibull: = 0.6

Figure 1. Weibull failure density function for several values

of . MTTF = 1000 for all four density functions.

Figure 1 shows the density function for the Weibull

distribution for several values of . Observe that as

increases the density function becomes more ―bell shaped‖

and clustered about its mean. A Weibull distribution with =

3.5 approximates the normal distribution.

The Weibull hazard function, h(t), is illustrated in Figure

2 for different values of . Observe that it is decreasing for

< 1—as items age, they are less likely to fail. This matches

our intuition for many items such a software and metals that

―work harden‖; when such an item is first put into service, we

expect it to fail, but after it has been operating a while, our

confidence that it will not fail in the next increment of time

increases. For = 1 the Weibull distribution reduces to the

exponential and the failure rate is constant. For > 1, the

Weibull distribution has an increasing failure rate—as items

age, they are more likely to fail. This represents the

reliability characteristics of such items as mechanical devices

that wear out as they age.

h(t)

(Failures/hr)

Time (t) (hours)

> 1

= 1

< 1

Figure 2. Weibull hazard function.

The ready availability of software to calculate the

parameters of the Weibull distribution that best match a set of

failure data has lead to a subdiscipline of reliability known as

―Weibull analysis‖ [23]. A careful Weibull analysis can give

many hints as to the causes of failure for a device. For

example, if failures are occurring during the decreasing

failure period (i.e., < 1), they are most likely due to

components that were either defective or inadequately

burned-in; if they occur during the wearout period (i.e., >

1), they are most likely due to items wearing out; if the data

can be partitioned into sets having different failure

characteristics the components may have come from different

manufacturing lots or from different manufacturers; etc.

The reliability characteristics of many systems follow

what is often described as a ―bathtub shaped curve‖ like that,

shown in Figure 3. This is a composite of decreasing,

constant, and increasing failure rates at different times.

h(t)

(Failures/hr)

Time (t)

(hours)

Infant

Mortality

Steady State

Wearout

Figure 3. ―Bathtub‖ curve for the reliability of most systems.

When the device is first put into service (i.e., the ―infant

mortality‖ period) the failure rate decreases with time as

defective components fail and are removed from the

population. During the ―steady state‖ period system failures

occur due to accidents and random events rather than inherent

system weaknesses and the failure rate is approximately

constant. During the ―wearout‖ period the failure rate

increases as accumulated wear and stresses lead to a loss of

resiliency.

It is important to realize that the bathtub shaped curve,

and in fact, any failure distribution, describes a population of

devices rather than a single item. For example, if we consider

the failure distribution for a population of automobile tires,

during the burn-in period tires with defects such as being out-

RF RF 5

of-round, with bubbles in the rubber, or which have poor

bonding between the cords and the rubber will fail quickly;

once these (defective) tires are removed from the population,

those that remain fail largely due to accidental road hazards or

punctures; as time goes by the tires accumulate wear, the

rubber gets thinner, and the flexing during use slowly

weakens the bonds with the cords leading to more failures.

The ―bathtub shaped‖ curve for human life (Figure 4),

consisting of a decreasing ―infant mortality‖ period, a

constant ―adult life‖ period, and an increasing ―old age‖

period, also provides some perspective to the model. During

adult life the MTTF is approximately 800 years. This does

not imply that people live to be 800 years old. It means that

the death rate is 0.00125 deaths/yr. or about 5 deaths/yr. in a

population of 4000 adults. The same interpretation should be

kept in mind when considering electronic components whose

MTTFs are typically several million hours.

h(t)

(Deaths/yr)

Age (t) (years)

Infant

Mortality

MTTF = 800 yr

Adult Life

Old Age

10 15 60 70

0.00125

Figure 4. Bathtub shaped curve of human reliability.

3. COMPONENT-LEVEL RELIABILITY

Traditionally, reliability engineering has focused on the

failure characteristics of the individual components in a

system. The main idea is that the overall device reliability is

determined by the reliability of its individual components is

improved. Efforts to improve the reliability of individual

components have concentrated on reducing the initial failure

rate of a device and shortening the infant-mortality or burn-in

period, and reducing the failure rate during the constant

failure period and extending the constant failure period as

long as possible.

3.1. Improving Early-Life Reliability

The objective during the early life of a product is to make

the initial failure rate as small as possible and to shorten the

burn-in period as much as possible. Techniques for doing this

focus on insuring that high quality parts are used, that the

parts are properly burned-in, and that component tolerances

are handled properly.

3.1.1. Part Quality

Some simple calculations help to show that a high level of

part quality is required. Consider a system with 1000 parts

and suppose the part defect rate is 0.005. Then the

probability that a system will contain all good parts is

(0.995)1000

= 0.0066; the probability that the system will need

to be repaired to fix a defective component is 0.9934.

Furthermore, the expected number of defective components

per system is 1000×0.005 = 5 and hence, on average, 5

repairs will need to be made due to defective components.

Incoming part inspections are one way to reduce the

number of defective parts received. If we receive a shipment

of 1000 parts with a defect rate of 5/1000 parts we would

expect to have 5 bad components in the shipment. Suppose

we test 100 components, a 10% sample, from this shipment.

Then the probability that we would have 0 bad parts in the

sample is found from the hypergeometric distribution:

59.0

1001000

100995

05

Clearly, we must cut down on the number of defective

parts to reduce the probability that the product will be

defective, but sampling incoming parts to determine

acceptance or rejection of a shipment is not an effective way

of doing this. For even moderate defect rates the probability

of detecting a bad part in a small sample is not high and

increasing the sample size increases the cost of testing

components, especially if the tests are destructive.

Furthermore, as the quality of incoming parts improves,

sampling becomes increasingly less effective as a way of

finding unacceptable shipments.

Measuring the variance in a shipment of incoming parts is

a more effective way than sampling of determining how many

bad parts are likely to be in the shipment. The parts produced

in any manufacturing operation will vary slightly due to many

factors—for example, wear on the machinery used in the

manufacturing operation, variations in the materials used,

differences in the machine operators, to name just a few. The

combined result of all these variations is that the part

parameter can be closely modeled as having a normal

(Gaussian) distribution. This distribution is characterized by

its mean and standard deviation () as shown in Figure 5.

68.3% of parts will lie within ±1 of the mean, 95.4% will lie

within ±2 of the mean, and 99.7% will lie within ±3 of the

mean. 3 is usually taken as a measure of the variation in the

manufacturing process. By measuring the variance of the

incoming parts and knowing the allowable tolerance limits

one can estimate the number of bad (out of tolerance) parts in

a shipment. This procedure has been formalized in the use of

the process capability index.

norm( ),,x m s

x

Part tolerance limits

3 limits

Standard Deviation ()

Figure 5. Normal distribution.

RF RF 6

The product process capability indices, Cp and Cpk,

compare the part tolerance specification to the 3

manufacturing variation [5]. The Cpk index is defined as the

ratio of the difference between the process mean parameter

value (x ) and the nearest tolerance limit to the 3 process

variation:

3

|)||,min(| xTxTC LU

pk

where TU is the upper tolerance limit and TL is the lower

tolerance limit. Ideally the process mean parameter value will

equal the nominal part specification but often this is not the

case. A Cpk of 1 implies that the specification tolerance limits

are 3 of the production process (Assuming the mean and

nominal parameter values are equal and the specification

limits are symmetrical.).

Table 1 gives the probability that the production process

produces an item that is out of specification for various values

of Cpk. Since Cpk considers the nearest tolerance limit to the

mean, the Probability Out-of-Tolerance in Table 1 is a worst-

case estimate.

Table 1. Cpk index probability values

Cpk

Tolerance

Limit

Probability Out

of Tolerance

Defect Rate*

0.33 1 0.31731 317,300 PPM

0.67 2 0.0455026 45,502 PPM

1.00 3 0.00270 0 2,700 PPM

1.33 4 0.00006 33 63.5 PPM

1.67 5 0.00000 0573 0.573 PPM

2.00 6 1.973 10-9 1.97 PPB

* PPM = Parts Per Million; PPB = Parts Per Billion

Usually x is not known until after the production process

is set up and some units have been produced. Thus, in the

design process the Cp index, defined as:

36

T-TTC LU

p

may provide an estimate of the out-of-tolerance probability.

The CP and CPk indices are related by the factor K,

C K Cpk p ( )1

where,

2/)( LU

n

TT

TxK

and Tn is the nominal parameter value [5].

A Cpk (or Cp) of 1.33 or higher is generally considered

―good‖, 1.0 to 1.32 ―marginal‖, and 1.0 or less ―bad‖.* The

capability index is often thought of as a measure of

component quality rather than reliability. However, it is

* As process technology improves, these definitions of

"good", "marginal" and "bad" will likely change and higher

Cp or Cpk indices will be expected.

noteworthy that the design group and the manufacturing

group must work together to simultaneously maximize the

allowable tolerances in the design and to minimize the

variation in the manufacturing processes.

3.1.2. Taguchi Loss Function Tolerancing

The Taguchi loss function provides another way of

assessing the effect of variability in a design and, more

importantly from a managerial perspective, it relates the

variability to cost. Cost is usually far more important to

management than an abstract quality such as reliability. The

Taguchi loss function is defined as:

L(y) = k(y-y0)2 (2)

where y is the characteristic of interest (e.g., a dimension,

performance measure, lifetime, etc.); y0 is the target value of

the parameter; and k is a constant that depends on the

organization cost structure [6].

L(y) associates a cost with the amount the parameter is off

target. Unlike other measures of variability, which treat all

items within specifications as equally good, the loss function

imposes a cost whenever the parameter differs from its target

value, even if it is still within specifications. Furthermore, the

cost increases rapidly as the parameter moves away from its

target value. This is illustrated in Figure 6 along with an off-

target density function for the parameter.

L( )y

dnorm( ),,y 1.4 .25

y

L(y) = k(y- y0)2

f(y)

y0

L(y) f(y)

Figure 6. Taguchi Loss function

The Taguchi loss function assumes that any parameter that

is off its target value imposes a cost on society. For example,

a pair of shoes that are ½ size too small might be purchased

but worn only once if they cause blisters, thus wasting their

initial cost. A pair that is ¼ size too small might be worn

only occasionally (they are ―uncomfortable‖), thus the buyer

does not realize their full value.

Using the cost function, an expected loss can be

developed for a manufacturing process that depends on both

the process variability and how much (if any) it has shifted

from its nominal value (y0):

22

0)( ykLE (3)

In eqn. (3) is the process mean and 2 is its variance.

The loss function provides a tool for evaluating the effect

of variation on possible design changes. For example,

consider the transistor circuit in Figure 7a with the gain

characteristic in Figure 7b. The loss function is calibrated by

RF RF 7

considering the loss at an extreme point. Suppose that the

nominal line voltage is 115V and the circuit, valued at $100,

is destroyed if the voltage rises to 140 V. Then eqn. (2)

reduces to $100 = k(140 – 115)2 which yields k = $0.16/V

2

If the transistor has a nominal gain of 23 V and standard

deviation of 2.5 V, we find an expected loss of $1.00 that

must be included in the warranty cost. If a transistor with less

variation would be too expensive we can try moving the

nominal operating point to the flat part of the gain curve by

using a transistor with a nominal gain of 35 and a standard

deviation of 0.66 V. This produces an off-target output

voltage of 122 V. From eqn. (3) the expected loss is $7.91 of

which $7.84 is due to being off-target and $0.07 is due to

variation. Replacing the load resistor with one of 60 k

brings the design back on-target by reducing the gain and

gives an expected loss of $0.07.

150 V

R

Vout

Vin

Vout

120

115

110

105

R=40 k

transistor gain

10 20 30 40 50

(a) (b)

Figure 7. Transistor circuit (a) and gain (b).

3.2. Improving Component reliability

The objective with ―fault intolerant‖ (also called ―fault

avoidance‖) techniques is to reduce the overall failure rate

over the life of the system, essentially lowering the bathtub

curve in Figure 3. In terms of the exponential reliability

model, eqn. (1) (which is applicable during the constant

failure period) they seek to reduce the failure rate . A

common approach in many reliability models (e.g., Mil-

Hdbk-217 [7], Bellcore Reliability Prediction Procedure [8])

is to express the failure rate as the product of a base failure

rate b and several ―pie factors‖ determined by the part

quality (Q), the environment (E), and other factors. Thus

p = bQE…

Reducing the overall failure rate involves:

improving component quality

derating components

improving the environment

reducing complexity

miscellaneous techniques

3.2.1. Improving Component Quality

Most manufacturers produce several quality grades of

parts. These grades are determined partly by the number and

types of monitors, controls, and inspections put on the

manufacturing process. Mil-Hdbk-217 provides quality

rating scales based on the Joint Army Navy (JAN)

specification used by the military.

Part quality can also be improved by doing burn-in tests

and acceptance testing on incoming parts (but see the

discussion on part quality in Section 3.1.1). Working with

suppliers to reduce variability (discussed earlier) and

performing quality audits of supplier manufacturing are also

effective in improving component quality. Although it is a

―management‖ technique rather than a ―design‖ technique,

the practice of qualifying suppliers who deliver high quality

parts and giving them a ―preferred status‖ is an effective way

of improving component quality.

3.2.2. Derating Components

Components are derated by using them at less than their

rated level of stress. Usually this is done by replacing a

component with one of greater strength. Different

components are derated in different ways as shown in

Table 2.

Table 2. Component derating methods

Type Component Load parameter

Resistor Operating power

Capacitor Applied voltage

Diode Applied current

Transistor Power dissipated

Semiconductor Power dissipated

Some rough but useful guidelines for derating provide that

it should not be conservative to the point of increasing costs

excessively. (Usually higher rated parts are more expensive

than lower rated parts.) Nor should it be done to the point

where the device complexity must be increased to gain back

the performance lost by derating, thus off setting the benefits

of derating. Typical derating factors range from 0.5 to 0.8.

Light bulbs provide an excellent example of the

effectiveness of derating at increasing component lifetimes

and the tradeoffs that often must be made with performance.

Standard 60 W light bulbs are rated by the manufacturer in

terms of their life expectancy and the amount of light

produced. Table 3 shows this relationship for several

manufacturer’s light bulbs.

Table 3. Light output (Lumens) and Life

(Hours) of 60 W light bulbs.

Lumens Life (Hours)

870 1000

680 1500

620 4000

563 5400

Over-rating is the opposite of derating. This is sometimes

done to gain performance, but almost always, at the expense

of reliability. This effect was clearly illustrated in a senior

design project where students found that they could get better

performance from a small 5 V motor by running it at 10 to 15

V. Although the motor appeared to work fine at first, the

class was plagued by a rash of motor failures.

3.2.3. Reducing Complexity

One measure of complexity is the number of separate

RF RF 8

devices needed to implement a given function or piece of

equipment. By this measure a circuit board that utilizes ten

integrated circuits to implement some function is more

complex than one that uses two. Similarly, a bumper for a car

that is built up from several separate pieces is more complex

than one that is a single unit. Complexity is reduced by

achieving a higher level of component integration and

reducing the number of components needed to implement a

function. The failure rate for non-redundant systems is the

sum of the individual component failure rates. Thus, if there

are fewer components, all else being the same, there will

generally be a lower failure rate. But all else is usually not

the same. One Very Large Scale Integrated (VLSI) circuit

chip might replace several Small Scale Integrated (SSI)

circuit chips on a board, but its failure rate will likely be

higher than that of an SSI chip. Similarly, the single unit

bumper might be more difficult to manufacture within

tolerances than each piece of the multi-piece bumper.

However, in many cases, device reliability and the level of

integration have a power law relationship so that the

reduction in the number of components more than

compensates for any accompanying increase in the

component failure rate caused by increasing its level of

integration. Consider, for example a 1 M byte DRAM. The

device model from Mil-Hdbk-217 is

= [C1T + C2E + CYC]QL.

The important factors are C1, the circuit complexity

failure rate, and C2, the package complexity failure rate. The

C1 complexity factor of a DRAM is measured by the number

of ―bits‖:

BC4

00125.01

where B is the number of bits. The C2 complexity factor

depends on the type of package; for a DIP C2 = 9.010-5

N1.51

where N is the number of pins in the package. These values

result in the C1 and C2 factors shown in Table 4 for the

DRAMS

Table 4. DRAM Circuit (C1) and packaging (C2)

complexity factors

Device C1 C2

64 K DRAM, 18 pins 0.0025 0.0071

256 K DRAM, 20 pins 0.0050 0.0083

1 M DRAM, 22 pins 0.010 0.0096

A 1 M bit memory built from these chips would result in

the values of C1 and C2 in Table 5, which shows the

advantage of increasing the level of component integration.

Table 5. Circuit (C1) and packaging (C2) complexity

factors for 1 M bit memory

Device

Number

of chips

Total C1

Total C2

64 K DRAM, 18 pins 16 0.04 0.114

256 K DRAM, 20 pins 4 0.02 0.033

1 M DRAM, 22 pins 1 0.01 0.01

The steady decrease of the device size in integrated

circuits has led to an increasing level of component

integration in electronic devices. The result is: fewer

components, fewer solder joints, less required board space

allowing smaller boards, generally less required power

resulting in cooler operating temperatures, better

performance, and improved reliability. The synergy produced

by these advantages has driven electronics toward ever higher

levels of component integration.

Another goal of reducing complexity is to simplify device

assembly. To do this the designer should: Use fewer parts;

use fewer distinct parts or part numbers; key parts so that they

can be assembled only one way; and use fewer vendors.

Pugh’s complexity factor provides a useful measure of

complexity [9]:

F

ITPC

3/1

where:

P = number of Parts

T = number of types of parts

I = number of interfaces

F = number of functions

3.2.4 Improving the Environment

Environmental effects on the components in a system can

be reduced by isolating the components to protect them from

the environment or by choosing components that are tolerant

of the environment. For example, germanium transistors are

known to be more tolerant of radiation than silicon transistors

and should be used if a circuit must operate in that type of

environment. Rubber supports are often used to shield parts

from shocks and vibration in the environment.

In many cases environmental effects combine to either

intensify or weaken their effects. For example, the

mechanical deterioration due to sand and dust in the

environment is weakened by high temperatures but

accelerated when combined with vibration. Mil-Hdbk-338

describes many environmental effects. [9]

One particularly important environmental effect is heat.

Several models are used to show the affect of heat on

component reliability. The Arrhenius model is one of the best

known of these.

Arrhenius Model. In the late 1800s Svante Arrhenius, a

Swedish chemist found that a plot of the log of the reaction

rate versus the inverse of the temperature was a straight line

for the inversion of sucrose. In modern terminology this is

expressed as,

TK

Ekk

b

aexp0

where,

T is the absolute temperature (degrees Kelvin)

Ea is the activation energy (eV)

KB is Boltzmann Constant (8.61710-5 eV/K)

k0 is a constant.

The Arrhenius equation is often used to describe the effect

of steady state temperature on many of the physical and

RF RF 9

chemical processes such as ion drift and impurity diffusion

that lead to component failures. Assuming that failures occur

when the concentration of the reactant corresponding to a

particular failure mechanism reaches some critical value, the

change in time to failure from t1 to t2 due to a change in

temperature from T1 to T2 is expressed as,

212

1 11exp

TTK

E

t

t

b

a

We observe that if T2 > T1 then t1 > t2 implying that the

device fails sooner at the higher temperature. This

relationship between temperature and reliability is captured in

many reliability prediction procedures. For example, Mil-

Hdbk-217 provides the following temperature factor for

composition resistors:

298

11

10617.8exp

13 T

EaT

Here, the activation energy for composition resistors is

determined experimentally to be 0.2 eV and 298 K (25 C) is

the base temperature for measurement. Figure 8 shows the

effect of temperature on T.

0 50 100 150

5

10

Pi ( )t

tTemperature (C)

T

Figure 8. Temperature acceleration factor.

Accelerated Stress Testing. In accelerated stress testing

components are subjected to a much harsher environment

than their normal operating environment in order to induce

the equipment to fail. Then by examining the types of

failures that occur the design can be changed to make the

equipment more robust. The stressed environment usually

involves operating the equipment at

higher and lower temperatures,

with high levels of vibration and sometimes shock

at higher and lower voltages

cycling between high and low temperatures

To the extent that the stressed environment does not induce

new failure mechanisms this type of testing can reveal

weaknesses in the design, which can then be improved upon

[12].

3.2.5. Miscellaneous Techniques

The designer must be aware of the factors that affect the

failure rate of his or her particular design. For example, NPN

transistors have a lower inherent failure rate than PNP

transistors due to their having a lower junction temperature;

so the logic in digital circuits should be designed to use NPN

transistors.

General Electric found that the expected life of an

incandescent lamp is reduced by 25% if a single spot on the

filament is just 1% less than specifications and that the

mandrel on which the filament is coiled must be accurate to

0.0001 in. or the life of the lamp is reduced by 20%.

Knowing this, manufacturers set very tight tolerance limits

and manufacturing controls to have as little variation in the

filament diameter as possible.

3.3. Software Reliability

Software is a major component of most systems with any

degree of complexity and its importance is increasing —

particularly with respect to embedded control systems.

Software cannot be ignored in any assessment of reliability.

Software differs from hardware in several respects:

It does not wear out or degrade with use or over time.

It can fail only when it is being used. Hence

reliability must be measured with respect to execution

time.

Errors are ―logical‖ rather than because some

physical parameter has changed.

It is generally not tested to the same degree as

hardware.

It can be much more complicated than anything done

in hardware.

Many models have been developed for assessing software

reliability. One relatively simple model, which we will

discuss, is the Basic Execution Time Model [13]. This model

assumes that a program contains a finite number of faults and

that the failure rate decreases as faults are found and fixed.

Thus, the software failure rate as a function of the faults

found () is:

() = 0(1 /V0) (4)

where:

0 is the initial failure rate

is the number of faults found

V0 is the initial number of faults

This is shown in Figure 9. Eqn. (4) leads to a software

reliability model in terms of the program execution time ,

0

00 exp)(

V

The model can be used to determine how much more

execution time will be required for testing and how many

more faults must be found and fixed to achieve a given level

Faults in code ()V0

()

() = 0(1/V0)

Figure 9. Basic Execution time software reliability model

RF RF 10

Table 6. Software defect ratio per 1000 lines of code

Development Process

Total errors

in

development

Faults

remaining

at delivery

Traditional Development

Bottom-up design

Unstructured code

Fault removal through

testing

50 60

15 18

Modern Practice

Top-down design

Structured code

Design/Code reviews &

inspections

Incremental releases

20 40

2 4

Advanced Software

Engineering Practices

Verification Practice

Reliability measurement

& tracking

0 20

0 1

of reliability. By considering the limiting resources available

— programmers to fix faults when they are found, test

personnel, machines to use for running tests, etc. — the

execution time estimates can be mapped into an elapsed time

model.

The initial failure rate can be found using rules of thumb

for the development methodology. Table 6 shows some

rough guidelines. Alternatively, it can be determined by

measuring the number of failure during the first few hours of

testing.

4 SYSTEM-LEVEL RELIABILITY

4.1. System Structure

Most devices are composed of more than one component.

In such cases, the reliability of the device depends upon the

reliability of the underlying components and how they are

configured to form the higher-level system. From a reliability

point of view we are mainly concerned with how failures of

the system’s components affect the operation of the system.

This effect is captured in a ―reliability logic diagram‖ which

shows the reliability structure of the system.

Series Systems. A ―series system‖ is one in which the device

fails if any of its components fail. It is represented by a

―series‖ diagram like that in Figure 10. Light switches

connected in series are a good analogy of a series system. If a

component is working, the corresponding switch is closed; if

the component has failed the corresponding switch is open.

For the light to be ―on‖ all of the switches must be closed—

i.e. all of the components must be working. Most simple

devices are of this type.

Since a series system fails when the first component fails,

the time to failure for the system is the minimum of all the

component times to failure. Thus, the probability distribution

of the time to failure for a series system with n components is,

Pr(TSYS t) = Pr(T1 , t AND T2 t AND ... AND Tn t)

where TSYS is the system time to failure and Ti, i = 1, ... n, are

the component times to failure. When the component states

(working, not working) are mutually independent, this

reduces to the product of the reliabilities of its constituent

parts:

RSYS = R1 R2 ... Rn

where RSYS is the system reliability and Ri is the reliability of

component i.

Redundant (Parallel) Systems. When the system has

redundant components, the redundancy can be represented by

"parallel" blocks in the diagram as illustrated in Figure 11.

Again, using the light switch analogy, light switches in

parallel are a good analogy for a redundant system—the light

is on if either switch is closed. The reliability of a system

with redundant components depends on the type of

redundancy used. In the simplest case, the system will

continue to operate as long as any one of the components

operates. In that case the system time to failure is the

maximum of the time to failure of its components, and we

have,

Pr(TSYS > t) = Pr(T1 > t OR T2 > t OR ... OR Tn > t)

The system reliability is then given by the expression:

RSYS = 1 [(1 R1)(1 R2) ... (1 Rn)]

where again, Ri is the reliability of the ith

redundant

component and independence is assumed.

Redundancy has two main uses in design:

to eliminate single points of failure.

to increase reliability or availability

Eliminating single points of failure is very important in

military systems and becoming more important in commercial

systems since such points are obvious targets for saboteurs.

Redundancy is also a requirement for high availability

systems in which repairs must be made without shutting down

the system and it is an important consideration for safety

critical systems.

Redundancy is often viewed as an ―easy way‖ to increase

system reliability. A simple calculation shows why. Suppose

each component in a redundant system configuration has a

(fairly poor) reliability of 0.9 for some time period. A two-

component system would give a reliability of 0.99; three

components would give a reliability of 0.999; four

components would give a reliability of 0.9999, and so on. It

is all very simple, very appealing, and very, very misleading.

When used to increase reliability—that is extend the

lifetime of the system with high probability—redundancy

may be particularly ineffective and it depends critically on the

component failure distribution function. We will discuss this

further in Section 4.2.

Other types of redundancy besides parallel redundancy,

that are often used include M-out-of-N:G (read ―M out of N

good‖) redundancy in which the system continues to operate

as long as M of its original N components are operational, and

standby redundancy in which a spare module can be switched

in to replace a failed module. These types of redundancy are

RF RF 11

most often used in degradable systems and in systems with

backup subsystems. One problem with all types of

redundancy is that as the level of redundancy increases, the

reliability of whatever mechanism is used to determine which

modules are good or to switch in good modules (and switch

out bad modules) can quickly dominate the overall system

reliability.

Combination Systems. Most reliability logic diagrams

consist of combinations of series and parallel components.

Figure 12 shows a reliability logic diagram for a high

availability computer system with redundant processor, disk

array controllers, and mirrored disks. Observe that in this

system each pair of redundant disk array controllers is in

series with a processor; this subsystem, in turn, is redundant,

and the entire processor subsystem is in series with the disk

array subsystem which is itself quadruple redundant. The

analysis of such systems consists of analyzing each series or

parallel subsystem iteratively until the whole system is

analyzed. This is straightforward but it can become rather

tedious.

1 2 n

Figure 10. "Series" reliability logic diagram.

1

2

n

Figure 11. "Parallel" reliability logic diagram.

Processor

ProcessorDisk Array

Controller

Disk Array

Controller

Disk Array

Controller

Disk Array

Controller

Disk Array

Disk Array

Disk Array

Disk Array

Figure 12. Reliability logic diagram for a high-availability

computer system.

Figure 13. Diodes electrically in series.

The type of reliability logic diagram used to model the

reliability of a system depends strongly on the type of failure

mode being considered. For example, a system consisting of

two diodes that are electrically in series (Figure 13), is

modeled as a series system (both must work) if the failure

mode "open" (current is blocked) is being analyzed. They are

modeled as a parallel system (either must work) if the failure

mode "short" (no rectification) is being considered. The

analysis of a system having redundant components can get

very complex when they can have different failure modes.

4.2. Effects of the Failure Distribution on System Reliability

The reliability of a system depends critically on the

failure distribution function. This is illustrated in Table 7 and

8 for redundant and series systems with 3 different

distribution functions. All three distribution functions have

the same MTTF for a single component.

Table 7 shows the effect of increasing the level of

redundancy on the system MTTF, assuming no repairs are

made following a failure. In all three cases the MTTF

increases with the level of redundancy. For the 4-unit parallel

system the exponential distribution has approximately double

the MTTF of the simplex system. For the first Weibull

distribution ( = 0.7), which has a decreasing hazard function,

the gain is even more at all levels of redundancy. On the

other hand, for the second Weibull distribution ( = 5), which

has an increasing hazard function, the increase is not nearly as

much and diminishing returns quickly set in as the

redundancy is increased.

Table 7. MTTF for redundant system (without repair)

with different failure distributions.

Parallel System

Weibull

= 0.020251

= 0.6

Exponential

= 0.001

Weibull

= 6.51016

= 5

1-unit (Simplex) 1000 1000 1000

2-unit 1685 1500 1130

3-unit 2215 1833 1192

4-unit 2652 2083 1231

The opposite result occurs for the series system as shown

in Table 8. Table 8 shows the system MTTF as more

components are added to a series system for the same 3

distribution functions. For the exponential distribution, the 4-

unit series system has an MTTF one quarter that of the

simplex system. For the first Weibull distribution ( = 0.6)

the MTTF decreases even faster; for the second Weibull

distribution ( = 5) the decrease is much less.

Table 8. MTTF for series system with different failure

distributions.

Series System

Weibull

= 0.020251

= 0.6

Exponential

= 0.001

Weibull

= 6.51016

= 5

1-unit (simplex) 1000 1000 1000

2-unit 315 500 871

3-unit 160 333 803

4-unit 99 250 758

The results in Tables 7 and 8 are easily explained with

reference to Figure 1, which shows the density functions for

the three distributions.

RF RF 12

First, comparing the exponential and the first Weibull ( =

0.6) distributions, observe that the exponential distribution

has a very long tail. 63.2% of the component population fail

before 1 MTTF, but those that survive may survive quite a

long time; 13.5% survive longer than 2 MTTF and 5% longer

than 3 MTTF. Thus for the parallel system, which works as

long as any unit is working, increasing the level of

redundancy greatly increases the likelihood that at least one

unit has not failed and the MTTF is greatly increased. This

effect is even more pronounced in the Weibull distribution in

which the failure rate decreases with time—only 27.8%

survive more than 1 MTTF, but 14.4% survive for 2 MTTF

and 8.5% survive for 3 MTTF. For the series system, the

large proportion of components in both distributions that fail

well before 1 MTTF increases the likelihood that at least one

component will fail resulting in a greatly reduced MTTF as

the number of components increases. 39.4% of components

from the exponential distribution fail before 0.5 MTTF; 57%

of components from the Weibull distribution.

Comparing the exponential and second Weibull ( = 5)

distributions we observe that the Weibull distribution is very

clustered about its MTTF; very few components fail before

1/2 MTTF or last longer than 1.5 MTTF. Thus, in the parallel

system, adding more components easily extends the lifetime

of the system for exponentially distributed components but it

is very unlikely to extend the system lifetime much beyond

1.5 MTTF for the Weibull. Similarly, in a series system one

of the exponentially distributed components is quite likely to

fail well before 1 MTTF whereas this is much less likely for

the Weibull distribution.

As already discussed, the exponential distribution is

particularly tractable and makes reliability calculations quite

easy. However, as we have seen in Tables 7 and 8 the effect

of assuming a constant failure rate when the actual failure

probability is either increasing or decreasing can lead to very

erroneous results. This is especially true for the clustered

distribution (Weibull b = 5); redundancy does not appreciably

increase the system lifetime as might be anticipated and a

series combination of components does not shorten it as much

as might be anticipated.

Modern, high quality manufacturing strives to reduce the

variance in the products produced. Thus, important product

parameters tend to be tightly clustered about their mean

values with only small variation. A consequence of this is

that products whose predominate failure mode is due to

wearout tend to have lifetime distributions that are also tightly

clustered. Examples of such components include

incandescent lamps, electronic tubes, and many mechanical

components. As an example, a study by General Electric

found that the expected life of an incandescent lamp is

reduced by 25% if a single spot on the filament is just 1% less

than specifications. Knowing this the manufacturer set very

tight tolerance limits and manufacturing controls to have as

little variation in the filament diameter as possible. The

predictable result is that the lamps’ lifetimes are tightly

clustered.

From another perspective, a manufacturer who produces

lamps with an advertised mean lifetime of 1000 hours (typical

of standard 60W light bulbs) would receive many complaints

if many of those bulbs lasted only 500 hours. A constant

failure rate predicts that 39% of such lamps would fail within

their first 500 hours of operation and 22% would fail in their

first 250 hours of operation. A manufacturer striving for a

reputation for high quality products could not tolerate this.

Other components without a predominant wearout failure

mode might be better described by a constant or decreasing

failure rate. Many electronic parts are of this type. For

products with a decreasing failure rate, allowing a significant

―burn-in‖ period before shipment can be a useful way of

improving reliability. Products that do not fail then are less

likely to fail later. For example, the reliability of disk drives

during the first few months of operation is typically less than

it is later [14].

The implications are clear. Ignoring the distribution and

assuming a constant failure rate equal to the inverse of the

system or component MTTF can give extremely misleading

results when modeling system reliability. The results of such

models are not ―conservative‖ in the sense that they under

estimate reliability, nor are they ―close approximations‖, nor

do they provide ―bounds‖ on the system reliability. They are

simply wrong! Reliability engineers must determine and

consider the actual failure distributions of the components in

their designs. Models based on generic failure rates, when in

fact the actual failure rate is either increasing or decreasing,

are of little value and might be harmful to a design.

4.3. Reliability Specification and Allocation

Reliability is a non-deterministic performance

requirement in much the same way as appearance. This

makes it difficult to specify and even more difficult to

measure whether it has been achieved. Most authorities warn

against asserting vague reliability requirements such as ―as

reliable as possible‖, or ―will have high reliability‖, or ―more

reliable than product X‖, or even ―the reliability will be

99%‖. They suggest focusing instead on describing the

environment in which the product will function, defining

failures in terms of its functions, and identifying particularly

critical failure modes that must have a very low probability of

occurrence. Even so, a quantitative reliability requirement

can be useful and sometimes gives perspective on what will

be required of the system.

As noted in Section 1.3 reliability affects such factors as

the product warranty, the cost of ownership, customer

satisfaction, and the competitive positioning of the product in

the market. In the computer industry, sophisticated customers

are beginning to demand that manufacturers provide a

reliability or availability guarantee with their products. An

economic analysis of these requirements can often be used to

quantify the required reliability for a product.

Some ―facts‖ from the medical profession help to put this

in perspective. If the medical profession had 99.9%

reliability:

Surgeons would do 26,000 bad operations per year;

Pharmacists would fill 20,000 drug prescriptions

wrong

Doctors and nurses would drop 15,000 new born

babies

This of course results from the huge number of operations,

RF RF 13

prescriptions filed, and babies born each year.

Reliability allocation is the process of distributing the

required system reliability, Rsys(t), among the various

modules making up the system. In general reliability is

allocated assuming a series system:

n

i

isys tRtR1

)()(

where Ri(t) is the ith

module reliability and n is the number of

system modules.

The simplest allocation technique is equal reliability

allocation, in which case,

n

sysi tRtR/1

)()(

With 5 modules a system reliability Rsys(t) = 0.99

generates a module reliability requirement of Ri(t) = 0.9980;

10 modules generate a module reliability requirement of

0.9990; 20 modules generate a requirement of 0.9995. Equal

reliability gives the lowest reliability requirement for all

modules, requires no knowledge of function, and usually

provides a good ballpark estimate of the module reliability

needed to achieve the system reliability. It is worthwhile to

observe how quickly the reliability requirement for each

module increases as the number of modules increases.

Another common allocation technique is complexity

weighting where complexity is measured by the number of

module components. In this technique the failure rate for

each module is set proportionally to the number of module

components: i = (ni/N) where:

i = module failure rate

sys = system failure rate

ni = estimated number of components in module

ules

inNmod

The allocated module reliability is

Nnsysii

itRttR/

)()exp()( (5)

The count of active components such as transistors,

diodes, and integrated circuits, is often adjusted by a

weighting factor (typically 1.3 to 3.0) since active

components tend to fail more often than passive components.

Another technique that is especially useful in the early

concept phase of the design, is the relative weight allocation

method. In this technique each subsystem is ranked relatively

(on a scale of 1 to 10) according to such factors as its

complexity, its degree of state-of-the-art, its working

environment, the time needed to achieve a given level of

reliability, the importance of the module and any other factor

deemed important. The factors are then summed to provide a

value ni, i = 1, … n, for each module and a reliability is

calculated as in eqn. (5). Table 9 illustrates the method.

Inexperienced designers tend to treat reliability allocation

as a ―numbers game‖. Instead, it should be used as a tool for

identifying what problems are likely to occur in the design.

This is especially true of the ―relative weight‖ technique

which can be used early in the design process as the concept

is being developed. As is evident from Table 9, the auto pilot

will be more complex, require more development time,

operate in a harsher environment, and push the technology

more (more state-of-the-art) than the communications

subsystem. Hence, it is likely to have a lower reliability and

is assigned a lower reliability requirement than the

communications subsystem. The important point in the

analysis is not the lower reliability requirement for the auto

pilot per say, but rather the factors that lead up to it and that

must be considered in its design. The analysis suggests that

adequate time must be allowed to develop the autopilot, that

new and possibly untried technology may be needed in its

construction, and that interactions between the autopilot and

the environment must be carefully assessed. Insights that

identify areas of concern are the important outcomes of the

allocation process rather than the specific numbers produced.

5. SYSTEM-LEVEL ANALYSIS TECHNIQUES

Several system analysis techniques allow the analyst to

determine how low-level failures affect the reliability of the

system. The most frequently used of these analysis

techniques are Failure Modes and Effects Analysis (FMEA),

fault tree analysis, and the use of Markov models. FMEA

examines the effect that each failure mode of an item has on

the system; fault tree analysis determines what low-level

failures can cause a given system-level failure; and Markov

models describe the changes in the state of the system

following a device failure.

5.1 Failure Modes and Effects Analysis

Failure modes and effects analysis is one of the most

effective tools available for building a reliable system. It

requires that the designer examine each item in the system,

consider all the ways that the item can fail and either 1)

Table 9. Relative weight reliability allocation for a rocket with an overall reliability of 0.99.

Major Subsystem

Com-

plexity

State

of the

Art

Time to

Develop

Environ-

ment

Module

Sum (ni)

Ratio

(ni/N)

i

Ri

Fuel 6 5 10 5 26 0.181 0.001815 0.998187

Auxiliary Power 5 4 8 5 22 0.153 0.001535 0.998466

Communications 6 1 5 2 14 0.0972 0.0009771 0.9990234

Auto Pilot 8 6 9 7 30 0.208 0.002094 0.997908

Navigation 7 6 8 6 27 0.188 0.001884 0.998117

Ecology 8 7 8 2 25 0.174 0.001745 0.998257

Total 144 0.01005 0.99

RF RF 14

accept the consequences; 2) find some way to mitigate their

effects on the system; or 3) eliminate the failure altogether. It

provides a basis for recognizing specific component failure

modes identified in component and system prototype tests,

and failure modes developed from historical "lessons learned"

in design requirements. It aids in identifying unacceptable

failure effects that prevent achieving design requirements. It

is used to assess the safety of system components, and to

identify design modifications and corrective actions needed to

mitigate the effects of a failure on the system. It is used in

planning system maintenance activities, subsystem design,

and as a framework for system failure detection and isolation.

In the military, aerospace, nuclear, and other industries where

safety issues are of prime importance, FMECA has become

an essential part of the system safety analysis [15, 16].

Table 10 shows a partial FMEA analysis of the stop valve

for the gas hot water heater in Figure 10. By focusing on how

the system behaves when a component fails, FMEA gives the

system designers deeper insights into the role that each

component plays in the system operation. However,

designers whose focus is necessarily on creating and

implementing the functions that the system will do, generally

find it to be a tedious and discouraging analysis to perform.

Figure 10. Schematic for a domestic hot water heater.

One of the most difficult tasks in performing a FMEA is

to identify component failure modes. Most analysts can

readily identify failure modes such as ―open‖ and ―short‖ for

electrical components but they have difficulty identifying

other types of failure modes that can be broadly characterized

as ―partial operation‖. Libraries of potential component

failure modes can be very helpful in this regard.

FMEA originated in the aerospace industry in the 1960s

as demands for greater reliability drove studies of component

failures to be broadened to include the effects of the failures

on the systems of which they were a part. Thus, FMEAs were

traditionally done near the end of the design process and the

analysis focused on the physical, piecepart components as in

Table 10. Thus, it could have little impact on the design

except when it revealed a major safety-related failure effect.

More recently, FMEA been extended to become a more

effective tool in the design process.

A functional FMEA focuses on functional failures. These

types of failures can be identified early in the design process

when only a functional description of the system is available.

Functional failures are usually a failure to perform some task

or doing the task incorrectly. Resolution of these types of

failures is accomplished by changing the system

requirements.

An interface FMEA focuses on failures of the interfaces

between the major functional modules of a system. This type

of analysis can be done before the internal design of the

modules has even begun and it can reveal deficiencies in the

module interconnects.

FMEA has also been extended to software where it is

particularly effective on systems such as microprocessor

based control systems that have little internal hardware

checking. Finally, FMEA has been applied to the

manufacturing process by which a device is produced.

Computerization has also made the FMEA task more

efficient. System simulations show the effects of failures

more readily than analysis in all but the simplest systems,

libraries of part failure modes ensure that all failure modes are

considered, and groups of failure modes having similar

consequences can sometimes be considered together rather

than repeating the analysis for each one individually [16] .

Table 10. Failure mode analysis of hot water heater stop valve.

Effect

Component Failure Mode Local System

stop valve 1) Fails closed Burner off No hot water

2) Fails open Burner won't shut off Overheats, release valve releases

pressure, may get scalded

3) Does not open fully Burner not fully on Water heats slowly or doesn't reach

desired temperature

4) Does not respond to

controller -- stays open

(same as 2)

5) Does not respond to

controller -- stays closed

(same as 1)

6) Leaks through valve Burner won't shut off, burns at

low level

Water overheats (possibly)

7) Leaks around valve Gas leaks into room Possible fire or gas asphyxiation

RF RF 15

5.2. Fault Tree Analysis

A fault tree is developed by considering a single,

important system-level failure and then identifying lower-

level failures that can cause that failure either directly or

indirectly or in combination with other failures. By

developing the lower-level failure mechanisms necessary to

cause the top-level effect a total overview of the system is

achieved. Once completed, the fault tree allows the system

designer to easily evaluate the effect of low-level changes on

the system safety and reliability.

Beginning with the top-level failure, the fault tree for a

failure mode at a given level is built up from combinations of

subsystem failures at levels lower than that at which the

failure is postulated. If the failure mode can result from any

of several lower level events, it is represented logically as the

OR of those events; if it can result only if all of several lower

level events occur, it is the AND of those events. This build-

up of failures gives a very visual representation of how

failures will propagate in the system. Table 11 shows the

symbols used to represent the logic gates in a fault tree.

Figure 11 shows an example of a partial fault tree for the

failure ―Box free falls‖ for a passenger elevator. Observe that

the fault tree can include operational causes of failure such as

―control unit disengages brake‖ as well as component failures

such as ―broken cable‖, and subsystem failures such as

―motor failure‖.

A fault tree is analyzed by determining the terminal node

probability of failure and combining these through the AND

and OR gates to determine the probability of the top event.

The fault tree can also be analyzed to find ―cut sets‖—

combinations of components whose failures are sufficient to

cause the top event.

Recent advances in fault tree technology have enabled the

fault tree to better analyze the types of failures encountered in

computer systems. These include sequence failures and

maintenance operations [17, 18].

1

432

6

7 8 9 10

Key

1. Box free falls

2. Cable slips off pulley

3. Holding brake failure

4. Broken cable

5. No holding brake

6. Motor turns free

7. Worn friction material

8. Stuck brake solenoid

9. Control unit disengages brake

10. No power to motor

11. Motor failure

11

5

Figure 11. Fault tree representation of the ―Elevator box free

falls‖[19]

Table 11. Symbols used in the construction of a fault tree.

Symbol

Name

Description

Reliability Model

Number

of Inputs

Basic Event Event that is not further decomposed and

for which reliability information is

available

Component failure mode, or a

failure mode cause

0

Conditional

Event

Event that is a condition of occurrence

of another event when both must occur

for the output to occur

Occurrence of event that must

occur for another event to occur

0

Undeveloped

Event

A part of the system that has not yet

been developed or defined

A contributor to the probability of

failure but the structure of that

system part has not been defined

0

Transfer Gate A gate indicating that the corresponding

part of the system fault tree is developed

on another page or part of the diagram

A partial reliability block diagram

is shown in another location of the

overall system block diagram

0

OR Gate The output event occurs if any of its

input events occur

Failure occurs if any of the parts of

the system fail—series system 2

m

Majority OR

Gate

The output event occurs if m of the input

events occur

k-out-of-n module redundancy. 3

Exclusive OR

Gate

The output event occurs if one but not

both input events takes place

Failure occurs only if one, but not

both, of the two possible failures

occur

2

AND Gate The output event occurs only if all of the

input events occur

Failure occurs if all of the parts of

the system fail—redundant system 2

NOT Gate The output event occurs only if the input

events does not occur

Exclusive event or preventative

measure does not take place

Failure occurs if all of the parts of

the system fail—redundant system

1

RF RF 16

5.3. Markov Models

Markov models describe a system in terms of a set of

states and transitions between those states [2, 20]. In

reliability models the states usually represent the various

working and failed conditions of the system. Transitions

between states occur as various components fail and as

repairs are made. A Markov model is memoryless in the

sense that transitions between states depend only on the state

that the system is in and not its previous history; in particular,

the probability of a transition from one state to another does

not depend on how the system came to be in its present state

nor on how long it has been in that state. This memoryless

property of a Markov model implies that components in the

model must have a constant failure rate and a constant repair

rate. Figure 12 shows an example of a Markov model for a

repairable system. It has two states: S1 is the working state

and S2 is the failed state. When in state S1 failures occur at

rate and take the system to the failed state, S0. When in

state S0 the system is repaired at rate and a repair restores

the system to the working condition. According to this model

the system follows a pattern of ―working‖ ,‖failed‖,

―working‖, ―failed‖, etc. with the working state ending in a

failure and the failed state ending in a repair.

Markov models are particularly useful for analyzing

repairable systems and calculating reliability measures such

as system availability. For the model in Figure 12 the system

availability at any time t, A(t), is the probability it is in state

S1 (working). If it is initially in state S1 (working) this is,

ttA )(exp)(

The steady state availability, A, is

A

S1 S0

Figure 12. Markov model of a system with repair.

A slightly more complex model is shown in Figure 13. It

has 3 states for modeling the reliability and availability of a 2-

unit redundant system with repair. In this model we assume

that the units are identical and have a constant failure rate, .

If only one unit has failed the system is still operational; if

both units fail, the system fails. State S2 is the state with all

units working. State S1 is the state with one unit failed. Note

that this state does not distinguish which unit has failed since

the system behavior is the same in both cases. State S0 is the

system failed state—it is entered if both units have failed at

the same time. The failure rate for the transition from state S2

to state S1 is 2 since both units are working and either can

fail; from state S1 to S2 the failure rate is since there is only

one unit that can fail. The repair rate when one unit has

failed, i.e., for transitions from state S1 back to S2 is 1.

When the system has failed (i.e., it is in state S0) the model

assumes that it is repaired at rate 2 and that the repair fully

restores the system to state S2.

From this model, it is straightforward, but rather tedious,

to find an expression for the subsystem reliability. States S2

and S1 are ―working‖ states; hence the system availability is

the probability that the system is in one of those states. The

system reliability is the probability the system has entered

state S0 by time t.

S2 S0 S1

2 Figure 13. 3-state Markov model of a 2-unit parallel

system with repair.

The number of states and transitions in a Markov model

can be very large for a complex system since every

component failure (or failure mode!) can potentially result in

a different system state. Many software programs are

available for analyzing Markov models.

6. PERSPECTIVES AND FUTURE

CHALLENGES

Reliability has always been an important attribute of any

device for the simple reason that if the product couldn’t be

counted upon to function as intended it was of little value.

Reliability began to develop as a discipline with the advent of

the industrial age when formal consideration of reliability

problems began to be undertaken [21]. Design of equipment

for a certain anticipated life also dates back to this time

period. At first efforts to characterize equipment reliability

were applied mainly to mechanical equipment. A classic

example is the studies of the life characteristics of ball and

roller bearings in the early days of railroad transportation.

Electrification brought with it the need to make the

electric power supply much more reliable. This reliability

was achieved through the parallel operation of generators,

transformers, and transmission lines and the interlinking of

high-voltage supply lines into nation-wide power grids. Thus,

the use of redundancy and parallel operation, as well as better

equipment, solved what had once been a very real problem.

Today, the overall reliability of our power supply and

communication networks is truly astonishing.

Aircraft introduced a new set of constraints which made

the reliability problems of airborne equipment more difficult

to solve than those of stationary and land-based equipment.

They also added safety as an important concern. These

problems were solved largely through the intuition and

ingenuity of aircraft designers.

The age of electronics, the age of high-speed jet aircraft,

the age of missiles and spacecraft brought reliability into a

new era. Previously, the reliability problem had been largely

solved by the use of high safety factors, extensive use of

redundancy, and learning from the failures and breakdowns of

earlier designs when designing equipment of similar design.

Safety factors and redundancy added tremendously to the size

and weight of a piece of equipment at a time when aircraft

and missile development were demanding just the opposite so

RF RF 17

that thousands of components could be squeezed into small

volumes. At the same time, the rapid progress of technology

nullified the efforts of those who hoped to learn from and

correct the mistakes of previous designs in the next design—

technological changes meant that the next design had to be

radically different from its predecessor. Very little use could

be made of the experience gained from previous mistakes and

neither time nor money was available for redesign as both had

to be directed to the next project. The earlier intuitive

approach to solving problems and the practice of redesigning

earlier designs, which had previously been used so

successfully, gave way to an entirely new approach to

reliability—one that was statically defined, calculated, and

designed.

The computer age has brought about yet another change in

both the way reliability is practiced and in the problems that

must be addressed. Computer based tools now do many of

the mathematical calculations needed to analyze failure data;

and computer simulations permit the timely evaluation of

many alternative designs. Computers have also become an

integral part of almost every type of equipment with any

degree of complexity. Software controls the computer, and

even a small program can be incredibly complex.

Increasingly, simple testing is being seen as an inadequate

approach for providing reliable software. Instead, attention

has been focused on reducing the number of errors in the

development process and reliability tools such as failure

modes and effects analysis, fault tree analysis, and Markov

models have been extended so that they can be applied to

software systems.

In addition, human errors in operating and maintaining

equipment are becoming the predominant cause of failure.

Thus ―human reliability‖ is emerging as an important area of

study and equipment must be designed so that it not only

functions correctly, but that it is easy to use and maintain

correctly (and difficult to use or maintain incorrectly).

The reliance of equipment on computers and the

connection of those computers into extensive communication

networks have also introduced a new type of challenge for

reliability engineers. The challenge is ―man-made‖ but very

real nevertheless. Software viruses that can be carried on

programs from one computer to another, software worms that

can move along the connections of a communications

network, logic bombs that can be loaded into the software of a

computer, and Trojan horses that can hide malicious

programs in attractive packages all can cause catastrophic

failures of the computers that manage entire enterprises. So-

called ―denial of service‖ attacks and ―unauthorized access‖

to protected data housed in a computer can also prevent

equipment from ―operating successfully‖.

Finally, simply defining reliability in the 21st century is a

challenge. How should ―reliability‖ be defined for an Internet

service? What is a ―failure‖ for a packet network?

With these new threats and challenges there are many

questions for reliability engineers to ponder and provide

useful answers to.

ACKNOWLEDGEMENT

I would like to thank John Healy from whose ―Basic

Reliability‖ tutorial [1], given for many years at this

symposium, I borrowed much of the introductory material.

Also, the perspective on the early days of reliability

engineering is based on a similar discussion by Igor

Bazovsky, who wrote one of the first reliability engineering

textbooks [21].

REFERENCES

1. J. D. Healy, ―Basic Reliability‖, Ann. Reliability and Maintainability Symp. Tutorial Notes, 2000.

2. J. B. Bowles, "Survey of Reliability Prediction Procedures for

Microelectronic Devices", IEEE Transactions on Reliability, March 1993, pp. 2 – 12.

3. F. R. Nash, Estimating Device Reliability: Assessment of Credibility,

Kluwer, Boston, 1993. 4. J. B. Bowles, ―Simple, approximate, system reliability and availability

analysis techniques‖, Reliability Review, Vol. 20, September 2000, pp. 5

– 11, 26 – 27. 5. P. D. T. O’Connor, Practical Reliability Engineering, 3rd ed. revised,

Wiley, New York, 1995.

6. R. Roy, A Primer on the Taguchi Method, Van Nostrand Reinhold, NY, 1990.

7. ―Reliability Prediction Procedure for Electronic Equipment‖, Mil-Hdbk-

217/F, December 1991. 8. ―Reliability Prediction Procedure for Electronic Equipment‖, TR-NWT-

000332, Issue 4, Bellcore, September 1992.

9. S. Pugh, ―Quality assurance and design: the problem of cost versus quality‖, Quality Assurance, Vol. 4, March 1978, pp. 3 – 6.

10. ―Electronic Reliability Design Handbook‖, Mil-Hdbk-338, October

1988, App. B, ―Environmental Considerations in Design‖. 11. D. J. Klinger, Y. Nakada, M. A. Menendez, AT&T Reliability Manual,

Van Nostrand Reinhold, 1990.

12. H. A. Chan and T. P. Parker ―Product Reliability Through Stress Testing‖, Ann. Reliability and Maintainability Symp. Tutorial Notes,

2000.

13. J. D. Musa, A. Iannino, and K. Okumoto, Software Reliability, Prediction and Measurement, McGraw-Hill, New York, 1987.

14. J. G. Elerath, ―Specifying reliability in the disk drive industry: no more

MTBF’s‖, Proc. Ann. Reliability and Maintainability Symp, 2000. pp. 194 – 199.

15. J. B. Bowles, ―Failure modes, effects, and criticality analysis‖, Ann.

Reliability and Maintainability Symp. Tutorial Notes, 1999. 16. Society of Automotive Engineers, ―Recommended failure modes and

effects analysis procedures for non-automobile applications‖, SAE

ARP5580, May 2000 (Draft). 17. L. L. Pullum and J. B. Dugan, ―Fault-tree models for the analysis of

complex computer-based systems‖, Proc. Ann. Reliability and

Maintainability Symp., 1996, pp. 200-207. 18. J. B. Dugan, ―Fault-tree analysis of computer-based systems‖, Ann.

Reliability and Maintainability Sym. Tutorial Notes, 1999.

19. Reliability Analysis Center, Fault Tree Analysis Application Guide,

1990.

20. M. L. Shooman, Probabilistic Reliability: An Engineering Approach, 2nd Ed., Krieger, Malabar, 1990.

21 I Bazovsky, Reliability Theory and Practice, Prentice-Hall, Englewood

Cliffs, 1961. 22. R. F. Drenick, ―The failure law of complex equipment‖, Journal of the

Society of Industrial and Applied Mathematics, Vol. 8, December 1960,

pp. 680 – 690. 23. R. H. Salzman, ―Understanding Weibull Analysis‖, Ann. Reliability and

Maintainability Symp. Tutorial Notes, 2000.

introduction to reliability 13

Documents

system reliability

models of reliability

definition of reliability

component reliability

reliability important

reliability theory

characteristics of reliability

reliability specification