1 product reliability chris nabavi bsc smieee © 2006 pce systems ltd

11

Product ReliabilityProduct Reliability

Chris Nabavi BSc SMIEEE

© 2006 PCE Systems Ltd

22

ReliabilityReliability

Reliability is the probability that an equipment Reliability is the probability that an equipment will operate for some determined period of time will operate for some determined period of time under the working conditions for which it was under the working conditions for which it was designeddesigned

33

The “Bathtub” CurveThe “Bathtub” CurveFailures per hour

Time

Infant Mortality End of Life

Operational Life Phase

44

Operational StrategyOperational Strategy

1.1. Run the equipment without traffic until the Run the equipment without traffic until the infant mortality period has passed - (the burn-in infant mortality period has passed - (the burn-in period)period)

2. Use the equipment during the operational life 2. Use the equipment during the operational life periodperiod

3. Retire or replace the equipment before the end 3. Retire or replace the equipment before the end of life periodof life period

55

Failure RateFailure Rate This is a statistical measure, applicable to a This is a statistical measure, applicable to a

large number of sampleslarge number of samples The The failure ratefailure rate, , is the number of failures per is the number of failures per

unit time, divided by the number of items in the unit time, divided by the number of items in the testtest

is constant during the operational phaseis constant during the operational phase is often expressed in % / 1000 hoursis often expressed in % / 1000 hours

or FITs (failures in ten to the 9 hours)or FITs (failures in ten to the 9 hours)

66

Mean Time Between Failures Mean Time Between Failures (MTBF)(MTBF)

This is a statistical measure, applicable to a This is a statistical measure, applicable to a large number of sampleslarge number of samples

The MTBF, The MTBF, is the average time between is the average time between failures, times the number of items in the testfailures, times the number of items in the test

MTBF = MTBF = 11 // failure ratefailure rate = = 11 //

77

Measured MTBF and Failure RateMeasured MTBF and Failure Rate

A manufacturer tests 3000 light bulbs for 300 A manufacturer tests 3000 light bulbs for 300 hours and observes 5 failureshours and observes 5 failures

Note: we don’t know the average time between failures Note: we don’t know the average time between failures from this test, because they have not all failed! But from this test, because they have not all failed! But approximately:approximately:

MTBF = 3000 x 300 / 5 = 180,000 hoursMTBF = 3000 x 300 / 5 = 180,000 hours Failure rate = 0.556 % per 1000 hoursFailure rate = 0.556 % per 1000 hours This measured MTBF is an under-approximation This measured MTBF is an under-approximation

of the true MTBFof the true MTBF

88

MTBF and End of LifeMTBF and End of Life MTBF is a measure of quality and has nothing to do MTBF is a measure of quality and has nothing to do

with the expected lifetimewith the expected lifetime To visualise this, think of a candle. After three hours, To visualise this, think of a candle. After three hours,

the wax will all be used up and it will have reached its the wax will all be used up and it will have reached its end of life. This is its end of life. This is its expectedexpected lifetime. lifetime.

However, a quality candle (higher MTBF) will be less However, a quality candle (higher MTBF) will be less likely to fizzle out half way down. If we light a new likely to fizzle out half way down. If we light a new candle, just as each old one runs out of wax, the mean candle, just as each old one runs out of wax, the mean time between being time between being unexpectedlyunexpectedly plunged into darkness plunged into darkness is the MTBFis the MTBF

99

Failure Rate (Graphical Representation)Failure Rate (Graphical Representation)Failures per hour

Time

The failure rate is the size of this gap

1010

Example: Typical Hard DiscExample: Typical Hard Disc

Rated or expected life = 5 yearsRated or expected life = 5 years Guaranteed life = 3 yearsGuaranteed life = 3 years MTBF = 1,000,000 hours (approx. 114 years)MTBF = 1,000,000 hours (approx. 114 years) Modern hard discs are fairly reliable, but being Modern hard discs are fairly reliable, but being

mechanical, they wear out after a few yearsmechanical, they wear out after a few years

1111

Disc Replacement StrategyDisc Replacement Strategy

Observation: The expected life is much less Observation: The expected life is much less than the MTBF and discs are the “weak link” in than the MTBF and discs are the “weak link” in the systemthe system

Conclusion: Replace the discs just before they Conclusion: Replace the discs just before they wear out under a preventative maintenance wear out under a preventative maintenance programprogram

1212

Example MTBF FiguresExample MTBF FiguresITEM MTBF

N. American Power Utility 2 months !

Router 10 years

Uninterruptible Power Supply 11 years

File Server 14 years

Ethernet Hub 120 years

Transistor 30,000 years

Resistor 100,000 years

1313

Operational Life PhaseOperational Life Phase

Reliability theory only works in the operational life Reliability theory only works in the operational life phase, where the failure rates are constantphase, where the failure rates are constant

With this proviso, the maths is well established With this proviso, the maths is well established and closely related to statisticsand closely related to statistics

There is a large amount of statistical theory There is a large amount of statistical theory concerned with sampling procedures, aimed at concerned with sampling procedures, aimed at estimating the MTBF of componentsestimating the MTBF of components

From now on, we are only concerned with the From now on, we are only concerned with the operational phaseoperational phase

1414

Probability of SurvivalProbability of Survival

1

0Time

Probability of survival

MTBF,

p= e-t

0.37

1515

Non-Redundant SystemNon-Redundant System Reliability Reliability

For a system S, made up of components A, B, C, For a system S, made up of components A, B, C, etc.etc.

11 // MTBFMTBFS S ==

11 // MTBFMTBFA A

+ + 11 // MTBFMTBFB B + + 11 // MTBFMTBFC C

+ etc+ etc..

or or SS = = AA + + BB + + C C + etc. + etc.

These formulae are used to calculate the MTBF or These formulae are used to calculate the MTBF or failure rate of equipment, using published tables failure rate of equipment, using published tables covering everything from a soldered joint to a disc covering everything from a soldered joint to a disc sub-systemsub-system

1616

Mean Time To Repair (MTTR)Mean Time To Repair (MTTR) The formulae discussed earlier assume zero The formulae discussed earlier assume zero

maintenance, i.e. if a device breaks down, it is not fixed.maintenance, i.e. if a device breaks down, it is not fixed. Often, it is important to know the probability of fixing a Often, it is important to know the probability of fixing a

broken system within a given time Tbroken system within a given time T For this we need to know the MTTR, which is worked For this we need to know the MTTR, which is worked

out by examining all the steps involved and the failure out by examining all the steps involved and the failure modesmodes

The probability of fixing the broken system within time T The probability of fixing the broken system within time T can then be predicted using similar exponentials as can then be predicted using similar exponentials as seen alreadyseen already

1717

Operational ReadinessOperational Readiness

Operational readiness is the probability that a Operational readiness is the probability that a system will be ready to fulfil its function when system will be ready to fulfil its function when called uponcalled upon

E.g. The probability that an email sent at a E.g. The probability that an email sent at a random time will get throughrandom time will get through

Operational readiness = Operational readiness = MTBFMTBF // (MTBF + MTTR)(MTBF + MTTR)

1818

Active RedundancyActive Redundancy

Device 1

Device 2

Either device can do the job

1919

Active Redundancy CalculationsActive Redundancy Calculations

MTBFMTBFSS = = 11 / / 11 + + 11 / / 22 - - 11 / ( / (11 + + 22 ) )

Probability of survival = eProbability of survival = e--11t t + e+ e--22t t - e- e--11t t . e. e--22t t

For an active redundancy system, S made from two identical sub-systems, AFor an active redundancy system, S made from two identical sub-systems, A

MTBFMTBFS S = 1.5 x MTBF= 1.5 x MTBFAA

Note: The failure rate is no longer constant with timeNote: The failure rate is no longer constant with time

2020

Passive RedundancyPassive Redundancy

Device 1

Device 2

When device 1 fails, switch over to device 2

(Device 2 not normally powered)

2121

Passive Redundancy CalculationsPassive Redundancy Calculations

For a passive redundancy system, S made from two identical sub-For a passive redundancy system, S made from two identical sub-

systems, A and ignoring the reliability of the switch-over systemsystems, A and ignoring the reliability of the switch-over system

Probability of survival = eProbability of survival = e--AAt t x (1+ x (1+AAt)t)

MTBFMTBFS S = 2 x MTBF= 2 x MTBFAA

Note: The failure rate is no longer constant with timeNote: The failure rate is no longer constant with time

2222

Error Detection and CorrectionError Detection and Correction There are two trivially simple ways to guard There are two trivially simple ways to guard

against errorsagainst errors Send the information twice: Then at the receiver, Send the information twice: Then at the receiver,

if they are different, we have detected an errorif they are different, we have detected an error Send the information three times: Then at the Send the information three times: Then at the

receiver, accept the majority verdict to correct an receiver, accept the majority verdict to correct an errorerror

But we can do better than this .....But we can do better than this .....

2323

The Hamming (7,4) codeThe Hamming (7,4) code 0 1 0 00 1 0 0 1 0 11 0 1 Pink are parity check bitsPink are parity check bits 0 0 1 00 0 1 0 1 1 01 1 0 Green are information bitsGreen are information bits 1 0 0 01 0 0 0 0 1 10 1 1 16 codes can be obtained16 codes can be obtained 0 0 0 10 0 0 1 1 1 11 1 1 by adding any rows mod 2by adding any rows mod 2

All 16 codes have Hamming distance of 3 or more, so the All 16 codes have Hamming distance of 3 or more, so the

code can correct a single errorcode can correct a single error------------------------------------

The Golay (23,12) code has 12 information bits and 11 parity The Golay (23,12) code has 12 information bits and 11 parity

check bits and can correct 3 errorscheck bits and can correct 3 errors

2424

Redundant Array of Independent DiscsRedundant Array of Independent Discs There are 5 RAID levels:There are 5 RAID levels:

11 Mirrored discsMirrored discs 22 Hamming code error correctionHamming code error correction 33 Single check disc per groupSingle check disc per group 44 Independent read and writeIndependent read and write 55 Spread data and parity over all discsSpread data and parity over all discs

In each case, disc errors are corrected; the differences are In each case, disc errors are corrected; the differences are largely in the system performancelargely in the system performance

2525

Effect of Non-MaintenanceEffect of Non-Maintenance Consider a file server with 6 discs in an arrayConsider a file server with 6 discs in an array The probability of getting a disc failure in a 6 disc The probability of getting a disc failure in a 6 disc

RAID array in one year is about RAID array in one year is about 5%5% which is fairly which is fairly high. Assume that this happens and the problem is high. Assume that this happens and the problem is left unfixed for a further 3 weeksleft unfixed for a further 3 weeks

The probability of another disc failing in this time is The probability of another disc failing in this time is about about .25%.25% If this happens too, you loose the server! If this happens too, you loose the server!

The odds for this happening in any year are 5% of The odds for this happening in any year are 5% of 0.25% or 0.25% or 1 in 8000 1 in 8000 divided by the number of RAIDsdivided by the number of RAIDs

2626

Effect of Improving the MTTREffect of Improving the MTTR

If the MTTR of the RAID had been 3 hours instead of If the MTTR of the RAID had been 3 hours instead of 3 weeks, the odds are somewhat different:3 weeks, the odds are somewhat different:

The probability of the first failure is still 5%. But now The probability of the first failure is still 5%. But now the probability of getting a second failure in the the probability of getting a second failure in the ensuing 3 hours is .0003% instead of .25%ensuing 3 hours is .0003% instead of .25%

So the odds of loosing the RAID in any year improve So the odds of loosing the RAID in any year improve to to 1 in 6,666,6751 in 6,666,675 from the previous 1 in 8000 from the previous 1 in 8000

Moral of the story: Fix the First Fault FastMoral of the story: Fix the First Fault Fast

1 product reliability chris nabavi bsc smieee © 2006 pce systems ltd

Documents

failures mtbf

true mtbf slide

end of life mtbf

years mtbf

test mtbf

measured mtbf

operational phase slide

mtbf of components