redundancy issues in software and hardware systems: an overview

June 13, 2011 19:38 WSPC/S0218-5393 122-IJRQSES0218539311004093

International Journal of Reliability, Quality and Safety EngineeringVol. 18, No. 1 (2011) 61–98c© World Scientific Publishing CompanyDOI: 10.1142/S0218539311004093

REDUNDANCY ISSUES IN SOFTWARE ANDHARDWARE SYSTEMS: AN OVERVIEW

MADHU JAIN∗,‡ and RITU GUPTA†,§

∗Department of MathematicsI.I.T. Roorkee-247667, India

†Department of Mathematics, Institute of Basic ScienceKhandari, Agra-282002, India

‡[email protected]§gupta [email protected]

Received 22 September 2010Revised 10 January 2011

The redundancy is a widely spread technology of building computing systems thatcontinue to operate satisfactorily in the presence of faults occurring in hardware andsoftware components. The principle objective of applying redundancy is achieve reliabil-ity goals subject to techno-economic constraints. Due to a plenty of applications arisingvirtually in both industrial and military organizations especially in embedded fault tol-erance systems including telecommunication, distributed computer systems, automatedmanufacturing systems, etc., the reliability and its dependability measures of redundantcomputer-based systems have become attractive features for the systems designers andproduction engineers. However, even with the best design of redundant computer-basedsystems, software and hardware failures may still occur due to many failure mecha-nisms leading to serious consequences such as huge economic losses, risk to human life,

etc. The objective of present survey article is to discuss various key aspects, failureconsequences, methodologies of redundant systems along with software and hardwareredundancy techniques which have been developed at the reliability engineering level.The methodological aspects which depict the required steps to build a block diagramcomposed of components in different configurations as well as Markov and non-Markovstate transition diagram representing the structural system has been elaborated. Fur-thermore, we describe the reliability of a specific redundant system and its comparisonwith a non redundant system to demonstrate the tractability of proposed models andits performance analysis.

Keywords: Redundancy; software and hardware system; reliability; survey; fault-tolerance.

Nomenclature and Notations

NHPP : Non-homogeneous Poisson processRBD : Reliability block diagramTMR : Triple modular redundancyNMR : N-modular redundancy

61

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.

http://dx.doi.org/10.1142/S0218539311004093


62 M. Jain & R. Gupta

DWC : Duplication with comparisonSS : Standby sparing

PAS : Pair-and-a-spareNMRS : N-modular redundancy with spares

SP : Self purgingSOM : Sift-out modular

TD : Triplex-duplexRB : Recovery blocksAT : Acceptance Test

NVP : N-version programmingNSCP : N-self-checking programming

RtB : Retry blockNCP : N-copy programmingDM : Decision mechanism

MTTF : Mean time-to-failureN : Number of Modules

X(t) : States of the system at time t

λ : Failure rate parameterpi(t) : Probability that the system is in state i, i = 1, 2, . . . , N

Ri : Reliability of the module i, i = 1, 2, . . . , N

Rs(t) : Reliability of the system at time t

1. Introduction

Recent advancements in information technology over the past decades have resultedin an exponential growth of computer based systems. With the fast growing tech-nology, a computer system consists of hundreds or thousands of interacting softwareand hardware components. The evolutionary improvements in hardware and soft-ware performances include the changes in computer architecture, vast increments inmemory and storage capacity and wide variety of exotic input and output options.During last few years, the computer applications are becoming ever more com-plex in both design and architecture; however they are often built from unreliablehardware and software components. The primary goals of hardware/software engi-neering are to improve the quality of products and to increase the productivity interms of product reliability and efficiency.

The redundancy issues in both hardware and software systems play the keyrole for a successful design and functionality of sophisticated computer systems.For embedded computer systems consisting of software and hardware components,redundancy can be achieved by applying extra copies of hardware and softwarecomponents in parallel that provide an alternate path for successful operation.Redundancy techniques based on reliability theory are commonly applied to achievehigh reliability, maintainability, availability, safety of the system and also to han-dle the system workloads. Highly reliable systems are required in the situation in

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.


Redundancy Issues in Software and Hardware Systems 63

which the repair action can not be taken into account (e.g. spacecraft) and anothersituation may arise where computers are employed to perform a critical function inwhich even the small amount of time lost due to repairs can not be tolerated suchas in the case of flight-control systems, nuclear missiles, etc.

The system redundancy is common in many real time systems as it ispractically impossible to make a perfect system in which component fail-ure (hardware/software) leading to system failure does not take place. Theaccomplishment of redundancy techniques through redundant hardware and soft-ware components was conceived in the early 1950s. A vast literature can be found onvarious hardware and software redundancy techniques and their reliability model-ing for redundant systems.11,55,82,97,98,101 Hardware redundancy techniques can bebroadly categorized in two different forms called as static or dynamic redundancy.Static redundancy involves fault masking, error detecting and correcting codes, etc.and provides an effective immediate action. Dynamic redundancy includes standbyspare components to replace the faulty ones which may be hot, cold or warm.Sometimes, it is possible that any amount of redundant hardware can fail dueto some faulty software as such hardware redundancy is not sufficient to achievehighly reliable or available computer system. The technology developed for softwareredundant system yields a high degree of satisfaction in terms of reliability andavailability. As a result, software redundancy techniques have become an implicitrequirement in most applications. Software redundancy is achieved by incorporat-ing some additional software components that are not exactly identical but theyare similar in functionality.

Hardware and software reliability both are the important factors affectingsystem reliability and discussed by Yamada and Osaki,83 Dhillon and Ugwu,19 Nas-sar,67 Levitin,59 Immonen and Niemela,39 Yang et al.101,102 and many others. Itis widely recognized that software reliability differs from hardware reliability sincecauses of failure in hardware and software are different and it reflects the design per-fection rather than manufacturing perfection. Hardware reliability changes duringcertain periods such as at initial use or at the end of a useful life whereas soft-ware reliability varies during the development and testing phases. Moreover, unlikehardware, the software does not degrade physically as a function of time or envi-ronmental stresses. But the software system to be operational on the same inputdata, computing environment and user requirements constantly is not reasonableto expect. Thus, now in computers’ world the emphasis to achieve high hardwarereliability is shifting to software because software faults are root causes in a highpercentile of operational system failures in real time embedded computer systems.

An adequate representation of reliability model is also important and should becarefully taken into account. The most widely spread reliability modeling techniqueis combinatorial modeling such as reliability block diagram (RBD) and fault/eventtrees. These are the useful tools for modeling purpose; however, these are not capa-ble to depict the dynamic behavior of the system such as load sharing, standby

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



redundancy, imperfect fault coverage, complex repair policies, etc. This is mainlydue to the fact that the interaction between two or more components/subsystemscan affect the system performance significantly. To overcome this problem, somestate space methods such as Markov models, Petri-nets, hybrid models (combina-tion of combinatorial and state space methods) may be adopted in order to analyzemore complicated systems.

In this article, we provide a comprehensive survey on redundancy issues asso-ciated with design and analysis of software and hardware redundant systems. Theremainder of this article is organized as follows. In Sec. 2, we discuss the softwareand hardware faults and their classification. In Sec. 3, some elementary techniquesrelated to fault/failure phenomenon that are used to improve the performanceof redundant system and various failures related issues are addressed. Section 4is concerned with the methodological aspects to solve the reliability problems.Section 5 highlights the basic features of the system redundancy. Section 6 pro-vides hardware redundancy techniques. Software redundancy techniques are dis-cussed in Sec. 7. Finally, the last Sec. 8 summarizes the contributions of presentwork.

2. Classification of Hardware and Software Faults

Generally a fault is defined as physical defect, imperfection or a design flaw over thehardware and software components. An error is a manifestation of the fault causedby the activation or execution. A failure is incorrect performance of some function inthe system. Specifically, the faults are the causes of errors, and errors are the causeof failures.73 The terms error and fault are often used as synonym to each otherand both can propagate system failure. The faults can occur during computer’sspecification, design, implementation, modification, installation and throughout itsoperational life. One way to classify faults is a couple of attributes; nature andduration of faults.24 The nature of a fault refers whether the faults are hardwarefaults or software faults whereas the duration of a fault refers to the way the faultsare activated. With respect to fault duration, the hardware faults can be classifiedinto following three forms.

(i) Permanent faults: A fault is said to be permanent if it continues to remainactive until it is repaired. Some physical defects in the hardware are short-circuits, connection disruption as well as design errors are examples of faults.

(ii) Transient faults: Transient faults remain active for short period of time anddisappear quickly. Transient faults are often detected through the errors thatresult from their propagation. They are usually referred to as soft faults orglitches and are mostly induced by random environmental disturbances suchas voltage fluctuations, electromagnetic interference, radiation, etc.

(iii) Intermittent faults: Intermittent faults activate, deactivate, and reactivaterepeatedly. They are often attributed to design errors that results in marginalor unstable hardware. Intermittent faults are difficult to be predicted, but their

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



effects are highly correlated. In the presence of these faults, the system workswell most of the time but fails under typical environmental conditions such asin the case of fault due to loose wire.

The nature of software faults is different from that of hardware faults. Thesource of failure in software is the design fault while the causes of hardware failuremay be physical deterioration, a manufacturing defect or poor quality of mate-rial. The software faults are all design faults which are harder to visualize, classify,detect and correct. The faults may be due to programming errors, specificationerrors, etc. Software can not physically break after being installed in a computersystem however some latent faults in the programming code may activate duringoperation. This type of problem can be seen under heavy or unusual workloads ofthe programs which eventually lead to system failure. During software execution,the software faults which are activated every time are usually detected in the test-ing phase and are corrected before releasing the piece of software. A few softwarefaults which are rarely activated escape testing and debugging processes and areterminated/removed when software is ready to be released. Some of the softwarefaults are not activated every time i.e. latent faults present in the software. Thissituation arises since faulty piece of code is executed when a certain external eventoccurs that is fault trigger event.

A typical way to classify the software faults is bugs based on the type of failurethat occur in the software. The software faults can be classified into following threecategories. They are defined in the context of software operation.

(i) Heisenbugs: Heisenbugs are the bugs which may or may not cause a fault fora given operation. If a Heisenbug is present in the system, the error could beremoved on retrying the operation. Heisenbugs are also called as transient orintermittent faults.31

(ii) Bohrbugs: Bohrbugs are the bugs which always cause a failure when a partic-ular operation is performed. In the presence of Bohrbugs, there would alwaysbe a failure on retrying the operation which caused the failure. Bohrbugs arecalled as permanent faults.31

Mostly industrial software systems are released after passing design reviews, qualityassurance, alpha, beta, gamma tests. They are mostly free from Bohrbugs butHeisenbugs may persist. In this case, if the software system is restarted it wouldfunction correctly.

(iii) Aging related bugs: These are the bugs which appear when software systemsrunning continuously for a long time; in such case, due to aging, the softwaresystems tend to show a degraded performance and an increased failure occur-rence rate. These bugs depend on internal environment of the system such asunreleased physical memory due to memory leaks.32

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



3. Some Failure/Fault Related Issues

A system is said to have a failure if the system does not perform as expected.Now it is worthwhile to mention some important failure/fault related issueswhile examining the reliability of redundant systems under active and standbyconfigurations.

3.1.

In practice, a number of failure factors can significantly reduce the reliability ofthe redundant systems. Common cause failures and load sharing phenomena arethe most concern failure issues in active redundant systems. In standby redundantsystems, switching failure is quite common consideration.

3.1.1. Common cause failure

In case of common cause failure there may be simultaneous failure of one or moreredundant components in the system due to some common reason. Such failureshave the strong potentialities in reducing the benefit gained with redundant config-urations. Common cause failures may be produced by common electric connections,shared environmental stress such as dust, vibration, humidity or common mainte-nance problems. The power failure may also be important example of common causefailure in manufacturing systems, common communication buses in computer net-works, etc. A great deal of engineering efforts of engineering systems is expendedon identifying possible common cause failure mechanism and eliminating them.However, in some cases, it may be impossible to eliminate the causes entirely andtherefore reliability modeling must take them into account. Due to this spirit, a con-siderable amount of work including this concept has been done by many researchers.An analytical methodology and examples of common cause failure has been pre-sented by Mosleh.65 Jain and Ghimire42 obtained the reliability of k-r-out of-n: Gsystem subject to random and common cause failure. The reliability of a two unitsystem with common cause shock failures was computed by Jain.41 Jain et al.43

suggested loading policies for M-r-out of-N: G system subject to common causefailure. Kang et al.46 investigated the standby safety systems consisting of morethan two redundant components.

3.1.2. Load sharing

Another cause of reliability degradation in active redundant systems is load sharing.In load sharing systems, the failure of one component increases the stress level onthe other and therefore it increases the failure rates of remaining surviving compo-nents. The stress may be an electrical load, a load caused by high temperature or aninformation load. This introduces failure dependency between the load sharing com-ponents which increases complexity in analyzing redundant systems. Fortunately, in

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



a redundant system with sufficient capacity, the increased failure rate can not leadto unacceptable failure probabilities. When the system experiences the first failure,if it is detected, the system may be required to work for only a short period of timebefore the completion of repairs. From this view point, the load sharing degradedproblem is less serious than common cause failures. Some research investigationshave appeared in the literature on load sharing redundant systems. A flow-graphbased approach has been suggested to analyze a multi-state k-out-of-n:G/F loadsharing systems by Jenab and Dhillon.44 Singh et al.86 investigated k-componentsload sharing systems. They obtained the load sharing parameters under classicaland Bayesian set up. In this sequence, an optimal load allocation for load sharingk-out of-n:F systems has been done by Yamamoto et al.100 A recent contributionin this field is due to Deshpande et al.17

3.1.3. Switching failure

In many systems with standby components, a standby component automaticallybecomes active in the event of the failure of active component. However, in somecases when dealing with standby systems, a switching device is also present and isused to switchover the standby component when an active component fails so thesystem can resume operation. The presence of a switching device has a significanteffect on the reliability of a standby system. The failure and reliability proper-ties of the switch must also be included while analyzing the redundant systems.Standby systems are inherently superior to active systems, but most of this superi-ority depends on the reliability of the standby switch. Many lot of models have beendeveloped by several researchers for analyzing the standby redundant systems withswitching failures including some notable contributions due to Alidrisi,3 Chung,14

Ke et al.,49 Pan,69 Wang and Chen94 and others.

3.2.

Achieving highly reliable system from the customer’s perspective is very demandingfor all software developers and system designers. To tackle fault related issues, thefollowing techniques are commonly used Refs. 58, 64, 50.

3.2.1. Fault prevention/avoidance

Fault prevention techniques are used to prevent the occurrences or introduction offaults in dependable system working in computing environment. Fault prevention isachieved by quality control techniques employed during the development and designof hardware and software. Rigorous design rules, component screening and testingtechniques are employed in hardware while structural programming, modulariza-tion and formal verification techniques are employed in software. To achieve faultprevention, another common approach is shielding from operational physical faultssuch as radiation, humidity, heat, etc. User and operation faults are prevented by

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



training, rigorous procedures for maintenance. Malicious faults are prevented byfirewalls and similar securities.

3.2.2. Fault tolerance

Fault tolerance is one of the important approaches to achieve highly reliable com-puting systems. Fault tolerance techniques have the ability to deliver continuousservice in the presence of hardware and software faults by providing redundant hard-ware and software components. A typical fault tolerance system generally includefault detection, location, containment and subsequent its recovery.

(i) Fault detection generates an error signal message when a fault occurs withina system. Detection of a fault is done by acceptance test or comparator; ingeneral it can not be predicated which component or module has failed. Variousfault detection techniques are employed such as acceptance test, comparator,etc.

(ii) Fault location is a mechanism to determine where a fault has occurred.(iii) The process of isolating a fault is referred to as fault containment that contains

the manifested faults throughout the system. It prevents further propagationof the effect of faults such as exception handling routines to treat unsuccessfuloperations. This isolating fault process can be achieved by multiple requestprotocols by employing consistency checks between modules and by performingfrequent fault detection techniques.

(iv) Fault recovery is a process to transform a system state that contains one ormore isolated faults into an operational state status and faults that may beactivated again. This mechanism recovers system operations from erroneousconditions such as check pointing and rollback mechanisms.

(v) Fault masking is another fault tolerance technique that is used to hide theoccurrences of faults and prevent faults from resulting in errors. It providescontinuous system operation and prevents faults in the system from introduc-ing errors into the informational structural of that system.

3.2.3. Fault removal

The fault removal technique is mainly used to reduce the number of faults which arepresent in the system. During development and operational phases of the system;fault removal is performed. Fault removal during development phase is completedin three steps:

(i) Verification(ii) Diagnosis(iii) Correction

Verification is the process of checking whether the system satisfies the pre-specifiedconditions. If it so happens, the next step is the diagnosing the faults that prevented

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



the verification conditions from being fulfilled and then performing the necessarycorrections. During operational phase, fault removal is performed in following twosteps:

(i) Corrective(ii) Preventive maintenance.

During operation phase, the corrective maintenance is performed to remove faultsthat have produced one or more errors and have been reported. In case of preven-tive maintenance, some adjustments are made or parts which may undergo to befaulty during normal operation are replaced before occurring the system failure. Inaddition to this, preventive maintenance is the achievement to avoid high cost ofreplacement or avoid damages of the surrounding of the system components. It isto be mentioned that the corrective and preventive forms of fault removal techniqueare applied to fault tolerant systems as well as non-fault tolerant systems that canbe maintained without interrupting service delivery or during service outage.

3.2.4. Fault/failure forecasting

An evaluation of system behavior with respect to fault occurrence or activation isdone to forecast the faults in the system. The evaluation can be either qualitativeor quantitative. The aim of qualitative evaluation is to identify, classify, rank of fail-ure modes or the event combinations in terms of component failures that may beresulted in system failures. The methods applied for failure mode and effect analysisare performed as qualitative evaluation. The quantitative evaluation is performedin terms of probabilities of the extent to which some of the dependability attributessuch as availability, reliability, safety, maintainability etc. are satisfied. Various reli-ability techniques namely Markov chains, stochastic Petri nets, etc. are used forquantitative evaluation. There are some methods which can be used to performboth qualitative as well as quantitative evaluation; reliability block diagrams, faulttrees etc., fall in this category.

4. Reliability Modeling

The analysis of redundant systems in literature is focused mainly on determiningthe reliability. Reliability is an important quality measure of a system. The com-puter based systems are viewed as one of many system components. System analystsoften consider the estimation of hardware and software reliability essential in orderto estimate the full system reliability. Hardware reliability encompasses a wide spec-trum of analyses that strive systematically to reduce or eliminate system failureswhich adversely affect the performance. Software reliability makes an effort system-atically to reduce or eliminate system failures which adversely affect performance ofa software program. Hardware reliability can be improved by better design, bettermaterial, applying redundancy and accelerated life testing while software reliabilitycan be improved by increasing the testing effort and by correcting detected faults.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Reliability is “the probability of failure free operation of a computer program ina specified environment for a specified period of time”.

Mathematically, the system reliability Rs(t) is defined as the conditional prob-ability that the system operates correctly throughout the interval [t0, t] given thatit was operating correctly at t0 i.e. Rs(t) = P (z > t), t ≥ t0 where z is a randomvariable denoting the time-to-failure. A measure of failure F (t) is defined as theconditional probability that the system fails by time t referred to as unreliability orfailure time distribution, F (t) = P (z ≤ t), t ≥ t0. If f(t) be the probability densityfunction of time to failure random variable z then Rs(t) =

∫ ∞t f(u)du. If life time

of the system is the exponential function, then Rs(t) = e−λt.The overall system reliability can be evaluated by developing reliability models.

The complexity of reliability models depends on various factors such as missionprofile, function criticality and redundancy characteristics. The main techniquesused for reliability modeling are:

(i) Combinatorial modeling(ii) Markov modeling(iii) Non-Markovian models

4.1. Combinatorial modeling

The combinatorial modeling is categorized as an analytical approach in which gov-erning equations describe the system behavior and discussed by many researchers.Recent developments in this field include Carrasco and Sune,10 Choi and Seong,12

Distefano and Puliafito,21 and others. In this technique, we consider the numberof all possible ways of event in which a system can continue to operate, giventhe probability of failure of its individual components. In other words, the failuresof the individual components which are mutually independent are enumerated toestimate the system’s reliability. Many configurations are usually being used tomodel the interconnection among the system’s components such as series, parallel,series-parallel, parallel-series, M-out of-N etc. Combinatorial modeling of systemreliability includes mainly two qualitative approaches (i) reliability blocks diagramsand (ii) fault trees. Here we discuss reliability block diagram which is the oldest andmost common reliability model.

Reliability block diagram (RBD): Reliability block diagrams are widely usedin engineering and other industrial setup to describe the behavior of the system’scomponents and are represented as blocks showing operational dependency betweenthe components with reliability view point. The system can be broadly categorizedin two configurations i.e. series and parallel configuration. The analysis of morecomplex redundant systems may also be built to mixed configurations such as series-parallel, parallel-series, M-out of-N systems, etc.

(i) Series configuration : A system is said to be in series configuration when allthe components (blocks) are necessary for the system to be operational i.e. failure

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



specificationdesign

HardwareFault

SoftwareFault

Error

UndetectedFailure

SystemFailure

No Failure

design

human

data corruption

electrical interference

process

wearout

overstress

ErrorRecovery

Fig. 1. Failures sequence of hardware and software faults.

of only single component leads to system failure. The graphical representation of aseries system is shown in Fig. 2(a).

(ii) Parallel configuration : A parallel configuration system is shown in Fig. 2(b).In such configuration, the system fails in case when all components of the systemfail. The system reliability in a parallel configuration is higher than the reliabilityof any single component system.

(iii) Series-parallel/parallel-series configuration : Some systems are made upof combinations of several series and parallel configurations as shown in Figs. 2(c)and 2(d). To obtain the system reliability in such cases, a way is to break the totalsystem configuration into subsystems. Then consider each of theses subsystemsseparately as a component and calculate their reliabilities. Finally, we put thesecomponents reliabilities into a single system and obtain its reliability.

(iv) M-out of-N system configuration : The term M-out of-N system is oftenused to indicate either a G system or an F system or both built into an M-out of-Nsystem. Both parallel (1-out of-N: G or N-out of-N: F) and series (1-out of-N: F

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



1 2 N

(a)

�

�

�

(b)

1

2

N2

2

1

Nk

1

2

N1

Subsystem 1

Subsystem 2

Subsystem k

(c)

1 2 N1

1 2 N2

1 2 Nk

Subsystem 1

Subsystem 2

Subsystem k

(d)

Fig. 2. (a) Series configuration with N components; (b) Parallel configuration with N components;(c) Series-parallel configuration; (d) Parallel-series configuration.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



or N-out of-N: G) systems are special cases of the M-out of-N system. The M-outof-N system structure is very popular type of redundancy in fault tolerant systemsincluding industrial and military systems.

An N components system that works or is good, if and only if at least M com-ponents out of total N components work or are good, is called as M-out of-N:G system. The M-out of-N: F system fails if and only if at least M componentsout of total N components fail. A triple modular redundant system uses 2-outof-3: G voting configuration. A variety of the M-out of-N systems are describedbelow.

• Consecutive M-out of-N system: Consecutive M-out of-N system consists ofN linearly or cyclically ordered components such that the system fails if and onlyif at least M consecutive components fail.

• Weighted M-out of-N system: A weighted M-out of-N system is N compo-nents system wherein each component carries its own positive integer weight suchthat the system is good if and only if the total weight of working component is atleast M, a pre-specified value. In mathematical sense, in a weighted M-out of-Nsystem, the component i carries a weight wi, wi ≥ 0 for i = 1, 2, . . . , N such thatw =

∑Ni=1 wi where w is the total weight of all the components. Thus, M-out of

–N: G system can be seen as a special case of the weighted M-out of-N: G systemwherein each component has a weight of 1.

• M-K-out of-N system: A M-K-out of-N system fails if less than M or morethan K components function simultaneously, i.e. for the successful operation ofthe system, neither less than M nor more than K components function properly.

The system reliabilities for different configurations are summarized in Table 1 underconsideration of non-identical components except M-out of-N. The simplest case of

Table 1. Reliability models.

Serial no. System System reliabilityconfiguration

1 Series Rseries(t) =QN

i=1 Ri, where Ri is the reliability of ithcomponent

2 Parallel Rparallel(t) = 1 − QNi=1(1 − Ri), where Ri is the

reliability of ith component

3 Series-Parallel Rs−p(t) =QN

i=1 Rparallel, Rparallel is the reliability of ithsubsystem

4 Parallel-Series Rp−s(t) = 1 − QNi=1 (1 − Rseries), Rseries is the reliability

of ith subsystem

5 M-out of-N: G RM-outof-N (t) =PN−M

i=0

“Ni

”(1 − R)iRN−i, where R is

a reliability of a component

6 M-K-out of-N: G RM-K-outof-N (t) =PK−M

i=0

“Ni

”(1 − R)N−K+iRK−i,

where R is a reliability of a component

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



components in M-out of-N configuration while analyzing system reliability is consid-ered when the components are mutually independent and identical. The reliabilityof M-out of-N system can be evaluated by using the binomial distribution.

Reliability block diagrams are gaining popularity because they are easy to under-stand and can be used for modeling of real time redundant systems. However, RBDas well as other combinatorial reliability model has a number of serious limitationsas given below.

• RBDs assume that the system components are limited to operational and failedstates and that system configuration does not change during the mission.

• The failures of the individual components are assumed to be independent. Thussystem reliability can not be adequately represented when it is affected bysequence of component failures.

4.2. Markov modeling

Markov modeling is the most powerful tool available to system engineers and design-ers for analyzing complex redundant systems. Markov models are preferred in casewhen the system is more complex and the reliability expressions can not be easilymodeled combinatorially. It gives the results for both time dependant evolution andsteady state of the system. In Markov modeling, the system can be represented by anumber of states and state transitions. The transition from the current state of thesystem is evaluated only through the present state, not from its past state. Tran-sitions may be determined by a variety of possible events (i.e. failure, repair, etc.)and are characterized by a probability distribution under reasonable conditions. Ina large number of Markov models in reliability analysis, the transition probabilitiesfollow exponential distributions with constant failure or repair rates. It may also beuseful for describing the electronic system or system’s components with repairablecomponents which either function or fail. In computer based systems, various com-ponents namely CPUs, RAM, network card, hard disk controllers and hard disksare used; such systems can be described by Markov model.

A simple Markov model for one component repairable system is depicted bytransition flow diagram shown in Fig. 3. Reibman79 presented an overview of numer-ical approaches for transient analysis of Markov as well as Markov reward model infault tolerant systems and derived many instantaneous and cumulative reliabilitymeasures. The reliability modeling for multiple repairable systems based on Markovprocess has been done by Islamov.40 Sharma and Kumar85 proposed Markovianapproach to model the behavior of safety engineering systems.

A more complicated example in Markov modeling would be a combined hard-ware and software system. In this situation, the repair time is the time required tobring the software back into service, not the time required to detect and removethe bug. The state transition diagram of combined hardware and software sys-tem is shown in Fig. 4. A Markov model for availability analysis of distributedsoftware/hardware systems has been developed by Lai et al.56 Dominguez-Garcia

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Operational Failed

failure rate (λ)

repair rate (µ)

Fig. 3. Transition diagram of a one unit repairable system.

et al.23 suggested an integrated methodology for the reliability evaluation anddynamic performance analysis of fault-tolerant systems.

Reliability modeling of TMR system: Here we illustrate the concept of Markovmodel as well as RBD for a TMR redundant system consisting of three modules,two of which are required for the system functioning properly. It is assumed thatthe component failures are mutually independent and the voter is perfect.

(a) RBD model : Let us denote the modules reliabilities by R1, R2 and R3. Thereliability of a TMR system is given by

RTMR = R1R2R3 + (1 − R1)R2R3 + (1 − R2)R1R3 + (1 − R3)R1R2 (1)

whereR1R2R3 = Prob{module 1 functions correctly ∩ module 2 functions correctly

∩ module 3 functions correctly}.

raterepairsoftware:;raterepairhardware:

ratefailuresoftware:;ratefailurehardware:

sh

sh

µµλλ

Operational

Hardware failed

Software failed

hλ

hµ

sλ

hλ

sµ

Fig. 4. Markov model for combined hardware and software repairable system.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



(1 − Ri)RjRk = Prob{module i has failed ∩ module j functions correctly ∩module k functions correctly} for i, j, k = 1, 2, 3.

If all the components are identical, then R1 = R2 = R3 = R (say). Then Eq. (1)yields

RTMR = 3R2 − 2R3 (2)

(b) A Markov model : The TMR system can be modeled by assuming exponen-tially distributed life time of each module. The system can be represented by threestates by assuming λ as the failure rate of a module as follows:

State 1- State in which only one module or no module is working (failure state)State 2- State in which two modules are operational (operational state)State 3- State in which three modules are operational (operational state)

The aim of Markov modeling of TMR system is to calculate pi(t), the probabilitythat the system is in the state i (i = 1, 2, 3) at time t. From the state transitiondiagram shown in Fig. 5, we can construct the state transition equations as follows.

d

dtp1(t) = 2λp1(t) (3)

d

dtp2(t) = 3λp3(t) − 2λp2(t) (4)

d

dtp3(t) = −3λp3(t) (5)

On solving the above system of Eqs. (3)–(5), we get

p1(t) = 1 − 3e−2λt + 2e−3λt, p2(t) = 3e−2λt − 3e−3λt, p3(t) = e−3λt. (6)

The reliability of TMR system is obtained as

RTMR(t) = p2(t) + p3(t) = 1 − p1(t) = 3e−2λt − 2e−3λt (7)

Now we compare the reliability of TMR (redundant) and simplex (non-redundant)systems which change with time and failure rate (see Fig. 6(a)) and MTTF (seeFig. 6(b)).

3 2 1

3λ 2λ

Fig. 5. State transition diagram of a TMR system.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

time (t)

Rel

iabi

lity

simplex (s)

TMR

0

0.5

1

1.5

2

0.5 1.5 2.5 3.5 4.5 5.5 6.5

failure rate (λ)

MT

TF

simplex (s)

TMR

(a) (b)

Fig. 6. Comparison of TMR and simplex systems for (a) reliability (b) MTTF.

The reliability of a single component system (i.e. simplex system) having exponen-tially distributed life time having failure rate λ is given by

Rs(t) = e−λt. (8)

The MTTF for the system is obtained as

MTTFs =∫ ∞

0

Rs(t)dt =1λ

(9)

and MTTF for TMR system is given by

MTTFTMR =∫ ∞

0

RTMR(t)dt =56λ

. (10)

It is noticed from the Fig. 6(a) that the reliability of TMR system is higher thanthe reliability of simplex system in the time period between 0 and approximately1.4 when we set the failure rate λ = 0.5 as a default parameter. But beyond thisperiod, the reliability of simplex system becomes high. Moreover, from the Fig. 6(b)a decreasing trend of mean time to failure of both simplex and TMR systems isfound by varying λ. It is also examined that the MTTF is higher for simplex systemin comparison to simplex system. Therefore it is concluded that the TMR systemis suitable for short mission as in TMR system reliability ultimately degrades forlong missions (t > z). Spare provisioning is better option for long mission wherefailure is likely to be occurred when all the spares are exhausted.

Consequently for z ≈ 1.4λ , we can say that

RTMR(t) ≥ Rsimplex(t), 0 ≤ t ≤ z

RTMR(t) ≤ Rsimplex(t), z ≤ t < ∞ and MTTFTMR < MTTFsimplex.

4.3. Non-Markovian models

In the stochastic Markov modeling, the most idealized assumption is thatrepair/service times of the components are exponentially-distributed i.e. they

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



possess the memoryless property. When this assumption is removed, the resultingstochastic process is said to be as non-Markovian.30 In non-Markovian modeling,the transitions are performed according to Markovian — laws and holding timessuch as life time, repair time, service time, etc. are described by random variableswhich follow general probability distributions.20

Let {X(t), t ∈ T } be the stochastic process denoting the states of the sys-tem at time t. If the stochastic process {X(t); t ≥ 0} is non-Markovian withnon-exponential repair/service times, then the future time-dependant evolution ofthe system depends on the elapsed or remaining repair time and service time. Inreliability engineering/modeling, the most popular analytical technique known assupplementary variable technique originated by Cox in 1955 is used largely forsuch non-Markovian models which converts a non-Markovian process into Marko-vian one. The recent developments in reliability modeling considering supplemen-tary variable technique include Garg et al.,29 Oliveira et al.,68 Wang and Chen,94

Zhang,106 Zhang and Wang107 and many others. There are other analytical methodsnamely embedded Markov chain and dummy states methods, which are also usedfor analysis purpose and are based on the attempt of reduction to the Markoviancase. The reliability models based on embedded Markov chain technique could bedeveloped by Agarwal,2 El-Karaksy et al.,26 Huang and Chang,37 Schoenig et al.84

and many others.

5. System Redundancy

The redundancy is a fundamental prerequisite for a system which can be achievedby either recover from or hide failures. For a redundant system to continue correctoperation in the presence of faults, the redundancy must be properly managed. Theredundancy issues in any system are deeply interrelated and ultimately determinesystem reliability. Now we describe different types of redundancy that can be builtinto a system. In literature, there are various methods, techniques and terminologiesto categorize system redundancy. Here we outline most common means of providingredundancy as follows.

5.1. Active system redundancy

The system with active redundancy has all components operating simultaneouslyin parallel. All the components are in use at the same time, even though only oneor less than operating components are required for successful functioning of thesystem. In active redundancy, there is no effect in the failure rate of the survivingcomponents after the failure of a component. Now, we cite some research woks thathave incorporated the concept of active redundancy in the reliability models. Carrand Savage9 proposed a unified methodology for the systems with active redundancyto determine the reliability index. Valdes and Zequeira91,92 presented the optimalallocation of components in a two-component series system when they are used as anactive redundant components. In the same direction, active redundancy allocation

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



for a k-out of-n: F system has been examined by Bueno and Carmo.8 Most recently,Valdes et al.93 discussed some stochastic comparisons in series systems with activeredundancy.

5.2. Standby system redundancy

The systems with standby redundancy consist of two or more components. Onlyone component called as primary component operates at a time to accomplish thesystem functioning and other components may be in hot, cold or warm standbymode. In standby redundancy, the failure rate of one component strongly affectsthe failure characteristics of others in terms of increased failure rates because theyare now under load. The standby redundant systems can be divided into threecategories as follows.

• Hot standby : The hot standby components have the same failure rate as theprimary component. The failure rate of one component is not affected by othercomponents either they are performing or non-performing. Hence, the compo-nents are statistically independent.

• Warm standby : The warm standby components have lower failure rate than thefailure rate of primary component. Thus, warm standby components are subjectto a lower load until primary component fails.

• Cold standby : The cold standby components have zero failure rates. They neverfail when they are in standby mode; thus preserve the component reliability.When the primary component fails, a cold standby component takes over thecontrol of the primary component’s responsibilities and then it’s characteristic issame as that of primary component.

In general, standby redundancy has received more attention in the past. Kapurand Kapoor,47 and Yearout et al.104 presented a survey on standby redundancy inreliability. Apart from theoretical interests, standby redundancies have also beenused extensively in reliability models in which a notable contribution has been givenby Azaron et al.6 They have applied shortest path approach in stochastic networkscalled E-network to evaluate the reliability function of time-dependent systems withstandby redundancy. The redundant systems with warm standby components havebeen studied by Papageorgiou and Kokolakis,70 Wang et al.,95 Zhang et al.105 andmany others.

In standby redundancy, there is an inevitable period of disruption between thefailure occurring and redundant component being brought into operation. Dur-ing this period, the system or application may be stopped functioning or systemresponse may be delayed. This is the main disadvantage of standby redundancy.Such an approach is rarely satisfactorily for critical systems in modern commer-cial and industrial situations. As compared to standby redundancy, active redun-dancy tends to have a shorter switchover time when a failure occurs. Thus, active

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



redundancy is suitable for computer installations. The use of mirrored disks in aserver-computer is an example of active redundancy.

6. Hardware Redundancy

Hardware redundancy can be achieved by providing two or more physical copies of ahardware component. A typical computer system can include redundant processors,disk drives, memories, power supplies or buses which can be switched automati-cally to replace the failed components. Following are three basic forms of hardwareredundancy66,90:

• Passive (Static) redundancy• Active (Dynamic) redundancy• Hybrid redundancy

6.1. Passive redundancy

Passive redundancy techniques are employed as fault masking to hide the occur-rences of faults within a set of redundant hardware modules. As soon as a fault isdetected, the effect of faulty module is immediately masked by permanently con-nected and continually operational redundant modules. In this technique a numberof identical modules execute the same functions and their outputs are voted toremove the errors created by a faulty module. Most of the passive approaches aredeveloped around the concept of N-modular redundancy (NMR) wherein N copiesof a module perform simultaneously.

The triple modular redundancy (TMR) in which three identical modulesare arranged in parallel, is the basic and most common arrangement of passiveredundancy as shown in Fig. 7(a). In TMR all the modules perform the same taskat the same time, then their results or outputs are sent to a majority voter whichis used to examine the correct result. If one of the modules gives the wrong results,then majority voter easily mask the fault that caused of incorrect result by attainingthe correct result of the remaining two fault-free modules. This redundancy ensuresthat a single faulty module out of three modules does not corrupt the performanceof the system. Thus a TMR system can mask or tolerate the fault of only onemodule; however remaining two fault-free modules may also be a cause of systemfailure, if the voter fails and produces an erroneous result. Therefore voter is calledas single point of failure. This is the primary weakness of TMR. To overcome thisdifficulty with TMR, the provision of three voters which provide three independentoutputs can be made (see Fig. 7(b)).

To achieve a higher level of tolerance of faults, N-modular redundancy

(NMR) technique can be used which is a generalization of TMR. In NMR sys-tem, if there are n faulty modules then N = 2n + 1 redundant modules can maskor tolerate those n module faults under consideration of perfect voting. NMR tech-nique is simple but expensive and provides uninterrupted service in the presence

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Module 1

Module 2

Module 3

Voter

Input 1

Input 2 Output

Input 3

(a)

Module 1

Module 2

Module 3

Voter 2

Input 1

Input 2 Output 2

Input 3

Voter 1

Voter 3

Output 1

Output 3

(b)

Fig. 7. (a) TMR with single voter; (b) TMR with triplicate voters.

of faults since any fault in redundant modules does not delay the results unlessthe number of faulty modules exceeds the tolerance of the voting. These techniquesare suitable for those real time applications which are made for short-mission timesuch as space shuttle computer control system.87,88 During flight-critical phases ofa mission, a system of four redundant computers is used to achieve high reliabilityon which single majority voter is performed. This system is designed to cope upwith two successive failures. If a computer fails, it is overturned by other threecomputers. If further another computer becomes defective, it is also overturned bythe remaining two. In case of failures of all four computers, another computer whichwas independently developed programme performs critical functions.

6.2. Active redundancy

In active redundancy, the fault tolerance is achieved by detecting existence of faultsand performing some action to remove faulty hardware from the system. Activeredundancy techniques involve fault detection, fault location and fault recovery inorder to achieve fault tolerance. In this approach, no attempt of fault masking ismade as such the system may produce an erroneous result which must be acceptablein the application. After fault detection, the system is reconfigured and back tooriginal status.

There are many techniques which can be used for fault detection. The most com-mon form of fault detection is duplication with comparison (DWC) as shown

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



in Fig. 8(a). In DWC, two identical hardware modules are developed for performingthe same computations in parallel. The computation results of both modules arecompared using a comparator device. If the results are found to be mismatched,an error signal is generated. Thus the system under duplication with comparatoroperates correctly only when both modules operate correctly. This technique candetect only one module fault. When a fault occurs, the comparator detects thatfault and then normal functioning of the system stops. Some times to avoid thisproblem, action may be taken from outside to switch when a fault is detected.

Standby-sparing (SS) technique is another form of active redundancy.73 Itis also known as standby-replacement technique in which one of the modules isoperational so called primary module and one or more modules serve as standbysor spares as illustrated in Fig. 8(b). If a fault is detected and located, the primaryfaulty module is removed and replaced by a standby module. In standby-sparing, thereconfiguration is done by a switching device which monitors the primary moduleand switches operation to a standby if an error is found. The standby sparingschemes are categorized as hot and cold standby sparing.73

In hot standby sparing , all spare modules are powered up and ready to beswitched at any times into operation immediately after the primary module becomesfailed. In cold standby sparing , the spare modules are powered down and theyare powered on when they are needed to replace a faulty module. The advantageof cold standby sparing is that the spare do not consume power until needed. Theexample of hot standby sparing can be seen in a process control system that controlsa chemical reaction where reconfiguration time needs to be minimized. Satelliteapplications where power consumption is extremely critical use cold standby sparingtechnique.

Pair-and-a-spare (PAS) active redundancy approach is the combination ofduplication with comparison and standby sparing techniques. Figure 8(c) representsthe basic structure of PAS scheme. In this approach, two modules are arrangedin parallel and always kept in operation. Their output results are compared toprovide the error detection capability as required in standby sparing. When a faultis detected, the system reconfiguration is taken place that removes faulty moduleand replaces with a spare one. This approach is used in a commercial system calledas stratus computers system.

6.3. Hybrid redundancy

The combination of both passive and active redundancy techniques is the hybridredundancy.45 The implementation of the system under hybrid approach is usu-ally very expensive and it is used in real time applications that require highintegrity of computations and highest levels of reliability. Hybrid redundancy tech-niques use the fault masking, fault detection, fault location and fault recoveryprocesses to reconfigure the system after occurrence of a fault. There are threehybrid redundancy approaches namely (i) N-modular redundancy with spares

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Module 1

Module 2

ComparatorInput ErrorSignal

Output

(a)

Input

Standby Module 1

Output

.

.

.

Fault DetectorStandby

Module 2

Standby Module N

PrimaryModule

Switch

(b)

Output

Fault Detector

Module 2A

Module 2B

Comparator

Error signal

Input

Module 2A

Module 2B

Switch

(c)

Fig. 8. (a) Duplication with comparison; (b) Standby-sparing redundancy; (c) Pair-and-a-spareredundancy.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Module 1

Module 2

Module N

Standby 1

Standby S

.

.

.

. Switch

Voter

.

.

.

Disagreement detector

OutputInput

(a)

Output

Module 1A

Module 1B

Input Module 2A

Module 2B

Module 3A

Module 3B


Switch 1


Switch 2


Switch 3

Voter

(b)

Fig. 9. (a) N-modular redundancy with spares; (b) Triplex-duplex redundancy.

(ii) self-purging redundancy and (iii) triplex-duplex redundancy. Figure 9(a) showsthe basic arrangement of N-modular redundancy with spares (NMRS).NMRS approach consists of N identical modules, and S additional spares mod-ules. All the modules are arranged in a voting configuration. Initially N modulestake input in parallel and their results are compared. In case of no fault detection,

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



the result is passed on as output. But in case of disagreement i.e. when a fault isdetected, two possibilities may arise;

(a) The majority voter masks the erroneous result of the active modules, then theresult is passed on as output.

(b) The switching device replaces the faulty module by one of the spare modules,then the system continues using the N main modules and S-1 spares.

Another form of hybrid redundancy is self-purging (SP) redundancy intro-duced by Lombardi60 and Losq.63 In SP configuration, N active modules workingin parallel are provided to the system and they are arranged under voting scheme.In this approach, individual switches are provided to each module. As a result, eachmodule becomes capable to remove itself from the system when the module is foundto be faulty. The voter produces the system output and provides masking of anyfault when it occurs. The possibility of fault occurrence in disagreement detectormay also occur. In this situation the switch opens and removes/purges the faultymodule from the system.

Sift-out modular (SOM) redundancy approach is also one example of hybridredundancy and developed by De Sousa and Mathur in 1978. The system has Nidentical modules with three basic elements such as comparator, detector and col-lector. The outputs of N modules are compared to each other using comparator thatperforms the reports of each comparison. If any is found by the comparator thenthe detector removes a module which disagrees with a majority of the remainingmodules. The collector element is responsible to report the output of each modulecomparison and also to report the output from the detector that indicates whichmodule is faulty.

Figure 9(b) shows the basic arrangement for the hybrid redundancy based ontriple-modular and duplication with comparison redundancy techniques which isknown as triplex-duplex (TD) redundancy. In this arrangement, there are threeprimary modules and each module uses a duplicate module. Thus, total six identicalmodules are computing in parallel which are grouped in three pairs. The compu-tation result of each pair is compared using a comparator. In this case the outputgiven by the majority voter is passed on as final output of the system.

A brief of work done on hardware redundancy techniques is presented below inTable 2.

7. Software Redundancy

The implementation of software redundancy can be done by adding software compo-nents which are not exactly identical but they are similar in functionality. Softwareredundancy techniques are designed to allow a system to tolerate software faultsthat remain in the system after its development. These techniques provide a mech-anism to the software system to prevent system failure from occurrences of thefaults.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Table 2. Hardware redundancy techniques.

Serial no. Technique Authors Findings

1 Duplicationwithcomparison

Hohl et al. (1993) To achieve fault tolerance inmultiprocessor systems, a detailedcomparison between watchdogprocessors and master-checker typeduplication has been shown from theview point of fault coverage, hardwareand time overhead

Tahir et al. (1995) Proposed fault tolerant arithmetic unitusing DWC and residue codetechniques to find the best design interms of lower cost and better errorcoverage

Hashimoto et al.(2000)

Developed a scheduling algorithm totolerate a single processor failure inmultiprocessor systems. This algorithmduplicates all task of a program whichreduces high overheads ofcommunication

Kim and Somani(2001)

Used component-level duplication toexamine the dynamic control signals inmicroprocessor control logic

2 N-modularredundancy

Lombardi andRatheal (1983)

Discussed steady state availabilities ofstatic and dynamic N-modularredundant fault tolerant systems

Koutny andMancini (1989)

A software system which permitsredundant systems to be robust withrespect to failures in replicatedprocessors has been constructed

Flammini et al.(2009)

Proposed a new approach to the safetyevaluation of N-modular redundantcomputer systems in presence ofimperfect maintenance

3 Triple-modularredundancy

Pham et al. (1996) Examined the reliability and mean timeto failure (MTTF) for the TMR system

Krstic et al. (2005) A fault-tolerant voter under TMR schemewas presented which is capable toselect mid value from the correctconsensus

4 N-modularredundancywith spares

Lombardi et al.(1982)

System reliability duplex-hybrid systemswith standby spares has been discussed

Krishna (1993) Discussed NMR with a spare processorapproach to show the impact ofworkload on reliability of real-timeprocessor triads

Dabney et al.(2008)

Designed a simple dual-redundant faulttolerant test control systemarchitecture and presented a survey ofexiting fault tolerant control systems

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Table 2. (Continued )

Serial no. Technique Authors Findings

5 Standby-sparingredundancy

Schmitz et al.(2004)

A significant contribution has beenprovided to reduce the energyconsumption in standby-sparing. Toaccomplish this, dynamic voltagescaling (DVS) and dynamic powermanagement (DPM) have beenpresented for primary module andspare module, respectively

Eljali et al. (2009) Developed an online energy-managementmethod to analyze hard real timesystems which uses a slack reclamationscheme to reduce the energyconsumption of both the primary andspare modules

6 Self purgingredundancy

Razavi (1993) A modified self-purging system wasdeveloped which contains a digitalvoter to adjust the threshold of thevoter automatically as failed modulesare purged

Quintana et al.(2001)

An efficient implementation of the voterwas presented which is helpful inreducing the circuit complexity anddelay

The traditional hardware redundancy techniques are designed to tackle the man-ufacturing faults firstly and environmental or other faults secondly. Furthermoresoftware redundancy techniques developed are also based on the approach of hard-ware redundancy techniques but hardware redundancy techniques do not protectthe system against software design and specification faults. For example, the TMRsystem was developed to solve many single errors by replicating the same hardwaremodule but similar approach can not be applied when we develop a software imple-mentation with triplicate modules and voting on its outputs; we can not tolerate afault in the module because all modules have identical faults. Software redundancytechniques attempt to leverage the experience of hardware redundancy techniquesto solve a different problem. To accomplish this task, and to create a proper soft-ware redundant system, the concept of design diversity has to be applied. The designdiversity is the fault tolerance approach which has the capability to solve the com-mon mode design faults. Under this approach, it is considered that to vary a designis more efficient at high level of abstractions since varying function (algorithm) ismore efficient than varying implementation details of a design e.g. using differentprogram-languages. The one way of looking the software redundancy techniques isalong the diversity75 as follows:

(i) Design diversity based software redundancy(ii) Data diversity based software redundancy.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



7.1. Design diversity based software redundancy

Design diversity is a solution to software redundant systems so far as it is possibleto create diverse and equivalent specifications so that the developer can design asoftware which do not share common faults. Design diversity approaches are used ina multiple version software environment. Software versions are functionally equiva-lent and independently developed programme to provide the capability of toleranceof software design faults. The most common examples of software redundancy tech-niques based on design diversity are recovery-blocks, N-version programming andN-self-checking programming systems.

The recovery blocks (RB) scheme was evolved as software redundancy tech-nique and initiated by Horning36 and further developed by Brian Randell in early1970s. The basic structure of recovery block scheme is shown in Fig. 10(a). In thistechnique, the multiple software versions are implemented for the same programin which one version is primary and others are alternate versions. RB uses threemechanism approaches (i) acceptance test (AT), (ii) checkpoint and (iii) restart. Inthe beginning, the primary version is executed then the acceptance test is appliedto the result of primary version. If the version passes the AT, it is considered asreliable. If the error is detected by AT, a roll back signal is sent to the switch whichswitches execution to another version of software program (module). This processis repeated until some version passes the AT or all version fail. The checkpoint iscreated before execution of a version and it is needed to recover the system stateafter a version fails. N-version programming (NVP) technique is another formof software redundancy technique based on design diversity and investigated byElmendorf in 1972. In this technique, all N software versions are executed simul-taneously and their results are sent to decision mechanism referred to as ‘majorityvoter’ which selects the correct output result (see Fig. 10(b)). The goal of N-versionprogramming systems is to minimize the probability of similar errors at decisionpoints.

N-self-checking programming (NSCP) is also a design diverse softwareredundancy technique which combines the features of both recovery blocks andN-version programming proposed by Laprie et al.57 and Yau and Cheung.103 NSCPapproach uses program redundancy to check its own behavior during execution andit can be done by using either acceptance test or comparator. NSCP scheme usescomparator resembles triplex-duplex hardware redundancy.

7.2. Data diversity based software redundancy

Data diversity was introduced by Ammann4 and Ammann and Knight.5 Theseapproaches are used in a multiple data representation environment and uses onlyone version of the software. Data diversity approaches utilize different representa-tions of input data to provide the capability of tolerance of software design faultsand cheaper to implement than the design-diversity approaches. The examples of

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Version 1

Version 2

Version N

.

.

.

Selection

Switch

AcceptanceTest

OutputInput

Checkpoint Memory

(a)

Version 1

Version 2

Version N

.

.

.

Selection Algorithm OutputInput

(b)

Fig. 10. (a) Recovery-block; (b) N-version programming.

software redundancy techniques based on data diversity include retry-blocks andN-copy programming systems.

Retry block (RtB) software redundancy technique is an enhancement ofthe recovery block scheme as shown in Fig. 11(a) that uses only one algorithmrather multiple algorithms as used in recovery block. The execution results of retryblock are evaluated by providing acceptance test. Another approach of softwareredundancy based on data diversity is N-copy programming (NCP) shown inFig. 11(b). The NCP resembles N-modular hardware redundancy scheme. This

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Version 1

Version 2

Version N

.

.

.

Execute Algorithm

AcceptanceTest

Output

Input

Checkpoint Memory

Restore Checkpoint

Signal Exception

FailPass

Discard Checkpoint

(a)

Version 1

Version 2

Version N

.

.

.

DecisionAlgorithm

OutputInput

Copy 1

Copy 2

Copy 3

.

.

.

Failure Exception

__

(b)

Fig. 11. (a) Retry-block; (b) N-copy programming.

technique uses one or more data re-expression algorithms. A NCP consists of Ncopies of a program executing in parallel and each copy run on different input setsproduced by re-expression. The selection of best output result of the system is doneby using a modified voting scheme. In this technique, firstly data re-expressionalgorithms are executed concurrently to re-express the input data, then N copiesare executed concurrently. The results of the executions of N copies are sent to thedecision mechanism (DM). If the correct result is adjusticated by DM then it isreturned otherwise an error signal occurs.

Many research investigations have done considerable amount of work includingsoftware redundancy techniques in which some are given in Table 3.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



Table 3. Software redundancy techniques.

Serial Technique Authors Findingsno.

1 Recovery blocks Berman andKumar(1999)

Optimization models for a fault tolerantsoftware system for both independent andconsensus recovery block schemes under costand reliability constraint has been discussed

Abulnaja(2005)

Presented component based recovery blocktechnique

Wattanapongs-korn andCoit (2007)

Discussed embedded system design andoptimization issues considering componentredundancy and uncertainty in thecomponent reliability estimates

2 N-versionprogramming

Pham (1994) The optimization issue for the cost ofNVP-systems subject to desired reliabilitylevel was discussed

Kapur et al.(2007)

The optimal release policy for 3VP systemminimizing cost subject to reliabilityconstraint under a fuzzy environment waspresented

Proenza et al.(2009)

An improved design of NVP-executionarchitecture resolving some potentialinconsistencies has been suggested

3 N-self checkingprogramming

Romanovsky(1997)

Introduced a general concept of N-SCP scheme

Djordjevic et al.(2004)

Provided an approach to partially self-checkingcombinatorial circuits design which is similarto DWC wherein comparator works as achecker that detects any erroneous result

4 Retry blocks Huang andKintala(1995)

They have provided C-style construction ofretry-block scheme for the programmingpurpose and the construction wasimplemented using macros

5 N-copyprogramming

Christ-manssonet al. (1994)

To tolerate the software design faults in a flightcontrol system, data diversity technique viaN-copy programming was applied which givesbest computation correct results and high

reliability than other technique

8. Concluding Remarks

The complexity of embedded computer systems is increasing day-by-day as perrequirements of safety-critical, mission-critical and business-critical applications. Inthese applications, a system failure may be a big loss in terms of people’s lives orenvironmental disaster. The redundancy has become one of the best ways to buildthe computer systems highly reliable and available in different configurations. Theredundancy technology makes the systems capable to tolerate both expected and

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



unexpected software/hardware faults. In the literature, various software/hardwareredundancy techniques have been developed to manage these faults. The hardwareredundancy allows recovery (repair) from failures rather than prevents them andthus provides the fault tolerance in the system with respect to operational faults.On the other hand, software redundancy plays a key role in most redundant com-puter systems since the computers that recover from failures mainly by hardwaremeans, use software to control their recovery and decision making processes. Inthe present survey article we have discussed some of the fundamental issues ofredundant computer based systems related to design and reliability analysis in adifferent framework. An overview of classified software/hardware faults and vari-ous hardware/software redundancy techniques are presented. Some methodologicalaspects discussed for reliability modeling of redundant systems may provide aninsight to the system designers and developers to improve the software and hard-ware systems subject to techno-economic constraints. Our study may be helpful toresolve the problems introduced by hardware failures and software faults in manycomputer-based engineering systems.

References

1. O. A. Abulnaja, Component-based recovery block technique, AIML Journal 5(2)(2005) 1–5.

2. M. Agarwal, Imbedded semi-Markov process applied to stochastic analysis of a two-unit standby system with two types of failures, Microelectronics Reliability 25(3)(1985) 561–571.

3. M. M. Alidrisi, The reliability of a dynamic warm standby redundant system ofn components with imperfect switching, Microelectronics Reliability 32(6) (1992)851–859.

4. P. E. Ammann, Data diversity: An approach to software fault tolerance, Proceedingsof FTCS-17, Pittsberg, P A (1987), pp. 113–117.

5. P. E. Ammann and J. C. Knight, Data diversity: An approach to software faulttolerance, IEEE Transaction on Computers 37(4) (1988) 418–425.

6. A. Azaron, H. Katagiri, M. Sakawa and M. Modarres, Reliability function of a class oftime-dependent systems eith standby redundancy, European Journal of OperationalResearch 164(2) (2005) 378–386.

7. O. Berman and U. D. Kumar, Optimization models for recovery block schemes,European Journal of Operational Research 115(2) (1999) 368–379.

8. V. D. C. Bueno and I. M. D. Carmo, Active redundancy allocation for a k-out-of-n: Fsystem of dependent components, European Journal of Operational Research 176(2)(2007) 1041–1051.

9. S. M. Carr and G. J. Savage, A unified methodology for reliability assessment of sys-tems with active redundancy, Reliability Engineering & System Safety 34(2) (1991)181–219.

10. J. A. Carrasco and V. Sune, Combinatorial methods for the evaluation of yield andoperational reliability of fault-tolerant systems-on-chip, Microelectronics Reliability44(2) (2004) 339–350.

11. R. J. Chevance, Hardware and software solutions for high availability, Server-Architecture (2005), 609–652.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



12. J. G. Choi and P. H. Seong, Reliability assessment of embedded digital system usingmulti-state function, Reliability Engineering & System Safety 91(3) (2006) 261–269.

13. J. Christmansson, Z. Kalbarczyk and J. Torin, Dependable flight control system bydata diversity and self-checking components, Microprocessing and Microprogramming40(2–3) (1994) 207–222.

14. W. K. Chung, Reliability of imperfect switching of cold standby systems with mul-tiple non-critical and critical errors, Microelectronics and Reliability 35(12) (1995)1479–1482.

15. D. R. Cox, The analysis of non-Markovian stochastic process by inclusion of supple-mentary variables, Mathematical Proceedings of the Cambridge Philosophical Society51(3) (1955) 433–441.

16. R. W. Dabney, L. Etzkorn and G. W. Cox, A fault tolerant approach to test controlutilizing dual-redundant processors, Advances in Engineering Software 39(5) (2008)371–383.

17. J. V. Deshpande, I. Dewan and U. V. Naik-Nimbalkar, A family of distributions tomodel load sharing systems, Journal of Statistical Planning and Inference 140(6)(2010) 1441–1451.

18. P. T. De-Sousa and F. P. Mathur, Sift-out modular redundancy, IEEE Transactionon Computers C-27(7) (1978) 624–627.

19. B. S. Dhillon and K. I. Ugwu, Bibliography of literature on computer hardwarereliability, Microelectronics and Reliability 26(1) (1986) 131–153.

20. A. Di-Macro, A semi-Markov model of a three state generating unit, IEEE Trans-action Power Apparatus Systems PAS-91(5) (1972) 2154–2160.

21. S. Distefano and A. Puliafito, Reliability and availability analysis of dependent–dynamic systems with DRBDs, Reliability Engineering & System Safety 94(9) (2009)1381–1393.

22. G. L. Djordjevic, M. K. Stojcev and T. R. Stankovic, Approach to partiallyself-checking combinatorial circuits design, Microelectronics Journal 35(12) (2004)945–952.

23. A. D. Dominguez-Garcia, J. G. Kassakian, J. E. Schindall and J. J. Zinchuk, Anintegrated methodology for the dynamic performance and reliability evaluation offault tolerant systems, Reliability Engineering and System Safety 93(11) (2008)1628–1649.

24. E. Dubrova, Fault Tolerant Design: An Introduction, Kluwer Academic Publishers(2007).

25. A. Eljali, B. M. Al-Hashimi and P. Eles, A standby-sparing technique with lowenergy overhead for fault tolerant hard real time systems, Proceedings of the 7thIEEE/ACM International Conference on Hardware/Software Codesign and SystemSynthesis, Power-aware design methodology (2009), pp. 193–202.

26. M. R. El-Karaksy, A. S. Nouh and A. R. Al-Obaidan, Performance analysis of timedPetri net models for communication protocols: A methodology and a package, Com-puter Communications 13(2) (1990) 73–82.

27. W. R. Elmendorf, Fault-tolerant programming, Proceedings of FTCS-2, Newton, MA(1972), pp. 79–83.

28. F. Flammini, S. Marrone, N. Mazzocca and V. Vittorini, A new modeling approachto the safety evaluation of N-modular redundant computer systems in the presenceof imperfect maintenance, Reliability Engineering and System Safety 94(9) (2009)1422–1432.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



29. S. Garg, J. Singh and D. V. Singh, Availability analysis of crank-case manufacturingin a two-wheeler automobile industry, Applied Mathematical Modelling 34(6) (2010)1672–1683.

30. R. German, Non-Markovian Analysis, Springer-Verlag (2002), 156–182.31. J. Gray, Why do computers stop and what can be done about it? in Proceeding of the

5th Symposium on Reliability in Distributed Software and Database Systems (1986),pp. 3–12.

32. M. Grottke and K. S. Trivedi, Fighting bugs: Remove, retry, replicate and rejuvenate,Software Technologies (2007), 107–109.

33. H. Guo and X. Yang, Automatic creation of Markov models for reliability assessmentof instrumented systems, Reliability Engineering and System Safety 93(6) (2008)829–837.

34. K. Hashimoto, T. Tsuchiya and T. Kikuno, A new approach to fault-tolerant schedul-ing using task duplication in microprocessor systems, Journal of Systems and Soft-ware 53(2) (2000) 159–171.

35. W. Hohl, E. Michel and A. Pataricza, Hardware support for error detection in mul-tiprocessor systems-a case study, Microprocessors and Microsystems 17(4) (1993)201–206.

36. J. J. Horning, A program structure for error detection and recovery, New York:Springer-Verlag 16 (1974) 171–187.

37. C. Y. Huang and Y. R. Chang, An improved decomposition scheme for assessing thereliability of embedded systems by using dynamic fault trees, Reliability Engineering& System Safety 92(10) (2007) 1403–1412.

38. Y. Huang and C. Kintala, Software Fault Tolerance in the Application Layer, JohnWiley & Sons Ltd (1995).

39. A. Immonen and E. Niemela, Survey of reliability and availability prediction methodsfrom the view point of software architecture, Software Systems Modeling 7 (2008)49–65.

40. R. T. Islamov, Using Markov reliability modeling for multiple repairable systems,Reliability Engineering and System Safety 44(2) (1994) 113–118.

41. M. Jain, Reliability of a two-unit system with common cause shock failures, Inter-national Journal of Pure & Applied Mathematics 29(12) (1998) 1281–1289.

42. M. Jain and R. P. Ghimire, Reliability of k-r-out-of-n: G system subject to randomand common cause failure, Performance Evaluation 29 (1997) 213–218.

43. M. Jain, S. Maheshwari and Rakhee, Study of loading policies for k-r-out-of-N: Gsystem subject to common cause failure, R & D Quality Quest 4 (2) (2002) 15–23.

44. K. Jenab and B. S. Dhillon, Assessment of reversible multi-state k-out-of-n:G/F/loadsharing systems with flow-graph models, Reliability Engineering and System Safety91(7) (2006) 765–771.

45. B. W. Johnson, Design and analysis of fault tolerant digital systems, Addition Wesley(1989).

46. D. L. Kang, M. J. Hwang and S. H. Han, Estimation of common cause failure prob-abilities of the components under mixed testing scheme, Annals of Nuclear Energy36(4) (2009) 493–497.

47. P. K. Kapur and K. R. Kapoor, Effect of standby redundancy in system reliability,Microelectronics Reliability 15(5) (1976) 376.

48. P. K. Kapur, A. Gupta and P. C. Jha, Reliability growth modeling and optimalrelease policy under fuzzy environment of an N-version programming system incor-porating the effect of fault removal efficiency, International Journal of Automationand Computing 4(4) (2007) 369–379.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



49. J. B. Ke, W. C. Lee and K. H. Wang, Reliability and sensitivity analysis of a systemwith multiple unreliable service stations and standby switching failures, Physica A:Statistical Mechanics and its Applications 380 (2007) 455–469.

50. J. Kienzle, Software Fault Tolerance: An Overview, Springer Berlin/Heidelberg(2003).

51. S. Kim and A. K. Somani, On-line integrity monitoring of microprocessor controllogic, Microelectronics Journal 32(12) (2001) 999–1007.

52. M. Koutny and L. V. Mancini, Synchronizing events in replicated systems, Journalof Systems and Software 9(3) (1989) 183–190.

53. C. M. Krishna, The impact of workload on the reliability of real-time processortriads, Microelectronics Reliability 33(8) (1993) 1169–1178.

54. M. D. Krstic, M. K. Stojcev, G. L. Djordjevic and I. D. Andrejic, A mid-value selectvoter, Microelectronics and Reliability 45(3–4) (2005) 733–738.

55. K. Kuspert, Principles of error detection in storage structures of database systems,Reliability Engineering 14(4) (1986) 275–290.

56. C. D. Lai, M. Xie, K. L. Poh, Y. S. Dai and P. Yang, A model for availability analy-sis of distributed software/hardware systems, Information and Software Technology,44(6) (2002) 343–350.

57. J. C. Laprie, Definition and analysis of hardware and software fault tolerant archi-tecture, IEEE Computer 23(7) (1990) 39–51.

58. J. C. Laprie, Dependability: Basic concepts and terminology, Springer-Verlag (1992).59. G. Levitin, Reliability and performance analysis of hardware-software systems with

fault tolerant software components, Reliability Engineering and System Safety 91(5)(2006) 570–579.

60. F. Lombardi, Availability modeling of ring microcomputer systems, MicroelectronicsReliability 22(2) (1982) 295–308.

61. F. Lombardi and S. Ratheal, Analysis of series deviance in a parallel state transitiondiagram and applications to fault tolerant computing, Microelectronics Reliability23(5) (1983) 963–980.

62. F. Lombardi, V. Obac-Roda and M. M. Islam, Reliability study of duplex-hybridsystems, Microelectronics and Reliability 22(3) (1982) 457–470.

63. J. Losq, A highly efficient redundancy scheme: self-purging redundancy, IEEE Trans-action on Computers C-25(6) (1976) 569–578.

64. M. R. Lyu, Handbook of software reliability engineering, IEEE Computer SocietyPress and McGraw-Hill (1996).

65. A. Mosleh, Common cause failure: An analysis methodology and examples, Reliabil-ity Engineering 34 (1991) 249–292.

66. M. Mukurani, Task-based dynamic fault tolerance for humanoid robot applicationsand its hardware implementation, Journal of Computers 3(8) (2008) 40–48.

67. S. M. Nassar, Software reliability, Computers & Industrial Engineering 11(1–4)(1986) 613–618.

68. E. A. Oliveira, A. C. M. Alvim and P. F. Frutuoso-e-Melo, Unavailability analysis ofsafety systems under aging by supplementary variables with imperfect repair, Annalsof Nuclear Energy 32(2) (2005) 241–252.

69. J. N. Pan, Reliability prediction of imperfect switching systems subject to Weibullfailures, Computers & Industrial Engineering 34(2) (1998) 481–492.

70. E. Papageorgiou and G. Kokolakis, Reliability analysis of a two-unit general paral-lel system with (n − 2) warm standbys, European Journal of Operational Research201(3) (2010) 821–827.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



71. H. Pham, On the optimal design of N-version software systems subject to constraints,Journal of Systems and Software 27(1) (1994) 55–61.

72. H. Pham, A. Suprasad and R. B. Misra, Reliability analysis of k-out of-n systemswith partially repairable multi-state components, Microelectronics and Reliability36(10) (1996) 1407–1415.

73. D. K. Pradhan, Fault-tolerant computer system design, Prentice Hall, EnglewoodCliffs, NJ (1996).

74. J. Proenza, J. Miro-Julia and H. Hansson, Managing redundancy in CAN-based net-works supporting N-version programming, Computer Standards & Interfaces 31(1)(2009) 120–127.

75. L. L. Pullum, Software fault tolerance techniques and implementation, Artech House(2001).

76. J. M. Quintana, M. J. Avedillo and J. L. Huertas, Efficient realization of a thresh-old voter for self-purging redundancy, Journal of Electronic Testing: Theory andApplications 17(1) (2001) 69–73.

77. B. Randell, System structure for software fault tolerance, IEEE Transaction on Soft-ware Engineering SE-1(2) (1975) 220–232.

78. H. M. Razavi, Self-purging redundancy with automatic threshold adjustment, IEEProceedings of Circuits, Devices and Systems 140(4) (1993) 233–236.

79. A. Reibman, R. Smith and K. Trivedi, Markov and Markov reward model transientanalysis: An overview of numerical approaches, European Journal of OperationalResearch 40(2) (1989) 257–267.

80. A. Romanovsky, Practical exception handing and resolution in concurrent programs,Computer Languages 23(1) (1997) 43–58.

81. O. Rooks and M. Armbruster, Duo duplex drive-by-wire computer system, ReliabilityEngineering and System Safety 89(1) (2005) 71–80.

82. R. Samet, Design and implementation of highly reliable dual-computer systems,Computers & Security 28(7) (2009) 710–722.

83. M. T. Schmitz, B. M. Al-Hashimi and P. Eles, System-level design techniques forenergy-efficient embedded systems, Norwell, MA: Kluwer (2004).

84. R. Schoenig, J. F. Aubry, T. Cambois and T. Hutinet, An aggregation method ofMarkov graphs for the reliability analysis of hybrid systems, Reliability Engineering& System Safety 91(2) (2006) 137–148.

85. R. K. Sharma and S. Kumar, Performance modeling in critical engineering sys-tems using RAM analysis, Reliability Engineering and System Safety 93(6) (2008)913–919.

86. B. Singh, K. K. Sharma and A. Kumar, A classical and Bayesian estimation of ak-components load sharing parallel system, Computational Statistics & Data Anal-ysis 52(12) (2008) 5175–5185.

87. J. R. Sklaroff, Redundancy management technique for space shuttle computers, IBMJRD 20(1) (1976) 20–28.

88. A. Spector and D. Gifford, The space shuttle primary computer system, Communi-cations of the ACM 27(9) (1984) 872–900.

89. J. M. Tahir, S. S. Dlay, R. N. G. Naguib and O. R. Hinton, Fault tolerant arithmeticunit using duplication and residue codes, Integration; the VLSI Journal 18(2–3)(1995) 187–200.

90. W. Torres-Pomales, Software fault tolerance: A tutorial, NASa/TM-2000-210616(2000).

91. J. E. Valdes and R. I. Zequeira, On the optimal allocation of an active redundancy ina two-component series system, Statistics & Probability Letters 63(3) (2003) 325–332.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



92. J. E. Valdes and R. I. Zequeira, On the optimal allocation of two active redundanciesin a two-component series system, Operations Research Letters 34(1) (2006) 49–52.

93. J. E. Valdes, G. Arango, R. I. Zequeira and G. Brito, Some stochastic comparisonsin series systems with active redundancy, Statistics & Probability Letters 80(11–12)(2010) 945–949.

94. K. H. Wang and Y. J. Chen, Comparative analysis of availability between three sys-tems with general repair times, reboot delay and switching failures, Applied Mathe-matics and Computation 215(1) (2009) 384–394.

95. K. H. Wang, W. L. Dong and J. B. Ke, Comparison of reliability and the availabilitybetween four systems with warm standby components and standby switching failures,Applied Mathematics and Computation 183(2) (2006) 1310–1322.

96. N. Wattanapongskorn and D. W. Coit, Fault-tolerant embedded system design andoptimization considering reliability estimation uncertainty, Reliability Engineering &System Safety 92(4) (2007) 395–407.

97. J. Wu, Y. Wang and E. B. Fernandez, A uniform approach to software and hardwarefault tolerance, Journal of System and Software 26(2) (1994) 117–127.

98. K. Wu, P. Mishra and R. Karri, Concurrent error detection of fault-based side chan-nel cryptanalysis of 128-bit RC6 block cipher, Microelectronics Journal 34(1) (2003)31–39.

99. S. Yamada and S. Osaki, Reliability growth models for hardware and software sys-tems based on non-homogeneous Poisson processes: A survey, Microelectronics Reli-ability 23(1) (1983) 91–112.

100. W. Yamamoto, L. Jin and K. Suzuki, Optimal allocations for load-sharing k-out-of-n:F systems, Journal of Statistical Planning and Inference 139(5) (2009) 1777–1781.

101. B. Yang, H. Hu and S. Guo, Cost oriented task allocation and hardware redun-dancy policies in heterogeneous distributed computing systems considering softwarereliability, Computers & Industrial Engineering 56(4) (2009) 1687–1696.

102. B. Yang, X. Li, M. Xie and T. Feng, A generic data-driven software reliabilitymodel with model mining technique, Reliability Engineering and System Safety 95(6)(2010) 671–678.

103. S. S. Yau and R. C. Cheung, Design of self-checking software, Proceedings of theInternational Conference on Reliable Software 10(6) (1975) 450–455.

104. R. D. Yearout, P. Reddy and D. L. Grosh, Standby redundancy in reliability — Areview, Microelectronics Reliability 27(5) (1987) 937.

105. T. Zhang, M. Xie and M. Horigome, Availability and reliability of k-out-of-(M+ N):Gwarm standby systems, Reliability Engineering and System Safety 91(4) (2006)381–387.

106. Y. L. Zhang, A geometrical process repair model for a repairable system with delayedrepair, Computers & Mathematics with Applications 55(8) (2008) 1629–1643.

107. Y. L. Zhang and G. J. Wang, A deteriorating cold standby repairable system withpriority in use, European Journal of Operational Research 183(1) (2007) 278–295.

About the Authors

Madhu Jain, Faculty, Dept of Mathematics, I.I.T. Roorkee, received her M.Sc.,M.Phil., Ph.D and D.Sc. degrees in Mathematics from University of Agra. She hasbeen a gold medalist of Agra University at M. Phil. level. There are more than 250research publications in refereed International/National journals and more than20 books to her credit in addition to two reference books. She was recipient of

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.



the Young Scientific award and SERC visiting fellow of Department of Science andTechnology (India), and Career award of University Grant Commission (India). Shehas successfully completed six sponsored major research projects of DST, UGC andCSIR. Her current research interest includes Stochastic Modelling, Soft Comput-ing, Bio-informatics, Reliability and Queueing Theory. Twenty five candidates havereceived their Ph.D. degrees under her supervision. She has visited more than 25reputed Universities/Institutes in USA, Canada, UK, Germany, France, Hollandand Belgium. She has participated and presented her research work in more than30 International and 75 National Conferences/Seminars.

Ritu Gupta is presently conducting research leading to Ph.D. degree in the Instituteof Basic Science, Dr B. R. Ambedkar University, Agra. She is 1st division scholarat the degree and post graduate level. She has published three research papers inreputed journals and refereed proceedings. She has attended and presented her workat two International and eight National Conferences/Seminars. Her areas of inter-est are software and hardware reliability and performance modeling of redundantsystems.

Int.

J. R

el. Q

ual.

Saf.

Eng

. 201

1.18

:61-

98. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G L

IBR

AR

IES

- A

CQ

UIS

ITIO

NS

SER

VIC

ES

DE

PAR

TM

EN

T o

n 04

/25/

13. F

or p

erso

nal u

se o

nly.

redundancy issues in software and hardware systems: an overview

Documents